Descriptor: "Attention mechanisms" / Publisher: springer nature - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Attention mechanisms"' showing total 90 results

Start Over Descriptor "Attention mechanisms" Publisher springer nature

90 results on '"Attention mechanisms"'

1. Short-term traffic flow prediction based on spatial–temporal attention time gated convolutional network with particle swarm optimization: Short-term traffic flow prediction based on spatial–temporal attention time gated convolutional network with particle swarm optimization: Z. Li et al.

Author: Li, Zhongxing, Li, Zenan, Pan, Chaofeng, and Wang, Jian
Abstract: Recently, the surge in vehicle ownership has led to a corresponding increase in the complexity of traffic data. Consequently, accurate traffic flow prediction has become crucial for effective traffic management. While the advancements in intelligent transportation system (ITS) and internet of things (IoT) technology have facilitated traffic flow prediction, many existing methods overlook the influence of the training process on model accuracy. Traditional approaches often fail to account for this critical aspect. Hence, a new approach to traffic flow prediction is introduced in this paper: a spatial–temporal attention time-gated convolutional network based on particle swarm optimization (PSO-STATG). This method uses the particle swarm algorithm to dynamically optimize the learning rate and epoch parameters throughout the training process. Firstly, spatial–temporal correlations are extracted through spatial map convolution and time-gated convolution, facilitated by an attention mechanism. Subsequently, the learning rate and epoch parameters are dynamically adjusted during the training phase via the particle swarm optimization algorithm. Finally, experiments are conducted with real-world datasets, and the results are compared with those from several existing methods. The experimental results indicate that the accuracy and stability of our proposed model in predicting traffic flow are superior. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

2. Explainable attention based breast tumor segmentation using a combination of UNet, ResNet, DenseNet, and EfficientNet models.

Author: Anari, Shokofeh, Sadeghi, Soroush, Sheikhi, Ghazaal, Ranjbarzadeh, Ramin, and Bendechache, Malika
Abstract: This study utilizes the Breast Ultrasound Image (BUSI) dataset to present a deep learning technique for breast tumor segmentation based on a modified UNet architecture. To improve segmentation accuracy, the model integrates attention mechanisms, such as the Convolutional Block Attention Module (CBAM) and Non-Local Attention, with advanced encoder architectures, including ResNet, DenseNet, and EfficientNet. These attention mechanisms enable the model to focus more effectively on relevant tumor areas, resulting in significant performance improvements. Models incorporating attention mechanisms outperformed those without, as reflected in superior evaluation metrics. The effects of Dice Loss and Binary Cross-Entropy (BCE) Loss on the model's performance were also analyzed. Dice Loss maximized the overlap between predicted and actual segmentation masks, leading to more precise boundary delineation, while BCE Loss achieved higher recall, improving the detection of tumor areas. Grad-CAM visualizations further demonstrated that attention-based models enhanced interpretability by accurately highlighting tumor areas. The findings denote that combining advanced encoder architectures, attention mechanisms, and the UNet framework can yield more reliable and accurate breast tumor segmentation. Future research will explore the use of multi-modal imaging, real-time deployment for clinical applications, and more advanced attention mechanisms to further improve segmentation performance. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

3. Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification.

Author: Bayoudh, Khaled and Mtibaa, Abdellatif
Abstract: In recent years, convolutional neural networks (CNNs) have proven their effectiveness in many challenging computer vision-based tasks, including small object classification. However, according to recent literature, this task is mainly based on 2D CNNs, and the small size of object instances makes their recognition a challenging task. Since 3D CNNs are extremely tedious and time-consuming to learn, they cannot be used in a way that requires a trade-off between accuracy and efficiency. Moreover, due to the great success of Transformers in the field of natural language processing (NLP), a spatial Transformer can also be used as a robust feature transformer and has recently been successfully applied to computer vision tasks, including image classification. By incorporating attention mechanisms into the Transformers, many NLP and computer vision tasks can achieve excellent performance and help learn the contextual encoding of the input patches. However, the complexity of these tasks generally increases with the dimension of the input feature space. In this paper, we propose a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for better performance on a low-resolution dataset. First, the combination of a pre-trained deep CNN and a 3D CNN can significantly reduce the complexity and result in an accurate learning algorithm. Second, a pre-trained deep CNN model is used as a robust feature extractor and combined with a spatial Transformer to improve the representational power of the developed model and take advantage of the powerful global modeling capabilities of Transformers. Finally, spatial attention and channel attention are adaptively fused by focusing on all components in the input space to capture local and global spatial correlations on non-overlapping regions of the input representation. Experimental results show that the proposed framework has significant relevance in terms of efficiency and accuracy. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

4. Multi-scale Unet-based feature aggregation network for lightweight image deblurring.

Author: Yang, Yancheng, Gai, Shaoyan, and Da, Feipeng
Abstract: The single image deblurring task has made remarkable progress, with convolutional neural networks exhibiting extraordinary performance. However, existing methods maintain high-quality reconstruction through an excessive number of parameters and extremely deep network structures, which results in increased requirements for computational resources and memory storage, making it challenging to deploy on resource-constrained devices. Numerous experiments indicate that current models still possess redundant parameters. To address these issues, we introduce a multi-scale Unet-based feature aggregation network (MUANet). This network architecture is based on a single-stage Unet, which significantly simplifies the network’s complexity. A lightweight Unet-based attention block is designed, based on a progressive feature extraction module to enhance feature extraction from multi-scale attention modules. Given the extraordinary performance of the self-attention mechanism, we propose a self-attention mechanism based on fourier transform and a depthwise convolutional feed-forward network to enhance the network’s feature extraction capability. This module contains extractors with different receptive fields for feature extraction at different spatial scales and capturing contextual information. Through the aggregation of multi-scale features from different attention mechanisms, our method learns a set of rich features that retain contextual information from multiple scales and high-resolution spatial details. Extensive experiments show that the proposed MUANet achieves competitive results in lightweight deblurring qualitative and quantitative evaluations. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

5. KTMN: Knowledge-driven Two-stage Modulation Network for visual question answering.

Author: Shi, Jingya, Han, Dezhi, Chen, Chongqing, and Shen, Xiang
Abstract: Existing visual question answering (VQA) methods introduce the Transformer as the backbone architecture for intra- and inter-modal interactions, demonstrating its effectiveness in dependency relationship modeling and information alignment. However, the Transformer’s inherent attention mechanisms tend to be affected by irrelevant information and do not utilize the positional information of objects in the image during the modelling process, which hampers its ability to adequately focus on key question words and crucial image regions during answer inference. Considering this issue is particularly pronounced on the visual side, this paper designs a Knowledge-driven Two-stage Modulation self-attention mechanism to optimize the internal interaction modeling of image sequences. In the first stage, we integrate textual context knowledge and the geometric knowledge of visual objects to modulate and optimize the query and key matrices. This effectively guides the model to focus on visual information relevant to the context and geometric knowledge during the information selection process. In the second stage, we design an information comprehensive representation to apply a secondary modulation to the interaction results from the first modulation. This further guides the model to fully consider the overall context of the image during inference, enhancing its global understanding of the image content. On this basis, we propose a Knowledge-driven Two-stage Modulation Network (KTMN) for VQA, which enables fine-grained filtering of redundant image information while more precisely focusing on key regions. Finally, extensive experiments conducted on the datasets VQA v2 and CLEVR yielded Overall accuracies of 71.36% and 99.20%, respectively, providing ample validation of the proposed method’s effectiveness and rationality. Source code is available at . [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. CFFANet: category feature fusion and attention mechanism network for retinal vessel segmentation.

Author: Chen, Qiyu, Wang, Jianming, Yin, Jiting, and Yang, Zizhong
Abstract: Retinal vessel segmentation is a computer-aided diagnostic method for ophthalmic disease analysis. Owing to the complex structure of the retinal vasculature, it is difficult for the segmentation network to capture effective features, and the semantic gap between different layers of features leads to insufficient feature fusion and thus makes segmentation difficult. In this paper, we propose a new segmentation network called CFFANet. Firstly, to capture accurate and sufficient global and local features, we design a Multi-scale Residual Pooling Module. In addition, a Category Feature Fusion Module is proposed to fuse category features at different stages to reduce the semantic gap between layers. Finally, a Frequency Channel Fusion Cross Attention Module is incorporated to reduce redundant semantic information during feature fusion. We conducted experiments on the DRIVE, CHASEDB1 and STARE datasets. The Dice and MIoU scores of the CFFANet network on the above datasets are 83.0, 82.9, 84.2, 84.1, and 84.1, 84.5. The ablation experiments also validate the effectiveness of the main modules in the network. The experiments show the value of the method in retinal vessel segmentation tasks. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. MDA-YOLO Person: a 2D human pose estimation model based on YOLO detection framework.

Author: Dong, Chengang, Tang, Yuhao, and Zhang, Liyan
Subjects: *BODY image, *HUMAN body, *POSE estimation (Computer vision), *ARCHAEOLOGICAL human remains, *PERSONAL names, *DETECTORS
Abstract: Human pose estimation aims to locate and predict the key points of the human body in images or videos. Due to the challenges of capturing complex spatial relationships and handling different body scales, accurate estimation of human pose remains challenging. Our work proposes a real-time human pose estimation method based on the anchor-assisted YOLOv7 framework, named MDA-YOLO Person. In this study, we propose the Keypoint Augmentation Strategies (KAS) to overcome the challenges faced in human pose estimation and improve the model's ability to accurately predict keypoints. Furthermore, we introduce the Anchor Adjustment Module (AAM) as a replacement for the original YOLOv7's detection head. By adjusting the parameters associated with the detector's anchors, we achieve an increased recall rate and enhance the completeness of the pose estimation. Additionally, we incorporate the Multi-Scale Dual-Head Attention (MDA) module, which effectively models the weights of both channel and spatial dimensions at multiple scales, enabling the model to focus on more salient feature information. As a result, our approach outperforms other methods, as demonstrated by the promising results obtained on two large-scale public datasets. MDA-YOLO Person outperforms the baseline model YOLOv7-pose on both MS COCO 2017 and CrowdPose datasets, with improvements of 2.2% and 3.7% in precision and recall on MS COCO 2017, and 1.9% and 3.5% on CrowdPose, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Deep Learning for Describing Breast Ultrasound Images with BI-RADS Terms.

Author: Carrilero-Mardones, Mikel, Parras-Jurado, Manuela, Nogales, Alberto, Pérez-Martín, Jorge, and Díez, Francisco Javier
Abstract: Breast cancer is the most common cancer in women. Ultrasound is one of the most used techniques for diagnosis, but an expert in the field is necessary to interpret the test. Computer-aided diagnosis (CAD) systems aim to help physicians during this process. Experts use the Breast Imaging-Reporting and Data System (BI-RADS) to describe tumors according to several features (shape, margin, orientation...) and estimate their malignancy, with a common language. To aid in tumor diagnosis with BI-RADS explanations, this paper presents a deep neural network for tumor detection, description, and classification. An expert radiologist described with BI-RADS terms 749 nodules taken from public datasets. The YOLO detection algorithm is used to obtain Regions of Interest (ROIs), and then a model, based on a multi-class classification architecture, receives as input each ROI and outputs the BI-RADS descriptors, the BI-RADS classification (with 6 categories), and a Boolean classification of malignancy. Six hundred of the nodules were used for 10-fold cross-validation (CV) and 149 for testing. The accuracy of this model was compared with state-of-the-art CNNs for the same task. This model outperforms plain classifiers in the agreement with the expert (Cohen's kappa), with a mean over the descriptors of 0.58 in CV and 0.64 in testing, while the second best model yielded kappas of 0.55 and 0.59, respectively. Adding YOLO to the model significantly enhances the performance (0.16 in CV and 0.09 in testing). More importantly, training the model with BI-RADS descriptors enables the explainability of the Boolean malignancy classification without reducing accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. Extraction of entity relationships serving the field of agriculture food safety regulation.

Author: Zhao, Zhihua, Liu, Yiming, Lv, Dongdong, Li, Ruixuan, Yu, Xudong, and Mao, Dianhui
Abstract: Agriculture food (agri-food) safety is closely related to all aspects of people's lives. In recent years, with the emergence of deep learning technology based on big data, the extraction of information relations in the field of agri-food safety supervision has become a research hotspot. However, most of the current work only expands the relationship recognition based on the traditional named entity recognition task, which makes it difficult to establish a true 'connection' between entities and relationships. The pipelined and federated extraction architectures that have emerged in this area are problematic in practice. In addition, the contextual information of the text corpus in the agri-food safety regulatory domain has not been fully utilized. To address the above issues, this paper proposes a semi-joint entity relationship extraction model (EB-SJRE) based on contextual entity boundary features. Firstly, a Token pair subject-object correspondence matrix label is designed to intuitively model the subject-object boundary, which is more friendly to complex entities in the field of agri-food safety regulation. Secondly, the dynamic fine-tuning of Bert makes the text embedding more relevant to the textual context of the agri-food safety regulation domain. Finally, we introduce an attention mechanism in the Token pair tagging framework to capture deep semantic subject-object boundary association information, which cleverly solves the problem of bias exposure due to the pipeline structure and the dimensional explosion due to the joint extraction structure. The experimental results show that our model achieves the best F1-score of 88.71% on agri-food safety regulation domain data and F1-scores of 92.36%, 92.80%, 88.91%, and 92.21% on NYT, NYT-star, WebNLG, and WebNLG-star, respectively. This indicates that EB-SJRE has excellent generalization ability in both the agri-food safety regulatory and public sectors. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. A Kernel Attention-based Transformer Model for Survival Prediction of Heart Disease Patients.

Author: Kaushal, Palak, Singh, Shailendra, and Vijayvergiya, Rajesh
Abstract: Survival analysis is employed to scrutinize time-to-event data, with emphasis on comprehending the duration until the occurrence of a specific event. In this article, we introduce two novel survival prediction models: CosAttnSurv and CosAttnSurv + DyACT. CosAttnSurv model leverages transformer-based architecture and a softmax-free kernel attention mechanism for survival prediction. Our second model, CosAttnSurv + DyACT, enhances CosAttnSurv with Dynamic Adaptive Computation Time (DyACT) control, optimizing computation efficiency. The proposed models are validated using two public clinical datasets related to heart disease patients. When compared to other state-of-the-art models, our models demonstrated an enhanced discriminative and calibration performance. Furthermore, in comparison to other transformer architecture-based models, our proposed models demonstrate comparable performance while exhibiting significant reduction in both time and memory requirements. Overall, our models offer significant advancements in the field of survival analysis and emphasize the importance of computationally effective time-based predictions, with promising implications for medical decision-making and patient care. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. Distributed CV classification with attention mechanisms.

Author: Chafi, Soumia, Kabil, Mustapha, and Kamouss, Abdessamad
Subjects: RECURRENT neural networks, ARTIFICIAL intelligence, DEEP learning, TRANSFORMER models, SENTIMENT analysis
Abstract: Text classification is a crucial domain within natural language processing (NLP), with applications ranging from document categorization to sentiment analysis. In this context, the use of attention mechanisms in neural networks has emerged as an effective method to enhance the performance of classification models. This comparative study focuses on the application of these mechanisms to CV classification, adopting a distributed approach with Apache Spark to handle large datasets. We explore several neural network architectures, including recurrent neural networks (RNNs) and transformer neural networks, integrating various attention mechanisms such as global attention, contextual attention, and multi-head attention. The performance of these models is compared to traditional text classification methods such as SVMs, taking into account the scalability and processing speed offered by Spark. Experiments are conducted on a diverse dataset of CVs. Results show that models based on neural networks with attention mechanisms, combined with a distributed Spark architecture, significantly outperform traditional approaches. Additionally, we analyze the impact of various Spark configuration parameters, such as the number of nodes and allocated memory, on model performance. In conclusion, this study demonstrates the effectiveness of attention mechanisms in the specific context of CV classification, highlighting the advantages of a distributed approach for efficient processing of large textual datasets. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. AMTLUS: Attention-guided multi-task learning with uncertainty estimation in skin lesion segmentation and classification.

Author: Kasukurthi, Aravinda and Davuluri, Rajya Lakshmi
Subjects: FEATURE extraction, DERMOSCOPY, DIAGNOSIS, EARLY diagnosis, DEEP learning, MELANOMA
Abstract: Skin lesion segmentation and classification from dermoscopic images have emerged as pivotal research topics, playing vibrant role in early detection and diagnosis of skin diseases, including melanoma. Previous studies have employed various deep learning models for skin lesion segmentation and classification, enabling the automatic learning of complex and discriminative features from dermoscopic images. However, inherent challenges arise due to the variance in skin lesion shape, size, and contrast, leading to intrinsic limitations of former models, such as Isolated Representation Learning, Uniform Attention, Limited Model Generalization, Reduced Model Interpretability, and Uncertainty. To address these limitations and propel the field forward, this paper introduces a novel frameworkcalled AMTLUS that leverages Multi-Task Learning (MTL) in conjunction with deep Attention Mechanisms and Uncertainty Estimation. The integration of MTL facilitates joint training of segmentation and classification tasks, enabling shared representation learning and efficient utilization of data. Incorporating attention mechanisms dynamically focuses on informative regions within dermoscopic images, improving segmentation accuracy and feature extraction for classification. Uncertainty estimation techniques quantify model confidence, offering probabilistic interpretations for improved reliability and interpretability. Our widespread experiments conducted on the ISIC-2016 dataset demonstrate superior accuracy and reliability, showcasing the proposed model's capability to identify challenging cases. This deep learning framework represents a significant advancement in automated skin lesion analysis, enhancing early detection and diagnosis of skin diseases, including melanoma. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. Lightweight enhanced YOLOv8n underwater object detection network for low light environments.

Author: Ding, Jifeng, Hu, Junquan, Lin, Jiayuan, and Zhang, Xiaotong
Subjects: *ATTENUATION of light, *DATA augmentation, *FEATURE extraction, *ALGORITHMS, *GENERALIZATION
Abstract: In response to the challenges of target misidentification, missed detection, and other issues arising from severe light attenuation, low visibility, and complex environments in current underwater target detection, we propose a lightweight low-light underwater target detection network, named PDSC-YOLOv8n. Firstly, we enhance the input images using the improved Pro MSRCR algorithm for data augmentation. Secondly, we replace the traditional convolutions in the backbone and neck networks of YOLOv8n with Ghost and GSConv modules respectively to achieve lightweight network modeling. Additionally, we integrate the improved DCNv3 module into the C2f module of the backbone network to enhance the capability of target feature extraction. Furthermore, we introduce the GAM into the SPPF and incorporate the CBAM attention mechanism into the last layer of the backbone network to enhance the model's perceptual and generalization capabilities. Finally, we optimize the training process of the model using WIoUv3 as the loss function. The model is successfully deployed on an embedded platform, achieving real-time detection of underwater targets on the embedded platform. We first conduct experiments on the RUOD underwater dataset. Meanwhile, we also utilized the Pascal VOC2012 dataset to evaluate our approach. The mAP@0.5 and mAP@0.5:0.95 of the original YOLOv8n algorithm on RUOD dataset were 79.6% and 58.2%, respectively, and the PDSC -YOLOv8n algorithm mAP@0.5 and mAP@0.5:0.95 can reach 86.1% and 60.8%. The number of parameters of the model is reduced by about 15.5%, the detection accuracy is improved by 6.5%. The original YOLOv8n algorithm was 73% and 53.2% mAP@0.5 and mAP@0.5:0.95 on the Pascal VOC dataset, respectively. The PDSC-YOLOv8n algorithm mAP@0.5 and mAP@0.5:0.95 were 78.5% and 57%, respectively. The superior performance of PDSC-YOLOv8n indicates its effectiveness in the field of underwater target detection. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. MAPM: multiscale attention pre-training model for TextVQA.

Author: Yang, Yue, Yu, Yue, and Li, Yingying
Subjects: LANGUAGE models, QUESTION answering systems
Abstract: Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2% higher accuracy compared to state-of-the-art models on the TextVQA dataset, especially when handling numerical data within images. Multiscale Attention Pre-training Model (MAPM) is proposed to enhance local fine-grained features (Joint Attention Module) and effectively addresses redundancy in global features (Global Attention Module) in text VQA task. Three pre-training tasks are designed to enhance the model's expressive power and address the issue of cross modal semantic alignment [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. Enhanced multi-view anomaly detection on attribute networks by truncated singular value decomposition.

Author: Lee, Baozhen, Su, Yuwei, Kong, Qianwen, and Zhang, Tingting
Abstract: In the field of attribute network anomaly detection, current research methodologies, such as reconstruction and contrastive learning, frequently face challenges including the minimal differentiation in embedding representations of normal and anomalous nodes, an excessive dependence on local information, and a susceptibility to noise from adjacent nodes. To overcome these limitations, this paper presents a novel approach: the Enhanced Multi-view Anomaly Detection on Attribute Networks by Truncated Singular Value Decomposition (EMTSVD) method. EMTSVD leverages TSVD to generate improved views of both attributes and structures. Through the use of a low-rank approximation matrix, EMTSVD effectively filters out noise and isolates critical structural and attribute information. This isolated information is subsequently incorporated into the node embedding representations, significantly enhancing the differentiation between normal and anomalous nodes. Moreover, EMTSVD employs an attention mechanism to integrate multiple views, effectively minimizing spatial feature redundancy and further diminishing the effects of noise disturbances. Empirical evidence highlights EMTSVD's adeptness at accurately identifying essential node information within networks. By bolstering the distinction in embedding representations between normal and anomalous nodes, EMTSVD markedly advances the precision of anomaly detection in attribute networks. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. AncientGlyphNet: an advanced deep learning framework for detecting ancient Chinese characters in complex scene.

Author: Qi, Hengnian, Yang, Hao, Wang, Zhaojiang, Ye, Jiabin, Xin, Qiuyi, Zhang, Chu, and Lang, Qing
Abstract: Detecting ancient Chinese characters in various media, including stone inscriptions, calligraphy, and couplets, is challenging due to the complex backgrounds and diverse styles. This study proposes an advanced deep-learning framework for detecting ancient Chinese characters in complex scenes to improve detection accuracy. First, the framework introduces an Ancient Character Haar Wavelet Transform downsampling block (ACHaar), effectively reducing feature maps’ spatial resolution while preserving key ancient character features. Second, a Glyph Focus Module (GFM) is introduced, utilizing attention mechanisms to enhance the processing of deep semantic information and generating ancient character feature maps that emphasize horizontal and vertical features through a four-path parallel strategy. Third, a Character Contour Refinement Layer (CCRL) is incorporated to sharpen the edges of characters. Additionally, to train and validate the model, a dedicated dataset was constructed, named Huzhou University-Ancient Chinese Character Dataset for Complex Scenes (HUSAM-SinoCDCS), comprising images of stone inscriptions, calligraphy, and couplets. Experimental results demonstrated that the proposed method outperforms previous text detection methods on the HUSAM-SinoCDCS dataset, with accuracy improved by 1.36–92.84%, recall improved by 2.24–85.61%, and F1 score improved by 1.84–89.08%. This research contributes to digitizing ancient Chinese character artifacts and literature, promoting the inheritance and dissemination of traditional Chinese character culture. The source code and the HUSAM-SinoCDCS dataset can be accessed at and . [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

17. Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism: H. Li et al.

Author: Li, Haisheng, Yuan, Rongrong, Li, Qiuyi, and Hu, Cong
Abstract: Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

18. Enhancing semantic scene segmentation for indoor autonomous systems using advanced attention-supported improved UNet.

Author: Tran, Hoang N., Nguyen, Nghi V., Le, Nhi Q. P., Nguyen, Nam N. N., Le, Thu A. N., and Nguyen, Vinh D.
Abstract: This paper introduces EFFB7-UNet, an advanced semantic segmentation framework tailored for Indoor Autonomous Vision Systems (IAVSs) utilizing the U-Net architecture. The framework employs EfficientNetB4 as its encoder, significantly enhancing feature extraction. It integrates a spatial and channel Squeeze-and-Excitation (scSE) attention block, emphasizing critical areas and features to refine segmentation outcomes. Comprehensive evaluations using the NYUv2 Dataset and various augmented datasets were conducted. This study systematically compares EFFB7-UNet’s performance with multiple U-Net encoders, including ResNet50, ResNet101, MobileNet V2, VGG16, VGG19, and EfficientNets B0-B6. The findings reveal that EFFB7-UNet not only surpasses these configurations in terms of accuracy but also highlights the effectiveness of the scSE attention block in achieving superior segmentation results. Without the utilization of depth information, EFFB7-UNet achieves a 12% improvement in mean Intersection over Union (mIOU). This significant enhancement demonstrates EFFB7-UNet’s adaptability across various domains, implying substantial progress in enhancing the effectiveness and reliability of Intelligent Autonomous Vision Systems (IAVS) technologies. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

19. Dual-attention transformer-based hybrid network for multi-modal medical image segmentation.

Author: Zhang, Menghui, Zhang, Yuchen, Liu, Shuaibing, Han, Yahui, Cao, Honggang, and Qiao, Bingbing
Subjects: *CONVOLUTIONAL neural networks, *COMPUTER-assisted image analysis (Medicine), *ARTIFICIAL intelligence, *TRANSFORMER models, *DIAGNOSTIC imaging
Abstract: Accurate medical image segmentation plays a vital role in clinical practice. Convolutional Neural Network and Transformer are mainstream architectures for this task. However, convolutional neural network lacks the ability of modeling global dependency while Transformer cannot extract local details. In this paper, we propose DATTNet, DualATTentionNetwork, an encoder-decoder deep learning model for medical image segmentation. DATTNet is exploited in hierarchical fashion with two novel components: (1) Dual Attention module is designed to model global dependency in spatial and channel dimensions. (2) Context Fusion Bridge is presented to remix the feature maps with multiple scales and construct their correlations. The experiments on ACDC, Synapse and Kvasir-SEG datasets are conducted to evaluate the performance of DATTNet. Our proposed model shows superior performance, effectiveness and robustness compared to SOTA methods, with mean Dice Similarity Coefficient scores of 92.2%, 84.5% and 89.1% on cardiac, abdominal organs and gastrointestinal poly segmentation tasks. The quantitative and qualitative results demonstrate that our proposed DATTNet attains favorable capability across different modalities (MRI, CT, and endoscopy) and can be generalized to various tasks. Therefore, it is envisaged as being potential for practicable clinical applications. The code has been released on https://github.com/MhZhang123/DATTNet/tree/main. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

20. Ea-yolo: efficient extraction and aggregation mechanism of YOLO for fire detection.

Author: Wang, Dongmei, Qian, Ying, Lu, Jingyi, Wang, Peng, Yang, Dandi, and yan, Tianhong
Abstract: For fire detection, there are characteristics such as variable samples feature morphology, complex background and dense targets, small samples size of dataset and imbalance of categories, which lead to the problems of low accuracy and poor real-time performance of the existing fire detection models. We propose EA-YOLO, a flame and smoke detection model based on efficient multi-scale feature enhancement. In order to improve the extraction capability of the network model for flame smoke targets’ features, an efficient attention mechanism is integrated into the backbone network, Multi Channel Attention (MCA), and the number of parameters of the model is reduced by introducing the RepVB module; at the same time, we design a multi-weighted, multidirectional feature neck structure called the Multidirectional Feature Pyramid Network (MDFPN), to enhance the model’s flame smoke target feature information fusion ability; finally, we redesign the CIoU loss function by introducing the Slide weighting function to improve the imbalance between difficult and easy samples. Additionally, to address the issue of small sample sizes in fire datasets, we establish two new fire datasets: Fire-smoke and Ro-fire-smoke. The latter includes a model robustness validation function. The experimental results show that the method of this paper is 6.5 % and 7.3 % higher compared to the baseline model YOLOv7 on the Fire-smoke and Ro-fire-smoke datasets, respectively. The detection speed is 74.6 frames per second. To fully demonstrate the superiority of EA-YOLO, we utilized the public FASDD dataset and compared several state-of-the-art (SOTA) models with EA-YOLO on this dataset. The results were highly favorable. It fully demonstrates that the method in this paper has high fire detection accuracy while considering the real-time nature of the model. The source code and datasets are located at . [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

21. DFGPD: a new distillation framework with global and positional distillation.

Author: Su, Weixing, Wang, Haoyu, Liu, Fang, and Li, Linfeng
Abstract: Knowledge distillation is a commonly used method for model compression that has been widely utilized in various computer vision tasks. Many efforts have utilized attention mechanisms to guide the student networks during training, encouraging them to mimic the important features of the teacher. However, most of these efforts use either the channel attention map or the spatial attention map to guide the student, ignoring the importance of positional features. In this paper, we propose a new distillation framework transferring global and positional features (DFGPD), which consists of three parts: global and positional distillation, a generic teacher framework and a two-stage distillation method. DFGPD takes positional information into consideration for a more effective distillation process. We conduct extensive comparison experiments, ablation studies, and sensitivity studies to demonstrate the effectiveness and stability of DFGPD. Our results show that (1) DFGPD achieves comparable or even better performance compared to state-of-the-art methods; (2) DFGPD can alleviate the bigger-models-not-always-better-teachers issue to a certain extent. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. HCA-YOLO: a non-salient object detection method based on hierarchical attention mechanism.

Author: Dong, Chengang, Tang, Yuhao, Zhu, Hanyue, and Zhang, Liyan
Subjects: *OBJECT recognition (Computer vision), *DEEP learning, *VIDEOS
Abstract: The objective of deep learning-based object detection is to accurately localize and recognize objects of interest from images or videos using neural networks. However, the detection and localization of non-salient objects pose challenges due to their small proportions, low contrast, and occlusion in images. To address this, we propose an improved object detection method, namely hierarchical coordinate attention (HCA)-YOLO, based on the YOLOv8 architecture. Specifically, we enhance the model's attention towards non-salient objects by introducing HCA, building upon the optimized YOLOv8 baseline. Additionally, we propose a novel object regression loss metric, β-VIoU, to improve YOLOv8's perception of non-salient object positions. Our method achieves competitive results on multiple metrics with two widely adopted open-source datasets, MS COCO 2017 and CrowdHuman. Compared to the YOLOv8x baseline model, HCA-YOLO improves the average precision (mAP) by 3.3% and 3.7% on these two datasets, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

23. The triple attention transformer: advancing contextual coherence in transformer models.

Author: Ghaith, Shadi
Abstract: This paper introduces the Triple Attention Transformer (TAT), a transformative approach in transformer models, tailored for enhancing long-term contextual coherence in dialogue systems. TAT innovates by representing dialogues as chunks of sequences, coupled with a triple attention mechanism. This novel architecture enables TAT to effectively manage extended sequences, addressing the coherence challenges inherent in traditional transformer models. Empirical evaluations using the Schema-Guided Dialogue Dataset from DSTC8 demonstrate TAT's enhanced performance, with significant improvements in Character Error Rate, Word Error Rate, and BLEU score. Importantly, TAT excels in generating coherent, extended dialogues, showcasing its advanced contextual comprehension. The integration of Conv1D networks, dual-level positional encoding, and decayed attention weighting are pivotal to TAT's robust context management. The paper also highlights the BERT variant of TAT, which leverages pre-trained language models to further enrich dialogue understanding and generation capabilities. Future developments include refining attention mechanisms, improving role distinction, and architectural optimizations. TAT's applicability extends to various complex NLP tasks, affirming its potential as a pioneering advancement in natural language processing. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

24. Attention-driven YOLOv5 for wildfire smoke detection from camera images.

Author: Vaidya, Himadri, Gupta, Akansha, and Ghanshala, Kamal Kumar
Abstract: Wildfires are serious hazards for the environment, and WFSD (Wildfire Smoke Detection) is a challenge for ensuring optimal response and mitigation efforts. Hence, this study suggests an attention-based YOLOv5 (You Only Look Once) network for detecting smoke instances within video frames, particularly ECA (Efficient Channel Attention), GAM (Global Attention Module) and CA (Coordinate Attention). Here, an open-source wildfire smoke dataset divided into train, validation and test set is used for experimentation. The comprehensive research and evaluations show that the incorporation of attention mechanisms successfully enhances the accuracy and robustness of the YOLOv5 model for WFSD. In the training among the attention modules, GAM appears as the most effective, attaining an improved 95% F1 score on the dataset. This research provides the impact of attention mechanisms on object detection in the context of wildfire smoke. The findings of the research paper contribute to improving the capabilities of deep learning models for emergency response and environmental monitoring. The proposed methodology not only outperforms regular YOLOv5 but also sets up a benchmark for future research of WFSD. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. GHA-Inst: a real-time instance segmentation model utilizing YOLO detection framework.

Author: Dong, Chengang, Tang, Yuhao, and Zhang, Liyan
Subjects: *DEEP learning, *NECK, *NOISE, *VIDEOS
Abstract: The real-time instance segmentation task based on deep learning aims to accurately identify and distinguish all instance objects from images or videos. However, due to the existence of problems such as mutual occlusion between instances, limitations in model receptive fields, etc., achieving accurate and real-time segmentation continues to pose a formidable challenge. To alleviate the aforementioned issues, this paper proposes a real-time instance segmentation method based on a dual-branch structure, called GHA-Inst. Specifically, we made improvements to the feature fusion module (Neck) and output end (Head) of the YOLOv7-seg real-time instance segmentation framework to mitigate the accuracy reduction caused by feature loss and reduce the interference of background noise on the model. Secondly, we introduced a Global Hybrid-Domain Attention (GHA) module to improve the model's focus on significant information while retaining more original spatial features, alleviate incomplete segmentation caused by instance occlusion, and improve the quality of generated masks. Finally, our method achieved competitive results on multiple metrics of the MS COCO 2017 and KINS open-source datasets. Compared with the YOLOv7-seg baseline model, GHA-Inst improved the average precision (AP) by 3.4% and 2.6% on the two datasets, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. Exploiting recurrent graph neural networks for suffix prediction in predictive monitoring.

Author: Rama-Maneiro, Efrén, Vidal, Juan C., Lama, Manuel, and Monteagudo-Lago, Pablo
Subjects: *RECURRENT neural networks, *HEURISTIC, *HEURISTIC algorithms, *PROCESS mining, *SEARCH algorithms
Abstract: Predictive monitoring is a subfield of process mining that aims to predict how a running case will unfold in the future. One of its main challenges is forecasting the sequence of activities that will occur from a given point in time —suffix prediction—. Most approaches to the suffix prediction problem learn to predict the suffix by learning how to predict the next activity only, while also disregarding structural information present in the process model. This paper proposes a novel architecture based on an encoder-decoder model with an attention mechanism that decouples the representation learning of the prefixes from the inference phase, predicting only the activities of the suffix. During the inference phase, this architecture is extended with a heuristic search algorithm that selects the most probable suffix according to both the structural information extracted from the process model and the information extracted from the log. Our approach has been tested using 12 public event logs against 6 different state-of-the-art proposals, showing that it significantly outperforms these proposals. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. CF-YOLO: a capable forest fire identification algorithm founded on YOLOv7 improvement.

Author: Liu, Wanjie, Shen, Zirui, and Xu, Sheng
Abstract: Forest fire is an ecological catastrophe with great damage and rapid spread, which inflicts significant damage upon the ecological balance of the forests and poses a threat to human well-being. Given the current problems of low forest fire recognition accuracy and weak local detection, an improved forest fire detection algorithm Catch Fire YOLO-based neural networks (CF-YOLO) based on YOLOv7 model is studied. In global information processing, the plug-and-play coordinate attention mechanism is introduced into the YOLOv7 model, which enhances the visual depiction of the receptive field, while aggregate features along different spatial directions to improve the depiction of the focal interest. We present the three parallel max-pooling operations in the SPPCSPC module of the Neck to a serial mode, where the output of each pooling is used as the next pooling input. In local information processing, we prepare a feature fusion module to replace the partial high-efficiency layer aggregation network (ELAN), so that the network further improves the detection accuracy while speeding up the calculation. The proposed model was trained and verified on a forest fire dataset, the experimental results demonstrate an improved detection capability, especially for small targets, and can meet the requirements of edge deployment in forest fire scenarios. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. HDA-pose: a real-time 2D human pose estimation method based on modified YOLOv8.

Author: Dong, Chengang, Tang, Yuhao, and Zhang, Liyan
Abstract: 2D human pose estimation aims to accurately regress the keypoints of human body from images or videos. However, it remains challenging due to the occlusion and intersection among multiple individuals and the difficulty of dealing with different body scales. In order to better tackle these issues, we propose a human pose estimation framework named HDA-Pose. By improving the real-time framework of YOLOv8, we achieve simultaneous regression of all individuals' keypoint locations in the image. Specifically, we propose the High-Grade Dual Attention (HDA) module to further enhance the focus of YOLOv8 on important features of individuals in the image. Additionally, we improve the original data augmentation strategy in YOLOv8 to better simulate cases where key points of individuals are occluded in the image. Lastly, we introduce a novel regression loss metric, Vertex Intersection over Union, to further enhance the effectiveness of the model in multi-person pose estimation. Our approach attains competitive results on multiple metrics of two open-source datasets, MS COCO 2017 and CrowdPose. Compared with the baseline model YOLOv8x-pose, HDA-Pose improves the average precision by 2.9% and 3.3% on the two datasets, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

29. A method for image–text matching based on semantic filtering and adaptive adjustment.

Author: Jin, Ran, Hou, Tengda, Jin, Tao, Yuan, Jie, and Du, Chenjie
Subjects: *ADAPTIVE filters, *COMPUTER vision, *VISUAL fields
Abstract: As image–text matching (a critical task in the field of computer vision) links cross-modal data, it has captured extensive attention. Most of the existing methods intended for matching images and texts explore the local similarity levels between images and sentences to align images with texts. Even though this fine-grained approach has remarkable gains, how to further mine the deep semantics between data pairs and focus on the essential semantics in data remains to be quested. In this work, a new semantic filtering and adaptive approach (FAAR) was proposed to ease the above problem. To be specific, the filtered attention (FA) module selectively focuses on typical alignments with the interference of meaningless comparisons eliminated. Next, the adaptive regulator (AR) further adjusts the attention weights of key segments for filtered regions and words. The superiority of our proposed method was validated by a number of qualitative experiments and analyses on the Flickr30K and MSCOCO data sets. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. Multimodal fake news detection through intra-modality feature aggregation and inter-modality semantic fusion.

Author: Zhu, Peican, Hua, Jiaheng, Tang, Keke, Tian, Jiwei, Xu, Jiwei, and Cui, Xiaodong
Subjects: FAKE news, MISINFORMATION, USER-generated content, SOCIAL media
Abstract: The prevalence of online misinformation, termed "fake news", has exponentially escalated in recent years. These deceptive information, often rich with multimodal content, can easily deceive individuals into spreading them via various social media platforms. This has made it a hot research topic to automatically detect multimodal fake news. Existing works made a great progress on inter-modality feature fusion or semantic interaction yet largely ignore the importance of intra-modality entities and feature aggregation. This imbalance causes them to perform erratically on data with different emphases. In the realm of authentic news, the intra-modality contents and the inter-modality relationship should be in mutually supportive relationships. Inspired by this idea, we propose an innovative approach to multimodal fake news detection (IFIS), incorporating both intra-modality feature aggregation and inter-modality semantic fusion. Specifically, the proposed model implements a entity detection module and utilizes attention mechanisms for intra-modality feature aggregation, whereas inter-modality semantic fusion is accomplished via two concurrent Co-attention blocks. The performance of IFIS is extensively tested on two datasets, namely Weibo and Twitter, and has demonstrated superior performance, surpassing various advanced methods by 0.6 The experimental results validate the capability of our proposed approach in offering the most balanced performance for multimodal fake news detection tasks. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

31. A multi-scale feature extraction and fusion method for bearing fault diagnosis based on hybrid attention mechanism.

Author: Meng, Huan, Zhang, Jiakai, Zhao, Jingbo, and Wang, Daichao
Abstract: Bearing failure is one of the most common failures in rotating mechanical. Therefore, rapid and accurate diagnosis of bearing faults is of great significance for ensuring the reliability of equipment. In recent years, researchers have overlooked the correlation and complementarity between different sources of information, which has limited the accuracy and robustness of fault diagnosis. This paper proposes a multi-scale feature extraction and fusion method for bearing fault diagnosis. By extracting features from data at different scales, the method can comprehensively perceive the overall information of the signal and accurately capture bearing fault characteristics. Moreover, an improved CBAM mechanism is introduced to automatically adjust the weights of feature maps, enhancing the discriminability and anti-interference ability of the features. The effectiveness of the proposed method is verified on the Paderborn bearing dataset. The results show that the diagnostic accuracy of the proposed method can reach 99.43%, which is significantly better than other methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. Salient Object Detection in Optical Remote Sensing Images Based on Global Context Mixed Attention.

Author: Yan, Longquan, Yan, Ruixiang, Geng, Guohua, Zhou, Mingquan, and Chen, Rong
Abstract: Optical remote sensing images exhibit complex characteristics such as high density, multiscale, and multi-angle features, posing significant challenges in the field of salient object detection. This academic exposition introduces an integrated model customized for the precise detection of salient objects in optical remote sensing images, presenting a comprehensive solution. At the core of this model lies a feature aggregation module based on the concept of hybrid attention. This module orchestrates the gradual fusion of multi-layer feature maps, thereby reducing information loss encountered during traversal of the inherent skip connections in the U-shaped architecture. Notably, this framework integrates a dual-channel attention mechanism, cleverly leveraging the spatial contours of salient regions within optical remote sensing images to enhance the efficiency of the proposed module. By implementing a hybrid loss function, the overall approach is further strengthened, facilitating multifaceted supervision during the network training phase, covering considerations at the pixel-level, region-level, and statistical levels. Through a series of comprehensive experiments, the effectiveness and robustness of the proposed method are validated, undergoing rigorous evaluation on two widely accessed benchmark datasets, meticulously catering to optical remote sensing scenarios. It is evident that our method exhibits certain advantages relative to other methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

33. Attention-based color consistency underwater image enhancement network.

Author: Chang, Baocai, Li, Jinjiang, Wang, Haiyang, and Li, Mengjun
Abstract: Underwater images often exhibit color deviation, reduced contrast, distortion, and other issues due to light refraction, scattering, and absorption. Therefore, restoring detailed information in underwater images and obtaining high-quality results are primary objectives in underwater image enhancement tasks. Recently, deep learning-based methods have shown promising results, but handling details in low-light underwater image processing remains challenging. In this paper, we propose an attention-based color consistency underwater image enhancement network. The method consists of three components: illumination detail network, balance stretch module, and prediction learning module. The illumination detail network is responsible for generating the texture structure and detail information of the image. We introduce a novel color restoration module to better match color and content feature information, maintaining color consistency. The balance stretch module compensates using pixel mean and maximum values, adaptively adjusting color distribution. Finally, the prediction learning module facilitates context feature interaction to obtain a reliable and effective underwater enhancement model. Experiments conducted on three real underwater datasets demonstrate that our approach produces more natural enhanced images, performing well compared to state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

34. YOLOv8-CML: a lightweight target detection method for color-changing melon ripening in intelligent agriculture.

Author: Chen, Guojun, Hou, Yongjie, Cui, Tao, Li, Huihui, Shangguan, Fengyang, and Cao, Lei
Subjects: *MELONS, *AGRICULTURE, *DECORATION & ornament, *SPINE, *RECOGNITION (Psychology), *COMPUTATIONAL neuroscience
Abstract: Color-changing melon is an ornamental and edible fruit. Aiming at the problems of slow detection speed and high deployment cost for Color-changing melon in intelligent agriculture equipment, this study proposes a lightweight detection model YOLOv8-CML.Firstly, a lightweight Faster-Block is introduced to reduce the number of memory accesses while reducing redundant computation, and a lighter C2f structure is obtained. Then, the lightweight C2f module fusing EMA module is constructed in Backbone to collect multi-scale spatial information more efficiently and reduce the interference of complex background on the recognition effect. Next, the idea of shared parameters is utilized to redesign the detection head to simplify the model further. Finally, the α-IoU loss function is adopted better to measure the overlap between the predicted and real frames using the α hyperparameter, improving the recognition accuracy. The experimental results show that compared to the YOLOv8n model, the parametric and computational ratios of the improved YOLOv8-CML model decreased by 42.9% and 51.8%, respectively. In addition, the model size is only 3.7 MB, and the inference speed is improved by 6.9%, while mAP@0.5, accuracy, and FPS are also improved. Our proposed model provides a vital reference for deploying Color-changing melon picking robots. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

35. DSSE-net: dual stream skip edge-enhanced network with forgery loss for image forgery localization.

Author: Zheng, Aokun, Huang, Tianqiang, Huang, Wei, Huang, Liqing, Ye, Feng, and Luo, Haifeng
Abstract: As deep learning has continuously made breakthroughs in computer vision, Image Forgery Localization (IFL) task has also started using deep learning frameworks. Currently, most deep learning-based IFL methods use binary cross entropy as the loss function during model training. However, the number of tampered pixels in image forgery is significantly smaller than the number of real pixels. This disparity makes it easier for the model to classify samples as real pixels during training, leading to a reduced F1 score. Therefore, in this paper, we have proposed a loss function for the IFL task: Forgery Loss. The Forgery Loss assigns weight to the classification loss of tampered pixels and edges, enhances tampered pixel constraints in the model, and amplifies the importance of difficult-to-classify samples. These enhancements facilitate the model to acquire more productive information. Consequently, the F1 score of the model is enhanced. Additionally, we designed an end-to-end, pixel-level detection network DSSE-Net. It comprises of a dual-stream codec network that extracts high-level and low-level features of images, and an edge attention stream. The edge attention stream have a Edge Attention Model which enhances the network's attention to the high frequency edges of the image and, in conjunction with the edge enhancement algorithm in Forgery Loss, improves the model's ability to detect tampered edges. Experiments demonstrate that Forgery Loss can effectively improve the F1 score, while the DSSE-Net accuracy outperforms the current SOTA algorithm. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

36. TGIE4REC: enhancing session-based recommendation with transition and global information.

Author: Gao, Shiwei, Wang, Jingyu, Zeng, Yufeng, and Dong, Xiaohui
Subjects: *GRAPH neural networks
Abstract: Predicting the next most likely interactive item based on the current session is the goal of session-based recommendation (SBR). In order to model the adjacent item transition information from previous session sequences and the current session sequences, the most advanced techniques in SBR use graph neural networks and attention mechanisms. Position-aware attention is used to incorporate the reversed position information in an item in order to learn the importance of each item in the session when generating the session representation. However, these methods have certain drawbacks. First, using data from previous sessions always introduces uncorrelated items (noise). Second, learning the sequence transition relations between items in the session sequence is challenging due to reverse position coding. This study presents a novel SBR technique called TGIE4Rec. Specifically, TGIE4Rec learns two levels of session embedding, global information enhanced session embedding and transition information enhanced session embedding. The global information enhanced session representation learning layer employs the information of other sessions and the current session to learn global-level session embedding, and the transition information enhanced session representation learning layer employs the items of the current session to learn new session embedding and integrates the time information into the item representation in the session sequence for neighbor embedding learning, so as to further enhance the sequential transition relations in the session sequence. Experiments on three benchmark datasets have demonstrated that TGIE4Rec is superior to other advanced methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

37. Higher efficient YOLOv7: a one-stage method for non-salient object detection.

Author: Dong, Chengang, Tang, Yuhao, and Zhang, Liyan
Abstract: Compared to the remarkable progress within the discipline of object detection in recent years, real-time detection of non-salient objects remains a challenging research task. However, most existing detection methods fail to adequately extract the global features of targets, leading to suboptimal performance when dealing with non-salient objects. In this paper, we propose a unified framework called Higher efficient (He)-YOLOv7 to enhance the detection capability of YOLOv7 for non-salient objects.Firstly, we introduce an refined Squeeze and Excitation Network (SENet) to dynamically adjust the weights of feature channels, thereby enhancing the model's perception of non-salient objects. Secondly, we design an Angle Intersection over Union (AIoU) loss function that considers relative positional information, optimizing the widely used Complete Intersection over Union (CIoU) loss function in YOLOv7. This significantly accelerates the model's convergence. Moreover, He-YOLOv7 adopts a blended data augmentation strategy to simulate occlusion among objects, further improving the model's ability to filter out noise information and enhancing its robustness. Comparison of experimental results demonstrates a significant improvement of 2.4% mean Average Precision (mAP) on the Microsoft Common Objects in Context (MS COCO) dataset and a notable enhancement of 1.2% mAP on the PASCAL VOC dataset. Simultaneously, our approach demonstrates comparable performance to state-of-the-art real-time object detection methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

38. Cross-modal pedestrian re-recognition based on attention mechanism.

Author: Zhao, Yuyao, Zhou, Hang, Cheng, Hai, and Huang, Chunguang
Subjects: *PEDESTRIANS, *VISIBLE spectra, *NETWORK performance, *RESEARCH personnel, *ATTENTION
Abstract: Person re-identification, as an essential research direction in intelligent security, has gained the focus of researchers and scholars. In practical scenarios, visible light cameras depend highly on lighting conditions and have limited detection capability in poor light. Therefore, many scholars have gradually shifted their research goals to cross-modality person re-identification. However, there are few relevant studies, and challenges remain in resolving the differences in the images of different modalities. In order to solve these problems, this paper will use the research method based on the attention mechanism to narrow the difference between the two modes and guide the network in a more appropriate direction to improve the recognition performance of the network. Aiming at the problem of using the attention mechanism method can improve training efficiency. However, it is easy to cause the model training instability. This paper proposes a cross-modal pedestrian re-recognition method based on the attention mechanism. A new attention mechanism module is designed to allow the network to use less time to focus on more critical features of a person. In addition, a cross-modality hard center triplet loss is designed to supervise the model training better. The paper has conducted extensive experiments on the above two methods on two publicly available datasets, which obtained better performance than similar current methods and verified the effectiveness and feasibility of the proposed methods in this paper. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

39. Harmonious Mutual Learning for Facial Emotion Recognition.

Author: Gan, Yanling, Xu, Luhui, Xia, Haiying, and Liu, Gan
Abstract: Facial emotion recognition in the wild is an important task in computer vision, but it still remains challenging since the influence of backgrounds, occlusions and illumination variations in facial images, as well as the ambiguity of expressions. This paper proposes a harmonious mutual learning framework for emotion recognition, mainly through utilizing attention mechanisms and probability distributions without utilizing additional information. Specifically, this paper builds an architecture with two emotion recognition networks and makes progressive cooperation and interaction between them. We first integrate self-mutual attention module into the backbone to learn discriminative features against the influence from emotion-irrelevant facial information. In this process, we deploy spatial attention module and convolutional block attention module for the two networks respectively, guiding to enhanced and supplementary learning of attention. Further, in the classification head, we propose to learn the latent ground-truth emotion probability distributions using softmax function with temperature to characterize the expression ambiguity. On this basis, a probability distribution distillation learning module is constructed to perform class semantic interaction using bi-directional KL loss, allowing mutual calibration for the two networks. Experimental results on three public datasets show the superiority of the proposed method compared to state-of-the-art ones. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

40. Multiscale dilated denoising convolution with channel attention mechanism for micro-seismic signal denoising.

Author: Cai, Jianxian, Duan, Zhijun, Wang, Li, Meng, Juan, and Yao, Zhenjing
Subjects: SIGNAL denoising, STANDARD deviations, SEISMIC waves, SIGNAL-to-noise ratio, SEISMIC response
Abstract: Denoising micro-seismic signals is paramount for ensuring reliable data for localizing mining-related seismic events and analyzing the state of rock masses during mining operations. However, micro-seismic signals are commonly contaminated by various types of complex noise, which can hinder micro-seismic accurate P-wave pickup and analysis. In this study, we propose the Multiscale Dilated Convolutional Attention denoising method, referred to as MSDCAN, to eliminate complex noise interference. The MSDCAN denoising model consists of an encoder, an improved attention mechanism, and a decoder. To effectively capture the neighborhood features and multiscale features of the micro-seismic signal, we construct an initial dilated convolution block and a multiscale dilated convolution block in the encoder, and the encoder focuses on extracting the relevant feature information, thus eliminating the noise interference and improving the signal-to-noise ratio (SNR). In addition, the attention mechanism is improved and introduced between the encoder and decoder to emphasize the key features of the micro-seismic signal, thus removing the complex noise and further improving the denoising performance. The MSDCAN denoising model is trained and evaluated using micro-seismic data from Stanford University. Experimental results demonstrate an impressive increase in SNR by 11.237 dB and a reduction in root mean square error (RMSE) by 0.802. Compared to the denoising results of the DeepDenoiser, CNN-denoiser and Neighbor2Neighbor methods, the MSDCAN denoising model outperforms them by enhancing the SNR by 2.589 dB, 1.584 dB and 2dB, respectively, and reducing the RMSE by 0.219, 0.050 and 0.188, respectively. The MSDCAN denoising model presented in this study effectively improves the SNR of micro-seismic signals, offering fresh insights into micro-seismic signal denoising methodologies. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

41. Enhanced predictive modeling of hot rolling work roll wear using TCN-LSTM-Attention.

Author: Hu, Xiaoke, Zhou, Xiaomin, Liu, Hongfei, Song, Hechuan, Wang, Shuaikun, and Zhang, Hongjia
Subjects: *HOT rolling, *WORK clothes, *FEATURE selection, *TIME series analysis, *HOT working
Abstract: During the hot rolling process, the work rolls suffer severe wear, resulting in a relatively short lifespan. Severe roll wear can adversely affect the strip shape while introducing roll wear into the crown calculation model can enhance the model accuracy. Therefore, it is crucial to quantify roll wear during the rolling process. Roll wear is a nonlinear time series and the accuracy of the existing work roll wear mechanistic models is not high. In this paper, a novel prediction model for work roll wear based on TCN-LSTM-Attention is developed. TCN utilizes convolutional structures of local and global information to extract data features, while LSTM focuses on capturing long-term and more complex sequence patterns, effectively handling nonlinear characteristics in data. With the incorporation of attention mechanisms, the model becomes more adept at effectively capturing relationships among different segments within the input sequence, which significantly improves predictive performance and reduces the risk of overfitting. Firstly, outlier cleaning and feature selection are performed using Boruta to construct the data set. Then, the predicted results of the proposed model are compared with the existing time series prediction models. The results indicate that the TCN-LSTM-Attention has the highest prediction accuracy, with an R2 of 0.989 and an RMSE of 0.0082 μm. Finally, the predicted results of work roll wear are combined with the mechanism to correct the strip crown pre-calculation model, which significantly improves the calculation accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

42. Relational reasoning and adaptive fusion for visual question answering.

Author: Shen, Xiang, Han, Dezhi, Zong, Liang, Guo, Zihan, and Hua, Jie
Subjects: PSEUDOPOTENTIAL method
Abstract: Visual relationship modeling plays an indispensable role in visual question answering (VQA). VQA models need to fully understand the visual scene and positional relationships within the image to answer complex reasoning questions involving visual object relationships. Accurate reasoning and an understanding of the relationships between different visual objects are particularly crucial. However, most reasoning models used in current VQA tasks only use simple attention mechanisms to model visual object relationships and ignore the potential for effective modeling using rich visual object features during the learning process. This work proposes an effective visual object Relationship Reasoning and Adaptive Fusion (RRAF) model to address the shortcomings of existing VQA model research. RRAF can simultaneously model visual objects' position, appearance, and semantic features and uses an adaptive fusion mechanism to achieve fine-grained multimodal reasoning and fusion. Specifically, we designed an effective image encoder to model and learn the relationship between the position and appearance features of visual objects. In addition, in the co-attention module, we employ semantic information from the question to focus on critical visual objects. Finally, we use an adaptive fusion mechanism to reassign weights and fuse different modalities of features to effectively predict the answer. Experimental results show that the RRAF model outperforms current state-of-the-art methods on the VQA 2.0 and GQA datasets, especially in visual object counting problems. We also conducted extensive ablation experiments to demonstrate the effectiveness of the RRAF model, achieving an overall accuracy of 71.33% and 57.83% on the VQA 2.0 and GQA datasets, respectively. Code is available at https://github.com/shenxiang-vqa/RRAF. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

43. SiamMaskAttn: inverted residual attention block fusing multi-scale feature information for multitask visual object tracking networks.

Author: Bian, Xiaofeng and Guo, Chenggang
Abstract: Multitask learning combining visual object tracking and other computer vision tasks has received increasing attention from researchers. Among them, the SiamMask algorithm can accomplish both object tracking and object segmentation tasks by utilizing a Siamese backbone network and a three-branch regression head. The mask refinement branch is the core innovation part of the SiamMask, which hierarchically integrates the features of the search region and the tracking correlation score maps. However, SiamMask and its subsequent improved algorithms do not fully integrated the target semantic information contained in multi-scale features into the mask refinement branch. To address the above problems, a module named inverted residual attention block is proposed, which combines the inverted residual structure and channel attention mechanism. The channel attention mechanism can effectively enhance the key information of the object and suppress the background noises by assigning weights to the feature channels output by different convolution kernels, thereby better handling the motion and deformation of the tracking object. Based on the proposed module and spatial attention mechanism, a novel multi-scale feature fusion method of the search region and tracking correlation score maps is proposed. The spatial attention mechanism can help the network focus on the region where the object is located and reduce the sensitivity to background interference, thus improving the accuracy and stability of tracking. Under the condition of using the same hardware and datasets, ablation experiments prove that the proposed improvements for the mask refinement branch are effective. Compared with the baseline SiamMask, the proposed method has achieved comparable segmentation results on the DAVIS datasets with improved speed. The expected average overlap on VOT-2018 has increased by 3.7%. The total number of parameters is reduced by 6.6%, including a 53.2% reduction in the number of parameters in the mask refinement branch. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

44. NDAM-YOLOseg: a real-time instance segmentation model based on multi-head attention mechanism.

Author: Dong, Chengang, Tang, Yuhao, and Zhang, Liyan
Abstract: The primary objective of deep learning-based instance segmentation is to achieve accurate segmentation of individual objects in input images or videos. However, there exist challenges such as feature loss resulting from down-sampling operations, as well as complications arising from occlusion, deformation, and complex backgrounds, which impede the precise delineation of object instance boundaries. To address these challenges, we introduce a novel visual attention network called the Normalized Deep Attention Mechanism (NDAM) into the YOLOv8seg instance segmentation model, proposing a real-time instance segmentation method named NDAM-YOLOseg. Specifically, we optimize the feature processing methodology of YOLOv8-seg to mitigate the degradation in accuracy caused by information loss. Additionally, we introduce the NDAM to enhance the model’s discriminate focus on pivotal information, thereby further improving the accuracy of segmentation. Furthermore, a Boundary Refinement Module (BRM) is intended to enhance the segmentation of instance boundaries, resulting in an enhanced quality of mask generation. Our proposed method demonstrates competitive performance on multiple evaluation metrics across two widely-used benchmark datasets, namely MS COCO 2017 and KINS. In comparison to the baseline model YOLOv8x-seg, NDAM-YOLOseg achieves noteworthy improvements of 2.4 % and 2.5 % in terms of Average Precision (AP) on the aforementioned datasets, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

45. RAMFAE: a novel unsupervised visual anomaly detection method based on autoencoder.

Author: Sun, Zhongju, Wang, Jian, and Li, Yakun
Abstract: Traditional methods of visual anomaly detection based on reconstruction often use normal data to train autoencoder. Then the metric distance detection method is used to estimate whether the samples of detection belong to the exception class. However, this method has some problems that the autoencoder produces blurry images to cause false detection of normal pixel points. The model may still be able to fully reconstruct the undiscovered defects due to the large capacity of autoencoder, even if it is trained only on normal samples. Then, the metric distance detection method would ignore local key information. To solve this problem, this paper comes up with the random anomaly multi-scale feature focused autoencoder (RAMFAE), an innovative unsupervised visual anomaly detection technique, which incorporates three novel concepts. First, a multi-scale feature focused extraction (MFFE) network structure is designed and added between the encoder and decoder, which effectively solves the problem of reconstructing image blur and effectively improves the sensitivity of the model to normal regions. Second, this article employs Delete Paste, a novel data augmentation strategy for generating two different types of random anomalies, which pastes the cut part into a random location, while the pixels in the original position are filled with 0. In spite of the input anomalous images, the strategy makes the model be able to produce normal images to avoid the phenomenon of anomaly reconstruction, and then enables defect localization based on the error between the measured image and the reconstructed image. Third, the study adopts the image quality assessment with combining gradient magnitude similarity deviation (GMSD) and structural similarity (SSIM) to solve the problem that local key information and texture detail information are not easy to be paid attention to by the model, and alleviate the training pressure caused by Delete Paste enhancement. We perform an extensive evaluation on the challenging MVTec AD data set and compare it with the advanced visual anomaly detection methods in recent years as well. The AUC final result of RAMFAE in this text reaches 94.5, which is 3.6, 2.5 and 0.8 higher than the advanced IGD, FCDD and RIAD detection methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

46. Improving Human Pose Estimation Based on Stacked Hourglass Network.

Author: Zou, Xuelian, Bi, Xiaojun, and Yu, Changdong
Subjects: POSE estimation (Computer vision), DEEP learning, PROBLEM solving, HUMAN beings
Abstract: The performance of multi-person pose estimation has been greatly improved due to the rapid development of deep learning. However, the problems of self-occlusion, mutual occlusion and complex background not only have not been effectively solved. In order to further effectively solve these problems, in this paper, we design a novel Global and Local Content-aware Feature Boosting Network (GLCFBNet) that includes Intra-layer Feature Residual-like Module (IFRM), Input Feature Aggregation Module (IFAM), Spatial and Channel Feature Hourglass Attention Module (SCFHAM). We propose a novel IFRM that can expand receptive field of each convolution layer through aggregation feature. The IRAM can fully extract the edge information of the input image,and effectively solve the problem of negative background impact. The SCFHAM can accurately determine the location of the occluded keypoints, judge the global information of the reasonable keypoints, and extract the effective features for joint node positioning from the redundant feature information. We evaluate the effectiveness of our proposed method on the MSCOCO keypoint detection dataset, the MPII Human Pose dataset and the CrowdPose dataset. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

47. DoubleU-NetPlus: a novel attention and context-guided dual U-Net with multi-scale residual feature fusion network for semantic segmentation of medical images.

Author: Ahmed, Md. Rayhan, Ashrafi, Adnan Ferdous, Ahmed, Raihan Uddin, Shatabda, Swakkhar, Islam, A. K. M. Muzahidul, and Islam, Salekul
Subjects: *COMPUTER-assisted image analysis (Medicine), *IMAGE segmentation, *DIAGNOSTIC imaging, *COMPUTER-aided diagnosis, *MULTICASTING (Computer networks), *MULTISPECTRAL imaging
Abstract: Accurate segmentation of the region of interest in medical images can provide an essential pathway for devising effective treatment plans for life-threatening diseases. It is still challenging for U-Net, and its modern state-of-the-art variants to effectively model the higher-level output feature maps of the convolutional units of the network mostly due to the presence of various scales of the region of interest, the intricacy of context environments, ambiguous boundaries, and multiformity of textures in medical images. In this paper, we exploit multi-contextual features and several attention strategies to increase networks' ability to model discriminative feature representation for more accurate medical image segmentation, and we present a novel dual-stacked U-Net-based architecture named DoubleU-NetPlus. The DoubleU-NetPlus incorporates several architectural modifications. In particular, we integrate EfficientNetB7 as the feature encoder module, a newly designed multi-kernel residual convolution module, and an adaptive feature re-calibrating attention-based atrous spatial pyramid pooling module to progressively and precisely accumulate discriminative multi-scale high-level contextual feature maps and emphasize the salient regions. In addition, we introduce a novel triple attention gate module and a hybrid triple attention module to encourage selective modeling of relevant medical image features. Moreover, to mitigate the gradient vanishing issue while incorporating high-resolution features with deeper spatial details, the standard convolution operation is replaced with the attention-guided residual convolution operations, which enables the model to achieve effective and relevant feature maps from a significantly increased network depth. Empirical results confirm that the proposed model accomplishes superior semantic segmentation performance compared to other state-of-the-art approaches on six publicly available benchmark datasets of diverse modalities. The proposed network achieves a Dice score of 85.17%, 99.34%, 94.30%, 96.40%, 95.76%, and 97.10% on DRIVE, LUNA, BUSI, CVCclinicDB, 2018 DSB, and ISBI 2012 datasets. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

48. Making attention mechanisms more robust and interpretable with virtual adversarial training.

Author: Kitada, Shunsuke and Iyatomi, Hitoshi
Subjects: DEEP learning, SUPERVISED learning, SECURE Sockets Layer (Computer network protocol), PREDICTION models, AUTOMATIC speech recognition
Abstract: Although attention mechanisms have become fundamental components of deep learning models, they are vulnerable to perturbations, which may degrade the prediction performance and model interpretability. Adversarial training (AT) for attention mechanisms has successfully reduced such drawbacks by considering adversarial perturbations. However, this technique requires label information, and thus, its use is limited to supervised settings. In this study, we explore the concept of incorporating virtual AT (VAT) into the attention mechanisms, by which adversarial perturbations can be computed even from unlabeled data. To realize this approach, we propose two general training techniques, namely VAT for attention mechanisms (Attention VAT) and "interpretable" VAT for attention mechanisms (Attention iVAT), which extend AT for attention mechanisms to a semi-supervised setting. In particular, Attention iVAT focuses on the differences in attention; thus, it can efficiently learn clearer attention and improve model interpretability, even with unlabeled data. Empirical experiments based on six public datasets revealed that our techniques provide better prediction performance than conventional AT-based as well as VAT-based techniques, and stronger agreement with evidence that is provided by humans in detecting important words in sentences. Moreover, our proposal offers these advantages without needing to add the careful selection of unlabeled data. That is, even if the model using our VAT-based technique is trained on unlabeled data from a source other than the target task, both the prediction performance and model interpretability can be improved. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

49. When attention is not enough to unveil a text's author profile: Enhancing a transformer with a wide branch.

Author: López-Santillán, Roberto, González, Luis C., Montes-y-Gómez, Manuel, and López-Monroy, A. Pastor
Subjects: *TRANSFORMER models, *SOCIAL media, *ARTIFICIAL neural networks, *DEEP learning, *NATURAL language processing, *MACHINE learning, *SPANISH language
Abstract: Author profiling (AP) is a highly relevant natural language processing (NLP) problem; it deals with predicting features of authors such as gender, age and personality traits. It is done by analyzing texts written by the authors themselves; take for instance documents such as books, articles, and more recently posts in social media platforms. In the present study, we focus in the latter, which is an scenario with a number of applications in marketing, security, health and others. Surprisingly, given the achievements of deep learning (DL) strategies on other NLP tasks, for AP DL architectures regularly underperform, left behind by classical machine learning (ML) approaches. In this study we show how a deep learning architecture based on transformers offers competitive results by exploiting a joint-intermediate fusion strategy called the Wide & Deep Transformer (WD-T). Our methodology implements a fusion of contextualized word vector representations and handcrafted features, by using a self-attention mechanism and a novel encoding technique that incorporates stylistic, topic, and personal information from authors. This allows for the creation of more accurate, fine-grained predictions. Our approach attained competitive performance against top-quartile results from the 2017–2019 editions at the Plagiarism analysis, Authorship identification, and Near-duplicate detection forum (PAN) in English and Spanish languages for gender and language variety predictions, and the Kaggle Myers–Briggs-type indicator (MBTI) dataset for personality forecasting. Our proposal consistently surpasses all other deep learning methods in PAN collections by as much as 2.4%, and up to 3.4% in the MBTI dataset. These results suggest that this DL strategy effectively addresses and improves upon the limitations of previous techniques and paves the way for new avenues of inquiry. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

50. Continuous sign language recognition based on iterative alignment network and attention mechanism.

Author: Xue, Cuihong, Yu, Ming, Yan, Gang, Gao, Yang, and Liu, Yuehao
Subjects: SIGN language, SUPERVISED learning, PROBLEM solving
Abstract: The biggest challenge of continuous sign language recognition is the weak supervision of sign language labels. This paper proposes a continuous sign language recognition framework based on iterative alignment network and attention mechanism to solve this problem. The iterative alignment network uses a spatial-temporal residual network (STRN) to extract block-level features, a temporal convolutional module (TCM) to enhance the temporal correlation between block-level features, and Bidirectional Gated neural network(BGRU) and connectionist temporal classification (CTC) to generate pseudo-labels for each block-level feature; this in turn performs strong supervised learning with STRN and TCM, optimizes the network parameters, and uses CTC to learn new mapping relationships based on the optimized parameters in the next iteration. Then, the word-level features generated by the iterative alignment network are input into the encoder-decoder network, which is based on an attention mechanism. The attention module is used to fully pay attention to the relevant time-step information of the input feature sequence during decoding to obtain more accurate decoding results. The method is evaluated experimentally on three large-scale continuous sign language data sets (RWTH-Phoenix-Weather 2014, CSL and CSL daily), and the experimental results prove the method's effectiveness. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

90 results on '"Attention mechanisms"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources