Descriptor: "vision transformer (ViT)" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"vision transformer (ViT)"' showing total 246 results

Start Over Descriptor "vision transformer (ViT)"

246 results on '"vision transformer (ViT)"'

1. DRViT: A dynamic redundancy-aware vision transformer accelerator via algorithm and architecture co-design on FPGA

Author: Sun, Xiangfeng, Zhang, Yuanting, Wang, Qinyu, Zou, Xiaofeng, Liu, Yujia, Zeng, Ziqian, and Zhuang, Huiping
Published: 2025
Full Text: View/download PDF

2. Attention-based deep learning for tire defect detection: Fusing local and global features in an industrial case study

Author: Saleh, Radhwan A.A. and Ertunç, H. Metin
Published: 2025
Full Text: View/download PDF

3. A survey of FPGA and ASIC designs for transformer inference acceleration and optimization

Author: Kang, Beom Jin, Lee, Hae In, Yoon, Seok Kyu, Kim, Young Chan, Jeong, Sang Beom, O, Seong Jun, and Kim, Hyun
Published: 2024
Full Text: View/download PDF

4. AD-Lite Net: A Lightweight and Concatenated CNN Model for Alzheimer’s Detection from MRI Images

Author: Roy, Santanu, Gupta, Archit, Tiwari, Shubhi, Sahu, Palak, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
Published: 2025
Full Text: View/download PDF

5. A Cascading Approach with Vision Transformers for Age-Related Macular Degeneration Diagnosis and Explainability

Author: Osa-Sanchez, Ainhoa, Balaha, Hossam Magdy, Ali, Mahmoud, Abdelrahim, Mostafa, Khudri, Mohmaed, Garcia-Zapirain, Begonya, El-Baz, Ayman, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
Published: 2025
Full Text: View/download PDF

6. A hybrid Framework for plant leaf disease detection and classification using convolutional neural networks and vision transformer.

Author: Aboelenin, Sherihan, Elbasheer, Foriaa Ahmed, Eltoukhy, Mohamed Meselhy, El-Hady, Walaa M., and Hosny, Khalid M.
Abstract: Recently, scientists have widely utilized Artificial Intelligence (AI) approaches in intelligent agriculture to increase the productivity of the agriculture sector and overcome a wide range of problems. Detection and classification of plant diseases is a challenging problem due to the vast numbers of plants worldwide and the numerous diseases that negatively affect the production of different crops. Early detection and accurate classification of plant diseases is the goal of any AI-based system. This paper proposes a hybrid framework to improve classification accuracy for plant leaf diseases significantly. This proposed model leverages the strength of Convolutional Neural Networks (CNNs) and Vision Transformers (ViT), where an ensemble model, which consists of the well-known CNN architectures VGG16, Inception-V3, and DenseNet20, is used to extract robust global features. Then, a ViT model is used to extract local features to detect plant diseases precisely. The performance proposed model is evaluated using two publicly available datasets (Apple and Corn). Each dataset consists of four classes. The proposed hybrid model successfully detects and classifies multi-class plant leaf diseases and outperforms similar recently published methods, where the proposed hybrid model achieved an accuracy rate of 99.24% and 98% for the apple and corn datasets. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

7. ViT-Based Face Diagnosis Images Analysis for Schizophrenia Detection.

Author: Liu, Huilin, Cao, Runmin, Li, Songze, Wang, Yifan, Zhang, Xiaohan, Xu, Hua, Sun, Xirong, Wang, Lijuan, Qian, Peng, Sun, Zhumei, Gao, Kai, and Li, Fufeng
Subjects: *TRANSFORMER models, *CHINESE medicine, *MAGNETIC resonance imaging, *IMAGE analysis, *PATIENT compliance
Abstract: Objectives: Computer-aided schizophrenia (SZ) detection methods mainly depend on electroencephalogram and brain magnetic resonance images, which both capture physical signals from patients' brains. These inspection techniques take too much time and affect patients' compliance and cooperation, while difficult for clinicians to comprehend the principle of detection decisions. This study proposes a novel method using face diagnosis images based on traditional Chinese medicine principles, providing a non-invasive, efficient, and interpretable alternative for SZ detection. Methods: An innovative face diagnosis image analysis method for SZ detection, which learns feature representations based on Vision Transformer (ViT) directly from face diagnosis images. It provides a face features distribution visualization and quantitative importance of each facial region and is proposed to supplement interpretation and to increase efficiency in SZ detection while keeping a high detection accuracy. Results: A benchmarking platform comprising 921 face diagnostic images, 6 benchmark methods, and 4 evaluation metrics was established. The experimental results demonstrate that our method significantly improves SZ detection performance with a 3–10% increase in accuracy scores. Additionally, it is found that facial regions rank in descending order according to importance in SZ detection as eyes, mouth, forehead, cheeks, and nose, which is exactly consistent with the clinical traditional Chinese medicine experience. Conclusions: Our method fully leverages semantic feature representations of first-introduced face diagnosis images in SZ, offering strong interpretability and visualization capabilities. It not only opens a new path for SZ detection but also brings new tools and concepts to the research and application in the field of mental illness. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

8. Peering into the Heart: A Comprehensive Exploration of Semantic Segmentation and Explainable AI on the MnMs-2 Cardiac MRI Dataset.

Author: Ayoob, Mohamed, Nettasinghe, Oshan, Sylvester, Vithushan, Bowala, Helmini, and Mohideen, Hamdaan
Subjects: CARDIAC magnetic resonance imaging, COMPUTER-aided diagnosis, TRANSFORMER models, COMPUTER-assisted image analysis (Medicine), DIAGNOSTIC imaging
Abstract: Accurate and interpretable segmentation of medical images is crucial for computer-aided diagnosis and image-guided interventions. This study explores the integration of semantic segmentation and explainable AI techniques on the MnMs-2 Cardiac MRI dataset. We propose a segmentation model that achieves competitive dice scores (nearly 90 %) and Hausdorff distance (less than 70), demonstrating its effectiveness for cardiac MRI analysis. Furthermore, we leverage Grad-CAM, and Feature Ablation, explainable AI techniques, to visualise the regions of interest guiding the model predictions for a target class. This integration enhances interpretability, allowing us to gain insights into the model decision-making process and build trust in its predictions. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

9. A Robust Tuberculosis Diagnosis Using Chest X-Rays Based on a Hybrid Vision Transformer and Principal Component Analysis.

Author: El-Ghany, Sameh Abd, Elmogy, Mohammed, A. Mahmood, Mahmood, and Abd El-Aziz, A. A.
Subjects: *TRANSFORMER models, *MYCOBACTERIUM tuberculosis, *COMPUTER-aided diagnosis, *MEDICAL personnel, *BACTERIAL diseases
Abstract: Background: Tuberculosis (TB) is a bacterial disease that mainly affects the lungs, but it can also impact other parts of the body, such as the brain, bones, and kidneys. The disease is caused by a bacterium called Mycobacterium tuberculosis and spreads through the air when an infected person coughs or sneezes. TB can be inactive or active; in its active state, noticeable symptoms appear, and it can be transmitted to others. There are ongoing challenges in fighting TB, including resistance to medications, co-infections, and limited resources in areas heavily affected by the disease. These issues make it challenging to eradicate TB. Objective: Timely and precise diagnosis is essential for effective control, especially since TB often goes undetected and untreated, particularly in remote and under-resourced locations. Chest X-ray (CXR) images are commonly used to diagnose TB. However, difficulties can arise due to unusual findings on X-rays and a shortage of radiologists in high-infection areas. Method: To address these challenges, a computer-aided diagnosis (CAD) system that uses the vision transformer (ViT) technique has been developed to accurately identify TB in CXR images. This innovative hybrid CAD approach combines ViT with Principal Component Analysis (PCA) and machine learning (ML) techniques for TB classification, introducing a new method in this field. In the hybrid CAD system, ViT is used for deep feature extraction as a base model, PCA is used to reduce feature dimensions, and various ML methods are used to classify TB. This system allows for quickly identifying TB, enabling timely medical action and improving patient outcomes. Additionally, it streamlines the diagnostic process, reducing time and costs for patients and lessening the workload on healthcare professionals. The TB chest X-ray dataset was utilized to train and evaluate the proposed CAD system, which underwent pre-processing techniques like resizing, scaling, and noise removal to improve diagnostic accuracy. Results: The performance of our CAD model was assessed against existing models, yielding excellent results. The model achieved remarkable metrics: an average precision of 99.90%, recall of 99.52%, F1-score of 99.71%, accuracy of 99.84%, false negative rate (FNR) of 0.48%, specificity of 99.52%, and negative predictive value (NPV) of 99.90%. Conclusions: This evaluation highlights the superior performance of our model compared to the latest available classifiers. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. HybridFusionNet: Deep Learning for Multi-Stage Diabetic Retinopathy Detection.

Author: Shukla, Amar, Tiwari, Shamik, and Jain, Anurag
Subjects: TRANSFORMER models, DIABETIC retinopathy, VISION disorders, CLASSIFICATION, DIAGNOSIS
Abstract: Diabetic retinopathy (DR) is one of the most common causes of visual impairment worldwide and requires reliable automated detection methods. Numerous research efforts have developed various conventional methods for early detection of DR. Research in the field of DR remains insufficient, indicating the potential for advances in diagnosis. In this paper, a hybrid model (HybridFusionNet) that integrates vision transformer (VIT) and attention processes is presented. It improves classification in the binary ( B c l ) and multi-class ( M c l ) stages by utilizing deep features from the DR stages. As a result, both the SAN and VIT models improve the recognition accuracy (A c c) in both stages.The HybridFusionNet mechanism achieves a competitive improvement in multi-stage and binary stages, which is A c c in B c l and M c l , with 91% and 99%, respectively. This illustrates that this model is suitable for a better diagnosis of DR. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. ViTSen: Bridging Vision Transformers and Edge Computing With Advanced In/Near-Sensor Processing.

Author: Tabrizchi, Sepehr, Reidy, Brendan C., Najafi, Deniz, Angizi, Shaahin, Zand, Ramtin, and Roohi, Arman
Abstract: This letter introduces ViTSen, optimizing vision transformers (ViTs) for resource-constrained edge devices. It features an in-sensor image compression technique to reduce data conversion and transmission power costs effectively. Further, ViTSen incorporates a ReRAM array, allowing efficient near-sensor analog convolution. This integration, novel pixel reading, and peripheral circuitry decrease the reliance on analog buffers and converters, significantly lowering power consumption. To make ViTSen compatible, several established ViT algorithms have undergone quantization and channel reduction. Circuit-to-application co-simulation results show that ViTSen maintains accuracy comparable to a full-precision baseline across various data precisions, achieving an efficiency of ~3.1 TOp/s/W. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. Comparison of Deep Learning approaches in classification of lacial landforms

Author: Paweł Nadachowski, Zbigniew Łubniewski, Karolina Trzcińska, and Jarosław Tęgowski
Subjects: convolutional neural network (cnn), deep learning, digital elevation model (dem), elise glacier, gardno- leba plain, glacial landforms, lubawa upland, residual neural network (resnet), supervised classification, svalbard, vgg, vision transformer (vit), Electrical engineering. Electronics. Nuclear engineering, TK1-9971, Telecommunication, TK5101-6720
Abstract: Glacial landforms, created by the continuous movements of glaciers over millennia, are crucial topics in geomorphological research. Their systematic analysis affords invaluable insights into past climatic oscillations and augments understanding of long-term climate change dynamics. The classification of these types of terrain traditionally depends on labor-intensive manual or semi-automated methods. However, the emergence of automated techniques driven by deep learning and neural networks holds promise for enhancing efficiency of terrain classification workflows. This study evaluated the effectiveness of Convolutional Neural Network (CNN) architectures, particularly Residual Neural Network (ResNet) and VGG in comparison with Vision Transformer (ViT) architecture in the glacial landform classification task. By using preprocessed input data from Digital Elevation Model (DEM) which covers regions such as the Lubawa Upland and Gardno-Leba Plain in Poland, as well as the Elise Glacier in Svalbard, Norway, comprehensive assessments of those methods were conducted. The final results highlight the unique ability of deep learning methods to accurately classify glacial landforms. Classification process presented in this study can be the efficient, repeatable and fast solution for automatic terrain classification.
Published: 2024
Full Text: View/download PDF

13. Intelligent tool wear prediction based on deep learning PSD-CVT model

Author: Sumei Si, Deqiang Mu, and Zekai Si
Subjects: Convolutional neural network (CNN), Deep learning, Tool wear prediction, Power spectral density (PSD), Vision transformer (ViT), Medicine, Science
Abstract: Abstract To ensure the reliability of machining quality, it is crucial to predict tool wear accurately. In this paper, a novel deep learning-based model is proposed, which synthesizes the advantages of power spectral density (PSD), convolutional neural networks (CNN), and vision transformer model (ViT), namely PSD-CVT. PSD maps can provide a comprehensive understanding of the spectral characteristics of the signals. It makes the spectral characteristics more obvious and makes it easy to analyze and compare different signals. CNN focuses on local feature extraction, which can capture local information such as the texture, edge, and shape of the image, while the attention mechanism in ViT can effectively capture the global structure and long-range dependencies present in the image. Two fully connected layers with a ReLU function are used to obtain the predicted tool wear values. The experimental results on the PHM 2010 dataset demonstrate that the proposed model has higher prediction accuracy than the CNN model or ViT model alone, as well as outperforms several existing methods in accurately predicting tool wear. The proposed prediction method can also be applied to predict tool wear in other machining fields.
Published: 2024
Full Text: View/download PDF

14. Enhancing Breast Cancer Detection in Ultrasound Images: An Innovative Approach Using Progressive Fine‐Tuning of Vision Transformer Models.

Author: Alruily, Meshrif, Mahmoud, Alshimaa Abdelraof, Allahem, Hisham, Mostafa, Ayman Mohamed, Shabana, Hosameldeen, Ezz, Mohamed, and Vocaturo, Eugenio
Subjects: TRANSFORMER models, IMAGE recognition (Computer vision), ULTRASONIC imaging, DEEP learning, DATA augmentation, BREAST
Abstract: Breast cancer is ranked as the second most common cancer among women globally, highlighting the critical need for precise and early detection methods. Our research introduces a novel approach for classifying benign and malignant breast ultrasound images. We leverage advanced deep learning methodologies, mainly focusing on the vision transformer (ViT) model. Our method distinctively features progressive fine‐tuning, a tailored process that incrementally adapts the model to the nuances of breast tissue classification. Ultrasound imaging was chosen for its distinct benefits in medical diagnostics. This modality is noninvasive and cost‐effective and demonstrates enhanced specificity, especially in dense breast tissues where traditional methods may struggle. Such characteristics make it an ideal choice for the sensitive task of breast cancer detection. Our extensive experiments utilized the breast ultrasound images dataset, comprising 780 images of both benign and malignant breast tissues. The dataset underwent a comprehensive analysis using several pretrained deep learning models, including VGG16, VGG19, DenseNet121, Inception, ResNet152V2, DenseNet169, DenseNet201, and the ViT. The results presented were achieved without employing data augmentation techniques. The ViT model demonstrated robust accuracy and generalization capabilities with the original dataset size, which consisted of 637 images. Each model's performance was meticulously evaluated through a robust 10‐fold cross‐validation technique, ensuring a thorough and unbiased comparison. Our findings are significant, demonstrating that the progressive fine‐tuning substantially enhances the ViT model's capability. This resulted in a remarkable accuracy of 94.49% and an AUC score of 0.921, significantly higher than models without fine‐tuning. These results affirm the efficacy of the ViT model and highlight the transformative potential of integrating progressive fine‐tuning with transformer models in medical image classification tasks. The study solidifies the role of such advanced methodologies in improving early breast cancer detection and diagnosis, especially when coupled with the unique advantages of ultrasound imaging. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. Hybrid-DC: A Hybrid Framework Using ResNet-50 and Vision Transformer for Steel Surface Defect Classification in the Rolling Process.

Author: Jeong, Minjun, Yang, Minyeol, and Jeong, Jongpil
Subjects: TRANSFORMER models, COMPUTER vision, DEEP learning, SURFACE defects, FEATURE extraction
Abstract: This study introduces Hybrid-DC, a hybrid deep-learning model integrating ResNet-50 and Vision Transformer (ViT) for high-accuracy steel surface defect classification. Hybrid-DC leverages ResNet-50 for efficient feature extraction at both low and high levels and utilizes ViT's global context learning to enhance classification precision. A unique hybrid attention layer and an attention fusion mechanism enable Hybrid-DC to adapt to the complex, variable patterns typical of steel surface defects. Experimental evaluations demonstrate that Hybrid-DC achieves substantial accuracy improvements and significantly reduced loss compared to traditional models like MobileNetV2 and ResNet, with a validation accuracy reaching 0.9944. The results suggest that this model, characterized by rapid convergence and stable learning, can be applied for real-time quality control in steel manufacturing and other high-precision industries, enhancing automated defect detection efficiency. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. High‐precision identification of highly similar Pinelliae Rhizoma and adulterated Rhizoma pinelliae pedatisectae through deep neural networks based on vision transformers.

Author: Chen, Rong, Zhang, Ying, Song, Wen‐Jun, Zhao, Ting‐Ting, Wang, Jiu‐Ning, and Zhao, Yong‐Hong
Subjects: *CONVOLUTIONAL neural networks, *ARTIFICIAL neural networks, *TRANSFORMER models, *IMAGE recognition (Computer vision), *PLANT identification
Abstract: Pinelliae Rhizoma is a key ingredient in botanical supplements and is often adulterated by Rhizoma Pinelliae Pedatisectae, which is similar in appearance but less expensive. Accurate identification of these materials is crucial for both scientific and commercial purposes. Traditional morphological identification relies heavily on expert experience and is subjective, while chemical analysis and molecular biological identification are typically time consuming and labor intensive. This study aims to employ a simpler, faster, and non‐invasive image recognition technique to distinguish between these two highly similar plant materials. In the realm of image recognition, we aimed to utilize the vision transformer (ViT) algorithm, a cutting‐edge image recognition technology, to differentiate these materials. All samples were verified using DNA molecular identification before image analysis. The result demonstrates that the ViT algorithm achieves a classification accuracy exceeding 94%, significantly outperforming the convolutional neural network model's 60%–70% accuracy. This highlights the efficiency of this technology in identifying plant materials with similar appearances. This study marks the pioneer work of the ViT algorithm to such a challenging task, showcasing its potential for precise botanical material identification and setting the stage for future advancements in the field. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

17. Fire and Smoke Detection in Complex Environments.

Author: Safarov, Furkat, Muksimova, Shakhnoza, Kamoliddin, Misirov, and Cho, Young Im
Subjects: *TRANSFORMER models, *EMERGENCY management, *FEATURE extraction, *ENVIRONMENTAL monitoring, *REMOTE sensing, *FIRE detectors
Abstract: Fire detection is a critical task in environmental monitoring and disaster prevention, with traditional methods often limited in their ability to detect fire and smoke in real time over large areas. The rapid identification of fire and smoke in both indoor and outdoor environments is essential for minimizing damage and ensuring timely intervention. In this paper, we propose a novel approach to fire and smoke detection by integrating a vision transformer (ViT) with the YOLOv5s object detection model. Our modified model leverages the attention-based feature extraction capabilities of ViTs to improve detection accuracy, particularly in complex environments where fires may be occluded or distributed across large regions. By replacing the CSPDarknet53 backbone of YOLOv5s with ViT, the model is able to capture both local and global dependencies in images, resulting in more accurate detection of fire and smoke under challenging conditions. We evaluate the performance of the proposed model using a comprehensive Fire and Smoke Detection Dataset, which includes diverse real-world scenarios. The results demonstrate that our model outperforms baseline YOLOv5 variants in terms of precision, recall, and mean average precision (mAP), achieving a mAP@0.5 of 0.664 and a recall of 0.657. The modified YOLOv5s with ViT shows significant improvements in detecting fire and smoke, particularly in scenes with complex backgrounds and varying object scales. Our findings suggest that the integration of ViT as the backbone of YOLOv5s offers a promising approach for real-time fire detection in both urban and natural environments. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

18. Web-Enhanced Vision Transformers and Deep Learning for Accurate Event-Centric Management Categorization in Education Institutions.

Author: Albarrak, Khalied M. and Sorour, Shaymaa E.
Subjects: TRANSFORMER models, DEEP learning, CONVOLUTIONAL neural networks, DIGITAL technology, INTERNET content, DIGITAL communications
Abstract: In the digital era, social media has become a cornerstone for educational institutions, driving public engagement and enhancing institutional communication. This study utilizes AI-driven image processing and Web-enhanced Deep Learning (DL) techniques to investigate the effectiveness of King Faisal University's (KFU's) social media strategy as a case study, particularly on Twitter. By categorizing images into five primary event management categories and subcategories, this research provides a robust framework for assessing the social media content generated by KFU's administrative units. Seven advanced models were developed, including an innovative integration of Vision Transformers (ViTs) with Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, VGG16, and ResNet. The AI-driven ViT-CNN hybrid model achieved perfect classification accuracy (100%), while the "Development and Partnerships" category demonstrated notable accuracy (98.8%), underscoring the model's unparalleled efficacy in strategic content classification. This study offers actionable insights for the optimization of AI-driven digital communication strategies and Web-enhanced data collection processes, aligning them with national development goals and Saudi Arabia's Vision 2030, thereby showcasing the transformative power of DL in event-centric management and the broader higher education landscape. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

19. CPH-Fmnet: An Optimized Deep Learning Model for Multi-View Stereo and Parameter Extraction in Complex Forest Scenes.

Author: Dai, Lingnan, Chen, Zhao, Zhang, Xiaoli, Wang, Dianchang, and Huo, Lishuo
Subjects: FOREST management, TRANSFORMER models, FEATURE selection, ENVIRONMENTAL monitoring, FOREST productivity, DEEP learning
Abstract: The three-dimensional reconstruction of forests is crucial in remote sensing technology, ecological monitoring, and forestry management, as it yields precise forest structure and tree parameters, providing essential data support for forest resource management, evaluation, and sustainable development. Nevertheless, forest 3D reconstruction now encounters obstacles including higher equipment costs, reduced data collection efficiency, and complex data processing. This work introduces a unique deep learning model, CPH-Fmnet, designed to enhance the accuracy and efficiency of 3D reconstruction in intricate forest environments. CPH-Fmnet enhances the FPN Encoder-Decoder Architecture by meticulously incorporating the Channel Attention Mechanism (CA), Path Aggregation Module (PA), and High-Level Feature Selection Module (HFS), alongside the integration of the pre-trained Vision Transformer (ViT), thereby significantly improving the model's global feature extraction and local detail reconstruction abilities. We selected three representative sample plots in Haidian District, Beijing, China, as the study area and took forest stand sequence photos with an iPhone for the research. Comparative experiments with the conventional SfM + MVS and MVSFormer models, along with comprehensive parameter extraction and ablation studies, substantiated the enhanced efficacy of the proposed CPH-Fmnet model in addressing difficult circumstances such as intricate occlusions, poorly textured areas, and variations in lighting. The test results show that the model does better on a number of evaluation criteria. It has an RMSE of 1.353, an MAE of only 5.1%, an r value of 1.190, and a forest reconstruction rate of 100%, all of which are better than current methods. Furthermore, the model produced a more compact and precise 3D point cloud while accurately determining the properties of the forest trees. The findings indicate that CPH-Fmnet offers an innovative approach for forest resource management and ecological monitoring, characterized by cheap cost, high accuracy, and high efficiency. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

20. EEG-VTTCNet: A loss joint training model based on the vision transformer and the temporal convolution network for EEG-based motor imagery classification.

Author: Shi, Xingbin, Li, Baojiang, Wang, Wenlong, Qin, Yuxin, Wang, Haiyan, and Wang, Xichao
Subjects: *TRANSFORMER models, *MOTOR imagery (Cognition), *BRAIN-computer interfaces, *TIME-varying networks, *FEATURE extraction
Abstract: • The increase of decoding precision of motor imaging EEG signal. • The hybrid TCN and ViT method achieves good results. • A shared convolution strategy and a dual-branching strategy. Brain-computer interface (BCI) is a technology that directly connects signals between the human brain and a computer or other external device. Motor imagery electroencephalographic (MI-EEG) signals are considered a promising paradigm for BCI systems, with a wide range of potential applications in medical rehabilitation, human–computer interaction, and virtual reality. Accurate decoding of MI-EEG signals poses a significant challenge due to issues related to the quality of the collected EEG data and subject variability. Therefore, developing an efficient MI-EEG decoding network is crucial and warrants research. This paper proposes a loss joint training model based on the vision transformer (VIT) and the temporal convolutional network (EEG-VTTCNet) to classify MI-EEG signals. To take advantage of multiple modules together, the EEG-VTTCNet adopts a shared convolution strategy and a dual-branching strategy. The dual-branching modules perform complementary learning and jointly train shared convolutional modules with better performance. We conducted experiments on the BCI Competition IV-2a and IV-2b datasets, and the proposed network outperformed the current state-of-the-art techniques with an accuracy of 84.58% and 90.94%, respectively, for the subject-dependent mode. In addition, we used t-SNE to visualize the features extracted by the proposed network, further demonstrating the effectiveness of the feature extraction framework. We also conducted extensive ablation and hyperparameter tuning experiments to construct a robust network architecture that can be well generalized. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

21. CSFNet: a compact and efficient convolution-transformer hybrid vision model.

Author: Feng, Jian, Wu, Peng, Xu, Renjie, Zhang, Xiaoming, Wang, Tao, and Li, Xuan
Subjects: TRANSFORMER models, CONVOLUTIONAL neural networks, IMAGE recognition (Computer vision), LOW-rank matrices, MOBILE apps
Abstract: The Vision Transformer (ViT) has demonstrated impressive performance in various visual tasks, but its high computational requirements limit its applicability on edge devices. Conversely, convolutional neural networks (CNNs) are commonly used in mobile applications, but their static and weak global properties hinder their performance. In this work, we propose a lightweight, high-density predictive classification hybrid-based model called CSFNet, which combines good local inductive bias capability with long-distance modeling property. To establish local-global information association, we introduce two layered structures. Firstly, we use the Local-Attention Block (LAB) with adaptive kernels and channel expansion ratio to aggregate n × n local information layer by layer, capturing multi-stage detail features and inducing efficient local inductive properties. Secondly, we introduce a linear complexity Channel-Spatial Fusion Attention (CSFA) that projects the attention matrix from both channel and tokens dimensions. The relationships between tokens are aggregated stage by stage to encode efficient contextual association information using low-rank matrix and element-by-element operations to reduce computational complexity. Experimental results demonstrate that our proposed CSFNet-XXS/XS/S models with 1.4M/2.4M/5.6M parameters and 0.3G/0.5G/1.1G multiply-adds (MAdds) achieve 70.23%/74.91%/78.82% top-1 accuracy on ImageNet-1k with competitive performance compared to recent mainstream methods. Furthermore, CSFNet performs well on small-scale datasets, MS-COCO2017 and ADE-20K. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. An Interpretable Target-Aware Vision Transformer for Polarimetric HRRP Target Recognition with a Novel Attention Loss.

Author: Gao, Fan, Lang, Ping, Yeh, Chunmao, Li, Zhangfeng, Ren, Dawei, and Yang, Jian
Subjects: *TRANSFORMER models, *AUTOMATIC target recognition, *RADAR targets
Abstract: Polarimetric high-resolution range profile (HRRP), with its rich polarimetric and spatial information, has become increasingly important in radar automatic target recognition (RATR). This study proposes an interpretable target-aware vision Transformer (ITAViT) for polarimetric HRRP target recognition with a novel attention loss. In ITAViT, we initially fuse the polarimetric features and the amplitude of polarimetric HRRP with a polarimetric preprocessing layer (PPL) to obtain the feature map as the input of the subsequent network. The vision Transformer (ViT) is then used as the backbone to automatically extract both local and global features. Most importantly, we introduce a novel attention loss to optimize the alignment between the attention map and the HRRP span. Thus, it can improve the difference between the target and the background, and enable the model to more effectively focus on real target areas. Experiments on a simulated X-band dataset demonstrate that our proposed ITAViT outperforms comparative models under various experimental conditions. Ablation studies highlight the effectiveness of polarimetric preprocessing and attention loss. Furthermore, the visualization of the self-attention mechanism suggests that attention loss enhances the interpretability of the network. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

23. A novel Skin lesion prediction and classification technique: ViT‐GradCAM.

Author: Shafiq, Muhammad, Aggarwal, Kapil, Jayachandran, Jagannathan, Srinivasan, Gayathri, Boddu, Rajasekhar, and Alemayehu, Adugna
Subjects: *TRANSFORMER models, *IMAGE recognition (Computer vision), *DATABASES, *DATA augmentation, *IMAGE segmentation, *DEEP learning
Abstract: Background: Skin cancer is one of the highly occurring diseases in human life. Early detection and treatment are the prime and necessary points to reduce the malignancy of infections. Deep learning techniques are supplementary tools to assist clinical experts in detecting and localizing skin lesions. Vision transformers (ViT) based on image segmentation classification using multiple classes provide fairly accurate detection and are gaining more popularity due to legitimate multiclass prediction capabilities. Materials and methods: In this research, we propose a new ViT Gradient‐Weighted Class Activation Mapping (GradCAM) based architecture named ViT‐GradCAM for detecting and classifying skin lesions by spreading ratio on the lesion's surface area. The proposed system is trained and validated using a HAM 10000 dataset by studying seven skin lesions. The database comprises 10 015 dermatoscopic images of varied sizes. The data preprocessing and data augmentation techniques are applied to overcome the class imbalance issues and improve the model's performance. Result: The proposed algorithm is based on ViT models that classify the dermatoscopic images into seven classes with an accuracy of 97.28%, precision of 98.51, recall of 95.2%, and an F1 score of 94.6, respectively. The proposed ViT‐GradCAM obtains better and more accurate detection and classification than other state‐of‐the‐art deep learning‐based skin lesion detection models. The architecture of ViT‐GradCAM is extensively visualized to highlight the actual pixels in essential regions associated with skin‐specific pathologies. Conclusion: This research proposes an alternate solution to overcome the challenges of detecting and classifying skin lesions using ViTs and GradCAM, which play a significant role in detecting and classifying skin lesions accurately rather than relying solely on deep learning models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

24. Automated Concrete Bridge Deck Inspection Using Unmanned Aerial System (UAS)-Collected Data: A Machine Learning (ML) Approach.

Author: Pokhrel, Rojal, Samsami, Reihaneh, Elmi, Saida, and Brooks, Colin N.
Subjects: *CONVOLUTIONAL neural networks, *TRANSFORMER models, *INFRASTRUCTURE (Economics), *MACHINE learning, *BRIDGE floors
Abstract: Bridges are crucial components of infrastructure networks that facilitate national connectivity and development. According to the National Bridge Inventory (NBI) and the Federal Highway Administration (FHWA), the cost to repair U.S. bridges was recently estimated at approximately USD 164 billion. Traditionally, bridge inspections are performed manually, which poses several challenges in terms of safety, efficiency, and accessibility. To address these issues, this research study introduces a method using Unmanned Aerial Systems (UASs) to help automate the inspection process. This methodology employs UASs to capture visual images of a concrete bridge deck, which are then analyzed using advanced machine learning techniques of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to detect damage and delamination. A case study on the Beyer Road Concrete Bridge in Michigan is used to demonstrate the developed methodology. The findings demonstrate that the ViT model outperforms the CNN in detecting bridge deck damage, with an accuracy of 97%, compared to 92% for the CNN. Additionally, the ViT model showed a precision of 96% and a recall of 97%, while the CNN model achieved a precision of 93% and a recall of 61%. This technology not only enhances the maintenance of bridges but also significantly reduces the risks associated with traditional inspection methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion.

Author: Abdo, Hakim A., Abdu, Ahmed, Al-Antari, Mugahed A., Manza, Ramesh R., Talo, Muhammed, Gu, Yeong Hyeon, and Bawiskar, Shobha
Subjects: *TRANSFORMER models, *OBJECT recognition (Computer vision), *CONVOLUTIONAL neural networks, *BLENDED learning, *ARTIFICIAL intelligence, *HANDWRITING recognition (Computer science)
Abstract: Arabic handwriting recognition and conversion are crucial for financial operations, particularly for processing handwritten amounts on cheques and financial documents. Compared to other languages, research in this area is relatively limited, especially concerning Arabic. This study introduces an innovative AI-driven method for simultaneously recognizing and converting Arabic handwritten legal amounts into numerical courtesy forms. The framework consists of four key stages. First, a new dataset of Arabic legal amounts in handwritten form (".png" image format) is collected and labeled by natives. Second, a YOLO-based AI detector extracts individual legal amount words from the entire input sentence images. Third, a robust hybrid classification model is developed, sequentially combining ensemble Convolutional Neural Networks (CNNs) with a Vision Transformer (ViT) to improve the prediction accuracy of single Arabic words. Finally, a novel conversion algorithm transforms the predicted Arabic legal amounts into digital courtesy forms. The framework's performance is fine-tuned and assessed using 5-fold cross-validation tests on the proposed novel dataset, achieving a word level detection accuracy of 98.6% and a recognition accuracy of 99.02% at the classification stage. The conversion process yields an overall accuracy of 90%, with an inference time of 4.5 s per sentence image. These results demonstrate promising potential for practical implementation in diverse Arabic financial systems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. CVTNet: A Fusion of Convolutional Neural Networks and Vision Transformer for Wetland Mapping Using Sentinel-1 and Sentinel-2 Satellite Data.

Author: Marjani, Mohammad, Mahdianpari, Masoud, Mohammadimanesh, Fariba, and Gill, Eric W.
Subjects: *WETLANDS, *TRANSFORMER models, *CONVOLUTIONAL neural networks, *DEEP learning, *CLASS differences, *FEATURE extraction
Abstract: Wetland mapping is a critical component of environmental monitoring, requiring advanced techniques to accurately represent the complex land cover patterns and subtle class differences innate in these ecosystems. This study aims to address these challenges by proposing CVTNet, a novel deep learning (DL) model that integrates convolutional neural networks (CNNs) and vision transformer (ViT) architectures. CVTNet uses channel attention (CA) and spatial attention (SA) mechanisms to enhance feature extraction from Sentinel-1 (S1) and Sentinel-2 (S2) satellite data. The primary goal of this model is to achieve a balanced trade-off between Precision and Recall, which is essential for accurate wetland mapping. The class-specific analysis demonstrated CVTNet's proficiency across diverse classes, including pasture, shrubland, urban, bog, fen, and water. Comparative analysis showed that CVTNet outperforms contemporary algorithms such as Random Forest (RF), ViT, multi-layer perceptron mixer (MLP-mixer), and hybrid spectral net (HybridSN) classifiers. Additionally, the attention mechanism (AM) analysis and sensitivity analysis highlighted the crucial role of CA, SA, and ViT in focusing the model's attention on critical regions, thereby improving the mapping of wetland regions. Despite challenges at class boundaries, particularly between bog and fen, and misclassifications of swamp pixels, CVTNet presents a solution for wetland mapping. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. Image Denoise Model Based on Structural Re-Parameterized Uniformer Transformer and UNet.

Author: Lu, Zhengwei, Zhang, Duzhen, Wang, Tao, Jiang, Hengchang, and Pu, Yingkai
Subjects: *IMAGE denoising, *TRANSFORMER models, *STRUCTURAL models, *COMPUTER vision, *CONVOLUTIONAL neural networks
Abstract: Image denoising is a fundamental problem in computer vision (CV). Vision Transformer (ViT) is an improvement in CV after convolutional neural networks (CNNs). Recently, it has been demonstrated that the RepVGG network of the structural re-parameterization performs well in image tasks. Experiments indicate that the ViT-based uniformer transformer network has successfully balanced local and global information. As image denoising tasks require more local and global information about the image, a novel image denoising model named structural Re-parameterization Uniformer Transformer-UNet (Rep-UUNet) is proposed in this paper. The model is structural and re-parameterized the structure of the Uniformer Transformer network and uses the UNet skip connections to reconstruct the output image. To complete image downstream tasks, the RepVGG model which utilizes the image local information is used. Indicators such as PSNR, SSIM and others are used to assess the performance of image denoising. Experimental results demonstrate that our Rep-UUNet network model outperforms other five models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. Add-Vit: CNN-Transformer Hybrid Architecture for Small Data Paradigm Processing.

Author: Chen, Jinhui, Wu, Peng, Zhang, Xiaoming, Xu, Renjie, and Liang, Jia
Abstract: The vision transformer(ViT), pre-trained on large datasets, outperforms convolutional neural networks (CNN) in computer vision(CV). However, if not pre-trained, the transformer architecture doesn’t work well on small datasets and is surpassed by CNN. Through analysis, we found that:(1) the division and processing of tokens in the ViT discard the marginalized information between token. (2) the isolated multi-head self-attention (MSA) lacks prior knowledge. (3) the local inductive bias capability of stacked transformer block is much inferior to that of CNN. We propose a novel architecture for small data paradigms without pre-training, named Add-Vit, which uses progressive tokenization with feature supplementation in patch embedding. The model’s representational ability is enhanced by using a convolutional prediction module shortcut to connect MSA and capture local features as additional representations of the token. Without the need for pre-training on large datasets, our best model achieved 81.25 % accuracy when trained from scratch on the CIFAR-100. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

29. Vehicle Detection in Adverse Weather: A Multi-Head Attention Approach with Multimodal Fusion.

Author: Tabassum, Nujhat and El-Sharkawy, Mohamed
Subjects: TRANSFORMER models, OBJECT recognition (Computer vision), WEATHER, AUTONOMOUS vehicles
Abstract: In the realm of autonomous vehicle technology, the multimodal vehicle detection network (MVDNet) represents a significant leap forward, particularly in the challenging context of weather conditions. This paper focuses on the enhancement of MVDNet through the integration of a multi-head attention layer, aimed at refining its performance. The integrated multi-head attention layer in the MVDNet model is a pivotal modification, advancing the network's ability to process and fuse multimodal sensor information more efficiently. The paper validates the improved performance of MVDNet with multi-head attention through comprehensive testing, which includes a training dataset derived from the Oxford Radar RobotCar. The results clearly demonstrate that the multi-head MVDNet outperforms the other related conventional models, particularly in the average precision (AP) of estimation, under challenging environmental conditions. The proposed multi-head MVDNet not only contributes significantly to the field of autonomous vehicle detection but also underscores the potential of sophisticated sensor fusion techniques in overcoming environmental limitations. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. Shape-Sensitive Loss for Catheter and Guidewire Segmentation

Author: Kongtongvattana, Chayun, Huang, Baoru, Kang, Jingxuan, Nguyen, Hoan, Olufemi, Olajide, Nguyen, Anh, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Abdul Majeed, Anwar P.P., editor, Yap, Eng Hwa, editor, Liu, Pengcheng, editor, Huang, Xiaowei, editor, Nguyen, Anh, editor, Chen, Wei, editor, and Kim, Ue-Hwan, editor
Published: 2024
Full Text: View/download PDF

31. EnviroWatch: A Comprehensive Environmental Monitoring Web Frame and Cleanup Coordination System Using CNN

Author: Ahire, Rohan, Gage, Madhu, Mhatre, Darpan, Kadam, Chirag, Deshmukh, Neha, Deshpande, Kiran, Kulkarni, Rucha, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Swaroop, Abhishek, editor, Kansal, Vineet, editor, Fortino, Giancarlo, editor, and Hassanien, Aboul Ella, editor
Published: 2024
Full Text: View/download PDF

32. Vision Transformer-Based LULC Classification Using Remotely Sensed Hyperspectral Image

Author: Chaudhri, S. N., Mallikarjuna Rao, Y., Rajput, N. S., Subramanyam, M. V., Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Kumar Jain, Pradip, editor, Nath Singh, Yatindra, editor, Gollapalli, Ravi Paul, editor, and Singh, S. P., editor
Published: 2024
Full Text: View/download PDF

33. CoViT-Net: A Pre-trained Hybrid Vision Transformer for COVID-19 Detection in CT-Scans

Author: Das, Ankit, Banik, Debapriya, Roy, Kaushiki, Chan, Gordon K., Bhattacharjee, Debotosh, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Kole, Dipak Kumar, editor, Roy Chowdhury, Shubhajit, editor, Basu, Subhadip, editor, Plewczynski, Dariusz, editor, and Bhattacharjee, Debotosh, editor
Published: 2024
Full Text: View/download PDF

34. Skin Lesion Classification Based on Vision Transformer (ViT)

Author: Rahmouni, Abla, Sabri, My Abdelouahed, Ennaji, Asmae, Aarab, Abdellah, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Farhaoui, Yousef, editor, Hussain, Amir, editor, Saba, Tanzila, editor, Taherdoost, Hamed, editor, and Verma, Anshul, editor
Published: 2024
Full Text: View/download PDF

35. Vision Transformer-Based Emotion Detection in HCI for Enhanced Interaction

Author: Soni, Jayesh, Prabakar, Nagarajan, Upadhyay, Himanshu, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Choi, Bong Jun, editor, Singh, Dhananjay, editor, Tiwary, Uma Shanker, editor, and Chung, Wan-Young, editor
Published: 2024
Full Text: View/download PDF

36. Crop Classification Using Deep Learning on Time Series SAR Images: A Survey

Author: Saini, Naman, Dhir, Renu, Kaur, Kamalpreet, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Nanda, Satyasai Jagannath, editor, Yadav, Rajendra Prasad, editor, Gandomi, Amir H., editor, and Saraswat, Mukesh, editor
Published: 2024
Full Text: View/download PDF

37. Multimodal Learning for Road Safety Using Vision Transformer ViT

Author: Rhanizar, Asmae, El Akkaoui, Zineb, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Tabaa, Mohamed, editor, Badir, Hassan, editor, Bellatreche, Ladjel, editor, Boulmakoul, Azedine, editor, Lbath, Ahmed, editor, and Monteiro, Fabrice, editor
Published: 2024
Full Text: View/download PDF

38. Naturalize Revolution: Unprecedented AI-Driven Precision in Skin Cancer Classification Using Deep Learning

Author: Mohamad Abou Ali, Fadi Dornaika, Ignacio Arganda-Carreras, Hussein Ali, and Malak Karaouni
Subjects: convolutional neural net (CNN), vision transformer (ViT), ImageNet models, transfer learning (TL), machine learning (ML), deep learning (DP), Neurosciences. Biological psychiatry. Neuropsychiatry, RC321-571, Computer applications to medicine. Medical informatics, R858-859.7
Abstract: Background: In response to the escalating global concerns surrounding skin cancer, this study aims to address the imperative for precise and efficient diagnostic methodologies. Focusing on the intricate task of eight-class skin cancer classification, the research delves into the limitations of conventional diagnostic approaches, often hindered by subjectivity and resource constraints. The transformative potential of Artificial Intelligence (AI) in revolutionizing diagnostic paradigms is underscored, emphasizing significant improvements in accuracy and accessibility. Methods: Utilizing cutting-edge deep learning models on the ISIC2019 dataset, a comprehensive analysis is conducted, employing a diverse array of pre-trained ImageNet architectures and Vision Transformer models. To counteract the inherent class imbalance in skin cancer datasets, a pioneering “Naturalize” augmentation technique is introduced. This technique leads to the creation of two indispensable datasets—the Naturalized 2.4K ISIC2019 and groundbreaking Naturalized 7.2K ISIC2019 datasets—catalyzing advancements in classification accuracy. The “Naturalize” augmentation technique involves the segmentation of skin cancer images using the Segment Anything Model (SAM) and the systematic addition of segmented cancer images to a background image to generate new composite images. Results: The research showcases the pivotal role of AI in mitigating the risks of misdiagnosis and under-diagnosis in skin cancer. The proficiency of AI in analyzing vast datasets and discerning subtle patterns significantly augments the diagnostic prowess of dermatologists. Quantitative measures such as confusion matrices, classification reports, and visual analyses using Score-CAM across diverse dataset variations are meticulously evaluated. The culmination of these endeavors resulted in an unprecedented achievement—100% average accuracy, precision, recall, and F1-score—within the groundbreaking Naturalized 7.2K ISIC2019 dataset. Conclusion: This groundbreaking exploration highlights the transformative capabilities of AI-driven methodologies in reshaping the landscape of skin cancer diagnosis and patient care. The research represents a pivotal stride towards redefining dermatological diagnosis, showcasing the remarkable impact of AI-powered solutions in surmounting the challenges inherent in skin cancer diagnosis. The attainment of 100% across crucial metrics within the Naturalized 7.2K ISIC2019 dataset serves as a testament to the transformative capabilities of AI-driven approaches in reshaping the trajectory of skin cancer diagnosis and patient care. This pioneering work paves the way for a new era in dermatological diagnostics, heralding the dawn of unprecedented precision and efficacy in the identification and classification of skin cancers.
Published: 2024
Full Text: View/download PDF

39. HybridFusionNet: Deep Learning for Multi-Stage Diabetic Retinopathy Detection

Author: Amar Shukla, Shamik Tiwari, and Anurag Jain
Subjects: diabetic retinopathy, Self Attention Network (SAN), Vision Transformer (VIT), HybridFusionNet, binary classification, multi-class classification, Technology
Abstract: Diabetic retinopathy (DR) is one of the most common causes of visual impairment worldwide and requires reliable automated detection methods. Numerous research efforts have developed various conventional methods for early detection of DR. Research in the field of DR remains insufficient, indicating the potential for advances in diagnosis. In this paper, a hybrid model (HybridFusionNet) that integrates vision transformer (VIT) and attention processes is presented. It improves classification in the binary (Bcl) and multi-class (Mcl) stages by utilizing deep features from the DR stages. As a result, both the SAN and VIT models improve the recognition accuracy (Acc) in both stages.The HybridFusionNet mechanism achieves a competitive improvement in multi-stage and binary stages, which is Acc in Bcl and Mcl, with 91% and 99%, respectively. This illustrates that this model is suitable for a better diagnosis of DR.
Published: 2024
Full Text: View/download PDF

40. An industrial product surface anomaly detection method based on masked image modeling.

Author: Tang, Shancheng, Li, Heng, Dai, Fenghua, Yang, Jiqing, Jin, Zicheng, Lu, Jianhui, and Zhang, Ying
Abstract: Current unsupervised industrial product surface anomaly detection methods suffer from poor reconstructed image quality and difficulty in detecting low-contrast anomalies, resulting in low anomaly detection accuracy. To address the above problems, we propose an unsupervised masked hybrid convolutional Transformer anomaly detection model, which forces the model to predict missing or edited regions based on unmasked information by introducing a mask reconstruction strategy, and utilises convolutional blocks and Transformer self-attention mechanism to extract the local features and global context of the image at different resolutions, enhancing the model’s ability to understand the interrelationships among image parts and the overall structure. information to enhance the model’s ability to understand the interrelationships between image parts and the overall structure, and to improve the reconstruction ability of the model; then a method based on Gaussian difference significance is proposed, which is combined with gradient magnitude similarity and colour difference to compare the differences between reconstructed and original images from multiple perspectives, and to improve the anomaly localisation performance of the model. We conducted extensive experiments on the industrial datasets MVTec AD and MTD to validate the effectiveness of the proposed method. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

41. Lightweight Low-Rank Adaptation Vision Transformer Framework for Cervical Cancer Detection and Cervix Type Classification.

Author: Hong, Zhenchen, Xiong, Jingwei, Yang, Han, and Mo, Yu K.
Subjects: *TRANSFORMER models, *CERVICAL cancer, *CONVOLUTIONAL neural networks, *EARLY detection of cancer, *DEEP learning, *PREMATURE labor
Abstract: Cervical cancer is a major health concern worldwide, highlighting the urgent need for better early detection methods to improve outcomes for patients. In this study, we present a novel digital pathology classification approach that combines Low-Rank Adaptation (LoRA) with the Vision Transformer (ViT) model. This method is aimed at making cervix type classification more efficient through a deep learning classifier that does not require as much data. The key innovation is the use of LoRA, which allows for the effective training of the model with smaller datasets, making the most of the ability of ViT to represent visual information. This approach performs better than traditional Convolutional Neural Network (CNN) models, including Residual Networks (ResNets), especially when it comes to performance and the ability to generalize in situations where data are limited. Through thorough experiments and analysis on various dataset sizes, we found that our more streamlined classifier is highly accurate in spotting various cervical anomalies across several cases. This work advances the development of sophisticated computer-aided diagnostic systems, facilitating more rapid and accurate detection of cervical cancer, thereby significantly enhancing patient care outcomes. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

42. A Renovated Framework of a Convolution Neural Network with Transformer for Detecting Surface Changes from High-Resolution Remote-Sensing Images.

Author: Yao, Shunyu, Wang, Han, Su, Yalu, Li, Qing, Sun, Tao, Liu, Changjun, Li, Yao, and Cheng, Deqiang
Subjects: *CONVOLUTIONAL neural networks, *TRANSFORMER models, *SURFACE of the earth, *FEATURE extraction, *REMOTE sensing
Abstract: Natural hazards are considered to have a strong link with climate change and human activities. With the rapid advancements in remote sensing technology, real-time monitoring and high-resolution remote-sensing images have become increasingly available, which provide precise details about the Earth's surface and enable prompt updates to support risk identification and management. This paper proposes a new network framework with Transformer architecture and a Residual network for detecting the changes in high-resolution remote-sensing images. The proposed model is trained using remote-sensing images from Shandong and Anhui Provinces of China in 2021 and 2022 while one district in 2023 is used to test the prediction accuracy. The performance of the proposed model is evaluated by using five matrices and further compared to both convention-based and attention-based models. The results demonstrated that the proposed structure integrates the great capability of conventional neural networks for image feature extraction with the ability to obtain global context from the attention mechanism, resulting in significant improvements in balancing positive sample identification while avoiding false positives in complex image change detection. Additionally, a toolkit supporting image preprocessing is developed for practical applications. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

43. FAformer: parallel Fourier-attention architectures benefits EEG-based affective computing with enhanced spatial information.

Author: Gao, Ziheng, Huang, Jiajin, Chen, Jianhui, and Zhou, Haiyan
Subjects: *AFFECTIVE computing, *DEEP learning, *TRANSFORMER models, *PARALLEL processing, *FOURIER transforms, *HUMAN beings
Abstract: The balance of brain functional segregation (i.e., the process in specialized local subsystems) and integration (i.e., the process in global cooperation of the subsystems) is crucial for cognition in human beings, and many deep learning models have been used to evaluate the spatial information during EEG-based affective computing. However, acquiring the intrinsic spatial representation in the topology of EEG channels is still challenging. To further address the issue, we propose the FAformer to enhance spatial information in EEG signals with parallel-branch architectures based on a vision transformer (ViT). In the encoder, there is a branch that utilizes Adaptive Neural Fourier Operators (AFNO) to model global spatial patterns using the Fourier transform in the electrode channel dimension. The other branch utilizes multi-head self-attention (MSA) to explore the dependence of emotion on different channels, which is conducive to building key local networks. Additionally, a self-supervised learning (SSL) task of adaptive feature dissociation (AdaptiveFD) is developed to improve the distinctiveness of spatial features generated from the parallel branches and guarantee robustness in different subjects. FAformer achieves superior performance over the competitive models on the DREAMER and DEAP. Moreover, the rationality and hyperparameters analysis are conducted to demonstrate the effectiveness of the FAformer. Finally, the visualization of features reveals the spatial global connections and key local patterns during the deep learning process in FAformer, which benefits EEG-based affective computing. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

44. Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions.

Author: Ghosh, Subhayu, Sarkar, Snehashis, Ghosh, Sovan, Zalkow, Frank, and Jana, Nanda Dulal
Subjects: SPEECH synthesis, TRANSFORMER models, IMAGE reconstruction algorithms, SPEECH
Abstract: Audio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in the realm of audio-visual learning. AVSS transforms one speaker's speech into another's audio-visual stream while retaining linguistic content. This approach extends existing AVSS methods by first modifying vocal features from the source to the target speaker, akin to voice conversion (VC), and then synthesizing the audio-visual stream for the target speaker, termed audio-visual synthesis (AVS). In this work, a novel AVSS approach is proposed using vision transformer (ViT)-based Autoencoders (AEs), enriched with a combination of cycle consistency and reconstruction loss functions, with the aim of enhancing synthesis quality. Leveraging ViT's attention mechanism, this method effectively captures spectral and temporal features from input speech. The combination of cycle consistency and reconstruction loss improves synthesis quality and aids in preserving essential information. The proposed framework is trained and tested on benchmark datasets, and compared extensively with state-of-the-art (SOTA) methods. The experimental results demonstrate the superiority of the proposed approach over existing SOTA models, in terms of quality and intelligibility for AVSS, indicating the potential for real-world applications. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

45. Melanoma Skin Cancer Detection Using Ensemble of Machine Learning Models Considering Deep Feature Embeddings.

Author: Ghosh, Subhayu, Dhar, Sandipan, Yoddha, Raktim, Kumar, Shivam, Thakur, Abhinav Kumar, and Jana, Nanda Dulal
Subjects: SKIN cancer, EARLY detection of cancer, TRANSFORMER models, CAPSULE neural networks, CONVOLUTIONAL neural networks
Abstract: In today's AI-driven era, deep learning (DL) algorithms play a crucial role in automatically detecting life-threatening skin cancers, thereby significantly enhancing survival rates. It makes skin cancer detection using DL algorithms an exciting area of exploration. While much of the prior research has focused on single-model approaches, combining ensembles of multiple models can enhance classification accuracy. Previous studies mainly relied on deep convolutional neural networks (DCNNs), which have limitations in capturing global features. Recent advancements have introduced capsule networks (Caps-Net) and vision transformers (ViT) for more effective feature extraction. In our study, we harness the power of DCNN, Caps-Net, and ViT frameworks to extract diverse image embeddings. These obtained feature vectors work as input data to train an ensemble model based on a majority voting mechanism. This ensemble model consists of five machine-learning models, including Random Forest, XGBoost, SVM, KNN, and logistic regression. The incorporation of this ensemble mechanism leads to a significant improvement in the overall model's performance. It is noteworthy that the proposed ensemble model serves as a lightweight model, which achieves an impressive accuracy of 91.6% when considering the melanoma skin cancer dataset. These results underscore the superiority of our proposed mechanism over individual state-of-the-art (SOTA) models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

46. Synergistic Detection of Multimodal Fake News Leveraging TextGCN and Vision Transformer.

Author: M, Visweswaran, Mohan, Jayanth, Sachin Kumar, S, and Soman, K P
Subjects: TRANSFORMER models, FAKE news, CONVOLUTIONAL neural networks, FEATURE extraction, MULTIMODAL user interfaces, DIGITAL technology, CELL fusion
Abstract: In today's digital age, the rapid spread of fake news is a pressing concern. Fake news, whether intentional or inadvertent, manipulates public sentiment and threatens the integrity of online information. To address this, effective detection and prevention methods are vital. Detecting and addressing this multimodal fake news is an intricate challenge as, unlike traditional news articles that predominantly rely on textual content, multimodal fake news leverages the persuasive power of visual elements, making its identification a formidable task. Manipulated images can significantly sway individuals' perceptions and beliefs, making the detection of such deceptive content complex. Our research introduces an innovative approach to multimodal fake news identification by presenting a fusion-based methodology that harnesses the capabilities of Text Graph Convolutional Neural Networks (TextGCN) and Vision Transformers (ViT) to effectively utilise both text and image modalities. The proposed Methodology starts with preprocessing textual content using TextGCN, allowing for the capture of intricate structural dependencies among words and phrases. Simultaneously, visual features are extracted from associated images using ViT. Through a fusion mechanism, these modalities seamlessly integrate, yielding superior embeddings. The primary contributions encompass an in-depth exploration of multimodal fake news detection through a fusion-based approach. What sets our approach apart from existing techniques is its integration of graph-based feature extraction through TextGCN. While previous methods predominantly rely on text or image features, our approach harnesses the additional semantic information and intricate relationships within a graph structure, in addition to image embeddings. This enables our method to capture more comprehensive understanding of the data, resulting in increased accuracy and reliability. Our experiments demonstrate the exceptional performance of our fusion-based approach, which leverages multiple modalities and incorporates graph-based representations and semantic relationships. This method outperformed single modalities of text or image, achieving an impressive accuracy of 94.17% using a neural network after fusion. By seamlessly integrating graph-based representations and semantic relationships, our fusion-based technique represents a significant stride in addressing the challenges posed by multimodal fake news. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

47. Ultrasound Image Analysis with Vision Transformers—Review.

Author: Vafaeezadeh, Majid, Behnam, Hamid, and Gifani, Parisa
Subjects: *TRANSFORMER models, *IMAGE analysis, *ULTRASONIC imaging, *CONVOLUTIONAL neural networks, *COMPUTER vision
Abstract: Ultrasound (US) has become a widely used imaging modality in clinical practice, characterized by its rapidly evolving technology, advantages, and unique challenges, such as a low imaging quality and high variability. There is a need to develop advanced automatic US image analysis methods to enhance its diagnostic accuracy and objectivity. Vision transformers, a recent innovation in machine learning, have demonstrated significant potential in various research fields, including general image analysis and computer vision, due to their capacity to process large datasets and learn complex patterns. Their suitability for automatic US image analysis tasks, such as classification, detection, and segmentation, has been recognized. This review provides an introduction to vision transformers and discusses their applications in specific US image analysis tasks, while also addressing the open challenges and potential future trends in their application in medical US image analysis. Vision transformers have shown promise in enhancing the accuracy and efficiency of ultrasound image analysis and are expected to play an increasingly important role in the diagnosis and treatment of medical conditions using ultrasound imaging as technology progresses. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

48. Naturalize Revolution: Unprecedented AI-Driven Precision in Skin Cancer Classification Using Deep Learning.

Author: Abou Ali, Mohamad, Dornaika, Fadi, Arganda-Carreras, Ignacio, Ali, Hussein, and Karaouni, Malak
Subjects: ARTIFICIAL intelligence, SKIN cancer, DEEP learning, MACHINE learning, CANCER diagnosis
Abstract: Background: In response to the escalating global concerns surrounding skin cancer, this study aims to address the imperative for precise and efficient diagnostic methodologies. Focusing on the intricate task of eight-class skin cancer classification, the research delves into the limitations of conventional diagnostic approaches, often hindered by subjectivity and resource constraints. The transformative potential of Artificial Intelligence (AI) in revolutionizing diagnostic paradigms is underscored, emphasizing significant improvements in accuracy and accessibility. Methods: Utilizing cutting-edge deep learning models on the ISIC2019 dataset, a comprehensive analysis is conducted, employing a diverse array of pre-trained ImageNet architectures and Vision Transformer models. To counteract the inherent class imbalance in skin cancer datasets, a pioneering "Naturalize" augmentation technique is introduced. This technique leads to the creation of two indispensable datasets—the Naturalized 2.4K ISIC2019 and groundbreaking Naturalized 7.2K ISIC2019 datasets—catalyzing advancements in classification accuracy. The "Naturalize" augmentation technique involves the segmentation of skin cancer images using the Segment Anything Model (SAM) and the systematic addition of segmented cancer images to a background image to generate new composite images. Results: The research showcases the pivotal role of AI in mitigating the risks of misdiagnosis and under-diagnosis in skin cancer. The proficiency of AI in analyzing vast datasets and discerning subtle patterns significantly augments the diagnostic prowess of dermatologists. Quantitative measures such as confusion matrices, classification reports, and visual analyses using Score-CAM across diverse dataset variations are meticulously evaluated. The culmination of these endeavors resulted in an unprecedented achievement—100% average accuracy, precision, recall, and F1-score—within the groundbreaking Naturalized 7.2K ISIC2019 dataset. Conclusion: This groundbreaking exploration highlights the transformative capabilities of AI-driven methodologies in reshaping the landscape of skin cancer diagnosis and patient care. The research represents a pivotal stride towards redefining dermatological diagnosis, showcasing the remarkable impact of AI-powered solutions in surmounting the challenges inherent in skin cancer diagnosis. The attainment of 100% across crucial metrics within the Naturalized 7.2K ISIC2019 dataset serves as a testament to the transformative capabilities of AI-driven approaches in reshaping the trajectory of skin cancer diagnosis and patient care. This pioneering work paves the way for a new era in dermatological diagnostics, heralding the dawn of unprecedented precision and efficacy in the identification and classification of skin cancers. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

49. Class-Aware Self-Distillation for Remote Sensing Image Scene Classification

Author: Bin Wu, Siyuan Hao, and Wei Wang
Subjects: Deep learning, knowledge distillation (KD), remote sensing image, scene classification, vision transformer (ViT), Ocean engineering, TC1501-1800, Geophysics. Cosmic physics, QC801-809
Abstract: Currently, convolutional neural networks (CNNs) and vision transformers (ViTs) are widely adopted as the predominant neural network architectures for remote sensing image scene classification. Although CNNs have lower computational complexity, ViTs have a higher performance ceiling, making both suitable as backbone networks for remote sensing scene classification tasks. However, remote sensing imagery has high intraclass variation and interclass similarity, which poses a challenge for existing methods. To address this issue, we propose the class-aware self-distillation (CASD) framework. This framework uses an end-to-end distillation mechanism to mine class-aware knowledge, effectively reducing the impact of significant intraclass variation and interclass similarity in remote sensing imagery. Specifically, our approach involves constructing pairs of images: similar pairs consisting of images belonging to the same class, and dissimilar pairs consisting of images from different classes. We then apply a distillation loss that we designed, which distills the corresponding probability distributions to ensure that the distributions of similar pairs become more consistent, and those of dissimilar pairs become more distinct. In addition, the enforced learnable $\alpha$ added to the distillation loss further amplifies the network's ability to comprehend class-aware knowledge. The experiment section demonstrates that our method CASD outperforms other methods on four publicly available datasets. And the ablation experiments demonstrate the effectiveness of the method.
Published: 2024
Full Text: View/download PDF

50. Pansharpening via Multiscale Embedding and Dual Attention Transformers

Author: Wensheng Fan, Fan Liu, and Jingzhi Li
Subjects: Attention mechanism, image fusion, multiscale embedding, pansharpening, remote sensing, vision transformer (ViT), Ocean engineering, TC1501-1800, Geophysics. Cosmic physics, QC801-809
Abstract: Pansharpening is a fundamental and crucial image processing task for many remote sensing applications, which generates a high-resolution multispectral image by fusing a low-resolution multispectral image and a high-resolution panchromatic image. Recently, vision transformers have been introduced into the pansharpening task for utilizing global contextual information. However, long-range and local dependencies modeling and multiscale feature learning are all essential to the pansharpening task. Learning and exploiting these various information raises a big challenge and limits the performance and efficiency of existing pansharpening methods. To solve this issue, we propose a pansharpening network based on multiscale embedding and dual attention transformers (MDPNet). Specifically, a multiscale embedding block is proposed to embed multiscale information of the images into vectors. Thus, transformers only need to process a multispectral embedding sequence and a panchromatic embedding sequence to efficiently use multiscale information. Furthermore, an additive hybrid attention transformer is proposed to fuse the embedding sequences in an additive injection manner. Finally, a channel self-attention transformer is proposed to utilize channel correlations for high-quality detail generation. Experiments over QuickBird and WorldView-3 datasets demonstrate the proposed MDPNet outperforms state-of-the-art methods visually and quantitatively with low running time. Ablation studies further verify the effectiveness of the proposed multiscale embedding and transformers in pansharpening.
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

246 results on '"vision transformer (ViT)"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources