612 results on '"Vision transformers"'
Search Results
2. AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale
- Author
-
Pardyl, Adam, Wronka, Michał, Wołczyk, Maciej, Adamczewski, Kamil, Trzciński, Tomasz, Zieliński, Bartosz, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
3. Uncertainty-Aware Vision Transformers for Medical Image Analysis
- Author
-
Erick, Franciskus Xaverius, Rezaei, Mina, Müller, Johanna Paula, Kainz, Bernhard, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Sudre, Carole H., editor, Mehta, Raghav, editor, Ouyang, Cheng, editor, Qin, Chen, editor, Rakic, Marianne, editor, and Wells, William M., editor
- Published
- 2025
- Full Text
- View/download PDF
4. Current status and prospects of artificial intelligence in breast cancer pathology: convolutional neural networks to prospective Vision Transformers.
- Author
-
Katayama, Ayaka, Aoki, Yuki, Watanabe, Yukako, Horiguchi, Jun, Rakha, Emad A., and Oyama, Tetsunari
- Subjects
- *
TRANSFORMER models , *CONVOLUTIONAL neural networks , *ARTIFICIAL intelligence , *DEEP learning , *CANCER diagnosis , *SURGICAL pathology - Abstract
Breast cancer is the most prevalent cancer among women, and its diagnosis requires the accurate identification and classification of histological features for effective patient management. Artificial intelligence, particularly through deep learning, represents the next frontier in cancer diagnosis and management. Notably, the use of convolutional neural networks and emerging Vision Transformers (ViT) has been reported to automate pathologists' tasks, including tumor detection and classification, in addition to improving the efficiency of pathology services. Deep learning applications have also been extended to the prediction of protein expression, molecular subtype, mutation status, therapeutic efficacy, and outcome prediction directly from hematoxylin and eosin-stained slides, bypassing the need for immunohistochemistry or genetic testing. This review explores the current status and prospects of deep learning in breast cancer diagnosis with a focus on whole-slide image analysis. Artificial intelligence applications are increasingly applied to many tasks in breast pathology ranging from disease diagnosis to outcome prediction, thus serving as valuable tools for assisting pathologists and supporting breast cancer management. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. A systematic review of deep learning techniques for plant diseases.
- Author
-
Pacal, Ishak, Kunduracioglu, Ismail, Alma, Mehmet Hakki, Deveci, Muhammet, Kadry, Seifedine, Nedoma, Jan, Slany, Vlastimil, and Martinek, Radek
- Abstract
Agriculture is one of the most crucial sectors, meeting the fundamental food needs of humanity. Plant diseases increase food economic and food security concerns for countries and disrupt their agricultural planning. Traditional methods for detecting plant diseases require a lot of labor and time. Consequently, many researchers and institutions strive to address these issues using advanced technological methods. Deep learning-based plant disease detection offers considerable progress and hope compared to classical methods. When trained with large and high-quality datasets, these technologies robustly detect diseases on plant leaves in early stages. This study systematically reviews the application of deep learning techniques in plant disease detection by analyzing 160 research articles from 2020 to 2024. The studies are examined in three different areas: classification, detection, and segmentation of diseases on plant leaves, while also thoroughly reviewing publicly available datasets. This systematic review offers a comprehensive assessment of the current literature, detailing the most popular deep learning architectures, the most frequently studied plant diseases, datasets, encountered challenges, and various perspectives. It provides new insights for researchers working in the agricultural sector. Moreover, it addresses the major challenges in the field of disease detection in agriculture. Thus, this study offers valuable information and a suitable solution based on deep learning applications for agricultural sustainability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Computationally efficient deep learning models for diabetic retinopathy detection: a systematic literature review.
- Author
-
Haq, Nazeef Ul, Waheed, Talha, Ishaq, Kashif, Hassan, Muhammad Awais, Safie, Nurhizam, Elias, Nur Fazidah, and Shoaib, Muhammad
- Abstract
Diabetic retinopathy, often resulting from conditions like diabetes and hypertension, is a leading cause of blindness globally. With diabetes affecting millions worldwide and anticipated to rise significantly, early detection becomes paramount. The survey scrutinizes existing literature, revealing a noticeable absence of consideration for computational complexity aspects in deep learning models. Notably, most researchers concentrate on employing deep learning models, and there is a lack of comprehensive surveys on the role of vision transformers in enhancing the efficiency of these models for DR detection. This study stands out by presenting a systematic review, exclusively considering 84 papers published in reputable academic journals to ensure a focus on mature research. The distinctive feature of this Systematic Literature Review (SLR) lies in its thorough investigation of computationally efficient approaches and models for DR detection. It sheds light on the incorporation of vision transformers into deep learning models, highlighting their significant contribution to improving accuracy. Moreover, the research outlines clear objectives related to the identified problem, giving rise to specific research questions. Following an assessment of relevant literature, data is extracted from digital archives. Additionally, in light of the results obtained from this SLR, a taxonomy for the detection of diabetic retinopathy has been presented. The study also highlights key research challenges and proposes potential avenues for further investigation in the field of detecting diabetic retinopathy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. ViTDroid: Vision Transformers for Efficient, Explainable Attention to Malicious Behavior in Android Binaries.
- Author
-
Syed, Toqeer Ali, Nauman, Mohammad, Khan, Sohail, Jan, Salman, and Zuhairi, Megat F.
- Subjects
- *
TRANSFORMER models , *MOBILE operating systems , *VISUAL fields , *DEEP learning , *MODERN society , *MALWARE - Abstract
Smartphones are intricately connected to the modern society. The two widely used mobile phone operating systems, iOS and Android, profoundly affect the lives of millions of people. Android presently holds a market share of close to 71% among these two. As a result, if personal information is not securely protected, it is at tremendous risk. On the other hand, mobile malware has seen a year-on-year increase of more than 42% globally in 2022 mid-year. Any group of human professionals would have a very tough time detecting and removing all of this malware. For this reason, deep learning in particular has been used recently to overcome this problem. Deep learning models, however, were primarily created for picture analysis. Despite the fact that these models have shown promising findings in the field of vision, it has been challenging to fully comprehend what the characteristics recovered by deep learning models are in the area of malware. Furthermore, the actual potential of deep learning for malware analysis has not yet been fully realized due to the translation invariance trait of well-known models based on CNN. In this paper, we present ViTDroid, a novel model based on vision transformers for the deep learning-based analysis of opcode sequences of Android malware samples from large real-world datasets. We have been able to achieve a false positive rate of 0.0019 as compared to the previous best of 0.0021. However, this incremental improvement is not the major contribution of our work. Our model aims to make explainable predictions, i.e., it not only performs the classification of malware with high accuracy, but it also provides insights into the reasons for this classification. The model is able to pinpoint the malicious behavior-causing instructions in the malware samples. This means that our model can actually aid in the field of malware analysis itself by providing insights to human experts, thus leading to further improvements in this field. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Enhanced Infant Movement Analysis Using Transformer-Based Fusion of Diverse Video Features for Neurodevelopmental Monitoring.
- Author
-
Turner, Alexander and Sharkey, Don
- Subjects
- *
ARTIFICIAL neural networks , *CONVOLUTIONAL neural networks , *TRANSFORMER models , *INFANT development , *MACHINE learning - Abstract
Neurodevelopment is a highly intricate process, and early detection of abnormalities is critical for optimizing outcomes through timely intervention. Accurate and cost-effective diagnostic methods for neurological disorders, particularly in infants, remain a significant challenge due to the heterogeneity of data and the variability in neurodevelopmental conditions. This study recruited twelve parent–infant pairs, with infants aged 3 to 12 months. Approximately 25 min of 2D video footage was captured, documenting natural play interactions between the infants and toys. We developed a novel, open-source method to classify and analyse infant movement patterns using deep learning techniques, specifically employing a transformer-based fusion model that integrates multiple video features within a unified deep neural network. This approach significantly outperforms traditional methods reliant on individual video features, achieving an accuracy of over 90%. Furthermore, a sensitivity analysis revealed that the pose estimation contributed far less to the model's output than the pre-trained transformer and convolutional neural network (CNN) components, providing key insights into the relative importance of different feature sets. By providing a more robust, accurate and low-cost analysis of movement patterns, our work aims to enhance the early detection and potential prediction of neurodevelopmental delays, whilst providing insight into the functioning of the transformer-based fusion models of diverse video features. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. Vision transformers in domain adaptation and domain generalization: a study of robustness.
- Author
-
Alijani, Shadi, Fayyad, Jamil, and Najjaran, Homayoun
- Subjects
- *
TRANSFORMER models , *DATA augmentation , *DATA distribution , *RESEARCH personnel , *SCIENTIFIC community - Abstract
Deep learning models are often evaluated in scenarios where the data distribution is different from those used in the training and validation phases. The discrepancy presents a challenge for accurately predicting the performance of models once deployed on the target distribution. Domain adaptation and generalization are widely recognized as effective strategies for addressing such shifts, thereby ensuring reliable performance. The recent promising results in applying vision transformers in computer vision tasks, coupled with advancements in self-attention mechanisms, have demonstrated their significant potential for robustness and generalization in handling distribution shifts. Motivated by the increased interest from the research community, our paper investigates the deployment of vision transformers in domain adaptation and domain generalization scenarios. For domain adaptation methods, we categorize research into feature-level, instance-level, model-level adaptations, and hybrid approaches, along with other categorizations with respect to diverse strategies for enhancing domain adaptation. Similarly, for domain generalization, we categorize research into multi-domain learning, meta-learning, regularization techniques, and data augmentation strategies. We further classify diverse strategies in research, underscoring the various approaches researchers have taken to address distribution shifts by integrating vision transformers. The inclusion of comprehensive tables summarizing these categories is a distinct feature of our work, offering valuable insights for researchers. These findings highlight the versatility of vision transformers in managing distribution shifts, crucial for real-world applications, especially in critical safety and decision-making scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Vision transformer introduces a new vitality to the classification of renal pathology.
- Author
-
Zhang, Ji, Lu, Jia Dan, Chen, Bo, Pan, ShuFang, Jin, LingWei, Zheng, Yu, and Pan, Min
- Abstract
Recent advancements in computer vision within the field of artificial intelligence (AI) have made significant inroads into the medical domain. However, the application of AI for classifying renal pathology remains challenging due to the subtle variations in multiple renal pathological classifications. Vision Transformers (ViT), an adaptation of the Transformer model for image recognition, have demonstrated superior capabilities in capturing global features and providing greater explainability. In our study, we developed a ViT model using a diverse set of stained renal histopathology images to evaluate its effectiveness in classifying renal pathology. A total of 1861 whole slide images (WSI) stained with HE, MASSON, PAS, and PASM were collected from 635 patients. Renal tissue images were then extracted, tiled, and categorized into 14 classes on the basis of renal pathology. We employed the classic ViT model from the Timm library, utilizing images sized 384 × 384 pixels with 16 × 16 pixel patches, to train the classification model. A comparative analysis was conducted to evaluate the performance of the ViT model against traditional convolutional neural network (CNN) models. The results indicated that the ViT model demonstrated superior recognition ability (accuracy: 0.96–0.99). Furthermore, we visualized the identification process of the ViT models to investigate potentially significant pathological ultrastructures. Our study demonstrated that ViT models outperformed CNN models in accurately classifying renal pathology. Additionally, ViT models are able to focus on specific, significant structures within renal histopathology, which could be crucial for identifying novel and meaningful pathological features in the diagnosis and treatment of renal disease. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Combining Local and Global Feature Extraction for Brain Tumor Classification: A Vision Transformer and iResNet Hybrid Model.
- Author
-
Jaffar, Amar Y.
- Abstract
Early diagnosis of brain tumors is crucial for effective treatment and patient prognosis. Traditional Convolutional Neural Networks (CNNs) have shown promise in medical imaging but have limitations in capturing long-range dependencies and contextual information. Vision Transformers (ViTs) address these limitations by leveraging self-attention mechanisms to capture both local and global features. This study aims to enhance brain tumor classification by integrating an improved ResNet (iResNet) architecture with a ViT, creating a robust hybrid model that combines the local feature extraction capabilities of iResNet with the global feature extraction strengths of ViTs. This integration results in a significant improvement in classification accuracy, achieving an overall accuracy of 99.2%, outperforming established models such as InceptionV3, ResNet, and DenseNet. High precision, recall, and F1 scores were observed across all tumor classes, demonstrating the model's robustness and reliability. The significance of the proposed method lies in its ability to effectively capture both local and global features, leading to superior performance in brain tumor classification. This approach offers a powerful tool for clinical decision-making, improving early detection and treatment planning, ultimately contributing to better patient outcomes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. Underwater image enhancement using lightweight vision transformer.
- Author
-
Daud, Muneeba, Afzal, Hammad, and Mahmood, Khawir
- Subjects
TRANSFORMER models ,IMAGE recognition (Computer vision) ,IMAGE reconstruction ,IMAGE intensifiers ,FEATURE extraction ,DEEP learning - Abstract
Deep learning-based models have recently shown a strong potential in Underwater Image Enhancement (UIE) that are satisfying and have the right colors and details, but these methods significantly increase the parameters and complexity of the image processing models and therefore cannot be deployed directly to the edge devices. Vision Transformers (ViT) based architectures have recently produced amazing results in many vision tasks such as image classification, super-resolution, and image restoration. In this study, we introduced a lightweight Context-Aware Vision Transformer (CAViT), based on the Mean Head tokenization strategy and uses a self-attention mechanism in a single branch module that is effective at simulating long-distance dependencies and global features. To further improve the image quality we proposed an efficient variant of our model which derived results by applying White Balancing and Gamma Correction methods. We evaluated our model on two standard datasets, i.e., Large-Scale Underwater Image (LSUI) and Underwater Image Enhancement Benchmark Dataset (UIEB), which subsequently contributed towards more generalized results. Overall findings indicate that our real-time UIE model outperforms other Deep Learning based models by reducing the model complexity and improving the image quality (i.e., 0.6 dB PSNR improvement while using only 0.3% parameters and 0.4% float operations). [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Dmg2Former-AR: Vision Transformers with Adaptive Rescaling for High-Resolution Structural Visual Inspection.
- Author
-
Eltouny, Kareem, Sajedi, Seyedomid, and Liang, Xiao
- Subjects
- *
TRANSFORMER models , *STRUCTURAL health monitoring , *INSPECTION & review , *COMPUTER vision , *DEEP learning - Abstract
Developments in drones and imaging hardware technology have opened up countless possibilities for enhancing structural condition assessments and visual inspections. However, processing the inspection images requires considerable work hours, leading to delays in the assessment process. This study presents a semantic segmentation architecture that integrates vision transformers with Laplacian pyramid scaling networks, enabling rapid and accurate pixel-level damage detection. Unlike conventional methods that often lose critical details through resampling or cropping high-resolution images, our approach preserves essential inspection-related information such as microcracks and edges using non-uniform image rescaling networks. This innovation allows for detailed damage identification of high-resolution images while significantly reducing the computational demands. Our main contributions in this study are: (1) proposing two rescaling networks that together allow for processing high-resolution images while significantly reducing the computational demands; and (2) proposing Dmg2Former, a low-resolution segmentation network with a Swin Transformer backbone that leverages the saved computational resources to produce detailed visual inspection masks. We validate our method through a series of experiments on publicly available visual inspection datasets, addressing various tasks such as crack detection and material identification. Finally, we examine the computational efficiency of the adaptive rescalers in terms of multiply–accumulate operations and GPU-memory requirements. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. Data Augmentation in Histopathological Classification: An Analysis Exploring GANs with XAI and Vision Transformers.
- Author
-
Rozendo, Guilherme Botazzo, Garcia, Bianca Lançoni de Oliveira, Borgue, Vinicius Augusto Toreli, Lumini, Alessandra, Tosta, Thaína Aparecida Azevedo, Nascimento, Marcelo Zanchetta do, and Neves, Leandro Alves
- Subjects
GENERATIVE adversarial networks ,DATA augmentation ,TRANSFORMER models ,ARTIFICIAL intelligence ,IMAGE recognition (Computer vision) - Abstract
Generative adversarial networks (GANs) create images by pitting a generator (G) against a discriminator (D) network, aiming to find a balance between the networks. However, achieving this balance is difficult because G is trained based on just one value representing D's prediction, and only D can access image features. We introduce a novel approach for training GANs using explainable artificial intelligence (XAI) to enhance the quality and diversity of generated images in histopathological datasets. We leverage XAI to extract feature information from D and incorporate it into G via the loss function, a unique strategy not previously explored in this context. We demonstrate that this approach enriches the training with relevant information and promotes improved quality and more variability in the artificial images, decreasing the FID by up to 32.7% compared to traditional methods. In the data augmentation task, these images improve the classification accuracy of Transformer models by up to 3.81% compared to models without data augmentation and up to 3.01% compared to traditional GAN data augmentation. The Saliency method provides G with the most informative feature information. Overall, our work highlights the potential of XAI for enhancing GAN training and suggests avenues for further exploration in this field. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. On-Edge Deployment of Vision Transformers for Medical Diagnostics Using the Kvasir-Capsule Dataset.
- Author
-
Varam, Dara, Khalil, Lujain, and Shanableh, Tamer
- Subjects
TRANSFORMER models ,IMAGE recognition (Computer vision) ,GASTROINTESTINAL diseases ,KHAT ,MODELS & modelmaking - Abstract
This paper aims to explore the possibility of utilizing vision transformers (ViTs) for on-edge medical diagnostics by experimenting with the Kvasir-Capsule image classification dataset, a large-scale image dataset of gastrointestinal diseases. Quantization techniques made available through TensorFlow Lite (TFLite), including post-training float-16 (F16) quantization and quantization-aware training (QAT), are applied to achieve reductions in model size, without compromising performance. The seven ViT models selected for this study are EfficientFormerV2S2, EfficientViT_B0, EfficientViT_M4, MobileViT_V2_050, MobileViT_V2_100, MobileViT_V2_175, and RepViT_M11. Three metrics are considered when analyzing a model: (i) F1-score, (ii) model size, and (iii) performance-to-size ratio, where performance is the F1-score and size is the model size in megabytes (MB). In terms of F1-score, we show that MobileViT_V2_175 with F16 quantization outperforms all other models with an F1-score of 0.9534. On the other hand, MobileViT_V2_050 trained using QAT was scaled down to a model size of 1.70 MB, making it the smallest model amongst the variations this paper examined. MobileViT_V2_050 also achieved the highest performance-to-size ratio of 41.25. Despite preferring smaller models for latency and memory concerns, medical diagnostics cannot afford poor-performing models. We conclude that MobileViT_V2_175 with F16 quantization is our best-performing model, with a small size of 27.47 MB, providing a benchmark for lightweight models on the Kvasir-Capsule dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. A novel hybrid attention gate based on vision transformer for the detection of surface defects.
- Author
-
Üzen, Hüseyin, Turkoglu, Muammer, Ozturk, Dursun, and Hanbay, Davut
- Abstract
Many advanced models have been proposed for automatic surface defect inspection. Although CNN-based methods have achieved superior performance among these models, it is limited to extracting global semantic details due to the locality of the convolution operation. In addition, global semantic details can achieve high success for detecting surface defects. Recently, inspired by the success of Transformer, which has powerful abilities to model global semantic details with global self-attention mechanisms, some researchers have started to apply Transformer-based methods in many computer-vision challenges. However, as many researchers notice, transformers lose spatial details while extracting semantic features. To alleviate these problems, in this paper, a transformer-based Hybrid Attention Gate (HAG) model is proposed to extract both global semantic features and spatial features. The HAG model consists of Transformer (Trans), channel Squeeze-spatial Excitation (sSE), and merge process. The Trans model extracts global semantic features and the sSE extracts spatial features. The merge process which consists of different versions such as concat, add, max, and mul allows these two different models to be combined effectively. Finally, four versions based on HAG-Feature Fusion Network (HAG-FFN) were developed using the proposed HAG model for the detection of surface defects. The four different datasets were used to test the performance of the proposed HAG-FFN versions. In the experimental studies, the proposed model produced 83.83%, 79.34%, 76.53%, and 81.78% mIoU scores for MT, MVTec-Texture, DAGM, and AITEX datasets. These results show that the proposed HAG
max -FFN model provided better performance than the state-of-the-art models. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
17. Person re-identification using vision transformer and centroid triplet loss.
- Author
-
Ijjina, Earnest Paul, Medipelly, Rampavan, Beerukuri, Santosh Kumar, Vinnakota, Sowmya, and Nelakurthi, Vijay Chowdary
- Subjects
CONVOLUTIONAL neural networks ,TRANSFORMER models ,VIDEO surveillance ,DIGITAL technology ,PUBLIC safety - Abstract
In the current digital era, video surveillance has become a part of daily life. The person re-identification(re-ID) task involves choosing a person as a target in one camera feed and recognizing that target in footage from a different camera or the same camera at various points in time. The goal is to accurately identify a person despite variations in their appearance due to changes in pose, illumination, and occlusions. Person re-ID is a key component in video surveillance with practical applications in public safety, retail, and transportation, among others. However, it remains a difficult problem due to the inherent variability in appearance and the lack of robust features to capture the subtle differences between individuals. Even the existing Convolution Neural Networks (CNNs) based approaches for person re-ID task struggle to address the issues due to variations in pose, occlusions, and background clutter. To tackle these issues in person re-ID task, we propose an approach using Vision Transformers (ViT) with Centroid Triplet Loss (CTL). Experimental studies conducted on Market1501 and DukeMTMC datasets, yielded better results than the existing approaches, indicating the effectiveness of our approach. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Vision Transformers in Optimization of AI-Based Early Detection of Botrytis cinerea.
- Author
-
Christakakis, Panagiotis, Giakoumoglou, Nikolaos, Kapetas, Dimitrios, Tzovaras, Dimitrios, and Pechlivani, Eleftheria-Maria
- Subjects
- *
TRANSFORMER models , *MULTISPECTRAL imaging , *AGRICULTURE , *ARTIFICIAL intelligence , *DEEP learning - Abstract
Detecting early plant diseases autonomously poses a significant challenge for self-navigating robots and automated systems utilizing Artificial Intelligence (AI) imaging. For instance, Botrytis cinerea, also known as gray mold disease, is a major threat to agriculture, particularly impacting significant crops in the Cucurbitaceae and Solanaceae families, making early and accurate detection essential for effective disease management. This study focuses on the improvement of deep learning (DL) segmentation models capable of early detecting B. cinerea on Cucurbitaceae crops utilizing Vision Transformer (ViT) encoders, which have shown promising segmentation performance, in systemic use with the Cut-and-Paste method that further improves accuracy and efficiency addressing dataset imbalance. Furthermore, to enhance the robustness of AI models for early detection in real-world settings, an advanced imagery dataset was employed. The dataset consists of healthy and artificially inoculated cucumber plants with B. cinerea and captures the disease progression through multi-spectral imaging over the course of days, depicting the full spectrum of symptoms of the infection, ranging from early, non-visible stages to advanced disease manifestations. Research findings, based on a three-class system, identify the combination of U-Net++ with MobileViTV2-125 as the best-performing model. This model achieved a mean Dice Similarity Coefficient (mDSC) of 0.792, a mean Intersection over Union (mIoU) of 0.816, and a recall rate of 0.885, with a high accuracy of 92%. Analyzing the detection capabilities during the initial days post-inoculation demonstrates the ability to identify invisible B. cinerea infections as early as day 2 and increasing up to day 6, reaching an IoU of 67.1%. This study assesses various infection stages, distinguishing them from abiotic stress responses or physiological deterioration, which is crucial for accurate disease management as it separates pathogenic from non-pathogenic stress factors. The findings of this study indicate a significant advancement in agricultural disease monitoring and control, with the potential for adoption in on-site digital systems (robots, mobile apps, etc.) operating in real settings, showcasing the effectiveness of ViT-based DL segmentation models for prompt and precise botrytis detection. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Comparative Analysis of Machine Learning Techniques Using RGB Imaging for Nitrogen Stress Detection in Maize.
- Author
-
Ghazal, Sumaira, Kommineni, Namratha, and Munir, Arslan
- Subjects
- *
CONVOLUTIONAL neural networks , *TRANSFORMER models , *ARTIFICIAL neural networks , *COMPUTER vision , *NITROGEN deficiency - Abstract
Proper nitrogen management in crops is crucial to ensure optimal growth and yield maximization. While hyperspectral imagery is often used for nitrogen status estimation in crops, it is not feasible for real-time applications due to the complexity and high cost associated with it. Much of the research utilizing RGB data for detecting nitrogen stress in plants relies on datasets obtained under laboratory settings, which limits its usability in practical applications. This study focuses on identifying nitrogen deficiency in maize crops using RGB imaging data from a publicly available dataset obtained under field conditions. We have proposed a custom-built vision transformer model for the classification of maize into three stress classes. Additionally, we have analyzed the performance of convolutional neural network models, including ResNet50, EfficientNetB0, InceptionV3, and DenseNet121, for nitrogen stress estimation. Our approach involves transfer learning with fine-tuning, adding layers tailored to our specific application. Our detailed analysis shows that while vision transformer models generalize well, they converge prematurely with a higher loss value, indicating the need for further optimization. In contrast, the fine-tuned CNN models classify the crop into stressed, non-stressed, and semi-stressed classes with higher accuracy, achieving a maximum accuracy of 97% with EfficientNetB0 as the base model. This makes our fine-tuned EfficientNetB0 model a suitable candidate for practical applications in nitrogen stress detection. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. ConVision Benchmark: A Contemporary Framework to Benchmark CNN and ViT Models.
- Author
-
Bangalore Vijayakumar, Shreyas, Chitty-Venkata, Krishna Teja, Arya, Kanishk, and Somani, Arun K.
- Subjects
- *
TRANSFORMER models , *OBJECT recognition (Computer vision) , *COMPUTER vision , *CONVOLUTIONAL neural networks , *DEEP learning - Abstract
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have shown remarkable performance in computer vision tasks, including object detection and image recognition. These models have evolved significantly in architecture, efficiency, and versatility. Concurrently, deep-learning frameworks have diversified, with versions that often complicate reproducibility and unified benchmarking. We propose ConVision Benchmark, a comprehensive framework in PyTorch, to standardize the implementation and evaluation of state-of-the-art CNN and ViT models. This framework addresses common challenges such as version mismatches and inconsistent validation metrics. As a proof of concept, we performed an extensive benchmark analysis on a COVID-19 dataset, encompassing nearly 200 CNN and ViT models in which DenseNet-161 and MaxViT-Tiny achieved exceptional accuracy with a peak performance of around 95 % . Although we primarily used the COVID-19 dataset for image classification, the framework is adaptable to a variety of datasets, enhancing its applicability across different domains. Our methodology includes rigorous performance evaluations, highlighting metrics such as accuracy, precision, recall, F1 score, and computational efficiency (FLOPs, MACs, CPU, and GPU latency). The ConVision Benchmark facilitates a comprehensive understanding of model efficacy, aiding researchers in deploying high-performance models for diverse applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Computer-aided diagnosis of Alzheimer’s disease and neurocognitive disorders with multimodal Bi-Vision Transformer (BiViT)
- Author
-
Shah, S. Muhammad Ahmed Hassan, Khan, Muhammad Qasim, Rizwan, Atif, Jan, Sana Ullah, Samee, Nagwan Abdel, and Jamjoom, Mona M.
- Abstract
Cognitive disorders affect various cognitive functions that can have a substantial impact on individual’s daily life. Alzheimer’s disease (AD) is one of such well-known cognitive disorders. Early detection and treatment of cognitive diseases using artificial intelligence can help contain them. However, the complex spatial relationships and long-range dependencies found in medical imaging data present challenges in achieving the objective. Moreover, for a few years, the application of transformers in imaging has emerged as a promising area of research. A reason can be transformer’s impressive capabilities of tackling spatial relationships and long-range dependency challenges in two ways, i.e., (1) using their self-attention mechanism to generate comprehensive features, and (2) capture complex patterns by incorporating global context and long-range dependencies. In this work, a Bi-Vision Transformer (BiViT) architecture is proposed for classifying different stages of AD, and multiple types of cognitive disorders from 2-dimensional MRI imaging data. More specifically, the transformer is composed of two novel modules, namely Mutual Latent Fusion (MLF) and Parallel Coupled Encoding Strategy (PCES), for effective feature learning. Two different datasets have been used to evaluate the performance of proposed BiViT-based architecture. The first dataset contain several classes such as mild or moderate demented stages of the AD. The other dataset is composed of samples from patients with AD and different cognitive disorders such as mild, early, or moderate impairments. For comprehensive comparison, a multiple transfer learning algorithm and a deep autoencoder have been each trained on both datasets. The results show that the proposed BiViT-based model achieves an accuracy of 96.38% on the AD dataset. However, when applied to cognitive disease data, the accuracy slightly decreases below 96% which can be resulted due to smaller amount of data and imbalance in data distribution. Nevertheless, given the results, it can be hypothesized that the proposed algorithm can perform better if the imbalanced distribution and limited availability problems in data can be addressed. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Advancing precision in breast cancer detection: a fusion of vision transformers and CNNs for calcification mammography classification.
- Author
-
Boudouh, Saida Sarra and Bouakkaz, Mustapha
- Subjects
CALCIFICATIONS of the breast ,TRANSFORMER models ,DEEP learning ,IMAGE recognition (Computer vision) ,VISUAL learning - Abstract
Breast cancer remains a substantial public health challenge, marked by a rising prevalence. Accurate early detection is paramount for effective treatment and improved patient outcomes in breast cancer. The diversity of breast tumors and the complexity of their microenvironment present significant challenges. Establishing a reliable breast calcification and micro-calcification detection approach is an ongoing issue that researchers must continue to investigate. The goal is to develop an effective methodology that contributes to increased patient survival. Therefore, this paper introduces a novel approach for classifying breast calcifications in mammography, aiming to distinguish between benign and malignant cases. Aiming to address these challenges, we proposed our hybrid approach for breast calcification classification in mammogram images. The proposed approach starts with an image pre-processing phase that includes noise reduction and enhancement filters. Afterward, we proposed our hybrid classification architecture. It includes two branches: First, the vision transformer (ViT++) branch for contextual features. Secondly, a CNN branch based on transfer learning techniques for visual features. Using the CBIS-DDSM dataset, the application of our proposed ViT++ architecture reached the maximum accuracy of 96.12%. Further, the application of the VGG16 as a single feature extractor had a much lower accuracy of 61.96%. Meanwhile, the combination of these techniques in the same architecture improved the accuracy to 99.22%. Three different pre-trained feature extractors were applied in the CNN branch: Xception, VGG16, and RegNetX002. However, the best-obtained outcomes were from the combination of the ViT++ and the VGG16. The experimental findings indicate that the proposed strategy for breast calcification detection has the potential to surpass the performance of currently top-ranked methods, particularly in terms of classification accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. Spatial entropy as an inductive bias for vision transformers.
- Author
-
Peruzzo, Elia, Sangineto, Enver, Liu, Yahui, De Nadai, Marco, Bi, Wei, Lepri, Bruno, and Sebe, Nicu
- Subjects
TRANSFORMER models ,NATURAL language processing ,COMPUTER vision ,ENTROPY (Information theory) ,ENTROPY - Abstract
Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Detection and localization of anomalous objects in video sequences using vision transformers and U-Net model.
- Author
-
Berroukham, Abdelhafid, Housni, Khalid, and Lahraichi, Mohammed
- Abstract
The detection and localization of anomalous objects in video sequences remain a challenging task in video analysis. Recent years have witnessed a surge in deep learning approaches, especially with recurrent neural networks (RNNs). However, RNNs have limitations that vision transformers (ViTs) can address. We propose a novel solution that leverages ViTs, which have recently achieved remarkable success in various computer vision tasks. Our approach involves a two-step process. First, we utilize a pre-trained ViT model to generate an intermediate representation containing an attention map, highlighting areas critical for anomaly detection. In the second step, this attention map is concatenated with the original video frame, creating a richer representation that guides the U-Net model towards anomaly-prone regions. This enriched data is then fed into a U-Net model for precise localization of the anomalous objects. The model achieved a mean Intersection over Union (IoU) of 0.70, indicating a strong overlap between the predicted bounding boxes and the ground truth annotations. In the field of anomaly detection, a higher IoU score signifies better performance. Moreover, the pixel accuracy of 0.99 demonstrates a high level of precision in classifying individual pixels. Concerning localization accuracy, we conducted a comparison of our method with other approaches. The results obtained show that our method outperforms most of the previous methods and achieves a very competitive performance in terms of localization accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Vision transformer introduces a new vitality to the classification of renal pathology
- Author
-
Ji Zhang, Jia Dan Lu, Bo Chen, ShuFang Pan, LingWei Jin, Yu Zheng, and Min Pan
- Subjects
Artificial Intelligence ,Convolutional neural networks ,Vision transformers ,Renal pathology ,Whole-slide imaging ,Diseases of the genitourinary system. Urology ,RC870-923 - Abstract
Abstract Recent advancements in computer vision within the field of artificial intelligence (AI) have made significant inroads into the medical domain. However, the application of AI for classifying renal pathology remains challenging due to the subtle variations in multiple renal pathological classifications. Vision Transformers (ViT), an adaptation of the Transformer model for image recognition, have demonstrated superior capabilities in capturing global features and providing greater explainability. In our study, we developed a ViT model using a diverse set of stained renal histopathology images to evaluate its effectiveness in classifying renal pathology. A total of 1861 whole slide images (WSI) stained with HE, MASSON, PAS, and PASM were collected from 635 patients. Renal tissue images were then extracted, tiled, and categorized into 14 classes on the basis of renal pathology. We employed the classic ViT model from the Timm library, utilizing images sized 384 × 384 pixels with 16 × 16 pixel patches, to train the classification model. A comparative analysis was conducted to evaluate the performance of the ViT model against traditional convolutional neural network (CNN) models. The results indicated that the ViT model demonstrated superior recognition ability (accuracy: 0.96–0.99). Furthermore, we visualized the identification process of the ViT models to investigate potentially significant pathological ultrastructures. Our study demonstrated that ViT models outperformed CNN models in accurately classifying renal pathology. Additionally, ViT models are able to focus on specific, significant structures within renal histopathology, which could be crucial for identifying novel and meaningful pathological features in the diagnosis and treatment of renal disease.
- Published
- 2024
- Full Text
- View/download PDF
26. On the differences between CNNs and vision transformers for COVID-19 diagnosis using CT and chest x-ray mono- and multimodality
- Author
-
El-Ateif, Sara, Idri, Ali, and Fernández-Alemán, José Luis
- Published
- 2024
- Full Text
- View/download PDF
27. ConVision Benchmark: A Contemporary Framework to Benchmark CNN and ViT Models
- Author
-
Shreyas Bangalore Vijayakumar, Krishna Teja Chitty-Venkata, Kanishk Arya, and Arun K. Somani
- Subjects
convolutional neural networks ,vision transformers ,deep-learning framework ,PyTorch ,COVID-19 ,ConVision Benchmark ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have shown remarkable performance in computer vision tasks, including object detection and image recognition. These models have evolved significantly in architecture, efficiency, and versatility. Concurrently, deep-learning frameworks have diversified, with versions that often complicate reproducibility and unified benchmarking. We propose ConVision Benchmark, a comprehensive framework in PyTorch, to standardize the implementation and evaluation of state-of-the-art CNN and ViT models. This framework addresses common challenges such as version mismatches and inconsistent validation metrics. As a proof of concept, we performed an extensive benchmark analysis on a COVID-19 dataset, encompassing nearly 200 CNN and ViT models in which DenseNet-161 and MaxViT-Tiny achieved exceptional accuracy with a peak performance of around 95%. Although we primarily used the COVID-19 dataset for image classification, the framework is adaptable to a variety of datasets, enhancing its applicability across different domains. Our methodology includes rigorous performance evaluations, highlighting metrics such as accuracy, precision, recall, F1 score, and computational efficiency (FLOPs, MACs, CPU, and GPU latency). The ConVision Benchmark facilitates a comprehensive understanding of model efficacy, aiding researchers in deploying high-performance models for diverse applications.
- Published
- 2024
- Full Text
- View/download PDF
28. Comparative Analysis of Machine Learning Techniques Using RGB Imaging for Nitrogen Stress Detection in Maize
- Author
-
Sumaira Ghazal, Namratha Kommineni, and Arslan Munir
- Subjects
computer vision ,transfer learning ,convolutional neural networks ,vision transformers ,nitrogen stress detection ,maize ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Proper nitrogen management in crops is crucial to ensure optimal growth and yield maximization. While hyperspectral imagery is often used for nitrogen status estimation in crops, it is not feasible for real-time applications due to the complexity and high cost associated with it. Much of the research utilizing RGB data for detecting nitrogen stress in plants relies on datasets obtained under laboratory settings, which limits its usability in practical applications. This study focuses on identifying nitrogen deficiency in maize crops using RGB imaging data from a publicly available dataset obtained under field conditions. We have proposed a custom-built vision transformer model for the classification of maize into three stress classes. Additionally, we have analyzed the performance of convolutional neural network models, including ResNet50, EfficientNetB0, InceptionV3, and DenseNet121, for nitrogen stress estimation. Our approach involves transfer learning with fine-tuning, adding layers tailored to our specific application. Our detailed analysis shows that while vision transformer models generalize well, they converge prematurely with a higher loss value, indicating the need for further optimization. In contrast, the fine-tuned CNN models classify the crop into stressed, non-stressed, and semi-stressed classes with higher accuracy, achieving a maximum accuracy of 97% with EfficientNetB0 as the base model. This makes our fine-tuned EfficientNetB0 model a suitable candidate for practical applications in nitrogen stress detection.
- Published
- 2024
- Full Text
- View/download PDF
29. Interpretable Detection of Malicious Behavior in Windows Portable Executables Using Multi-Head 2D Transformers
- Author
-
Sohail Khan and Mohammad Nauman
- Subjects
malware ,windows protable executable (pe) ,machine learning ,vision transformers ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Windows malware is becoming an increasingly pressing problem as the amount of malware continues to grow and more sensitive information is stored on systems. One of the major challenges in tackling this problem is the complexity of malware analysis, which requires expertise from human analysts. Recent developments in machine learning have led to the creation of deep models for malware detection. However, these models often lack transparency, making it difficult to understand the reasoning behind the model’s decisions, otherwise known as the black-box problem. To address these limitations, this paper presents a novel model for malware detection, utilizing vision transformers to analyze the Operation Code (OpCode) sequences of more than 350000 Windows portable executable malware samples from real-world datasets. The model achieves a high accuracy of 0.9864, not only surpassing the previous results but also providing valuable insights into the reasoning behind the classification. Our model is able to pinpoint specific instructions that lead to malicious behavior in malware samples, aiding human experts in their analysis and driving further advancements in the field. We report our findings and show how causality can be established between malicious code and actual classification by a deep learning model, thus opening up this black-box problem for deeper analysis.
- Published
- 2024
- Full Text
- View/download PDF
30. Enhancing Image Copy Detection through Dynamic Augmentation and Efficient Sampling with Minimal Data.
- Author
-
Fawzy, Mohamed, Tawfik, Noha S., and Saleh, Sherine Nagy
- Subjects
TRANSFORMER models ,ARTIFICIAL intelligence ,DEEP learning ,SOCIAL networks ,EVERYDAY life - Abstract
Social networks have become deeply integrated into our daily lives, leading to an increase in image sharing across different platforms. Simultaneously, the existence of robust and user-friendly media editors not only facilitates artistic innovation, but also raises concerns regarding the ease of creating misleading media. This highlights the need for developing new advanced techniques for the image copy detection task, which involves evaluating whether photos or videos originate from the same source. This research introduces a novel application of the Vision Transformer (ViT) model to the image copy detection task on the DISC21 dataset. Our approach involves innovative strategic sampling of the extensive DISC21 training set using K-means clustering to achieve a representative subset. Additionally, we employ complex augmentation pipelines applied while training with varying intensities. Our methodology follows the instance discrimination concept, where the Vision Transformer model is used as a classifier to map different augmentations of the same image to the same class. Next, the trained ViT model extracts descriptors of original and manipulated images that subsequently underwent post-processing to reduce dimensionality. Our best-achieving model, tested on a refined query set of 10K augmented images from the DISC21 dataset, attained a state-of-the-art micro-average precision of 0.79, demonstrating the effectiveness and innovation of our approach. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. A fine-tuned vision transformer based enhanced multi-class brain tumor classification using MRI scan imagery.
- Author
-
Reddy, C. Kishor Kumar, Reddy, Pulakurthi Anaghaa, Janapati, Himaja, Assiri, Basem, Shuaib, Mohammed, Alam, Shadab, and Sheneamer, Abdullah
- Subjects
TRANSFORMER models ,MAGNETIC resonance imaging ,DEEP learning ,IMAGE processing ,COMPUTER-assisted image analysis (Medicine) ,BRAIN tumors - Abstract
Brain tumors occur due to the expansion of abnormal cell tissues and can be malignant (cancerous) or benign (not cancerous). Numerous factors such as the position, size, and progression rate are considered while detecting and diagnosing brain tumors. Detecting brain tumors in their initial phases is vital for diagnosis where MRI (magnetic resonance imaging) scans play an important role. Over the years, deep learning models have been extensively used for medical image processing. The current study primarily investigates the novel Fine-Tuned Vision Transformer models (FTVTs)--FTVT-b16, FTVT-b32, FTVT-l16, FTVT-l32--for brain tumor classification, while also comparing them with other established deep learning models such as ResNet50, MobileNet-V2, and EfficientNet - B0. A dataset with 7,023 images (MRI scans) categorized into four different classes, namely, glioma, meningioma, pituitary, and no tumor are used for classification. Further, the study presents a comparative analysis of these models including their accuracies and other evaluation metrics including recall, precision, and F1-score across each class. The deep learning models ResNet-50, EfficientNet-B0, and MobileNet-V2 obtained an accuracy of 96.5%, 95.1%, and 94.9%, respectively. Among all the FTVT models, FTVT-l16 model achieved a remarkable accuracy of 98.70% whereas other FTVT models FTVT-b16, FTVTb32, and FTVT-132 achieved an accuracy of 98.09%, 96.87%, 98.62%, respectively, hence proving the efficacy and robustness of FTVT's in medical image processing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Deep learning approaches for recognizing facial emotions on autistic patients.
- Author
-
El Rhatassi, Fatima Ezzahrae, El Ghali, Btihal, and Daoudi, Najima
- Subjects
EMOTION recognition ,CONVOLUTIONAL neural networks ,TRANSFORMER models ,FACIAL expression & emotions (Psychology) ,AUTISTIC people ,CHATBOTS - Abstract
Autistic people need continuous assistance in order to improve their quality of life, and chatbots are one of the technologies that can provide this today. Chatbots can help with this task by providing assistance while accompanying the autist. The chatbot we plan to develop gives to autistic people an immediate personalized recommendation by determining the autist's state, intervene with him and build a profile of the individual that will assist medical professionals in getting to know their patients better so they can provide an individualized care. We attempted to identify the emotion from the image's face in order to gain an understanding of emotions. Deep learning methods like convolutional neural networks and vision transformers could be compared using the FER2013. After optimization, conventional neural network (CNN) achieved 74% accuracy, whereas the vision transformer (ViT) achieved 69%. Given that there is not a massive dataset of autistic individuals accessible, we combined a dataset of photos of autistic people from two distinct sources and used the CNN model to identify the relevant emotion. Our accuracy rate for identifying emotions on the face is 65%. The model still has some identification limitations, such as misinterpreting some emotions, particularly "neutral," "surprised," and "angry," because these emotions and facial traits are poorly expressed by autistic people, and because the model is trained with imbalanced emotion categories. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. Co-located OLCI optical imagery and SAR altimetry from Sentinel-3 for enhanced Arctic spring sea ice surface classification.
- Author
-
Weibin Chen, Tsamados, Michel, Willatt, Rosemary, So Takao, Brockley, David, de Rijke-Thomas, Claude, Francis, Alistair, Johnson, Thomas, Landy, Jack, Lawrence, Isobel R., Sanggyun Lee, Shirazi, Dorsa Nasrollahi, Wenxuan Liu, Nelson, Connor, Stroeve, Julienne C., Len Hirata, and Deisenroth, Marc Peter
- Subjects
SEA ice ,MACHINE learning ,TRANSFORMER models ,RADAR altimetry ,SYNTHETIC aperture radar ,SPECTRAL imaging ,AERIAL photography - Abstract
The Sentinel-3A and Sentinel-3B satellites, launched in February 2016 and April 2018 respectively, build on the legacy of CryoSat-2 by providing high-resolution Ku-band radar altimetry data over the polar regions up to 81° North. The combination of synthetic aperture radar (SAR) mode altimetry (SRAL instrument) from Sentinel-3A and Sentinel-3B, and the Ocean and Land Colour Instrument (OLCI) imaging spectrometer, results in the creation of the first satellite platform that offers coincident optical imagery and SAR radar altimetry. We utilise this synergy between altimetry and imagery to demonstrate a novel application of deep learning to distinguish sea ice from leads in spring. We use SRAL classified leads as training input for pan-Arctic lead detection from OLCI imagery. This surface classification is an important step for estimating sea ice thickness and to predict future sea ice changes in the Arctic and Antarctic regions. We propose the use of Vision Transformers (ViT), an approach adapting the popular deep learning algorithm Transformer, for this task. Their effectiveness, in terms of both quantitative metric including accuracy and qualitative metric including model roll-out, on several entire OLCI images is demonstrated and we show improved skill compared to previous machine learning and empirical approaches. We show the potential for this method to provide lead fraction retrievals at improved accuracy and spatial resolution for sunlit periods before melt onset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. Four Transformer-Based Deep Learning Classifiers Embedded with an Attention U-Net-Based Lung Segmenter and Layer-Wise Relevance Propagation-Based Heatmaps for COVID-19 X-ray Scans.
- Author
-
Gupta, Siddharth, Dubey, Arun K., Singh, Rajesh, Kalra, Mannudeep K., Abraham, Ajith, Kumari, Vandana, Laird, John R., Al-Maini, Mustafa, Gupta, Neha, Singh, Inder, Viskovic, Klaudija, Saba, Luca, and Suri, Jasjit S.
- Subjects
- *
ARTIFICIAL intelligence , *TRANSFORMER models , *IMAGE recognition (Computer vision) , *CONVOLUTIONAL neural networks , *DEEP learning - Abstract
Background: Diagnosing lung diseases accurately is crucial for proper treatment. Convolutional neural networks (CNNs) have advanced medical image processing, but challenges remain in their accurate explainability and reliability. This study combines U-Net with attention and Vision Transformers (ViTs) to enhance lung disease segmentation and classification. We hypothesize that Attention U-Net will enhance segmentation accuracy and that ViTs will improve classification performance. The explainability methodologies will shed light on model decision-making processes, aiding in clinical acceptance. Methodology: A comparative approach was used to evaluate deep learning models for segmenting and classifying lung illnesses using chest X-rays. The Attention U-Net model is used for segmentation, and architectures consisting of four CNNs and four ViTs were investigated for classification. Methods like Gradient-weighted Class Activation Mapping plus plus (Grad-CAM++) and Layer-wise Relevance Propagation (LRP) provide explainability by identifying crucial areas influencing model decisions. Results: The results support the conclusion that ViTs are outstanding in identifying lung disorders. Attention U-Net obtained a Dice Coefficient of 98.54% and a Jaccard Index of 97.12%. ViTs outperformed CNNs in classification tasks by 9.26%, reaching an accuracy of 98.52% with MobileViT. An 8.3% increase in accuracy was seen while moving from raw data classification to segmented image classification. Techniques like Grad-CAM++ and LRP provided insights into the decision-making processes of the models. Conclusions: This study highlights the benefits of integrating Attention U-Net and ViTs for analyzing lung diseases, demonstrating their importance in clinical settings. Emphasizing explainability clarifies deep learning processes, enhancing confidence in AI solutions and perhaps enhancing clinical acceptance for improved healthcare results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. Comparison of deep learning architectures for predicting amyloid positivity in Alzheimer's disease, mild cognitive impairment, and healthy aging, from T1-weighted brain structural MRI.
- Author
-
Chattopadhyay, Tamoghna, Ozarkar, Saket S., Buwa, Ketaki, Joshy, Neha Ann, Komandur, Dheeraj, Naik, Jayati, Thomopoulos, Sophia I., Steeg, Greg Ver, Ambite, Jose Luis, and Thompson, Paul M.
- Subjects
ARTIFICIAL neural networks ,DEEP learning ,ALZHEIMER'S disease ,MILD cognitive impairment ,MACHINE learning ,SIGNAL convolution - Abstract
Abnormal β-amyloid (Aβ) accumulation in the brain is an early indicator of Alzheimer's disease (AD) and is typically assessed through invasive procedures such as PET (positron emission tomography) or CSF (cerebrospinal fluid) assays. As new anti-Alzheimer's treatments can now successfully target amyloid pathology, there is a growing interest in predicting Aβ positivity (Aβ+) from less invasive, more widely available types of brain scans, such as T1-weighted (T1w) MRI. Here we compare multiple approaches to infer Aβ + from standard anatomical MRI: (1) classical machine learning algorithms, including logistic regression, XGBoost, and shallow artificial neural networks, (2) deep learning models based on 2D and 3D convolutional neural networks (CNNs), (3) a hybrid ANN-CNN, combining the strengths of shallow and deep neural networks, (4) transfer learning models based on CNNs, and (5) 3D Vision Transformers. All models were trained on paired MRI/PET data from 1,847 elderly participants (mean age: 75.1 yrs. ± 7.6SD; 863 females/984 males; 661 healthy controls, 889 with mild cognitive impairment (MCI), and 297 with Dementia), scanned as part of the Alzheimer's Disease Neuroimaging Initiative. We evaluated each model's balanced accuracy and F1 scores. While further tests on more diverse data are warranted, deep learning models trained on standard MRI showed promise for estimating Aβ + status, at least in people with MCI. This may offer a potential screening option before resorting to more invasive procedures. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Art authentication with vision transformers.
- Author
-
Schaerf, Ludovica, Postma, Eric, and Popovici, Carina
- Subjects
- *
TRANSFORMER models , *ART authentication , *CONVOLUTIONAL neural networks , *IMAGE recognition (Computer vision) , *ATTRIBUTION of art - Abstract
In recent years, transformers, initially developed for language, have been successfully applied to visual tasks. Vision transformers have been shown to push the state of the art in a wide range of tasks, including image classification, object detection, and semantic segmentation. While ample research has shown promising results in art attribution and art authentication tasks using convolutional neural networks, this paper examines whether the superiority of vision transformers extends to art authentication, improving, thus, the reliability of computer-based authentication of artworks. Using a carefully compiled dataset of authentic paintings by Vincent van Gogh and two contrast datasets, we compare the art authentication performances of Swin transformers with those of EfficientNet. Using a standard contrast set containing imitations and proxies (works by painters with styles closely related to van Gogh), we find that EfficientNet achieves the best performance overall. With a contrast set that only consists of imitations, we find the Swin transformer to be superior to EfficientNet by achieving an authentication accuracy of over 85%. These results lead us to conclude that vision transformers represent a strong and promising contender in art authentication, particularly in enhancing the computer-based ability to detect artistic imitations. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Enhancing brain tumor detection in MRI with a rotation invariant Vision Transformer.
- Author
-
Krishnan, Palani Thanaraj, Krishnadoss, Pradeep, Khandelwal, Mukund, Gupta, Devansh, Nihaal, Anupoju, and Kumar, T. Sunil
- Subjects
TRANSFORMER models ,BRAIN tumors ,DEEP learning ,MAGNETIC resonance imaging ,TUMOR classification - Abstract
Background: The Rotation Invariant Vision Transformer (RViT) is a novel deep learning model tailored for brain tumor classification using MRI scans. Methods: RViT incorporates rotated patch embeddings to enhance the accuracy of brain tumor identification. Results: Evaluation on the Brain Tumor MRI Dataset from Kaggle demonstrates RViT's superior performance with sensitivity (1.0), specificity (0.975), F1-score (0.984), Matthew's Correlation Coefficient (MCC) (0.972), and an overall accuracy of 0.986. Conclusion: RViT outperforms the standard Vision Transformer model and several existing techniques, highlighting its efficacy inmedical imaging. The study confirms that integrating rotational patch embeddings improves the model's capability to handle diverse orientations, a common challenge in tumor imaging. The specialized architecture and rotational invariance approach of RViT have the potential to enhance current methodologies for brain tumor detection and extend to other complex imaging tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs.
- Author
-
Dingeto, Hiskias and Kim, Juntae
- Subjects
TRANSFORMER models ,CONVOLUTIONAL neural networks ,ARTIFICIAL intelligence ,VISUAL training ,MACHINE learning - Abstract
Transformer-based models are driving a significant revolution in the field of machine learning at the moment. Among these innovations, vision transformers (ViTs) stand out for their application of transformer architectures to vision-related tasks. By demonstrating performance as good, if not better, than traditional convolutional neural networks (CNNs), ViTs have managed to capture considerable interest in the field. This study focuses on the resilience of ViTs and CNNs in the face of adversarial attacks. Such attacks, which introduce noise into the input of machine learning models to produce incorrect outputs, pose significant challenges to the reliability of machine learning models. Our analysis evaluated the adversarial robustness of CNNs and ViTs by using regularization techniques and adversarial training methods. Adversarial training, in particular, represents a traditional approach to boosting defenses against these attacks. Despite its prominent use, our findings reveal that regularization techniques enable vision transformers and, in most cases, CNNs to enhance adversarial defenses more effectively. Through testing datasets like CIFAR-10 and CIFAR-100, we demonstrate that vision transformers, especially when combined with effective regularization strategies, demonstrate adversarial robustness, even without adversarial training. Two main inferences can be drawn from our findings. Firstly, it emphasizes how effectively vision transformers could strengthen artificial intelligence defenses against adversarial attacks. Secondly, it shows how regularization, which requires much fewer computational resources and covers a wide range of adversarial attacks, can be effective for adversarial defenses. Understanding and improving a model's resilience to adversarial attacks is crucial for developing secure, dependable systems that can handle the complexity of real-world applications as artificial intelligence and machine learning technologies advance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Optimizing Strawberry Disease and Quality Detection with Vision Transformers and Attention-Based Convolutional Neural Networks.
- Author
-
Aghamohammadesmaeilketabforoosh, Kimia, Nikan, Soodeh, Antonini, Giorgio, and Pearce, Joshua M.
- Subjects
TRANSFORMER models ,CONVOLUTIONAL neural networks ,COMPUTER vision ,STRAWBERRIES ,MACHINE learning ,AGRICULTURAL productivity ,TASK performance - Abstract
Machine learning and computer vision have proven to be valuable tools for farmers to streamline their resource utilization to lead to more sustainable and efficient agricultural production. These techniques have been applied to strawberry cultivation in the past with limited success. To build on this past work, in this study, two separate sets of strawberry images, along with their associated diseases, were collected and subjected to resizing and augmentation. Subsequently, a combined dataset consisting of nine classes was utilized to fine-tune three distinct pretrained models: vision transformer (ViT), MobileNetV2, and ResNet18. To address the imbalanced class distribution in the dataset, each class was assigned weights to ensure nearly equal impact during the training process. To enhance the outcomes, new images were generated by removing backgrounds, reducing noise, and flipping them. The performances of ViT, MobileNetV2, and ResNet18 were compared after being selected. Customization specific to the task was applied to all three algorithms, and their performances were assessed. Throughout this experiment, none of the layers were frozen, ensuring all layers remained active during training. Attention heads were incorporated into the first five and last five layers of MobileNetV2 and ResNet18, while the architecture of ViT was modified. The results indicated accuracy factors of 98.4%, 98.1%, and 97.9% for ViT, MobileNetV2, and ResNet18, respectively. Despite the data being imbalanced, the precision, which indicates the proportion of correctly identified positive instances among all predicted positive instances, approached nearly 99% with the ViT. MobileNetV2 and ResNet18 demonstrated similar results. Overall, the analysis revealed that the vision transformer model exhibited superior performance in strawberry ripeness and disease classification. The inclusion of attention heads in the early layers of ResNet18 and MobileNet18, along with the inherent attention mechanism in ViT, improved the accuracy of image identification. These findings offer the potential for farmers to enhance strawberry cultivation through passive camera monitoring alone, promoting the health and well-being of the population. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Comparing CNN-based and transformer-based models for identifying lung cancer: which is more effective?
- Author
-
Gai, Lulu, Xing, Mengmeng, Chen, Wei, Zhang, Yi, and Qiao, Xu
- Subjects
TRANSFORMER models ,LUNG cancer ,CONVOLUTIONAL neural networks ,RECEIVER operating characteristic curves ,TRAUMA registries ,COMPUTED tomography ,COMPUTER vision - Abstract
Lung cancer constitutes the most severe cause of cancer-related mortality. Recent evidence supports that early detection by means of computed tomography (CT) scans significantly reduces mortality rates. Given the remarkable progress of Vision Transformers (ViTs) in the field of computer vision, we have delved into comparing the performance of ViTs versus Convolutional Neural Networks (CNNs) for the automatic identification of lung cancer based on a dataset of 212 medical images. Importantly, neither ViTs nor CNNs require lung nodule annotations to predict the occurrence of cancer. To address the dataset limitations, we have trained both ViTs and CNNs with three advanced techniques: transfer learning, self-supervised learning, and sharpness-aware minimizer. Remarkably, we have found that CNNs achieve highly accurate prediction of a patient's cancer status, with an outstanding recall (93.4%) and area under the Receiver Operating Characteristic curve (AUC) of 98.1%, when trained with self-supervised learning. Our study demonstrates that both CNNs and ViTs exhibit substantial potential with the three strategies. However, CNNs are more effective than ViTs with the insufficient quantities of dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. Tiny polyp detection from endoscopic video frames using vision transformers.
- Author
-
Liu, Entong, He, Bishi, Zhu, Darong, Chen, Yuanjiao, and Xu, Zhe
- Abstract
Deep learning techniques can be effective in helping doctors diagnose gastrointestinal polyps. Currently, processing video frame sequences containing a large amount of spurious noise in polyp detection suffers from elevated recall and mean average precision. Moreover, the mean average precision is also low when the polyp target in the video frame has large-scale variability. Therefore, we propose a tiny polyp detection from endoscopic video frames using Vision Transformers, named TPolyp. The proposed method uses a cross-stage Swin Transformer as a multi-scale feature extractor to extract deep feature representations of data samples, improves the bidirectional sampling feature pyramid, and integrates the prediction heads of multiple channel self-attention mechanisms. This approach focuses more on the feature information of the tiny object detection task than convolutional neural networks and retains relatively deeper semantic information. It additionally improves feature expression and discriminability without increasing the computational complexity. Experimental results show that TPolyp improves detection accuracy by 7%, recall by 7.3%, and average accuracy by 7.5% compared to the YOLOv5 model, and has better tiny object detection in scenarios with blurry artifacts. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. DAM-Net: Flood detection from SAR imagery using differential attention metric-based vision transformers.
- Author
-
Saleh, Tamer, Weng, Xingxing, Holail, Shimaa, Hao, Chen, and Xia, Gui-Song
- Subjects
- *
TRANSFORMER models , *CRISIS management , *SPECKLE interference , *SYNTHETIC aperture radar , *EMERGENCY management , *FLOODS , *BODIES of water - Abstract
Flood detection from synthetic aperture radar (SAR) imagery plays an important role in crisis and disaster management. Based on pre- and post-flood SAR images, flooded areas can be extracted by detecting changes of water bodies. Existing state-of-the-art change detection methods primarily target optical image pairs. The nature of SAR images, such as scarce visual information, similar backscatter signals, and ubiquitous speckle noise, pose great challenges to identifying water bodies and mining change features, thus resulting in unsatisfactory performance. Besides, the lack of large-scale annotated datasets hinders the development of accurate flood detection methods. In this paper, we focus on the difference between SAR image pairs and present a differential attention metric-based network (DAM-Net), to achieve flood detection. By introducing feature interaction during temporal-wise feature representation, we guide the model to focus on changes of interest rather than fully understanding the scene of the image. On the other hand, we devise a class token to capture high-level semantic information about water body changes, increasing the ability to distinguish water body changes and pseudo changes caused by similar signals or speckle noise. To better train and evaluate DAM-Net, we create a large-scale flood detection dataset using Sentinel-1 SAR imagery, namely S1GFloods. This dataset consists of 5,360 image pairs, covering 46 flood events during 2015–2022, and spanning 6 continents of the world. The experimental results on this dataset demonstrate that our method outperforms several advanced change detection methods. DAM-Net achieves 97.8% overall accuracy, 96.5% F1, and 93.2% IoU on the test set. Our dataset and code are available at https://github.com/Tamer-Saleh/S1GFlood-Detection. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. GITPose: going shallow and deeper using vision transformers for human pose estimation.
- Author
-
Aidoo, Evans, Wang, Xun, Liu, Zhenguang, Abbam, Abraham Opanfo, Tenagyei, Edwin Kwadwo, Ejianya, Victor Nonso, Kodjiku, Seth Larweh, and Aggrey, Esther Stacy E. B.
- Subjects
TRANSFORMER models ,MULTILAYER perceptrons ,CONVOLUTIONAL neural networks ,FEATURE extraction - Abstract
In comparison to convolutional neural networks (CNN), the newly created vision transformer (ViT) has demonstrated impressive outcomes in human pose estimation (HPE). However, (1) there is a quadratic rise in complexity with respect to image size, which causes the traditional ViT to be unsuitable for scaling, and (2) the attention process at the transformer encoder as well as decoder also adds substantial computational costs to the detector's overall processing time. Motivated by this, we propose a novel Going shallow and deeper with vIsion Transformers for human Pose estimation (GITPose) without CNN backbones for feature extraction. In particular, we introduce a hierarchical transformer in which we utilize multilayer perceptrons to encode the richest local feature tokens in the initial phases (i.e., shallow), whereas self-attention modules are employed to encode long-term relationships in the deeper layers (i.e., deeper), and a decoder for keypoint detection. In addition, we offer a learnable deformable token association module (DTA) to non-uniformly and dynamically combine informative keypoint tokens. Comprehensive evaluation and testing on the COCO and MPII benchmark datasets reveal that GITPose achieves a competitive average precision (AP) on pose estimation compared to its state-of-the-art approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. Automated Detection of Posterior Vitreous Detachment on OCT Using Computer Vision and Deep Learning Algorithms
- Author
-
Li, Alexa L, Feng, Moira, Wang, Zixi, Baxter, Sally L, Huang, Lingling, Arnett, Justin, Bartsch, Dirk-Uwe G, Kuo, David E, Saseendrakumar, Bharanidharan Radha, Guo, Joy, and Nudleman, Eric
- Subjects
Eye Disease and Disorders of Vision ,Bioengineering ,Clinical Research ,Neurosciences ,Biomedical Imaging ,AI ,artificial intelligence ,AUROC ,area under the receiver operator characteristic curve ,Automated detection ,CNN ,convolutional neural network ,DL ,deep learning ,Deep Learning ,ILM ,internal limiting membrane ,OCT ,PVD ,posterior vitreous detachment ,Posterior vitreous detachment ,ViT ,vision transformers - Abstract
ObjectiveTo develop automated algorithms for the detection of posterior vitreous detachment (PVD) using OCT imaging.DesignEvaluation of a diagnostic test or technology.SubjectsOverall, 42 385 consecutive OCT images (865 volumetric OCT scans) obtained with Heidelberg Spectralis from 865 eyes from 464 patients at an academic retina clinic between October 2020 and December 2021 were retrospectively reviewed.MethodsWe developed a customized computer vision algorithm based on image filtering and edge detection to detect the posterior vitreous cortex for the determination of PVD status. A second deep learning (DL) image classification model based on convolutional neural networks and ResNet-50 architecture was also trained to identify PVD status from OCT images. The training dataset consisted of 674 OCT volume scans (33 026 OCT images), while the validation testing set consisted of 73 OCT volume scans (3577 OCT images). Overall, 118 OCT volume scans (5782 OCT images) were used as a separate external testing dataset.Main outcome measuresAccuracy, sensitivity, specificity, F1-scores, and area under the receiver operator characteristic curves (AUROCs) were measured to assess the performance of the automated algorithms.ResultsBoth the customized computer vision algorithm and DL model results were largely in agreement with the PVD status labeled by trained graders. The DL approach achieved an accuracy of 90.7% and an F1-score of 0.932 with a sensitivity of 100% and a specificity of 74.5% for PVD detection from an OCT volume scan. The AUROC was 89% at the image level and 96% at the volume level for the DL model. The customized computer vision algorithm attained an accuracy of 89.5% and an F1-score of 0.912 with a sensitivity of 91.9% and a specificity of 86.1% on the same task.ConclusionsBoth the computer vision algorithm and the DL model applied on OCT imaging enabled reliable detection of PVD status, demonstrating the potential for OCT-based automated PVD status classification to assist with vitreoretinal surgical planning.Financial disclosuresProprietary or commercial disclosure may be found after the references.
- Published
- 2023
45. Self-supervised learning of Vision Transformers for digital soil mapping using visual data
- Author
-
Paul Tresson, Maxime Dumont, Marc Jaeger, Frédéric Borne, Stéphane Boivin, Loïc Marie-Louise, Jérémie François, Hassan Boukcim, and Hervé Goëau
- Subjects
Self-supervised learning ,Vision transformers ,Digital soil mapping ,Arid lands ,Science - Abstract
In arid environments, prospecting cultivable land is challenging due to harsh climatic conditions and vast, hard-to-access areas. However, the soil is often bare, with little vegetation cover, making it easy to observe from above. Hence, remote sensing can drastically reduce costs to explore these areas. For the past few years, deep learning has extended remote sensing analysis, first with Convolutional Neural Networks (CNNs), then with Vision Transformers (ViTs). The main drawback of deep learning methods is their reliance on large calibration datasets, as data collection is a cumbersome and costly task, particularly in drylands. However, recent studies demonstrate that ViTs can be trained in a self-supervised manner to take advantage of large amounts of unlabelled data to pre-train models. These backbone models can then be finetuned to learn a supervised regression model with few labelled data.In our study, we trained ViTs in a self-supervised way with a 9500 km2 satellite image of dry-lands in Saudi Arabia with a spatial resolution of 1.5 m per pixel. The resulting models were used to extract features describing the bare soil and predict soil attributes (pH H2O, pH KCl, Si composition). Using only RGB data, we can accurately predict these soil properties and achieve, for instance, an RMSE of 0.40 ± 0.03 when predicting alkaline soil pH. We also assess the effectiveness of adding additional covariates, such as elevation. The pretrained models can as well be used as visual features extractors. These features can be used to automatically generate a clustered map of an area or as input of random forests models, providing a versatile way to generate maps with limited labelled data and input variables.
- Published
- 2024
- Full Text
- View/download PDF
46. Detection and classification of surface defects on hot-rolled steel using vision transformers
- Author
-
Vinod Vasan, Naveen Venkatesh Sridharan, Sugumaran Vaithiyanathan, and Mohammadreza Aghaei
- Subjects
Deep neural network ,Vision transformers ,Automated defect identification ,Steel surface defects ,Non-destructive testing ,Non-contact testing ,Science (General) ,Q1-390 ,Social sciences (General) ,H1-99 - Abstract
This study proposes a vision transformer to detect visual defects on steel surfaces. The proposed approach utilizes an open-source image dataset to classify steel surface conditions into six fault categories namely, crazing, inclusion, rolled in, pitted surface, scratches and patches. The defect images are first subject to resizing and then fed into a vision transformer subject to different hyperparameter configurations to determine the most optimal setting to render highest classification performance. The performance of the model is evaluated for different hyperparameter configurations, and the most optimal configuration is examined using the associated confusion matrices. It was observed that the proposed model presents a high overall accuracy of 96.39 % for detection and classification of steel surface faults. The study presents a descriptive insight into the vision transformer architecture and in addition, compares the performance of the current model with the results of other approaches suggested for application in literature. Vision transformers can serve as standalone approaches and suitable alternatives to the widely used convolution neural networks (CNNs) by actuating complex defect detection and classification tasks in real-time, enabling efficient and robust condition monitoring of a wide range of defects.
- Published
- 2024
- Full Text
- View/download PDF
47. ViTaL: An Advanced Framework for Automated Plant Disease Identification in Leaf Images Using Vision Transformers and Linear Projection for Feature Reduction
- Author
-
Sebastian, Abhishek, Fathima, A. Annis, Pragna, R., MadhanKumar, S., Kannan, G. Yaswanth, Murali, Vinay, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Bansal, Jagdish Chand, editor, Borah, Samarjeet, editor, Hussain, Shahid, editor, and Salhi, Said, editor
- Published
- 2024
- Full Text
- View/download PDF
48. Innovative Fusion of Transformer Models with SIFT for Superior Panorama Stitching
- Author
-
Xiang, Zheng, Fournier-Viger, Philippe, Series Editor, and Wang, Yulin, editor
- Published
- 2024
- Full Text
- View/download PDF
49. A Region-Based Approach to Diabetic Retinopathy Classification with Superpixel Tokenization
- Author
-
Playout, Clément, Legault, Zacharie, Duval, Renaud, Boucher, Marie Carole, Cheriet, Farida, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
50. MuST: Multi-scale Transformers for Surgical Phase Recognition
- Author
-
Pérez, Alejandra, Rodríguez, Santiago, Ayobi, Nicolás, Aparicio, Nicolás, Dessevres, Eugénie, Arbeláez, Pablo, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.