Back to Search Start Over

Token-Selective Vision Transformer for fine-grained image recognition of marine organisms

Authors :
Guangzhe Si
Ying Xiao
Bin Wei
Leon Bevan Bullock
Yueyue Wang
Xiaodong Wang
Source :
Frontiers in Marine Science, Vol 10 (2023)
Publication Year :
2023
Publisher :
Frontiers Media S.A., 2023.

Abstract

IntroductionThe objective of fine-grained image classification on marine organisms is to distinguish the subtle variations in the organisms so as to accurately classify them into subcategories. The key to accurate classification is to locate the distinguishing feature regions, such as the fish’s eye, fins, or tail, etc. Images of marine organisms are hard to work with as they are often taken from multiple angles and contain different scenes, additionally they usually have complex backgrounds and often contain human or other distractions, all of which makes it difficult to focus on the marine organism itself and identify its most distinctive features.Related workMost existing fine-grained image classification methods based on Convolutional Neural Networks (CNN) cannot accurately enough locate the distinguishing feature regions, and the identified regions also contain a large amount of background data. Vision Transformer (ViT) has strong global information capturing abilities and gives strong performances in traditional classification tasks. The core of ViT, is a Multi-Head Self-Attention mechanism (MSA) which first establishes a connection between different patch tokens in a pair of images, then combines all the information of the tokens for classification.MethodsHowever, not all tokens are conducive to fine-grained classification, many of them contain extraneous data (noise). We hope to eliminate the influence of interfering tokens such as background data on the identification of marine organisms, and then gradually narrow down the local feature area to accurately determine the distinctive features. To this end, this paper put forwards a novel Transformer-based framework, namely Token-Selective Vision Transformer (TSVT), in which the Token-Selective Self-Attention (TSSA) is proposed to select the discriminating important tokens for attention computation which helps limits the attention to more precise local regions. TSSA is applied to different layers, and the number of selected tokens in each layer decreases on the basis of the previous layer, this method gradually locates the distinguishing regions in a hierarchical manner.ResultsThe effectiveness of TSVT is verified on three marine organism datasets and it is demonstrated that TSVT can achieve the state-of-the-art performance.

Details

Language :
English
ISSN :
22967745
Volume :
10
Database :
Directory of Open Access Journals
Journal :
Frontiers in Marine Science
Publication Type :
Academic Journal
Accession number :
edsdoj.b29e327ea5cf4ab8917a96aa3f8355f5
Document Type :
article
Full Text :
https://doi.org/10.3389/fmars.2023.1174347