Back to Search
Start Over
Transformer model incorporating local graph semantic attention for image caption.
- Source :
-
Visual Computer . Sep2024, Vol. 40 Issue 9, p6533-6544. 12p. - Publication Year :
- 2024
-
Abstract
- Aiming at the problem of isolating semantic information of existing transformer-based models in the image captioning tasks, a transformer model incorporating local graph semantic attention (TLGSA) is proposed. TLGSA consists of a multi-layer image convolutional encoder, a multi-label semantic recognizer, and a multi-layer natural language generation decoder. The image convolutional feature encoder outputs the spatial feature self-attention information, and the multi-label semantic recognizer combines the global corpus with the current image feature to output image semantic graph node encoding features using a graph convolutional neural network (GCN). The natural language decoder incorporates the input semantic self-attention, graph node encoding features, and spatial feature self-attention to form a local semantic multi-head attention, and finally generates a natural language caption of the image. Experimental results show that the proposed method has higher accuracy with meaningful semantic information than the existing SOTA methods, achieving 23%/23.2%/31.57% of Bleu-4 on the widely used test datasets Flickr8K, Flickr30K, and MS COCO, respectively, without reinforcement optimization stage. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 01782789
- Volume :
- 40
- Issue :
- 9
- Database :
- Academic Search Index
- Journal :
- Visual Computer
- Publication Type :
- Academic Journal
- Accession number :
- 179041397
- Full Text :
- https://doi.org/10.1007/s00371-023-03180-7