Back to Search Start Over

Transformer model incorporating local graph semantic attention for image caption.

Authors :
Qian, Kui
Pan, Yuchen
Xu, Hao
Tian, Lei
Source :
Visual Computer. Sep2024, Vol. 40 Issue 9, p6533-6544. 12p.
Publication Year :
2024

Abstract

Aiming at the problem of isolating semantic information of existing transformer-based models in the image captioning tasks, a transformer model incorporating local graph semantic attention (TLGSA) is proposed. TLGSA consists of a multi-layer image convolutional encoder, a multi-label semantic recognizer, and a multi-layer natural language generation decoder. The image convolutional feature encoder outputs the spatial feature self-attention information, and the multi-label semantic recognizer combines the global corpus with the current image feature to output image semantic graph node encoding features using a graph convolutional neural network (GCN). The natural language decoder incorporates the input semantic self-attention, graph node encoding features, and spatial feature self-attention to form a local semantic multi-head attention, and finally generates a natural language caption of the image. Experimental results show that the proposed method has higher accuracy with meaningful semantic information than the existing SOTA methods, achieving 23%/23.2%/31.57% of Bleu-4 on the widely used test datasets Flickr8K, Flickr30K, and MS COCO, respectively, without reinforcement optimization stage. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
01782789
Volume :
40
Issue :
9
Database :
Academic Search Index
Journal :
Visual Computer
Publication Type :
Academic Journal
Accession number :
179041397
Full Text :
https://doi.org/10.1007/s00371-023-03180-7