Transformer with sparse self‐attention mechanism for image captioning

Authors :: Dihu Chen
Duofeng Wang
Haifeng Hu
Source :: Electronics Letters. 56:764-766
Publication Year :: 2020
Publisher :: Institution of Engineering and Technology (IET), 2020.
Abstract: Recently, transformer has been applied to the image caption model, in which the convolutional neural network and the transformer encoder act as the image encoder of the model, and the transformer decoder acts as the decoder of the model. However, transformer may suffer from the interference of non-critical objects of a scene and meet with difficulty to fully capture image information due to its self-attention mechanism's dense characteristics. In this Letter, in order to address this issue, the authors propose a novel transformer model with decreasing attention gates and attention fusion module. Specifically, they firstly use attention gate to force transformer to overcome the interference of non-critical objects and capture objects information more efficiently via truncating all the attention weights that smaller than gate threshold. Secondly, through inheriting attentional matrix from the previous layer of each network layer, the attention fusion module enables each network layer to consider other objects without losing the most critical ones. Their method is evaluated using the benchmark Microsoft COCO dataset and achieves better performance compared to the state-of-the-art methods.