Improving image captioning with Pyramid Attention and SC-GAN

Authors :: Tianyu Chen
Bianping Su
Huifang Ma
Zhixin Li
Jingli Wu
Source :: Image and Vision Computing. 117:104340
Publication Year :: 2022
Publisher :: Elsevier BV, 2022.
Abstract: Most of the existing image captioning models mainly use global attention, which represents the whole image features, local attention, representing the object features, or a combination of them; there are few models to integrate the relationship information between various object regions of the image. But this relationship information is also very instructive for caption generation. For example, if a football appears, there is a high probability that the image also contains people near the football. In this article, the relationship feature is embedded into the global-local attention to constructing a new Pyramid Attention mechanism, which can explore the internal visual and semantic relationship between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we propose a new method to apply the Generative Adversarial Network into sequence generation. The greedy decoding method is used to generate an efficient baseline reward for self-critical training. Finally, experiments on MSCOCO dataset show that the model can generate more accurate and vivid captions and outperforms many recent advanced models in various prevailing evaluation metrics on both local and online test sets.