101. 基于全局与序列变分自编码的图像描述生成.
- Author
-
刘明明, 刘浩, 王栋, and 张海燕
- Abstract
The Transformer-based image captioning models have shown remarkable performance based on the powerful sequence modeling capability. However, most of them focus only on learning deterministic mappings from image space to caption space, i. e., learning how to improve the accuracy of predicting "average" captions, which generally tends to common words, repeated phrases and single sentence, leading to the severe mode collapse problem. To this end, this paper combined the conditional variational encoder with the Transformer-based image captioning model, and proposed the sentence-level and word-level diverse image captioning models, respectively. The proposed models introduced the global and sequential latent embedding learning based on the evidence lower bound (ELBO), which promoted the diversity of Transformer-based image captioning. Quantitative and qualitative experiments on MSCOCO dataset show that both models have the ability of learning one-to-many projections between the image space and the caption space. Compared with the state-of-the-art COS-CVAE, the proposed method with 20 samples improves the CIDEr and Div-2 scores by 1.3 and 33% respectively in the case of 20 samples, improves the CIDEr and Div-2 scores by 11.4 and 14%, respectively in the case of 100 samples. The proposed method can fit the distribution of ground-truth captions well, and achieve a better balance between diversity and accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF