Start Over

SPT: Spatial Pyramid Transformer for Image Captioning

Authors :: Zhang, Haonan
Zeng, Pengpeng
Gao, Lianli
Lyu, Xinyu
Song, Jingkuan
Shen, Heng Tao
Source :: IEEE Transactions on Circuits and Systems for Video Technology; 2024, Vol. 34 Issue: 6 p4829-4842, 14p
Publication Year :: 2024
Abstract: The existing approaches to image captioning tend to adopt Transformer-based architectures with grid features, which represent the state-of-the-art. However, the strategies are prone to address the grid features with a fixed resolution, which often hampers the perception of entities with various scales. In addition, directly applying them may also result in spatial and fine-grained semantic information loss. To this end, we propose a simple yet effective method, named Spatial Pyramid Transformer (SPT). Specifically, it adopts several parameter-shared pyramid structures to perform semantic interactions across different grid resolutions. In each layer, we design a Spatial-aware Pseudo-supervised (SP) module, which aims to adaptively resort to disrupted spatial information among flatted grid features. Moreover, to maintain the model size and enhance semantics, we build a simple weighted residual connection termed as Scale-wise Reinforcement (SR) module to simultaneously explore both low- and high-level encoded features. Extensive experiments on the MS-COCO benchmark demonstrate that our method achieves new state-of-the-art performance without bringing excessive parameters compared with vanilla transformer. In addition, our method is extended to the video captioning task, which further proves the practicability of the proposed method. Code is available at <uri>https://github.com/zchoi/SPT</uri>.