Back to Search
Start Over
SPT: Spatial Pyramid Transformer for Image Captioning
- Source :
- IEEE Transactions on Circuits and Systems for Video Technology; 2024, Vol. 34 Issue: 6 p4829-4842, 14p
- Publication Year :
- 2024
-
Abstract
- The existing approaches to image captioning tend to adopt Transformer-based architectures with grid features, which represent the state-of-the-art. However, the strategies are prone to address the grid features with a fixed resolution, which often hampers the perception of entities with various scales. In addition, directly applying them may also result in spatial and fine-grained semantic information loss. To this end, we propose a simple yet effective method, named Spatial Pyramid Transformer (SPT). Specifically, it adopts several parameter-shared pyramid structures to perform semantic interactions across different grid resolutions. In each layer, we design a Spatial-aware Pseudo-supervised (SP) module, which aims to adaptively resort to disrupted spatial information among flatted grid features. Moreover, to maintain the model size and enhance semantics, we build a simple weighted residual connection termed as Scale-wise Reinforcement (SR) module to simultaneously explore both low- and high-level encoded features. Extensive experiments on the MS-COCO benchmark demonstrate that our method achieves new state-of-the-art performance without bringing excessive parameters compared with vanilla transformer. In addition, our method is extended to the video captioning task, which further proves the practicability of the proposed method. Code is available at <uri>https://github.com/zchoi/SPT</uri>.
Details
- Language :
- English
- ISSN :
- 10518215 and 15582205
- Volume :
- 34
- Issue :
- 6
- Database :
- Supplemental Index
- Journal :
- IEEE Transactions on Circuits and Systems for Video Technology
- Publication Type :
- Periodical
- Accession number :
- ejs66588450
- Full Text :
- https://doi.org/10.1109/TCSVT.2023.3336371