1. Combining Self-attention and Depth-wise Convolution for Human Pose Estimation
- Author
-
Zhang fan, Shi qingxuan, and Ma yanli
- Abstract
The implementation of Convolutional Neural Networks (CNNs) and Transformers has significantly accelerated the booming advances of human pose estimation. However, challenges persist in accurately estimating target details, and performing pose estimation in complex environments where targets may be blurred or chaotic. In this paper, we exploit TokenPose as the baseline, revisiting the design of CNNs and Transformers. TokenPose utilizes CNNs and Transformers sequentially, combining the advantages of CNNs in extracting low level features and Transformers in establishing long-range dependencies. On this basis, we modify the CNNs and Transformer related structures to further enhance expression ability of the model. The modification is made to the original CNNs section: we alter BasicBlock to AtBlock to maintain high-resolution information exchange to further excavate details of objects. The two following modifications are applied to the subsequent Transformer section: (1) we replace the self-attention layer in each encoder block with the Local Enhancement Self-Attention(LESA) module to capture local information. (2) we propose a Local Perception Feed-Forward Network(LPFFN) to substitute for the feed-froward network layer in each encoder block, which employs the depth-wise convolution to enhance the correlation of neighbour information in the spatial dimension. Our modifications contribute to analyzing poses that are arduous to estimate due to occlusion. Our method combines self-attention and depth-wise convolution, which is named CSDNet. We demonstrate a better performance over the TokenPose baseline on both the COCO2017 and MPII datasets. Additionally, in complicated environments such as poor lighting conditions, our method can more accurately estimate fuzzy keypoints.
- Published
- 2023