1. Combining self-attention and depth-wise convolution for human pose estimation.
- Author
-
Zhang, Fan, Shi, Qingxuan, and Ma, Yanli
- Abstract
The implementation of convolutional neural networks (CNNs) and Transformers has significantly accelerated the booming advances of human pose estimation. However, challenges persist in accurately estimating target details. Pose estimation faces inherent difficulties when applied in complex environments marked by intricate conditions, including factors like motion blur and chaotic scenes. In this paper, we revisit the design of CNNs and Transformers, delving deeper into their internal structures. We sequentially utilize CNNs and Transformers, leveraging the proficiency of CNNs in extracting low-level features and the capability of Transformers in establishing long-range dependencies. Building upon this framework, we introduce modifications to both CNNs and Transformer-related structures, enhancing the overall expressive capacity of the model. The modification is made to the original CNNs section: we alter BasicBlock to AtBlock to maintain high-resolution information exchange to further excavate details of objects. The two following modifications are applied to the subsequent Transformer section: (1) we replace the self-attention layer in each encoder block with the local enhancement self-attention module to capture local information. (2) We propose a local perception feed-forward network to substitute for the feed-forward network layer in each encoder block, which employs the depth-wise convolution to enhance the correlation of neighbour information in the spatial dimension. Our modifications contribute to analyzing poses that are arduous to estimate due to occlusion. Our method combines self-attention and depth-wise convolution, named CSDNet. The experiments on both COCO2017 and MPII datasets show improved performance over the baseline. Compared to other models achieving the similar accuracy, our model has fewer parameters and requires less computation. Additionally, in complicated environments such as poor lighting conditions, our method can more accurately estimate fuzzy keypoints. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF