1. CSPNeXt: A new efficient token hybrid backbone.
- Author
-
Chen, Xiangqi, Yang, Chengzhuan, Mo, Jiashuaizi, Sun, Yaxin, Karmouni, Hicham, Jiang, Yunliang, and Zheng, Zhonglong
- Subjects
- *
IMAGE recognition (Computer vision) , *CONVOLUTIONAL neural networks , *LEARNING ability , *SPINE - Abstract
Although the cross-stage partial network (CSPNet) model enhances the learning ability and reduces the computational effort of convolutional neural networks, while offering high flexibility and efficiency. However, the model is significantly affected by the limited perceptual field and the weak mixing of high-frequency and low-frequency features, which significantly affects the recognition performance of the model. To alleviate this problem, we propose a "modernized" CSPNeXt model that can effectively learn the feature maps' high- and low-frequency information and extend the perceptual field to improve the recognition performance of the CSPNet model. At the same time, the CSPNeXt model also retains the corresponding advantages of the CSPNet model. Specifically, we introduce parallel large-kernel convolution and a simple average pooling method to capture different frequency information in the image. Unlike the original CSPNet channel splitting mechanism, CSPNeXt mixer is more effective in feature fusion by introducing a new channel splitting mechanism. To obtain more high-frequency signals in the shallow layer and more low-frequency signals in the deep layer, we increase the dimension of feeding to the high-frequency mixer while expanding the dimension of providing to the low-frequency mixer in the deep layer. This mechanism efficiently captures high and low frequencies signal at different levels. We extensively test the CSPNeXt model on various vision tasks, including image classification, object detection, and instance segmentation, and the model demonstrates its excellent performance, outperforming previous CSPNet method. Our method achieves 81.6% top-1 accuracy on Imagenet-1K, 1.8% better than DeiT-S and slightly better than Swin-T (81.3%) while using fewer parameters and GFLOPs. • The high and low-frequency information are combined in an efficient manner. • The local and global feature representations are balanced in a new manner. • A new channel segmentation mechanism is introduced. • The proposed architecture achieves good performance on several vision tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF