1. Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition
- Author
-
Hongwei Ge, Pengfei Wang, Dongsheng Zhou, Yaqing Hou, Jianxin Zhang, Qiang Zhang, and Hua Yu
- Subjects
0209 industrial biotechnology ,business.industry ,Computer science ,Deep learning ,Pooling ,02 engineering and technology ,Machine learning ,computer.software_genre ,Support vector machine ,020901 industrial engineering & automation ,Spatial network ,Action (philosophy) ,Discriminative model ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Representation (mathematics) ,computer ,Software ,Transformer (machine learning model) - Abstract
In the study of human action recognition, two-stream networks have made excellent progress recently. However, there remain challenges in distinguishing similar human actions in videos. This paper proposes a novel local-aware spatio-temporal attention network with multi-stage feature fusion based on compact bilinear pooling for human action recognition. To elaborate, taking two-stream networks as our essential backbones, the spatial network first employs multiple spatial transformer networks in a parallel manner to locate the discriminative regions related to human actions. Then, we perform feature fusion between the local and global features to enhance the human action representation. Furthermore, the output of the spatial network and the temporal information are fused at a particular layer to learn the pixel-wise correspondences. After that, we bring together three outputs to generate the global descriptors of human actions. To verify the efficacy of the proposed approach, comparison experiments are conducted with the traditional hand-engineered IDT algorithms, the classical machine learning methods (i.e., SVM) and the state-of-the-art deep learning methods (i.e., spatio-temporal multiplier networks). According to the results, our approach is reported to obtain the best performance among existing works, with the accuracy of 95.3% and 72.9% on UCF101 and HMDB51, respectively. The experimental results thus demonstrate the superiority and significance of the proposed architecture in solving the task of human action recognition.
- Published
- 2021
- Full Text
- View/download PDF