Start Over

Spatio-Temporal Collaborative Module for Efficient Action Recognition.

Authors :: Hao, Yanbin
Wang, Shuo
Tan, Yi
He, Xiangnan
Liu, Zhenguang
Wang, Meng
Source :: IEEE Transactions on Image Processing. 2022, Vol. 31, p7279-7291. 13p.
Publication Year :: 2022
Abstract: Efficient action recognition aims to classify a video clip into a specific action category with a low computational cost. It is challenging since the integrated spatial-temporal calculation (e. g., 3D convolution) introduces intensive operations and increases complexity. This paper explores the feasibility of the integration of channel splitting and filter decoupling for efficient architecture design and feature refinement by proposing a novel spatio-temporal collaborative (STC) module. STC splits the video feature channels into two groups and separately learns spatio-temporal representations in parallel with decoupled convolutional operators. Particularly, STC consists of two computation-efficient blocks, i.e., $\text {S}_{\mathrm{ T}}$ and $\text {T}_{\mathrm{ S}}$ , where they extract either spatial (${S}_{\cdot }$) or temporal (${T}_{\cdot }$) features and further refine their features with either temporal ($\cdot _{T}$) or spatial ($\cdot _{S}$) contexts globally. The spatial/temporal context refers to information dynamics aggregated from temporal/spatial axis. To thoroughly examine our method’s performance in video action recognition tasks, we conduct extensive experiments using five video benchmark datasets requiring temporal reasoning. Experimental results show that the proposed STC networks achieve a competitive trade-off between model efficiency and effectiveness. [ABSTRACT FROM AUTHOR]