Back to Search Start Over

Cross-modal guides spatio-temporal enrichment network for few-shot action recognition.

Authors :
Chen, Zhiwen
Yang, Yi
Li, Li
Li, Min
Source :
Applied Intelligence; Nov2024, Vol. 54 Issue 22, p11196-11211, 16p
Publication Year :
2024

Abstract

Few-shot action recognition aims to learn a model that can be easily adapted to identify novel action classifications using only a few labeled samples. Recent methods primarily focus on visual features and fail to fully utilize the available classification title of the video. In addition, they capture higher-order temporal relationships among video frames through averaging, which neglects the long-range dependencies information of the video. To address these issues, we designed a novel cross-modal guided spatio-temporal enrichment network (X-STEN) for few-shot action recognition. The model includes a cross-modal spatial enrichment module (X-SEM), a temporal enrichment module (TEM), and a non-parametric metrics module (NMM). Firstly, we extract and fuse multi-modal feature representations of videos. Then, we enhance the spatial context information of the video using the X-SEM and model the temporal context information of the video using the TEM. Finally, we generate the query and support prototypes and measure the similarity between them. Extensive experiments demonstrate that our X-STEN achieve excellent results on few-shot splits of Kinetics, HMDB51 and UCF101. Importantly, our method outperforms prior work on Kinetics by a wide margin (13.9%). [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
0924669X
Volume :
54
Issue :
22
Database :
Complementary Index
Journal :
Applied Intelligence
Publication Type :
Academic Journal
Accession number :
179711605
Full Text :
https://doi.org/10.1007/s10489-024-05617-5