Back to Search
Start Over
Cross-modal guides spatio-temporal enrichment network for few-shot action recognition.
- Source :
- Applied Intelligence; Nov2024, Vol. 54 Issue 22, p11196-11211, 16p
- Publication Year :
- 2024
-
Abstract
- Few-shot action recognition aims to learn a model that can be easily adapted to identify novel action classifications using only a few labeled samples. Recent methods primarily focus on visual features and fail to fully utilize the available classification title of the video. In addition, they capture higher-order temporal relationships among video frames through averaging, which neglects the long-range dependencies information of the video. To address these issues, we designed a novel cross-modal guided spatio-temporal enrichment network (X-STEN) for few-shot action recognition. The model includes a cross-modal spatial enrichment module (X-SEM), a temporal enrichment module (TEM), and a non-parametric metrics module (NMM). Firstly, we extract and fuse multi-modal feature representations of videos. Then, we enhance the spatial context information of the video using the X-SEM and model the temporal context information of the video using the TEM. Finally, we generate the query and support prototypes and measure the similarity between them. Extensive experiments demonstrate that our X-STEN achieve excellent results on few-shot splits of Kinetics, HMDB51 and UCF101. Importantly, our method outperforms prior work on Kinetics by a wide margin (13.9%). [ABSTRACT FROM AUTHOR]
- Subjects :
- RECOGNITION (Psychology)
VIDEOS
PROTOTYPES
CLASSIFICATION
Subjects
Details
- Language :
- English
- ISSN :
- 0924669X
- Volume :
- 54
- Issue :
- 22
- Database :
- Complementary Index
- Journal :
- Applied Intelligence
- Publication Type :
- Academic Journal
- Accession number :
- 179711605
- Full Text :
- https://doi.org/10.1007/s10489-024-05617-5