1. Multimodal action recognition: a comprehensive survey on temporal modeling.
- Author
-
Shabaninia, Elham, Nezamabadi-pour, Hossein, and Shafizadegan, Fatemeh
- Subjects
NATURAL language processing ,RECURRENT neural networks ,TRANSFORMER models ,DEEP learning ,HUMAN activity recognition ,RECOGNITION (Psychology) - Abstract
In action recognition that relies on visual information, activities are recognized through spatio-temporal features from different modalities. The challenge of temporal modeling has been a long-standing issue in this field. There are a limited number of methods, such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNNs), that are used in deep-based approaches to model motion information. However, the success of transformers in modeling long-range dependencies in natural language processing tasks has recently caught the attention of other domains, including speech, image, and video, as they can rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic in the last few years is impressive. This paper aims to review recent progress in deep learning methods for modeling temporal variations in multimodal human action recognition. Specifically, it focuses on methods that use transformers for temporal modeling, highlighting their key features and the modalities they employ, while also identifying opportunities and challenges for future research. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF