Back to Search Start Over

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

Authors :
Jing Shi
Yuanyuan Zhang
Weihang Wang
Bin Xing
Dasha Hu
Liangyin Chen
Source :
Applied Sciences, Vol 13, Iss 4, p 2058 (2023)
Publication Year :
2023
Publisher :
MDPI AG, 2023.

Abstract

Due to the great success of Vision Transformer (ViT) in image classification tasks, many pure Transformer architectures for human action recognition have been proposed. However, very few works have attempted to use Transformer to conduct bimodal action recognition, i.e., both skeleton and RGB modalities for action recognition. As proved in many previous works, RGB modality and skeleton modality are complementary to each other in human action recognition tasks. How to use both RGB and skeleton modalities for action recognition in a Transformer-based framework is a challenge. In this paper, we propose RGBSformer, a novel two-stream pure Transformer-based framework for human action recognition using both RGB and skeleton modalities. Using only RGB videos, we can acquire skeleton data and generate corresponding skeleton heatmaps. Then, we input skeleton heatmaps and RGB frames to Transformer at different temporal and spatial resolutions. Because the skeleton heatmaps are primary features compared to the original RGB frames, we use fewer attention layers in the skeleton stream. At the same time, two ways are proposed to fuse the information of two streams. Experiments demonstrate that the proposed framework achieves the state of the art on four benchmarks: three widely used datasets, Kinetics400, NTU RGB+D 60, and NTU RGB+D 120, and the fine-grained dataset FineGym99.

Details

Language :
English
ISSN :
20763417
Volume :
13
Issue :
4
Database :
Directory of Open Access Journals
Journal :
Applied Sciences
Publication Type :
Academic Journal
Accession number :
edsdoj.babb4b7e16ac424aaab7829e00f2e25b
Document Type :
article
Full Text :
https://doi.org/10.3390/app13042058