Start Over

Static hand gesture recognition method based on the Vision Transformer.

Authors :: Zhang, Yu
Wang, Junlin
Wang, Xin
Jing, Haonan
Sun, Zhanshuo
Cai, Yu
Source :: Multimedia Tools & Applications; Aug2023, Vol. 82 Issue 20, p31309-31328, 20p
Publication Year :: 2023
Abstract: Hand gesture recognition (HGR) is the most important part of human-computer interaction (HCI). Static hand gesture recognition is equivalent to the classification of hand gesture images. At present, the classification of hand gesture images mainly uses the Convolutional Neural Network (CNN) method. The Vision Transformer architecture (ViT) proposes not to use the convolutional layers at all but to use the multi-head attention mechanism to learn global information. Therefore, this paper proposes a static hand gesture recognition method based on the Vision Transformer. This paper uses a self-made dataset and two publicly available American Sign Language (ASL) datasets to train and evaluate the ViT architecture. Using the depth information provided by the Microsoft Kinect camera to capture the hand gesture images and filter the background, then use the eight-connected discrimination algorithm and the distance transformation algorithm to remove the redundant arm information. The resulting images constitute a self-made dataset. At the same time, this paper studies the impact of several data augmentation strategies on recognition performance. This paper uses accuracy, F1 score, recall, and precision as evaluation metrics. Finally, the validation accuracy of the proposed model on the three datasets achieves 99.44%, 99.37%, and 96.53%, respectively, and the results obtained are better than those obtained by some CNN structures. [ABSTRACT FROM AUTHOR]