Start Over

Appearance-posture fusion network for distracted driving behavior recognition.

Authors :: Yang, Xiaohui
Qiao, Yu
Han, Shiyuan
Feng, Zhen
Chen, Yuehui
Source :: Expert Systems with Applications. Dec2024, Vol. 257, pN.PAG-N.PAG. 1p.
Publication Year :: 2024
Abstract: In recent years, detection techniques using computer vision and deep learning have shown promise in assessing driver distraction. This paper proposes a fusion network that combines a Spatial-Temporal Graph Convolutional Network (ST-GCN) and a hybrid convolutional network to integrate multimodal input data for recognizing distracted driver behavior. Specifically, to address the limitations of the ST-GCN method in modeling long-distance joint interaction features and inadequate temporal feature extraction, we design the Spatially Symmetric Configuration Partitioning Graph Convolutional Network (SSCP-GCN) to model relative motion information of symmetric relationships between limbs. Specifically, we utilize densely connected blocks for processing multi-scale temporal information between consecutive frames, thereby enhancing the reuse of bottom features. Furthermore, the expression of important temporal information is augmented by the introduction of the channel attention mechanism. To tackle the problem that the Mixed Convolution (MC) combining 3D convolution with 2D convolution cannot extract higher-order timing information and has limitations in modeling global dependency relationships, we compensate for its inability to capture higher-order temporal semantic information using the Time Shift Module (TSM) without consuming additional computational resources. Additionally, the 3D Multi-Head Self-Attention mechanism (3D MHSA) is employed to integrate global spatial–temporal information of high-level features, avoiding the issue of model complexity proliferation caused by the deep stacking design of Convolutional Neural Networks (CNN). Lastly, we introduce a multistream network framework that integrates driver posture and appearance features to harness complementary advantages, enabling us to combine multimodal input features to achieve better model performance. Experimental results indicate that the accuracy of the network designed in this paper reaches 95.6% and 94.3% on ASU dataset and NTU-RGB+D dataset, respectively. The small size of the model offers the possibility for practical application of the algorithm. [ABSTRACT FROM AUTHOR]