Xue, Wanli, Kang, Ze, Guo, Leming, Yang, Shourui, Yuan, Tiantian, and Chen, Shengyong
Consumer communication is one of the most important factors impacting consumption, especially for hearing-impaired people. Continuous sign language recognition (CSLR) is a technology to help people understand hearing-impaired people’s expressions, which focuses on predicting sign language sentences from sign language videos. The current paradigm of CSLR suffers from the overfitting problem, making the model provide insufficient information in backpropagation. To overcome these limitations, this work proposes the Self-Guidance Network (SGN), to fully enhance model power toward self-guidance. The SGN involves the self-spatial guidance constraint (SSGC), self-temporal guidance constraint (STGC), self-category guidance constraint (SCGC), and self-compounding spatial module (SCSM). The SSGC can guide the SCSM in paying comprehensive attention to spatial positions where sign sub-actions occur via combining attentions of the last and current model. Besides, the STGC is able to achieve self-temporal consistency between temporal modules. Moreover, the SCGC can produce informative soft labels as frame-wise supervision by combining category information of the last and current model. Furthermore, the SCSM can enrich model feature diversity by conducting a compounding architecture to achieve data augmentation. Experimenting on challenging CSLR benchmarks such as RWTH-2014, RWTH-2014T, and CSL-Daily, our SGN achieves excellent scores on the WER metrics of $19.5\%/20.2\%$ , $19.1\%/20.4\%$ , and $30.4\%/30.3\%$ , respectively.