Start Over

Pronunciation error detection model based on feature fusion.

Authors :: Zhu, Cuicui
Wumaier, Aishan
Wei, Dongping
Fan, Zhixing
Yang, Jianlei
Yu, Heng
Kadeer, Zaokere
Wang, Liejun
Source :: Speech Communication. Jan2024, Vol. 156, pN.PAG-N.PAG. 1p.
Publication Year :: 2024
Abstract: Mispronunciation detection and diagnosis (MDD) is a specific speech recognition task that aims to recognize the phoneme sequence produced by a user, compare it with the standard phoneme sequence, and identify the type and location of any mispronunciations. However, the lack of large amounts of phoneme-level annotated data limits the performance improvement of the model. In this paper, we propose a joint training approach, Acoustic Error_Type Linguistic (AEL) that utilizes the error type information, acoustic information, and linguistic information from the annotated data, and achieves feature fusion through multiple attention mechanisms. To address the issue of uneven distribution of phonemes in the MDD data, which can cause the model to make overconfident predictions when using the CTC loss, we propose a new loss function, Focal Attention Loss, to improve the performance of the model, such as F1 score accuracy and other metrics. The proposed method in this paper was evaluated on the TIMIT and L2-Arctic public corpora. In ideal conditions, it was compared with the baseline model CNN-RNN-CTC. The F1 score, diagnostic accuracy, and precision were improved by 31.24%, 16.6%, and 17.35% respectively. Compared to the baseline model, our model reduced the phoneme error rate from 29.55% to 8.49% and showed significant improvements in other metrics. Furthermore, experimental results demonstrated that when we have a model capable of accurately obtaining pronunciation error types, our model can achieve results close to the ideal conditions. • The utilization of pronunciation error types in the pronunciation error detection model significantly enhances its performance. • Jointly using Focal loss and multi-task loss effectively resolves overconfidence caused by CTC loss. • The model excels across multiple evaluation metrics by incorporating joint loss functions and error type information. [ABSTRACT FROM AUTHOR]