1. Multimodal audiovisual speech recognition architecture using a three-feature multi-fusion method for noise-robust systems
- Author
-
Sanghun Jeon, Jieun Lee, Dohyeon Yeo, Yong-Ju Lee, and SeungJun Kim
- Subjects
application programming interface ,audiovisual speech recognition ,lip reading ,multimodal interaction ,Telecommunication ,TK5101-6720 ,Electronics ,TK7800-8360 - Abstract
Exposure to varied noisy environments impairs the recognition performance of artificial intelligence-based speech recognition technologies. Degradedperformance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log-Mel spectrograms into feature vectors for audio recognition. A dense spatial–temporal convolutional neural network model extracts features from log-Mel spectrograms, transformed for visual-based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signalto-noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three-feature multi-fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise-affected environments owing to its enhanced stability and recognition rate.
- Published
- 2024
- Full Text
- View/download PDF