Back to Search
Start Over
CATNet: Cross-modal fusion for audio–visual speech recognition.
- Source :
-
Pattern Recognition Letters . Feb2024, Vol. 178, p216-222. 7p. - Publication Year :
- 2024
-
Abstract
- Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) methods achieve satisfactory performance by combining audio-modal and visual-modal information. However, various complex environments, especially noises, limit the effectiveness of existing methods. In response to the noisy problem, in this paper, we propose a novel cross-modal audio–visual speech recognition model, named CATNet. First, we devise a cross-modal bidirectional fusion model to analyze the close relationship between audio and visual modalities. Second, we propose an audio–visual dual-modal network to preprocess audio and visual information, extract significant features and filter redundant noises. The experimental results demonstrate the effectiveness of CATNet, which achieves excellent WER, CER and converges speeds, outperforms other benchmark models and overcomes the challenge posed by noisy environments. • Proposing a novel cross-modal audio–visual speech recognition network, named CATNet. • Devising a cross-modal bidirectional fusion model. • Devising an audio–visual dual-modal speech recognition network. • CATNet is robust against noises and outperforms other benchmarks. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 01678655
- Volume :
- 178
- Database :
- Academic Search Index
- Journal :
- Pattern Recognition Letters
- Publication Type :
- Academic Journal
- Accession number :
- 175240628
- Full Text :
- https://doi.org/10.1016/j.patrec.2024.01.002