Back to Search Start Over

Self-organizing speech recognition that processes acoustic and articulatory features.

Authors :
Viana, Hesdras O.
Araújo, Aluízio F. R.
Barbosa, Danilo S.
Source :
Multimedia Tools & Applications; Apr2024, Vol. 83 Issue 13, p39169-39195, 27p
Publication Year :
2024

Abstract

In automatic speech recognition (ASR) systems, the minimization of noxious effects caused by different background noises between training and operating situations has been a challenging task for many years. An ASR robust to noise that can deal with different types of speeches and various speakers still is an open research point. Typically, conventional ASR models for missing-feature reconstructions and robust speech descriptors employ acoustic features and statistical methods. In spite of improved performance in dealing with noise, such methods still degrade the performance when different background noises co-exist with the main signal. More recent approaches use neural networks, particularly deep learning models, for ASR purposes. Such models increase performance at the high training cost. In order to mitigate such limitations, we proposed an ASR model called Self-Organizing Speech Recognizer (SOSR). Unlike most conventional ASRs, SOSR is characterized by using acoustic and articulatory features, employing unsupervised and incremental learning, and is suitable for real-time applications due to its quick training stage. SOSR simultaneously processes an audio signal in a two-branch. In the first path, the acoustic features are extracted from the original signal whereas in the second path an acoustic-to-articulatory inversion is performed by several Self-organizing Maps. The signal from both paths is delivered to a Self-organizing Map with a time-varying structure, which is responsible for recognizing the input speech signal. Four datasets (TIMIT, Aurora 2, Aurora 4, and CHIME 2) were used for SOSR assessment. The Word Error Rate (WER) was the chosen metric to compare the experimental results of the tests with different noise levels and signal variations. Hence, the experimental results suggest that SOSR can learn quickly, and it can handle noisy signals, various speakers, different types of speeches, and assorted lengths of utterances. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
13807501
Volume :
83
Issue :
13
Database :
Complementary Index
Journal :
Multimedia Tools & Applications
Publication Type :
Academic Journal
Accession number :
176408765
Full Text :
https://doi.org/10.1007/s11042-023-17080-4