Descriptor: "Speech processing" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Speech processing"' showing total 24,420 results

Start Over Descriptor "Speech processing"

24,420 results on '"Speech processing"'

1. Multimodal Emotion Recognition Using Contextualized Audio Information and Ground Transcripts on Multiple Datasets.

Author: Chauhan, Krishna, Sharma, Kamalesh Kumar, and Varma, Tarun
Subjects: *EMOTION recognition, *CONVOLUTIONAL neural networks, *AMERICAN English language, *LANGUAGE models, *MOTION capture (Cinematography), *DEEP learning, *MOTION capture (Human mechanics)
Abstract: The widespread applications of emotion recognition (ER) in various fields have recently attracted much attention from researchers. Consequently, an array of advanced techniques has emerged, driven by enhancing the accuracy and robustness of these recognition systems. As emotional dialogue comprises sound and spoken content, the proposed model encodes the information from audio and text sequences using two separate channels and merges them for emotional classification. The two channels used inputs from audio and text modalities. The audio channel is encoded using a deep convolutional neural network with residual connections and further transformed using a self-attention-based multihead attention network called channel-wise global head pooling. Unlike the vanilla multihead attention network, an adaptive global pooling is used after concatenating all the heads. The text channel is encoded using a pre-trained BERT model. The proposed ER method is validated on four benchmark databases: Interactive Emotional Dyadic Motion Capture in English, the Berlin emotional speech dataset in the German language, Ryerson Audio-Visual Database of Emotional Speech and Song in North American English and Crowd-sourced Emotional Multimodal Actors Dataset in English. The classification accuracy on the above emotional corpora is 85.71%, 79.52%, 76.71% and 73.91%, respectively. Furthermore, cross-corpus analysis is presented to understand the variability of speech and text features and the robustness of the proposed architecture. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

2. Evoking artificial speech perception through invasive brain stimulation for brain-computer interfaces: current challenges and future perspectives.

Author: Yirye Hong, Seokyun Ryun, and Chun Kee Chung
Subjects: SPEECH perception, BRAIN stimulation, BRAIN-computer interfaces, VAGUS nerve, SPEECH disorders, ELECTRIC stimulation, SENSORIMOTOR cortex
Abstract: Encoding artificial perceptions through brain stimulation, especially that of higher cognitive functions such as speech perception, is one of the most formidable challenges in brain-computer interfaces (BCI). Brain stimulation has been used for functional mapping in clinical practices for the last 70 years to treat various disorders affecting the nervous system, including epilepsy, Parkinson's disease, essential tremors, and dystonia. Recently, direct electrical stimulation has been used to evoke various forms of perception in humans, ranging from sensorimotor, auditory, and visual to speech cognition. Successfully evoking and fine-tuning artificial perceptions could revolutionize communication for individuals with speech disorders and significantly enhance the capabilities of brain-computer interface technologies. However, despite the extensive literature on encoding various perceptions and the rising popularity of speech BCIs, inducing artificial speech perception is still largely unexplored, and its potential has yet to be determined. In this paper, we examine the various stimulation techniques used to evoke complex percepts and the target brain areas for the input of speech-like information. Finally, we discuss strategies to address the challenges of speech encoding and discuss the prospects of these approaches. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. Evaluating the accuracy of automated processing of child and adult language production in preschool classrooms.

Author: Pelfrey, G. Logan, Justice, Laura M., Villasanti, Hugo Gonzalez, and Foster, Tiffany J.
Subjects: PRESCHOOL children, CHILDREN'S language, CLASSROOM environment, LANGUAGE acquisition, CLASSROOMS, SPEECH
Abstract: Young children's language and social development is influenced by the linguistic environment of their classrooms, including their interactions with teachers and peers. Measurement of the classroom linguistic environment typically relies on observational methods, often providing limited 'snapshots' of children's interactions, from which broad generalizations are made. Recent technological advances, including artificial intelligence, provide opportunities to capture children's interactions using continuous recordings representing much longer durations of time. The goal of the present study was to evaluate the accuracy of the Interaction Detection in Early Childhood Settings (IDEAS) system on 13 automated indices of language output using recordings collected from 19 children and three teachers over two weeks in an urban preschool classroom. The accuracy of language outputs processed via IDEAS were compared to ground truth via linear correlations and median absolute relative error. Findings indicate high correlations between IDEAS and ground truth data on measures of teacher and child speech, and relatively low error rates on the majority of IDEAS language output measures. Study findings indicate that IDEAS may provide a useful measurement tool for advancing knowledge about children's classroom experiences and their role in shaping development. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Differences in the interplay of face and speech processing in 5-year-olds and adults.

Author: Sensoy, Özlem, Krasotkina, Anna, Götz, Antonia, Höhle, Barbara, and Schwarzer, Gudrun
Subjects: *AUDITORY perception, *CHILD development, *SPEECH, *ADULTS
Abstract: The current study examined to what extent face and speech processing interact with each other and whether they enhance or impair the processing of the other in 5-year-olds (n = 51) and adults (n = 34). Using a computer-based speeded sorting task allowed to directly test the influence of auditory speech on face processing and the influence of face identity on auditory speech processing within one experiment. Participants were asked to either sort faces while ignoring auditory speech information (face task) or to sort auditory speech while ignoring face information (speech task). The tasks comprised three conditions: control (irrelevant dimension constant), correlational (congruent pairing of relevant and irrelevant dimension), and orthogonal (random pairing). For the 5-year-olds, reaction times did not differ in the face task, but differed in the speech task. They were the fastest in the control and the slowest in the orthogonal compared with the constant conditions. Adults' reaction times were similar across conditions and tasks indicating an independent processing of faces and speech. Hence, we found an asymmetrical processing pattern between face and auditory speech processing in children, in which face identity is processed independent of auditory speech; however, auditory speech processing is affected by face identity. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. Robust Classification of Parkinson’s Speech: an Approximation to a Scenario With Non-controlled Acoustic Conditions

Author: Lopez-Santander, Diego Alexander, David Rios-Urrego, Cristian, Bergler, Christian, Nöth, Elmar, Orozco-Arroyave, Juan Rafael, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Nöth, Elmar, editor, Horák, Aleš, editor, and Sojka, Petr, editor
Published: 2024
Full Text: View/download PDF

6. How do we Produce and Understand Speech?

Author: Dornbierer-Stuart, Joanna and Dornbierer-Stuart, Joanna
Published: 2024
Full Text: View/download PDF

7. Enhancing the Skills of Visually Impaired Individuals by Generating Open-Source Engine Using Machine Learning

Author: Garg, Bindu, Chhajed, Gyankamal, SrushtiSurpur, AmeySuryawanshi, Sherekar, Harsh, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Hassanien, Aboul Ella, editor, Anand, Sameer, editor, Jaiswal, Ajay, editor, and Kumar, Prabhat, editor
Published: 2024
Full Text: View/download PDF

8. AfricAIED 2024: 2nd Workshop on Artificial Intelligence in Education in Africa

Author: Boateng, George, Kumbol, Victor, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Olney, Andrew M., editor, Chounta, Irene-Angelica, editor, Liu, Zitao, editor, Santos, Olga C., editor, and Bittencourt, Ig Ibert, editor
Published: 2024
Full Text: View/download PDF

9. Identification of Speech Stream and the Source Localization for Hearing Prosthesis-Driven Healthcare

Author: Peddi, Anudeep, Teppala, Venkata Ramana, Rocha, Álvaro, Series Editor, Hameurlain, Abdelkader, Editorial Board Member, Idri, Ali, Editorial Board Member, Vaseashta, Ashok, Editorial Board Member, Dubey, Ashwani Kumar, Editorial Board Member, Montenegro, Carlos, Editorial Board Member, Laporte, Claude, Editorial Board Member, Moreira, Fernando, Editorial Board Member, Peñalvo, Francisco, Editorial Board Member, Dzemyda, Gintautas, Editorial Board Member, Mejia-Miranda, Jezreel, Editorial Board Member, Hall, Jon, Editorial Board Member, Piattini, Mário, Editorial Board Member, Holanda, Maristela, Editorial Board Member, Tang, Mincong, Editorial Board Member, Ivanovíc, Mirjana, Editorial Board Member, Muñoz, Mirna, Editorial Board Member, Kanth, Rajeev, Editorial Board Member, Anwar, Sajid, Editorial Board Member, Herawan, Tutut, Editorial Board Member, Colla, Valentina, Editorial Board Member, Devedzic, Vladan, Editorial Board Member, Manoharan, S., editor, Tugui, Alexandru, editor, and Baig, Zubair, editor
Published: 2024
Full Text: View/download PDF

10. Unfolding Laryngeal Neuromotor Activity in Parkinson’s Disease by Phonation Inversion

Author: Gómez-Vilda, Pedro, Gómez-Rodellar, Andrés, Mekyska, Jiri, Álvarez-Marquina, Agustín, Palacios-Alonso, Daniel, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, van Leeuwen, Jan, Series Editor, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Kobsa, Alfred, Series Editor, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Nierstrasz, Oscar, Series Editor, Pandu Rangan, C., Editorial Board Member, Sudan, Madhu, Series Editor, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Vardi, Moshe Y, Series Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Ferrández Vicente, José Manuel, editor, Val Calvo, Mikel, editor, and Adeli, Hojjat, editor
Published: 2024
Full Text: View/download PDF

11. VocES – An Open Database of Child and Youth Vowels in Spanish for Research Purposes

Author: Rodríguez-Dueñas, William R., Rojas, Paola Camila Castro, Solano, Eduardo Lleida, Magjarević, Ratko, Series Editor, Ładyżyński, Piotr, Associate Editor, Ibrahim, Fatimah, Associate Editor, Lackovic, Igor, Associate Editor, Rock, Emilio Sacristan, Associate Editor, Pino, Esteban, editor, and de Carvalho, Paulo, editor
Published: 2024
Full Text: View/download PDF

12. Utilizing New Technologies for Children with Communication and Swallowing Disorders: A Systematic Review

Author: Toki, Eugenia I., Papadopoulou, Soultana, Pange, Jenny, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Auer, Michael E., editor, and Tsiatsos, Thrasyvoulos, editor
Published: 2024
Full Text: View/download PDF

13. Signal Processing for Language Sanitization: Detection and Censorship of Obscene Words in Speech Recordings

Author: Jameel, Mohd Mazin, Aamir, Zenab, Wajid, Mohd, Usman, Mohammed, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, Tan, Kay Chen, Series Editor, Jain, Shruti, editor, Marriwala, Nikhil, editor, Singh, Pushpendra, editor, Tripathi, C.C., editor, and Kumar, Dinesh, editor
Published: 2024
Full Text: View/download PDF

14. The Impact of Language Technologies in the Legal Domain

Author: Trancoso, Isabel, Mamede, Nuno, Martins, Bruno, Pinto, H. Sofia, Ribeiro, Ricardo, Casanovas, Pompeu, Series Editor, Sartor, Giovanni, Series Editor, Sousa Antunes, Henrique, editor, Freitas, Pedro Miguel, editor, Oliveira, Arlindo L., editor, Martins Pereira, Clara, editor, Vaz de Sequeira, Elsa, editor, and Barreto Xavier, Luís, editor
Published: 2024
Full Text: View/download PDF

15. Speech Recognition-Based Prediction for Mental Health and Depression: A Review

Author: Gaikwad, Priti, Venkatesan, Mithra, Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Jha, Pradeep Kumar, editor, Tripathi, Brijesh, editor, Natarajan, Elango, editor, and Sharma, Harish, editor
Published: 2024
Full Text: View/download PDF

16. Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition

Author: Sami Dhahbi, Nasir Saleem, Teddy Surya Gunawan, Sami Bourouis, Imad Ali, Aymen Trigui, and Abeer D. Algarni
Subjects: real-time speech, simple recurrent unit (sru), speech enhancement, speech processing, speech quality, Technology
Abstract: Traditional recurrent neural networks (RNNs) encounter difficulty in capturing long-term temporal dependencies. However, lightweight recurrent models for speech enhancement are important to improve noisy speech, while being computationally efficient and able to capture long-term temporal dependencies efficiently. This study proposes a lightweight hourglass-shaped model for speech enhancement (SE) and automatic speech recognition (ASR). Simple recurrent units (SRU) with skip connections are implemented where attention gates are added to the skip connections, highlighting the important features and spectral regions. The model operates without relying on future information that is well-suited for real-time processing. Combined acoustic features and two training objectives are estimated. Experimental evaluations using the short time speech intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and word error rates (WERs) indicate better intelligibility, perceptual quality, and word recognition rates. The composite measures further confirm the performance of residual noise and speech distortion. With the TIMIT database, the proposed model improves the STOI and PESQ by 16.21% and 0.69 (31.1%) whereas with the LibriSpeech database, the model improves STOI by 16.41% and PESQ by 0.71 (32.9%) over the noisy speech. Further, our model outperforms other deep neural networks (DNNs) in seen and unseen conditions. The ASR performance is measured using the Kaldi toolkit and achieves 15.13% WERs in noisy backgrounds.
Published: 2024
Full Text: View/download PDF

17. Continuous lipreading based on acoustic temporal alignments

Author: David Gimeno-Gómez and Carlos-D. Martínez-Hinarejos
Subjects: Visual speech recognition, Limited computation, Data scarcity, Speech processing, Computer vision, Acoustics. Sound, QC221-246, Electronic computers. Computer science, QA75.5-76.95
Abstract: Abstract Visual speech recognition (VSR) is a challenging task that has received increasing interest during the last few decades. Current state of the art employs powerful end-to-end architectures based on deep learning which depend on large amounts of data and high computational resources for their estimation. We address the task of VSR for data scarcity scenarios with limited computational resources by using traditional approaches based on hidden Markov models. We present a novel learning strategy that employs information obtained from previous acoustic temporal alignments to improve the visual system performance. Furthermore, we studied multiple visual speech representations and how image resolution or frame rate affect its performance. All these experiments were conducted on the limited data VLRF corpus, a database which offers an audio-visual support to address continuous speech recognition in Spanish. The results show that our approach significantly outperforms the best results achieved on the task to date.
Published: 2024
Full Text: View/download PDF

18. Exploring neural tracking of acoustic and linguistic speech representations in individuals with post‐stroke aphasia.

Author: Kries, Jill, De Clercq, Pieter, Gillis, Marlies, Vanthornhout, Jonas, Lemmens, Robin, Francart, Tom, and Vandermosten, Maaike
Subjects: *SPEECH, *APHASIA, *COMMUNICATIVE disorders, *NEUROLINGUISTICS, *MULTIPLE comparisons (Statistics), *ELECTROENCEPHALOGRAPHY
Abstract: Aphasia is a communication disorder that affects processing of language at different levels (e.g., acoustic, phonological, semantic). Recording brain activity via Electroencephalography while people listen to a continuous story allows to analyze brain responses to acoustic and linguistic properties of speech. When the neural activity aligns with these speech properties, it is referred to as neural tracking. Even though measuring neural tracking of speech may present an interesting approach to studying aphasia in an ecologically valid way, it has not yet been investigated in individuals with stroke‐induced aphasia. Here, we explored processing of acoustic and linguistic speech representations in individuals with aphasia in the chronic phase after stroke and age‐matched healthy controls. We found decreased neural tracking of acoustic speech representations (envelope and envelope onsets) in individuals with aphasia. In addition, word surprisal displayed decreased amplitudes in individuals with aphasia around 195 ms over frontal electrodes, although this effect was not corrected for multiple comparisons. These results show that there is potential to capture language processing impairments in individuals with aphasia by measuring neural tracking of continuous speech. However, more research is needed to validate these results. Nonetheless, this exploratory study shows that neural tracking of naturalistic, continuous speech presents a powerful approach to studying aphasia. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

19. Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition.

Author: Dhahbi, Sami, Saleem, Nasir, Gunawan, Teddy Surya, Bourouis, Sami, Ali, Imad, Trigui, Aymen, and Algarni, Abeer D.
Subjects: AUTOMATIC speech recognition, INTELLIGIBILITY of speech, SPEECH enhancement, ARTIFICIAL neural networks, RECURRENT neural networks, SPEECH
Abstract: Traditional recurrent neural networks (RNNs) encounter difficulty in capturing long-term temporal dependencies. However, lightweight recurrent models for speech enhancement are important to improve noisy speech, while being computationally efficient and able to capture long-term temporal dependencies efficiently. This study proposes a lightweight hourglass-shaped model for speech enhancement (SE) and automatic speech recognition (ASR). Simple recurrent units (SRU) with skip connections are implemented where attention gates are added to the skip connections, highlighting the important features and spectral regions. The model operates without relying on future information that is well-suited for real-time processing. Combined acoustic features and two training objectives are estimated. Experimental evaluations using the short time speech intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and word error rates (WERs) indicate better intelligibility, perceptual quality, and word recognition rates. The composite measures further confirm the performance of residual noise and speech distortion. With the TIMIT database, the proposed model improves the STOI and PESQ by 16.21% and 0.69 (31.1%) whereas with the LibriSpeech database, the model improves STOI by 16.41% and PESQ by 0.71 (32.9%) over the noisy speech. Further, our model outperforms other deep neural networks (DNNs) in seen and unseen conditions. The ASR performance is measured using the Kaldi toolkit and achieves 15.13% WERs in noisy backgrounds. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

20. Impaired Cortical Tracking of Speech in Children with Developmental Language Disorder.

Author: Nora, Anni, Rinkinen, Oona, Renvall, Hanna, Service, Elisabet, Arkkila, Eva, Smolander, Sini, Laasonen, Marja, and Salmelin, Riitta
Subjects: *LANGUAGE disorders, *SPEECH, *MACHINE learning, *CHILDREN'S language, *CHILDREN with developmental disabilities, *DOG barking, *ARTIFICIAL satellite tracking
Abstract: In developmental language disorder (DLD), learning to comprehend and express oneself with spoken language is impaired, but the reason for this remains unknown. Using millisecond-scale magnetoencephalography recordings combined with machine learning models, we investigated whether the possible neural basis of this disruption lies in poor cortical tracking of speech. The stimuli were common spoken Finnish words (e.g., dog, car, hammer) and sounds with corresponding meanings (e.g., dog bark, car engine, hammering). In both children with DLD (10 boys and 7 girls) and typically developing (TD) control children (14 boys and 3 girls), aged 10-15 years, the cortical activation to spoken words was best modeled as time-locked to the unfolding speech input at ~100 ms latency between sound and cortical activation. Amplitude envelope (amplitude changes) and spectrogram (detailed time-varying spectral content) of the spoken words, but not other sounds, were very successfully decoded based on time-locked brain responses in bilateral temporal areas; based on the cortical responses, the models could tell at ~75-85% accuracy which of the two sounds had been presented to the participant. However, the cortical representation of the amplitude envelope information was poorer in children with DLD compared with TD children at longer latencies (at ~200-300 ms lag). We interpret this effect as reflecting poorer retention of acoustic-phonetic information in short-term memory. This impaired tracking could potentially affect the processing and learning of words as well as continuous speech. The present results offer an explanation for the problems in language comprehension and acquisition in DLD. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

21. Age-related differences in processing of emotions in speech disappear with babble noise in the background.

Author: Dor, Yehuda I., Algom, Daniel, Shakuf, Vered, and Ben-David, Boaz M.
Abstract: Older adults process emotional speech differently than young adults, relying less on prosody (tone) relative to semantics (words). This study aimed to elucidate the mechanisms underlying these age-related differences via an emotional speech-in-noise test. A sample of 51 young and 47 older adults rated spoken sentences with emotional content on both prosody and semantics, presented on the background of wideband speech-spectrum noise (sensory interference) or on the background of multi-talker babble (sensory/cognitive interference). The presence of wideband noise eliminated age-related differences in semantics but not in prosody when processing emotional speech. Conversely, the presence of babble resulted in the elimination of age-related differences across all measures. The results suggest that both sensory and cognitive-linguistic factors contribute to age-related changes in emotional speech processing. Because real world conditions typically involve noisy background, our results highlight the importance of testing under such conditions. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. Impaired speech input and output processing abilities in children with cleft palate speech disorder.

Author: Yang, Linrui, Mu, Yue, Zhai, Yuxiang, and Chen, Renji
Abstract: Background Aims Outcomes & Results Conclusions & Implications WHAT THIS PAPER ADDS What is already known on the subject What this paper adds to existing knowledge What are the potential or actual clinical implications of this work? Cleft lip and palate is one of the most common oral and maxillofacial deformities associated with a variety of functional disorders. Cleft palate speech disorder (CPSD) occurs the most frequently and manifests a series of characteristic speech features, which are called cleft speech characteristics. Some scholars believe that children with CPSD and poor speech outcomes may also have weaknesses in speech input processing ability, but evidence is still lacking so far.(1) To explore whether children with CPSD and speech output disorders also have defects in speech input processing abilities; (2) to explore the correlation between speech input and output processing abilities.Methods & Procedures: Children in the experimental group were enrolled from Beijing Stomatological Hospital, Capital Medical University, and healthy volunteers were recruited as controls. Then three tasks containing real and pseudo words were performed sequentially. Reaction time, accuracy and other indicators in three tasks were collected and then analysed.The indicators in the experimental group were significantly lower than those in the control group. There was a strong correlation between speech input and output processing tasks. The performance of both groups when processing pseudo words in the three tasks was worse than that when dealing with real words.Compared with normal controls, children with CPSD have deficits in both speech input and output processing, and there is a strong correlation between speech input and output speech processing abilities. In addition, the pseudo words task was more challenging than the real word task for both groups. Children with cleft lip and palate often have speech sound disorders known as cleft palate speech disorder (CPSD). CPSD is characterised by consonant errors called cleft speech characteristics, which can persist even after surgery. Some studies suggest that poor speech outcomes in children with CPSD may be associated with deficits in processing speech input. However, this has not been validated in mainland China. The results of our study indicate that children with CPSD exhibit poorer performance in three tasks assessing speech input and output abilities compared to healthy controls, suggesting their deficits in both speech input and output processing. Furthermore, a significant correlation was observed between speech input and output processing abilities. Additionally, both groups demonstrated greater difficulty in processing pseudo words compared to real words, as evidenced by their worse performance in dealing with pseudo words. The pseudo word tasks designed and implemented in our study can be employed in future research and assessment of speech input and output abilities in Chinese Mandarin children with CPSD. Additionally, our findings revealed the significance of considering both speech output processing abilities and potential existence of speech input processing ability for speech and language therapists when evaluating and developing treatment options for children with CPSD as these abilities are also important for the development of literacy development. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

23. Continuous lipreading based on acoustic temporal alignments.

Author: Gimeno-Gómez, David and Martínez-Hinarejos, Carlos-D.
Subjects: LIPREADING, SPEECH perception, HIDDEN Markov models, SPEECH, DEEP learning
Abstract: Visual speech recognition (VSR) is a challenging task that has received increasing interest during the last few decades. Current state of the art employs powerful end-to-end architectures based on deep learning which depend on large amounts of data and high computational resources for their estimation. We address the task of VSR for data scarcity scenarios with limited computational resources by using traditional approaches based on hidden Markov models. We present a novel learning strategy that employs information obtained from previous acoustic temporal alignments to improve the visual system performance. Furthermore, we studied multiple visual speech representations and how image resolution or frame rate affect its performance. All these experiments were conducted on the limited data VLRF corpus, a database which offers an audio-visual support to address continuous speech recognition in Spanish. The results show that our approach significantly outperforms the best results achieved on the task to date. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

24. Early or synchronized gestures facilitate speech recall—a study based on motion capture data.

Author: Nirme, Jens, Gulz, Agneta, Haake, Magnus, and Gullberg, Marianne
Subjects: MOTION capture (Human mechanics), RECOLLECTION (Psychology), SPEECH, TEMPORAL databases, GESTURE, PROSPECTIVE memory, SUPINE position
Abstract: Introduction: Temporal co-ordination between speech and gestures has been thoroughly studied in natural production. In most cases gesture strokes precede or coincide with the stressed syllable in words that they are semantically associated with. Methods: To understand whether processing of speech and gestures is attuned to such temporal coordination, we investigated the effect of delaying, preposing or eliminating individual gestures on the memory for words in an experimental study in which 83 participants watched video sequences of naturalistic 3D-animated speakers generated based on motion capture data. A target word in the sequence appeared (a) with a gesture presented in its original position synchronized with speech, (b) temporally shifted 500â€‰ ms before or (c) after the original position, or (d) with the gesture eliminated. Participants were asked to retell the videos in a free recall task. The strength of recall was operationalized as the inclusion of the target word in the free recall. Results: Both eliminated and delayed gesture strokes resulted in reduced recall rates compared to synchronized strokes, whereas there was no difference between advanced (preposed) and synchronized strokes. An item-level analysis also showed that the greater the interval between the onsets of delayed strokes and stressed syllables in target words, the greater the negative effect was on recall. Discussion: These results indicate that speech-gesture synchrony affects memory for speech, and that temporal patterns that are common in production lead to the best recall. Importantly, the study also showcases a procedure for using motion capture-based 3D-animated speakers to create an experimental paradigm for the study of speech-gesture comprehension. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network.

Author: Bhangale, Kishor and Kothandaraman, Mohanaprasad
Subjects: *CONVOLUTIONAL neural networks, *GENERATIVE adversarial networks, *EMOTION recognition, *AFFECTIVE computing, *DEEP learning, *DATA augmentation
Abstract: Speech emotion recognition (SER) has recently increased because of vast innovations in human–computer interaction and affective computing. In recent years, numerous deep learning-based schemes presented for SER have shown significant improvement over the traditional machine learning approaches. Most deep learning-based faced SER systems face challenges due to data imbalance problem that occurs due to unequal samples in the database. The input to two-dimensional CNN uses traditional MFCC for SER. It degrades the quality of deep attributes because of the higher variance, frequency resolution problem and spectral leakage problem of traditional MFCC. This paper proposed the novel Multi-taper Mel Frequency Logarithmic Spectrogram to enrich the Deep Convolutional Neural Network effectiveness for SER. Further, Generative Adversarial Network is used for speech emotion data augmentation during training to deal with data scarcity problems in SER. The performance of the proposed SER scheme is validated using the Berlin EmoDB and RAVDESS datasets. The proposed method provides SER accuracy of 96.65% and 97.12% for the EmoDB and RAVDESS dataset, respectively, and significantly improves over the recent techniques. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. Voices Unheard: Exploring New Data Sources in Nursing Through Speech Processing.

Author: TOPAZ, Maxim, ZOLNOORI, Maryam, Zidu XU, and Jiyoun SONG
Abstract: The complex nature of verbal patient-nurse communication holds valuable insights for nursing research, but traditional documentation methods often miss these crucial details. This study explores the emerging role of speech processing technology in nursing research, emphasizing patient-nurse verbal communication. We conducted case studies across various healthcare settings, revealing a substantial gap in electronic health records for capturing vital patient-nurse encounters. Our research demonstrates that speech processing technology can effectively bridge this gap, enhancing documentation accuracy and enriching data for quality care assessment and risk prediction. The technology's application in home healthcare, outpatient settings, and specialized areas like dementia care illustrates its versatility. It offers the potential for real-time decision support, improved communication training, and enhanced telehealth practices. This paper provides insights into the promises and challenges of integrating speech processing into nursing practice, paving the way for future patient care and healthcare data management advancements. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. Novel SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech

Author: Ritesh, Aron, Sigicharla, Indra Kiran, Periwal, Chirag, Kothandaraman, Mohanaprasad, Darisini, P. S. Nithya, Tiwari, Sourabh, and Arora, Shivani
Published: 2024
Full Text: View/download PDF

28. A Quest for Formant-Based Compact Nonuniform Trapezoidal Filter Banks for Speech Processing with VGG16

Author: Parlak, Cevahir and Altun, Yusuf
Published: 2024
Full Text: View/download PDF

29. Comparing Audiological Outcomes of Conventional and AI-Upgraded Cochlear Implant Speech Processors

Author: Sahoo, Lokanath, Patnaik, Uma, Singh, Nitu, Dwivedi, Gunjan, Nagre, Gauri D., and Sahoo, Krushnendu Sundar
Published: 2024
Full Text: View/download PDF

30. Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Author: Yidi Li, Jiale Ren, Yawei Wang, Guoquan Wang, Xia Li, and Hong Liu
Subjects: artificial intelligence, multimodal approaches, natural language processing, neural network, speech processing, Computational linguistics. Natural language processing, P98-98.5, Computer software, QA76.75-76.765
Abstract: Abstract As one of the most effective methods to improve the accuracy and robustness of speech tasks, the audio–visual fusion approach has recently been introduced into the field of Keyword Spotting (KWS). However, existing audio–visual keyword spotting models are limited to detecting isolated words, while keyword spotting for unconstrained speech is still a challenging problem. To this end, an Audio–Visual Keyword Transformer (AVKT) network is proposed to spot keywords in unconstrained video clips. The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable‐length audio and visual inputs. The outputs of audio and visual branches are combined in a decision fusion module. As humans can easily notice whether a keyword appears in a sentence or not, our AVKT network can detect whether a video clip with a spoken sentence contains a pre‐specified keyword. Moreover, the position of the keyword is localised in the attention map without additional position labels. Experimental results on the LRS2‐KWS dataset and our newly collected PKU‐KWS dataset show that the accuracy of AVKT exceeded 99% in clean scenes and 85% in extremely noisy conditions. The code is available at https://github.com/jialeren/AVKT.
Published: 2024
Full Text: View/download PDF

31. Race-relevant cues influence the processing of linguistic variation: Evidence from African American English and Mainstream American English

Author: Beyer Tim, Renirie Tess, and Andresen David
Subjects: dual-route approach to speech perception, socially meaningful linguistic variation, race-relevant cues, speech processing, african american english, Oral communication. Speech, P95-95.6, Psychology, BF1-990
Abstract: Race-relevant cues, whether vocal or visual, shape how listeners process the incoming speech signal. In order to better understand how these cues inform sentence-level processing, we asked listeners to rate the plausibility of three different sentence types: (a) plausible in both Mainstream American English (MAE) and African American English (AAE), (b) implausible in both, or (c) plausible in AAE, but not MAE. Across three experiments, we manipulated the type of race-relevant cues provided to listeners, who all identified as MAE-speakers. Experiment 1 (n = 72) used written sentences and therefore did not provide vocal or visual cues, Experiment 2 (n = 72) provided vocal cues to speaker background, and Experiment 3 (n = 72) provided vocal and visual cues to speaker background. Results show that MAE-speaking listeners readily incorporated race-relevant cues when processing these sentences. In particular, findings indicate that expectations associating African Americans with utterances implausible from an MAE-perspective inform sentence-level processing.
Published: 2024
Full Text: View/download PDF

32. Enhancing Automatic Speech Recognition With Personalized Models: Improving Accuracy Through Individualized Fine-Tuning

Author: Vitalii Brydinskyi, Dmytro Sabodashko, Yuriy Khoma, Michal Podpora, Alexander Konovalov, and Volodymyr Khoma
Subjects: Automatic speech recognition, speech processing, natural language processing, sound recognition, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Automatic speech recognition (ASR) systems have become increasingly popular in recent years due to their ability to convert spoken language into text. Nonetheless, despite their widespread use, existing speaker-independent ASR systems frequently encounter challenges related to variations in speaking styles, accents, and vocal characteristics, leading to potential recognition inaccuracies. This study delves into the feasibility of personalized ASR systems that adapt to the unique voice attributes of individual speakers, thereby enhancing recognition accuracy. It provides an overview of our methodology, focusing on the design, development, and evaluation of both speaker-independent and personalized ASR systems. The dataset used included diverse speakers selected from three extensive datasets: TedLIUM-3, CommonVoice, and GoogleVoice, demonstrating the capability of our methodology to accommodate various accents and challenges of both natural and synthetic voices. In terms of signal classification and interpretation, the personalized model eclipsed the speaker-independent variant, registering an enhancement of up to ~3% for natural voices and ~10% for synthetic voices in recognition accuracy for individual speakers. Our findings demonstrate that personalized ASR systems can significantly improve the accuracy of speech recognition for individual speakers and highlight the importance of adapting ASR models to individual voices.
Published: 2024
Full Text: View/download PDF

33. An Overview of the ADReSS-M Signal Processing Grand Challenge on Multilingual Alzheimer's Dementia Recognition Through Spontaneous Speech

Author: Saturnino Luz, Fasih Haider, Davida Fromm, Ioulietta Lazarou, Ioannis Kompatsiaris, and Brian MacWhinney
Subjects: Biomedical signal processing, medical conditions, Alzheimer's disease, human disease biomarkers, speech processing, natural language processing, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: The ADReSS-M Signal Processing Grand Challenge was held at the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023. The challenge targeted difficult automatic prediction problems of great societal and medical relevance, namely, the detection of Alzheimer's Dementia (AD) and the estimation of cognitive test scoress. Participants were invited to create models for the assessment of cognitive function based on spontaneous speech data. Most of these models employed signal processing and machine learning methods. The ADReSS-M challenge was designed to assess the extent to which predictive models built based on speech in one language generalise to another language. The language data compiled and made available for ADReSS-M comprised English, for model training, and Greek, for model testing and validation. To the best of our knowledge no previous shared research task investigated acoustic features of the speech signal or linguistic characteristics in the context of multilingual AD detection. This paper describes the context of the ADReSS-M challenge, its data sets, its predictive tasks, the evaluation methodology we employed, our baseline models and results, and the top five submissions. The paper concludes with a summary discussion of the ADReSS-M results, and our critical assessment of the future outlook in this field.
Published: 2024
Full Text: View/download PDF

34. Interactive Design With Gesture and Voice Recognition in Virtual Teaching Environments

Author: Ke Fang and Jing Wang
Subjects: Game engines, hand tracking, human–computer interaction, recurrent neural networks, speech processing, virtual environments, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: In virtual teaching scenarios, head-mounted display (HMD) interactions often employ traditional controller and UI interactions, which are not very conducive to teaching scenarios that require hand training. Existing improvements in this area have primarily focused on replacing controllers with gesture recognition. However, the exclusive use of gesture recognition may have limitations in certain scenarios, such as complex operations or multitasking environments. This study designed and tested an interaction method that combines simple gestures with voice assistance, aiming to offer a more intuitive user experience and enrich related research. A speech classification model was developed that can be activated via a fist-clenching gesture and is capable of recognising specific Chinese voice commands to initiate various UI interfaces, further controlled by pointing gestures. Virtual scenarios were constructed using Unity, with hand tracking achieved through the HTC OpenXR SDK. Within Unity, hand rendering and gesture recognition were facilitated, and interaction with the UI was made possible using the Unity XR Interaction Toolkit. The interaction method was detailed and exemplified using a teacher training simulation system, including sample code provision. Following this, an empirical test involving 20 participants was conducted, comparing the gesture-plus-voice operation to the traditional controller operation, both quantitatively and qualitatively. The data suggests that while there is no significant difference in task completion time between the two methods, the combined gesture and voice method received positive feedback in terms of user experience, indicating a promising direction for such interactive methods. Future work could involve adding more gestures and expanding the model training dataset to realize additional interactive functions, meeting diverse virtual teaching needs.
Published: 2024
Full Text: View/download PDF

35. Some Properties of Zipf's Law and Applications.

Author: Bolea, Speranta Cecilia, Pirnau, Mironela, Bejinariu, Silviu-Ioan, Apopei, Vasile, Gifu, Daniela, and Teodorescu, Horia-Nicolai
Subjects: *ZIPF'S law, *LOGNORMAL distribution, *NATURAL language processing
Abstract: The article extends the theoretical and applicative analysis of Zipf's law. We are concerned with a set of properties of Zipf's law that derive directly from the power law expression and from the discrete nature of the objects to which the law is applied, when the objects are words, lemmas, and the like. We also search for variations of Zipf's law that can help explain the noisy results empirically reported in the literature and the departures of the empirically obtained nonlinear graph from the theoretical linear one, with the variants analyzed differing from Mandelbrot and lognormal distributions. A problem of interest that we deal with is that of mixtures of populations obeying Zipf's law. The last problem has relevance in the analysis of texts with words with various etymologies. Computational aspects are also addressed. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

36. Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting.

Author: Li, Yidi, Ren, Jiale, Wang, Yawei, Wang, Guoquan, Li, Xia, and Liu, Hong
Subjects: NATURAL language processing, VIDEO excerpts, SPEECH, PREDICATE calculus
Abstract: As one of the most effective methods to improve the accuracy and robustness of speech tasks, the audio–visual fusion approach has recently been introduced into the field of Keyword Spotting (KWS). However, existing audio–visual keyword spotting models are limited to detecting isolated words, while keyword spotting for unconstrained speech is still a challenging problem. To this end, an Audio–Visual Keyword Transformer (AVKT) network is proposed to spot keywords in unconstrained video clips. The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable‐length audio and visual inputs. The outputs of audio and visual branches are combined in a decision fusion module. As humans can easily notice whether a keyword appears in a sentence or not, our AVKT network can detect whether a video clip with a spoken sentence contains a pre‐specified keyword. Moreover, the position of the keyword is localised in the attention map without additional position labels. Experimental results on the LRS2‐KWS dataset and our newly collected PKU‐KWS dataset show that the accuracy of AVKT exceeded 99% in clean scenes and 85% in extremely noisy conditions. The code is available at https://github.com/jialeren/AVKT. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

37. PARKINSON'S DISEASE DIAGNOSTICS USING AI AND NATURAL LANGUAGE KNOWLEDGE TRANSFER.

Author: CHRONOWSKI, Maurycy, KŁACZYŃSKI, Maciej, DEC-ĆWIEK, Małgorzata, and PORĘBSKA, Karolina
Subjects: *KNOWLEDGE transfer, *AUTOMATIC speech recognition, *LANGUAGE models, *NATURAL languages, *DEEP learning, *PARKINSON'S disease, *ARTIFICIAL intelligence
Abstract: With global life expectancy rising every year, ageing-associated diseases are becoming an increasingly important problem. Very often, successful treatment relies on early diagnosis. In this work, the issue of Parkinson's disease (PD) diagnostics is tackled. It is particularly important, as there are no certain antemortem methods of diagnosing PD - meaning that the presence of the disease can only be confirmed after the patient's death. In our work, we propose a non-invasive approach for classification of raw speech recordings for PD recognition using deep learning models. The core of the method is an audio classifier using knowledge transfer from a pretrained natural language model, namely wav2vec 2.0. The model was tested on a group of 38 PD patients and 10 healthy persons above the age of 50. A dataset of speech recordings acquired using a smartphone recorder was constructed and the recordings were labelled as PD/non-PD with the severity of the disease additionally rated using Hoehn-Yahr scale. We then benchmarked the classification performance against baseline methods. Additionally, we show an assessment of human-level performance with neurology professionals. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

38. Development, Validity, and Reliability of the Auditory and Speech Performance Test for Children.

Author: Hancer, Hale, Kiziltan, Erhan, Tan, Pinar Civak, Gokmen, Derya, Hayme, Serhat, and Yilmaz, Suna Tokgoz
Abstract: Copyright of Canadian Journal of Speech-Language Pathology & Audiology is the property of Speech-Language & Audiology Canada and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2024

39. WavDepressionNet: Automatic Depression Level Prediction via Raw Speech Signals.

Author: Niu, Mingyue, Tao, Jianhua, Li, Yongwei, Qin, Yong, and Li, Ya
Abstract: Physiological reports have confirmed that there are differences in speech signals between depressed and healthy individuals. Therefore, as an application in the field of affective computing, automatic depression level prediction through speech signals has received the attention of researchers, which often estimate the depression severity of individuals by the Fourier or Mel spectrograms of speech signals. However, some studies on speech emotion recognition suggest that directly modeling the raw speech signal is more helpful for extracting emotion-related information. Inspired by this fact, we develop a WavDepressionNet to model raw speech signals for the improvement of prediction accuracy. In our method, a representation block is proposed to find a set of basis vectors to construct the optimal transformation space and generate the transformation result (named Depression Feature Map, DFM) of speech signal for facilitating the perception of depression cues. We further propose an assessment block, which cannot only use the designed spatiotemporal self-calibration mechanism to calibrate the DFM and highlight the useful elements, but also aggregates the calibrated DFM across various temporal ranges with the dilated convolution. Experimental results on the AVEC 2013 and AVEC 2014 depression databases demonstrate the effectiveness of our approach over previous works. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

40. Evaluating the accuracy of automated processing of child and adult language production in preschool classrooms

Author: G. Logan Pelfrey, Laura M. Justice, Hugo Gonzalez Villasanti, and Tiffany J. Foster
Subjects: speech processing, early childhood language environments, preschool, automated sensing, objective measurement, Psychology, BF1-990
Abstract: Young children's language and social development is influenced by the linguistic environment of their classrooms, including their interactions with teachers and peers. Measurement of the classroom linguistic environment typically relies on observational methods, often providing limited 'snapshots' of children's interactions, from which broad generalizations are made. Recent technological advances, including artificial intelligence, provide opportunities to capture children's interactions using continuous recordings representing much longer durations of time. The goal of the present study was to evaluate the accuracy of the Interaction Detection in Early Childhood Settings (IDEAS) system on 13 automated indices of language output using recordings collected from 19 children and three teachers over two weeks in an urban preschool classroom. The accuracy of language outputs processed via IDEAS were compared to ground truth via linear correlations and median absolute relative error. Findings indicate high correlations between IDEAS and ground truth data on measures of teacher and child speech, and relatively low error rates on the majority of IDEAS language output measures. Study findings indicate that IDEAS may provide a useful measurement tool for advancing knowledge about children's classroom experiences and their role in shaping development.
Published: 2024
Full Text: View/download PDF

41. Dataset of directional room impulse responses for realistic speech data

Author: Stefan Fragner, Lukas Pfeifenberger, Martin Hagmüller, and Franz Pernkopf
Subjects: Reverberant speech data, Speech processing, Room impulse response, Deep learning, Artificial intelligence, Computer applications to medicine. Medical informatics, R858-859.7, Science (General), Q1-390
Abstract: Obtaining real-world multi-channel speech recordings is expensive and time-consuming. Therefore, multi-channel recordings are often artificially generated by convolving existing monaural speech recordings with simulated Room Impulse Responses (RIRs) from a so-called shoebox room [1] for stationary (not moving) speakers. Far-field speech processing for home automation or smart assistants have to cope with moving speakers in reverberant environments. With this dataset, we aim to support the generation of realistic speech data by providing multiple directional RIRs along a fine grid of locations in a real room. We provide directional RIR recordings for a classroom and a large corridor. These RIRs can be used to simulate moving speakers by generating random trajectories on that grid, and quantize the trajectories along the grid points. For each matching grid point, the monaural speech recording can be convolved with the RIR at this grid point. Then, the spatialized recording can be compiled using the overlap-add method for each grid point [2]. An example is provided with the data.
Published: 2024
Full Text: View/download PDF

42. Theta and alpha oscillatory signatures of auditory sensory and cognitive loads during complex listening

Author: Brilliant, Y. Yaar-Soffer, C.S. Herrmann, Y. Henkin, and A. Kral
Subjects: Time-frequency analysis, EEG, Stroop task, Speech processing, Neural oscillations, Neurosciences. Biological psychiatry. Neuropsychiatry, RC321-571
Abstract: The neuronal signatures of sensory and cognitive load provide access to brain activities related to complex listening situations. Sensory and cognitive loads are typically reflected in measures like response time (RT) and event-related potentials (ERPs) components. It's, however, strenuous to distinguish the underlying brain processes solely from these measures. In this study, along with RT- and ERP-analysis, we performed time-frequency analysis and source localization of oscillatory activity in participants performing two different auditory tasks with varying degrees of complexity and related them to sensory and cognitive load. We studied neuronal oscillatory activity in both periods before the behavioral response (pre-response) and after it (post-response). Robust oscillatory activities were found in both periods and were differentially affected by sensory and cognitive load. Oscillatory activity under sensory load was characterized by decrease in pre-response (early) theta activity and increased alpha activity. Oscillatory activity under cognitive load was characterized by increased theta activity, mainly in post-response (late) time. Furthermore, source localization revealed specific brain regions responsible for processing these loads, such as temporal and frontal lobe, cingulate cortex and precuneus. The results provide evidence that in complex listening situations, the brain processes sensory and cognitive loads differently. These neural processes have specific oscillatory signatures and are long lasting, extending beyond the behavioral response.
Published: 2024
Full Text: View/download PDF

43. On‐device audio‐visual multi‐person wake word spotting

Author: Yidi Li, Guoquan Wang, Zhan Chen, Hao Tang, and Hong Liu
Subjects: audio‐visual fusion, human‐computer interfacing, speech processing, Computational linguistics. Natural language processing, P98-98.5, Computer software, QA76.75-76.765
Abstract: Abstract Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance. However, most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity. Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments. In this paper, a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting. Firstly, an attention‐based audio‐visual voice activity detection module is presented, which generates an attention score matrix of audio and visual representations to derive active speaker representation. Secondly, the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model. Moreover, a new audio‐visual dataset, PKU‐KWS, is collected for sentence‐level multi‐person wake word spotting. Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.
Published: 2023
Full Text: View/download PDF

44. Optimized intelligent speech signal verification system for identifying authorized users

Author: Pravin Marotrao Ghate, Bhagvat D Jadhav, Prabhakar N Kota, Shankar Dattatray Chavan, and Pravin Balaso Chopade
Subjects: speech processing, authorized users, speech verification, audio signal, noise filtering, Technology, Technology (General), T1-995
Abstract: Speech processing is today's trending topic in the digital industry for making authentication to keep aware of unauthorized ones. However, analyzing the signal feature using conventional filtering or the neural models is insufficient due to the present signal's several noisy features. Hence, incorporating the different noise elimination filters has maximized the algorithm complexity in verifying the user speech signal. So, the present study built a novel Chimp-based recursive Speech Identification (CbRSI) system for the speech processing domain to verify the authenticated users through the speech signal data. To make signal processing the most straightforward task was activated the filtering function to recognize and neglect the noisy features. Consequently, the filtered audio data is imported as the classification phase input then the features-selecting process is performed. Finally, authenticated users were found and validated the performance by matching the analyzed signal features with the saved audio features. Hence, a novel CbRSI earned the finest user verification exactness score of 98.2%, which is the most satisfactory outcome compared to past studies. Therefore, the implemented solution is the most required framework for verifying authenticated users.
Published: 2023

45. An enhanced method for dialect transcription via error‐correcting thesaurus

Author: Xiaoliang Ma, Congjian Deng, Dequan Du, and Qingqi Pei
Subjects: speech processing, telecommunication services, Telecommunication, TK5101-6720
Abstract: Abstract Automatic speech recognition (ASR) has been widely used in the field of customer service, but the performance of general ASR in dialect transcription is not satisfactory, especially in Guangdong Province. Targeted training of ASR transcription engine will produce effect, but the training cost is high, and it is not suitable for small‐scale training with multiple dialects and frequencies. The complaint problems in the customer service field have obvious clustering and are suitable for few‐shot and multi‐frequency training. In view of this, in the actual engineering application, the method of ASR transcribed into the dialect error correction thesaurus is tried to be used to replace the wrong words, and have achieved good results. The optimization technology after automatic speech transcription proposed in this study can improve the recognition accuracy of general ASR by 13.75% for dialect words.
Published: 2023
Full Text: View/download PDF

46. Attention-based encoder-decoder models for speech processing

Author: Li, Qiujia and Woodland, Phil
Subjects: Speech Processing, Machine Learning, Deep Neural Network, Attention-Based Encoder-Decoder Model, Speech Recognition, Confidence Estimation, Speaker Diarisation
Abstract: Speech processing is one of the key components of machine perception. It covers a wide range of topics and plays an important role in many real-world applications. Many speech processing problems are modelled using sequence-to-sequence models. More recently, the Attention-Based Encoder-Decoder (AED) model has become a general and effective neural network that transforms a source sequence into a target sequence. These two sequences may have different lengths and belong to different modalities. AED models offer a new perspective for various speech processing tasks. In this thesis, the fundamentals of AED models and Automatic Speech Recognition (ASR) are first covered. The rest of the thesis focuses on the application of AED models for three major speech processing tasks - speech recognition, confidence estimation and speaker diarisation. Speech recognition technology is widely used in voice assistants and dictation systems. It converts speech signals into text. Traditionally, Hidden Markov Models (HMMs), as a generative sequence-to-sequence model, are widely used as the backbone of an acoustic model. Under the Source-Channel Model (SCM) framework, the ASR system finds the most likely text sequence that produces the corresponding acoustic sequence together with a language model and a lexicon. Alternatively, the speech recognition task can be addressed discriminatively using a single AED model. There are distinct characteristics associated with each modelling approach. As the first contribution of the thesis, the Integrated Source-Channel and Attention (ISCA) framework is proposed to leverage the advantages of both approaches with two passes. The first pass uses the traditional SCM-based ASR system to generate diverse hypotheses, either in the form of N-best lists or lattices. The second pass obtains the AED model score for each hypothesis. Experiments on Augmented Multi-Party Interaction (AMI) dataset showed that ISCA using two-pass decoding reduced the Word Error Rate (WER) by 13% relative for a joint SCM and AED system using one-pass decoding. Further experiments on both the AMI dataset and the larger Switchboard (SWB) dataset showed that, if the SCM and AED systems were trained separately to be more complementary, the combined system using ISCA outperformed the individual system by around 30%. Also, the refined lattice rescoring algorithm is significantly better than N-best rescoring as lattice is a more compact representation of hypothesis space, especially for longer utterances. With various advancements in neural network training, AED models can reach similar or better performance than traditional systems for many ASR tasks. Compared to a conventional ASR system, one important but perhaps missing attribute of an AED-based system is good confidence scores which indicate the reliability of automatic transcriptions. Confidence scores are very helpful for various downstream tasks, including semi-supervised training, keyword spotting and dialogue systems. As the second contribution of this thesis, effective confidence estimators for AED-based ASR systems are proposed. The Confidence Estimation Module (CEM) is a lightweight simple add-on neural network that takes various features from the encoder, attention mechanism and decoder to estimate a confidence score for each output unit (token). Experiments on the LibriSpeech dataset showed that compared to using Softmax probabilities as confidence scores, the CEM improved token-level confidence estimation performance substantially and largely addressed the over-confidence issue. For various downstream tasks such as data selection, utterance-level confidence scores are more desirable. The Residual Energy-Based Model (R-EBM), an utterance-level confidence estimator, was demonstrated to outperform both Softmax probabilities and the CEM. The R-EBM directly operates at the utterance level and takes deletion errors into account implicitly. The R-EBM also provides a global normalisation term for the locally normalised auto-regressive AED models. On the LibriSpeech dataset, the R-EBM reduced the WER of an AED model by up to 8% relative. One potential issue for model-based confidence estimators such as the CEM and R-EBM is their performance on Out-of-Domain (OOD) data. To ensure that confidence estimators generalise well for OOD input, two simple approaches are suggested that can effectively inject OOD information during the training of the CEM and R-EBM. Speaker diarisation, a task of identifying "who spoke when", is a crucial step for information extraction and retrieval. The speaker diarisation pipeline often consists of multiple stages. The last stage is to perform clustering over segment-level or window-level speaker representations. Although clustering is normally an unsupervised task, this thesis proposes the use of AED models for supervised clustering. With specific data augmentation techniques, the proposed approach, Discriminative Neural Clustering (DNC), has shown to be an effective alternative to unsupervised clustering algorithms. Experiments on the very challenging AMI dataset showed that DNC improved the Speaker Error Rate (SpkER) by around 30% relative compared to a strong spectral clustering baseline. Furthermore, DNC opens more interesting research directions, e.g. speaker diarisation with multi-channel or multi-modality information and end-to-end neural network-based speaker diarisation.
Published: 2022
Full Text: View/download PDF

47. Phonological neighbors cooperate during spoken-sentence processing: Evidence from a nonword detection task

Author: Dufour, Sophie, Fournet, Colas, Mirault, Jonathan, and Grainger, Jonathan
Published: 2024
Full Text: View/download PDF

48. Comprehensive overview of Alzheimer's disease utilizing Machine Learning approaches

Author: Kumar, Rahul and Azad, Chandrashekhar
Published: 2024
Full Text: View/download PDF

49. Shennong: A Python toolbox for audio speech features extraction.

Author: Bernard, Mathieu, Poli, Maxime, Karadayi, Julien, and Dupoux, Emmanuel
Subjects: *PYTHON programming language, *FEATURE extraction, *MACHINE learning, *SPEECH, *SOFTWARE architecture
Abstract: We introduce Shennong, a Python toolbox and command-line utility for audio speech features extraction. It implements a wide range of well-established state-of-the-art algorithms: spectro-temporal filters such as Mel-Frequency Cepstral Filterbank or Predictive Linear Filters, pre-trained neural networks, pitch estimators, speaker normalization methods, and post-processing algorithms. Shennong is an open source, reliable and extensible framework built on top of the popular Kaldi speech processing library. The Python implementation makes it easy to use by non-technical users and integrates with third-party speech modeling and machine learning tools from the Python ecosystem. This paper describes the Shennong software architecture, its core components, and implemented algorithms. Then, three applications illustrate its use. We first present a benchmark of speech features extraction algorithms available in Shennong on a phone discrimination task. We then analyze the performances of a speaker normalization model as a function of the speech duration used for training. We finally compare pitch estimation algorithms on speech under various noise conditions. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

50. Confounding Factor Analysis for Vocal Fold Oscillations.

Author: Gençağa, Deniz
Subjects: *VOCAL cords, *FACTOR analysis, *AUTOREGRESSIVE models, *OSCILLATIONS, *FEATURE selection, *MULTIVARIATE analysis, *CONFOUNDING variables
Abstract: This paper provides a methodology to better understand the relationships between different aspects of vocal fold motion, which are used as features in machine learning-based approaches for detecting respiratory infections from voice recordings. The relationships are derived through a joint multivariate analysis of the vocal fold oscillations of speakers. Specifically, the multivariate setting explores the displacements and velocities of the left and right vocal folds derived from recordings of five extended vowel sounds for each speaker (/aa/, /iy/, /ey/, /uw/, and /ow/). In this multivariate setting, the differences between the bivariate and conditional interactions are analyzed by information-theoretic quantities based on transfer entropy. Incorporation of the conditional quantities reveals information regarding the confounding factors that can influence the statistical interactions among other pairs of variables. This is demonstrated on a vector autoregressive process where the analytical derivations can be carried out. As a proof of concept, the methodology is applied on a clinically curated dataset of COVID-19. The findings suggest that the interaction between the vocal fold oscillations can change according to individuals and presence of any respiratory infection, such as COVID-19. The results are important in the sense that the proposed approach can be utilized to determine the selection of appropriate features as a supplementary or early detection tool in voice-based diagnostics in future studies. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

24,420 results on '"Speech processing"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources