Author: "S. R. Mahadeva Prasanna" / Topic: speech recognition - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"S. R. Mahadeva Prasanna"' showing total 139 results

Start Over Author "S. R. Mahadeva Prasanna" Topic speech recognition

139 results on '"S. R. Mahadeva Prasanna"'

1. Overlapped speech detection using phase features

Author: Prithwijit Guha, Shikha Baghel, and S. R. Mahadeva Prasanna
Subjects: Male, Memory, Long-Term, Voice activity detection, Acoustics and Ultrasonics, Computer science, Speech recognition, Context (language use), Convolutional neural network, Instantaneous phase, Speaker diarisation, Arts and Humanities (miscellaneous), Cepstrum, Feature (machine learning), Humans, Speech, Female, Neural Networks, Computer, Mel-frequency cepstrum
Abstract: Simultaneous speech of multiple speakers is known as overlapped speech, which causes problems for speech recognition and speaker diarization systems. The present work uses previously less utilized signal phase information in the task of overlapped speech detection. In this context, Instantaneous Frequency Cosine Coefficient (IFCC) and Modified Group Delay Cepstral Coefficient (MGDCC) features are explored. IFCC captures the time-varying phase characteristics, while MGDCC represents the frequency-varying information of the phase spectrum. A Convolutional Neural Network and Long Short-Term Memory (CNN-LSTM)-based classifier is used for the classification. The present work uses synthetically generated overlapped speech from the GRID corpus. The proposed method is benchmarked against three baseline approaches that use magnitude spectrum features. It is observed that the combination of IFCC and MGDCC features with CNN-LSTM classifier provides better performance than the baselines. The combination of phase features with magnitude-based MFCC feature provides the best performance, indicating the importance of complementary information. The present study also investigates the effect of segment duration, genders, and number of simultaneous speakers on the overlapped speech detection system. Finally, the proposed method is also evaluated on real overlapped data from the AMI corpus.
Published: 2021

2. Event-Based Transformation of Misarticulated Stops in Cleft Lip and Palate Speech

Author: Protima Nomo Sudro, C. M. Vikram, and S. R. Mahadeva Prasanna
Subjects: 0209 industrial biotechnology, Speech production, Computer science, Applied Mathematics, Speech recognition, Event based, 02 engineering and technology, Transformation (music), language.human_language, Kannada, Improved performance, 020901 industrial engineering & automation, Duration (music), Vowel, Signal Processing, language, Velopharyngeal dysfunction
Abstract: The cleft of the lip and palate (CLP) is a congenital disability affecting the craniofacial region and it impacts the speech production system. The current work focuses on the modification of misarticulations produced for unvoiced stop consonants in CLP speech. Three types of misarticulations are studied: glottal, palatal, and velar stop substitutions. The stop consonants are misarticulated due to inadequate buildup of intra-oral pressure caused by velopharyngeal dysfunction and oro-nasal fistula. The misarticulated stops affect the speech intelligibility and quality, and this further affects the use of speech-based applications. The misarticulated stops are analyzed and modified using the speech data collected from 60 Kannada speaking children (normal and CLP). An event-based modification approach is used to correct the misarticulated stops. At first, automatic detection of burst onset and vowel onset events is carried out. Then, the region from vowel onset to 20 ms duration of the vowel is extracted. Further, the region from burst onset point to 20 ms duration of the vowel is defined as the region for modification. It is transformed using the nonnegative matrix factorization (NMF) method. The objective and subjective evaluation results show that the proposed event-based transformation approach provides a relative improvement compared to the entire-word modification (signal processed without using the knowledge of burst and vowel onset events). The event-based transformed misarticulated stops showed close similarity with the normal stops in perceptual quality. The improved performance accuracy of modified stops suggests that the speech distortion is minimized.
Published: 2021

3. Sinusoidal model-based hypernasality detection in cleft palate speech using CVCV sequence

Author: Akhilesh Kumar Dubey, S. R. Mahadeva Prasanna, and Samarendra Dandapat
Subjects: Normalization (statistics), Linguistics and Language, Communication, Speech recognition, Magnitude (mathematics), 020206 networking & telecommunications, Sinusoidal model, 02 engineering and technology, Hypernasal speech, medicine.disease, 01 natural sciences, Language and Linguistics, Computer Science Applications, Formant, Velopharyngeal insufficiency, Feature (computer vision), Modeling and Simulation, Harmonics, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, medicine, Computer Vision and Pattern Recognition, 010301 acoustics, Software, Mathematics
Abstract: Hypernasality in the speech of children with cleft palate is a consequence of velopharyngeal insufficiency. The spectral analysis of hypernasal speech shows the presence of nasal formants and anti-formants in the spectrum which affects the harmonic-intensity. The nasal formants increase whereas the anti-formants decrease the magnitude of harmonics around its location of addition. Hence, the spectrum of hypernasal and normal speech is different from each other. To capture the spectral difference, three features namely, normalized harmonic amplitude (NHA), harmonic amplitude ratio (HAR), and prominent harmonics frequency (PHF) are proposed in this work. NHA feature is the magnitude of harmonics after their normalization with respect to the maximum magnitude, HAR feature is the relative magnitude of harmonics with respect to their previous harmonics, and the PHF feature is the frequencies of prominent harmonics in the spectrum. The combination of three features gives an accuracy of 82.46%, 87.89%, 84.25% for /a/, /i/ and /u/ vowels respectively for the detection of hypernasality using support vector machine classifier.
Published: 2020

4. Enhancement of cleft palate speech using temporal and spectral processing

Author: Protima Nomo Sudro and S. R. Mahadeva Prasanna
Subjects: Linguistics and Language, Communication, Speech recognition, Linear prediction, Intelligibility (communication), Hypernasal speech, medicine.disease, Residual, Language and Linguistics, Computer Science Applications, Weighting, Modeling and Simulation, Frequency domain, Vowel, otorhinolaryngologic diseases, medicine, Computer Vision and Pattern Recognition, Software, Vocal tract, Mathematics
Abstract: The speech of the individuals with cleft palate (CP) is generally characterized by the presence of abnormal nasal resonances during the production of voiced sounds, primarily in vowels, and is called hypernasality. Hypernasality is present in more than 50% of the individuals with CP, and it often results in degraded speech, both in quality and intelligibility. The current work describes the signal processing based enhancement of CP speech, where specifically hypernasal speech modification is addressed. The hypernasal speech’s residual and vocal tract system characteristics are analyzed using an extended weighted linear prediction (XLP) method. The enhancement is performed for three different variants: XLP residual weighting in the time domain, Gaussian mixture model-based spectral conversion in the frequency domain, and combined modification of the XLP residual and vocal tract system characteristics. The modified hypernasal speech achieved by the proposed method is evaluated using different objective and subjective measures for the vowel /a/, /i/, and /u/. The evaluation results indicate that the combination of XLP residual and vocal tract system characteristics modification yields better results than XLP residual or vocal tract system characteristics modification alone.
Published: 2020

5. Exploration of excitation source information for shouted and normal speech classification

Author: Shikha Baghel, S. R. Mahadeva Prasanna, and Prithwijit Guha
Subjects: Noise, Speech production, Acoustics and Ultrasonics, Arts and Humanities (miscellaneous), Computer science, Speech recognition, Cepstrum, Discrete cosine transform, Feature (machine learning), Linear prediction, Mel-frequency cepstrum, Speech processing
Abstract: Discrimination between shouted and normal speech is an essential prerequisite for many speech processing applications. Existing works have established that excitation source information plays a significant role in shouted speech production. In speech processing literature, various features have been proposed to model different aspects of the excitation source. The principal contribution of this work is to explore three such features, Discrete Cosine Transform of Integrated Linear Prediction Residual (DCT-ILPR), Mel-Power Difference of Spectrum in Sub-bands (MPDSS), and Residual Mel-Frequency Cepstral Coefficient (RMFCC), for shouted and normal speech classification. The DCT-ILPR feature represents the shape of the glottal cycle, MPDSS estimates the periodicity of the excitation source spectrum, and RMFCC characterizes smoothed spectral information of the excitation source. The authors have also contributed a dataset containing shouted and normal speech. This work is evaluated on three datasets and benchmarked against three baseline methods. Deep neural networks are used to study the classification performance of individual features and their combinations. The generalization performance of features (and combinations) is also investigated. Fusion of excitation source features with Mel-Frequency Cepstral Coefficients (MFCC) provides the best performance compared to other combinations. Noise analysis shows that adding excitation features with MFCC+ Δ Δ provides a more robust classification system.
Published: 2020

6. Vowel Onset Point Based Screening of Misarticulated Stops in Cleft Lip and Palate Speech

Author: S. R. Mahadeva Prasanna and Vikram C. Mathad
Subjects: Acoustics and Ultrasonics, Dental occlusion, Speech recognition, Speech processing, 030507 speech-language pathology & audiology, 03 medical and health sciences, Computational Mathematics, Feature (computer vision), Vowel, Computer Science (miscellaneous), Discrete cosine transform, Spectrogram, Mel-frequency cepstrum, Electrical and Electronic Engineering, 0305 other medical science, Articulation (phonetics), Mathematics
Abstract: The presence of velopharyngeal dysfunction, dental occlusion, and mislearned articulation in individuals with cleft lip and palate (CLP) results in the production of misarticulated stop consonants. The present work considers vowel onset points (VOPs) as the anchor points, around which the consonant-vowel (CV) transition regions are segmented to analyze the difference between normal and misarticulated stops. VOPs are located using an epoch-synchronously computed feature called maximum weighted inner product. Spectro-temporal dynamics of CV transitions anchored around VOP are analyzed using two-dimensional discrete cosine transform (2D-DCT) coefficients, where 2D-DCT coefficients are derived from single pole filter (SPF) based time-frequency representation. The SPF-based 2D-DCT coefficients are used to train a support vector machine for the classification of normal and misarticulated stops, where the class of misarticulated stops includes weak, nasalized, palatal, velar, pharyngeal, glottal, and devoicing errors produced by CLP speakers. The performance of the proposed VOP detection algorithm is evaluated on a database containing CV units of normal and misarticulated stops, and the results are compared with the state-of-the-art VOP detection methods. The classification results obtained for the proposed SPF-based 2D-DCT coefficients are compared with the short-time Fourier transform-based 2D-DCT coefficients and Mel-frequency cepstral coefficients. Further, the performance of the proposed system is compared with the hidden Markov model-based goodness of pronunciation approach.
Published: 2020

7. Spoken Language Diarization Using an Attention based Neural Network

Author: S. R. Mahadeva Prasanna, Jagabandhu Mishra, and Ayush Agarwal
Subjects: Speaker diarisation, Artificial neural network, Language change, Computer science, Speech recognition, Speech coding, Task analysis, Utterance, Term (time), Spoken language
Abstract: Spoken language diarization (SLD) is a task to perform automatic segmentation and labeling of the languages present in a given code-switched speech utterance. Inspiring from the way humans perform SLD (i.e capturing the language specific long term information), this work has proposed an acoustic-phonetic approach to perform SLD. This acoustic phonetic approach consists of an attention based neural network modelling to capture the language specific information and a Gaussian smoothing approach to locate the language change points. From the experimental study, it has been observed that the proposed approach performs better when dealing with code-switched segment containing monolingual segments of longer duration. However, the performance of the approach decreases with decrease in the monolingual segment duration. This issue poses a challenge in the further exploration of the proposed approach.
Published: 2021

8. Detection of Speech Overlapped with Low-Energy Music using Pyknograms

Author: Mrinmoy Bhattacharjee, Prithwijit Guha, and S. R. Mahadeva Prasanna
Subjects: Voice activity detection, Artificial neural network, Computer Science::Sound, Computer science, Speech recognition, Classifier (linguistics), Task analysis, Spectrogram, Convolutional neural network, Discrete Fourier transform, Convolution
Abstract: Detection of speech overlapped with music is a challenging task. This work deals with discriminating clean speech from speech overlapped with low-energy music. The overlapped signals are generated synthetically. An enhanced spectrogram representation called Pyknogram has been explored for the current task. Pyknograms have been previously used in overlapped speech detection. The classification is performed using a neural network that is designed with only convolutional layers. The performance of Pyknograms at various high SNR levels is compared with that of discrete fourier transform based spectrograms. The classification system is benchmarked on three publicly available datasets, viz., GTZAN, Scheirer-slaney and MUSAN. The Pyknogram representation with the fully convolutional classifier performs well, both individually and in combination with spectrograms.
Published: 2021

9. Analysis and modeling of dialect information in Ao, a low resource language

Author: Moakala Tzudir, S. R. Mahadeva Prasanna, and Priyankoo Sarmah
Subjects: Acoustics and Ultrasonics, Low resource, Speech recognition, Tone (linguistics), Tonal language, Normal Distribution, India, Acoustics, Mixture model, Identification (information), Arts and Humanities (miscellaneous), Speech Perception, Mel-frequency cepstrum, Mathematics, Language
Abstract: Ao is a Tibeto-Burman language spoken in Nagaland, India. It is a low resource, tonal language with three lexical tones, namely, high, mid, and low. However, tone assignment on lexical words may differ among the three dialects of Ao, namely, Chungli, Mongsen, and Changki. In this work, an acoustic study is conducted on the three tones in the three dialects of Ao. It was found that the acoustic characteristics of the tones in the Changki dialect are markedly different from that of the Chungli and the Mongsen dialects. Hence, in the latter part of the work, automatic dialect identification (DID) in the Ao dialects is attempted with Mel Frequency Cepstral Coefficients, Shifted Delta Cepstral coefficients, and F0 features using the Gaussian Mixture models. It is confirmed that in both text-dependent and text-independent DID, the F0 features improve the accuracy of classification.
Published: 2021

10. Sonority Measurement Using System, Source, and Suprasegmental Information

Author: Bidisha Sharma and S. R. Mahadeva Prasanna
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Acoustics and Ultrasonics, Computer science, Speech recognition, Linear prediction, 02 engineering and technology, Signal, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Feature (machine learning), Electrical and Electronic Engineering, Group delay and phase delay, Sonorant, 020206 networking & telecommunications, Computational Mathematics, Formant, Sonority hierarchy, Mel-frequency cepstrum, 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Sonorant sounds are characterized by regions with prominent formant structure, high energy and high degree of periodicity. In this work, the vocal-tract system, excitation source and suprasegmental features derived from the speech signal are analyzed to measure the sonority information present in each of them. Vocal-tract system information is extracted from the Hilbert envelope of numerator of group delay function. It is derived from zero time windowed speech signal that provides better resolution of the formants. A five-dimensional feature set is computed from the estimated formants to measure the prominence of the spectral peaks. A feature representing strength of excitation is derived from the Hilbert envelope of linear prediction residual, which represents the source information. Correlation of speech over ten consecutive pitch periods is used as the suprasegmental feature representing periodicity information. The combination of evidences from the three different aspects of speech provides better discrimination among different sonorant classes, compared to the baseline MFCC features. The usefulness of the proposed sonority feature is demonstrated in the tasks of phoneme recognition and sonorant classification.
Published: 2021

11. Detection of Nasalized Voiced Stops in Cleft Palate Speech Using Epoch-Synchronous Features

Author: C M Vikram, Nagaraj Adiga, and S. R. Mahadeva Prasanna
Subjects: Consonant, Acoustics and Ultrasonics, Speech recognition, Linear prediction, Filter (signal processing), Nasalization, 030507 speech-language pathology & audiology, 03 medical and health sciences, Computational Mathematics, Computer Science (miscellaneous), Segmentation, Mel-frequency cepstrum, Electrical and Electronic Engineering, Syllable, 0305 other medical science, Hidden Markov model, Mathematics
Abstract: The presence of velopharyngeal dysfunction in individuals with cleft palate CP nasalizes the voiced stops. Due to this, voiced stops /b/, /d/, /g/ tend to be perceive like nasal consonants /m/, /n/, /ng/. In this work, a novel algorithm is proposed for the detection of nasalized voiced stops in CP speech using epoch-synchronous features. Speech regions corresponding to consonant and consonant-vowel transitions are segmented using the knowledge of glottal activity, syllable nucleus, low-frequency spectral dominance, and vowel onset point. The segmented regions are epoch-synchronously processed to analyze the spectral, spectro-temporal, excitation source, and periodicity characteristics of normal and nasalized voiced stops. Spectral and spectro temporal features are computed using single pole filter based time-frequency representation. The amplitude of Hilbert envelope of linear prediction residual, measured around the epoch is used to analyze the effect of nasalization on excitation source. Comparison of speech frames of successive inter-epoch intervals is carried out to analyze the periodicity characteristics. The proposed features are used to develop a support vector machine classifier for the classification of normal and nasalized voiced stops. Segmentation accuracy for the proposed knowledge based method is found to be better than the hidden Markov model based force-alignment approach. The detection rate of nasalized voiced stops is found to be high for the proposed epoch synchronous features than the conventional Mel-frequency cepstral coefficients.
Published: 2019

12. Aspiration in fricative and nasal consonants: Properties and detection

Author: Saswati Rabha, S. R. Mahadeva Prasanna, and Priyankoo Sarmah
Subjects: Acoustics and Ultrasonics, Arts and Humanities (miscellaneous), Aspirated consonant, Computer science, Speech recognition, Speech sounds, otorhinolaryngologic diseases, Intelligibility (communication), Vocal tract
Abstract: Unlike aspiration in stops, occurrence of aspiration in non-stop consonants is quite rare. Most of the languages that have aspirated non-stop consonants are low-resource languages. Hence, data driven, quantitative, and statistical analysis of their aspiration phenomena is fairly limited. Rabha and Angami are considered in this study, as previous studies have confirmed the existence of aspiration contrast in fricatives and nasals. This study reports the acoustic characteristics of aspiration in stops, fricatives, and nasals. Among them, distinguishing the aspirated fricatives and aspirated nasals from their unaspirated counterparts is a challenging task. A set of acoustic features is proposed to automatically detect the presence of aspiration in fricatives and nasals. Acoustic features, such as vocal tract constriction (VTC), normalized autocorrelation peak strength (NAPS), strength of excitation (SoE), and variance of successive epoch intervals (VSEI) are used to detect aspiration in fricatives and nasals. These features are extracted from zero-frequency filtered signal of the speech sounds, as it preserves the aspiration information. Results show that VTC, NAPS, and SoE can detect aspiration in nasals, whereas SoE and VSEI can detect aspiration in fricatives. The proposed method improves the performance of an automatic phoneme recognizer by reducing the confusion between aspirated and unaspirated counterparts.
Published: 2019

13. Investigating Text-Independent Speaker Verification Systems Under Varied Data Conditions

Author: S. R. Mahadeva Prasanna and Rohan Kumar Das
Subjects: 0209 industrial biotechnology, Computer science, Applied Mathematics, Speech recognition, Word error rate, 02 engineering and technology, Field (computer science), 020901 industrial engineering & automation, Signal Processing, Cepstrum, Feature (machine learning), Kernel Fisher discriminant analysis, Vocal tract, Communication channel, Test data
Abstract: This work makes an investigation into speaker verification (SV) from the view of practical systems. Limited data SV is preferred in order to have user comfort and effective decision delivery for regular usage. However, reduction in speech data affects the SV performance that becomes a concern for field deployment. In this work, varied data conditions for SV are explored, and sufficient train with limited test data is presented as a preferable anatomy for practical systems. Different explorations are made from the perspective of improving performance in varied data conditions. These explorations include vocal tract constriction feature to include speaker-specific acoustic–phonetic information, different attributes of voice source features that carry alternative/complementary information from that carried by conventional mel-frequency cepstral coefficient features. Further, kernel discriminant analysis is performed at the back end of i-vector-based speaker modeling for channel/session compensation that is found to work well for varied data conditions. Finally, a framework is proposed in combination with the stated explorations to have a better speaker characterization, which is more effective in case of sufficient train and limited test data scenario. The proposed framework achieves significant improvement in performance [equal error rate (EER): 11.20%, detection cost function (DCF): 0.1990], compared to the baseline (EER: 22.31%, DCF: 0.4128) for sufficient train with 2-s test segment case, showing scope toward application-oriented systems.
Published: 2019

14. Enhancing the Intelligibility of Cleft Lip and Palate Speech using Cycle-consistent Adversarial Networks

Author: Rohan Kumar Das, Rohit Sinha, Protima Nomo Sudro, and S. R. Mahadeva Prasanna
Subjects: Medical treatment, Adversarial network, Computer science, business.industry, Speech recognition, 020206 networking & telecommunications, Usability, 02 engineering and technology, Limiting, Intelligibility (communication), Speech therapy, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Adversarial system, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, otorhinolaryngologic diseases, FOS: Electrical engineering, electronic engineering, information engineering, 0305 other medical science, business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Cleft lip and palate (CLP) refer to a congenital craniofacial condition that causes various speech-related disorders. As a result of structural and functional deformities, the affected subjects' speech intelligibility is significantly degraded, limiting the accessibility and usability of speech-controlled devices. Towards addressing this problem, it is desirable to improve the CLP speech intelligibility. Moreover, it would be useful during speech therapy. In this study, the cycle-consistent adversarial network (CycleGAN) method is exploited for improving CLP speech intelligibility. The model is trained on native Kannada-speaking childrens' speech data. The effectiveness of the proposed approach is also measured using automatic speech recognition performance. Further, subjective evaluation is performed, and those results also confirm the intelligibility improvement in the enhanced speech over the original., Comment: 8 pages, 4 figures, IEEE spoken language and technology workshop
Published: 2021
Full Text: View/download PDF

15. VOP Detection in Variable Speech Rate Condition

Author: S. R. Mahadeva Prasanna, Jagabandhu Mishra, and Ayush Agarwal
Subjects: Variable (computer science), Computer science, Speech recognition, Speech rate
Published: 2020

16. Overlapped/Non-Overlapped Speech Transition Point Detection Using Bag-of-Audio-Words

Author: S. R. Mahadeva Prasanna, Prithwijit Guhal, and Shikha Baghel
Subjects: Audio signal, Computer science, Speech recognition, 020206 networking & telecommunications, Linear prediction, 02 engineering and technology, 01 natural sciences, Signal, Speaker diarisation, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Mel-frequency cepstrum, 010301 acoustics, Vocal tract, Energy (signal processing)
Abstract: Overlapped speech refers to an audio signal which contains speech of two or more speakers speaking simultaneously. Overlapped speech is one of the main sources of error for speaker diarization systems. This work presents an initial study to identify the transition points of overlapped to non-overlapped speech and vice-versa. Characteristics of overlapped and non-overlapped speech are examined in terms of the vocal tract system, excitation source, and modulation spectrum. The Hilbert envelope (HE) of Linear Prediction (LP) residual signal represents the excitation source characteristics of speech signal. The Sum of Ten Largest Peaks (STLP) of the spectrum and Mel-Frequency Cepstral Coefficients (MFCCs) represent the vocal tract shape information. The modulation spectrum energy (ModSE) captures the information of slowly varying temporal envelope of speech. A Bag-of-Audio-Words (BoAW) based approach is used to detect the transition points. News debates are one of the main sources of naturally occurred overlapped speech. Therefore, the present work is evaluated on Indian news debate scenario. A high Identification Rate (IR) and low Spurious Rate (SR) is observed when all the features are used simultaneously as a 16d feature(13-MFCCs, HE of LP residual, STLP and ModSE) for the detection task.
Published: 2020

17. Classification of Speech vs. Speech with Background Music

Author: Prithwijit Guha, Mrinmoy Bhattacharjee, and S. R. Mahadeva Prasanna
Subjects: Scheme (programming language), Computer science, Speech recognition, 020207 software engineering, 02 engineering and technology, A-weighting, Class (biology), Histogram, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Spectrogram, 020201 artificial intelligence & image processing, computer, computer.programming_language
Abstract: Applications that perform enhancement of speech containing background music require a critical preprocessing step that can efficiently detect such segments. This work proposes such a preprocessing method to detect speech with background music that is mixed at different SNR levels. A bag-of-words approach is proposed in this work. Representative dictionaries from speech and music data are first learned. The signals are processed as spectrograms of 1s intervals. Rows of these spectrograms are used to learn separate speech and music dictionaries. This work proposes a weighting scheme to reduce confusion by suppressing codewords of one class that have similarities to the other class. The proposed feature is a weighted histogram of 1s audio intervals obtained from the learned dictionaries. The classification is performed using a deep neural network classifier. The proposed approach is validated against a baseline and benchmarked over two publicly available datasets. The proposed feature shows promising results, both individually and in combination with the baseline.
Published: 2020

18. Analysis of Excitation Source Characteristics for Shouted and Normal Speech Classification

Author: Prithwijit Guha, Shikha Baghel, and S. R. Mahadeva Prasanna
Subjects: Computer science, Flatness (systems theory), Speech recognition, Linear prediction, Residual, Speech processing, 01 natural sciences, Pitch period, Phase duration, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0103 physical sciences, Normal speech, 0305 other medical science, 010301 acoustics, Excitation
Abstract: The present work is aimed at analysing the excitation source characteristics of normal and shouted speech. In this context, we analyze the Differenced Electroglottogram (DEGG) signal corresponding to different vowels. This work proposes two novel excitation source features that are estimated from DEGG signal. These features are (a) Open Phase Triangle Area (OPTA) and (b) Flatness of Glottal Cycle (FoGC). OPTA captures the effect of open phase duration and slope of DEGG signal. FoGC measures the change in source characteristics due to strength of excitation (SoE) and pitch period. A practical issue in using the proposed features is the unavailability of DEGG signal in most speech processing applications. To overcome this problem, the integrated linear prediction residual (ILPR) signal estimated from speech is considered as an approximation of DEGG. We show that the proposed features can be computed from ILPR signal in the absence of DEGG. It is observed that the proposed features (estimated from either DEGG or ILPR) are successful in discriminating shouted from normal speech.
Published: 2020

19. Intelligibility assessment of cleft lip and palate speech using Gaussian posteriograms based on joint spectro-temporal features

Author: Samarendra Dandapat, Sishir Kalita, and S. R. Mahadeva Prasanna
Subjects: Dynamic time warping, Acoustics and Ultrasonics, Computer science, media_common.quotation_subject, Gaussian, Speech recognition, Intelligibility (communication), Speech processing, 01 natural sciences, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Arts and Humanities (miscellaneous), Perception, 0103 physical sciences, symbols, Mel-frequency cepstrum, 0305 other medical science, 010301 acoustics, media_common
Abstract: Intelligibility is considered as one of the primary measures for speech rehabilitation of individuals with a cleft lip and palate (CLP). Currently, speech processing and machine-learning-based objective methods are gaining more research interest as a way to quantify speech intelligibility. In this work, joint spectro-temporal features computed from a time-frequency representation of speech are explored to derive speech representations based on Gaussian posteriograms. A comparative framework using dynamic time warping (DTW) is used to quantify the intelligibility of child CLP speech. The DTW distance is used to score sentence-level intelligibility and tested for correlation with perceptual intelligibility ratings obtained from expert speech-language pathologists. A baseline DTW system using the conventional Mel-frequency cepstral coefficients (MFCCs) is also developed to compare the performance of the proposed system. Spearman's rank correlation coefficient between the objective intelligibility scores and the perceptual intelligibility rating is studied. A Williams significance test is conducted to assess the statistical significance of the correlation difference between the methods. The results show that the system based on joint spectro-temporal features significantly outperforms the MFCC-based system.
Published: 2018

20. Exploring Text-Constraint Models and Source Information for Long-Enrollment with Short-Test Speaker Verification

Author: S. R. Mahadeva Prasanna, Rohan Kumar Das, and Sarfaraz Jelil
Subjects: 0209 industrial biotechnology, Computer science, Applied Mathematics, Speech recognition, Perspective (graphical), Linear prediction, 02 engineering and technology, Residual, Field (computer science), Constraint (information theory), 020901 industrial engineering & automation, Signal Processing, Discrete cosine transform, Mel-frequency cepstrum, Baseline (configuration management)
Abstract: This work focuses on long-enrollment with short-test speaker verification (SV) from the perspective of application-oriented systems. The importance of phonetic match between train and test models is explored in terms of having a text-constraint model-based framework on Part IV of RedDots database. This database has a text-dependent and a text-prompted-based enrollment conditions for speaker modeling. Two different text-constraint setups are formalized for evaluating the effect of text match on train and test sessions. Further, the excitation source features mel power difference of spectrum in subbands, residual mel frequency cepstral coefficient and discrete cosine transform of integrated linear prediction residual are investigated to determine their significance for text-constraint-based framework. Although the source features individually perform poorer compared to the conventional mel frequency cepstral coefficient (MFCC) features, their significance is reflected in fusion due to the complementary nature of information carried by them. Additionally, the source features become imperative for text-constraint-based models for long-enrollment with short-test SV in fusion to MFCC features and achieves commendable improvement from baseline framework of text-prompted-based enrollment condition. This thus minimizes the performance difference between text-dependent and text-prompted-based enrollment condition showing importance of text-constraint models and source information in long-enrollment with short-test-based framework favorable from the perspective of field deployable systems.
Published: 2018

21. Speech Enhancement Using Source Information for Phoneme Recognition of Speech with Background Music

Author: Abhishek Dey, Banriskhem K. Khonglah, and S. R. Mahadeva Prasanna
Subjects: 0209 industrial biotechnology, Computer science, Applied Mathematics, Speech recognition, Gaussian, Word error rate, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), 02 engineering and technology, Filter (signal processing), Markov model, Standard deviation, Speech enhancement, symbols.namesake, ComputingMethodologies_PATTERNRECOGNITION, 020901 industrial engineering & automation, Computer Science::Sound, Phone, Signal Processing, symbols, Subspace topology
Abstract: This work explores the significance of source information for speech enhancement resulting in better phoneme recognition of speech with background music segments. Standard procedure for speech enhancement in noisy conditions involves sequential processing in terms of the temporal, spectral and perceptual methods. This work follows the same sequential processing but with the additional modification of studying the effect of source, particularly in the temporal and perceptual-based enhancement techniques for enhancing speech with background music segments. The source information is studied in terms of the epoch locations and epoch strength, obtained after passing the sum of the mean and standard deviation of the component envelopes computed across frequencies obtained using the single frequency filter (SFF), through a zero frequency filter (ZFF). This method of obtaining epoch locations and epoch strength will be termed as SFF-ZFF in this work. The enhanced segments are passed through a phoneme recognizer built using Gaussian mixture model-hidden Markov model (GMM-HMM), subspace Gaussian mixture model-hidden Markov model (SGMM-HMM) and deep neural network-hidden Markov model (DNN-HMM) system, where the models are trained on clean speech. The enhanced audio files show a better phone error rate than the degraded audio files, which means that performing enhancement in terms of the source information is significant for the speech with background music regions.
Published: 2018

22. Acoustic analysis of misarticulated trills in cleft lip and palate children

Author: S. R. Mahadeva Prasanna, Sishir Kalita, C M Vikram, and Sashank Kumar Macha
Subjects: Male, Dynamic time warping, Sound Spectrography, Time Factors, Acoustics and Ultrasonics, Voice Quality, Bioacoustics, Cleft Lip, Speech recognition, Speech Acoustics, Speech Production Measurement, Arts and Humanities (miscellaneous), Humans, Child, Mathematics, Age Factors, Signal Processing, Computer-Assisted, Acoustics, Fundamental frequency, Cleft Palate, Formant, Female, Mel-frequency cepstrum, Trill (music), Vocal tract
Abstract: In this paper, acoustic analysis of misarticulated trills in cleft lip and palate speakers is carried out using excitation source based features: strength of excitation and fundamental frequency, derived from zero-frequency filtered signal, and vocal tract system features: first formant frequency (F1) and trill frequency, derived from the linear prediction analysis and autocorrelation approach, respectively. These features are found to be statistically significant while discriminating normal from misarticulated trills. Using acoustic features, dynamic time warping based trill misarticulation detection system is demonstrated. The performance of the proposed system in terms of the F1-score is 73.44%, whereas that for conventional Mel-frequency cepstral coefficients is 66.11%.
Published: 2018

23. End Point Detection Using Speech-Specific Knowledge for Text-Dependent Speaker Verification

Author: S. R. Mahadeva Prasanna, Biswajit Dev Sarma, and Ramesh K. Bhukya
Subjects: Consonant, Speaker verification, Sonorant, Computer science, Applied Mathematics, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Obstruent, Speech segmentation, Background noise, 030507 speech-language pathology & audiology, 03 medical and health sciences, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science, Spurious relationship, Vocal tract
Abstract: This paper proposes a method using speech-specific knowledge to detect the begin and end points of speech under degraded condition. The method is based on vowel-like region (VLR) detection and uses both excitation source and vocal tract system information. Existing method for VLR detection uses excitation source information. Vocal tract system information from dominant resonant frequency is used to eliminate spurious VLRs in background noise. Foreground speech segmentation using excitation and vocal tract system information is carried out to remove spurious VLRs in the background speech region. Better localization of the end points is done using more detailed information about excitation source in terms of glottal activity to detect the sonorant consonants and missed VLRs. To include an unvoiced consonant, obstruent region detection is done at the beginning of the first VLR and at the end of last VLR. Detected begin and end points are evaluated by comparing with manually marked end points as well as by conducting the text-dependent speaker verification experiments. The proposed method performs better than some of the existing techniques.
Published: 2018

24. Significance of sonority information for voiced/unvoiced decision in speech synthesis

Author: Bidisha Sharma and S. R. Mahadeva Prasanna
Subjects: Linguistics and Language, Sonorant, Computer science, Communication, Speech recognition, 020206 networking & telecommunications, Speech synthesis, 02 engineering and technology, Fundamental frequency, Obstruent, computer.software_genre, Language and Linguistics, Computer Science Applications, 030507 speech-language pathology & audiology, 03 medical and health sciences, Variation (linguistics), Quality (physics), Modeling and Simulation, Sonority hierarchy, 0202 electrical engineering, electronic engineering, information engineering, Voice, Computer Vision and Pattern Recognition, 0305 other medical science, computer, Software
Abstract: The quality of synthesized speech obtained from statistical parametric speech synthesis (SPSS) significantly relies on excitation source generation. Voiced/unvoiced decision is an essential component for generation of excitation source. It is obtained from fundamental frequency and other excitation source evidence in the existing literature. The discontinuity at the point of contact in the vocal-folds excites energy into the vocal-tract resulting voicing effect in the produced speech signal. The perceptual reflection of voicing over the sound produced is correlated with the sonority information which is related to less vocal-tract constriction and significant glottal vibration. Therefore, the possible variation in voicing with the change in supraglottal pressure due to vocal-tract constriction, rate of closing of vocal folds and regularity in structure of the signal are intact in the sonority associated with a sound unit. Voicing and degree of opening of vocal-tract are the two most effective correlates of sonority, that potentially contribute to the sonority hierarchy for sonorants and obstruents uniformly. Therefore, the voicing effect can be captured by the sonority measurement derived from system, source and suprasegmental information in the speech signal. In this work, a novel voiced/unvoiced decision method using sonority information is proposed and integrated in the SPSS framework for generation of excitation source. It leads to better voicing decision compared to the existing methods resulting in synthesized speech of improved quality, which is assured from objective and subjective analysis.
Published: 2018

25. Improved voicing decision using glottal activity features for statistical parametric speech synthesis

Author: S. R. Mahadeva Prasanna, Banriskhem K. Khonglah, and Nagaraj Adiga
Subjects: Computer science, Speech recognition, Speech synthesis, 02 engineering and technology, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, Deep belief network, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Electrical and Electronic Engineering, Hidden Markov model, Parametric statistics, Artificial neural network, business.industry, Applied Mathematics, 020206 networking & telecommunications, Pattern recognition, Support vector machine, Computational Theory and Mathematics, Signal Processing, Voice, Computer Vision and Pattern Recognition, Artificial intelligence, Statistics, Probability and Uncertainty, 0305 other medical science, business, computer
Abstract: A method to improve voicing decision using glottal activity features proposed for statistical parametric speech synthesis. In existing methods, voicing decision relies mostly on fundamental frequency F 0 , which may result in errors when the prediction is inaccurate. Even though F 0 is a glottal activity feature, other features that characterize this activity may help in improving the voicing decision. The glottal activity features used in this work are the strength of excitation (SoE), normalized autocorrelation peak strength (NAPS), and higher-order statistics (HOS). These features obtained from approximated source signals like zero-frequency filtered signal and integrated linear prediction residual. To improve voicing decision and to avoid heuristic threshold for classification, glottal activity features are trained using different statistical learning methods such as the k-nearest neighbor, support vector machine (SVM), and deep belief network. The voicing decision works best with SVM classifier, and its effectiveness is tested using the statistical parametric speech synthesis. The glottal activity features SoE, NAPS, and HOS modeled along with F 0 and Mel-cepstral coefficients in Hidden Markov model and deep neural network to get the voicing decision. The objective and subjective evaluations demonstrate that the proposed method improves the naturalness of synthetic speech.
Published: 2017

26. Multi-style speaker recognition database in practical conditions

Author: Rohan Kumar Das, S. R. Mahadeva Prasanna, and Sarfaraz Jelil
Subjects: Linguistics and Language, Speaker verification, Computer science, Speech recognition, 02 engineering and technology, computer.software_genre, Language and Linguistics, Style (sociolinguistics), 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Session (computer science), Telephone network, Database, business.industry, Process (computing), 020206 networking & telecommunications, Speaker recognition, Human-Computer Interaction, Speaker diarisation, Computer Vision and Pattern Recognition, Artificial intelligence, 0305 other medical science, business, computer, Software, Utterance, Natural language processing
Abstract: This work describes the process of collection and organization of a multi-style database for speaker recognition. The multi-style database organization is based on three different categories of speaker recognition: voice-password, text-dependent and text-independent framework. Three Indian institutes collaborated for the collection of the database at respective sites. The database is collected over an online telephone network that is deployed for speech based student attendance system. This enables the collection of data for a longer period from different speakers having session variabilities, which is useful for speaker verification (SV) studies in practical scenario. The database contains data of 923 speakers for the three different modes of SV and hence termed as multi-style speaker recognition database. This database is useful for session variability, multi-style speaker recognition and short utterance based SV studies. Initial results are reported over the database for the three different modes of SV. A copy of the database can be obtained by contacting the authors.
Published: 2017

27. Exploring kernel discriminant analysis for speaker verification with limited test data

Author: Rohan Kumar Das, Akhil Babu Manam, and S. R. Mahadeva Prasanna
Subjects: Normalization (statistics), business.industry, Computer science, Speech recognition, Pattern recognition, 02 engineering and technology, Covariance, Speaker recognition, Linear discriminant analysis, 01 natural sciences, Artificial Intelligence, 0103 physical sciences, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, NIST, 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Artificial intelligence, Kernel Fisher discriminant analysis, business, 010301 acoustics, Software, Utterance, Test data
Abstract: Speaker verification (SV) with limited test data condition is desirable for practical application oriented systems. The i-vector based speaker modeling has shown its significance for SV tasks, but its performance degrades as the utterance becomes shorter. The i-vectors apart from being compact and dominant speaker representations, bear channel and session information, which has to be compensated for robust speaker modeling. The conventional techniques for channel/session compensation include linear discriminant analysis (LDA) followed by within class covariance normalization (WCCN) and Gaussian probabilistic linear discriminant analysis (GPLDA) that eliminate the channel/session variation across the i-vectors by assuming these are linearly separable. In this work, a novel method for channel/session compensation is proposed using kernel discriminant analysis (KDA) that projects the i-vectors into a higher dimensional space and performs discriminant analysis to remove the unwanted information for speaker modeling. The SV studies are performed on standard NIST speaker recognition evaluation (SRE) 2003 and 2008 databases that convey the significance of the proposed compensation over the conventional methods, which is greater on using short test utterances. The achieved improvements are hypothesized due to the non-linearities of channel/session information in the i-vector domain.
Published: 2017

28. Speaker Verification from Short Utterance Perspective: A Review

Author: S. R. Mahadeva Prasanna and Rohan Kumar Das
Subjects: Speaker verification, Computer science, business.industry, Speech recognition, Text independent, Perspective (graphical), 020206 networking & telecommunications, 02 engineering and technology, computer.software_genre, Field (computer science), 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Electrical and Electronic Engineering, 0305 other medical science, business, computer, Natural language processing, Utterance
Abstract: Speaker recognition has emerged as an important field over the past several decades and has evolved over time. This work is attempted to investigate speaker verification (SV), particularly ...
Published: 2017

29. Consonant-vowel unit recognition using dominant aperiodic and transition region detection

Author: S. R. Mahadeva Prasanna, Priyankoo Sarmah, and Biswajit Dev Sarma
Subjects: Consonant, 0209 industrial biotechnology, Linguistics and Language, Communication, Transition (fiction), Speech recognition, 02 engineering and technology, Signal, Language and Linguistics, Computer Science Applications, 030507 speech-language pathology & audiology, 03 medical and health sciences, 020901 industrial engineering & automation, Aperiodic graph, Duration (music), Modeling and Simulation, Computer Vision and Pattern Recognition, 0305 other medical science, Unit (ring theory), Software, Vocal tract, Mathematics, Group delay and phase delay
Abstract: This work reports a method of Consonant-Vowel (CV) unit recognition by detecting the Dominant Aperiodic component Regions (DARs) and by predicting the Duration of Transition Regions (DTRs) in speech. DAR detection is performed using complementary information from source and vocal tract. While source information is extracted using sub-fundamental frequency filtering of speech, vocal tract information is extracted using a) Dominant Resonant Frequency (DRF) and b) High to Low Frequency component Ratio (HLFR), computed from Hilbert envelope of Numerator Group Delay (HNGD) spectrum of zero-time windowed signal. The DTR is predicted by using vocal tract constriction information. Subsequently, detected DARs and predicted DTRs are compared with manually marked regions and finally used for CV unit recognition of Indian languages. Conventionally, CV unit recognition is performed by anchoring the Vowel Onset Point (VOP) and assuming fixed durations for transition and consonant regions on either side of the VOP. However, in speech, the duration of transition and consonantal regions vary depending on the type of consonants and vowels. In the proposed method, the use of dynamic values for consonant duration and transition regions have resulted in better consonant recognition improving CV unit recognition.
Published: 2017

30. Processing degraded speech for text dependent speaker verification

Author: Ramesh K. Bhukya, S. R. Mahadeva Prasanna, and Banriskhem K. Khonglah
Subjects: Linguistics and Language, Dynamic time warping, Speaker verification, Voice activity detection, business.industry, Computer science, Speech recognition, Pattern recognition, Spectral processing, 01 natural sciences, Language and Linguistics, Task (project management), Human-Computer Interaction, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0103 physical sciences, Computer Vision and Pattern Recognition, Noise (video), Artificial intelligence, 0305 other medical science, business, 010301 acoustics, Software, Degradation (telecommunications)
Abstract: This work explores the use of speech enhancement for enhancing degraded speech which may be useful for text dependent speaker verification system. The degradation may be due to noise or background speech. The text dependent speaker verification is based on the dynamic time warping (DTW) method. Hence there is a necessity of the end point detection. The end point detection can be performed easily if the speech is clean. However the presence of degradation tends to give errors in the estimation of the end points and this error propagates into the overall accuracy of the speaker verification system. Temporal and spectral enhancement is performed on the degraded speech so that ideally the nature of the enhanced speech will be similar to the clean speech. Results show that the temporal and spectral processing methods do contribute to the task by eliminating the degradation and improved accuracy is obtained for the text dependent speaker verification system using DTW.
Published: 2017

31. Acoustic–Phonetic Analysis for Speech Recognition: A Review

Author: S. R. Mahadeva Prasanna and Biswajit Dev Sarma
Subjects: Descriptive knowledge, Computer science, business.industry, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Electrical and Electronic Engineering, 0305 other medical science, business, computer, Natural language processing
Abstract: This paper reviews the literature related to the acoustic–phonetic analysis of speech and the speech recognition approaches that use these types of knowledge. At first, acoustic–phonetic cues that ...
Published: 2017

32. Enhancement of Spectral Tilt in Synthesized Speech

Author: S. R. Mahadeva Prasanna and Bidisha Sharma
Subjects: Computer science, Applied Mathematics, Speech recognition, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), 020206 networking & telecommunications, Speech synthesis, 02 engineering and technology, Intelligibility (communication), computer.software_genre, Linear predictive coding, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Computer Science::Sound, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, 0305 other medical science, Hidden Markov model, computer, Natural language, Parametric statistics
Abstract: The research in statistical parametric speech synthesis is towards improving naturalness and intelligibility. In this work, the deviation in spectral tilt of the natural and synthesized speech is analyzed and observed a large gap between the two. Furthermore, the same is analyzed for different classes of sounds, namely low-vowels, mid-vowels, high-vowels, semi-vowels, nasals, and found to be varying with category of sound units. Based on variation, a novel method for spectral tilt enhancement is proposed, where the amount of enhancement introduced is different for different classes of sound units. The proposed method yields improvement in terms of intelligibility, naturalness, and speaker similarity of the synthesized speech.
Published: 2017

33. Epoch Extraction From Telephone Quality Speech Using Single Pole Filter

Author: S. R. Mahadeva Prasanna and C M Vikram
Subjects: 0209 industrial biotechnology, Voice activity detection, Acoustics and Ultrasonics, Computer science, Speech recognition, Speech coding, Linear prediction, 02 engineering and technology, Filter (signal processing), Filter bank, Speech processing, 030507 speech-language pathology & audiology, 03 medical and health sciences, Computational Mathematics, 020901 industrial engineering & automation, Computer Science (miscellaneous), Spectrogram, Electrical and Electronic Engineering, 0305 other medical science, Vocal tract
Abstract: Epoch extraction from speech involves the suppression of vocal tract resonances, either by linear prediction based inverse filtering or filtering at very low frequency. Degradations due to channel effect and significant attenuation of low frequency components ( $ 300 Hz) create challenges for the epoch extraction from telephone quality speech. An epoch extraction method is proposed that considers the vertical striations present in the time-frequency representation of voiced speech as the representative candidates for the epochs. Time-frequency representation with better localized vertical striations is estimated using single pole filter based filter bank. The time marginal of time-frequency representation is computed to locate the epochs. The proposed algorithm is evaluated on the database of five speakers, which provide simultaneous speech and electroglottographic recordings. Telephone quality speech is simulated using G.191 software tools. The identification rate of the state-of-the-art methods degrades substantially for the telephone quality speech whereas that of the proposed method remains the same, comparable to that of clean speech.
Published: 2017

34. Feature optimisation for stress recognition in speech

Author: H. Leonardo Rufiner, Leandro Daniel Vignolo, Samarendra Dandapat, S. R. Mahadeva Prasanna, and Diego H. Milone
Subjects: Computer science, Emotion classification, Speech recognition, SPEECH PROCESSING, Population, 02 engineering and technology, Machine learning, computer.software_genre, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), education, education.field_of_study, business.industry, Speech technology, 020206 networking & telecommunications, Speech processing, Filter bank, Ciencias de la Computación, EVOLUTIONARY ALGORITHMS, Filter (video), Ciencias de la Computación e Información, Signal Processing, 020201 artificial intelligence & image processing, STRESSED SPEECH, Computer Vision and Pattern Recognition, Mel-frequency cepstrum, Artificial intelligence, business, computer, EMOTIONAL SPEECH, CIENCIAS NATURALES Y EXACTAS, Software
Abstract: Mel-frequency cepstral coefficients introduced biologically-inspired features into speech technology, becoming the most commonly used representation for speech, speaker and emotion recognition, and even for applications in music. While this representation is quite popular, it is ambitious to assume that it would provide the best results for every application, as it is not designed for each specific objective.This work proposes a methodology to learn a speech representation from data by optimising a filter bank, in order to improve results in the classification of stressed speech. Since population-based metaheuristics have proved successful in related applications, an evolutionary algorithm is designed to search for a filter bank that maximises the classification accuracy. For the codification, spline functions are used to shape the filter banks, which allows reducing the number of parameters to optimise. The filter banks obtained with the proposed methodology improve the results in stressed and emotional speech classification. Fil: Vignolo, Leandro Daniel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentina Fil: Prasanna, S.R. Mahadeva. Indian Institute of Technology Guwahati; India Fil: Dandapat, Samarendra. Indian Institute of Technology Guwahati; India Fil: Rufiner, Hugo Leonardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentina Fil: Milone, Diego Humberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentina
Published: 2016

35. Modification of Devoicing Error in Cleft Lip and Palate Speech

Author: Protima Nomo Sudro and S. R. Mahadeva Prasanna
Subjects: Computer science, Speech recognition
Published: 2019

36. Modelling Glottal Flow Derivative Signal for Detection of Replay Speech Samples

Author: S. R. Mahadeva Prasanna, Jagabandhu Mishra, and Debadatta Pati
Subjects: Signal processing, Feature (computer vision), Computer science, Distortion, Speech recognition, Word error rate, Context (language use), Mel-frequency cepstrum, Residual, Signal
Abstract: It is a widely known fact that automatic speaker verification systems are quite vulnerable to replay speech. The present work deals with detecting replay speech by using the information available in glottal flow derivative (GFD) signal. In signal processing terms, the speech signal can be represented as the response of a vocal-tract system with excited by a excitation source in the form of glottal flow. The effect of record and replay devices distorted the spectral characteristics of the naturally uttered speech sample, resulting distortion in corresponding GFD signals. In this work the GFD signals are parameterized by using standard mel filters and Gaussian mixtures models are made for detection. Although various methods are available, by correlation analysis it is observed that in the context of the present work the dynamic programming phase slope algorithm (DYPSA) method is relatively more effective in estimating the GFD signals. The experimental studies are made on ASVSpoof2017 database. The proposed glottal flow derivative mel frequency cepstral coefficients (GFDMFCC) feature provides 20.53% equal error rate (EER). This performance is comparatively poor than by speech and residual based features. It is mainly due to the absence of fine structure information in estimated GFD signal. However, in fusion with speech signal based constant-Q cepstral coefficients (CQCC) features, the GFDMFCC feature provides an improvement of 10.30% with reference to conventional residual feature. This shows the usefulness of modelling GFD signals for detection of replay signals.
Published: 2019

37. Robust Recognition of Tone Specified Mizo Digits Using CNN-LSTM and Nonlinear Spectral Resolution

Author: S. R. Nirmala, S. R. Mahadeva Prasanna, Vignesh Kothapalli, Priyankoo Sarmah, Biswajit Dev Sarma, Parismita Gogoi, Wendy Lalhminghlui, Abhishek Dey, and Rohit Sinha
Subjects: Computer science, Speech recognition, Word error rate, 020206 networking & telecommunications, 02 engineering and technology, Markov model, 030507 speech-language pathology & audiology, 03 medical and health sciences, Nonlinear system, Noise, Tone (musical instrument), 0202 electrical engineering, electronic engineering, information engineering, Spectrogram, Spectral resolution, 0305 other medical science, Test data
Abstract: In this work, we attempt Mizo digit recognition under degraded conditions, using spectrograms as visual inputs to a Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) network. As a tone language, each digit in Mizo is associated with a specific sequence of tones. To emphasize the tonal information, low frequency resolution is increased by applying a nonlinear spectral resolution method. The use of nonlinear spectral resolution improves the recognition rate of the system, as evident from the word error rate decrease of about 4% when the training data contains speech data with similar noise profiles as in the testing data. When the training data is clean, improvement in recognition rate is about 2%, using the nonlinear spectral resolution method. The proposed method, compared with the Deep Neural Network-Hidden Markov Model (DNN-HMM) based baseline system, gives an improvement of around 40% and 15% for 0 dB and 5 dB SNRs, respectively when noise profiles of speech sounds included in training and testing conditions are similar.
Published: 2018

38. Modification of misarticulated fricative /s/ in cleft lip and palate speech

Author: Protima Nomo Sudro and S. R. Mahadeva Prasanna
Subjects: music.instrument, Speech recognition, 0206 medical engineering, Biomedical Engineering, Health Informatics, Glottal stop, 02 engineering and technology, White noise, Intelligibility (communication), 020601 biomedical engineering, 03 medical and health sciences, 0302 clinical medicine, Signal Processing, Normal children, music, 030217 neurology & neurosurgery, Energy (signal processing), Mathematics, Nasal air emission
Abstract: Cleft lip and palate is a craniofacial condition that affects the intelligibility of speech. The current work describes a modification of three types of misarticulated fricative /s/ in cleft lip and palate speech, which include palatalized /s/, phoneme-specific nasal air emission distorted /s/, and glottal stop substituted /s/. By using the knowledge of the glottal activity, frication, and silence, an approach is proposed for misarticulated fricative detection and categorization of the type of error. Based on the category of error, an appropriate modification technique is applied. The deviations of palatalized /s/ and phoneme-specific nasal air emission distorted /s/ are corrected by modifying the spectral energy. The high-frequency energy levels are emphasized to improve the perception of fricative /s/ using spectral tilt modification. Glottal stop substitution is modified by inserting artificially synthesized /s/, where the fricative /s/ signal is synthesized using white noise signal and linear prediction filter obtained from normal children fricative /s/. Further, the modified speech samples are subjected to objective and subjective evaluation. The evaluation scores obtained from objective and subjective assessment indicate speech intelligibility improvement of the modified signals compared to the misarticulated fricative /s/.
Published: 2021

39. A better decomposition of speech obtained using modified Empirical Mode Decomposition

Author: S. R. Mahadeva Prasanna and Rajib Sharma
Subjects: 0209 industrial biotechnology, Speech recognition, Linear prediction, 02 engineering and technology, Hilbert–Huang transform, 020901 industrial engineering & automation, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Time domain, Electrical and Electronic Engineering, Mathematics, business.industry, Applied Mathematics, 020206 networking & telecommunications, Pattern recognition, Filter (signal processing), Filter bank, Formant, Computational Theory and Mathematics, Signal Processing, Computer Vision and Pattern Recognition, Artificial intelligence, Statistics, Probability and Uncertainty, business, Spline interpolation, Interpolation
Abstract: The objective of this work is to obtain meaningful time domain components, or Intrinsic Mode Functions (IMFs), of the speech signal, using Empirical Mode Decomposition (EMD), with reduced mode mixing, and in a time-efficient manner. This work focuses on two aspects – firstly, extracting IMFs of the speech signal which can better reflect its higher frequency spectrum; and secondly, to get a better representation and distribution of the vocal tract resonances of the speech signal in its IMFs, compared to that obtained from standard EMD. To this effect, modifications are proposed to the EMD algorithm for processing speech signals, based on the critical nature of the interpolation points (IPs) used for cubic spline interpolation in EMD. The effect of using different sets of IPs, other than the extrema of the residue – as used in standard EMD – is analyzed. It is found that having more IPs is beneficial only upto a certain limit, after which the characteristic dyadic filterbank nature of EMD breaks down. For certain sets of IPs, these modified EMD processes perform better than EMD, giving better frequency separability between the IMFs, and an enhanced representation of the higher frequency content of the signal. A detailed study of the distribution of the formants, in the IMFs of the speech signal, is done using Linear Prediction (LP) analysis of the IMFs. It is found that the IMFs of the EMD variants have a far better distribution of the formants structure within them, with reduced overlapping amongst their filter spectrums, compared to that of standard EMD. Henceforth, when subjected to the task of formants estimation of voiced speech, using LP analysis, the IMFs of the modified EMD processes cumulatively exhibit a superior performance than that of standard EMD, or the speech signal itself, under both clean and noisy conditions.
Published: 2016

40. Polyglot Speech Synthesis: A Review

Author: Bidisha Sharma and S. R. Mahadeva Prasanna
Subjects: business.industry, Computer science, Process (engineering), Speech recognition, 020206 networking & telecommunications, Speech synthesis, Polyglot, 02 engineering and technology, computer.software_genre, Focus (linguistics), 030507 speech-language pathology & audiology, 03 medical and health sciences, Text to speech synthesis, ComputingMethodologies_PATTERNRECOGNITION, 0202 electrical engineering, electronic engineering, information engineering, Natural (music), Artificial intelligence, Electrical and Electronic Engineering, 0305 other medical science, business, computer, Natural language processing
Abstract: The term polyglot speech synthesis refers to the process of producing speech in multiple languages and single speaker's voice from a single text-to-speech synthesis (TTS) system. This report reviews existing efforts in the literature to develop a polyglot TTS. Different methods described in this review mainly focus on developing a natural, intelligible, and cost-effective TTS system for multilingual text input. Since multilingual text is becoming very common in all applications of TTS, recent focus is made on developing a cost-effective polyglot TTS system, instead of conventional monolingual TTS. This review also discusses the pros and cons of different methods and mentions possible directions to overcome the limitations.
Published: 2016

41. Children's Speech Recognition Under Mismatched Condition: A Review

Author: S. R. Mahadeva Prasanna, Rohit Sinha, and Y. Sunil
Subjects: Motor theory of speech perception, Engineering, business.industry, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, General Medicine, Task (project management), 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science, business, Vocal tract
Abstract: Automatic speech recognition (ASR) is a task of converting speech to text. In this article, the ASR system trained using the speech of adult speakers and tested using the speech of child speakers is termed as children's speech recognition under mismatched condition. The mismatch refers to differences in the training and testing conditions due to the differences in the characteristics of speech signals belonging to adults and children. This mismatch results in significant degradation of the recognition performance. Therefore, there are several approaches in the literature to reduce the mismatch and hence improve the performance. The approaches may be broadly viewed of those trying to reduce the mismatch due to vocal tract length and excitation source characteristics. This article will review the works that have been done in this direction. A discussion on the literature critically comments on the existing works and suggests possible directions for future research.
Published: 2016

42. Foreground Speech Segmentation and Enhancement Using Glottal Closure Instants and Mel Cepstral Coefficients

Author: S. R. Mahadeva Prasanna and K. T. Deepak
Subjects: 0209 industrial biotechnology, Speech production, Acoustics and Ultrasonics, business.industry, Computer science, Speech recognition, 02 engineering and technology, Filter (signal processing), Speech segmentation, Speech enhancement, Background noise, 030507 speech-language pathology & audiology, 03 medical and health sciences, Computational Mathematics, 020901 industrial engineering & automation, Signal-to-noise ratio, Computer Science (miscellaneous), Computer vision, Artificial intelligence, Mel-frequency cepstrum, Electrical and Electronic Engineering, 0305 other medical science, business, Vocal tract
Abstract: In this paper, the speech signal recorded from the desired speaker close to microphone in natural environment is regarded as foreground speech and rest of the interfering sources as background noise . The proposed paper exploits speech production features like glottal closure instants in time domain and vocal tract information in spectral domain to segment the desired speaker's speech and to further enhance it. The foreground speech is perceptually enhanced using the auditory perception feature in mel-frequency domain using mel-cepstral coefficients and its inversion using mel log spectrum approximation filter. The focus is on enhancing the production and perceptual features of foreground speech rather than relying on modeling the interfering sources. The speech data are collected in different natural environments from different speakers in order to evaluate the proposed method. The enhanced speech signals derived at three different stages of the proposed method are evaluated with state-of-the-art methods in terms of subjective and objective measures. The proposed method provides improved performance compared to the considered state-of-the-art methods. In terms of the proposed objective measure foreground to background Ratio , the enhancement approach presented in this paper gives an average improvement of 12 dB as opposed to existing spectral subtraction-based method which provides 3 dB. Moreover, subjective evaluation using 24 different subjects corroborates the objective test results.
Published: 2016

43. Development of Multi-Level Speech based Person Authentication System

Author: Rohan Kumar Das, S. R. Mahadeva Prasanna, and Sarfaraz Jelil
Subjects: Authentication, Telephone network, Computer science, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Speaker recognition, Theoretical Computer Science, Speaker diarisation, Set (abstract data type), 030507 speech-language pathology & audiology, 03 medical and health sciences, Hardware and Architecture, Control and Systems Engineering, Software deployment, Human–computer interaction, Modeling and Simulation, Interactive voice response, Signal Processing, Pattern recognition (psychology), 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science, Information Systems
Abstract: This work presents the development of a multi-level speech based person authentication system with attendance as an application. The multi-level system consists of three different modules of speaker verification, namely voice-password, text-dependent and text-independent speaker verification. The three speaker verification modules are combined in a sequential manner to develop a multi-level framework which is ported over a telephone network through interactive voice response (IVR) system for aiding remote authentication. The users call from a fixed set of mobile handsets to verify their claim against their respective models, which is then authenticated in a multi-level mode using the above stated three modules. An analysis over a period of two months is shown on the performance of the multi-level system in attendance marking. The multi-level framework having combination of the three modules helps in achieving better performance than that of the individual modules, which shows its potential for practical deployment.
Published: 2016

44. Speech / music classification using speech-specific features

Author: Banriskhem K. Khonglah and S. R. Mahadeva Prasanna
Subjects: Computer science, Speech recognition, Linear prediction, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Electrical and Electronic Engineering, Applied Mathematics, Autocorrelation, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), 020206 networking & telecommunications, Mixture model, Speech processing, Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, Computational Theory and Mathematics, Computer Science::Sound, Signal Processing, Computer Vision and Pattern Recognition, Statistics, Probability and Uncertainty, 0305 other medical science, Energy (signal processing), Vocal tract
Abstract: This paper proposes the use of speech-specific features for speech / music classification. Features representing the excitation source, vocal tract system and syllabic rate of speech are explored. The normalized autocorrelation peak strength of zero frequency filtered signal, and peak-to-sidelobe ratio of the Hilbert envelope of linear prediction residual are the two source features. The log mel energy feature represents the vocal tract information. The modulation spectrum represents the slowly-varying temporal envelope corresponding to the speech syllabic rate. The novelty of the present work is in analyzing the behavior of these features for the discrimination of speech and music regions. These features are non-linearly mapped and combined to perform the classification task using a threshold based approach. Further, the performance of speech-specific features is evaluated using classifiers such as Gaussian mixture models, and support vector machines. It is observed that the performance of the speech-specific features is better compared to existing features. Additional improvement for speech / music classification is achieved when speech-specific features are combined with the existing ones, indicating different aspects of information exploited by the former.
Published: 2016

45. Epoch Extraction from Pathological Children Speech Using Single Pole Filtering Approach

Author: C M Vikram and S. R. Mahadeva Prasanna
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, Computer science, Epoch (reference date), Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, 0305 other medical science
Published: 2018

46. Self-similarity Matrix Based Intelligibility Assessment of Cleft Lip and Palate Speech

Author: Sishir Kalita, Samarendra Dandapat, and S. R. Mahadeva Prasanna
Subjects: Computer science, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, Self-similarity matrix, Intelligibility (communication)
Published: 2018

47. Robust Mizo Continuous Speech Recognition

Author: Rohit Sinha, Wendy Lalhminghlui, Biswajit Dev Sarma, S. R. Nirmala, S. R. Mahadeva Prasanna, Lalnunsiami Ngente, Parismita Gogoi, Priyankoo Sarmah, and Abhishek Dey
Subjects: Computer science, Speech recognition
Published: 2018

48. Analysis of Breathiness in Contextual Vowel of Voiceless Nasals in Mizo

Author: Pamir Gogoi, S. R. Mahadeva Prasanna, Ratree Wayland, Priyankoo Sarmah, Parismita Gogoi, and Sishir Kalita
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, Computer science, Speech recognition, Vowel, 0103 physical sciences, 0305 other medical science, 010301 acoustics, 01 natural sciences, Breathy voice
Published: 2018

49. Spoken Keyword Detection Using Joint DTW-CNN

Author: Ravi Shankar, S. R. Mahadeva Prasanna, and C M Vikram
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, Computer science, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, 0305 other medical science, Joint (audio engineering)
Published: 2018

50. Processing Transition Regions of Glottal Stop Substituted /S/ for Intelligibility Enhancement of Cleft Palate Speech

Author: S. R. Mahadeva Prasanna, Sishir Kalita, and Protima Nomo Sudro
Subjects: 03 medical and health sciences, 0302 clinical medicine, music.instrument, Computer science, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, Glottal stop, 02 engineering and technology, Intelligibility (communication), music, 030217 neurology & neurosurgery
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

139 results on '"S. R. Mahadeva Prasanna"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources