205 results on '"Kodukula, Sri Rama Murty"'
Search Results
2. Unsupervised Speech Signal-to-Symbol Transformation for Language Identification
- Author
-
Bhati, Saurabhchand, Nayak, Shekhar, and Kodukula, Sri Rama Murty
- Published
- 2020
- Full Text
- View/download PDF
3. Multi-Feature Integration for Speaker Embedding Extraction
- Author
-
Sankala, Sreekanth, Rafi B, Shaik Mohammad, Kodukula, Sri Rama Murty, Sankala, Sreekanth, Rafi B, Shaik Mohammad, and Kodukula, Sri Rama Murty
- Abstract
The performance of the automatic speaker recognition system is becoming more and more accurate, with the advancement in deep learning methods. However, current speaker recognition system performances are subjective to the training conditions, thereby decreasing the performance drastically even on slightly varied test data. A lot of methods such as using various data augmentation structures, various loss functions, and integrating multiple features systems have been proposed and shown a performance improvement. This work focuses on integrating multiple features to improve speaker verification performance. Speaker information is commonly represented in the different kinds of features, where the redundant and irrelevant information such as noise and channel information will affect the dimensions of different features in a different manner. In this work, we intend to maximize the speaker information by reconstructing the extracted speaker information in one feature from the other features while at the same time minimizing the irrelevant information. The experiments with the multi-feature integration model demonstrate improved performance than the stand-alone models by significant margins. Also, the extracted speaker embeddings are found to be noise-robust. © 2022 IEEE
- Published
- 2022
4. Subtitle Synthesis using Inter and Intra utterance Prosodic Alignment for Automatic Dubbing
- Author
-
Pamisetty, Giridhar, Kodukula, Sri Rama Murty, Pamisetty, Giridhar, and Kodukula, Sri Rama Murty
- Abstract
Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process. © 2022 IEEE.
- Published
- 2022
5. Tabla Gharānā Recognition from Tabla Solo Recordings
- Author
-
Gowriprasad, R, Venkatesh, V, Kodukula, Sri Rama Murty, Gowriprasad, R, Venkatesh, V, and Kodukula, Sri Rama Murty
- Abstract
Tabla is a percussion instrument in North Indian music tradition. Teaching practices and performances of tabla are based on stylistic schools called gharana-s. Gharana-s are characterized by their unique playing technique, finger posture, improvisations, and compositional patterns (signature patterns). Recognizing the gharana information from a tabla performance is hence helpful to characterize the performance. In this paper, we explore an approach for gharana recognition from solo tabla recordings by searching for the characteristic tabla phrases in these recordings. The tabla phrases are modeled as sequences of strokes, and characteristic phrases from the gharana compositions are chosen as query patterns. The recording is automatically transcribed into a syllable sequence using Hidden Markov Models (HMM). The Rough Longest Common Subsequence (RLCS) approach is used to search for the query pattern instances. A decision rule is proposed to recognize the gharana from the patterns. © 2022 IEEE.
- Published
- 2022
6. Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement
- Author
-
Parvathala, Venkatesh, Andhavarapu, Sivaganesh, Pamisetty, Giridhar, Kodukula, Sri Rama Murty, Parvathala, Venkatesh, Andhavarapu, Sivaganesh, Pamisetty, Giridhar, and Kodukula, Sri Rama Murty
- Abstract
In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in the deep neural network (DNN)-based speech enhancement. The parameters of the DNN can be estimated by minimizing the mask loss, but it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. Restoring the cepstral pitch peak, in turn, helps in restoring the pitch harmonics in the enhanced spectrum. The proposed cepstral pitch-peak loss acts as an adaptive comb filter on voiced segments and emphasizes the pitch harmonics in the speech spectrum. The network parameters are estimated using a combination of mask loss and cepstral pitch-peak loss. We show that this combination offers the complementary advantages of enhancing both the voiced and unvoiced regions. The DNN-based methods primarily rely on the network architecture, and hence, the prediction accuracy improves with the increasing complexity of the architecture. The lower complex models are essential for real-time processing systems. In this work, we propose a compact model using a sliding-window attention network (SWAN). The SWAN is trained to regress the spectral magnitude mask (SMM) from the noisy speech signal. Our experimental results demonstrate that the proposed approach achieves comparable performance with the state-of-the-art noncausal and causal speech enhancement methods with much lesser computational complexity. Our three-layered noncausal SWAN achieves 2.99 PESQ on the Valentini database with only 10 9 floating-point operations (FLOPs). © 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
- Published
- 2022
7. Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
- Author
-
Pamisetty, Giridhar, Kodukula, Sri Rama Murty, Pamisetty, Giridhar, and Kodukula, Sri Rama Murty
- Abstract
End-to-end text-to-speech synthesis systems achieved immense success in recent times, with improved naturalness and intelligibility. However, the end-to-end models, which primarily depend on the attention-based alignment, do not offer an explicit provision to modify/incorporate the desired prosody while synthesizing the speech. Moreover, the state-of-the-art end-to-end systems use autoregressive models for synthesis, making the prediction sequential. Hence, the inference time and the computational complexity are quite high. This paper proposes Prosody-TTS, a data-efficient end-to-end speech synthesis model that combines the advantages of statistical parametric models and end-to-end neural network models. It also has a provision to modify or incorporate the desired prosody at the finer level by controlling the fundamental frequency (f) and the phone duration. Generating speech utterances with appropriate prosody and rhythm helps in improving the naturalness of the synthesized speech. We explicitly model the duration of the phoneme and the f to have a finer level control over them during the synthesis. The model is trained in an end-to-end fashion to directly generate the speech waveform from the input text, which in turn depends on the auxiliary subtasks of predicting the phoneme duration, f, and Mel spectrogram. Experiments on the Telugu language data of the IndicTTS database show that the proposed Prosody-TTS model achieves state-of-the-art performance with a mean opinion score of 4.08, with a very low inference time using just 4 hours of training data. © 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
- Published
- 2022
8. Subtitle Synthesis using Inter and Intra utterance Prosodic Alignment for Automatic Dubbing
- Author
-
Pamisetty, Giridhar, primary and Kodukula, Sri Rama Murty, additional
- Published
- 2022
- Full Text
- View/download PDF
9. Neural Comb Filtering using Sliding Window Attention Network for Speech Enhancement
- Author
-
Parvathala, Venkatesh, primary, Kodukula, Sri Rama Murty, primary, and Andhavarapu, Siva Ganesh, primary
- Published
- 2021
- Full Text
- View/download PDF
10. Self-Supervised Phonotactic Representations for Language Identification
- Author
-
Ramesh, G., Kumar, C. Shiva, Kodukula, Sri Rama Murty, Ramesh, G., Kumar, C. Shiva, and Kodukula, Sri Rama Murty
- Abstract
Phonotactic constraints characterize the sequence of permissible phoneme structures in a language and hence form an important cue for language identification (LID) task. As phonotactic constraints span across multiple phonemes, the short-term spectral analysis (20-30 ms) alone is not sufficient to capture them. The speech signal has to be analyzed over longer contexts (100s of milliseconds) in order to extract features representing the phonotactic constraints. The supervised senone classifiers, aimed at modeling triphone context, have been used for extracting language-specific features for the LID task. However, it is difficult to get large amounts of manually labeled data to train the supervised models. In this work, we explore a selfsupervised approach to extract long-term contextual features for the LID task. We have used wav2vec architecture to extract contextualized representations from multiple frames of the speech signal. The contextualized representations extracted from the pre-trained wav2vec model are used for the LID task. The performance of the proposed features is evaluated on a dataset containing 7 Indian languages. The proposed self-supervised embeddings achieved 23% absolute improvement over the acoustic features and 3% absolute improvement over their supervised counterparts. Copyright © 2021 ISCA.
- Published
- 2021
11. Virtual phone discovery for speech synthesis without text
- Author
-
Nayak, Shekhar, Kumar, C Shiva, Kodukula, Sri Rama Murty, et al, ., Nayak, Shekhar, Kumar, C Shiva, Kodukula, Sri Rama Murty, and et al, .
- Abstract
The objective of this work is to re-synthesize speech directly from the speech signals without using any text in a different speaker's voice. The speech signals are transformed into a sequence of acoustic subword units or virtual phones which are discovered automatically from the given speech signals in an unsupervised manner. The speech signal is initially segmented into acoustically homogeneous segments through kernel-Gram segmentation using MFCC and autoencoder bottleneck features. These segments are then clustered using different clustering techniques. The cluster labels thus obtained are considered as virtual phone units which are used to transcribe the speech signals. The virtual phones for the utterances to be resynthesized are encoded as one-hot vector sequences. Deep neural network based duration model and acoustic model are trained for synthesis using these sequences. A vocoder is used to synthesize speech in target speaker's voice from the features estimated by the acoustic model. The performance evaluation is done on ZeroSpeech 2019 challenge on English and Indonesian language. The bitrate and speaker similarity were found to be better than the challenge baseline with slightly lower intelligibility due to the compact encoding.
- Published
- 2020
12. Onset detection of Tabla Strokes using LP Analysis
- Author
-
Gowriprasad., R, Kodukula, Sri Rama Murty, Gowriprasad., R, and Kodukula, Sri Rama Murty
- Abstract
Onset detection is an important first step in music analysis. We propose a pre-processing scheme for improved onset detection of complex strokes of Indian percussion instruments with resonance characteristics. In this work, we explore the onset detection of Tabla (Indian percussion instrument) strokes. The resonance characteristics of tabla strokes poses challenges to onset detection. In such cases, the energy-based and spectral flux-based onset detectors are often inaccurate on the raw signal. We propose an onset detection algorithm addressing these challenges using Linear Prediction (LP) analysis and Hilbert envelope (HE) in tandem. Tabla signal is modeled using LP, and its residual highlights the onset time instances very well. Unipolar nature of HE on top of LP residual further enhances the onset instances. Onset detection is performed using energy based, spectral flux based and the state of the art Machine Learning based onset detectors on the Hilbert envelope of LP residual (HELP). Experiments were performed on tabla solo played at various tempi and the results show that the HELP based approach provides 12% relative improvement in F-measures compared to the performance on raw tabla signal. © 2020 IEEE.
- Published
- 2020
13. Significance of Phase in DNN based speech enhancement algorithms
- Author
-
Rani, P. Swetha, Andhavarapu, Sivaganesh, Kodukula, Sri Rama Murty, Rani, P. Swetha, Andhavarapu, Sivaganesh, and Kodukula, Sri Rama Murty
- Abstract
Most of the speech enhancement algorithms rely on estimating the magnitude spectrum of the clean speech signal from that of the noisy speech signal using either spectral regression or spectral masking. Because of difficulty in processing the phase of the short time Fourier transform (STFT), noisy phase is reused while synthesizing the waveform from the enhanced magnitude spectrum. In order to demonstrate the significance of phase in speech enhancement, we compare the phase obtained from different reconstruction methods, like Griffin and Lim, minimum phase, with that of the gold phase (clean phase). In this work, spectral magnitude mask (SMM) is estimated using deep neural networks to enhance the magnitude spectrum of the speech signal. The experimental results showed that gold phase outperforms the phase reconstruction methods in all the objective measures, illustrating the significance of enhancing the noisy phase in speech enhancement.
- Published
- 2020
14. Self Attentive Context dependent Speaker Embedding for Speaker Verification
- Author
-
Sankala, Sreekanth, Mohammad Rafi, B. Shaik, Kodukula, Sri Rama Murty, Sankala, Sreekanth, Mohammad Rafi, B. Shaik, and Kodukula, Sri Rama Murty
- Abstract
In the recent past, Deep neural networks became the most successful approach to extract the speaker embeddings. Among the existing methods, the x-vector system, that extracts a fixed dimensional representation from varying length speech signal, became the most successful approach. Later the performance of the x-vector system improved by explicitly modeling the phonological variations in it i.e, c-vector. Although the c-vector framework utilizes the phonological variations in the speaker embedding extraction process, it is giving equal attention to all the frames using the stats pooling layer. Motivated by the subjective analysis of the importance of nasals, vowels, and semivowels for speaker recognition, we extend the work of the c-vector system by including a multi-head self-attention mechanism. In comparison with the earlier subjective analysis on the importance of different phonetic units for speaker recognition, we also analyzed the attentions learnt by the network using TIMIT data. To examine the effectiveness of the proposed approach, we evaluate the performance of the proposed system on the NIST SRE10 database and get a relative improvement of 18.19 % with respect to the c-vector system on the short-duration case.
- Published
- 2020
15. End to End ASR Free Keyword Spotting with Transfer Learning from Speech Synthesis
- Author
-
Bramhendra, Koilakuntla, Kodukula, Sri Rama Murty, Bramhendra, Koilakuntla, and Kodukula, Sri Rama Murty
- Abstract
Keyword Spotting is an important application in speech. But it requires as much as data of an Automatic Speech Recognition(ASR).But the problem is much specific compare to that of an ASR. Here the work made an effort to reduce the transcribed data dependency while building the ASR. Traditional keyword spotting(KWS) architectures built on top of ASR. Such as Lattice indexing and Keyword filler models are very popular in this approach. Though they give very good accuracy the former suffers being a offline system, and the latter suffer from less accuracy Here we proposed an improvement to an approach called End-to-End ASR free Keyword Spotting. This system has been inspired from traditional keyword spotting architectures consist of three modules namely acoustic encoder and phonetic encoder and keyword neural network. The acoustic encoder process the speech features and gets a fixed length representation, same as phonetic encoder gets fixed length representation both concatenated to form input for keyword neural network. The keyword network predicts whether the keyword exist or not. Here we proposed to retain all the hidden representation to have temporal resolution to identify the location of the query. And also we propose to pretrain the phonetic encoder to make aware of acoustic projection. By doing these changes the performance is improved by 7.1% absolutely. And in addition to that system being end to end gives an advantage of making it easily deploy able.
- Published
- 2020
16. Speaker embedding extraction with virtual phonetic information
- Author
-
Sreekanth, S, Rafi, B Shaik Mohammad, Kodukula, Sri Rama Murty, et al, ., Sreekanth, S, Rafi, B Shaik Mohammad, Kodukula, Sri Rama Murty, and et al, .
- Abstract
In the recent past, deep neural networks have been successfully employed to extract fixed-dimensional speaker embeddings from the speech signal. The commonly used x-vectors are extracted by projecting the magnitude spectral features extracted from the speech signal onto a speaker-discriminative space. As the x-vectors do not explicitly capture the speaker-specific phonological pronunciation variability, phonetic vectors extracted from an automatic speech recognition (ASR) engine were supplied as auxiliary information to improve the performance of the x-vector system. However, the development of ASR engine requires a huge amount of manually transcribed speech data. In this paper, we propose to transcribe the speech signal in an unsupervised manner with the cluster labels obtained from a mixture of autoencoders (MoA) trained on a large amount of speech data. The unsupervised labels, referred to as virtual phonetic transcriptions, are used to extract the phonetic vectors. The virtual phonetic vectors extracted using MoA are supplied as auxiliary information to the x-vector system. The performance of the proposed system is compared with the state-of-the-art x-vector system on NIST SRE-2010 data. The proposed unsupervised auxiliary information provides a relative improvement of 12.08%, 3.61% and 16.66% over the x-vector system on core-core, core-10sec and 10sec-10sec conditions, respectively.
- Published
- 2019
17. Instantaneous Frequency Features for Noise Robust Speech Recognition
- Author
-
Nayak, Shekhar, Shashank, Dhar B., Kodukula, Sri Rama Murty, et al, ., Nayak, Shekhar, Shashank, Dhar B., Kodukula, Sri Rama Murty, and et al, .
- Abstract
Analytic phase of the speech signal plays an important role in human speech perception, specially in the presence of noise. Generally, phase information is ignored in most of the recent speech recognition systems. In this paper, we illustrate the importance of analytic phase of the speech signal for noise robust automatic speech recognition. To avoid phase wrapping problem involved in the computation of analytic phase, features are extracted from instantaneous frequency (IF) which is time derivative of analytic phase. Deep neural network (DNN) based acoustic models are trained on clean speech using features extracted from the IF of speech signals. Robustness of IF features in combination with mel-frequency cepstral coefficients (MFCCs) was evaluated in varied noisy conditions. System combination using minimum Bayes risk decoding of IF features with MFCCs delivered absolute improvements of upto 13% over MFCC features alone for DNN based systems under noisy conditions. The impact of the system combination of magnitude and phase based features on different phonetic classes was studied under noisy conditions and was found to model both voiced and unvoiced phonetic classes efficiently.
- Published
- 2019
18. Importance of Analytic Phase of the Speech Signal for Detecting Replay Attacks in Automatic Speaker Verification Systems
- Author
-
Rafi B, Shaik Mohammad, Kodukula, Sri Rama Murty, Rafi B, Shaik Mohammad, and Kodukula, Sri Rama Murty
- Abstract
In this paper, the importance of analytic phase of the speech signal in automatic speaker verification systems is demonstrated in the context of replay spoof attacks. In order to accurately detect the replay spoof attacks, effective feature representations of speech signals are required to capture the distortion introduced due to the intermediate playback/recording devices, which is convolutive in nature. Since the convolutional distortion in time-domain translates to additive distortion in the phase-domain, we propose to use IFCC features extracted from the analytic phase of the speech signal. The IFCC features contain information from both clean speech and distortion components. The clean speech component has to be subtracted in order to highlight the distortion component introduced by the playback/recording devices. In this work, a dictionary learned from the IFCCs extracted from clean speech data is used to remove the clean speech component. The residual distortion component is used as a feature to build binary classifier for replay spoof detection. The proposed phase-based features delivered a 9% absolute improvement over the baseline system built using magnitude-based CQCC features.
- Published
- 2019
19. Zero Resource Speaking Rate Estimation from Change Point Detection of Syllable-like Units
- Author
-
Nayak, Shekhar, Bhati, Saurabhchand, Kodukula, Sri Rama Murty, Nayak, Shekhar, Bhati, Saurabhchand, and Kodukula, Sri Rama Murty
- Abstract
Speaking rate is an important attribute of the speech signal which plays a crucial role in the performance of automatic speech processing systems. In this paper, we propose to estimate the speaking rate by segmenting the speech into syllable-like units using end point detection algorithms which do not require any training and fine-tuning. Also, there are no predefined constraints on the expected number of syllabic segments. The acoustic subword units are obtained only from speech signal to estimate the speaking rate without any requirement of transcriptions or phonetic knowledge of the speech data. A recent theta-rate oscillator based syllabification algorithm is also employed for speaking rate estimation. The performance is evaluated on TIMIT corpus and spontaneous speech from Switchboard corpus. The correlation results are comparable to recent algorithms which are trained with specific training set and/or make use of the available transcriptions.
- Published
- 2019
20. Unsupervised Acoustic Segmentation and Clustering Using Siamese Network Embeddings
- Author
-
Bhati, Saurabhchand, Nayak, Shekhar, Kodukula, Sri Rama Murty, Dehak, Najim, Bhati, Saurabhchand, Nayak, Shekhar, Kodukula, Sri Rama Murty, and Dehak, Najim
- Abstract
Unsupervised discovery of acoustic units from the raw speech signal forms the core objective of zero-resource speech processing. It involves identifying the acoustic segment boundaries and consistently assigning unique labels to acoustically similar segments. In this work, the possible candidates for segment boundaries are identified in an unsupervised manner from the kernel Gram matrix computed from the Mel-frequency cepstral coefficients (MFCC). These segment boundary candidates are used to train a siamese network, that is intended to learn embeddings that minimize intrasegment distances and maximize the intersegment distances. The siamese embeddings capture phonetic information from longer contexts of the speech signal and enhance the intersegment discriminability. These properties make the siamese embeddings better suited for acoustic segmentation and clustering than the raw MFCC features. The Gram matrix computed from the siamese embeddings provides unambiguous evidence for boundary locations. The initial candidate boundaries are refined using this evidence, and siamese embeddings are extracted for the new acoustic segments. A graph growing approach is used to cluster the siamese embeddings, and a unique label is assigned to acoustically similar segments. The performance of the proposed method for acoustic segmentation and clustering is evaluated on Zero Resource 2017 database.
- Published
- 2019
21. Allpass Modeling of Phase Spectrum of Speech Signals for Formant Tracking
- Author
-
Vijayan, Karthika, Kodukula, Sri Rama Murty, Li, Haizhou, Vijayan, Karthika, Kodukula, Sri Rama Murty, and Li, Haizhou
- Abstract
Formant tracking is a very important task in speech applications. Most of the current formant tracking methods bank on peak picking from linear prediction (LP) spectrum of speech, which suffers from merged/spurious peaks in LP spectra, resulting in unreliable formant candidates. In this paper, we present the significance of phase spectrum of speech in refining the formant candidates from LP analysis. The short-time phase spectrum of speech is modeled as phase response of an allpass (AP) system, where the coefficients of AP system are initialized with LP coefficients and estimated with an iterative procedure. This technique refines the initial formants from LP analysis using phase spectrum of speech through an AP analysis, thereby accomplishing fusion of information from magnitude and phase spectra. The group delay of the resultant AP system exhibits unambiguous peaks at formants and, delivers reliable formant candidates. The formant trajectories obtained by selection of formants from these candidates are reported to be more accurate than those obtained from LP analysis. The fused information from magnitude and phase spectra has rendered relative improvements of 25%, 15% and 18% in tracking accuracy of first, second and third formants, respectively, over those from magnitude spectrum alone.
- Published
- 2019
22. Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Processing
- Author
-
Nayak, Shekhar, Kodukula, Sri Rama Murty, Nayak, Shekhar, and Kodukula, Sri Rama Murty
- Abstract
Zero resource speech processing refers to techniques which do not require manually transcribed speech data. The inspiration for zero resource is drawn from language acquisition in infants which is completely self-driven. Infants learn di_erent abstraction levels i.e. phones, words and some syntactic aspects of the language they are exposed to, without any supervision or feedback. This motivated the research in speech community towards the development of completely unsupervised speech algorithms which can discover subword/word units from speech signal alone. The applications include spoken term discovery, language identi_cation, keyword spotting etc. Zero resource techniques can be e_ective in solving problems associated with the development of speech systems for low resource languages. Low resource languages have low amount of transcribed data and/or low number of native speakers. Several languages of the world have become endangered languages with almost negligible resources. The lack of transcribed data for low resource languages has inspired many directions to address this problem such as data augmentation, cross-lingual and multilingual techniques with limited success. In this thesis, we explore better feature representations for low resource speech recognition and later build unsupervised algorithms for zero resource speech processing which could lead to directions to e_ective solutions to the low resource problem. Traditional speech recognition systems employed magnitude based features for building acoustic models. Phase of the speech signals is generally ignored as human ear was considered traditionally to be indi_erent to phase. Recent perceptual studies have shown the importance of phase in human speech recognition. Motivated by this fact and in order to leverage the maximum information from limited transcribed data available in low resource settings, we propose to extract features from the analytic phase of speech signals for speech recognition. In order t
- Published
- 2019
23. Automatic Gharana Recognition from Audio music recordings of Tabla Solo performances
- Author
-
Gowriprasad, R, Kodukula, Sri Rama Murty, Gowriprasad, R, and Kodukula, Sri Rama Murty
- Abstract
Tabla is a percussion instrument in Hindustani music (North Indian music tradition). Tabla learning and performance in the Indian subcontinent is based on stylistic schools called the ghar¯an¯as. This thesis aims to explore the problem of Tabla ghar¯an¯a recognition from solo tabla recordings by searching characteristic tabla phrases in these recordings. The concept of rhythm in Indian music, the instrument tabla and its ghar¯an¯as are briefly explained. At first, our work explores the onset detection of Tabla stokes, then deals with automatic ghar¯an¯a recognition and the final part focuses on tabla critic. Relevant research related to musical onset detection, segmentation, transcription and string search methods are reviewed and commented. Onset detection is an important first step in music analysis. We analyze the resonance characteristics of the tabla strokes, motivating the challenge in designing an onset detection algorithm. We propose an onset detection algorithm addressing these challenges using Linear Prediction (LP) analysis and Hilbert envelope (HE) in tandem. Tabla signal is modeled using LP, and its residual highlights the onset time instances very well. Unipolar nature of HE on top of LP residual further enhances the onset instances. Onset detection is performed using energy based and spectral flux based detectors on the Hilbert envelope of LP residual (HELP). Experiments were performed on tabla solo played at various tempi, and the results show that the HELP based approach provides 12% relative improvement in F-measures compared to the performance on raw tabla signal. Tabla learning and performance in the Indian subcontinent is based on stylistic schools called the ghar¯an¯as. Each ghar¯an¯a is characterized by its unique style of playing technique, dynamics of tabla strokes, improvisations and compositional patterns (signature patterns). Ghar¯an¯a identification helps in characterizing tabla performances and provides valuable information for analysis
- Published
- 2019
24. Phoneme Based Embedded Segmental K-Means for Unsupervised Term Discover
- Author
-
Bhati, Saurabhch, Kamper, Herman, Kodukula, Sri Rama Murty, Bhati, Saurabhch, Kamper, Herman, and Kodukula, Sri Rama Murty
- Abstract
Identifying and grouping the frequently occurring word-like patterns from raw acoustic waveforms is an important task in the zero resource speech processing. Embedded segmental K-means (ES-KMeans) discovers both the word boundaries and the word types from raw data. Starting from an initial set of subword boundaries, the ES-Kmeans iteratively eliminates some of the boundaries to arrive at frequently occurring longer word patterns. Notice that the initial word boundaries will not be adjusted during the process. As a result, the performance of the ES-Kmeans critically depends on the initial subword boundaries. Originally, syllable boundaries were used to initialize ES-Kmeans. In this paper, we propose to use a phoneme segmentation method that produces boundaries closer to true boundaries for ES-KMeans initialization. The use of shorter units increases the number of initial boundaries which leads to a significant increment in the computational complexity. To reduce the computational cost, we extract compact lower dimensional embeddings from an auto-encoder. The proposed algorithm is benchmarked on Zero Resource 2017 challenge, which consists of 70 hours of unlabeled data across three languages, viz. English, French, and Mandarin. The proposed algorithm outperforms the baseline system without any language-specific parameter tuning.
- Published
- 2018
25. Action Recognition Based on Discriminative Embedding of Actions Using Siamese Networks
- Author
-
Roy, Debaditya, C, Krishna Mohan, Kodukula, Sri Rama Murty, Roy, Debaditya, C, Krishna Mohan, and Kodukula, Sri Rama Murty
- Abstract
Actions can be recognized effectively when the various atomic attributes forming the action are identified and combined in the form of a representation. In this paper, a low-dimensional representation is extracted from a pool of attributes learned in a universal Gaussian mixture model using factor analysis. However, such a representation cannot adequately discriminate between actions with similar attributes. Hence, we propose to classify such actions by leveraging the corresponding class labels. We train a Siamese deep neural network with a contrastive loss on the low-dimensional representation. We show that Siamese networks allow effective discrimination even between similar actions. The efficacy of the proposed approach is demonstrated on two benchmark action datasets, HMDB51 and MPII Cooking Activities. On both the datasets, the proposed method improves the state-of-the-art performance considerably.
- Published
- 2018
26. Speech Source Separation Using ICA in Constant Q Transform Domain
- Author
-
D.V.L.N, Dheeraj Sai, K.S, Kishor, Kodukula, Sri Rama Murty, D.V.L.N, Dheeraj Sai, K.S, Kishor, and Kodukula, Sri Rama Murty
- Abstract
In order to separate individual sources from convoluted speech mixtures, complex-domain independent component analysis (ICA) is employed on the individual frequency bins of time-frequency representations of the speech mixtures, obtained using short-time Fourier transform (STFT). The frequency components computed using STFT are separated by constant frequency difference with a constant frequency resolution. However, it is well known that the human auditory mechanism offers better resolution at lower frequencies. Hence, the perceptual quality of the extracted sources critically depends on the separation achieved in the lower frequency components. In this paper, we propose to perform source separation on the time-frequency representation computed though constant Q transform (CQT), which offers non uniform logarithmic binning in the frequency domain. Complex-domain ICA is performed on the individual bins of the CQT in order to get separated components in each frequency bin which are suitably scaled and permuted to obtain separated sources in the CQT domain. The estimated sources are obtained by applying inverse constant Q transform to the scaled and permuted sources. In comparison with the STFT based frequency domain ICA methods, there has been a consistent improvement of 3dB or more in the Signal to Interference Ratios of the extracted sources.
- Published
- 2018
27. Unsupervised Universal Attribute Modelling for Action Recognition
- Author
-
Roy, Debaditya, Kodukula, Sri Rama Murty, C, Krishna Mohan, Roy, Debaditya, Kodukula, Sri Rama Murty, and C, Krishna Mohan
- Abstract
A fixed dimensional representation for action clips of varying lengths has been proposed in the literature using aggregation models like bag-of-words and Fisher vector. These representations are high-dimensional and require classification techniques for action recognition. In this paper, we propose a framework for unsupervised extraction of a discriminative low-dimensional representation called action-vector. To start with, local spatio-temporal features are utilized to capture the action attributes implicitly in a large Gaussian mixture model called universal attribute model (UAM). To enhance the contribution of the significant attributes in each action clip, a maximum \textit{aposteriori} adaptation of the UAM means is performed for each clip. This results in a concatenated mean vector called super action vector (SAV) for each action clip. However, the SAV is still high-dimensional because of the presence of redundant attributes. Hence, we employ factor analysis to represent every SAV only in terms of the few important attributes contributing to the action clip. This leads to a low-dimensional representation called action-vector. This entire procedure requires no class labels and produces action-vectors that are distinct representations of each action irrespective of inter-actor variability encountered in unconstrained videos. An evaluation on trimmed action datasets UCF101 and HMDB51 demonstrates the efficacy of action-vectors for action classification over state-of-the-art techniques. Moreover, we also show that action-vectors can adequately represent untrimmed videos from the THUMOS14 dataset and produce classification results comparable to existing techniques.
- Published
- 2018
28. Unsupervised Segmentation of Speech Signals Using Kernel-Gram Matrices
- Author
-
Bhati, Saurabhchand, Nayak, Shekhar, Kodukula, Sri Rama Murty, Bhati, Saurabhchand, Nayak, Shekhar, and Kodukula, Sri Rama Murty
- Abstract
The objective of this paper is to develop an unsupervised method for segmentation of speech signals into phoneme-like units. The proposed algorithm is based on the observation that the feature vectors from the same segment exhibit higher degree of similarity than the feature vectors across the segments. The kernel-Gram matrix of an utterance is formed by computing the similarity between every pair of feature vectors in the Gaussian kernel space. The kernel-Gram matrix consists of square patches, along with the principle diagonal, corresponding to different phoneme-like segments in the speech signal. It detects the number of segments, as well as their boundaries automatically. The proposed approach does not assume any information about input utterances like exact distribution of segment length or correct number of segments in an utterance. The proposed method out-performs the state-of-the-art blind segmentation algorithms on Zero Resource 2015 databases and TIMIT database.
- Published
- 2018
29. An investigation into instantaneous frequency estimation methods for improved speech recognition features
- Author
-
Nayak, Shekhar, Bhati, Saurabhchand, Kodukula, Sri Rama Murty, Nayak, Shekhar, Bhati, Saurabhchand, and Kodukula, Sri Rama Murty
- Abstract
There have been several studies, in the recent past, pointing to the importance of analytic phase of the speech signal in human perception, especially in noisy conditions. However, phase information is still not used in state-of-the-art speech recognition systems. In this paper, we illustrate the importance of analytic phase of the speech signal for automatic speech recognition. As the computation of analytic phase suffers from inevitable phase wrapping problem, we extract features from its time derivative, referred to as instantaneous frequency (IF). In this work, we highlight the issues involved in IF extraction from speech-like signals, and propose suitable modifications for IF extraction from speech signals. We used the deep neural network (DNN) framework to build a speech recognition system using features extracted from the IF of speech signals. The speech recognition system based on IF features delivered a phoneme error rate of 21.8% on TIMIT database, while the baseline system based on mel-frequency cepstral coefficients (MFCCs) delivered a phoneme error rate of 18.4%. The combination of IF and MFCC features based systems, using minimum Bayes risk (MBR) decoding, provided a relative improvement of 8.7% over the baseline system, illustrating the significance of analytic phase for speech recognition.
- Published
- 2017
30. A new approach for robust replay spoof detection in ASV systems
- Author
-
Rafi, B Shaik Mohammad, Kodukula, Sri Rama Murty, Nayak, Shekhar, Rafi, B Shaik Mohammad, Kodukula, Sri Rama Murty, and Nayak, Shekhar
- Abstract
The objective of this paper is to extract robust features for detecting replay spoof attacks on text-independent speaker verification systems. In the case of replay attacks, prerecorded utterance of the target speaker is played to automatic speaker verification (ASV) system to gain unauthorized access. In such a scenario, the speech signal carries the characteristics of the intermediate recording device as well. In the proposed approach, the characteristics of the intermediate device are highlighted by subtracting the contribution of the live speech in the cepstral domain. An overcomplete dictionary learned on cepstral features, extracted from live speech data, is used to subtract the contribution of live speech. The residual captures the characteristics of recording device, and can be used to distinguish spoof speech signal from live speech signal. The distribution of the residuals from live and spoof speech signals are captured using Gaussian mixture models (GMMs). The likelihood ratio computed from the GMMs built on spoof and live signals, respectively, is used to detect the spoof attack. The performance of the proposed approach is evaluated on ASVspoof 2017 evaluation challenge database. The proposed feature extraction method achieved 20.18% relative improvement over the base line system built on the constant-Q cepstral coefficients.
- Published
- 2017
31. Action-vectors: Unsupervised movement modeling for action recognition
- Author
-
Roy, D, Kodukula, Sri Rama Murty, C, Krishna Mohan, Roy, D, Kodukula, Sri Rama Murty, and C, Krishna Mohan
- Abstract
Representation and modelling of movements play a significant role in recognising actions in unconstrained videos. However, explicit segmentation and labelling of movements are non-trivial because of the variability associated with actors, camera viewpoints, duration etc. Therefore, we propose to train a GMM with a large number of components termed as a universal movement model (UMM). This UMM is trained using motion boundary histograms (MBH) which capture the motion trajectories associated with the movements across all possible actions. For a particular action video, the MAP adapted mean vectors of the UMM are concatenated to form a fixed dimensional representation referred to as 'super movement vector' (SMV). However, SMV is still high dimensional and hence, Baum-Welch statistics extracted from the UMM are used to arrive at a compact representation for each action video, which we refer to as an 'action-vector'. It is shown that even without the use of class labels, action-vectors provide a more discriminatory representation of action classes translating to a 8 % relative improvement in classification accuracy for action-vectors based on MBH features over naïve MBH features on the UCF101 dataset. Furthermore, action-vectors projected with LDA achieve 93% accuracy on the UCF101 dataset which rivals state-of-the-art deep learning techniques.
- Published
- 2017
32. Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications
- Author
-
Bhati, Saurabhchand, Nayak, Shekhar, Kodukula, Sri Rama Murty, Bhati, Saurabhchand, Nayak, Shekhar, and Kodukula, Sri Rama Murty
- Abstract
Zero resource speech processing refers to a scenario where no or minimal transcribed data is available. In this paper, we propose a three-step unsupervised approach to zero resource speech processing, which does not require any other information/dataset. In the first step, we segment the speech signal into phonemelike units, resulting in a large number of varying length segments. The second step involves clustering the varying-length segments into a finite number of clusters so that each segment can be labeled with a cluster index. The unsupervised transcriptions, thus obtained, can be thought of as a sequence of virtual phone labels. In the third step, a deep neural network classifier is trained to map the feature vectors extracted from the signal to its corresponding virtual phone label. The virtual phone posteriors extracted from the DNN are used as features in the zero resource speech processing. The effectiveness of the proposed approach is evaluated on both ABX and spoken term discovery tasks (STD) using spontaneous American English and Tsonga language datasets, provided as part of zero resource 2015 challenge. It is observed that the proposed system outperforms baselines, supplied along the datasets, in both the tasks without any task specific modifications
- Published
- 2017
33. IITG-Indigo System for NIST 2016 SRE Challenge
- Author
-
Kumar, Nagendra, Das, Rohan Kumar, Kodukula, Sri Rama Murty, et al, ., Kumar, Nagendra, Das, Rohan Kumar, Kodukula, Sri Rama Murty, and et al, .
- Abstract
This paper describes the speaker verification (SV) system submitted to the NIST 2016 speaker recognition evaluation (SRE) challenge by Indian Institute of Technology Guwahati (IITG) under the fixed training condition task. Various SV systems are developed following the idea-level collaboration with two other Indian institutions. Unlike the previous SREs, this time the focus was on developing SV system using non-target language speech data and a small amount unlabeled data from target language/dialects. For addressing these novel challenges, we tried exploring the fusion of systems created using different features, data conditioning, and classifiers. On NIST 2016 SRE evaluation data, the presented fused system resulted in actual detection cost function (actDCF) and equal error rate (EER) of 0.81 and 12.91%, respectively. Post-evaluation, we explored a recently proposed pairwise support vector machine classifier and applied adaptive S-norm to the decision scores before fusion. With these changes, the final system achieves the actDCF and EER of 0.67 and 11.63%, respectively.
- Published
- 2017
34. Unsupervised Spoken Term Discovery for Zero Resource Speech Processing
- Author
-
Bhati, Saurabh Chand, Kodukula, Sri Rama Murty, Bhati, Saurabh Chand, and Kodukula, Sri Rama Murty
- Abstract
Zero resource speech processing refers to a scenario where no or minimal transcribed data is available. Unsupervised Acoustic Segment Modelling (ASM) is a technique for unsupervised dis- covery of acoustic segments from speech and building corresponding acoustic models without any prior knowledge or manual transcriptions. ASM mainly comprises of three steps: a) segmentation of speech utterances into acoustic segments b) segment labelling c) segment modelling. This work focuses on improving initial segmentation and acoustic segment labelling (ASL). In the first step, we segment the speech signal into acoustically homogeneous regions, resulting in a large number of varying length segments. We propose a new kernel gram matrix-based approach for segmentation. It determines the number of segments automatically and delivers comparable performance to the state of the art algorithms. The second step involves clustering the varying-length segments into a finite number of clusters so that each segment can be labeled with a cluster index. To improve labelling, a new graph clustering based framework is proposed. A major problem in ASM is the estimation of the number of ASM units that should be used for modelling the speech data. It is often left unaddressed, or an empirical number of ASM units is adopted. Our algorithm estimates the number of ASM units to be used reasonably well. Performance comparison with baseline ap- proaches demonstrates the ability of our algorithm to model ASM with minimal supervision. In the third step, a deep neural network classifier is trained to map the feature vectors extracted from the signal to its corresponding virtual phone label. The virtual phone posteriors extracted from the DNN are used as features in the zero resource speech processing. The effectiveness of the proposed approach is evaluated on both ABX and spoken term discovery tasks (STD) using spontaneous American English and Tsonga language datasets, provided as part of zero resource 20
- Published
- 2017
35. Single channel speech enhancement using Deep Neural Networks
- Author
-
K S, Kishor, Kodukula, Sri Rama Murty, K S, Kishor, and Kodukula, Sri Rama Murty
- Abstract
Speech enhancement is an important first step in many applications like mobile communication, Speech recognition, hearing aids etc. Traditionally, speech enhancement was viewed as a pure signal processing problem, and several methods have been proposed to design linear filters to suppress the noise. However, with the advent of deep neural networks, speech enhancement has been viewed as a machine learning problem, which aims at learning a nonlinear model that maps the noisy speech signal to the clean speech signal. In this thesis, we provide an overview of existing signal processing approaches for single channel speech enhancement and compare their performance with the DNN counterparts. Even though the DNN based approaches provide significant performance improvements, they do not use phase information in the speech signal. In this work, We propose a speech enhancement framework using DNNs, which incorporates both magnitude (Cochleagram) and phase (Instantaneous frequency) information. Experimental results demonstrate that proposed framework achieves improved performance, especially at lower SNRs, over the conventional systems.
- Published
- 2017
36. Prosody Modification using Allpass Residual of Speech Signals
- Author
-
Karthika, Vijayan, Kodukula, Sri Rama Murty, Karthika, Vijayan, and Kodukula, Sri Rama Murty
- Abstract
In this paper, we attempt to signify the role of phase spectrum of speech signals in acquiring an accurate estimate of excitation source for prosody modification. The phase spectrum is parametrically modeled as the response of an all pass (AP) filter, and the filter coefficients are estimated by considering the linear prediction (LP) residual as the output of the AP filter. The resultant residual signal, namely AP residual, exhibits unambiguous peaks corresponding to epochs, which are chosen as pitch markers for prosody modification. This strategy efficiently removes ambiguities associated with pitch marking, required for pitch synchronous overlap-add (PSOLA) method. The prosody modification using AP residual is advantageous than time domain PSOLA (TD-PSOLA) using speech signals, as it offers fewer distortions due to its flat magnitude spectrum. Windowing centered around unambiguous peaks in AP residual is used for segmentation, followed by pitch/duration modification of AP residual by mapping of pitch markers. The modified speech signal is obtained from modified AP residual using synthesis filters. The mean opinion scores are used for performance evaluation of the proposed method, and it is observed that the AP residual-based method delivers equivalent performance as that of LP residual based method using epochs, and better performance than the linear prediction PSOLA (LP-PSOLA).
- Published
- 2016
37. Epoch Extraction by Phase Modelling of Speech Signals
- Author
-
Vijayan, K, Kodukula, Sri Rama Murty, Vijayan, K, and Kodukula, Sri Rama Murty
- Abstract
Epochs are instants of significant excitation of vocal-tract system in speech production process. In this paper, we attempt to extract information about epochs from phase spectra of speech signals. The phase spectrum of speech is modelled as the response of an allpass (AP) filter, and the resulting error signal is used for epoch extraction. The parameters of AP model are estimated by imposing sparsity constraints on the error signal. The error signal, thus obtained, exhibits prominent peaks at epoch locations. The epochal candidates obtained from the error signal are refined using a dynamic programming algorithm. The performance of the proposed method is consistent across genders and is comparable with the state-of-the-art methods.
- Published
- 2016
38. Significance of analytic phase of speech signals in speaker verification
- Author
-
Vijayan, K, Reddy, P R, Kodukula, Sri Rama Murty, Vijayan, K, Reddy, P R, and Kodukula, Sri Rama Murty
- Abstract
The objective of this paper is to establish the importance of phase of analytic signal of speech, referred to as the analytic phase, in human perception of speaker identity, as well as in automatic speaker verification. Subjective studies are conducted using analytic phase distorted speech signals, and the adversities occurred in human speaker verification task are observed. Motivated from the perceptual studies, we propose a method for feature extraction from analytic phase of speech signals. As unambiguous computation of analytic phase is not possible due to the phase wrapping problem, feature extraction is attempted from its derivative, i.e., the instantaneous frequency (IF). The IF is computed by exploiting the properties of the Fourier transform, and this strategy is free from the phase wrapping problem. The IF is computed from narrowband components of speech signal, and discrete cosine transform is applied on deviations in IF to pack the information in smaller number of coefficients, which are referred to as IF cosine coefficients (IFCCs). The nature of information in the proposed IFCC features is studied using minimal-pair ABX (MP-ABX) tasks, and t-stochastic neighbor embedding (t-SNE) visualizations. The performance of IFCC features is evaluated on NIST 2010 SRE database and is compared with mel frequency cepstral coefficients (MFCCs) and frequency domain linear prediction (FDLP) features. All the three features, IFCC, FDLP and MFCC, provided competitive speaker verification performance with average EERs of 2.3%, 2.2% and 2.4%, respectively. The IFCC features are more robust to vocal effort mismatch, and provided relative improvements of 26% and 11% over MFCC and FDLP features, respectively, on the evaluation conditions involving vocal effort mismatch. Since magnitude and phase represent different components of the speech signal, we have attempted to fuse the evidences from them at the i-vector level of speaker verification system. It is found that the i-vec
- Published
- 2016
39. Experimental studies on effect of speaking mode on spoken term detection
- Author
-
Rout, K, Reddy, P R, Kodukula, Sri Rama Murty, Rout, K, Reddy, P R, and Kodukula, Sri Rama Murty
- Abstract
The objective of this paper is to study the effect of speaking mode on spoken term detection (STD) system. The experiments are conducted with respect to query words recorded in isolated manner and words cut out from continuous speech. Durations of phonemes in query words greatly vary between these two modes. Hence pattern matching stage plays a crucial role which takes care of temporal variations. Matching is done using Subsequence dynamic time warping (DTW) on posterior features of query and reference utterances, obtained by training Multilayer perceptron (MLP). The difference in performance of the STD system for different phoneme groupings (45, 25, 15 and 6 classes) is also analyzed. Our STD system is tested on Telugu broadcast news. Major difference in STD system performance is observed for recorded and cut-out types of query words. It is observed that STD system performance is better with query words cut out from continuous speech compared to words recorded in isolated manner. This performance difference can be accounted for large temporal variations.
- Published
- 2015
40. Feature selection using Deep Neural Networks
- Author
-
Ray, Debjyoti, Kodukula, Sri Rama Murty, C, Krishna Mohan, Ray, Debjyoti, Kodukula, Sri Rama Murty, and C, Krishna Mohan
- Abstract
Feature descriptors involved in video processing are generally high dimensional in nature. Even though the extracted features are high dimensional, many a times the task at hand depends only on a small subset of these features. For example, if two actions like running and walking have to be identified, extracting features related to the leg movement of the person is enough. Since, this subset is not known apriori, we tend to use all the features, irrespective of the complexity of the task at hand. Selecting task-aware features may not only improve the efficiency but also the accuracy of the system. In this work, we propose a supervised approach for task-aware selection of features using Deep Neural Networks (DNN) in the context of action recognition. The activation potentials contributed by each of the individual input dimensions at the first hidden layer are used for selecting the most appropriate features. The selected features are found to give better classification performance than the original high-dimensional features. It is also shown that the classification performance of the proposed feature selection technique is superior to the low-dimensional representation obtained by principal component analysis (PCA).
- Published
- 2015
41. Analysis of features from analytic representation of speech using MP-ABX measures
- Author
-
Pappagari, R R, Vijayan, K, Kodukula, Sri Rama Murty, Pappagari, R R, Vijayan, K, and Kodukula, Sri Rama Murty
- Abstract
[Not available]
- Published
- 2015
42. Analysis of Phase Spectrum of Speech Signals Using Allpass Modeling
- Author
-
Vijayan, K, Kodukula, Sri Rama Murty, Vijayan, K, and Kodukula, Sri Rama Murty
- Abstract
The phase spectrum of Fourier transform has received lesser prominence than its magnitude counterpart in speech processing. In this paper, we propose a method for parametric modeling of the phase spectrum, and discuss its applications in speech signal processing. The phase spectrum is modeled as the response of an allpass (AP) filter, whose coefficients are estimated from the knowledge of speech production process, especially the impulse-like nature of excitation source. A signal retaining only the phase spectral component of speech signal is derived by suppressing the magnitude spectral component, and is modeled as the output of an AP filter excited with a sequence of impulses. Entropy of energy of the input signal is minimized to estimate the coefficients of the AP filter. The resulting objective function, being nonconvex in nature, is minimized using particle swarm optimization. The group delay response of estimated AP filters can be used for accurate analysis of resonances of the vocal-tract system (VTS). The error signal associated with AP modeling provides unambiguous evidence about the instants of significant excitation of the VTS. The applications of the proposed AP modeling include, but not limited to, formant tracking, extraction of glottal closure instants, speaker verification and speech synthesis.
- Published
- 2015
43. Epoch extraction from allpass residual estimated using orthogonal matching pursuit
- Author
-
K, Vijayan, Kodukula, Sri Rama Murty, K, Vijayan, and Kodukula, Sri Rama Murty
- Abstract
This paper presents an epoch extraction method from the phase spectrum of speech signals. The phase spectrum of speech is modelled as the response of an allpass (AP) system. The coefficients of AP system are estimated by imposing sparsity constraint on the input signal. The AP residual is estimated using orthogonal matching pursuit and it is found that the residual signal exhibits prominent peaks at epoch locations and negligible values elsewhere. A dynamic programming algorithm is employed to select appropriate peaks from AP residual corresponding to epoch locations. Epoch extraction performance of the proposed algorithm is evaluated on a subset of CMU Arctic database. The AP modelling based epoch extraction method is compared with two state-of-the-Art epoch identification techniques, namely DYPSA and zero frequency resonator (ZFR). It is observed that the proposed algorithm delivers equivalent performance as that of ZFR for clean speech and significantly out-performs ZFR for telephone speech. Also the AP modelling based method provided slightly better epoch identification rate than DYPSA for clean as well as telephone speech.
- Published
- 2014
44. Feature extraction from analytic phase of speech signals for speaker verification
- Author
-
Vijayan, K, Kumar, V, Kodukula, Sri Rama Murty, Vijayan, K, Kumar, V, and Kodukula, Sri Rama Murty
- Abstract
The objective of this work is to study the speaker-specific nature of analytic phase of speech signals. Since computation of analytic phase suffers from phase wrapping problem, we have used its derivative- The instantaneous frequency for feature extraction. The cepstral coefficients extracted from smoothed subband instantaneous frequencies (IFCC) are used as features for speaker verification. The performance of IFCC features is evaluated on NIST-2003 speaker recognition evaluation database and is compared with baseline mel-frequency cepstral coefficients (MFCC). The performance of IFCC features is observed to be comparable with MFCC features in terms of equal error rates and minimum detection cost function values. Different strategies for evaluating the speaker verification performance of IFCC and MFCC are explored and it is found that the evaluation based on cosine similarity delivers better performance than other strategies under consideration.
- Published
- 2014
45. Comparative Study of Spectral Mapping Techniques for Enhancement of Throat Microphone Speech
- Author
-
Vijayan, K, Kodukula, Sri Rama Murty, Vijayan, K, and Kodukula, Sri Rama Murty
- Abstract
The objective of this work is to study the suitability of existing spectral mapping methods for enhancement of throat microphone (TM) speech, and propose a more elegant method for spectral mapping. Gaussian mixture models (GMM) and neural networks (NN) have been used for spectral mapping. Though GMM-based mapping captures the variability among speech sounds through multiple mixtures, it can only provide a linear map between the source and the target. On the other hand, NN-based mapping is capable of providing a nonlinear map but a single mapping scheme may not handle variability across different speech sounds. Incorporating the advantages from these approaches, we propose a spectral mapping method using multiple neural networks. Speech data is clustered using k-means algorithm, and a separate neural network is employed to capture the mapping within each cluster. Objective evaluation has shown that proposed method is better than both GMM-base and NN-base mapping schemes
- Published
- 2014
46. Estimation of Allpass Transfer Functions by Introducing Sparsity Constraints to Particle Swarm Optimization
- Author
-
Vijayan, K, Kodukula, Sri Rama Murty, Vijayan, K, and Kodukula, Sri Rama Murty
- Abstract
An algorithm to estimate allpass transfer functions by assuming sparsity over the input signals is proposed in this paper. As a tractable measure of sparsity, the l1 norm of input signal is minimized and the set of allpass coefficients which realizes the l1 norm minimization is obtained. It is observed that the estimation of allpass systems with sparse inputs is a nonconvex problem and hence a nonconvex optimization method-the particle swarm optimization (PSO) is used. With PSO, a large number of uniformly chosen points in a d-dimensional problem space are guided towards an optimum solution with respect to the l1 norm of input signal. Experimental results show that PSO is successful in estimating allpass transfer functions. Application of allpass filter estimation to speech processing is also studied and results which portray the effectiveness of the proposed method are reported
- Published
- 2014
47. Query word retrieval from continuous speech using GMM posteriorgrams
- Author
-
Reddy, P R, Rout, K, Kodukula, Sri Rama Murty, Reddy, P R, Rout, K, and Kodukula, Sri Rama Murty
- Abstract
The objective of this work is to study the issues involved in building an automatic query word retrieval system for broadcast news in an unsupervised framework, i.e., without using any labelled speech data. In the absence of labelled data, sequence of feature-vectors extracted from the query word have to be matched with those extracted from the test utterance. This is a non-trivial task, as typical feature-vectors like Mel-frequency cepstral coefficients (MFCC) carry both speech-specific and speaker-specific information. In this work, we have employed Gaussian mixture models (GMM) to extract speaker-independent features from the speech signal. Gaussian mixture model, trained on a large amount of speech data, is used to derive posterior features for each frame of speech signal. The sequence of posterior features are matched using dynamic time-warping algorithm to detect the presence of query word in the test utterance. The performance of the proposed method is evaluated on Telugu broadcast news database. It is observed that the posterior features extracted from GMM are better suited for query word retrieval compared to the MFCC features.
- Published
- 2014
48. Unsupervised spoken word retrieval using gaussian-bernoulli restricted Boltzmann machines
- Author
-
Reddy, P R, Nayak, S, Kodukula, Sri Rama Murty, Reddy, P R, Nayak, S, and Kodukula, Sri Rama Murty
- Abstract
The objective of this work is to explore a novel unsupervised framework, using Restricted Boltzmann machines, for Spoken Word Retrieval (SWR). In the absence of labelled speech data, SWR is typically performed by matching sequence of feature vectors of query and test utterances using dynamic time warping (DTW). In such a scenario, performance of SWR system critically depends on representation of the speech signal. Typical features, like mel-frequency cepstral coefficients (MFCC), carry significant speaker-specific information, and hence may not be used directly in SWR system. To overcome this issue, we propose to capture the joint density of the acoustic space spanned by MFCCs using Gaussian-Bernoulli restricted Boltzmann machine (GBRBM). In this work, we have used hidden activations of the GBRBM as features for SWR system. Since the GBRBM is trained with speech data collected from large number of speakers, the hidden activations are more robust to the speaker-variability compared to MFCCs. The performance of the proposed features is evaluated on Telugu broadcast news data, and an absolute improvement of 12% was observed compared to MFCCs.
- Published
- 2014
49. Epoch Extraction from Allpass Residual of Speech Signals
- Author
-
Vijayan, K, Kodukula, Sri Rama Murty, Vijayan, K, and Kodukula, Sri Rama Murty
- Abstract
Identification of epochs from speech signals is a prominent task in speech processing. In this paper, epoch extraction is attempted from phase spectrum of speech signals. The phase spectrum of speech is modelled as an allpass (AP) filter by minimizing entropy of energy in the associated error signal. The AP residual thus obtained contains prominent unambiguous peaks at epoch locations. These peaks in AP residual constitute a set of candidate epoch locations from which appropriate ones are identified using a dynamic programming algorithm. The proposed method is evaluated on a subset of CMU Arctic database and it is observed that it delivered better epoch extraction performance than the prominent speech events estimation method-DYPSA. In case of telephone channel speech, the proposed method significantly outperformed zero frequency resonator based method also
- Published
- 2014
50. Novel speech duration modifier for packet based communication system
- Author
-
Mani, S K, Dhiman, J K, Kodukula, Sri Rama Murty, Mani, S K, Dhiman, J K, and Kodukula, Sri Rama Murty
- Abstract
In this paper, we propose a real-time method for duration modification of speech for packet based communication system. While there is rich literature available on duration modification, it fails to clearly address the issues in real-time implementation of the same. Most of the duration modification methods rely on accurate estimation of pitch marks, which is not feasible in a real-time scenario. The proposed method modifies the duration of Linear Prediction residual of individual frames without using any look-ahead delay and knowledge of pitch marks. In this method, multiples of pitch period is repeated or removed from a frame depending on a scheduling algorithm. The subjective quality of the proposed method was found to be better than waveform similarity overlap and add (WSOLA) technique as well as Linear Prediction Pitch Synchronous Overlap and Add (LP-PSOLA) technique.
- Published
- 2014
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.