18 results on '"Tan, Zheng-Hua"'
Search Results
2. On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification.
- Author
-
Sarkar, Achintya Kumar and Tan, Zheng-Hua
- Subjects
ARTIFICIAL neural networks ,DEEP learning ,DATABASES ,ERROR rates ,SUPERVISED learning ,AUTOMATIC speech recognition - Abstract
Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck (BN) features, the key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training targets, we consider speaker identity, time-contrastive learning (TCL), and auto-regressive prediction coding, with the first being supervised and the last two being self-supervised. Furthermore, we study a range of loss functions when speaker identity is used as the training target. With regard to activation functions, we study the widely used sigmoid function, rectified linear unit (ReLU), and Gaussian error linear unit (GELU). We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid, irrespective of the training target. Among the three training targets, TCL performs the best. Among the various loss functions, cross-entropy, joint-softmax, and focal loss functions outperform the others. Finally, the score-level fusion of different systems is also able to reduce the error rates. To evaluate the representation learning methods, experiments are conducted on the RedDots 2016 challenge database consisting of short utterances for TD-SV systems based on classic Gaussian mixture model-universal background model (GMM-UBM) and i-vector methods. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
3. rVAD: An unsupervised segment-based robust voice activity detection method.
- Author
-
Tan, Zheng-Hua, Sarkar, Achintya kr., and Dehak, Najim
- Subjects
- *
VOICE analysis software , *VOICE frequency , *AUTOMATIC speech recognition , *ROBUST control , *INTONATION (Phonetics) , *VOICEPRINTS - Abstract
• Proposed an unsupervised segment-based method for robust voice activity detection. • Proposed modified rVAD that uses computationally fast spectral flatness calculation. • Evaluated rVAD in terms of VAD performance using RATS and Aurora-2 databases. • Evaluated rVAD in terms of speaker verification performance using RedDots 2016. • rVAD showed favorable performance on various difficult tasks over existing methods. This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
4. Guided spectrogram filtering for speech dereverberation.
- Author
-
Zheng, Chengshi, Tan, Zheng-Hua, Peng, Renhua, and Li, Xiaodong
- Subjects
- *
ACOUSTIC vibrations , *AUDITORY pathways , *SOUND reverberation , *SPEECH processing systems , *AUTOMATIC speech recognition - Abstract
Guided filtering is a computationally efficient and powerful technique used in image processing applications, such as edge-preserving smoothing, details enhancing and single image dehazing. In this paper, we propose a novel single channel speech dereverberation method using guided spectrogram filtering by considering a speech spectrogram as an image. The proposed method requires neither room acoustic parameter estimation nor late reverberant spectral variance estimation. Objective test results show the validity of the guided spectrogram filtering method for speech dereverberation. Compared with state-of-the-art speech dereverberation methods, the proposed method has better performance in terms of perceptual evaluation of speech quality (PESQ), speech-to-reverberation modulation energy ratio (SRMR) and short-time objective intelligibility (STOI) in most cases. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
5. A perceptually motivated LP residual estimator in noisy and reverberant environments.
- Author
-
Peng, Renhua, Tan, Zheng-Hua, Li, Xiaodong, and Zheng, Chengshi
- Subjects
- *
AUTOMATIC speech recognition , *ADDITIVE white Gaussian noise , *SINGULAR value decomposition , *PERFORMANCE evaluation , *SIGNAL filtering - Abstract
Both reverberation and additive noise can degrade the quality of recorded speech and thus should be suppressed simultaneously. Previous studies have shown that the generalized singular value decomposition (GSVD) has the capability of suppressing the additive noise effectively, but it is not often applied for speech dereverberation since reverberation is considered to be convolutive as well as colored noise. Recently, we revealed that late reverberation is also additive and relatively white interference component in the linear prediction (LP) residual domain. To suppress both late reverberation and additive noise, we have proposed an optimal filter for LP residual estimator (LPRE) based on a constrained minimum mean square error (CMMSE) by using GSVD in single channel speech enhancement, where the algorithm is referred as CMMSE-GSVD-LPRE. Experimental results have shown a better performance of the CMMSE-GSVD-LPRE than spectral subtraction methods, but some residual noise and reverberation components are still audible and annoying. To solve this problem, this paper incorporates the masking properties of the human auditory system in the LP residual domain to further suppress these residual noise and reverberation components while reducing speech distortion at the same time. Various simulation experiments are conducted, and the results show an improved performance of the proposed algorithm. Experimental results with speech recorded in noisy and reverberant environments further confirm the effectiveness of the proposed algorithm in real-world environments. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
6. Incorporating pass-phrase dependent background models for text-dependent speaker verification.
- Author
-
Sarkar, Achintya Kumar and Tan, Zheng-Hua
- Subjects
- *
PHRASE structure grammar , *ORATORS , *LIKELIHOOD ratio tests , *HIDDEN Markov models , *AUTOMATIC speech recognition - Abstract
In this paper, we propose pass-phrase dependent background models (PBMs) for text-dependent (TD) speaker verification (SV) to integrate the pass-phrase identification process into the conventional TD-SV system, where a PBM is derived from a text-independent background model through adaptation using the utterances of a particular pass-phrase. During training, pass-phrase specific target speaker models are derived from the particular PBM using the training data for the respective target model. While testing, the best PBM is first selected for the test utterance in the maximum likelihood (ML) sense and the selected PBM is then used for the log likelihood ratio (LLR) calculation with respect to the claimant model. The proposed method incorporates the pass-phrase identification step in the LLR calculation, which is not considered in conventional standalone TD-SV systems. The performance of the proposed method is compared to conventional text-independent background model based TD-SV systems using either Gaussian mixture model (GMM)-universal background model (UBM) or hidden Markov model (HMM)-UBM or i-vector paradigms. In addition, we consider two approaches to build PBMs: speaker-independent and speaker-dependent. We show that the proposed method significantly reduces the error rates of text-dependent speaker verification for the non-target types: target-wrong and impostor-wrong while it maintains comparable TD-SV performance when impostors speak a correct utterance with respect to the conventional system. Experiments are conducted on the RedDots challenge and the RSR2015 databases that consist of short utterances. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
7. Speech Recognition on Mobile Devices
- Author
-
Tan, Zheng-Hua and Lindberg, Børge
- Subjects
automatic speech recognition ,text entry ,mobile device - Abstract
The enthusiasm of deploying automatic speech recognition (ASR) onmobile devices is driven both by remarkable advances in ASR technology andby the demand for efficient user interfaces on such devices as mobile phonesand personal digital assistants (PDAs). This chapter presents an overview ofASR in the mobile context covering motivations, challenges, fundamentaltechniques and applications. Three ASR architectures are introduced: embeddedspeech recognition, distributed speech recognition and network speechrecognition. Their pros and cons and implementation issues are discussed.Applications within command and control, text entry and search are presentedwith an emphasis on mobile text entry.
- Published
- 2010
8. Variable Frame Rate Analysis for Automatic Speech Recognition
- Author
-
Tan, Zheng-Hua
- Subjects
Automatic speech recognition ,Variable frame rate analysis - Published
- 2007
9. Joint variable frame rate and length analysis for speech recognition under adverse conditions.
- Author
-
Tan, Zheng-Hua and Kraljevski, Ivan
- Subjects
- *
AUTOMATIC speech recognition , *SIGNAL-to-noise ratio , *ROBUST control , *DIGITAL signal processing , *SPEECH processing systems - Abstract
This paper presents a method that combines variable frame length and rate analysis for speech recognition in noisy environments, together with an investigation of the effect of different frame lengths on speech recognition performance. The method adopts frame selection using an a posteriori signal-to-noise (SNR) ratio weighted energy distance and increases the length of the selected frames, according to the number of non-selected preceding frames. It assigns a higher frame rate and a normal frame length to a rapidly changing and high SNR region of a speech signal, and a lower frame rate and an increased frame length to a steady or low SNR region. The speech recognition results show that the proposed variable frame rate and length method outperforms fixed frame rate and length analysis, as well as standalone variable frame rate analysis in terms of noise-robustness. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
10. A Joint Approach for Single-Channel Speaker Identification and Speech Separation.
- Author
-
Mowlaee, Pejman, Saeidi, Rahim, Christensen, Mads Græsbøll, Tan, Zheng-Hua, Kinnunen, Tomi, Franti, Pasi, and Jensen, Søren Holdt
- Subjects
AUTOMATIC speech recognition ,SPEECH processing systems ,MATHEMATICAL models ,HIDDEN Markov models ,SPEECH coding ,PARAMETER estimation ,SIGNAL-to-noise ratio ,ALGORITHMS - Abstract
In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
11. Automatic speech recognition over error-prone wireless networks
- Author
-
Tan, Zheng-Hua, Dalsgaard, Paul, and Lindberg, Børge
- Subjects
- *
AUTOMATIC speech recognition , *COMPUTER input-output equipment , *SPEECH perception , *WIRELESS communications - Abstract
Abstract: The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communication networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and transmission errors. The introduction of distributed speech recognition (DSR) largely eliminates the bandwidth limitations and the presence of transmission errors becomes the key robustness issue. This paper reviews the techniques that have been developed for ASR robustness against transmission errors. In the paper, a model of network degradations and robustness techniques is presented. These techniques are classified into three categories: error detection, error recovery and error concealment (EC). A one-frame error detection scheme is described and compared with a frame-pair scheme. As opposed to vector level techniques a technique for error detection and EC at the sub-vector level is presented. A number of error recovery techniques such as forward error correction and interleaving are discussed in addition to a review of both feature-reconstruction and ASR-decoder based EC techniques. To enable the comparison of some of these techniques, evaluation has been conduced on the basis of the same speech database and channel. Special attention is given to the unique characteristics of DSR as compared to streaming audio e.g. voice-over-IP. Additionally, a technique for adapting ASR to the varying quality of networks is presented. The frame-error-rate is here used to adjust the discrimination threshold with the goal of optimising out-of-vocabulary detection. This paper concludes with a discussion of applicability of different techniques based on the channel characteristics and the system requirements. [Copyright &y& Elsevier]
- Published
- 2005
- Full Text
- View/download PDF
12. Speech Recognition in Mobile Phones
- Author
-
Varga, Imre, Kiss, Imre, Singh, Sameer, editor, Tan, Zheng-Hua, and Lindberg, Børge
- Published
- 2008
- Full Text
- View/download PDF
13. Speech Recognition Over IP Networks
- Author
-
Kim, Hong Kook, Singh, Sameer, editor, Tan, Zheng-Hua, and Lindberg, Børge
- Published
- 2008
- Full Text
- View/download PDF
14. Error Concealment
- Author
-
Haeb-Umbach, Reinhold, Ion, Valentin, Singh, Sameer, editor, Tan, Zheng-Hua, and Lindberg, Børge
- Published
- 2008
- Full Text
- View/download PDF
15. Fixed-Point Arithmetic
- Author
-
Bocchieri, Enrico, Singh, Sameer, editor, Tan, Zheng-Hua, and Lindberg, Børge
- Published
- 2008
- Full Text
- View/download PDF
16. Speech Recognition Over Mobile Networks
- Author
-
Kim, Hong Kook, Rose, Richard C., Singh, Sameer, editor, Tan, Zheng-Hua, and Lindberg, Børge
- Published
- 2008
- Full Text
- View/download PDF
17. Speech Coding and Packet Loss Effects on Speech and Speaker Recognition
- Author
-
Besacier, Laurent, Singh, Sameer, editor, Tan, Zheng-Hua, and Lindberg, Børge
- Published
- 2008
- Full Text
- View/download PDF
18. Automatic speech recognition on mobile devices and over communication networks
- Author
-
Tan, Zheng-Hua and Lindberg, Børge
- Subjects
Communication networks ,Automatic speech recognition ,Mobile devices - Published
- 2008
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.