Author: "Jan Cernocky" - Searchworks@Jio Institute Digital Library Search Results

1. Analysis of Speaker Diarization Based on Bayesian HMM With Eigenvoice Priors

Author: Jan Cernocky, Federico Landini, Mireia Diez, and Lukas Burget
Subjects: Acoustics and Ultrasonics, Computer science, Speech recognition, Bayesian probability, Probabilistic logic, Inference, Speech processing, Speaker diarisation, 030507 speech-language pathology & audiology, 03 medical and health sciences, Computational Mathematics, Local optimum, Computer Science::Sound, Prior probability, Computer Science (miscellaneous), Electrical and Electronic Engineering, 0305 other medical science, Hidden Markov model
Abstract: In our previous work, we introduced our Bayesian Hidden Markov Model with eigenvoice priors, which has been recently recognized as the state-of-the-art model for Speaker Diarization. In this article we present a more complete analysis of the Diarization system. The inference of the model is fully described and derivations of all update formulas are provided for a complete understanding of the algorithm. An extensive analysis on the effect, sensitivity and interactions of all model parameters is provided, which might be used as a guide for their optimal setting. The newly introduced speaker regularization coefficient allows us to control the number of speakers inferred in an utterance. A naive speaker model merging strategy is also presented, which allows to drive the variational inference out of local optima. Experiments for the different diarization scenarios are presented on CALLHOME and DIHARD datasets.
Published: 2020

2. DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation And Extraction

Author: Jiangyu Han, Yanhua Long, Lukas Burget, and Jan Cernocky
Subjects: Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated conditions. Furthermore, we generalize the DPCCN to target speech extraction (TSE) by integrating a new specially designed speaker encoder. Moreover, we also investigate the robustness of DPCCN to unsupervised cross-domain TSE tasks. A Mixture-Remix approach is proposed to adapt the target domain acoustic characteristics for fine-tuning the source model. We evaluate the proposed methods not only under noisy and reverberant in-domain condition, but also in clean but cross-domain conditions. Results show that for both speech separation and extraction, the DPCCN-based systems achieve significantly better performance and robustness than the currently dominating time-domain methods, especially for the cross-domain tasks. Particularly, we find that the Mixture-Remix fine-tuning with DPCCN significantly outperforms the TD-SpeakerBeam for unsupervised cross-domain TSE, with around 3.5 dB SISNR improvement on target domain test set, without any source domain performance degradation., accepted by ICASSP 2022
Published: 2021

3. Analysis of X-Vectors for Low-Resource Speech Recognition

Author: Jan Profant, Jiri Nytra, Martin Karafiat, Jan Cernocky, Tomas Pavlicek, Miroslav Hlavacek, and Karel Vesely
Subjects: Signal processing, Artificial neural network, business.industry, Low resource, Computer science, Speech recognition, Usability, Speaker recognition, language.human_language, Robustness (computer science), language, Pashto, business, Adaptation (computer science)
Abstract: The paper presents a study of usability of x-vectors for adaptation of automatic speech recognition (ASR) systems. X-vectors are Neural Network (NN)-based speaker embeddings recently proposed in speaker recognition (SR). They quickly replaced common i-vectors and became new state-of-the-art technique. Here, the same approach is adopted for ASR with the hope of similar outcome. All experiments were done on ASR for the latest IARPA MATERIAL evaluation running on Pashto language. Over 1% absolute improvement was observed with x-vectors over traditional i-vectors, even when the x-vector extractor was not trained on target Pashto data.
Published: 2021

4. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

Author: Tsubasa Ochiai, Tomohiro Nakatani, Katerina Zmolikova, Jan Cernocky, Marc Delcroix, Keisuke Kinoshita, and Lukas Burget
Subjects: Artificial neural network, Computer science, business.industry, Speech recognition, Deep learning, 020206 networking & telecommunications, 02 engineering and technology, Speech processing, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Noise (video), Artificial intelligence, Electrical and Electronic Engineering, business, Hidden Markov model, Adaptation (computer science), Representation (mathematics), Utterance
Abstract: The processing of speech corrupted by interfering overlapping speakers is one of the challenging problems with regards to today's automatic speech recognition systems. Recently, approaches based on deep learning have made great progress toward solving this problem. Most of these approaches tackle the problem as speech separation, i.e., they blindly recover all the speakers from the mixture. In some scenarios, such as smart personal devices, we may however be interested in recovering one target speaker from a mixture. In this paper, we introduce SpeakerBeam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker. Formulating the problem as speaker extraction avoids certain issues such as label permutation and the need to determine the number of speakers in the mixture. With SpeakerBeam, we jointly learn to extract a representation from the adaptation utterance characterizing the target speaker and to use this representation to extract the speaker. We explore several ways to do this, mostly inspired by speaker adaptation in acoustic models for automatic speech recognition. We evaluate the performance on the widely used WSJ0-2mix and WSJ0-3mix datasets, and these datasets modified with more noise or more realistic overlapping patterns. We further analyze the learned behavior by exploring the speaker representations and assessing the effect of the length of the adaptation data. The results show the benefit of including speaker information in the processing and the effectiveness of the proposed method.
Published: 2019

5. EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

Author: Shinji Watanabe, Jan Cernocky, Ramón Fernandez Astudillo, Lukas Burget, and Murali Karthick Baskar
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer science, Computer Science - Artificial Intelligence, Speech recognition, Context (language use), Performance gap, Computer Science - Sound, Data modeling, Machine Learning (cs.LG), Speech enhancement, Self supervision, Artificial Intelligence (cs.AI), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Language model, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Self-supervised ASR-TTS models suffer in out-of-domain data conditions. Here we propose an enhanced ASR-TTS (EAT) model that incorporates two main features: 1) The ASR$\rightarrow$TTS direction is equipped with a language model reward to penalize the ASR hypotheses before forwarding it to TTS. 2) In the TTS$\rightarrow$ASR direction, a hyper-parameter is introduced to scale the attention context from synthesized speech before sending it to ASR to handle out-of-domain data. Training strategies and the effectiveness of the EAT model are explored under out-of-domain data conditions. The results show that EAT reduces the performance gap between supervised and self-supervised training significantly by absolute 2.6\% and 2.7\% on Librispeech and BABEL respectively.
Published: 2021

6. Detecting English Speech in the Air Traffic Control Voice Communication

Author: Santosh Kesiraju, Ondrej Novotny, Martin Kocour, Igor Szöke, Karel Vesely, and Jan Cernocky
Subjects: Data processing, Computer science, business.industry, Speech recognition, Bayesian probability, 020206 networking & telecommunications, 02 engineering and technology, Air traffic control, Pipeline (software), Reduction (complexity), 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, 0305 other medical science, business, Publication, Subspace topology, Word (computer architecture), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We launched a community platform for collecting the ATC speech world-wide in the ATCO2 project. Filtering out unseen non-English speech is one of the main components in the data processing pipeline. The proposed English Language Detection (ELD) system is based on the embeddings from Bayesian subspace multinomial model. It is trained on the word confusion network from an ASR system. It is robust, easy to train, and light weighted. We achieved 0.0439 equal-error-rate (EER), a 50% relative reduction as compared to the state-of-the-art acoustic ELD system based on x-vectors, in the in-domain scenario. Further, we achieved an EER of 0.1352, a 33% relative reduction as compared to the acoustic ELD, in the unseen language (out-of-domain) condition. We plan to publish the evaluation dataset from the ATCO2 project.
Published: 2021

7. Integration of Variational Autoencoder and Spatial Clustering for Adaptive Multi-Channel Neural Speech Separation

Author: Jan Cernocky, Marc Delcroix, Tomohiro Nakatani, Katerina Zmolikova, and Lukas Burget
Subjects: Factorial, Noise measurement, Artificial neural network, Computer science, business.industry, Speech coding, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Mixture model, Autoencoder, 030507 speech-language pathology & audiology, 03 medical and health sciences, Noise, Discriminative model, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, 0305 other medical science, business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation. The advantage of integrating spatial clustering with a spectral model was shown in several works. As the spectral model, previous works used either factorial generative models of the mixed speech or discriminative neural networks. In our work, we combine the strengths of both approaches, by building a factorial model based on a generative neural network, a variational autoencoder. By doing so, we can exploit the modeling power of neural networks, but at the same time, keep a structured model. Such a model can be advantageous when adapting to new noise conditions as only the noise part of the model needs to be modified. We show experimentally, that our model significantly outperforms previous factorial model based on Gaussian mixture model (DOLPHIN), performs comparably to integration of permutation invariant training with spatial clustering, and enables us to easily adapt to new noise conditions. The code for the method is available at https://github.com/BUTSpeechFIT/vae_dolphin, Comment: 8 pages, 3 figures, to be published in SLT2021
Published: 2021

8. Investigation of Specaugment for Deep Speaker Embedding Learning

Author: Kai Yu, Jan Cernocky, Johan Rohdin, Oldrich Plchot, Lukas Burget, and Shuai Wang
Subjects: Masking (art), Computer science, Time delay neural network, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Set (abstract data type), 030507 speech-language pathology & audiology, 03 medical and health sciences, Softmax function, 0202 electrical engineering, electronic engineering, information engineering, Embedding, NIST, 0305 other medical science
Abstract: SpecAugment is a newly proposed data augmentation method for speech recognition. By randomly masking bands in the log Mel spectogram this method leads to impressive performance improvements. In this paper, we investigate the usage of SpecAugment for speaker verification tasks. Two different models, namely 1-D convolutional TDNN and 2-D convolutional ResNet34, trained with either Softmax or AAM-Softmax loss, are used to analyze SpecAugment’s effectiveness. Experiments are carried out on the Voxceleb and NIST SRE 2016 dataset. By applying SpecAugment to the original clean data in an on-the-fly manner without complex off-line data augmentation methods, we obtained 3.72% and 11.49% EER for NIST SRE 2016 Cantonese and Tagalog, respectively. For Voxceleb1 evaluation set, we obtained 1.47% EER.
Published: 2020

9. A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery

Author: Murat Saraclar, Lucas Ondel, Jan Cernocky, Lukas Burget, Bolaji Yusuf, Faculty of Information Technology [Brno] (FIT / BUT), Brno University of Technology [Brno] (BUT), Boǧaziçi üniversitesi = Boğaziçi University [Istanbul], Traitement du Langage Parlé (TLP ), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Sciences et Technologies des Langues (STL), CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), BUSIM Speech Group, Department of Electrical and Electronic Engineering [Istanbul], Boǧaziçi üniversitesi = Boğaziçi University [Istanbul]-Boǧaziçi üniversitesi = Boğaziçi University [Istanbul], Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Sciences et Technologies des Langues (STL), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Boğaziçi University [Istanbul], and Boğaziçi University [Istanbul]-Boğaziçi University [Istanbul]
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer science, Speech recognition, TIMIT, 02 engineering and technology, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Computer Science - Sound, Machine Learning (cs.LG), Set (abstract data type), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Cluster analysis, Hidden Markov model, ComputingMilieux_MISCELLANEOUS, Frame (networking), Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Speech processing, ComputingMethodologies_PATTERNRECOGNITION, Computer Science::Sound, Embedding, 020201 artificial intelligence & image processing, Subspace topology, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this work, we propose a hierarchical subspace model for acoustic unit discovery. In this approach, we frame the task as one of learning embeddings on a low-dimensional phonetic subspace, and simultaneously specify the subspace itself as an embedding on a hyper-subspace. We train the hyper-subspace on a set of transcribed languages and transfer it to the target language. In the target language, we infer both the language and unit embeddings in an unsupervised manner, and in so doing, we simultaneously learn a subspace of units specific to that language and the units that dwell on it. We conduct our experiments on TIMIT and two low-resource languages: Mboshi and Yoruba. Results show that our model outperforms major acoustic unit discovery techniques, both in terms of clustering quality and segmentation accuracy., Comment: Submitted to ICASSP 2021
Published: 2020
Full Text: View/download PDF

10. Speaker Verification with Application-Aware Beamforming

Author: Oldrich Plchot, Ladislav Mosner, Jan Cernocky, Lukas Burget, and Johan Rohdin
Subjects: Speech enhancement, Beamforming, Spatial filter, Artificial neural network, Computer science, Speech recognition, media_common.quotation_subject, Embedding, Data_CODINGANDINFORMATIONTHEORY, Speech processing, Function (engineering), Eigendecomposition of a matrix, media_common
Abstract: Multichannel speech processing applications usually employ beamformers as means of speech enhancement through spatial filtering. Beamformers with learnable parameters require training to minimize a loss function that is not necessarily correlated with the final objective. In this paper, we present a framework employing recent neural network based generalized eigenvalue beamformer and application-specific model that allows for optimization of beamformer w.r.t. target application. In our case, the application is speaker verification which utilizes a speaker embedding (x-vector) extractor that conveniently comes with desired loss. We show that application-specific training of the beamformer brings performance improvements over a system trained in the standard way. We perform our analysis on the recently introduced VOiCES corpus which contains multichannel data and allows us to modify the evaluation trials such that enrollment recordings remain single-channel and test utterances are multichannel.
Published: 2019

11. A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: The Deepmine Database

Author: Jan Cernocky, Hossein Zeinali, and Lukas Burget
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Speaker verification, Artificial neural network, Database, Computer science, Speech recognition, Text independent, Speech corpus, computer.software_genre, Computer Science - Sound, language.human_language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, language, Hidden Markov model, Scale (map), Computation and Language (cs.CL), computer, Electrical Engineering and Systems Science - Audio and Speech Processing, Persian
Abstract: DeepMine is a speech database in Persian and English designed to build and evaluate text-dependent, text-prompted, and text-independent speaker verification, as well as Persian speech recognition systems. It contains more than 1850 speakers and 540 thousand recordings overall, more than 480 hours of speech are transcribed. It is the first public large-scale speaker verification database in Persian, the largest public text-dependent and text-prompted speaker verification database in English, and the largest public evaluation dataset for text-independent speaker verification. It has a good coverage of age, gender, and accents. We provide several evaluation protocols for each part of the database to allow for research on different aspects of speaker verification. We also provide the results of several experiments that can be considered as baselines: HMM-based i-vectors for text-dependent speaker verification, and HMM-based as well as state-of-the-art deep neural network based ASR. We demonstrate that the database can serve for training robust ASR models.
Published: 2019

12. Building and Evaluation of a Real Room Impulse Response Dataset

Author: Jakub Paliesek, Igor Szöke, Miroslav Skácel, Jan Cernocky, and Ladislav Mosner
Subjects: Reverberation, Noise measurement, Microphone, Computer science, business.industry, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Speaker recognition, Software, Audio and Speech Processing (eess.AS), Signal Processing, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, NIST, Loudspeaker, Electrical and Electronic Engineering, business, Impulse response, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper presents BUT ReverbDB - a dataset of real room impulse responses (RIR), background noises and re-transmitted speech data. The retransmitted data includes LibriSpeech test-clean, 2000 HUB5 English evaluation and part of 2010 NIST Speaker Recognition Evaluation datasets. We provide a detailed description of RIR collection (hardware, software, post-processing) that can serve as a "cook-book" for similar efforts. We also validate BUT ReverbDB in two sets of automatic speech recognition (ASR) experiments and draw conclusions for augmenting ASR training data with real and artificially generated RIRs. We show that a limited number of real RIRs, carefully selected to match the target environment, provide results comparable to a large number of artificially generated RIRs, and that both sets can be combined to achieve the best ASR results. The dataset is distributed for free under a non-restrictive license and it currently contains data from 8 rooms, which is growing. The distribution package also contains a Kaldi-based recipe for augmenting publicly available AMI close-talk meeting data and test the results on an AMI single distant microphone set, allowing it to reproduce our experiments., Submitted to Journal of Selected Topics in Signal Processing, November 2018
Published: 2018

13. Optimization of Speaker-Aware Multichannel Speech Extraction with ASR Criterion

Author: Takuya Higuchi, Keisuke Kinoshita, Katerina Zmolikova, Tomohiro Nakatani, Jan Cernocky, and Marc Delcroix
Subjects: Beamforming, Artificial neural network, Computer science, Speech recognition, Acoustic model, 01 natural sciences, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0103 physical sciences, 0305 other medical science, Hidden Markov model, Adaptation (computer science), 010301 acoustics, Utterance
Abstract: This paper addresses the problem of recognizing speech corrupted by overlapping speakers in a multichannel setting. To extract a target speaker from the mixture, we use a neural network based beamformer which uses masks estimated by a neural network to compute statistically optimal spatial filters. Following our previous work, we inform the neural network about the target speaker using information extracted from an adaptation utterance’ enabling the network to track the target speaker. While in the previous work, this method was used to separately extract the speaker and then pass such preprocessed speech to a speech recognition system, here we explore training both systems jointly with a common speech recognition criterion. We show that integrating the two systems and training for the final objective improves the performance. In addition, the integration enables further sharing of information between the acoustic model and the speaker extraction system, by making use of the predicted HMM-state posteriors to refine the masks used for beamforming.
Published: 2018

14. Dereverberation and Beamforming in Far-Field Speaker Recognition

Author: Ondrej Novotny, Ladislav Mosner, Jan Cernocky, and Pavel Matejka
Subjects: Beamforming, Reverberation, Artificial neural network, Computer science, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Room acoustics, Speaker recognition, Weighting, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, NIST, 0305 other medical science
Abstract: This paper deals with far-field speaker recognition. On a corpus of NIST SRE 2010 data retransmitted in a real room with multiple microphones, we first demonstrate how room acoustics cause significant degradation of state-of-the-art i-vector based speaker recognition system. We then investigate several techniques to improve the performances ranging from probabilistic linear discriminant analysis (PLDA) re-training, through dereverberation, to beamforming. We found that weighted prediction error (WPE) based dereverberation combined with generalized eigenvalue beamformer with power-spectral density (PSD) weighting masks generated by neural networks (NN) provides results approaching the clean close-microphone setup. Further improvement was obtained by re-training PLDA or the mask-generating NNs on simulated target data. The work shows that a speaker recognition system working robustly in the far-field scenario can be developed.
Published: 2018

15. Analysis of Multilingual Blstm Acoustic Model on Low and High Resource Languages

Author: Martin Karafidt, Karel Vesely, Lukas Burget, Frantisek Grezl, Jan Cernocky, and Murali Karthick Baskar
Subjects: Artificial neural network, Computer science, business.industry, Feature extraction, Acoustic model, Context (language use), computer.software_genre, Task (project management), Resource (project management), Artificial intelligence, Hidden Markov model, business, computer, Natural language processing
Abstract: The paper provides an analysis of automatic speech recognition systems (ASR) based on multilingual BLSTM, where we used multi-task training with separate classification layer for each language. The focus is on low resource languages, where only a limited amount of transcribed speech is available. In such scenario, we found it essential to train the ASR systems in a multilingual fashion and we report superior results obtained with pre-trained multilingual BLSTM on this task. The high resource languages are also taken into account and we show the importance of language richness for multilingual training. Next, we present the performance of this technique as a function of amount of target language data. The importance of including context information into BLSTM multilingual systems is also stressed, and we report increased resilience of large NNs to overtraining in case of multi-task training.
Published: 2018

16. Promising Accurate Prefix Boosting for sequence-to-sequence ASR

Author: Murali Karthick Baskar, Takaaki Hori, Martin Karafiat, Lukas Burget, Shinji Watanabe, and Jan Cernocky
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer Science - Computation and Language, Boosting (machine learning), Computer science, Speech recognition, Word error rate, 020206 networking & telecommunications, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Computer Science - Sound, Machine Learning (cs.LG), Prefix, Discriminative model, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Beam search, Sequence learning, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing, 0105 earth and related environmental sciences
Abstract: In this paper, we present promising accurate prefix boosting (PAPB), a discriminative training technique for attention based sequence-to-sequence (seq2seq) ASR. PAPB is devised to unify the training and testing scheme in an effective manner. The training procedure involves maximizing the score of each partial correct sequence obtained during beam search compared to other hypotheses. The training objective also includes minimization of token (character) error rate. PAPB shows its efficacy by achieving 10.8\% and 3.8\% WER with and without RNNLM respectively on Wall Street Journal dataset.
Published: 2018
Full Text: View/download PDF

17. How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

Author: Lukas Burget, Hossein Zeinali, Themos Stafylakis, Jan Cernocky, and Johan Rohdin
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Time delay neural network, Computer science, Mechanism (biology), Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Rectifier (neural networks), Overfitting, Computer Science - Sound, Extractor, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, 0305 other medical science, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for speaker verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which we anticipate to enable further improvements on the method. We examine several tricks in training, such as the effects of normalizing input features and pooled statistics, different methods for preventing overfitting as well as alternative non-linearities that can be used instead of Rectifier Linear Units. In addition, we investigate the difference in performance between TDNN and CNN, and between two types of attention mechanism. Experimental results on Speaker in the Wild, SRE 2016 and SRE 2018 datasets demonstrate the effectiveness of the proposed implementation.
Published: 2018
Full Text: View/download PDF

18. Bayesian phonotactic Language Model for Acoustic Unit Discovery

Author: Lukas Burget, Santosh Kesiraju, Lucas Ondel, Jan Cernocky, Faculty of Information Technology [Brno] (FIT / BUT), and Brno University of Technology [Brno] (BUT)
Subjects: business.industry, Computer science, Bigram, Speech recognition, Bayesian probability, Pattern recognition, TIMIT, Mutual information, 010501 environmental sciences, 01 natural sciences, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Data modeling, Dirichlet process, 010104 statistics & probability, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Language model, Artificial intelligence, 0101 mathematics, business, Hidden Markov model, ComputingMilieux_MISCELLANEOUS, 0105 earth and related environmental sciences
Abstract: Recent work on Acoustic Unit Discovery (AUD) has led to the development of a non-parametric Bayesian phone-loop model where the prior over the probability of the phone-like units is assumed to be sampled from a Dirichlet Process (DP). In this work, we propose to improve this model by incorporating a Hierarchical Pitman-Yor based bigram Language Model on top of the units' transitions. This new model makes use of the phonotactic context information but assumes a fixed number of units. To remedy this limitation we first train a DP phone-loop model to infer the number of units, then, the bigram phone-loop is initialized from the DP phone-loop and trained until convergence of its parameters. Results show an absolute improvement of 1–2%on the Normalized Mutual Information (NMI) metric. Furthermore, we show that, combined with Multilingual Bottleneck (MBN) features the model yields a same or higher NMI as an English phone recogniser trained on TIMIT.
Published: 2017

19. Multilingual BLSTM and speaker-specific vector adaptation in 2016 but babel system

Author: Murali Karthick Baskar, Frantisek Grezl, Jan Cernocky, Karel Vesely, Pavel Matejka, and Martin Karafiat
Subjects: Training set, Artificial neural network, business.industry, Computer science, Speech recognition, Feature extraction, computer.software_genre, Whole systems, 030507 speech-language pathology & audiology, 03 medical and health sciences, Long short term memory, 0302 clinical medicine, Artificial intelligence, 0305 other medical science, business, Adaptation (computer science), computer, 030217 neurology & neurosurgery, Natural language processing, Speaker adaptation
Abstract: This paper provides an extensive summary of BUT 2016 system for the last IARPA Babel evaluations. It concentrates on multi-lingual training of both deep neural network (DNN)-based feature extraction and acoustic models including multilingual training of bidirectional Long Short Term memory networks. Next, two low-dimensional vector approaches to speaker adaptation are investigated: i-vectors and sequence-summarizing neural networks (SSNN). The results provided on three Babel Year 4 languages show clear advantage of both approaches in case limited amount of training data is available. The time necessary for the development of a new system is addressed too, as some of the investigated techniques do not require extensive re-training of the whole system.
Published: 2016

20. Analysis of the DNN-based SRE systems in multi-language conditions

Author: Ondrej Glembek, Ondrej Novotny, Oldrich Plchot, Lukas Burget, Jan Cernocky, Pavel Matejka, and Frantisek Grezl
Subjects: Set (abstract data type), 030507 speech-language pathology & audiology, 03 medical and health sciences, Artificial neural network, Computer science, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, Multi language, NIST, 020206 networking & telecommunications, 02 engineering and technology, 0305 other medical science, Speaker recognition
Abstract: This paper analyzes the behavior of our state-of-the-art Deep Neural Network/i-vector/PLDA-based speaker recognition systems in multi-language conditions. On the “Language Pack” of the PRISM set, we evaluate the systems' performance using the NIST's standard metrics. We show that not only the gain from using DNNs vanishes, nor using dedicated DNNs for target conditions helps, but also the DNN-based systems tend to produce de-calibrated scores under the studied conditions. This work gives suggestions for directions of future research rather than any particular solutions to these issues.
Published: 2016

21. Analysis of DNN approaches to speaker identification

Author: Ondrej Glembek, Oldrich Plchot, Ondrej Novotny, Jan Cernocky, Pavel Matejka, Lukas Burget, and Frantisek Grezl
Subjects: Normalization (statistics), Artificial neural network, business.industry, Computer science, Speech recognition, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Speaker recognition, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Speaker identification, NIST, Mel-frequency cepstrum, Artificial intelligence, 0305 other medical science, business, Sufficient statistic
Abstract: This work studies the usage of the Deep Neural Network (DNN) Bottleneck (BN) features together with the traditional MFCC features in the task of i-vector-based speaker recognition. We decouple the sufficient statistics extraction by using separate GMM models for frame alignment, and for statistics normalization and we analyze the usage of BN and MFCC features (and their concatenation) in the two stages. We also show the effect of using full-covariance GMM models, and, as a contrast, we compare the result to the recent DNN-alignment approach. On the NIST SRE2010, telephone condition, we show 60% relative gain over the traditional MFCC baseline for EER (and similar for the NIST DCF metrics), resulting in 0.94% EER.
Published: 2016

22. Multilingual region-dependent transforms

Author: Lukas Burget, Jan Cernocky, Martin Karafiat, Frantisek Grezl, and Karel Vesely
Subjects: Artificial neural network, Time delay neural network, business.industry, Computer science, Feature extraction, Feedforward neural network, Bootstrapping (linguistics), Artificial intelligence, Machine learning, computer.software_genre, business, computer, Natural language processing
Abstract: In recent years, trained feature extraction (FE) schemes based on neural networks have replaced or complemented traditional approaches in top performing systems. This paper deals with FE in multilingual scenarios with a target language with low amount of transcribed data. Continuing our previous work on multilingual training of Stacked Bottle-Neck Neural Network FE schemes, we concentrate on improving the discriminatively trained Region-Dependent Transforms. We show that multilingual training of RDT can be implemented by merging statistics from several languages. In our case we used up to 11 source languages to build a FE which generalize well for a new language. This allows us to build a strong bootstrapping model for the final ASR system. The results are produced on IARPA Babel data.
Published: 2016

23. Robust speech recognition in unknown reverberant and noisy conditions

Author: Jan Cernocky, Frantisek Grezl, Richard Schwartz, Lukas Burget, William Hartmann, Stavros Tsakalidis, Igor Szöke, Hynek Hermansky, Shinji Watanabe, Jeff Z. Ma, Sri Harish Mallidi, Martin Karafiat, Roger Hsiao, and Zhuo Chen
Subjects: Speech enhancement, Voice activity detection, Artificial neural network, Computer science, Robustness (computer science), Speech recognition, Speech coding, Speech technology, Acoustic model, Speech processing
Abstract: In this paper, we describe our work on the ASpIRE (Automatic Speech recognition In Reverberant Environments) challenge, which aims to assess the robustness of automatic speech recognition (ASR) systems. The main characteristic of the challenge is developing a high-performance system without access to matched training and development data. While the evaluation data are recorded with far-field microphones in noisy and reverberant rooms, the training data are telephone speech and close talking. Our approach to this challenge includes speech enhancement, neural network methods and acoustic model adaptation, We show that these techniques can successfully alleviate the performance degradation due to noisy audio and data mismatch.
Published: 2015

24. Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System

Author: Pavel Matejka, Ondrej Glembek, Petr Schwarz, Lukas Burget, and Jan Cernocky
Subjects: Acoustics and Ultrasonics, Channel (digital image), business.industry, Computer science, Speech recognition, Feature extraction, Word error rate, Pattern recognition, Linear discriminant analysis, Speaker recognition, Speech processing, Feature (machine learning), NIST, Artificial intelligence, Electrical and Electronic Engineering, business
Abstract: In this paper, several feature extraction and channel compensation techniques found in state-of-the-art speaker verification systems are analyzed and discussed. For the NIST SRE 2006 submission, cepstral mean subtraction, feature warping, RelAtive SpecTrAl (RASTA) filtering, heteroscedastic linear discriminant analysis (HLDA), feature mapping, and eigenchannel adaptation were incrementally added to minimize the system's error rate. This paper deals with eigenchannel adaptation in more detail and includes its theoretical background and implementation issues. The key part of the paper is, however, the post-evaluation analysis, undermining a common myth that ldquothe more boxes in the scheme, the better the system.rdquo All results are presented on NIST Speaker Recognition Evaluation (SRE) 2005 and 2006 data.
Published: 2007

25. Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006

Author: Frantisek Grezl, Martin Karafiat, Jan Cernocky, Ondrej Glembek, Petr Schwarz, Pavel Matejka, Niko Brümmer, D.A. van Leeuwen, Albert Strasheim, Lukas Burget, and TNO Defensie en Veiligheid
Subjects: Acoustics and Ultrasonics, Computer science, Speech recognition, Cognitive neuroscience of visual object recognition, Linear prediction, Object recognition, Vectors, Speaker recognition, Speech processing, Support vector machine, Gaussian mixture model (GMM), Communication channels (information theory), Magnetostrictive devices, Nuisance attribute projection (NAP), NIST, Eigenchannel, Mel-frequency cepstrum, Electrical and Electronic Engineering, Fusion, Image retrieval, Maximum likelihood
Abstract: This paper describes and discusses the "STBU" speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium of four partners: Spescom DataVoice (Stellenbosch, South Africa), TNO (Soesterberg, The Netherlands), BUT (Brno, Czech Republic), and the University of Stellenbosch (Stellenbosch, South Africa). The STBU system was a combination of three main kinds of subsystems: 1) GMM, with short-time Mel frequency cepstral coefficient (MFCC) or perceptual linear prediction (PLP) features, 2) Gaussian mixture model-support vector machine (GMM-SVM), using GMM mean supervectors as input to an SVM, and 3) maximum-likelihood linear regression-support vector machine (MLLR-SVM), using MLLR speaker adaptation coefficients derived from an English large vocabulary continuous speech recognition (LVCSR) system. All subsystems made use of supervector subspace channel compensation methods-either eigenchannel adaptation or nuisance attribute projection. We document the design and performance of all subsystems, as well as their fusion and calibration via logistic regression. Finally, we also present a cross-site fusion that was done with several additional systems from other NIST SRE-2006 participants. © 2006 IEEE.
Published: 2007

26. Audio-Visual Processing in Meetings: Seven Questions and Current AMI Answers

Author: Jan Cernocky, Andrew Thean, Ronald Müller, Hervé Bourlard, Fabien Cardinaux, Pavel Zemcik, Marc Al-Hames, Mannes Poel, Jean-Marc Odobez, Silèye O. Ba, Daniel Gatica-Perez, Steve Renals, David A. van Leeuwen, Petr Motlicek, Kevin Smith, Jeroen van Rest, Gerhard Rigoll, Thomas Hain, Sébastien Marcel, Sascha Schreiber, Stephan Reiter, Adam Janin, and Rutger Rienks
Subjects: IR-63415, Multimedia, Computer science, EWI-6857, Image processing, computer.software_genre, METIS-242051, Content analysis, Informatics, Component (UML), Audio visual, EC Grant Agreement nr.: FP6/506811, User interface, computer, Abstraction (linguistics)
Abstract: The project Augmented Multi-party Interaction (AMI) is concerned with the development of meeting browsers and remote meeting assistants for instrumented meeting rooms – and the required component technologies R&D themes: group dynamics, audio, visual, and multimodal processing, content abstraction, and human-computer interaction. The audio-visual processing workpackage within AMI addresses the automatic recognition from audio, video, and combined audio-video streams, that have been recorded during meetings. In this article we describe the progress that has been made in the first two years of the project. We show how the large problem of audio-visual processing in meetings can be split into seven questions, like “Who is acting during the meeting?��?. We then show which algorithms and methods have been developed and evaluated for the automatic answering of these questions.
Published: 2007

27. Copingwith channel mismatch in Query-by-Example - But QUESST 2014

Author: Miroslav Skácel, Jan Cernocky, Igor Szöke, and Lukas Burget
Subjects: Information retrieval, Language identification, Computer science, Calibration (statistics), Query by Example, computer, Communication channel, computer.programming_language, Term (time)
Abstract: The paper investigates into Query by Example (QbE) - a spoken term detection technique with queries entered by voice. It describes BUT QbE system that achieved the best accuracy in MediaEval QUESST2014 evaluations. This evaluation was challenging because of severe mismatch between queries and utterances, and introduction of new types of queries. The paper provides an analysis of DTW sub-system's in mismatched conditions (especially targeting DTW metrics) and discusses approaches investigated for QUESST2014: generation of calibration side-information by a language identification system, and handling T2 and T3 queries relaxing the constraints of an exact match. All results are provided on QUESST2014 development and evaluation data.
Published: 2015

28. But ASR system for BABEL Surprise evaluation 2014

Author: Frantisek Grezl, Karel Vesely, Mirko Hannemann, Igor Szöke, Jan Cernocky, Lukas Burget, and Martin Karafiat
Subjects: Training set, Artificial neural network, Computer science, business.industry, media_common.quotation_subject, Speech recognition, computer.software_genre, language.human_language, Surprise, Tamil, language, Deep neural networks, Noise (video), Artificial intelligence, business, computer, Natural language processing, media_common, Test data
Abstract: The paper describes Brno University of Technology (BUT) ASR system for 2014 BABEL Surprise language evaluation (Tamil). While being largely based on our previous work, two original contributions were brought: (1) speaker-adapted bottle-neck neural network (BN) features were investigated as an input to DNN recognizer and semi-supervised training was found effective. (2) Adding of noise to training data outperformed a classical de-noising technique while dealing with noisy test data was found beneficial, and the performance of this approach was verified on a relatively clean training/test data setup from a different language. All results are reported on BABEL 2014 Tamil data.
Published: 2014

29. But neural network features for spontaneous Vietnamese in BABEL

Author: Jan Cernocky, Frantisek Grezl, Mirko Hannemann, and Martin Karafiat
Subjects: ComputingMethodologies_PATTERNRECOGNITION, Artificial neural network, Discriminative model, Computer science, Time delay neural network, Vietnamese, Speech recognition, Feature extraction, language, Adaptation (computer science), language.human_language
Abstract: This paper presents our work on speech recognition of Vietnamese spontaneous telephone conversations. It focuses on feature extraction by Stacked Bottle-Neck neural networks: several improvements such as semi-supervised training on untranscribed data, increasing of precision of state targets, and CMLLR adaptations were investigated. We have also tested speaker adaptive training of this architecture and significant gain was found. The results are reported on BABEL Vietnamese data. Index Terms: speech recognition, discriminative training, bottleneck neural networks, adaptation of neural networks, regiondependent transforms
Published: 2014

30. Calibration and fusion of query-by-example systems — But SWS 2013

Author: Jan Cernocky, Igor Szöke, Lucas Ondel, Frantisek Grezl, and Lukas Burget
Subjects: Normalization (statistics), Dynamic time warping, Artificial neural network, Calibration (statistics), business.industry, Computer science, Speech recognition, Speech processing, computer.software_genre, Task (project management), Keyword spotting, Query by Example, Artificial intelligence, business, computer, Natural language processing, computer.programming_language
Abstract: This paper summarizes our work for MediaEval 2013 Spoken Web Search task evaluations. The task was Query-by-Example (search of spoken queries within spoken data). We submitted a system composed of 26 subsystems, of which 13 arebased on Acoustic Keyword Spotting and 13 on Dynamic Time Warping. All of them use threestate phoneme posteriors as input features. Our main contribution was m-norm normalization of particular subsystems together with the fusion based on binary logistic regression. The results, including per-language analysis, are provided on MediaEval 2013 dataset.
Published: 2014

31. The ELISA Systems for the NIST'99 Evaluation in Speaker Detection and Tracking

Author: Jan Cernocky, Ivan Magrin-Chagnolleau, and Gérard Chollet
Subjects: Computational Theory and Mathematics, Artificial Intelligence, Applied Mathematics, Signal Processing, Computer Vision and Pattern Recognition, Electrical and Electronic Engineering, Statistics, Probability and Uncertainty
Published: 2000

32. Manual and semi-automatic approaches to building a multilingual phoneme set

Author: Martin Karafiat, Ekaterina Egorova, Milos Janda, Karel Vesely, and Jan Cernocky
Subjects: Artificial neural network, business.industry, Computer science, Speech recognition, Confusion matrix, computer.software_genre, ComputingMethodologies_ARTIFICIALINTELLIGENCE, Expert system, Reduction (complexity), Set (abstract data type), Semi automatic, Artificial intelligence, business, computer, Natural language processing
Abstract: The paper addresses manual and semi-automatic approaches to building a multilingual phoneme set for automatic speech recognition. The first approach involves mapping and reduction of the phoneme set based on IPA and expert knowledge, the later one involves phoneme confusion matrix generated by a neural network. The comparison is done for 8 languages selected from GlobalPhone on three scenarios: 1) multilingual system with abundant data for all the languages, 2) multilingual systems excluding target language 3) multilingual systems with small amount of data for target languages. For 3), the multilingual system brought improvement for languages close enough to the others in the set.
Published: 2013

33. A factorized representation of FMLLR transform based on QR-decomposition

Author: Rath, S. P., Karafiat, M., Glembek, O., and Jan Cernocky
Published: 2012

34. Region dependent linear transforms in multilingual speech recognition

Author: Jan Cernocky, Milos Janda, Martin Karafiat, and Lukas Burget
Subjects: Computer science, business.industry, Speech recognition, Feature extraction, Acoustic model, Pattern recognition, symbols.namesake, Discriminative model, symbols, Feature (machine learning), Artificial intelligence, Hidden Markov model, business, Focus (optics), Gaussian process
Abstract: In today's speech recognition systems, linear or nonlinear transformations are usually applied to post-process speech features forming input to HMM based acoustic models. In this work, we experiment with three popular transforms: HLDA, MPE-HLDA and Region Dependent Linear Transforms (RDLT), which are trained jointly with the acoustic model to extract maximum of the discriminative information from the raw features and to represent it in a form suitable for the following GMM-HMM based acoustic model. We focus on multi-lingual environments, where limited resources are available for training recognizers of many languages. Using data from GlobalPhone database, we show that, under such restrictive conditions, the feature transformations can be advantageously shared across languages and robustly trained using data from several languages.
Published: 2012

35. Discriminative Classifiers for Phonotactic Language Recognition with iVectors

Author: Jan Cernocky, Mehdi Soufifar, Sandro Cumani, and Lukas Burget
Subjects: Training set, Computer science, business.industry, Feature vector, Speech recognition, Feature extraction, Pattern recognition, Set (abstract data type), Support vector machine, Random subspace method, ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, NIST, Artificial intelligence, business, Subspace topology
Abstract: Phonotactic models based on bags of n-grams representations and discriminative classifiers are a popular approach to the language recognition problem. However, the large size of n-gram count vectors brings about some difficulties in discriminative classifiers. The subspace Multinomial model was recently proposed to effectively represent information contained in the n-grams using low-dimensional iVectors. The availability of a low-dimensional feature vector allows investigating different post-processing techniques and different classifiers to improve recognition performance. In this work, we analyze a set of discriminative classifiers based on Support Vector Machines and Logistic Regression and we propose an iVector post-processing technique which allows to improve recognition performance. The proposed systems are evaluated on the NIST LRE 2009 task.
Published: 2012

36. iVector-based discriminative adaptation for automatic speech recognition

Author: Jan Cernocky, Lukas Burget, Martin Karafiat, Ondrej Glembek, and Pavel Matejka
Subjects: Novel technique, Discriminative model, Computer science, business.industry, Speech recognition, Feature extraction, Feature (machine learning), Pattern recognition, Artificial intelligence, Adaptation (computer science), Speaker recognition, business, Relevant information
Abstract: We presented a novel technique for discriminative feature-level adaptation of automatic speech recognition system. The concept of iVectors popular in Speaker Recognition is used to extract information about speaker or acoustic environment from speech segment. iVector is a low-dimensional fixed-length representing such information. To utilized iVectors for adaptation, Region Dependent Linear Transforms (RDLT) are discriminatively trained using MPE criterion on large amount of annotated data to extract the relevant information from iVectors and to compensate speech feature. The approach was tested on standard CTS data. We found it to be complementary to common adaptation techniques. On a well tuned RDLT system with standard CMLLR adaptation we reached 0.8% additive absolute WER improvement.
Published: 2011

37. Strategies for training large scale neural network language models

Author: Daniel Povey, Jan Cernocky, Lukas Burget, Anoop Deoras, and Tomas Mikolov
Subjects: Computational complexity theory, Artificial neural network, Computer science, Time delay neural network, business.industry, Speech recognition, Principle of maximum entropy, Hash function, Word error rate, Machine learning, computer.software_genre, Reduction (complexity), Artificial intelligence, Language model, business, computer
Abstract: We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the training data are sorted by their relevance. We introduce hash-based implementation of a maximum entropy model, that can be trained as a part of the neural network model. This leads to significant reduction of computational complexity. We achieved around 10% relative reduction of word error rate on English Broadcast News speech recognition task, against large 4-gram model trained on 400M tokens.
Published: 2011

38. Extensions of recurrent neural network language model

Author: Jan Cernocky, Tomas Mikolov, Stefan Kombrink, Sanjeev Khudanpur, and Lukas Burget
Subjects: Speedup, Artificial neural network, Computational complexity theory, Computer science, business.industry, Feed forward, Recurrent neural nets, Machine learning, computer.software_genre, Backpropagation, Recurrent neural network, Backpropagation through time, Probability distribution, Language model, Artificial intelligence, business, computer
Abstract: We present several modifications of the original recurrent neural network language model (RNN LM).While this model has been shown to significantly outperform many competitive language modeling techniques in terms of accuracy, the remaining problem is the computational complexity. In this work, we show approaches that lead to more than 15 times speedup for both training and testing phases. Next, we show importance of using a backpropagation through time algorithm. An empirical comparison with feedforward networks is also provided. In the end, we discuss possibilities how to reduce the amount of parameters in the model. The resulting RNN model can thus be smaller, faster both during training and testing, and more accurate than the basic one.
Published: 2011

39. Recent progress in prosodic speaker verification

Author: Marcel Kockmann, Jan Cernocky, Lukas Burget, Elizabeth Shriberg, and Luciana Ferrer
Subjects: Normalization (statistics), business.industry, Computer science, Speech recognition, Feature extraction, Probabilistic logic, Word error rate, Pattern recognition, Covariance, Linear discriminant analysis, Speaker recognition, Support vector machine, NIST, Artificial intelligence, business
Abstract: We describe recent progress in the field of prosodic modeling for speaker verification. In a previous paper, we proposed a technique for modeling syllable-based prosodic features that uses a multinomial subspace model for feature extraction and within-class covariance normalization or linear discriminant analysis for session variability compensation. In this paper, we show that performance can be significantly improved with the use of probabilistic linear discriminant analysis (PLDA) for session variability compensation. This system does not require score normalization. We report an equal error rate below 7% on a NIST 2008 task. To our knowledge, this is the best reported result to date for a prosodic system for speaker recognition. Fusion of this system with a state-of-the-art acoustic baseline system yields 10% relative improvement in the new detection cost function (DCF) as defined by NIST.
Published: 2011

40. Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification

Author: Jan Cernocky, Oldrich Plchot, Ondrej Glembek, Lukas Burget, Patrick Kenny, Md. Jahangir Alam, Fabio Castaldo, and Pavel Matejka
Subjects: Speaker verification, Computer science, business.industry, Covariance matrix, Speech recognition, Dimensionality reduction, Feature extraction, NIST, Pattern recognition, Artificial intelligence, Covariance, business, Speaker recognition
Abstract: In this paper, we describe recent progress in i-vector based speaker verification. The use of universal background models (UBM) with full-covariance matrices is suggested and thoroughly experimentally tested. The i-vectors are scored using a simple cosine distance and advanced techniques such as Probabilistic Linear Discriminant Analysis (PLDA) and heavy-tailed variant of PLDA (PLDA-HT). Finally, we investigate into dimensionality reduction of i-vectors before entering the PLDA-HT modeling. The results are very competitive: on NIST 2010 SRE task, the results of a single full-covariance LDA-PLDA-HT system approach those of complex fused system.
Published: 2011

41. Acoustic keyword spotter - optimization from end-user perspective

Author: Frantisek Grezl, Jan Cernocky, Tomas Cipr, Igor Szöke, and Michal Fapso
Subjects: Normalization (statistics), Artificial neural network, End user, business.industry, Computer science, Keyword spotting, Speech recognition, Discrete cosine transform, Artificial intelligence, business, computer.software_genre, computer, Natural language processing
Abstract: The paper deals with the development of acoustic keyword spotter (KWS) meeting requirements of a real user from the security community. While the basic scheme of the KWS is relatively standard, it uses novel features derived by a hierarchy of neural networks, and score normalization trained to maximize a user-like evaluation metric. The results are reported on a selection of Czech conversational telephone speech (CTS), radio and read data.
Published: 2010

42. Speech@FIT lecture browser

Author: Igor Szöke, J. Zizka, Jan Cernocky, and Michal Fapso
Subjects: Video recording, Access to information, Multimedia, Point (typography), Computer science, E-learning (theory), Server, Information access, Image processing, computer.software_genre, Speech processing, computer
Abstract: This paper describes an innovative web-based browser used for video recordings of lectures that is built on speech and image processing technologies. The aim of this project is to simplify the access to information that is spread across video recordings. This is mainly achieved by coupling to the speech search engine and due to a possibility to quickly navigate through an automatically generated list of slides presented. The reader is briefly acquainted with the technological background of the browser; the emphasis is laid on the use of the browser from the user point of view.
Published: 2010

43. Prosodic speaker verification using subspace multinomial models with intersession compensation

Author: Kockmann, M., Burget, L., Glembek, O., Ferrer, L., and Jan Cernocky
Published: 2010

44. Tuning phone decoders for language identification

Author: Pavel Matejka, Jan Cernocky, Haizhou Li, Lukas Burget, Rong Tong, and C. Santhosh Kumar
Subjects: Artificial neural network, Language identification, Computer science, Phone, Speech recognition, Speech coding, Language model, Mutual information, Hidden Markov model, Decoding methods
Abstract: Phonotactic approach, phone recognition to be followed by language modeling, is one of the most popular approaches to language identification (LID). In this work, we explore how language identification accuracy of a phone decoder can be enhanced by varying acoustic resolution of the phone decoder, and subsequently how multiresolution versions of the same decoder can be integrated to improve the LID accuracy. We use mutual information to select the optimum set of phones for a specific acoustic resolution. Further, we propose strategies for building multilingual systems suitable for LID applications, and subsequently fine tune these systems to enhance the overall accuracy.
Published: 2010

45. Investigations into prosodic syllable contour features for speaker recognition

Author: Marcel Kockmann, Jan Cernocky, and Lukas Burget
Subjects: Computer science, business.industry, Speech recognition, Feature extraction, Pattern recognition, 02 engineering and technology, Speaker recognition, Speech segmentation, Speaker diarisation, 030507 speech-language pathology & audiology, 03 medical and health sciences, ComputingMethodologies_PATTERNRECOGNITION, 0202 electrical engineering, electronic engineering, information engineering, NIST, 020201 artificial intelligence & image processing, Segmentation, Loudspeaker, Artificial intelligence, Syllable, 0305 other medical science, business
Abstract: We investigate various ways of generating prosodic syllable contour features that have recently been applied to enhance systems for speaker recognition. We compare different approaches for segmentation of speech into syllable-like units, techniques for contour modeling and the extraction of pitch and energy, taking into account the computational complexity and gender dependence. We show that the performance is especially affected by the segmentation and the quality of the pitch tracking algorithm and that the features are highly gender dependent. Still, computationally simple ways of segmentation of speech can be used to achieve good results, as experiments on 2006 NIST speaker recognition evaluation task indicate.
Published: 2010

46. Neural network based language models for highly inflective languages

Author: Ondrej Glembek, Lukas Burget, Jan Cernocky, Tomas Mikolov, and Jiri Kopecky
Subjects: Czech, Class (computer programming), Artificial neural network, Computer science, business.industry, Speech recognition, Decision tree, Word error rate, computer.software_genre, language.human_language, Data modeling, language, Language model, Artificial intelligence, business, computer, Natural language processing, Smoothing
Abstract: Speech recognition of inflectional and morphologically rich languages like Czech is currently quite a challenging task, because simple n-gram techniques are unable to capture important regularities in the data. Several possible solutions were proposed, namely class based models, factored models, decision trees and neural networks. This paper describes improvements obtained in recognition of spoken Czech lectures using language models based on neural networks. Relative reductions in word error rate are more than 15% over baseline obtained with adapted 4-gram backoff language model using modified Kneser-Ney smoothing.
Published: 2009

47. Morphological random forests for language modeling of inflectional languages

Author: Ondrej Glembek, Jan Cernocky, Ilya Oparin, and Lukas Burget
Subjects: Czech, Perplexity, business.industry, Computer science, Speech recognition, Decision tree, computer.software_genre, language.human_language, Random forest, Reduction (complexity), Feature (machine learning), language, Trigram, Artificial intelligence, Language model, business, computer, Natural language processing
Abstract: In this paper, we are concerned with using decision trees (DT) and random forests (RF) in language modeling for Czech LVCSR. We show that the RF approach can be successfully implemented for language modeling of an inflectional language. Performance of word-based and morphological DTs and RFs was evaluated on lecture recognition task. We show that while DTs perform worse than conventional trigram language models (LM), RFs of both kind outperform the latter. WER (up to 3.4% relative) and perplexity (10%) reduction over the trigram model can be gained with morphological RFs. Further improvement is obtained after interpolation of DT and RF LMs with the trigram one (up to 15.6% perplexity and 4.8% WER relative reduction). In this paper we also investigate distribution of morphological feature types chosen for splitting data at different levels of DTs.
Published: 2008

48. Discrimininative training of narrow band - wide band adapted systems for meeting recognition

Author: Karafiát, M., Burget, L., Hain, T., and Jan Cernocky
Published: 2008

49. Combination of strongly and weakly constrained recognizers for reliable detection of OOVS

Author: Lukas Burget, Christopher White, Ariya Rastrow, Mirko Hannemann, Hynek Hermansky, Pavel Matejka, Petr Schwarz, Sanjeev Khudanpur, and Jan Cernocky
Subjects: Vocabulary, Computer science, business.industry, Speech recognition, media_common.quotation_subject, Frame (networking), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Estimator, computer.software_genre, Task (project management), ComputingMethodologies_PATTERNRECOGNITION, Phone, Artificial intelligence, business, computer, Natural language processing, Word (computer architecture), media_common
Abstract: This paper addresses the detection of OOV segments in the output of a large vocabulary continuous speech recognition (LVCSR) system. First, standard confidence measures from frame-based word- and phone-posteriors are investigated. Substantial improvement is obtained when posteriors from two systems - strongly constrained (LVCSR) and weakly constrained (phone posterior estimator) are combined. We show that this approach is also suitable for detection of general recognition errors. All results are presented on WSJ task with reduced recognition vocabulary.
Published: 2008

50. TRAP-Based Techniques for Recognition of Noisy Speech

Author: Frantisek Grezl and Jan Cernocky
Subjects: Trap (computing), Noise, Critical band, Training set, Computer science, Speech recognition, Concatenation, Discrete cosine transform, Word error rate, Noise level
Abstract: This paper presents a systematic study of performance of TempoRAl Patterns (TRAP) based features and their proposed modifications and combinations for speech recognition in noisy environment. The experimental results are obtained on AURORA 2 database with clean training data. We observed large dependency of performance of different TRAP modifications on noise level. Earlier proposed TRAP system modifications help in clean conditions but degrade the system performance in presence of noise. The combination techniques on the other hand can bring large improvement in case of weak noise and degrade only slightly for strong noise cases. The vector concatenation combination technique is improving the system performance up to strong noise.
Published: 2007

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

75 results on '"Jan Cernocky"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources