48 results on '"Simon King"'
Search Results
2. Speech Audio Corrector: using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech
- Author
-
Jason Fong, Daniel Lyth, Gustav Eje Henter, Hao Tang, and Simon King
- Published
- 2022
3. Factors Affecting the Evaluation of Synthetic Speech in Context
- Author
-
Catherine Lai, Simon King, Pilar Oplustil-Gallegos, and Johannah O'Mahony
- Subjects
Computer science ,Context (language use) ,Cognitive psychology - Published
- 2021
4. Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech
- Author
-
Simon King, Pilar Oplustil-Gallegos, and Johannah O'Mahony
- Subjects
Linguistic context ,Computer science ,Speech synthesis ,computer.software_genre ,computer ,Linguistics - Published
- 2021
5. Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks
- Author
-
Jason Fong, Jennifer Williams, and Simon King
- Subjects
Computer science ,Phone ,business.industry ,Pattern recognition ,Sensitivity (control systems) ,Artificial intelligence ,business - Published
- 2021
6. The Blizzard Challenge 2020
- Author
-
Xiao Zhou, Zhen-Hua Ling, and Simon King
- Published
- 2020
7. Testing the Limits of Representation Mixing for Pronunciation Correction in End-to-End Speech Synthesis
- Author
-
Simon King, Jason Taylor, and Jason Fong
- Subjects
End-to-end principle ,Computer science ,Speech recognition ,Representation (systemics) ,Speech synthesis ,Pronunciation ,computer.software_genre ,computer ,Mixing (physics) - Published
- 2020
8. Hider-Finder-Combiner: An Adversarial Architecture for General Speech Signal Modification
- Author
-
Jacob J. Webber, Simon King, Olivier Perrotin, The Centre for Speech Technology Research [Edinburgh] (CSTR), University of Edinburgh, GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA), ANR-15-IDEX-0002,UGA,IDEX UGA(2015), and ANR-11-LABX-0025,PERSYVAL-lab,Systemes et Algorithmes Pervasifs au confluent des mondes physique et numérique(2011)
- Subjects
adversarial networks ,Computer science ,speech modification ,Speech recognition ,020206 networking & telecommunications ,02 engineering and technology ,Signal ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Adversarial system ,speech synthesis ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,0202 electrical engineering, electronic engineering, information engineering ,[INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC] ,Architecture ,0305 other medical science - Abstract
International audience; We introduce a prototype system for modifying an arbitrary parameter of a speech signal. Unlike signal processing approaches that require dedicated methods for different parameters, our system can-in principle-modify any control parameter that the signal can be annotated with. Our system comprises three neural networks. The 'hider' removes all information related to the control parameter, outputting a hidden embedding. The 'finder' is an adversary used to train the 'hider', attempting to detect the value of the control parameter from the hidden embedding. The 'combiner' network recombines the hidden embedding with a desired new value of the control parameter. The input and output to the system are mel-spectrograms and we employ a neural vocoder to generate the output speech waveform. As a proof of concept, we use F0 as the control parameter. The system was evaluated in terms of control parameter accuracy and naturalness against a high quality signal processing method of F0 modification that also works in the spectrogram domain. We also show that, with modifications only to training data, the system is capable of modifying the 1 st and 2 nd vocal tract for-mants, showing progress towards universal signal modification.
- Published
- 2020
9. A Comparison of Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis
- Author
-
Simon King, Jason Taylor, Korin Richmond, and Jason Fong
- Subjects
Computer science ,Speech recognition ,Speech synthesis ,computer.software_genre ,computer ,Sequence (medicine) - Abstract
Neural sequence-to-sequence (S2S) models for text-tospeech synthesis (TTS) may take letter or phone input sequences. Since for many languages phones have a more direct relationship to the acoustic signal, they lead to improved quality. But generating phone transcriptions from text requires an expensive dictionary and an error-prone grapheme-to-phoneme (G2P) model, and the relative improvement over using letters has yet to be quantified. In approaching this question, we presume that letter-input S2S models must implicitly learn an internal counterpart to G2P conversion and therefore inevitably make errors. Such a model may thus be viewed as phone-input S2S with inaccurate phone input. To quantify this inaccuracy, we compare in this paper a letter-input S2S system to several phone-input systems trained on data with a varying level of error in the phonetic transcription. Our findings show our letterinput system is equivalent in quality to the phone-input system in which 25% of word tokens in the training data have incorrect phonetic transcriptions. Furthermore, we find that for phoneinput systems up to 15% of word tokens in the training data can have incorrect phonetic transcriptions without any significant difference in performance to a 0% error rate system. This suggests it is acceptable to use G2P to predict pronunciations for out-of-vocabulary words (OOVs) provided they are less than around 15% of the training data, removing the need to manually add OOVs to the dictionary for every new training set.
- Published
- 2019
10. Disentangling Style Factors from Speaker Representations
- Author
-
Jennifer Williams and Simon King
- Subjects
Computer science ,Linguistics ,Style (sociolinguistics) - Published
- 2019
11. Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data
- Author
-
Zack Hodari, Pilar Oplustil Gallegos, Jason Fong, and Simon King
- Subjects
Training set ,Computer science ,Robustness (computer science) ,Speech recognition ,Speech synthesis ,computer.software_genre ,computer - Published
- 2019
12. Learning Interpretable Control Dimensions for Speech Synthesis by Using External Data
- Author
-
Oliver Watts, Srikanth Ronanki, Zack Hodari, and Simon King
- Subjects
030507 speech-language pathology & audiology ,03 medical and health sciences ,External data ,Computer science ,Speech recognition ,0202 electrical engineering, electronic engineering, information engineering ,020206 networking & telecommunications ,Speech synthesis ,02 engineering and technology ,0305 other medical science ,Control (linguistics) ,computer.software_genre ,computer - Abstract
There are many aspects of speech that we might want to control when creating text-to-speech (TTS) systems. We present a general method that enables control of arbitrary aspects of speech, which we demonstrate on the task of emotion control. Current TTS systems use supervised machine learning and are therefore heavily reliant on labelled data. If no labels are available for a desired control dimension, then creating interpretable control becomes challenging. We introduce a method that uses external, labelled data (i.e. not the original data used to train the acoustic model) to enable the control of dimensions that are not labelled in the original data. Adding interpretable control allows the voice to be manually controlled to produce more engaging speech, for applications such as audiobooks. We evaluate our method using a listening test.
- Published
- 2018
13. Exemplar-based Speech Waveform Generation
- Author
-
Felipe Espic, Simon King, Cassia Valentini-Botinhao, and Oliver Watts
- Subjects
Computer science ,Speech recognition ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Waveform ,020201 artificial intelligence & image processing ,02 engineering and technology ,010301 acoustics ,01 natural sciences - Abstract
This paper presents a simple but effective method for generating speech waveforms by selecting small units of stored speech to match a low-dimensional target representation. The method is designed as a drop-in replacement for the vocoder in a deep neural network-based text-to-speech system. Most previous work on hybrid unit selection waveform generation relies on phonetic annotation for determining unit boundaries, or for specifying target cost, or for candidate preselection. In contrast, our waveform generator requires no phonetic information, annotation, or alignment. Unit boundaries are determined by epochs, and spectral analysis provides representations which are compared directly with target features at runtime. As in unit selection, we minimise a combination of target cost and join cost, but find that greedy left-to-right nearest-neighbour search gives similar results to dynamic programming. The method is fast and can generate the waveform incrementally. We use publicly available data and provide a permissively-licensed open source toolkit for reproducing our results.
- Published
- 2018
14. Merlin: An Open Source Neural Network Speech Synthesis System
- Author
-
Oliver Watts, Simon King, and Zhizheng Wu
- Subjects
Quantitative Biology::Neurons and Cognition ,Artificial neural network ,business.industry ,Computer science ,Speech recognition ,Computer Science::Neural and Evolutionary Computation ,Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) ,Speech synthesis ,computer.software_genre ,Merlin (protein) ,ComputingMethodologies_PATTERNRECOGNITION ,Open source ,Computer Science::Sound ,Artificial intelligence ,business ,computer - Abstract
We introduce the Merlin speech synthesis toolkit for neural network-based speech synthesis. The system takes linguistic features as input, and employs neural networks to predict acoustic features, which are then passed to a vocoder to produce the speech waveform. Various neural netw are implemented, including a standard feedforward neural network, mixture density neural network, recurrent neural network (RNN), long short-term memory (LSTM) recurrent neural network, amongst others. The toolkit is Open Source, written in Python, and is extensible. This paper briefly describes the system, and provides some benchmarking results on a freely available corpus.
- Published
- 2016
15. GlottDNN — A Full-Band Glottal Vocoder for Statistical Parametric Speech Synthesis
- Author
-
Paavo Alku, Simon King, Manu Airaksinen, Lauri Juvela, Zhizheng Wu, and Bajibabu Bollepalli
- Subjects
Artificial neural network ,Computer science ,Speech recognition ,Full band ,020206 networking & telecommunications ,Speech synthesis ,02 engineering and technology ,computer.software_genre ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,0305 other medical science ,Hidden Markov model ,computer ,Vocal tract ,Human voice ,Parametric statistics - Abstract
GlottHMM is a previously developed vocoder that has been successfully used in HMM-based synthesis by parameterizing speech into two parts (glottal flow, vocal tract) according to the functioning of the real human voice production mechanism. In this study, a new glottal vocoding method, GlottDNN, is proposed. The GlottDNN vocoder is built on the principles of its predecessor, GlottHMM, but the new vocoder introduces three main improvements: GlottDNN (1) takes advantage of a new, more accurate glottal inverse filtering method, (2) uses a new method of deep neural network (DNN) -based glottal excitation generation, and (3) proposes a new approach of band-wise processing of full-band speech.The proposed GlottDNN vocoder was evaluated as part of a full-band state-of-the-art DNN-based text-to-speech (TTS) synthesis system, and compared against the release version of the original GlottHMM vocoder, and the well-known STRAIGHT vocoder. The results of the subjective listening test indicate that GlottDNN improves the TTS quality over the compared methods.
- Published
- 2016
16. A Template-Based Approach for Speech Synthesis Intonation Generation Using LSTMs
- Author
-
Zhizheng Wu, Simon King, Srikanth Ronanki, and Gustav Eje Henter
- Subjects
030507 speech-language pathology & audiology ,03 medical and health sciences ,Computer science ,Speech recognition ,0103 physical sciences ,Intonation (music) ,Speech synthesis ,Template based ,0305 other medical science ,computer.software_genre ,010301 acoustics ,01 natural sciences ,computer - Published
- 2016
17. The Use of Locally Normalized Cepstral Coefficients (LNCC) to Improve Speaker Recognition Accuracy in Highly Reverberant Rooms
- Author
-
Juan Pablo Escudero, Richard M. Stern, Néstor Becerra Yoma, José Novoa, Simon King, Víctor Poblete, and Josué Fredes
- Subjects
030507 speech-language pathology & audiology ,03 medical and health sciences ,0302 clinical medicine ,Computer science ,Speech recognition ,Mel-frequency cepstrum ,0305 other medical science ,Speaker recognition ,030217 neurology & neurosurgery - Published
- 2016
18. Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features
- Author
-
Zhizheng Wu and Simon King
- Subjects
Mean squared error ,Estimation theory ,business.industry ,Computer science ,Speech recognition ,Acoustic model ,Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) ,Context (language use) ,Speech synthesis ,Pattern recognition ,computer.software_genre ,Bottleneck ,Computer Science::Sound ,Artificial intelligence ,business ,computer ,Parametric statistics - Abstract
Recently, Deep Neural Networks (DNNs) have shown promise as an acoustic model for statistical parametric speech synthesis. Their ability to learn complex mappings from linguistic features to acoustic features has advanced the naturalness of synthesis speech significantly. However, because DNN parameter estimation methods typically attempt to minimise the mean squared error of each individual frame in the training data, the dynamic and continuous nature of speech parameters is neglected. In this paper, we propose a training criterion that minimises speech parameter trajectory errors, and so takes dynamic constraints from a wide acoustic context into account during training. We combine this novel training criterion with our previously proposed stacked bottleneck features, which provide wide linguistic context. Both objective and subjective evaluation results confirm the effectiveness of the proposed training criterion for improving model accuracy and naturalness of synthesised speech.
- Published
- 2015
19. Robustness to additive noise of locally-normalized cepstral coefficients in speaker verification
- Author
-
José Novoa, Richard M. Stern, Víctor Poblete, Josué Fredes, Simon King, and Néstor Becerra Yoma
- Subjects
Speaker verification ,Robustness (computer science) ,Computer science ,Speech recognition ,Mel-frequency cepstrum - Published
- 2015
20. Towards minimum perceptual error training for DNN-based speech synthesis
- Author
-
Zhizheng Wu, Simon King, and Cassia Valentini-Botinhao
- Subjects
Computer science ,business.industry ,Speech recognition ,Speech synthesis ,Pattern recognition ,Fundamental frequency ,Function (mathematics) ,computer.software_genre ,Domain (software engineering) ,Noise ,ComputingMethodologies_PATTERNRECOGNITION ,Quality (physics) ,Cepstrum ,Mel-frequency cepstrum ,Artificial intelligence ,business ,computer - Abstract
We propose to use a perceptually-oriented domain to improve the quality of text-to-speech generated by deep neural networks (DNNs). We train a DNN that predicts the parameters required for speech reconstruction but whose cost function is calculated in another domain. In this paper, to represent this perceptual domain we extract an approximated version of the Spectro-Temporal Excitation Pattern that was originally proposed as part of a model of hearing speech in noise. We train DNNs that predict band aperiodicity, fundamental frequency and Mel cepstral coefficients and compare generated speech when the spectral cost function is defined in the Mel cepstral, warped log spectrum or perceptual domains. Objective results indicate that the perceptual domain system achieves the highest quality.
- Published
- 2015
21. Reconstructing voices within the multiple-average-voice-model framework
- Author
-
Mark J. F. Gales, Pierre Lanchantin, Junichi Yamagishi, Christophe Veaux, and Simon King
- Subjects
Voice activity detection ,Computer science ,Speech recognition ,Speech synthesis ,computer.software_genre ,Speaker recognition ,Identity (music) ,medicine ,Speech disorder ,medicine.symptom ,Adaptation (computer science) ,Hidden Markov model ,computer ,Interpolation - Abstract
Personalisation of voice output communication aids (VOCAs) allows to preserve the vocal identity of people suffering from speech disorders. This can be achieved by the adaptation of HMM-based speech synthesis systems using a small amount of adaptation data. When the voice has begun to deteriorate, reconstruction is still possible in the statistical domain by correcting the parameters of the models associated with the speech disorder. This can be done by substituting those with parameters from a donor’s voice, at risk of losing part of the identity of the patient. Recently, the Multiple-Average-Voice-Model (Multiple AVM) framework has been proposed for speaker adaptation. Adaptation is performed via interpolation into a speaker eigenspace spanned by the mean vectors of speaker-adapted AVMs which can be tuned to the individual speaker. In this paper, we present the benefits of this framework for voice reconstruction: it requires only a very small amount of adaptation data, interpolation can be performed in a clean speech eigenspace and the resulting voice can be easily fine-tuned by acting on the interpolation weights. We illustrate our points with a subjective assessment of the reconstructed voice. Index Terms: HMM-Based speech synthesis, speaker adaptation, multiple average voice model, cluster adaptive training, voice reconstruction, voice output communication aids.
- Published
- 2015
22. Investigating automatic & human filled pause insertion for speech synthesis
- Author
-
William Byrne, Rasmus Dall, Marcus Tomalin, Simon King, and Mirjam Wester
- Subjects
Conversational speech ,business.industry ,Computer science ,Speech recognition ,media_common.quotation_subject ,Speech synthesis ,computer.software_genre ,Support vector machine ,Consistency (database systems) ,Recurrent neural network ,Perception ,Language model ,Artificial intelligence ,business ,Hidden Markov model ,computer ,Natural language processing ,media_common - Abstract
Filled pauses are pervasive in conversational speech and have been shown to serve several psychological and structural purposes. Despite this, they are seldom modelled overtly by stateof-the-art speech synthesis systems. This paper seeks to motivate the incorporation of filled pauses into speech synthesis systems by exploring their use in conversational speech, and by comparing the performance of several automatic systems inserting filled pauses into fluent text. Two initial experiments are described which seek to determine whether people’s predicted insertion points are consistent with actual practice and/or with each other. The experiments also investigate whether there are ‘right’ and ‘wrong’ places to insert filled pauses. The results show good consistency between people’s predictions of usage and their actual practice, as well as a perceptual preference for the ‘right’ placement. The third experiment contrasts the performance of several automatic systems that insert filled pauses into fluent sentences. The best performance (determined by F-score) was achieved through the by-word interpolation of probabilities predicted by Recurrent Neural Network and 4gram Language Models. The results offer insights into the use and perception of filled pauses by humans, and how automatic systems can be used to predict their locations. Index Terms: filled pause, HMM TTS, SVM, RNN
- Published
- 2014
23. Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech
- Author
-
Gustav Eje Henter, Catherine Mayo, Simon King, Matt Shannon, and Thomas Merritt
- Subjects
Computer science ,Speech recognition ,Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) ,020206 networking & telecommunications ,Speech synthesis ,Speech corpus ,02 engineering and technology ,Filter (signal processing) ,Covariance ,computer.software_genre ,01 natural sciences ,Computer Science::Sound ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Active listening ,010301 acoustics ,computer ,Independence (probability theory) ,Parametric statistics - Abstract
Acoustic models used for statistical parametric speech synthesis typically incorporate many modelling assumptions. It is an open question to what extent these assumptions limit the naturalness of synthesised speech. To investigate this question, we recorded a speech corpus where each prompt was read aloud multiple times. By combining speech parameter trajectories extracted from different repetitions, we were able to quantify the perceptual effects of certain commonly used modelling assumptions. Subjective listening tests show that taking the source and filter parameters to be conditionally independent, or using diagonal covariance matrices, significantly limits the naturalness that can be achieved. Our experimental results also demonstrate the shortcomings of mean-based parameter generation. Index terms: speech synthesis, acoustic modelling, stream independence, diagonal covariance matrices, repeated speech
- Published
- 2014
24. Combining perceptually-motivated spectral shaping with loudness and duration modification for intelligibility enhancement of HMM-based synthetic speech in noise
- Author
-
Yannis Stylianou, Junichi Yamagishi, Cassia Valentini-Botinhao, and Simon King
- Subjects
Computer science ,Speech recognition ,020206 networking & telecommunications ,02 engineering and technology ,Intelligibility (communication) ,Loudness ,Spectral shaping ,Speech in noise ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,Dynamic range compression ,0305 other medical science ,Hidden Markov model - Abstract
This paper presents our entry to a speech-in-noise intelligibility enhancement evaluation: the Hurricane Challenge. The system consists of a Text-To-Speech voice manipulated through a combination of enhancement strategies, each of which is known to be individually successful: a perceptually-motivated spectral shaper based on the Glimpse Proportion measure, dynamic range compression, and adaptation to Lombard excitation and duration patterns. We achieved substantial intelligibility improvements relative to unmodified synthetic speech: 4.9 dB in competing speaker and 4.1 dB in speech-shaped noise. An analysis conducted across this and other two similar evaluations shows that the spectral shaper and the compressor (both of which are loudness boosters) contribute most under higher SNR conditions, particularly for speech-shaped noise. Duration and excitation Lombard-adapted changes are more beneficial in lower SNR conditions, and for competing speaker noise.
- Published
- 2013
25. TUNDRA: a multilingual corpus of found data for TTS research created with light supervision
- Author
-
Robert A. J. Clark, Mircea Giurgiu, Simon King, Junichi Yamagishi, Adriana Stan, Oliver Watts, and Yoshitaka Mamiya
- Subjects
Computer science ,business.industry ,Process (engineering) ,020206 networking & telecommunications ,02 engineering and technology ,computer.software_genre ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Annotation ,ComputingMethodologies_PATTERNRECOGNITION ,Index (publishing) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Compiler ,0305 other medical science ,business ,computer ,Natural language processing - Abstract
Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained with a lightly-supervised process. Future versions of the corpus will include finer-grained alignment and prosodic annotation, all of which will be made freely available. This paper gives a general outline of the data collected so far, as well as a detailed description of how this has been done, emphasizing the minimal language-specific knowledge and manual intervention used to compile the corpus. To demonstrate its potential use, textto-speech systems have been built for all languages using unsupervised or lightly supervised methods, also briefly presented in the paper. Index Terms: multilingual corpus, light supervision, imperfect data, found data, text-to-speech, audiobook data
- Published
- 2013
26. Detecting acronyms from capital letter sequences in Spanish
- Author
-
San-Segundo, R., Montero, J. M., López-Ludeña, V., and Simon King
- Subjects
Telecomunicaciones ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,020206 networking & telecommunications ,02 engineering and technology ,0305 other medical science - Abstract
This paper presents an automatic strategy to decide how to pronounce a Capital Letter Sequence (CLS) in a Text to Speech system (TTS). If CLS is well known by the TTS, it can be expanded in several words. But when the CLS is unknown, the system has two alternatives: spelling it (abbreviation) or pronouncing it as a new word (acronym). In Spanish, there is a high relationship between letters and phonemes. Because of this, when a CLS is similar to other words in Spanish, there is a high tendency to pronounce it as a standard word. This paper proposes an automatic method for detecting acronyms. Additionaly, this paper analyses the discrimination capability of some features, and several strategies for combining them in order to obtain the best classifier. For the best classifier, the classification error is 8.45%. About the feature analysis, the best features have been the Letter Sequence Perplexity and the Average N-gram order.
- Published
- 2012
27. Using HMM-based speech synthesis to reconstruct the voice of individuals with degenerative speech disorders
- Author
-
Veaux, C., Yamagishi, J., and Simon King
- Published
- 2012
28. Mel cepstral coefficient modification based on the glimpse proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise
- Author
-
Junichi Yamagishi, Cassia Valentini-Botinhao, and Simon King
- Subjects
Computer science ,Speech recognition ,020206 networking & telecommunications ,02 engineering and technology ,Intelligibility (communication) ,Speech in noise ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Cepstrum ,0202 electrical engineering, electronic engineering, information engineering ,Active listening ,Mel-frequency cepstrum ,0305 other medical science ,Hidden Markov model - Abstract
We propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation we previously proposed for the Glimpse Proportion measure. Here we show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. To evaluate the method we built eight different voices from normal read-text speech data from a male speaker. Some voices were also built from Lombard speech data produced by the same speaker. Listening experiments with speech-shaped noise and with a single competing talker indicate that our method significantly improves intelligibility when compared to unmodified synthetic speech. The voices built from Lombard speech outperformed the proposed method particularly for the competing talker case. However, compared to a voice using only the spectral parameters from Lombard speech, the proposed method obtains similar or higher performance.
- Published
- 2012
29. Can objective measures predict the intelligibility of modified HMM-based synthetic speech in noise?
- Author
-
Valentini-Botinhao, C., Yamagishi, J., and Simon King
- Subjects
030507 speech-language pathology & audiology ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,020206 networking & telecommunications ,02 engineering and technology ,0305 other medical science - Abstract
Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility - and on how well objective measures predict it -“ when we separately modify speaking rate, fundamental frequency, line spectral pairs and spectral peaks. Shifting LSPs can increase intelligibility for human listeners; other modifications had weaker effects. Among the objective measures we evaluated, the Dau model and the Glimpse proportion were the best predictors of human performance.
- Published
- 2011
30. Thousands of voices for HMM-based speech synthesis
- Author
-
Yamagishi, J., Usabaev, B., Simon King, Watts, O., Dines, J., Tian, J., Hu, R., Guan, Y., Oura, K., Tokuda, K., Karhila, R., and Kurimo, M.
- Subjects
ComputingMethodologies_PATTERNRECOGNITION ,05 social sciences ,0202 electrical engineering, electronic engineering, information engineering ,020207 software engineering ,0501 psychology and cognitive sciences ,02 engineering and technology ,ComputingMethodologies_ARTIFICIALINTELLIGENCE ,050107 human factors - Abstract
Our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an 'average voice model' plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack of phonetic balance. This enables us consider building high-quality voices on 'non-TTS' corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper we show thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management, Globalphone and Speecon. We report some perceptual evaluation results and outline the outstanding issues.
- Published
- 2009
31. A posterior approach for microphone array based speech recognition
- Author
-
Wang, D., Himawan, I., Frankel, J., and Simon King
- Abstract
Automatic speech recognition (ASR) becomes rather difficult in meetings domains because of the adverse acoustic conditions, including more background noise, more echo and reverberation and frequent cross-talking. Microphone arrays have been demonstrated able to boost ASR performance dramatically in such noisy and reverberant environments, with various beamforming algorithms. However, almost all existing beamforming measures work in the acoustic domain, resorting to signal processing theories and geometric explanation. This limits their application, and induces significant performance degradation when the geometric property is unavailable or hard to estimate, or if heterogenous channels exist in the audio system. In this paper, we preset a new posterior-based approach for array-based speech recognition. The main idea is, instead of enhancing speech signals, we try to enhance the posterior probabilities that frames belonging to recognition units, e.g., phones. These enhanced posteriors are then transferred to posterior probability based features and are modeled by HMMs, leading to a tandem ANN-HMM hybrid system presented by Hermansky et al.. Experimental results demonstrated the validity of this posterior approach. With the posterior accumulation or enhancement, significant improvement was achieved over the single channel baseline. Moreover, we can combine the acoustic enhancement and posterior enhancement together, leading to a hybrid acoustic-posterior beamforming approach, which works significantly better than just the acoustic beamforming, especially in the scenario with moving-speakers.
- Published
- 2008
32. A shrinkage estimator for speech recognition with full covariance HMMs
- Author
-
Bell, P. and Simon King
- Subjects
Statistics::Methodology - Abstract
We consider the problem of parameter estimation in full-covariance Gaussian mixture systems for automatic speech recognition. Due to the high dimensionality of the acoustic feature vector, the standard sample covariance matrix has a high variance and is often poorly-conditioned when the amount of training data is limited. We explain how the use of a shrinkage estimator can solve these problems, and derive a formula for the optimal shrinkage intensity. We present results of experiments on a phone recognition task, showing that the estimator gives a performance improvement over a standard full-covariance system
- Published
- 2008
33. Investigating festival's target cost function using perceptual experiments
- Author
-
Volker Strom and Simon King
- Subjects
Computer science ,Component (UML) ,Speech recognition ,Selection (linguistics) ,Speech synthesis ,Context (language use) ,Function (mathematics) ,computer.software_genre ,computer ,Sentence ,Target costing - Abstract
We describe an investigation of the target cost used in the Festival unit selection speech synthesis system [1]. Our ultimate goal is to automatically learn a perceptually optimal target cost function. In this study, we investigated the behaviour of the target cost for one segment type. The target cost is based on counting the mismatches in several context features. A carrier sentence (“My name is Roger”) was synthesised using all 147,820 possible combinations of the diphones /n ei/ and /ei m/. 92 representative versions were selected and presented to listeners as 460 pairwise comparisons. The listeners’ preference votes were used to analyse the behaviour of the target cost, with respect to the values of its component linguistic context features.
- Published
- 2008
34. Unsupervised adaptation for HMM-based speech synthesis
- Author
-
Heiga Zen, Junichi Yamagishi, Simon King, Keiichi Tokuda, Engineering and Physical Sciences Research Council (EPSRC), and European Commission (Seventh Framework Programme)
- Subjects
Computer science ,business.industry ,Speech recognition ,HMM speech synthesis ,Speech synthesis ,Pattern recognition ,computer.software_genre ,Speaker recognition ,Speaker diarisation ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial intelligence ,Hidden Markov model ,business ,Adaptation (computer science) ,computer - Abstract
It is now possible to synthesise speech using HMMs with a comparable quality to unit-selection techniques. Generating speech from a model has many potential advantages over concatenating waveforms. The most exciting is model adaptation. It has been shown that supervised speaker adaptation can yield highquality synthetic voices with an order of magnitude less data than required to train a speaker-dependent model or to build a basic unit-selection system. Such supervised methods require labelled adaptation data for the target speaker. In this paper, we introduce a method capable of unsupervised adaptation, using only speech from the target speaker without any labelling. Index Terms: speech synthesis, HMM-based speech synthesis, HTS, trajectory HMMs, speaker adaptation, MLLR
- Published
- 2008
35. Sparse Gaussian graphical models for speech recognition
- Author
-
Simon King and Peter Bell
- Subjects
Computer science ,business.industry ,Speech recognition ,Gaussian ,MathematicsofComputing_NUMERICALANALYSIS ,Speech technology ,Pattern recognition ,Covariance ,Expression (mathematics) ,symbols.namesake ,ComputingMethodologies_PATTERNRECOGNITION ,speech technology ,symbols ,Graphical model ,Artificial intelligence ,Focus (optics) ,business - Abstract
We address the problem of learning the structure of Gaussian graphical models for use in automatic speech recognition, a means of controlling the form of the inverse covariance matrices of such systems. With particular focus on data sparsity issues, we implement a method for imposing graphical model structure on a Gaussian mixture system, using a convex optimisation technique to maximise a penalised likelihood expression. The results of initial experiments on a phone recognition task show a performance improvement over an equivalent full-covariance system.
- Published
- 2007
36. Modelling prominence and emphasis improves unit-selection synthesis
- Author
-
Dan Jurafsky, Simon King, Ani Nenkova, Jason Brenier, Volker Strom, Robert A. J. Clark, and Yolanda Vazquez-Alvarez
- Subjects
Pitch accent ,Computer science ,business.industry ,Speech recognition ,media_common.quotation_subject ,Contrast (statistics) ,Speech synthesis ,computer.software_genre ,Scale (music) ,speech synthesis ,Perception ,Selection (linguistics) ,Artificial intelligence ,Prosody ,business ,computer ,Natural language processing ,media_common - Abstract
We describe the results of large scale perception experiments showing improvements in synthesising two distinct kinds of prominence: standard pitch-accent and strong emphatic accents. Previously prominence assignment has been mainly evaluated by computing accuracy on a prominence-labelled test set. By contrast we integrated an automatic pitch-accent classifier into the unit selection target cost and showed that listeners preferred these synthesised sentences. We also describe an improved recording script for collecting emphatic accents, and show that generating emphatic accents leads to further improvements in the fiction genre over incorporating pitch accent only. Finally, we show differences in the effects of prominence between child-directed speech and news and fiction genres. Index Terms: speech synthesis, prosody, prominence, pitch accent, unit selection
- Published
- 2007
37. Articulatory feature classifiers trained on 2000 hours of telephone speech
- Author
-
Joe Frankel, Simon King, Karen Livescu, Mathew Magimai-Doss, and Özgür Çetin
- Subjects
ComputingMethodologies_PATTERNRECOGNITION ,speech technology ,Computer science ,Speech recognition ,Speech technology ,Feature (machine learning) ,Perceptron - Abstract
The so-called tandem approach, where the posteriors of a multilayer perceptron (MLP) classifier are used as features in an automatic speech recognition (ASR) system has proven to be a very effective method. Most tandem approaches up to date have relied on MLPs trained for phone classification, and appended the posterior features to some standard feature hidden Markov model (HMM). In this paper, we develop an alternative tandem approach based on MLPs trained for articulatory feature (AF) classification. We also develop a factored observation model for characterizing the posterior and standard features at the HMM outputs, allowing for separate hidden mixture and state-tying structures for each factor. In experiments on a subset of Switchboard, we show that the AFbased tandem approach is as effective as the phone-based approach, and that the factored observation model significantly outperforms the simple feature concatenation approach while using fewer parameters.
- Published
- 2007
38. Predicting consonant duration with Bayesian belief networks
- Author
-
Simon King and Olga Goubanova
- Subjects
Consonant ,Bayesian statistics ,Computer science ,Statistics ,Bayesian network ,Graphical model ,Bayesian average ,Variable-order Bayesian network - Abstract
Consonant duration is influenced by a number of linguistic factors such as the consonant s identity, within-word position, stress level of the previous and following vowels, phrasal position of the word containing the target consonant, its syllabic position, identity of the previous and following segments. In our work, consonant duration is predicted from a Bayesian belief network (BN) consisting of discrete nodes for the linguistic factors and a single continuous node for the consonant s duration. Interactions between factors are represented as conditional dependency arcs in this graphical model. Given the parameters of the belief network, the duration of each consonant in the test set is then predicted as the value with the maximum probability. We compare the results of the belief network model with those of sums-of-products (SoP) and classification and regression tree (CART) models using the same data. In terms of RMS error, our BN model performs better than both CART and SoP models. In terms of the correlation coefficient, our BN model performs better than SoP model, and no worse than CART model. In addition, the Bayesian model reliably predicts consonant duration in cases of missing or hidden linguistic factors.
- Published
- 2005
39. A hybrid ANN/DBN approach to articulatory feature recognition
- Author
-
Joe Frankel and Simon King
- Subjects
Artificial neural networks ,Artificial neural network ,Time delay neural network ,business.industry ,Computer science ,Computer Science::Neural and Evolutionary Computation ,speech recognition ,Feature recognition ,dynamic Bayesian network ,Pattern recognition ,articulatory feature recognition ,Viterbi algorithm ,Machine learning ,computer.software_genre ,Set (abstract data type) ,symbols.namesake ,ComputingMethodologies_PATTERNRECOGNITION ,Asynchronous communication ,Feature (machine learning) ,symbols ,Artificial intelligence ,business ,computer ,Dynamic Bayesian network - Abstract
Artificial neural networks (ANN) have proven to be well suited to the task of articulatory feature (AF) recognition. Previous studies have taken a cascaded approach where separate ANNs are trained for each feature group, making the assumption that features are statistically independent. We address this by using ANNs to provide virtual evidence to a dynamic Bayesian network (DBN). This gives a hybrid ANN/DBN model and allows modelling of inter-feature dependencies. We demonstrate significant increases in AF recognition accuracy from modelling dependencies between features, and present the results of embedded training experiments in which a set of asynchronous feature changes are learned. Furthermore, we report on the application of a Viterbi training scheme in which we alternate between realigning the AF training labels and retraining the ANNs.
- Published
- 2005
40. SVitchboard 1: small vocabulary tasks from Switchboard
- Author
-
Chris Bartels, Jeff A. Bilmes, and Simon King
- Subjects
Vocabulary ,Markov chain ,business.industry ,Computer science ,Speech recognition ,media_common.quotation_subject ,Artificial intelligence ,Hidden Markov model ,business ,computer.software_genre ,computer ,Natural language processing ,media_common - Published
- 2005
41. Genetic triangulation of graphical models for speech and language processing
- Author
-
Katrin Kirchhoff, Kevin Duh, Simon King, Jeff A. Bilmes, and Chris Bartels
- Subjects
Language identification ,Computer science ,business.industry ,Triangulation (computer vision) ,Graphical model ,Artificial intelligence ,computer.software_genre ,business ,computer ,Natural language processing - Published
- 2005
42. Objective distance measures for spectral discontinuities in concatenative speech synthesis
- Author
-
Jithendra Vepa, Simon King, and Paul Taylor
- Published
- 2002
43. An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces
- Author
-
Frankel, J., Korin Richmond, Simon King, and Paul Taylor
- Subjects
ComputingMethodologies_PATTERNRECOGNITION - Abstract
We describe a speech recognition system which uses articulatory parameters as basic features and phone-dependent linear dynamic models. The system first estimates articulatory trajectories from the speech signal. Estimations of x and y coordinates of 7 actual articulator positions in the midsagittal plane are produced every 2 milliseconds by a recurrent neural network, trained on real articulatory data. The output of this network is then passed to a set of linear dynamic models, which perform phone recognition
- Published
- 2000
44. Measuring the Cognitive Load of Synthetic Speech Using a Dual Task Paradigm
- Author
-
Simon King and Avashna Govender
- Subjects
Dual-task paradigm ,Computer science ,Speech recognition ,cognitive load ,05 social sciences ,020206 networking & telecommunications ,Speech synthesis ,dual task paradigm ,02 engineering and technology ,computer.software_genre ,050105 experimental psychology ,speech synthesis ,0202 electrical engineering, electronic engineering, information engineering ,0501 psychology and cognitive sciences ,Speech communication ,computer ,Cognitive load - Abstract
We present a methodology for measuring the cognitive load (listening effort) of synthetic speech using a dual task paradigm. Cognitive load is calculated from changes in a listener’s performance on a secondary task (e.g., reaction time to decide if a visually-displayed digit is odd or even). Previous related studies have only found significant differences between the best and worst quality systems but failed to separate the systems that lie in between. A paradigm that is sensitive enough to detect differences between state-of-the-art, high quality speech synthesizers would be very useful for advancing the state of the art. In our work, four speech synthesis systems from a previous Blizzard Challenge, and the corresponding natural speech, were compared. Our results show that reaction times slow down as speech quality reduces, as we expected: lower quality speech imposes a greater cognitive load, taking resources away from the secondary task. However, natural speech did not have the fastest reaction times. This intriguing result might indicate that, as speech synthesizers attain near-perfect intelligibility, this paradigm is measuring something like the listener’s level of sustained attention and not listening effort.
- Full Text
- View/download PDF
45. Using Pupillometry to Measure the Cognitive Load of Synthetic Speech
- Author
-
Avashna Govender and Simon King
- Subjects
Computer science ,Speech recognition ,cognitive load ,pupillometry ,Measure (physics) ,Speech synthesis ,computer.software_genre ,01 natural sciences ,03 medical and health sciences ,0302 clinical medicine ,speech synthesis ,0103 physical sciences ,030223 otorhinolaryngology ,010301 acoustics ,computer ,Cognitive load ,Pupillometry - Abstract
It is common to evaluate synthetic speech using listening tests in which intelligibility is measured by asking listeners to transcribe the words heard, and naturalness is measured using Mean Opinion Scores. But, for real-world applications of synthetic speech, the effort (cognitive load) required to understand the synthetic speech may be a more appropriate measure. Cognitive load has been investigated in the past, when rule-based speech synthesizers were popular, but there is little or no recent work using state-of-the-art text-to-speech. Studies on the understanding of natural speech have shown that the pupil dilates when increased mental effort is exerted to perform a task. We use pupillometry to measure the cognitive load of synthetic speech submitted to two of the Blizzard Challenge evaluations. Our results show that pupil dilation is sensitive to the quality of synthetic speech. In all cases, synthetic speech imposes a higher cognitive load than natural speech. Pupillometry is therefore proposed as a sensitive measure that can be used to evaluate synthetic speech.
- Full Text
- View/download PDF
46. A Sound Engineering Approach to Near End Listening Enhancement
- Author
-
Simon King and Carol Chermaz
- Subjects
03 medical and health sciences ,geography ,0302 clinical medicine ,geography.geographical_feature_category ,Computer science ,Acoustics ,0103 physical sciences ,Active listening ,030223 otorhinolaryngology ,010301 acoustics ,01 natural sciences ,Sound (geography) - Full Text
- View/download PDF
47. 18th Blizzard Challenge Workshop, Grenoble, France, August 29, 2023
- Author
-
Olivier Perrotin, Gérard Bailly, and Simon King 0001
- Published
- 2023
- Full Text
- View/download PDF
48. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, Shanghai, China, October 30, 2020
- Author
-
Junichi Yamagishi, Zhenhua Ling, Rohan Kumar Das, Simon King 0001, Tomi Kinnunen, Tomoki Toda, Wen-Chin Huang, Xiao Zhou, Xiaohai Tian, and Yi Zhao 0006
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.