Author: "Bonafonte Cávez, Antonio" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bonafonte Cávez, Antonio"' showing total 176 results

Start Over Author "Bonafonte Cávez, Antonio"

176 results on '"Bonafonte Cávez, Antonio"'

1. Efficient, end-to-end and self-supervised methods for speech processing and generation

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Serra Julià, Joan, Pascual de la Puente, Santiago, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Serra Julià, Joan, and Pascual de la Puente, Santiago
Abstract: Premi extraordinari doctorat UPC curs 2019-2020, àmbit d’Enginyeria de les TIC, Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recompos, L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes, Award-winning, Postprint (published version)
Published: 2020

2. Time-domain speech enhancement using generative adversarial networks

Author: Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Serra, Joan, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Serra, Joan, and Bonafonte Cávez, Antonio
Abstract: Speech enhancement improves recorded voice utterances to eliminate noise that might be impeding their intelligibility or compromising their quality. Typical speech enhancement systems are based on regression approaches that subtract noise or predict clean signals. Most of them do not operate directly on waveforms. In this work, we propose a generative approach to regenerate corrupted signals into a clean version by using generative adversarial networks on the raw signal. We also explore several variations of the proposed system, obtaining insights into proper architectural choices for an adversarially trained, convolutional autoencoder applied to speech. We conduct both objective and subjective evaluations to assess the performance of the proposed method. The former helps us choose among variations and better tune hyperparameters, while the latter is used in a listening experiment with 42 subjects, confirming the effectiveness of the approach in the real world. We also demonstrate the applicability of the approach for more generalized speech enhancement, where we have to regenerate voices from whispered signals., Peer Reviewed, Postprint (author's final draft)
Published: 2019

3. Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech

Author: Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Serra, Joan, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Serra, Joan, and Bonafonte Cávez, Antonio
Abstract: Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU., Peer Reviewed, Postprint (published version)
Published: 2019

4. Visualizing punctuation restoration in speech transcripts with prosograph

Author: Oktem, A., Farrús, M., Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Subjects: Precision and recall, Speech communication, Automatic speech recognition, Reconeixement automàtic de la parla, Prosody, Speech transcripts, Speech recognition, Speech transcriptions, Testing environment, Neural architectures, Punctuation, ComputingMethodologies_PATTERNRECOGNITION, Speech processing, Speech transmission, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Transcription
Abstract: Comunicació presentada a: Interspeech 2018, celebrat del 2 al 6 de setembre de 2018 a Hyderabad, Índia. We have developed a neural architecture that tests the effect of lexical, morphosyntactic and prosodic features in restoring punctuation in speech transcriptions. Having outperformed a baseline model in terms of precision and recall, we further extend our performance tests by attaching it in a speech recognition pipeline. The visual and interactive testing environment that we prepared helps us observe how our models generalizes in unseen data and also plan our next steps for improvement. The first author has received Maria de Maeztu Reproducibility Award from Department of Information and Communication Technologies of Universitat Pompeu Fabra in 2018 through presentation of this work. The second author is funded by the Spanish Ministry of Economy, Industry and Competitiveness through the Ram´on y Cajal program.
Published: 2018

5. End-to-End photoplethysmography-based biometric authentication system by using deep neural networks

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Telefónica I+D, Bonafonte Cávez, Antonio, Luque Serrano, Jordi, Cortès Sebastià, Guillem, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Telefónica I+D, Bonafonte Cávez, Antonio, Luque Serrano, Jordi, and Cortès Sebastià, Guillem
Abstract: Whilst research efforts have traditionally focused on Electrocardiographic (ECG) signals and handcrafted features as potential biometric traits, few works have explored systems based on the raw photoplethysmogram (PPG) signal. This work proposes an end-to-end architecture to offer biometric authentication using PPG biosensors through Convolutional Neural Networks. We provide an evaluation of the performance of our approach in two different databases: Troika and PulseID, the latter a publicly available database specifically collected by the authors for such a purpose. Our verification approach through convolutional network based models and using raw PPG signals appears to be viable in current monitoring procedures within e-health and fitness environments, and shows a remarkable potential as a biometric identifier. When tested on a verification task with one second trials, the approach achieved an AUC of $78.2\%$ and $83.2\%$, averaged among target subjects, on PulseID and Troika datasets respectively. Our experimental results on other small datasets support the usefulness of PPG-extracted biomarkers as viable traits for multi-biometric or standalone biometrics. Furthermore, the approach results in a low input throughput and complexity that allows for continuous authentication in real-world scenarios and implementation in little wearable devices. Nevertheless, the reported experiments also suggest that further research is necessary to develop a definitive system., Si bien los esfuerzos en la investigación se han focalizado tradicionalmente en las señales electrocardiográficas (ECG) y características extraídas manualmente como rasgos biométricos potenciales, pocas operaciones han explorado sistemas basados en la señal fotopletográfica (PPG). Este trabajo propone una arquitectura de extremo a extremo para ofrecer autenticación biométrica mediante biosensores PPG a través de redes convolucionales. Ésta aproximación se ha evaluado en dos bases de datos diferentes: Troika y PulseID, ésta última disponible públicamente y que ha sido recogida por los autores para este propósito. La verificación a través de modelos basados en redes convolucionales y el uso de señales PPG en crudo parecen ser viables en los procedimientos de seguimiento actuales, dentro del entorno de la salud y del deporte, mostrando así un gran potencial para la biometría. El trabajo testeado en la tarea de verifiación, en ensayos de un segundo, consiguen una AUC de $78,2\%$ y $83,2\%$ en media, entre todos los sujetos objetivo, en los conjuntos de datos PulseID y Troika respectivamente. Los resultados experimentales en otros conjuntos de datos pequeños refuerzan la potencial utilidad de estos biomarcadores extraídos de señales PPG como rasgos viables para la caracterización biométrica. Además, este enfoque permite una autenticación contínua debido a su baja complejidad y número de operaciones, haciéndola sostenible para escenarios del mundo real así como para poder ser implementado en dispositivos de reducido tamaño y capacidad computacional. Sin embargo, los experimentos aquí reportados sugieren que son necesarias más investigaciones para poder desarrollar un sistema definitivo., Si bé els esforços en la investigació s'han centrat tradicionalment en senyals electrocardiogràfics (ECG) i característiques artesanals com a trets biomètrics potencials, pocs treballs han explorat sistemes basats en el senyal fotopletogràfic (PPG). Aquest treball proposa una arquitectura d'extrem a extrem per oferir autenticació biomètrica mitjançant biosensors PPG a través de xarxes convolucionals. L'acompliment d'aquest enfocament s'ha avaluat en dues bases de dades diferents: Troika i PulseID, aquesta última disponible públicament i que ha estat recollida pels autors per a aquest propòsit. Aquest enfocament de verificació a través de models basats en xarxes convolucionals i l'ús de senyals de PPG en cru sembla ser viable en els procediments de monitorització actuals, dins d'entorns de salut i esport, mostrant així un gran potencial i atractiu per a la biometria. L'enfocament provat en la tasca de verificació, en assaigs que duren un segon, aconsegueix una AUC de $78,2\%$ i $83,2\%$ en mitjana, entre els subjectes objectiu, en els conjunts de dades de PulseID i Troika, respectivament. Els nostres resultats experimentals en altres conjunts petits de dades recolzen la utilitat dels biomarcadors extrets de PPG com a trets viables per a la biometria multi-biomètrica o autònoma. A més, l'enfocament permet una autenticació contínua degut a la baixa complexitat i nombre d'operacions, que la fan sostenible pels escenaris del món real així com per a ésser implementat en dispositius de reduit tamany i capacitat computacional. No obstant això, els experiments reportats també suggereixen que més investigacions són necessàries per a poder desenvolupar un sistema definitiu.
Published: 2018

6. Visualizing punctuation restoration in speech transcripts with prosograph

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Oktem, A., Farrús, M., Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Oktem, A., Farrús, M., and Bonafonte Cávez, Antonio
Abstract: We have developed a neural architecture that tests the effect of lexical, morphosyntactic and prosodic features in restoring punctuation in speech transcriptions. Having outperformed a baseline model in terms of precision and recall, we further extend our performance tests by attaching it in a speech recognition pipeline. The visual and interactive testing environment that we prepared helps us observe how our models generalizes in unseen data and also plan our next steps for improvement., Peer Reviewed, Postprint (published version)
Published: 2018

7. Expressive speech synthesis using sentiment embeddings

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, Lorenzo Trueba, J., Yamagishi, J., Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, Lorenzo Trueba, J., Yamagishi, J., and Bonafonte Cávez, Antonio
Abstract: In this paper we present a DNN based speech synthesis system trained on an audiobook including sentiment features predicted by the Stanford sentiment parser. The baseline system uses DNN to predict acoustic parameters based on conventional linguistic features, as they have been used in statistical parametric speech synthesis. The predicted parameters are transformed into speech using a conventional high-quality vocoder. In this paper, the conventional linguistic features are enriched using sentiment features. Different sentiment representations have been considered, combining sentiment probabilities with hierarchical distance and context. After preliminary analysis a listening experiment is conducted, where participants evaluate the different systems. The results show the usefulness of the proposed features and reveal differences between expert and non-expert TTS user., Peer Reviewed, Postprint (published version)
Published: 2018

8. Language and noise transfer in speech enhancement generative adversarial network

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Park, Maruchan, Serra, Joan, Bonafonte Cávez, Antonio, Ahn, Kang-hun, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Park, Maruchan, Serra, Joan, Bonafonte Cávez, Antonio, and Ahn, Kang-hun
Abstract: ©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works., Speech enhancement deep learning systems usually require large amounts of training data to operate in broad conditions or real applications. This makes the adaptability of those systems into new, low resource environments an important topic. In this work, we present the results of adapting a speech enhancement generative adversarial network by fine-tuning the generator with small amounts of data. We investigate the minimum requirements to obtain a stable behavior in terms of several objective metrics in two very different languages: Catalan and Korean. We also study the variability of test performance to unseen noise as a function of the amount of different types of noise available for training. Results show that adapting a pre-trained English model with 10 min of data already achieves a comparable performance to having two orders of magnitude more data. They also demonstrate the relative stability in test performance with respect to the number of training noise types., Peer Reviewed, Postprint (published version)
Published: 2018

9. Study of the signal properties of music genres

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Sýkora, Martin, Boadas Rabassedas, Andreu, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Sýkora, Martin, and Boadas Rabassedas, Andreu
Abstract: The amount of music available is huge, specially if we consider free non-commercial music. In order that users can discover artists we need to design retrieval or recommendation systems which are able to organize all the available resources. Genre classification is a first step on that direction. Genre classification typically are based on supervised classification. First, the system extract classic features related to dynamic, rhythmic, spectral, and harmonic. Then a classifier (e.g. SVM) is trained to decide the genre from the extracted features. Recently, SoundNet, In this project different features of the music signals are studied and evaluated, for a better understanding of its behaviour from the technical point of view, to attribute similar patrons to the different genres studied, and also be able to differentiate them. In order to differentiate one genre from each other and find some similarities between the audio tracks of the same genre, some specific features are extracted: statistical descriptors, dynamic and psycho-acoustical features, and frequency component features. With the information extracted from these features, a comparison with some aspects from the theoretical definitions of music genres is done to corroborate their specific characteristics. In the last part of the project, the rhythm features are extracted and classified by a support vector machine, to evaluate the obtained results. The results obtained show that some music genres, like classical music or jazz, are easy to differentiate from the other genres. Moreover, there are other genres, like metal, disco or hip hop, that some of their features are mostly equal in all data set, being easy to establish common patterns on them., En este proyecto se estudian y evalúan diferentes características de las señales musicales, por entender mejor su comportamiento des de un punto de vista técnico, y poder atribuir patrones similares a los diferentes géneros musicales, además de poder diferenciarlos entre ellos. A partir de la extracción de algunos descriptores estadísticos, el estudio de características dinámicas y psicoacústicas, y el análisis en frecuencia de la base de datos, se intentan encontrar ciertos patrones en su comportamiento, que diferencien los 10 géneros musicales. Por otro lado, se intentan buscar similitudes con las definiciones teóricas que se les atribuyen. En la parte final del proyecto, se extraen las características rítmicas, para finalmente terminar clasificando la base de datos con máquinas de soporte vectorial, y poder así evaluar los resultados obtenidos. Los resultados obtenidos muestran cómo ciertos géneros musicales, cómo la música clásica o el jazz, se diferencian en muchos aspectos del resto de géneros, y de manera muy clara. Además, hay otros géneros, cómo el metal o la música disco, los cuales algunas de sus características tienen un comportamiento prácticamente igual en toda la base de datos y hace que sea fácil establecer un patrón entre ellos., En aquest projecte s'estudien i avaluen diferents característiques dels senyals de música, per tal d'entendre millor el seu comportament, i poder atribuir patrons similars als diferents gèneres musicals, a més de poder diferenciar-los entre ells. A partir de l'extracció d'alguns descriptors estadístics, l'estudi de les característiques dinàmiques i psico-acústiques, i l'anàlisi freqüencial de la base de dades, s'intenten buscar certs patrons en el comportament de les característiques mencionades, que diferenciïn els 10 gèneres musicals. Per altra banda, s'intenten buscar similituds amb les definicions teòriques que se'ls hi atribueixen. A la part final del projecte, s'extreuen les característiques rítmiques, per finalment acabar classificant la base de dades amb màquines de suport vectorial, per així avaluar els resultats obtinguts. Els resultats obtinguts mostren com certs gèneres musicals, com la música clàssica o el jazz, es diferencien en molts aspectes de la resta de gèneres musicals i de manera molt clara. A més, hi ha altres gèneres, com és el cas del metall o la musica disco, els quals algunes de les seves característiques tenen un comportament pràcticament igual en tota la base de dades i fa que sigui senzill establir un patró entre ells.
Published: 2018

10. Corpus for cyberbullying prevention

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Moreno Bilbao, M. Asunción, Bonafonte Cávez, Antonio, Jauk, Igor, Tarrés, Laia, Pereira, Victor, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Moreno Bilbao, M. Asunción, Bonafonte Cávez, Antonio, Jauk, Igor, Tarrés, Laia, and Pereira, Victor
Abstract: Cyberbullying is the use of digital media to harass a person or group of people, through personal attacks, disclosure of confidential or false information, among other means. That is to say, it is considered cyberbullying, or cyber-aggression to everything that is done through electronic communication devices with the intended purpose of harming or attacking a person or a group.In this paper we present a starting project to prevent cyberbullying between kids and teenagers. The idea is to create a prevention system. A system which is installed in the mobile of a kid and, if a harassment is detected, some advice is given to the child. In case of serious or repeated behavior the parents are alerted. The focus of this paper is to describe the characteristics of the database to be used to train the system, Peer Reviewed, Postprint (published version)
Published: 2018

11. Ultrasound based Silent Speech Interface using Deep Learning

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Budapesti Műszaki és Gazdaságtudományi Egyetem, Bonafonte Cávez, Antonio, Gábor Csapó, Tamás, Moliner Juanpere, Eloi, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Budapesti Műszaki és Gazdaságtudományi Egyetem, Bonafonte Cávez, Antonio, Gábor Csapó, Tamás, and Moliner Juanpere, Eloi
Abstract: Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of Csapó et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms., Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of Csapó et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms., Silent Speech Interface (SSI) és una tecnologia capaç de sintetitzar veu partint únicament de senyals no-acústiques. Pot tenir gran utilitat en casos com pacients de laringectomia, ambients sorollosos o trucades silencioses. Aquesta tèsis explora el cas particular de SSI utilitzant imatges de la llengua captades amb ultrasons com a senyals d'entrada. Es proposa un enfocament de 'síntesis directa' basat en Xarxes Neuronals Profundes i coeficients Mel-generalized cepstral. Aquest document és una extensió del treball de Csapó et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface" . Diversos models de xarxes neuronals són presentats i discutits, com les bàsiques xarxes neuronals directes, xarxes neuronals convolucionals o xarxes neuronals recurrents. També s'ha estudiat un pre-processat reductor de soroll basat en un Autoencoder convolucional profund. S'ha portat a terme un nombre considerable d'experiments utilitzant diverses arquitectures de Deep Learning, així com un extens estudi d'optimització d'hyperparàmetres. Els diferents experiments han estat evaluar i qualificar a partir de diferentes mesures de qualitat objectives i subjectives. Els millors resultats, tant en termes objectius com subjectius, els ha presentat una arquitectura basada en una CNN i capes bidireccionals de LSTMs.
Published: 2018

12. Multi-speaker Neural Vocoder

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Barbany Mayor, Oriol, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Barbany Mayor, Oriol
Abstract: Deep learning has revolutionized almost every engineering branch over the past decades and have also been successfully applied to text-to-speech, where it yields state-of-the-art performance and overcomes classical approaches. This work focuses in the implementation of a speech synthesis system based in Recurrent Neural Networks (RNNs) that holds many speakers with a unique model. Despite the fact that other systems only share some layers across speakers but maintain independent blocks for every identity, this dissertation explore the possibilities of implementing an adaptation of the end-toend model SampleRNN conditioned to both speech parameters and speaker identity that allow an entire shared framework., Durante las últimas décadas, el aprendizaje profundo o deep learning ha revolucionado prácticamente todas las ramas de la ingeniería y ha estado aplicado con éxito en la síntesis de voz, donde obtiene los mejores resultados sobrepasando con diferencia los anteriores obtenidos con sistemas clásicos. Éste trabajo se centra en el desarrollo de un sistema de síntesis de voz basado en redes neuronales recurrentes con un único modelo para varios locutores. Aunque otros sistemas únicamente comparten algunas capas entre hablantes pero mantienen bloques independientes para cada locutor, ésta tesis explora las posibilidades de implementar una adaptación del modelo SampleRNN condicionado a parámetros propios del hable y a la identidad del locutor que permite una estructura compartida., Durant les últimes dècades, l'aprenentatge profund o deep learning ha revolucionat pràcticament totes les branques de l'enginyeria i ha estat aplicat amb èxit en la síntesi de veu, on obté els millors resultats sobrepassant amb diferència els anteriors assolits amb sistemes clàssics. Aquest treball se centra en la implementació d'un sistema de síntesi de veu basat en xarxes neuronals recurrents amb un únic model per varis locutors. Encara que altres sistemes únicament comparteixen algunes capes entre parlants però mantenen blocs independents per a cada locutor, aquesta tesis explora les possibilitats d'implementar una adaptació del model SampleRNN condicionant tant a paràmetres propis de la parla com a la identitat del parlant que permet una estructura compartida.
Published: 2018

13. Spanish statistical parametric speech synthesis using a neural vocoder

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, Dorca, G., Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, and Dorca, G.
Abstract: During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology. Meanwhile, the TTS research community has made a big effort to push statistical-parametric speech synthesis to get similar quality and more flexibility on the synthetically generated voice. During last years, deep learning advances applied to speech synthesis have filled the gap, specially when neural vocoders substitute traditional signal-processing based vocoders. In this paper we propose to substitute the waveform generation vocoder of MUSA, our Spanish TTS, with SampleRNN, a neural vocoder which was recently proposed as a deep autoregressive raw waveform generation model. MUSA uses recurrent neural networks to predict vocoder parameters (MFCC and logF0) from linguistic features. Then, the Ahocoder vocoder is used to recover the speech waveform out of the predicted parameters. In the first system SampleRNN is extended to generate speech conditioned on the Ahocoder generated parameters (mfcc and logF0), where two configurations have been considered to train the system. First, the parameters derived from the signal using Ahocoder are used. Secondly, the system is trained with the parameters predicted by MUSA, where SampleRNN and MUSA are jointly optimized. The subjective evaluation shows that the second system outperforms both the original Ahocoder and SampleRNN as an independent neural vocoder., Peer Reviewed, Postprint (published version)
Published: 2018

14. Neural Audio Generation for Speech Synthesis

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Dorca Saez, Georgina, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Dorca Saez, Georgina
Abstract: Most speech synthesis systems require a linguistic module to produce the features that drive the speech generation module. In this project, system will be designed using a deep architecture and automatically learned to produce either linguistic features or speech from the raw letter representation., Recently, neural networks have become the state of the art for speech synthesis from raw text tasks and they are actually representing a powerful force in the industry. In this project, we present an end-to-end deep learning-based TTS system, able to generate a voice signal from characters. In order to fulfil this task we developed a re-implementation of the Uncodicional-SampleRNN neural vocoder, in order to be conditioned under an adaptation of MUSA, which predicts vocoder parameters from text., Recientemente, las redes neuronales se han convertido en el estado del arte para las tareas de síntesis del habla y actualmente representan una fuerza poderosa en la industria. En este proyecto, presentamos un sistema de conversión de texto a voz (Text-to-Speech) basado en aprendizaje profundo (Deep learning), capaz de generar una señal de voz a partir de caracteres. Para realizar dicha tarea desarrollamos una adaptación de MUSA, encargado de realizar una predicción de los parámetros del vocoder a partir del texto para, posteriormente, condicionar una reimplementación del vocoder neuronal Uncodicional-SampleRNN., Recentment, les xarxes neuronals s'han convertit en l'estat de l'art per a les tasques de síntesis de la parla i actualment representen una força poderosa en la indústria. En aquest projecte, presentem un sistema de conversió de text a veu (Text-to-Speech) basat en aprenentatge profund (Deep learning), capaç de generar un senyal de veu a partir de caràcters. Per realitzar aquesta tasca desenvolupem una adaptació de MUSA, encarregat de realitzar una predicció dels paràmetres del vocoder a partir del text, per condicionar posteriorment una reimplementació del vocoder neuronal Uncodicional-SampleRNN.
Published: 2018

15. Trombone Synthesis using Deep Learning

Author: Bonafonte Cávez, Antonio, Badenas Crespo, Víctor, Bonafonte Cávez, Antonio, and Badenas Crespo, Víctor
Abstract: In this project we present a possible expansion of the sampleRNN project for trombone synthesis. The project is divided in three main blocks: the database creation, the sampleRNN analysis and the sampleRNN modification. The first block contains the generation of a database suitable for the project?s application in the second and third block. The second block consists on an analysis of the published code of sampleRNN and a first approach on training the model with a portion of the database. The third and final block explains the modifications made to the code to be able to accept note and length values in the input vector of the Neural Network. Unfortunately, due to time constraints, the expected results were not obtained, as explained in the results and conclusion of the document., En este proyecto se presenta una posible ampliación del proyecto sampleRNN para la síntesis de ondas de audio de Trombón. El proyecto esta fragmentado en tres fases: la creación de una base de datos, el análisis del código de sampleRNN y la modificación del mismo. El primer bloque contiene la generación de una base de datos adaptada para la aplicación deseada en el segundo y tercer bloque. El segundo bloque consiste en el análisis del código publicado de sampleRNN y un primer entrenamiento del model con una fracción de la base de datos. Y el tercer bloque y bloque final explica las modificaciones llevadas a cabo en el código para que sea capaz de aceptar los valores de nota y duración como entrada a la Red Neuronal. Desafortunadamente, debido a la falta de tiempo, los resultados deseados no se han obtenido tal y como se explica en los apartados conclusión y resultado del documento., En aquest projecte presentem una possible ampliació del projecte anomenat sampleRNN per utilitzar-lo per la síntesi de ones d?audio de Trombó. El projecte esta dividit en tres fases: la creació d?una base de dades, l?analisi de sampleRNN i la modificació del codi. El primer bloc conté la generació de una base de dades adaptada per la aplicació desitjada en el segon i tercer bloc. El segon bloc constisteix en l?analisi del codi publicat de sampleRNN I un primer entrenament del model amb una fracció de la base de dades. I el tercer i bloc final explica les modificacions que s?han fet al codi perquè aquest pugui acceptar valors de la nota i de la llargada de la nota com a entrada a la Xarxa Neuronal. Malauradament, degut a una manca de temps, els resultats no han estat obtingut tal i com s?explica a l?apartat dels resultats i la conclusió del document.
Published: 2017

16. An evaluation of the impact of body movement data in automatic music generation processes with long short-term memory neural networks

Author: University of Limerick, Torre, Giuseppe, Bonafonte Cávez, Antonio, Tantinyà Vidal, Àgata, University of Limerick, Torre, Giuseppe, Bonafonte Cávez, Antonio, and Tantinyà Vidal, Àgata
Abstract: El aprendizaje automático está ganando popularidad en el campo artístico y la generación de música. El uso del aprendizaje profundo para crear canciones subjetivamente convincentes ha sido un área activa de investigación que ha atraído mucha atención últimamente. Trabajos anteriores en la generación de música se han centrado principalmente en la creación de música mediante el aprendizaje sólo de la música en sí. En este proyecto utilizaremos la plataforma Magenta de Google para investigar si la adición de una coreografía en la ecuación mejora el resultado en las piezas generadas. El objetivo es poder generar música con un modelo entrenado con información de música y movimiento y luego evaluar los resultados. La generación de los datos de movimiento y su serialización es una de las partes principales del proyecto, ya que necesitan ser fusionados con los datos de música que ya se están utilizando. Diferentes modelos de Redes Neuronales de gran Memoria de Corto Plazo (LSTM) serán entrenados con estos datos y posteriormente evaluados por explorar la utilidad de este enfoque en el contexto de la generación musical a través del aprendizaje profundo., L'aprenentatge automàtic està guanyant popularitat en el camp artístic i la generació de música. L'ús de l'aprenentatge profund per crear cançons subjectivament convincents ha estat un àrea activa d'investigació que ha atret molta atenció últimament. Treballs anteriors en la generació de música s'han centrat principalment en la creació de música mitjançant l'aprenentatge només a partir de la música en sí. En aquest projecte farem servir la plataforma Magenta de Google per investigar si l'addició d'una coreografia en l'equació millora el resultat en les peces generades. L'objectiu és poder generar música amb un model entrenat amb informació de música i moviment, i després avaluar els resultats. La generació de les dades de moviment i la seva serialització és una de les parts principals del projecte, ja que necessiten ser fusionades amb les dades de música que ja s'estan utilitzant. Diferents models de Xarxes Neuronals de gran Memòria de Curt Termini (LSTM) seran entrenats amb aquestes dades i posteriorment avaluats per explorar la utilitat d'aquest enfocament en el context de la generació musical a través de l'aprenentatge profund., Machine learning is gaining popularity in the artistic field and music generation. Using deep learning to create compelling songs has been an active area of research that has drawn a lot of attention lately. Previous work in music generation has mainly been focused on creating music by learning only from music itself. In this project we will use Google's project Magenta to investigate if the addition of a choreography in the equation actually improves the result on the generated pieces. The goal is to be able to generate music with a model trained with music and movement information and then evaluate the results. Generation of the movement data and its serialization is one of the main parts of the project as it needs to be merged with the music data that is already being used. Different LSTM models will be trained with this data and later evaluated in order to explore the usefulness of this approach in the context of music generation through deep learning.
Published: 2017

17. Voice conversion using Deep Learning

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, Aparicio Isarn, Albert, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, and Aparicio Isarn, Albert
Abstract: In this project we present a first attempt at a Voice Conversion system based on Deep Learning in which the alignment between the training data is intrinsic to the model. Our system is structured in three main blocks. The first performs a vocoding of the speech (we have used Ahocoder for this task) and a normalization of the data. The second and main block consists of a Sequence-to-Sequence model. It consists of an RNN-based encoder-decoder structure with an Attention Mechanism. Its main strengts are the ability to process variable-length sequences, as well as aligning them internallly. The third block of the system performs a denormalization and reconstructs the speech signal. For the development of our system we have used the Voice Conversion Challenge 2016 dataset, as well as a part of the TC-STAR dataset. Unfortunately we have not obtained the results we expected. At the end of this thesis we present them and discuss some hypothesis to explain the reasons behind them., En este proyecto presentamos un primer intento en la realización de un sistema de Conversión de Voz basado en Aprendizaje Profundo (\emph{Deep Learning}) en el cual el alineamiento de los datos de entrenamiento es intrínseco al modelo. Nuestro sistema está estructurado en tres bloques principales. El primer bloque codifica la señal de voz en parámetros (\emph{vocoding}). Hemos elegido el \emph{vocoder} Ahocoder para esta tarea. Este bloque también normaliza los parámetros codificados. El segundo bloque consiste en un modelo \emph{Sequence-to-Sequence}. Este modelo está formado por una estructura codificador-decodificador basada en Redes Neuronales Recurrentes (RNN) con un Mecanismo de Atención. Sus puntos fuertes son la capacidad de procesar secuencias de longitud variable, a la vez que las alinea internamente. El tercer bloque del sistema desnormaliza los parámetros, y reconstruye la señal de voz a partir de ellos. Para el desarrollo del modelo hemos usado el conjunto de datos (\emph{dataset}) del \emph{Voice Conversion Challenge} 2016. También hemos usado una parte del conjunto TC-STAR. Desafortunadamente no hemos obtenido los resultados que esperábamos. Al final de esta tesis los presentamos y proponemos varias hipótesis que los explican., En aquest projecte presentem un primer itent en la realització d'un sistema de Conversió de Veu basat en Aprenentatge Profund (Deep Learning) en el qual l'alineament entre les dades d'entrenament sigui intrínsec al model. El nostre sistema s'estructura en tres blocs principals. El primer bloc codifica la veu en paràmetres (\emph{vocoding}). Hem usat el codificador Ahocoder per a aquesta tasca. A més a més, aquest primer bloc normalitza les dades. El segon bloc consisteix en un model \emph{Sequence-to-Sequence}. Consisteix en una estructura codificador-decodificador basada en Xarxes Neuronals Recurrents (RNN) amb un Mecanisme d'Atenció (\emph{Attention Mechanism}). Els punts forts d'aquest model són la capacitat per a tractar seqüències de durada variable, alhora que les alinea internament. El tercer bloc del sistema desnormalitza les seqüències i reconstrueix els senyals de veu. Per a desenvolupar el sistema hem usat el conjunt de dades del \emph{Voice Conversion Challenge} 2016. Hem fet servir també una part del conjunt TC-STAR. Desafortunadament no hem obtingut els resultats que esperàvem. Al final d'aquesta tesis presentem aquests resultats i plantegem algunes hipòtesis que els expliquen.
Published: 2017

18. Word and paragraph embeddings for expresive speech synthesis

Author: Bonafonte Cávez, Antonio, Gómez Bajo, Germán, Bonafonte Cávez, Antonio, and Gómez Bajo, Germán
Abstract: Speech synthesis is the task of generating speech using computers. Due to the limitations of classical techniques, these systems are normally not suitable for applications that would benefit from expressiveness in the speech, such as audiobook reading. In this project, we attempt to develop a text-to-speech speech synthesizer that is capable of reacting to the semantic content of the input text to produce expressive speech. The system is based on the Socrates text-to-speech framework developed in the VEU research lab at UPC and the Keras deep learning library., La sintesis de voz consiste en utilizar ordenadores para generar voz humana. Debido a las limitaciones de las técnicas clásicas, estos sitemas normalmente no son adecuados para aplicaciones que requieren voz expresiva como en la lectura automática de audiolibros. En este proyecto, tratamos de desarrollar un sintetizador de voz capaz de reaccionar al contenido semántico del texto para producir voz expresiva. El sistema está basado en el framework de síntesis de voz Socrates, desarrollado en el grupo VEU de la UPC, y en la librería de deep learning Keras., La síntesi de veu consisteix en fer servir ordinadors per generar veu humana. Degut a les limitacions de les tècniques clàssiques, aquests sistemes normalment no són adequats per aplicacions que requireixen veu expressiva com és el cas de la lectura de audiollibres automàtica. En aquest projecte, desenvolupem un sintetitzador de veu capaç de reaccionar al contingut semàntic del text per produir veu expressiva. El sistema està basat en el framework de síntesi de veu Socrates, desenvolupat al grup de recerca VEU de la UPC, i en la llibreria de deep learning Keras.
Published: 2017

19. Measuring the evolution of timbre in Billboard Hot 100

Author: Bonafonte Cávez, Antonio, Pons Puig, Jordi, Pons Albà, Aleu, Bonafonte Cávez, Antonio, Pons Puig, Jordi, and Pons Albà, Aleu
Abstract: This project consists in analyzing the timbre blend of some representative most popular songs along last 60 years: Billboard Hot 100's first positions of all weeks of the analysed years. The study focus the attention in those years that made a revolutionary change in Western popular music, according to musicological bibliography. After studying musical timbre and its state of the art in order to define the methodology, the data-set is defined, obtained and perceptually analysed and classified. Then a two-parts system is developed. First part consist in timbre characterization using MFCCs (Mel-frequency cepstral coefficients), ZCC (zero-crossing count) and brightness musical descriptor. And second part is the automatic classification of all data-set using trained and validated SVM (support vector machine) models. Finally, obtained results show an approximation of mainstream music timbre evolution. Despite of their questionable validity and the risks of falling into false attributions or generalizations, final conclusions point -with regard to Billboard Hot 100- to the non-influence of punk, an increase of high load of low frequencies and electronic sounds along time, and the influence of hip hop and trap., Este proyecto consiste en analizar la mezcla de timbre de algunas de las canciones más populares a lo largo de los últimos 60 años: las primeras posiciones de la Billboard Hot 100 de todas las semanas de los años analizados. El estudio centra la atención en aquellos años que hicieron un cambio revolucionario en la música popular occidental, según la bibliografía musicológica. Después de estudiar el timbre musical y su estado de la cuestión para definir la metodología, el conjunto de datos se define, se obtiene y perceptivamente se analiza y se clasifica. Luego se desarrolla un sistema de dos partes. La primera parte consiste en la caracterización del timbre mediante MFCCs (coeficientes cepstrales en las frecuencias de Mel), ZCC (conteo de cruce a cero) y el descriptor musical de brillo. Y la segunda parte es la clasificación automática de todos los conjuntos de datos utilizando modelos SVM (máquinas de soporte vectorial) formados y validados. Finalmente, los resultados obtenidos muestran una aproximación de la evolución del timbre de la música mainstream (de la corriente principal). A pesar de su cuestionable validez y los riesgos de caer en falsas atribuciones o generalizaciones, las conclusiones finales apuntan -con respecto a la Billboard Hot 100- a la no-influencia del punk, un aumento de alta carga de bajas frecuencias y sonidos electrónicos a lo largo del tiempo y la influencia del hip hop y el trap., Aquest projecte consisteix en analitzar la mescla tímbrica d'algunes de les cançons més populars al llarg dels últims 60 anys: les primeres posicions de la Billboard Hot 100 de totes les setmanes dels anys analitzats. L'estudi centra l'atenció en aquells anys que van provocar un canvi revolucionari en la música popular occidental, segons la bibliografia musicològica. Després d'estudiar el timbre musical i el seu estat de la qüestió per definir la metodologia, es defineix i s'obté el conjunt de dades i s'analitza i es classifica perceptualment. Després es desenvolupa un sistema de dues parts. La primera part consisteix en la caracterització del timbre mitjançant MFCCs (coeficients cepstrals en les freqüències de Mel), ZCC (número d'encreuaments per zero) i el descriptor musical de brillantor. I la segona part és la classificació automàtica de tot el conjunt de dades utilitzant models SVM (màquina de vector de suport) entrenats i validats. Finalment, els resultats obtinguts mostren una aproximació de l'evolució del timbre de la música mainstream (del corrent principal). Tot i la seva qüestionable validesa i els riscos de caure en atribucions errònies o generalitzacions, les conclusions finals apunten -respecte a la Billboard Hot 100- a la no-influència del punk, un augment de l'alta càrrega de baixes freqüències i sons electrònics al llarg del temps i la influència del hip hop i el trap.
Published: 2017

20. Effects of room acoustics on players' perceptions in audio games

Author: University of Limerick, Bonafonte Cávez, Antonio, Hagan, Kerry, Sánchez Cervera, Ariadna, University of Limerick, Bonafonte Cávez, Antonio, Hagan, Kerry, and Sánchez Cervera, Ariadna
Abstract: Design and develop a system that uses acoustic principles in order to create a user interactive system on a VR environment. The user will be able to interact with the environment, creating a sound response from the system created., En tiempos recientes, la evolución de los videojuegos ha sido posible debido a la mejora de sus contenidos visuales, para así recrear la realidad lo más rigurosamente posible. A pesar de ello, los contenidos de audio no han seguido el mismo camino, a pesar de que se ha demostrado su importancia para una experiencia totalmente envolvente (Lokki y Gröhn 2005). Las características auditivas y la acústica de espacios juegan un papel importante en recrear lugares reales (Larsson et al 2001; Gonot et al 2006; Podkosova et al 2016), los cuales llevan a una experiencia de inmersión total. Por esa razón, la pregunta principal a responder en este proyecto es la importancia de la acústica de salas para una inmersión total del jugador. Desarrollando un juego, este proyecto trabaja con diferentes tipos de acústica y espacios para evaluar las reacciones del jugador. Además, diferentes fuentes sonoras son utilizadas para crear una sensación de localización del espacio para el jugador, y la utilidad de dicha técnica es evaluada., En temps recents, l'evolució dels videojocs ha estat possible degut a la millora dels seus continguts visuals, per així recrear la realitat el més rigurosament possible. Tot i això, els continguts d'àudio no han seguit el mateix camí, tot i que s'ha demostrat la seva importància per a una experiència totalment envolvent (Lokki i Gröhn 2005). Les característiques auditives i l'acústica d'espais juguen un paper important en recrear espais reals (Larsson et al 2001; Gonot et al 2006; Podkosova et al 2016), els quals porten a una experiència d'immersió total. Degut a això, la pregunta principal a respondre en aquest projecte és la importància de l'acústica de sales per a una immersió total del jugador. Desenvolupant un joc, aquest projecte treballa amb diferents tipus d'acústiques i espais per avaluar les reaccions del jugador. A més, diferents fonts sonores són utilitzades per crear una sensació de localització de l'espai per al jugador, avaluant-ne la utilitat., In recent times, the evolution of video games has made possible the improvement of their visual contents in order to recreate reality as accurate as possible. Nevertheless, audio contents have not followed the same path, even though it has been proved its importance in a full immersive experience (Lokki and Gröhn 2005). Aural characteristics and room acoustics play an important role in recreating a real environment (Larsson et al 2001; Gonot et al 2006; Podkosova et al 2016), leading to a full immersive experience. Because of that, the main question to answer in this project is the importance of room acoustics for a full immersive experience of the player. Developing a game, this project works with different types of room acoustics and spaces to evaluate the player’s reactions towards them. Additionally, different audio sources are used in order to create a sense of space location to the player, whose usefulness is evaluated.
Published: 2017

21. Unsupervised learning for expressive speech synthesis

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Jauk, Igor, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Jauk, Igor
Abstract: Nowadays, especially with the upswing of neural networks, speech synthesis is almost totally data driven. The goal of this thesis is to provide methods for automatic and unsupervised learning from data for expressive speech synthesis. In comparison to "ordinary" synthesis systems, it is more difficult to find reliable expressive training data, despite huge availability on sources like Internet. The main difficulty consists in the highly speaker- and situation-dependent nature of expressiveness, causing many and acoustically substantial variations. The consequences are, first, it is very difficult to define labels which reliably identify expressive speech with all nuances. The typical definition of 6 basic emotions, or alike, is a simplification which will have inexcusable consequences dealing with data outside the lab. Second, even if a label set is defined, apart of the enormous manual effort, it is difficult to gain sufficient training data for the models respecting all the nuances and variations. The goal of this thesis is to study automatic training methods for expressive speech synthesis avoiding labeling and to develop applications from these proposals. The focus lies on the acoustic and the semantic domains. For the part of the acoustic domain, the goal is to find suitable acoustic features to represent expressive speech, especially for the multi-speaker domain, as getting closer to real-life uncontrolled data. For this, the perspective will slide away from traditional, mainly prosody-based, features towards features gained with factor analysis, trying to identify the principal components of the expressiveness, namely using i-vectors. Results show that a combination of traditional and i-vector based features performs better in unsupervised clustering of expressive speech than traditional features and even better than large state-of-the-art sets in the multi-speaker domain. Once the feature set is defined, it is used for unsupervised clustering of an audiobook, Hoy en día, especialmente con el auge de las redes neuronales, la síntesis de habla se basa casi totalmente en datos. El objetivo de esta tesis es proveer métodos de entrenamiento automático y no supervisado a partir de datos para la síntesis de habla expresiva. En comparación con sistemas de síntesis "neutrales", resulta más difícil encontrar datos de entrenamiento fiables para la síntesis expresiva, a pesar de la gran disponibilidad de recursos como internet. La dificultad principal se origina en la naturaleza del habla expresiva, altamente dependiente del hablante y la situación, resultando en muchas variaciones acústicas. Las consecuencias son, primero, que es muy difícil definir etiquetas que identifiquen fiablemente todos los detalles del habla expresiva. La definición típica de 6 emociones básicas es una simplificación que tendrá consecuencias inexcusables cuando se trata con datos fuera del laboratorio. Segundo, incluso si se llegara a definir un conjunto de etiquetas, aparte del enorme esfuerzo manual que supondría, sería muy difícil conseguir suficientes datos de entrenamiento para cada variante respetando todos sus matices. El objetivo de esta tesis es estudiar métodos de entrenamiento automático para la síntesis de habla expresiva evitando etiquetas y desarrollar aplicaciones a base de estas propuestas. El enfoque abarca los dominios acústico y semántico. Con respecto al dominio acústico, el objetivo es encontrar características acústicas aptas para representar habla expresiva, especialmente en el dominio multi-locutor, acercándose a datos reales e incontrolados. Para esto, la perspectiva se apartará de las características tradicionales, principalmente basadas en la prosodia, hacia características ganadas a partir del análisis de factores, intentando identificar los componentes principales de la expresividad, concretamente los i-vectors. Los resultados demuestran que una combinación de características tradicionales y de las basadas en los i-vectors rinde, Postprint (published version)
Published: 2017

22. Speech enhancement using deep learning

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Badescu, Dan Mihai, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Badescu, Dan Mihai
Abstract: This thesis explores the possibility to achieve enhancement on noisy speech signals using Deep Neural Networks. Signal enhancement is a classic problem in speech processing. In the last years, researches using deep learning has been used in many speech processing tasks since they have provided very satisfactory results. As a first step, a Signal Analysis Module has been implemented in order to calculate the magnitude and phase of each audio file in the database. The signal is represented into its magnitude and its phase, where the magnitude is modified by the neural network, and then it is reconstructed with the original phase. The implementation of the Neural Networks is divided into two stages.The first stage was the implementation of a Speech Activity Detection Deep Neural Network (SAD-DNN). The magnitude previously calculated, applied to the noisy data, will train the SAD-DNN in order to classify each frame in speech or non-speech. This classification is useful for the network that does the final cleaning. The Speech Activity Detection Deep Neural Network is followed by a Denoising Auto-Encoder (DAE). The magnitude and the label speech or non-speech will be the input of this second Deep Neural Network in charge of denoising the speech signal. The first stage is also optimized to be adequate for the final task in this second stage. In order to do the training, Neural Networks require datasets. In this project the Timit corpus [9] has been used as dataset for the clean voice (target) and the QUT-NOISE TIMIT corpus[4] as noisy dataset (source). Finally, Signal Synthesis Module reconstructs the clean speech signal from the enhanced magnitudes and the phase. In the end, the results provided by the system have been analysed using both objective and subjective measures., Esta tesis explora la posibilidad de conseguir mejorar señales de voz con ruido utilizando Redes Neuronales Profundas. La mejora de señales es un problema clásico del procesado de señal, pero recientemente se esta investigando con deep learning, ya que son técnicas que han dado resultados muy satisfactorios en muchas tareas del procesado de señal. Como primer paso, se ha implementado un Módulo de Análisis de Señal con el objetivo de extraer el módulo y fase de cada archivo de voz de la base de datos. La señal se representa en módulo y fase, donde el módulo se modifica con la red neuronal y posteriormente se reconstruye con la fase original. La implementación de la Red Neuronal consta de dos etapas. En la primera etapa se implementó una Red Neuronal de Detección de Actividad de Voz. El módulo previamente calculado, aplicado a los datos con ruido, se utiliza como entrada para entrenar esta red, de manera que se consigue clasificar cada trama en voz o no voz. Esta clasificación es útil para la red que se encarga de hacer la limpieza. A continuación de la Red Neuronal de Detección de Actividad de Voz se implementa otra, con el objetivo de eliminar el ruido. El módulo junto con la etiqueta obtenida en la red anterior serán la entrada de esta nueva red. En esta segunda etapa también se optimiza la primera para adaptarse a la tarea final. Las Redes Neuronales requieren bases de datos para el entrenamiento. En este proyecto se ha utilizado el Timit corpus [9] como base de datos de voz limpia (objetivo) y el QUT-NOISE TIMIT [4] como base de datos con ruido (fuente). A continuación, el Módulo de Síntesis de Señal reconstruye la señal de voz limpia a partir del módulo sin ruido y la fase original., Aquesta tesis explora la possibilitat d'aconseguir millorar senyals de veu amb soroll, utilitzant Xarxes Neuronals Profundes. La millora de senyals és un problema clàssic del processat de senyal, però recentment s'està investigant amb deep learning, ja que són tècniques que han donat resultats molt satisfactoris en moltes tasques de processament de veu. Com a primer pas, s'ha implementat un Mòdul d'Anàlisi de Senyal amb l'objectiu d'extreure el mòdul i la fase de cada arxiu d'àudio de la base de dades. El senyal es representa en mòdul i fase, on el mòdul es modifica amb la xarxa neuronal i posteriorment es reconstrueix amb la fase original. La implementació de les Xarxes Neuronals consta de dues etapes. En la primera etapa es va implementar una Xarxa Neuronal de Detecció d'Activitat de Veu. El mòdul prèviament calculat, aplicat a les dades amb soroll, s'utilitza com entrada per entrenar aquesta xarxa, de manera que s'aconsegueix classificar cada trama en veu o no veu. Aquesta classificació és útil per la xarxa que fa la neteja final. A continuació de la Xarxa Neuronal de Detecció d'Activitat de Veu s'implementa una altra amb l'objectiu d'eliminar el soroll. El mòdul, juntament amb la etiqueta obtinguda en la xarxa anterior, seran l'entrada d'aquesta nova xarxa. En aquesta segona etapa també s'optimitza la primera per adaptar-se a la tasca final. Les Xarxes Neuronals requereixen bases de dades per fer l'entrenament. En aquest projecte s'ha utilitzat el Timit corpus [9] com a base de dades de veu neta (objectiu) i el QUT-NOISE TIMIT[4] com a base de dades amb soroll (font). A continuació, el Mòdul de Síntesi de Senyal reconstrueix el senyal de veu net a partir del mòdul netejat i la fase original. Finalment, els resultats obtinguts del sistema van ser analitzats utilitzant mesures objectives i subjectives.
Published: 2017

23. Acoustic feature prediction from semantic features for expressive speech using deep neural networks

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, Bonafonte Cávez, Antonio, Pascual, Santiago, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, Bonafonte Cávez, Antonio, and Pascual, Santiago
Abstract: The goal of the study is to predict acoustic features of expressive speech from semantic vector space representations. Though a lot of successful work was invested in expressiveness analysis and prediction, the results often involve manual labeling, or indirect prediction evaluation such as speech synthesis. The proposed analysis aims at direct acoustic feature prediction and comparison to original acoustic features from an audiobook. The audiobook is mapped in a semantic vector space. A set of acoustic features is extracted from the same utterances, involving iVectors trained on MFCC and F0 basis. Two regression models are trained with the semantic coordinates, DNNs and a baseline CART. Later, semantic and acoustic context features are combined for the prediction. The prediction is achieved successfully using the DNNs. A closer analysis shows that the prediction works best for larger utterances or utterances with specific contexts, and worst for general short utterances and proper names., Peer Reviewed, Postprint (published version)
Published: 2016

24. Prosodic and spectral iVectors for expressive speech synthesis

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, and Bonafonte Cávez, Antonio
Abstract: This work presents a study on the suitability of prosodic andacoustic features, with a special focus on i-vectors, in expressivespeech analysis and synthesis. For each utterance of two dif-ferent databases, a laboratory recorded emotional acted speech,and an audiobook, several prosodic and acoustic features are ex-tracted. Among them, i-vectors are built not only on the MFCCbase, but also on F0, power and syllable durations. Then, un-supervised clustering is performed using different feature com-binations. The resulting clusters are evaluated calculating clus-ter entropy for labeled portions of the databases. Additionally,synthetic voices are trained, applying speaker adaptive training,from the clusters built from the audiobook. The voices are eval-uated in a perceptual test where the participants have to edit anaudiobook paragraph using the synthetic voices.The objective results suggest that i-vectors are very use-ful for the audiobook, where different speakers (book charac-ters) are imitated. On the other hand, for the laboratory record-ings, traditional prosodic features outperform i-vectors. Also,a closer analysis of the created clusters suggest that differentspeakers use different prosodic and acoustic means to conveyemotions. The perceptual results suggest that the proposed i-vector based feature combinations can be used for audiobookclustering and voice training., Peer Reviewed, Postprint (published version)
Published: 2016

25. Multi-output RNN-LSTM for multiple speaker speech synthesis with a-interpolation model

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual, Santiago, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual, Santiago, and Bonafonte Cávez, Antonio
Abstract: Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers., Peer Reviewed, Postprint (published version)
Published: 2016

26. Prosodic break prediction with RNNs

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, and Bonafonte Cávez, Antonio
Abstract: Prosodic breaks prediction from text is a fundamental task to obtain naturalness in text to speech applications. In this work we build a data-driven break predictor out of linguistic features like the Part of Speech (POS) tags and forward-backward word distance to punctuation marks, and to do so we use a basic Recurrent Neural Network (RNN) model to exploit the sequence dependency in decisions. In the experiments we evaluate the performance of a logistic regression model and the recurrent one. The results show that the logistic regression outperforms the baseline (CART) by a 9.5% in the F-score, and the addition of the recurrent layer in the model further improves the predictions of the baseline by an 11%., Peer Reviewed, Postprint (published version)
Published: 2016

27. Deep neural networks for i-vector language identification of short utterances in cars

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Ghahabi Esfahani, Omid, Bonafonte Cávez, Antonio, Hernando Pericás, Francisco Javier, Moreno Bilbao, M. Asunción, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Ghahabi Esfahani, Omid, Bonafonte Cávez, Antonio, Hernando Pericás, Francisco Javier, and Moreno Bilbao, M. Asunción
Abstract: This paper is focused on the application of the Language Identification (LID) technology for intelligent vehicles. We cope with short sentences or words spoken in moving cars in four languages: English, Spanish, German, and Finnish. As the response time of the LID system is crucial for user acceptance in this particular task, speech signals of different durations with total average of 3.8s are analyzed. In this paper, the authors propose the use of Deep Neural Networks (DNN) to model effectively the i-vector space of languages. Both raw i-vectors and session variability compensated i-vectors are evaluated as input vectors to DNNs. The performance of the proposed DNN architecture is compared with both conventional GMM-UBM and i-vector/LDA systems considering the effect of durations of signals. It is shown that the signals with durations between 2 and 3s meet the requirements of this application, i.e., high accuracy and fast decision, in which the proposed DNN architecture outperforms GMM-UBM and i-vector/LDA systems by 37% and 28%, respectively., Peer Reviewed, Postprint (published version)
Published: 2016

28. Emotion recognition based on the speech, using a Naive Bayes classifier

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Institut für Computertechnik, Bonafonte Cávez, Antonio, Taherinejad, Nima, Urbano Romeu, Ángel, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Institut für Computertechnik, Bonafonte Cávez, Antonio, Taherinejad, Nima, and Urbano Romeu, Ángel
Abstract: The goal of this work is to develop emotion classifiers from speech. The framework is iteration of autism children with robots., Speech emotion recognition is one of the latest challenges in speech processing. Besides facial expressions or gestures, speech has proven as one of the most promising modalities for the automatic emotion recognition. To identify the emotions from he speech signal, many systems have been developed. This project presents the results from the application of Naive Bayer classifier over different types of features. Automatic detection of emotions has been evaluated using standard Mel-Frequency Cepstral Coefficients, MFCCs, and pitch related features extracted from a speech corpus. This corpus contains a set of recorded sentences by actors and actresses which express different emotions. The classification performance is based on extracted features. The best results are around 78% of accuracy using proper layers and weights in the classifier. Classifying emotions with Naive Bayes provides quick probabilistic results and performs better than more sophisticated classifiers., El reconocimiento de emociones es uno de los últimos retos en el procesamiento de la voz. Además de las expresiones faciales o gestuales, la voz ha sido demostrada como una de las modalidades más prometedoras en el ámbito del reconocimiento de emociones. Para identificar emociones a partir de una señal de voz, muchos sistemas han sido desarrollados. Este proyecto presenta los resultados de la aplicación de un clasificador Naïve Bayes utilizando una variedad de características. La detección automática de emociones ha sido posible utilizando MFFCCs (Miel Frequency Cepstral Coefficients) y características relacionadas con el 'pitch', extraídas de una base de datos de audio. Este corpus contiene un conjunto de frases representadas por actores y actrices que expresan diferentes emociones. El rendimiento de la clasificación depende de las características extraídas. Los mejores resultados están alrededor de un 78% de precisión, utilizando pesos y capas apropiadas en el clasificador. Clasificar emociones con el clasificador Naïve Bayes proporciona resultados probabilísticos, rápidos y actúa mejor que clasificadores más sofisticados., El reconeixement d'emocions és un dels últims reptes en el processament de la veu. A més a més de les expressions facials o gestuals, la veu ha estat demostrada com una de les modalitats més prometedores en l'àmbit del reconeixement d'emocions. Per a identificar emocions a partir d'un senyal de veu, molts sistemes han sigut desenvolupats. Aquest projecte presenta els resultats de l'aplicació d'un classificador Naïve Bayes utilitzant una varietat de característiques. La detecció automàtica d'emocions ha sigut possible utilitzant MFFCCs (Mel Frequency Cepstral Coefficients) i característiques relacionades amb el 'pitch', extretes d'una base de dades d'àudio. Aquest corpus conté un conjunt de frases representades per actors i actrius que expressen diferents emocions. El rendiment de la classificació depèn de les característiques extretes. Els millors resultats estan al voltant d'un 78% de precisió, utilitzant pesos i capes apropiades en el classificador. Classificar emocions amb el classificador Naïve Bayes proporciona resultats probabilístics, ràpids i actua millor que classificadors més sofisticats.
Published: 2016

29. Expressive speech synthesis from Broadcast News

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Jauk, Igor, Luzón Tuells, Joaquín, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Jauk, Igor, and Luzón Tuells, Joaquín
Abstract: Speech Synthesis is the computer process of converting text to voice. This project consists in the synthesis of voices that can tell news with an appropriate expression, since it is important to achieve expressiveness on the generated speech in order to obtain natural sounding voices. Conventional Speech Synthesis systems use as training data audios signals, specifically recorded for voice models training. Nevertheless, in this project the data was obtained from a news TV station, in order to test a different database in the speech synthesis. An important part of the work done in this TFG has been preparing data later used in synthesis. The audio and its transcriptions were labeled so as to differentiate the expressions recorded: explaining good or bad news, or talking about relevant or trivial topics. A phonetic segmentation of the database was obtained in order to create the models used in the speech synthesis. After preparing all the audio and transcriptions data, statistical-parametric models were estimated and used to synthesize test voices, in order to evaluate the previous setup work. All the project has been developed in a Linux environment, using Ogmios, AHOCoder and HTS-toolkit as main software. The results obtained after synthesizing the voices shows that the data preparation process is correct, but the voices synthesized had not the enough quality. This is due to the adaptation of the voices towards heterogeneous samples, originated by the amount of different speakers used to train the models., La síntesis de voz es el proceso informático mediante el cual se transforma texto a voz. Este proyecto consiste en la síntesis de voces que puedan explicar notícias con una expresión adecuada, ya que es importante obtener expresividad en el habla generada para poder generar voces con naturalidad expresiva. Los sistemas de síntesis del habla convencionales utilizan como datos de entrenamiento voces grabadas expresamente para el entrenamiento de los modelos. No obstante, en este proyecto se ha creado una base de datos a partir de unas grabaciones de un canal de televisión especializado en noticias, ya que se queria probar la síntesis de voz con una base de datos diferente. Una parte importante del trabajo llevado a cabo en este TFG ha sido la preparación de los datos utilizados en la grabación. Las grabaciones y sus transcripciones se etiquetaron con la intención de diferenciar las expresiones grabadas: explicando buenas o malas noticias, o hablando de temas relevantes o triviales. Se ha obtenido una segmentación de la base de datos con tal de crear los modelos utilizados en la síntesis del habla. Una vez preparados los audios y sus respectivas transcripciones, se estimaron los modelos estadístico-paramétricos y se utilizaron para sintetizar las voces de prueba, con el objetivo de evaluar el trabajo de preparación anterior. Todo el proyecto se ha realizado en un entorno Linux, utilizando \emph{Ogmios}, \emph{AHOCoder} y HTS-toolkit como software principal. Los resultados obtenidos después de la síntesis muestran que la preparación de los datos es correcta, pero las voces sintetizadas no tenian la calidad suficiente. Esto se debe a la adaptación de las voces a partir de una base de datos muy heterogénea, debido a la cantidad de hablantes diferentes contemplados en el entrenamiento de los modelos., La síntesi de veu es el procés informàtic que transforma text a veu. Aquest projecte consisteix en la sínteis de veus que poden explicar notícies amb una expressió adient, ja que és important obtenir expressivitat en la parla generada per tal d'obtenir veus amb naturalitat expressiva. Els sistemes de síntesis de la parla convencionals utilitzen com a dades d'entrenament veus gravades expressament pel entrenament dels models. No obstant, en aquest projecte s'ha creat una base de dades a partir d'unes gravacions d'un canal de televisió especialitzat en notícies, ja que es volia provar a sintetizar veu amb una base de dades diferent. Una part important del treball dut a terme en aquest TFG ha sigut preparar les dades desp?es utilitzades en l'entrenament. Les gravacions i les seves transcripcions van ser etiquetades amb la intenció de diferenciar les epxressions gravades: explicant males o bones notícies, o parlant de temes rellevants o trivials. S'ha obtingut una segmentació de la base de dades per tal de crear els models utilitzats en la síntesi de la parla. Una vegada preparat els audios i les seves transcripcions, es van estimar models estadístic-paramètrics i es van utilitzar per sintetizar les veu de prova, amb l'objectiu de evaluar el treball de preparació anterior. Tot el projecte s'ha realitzat en un entorn Linux, fent servir \emph{Ogmios}, \emph{AHOCoder} i HTS-toolkit com a software principal. Els resultats obtinguts desprès de la síntesi mostren que la preparació de les dades es correcta, però les veus sintetitzades no teníen qualitat suficient. Això es deu a l'adaptacio de les veus a partir d'una base de dades molt heterogènia, degut a la quantitat de parlants diferents contemplats en l'entrenament dels models.
Published: 2016

30. Voice generation using deep learning

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, Gómez Sánchez, Gonzalo, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, and Gómez Sánchez, Gonzalo
Abstract: Las técnicas de aprendizaje profundo están teniendo unas excelentes prestaciones en muchas tareas relacionadas con el habla, tales como reconocimiento o síntesis. Muchos de los trabajos se apoyan en modelos de voz, o técnicas de análisis clásicas, como el espectrograma o el MFCC. En este proyecto se desea sustituir estas técnicas por redes neuronales profundas que puedan autodiseñarse para modelar la señal. Una aplicación que puede plantearse para validar esta tecnología es codificación., Voice generation, also known as Speech Synthesis, is the artificial production of human speech. In the last decade, the Speech Synthesis research has been focused on a technique called Statistical Parametric Speech Synthesis. This technique uses a statistical model that obtains parameters (acoustic features) to define the signal out of a text. These parameters are then converted into a waveform using a vocoder. The use of the vocoder is needed but it decreases the quality of the obtained audio. In the past few years, Deep Learning techniques have shown great performance in many fields. One of them is Speech Synthesis, where Deep Learning is used as a substitute for the statistical model, obtaining the parameters that define the signal with great effectiveness. However, the quality of the synthesis is still affected by the use of the vocoder. For this reason, in this work, we investigate how to generate the audio waveform out of the parameters using Deep Neural Networks. If it results to work, it could be possible to build a DNN system that generates an audio waveform using text as input, leaving the vocoder out of the scheme. Different architectures were tested before getting to the final model. The first attempt was to directly map the frames of the signal using a Long Short-Term Memory Recurrent Neural Network. In the second one, instead of generating the signal frame by frame we did it sample by sample. We tried a different architecture in the third model, using a Clockwork RNN. Finally, in the fourth model we used again an LSTM, but this time, we generated the signal by frequency sub-bands, using Pseudo-Quadrature Mirror Filter banks. The models that showed better performance were the second and the fourth. Neverthe- less, the computational cost of the second one is too high. We solved this problem in the fourth model. Generating the signal by sub-bands allows us to parallelize the problem and decrease the computational cost significantly. Although it is a great, La generación de voz, también conocida como Síntesis de Habla, es la producción artificial de habla humana. En la última década, la investigación de Síntesis de Habla se ha centrado en una técnica llamada Síntesis Estadística Paramétrica de Habla. Esta técnica utiliza un modelo estadístico y genera los parámetros acústicos más probables, condicionados al texto de entrada. Estos parámetros son convertidos en forma de onda utilizando un vocoder. El uso de este vocoder es necesario en la síntesis estadística, pero limita la calidad del audio que puede obtenerse. En los últimos años, las técnicas de Aprendizaje Profundo han obtenido importantes resultados en muchos campos. Uno de ellos es la Síntesis de Habla, donde el Aprendizaje Profundo es usado como sustituto de los modelos estadísticos tradicionales, basados en Modelos Ocultos de Markov, obteniendo los parámetros que definen la señal. Sin embargo, la calidad sigue afectada por el uso del vocoder. Por esta razón, en este trabajo hemos investigado como generar una forma de onda, partiendo de parámetros, mediante Redes Neuronales Profundas. Si funcionara, sería posible construir un sistema basado en Redes Neuronales Profundas que genere una forma de onda utilizando texto como entrada, sin necesitar el vocoder. Se han probado diferentes arquitecturas antes de llegar al modelo final. El primer intento fue mapear directamente las muestras de la señal de audio utilizando una Red Neuronal Recurrente con Memoria a Largo y Corto Plazo. (LSTM-RNN). En el segundo, en vez de generar la señal trama a trama, se ha generado muestra a muestra. Se ha probado también una arquitectura diferente en el tercer modelo, utilizando una Red Neuronal Recurrente 'Clockwork'. Finalmente, en el cuarto modelo, usamos de nuevo una LSTM-RNN, pero esta vez, generamos la señal por bandas frecuenciales, usando \textit{Pseudo Quadrature-Mirror Filters} (PQMF). Los modelos que han obtenido mejores resultados han sido el segundo y el cuarto. Sin embargo
Published: 2016

31. Automatic transcription for polyphonic music

Author: Bonafonte Cávez, Antonio, Martín Valero, Juan, Bonafonte Cávez, Antonio, and Martín Valero, Juan
Abstract: In this TFG you should develop a VST plugin to transform an audio signal (singing voice, played instrument) into MIDI notation. MIDI is a symbolic representation of music, a electronic extension of the traditional music score. The MIDI track can be easily manipulated, for instance, change the tempo, transpose, correct player errors, change instrument, etc. VST is a in-facto standard which allow to integrate 3rd party audio software in many Digital Audio Workstations (DAW). The main focus of this TFG is in development, integration and usability of this pluging. However, it a, We are living times where technology and music go hand in hand, and everyday more musicians and producers integrate new music software into their workflow. This project is the study and development of a first prototype for a complete automatic transcription system for polyphonic music, which is able to generate a MIDI file from an audio signal. This first prototype is focused on piano music. The developed algorithm combines different techniques to put together the six different blocks that form the final system. The core of the project is the multiple fundamental frequency estimation and it is based on the “Iterative Estimation and Cancellation” method, initially proposed by Anssi Klapuri. The general system combines this multi F0 estimation method with techniques from other research papers and some new methods proposed in this work. With an accuracy of 0.7402 for melodies and 0.6142 for low polyphonic recordings, we can say that the results are good for simple scenarios., Vivimos tiempos en los que la tecnología y la música van de la mano, y cada día más músicos y productores integran nuevo software musical en su forma de trabajar. Este proyecto es el estudio y desarrollo de un primer prototipo para un sistema completo de transcripción musical automática para música polifónica, que es capaz de generar un fichero MIDI a partir de una señal de audio. Este primer prototipo se centra en la música de piano. El algoritmo desarrollado combina diferentes técnicas para crear los seis bloques que forman el sistema final. El núcleo del proyecto es la estimación de múltiples frecuencias fundamentales y se basa en el método de “Estimación y Cancelación Iterativa”, propuesto inicialmente por Anssi Klapuri. El sistema general combina este método de estimación de múltiples F0 con técnicas de otras publicaciones y algunos nuevos métodos propuestos en este trabajo. Con una accuracy de 0.7402 para melodías y de 0.6142 para grabaciones con baja polifonía, podemos decir que los resultados obtenidos son buenos para casos simples., Estem vivint uns temps en els quals la tecnologia i la música van de la mà, i cada dia més músics i productors estan integrant nou software musical a la seva manera de treballar. Aquest projecte és l’estudi i el desenvolupament d’un primer prototip per a un sistema complet de transcripció musical automàtic per a música polifònica, que és capaç de generar un fitxer MIDI a partir d’un senyal d’àudio. Aquest primer prototip se centra en la música de piano. L’algoritme desenvolupat combina diferents tècniques per a crear els sis blocs que formen el sistema final. El nucli del projecte és l’estimació de múltiples freqüències fonamentals i es basa en el mètode de “Estimació i Cancel·lació Iterativa”, proposat inicialment per Anssi Klapuri. El sistema general combina aquest mètode d’estimació de multiples F0 amb tècniques d’altres publicacions i alguns nous mètodes proposats en aquest treball. Amb una accuracy de 0.7402 per a melodies i de 0.6142 per a gravacions amb baixa polifonia, podem dir que els resultats obtinguts són bons per a casos simples.
Published: 2016

32. Speech activity detection: Application-specific tuning and context-based neural approaches

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Luque Serrano, Jordi, Balcells Eichenberger, Daniel, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Luque Serrano, Jordi, and Balcells Eichenberger, Daniel
Published: 2016

33. Multi-output RNN-LSTM for multiple speaker speech synthesis and adaptation

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual, Santiago, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual, Santiago, and Bonafonte Cávez, Antonio
Abstract: Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with single speaker model. Moreover, we also tackle the problem of speaker adaptation by adding a new output branch to the model and successfully training it without the need of modifying the base optimized model. This fine tuning method achieves better results than training the new speaker from scratch with its own model., Peer Reviewed, Postprint (published version)
Published: 2016

34. Direct expressive voice training based on semantic selection

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, and Bonafonte Cávez, Antonio
Abstract: This work aims at creating expressive voices from audiobooks using semantic selection. First, for each utterance of the audiobook an acoustic feature vector is extracted, including iVectors built on MFCC and on F0 basis. Then, the transcription is projected into a semantic vector space. A seed utterance is projected to the semantic vector space and the N nearest neighbors are selected. The selection is then filtered by selecting only acoustically similar data. The proposed technique can be used to train emotional voices by using emotional keywords or phrases as seeds, obtaining training data semantically similar to the seed. It can also be used to read larger texts in an expressive manner, creating specific voices for each sentence. That later application is compared to a DNN predictor, which predicts acoustic features from semantic features. The selected data is used to adapt statistical speech synthesis models. The performance of the technique is analyzed objectively and in a perceptive experiment. In the first part of the experiment, subjects clearly show preference for particular expressive voices to synthesize semantically expressive utterances. In the second part, the proposed method is shown to achieve similar or better performance than the DNN based prediction. Copyright © 2016 ISCA., Peer Reviewed, Postprint (published version)
Published: 2016

35. Deep learning applied to speech synthesis

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Pascual de la Puente, Santiago
Abstract: Deep Learning has been applied successfully to speech processing problems. In this work we explore its capabilities, focusing concretely in recurrent neural architectures to build a state of the art Text-To-Speech system from scratch. The different steps to make the full TTS system are shown. Also, a post-filtering method to improve the generated speech naturalness is applied and evaluated. The objective results show which architecture fits better our problem, achieving low error rates in term of cepstral distortion, pitch estimation error and voiced/unvoiced classification error. Also, subjective results suggest that the model achieves a state of the art quality in the synthesis, where the post-filtering factor seems to be a key component to get a good level of naturalness. A novel architecture called Multi-Output TTS is also proposed to hold multiple speakers inside the same structure. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with single speaker models. Moreover, we also tackle the problem of speaker adaptation by adding a new output branch to the model and successfully training it without the need of modifying the base optimized model. This fine tuning method achieves better results than training the new speaker from scratch with its own model. Finally, we also tackle the problem of speaker interpolation by adding a new output layer (alpha-layer) on top of the Multi-Output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the alpha-layer can effectively learn to interpolate the acoustic features between speakers., El Deep Learning se ha aplicado con éxito a problemas de procesado del habla. En éste trabajo exploramos las capacidades de ésta disciplina, haciendo especial énfasis en las arquitecturas recurrentes para construir un sistema de síntesis de voz desde cero. Se muestran las distintas etapas para hacer el sistema de síntesis completo. Además se aplica y se evalúa un método de post-procesado con tal de mejorar la naturalidad de la voz generada. Los resultados objetivos muestran qué arquitectura encaja más con nuestro problema, consiguiendo errores bajos en términos de distorsión cepstral, error de estimación de pitch y error de clasificación sonoro/sordo. También los resultados subjetivos indican que el modelo llega a tener una calidad de voz comparable con la de las últimas tecnologías, donde el hecho de aplicar el post-procesado parece ser una pieza clave para obtener un buen nivel de naturalidad. También se propone una arquitectura innovadora llamada Multi-Output TTS, la cual contiene diferentes hablantes dentro de la misma estructura. Algunas capas ocultas se comparten entre todos los hablantes, mientras que hay una capa de salida específica para cada uno de ellos. Los experimentos perceptuales y objetivos muestran que éste esquema produce resultados bastante mejores en comparación con los modelos de hablantes solos. También abordamos el problema de adaptación de hablantes añadiendo una nueva capa de salida al modelo y entrenándola sin necesidad de modificar el sistema base ya optimizado. Éste método de afinado del modelo en la última capa permite obtener mejores resultados que entrenando el modelo del nuevo hablante desde cero con su propio modelo. Finalmente también abordamos el problema de interpolación de hablantes añadiendo una nueva capa sobre las salidas del Multi-Output, la cual se llama capa-alfa. A la nueva capa se le introduce un código de identificación del hablante junto con las características acústicas de los distintos hablantes. Los experimentos mues, El Deep Learning s'ha aplicat amb èxit a problemes de processament de la parla. En aquest treball explorem les capacitats d'aquesta disciplina, fent especial èmfasi en les arquitectures recurrents per a construir un sistema de síntesi de veu des de zero. Es mostren les diferents etapes per fer el sistema de síntesi complet. A més, s'aplica i s'avalua un mètode de post-processament per tal de millorar la naturalitat de la veu generada. Els resultats objectius mostren quina arquitectura encaixa més amb el nostre problema, aconseguint errors baixos en termes de distorsió cepstral, error d'estimació de pitch i error de classificació sonor/sord. També els resultats subjectius indiquen que el model arriba a tenir una qualitat de síntesi comparable amb la de les últimes tecnologíes, on el fet de fer post-processament sembla ser una peça clau per obtenir un bon nivell de naturalitat. També es proposa una arquitectura novedosa anomenada Multi-Output TTS, la qual conté diferents parlants dins la mateixa estructura. Algunes capes ocultes es comparteixen entre tots els parlants, mentres que hi ha una capa de sortida específica per a cada un d'ells. Els experiments perceptuals i objectius mostren que aquest esquema produeix força millors resultats en comparació amb els models de parlants sols. També abordem el problema d'adaptació de parlants afegint una nova capa de sortida al model i entrenant-la sense necessitat de modificar el sistema base ja optimitzat. Aquest mètode d'afinament del model a l'última capa permet obtenir millors resultats que entrenant el model del nou parlant des de zero amb el seu propi model sol. Finalment també abordem el problema d'interpolació de parlants afegint una nova capa sobre les sortides del Multi-Output, la qual es diu capa-alfa. A la nova capa se li insereix un codi d'identificació juntament amb les característiques acústiques dels diferents parlants. Els experiments mostren que la capa-alfa pot aprendre, en efecte, a interpolar valors intermi
Published: 2016

36. Building synthetic voices in the META-NET framework

Author: Garcia Casademont, Emília, Bonafonte Cávez, Antonio|||0000-0002-6240-9915, Moreno Bilbao, M. Asunción|||0000-0002-1823-5970, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Subjects: Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Processament de la parla, Speech processing systems
Abstract: METANET 4 U is a European project aiming at supporting language technology for European languages and multilingualism. It is a project in the META-NET Network of Excellence, a cluster of projects aiming at fostering the mission of META, which is the Multilingual Europe Technology Alliance, dedicated to building the technological foundations of a multilingual European information society. This paper describe the resources produced at our lab to provide Synthethic voices. Using existing 10h corpus for a male and a female Spanish speakers, voices have been developed to be used in Festival, both with unit-selection and with statistical-based technologies. Furthermore, using data produced for supporting research on intra and inter-lingual voice conversion, four bilingual voices (English/Spanish) have been developed. The paper describes these resources which are available through META. Furthermore, an evaluation is presented to compare different synthesis techniques, influence of amount of data in statistical speech synthesis and the effect of sharing data in bilingual voices
Published: 2012

37. Search engine for multilingual audiovisual contents

Author: Pérez, José David, Bonafonte Cávez, Antonio, Ruiz Costa-Jussà, Marta, Cardenal, Antonio, Rodríguez Fonollosa, José Adrián, Moreno Bilbao, M. Asunción, Navas, Eva, Rodríguez Banga, Eduardo, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Subjects: Traducció automàtica, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Machine translating
Abstract: This paper describes the BUCEADOR search engine, a web server that allows retrieving. multimedia documents (text, audio, video) in different languages. All the documents are translated into the user language and are presented either as text (for instance, subtitles in video documents) or dubbed audio. The user query consist in a sequence of keywords and can be typed or spoken. Multiple Spoken Language Technologies (SLT) servers have been implemented, such as speech recognition, speech machine translation and text-to-speech conversion. The platform can be used in the four Spanish official (Spanish, Basque, Catalan and Galician) and in English.
Published: 2012

38. BUCEADOR hybrid TTS for blizzard challenge 2011

Author: Sainz, Iñaki, Erro Eslava, Daniel, Navas, Eva, Adell Mercado, Jordi, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Subjects: Automatic speech recognition, Reconeixement automàtic de la parla, Informàtica::Intel·ligència artificial::Llenguatge natural [Àrees temàtiques de la UPC]
Abstract: This paper describes the Text-to-Speech (TTS) systems presented by the Buceador Consortium in the Blizzard Challenge 2011 evaluation campaign. The main system is a concatenative hybrid one that tries to combine the strong points of both statistical and unit selection synthesis (i.e. robustness and segmental naturalness respectively). The hybrid system has reached results significantly above average as far as similarity and naturalness are concerned, with no significant differences with most of the systems in the intelligibility task. This clearly improves the performance achieved in previous participations, and shows the validity of the hybrid approach proposed. Besides, an HMM-based system was built for the ES1 intelligibility tasks, using an HNM-based vocoder.
Published: 2011

39. Work in progress - Cooperative and competitive projects for engaging students in advanced ICT subjects

Author: Pardàs Feliu, Montse, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. GPI - Grup de Processament d'Imatge i Vídeo, and Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Subjects: Engineering -- Study and teaching, competitive projects, Cooperative projects, project-based learning, ComputingMilieux_COMPUTERSANDEDUCATION, Ensenyament i aprenentatge::Ensenyament universitari [Àrees temàtiques de la UPC], teaching in information and communication technologies, Ensenyament -- Treball en equip, Enginyeria de la telecomunicació [Àrees temàtiques de la UPC], Group work in education, Enginyeria -- Ensenyament
Abstract: In this paper we present a specific kind of projects that can be used for project-based learning in engineering subjects. The subjects must combine lectures with projects, in order to provide the technical competences together with additional skills such as teamwork learning, oral and written communication skills and application of theory to practice. The projects proposed consist on improving an elemental baseline system. The system is decomposed in modules that correspond to the topics that have been learnt during the lectures. For improving the system, the class is divided in groups and each group has to propose, implement, assess and report a better system. In order to be able to improve the system with a limited amount of time and effort the students need to make a coherent proposal and split the project in several tasks that are usually developed by one or two students. The students within a group cooperate to achieve a better system, but groups compete for the best results. We have already implemented this kind of project in a Speech Processing course and we plan to apply it in a Video Coding course.
Published: 2011

40. Query by Humming (Android app)

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Siquier Penyafort, Marc, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Siquier Penyafort, Marc
Abstract: Query by Humming/Singing is the technology to retrieve information of a song (title, artist, etc.) from singing (or humming) a small excerpt. This TFG should develop and integrate the required technology to create an application., [ANGLÈS]In this thesis, a Query by Singing/Humming (QbSH) has been developed. A QbSH system tries to retrieve information of a song given a melody recorded by the user. It has been developed as a client/server system, where the client is an Android application (programmed on Java) and the server is located on a Unix system and written on C++. The system compares a melody recorded by the user with other melodies previously recorded by other users and tagged with song information by the system administrator. A pitch extraction algorithm is applied in order to extract the melody for the query songs, then a processing algorithm in order to enhance the signal and prepare it for the matching. In the matching step Dynamic Time Warping (DTW) has been applied, which computes a distance between two signals and absorbs tempo variations. As a result, this thesis contains a full experience of audio processing, systems administration, communications and programming skills., [CASTELLÀ] En esta tesis se ha desarrollado un sistema de Query by Singing/Humming (QbSH). Estos sistemas tratan de recuperar información de una canción a partir de una melodia grabada por el usuario. El sistema ha sido desarrollado como un sistema cliente/servidor, donde el cliente es una aplicación Android (programada en Java) y el servidor está basado en una máquina Unix y escrito en C++. El sistema compara una melodía grabada por el usuario con otras melodías previamente grabadas por otros usuarios y etiquetadas con información de la canción por el propio administrador del sistema. Para extraer la melodía de los fragmentos grabados por el usuario, se ha aplicado un algoritmo de extracción de pitch. Posteriormente se ha aplicado un preprocesado para mejorar la señal y prepararla para la clasificación. En la etapa de clasificación se ha aplicado el Dynamic Tiime Warping (DTW), que calcula la distancia entre dos señales absorbiendo variaciones temporales. De esta forma, esta tesis contiene una experiencia completa en procesado de audio, administración de sistemas, comunicaciones y habilidades en programación., [CATALÀ] En aquesta tesi s’ha desenvolupat un sistema de Query by Singing/Humming (QbSH). Aquests sistemes tracten de recuperar informació d’una cançó donada una melodia gravada per l’usuari. Ha estat desenvolupat com un sistema client/servidor, on el client és una aplicació Android (programada en Java) i el servidor està basat en una màquina Unix i escrit en C++. El sistema compara una melodia gravada per l'usuari amb altres melodies prèviament gravades per altres usuaris i etiquetades amb informació de la cançó pel propi administrador del sistema. Per a extreure la melodia dels fragments gravats per l'usuari, s'ha aplicat un algoritme d'extracció de pitch. Posteriorment s'ha aplicat un preprocessat per a millorar la senyal i preparar-la per a la classificació. A l'etapa de classificació s'ha aplicat el Dynamic time Warping (DTW), que calcula la distància entre dues senyals absorbint variacions temporals. Així, aquesta tesi conté una experiència completa en processat d'àudio, administració de sistemes, comunicacions i habilitats en programació.
Published: 2015

41. Creating expressive synthetic voices by unsupervised clustering of audiobooks

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, Bonafonte Cávez, Antonio, López Otero, Paula, Docio Fernández, Laura, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Jauk, Igor, Bonafonte Cávez, Antonio, López Otero, Paula, and Docio Fernández, Laura
Abstract: In this work we design an approach for automatic feature selection and voice creation for expressive synthesis. Our approach is guided by two main goals: (1) increasing the flexibility of expressive voice creation and (2) overcoming the limitations of speaking styles in expressive synthesis. We define a novel set of features, combining traditionally used prosodic features with spectral features and proposing the use of iVectors. With these features we perform unsupervised clustering of an audiobook excerpt and, from these clusters, we create synthetic voices using the SAT technique. To evaluate the clustering performance we propose an objective evaluation of the unsupervised clustering results technique based on perplexity reduction. This objective evaluation indicates that both prosodic and spectral features contribute to separate speaking styles and emotions, achieving the best results when including iVectors in the feature set, leading to a perplexity reduction of the expressions and audiobook characters by factors 14 and 2, respectively. We also designed a novel subjective evaluation method where the participants have to edit a small excerpt of an audiobook using synthetic voices created from clusters. The results suggest that our feature set is effective in the task of expressiveness and character detection., Peer Reviewed, Postprint (published version)
Published: 2015

42. Grapheme-to-phoneme conversion in the era of globalization

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Polyàkova, Tatyana V., Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Polyàkova, Tatyana V.
Abstract: This thesis focuses on the phonetic transcription in the framework of text-to-speech conversion, especially on improving adaptability, reliability and multilingual support in the phonetic module. The language is constantly evolving making the adaptability one of major concerns in phonetic transcription. The phonetic transcription has been addressed from a data- based approach. On one hand, several classifiers such as Decision Trees, Finite State Transducers, Hidden Markov Models were studied and applied to the grapheme-to-phoneme conversion task. In addition, we analyzed a method of generation of pronunciation by analogy, considering different strategies. Further improvements were obtained by means of application of the transformation-based error-driven learning algorithm. The most significant improvements were obtained for classifiers with higher error rates. The experimental results show that the adaptability of phonetic module was improved, having obtained word error rates as low as 12% (for English). Next, steps were taken towards increasing reliability of the output of the phonetic module. Although, the G2P results were quite good, in order to achieve a higher level of reliability we propose using dictionary fusion. The ways the pronunciations are represented in different lexica depend on many factors such as: expert¿s opinion, local accent specifications, phonetic alphabet chosen, assimilation level (for proper names), etc. There are often discrepancies between pronunciations of the same word found in different lexica. The fusion system is a system that learns phoneme-to-phoneme transformations and converts pronunciations from the source lexicon into pronunciations from the target lexicon. Another important part of this thesis consisted in acing the challenge of multilingualism, a phenomenon that is becoming a usual part of our daily lives. Our goal was to obtain such pronunciations for foreign inclusions that would not be totally unfamiliar either to a native, Fa tan sols uns deu anys les aplicacions de sistemes TTS eren molt més limitades, encara que un passat tan recent sembla més llunyà a causa dels canvis produïts en les nostres vides per la invasió massiva de les tecnologies intel·ligents. Els processos d’automatització de serveis també han assolit nous nivells. Què és el que defineix un bon sistema TTS avui dia? El mercat exigeix que aquest sigui molt adaptable a qualsevol tipus d’àmbit. També és imprescindible un alt nivell de fiabilitat ja que un simple error d’un TTS pot causar problemes seriosos en el nostre dia a dia. La nostra agenda és cada vegada més exigent i hem de fer front a més volums d’informació en menys temps. Deleguem les nostres tasques quotidianes als nostres dispositius intel·ligents que ens ajuden a llegir llibres, triar productes, trobar un lloc al mapa, etc. A més viatgem més i més cada dia. Aprenem a parlar noves llengües, les barregem, en un món més i més globalitzat. Un sistema TTS que no és capaç de fer front a les entrades multilingües no serà capaç de sostenir la competència. Els sistemes TTS moderns han de ser multilingües. La transcripció fonètica és el primer mòdul del TTS per la qual cosa el seu correcte funcionament és fonamental. Aquesta tesi se centra en la millora de l’adaptabilitat, fiabilitat i suport multilingüe del mòdul fonètic del nostre sistema TTS. El mòdul de transcripció fonètica del TTS va passar de ser basat en regles o diccionaris a ser automàtic, derivat de dades. La llengua està en constant evolució, igual que tots els organismes vius. És per això que l’adaptabilitat és un dels principals problemes de la transcripció fonètica. Per millorar-la es necessita un mètode basat en dades que funcioni bé per a derivar la pronunciació de paraules no trobades al lèxic del sistema. En aquesta tesi es comparen diferents mètodes G2P impulsats per dades que utilitzen les mateixes dades d’entrenament i test i es proposen millores. S’han aplicat diversos classificadors basats en da, Hace tan sólo unos diez años, las aplicaciones de sistemas TTS estaban mucho más limitadas, aunque un pasado tan reciente parece más lejano debido a los cambios producidos en nuestras vidas por la invasión masiva de las tecnologías inteligentes. Los procesos de automatización de los servicios han alcanzado a nuevos niveles. ¿Qué es lo que define un buen sistema TTS hoy en día? El mercado exige que éste sea muy adaptable a cualquier tipo de ámbito. También es imprescindible un alto nivel de fiabilidad, ya que un simple error de un TTS puede causar problemas serios en nuestro día a día. Nuestra agenda es cada vez más exigente y tenemos que hacer frente a un volumen cada vez mayor de información en menos tiempo. Delegamos nuestras tareas cotidianas a nuestros dispositivos inteligentes que nos ayudan a leer libros, elegir productos, encontrar un lugar en el mapa, etc. Además, cada día viajamos más, aprendemos a hablar nuevas lenguas, las mezclamos, volviéndonos más y más globalizados. Un sistema TTS que no sea capaz de hacer frente a las entradas multilngües no será capaz de sostener la competencia. Los sistemas TTS modernos tienen que ser multilngües. La transcripción fonética es el primer módulo del TTS por lo cual su correcto funcionamiento es fundamental. Esta tesis se centra en la mejora de la adaptabilidad, fiabilidad y soporte del módulo fonético de nuestro sistema TTS. El módulo de transcripción fonética del TTS pasó de ser basado en reglas o diccionarios a ser automática, basada en datos. La lengua está en constante evolución al igual que todos los organismos vivos. Es por eso que la adaptabilidad es uno de los principales problemas de la transcripción fonética. Para mejorarla se necesita un método basado en datos que funcione bien para derivar la pronunciación de palabras no encontradas en el léxico del sistema. En esta tesis se comparan diferentes métodos G2P basados en datos, utilizando los mismos datos de entrenamiento y test y se proponen mejoras. Se han e, Postprint (published version)
Published: 2015

43. Synthesis using speaker adaptation from speech recognition DB

Author: Oller Moreno, Sergio, Moreno Bilbao, M. Asunción, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Subjects: Automatic speech recognition, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Reconeixement automàtic de la parla, HMM, Adaptation, Speech synthesis
Abstract: This paper deals with the creation of multiple voices from a Hidden Markov Model based speech synthesis system (HTS). More than 150 Catalan synthetic voices were built using Hidden Markov Models (HMM) and speaker adaptation techniques. Training data for building a Speaker-Independent (SI) model were selected from both a general purpose speech synthesis database (FestCat;) and a database design ed for training Automatic Speech Recognition (ASR) systems (Catalan SpeeCon database). The SpeeCon database was also used to adapt the SI model to different speakers. Using an ASR designed database for TTS purposes provided many different amateur voices, with few minutes of recordings not performed in studio conditions. This paper shows how speaker adaptation techniques provide the right tools to generate multiple voices with very few adaptation data. A subjective evaluation was carried out to assess the intelligibility and naturalness of the generated voices as well as the similarity of the adapted voices to both the original speaker and the average voice from the SI model.
Published: 2010

44. Synthesis of filled pauses based on a disfluent speech model

Author: Adell Roig, Jordi, Bonafonte Cávez, Antonio, Escudero Mancebo, David, Universitat Politècnica de Catalunya. Departament de Projectes Arquitectònics, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Subjects: Signal processing, Processament de la parla, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Speech processing systems, Tractament del senyal
Published: 2010

45. Defining analogy for non-native inclusions in Spanish utterances

Author: Polyakova, Tatyana, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Subjects: Machine learning, Processament de la parla, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Speech processing systems
Abstract: Mass media globalization introduces the challenge of multilingualism into most popular speech applications such as text-tospeech synthesis and automatic speech recognition. In Spain as well as in the other countries, the usage of English words is rapidly growing, however due to the linguistic diversity of the languages spoken across the country, Spanish is not less inﬂuenced by inclusions from the four ofﬁcial languages. This work is focused on the pronunciation of Catalan inclusions in Spanish utterances. Our goal was to approach the nativization phenomenon by data-driven methods, making it easily transferable to other languages without loss in performance. For this particular task, training and test nativization corpora were manually crafted and the task itself was approached using pronunciation by analogy. The results were encouraging and showed that even small corpus of 1000 words allows to capture the analogy in the nativization process. The resulting pronunciations allowed signiﬁcant improvements in the ntelligibility of Catalan inclusions in Spanish utterances.
Published: 2010

46. Main issues in grapheme-to-phoneme conversion for TTS

Author: Polyakova, Tatyana and Bonafonte Cávez, Antonio
Subjects: Alineado, CART, Letras en fonemas, Traductores de estados finitos, Finite-state transducers, X-grams, Grapheme-to-phoneme, Alignment
Abstract: La conversión de letras a fonemas en inglés está siendo desarrollada para su futura integración en un sistema de síntesis de habla dentro del proyecto TC-STAR. En este trabajo se describen los experimentos realizados usando dos técnicas diferentes de aprendizaje automático. Se ha considerado la predicción de la pronunciación con y sin acento. Se analiza la influencia de los diferentes parámetros en la tasa de error en la conversión de letras a fonemas. También se estudia la distribución de la tasa de error en función de la longitud de las palabras ha sido obtenida. Grapheme-to-phoneme conversion system for English is being developed for further integration into speech synthesis system within TC-STAR project. In this work we describe experiments performed using two different machine learning techniques. The pronunciation was predicted both for stressed and unstressed lexicon and the results were compared. Analysis of different parameters that may influence the error rate in grapheme-to-phoneme conversion was performed. The error rate as a function of the word length was studied. This work has been funded by the European Union under the integrated project TC-STAR - Technology and Corpora for Speech to Speech Translation (IST-2002-FP6-506738, http://www.tc-star.org ).
Published: 2005

47. Including dynamic information in voice conversion systems

Author: Duxans Barrobes, Helenca, Bonafonte Cávez, Antonio, Kain, Alexander, and Santen, Jan van
Subjects: Conversión de voz, GMM, HMM, Voice conversion
Abstract: Los sistemas de conversión de voz modifican la voz de un locutor (locutor fuente) para que se perciba como si hubiera sido producida por otro locutor (locutor objetivo). Muchos trabajos se basan en un modelado mediante mezcla de Gaussianas de las características conjuntas de ambos locutores, realizado asumiendo independencia para cada tramo de voz. En este articulo se estudia la inclusión de información dinámica, tanto del locutor fuente, como del locutor objetivo o de ambos. Los sistemas propuestos se comparan basándose en medidas objetivas y preceptuales. Voice Conversion (VC) systems modify a speaker voice (source speaker) to be perceived as if another speaker (target speaker) had uttered it. Previous published VC approaches using Gaussian Mixture Models performs the conversion in a frame by frame basis. In this paper, the inclusion of dynamic information of the source, target or both joint source-target speakers in the conversion is studied. Objective and perceptual results compare the performance of the proposed systems. This work has been partially sponsored by the European Union under grant FP6-506738 (TC-STAR project, http://www.tc-star.org) and the Spanish Government under grant TIC2002-04447-C02 (ALIADO project, http://gps-tsc.upc.es/veu/aliado).
Published: 2004

48. Voice conversion using exclusively unaligned training data

Author: Sündermann, David, Bonafonte Cávez, Antonio, Höge, Harald, and Ney, Hermann
Subjects: Unaligned training data, Linear transformation of the spectral envelope, Voice conversion
Abstract: Although all conventional voice conversion approaches require equivalent training utterances of source and target speaker, several recently proposed applications call for breaking this demand. In this paper, we present an algorithm which finds corresponding time frames within unaligned training data. The performance of this algorithm is tested by means of a voice conversion framework based on linear transformation of the spectral envelope. Experimental results are reported on a Spanish cross-gender corpus utilizing several objective error measures.
Published: 2004

49. Automatic Drums Transcription for polyphonic music using Non-Negative Matrix Factor Deconvolution

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Roebel, Axel, Pons i Puig, Jordi, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Roebel, Axel, and Pons i Puig, Jordi
Published: 2014

50. Query by Humming

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Tur Vallés, Pau, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Tur Vallés, Pau
Abstract: This TFG would explore different methods to retrieve song information from a query humming the song., [ANGLÈS] In this thesis, a Query by Singing/Humming (QbSH) system has been developed. A QbSH system tries to retrieve information of a song given a melody recorded by the user. The system compares human queries with melodies extracted from audio files. A pitch extraction algorithm has been used to obtain the melodies for both queries and database songs. The preprocessing of the signals turned out to be crucial, and has been deeply studied. The matching step used Dynamic Time Warping, which computes a distance between two signals absorbing tempo variations. Several databases have been built to assess the system. Finally, a complete Graphic User Interface has been programmed to allow the user to analyze the system step by step. In the end, this thesis contains a thorough experience through the creation of the system which, obtaining competitive results, provides a solid basis for further development., [CASTELLÀ] En esta tesis se ha desarrollado un sistema de Query by Singing/Humming (QbSH). Estos sistemas tratan de recuperar información de una canción dada una melodía grabada por el usuario. El sistema compara grabaciones humanas con melodías extraídas de archivos de audio. Se ha utilizado un algoritmo de extracción del pitch para obtener las melodías de la grabación y de las canciones de la base de datos. El preprocesado de las señales ha resultado ser crucial, y ha sido estudiado en profundidad. Para la clasificación se ha utilizado Dynamic Time Warping, que calcula la distancia entre dos señales absorbiendo variaciones temporales. Diversas bases de datos se han construido para evaluar el sistema. Finalmente, se ha programado una completa interfaz gráfica para permitir al usuario analizar el sistema paso por paso. Así, esta tesis contiene una experiencia completa de la creación del sistema que, obteniendo resultados competitivos, proporciona una base sólida para futuros desarrollos., [CATALÀ] En aquesta tesi s’ha desenvolupat un sistema de Query by Singing/Humming (QbSH). Aquests sistemes tracten de recuperar informació d’una cançó donada una melodia gravada per l’usuari. El sistema compara gravacions humanes amb melodies extretes d’arxius d’àudio. S’ha fet servir un algoritme d’extracció del pitch per obtindre les melodies de la gravació i de les cançons de la base de dades. El preprocessat dels senyals ha resultat ser crucial, i ha estat estudiat en profunditat. Per la classificació s’ha utilitzat Dynamic Time Warping, que calcula la distància entre dos senyals absorbint variacions temporals. Diverses bases de dades s’han construït per avaluar el sistema. Finalment, s’ha programat una completa interfície gràfica per permetre a l’usuari analitzar el sistema pas per pas. Així, aquesta tesi conté una experiència completa de la creació del sistema que, obtenint resultats competitius, proporciona una base sòlida per futurs desenvolupaments.
Published: 2014

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

Publisher

176 results on '"Bonafonte Cávez, Antonio"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources