Author: "de la Puente, Santiago" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"de la Puente, Santiago"' showing total 45 results

Start Over Author "de la Puente, Santiago"

45 results on '"de la Puente, Santiago"'

1. Linking catch reconstructions with downstream supply chain nodes can help strengthen management actions in favour of just, sustainable and resilient futures

Author: de la Puente, Santiago and Christensen, Villy
Published: 2024
Full Text: View/download PDF

2. WTO must ban harmful fisheries subsidies

Author: Sumaila, U Rashid, Skerritt, Daniel J, Schuhbauer, Anna, Villasante, Sebastian, Cisneros-Montemayor, Andrés M, Sinan, Hussain, Burnside, Duncan, Abdallah, Patrízia Raggi, Abe, Keita, Addo, Kwasi A, Adelsheim, Julia, Adewumi, Ibukun J, Adeyemo, Olanike K, Adger, Neil, Adotey, Joshua, Advani, Sahir, Afrin, Zahidah, Aheto, Denis, Akintola, Shehu L, Akpalu, Wisdom, Alam, Lubna, Alava, Juan José, Allison, Edward H, Amon, Diva J, Anderies, John M, Anderson, Christopher M, Andrews, Evan, Angelini, Ronaldo, Anna, Zuzy, Antweiler, Werner, Arizi, Evans K, Armitage, Derek, Arthur, Robert I, Asare, Noble, Asche, Frank, Asiedu, Berchie, Asuquo, Francis, Badmus, Lanre, Bailey, Megan, Ban, Natalie, Barbier, Edward B, Barley, Shanta, Barnes, Colin, Barrett, Scott, Basurto, Xavier, Belhabib, Dyhia, Bennett, Elena, Bennett, Nathan J, Benzaken, Dominique, Blasiak, Robert, Bohorquez, John J, Bordehore, Cesar, Bornarel, Virginie, Boyd, David R, Breitburg, Denise, Brooks, Cassandra, Brotz, Lucas, Campbell, Donovan, Cannon, Sara, Cao, Ling, Cardenas Campo, Juan C, Carpenter, Steve, Carpenter, Griffin, Carson, Richard T, Carvalho, Adriana R, Castrejón, Mauricio, Caveen, Alex J, Chabi, M Nicole, Chan, Kai MA, Chapin, F Stuart, Charles, Tony, Cheung, William, Christensen, Villy, Chuku, Ernest O, Church, Trevor, Clark, Colin, Clarke, Tayler M, Cojocaru, Andreea L, Copeland, Brian, Crawford, Brian, Crépin, Anne-Sophie, Crowder, Larry B, Cury, Philippe, Cutting, Allison N, Daily, Gretchen C, Da-Rocha, Jose Maria, Das, Abhipsita, de la Puente, Santiago, de Zeeuw, Aart, Deikumah, Savior KS, Deith, Mairin, Dewitte, Boris, Doubleday, Nancy, Duarte, Carlos M, Dulvy, Nicholas K, Eddy, Tyler, Efford, Meaghan, Ehrlich, Paul R, Elsler, Laura G, and Fakoya, Kafayat A
Subjects: General Science & Technology
Published: 2021

3. Constrained public benefits from global catch share fisheries

Author: Ben-Hasan, Abdulrahman, De La Puente, Santiago, Flores, Diana, Melnychuk, Michael C., Tivoli, Emily, Christensen, Villy, Cui, Wei, and Walters, Carl J.
Published: 2021

4. Adoption of sustainable low-impact fishing practices is not enough to secure sustainable livelihoods and social wellbeing in small-scale fishing communities

Author: de la Puente, Santiago, López de la Lama, Rocío, Llerena-Cayo, Camila, Martínez, Benny R., Rey-Cama, Gonzalo, Christensen, Villy, Rivera-Ch, María, and Valdés-Velasquez, Armando
Published: 2022
Full Text: View/download PDF

5. A Fish-Focused Menu: An Interdisciplinary Reconstruction of Ancestral Tsleil-Waututh Diets.

Author: Efford, Meaghan, de la Puente, Santiago, George, Micheal, George, Michelle, Testani, Alessandria, Taft, Spencer, Morin, Jesse, Hilsden, Jay, Zhu, Jennifer, Chen, Pengpeng, Paskulin, Lindsey, Toniello, Ginevra, Christensen, Villy, and Speller, Camilla
Subjects: SEA birds, FORAGE fishes, SEX (Biology), ANIMAL species, FOOD chains
Abstract: The study of past subsistence offers archeologists a lens through which we can understand relationships between people and their homelands. səl̓ilwətaɬ (Tsleil-Waututh) is a Coast Salish Nation whose traditional and unceded territory centers on səl̓ilwət (Tsleil-Wat, Burrard Inlet, British Columbia, Canada). səl̓ilwətaɬ people were fish specialists whose traditional diet focused primarily on marine and tidal protein sources. In this research, we draw on the archeological record, ecology, historical and archival records, and səl̓ilwətaɬ oral histories and community knowledge to build an estimated precontact diet that ancestral səl̓ilwətaɬ people obtained from səl̓ilwət. Based on prior archeological research, we assume a high protein diet that is primarily (90–100 percent) from marine and tidal sources. The four pillars of səl̓ilwətaɬ precontact diets (salmon, forage fish, shellfish, and marine birds) offer anchor points that ensure the diet is realistic, evidence-based, and representative of community knowledge. We consider the caloric needs of adults, children, elders, and those who are pregnant or lactating. Finally, we consider the variation in the edible yield from different animal species and their relationships in the food web. Together, these data and anchor points build an estimated precontact diet averaged across seasons, ages, and biological sex from approximately 1000 CE up until early European contact in approximately 1792 CE. The reconstruction of səl̓ilwətaɬ lifeways and subsistence practices, which were based on a myriad of stewardship techniques, aid our understanding of the precontact səl̓ilwətaɬ diet and the relationship between səl̓ilwətaɬ and their territory. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. Open-Ended Visual Question-Answering

Author: Masuda, Issey, de la Puente, Santiago Pascual, and Giro-i-Nieto, Xavier
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: This thesis report studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework. As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-16 and K-CNN convolutional neural networks to extract visual features from the image. These are merged with the word embedding or with a sentence embedding of the question to predict the answer. This work was successfully submitted to the Visual Question Answering Challenge 2016, where it achieved a 53,62% of accuracy in the test dataset. The developed software has followed the best programming practices and Python code style, providing a consistent baseline in Keras for different configurations., Comment: Bachelor thesis report graded with A with honours at ETSETB Telecom BCN school, Universitat Polit\`ecnica de Catalunya (UPC). June 2016. Source code and models are publicly available at http://imatge-upc.github.io/vqa-2016-cvprw/
Published: 2016

7. Not in it for the money: Meaningful relationships sustain voluntary land conservation initiatives in Peru.

Author: López de la Lama, Rocío, Bennett, Nathan, Bulkan, Janette, de la Puente, Santiago, and Chan, Kai M. A.
Subjects: CONSERVATION easements, NATURE reserves, PROTECTED areas, SEMI-structured interviews, HUMAN beings, PERIODICAL articles
Abstract: Copyright of People & Nature is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2024
Full Text: View/download PDF

8. A delay‐differential model for representing small pelagic fish stock dynamics and its application for assessing alternative management strategies under environmental uncertainty

Author: Licandeo, Roberto, primary, de la Puente, Santiago, additional, Christensen, Villy, additional, Hilborn, Ray, additional, and Walters, Carl, additional
Published: 2023
Full Text: View/download PDF

9. Pollution, habitat loss, fishing, and climate change as critical threats to penguins

Author: Trathan, Phil N., García-Borboroglu, Pablo, Boersma, Dee, Bost, Charles-André, Crawford, Robert J. M., Crossin, Glenn T., Cuthbert, Richard J., Dann, Peter, Davis, Lloyd Spencer, De La Puente, Santiago, Ellenberg, Ursula, Lynch, Heather J., Mattern, Thomas, Pütz, Klemens, Seddon, Philip J., Trivelpiece, Wayne, and Wienecke, Barbara
Published: 2015
Full Text: View/download PDF

10. Valuing seafood: The Peruvian fisheries sector

Author: Christensen, Villy, de la Puente, Santiago, Sueiro, Juan Carlos, Steenbeek, Jeroen, and Majluf, Patricia
Published: 2014
Full Text: View/download PDF

11. Desafiando la tradición de país harinero: Una mirada económica de la actividad pesquera de Piura, Perú

Author: Gozzer Wuest, Renato, primary, Sueiro, Juan Carlos, additional, Grillo-Núñez, Jorge, additional, De La Puente, Santiago, additional, Correa, Mario, additional, Mendo, Tania, additional, and Mendo, Jaime, additional
Published: 2022
Full Text: View/download PDF

12. Successful Sirolimus Treatment for Recurrent Pericardial Effusion in a Large Cervico-Mediastinal Provisionally Unclassified Vascular Anomaly: Case Report

Author: Moreno, Julio, Berenguer, María San Basilio, Caldas, Maria Sarmiento, Cayón, Jesús González, De La Puente, Santiago, Junco, Paloma Elena Triana, and Gutiérrez, Juan Carlos López
Published: 2022
Full Text: View/download PDF

13. Successful Sirolimus Treatment for Recurrent Pericardial Effusion in a Large Cervicomediastinal Provisionally Unclassified Vascular Anomaly: A Case Report

Author: Moreno-Alfonso, Julio César, San Basilio Berenguer, María, Sarmiento Caldas, María del Carmen, González Cayón, Jesús, de la Puente, Santiago, Triana, Paloma, and López-Gutiérrez, Juan Carlos
Published: 2023
Full Text: View/download PDF

14. Efficient, end-to-end and self-supervised methods for speech processing and generation

Author: Pascual de la Puente, Santiago, Bonafonte Cávez, Antonio, Serra Julià, Joan, and Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
Subjects: Enginyeria de la telecomunicació [Àrees temàtiques de la UPC]
Abstract: Premi extraordinari doctorat UPC curs 2019-2020, àmbit d’Enginyeria de les TIC Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model. L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperació de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucció de parts del senyal que s’han malmès, com extensió d’ample de banda i recuperació de seccions temporals perdudes, entre d’altres. En aquesta última aplicació del model, el fet d’incloure funcions de pèrdua acústicament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu característiques acústiques a la sortida de la xarxa discriminadora de la nostra GAN. També es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversària i la qualitat generada finalment després d’afegir les funcions acústiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE és un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informació abstracta com identitat del parlant, les característiques prosòdiques i els continguts lingüístics. També es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’àmbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emoció i de la parla, mostrant-se efectiu especialment si s’ajusta la representació de manera supervisada amb un conjunt de dades d’adaptació. Award-winning
Published: 2020

15. Efficient, end-to-end and self-supervised methods for speech processing and generation

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Serra Julià, Joan, Pascual de la Puente, Santiago, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Serra Julià, Joan, and Pascual de la Puente, Santiago
Abstract: Premi extraordinari doctorat UPC curs 2019-2020, àmbit d’Enginyeria de les TIC, Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recompos, L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes, Award-winning, Postprint (published version)
Published: 2020

16. Growing Into Poverty: Reconstructing Peruvian Small-Scale Fishing Effort Between 1950 and 2018

Author: De la Puente, Santiago, primary, López de la Lama, Rocío, additional, Benavente, Selene, additional, Sueiro, Juan Carlos, additional, and Pauly, Daniel, additional
Published: 2020
Full Text: View/download PDF

17. Time-domain speech enhancement using generative adversarial networks

Author: Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Serra, Joan, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Serra, Joan, and Bonafonte Cávez, Antonio
Abstract: Speech enhancement improves recorded voice utterances to eliminate noise that might be impeding their intelligibility or compromising their quality. Typical speech enhancement systems are based on regression approaches that subtract noise or predict clean signals. Most of them do not operate directly on waveforms. In this work, we propose a generative approach to regenerate corrupted signals into a clean version by using generative adversarial networks on the raw signal. We also explore several variations of the proposed system, obtaining insights into proper architectural choices for an adversarially trained, convolutional autoencoder applied to speech. We conduct both objective and subjective evaluations to assess the performance of the proposed method. The former helps us choose among variations and better tune hyperparameters, while the latter is used in a listening experiment with 42 subjects, confirming the effectiveness of the approach in the real world. We also demonstrate the applicability of the approach for more generalized speech enhancement, where we have to regenerate voices from whispered signals., Peer Reviewed, Postprint (author's final draft)
Published: 2019

18. Wav2Pix: speech-conditioned face generation using generative adversarial networks

Author: Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Universitat Politècnica de Catalunya. GPI - Grup de Processament d'Imatge i Vídeo, Cardoso Duarte, Amanda, Roldan, Francisco, Tubau, Miquel, Escur, Janna, Pascual de la Puente, Santiago, Salvador Aguilera, Amaia, Mohedano, Eva, McGuinness, Kevin, Torres Viñals, Jordi, Giró Nieto, Xavier, Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Universitat Politècnica de Catalunya. GPI - Grup de Processament d'Imatge i Vídeo, Cardoso Duarte, Amanda, Roldan, Francisco, Tubau, Miquel, Escur, Janna, Pascual de la Puente, Santiago, Salvador Aguilera, Amaia, Mohedano, Eva, McGuinness, Kevin, Torres Viñals, Jordi, and Giró Nieto, Xavier
Abstract: Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of youtubers with notable expressiveness in both the speech and visual signals., Peer Reviewed, Postprint (published version)
Published: 2019

19. Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech

Author: Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Serra, Joan, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Serra, Joan, and Bonafonte Cávez, Antonio
Abstract: Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU., Peer Reviewed, Postprint (published version)
Published: 2019

20. Establishing company level fishing revenue and profit losses from fisheries: A bottom-up approach

Author: Cashion, Tim, primary, de la Puente, Santiago, additional, Belhabib, Dyhia, additional, Pauly, Daniel, additional, Zeller, Dirk, additional, and Sumaila, U. Rashid, additional
Published: 2018
Full Text: View/download PDF

21. Bringing sustainable seafood back to the table: exploring chefs’ knowledge, attitudes and practices in Peru

Author: López De La Lama, Rocio, primary, De La Puente, Santiago, additional, and Valdés-Velásquez, Armando, additional
Published: 2018
Full Text: View/download PDF

22. Towards cleaner shores: Assessing the Great Canadian Shoreline Cleanup's most recent data on volunteer engagement and litter removal along the coast of British Columbia, Canada

Author: Konecny, Cassandra, primary, Fladmark, Vanessa, additional, and De la Puente, Santiago, additional
Published: 2018
Full Text: View/download PDF

23. Language and noise transfer in speech enhancement generative adversarial network

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Park, Maruchan, Serra, Joan, Bonafonte Cávez, Antonio, Ahn, Kang-hun, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Park, Maruchan, Serra, Joan, Bonafonte Cávez, Antonio, and Ahn, Kang-hun
Abstract: ©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works., Speech enhancement deep learning systems usually require large amounts of training data to operate in broad conditions or real applications. This makes the adaptability of those systems into new, low resource environments an important topic. In this work, we present the results of adapting a speech enhancement generative adversarial network by fine-tuning the generator with small amounts of data. We investigate the minimum requirements to obtain a stable behavior in terms of several objective metrics in two very different languages: Catalan and Korean. We also study the variability of test performance to unseen noise as a function of the amount of different types of noise available for training. Results show that adapting a pre-trained English model with 10 min of data already achieves a comparable performance to having two orders of magnitude more data. They also demonstrate the relative stability in test performance with respect to the number of training noise types., Peer Reviewed, Postprint (published version)
Published: 2018

24. Spanish statistical parametric speech synthesis using a neural vocoder

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, Dorca, G., Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, and Dorca, G.
Abstract: During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology. Meanwhile, the TTS research community has made a big effort to push statistical-parametric speech synthesis to get similar quality and more flexibility on the synthetically generated voice. During last years, deep learning advances applied to speech synthesis have filled the gap, specially when neural vocoders substitute traditional signal-processing based vocoders. In this paper we propose to substitute the waveform generation vocoder of MUSA, our Spanish TTS, with SampleRNN, a neural vocoder which was recently proposed as a deep autoregressive raw waveform generation model. MUSA uses recurrent neural networks to predict vocoder parameters (MFCC and logF0) from linguistic features. Then, the Ahocoder vocoder is used to recover the speech waveform out of the predicted parameters. In the first system SampleRNN is extended to generate speech conditioned on the Ahocoder generated parameters (mfcc and logF0), where two configurations have been considered to train the system. First, the parameters derived from the signal using Ahocoder are used. Secondly, the system is trained with the parameters predicted by MUSA, where SampleRNN and MUSA are jointly optimized. The subjective evaluation shows that the second system outperforms both the original Ahocoder and SampleRNN as an independent neural vocoder., Peer Reviewed, Postprint (published version)
Published: 2018

25. Bringing sustainable seafood back to the table: exploring chefs' knowledge, attitudes and practices in Peru.

Author: López De La Lama, Rocio, De La Puente, Santiago, and Valdés-Velásquez, Armando
Subjects: *SEAFOOD, *ATTITUDE (Psychology), *BEHAVIOR, *COOKS, *EDUCATIONAL background, *PROFIT & loss, *FISH anatomy, *FISH morphology
Abstract: Conservation organizations promoting sustainable seafood have had greater success when chefs are empowered as agents of change in favour of sustainable seafood. Peru is experiencing a gastronomic revolution with seafood at its core, and Peruvian top chefs are being approached by conservation organizations to become environmental advocates. Within this context we characterize the factors that influence chefs' behaviours regarding sustainable seafood. A total of 52 Peruvian top chefs were surveyed using the Knowledge, Attitudes and Practices Framework, complemented by a focus group with a subset of the surveyed population. Our results suggest that, regardless of their age or academic background, chefs are aware of the negative consequences that human activities have on the ocean and believe that restaurants have an obligation to become part of the solution by promoting the use of sustainable seafood. Nonetheless, three factors limit chefs' understanding of key concepts and prevent them from fully internalizing the environmental consequences of their actions in restaurants: (1) sustainability is a new topic for them, particularly for older chefs; (2) the fish species commonly used at restaurants are poorly regulated, and (3) chefs are risk averse to actions that could result in profit loss. Additionally, the structure of the seafood supply chain further limits chefs' capacity to act sustainably, even if they are aware of the need to change their behaviour. Recommendations are provided for future conservation campaigns advocating use of sustainable seafood, some of which have now been implemented. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

26. Attitudes and misconceptions towards sharks and shark meat consumption along the Peruvian coast

Author: López de la Lama, Rocío, primary, De la Puente, Santiago, additional, and Riveros, Juan Carlos, additional
Published: 2018
Full Text: View/download PDF

27. ASSESSMENT OF POLYCHLORINATED BIPHENYLS, ORGANOCHLORINE PESTICIDES, AND POLYBROMINATED DIPHENYL ETHERS IN THE BLOOD OF HUMBOLDT PENGUINS (SPHENISCUS HUMBOLDTI) FROM THE PUNTA SAN JUAN MARINE PROTECTED AREA, PERU

Author: Adkesson, Michael J., primary, Levengood, Jeffrey M., additional, Scott, John W., additional, Schaeffer, David J., additional, Langan, Jennifer N., additional, Cárdenas-Alayza, Susana, additional, de la Puente, Santiago, additional, Majluf, Patricia, additional, and Yi, Sandra, additional
Published: 2018
Full Text: View/download PDF

28. Voice conversion using Deep Learning

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, Aparicio Isarn, Albert, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, and Aparicio Isarn, Albert
Abstract: In this project we present a first attempt at a Voice Conversion system based on Deep Learning in which the alignment between the training data is intrinsic to the model. Our system is structured in three main blocks. The first performs a vocoding of the speech (we have used Ahocoder for this task) and a normalization of the data. The second and main block consists of a Sequence-to-Sequence model. It consists of an RNN-based encoder-decoder structure with an Attention Mechanism. Its main strengts are the ability to process variable-length sequences, as well as aligning them internallly. The third block of the system performs a denormalization and reconstructs the speech signal. For the development of our system we have used the Voice Conversion Challenge 2016 dataset, as well as a part of the TC-STAR dataset. Unfortunately we have not obtained the results we expected. At the end of this thesis we present them and discuss some hypothesis to explain the reasons behind them., En este proyecto presentamos un primer intento en la realización de un sistema de Conversión de Voz basado en Aprendizaje Profundo (\emph{Deep Learning}) en el cual el alineamiento de los datos de entrenamiento es intrínseco al modelo. Nuestro sistema está estructurado en tres bloques principales. El primer bloque codifica la señal de voz en parámetros (\emph{vocoding}). Hemos elegido el \emph{vocoder} Ahocoder para esta tarea. Este bloque también normaliza los parámetros codificados. El segundo bloque consiste en un modelo \emph{Sequence-to-Sequence}. Este modelo está formado por una estructura codificador-decodificador basada en Redes Neuronales Recurrentes (RNN) con un Mecanismo de Atención. Sus puntos fuertes son la capacidad de procesar secuencias de longitud variable, a la vez que las alinea internamente. El tercer bloque del sistema desnormaliza los parámetros, y reconstruye la señal de voz a partir de ellos. Para el desarrollo del modelo hemos usado el conjunto de datos (\emph{dataset}) del \emph{Voice Conversion Challenge} 2016. También hemos usado una parte del conjunto TC-STAR. Desafortunadamente no hemos obtenido los resultados que esperábamos. Al final de esta tesis los presentamos y proponemos varias hipótesis que los explican., En aquest projecte presentem un primer itent en la realització d'un sistema de Conversió de Veu basat en Aprenentatge Profund (Deep Learning) en el qual l'alineament entre les dades d'entrenament sigui intrínsec al model. El nostre sistema s'estructura en tres blocs principals. El primer bloc codifica la veu en paràmetres (\emph{vocoding}). Hem usat el codificador Ahocoder per a aquesta tasca. A més a més, aquest primer bloc normalitza les dades. El segon bloc consisteix en un model \emph{Sequence-to-Sequence}. Consisteix en una estructura codificador-decodificador basada en Xarxes Neuronals Recurrents (RNN) amb un Mecanisme d'Atenció (\emph{Attention Mechanism}). Els punts forts d'aquest model són la capacitat per a tractar seqüències de durada variable, alhora que les alinea internament. El tercer bloc del sistema desnormalitza les seqüències i reconstrueix els senyals de veu. Per a desenvolupar el sistema hem usat el conjunt de dades del \emph{Voice Conversion Challenge} 2016. Hem fet servir també una part del conjunt TC-STAR. Desafortunadament no hem obtingut els resultats que esperàvem. Al final d'aquesta tesis presentem aquests resultats i plantegem algunes hipòtesis que els expliquen.
Published: 2017

29. Deep Learning aplicado a síntesis de voz

Author: Pascual de la Puente, Santiago, Bonafonte Cávez, Antonio, and Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
Subjects: Machine Learning, speaker interpolation, speech synthesis, Computer Vision, Visió per ordinador, Aprenentatge automàtic, deep learning, speaker adaptation, recurrent neural networks, Enginyeria de la telecomunicació [Àrees temàtiques de la UPC], neural networks, TTS
Abstract: Deep Learning has been applied successfully to speech processing problems. In this work we explore its capabilities, focusing concretely in recurrent neural architectures to build a state of the art Text-To-Speech system from scratch. The different steps to make the full TTS system are shown. Also, a post-filtering method to improve the generated speech naturalness is applied and evaluated. The objective results show which architecture fits better our problem, achieving low error rates in term of cepstral distortion, pitch estimation error and voiced/unvoiced classification error. Also, subjective results suggest that the model achieves a state of the art quality in the synthesis, where the post-filtering factor seems to be a key component to get a good level of naturalness. A novel architecture called Multi-Output TTS is also proposed to hold multiple speakers inside the same structure. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with single speaker models. Moreover, we also tackle the problem of speaker adaptation by adding a new output branch to the model and successfully training it without the need of modifying the base optimized model. This fine tuning method achieves better results than training the new speaker from scratch with its own model. Finally, we also tackle the problem of speaker interpolation by adding a new output layer (alpha-layer) on top of the Multi-Output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the alpha-layer can effectively learn to interpolate the acoustic features between speakers. El Deep Learning se ha aplicado con éxito a problemas de procesado del habla. En éste trabajo exploramos las capacidades de ésta disciplina, haciendo especial énfasis en las arquitecturas recurrentes para construir un sistema de síntesis de voz desde cero. Se muestran las distintas etapas para hacer el sistema de síntesis completo. Además se aplica y se evalúa un método de post-procesado con tal de mejorar la naturalidad de la voz generada. Los resultados objetivos muestran qué arquitectura encaja más con nuestro problema, consiguiendo errores bajos en términos de distorsión cepstral, error de estimación de pitch y error de clasificación sonoro/sordo. También los resultados subjetivos indican que el modelo llega a tener una calidad de voz comparable con la de las últimas tecnologías, donde el hecho de aplicar el post-procesado parece ser una pieza clave para obtener un buen nivel de naturalidad. También se propone una arquitectura innovadora llamada Multi-Output TTS, la cual contiene diferentes hablantes dentro de la misma estructura. Algunas capas ocultas se comparten entre todos los hablantes, mientras que hay una capa de salida específica para cada uno de ellos. Los experimentos perceptuales y objetivos muestran que éste esquema produce resultados bastante mejores en comparación con los modelos de hablantes solos. También abordamos el problema de adaptación de hablantes añadiendo una nueva capa de salida al modelo y entrenándola sin necesidad de modificar el sistema base ya optimizado. Éste método de afinado del modelo en la última capa permite obtener mejores resultados que entrenando el modelo del nuevo hablante desde cero con su propio modelo. Finalmente también abordamos el problema de interpolación de hablantes añadiendo una nueva capa sobre las salidas del Multi-Output, la cual se llama capa-alfa. A la nueva capa se le introduce un código de identificación del hablante junto con las características acústicas de los distintos hablantes. Los experimentos muestran que la capa-alfa puede aprender, en efecto, a interpolar valores en un rango intermedio entre los dos hablantes modelados. El Deep Learning s'ha aplicat amb èxit a problemes de processament de la parla. En aquest treball explorem les capacitats d'aquesta disciplina, fent especial èmfasi en les arquitectures recurrents per a construir un sistema de síntesi de veu des de zero. Es mostren les diferents etapes per fer el sistema de síntesi complet. A més, s'aplica i s'avalua un mètode de post-processament per tal de millorar la naturalitat de la veu generada. Els resultats objectius mostren quina arquitectura encaixa més amb el nostre problema, aconseguint errors baixos en termes de distorsió cepstral, error d'estimació de pitch i error de classificació sonor/sord. També els resultats subjectius indiquen que el model arriba a tenir una qualitat de síntesi comparable amb la de les últimes tecnologíes, on el fet de fer post-processament sembla ser una peça clau per obtenir un bon nivell de naturalitat. També es proposa una arquitectura novedosa anomenada Multi-Output TTS, la qual conté diferents parlants dins la mateixa estructura. Algunes capes ocultes es comparteixen entre tots els parlants, mentres que hi ha una capa de sortida específica per a cada un d'ells. Els experiments perceptuals i objectius mostren que aquest esquema produeix força millors resultats en comparació amb els models de parlants sols. També abordem el problema d'adaptació de parlants afegint una nova capa de sortida al model i entrenant-la sense necessitat de modificar el sistema base ja optimitzat. Aquest mètode d'afinament del model a l'última capa permet obtenir millors resultats que entrenant el model del nou parlant des de zero amb el seu propi model sol. Finalment també abordem el problema d'interpolació de parlants afegint una nova capa sobre les sortides del Multi-Output, la qual es diu capa-alfa. A la nova capa se li insereix un codi d'identificació juntament amb les característiques acústiques dels diferents parlants. Els experiments mostren que la capa-alfa pot aprendre, en efecte, a interpolar valors intermitjos respecte els parlants modelats.
Published: 2016

30. The little fish that can feed the world

Author: Majluf, Patricia, primary, De la Puente, Santiago, additional, and Christensen, Villy, additional
Published: 2017
Full Text: View/download PDF

31. Prosodic break prediction with RNNs

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Pascual de la Puente, Santiago, and Bonafonte Cávez, Antonio
Abstract: Prosodic breaks prediction from text is a fundamental task to obtain naturalness in text to speech applications. In this work we build a data-driven break predictor out of linguistic features like the Part of Speech (POS) tags and forward-backward word distance to punctuation marks, and to do so we use a basic Recurrent Neural Network (RNN) model to exploit the sequence dependency in decisions. In the experiments we evaluate the performance of a logistic regression model and the recurrent one. The results show that the logistic regression outperforms the baseline (CART) by a 9.5% in the F-score, and the addition of the recurrent layer in the model further improves the predictions of the baseline by an 11%., Peer Reviewed, Postprint (published version)
Published: 2016

32. Voice generation using deep learning

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, Gómez Sánchez, Gonzalo, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, and Gómez Sánchez, Gonzalo
Abstract: Las técnicas de aprendizaje profundo están teniendo unas excelentes prestaciones en muchas tareas relacionadas con el habla, tales como reconocimiento o síntesis. Muchos de los trabajos se apoyan en modelos de voz, o técnicas de análisis clásicas, como el espectrograma o el MFCC. En este proyecto se desea sustituir estas técnicas por redes neuronales profundas que puedan autodiseñarse para modelar la señal. Una aplicación que puede plantearse para validar esta tecnología es codificación., Voice generation, also known as Speech Synthesis, is the artificial production of human speech. In the last decade, the Speech Synthesis research has been focused on a technique called Statistical Parametric Speech Synthesis. This technique uses a statistical model that obtains parameters (acoustic features) to define the signal out of a text. These parameters are then converted into a waveform using a vocoder. The use of the vocoder is needed but it decreases the quality of the obtained audio. In the past few years, Deep Learning techniques have shown great performance in many fields. One of them is Speech Synthesis, where Deep Learning is used as a substitute for the statistical model, obtaining the parameters that define the signal with great effectiveness. However, the quality of the synthesis is still affected by the use of the vocoder. For this reason, in this work, we investigate how to generate the audio waveform out of the parameters using Deep Neural Networks. If it results to work, it could be possible to build a DNN system that generates an audio waveform using text as input, leaving the vocoder out of the scheme. Different architectures were tested before getting to the final model. The first attempt was to directly map the frames of the signal using a Long Short-Term Memory Recurrent Neural Network. In the second one, instead of generating the signal frame by frame we did it sample by sample. We tried a different architecture in the third model, using a Clockwork RNN. Finally, in the fourth model we used again an LSTM, but this time, we generated the signal by frequency sub-bands, using Pseudo-Quadrature Mirror Filter banks. The models that showed better performance were the second and the fourth. Neverthe- less, the computational cost of the second one is too high. We solved this problem in the fourth model. Generating the signal by sub-bands allows us to parallelize the problem and decrease the computational cost significantly. Although it is a great, La generación de voz, también conocida como Síntesis de Habla, es la producción artificial de habla humana. En la última década, la investigación de Síntesis de Habla se ha centrado en una técnica llamada Síntesis Estadística Paramétrica de Habla. Esta técnica utiliza un modelo estadístico y genera los parámetros acústicos más probables, condicionados al texto de entrada. Estos parámetros son convertidos en forma de onda utilizando un vocoder. El uso de este vocoder es necesario en la síntesis estadística, pero limita la calidad del audio que puede obtenerse. En los últimos años, las técnicas de Aprendizaje Profundo han obtenido importantes resultados en muchos campos. Uno de ellos es la Síntesis de Habla, donde el Aprendizaje Profundo es usado como sustituto de los modelos estadísticos tradicionales, basados en Modelos Ocultos de Markov, obteniendo los parámetros que definen la señal. Sin embargo, la calidad sigue afectada por el uso del vocoder. Por esta razón, en este trabajo hemos investigado como generar una forma de onda, partiendo de parámetros, mediante Redes Neuronales Profundas. Si funcionara, sería posible construir un sistema basado en Redes Neuronales Profundas que genere una forma de onda utilizando texto como entrada, sin necesitar el vocoder. Se han probado diferentes arquitecturas antes de llegar al modelo final. El primer intento fue mapear directamente las muestras de la señal de audio utilizando una Red Neuronal Recurrente con Memoria a Largo y Corto Plazo. (LSTM-RNN). En el segundo, en vez de generar la señal trama a trama, se ha generado muestra a muestra. Se ha probado también una arquitectura diferente en el tercer modelo, utilizando una Red Neuronal Recurrente 'Clockwork'. Finalmente, en el cuarto modelo, usamos de nuevo una LSTM-RNN, pero esta vez, generamos la señal por bandas frecuenciales, usando \textit{Pseudo Quadrature-Mirror Filters} (PQMF). Los modelos que han obtenido mejores resultados han sido el segundo y el cuarto. Sin embargo
Published: 2016

33. Open-ended visual question answering

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Giró Nieto, Xavier, Pascual de la Puente, Santiago, Masuda Mora, Issey, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Giró Nieto, Xavier, Pascual de la Puente, Santiago, and Masuda Mora, Issey
Abstract: Wearable cameras generate a large amount of photos which are, in many cases, useless or redundant. On the other hand, these devices are provide an excellent opportunity to create automatic questions and answers for reminiscence therapy. This is a follow up of the BSc thesis developed by Ricard Mestre during Fall 2014, and MSc thesis developed by Aniol Lidon., This thesis studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework. As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-16 and K-CNN convolutional neural networks to extract visual features from the image. These are merged with the word embedding or with a sentence embedding of the question to predict the answer. This work was successfully submitted to the Visual Question Answering Challenge 2016, where it achieved a 53,62% of accuracy in the test dataset. The developed software has followed the best programming practices and Python code style, providing a consistent baseline in Keras for different configurations. The source code and models are publicly available at https://github.com/imatge-upc/vqa-2016-cvprw., Esta tesis estudia métodos para resolver tareas de Visual Question-Answering usando técnicas de Deep Learning. Como primer paso, exploramos las redes Long Short-Term Memory (LST) que se usan en el Procesado del Lenguaje Natural (NLP) para atacar tareas de Question-Answering basadas únicamente en texto. A continuación modificamos el modelo anterior para aceptar una imagen como entrada junto con la pregunta. Para este propósito, estudiamos el uso de las redes convolucionales VGG-16 y K-CNN para extraer los descriptores visuales de la imagen. Estos descriptores son fusionados con el word embedding o sentence embedding de la pregunta para poder predecir la respuesta. Este trabajo se ha presentado al Visual Question Answering Challenge 2016, donde ha obtenido una precisión del 53,62% en los datos de test. El software desarrollado ha usado buenas prácticas de programación y ha seguido las directrices de estilo de Python, proveyendo un proyecto base en Keras consistente a distintas configuraciones. El código fuente y los modelos son públicos en https://github.com/imatge-upc/ vqa-2016-cvprw., Aquesta tesis estudia mètodes per resoldre tasques de Visual Question-Answering emprant tècniques de Deep Learning. Com a pas preliminar, explorem les xarxes Long Short-Term Memory (LSTM) que s'utilitzen en el Processat del Llenguatge Natural (NLP) per atacar tasques de Question-Answering basades únicament en text. A continuació modifiquem el model anterior per acceptar una imatge com a entrada juntament amb la pregunta. Per aquest propòsit, estudiem l'ús de les xarxes convolucionals VGG-16 i KCNN per tal d'extreure els descriptors visuals de la imatge. Aquests descriptors són fusionats amb el word embedding o sentence embedding de la pregunta per poder predir la resposta. Aquest treball ha estat presentat al Visual Question Answering Challenge 2016, on ha obtingut una precisió del 53,62% en les dades de test. El software desenvolupat ha emprat bones pràctiques en programació i ha seguit les directrius d'estil de Python, prove ïnt un projecte base en Keras consistent a diferents configuracions. El codi font i els models són públics a https://github.com/imatge-upc/vqa-2016-cvprw.
Published: 2016

34. Deep learning applied to speech synthesis

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de la Puente, Santiago, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Pascual de la Puente, Santiago
Abstract: Deep Learning has been applied successfully to speech processing problems. In this work we explore its capabilities, focusing concretely in recurrent neural architectures to build a state of the art Text-To-Speech system from scratch. The different steps to make the full TTS system are shown. Also, a post-filtering method to improve the generated speech naturalness is applied and evaluated. The objective results show which architecture fits better our problem, achieving low error rates in term of cepstral distortion, pitch estimation error and voiced/unvoiced classification error. Also, subjective results suggest that the model achieves a state of the art quality in the synthesis, where the post-filtering factor seems to be a key component to get a good level of naturalness. A novel architecture called Multi-Output TTS is also proposed to hold multiple speakers inside the same structure. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with single speaker models. Moreover, we also tackle the problem of speaker adaptation by adding a new output branch to the model and successfully training it without the need of modifying the base optimized model. This fine tuning method achieves better results than training the new speaker from scratch with its own model. Finally, we also tackle the problem of speaker interpolation by adding a new output layer (alpha-layer) on top of the Multi-Output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the alpha-layer can effectively learn to interpolate the acoustic features between speakers., El Deep Learning se ha aplicado con éxito a problemas de procesado del habla. En éste trabajo exploramos las capacidades de ésta disciplina, haciendo especial énfasis en las arquitecturas recurrentes para construir un sistema de síntesis de voz desde cero. Se muestran las distintas etapas para hacer el sistema de síntesis completo. Además se aplica y se evalúa un método de post-procesado con tal de mejorar la naturalidad de la voz generada. Los resultados objetivos muestran qué arquitectura encaja más con nuestro problema, consiguiendo errores bajos en términos de distorsión cepstral, error de estimación de pitch y error de clasificación sonoro/sordo. También los resultados subjetivos indican que el modelo llega a tener una calidad de voz comparable con la de las últimas tecnologías, donde el hecho de aplicar el post-procesado parece ser una pieza clave para obtener un buen nivel de naturalidad. También se propone una arquitectura innovadora llamada Multi-Output TTS, la cual contiene diferentes hablantes dentro de la misma estructura. Algunas capas ocultas se comparten entre todos los hablantes, mientras que hay una capa de salida específica para cada uno de ellos. Los experimentos perceptuales y objetivos muestran que éste esquema produce resultados bastante mejores en comparación con los modelos de hablantes solos. También abordamos el problema de adaptación de hablantes añadiendo una nueva capa de salida al modelo y entrenándola sin necesidad de modificar el sistema base ya optimizado. Éste método de afinado del modelo en la última capa permite obtener mejores resultados que entrenando el modelo del nuevo hablante desde cero con su propio modelo. Finalmente también abordamos el problema de interpolación de hablantes añadiendo una nueva capa sobre las salidas del Multi-Output, la cual se llama capa-alfa. A la nueva capa se le introduce un código de identificación del hablante junto con las características acústicas de los distintos hablantes. Los experimentos mues, El Deep Learning s'ha aplicat amb èxit a problemes de processament de la parla. En aquest treball explorem les capacitats d'aquesta disciplina, fent especial èmfasi en les arquitectures recurrents per a construir un sistema de síntesi de veu des de zero. Es mostren les diferents etapes per fer el sistema de síntesi complet. A més, s'aplica i s'avalua un mètode de post-processament per tal de millorar la naturalitat de la veu generada. Els resultats objectius mostren quina arquitectura encaixa més amb el nostre problema, aconseguint errors baixos en termes de distorsió cepstral, error d'estimació de pitch i error de classificació sonor/sord. També els resultats subjectius indiquen que el model arriba a tenir una qualitat de síntesi comparable amb la de les últimes tecnologíes, on el fet de fer post-processament sembla ser una peça clau per obtenir un bon nivell de naturalitat. També es proposa una arquitectura novedosa anomenada Multi-Output TTS, la qual conté diferents parlants dins la mateixa estructura. Algunes capes ocultes es comparteixen entre tots els parlants, mentres que hi ha una capa de sortida específica per a cada un d'ells. Els experiments perceptuals i objectius mostren que aquest esquema produeix força millors resultats en comparació amb els models de parlants sols. També abordem el problema d'adaptació de parlants afegint una nova capa de sortida al model i entrenant-la sense necessitat de modificar el sistema base ja optimitzat. Aquest mètode d'afinament del model a l'última capa permet obtenir millors resultats que entrenant el model del nou parlant des de zero amb el seu propi model sol. Finalment també abordem el problema d'interpolació de parlants afegint una nova capa sobre les sortides del Multi-Output, la qual es diu capa-alfa. A la nova capa se li insereix un codi d'identificació juntament amb les característiques acústiques dels diferents parlants. Els experiments mostren que la capa-alfa pot aprendre, en efecte, a interpolar valors intermi
Published: 2016

35. Graphical interface design to control speech synthesis

Author: Pascual de La Puente, Santiago, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Bonafonte Cávez, Antonio
Subjects: internet protocols, Sintetitzadors de veu, Android, programación multi-hilo, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Processament de la parla, Speech synthesizers, Speech processing systems, protocolos de internet, multi-thread programming, Síntesis de voz, Speech synthesis
Abstract: [ANGLÈS] In this project we have developed a set of interfaces in Android to control a speech synthesis system in real time. This has involved the design and implementation of all components of the interaction, such as: the Android client, synthesis server and communications between them. The control is able to modify voice parameters to change the speaker you are listening and modifications in features like pitch or speed. The challenge facing us in this work has been the control of these synthesis parameters in real time, and must analyze the communication methods that will allow this interactivity. Developed set of communication protocols UDP and TCP transport, dealing with transfers voice signal and session information respectively. OSC library has been used to send requests to the server. The client has been developed in Android to be an emerging system in the mobile technology market today. One of the important design features of this client has been the adaptation of the interface among the different devices like handsets and tablets, enhancing the user experience. On the server side, we have a statistical parametric synthesis system using hidden Markov models. The HTS synthesis system is the basis from which we started to use this technique, but does not offer a mechanism for interaction with real-time parameters. Thus, the server has been developed using a framework that works on top of the HTS system called mage and it allows us to perform the synthesis and modification, thus adapting to our request to change the parameters in real time. [CASTELLÀ] En este proyecto se han desarrollado un conjunto de interfaces de control en Android de un sistema de síntesis de voz en tiempo real. Esto ha implicado el diseño e implementación de todos los componentes de la interacción, como son: el cliente Android, el servidor de síntesis y las comunicaciones entre ellos. El control al que se refiere el trabajo es el de poder modificar parámetros de la voz que cambien al hablante que estamos escuchando y modificaciones en características como el pitch o la velocidad. El reto que se nos plantea en este trabajo ha sido el control de estos parámetros de síntesis en tiempo real, debiendo analizar los métodos de comunicación que nos permitan esta interactividad. La comunicación desarrollada establece sobre los protocolos de transporte UDP y TCP, que tratan las transferencias de señal de voz e información de sesión respectivamente, y del protocolo OSC de aplicación, que nos sirve para enviar los pedidos del cliente Android al servidor. El cliente se ha desarrollado en Android por ser un sistema emergente en el mercado de las tecnologías móviles actuales. Uno de los rasgos importantes del diseño de este cliente ha sido la adaptación que ofrece a diferentes terminales que funcionen con este sistema operativo, pudiendo ser tanto móviles como tablets, ofreciendo una interfaz adaptada a cada tipo de dispositivo para aumentar la experiencia de usuario. En el lado del servidor, tenemos un sistema de síntesis paramétrica estadística mediante modelos ocultos de Markov. El sistema de síntesis HTS es la base de la que partimos para utilizar esta técnica, pero no nos ofrece un mecanismo de interactividad con los parámetros a tiempo real. Así, el servidor se ha desarrollado utilizando un framework que trabaja sobre el sistema HTS que se llama mage y si que nos permite realizar la síntesis y modificación, adaptándose de esta manera a nuestro requerimiento de cambiar los parámetros a tiempo real. [CATALÀ] En aquest projecte s’han desenvolupat un conjunt d’interfícies de control en Android d’un sistema de síntesis de veu a temps real. Això ha implicat el disseny i implementació de tots els components de la interacció, com són: el client Android, el servidor de síntesis i les comunicacions entre ells. El control al que es refereix el treball és el de poder modificar paràmetres de la veu que canviïn al parlant que estem escoltant i modificacions en característiques com el pitch o la velocitat. El repte que se’ns planteja en aquest treball ha estat el control d’aquests paràmetres de síntesi a temps real, havent d’analitzar els mètodes de comunicació que ens permetin aquesta interactivitat. La comunicació desenvolupada s’estableix sobre els protocols de transport UDP i TCP, que tracten les transferències de senyal de veu i informació de sessió respectivament, i del protocol OSC d’aplicació, que ens serveix per enviar les comandes del client Android al servidor. El client s’ha desenvolupat en Android per ser un sistema emergent en el mercat de les tecnologies mòbils actuals. Un dels trets importants del disseny d’aquest client ha sigut l’adaptació que ofereix a diferents terminals que funcionin amb aquest sistema operatiu, podent ser tant mòbils com tablets, oferint una interfície adaptada a cada tipus de dispositiu per augmentar l’experiència d’usuari. A la banda del servidor, tenim un sistema de síntesis paramètrica estadística mitjançant models ocults de Markov. El sistema de síntesis HTS és la base de la que partim per utilitzar aquesta tècnica, però no ens ofereix un mecanisme d’interactivitat amb els paràmetres a temps real. Així, el servidor s’ha desenvolupat fent servir un framework que treballa sobre el sistema HTS que s’anomena mage i si que ens permet realitzar la síntesis i modificació, adaptant-se d’aquesta manera al nostre requeriment de canviar els paràmetres a temps real.
Published: 2013

36. Editorial

Author: Pascual de la Puente, Santiago
Subjects: education, Enginyeria electrònica [Àrees temàtiques de la UPC], Telecommunication, Telecomunicació -- Revistes, Electronics
Published: 2013

37. Diseño de interfície de control gráfica para transformación de voz Disseny d’interfície de control gràfica per transformació de veu

Author: Pascual de La Puente, Santiago and Bonafonte Cávez, Antonio
Subjects: internet protocols, Sintetitzadors de veu, Android, programación multi-hilo, Processament de la parla, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Speech synthesizers, Speech processing systems, protocolos de internet, multi-thread programming, Síntesis de voz, Speech synthesis
Abstract: [ANGLÈS] In this project we have developed a set of interfaces in Android to control a speech synthesis system in real time. This has involved the design and implementation of all components of the interaction, such as: the Android client, synthesis server and communications between them. The control is able to modify voice parameters to change the speaker you are listening and modifications in features like pitch or speed. The challenge facing us in this work has been the control of these synthesis parameters in real time, and must analyze the communication methods that will allow this interactivity. Developed set of communication protocols UDP and TCP transport, dealing with transfers voice signal and session information respectively. OSC library has been used to send requests to the server. The client has been developed in Android to be an emerging system in the mobile technology market today. One of the important design features of this client has been the adaptation of the interface among the different devices like handsets and tablets, enhancing the user experience. On the server side, we have a statistical parametric synthesis system using hidden Markov models. The HTS synthesis system is the basis from which we started to use this technique, but does not offer a mechanism for interaction with real-time parameters. Thus, the server has been developed using a framework that works on top of the HTS system called mage and it allows us to perform the synthesis and modification, thus adapting to our request to change the parameters in real time. [CASTELLÀ] En este proyecto se han desarrollado un conjunto de interfaces de control en Android de un sistema de síntesis de voz en tiempo real. Esto ha implicado el diseño e implementación de todos los componentes de la interacción, como son: el cliente Android, el servidor de síntesis y las comunicaciones entre ellos. El control al que se refiere el trabajo es el de poder modificar parámetros de la voz que cambien al hablante que estamos escuchando y modificaciones en características como el pitch o la velocidad. El reto que se nos plantea en este trabajo ha sido el control de estos parámetros de síntesis en tiempo real, debiendo analizar los métodos de comunicación que nos permitan esta interactividad. La comunicación desarrollada establece sobre los protocolos de transporte UDP y TCP, que tratan las transferencias de señal de voz e información de sesión respectivamente, y del protocolo OSC de aplicación, que nos sirve para enviar los pedidos del cliente Android al servidor. El cliente se ha desarrollado en Android por ser un sistema emergente en el mercado de las tecnologías móviles actuales. Uno de los rasgos importantes del diseño de este cliente ha sido la adaptación que ofrece a diferentes terminales que funcionen con este sistema operativo, pudiendo ser tanto móviles como tablets, ofreciendo una interfaz adaptada a cada tipo de dispositivo para aumentar la experiencia de usuario. En el lado del servidor, tenemos un sistema de síntesis paramétrica estadística mediante modelos ocultos de Markov. El sistema de síntesis HTS es la base de la que partimos para utilizar esta técnica, pero no nos ofrece un mecanismo de interactividad con los parámetros a tiempo real. Así, el servidor se ha desarrollado utilizando un framework que trabaja sobre el sistema HTS que se llama mage y si que nos permite realizar la síntesis y modificación, adaptándose de esta manera a nuestro requerimiento de cambiar los parámetros a tiempo real. [CATALÀ] En aquest projecte s’han desenvolupat un conjunt d’interfícies de control en Android d’un sistema de síntesis de veu a temps real. Això ha implicat el disseny i implementació de tots els components de la interacció, com són: el client Android, el servidor de síntesis i les comunicacions entre ells. El control al que es refereix el treball és el de poder modificar paràmetres de la veu que canviïn al parlant que estem escoltant i modificacions en característiques com el pitch o la velocitat. El repte que se’ns planteja en aquest treball ha estat el control d’aquests paràmetres de síntesi a temps real, havent d’analitzar els mètodes de comunicació que ens permetin aquesta interactivitat. La comunicació desenvolupada s’estableix sobre els protocols de transport UDP i TCP, que tracten les transferències de senyal de veu i informació de sessió respectivament, i del protocol OSC d’aplicació, que ens serveix per enviar les comandes del client Android al servidor. El client s’ha desenvolupat en Android per ser un sistema emergent en el mercat de les tecnologies mòbils actuals. Un dels trets importants del disseny d’aquest client ha sigut l’adaptació que ofereix a diferents terminals que funcionin amb aquest sistema operatiu, podent ser tant mòbils com tablets, oferint una interfície adaptada a cada tipus de dispositiu per augmentar l’experiència d’usuari. A la banda del servidor, tenim un sistema de síntesis paramètrica estadística mitjançant models ocults de Markov. El sistema de síntesis HTS és la base de la que partim per utilitzar aquesta tècnica, però no ens ofereix un mecanisme d’interactivitat amb els paràmetres a temps real. Així, el servidor s’ha desenvolupat fent servir un framework que treballa sobre el sistema HTS que s’anomena mage i si que ens permet realitzar la síntesis i modificació, adaptant-se d’aquesta manera al nostre requeriment de canviar els paràmetres a temps real.
Published: 2013

38. Editorial

Author: Pascual de la Puente, Santiago|||0000-0002-8365-7387
Subjects: education, Telecommunication, Enginyeria electrònica [Àrees temàtiques de la UPC], Telecomunicació -- Revistes, Electronics
Published: 2013

39. Pollution, habitat loss, fishing, and climate change as critical threats to penguins

Author: Trathan, Phil N., primary, García‐Borboroglu, Pablo, additional, Boersma, Dee, additional, Bost, Charles‐André, additional, Crawford, Robert J. M., additional, Crossin, Glenn T., additional, Cuthbert, Richard J., additional, Dann, Peter, additional, Davis, Lloyd Spencer, additional, De La Puente, Santiago, additional, Ellenberg, Ursula, additional, Lynch, Heather J., additional, Mattern, Thomas, additional, Pütz, Klemens, additional, Seddon, Philip J., additional, Trivelpiece, Wayne, additional, and Wienecke, Barbara, additional
Published: 2014
Full Text: View/download PDF

40. Disseny d'interfície de control gràfica per transformació de veu

Author: Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Pascual de La Puente, Santiago, Bonafonte Cávez, Antonio, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Pascual de La Puente, Santiago
Abstract: [ANGLÈS] In this project we have developed a set of interfaces in Android to control a speech synthesis system in real time. This has involved the design and implementation of all components of the interaction, such as: the Android client, synthesis server and communications between them. The control is able to modify voice parameters to change the speaker you are listening and modifications in features like pitch or speed. The challenge facing us in this work has been the control of these synthesis parameters in real time, and must analyze the communication methods that will allow this interactivity. Developed set of communication protocols UDP and TCP transport, dealing with transfers voice signal and session information respectively. OSC library has been used to send requests to the server. The client has been developed in Android to be an emerging system in the mobile technology market today. One of the important design features of this client has been the adaptation of the interface among the different devices like handsets and tablets, enhancing the user experience. On the server side, we have a statistical parametric synthesis system using hidden Markov models. The HTS synthesis system is the basis from which we started to use this technique, but does not offer a mechanism for interaction with real-time parameters. Thus, the server has been developed using a framework that works on top of the HTS system called mage and it allows us to perform the synthesis and modification, thus adapting to our request to change the parameters in real time., [CASTELLÀ] En este proyecto se han desarrollado un conjunto de interfaces de control en Android de un sistema de síntesis de voz en tiempo real. Esto ha implicado el diseño e implementación de todos los componentes de la interacción, como son: el cliente Android, el servidor de síntesis y las comunicaciones entre ellos. El control al que se refiere el trabajo es el de poder modificar parámetros de la voz que cambien al hablante que estamos escuchando y modificaciones en características como el pitch o la velocidad. El reto que se nos plantea en este trabajo ha sido el control de estos parámetros de síntesis en tiempo real, debiendo analizar los métodos de comunicación que nos permitan esta interactividad. La comunicación desarrollada establece sobre los protocolos de transporte UDP y TCP, que tratan las transferencias de señal de voz e información de sesión respectivamente, y del protocolo OSC de aplicación, que nos sirve para enviar los pedidos del cliente Android al servidor. El cliente se ha desarrollado en Android por ser un sistema emergente en el mercado de las tecnologías móviles actuales. Uno de los rasgos importantes del diseño de este cliente ha sido la adaptación que ofrece a diferentes terminales que funcionen con este sistema operativo, pudiendo ser tanto móviles como tablets, ofreciendo una interfaz adaptada a cada tipo de dispositivo para aumentar la experiencia de usuario. En el lado del servidor, tenemos un sistema de síntesis paramétrica estadística mediante modelos ocultos de Markov. El sistema de síntesis HTS es la base de la que partimos para utilizar esta técnica, pero no nos ofrece un mecanismo de interactividad con los parámetros a tiempo real. Así, el servidor se ha desarrollado utilizando un framework que trabaja sobre el sistema HTS que se llama mage y si que nos permite realizar la síntesis y modificación, adaptándose de esta manera a nuestro requerimiento de cambiar los parámetros a tiempo real., [CATALÀ] En aquest projecte s’han desenvolupat un conjunt d’interfícies de control en Android d’un sistema de síntesis de veu a temps real. Això ha implicat el disseny i implementació de tots els components de la interacció, com són: el client Android, el servidor de síntesis i les comunicacions entre ells. El control al que es refereix el treball és el de poder modificar paràmetres de la veu que canviïn al parlant que estem escoltant i modificacions en característiques com el pitch o la velocitat. El repte que se’ns planteja en aquest treball ha estat el control d’aquests paràmetres de síntesi a temps real, havent d’analitzar els mètodes de comunicació que ens permetin aquesta interactivitat. La comunicació desenvolupada s’estableix sobre els protocols de transport UDP i TCP, que tracten les transferències de senyal de veu i informació de sessió respectivament, i del protocol OSC d’aplicació, que ens serveix per enviar les comandes del client Android al servidor. El client s’ha desenvolupat en Android per ser un sistema emergent en el mercat de les tecnologies mòbils actuals. Un dels trets importants del disseny d’aquest client ha sigut l’adaptació que ofereix a diferents terminals que funcionin amb aquest sistema operatiu, podent ser tant mòbils com tablets, oferint una interfície adaptada a cada tipus de dispositiu per augmentar l’experiència d’usuari. A la banda del servidor, tenim un sistema de síntesis paramètrica estadística mitjançant models ocults de Markov. El sistema de síntesis HTS és la base de la que partim per utilitzar aquesta tècnica, però no ens ofereix un mecanisme d’interactivitat amb els paràmetres a temps real. Així, el servidor s’ha desenvolupat fent servir un framework que treballa sobre el sistema HTS que s’anomena mage i si que ens permet realitzar la síntesis i modificació, adaptant-se d’aquesta manera al nostre requeriment de canviar els paràmetres a temps real.
Published: 2013

41. Disseny d'interfície de control gràfica per transformació de veu

Author: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, Pascual de La Puente, Santiago, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Pascual de La Puente, Santiago
Abstract: [ANGLÈS] In this project we have developed a set of interfaces in Android to control a speech synthesis system in real time. This has involved the design and implementation of all components of the interaction, such as: the Android client, synthesis server and communications between them. The control is able to modify voice parameters to change the speaker you are listening and modifications in features like pitch or speed. The challenge facing us in this work has been the control of these synthesis parameters in real time, and must analyze the communication methods that will allow this interactivity. Developed set of communication protocols UDP and TCP transport, dealing with transfers voice signal and session information respectively. OSC library has been used to send requests to the server. The client has been developed in Android to be an emerging system in the mobile technology market today. One of the important design features of this client has been the adaptation of the interface among the different devices like handsets and tablets, enhancing the user experience. On the server side, we have a statistical parametric synthesis system using hidden Markov models. The HTS synthesis system is the basis from which we started to use this technique, but does not offer a mechanism for interaction with real-time parameters. Thus, the server has been developed using a framework that works on top of the HTS system called mage and it allows us to perform the synthesis and modification, thus adapting to our request to change the parameters in real time., [CASTELLÀ] En este proyecto se han desarrollado un conjunto de interfaces de control en Android de un sistema de síntesis de voz en tiempo real. Esto ha implicado el diseño e implementación de todos los componentes de la interacción, como son: el cliente Android, el servidor de síntesis y las comunicaciones entre ellos. El control al que se refiere el trabajo es el de poder modificar parámetros de la voz que cambien al hablante que estamos escuchando y modificaciones en características como el pitch o la velocidad. El reto que se nos plantea en este trabajo ha sido el control de estos parámetros de síntesis en tiempo real, debiendo analizar los métodos de comunicación que nos permitan esta interactividad. La comunicación desarrollada establece sobre los protocolos de transporte UDP y TCP, que tratan las transferencias de señal de voz e información de sesión respectivamente, y del protocolo OSC de aplicación, que nos sirve para enviar los pedidos del cliente Android al servidor. El cliente se ha desarrollado en Android por ser un sistema emergente en el mercado de las tecnologías móviles actuales. Uno de los rasgos importantes del diseño de este cliente ha sido la adaptación que ofrece a diferentes terminales que funcionen con este sistema operativo, pudiendo ser tanto móviles como tablets, ofreciendo una interfaz adaptada a cada tipo de dispositivo para aumentar la experiencia de usuario. En el lado del servidor, tenemos un sistema de síntesis paramétrica estadística mediante modelos ocultos de Markov. El sistema de síntesis HTS es la base de la que partimos para utilizar esta técnica, pero no nos ofrece un mecanismo de interactividad con los parámetros a tiempo real. Así, el servidor se ha desarrollado utilizando un framework que trabaja sobre el sistema HTS que se llama mage y si que nos permite realizar la síntesis y modificación, adaptándose de esta manera a nuestro requerimiento de cambiar los parámetros a tiempo real., [CATALÀ] En aquest projecte s’han desenvolupat un conjunt d’interfícies de control en Android d’un sistema de síntesis de veu a temps real. Això ha implicat el disseny i implementació de tots els components de la interacció, com són: el client Android, el servidor de síntesis i les comunicacions entre ells. El control al que es refereix el treball és el de poder modificar paràmetres de la veu que canviïn al parlant que estem escoltant i modificacions en característiques com el pitch o la velocitat. El repte que se’ns planteja en aquest treball ha estat el control d’aquests paràmetres de síntesi a temps real, havent d’analitzar els mètodes de comunicació que ens permetin aquesta interactivitat. La comunicació desenvolupada s’estableix sobre els protocols de transport UDP i TCP, que tracten les transferències de senyal de veu i informació de sessió respectivament, i del protocol OSC d’aplicació, que ens serveix per enviar les comandes del client Android al servidor. El client s’ha desenvolupat en Android per ser un sistema emergent en el mercat de les tecnologies mòbils actuals. Un dels trets importants del disseny d’aquest client ha sigut l’adaptació que ofereix a diferents terminals que funcionin amb aquest sistema operatiu, podent ser tant mòbils com tablets, oferint una interfície adaptada a cada tipus de dispositiu per augmentar l’experiència d’usuari. A la banda del servidor, tenim un sistema de síntesis paramètrica estadística mitjançant models ocults de Markov. El sistema de síntesis HTS és la base de la que partim per utilitzar aquesta tècnica, però no ens ofereix un mecanisme d’interactivitat amb els paràmetres a temps real. Així, el servidor s’ha desenvolupat fent servir un framework que treballa sobre el sistema HTS que s’anomena mage i si que ens permet realitzar la síntesis i modificació, adaptant-se d’aquesta manera al nostre requeriment de canviar els paràmetres a temps real.
Published: 2013

42. WTO must ban harmful fisheries subsidies

Author: Gert van Santen, John M. Anderies, Donovan Campbell, Tyler D. Eddy, Omu Kakujaha-Matundu, Bryce D. Stewart, Marten Scheffer, Jessica Fanzo, Rowenna Gryba, F. Stuart Chapin, Denis Worlanyo Aheto, Katina Roumbedakis, Ibrahim Issifu, Gordon R. Munro, Shakuntala H. Thilsted, Ibukun Jacob Adewumi, Evgeny A. Pakhomov, Grant Murray, Jason F. Shogren, Unai Pascual, Satoshi Yamazaki, Margaret Spring, Carlos M. Duarte, Kathleen Segerson, U. Rashid Sumaila, Precious Agbeko Dzorgbe Mattah, Kyle Gillespie, Saleem Mustafa, Lan Xiao, Joshua Adotey, Frances Westley, Francis K. E. Nunoo, Frank Asche, Zuzy Anna, Boris Worm, D. R. Fraser Taylor, Diva J. Amon, Roshni S. Mangar, Cassandra M. Brooks, Frederik Noack, Brooks Kaiser, Nathan J. Bennett, William W. L. Cheung, Dwight Owens, S. Kim Juniper, Derek Armitage, Karly McMullen, Dawn Kotowicz, Enric Sala, Paul O. Onyango, Francis E. Asuquo, Kristin M. Kleisner, Monirul Islam, Juliano Palacios Abrantes, Tony Charles, Dana D. Miller, Sarah Harper, Louise S. L. Teh, Juan José Alava, Aurélien Paulmier, Jeremy B. C. Jackson, Santiago de la Puente, Colin W. Clark, Jennifer J. Silver, Robert Blasiak, Colette C. C. Wabnitz, Gretchen C. Daily, Lydia C. L. Teh, John A. List, Alessandro Tavoni, Philippe D. Tortell, Tabitha Mallory, Jaime Mendo, Amadou Tall, Essam Yassin Mohammed, Romola V. Thumbadoo, Kristen Hopewell, Rebecca R. Helm, Mauricio Castrejón, Elena M. Bennett, Jean-Baptiste Thiebot, Jorge Jimenez Ramon, Patrick Kimani, Gerald G. Singh, Kátia Meirelles Felizola Freire, Johannes A. Iitembu, Sara E. Cannon, Jorge Ramírez, Richard S.J. Tol, Evelyn Pinkerton, Andrew Forrest, Juan Camilo Cárdenas Campo, Sadique Isahaku, Dyhia Belhabib, Moenieba Isaacs, Laura G. Elsler, Alessandro Tagliabue, Tom Okey, Tessa Owens, Alex J. Caveen, José-María Da-Rocha, Isigi Kadagi, Hong Yang, Ekow Prah, Glenn-Marie Lange, Mary S. Wisz, Vicky W. L. Lam, Maartje Oostdijk, Daniel Pauly, Torsten Thiele, Michel J. Kaiser, Christina C. Hicks, Nancy C. Doubleday, Nicholas K. Dulvy, Line Gordon, Thomas L. Frölicher, Kwasi Appeaning Addo, Katherine Millage, Alfredo Giron-Nava, Heike K. Lotze, Lincoln Hood, Michelle Tigchelaar, Keita Abe, S. Karuaihe, Nancy Knowlton, Jessica A. Gephart, Noble K. Asare, Werner Antweiler, Christopher D. G. Harley, Kai M. A. Chan, Rodrigue Orobiyi Edéya Pèlèbè, Duncan Burnside, Sarah Glaser, Hussain Sinan, Garry D. Peterson, Olaf P. Jensen, Don Robadue, Mafaniso Hara, Sahir Advani, Andreea L. Cojocaru, Fiorenza Micheli, Gakushi Ishimura, Berchie Asiedu, Tu Nguyen, Mohammed Oyinlola, Lubna Alam, Maria A. Gasalla, Priscila F. M. Lopes, Mary Karumba, Austin J. Gallagher, Sufian Jusoh, Brian R. Copeland, Christopher M. Anderson, Alberta Jonah, Christopher D. Golden, Fabrice Stephenson, Douglas J. McCauley, Isaac Okyere, Jennifer Jacquet, Elke U. Weber, Benjamin S. Halpern, Olanike Kudirat Adeyemo, Neil Adger, Nina Wambiji, Kristina M. Gjerde, A. Eyiwunmi Falaye, Polina Orlov, Umi Muawanah, Trevor Church, Denise Breitburg, J. P. Walsh, Edward H. Allison, Cullen S. Hendrix, Curtis A. Suttle, Thuy Thi Thanh Pham, Cesar Bordehore, Michael Harte, Xavier Basurto, Carol McAusland, Rainer Froese, Adibi R. M. Nor, Anne-Sophie Crépin, Karen C. Seto, Abhipsita Das, Philippe Cury, Masahide Kaeriyama, Peter Freeman, Dacotah-Victoria Splichalova, Nobuyuki Yagi, Natalie C. Ban, Larry B. Crowder, Véronique Garçon, Amanda T. Lombard, Katie R. N. Florko, Nicolás Talloni-Álvarez, Riad Sultan, Lisa A. Levin, Mimi E. Lam, Evans K. Arizi, Richard T. Carson, Megan Bailey, Steven J. Lade, Zahidah Afrin, Dianne Newell, Shanta C. Barley, Colin Barnes, Villy Christensen, Dirk Zeller, Simon A. Levin, Kolliyil Sunil Mohamed, Marta Flotats Aviles, Jonathan D. R. Houghton, Daniel J. Skerritt, Karin E. Limburg, Meaghan Efford, Michael C. Melnychuk, Lanre Badmus, Sebastián Villasante, Carie Hoover, Evan Andrews, Daniel Peñalosa, Allison N. Cutting, Nathan Pacoureau, Melissa Walsh, Wisdom Akpalu, Kafayat Adetoun Fakoya, Ling Cao, Edward B. Barbier, Clare Fitzsimmons, Alex Rogers, Robert Arthur, Daniel Marszalec, Jean-Baptiste Jouffray, Carl Folke, Anna Schuhbauer, Mazlin Mokhtar, Juan Mayorga, Ingrid van Putten, S.L. Akintola, Stephen Polasky, Lance Morgan, Jesper Stage, Lucas Brotz, M. Selçuk Uzmanoğlu, Boris Dewitte, Ahmed Khan, Ernest Obeng Chuku, Veronica Relano, Nicholas Polunin, Griffin Carpenter, Virginie Bornarel, Max Troell, Bárbara Horta e Costa, Lian E. Kwong, Mairin C. M. Deith, Valérie Le Brenne, Dan Laffoley, Hugh Govan, Ronaldo Angelini, Juan Carlos Villaseñor-Derbez, Mark J. Gibbons, Ambre Soszynski, Ola Flaaten, Stella Williams, M. Nicole Chabi, S. R. Carpenter, Prateep Kumar Nayak, David Obura, Scott Barrett, Philippe Le Billon, Patrízia Raggi Abdallah, John J. Bohorquez, Adriana Rosa Carvalho, Andrés M. Cisneros-Montemayor, Paul R. Ehrlich, John Kurien, Juan Carlos Seijo, Dominique Benzaken, Brian Crawford, Callum M. Roberts, Gabriel Reygondeau, Xue Jin, Julia Adelsheim, Mohd Talib Latif, Annie Mejaes, Frank Meere, Jeffrey McLean, Jennifer Dianto Kemmerly, Henrik Österblom, Savior K. S. Deikumah, Tayler M. Clarke, Aart de Zeeuw, Frédéric Le Manach, Maria Grazia Pennino, Quentin A Hanich, David R. Boyd, Sumaila, U Rashid, Skerritt, Daniel J, Schuhbauer, Anna, Villasante, Sebastian, Cisneros-Montemayor, Andrés M, Sinan, Hussain, Burnside, Duncan, Abdallah, Patrízia Raggi, Abe, Keita, Addo, Kwasi A, Adelsheim, Julia, Adewumi, Ibukun J, Adeyemo, Olanike K, Adger, Neil, Adotey, Joshua, Advani, Sahir, Afrin, Zahidah, Aheto, Deni, Akintola, Shehu L, Akpalu, Wisdom, Alam, Lubna, Alava, Juan José, Allison, Edward H, Amon, Diva J, Anderies, John M, Anderson, Christopher M, Andrews, Evan, Angelini, Ronaldo, Anna, Zuzy, Antweiler, Werner, Arizi, Evans K, Armitage, Derek, Arthur, Robert I, Asare, Noble, Asche, Frank, Asiedu, Berchie, Asuquo, Franci, Badmus, Lanre, Bailey, Megan, Ban, Natalie, Barbier, Edward B, Barley, Shanta, Barnes, Colin, Barrett, Scott, Basurto, Xavier, Belhabib, Dyhia, Bennett, Elena, Bennett, Nathan J, Benzaken, Dominique, Blasiak, Robert, Bohorquez, John J, Bordehore, Cesar, Bornarel, Virginie, Boyd, David R, Breitburg, Denise, Brooks, Cassandra, Brotz, Luca, Campbell, Donovan, Cannon, Sara, Cao, Ling, Cardenas Campo, Juan C, Carpenter, Steve, Carpenter, Griffin, Carson, Richard T, Carvalho, Adriana R, Castrejón, Mauricio, Caveen, Alex J, Chabi, M Nicole, Chan, Kai M A, Chapin, F Stuart, Charles, Tony, Cheung, William, Christensen, Villy, Chuku, Ernest O, Church, Trevor, Clark, Colin, Clarke, Tayler M, Cojocaru, Andreea L, Copeland, Brian, Crawford, Brian, Crépin, Anne-Sophie, Crowder, Larry B, Cury, Philippe, Cutting, Allison N, Daily, Gretchen C, Da-Rocha, Jose Maria, Das, Abhipsita, de la Puente, Santiago, de Zeeuw, Aart, Deikumah, Savior K S, Deith, Mairin, Dewitte, Bori, Doubleday, Nancy, Duarte, Carlos M, Dulvy, Nicholas K, Eddy, Tyler, Efford, Meaghan, Ehrlich, Paul R, Elsler, Laura G, Fakoya, Kafayat A, Falaye, A Eyiwunmi, Fanzo, Jessica, Fitzsimmons, Clare, Flaaten, Ola, Florko, Katie R N, Aviles, Marta Flotat, Folke, Carl, Forrest, Andrew, Freeman, Peter, Freire, Kátia M F, Froese, Rainer, Frölicher, Thomas L, Gallagher, Austin, Garcon, Veronique, Gasalla, Maria A, Gephart, Jessica A, Gibbons, Mark, Gillespie, Kyle, Giron-Nava, Alfredo, Gjerde, Kristina, Glaser, Sarah, Golden, Christopher, Gordon, Line, Govan, Hugh, Gryba, Rowenna, Halpern, Benjamin S, Hanich, Quentin, Hara, Mafaniso, Harley, Christopher D G, Harper, Sarah, Harte, Michael, Helm, Rebecca, Hendrix, Cullen, Hicks, Christina C, Hood, Lincoln, Hoover, Carie, Hopewell, Kristen, Horta E Costa, Bárbara B, Houghton, Jonathan D R, Iitembu, Johannes A, Isaacs, Moenieba, Isahaku, Sadique, Ishimura, Gakushi, Islam, Monirul, Issifu, Ibrahim, Jackson, Jeremy, Jacquet, Jennifer, Jensen, Olaf P, Ramon, Jorge Jimenez, Jin, Xue, Jonah, Alberta, Jouffray, Jean-Baptiste, Juniper, S Kim, Jusoh, Sufian, Kadagi, Isigi, Kaeriyama, Masahide, Kaiser, Michel J, Kaiser, Brooks Alexandra, Kakujaha-Matundu, Omu, Karuaihe, Selma T, Karumba, Mary, Kemmerly, Jennifer D, Khan, Ahmed S, Kimani, Patrick, Kleisner, Kristin, Knowlton, Nancy, Kotowicz, Dawn, Kurien, John, Kwong, Lian E, Lade, Steven, Laffoley, Dan, Lam, Mimi E, Lam, Vicky W L, Lange, Glenn-Marie, Latif, Mohd T, Le Billon, Philippe, Le Brenne, Valérie, Le Manach, Frédéric, Levin, Simon A, Levin, Lisa, Limburg, Karin E, List, John, Lombard, Amanda T, Lopes, Priscila F M, Lotze, Heike K, Mallory, Tabitha G, Mangar, Roshni S, Marszalec, Daniel, Mattah, Preciou, Mayorga, Juan, McAusland, Carol, McCauley, Douglas J, McLean, Jeffrey, McMullen, Karly, Meere, Frank, Mejaes, Annie, Melnychuk, Michael, Mendo, Jaime, Micheli, Fiorenza, Millage, Katherine, Miller, Dana, Mohamed, Kolliyil Sunil, Mohammed, Essam, Mokhtar, Mazlin, Morgan, Lance, Muawanah, Umi, Munro, Gordon R, Murray, Grant, Mustafa, Saleem, Nayak, Prateep, Newell, Dianne, Nguyen, Tu, Noack, Frederik, Nor, Adibi M, Nunoo, Francis K E, Obura, David, Okey, Tom, Okyere, Isaac, Onyango, Paul, Oostdijk, Maartje, Orlov, Polina, Österblom, Henrik, Owens, Dwight, Owens, Tessa, Oyinlola, Mohammed, Pacoureau, Nathan, Pakhomov, Evgeny, Abrantes, Juliano Palacio, Pascual, Unai, Paulmier, Aurélien, Pauly, Daniel, Pèlèbè, Rodrigue Orobiyi Edéya, Peñalosa, Daniel, Pennino, Maria G, Peterson, Garry, Pham, Thuy T T, Pinkerton, Evelyn, Polasky, Stephen, Polunin, Nicholas V C, Prah, Ekow, Ramírez, Jorge, Relano, Veronica, Reygondeau, Gabriel, Robadue, Don, Roberts, Callum, Rogers, Alex, Roumbedakis, Katina, Sala, Enric, Scheffer, Marten, Segerson, Kathleen, Seijo, Juan Carlo, Seto, Karen C, Shogren, Jason F, Silver, Jennifer J, Singh, Gerald, Soszynski, Ambre, Splichalova, Dacotah-Victoria, Spring, Margaret, Stage, Jesper, Stephenson, Fabrice, Stewart, Bryce D, Sultan, Riad, Suttle, Curti, Tagliabue, Alessandro, Tall, Amadou, Talloni-Álvarez, Nicolá, Tavoni, Alessandro, Taylor, D R Fraser, Teh, Louise S L, Teh, Lydia C L, Thiebot, Jean-Baptiste, Thiele, Torsten, Thilsted, Shakuntala H, Thumbadoo, Romola V, Tigchelaar, Michelle, Tol, Richard S J, Tortell, Philippe, Troell, Max, Uzmanoğlu, M Selçuk, van Putten, Ingrid, van Santen, Gert, Villaseñor-Derbez, Juan Carlo, Wabnitz, Colette C C, Walsh, Melissa, Walsh, J P, Wambiji, Nina, Weber, Elke U, Westley, France, Williams, Stella, Wisz, Mary S, Worm, Bori, Xiao, Lan, Yagi, Nobuyuki, Yamazaki, Satoshi, Yang, Hong, and Zeller, Dirk
Subjects: 0106 biological sciences, Aquatic Ecology and Water Quality Management, Multidisciplinary, WIMEK, 010504 meteorology & atmospheric sciences, Natural resource economics, 530 Physics, 010604 marine biology & hydrobiology, Subsidy, Aquatische Ecologie en Waterkwaliteitsbeheer, 01 natural sciences, WTO, fishery, subsidy, 13. Climate action, 550 Earth sciences & geology, SUBSÍDIOS, Life Science, 14. Life underwater, Business, 0105 earth and related environmental sciences
Abstract: Sustainably managed wild fisheries support food and nutritional security, livelihoods, and cultures (1). Harmful fisheries subsidies—government payments that incentivize overcapacity and lead to overfishing—undermine these benefits yet are increasing globally (2). World Trade Organization (WTO) members have a unique opportunity at their ministerial meeting in November to reach an agreement that eliminates harmful subsidies (3). We—a group of scientists spanning 46 countries and 6 continents—urge the WTO to make this commitment...
Published: 2021

43. Conversión de voz mediante Deep Learning

Author: Aparicio Isarn, Albert, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Pascual de la Puente, Santiago
Subjects: Neural Networks, Speech Processing, Procesado de voz, Voice Conversion, Enginyeria de la telecomunicació [Àrees temàtiques de la UPC], Tractament del senyal, Neural networks (Computer science), Deep Learning, Machine learning, Aprenentatge automàtic, Conversión de Voz, Redes Neuronales, Xarxes neuronals (Informàtica), Processament de la parla, Speech processing systems, Aprendizaje Automático
Abstract: In this project we present a first attempt at a Voice Conversion system based on Deep Learning in which the alignment between the training data is intrinsic to the model. Our system is structured in three main blocks. The first performs a vocoding of the speech (we have used Ahocoder for this task) and a normalization of the data. The second and main block consists of a Sequence-to-Sequence model. It consists of an RNN-based encoder-decoder structure with an Attention Mechanism. Its main strengts are the ability to process variable-length sequences, as well as aligning them internallly. The third block of the system performs a denormalization and reconstructs the speech signal. For the development of our system we have used the Voice Conversion Challenge 2016 dataset, as well as a part of the TC-STAR dataset. Unfortunately we have not obtained the results we expected. At the end of this thesis we present them and discuss some hypothesis to explain the reasons behind them. En este proyecto presentamos un primer intento en la realización de un sistema de Conversión de Voz basado en Aprendizaje Profundo (\emph{Deep Learning}) en el cual el alineamiento de los datos de entrenamiento es intrínseco al modelo. Nuestro sistema está estructurado en tres bloques principales. El primer bloque codifica la señal de voz en parámetros (\emph{vocoding}). Hemos elegido el \emph{vocoder} Ahocoder para esta tarea. Este bloque también normaliza los parámetros codificados. El segundo bloque consiste en un modelo \emph{Sequence-to-Sequence}. Este modelo está formado por una estructura codificador-decodificador basada en Redes Neuronales Recurrentes (RNN) con un Mecanismo de Atención. Sus puntos fuertes son la capacidad de procesar secuencias de longitud variable, a la vez que las alinea internamente. El tercer bloque del sistema desnormaliza los parámetros, y reconstruye la señal de voz a partir de ellos. Para el desarrollo del modelo hemos usado el conjunto de datos (\emph{dataset}) del \emph{Voice Conversion Challenge} 2016. También hemos usado una parte del conjunto TC-STAR. Desafortunadamente no hemos obtenido los resultados que esperábamos. Al final de esta tesis los presentamos y proponemos varias hipótesis que los explican. En aquest projecte presentem un primer itent en la realització d'un sistema de Conversió de Veu basat en Aprenentatge Profund (Deep Learning) en el qual l'alineament entre les dades d'entrenament sigui intrínsec al model. El nostre sistema s'estructura en tres blocs principals. El primer bloc codifica la veu en paràmetres (\emph{vocoding}). Hem usat el codificador Ahocoder per a aquesta tasca. A més a més, aquest primer bloc normalitza les dades. El segon bloc consisteix en un model \emph{Sequence-to-Sequence}. Consisteix en una estructura codificador-decodificador basada en Xarxes Neuronals Recurrents (RNN) amb un Mecanisme d'Atenció (\emph{Attention Mechanism}). Els punts forts d'aquest model són la capacitat per a tractar seqüències de durada variable, alhora que les alinea internament. El tercer bloc del sistema desnormalitza les seqüències i reconstrueix els senyals de veu. Per a desenvolupar el sistema hem usat el conjunt de dades del \emph{Voice Conversion Challenge} 2016. Hem fet servir també una part del conjunt TC-STAR. Desafortunadament no hem obtingut els resultats que esperàvem. Al final d'aquesta tesis presentem aquests resultats i plantegem algunes hipòtesis que els expliquen.
Published: 2017

44. Generación de Voz utilizando Aprendizaje Profundo

Author: Gómez Sánchez, Gonzalo, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Bonafonte Cávez, Antonio, and Pascual de la Puente, Santiago
Subjects: Neural networks (Computer science), Deep Learning, generación de voz, Aprendizaje profundo, Machine learning, voice synthesis, Aprenentatge automàtic, Xarxes neuronals (Informàtica), síntesis de habla, Enginyeria de la telecomunicació [Àrees temàtiques de la UPC], neural networks
Abstract: Las técnicas de aprendizaje profundo están teniendo unas excelentes prestaciones en muchas tareas relacionadas con el habla, tales como reconocimiento o síntesis. Muchos de los trabajos se apoyan en modelos de voz, o técnicas de análisis clásicas, como el espectrograma o el MFCC. En este proyecto se desea sustituir estas técnicas por redes neuronales profundas que puedan autodiseñarse para modelar la señal. Una aplicación que puede plantearse para validar esta tecnología es codificación. Voice generation, also known as Speech Synthesis, is the artificial production of human speech. In the last decade, the Speech Synthesis research has been focused on a technique called Statistical Parametric Speech Synthesis. This technique uses a statistical model that obtains parameters (acoustic features) to define the signal out of a text. These parameters are then converted into a waveform using a vocoder. The use of the vocoder is needed but it decreases the quality of the obtained audio. In the past few years, Deep Learning techniques have shown great performance in many fields. One of them is Speech Synthesis, where Deep Learning is used as a substitute for the statistical model, obtaining the parameters that define the signal with great effectiveness. However, the quality of the synthesis is still affected by the use of the vocoder. For this reason, in this work, we investigate how to generate the audio waveform out of the parameters using Deep Neural Networks. If it results to work, it could be possible to build a DNN system that generates an audio waveform using text as input, leaving the vocoder out of the scheme. Different architectures were tested before getting to the final model. The first attempt was to directly map the frames of the signal using a Long Short-Term Memory Recurrent Neural Network. In the second one, instead of generating the signal frame by frame we did it sample by sample. We tried a different architecture in the third model, using a Clockwork RNN. Finally, in the fourth model we used again an LSTM, but this time, we generated the signal by frequency sub-bands, using Pseudo-Quadrature Mirror Filter banks. The models that showed better performance were the second and the fourth. Neverthe- less, the computational cost of the second one is too high. We solved this problem in the fourth model. Generating the signal by sub-bands allows us to parallelize the problem and decrease the computational cost significantly. Although it is a great success that the system is able to generate an intelligible audio waveform without a parametric description, the voice obtained is not natural enough to be a competitive technology. These experiments leave the door open to a Text-to-Speech system completely based on Deep Learning, avoiding the use of the vocoder. We think that with deeper research, this architecture could overcome the quality of the state of the art systems. La generación de voz, también conocida como Síntesis de Habla, es la producción artificial de habla humana. En la última década, la investigación de Síntesis de Habla se ha centrado en una técnica llamada Síntesis Estadística Paramétrica de Habla. Esta técnica utiliza un modelo estadístico y genera los parámetros acústicos más probables, condicionados al texto de entrada. Estos parámetros son convertidos en forma de onda utilizando un vocoder. El uso de este vocoder es necesario en la síntesis estadística, pero limita la calidad del audio que puede obtenerse. En los últimos años, las técnicas de Aprendizaje Profundo han obtenido importantes resultados en muchos campos. Uno de ellos es la Síntesis de Habla, donde el Aprendizaje Profundo es usado como sustituto de los modelos estadísticos tradicionales, basados en Modelos Ocultos de Markov, obteniendo los parámetros que definen la señal. Sin embargo, la calidad sigue afectada por el uso del vocoder. Por esta razón, en este trabajo hemos investigado como generar una forma de onda, partiendo de parámetros, mediante Redes Neuronales Profundas. Si funcionara, sería posible construir un sistema basado en Redes Neuronales Profundas que genere una forma de onda utilizando texto como entrada, sin necesitar el vocoder. Se han probado diferentes arquitecturas antes de llegar al modelo final. El primer intento fue mapear directamente las muestras de la señal de audio utilizando una Red Neuronal Recurrente con Memoria a Largo y Corto Plazo. (LSTM-RNN). En el segundo, en vez de generar la señal trama a trama, se ha generado muestra a muestra. Se ha probado también una arquitectura diferente en el tercer modelo, utilizando una Red Neuronal Recurrente 'Clockwork'. Finalmente, en el cuarto modelo, usamos de nuevo una LSTM-RNN, pero esta vez, generamos la señal por bandas frecuenciales, usando \textit{Pseudo Quadrature-Mirror Filters} (PQMF). Los modelos que han obtenido mejores resultados han sido el segundo y el cuarto. Sin embargo, el coste computacional del segundo es demasiado alto. Hemos resuelto este problema en el cuarto modelo: generando la señal por subbandas permitimos la paralelización del problema y disminuimos significativamente el coste computacional. A pesar del éxito que supone que el sistema sea capaz de generar una forma de onda continua, prescindiendo de una representación paramétrica e inteligible, la voz generada aún no es lo suficientemente natural como para que sea una tecnología competitiva. Estos experimentos dejan la puerta abierta a un sistema de conversión de texto en habla completamente basado en Aprendizaje Profundo, evitando el uso de un vocoder. Pensamos que, con una investigación más profunda, esta arquitectura podría sobrepasar la calidad de los sistemas del estado del arte.
Published: 2016

45. Open-ended visual question answering

Author: Masuda Mora, Issey, Giró Nieto, Xavier, Pascual de la Puente, Santiago, and Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
Subjects: procesado de lenguaje natural, redes neuronales, Imatges--Processament, deep learning, Ordinadors neuronals, Enginyeria de la telecomunicació [Àrees temàtiques de la UPC], Neural computers, procesado de imágenes, Neural networks (Computer science), Image processing, Natural language processing (Computer science), Machine learning, aprendizaje automático, Aprenentatge automàtic, Xarxes neuronals (Informàtica), Tractament del llenguatge natural (Informàtica)
Abstract: Wearable cameras generate a large amount of photos which are, in many cases, useless or redundant. On the other hand, these devices are provide an excellent opportunity to create automatic questions and answers for reminiscence therapy. This is a follow up of the BSc thesis developed by Ricard Mestre during Fall 2014, and MSc thesis developed by Aniol Lidon. This thesis studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework. As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-16 and K-CNN convolutional neural networks to extract visual features from the image. These are merged with the word embedding or with a sentence embedding of the question to predict the answer. This work was successfully submitted to the Visual Question Answering Challenge 2016, where it achieved a 53,62% of accuracy in the test dataset. The developed software has followed the best programming practices and Python code style, providing a consistent baseline in Keras for different configurations. The source code and models are publicly available at https://github.com/imatge-upc/vqa-2016-cvprw. Esta tesis estudia métodos para resolver tareas de Visual Question-Answering usando técnicas de Deep Learning. Como primer paso, exploramos las redes Long Short-Term Memory (LST) que se usan en el Procesado del Lenguaje Natural (NLP) para atacar tareas de Question-Answering basadas únicamente en texto. A continuación modificamos el modelo anterior para aceptar una imagen como entrada junto con la pregunta. Para este propósito, estudiamos el uso de las redes convolucionales VGG-16 y K-CNN para extraer los descriptores visuales de la imagen. Estos descriptores son fusionados con el word embedding o sentence embedding de la pregunta para poder predecir la respuesta. Este trabajo se ha presentado al Visual Question Answering Challenge 2016, donde ha obtenido una precisión del 53,62% en los datos de test. El software desarrollado ha usado buenas prácticas de programación y ha seguido las directrices de estilo de Python, proveyendo un proyecto base en Keras consistente a distintas configuraciones. El código fuente y los modelos son públicos en https://github.com/imatge-upc/ vqa-2016-cvprw. Aquesta tesis estudia mètodes per resoldre tasques de Visual Question-Answering emprant tècniques de Deep Learning. Com a pas preliminar, explorem les xarxes Long Short-Term Memory (LSTM) que s'utilitzen en el Processat del Llenguatge Natural (NLP) per atacar tasques de Question-Answering basades únicament en text. A continuació modifiquem el model anterior per acceptar una imatge com a entrada juntament amb la pregunta. Per aquest propòsit, estudiem l'ús de les xarxes convolucionals VGG-16 i KCNN per tal d'extreure els descriptors visuals de la imatge. Aquests descriptors són fusionats amb el word embedding o sentence embedding de la pregunta per poder predir la resposta. Aquest treball ha estat presentat al Visual Question Answering Challenge 2016, on ha obtingut una precisió del 53,62% en les dades de test. El software desenvolupat ha emprat bones pràctiques en programació i ha seguit les directrius d'estil de Python, prove ïnt un projecte base en Keras consistent a diferents configuracions. El codi font i els models són públics a https://github.com/imatge-upc/vqa-2016-cvprw.
Published: 2016

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

45 results on '"de la Puente, Santiago"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources