125 results on '"Sanchis Trilles, Germán"'
Search Results
2. Multi-input CNN for Text Classification in Commercial Scenarios
- Author
-
Parcheta, Zuzanna, Sanchis-Trilles, Germán, Casacuberta, Francisco, Redahl, Robin, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Rojas, Ignacio, editor, Joya, Gonzalo, editor, and Catala, Andreu, editor
- Published
- 2019
- Full Text
- View/download PDF
3. Vector sentences representation for data selection in statisticalmachine translation
- Author
-
Chinea-Rios, Mara, Sanchis-Trilles, Germán, and Casacuberta, Francisco
- Published
- 2019
- Full Text
- View/download PDF
4. Discriminative ridge regression algorithm for adaptation in statistical machine translation
- Author
-
Chinea-Rios, Mara, Sanchis-Trilles, Germán, and Casacuberta, Francisco
- Published
- 2019
- Full Text
- View/download PDF
5. Log-Linear Weight Optimization Using Discriminative Ridge Regression Method in Statistical Machine Translation
- Author
-
Chinea-Rios, Mara, Sanchis-Trilles, Germán, Casacuberta, Francisco, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Alexandre, Luís A., editor, Salvador Sánchez, José, editor, and Rodrigues, João M. F., editor
- Published
- 2017
- Full Text
- View/download PDF
6. Learning Advanced Post-editing
- Author
-
Alabau, Vicent, Carl, Michael, Casacuberta, Francisco, Martínez, Mercedes García, González-Rubio, Jesús, Mesa-Lao, Bartolomé, Ortiz-Martínez, Daniel, Schaeffer, Moritz, Sanchis-Trilles, Germán, Li, Defeng, Series editor, Carl, Michael, editor, Bangalore, Srinivas, editor, and Schaeffer, Moritz, editor
- Published
- 2016
- Full Text
- View/download PDF
7. Integrating Online and Active Learning in a Computer-Assisted Translation Workbench
- Author
-
Ortiz-Martínez, Daniel, González-Rubio, Jesús, Alabau, Vicent, Sanchis-Trilles, Germán, Casacuberta, Francisco, Li, Defeng, Series editor, Carl, Michael, editor, Bangalore, Srinivas, editor, and Schaeffer, Moritz, editor
- Published
- 2016
- Full Text
- View/download PDF
8. Bilingual Data Selection Using a Continuous Vector-Space Representation
- Author
-
Chinea-Rios, Mara, Sanchis-Trilles, Germán, Casacuberta, Francisco, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Robles-Kelly, Antonio, editor, Loog, Marco, editor, Biggio, Battista, editor, Escolano, Francisco, editor, and Wilson, Richard, editor
- Published
- 2016
- Full Text
- View/download PDF
9. Sentence Clustering Using Continuous Vector Space Representation
- Author
-
Chinea-Rios, Mara, Sanchis-Trilles, Germán, Casacuberta, Francisco, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Paredes, Roberto, editor, Cardoso, Jaime S., editor, and Pardo, Xosé M., editor
- Published
- 2015
- Full Text
- View/download PDF
10. Multi-input CNN for Text Classification in Commercial Scenarios
- Author
-
Parcheta, Zuzanna, primary, Sanchis-Trilles, Germán, additional, Casacuberta, Francisco, additional, and Redahl, Robin, additional
- Published
- 2019
- Full Text
- View/download PDF
11. Improving translation quality stability using Bayesian predictive adaptation
- Author
-
Sanchis-Trilles, Germán and Casacuberta, Francisco
- Published
- 2015
- Full Text
- View/download PDF
12. Shay Cohen: Bayesian analysis in natural language processing. Morgan and Claypool, San Rafael, California, 2016, xxvii + 246 pp, ISBN 9781627058735
- Author
-
Sanchis-Trilles, Germán
- Published
- 2017
- Full Text
- View/download PDF
13. Online Learning of Log-Linear Weights in Interactive Machine Translation
- Author
-
López-Salcedo, Francisco-Javier, Sanchis-Trilles, Germán, Casacuberta, Francisco, Torre Toledano, Doroteo, editor, Ortega Giménez, Alfonso, editor, Teixeira, António, editor, González Rodríguez, Joaquín, editor, Hernández Gómez, Luis, editor, San Segundo Hernández, Rubén, editor, and Ramos Castro, Daniel, editor
- Published
- 2012
- Full Text
- View/download PDF
14. Passive-Aggressive for On-Line Learning in Statistical Machine Translation
- Author
-
Martínez-Gómez, Pascual, Sanchis-Trilles, Germán, Casacuberta, Francisco, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Vitrià, Jordi, editor, Sanches, João Miguel, editor, and Hernández, Mario, editor
- Published
- 2011
- Full Text
- View/download PDF
15. Online Learning via Dynamic Reranking for Computer Assisted Translation
- Author
-
Martínez-Gómez, Pascual, Sanchis-Trilles, Germán, Casacuberta, Francisco, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, and Gelbukh, Alexander, editor
- Published
- 2011
- Full Text
- View/download PDF
16. Bayesian Adaptation for Statistical Machine Translation
- Author
-
Sanchis-Trilles, Germán, Casacuberta, Francisco, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Hancock, Edwin R., editor, Wilson, Richard C., editor, Windeatt, Terry, editor, Ulusoy, Ilkay, editor, and Escolano, Francisco, editor
- Published
- 2010
- Full Text
- View/download PDF
17. Similarity Word-Sequence Kernels for Sentence Clustering
- Author
-
Andrés-Ferrer, Jesús, Sanchis-Trilles, Germán, Casacuberta, Francisco, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Hancock, Edwin R., editor, Wilson, Richard C., editor, Windeatt, Terry, editor, Ulusoy, Ilkay, editor, and Escolano, Francisco, editor
- Published
- 2010
- Full Text
- View/download PDF
18. Interactive translation prediction versus conventional post-editing in practice: a study with the CasMaCat workbench
- Author
-
Sanchis-Trilles, Germán, Alabau, Vicent, Buck, Christian, Carl, Michael, Casacuberta, Francisco, García-Martínez, Mercedes, Germann, Ulrich, González-Rubio, Jesús, Hill, Robin L., Koehn, Philipp, Leiva, Luis A., Mesa-Lao, Bartolomé, Ortiz-Martínez, Daniel, Saint-Amand, Herve, Tsoukala, Chara, and Vidal, Enrique
- Published
- 2014
19. Introducing Additional Input Information into Interactive Machine Translation Systems
- Author
-
Sanchis-Trilles, Germán, González, Maria-Teresa, Casacuberta, Francisco, Vidal, Enrique, Civera, Jorge, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Popescu-Belis, Andrei, editor, and Stiefelhagen, Rainer, editor
- Published
- 2008
- Full Text
- View/download PDF
20. Log-Linear Weight Optimization Using Discriminative Ridge Regression Method in Statistical Machine Translation
- Author
-
Chinea-Rios, Mara, primary, Sanchis-Trilles, Germán, additional, and Casacuberta, Francisco, additional
- Published
- 2017
- Full Text
- View/download PDF
21. Online adaptation strategies for statistical machine translation in post-editing scenarios
- Author
-
Martínez-Gómez, Pascual, Sanchis-Trilles, Germán, and Casacuberta, Francisco
- Published
- 2012
- Full Text
- View/download PDF
22. Learning Advanced Post-editing
- Author
-
Alabau, Vicent, primary, Carl, Michael, additional, Casacuberta, Francisco, additional, Martínez, Mercedes García, additional, González-Rubio, Jesús, additional, Mesa-Lao, Bartolomé, additional, Ortiz-Martínez, Daniel, additional, Schaeffer, Moritz, additional, and Sanchis-Trilles, Germán, additional
- Published
- 2016
- Full Text
- View/download PDF
23. Integrating Online and Active Learning in a Computer-Assisted Translation Workbench
- Author
-
Ortiz-Martínez, Daniel, primary, González-Rubio, Jesús, additional, Alabau, Vicent, additional, Sanchis-Trilles, Germán, additional, and Casacuberta, Francisco, additional
- Published
- 2016
- Full Text
- View/download PDF
24. Sentence Clustering Using Continuous Vector Space Representation
- Author
-
Chinea-Rios, Mara, primary, Sanchis-Trilles, Germán, additional, and Casacuberta, Francisco, additional
- Published
- 2015
- Full Text
- View/download PDF
25. Combining Embeddings of Input Data for Text Classification
- Author
-
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Ministerio de Economía y Competitividad, Parcheta, Zuzanna, Sanchis Trilles, Germán, Casacuberta Nolla, Francisco, Rendahl, Robin, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Ministerio de Economía y Competitividad, Parcheta, Zuzanna, Sanchis Trilles, Germán, Casacuberta Nolla, Francisco, and Rendahl, Robin
- Abstract
[EN] The problem of automatic text classification is an essential part of text analysis. The improvement of text classification can be done at different levels such as a preprocessing step, network implementation, etc. In this paper, we focus on how the combination of different methods of text encoding may affect classification accuracy. To do this, we implemented a multi-input neural network that is able to encode input text using several text encoding techniques such as BERT, neural embedding layer, GloVe, skip-thoughts and ParagraphVector. The text can be represented at different levels of tokenised input text such as the sentence level, word level, byte pair encoding level and character level. Experiments were conducted on seven datasets from different language families: English, German, Swedish and Czech. Some of those languages contain agglutinations and grammatical cases. Two out of seven datasets originated from real commercial scenarios: (1) classifying ingredients into their corresponding classes by means of a corpus provided by Northfork; and (2) classifying texts according to the English level of their corresponding writers by means of a corpus provided by ProvenWord. The developed architecture achieves an improvement with different combinations of text encoding techniques depending on the different characteristics of the datasets. Once the best combination of embeddings at different levels was determined, different architectures of multi-input neural networks were compared. The results obtained with the best embedding combination and best neural network architecture were compared with state-of-the-art approaches. The results obtained with the dataset used in the experiments were better than the state-of-the-art baselines.
- Published
- 2021
26. Online Learning of Log-Linear Weights in Interactive Machine Translation
- Author
-
López-Salcedo, Francisco-Javier, primary, Sanchis-Trilles, Germán, additional, and Casacuberta, Francisco, additional
- Published
- 2012
- Full Text
- View/download PDF
27. Passive-Aggressive for On-Line Learning in Statistical Machine Translation
- Author
-
Martínez-Gómez, Pascual, primary, Sanchis-Trilles, Germán, additional, and Casacuberta, Francisco, additional
- Published
- 2011
- Full Text
- View/download PDF
28. Online Learning via Dynamic Reranking for Computer Assisted Translation
- Author
-
Martínez-Gómez, Pascual, primary, Sanchis-Trilles, Germán, additional, and Casacuberta, Francisco, additional
- Published
- 2011
- Full Text
- View/download PDF
29. Combining Embeddings of Input Data for Text Classification
- Author
-
Parcheta, Zuzanna, primary, Sanchis-Trilles, Germán, additional, Casacuberta, Francisco, additional, and Rendahl, Robin, additional
- Published
- 2020
- Full Text
- View/download PDF
30. Bayesian analysis in natural language processing Shay Cohen
- Author
-
Sanchis-Trilles, Germán
- Published
- 2017
31. Vector sentences representation for data selection in statistical machine translation
- Author
-
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Generalitat Valenciana, Universitat Politècnica de València, Chinea-Rios, Mara, Sanchis Trilles, Germán, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Generalitat Valenciana, Universitat Politècnica de València, Chinea-Rios, Mara, Sanchis Trilles, Germán, and Casacuberta Nolla, Francisco
- Abstract
[EN] One of the most popular approaches to machine translation consists in formulating the problem as a pattern recognition approach. Under this perspective, bilingual corpora are precious resources, as they allow for a proper estimation of the underlying models. In this framework, selecting the best possible corpus is critical, and data selection aims to find the best subset of the bilingual sentences from an available pool of sentences such that the final translation quality is improved. In this paper, we present a new data selection technique that leverages a continuous vector-space representation of sentences. Experimental results report improvements compared not only with a system trained only with in-domain data, but also compared with a system trained on all the available data. Finally, we compared our proposal with other state-of-the-art data selection techniques (Cross-entropy selection and Infrequent ngrams recovery) in two different scenarios, obtaining very promising results with our proposal: our data selection strategy is able to yield results that are at least as good as the best-performing strfategy for each scenario. The empirical results reported are coherent across different language pairs.
- Published
- 2019
32. Discriminative ridge regression algorithm for adaptation in statistical machine translation
- Author
-
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Generalitat Valenciana, Ministerio de Economía y Empresa, Chinea-Ríos, Mara, Sanchis-Trilles, Germán, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Generalitat Valenciana, Ministerio de Economía y Empresa, Chinea-Ríos, Mara, Sanchis-Trilles, Germán, and Casacuberta Nolla, Francisco
- Abstract
[EN] We present a simple and reliable method for estimating the log-linear weights of a state-of-the-art machine translation system, which takes advantage of the method known as discriminative ridge regression (DRR). Since inappropriate weight estimations lead to a wide variability of translation quality results, reaching a reliable estimate for such weights is critical for machine translation research. For this reason, a variety of methods have been proposed to reach reasonable estimates. In this paper, we present an algorithmic description and empirical results proving that DRR is able to provide comparable translation quality when compared to state-of-the-art estimation methods [i.e. MERT and MIRA], with a reduction in computational cost. Moreover, the empirical results reported are coherent across different corpora and language pairs.
- Published
- 2019
33. Advanced techniques for domain adaptation in Statistical Machine Translation
- Author
-
Casacuberta Nolla, Francisco, Sanchis Trilles, Germán, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Chinea Ríos, Mara, Casacuberta Nolla, Francisco, Sanchis Trilles, Germán, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, and Chinea Ríos, Mara
- Abstract
[ES] La Traducción Automática Estadística es un sup-campo de la lingüística computacional que investiga como emplear los ordenadores en el proceso de traducción de un texto de un lenguaje humano a otro. La traducción automática estadística es el enfoque más popular que se emplea para construir estos sistemas de traducción automáticos. La calidad de dichos sistemas depende en gran medida de los ejemplos de traducción que se emplean durante los procesos de entrenamiento y adaptación de los modelos. Los conjuntos de datos empleados son obtenidos a partir de una gran variedad de fuentes y en muchos casos puede que no tengamos a mano los datos más adecuados para un dominio específico. Dado este problema de carencia de datos, la idea principal para solucionarlo es encontrar aquellos conjuntos de datos más adecuados para entrenar o adaptar un sistema de traducción. En este sentido, esta tesis propone un conjunto de técnicas de selección de datos que identifican los datos bilingües más relevantes para una tarea extraídos de un gran conjunto de datos. Como primer paso en esta tesis, las técnicas de selección de datos son aplicadas para mejorar la calidad de la traducción de los sistemas de traducción bajo el paradigma basado en frases. Estas técnicas se basan en el concepto de representación continua de las palabras o las oraciones en un espacio vectorial. Los resultados experimentales demuestran que las técnicas utilizadas son efectivas para diferentes lenguajes y dominios. El paradigma de Traducción Automática Neuronal también fue aplicado en esta tesis. Dentro de este paradigma, investigamos la aplicación que pueden tener las técnicas de selección de datos anteriormente validadas en el paradigma basado en frases. El trabajo realizado se centró en la utilización de dos tareas diferentes de adaptación del sistema. Por un lado, investigamos cómo aumentar la calidad de traducción del sistema, aumentando el tamaño del conjunto de entrenamiento. Por otro lado, el método de sele, [CA] La Traducció Automàtica Estadística és un sup-camp de la lingüística computacional que investiga com emprar els ordinadors en el procés de traducció d'un text d'un llenguatge humà a un altre. La traducció automàtica estadística és l'enfocament més popular que s'empra per a construir aquests sistemes de traducció automàtics. La qualitat d'aquests sistemes depèn en gran mesura dels exemples de traducció que s'empren durant els processos d'entrenament i adaptació dels models. Els conjunts de dades emprades són obtinguts a partir d'una gran varietat de fonts i en molts casos pot ser que no tinguem a mà les dades més adequades per a un domini específic. Donat aquest problema de manca de dades, la idea principal per a solucionar-ho és trobar aquells conjunts de dades més adequades per a entrenar o adaptar un sistema de traducció. En aquest sentit, aquesta tesi proposa un conjunt de tècniques de selecció de dades que identifiquen les dades bilingües més rellevants per a una tasca extrets d'un gran conjunt de dades. Com a primer pas en aquesta tesi, les tècniques de selecció de dades són aplicades per a millorar la qualitat de la traducció dels sistemes de traducció sota el paradigma basat en frases. Aquestes tècniques es basen en el concepte de representació contínua de les paraules o les oracions en un espai vectorial. Els resultats experimentals demostren que les tècniques utilitzades són efectives per a diferents llenguatges i dominis. El paradigma de Traducció Automàtica Neuronal també va ser aplicat en aquesta tesi. Dins d'aquest paradigma, investiguem l'aplicació que poden tenir les tècniques de selecció de dades anteriorment validades en el paradigma basat en frases. El treball realitzat es va centrar en la utilització de dues tasques diferents. D'una banda, investiguem com augmentar la qualitat de traducció del sistema, augmentant la grandària del conjunt d'entrenament. D'altra banda, el mètode de selecció de dades es va emprar per a crear un conjunt de dades, [EN] La Traducció Automàtica Estadística és un sup-camp de la lingüística computacional que investiga com emprar els ordinadors en el procés de traducció d'un text d'un llenguatge humà a un altre. La traducció automàtica estadística és l'enfocament més popular que s'empra per a construir aquests sistemes de traducció automàtics. La qualitat d'aquests sistemes depèn en gran mesura dels exemples de traducció que s'empren durant els processos d'entrenament i adaptació dels models. Els conjunts de dades emprades són obtinguts a partir d'una gran varietat de fonts i en molts casos pot ser que no tinguem a mà les dades més adequades per a un domini específic. Donat aquest problema de manca de dades, la idea principal per a solucionar-ho és trobar aquells conjunts de dades més adequades per a entrenar o adaptar un sistema de traducció. En aquest sentit, aquesta tesi proposa un conjunt de tècniques de selecció de dades que identifiquen les dades bilingües més rellevants per a una tasca extrets d'un gran conjunt de dades. Com a primer pas en aquesta tesi, les tècniques de selecció de dades són aplicades per a millorar la qualitat de la traducció dels sistemes de traducció sota el paradigma basat en frases. Aquestes tècniques es basen en el concepte de representació contínua de les paraules o les oracions en un espai vectorial. Els resultats experimentals demostren que les tècniques utilitzades són efectives per a diferents llenguatges i dominis. El paradigma de Traducció Automàtica Neuronal també va ser aplicat en aquesta tesi. Dins d'aquest paradigma, investiguem l'aplicació que poden tenir les tècniques de selecció de dades anteriorment validades en el paradigma basat en frases. El treball realitzat es va centrar en la utilització de dues tasques diferents d'adaptació del sistema. D'una banda, investiguem com augmentar la qualitat de traducció del sistema, augmentant la grandària del conjunt d'entrenament. D'altra banda, el mètode de selecció de dades es va emprar per a cr
- Published
- 2019
34. Filtering of Noisy Parallel Corpora Based on Hypothesis Generation
- Author
-
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Nvidia, Ministerio de Economía y Competitividad, Parcheta, Zuzanna, Sanchis Trilles, Germán, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Nvidia, Ministerio de Economía y Competitividad, Parcheta, Zuzanna, Sanchis Trilles, Germán, and Casacuberta Nolla, Francisco
- Abstract
[EN] The filtering task of noisy parallel corpora in WMT2019 aims to challenge participants to create filtering methods to be useful for training machine translation systems. In this work, we introduce a noisy parallel corpora filtering system based on generating hypotheses by means of a translation model. We train translation models in both language pairs: Nepali English and Sinhala English using provided parallel corpora. To create the best possible translation model, we first join all provided parallel corpora (Nepali, Sinhala and Hindi to English) and after that, we applied bilingual cross-entropy selection for both language pairs (Nepali English and Sinhala English). Once the translation models are trained, we translate the noisy corpora and generate a hypothesis for each sentence pair. We compute the smoothed BLEU score between the target sentence and generated hypothesis. In addition, we apply several rules to discard very noisy or inadequate sentences which can lower the translation score. These heuristics are based on sentence length, source and target similarity and source language detection. We compare our results with the baseline published on the shared task website, which uses the Zipporah model, over which we achieve significant improvements in one of the conditions in the shared task. The designed filtering system is domain independent and all experiments are conducted using neural machine translation.
- Published
- 2019
35. Implementing a neural machine translation engine for mobile devices: the Lingvanex use case
- Author
-
Parcheta, Zuzanna, Sanchis-Trilles, Germán, Rudak, Aliaksei, and Bratchenia, Siarhei
- Subjects
Machine Translation ,Lenguajes y Sistemas Informáticos - Abstract
In this paper, we present the challenge entailed by implementing a mobile version of a neural machine translation system, where the goal is to maximise translation quality while minimising model size. We explain the whole process of implementing the translation engine on an English–Spanish example and we describe all the difficulties found and the solutions implemented. The main techniques used in this work are data selection by means of Infrequent n-gram Recovery, appending a special word at the end of each sentence, and generating additional samples without the final punctuation marks. The last two techniques were devised with the purpose of achieving a translation model that generates sentences without the final full stop, or other punctuation marks. Also, in this work, the Infrequent n-gram Recovery was used for the first time to create a new corpus, and not enlarge the in-domain dataset. Finally, we get a small size model with quality good enough to serve for daily use. Work partially supported by MINECO under grant DI-15-08169 and by Sciling under its R+D programme.
- Published
- 2018
36. Creating the best development corpus for Statistical Machine Translation systems
- Author
-
Chinea-Rios, Mara, Sanchis-Trilles, Germán, and Casacuberta, Francisco
- Subjects
Machine Translation ,Lenguajes y Sistemas Informáticos - Abstract
We propose and study three different novel approaches for tackling the problem of development set selection in Statistical Machine Translation. We focus on a scenario where a machine translation system is leveraged for translating a specific test set, without further data from the domain at hand. Such test set stems from a real application of machine translation, where the texts of a specific e-commerce were to be translated. For developing our development-set selection techniques, we first conducted experiments in a controlled scenario, where labelled data from different domains was available, and evaluated the techniques both with classification and translation quality metrics. Then, the best-performing techniques were evaluated on the e-commerce data at hand, yielding consistent improvements across two language directions. The research leading to these results were partially supported by projects CoMUN-HaT-TIN2015-70924-C2-1-R (MINECO/FEDER) and PROMETEO/2018/004.
- Published
- 2018
37. Filtering of Noisy Parallel Corpora Based on Hypothesis Generation
- Author
-
Parcheta, Zuzanna, primary, Sanchis-Trilles, Germán, additional, and Casacuberta, Francisco, additional
- Published
- 2019
- Full Text
- View/download PDF
38. Discriminative ridge regression algorithm for adaptation in statistical machine translation
- Author
-
Chinea-Rios, Mara, primary, Sanchis-Trilles, Germán, additional, and Casacuberta, Francisco, additional
- Published
- 2018
- Full Text
- View/download PDF
39. Data selection for NMT using Infrequent n-gram Recovery
- Author
-
Parcheta, Zuzanna, Sanchis-Trilles, Germán, Casacuberta, Francisco, Parcheta, Zuzanna, Sanchis-Trilles, Germán, and Casacuberta, Francisco
- Abstract
Neural Machine Translation (NMT) has achieved promising results comparable with Phrase-Based Statistical Machine Translation (PBSMT). However, to train a neural translation engine, much more powerful machines are required than those required to develop translation engines based on PBSMT. One solution to reduce the training cost of NMT systems is the reduction of the training corpus through data selection (DS) techniques. There are many DS techniques applied in PBSMT which bring good results. In this work, we show that the data selection technique based on infrequent n-gram occurrence described in (Gascó et al., 2012) commonly used for PBSMT systems also works well for NMT systems. We focus our work on selecting data according to specific corpora using the previously mentioned technique. The specific-domain corpora used for our experiments are IT domain and medical domain. The DS technique significantly reduces the execution time required to train the model between 87% and 93%. Also, it improves translation quality by up to 2.8 BLEU points. The improvements are obtained with just a small fraction of the data that accounts for between 6% and 20% of the total data.
- Published
- 2018
40. An Empirical Analysis of Data Selection Techniques in Statistical Machine Translation
- Author
-
Chinea Ríos, Mara, Sanchis Trilles, Germán, and Casacuberta Nolla, Francisco
- Subjects
Domain adaptation ,Statistical machine translation ,Entropía cruzada ,Bilingual sentence selection ,Selección de frases bilingües ,Statistical machine translation, domain adaptation, bilingual sentence selection, infrequent n-gram, cross-entropy ,Infrequent n-gram ,Lenguajes y Sistemas Informáticos ,Cross-entropy ,Adaptación dominios ,LENGUAJES Y SISTEMAS INFORMATICOS ,Traducción automática estadística ,n-gramas infrecuentes ,N-gramas infrecuentes - Abstract
[EN] Domain adaptation has recently gained interest in statistical machine translation. One of the adaptation techniques is based in the selection data. Data selection aims to select the best subset of the bilingual sentences from an available pool of sentences, with which to train a SMT system. In this paper, we study how affect the bilingual corpora used for the data selection methods in the translation quality, [ES] La adaptación de dominios genera mucho interés dentro de la traducción automática estadística. Una de las técnicas de adaptaciión esta basada en la selecciión de datos que tiene como objetivo seleccionar el mejor subconjunto de oraciones bilingües de un gran conjunto de oraciones. En este artículo estudiamos como afectan los corpus bilingües empleados por los métodos de selección de frases en la calidad de las traducciones., The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 287576 (CasMaCat). Also funded by the Generalitat Valenciana under grant Prometeo/2009/014.
- Published
- 2015
41. Log-Linear Weight Optimization Using Discriminative Ridge Regression Method in Statistical Machine Translation
- Author
-
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Generalitat Valenciana, Universitat Politècnica de València, Chinea-Ríos, Mara, Sanchis Trilles, Germán, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Generalitat Valenciana, Universitat Politècnica de València, Chinea-Ríos, Mara, Sanchis Trilles, Germán, and Casacuberta Nolla, Francisco
- Abstract
[EN] We present a simple and reliable method for estimating the log-linear weights of a state-of-the-art machine translation system, which takes advantage of the method known as discriminative ridge regression (DRR). Since inappropriate weight estimations lead to a wide variability of translation quality results, reaching a reliable estimate for such weights is critical for machine translation research. For this reason, a variety of methods have been proposed to reach reasonable estimates. In this paper, we present an algorithmic description and empirical results proving that DRR, as applied in a pseudo-batch scenario, is able to provide comparable translation quality when compared to state-of-the-art estimation methods (i.e., MERT [1] and MIRA [2]). Moreover, the empirical results reported are coherent across different corpora and language pairs.
- Published
- 2017
42. Domain adaptation problem in statistical machine translation systems
- Author
-
Chinea Ríos, Mara, Sanchis Trilles, Germán, and Casacuberta Nolla, Francisco
- Subjects
Domain adaptation ,Data combination ,Data selection ,Statistical machine translation ,Phrase tables ,LENGUAJES Y SISTEMAS INFORMATICOS - Abstract
Globalization suddenly brings many people from different country to interact with each other, requiring them to be able to speak several languages. Human translators are slow and expensive, we find the necessity of developing machine translators to automatize the task. Several approaches of Machine translation have been develop by the researchers. In this work, we use the Statistical Machine Translation approach. Statistical Machine Translation systems perform poorly when applied on new domains. The domain adaptation problem has recently gained interest in Statistical Machine Translation. The basic idea is to improve the performance of the system trained and tuned with different domain than the one to be translated. This article studies different paradigms of domain adaptation. The results report improvements compared with a system trained only with in-domain data and trained with all the available data.
- Published
- 2015
- Full Text
- View/download PDF
43. Does more data always yield better translations?
- Author
-
Gascó Mora, Guillem, Rocha Sánchez, Martha Alicia, Sanchis Trilles, Germán, Andrés Ferrer, Jesús, and Casacuberta Nolla, Francisco
- Subjects
Infrequent n-gram occurrence ,Training data selection techniques ,ESTADISTICA E INVESTIGACION OPERATIVA ,Bilingual corpora ,Probability of an indomain corpus ,LENGUAJES Y SISTEMAS INFORMATICOS - Abstract
Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions., The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México.
- Published
- 2012
44. SITGs for Phrase Extraction and Mouse Actions in IMT
- Author
-
Sanchis Trilles, Germán
- Subjects
Stochastic inversion transduction grammars ,Statistical machine translation ,Reconeixement de Formes i Imatge Digital [Máster Universitario en Inteligencia Artificial, Reconocimiento de Formas e Imagen Digital-Màster Universitari en Intel·Ligència Artificial] ,Máster Universitario en Inteligencia Artificial, Reconocimiento de Formas e Imagen Digital-Màster Universitari en Intel·Ligència Artificial: Reconeixement de Formes i Imatge Digital ,Interactive machine translation ,LENGUAJES Y SISTEMAS INFORMATICOS - Abstract
This thesis presents two main contributions in the fields of Statistical Machine Translation and Interactive Machine Translation. In the field of Statistical Machine Translation, the efforts have been focused on obtaining high quality, linguistically motivated phrase pairs by means of Statistical Inversion Transduction Grammars. By using a SITG for parsing a bilingual corpus, spans are defined over both input and output strings, yielding the possibility of considering these spans as translations of each other. By doing so, phrase tables can be built from the bilingual corpus and fed to an off-the-shelf Statistical Machine Translation decoder. Moreover, novel syntax-based models are introduced in this thesis, and experimental results are shown which back up the inclusion of such models into the standard phrase translation table. Since these models are inherent to SITGs, they cannot be included into other standard phrase-based models. In the field of Interactive Machine Translation, a new interface between the user and the machine is proposed. By considering the Mouse Actions the user performs as an important input source for the system, it is shown that important and consistent performance gains may be achieved. These gains come in some cases at the cost of having the user ask for new suffix hypotheses, but in other cases these gains come at no cost, hence yielding true improvements to the state of the art.
- Published
- 2011
45. An Empirical Analysis of Data Selection Techniques in Statistical Machine Translation
- Author
-
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, European Commission, Generalitat Valenciana, Chinea Ríos, Mara, Sanchis Trilles, Germán, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, European Commission, Generalitat Valenciana, Chinea Ríos, Mara, Sanchis Trilles, Germán, and Casacuberta Nolla, Francisco
- Abstract
[EN] Domain adaptation has recently gained interest in statistical machine translation. One of the adaptation techniques is based in the selection data. Data selection aims to select the best subset of the bilingual sentences from an available pool of sentences, with which to train a SMT system. In this paper, we study how affect the bilingual corpora used for the data selection methods in the translation quality, [ES] La adaptación de dominios genera mucho interés dentro de la traducción automática estadística. Una de las técnicas de adaptaciión esta basada en la selecciión de datos que tiene como objetivo seleccionar el mejor subconjunto de oraciones bilingües de un gran conjunto de oraciones. En este artículo estudiamos como afectan los corpus bilingües empleados por los métodos de selección de frases en la calidad de las traducciones.
- Published
- 2015
46. Domain adaptation problem in statistical machine translation systems
- Author
-
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Universitat Politècnica de València. Centro de Investigación Pattern Recognition and Human Language Technology, Chinea Ríos, Mara, Sanchis Trilles, Germán, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Universitat Politècnica de València. Centro de Investigación Pattern Recognition and Human Language Technology, Chinea Ríos, Mara, Sanchis Trilles, Germán, and Casacuberta Nolla, Francisco
- Abstract
Globalization suddenly brings many people from different country to interact with each other, requiring them to be able to speak several languages. Human translators are slow and expensive, we find the necessity of developing machine translators to automatize the task. Several approaches of Machine translation have been develop by the researchers. In this work, we use the Statistical Machine Translation approach. Statistical Machine Translation systems perform poorly when applied on new domains. The domain adaptation problem has recently gained interest in Statistical Machine Translation. The basic idea is to improve the performance of the system trained and tuned with different domain than the one to be translated. This article studies different paradigms of domain adaptation. The results report improvements compared with a system trained only with in-domain data and trained with all the available data.
- Published
- 2015
47. Improving translation quality stability using Bayesian predictive adaptation
- Author
-
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Generalitat Valenciana, European Commission, Sanchis Trilles, Germán, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Generalitat Valenciana, European Commission, Sanchis Trilles, Germán, and Casacuberta Nolla, Francisco
- Abstract
[EN] We introduce a Bayesian approach for the adaptation of the log-linear weights present in state-of-the-art statistical machine translation systems. Typically, these weights are estimated by optimising a given translation quality criterion, taking only into account a certain set of development data (e.g., the adaptation data). In this article, we show that the Bayesian framework provides appropriate estimates of such weights in conditions where adaptation data is scarce. The theoretical framework is presented, alongside with a thorough experimentation and comparison with other weight estimation methods. We provide a comparison of different sampling strategies, including an effective heuristic strategy and a theoretically sound Markov chain Monte-Carlo algorithm. Experimental results show that Bayesian predictive adaptation (BPA) outperforms the re-estimation from scratch in conditions where adaptation data is scarce. Further analysis reveals that the improvements obtained are due to the greater stability of the estimation procedure. In addition, the proposed BPA framework has a much lower computational cost than raw re-estimation. © 2015 Elsevier Ltd. All rights reserved.
- Published
- 2015
48. Sentence clustering using continuous vector space representation
- Author
-
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Chinea Ríos, Mara, Sanchis Trilles, Germán, Casacuberta Nolla, Francisco, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació, Chinea Ríos, Mara, Sanchis Trilles, Germán, and Casacuberta Nolla, Francisco
- Abstract
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-19390-8_49, In this paper, we present a clustering approach based on the combined use of a continuous vector space representation of sentences and the k-means algorithm. The principal motivation of this proposal is to split a big heterogeneous corpus into clusters of similar sentences. We use the word2vec toolkit for obtaining the representation of a given word as a continuous vector space. We provide empirical evidence for proving that the use of our technique can lead to better clusters, in terms of intra-cluster perplexity and F 1 score.
- Published
- 2015
49. Representatively Memorable: Sampling the Right Phrase Set to Get the Text Entry Experiment Right
- Author
-
European Commission, Leiva, Luis A., Sanchis-Trilles, Germán, European Commission, Leiva, Luis A., and Sanchis-Trilles, Germán
- Abstract
[EN] In text entry experiments, memorability is a desired property of the phrases used as stimuli. Unfortunately, to date there is no automated method to achieve this effect. As a result, researchers have to use either manually curated Englishonly phrase sets or sampling procedures that do not guarantee phrases being memorable. In response to this need, we present a novel sampling method based on two core ideas: a multiple regression model over language-independent features, and the statistical analysis of the corpus from which phrases will be drawn. Our results show that researchers can finally use a method to successfully curate their own stimuli targeting potentially any language or domain. The source code as well as our phrase sets are publicly available.
- Published
- 2014
50. Estrategias de aprendizaje online de los pesos del modelo log-lineal en traducción automática interactiva
- Author
-
Casacuberta Nolla, Francisco, Sanchis Trilles, Germán, Universitat Politècnica de València. Servicio de Alumnado - Servei d'Alumnat, Chinea Ríos, Mara, Casacuberta Nolla, Francisco, Sanchis Trilles, Germán, Universitat Politècnica de València. Servicio de Alumnado - Servei d'Alumnat, and Chinea Ríos, Mara
- Abstract
[ES] La intervención de los traductores humanos en un escenario de post-edición para corregir las traducciones obtenidas a partir de los sistemas de traducción automática es aún muy necesaria para lograr la calidad deseada. El paradigma de la traducción automática interactiva (Interactive Machine Translation, IMT), es capaz de reducir el esfuerzo y tiempo que el traductor humano tiene que invertir en el proceso de corrección. En este trabajo final de máster se plantea la utilización del paradigma de traducción automática interactiva, combinado con una aproximación que adecua los pesos del modelo log-lineal a cada una de las traducciones mediante diferentes algoritmos de aprendizaje online. Nuestro objetivo es que el sistema aprenda de los errores corregidos, favoreciendo la corrección de las próximas traducciones. Para lograr lo anteriormente planteado se emplearon diferentes algoritmos de aprendizaje online: Discriminative Ridge Regression, Perceptron-Like y Passive Agressive, empleados estos en postedición con resultados positivos. Para poder utilizar estos algoritmos dentro del escenario IMT fue necesaria una nueva formulación de cada uno de los algoritmos. Con estas nuevas formulaciones, en este trabajo final de máster, se obtienen resultados diversos, dando la posibilidad de emplearse en nuevos planteamientos para lograr la calidad de las traducciones deseada y así disminuir el esfuerzo del traductor humano., [EN] In a post-edit scenario, the translations obtained by machine translator systems need to have been corrected by a human translator to obtain the desire quality. Interactive Machine Tranlator (IMT) paradigm is able to reduce the effort and the time that human translators have to invert in the correction process. In this thesis, we propose to adapt the weights of the log-linear model in interactive machine translator. For adapting the weights of the log-linear model, we have utilizes different online learning algorithms. The main goal is that the system learns from the errors corrected. We propose to use three different online learning algorithms: Discriminative Ridge Regression, Passive Agressive and Percetron-Like. These algorithms has been used in post-edit scenario with good results. These algorithms needed a new formulation in IMT sceneario. With these new formulations, we have obtained different results. These resuts give the posibility to use the new formulations to archieve the quality deseared and reduce efforts of the human translator in new problems.
- Published
- 2014
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.