Back to Search Start Over

Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan-Spanish language pair

Authors :
Mireia Farrús
Adolfo Hernández
Carlos A. Henríquez
Marc Poch
José A. R. Fonollosa
Marta R. Costa-jussà
José B. Mariño
Universitat Oberta de Catalunya. Internet Interdisciplinary Institute (IN3)
Universitat Politècnica de Catalunya
Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Source :
O2, repositorio institucional de la UOC, Universitat Oberta de Catalunya (UOC), UPCommons. Portal del coneixement obert de la UPC, Universitat Politècnica de Catalunya (UPC), Recercat. Dipósit de la Recerca de Catalunya, instname
Publication Year :
2011
Publisher :
Language Resources and Evaluation, 2011.

Abstract

This work aims to improve an N-gram-based statistical machine translation system between the Catalan and Spanish languages, trained with an aligned Spanish-Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico newspaper. Starting from a linguistic error analysis above this baseline system, orthographic, morphological, lexical, semantic and syntactic problems are approached using a set of techniques. The proposed solutions include the development and application of additional statistical techniques, text pre- and post-processing tasks, and rules based on the use of grammatical categories, as well as lexical categorization. The performance of the improved system is clearly increased, as is shown in both human and automatic evaluations of the system, with a gain of about 1.1 points BLEU observed in the Spanish-to-Catalan direction of translation, and a gain of about 0.5 points in the reverse direction. The final system is freely available online as a linguistic resource. This work has been partially funded by the Spanish Department of Science and Innovation through the Juan de la Cierva fellowship program and the Spanish Government under the BUCEADOR project (TEC2009-14094-C04-01).

Details

Database :
OpenAIRE
Journal :
O2, repositorio institucional de la UOC, Universitat Oberta de Catalunya (UOC), UPCommons. Portal del coneixement obert de la UPC, Universitat Politècnica de Catalunya (UPC), Recercat. Dipósit de la Recerca de Catalunya, instname
Accession number :
edsair.doi.dedup.....36b6ec50cf6ded30cf8f2821e6b41af4