Back to Search
Start Over
Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan-Spanish language pair
- Source :
- O2, repositorio institucional de la UOC, Universitat Oberta de Catalunya (UOC), UPCommons. Portal del coneixement obert de la UPC, Universitat Politècnica de Catalunya (UPC), Recercat. Dipósit de la Recerca de Catalunya, instname
- Publication Year :
- 2011
- Publisher :
- Language Resources and Evaluation, 2011.
-
Abstract
- This work aims to improve an N-gram-based statistical machine translation system between the Catalan and Spanish languages, trained with an aligned Spanish-Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico newspaper. Starting from a linguistic error analysis above this baseline system, orthographic, morphological, lexical, semantic and syntactic problems are approached using a set of techniques. The proposed solutions include the development and application of additional statistical techniques, text pre- and post-processing tasks, and rules based on the use of grammatical categories, as well as lexical categorization. The performance of the improved system is clearly increased, as is shown in both human and automatic evaluations of the system, with a gain of about 1.1 points BLEU observed in the Spanish-to-Catalan direction of translation, and a gain of about 0.5 points in the reverse direction. The final system is freely available online as a linguistic resource. This work has been partially funded by the Spanish Department of Science and Innovation through the Juan de la Cierva fellowship program and the Spanish Government under the BUCEADOR project (TEC2009-14094-C04-01).
- Subjects :
- Linguistics and Language
Statistical machine translation
Spanish language
traducció automàtica estadística
Machine translation
Grammatical categories
Computer science
Speech recognition
conocimientos lingüísticos
Grammatical category
Library and Information Sciences
statistical machine translation
computer.software_genre
Translation (geometry)
Language and Linguistics
Parla
Education
Set (abstract data type)
categories gramaticals
Rule-based machine translation
Linguistic knowledge
categorías gramaticales
Traducció automàtica
Traducción automática
N-gram-based translation
BLEU
Language and Speech Technologies
business.industry
traducción automática estadística
Signal theory (Telecommunication)
Enginyeria de la telecomunicació [Àrees temàtiques de la UPC]
Syntax
traducció basada en n-grames
language.human_language
Categorization
coneixements lingüístics
n-gram-based translation
language
Catalan
Artificial intelligence
traducción basada en n-gramas
Computational linguistics
business
computer
grammatical categories
Machine translating
Natural language processing
linguistic knowledge
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- O2, repositorio institucional de la UOC, Universitat Oberta de Catalunya (UOC), UPCommons. Portal del coneixement obert de la UPC, Universitat Politècnica de Catalunya (UPC), Recercat. Dipósit de la Recerca de Catalunya, instname
- Accession number :
- edsair.doi.dedup.....36b6ec50cf6ded30cf8f2821e6b41af4