1. Mapping and Aligning Units from Comparable Corpora
- Author
-
Dan Ștefănescu, Sabine Hunsicker, Yang Feng, Alexandru Ceaușu, Dan Tufiș, Elena Irimia, Robert Gaizauskas, Radu Ion, and Ahmet Aker
- Subjects
Machine translation ,Computer science ,business.industry ,Computation ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Comparability ,Translation (geometry) ,computer.software_genre ,Extractor ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that convincingly show how parallel sentences extracted from comparable corpora are able to improve statistical machine translation (SMT). Yet, the existing body of research on the subject does not take into account the degree of comparability of the corpus being processed nor the computation time that it takes to extract translational similar pairs from a corpus of a given size. We will show that the performance of a parallel unit extractor crucially depends on the degree of comparability, such that it is more difficult to mine for parallel data in a weakly comparable corpus than a strongly comparable corpus.
- Published
- 2019