1. Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs.
- Author
-
Wołk, Krzysztof and Marasek, Krzysztof
- Subjects
SENTENCES (Grammar) ,CORPORA ,DATA mining ,INFORMATION retrieval research ,MACHINE translating - Abstract
Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subjectaligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF