1. Processing Annotated TMX Parallel Corpora
- Author
-
Brito, Rui Miguel Magalhães, Almeida, J. J., Simões, Alberto, and Universidade do Minho
- Subjects
Corpora paralelos ,TMX ,Parallel corpora ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Ciências Naturais::Ciências da Computação e da Informação ,Annotated corpora ,PLN - Abstract
In the later years the amount of freely available multilingual corpora has grown in an exponential way. Unfortunately the way these corpora are made available is very diverse, ranging from simple text files or specific XML schemas to supposedly standard formats like the XML Corpus Encoding Initiative, the Text Encoding Initiative, or even the Translation Memory Exchange formats. In this document we defend the usage of Translation Memory Exchange documents, but we enrich its structure in order to support the annotation of the documents with different information like lemmas, multi-words or entities. To support the adoption of the proposed formats, we present a set of tools to manipulate the different formats in an agile way.
- Published
- 2014