Back to Search Start Over

Detecting the same text in different languages

Authors :
Kostadin Koroutchev
Manuel Cebrián
UAM. Departamento de Ingeniería Informática
Aprendizaje Automático (ING EPS-001)
Neurocomputación Biológica (ING EPS-005)
Source :
Biblos-e Archivo. Repositorio Institucional de la UAM, instname
Publication Year :
2006
Publisher :
IEEE, 2006.

Abstract

Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. K. Koroutchev, and M. Cebrian, “Detecting the same text in different languages”, in Information Theory Workshop, 2006. ITW '06 Punta del Este. IEEE, Punta del Este, Uruguay, 2006, pp. 337-341<br />Compression based similarity distances have the main drawback of needing the same coding scheme for the objects to be compared. When two texts are translated, there exists significant similarity with no literal coincidence. In this article, we present an algorithm that compares the redundancy structure of the data extracted by means of a Lempel- Ziv compression scheme. Each text is presented as a graph and two texts are considered similar with our measure if they have the same referential topology when compressed. We give empirical evidence that this measure detects similarity between data coded in different languages.<br />This work was partially supported by grant TIN 2004-07676-G01 of the Spanish Ministry of Education and Culture. Partially supported by grant TSI 2005-08255-C07-06 of the Spanish Ministry of Education and Culture

Details

Language :
English
Database :
OpenAIRE
Journal :
Biblos-e Archivo. Repositorio Institucional de la UAM, instname
Accession number :
edsair.doi.dedup.....63cc43f4850e15739bca0e7c01ba2c0d