Back to Search Start Over

A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics

Authors :
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació
Department of Electronics and Information Technology, Ministry of Communications and Information Technology, India
European Commission
Universitat de València
Ministerio de Economía y Competitividad
Banerjee, Somnath
Kuila, Alapan
Roy, Aniruddha
Naskar, Sudip Kumar
Rosso, Paolo
Bandyopadhyay, Sivaji
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació
Department of Electronics and Information Technology, Ministry of Communications and Information Technology, India
European Commission
Universitat de València
Ministerio de Economía y Competitividad
Banerjee, Somnath
Kuila, Alapan
Roy, Aniruddha
Naskar, Sudip Kumar
Rosso, Paolo
Bandyopadhyay, Sivaji
Publication Year :
2014

Abstract

© {Owner/Author | ACM} {Year}. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation, http://dx.doi.org/10.1145/2824864.2824876<br />[EN] In this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for Information Retrieval Evaluation (FIRE) in 2014. A CRF based machine learning model and post-processing heuristics are employed for the WLL identification task. In addition to language identification, two transliteration systems were built to transliterate detected Bangla words written in Roman script into native Bangla script. The system demonstrated an overall token level language identification accuracy of 0.905. The token level Bangla and English language identification F-scores are 0.899, 0.920 respectively. The two transliteration systems achieved accuracies of 0.062 and 0.037. The word-level language identification system presented in this paper resulted in the best scores across almost all metrics among all the participating systems for the Bangla-English language pair.

Details

Database :
OAIster
Notes :
TEXT, TEXT, English
Publication Type :
Electronic Resource
Accession number :
edsoai.on1228689460
Document Type :
Electronic Resource