Start Over

A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics

Authors :: Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació
Department of Electronics and Information Technology, Ministry of Communications and Information Technology, India
European Commission
Universitat de València
Ministerio de Economía y Competitividad
Banerjee, Somnath
Kuila, Alapan
Roy, Aniruddha
Naskar, Sudip Kumar
Rosso, Paolo
Bandyopadhyay, Sivaji
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació
Department of Electronics and Information Technology, Ministry of Communications and Information Technology, India
European Commission
Universitat de València
Ministerio de Economía y Competitividad
Banerjee, Somnath
Kuila, Alapan
Roy, Aniruddha
Naskar, Sudip Kumar
Rosso, Paolo
Bandyopadhyay, Sivaji
Publication Year :: 2014
Abstract: © {Owner/Author | ACM} {Year}. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation, http://dx.doi.org/10.1145/2824864.2824876<br />[EN] In this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for Information Retrieval Evaluation (FIRE) in 2014. A CRF based machine learning model and post-processing heuristics are employed for the WLL identification task. In addition to language identification, two transliteration systems were built to transliterate detected Bangla words written in Roman script into native Bangla script. The system demonstrated an overall token level language identification accuracy of 0.905. The token level Bangla and English language identification F-scores are 0.899, 0.920 respectively. The two transliteration systems achieved accuracies of 0.062 and 0.037. The word-level language identification system presented in this paper resulted in the best scores across almost all metrics among all the participating systems for the Bangla-English language pair.

Details

Database :: OAIster
Notes :: TEXT, TEXT, English
Publication Type :: Electronic Resource
Accession number :: edsoai.on1228689460
Document Type :: Electronic Resource

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics

Abstract

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics

Abstract

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources