Back to Search Start Over

A semi-automatic approach to identifying and unifying ambiguously encoded Arabic-based characters

Authors :
Sardar Jaf
Dong, Minghui
Tseng, Yuen-Hsien
Lu, Yanfeng
Yu, Liang-Chih
Lee, Lung-Hao
Wu, Chung-Hsien
Li, Haizhou
Source :
Dong, Minghui & Tseng, Yuen-Hsien & Lu, Yanfeng & Yu, Liang-Chih & Lee, Lung-Hao & Wu, Chung-Hsien & Li, Haizhou (Eds.). (2017). Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan. Los Alamitos: IEEE, pp. 228-231, IALP
Publication Year :
2017
Publisher :
IEEE, 2017.

Abstract

In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them.

Details

Database :
OpenAIRE
Journal :
Dong, Minghui & Tseng, Yuen-Hsien & Lu, Yanfeng & Yu, Liang-Chih & Lee, Lung-Hao & Wu, Chung-Hsien & Li, Haizhou (Eds.). (2017). Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan. Los Alamitos: IEEE, pp. 228-231, IALP
Accession number :
edsair.doi.dedup.....144506809a84ad538049daedb580c8a4