A semi-automatic approach to identifying and unifying ambiguously encoded Arabic-based characters

Authors :: Sardar Jaf
Dong, Minghui
Tseng, Yuen-Hsien
Lu, Yanfeng
Yu, Liang-Chih
Lee, Lung-Hao
Wu, Chung-Hsien
Li, Haizhou
Source :: Dong, Minghui & Tseng, Yuen-Hsien & Lu, Yanfeng & Yu, Liang-Chih & Lee, Lung-Hao & Wu, Chung-Hsien & Li, Haizhou (Eds.). (2017). Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan. Los Alamitos: IEEE, pp. 228-231, IALP
Publication Year :: 2017
Publisher :: IEEE, 2017.
Abstract: In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them.

Subjects :: business.industry
Arabic
Computer science
computer.software_genre
Unicode
language.human_language
Linguistics
Writing system
Web page
language
Encoding (semiotics)
The Internet
Urdu
Artificial intelligence
business
computer
Word (computer architecture)
Natural language processing

Database :: OpenAIRE
Journal :: Dong, Minghui & Tseng, Yuen-Hsien & Lu, Yanfeng & Yu, Liang-Chih & Lee, Lung-Hao & Wu, Chung-Hsien & Li, Haizhou (Eds.). (2017). Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan. Los Alamitos: IEEE, pp. 228-231, IALP
Accession number :: edsair.doi.dedup.....144506809a84ad538049daedb580c8a4

Tools