Back to Search
Start Over
A semi-automatic approach to identifying and unifying ambiguously encoded Arabic-based characters
- Source :
- Dong, Minghui & Tseng, Yuen-Hsien & Lu, Yanfeng & Yu, Liang-Chih & Lee, Lung-Hao & Wu, Chung-Hsien & Li, Haizhou (Eds.). (2017). Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan. Los Alamitos: IEEE, pp. 228-231, IALP
- Publication Year :
- 2017
- Publisher :
- IEEE, 2017.
-
Abstract
- In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them.
Details
- Database :
- OpenAIRE
- Journal :
- Dong, Minghui & Tseng, Yuen-Hsien & Lu, Yanfeng & Yu, Liang-Chih & Lee, Lung-Hao & Wu, Chung-Hsien & Li, Haizhou (Eds.). (2017). Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan. Los Alamitos: IEEE, pp. 228-231, IALP
- Accession number :
- edsair.doi.dedup.....144506809a84ad538049daedb580c8a4