201. Towards the automatic generation of Arabic Lexical Recognition Tests using orthographic and phonological similarity maps
- Author
-
Saeed Salah, Raid Zaghal, Mohammad Nassar, and Osama Hamed
- Subjects
Vocabulary ,General Computer Science ,business.industry ,Computer science ,media_common.quotation_subject ,Confusion matrix ,020206 networking & telecommunications ,Phonology ,02 engineering and technology ,Pronunciation ,computer.software_genre ,Word lists by frequency ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Language proficiency ,Artificial intelligence ,Precision and recall ,business ,computer ,Orthography ,Natural language processing ,media_common - Abstract
Lexical Recognition Test (LRT) themes are one of the main methods that are widely used to measure language proficiency of some common languages such as English, German and Spanish. However, similar research for Arabic is still at development stages, and existing proposals mainly use human-crafted methods. In this paper, a new methodology, based on a newly developed algorithm, was proposed with the aim of automatically constructing high quality nonwords associated with a real quick measurement of Arabic proficiency levels (Arabic LRT). The suggested algorithm will automatically generate nonwords based on Arabic special characteristics they are orthography (spelling), phonology (pronunciation), n-grams and the word frequency map, which is an important factor to create a multi-level test. With the help of a large dataset of Arabic vocabulary, the proposed algorithm was experimented. For this purpose, a Web-based application, following the suggested methodology, was designed and implemented to facilitate the process of collecting and analyzing learners’ responses. The experimental results have shown that the LRT questions that were automatically generated by the proposed system had confused the learners, this is clear from the output of the confusion matrix which showed that (1/3) of the generated nonwords were able to distract the learners (with accuracy 65%). Consequentially, the results of recall and precision have smaller values, 0.52 and 0.48, respectively.
- Published
- 2022