1. Characteristics of Malay translated hadith corpus
- Author
-
Siti Syakirah Sazali, Nurazzah Abd Rahman, and Zainab Binti Abu Bakar
- Subjects
General Computer Science ,Zipf's law ,business.industry ,Computer science ,media_common.quotation_subject ,Search engine indexing ,020206 networking & telecommunications ,Context (language use) ,02 engineering and technology ,Ambiguity ,computer.software_genre ,language.human_language ,Field (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,language ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Cluster analysis ,computer ,Natural language processing ,media_common ,Malay - Abstract
Annotated corpus can greatly assist in the natural language processing field. For example, computers can understand more of the document context, and indexing and clustering in information retrieval can be done precisely with less or no ambiguity of words. However, there are only a few annotated corpora in Malay language, which are not publicly shared. In this paper, we delve into analysing and annotating Malay translated hadith documents in terms of tagging and entities. There are three phases, which are manual filtering and cleaning, analysing the corpus and creating the benchmark. As the result, an analysis and benchmark of Malay translated hadith corpus were produced in term of part-of-speech and named entities tags that follows the Zipf’s law distribution.
- Published
- 2022
- Full Text
- View/download PDF