1. Alma: Fast Lemmatizer and POS Tagger for Arabic.
- Author
-
Jarrar, Mustafa, Akra, Diyam, and Hammouda, Tymaa
- Subjects
PARTS of speech ,ENCYCLOPEDIAS & dictionaries ,DATABASES ,SPEED ,MORPHOLOGY - Abstract
We introduce Alma (
ﺍ ﺍ ﻠ ى), an open-source and state-of-the-art lemmatizer, POS tagger, and root tagger for Arabic, boasting both high speed and accuracy. Alma relies on a dictionary of morphological solutions ordered by the frequency of these solutions. This dictionary was developed based on the Qabas lexicographic database. Unlike many Arabic lemmatizers that return a lemma after stripping diacritics, shadda, and hamza (i.e., ambiguous lemma), Alma retrieves unambiguous lemmas (we called true lemmatization). Our POS tagger uses a rich tagset of 40 POS tags. Additionally, our root tagger is the first fully-featured tagger since it uses Qabas, the largest Arabic lexicographic database. We evaluated Alma on the LDC Arabic Treebank (ATB) that contains 339,710 tokens and achieved an 88% F1 score. We also evaluated Alma on the Salma corpus (34k tokens) and obtained a 90% F1 score. Compared to Farasa, MADAMIRA, and Camelira lemmatizers and POS taggers, Alma outperformed all of them in both tasks, excelling in both speed and accuracy. Alma demonstrated superior processing speed, handling 339k tokens in 10.00. Alma is open-source and publicly available at (https://sina.birzeit.edu/alma). [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF