Back to Search Start Over

BaNeL: an encoder-decoder based Bangla neural lemmatizer

Authors :
Md. Ashraful Islam
Md. Towhiduzzaman
Md. Tauhidul Islam Bhuiyan
Abdullah Al Maruf
Jesan Ahammed Ovi
Source :
SN Applied Sciences, Vol 4, Iss 5, Pp 1-15 (2022)
Publication Year :
2022
Publisher :
Springer, 2022.

Abstract

Abstract This study presents an efficient framework of deriving lemma from an inflected Bangla word considering its parts-of-speech as context. Bangla is a morphologically rich Indo-Aryan language where around 70% words are inflected, and some words have around 90 different inflected forms making it one of the most challenging languages for lemmatization. The unavailability of a sufficiently large appropriate dataset in Bangla makes the task even more strenuous. A reliable robust Bangla lemmatizer will create new possibilities for other dependent fields like automatic language translation and grammatical correction to flourish in Bangla. In this paper, we have described a new larger Bangla dataset for lemmatization and an encoder-decoder-based sequence_to_sequence framework for it. After tuning the hyper-parameters, the proposed framework yielded 95.75% character accuracy and 91.81% exact match on the testing split of the prepared dataset which is significantly higher than existing other approaches in Bangla for lemmatization. Article Highlights This article: Discusses lemmatization task in Bangla and demonstrates difference with stemming Presents an artificial neural network based efficient model for lemmatization that yields comparatively better performance than existing ones Describes a new large dataset for lemmatization in Bangla language

Details

Language :
English
ISSN :
25233963 and 25233971
Volume :
4
Issue :
5
Database :
Directory of Open Access Journals
Journal :
SN Applied Sciences
Publication Type :
Academic Journal
Accession number :
edsdoj.88511ea89b3b4dc69e9785a36cef10f1
Document Type :
article
Full Text :
https://doi.org/10.1007/s42452-022-04985-2