Back to Search Start Over

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Authors :
Bhattacharjee, Abhik
Hasan, Tahmid
Ahmad, Wasi Uddin
Samin, Kazi
Islam, Md Saiful
Iqbal, Anindya
Rahman, M. Sohel
Shahriyar, Rifat
Bhattacharjee, Abhik
Hasan, Tahmid
Ahmad, Wasi Uddin
Samin, Kazi
Islam, Md Saiful
Iqbal, Anindya
Rahman, M. Sohel
Shahriyar, Rifat
Publication Year :
2021

Abstract

In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.<br />Comment: Findings of North American Chapter of the Association for Computational Linguistics, NAACL 2022 (camera-ready)

Details

Database :
OAIster
Publication Type :
Electronic Resource
Accession number :
edsoai.on1269521370
Document Type :
Electronic Resource