Back to Search Start Over

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Authors :
Alam, Sadia
Ishmam, Md Farhan
Alvee, Navid Hasin
Siddique, Md Shahnewaz
Hossain, Md Azam
Kamal, Abu Raihan Mostofa
Publication Year :
2024

Abstract

The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with $4$ sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose $14$ baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of $69.8\%$ and an F1 score of $69.1\%$ on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2408.08964
Document Type :
Working Paper