BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Authors :: Alam, Sadia
Ishmam, Md Farhan
Alvee, Navid Hasin
Siddique, Md Shahnewaz
Hossain, Md Azam
Kamal, Abu Raihan Mostofa
Publication Year :: 2024
Abstract: The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with $4$ sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose $14$ baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of $69.8\%$ and an F1 score of $69.1\%$ on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.