Native Language Identification with Cross-Corpus Evaluation Using Social Media Data: 'Reddit'

Authors :: Bassas, Yasmeen
Kuebler, Sandra
Riddell, Allen
Publication Year :: 2023
Publisher :: Zenodo, 2023.
Abstract: Native Language Identification is one of the growing subfields in Natural Language Processing (NLP). The task of Native Language Identification (NLI) is mainly concerned with predicting the native language of an author’s writing in a second language. In this paper, we investigate the performance of two types of features; content-based features vs. content independent features when they are evaluated on a different corpus (using social media data “Reddit”). In this NLI task, the predefined models are trained on one corpus (TOEFL) and then the trained models are evaluated on a different data using an external corpus (Reddit). Three classifiers are used in this task; the baseline, linear SVM, and Logistic Regression. Results show that content-based features are more accurate and robust than content independent ones when tested within corpus and across corpus.

Subjects :: social media corpus
content-based features
NLI
content independent features
NLP
ML

Full Text Access

Tools