Back to Search Start Over

Native Language Identification with Cross-Corpus Evaluation Using Social Media Data: 'Reddit'

Authors :
Bassas, Yasmeen
Kuebler, Sandra
Riddell, Allen
Publication Year :
2023
Publisher :
Zenodo, 2023.

Abstract

Native Language Identification is one of the growing subfields in Natural Language Processing (NLP). The task of Native Language Identification (NLI) is mainly concerned with predicting the native language of an author’s writing in a second language. In this paper, we investigate the performance of two types of features; content-based features vs. content independent features when they are evaluated on a different corpus (using social media data “Reddit”). In this NLI task, the predefined models are trained on one corpus (TOEFL) and then the trained models are evaluated on a different data using an external corpus (Reddit). Three classifiers are used in this task; the baseline, linear SVM, and Logistic Regression. Results show that content-based features are more accurate and robust than content independent ones when tested within corpus and across corpus.

Details

Language :
English
Database :
OpenAIRE
Accession number :
edsair.doi.dedup.....4898c567a003b1ffe7f71adfdcab0903
Full Text :
https://doi.org/10.5281/zenodo.7563500