Automated Grading of Exam Responses: An Extensive Classification Benchmark

Authors :: Alexandra Farazouli
Zed Lee
Vanessa Lislevand
Jimmy Ljungman
John Pavlopoulos
Panagiotis Papapetrou
Uno Fors
Source :: Discovery Science ISBN: 9783030889418, DS
Publication Year :: 2021
Publisher :: Springer International Publishing, 2021.
Abstract: Automated grading of free-text exam responses is a very challenging task due to the complex nature of the problem, such as lack of training data and biased ground-truth of the graders. In this paper, we focus on the automated grading of free-text responses. We formulate the problem as a binary classification problem of two class labels: low- and high-grade. We present a benchmark on four machine learning methods using three experiment protocols on two real-world datasets, one from Cyber-crime exams in Arabic and one from Data Mining exams in English that is presented first time in this work. By providing various metrics for binary classification and answer ranking, we illustrate the benefits and drawbacks of the benchmarked methods. Our results suggest that standard models with individual word representations can in some cases achieve competitive predictive performance against deep neural language models using context-based representations on both binary classification and answer ranking for free-text response grading tasks. Lastly, we discuss the pedagogical implications of our findings by identifying potential pitfalls and challenges when building predictive models for such tasks.