Gombert, Sebastian, Di Mitri, Daniele, Karademir, Onur, Kubsch, Marcus, Kolbe, Hannah, Tautz, Simon, Grimm, Adrian, Bohm, Isabell, Neumann, Knut, and Drachsler, Hendrik
Background: Formative assessments are needed to enable monitoring how student knowledge develops throughout a unit. Constructed response items which require learners to formulate their own free‐text responses are well suited for testing their active knowledge. However, assessing such constructed responses in an automated fashion is a complex task and requires the application of natural language processing methodology. In this article, we implement and evaluate multiple machine learning models for coding energy knowledge in free‐text responses of German K‐12 students to items in formative science assessments which were conducted during synchronous online learning sessions. Dataset: The dataset we collected for this purpose consists of German constructed responses from 38 different items dealing with aspects of energy such as manifestation and transformation. The units and items were implemented with the help of project‐based pedagogy and evidence‐centered design, and the responses were coded for seven core ideas concerning the manifestation and transformation of energy. The data was collected from students in seventh, eighth and ninth grade. Methodology: We train various transformer‐ and feature‐based models and compare their ability to recognize the respective ideas in students' writing. Moreover, as domain knowledge and its development can be formally modeled through knowledge networks, we evaluate how well the detection of the ideas within responses translated into accurate co‐occurrence‐based knowledge networks. Finally, in terms of the descriptive accuracy of our models, we inspect what features played a role for which prediction outcome and if the models pick up on undesired shortcuts. In addition to this, we analyze how much the models match human coders in what evidence within responses they consider important for their coding decisions. Results: A model based on a modified GBERT‐large can achieve the overall most promising results, although descriptive accuracy varies much more than predictive accuracy for the different ideas assessed. For reasons of comparability, we also evaluate the same machine learning architecture using the SciEntsBank 3‐Way benchmark with an English RoBERTa‐large model, where it achieves state‐of‐the‐art results in two out of three evaluation categories. Lay Description: What is already known about this topic?: Formative assessments are needed to test and monitor the development of learners' knowledge throughout a unit to provide them with appropriate automated feedback.Constructed response items which require learners to formulate their own free‐text responses are well suited for testing their active knowledge.Assessing constructed responses in an automated fashion is a widely researched topic, but the problem is far from solved and most of the work focuses on predicting holistic scores or grades.To allow for a more fine‐grained and analytic assessment of learners' knowledge, systems which go beyond predicting simple grades are required.To guarantee that models are stable and make their predictions for the correct reasons, methods for explaining the models are required. What this papers adds?: A core topic in physics education is the concept of energy.We implement and evaluate multiple systems based on natural language processing technology for assessing learners' conceptual knowledge about energy physics using transformer language models as well as feature‐based approaches.The systems assess students' knowledge about various forms of energy, indicators for the same and the transformation of energy from form into another.As our systems are based on machine learning methodology, we introduce a novel German short answer dataset for training them to detect the respective knowledge elements within students' free‐text responses.We evaluate the performance of these systems using this dataset as well as the well‐established SciEntsBand‐3‐Way dataset and manage to achieve, to our best knowledge, new state‐of‐the‐art results for the latter.Moreover, we apply methodology for explaining model predictions to assess whether predictions are carried out for the correct reasons. Implications for practice and/or policy: It is indeed possible to assess constructed responses for the demonstrated knowledge about energy physics in an analytic fashion using natural language processing.Transformer language models can outperform more specialized feature‐based approaches for this task in terms of predictive and descriptive accuracy.Co‐occurrences of different concepts within the same responses can lead models to learn undesired shortcuts which make them unstable. [ABSTRACT FROM AUTHOR]