Back to Search Start Over

ChatGPT's quality: Reliability and validity of concept inventory items

Authors :
Stefan Küchemann
Martina Rau
Albrecht Schmidt
Jochen Kuhn
Source :
Frontiers in Psychology, Vol 15 (2024)
Publication Year :
2024
Publisher :
Frontiers Media S.A., 2024.

Abstract

IntroductionThe recent advances of large language models (LLMs) have opened a wide range of opportunities, but at the same time, they pose numerous challenges and questions that research needs to answer. One of the main challenges are the quality and correctness of the output of LLMs as well as the overreliance of students on the output without critically reflecting on it. This poses the question of the quality of the output of LLMs in educational tasks and what students and teachers need to consider when using LLMs for creating educational items. In this work, we focus on the quality and characteristics of conceptual items developed using ChatGPT without user-generated improvements.MethodsFor this purpose, we optimized prompts and created 30 conceptual items in kinematics, which is a standard topic in high-school level physics. The items were rated by two independent experts. Those 15 items that received the highest rating were included in a conceptual survey. The dimensions were designed to align with the ones in the most commonly used concept inventory, the Force Concept Inventory (FCI). We administered the designed items together with the FCI to 172 first-year university students. The results show that ChatGPT items have a medium difficulty and discriminatory index but they overall exhibit a slightly lower average values as the FCI. Moreover, a confirmatory factor analysis confirmed a three factor model that is closely aligned with a previously suggested expert model.Results and discussionIn this way, after careful prompt engineering, thorough analysis and selection of fully automatically generated items by ChatGPT, we were able to create concept items that had only a slightly lower quality than carefully human-generated concept items. The procedures to create and select such a high-quality set of items that is fully automatically generated require large efforts and point towards cognitive demands of teachers when using LLMs to create items. Moreover, the results demonstrate that human oversight or student interviews are necessary when creating one-dimensional assessments and distractors that are closely aligned with students' difficulties.

Details

Language :
English
ISSN :
16641078
Volume :
15
Database :
Directory of Open Access Journals
Journal :
Frontiers in Psychology
Publication Type :
Academic Journal
Accession number :
edsdoj.590ec3ff08654bba98939decbf3e749a
Document Type :
article
Full Text :
https://doi.org/10.3389/fpsyg.2024.1426209