1,702 results on '"Test format"'
Search Results
2. Evaluating the Evaluators: A Comparative Study of AI and Teacher Assessments in Higher Education
- Author
-
Tugra Karademir Coskun and Ayfer Alper
- Abstract
This study aims to examine the potential differences between teacher evaluations and artificial intelligence (AI) tool-based assessment systems in university examinations. The research has evaluated a wide spectrum of exams including numerical and verbal course exams, exams with different assessment styles (project, test exam, traditional exam), and both theoretical and practical course exams. These exams were selected using a criterion sampling method and were analyzed using Bland-Altman Analysis and Intraclass Correlation Coefficient (ICC) analyses to assess how AI and teacher evaluations performed across a broad range. The research findings indicate that while there is a high level of proficiency between the total exam scores assessed by artificial intelligence and teacher evaluations; medium consistency was found in the evaluation of visually based exams, low consistency in video exams, high consistency in test exams, and low consistency in traditional exams. This research is crucial as it helps to identify specific areas where artificial intelligence can either complement or needs improvement in educational assessment, guiding the development of more accurate and fair evaluation tools.
- Published
- 2024
3. Results of Mathematics Examinations before, during, and after the COVID-19 Related Restrictions
- Author
-
Eva Ulrychová, Renata Majovská, and Petr Tesar
- Abstract
The article deals with the results of mathematics examinations at the University of Finance and Administration in Prague before, during, and immediately after the COVID-19 pandemic-related restrictions. The first objective is to evaluate whether the non-standard forms of testing (correspondence and online), used on an emergency basis during the pandemic, were adequate compared to the standard form (face-to-face) applied before the pandemic. The second objective is to assess whether and to what extent the results of the examinations have changed after the return of teaching and testing methods to normal. It turns out that the use of non-standard forms, although more challenging for teachers to control, did not lead to better results -- the results in the correspondence form were similar to the standard form and even worse in the online form. The results of examinations administered in the standard form after the return to normal teaching were significantly better than in any of the periods studied, including the standard form of examination before the pandemic. Possible reasons for the results are analysed in the paper.
- Published
- 2024
4. Pilot Comparison of Reading Quiz Formats in a Graduate Speech Sound Disorders Course
- Author
-
Sheri Bayley
- Abstract
The purpose of this study was to explore student performance, self-ratings of learning and preference, and student comments on a variety of reading quiz formats in a first semester speech-language pathology graduate course. Students from two cohorts (n = 34) completed four types of quizzes: closed-book, open-book, open-note, and collaborative group in addition to a note review study option in self-selected order. Scores and reported preference were significantly lower on closed-book quizzes compared to other formats, but few other significant differences were observed across formats. Ranges of preferences, low variability in scores, and student comments supported the practice of allowing students to choose their own format, consistent with a needed move in the field towards learner-centered teaching. While additional research is warranted, this pilot study suggests that adding the learner-centered element of choice to assessments such as quizzes can provide flexibility for student preferences while also increasing adherence to reading assignments.
- Published
- 2024
5. Key Issues and Considerations in Measuring Vocabulary Growth: A Methodological Overview
- Author
-
Abdullah Albalawi
- Abstract
Despite the substantial expansion in vocabulary research since the 1980s, we still know very little about how vocabulary develops over time and what factors influence this development. This methodological overview discusses key issues and considerations in vocabulary breadth growth assessment to help advance research in this area. The report begins by discussing general issues in vocabulary assessment such as sampling rate and the effect of cognates. This is followed by an overview and an evaluation of common vocabulary breadth tests. The report ends with recommendations for choosing vocabulary tests for vocabulary growth research.
- Published
- 2024
6. Transforming Assessments of Clinician Knowledge: A Randomized Controlled Trial Comparing Traditional Standardized and Longitudinal Assessment Modalities
- Author
-
Shahid A. Choudhry, Timothy J. Muckle, Christopher J. Gill, Rajat Chadha, Magnus Urosev, Matt Ferris, and John C. Preston
- Abstract
The National Board of Certification and Recertification for Nurse Anesthetists (NBCRNA) conducted a one-year research study comparing performance on the traditional continued professional certification assessment, administered at a test center or online with remote proctoring, to a longitudinal assessment that required answering quarterly questions online on demand. A randomized controlled trial of 1,000 certified registered nurse anesthetists (500 randomly assigned to the traditional assessment group and longitudinal assessment group) aimed to 1) compare assessment performance between groups, 2) compare perceptions and user experience between groups; and 3) describe participant feedback about usability of the longitudinal assessment platform. The mean scaled score for the traditional assessment group exceeded that of the longitudinal assessment group when scoring the first responses; however, upon scoring the longitudinal assessment group's most recent responses on repeat questions previously answered incorrectly, the mean scaled score was higher than the traditional assessment group. Both groups were satisfied with their experience, with slightly higher feedback ratings for the longitudinal assessment group who also found the platform easy to use and navigate. Overall results suggest the longitudinal assessment is a feasible, acceptable, and usable format to assess specialized knowledge for continued healthcare professional certification.
- Published
- 2024
7. Impacts of Differences in Group Abilities and Anchor Test Features on Three Non-IRT Test Equating Methods
- Author
-
Inga Laukaityte and Marie Wiberg
- Abstract
The overall aim was to examine effects of differences in group ability and features of the anchor test form on equating bias and the standard error of equating (SEE) using both real and simulated data. Chained kernel equating, Postratification kernel equating, and Circle-arc equating were studied. A college admissions test with four different anchor test forms administered at three test administrations was used. The simulation study examined the differences in ability of the test groups, and differences in the anchor test form with respect to item difficulty and discrimination. In the empirical study, the equated values from the three methods only slightly differed. The simulation study indicated that an easier anchor test form and/or an easier regular test form, and anchor items with a wider spread in difficulty, negatively affected the SEE and bias. The ability level of groups was also important. Equating with only less or more capable groups resulted in high SEEs at higher and lower test scores, respectively. The discussion includes practical recommendations to whom an anchor test should be given if there is a choice and how to select an anchor test form which have equating as primary purpose.
- Published
- 2024
8. Transitioning from Paper to Touch Interface: Phoneme-Grapheme Recognition Testing and Gamification in Primary School Classrooms
- Author
-
Lishi Liang, W. L. Quint Oga-Baldwin, Kaori Nakao, Luke K. Fryer, and Alex Shum
- Abstract
Phonological processing of written characters has been recognized as a crucial element in acquiring literacy in any language, both native and foreign. This study aimed to assess Japanese primary school students' phoneme-grapheme recognition skills using both paper-based and touch-interface tests. Differences between the two test formats and the relationship between phoneme-grapheme recognition skills and interaction with digital tests were investigated. We hypothesized a relationship between paper test performance and digital item performance. Participants were sixth-grade students from two public schools. The results of comparison tests indicated that the touch-interface test had lower success rates compared to the paper-based test for most items, suggesting a difference in performance patterns. A consistent relationship between phoneme-grapheme knowledge tested on paper and successful digital interaction was found. Findings highlight the potential of touch-interface assessments for assessing phoneme-grapheme recognition skills in primary school classrooms and suggest incorporating more digital tasks to enhance student adaptation. [Note: The issue number (1) shown in the citation on the PDF is incorrect. The correct issue number is 2.]
- Published
- 2024
9. Input as a Key Element in Test Design: A Narrative of Designing an Innovative Critical Thinking Assessment
- Author
-
Khagendra Raj Dhakal, Richard Watson Todd, and Natjiree Jaturapitakkul
- Abstract
Test input has often been taken as a given in test design practice. Nearly all guides for test designers provide extensive coverage of how to design test items but pay little attention to test input. This paper presents the case that test input plays a crucial role in designing tests of soft skills that have rarely been assessed in existing tests. In the process of designing a test of critical thinking, several attempts following existing test design guides resulted in poor tests that did not truly assess the intended objectives. These initial attempts used the norm of short passages as test input. Following these failures, we switched to using real-world input, such as tweets, numerical tables, and spam emails. In doing so, it was found that a particular input type favored a particular sub-skill of critical thinking and a particular item type. For example, using tweets as input enabled the assessment of the Perspective Taking sub-skill of critical thinking. This paper concludes that in designing skill tests, integrating appropriate input is at least as important as item design and calls for reevaluating the functions of test input as a distinct and dynamic element.
- Published
- 2024
10. The Effects of Reverse Items on Psychometric Properties and Respondents' Scale Scores According to Different Item Reversal Strategies
- Author
-
Mustafa Ilhan, Nese Güler, Gülsen Tasdelen Teker, and Ömer Ergenekon
- Abstract
This study aimed to examine the effects of reverse items created with different strategies on psychometric properties and respondents' scale scores. To this end, three versions of a 10-item scale in the research were developed: 10 positive items were integrated in the first form (Form-P) and five positive and five reverse items in the other two forms. The reverse items in the second and third forms were crafted using antonyms (Form-RA) and negations (Form-RN), respectively. Based on the research results, Form-P was unidimensional, while other forms were two-dimensional. Moreover, although reliability coefficients of all forms were obtained as above 0.80, the lowest one was acquired for Form-RN. There were strong-positive relationships between students' scores in the three scale forms. However, the lowest one was estimated between Form-P and Form-RN. Finally, there was a significant difference between the students' mean scores obtained from Form-RN and other two versions, but the effect size of the said difference was small. In conclusion, all these results indicate that different types of reverse items influence psychometric properties and respondents' scale scores differently.
- Published
- 2024
11. An Exploratory Criterion Validation of Three Meaning-Recall Vocabulary Test Item Formats
- Author
-
Tim Stoeckel and Tomoko Ishii
- Abstract
In an upcoming coverage-comprehension study, we plan to assess learners' meaning-recall knowledge of words as they occur in the study's reading passage. As several meaning-recall test formats exist, the purpose of this small-scale study (N = 10) was to determine which of three formats was most similar to a criterion interview regarding mean score and the consistency of correct/incorrect classifications (match rate, k = 30). In Test 1, the prompt consisted of only the target item, and a written translation of its meaning was elicited. In Test 2, the prompt was a short sentence in which a target item was highlighted, and a written translation of only that target item was requested. In Test 3, the prompt was the same sentence as in Test 2, but the target item was unhighlighted, and participants were requested to translate the entire sentence. Finally, in the criterion interview, participants were asked to demonstrate their understanding of the target items in the same prompt sentences as in Tests 2-3. The results indicated that Test 3 produced a mean score and match rate most similar to the interview, followed by Test 2, with Test 1 being the least similar. The paper discusses several factors explaining differences in test performance that were explored during the interview.
- Published
- 2024
12. How Assessment Choice Affects Student Perception and Performance
- Author
-
Sanne Unger and Alanna Lecher
- Abstract
This action research project sought to understand how giving students a choice in how to demonstrate mastery of a reading would affect both grades and evaluations of the instructor, given that assessment choice might increase student engagement. We examined the effect of student assessment choice on grades and course evaluations, the two assessment options being a reading quiz or a two-minute video recording of themselves "recalling" what they could about the text (a "recall"). In Year 1, students were required to complete a multiple-choice reading quiz, with the option to complete a recall video for the opportunity to revise essays (revision tokens). In Year 2, students were allowed to choose whether they submitted a recall video or a quiz, with the option to submit the other to earn revision tokens. The data included student submissions, grades, and course evaluations. Students completed more recall assignments when the recall replaced the quiz requirement than during Year 1 when recalls only earned the students revision tokens. In addition, the instances of students completing both the quiz and recall increased in Year 2. Average course grades did not change from year to year, but students with higher course grades were significantly more likely to have completed recalls in both years. Student evaluations of the instructor were significantly higher for "responses to diverse learning styles" in Year 2 compared to Year 1. The study shows that letting students choose the assessment type they prefer can lead to increased student engagement and improve their perception of the instructor's responsiveness to learning styles, without causing grade inflation.
- Published
- 2024
13. A Two-Level Adaptive Test Battery
- Author
-
Wim J. van der Linden, Luping Niu, and Seung W. Choi
- Abstract
A test battery with two different levels of adaptation is presented: a within-subtest level for the selection of the items in the subtests and a between-subtest level to move from one subtest to the next. The battery runs on a two-level model consisting of a regular response model for each of the subtests extended with a second level for the joint distribution of their abilities. The presentation of the model is followed by an optimized MCMC algorithm to update the posterior distribution of each of its ability parameters, select the items to Bayesian optimality, and adaptively move from one subtest to the next. Thanks to extremely rapid convergence of the Markov chain and simple posterior calculations, the algorithm can be used in real-world applications without any noticeable latency. Finally, an empirical study with a battery of short diagnostic subtests is shown to yield score accuracies close to traditional one-level adaptive testing with subtests of double lengths.
- Published
- 2024
- Full Text
- View/download PDF
14. Middle School Students' Conceptualizations and Reasoning about the Fairness of Math Tests
- Author
-
Morgan McCracken, Jonathan D. Bostic, and Timothy D. Folger
- Abstract
Assessment is central to teaching and learning, and recently there has been a substantive shift from paper-and-pencil assessments towards technology delivered assessments such as computer-adaptive tests. Fairness is an important aspect of the assessment process, including design, administration, test-score interpretation, and data utility. The Universal Design for Learning (UDL) guidelines can inform assessment development to promote fairness; however, it is not explicitly clear how UDL and fairness may be linked through students' conceptualizations of assessment fairness. This phenomenological study explores how middle grades students conceptualize and reason about the fairness of mathematics tests, including paper-and-pencil and technology-delivered assessments. Findings indicate that (a) students conceptualize fairness through unique notions related to educational opportunities and (b) students' reason about fairness non-linearly. Implications of this study have potential to inform test developers and users about aspects of test fairness, as well as educators data usage from fixed-form, paper-and-pencil tests, and computer-adaptive, technology-delivered tests.
- Published
- 2024
- Full Text
- View/download PDF
15. Measuring Mathematical Skills in Early Childhood: A Systematic Review of the Psychometric Properties of Early Maths Assessments and Screeners
- Author
-
Laura A. Outhwaite, Pirjo Aunio, Jaimie Ka Yu Leung, and Jo Van Herwegen
- Abstract
Successful early mathematical development is vital to children's later education, employment, and wellbeing outcomes. However, established measurement tools are infrequently used to (i) assess children's mathematical skills and (ii) identify children with or at-risk of mathematical learning difficulties. In response, this pre-registered systematic review aimed to provide an overview of measurement tools that have been evaluated for their psychometric properties for measuring the mathematical skills of children aged 0-8 years. The reliability and validity evidence reported for the identified measurement tools were then synthesised, including in relation to common acceptability thresholds. Overall, 41 mathematical assessments and 25 screeners were identified. Our study revealed five main findings. Firstly, most measurement tools were categorised as child-direct measures delivered individually with a trained assessor in a paper-based format. Secondly, the majority of the identified measurement tools have not been evaluated for aspects of reliability and validity most relevant to education measures, and only 15 measurement tools met the common acceptability thresholds for more than two areas of psychometric evidence. Thirdly, only four screeners demonstrated an acceptable ability to distinguish between typically developing children and those with or at-risk of mathematical learning difficulties. Fourthly, only one mathematical assessment and one screener met the common acceptability threshold for predictive validity. Finally, only 11 mathematical assessments and one screener were found to concurrently align with other validated measurement tools. Building on this current evidence and improving measurement quality is vital for raising methodological standards in mathematical learning and development research.
- Published
- 2024
- Full Text
- View/download PDF
16. Checkbox Grading of Handwritten Mathematics Exams with Multiple Assessors: How Do Students React to the Resulting Atomic Feedback? A Mixed-Method Study
- Author
-
Filip Moons, Paola Iannone, and Ellen Vandervieren
- Abstract
Handwritten tasks are better suited than digital ones to assess higher-order mathematics skills, as students can express themselves more freely. However, maintaining reliability and providing feedback can be challenging when assessing high-stakes, handwritten mathematics exams involving multiple assessors. This paper discusses a new semi-automated grading approach called 'checkbox grading'. Checkbox grading gives each assessor a list of checkboxes consisting of feedback items for each task. The assessor then ticks those feedback items which apply to the student's solution. Dependencies between the checkboxes can be set to ensure all assessors take the same route on the grading scheme. The system then automatically calculates the grade and provides atomic feedback to the student, giving a detailed insight into what went wrong and how the grade was obtained. Atomic feedback consists of a set of format requirements for mathematical feedback items, which has been shown to increase feedback's reusability. Checkbox grading was tested during the final high school mathematics exam (grade 12) organised by the Flemish Exam Commission, with 60 students and 10 assessors. This paper focuses on students' perceptions of the received checkbox grading feedback and how easily they interpreted it. After the exam was graded, all students were sent an online questionnaire, including their personalised exam feedback. The questionnaire was filled in by 36 students, and 4 of them participated in semi-structured interviews. Findings suggest that students could interpret the feedback from checkbox grading well, with no correlation between students' exam scores and feedback understanding. Therefore, we suggest that checkbox grading is an effective way to provide feedback, also for students with shaky subject matter knowledge.
- Published
- 2024
- Full Text
- View/download PDF
17. The Structural and Convergent Validity of the FMS[superscript 2] Assessment Tool among 8- to 12-Year-Old Children
- Author
-
Nathan Gavigan, Sarahjane Belton, Una Britton, Shane Dalton, and Johann Issartel
- Abstract
Although there is a plethora of tools available to assess children's movement competence (MC), the literature suggests that many have significant limitations (e.g. not being practical for use in many 'real-world' settings). The FMS[superscript 2] assessment tool has recently been developed as a targeted solution to many of the existing barriers preventing practitioners from utilising MC assessments. The aim of this study was to investigate the structural and convergent validity of this new tool among 8- to 12-year-old Irish primary school children. As part of this study, 102 children (56.8% female, mean = 9.8 years) were assessed using the FMS[superscript 2], the Test of Gross Motor Development (3rd edition) (TGMD-3) (short version) and the Functional Movement Screen™ (FMS™). Structural validity was assessed using confirmatory factor analysis (CFA). The convergent validity between the FMS[superscript 2], the TGMD-3 (short version) and the FMS™ was investigated using the Pearson product-moment correlation coefficient. Results of CFA for the FMS[superscript 2] indicate a good fit model, supporting a three-factor structure (locomotor, object manipulation, and stability). Additional findings indicate a moderate, positive correlation between the FMS[superscript 2] and the TGMD-3 (short version) (r = 0.66), with a low, positive correlation between the FMS[superscript 2] and the FMS™ (r = 0.48). This study presents the first preliminary findings to suggest that the FMS[superscript 2] may be a versatile, time-efficient, and ecologically valid tool to measure children's MC in multiple settings (e.g. research, education, sport, athletic therapy, and physiotherapy). Future research should also seek to continue to implement this solution, consolidate the existing validity findings with a larger and more diverse sample, and further explore the feasibility of the tool in 'real-world' settings.
- Published
- 2024
- Full Text
- View/download PDF
18. Evaluating Psychometric Differences between Fast versus Slow Responses on Rating Scale Items
- Author
-
Nana Kim and Daniel M. Bolt
- Abstract
Some previous studies suggest that response times (RTs) on rating scale items can be informative about the content trait, but a more recent study suggests they may also be reflective of response styles. The latter result raises questions about the possible consideration of RTs for content trait estimation, as response styles are generally viewed as nuisance dimensions in the measurement of noncognitive constructs. In this article, we extend previous work exploring the simultaneous relevance of content and response style traits on RTs in self-report rating scale measurement by examining psychometric differences related to fast versus slow item responses. Following a parallel methodology applied with cognitive measures, we provide empirical illustrations of how RTs appear to be simultaneously reflective of both content and response style traits. Our results demonstrate that respondents may exhibit different response behaviors for fast versus slow responses and that both the content trait and response styles are relevant to such heterogeneity. These findings suggest that using RTs as a basis for improving the estimation of noncognitive constructs likely requires simultaneously attending to the effects of response styles.
- Published
- 2024
- Full Text
- View/download PDF
19. Examining Adaptations in Study Time Allocation and Restudy Selection as a Function of Expected Test Format
- Author
-
Skylar J. Laursen, Dorina Sluka, and Chris M. Fiacconi
- Abstract
Previous literature suggests learners can adjust their encoding strategies to match the demands of the expected test format. However, it is unclear whether other forms of metacognitive control, namely, study time allocation and restudy selection, are also sensitive to expected test format. Across four experiments we examined whether learners qualitatively adjust their allocation of study time (Experiment 1) and restudy selections (Experiments 2a, 2b, and 3) when expecting a more difficult generative memory test (i.e., cued-recall) as compared to a less difficult non-generative memory test (i.e., forced-choice recognition). Counter to our predictions, we found little evidence that learners shift their study time allocation and restudy selection choices toward easier material when expecting a relatively more difficult cued recall test, even after acquiring experience with each test format. Instead, based on exploratory analyses conducted post-hoc, learners appeared to rely heavily on the success with which they retrieved associated studied information at the time that restudy selections were solicited. Moreover, counter to some extant models of self-regulated learning, learners tended to first choose difficult rather than easy items when making their restudy selections, regardless of expected test format. Together, these novel findings place new constraints on our current understanding of learners' metacognitive sensitivity to expected test format, and have important implications for current theoretical accounts of self-regulated learning.
- Published
- 2024
- Full Text
- View/download PDF
20. An Experimental Comparison of Multiple-Choice and Short-Answer Questions on a High-Stakes Test for Medical Students
- Author
-
Janet Mee, Ravi Pandian, Justin Wolczynski, Amy Morales, Miguel Paniagua, Polina Harik, Peter Baldwin, and Brian E. Clauser
- Abstract
Recent advances in automated scoring technology have made it practical to replace multiple-choice questions (MCQs) with short-answer questions (SAQs) in large-scale, high-stakes assessments. However, most previous research comparing these formats has used small examinee samples testing under low-stakes conditions. Additionally, previous studies have not reported on the time required to respond to the two item types. This study compares the difficulty, discrimination, and time requirements for the two formats when examinees responded as part of a large-scale, high-stakes assessment. Seventy-one MCQs were converted to SAQs. These matched items were randomly assigned to examinees completing a high-stakes assessment of internal medicine. No examinee saw the same item in both formats. Items administered in the SAQ format were generally more difficult than items in the MCQ format. The discrimination index for SAQs was modestly higher than that for MCQs and response times were substantially higher for SAQs. These results support the interchangeability of MCQs and SAQs. When it is important that the examinee generate the response rather than selecting it, SAQs may be preferred. The results relating to difficulty and discrimination reported in this paper are consistent with those of previous studies. The results on the relative time requirements for the two formats suggest that with a fixed testing time fewer SAQs can be administered, this limitation more than makes up for the higher discrimination that has been reported for SAQs. We additionally examine the extent to which increased difficulty may directly impact the discrimination of SAQs.
- Published
- 2024
- Full Text
- View/download PDF
21. Cheating Automatic Short Answer Grading with the Adversarial Usage of Adjectives and Adverbs
- Author
-
Anna Filighera, Sebastian Ochs, Tim Steuer, and Thomas Tregel
- Abstract
Automatic grading models are valued for the time and effort saved during the instruction of large student bodies. Especially with the increasing digitization of education and interest in large-scale standardized testing, the popularity of automatic grading has risen to the point where commercial solutions are widely available and used. However, for short answer formats, automatic grading is challenging due to natural language ambiguity and versatility. While automatic short answer grading models are beginning to compare to human performance on some datasets, their robustness, especially to adversarially manipulated data, is questionable. Exploitable vulnerabilities in grading models can have far-reaching consequences ranging from cheating students receiving undeserved credit to undermining automatic grading altogether--even when most predictions are valid. In this paper, we devise a black-box adversarial attack tailored to the educational short answer grading scenario to investigate the grading models' robustness. In our attack, we insert adjectives and adverbs into natural places of incorrect student answers, fooling the model into predicting them as correct. We observed a loss of prediction accuracy between 10 and 22 percentage points using the state-of-the-art models BERT and T5. While our attack made answers appear less natural to humans in our experiments, it did not significantly increase the graders' suspicions of cheating. Based on our experiments, we provide recommendations for utilizing automatic grading systems more safely in practice.
- Published
- 2024
- Full Text
- View/download PDF
22. Reducing Workload in Short Answer Grading Using Machine Learning
- Author
-
Rebecka Weegar and Peter Idestam-Almquist
- Abstract
Machine learning methods can be used to reduce the manual workload in exam grading, making it possible for teachers to spend more time on other tasks. However, when it comes to grading exams, fully eliminating manual work is not yet possible even with very accurate automated grading, as any grading mistakes could have significant consequences for the students. Here, the evaluation of an automated grading approach is therefore extended from measuring workload in relation to the accuracy of automated grading, to also measuring the overall workload required to correctly grade a full exam, with and without the support of machine learning. The evaluation was performed during an introductory computer science course with over 400 students. The exam consisted of 64 questions with relatively short answers and a two-step approach for automated grading was applied. First, a subset of answers to the exam questions was manually graded and next used as training data for machine learning models classifying the remaining answers. A number of different strategies for how to select which answers to include in the training data were evaluated. The time spent on different grading actions was measured along with the reduction of effort using clustering of answers and automated scoring. Compared to fully manual grading, the overall reduction of workload was substantial--between 64% and 74%--even with a complete manual review of all classifier output to ensure a fair grading.
- Published
- 2024
- Full Text
- View/download PDF
23. Impact of Different Practice Testing Methods on Learning Outcomes
- Author
-
Yavuz Akbulut
- Abstract
The testing effect refers to the gains in learning and retention that result from taking practice tests before the final test. Understanding the conditions under which practice tests improve learning is crucial, so four experiments were conducted with a total of 438 undergraduate students in Turkey. In the first study, students who took graded practice tests outperformed those who took them as ungraded practice. In the second study, students who took short-answer questions before the first exam and multiple-choice questions before the second exam scored higher on the second exam. In the third study, multiple-choice, short-answer and hybrid questions produced similar learning gains. In the fourth study, students who received detailed feedback immediately after class performed similarly to those who received feedback at the beginning of the next class. The results suggested the contribution of graded practice tests in general; however, the type of questions or the timing of feedback did not predict learning outcomes.
- Published
- 2024
- Full Text
- View/download PDF
24. Evaluating Equating Methods for Varying Levels of Form Difference
- Author
-
Ting Sun and Stella Yun Kim
- Abstract
Equating is a statistical procedure used to adjust for the difference in form difficulty such that scores on those forms can be used and interpreted comparably. In practice, however, equating methods are often implemented without considering the extent to which two forms differ in difficulty. The study aims to examine the effect of the magnitude of a form difficulty difference on equating results under random group (RG) and common-item nonequivalent group (CINEG) designs. Specifically, this study evaluates the performance of six equating methods under a set of simulation conditions including varying levels of form difference. Results revealed that, under the RG design, mean equating was proven to be the most accurate method when there is no or small form difference, whereas equipercentile is the most accurate method when the difficulty difference is medium or large. Under the CINEG design, Tucker Linear was found to be the most accurate method when the difficulty difference is medium or small, and either chained equipercentile or frequency estimation is preferred with a large difficulty level. This study would provide practitioners with research evidence-based guidance in the choice of equating methods with varying levels of form difference. As the condition of no form difficulty difference is also included, this study would inform testing companies of appropriate equating methods when two forms are similar in difficulty level.
- Published
- 2024
- Full Text
- View/download PDF
25. Exploring Interaction in Video-Call Paired Speaking Tests: A Look at Scores, Language, and Perceptions
- Author
-
Hye-won Lee, Andrew Mullooly, Amy Devine, and Evelina Galaczi
- Abstract
In the assessment of second language oral communication, the video-call speaking test has received increasing attention as a test method with higher practicality than its in-person counterpart, but still with broad coverage of the test construct. Previous studies into video-call assessment have focussed on the individual (as opposed to paired or group) interactional format. The current study extends this line of research by focussing on paired speaking interactions, with a specific focus on the construct of interactional competence. A concurrent triangulation design was adopted with the use of both quantitative and qualitative data through recordings and scores of test performances, questionnaires, and focus groups. Findings indicate that video-call paired interactions in the assessment context of interest in this study are largely comparable to in-person interactions in terms of scores, with statistically small-effect size differences identified. Some differences in terms of turn-taking management, examiner, and test-taker perceptions were also identified. We argue for a more in-depth awareness of the characteristics of video-call speaking in its own right, which can inform both assessment and learning contexts.
- Published
- 2024
- Full Text
- View/download PDF
26. Delineating Discrepancies between TOEFL PBT and CBT
- Author
-
Yulianto, Ahmad, Pudjitriherwanti, Anastasia, Kusumah, Chevy, and Oktavia, Dies
- Abstract
The increasing use of computer-based mode in language testing raises concern over its similarities with and differences from paper-based format. The present study aimed to delineate discrepancies between TOEFL PBT and CBT. For that objective, a quantitative method was employed to probe into scores equivalence, the performance of male-female participants, the relationship between completion time and test score, and test mode's effects on participants' performance. Totally, 124 undergraduates partook in the current research whose ages ranged from 19-21 years (M = 20, SD = 0.66). To analyze the data, MANOVA, Pearson correlation, and regression tests were run. The findings uncovered that: (1) PBT and CBT were equivalent in scores; (2) male and female's scores were not significantly different; (3) there was a moderately negative correlation between completion time and score; (4) computer familiarity, habit in using computers, and perception toward CBT did not affect performance in TOEFL. For researchers, the implication of this study concerns the interchangeability of the two-test modes. For CBT test designers, it concerns the appropriate inclusion of visuals, time related measurement, and procedures to design computer-based tests.
- Published
- 2023
27. Do Different Devices Perform Equally Well with Different Numbers of Scale Points and Response Formats? A Test of Measurement Invariance and Reliability
- Author
-
Natalja Menold and Vera Toepoel
- Abstract
Research on mixed devices in web surveys is in its infancy. Using a randomized experiment, we investigated device effects (desktop PC, tablet and mobile phone) for six response formats and four different numbers of scale points. N = 5,077 members of an online access panel participated in the experiment. An exact test of measurement invariance and Composite Reliability were investigated. The results provided full data comparability for devices and formats, with the exception of continuous Visual Analog Scale (VAS), but limited comparability for different numbers of scale points. There were device effects on reliability when looking at the interactions with formats and number of scale points. VAS, use of mobile phones and five point scales consistently gained lower reliability. We suggest technically less demanding implementations as well as a unified design for mixed-device surveys.
- Published
- 2024
- Full Text
- View/download PDF
28. Eye Movements and Reading Comprehension Performance: Examining the Relationships among Test Format, Working Memory Capacity and Reading Comprehension
- Author
-
Corrin Moss, Sharon Kwabi, Scott P. Ardoin, and Katherine S. Binder
- Abstract
The ability to form a mental model of a text is an essential component of successful reading comprehension (RC), and purpose for reading can influence mental model construction. Participants were assigned to one of two conditions during an RC test to alter their purpose for reading: concurrent (texts and questions were presented simultaneously) and sequential (texts were presented first, then questions were shown without text access). Their eye movements were recorded during testing. Working memory capacity (WMC) and centrality of textual information were measured. Participants in the sequential condition had longer first-pass reading times compared to participants in the concurrent condition, while participants in the concurrent condition had longer total processing times per word. In addition, participants with higher WMC had longer total reading times per word. Finally, participants in the sequential condition with higher WMC had longer processing times in central regions. Even among skilled college readers, participants with lower WMC had difficulty adjusting their reading behaviors to meet the task demands such as distinguishing central and peripheral ideas. However, participants with higher WMC increased attention to important text areas. One potential explanation is that participants with higher WMC are better able to construct a coherent mental model of the text, and attending to central text areas is an essential component of mental model formation. Therefore, these results help clarify the relationship between the purpose for reading and mental model development.
- Published
- 2024
- Full Text
- View/download PDF
29. A Method for Converting 4-Option Multiple-Choice Items to 3-Option Multiple-Choice Items without Re-Pretesting
- Author
-
Wolkowitz, Amanda A., Foley, Brett, and Zurn, Jar
- Abstract
The purpose of this study is to introduce a method for converting scored 4-option multiple-choice (MC) items into scored 3-option MC items without re-pretesting the 3-option MC items. This study describes a six-step process for achieving this goal. Data from a professional credentialing exam was used in this study and the method was applied to 24 forms of the exam. The results found 100% accuracy in predicting the rounded passing score for all forms.
- Published
- 2023
30. The Correlation among Different Types of Exams in Azerbaijan
- Author
-
Alasgarova, Gunel A.
- Abstract
It is crucial to examine the alignment of different exam results conducted by various organizations to improve the quality of assessment. The research used a document analysis method with recent, publicly available national and international reports addressing the research question. The following main question was examined through the document analysis: What exams have the highest correlation and are more trustworthy in Azerbaijan for short and long-term outcomes? The data were analyzed to discover any statistical comparisons of university admission exams with the 9th and 11th grade SEC exams, school grades, and other assessments. Research shows that the State Examination Center's exams align with its own evaluations and international assessment (OECD). They can be considered methodologically rigorous, providing a more valid yardstick for measuring student knowledge and achievement. Overall, exams by the SEC had a high correlation coefficient compared to Higher Education Institutions' assessments. As more and more international students want to pursue their education in Azerbaijan, these findings can be valuable for their decision-making and tertiary level.
- Published
- 2023
31. Assessment Literacy Components Predicting EFL Teachers' Job Demand-Resources: A Focus on Burnout and Engagement
- Author
-
Rastegr, Behnaz and Zarei, Abbas Ali
- Abstract
Much has been done on assessment literacy (AL) components and job demand-resources (JD-R). However, an interdisciplinary look at AL components as the predictors of JD-R and its possible consequences for the engagement and burnout of teachers' assessment performance has been neglected. To fill this gap, the present study explored this issue in the context of Iran. To this end, through convenience sampling, 146 Iranian EFL teachers were selected to answer questionnaires on AL, JD-R, burnout, and engagement. A series of multiple regression analyses were run to analyze the collected data. The results showed that some components of AL such as 'test construction', 'administering, rating, and interpreting test', 'psychometric properties of a test', 'using and interpreting statistics', and 'authenticity' were significant predictors of job demand. Moreover, the results revealed that alternative and digital-based assessment, recognizing test type, distinction and function, and authenticity were significant predictors of job resources. Furthermore, test construction, administering, rating, and interpreting test, psychometric properties of a test, and using and interpreting statistics could significantly predict teachers' burnout. In addition, alternative and digital-based assessment, giving feedback in assessment, and ethical and cultural considerations in assessment turned out to significantly predict teachers' engagement. These findings can have theoretical and practical implications for stakeholders.
- Published
- 2023
32. Question Format Biases College Students' Metacognitive Judgments for Exam Performance
- Author
-
McGuire, Michael J.
- Abstract
College students in a lower-division psychology course made metacognitive judgments by predicting and postdicting performance for true-false, multiple-choice, and fill-in-the-blank question sets on each of three exams. This study investigated which question format would result in the most accurate metacognitive judgments. Extending Koriat's (1997) cue-utilization framework to these judgments, each format gave students different cues on which to base judgments. Further, each format has different probabilities of correctly guessing, which can skew accuracy. Students reported the lowest estimates for fill-in-the-blank questions. Accuracy measured using bias scores showed students' predictions and postdictions were most accurate for multiple-choice items. Accuracy measured using gamma correlations showed students' predictions were most accurate for multiple-choice items and postdictions were most accurate for fill-in-the-blank items. Based on the findings, educators are encouraged to consider what implications question format have on metacognitive processes when testing students over studied material. And, for researchers, the findings support the use of different accuracy measures to get a more detailed understanding of factors influencing metacognitive judgment accuracy.
- Published
- 2023
33. EFL Teachers' Knowledge, Beliefs, and Practices Regarding Fairness and Justice in Technology-Enhanced Classroom Assessment: A Duoethnography
- Author
-
Teymour Rahmati and Musa Nushi
- Abstract
Drawing on duoethnography, the teacher researchers in the present study interacted with the relevant literature, engaged in dialogs, and shared artifacts to examine their knowledge, beliefs, and practices regarding fairness and justice considerations in technology-enhanced language classroom assessment. Under the domain of knowledge, they conceptualized fairness and justice and identified their components. Within beliefs, the difference between high-stakes and low-stakes assessments, the significance of students' perceptions, and the role of computer literacy in relation to fairness and justice in technology-enhanced classroom assessment were debated. To operationalize their knowledge and beliefs, the researchers inspected their assessment practices during and following COVID-19. They agreed that fairness was distinct from justice in that the former pertained to test internal characteristics and its administration procedures while the latter referred to test external consequences at a broader social level. They believed that fairness and justice were equally important in high-stakes and low-stakes assessments, and students' perceptions were valuable sources of feedback regarding fair and just classroom assessments. Moreover, the teachers argued that computer literacy cannot yet be considered an aspect of language ability. Finally, it was revealed that although their practice regarding fairness and justice was affected by the pandemic, they learned valuable lessons (e.g., combining online and paper assessment modalities and giving oral exams) in this respect for the future. The findings imply that language teachers should theoretically adopt a clear conception of fairness and justice while being practically prepared for future developments (e.g., technological advances) and unexpected circumstances (e.g., a pandemic).
- Published
- 2023
34. Historical Sources in Exam Tests: An Analysis of School Leaving Exam Tasks in Terms of the Use of Primary Sources
- Author
-
László Kojanitz
- Abstract
In 2005 the Hungarian school-leaving examination system underwent a significant transformation. In case of history the aim was to give a greater role to the development of students' knowledge acquisition and source analysis skills by more focusing on students' work with historical sources in classes. However, it was clear that the achievement of these goals would also depend on the new exam tasks. Those determine whether the reform will be able to get real change. So I carefully examined those tasks of the past fifteen years exams that contained primary sources. I wanted to give an accurate picture of which types of tasks were most frequent and how they could be assessed in terms of the original objectives of the reform and the competency requirements of the school leaving examination. Based on the conclusions drawn from the results of the investigation, I formulate proposals for changing the composition of the exam tasks and preparing for writing the tasks.
- Published
- 2023
35. The Influence of Two Stage Collaborative Testing on Peer Relationships: A Study of First Year University Student Perceptions
- Author
-
Brian Rempel, Elizabeth McGinitie, and Maria Dirks
- Abstract
Two-stage testing is a form of collaborative assessment that creates an active learning environment during test taking. In two-stage testing, students first complete an exam individually, and then complete a subset of the same questions as part of a learning team with the ultimate exam score being a weighted average of the individual and team portions. In the second (team-based) part of the exam, students are encouraged to discuss solutions until a consensus among team members is achieved, thus actively engaging students with course material and each other during the exam. A short open-ended survey was administered to students at the end of the semester, and the responses coded by thematic analysis, with themes generated using inductive coding based on the principles of grounded theory. The most important conclusion was that students overwhelmingly preferred two-stage tests for the development of positive peer relationships in class. The most common themes that emerged from student responses involved positive feelings from forced interaction with their peers, the benefits of meeting and socializing with other students, sharing of knowledge with others, and solidarity or positive affect towards the process of working as part of a team. Finally, students also expressed an overall preference for two-stage exams when compared to solely individual, one-stage exams.
- Published
- 2023
36. The Examination of Online and Paper-Pencil Test Scores of Students Engaged in Online Learning
- Author
-
Necati Taskin and Kerem Erzurumlu
- Abstract
In this study, online test scores and paper-pencil test scores of students studying through online learning were examined. Causal-comparative research was used to determine the distribution of students' test scores and to examine the relationship between them. The participants of the research are freshman students studying in 12 faculties and 8 colleges of a state university in Türkiye. The distribution of students' test scores is depicted by means, standard deviation, percentage, and graphs. The correlation coefficient was examined to find and interpret the amount of relationship between the test scores of the students. According to the findings, it was seen that the online test scores of the students were higher than the paper-pencil test scores. At the same time, it was observed that the passing of the course rates in online test exams was higher than in the paper-pencil test exams. It was observed that the relationship between the paper-pencil test scores of the students and the online test scores was lower than the relationship between the paper-pencil test scores and the paper-pencil test scores. There is an inconsistency between students' paper-pencil test scores and online test scores. The rise in students' online test scores to un-proctored online exams as the reason for the inconsistency. Moving online exams to proctored exam environments, using computerized adaptive testing, or including online activities in the assessment may reduce this inconsistency.
- Published
- 2023
37. Reliability and Validity of Methods to Assess Undergraduate Healthcare Student Performance in Pharmacology: Comparison of Open Book versus Time-Limited Closed Book Examinations
- Author
-
David Bell, Vikki O'Neill, and Vivienne Crawford
- Abstract
We compared the influence of open-book extended duration versus closed book time-limited format on reliability and validity of written assessments of pharmacology learning outcomes within our medical and dental courses. Our dental cohort undertake a mid-year test (30xfree-response short answer to a question, SAQ) and end-of-year paper (4xSAQ, 1xessay, 1xcase) in pharmacology. For our first year medical cohort, pharmacology is integrated within a larger course, contributing 20xclinical vignette questions (to select the single best answer (SBA) to each question from a choice of 5 plausible answers) to a mid-year test and 3-5xSAQ to an end-of-year paper. Our experience indicates that SAQ are as reliable as SBA for closed-book time-limited assessments; reliability correlates with number of questions employed. We have found good correlation between mid-year and end-of-year performance (predictive validity), between questions (factorial validity) and between pharmacology and other subjects within the assessment (concurrent validity). Adoption of open-book extended duration assessments resulted in only modest reduction in reliability and validity.
- Published
- 2023
38. The Influence of Passage Cohesion on Cloze Test Item Difficulty
- Author
-
Jonathan Trace
- Abstract
The role of context in cloze tests has long been seen as both a benefit as well as a complication in their usefulness as a measure of second language comprehension (Brown, 2013). Passage cohesion, in particular, would seem to have a relevant and important effect on the degree to which cloze items function and the interpretability of performances (Brown, 1983; Dastjerdi & Talebinezhad, 2006; Oller & Jonz, 1994). With recent evidence showing that cloze items can require examinees to access information at both the sentence and passage level (Trace, 2020), it's worthwhile to now look back and examine the relationship between aspects of passage cohesion--referential cohesion, semantic overlap, and incidence of conjunctives--and item difficulty by classification. The current study draws upon a large pool of cloze test passages and items (k = 377) originally used by Brown (1993) along with automated text analysis of cohesion ("Coh-Metrix," McNamara et al., 2014) to examine the impact of passage cohesion on item function. Correlations, factor analysis, and linear regression point to clear though minimal differences for both sentential and intersentential items as they relate to aspects of passage cohesion, the results of which may inform future test design and interpretation of cloze performance.
- Published
- 2023
39. Achieving Technical Economy: A Modification of Cloze Procedure
- Author
-
Albert Weideman and Tobie van Dyk
- Abstract
This contribution investigates gains in technical economy in measuring language ability by considering one recurrent interest of JD Brown: cloze tests. In the various versions of the Test of Academic Literacy Levels (TALL), its Sesotho and Afrikaans (Toets van Akademiese Geletterdheidsvlakke -- TAG) counterparts, as well as related other tests used in South Africa, the test designers have used a modification of this procedure to very good effect. This paper reports on the steady evolution of its format over many years, how it is currently used, what its outstanding empirical properties are, and how the kind of technical economy it brings to the measurement of the ability to handle the demands of academic language at the level of tertiary education can be further applied. The modification involves the conventional, more or less systematic mutilation of a selected text, with two multiple choice questions about every gap in it: where the gap is, and which word has been omitted. We have not seen anywhere else analyses of this format, which in itself may be of interest to test designers. We proceed by defining technical economy, and then develop an argument on the basis of the empirical properties of TALL on how that idea can be applied, in particular to the design and task selection of such tests, before giving illustrations of how such choices may contribute to further and other productive and responsible designs and test formats.
- Published
- 2023
40. The Impact of the Images in Multiple-Choice Questions on Anatomy Examination Scores of Nursing Students
- Author
-
Narnaware, Yuwaraj and Cuschieri, Sarah
- Abstract
Visualizing effects of images on improved anatomical knowledge are evident in medical and allied health students, but this phenomenon has rarely been assessed in nursing students. To assess the visualizing effect of images on improving anatomical knowledge and to use images as one of the methods of gross anatomical knowledge assessment in nursing students, the present study was repeated over two semesters. The results show that the percent class average (%) was significantly (P<0.006) increased with the inclusion of more anatomical images in a multiple-choice anatomy exam compared to a similar exam with fewer images and was significantly (P<0.002) decreased by reducing the number of images by 50% compared to image-rich exams. However, examinations with an equal number of images did not alter the class average. The percent score of individual questions from the examinations with images plus text was significantly (P<0.001) higher than the same questions with text only in both semesters. The findings of this study indicate that image inclusion in anatomy examinations can improve learning and knowledge, may help reduce cognitive load, recall anatomical knowledge, and provide a hint to an exam question.
- Published
- 2023
41. Effect of Missing Data on Test Equating Methods Under NEAT Design
- Author
-
Semih Asiret and Seçil Ömür Sünbül
- Abstract
In this study, it was aimed to examine the effect of missing data in different patterns and sizes on test equating methods under the NEAT design for different factors. For this purpose, as part of this study, factors such as sample size, average difficulty level difference between the test forms, difference between the ability distribution, missing data rate, and missing data mechanisms were manipulated. The effects of these factors on the equating error of test equating methods (chained-equipercentile equating, Tucker, frequency estimation equating, and Braun-Holland) were investigated. In the study, two separate sets of 10,000 dichotomous data were generated consistent with a 2-parameter logistic model. While generating data, the MCAR and MAR missing data mechanisms were used. All analyses were conducted by R 4.2.2. As a result of the study, it was seen that the RMSE of the equating methods increased significantly as the missing data rate increased. The results indicate that the RMSE of the equating methods with imputed missing data are reduced compared to equating without imputed missing data. Furthermore, the percentage of missing data, along with the difference between ability levels and the average difficulty difference between forms, was found to significantly affect equating errors in the presence of missing data. Although increasing sample size did not have a significant effect on equating error in the presence of missing data, it did lead to more accurate equating when there was no missing data present.
- Published
- 2023
42. Investigating Different Kinds of Stems in Multiple-Choice Tests: Interruptive vs. Cumulative
- Author
-
Sharareh Sadat Sarsarabi and Zeinab Sazegar
- Abstract
The statement stated in a multiple-choice question can be developed regarding two types of sentences: Interruptive (periodic) and cumulative (or loose). This study deals with different kinds of stems in designing multiple-choice (MC) items. To fill the existing gap in the literature, two groups of teacher students passing general English courses at Farhangian University were selected based on Cambridge Placement Test. The design of this study was a comparison group design. To verify the effectiveness of the stems, i.e., interruptive and cumulative stems, two types of tests based on the book entitled Thoughts and Notions 2, which was taught in General English classes, similar in content, but different in their stems, were designed. Each test contained 40 items, 25 vocabulary items, and 15 items of reading comprehension. The first group of students was given the test designed using only interruptive sentences as stems. The second group participated in the test, being prepared using only cumulative sentences as stems. After the data analysis via an independent t-test, it became apparent that the first group outperformed the second. Therefore, it was concluded that interruptive sentences as a stem in multiple-choice tests were more reliable and valid than cumulative ones. One of the study implications is that the interruptive stems can be used to assist policymakers and material designers, and language teachers to be considered for future decision making, and designing materials.
- Published
- 2023
43. TOEFL iBT Speaking Subtest: The Efficacy of Preparation Time on Test-Takers' Performance
- Author
-
Ali Akbar Ariamanesh, Hossein Barati, and Manijeh Youhanaee
- Abstract
The present study investigates the efficacy of preparation time in four speaking tasks of TOEFL iBT. As the current pre-task planning time offered by ETS is very short, 15 to 30 seconds, we intended to explore how the test-takers' speaking quality would change if the preparation time was added to the response time, giving the respondents a relatively longer online planning opportunity. To this aim, two groups of TOEFL iBT candidates were studied under pre-task and online planning conditions. Totally, 384 elicited speaking samples were first transcribed and then measured in terms of complexity, accuracy, and fluency (CAF). The results yielded by a series of One-way MANOVA revealed the online planning group significantly outperformed the pre-task planning group in terms of accuracy and fluency across all four speaking tasks. Although with less robustness, the online planners had significantly higher speech complexity represented by lexical diversity and left-embeddedness. The results obtained through this study may challenge the efficacy of the currently provided preparation time in TOEFL iBT speaking subsection.
- Published
- 2023
44. Application of Two-Parameter Item Response Theory for Determining Form-Dependent Items on Exams Using Different Item Orders
- Author
-
Pentecost, Thomas C., Raker, Jeffery R., and Murphy, Kristen L.
- Abstract
Using multiple versions of an assessment has the potential to introduce item environment effects. These types of effects result in version dependent item characteristics (i.e., difficulty and discrimination). Methods to detect such effects and resulting implications are important for all levels of assessment where multiple forms of an assessment are created. This report describes a novel method for identifying items that do and do not display form dependence. The first two steps identify form dependent items using a differential item functioning (DIF) analysis of item parameters estimated by Item Response Theory. The method is illustrated using items that appeared in four forms (two trial and two released versions) of a first semester general chemistry examination. Eighteen of fifty-six items were identified as having item parameters that were form dependent. Thirteen of those items displayed a form dependence consistent with reasons previously identified in the literature: preceding item difficulty, content priming, and a combination of preceding item difficulty and content priming. The remaining five items had form dependence that did not align reasons reported in the literature. An analysis was done to determine if all possible instances of predicted form dependence could be found. Several instances where form dependence could have been found, based on the preceding item difficulty or content priming, were identified, and those items did not display form dependence. We identify and rationalize form dependence for thirteen of the eighteen items flagged; however, we are unable to predict form dependence for items.
- Published
- 2023
45. Exploring Confidence Accuracy and Item Difficulty in Changing Multiple-Choice Answers of Scientific Reasoning Test
- Author
-
Fadillah, Sarah Meilani, Ha, Minsu, Nuraeni, Eni, and Indriyanti, Nurma Yunita
- Abstract
Purpose: Researchers discovered that when students were given the opportunity to change their answers, a majority changed their responses from incorrect to correct, and this change often increased the overall test results. What prompts students to modify their answers? This study aims to examine the modification of scientific reasoning test, with additional exploration on confidence accuracy and its relation to item difficulty. Methodology: A set of pre-test and post-test experiments which included 20 items of scientific reasoning test with confidence judgement on each item were used. The items of the instruments were assessed for their validity by analysing their psychometric properties using the three-parameter (3PL) Item Response Theory, which was carried out in R studio. The set of items were randomly administered to 205 Indonesian undergraduate students with a background in science education related major. The accuracy of confidence was determined by categorising correct or incorrect answers to scientific reasoning questions based on their level of confidence. Findings: The results revealed that responses were modified more frequently from incorrect to correct than from correct to incorrect, resulting in a significant gain in overall scientific reasoning score although these modifications were not shown to be connected to the item's difficulty level. Even though confidence level also increased significantly, it was observed that Indonesian students repeatedly responded with overconfidence even after sitting for the same test after three weeks, which could indicate a lack of metacognitive ability. The findings of this study serve to spur educators to begin actively engaging in metacognitive training in their teaching and learning activities as a result of overconfidence that frequently occurs among Indonesian students in examinations. Significance: This study provides further substantiation in the field of scientific reasoning and cognitive science; that a trend of confidence accuracy change in scientific reasoning test has been observed. It also contributes to uncovering the true ability of Indonesian students when performing such reasoning tests through their repeated attempts.
- Published
- 2023
46. Comparing the Effectiveness of Newer Linework on the Mental Cutting Test (MCT) to Investigate Its Delivery in Online Educational Settings
- Author
-
Green, Theresa, Goodridge, Wade H., Anderson, Jon, Davishahl, Eric, and Kane, Daniel
- Abstract
The purpose of this study was to examine any differences in test scores between three different online versions of the Mental Cutting Test (MCT). The MCT was developed to quantify a rotational and proportion construct of spatial ability and has been used extensively to assess spatial ability. This test was developed in 1938 as a paper-and-pencil test, where examinees are presented with a two-dimensional drawing of a 3D object containing a cutting plane passing through the object. The examinee must then determine the cross-sectional shape that would result from cutting along the imaginary cutting plane. This work explored three versions of this test (the original and two adapted versions), administered online, to see if there were any differences on the versions regarding student performance. Versions differed in the linework quality displayed as well as shading shown on the surfaces. This study analyzed statics students' scores on the three online versions of the MCT and on the original paper version of the MCT to identify which version of the test may be most optimal for administering to engineering students. Results showed that there was a statistically significant difference in students' scores between multiple versions. Understanding which representations of the MCT items are most clear to students will provide insights for educators looking to improve and understand the spatial ability of their students.
- Published
- 2023
47. The Oral Exam--Learning for Mastery and Appreciating It
- Author
-
Akkaraju, Shylaja
- Abstract
To reduce academic dishonesty and strengthen learning outcomes, I adopted in-depth oral examinations as my benchmark and summative assessments in a Human Anatomy & Physiology course taught in an online asynchronous setting. This decision led my students and me down the transformative path of mastery learning. This was a "threshold experience" for my students who were learning how to think and express themselves as physiologists. This was also a threshold experience for me as I explored the scope of the oral examination in promoting skill acquisition while nurturing a relationship-rich learning environment. By employing "deliberate practice" principles including basic drills, one-on-one weekly check-ins, and small group recitation sessions, students exceeded benchmarks for conceptual understanding, mastery of fundamentals, and application of concepts to clinical scenarios. Students consistently reported that they were happy within this learning environment. With meticulous planning, it is possible to motivate students to learn for mastery and acquire expertise by employing oral exams as the pivotal assessment strategy in an online course thereby also making academic dishonesty almost irrelevant.
- Published
- 2023
48. Automatic Item Generation for Non-Verbal Reasoning Items
- Author
-
Ayfer Sayin, Sabiha Bozdag, and Mark J. Gierl
- Abstract
The purpose of this study is to generate non-verbal items for a visual reasoning test using templated-based automatic item generation (AIG). The fundamental research method involved following the three stages of template-based AIG. An item from the 2016 4th-grade entrance exam of the Science and Art Center (known as BILSEM) was chosen as the parent item. A cognitive model and an item model were developed for non-verbal reasoning. Then, the items were generated using computer algorithms. For the first item model, 112 items were generated, and for the second item model, 1728 items were produced. The items were evaluated based on subject matter experts (SMEs). The SMEs indicated that the items met the criteria of one right answer, single content and behavior, not trivial content, and homogeneous choices. Additionally, SMEs' opinions determined that the items have varying item difficulty. The results obtained demonstrate the feasibility of AIG for creating an extensive item repository consisting of non-verbal visual reasoning items.
- Published
- 2023
49. A Proposed Taxonomy of Test-Taking Action and Item Format in Written Receptive Vocabulary Testing
- Author
-
Jeffrey Martin
- Abstract
The functioning of a vocabulary testing instrument rests in part on the test-taking actions made possible for examinees by item format, an aspect of test development that warrants consideration in second-language vocabulary research. For example, although iterations of the written receptive vocabulary levels test (VLT) have integrated improvements in lexis sampling and distractor-item creation (i.e., Beglar & Hunt, 1999; Nation, 1983, 1990; Schmitt et al., 2001; Webb et al., 2017), its clustered form-meaning matching format has remained fundamentally unchanged. This study qualitatively explores the influence of this test item format on test-taking actions observed when taking the updated vocabulary levels test (UVLT, Webb et al, 2017). Data from a think-aloud protocol and retrospective interviewing indicated the predominant use of test-taking strategies for answering test items on the UVLT, such as bidirectional matching and elimination of cluster options, and that these actions enabled correct responses for clusters of target vocabulary about which the test taker demonstrated partial or even no knowledge. This evidence at the interface of test taker and test draws attention to the interconnection of estimating learners' vocabulary knowledge and the action possibilities provided by item format on vocabulary tests. Such affordances are hierarchically structured in a proposed "Taxonomy of Test-taking Actions Afforded by Receptive Vocabulary Test Format" as a heuristic to evaluate the influences of test format on written receptive vocabulary assessment.
- Published
- 2022
50. Influence of Selected-Response Format Variants on Test Characteristics and Test-Taking Effort: An Empirical Study. Research Report. ETS RR-22-01
- Author
-
Guo, Hongwen, Rios, Joseph A., Ling, Guangming, Wang, Zhen, Gu, Lin, Yang, Zhitong, and Liu, Lydia O.
- Abstract
Different variants of the selected-response (SR) item type have been developed for various reasons (i.e., simulating realistic situations, examining critical-thinking and/or problem-solving skills). Generally, the variants of SR item format are more complex than the traditional multiple-choice (MC) items, which may be more challenging to test takers and thus may discourage their test engagement on low-stakes assessments. Low test-taking effort has been shown to distort test scores and thereby diminish score validity. We used data collected from a large-scale assessment to investigate how variants of the SR item format may impact test properties and test engagement. Results show that the studied variants of SR item format were generally harder and more time consuming compared to the traditional MC item format, but they did not show negative impact on test-taking effort. However, item position had a dominating influence on nonresponse rates and rapid-guessing rates in a cumulative fashion, even though the effect sizes were relatively small in the studied data.
- Published
- 2022
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.