3,126 results on '"Test format"'
Search Results
2. Evaluating the Evaluators: A Comparative Study of AI and Teacher Assessments in Higher Education
- Author
-
Tugra Karademir Coskun and Ayfer Alper
- Abstract
This study aims to examine the potential differences between teacher evaluations and artificial intelligence (AI) tool-based assessment systems in university examinations. The research has evaluated a wide spectrum of exams including numerical and verbal course exams, exams with different assessment styles (project, test exam, traditional exam), and both theoretical and practical course exams. These exams were selected using a criterion sampling method and were analyzed using Bland-Altman Analysis and Intraclass Correlation Coefficient (ICC) analyses to assess how AI and teacher evaluations performed across a broad range. The research findings indicate that while there is a high level of proficiency between the total exam scores assessed by artificial intelligence and teacher evaluations; medium consistency was found in the evaluation of visually based exams, low consistency in video exams, high consistency in test exams, and low consistency in traditional exams. This research is crucial as it helps to identify specific areas where artificial intelligence can either complement or needs improvement in educational assessment, guiding the development of more accurate and fair evaluation tools.
- Published
- 2024
3. Exploring Speededness in Pre-Reform GCSEs (2009 to 2016)
- Author
-
Emma Walland
- Abstract
GCSE examinations (taken by students aged 16 years in England) are not intended to be speeded (i.e. to be partly a test of how quickly students can answer questions). However, there has been little research exploring this. The aim of this research was to explore the speededness of past GCSE written examinations, using only the data from scored responses to items from a sample of 340 GCSE components. Speededness was calculated as the average (mean) percentage marks lost from the longest string of unanswered items at the end of each student's examination paper. The potential impact of student ability on examination completion patterns was taken into account. The data suggested that most GCSEs analysed were unlikely to have been speeded. This method of exploring the speededness of exams using only scored responses has potential (although there are limitations), and it can flag potentially problematic components for further investigation.
- Published
- 2024
- Full Text
- View/download PDF
4. The Effects of Reverse Items on Psychometric Properties and Respondents' Scale Scores According to Different Item Reversal Strategies
- Author
-
Mustafa Ilhan, Nese Güler, Gülsen Tasdelen Teker, and Ömer Ergenekon
- Abstract
This study aimed to examine the effects of reverse items created with different strategies on psychometric properties and respondents' scale scores. To this end, three versions of a 10-item scale in the research were developed: 10 positive items were integrated in the first form (Form-P) and five positive and five reverse items in the other two forms. The reverse items in the second and third forms were crafted using antonyms (Form-RA) and negations (Form-RN), respectively. Based on the research results, Form-P was unidimensional, while other forms were two-dimensional. Moreover, although reliability coefficients of all forms were obtained as above 0.80, the lowest one was acquired for Form-RN. There were strong-positive relationships between students' scores in the three scale forms. However, the lowest one was estimated between Form-P and Form-RN. Finally, there was a significant difference between the students' mean scores obtained from Form-RN and other two versions, but the effect size of the said difference was small. In conclusion, all these results indicate that different types of reverse items influence psychometric properties and respondents' scale scores differently.
- Published
- 2024
5. An Exploratory Criterion Validation of Three Meaning-Recall Vocabulary Test Item Formats
- Author
-
Tim Stoeckel and Tomoko Ishii
- Abstract
In an upcoming coverage-comprehension study, we plan to assess learners' meaning-recall knowledge of words as they occur in the study's reading passage. As several meaning-recall test formats exist, the purpose of this small-scale study (N = 10) was to determine which of three formats was most similar to a criterion interview regarding mean score and the consistency of correct/incorrect classifications (match rate, k = 30). In Test 1, the prompt consisted of only the target item, and a written translation of its meaning was elicited. In Test 2, the prompt was a short sentence in which a target item was highlighted, and a written translation of only that target item was requested. In Test 3, the prompt was the same sentence as in Test 2, but the target item was unhighlighted, and participants were requested to translate the entire sentence. Finally, in the criterion interview, participants were asked to demonstrate their understanding of the target items in the same prompt sentences as in Tests 2-3. The results indicated that Test 3 produced a mean score and match rate most similar to the interview, followed by Test 2, with Test 1 being the least similar. The paper discusses several factors explaining differences in test performance that were explored during the interview.
- Published
- 2024
6. How Assessment Choice Affects Student Perception and Performance
- Author
-
Sanne Unger and Alanna Lecher
- Abstract
This action research project sought to understand how giving students a choice in how to demonstrate mastery of a reading would affect both grades and evaluations of the instructor, given that assessment choice might increase student engagement. We examined the effect of student assessment choice on grades and course evaluations, the two assessment options being a reading quiz or a two-minute video recording of themselves "recalling" what they could about the text (a "recall"). In Year 1, students were required to complete a multiple-choice reading quiz, with the option to complete a recall video for the opportunity to revise essays (revision tokens). In Year 2, students were allowed to choose whether they submitted a recall video or a quiz, with the option to submit the other to earn revision tokens. The data included student submissions, grades, and course evaluations. Students completed more recall assignments when the recall replaced the quiz requirement than during Year 1 when recalls only earned the students revision tokens. In addition, the instances of students completing both the quiz and recall increased in Year 2. Average course grades did not change from year to year, but students with higher course grades were significantly more likely to have completed recalls in both years. Student evaluations of the instructor were significantly higher for "responses to diverse learning styles" in Year 2 compared to Year 1. The study shows that letting students choose the assessment type they prefer can lead to increased student engagement and improve their perception of the instructor's responsiveness to learning styles, without causing grade inflation.
- Published
- 2024
7. New York State Testing Program: Grades 6 and 7 English Language Arts Paper-Based Tests. Teacher's Directions. Spring 2024
- Author
-
New York State Education Department and NWEA
- Abstract
The New York State Education Department (NYSED) has a partnership with NWEA for the development of the 2024 Grades 3-8 English Language Arts Tests. Teachers from across the State work with NYSED in a variety of activities to ensure the validity and reliability of the New York State Testing Program (NYSTP). The 2024 Grades 6 and 7 English Language Arts Tests are administered in two sessions on two consecutive school days. Students are asked to demonstrate their knowledge and skills in the areas of reading and writing. Students will have as much time as they need each day to answer the questions in the test sessions within the confines of the regular school day. For Grades 6 and 7, the tests consist of multiple-choice (1-credit) and constructed-response (2- and 4-credit) questions. Each multiple-choice question is followed by four choices, one of which is the correct answer. Students record their multiple-choice responses on a separate answer sheet. For Session 1, students will write their responses to the constructed-response questions in their separate answer booklets. For Session 2, students will write their responses to these questions directly in their test booklets. By following the guidelines in this document, teachers help ensure that the test is valid, reliable, and equitable for all students. A series of instructions helps teachers organize the materials and the testing schedule.
- Published
- 2024
8. New York State Testing Program: English Language Arts, Mathematics, and Science Tests. School Administrator's Manual, 2024. Grades 3-8
- Author
-
New York State Education Department and NWEA
- Abstract
The instructions in this manual explain the responsibilities of school administrators for the New York State Testing Program (NYSTP) Grades 3-8 English Language Arts, Mathematics, and Grades 5 & 8 Science Tests. School administrators must be thoroughly familiar with the contents of the manual, and the policies and procedures must be followed as written so that testing conditions are uniform statewide. The appendices include: (1) Certificates; (2) A tracking log of secure materials; (3) Procedures for testing students with disabilities; (4) Testing accommodation information; (5) Documents to assist with material return; (6) Contact information; and (7) Information on the Nextera™ Administration System for computer-based testing. This "School Administrator's Manual" serves to guide school administrators in general test administration activities for both paper- and computer-based testing.
- Published
- 2024
9. Measuring Mathematical Skills in Early Childhood: A Systematic Review of the Psychometric Properties of Early Maths Assessments and Screeners
- Author
-
Laura A. Outhwaite, Pirjo Aunio, Jaimie Ka Yu Leung, and Jo Van Herwegen
- Abstract
Successful early mathematical development is vital to children's later education, employment, and wellbeing outcomes. However, established measurement tools are infrequently used to (i) assess children's mathematical skills and (ii) identify children with or at-risk of mathematical learning difficulties. In response, this pre-registered systematic review aimed to provide an overview of measurement tools that have been evaluated for their psychometric properties for measuring the mathematical skills of children aged 0-8 years. The reliability and validity evidence reported for the identified measurement tools were then synthesised, including in relation to common acceptability thresholds. Overall, 41 mathematical assessments and 25 screeners were identified. Our study revealed five main findings. Firstly, most measurement tools were categorised as child-direct measures delivered individually with a trained assessor in a paper-based format. Secondly, the majority of the identified measurement tools have not been evaluated for aspects of reliability and validity most relevant to education measures, and only 15 measurement tools met the common acceptability thresholds for more than two areas of psychometric evidence. Thirdly, only four screeners demonstrated an acceptable ability to distinguish between typically developing children and those with or at-risk of mathematical learning difficulties. Fourthly, only one mathematical assessment and one screener met the common acceptability threshold for predictive validity. Finally, only 11 mathematical assessments and one screener were found to concurrently align with other validated measurement tools. Building on this current evidence and improving measurement quality is vital for raising methodological standards in mathematical learning and development research.
- Published
- 2024
- Full Text
- View/download PDF
10. Digital SAT® Pilot Predictive Validity Study -- A Comprehensive Analysis of First-Year College Outcomes
- Author
-
College Board, Westrick, Paul A., Marini, Jessica P., Young, Linda, Ng, Helen, and Shaw, Emily J.
- Abstract
This pilot study examines digital SAT® score relationships with first-year college performance. Results show that digital SAT scores predict college performance as well as paper and pencil SAT scores, and that digital SAT scores meaningfully improve our understanding of a student's readiness for college above high school grade point average (HSGPA) alone. In this study, there was a 22% improvement in the prediction of college performance when the SAT and HSGPA were used together, instead of using the HSGPA alone. For STEM majors, the added SAT value was 38%. Similar results were found when the outcome examined was course credits earned in the first year, a metric for understanding student progress toward degree completion. Findings from this study show that the SAT remains a powerful tool for understanding students' readiness for college, for course placement and academic major field decisions, scholarship and honors program decisions, and identifying students who may need academic support.
- Published
- 2023
11. Checkbox Grading of Handwritten Mathematics Exams with Multiple Assessors: How Do Students React to the Resulting Atomic Feedback? A Mixed-Method Study
- Author
-
Filip Moons, Paola Iannone, and Ellen Vandervieren
- Abstract
Handwritten tasks are better suited than digital ones to assess higher-order mathematics skills, as students can express themselves more freely. However, maintaining reliability and providing feedback can be challenging when assessing high-stakes, handwritten mathematics exams involving multiple assessors. This paper discusses a new semi-automated grading approach called 'checkbox grading'. Checkbox grading gives each assessor a list of checkboxes consisting of feedback items for each task. The assessor then ticks those feedback items which apply to the student's solution. Dependencies between the checkboxes can be set to ensure all assessors take the same route on the grading scheme. The system then automatically calculates the grade and provides atomic feedback to the student, giving a detailed insight into what went wrong and how the grade was obtained. Atomic feedback consists of a set of format requirements for mathematical feedback items, which has been shown to increase feedback's reusability. Checkbox grading was tested during the final high school mathematics exam (grade 12) organised by the Flemish Exam Commission, with 60 students and 10 assessors. This paper focuses on students' perceptions of the received checkbox grading feedback and how easily they interpreted it. After the exam was graded, all students were sent an online questionnaire, including their personalised exam feedback. The questionnaire was filled in by 36 students, and 4 of them participated in semi-structured interviews. Findings suggest that students could interpret the feedback from checkbox grading well, with no correlation between students' exam scores and feedback understanding. Therefore, we suggest that checkbox grading is an effective way to provide feedback, also for students with shaky subject matter knowledge.
- Published
- 2024
- Full Text
- View/download PDF
12. The Structural and Convergent Validity of the FMS[superscript 2] Assessment Tool among 8- to 12-Year-Old Children
- Author
-
Nathan Gavigan, Sarahjane Belton, Una Britton, Shane Dalton, and Johann Issartel
- Abstract
Although there is a plethora of tools available to assess children's movement competence (MC), the literature suggests that many have significant limitations (e.g. not being practical for use in many 'real-world' settings). The FMS[superscript 2] assessment tool has recently been developed as a targeted solution to many of the existing barriers preventing practitioners from utilising MC assessments. The aim of this study was to investigate the structural and convergent validity of this new tool among 8- to 12-year-old Irish primary school children. As part of this study, 102 children (56.8% female, mean = 9.8 years) were assessed using the FMS[superscript 2], the Test of Gross Motor Development (3rd edition) (TGMD-3) (short version) and the Functional Movement Screen™ (FMS™). Structural validity was assessed using confirmatory factor analysis (CFA). The convergent validity between the FMS[superscript 2], the TGMD-3 (short version) and the FMS™ was investigated using the Pearson product-moment correlation coefficient. Results of CFA for the FMS[superscript 2] indicate a good fit model, supporting a three-factor structure (locomotor, object manipulation, and stability). Additional findings indicate a moderate, positive correlation between the FMS[superscript 2] and the TGMD-3 (short version) (r = 0.66), with a low, positive correlation between the FMS[superscript 2] and the FMS™ (r = 0.48). This study presents the first preliminary findings to suggest that the FMS[superscript 2] may be a versatile, time-efficient, and ecologically valid tool to measure children's MC in multiple settings (e.g. research, education, sport, athletic therapy, and physiotherapy). Future research should also seek to continue to implement this solution, consolidate the existing validity findings with a larger and more diverse sample, and further explore the feasibility of the tool in 'real-world' settings.
- Published
- 2024
- Full Text
- View/download PDF
13. Evaluating Psychometric Differences between Fast versus Slow Responses on Rating Scale Items
- Author
-
Nana Kim and Daniel M. Bolt
- Abstract
Some previous studies suggest that response times (RTs) on rating scale items can be informative about the content trait, but a more recent study suggests they may also be reflective of response styles. The latter result raises questions about the possible consideration of RTs for content trait estimation, as response styles are generally viewed as nuisance dimensions in the measurement of noncognitive constructs. In this article, we extend previous work exploring the simultaneous relevance of content and response style traits on RTs in self-report rating scale measurement by examining psychometric differences related to fast versus slow item responses. Following a parallel methodology applied with cognitive measures, we provide empirical illustrations of how RTs appear to be simultaneously reflective of both content and response style traits. Our results demonstrate that respondents may exhibit different response behaviors for fast versus slow responses and that both the content trait and response styles are relevant to such heterogeneity. These findings suggest that using RTs as a basis for improving the estimation of noncognitive constructs likely requires simultaneously attending to the effects of response styles.
- Published
- 2024
- Full Text
- View/download PDF
14. Examining Adaptations in Study Time Allocation and Restudy Selection as a Function of Expected Test Format
- Author
-
Skylar J. Laursen, Dorina Sluka, and Chris M. Fiacconi
- Abstract
Previous literature suggests learners can adjust their encoding strategies to match the demands of the expected test format. However, it is unclear whether other forms of metacognitive control, namely, study time allocation and restudy selection, are also sensitive to expected test format. Across four experiments we examined whether learners qualitatively adjust their allocation of study time (Experiment 1) and restudy selections (Experiments 2a, 2b, and 3) when expecting a more difficult generative memory test (i.e., cued-recall) as compared to a less difficult non-generative memory test (i.e., forced-choice recognition). Counter to our predictions, we found little evidence that learners shift their study time allocation and restudy selection choices toward easier material when expecting a relatively more difficult cued recall test, even after acquiring experience with each test format. Instead, based on exploratory analyses conducted post-hoc, learners appeared to rely heavily on the success with which they retrieved associated studied information at the time that restudy selections were solicited. Moreover, counter to some extant models of self-regulated learning, learners tended to first choose difficult rather than easy items when making their restudy selections, regardless of expected test format. Together, these novel findings place new constraints on our current understanding of learners' metacognitive sensitivity to expected test format, and have important implications for current theoretical accounts of self-regulated learning.
- Published
- 2024
- Full Text
- View/download PDF
15. An Experimental Comparison of Multiple-Choice and Short-Answer Questions on a High-Stakes Test for Medical Students
- Author
-
Janet Mee, Ravi Pandian, Justin Wolczynski, Amy Morales, Miguel Paniagua, Polina Harik, Peter Baldwin, and Brian E. Clauser
- Abstract
Recent advances in automated scoring technology have made it practical to replace multiple-choice questions (MCQs) with short-answer questions (SAQs) in large-scale, high-stakes assessments. However, most previous research comparing these formats has used small examinee samples testing under low-stakes conditions. Additionally, previous studies have not reported on the time required to respond to the two item types. This study compares the difficulty, discrimination, and time requirements for the two formats when examinees responded as part of a large-scale, high-stakes assessment. Seventy-one MCQs were converted to SAQs. These matched items were randomly assigned to examinees completing a high-stakes assessment of internal medicine. No examinee saw the same item in both formats. Items administered in the SAQ format were generally more difficult than items in the MCQ format. The discrimination index for SAQs was modestly higher than that for MCQs and response times were substantially higher for SAQs. These results support the interchangeability of MCQs and SAQs. When it is important that the examinee generate the response rather than selecting it, SAQs may be preferred. The results relating to difficulty and discrimination reported in this paper are consistent with those of previous studies. The results on the relative time requirements for the two formats suggest that with a fixed testing time fewer SAQs can be administered, this limitation more than makes up for the higher discrimination that has been reported for SAQs. We additionally examine the extent to which increased difficulty may directly impact the discrimination of SAQs.
- Published
- 2024
- Full Text
- View/download PDF
16. Cheating Automatic Short Answer Grading with the Adversarial Usage of Adjectives and Adverbs
- Author
-
Anna Filighera, Sebastian Ochs, Tim Steuer, and Thomas Tregel
- Abstract
Automatic grading models are valued for the time and effort saved during the instruction of large student bodies. Especially with the increasing digitization of education and interest in large-scale standardized testing, the popularity of automatic grading has risen to the point where commercial solutions are widely available and used. However, for short answer formats, automatic grading is challenging due to natural language ambiguity and versatility. While automatic short answer grading models are beginning to compare to human performance on some datasets, their robustness, especially to adversarially manipulated data, is questionable. Exploitable vulnerabilities in grading models can have far-reaching consequences ranging from cheating students receiving undeserved credit to undermining automatic grading altogether--even when most predictions are valid. In this paper, we devise a black-box adversarial attack tailored to the educational short answer grading scenario to investigate the grading models' robustness. In our attack, we insert adjectives and adverbs into natural places of incorrect student answers, fooling the model into predicting them as correct. We observed a loss of prediction accuracy between 10 and 22 percentage points using the state-of-the-art models BERT and T5. While our attack made answers appear less natural to humans in our experiments, it did not significantly increase the graders' suspicions of cheating. Based on our experiments, we provide recommendations for utilizing automatic grading systems more safely in practice.
- Published
- 2024
- Full Text
- View/download PDF
17. Reducing Workload in Short Answer Grading Using Machine Learning
- Author
-
Rebecka Weegar and Peter Idestam-Almquist
- Abstract
Machine learning methods can be used to reduce the manual workload in exam grading, making it possible for teachers to spend more time on other tasks. However, when it comes to grading exams, fully eliminating manual work is not yet possible even with very accurate automated grading, as any grading mistakes could have significant consequences for the students. Here, the evaluation of an automated grading approach is therefore extended from measuring workload in relation to the accuracy of automated grading, to also measuring the overall workload required to correctly grade a full exam, with and without the support of machine learning. The evaluation was performed during an introductory computer science course with over 400 students. The exam consisted of 64 questions with relatively short answers and a two-step approach for automated grading was applied. First, a subset of answers to the exam questions was manually graded and next used as training data for machine learning models classifying the remaining answers. A number of different strategies for how to select which answers to include in the training data were evaluated. The time spent on different grading actions was measured along with the reduction of effort using clustering of answers and automated scoring. Compared to fully manual grading, the overall reduction of workload was substantial--between 64% and 74%--even with a complete manual review of all classifier output to ensure a fair grading.
- Published
- 2024
- Full Text
- View/download PDF
18. Impact of Different Practice Testing Methods on Learning Outcomes
- Author
-
Yavuz Akbulut
- Abstract
The testing effect refers to the gains in learning and retention that result from taking practice tests before the final test. Understanding the conditions under which practice tests improve learning is crucial, so four experiments were conducted with a total of 438 undergraduate students in Turkey. In the first study, students who took graded practice tests outperformed those who took them as ungraded practice. In the second study, students who took short-answer questions before the first exam and multiple-choice questions before the second exam scored higher on the second exam. In the third study, multiple-choice, short-answer and hybrid questions produced similar learning gains. In the fourth study, students who received detailed feedback immediately after class performed similarly to those who received feedback at the beginning of the next class. The results suggested the contribution of graded practice tests in general; however, the type of questions or the timing of feedback did not predict learning outcomes.
- Published
- 2024
- Full Text
- View/download PDF
19. Evaluating Equating Methods for Varying Levels of Form Difference
- Author
-
Ting Sun and Stella Yun Kim
- Abstract
Equating is a statistical procedure used to adjust for the difference in form difficulty such that scores on those forms can be used and interpreted comparably. In practice, however, equating methods are often implemented without considering the extent to which two forms differ in difficulty. The study aims to examine the effect of the magnitude of a form difficulty difference on equating results under random group (RG) and common-item nonequivalent group (CINEG) designs. Specifically, this study evaluates the performance of six equating methods under a set of simulation conditions including varying levels of form difference. Results revealed that, under the RG design, mean equating was proven to be the most accurate method when there is no or small form difference, whereas equipercentile is the most accurate method when the difficulty difference is medium or large. Under the CINEG design, Tucker Linear was found to be the most accurate method when the difficulty difference is medium or small, and either chained equipercentile or frequency estimation is preferred with a large difficulty level. This study would provide practitioners with research evidence-based guidance in the choice of equating methods with varying levels of form difference. As the condition of no form difficulty difference is also included, this study would inform testing companies of appropriate equating methods when two forms are similar in difficulty level.
- Published
- 2024
- Full Text
- View/download PDF
20. Exploring Interaction in Video-Call Paired Speaking Tests: A Look at Scores, Language, and Perceptions
- Author
-
Hye-won Lee, Andrew Mullooly, Amy Devine, and Evelina Galaczi
- Abstract
In the assessment of second language oral communication, the video-call speaking test has received increasing attention as a test method with higher practicality than its in-person counterpart, but still with broad coverage of the test construct. Previous studies into video-call assessment have focussed on the individual (as opposed to paired or group) interactional format. The current study extends this line of research by focussing on paired speaking interactions, with a specific focus on the construct of interactional competence. A concurrent triangulation design was adopted with the use of both quantitative and qualitative data through recordings and scores of test performances, questionnaires, and focus groups. Findings indicate that video-call paired interactions in the assessment context of interest in this study are largely comparable to in-person interactions in terms of scores, with statistically small-effect size differences identified. Some differences in terms of turn-taking management, examiner, and test-taker perceptions were also identified. We argue for a more in-depth awareness of the characteristics of video-call speaking in its own right, which can inform both assessment and learning contexts.
- Published
- 2024
- Full Text
- View/download PDF
21. Delineating Discrepancies between TOEFL PBT and CBT
- Author
-
Yulianto, Ahmad, Pudjitriherwanti, Anastasia, Kusumah, Chevy, and Oktavia, Dies
- Abstract
The increasing use of computer-based mode in language testing raises concern over its similarities with and differences from paper-based format. The present study aimed to delineate discrepancies between TOEFL PBT and CBT. For that objective, a quantitative method was employed to probe into scores equivalence, the performance of male-female participants, the relationship between completion time and test score, and test mode's effects on participants' performance. Totally, 124 undergraduates partook in the current research whose ages ranged from 19-21 years (M = 20, SD = 0.66). To analyze the data, MANOVA, Pearson correlation, and regression tests were run. The findings uncovered that: (1) PBT and CBT were equivalent in scores; (2) male and female's scores were not significantly different; (3) there was a moderately negative correlation between completion time and score; (4) computer familiarity, habit in using computers, and perception toward CBT did not affect performance in TOEFL. For researchers, the implication of this study concerns the interchangeability of the two-test modes. For CBT test designers, it concerns the appropriate inclusion of visuals, time related measurement, and procedures to design computer-based tests.
- Published
- 2023
22. Do Different Devices Perform Equally Well with Different Numbers of Scale Points and Response Formats? A Test of Measurement Invariance and Reliability
- Author
-
Natalja Menold and Vera Toepoel
- Abstract
Research on mixed devices in web surveys is in its infancy. Using a randomized experiment, we investigated device effects (desktop PC, tablet and mobile phone) for six response formats and four different numbers of scale points. N = 5,077 members of an online access panel participated in the experiment. An exact test of measurement invariance and Composite Reliability were investigated. The results provided full data comparability for devices and formats, with the exception of continuous Visual Analog Scale (VAS), but limited comparability for different numbers of scale points. There were device effects on reliability when looking at the interactions with formats and number of scale points. VAS, use of mobile phones and five point scales consistently gained lower reliability. We suggest technically less demanding implementations as well as a unified design for mixed-device surveys.
- Published
- 2024
- Full Text
- View/download PDF
23. Eye Movements and Reading Comprehension Performance: Examining the Relationships among Test Format, Working Memory Capacity and Reading Comprehension
- Author
-
Corrin Moss, Sharon Kwabi, Scott P. Ardoin, and Katherine S. Binder
- Abstract
The ability to form a mental model of a text is an essential component of successful reading comprehension (RC), and purpose for reading can influence mental model construction. Participants were assigned to one of two conditions during an RC test to alter their purpose for reading: concurrent (texts and questions were presented simultaneously) and sequential (texts were presented first, then questions were shown without text access). Their eye movements were recorded during testing. Working memory capacity (WMC) and centrality of textual information were measured. Participants in the sequential condition had longer first-pass reading times compared to participants in the concurrent condition, while participants in the concurrent condition had longer total processing times per word. In addition, participants with higher WMC had longer total reading times per word. Finally, participants in the sequential condition with higher WMC had longer processing times in central regions. Even among skilled college readers, participants with lower WMC had difficulty adjusting their reading behaviors to meet the task demands such as distinguishing central and peripheral ideas. However, participants with higher WMC increased attention to important text areas. One potential explanation is that participants with higher WMC are better able to construct a coherent mental model of the text, and attending to central text areas is an essential component of mental model formation. Therefore, these results help clarify the relationship between the purpose for reading and mental model development.
- Published
- 2024
- Full Text
- View/download PDF
24. A Method for Converting 4-Option Multiple-Choice Items to 3-Option Multiple-Choice Items without Re-Pretesting
- Author
-
Wolkowitz, Amanda A., Foley, Brett, and Zurn, Jar
- Abstract
The purpose of this study is to introduce a method for converting scored 4-option multiple-choice (MC) items into scored 3-option MC items without re-pretesting the 3-option MC items. This study describes a six-step process for achieving this goal. Data from a professional credentialing exam was used in this study and the method was applied to 24 forms of the exam. The results found 100% accuracy in predicting the rounded passing score for all forms.
- Published
- 2023
25. Rajah's Quest: A Gamified Offline Assessment of Least Learned Competency in English 8 in the Post-Pandemic Pedagogy
- Author
-
Souribio, Christine N.
- Abstract
"Gamification is the process of using game thinking and game dynamics to engage audiences and solve problems", Zichermann (2011). The aim of this research is to facilitate the least-learned competencies in the English subject of students of Grade 8 in Tupi National High School for the academic year 2022-2023. With this, the researcher addressed the gap that had been created on the post-pandemic year through an assessment using a supplementary program which was later tested through a competency-based test (pre-test and post-test) evaluation for its effectivity. A gamified supplementary program was created and utilized by the students in Grade 8 SPED sections. To test the effectivity of the program, the researcher made an experiment with two sets of groups: control group (traditional teaching) and experimental group (traditional with GASP). After conducting the experiment for the supplementary program, a two-sample t-test was performed to compare the scores result between the students' competency-based test. The result of the control group; pretest [mu]=25.74, which is interpreted as learned (see Appendix B), and post-test [mu]=33.32, interpreted as highly learned. While the experimental group pretest [mu]=29.74, also interpreted as learned, and post-test [mu]=50.93, which is interpreted as highly learned. Hence, the result shows that the mean of pretest in both groups showed that there is slight rise of level after the conduct of posttest in each group.
- Published
- 2023
26. The Correlation among Different Types of Exams in Azerbaijan
- Author
-
Alasgarova, Gunel A.
- Abstract
It is crucial to examine the alignment of different exam results conducted by various organizations to improve the quality of assessment. The research used a document analysis method with recent, publicly available national and international reports addressing the research question. The following main question was examined through the document analysis: What exams have the highest correlation and are more trustworthy in Azerbaijan for short and long-term outcomes? The data were analyzed to discover any statistical comparisons of university admission exams with the 9th and 11th grade SEC exams, school grades, and other assessments. Research shows that the State Examination Center's exams align with its own evaluations and international assessment (OECD). They can be considered methodologically rigorous, providing a more valid yardstick for measuring student knowledge and achievement. Overall, exams by the SEC had a high correlation coefficient compared to Higher Education Institutions' assessments. As more and more international students want to pursue their education in Azerbaijan, these findings can be valuable for their decision-making and tertiary level.
- Published
- 2023
27. Assessment Literacy Components Predicting EFL Teachers' Job Demand-Resources: A Focus on Burnout and Engagement
- Author
-
Rastegr, Behnaz and Zarei, Abbas Ali
- Abstract
Much has been done on assessment literacy (AL) components and job demand-resources (JD-R). However, an interdisciplinary look at AL components as the predictors of JD-R and its possible consequences for the engagement and burnout of teachers' assessment performance has been neglected. To fill this gap, the present study explored this issue in the context of Iran. To this end, through convenience sampling, 146 Iranian EFL teachers were selected to answer questionnaires on AL, JD-R, burnout, and engagement. A series of multiple regression analyses were run to analyze the collected data. The results showed that some components of AL such as 'test construction', 'administering, rating, and interpreting test', 'psychometric properties of a test', 'using and interpreting statistics', and 'authenticity' were significant predictors of job demand. Moreover, the results revealed that alternative and digital-based assessment, recognizing test type, distinction and function, and authenticity were significant predictors of job resources. Furthermore, test construction, administering, rating, and interpreting test, psychometric properties of a test, and using and interpreting statistics could significantly predict teachers' burnout. In addition, alternative and digital-based assessment, giving feedback in assessment, and ethical and cultural considerations in assessment turned out to significantly predict teachers' engagement. These findings can have theoretical and practical implications for stakeholders.
- Published
- 2023
28. Question Format Biases College Students' Metacognitive Judgments for Exam Performance
- Author
-
McGuire, Michael J.
- Abstract
College students in a lower-division psychology course made metacognitive judgments by predicting and postdicting performance for true-false, multiple-choice, and fill-in-the-blank question sets on each of three exams. This study investigated which question format would result in the most accurate metacognitive judgments. Extending Koriat's (1997) cue-utilization framework to these judgments, each format gave students different cues on which to base judgments. Further, each format has different probabilities of correctly guessing, which can skew accuracy. Students reported the lowest estimates for fill-in-the-blank questions. Accuracy measured using bias scores showed students' predictions and postdictions were most accurate for multiple-choice items. Accuracy measured using gamma correlations showed students' predictions were most accurate for multiple-choice items and postdictions were most accurate for fill-in-the-blank items. Based on the findings, educators are encouraged to consider what implications question format have on metacognitive processes when testing students over studied material. And, for researchers, the findings support the use of different accuracy measures to get a more detailed understanding of factors influencing metacognitive judgment accuracy.
- Published
- 2023
29. Auto-Scoring Student Responses with Images in Mathematics
- Author
-
Baral, Sami, Botelho, Anthony, Santhanam, Abhishek, Gurung, Ashish, Cheng, Li, and Heffernan, Neil
- Abstract
Teachers often rely on the use of a range of open-ended problems to assess students' understanding of mathematical concepts. Beyond traditional conceptions of student open-ended work, commonly in the form of textual short-answer or essay responses, the use of figures, tables, number lines, graphs, and pictographs are other examples of open-ended work common in mathematics. While recent developments in areas of natural language processing and machine learning have led to automated methods to score student open-ended work, these methods have largely been limited to textual answers. Several computer-based learning systems allow students to take pictures of hand-written work and include such images within their answers to open-ended questions. With that, however, there are few-to-no existing solutions that support the auto-scoring of student hand-written or drawn answers to questions. In this work, we build upon an existing method for auto-scoring textual student answers and explore the use of OpenAI/CLIP, a deep learning embedding method designed to represent both images and text, as well as Optical Character Recognition (OCR) to improve model performance. We evaluate the performance of our method on a dataset of student open-responses that contains both text- and image-based responses, and find a reduction of model error in the presence of images when controlling for other answer-level features. [For the complete proceedings, see ED630829.]
- Published
- 2023
30. Use of Technology-Based Assessments: A Systematic Review. Global Education Monitoring Report
- Author
-
United Nations Educational, Scientific, and Cultural Organization (UNESCO) (France) and Chen, Dandan
- Abstract
Technology-driven shifts have created opportunities to improve efficiency and quality of assessments. Meanwhile, they may have exacerbated underlying socioeconomic issues in relation to educational equity. The increased implementation of technology-based assessments during the COVID-19 pandemic compounds the concern about the digital divide, as digital access, connectivity, and coping strategies vary across the globe. This systematic review was intended to answer how the use of technology-based assessments has affected the education system's functioning, compared to traditional assessments that do not employ any technology solution. It covered 34 countries from 34 full-text sources in English published in 2018-2022. A total of 12 themes emerged corresponding to six hypotheses about technology-based assessments. In summary, when compared with traditional paper-based exams, mixed evidence was found when testing assumptions about technology-based assessments' roles in cheating reduction, learning boost, monitoring support, instructional improvement, and non-teaching workload reduction. Strong supporting evidence was found when testing assumptions about technology-based assessments' higher measurement precision, easier interpretation, higher learner engagement, and more interaction with others at the learning level, in addition to smoother communication with parents at the educating level. Limited but positive evidence at the management level suggested that technology-based assessments are more cost-effective and time-efficient.
- Published
- 2023
31. New York State Testing Program: English Language Arts and Mathematics Tests. School Administrator's Manual, 2023. Grades 3-8
- Author
-
New York State Education Department and Questar Assessment Inc.
- Abstract
The instructions in this manual explain the responsibilities of school administrators for the New York State Testing Program (NYSTP) Grades 3-8 English Language Arts and Mathematics Tests. School administrators must be thoroughly familiar with the contents of the manual, and the policies and procedures must be followed as written so that testing conditions are uniform statewide. The appendices include: (1) Certificates; (2) A tracking log of secure materials; (3) Procedures for testing students with disabilities; (4) Testing accommodation information; (5) Documents to assist with material return; (6) Contact information; and (7) Information on the Nextera™ Administration System for computer-based testing. This School Administrator's Manual serves to guide school administrators in general test administration activities for both paper- and computer-based testing. [For the 2022 Manual for Computer-Based Field Testing, see ED628919. For the 2022 Manual for Paper-Based Field Testing, see ED628920.]
- Published
- 2023
32. New York State Testing Program: 2023 Elementary-Level (Grade 5) and Intermediate-Level (Grade 8) Science Field Tests. Teacher's Directions for Computer-Based Field Testing. May 15-June 2, 2023
- Author
-
New York State Education Department and Questar Assessment Inc.
- Abstract
The New York State Education Department (NYSED) has a partnership with Questar Assessment Inc. (Questar) for the online delivery of the 2023 Elementary-Level (Grade 5) and Intermediate-Level (Grade 8) Science Computer-Based Field Tests. Teachers from across the State work with NYSED in a variety of activities to ensure the validity and reliability of the New York State Testing Program (NYSTP). The guidelines in this document help ensure that the field tests are valid, reliable, and equitable for all students. A series of instructions helps to follow the steps necessary for administering the computer-based field tests within the field test window. [For the 2022 manual, see ED628890.]
- Published
- 2023
33. EFL Teachers' Knowledge, Beliefs, and Practices Regarding Fairness and Justice in Technology-Enhanced Classroom Assessment: A Duoethnography
- Author
-
Teymour Rahmati and Musa Nushi
- Abstract
Drawing on duoethnography, the teacher researchers in the present study interacted with the relevant literature, engaged in dialogs, and shared artifacts to examine their knowledge, beliefs, and practices regarding fairness and justice considerations in technology-enhanced language classroom assessment. Under the domain of knowledge, they conceptualized fairness and justice and identified their components. Within beliefs, the difference between high-stakes and low-stakes assessments, the significance of students' perceptions, and the role of computer literacy in relation to fairness and justice in technology-enhanced classroom assessment were debated. To operationalize their knowledge and beliefs, the researchers inspected their assessment practices during and following COVID-19. They agreed that fairness was distinct from justice in that the former pertained to test internal characteristics and its administration procedures while the latter referred to test external consequences at a broader social level. They believed that fairness and justice were equally important in high-stakes and low-stakes assessments, and students' perceptions were valuable sources of feedback regarding fair and just classroom assessments. Moreover, the teachers argued that computer literacy cannot yet be considered an aspect of language ability. Finally, it was revealed that although their practice regarding fairness and justice was affected by the pandemic, they learned valuable lessons (e.g., combining online and paper assessment modalities and giving oral exams) in this respect for the future. The findings imply that language teachers should theoretically adopt a clear conception of fairness and justice while being practically prepared for future developments (e.g., technological advances) and unexpected circumstances (e.g., a pandemic).
- Published
- 2023
34. Historical Sources in Exam Tests: An Analysis of School Leaving Exam Tasks in Terms of the Use of Primary Sources
- Author
-
László Kojanitz
- Abstract
In 2005 the Hungarian school-leaving examination system underwent a significant transformation. In case of history the aim was to give a greater role to the development of students' knowledge acquisition and source analysis skills by more focusing on students' work with historical sources in classes. However, it was clear that the achievement of these goals would also depend on the new exam tasks. Those determine whether the reform will be able to get real change. So I carefully examined those tasks of the past fifteen years exams that contained primary sources. I wanted to give an accurate picture of which types of tasks were most frequent and how they could be assessed in terms of the original objectives of the reform and the competency requirements of the school leaving examination. Based on the conclusions drawn from the results of the investigation, I formulate proposals for changing the composition of the exam tasks and preparing for writing the tasks.
- Published
- 2023
35. The Influence of Two Stage Collaborative Testing on Peer Relationships: A Study of First Year University Student Perceptions
- Author
-
Brian Rempel, Elizabeth McGinitie, and Maria Dirks
- Abstract
Two-stage testing is a form of collaborative assessment that creates an active learning environment during test taking. In two-stage testing, students first complete an exam individually, and then complete a subset of the same questions as part of a learning team with the ultimate exam score being a weighted average of the individual and team portions. In the second (team-based) part of the exam, students are encouraged to discuss solutions until a consensus among team members is achieved, thus actively engaging students with course material and each other during the exam. A short open-ended survey was administered to students at the end of the semester, and the responses coded by thematic analysis, with themes generated using inductive coding based on the principles of grounded theory. The most important conclusion was that students overwhelmingly preferred two-stage tests for the development of positive peer relationships in class. The most common themes that emerged from student responses involved positive feelings from forced interaction with their peers, the benefits of meeting and socializing with other students, sharing of knowledge with others, and solidarity or positive affect towards the process of working as part of a team. Finally, students also expressed an overall preference for two-stage exams when compared to solely individual, one-stage exams.
- Published
- 2023
36. The Examination of Online and Paper-Pencil Test Scores of Students Engaged in Online Learning
- Author
-
Necati Taskin and Kerem Erzurumlu
- Abstract
In this study, online test scores and paper-pencil test scores of students studying through online learning were examined. Causal-comparative research was used to determine the distribution of students' test scores and to examine the relationship between them. The participants of the research are freshman students studying in 12 faculties and 8 colleges of a state university in Türkiye. The distribution of students' test scores is depicted by means, standard deviation, percentage, and graphs. The correlation coefficient was examined to find and interpret the amount of relationship between the test scores of the students. According to the findings, it was seen that the online test scores of the students were higher than the paper-pencil test scores. At the same time, it was observed that the passing of the course rates in online test exams was higher than in the paper-pencil test exams. It was observed that the relationship between the paper-pencil test scores of the students and the online test scores was lower than the relationship between the paper-pencil test scores and the paper-pencil test scores. There is an inconsistency between students' paper-pencil test scores and online test scores. The rise in students' online test scores to un-proctored online exams as the reason for the inconsistency. Moving online exams to proctored exam environments, using computerized adaptive testing, or including online activities in the assessment may reduce this inconsistency.
- Published
- 2023
37. Reliability and Validity of Methods to Assess Undergraduate Healthcare Student Performance in Pharmacology: Comparison of Open Book versus Time-Limited Closed Book Examinations
- Author
-
David Bell, Vikki O'Neill, and Vivienne Crawford
- Abstract
We compared the influence of open-book extended duration versus closed book time-limited format on reliability and validity of written assessments of pharmacology learning outcomes within our medical and dental courses. Our dental cohort undertake a mid-year test (30xfree-response short answer to a question, SAQ) and end-of-year paper (4xSAQ, 1xessay, 1xcase) in pharmacology. For our first year medical cohort, pharmacology is integrated within a larger course, contributing 20xclinical vignette questions (to select the single best answer (SBA) to each question from a choice of 5 plausible answers) to a mid-year test and 3-5xSAQ to an end-of-year paper. Our experience indicates that SAQ are as reliable as SBA for closed-book time-limited assessments; reliability correlates with number of questions employed. We have found good correlation between mid-year and end-of-year performance (predictive validity), between questions (factorial validity) and between pharmacology and other subjects within the assessment (concurrent validity). Adoption of open-book extended duration assessments resulted in only modest reduction in reliability and validity.
- Published
- 2023
38. The Influence of Passage Cohesion on Cloze Test Item Difficulty
- Author
-
Jonathan Trace
- Abstract
The role of context in cloze tests has long been seen as both a benefit as well as a complication in their usefulness as a measure of second language comprehension (Brown, 2013). Passage cohesion, in particular, would seem to have a relevant and important effect on the degree to which cloze items function and the interpretability of performances (Brown, 1983; Dastjerdi & Talebinezhad, 2006; Oller & Jonz, 1994). With recent evidence showing that cloze items can require examinees to access information at both the sentence and passage level (Trace, 2020), it's worthwhile to now look back and examine the relationship between aspects of passage cohesion--referential cohesion, semantic overlap, and incidence of conjunctives--and item difficulty by classification. The current study draws upon a large pool of cloze test passages and items (k = 377) originally used by Brown (1993) along with automated text analysis of cohesion ("Coh-Metrix," McNamara et al., 2014) to examine the impact of passage cohesion on item function. Correlations, factor analysis, and linear regression point to clear though minimal differences for both sentential and intersentential items as they relate to aspects of passage cohesion, the results of which may inform future test design and interpretation of cloze performance.
- Published
- 2023
39. Achieving Technical Economy: A Modification of Cloze Procedure
- Author
-
Albert Weideman and Tobie van Dyk
- Abstract
This contribution investigates gains in technical economy in measuring language ability by considering one recurrent interest of JD Brown: cloze tests. In the various versions of the Test of Academic Literacy Levels (TALL), its Sesotho and Afrikaans (Toets van Akademiese Geletterdheidsvlakke -- TAG) counterparts, as well as related other tests used in South Africa, the test designers have used a modification of this procedure to very good effect. This paper reports on the steady evolution of its format over many years, how it is currently used, what its outstanding empirical properties are, and how the kind of technical economy it brings to the measurement of the ability to handle the demands of academic language at the level of tertiary education can be further applied. The modification involves the conventional, more or less systematic mutilation of a selected text, with two multiple choice questions about every gap in it: where the gap is, and which word has been omitted. We have not seen anywhere else analyses of this format, which in itself may be of interest to test designers. We proceed by defining technical economy, and then develop an argument on the basis of the empirical properties of TALL on how that idea can be applied, in particular to the design and task selection of such tests, before giving illustrations of how such choices may contribute to further and other productive and responsible designs and test formats.
- Published
- 2023
40. The Impact of the Images in Multiple-Choice Questions on Anatomy Examination Scores of Nursing Students
- Author
-
Narnaware, Yuwaraj and Cuschieri, Sarah
- Abstract
Visualizing effects of images on improved anatomical knowledge are evident in medical and allied health students, but this phenomenon has rarely been assessed in nursing students. To assess the visualizing effect of images on improving anatomical knowledge and to use images as one of the methods of gross anatomical knowledge assessment in nursing students, the present study was repeated over two semesters. The results show that the percent class average (%) was significantly (P<0.006) increased with the inclusion of more anatomical images in a multiple-choice anatomy exam compared to a similar exam with fewer images and was significantly (P<0.002) decreased by reducing the number of images by 50% compared to image-rich exams. However, examinations with an equal number of images did not alter the class average. The percent score of individual questions from the examinations with images plus text was significantly (P<0.001) higher than the same questions with text only in both semesters. The findings of this study indicate that image inclusion in anatomy examinations can improve learning and knowledge, may help reduce cognitive load, recall anatomical knowledge, and provide a hint to an exam question.
- Published
- 2023
41. Effect of Missing Data on Test Equating Methods Under NEAT Design
- Author
-
Semih Asiret and Seçil Ömür Sünbül
- Abstract
In this study, it was aimed to examine the effect of missing data in different patterns and sizes on test equating methods under the NEAT design for different factors. For this purpose, as part of this study, factors such as sample size, average difficulty level difference between the test forms, difference between the ability distribution, missing data rate, and missing data mechanisms were manipulated. The effects of these factors on the equating error of test equating methods (chained-equipercentile equating, Tucker, frequency estimation equating, and Braun-Holland) were investigated. In the study, two separate sets of 10,000 dichotomous data were generated consistent with a 2-parameter logistic model. While generating data, the MCAR and MAR missing data mechanisms were used. All analyses were conducted by R 4.2.2. As a result of the study, it was seen that the RMSE of the equating methods increased significantly as the missing data rate increased. The results indicate that the RMSE of the equating methods with imputed missing data are reduced compared to equating without imputed missing data. Furthermore, the percentage of missing data, along with the difference between ability levels and the average difficulty difference between forms, was found to significantly affect equating errors in the presence of missing data. Although increasing sample size did not have a significant effect on equating error in the presence of missing data, it did lead to more accurate equating when there was no missing data present.
- Published
- 2023
42. Investigating Different Kinds of Stems in Multiple-Choice Tests: Interruptive vs. Cumulative
- Author
-
Sharareh Sadat Sarsarabi and Zeinab Sazegar
- Abstract
The statement stated in a multiple-choice question can be developed regarding two types of sentences: Interruptive (periodic) and cumulative (or loose). This study deals with different kinds of stems in designing multiple-choice (MC) items. To fill the existing gap in the literature, two groups of teacher students passing general English courses at Farhangian University were selected based on Cambridge Placement Test. The design of this study was a comparison group design. To verify the effectiveness of the stems, i.e., interruptive and cumulative stems, two types of tests based on the book entitled Thoughts and Notions 2, which was taught in General English classes, similar in content, but different in their stems, were designed. Each test contained 40 items, 25 vocabulary items, and 15 items of reading comprehension. The first group of students was given the test designed using only interruptive sentences as stems. The second group participated in the test, being prepared using only cumulative sentences as stems. After the data analysis via an independent t-test, it became apparent that the first group outperformed the second. Therefore, it was concluded that interruptive sentences as a stem in multiple-choice tests were more reliable and valid than cumulative ones. One of the study implications is that the interruptive stems can be used to assist policymakers and material designers, and language teachers to be considered for future decision making, and designing materials.
- Published
- 2023
43. TOEFL iBT Speaking Subtest: The Efficacy of Preparation Time on Test-Takers' Performance
- Author
-
Ali Akbar Ariamanesh, Hossein Barati, and Manijeh Youhanaee
- Abstract
The present study investigates the efficacy of preparation time in four speaking tasks of TOEFL iBT. As the current pre-task planning time offered by ETS is very short, 15 to 30 seconds, we intended to explore how the test-takers' speaking quality would change if the preparation time was added to the response time, giving the respondents a relatively longer online planning opportunity. To this aim, two groups of TOEFL iBT candidates were studied under pre-task and online planning conditions. Totally, 384 elicited speaking samples were first transcribed and then measured in terms of complexity, accuracy, and fluency (CAF). The results yielded by a series of One-way MANOVA revealed the online planning group significantly outperformed the pre-task planning group in terms of accuracy and fluency across all four speaking tasks. Although with less robustness, the online planners had significantly higher speech complexity represented by lexical diversity and left-embeddedness. The results obtained through this study may challenge the efficacy of the currently provided preparation time in TOEFL iBT speaking subsection.
- Published
- 2023
44. Testing Vocabulary Associations for Effective Long Term Learning
- Author
-
Al-Jarf, Reima
- Abstract
This article aims to give a comprehensive guide to planning and designing vocabulary tests which include Identifying the skills to be covered by the test; outlining the course content covered; preparing a table of specifications that shows the skill, content topics and number of questions allocated to each; and preparing the test instructions. The test should meet several criteria as the instructions should be brief and clear; the questions should cover all kinds of skills, tasks and exercises covered in the classroom and textbook; the test items should require the students to perform tasks at the phoneme, grapheme, affix, word, phrase and paragraph levels. The questions should test student's ability to think, apply, infer, connect, and synthesize information, not mere recall, and should not use exact sentences and examples from the textbook. The test should have as many production questions as possible. It should have adequate discrimination power; should be reliable and valid; and should be a power and a speed test. In addition, the article describes the optimal test length, when to give the tests during the semester and the test duration. It describes the test paper format; how the tests are scored, marks allocated for each question type and whole test, using whole marks, not fractions; deducting points for spelling and grammatical mistakes. After scoring the answer sheets, the instructor returns the marked answer sheets to the students, shows the marking system and goes through the questions one by one, gives the correct answers and mentions the common errors. Follow-up issues such as calculating the test validity, reliability, and discrimination power, using the test results for diagnosing weaknesses and providing remedial work are given. The effects of the proposed test model on learning outcomes and students' views are also given.
- Published
- 2023
45. Application of Two-Parameter Item Response Theory for Determining Form-Dependent Items on Exams Using Different Item Orders
- Author
-
Pentecost, Thomas C., Raker, Jeffery R., and Murphy, Kristen L.
- Abstract
Using multiple versions of an assessment has the potential to introduce item environment effects. These types of effects result in version dependent item characteristics (i.e., difficulty and discrimination). Methods to detect such effects and resulting implications are important for all levels of assessment where multiple forms of an assessment are created. This report describes a novel method for identifying items that do and do not display form dependence. The first two steps identify form dependent items using a differential item functioning (DIF) analysis of item parameters estimated by Item Response Theory. The method is illustrated using items that appeared in four forms (two trial and two released versions) of a first semester general chemistry examination. Eighteen of fifty-six items were identified as having item parameters that were form dependent. Thirteen of those items displayed a form dependence consistent with reasons previously identified in the literature: preceding item difficulty, content priming, and a combination of preceding item difficulty and content priming. The remaining five items had form dependence that did not align reasons reported in the literature. An analysis was done to determine if all possible instances of predicted form dependence could be found. Several instances where form dependence could have been found, based on the preceding item difficulty or content priming, were identified, and those items did not display form dependence. We identify and rationalize form dependence for thirteen of the eighteen items flagged; however, we are unable to predict form dependence for items.
- Published
- 2023
46. Exploring Confidence Accuracy and Item Difficulty in Changing Multiple-Choice Answers of Scientific Reasoning Test
- Author
-
Fadillah, Sarah Meilani, Ha, Minsu, Nuraeni, Eni, and Indriyanti, Nurma Yunita
- Abstract
Purpose: Researchers discovered that when students were given the opportunity to change their answers, a majority changed their responses from incorrect to correct, and this change often increased the overall test results. What prompts students to modify their answers? This study aims to examine the modification of scientific reasoning test, with additional exploration on confidence accuracy and its relation to item difficulty. Methodology: A set of pre-test and post-test experiments which included 20 items of scientific reasoning test with confidence judgement on each item were used. The items of the instruments were assessed for their validity by analysing their psychometric properties using the three-parameter (3PL) Item Response Theory, which was carried out in R studio. The set of items were randomly administered to 205 Indonesian undergraduate students with a background in science education related major. The accuracy of confidence was determined by categorising correct or incorrect answers to scientific reasoning questions based on their level of confidence. Findings: The results revealed that responses were modified more frequently from incorrect to correct than from correct to incorrect, resulting in a significant gain in overall scientific reasoning score although these modifications were not shown to be connected to the item's difficulty level. Even though confidence level also increased significantly, it was observed that Indonesian students repeatedly responded with overconfidence even after sitting for the same test after three weeks, which could indicate a lack of metacognitive ability. The findings of this study serve to spur educators to begin actively engaging in metacognitive training in their teaching and learning activities as a result of overconfidence that frequently occurs among Indonesian students in examinations. Significance: This study provides further substantiation in the field of scientific reasoning and cognitive science; that a trend of confidence accuracy change in scientific reasoning test has been observed. It also contributes to uncovering the true ability of Indonesian students when performing such reasoning tests through their repeated attempts.
- Published
- 2023
47. Comparing the Effectiveness of Newer Linework on the Mental Cutting Test (MCT) to Investigate Its Delivery in Online Educational Settings
- Author
-
Green, Theresa, Goodridge, Wade H., Anderson, Jon, Davishahl, Eric, and Kane, Daniel
- Abstract
The purpose of this study was to examine any differences in test scores between three different online versions of the Mental Cutting Test (MCT). The MCT was developed to quantify a rotational and proportion construct of spatial ability and has been used extensively to assess spatial ability. This test was developed in 1938 as a paper-and-pencil test, where examinees are presented with a two-dimensional drawing of a 3D object containing a cutting plane passing through the object. The examinee must then determine the cross-sectional shape that would result from cutting along the imaginary cutting plane. This work explored three versions of this test (the original and two adapted versions), administered online, to see if there were any differences on the versions regarding student performance. Versions differed in the linework quality displayed as well as shading shown on the surfaces. This study analyzed statics students' scores on the three online versions of the MCT and on the original paper version of the MCT to identify which version of the test may be most optimal for administering to engineering students. Results showed that there was a statistically significant difference in students' scores between multiple versions. Understanding which representations of the MCT items are most clear to students will provide insights for educators looking to improve and understand the spatial ability of their students.
- Published
- 2023
48. The Oral Exam--Learning for Mastery and Appreciating It
- Author
-
Akkaraju, Shylaja
- Abstract
To reduce academic dishonesty and strengthen learning outcomes, I adopted in-depth oral examinations as my benchmark and summative assessments in a Human Anatomy & Physiology course taught in an online asynchronous setting. This decision led my students and me down the transformative path of mastery learning. This was a "threshold experience" for my students who were learning how to think and express themselves as physiologists. This was also a threshold experience for me as I explored the scope of the oral examination in promoting skill acquisition while nurturing a relationship-rich learning environment. By employing "deliberate practice" principles including basic drills, one-on-one weekly check-ins, and small group recitation sessions, students exceeded benchmarks for conceptual understanding, mastery of fundamentals, and application of concepts to clinical scenarios. Students consistently reported that they were happy within this learning environment. With meticulous planning, it is possible to motivate students to learn for mastery and acquire expertise by employing oral exams as the pivotal assessment strategy in an online course thereby also making academic dishonesty almost irrelevant.
- Published
- 2023
49. Automatic Item Generation for Non-Verbal Reasoning Items
- Author
-
Ayfer Sayin, Sabiha Bozdag, and Mark J. Gierl
- Abstract
The purpose of this study is to generate non-verbal items for a visual reasoning test using templated-based automatic item generation (AIG). The fundamental research method involved following the three stages of template-based AIG. An item from the 2016 4th-grade entrance exam of the Science and Art Center (known as BILSEM) was chosen as the parent item. A cognitive model and an item model were developed for non-verbal reasoning. Then, the items were generated using computer algorithms. For the first item model, 112 items were generated, and for the second item model, 1728 items were produced. The items were evaluated based on subject matter experts (SMEs). The SMEs indicated that the items met the criteria of one right answer, single content and behavior, not trivial content, and homogeneous choices. Additionally, SMEs' opinions determined that the items have varying item difficulty. The results obtained demonstrate the feasibility of AIG for creating an extensive item repository consisting of non-verbal visual reasoning items.
- Published
- 2023
50. Influence of Selected-Response Format Variants on Test Characteristics and Test-Taking Effort: An Empirical Study. Research Report. ETS RR-22-01
- Author
-
Guo, Hongwen, Rios, Joseph A., Ling, Guangming, Wang, Zhen, Gu, Lin, Yang, Zhitong, and Liu, Lydia O.
- Abstract
Different variants of the selected-response (SR) item type have been developed for various reasons (i.e., simulating realistic situations, examining critical-thinking and/or problem-solving skills). Generally, the variants of SR item format are more complex than the traditional multiple-choice (MC) items, which may be more challenging to test takers and thus may discourage their test engagement on low-stakes assessments. Low test-taking effort has been shown to distort test scores and thereby diminish score validity. We used data collected from a large-scale assessment to investigate how variants of the SR item format may impact test properties and test engagement. Results show that the studied variants of SR item format were generally harder and more time consuming compared to the traditional MC item format, but they did not show negative impact on test-taking effort. However, item position had a dominating influence on nonresponse rates and rapid-guessing rates in a cumulative fashion, even though the effect sizes were relatively small in the studied data.
- Published
- 2022
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.