Back to Search Start Over

Monitoring Reader Performance and DRIFT in the APĀ® English Literature and Composition Examination Using Benchmark Essays. Research Report No. 2007-2

Authors :
College Board
Wolfe, Edward W.
Myford, Carol M.
Engelhard, George
Source :
College Board. 2007.
Publication Year :
2007

Abstract

In this study, we investigated a variety of Reader effects that may influence the validity of ratings assigned to AP® English Literature and Composition essays. Specifically, we investigated whether Readers exhibit changes in their levels of severity and accuracy, and their use of individual scale categories over time. We refer to changes in these characteristics of Readers as Differential Reader Functioning over Time (DRIFT). Our literature review points out several weaknesses in the way Reader effects have been addressed in prior studies, and the study sought to address several of those weaknesses. The study is relevant to operational AP Readings because it addresses several existing challenges: (a) difficulties in monitoring Reader performance due to the assignment of one rating per essay; (b) difficulties in tracking changes in Reader performance over time; (c) difficulties in identifying diagnostically informative indices of DRIFT; and (d) lack of knowledge about how and when DRIFT is likely to occur during an operational AP Reading. In addition, the study suggests how to approach Reader monitoring in an automated, online reading system, should AP choose to pursue such a system in the future. The study sought to answer research questions relating to the implications of three types of DRIFT ("differential severity," "differential accuracy," and "differential scale category use") in AP English Literature and Composition essay ratings by collecting data during an operational AP Reading. Prior to the Reading, a panel of highly experienced AP Readers identified benchmark essays and assigned them consensus ratings. These benchmark essays were copied and distributed to AP Readers during the Reading so that the single ratings assigned to each essay could be connected via the benchmark essays. In addition, the time that each Reader began and completed rating each packet of essays during the Reading was recorded. A variety of analyses were performed for the purpose of assessing types and seriousness of various types of DRIFT, determining whether different methods of detecting DRIFT provided more or less diagnostically useful information, and determining whether connecting Readers through their ratings of the benchmark essays would result in a stronger rating design that might improve the detection of DRIFT. A scoring worksheet is appended.

Details

Language :
English
Database :
ERIC
Journal :
College Board
Publication Type :
Report
Accession number :
ED561038
Document Type :
Reports - Research