Urban Institute, National Center for Analysis of Longitudinal Data in Education Research (CALDER), Boyd, Donald, Grossman, Pamela, Lankford, Hamilton, Loeb, Susanna, and Wyckoff, James
Value-added models in education research allow researchers to explore how a wide variety of policies and measured school inputs affect the academic performance of students. Researchers typically quantify the impacts of such interventions in terms of "effect sizes", i.e., the estimated effect of a one standard deviation change in the variable divided by the standard deviation of test scores in the relevant population of students. Effect size estimates based on administrative databases typically are quite small. Research has shown that high quality teachers have large effects on student learning but that measures of teacher qualifications seem to matter little, leading some observers to conclude that, even though effectively choosing teachers can make an important difference in student outcomes, attempting to differentiate teacher candidates based on pre-employment credentials is of little value. This illustrates how the perception that many educational interventions have small effect sizes, as traditionally measured, are having important consequences for policy. In this paper we focus on two issues pertaining to how effect sizes are measured. First, we argue that model coefficients should be compared to the standard deviation of gain scores, not the standard deviation of scores, in calculating most effect sizes. The second issue concerns the need to account for test measurement error. The standard deviation of observed scores in the denominator of the effect-size measure reflects such measurement error as well as the dispersion in the true academic achievement of students, thus overstating variability in achievement. It is the size of an estimated effect relative to the dispersion in the true achievement or the gain in true achievement that is of interest. Adjusting effect-size estimates to account for these considerations is straightforward if one knows the extent of test measurement error. Technical reports provided by test vendors typically only provide information regarding the measurement error associated with the test instrument. However, there are a number of other factors, including variation in scores associated with students having particularly good or bad days, which can result in test scores not accurately reflecting true academic achievement. Using the covariance structure of student test scores across grades in New York City from 1999 to 2007, we estimate the overall extent of test measurement error and how measurement error varies across students. Our estimation strategy follows from two key assumptions: (1) there is no persistence (correlation) in each student's test measurement error across grades; (2) there is at least some persistence in learning across grades with the degree of persistence constant across grades. Employing the covariance structure of test scores for NYC students and alternative models characterizing the growth in academic achievement, we find estimates of the overall extent of test measurement error to be quite robust. Returning to the analysis of effect sizes, our effect-size estimates based on the dispersion in gain scores net of test measurement error are four times larger than effect sizes typically measured. To illustrate the importance of this difference, we consider results from a recent paper analyzing how various attributes of teachers affect the test-score gains of their students (Boyd et al., in press). Many of the estimated effects appear small when compared to the standard deviation of student achievement--that is effect sizes of less than 0.05. However, when measurement error is taken into account, the associated effect sizes often are about 0.16. Furthermore, when teacher attributes are considered jointly, based on the teacher attribute combinations commonly observed, the overall effect of teacher attributes is roughly half a standard deviation of universe score gains--even larger when teaching experience is also allowed to vary. The bottom line is that there are important differences in teacher effectiveness that are systematically related to observed teacher attributes. Such effects are important from a policy perspective, and should be taken into account in the formulation and implementation of personnel policies. An appendix is included. (Contains 34 footnotes, 4 figures, and 9 tables.) ["Overview of Measuring Effect Sizes: The Effect of Measurement Error. Brief 2" (ED508284) was based on this report.]