Background: Evidence from education randomized controlled trials (RCTs) in low- and middle-income countries (LMICs) demonstrates how interventions can improve children's educational achievement [1, 2, 3, 4]. RCTs assess the impact of an intervention by comparing outcomes--aggregate test scores--between treatment and control groups. A review of education RCTs in LMICs [5] demonstrates the need to improve outcome measurement properties and analysis. Objective: We examine how treatment effects vary across items. Specifically, whether effects of interventions manifest as general increases in the targeted domain or narrow gains in subsets of items associated with specific skills. We use item-level response data--data from over 200,000 students responding to over 2,000 items--from 15 RCTs [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Given the centrality of the outcome measure in inferences about educational intervention efficacy, our study is highly relevant for the design of future RCT outcomes. Data: We obtained item-level response data on academic outcomes from 15 RCTs. We have data from 84 tests (based on grade-subject combinations) and use data from 265,732 respondents (162,359 in JPAL and 103,373 in REAP) on 2119 items (1466 in JPAL and 653 in REAP; the same item may occur in many tests). In total, we analyze over seven million item responses and summarize the data in Figure 1. Analysis: Our approach is motivated by "Differential Item Functioning'' (DIF; [18]). We anticipate findings of DIF if the effects of an RCT intervention are specific to certain items (i.e., if there is variation in the level of sensitivity to treatment across the test's items). We use techniques from item response theory (IRT; [19]) to analyze responses. [equations omitted]. If the treatment is effective, we would anticipate item $j$ being easier had the respondent been in the treatment group and we would have $\E(\beta_j)>0$. If the effect of treatment is homogeneous across items (ie. all items are equally sensitive to the treatment effect), then we would have $\beta_j=\beta_{j'}$ for all $j'$. The $a$ coefficient is a test-level loading that we estimate. To estimate Eqn 2, we use the multiple group approach [20] estimated via the EM algorithm [21] as implemented in mirt [22]. Estimates of $\beta_j$, denoted $\widehat{\beta_j}$, are informative about the role of items in a specific test context. To index sensitivity of treatment effects to individual items, we consider a "leave-one-out'' (LOO) analysis. We first re-estimate the effect of treatment in the study after removing that item from the test. We then subtract this updated effect size from the original estimate of the treatment effect; a value of 0 would indicate that removing the item led to no difference in the estimated treatment effect of the resultant item response matrix. To index the overall homogeneity of item-level treatment sensitivity, we consider the ratio, denoted $R$, of variation in the treatment-specific offsets to the variation in difficulty parameters, [equation omitted]. Relatively large values of $R$ would suggest tests with significant item-treatment interactions as compared to tests with smaller $R$ values. We also consider a simulation study wherein we construct alternative forms of the test by resampling from the $\widehat{\beta_j}$ parameters to examine the imprecision of treatment effect estimates associated with variation in item-level sensitivity to treatment. Results: In Panel A of Figure 2, the values of $\widehat{\beta_j}$ demonstrate a right skew (skew=0.85)--consistent with the notion that items tend to show be marginally easier for treatment respondents rather than control respondents. Overall, 25% of items (527 of 2119) exhibited DIF at the $\alpha=0.05$ level. We consider LOO differences as a function of $\beta_j$ (Figure 3). The point in the black circle comes from an intervention with an effect size of 0.11 (p=0.036). However, after we remove this item, the adjusted effect size falls to 0.066. If we standardize the LOO difference by the original effect size, this single item accounts for 37% of the original effect. Without this item, a different inference about the efficacy of this intervention would result. Turning to the $R$ quantities (Eqn 3), the average test has $R=0.14$ but there is clear variation (Figure 4 Panel A). Figure 4 Panel B is a scatterplot of effect sizes for RCTs versus $\widehat{\beta_j}$ values. The RCTs in blue (items with homogeneous sensitivity; $R=0.076$) and red (items with heterogeneous sensitivity; $R=0.21$) both had effect estimates near zero and insignificant. A similar set of items for the blue RCT would yield a similar inference. In contrast, inferences about the red RCT seem highly sensitive to the specific configuration of items; i.e., if the two items with the lowest $\widehat{\beta_j}$ values had not been included on the assessment, it may have resulted in a significant effect estimate. We build on this notion with the simulation exercise but focusing on the 29 tests that had a significant treatment effect with results in Figure 5. We re-examine effects after accounting for uncertainty related to variation across items of treatment effects (Panel C). The uncertainty due to variation in item-level sensitivity was 19\% larger on average than the uncertainty based on the standard error for the difference. This additional uncertainty has implications: in (12/29=) 41% of those cases, the CI related to item-level sensivitity suggests that these may be false positives. Conclusions: RCT outcomes are the crux of measuring intervention effectiveness. Researchers and policymakers can incorporate DIF analysis of item-level test data to move beyond only using aggregate test scores to evaluate intervention effectiveness. Items vary in sensitivity to the intervention. Outcome measures more aligned to the intervention can better answer whether and how learning outcomes were improved. DIF analysis is one tool that can be used to explicate which items are more sensitive to an intervention and thus which skills were improved. Cost-effectiveness analysis of RCTs could use DIF analysis to show the effects across specific subdomains tested per dollar spent to determine how to target future resources. We demonstrate a method to analyze RCT outcomes at the item-level and provide a better understanding of "what works" to improve children's academic skills globally.