Background: Randomized controlled trials (RCTs) give unbiased estimates of average effects. However, positive effects for the majority of students may mask harmful effects for smaller subgroups, and RCTs often have too small a sample to estimate these subgroup effects. In many RCTs, covariate and outcome data are drawn from a larger database. For instance, educational efficacy studies may target state standardized tests scores while adjusting for student demographics and prior achievement, all of which are drawn from a state longitudinal data system. In these situations, typically only a subset of students in the database participate in the RCT. We refer to RCT non-participants in the database as the "remnant" from the RCT. The remnant often includes a larger sample size than the RCT--especially when estimating subgroup effects--but is not randomized, so including may cause bias. This talk will describe and illustrate a method incorporating the remnant into the analysis of RCTs that gives unbiased estimates and conservative standard errors and has the potential to improve the precision of effect estimates from an RCT, sometimes dramatically. Research Questions: The talk will use data from RCTs that were run within an online tutoring system to answer three research questions: (1) To what extent might remnant data improve the precision of effect estimates from RCTs? (2) Can incorporating the remnant improve estimation of subgroup effects? (3) If the remnant is known to be drawn from a different population than the participants in A/B tests, can it still be useful? Data: E-Trials is a platform that allows researchers to design educational experiments that will then be run within the ASSISTments online tutor. Students working on "skill-builder" modules are randomized between two or more conditions, such as how subject matter is portrayed, available hints, and feedback to students. We analyzed data from 277 contrasts between pairs of treatment arms drawn from 68 multi-armed RCTs, including 113,963 students. We also collected rich clickstream-level log data from RCT participants as well as from a remnant of 193,218 ASSISTments users who did not participate in any of the RCTs we analyzed. Assignment completion was the outcome of interest across RCTs. Methodology: The methodology (Gagnon-Bartsch et al. 2023) builds on prior literature including Sales, et al (2018ab), Wu and Gagnon-Bartsch (2018), and Aronow and Middleton (2013). It involves three steps: (1) using only remnant data, train a model $f^R(\cdot):\mathbb{R}^p\rightarrow\mathbb{R}$ predicting outcomes Y as a function of a pdimensional covariate matrix X. This model can be of any form, including machine learning algorithms, and may be fit and tested multiple times, as long as only remnant data is used. It need not be correct, unbiased, or consistent in any sense, but should ideally yield accurate out-of-sample predictions. (2) Use the fitted model $\hat{f}(\cdot)$, along with covariate data from RCT participants X^{RCT} to predict RCT outcomes as $\hat{y}_C=\hat{f}(X^{RCT})$. (3) Use $\hat{y}_C$, as a covariate in a design-based covariate-adjusted effect estimator (e.g. Rosenbaum 2002, Wager, et al. 2016) perhaps alongside other covariates. In our analysis we use the LOOP estimator of Wu and Gagnon-Bartsch (2018), and refer to the estimator including only $\hat{y}_C$ as a covariate as "ReLOOP" (i.e. Remnant-based LOOP) and the estimator also including other covariates as "ReLOOP+." To carry out the method, we fit an ensemble model (Fig. 1) including three deep neural networks (e.g. Goodfellow, et al. 2016), each trained on a different set of covariates: data on each of the previous 20 assignments each students started, actions students took on each of the prior 60 days, and aggregated student-level data. These three models were then ensembled via a fourth feed-forward neural network. The same fitted model was used for all subgroup comparisons. To test the performance of our methods when the remnant is demographically distinct from the RCT, we used the "gender guesser" python script https://pypi.org/project/gender-guesser/ using students' first names, categorizing students as Male, Female, or Unknown. We fit a model to the subset of the remnant categorized as "Male"--since gender guesser is imperfect and trained using Eurasian names, we assume this group is mostly male and disproportionally white or Asian--and estimated effects for RCT participants from the other categories. Results: All of the estimators we consider are exactly unbiased, so our results focus on estimated sampling-variance of ReLOOP, ReLOOP+, the difference-in-means estimator (Neyman 2023) or the LOOP estimator without remnant data. Specifically, we consider sampling-variance ratios: since sampling-variance scales as 1/n, these can be thought of as "sample size multipliers." RQ1: Figure 2 shows boxplots of sampling variance ratios comparing ReLOOP or ReLOOP+ to difference-in-means (Labeled "T-Test") or LOOP. Incorporating $\hat{y}_C$ into the analysis never substantially hurt precision, but in many cases improved precision by as much as 30% or more. RQ2: Figures 3 shows boxplots of sampling-variance ratios for subgroup effects, and Figure 4 plots these ratios as a function of sample size within the subgroup. On average, ReLOOP+ improved precision for subgroup estimation for all sample sizes. For small subgroup samples, the improvement could be dramatic--equivalent to more than doubling the sample size in some cases--but in some other cases it hurt precision noticeably. RQ3: Figure 5 shows results from the experiment using the "male" remnant to estimate effects in the non-"male" subset of the RCTs. Surprisingly, ReLOOP improved precision roughly equally among "male" and non-"male" subgroups of the RCT, and ReLOOP+ improved precision more in the non-"male" subgroup, despite the unrepresentativeness of the remnant. Conclusion: Machine learning methods have an impressive potential, but can also reproduce biases present in their training samples (e.g. Bolukbasi et al. 2016). Randomized Controlled Trials give unbiased estimates (for the RCT participants) but these may be imprecise and mask treatment effect heterogeneity. The methods we propose, ReLOOP and ReLOOP+ use machine learning models fit to auxiliary data that are unbiased, like RCTs, but can also be more precise--especially for subgroup estimates, even when the remnant is itself biased. They can be valuable tools for using all available data to evaluate programs for both the majority of students as well as for vulnerable minorities.