Chapter 1 (Introduction). In line with the US EPA's "Toxicity Testing in the 21st Century: A Vision and Strategy" and the Replacement, Refinement, and Reduction (3Rs) goals for animals in toxicity testing, the current two species preclinical testing paradigm for small molecules outlined in ICH M3 (R2) is under increasing pressure (Krewski et al., 2010). These pressures reflect not only the ethical concerns around animal testing (Prior et al., 2021) but also the lack of translatability of preclinical findings to humans. For example, despite animal testing forming the cornerstone of modern safety assessment (Greaves, 2012), undetected clinical safety risks (25.5%) were found to be the principal cause of drug attrition upon entering phase I clinical trials between 2000-2010 followed by efficacy (8.9%) (Waring et al., 2015). The high clinical failure rate is particularly pertinent due to the significant cost associated with phase I and II clinical trials (Bender and Cortés-Ciriano, 2020). This lack of translatability has given rise not only to questioning the external validity of increasingly standardised animal tests conducted with small sample sizes (Karp and Fry, 2021), but even whether use of animals for toxicity testing is scientific valid or is based on mere historical precedence (Monticello, 2015; Zbinden, 1993). Such concerns have driven changes in regulatory attitudes, such as the recent passing of the FDA Modernization Act 2021 which would remove the mandate for animal testing to evaluate the safety of drugs in the USA (Buchanan, 2021), and the European Parliament 2021 resolution to phase out animal testing for research, testing and education (Marshall et al., 2022). These developments follow in the wake of the 2006 REACH legislation in Europe and subsequent banning of animal testing for new cosmetic products (Yang et al., 2021), and aim to drive the development of New Approach Methodologies (NAMs) for the toxicity testing of drugs (Ball et al., 2022; Fischer et al., 2020; Parish et al., 2020). These issues have motivated two lines of research inquiry which aim to understand and increase the predictivity of clinical safety risks of which this thesis builds upon. Firstly, the potential use of Historical Control Data (HCD) in the form of Virtual Control Groups (VCGs) to replace or supplement Concurrent Control Groups (CCGs) in preclinical toxicity assessment. The scientific motivation of using HCD are its ability to increase the external validity of animal testing by providing context to findings through a better approximation of the biologically plausible range of outcomes relative to the small CGG resulting in the enhanced the assessment of treatment-related effects (Kluxen et al., 2021; Pinches et al., 2019; Steger-Hartmann et al., 2020). Moreover, from a 3Rs perspective the use of HCD preclinically has gained support as it has the potential to reduce animal usage by up to 25% (Steger-Hartmann et al., 2020). However, the idea has seen little uptake even though the use of HCD is well established in randomised human clinical trials (Berry et al., 2017) and that many studies are routinely performed under similar conditions. Despite several guidelines describing best practices for using HCD preclinically (Greim et al., 2003; Keenan et al., 2009; Kluxen et al., 2021), there is limited understanding of how variability and drift in HCD findings from CCGs could lead to differences in study outcomes across a large collection of studies (Steger-Hartmann et al., 2020). Therefore, we first aimed to conduct a large-scale retrospective analysis of the use of HCD in the form of VCGs and its potential impact on study outcomes (Chapter 3). Secondly, studies have been conducted to statistically quantify whether historical findings observed in preclinical studies are predictive for those later observed in humans, termed concordance analysis (Clark and Steger-Hartmann, 2018). However, publications have often come to contradictory conclusions with some authors claiming a lack of predictivity (Bailey et al., 2015, 2014, 2013; Van Norman, 2019), whilst others claiming their results support the current regulatory paradigm of animal testing (Monticello et al., 2017; Olson et al., 2000). However, studies have consistently demonstrated that observing the concordance of findings between preclinical animal models can lead to an increase in translatability of those findings to humans. However, these studies suffered from various limitations including small dataset size and a lack of control of experimental variables when comparing findings (Bailey et al., 2015). Therefore, we next aimed to quantify the inter-species concordance between preclinical findings whilst implementing methodological improvements and using a larger dataset (Chapter 4). Overall, both lines of inquiry were pursued through retrospective analyses of the eTOX preclinical toxicity dataset. Chapter 2 (Curation and Characterisation of the Histopathology and Pharmacokinetic Data in the eTOX Dataset) presents a methodology to curate the eTOX dataset, which is the largest preclinical toxicity dataset at the time of writing (Briggs et al., 2015; Cases et al., 2014; Sanz et al., 2017). This warranted its own research chapter as previous studies have highlighted the need to perform a multi-step curation of the dataset before any formal analysis is possible. The methodology included basic quality assurance regarding missing values and term standardisation as well as detailed steps required to handle the way in which the histopathology data were aggregated at the study, dose, time point, and severity grade level. We also discussed key characteristics of the dataset that have potential implications for the results presented in Chapter 3 and Chapter 4. Chapter 3 (Retrospective Analysis of the Potential use of Virtual Control Groups in Preclinical Toxicity Assessment using the eTOX Dataset) investigated the potential impact of replacing CCGs with VCGs based on HCD on preclinical toxicological study outcomes, namely histopathological finding treatment-relatedness designations. To this end, we developed a novel methodology whereby statistical predictions of treatment-relatedness using either CCGs or VCGs of varying covariate similarity to CCGs were compared to designations from original toxicologist reports; and changes in agreement were used to quantify changes in study outcomes. Generally, the best agreement was achieved when CCGs were replaced with VCGs with the highest level of covariate similarity, the same species, strain, sex, administration route, and vehicle. As HCD of increasing covariate dissimilarity were incorporated into VCGs we observed increasingly poor agreement and found this to be related to a concurrent increase in incidence rate divergence between HCD and CCGs. This result provided quantitative evidence that the CCG is the most relevant comparator for determining treatment-related findings, but more so systematically demonstrated that using increasingly heterogenous HCD leads to a divergence in study outcomes compared to when using the CCG. We therefore also presented the first identification of study covariates that impact study outcomes when using HCD to replace CCGs, which could help set future suitability criteria for the use of HCD in preclinical toxicity assessment. We next investigated a key choice when sampling HCD from a preclinical dataset, termed the Control Total Assumption, and found that assuming the lack of reporting of a finding to be equivalent to the absence of a finding systematically resulted in poorer agreement and a hyper-sensitivity to designate findings as treatment-related. Finally, although it is one of the largest and most comprehensive preclinical datasets, eTOX was found to lack sufficient documentation of study details previously highlighted as important when evaluating the suitability of HCD (Greim et al., 2003; Keenan et al., 2009; Kluxen et al., 2021). Therefore, we also highlight required features of future preclinical datasets to construct adequate VCGs which could potentially comply with future regulatory guidance. A more thorough analysis of study covariates gathered in Standard for Exchange of Nonclinical Data (SEND) format and their impact on study outcomes is currently being considered as part of an eTRANSAFE initiative (Steger-Hartmann et al., 2020). Overall, these results provide preliminary guidance for future industrial research into the VCG concept when sampling HCD from an internal or external database.