Background: Over the past 20 years, much has been written about the merits and limitations of matching methods, particularly propensity score matching (see, e.g., Stuart, 2010 for a review of matching approaches), and the myriad ways matching can be implemented to identify a comparison group that is similar to the treatment group on observable characteristics. However, relatively little methodological work focuses on using matching methods with multilevel data structures that are common in education research. For example, with students nested within schools, a study might use (a) a multisite design when the study includes multiple schools and treatment assignment is at the student level, or (b) a cluster design when treatment assignment is at the school level. Over the past 15 years, some studies have addressed matching for a multisite design (e.g., Arpino & Cannas, 2016; Lee, Nguyen, & Stuart, 2021; Rickles & Seltzer, 2014) and some studies have addressed matching for a cluster design (e.g., Pimentel, Page, Lenard, & Keele, 2018; Zubizarreta & Keele, 2017). Under both designs, a critical complication is the extent to which, and the way in which, the matching method should account for possible confounding at different levels of the data structure (e.g., student-level and school-level). We are aware of no study, however, that has addressed matching for a multisite cluster design that may include confounding factors at three levels. Such a design is common in education research, where, for example, students are the unit of analysis, classes or teachers are the unit of treatment assignment, and classes/teachers are nested within schools. Throughout this paper, we use the terms unit to refer to the level of analysis (e.g., students at level 1), cluster to refer to the level of treatment assignment (e.g., teachers at level 2), and site to refer to the top level of nesting (e.g., schools at level 3). Purpose: This paper addresses the lack of information and guidance about matching in a multisite cluster design by comparing how different matching methods perform within this design, as well as discussing the trade-offs of different matching approaches under different data conditions. The paper is primarily intended for applied researchers planning observational studies of interventions at the classroom/teacher level or the school level, and an interest in student-level outcomes. The findings will also inform needs for future methodological work on matching methods for multisite cluster designs. Setting/Population: Our analysis is based on an empirical data example and a simulation study. For the empirical example, we use data from an evaluation of an alternative teacher training program in a Southeastern metropolitan area, where some teachers of grade 4-8 students entered teaching through this alternative program (treatment condition) and other teachers did not (comparison condition). These data allow us to demonstrate how different matching methods function when estimating the effect of a teacher-level treatment on student achievement. For the simulation study, we generated data to mimic a three-level design with outcomes at level 1 and treatment at level 2. The data generation process is outlined in Table 1. To test the performance of matching methods under different study contexts, we generate data for 16 conditions (see Table 2) based on the level 2 and level 3 sample size, as well as the amount of outcome variance at level 2 and level 3. These parameters were selected because they are the most likely to affect the relative performance of the matching approach (see, e.g., Arpino & Cannas, 2016; Rickles & Seltzer, 2014). The values for the parameters were based on the empirical data example to reflect real world values. Research Design: We examine 12 different matching methods that cover the range of approaches currently available to address multisite designs or cluster designs separately (see Table 3). Matching methods for multisite designs differ in the extent to which they prioritize within-site versus between-site matching. Within-site matching is desirable because it ensures that matched units are equivalent on observed and unobserved site-level covariates but may not produce balance on unit-level covariates, or may exclude a lot of treatment units if there is limited common support among units in the same site. Between-site matching allows for a larger pool of matches, but if an unobserved site-level characteristic influences both the treatment and outcome of interest, it can bias the estimated causal effect. Within-group matching is a compromise between within- and between-site matching by creating groups (or strata) of similar sites and exact matching within those groups. Matching methods for cluster designs differ based on whether and how the matching process accounts for unit-level and/or cluster-level covariates. We will examine all 12 combinations of matching approaches on our three-level study design, using different matching specifications tailored for each method (e.g., propensity score estimation, optimal matching, caliper setting). For all the methods, we pair matching with regression-based covariate adjustment to follow best practices. Results: Our results will compare the relative performance of the methods along four dimensions: (1) match rates at all levels, (2) covariate balance at all levels, (3) mean treatment effect estimation bias, and (4) root mean squared error. We hypothesize that approaches that directly account for the unit-level and cluster-level covariates and within-site matching will have better covariate balance and lower bias compared to other approaches when cluster-level and site-level factors are stronger confounders (larger inter-class correlations) and the sample size is larger, but may result in worse match rates (and generalizability) than other approaches. Conversely, we hypothesize that approaches focused only on the unit-level or cluster-level covariates and that allow for between-site matches will perform better than other approaches when cluster-level and site-level factors are relatively weak confounders (smaller inter-class correlations) and the sample size is smaller. Conclusions: This paper addresses a limitation in current matching literature by providing researchers with information about different matching approaches for a multisite cluster observational study, as well as things to consider when selecting an approach. The findings will help applied researchers select an appropriate matching approach for a common three-level design found in education.