Many classroom-observation instruments have been developed (e.g., Gleason et al., 2017; Nava et al., 2019; Sawada et al., 2002), but a very small number of studies published in refereed journals have rigorously examined the quality of the ratings and the instrument using measurement models. For example, Gleason et al. developed a mathematics classroom observation protocol, provided evidence of the content validity and internal structure of the instrument, and reported interrater reliability, which is often considered as the primary measure of reliability of classroom observation instrument in mathematics education research. Hill et al. (2012) argued that it is not sufficient to report only interrater reliability and demonstrated the use of the generalizability theory (Brennan, 2001) to improve the system of teacher observation. More recently, Wind and Jones (2019) illustrated the use of the many-facet Rasch model (Linacre, 1989) to assess the quality of a teacher evaluation system using real data. In the many-facet Rasch model, all facets interact with each other to produce the final ratings. Some raters' aberrant behavior is possibly caused by some problematic items in the observation scale. Therefore, to comprehensively assess the rating of teachers' classroom teaching, all facets in the model need to be examined. The present study aims to assess the quality of a classroom observation instrument, raters' rating behavior, and teachers' teaching quality using the many-facet Rasch model. Instrument: The Mathematics Cognition, Language, Interaction, and Problem Solving (M-CLIPS; Riddell, Bray, & Schoen, 2021) classroom-observation scale consists of 11 rubrics/items measuring four aspects of classroom teaching that are consistent with a CGI approach to teaching: 1) problem solving; 2) attention to student thinking; 3) teacher-student interaction; and 4) communication. Each item uses a 5-point, Likert-type scale, ranging from 0 to 4, with ratings of 4 indicating that teacher practice is very consistent with the topic of the rubric, and ratings of 0 indicating that instructional practice is antithetical to the topic of the rubric. Teachers and Raters: Forty-seven first-or second-grade teachers/classrooms were selected at random from among the available teachers who consented to participate in a two-year, randomized controlled trial of a teacher professional-development program. A team of 16 raters were trained to use M-CLIPS. The raters represented a diverse set of experience and formal education in mathematics, mathematics education, elementary teaching. Research Design: The judging plan was made based on several considerations. First, to use the many-facet Rasch model, the facet elements should be linked to each either directly or indirectly. Based on practicality and the desired precision of estimates of teachers' teaching ability, we had each video scored by four raters. As there were some novice raters, we paired experienced raters with novice raters at the beginning of the video observation in order to give some trainings to the novice raters. Based on the above considerations, we adopted a hybrid rating design. In the first four rating sessions, we had at least one experienced rater teamed with other raters for each group and rotated two members for each team each session. For the rest of 12 sessions, we adopted a partially crossed, incomplete block design, in which raters were randomly assigned to each of the teaching videos. Data Analysis: We used the many-facet Rasch model (Linacre, 1989) to analyze the video rating data. Specifically, we examined the dimensionality of the scale, the hierarchical structure of rater severity, person ability, and item difficulty, reliability of the three facets, their fit statistics, and the interaction between raters and rating sessions. Results: Analysis of a Wright map (figure 1) indicates a small spread in rater severity. In contrast, the estimated parameters for instructional practice had a large amount of variation. Logit measures ranged from -6.0 to 1.7. The fit statistics (Table 1) showed that one rater had large values of infit and outfit statistics, suggesting some aberrant rating behavior of this rater. Item difficulty had a slightly larger amount of variation than the raters. The most difficult item was "PS1: Innovation" and the easiest item was "LN1: Display." An examination of the item fit statistics (Table 2) showed that all items had good fit statistics except for item "LN1". Its large fit statistics suggest potential problems with this item. Inspection of the item characteristics curves of the 11 items provided further evidence that some categories of item "LN1" had disordered category measures and thresholds. An analysis of the interaction between rater severity and rating session (figure 2) showed that the variation of rater severity decreased slightly over time. Dimensionality analysis of the residuals suggests a single dimension underlying the instrument. Reliability statistics showed that rater reliability, video reliability, and item reliability were 0.78, 0.98 and 0.98, respectively. These and other results--and how we arrived at them--will be explained in more detail and with visualizations during the brief presentation. Discussion: Our results suggest that the M-CLIPS scale could be considered unidimensional, and one item may need to be revised. The fit statistics of raters also identified some unexpected ratings of some raters, and it also allows an inspection of trends in rater severity (or leniency) over time. The results also showed that the rating scores had high reliability, and teachers differed substantially on their classroom teaching practices. The many-facet Rasch model can yield important insight into--as the name suggests--many facets related to measurement of classroom instruction, including quality of items, rater severity/leniency, and much more. The method represents an important alternative to some of the standard methods of estimating reliability among raters, such as Kappa or interrater agreement. This is especially important for high-inference, rater-mediated observation protocols, which are used frequently in both research and practice.