Patient safety persists as a national concern since the Institute of Medicine's landmark report on medical errors (Kohn, Corrigan, and Donaldson 2000). The Agency for Health Care Research and Quality (AHRQ) recently released a methodology, the Patient Safety Indicators (PSIs), to screen for potential patient safety events using administrative data from acute care hospitals. The PSIs are an attractive tool because they use readily available data and standardized algorithms; they are risk adjusted and therefore potentially useful for benchmarking; and they are easy to implement using free, downloadable software (AHRQ 2007a,AHRQ 2008). The evidence published to date suggests that the PSIs generally have high specificity (i.e., low false-positive rates) and modest sensitivity (i.e., moderate false-negative rates) (Gallagher, Cen, and Hannan 2005b,Zhan et al. 2007,Houchens, Elixhauser, and Romano 2008). Although several recent studies have used the PSIs to identify significant gaps and variations in safety (Romano et al. 2003,Rosen et al. 2005,Rosen et al. 2006), the PSIs are still regarded by both AHRQ and the user community principally as screening tools to flag potential safety-related events rather than as definitive measures (AHRQ 2007a). Increasing use of the PSIs for public reporting and pay-for-performance (HealthGrades 2008;,Premier Inc. 2008) makes it imperative that the PSIs undergo more rigorous evaluation. Although previous studies have demonstrated the face, content, and predictive validity of the PSIs, there is insufficient evidence of their criterion validity to support some of these new applications. The few published studies examining the criterion validity of the PSIs are limited by small sample sizes or lack of a true gold standard (Weller et al. 2004,Gallagher, Cen 2005a andHannan 2005b,Shufelt, Hannan, and Gallagher 2005,Polancich, Restrepo, and Prosser 2006,Zhan et al. 2007). As a national leader in patient safety (Leape 2005), the Veterans Health Administration (VA) is well positioned to evaluate the criterion validity of the PSIs. The VA has several data sources that can serve as valuable resources for this endeavor. VA administrative data, necessary for estimating risk-adjusted PSI rates, contain detailed diagnostic and utilization information on inpatient episodes of care. The VA also collects rich chart-abstracted data on major noncardiac surgeries through the National Surgical Quality Improvement Program (NSQIP) (Khuri et al. 1998). NSQIP was designed to promote continuous quality monitoring and improvement by providing reliable, valid, comparative information regarding surgical outcomes to all facilities performing major noncardiac surgery (Daley et al. 1997,Khuri et al. 1998). NSQIP data were used as a “gold standard” for identifying postoperative complications in one previous study (Best et al. 2002), although the mapping of clinically defined events to ICD-9-CM complication codes was somewhat inexact (Romano 2003). The purpose of this paper is to evaluate the criterion validity of surgical PSIs that match NSQIP adverse events. Our specific objectives were to (1) estimate the sensitivity, specificity, positive predictive value (PPV), and likelihood ratio of the PSIs using NSQIP data as the gold standard; and (2) improve the sensitivity and PPV of the PSIs, if possible, through revisions to PSI algorithms. If the PSIs demonstrate high criterion validity, then public reporting and pay-for-performance activities using these indicators will likely multiply.