Background: Research clearinghouses (CHs) seek to play an important role in identifying successful social programs and practices. The evidence-based policy movement has grown tremendously over the last few decades and has gotten institutionalized in the form of CHs in the US. CHs (1) specify standards for evaluating the quality of evidence from single studies, (2) judge how well each study meets these quality criteria to determine the weight its results should be given; and when more than one acceptable study exists, (3) they synthesize causal estimates to reach a conclusion equivalent to whether an intervention is clearly effective, or merely promising, or not yet demonstrated to be even promising, or it is ineffective or even harmful. Purpose: This paper explores the construct validity of "evidence-based" in the social sciences, with a focus on education and socio-behavioral development of youth. It asks -- how different are the effectiveness ratings of a single program, and whether the size of the difference might reflect disparate understandings of the meaning of "evidence-based". Specifically, this paper: (1) compares the scientific criteria 24 CHs use to determine intervention effectiveness, (2) estimates how comparable CH ratings of the same program are in the fields of education and socio-behavioral development, and (3) probes why CHs differ in how evidence-based a program is. Data and Methods: We found a total of 43 CHs in the US and UK. Out of these, we kept the 24 that conducted effectiveness rating of programs/policies, that did not borrow their standards from other CHs and that could be found on the internet. CHs vary greatly in terms of the policy domains they focus on; some CHs deal with multiple policy domains. We identified 12 CHs that rated educational interventions and 13 that explicitly deal with sociobehavioral development. We first compared the methodological criteria used by these 24 CHs to rate the quality of individual studies as well as the standards of evidence they use to declare a program evidence-based. To estimate the consistency in effectiveness ratings, we collected the names and ratings of individual programs from CH websites. The sample consists of 1,359 programs in the field of education and 2,525 in socio-behavioral development. Since each CH uses its own rating scheme, we rescaled the ratings of all CHs to a three-point scale -- 1 (Recommended), 2 (Promising) and 3 (Not Recommended). Program ratings are fully consistent when both CHs give a program a score of 1 or 2 or 3. In contrast, when a program receives ratings of 1 (recommended) and 3 (not recommended), it is considered fully inconsistent. Ratings of 1 and 2 indicate moderate consistency as CHs differ in the confidence with which they endorse it. Moderate inconsistency happens when a program is rated 2 and 3. Thereafter, to understand the reasons for why ratings of the same program differ, we conducted case studies of 5 programs in each policy field and probed four hypotheses for differences in ratings. These hypotheses are: variation in studies included, different versions of the program were examined, effects were analyzed on different outcomes and differences in evidence standards used to draw judgements about effectiveness. Results: We first compared the standards of 24 CHs and find that all CHs prefer RCT studies over other studies. Even though CHs accept quasi-experimental studies, they are almost always rated lower than RCTs (though some exceptions exist). There is some variation in the scientific criteria CHs use, especially about methodological attributes like replication, temporal post-intervention follow-up, evaluator independence etc. We find that only approx. 17% of the education programs and 18% of socio-behavioral development are rated by 2 or more CHs. When two CHs rate a program, approximately 18% of education programs are found effective, i.e., either both CHs rated them 1, or both rated them 2 or a combination of 1and 2. Full consistency in ratings pairs stands at around 30% for most education programs. However, CHs agree more on programs that they do not recommend, i.e., receive two ratings of 3. There are comparatively fewer programs that both CHs endorse at the highest levels. Moderate consistency (ratings of 1 and 2) is not as common and ranges from 8.5% to 20% but its probability increases with the increase in number of clearinghouses ratings per program. Moreover, at least a quarter programs show maximal disagreement (ratings of 1 and 3). In the field of socio-behavior development, we find a relatively higher level of full consistency though other patterns of results replicate. 55% of the programs rated by 2 CHs show full consistency. This is nearly 48% for programs rated by 3 CHs, 38% and 37% for programs rated by 4 and 5 CHs. Again, agreement is most common for programs that are not endorsed and least for programs that are endorsed. Among programs rated by 2 CHs, nearly 42% received ratings of 3 from both CHs and only 4% were endorsed by both CHs with ratings of 1. However, unlike in education, maximal disagreement in ratings is not as common and ranges from 9.8 to 15.6% across the entire sample. From the case studies, we concluded that the most likely explanation is that CHs differ in the design features CHs require to give programs a high rating. The bar for attaining the highest rating differs. In particular, the requirements of replication and temporal persistence of effects can play an important role. Discussion: Evidence-based language is commonly used in the public rhetoric though we find that this might be more of an aspiration right now than a matter of scientific consensus. The levels of consistency we found in program ratings are perhaps lower than what one might expect. Complete consensus on methodological issues is neither possible nor desirable but it may be beneficial to find partial consensus on which methodological features of the evidence are necessary to recommend a program's effectiveness.