Start Over

Levels of evidence

Authors :: Maurits Van Tulder
Chris Maher
Rob Herbert
Lex Bouter
Henrica De Vet
Source :: Journal of Clinical Epidemiology. 56:917-918
Publication Year :: 2003
Publisher :: Elsevier BV, 2003.
Abstract: In systematic reviews statistical pooling is not always possible due to inadequate reporting of the results of original studies. In these reviews, a qualitative analysis using levels of evidence may be performed to summarize the evidence and to formulate conclusions [1]. This is explicit and reproducible, because it explains the exact meaning of labels like strong, moderate, and limited evidence. In the course of time, different sets of levels of evidence have been published [2–5]. All these sets are arbitrary and based on common sense at best. Ferreira et al. [6] show in their article that these different criteria may lead to different conclusions. They advise readers to be cautious when interpreting conclusions of systematic reviews that use levels of evidence. We fully agree, but not merely because different sets of criteria exist. If only one set of levels of evidence would exist this warning would even be more necessary. The use of levels of evidence is essentially an arbitrary and subjective way of summarizing evidence. Typically levels of evidence take into account the quality of the studies and the consistency of the results. In contrast to meta-analysis (statistical pooling), levels of evidence do not include the size of the effect. In the interpretation of both the conclusions of a study and its methodologic quality, there is some subjectivity involved. For example, grading the conclusions is difficult when there is borderline statistical significance, when a positive effect is observed for only part of many outcome measures studied, or when the reviewers do not agree with the authors’ conclusions. A similar example of arbitrary definitions is the interpretation of kappa values. Ferreira et al. [6] use the benchmarks proposed by Landis and Koch [7] to interpret their kappa values. Other often-used benchmarks are the ones proposed by Fleiss [8], which would have changed the conclusions about the disagreements between different sets of levels of evidence in Fereirra’s paper [6] slightly. Existence of different rating systems emphasizes their subjectivity. Kappa