Objectives: To review for acute abdominal pain (AAP), the diagnostic accuracies of combining decision tools (DTs) and doctors aided by DTs compared with those of unaided doctors. Also to evaluate the impact of providing doctors with an AAP DT on patient outcomes, clinical decisions and actions, what factors are likely to determine the usage rates and usability of a DT and the associated costs and likely cost-effectiveness of these DTs in routine use in the UK. Design: Electronic databases were searched up to 1 July 2003. Review methods: Data from each eligible study were extracted. Potential sources of heterogeneity were extracted for both questions. For the accuracy review, meta-analysis was conducted. Among studies comparing diagnostic accuracies of DTs with unaided doctors, error rate ratios provided estimates of the differences between the false-negative and false-positive rates of the DT and unaided doctors' performance. Pooled error rate ratios and 95% confidence intervals (CIs) for false-negative rates and false-positive rates were computed. Metaregression was used to explore heterogeneity. Results: Thirty-two studies from 27 articles, all based in secondary care, were eligible for the review of DT accuracies, while two were eligible for the review of the accuracy of hospital doctors aided by DTs. Sensitivities and specificities for DTs ranged from 53 to 99% and from 30 to 99%, respectively. Those for unaided doctors ranged from 64 to 93% and from 39 to 91%, respectively. Thirteen studies reported false-positive and false-negative rates for both DTs and unaided doctors, enabling a direct comparison of their performance. In random effects meta-analyses, DTs had significantly lower false-positive rates (error rate ratio 0.62, 95% CI 0.46 to 0.83) than unaided doctors. DTs may have higher false-negative rates than unaided doctors (error rate ratio 1.34, 95% CI 0.93 to 1.93). Significant heterogeneity was present. Two studies compared the diagnostic accuracies of doctors aided by DTs to unaided doctors. In a multiarm cluster randomised controlled trial (n = 5193), the diagnostic accuracy of doctors not given access to DTs was not significantly worse (sensitivity 28.4% and specificity 96.0%) than that of three groups of aided doctors (sensitivities of 42.4-47.9%, and specificities of 95.5-96.5%, respectively). In an uncontrolled before-and-after study (n = 1484), the sensitivities and specificities of aided and unaided doctors were 95.5% and 91.5% (p = 0.24) and 78.1% and 86.4% (p < 0.001), respectively. The metaregression of DTs showed that prospective test-set validation at the site of the tool's development was associated with considerably higher diagnostic accuracy than prospective test-set validation at an independent centre [relative diagnostic odds ratio (RDOR) 8.2; 95% CI 3.1 to 14.7]. It also showed that the earlier in the year the study was performed the higher the performance (RDOR 0.88, 0.83 to 0.92), that when developers evaluated their own DT there was better performance than when independent evaluators carried out the study (RDOR = 3.0, 1.3 to 6.8), and that there was no evidence of association between other quality indicators and DT accuracy. The one eligible study of the impact study review, a four-arm cluster randomised trial (n = 5193), showed that hospital admission rates of patients by doctors not allocated to a DT (42.8%) were significantly higher than those by doctors allocated to three combinations of decision support (34.2-38.5%) (p < 0.001). There was no evidence of a difference between perforation rates (p = 0.19) and negative laparotomy rates in the four trial arms (p = 0.46). Usage rates of DTs by doctors in accident and emergency departments ranged from 10 to 77% in the six studies that reported them. Possible determinants of usability include the reasoning method used, the number of items used and the output format. A deterministic cost-effectiveness comparison demonstrated that a paper checklist is likely to be 100-900 times more cost-effective than a computer-based DT under stated assumptions. Conclusions: With their significantly greater specificity and lower false-positive rates than doctors, DTs are potentially useful in confirming a diagnosis of acute appendicitis, but not in ruling it out. The clinical use of well-designed, condition-specific paper or computer-based structured checklists is promising as a way to improve impact on patient outcomes, subject to further research. © Queen's Printer and Controller of HMSO 2006. All rights reserved.