Back to all papers

Artificial intelligence for fracture detection on computed tomography: a comprehensive systematic review and meta-analysis of diagnostic test accuracy in non-commercial and commercial solutions.

February 7, 2026pubmed logopapers

Authors

Husarek J,Fuchss AMC,Ruder TD,Mougiakakou S,Exadaktylos A,Wahedi K,Müller M

Affiliations (5)

  • Department of Emergency Medicine, Inselspital, Bern University Hospital, University of Bern, Rosenbühlgasse 27, 3010, Bern, Switzerland. [email protected].
  • Department of Emergency Medicine, Inselspital, Bern University Hospital, University of Bern, Rosenbühlgasse 27, 3010, Bern, Switzerland.
  • University of Bern, Bern, Switzerland.
  • Department of Diagnostic, Interventional and Pediatric Radiology, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland.
  • ARTORG Center for Biomedical Engineering Research, University of Bern, Bern, Switzerland.

Abstract

Rising patient volumes, the increasing use of computed tomography (CT) imaging in emergency departments and the resulting prolonged waiting times highlight the urgent need for efficient and accurate diagnostic tools, especially given that the number of experienced healthcare professionals is not increasing at the same pace. Artificial intelligence (AI) has emerged as a promising tool to support fracture detection on CT scans, with the potential to streamline diagnostic workflows in emergency care. However, concerns exist regarding dataset bias, limited external testing, and methodological variability. This systematic review and diagnostic test accuracy (DTA) meta-analysis aimed to comprehensively assess the diagnostic accuracy of AI-driven fracture detection solutions, with a particular focus on the effect of the testing strategy, cohort composition and commercial availability on diagnostic accuracy. The Cochrane Handbook for Systematic Reviews of DTA and reported according to PRISMA-DTA guidelines were followed. We systematically searched Embase, MEDLINE, Cochrane Library, Web of Science, and Google Scholar for studies published from January 2010 onward, complemented by citation chasing and manual searches for commercial AI fracture detection solutions (CAAI-FDS). Two reviewers independently conducted study selection, data extraction, and risk of bias assessment using a modified QUADAS-2 tool. Statistical analysis was conducted using STATA 18.1 and the -metadta- command. Primary analyses evaluated diagnostic accuracy (sensitivity and specificity) of stand-alone AI based on (1) cohort type (selected vs. unselected), (2) test dataset origin (internal vs. external), and (3) level of analysis (patient-wise, vertebra-wise, rib-wise). Secondary analyses explored accuracy differences according to (1) CAAI-FDS, (2) anatomical region and (3) reader type (stand-alone AI, human unaided, human aided by AI). Forest plots visualized results, and heterogeneity was measured using generalized I<sup>2</sup> statistics. Out of 7683 identified articles, 44 studies were included for meta-analysis. 14 CAAI-FDS were identified. Primary analyses of stand-alone AI showed moderate sensitivity (0.85, 95% CI: 0.77, 0.90) and good specificity (0.92, 95% CI: 0.87, 0.95) in unselected patient cohorts, whereas selected cohorts achieved slightly higher sensitivity (0.89, 95% CI: 0.80, 0.94). Diagnostic accuracy was higher when studies used internal test datasets (sensitivity 0.94, 95% CI: 0.88, 0.97; specificity 0.91, 95% CI: 0.86, 0.94) compared to external test datasets (sensitivity 0.85, 95% CI: 0,77, 0.91; specificity 0.92, 95% CI: 0.89, 0.95). Vertebra- and rib-wise analyses achieved higher specificity (0.98) compared to patient-wise analysis (0.92, 95% CI: 0.89, 0.95), although sensitivity remained moderate across all levels (0.85-0.89). Secondary analyses showed variability among CAAI-FDS (sensitivities 0.68-0.80; specificities 0.87-0.97) and by anatomical region, with the highest sensitivity for skull (0.90, 95% CI: 0.85, 0.93), rib (0.92, 95% CI: 0.83, 0.96) and pelvis fractures (1.00), and lowest for spine fractures (0.82, 95% CI: 0.73, 0.88). Stand-alone AI showed moderate to good diagnostic accuracy, slightly outperforming unaided human readers, with minimal further improvement when humans were aided by AI. While AI demonstrates promising diagnostic accuracy in fracture detection, study biases, stringent patient selection, and lack of external testing raise concerns about real-world applicability. Commercially available solutions tend to underperform compared to pooled study results, highlighting the gap between research settings and clinical practice. Future efforts should focus on reducing bias, improving generalizability and robustness, as well as conducting prospective trials to assess AI's true impact on clinical outcomes.

Topics

Journal ArticleReview

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.