Independent bone-level diagnostic accuracy study of an AI tool for detecting appendicular skeletal fractures on radiographs.

April 17, 2026

papers

DOI: 10.1007/s00330-026-12489-5 PMID: 41995742

Authors

Bruun FJ,Müller FC,Nybing JU,Hansen P,Gosvig KK,Boesen MP,Brejnebøl MW

Affiliations (5)

Radiological Arteficial Intelligence Testcenter, København, Denmark. [email protected].
Department of Radiology, Bispebjerg og Frederiksberg Hospitaler, København, Denmark. [email protected].
Radiological Arteficial Intelligence Testcenter, København, Denmark.
Department of Radiology, Herlev og Gentofte Hospitaler, Herlev, Denmark.
Department of Radiology, Bispebjerg og Frederiksberg Hospitaler, København, Denmark.

Abstract

To perform an in-depth evaluation of the diagnostic test accuracy of a commercially available AI tool for assistance in fracture detection on radiographs. This retrospective study included consecutive patients with trauma radiographs at seven Danish hospitals. The AI output was evaluated using the clinical radiologic report as a reference standard for a binary fracture outcome. The report is based on assessments by an emergency physician, a senior orthopedic surgeon, and a radiology expert. Sensitivity, specificity, positive- and negative predictive values were calculated. Sensitivity and specificity were additionally stratified for children, degenerative disease, metal, old fractures, casting, obvious fractures, and inter-hospital differences. Bone-wise sensitivity and specificity were assessed for multiple fracture cases and individual bones. The study sample consisted of 2783 patients (median age 38 years, IQR, 21, 64, 1443 female), and 948 (34%) had the target finding. The AI tool demonstrated an overall sensitivity of 89% (95% CI: 87%-91%) and specificity of 88% (95% CI: 86%-89%). The specificity was 57% (95% CI: 49%-65%) in examinations with old fractures. Bone-wise sensitivity for carpal fractures ranged from other carpals 25% (95% CI: 1%-81%] to triquetrum 75% (95% CI: 43%-95%). Tarsal fractures ranged from medial cuneiform 0% (95% CI: 0%-60%) to talus 53% (95% CI: 27%-79%). The AI tool demonstrated high overall diagnostic accuracy and performed robustly across most specific situations. However, specificity was substantially reduced in the presence of old fractures. The bone-wise analysis showed great variability, with a pattern of poor accuracy for short, irregular bones. Question Can a commercially available AI tool reliably detect fractures across anatomical regions, confounding factors, and individual bones -and are there patterns in diagnostic limitations? Findings The AI tool achieved 89% sensitivity and 88% specificity with consistent accuracy across subgroups. However, accuracy dropped for old fractures and irregular short bones. Clinical relevance Despite broad regulatory approval, AI fracture tools may overlook clinically relevant weaknesses. Our in-depth evaluation highlights limitations, guiding responsible clinical use and future research to support safe AI implementation in radiology and informed medicolegal regulation.

View Source Full Text PDF

Topics

Journal Article

Independent bone-level diagnostic accuracy study of an AI tool for detecting appendicular skeletal fractures on radiographs.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?