Evaluation of commercial AI algorithms for the detection of fractures, effusions, and dislocations on real-world clinical data: A prospective registry study.
Authors
Affiliations (4)
Affiliations (4)
- Institute for Diagnostic and Interventional Radiology, TUM School of Medicine and Health, TUM University Hospital Rechts der Isar, Munich, Germany.
- Institute for Diagnostic and Interventional Radiology, TUM School of Medicine and Health, TUM University Hospital Rechts der Isar, Munich, Germany; Institute of Diagnostic and Interventional Neuroradiology, TUM School of Medicine and Health, TUM University Hospital Rechts der Isar, Munich, Germany.
- Institute for Diagnostic and Interventional Radiology, TUM School of Medicine and Health, TUM University Hospital Rechts der Isar, Munich, Germany; Institute for Cardiovascular Radiology and Nuclear Medicine, TUM School of Medicine and Health, German Heart Center Munich, Munich, Germany.
- Institute for Diagnostic and Interventional Radiology, TUM School of Medicine and Health, TUM University Hospital Rechts der Isar, Munich, Germany. Electronic address: [email protected].
Abstract
To prospectively evaluate and directly compare the performance of three commercial AI algorithms (Gleamer, AZmed, and Radiobotics) for detecting fractures, dislocations, and joint effusions across multiple anatomical regions in real-world adult clinical radiography. In this single-center, prospective technical performance evaluation study, we assessed these algorithms on radiographs from adult patients (n = 1037; 2926 radiographs; 22 anatomical regions) at [anonymized] (January-March 2025). Radiologists' reports served as the reference standard, with CT adjudication when available. Sensitivity, specificity, accuracy, and AUC were calculated; AUCs were compared using Bonferroni-corrected DeLong tests. Fractures were identified in 29.60 % of patients; 13.69 % had acute fractures and 6.65 % had multiple fractures. For all fractures, Gleamer (AUC 83.95 %, sensitivity 75.57 %, specificity 92.33 %) and AZmed (AUC 84.88 %, sensitivity 79.48 %, specificity 90.27 %) outperformed Radiobotics (AUC 77.24 %, sensitivity 60.91 %, specificity 93.56 %). For acute fractures, AUCs were comparable (range: 84.81-87.78 %). For multiple fractures, performance was limited (AUCs 64.17-73.40 %). AZmed had higher AUC for dislocation (61.85 % vs. 54.48 % for Gleamer), while Gleamer and Radiobotics outperformed AZmed for effusion (AUC 69.59 % and 73.63 % vs. 57.99 %). No algorithm exceeded 91 % accuracy for acute fractures. In this real-world, single-center study, commercial AI algorithms showed moderate to high performance for straightforward fracture detection but limited accuracy for complex scenarios such as multiple fractures and dislocations. Current tools should be used as adjuncts rather than replacements for radiologists and reporting radiographers. Multicenter validation and more diverse training data are necessary to improve generalizability and robustness.