Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs:A Comparative Study.

May 14, 2026

papers

DOI: 10.14744/cpr.2026.37887 PMID: 42305281

Authors

Konukoglu O,Kaya M,Arslan BC,Gunaydin I

Affiliations (2)

Department of Radiology, Gaziantep City Hospital, Gaziantep, Türkiye.
Department of Emergency Medicine, Gaziantep City Hospital, Gaziantep, Türkiye.

Abstract

To evaluate the diagnostic accuracy and reliability of closed-source, multimodal large language models (LLMs)-ChatGPT-4o, ChatGPT-4.5, and Gemini 2.5 Pro-in detecting acute knee fractures on radiographs compared with an emergency medicine specialist and a radiologist. This retrospective study included 252 patients who underwent both knee radiography and CT between September 2023 and July 2025. Fracture status was determined by CT and reviewed by radiologists. Anteroposterior and lateral radiographs were independently assessed by an emergency medicine specialist, a radiologist, and three LLMs. Diagnostic performance was evaluated using sensitivity, specificity, predictive values, likelihood ratios, accuracy, and area under the curve (AUC). Reliability was assessed using Cohen's kappa and McNemar's tests. According to CT findings, fractures were present in 23.08% (n=58) of patients. The LLMs demonstrated low sensitivity: ChatGPT-4o, 37.9%; ChatGPT-4.5, 13.8%; and Gemini 2.5 Pro, 10.3%, with moderate overall accuracy (72-77%). In contrast, the radiologist achieved 92.1% accuracy, with high sensitivity (77.6%) and specificity (96.4%), whereas the emergency medicine specialist showed 83.7% accuracy. AUC comparisons revealed significantly higher diagnostic performance for clinicians, particularly radiologists, than for all LLMs (p<0.05). Consistency analysis showed moderate agreement for ChatGPT-4o, slight agreement for ChatGPT-4.5, and substantial agreement for Gemini 2.5 Pro. Closed-source LLMs performed worse than clinicians in diagnosing acute knee fractures on radiographs, with a high risk of missed fractures. Although they may support triage by reliably identifying normal cases, they are not sufficient for standalone diagnostic use.

View Source Full Text PDF

Topics

Journal Article

Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs:A Comparative Study.

Authors

Affiliations (2)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?