Diagnostic Performance of ChatGPT-4o in Detecting Hip Fractures on Pelvic X-rays.
Erdem TE, Kirilmaz A, Kekec AF
•papers•Jun 1 2025Hip fractures are a major orthopedic problem, especially in the elderly population. Hip fractures are usually diagnosed by clinical evaluation and imaging, especially X-rays. In recent years, new approaches to fracture detection have emerged with the use of artificial intelligence (AI) and deep learning techniques in medical imaging. In this study, we aimed to evaluate the diagnostic performance of ChatGPT-4o, an artificial intelligence model, in diagnosing hip fractures. A total of 200 anteroposterior pelvic X-ray images were retrospectively analyzed. Half of the images belonged to patients with surgically confirmed hip fractures, including both displaced and non-displaced types, while the other half represented patients with soft tissue trauma and no fractures. Each image was evaluated by ChatGPT-4o through a standardized prompt, and its predictions (fracture vs. no fracture) were compared against the gold standard diagnoses. Diagnostic performance metrics such as sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), receiver operating characteristic (ROC) curve, Cohen's kappa, and F1 score were calculated. ChatGPT-4o demonstrated an overall accuracy of 82.5% in detecting hip fractures on pelvic radiographs, with a sensitivity of 78.0% and specificity of 87.0%. PPVs and NPVs were 85.7% and 79.8%, respectively. The area under the ROC curve (AUC) was 0.825, indicating good discriminative performance. Among 22 false-negative cases, 68.2% were non-displaced fractures, suggesting the model had greater difficulty identifying subtle radiographic findings. Cohen's kappa coefficient was 0.65, showing substantial agreement with actual diagnoses. Chi-square analysis revealed a strong correlation (χ² = 82.59, <i>P</i> < 0.001), while McNemar's test (<i>P</i> = 0.176) showed no significant asymmetry in error distribution. ChatGPT-4o shows promising accuracy in identifying hip fractures on pelvic X-rays, especially when fractures are displaced. However, its sensitivity drops significantly for non-displaced fractures, leading to many false negatives. This highlights the need for caution when interpreting negative AI results, particularly when clinical suspicion remains high. While not a replacement for expert assessment, ChatGPT-4o may assist in settings with limited specialist access.