Back to all papers

Staging Femoral Head Osteonecrosis with General-Purpose AI: Lessons from the Ficat-Arlet Classification.

December 16, 2025pubmed logopapers

Authors

Alyanak B,Dede BT,Bağcıer F

Affiliations (3)

  • Department of Physical Medicine and Rehabilitation, Gölcük Necati Çelik State Hospital, Kocaeli, Turkey. [email protected].
  • Department of Physical Medicine and Rehabilitation, Prof. Dr. Cemil Taşcıoğlu City Hospital, Istanbul, Turkey.
  • Department of Physical Medicine and Rehabilitation, University of Health Sciences, Başakşehir Çam and Sakura City Hospital, Istanbul, Turkey.

Abstract

This study aimed to conceptually evaluate the diagnostic capacity and limitations of a general-purpose vision-language model (ChatGPT-4o) in the radiographic staging of femoral head avascular necrosis (AVN) according to the Ficat-Arlet (FA) classification. Rather than proposing clinical implementation, this proof-of-concept investigation was designed to delineate the boundaries of general-purpose artificial intelligence (AI) in radiographic reasoning. A total of 240 anteroposterior pelvic X-ray images with corresponding clinical and magnetic resonance imaging (MRI)-based reference staging were retrospectively analyzed. Only plain radiographs were uploaded to ChatGPT-4o; no MRI data were provided. Each image was evaluated in a separate, independent session using a standardized English-language prompt. The model's staging outputs were compared with ground-truth labels using sensitivity, specificity, precision, accuracy, and area under the receiver operating characteristic curve (AUC). ChatGPT-4o achieved a micro-averaged sensitivity of 25.8%, specificity of 79.5%, and mean AUC of 0.55 across all FA stages. The highest sensitivity was observed in Stage 0 (54.2%) and the lowest in Stage 4 (10.4%). Statistical comparison across stages revealed significant variation in diagnostic metrics, indicating stage-dependent inconsistency. While ChatGPT-4o demonstrated partial capacity for radiographic reasoning, its limited sensitivity, low AUC, and inconsistency across stages underscore the conceptual boundaries of general-purpose AI in diagnostic image interpretation. These findings highlight the necessity for task-specific training and balanced datasets to achieve reliable performance in orthopedic imaging.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 7,300+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.