Back to all papers

Assessment of ChatGPT performance in orbital MRI reporting with multimetric evaluation of transformer based language models.

Authors

Tel A,Bolognesi F,Michelutti L,Biglioli F,Robiony M

Affiliations (6)

  • Maxillofacial Surgery Clinic, Department of Head & Neck and Neuroscience, Academic Hospital of Udine, University of Udine, Udine, Italy. [email protected].
  • Division of Maxillofacial surgery, Head and Neck Department, San Paolo Hospital, University of Milan, Milan, Italy.
  • Maxillofacial Surgery Clinic, Department of Head & Neck and Neuroscience, Academic Hospital of Udine, University of Udine, Udine, Italy.
  • Division of Maxillofacial Surgery, Maxillofacial Surgery Department, Postgraduate School of Maxillofacial Surgery, San Paolo Hospital, University of Milan, Milan, Italy.
  • Division of Maxillofacial Surgery, Head and Neck Department, Postgraduate School of Maxillofacial Surgery, San Paolo Hospital, University of Milan, Milan, Italy.
  • Department of Head & Neck and Neuroscience, Maxillofacial Surgery Clinic, Postgraduate School of Maxillofacial Surgery, Academic Hospital of Udine, Udine, Italy.

Abstract

Transformer-based large language models (LLMs), such as ChatGPT-4, are increasingly used to streamline clinical practice, of which radiology reporting is a prominent aspect. However, their performance in interpreting complex anatomical regions from MRI data remains largely unexplored. This study investigates the capability of ChatGPT-4 to produce clinically reliable reports based on orbital MR images, applying a multimetric, quantitative evaluation framework in 25 patients with orbital lesions. Due to inherent limitations of current version of GPT-4, the model was not fed with MR volumetric data, but key 2D images only. For each case, ChatGPT-4 generated a free-text report, which was then compared to the corresponding ground-truth report authored by a board-certified radiologist. Evaluation included established NLP metrics (BLEU-4, ROUGE-L, BERTScore), clinical content recognition scores (RadGraph F1, CheXbert), and expert human judgment. Among the automated metrics, BERTScore demonstrated the highest language similarity, while RadGraph F1 best captured clinical entity recognition. Clinician assessment revealed moderate agreement with the LLM outputs, with performance decreasing in complex or infiltrative cases. The study highlights both the promise and current limitations of LLMs in radiology, particularly regarding their inability to process volumetric data and maintain spatial consistency. These findings suggest that while LLMs may assist in structured reporting, effective integration into diagnostic imaging workflows will require coupling with advanced vision models capable of full 3D interpretation.

Topics

Magnetic Resonance ImagingNatural Language ProcessingOrbitJournal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.