Back to all papers

Intra-axial primary brain tumor differentiation: comparing large language models on structured MRI reports vs. radiologists on images.

Authors

Nakaura T,Uetani H,Yoshida N,Kobayashi N,Nagayama Y,Kidoh M,Kuroda JI,Mukasa A,Hirai T

Affiliations (3)

  • Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan. [email protected].
  • Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan.
  • Department of Neurosurgery, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan.

Abstract

Aimed to evaluate the potential of large language models (LLMs) in differentiating intra-axial primary brain tumors using structured magnetic resonance imaging (MRI) reports and compare their performance with radiologists. Structured reports of preoperative MRI findings from 137 surgically confirmed intra-axial primary brain tumors, including Glioblastoma (n = 77), Central Nervous System (CNS) Lymphoma (n = 22), Astrocytoma (n = 9), Oligodendroglioma (n = 9), and others (n = 20), were analyzed by multiple LLMs, including GPT-4, Claude-3-Opus, Claude-3-Sonnet, GPT-3.5, Llama-2-70B, Qwen1.5-72B, and Gemini-Pro-1.0. The models provided the top 5 differential diagnoses based on the preoperative MRI findings, and their top 1, 3, and 5 accuracies were compared with board-certified neuroradiologists' interpretations of the actual preoperative MRI images. Radiologists achieved top 1, 3, and 5 accuracies of 85.4%, 94.9%, and 94.9%, respectively. Among the LLMs, GPT-4 performed best with top 1, 3, and 5 accuracies of 65.7%, 84.7%, and 90.5%, respectively. Notably, GPT-4's top 3 accuracy of 84.7% approached the radiologists' top 1 accuracy of 85.4%. Other LLMs showed varying performance levels, with average accuracies ranging from 62.3% to 75.9%. LLMs demonstrated high accuracy for Glioblastoma but struggled with CNS Lymphoma and other less common tumors, particularly in top 1 accuracy. LLMs show promise as assistive tools for differentiating intra-axial primary brain tumors using structured MRI reports. However, a significant gap remains between their performance and that of board-certified neuroradiologists interpreting actual images. The choice of LLM and tumor type significantly influences the results. Question How do Large Language Models (LLM) perform when differentiating complex intra-axial primary brain tumors from structured MRI reports compared to radiologists interpreting images? Findings Radiologists outperformed all tested LLMs in diagnostic accuracy. The best model, GPT-4, showed promise but lagged considerably behind radiologists, particularly for less common tumors. Clinical relevance LLMs show potential as assistive tools for generating differential diagnoses from structured MRI reports, particularly for non-specialists, but they cannot currently replace the nuanced diagnostic expertise of a board-certified radiologist interpreting the primary image data.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.