Back to all papers

In-context learning enables large language models to achieve human-level performance in spinal instability neoplastic score classification from synthetic CT and MRI reports.

Authors

Russe MF,Reisert M,Fink A,Hohenhaus M,Nakagawa JM,Wilpert C,Simon CP,Kotter E,Urbach H,Rau A

Affiliations (6)

  • Department of Diagnostic and Interventional Radiology, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
  • Medical Physics, Department of Diagnostic and Interventional Radiology, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
  • Department of Stereotactic and Functional Neurosurgery, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
  • Department of Neurosurgery, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
  • Department of Neuroradiology, Faculty of Medicine, Medical Center - University of Freiburg,, University of Freiburg, Hugstetter Str. 55, 79106, Freiburg, Germany.
  • Department of Neuroradiology, Faculty of Medicine, Medical Center - University of Freiburg,, University of Freiburg, Hugstetter Str. 55, 79106, Freiburg, Germany. [email protected].

Abstract

To assess the performance of state-of-the-art large language models in classifying vertebral metastasis stability using the Spinal Instability Neoplastic Score (SINS) compared to human experts, and to evaluate the impact of task-specific refinement including in-context learning on their performance. This retrospective study analyzed 100 synthetic CT and MRI reports encompassing a broad range of SINS scores. Four human experts (two radiologists and two neurosurgeons) and four large language models (Mistral, Claude, GPT-4 turbo, and GPT-4o) evaluated the reports. Large language models were tested in both generic form and with task-specific refinement. Performance was assessed based on correct SINS category assignment and attributed SINS points. Human experts demonstrated high median performance in SINS classification (98.5% correct) and points calculation (92% correct), with a median point offset of 0 [0-0]. Generic large language models performed poorly with 26-63% correct category and 4-15% correct SINS points allocation. In-context learning significantly improved chatbot performance to near-human levels (96-98/100 correct for classification, 86-95/100 for scoring, no significant difference to human experts). Refined large language models performed 71-85% better in SINS points allocation. In-context learning enables state-of-the-art large language models to perform at near-human expert levels in SINS classification, offering potential for automating vertebral metastasis stability assessment. The poor performance of generic large language models highlights the importance of task-specific refinement in medical applications of artificial intelligence.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.