Back to all papers

In-context learning enables large language models to achieve human-level performance in spinal instability neoplastic score classification from synthetic CT and MRI reports.

September 24, 2025pubmed logopapers

Authors

Russe MF,Reisert M,Fink A,Hohenhaus M,Nakagawa JM,Wilpert C,Simon CP,Kotter E,Urbach H,Rau A

Affiliations (6)

  • Department of Diagnostic and Interventional Radiology, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
  • Medical Physics, Department of Diagnostic and Interventional Radiology, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
  • Department of Stereotactic and Functional Neurosurgery, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
  • Department of Neurosurgery, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
  • Department of Neuroradiology, Faculty of Medicine, Medical Center - University of Freiburg,, University of Freiburg, Hugstetter Str. 55, 79106, Freiburg, Germany.
  • Department of Neuroradiology, Faculty of Medicine, Medical Center - University of Freiburg,, University of Freiburg, Hugstetter Str. 55, 79106, Freiburg, Germany. [email protected].

Abstract

To assess the performance of state-of-the-art large language models in classifying vertebral metastasis stability using the Spinal Instability Neoplastic Score (SINS) compared to human experts, and to evaluate the impact of task-specific refinement including in-context learning on their performance. This retrospective study analyzed 100 synthetic CT and MRI reports encompassing a broad range of SINS scores. Four human experts (two radiologists and two neurosurgeons) and four large language models (Mistral, Claude, GPT-4 turbo, and GPT-4o) evaluated the reports. Large language models were tested in both generic form and with task-specific refinement. Performance was assessed based on correct SINS category assignment and attributed SINS points. Human experts demonstrated high median performance in SINS classification (98.5% correct) and points calculation (92% correct), with a median point offset of 0 [0-0]. Generic large language models performed poorly with 26-63% correct category and 4-15% correct SINS points allocation. In-context learning significantly improved chatbot performance to near-human levels (96-98/100 correct for classification, 86-95/100 for scoring, no significant difference to human experts). Refined large language models performed 71-85% better in SINS points allocation. In-context learning enables state-of-the-art large language models to perform at near-human expert levels in SINS classification, offering potential for automating vertebral metastasis stability assessment. The poor performance of generic large language models highlights the importance of task-specific refinement in medical applications of artificial intelligence.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 7,200+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.