In-context learning enables large language models to achieve human-level performance in spinal instability neoplastic score classification from synthetic CT and MRI reports.
Authors
Affiliations (6)
Affiliations (6)
- Department of Diagnostic and Interventional Radiology, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
- Medical Physics, Department of Diagnostic and Interventional Radiology, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
- Department of Stereotactic and Functional Neurosurgery, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
- Department of Neurosurgery, Faculty of Medicine, Medical Center - University of Freiburg, University of Freiburg, 79106, Freiburg, Germany.
- Department of Neuroradiology, Faculty of Medicine, Medical Center - University of Freiburg,, University of Freiburg, Hugstetter Str. 55, 79106, Freiburg, Germany.
- Department of Neuroradiology, Faculty of Medicine, Medical Center - University of Freiburg,, University of Freiburg, Hugstetter Str. 55, 79106, Freiburg, Germany. [email protected].
Abstract
To assess the performance of state-of-the-art large language models in classifying vertebral metastasis stability using the Spinal Instability Neoplastic Score (SINS) compared to human experts, and to evaluate the impact of task-specific refinement including in-context learning on their performance. This retrospective study analyzed 100 synthetic CT and MRI reports encompassing a broad range of SINS scores. Four human experts (two radiologists and two neurosurgeons) and four large language models (Mistral, Claude, GPT-4 turbo, and GPT-4o) evaluated the reports. Large language models were tested in both generic form and with task-specific refinement. Performance was assessed based on correct SINS category assignment and attributed SINS points. Human experts demonstrated high median performance in SINS classification (98.5% correct) and points calculation (92% correct), with a median point offset of 0 [0-0]. Generic large language models performed poorly with 26-63% correct category and 4-15% correct SINS points allocation. In-context learning significantly improved chatbot performance to near-human levels (96-98/100 correct for classification, 86-95/100 for scoring, no significant difference to human experts). Refined large language models performed 71-85% better in SINS points allocation. In-context learning enables state-of-the-art large language models to perform at near-human expert levels in SINS classification, offering potential for automating vertebral metastasis stability assessment. The poor performance of generic large language models highlights the importance of task-specific refinement in medical applications of artificial intelligence.