Comparing performance of seven fine-tuned open-source large language models in summarizing and predicting outcome-relevant information from mechanical thrombectomy reports in patients with acute ischemic stroke.
Authors
Affiliations (4)
Affiliations (4)
- Department of Radiology, Charité - Universitätsmedizin Berlin, Humboldt-Universität zu Berlin, Freie Universität Berlin, Berlin Institute of Health, Berlin, Germany. [email protected].
- Department of Radiology, Charité - Universitätsmedizin Berlin, Humboldt-Universität zu Berlin, Freie Universität Berlin, Berlin Institute of Health, Berlin, Germany.
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Humboldt-Universität zu Berlin, Freie Universität Berlin, Berlin Institute of Health, Berlin, Germany.
Abstract
This study evaluates seven open-source Large Language Models (LLMs) in summarizing radiology reports of acute ischemic stroke patients treated with mechanical thrombectomy and predicting angiography-based outcome measures relevant to post-thrombectomy reperfusion. 2000 mechanical thrombectomy reports (findings and summarizing impression section as gold standard) were split into training set (N = 1900) for model fine-tuning and test set (N = 100). A two-step evaluation was performed: (1) Quantitative analyses of seven LLMs with metrics ROUGE-1, -2, -L, METEOR, BERTScore (F1) and BLEU comparing LLM-generated summaries against gold-standard impressions. (2) Qualitative manual evaluation of the four best-performing models by two radiologists, assessing correctness and completeness across key parameters: outcome-relevant scores, vessel information, occlusion side, number of passes, relevant additional information, hallucinations, and grammar quality. Statistical significance was assessed via a two-tailed, four-sample χ² test, followed by post hoc pairwise χ² comparisons. BioMistral-7b scored highest across most quantitative metrics (ROUGE-1: 0.47, ROUGE-2: 0.30, ROUGE-L: 0.43, METEOR: 0.46, BERTScore (F1): 0.82). Manual evaluation revealed gemma-2-9b most frequently documented pass counts (56 out of 100 cases (56%); p < 0.02 vs. Llama-3.1-8b/mistral-7b-instruct), while mistral-7b-instruct described them most often correctly (29 out of 38 mentioned passes (76.32%); p < 0.02 vs. BioMistral-7b and p < 0.01 vs. gemma-2-9b). All four manually evaluated LLMs performed moderately well in predicting "Thrombolysis-In-Cerebral-Ischemia (TICI)" Score (correctness rate ranging from 66 to 71%; p = 0.89). All four manually evaluated LLMs effectively summarized thrombectomy reports and demonstrated moderate accuracy predicting TICI scores. Their integration into radiology workflows could enhance efficiency, warranting further clinical validation. Question Specifically fine-tuned Large Language Models (LLMs) can improve radiology workflow by automatically summarizing thrombectomy reports and inferring angiographic classifications from textual descriptions. Findings Fine-tuned LLMs achieve similar performance in summarizing thrombectomy reports, with each model performing best in specific categories and showing moderate accuracy in correct "Thrombolysis-In-Cerebral-Ischemia (TICI)" Score prediction (66-71%). Clinical relevance Integrating fine-tuned LLMs into radiology workflows may accelerate decision-making and improve patient outcomes by automatically summarizing reports and assessing recanalization success, while future work should enhance contextual understanding, address ambiguous inputs, and limit hallucinations.