Large language models for patient education prior to interventional radiology procedures: a comparative study.
Authors
Affiliations (5)
Affiliations (5)
- Department of Radiology, Charité - Universitätsmedizin Berlin, Humboldt-Universität Zu Berlin, Freie Universität Berlin, Berlin Institute of Health, Augustenburger Platz 1, 13353, Berlin, Germany. [email protected].
- Department of Radiology, Charité - Universitätsmedizin Berlin, Humboldt-Universität Zu Berlin, Freie Universität Berlin, Berlin Institute of Health, Augustenburger Platz 1, 13353, Berlin, Germany.
- Experimental Clinical Research Center (ECRC) at Charité-Universitätsmedizin Berlin and Max-Delbrück-Centrum Für Molekulare Medizin (MDC), Robert-Rössle-Straße 10, 13125, Berlin, Germany.
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Humboldt-Universität Zu Berlin, Freie Universität Berlin, Berlin Institute of Health, Berlin, Germany.
- Berlin Institute of Health (BIH), Berlin, Germany.
Abstract
This study evaluates four large language models' (LLMs) ability to answer common patient questions preceding transarterial periarticular embolization (TAPE), computed tomography (CT)-guided high-dose-rate (HDR) brachytherapy, and bleomycin electrosclerotherapy (BEST). The goal is to evaluate their potential to enhance clinical workflows and patient comprehension, while also assessing associated risks. Thirty-five TAPE, 34 CT-HDR brachytherapy, and 36 BEST related questions were presented to ChatGPT-4o, DeepSeek-V3, OpenBioLLM-8b, and BioMistral-7b. The LLM-generated responses were independently assessed by two board-certified radiologists. Accuracy was rated on a 5-point Likert scale. Statistics compared LLM performance across question categories for patient-education suitability. DeepSeek-V3 attained the highest mean scores for BEST [4.49 (± 0.77)] and CT-HDR [4.24 (± 0.81)] and demonstrated comparable performance to ChatGPT-4o for TAPE-related questions (DeepSeek-V3 [4.20 (± 0.77)] vs. ChatGPT-4o [4.17 (± 0.64)]; p = 1.000). In contrast, OpenBioLLM-8b (BEST 3.51 (± 1.15), CT-HDR 3.32 (± 1.13), TAPE 3.34 (± 1.16)) and BioMistral-7b (BEST 2.92 (± 1.35), CT-HDR 3.03 (± 1.06), TAPE 3.33 (± 1.28)) performed significantly worse than DeepSeek-V3 and ChatGPT-4o across all procedures. Preparation/Planning was the only category without statistically significant differences across all three procedures. DeepSeek-V3 and ChatGPT-4o excelled on TAPE, BEST, and CT-HDR brachytherapy questions, indicating potential to enhance patient education in interventional radiology, where complex but minimally invasive procedures often are explained in brief consultations. However, OpenBioLLM-8b and BioMistral-7b exhibited more frequent inaccuracies, suggesting that LLMs cannot replace comprehensive clinical consultations yet. Patient feedback and clinical workflow implementation should validate these findings.