Multi-center benchmarking of large language models for clinical decision support in lung cancer screening.
Authors
Affiliations (12)
Affiliations (12)
- Department of Thoracic Surgery, Zhongshan Hospital, Fudan University, Shanghai, China; Intelligent Medicine Institute, Fudan University, Shanghai, China.
- Department of Thoracic Surgery, Zhongshan Hospital, Fudan University, Shanghai, China.
- Department of Radiology, Huashan Hospital, Fudan University, Shanghai, China.
- Department of Statistics, University of Illinois Urbana-Champaign, Champaign, IL, USA.
- Department of Immunology, School of Basic Medical Sciences, Shanghai Medical College, Fudan University, Shanghai, China.
- School of Clinical Medicine, Shanghai Medical College, Fudan University, Shanghai, China.
- Department of Radiology, Zhongshan Hospital, Fudan University, Shanghai, China.
- MR Research Collaboration Team, Siemens Healthineers Ltd, Shanghai, China.
- Department of Cardiothoracic Surgery, Luan Affiliated Hospital of Anhui Medical University, Luan, China.
- Shanghai Key Laboratory of Medical Epigenetics, State International Co-laboratory of Medical Epigenetics and Metabolism, Institutes of Biomedical Sciences, Fudan University, Shanghai, China; Key Laboratory of Carcinogenesis and Cancer Invasion, Ministry of Education, Liver Cancer Institute, Zhongshan Hospital, Fudan University, Shanghai, China.
- Intelligent Medicine Institute, Fudan University, Shanghai, China; Zhongshan Hospital, Fudan University, Shanghai, China; Center for Digital Health, Berlin Institute of Health (BIH), Charite University Medicine Berlin, Berlin, Germany. Electronic address: [email protected].
- Department of Thoracic Surgery, Zhongshan Hospital, Fudan University, Shanghai, China; Department of Thoracic Surgery, Zhongshan Hospital Xiamen Branch, Fudan University, Xiamen, China. Electronic address: [email protected].
Abstract
Large language models (LLMs) are increasingly explored for clinical applications, but their ability to generate management recommendations for lung cancer screening remains uncertain. In this cross-sectional, multi-center study, 148 anonymized low-dose computed tomography (CT) reports from three healthcare institutions are used to assess the readability, accuracy, and consistency of four widely adopted models (GPT-3.5, GPT-4, Claude 3 Sonnet, and Claude 3 Opus). Among them, Claude 3 Opus produces the most readable recommendations, while GPT-4 achieves the highest clinical accuracy. Importantly, performance dose not differ significantly across institutions, underscoring the robustness of these models to variations in reporting templates and their utility in diverse healthcare settings. In an exploratory analysis, two state-of-the-art models, proprietary GPT-4o and its open-source counterpart DeepSeek-R1, show comparable performance to GPT-4, outperforming GPT-3.5. These findings highlight the potential role of LLMs to enhance clinical decision support in lung cancer screening across diverse healthcare settings.