Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports.
Authors
Affiliations (2)
Affiliations (2)
- Interdisciplinary Department of Medicine, Section of Radiology and Radiation Oncology, University of Bari "Aldo Moro", Bari, Italy.
- Interdisciplinary Department of Medicine, Section of Radiology and Radiation Oncology, University of Bari "Aldo Moro", Bari, Italy. [email protected].
Abstract
Coronary computed tomography angiography (CCTA) has become a cornerstone in non-invasive CAD diagnosis and risk stratification. To standardize reporting and improve clinical decision-making, the CAD-RADS 2.0 system was introduced. This study evaluates the performance of four LLMs, GPT-4o, Gemini 2.0 Flash, DeepSeek V, and Copilot in generating CAD-RADS 2.0-compliant conclusions from standardized CCTA reports. A total of 196 anonymized CCTA reports were retrospectively analyzed. Each LLM was prompted to provide CAD-RADS 2.0 classifications and follow-up recommendations. Ground truth labels were assigned by a senior radiologist. Performance metrics (accuracy, precision, recall, F1-score), execution times, and agreement (Cohen's kappa) with expert interpretation were computed. Interobserver agreement between junior and senior radiologists was also assessed. LLMs demonstrated good-to-excellent agreement with expert classifications: DeepSeek V (κ = 0.771), Copilot (κ = 0.761), GPT-4o (κ = 0.759), and Gemini 2.0 Flash (κ = 0.634). DeepSeek V achieved the highest accuracy (91.83%). Intra-model consistency was perfect (κ = 1). However, LLMs failed to assign CAD-RADS modifiers. ChatGPT-4o provided the most accurate follow-up recommendations (71.94%). All LLMs outperformed radiologists in execution time (3-9 s vs. 15-20 s; p < 0.05). Generic LLMs demonstrate promising performance in automating CAD-RADS 2.0 classification from CCTA reports. However, limitations in modifier assignment and recommendation accuracy highlight areas for refinement before clinical integration. This study explores the potential of large language models to facilitate standardized CAD-RADS 2.0 reporting from coronary CT angiography, highlighting a possible avenue to support workflow efficiency and clinical decision-making in non-invasive coronary artery disease evaluation. LLMs demonstrated strong potential in automating CAD-RADS 2.0-compliant structured reporting for CCTA. LLMs could significantly enhance efficiency in radiological reporting. LLMs need further optimization before clinical integration.