Application of LLMs in CAD-RADS Classification and Patient Management.
Authors
Affiliations (14)
Affiliations (14)
- Department of Radiology and Nuclear Medicine, University Hospital No 4 of Lublin, Lublin, Poland.
- Department of Diagnostic Imaging, University Hospital No 1 of Lublin, Lublin, Poland.
- School of Medicine, University of Milano-Bicocca, Milan, Italy.
- Department of Radiology, ASST Papa Giovanni XXIII Hospital, Bergamo, Italy.
- Department of Clinical Science, Intervention and Technology, Unit of Radiology, Karolinska Institute, Stockholm, Sweden.
- Department of Biomedical Sciences and Public Health, Marche Polytechnic University, Ancona, Italy.
- 1st Department of Radiology, Medical University of Lublin, Lublin, Poland.
- Department of Correct, Clinical and Imaging Anatomy, Medical University of Lublin, Lublin, Poland.
- Department of Radiology, Medical University of Gdansk, Gdańsk, Poland.
- Doctoral School of Medicine and Pharmacy, Science and Technology, George Emil Palade University of Medicine, Pharmacy, Science and Technology, Târgu Mureș, Romania.
- Department of Radiology, County Emergency Clinical Hospital, Târgu Mureș, Romania.
- Department of Radiation Diagnostics, Danylo Halytsky Lviv National Medical University, Lviv, Ukraine.
- Oxford Medical West, Lviv, Ukraine.
- University Medical Center Utrecht, Utrecht, the Netherlands.
Abstract
To evaluate the capability of four publicly available Large language models (LLMs) to assign Coronary Artery Disease-Reporting and Data System (CAD-RADS) scores and provide patient management recommendations based on synthetic coronary CT angiography (CCTA) reports. Four LLMs (ChatGPT 4o, Claude 3.7, DeepSeek, and Gemini 2.5 Pro) were tasked with analyzing reports and suggesting next steps. Prompts were framed from the perspective of both a cardiologist and a radiologist. Agreement with a human reference standard was assessed using weighted Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha for CAD-RADS scoring, and unweighted Cohen's kappa for management recommendations. A Bayesian Wilcoxon signed-rank test was performed to assess directional bias. Performance variations were observed across LLMs and prompt identities. Claude-3.7 achieved almost perfect agreement for CAD-RADS scoring (κ = 0.997) regardless of prompt identity, Gemini similarly achieved almost perfect agreement (radiologist: κ = 0.962; cardiologist: κ = 0.990), ChatGPT demonstrated almost perfect agreement when prompted as a radiologist (κ = 0.896) but only substantial agreement when prompted as a cardiologist (κ = 0.715). DeepSeek showed the lowest overall performance (radiologist: κ = 0.637; cardiologist: κ = 0.768). By category, all LLMs correctly identified CAD-RADS 0, whereas higher-grade stenosis (4A/4B) remained the most challenging, with non-Claude models showing low-to-null agreement in some configurations. The LLMs' accuracy in proposing further management was considerably lower than their scoring accuracy, with CAD-RADS 3 showing the greatest variability in management recommendations across models and between human specialists. Furthermore, both CAD-RADS scoring and management recommendations varied depending on the professional identity specified in the prompt. While LLMs demonstrated reliable scoring performance for lower-grade CAD-RADS categories (0-2), agreement was substantially reduced for higher-grade stenosis categories (4A/4B) and non-diagnostic studies, which could pose risks to patients. Their current ability to generate dependable clinical management recommendations is limited.