Enhancing the CAD-RADS™ 2.0 Category Assignment Performance of ChatGPT and DeepSeek Through "Few-shot" Prompting.
Authors
Affiliations (1)
Affiliations (1)
- Department of Radiology, School of Medicine, Bursa Uludağ University, Bursa, Turkey.
Abstract
To assess whether few-shot prompting improves the performance of 2 popular large language models (LLMs) (ChatGPT o1 and DeepSeek-R1) in assigning Coronary Artery Disease Reporting and Data System (CAD-RADS™ 2.0) categories. A detailed few-shot prompt based on CAD-RADS™ 2.0 framework was developed using 20 reports from the MIMIC-IV database. Subsequently, 100 modified reports from the same database were categorized using zero-shot and few-shot prompts through the models' user interface. Model accuracy was evaluated by comparing assignments to a reference radiologist's classifications, including stenosis categories and modifiers. To assess reproducibility, 50 reports were reclassified using the same few-shot prompt. McNemar tests and Cohen kappa were used for statistical analysis. Using zero-shot prompting, accuracy was low for both models (ChatGPT: 14%, DeepSeek: 8%), with correct assignments occurring almost exclusively in CAD-RADS 0 cases. Hallucinations occurred frequently (ChatGPT: 19%, DeepSeek: 54%). Few-shot prompting significantly improved accuracy to 98% for ChatGPT and 93% for DeepSeek (both P<0.001) and eliminated hallucinations. Kappa values for agreement between model-generated and radiologist-assigned classifications were 0.979 (0.950, 1.000) (P<0.001) for ChatGPT and 0.916 (0.859, 0.973) (P<0.001) for DeepSeek, indicating almost perfect agreement for both models without a significant difference between the models (P=0.180). Reproducibility analysis yielded kappa values of 0.957 (0.900, 1.000) (P<0.001) for ChatGPT and 0.873 [0.779, 0.967] (P<0.001) for DeepSeek, indicating almost perfect and strong agreement between repeated assignments, respectively, with no significant difference between the models (P=0.125). Few-shot prompting substantially enhances LLMs' accuracy in assigning CAD-RADS™ 2.0 categories, suggesting potential for clinical application and facilitating system adoption.