Large Language Models Versus Human Readers in CAD-RADS 2.0 Categorization of Coronary CT Angiography Reports.
Authors
Affiliations (5)
Affiliations (5)
- Department of Radiology, Research Institute of Radiological Science, Severance Hospital, Yonsei University College of Medicine, 50-1 Yonsei-ro, Seodaemun-gu, Seoul, 03722, Korea.
- Department of Medicine, Graduate School, Yonsei University College of Medicine, Seoul, Korea.
- Department of Radiology, Dongsan Medical Center, Keimyung University College of Medicine, Daegu, Korea.
- Department of Radiology, Ansan Hospital, Korea University College of Medicine, Ansan-si, Korea.
- Department of Radiology, Research Institute of Radiological Science, Severance Hospital, Yonsei University College of Medicine, 50-1 Yonsei-ro, Seodaemun-gu, Seoul, 03722, Korea. [email protected].
Abstract
This study evaluated the accuracy of large language models (LLMs) in assigning Coronary Artery Disease Reporting and Data System (CAD-RADS) 2.0 categories and modifiers based on real-world coronary CT angiography (CCTA) reports and compared their accuracy with human readers. From 2752 eligible CCTA reports generated at an academic hospital between January and September 2024, 180 were randomly selected to fit a balanced distribution of categories and modifiers. The reference standard was established by consensus between two expert cardiac radiologists with 15 and 14 years of experience, respectively. Four LLMs (O1, GPT-4o, GPT-4, GPT-3.5-turbo) and four human readers (a cardiac radiologist, a fellow, two residents) independently assigned CAD-RADS categories and modifiers for each report. For LLMs, the input prompt consisted of the report and a summary of CAD-RADS 2.0. The accuracy of evaluators in full CAD-RADS categorization was compared with O1 using McNemar tests. O1 demonstrated the highest accuracy (90.7%) in full CAD-RADS categorization, outperforming GPT-4o (73.8%), GPT-4 (59.7%), GPT-3.5-turbo (25.8%), the fellow (83.3%), and resident 1 (83.3%; all P-values ≤ 0.01). However, there was no significant difference in accuracy when compared to the cardiac radiologist (86.1%; P = 0.12) and resident 2 (89.4%; P = 0.68). Processing time per report ranged 1.34-16.61 s for LLMs, whereas human readers required 32.10-55.06 s. In the external validation dataset (n = 327) derived from two independent institutions, O1 achieved 95.7% accuracy for full CAD-RADS categorization. In conclusion, compared to human readers, O1 exhibited similar or higher accuracy and shorter processing times to produce a full CAD-RADS 2.0 categorization based on CCTA reports.