Retrieval-augmented generation-enhanced large language models for comprehensive CAD-RADS 2.0 categorization from structured coronary CTA reports.

February 20, 2026

DOI: 10.1007/s00117-026-01580-z PMID: 41718766

Authors

Kaba E,Çubukçu Y,Uzunibrahimoğlu B,Yılmaz YE,Çınar M,Tabakoğlu S,Bal EM,Beydüz YS,Solak M,Oğuz E,Varlık AT,Malkoç G,Beyazal M,Celiker FB,Akkaya S

Affiliations (3)

Department of Radiology, Training and Research Hospital, Recep Tayyip Erdogan University, 53100, Rize, Türkiye. [email protected].
Department of Radiology, Training and Research Hospital, Recep Tayyip Erdogan University, 53100, Rize, Türkiye.
Department of Radiology, Karadeniz Technical University School of Medicine, Trabzon, Türkiye.

Abstract

To evaluate the performance of large language models (LLMs), including retrieval-augmented generation (RAG)-based approaches, in extracting components and management recommendations from structured coronary computed tomography angiography (CCTA) reports according to the Coronary Artery Disease Reporting and Data System (CAD-RADS 2.0). A total of 320 fully structured CCTA reports were analyzed using LLM. Closed-source standard ChatGPT‑5, NotebookLM (RAG-based model), and a RAG-adapted ChatGPT‑5 model (ChatGPT-5-RAG) were used. Each model extracted the CAD-RADS category, plaque burden, presence of high-risk plaque (HRP), other modifiers, full score, and management recommendations in accordance with the CAD-RADS 2.0 guidelines. We compared LLM outputs with reference standards determined by two expert cardiovascular radiologists. ChatGPT-5-RAG showed the highest accuracy for CAD-RADS classification (0.959, 95% CI: 0.932-0.976), plaque burden (0.912, 95% CI: 0.876-0.939), HRP detection (0.988, 95% CI: 0.968-0.995), other modifiers (0.950, 95% CI: 0.920-0.969), and full score (0.828, 95% CI: 0.783-0.866). Closed-source ChatGPT‑5 showed the weakest performance across all components. Significant statistical differences were found among the three models (p < 0.001). Management recommendations were qualitatively rated on a three-point Likert scale; although agreement between models was low, ChatGPT-5-RAG and NotebookLM performed almost perfectly (median 3 points). This study demonstrates that RAG-enhanced LLMs significantly improve accuracy and reliability in extracting CAD-RADS 2.0 components and generating clinical management recommendations. The findings highlight the potential of RAG-based LLMs as innovative, explainable tools for automated and standardized CCTA reporting in clinical radiology workflows.

View Source Full Text PDF

Topics

Journal Article

Retrieval-augmented generation-enhanced large language models for comprehensive CAD-RADS 2.0 categorization from structured coronary CTA reports.

Authors

Affiliations (3)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?