Empowering Radiologists With ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases.
Authors
Affiliations (4)
Affiliations (4)
- Department of Radiology, Mamak State Hospital, Ankara, Türkiye.
- Department of Radiology, Kirikkale Yuksek Ihtisas Hospital, Kirikkale, Türkiye.
- Department of Radiology, Ministry of Health Ankara 29 Mayis State Hospital, Ankara, Türkiye.
- Department of Radiology, Bilkent City Hospital, Ankara, Türkiye.
Abstract
This study evaluated the diagnostic accuracy and differential diagnostic capabilities of 12 Large Language Models (LLMs), one cardiac radiologist, and 3 general radiologists in cardiac radiology. The impact of the ChatGPT-4o assistance on radiologist performance was also investigated. We collected publicly available 80 "Cardiac Case of the Month" from the Society of Thoracic Radiology website. LLMs and Radiologist-III were provided with text-based information, whereas other radiologists visually assessed the cases with and without the ChatGPT-4o assistance. Diagnostic accuracy and differential diagnosis scores (DDx scores) were analyzed using the χ2, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests. The unassisted diagnostic accuracy of the cardiac radiologist was 72.5%, general radiologist-I was 53.8%, and general radiologist-II was 51.3%. With ChatGPT-4o, the accuracy improved to 78.8%, 70.0%, and 63.8%, respectively. The improvements for general radiologists-I and II were statistically significant (P≤0.006). All radiologists' DDx scores improved significantly with ChatGPT-4o assistance (P≤0.05). Remarkably, Radiologist-I's GPT-4o-assisted diagnostic accuracy and DDx score were not significantly different from the Cardiac Radiologist's unassisted performance (P>0.05).Among the LLMs, Claude 3 Opus and Claude 3.5 Sonnet had the highest accuracy (81.3%), followed by Claude 3 Sonnet (70.0%). Regarding the DDx score, Claude 3 Opus outperformed all models and radiologist-III (P<0.05). The accuracy of the general radiologist-III significantly improved from 48.8% to 63.8% with GPT4o assistance (P<0.001). ChatGPT-4o may enhance the diagnostic performance of general radiologists in cardiac imaging, suggesting its potential as a diagnostic support tool. Further studies are required to assess the clinical integration.