Large Language Models in Clinical Decision Support: A Comparative Analysis of Chat-GPT and Breast Radiologists on ACR Appropriateness Criteria.
Authors
Affiliations (3)
Affiliations (3)
- The Warren Alpert Medical School, Brown University, Providence, RI (S.A.E., J.R., H.A.Z., G.L.B.).
- Department of Diagnostic Imaging, The Warren Alpert Medical School, Brown University, and Brown University Health, Providence, RI (E.H.D., R.C.W., A.P.L., G.L.B.).
- The Warren Alpert Medical School, Brown University, Providence, RI (S.A.E., J.R., H.A.Z., G.L.B.); Department of Diagnostic Imaging, The Warren Alpert Medical School, Brown University, and Brown University Health, Providence, RI (E.H.D., R.C.W., A.P.L., G.L.B.); Brown Radiology Human Factors Lab, Providence, RI (G.L.B.). Electronic address: [email protected].
Abstract
This study evaluates the performance of ChatGPT, a large language model (LLM), in selecting appropriate imaging modalities for breast imaging scenarios using the American College of Radiology (ACR) Appropriateness Criteria (AC). We aim to compare the agreement of ChatGPT with the ACR AC to that of breast radiologists at a single institution in selecting appropriate imaging modalities. The study utilized ten randomly selected clinical variants from the ACR AC breast imaging category. Outputs were obtained from ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o using the versions available in July 2024. The ChatGPT versions and four breast radiologists rated the appropriateness of 81 imaging decisions on a scale from 1 to 9. For each imaging option within a clinical scenario, the ratings provided by the radiologists and four independent samplings of ChatGPT's responses were aggregated. Agreement between ratings from ChatGPT, radiologists, and the ACR AC was analyzed using generalized estimating equations (GEEs) and Bland-Altman plots to assess consistency and bias. Radiologists had the lowest overall mean bias (0.2438) relative to the ACR (p = 0.489). All versions of ChatGPT had larger mean biases that were significant (GPT-4o: 2.463; GPT-4: 1.7623, GPT-3.5: 2.4691, all p<0.001). All had a slope bias (p < 0.001), but radiologists had the smallest slope bias. In summary, radiologists were closer to the ACR AC and were oftentimes as variable or even less variable as a group than the same ChatGPT version at the same time. ChatGPT shows promise as an AI tool for imaging decision-making, but current versions lack the accuracy, consistency, and reproducibility demonstrated by experienced radiologists. The study underscores the importance for human oversight in clinical applications and the need for further development to improve ChatGPT's and other LLMs' reliability and alignment with established guidelines.