GPT-4 vs. Radiologists: who advances mediastinal tumor classification better across report quality levels? A cohort study.
Authors
Affiliations (8)
Affiliations (8)
- 7T Magnetic Resonance Translational Medicine Research Center, Department of Radiology, Southwest Hospital, Army Medical University (Third Military Medical University), Chongqing, China.
- College of Mathematics and Statistics, Chongqing University, Chongqing, China.
- Department of Radiology, the Third Affiliated Hospital of Chongqing Medical University, Chongqing, China.
- Department of Radiology, The Affiliated Yongchuan Hospital of Chongqing Medical University, Chongqing, China.
- Department of Radiology, Second Affiliated Hospital of Chongqing Medical University, Chongqing, P.R. China.
- Department of Radiology, The People's Hospital of Tong liang District, Chongqing, China.
- School of Big Data and Software Engineering, Chongqing University.
- Department of Medicine, Yidu Cloud (Beijing) Technology Co. Ltd., Beijing, China.
Abstract
Accurate mediastinal tumor classification is crucial for treatment planning, but diagnostic performance varies with radiologists' experience and report quality. To evaluate GPT-4's diagnostic accuracy in classifying mediastinal tumors from radiological reports compared to radiologists of different experience levels using radiological reports of varying quality. We conducted a retrospective study of 1,494 patients from five tertiary hospitals with mediastinal tumors diagnosed via chest CT and pathology. Radiological reports were categorized into low-, medium-, and high-quality based on predefined criteria assessed by experienced radiologists. Six radiologists (two residents, two attending radiologists, and two associate senior radiologists) and GPT-4 evaluated the chest CT reports. Diagnostic performance was analyzed overall, by report quality, and by tumor type using Wald χ2 tests and 95% CIs calculated via the Wilson method. GPT-4 achieved an overall diagnostic accuracy of 73.3% (95% CI: 71.0-75.5), comparable to associate senior radiologists (74.3%, 95% CI: 72.0-76.5; p >0.05). For low-quality reports, GPT-4 outperformed associate senior radiologists (60.8% vs. 51.1%, p<0.001). In high-quality reports, GPT-4 was comparable to attending radiologists (80.6% vs.79.4%, p>0.05). Diagnostic performance varied by tumor type: GPT-4 was comparable to radiology residents for neurogenic tumors (44.9% vs. 50.3%, p>0.05), similar to associate senior radiologists for teratomas (68.1% vs. 65.9%, p>0.05), and superior in diagnosing lymphoma (75.4% vs. 60.4%, p<0.001). GPT-4 demonstrated interpretation accuracy comparable to Associate Senior Radiologists, excelling in low-quality reports and outperforming them in diagnosing lymphoma. These findings underscore GPT-4's potential to enhance diagnostic performance in challenging diagnostic scenarios.