Performance analysis of large language models in multi-disease detection from chest computed tomography reports: a comparative study: Experimental Research.

Authors

Luo P,Fan C,Li A,Jiang T,Jiang A,Qi C,Gan W,Zhu L,Mou W,Zeng D,Tang B,Xiao M,Chu G,Liang Z,Shen J,Liu Z,Wei T,Cheng Q,Lin A,Chen X

Affiliations (17)

  • Donghai County People's Hospital - Jiangnan University Smart Healthcare Joint Laboratory, Donghai County People's Hospital, Lianyungang, Jiangsu Province, 222000, China.
  • Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, Guangdong, 510282, China.
  • Cancer Centre and Institute of Translational Medicine, Faculty of Health Sciences, University of Macau, Macau SAR, 999078 China.
  • Department of Pulmonary and Critical Care Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, Guangdong, 510282, China.
  • Department of Urology, Changhai hospital, Naval Medical University (Second Military Medical University), Shanghai, China.
  • Department of Microbiology, State Key Laboratory of Emerging Infectious Diseases, Carol Yu Centre for Infection, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong.
  • Department of Joint Surgery and Sports Medicine, Zhuhai People's Hospital (Zhuhai hospital affiliated with Jinan University), Guangdong, China.
  • Department of Urology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
  • Department of Oncology, Nanfang Hospital, Southern Medical University.
  • Cancer Center, the Sixth Affiliated Hospital, School of Medicine, South China University of Technology.
  • Department of Radiation Oncology, Zhongshan Hospital Affiliated to Fudan University, Shanghai, China.
  • Hepatobiliary Surgery Department, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, China.
  • Department of Urology, The Affiliated Hospital of Qingdao University.
  • State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, Guangzhou Institute of Respiratory Health, the First Affiliated Hospital of Guangzhou Medical University, Guangzhou 510120, China.
  • Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100730, China.
  • Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, 410008, China.
  • National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha,Hunan, 410008, China.

Abstract

Computed Tomography (CT) is widely acknowledged as the gold standard for diagnosing thoracic diseases. However, the accuracy of interpretation significantly depends on radiologists' expertise. Large Language Models (LLMs) have shown considerable promise in various medical applications, particularly in radiology. This study aims to assess the performance of leading LLMs in analyzing unstructured chest CT reports and to examine how different questioning methodologies and fine-tuning strategies influence their effectiveness in enhancing chest CT diagnosis. This retrospective analysis evaluated 13,489 chest CT reports encompassing 13 common thoracic conditions across pulmonary, cardiovascular, pleural, and upper abdominal systems. Five LLMs (Claude-3.5-Sonnet, GPT-4, GPT-3.5-Turbo, Gemini-Pro, Qwen-Max) were assessed using dual questioning methodologies: multiple-choice and open-ended. Radiologist-curated datasets underwent rigorous preprocessing, including RadLex terminology standardization, multi-step diagnostic validation, and exclusion of ambiguous cases. Model performance was quantified via Subjective Answer Accuracy Rate (SAAR), Reference Answer Accuracy Rate (RAAR), and Area Under the Receiver Operating Characteristic (ROC) Curve analysis. GPT-3.5-Turbo underwent fine-tuning (100 iterations with one training epoch) on 200 high-performing cases to enhance diagnostic precision for initially misclassified conditions. GPT-4 demonstrated superior performance with the highest RAAR of 75.1% in multiple-choice questioning, followed by Qwen-Max (66.0%) and Claude-3.5 (63.5%), significantly outperforming GPT-3.5-Turbo (41.8%) and Gemini-Pro (40.8%) across the entire patient cohort. Multiple-choice questioning consistently improved both RAAR and SAAR for all models compared to open-ended questioning, with RAAR consistently surpassing SAAR. Model performance demonstrated notable variations across different diseases and organ conditions. Notably, fine-tuning substantially enhanced the performance of GPT-3.5-Turbo, which initially exhibited suboptimal results in most scenarios. This study demonstrated that general-purpose LLMs can effectively interpret chest CT reports, with performance varying significantly across models depending on the questioning methodology and fine-tuning approaches employed. For surgical practice, these findings provided evidence-based guidance for selecting appropriate AI tools to enhance preoperative planning, particularly for thoracic procedures. The integration of optimized LLMs into surgical workflows may improve decision-making efficiency, risk stratification, and diagnostic speed, potentially contributing to better surgical outcomes through more accurate preoperative assessment.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.