Evaluation of large language models in generating pulmonary nodule follow-up recommendations.

Authors

Wen J,Huang W,Yan H,Sun J,Dong M,Li C,Qin J

Affiliations (3)

  • Department of Radiology, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China.
  • Department of Interventional Radiology, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China.
  • Big Data and Artificial Intelligence Center, The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China.

Abstract

To evaluate the performance of large language models (LLMs) in generating clinically follow-up recommendations for pulmonary nodules by leveraging radiological report findings and management guidelines. This retrospective study included CT follow-up reports of pulmonary nodules documented by senior radiologists from September 1st, 2023, to April 30th, 2024. Sixty reports were collected for prompting engineering additionally, based on few-shot learning and the Chain of Thought methodology. Radiological findings of pulmonary nodules, along with finally prompt, were input into GPT-4o-mini or ERNIE-4.0-Turbo-8K to generate follow-up recommendations. The AI-generated recommendations were evaluated against radiologist-defined guideline-based standards through binary classification, assessing nodule risk classifications, follow-up intervals, and harmfulness. Performance metrics included sensitivity, specificity, positive/negative predictive values, and F1 score. On 1009 reports from 996 patients (median age, 50.0 years, IQR, 39.0-60.0 years; 511 male patients), ERNIE-4.0-Turbo-8K and GPT-4o-mini demonstrated comparable performance in both accuracy of follow-up recommendations (94.6 % vs 92.8 %, P = 0.07) and harmfulness rates (2.9 % vs 3.5 %, P = 0.48). In nodules classification, ERNIE-4.0-Turbo-8K and GPT-4o-mini performed similarly with accuracy rates of 99.8 % vs 99.9 % sensitivity of 96.9 % vs 100.0 %, specificity of 99.9 % vs 99.9 %, positive predictive value of 96.9 % vs 96.9 %, negative predictive value of 100.0 % vs 99.9 %, f1-score of 96.9 % vs 98.4 %, respectively. LLMs show promise in providing guideline-based follow-up recommendations for pulmonary nodules, but require rigorous validation and supervision to mitigate potential clinical risks. This study offers insights into their potential role in automated radiological decision support.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.