Back to all papers

GPT-4o for Automated Determination of Follow-up Examinations Based on Radiology Reports from Clinical Routine.

April 16, 2026pubmed logopapers

Authors

Kaya K,Müller L,Persigehl T,Zopfs D,Nelles C,Dratsch T,Iuga AI,Janssen JP,Schömig T,Kottlors J,Gietzen C,Maintz D,Pennig L,Jorg T,Beste N,Terzis R

Affiliations (4)

  • Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany.
  • Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg-University, Mainz, Germany.
  • Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany. [email protected].
  • Institute for Diagnostic and Interventional Radiology, University Hospital Cologne, Kerpener Str. 62, 50937, Cologne, Köln, Germany. [email protected].

Abstract

Follow-up imaging recommendations vary across radiologists despite established guidelines. We evaluated whether a large language model (GPT-4o) can standardize follow-up timing and modality selection from routine radiology reports. In this retrospective, two-center study, 100 CT/MRI cases (25 each: head/neck, liver, lung, pancreas) were randomly sampled. GPT-4o and two human readers (R1, resident; R2, board-certified) generated follow-up recommendations from report text. Expert consensus, blinded to rater, assessed completeness, appropriateness of imaging modality, timing accuracy, and global follow up quality on a 5-point scale, with 5 being the highest score. Median global quality was 4 (2–5) for GPT-4o, 4 (1–5) for R1, and 4 (1–5) for R2; GPT-4o exceeded R1 (<i>p</i> < 0.01) and did not differ from 2 (<i>p</i> = 0.06). Relative treatment effect showed a reader effect (GPT4o 0.56 vs. R1 0.43 vs. R2 0.51; <i>p</i> < 0.01) and no center effect (0.50 vs. 0.50; <i>p</i> = 0.91). Correctness of follow-up timing was 96% (96/100) for GPT-4o, 75% (75/100) for R1, and 90% (90/100) for R2 (<i>p</i> < 0.001 for GPT-4o vs. R1). Completeness of follow-up was 92% (92/100) for GPT-4o, 91% (91/100) for R1, and 80% (80/100) for R2 (<i>p</i> = 0.014 for GPT-4o vs. Reader 2). No significant differences were observed regarding appropriateness of imaging modality. GPT-4o generated follow-up recommendations with overall quality comparable to an experienced radiologist and superior to a trainee, with high completeness and generally appropriate follow up timing and modality, supporting its role as decision support for standardized, guideline-aligned follow-up. The online version contains supplementary material available at 10.1038/s41598-026-40317-9.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.