GPT-4o for Automated Determination of Follow-up Examinations Based on Radiology Reports from Clinical Routine.

April 16, 2026

DOI: 10.1038/s41598-026-40317-9 PMID: 41991937

Authors

Kaya K,Müller L,Persigehl T,Zopfs D,Nelles C,Dratsch T,Iuga AI,Janssen JP,Schömig T,Kottlors J,Gietzen C,Maintz D,Pennig L,Jorg T,Beste N,Terzis R

Affiliations (4)

Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany.
Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg-University, Mainz, Germany.
Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany. [email protected].
Institute for Diagnostic and Interventional Radiology, University Hospital Cologne, Kerpener Str. 62, 50937, Cologne, Köln, Germany. [email protected].

Abstract

Follow-up imaging recommendations vary across radiologists despite established guidelines. We evaluated whether a large language model (GPT-4o) can standardize follow-up timing and modality selection from routine radiology reports. In this retrospective, two-center study, 100 CT/MRI cases (25 each: head/neck, liver, lung, pancreas) were randomly sampled. GPT-4o and two human readers (R1, resident; R2, board-certified) generated follow-up recommendations from report text. Expert consensus, blinded to rater, assessed completeness, appropriateness of imaging modality, timing accuracy, and global follow up quality on a 5-point scale, with 5 being the highest score. Median global quality was 4 (2–5) for GPT-4o, 4 (1–5) for R1, and 4 (1–5) for R2; GPT-4o exceeded R1 (p < 0.01) and did not differ from 2 (p = 0.06). Relative treatment effect showed a reader effect (GPT4o 0.56 vs. R1 0.43 vs. R2 0.51; p < 0.01) and no center effect (0.50 vs. 0.50; p = 0.91). Correctness of follow-up timing was 96% (96/100) for GPT-4o, 75% (75/100) for R1, and 90% (90/100) for R2 (p < 0.001 for GPT-4o vs. R1). Completeness of follow-up was 92% (92/100) for GPT-4o, 91% (91/100) for R1, and 80% (80/100) for R2 (p = 0.014 for GPT-4o vs. Reader 2). No significant differences were observed regarding appropriateness of imaging modality. GPT-4o generated follow-up recommendations with overall quality comparable to an experienced radiologist and superior to a trainee, with high completeness and generally appropriate follow up timing and modality, supporting its role as decision support for standardized, guideline-aligned follow-up. The online version contains supplementary material available at 10.1038/s41598-026-40317-9.

View Source Full Text PDF

Topics

Journal Article

GPT-4o for Automated Determination of Follow-up Examinations Based on Radiology Reports from Clinical Routine.

Authors

Affiliations (4)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?