Back to all papers

Automated identification of incidentalomas requiring follow-up: A multi-anatomy evaluation of LLM-based and supervised approaches.

April 28, 2026pubmed logopapers

Authors

Park N,Ahmed F,Sun Z,Lybarger K,Breinhorst E,Hu J,Uzuner Ö,Gunn M,Yetisgen M

Affiliations (5)

  • Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA. Electronic address: [email protected].
  • Department of Information Sciences and Technology, George Mason University, Fairfax, VA, USA.
  • Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.
  • Department of Radiology, Te Whatu Ora Health New Zealand, Te Toka Tumai Auckland, Auckland, New Zealand.
  • Department of Radiology, School of Medicine, University of Washington, Seattle, WA, USA.

Abstract

To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. We utilized a dataset of 400 annotated radiology reports containing 1623 verified lesion findings. We compared two supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT) against four generative LLM configurations (Llama 3.1-8B, Fine-tuned Llama 3.1-8b, GPT-4o, GPT-OSS-20B). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. The anatomy-informed GPT-OSS-20B model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p<0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.