Automated identification of incidentalomas requiring follow-up: A multi-anatomy evaluation of LLM-based and supervised approaches.

April 28, 2026

papers

DOI: 10.1016/j.jbi.2026.105048 PMID: 42061667

Authors

Park N,Ahmed F,Sun Z,Lybarger K,Breinhorst E,Hu J,Uzuner Ö,Gunn M,Yetisgen M

Affiliations (5)

Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA. Electronic address: [email protected].
Department of Information Sciences and Technology, George Mason University, Fairfax, VA, USA.
Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.
Department of Radiology, Te Whatu Ora Health New Zealand, Te Toka Tumai Auckland, Auckland, New Zealand.
Department of Radiology, School of Medicine, University of Washington, Seattle, WA, USA.

Abstract

To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. We utilized a dataset of 400 annotated radiology reports containing 1623 verified lesion findings. We compared two supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT) against four generative LLM configurations (Llama 3.1-8B, Fine-tuned Llama 3.1-8b, GPT-4o, GPT-OSS-20B). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. The anatomy-informed GPT-OSS-20B model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p<0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.

View Source Full Text PDF

Topics

Journal Article

Automated identification of incidentalomas requiring follow-up: A multi-anatomy evaluation of LLM-based and supervised approaches.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?