Human-in-the-Loop Large Language Model-Augmented Diagnostic Reasoning in Thoracic Imaging: Impact of Radiologic Expertise.

May 20, 2026

papers

DOI: 10.2214/AJR.26.34999 PMID: 42160120

Authors

Song J,Ko H,Han DH,Hwang EJ,Cho HS,Lee JY,Jeong WG,Yoon SH,Kim H,Kim J,Park J,Park J,Son Y,Lee JE,Lee T

Affiliations (5)

Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea.
Department of Radiology, Seoul St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Korea.
Department of Radiology, Uijeongbu Eulji University Hospital and Daejeon Eulji University, Daejeon, Korea.
Department of Radiology, National Cancer Center, Goyang, Korea.
Department of Radiology, National Jewish Health, Denver, Colorado, United States.

Abstract

Background: Technical and regulatory constraints limit application of large language models (LLMs) for augmenting diagnostic reasoning in radiology. Reader-mediated text-based workflows may provide a practical alternative. Objective: To evaluate the impact on diagnostic performance of LLM assistance using reader-generated free-text image descriptions and to assess the effect of reader expertise on this LLM-augmented diagnostic workflow. Methods: This retrospective study included 93 cases (encompassing radiographic, CT, MRI, and PET/CT images) from the Korean Society of Thoracic Radiology quiz platform from January 2014 to December 2017. Five differential diagnoses (correct diagnosis and four distractors) were assembled for each case. Ten readers (five thoracic radiologists, five radiology residents) independently interpreted cases. In session 1, readers selected the most likely diagnosis and provided a free-text description of key findings. An LLM (Gemini 3.0 Pro) was inputted the free-text description-without case images-and outputted a ranking of the five differential diagnoses along with explanatory rationales for the top-three options. In session 2, readers were provided the LLM output from their own free-text description and re-selected a most likely diagnosis. LLM performance using images-without free-text descriptions-was also assessed. LLM accuracy was determined using top-ranked diagnoses. Reader groups were compared using generalized estimating equations. Results: LLM accuracy was 52.7% when inputted images and 63.9% when inputted reader-generated descriptions. LLM accuracy was greater using descriptions generated by thoracic radiologists than by residents (67.3% vs. 60.4%; P<.001). From session 1 to session 2, accuracy increased for thoracic radiologists from 56.3% to 65.6% and for residents from 42.4% to 58.5%, respectively. Accuracy improvement between sessions was greater for residents than thoracic radiologists (16.1 vs 9.2 percentage points; P=.02). Residents, compared with thoracic radiologists, demonstrated a greater rate of accepting LLM-favored diagnoses (73.2% vs 48.9%; P<.001), including a greater rate of switching to an incorrect diagnosis following misleading LLM output (60.6% vs 32.4%; P=.009). Conclusions: The text-based LLM-assisted workflow yielded improved reader accuracy although was heavily influenced by reader expertise. Clinical Impact: The utility of human-in-the-loop workflows arises from dynamic reader-LLM interactions shaped by the expertise of the operator formulating model inputs and critically evaluating model outputs.

View Source Full Text PDF

Topics

Journal Article

Human-in-the-Loop Large Language Model-Augmented Diagnostic Reasoning in Thoracic Imaging: Impact of Radiologic Expertise.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?