Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.

September 10, 2025

papers DOI: 10.2214/AJR.25.33469 PMID: 40928165

Authors

Talati IA,Chaves JMZ,Das A,Banerjee I,Rubin DL

Affiliations (6)

Department of Radiology, Stanford University, Stanford, CA, USA.
Microsoft Research, Redmond, WA.
Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona, Phoenix, AZ, USA.
Department of Radiology, Mayo Clinic Arizona, Phoenix, AZ, USA.
School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ, USA.
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.

Abstract

Background: The increasing complexity and volume of radiology reports present challenges for timely critical findings communication. Purpose: To evaluate the performance of two out-of-the-box LLMs in detecting and classifying critical findings in radiology reports using various prompt strategies. Methods: The analysis included 252 radiology reports of varying modalities and anatomic regions extracted from the MIMIC-III database, divided into a prompt engineering tuning set of 50 reports, a holdout test set of 125 reports, and a pool of 77 remaining reports used as examples for few-shot prompting. An external test set of 180 chest radiography reports was extracted from the CheXpert Plus database. Reports were manually reviewed to identify critical findings and classify such findings into one of three categories (true critical finding, known/expected critical finding, equivocal critical finding). Following prompt engineering using various prompt strategies, a final prompt for optimal true critical findings detection was selected. Two general-purpose LLMs, GPT-4 and Mistral-7B, processed reports in the test sets using the final prompt. Evaluation included automated text similarity metrics (BLEU-1, ROUGE-F1, G-Eval) and manual performance metrics (precision, recall). Results: For true critical findings, zero-shot, few-shot static (five examples), and few-shot dynamic (five examples) prompting yielded BLEU-1 of 0.691, 0.778, and 0.748; ROUGE-F1 of 0.706, 0.797, and 0.773; and G-Eval of 0.428, 0.573, and 0.516. Precision and recall for true critical findings, known/expected critical findings, and equivocal critical findings, in the holdout test set for GPT-4 were 90.1% and 86.9%, 80.9% and 85.0%, and 80.5% and 94.3%; in the holdout test set for Mistral-7B were 75.6% and 77.4%, 34.1% and 70.0%, and 41.3% and 74.3%; in the external test set for GPT-4 were 82.6% and 98.3%, 76.9% and 71.4%, and 70.8% and 85.0%; and in the external test set for Mistral-7B were 75.0% and 93.1%, 33.3% and 92.9%, and 34.0% and 80.0%. Conclusion: Out-of-the-box LLMs were used to detect and classify arbitrary numbers of critical findings in radiology reports. The optimal model for true critical findings entailed a few-shot static approach. Clinical Impact: The study shows a role of contemporary general-purpose models in adapting to specialized medical tasks using minimal data annotation.

View Source Full Text PDF

Topics

Journal Article

Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.

Authors

Affiliations (6)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?