Stanford and Mayo Clinic Arizona researchers demonstrated that LLMs like GPT-4 can categorize critical findings in radiology reports using few-shot prompting.
Key Details
- 1GPT-4 and Mistral-7B LLMs tested for classifying critical findings in radiology reports from ICU patients.
- 2252 MIMIC-III reports (mixed modalities: 56% CT, ~30% x-ray, 9% MRI) and 180 external chest x-ray reports evaluated.
- 3LLMs categorized findings as true, known/expected, or equivocal critical findings.
- 4GPT-4 achieved 90.1% precision and 86.9% recall for true critical findings in internal test set; 82.6% precision and 98.3% recall in external test set.
- 5Mistral-7B showed lower precision (75.6%) but comparable recall (77.4%-93.1%).
- 6Study highlights few-shot prompting as an efficient strategy; real-world deployment requires further refinement.
Why It Matters

Source
AuntMinnie
Related News

Toronto Study: LLMs Must Cite Sources for Radiology Decision Support
University of Toronto researchers found that large language models (LLMs) such as DeepSeek V3 and GPT-4o offer promising support for radiology decision-making in pancreatic cancer when their recommendations cite guideline sources.

AI Model Using Mammograms Enhances Five-Year Breast Cancer Risk Assessment
A new image-only AI model more accurately predicts five-year breast cancer risk than breast density alone, according to multinational research presented at RSNA 2025.

AI Model Uses CT Scans to Reveal Biomarker for Chronic Stress
Researchers developed an AI model to measure chronic stress using adrenal gland volume on routine CT scans.