Stanford and Mayo Clinic Arizona researchers demonstrated that LLMs like GPT-4 can categorize critical findings in radiology reports using few-shot prompting.
Key Details
- 1GPT-4 and Mistral-7B LLMs tested for classifying critical findings in radiology reports from ICU patients.
- 2252 MIMIC-III reports (mixed modalities: 56% CT, ~30% x-ray, 9% MRI) and 180 external chest x-ray reports evaluated.
- 3LLMs categorized findings as true, known/expected, or equivocal critical findings.
- 4GPT-4 achieved 90.1% precision and 86.9% recall for true critical findings in internal test set; 82.6% precision and 98.3% recall in external test set.
- 5Mistral-7B showed lower precision (75.6%) but comparable recall (77.4%-93.1%).
- 6Study highlights few-shot prompting as an efficient strategy; real-world deployment requires further refinement.
Why It Matters

Source
AuntMinnie
Related News

Study: Computer Vision Models Best LLMs in Chest CT Breast Abnormality Detection
Computer vision models (CVMs) surpass large language models (LLMs) in accurately labeling incidental breast abnormalities on chest CT scans.

Private Equity Backs AIRS Medical to Expand MRI AI Globally
TA Associates is investing in AIRS Medical to accelerate its global expansion of AI-powered MRI efficiency solutions.

Radiology Maintains Lead in FDA-Cleared AI Algorithms, Cardiology Follows
Radiology remains the top specialty for FDA-cleared AI, with cardiology as a strong second, particularly in cardiovascular imaging.