SIIM Study: LLMs Excel in Numerical Radiology Tasks

At SIIM 2026, LLMs demonstrated high accuracy on radiology-based numerical tasks, particularly in extraction and judgment tests.

Key Details

1LLMs evaluated included Llama 3.1 8B, DeepSeek R1-distilled Llama 8B, OpenAI o1-mini, and OpenAI GPT 5-mini.
2Tasks tested involved extraction and judgment from DEXA, ultrasound, CT, and PET radiology reports.
3Most models, except Llama, achieved over 95% accuracy on extraction tasks; Llama ranged from 86% to 98.7%.
4GPT 5-mini achieved highest minimum accuracy (judgment tasks: 91.7%) among tested models.
5o1-mini and GPT 5-mini reached perfect accuracy in detecting osteoporosis and made no mathematical errors.
6Answer-only output formats reduced accuracy for Llama and DeepSeek, but not OpenAI models.

Why It Matters

Reliable numerical extraction and judgment by LLMs could streamline radiology workflows, increasing efficiency for repetitive data extraction. However, caution remains due to ongoing risks of non-mathematical and medical knowledge-based errors.

Read the full article on AuntMinnie

SIIM Study: LLMs Excel in Numerical Radiology Tasks

Key Details

Why It Matters

Related News

Radiologists Struggle to Spot AI-Generated Radiology Images

Radiology Leads FDA AI Device Approvals Over Three Decades

Automation Bias: How AI Can Compromise Radiologist Accuracy

Ready to Sharpen Your Edge?