At SIIM 2026, LLMs demonstrated high accuracy on radiology-based numerical tasks, particularly in extraction and judgment tests.
Key Details
- 1LLMs evaluated included Llama 3.1 8B, DeepSeek R1-distilled Llama 8B, OpenAI o1-mini, and OpenAI GPT 5-mini.
- 2Tasks tested involved extraction and judgment from DEXA, ultrasound, CT, and PET radiology reports.
- 3Most models, except Llama, achieved over 95% accuracy on extraction tasks; Llama ranged from 86% to 98.7%.
- 4GPT 5-mini achieved highest minimum accuracy (judgment tasks: 91.7%) among tested models.
- 5o1-mini and GPT 5-mini reached perfect accuracy in detecting osteoporosis and made no mathematical errors.
- 6Answer-only output formats reduced accuracy for Llama and DeepSeek, but not OpenAI models.
Why It Matters
Reliable numerical extraction and judgment by LLMs could streamline radiology workflows, increasing efficiency for repetitive data extraction. However, caution remains due to ongoing risks of non-mathematical and medical knowledge-based errors.

Source
AuntMinnie
Related News

•Radiology Business
Generative AI Set to Transform Chest X-ray Reporting and Quality
Generative AI models can now produce full radiology reports from chest X-rays, promising increased diagnostic accuracy and efficiency.

•Radiology Business
Study Finds Disparities in Access to Stroke Imaging AI Tools
Research shows access to AI stroke detection tools is concentrated in resource-rich hospitals despite Medicare incentives.

•Cardiovascular Business
AI Is Quietly Embedded in Cardiac Imaging Workflows
AI is now seamlessly integrated into cardiac imaging, often unnoticed by clinicians.