Back to all news

SIIM Study: LLMs Excel in Numerical Radiology Tasks

AuntMinnieIndustry

At SIIM 2026, LLMs demonstrated high accuracy on radiology-based numerical tasks, particularly in extraction and judgment tests.

Key Details

  • 1LLMs evaluated included Llama 3.1 8B, DeepSeek R1-distilled Llama 8B, OpenAI o1-mini, and OpenAI GPT 5-mini.
  • 2Tasks tested involved extraction and judgment from DEXA, ultrasound, CT, and PET radiology reports.
  • 3Most models, except Llama, achieved over 95% accuracy on extraction tasks; Llama ranged from 86% to 98.7%.
  • 4GPT 5-mini achieved highest minimum accuracy (judgment tasks: 91.7%) among tested models.
  • 5o1-mini and GPT 5-mini reached perfect accuracy in detecting osteoporosis and made no mathematical errors.
  • 6Answer-only output formats reduced accuracy for Llama and DeepSeek, but not OpenAI models.

Why It Matters

Reliable numerical extraction and judgment by LLMs could streamline radiology workflows, increasing efficiency for repetitive data extraction. However, caution remains due to ongoing risks of non-mathematical and medical knowledge-based errors.

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.