Back to all papers

Influence of structured output constraints on GPT-5-Thinking, Gemini 2.5 Pro, and open-weight LLMs for radiology protocol selection.

April 10, 2026pubmed logopapers

Authors

Bahaaeldin M,Nowak S,Seidel O,Isaak A,Kravchenko D,Proff A,Dell T,Kuetting D,Luetkens JA,Mesropyan N

Affiliations (4)

  • Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Bonn, Germany.
  • Quantitative Imaging Laboratory Bonn (QILaB), University Hospital Bonn, Bonn, Germany.
  • Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Bonn, Germany. [email protected].
  • Quantitative Imaging Laboratory Bonn (QILaB), University Hospital Bonn, Bonn, Germany. [email protected].

Abstract

To evaluate the impact of constraining proprietary and open large language models (LLMs) to structured outputs in processing radiology request forms (RRF). We evaluated five LLMs-two proprietary (GPT-5-Thinking, Gemini 2.5 Pro) and three open (Qwen3-235B-A22B-Thinking, gpt-oss-120b, medgemma-27b-it)-on 100 RRFs (50 computed tomography, 50 magnetic resonance imaging). Each model processed cases with and without constraints to structured outputs. Endpoints included accuracy for modality, anatomical region, contrast phase, urgency, "all correct" (all four categories correct), and "indication improved" (clarity of rewritten text). Outputs were evaluated against a reference standard defined by board-certified radiologists and compared with two radiology residents (first-year and third-year). Accuracies with 95% confidence intervals were calculated. Constraining to structured outputs had model-dependent effects: it improved Gemini 2.5 Pro (all correct: from 53.0% [43.3-62.5] to 66.0% [56.3-74.5]) but reduced GPT-5-Thinking accuracy (from 76.0% [66.8-83.3] to 53.0% [43.3-62.5]), with minimal influence on open models. Both proprietary LLMs outperformed the best open models (up to 41.0% [31.9-50.8]). All LLMs exceeded the unassisted first-year residents' performance (19.0% [12.5-27.8]). LLM assistance improved first-year residents' accuracy to 65.0% [55.3-73.6], approaching the third-year residents' performance (80.0% [71.1-86.7]), who performed comparably to the best LLMs. Across models, performance was highest for modality and anatomical region, and lowest for urgency. Indication reformulation was judged clearer in > 90% of cases across all models without hallucinations. Constraining to structured outputs exerted model-specific effects. Proprietary LLMs achieved the highest accuracy in RRF-based protocol selection and improved first-year resident performance to an experienced-resident level. LLMs may serve as valuable decision-support tools for radiology workflow. Constraining LLMs to structured outputs produced divergent, model-specific effects in radiology protocol selection-improving Gemini 2.5 Pro, reducing GPT-5-Thinking, and minimally affecting open models-highlighting the need for model-specific prompting strategies before adopting LLMs in radiology decision support. Structured output constraints affect LLM performance differently. Gemini 2.5 Pro benefits from structured prompting, while GPT-5-Thinking declines. Open-weight models show minimal impact from output constraining. Proprietary models outperform open models in radiology protocol selection.

Topics

RadiologyRadiology Information SystemsProgramming LanguagesJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.