Large Language Models in Radiologic Numerical Tasks: A Thorough Evaluation and Error Analysis.

January 21, 2026

papers

DOI: 10.1007/s10278-025-01824-9 PMID: 41565927

Authors

Nowroozi A,Bondarenko M,Serapio A,Schnitzler T,Brar SS,Sohn JH

Affiliations (3)

Center for Intelligent Imaging, Department of Radiology and Biomedical Imaging, University of California, San Francisco (UCSF), San Francisco, CA, USA.
All India Institute of Medical Sciences, Bhopal, India.
Center for Intelligent Imaging, Department of Radiology and Biomedical Imaging, University of California, San Francisco (UCSF), San Francisco, CA, USA. [email protected].

Abstract

The purpose of this study was to investigate the performance of LLMs in radiology numerical tasks and perform a comprehensive error analysis. We defined six tasks: extracting (1) minimum T-score from DEXA report, (2) maximum common bile duct (CBD) diameter from ultrasound report, and (3) maximum lung nodule size from CT report, and judging (1) presence of a highly hypermetabolic region on a PET report, (2) whether a patient is osteoporotic based on a DEXA report, and (3) whether a patient has a dilated CBD based on an ultrasound report. Reports were extracted from the MIMIC III and our institution's databases, and the ground truths were extracted manually. The models used were Llama 3.1 8b, DeepSeek R1 distilled Llama 8b, OpenAI o1-mini, and OpenAI GPT-5-mini. We manually reviewed all incorrect outputs and performed a comprehensive error analysis. In extraction tasks, while Llama showed relatively variable results (ranging 86%-98.7%) across tasks, other models performed consistently well (accuracies > 95%). In judgment tasks, the lowest accuracies of Llama, DeepSeek distilled Llama, o1-mini, and GPT-5-mini were 62.0%, 91.7%, 91.7%, and 99.0%, respectively, while o1-mini and GPT-5-mini did reach 100% performance in detecting osteoporosis. We found no mathematical errors in the outputs of o1-mini and GPT-5-mini. Answer-only output format significantly reduced performance in Llama and DeepSeek distilled Llama but not in o1-mini or GPT-5-mini. To conclude, reinforcement learning (RL) reasoning models perform consistently well in radiology numerical tasks and show no mathematical errors. Simpler non-RL reasoning models may also achieve acceptable performance depending on the task.

View Source Full Text PDF

Topics

Journal Article

Large Language Models in Radiologic Numerical Tasks: A Thorough Evaluation and Error Analysis.

Authors

Affiliations (3)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?