Large language models in radiologic numerical tasks: A thorough evaluation and error analysis
Authors
Affiliations (1)
Affiliations (1)
- University of California San Francisco Medical Center
Abstract
PurposeTo investigate the performance of LLMs in radiology numerical tasks and perform a comprehensive error analysis. Materials and MethodsWe defined six tasks: extracting 1-minimum T-score from DEXA report, 2-maximum common bile duct (CBD) diameter from ultrasound report, and 3-maximum lung nodule size from CT report, and judging 1-presence of a highly hypermetabolic region on a PET report, 2-whether a patient is osteoporotic based on a DEXA report, and 3-whether a patient has a dilated CBD based on an ultrasound report. Reports were extracted from the MIMIC III and our institutions databases, and the ground truths were extracted manually. The models used were Llama 3.1 8b, DeepSeek R1 distilled Llama 8b, OpenAI o1-mini, and OpenAI GPT-5-mini. We manually reviewed all incorrect outputs and performed a comprehensive error analysis. ResultsIn extraction tasks, while Llama showed relatively variable results (ranging 86%-98.7%) across tasks, other models performed consistently well (accuracies >95%). In judgement tasks, the lowest accuracies of Llama, DeepSeek, o1-mini, and GPT-5-mini were 62.0%, 91.7%, 91.7%, and 99.0%, respectively, while o1-mini and GPT-5-mini did reach 100% performance in detecting osteoporosis. We found no mathematical errors in the outputs of o1-mini and GPT-5-mini. Answer-only output format significantly reduced performance in Llama and DeepSeek but not in o1-mini or GPT-5-mini. ConclusionTrue reasoning models perform consistently well in radiology numerical tasks and show no mathematical errors. Simpler non-true reasoning models may also achieve acceptable performance depending on the task.