GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation.
Authors
Affiliations (4)
Affiliations (4)
- Division of Diagnostic and Interventional Neuroradiology, Department of Radiology, University Hospital Basel, Basel, Switzerland. [email protected].
- Department of Pediatric Radiology, University Children's Hospital Basel, Basel, Switzerland. [email protected].
- Institute of Diagnostic and Interventional Radiology, TUM University Hospital, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany.
- Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Munich, Germany.
Abstract
To evaluate whether GPT-4.1 and Llama 3.3 70B, large language models (LLMs) assessed in zero-shot, baseline configurations, detect and categorize clinically consequential errors across types that range from pattern-based to reasoning-dependent. Two hundred fifty-six radiology reports encompassing CT (n = 104), MRI (n = 104), and X-ray (n = 48) studies across multiple anatomical regions were retrospectively analyzed. For each original report, four variants (n = 1024) were generated, each incorporating one of four predefined error types: E1, anatomical mislabeling that could cause wrong-site actions; E2, physiologically impossible or nonsensical findings; E3, diagnostic inconsistencies that affect staging or diagnosis; E4, inappropriate recommendations. The evaluated models were GPT‑4.1 04-14) and Llama 3.3 70B, both used without domain-specific training or prompt optimization to assess baseline model performance. Model performance revealed a systematic hierarchy governed by error type and imaging modality. Physiologically impossible errors (E2) showed the lowest performance: 46.2% (CT) and 33.7% (MRI) for GPT-4.1, compared with 32.7% and 25.0% for Llama 3.3, respectively. Overall success for GPT-4.1 on E2 was 16.3% (CT), 8.7% (MRI), and 12.5% (X-ray). Mislabeling errors (E1) were detected in 49.0% by GPT‑4.1 and 33.7% by Llama 3.3 for MRI. Best performance occurred for inappropriate recommendations (E4), with GPT‑4.1 achieving 85.4% detection in X-ray with high classification accuracy. The evaluation framework and benchmark dataset provide a methodology for assessing LLM performance on clinically significant errors. Applied to GPT-4.1 and Llama 3.3 70B in zero-shot settings, the framework reveals a performance gap between pattern-based and reasoning-dependent error detection that warrants investigation across additional models and optimization strategies. Question LLMs are increasingly used for quality assurance of radiology reports, but whether their linguistic competence translates into the detection of clinically significant errors remains unclear. Findings Error detection was type-dependent, both GPT-4.1 and Llama 3.3 70B performed poorly on physiologic and anatomical errors, but better on inappropriate recommendations. Clinical relevance LLMs failed to detect most clinically consequential errors in radiology reports, especially physiologically impossible statements that trained radiologists would rapidly identify.