Detection without calibration: benchmarking domestic and international large language models for quality control of Mandarin 18F-FDG PET/CT reports

June 26, 2026

preprint

DOI: 10.64898/2026.06.24.26356406

Authors

Wang, J.,Tang, W.,Ma, X.,Yan, H. m.,Yuan, Y.

Affiliations (1)

Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine

Abstract

Large language models (LLMs) are increasingly used for automated quality control (QC) of radiology reports. However, the reliability of LLMs on reports in Mandarin, and the relative performance of domestic versus international flagship models, remain unknown. We benchmarked 14 LLM configurations, seven Chinese-developed ("domestic") and seven international models, on 1,000 whole-body 18F-FDG PET/CT reports split into an error-injected "junior-doctor" arm and a low-residual "finalised" arm (500 each), using a controlled error-injection gold standard. Under each blinded zero-shot prompt, each model flagged six error types and assigned a 1-5 overall score. Two distinct abilities: error-detection macro-F1 (0.356-0.667) and overall-score calibration (ICC[2,1] 0.099-0.627), were weakly and not significantly correlated across models (Spearman {rho} = 0.38, p = 0.18); the dissociation was instead evident in sharp rank reversals, the strongest detector (Claude-Opus-4.8 0.667) calibrating poorly (0.491), while the three best-calibrated models were all domestic (MiMo 0.627, GLM-5 0.612, DeepSeek 0.609). Once the access channel was controlled, domestic and international error detection were statistically indistinguishable ({Delta}macro-F1 = -0.011, P = 0.84); domestic models showed consistent but not significant advantages in calibration ({Delta}ICC = +0.142) and Chinese-character-error detection ({Delta}F1 = +0.109), accompanied with large reductions in cost (US$0.09-2.71 vs $0.26-14.5 per 1,000 reports) and on-premise deployability. Re-running two flagships through both agent channels and clean APIs showed that agent channel inflated both detection and calibration (GPT-5.5 {Delta}ICC = +0.098, 95% CI 0.070-0.128), confirming that uncontrolled benchmarks over-credit agent-channel models. Missed-diagnosis detection was the universal weakness (best 0.467) and the one category where the human physicians outperformed every model. Raw detection ability does not guarantee a trustworthy score, and domestic and international models differ by deployment-relevant profile rather than by overall performance rank; both essential distinctions for performing clinical nuclear-medicine QC.

View Source Full Text PDF

Topics

radiology and imaging

Detection without calibration: benchmarking domestic and international large language models for quality control of Mandarin 18F-FDG PET/CT reports

Authors

Affiliations (1)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?