Reasoning Model-Assisted Second-Reader Quality Control of Chinese-Language Ultrasound Reports: A Retrospective Imaging Informatics Study.
Authors
Affiliations (5)
Affiliations (5)
- General Affairs Office, Sichuan Clinical Research Center for Cancer, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, Affiliated Cancer Hospital of University of Electronic Science and Technology of China, Chengdu, 610041, China.
- School of Health Sciences, College of Health and Human Sciences, Purdue University, West Lafayette, IN, 47907, USA.
- Finance Department, Sichuan Clinical Research Center for Cancer, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, Affiliated Cancer Hospital of University of Electronic Science and Technology of China, Chengdu, 610041, China.
- Ultrasound Medical Center, Sichuan Clinical Research Center for Cancer, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, Affiliated Cancer Hospital of University of Electronic Science and Technology of China, Chengdu, 610041, China.
- Ultrasound Medical Center, Sichuan Clinical Research Center for Cancer, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, Affiliated Cancer Hospital of University of Electronic Science and Technology of China, Chengdu, 610041, China. [email protected].
Abstract
The purpose of this study is to evaluate whether the reasoning model DeepSeek-R1 can function as a second-reader quality control (QC) tool for Chinese-language ultrasound reports. In this retrospective diagnostic-accuracy study with a parallel blinded review design, 500 deidentified finalized ultrasound reports were randomly sampled from 9711 eligible reports finalized in 2024 at a tertiary cancer center. DeepSeek-R1 and physician reviewer groups independently evaluated the same reports, and none of the review conditions had access to the outputs of the others or to the consensus reference standard. DeepSeek-R1 achieved 69.1% sensitivity and 98.1% specificity. DeepSeek-R1 showed numerically higher sensitivity than senior physicians (69.1% vs 47.1%), though this difference did not reach significance after Bonferroni correction (adjusted p = 0.147); specificity was identical at 98.1% for both. The model performed best for findings-impression discordance (36/42) and more modestly for completeness/template/indicator violations (7/21). In a post hoc exploratory OR-rule simulation, the combined workflow yielded 95.6% sensitivity (95% CI 87.8-98.5) and 96.3% specificity (95% CI 94.1-97.7). This retrospective single-center study provides workflow-level feasibility evidence that a reasoning model can serve as a high-specificity second-reader control for finalized ultrasound report text, with human review retained for local rules, exceptions, and final sign-off.