DeepSeek-R1 for automated scoring in radiology residency examinations: an agreement and test-retest reliability study.
Authors
Affiliations (3)
Affiliations (3)
- Department of Radiology, Fifth Affiliated Hospital of Sun Yat-sen University, Zhuhai, Guangdong, China.
- Department of Radiology, Fifth Affiliated Hospital of Sun Yat-sen University, Zhuhai, Guangdong, China. [email protected].
- Department of Radiology, Fifth Affiliated Hospital of Sun Yat-sen University, Zhuhai, Guangdong, China. [email protected].
Abstract
This study evaluates the feasibility of employing DeepSeek-R1 for automated scoring in examinations for radiology residents, comparing its performance with that of radiologists. A cross-sectional study was undertaken to assess 504 diagnostic radiology reports produced by eighteen third-year radiology residents. The evaluations were independently conducted by Radiologist A, Radiologist B, and DeepSeek-R1 (as of June 15, 2025), utilizing standardized scoring rubrics and predefined evaluation criteria. One month after the initial evaluation, a re-assessment was performed by DeepSeek-R1 and Radiologist A. The inter-rater reliability among Radiologist A, Radiologist B, and DeepSeek-R1, in addition to the test-retest reliability, was analyzed using intraclass correlation coefficients (ICC). The ICC values between DeepSeek-R1 and Radiologist A, DeepSeek-R1 and Radiologist B, and Radiologist A and Radiologist B were found to be 0.879, 0.820, and 0.862, respectively. The test-retest ICC for DeepSeek-R1 was determined to be 0.922, whereas for Radiologist A, it was 0.952. The ICC between DeepSeek-R1 (re-test) and Radiologist A (re-test) was 0.885. The performance of DeepSeek-R1 was comparable to that of radiologists in the evaluation of radiology residents' reports. The integration of DeepSeek-R1 into medical education could effectively assist in assessment tasks, potentially alleviating faculty workload while preserving the quality of evaluations.