Back to all papers

Comparative performance of Chinese and international large language models on the Chinese radiology attending physician qualification examination.

November 10, 2025pubmed logopapers

Authors

Luo D,Liu M,Zhang H,Wang X,Gao Q,Kuang N,Yin T,Zheng Z

Affiliations (5)

  • Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an, 271000, Shandong, China.
  • Department of Radiology, Affiliated Shandogn Provincial Hospital, Shandong First Medical University, Shandong, 250021, China.
  • Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an, 271000, Shandong, China. [email protected].
  • Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an, 271000, Shandong, China. [email protected].
  • Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an, 271000, Shandong, China. [email protected].

Abstract

This study evaluates the accuracy and reliability of six large language models (LLMs)-three Chinese (Doubao, Kimi, DeepSeek) and three international (ChatGPT-4o, Gemini 2.0 Pro, Grok3)-in radiology, using simulated questions from the 2025 Chinese Radiology Attending Physician Qualification Examination (CRAPQE). Analysis covered 400 CRAPQE-simulated questions, spanning various formats (A1, A2-A4, B, C-type) and modalities (text-only, image-based). Expert radiologists scored responses against official answer keys. Performance comparisons within and between Chinese and international LLM groups assessed overall, unit-specific, question-type-specific, and modality-specific accuracy. All LLMs passed the CRAPQE simulation, showing proficiency comparable to a radiology attending. Chinese LLMs achieved a higher mean accuracy (87.2%) than international LLMs (80.4%, P < 0.05), excelling in text-only and A1-type questions (P < 0.05). DeepSeek (91.6%) and Doubao (89.5%) outperformed Kimi (80.5%, P < 0.0167), while international LLMs showed no significant differences (P > 0.05). All models surpassed the passing threshold on image-based questions but performed worse than on text questions, with no group difference (P > 0.05). This pioneering comparison highlights the potential of LLMs in radiology, with Chinese models outperforming their international counterparts, likely due to localized training, providing evidence to guide the development of medical AI.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.