Language-dependent diagnostic safety of medical AI systems: a cross-lingual benchmarking and prospective clinical study

May 21, 2026

DOI: 10.64898/2026.05.19.26353490

Authors

Wang, Y.,He, H.,Zhu, R.,Lu, Y.,Phadungsaksawasdi, P.,Peng, M.,Liu, Z.,Zou, K.,Zhang, Y.,Chew, S. P.,Tham, Y. C.,Khorasani, A.,Deng, H.,Cheng, C.-Y.,Yang, J.,Liu, D.

Affiliations (1)

Yong Loo Lin School of Medicine and College of Design and Engineering, NUS, Singapore

Abstract

BackgroundPatients worldwide receive healthcare in many languages, yet medical AI systems are validated almost exclusively in high-resource languages such as English and Chinese, exposing patients in other linguistic settings to unquantified diagnostic risk. Existing multilingual evaluations rely on translated research-style benchmarks that fail to capture authentic clinical work. We aimed to characterise the patient safety consequences of multilingual medical AI deployment in real-world clinical settings and to develop an auditable detection method for unsafe outputs. MethodsWe evaluated different language models(LLMs) and visual language models(VLMs) across four real-world clinical tasks (conversational QA, radiology report generation, glaucoma diagnosis, ICU re-intubation prediction) in five languages (English, Chinese, Malay, Thai, Persian). We developed a token-level uncertainty toolkit to localise reasoning instability, com pared three inference paradigms (native-language, English chain-of-thought, back-translation pivot), and conducted a prospective study (50 dialogues, 150 physician-reviewed records). FindingsLLMs/VLMs performance degraded consistently from high-to low-resource languages across al l tasks. Key gaps included: HealthBench score declining from 0{middle dot}3743 to 0{middle dot}3180; radiology macro-F1 from 0{middle dot}2938 to 0{middle dot}2149-0{middle dot}2424, consistent with selective pathology suppression; glaucoma accuracy from 50{middle dot}7% to 32{middle dot}7%; ICU parameter recall from 100{middle dot}0% to 48{middle dot}5%. Multimodal inputs amplified degradation. Qwen3 VL 235B showed attenuated decline with no re source-ordered pattern in glaucoma classification. Token-level analysis localised instability to mid-chain stages (40-70% of the normalised trajectory); perplexity-based confidence failed to flag errors (AUROC 0{middle dot}41-0{middle dot}66). Back-translation pivot consistently restored performance. In the prospective study, 98{middle dot}7% of records required physician edits (overall modification score 53{middle dot}6%); Thai-pivot correction burden (59{middle dot}0%) exceeded English-pivot (5 0{middle dot}7%, p=0{middle dot}003) and Chinese-direct (51{middle dot}0%, p=0{middle dot}004). InterpretationMultilingual deployment produced clinically consequential failures, including missed pathology, distorted physiological extraction, and amplified multimodal misclassification, that were invisible to monolingual validation and not reliably flagged by model confidence. Pre training data composition may contribute to multilingual safety risk. Language-specific safety auditing should precede deployment in non-dominant-language healthcare settings; the open-source detection toolkit enables this without model retraining.

View Source Full Text PDF

Topics

health informatics

Language-dependent diagnostic safety of medical AI systems: a cross-lingual benchmarking and prospective clinical study

Authors

Affiliations (1)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?