Demographic prompt cues shift clinical recommendations in multimodal large language models: a multi-model audit of 25,380 dental radiograph assessments.

May 28, 2026

papers

DOI: 10.1186/s12903-026-08644-5 PMID: 42210248

Authors

Saravi B,Winkler I,Singh DD,Lommen J,Wilkat M,Schorn L,Vollmer A,Vollmer M,Kübler N,Sproll C,Schrader F

Affiliations (5)

Department of Oral, Maxillofacial and Facial Plastic Surgery, Medical Faculty, University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, 40225, Germany. [email protected].
Department of Oral and Maxillofacial Plastic Surgery, University Hospital Cologne, Cologne, 50937, Germany.
Department of Oral, Maxillofacial and Facial Plastic Surgery, Medical Faculty, University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, 40225, Germany.
Department of Oral and Maxillofacial Plastic Surgery, University Hospital of Würzburg, Würzburg, 97070, Germany.
Department of Oral and Maxillofacial Surgery, Tübingen University Hospital, Tübingen, 72076, Germany.

Abstract

Multimodal large language models (LLMs) are increasingly being evaluated for clinical image interpretation, but whether patient demographic cues influence their recommendations remains unclear in dental radiology-a leading domain for AI-assisted diagnostics, and one marked by pronounced disparities in oral health. We conducted a within-image, between-condition audit study using 705 panoramic dental radiographs from the publicly available DENTEX dataset. Each image was submitted to three commercial multimodal LLMs (Gemini 2.5 Flash, GPT-5.4, Claude Sonnet 4.6) under 12 experimental conditions: five race/ethnicity groups × two sex categories plus two controls, yielding 25,380 independent zero-temperature API calls. Each call returned a structured JSON object containing ten prespecified clinical recommendation variables. Delta variables were computed against the primary control, and linear mixed-effects models with image identity as a random intercept tested for race/ethnicity (H1), sex (H2), and Race × Sex interaction (H3) effects. The five primary ordinal outcomes were evaluated against a Bonferroni-corrected significance threshold of α = 0.01. All three models altered recommendations when demographic cues were introduced. Claude showed the broadest sensitivity, with significant race effects on treatment invasiveness, prognosis, and confidence (all p < .001), and significant sex effects on prognosis, confidence, and invasiveness. GPT-5.4 effects were concentrated in confidence and in invasiveness for Hispanic patients (β = +0.052, p < .001). Gemini produced higher invasiveness for Black patients (β = +0.026, p < .001). No Race × Sex interactions reached significance. All ordinal shifts were below 6% of one category width. Inter-model diagnostic agreement under control prompts was near chance (Cohen's κ = 0.009-0.159). Findings were robust across four prespecified sensitivity analyses. Demographic prompt cues produced small but systematic shifts in multimodal LLM clinical recommendations for dental radiographs, with model-specific sensitivity profiles. Because clinical accuracy was not benchmarked, these findings should be read as demographic-cue sensitivity rather than clinical bias. Inter-model agreement was near chance, indicating that model choice currently introduces more variability than any demographic label. Demographic-cue audits and model-specific reliability evaluations should both accompany clinical deployment, particularly in oral health, where racial and socioeconomic disparities are already documented.

View Source Full Text PDF

Topics

Journal Article

Demographic prompt cues shift clinical recommendations in multimodal large language models: a multi-model audit of 25,380 dental radiograph assessments.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?