Diagnostic and localization performance of multimodal large language models in the interpretation of dental periapical radiographs.
Authors
Affiliations (3)
Affiliations (3)
- Department of Restorative Dentistry, Faculty of Dentistry, Recep Tayyip Erdoğan University, Rize, Turkey. [email protected].
- Department of Restorative Dentistry, Faculty of Dentistry, Recep Tayyip Erdoğan University, Rize, Turkey.
- Department of Endodontics, Recep Tayyip Erdoğan University, Rize, Turkey.
Abstract
This study aimed to comparatively evaluate the performance of multiple multimodal large language models (MLLMs) in interpreting dental periapical radiographs using a structured framework that incorporates diagnostic accuracy, tooth-level anatomical precision, localization, and composite correctness across various attributes. This paired, cross-sectional study assessed six contemporary MLLMs (Microsoft Copilot, ChatGPT-o4-mini-high, ChatGPT-5.2 Thinking, Gemini 2.5 Flash, Gemini 3 Pro, and Grok-3) using 55 de-identified dental periapical radiographs selected from a clinical archive according to predefined inclusion and exclusion criteria. Radiographs were selected to represent mutually exclusive tooth-status categories, including dental caries, composite restorations, amalgam restorations, and sound teeth. The reference labels for tooth status and anatomical identifiers were established before model evaluation through independent assessment and consensus by experienced clinicians. Each image was evaluated by all models under standardized prompting conditions. Performance was assessed across tasks of increasing complexity, including tooth-status diagnosis, jaw identification, tooth region, side determination, FDI tooth numbering, localization accuracy, and overall correctness across all attributes. Paired statistical analyses were performed using Cochran's Q test and generalized estimating equations with Holm's adjustment for multiple comparisons. Tooth status diagnosis accuracy was moderate and comparable across models (45.5-60.0%), with no significant between-model differences. In contrast, significant heterogeneity was observed for localization-related tasks, including side determination, FDI tooth numbering, composite localization, and complete accuracy (p < 0.05). Adjusted analyses demonstrated that Gemini 3 Pro showed the highest adjusted accuracy for localization-dependent outcomes and overall correctness, with higher odds than several other models in pairwise comparisons. None of the models showed a consistent advantage for tooth status diagnosis alone. The performance of MLLMs in dental periapical radiograph interpretation is task-dependent and increasingly model-specific as anatomical precision requirements increase. While MLLMs may support education and structured discussions of radiographic findings, their reliability for autonomous tooth-level diagnosis and localization remains limited. MLLMs may support dental education and patient communication by assisting with the explanation of basic radiographic features. However, their reliability decreases for tooth-level localization and comprehensive anatomical accuracy, indicating that MLLMs should not be used as stand-alone tools for clinical decision-making based on periapical radiographs.