Assessing diagnostic performance of multimodal LLMs and a custom convolutional neural network in tooth-level caries detection and localization.
Authors
Affiliations (4)
Affiliations (4)
- Department of Surgery, Section of Dentistry, The Aga Khan University Hospital, National Stadium Road, Karachi, Pakistan.
- Department of Operative Dentistry, Liaquat University of Medical Health Sciences Jamshoro, Deh Soun Valhar, Jamshoro, Sindh, Pakistan.
- Department of Paediatrics and Child Health, The Aga Khan University Hospital, National Stadium Road, Karachi, Pakistan.
- Department of Surgery, Section of Dentistry, The Aga Khan University Hospital, National Stadium Road, Karachi, Pakistan. [email protected].
Abstract
Artificial Intelligence is reshaping dental diagnostics through automated interpretation of images. While Convolutional Neural Networks (CNNs) demonstrate high accuracy via domain-specific training, multimodal Large Language Models (LLMs) such as ChatGPT-4o and Gemini 2.5 Flash have recently acquired visual-reasoning capabilities without task-specific fine-tuning. This study compared the diagnostic performance of these LLMs with a custom CNN for detecting and localizing dental caries on intraoral images. This cross-sectional diagnostic accuracy study used 22 occlusal-view intraoral images. ChatGPT-4o, Gemini 2.5 Flash, and a YOLOv5s-based CNN analyzed each image for caries detection and localization. Quantitative evaluation assessed decay detection using accuracy, sensitivity, specificity, precision, Positive Predictive Value (PPV), Negative Predictive Value (NPV), and F1 score. Inter-model differences were analyzed using McNemar's test. Additionally, a descriptive qualitative evaluation was performed by specialist dentists, who rated each model's output for realism, diagnostic accuracy, bounding-box precision, and absence of unnecessary annotations using a 3-point Likert scale. The CNN achieved the highest diagnostic accuracy (97.2%), sensitivity (86.7%), and F-1 score (88.0%). Gemini 2.5 Flash outperformed ChatGPT-4o in sensitivity (76.4 vs. 66.2%) and F-1 score (74.3 vs. 68.7%). Overall, CNN's performance was significantly superior (p < 0.001), whereas no difference was found between the two LLMs (p = 0.541). Qualitatively, CNN scored best for realism (90.9%), decay accuracy (79.5%), and bounding-box precision (93.1%). CNNs provide superior accuracy for caries localization compared with multimodal LLMs. However, LLMs demonstrate potential for generating clinically interpretable diagnostic summaries. Hybrid systems integrating CNN-based detection with LLM-driven reasoning may enhance decision-making and improve efficiency in dental diagnostic workflows.