Assessing diagnostic performance of multimodal AI and human experts in oral and maxillofacial radiography: a comparative analysis of ChatGPT, Grok, and MANUS.

May 7, 2026

papers

DOI: 10.1080/07853890.2026.2664903 PMID: 42093540

Authors

Madfa AA,Alshammari AF,Anazi BA

Affiliations (2)

Department of Restorative Dental Science, College of Dentistry, University of Ha'il, Ha'il, Kingdom of Saudi Arabia.
Department of Basic Dental and Medical Science, College of Dentistry, University of Ha'il, Ha'il, Kingdom of Saudi Arabia.

Abstract

Artificial intelligence (AI), particularly large language models (LLMs), is increasingly applied to radiographic interpretation in healthcare. In dentistry, radiographic imaging is essential for diagnosis and treatment planning, yet remains subject to variability and human error. AI may enhance diagnostic accuracy and consistency. To evaluate and compare the diagnostic accuracy, consistency, and interpretive performance of multimodal AI models-ChatGPT, Grok, and MANUS-with expert radiologists in dental radiograph interpretation. A total of 120 anonymised radiographs (40 OPGs, 40 periapical, 40 CT slices) were selected from validated academic sources. Two board-certified oral and maxillofacial radiologists established gold standard diagnoses. Each image was independently assessed by the three AI models under standardised prompting. Diagnostic accuracy and intra-model consistency were analysed using descriptive statistics, Cohen's kappa, McNemar's test, and logistic regression. In the first assessment, MANUS and ChatGPT achieved 92.5% accuracy (111/120), while Grok reached 88.3% (106/120). Performance improved in the second round: MANUS 95.0%, ChatGPT 93.3%, and Grok 90.8%, compared with 96.7% for radiologists. ChatGPT showed the highest reproducibility (κ = 0.937), whereas MANUS demonstrated the highest overall accuracy. Strong agreement was observed between ChatGPT and MANUS, with greater variability in Grok. No significant systematic bias was detected between AI outputs and radiologist benchmarks. The evaluated LLMs demonstrated diagnostic performance comparable to expert radiologists. MANUS excelled in accuracy and ChatGPT in reproducibility, supporting their potential as adjunct tools in dental radiology, while maintaining the need for expert oversight.Clinical trial number: Not applicable.

View Source Full Text PDF

Topics

Artificial IntelligenceRadiography, DentalRadiographic Image Interpretation, Computer-AssistedJournal ArticleComparative Study

Assessing diagnostic performance of multimodal AI and human experts in oral and maxillofacial radiography: a comparative analysis of ChatGPT, Grok, and MANUS.

Authors

Affiliations (2)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?