Back to all papers

Comparative assessment of the accuracy of different artificial intelligence models in answering analytical and knowledge-based questions in oral and maxillofacial radiology and oral and maxillofacial surgery; a research article.

June 26, 2026pubmed logopapers

Authors

Samunahmetoglu E,Yilmaz A

Affiliations (2)

  • Oral and Maxillofacial Radiology Department, Faculty of Dentistry, Yozgat Bozok University, Çapanoğlu, Cemil Çiçek Bv No : 217/1, Yozgat, 66100 , Türkiye. [email protected].
  • Oral and Maxillofacial Surgery Department, Faculty of Dentistry, Gazi University, Ankara, Türkiye.

Abstract

Artificial intelligence models are increasingly used in healthcare education; however, their ability to handle both factual knowledge and analytical clinical reasoning in dentistry remains unclear. This study aimed to compare the accuracy of different AIs in answering knowledge-based and analytical multiple-choice questions in Oral and Maxillofacial Radiology (OMFR) and Oral and Maxillofacial Surgery (OMFS), and to evaluate performance differences according to cognitive task type. This cross-sectional comparative study analyzed 258 multiple-choice questions from the Turkish Dental Specialty Examination (DUS) conducted between 2012 and 2021 (202 knowledge-based, 56 analytical). Five AI models (ChatGPT-5.2 Go, ChatGPT-5.2 Plus, DeepSeek V3, Claude Sonnet 4.5, and Gemini 3 Flash) answered all questions under default settings in a single session. Accuracy rates were compared using Chi-square and Kruskal-Wallis tests with Bonferroni correction. Inter-model agreement and reliability were assessed using Cohen's kappa and the intraclass correlation coefficient (ICC) (α = 0.05). Significant differences among models were observed in knowledge-based questions (p = 0.048), analytical questions (p = 0.032), and overall accuracy (p = 0.006). Gemini achieved the highest accuracy in knowledge-based questions, while Claude demonstrated the lowest performance. Although a general difference was detected in analytical questions, pairwise comparisons did not show clear model superiority. Overall performance largely reflected success in knowledge-based tasks. Agreement analysis showed low kappa values (κ = 0.226-0.339) but moderate ICC levels (0.597-0.728). AIs demonstrate strong factual recall but remain limited in analytical clinical reasoning tasks. While these models may serve as supportive tools in dental education, their use as independent clinical decision-making systems is not yet reliable.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.