Evaluating vision transformers and convolutional neural networks in the context of dental image processing: a systematic review.
Authors
Affiliations (4)
Affiliations (4)
- Institute of Natural and Applied Sciences, Akdeniz University, Antalya, Türkiye.
- , Department of Information Technologies, Faculty of Dentistry, Akdeniz University, Antalya, Turkey.
- Departmant of Oral and Maxillofacial Radiology, Faculty of Dentistry, Akdeniz University, Antalya, Türkiye. [email protected].
- Departmant of Oral and Maxillofacial Radiology, Faculty of Dentistry, Antalya Bilim University, Antalya, Türkiye.
Abstract
The aim of this systematic review is to compare the efficacy of convolutional neural networks (CNN) and Vision Transformers (ViT) in the field of dental imaging, in order to examine in depth the potential, advantages, and limitations of both models in this domain. The search strings used in the study were "(("Vision Transformer" OR ViT OR "Transformer architecture") AND ("Convolutional Neural Network" OR CNN OR ConvNet) AND (Dental OR Dentistry OR "Maxillofacial" OR "Oral Radiology") AND (Image OR Imaging OR Radiograph))". The search was conducted in January 2025. Two investigators independently evaluated the full texts of all eligible articles and excluded those that did not meet the inclusion/exclusion criteria. Of 2596 articles, 21 met the inclusion criteria. Depending on the task category, of the 21 studies that were reviewed, 14 (66.7%) utilized classification, while 7 (33.3%) utilized segmentation. Panoramic radiography is the most commonly used imaging modality (52.3%) and the ViT-based model was observed to have the highest performance (58%). ViT-based deep learning models tend to exhibit higher performance in many dental image analysis scenarios compared to traditional convolutional neural networks. However, in practice CNN and ViT approaches can be used in a complementary manner.