Efficient Vision Transformers for Ophthalmic Images Classification: A Comparative Study of Supervised, Semi-Supervised, and Unsupervised Learning Approaches.
Authors
Affiliations (6)
Affiliations (6)
- MBChB, FIBMS (ophthalmology), FICO, FRCS (Glasg), College of Medicine, University of Baghdad, Baghdad, Baghdad Governorate, Iraq.
- Department of Surgery, Ophthalmology unit, College of Medicine, University of Baghdad, Baghdad, Baghdad Governorate, Iraq. [email protected].
- Department of Surgery, Ophthalmology unit, College of Medicine, University of Baghdad, Baghdad, Baghdad Governorate, Iraq. [email protected].
- Department of Ophthalmology, Baghdad Medical City, Baghdad, Baghdad Governorate, Iraq.
- MBChB, FICM-ICO, CAB, Department of Ophthalmology, College of Medicine, Al Iraqia University, Baghdad, Baghdad Governorate, Iraq.
- Department of Internal Medicine, College of Medicine, Baghdad, Baghdad Governorate, Iraq.
Abstract
This study explored the integration of supervised, semi-supervised, and unsupervised learning strategies to classify ophthalmic images under label-scarce conditions. Given the high cost of annotations in medical imaging, the goal was to improve diagnostic performance using minimal labeled data and robust feature representations. A dataset of 18,767 multimodal ophthalmic images was collected - 1,877 labeled and 16,890 unlabeled. Three transformer-based architectures -ViT-Base, DeiT-Base, and MaxViT-L-were used for supervised learning. Semi-supervised learning employed pseudo-labeling with a confidence threshold ≥ 0.98. For unsupervised learning, SimCLR-based contrastive learning and K-means clustering were implemented on extracted features. Performance was evaluated using classification accuracy, AUC, F1-score, clustering indices (Silhouette Score, DBI, CH Index), and computational metrics. In supervised learning, ViT-Base achieved the highest accuracy (92.47%), followed by DeiT-Base (89.38%) and MaxViT-L (85.27%). After pseudo-labeling, MaxViT-L achieved the best accuracy (97.49%) and AUC (0.9982). Contrastive learning significantly improved feature clustering, with MaxViT-L reaching a Silhouette Score of 0.556 and a reduced DBI of 0.541. However, computational analysis revealed that MaxViT-L exhibited the highest computational complexity (81,713 MFLOPs) and longest inference (~ 102 ms), while ViT-Base and DeiT-Base showed considerably lower FLOPs (39,120.6 MFLOPs) and faster inference (~ 52 ms). On external validation set, MaxViT demonstrated the best overall performance. Although ViT-Base achieved the highest accuracy in supervised training, MaxViT-L demonstrated the most favorable trade-off between performance and model generalization in semi- and unsupervised settings, Despite its higher computational complexity and longer inference time, MaxViT-L consistently achieved strong accuracy and clustering performance. This approach minimizes dependence on expert annotations, supporting scalable and automated ophthalmic diagnosis.