Back to all papers

Efficient Vision Transformers for Ophthalmic Images Classification: A Comparative Study of Supervised, Semi-Supervised, and Unsupervised Learning Approaches.

November 17, 2025pubmed logopapers

Authors

Al-Wassiti AS,Mutar MT,Al Sakini AS,Rasheed LS,Yosif W,Abbas MA,Raouf NR,Al-Shammari AS

Affiliations (6)

  • MBChB, FIBMS (ophthalmology), FICO, FRCS (Glasg), College of Medicine, University of Baghdad, Baghdad, Baghdad Governorate, Iraq.
  • Department of Surgery, Ophthalmology unit, College of Medicine, University of Baghdad, Baghdad, Baghdad Governorate, Iraq. [email protected].
  • Department of Surgery, Ophthalmology unit, College of Medicine, University of Baghdad, Baghdad, Baghdad Governorate, Iraq. [email protected].
  • Department of Ophthalmology, Baghdad Medical City, Baghdad, Baghdad Governorate, Iraq.
  • MBChB, FICM-ICO, CAB, Department of Ophthalmology, College of Medicine, Al Iraqia University, Baghdad, Baghdad Governorate, Iraq.
  • Department of Internal Medicine, College of Medicine, Baghdad, Baghdad Governorate, Iraq.

Abstract

This study explored the integration of supervised, semi-supervised, and unsupervised learning strategies to classify ophthalmic images under label-scarce conditions. Given the high cost of annotations in medical imaging, the goal was to improve diagnostic performance using minimal labeled data and robust feature representations. A dataset of 18,767 multimodal ophthalmic images was collected - 1,877 labeled and 16,890 unlabeled. Three transformer-based architectures -ViT-Base, DeiT-Base, and MaxViT-L-were used for supervised learning. Semi-supervised learning employed pseudo-labeling with a confidence threshold ≥ 0.98. For unsupervised learning, SimCLR-based contrastive learning and K-means clustering were implemented on extracted features. Performance was evaluated using classification accuracy, AUC, F1-score, clustering indices (Silhouette Score, DBI, CH Index), and computational metrics. In supervised learning, ViT-Base achieved the highest accuracy (92.47%), followed by DeiT-Base (89.38%) and MaxViT-L (85.27%). After pseudo-labeling, MaxViT-L achieved the best accuracy (97.49%) and AUC (0.9982). Contrastive learning significantly improved feature clustering, with MaxViT-L reaching a Silhouette Score of 0.556 and a reduced DBI of 0.541. However, computational analysis revealed that MaxViT-L exhibited the highest computational complexity (81,713 MFLOPs) and longest inference (~ 102 ms), while ViT-Base and DeiT-Base showed considerably lower FLOPs (39,120.6 MFLOPs) and faster inference (~ 52 ms). On external validation set, MaxViT demonstrated the best overall performance. Although ViT-Base achieved the highest accuracy in supervised training, MaxViT-L demonstrated the most favorable trade-off between performance and model generalization in semi- and unsupervised settings, Despite its higher computational complexity and longer inference time, MaxViT-L consistently achieved strong accuracy and clustering performance. This approach minimizes dependence on expert annotations, supporting scalable and automated ophthalmic diagnosis.

Topics

Supervised Machine LearningUnsupervised Machine LearningImage Processing, Computer-AssistedJournal ArticleComparative Study

Ready to Sharpen Your Edge?

Subscribe to join 7,600+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.