Back to all papers

Efficient Vision Transformers for Ophthalmic Images Classification: A Comparative Study of Supervised, Semi-Supervised, and Unsupervised Learning Approaches.

November 17, 2025pubmed logopapers

Authors

Al-Wassiti AS,Mutar MT,Al Sakini AS,Rasheed LS,Yosif W,Abbas MA,Raouf NR,Al-Shammari AS

Affiliations (6)

  • MBChB, FIBMS (ophthalmology), FICO, FRCS (Glasg), College of Medicine, University of Baghdad, Baghdad, Baghdad Governorate, Iraq.
  • Department of Surgery, Ophthalmology unit, College of Medicine, University of Baghdad, Baghdad, Baghdad Governorate, Iraq. [email protected].
  • Department of Surgery, Ophthalmology unit, College of Medicine, University of Baghdad, Baghdad, Baghdad Governorate, Iraq. [email protected].
  • Department of Ophthalmology, Baghdad Medical City, Baghdad, Baghdad Governorate, Iraq.
  • MBChB, FICM-ICO, CAB, Department of Ophthalmology, College of Medicine, Al Iraqia University, Baghdad, Baghdad Governorate, Iraq.
  • Department of Internal Medicine, College of Medicine, Baghdad, Baghdad Governorate, Iraq.

Abstract

This study explored the integration of supervised, semi-supervised, and unsupervised learning strategies to classify ophthalmic images under label-scarce conditions. Given the high cost of annotations in medical imaging, the goal was to improve diagnostic performance using minimal labeled data and robust feature representations. A dataset of 18,767 multimodal ophthalmic images was collected - 1,877 labeled and 16,890 unlabeled. Three transformer-based architectures -ViT-Base, DeiT-Base, and MaxViT-L-were used for supervised learning. Semi-supervised learning employed pseudo-labeling with a confidence threshold ≥ 0.98. For unsupervised learning, SimCLR-based contrastive learning and K-means clustering were implemented on extracted features. Performance was evaluated using classification accuracy, AUC, F1-score, clustering indices (Silhouette Score, DBI, CH Index), and computational metrics. In supervised learning, ViT-Base achieved the highest accuracy (92.47%), followed by DeiT-Base (89.38%) and MaxViT-L (85.27%). After pseudo-labeling, MaxViT-L achieved the best accuracy (97.49%) and AUC (0.9982). Contrastive learning significantly improved feature clustering, with MaxViT-L reaching a Silhouette Score of 0.556 and a reduced DBI of 0.541. However, computational analysis revealed that MaxViT-L exhibited the highest computational complexity (81,713 MFLOPs) and longest inference (~ 102 ms), while ViT-Base and DeiT-Base showed considerably lower FLOPs (39,120.6 MFLOPs) and faster inference (~ 52 ms). On external validation set, MaxViT demonstrated the best overall performance. Although ViT-Base achieved the highest accuracy in supervised training, MaxViT-L demonstrated the most favorable trade-off between performance and model generalization in semi- and unsupervised settings, Despite its higher computational complexity and longer inference time, MaxViT-L consistently achieved strong accuracy and clustering performance. This approach minimizes dependence on expert annotations, supporting scalable and automated ophthalmic diagnosis.

Topics

Supervised Machine LearningUnsupervised Machine LearningImage Processing, Computer-AssistedJournal ArticleComparative Study

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.