Pre-trained Vision Transformer With Masked Autoencoder for Automated Diabetic Macular Edema Detection from Optical Coherence Tomography Images
Authors
Affiliations (1)
Affiliations (1)
- University of Tsukuba
Abstract
PurposeTo develop and evaluate a novel self-supervised learning approach using Masked Autoencoder (MAE) pre-trained Vision Transformer (ViT) for automated detection of diabetic macular edema (DME) from optical coherence tomography (OCT) images, addressing the critical need for scalable screening solutions in diabetic eye care. Study DesignArtificial intelligence model training. MethodsWe utilized the publicly available Kermany dataset containing 109,312 OCT images, defining DME detection as a binary classification task (11,559 DME vs. 97,753 non-DME images). Five deep learning architectures were compared: MAE-pretrained ViT (MAE_ViT), standard ViT, ResNet18, VGG19_bn, and EfficientNetV2. MAE_ViT underwent two-stage training: (1) self-supervised pre-training with 75% patch masking for 1,000 epochs to learn robust visual representations, and (2) supervised fine-tuning for DME classification. Model performance was evaluated using accuracy, sensitivity, specificity, F1 score, and area under the receiver operating characteristic curve (AU-ROC) with 95% confidence intervals calculated via bootstrap resampling. ResultsMAE_ViT achieved superior performance with AU-ROC 0.999 (95% CI: 0.999-1.000), accuracy 98.5% (95% CI: 97.7-99.2%), sensitivity 99.6% (95% CI: 98.7-100%), and specificity 98.1% (95% CI: 97.2-99.1%). VGG19_bn showed the second-best performance (AU-ROC 0.997), while ResNet18 demonstrated poor specificity (28.3%) despite perfect sensitivity. The self-supervised approach of MAE_ViT outperformed standard supervised ViT (AU-ROC 0.995), demonstrating the effectiveness of learning from unlabeled data. ConclusionMAE pre-trained Vision Transformer establishes a new benchmark for automated DME detection, offering exceptional diagnostic accuracy and potential for deployment in resource-constrained settings through reduced annotation requirements.