DenseViT-OCT: A Hybrid CNN-Transformer Architecture with Multi-Scale Dense Feature Aggregation for Automated Epiretinal Membrane Severity Classification.

May 22, 2026

papers

DOI: 10.3390/tomography12060076 PMID: 42347131

Authors

Yusufoğlu E,Özçelik STA,Atila O,Guldemir NH,Sengur A

Affiliations (4)

Department of Ophthalmology, Elazig Fethi Sekin City Hospital, 23100 Elazig, Turkey.
Department of Electrical-Electronics Engineering, Faculty of Engineering, Bingol University, 12000 Bingol, Turkey.
Department of Electrical-Electronics Engineering, Faculty of Technology, Firat University, 23100 Elazig, Turkey.
School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast BT9 5BN, UK.

Abstract

Epiretinal membrane (ERM) is a common vitreoretinal disorder characterized by fibrocellular proliferation on the inner retinal surface, often leading to progressive visual impairment. Accurate grading of ERM severity using optical coherence tomography (OCT) is critical for treatment planning and surgical decision-making; however, manual grading is labor-intensive and subjective. This study aims to develop an automated and reliable deep learning-based method for ERM severity classification. We propose DenseViT-OCT, a hybrid deep learning model that integrates dense convolutional neural networks (CNN) and vision transformers (ViT). The model introduces three key modules: Multi-Scale Dense Feature Aggregation (MDFA) for capturing hierarchical features across multiple spatial scales, Adaptive Feature Calibration (AFC) for enhancing feature discrimination through channel and spatial attention, and Cross-Attention Feature Fusion (CAFF) for enabling bidirectional interaction between convolutional and transformer representations. The model was trained and evaluated on 2195 OCT B-scan images obtained from 397 patients. DenseViT-OCT achieved an overall accuracy of 94.76% on the internal four-class test set, outperforming 19 benchmark models, including ConvNeXt, EfficientNet, ViT, and Swin Transformers. The model demonstrated balanced performance with a macro-averaged precision of 93.76%, recall of 93.22%, F1-score of 93.47%, Cohen's kappa of 92.62%, and macro-Area Under the Curve (AUC) of 98.95%. Ablation experiments confirmed the contribution of the proposed MDFA, AFC, CAFF, and deep supervision components, with the full model consistently outperforming reduced variants and standalone DenseNet121 and ViT-B/16 backbones. In repeated experiments across five random seeds, DenseViT-OCT also achieved the best mean accuracy (0.9399 ± 0.0052). External validation on the public multicenter OCTDL dataset, performed as binary ERM-versus-normal classification because of label availability, yielded 90.76% accuracy and 97.61% AUC, indicating promising generalization beyond the development cohort. DenseViT-OCT provides a robust framework for automated ERM severity classification from OCT B-scans. The combination of local CNN features, global transformer context, and dedicated fusion modules improves classification performance and yields clinically meaningful error patterns. Although further stage-wise multicenter validation, volumetric OCT analysis, and prospective clinical assessment are required, the proposed method shows promise as a research-oriented decision-support framework for B-scan-level ERM assessment.

View Source Full Text PDF

Topics

Epiretinal MembraneTomography, Optical CoherenceDeep LearningJournal Article

DenseViT-OCT: A Hybrid CNN-Transformer Architecture with Multi-Scale Dense Feature Aggregation for Automated Epiretinal Membrane Severity Classification.

Authors

Affiliations (4)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?