Capsule-enhanced hierarchical vision transformers for rare disease classification from medical images.
Authors
Affiliations (6)
Affiliations (6)
- GITAM School of Computer Science and Engineering, GITAM University- Bengaluru Campus, Bengaluru, India.
- Department of CSE, St. Peter's Engineering College, Hyderabad, India.
- Department of CSE - Data Science, Chalapathi Institute of Technology, Guntur, 522016, India.
- Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, 576104, India. [email protected].
- Department of Computer Science and Engineering, Lakireddy Bali Reddy College of Engineering, Mylavaram, 521230, India.
- Department of AI & ML, School of Computing, Mohan Babu University, Tirupati, India.
Abstract
Automated medical image analysis plays a vital role in rare disease detection, yet existing deep learning models often struggle with severe class imbalance, limited labeled data, and subtle morphological variations. To address these challenges, this paper proposes Swin-CapsuleNet, a hybrid architecture that integrates a hierarchical Swin Transformer with capsule-based representations, tailored for rare disease classification. The framework integrates a Swin Transformer backbone for multi-scale contextual feature extraction with a capsule-based classification head that preserves part-whole spatial relationships through dynamic routing. A class-balanced capsule loss is introduced to improve sensitivity toward under-represented disease categories. Extensive experiments conducted on a multi-center rare disease dataset demonstrate that Swin-CapsuleNet consistently outperforms state-of-the-art CNN, transformer, and capsule-based baselines. The proposed model achieves 94.1% accuracy, a 93.2% F1-score, and an AUC of 0.972, while attaining a macro-F1 of 0.899 for rare disease classes. Ablation studies validate the complementary contributions of hierarchical attention, capsule representations, and the proposed loss function. Furthermore, computational analysis shows that Swin-CapsuleNet offers a favorable balance between performance and efficiency, supporting its applicability in real-world clinical decision-support systems.