Adaptive distribution-aware transformer for multi-scale visual representation learning on imbalanced and low-resolution data.
Authors
Affiliations (4)
Affiliations (4)
- Department of Computing and Mathematics, Manchester Metropolitan University, The Dalton Building, Chester Street, Manchester, M1 5GD, UK. Electronic address: [email protected].
- Department of Computing and Mathematics, Manchester Metropolitan University, The Dalton Building, Chester Street, Manchester, M1 5GD, UK. Electronic address: [email protected].
- Department of Computing and Mathematics, Manchester Metropolitan University, The Dalton Building, Chester Street, Manchester, M1 5GD, UK. Electronic address: [email protected].
- Department of Computing and Mathematics, Manchester Metropolitan University, The Dalton Building, Chester Street, Manchester, M1 5GD, UK. Electronic address: [email protected].
Abstract
Deep learning models often struggle with class imbalance and low-resolution medical images, where critical spatial details and minority-class features are underrepresented. We introduce the Adaptive Distribution-aware Vision Transformer (AdaptiveViT), a novel hybrid CNN-Transformer architecture that unifies fine-grained local feature extraction with global contextual modelling. AdaptiveViT incorporates a distribution-aware modulation mechanism that adaptively adjusts feature emphasis according to the severity of class imbalance. In addition, a Distribution-aware Adaptive (DA) Loss incorporates the dataset imbalance ratio into an adaptive focusing scheme, enhancing minority-class sensitivity. Experiments on five skin lesion datasets with varying image resolutions and imbalance ratios (as high as 1:10 for melanoma versus non-melanoma) demonstrate that AdaptiveViT consistently outperforms state-of-the-art CNN, Transformer, and hybrid baselines in F1 and AUC, while maintaining stable convergence across imbalance levels. Validation on gastrointestinal endoscopy datasets further demonstrates AdaptiveViT's domain-agnostic generalisation beyond skin lesion data, which share similar imbalance characteristics. All experiments are conducted using patient-disjoint splits, with a threshold-free evaluation protocol to ensure fair, unbiased, and clinically reliable comparisons. Overall, AdaptiveViT establishes a hybrid framework for medical image classification under class imbalance and image-resolution variability. The code is available at https://github.com/mmu-dermatology-research/AdaptiveViT.