Spec-ViT: A Vision Transformer with Wavelet for Anti-aliasing and Denoising in Medical Image Classification.
Authors
Abstract
Medical image analysis remains challenging due to inherent limitations in imaging modalities, where structural aliasing and noise artifacts persistently compromise diagnostic accuracy. While convolutional neural networks (CNNs) and vision transformers (ViTs) have achieved remarkable progress in feature extraction, their inherent sampling mechanisms and spectral biases often exacerbate these high-frequency distortions, leading to suboptimal lesion characterization. To address this critical limitation, we propose Spec-ViT, a novel wavelet-based anti-aliasing Transformer architecture that synergistically integrates adaptive spectral purification with hierarchical attentive learning. The Wavelet Anti-aliasing Module (WAM) first implements learnable smoothing factor in the wavelet domain to suppress high-frequency artifacts, while preserving clinically relevant low-frequency structures and fine diagnostic details. Building upon this spectral foundation, the Lightweight Enhanced Attention (LEA) refines feature representations through a dual-path mechanism, coupling channel-spatial attention with global multi-head self-attention to enhance lesion context modeling. Finally, the Smoothed Convolutional Gate (SCG) further sharpens local discriminability through depth-wise convolution and adaptive Swish gating, completing a coherent pipeline from frequency-aware purification to global-local attentive analysis. Extensive experiments on five benchmark medical image classification datasets demonstrate that Spec-ViT consistently outperforms both baseline and state-of-the-art methods, achieving up to 84.04% accuracy on the Pediatric Pneumonia Chest X-rays dataset in particular.