A General Framework for Efficient Medical Image Analysis via Shared Attention Vision Transformer.
Authors
Abstract
Vision Transformers (ViTs) demonstrate significant promise in medical image analysis but face two critical challenges: 1) their limited ability to capture local features in data-scarce scenarios, leading to data inefficiency, and 2) their high computational and storage demands of the full fine-tuning process in transfer learning, resulting in parameter inefficiency. To achieve efficient and accurate medical image analysis, we propose Shared Attention Vision Transformer (SAViT) that comprises three innovative modules: i) Shared Prior Attention (SPA) that enhances data efficiency by innovatively employing a visual prompt to sequentially share consistent attention weights across local image regions, thereby enabling the learning of translational invariance to capture locality; ii) MixPool that preserves global modeling ability by aggregating local features after SPA through a multi-pooling mechanism, thus effectively facilitating long-range dependency across local image regions; and iii) Low-rank Multi-head Self-Attention (Lr-MSA) that improves parameter efficiency by using low-rank weights of multi-head self-attention, hence reducing computational complexity while maintaining accuracy in medical image analysis. SAViT demonstrates strong generalization across multiple medical imaging modalities, including retinopathy, dermoscopy, and radiography. Extensive experiments are conducted. The results indicate its high data efficiency and outstanding performance in comparison with more than 20 medical-specific and ViT-based models when all of them are trained from scratch. It excels in parameter-efficient tuning by surpassing 17 models across 6 datasets in transfer learning, with only 0.17M/0.23M trainable parameters on ViT-B/SwinViT-B backbones requiring 86.60M/88.00M parameters. Source code can be found at: https://github.com/LYH-hh/SAViT.