SwiftMSeg: lightweight multi-scale local-global context modeling with transformer for medical image segmentation.
Authors
Affiliations (2)
Affiliations (2)
- Department of CSE, Dhaka University of Engineering and Technology, Gazipur, Bangladesh.
- School of Informatics, Kochi University of Technology, Kami, 782-8502, Japan. [email protected].
Abstract
Accurate medical image segmentation requires both fine boundary localization and robust contextual understanding, which is often difficult to achieve simultaneously, particularly in lightweight architectures. In this paper, we propose SwiftMSeg, a lightweight encoder-decoder framework that integrates a convolutional encoder, a transformer-based local-global-local module, and a hierarchical multi-scale decoder. The proposed framework addresses the boundary-context challenge by effectively combining progressive multi-scale refinement for fine boundary separation with global context modeling through long-range dependency aggregation. Extensive evaluations on publicly available colonoscopy, pathology, ultrasound, and magnetic resonance imaging datasets demonstrated the capability of SwiftMSeg to accurately segment diverse anatomical structures, ranging from tiny nuclei to polyps and large tumor regions. The model further demonstrated moderate domain-independent generalization on an external dataset, achieving Dice scores of 0.896 (colonoscopy), 0.860 (pathology), 0.850 (ultrasound), and 0.870 (MRI), consistently outperforming most baseline methods. In addition, it achieved improved boundary localization with lower Hausdorff distance (e.g., 16.43 in MRI and 33.89 in ultrasound) and reduced average symmetric surface distance, indicating more precise and stable segmentation. Statistical analysis further confirmed that the improvements of SwiftMSeg are significant ([Formula: see text]) with large effect sizes across modalities, validated by both paired t-tests and Wilcoxon tests. Despite its strong performance, SwiftMSeg remains highly efficient, requiring only 4.48M parameters and 0.940 giga floating-point operations per second (GFLOPs), reducing computational cost by approximately ∼53× compared to the U-Net-based baselines (standard U-Net ∼31M parameters and ∼50 GFLOPs), while maintaining high segmentation accuracy. These results highlight the effectiveness of SwiftMSeg as a practical and scalable solution for real-world medical image segmentation across diverse modalities.