Semi-SwinUNeTR: Towards 3D Swin Vision Transformer-Based UNet for Medical Image Segmentation with Limited Annotations.
Authors
Affiliations (5)
Affiliations (5)
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China.
- Engineering Research Center of Blockchain and Network Convergence Technology, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China.
- National Engineering Research Center for Mobile Internet Security Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China.
- Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China.
- School of Computer Science and Digital Technologies, Aston University, Birmingham B4 7ET, UK.
Abstract
Accurate brain tumor segmentation from magnetic resonance imaging (MRI) is essential for computer-assisted diagnosis, treatment planning, and disease monitoring. However, brain tumors usually exhibit irregular, heterogeneous, and multi-scale spatial patterns with complex and ambiguous boundaries. At the same time, the performance of deep segmentation models is often constrained by the limited availability of voxel-level annotations, which are expensive and time-consuming to obtain. To address these challenges, this paper proposes Semi-SwinUNeTR, a semi-supervised framework for 3D brain tumor segmentation with limited annotated data. The proposed method adopts SwinUNeTR as the segmentation backbone, enabling hierarchical volumetric representation learning through shifted-window self-attention while preserving the encoder-decoder structure required for dense prediction. On top of this backbone, we introduce a dual-consistency semi-supervised learning strategy, consisting of mean teacher-based model consistency and interpolation consistency-based data consistency. In addition, voxel-wise consistency weights are used to redistribute semi-supervised supervision toward structurally complex and boundary-irregular tumor regions without changing the SwinUNeTR backbone. Experiments on the BraTS 2019 benchmark demonstrate that the proposed framework achieves strong performance across different annotation ratios. The original Semi-SwinUNeTR achieves Dice scores of 84.93%, 86.25%, 87.05%, and 87.83% under the 10%, 20%, 40%, and 80% labeled-data settings, respectively. With the weighted consistency extension, the Dice scores are further improved to 85.64%, 87.94%, and 88.59% under the 10%, 20%, and 80% labeled-data settings, respectively, while the corresponding HD<sub>95</sub> values are reduced to 8.9826, 8.1854, and 7.4533. These results indicate that combining a SwinUNeTR backbone with complementary model consistency, data consistency, and voxel-wise consistency weighting is an effective strategy for semi-supervised volumetric medical image segmentation under limited annotation.