DMformer: Difficulty-adapted Masked Transformer for Semi-Supervised Medical Image Segmentation.
Authors
Abstract
The shared anatomy among different human bodies can serve as a strong prior for effectively leveraging unlabeled data in semi-supervised medical image segmentation. Inspired by the success of masked image modeling, we notice that this prior can be explicitly realized by incorporating an auxiliary unsupervised gross anatomy reconstruction task into a teacher-student semi-supervised segmentation framework. In this auxiliary task, consistency is maintained between the student's predictions on masked images and the teacher's predictions on the original images. Despite its potential, we observe that the reconstruction difficulties of different organs/tissues can vary significantly and therefore reconstructing them requires tailored learning strategies. To address this issue, we introduce a difficulty-adapted mask mechanism based on the teacher-student framework, wherein the reconstruction difficulty is adapted to facilitate training. Specifically, we control the reconstruction difficulty by modulating two important factors: masked region ratio and masked class ratio. Accordingly, we design two corresponding mask strategies. 1) Region-based masking: randomly masks a fraction of each class according to an automatically computed mask ratio. 2) Class-based masking: masks the entire regions of the specific classes according to the class confidence predicted by the teacher model. During training, a conflict-aware gradient computation strategy is introduced to mitigate potential optimization conflicts arising from modulating the two reconstruction factors simultaneously. By building on vision transformers, we develop an Difficulty-adapted Masked Transformer (DMformer) for semi-supervised medical image segmentation. Extensive experiments demonstrate the superiority of DMformer, which outperforms the previous SOTA by 9.53% and 4.63% in terms of DSC on ACDC dataset with 5% labeled images and Synapse dataset with 30% labeled images, respectively. Code is available at: https://github.com/SJTU-DeepVisionLab/DMformer.