CMT-Unet: leveraging stage-wise hybrid framework for enhanced accuracy and efficiency in medical image segmentation.
Authors
Affiliations (3)
Affiliations (3)
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China.
- Department of Radiology, Southeast University Affiliated Xuzhou Central Hospital, Xuzhou, 221009, China.
- Department of Radiology, Southeast University Affiliated Xuzhou Central Hospital, Xuzhou, 221009, China. [email protected].
Abstract
Achieving high-precision pixel-level segmentation in medical imaging necessitates both the preservation of fine-grained local details and the modeling of long-range contextual dependencies; nevertheless, convolutional, Transformer, and naive hybrid architectures struggle to adaptively balance these two requirements. To alleviate this limitation, we propose CMT-Unet, which incorporates a Mamba-based state space model to enable adaptive adjustment between local and global feature modeling. In our model, a task-driven, hierarchically integrated architecture is proposed, where inverted residual convolutional units are integrated with Mamba and Transformer modules to effectively capture local and global feature representation, avoid excessive computational complexity, enhance early-stage local feature extraction, and prevent local blurring from prematurely large receptive fields. HiLo attention further complements texture and boundary cues by jointly modeling high and low frequency information often missed by standard Multi-Head Self Attention (MHSA). This staged integration leverages the inherent progression of the encoder from spatial to semantic abstraction, thereby enriching its representational capacity and enhancing efficiency. Experiments conducted on the Synapse and ACDC datasets suggest that CMT-Unet performs reasonably well in terms of efficiency and accuracy when compared to baseline Transformer-UNet and other hybrid approaches. These results demonstrate the feasibility and robustness of stage-specific hybrid designs for advanced medical image segmentation.