A Multi-Resolution Hybrid CNN-Transformer Network With Scale-Guided Attention for Medical Image Segmentation.
Authors
Abstract
Medical image segmentation remains a challenging task due to the intricate nature of anatomical structures and the wide range of target sizes. In this paper, we propose a novel U -shaped segmentation network that integrates CNN and Transformer architectures to address these challenges. Specifically, our network architecture consists of three main components. In the encoder, we integrate an attention-guided multi-scale feature extraction module with a dual-path downsampling block to learn hierarchical features. The decoder employs an advanced feature aggregation and fusion module that effectively models inter-dependencies across different hierarchical levels. For the bottleneck, we explore multi-scale feature activation and multi-layer context Transformer modules to facilitate high-level semantic feature learning and global context modeling. Additionally, we implement a multi-resolution input-output strategy throughout the network to enrich feature representations and ensure fine-grained segmentation outputs across different scales. The experimental results on diverse multi-modal medical image datasets (ultrasound, gastrointestinal polyp, MR, and CT images) demonstrate that our approach can achieve superior performance over state-of-the-art methods in both quantitative measurements and qualitative assessments. The code is available at https://github.com/zsj0577/MSAGHNet.