Integrating visual and language cues via state space models for medical image segmentation.
Authors
Affiliations (6)
Affiliations (6)
- School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, 400044, China.
- School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, 400044, China; Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing University), Ministry of Education of China, Chongqing University, Chongqing, 400044, China. Electronic address: [email protected].
- Bioengineering College of Chongqing University, Chongqing, 400044, China.
- Health Management Institute, The Second Medical Center and National Clinical Research Center for Geriatric Diseases, Chinese PLA General Hospital, Beijing, 100853, China.
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China.
- Department of gastroenterology, Daping Hospital (Army Medical University), No. 10 Changjiang Branch Road, Daping, Yuzhong, Chongqing, 400042, PR China.
Abstract
The pursuit of reliable medical image segmentation is paramount for clinical applications, but is often hindered by low image contrast, ambiguous boundaries, and a scarcity of annotated data. Integrating language guidance from clinical text prompts offers a promising solution. However, effectively modeling the complex, long-range dependencies within and across visual and textual modalities remains a significant hurdle for current deep learning architectures. In this paper, we introduce a novel neural framework that leverages the evolving capabilities of State Space Models to achieve a dynamic and selective fusion of multi-modal information. Our core innovation is a Multimodal Interactive Guide Decoder (MIGD), which employs SSMs with a selective scanning mechanism to efficiently capture global context in both images and text with linear complexity, followed by a cross-attention module for fine-grained feature alignment. To improve prediction reliability, we propose a Multi-Expert Uncertainty Refinement (MEUR) module. Grounded in the theory of Choquet integrals, MEUR aggregates the opinions of multiple expert networks to produce well-calibrated, pixel-wise uncertainty estimates, effectively identifying and refining unreliable segmentations. Extensive experiments on three public benchmarks QaTa-COVID19, MosMedData+, and MoNuSeg demonstrate that our framework achieves state-of-the-art or competitive performance across radiology and histopathology tasks, outperforming strong competitors like LViT and RecLMIS in both accuracy (Dice/mIoU) and computational efficiency (GFLOPs). More importantly, it delivers superior prediction stability in challenging scenarios. Our code is publicly available at https://github.com/394481125/ViLSSeg.