Back to all papers

Multimodal sparse fusion transformer network with spatio-temporal decoupling for breast tumor classification.

January 28, 2026pubmed logopapers

Authors

Xu J,Zhuang S,He Y,Wang H,Zhuang Z,Zeng H

Affiliations (5)

  • Engineering College, Shantou University, Shantou, Guangdong 515041, China.
  • School of Biomedical Engineering, Sun Yat-sen University, Shenzhen, Guangdong 518107, China. Electronic address: [email protected].
  • Shantou University Medical College, Shantou, Guangdong 515000, China; Department of Ultrasound, Shantou Central Hospital, Shantou, Guangdong 515000, China.
  • Engineering College, Shantou University, Shantou, Guangdong 515041, China. Electronic address: [email protected].
  • The Breast Surgery, Cancer Hospital of Shantou University Medical College, Shantou, Guangdong 515041, China. Electronic address: [email protected].

Abstract

Accurate analysis of tumor morphology, vascularity, and tissue stiffness under multimodal ultrasound imaging plays a critical role in the diagnosis of breast cancer. However, manual interpretation across multiple modalities is time-consuming and heavily dependent on the radiologist's expertise. Computer-aided classification offers an efficient alternative, yet remains challenging due to significant modality heterogeneity, inconsistent image quality, and redundant information across modalities. To address these issues, we propose a novel Multimodal Sparse Fusion Transformer Network (MSFT-Net). First, a Spatio-Temporal Decoupling Attention architecture (STDA) is introduced to disentangle and extract dynamic and static features from different modalities along spatial and temporal dimensions, capturing modality-specific motion and morphological characteristics independently. Second, the Mixed-Scale Convolution Module (MSCM) obtains tumor features at multiple scales, enhancing geometric detail representation and improving receptive field coverage. Third, the Sparse Cross-Attention Module (SCAM) adaptively retains the most effective query-key interactions between modalities, thereby facilitating the aggregation of high-quality features for robust multimodal information fusion. MSFT-Net is trained and tested on a curated dataset comprising multimodal breast tumor videos collected from 458 patients, including ultrasound (US), superb microvascular imaging (SMI), and strain elastography (SE), and its generalizability is further validated on the public BraTS'21 MRI dataset. Extensive experiments demonstrate that MSFT-Net achieves superior performance in multimodal breast tumor classification compared to state-of-the-art methods, providing fast and reliable support for radiologists in diagnostic tasks.

Topics

Breast NeoplasmsMultimodal ImagingUltrasonography, MammaryImage Interpretation, Computer-AssistedNeural Networks, ComputerJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.