HyFormer-Net: A Synergistic CNN-Transformer with Interpretable Multi-Scale Fusion for Breast Lesion Segmentation and Classification in Ultrasound Images
Authors
Abstract
B-mode ultrasound for breast cancer diagnosis faces challenges: speckle, operator dependency, and indistinct boundaries. Existing deep learning suffers from single-task learning, architectural constraints (CNNs lack global context, Transformers local features), and black-box decision-making. These gaps hinder clinical adoption. We propose HyFormer-Net, a hybrid CNN-Transformer for simultaneous segmentation and classification with intrinsic interpretability. Its dual-branch encoder integrates EfficientNet-B3 and Swin Transformer via multi-scale hierarchical fusion blocks. An attention-gated decoder provides precision and explainability. We introduce dual-pipeline interpretability: (1) intrinsic attention validation with quantitative IoU verification (mean: 0.86), and (2) Grad-CAM for classification reasoning. On the BUSI dataset, HyFormer-Net achieves Dice Score 0.761 +/- 0.072 and accuracy 93.2%, outperforming U-Net, Attention U-Net, and TransUNet. Malignant Recall of 92.1 +/- 2.2% ensures minimal false negatives. Ensemble modeling yields exceptional Dice 90.2%, accuracy 99.5%, and perfect 100% Malignant Recall, eliminating false negatives. Ablation studies confirm multi-scale fusion contributes +16.8% Dice and attention gates add +5.9%. Crucially, we conduct the first cross-dataset generalization study for hybrid CNN-Transformers in breast ultrasound. Zero-shot transfer fails (Dice: 0.058), confirming domain shift. However, progressive fine-tuning with only 10% target-domain data (68 images) recovers 92.5% performance. With 50% data, our model achieves 77.3% Dice, exceeding source-domain performance (76.1%) and demonstrating true generalization.