Cross-attention guided multi-modal network for breast ultrasound diagnosis incorporating objective clinical semantics.
Authors
Affiliations (1)
Affiliations (1)
- Department of Ultrasound Diagnostics, The First People's Hospital of Taicang, Suzhou, Jiangsu, China.
Abstract
Breast ultrasound diagnosis is significantly constrained by operator subjectivity. While Deep Learning shows promise, existing models often neglect the structured morphological semantics essential for radiological reasoning. We propose Cross-Attention Guided Network (CGA-Net), a multi-modal framework that fuses visual data with objective clinical descriptors via a cross-attention mechanism. Specifically, clinical features-such as shape and margin-act as semantic queries to dynamically highlight pathological regions. Validated on 252 patients using a rigorous Out-Of-Fold (OOF) prediction strategy to prevent data leakage, the CGA-Net trained from scratch demonstrated the most balanced clinical utility, yielding a robust OOF AUC of 0.905 with the highest overall accuracy (0.857) and an optimally balanced specificity (0.831), while maintaining excellent sensitivity (0.898). Furthermore, a pre-trained version of CGA-Net achieved a peak overall ranking AUC of 0.915. Both multi-modal configurations substantially outperformed the robust clinical-only baseline (0.890) and the image-only baseline (0.795). This suggests that while transfer learning aids general feature extraction, strong cross-modal semantic guidance alone is highly effective at reducing false positive diagnoses and optimizing practical clinical thresholds. Attention map visualization confirmed that the model aligns closely with expert focus on tumor periphery. CGA-Net offers a robust, interpretable, and data-efficient "second opinion" tool to reduce diagnostic variability.