Hierarchical contrastive disentanglement architecture for multi-modal breast cancer detection.
Authors
Affiliations (2)
Affiliations (2)
- Indian Institute of Information Technology, Kottayam, 686635, Kerala, India.
- Indian Institute of Information Technology, Kottayam, 686635, Kerala, India. Electronic address: [email protected].
Abstract
Breast cancer is the most common cancer diagnosed among women worldwide, and early detection improves survival outcomes. While medical imaging offers complementary diagnostic information through mammography, ultrasound, and thermal imaging, existing multi-modal computer-aided diagnosis systems suffer from several critical limitations: (a) rigid architectural assumptions which require simultaneous availability of all imaging modalities, (b) entangled feature representations that conflate disease- specific patterns with modality-specific artifacts, and (c) an inability to exploit unpaired multimodal datasets in which patient-level correspondence is unavailable. This work proposes a Hierarchical Contrastive Disentanglement Architecture (HCDA) to address the challenges of multi-modal fusion for breast cancer detection. The proposed work includes: (a) a flexible multi-modal design to adapt variable availability of imaging modalities, (b) a Swin transformer which includes three encoders and orthogonality constraints for explicit hierarchical feature extraction and disentanglement, (c) within-modality contrastive learning to capture semantic representations from unpaired datasets, (d) conditional weighted voting for multimodal feature fusion, and (e) a two-phase sequential training strategy, where disentangled representations are learned in the first phase and fine-tuned for classification in the second. This work is evaluated on total 14,500 images collected from eight publicly available datasets of three modalities (a) mammograms (DDSM, MIAS, INbreast), (b) ultrasound (BrEaST, BUSI, Thammasat, HMSS, and (c) thermal (DMR-IR). Proposed HCDA achieves strong single-modality performance with mammograms attaining highest accuracy (86.0%) and AUC(0.938), followed by ultrasound (84.9% accuracy, 0.914 AUC), and thermal imaging (80.9% accuracy, 0.954 AUC). For multimodal settings, the model achieves its best dual-modality performance with mammography + ultrasound (84.72% accuracy, 0.926 AUC), while triple-modality fusion yields 80.94% accuracy and a 0.937 AUC. Evaluation reveals that multi-modal fusion without patient-level correspondence underperform the best single modality (86.0% vs. 84.7%), demonstrating that data structure fundamentally constrains fusion benefits. However, flexible architecture establishes HCDA as a foundation for future multi-modal medical AI research.