Multiple instance learning approach for automated gallbladder cancer detection using ultrasound imaging: multi-center validation of a deep learning model with the public dataset contribution.
Authors
Affiliations (10)
Affiliations (10)
- Department of Radiodiagnosis, Postgraduate Institute of Medical Education and Research, Chandigarh, 160012, India.
- Department of Clinical Hematology and Medical Oncology, Postgraduate Institute of Medical Education and Research, Chandigarh, 160012, India.
- Department of GI Surgery, HBP and Liver Transplantation, Postgraduate Institute of Medical Education and Research, Chandigarh, 160012, India.
- Department of General Surgery, Postgraduate Institute of Medical Education and Research, Chandigarh, 160012, India.
- Division of Radiation Oncology, Postgraduate Institute of Medical Education and Research, Chandigarh, 160012, India.
- Department of Radiodiagnosis, Tata Memorial Hospital, Mumbai, 40012, India.
- Department of Radiodiagnosis, Institute of Medical Sciences, Banarus Hindu University, Varanasi, 221005, India.
- Department of Radiodiagnosis, Pandit Bhagwat Dayal Sharma, Post Graduate Institute of Medical Sciences, Rohtak, 124001, India.
- Department of Computer Sciences, Indian Institute of Technology, New Delhi 110016.
- Department of Medical Gastroenterology, Postgraduate Institute of Medical Education and Research, Chandigarh, 160012, India.
Abstract
Gallbladder cancer (GBC) diagnosis is challenging due to overlapping imaging features. We developed and validated a multiple instance learning (MIL) model for automated GBC detection using a large-scale multi-center ultrasound dataset and benchmarked it against state-of-the-art architectures. This was a retrospective and prospective multi-center cohort study. We trained a gated attention MIL (GAIA-MIL) model on the prospective AURORA-GB dataset (August 2022-July 2024) and two public datasets. The model was evaluated on a temporally independent internal test set (August 2024-December 2024) and three retrospective external cohorts. The area under curve (AUC), sensitivity, and specificity of GAIA-MIL was compared to Clustering-constrained Attention MIL (CLAM), Dual-Stream MIL (DS-MIL), and Transformer-based MIL (TransMIL). The datasets comprised 11,012 images from 1151 patients. Cross-validation achieved a mean AUC of 0.874 (95% CI 0.846-0.902). On the internal test set (n = 97), GAIA-MIL achieved 87.7% sensitivity (78.9-95.1%), 86.2% specificity (72.4-96.9%), and an AUC of 0.883 (0.786-0.963). Pooled external validation (n = 122) showed an AUC of 0.778 (0.698-0.852). Performance varied by external center (AUCs: 0.722, 0.950, and 0.749). In comparative benchmarking, while TransMIL excelled internally (AUC 0.871), its performance degraded significantly in external validation (Pooled AUC 0.654). GAIA-MIL demonstrated superior stability, maintaining robust sensitivity (78.2%), specificity (73.4%), and AUC (0.778) pooled across all diverse external centers where complex transformers struggled. Interpretability analysis confirmed the model focused on clinically relevant features like wall thickening. While complex architectures like TransMIL perform well internally, GAIA-MIL offers the optimal balance of performance and generalizability for multi-center deployment. The AURORA-GB benchmark dataset is publicly released to advance research. None.