ThyroFusion: A Multi-modal Deep Learning Framework Integrating Vision and Language for Thyroid Nodule Malignancy Risk Assessment.
Authors
Affiliations (2)
Affiliations (2)
- School of Electronics and Information, Xian Polytechnic University, Xian, 710048, China. [email protected].
- School of Electronics and Information, Xian Polytechnic University, Xian, 710048, China.
Abstract
Accurate differentiation between benign and malignant thyroid nodules remains challenging in clinical practice. Current deep learning approaches predominantly rely on single-modality analysis, failing to leverage complementary information from multiple clinical data sources. This study aims to develop and validate ThyroFusion, a multi-modal deep learning framework integrating ultrasound images, segmentation masks, and clinical text reports for improved thyroid nodule malignancy risk assessment. In this retrospective multi-center study, we developed ThyroFusion, a multi-modal fusion framework comprising: (1) a dual-stream ResNet-50 encoder with partially shared parameters for extracting features from ultrasound images and segmentation masks; (2) a Set Transformer module for aggregating variable numbers of image features; and (3) a bidirectional cross-modal attention mechanism for fusing visual and textual features extracted by frozen BioBERT. The framework was trained on 1472 cases from Xi'an International Medical Center Hospital and validated on four independent external test sets totaling 4530 cases from two clinical centers and two public datasets (DDTI and TN3K). Performance was compared against state-of-the-art deep learning models and radiologists with varying experience levels. ThyroFusion achieved an AUC of 0.937 (95% CI 0.914-0.960) on internal validation and 0.896 (95% CI 0.887-0.905) on combined external validation. Compared to single-modal approaches, ThyroFusion significantly outperformed ResNet-50 (AUC: 0.841), DenseNet-121 (AUC 0.848), EfficientNet-B4 (AUC 0.859), and Vision Transformer (AUC 0.835) on external validation (all p < 0.001). The model also outperformed senior radiologists (AUC 0.809) and demonstrated substantial improvement in junior radiologists' performance when used as an assistive tool (ΔAUC = 0.126). On public datasets, ThyroFusion achieved AUCs of 0.893 on DDTI and 0.881 on TN3K, demonstrating robust cross-domain generalization. ThyroFusion demonstrates robust performance in thyroid nodule malignancy risk assessment across multiple centers and public benchmarks, significantly outperforming state-of-the-art single-modal methods and experienced radiologists. The integration of visual and textual information through bidirectional cross-modal attention offers a promising tool for clinical decision support.