Deep Learning-Based Multimodal Fusion of Ultrasound, Cytology, and Clinical Features to Distinguish Follicular Thyroid Carcinoma from Adenoma: A Multicenter Study.
Authors
Affiliations (6)
Affiliations (6)
- Department of Ultrasound, The Second Affiliated Hospital of Soochow University, Suzhou, Jiangsu, China (X.-F.G., L.Z., S.-Y.Z.).
- Department of Ultrasound, Zhangjiagang Hospital of Traditional Chinese Medicine, Nanjing, Jiangsu, China (X.-Y.B.).
- Department of Ultrasound, The Third Affiliated Hospital of Soochow University, Changzhou First People's Hospital, Changzhou, Jiangsu, China (S.-Q.L.).
- Department of Thyroid Surgery, The Third Affiliated Hospital of Soochow University, Changzhou First People's Hospital, Changzhou, Jiangsu, China (J.-W.F., Y.J.).
- Department of Gastrointestinal Surgery, Southeast University Affiliated Xuzhou Central Hospital, Xuzhou, Jiangsu, China (Y.-L.Z.).
- Department of Ultrasound, The Second Affiliated Hospital of Soochow University, Suzhou, Jiangsu, China (X.-F.G., L.Z., S.-Y.Z.). Electronic address: [email protected].
Abstract
Preoperative differentiation between follicular thyroid carcinoma (FTC) and follicular thyroid adenoma (FTA) remains challenging due to overlapping cytological and ultrasonographic features. This study aimed to develop and validate a multimodal deep learning model integrating ultrasound images, fine-needle aspiration cytology (FNAC) images, and clinical features for preoperative differentiation of FTC from FTA in patients with cytologically indeterminate follicular thyroid neoplasms. This retrospective multicenter study included 714 patients with pathologically confirmed follicular thyroid neoplasms from three medical centers. Patients were divided into training set (n = 304), internal validation set (n = 130), and two external validation sets (n = 201 and n = 79). The multimodal model employed Swin Transformer for ultrasound feature extraction (intratumoral and peritumoral regions), attention-based multiple instance learning for cytological image processing, and self-attention multilayer perceptron for clinical feature encoding. Cross-modal feature fusion was achieved through a Transformer module. Model performance was evaluated using area under the receiver operating characteristic curve (AUC), calibration curves, and decision curve analysis (DCA). The multimodal fusion model achieved AUCs of 0.947, 0.933, 0.936, and 0.928 in the training set, internal validation set, external validation set 1, and external validation set 2, respectively. Compared with unimodal models, the multimodal model demonstrated significant improvements (all P < 0.001): AUC increased by 0.080-0.098 versus the ultrasound model, 0.165-0.183 versus the cytological model, and 0.236-0.254 versus the clinical model. The addition of the peritumoral region improved ultrasound model AUC by 0.042-0.043 (P < 0.05). Modality contribution analysis revealed that the peritumoral ultrasound weight was significantly higher in FTC than FTA cases (29.3-30.5% vs. 23.9-24.6%, P < 0.001). DCA demonstrated superior net benefit of the multimodal model across threshold probabilities of 0.1-0.8. The multimodal deep learning model integrating ultrasound, cytological, and clinical features demonstrated favorable diagnostic performance for preoperative differentiation between FTC and FTA. The inclusion of peritumoral region provided significant incremental diagnostic value. This model may serve as an effective auxiliary tool for individualized diagnosis and treatment decision-making in patients with cytologically indeterminate follicular thyroid neoplasms.