Application of transformer attention mechanism-based multimodal deep learning model in the diagnosis of papillary thyroid carcinoma.

June 27, 2026

papers

DOI: 10.1186/s12880-026-02530-w PMID: 42374243

Authors

Xu M,Zhu X,Lu Z,Xu W,Shen C,Yang J

Affiliations (3)

Department of Ultrasound, Suzhou Ninth People's Hospital, Suzhou, Jiangsu Province, China.
Department of Ultrasound, The Affiliated Jiangsu Shengze Hospital of Nanjing Medical University, Suzhou, Jiangsu Province, China.
Oncology Center, The Affiliated Jiangsu Shengze Hospital of Nanjing Medical University, Suzhou, Jiangsu Province, China. [email protected].

Abstract

A Transformer-based multimodal deep learning model was developed to enhance ultrasound imaging diagnosis of PTC and benign nodules. This study included 491 thyroid nodule patients from two centers, with 406 from the Suzhou Ninth People's Hospital divided into training (n = 284) and validation (n = 122) sets, and 85 from Jiangsu Shengze Hospital as an external test set. The dataset comprised 232 benign nodules and 259 papillary thyroid carcinoma cases. A comparison of 34 deep learning architectures and four traditional models was conducted, leading to the proposal of an Efficientnetv2scbamTrans fusion model. Five ablation experiments assessed module contributions, while SHAP and Grad-CAM were used for interpretability, with energy concentration, peak center distance, and spatial IoU as performance indicators. DenseNet201 excelled on the internal validation set but showed overfitting with an AUC of 0.702 on the external test set. The unimodal imaging baseline had an AUC of 0.782, which improved to 0.811 with clinical features, reducing overfitting risk. Directly fusing radiomics features yielded no improvement, maintaining an AUC of 0.751. The proposed Efficientnetv2scbamTrans model reached an AUC of 0.985 on the internal validation set. More importantly, on the independent external test set (Jiangsu Shengze Hospital), the model achieved an AUC of 0.986 (95% CI: 0.967-1.000), with 96.4% sensitivity and 96.7% specificity. On this external test set, our proposed model significantly outperformed the traditional DLR model by an AUC margin of 0.199 (p < 0.001). SHAP analysis indicated clinical features altered radiomics decision weights, and Grad-CAM showed an increase in spatial IoU from 0.097 to 0.349, enhancing visual localization. The study developed a multimodal deep learning model using thyroid ultrasound, improving diagnostics with interaction and attention mechanisms via CBAM and Transformer.

View Source Full Text PDF

Topics

Journal Article

Application of transformer attention mechanism-based multimodal deep learning model in the diagnosis of papillary thyroid carcinoma.

Authors

Affiliations (3)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?