Back to all papers

Application of transformer attention mechanism-based multimodal deep learning model in the diagnosis of papillary thyroid carcinoma.

June 27, 2026pubmed logopapers

Authors

Xu M,Zhu X,Lu Z,Xu W,Shen C,Yang J

Affiliations (3)

  • Department of Ultrasound, Suzhou Ninth People's Hospital, Suzhou, Jiangsu Province, China.
  • Department of Ultrasound, The Affiliated Jiangsu Shengze Hospital of Nanjing Medical University, Suzhou, Jiangsu Province, China.
  • Oncology Center, The Affiliated Jiangsu Shengze Hospital of Nanjing Medical University, Suzhou, Jiangsu Province, China. [email protected].

Abstract

A Transformer-based multimodal deep learning model was developed to enhance ultrasound imaging diagnosis of PTC and benign nodules. This study included 491 thyroid nodule patients from two centers, with 406 from the Suzhou Ninth People's Hospital divided into training (n = 284) and validation (n = 122) sets, and 85 from Jiangsu Shengze Hospital as an external test set. The dataset comprised 232 benign nodules and 259 papillary thyroid carcinoma cases. A comparison of 34 deep learning architectures and four traditional models was conducted, leading to the proposal of an Efficientnetv2scbamTrans fusion model. Five ablation experiments assessed module contributions, while SHAP and Grad-CAM were used for interpretability, with energy concentration, peak center distance, and spatial IoU as performance indicators. DenseNet201 excelled on the internal validation set but showed overfitting with an AUC of 0.702 on the external test set. The unimodal imaging baseline had an AUC of 0.782, which improved to 0.811 with clinical features, reducing overfitting risk. Directly fusing radiomics features yielded no improvement, maintaining an AUC of 0.751. The proposed Efficientnetv2scbamTrans model reached an AUC of 0.985 on the internal validation set. More importantly, on the independent external test set (Jiangsu Shengze Hospital), the model achieved an AUC of 0.986 (95% CI: 0.967-1.000), with 96.4% sensitivity and 96.7% specificity. On this external test set, our proposed model significantly outperformed the traditional DLR model by an AUC margin of 0.199 (p < 0.001). SHAP analysis indicated clinical features altered radiomics decision weights, and Grad-CAM showed an increase in spatial IoU from 0.097 to 0.349, enhancing visual localization. The study developed a multimodal deep learning model using thyroid ultrasound, improving diagnostics with interaction and attention mechanisms via CBAM and Transformer.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.