Performance Evaluation of Deep Learning for the Detection and Segmentation of Thyroid Nodules: Systematic Review and Meta-Analysis.
Authors
Affiliations (5)
Affiliations (5)
- Departement of Otolaryngology-Head and Neck Surgery, Affiliated Hospital of Hangzhou Normal University, No. 126, Wenzhou Road, Hangzhou, 310015, China, 86 15005812373.
- Departement of Ultrasound, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, China.
- Departement of Ultrasound, Yangming Hospital Affiliated to Ningbo University, Yuyao, China.
- Departement of Otolaryngology-Head and Neck Surgery, Hangzhou Normal University, Hangzhou, China.
- Department of Otorhinolaryngology, Deqing Hospital of Hangzhou Normal University (The Third People's Hospital of Deqing), Huzhou, China.
Abstract
Thyroid cancer is one of the most common endocrine malignancies. Its incidence has steadily increased in recent years. Distinguishing between benign and malignant thyroid nodules (TNs) is challenging due to their overlapping imaging features. The rapid advancement of artificial intelligence (AI) in medical image analysis, particularly deep learning (DL) algorithms, has provided novel solutions for automated TN detection. However, existing studies exhibit substantial heterogeneity in diagnostic performance. Furthermore, no systematic evidence-based research comprehensively assesses the diagnostic performance of DL models in this field. This study aimed to execute a systematic review and meta-analysis to appraise the performance of DL algorithms in diagnosing TN malignancy, identify key factors influencing their diagnostic efficacy, and compare their accuracy with that of clinicians in image-based diagnosis. We systematically searched multiple databases, including PubMed, Cochrane, Embase, Web of Science, and IEEE, and identified 41 eligible studies for systematic review and meta-analysis. Based on the task type, studies were categorized into segmentation (n=14) and detection (n=27) tasks. The pooled sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC) were calculated for each group. Subgroup analyses were performed to examine the impact of transfer learning and compare model performance against clinicians. For segmentation tasks, the pooled sensitivity, specificity, and AUC were 82% (95% CI 79%-84%), 95% (95% CI 92%-96%), and 0.91 (95% CI 0.89-0.94), respectively. For detection tasks, the pooled sensitivity, specificity, and AUC were 91% (95% CI 89%-93%), 89% (95% CI 86%-91%), and 0.96 (95% CI 0.93-0.97), respectively. Some studies demonstrated that DL models could achieve diagnostic performance comparable with, or even exceeding, that of clinicians in certain scenarios. The application of transfer learning contributed to improved model performance. DL algorithms exhibit promising diagnostic accuracy in TN imaging, highlighting their potential as auxiliary diagnostic tools. However, current studies are limited by suboptimal methodological design, inconsistent image quality across datasets, and insufficient external validation, which may introduce bias. Future research should enhance methodological standardization, improve model interpretability, and promote transparent reporting to facilitate the sustainable clinical translation of DL-based solutions.