Preliminary analysis of AI-based thyroid nodule evaluation in a non-subspecialist endocrinology setting.
Authors
Affiliations (4)
Affiliations (4)
- Department of Endocrinology and Nutrition, Hospital Clínico Universitario Valladolid, Valladolid, Spain. [email protected].
- Centro de Investigación de Endocrinología y Nutrición Clínica (CIENC), Facultad de Medicina, Universidad de Valladolid, Valladolid, Spain. [email protected].
- Department of Endocrinology and Nutrition, Hospital Clínico Universitario Valladolid, Valladolid, Spain.
- Centro de Investigación de Endocrinología y Nutrición Clínica (CIENC), Facultad de Medicina, Universidad de Valladolid, Valladolid, Spain.
Abstract
Thyroid nodules are commonly evaluated using ultrasound-based risk stratification systems, which rely on subjective descriptors. Artificial intelligence (AI) may improve assessment, but its effectiveness in non-subspecialist settings is unclear. This study evaluated the impact of an AI-based decision support system (AI-DSS) on thyroid nodule ultrasound assessments by general endocrinologists (GE) without subspecialty thyroid imaging training. A prospective cohort study was conducted on 80 patients undergoing thyroid ultrasound in GE outpatient clinics. Thyroid ultrasound was performed based on clinical judgment as part of routine care by GE. Images were retrospectively analyzed using an AI-DSS (Koios DS), independently of clinician assessments. AI-DSS results were compared with initial GE evaluations and, when referred, with expert evaluations at a subspecialized thyroid nodule clinic (TNC). Agreement in ultrasound features, risk classification by the American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS) and American Thyroid Association guidelines, and referral recommendations was assessed. AI-DSS differed notably from GE, particularly assessing nodule composition (solid: 80%vs.36%,p < 0.01), echogenicity (hypoechoic:52%vs.16%,p < 0.01), and echogenic foci (microcalcifications:10.7%vs.1.3%,p < 0.05). AI-DSS classification led to a higher referral rate compared to GE (37.3%vs.30.7%, not statistically significant). Agreement between AI-DSS and GE in ACR TI-RADS scoring was moderate (r = 0.337;p < 0.001), but improved when comparing GE to AI-DSS and TNC subspecialist (r = 0.465;p < 0.05 and r = 0.607;p < 0.05, respectively). In a non-subspecialist setting, non-adjunct AI-DSS use did not significantly improve risk stratification or reduce hypothetical referrals. The system tended to overestimate risk, potentially leading to unnecessary procedures. Further optimization is required for AI to function effectively in low-prevalence environment.