Multimodal deep learning for papillary thyroid carcinoma diagnosis using ultrasound and cytology.
Authors
Affiliations (3)
Affiliations (3)
- Department of Computer Science, School of Engineering, Central Asian University, Tashkent, Uzbekistan. [email protected].
- Faculty of Data Science and Information Technology, INTI International University, Negeri Sembilan, Malaysia. [email protected].
- Faculty of Data Science and Information Technology, INTI International University, Negeri Sembilan, Malaysia.
Abstract
Papillary thyroid carcinoma (PTC) is the most common thyroid malignancy, and pre-operative diagnosis depends on integrating ultrasound (US) imaging with fine-needle aspiration cytology (FNAC). Pang et al. (2025) recently released a paired US/cytology dataset and demonstrated that classical radiomics combined with classifiers such as support vector machines, random forests, and XGBoost can reach AUROC ≈ 0.99 on a single random split. We re-examine this result under a stricter evaluation protocol and contribute a calibrated multimodal deep learning model for PTC diagnosis. Using 384 patients from the Pang cohort (220 PTC, 164 benign) we created a stratified, untouched 20% holdout (n = 77). On the development set (n = 307) we performed 5-fold cross-validation and trained a multimodal model (v2) combining a ConvNeXt-Tiny ultrasound encoder, a domain-pretrained CTransPath cytology encoder, gated-attention multiple-instance learning (MIL) over cytology patches, and bidirectional cross-attention fusion. We compared this against a self-attention multimodal baseline (v1), unimodal ablations, and a modified Pang-style classical comparator that, because lesion masks were not released with the public dataset, used whole-image radiomics-style features rather than ROI-based features. Evaluation included bootstrap and Wilson confidence intervals, paired DeLong tests, McNemar tests, and a four-method calibration analysis (none, temperature, Platt, isotonic) using ensemble out-of-fold predictions and equal-mass binning. Multimodal v2 achieved holdout AUROC 0.977 (95% CI 0.949-0.996), Brier 0.042 (95% CI 0.014-0.079), sensitivity 0.977 (Wilson 0.882-0.996), and specificity 0.939 (Wilson 0.804-0.983) at the cross-validation Youden threshold. Across three random seeds (42, 7, 123), AUROC was 0.977 ± 0.0004 (mean ± SD). Paired DeLong showed our model statistically outperformed Pang's reimplemented Random Forest (p = 0.017) and XGBoost (p = 0.022) on identical holdout patients. Paired DeLong vs. the v1 baseline showed identical patient ranking (p = 1.0), but v2 showed numerically better operating-point and probability quality despite the identical ranking: Brier was roughly halved (0.042 vs. 0.083), MCC at threshold 0.5 increased by 0.105 (0.894 vs. 0.789), and uncalibrated expected calibration error fell from 0.114 to 0.042. The difference in paired binary decisions between v1 and v2 was, however, not statistically significant (McNemar p = 0.221). Calibration analysis revealed that temperature scaling, the de facto standard, was inappropriate for cross-validated ensembles, while isotonic regression on out-of-fold predictions reduced ECE by 53% on the v1 baseline. A multimodal model with domain-pretrained encoders and proper MIL aggregation achieves discriminative performance comparable to a modified Pang-style classical comparator and shows improved probability quality (lower Brier and calibration error), although these gains over the simpler v1 baseline did not reach statistical significance on the present holdout. We argue that AUROC alone is insufficient for evaluating clinical AI: discrimination and calibration are distinct properties, and operating-point performance is more directly relevant to deployment. We acknowledge that the present ultrasound branch operates on whole-image crops without explicit lesion localization and therefore should not yet be regarded as a lesion-specific clinical decision tool. Code and analysis scripts are released on GitHub.