Evaluating Consistency and Accuracy of GPT-4 Omni to Analyze Thyroid Ultrasound Features and ACR TR Categories to Aid Report Generation.

March 12, 2026

papers

DOI: 10.2174/0115734056437835260214191655 PMID: 41832723

Authors

Yang Z,Huang T,Huang L,Yao J,Jiang L,Wu W,Xie X,Xu M,Zhang X

Affiliations (1)

Department of Medical Ultrasonics, Institution of Diagnostic and Interventional Ultrasound, the First Affiliated Hospital of Sun Yat-sen University, No.58 Zhongshan Er Road, Guangzhou 510080, Guangzhou, People's Republic of China.

Abstract

Multimodal large language models, including GPT-4 Omni (GPT-4o), have been applied for facilitating the healthcare process, but their capacity to interpret thyroid sonography images to aid report generation, as well as ways for improvements, are unclear. 120 thyroid nodules were retrospectively included for evaluation of GPT-4o to analyze ultrasound features and ACR TR categories (version 2017). In a zero-shot setting, 80 original images of unmarked nodules (zero-shot unmarked group) and images with nodules' boundary artificially depicted by senior radiologists with red circles (zero-shot marked group) were repetitively input into GPT-4o, respectively with identical prompts for 3 attempts without examples. In a few-shot setting, another 40 images with artificially marked nodule boundary (few-shot marked group) were input after 3 examples. The marking gold standard was established by 2 senior radiologists with over 10 years of experience in thyroid sonography. Consistency of GPT-4o was evaluated with the Gwet agreement coefficient (AC1) value calculated. The mean accuracy of GPT-4o across different settings was compared using the Mann-Whitney test with Bonferroni correction, in comparison to the mean accuracy of 2 junior radiologists with 1 and 3 years of experience in thyroid sonography, respectively. The AC1 values were 0.466 [0.367,0.564], 0.778 [0.696,0.860], 0.823 [0.711,0.934], respectively, for zero-shot unmarked group, zero-shot marked group, and few-shot marked group. The mean accuracy of the 3 groups to judge TR categories was 18.75% [13.78%,23.72%], 42.50% [36.20%,48.80%], 79.17% [71.80%,86.54%]. Zero-shot marked group outperformed zero-shot unmarked group, and the few-shot setting performed even better (p<0.001). Particularly, segmentation helped GPT-4o detect composition, shape, and margin of nodules, and a few-shot setting helped detect echogenicity, margin, and calcification (p<0.001). Compared with junior radiologists, the few-shot marked group achieved a similar accuracy in identifying composition, echogenicity, calcification, and TR categories (p>0.05) and performed even better in identifying the margin of thyroid nodules (p=0.004). GPT-4o's performance to analyze original images of thyroid nodules was insufficient, possibly owing to incorrect nodule recognition and a lack of standardized reference. After adopting segmentation methods and a few-shot setting, its performance was improved significantly. GPT-4o's consistency and accuracy of analyzing thyroid sonography images can be gradually improved by segmentation methods and a few-shot setting, and finally achieves a junior-radiologist level in this preliminary study. This can potentially benefit report generation, while multicenter validation is needed.

View Source Full Text PDF

Topics

Journal Article

Evaluating Consistency and Accuracy of GPT-4 Omni to Analyze Thyroid Ultrasound Features and ACR TR Categories to Aid Report Generation.

Authors

Affiliations (1)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?