A comparative study of large language models (ChatGPT/DeepSeek) in generating structured thyroid ultrasound reports.

June 6, 2026

papers

DOI: 10.1038/s41598-026-55656-w PMID: 42248992

Authors

Zhang L,Li X,Qi R,Wang S,Shi X,Song Y,Li L,Zhang N,Ge H

Affiliations (3)

Department of Ultrasound, Beijing Chao-yang Hospital, Capital Medical University, 8 Gongti South Road, Beijing, 100020, China.
Department of Radiation Oncology, The First Medical Center of PLA General Hospital, Beijing, 100853, China.
Department of Ultrasound, Beijing Chao-yang Hospital, Capital Medical University, 8 Gongti South Road, Beijing, 100020, China. [email protected].

Abstract

The importance of structured radiology reports is well recognized for enabling efficient data extraction and facilitating multidisciplinary collaboration. This study aimed to evaluate the agreement rate and reproducibility of the large language models ChatGPT-4o and DeepSeek-R1 in generating structured thyroid ultrasound reports. This study retrospectively included 174 thyroid ultrasound reports from 174 patients, encompassing a total of 230 nodules. ChatGPT-4o and DeepSeek-R1 were used to convert the reports into structured formats according to the C-TIRADS guidelines. Two ultrasonographers assessed the agreement rate of the generated nodule classifications and the appropriateness of the management recommendations. Each report was submitted twice to evaluate the consistency of nodule classification and management suggestions. Among 174 patients (mean age 44 ± 11 years; 32 males), there was no significant difference in nodule classification agreement rate between ChatGPT-4o and DeepSeek-R1 (80.4% vs. 77.2%; OR = 1.636; 95% CI: 0.976-2.741; P = 0.205). ChatGPT-4o outperformed DeepSeek-R1 in providing more comprehensive or correct management recommendations (OR = 7.362, 95% CI: 4.255-12.735, P < 0.001). Furthermore, both ChatGPT-4o and DeepSeek-R1 demonstrated moderate consistency in nodule classification (AC1 = 0.767 vs. 0.713). Specifically, for category 3 nodules, both models showed high consistency (AC1 = 0.983 vs. 0.929). Compared to DeepSeek-R1, ChatGPT-4o exhibited higher consistency in providing management recommendations (AC1 = 0.809 vs. AC1 = 0.632). This study suggests that both ChatGPT and DeepSeek showed potential for converting free-text thyroid ultrasound reports into a structured format. While ChatGPT-4o and DeepSeek-R1 performed similarly in nodule classification agreement rate, ChatGPT-4o demonstrated a clear advantage in the agreement rate of management recommendations.

View Source Full Text PDF

Topics

Journal Article

A comparative study of large language models (ChatGPT/DeepSeek) in generating structured thyroid ultrasound reports.

Authors

Affiliations (3)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?