Transforming adnexal mass assessment: how large language model improve ovarian-adnexal reporting and data system interpretation and sonographer performance.

April 18, 2026

papers

DOI: 10.1007/s00261-026-05498-x PMID: 41999424

Authors

Sun Y,Shen H,Zhang L,Jiang Y,Zheng Q,Du L,He M,Wu L,Xie H

Affiliations (2)

First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China.
First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China. [email protected].

Abstract

To assess the accuracy of large language models (LLMs) in Ovarian-Adnexal Reporting and Data System (O-RADS) categorization based on free-text ultrasound reports and to explore their potential to support radiologists with varying levels of experience. This retrospective study included patients with suspected adnexal masses from October 2022 to May 2024. A reference standard for O-RADS categorization was established by consensus of three senior radiologists. Surgical pathology served as the gold standard for determining benign versus malignant nature. Three LLMs (ChatGPT-4o, ChatGPT-5, Gemini 2.5 Pro) were prompted with O-RADS rules using few-shot learning. Intra-LLM agreement and accuracy against the reference standard were evaluated, along with structured error analysis for systematic misclassification patterns. In a crossover design, a subset of 150 lesions was interpreted independently by two junior and two senior readers with and without LLM assistance. Diagnostic performance (area under the receiver operating characteristic curve [AUC]), inter-reader agreement, and agreement with the reference standard were compared between assisted and unassisted readings. A total of 302 patients with 324 lesions were analyzed. All three LLMs demonstrated substantial intra-LLM agreement (κ = 0.65-0.77) and high categorization accuracy against the reference standard, with no significant differences among models (p = 0.70). Error analysis revealed that misclassifications were concentrated in lesions with borderline morphologic features. With LLM assistance, junior radiologists showed higher inter-reader agreement and agreement with the reference standard (weighted κ ≥ 0.85), along with a non-significant trend toward better diagnostic performance (mean ΔAUC = 0.10; p = 0.06). In contrast, senior readers showed stable performance (mean ΔAUC = 0.01; p = 0.22). In discordant cases, junior readers accepted LLM suggestions in 60.78%-72.34% of instances, compared with only 2.04%-2.33% among seniors. LLMs demonstrated high accuracy and reliability in O-RADS ultrasound categorization based on free-text reports. Their assistance was associated with increased O-RADS categorization performance and a non-significant trend toward better diagnostic performance among junior readers, suggesting their potential as an adjunct tool for less experienced radiologists, however, these observed effects predominantly reflect substantial reliance on LLM suggestions rather than independent skill improvement.

View Source Full Text PDF

Topics

Journal Article

Transforming adnexal mass assessment: how large language model improve ovarian-adnexal reporting and data system interpretation and sonographer performance.

Authors

Affiliations (2)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?