Automated O-RADS Risk Stratification Using a Large Language Model Analysis of Narrative Ultrasound Reports.

April 10, 2026

papers

DOI: 10.1016/j.ultrasmedbio.2026.03.009 PMID: 41966937

Authors

Guo Y,Gong J,Jiang R,Agarwal A,Goel R,Selingreund R,Liu Y,Ren M

Affiliations (7)

Department of Computer Science, University of Illinois Springfield, Springfield, IL, USA. Electronic address: [email protected].
School of Medicine, Tongji University, Shanghai, China; Department of Medical Ultrasound, Shanghai Changning Maternity and Infant Health Hospital, Shanghai, China.
Third Clinical Medical College, Zhengzhou University, Zhengzhou, China.
Southern Illinois University School of Medicine, Springfield, IL, USA.
Southern Illinois University School of Medicine, Springfield, IL, USA; Johns Hopkins University School of Medicine, Baltimore, MD, USA.
Department of Ultrasound Medicine, Sanya Central Hospital (The Third People's Hospital of Hainan Province), Sanya, China. Electronic address: [email protected].
Department of Ultrasound Medicine, Shanghai First Maternity and Infant Hospital, School of Medicine, Tongji University, Shanghai, China. Electronic address: [email protected].

Abstract

The Ovarian-Adnexal Reporting and Data System (O-RADS) is essential for standardizing the risk stratification of ovarian lesions detected on ultrasound. However, manual assignment of O-RADS scores is time-consuming and can vary between observers. This study investigates an automated method for O-RADS scoring using a large language model (LLM) to analyze narrative ultrasound reports. A two-stage pipeline was developed for automated O-RADS classification. Initially, the Lingshu LLM, specialized in medical language, extracted and embedded features from free-text descriptions of ovarian lesions. It identified key diagnostic features mentioned by sonologists. Subsequently, these features were used to train and evaluate several machine learning algorithms, including logistic regression (LR), support vector machines and random forests, to predict O-RADS scores (1-5). The proposed method was evaluated on a dataset of 513 cases using fivefold cross-validation. The pipeline using Lingshu model embeddings with LR achieved the highest accuracy of 0.803 [95% CI: 0.753, 0.853], a weighted-average F1-score of 0.819 [95% CI: 0.777, 0.861] and a macro-averaged AUROC of 0.948 [95% CI: 0.937, 0.959]. This outperformed the MedGemma model's pipeline, which had an accuracy of 0.760 [95% CI: 0.700, 0.820], F1-score of 0.787 [95% CI: 0.739, 0.835] and AUROC of 0.941 [95% CI: 0.911, 0.971]. This study introduces a novel approach to automate O-RADS scoring using LLMs for feature extraction and traditional machine learning for classification. The results indicate that this method can accurately stratify ovarian cancer risk, potentially improving clinical workflow efficiency and reducing diagnostic variability. This approach may support radiologists in making more consistent and timely assessments.

View Source Full Text PDF

Topics

Journal Article

Automated O-RADS Risk Stratification Using a Large Language Model Analysis of Narrative Ultrasound Reports.

Authors

Affiliations (7)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?