Back to all papers

Automated O-RADS Risk Stratification Using a Large Language Model Analysis of Narrative Ultrasound Reports.

April 10, 2026pubmed logopapers

Authors

Guo Y,Gong J,Jiang R,Agarwal A,Goel R,Selingreund R,Liu Y,Ren M

Affiliations (7)

  • Department of Computer Science, University of Illinois Springfield, Springfield, IL, USA. Electronic address: [email protected].
  • School of Medicine, Tongji University, Shanghai, China; Department of Medical Ultrasound, Shanghai Changning Maternity and Infant Health Hospital, Shanghai, China.
  • Third Clinical Medical College, Zhengzhou University, Zhengzhou, China.
  • Southern Illinois University School of Medicine, Springfield, IL, USA.
  • Southern Illinois University School of Medicine, Springfield, IL, USA; Johns Hopkins University School of Medicine, Baltimore, MD, USA.
  • Department of Ultrasound Medicine, Sanya Central Hospital (The Third People's Hospital of Hainan Province), Sanya, China. Electronic address: [email protected].
  • Department of Ultrasound Medicine, Shanghai First Maternity and Infant Hospital, School of Medicine, Tongji University, Shanghai, China. Electronic address: [email protected].

Abstract

The Ovarian-Adnexal Reporting and Data System (O-RADS) is essential for standardizing the risk stratification of ovarian lesions detected on ultrasound. However, manual assignment of O-RADS scores is time-consuming and can vary between observers. This study investigates an automated method for O-RADS scoring using a large language model (LLM) to analyze narrative ultrasound reports. A two-stage pipeline was developed for automated O-RADS classification. Initially, the Lingshu LLM, specialized in medical language, extracted and embedded features from free-text descriptions of ovarian lesions. It identified key diagnostic features mentioned by sonologists. Subsequently, these features were used to train and evaluate several machine learning algorithms, including logistic regression (LR), support vector machines and random forests, to predict O-RADS scores (1-5). The proposed method was evaluated on a dataset of 513 cases using fivefold cross-validation. The pipeline using Lingshu model embeddings with LR achieved the highest accuracy of 0.803 [95% CI: 0.753, 0.853], a weighted-average F1-score of 0.819 [95% CI: 0.777, 0.861] and a macro-averaged AUROC of 0.948 [95% CI: 0.937, 0.959]. This outperformed the MedGemma model's pipeline, which had an accuracy of 0.760 [95% CI: 0.700, 0.820], F1-score of 0.787 [95% CI: 0.739, 0.835] and AUROC of 0.941 [95% CI: 0.911, 0.971]. This study introduces a novel approach to automate O-RADS scoring using LLMs for feature extraction and traditional machine learning for classification. The results indicate that this method can accurately stratify ovarian cancer risk, potentially improving clinical workflow efficiency and reducing diagnostic variability. This approach may support radiologists in making more consistent and timely assessments.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.