Back to all papers

Exploring the feasibility of inferring prostate cancer pathological grade from multiparametric MRI text reports using natural language processing: assessment of four large language models.

April 7, 2026pubmed logopapers

Authors

Niu Y,Shen L,Liu J,Yang Z,Wang L,Cheng Y

Affiliations (4)

  • Department of Interventional Medicine, Beijing Chaoyang Hospital, Capital Medical University, Beijing, China.
  • Department of Radiology, Beijing Friendship Hospital, Capital Medical University, Beijing, China.
  • Department of Radiology, Chuiyangliu Hospital Affiliated with Tsinghua University, Beijing, China.
  • Department of Radiology, Beijing Friendship Hospital, Capital Medical University, Beijing, China. [email protected].

Abstract

This study conducted a natural language processing feasibility analysis aimed at comparing four large language models (LLMs) in terms of (a) reproducibility and (b) predictive accuracy for International Society of Urological Pathology Grade Groups (ISUP GGs) based on structured text reports from prostate multiparametric magnetic resonance imaging (mpMRI). The study first used LLMs to perform the initial round of ISUP GGs predictions based solely on the mpMRI text reports. This was followed by a second round of predictions that incorporated clinical information. Each prediction round was repeated three times to assess consistency. Three radiologists independently completed the first two rounds of ISUP GG predictions and then performed a third round of assessment after reviewing the LLMs' predictions. The study recorded the response times. The study included 150 patients (median age, 69 years). Statistically significant differences were observed among different ISUP GGs in terms of age, PSA levels, prostate volume, PSA density, and PI-RADS scores. The four LLMs demonstrated good to excellent reproducibility (Kappa 0.671-0.861). ChatGPT-4.1 had the shortest response time (0.95-17.19 s). Furthermore, the study found that the accuracy of the LLMs (32.7-50.0%) was significantly lower than that of senior radiologist (72.7-76.0%) and intermediate-level radiologist (66.0-68.7%), but was comparable to that of junior radiologist (59.3-65.3%). General-purpose LLMs demonstrate excellent reproducibility. While ChatGPT-4.1 outperforms other LLMs in ISUP GGs prediction and response time, its predictive accuracy remains inferior to that of intermediate and senior radiologists. Therefore, specific fine-tuning of this technology is necessary before general-purpose LLMs can be applied in clinical practice.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.