Exploring the feasibility of inferring prostate cancer pathological grade from multiparametric MRI text reports using natural language processing: assessment of four large language models.
Authors
Affiliations (4)
Affiliations (4)
- Department of Interventional Medicine, Beijing Chaoyang Hospital, Capital Medical University, Beijing, China.
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, Beijing, China.
- Department of Radiology, Chuiyangliu Hospital Affiliated with Tsinghua University, Beijing, China.
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, Beijing, China. [email protected].
Abstract
This study conducted a natural language processing feasibility analysis aimed at comparing four large language models (LLMs) in terms of (a) reproducibility and (b) predictive accuracy for International Society of Urological Pathology Grade Groups (ISUP GGs) based on structured text reports from prostate multiparametric magnetic resonance imaging (mpMRI). The study first used LLMs to perform the initial round of ISUP GGs predictions based solely on the mpMRI text reports. This was followed by a second round of predictions that incorporated clinical information. Each prediction round was repeated three times to assess consistency. Three radiologists independently completed the first two rounds of ISUP GG predictions and then performed a third round of assessment after reviewing the LLMs' predictions. The study recorded the response times. The study included 150 patients (median age, 69Â years). Statistically significant differences were observed among different ISUP GGs in terms of age, PSA levels, prostate volume, PSA density, and PI-RADS scores. The four LLMs demonstrated good to excellent reproducibility (Kappa 0.671-0.861). ChatGPT-4.1 had the shortest response time (0.95-17.19Â s). Furthermore, the study found that the accuracy of the LLMs (32.7-50.0%) was significantly lower than that of senior radiologist (72.7-76.0%) and intermediate-level radiologist (66.0-68.7%), but was comparable to that of junior radiologist (59.3-65.3%). General-purpose LLMs demonstrate excellent reproducibility. While ChatGPT-4.1 outperforms other LLMs in ISUP GGs prediction and response time, its predictive accuracy remains inferior to that of intermediate and senior radiologists. Therefore, specific fine-tuning of this technology is necessary before general-purpose LLMs can be applied in clinical practice.