Back to all papers

Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.

January 10, 2026pubmed logopapers

Authors

Zhang S,Wu Z,Guo M,Liu C,Cui M,Yang S,Chen F

Affiliations (4)

  • Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310006, China (S.Z., M.G., C.L., F.C.).
  • Department of Intensive Care Unit, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310006, China (Z.W.).
  • Department of Ultrasound, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310006, China (M.C.).
  • Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310006, China (S.Z., M.G., C.L., F.C.). Electronic address: [email protected].

Abstract

Prostate cancer (PCa) is a significant global health challenge, and the prostate imaging reporting and data system (PI-RADS) is crucial for risk stratification using MRI. However, inter-reader variability, especially in the transition zone and among practitioners with differing experience levels, compromises diagnostic consistency. Large language models (LLMs) show potential in medical image analysis, particularly in standardizing reports to improve diagnostic consistency and efficiency. To evaluate the performance of LLMs in assisting PI-RADS scoring based on biparametric MRI text reports and compare them with radiologists of varying experience levels. Additionally, to identify independent predictors of PCa and csPCa using multivariable logistic regression analysis. This retrospective single-center study included 210 patients who underwent transperineal cognitive fusion-targeted biopsy for clinically suspected prostate cancer between December 2024 and July 2025. Three radiologists and two LLMs (DeepSeek and ChatGPT-4.1) independently reviewed anonymized reports and assigned PI-RADS v2.1 scores. Diagnostic performance was assessed using biopsy pathological results as the gold standard. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC) were calculated at both lesion-level (PI-RADS ≥3 as positive) and participant-level (PI-RADS ≥3 and ≥4 as positive thresholds). Decision curve analysis was performed to evaluate clinical utility. Subgroup analyses were conducted based on lesion location (peripheral zone vs. transition zone). Multivariable logistic regression analysis identified independent predictors of PCa and csPCa. The senior radiologist demonstrated the highest diagnostic performance, with AUC values of 0.847 for PCa and 0.859 for csPCa. The attending physician achieved perfect sensitivity but had the lowest specificity and PPV. The resident physician had comparable sensitivity but lower specificity and PPV, resulting in the lowest AUC values. Both LLMs exhibited high sensitivity but extremely low specificity, leading to lower PPV than human readers. DeepSeek outperformed ChatGPT-4.1 in AUC but still fell short of the senior radiologist's performance. In region-specific analyses, the senior radiologist significantly outperformed LLMs in the transition zone, while LLMs showed high sensitivity but low specificity in the peripheral zone. At the participant level, raising the threshold to PI-RADS ≥4 substantially improved specificity for all readers. Decision curve analysis confirmed the superior clinical utility of the PI-RADS ≥4 threshold, with the senior radiologist's ratings achieving the highest net benefit. Multivariable logistic regression analysis identified PSA density as the strongest independent predictor for both PCa (OR = 109.49, 95% CI: 14.89-1000.00, P<0.001) and csPCa (OR = 152.16, 95% CI: 21.06-1000.00, P<0.001). Among all PI-RADS ratings, only the senior radiologist's scores retained independent predictive value for both PCa (OR = 17.94, P<0.001) and csPCa (OR = 22.69, P = 0.001). While LLMs demonstrated high sensitivity in detecting PCa and csPCa, they had significant limitations in specificity and PPV, particularly in the transition and peripheral zones. The optimal utilization strategy involves deploying LLMs as adjuncts for indeterminate cases or when using higher diagnostic thresholds (PI-RADS ≥4). Experienced radiologists achieved better diagnostic performance, highlighting the need for cautious clinical application of LLMs. Future research should focus on optimizing LLMs to improve specificity and reliability, and combining them with human radiologists' expertise to enhance diagnostic accuracy and efficiency.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 8,300+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.