Real-world text-only inference of PI-RADS v2.1 from prostate MRI reports using large language models: a lesion-level, zone-aware study.
Authors
Affiliations (2)
Affiliations (2)
- Department of Radiology, Sultan Abdülhamid Han Training and Research Hospital, University of Health Sciences Türkiye, Istanbul, Türkiye. Electronic address: [email protected].
- Department of Radiology, Sultan Abdülhamid Han Training and Research Hospital, University of Health Sciences Türkiye, Istanbul, Türkiye. Electronic address: [email protected].
Abstract
To evaluate the feasibility and limitations of real-world, text-only inference of PI-RADS v2.1 categories from prostate MRI reports using large language models, with lesion-level and zone-aware analysis. This single-center retrospective study included 1,205 lesion-level entries from 1,118 patients derived from semi-structured prostate MRI reports after removal of all explicit PI-RADS elements. ChatGPT-4o was prompted to assign numeric PI-RADS categories based solely on report text. Agreement with radiologist-assigned reference categories was assessed using exact agreement, Cohen's κ, and class-wise metrics. Analyses were performed overall, by zone (peripheral vs transition), and using collapsed risk strata (1-2/3/4-5). Discordant cases were reviewed to identify error mechanisms and severity. Human interobserver agreement, intra-model reproducibility, temporal stability, and a paired model-version sensitivity analysis comparing ChatGPT-4o with GPT-5.2 were also evaluated. Overall exact agreement was 72.9% (κ = 0.538; macro-F1 = 61.2%), with a systematic tendency toward overcalling. Agreement was higher in the peripheral zone than in the transition zone (κ = 0.476 vs 0.077, reference PI-RADS 3-5). PI-RADS 3 showed the lowest precision and recall, with frequent bidirectional misclassification. Collapsing categories improved agreement (κ = 0.610). Incorrect diffusion-weighted imaging subscores were the most common error mechanism, with zone-specific differences. Clinically high-impact downgrades of PI-RADS 4-5 to 1-2 were rare (1.6%). Human interobserver agreement was excellent (κ = 0.916-0.967). GPT-5.2 outperformed ChatGPT-4o in paired analyses but produced invalid outputs in a minority of cases. Text-only large language models can infer radiologist-assigned PI-RADS v2.1 categories from real-world prostate MRI reports with moderate agreement, but performance is zone dependent and limited around PI-RADS 3, particularly in the transition zone. These models are best suited as supervised tools for quality control rather than autonomous decision-making.