Staging Prostate Cancer with AI: A Comparative Study of Large Language Models and Expert Interpretation on PSMA PET-CT Reports.
Authors
Affiliations (4)
Affiliations (4)
- Department of Medical Oncology, Baskent University, Faculty of Medicine, Ankara, Türkiye. [email protected].
- Department of Nuclear Medicine, Baskent University, Faculty of Medicine, Ankara, Türkiye.
- Department of Medical Oncology, Baskent University, Faculty of Medicine, Ankara, Türkiye.
- Department of Medical Informatics, Baskent University, Faculty of Medicine, Ankara, Türkiye.
Abstract
Accurate staging of prostate cancer is essential for therapeutic decision-making. While PSMA PET-CT reports offer rich clinical data, their unstructured format hinders large-scale analysis. Recent advances in large language models (LLMs) offer new opportunities to extract structured information from narrative radiology reports. However, their ability to perform multi-step clinical reasoning, particularly for cancer staging, remains underexplored. In this feasibility study, 80 anonymized, Turkish-language PSMA PET-CT reports were independently interpreted by two LLMs-Gemini 2.5 Pro (Google) and ChatGPT 4o (OpenAI). Using a structured prompt containing an embedded knowledge base (AJCC/CHAARTED criteria) and few-shot examples, both LLMs generated classifications for T, N, M, and overall clinical stage/disease volume. Outputs were benchmarked against expert classifications by a senior nuclear medicine specialist. Performance was evaluated using accuracy, precision, recall, F1-score, and Cohen's kappa. For the composite task of classifying clinical stage and disease volume, Gemini 2.5 Pro achieved an accuracy of 93.8% (95% CI: 86.0-97.9) and a Cohen's kappa of 0.910 (95% CI: 0.834-0.986), while ChatGPT 4o achieved 91.3% accuracy (95% CI: 82.8-96.4) with a kappa of 0.874 (95% CI: 0.786-0.962). For T staging, Gemini showed a higher accuracy point estimate (95.0% [95% CI: 87.7-98.6] vs. 91.3% [95% CI: 82.8-96.4]), while both models excelled at the binary N and M classifications, achieving accuracies above 95% and kappa values indicating near-perfect agreement (κ > 0.900). LLMs, when guided by expert-informed prompt engineering, can accurately stage prostate cancer from free-text PSMA PET-CT reports and may serve as a powerful assistive tool for data automation, research acceleration, and quality assurance.