Phenotyping Prostate Cancer in a National Health System Using Large Language Models.

June 12, 2026

papers

DOI: 10.1200/CCI-25-00314 PMID: 42284546

Authors

Dykstra MP,Tsao PA,Caram MEV,Nieto J,Schipper M,Stensland KD,Elliott D,Rose BS,Bryant AK

Affiliations (9)

Department of Radiation Oncology, Veterans Affairs Ann Arbor Healthcare System, Ann Arbor, MI.
Department of Radiation Oncology, University of Michigan, Ann Arbor, MI.
Division of Medical Oncology, Department of Internal Medicine, Veterans Affairs Ann Arbor Healthcare System, Ann Arbor, MI.
Division of Hematology/Oncology, Department of Internal Medicine, University of Michigan, Ann Arbor, MI.
Veterans Affairs Center for Clinical Management Research, Ann Arbor, MI.
Department of Biostatistics, University of Michigan, Ann Arbor, MI.
Section of Urology, Veterans Affairs Ann Arbor Healthcare System, Ann Arbor, MI.
Department of Urology, University of Michigan, Ann Arbor, MI.
Department of Radiation Medicine and Applied Sciences, University of California San Diego, La Jolla, CA.

Abstract

Large language models (LLMs) may improve extraction of prognostic variables in prostate cancer from unstructured clinical text compared with traditional, rule-based natural language processing. We used iterative prompt engineering with few-shot examples to develop LLM prompts for 30 phenotypes from prostate biopsy, radical prostatectomy (RP), and transurethral resection of the prostate (TURP) pathology reports, as well as magnetic resonance imaging (MRI) pelvis, computed tomography [CT] abdomen/pelvis, Tc-99m bone scan, and prostate-specific membrane antigen [PSMA] PET/CT reports. Data were drawn from >130 Veterans Affairs facilities (1999-2025). Inference was performed with Llama 3.3 70B or GPT-4o depending on the task. Performance was evaluated on independent test sets with metrics including overall accuracy, sensitivity, positive predictive value (PPV), negative predictive value (NPV), and macro-F1. Pathology extraction tasks achieved near-perfect accuracy. For prostate biopsy reports, exact extraction of total cores and involved cores was highly accurate (total cores: accuracy 98.0% [95% CI, 93.0 to 99.5]; involved cores: accuracy 95.0% [95% CI, 88.8 to 98.5]). Performance was similarly strong for RP and TURP reports. On MRI pelvis, extraction of PIRADS scores (accuracy 98.0% [95% CI, 93.0 to 99.5]), lesion locations (accuracy 100% [95% CI, 96.3 to 100]), and lesion dimensions (accuracy 100% [95% CI, 96.3 to 100]) was excellent. For PSMA PET/CT, PPVs were 100% (95% CI, 93.5 to 100) for nodal metastases and 97.9% (95% CI, 89.9 to 99.6) for bone metastases; Tc-99m bone scan performance was comparable. Lower PPVs were observed for nodal and bone metastases on pelvic MRI (84.2%-86.4%) and CT (88.0%-90.3%), largely due to ambiguous language in radiology report texts. LLMs can reliably extract key prostate cancer phenotypes from unstructured text across multiple pathology and radiology report types. Ambiguous or indeterminate language remains the principal challenge for optimal performance.

View Source Full Text PDF

Topics

Prostatic NeoplasmsLarge Language ModelsJournal Article

Phenotyping Prostate Cancer in a National Health System Using Large Language Models.

Authors

Affiliations (9)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?