Can chatGPT-4o reliably standardize PSMA PET/CT and PET/MRI reports using PROMISE V2 criteria? - An exploratory study.
Authors
Affiliations (10)
Affiliations (10)
- DKFZ Hector Cancer Institute at the University Medical Center Mannheim, Heidelberg, Germany.
- Junior Clinical Cooperation Unit Translational Molecular Imaging in Oncologic Therapy Monitoring (E310), German Cancer Research Center, Heidelberg, Germany.
- Junior Clinical Cooperation Unit Intelligent Systems and Robotics in Urology (ISRU), German Cancer Research Center, Heidelberg, Germany.
- Department of Urology and Urologic Surgery, University Medical Centre Mannheim, University of Heidelberg, Mannheim, Germany.
- Department of Radiology and Nuclear Medicine, University Medical Center Mannheim, Heidelberg University, Mannheim, Germany.
- Department of Radiology, LMU University Hospital, LMU Munich, Munich, Germany.
- DKFZ Hector Cancer Institute at the University Medical Center Mannheim, Heidelberg, Germany. [email protected].
- Junior Clinical Cooperation Unit Translational Molecular Imaging in Oncologic Therapy Monitoring (E310), German Cancer Research Center, Heidelberg, Germany. [email protected].
- Department of Radiology and Nuclear Medicine, University Medical Center Mannheim, Heidelberg University, Mannheim, Germany. [email protected].
- German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120, Heidelberg, Germany. [email protected].
Abstract
Structured reporting standardizes and facilitates reporting, improves accurate communication, and ultimately clinical decision-making. Although standardized frameworks such as PROMISE criteria are available for prostate-specific membrane antigen positron emission tomography (PSMA PET) for prostate cancer patients, free-text reporting remains predominant in both clinical routine and trials. Large language models (LLMs) may enable low-effort, time-efficient extraction of structured classifications from narrative reports. This study evaluated the performance of ChatGPT-4o for extracting PROMISE V2-based classifications from unstructured PSMA-PET/CT and PET/MRI reports. For PSMA-PET/CT, overall miTNM accuracy was 79.8%, whereas PSMA-PET/MRI achieved a significantly higher accuracy of 91.0% (OR = 2.80, 95% CI: 1.32-6.51, p = 0.003). Component-wise, PET/MRI outperformed PET/CT in T-stage classification (83.8% vs. 57.7%; OR = 3.83, 95% CI: 1.34-12.69, p = 0.006) and demonstrated numerically higher N-stage classification accuracy (100% vs. 85.9%, p = 0.014), while M-stage classification was comparable between modalities (89.1% vs. 95.7%; OR = 0.84, 95% CI: 0.20-4.19, p = 0.748). PRIMARY score accuracy was also comparable for PET/CT and PET/MRI (70.4% vs. 88.1%; OR = 0.43, 95% CI: 0.05-2.14, p = 0.315). ChatGPT-4o's rationale for classifications was rated highly plausible across modalities, with a minimum Likert score of ≥ 4.8 for miTNM and 4.1 for PRIMARY. ChatGPT-4o enables reliable extraction of PROMISE V2-based N- and M-stage classifications from free-text PSMA-PET reports, with limited accuracy for T-stage. This work provides a first step toward leveraging LLMs to support structured and efficient reporting in PSMA PET imaging and points out present limitations.