Automated PROMISE V2 Scoring from PSMA PET/CT Reports Using Large Language Models: A Comparative Evaluation of Prompt Design and Model Performance.
Authors
Affiliations (2)
Affiliations (2)
- Department of Nuclear Medicine, Saarland University-Medical Center, 66421 Homburg, Germany.
- Department of Nuclear Medicine, Friedrich-Alexander-Universität Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany.
Abstract
Large language models (LLMs) are increasingly explored for clinical use. However, the extent to which such models can reliably support physicians in reporting, staging, and the assessment of classification remains an active area of research. This study aimed to evaluate and compare multiple LLMs for automated PROMISE V2 classification for prostate cancer. A total of 126 unambiguous German-language PSMA PET/CT text reports were retrospectively analyzed, with reference standards established by expert consensus based on image interpretation and the original report text. Five LLMs (GPT-5.4, DeepSeek-V3.2, Claude Sonnet 4.6, Gemini 3 Flash and Grok 4) were assessed using two English-language prompting strategies of varying complexity. Agreement with the reference standard served as the primary endpoint. Performance varied in the short-prompt setting (36.5-79.4%) but improved consistently with the long prompt (74.6-86.5%), with Gemini 3 Flash achieving the highest agreement. Across PROMISE V2 subcategories, agreement rates were high (miT: 81.0-92.1%, miN: 92.9-96.0%, miM: 92.9-95.2%), despite inter-model differences. In conclusion, contemporary LLMs demonstrate promising performance in deriving PROMISE V2 scores from unambiguous original report texts, particularly when guided by detailed prompts.