Artificial intelligence for immunotherapy response assessment in lung cancer using PET/CT reports.
Authors
Affiliations (5)
Affiliations (5)
- Department of Medical Oncology, Baskent University Faculty of Medicine, Ankara, Türkiye. [email protected].
- Department of Medical Oncology, Baskent University Faculty of Medicine, Ankara, Türkiye.
- Department of Nuclear Medicine, Baskent University Faculty of Medicine, Ankara, Türkiye.
- Baskent University Faculty of Medicine, Ankara, Türkiye.
- Department of Medical Informatics, Baskent University Faculty of Medicine, Ankara, Türkiye.
Abstract
Accurate and timely assessment of immunotherapy response is vital for optimizing lung cancer management. This study evaluates the efficacy of large language models (LLMs) in automating response assessment using positron emission tomography/computed tomography (PET/CT) reports based on the European Organization for Research and Treatment of Cancer (EORTC) criteria. An effective prompting strategy was developed using Google Gemini 2.5 Pro Experimental 03-25, with explicit instructions for applying EORTC criteria via few-shot prompting. This prompt was then tested with both Gemini 2.5 Pro and OpenAI ChatGPT 4o to assess cross-model performance. Pre- and post-immunotherapy PET-CT reports in text format from 36 lung cancer patients were independently classified by the LLMs and an experienced nuclear medicine specialist. Performance metrics, including precision, recall, F1-score, and support, were calculated for each response category. Inter-rater agreement was assessed using Cohen's Kappa. The nuclear medicine specialist classified 5, 21, 6, and 4 reports as complete metabolic response (CMR), progressive metabolic disease (PMD), partial metabolic response (PMR), and stable metabolic disease (SMD), respectively, while Gemini 2.5 Pro classified 4, 21, 8, and 3 of them. Gemini achieved an overall accuracy of 94% and demonstrated strong agreement with the expert (overall Cohen's Kappa: 0.907). F1-scores were 0.86 for PMR and SMD, 0.89 for CMR, and 1.00 for PMD, with per-label Kappa scores ranging from 0.824 (PMR) to 1.00 (PMD). In comparison, ChatGPT 4o achieved perfect agreement with the expert across all 36 cases (accuracy = 100%, Cohen's Kappa = 1.000). When guided by a structured and task-specific prompt, both Gemini 2.5 Pro and ChatGPT 4o demonstrated strong capability for automating accurate immunotherapy response assessment in lung cancer using PET-CT reports. These results underscore the potential of LLMs to streamline clinical workflows and improve efficiency. Validation with larger data sets is warranted to support clinical implementation.