Artificial intelligence for TNM staging in NSCLC: a critical appraisal of segmentation utility in [<sup>1</sup>⁸F]FDG PET/CT.
Authors
Affiliations (8)
Affiliations (8)
- Department of Radiology, LMU University Hospital, LMU Munich, Munich, Germany. [email protected].
- Department of Radiology, LMU University Hospital, LMU Munich, Munich, Germany.
- Munich Center for Machine Learning (MCML), Munich, Germany.
- Comprehensive Pneumology Center (CPC-M), Member of the German Center for Lung Research (DZL), Munich, Germany.
- Department of Radiology, TUM University Hospital, TU Munich, Munich, Germany.
- Department of Medicine V, LMU University Hospital, LMU Munich, Munich, Germany, and Bavarian Center for Cancer Research (BZKF), Munich, Germany.
- Department of Nuclear Medicine, LMU University Hospital, LMU Munich, Munich, Germany.
- The Russell H. Morgan Department of Radiology and Radiological Sciences, Division of Nuclear Medicine and Molecular Imaging, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
Abstract
This study aims to investigate whether a diagnostic AI model can effectively support lesion detection and staging in non-small cell lung cancer (NSCLC) [<sup>1</sup>⁸F]FDG PET/CT studies, focusing on the distinction between technical segmentation accuracy and clinically meaningful performance. In this retrospective single-centre study, [<sup>1</sup>⁸F]FDG PET/CT scans from 306 treatment-naïve NSCLC patients were reviewed with reference to multidisciplinary team decisions. Tumour lesions were manually segmented for reference and compared with predictions from the top-performing algorithm of the autoPET III challenge. Quantitative segmentation metrics were calculated, and lesion-level errors were assessed for impact on patient-level TNM and UICC staging. The algorithm achieved a mean Dice Similarity Coefficient (DSC) of 0.64. Lesion-level sensitivity was 95.8% across all patients, with a precision of 87.5%. False positive M-category lesions (n = 196) occurred as most frequent error. Of all false positives, 35.7% were benign and 34.7% non-oncologic pathologies. UICC staging matched ground truth in 207/306 patients, with most discordances due to upstaging (88/306). Clinically driven metrics and cause-based error analysis offer valuable insight into AI segmentation performance. The evaluated model showed excellent lesion sensitivity but a tendency towards systematic overprediction across TNM categories. On a lesion level M-stage false positives and undersegmentation in the hilar region emerged as the main driver of clinically relevant upstaging. Despite promising lesion detection sensitivity, only 67.7% UICC-stagings were accurate using AI masks, indicating that diagnostic AI may support, though not yet replace, manual lesion evaluation in NSCLC [<sup>1</sup>⁸F]FDG PET/CT.