CALM-RAD: Calibrated GPT-4o Confidence-Based Triage Enables 96-100% Accuracy in Automated TNM Staging of Head and Neck Cancer Reports.
Authors
Affiliations (3)
Affiliations (3)
- Department of Diagnostic and Interventional Onco-Radiology, NCI Jhajjar, All India Institute of Medical Sciences, New Delhi, India.
- Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, School of Medicine and Health, Technical University of Munich, TUM University Hospital, Ismaninger Str 22, 81675, Munich, Germany.
- Department of Diagnostic and Interventional Onco-Radiology, Dr. BRAIRCH, All India Institute of Medical Sciences, Ansari Nagar, 110029, New Delhi, India. [email protected].
Abstract
The purpose of this study is to assess the CALM-RAD framework that converts GPT-4o token-level log-probabilities (TLPs) into calibrated confidence scores for selective auto-labeling of TNM stage and primary site in head and neck cancer CT reports. Anonymized 150 CT reports were retrospectively curated from a tertiary cancer center. A radiologist assigned T, N, M, and site labels, which were hidden from the model. GPT-4o was queried once per report using simple and structured (knowledge-guided) prompts (temperature 0.2), returning single-token predictions and TLPs. Reports were split into calibration (nβ=β50) and testing (nβ=β100) sets. TLPs were converted to calibrated confidence scores using isotonic regression (M-stage excluded). Confidence-based triage was simulated on the test set to evaluate accuracy-coverage trade-offs. Correct predictions showed higher TLPs than errors (e.g., N-stage Uβ=β1302.5, Pβ<β0.001). Simple prompt accuracies were 0.73 for T, 0.80 for N, and 0.96 for site; structured prompts raised T to 0.81 and N to 0.83, but reduced site to 0.91. Triage thresholds accepting 37% (simple) and 34% (structured) of T-stage predictions raised accuracies to 0.89 and 1.00, respectively; accepting 55% of N-stage cases achieved up to 1.00 accuracy. Accepted predictions were enriched for prototypical labels (e.g., N0β=β91%, Ο<sup>2</sup>β=β79.9, Pβ<β0.001). CALMβRAD demonstrates that calibrated log probabilities can support trustworthy selective automation in radiology report labeling while deferring uncertain cases to human review. This framework offers a practical path toward safer LLM integration in clinical radiology workflows.