Back to all papers

CALM-RAD: Calibrated GPT-4o Confidence-Based Triage Enables 96-100% Accuracy in Automated TNM Staging of Head and Neck Cancer Reports.

March 12, 2026pubmed logopapers

Authors

Gupta A,Adams LC,Rangarajan K

Affiliations (3)

  • Department of Diagnostic and Interventional Onco-Radiology, NCI Jhajjar, All India Institute of Medical Sciences, New Delhi, India.
  • Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, School of Medicine and Health, Technical University of Munich, TUM University Hospital, Ismaninger Str 22, 81675, Munich, Germany.
  • Department of Diagnostic and Interventional Onco-Radiology, Dr. BRAIRCH, All India Institute of Medical Sciences, Ansari Nagar, 110029, New Delhi, India. [email protected].

Abstract

The purpose of this study is to assess the CALM-RAD framework that converts GPT-4o token-level log-probabilities (TLPs) into calibrated confidence scores for selective auto-labeling of TNM stage and primary site in head and neck cancer CT reports. Anonymized 150 CT reports were retrospectively curated from a tertiary cancer center. A radiologist assigned T, N, M, and site labels, which were hidden from the model. GPT-4o was queried once per report using simple and structured (knowledge-guided) prompts (temperature 0.2), returning single-token predictions and TLPs. Reports were split into calibration (n = 50) and testing (n = 100) sets. TLPs were converted to calibrated confidence scores using isotonic regression (M-stage excluded). Confidence-based triage was simulated on the test set to evaluate accuracy-coverage trade-offs. Correct predictions showed higher TLPs than errors (e.g., N-stage U = 1302.5, P < 0.001). Simple prompt accuracies were 0.73 for T, 0.80 for N, and 0.96 for site; structured prompts raised T to 0.81 and N to 0.83, but reduced site to 0.91. Triage thresholds accepting 37% (simple) and 34% (structured) of T-stage predictions raised accuracies to 0.89 and 1.00, respectively; accepting 55% of N-stage cases achieved up to 1.00 accuracy. Accepted predictions were enriched for prototypical labels (e.g., N0 = 91%, Ο‡<sup>2</sup> = 79.9, P < 0.001). CALM‑RAD demonstrates that calibrated log probabilities can support trustworthy selective automation in radiology report labeling while deferring uncertain cases to human review. This framework offers a practical path toward safer LLM integration in clinical radiology workflows.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.