CALM-RAD: Calibrated GPT-4o Confidence-Based Triage Enables 96-100% Accuracy in Automated TNM Staging of Head and Neck Cancer Reports.

March 12, 2026

papers

DOI: 10.1007/s10278-026-01904-4 PMID: 41820628

Authors

Gupta A,Adams LC,Rangarajan K

Affiliations (3)

Department of Diagnostic and Interventional Onco-Radiology, NCI Jhajjar, All India Institute of Medical Sciences, New Delhi, India.
Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, School of Medicine and Health, Technical University of Munich, TUM University Hospital, Ismaninger Str 22, 81675, Munich, Germany.
Department of Diagnostic and Interventional Onco-Radiology, Dr. BRAIRCH, All India Institute of Medical Sciences, Ansari Nagar, 110029, New Delhi, India. [email protected].

Abstract

The purpose of this study is to assess the CALM-RAD framework that converts GPT-4o token-level log-probabilities (TLPs) into calibrated confidence scores for selective auto-labeling of TNM stage and primary site in head and neck cancer CT reports. Anonymized 150 CT reports were retrospectively curated from a tertiary cancer center. A radiologist assigned T, N, M, and site labels, which were hidden from the model. GPT-4o was queried once per report using simple and structured (knowledge-guided) prompts (temperature 0.2), returning single-token predictions and TLPs. Reports were split into calibration (n = 50) and testing (n = 100) sets. TLPs were converted to calibrated confidence scores using isotonic regression (M-stage excluded). Confidence-based triage was simulated on the test set to evaluate accuracy-coverage trade-offs. Correct predictions showed higher TLPs than errors (e.g., N-stage U = 1302.5, P < 0.001). Simple prompt accuracies were 0.73 for T, 0.80 for N, and 0.96 for site; structured prompts raised T to 0.81 and N to 0.83, but reduced site to 0.91. Triage thresholds accepting 37% (simple) and 34% (structured) of T-stage predictions raised accuracies to 0.89 and 1.00, respectively; accepting 55% of N-stage cases achieved up to 1.00 accuracy. Accepted predictions were enriched for prototypical labels (e.g., N0 = 91%, χ<sup>2</sup> = 79.9, P < 0.001). CALM‑RAD demonstrates that calibrated log probabilities can support trustworthy selective automation in radiology report labeling while deferring uncertain cases to human review. This framework offers a practical path toward safer LLM integration in clinical radiology workflows.

View Source Full Text PDF

Topics

Journal Article

CALM-RAD: Calibrated GPT-4o Confidence-Based Triage Enables 96-100% Accuracy in Automated TNM Staging of Head and Neck Cancer Reports.

Authors

Affiliations (3)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?