Development and retrospective validation of SCOUT: scalable clinical oversight of large language models via uncertainty triangulation
Authors
Affiliations (1)
Affiliations (1)
- Department of Cardiology, State Key Laboratory of Cardiovascular Disease, National Center for Cardiovascular Diseases, Fuwai Hospital, Chinese Academy of Medica
Abstract
Large language models (LLMs) are increasingly used in clinical workflows, yet requiring clinician review of every AI output negates the efficiency gains that motivate their adoption. We present SCOUT (Scalable Clinical Oversight via Uncertainty Triangulation), a model-agnostic meta-verification framework that selectively defers unreliable LLM predictions to clinicians by triangulating three orthogonal signals: model heterogeneity, stochastic inconsistency, and reasoning critique. In this retrospective development and validation study, we derived the framework on a discovery cohort (n = 405) and validated it across three clinically distinct tasks using 4 independent retrospective cohorts: coronary heart disease subtyping (n = 2,271), liver cancer screening from radiology reports (n = 3,373), and diseased coronary vessel counting (n = 286). SCOUT reduced the volume of cases requiring human review by 45% to 83%, with projected final accuracy of 99.1% to 100.0% assuming expert correction of all flagged cases. SCOUT provides a scalable, retrospectively validated approach for deploying generative AI in clinical medicine without compromising patient safety. Prospective randomized validation is underway to confirm real-world clinical utility.