An ethics-informed computable audit framework for monitoring misdiagnosis risk in AI-assisted diagnosis.
Authors
Affiliations (7)
Affiliations (7)
- School of Humanities and Social Sciences, Shanxi Medical University, Jinzhong, China.
- Department of Ideological and Political Education, Shanxi University of Medicine, Fenyang, Shanxi, China.
- School of Management, Shanxi Medical University, Jinzhong, China.
- Department of Nursing, Shanxi University of Medicine, Fenyang, Shanxi, China.
- Department of Radiation Therapy, Shanxi Cancer Hospital, Taiyuan, China.
- School of Humanities and Social Sciences, Shanxi Medical University, Jinzhong, China. [email protected].
- School of Management, Shanxi Medical University, Jinzhong, China. [email protected].
Abstract
Diagnostic AI can misclassify under distribution shift and subgroup imbalance; governance signals are rarely computable at deploy time. We target deployed diagnostic decision-support systems that perform binary classification and output a continuous risk score (preferably a calibrated probability). The audit layer ingests deploy-time streams including features X, model score S, subgroup tag g, and clinician action a, while outcome labels Y may arrive later via adjudication or follow-up. We define a unit-scaled Misdiagnosis Risk Index (MRI-AI) aggregating shift, fairness, calibration, and human-AI interaction; implement a streaming sentinel with starter bands and stop rules; and log signals and actions in an accountability ledger. A minimal simulation emulates device/site drift and imbalance. Outcomes include deploy-time trigger behavior from label-free indicators and delayed updates of label-dependent metrics (overall/worst-group AUC/FPR, ECE/Brier), as well as trigger rate, top-decile error share, and decision-curve net benefit. Using a controlled, scenario-based synthetic stress-test suite (designed to evaluate the audit/monitoring layer rather than to claim clinical performance of a particular diagnostic model), we report both predictive metrics (overall and worst-group AUC/FPR, ECE, and Brier) and alert-centric endpoints (window-level alert rate, time-to-first-trigger, and persistence). The results show that alert burden remains low under stable conditions and increases in a graded and interpretable manner with shift type and severity, supporting scenario-dependent monitoring and risk-tiered governance actions. Temperature scaling improves calibration while preserving rank-based decision behavior, and subgroup disparities remain explicitly auditable. A computable audit layer-MRI-AI + streaming sentinel + ledger-turns fairness and transparency into actionable controls for diagnostic decision support, enabling auditable monitoring and risk-tiered interventions.