Locally calibrated error rates improve interpretability of AI scores and influence radiologist decision-making
Authors
Affiliations (1)
Affiliations (1)
- Brown University Health
Abstract
IntroductionArtificial intelligence (AI) systems in radiology commonly generate case-level numeric scores intended to reflect the likelihood of underlying pathology. However, these scores are often difficult to interpret in clinical practice. We propose a framework for translating AI scores into clinically meaningful, locally calibrated error probabilities by providing the corresponding false discovery rate (FDR) and false omission rate (FOR) at each score threshold. MethodsUsing an open-source mammography AI model (Mirai), we estimated score-specific FDR and FOR across a range of thresholds using a retrospective cohort of 130,712 digital screening mammograms (907 positive, 129,805 negative). We then conducted a decision-making study to evaluate whether presenting FDR/FOR alongside AI scores influenced radiologist recall recommendations and confidence compared with AI scores alone. ResultsFDR and FOR varied substantially across AI score thresholds, ranging from 60.87% and 0.03%, respectively, at the low end of the score distribution to 99.26%% and 0.65% at the high end. In the decision-making study (n=21; 20 assessments per radiologist), recall increased with AI score in both conditions; however, recall was higher when AI scores were presented alone compared with scores accompanied by FDR/FOR (odds ratio 2.9, 95% CI [1.331, 6.417], p=0.0077). Confidence followed a U-shaped relationship with score and was higher when FDR/FOR were provided, particularly at intermediate scores. ConclusionLocally calibrated FDR and FOR provide a practical approach for translating AI scores into clinically interpretable probabilities. Presenting these measures alongside AI scores improves interpretability and is associated with changes in radiologist decision-making, supporting their use as a framework for clinical implementation of AI.