Diagnostic accuracy, fairness and clinical implementation of AI for breast cancer screening: results of multicenter retrospective and prospective technical feasibility studies.
Authors
Affiliations (13)
Affiliations (13)
- Google Research, Mountain View, CA, USA. [email protected].
- Google Research, Mountain View, CA, USA.
- Royal Surrey NHS Foundation Trust, Guildford, UK.
- The Royal Marsden NHS Foundation Trust, London, UK.
- University of Surrey, Guildford, UK.
- University of Cambridge, Cambridge, UK.
- Imperial College Healthcare NHS Trust, London, UK.
- St George's University Hospitals NHS Foundation Trust, London, UK.
- Imperial College London, London, UK.
- AIMS Public Engagement Group, London, UK.
- Google Research, Mountain View, CA, USA. [email protected].
- Imperial College London, London, UK. [email protected].
- Imperial College Healthcare NHS Trust, London, UK. [email protected].
Abstract
Artificial intelligence (AI) promises to enhance breast cancer screening. Here we evaluated Google's mammography AI system (version 1.2) across two phases: a retrospective study using 115,973 mammograms from five National Health Service screening services with 39-month follow-up and prospective noninterventional feasibility deployment at 12 sites (9,266 cases). The primary endpoint was AI sensitivity and specificity versus first reader using a 5% noninferiority margin. The secondary endpoints were performance versus second or consensus readers and breast-level analyses. Retrospectively, AI achieved superior sensitivity (0.541 versus 0.437 for first reader, Pā<ā0.001) and noninferior specificity (0.943 versus 0.952, Pā<ā0.001). Cancer detection rate increased from 7.54 to 9.33 per 1,000 women, with AI detecting 25.0% of interval cancers. Performance was particularly strong for first screens (39.3% fewer recalls, 8.8% higher detection) and invasive cancers. No systematic demographic disparities were observed. Simulated second-reader replacement reduced reading time by 32% while increasing detection by 17.7%. Prospective deployment confirmed technical feasibility but revealed a distribution shift requiring threshold recalibration. Implementation requires adaptive calibration and continuous monitoring to ensure safety and equity.