Back to all papers

Diagnostic accuracy, fairness and clinical implementation of AI for breast cancer screening: results of multicenter retrospective and prospective technical feasibility studies.

March 10, 2026pubmed logopapers

Authors

Kelly CJ,Wilson M,Warren LM,Sidebottom R,Halling-Brown M,Yang L,Morigami M,Venton J,Chopra R,Chang J,Dixon J,Gilbert FJ,Golden DI,Gruzewska E,Honeyfield L,Hujan A,Khodabakhshi D,Lewis E,Malhotra N,Mallya R,Ogunleye D,Purdy C,Sayres R,Sieniek M,Stoycheva T,Sy A,Thomas S,Ward D,Xi L,Xu S,Shetty S,Darzi A,Young K,Purushothaman H,Khoo L,Reddy M,Ashrafian H,Cunningham D

Affiliations (13)

  • Google Research, Mountain View, CA, USA. [email protected].
  • Google Research, Mountain View, CA, USA.
  • Royal Surrey NHS Foundation Trust, Guildford, UK.
  • The Royal Marsden NHS Foundation Trust, London, UK.
  • University of Surrey, Guildford, UK.
  • University of Cambridge, Cambridge, UK.
  • Imperial College Healthcare NHS Trust, London, UK.
  • St George's University Hospitals NHS Foundation Trust, London, UK.
  • Imperial College London, London, UK.
  • AIMS Public Engagement Group, London, UK.
  • Google Research, Mountain View, CA, USA. [email protected].
  • Imperial College London, London, UK. [email protected].
  • Imperial College Healthcare NHS Trust, London, UK. [email protected].

Abstract

Artificial intelligence (AI) promises to enhance breast cancer screening. Here we evaluated Google's mammography AI system (version 1.2) across two phases: a retrospective study using 115,973 mammograms from five National Health Service screening services with 39-month follow-up and prospective noninterventional feasibility deployment at 12 sites (9,266 cases). The primary endpoint was AI sensitivity and specificity versus first reader using a 5% noninferiority margin. The secondary endpoints were performance versus second or consensus readers and breast-level analyses. Retrospectively, AI achieved superior sensitivity (0.541 versus 0.437 for first reader, P < 0.001) and noninferior specificity (0.943 versus 0.952, P < 0.001). Cancer detection rate increased from 7.54 to 9.33 per 1,000 women, with AI detecting 25.0% of interval cancers. Performance was particularly strong for first screens (39.3% fewer recalls, 8.8% higher detection) and invasive cancers. No systematic demographic disparities were observed. Simulated second-reader replacement reduced reading time by 32% while increasing detection by 17.7%. Prospective deployment confirmed technical feasibility but revealed a distribution shift requiring threshold recalibration. Implementation requires adaptive calibration and continuous monitoring to ensure safety and equity.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.