Back to all papers

Commercial Artificial Intelligence Versus Radiologists: NPV and Recall Rate in Large Population-Based Digital Mammography and Tomosynthesis Screening Mammography Cohorts.

Authors

Chen IE,Joines M,Capiro N,Dawar R,Sears C,Sayre J,Chalfant J,Fischer C,Hoyt AC,Hsu W,Milch HS

Affiliations (1)

  • Department of Radiology, University of California, Los Angeles, 200 UCLA Medical Plaza, Los Angeles, CA, 90095.

Abstract

<b>Background:</b> By reliably classifying screening mammograms as negative, artificial intelligence (AI) could minimize radiologists' time spent reviewing high volumes of normal examinations and help prioritize examinations with high likelihood of malignancy. <b>Objective:</b> To compare performance of AI, classified as positive at different thresholds, with that of radiologists, focusing on NPV and recall rates, in large population-based digital mammography (DM) and digital breast tomosynthesis (DBT) screening cohorts. <b>Methods:</b> This retrospective single-institution study included women enrolled in the observational population-based Athena Breast Health Network. Stratified random sampling was used to identify cohorts of DM and DBT screening examinations performed from January 2010 through December 2019. Radiologists' interpretations were extracted from clinical reports. A commercial AI system classified examinations as low, intermediate, or elevated risk. Breast cancer diagnoses within 1 year after screening examinations were identified from a state cancer registry. AI and radiologist performance were compared. <b>Results:</b> The DM cohort included 26,693 examinations in 20,409 women (mean age, 58.1 years). AI classified 58.2%, 27.7%, and 14.0% of examinations as low, intermediate, and elevated risk, respectively. Sensitivity, specificity, recall rate and NPV for radiologists were 88.6%, 93.3%, 7.2%, and 99.9%; for AI (defining positive as elevated risk), 74.4%, 86.3%, 14.0%, and 99.8%; and for AI (defining positive as intermediate/elevated risk), 94.0%, 58.6%, 41.8%, and 99.9%. The DBT cohort included 4824 examinations in 4379 women (mean age, 61.3 years). AI classified 68.1%, 19.8%, and 12.1% of examinations as low, intermediate, and elevated risk, respectively. Sensitivity, specificity, recall rate, and NPV for radiologists were 83.8%, 93.7%, 6.9%, and 99.9%; for AI (defining positive results as elevated risk), 78.4%, 88.4%, 12.1%, and 99.8%; and for AI (defining positive results as intermediate/elevated risk), 89.2%, 68.5%, 31.9%, and 99.8%. <b>Conclusion:</b> In large DM and DBT cohorts, AI at either diagnostic threshold achieved high NPV but had higher recall rates than radiologists. Defining positive AI results to include intermediate-risk examinations, versus only elevated-risk examinations, detected additional cancers but yielded markedly increased recall rates. <b>Clinical Impact:</b> The findings support AI's potential to aid radiologists' workflow efficiency. Yet, strategies are needed to address frequent false-positive results, particularly in the intermediate-risk category.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.