Artificial Intelligence Versus Radiologist False Positives on Digital Breast Tomosynthesis Examinations in a Population-Based Screening Program.
Authors
Affiliations (3)
Affiliations (3)
- David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA.
- Department of Radiology, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA.
- Department of Bioengineering at UCLA, Los Angeles, CA 90095, USA.
Abstract
<b>Background:</b> Insights into the nature of false-positive findings flagged by contemporary mammography artificial intelligence (AI) systems could inform the potential use of AI to reduce false-positive recall rates. <b>Objective:</b> To compare AI and radiologists in terms of characteristics of false-positive digital breast tomosynthesis (DBT) examinations in a breast cancer screening population. <b>Methods:</b> This retrospective study included 2977 women (mean age, 58 years) participating in an observational population-based screening study who underwent 3183 screening DBT examinations from January 2013 to June 2017. A commercial AI tool analyzed DBT examinations. Positive examinations were defined for AI as an elevated-risk result and for interpreting radiologists as BI-RAD category 0. False-positive examinations were defined as the absence of a breast cancer diagnosis within 1 year. Radiologists re-reviewed the imaging for AI-flagged false-positive findings. <b>Results:</b> The false-positive rate was 10% for both AI (308/3183) and radiologists (304/3183). Of 541 total false-positive examinations, 233 (43%) were false positives for AI only, 237 (44%) for radiologists only, and 71 (13%) for both. AI-only versus radiologist-only false positives were associated with greater mean patient age (60 vs 52 years, p<.001), lower frequency of dense breasts (24% vs 57%, p<.001), and greater frequencies of a personal history of breast cancer (13% vs 4%, p<.001), prior breast imaging studies (95% vs 78%, p<.001), and prior breast surgical procedures (37% vs 11%, p<.001). The false-positive examinations included 932 AI-only flagged findings, 315 radiologist-only flagged findings, and 49 flagged findings concordant between AI and radiologists. AI-only flagged findings were most commonly benign calcifications (40%), asymmetries (13%), and benign postsurgical change (12%); radiologist-only flagged findings were most commonly masses (47%), asymmetries (19%), and indeterminate calcifications (15%). Of 18 concordant flagged findings undergoing biopsy, 44% yielded high-risk lesions. <b>Conclusion:</b> Imaging and patient-level differences were observed between AI and radiologist false-positive DBT examinations. Although only a small fraction of false-positive examinations overlapped between AI and radiologists, concordant flagged findings had a high rate of representing high-risk lesions. <b>Clinical Impact:</b> The findings may help guide strategies for using AI to improve DBT recall specificity. In particular, concordant findings may represent an enriched subset of actionable abnormalities.