The evaluation of artificial intelligence in mammography-based breast cancer screening: Is breast-level analysis enough?

Authors

Taib AG,Partridge GJW,Yao L,Darker I,Chen Y

Affiliations (2)

  • Translational Medical Sciences, School of Medicine, University of Nottingham, Clinical Sciences Building, City Hospital Campus, Hucknall Road, Nottingham, NG5 1PB, UK.
  • Translational Medical Sciences, School of Medicine, University of Nottingham, Clinical Sciences Building, City Hospital Campus, Hucknall Road, Nottingham, NG5 1PB, UK. [email protected].

Abstract

To assess whether the diagnostic performance of a commercial artificial intelligence (AI) algorithm for mammography differs between breast-level and lesion-level interpretations and to compare performance to a large population of specialised human readers. We retrospectively analysed 1200 mammograms from the NHS breast cancer screening programme using a commercial AI algorithm and assessments from 1258 trained human readers from the Personal Performance in Mammographic Screening (PERFORMS) external quality assurance programme. For breasts containing pathologically confirmed malignancies, a breast and lesion-level analysis was performed. The latter considered the locations of marked regions of interest for AI and humans. The highest score per lesion was recorded. For non-malignant breasts, a breast-level analysis recorded the highest score per breast. Area under the curve (AUC), sensitivity and specificity were calculated at the developer's recommended threshold for recall. The study was designed to detect a medium-sized effect (odds ratio 3.5 or 0.29) for sensitivity. The test set contained 882 non-malignant (73%) and 318 malignant breasts (27%), with 328 cancer lesions. The AI AUC was 0.942 at breast level and 0.929 at lesion level (difference -0.013, p < 0.01). The mean human AUC was 0.878 at breast level and 0.851 at lesion level (difference -0.027, p < 0.01). AI outperformed human readers at the breast and lesion level (ps < 0.01, respectively) according to the AUC. AI's diagnostic performance significantly decreased at the lesion level, indicating reduced accuracy in localising malignancies. However, its overall performance exceeded that of human readers. Question AI often recalls mammography cases not recalled by humans; to understand why, we as humans must consider the regions of interest it has marked as cancerous. Findings Evaluations of AI typically occur at the breast level, but performance decreases when AI is evaluated on a lesion level. This also occurs for humans. Clinical relevance To improve human-AI collaboration, AI should be assessed at the lesion level; poor accuracy here may lead to automation bias and unnecessary patient procedures.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.