Performance across different versions of an artificial intelligence model for screen-reading of mammograms.
Authors
Affiliations (14)
Affiliations (14)
- Department of Breast Cancer Screening, The Cancer Registry, Norwegian Institute of Public Health, Oslo, Norway.
- Department of Radiology, University of Wisconsin-Madison School of Medicine and Public Health, Madison, WI, USA.
- Department of Radiology and Nuclear Medicine, St Olavs University Hospital, Trondheim, Norway.
- Department of Radiology, Møre og Romsdal Hospital Trust, Ålesund, Norway.
- Department of Health Sciences in Ålesund, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology (NTNU), Trondheim, Norway.
- Department of Registry Informatics, The Cancer Registry, Norwegian Institute of Public Health, Oslo, Norway.
- Diagnostic Radiology, Translational Medicine, Lund University, Lund, Sweden.
- Unilabs: Mammography Unit, Skåne University Hospital, Malmö, Sweden.
- School of Medicine, University of Nottingham, Clinical Science Building, Nottingham City Hospital, Nottingham, United Kingdom.
- The Cancer Registry, Norwegian Institute of Public Health, Oslo, Norway.
- Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway.
- Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA.
- Department of Breast Cancer Screening, The Cancer Registry, Norwegian Institute of Public Health, Oslo, Norway. [email protected].
- Department of Health and Care Sciences, Faculty of Health Sciences, UiT The Arctic University of Norway, Tromsø, Norway. [email protected].
Abstract
Studies have reported promising results regarding artificial intelligence (AI) as a tool for improved mammographic screening interpretive performance. We analyzed AI malignancy risk scores from two versions of the same commercial AI model. This retrospective cohort study used data from 117,709 screening examinations performed in BreastScreen Norway 2009-2018. The mammograms were processed by two versions of the commercially available AI model, Transpara (version 1.7 and 2.1). The distributions of exam-level risk scores (AI score 1-10) and risk categories were evaluated for both AI versions on all examinations, including 737 screen-detected and 200 interval cancers. Scores between 1-7 were categorized as low risk, 8-9 as intermediate risk, and 10 as high risk of malignancy. Area under the receiver operating curve was 0.908 (95% CI: 0.986-0.920) for version 1.7 and 0.928 (95% CI: 0.917-0.939) for 2.1 when screen-detected and interval cancers were considered as positive cases (p < 0.001). A total of 87.1% (642/737) and 93.5% (689/737) of the screen-detected cancers had an AI score of 10 with version 1.7 and 2.1, respectively. Among interval cancers, 45.0% (90/200) had AI score 10 with version 1.7 and 44.5% (89/200) had AI score 10 with version 2.1. A higher proportion of screen-detected breast cancers had the highest AI score of 10 with the newer version of the AI model compared to the older version. For interval cancers, there was no difference in the proportion of cases assigned to the highest score between the two versions. Question Studies have reported promising results regarding the use of AI in mammography screening, but comparisons of updated versus older versions are less studied. Findings In our study, 87.1% (642/737) of the screen-detected cancers were classified with a high malignancy risk score by the old version, while it was 93.5% (689/737) for the newer version. Clinical relevance Understanding how version updates of AI models might impact screening mammography performance will be important for future quality assurance and validation of AI models.