Transformer-based Deep Learning Models with Shape Guidance for Predicting Breast Cancer in Mammography Images.
Authors
Affiliations (6)
Affiliations (6)
- Department of Radiological Imaging and Informatics, Tohoku University Graduate School of Medicine, 2-1 Seiryo-Machi, Aoba-Ku, Sendai, Miyagi, 980-8575, Japan.
- Tohoku University Advanced Institute of So-Go-Chi (Convergence Knowledge) Informatics, 2-1-1 Katahira, Aoba-Ku, Sendai, Miyagi, 980-0812, Japan. [email protected].
- Center for Data-Driven Science and Artificial Intelligence, Tohoku University, 41 Kawauchi, Aoba-Ku, Sendai, Miyagi, 980-8576, Japan.
- Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8574, Japan.
- Department of Diagnostic Imaging , Tohoku University Graduate School of Medicine, 2-1 Seiryo-Machi, Aoba-Ku, Sendai, Miyagi, 980-8575, Japan.
- Intelligent Biomedical Systems Engineering Laboratory, Graduate School of Biomedical Engineering, Tohoku University, 2-1 Seiryo-Machi, Aoba-Ku, Sendai, Miyagi, 980-8575, Japan.
Abstract
Recent breast cancer research has investigated shape-based attention guidance in Vision Transformer (ViT) models, focusing on anatomical structures and the heterogeneity surrounding tumors. However, few studies have clarified the optimal transformer encoder layer stage for applying attention guidance. Our study aimed to evaluate the effectiveness of shape-guidance strategies by varying the combinations of encoder layers that guide attention to breast structures and by comparing the proposed models with conventional models. For the shape-guidance strategy, we applied breast masks to the attention mechanism to emphasize spatial dependencies and enhance the learning of positional relationships within breast anatomy. We then compared the representative models-Masked Transformer models that demonstrated the best performance across layer combinations-with the conventional ResNet50, ViT, and SwinT V2. In our study, a total of 2,436 publicly available mammography images from the Chinese Mammography Database via The Cancer Imaging Archive were analyzed. Three-fold cross-validation was employed, with a patient-wise split of 70% for training and 30% for validation. Model performance on differentiating breast cancer from non-cancer images was assessed by the area under the receiver-operating characteristic curve (AUROC). The results showed that applying masks at the Shallow and Deep stages gave the highest AUROC for Masked ViT. The Masked ViT achieved an AUROC of 0.885 [95% confidence interval: 0.849-0.918], a sensitivity of 0.876, and a specificity of 0.802, outperforming all other conventional models. These results indicate that incorporating mask guidance into particular Transformer encoders promotes representation learning, highlighting their potential as decision-support tools in breast cancer diagnosis.