Hybrid Multi-View MRI Fusion for csPCa Diagnosis via Intra- and Inter-View Transformers.
Authors
Abstract
Accurate diagnosis of clinically significant prostate cancer (csPCa) from multi-view MRI scans (axial, sagittal, and coronal) is essential for effective treatment planning and improved outcomes. Although deep learning has advanced prostate MRI analysis, many existing approaches adopt late fusion strategies that aggregate one-dimensional feature vectors extracted independently from each view, resulting in loss of spatial information and anatomical correspondence across views, ultimately limiting diagnostic performance. While Vision Transformers offer flexibility in processing multi-view patches, their memory requirements scale quadratically with the number of patches, hindering efficient concurrent processing. In contrast, Swin Transformers efficiently capture local features but are typically restricted to single-view processing due to their reliance on regular-grid input constraints. To overcome these limitations, we propose a hybrid fusion framework that decomposes multi-view information integration into iterative intra-view and inter-view interactions across multiple resolutions. The framework preserves spatial coherence and enables fine-grained feature integration while maintaining computational efficiency. Specifically, the inter-view feature exchange module, based on the Vision Transformer, employs bridge tokens to summarize information from localized patch windows, reducing memory usage while preserving spatial relationships across views. The intra-view feature extraction module, built on the Swin Transformer, facilitates dynamic, attention-driven interactions among image patches and bridge tokens within each window. Moreover, shared positional embeddings are explicitly incorporated to enhance spatial correspondence across views. Extensive experiments on a public dataset demonstrate the superiority of our method in csPCa classification. Ablation studies highlight contributions of different components, while attention map visualizations validate integration of anatomical structures across views.