MM FD ConvFormer multimodal frequency aware deformable CNN transformer network for robust brain tumor classification.
Authors
Affiliations (6)
Affiliations (6)
- Department of Data Science and Analytics, College of Computing, Grand Valley State University, Michigan, USA.
- School of Computing Science and Engineering, Galgotias University, Greater Noida, UP, India. [email protected].
- Department of Computer Science, College of Computers and Information Technology, Taif University, P. O. Box 11099, Taif, 21944, Saudi Arabia.
- Department of Information Technology, College of Computers and Information Technology, Taif University, Taif, 21974, Saudi Arabia.
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, 21589, Saudi Arabia.
- Research Department, Arba Minch University, Arba, Minch, Ethiopia. [email protected].
Abstract
Accurate brain tumor classification from magnetic resonance imaging (MRI) is critical for early diagnosis and effective clinical decision-making. Although recent CNN-Transformer hybrid models have shown promising performance, most approaches rely primarily on single-modal spatial information, limiting their ability to capture complementary spectral features, model tumor heterogeneity, and generalize across datasets. To address these challenges, this paper proposes MM-FD-ConvFormer, a multimodal frequency-aware deformable CNN-Transformer network for robust brain tumor classification with enhanced interpretability. The proposed mode integrates three complementary modalities: (1) spatial MRI representations from original images, (2) frequency-domain MRI representations obtained via Fourier or wavelet transforms to capture texture and intensity variations, and (3) multi-scale contextual features for modeling global dependencies. A ConvNeXt V2 backbone is employed to extract discriminative spatial features, while a parallel lightweight ConvNeXt-based branch processes frequency-domain inputs. These features are subsequently fused and refined using a Swin Transformer V2 to capture long-range contextual relationships. To effectively integrate heterogeneous modalities and adapt to irregular tumor boundaries, a deformable cross-modal attention mechanism is introduced, enabling dynamic and shape-aware feature fusion. Final classification is performed on a unified multimodal representation, with an optional uncertainty-aware prediction head to improve reliability. The proposed model is evaluated using multiple public datasets, including the Kaggle Brain Tumor MRI and Figshare datasets for training, with external validation on the clinically relevant BraTS 2020/2021 dataset and optional testing on TCIA/REMBRANDT to assess cross-dataset generalization. Extensive experiments demonstrate that MM-FD-ConvFormer consistently outperforms standard CNN baselines, advanced transformer-based models, and hybrid approaches in terms of accuracy, macro-F1 score, and AUC. Furthermore, qualitative analyses using Grad-CAM, attention map visualization, and weakly supervised pseudo-segmentation provide interpretable insights into tumor localization and model decision-making. Overall, MM-FD-ConvFormer offers a robust, interpretable, and generalizable solution for automated brain tumor classification in real-world clinical imaging applications.