Taming diffusion transformers for high-fidelity MRI super-resolution.
Authors
Affiliations (3)
Affiliations (3)
- Division of Hepatobiliary and Pancreatic Surgery, Department of Surgery, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310003, China.
- College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China.
- Division of Hepatobiliary and Pancreatic Surgery, Department of Surgery, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310003, China. Electronic address: [email protected].
Abstract
Recent MRI super-resolution (SR) methods increasingly adopt diffusion models to enhance reconstruction quality. While these approaches have achieved promising improvements in image quality, they still face two challenges. First, the diffusion models employed in these methods are typically constructed upon U-Net architectures. Although U-Net offers advantages in capturing local structures, it struggles with modeling global context and long-range dependencies, which are essential for faithfully recovering complex anatomical details. Moreover, the U-Net architecture exhibits limited scalability and struggles to handle images with complex anatomical structures. In contrast, Transformer-based diffusion models exhibit stronger capabilities in global reasoning and multi-scale dependency modeling, yet they remain underexplored in the context of MRI super-resolution. Second, they often lack a powerful and efficient decoding mechanism in the latent space, making it difficult to accurately reconstruct high-fidelity MR images from the generated latent representations. To address these challenges, we tame the Diffusion Transformer for MRI super-resolution, and propose a novel framework called DiTMSR. Specifically, we design a conditional Diffusion Transformer that progressively denoises a noisy latent input to recover structurally faithful MR latent features. To reconstruct high-fidelity MR images from these latent features, we design a hybrid Mamba decoder consisting of two key components: a content preservation module that retains structural information, and a hybrid Mamba block combining a MambaVision Mixer and feedforward network to improve decoding performance and efficiency. Extensive experiments on both public and clinical datasets demonstrate that DiTMSR achieves superior performance compared to state-of-the-art methods.