Capturing Finer-grained Long-range Dependency for Dense Prediction in Medical Images: An Empirical Investigation of MLPs.
Authors
Abstract
Dense prediction is a fundamental problem in medical image analysis. As Convolutional Neural Networks (CNNs) are limited by the intrinsic locality of convolution operations, transformers with the ability to capture long-range visual dependency have been widely adopted for dense prediction. However, due to the high computation and memory loads of self-attention operations, transformers are typically applied at downsampled resolutions (e.g., after patch embedding), which cannot effectively leverage the tissue-level textural information that is recognizable only at high-resolution image features (e.g., full/half of the image resolution). Unfortunately, this textural information is crucial for differentiating subtle human anatomy/pathology in medical images. In this study, we hypothesize that Multi-Layer Perceptrons (MLPs) are superior alternatives to transformers for medical dense prediction, as they can capture finer-grained long-range dependency at higher-resolution features under equal computation/ memory constraints. To validate this, we conducted a comprehensive empirical investigation of MLPs in various medical scenarios. We built a hierarchical MLP framework that applying MLPs to extract image feature pyramids beginning from the full image resolution, and then evaluated it with various MLP blocks on diverse dense prediction tasks, including medical image restoration, registration, and segmentation. Extensive experiments on six public datasets show that applying MLPs at higher resolutions yielded superior performance over CNN- and transformer-based counterparts across all evaluation tasks. Our findings suggest that MLPs can serve as superior medical vision backbones over CNNs and transformers, with significant potential to influence future model designs for medical dense prediction.