A data-efficient 3D medical vision-language model using only a 2D encoder.
Authors
Affiliations (6)
Affiliations (6)
- Department of Orthopaedics, General Hospital of Northern Theater Command, Shenyang, China.
- Department of Orthopaedics, General Hospital of Northern Theater Command, Shenyang, China. [email protected].
- Faculty of Robot Science and Engineering, Northeastern University, Shenyang, China.
- CCTEG (Liaoning) Embodied Intelligence Technology Co., Ltd., Shenyang, China.
- CCTEG Robot Technology Co., Ltd., Shenzhen, China.
- Department of Orthopaedics, General Hospital of Northern Theater Command, Shenyang, China. [email protected].
Abstract
The demonstrated success of Vision-Language Models in 2D medical image analysis has motivated the extension of their capabilities to 3D volumetric data for tasks such as report generation and visual question answering. A primary obstacle to this advancement is the reliance of current approaches on specialized 3D vision encoders, whose performance is constrained by the scarcity of large-scale annotated datasets. This paper presents a data-efficient framework that bypasses the need for a 3D encoder, instead leveraging a pre-trained 2D vision encoder to process volumetric data. Our pipeline sequentially refines the visual representation. First, a cosine similarity strategy prunes redundant 2D slices to improve computational efficiency. Next, a spatial-frequency fusion module integrates spatial and frequency-domain information to model inter-slice correlations from the 2D features. Finally, a fine-grained feature injection mechanism mitigates information loss during feature compression by re-introducing high-resolution details into the final visual tokens for the Large Language Model. Evaluated on public 3D benchmarks, our framework demonstrates superior performance, achieving a METEOR score of 50.13 on M3D-Cap report generation and 82.90% accuracy on M3D-VQA, significantly outperforming previous models. Our work demonstrates a scalable and efficient paradigm for 3D medical vision-language tasks that avoids the need for 3D-specific pre-training, offering a data-efficient alternative to data-intensive 3D encoders.