A data-efficient 3D medical vision-language model using only a 2D encoder.

February 13, 2026

papers

DOI: 10.1038/s41598-026-39526-z PMID: 41688675

Authors

Lian Y,Xie Y,Jiang Y,Wang L,Yu H

Affiliations (6)

Department of Orthopaedics, General Hospital of Northern Theater Command, Shenyang, China.
Department of Orthopaedics, General Hospital of Northern Theater Command, Shenyang, China. [email protected].
Faculty of Robot Science and Engineering, Northeastern University, Shenyang, China.
CCTEG (Liaoning) Embodied Intelligence Technology Co., Ltd., Shenyang, China.
CCTEG Robot Technology Co., Ltd., Shenzhen, China.
Department of Orthopaedics, General Hospital of Northern Theater Command, Shenyang, China. [email protected].

Abstract

The demonstrated success of Vision-Language Models in 2D medical image analysis has motivated the extension of their capabilities to 3D volumetric data for tasks such as report generation and visual question answering. A primary obstacle to this advancement is the reliance of current approaches on specialized 3D vision encoders, whose performance is constrained by the scarcity of large-scale annotated datasets. This paper presents a data-efficient framework that bypasses the need for a 3D encoder, instead leveraging a pre-trained 2D vision encoder to process volumetric data. Our pipeline sequentially refines the visual representation. First, a cosine similarity strategy prunes redundant 2D slices to improve computational efficiency. Next, a spatial-frequency fusion module integrates spatial and frequency-domain information to model inter-slice correlations from the 2D features. Finally, a fine-grained feature injection mechanism mitigates information loss during feature compression by re-introducing high-resolution details into the final visual tokens for the Large Language Model. Evaluated on public 3D benchmarks, our framework demonstrates superior performance, achieving a METEOR score of 50.13 on M3D-Cap report generation and 82.90% accuracy on M3D-VQA, significantly outperforming previous models. Our work demonstrates a scalable and efficient paradigm for 3D medical vision-language tasks that avoids the need for 3D-specific pre-training, offering a data-efficient alternative to data-intensive 3D encoders.

View Source Full Text PDF

Topics

Journal Article

A data-efficient 3D medical vision-language model using only a 2D encoder.

Authors

Affiliations (6)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?