Back to all papers

Read like a radiologist: Efficient vision-language model for 3D medical imaging interpretation.

April 8, 2026pubmed logopapers

Authors

Lee C,Park S,Shin CI,Choi WH,Park HJ,Lee JE,Ye JC

Affiliations (7)

  • Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.
  • Department of Radiation Oncology, Yonsei Cancer Center, Heavy Ion Therapy Research Institute, Yonsei University College of Medicine, Seoul, Republic of Korea; Yonsei Institute for Digital Health, Yonsei University, Seoul, Republic of Korea.
  • Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea.
  • Division of Nuclear Medicine, Department of Radiology, St. Vincent's Hospital, College of Medicine, The Catholic University of Korea, Suwon, Republic of Korea.
  • Department of Radiology, Chung-Ang University Hospital, Chung-Ang University College of Medicine, Seoul, Republic of Korea.
  • Department of Radiology, Chungnam National University Hospital, Chungnam National University College of Medicine, Daejeon, Republic of Korea. Electronic address: [email protected].
  • Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea. Electronic address: [email protected].

Abstract

Recent medical vision-language models (VLMs) have shown promise in 2D medical image interpretation. However extending them to 3D medical imaging has been challenging due to computational complexities and data scarcity. Although a few recent VLMs specified for 3D medical imaging have emerged, all are limited to learning volumetric representation of a 3D medical image as a set of sub-volumetric features. Such process introduces overly correlated representations along the z-axis that neglect slice-specific clinical details, particularly for 3D medical images where adjacent slices have low redundancy. To address this limitation, we introduce MS-VLM that mimic radiologists' workflow in 3D medical image interpretation. Specifically, radiologists analyze 3D medical images by examining individual slices sequentially and synthesizing information across slices and views. Likewise, MS-VLM leverages self-supervised 2D transformer encoders to learn a volumetric representation that capture inter-slice dependencies from a sequence of slice-specific features. Unbound by sub-volumetric patchification, MS-VLM is capable of obtaining useful volumetric representations from 3D medical images with any slice length and from multiple images acquired from different planes and phases. We evaluate MS-VLM on publicly available chest CT dataset CT-RATE and in-house rectal MRI dataset. In both scenarios, MS-VLM surpasses existing methods in radiology report generation, producing more coherent and clinically relevant reports. These findings highlight the potential of MS-VLM to advance 3D medical image interpretation and improve the robustness of medical VLMs.

Topics

Imaging, Three-DimensionalImage Interpretation, Computer-AssistedJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.