Read like a radiologist: Efficient vision-language model for 3D medical imaging interpretation.

April 8, 2026

papers

DOI: 10.1016/j.media.2026.104077 PMID: 41990528

Authors

Lee C,Park S,Shin CI,Choi WH,Park HJ,Lee JE,Ye JC

Affiliations (7)

Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.
Department of Radiation Oncology, Yonsei Cancer Center, Heavy Ion Therapy Research Institute, Yonsei University College of Medicine, Seoul, Republic of Korea; Yonsei Institute for Digital Health, Yonsei University, Seoul, Republic of Korea.
Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea.
Division of Nuclear Medicine, Department of Radiology, St. Vincent's Hospital, College of Medicine, The Catholic University of Korea, Suwon, Republic of Korea.
Department of Radiology, Chung-Ang University Hospital, Chung-Ang University College of Medicine, Seoul, Republic of Korea.
Department of Radiology, Chungnam National University Hospital, Chungnam National University College of Medicine, Daejeon, Republic of Korea. Electronic address: [email protected].
Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea. Electronic address: [email protected].

Abstract

Recent medical vision-language models (VLMs) have shown promise in 2D medical image interpretation. However extending them to 3D medical imaging has been challenging due to computational complexities and data scarcity. Although a few recent VLMs specified for 3D medical imaging have emerged, all are limited to learning volumetric representation of a 3D medical image as a set of sub-volumetric features. Such process introduces overly correlated representations along the z-axis that neglect slice-specific clinical details, particularly for 3D medical images where adjacent slices have low redundancy. To address this limitation, we introduce MS-VLM that mimic radiologists' workflow in 3D medical image interpretation. Specifically, radiologists analyze 3D medical images by examining individual slices sequentially and synthesizing information across slices and views. Likewise, MS-VLM leverages self-supervised 2D transformer encoders to learn a volumetric representation that capture inter-slice dependencies from a sequence of slice-specific features. Unbound by sub-volumetric patchification, MS-VLM is capable of obtaining useful volumetric representations from 3D medical images with any slice length and from multiple images acquired from different planes and phases. We evaluate MS-VLM on publicly available chest CT dataset CT-RATE and in-house rectal MRI dataset. In both scenarios, MS-VLM surpasses existing methods in radiology report generation, producing more coherent and clinically relevant reports. These findings highlight the potential of MS-VLM to advance 3D medical image interpretation and improve the robustness of medical VLMs.

View Source Full Text PDF

Topics

Imaging, Three-DimensionalImage Interpretation, Computer-AssistedJournal Article

Read like a radiologist: Efficient vision-language model for 3D medical imaging interpretation.

Authors

Affiliations (7)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?