Multimodal Large Language Model for Zero-Shot L3 Body Composition Segmentation on CT: Improved Accuracy via Automated Candidate Selection.
Authors
Affiliations (5)
Affiliations (5)
- Department of Medical Imaging, The Ottawa Hospital, University of Ottawa, 501 Smyth Road, Ottawa, ON, K1H 8L6, Canada. [email protected].
- Department of Diagnostic Radiology, McGill University, Montreal, QC, Canada.
- Augmented Intelligence and Precision Health Laboratory (AIPHL), Research Institute of the McGill University Health Centre, Montreal, Canada.
- Diagnostic Radiology and Radiation Oncology, Chiba University Graduate School of Medicine, Chiba, Japan.
- Department of Radiology, Institute of Medical Science, The University of Tokyo, Tokyo, Japan.
Abstract
The purpose of the study is to evaluate zero-shot L3 body composition segmentation on computed tomography (CT) using a general-purpose multimodal large language model (MLLM) and to assess whether automated candidate selection improves segmentation accuracy. This retrospective study used the publicly available TCIA Colorectal-Liver-Metastases CT dataset. One mid-L3 axial image was selected per case. Radiologist A segmented skeletal muscle (SM), subcutaneous adipose tissue (SAT), and visceral adipose tissue (VAT), and Radiologist B independently segmented all cases for interobserver reproducibility. For each of 192 cases, gemini-3-pro-image-preview generated 10 candidate masks, and gemini-3-pro-preview served as the evaluator model and selected the most anatomically plausible candidate. The Dice similarity coefficient (DSC) was used to compare model masks with Radiologist A reference masks. Automated candidate selection achieved mean DSCs of 0.900 ± 0.102 for SM, 0.902 ± 0.096 for SAT, and 0.714 ± 0.245 for VAT. Compared with the best cohort-level single run, automated candidate selection improved DSC for SM (0.879 ± 0.128; adjusted P = .018) and SAT (0.860 ± 0.166; adjusted P < .001), but not for VAT (0.715 ± 0.237; adjusted P = .934). Compared with the mean of 10 runs, automated candidate selection improved DSC for all compartments. Interobserver DSCs were 0.972 ± 0.094 for SM, 0.976 ± 0.095 for SAT, and 0.933 ± 0.099 for VAT. Zero-shot L3 body composition segmentation with a general-purpose MLLM appeared feasible, and automated candidate selection improved segmentation accuracy for SM and SAT, although performance remained below interobserver DSCs between radiologists, particularly for VAT.