A Comparative Evaluation of Zero-Shot Performance of SAM, SAM2, MedSAM, and MedSAM2 Models on Lung Segmentation.
Authors
Affiliations (3)
Affiliations (3)
- Department of Computer Engineering, Bartin University, Bartin, 74100, Turkey. [email protected].
- Department of Computer Engineering, Hacettepe University, Ankara, 06100, Turkey.
- Department of Computer Engineering, Ankara University, Ankara, 06100, Turkey.
Abstract
Lung diseases require accurate and early diagnosis to ensure effective treatment planning and close monitoring of disease progression. High-resolution computed tomography (HRCT) provides detailed visualization of lung structures, while automated lung segmentation in HRCT images supports the diagnostic process and improves clinical accuracy. This study presents a comprehensive evaluation of the segmentation performance of Segment Anything Model (SAM), SAM2, Medical SAM (MedSAM), and MedSAM2 models within a zero-shot learning framework, using the MedGIFT database as the experimental benchmark. Notably, no retraining or fine-tuning was applied to the models, thereby enabling an objective assessment of segmentation performance as a function of prompt types. In the conducted experiments, bounding box (BB) prompts were automatically derived from the ground truth masks, while point-based prompts were generated with positive-only, negative-only, and combined strategies. The number of points varied between 1 and 10, with selections made both randomly and in a balanced manner across lung regions. Experimental findings revealed that, contrary to initial expectations, earlier model versions (SAM and MedSAM) outperformed their newer counterparts (SAM2 and MedSAM2) in BB-based segmentation tasks. Regarding point-based prompts, SAM and SAM2 exhibited complementary strengths: SAM2 achieved higher accuracy with fewer input points, whereas SAM demonstrated superior performance with more densely labeled scenarios. Disease-specific analysis showed point-based prompting was most effective in tuberculosis, while BB-based prompts performed poorly; pulmonary fibrosis had the lowest overall segmentation performance. The highest Dice obtained were 96.076% for SAM, 92.912% for SAM2, 94.326% for MedSAM, and 84.979% for MedSAM2. These results underscore the importance of selecting an appropriate model and prompting strategy based on labeling density and disease characteristics. This study presents the first systematic evaluation of the SAM model family for computed tomography lung segmentation on the MedGIFT database, demonstrating their potential as flexible and robust tools for clinical use. Moreover, this study highlights prompt selection as a key determinant of SAM-based segmentation performance in clinical lung imaging.