Back to all papers

Evaluating Conversational Image Segmentation for Medicine: Performance, Failure Modes, and a Fairness Audit Across Seven Modalities

November 23, 2025medrxiv logopreprint

Authors

Do, J.,Guggilla, N.,Suresh, V.,Kothari, R.

Affiliations (1)

  • Sidney Kimmel Medical College at Thomas Jefferson University

Abstract

IntroductionMedical-image segmentation underpins quantitative diagnostics and research, yet state-of-the-art models remain task-specific and data-hungry. The recent emergence of powerful, multimodal large language models (LLMs) presents a generalizable option; however, their efficacy in the specialized medical domain remains largely unquantified. We aim to benchmark the foundational Gemini 2.5 Flash and Gemini 2.5 Flash-Lite models for zero-shot medical image segmentation and evaluate them for potential bias in performance. MethodsThe models were tested on 13,086 medical images spanning seven distinct imaging modalities (endoscopy, fundoscopy, dermoscopy, laparoscopy, ultrasound, radiography, and CT) and 10 clinically relevant segmentation targets (e.g., colorectal polyps, skin lesions, liver tumors). Prompts followed a standard template ("Segment the ... in this image."). Per-image Dice and Intersection-over-Union (IoU) were computed against publicly released expert masks. Bias was assessed on 4,057 dermoscopy images split by Individual-Typology-Angle (ITA) into Light (> 28{degrees}, n = 3,499) and Dark ([&le;] 28{degrees}, n = 558) groups. ResultsFlash achieved a mean Dice=0.766 (IoU=0.680) for colorectal polyp segmentation, Dice=0.761 (IoU=0.672) for skin lesion segmentation, Dice=0.824 (IoU=0.736) for optic disc segmentation, and Dice=0.718 (IoU=0.616) for surgical tool segmentation, outperforming Flash-Lite by approximately 0.1 Dice points. Accuracy declined on low-contrast radiological tasks (Liver Mass CT Dice=0.071). In the fairness audit, Flash produced a successful mask for 2,839/3,499 light-tone images (81.2%) versus 378/558 dark-tone images (67.7%); {chi}{superscript 2} = 51.8, p < 0.001. Using all images, mean IoU=0.686 for light tones and IoU=0.591 for dark tones (Kruskal-Wallis H = 62.6, p < 0.001); Cliffs {delta} = -0.208 (95% CI - 0.259 to -0.159). DiscussionGemini 2.5 Flash delivers competitive accuracy on high-contrast photographic datasets at negligible cost. Performance is weaker on radiographic modalities (ultrasound, CT, chest radiography), and in dermoscopy, we observe lower accuracy on darker ITA skin-tone groups. This study informs the field where foundational LLMs are deployment-ready for medical image segmentation and where targeted debiasing or domain adaptation is required.

Topics

health informatics

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.