Back to all papers

Evaluating Multiple Input Strategies of Large Language Models for Gallbladder Polyps on Ultrasound: Comparative Study.

December 23, 2025pubmed logopapers

Authors

Jiang L,Yao J,Yang Z,Tang F,Zheng X,Zhang X,Xie X,Xu M,Huang T

Affiliations (1)

  • Department of Medical Ultrasonics, Institute of Diagnostic and Interventional Ultrasound, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, China.

Abstract

Gallbladder polyps have a high prevalence and are predominantly benign lesions, often detected via ultrasound. They impose diagnostic burdens on radiologists while generating substantial patient demand for report interpretation. Benign polyps include nonneoplastic polyps without malignant potential and premalignant adenomas that require cholecystectomy. Current guidelines recommending surgery for polyps ≥1.0 cm may lead to unnecessary interventions. Advanced multimodal large language models (LLMs) such as ChatGPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic PBC) demonstrate emerging capabilities in medical image analysis. Implementing LLMs in gallbladder polyp ultrasound evaluation can potentially alleviate radiologists' workload, provide patient-accessible consultation platforms, and even reduce overtreatment. We aimed to analyze the feasibility and conduct an early-stage evaluation of using LLMs for differentiating between adenomatous and nonneoplastic gallbladder polyps (≥1.0 cm) based on ChatGPT-4o and Claude 3.5 Sonnet, compared to assessments by radiologists and the guideline. Ultrasound images and reports of gallbladder polyps ≥1.0 cm with pathology were retrospectively collected from a hospital between January 2011 and January 2022. LLM performance was evaluated using three input strategies: (1) direct image analysis (LLMs-image), (2) feature-based text analysis (LLMs-text), and (3) scoring model-based text analysis (LLMs-model). Both intra- and interreader agreement and diagnostic performance of LLMs were evaluated for all three strategies. The diagnostic performance metrics-including sensitivity, specificity, accuracy, area under the receiver operating characteristic curve, and unnecessary resection rate of nonneoplastic polyps of LLMs in the three strategies were compared with the guideline. Additionally, the strategy LLMs-model was specifically compared with radiologists using the same scoring system (strategy readers-model). This study included 223 patients (aged 18-72 years; 132/223, 59.2% female) as the initial cohort, with 48 adenomatous polyps and 175 nonneoplastic polyps. The external test set comprised 100 patients. The intrareader agreement coefficients for strategy LLMs-model were significantly higher than those for strategy LLMs-image and LLMs-text (all P<.01). The interreader agreement of the three diagnostic strategies was ranked as LLMs-model>LLMs-text>LLMs-image. The sensitivity of strategies LLMs-image and LLMs-text was significantly lower than that of the guideline (all P<.001). When applying a scoring model (readers/LLMs-model strategy), both radiologists and the LLMs achieved a significantly higher accuracy compared to the guideline (0.34, 0.35, and 0.34 vs 0.22, all P<.01), and the unnecessary resection rate of nonneoplastic polyps was significantly lower (82%, 83%, and 83% vs 100%, all P<.01), while the sensitivity was comparable to the guideline (0.94, 0.98, and 0.98 vs 1.00, all P>.05). All diagnostic performance indicators for GPT-model and Claude-model were not significantly different from those of radiologists (all P>.05). The ability of LLMs to recognize and interpret medical images requires further improvement. The text strategy with a scoring system is currently the most appropriate diagnostic strategy for LLMs.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 7,500+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.