Back to all papers

Large Language Models for the Differentiation of Benign and Malignant Liver Nodules based on Multimodal Prompts in Liver US Cases.

April 16, 2026pubmed logopapers

Authors

Lin S,Li Y,Mao R,Zou X,Hu Y,Ye H,Wu X,Yang L,He J,Lu S,Li L,Zhou J

Affiliations (3)

  • Department of Ultrasound, Sun Yat-sen University Cancer Center, State Key Laboratory of Oncology in South China, Guangdong Provincial Clinical Research Center for Cancer, Guangzhou, P. R. China.
  • Department of Ultrasound, Sun Yat-sen University Cancer Center, State Key Laboratory of Oncology in South China, Guangdong Provincial Clinical Research Center for Cancer, Guangzhou, P. R. China. Electronic address: [email protected].
  • Department of Ultrasound, Sun Yat-sen University Cancer Center, State Key Laboratory of Oncology in South China, Guangdong Provincial Clinical Research Center for Cancer, Guangzhou, P. R. China. Electronic address: [email protected].

Abstract

Large language models (LLMs) that can process both images and text are increasingly being used in radiology. This study aimed to evaluate the performance of LLMs including GPT-4 Omni (GPT-4o), Claude-3.5-Sonnet (Claude), and Gemini 1.5 Pro (Gemini) in differentiating benign and malignant nodules in liver US cases and compare it with that of human readers. Four hundred liver US cases with pathologically confirmed liver nodules visible on B-mode US from January 2020 to November 2024 were randomly selected in this retrospective study. They were divided into a development set (n = 100) and a test set (n = 300). Five prompt groups for LLMs including US image [I-only], image description [D-only], image and description [I+D], image and liver US e-textbook [I+T], and image and medical history [I+H] were evaluated to identify the optimal input in development set. In test set, accuracy of LLMs in differentiating benign and malignant liver nodules was compared with that of human readers using McNemar's test. In development set, the prompt group I+H for all LLMs exhibited the highest diagnostic accuracy in differentiating benign and malignant liver nodules, being considering as the optimal input (taking GPT-4o as an example, with I-only, 57.0% [as reference]; D-only, 62.0%, p = 0.55; I+D, 62.0%, p = 0.54; I+T, 62.0%, p = 0.36; I+H, 77.0%, p = 0.01). In test set, LLMs with I+H outperformed junior group and showed similar accuracy to senior group (Junior, 70.0% [as reference1]; Senior, 78.3% [as reference2]; GPT-4o, 83.3%, P1 < .001, P2 = .10; Claude, 77.0%, p1 = 0.04, p2 = 0.72; Gemini, 75.3%, p1 = 0.14, p2 = 0.36). Large language models with US image and medical history inputs achieved accuracy comparable to senior radiologists and superior to junior radiologists in differentiating benign and malignant liver nodules.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.