Back to all papers

Benchmarking Vision LLMs in fetal ultrasound interpretation: a five-point expert evaluation of standard vs. custom prompts.

December 9, 2025pubmed logopapers

Authors

ELsharif W,Alzubaidi M,Tukur M,Magram A,Anver F,Hamza A,Said S,Khan R,Househ M,Agus M

Affiliations (4)

  • College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
  • College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar. [email protected].
  • Advanced AlRazi Diagnostic Center, Al-Hodeidah, Yemen.
  • Clinical Imaging Department, Hamad Medical Corporation (HMC), Doha, Qatar.

Abstract

Fetal ultrasound imaging is critical for prenatal care, demanding accurate anatomical interpretation. This study evaluates the potential of Vision Large Language Models (LLMs) in interpreting fetal ultrasound images, exploring whether tailored prompts can enhance performance compared to standard prompts, and assessing their utility in clinical settings. Nine fetal ultrasound images were analyzed using six advanced Vision LLMs via the Chatbot Arena platform. Standard prompts were compared against expert-crafted tailored questions. Three expert sonographers assessed the models' outputs across five criteria-anatomical recognition, biometric potential, picture quality, normalcy assessment, and clinical recommendations-using a Likert scale (1-5). Standard prompts yielded limited interpretative accuracy. In contrast, custom prompts significantly improved performance, with Claude Sonnet 3.5 and ChatGPT4o achieving median scores of 19 and 18, respectively. Models excelled in analyzing fetal femur and trans-cerebellum images, with clinical advice being the easiest to identify. Challenges persisted in precise anatomical identification and image quality assessment, revealing limitations in visual recognition. Smaller models like pixtral-12b showed notable improvement with tailoring, suggesting fine-tuning potential, while larger models did not consistently outperform smaller ones, indicating factors beyond model size influence efficacy. Tailored prompts markedly enhance Vision LLMs' ability to interpret fetal ultrasound images, supporting their potential as aids in prenatal diagnosis and education. However, limitations in anatomical precision and image quality assessment persist. Future research should focus on refining models with specialized datasets, optimizing architectures, and advancing prompt engineering to maximize clinical utility.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 7,100+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.