How well do multimodal LLMs interpret CT scans? An auto-evaluation framework for analyses.

Authors

Zhu Q,Hou B,Mathai TS,Mukherjee P,Jin Q,Chen X,Wang Z,Cheng R,Summers RM,Lu Z

Affiliations (5)

  • National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, 20894, MD, USA.
  • Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Clinical Center, National Institutes of Health, 10 Center Drive, Bethesda, 20892, MD, USA.
  • Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, 999041, United Arab Emirates.
  • Center for Information Technology, National Institutes of Health, Bethesda, 20894, MD, USA.
  • National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, 20894, MD, USA. Electronic address: [email protected].

Abstract

This study introduces a novel evaluation framework, GPTRadScore, to systematically assess the performance of multimodal large language models (MLLMs) in generating clinically accurate findings from CT imaging. Specifically, GPTRadScore leverages LLMs as an evaluation metric, aiming to provide a more accurate and clinically informed assessment than traditional language-specific methods. Using this framework, we evaluate the capability of several MLLMs, including GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, to interpret findings in CT scans. This retrospective study leverages a subset of the public DeepLesion dataset to evaluate the performance of several multimodal LLMs in describing findings in CT slices. GPTRadScore was developed to assess the generated descriptions (location, body part, and type) using GPT-4, alongside traditional metrics. RadFM was fine-tuned using a subset of the DeepLesion dataset with additional labeled examples targeting complex findings. Post fine-tuning, performance was reassessed using GPTRadScore to measure accuracy improvements. Evaluations demonstrated a high correlation of GPTRadScore with clinician assessments, with Pearson's correlation coefficients of 0.87, 0.91, 0.75, 0.90, and 0.89. These results highlight its superiority over traditional metrics, such as BLEU, METEOR, and ROUGE, and indicate that GPTRadScore can serve as a reliable evaluation metric. Using GPTRadScore, it was observed that while GPT-4V and Gemini Pro Vision outperformed other models, significant areas for improvement remain, primarily due to limitations in the datasets used for training. Fine-tuning RadFM resulted in substantial accuracy gains: location accuracy increased from 3.41% to 12.8%, body part accuracy improved from 29.12% to 53%, and type accuracy rose from 9.24% to 30%. These findings reinforce the hypothesis that fine-tuning RadFM can significantly enhance its performance. GPT-4 effectively correlates with expert assessments, validating its use as a reliable metric for evaluating multimodal LLMs in radiological diagnostics. Additionally, the results underscore the efficacy of fine-tuning approaches in improving the descriptive accuracy of LLM-generated medical imaging findings.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.