Back to all papers

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

July 14, 2025medrxiv logopreprint

Authors

Guan, H.,Hou, P. C.,Hong, P.,Wang, L.,Zhang, W.,Du, X.,Zhou, Z.,Zhou, L.

Affiliations (1)

  • Brigham and Women\'s Hospital, Harvard Medical School

Abstract

Recent advances in vision-language models (VLMs) have enabled automatic radiology report generation, yet current evaluation methods remain limited to general-purpose NLP metrics or coarse classification-based clinical scores. In this study, we propose a clinically informed evaluation framework for VLM-generated radiology reports that goes beyond traditional performance measures. We define a taxonomy of 12 radiology-specific error types, each annotated with clinical risk levels (low, medium, high) in collaboration with physicians. Using this framework, we conduct a comprehensive error analysis of three representative VLMs, i.e., DeepSeek VL2, CXR-LLaVA, and CheXagent, on 685 gold-standard, expert-annotated MIMIC-CXR cases. We further introduce a risk-aware evaluation metric, the Clinical Risk-weighted Error Score for Text-generation (CREST), to quantify safety impact. Our findings reveal critical model vulnerabilities, common error patterns, and condition-specific risk profiles, offering actionable insights for model development and deployment. This work establishes a safety-centric foundation for evaluating and improving medical report generation models. The source code of our evaluation framework, including CREST computation and error taxonomy analysis, is available at https://github.com/guanharry/VLM-CREST.

Topics

health informatics

Ready to Sharpen Your Edge?

Subscribe to join 7,600+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.