Comparative evaluation of generative AI models for chest radiograph report generation in the emergency department.
Authors
Affiliations (6)
Affiliations (6)
- Department of Radiology, Seoul National University Hospital, Jongno-gu, Korea.
- Department of Radiology, Seoul National University College of Medicine, Jongno-gu, Korea.
- Soombit.ai, Bundang-gu, Korea.
- Department of Radiology, Seoul National University Hospital, Jongno-gu, Korea. [email protected].
- Department of Radiology, Seoul National University College of Medicine, Jongno-gu, Korea. [email protected].
- Soombit.ai, Bundang-gu, Korea. [email protected].
Abstract
To benchmark medical image-specific vision-language models (VLMs) against real-world radiologist-written reports, focusing on diagnostic quality, clinical acceptability, hallucinations, and language clarity. This retrospective study included adult patients who presented to the emergency department of a tertiary center between January 2022 and April 2025 and underwent same-day chest radiograph (CXR) and CT for febrile or respiratory symptoms. Reports from five VLMs (AIRead, Lingshu, MAIRA-2, MedGemma, and MedVersa) and radiologist-written reports were randomly presented and blindly evaluated by three thoracic radiologists using four criteria: RADPEER, clinical acceptability, hallucination, and language clarity. Comparative performance was assessed using generalized linear mixed models, with radiologist-written reports treated as the reference. Finding-level analyses were also performed with CT as a reference. A total of 478 patients (median age, 67 years [interquartile range, 50-78]; 282 males [59.0%]) were included. AIRead demonstrated the lowest RADPEER 3b rate (5.3% [76/1434] vs radiologists 13.9% [200/1434]; p < 0.001) and the highest clinical acceptability (84.5% [1212/1434] vs radiologists 74.3% [1065/1434]; p < 0.001), with hallucination rates comparable to radiologists (0.3% [4/1425]) vs 0.1% [1/1425]; p = 0.21). Other VLMs showed higher disagreement (16.8-43.0%; p < 0.05), lower acceptability (41.1-71.4%; p < 0.05), and more frequent hallucinations (5.4-17.4%; p < 0.05). Language clarity was higher for several VLMs (AIRead, Lingshu, and MedVersa) than for radiologist-written reports (82.9-88.4% [1189-1268/1434] vs 78.1% [1120/1434]; p < 0.05). Finding-level analyses showed substantial variability in sensitivity across VLMs for common thoracic findings. Medical VLMs for CXR report generation exhibited variable performance in report quality and diagnostic measures. Question How do medical VLMs perform compared with radiologist-written reports regarding diagnostic quality, clinical acceptability, hallucinations, and language clarity for CXRs? Findings One non-open-source VLM achieved the lowest RADPEER 3b rate and highest acceptability with radiologist-level hallucinations, whereas other models showed inferior performance. Clinical relevance VLMs may support automated preliminary CXR reporting and serve as adjunct tools to enhance workflow efficiency and consistency. Although performance varied widely, carefully developed models approached radiologist-level quality, supporting continued refinement and targeted clinical integration.