Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine.
Authors
Affiliations (5)
Affiliations (5)
- Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland.
- Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland.
- Department of Mathematics and Computer Science, University of Bremen, Bremen, Germany.
- Dioscuri Centre in Topological Data Analysis, Polish Academy of Sciences, Warsaw, Poland.
- Institute of Diagnostic and Interventional Radiology, Pediatric Radiology and Neuroradiology, University Medical Center Rostock, Rostock, Germany.
Abstract
The rising demand for radiology services calls for innovative solutions to sustain diagnostic quality and efficiency. This study evaluated the diagnostic agreement between two commercially available artificial intelligence (AI) chest X-ray systems and human radiologists during routine clinical practice.We retrospectively analyzed 279 chest X-rays (204 standing, 63 supine, 12 sitting) from a Swiss university hospital. Seven thoracic pathologies - cardiomegaly, consolidation, mediastinal mass, nodule, pleural effusion, pneumothorax, and pulmonary oedema - were assessed. Radiologists' routine reports were compared against Rayvolve (AZmed) and ChestView (Gleamer, both from Paris, France). A Python code, provided as open access supplement, calculated performance metrics, agreement measures, and effect size quantification.Agreement between radiologists and AI ranged from moderate to almost perfect: Human-AZmed (Gwet's AC1: 0.47-0.72, moderate to substantial), and Human-Gleamer (Gwet's AC1: 0.56-0.96, moderate to almost perfect). Balanced accuracies ranged from 0.67-0.85 for Human-AZmed and 0.71-0.85 for Human-Gleamer, with peak performance for pleural effusion (0.85 both systems). Specificity consistently exceeded sensitivity across pathologies (0.70-0.98 vs 0.45-0.85). Common findings showed strong performance, pleural effusion (MCC 0.70-0.73), cardiomegaly (MCC 0.51), and consolidation (MCC 0.45-0.46). Rare pathologies demonstrated lower agreement, mediastinal mass, and nodules (MCC 0.23-0.31). Standing radiographs yielded superior agreement compared to supine studies. The two AI systems showed substantial inter-system agreement for consolidation and pleural effusion (balanced accuracy 0.81-0.84).Both commercial AI chest X-ray systems demonstrated comparable performance to human radiologists for common thoracic pathologies, with no meaningful differences between platforms. Performance was strongest for standing radiographs but declined for rare findings and supine studies. Position-dependent variability and reduced sensitivity for uncommon pathologies underscore the continued need for human oversight in clinical practice. · AI systems matched radiologists for common chest X-ray findings.. · Standing radiographs achieved the highest diagnostic agreement.. · Rare pathologies showed weaker AI-human agreement.. · Supine studies reduced diagnostic performance.. · Human oversight remains essential in clinical practice.. · Bosbach WA, Schoeni L, Senge JF et al. Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine. Rofo 2025; DOI 10.1055/a-2778-3892.