Reproducibility of echocardiographic measurements of left ventricular systolic function: a systematic review and meta-analysis comparing artificial intelligence and clinician estimates.
Authors
Affiliations (3)
Affiliations (3)
- Division of Population Medicine, Cardiff University, 3rd floor, Neuadd Meirionnydd, Heath Park, Cardiff CF14 4XN, Wales, UK.
- Institute of Cardiovascular Medicine, University College London, 74 Huntley Street, London WC12 6BT, England, UK.
- Cardiology Department, University Hospital of Wales, Heath Park, Cardiff CF14 4XW, Wales, UK.
Abstract
Echocardiography underpins the diagnosis and management of cardiovascular disease, yet measurement variability can influence treatment decisions. Artificial intelligence (AI) may standardize interpretation, but its reproducibility and clinical impact require systematic evaluation. To compare the reproducibility of AI-derived and clinician-derived measurements of left ventricular (LV) systolic function, specifically global longitudinal strain (GLS) and ejection fraction (EF), in adults. We searched Medline, Embase, Web of Science, and CENTRAL from inception to May 2025 for peer-reviewed studies assessing the reproducibility of AI-derived EF and/or GLS from two-dimensional (2D) or three-dimensional (3D) transthoracic echocardiography. Reporting quality was assessed with the Checklist for Artificial Intelligence in Medical Imaging (CLAIM). Random-effects meta-analyses of intraclass correlation coefficients (ICCs) and Bland-Altman plots compared reproducibility of AI- and clinician-derived measures Nineteen studies (17 984 participants; mean age 59 ± 8 years, 52.8% male) were included. Mean CLAIM adherence was 72.9%. Pooled ICCs demonstrated high reproducibility for both AI- and clinician-derived EF and GLS. Bland-Altman analyses showed limits of agreement of -13.4% to +12.7% for 2D EF and -4.3% to +2.3% for 2D GLS. 3D EF was slightly better, showing pooled limits of agreement of 11.26-12.61%. The pooled mean absolute differences (MAD) were 5.17% for 2D EF, 5.27% for 3D EF, and 1.32% for 2D GLS. AI-derived GLS and 3D EF achieve reproducibility comparable to, or exceeding, clinicians' estimates. However, the limits of agreement between clinician and AI estimates are sufficiently wide that reclassification is possible around key thresholds, which could affect patient management decisions. Large-scale, real-world validation remains essential to confirm generalizability.