Evaluation of Image-Level Harmonization Methods for Multi-Center MR Neuroimaging.
Authors
Affiliations (2)
Affiliations (2)
- Department of Radiology, Stanford University, Stanford, California, USA.
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, California, USA.
Abstract
Multi-center imaging studies create large-scale data that are useful for identifying pathological patterns and robust training of deep learning models. However, variation due to site and scanner differences can confound analyses, emphasizing the need for harmonization. To evaluate scanner-related differences in T1w and T2-FLAIR images in the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset and assess the performance of publicly available image-level harmonization tools. Retrospective. Scanner group analysis: 1143 ADNI3 subjects (233 GE, 173 Philips, 250 Siemens, with 487 Siemens subjects used as an independent reference group). Within-subject comparison: paired multi-vendor scan sessions from 8 subjects. 3.0T, T1w, and T2-FLAIR MRI sequences. Gray/white matter contrast ratio (G/W ratio), white matter hyperintensity (WMH) volume, and image feature similarity metrics (Fréchet Inception Distance [FID], Learned Perceptual Image Patch Similarity [LPIPS]) were compared across scanner vendors before and after harmonization with statistical (ComBat) and deep learning (HACA3) algorithms. One-way ANOVA and post hoc Games-Howell tests were conducted to assess differences between scanner groups across image pipelines (baseline, post-harmonization). Repeated-measures ANOVA and post hoc paired t-tests with Bonferroni correction were used to evaluate similarity metric changes pre- and post-harmonization for multi-vendor subjects. We defined statistical significance as p < 0.05. At baseline, significant image differences in G/W ratio and WMH volumes between vendors were identified. Both ComBat and HACA3 harmonization improved G/W ratio consistency for T1w and T2-FLAIR imaging across vendors, particularly for GE T2-FLAIRs. HACA3 led to the best similarity between scanner datasets: mean FID T1w/T2-FLAIR: 10.45/14.62 (Baseline); 7.45/11.71 (ComBat); 5.60/8.91 (HACA3). Only HACA3 harmonization resulted in non-significant differences between vendors for WMH volume. HACA3 deep learning harmonization outperformed a statistical method, ComBat, improving MR contrast consistency and feature similarity across vendors. However, difficulties in harmonizing T2-FLAIRs highlight limitations in current multi-contrast MR harmonization tools. 3. Stage 1.