Pretraining Diversity and Clinical Metric Optimization Achieve State-of-the-Art Performance on ChestX-ray14
Authors
Affiliations (1)
Affiliations (1)
- Independent Researcher
Abstract
We achieved state-of-the-art performance on the NIH ChestX-ray14 multi-label classification task using a simple 3-model ensemble: mean ROC-AUC 0.940, F1 0.821 (95% CI: 0.799-0.845), PR-AUC 0.827, sensitivity 76.0%, and specificity 98.8% across 14 thoracic diseases. Our primary finding challenges current research priorities: pretraining diversity dominates architectural diversity. Systematic evaluation of 255 ensemble combinations from 8 models spanning three architecture families (ConvNeXt, Vision Transformers, EfficientNet) at multiple resolutions (224x224 to 384x384) revealed that a simple 3-model ConvNeXt ensemble combining ImageNet-1K, ImageNet-21K, and ImageNet-21K-384 pretrained variants outperformed all 252 alternative combinations, including modern Vision Transformers and efficiency-optimized architectures. This ensemble achieved mean ROC-AUC 0.940, exceeding recent hybrid transformer approaches (LongMaxViT [1]: 0.932) with substantially lower computational requirements. Systematic comparison of five optimization strategies (F1, F_SS, pure sensitivity, Youdens J, validation loss) established that clinical metric optimization outperforms traditional validation loss by 19.5% in F1 score. F_SS optimization (sensitivity-specificity harmonic mean) achieved optimal clinical balance: highest sensitivity (73.9%), best Youdens J (0.727), and superior threshold-independent performance (ROC-AUC, PR-AUC). Traditional validation loss optimization failed to align with diagnostic utility despite achieving mathematical convergence. Strategic pretraining selection and clinical metric optimization provide greater performance improvements than architectural innovation alone, enabling competitive state-of-the-art results on accessible computational resources (AWS g5.2xlarge, $1.21/hr).