Investigating the Data Addition Dilemma in Longitudinal TBI MRI
Authors
Affiliations (1)
Affiliations (1)
- Carnegie Mellon University
Abstract
Clinical machine learning (CML) in brain MRI analysis often assumes that "more data = better performance." However, when added samples derive from a different distribution than the training set, accuracy can decline--a phenomenon known as the Data Addition Dilemma. Here, we present the first systematic study of this dilemma in longitudinal traumatic brain injury (TBI) MRI, where acute baseline scans (session 1, S1) and follow-up scans (session 2, S2) exhibit pronounced distributional shifts. We make three key contributions. First, we quantify how intra-subject shifts (S1 [->] S2) and inter-subject variability jointly affect classifier performance in a 14-subject (28-scan) cohort spanning mild to severe TBI. Second, we compare four training schemes--(1) intra-session upper bound (S1 [->] S1), (2) cross-session OOD test (S1 [->] S2), (3) pooled training (S1+S2 [->] S1, S2), and (4) LOSO-IPA, which augments training with one unlabeled S2 scan per patient--using a lightweight logistic-regression model on five-component PCA features. Third, we derive actionable deployment insights: naive pooling can impair accuracy; pooled training trades baseline performance for robustness; and LOSO-IPA recovers near-intra-session accuracy. Accordingly, we recommend unlabeled per-subject follow-up anchoring and diagonal CORrelation ALignment (CORAL) covariance adjustment prior to inference. These findings clarify when additional data aid versus hinder CML in medical imaging and establish a minimally invasive framework for reliable longitudinal severity assessment in TBI.