External Validation of a CT-Based Radiogenomics Model for the Detection of EGFR Mutation in NSCLC and the Impact of Prevalence in Model Building by Using Synthetic Minority Over Sampling (SMOTE): Lessons Learned.
Authors
Affiliations (6)
Affiliations (6)
- Joint Department of Medical Imaging, Princess Margaret Hospital, University Health Network, University of Toronto, Toronto, ON, Canada (A.A.K., S.A.M., R.H., R.K., C.O., U.M., P.V.-H.); Joint Department of Medical Imaging, University Health Network, 263 McCaul St 4th Floor, Toronto, ON M5T 1W7, Canada (A.A.K.). Electronic address: [email protected].
- Joint Department of Medical Imaging, Princess Margaret Hospital, University Health Network, University of Toronto, Toronto, ON, Canada (A.A.K., S.A.M., R.H., R.K., C.O., U.M., P.V.-H.).
- Joint Department of Medical Imaging, Princess Margaret Hospital, University Health Network, University of Toronto, Toronto, ON, Canada (A.A.K., S.A.M., R.H., R.K., C.O., U.M., P.V.-H.); Institute of Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Zurich, Switzerland (R.H.).
- Joint Department of Medical Imaging, Princess Margaret Hospital, University Health Network, University of Toronto, Toronto, ON, Canada (A.A.K., S.A.M., R.H., R.K., C.O., U.M., P.V.-H.); Department of Diagnostics Radiology, Queen's University, Kingston, ON, Canada (R.K.).
- Department of Biostatistics, Princess Margaret Cancer Centre, University Health Network, University of Toronto, Toronto, ON, Canada (L.A.).
- Department of Radiation Oncology, University Health Network, University of Toronto, Toronto, ON, Canada (A.H.).
Abstract
Radiogenomics holds promise in identifying molecular alterations in nonsmall cell lung cancer (NSCLC) using imaging features. Previously, we developed a radiogenomics model to predict epidermal growth factor receptor (EGFR) mutations based on contrast-enhanced computed tomography (CECT) in NSCLC patients. The current study aimed to externally validate this model using a publicly available National Institutes of Health (NIH)-based NSCLC dataset and assess the effect of EGFR mutation prevalence on model performance through synthetic minority oversampling technique (SMOTE). The original radiogenomics model was validated on an independent NIH cohort (n=140). For assessing the influence of disease prevalence, six SMOTE-augmented datasets were created, simulating EGFR mutation prevalence from 25% to 50%. Seven models were developed (one from original data, six SMOTE-augmented), each undergoing rigorous cross-validation, feature selection, and logistic regression modeling. Models were tested against the NIH cohort. Performance was compared using area under the receiver operating characteristic curve (Area Under the Curve [AUC]), and differences between radiomic-only, clinical-only, and combined models were statistically assessed. External validation revealed poor diagnostic performance for both our model and a previously published EGFR radiomics model (AUC ∼0.5). The clinical model alone achieved higher diagnostic accuracy (AUC 0.74). SMOTE-augmented models showed increased sensitivity but did not improve overall AUC compared to the clinical-only model. Changing EGFR mutation prevalence had minimal impact on AUC, challenging previous assumptions about the influence of sample imbalance on model performance. External validation failed to reproduce prior radiogenomics model performance, while clinical variables alone retained strong predictive value. SMOTE-based oversampling did not improve diagnostic accuracy, suggesting that, in EGFR prediction, radiomics may offer limited value beyond clinical data. Emphasis on robust external validation and data-sharing is essential for future clinical implementation of radiogenomic models.