Feature Selection and Machine Learning Strategies for CT Radiomics-Based Survival Prediction in Non-Small Cell Lung Cancer: A Comparative Study.
Authors
Affiliations (1)
Affiliations (1)
- School of Medical and Health Sciences, Tung Wah College, Homantin, Hong Kong SAR, China.
Abstract
<b>Background/Objectives</b>: Computed tomography (CT)-based radiomics shows promise for non-small cell lung cancer (NSCLC) prognosis prediction, but model performance varies widely by feature selection and machine learning strategies. Optimal combinations remain unclear. This study aims to systematically compare feature selection methods and machine learning algorithms for 12-month overall survival prediction using CT radiomics in NSCLC patients. <b>Methods</b>: We analyzed 385 patients from The Cancer Imaging Archive (TCIA) NSCLC-Radiomics dataset. Radiomic features from primary tumor volumes were combined with clinical variables. Three feature selection methods-sequential forward selection (SFS), maximum relevance minimum redundancy (mRMR), and least absolute shrinkage and selection operator (LASSO)-were compared across five classifiers: k-nearest neighbors (KNN), support vector machine (SVM), random forest (RF), logistic regression (LR), and gradient boosting classifier (GBC). Performance was assessed using area under the receiver operating characteristic curve (AUC) and accuracy on independent test sets. Cox regression and Kaplan-Meier analyses evaluated survival risk stratification. <b>Results</b>: Logistic regression showed the most stable classification performance across feature selection strategies (test AUC 0.60-0.65, accuracy 0.72-0.73). The mRMR-LR model achieved highest AUC (0.65); LASSO-LR showed highest accuracy (0.73). For survival analysis, LASSO-based Cox modeling demonstrated superior risk stratification with significant separation between high- and low-risk groups in both training and testing sets (<i>p</i> = 0.0095). <b>Conclusions</b>: Simpler models like logistic regression provide robust performance in CT radiomics, while LASSO excels for survival risk stratification. As we employed single-dataset validation, clinical applicability remains limited because validation was performed within a single public dataset. Nevertheless, the findings provide methodological insights into the selection of feature selection and machine learning strategies for CT radiomics-based prognostic modeling in NSCLC.