Robust Semi-Supervised CT Radiomics for Lung Cancer Prognosis: Cost-Effective Learning with Limited Labels and SHAP Interpretation.
Authors
Abstract
Computed tomography (CT) imaging is essential for lung cancer (LCa) management, while offering detailed visualization and valuable information for Artificial Intelligence-guided prognosis. Meanwhile, supervised learning (SL) models require extensive labeled data, limiting their real-world utility where annotations are scarce. We analyzed CT scans from 977 patients across 12 public/private datasets, extracting 1,218 radiomics features using Laplacian of Gaussian and wavelet filters via standardized PyRadiomics. Dimensionality was reduced using 56 feature selection and attribute extraction algorithms, and 27 classifiers were benchmarked. Semi-supervised learning (SSL) framework with pseudo-labeling utilized 478 unlabeled and 499 labeled cases. Model sensitivity was assessed in three scenarios: varying labeled data in SL, increasing unlabeled data in SSL, and jointly scaling both from 10% to 100%. SHapley Additive exPlanations analysis (SHAP)analysis explained and interpreted the top model predictions. Five-fold cross-validation and external testing (in two cohorts) were performed. SSL outperformed SL across all metrics, improving overall survival prediction by up to 17%. The top SSL model (Feature Importance by Random Forest+XGBoost) achieved 0.90±0.01 accuracy in cross-validation and 0.88±0.01 externally. SHAP revealed enhanced feature discriminability in SSL and SL, particularly for class 1 (survival>4 years), and helped explain model decisions. SSL maintained strong performance even with only 10% labeled data. Both SSL-based scenarios demonstrated more stable performance compared to SL, with lower variance across external testing, emphasizing SSL's robustness and cost-efficiency. We propose an interpretable, cost-effective SSL framework for CT-based LCa survival prediction. It improves performance, generalizability, and clinical readiness via SHAP explainability and unlabeled data utilization.