Back to all papers

Data-Efficient and Explainable Multimodal Survival Prediction in NSCLC Using Deep Image Embeddings, Clinical Variables, and Gradient-Boosted Trees.

June 22, 2026pubmed logopapers

Authors

Sahin S,Karacor AG

Affiliations (2)

  • Department of Electrical and Electronics Engineering, Faculty of Engineering and Natural Sciences, Fenerbahce University, Istanbul 34758, Türkiye.
  • Department of Industrial Engineering, Faculty of Engineering and Natural Sciences, Fenerbahce University, Istanbul 34758, Türkiye.

Abstract

<b>Background/Objectives:</b> Survival prediction in non-small cell lung cancer (NSCLC) remains challenging, particularly in limited-sample settings where end-to-end deep learning models may suffer from limited generalization. This study aimed to develop a data-efficient, multimodal, and explainable framework integrating computed tomography (CT)-derived imaging information with clinical variables for NSCLC survival prediction. <b>Methods</b>: CT images, tumor segmentations, and clinical data from the publicly available NSCLC Radiomics (LUNG1) dataset (377 patients) were used. Tumor-focused regions were extracted using segmentation masks, and pretrained RadImageNet-InceptionV3 embeddings were obtained from the largest tumor-containing slice and neighboring-slice summaries. Deep imaging embeddings, engineered imaging features, and clinical variables were fused into a unified tabular representation. To improve robustness under limited-sample conditions, feature blocks were compressed using principal component analysis. CatBoost, XGBoost, and LightGBM models were trained on a development set and evaluated on a strictly held-out final validation set. <b>Results:</b> In three-class survival stratification, assigning censored/non-event patients to the upper survival group produced the strongest ordinal prognostic performance. Under the EX_PLUS_NON_EX_TOP setting, CatBoost achieved the best holdout score-based class C-index of 0.655. In continuous survival regression, LightGBM achieved the best holdout event-patient C-index of 0.576. Clinical variables provided the dominant prognostic signal, while compact deep image embeddings contributed complementary information, particularly in separating short- and long-survival groups. SHAP analysis confirmed contributions from both clinical and image-derived features. <b>Conclusions:</b> The proposed framework provides a proof-of-concept demonstration of a data-efficient and explainable image-to-tabular approach for NSCLC survival prediction under strict internal holdout validation. The results suggest that pretrained CT embeddings, clinical variables, gradient-boosted trees, and SHAP-based interpretation can be combined in a feasible, limited-sample survival modeling pipeline, while external validation remains necessary before clinical translation.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.