Feature Selection in Healthcare Datasets: Towards a Generalizable Solution.
Authors
Affiliations (3)
Affiliations (3)
- Institute of Biomedical and Neural Engineering, Reykjavik University, Reykjavik, Iceland.
- Institute of Biomedical and Neural Engineering, Reykjavik University, Reykjavik, Iceland; Department of Science, Landspitali University Hospital, Reykjavik, Iceland.
- Institute of Biomedical and Neural Engineering, Reykjavik University, Reykjavik, Iceland. Electronic address: [email protected].
Abstract
The increasing dimensionality of healthcare datasets presents major challenges for clinical data analysis and interpretation. This study introduces a scalable ensemble feature selection (FS) strategy optimized for multi-biometric healthcare datasets aiming to: address the need for dimensionality reduction, identify the most significant features, improve machine learning models' performance, and enhance interpretability in a clinical context. The novel waterfall selection, that integrates sequentially (a) tree-based feature ranking and (b) greedy backward feature elimination, produces as output several sets of features. These subsets are then combined using a specific merging strategy to produce a single set of clinically relevant features. The overall method is applied to two healthcare datasets: the biosignal-based BioVRSea dataset, containing electromyography, electroencephalography, and center-of-pressure data for postural control and motion sickness assessment, and the image-based SinPain dataset, which includes MRI and CT-scan data to study knee osteoarthritis. Our ensemble FS approach demonstrated effective dimensionality reduction, achieving over a 50% decrease in certain feature subsets. The new reduced feature set maintained or improved the model classification metrics when tested with Support Vector Machine and Random Forest models. The proposed ensemble FS method retains selected features essential for distinguishing clinical outcomes, leading to models that are both computationally efficient and clinically interpretable. Furthermore, the adaptability of this method across two heterogeneous healthcare datasets and the scalability of the algorithm indicates its potential as a generalizable tool in healthcare studies. This approach can advance clinical decision support systems, making high-dimensional healthcare datasets more accessible and clinically interpretable.