Statistical Learning in High-Dimensional Data

The performance of random forests compared to penalized linear regression in scenarios with sparse informative features

Bachelor Thesis (2025)
Author(s)

E.C. Brouwer (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

H.N. Kekkonen – Mentor (TU Delft - Statistics)

J.C. van der Voort – Mentor (TU Delft - Mathematical Physics)

Marleen Keijzer – Graduation committee member (TU Delft - Mathematical Physics)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
20-06-2025
Awarding Institution
Delft University of Technology
Programme
Applied Mathematics
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In modern data analysis, high-dimensional datasets where the number of features far exceeds the number of observations are increasingly common. In such settings, identifying sparse informative features is a critical challenge. This thesis investigates the comparative performance of random forests and two penalized linear regression techniques, Ridge and Lasso regression, in scenarios with few informative features. Synthetic datasets were generated to simulate both linear and non-linear relationships between features and the target variable. The models were evaluated using mean squared error (MSE) to measure prediction accuracy and two custom metrics assessing the ability to identify informative features among the top-ranked variables. Additionally, computational efficiency was assessed.
In linear settings, Lasso consistently outperformed the other models in both prediction and feature selection, particularly at the lower informative ratios. Ridge regression demonstrated reasonable feature selection accuracy in linear settings but underperformed in prediction, likely due to a limited hyperparameter grid. In non-linear settings, random forests were most effective at identifying informative features, though the prediction performance of Lasso remained competitive and slightly outperformed random forests under high sparsity. Training time analysis further showed that the penalized linear models were substantially faster than randomforests. These findings underscore that no single model is universally optimal, as performance varies depending on datastructure, sparsity, and the modelling objective. They also demonstrate that improved feature selection does not necessarily guarantee better prediction in complex, high-dimensional settings. These results have practical implications for domains like genomics, where balancing accuracy, interpretability, and computational efficiency is critical.

Files

License info not available