Statistical Learning in High-Dimensional Data

None, None

Statistical Learning in High-Dimensional Data

The performance of random forests compared to penalized linear regression in scenarios with sparse informative features

Bachelor Thesis (2025)

Author(s)

E.C. Brouwer (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

H.N. Kekkonen – Mentor (TU Delft - Statistics)

J.C. van der Voort – Mentor (TU Delft - Mathematical Physics)

Marleen Keijzer – Graduation committee member (TU Delft - Mathematical Physics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Random Forest High-Dimensional Data Analysis Ridge Lasso Lasso Regression Statistical Learning Ridge Regression

To reference this document use:

https://resolver.tudelft.nl/uuid:61a77751-3b2d-4a55-b91c-ca79209841e6

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

20-06-2025

Awarding Institution

Delft University of Technology

Programme

Applied Mathematics

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In modern data analysis, high-dimensional datasets where the number of features far exceeds the number of observations are increasingly common. In such settings, identifying sparse informative features is a critical challenge. This thesis investigates the comparative performance of random forests and two penalized linear regression techniques, Ridge and Lasso regression, in scenarios with few informative features. Synthetic datasets were generated to simulate both linear and non-linear relationships between features and the target variable. The models were evaluated using mean squared error (MSE) to measure prediction accuracy and two custom metrics assessing the ability to identify informative features among the top-ranked variables. Additionally, computational efficiency was assessed.
In linear settings, Lasso consistently outperformed the other models in both prediction and feature selection, particularly at the lower informative ratios. Ridge regression demonstrated reasonable feature selection accuracy in linear settings but underperformed in prediction, likely due to a limited hyperparameter grid. In non-linear settings, random forests were most effective at identifying informative features, though the prediction performance of Lasso remained competitive and slightly outperformed random forests under high sparsity. Training time analysis further showed that the penalized linear models were substantially faster than randomforests. These findings underscore that no single model is universally optimal, as performance varies depending on datastructure, sparsity, and the modelling objective. They also demonstrate that improved feature selection does not necessarily guarantee better prediction in complex, high-dimensional settings. These results have practical implications for domains like genomics, where balancing accuracy, interpretability, and computational efficiency is critical.

Files

BEP_Statistical_Learning.pdf

(pdf | 5.78 Mb)

License info not available