In modern data analysis, high-dimensional datasets where the number of features far exceeds the number of observations are increasingly common. In such settings, identifying sparse informative features is a critical challenge. This thesis investigates the comparative performance
...
In modern data analysis, high-dimensional datasets where the number of features far exceeds the number of observations are increasingly common. In such settings, identifying sparse informative features is a critical challenge. This thesis investigates the comparative performance of random forests and two penalized linear regression techniques, Ridge and Lasso regression, in scenarios with few informative features. Synthetic datasets were generated to simulate both linear and non-linear relationships between features and the target variable. The models were evaluated using mean squared error (MSE) to measure prediction accuracy and two custom metrics assessing the ability to identify informative features among the top-ranked variables. Additionally, computational efficiency was assessed.
In linear settings, Lasso consistently outperformed the other models in both prediction and feature selection, particularly at the lower informative ratios. Ridge regression demonstrated reasonable feature selection accuracy in linear settings but underperformed in prediction, likely due to a limited hyperparameter grid. In non-linear settings, random forests were most effective at identifying informative features, though the prediction performance of Lasso remained competitive and slightly outperformed random forests under high sparsity. Training time analysis further showed that the penalized linear models were substantially faster than randomforests. These findings underscore that no single model is universally optimal, as performance varies depending on datastructure, sparsity, and the modelling objective. They also demonstrate that improved feature selection does not necessarily guarantee better prediction in complex, high-dimensional settings. These results have practical implications for domains like genomics, where balancing accuracy, interpretability, and computational efficiency is critical.