Principal Component Analysis of Education-Related Data Sets

Bachelor thesis (2020)

Authors

T.P.K. Nguyen Electrical Engineering, Mathematics and Computer Science

Contributors

Cornelis Vuik Numerical Analysis - (supervisor 1)

K.P. Hart Analysis - (supervisor 2)

Elizaveta Wobbes Numerical Analysis - (supervisor 1)

Erik Fleur (supervisor 1)

Faculty

Electrical Engineering, Mathematics and Computer Science

Education Data Science Principal Component Analysis PCA

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:11a166e3-cd94-45e8-91ed-660a0cfe8b9e

Published Date

21-07-2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Principal Component Analysis (PCA) is a mathematical instrument beneficial for its dimension reduction whilst keeping the most important data. Due to its advantages, PCA is chosen to handle a substantial amount of data. In this thesis two questions are answered: what variables influence a pupil's attainment test score using linear regression and whether PCA provides better linear regression models? The data used in this thesis is provided by DUO, the Dutch Executive Agency for Education. The data contains information about pupils who completed the attainment test in 2008-2013. This thesis starts with a brief description of the data set used for the research and some background information about PCA. Before linear regression can be used, the data is preprocessed. Creating a linear model with all variables resulted in the largest absolute coefficients for teachers' secondary school recommendations. When PCA is applied, it gives great insight into which variables are (likely) dependent on each other: dependent not only in the sense of linear dependency but also the influences on each other in general. Furthermore, PCA also indicates which variables are most likely to have a significant impact. When the data set is free of linearly dependent variables, PCA may give worse fitted models. However, the models are better than models with randomly chosen variables.

Files

Bep_ThaoNguyen.pdf

(.pdf | 0.665 Mb)