Minimize experimentation overhead through dataset selection, ensemble feature attention, and feature selection with reduced subset sizes
M. Anton (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Luis Cruz – Mentor (TU Delft - Software Engineering)
A. Shome – Mentor (TU Delft - Software Engineering)
A. van Deursen – Graduation committee member (TU Delft - Software Engineering)
J.C. van Gemert – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)
Vincent Cohen-Addad – Mentor (Google Research)
Sammy Jerome – Mentor (Google Research)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
In large-scale ML, data size becomes a critical variable, especially in the context of large companies, where models already exist and are hard to change and fine-tune. Time to market and model quality are essential metrics, thus looking for ways to select, prune and augment the input data while treating the model as a black box can speed up the process from raw data to productionized model.
Datasets can have thousands of features and many redundant/duplicate samples, for various business logic reasons. In some particular ML flows, it might be that only a subset of them provide most of the input to the final accuracy. Also, looking into ways to provide insights on what data points are the most meaningful can help engineers collect more relevant samples, or focus their attention on specific parts of the data distribution.