GPU Implementation of Grid Search based Feature Selection

Using Machine Learning to Predict Hydrocarbons using High Dimensional Datasets

Master Thesis (2020)
Author(s)

T.A. Ament (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Haixiang Lin – Mentor (TU Delft - Mathematical Physics)

Chris te Stroet – Graduation committee member (Biodentify)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2020 Tjalling Ament
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 Tjalling Ament
Graduation Date
21-02-2020
Awarding Institution
Delft University of Technology
Programme
Computer Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

To optimize the exploitation of oil and gas reservoirs both on- and offshore, Biodentfiy has developed a method to predict prospectivity of hydrocarbons before drilling. This method uses microbiological DNA analysis of shallow soil or seabed samples to detect vertical upward microseepage from hydrocarbon accumulations, which change the composition of microbes at the surface.
Microbiological DNA analysis of shallow soil or seabed samples results in a high-dimensional dataset, which is interpreted using machine learning. Using the machine learning method Elastic Net, features (microbes) are selected from an existing DNA database to classify new shallow soil or seabed samples. Multiple models, each with a different combination of externally set parameters (called hyperparameters), are trained to improve accuracy, essentially creating a grid of models. The aim of this thesis is to investigate if it is possible to accelerate feature selection on high-dimensional datasets by implementing a parallel design on a GPU to train this grid of models, and to investigate the performance of this GPU implementation. Inspired by an implementation called Shotgun, which is able to improve performance by exploiting parallelism across features when training a single model on a CPU, an implementation, named GPU Shotgun (GPU-SG) was devised, which could exploit parallelism across samples, features, and multiple models in the grid (of combinations of hyperparameters). Depending on the size of the grid and the hardware, using GPU-SG, a speedup of between 0.2 and 5.26 can be reached for sparse datasets (a datasets with lots of 0 values) when compared to standard CPU implementations. When considering dense datasets (a dataset with few 0 values), using GPU-SG, a speedup of between 0.5 and 10 can be achieved. The amount of memory available to store a dataset is lower for GPU's than for a CPU, and currently the design is limited by this, because the design does not allow a dataset that is larger than the memory available. GPU-SG can be used to design improved implementations, which reduce the time when the GPU or CPU is idle to improve performance.

Files

Tjalling_Ament_thesis.pdf
(pdf | 4.75 Mb)
License info not available