GPU Implementation of Grid Search based Feature Selection

Using Machine Learning to Predict Hydrocarbons using High Dimensional Datasets

More Info
expand_more

Abstract

To optimize the exploitation of oil and gas reservoirs both on- and offshore, Biodentfiy has developed a method to predict prospectivity of hydrocarbons before drilling. This method uses microbiological DNA analysis of shallow soil or seabed samples to detect vertical upward microseepage from hydrocarbon accumulations, which change the composition of microbes at the surface.
Microbiological DNA analysis of shallow soil or seabed samples results in a high-dimensional dataset, which is interpreted using machine learning. Using the machine learning method Elastic Net, features (microbes) are selected from an existing DNA database to classify new shallow soil or seabed samples. Multiple models, each with a different combination of externally set parameters (called hyperparameters), are trained to improve accuracy, essentially creating a grid of models. The aim of this thesis is to investigate if it is possible to accelerate feature selection on high-dimensional datasets by implementing a parallel design on a GPU to train this grid of models, and to investigate the performance of this GPU implementation. Inspired by an implementation called Shotgun, which is able to improve performance by exploiting parallelism across features when training a single model on a CPU, an implementation, named GPU Shotgun (GPU-SG) was devised, which could exploit parallelism across samples, features, and multiple models in the grid (of combinations of hyperparameters). Depending on the size of the grid and the hardware, using GPU-SG, a speedup of between 0.2 and 5.26 can be reached for sparse datasets (a datasets with lots of 0 values) when compared to standard CPU implementations. When considering dense datasets (a dataset with few 0 values), using GPU-SG, a speedup of between 0.5 and 10 can be achieved. The amount of memory available to store a dataset is lower for GPU's than for a CPU, and currently the design is limited by this, because the design does not allow a dataset that is larger than the memory available. GPU-SG can be used to design improved implementations, which reduce the time when the GPU or CPU is idle to improve performance.