Enabling Big Data Analytics For MATLAB Programs Using High Performance Compute Methods

Master thesis (2018)

Authors

Y. Lu Electrical Engineering, Mathematics and Computer Science

Contributors

Z. Al-Ars (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:88ece5de-f233-4c13-a940-f5c862a9b154

More Info

expand_more

Published Date

14-05-2018

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

In this work, a possible solution to allow for scalable MATLAB deployment on big data clusters through Spark without using the official MATLAB toolbox is introduced. Other possible solutions that can be used for accelerating existing MATLAB code including calling modules written by Graphics Processing Unit (GPU) and Python Pool with multiprocessors are also investigated in this thesis. Among these approaches, Spark solution is achieved by accessing to PySpark through Python. Instead of using distributed computing server of MATLAB that is necessary for the official Spark approach in the newest version, our approach is low-cost, easy to set up, flexible and general enough to handle changes, and enable for scaling up. All the solutions are analyzed for bottlenecks based on their performance in initialization, memory transfer, data conversion and computational throughput. Our analysis shows that initialization \& memory transfer for GPU, data conversion for Python/Pyspark when the data input or output has high dimensions can be bottlenecks. For use case analysis, a medical image registration MATLAB application using NCC was accelerated by multiple solutions. The results indicate that GPU and PySpark using cluster have the best performance, which was 5.7x and 7.8x faster than MATLAB with Pool performance. Based on the overall performance of these solutions, a decision tree for the most optimal solution to choose is built for the future research.

Files

Thesis_yunlu_ce_final.pdf

(.pdf | 1.88 Mb)