Enabling Big Data Analytics For MATLAB Programs Using High Performance Compute Methods

More Info
expand_more

Abstract

In this work, a possible solution to allow for scalable MATLAB deployment on big data clusters through Spark without using the official MATLAB toolbox is introduced. Other possible solutions that can be used for accelerating existing MATLAB code including calling modules written by Graphics Processing Unit (GPU) and Python Pool with multiprocessors are also investigated in this thesis. Among these approaches, Spark solution is achieved by accessing to PySpark through Python. Instead of using distributed computing server of MATLAB that is necessary for the official Spark approach in the newest version, our approach is low-cost, easy to set up, flexible and general enough to handle changes, and enable for scaling up. All the solutions are analyzed for bottlenecks based on their performance in initialization, memory transfer, data conversion and computational throughput. Our analysis shows that initialization \& memory transfer for GPU, data conversion for Python/Pyspark when the data input or output has high dimensions can be bottlenecks. For use case analysis, a medical image registration MATLAB application using NCC was accelerated by multiple solutions. The results indicate that GPU and PySpark using cluster have the best performance, which was 5.7x and 7.8x faster than MATLAB with Pool performance. Based on the overall performance of these solutions, a decision tree for the most optimal solution to choose is built for the future research.