Implicit Parallelization of the Nested Relational Calculus for Semi-Structured Data

Master thesis (2016)

Authors

P.A. Hameete

Contributors

A.J.H. Hidders (mentor)

Department

Web Information Systems () (TU Delft)

Big data Data analysis Data analytics Query Flink Nested relational calculus Semi-structured data Nested data Input projection

To reference this document use:

http://resolver.tudelft.nl/uuid:5e0bf5b4-f361-4d36-8558-7cb76337fb0c

More Info

expand_more

Published Date

05-07-2016

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Department

Web Information Systems

Abstract

Large volumes of data are produced, published and exchanged over the Internet. Such data is often in a semi-structured format which is typically irregular and therefore challenging to analyze. High-level data analysis languages are built on top of implicit parallel data processing platforms that handle distribution of computations and data. Currently, work is being performed on the Nested Relational Calculus for Semi-structured Data (sNRC), which combines well-known formalisms from the Nested Relatonal Calculus for querying nested data with modern approaches for large-scale data analysis. This work presents a first of its kind system for parallel evaluation of sNRC queries built on top of an implicit parallel framework called Flink. Previous work on an optimization called input projection recombined and modified to present an input projection algorithm for sNRC. This optimization has as goal to improve the performance and scalability of the parallel sNRC system by reducing the size of the input dataset.The system is evaluated with the XMark benchmark on a cluster of up to 16 quad CPU nodes, and for datasets of up to 141 GB. We show that the presented parallel sNRC system is capable of processing large-scale datasets and that it can facilitate future work on sNRC. Moreover, it is shown that the presented input projection algorithm strongly improves the performance and scalability for sNRC queries that require partitioning.

Files

Phameete-thesis-finaldraft.pdf

(pdf | 2.24 Mb)