Implicit Parallelization of the Nested Relational Calculus for Semi-Structured Data

More Info
expand_more

Abstract

Large volumes of data are produced, published and exchanged over the Internet. Such data is often in a semi-structured format which is typically irregular and therefore challenging to analyze. High-level data analysis languages are built on top of implicit parallel data processing platforms that handle distribution of computations and data. Currently, work is being performed on the Nested Relational Calculus for Semi-structured Data (sNRC), which combines well-known formalisms from the Nested Relatonal Calculus for querying nested data with modern approaches for large-scale data analysis. This work presents a first of its kind system for parallel evaluation of sNRC queries built on top of an implicit parallel framework called Flink. Previous work on an optimization called input projection recombined and modified to present an input projection algorithm for sNRC. This optimization has as goal to improve the performance and scalability of the parallel sNRC system by reducing the size of the input dataset.The system is evaluated with the XMark benchmark on a cluster of up to 16 quad CPU nodes, and for datasets of up to 141 GB. We show that the presented parallel sNRC system is capable of processing large-scale datasets and that it can facilitate future work on sNRC. Moreover, it is shown that the presented input projection algorithm strongly improves the performance and scalability for sNRC queries that require partitioning.