Arrow Prefetching: Prefetching Techniques for High Performance Big Data Processors using the Apache Arrow Data Format

More Info
expand_more

Abstract

As the digitisation of the world progresses at an accelerating pace, an overwhelming quantity of data from a variety of sources, of different types, organised in a multitude of forms or not at all, are subjected into diverse analytic processes for specific kinds of value to be extracted out of them. The aggregation of these analytic processes along with the software and hardware infrastructure implementing and facilitating them, comprise the field of big data analytics, which has distinct characteristics from normal data analytics. The systems executing the analysis, were found to exhibit performance weaknesses, significant front-end-bounding Level 1 Data cache miss rates specifically, for certain queries including, but not limited to, Natural Language Processing analytics. Based on this observation, investigations on whether, for data using the Apache Arrow format, its metadata could be used by certain prefetching techniques to improve cache behaviour and on the profile of the datasets and workloads which could profit from them, were conducted. Architectural simulations of the execution of various microbenchmarks in an In-Order and an Out-Of-Order core were performed utilising different popular prefetchers and comparing the produced statistics with those of a perfect prefetcher. During this process, the performance of systems featuring the tested prefetchers was found to lag behind this of the perfect prefetcher system by a considerable margin for workloads featuring indirect indexing access patterns for datatypes of variable width. As a result, those patterns, which are readily available in the access mechanism of Apache Arrow, were identified as possible candidates for acceleration by more sophisticated prefetching techniques. The first step in the investigation of such techniques was the introduction of a software prefetching scheme. This scheme, which is using prefetching and execution bursts, was embedded in a synthetic benchmark, specifically designed for the investigation of the effect the computation intensity of an algorithm has on the successful prefetching of indirect indexing access patterns. Its performance approached closely the performance produced by ideal prefetching for all the computation intensities except for the very low ones, with its resilience to possible variance in memory congestion being questionable though. As hardware prefetching is more adaptable to runtime parameters like memory congestion, subsequently such a module was designed, with information about the memory locations and hierarchy of the Apache Arrow columns’ buffers communicated to it by the software. Using this information, the prefetcher is able to distinguish between index and value accesses and prefetch the regular, iterative ones successfully. The hardware technique was able to almost match the performance of the ideal prefetcher for medium and high computation intensities and approach it, to the extent the bandwidth constraints allow, for the rest. In terms of speed-up, the maximum attained of the hardware prefetcher over the top performing reference prefetchers was almost 7x and 4x for In-Order and Out-Of-Order execution respectively with the software prefetcher performing marginally worse in both cases. Those speed-ups were achieved for those algorithm’s computation intensities which resulted in execution neither constrained by memory system’s bandwidth limitations, nor by in-core computation resources’ limitations.