Scaling up data analytics in Python using multiple FPGAs

More Info
expand_more

Abstract

Big data applications are becoming more commonplace due to an abundance of digital data and increasingly powerful hardware. One of these classes of hardware devices are FPGAs, which are being used today in various ways such as data centers and embedded systems. High performance, power efficiency, and reprogrammability are the primary reasons behind their wide use. Another trend over the previous years has been to use distributed data processing frameworks such as Apache Spark to improve the performance of big data applications. Traditionally, such frameworks are deployed on commodity hardware to save costs. This approach is fairly popular, with organizations often having on-premise compute clusters or using a cloud provider to access a managed cluster. This project attempts to combine the above-mentioned worlds - FPGAs and dis- tributed data processing. We have designed an architecture that allows us to use FP- GAs as end-devices in a compute cluster to perform the actual computation instead of CPUs. This architecture is designed by composing together several open source technologies and allows us to interact with an FPGA cluster using Python. Using a high-level programming language such as Python makes this system easy to use for software developers and data scientists, and also abstracts away the internal commu- nication within the cluster. We have built prototypes based on this architecture for 3 hardware platforms (FPGA families) and 3 specific applications to demonstrate general applicability. We have observed noticeable performance gains in these applications by scaling up the FPGA cluster.