J.W. Peltenburg | TU Delft Repository

An Intermediate Representation for Composable Typed Streaming Dataflow Designs

Journal article (2023) - Matthijs A. Reukers (author) , Yongding Tian (author) , Z. Al-Ars (author) , H. Peter Hofstee (author) , Matthijs Brobbel (author) , J.W. Peltenburg (author) , J. Van Straten (author)

Tydi is an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. These data types are extensively used in a many applicat ...

Battling the CPU Bottleneck in Apache Parquet to Arrow Conversion Using FPGA

Conference paper (2021) - Johan Peltenburg (author) , Lars Van Leeuwen (author) , J.J. Hoozemans (author) , J. Fang (author) , Zaid Al-Ars (author) , Peter Peter Hofstee (author)

In the domain of big data analytics, the bottleneck of converting storage-focused file formats to in-memory data structures has shifted from the bandwidth of storage to the performance of decoding and decompression software. Two widely used formats for big data storage and in-mem ...

Generating high-performance FPGA accelerator designs for big data analytics with Fletcher and Apache Arrow

Journal article (2021) - J.W. Peltenburg (author) , Jeroen Van Straten (author) , Matthijs Brobbel (author) , Z Al-Ars (author) , H.P. Hofstee (author)

As big data analytics systems are squeezing out the last bits of performance of CPUs and GPUs, the next near-term and widely available alternative industry is considering for higher performance in the data center and cloud is the FPGA accelerator. We discuss several challenges a ...

Tens of gigabytes per second JSON-to-Arrow conversion with FPGA accelerators

Conference paper (2021) - J.W. Peltenburg (author) , A. Hadnagy (author) , Matthijs Brobbel (author) , Robert Morrow (author) , Zaid Al-Ars (author)

JSON is a popular data interchange format for many web, cloud, and IoT systems due to its simplicity, human readability, and widespread support. However, applications must first parse and convert the data to a native in-memory format before being able to perform useful computatio ...

FPGA Acceleration for Big Data Analytics

Challenges and Opportunities

Journal article (2021) - J.J. Hoozemans (author) , Johan Peltenburg (author) , Fabian Nonnenmacher (author) , Ákos Hadnagy (author) , Zaid Al-Ars (author) , Peter Hofstee (author)

The big data revolution has ushered an era with ever increasing volumes and complexity of data requiring ever faster computational analysis. During this very same era, CPU performance growth has been stagnating, pushing the industry to either scale their computation horizontall ...

The big data revolution has ushered an era with ever increasing volumes and complexity of data requiring ever faster computational analysis. During this very same era, CPU performance growth has been stagnating, pushing the industry to either scale their computation horizontally using multiple nodes in datacenters, or to scale vertically using heterogeneous components to reduce compute time. However, networking and storage continue to provide both higher throughput and lower latency, which allows for leveraging heterogeneous components, deployed in data centers around the world. Still, the integration of big data analytics frameworks with heterogeneous hardware components such as GPGPUs and FPGAs is challenging, because there is an increasing gap in the level of abstraction between analytics solutions developed with big data analytics frameworks, and accelerated kernels developed with heterogeneous components. In this article, we focus on FPGA accelerators that have seen wide-scale deployment in large cloud infrastructures. FPGAs allow the implementation of highly optimized hardware architectures, tailored exactly to an application, and unburdened by the overhead associated with traditional general-purpose computer architectures. FPGAs implementing dataflow-oriented architectures with high levels of (pipeline) parallelism can provide high application throughput, often providing high energy efficiency. Latency-sensitive applications can leverage FPGA accelerators by directly connecting to the physical layer of a network, and perform data transformations without going through the software stacks of the host system. While these advantages of FPGA accelerators hold promise, difficulties associated with programming and integration limit their use. This article explores the existing practices in big data analytics frameworks, discusses the aforementioned gap in development abstractions, and provides some perspectives on how to address these challenges in the future.

Tydi

An open specification for complex data structures over hardware streams

Journal article (2020) - J.W. Peltenburg (author) , Matthijs Brobbel (author) , Jeroen Van Straten (author) , Z Al-Ars (author) , H.P. Hofstee (author)

Streaming dataflow designs describe hardware by connecting components through streams that transport data structures. We introduce a stream-oriented specification and type system that provides a clear and intuitive way to map complex, dynamically-sized data structures onto hardwa ...

Methods for Efficient Integration of FPGA Accelerators with Big Data Systems

Doctoral thesis (2020) - J.W. Peltenburg (author)

Because of fundamental limitations of CMOS technology, computing researchers and the computing industry are focusing on using transistors in integrated circuits more efficiently towards obtaining a computational goal. At the architectural level, this has led to an era of heteroge ...

Because of fundamental limitations of CMOS technology, computing researchers and the computing industry are focusing on using transistors in integrated circuits more efficiently towards obtaining a computational goal. At the architectural level, this has led to an era of heterogeneous computing, where various types of computational components are used to solve problems. In this dissertation, we focus on the integration of one such heterogeneous component; the FPGA accelerator, with one of the main drivers behind the increasing need of computational performance; big data systems. With the increased availability of these FPGA accelerators in data centers and clouds, and with an increasing amount of I/O bandwidth between accelerated systems and their host, the industry is trying to push these components into more widespread usage in big data applications. For big data systems, three related challenges are observed. First, the software systems consist of many layered run-time systems that have often been designed to raise the level of abstraction, often at the cost of potential performance. Second, hardware-unfriendly in-memory data structures, and (to the accelerator) uninteresting metadata may convolute designs required to integrate FPGA accelerators with big data systems software. Last, serialization is applied to face the second challenge, but the rate at which serialization is performed is much lower than the rate at which accelerators may absorb data. For FPGA accelerators, we also observe three challenges. First, highly vendor-specific styles of designing hardware accelerators hampers the widespread reuse of existing solutions. Second, developers spend a lot of time on designing interfaces appropriate for their data structure, since they are typically provided with just a byte-addressable memory interface. Third, developers spend a lot of time on the infrastructure or ‘plumbing’ around their computational kernels, while their focus should be the kernel itself. We describe a toolchain named Fletcher, based on the Apache Arrow in-memory format for tabular data structures, that uses Arrow to deal with the challenges on the big data systems software side, and also deals with the challenges on the FPGA accelerator development side. The toolchain allows to rapidly generate platform-agnostic FPGA accelerator designs where kernels operate on tabular data sets, requiring the developer to only implement the kernel, automating all other aspects of the design, including hardware interfaces, hardware infrastructure, and software integration. We describe applications in regular expression matching, k-means clustering, Hidden Markov Models with the posit numeric format, and decoding Parquet files. We finally apply the lessons learned on the work of the Fletcher framework in a new interface specification for streaming dataflow designs, named Tydi. We introduce a hardware-oriented type system that allows to express complex, dynamically sized data structures often found in the domain of big data analytics. The type system helps to increase the productivity when designing hardware transporting such data structures over streams, abstracting their use in hardware without losing the ability to make common design trade-offs.

ArrowSAM

In-Memory Genomics Data Processing Using Apache Arrow

Conference paper (2020) - T. Ahmad (author) , N. Ahmed (author) , Johan Peltenburg (author) , Z Al-Ars (author)

The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing s ...

Fletcher

A framework to efficiently integrate FPGA accelerators with apache arrow

Conference paper (2019) - J.W. Peltenburg (author) , J. Van Straten (author) , Lars Wijtemans (author) , Lars T.J. Van Leeuwen (author) , Zaid Al-Ars (author) , H. Peter Hofstee (author)

Modern big data systems are highly heterogeneous. The components found in their many layers of abstraction are often implemented in a wide variety of programming languages and frameworks. Due to language implementation differences, interfaces between these components, including h ...

Modern big data systems are highly heterogeneous. The components found in their many layers of abstraction are often implemented in a wide variety of programming languages and frameworks. Due to language implementation differences, interfaces between these components, including hardware accelerated components, are often burdened by serialization overhead. Serialization bandwidth of many high-level language frameworks is an order of magnitude lower than contemporary FPGA accelerator interface bandwidth, especially when objects are small but numerous. Therefore, serialization bounds the effective end-to-end performance of FPGA-accelerated solutions integrated with applications written in high-level languages. The Apache Arrow project defines a language agnostic columnar in-memory format optimized for big data applications, preventing the need to serialize or even make copies during communication between components. To enable FPGA accelerators to benefit from the approach of Arrow, we first investigate the properties of its format in relation to hardware interfaces and establish that the format is usable. Second, we present the Fletcher framework, that automatically generates highly efficient hardware interfaces to access data of potentially complex, nested Arrow data types. Our approach allows 11 of the languages supported by Apache Arrow libraries to efficiently communicate large data sets with FPGA accelerators at system bandwidth. Furthermore, on the hardware side, the generated interfaces deliver any data type that Arrow can represent as groups of streams, providing a better starting point for data-flow-oriented kernel development, compared to manually creating custom interfaces to address issues related to pointer arithmetic, bus word misalignment and latency. For example applications, as measured on an AWS EC2 F1 and CAPI2-enabled POWER9 system, accelerated end-to-end application performance improves by 1.3x-49x compared to a hardware accelerated solution that still requires serialization.

An Accelerator for Posit Arithmetic Targeting Posit Level 1 BLAS Routines and Pair-HMM

Conference paper (2019) - Laurens van Dam (author) , Johan Peltenburg (author) , Zaid Al-Ars (author) , H. Hofstee (author)

The newly proposed posit number format uses a significantly different approach to represent floating point numbers. This paper introduces a framework for posit arithmetic in reconfigurable logic that maintains full precision in intermediate results. We present the design and impl ...

Diminished-1 Fermat Number Transform for Integer Convolutional Neural Networks

Conference paper (2019) - B. Baozhou (author) , N. Ahmed (author) , J.W. Peltenburg (author) , Koen Bertels (author) , Zaid Al-Ars (author)

Convolutional Neural Networks (CNNs) are a class of widely used deep artificial neural networks. However, training large CNNs to produce state-of-the-art results can take a long time. In addition, we need to reduce compute time of the inference stage for trained networks to make ...

Supporting Columnar In-memory Formats on FPGA

The Hardware Design of Fletcher for Apache Arrow

Conference paper (2019) - Johan Peltenburg (author) , J. Van Straten (author) , Matthijs Brobbel (author) , H.Peter Hofstee (author) , Zaid Al-Ars (author)

As a columnar in-memory format, Apache Arrow has seen increased interest from the data analytics community. Fletcher is a framework that generates hardware interfaces based on this format, to be used in FPGA accelerators. This allows efficient integration of FPGA accelerators wit ...

Pushing Big Data into Accelerators

Can the JVM Saturate Our Hardware?

Conference paper (2017) - Johan Peltenburg (author) , Ahmad Hesam (author) , Zaid Al-Ars (author)

Advancements in the field of big data have led into an increasing interest in accelerator-based computing as a solution for computationally intensive problems. However, many prevalent big data frameworks are built and run on top of the Java Virtual Machine (JVM), which does not e ...

Maximizing Systolic Array Efficiency to Accelerate the PairHMM Forward Algorithm

Conference paper (2016) - Johan Peltenburg (author) , S. Ren (author) , Zaid Al-Ars (author)

In the analysis of next-generation DNA sequencing data, Hidden Markov Models (HMMs) are used to perform variant calling between DNA sequences and a reference genome. The PairHMM model is solved by the Forward Algorithm, for which the performance and power efficiency can be increa ...