JF

J. Fang

info

Please Note

9 records found

In the domain of big data analytics, the bottleneck of converting storage-focused file formats to in-memory data structures has shifted from the bandwidth of storage to the performance of decoding and decompression software. Two widely used formats for big data storage and in-memory data are Apache Parquet and Apache Arrow, respectively. In order to improve the speed at which data can be loaded from disk to memory, we propose an FPGA accelerator design that converts Parquet files to Arrow in-memory data structures. We describe an extensible, publicly available, free and open-source implementation of the proposed converter that supports various Parquet file configurations. The performance of the converter is measured on an AWS EC2 F1 system and on a POWER9 system using the recently released OpenCAPI interface. A single instance of the converter can reach between 6 and 12 GB/s of end-to-end throughput, and shows up to a threefold improvement over the fastest single-thread CPU implementation. It has a low resource utilization (less than 5% for all types of FPGA resources). This allows scaling out the design to match the bandwidth of the coming generation of accelerator interfaces. The proposed design and implementation can be extended to support more of the many possible Parquet file configurations. ...
Due to the critical need for reducing carbon emissions, the demand for energy-efficient building design is urgent. Studies have shown that space layouts affect energy performance considerably. Energy performance optimisation is able to improve energy performance significantly. However, in order to apply energy performance optimisation to space layouts (EPO), abundant layout alternatives are needed. With the development of computational methods, automatic generation of space layouts (GSL) helps to generate abundant layouts quickly. Therefore, combining GSL with EPO is expected to be greatly helpful for energy-efficient design. This paper investigates 10 relevant studies combining GSL and EPO and analyses their gaps. Furtherly, we extend the analysis to the research on GSL and EPO. 7 GSL methods are categorised and evaluated based on 66 studies, and the requirements for the combination with optimisation are inspected. Regarding EPO, the requirements for energy performance assessment and optimisation are analysed. ...
Journal article (2020) - Jian Fang, Jianyu Chen, Jinho Lee, Zaid Al-Ars, Peter Hofstee
To best leverage high-bandwidth storage and network technologies requires an improvement in the speed at which we can decompress data. We present a “refine and recycle” method applicable to LZ77-type decompressors that enables efficient high-bandwidth designs and present an implementation in reconfigurable logic. The method refines the write commands (for literal tokens) and read commands (for copy tokens) to a set of commands that target a single bank of block ram, and rather than performing all the dependency calculations saves logic by recycling (read) commands that return with an invalid result. A single “Snappy” decompressor implemented in reconfigurable logic leveraging this method is capable of processing multiple literal or copy tokens per cycle and achieves up to 7.2GB/s, which can keep pace with an NVMe device. The proposed method is about an order of magnitude faster and an order of magnitude more power efficient than a state-of-the-art single-core software implementation. The logic and block ram resources required by the decompressor are sufficiently low so that a set of these decompressors can be implemented on a single FPGA of reasonable size to keep up with the bandwidth provided by the most recent interface technologies. ...
Doctoral thesis (2019) - Jian Fang
Though field-programmable gate arrays (FPGAs) have been used to accelerate database systems, they have not been widely adopted for the following reasons. As databases have transitioned to higher bandwidth technology such as in-memory and NVMe, the communication overhead associated with accelerators has become more of a burden. Also, FPGAs are more difficult to program, and GPUs have emerged as an alternative technology with better programming support. However, with the development of new interconnect technology, memory technology, and improved FPGA design tool chains, FPGAs again provide significant opportunities. Therefore, we believe that FPGAs can be attractive again in the database field. This thesis focuses on FPGAs as a high-performance compute platform, and explores using FPGAs to accelerate database systems. It investigates the current challenges that have held FPGAs back in the database field as well as the opportunities resulting from recent technology developments. The investigation illustrates that FPGAs can provide significant advantages for integration in database systems. However, to make further progress, studies in a number of areas, including new database architectures, new types of accelerators, deep performance analysis, and the development of the tool chains are required. Our contributions focus on accelerators for databases implemented in reconfigurable logic. We provide an overview of prior work and make contributions to two specific types of accelerators: both a compute-intensive (decompression) and a memory-intensive (hash join) accelerator. ...
Journal article (2019) - Jian Fang, Yvo T.B. Mulder, Jan Hidders, Jinho Lee, H. Peter Hofstee
While FPGAs have seen prior use in database systems, in recent years interest in using FPGA to accelerate databases has declined in both industry and academia for the following three reasons. First, specifically for in-memory databases, FPGAs integrated with conventional I/O provide insufficient bandwidth, limiting performance. Second, GPUs, which can also provide high throughput, and are easier to program, have emerged as a strong accelerator alternative. Third, programming FPGAs required developers to have full-stack skills, from high-level algorithm design to low-level circuit implementations. The good news is that these challenges are being addressed. New interface technologies connect FPGAs into the system at main-memory bandwidth and the latest FPGAs provide local memory competitive in capacity and bandwidth with GPUs. Ease of programming is improving through support of shared coherent virtual memory between the host and the accelerator, support for higher-level languages, and domain-specific tools to generate FPGA designs automatically. Therefore, this paper surveys using FPGAs to accelerate in-memory database systems targeting designs that can operate at the speed of main memory. ...
Conference paper (2019) - Jian Fang, Jianyu Chen, Jinho Lee, Zaid Al-Ars, Peter Hofstee
Snappy is a widely used (de) compression algorithm in many big data applications. Such a data compression technique has been proven to be successful to save storage space and to reduce the amount of data transmission from/to storage devices. In this paper, we present a fine-grained parallel Snappy decompressor on FPGAs running under a relaxed execution model that addresses the following main challenges in existing solutions. First, existing designs either can only process one token per cycle or can process multiple tokens per cycle with low area efficiency and/or low clock frequency. Second, the high read-after-write data dependency during decompression introduces stalls which pull down the throughput. ...

A method to increase decompression parallelism

Conference paper (2019) - Jian Fang, Jianyu Chen, Jinho Lee, Zaid Al-Ars, H. Peter Hofstee
Rapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s. ...
Conference paper (2018) - Jian Fang, Jianyu Chen, Zaid Al-Ars, Peter Hofstee, Jan Hidders
While in-memory databases have largely removed I/O as a bottleneck for database operations, loading the data from storage into memory remains a significant limiter to end-to end performance. Snappy is a widely used compression algorithm in the Hadoop ecosystem and in database systems and is an option in often-used file formats such as Parquet and ORC. Compression reduces the amount of data that must be transferred from/to the storage saving both storage space and storage bandwidth. While it is easy for a CPU Snappy decompressor to keep up with the bandwidth of a hard disk drive, when moving to NVMe devices attached with high bandwidth connections such as PCIe Gen4 or OpenCAPI, the decompression speed in a CPU is insufficient. We propose an FPGA-based Snappy decompressor that can process multiple tokens in parallel and operates on each FPGA block ram independently. Read commands are recycled until the read data is valid dramatically reducing control complexity. One instance of our decompression engine takes 9% of the LUTs in the XCKU15P FPGA, and achieves up to 3GB/s (5GB/s) decompression rate from the input (output) side, about an order of magnitude faster than a CPU (single thread). Parquet allows for independent decompression of multiple pages and instantiating eight of these units on a XCKU15P FPGA can keep up with the highest performance interface bandwidths. ...
Abstract (2016) - Jian Fang, Jan Hidders, Koen Bertels, Jinho Lee, Peter Hofstee
The join is a commonly used operation in databases systems. As data volumes explode, join operations between two large relations become challenging. To overcome this challenge, some research adopts FPGAs (field programmable gate arrays) to accelerate this operation. However, increasing the bandwidth between accelerators and main memory provides additional opportunity for improvement. In this paper, we propose a locality-aware hash-join algorithm, and apply this to explore the potential of using FPGAs to accelerate hash-join operations in databases. The tuples in the input tables are partitioned into small buckets such that each bucket can fit into BRAM (Block RAM) of the FPGA. Hash-join probe operations are only performed on buckets with the same bucket number in these two tables. Analysis shows that the proposed method can take advantage of large memory bandwidth and thus provide a high throughput. ...