JC

Jianyu Chen

info

Please Note

6 records found

Conference paper (2021) - Jianyu Chen, Maurice Daverveldt, Zaid Al-Ars
With the continued increase in the amount of big data generated and stored in various application domains, such as high-frequency trading, compression techniques are becoming ever more important to reduce the requirements on communication bandwidth and storage capacity. Zstandard (Zstd) is emerging as an important compression algorithm for big data sets capable of achieving a good compression ratio but with a higher speed than comparable algorithms. In this paper, we introduce the architecture of a new hardware compression kernel for Zstd that allows the algorithm to be used for real-time compression of big data streams. In addition, we optimize the proposed architecture for the specific use case of streaming high-frequency trading data. The optimized kernel is implemented on a Xilinx Alveo U200 board. Our optimized implementation allows us to fit ten kernel blocks on one board, which is able to achieve a compression throughput of about 8.6GB/s and compression ratio of about 23.6%. The hardware implementation is open source and publicly available at https://github.com/ChenJianyunp/Hardware-Zstd-Compression-Unit. ...
Journal article (2020) - Jian Fang, Jianyu Chen, Jinho Lee, Zaid Al-Ars, Peter Hofstee
To best leverage high-bandwidth storage and network technologies requires an improvement in the speed at which we can decompress data. We present a “refine and recycle” method applicable to LZ77-type decompressors that enables efficient high-bandwidth designs and present an implementation in reconfigurable logic. The method refines the write commands (for literal tokens) and read commands (for copy tokens) to a set of commands that target a single bank of block ram, and rather than performing all the dependency calculations saves logic by recycling (read) commands that return with an invalid result. A single “Snappy” decompressor implemented in reconfigurable logic leveraging this method is capable of processing multiple literal or copy tokens per cycle and achieves up to 7.2GB/s, which can keep pace with an NVMe device. The proposed method is about an order of magnitude faster and an order of magnitude more power efficient than a state-of-the-art single-core software implementation. The logic and block ram resources required by the decompressor are sufficiently low so that a set of these decompressors can be implemented on a single FPGA of reasonable size to keep up with the bandwidth provided by the most recent interface technologies. ...

A method to increase decompression parallelism

Conference paper (2019) - Jian Fang, Jianyu Chen, Jinho Lee, Zaid Al-Ars, H. Peter Hofstee
Rapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s. ...
Conference paper (2019) - Jian Fang, Jianyu Chen, Jinho Lee, Zaid Al-Ars, Peter Hofstee
Snappy is a widely used (de) compression algorithm in many big data applications. Such a data compression technique has been proven to be successful to save storage space and to reduce the amount of data transmission from/to storage devices. In this paper, we present a fine-grained parallel Snappy decompressor on FPGAs running under a relaxed execution model that addresses the following main challenges in existing solutions. First, existing designs either can only process one token per cycle or can process multiple tokens per cycle with low area efficiency and/or low clock frequency. Second, the high read-after-write data dependency during decompression introduces stalls which pull down the throughput. ...
Conference paper (2018) - Jian Fang, Jianyu Chen, Zaid Al-Ars, Peter Hofstee, Jan Hidders
While in-memory databases have largely removed I/O as a bottleneck for database operations, loading the data from storage into memory remains a significant limiter to end-to end performance. Snappy is a widely used compression algorithm in the Hadoop ecosystem and in database systems and is an option in often-used file formats such as Parquet and ORC. Compression reduces the amount of data that must be transferred from/to the storage saving both storage space and storage bandwidth. While it is easy for a CPU Snappy decompressor to keep up with the bandwidth of a hard disk drive, when moving to NVMe devices attached with high bandwidth connections such as PCIe Gen4 or OpenCAPI, the decompression speed in a CPU is insufficient. We propose an FPGA-based Snappy decompressor that can process multiple tokens in parallel and operates on each FPGA block ram independently. Read commands are recycled until the read data is valid dramatically reducing control complexity. One instance of our decompression engine takes 9% of the LUTs in the XCKU15P FPGA, and achieves up to 3GB/s (5GB/s) decompression rate from the input (output) side, about an order of magnitude faster than a CPU (single thread). Parquet allows for independent decompression of multiple pages and instantiating eight of these units on a XCKU15P FPGA can keep up with the highest performance interface bandwidths. ...
Conference paper (2018) - Jianyu Chen, Zaid Al-Ars, H. Peter Hofstee
In this paper, we present the design in reconfigurable logic of a matrix multiplier for matrices of 32-bit posit numbers with es=2 [1]. Vector dot products are computed without intermediate rounding as suggested by the proposed posit standard to maximally retain precision. An initial implementation targets the CAPI 1.0 interface on the POWER8 processor and achieves about 10Gpops (Giga posit operations per second). Follow-on implementations targeting CAPI 2.0 and OpenCAPI 3.0 on POWER9 are expected to achieve up to 64Gpops. Our design is available under a permissive open source license at https://github.com/ChenJianyunp/Unum_matrix_multiplier. We hope the current work, which works on CAPI 1.0, along with future community contributions, will help enable a more extensive exploration of this proposed new format. ...