Refine and recycle

None, None; None, None; None, None; None, None; None, None

Refine and recycle

A method to increase decompression parallelism

Conference Paper (2019)

Author(s)

Jian Fang (TU Delft - Computer Engineering)

Jianyu Chen (Student TU Delft)

Jinho Lee (IBM Austin)

Zaid Al-Ars (TU Delft - Computer Engineering)

H. Peter Hofstee (IBM Austin, TU Delft - Computer Engineering)

Research Group

Computer Engineering

DOI related publication

https://doi.org/10.1109/ASAP.2019.00017

FPGA Acceleration Decompression Snappy Decompression

To reference this document use:

https://resolver.tudelft.nl/uuid:994ce1fc-0130-4885-bbdc-e6874103dbe1

More Info

expand_more

Publication Year

2019

Language

English

Research Group

Computer Engineering

Pages (from-to)

272-280

ISBN (print)

978-1-7281-1602-0

ISBN (electronic)

978-1-7281-1601-3

Abstract

Rapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s.

No files available

Metadata only record. There are no files for this record.