H.P. Hofstee | TU Delft Repository

GPU-Accelerated String Compression for Big Data Analytics

Master thesis (2025) - T.P.D. Anema (author) , H. Hofstee (mentor) , Zaid Al-Ars (mentor) , Matthias Moller (graduation committee member) , Joost Hoozemans (graduation committee member)

This thesis presents a GPU-accelerated string compression algorithm based on FSST (Fast Static Symbol Table).
The proposed compressor leverages several advanced CUDA techniques to optimize performance, including a voting mechanism that maximizes memory bandwidth and an effici ...

This thesis presents a GPU-accelerated string compression algorithm based on FSST (Fast Static Symbol Table).
The proposed compressor leverages several advanced CUDA techniques to optimize performance, including a voting mechanism that maximizes memory bandwidth and an efficient gathering pipeline utilizing stream compaction.
Additionally, the algorithm uses GPU compute capacity to support a memory-efficient encoding table through a space-time tradeoff.

The compression task is parallelized by tiling input data and adapting the data layout.
We introduce multiple compression pipelines, each with distinct tradeoffs.
To maximize encoding kernel throughput, the design introduces sliding windows and output packing to optimize register use and maximize effective memory bandwidth.
Pipeline-level throughput is further enhanced by introducing pipelined transposition stages and stream compaction to remove intermediate padding efficiently.

We evaluate these pipelines across several benchmark datasets and compare the best-performing version against state-of-the-art GPU compression algorithms, including nvCOMP, GPULZ, and compressors generated using the LC framework.
The proposed compressor achieves a throughput of 74GB/s on an RTX4090 while maintaining compression ratios comparable to FSST.
In terms of compression ratio, it consistently outperforms ANS, Bitcomp, Cascaded, and GPULZ across all datasets.
Its overall throughput exceeds that of GPULZ and all nvCOMP compressors except ANS, Bitcomp, Cascaded, and those produced by the LC framework.
Our compressor lies on the Pareto frontier for all evaluated datasets, advancing the state-of-the-art toward ideal compression.
It achieves near-identical compression ratios to FSST (except for machine-readable datasets), while achieving a speedup of 42.06x.
Compared to multithreaded CPU compression, it achieves a 6.45x speedup.

To assess end-to-end performance, we integrate the compressor with the GSST decompressor. The resulting (de)compression pipeline achieves a combined throughput of 55GB/s, outperforming uncompressed data transfer on links with a bandwidth up to 37.5 GB/s.
It also outperforms all state-of-the-art (de)compressors when the link bandwidth ranges between 3GB/s and 20GB/s.

While further research is needed to enhance robustness and integrate the compressor into analytical engines, this work demonstrates a viable and Pareto-optimal alternative to existing string compression methods.

The source code of all our compression pipelines is publicly available on GitHub.
This work also serves as the foundation for a scientific paper that has been accepted for presentation at ADMS 2025.

Exploration of the AMD Ryzen NPU for Real-time Signal Processing

Real-time Imaging of LOFAR Station Data

Master thesis (2025) - J.A. Fortanet Capetillo (author) , H.P. Hofstee (mentor) , Alle Jan van der Veen (graduation committee member) , Steven van der Vlugt (graduation committee member) , Mario Ruiz Noguera (graduation committee member) , Zaid Al-Ars (graduation committee member)

The growing prevalence of Artificial Intelligence (AI) applications has led to the development of specialized hardware accelerators optimized for performance and energy efficiency. One such accelerator is the Ryzen Neural Processing Unit (NPU), integrated into AMD’s Ryzen AI proc ...

Shockwaves and Tydi-Clash

Raising the abstraction level of the Haskell HDL Clash through typed waveforms and complex streaming interfaces

Master thesis (2025) - M.H.P. Adriaanse (author) , H. Hofstee (mentor) , Z. Al-Ars (mentor) , C. P. R. Baaij (mentor) , Chris Verhoeven (graduation committee member)

This work contains two systems created to raise abstraction for the Haskell-based HDL Clash.

A common tool in hardware design is the waveform viewer. Although Clash could already generate waveform files, these only contained binary representations of the values. Without ...

Programming Quantum Computers

Doctoral thesis (2025) - A.M. Krol (author) , H. Peter Peter Hofstee (promotor) , Zaid Al-Ars (promotor)

Because of recent stagnating single-thread performance and limited potential for further miniaturization of transistors, the computing industry is looking towards new technologies as the basis for the next generation of computing. One of these new technologies is quantum computin ...

Tywaves

A Typed Waveform Viewer for Chisel HDL with typed circuit components and Tydi streams

Master thesis (2024) - R. Meloni (author) , Zaid Al-Ars (mentor) , H.Peter Hofstee (mentor)

Modern hardware design languages introduce high-level constructs to considerably improve design capabilities. The adoption of software language features and strong type systems contribute to expressing complex designs with cleaner and more robust code, facilitating the translatio ...

Optimization Methods for Efficient Nanopore DNA Basecalling

Master thesis (2024) - M.W.G. Frensel (author) , Z. Al-Ars (mentor) , H.Peter Peter Hofstee (mentor) , Erik van den Akker (graduation committee member) , Ramin Shirali Hossein Shirali Hossein Zade (graduation committee member)

Genomics, the study of an organism's complete set of DNA, including all of its genes, has revolutionized our understanding of biological processes and disease mechanisms. The field's rapid advancements have paved the way for personalized medicine, offering targeted therapies and ...

Genomics, the study of an organism's complete set of DNA, including all of its genes, has revolutionized our understanding of biological processes and disease mechanisms. The field's rapid advancements have paved the way for personalized medicine, offering targeted therapies and improved healthcare outcomes. These advancements are a result of significant improvements in sequencing technology, bioinformatics, and computational power. Next-generation or long-read sequencing has reduced the cost and time required to sequence entire genomes, and Oxford Nanopore Technologies (ONT) sequencers provide 100–1000× longer contiguous reads, simplifying genome assembly. However, bioinformatics-driven advances in accuracy have come at the cost of high computational requirements because of the dependency on large deep neural networks (DNNs), and the basecalling step now takes 43% of the time in the nanopore sequencing pipeline.

This thesis addresses the large computational demands for high accuracy nanopore basecalling of nanopore reads. Bonito, ONT's research basecaller, and other basecallers use DNNs at their core. The five Long Short-Term Memory (LSTM) layers used by the basecaller are the primary bottleneck to more efficient basecalling, taking almost 90% of the whole model's execution time when basecalling a single read. To alleviate this bottleneck, three approaches are investigated: pruning, model architecture, and quantization. Preliminary results show that pruning is the most impactful approach and has not successfully been used in previous work.

We propose learning structured sparsity using a delayed masking penalty scheduler. By adapting and improving on previous work, each LSTM layer is able to learn its optimal size during training, simultaneously with learning to basecall accurately. The method is optimized for the basecalling application and can be generalized to other tasks. We find that the required number of computations in the LSTM layers can be significantly reduced by up to 21 times with a reduction in match rate of just 1.3% compared to the high accuracy Bonito model. Furthermore, the newly introduced penalty parameter can be tuned to find the optimal trade-off between compute and accuracy for users' requirements.

The results indicate that state-of-the-art basecalling models are overparameterized and that their size can be reduced drastically without significantly affecting accuracy. Future work is suggested to investigate the benefits of pruning the whole model, and to assess the feasibility of combining pruning with advanced quantization methods. This work helps increase the accessibility of nanopore DNA sequencing, broadening the reach and impact of this technology.

High Performance ASIC Processor Design for DNA Basecallers

Master thesis (2024) - W. Trinh (author) , Z. Al-Ars (mentor) , H.Peter Hofstee (mentor) , R. T. Rajan (graduation committee member)

Genomics has revolutionized medicine and biological research by providing deeper insights into the genetic makeup of organisms, advancing our understanding of diseases, and enabling personalized medicine. These breakthroughs are driven by advancements in genome sequencing techn ...

Genomics has revolutionized medicine and biological research by providing deeper insights into the genetic makeup of organisms, advancing our understanding of diseases, and enabling personalized medicine. These breakthroughs are driven by advancements in genome sequencing technologies, bioinformatics, and the aid of neural networks. The advent of third-generation sequencing technol ogy has further accelerated progress by allowing for long-read sequencing, which enhances the ac curacy and efficiency of genome assembly. Oxford Nanopore Technologies (ONT) offers advanced sequencers that use nanopore-based technology to read DNA sequences in real-time. However, the raw sequenced data contains noise and requires a basecalling stage to read DNA sequences with the required accuracy. Basecalling relies on deep neural networks (DNNs) to achieve high-accuracy reads, but the significant computational power required makes basecalling a costly process, especially in real-time applications. Current hardware accelerators used for these compute-intensive basecallers are state-of-the-art GPUs that cost over $10,000 per unit. This thesis explores the design of a custom hardware accelerator for ONT’s basecalling program, Bonito, which aims to provide a cost-effective alternative to existing accelerators such as Nvidia GPUs and Groq Tensor Streaming Processors (TSPs). Bonito’s DNN is dominated by five Long Short-Term Memory(LSTM)layers, accounting for 90% of its execution time. The custom accelerator targets these compute-intensive LSTMs to reduce execution time. This work provides a comprehensive analysis of LSTM performance and behavior on GPUs and Groq TSPs. It emphasizes the architectural benefits and limitations in the context of basecalling. Furthermore, it also evaluates an Application-Specific Integrated Circuit (ASIC) implementation of an existing FPGA-based LSTM accelerator design. TheanalysisshowsthatBonito’sHigh-Accuracymodel(HAC)LSTMlayerscontainmanysequential matrix multiplications that are compute-intensive and require high memory bandwidth to accommodate the data transfers. Furthermore, Bonito’s small problem size, with 384 features per vector input, causes GPUs to not fully utilize available compute cores. Combined with slow per-core performance, GPUs executing LSTMs achieve only 13.5% of the maximum TFLOP/s with FP16 precision. Groq uses a heterogeneous architecture with fast separate MXM units executing matrix multiplica tions, and VXM units executing point-wise operations. Data travels between these different compute units through streaming channels and MEM units. The LSTM analysis on Groq showed three main issues. First, Groq uses a 320-element wide data channel to transfer data across the chip, whereas Bonito has 384 hidden features as input. This leads to the 384-element input being sliced in two 192 element partial inputs as Groq supports physical tensors up to 320-element long, effectively doubling the cycle cost by using two slices instead of one. Second, performing matrix multiplications requires these slices to be transferred from MEM to MXM, by executing reload operations to the MXM weight buffer. This data transfer consumes the bandwidth on the streaming channel, which introduces stalls in the pipeline and clock cycles are spent on MEM operations, instead of compute operations in MXM or VXM. Lastly, the analysis shows that the VXM forms a bottleneck in the LSTM execution, accounting for 50% of the clock cycles, whereas the MXM accounts for the other 50%. This shows that the special functions and additions inside an LSTM cell are slowing down Groq’s overall performance. After the existing architecture analysis, the ASIC evaluation showed a synthesis result, where one block of 384 LSTM Processing Engines (PEs) achieves a clock speed of 434MHz and costs 8.07𝑚𝑚2 on a 40nmprocess node. Putting these PEs on a Groq-based chip layout of 725𝑚𝑚2 in area size, the 40nm-based PEs can achieve 79.3 TFLOP/s at FP16 precision. By correcting the 40nm process node to Groq’s 14nm process node, the 14nm-based PEs achieved 448 TFLOP/s at FP16. These results suggest that a custom LSTM accelerator could compete in performance with state-of-the-art solutions while being more cost-effective. Future work is suggested to investigate a cycle reduction in the multiply-accumulate stage and to evaluate the ASIC design using a modern process node technology, as the current ASIC design uses a 2008-based 40nm process node. This work helps future development in further optimizing a custom LSTM accelerator specified for Bonito’s DNN requirements, paving the road toward more affordable genome sequencing

Zero-serialization, Zero-copy memory pooling in compute clusters

Disaggregated memory made accessible

Master thesis (2023) - P.M.Q. Groet (author) , H. Peter Hofstee (mentor) , Zaid Al-Ars (mentor)

With the rise of the new interconnect standards CXL and previously OpenCAPI, has come a great deal of possibilities to step away from the classical approach where CPUs are in charge of moving data between external devices and local memory. Specifically, OpenCAPI allows for attach ...

Accelerating DNA basecalling of Nanopore reads on FPGAs

Master thesis (2023) - J. Haenen (author) , H. Peter Hofstee (mentor) , Z. Al-Ars (graduation committee member) , Joana P. P. Gonçalves (coach)

Genomics has revolutionized our understanding of evolution, hereditary diseases, and more. The advent of long-read DNA sequencers i.e. Oxford Nanopore Technologies' innovations, has opened many new research potentials in genomics. These sequencers produce significantly longer DNA ...

Tydi-Chisel

Collaborative and Interface-Driven Data-Streaming Accelerator Design

Master thesis (2023) - C. Cromjongh (author) , H. Hofstee (mentor) , Z. Al-Ars (graduation committee member) , C.B. Bach (coach)

In spite of progress on hardware design languages, the design of high-performance hardware accelerators forces many design decisions specializing the interfaces of these accelerators in ways that complicate the understanding of the design and hinder modularity and collaboration. ...

In spite of progress on hardware design languages, the design of high-performance hardware accelerators forces many design decisions specializing the interfaces of these accelerators in ways that complicate the understanding of the design and hinder modularity and collaboration. In response to this challenge, Tydi has been presented as an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. Earlier efforts in providing an implementation framework for Tydi managed to generate VHDL boilerplate code for Tydi interfaces, but offered limited design value over custom solutions due to VHDL's low abstraction level. In contrast, Chisel, with its high level of abstraction and customizability offers a suitable platform to implement Tydi-based components.

In this thesis, the Tydi-Chisel library is presented along with an A-to-Z design-process description for data-streaming accelerators. A stream-interface solution is presented that offers both compatibility with Tydi in traditional HDLs and maximum utility within Chisel through two intercompatible representations. In addition, design complexity is reduced through novel utilities like stream-complexity conversion, developed to alleviate interface specification mismatches between components. Using the presented toolchain and library, the amount of code required to specify Tydi interfaces for representative use-cases can be reduced several times compared to a Verilog description, while offering increased utility.

Tydi-Chisel aims to simplify the design of data-streaming accelerators through the integration of the Tydi interface standard in Chisel, along with helper components, syntax sugar, and verification tools. In combination Chisel and Tydi help bridge the hardware-software divide, making solo-design and collaboration between designers easier.

A Toolchain for Streaming Dataflow Accelerator Designs for Big Data Analytics

Defining an IR for Composable Typed Streaming Dataflow Designs

Master thesis (2022) - M.A. Reukers (author) , H.P. Peter Hofstee (mentor) , Zaid Al-Ars (graduation committee member) , Johan Peltenburg (graduation committee member) , R. Van Leuken (graduation committee member)

Tydi is an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. This provides a higher-level method for defining interfa ...

Tydi-lang: a language for typed streaming hardware

A manual for future Tydi-lang compiler developers

Master thesis (2022) - Y. TIAN (author) , Zaid Al-Ars (mentor) , H. Peter Hofstee (mentor)

Transferring composite data structures with variable-length fields often requires designing non-trivial protocols that are not compatible between hardware designs. When each project designs its own data format and protocols the ability to collaborate between hardware developers i ...

High-Performance Cluster-Scalable Computational Methods for Genomics Applications

Doctoral thesis (2022) - T. Ahmad (author) , Z Al-Ars (promotor) , H. Peter Hofstee (promotor)

The ever increasing pace of advancements in sequencing technologies has enabled rapid DNA/genome sequencing to become much more accessible. In particular, next (second) and third generation sequencing technologies offer high throughput, massively parallel and cost effective seque ...

The ever increasing pace of advancements in sequencing technologies has enabled rapid DNA/genome sequencing to become much more accessible. In particular, next (second) and third generation sequencing technologies offer high throughput, massively parallel and cost effective sequencing solutions. Individual sample sequencing data volumes as well as the number of assembled genomes are also growing quickly. These advances in high throughput sequencing technologies and demand for fast computational processing and downstream analysis of sequencing data in clinical settings is widening the gap between the time spent in sample collection and sequencing versus computational analysis.

To improve the scalability and performance optimizations of genome variant calling analysis workflows on modern computing systems, in this dissertation four potential research directions have been selected for further exploration. First, to exploit the performance of modern processors hardware features like multi-core and vector units on the GATK best practices variant calling pipelines, we introduce ArrowSAM, a columnar inmemory data format to place and process genomics data in-memory thus removing the need for repeated file storage accesses in intermediate variant calling pipeline applications. Our second contribution focuses on integration of the Apache Arrow based columnar in-memory data format in the PySpark API to enable exploiting the benefits of vectorized operations in the Python language using user-defined functions on Spark dataframes. For our third research contribution, we tested and benchmarked both the scalability and performance of Arrow Flight for client-server as well as cluster scaled communication.For our final research contribution reported in this dissertation, we implemented an orthogonal approach that is even more scalable than Apache Spark and Arrow Flight based solutions and offers flexibility to use many different variant callers.

FPGA accelerated trading data compression

Master thesis (2020) - J. Chen (author) , M. Zaid (mentor) , Fabio Sebastiano (graduation committee member) , H.P. Hofstee (coach) , Maurice Daverveldt (coach)

Enabling High Performance Posit Arithmetic Applications Using Hardware Acceleration

Master thesis (2018) - L. van Dam (author) , H. Peter Hofstee (mentor) , Zaid Al-Ars (mentor)

The demand for higher precision arithmetic is increasing due to the rapid development of new computing paradigms. The novel posit number representation system, as introduced by John L. Gustafson, claims to be able to provide more accurate answers to mathematical problems with equ ...

The demand for higher precision arithmetic is increasing due to the rapid development of new computing paradigms. The novel posit number representation system, as introduced by John L. Gustafson, claims to be able to provide more accurate answers to mathematical problems with equal or less number of bits compared to the well-established IEEE 754 floating point standard. In this work, the performance of the posit number format in terms of decimal accuracy is analyzed and compared to alternative number representations. A framework for performing high-precision posit arithmetic in reconfigurable logic is presented. The supported arithmetic operations can be performed without rounding off intermediate results, minimizing the loss of decimal accuracy. The proposed posit arithmetic units achieve approximately 250 MPOPS for addition, 160 MPOPS for multiplication and 180 MPOPS for accumulation operations. A hardware accelerator for performing Level 1 BLAS operations on (sparse) posit column vectors is presented. For the calculation of the vector dot product for an input vector length of 10^6 elements, a speedup of approximately 15000x compared to software is achieved. The decimal accuracy is improved by one decimal of accuracy on average compared to posit emulation in software, and two additional decimals of accuracy are achieved compared to calculation using the IEEE 754 floating point format. A study of the application of posit arithmetic in the field of bioinformatics is performed. The effect on decimal accuracy of the pair-HMM forward algorithm by replacing traditional floating point arithmetic with posit arithmetic is analyzed. It is shown that the maximum achievable decimal accuracy using posit arithmetic is higher compared to the IEEE floating point format for the same number of required bits. The design of a hardware accelerator for the pair-HMM forward algorithm using posit arithmetic is proposed for two different interfaces: a streaming-based accelerator and an accelerator interfacing with Apache Arrow columnar data, both connected by the CAPI (SNAP) platform. Overall, the posit number format beats the IEEE floating point number format in terms of decimal accuracy, ranging from an improvement of 0.5 to 1 additional decimal of accuracy for the performed test cases. A throughput of 1.6 and 1 giga cell updates per second is measured for both accelerator implementations, respectively.

An FPGA-based Snappy Decompressor-Filter

Master thesis (2018) - Y. Qiao (author) , H.Peter Peter Hofstee (mentor) , J. Fang (graduation committee member)

New interfaces to interconnect CPUs and accelerators at memory-class bandwidth pose new opportunities and challenges for the design of accelerators. This thesis studies one such accelerator, a decompressor for Parquet files compressed with the Snappy library. Our design targets r ...

Multi-way Hash Join Based on FPGAs

Master thesis (2018) - K. Huang (author) , H.Peter Peter Hofstee (mentor) , J. Fang (coach)

The multi-way hash join is one of the commonly used and time-consuming database operations. Many algorithms have been developed to accelerate this operation, some of which use accelerators such as field programmable gate arrays (FPGAs). However, most of the previous work was focu ...

FPGA-Based High Throughput Merge Sorter

Master thesis (2018) - X. Zeng (author) , H.Peter Peter Hofstee (mentor) , J. Fang (graduation committee member)

As database systems have shifted from disk-based to in-memory, and the scale of the database in big data analysis increases significantly, the workloads analyzing huge datasets are growing. Adopting FPGAs as hardware accelerators improves the flexibility, parallelism and power co ...

Feeding High-Bandwidth Streaming-Based FPGA Accelerators

Master thesis (2018) - Y.T.B. Mulder (author) , H. Peter Hofstee (mentor)

A new class of accelerator interfaces has signi cant implications on system architecture. An order of magnitude more bandwidth forces us to reconsider FPGA design. OpenCAPI is a new interconnect standard that enables attaching FPGAs coherently to a high-bandwidth, low- latency in ...

A new class of accelerator interfaces has signi cant implications on system architecture. An order of magnitude more bandwidth forces us to reconsider FPGA design. OpenCAPI is a new interconnect standard that enables attaching FPGAs coherently to a high-bandwidth, low- latency interface. Keeping up with this bandwidth poses new challenges for the design of accelerators, and the logic feeding them.

This thesis is conducted as part of a group project, where three other master students investigate database operator accelerators. This thesis focuses on the logic to feed the accelerators, by designing a recon gurable multi-stream bu er architecture. By generalizing across multiple common streaming-like accelerator access patterns, an interface consisting of multiple read ports with a smaller than cache line granularity is desired. At the same time, multiple read ports are allowed to request any stream, including reading across a cache line boundary.

The proposed architecture exploits di erent memory primitives available on the latest genera- tion of Xilinx FPGAs. By combining a traditional multi-read port approach for data duplication with a second level of bu ering, a hierarchy typically found in caches, an architecture is pro- posed which can supply data from 64 streams to eight read ports without any access pattern restrictions.

A correct-by-construction design methodology was used to simplify the validation of the design and to speedup the implementation phase. At the same time, the design methodology is doc- umented and examples are provided for ease of adoption. With the design methodology, the proposed architecture has been implemented and is accompanied by a validation framework.

Various con gurations of the multi-stream bu er have been tested. Con gurations up to 64 streams with four read ports meet timing with an AFU request-to-response latency of ve cycles. The largest con guration with 64 streams and eight read ports fails timing. Limiting factors are the inherent architecture of FPGAs, where memories are physically located in speci c columns. This makes extracting data complex, especially at the target frequencies of 200 MHz and 400 MHz. Wires are scattered across the FPGA and wire delay becomes dominant.

FPGA design at increasing bandwidths requires new design approaches. Synthesis results are no guarantee for the implemented design, and depending on the design size, could indicate a very optimistic operating frequency. Therefore, designing accelerators to keep up with an order of magnitude more bandwidth compared to the current state-of-the-art is complex, and requires carefully thought out accelerator cores, combined with an interface capable of feeding it.

A Resiliency-First Approach to Distributed DAG Computations

Master thesis (2017) - T.C. Leliveld (author) , H. Peter Hofstee (mentor)

A framework is introduced for computations with transformations on immutable data. Inspiration is taken from Apache Spark, however the model of computation is generalized from an emphasis on narrow and wide dependencies, to an arbitrary set of transformations that form a directed ...