Z. Al-Ars | TU Delft Repository

GPU-Accelerated String Compression for Big Data Analytics

Master thesis (2025) - T.P.D. Anema (author) , H. Hofstee (mentor) , Zaid Al-Ars (mentor) , Matthias Moller (graduation committee member) , Joost Hoozemans (graduation committee member)

This thesis presents a GPU-accelerated string compression algorithm based on FSST (Fast Static Symbol Table).
The proposed compressor leverages several advanced CUDA techniques to optimize performance, including a voting mechanism that maximizes memory bandwidth and an effici ...

This thesis presents a GPU-accelerated string compression algorithm based on FSST (Fast Static Symbol Table).
The proposed compressor leverages several advanced CUDA techniques to optimize performance, including a voting mechanism that maximizes memory bandwidth and an efficient gathering pipeline utilizing stream compaction.
Additionally, the algorithm uses GPU compute capacity to support a memory-efficient encoding table through a space-time tradeoff.

The compression task is parallelized by tiling input data and adapting the data layout.
We introduce multiple compression pipelines, each with distinct tradeoffs.
To maximize encoding kernel throughput, the design introduces sliding windows and output packing to optimize register use and maximize effective memory bandwidth.
Pipeline-level throughput is further enhanced by introducing pipelined transposition stages and stream compaction to remove intermediate padding efficiently.

We evaluate these pipelines across several benchmark datasets and compare the best-performing version against state-of-the-art GPU compression algorithms, including nvCOMP, GPULZ, and compressors generated using the LC framework.
The proposed compressor achieves a throughput of 74GB/s on an RTX4090 while maintaining compression ratios comparable to FSST.
In terms of compression ratio, it consistently outperforms ANS, Bitcomp, Cascaded, and GPULZ across all datasets.
Its overall throughput exceeds that of GPULZ and all nvCOMP compressors except ANS, Bitcomp, Cascaded, and those produced by the LC framework.
Our compressor lies on the Pareto frontier for all evaluated datasets, advancing the state-of-the-art toward ideal compression.
It achieves near-identical compression ratios to FSST (except for machine-readable datasets), while achieving a speedup of 42.06x.
Compared to multithreaded CPU compression, it achieves a 6.45x speedup.

To assess end-to-end performance, we integrate the compressor with the GSST decompressor. The resulting (de)compression pipeline achieves a combined throughput of 55GB/s, outperforming uncompressed data transfer on links with a bandwidth up to 37.5 GB/s.
It also outperforms all state-of-the-art (de)compressors when the link bandwidth ranges between 3GB/s and 20GB/s.

While further research is needed to enhance robustness and integrate the compressor into analytical engines, this work demonstrates a viable and Pareto-optimal alternative to existing string compression methods.

The source code of all our compression pipelines is publicly available on GitHub.
This work also serves as the foundation for a scientific paper that has been accepted for presentation at ADMS 2025.

Shockwaves and Tydi-Clash

Raising the abstraction level of the Haskell HDL Clash through typed waveforms and complex streaming interfaces

Master thesis (2025) - M.H.P. Adriaanse (author) , H. Hofstee (mentor) , Z. Al-Ars (mentor) , C. P. R. Baaij (mentor) , Chris Verhoeven (graduation committee member)

This work contains two systems created to raise abstraction for the Haskell-based HDL Clash.

A common tool in hardware design is the waveform viewer. Although Clash could already generate waveform files, these only contained binary representations of the values. Without ...

Synthetic X-Ray Image Generation Using FPGA-Based Hardware Acceleration

Master thesis (2025) - P.J. Aanhane (author) , Rob de Jong (mentor) , Zaid Al-Ars (mentor)

Synthetic image generation involves the creation of artificially generated images that are indistinguishable from real ones. This field is an answer to challenges in the world of data acquisition, where the need for data is outpacing the availability. In cooperation with Philips ...

Synthetic image generation involves the creation of artificially generated images that are indistinguishable from real ones. This field is an answer to challenges in the world of data acquisition, where the need for data is outpacing the availability. In cooperation with Philips Medical Systems, the generation of synthetic X-ray images is studied. Using datasets derived from such images, equipment testing and physician training can be improved. Additionally, training data can be generated for machine learning purposes.

The generation of synthetic X-ray images has been an area of research since at least 1994. The images have traditionally been generated using ray-tracing techniques on CPUs or GPUs. While effective, these methods are computationally expensive and demand high memory bandwidths. More recently, machine learning techniques have been explored for X-ray image generation. These approaches are promising. However, they require large labelled datasets which are often unavailable and the quality of the results is difficult to predict.

The aim of this thesis is to investigate whether hardware acceleration using a field programmable gate array (FPGA) can solve the challenges other methods face. Specifically, it discusses an architecture that can handle the large amount of computations in parallel. The memory architecture required to handle the high bandwidth demands is also explained. The performance of the proposed architecture is studied to see whether it is a viable solution.

By simulating the traversal of rays through a voxelized model, an attenuation map was computed which can be used to determine X-ray intensities on a detector. The design separates computational tasks between a host machine and an FPGA, with an optimized High Bandwidth Memory architecture to maximize data throughput. Results demonstrated that the simulation produced realistic images with minimal error (2.26\% - 3.00\% deviation from CPU results), and performance is dependant on the detector resolution, achieving frame rates between 123 and 378 frames per second which are well above the goal of 60 frames per second. If more performance is required, upsampling can be used to speed up image generation by 33\% at an increased error of 0.6\% for an upsampling factor of two. These findings highlight the advantages of FPGA acceleration for deterministic, high-speed synthetic image generation without the need for large labelled datasets as required by machine learning algorithms.

Programming Quantum Computers

Doctoral thesis (2025) - A.M. Krol (author) , H. Peter Peter Hofstee (promotor) , Zaid Al-Ars (promotor)

Because of recent stagnating single-thread performance and limited potential for further miniaturization of transistors, the computing industry is looking towards new technologies as the basis for the next generation of computing. One of these new technologies is quantum computin ...

GSST: High Throughput Parallel String Decompression on GPU

Master thesis (2024) - R. Vonk (author) , Z Al-Ars (mentor) , R. Hai (graduation committee member) , Joost Hoozemans (graduation committee member) , J.W. Peltenburg (graduation committee member)

This thesis describes how the throughput of data ingestion on GPUs can be increased by using data compression. This is done through two main contributions. First, a high-level model is presented to assess the impact of compression on ingestion throughput. Second, a novel decompre ...

This thesis describes how the throughput of data ingestion on GPUs can be increased by using data compression. This is done through two main contributions. First, a high-level model is presented to assess the impact of compression on ingestion throughput. Second, a novel decompression algorithm called GSST (GPU Static Symbol Table) is developed, optimized for GPU parallelism. GSST achieves state-of-the-art performance, striking an effective balance between compression ratio and decompression throughput.
The work done in this thesis has contributed to the submission of two scientific papers:
1. GSST: Parallel string decompression at 150 GB/s on GPU [1] (to be updated to 191 GB/s upon
submission).
2. Benchmarking GPU Direct Storage for High-Performance Filesystems: Impact & Future Trends [2]
The ingestion throughput model is introduced to quantify the impact of data compression on data ingestion on a GPU. This model offers insight into how the compression ratio and decompression throughput influence overall data ingestion performance. This model shows that, as storage devices become faster, the decompression algorithm must also increase its throughput to keep up, while the compression ratio becomes less influential on the ingestion throughput.
GSST is the solution proposed to increase data ingestion throughput on GPUs. GSST adapts from the FSST (Fast Static Symbol Table) algorithm to the parallel architecture of GPUs. GSST’s performance is driven by six performance optimizations. Three format optimizations, block parallelism, split parallelism, and coalesced memory access, increase parallelism and throughput by changing the way data is stored. Additionally, three memory management techniques are implemented to effectively utilize the memory throughput of a GPU. These are the use of shared memory, using aligned memory accesses, and utilizing asynchronous data transfers.
Using the ingestion throughput model, GSST is evaluated against the state-of-the-art GPU compression algorithms from nvCOMP. The results reveal that GSST achieves a decompression throughput of 191 GB/s with a compression ratio of 2.74 on an A100. While nvCOMP’s ANS and Bitcomp outperform GSST in decompression throughput, they offer lower compression ratios. Similarly, Zstd achieves a higher compression with significantly lower decompression throughput, positioning GSST as a good balance of decompression throughput and compression ratio.
The data ingestion model demonstrates that GSST offers the highest ingestion throughput among the tested compression algorithms when ingesting data over a connection with a throughput between 0.8 GB/s and 87 GB/s. This means GSST is ideally suited for use with top-of-the-line networking equipment and even provides headroom for future improvements in connection throughput. Additionally, GSST is extremely memory-efficient, using significantly less GPU memory than all of nvCOMP’s compression algorithms. In some cases, GSST uses 3,500 times less memory, and in the best scenarios, over 67 million times less.
By leveraging these format and memory management optimizations, GSST provides a powerful, efficient solution for industries using large-scale data systems such as high-performance computing and data analytics.
The GSST source code will be made available on GitHub [3].

High-Performance Optimization of DNA Long Read De Novo Assembler

Master thesis (2024) - K. Zhao (author) , Zaid Al-Ars (mentor) , Tanveer Ahmad (mentor) , J.A. Baaijens (graduation committee member)

This thesis focuses on accelerating the polishing stage of the Flye genome assemblers. Flye is a de novo assembler designed for long reads produced by modern sequencing technologies, excelling in handling large genomes with high accuracy and efficiency. A crucial component of the ...

Tywaves

A Typed Waveform Viewer for Chisel HDL with typed circuit components and Tydi streams

Master thesis (2024) - R. Meloni (author) , Zaid Al-Ars (mentor) , H.Peter Hofstee (mentor)

Modern hardware design languages introduce high-level constructs to considerably improve design capabilities. The adoption of software language features and strong type systems contribute to expressing complex designs with cleaner and more robust code, facilitating the translatio ...

Optimization Methods for Efficient Nanopore DNA Basecalling

Master thesis (2024) - M.W.G. Frensel (author) , Z. Al-Ars (mentor) , H.Peter Peter Hofstee (mentor) , Erik van den Akker (graduation committee member) , Ramin Shirali Hossein Shirali Hossein Zade (graduation committee member)

Genomics, the study of an organism's complete set of DNA, including all of its genes, has revolutionized our understanding of biological processes and disease mechanisms. The field's rapid advancements have paved the way for personalized medicine, offering targeted therapies and ...

Genomics, the study of an organism's complete set of DNA, including all of its genes, has revolutionized our understanding of biological processes and disease mechanisms. The field's rapid advancements have paved the way for personalized medicine, offering targeted therapies and improved healthcare outcomes. These advancements are a result of significant improvements in sequencing technology, bioinformatics, and computational power. Next-generation or long-read sequencing has reduced the cost and time required to sequence entire genomes, and Oxford Nanopore Technologies (ONT) sequencers provide 100–1000× longer contiguous reads, simplifying genome assembly. However, bioinformatics-driven advances in accuracy have come at the cost of high computational requirements because of the dependency on large deep neural networks (DNNs), and the basecalling step now takes 43% of the time in the nanopore sequencing pipeline.

This thesis addresses the large computational demands for high accuracy nanopore basecalling of nanopore reads. Bonito, ONT's research basecaller, and other basecallers use DNNs at their core. The five Long Short-Term Memory (LSTM) layers used by the basecaller are the primary bottleneck to more efficient basecalling, taking almost 90% of the whole model's execution time when basecalling a single read. To alleviate this bottleneck, three approaches are investigated: pruning, model architecture, and quantization. Preliminary results show that pruning is the most impactful approach and has not successfully been used in previous work.

We propose learning structured sparsity using a delayed masking penalty scheduler. By adapting and improving on previous work, each LSTM layer is able to learn its optimal size during training, simultaneously with learning to basecall accurately. The method is optimized for the basecalling application and can be generalized to other tasks. We find that the required number of computations in the LSTM layers can be significantly reduced by up to 21 times with a reduction in match rate of just 1.3% compared to the high accuracy Bonito model. Furthermore, the newly introduced penalty parameter can be tuned to find the optimal trade-off between compute and accuracy for users' requirements.

The results indicate that state-of-the-art basecalling models are overparameterized and that their size can be reduced drastically without significantly affecting accuracy. Future work is suggested to investigate the benefits of pruning the whole model, and to assess the feasibility of combining pruning with advanced quantization methods. This work helps increase the accessibility of nanopore DNA sequencing, broadening the reach and impact of this technology.

High Performance ASIC Processor Design for DNA Basecallers

Master thesis (2024) - W. Trinh (author) , Z. Al-Ars (mentor) , H.Peter Hofstee (mentor) , R. T. Rajan (graduation committee member)

Genomics has revolutionized medicine and biological research by providing deeper insights into the genetic makeup of organisms, advancing our understanding of diseases, and enabling personalized medicine. These breakthroughs are driven by advancements in genome sequencing techn ...

Genomics has revolutionized medicine and biological research by providing deeper insights into the genetic makeup of organisms, advancing our understanding of diseases, and enabling personalized medicine. These breakthroughs are driven by advancements in genome sequencing technologies, bioinformatics, and the aid of neural networks. The advent of third-generation sequencing technol ogy has further accelerated progress by allowing for long-read sequencing, which enhances the ac curacy and efficiency of genome assembly. Oxford Nanopore Technologies (ONT) offers advanced sequencers that use nanopore-based technology to read DNA sequences in real-time. However, the raw sequenced data contains noise and requires a basecalling stage to read DNA sequences with the required accuracy. Basecalling relies on deep neural networks (DNNs) to achieve high-accuracy reads, but the significant computational power required makes basecalling a costly process, especially in real-time applications. Current hardware accelerators used for these compute-intensive basecallers are state-of-the-art GPUs that cost over $10,000 per unit. This thesis explores the design of a custom hardware accelerator for ONT’s basecalling program, Bonito, which aims to provide a cost-effective alternative to existing accelerators such as Nvidia GPUs and Groq Tensor Streaming Processors (TSPs). Bonito’s DNN is dominated by five Long Short-Term Memory(LSTM)layers, accounting for 90% of its execution time. The custom accelerator targets these compute-intensive LSTMs to reduce execution time. This work provides a comprehensive analysis of LSTM performance and behavior on GPUs and Groq TSPs. It emphasizes the architectural benefits and limitations in the context of basecalling. Furthermore, it also evaluates an Application-Specific Integrated Circuit (ASIC) implementation of an existing FPGA-based LSTM accelerator design. TheanalysisshowsthatBonito’sHigh-Accuracymodel(HAC)LSTMlayerscontainmanysequential matrix multiplications that are compute-intensive and require high memory bandwidth to accommodate the data transfers. Furthermore, Bonito’s small problem size, with 384 features per vector input, causes GPUs to not fully utilize available compute cores. Combined with slow per-core performance, GPUs executing LSTMs achieve only 13.5% of the maximum TFLOP/s with FP16 precision. Groq uses a heterogeneous architecture with fast separate MXM units executing matrix multiplica tions, and VXM units executing point-wise operations. Data travels between these different compute units through streaming channels and MEM units. The LSTM analysis on Groq showed three main issues. First, Groq uses a 320-element wide data channel to transfer data across the chip, whereas Bonito has 384 hidden features as input. This leads to the 384-element input being sliced in two 192 element partial inputs as Groq supports physical tensors up to 320-element long, effectively doubling the cycle cost by using two slices instead of one. Second, performing matrix multiplications requires these slices to be transferred from MEM to MXM, by executing reload operations to the MXM weight buffer. This data transfer consumes the bandwidth on the streaming channel, which introduces stalls in the pipeline and clock cycles are spent on MEM operations, instead of compute operations in MXM or VXM. Lastly, the analysis shows that the VXM forms a bottleneck in the LSTM execution, accounting for 50% of the clock cycles, whereas the MXM accounts for the other 50%. This shows that the special functions and additions inside an LSTM cell are slowing down Groq’s overall performance. After the existing architecture analysis, the ASIC evaluation showed a synthesis result, where one block of 384 LSTM Processing Engines (PEs) achieves a clock speed of 434MHz and costs 8.07𝑚𝑚2 on a 40nmprocess node. Putting these PEs on a Groq-based chip layout of 725𝑚𝑚2 in area size, the 40nm-based PEs can achieve 79.3 TFLOP/s at FP16 precision. By correcting the 40nm process node to Groq’s 14nm process node, the 14nm-based PEs achieved 448 TFLOP/s at FP16. These results suggest that a custom LSTM accelerator could compete in performance with state-of-the-art solutions while being more cost-effective. Future work is suggested to investigate a cycle reduction in the multiply-accumulate stage and to evaluate the ASIC design using a modern process node technology, as the current ASIC design uses a 2008-based 40nm process node. This work helps future development in further optimizing a custom LSTM accelerator specified for Bonito’s DNN requirements, paving the road toward more affordable genome sequencing

Acceleration of hybrid CPU-GPU query execution engine in Arrow Format

Master thesis (2023) - K. Su (author) , Z Al-Ars (mentor) , Y. Tian (coach) , Asterios Katsifodimos (coach)

General-purpose GPUs, renowned for their exceptional parallel processing capabilities and throughput, hold great promise for enhancing the efficiency of data analytics tasks. At the same time, recent developments in query execution engines have integrated the support of OLAP oper ...

Zero-serialization, Zero-copy memory pooling in compute clusters

Disaggregated memory made accessible

Master thesis (2023) - P.M.Q. Groet (author) , H. Peter Hofstee (mentor) , Zaid Al-Ars (mentor)

With the rise of the new interconnect standards CXL and previously OpenCAPI, has come a great deal of possibilities to step away from the classical approach where CPUs are in charge of moving data between external devices and local memory. Specifically, OpenCAPI allows for attach ...

Adding fault tolerance to OpenCL

Through redundant heterogeneous computing

Master thesis (2023) - R.A. Bijl (author) , Z Al-Ars (mentor) , Christoph Lofi (graduation committee member) , Pekka Jääskeläinen (coach)

The ever-increasing demand for computing has led to the need for specialized heterogeneous hardware, and the frameworks required to utilize them. Besides the traditional central processing units, more and more programs will make use of specialized hardware to accelerate computati ...

Neural network partitioning for resource-limited environments

Master thesis (2023) - P. Geel (author) , Zaid Al-Ars (mentor) , N.P. van der Meijs (coach) , J. Petri-König (graduation committee member) , Kevin McElligott (coach)

The demand for implementing neural networks on edge devices has rapidly increased as they allow designers to move away from expensive server-grade hardware. However, due to the limited resources available on edge devices, it is challenging to implement complex neural networks. Th ...

Efficient mapping of large scale SNN and rate-based DNN on SENeCA

Master thesis (2023) - P.V. Nembhani (author) , Zaid Al-Ars (mentor) , P. Pawetczak (graduation committee member) , Amirreza Yousefzadeh (graduation committee member)

Artificial intelligence, machine learning, and deep learning have been the buzzwords in almost every industry (medical, automotive, defense, security, finance, etc.) for the last decade. As the market moves towards AI-based solutions, so does the computation need for these solut ...

Artificial intelligence, machine learning, and deep learning have been the buzzwords in almost every industry (medical, automotive, defense, security, finance, etc.) for the last decade. As the market moves towards AI-based solutions, so does the computation need for these solutions increase and change with time. With the rise of smart cities and cyberphysical systems, the need for edge devices and efficient computation on the edge increases. While most of these newly developed deep learning models are quite large and wasteful in terms of energy, there have been recent methods that help improve the performance on the edge. However, due to their size, variety, and irregularity, the computing and power requirements are often too large to deploy these models on edge devices. This prohibits the application of such models within a rich field of application that requires high-throughput and real-time execution.

SENeCA (Scalable Energy Efficient Neuromorphic Computing Architecture) is a next-generation RISC-V-based neuromorphic computing architecture that was designed primarily for ultralow-edge applications where adaptivity is required. To mathematically model SENeCA, SENSIM (Scalable Energy Efficient Simulator, an open source simulator developed by the Interuniversity Microelectronic Center) provides an accurate mathematical software model of SENeCA, which helps in the early development and realization of a spiking neural network and deep neural network. This thesis work develops an efficient mapping tool SENMap (Scalable Energy-Efficient Neuromorphic Computing Architecture Mapper) on top of SENSIM which maps spiking neural networks efficiently. Having a faster, scalable realization software solution that can cater to large-scale neural networks can speed up the development procedure.

SENMap is developed in such a way that it supports flexible SNN/DNN application replacement, multiple single- and multi-objective optimization algorithms; the flexibility to choose from different optimization strategies; and also varying architectural parameters at the time of experimentation. Results show that mapping and neural processing elements (NPEs) depend primarily on the rate at which the sensor processes the data. On the basis of the rate, an early realization of SNN- and DNN-based edge AI chips SENMap. Depending on the actual parameters used, the maximum achieved improvements in energy consumption was around ~40%.

Hardware acceleration of artificial X-ray image generation

Master thesis (2023) - H.J.M.T. Knops (author) , Z Al-Ars (mentor) , RF Remis (graduation committee member) , Rob de Jong (coach)

X-ray imaging systems play an important role in the diagnostic process of various medical conditions. Generating an accurate artificial X-ray image has multiple advantages. It allows for flexible configurations during generation. The resulting images can reduce testing time and c ...

DNA comparisons in genomics

A reference-based perspective

Doctoral thesis (2023) - T.O. Mokveld (author) , Marcel JT Reinders (promotor) , Zaid Al-Ars (promotor)

Genomics is a field devoted to understanding the differences in genetics between populations, individuals, and even within individuals. By constantly comparing and contrasting data from diverse sources, genomics can refine our understanding of life and identify new ways to improv ...

Genomics is a field devoted to understanding the differences in genetics between populations, individuals, and even within individuals. By constantly comparing and contrasting data from diverse sources, genomics can refine our understanding of life and identify new ways to improve our lives. However, this often presents technical and biological challenges that require careful consideration of what is compared, in what context, and what might be present. In this thesis I contribute to resolving these challenges in three different domains:

In genomic data analysis, analysts often compare and contrast new genomic data to an established reference to reduce costs. However, this approach biases comparisons in favor of population-specific genetics since such references encode only a fraction of the genetics of a given population. To address this bias, I propose a method that accounts for population variability in a way that integrates it directly into the comparison process. This integration ensures that the contrast between sample and reference becomes smaller and closer to personalized, so they are treated the same way regardless of the underlying population. The method improves genome characterization and simplifies downstream analyses that rely on these comparisons. As a result, a more accurate portrayal of the genetics of a given population as a whole is obtained.

In non-invasive sequencing-based prenatal testing, we rely on circulating cell-free DNA from maternal plasma to detect pathogenic variants that may affect the fetus. A healthy baseline, which describes the normative state, is generally required to determine the presence of such variants. However, because this DNA is a mixture of maternal and much lower fetal proportions, it remains difficult to disentangle the two, primarily because of biological and technical biases. While this bias can partially be mitigated by changing the baseline and thus contrasting within the individual DNA mixture rather than to a divergent population of mixtures, further improvements are still needed. I present a generalized framework in which the signal-to-noise ratio can be further improved by fully exploiting the information in sequencing data, allowing for more robust predictions at even earlier stages of pregnancy.

The composition of the gut ecosystem can have short- and long-term effects on our health. It is therefore important to understand how it is formed and how a healthy balance can be maintained for as long as possible to preserve our health. To do this, ecosystems must be stratified and compared based on health indices. I show in extremely contrasting Dutch subpopulations that we can obtain valuable characteristics of divergent health states by comparing the gut ecosystems of centenarians with those of Alzheimer's patients. However, significant efforts are required to enable these comparisons due to the many organisms present and the technological limitations in measuring them, introducing bias at all levels.

Deep learning based motion prediction algorithms for autonomous driving

Master thesis (2022) - C. Ma (author) , Zaid Al-Ars (mentor)

In order to ensure that autonomous driving vehicles can make appropriate driving decisions based on the surrounding situation, motion prediction algorithms are used to generate the driving decision output, which will then be used for guiding the trajectory of the vehicle. In gene ...

Investigating scaling techniques and the cost-efficiency of distributed to single FPGA compositions for Full Waveform Inversion

Master thesis (2022) - L.A. Dierick (author) , Zaid Al-Ars (mentor) , J. Petri-König (graduation committee member) , Marco A. Zuñiga Zamalloa (graduation committee member) , Gokhan Gunay (graduation committee member)

In recent years, the big data era has produced an increasing volume and complexity of data that requires processing. To analyze and process these large amounts of data, applications are being scaled on large clusters using distributed data processing frameworks. A more recent tre ...

In recent years, the big data era has produced an increasing volume and complexity of data that requires processing. To analyze and process these large amounts of data, applications are being scaled on large clusters using distributed data processing frameworks. A more recent trend utilizes hardware accelerators to offload computationally intensive tasks and reduce compute time and energy consumption. As a result, a rapid growth of data center deployment containing heterogeneous compute infrastructures is observed. Alternative to the more commonly used general-purpose GPUs (GPGPUS), the field programmable gate array (FPGA) is becoming an increasingly popular choice of accelerator. Its effectiveness to accelerate highly parallel applications in combination with the flexibility due to its reconfigurable nature make it well suited for a wide range of applications. As a spatial compute resource, the problem size a single FPGA can process is bounded by the available programmable logic and memory. However, applications that do not require the full resources of an FPGA can be vertically scaled by instantiating multiple instances of the hardware design on a single node. A barrier in the adoption of FPGAs is formed by the complexity of hardware design which requires in depth hardware-specific expertise. Additionally, integrating FPGAs in distributed data processing frameworks is a challenge on itself.

These challenges are being addressed in two directions. High level synthesis (HLS) tools and compilers are being developed to decrease the complexity of hardware design by allowing users to develop FPGA designs in high level languages. Additionally, there is an increased availability of ready-to-use FPGA designs for common applications in hardware libraries such as Vitis libraries.

To aid the adoption of FPGAs and improve their accessibility, this work presents OctoRay: a python framework with a focus on ease-of-use that allows users to flexibly and transparently scale applications both vertically and horizontally on FPGA clusters. Scaling a binarized convolutional neural network (CNN) with OctoRay resulted in performance improvements linear to the number of nodes, or copied instances applied. The framework was also used to analyze the cost-efficiency of a cluster of low-end PYNQ-Z1 FPGAs compared to a data center class Alveo U280 FPGA. A partly in hardware accelerated implementation of Full Waveform Inversion (FWI), a seismic imaging algorithm, was developed and used to conduct the investigation. It was concluded that 32 PYNQ-Z1s are required to match the performance of a single Alveo U280 FPGA. An important bottleneck in the performance of the PYNQ-Z1s was the low-performance host processor on which a significant portion of FWI was executed. The small number of resources available on a PYNQ-Z1 limited the attainable accuracy of FWI to a bare minimum. The FWI hardware design with the same specifications made for the high-end FPGA only utilized a fraction of its resources, far from harnessing its full potential. It was concluded that, unlike FWI, applications that do not require the abundance of resources a high-end FPGA offers, but do benefit from rapid development cycles and low energy consumption are suited for a distributed low-end FPGA composition.

QPack: A cross-platform quantum benchmark-suite

Quantitative performance metrics for application-oriented quantum computer benchmarking

Master thesis (2022) - H.J. Donkers (author) , Zaid Al-Ars (mentor) , Matthias Moller (graduation committee member) , K.J. Mesman (coach) , A. Sarkar (coach)

As the technology of quantum computers improves, the need to evaluate their performance also becomes an important tool for indexing and comparing of quantum performance. Current benchmarking proposals either focus on gate-level evaluation, are centered around a single performance ...

As the technology of quantum computers improves, the need to evaluate their performance also becomes an important tool for indexing and comparing of quantum performance. Current benchmarking proposals either focus on gate-level evaluation, are centered around a single performance metric, or only evaluate in-house quantum computers. This gives rise to the need for a holistic, application- oriented, and hardware-agnostic benchmarking tool that can provide fair and varied insight into quantum computer performance. This thesis continues the development of the QPack benchmark, which collects quantum computer data by running noisy intermediate-scale quantum (NISQ)-era applications and transforms this data into an overall performance score, which is decomposed into four subscores.

These scores are quantitative metrics of quantum performance that allow for easy and quick comparisons between different quantum computers. The QPack benchmark is an application-oriented cross-platform benchmarking suite for quantum computers and simulators, which makes use of scalable Quantum Approximate Optimization Algorithm and Variational Quantum Eigensolver applications. Using a varied set of benchmark applications, an insight into how well a quantum computer or its simulator performs on a general NISQ-era application can be quantitatively made. QPack is built on top of the cross-platform library |Lib⟩ (pronounced: libket), which allows for a single expression of a quantum circuit and execution on multiple quantum computers.

Using the QPack benchmarking scores, a comparison is made between various quantum computer simulators, running both locally and on vendors’ remote cloud services. Tested local simulators include Qiskit Aer, Cirq, Rigetti QVM, and QuEST. For remote simulators, the IBMQ, IonQ, and Rigetti simulators have been benchmarked. The QPack benchmark is also executed on the Rigetti Aspen-M-1 and a selection of available quantum hardware from the IBMQ aviary, namely the Nairobi, Jakarta, Perth, Lagos, Quito, and Manila processors. For all quantum computers, an analysis is made of their individual performance in the QPack benchmark, as well as an evaluation of how these simulators or hardware implementations compare to each other. Based on the results of the QPack benchmark, the local QuEST simulator, the remote IBMQ QASM simulator and the IBMQ Nairobi and Quito quantum computers achieve best performance compared to the other tested backends.

This work shows that the QPack benchmark is capable of providing holistic quantum computer performance for quantum computers, be it physical implementation or their simulator counterparts. The latest version of the QPack benchmark and all the results collected can be found in the repository: https://gitlab.com/libket/qpack/-/tree/stable.

Tydi-lang: a language for typed streaming hardware

A manual for future Tydi-lang compiler developers

Master thesis (2022) - Y. TIAN (author) , Zaid Al-Ars (mentor) , H. Peter Hofstee (mentor)

Transferring composite data structures with variable-length fields often requires designing non-trivial protocols that are not compatible between hardware designs. When each project designs its own data format and protocols the ability to collaborate between hardware developers i ...