ZA

Z. Al-Ars

info

Please Note

90 records found

Master thesis (2026) - A.A.F. Verdiesen, H.P. Hofstee, Wim Bos, M. Weinmann, Z. Al-Ars
Monitoring the lower airspace for small drones and distinguishing them from birds, helicopters and airplanes, is a growing security need that radar, radio-frequency, and acoustic sensors meet only at considerable cost. This thesis asks whether a ground-based network of synchronized, overlapping RGB cameras can instead reconstruct and classify flying objects directly in 3D, recovering range through multi-view geometry rather than a long-range sensor. The central hypothesis is that the temporal evolution of a 3D Gaussian Splatting representation carries motion cues more discriminative than per-frame 2D or static 3D appearance.

Four contributions support this investigation, which, to our knowledge, is the first to classify flying objects
using temporal 4D Gaussian features. AeroSplat-4D is a synthetic multi-camera dataset and NVIDIA Isaac Sim pipeline emitting synchronized RGB, instance masks, depth, 3D trajectories, and exact calibration across the four classes, with class-balanced, identity-disjoint splits. DepthSplat-OC adapts feed-forward Gaussian splatting to thin, distant targets against a texture-less sky via a mask-gated photometric loss. MambaSplat-4D,
the main contribution, classifies the temporal Gaussian sequences by pairing a rotation-equivariant Vector-Neuron Transformer with a linear-time Mamba temporal encoder, enforcing SO(3) invariance architecturally rather than through augmentation.

In an augmentation-free ablation, aggregating a 24-frame clip rather than classifying a single frame raises accuracy from 59.1 % to 78.8 %, confirming that motion, not single-frame appearance, drives discrimination. Because SO(3) invariance is enforced architecturally, the full-attribute model attains the same 70.2 % four-class accuracy on clean and arbitrarily rotated data, about eight percentage points above a position-only baseline; it trails the strongest temporal baseline by roughly five points on clean data but is uniquely robust under rotation, with zero classification changes across 9600 rotated forward passes. DepthSplat-OC surpasses the closest-protocol baseline (24.65 versus 21.44 PSNR) despite roughly two orders of magnitude less training compute, and the compact 1.9 M-parameter classifier runs in under a millisecond per frame. On the out-of-distribution probe the pipeline does not yet surpass the 2D baselines, a gap that likely reflects their ImageNet-pretrained (∼1.2M-image) backbones rather than a limit of the 3D representation; real-camera transfer remains open, and and the core of the pipeline is released as open-source software at github.com/lumiad-bv/MambaSplat-4D.

This work thereby points toward multi-view 3D reconstruction and temporal reasoning as an effective alternative to the per-frame 2D detection that currently dominates aerial object classification. ...
Master thesis (2026) - U. Verma, H.P. Hofstee, K. J. Engel, M. Möller, Z. Al-Ars
“There is no threshold dose below which radiation can be considered completely harmless.”
— Hermann J. Muller

Although originally formulated for radiation in general, this statement applies to X-ray imaging as well, given its ionising nature. Even low-dose diagnostic procedures can damage DNA and carry cumulative cancer risk, which has led radiologists to minimise patient exposure whenever possible. These safety constraints limit the acquisition of large, diverse imaging datasets needed for developing, validating, and benchmarking modern medical imaging systems. They also restrict the number of projections that can be acquired in modalities such as computed tomography (CT), requiring accurate volumetric reconstruction from fewer scans and thereby increasing the technical demands on reconstruction algorithms.

Because acquiring realistic X-ray images is limited by patient safety, modern approaches first use sparse CT scans to reconstruct an accurate volumetric model of the object of interest. From this model, synthetic images can be generated through novel view synthesis, enabling large-scale offline datasets for machine learning, system testing, and model validation without additional radiation exposure. Beyond dataset creation, these synthetic projections can be produced in real time for interactive applications such as digital twins, used in virtual physician training and integration testing. However, generating high-fidelity synthetic images in real time remains challenging, given the substantial computational requirments of the algorithm.

This thesis investigates ways to accelerate X-ray simulation using graphics processing unit (GPU) implementations. Two techniques were developed: one based on voxelised models and another using Gaussian mixture models (GMMs). The approaches were evaluated in terms of visual fidelity and rendering performance, achieving ≈300 frames per second for voxel-based simulation and ≈40 frames per second for GMM-based simulation. Both techniques significantly reduce computation time compared to baseline CPU implementations, while maintaining realistic image quality suitable for virtual testing, physician training, and AI data generation.

These results demonstrate that GPU acceleration can enable real-time synthetic X-ray simulation, supporting scalable dataset creation and interactive applications while maintaining strict adherence to radiation safety principles. ...
Master thesis (2025) - T.P.D. Anema, H.P. Hofstee, Z. Al-Ars, M. Möller, Joost Hoozemans
This thesis presents a GPU-accelerated string compression algorithm based on FSST (Fast Static Symbol Table).
The proposed compressor leverages several advanced CUDA techniques to optimize performance, including a voting mechanism that maximizes memory bandwidth and an efficient gathering pipeline utilizing stream compaction.
Additionally, the algorithm uses GPU compute capacity to support a memory-efficient encoding table through a space-time tradeoff.

The compression task is parallelized by tiling input data and adapting the data layout.
We introduce multiple compression pipelines, each with distinct tradeoffs.
To maximize encoding kernel throughput, the design introduces sliding windows and output packing to optimize register use and maximize effective memory bandwidth.
Pipeline-level throughput is further enhanced by introducing pipelined transposition stages and stream compaction to remove intermediate padding efficiently.

We evaluate these pipelines across several benchmark datasets and compare the best-performing version against state-of-the-art GPU compression algorithms, including nvCOMP, GPULZ, and compressors generated using the LC framework.
The proposed compressor achieves a throughput of 74GB/s on an RTX4090 while maintaining compression ratios comparable to FSST.
In terms of compression ratio, it consistently outperforms ANS, Bitcomp, Cascaded, and GPULZ across all datasets.
Its overall throughput exceeds that of GPULZ and all nvCOMP compressors except ANS, Bitcomp, Cascaded, and those produced by the LC framework.
Our compressor lies on the Pareto frontier for all evaluated datasets, advancing the state-of-the-art toward ideal compression.
It achieves near-identical compression ratios to FSST (except for machine-readable datasets), while achieving a speedup of 42.06x.
Compared to multithreaded CPU compression, it achieves a 6.45x speedup.

To assess end-to-end performance, we integrate the compressor with the GSST decompressor. The resulting (de)compression pipeline achieves a combined throughput of 55GB/s, outperforming uncompressed data transfer on links with a bandwidth up to 37.5 GB/s.
It also outperforms all state-of-the-art (de)compressors when the link bandwidth ranges between 3GB/s and 20GB/s.

While further research is needed to enhance robustness and integrate the compressor into analytical engines, this work demonstrates a viable and Pareto-optimal alternative to existing string compression methods.

The source code of all our compression pipelines is publicly available on GitHub.
This work also serves as the foundation for a scientific paper that has been accepted for presentation at ADMS 2025. ...

Raising the abstraction level of the Haskell HDL Clash through typed waveforms and complex streaming interfaces

Master thesis (2025) - M.H.P. Adriaanse, H.P. Hofstee, Z. Al-Ars, C. P. R. Baaij, C.J.M. Verhoeven
This work contains two systems created to raise abstraction for the Haskell-based HDL Clash.

A common tool in hardware design is the waveform viewer. Although Clash could already generate waveform files, these only contained binary representations of the values. Without translating these to Haskell values, they are difficult to interpret. Shockwaves was created to perform this translation. Unlike other typed waveform solutions, Shockwaves performs the translation fully in the Haskell runtime, and stores the results in lookup tables. This gives the programmer full control over the waveform representation of data. There are two methods of generating VCD files from Clash, and Shockwaves was designed to work with both. The system is fully functional for signals traced during direct simulation. The alternative approach of simulating a design after compiling it to a different HDL depends on the Clash compiler adding type annotations. This requires an overhaul of the Clash compiler beyond the scope of the project.

The second system, Tydi-Clash, is a library for the Tydi streaming specification in Clash. Tydi was designed around transferring complex data structures, and allows for multiple related streams carrying typed, multi-dimensional data. The Tydi-Clash library supports Tydi data types, physical streams, and logical stream constructs. To encourage correct usage of the streams, the internal signals are encapsulated in algebraic and abstract data types that prevent defining or accessing undefined values. Additionally, tests are supplied for behavioral restrictions. An example implementation revealed implementations using Tydi-Clash are unfortunately still a bit cumbersome, but this is believed to be solvable by adding a library of utility modules for common situations. ...
Master thesis (2025) - P.J. Aanhane, Rob de Jong, Z. Al-Ars
Synthetic image generation involves the creation of artificially generated images that are indistinguishable from real ones. This field is an answer to challenges in the world of data acquisition, where the need for data is outpacing the availability. In cooperation with Philips Medical Systems, the generation of synthetic X-ray images is studied. Using datasets derived from such images, equipment testing and physician training can be improved. Additionally, training data can be generated for machine learning purposes.

The generation of synthetic X-ray images has been an area of research since at least 1994. The images have traditionally been generated using ray-tracing techniques on CPUs or GPUs. While effective, these methods are computationally expensive and demand high memory bandwidths. More recently, machine learning techniques have been explored for X-ray image generation. These approaches are promising. However, they require large labelled datasets which are often unavailable and the quality of the results is difficult to predict.

The aim of this thesis is to investigate whether hardware acceleration using a field programmable gate array (FPGA) can solve the challenges other methods face. Specifically, it discusses an architecture that can handle the large amount of computations in parallel. The memory architecture required to handle the high bandwidth demands is also explained. The performance of the proposed architecture is studied to see whether it is a viable solution.

By simulating the traversal of rays through a voxelized model, an attenuation map was computed which can be used to determine X-ray intensities on a detector. The design separates computational tasks between a host machine and an FPGA, with an optimized High Bandwidth Memory architecture to maximize data throughput. Results demonstrated that the simulation produced realistic images with minimal error (2.26\% - 3.00\% deviation from CPU results), and performance is dependant on the detector resolution, achieving frame rates between 123 and 378 frames per second which are well above the goal of 60 frames per second. If more performance is required, upsampling can be used to speed up image generation by 33\% at an increased error of 0.6\% for an upsampling factor of two. These findings highlight the advantages of FPGA acceleration for deterministic, high-speed synthetic image generation without the need for large labelled datasets as required by machine learning algorithms. ...
Doctoral thesis (2025) - A.M. Krol, Peter Hofstee, Zaid Al-Ars
Because of recent stagnating single-thread performance and limited potential for further miniaturization of transistors, the computing industry is looking towards new technologies as the basis for the next generation of computing. One of these new technologies is quantum computing.
For utility-scale quantum computing, we will likely need millions of qubits. To program these qubits, the complete quantum computing stack will need to be improved, since programming large numbers of qubits is not feasible with current quantum programming languages.

In this dissertation, we present our new unitary decomposition algorithm, which is used to decompose arbitrary unitary matrices into a sequence of quantum gates that can be executed on a quantum computer. Our method results in 5% less CNOT gates than the previous state-of-the-art and can be used to decompose an arbitrary 3-qubit gate into at most 19 CNOT gates.

Unitary decomposition is an essential part of some quantum algorithms, and can be used as an optimization method for (parts) of quantum circuits. Efficient implementation of unitary decomposition allows for the translation of bigger input matrices into elementary quantum operations, which is key to executing these algorithms on existing quantum computers.
With the implementation of unitary decomposition in quantum programming framework OpenQL, we show how the structure of the input or intermediate matrices can be used to minimize the number of output gates and to minimize the runtime of the decomposition. Our implementation is 10 to 500 times as fast as the decomposition methods of the UniversalQCompiler and Qubiter.

With hybrid classical-quantum algorithms, even near-term quantum devices may be able to outperform classical computers. Hybrid algorithms, such as variational quantum eigensolvers, are iterative processes, and use a classical optimizer to update a parameterized quantum circuit. Each iteration, the circuit is executed on a physical quantum processor or a simulator, and the average of the measurement results is passed back to the classical optimizer. When many iterations are performed, the quantum program is recompiled many times.

We have implemented explicit parameters that prevent recompilation of hybrid programs in OpenQL, called OpenQLpc. These parameters reduce the compile time, and therefore improve the total runtime for hybrid algorithms. We have compared the execution of the MAXCUT benchmark in OpenQL with the execution of the same benchmark in PyQuil and Qiskit, which shows that the efficient handling of parameterized circuits in OpenQLpc results in up to 70% reduction in total compilation time and a reduced total execution time. With OpenQLpc, compilation of hybrid algorithms is also faster than either PyQuil or Qiskit.

In a collaboration with BMW and Entropica, we have developed a quantum algorithm for industrial shift scheduling (QISS), which uses Grover's adaptive search to tackle a common and important class of valuable, real-world combinatorial optimization problems.

We show how QISS can be used to find the optimal schedule for n days out of a solution space of size N = 4^(2n). The optimal solution is reached in 99% of cases within sqrt(N) = 4^n applications of Grover's oracle, which requires a total of 11n +9 + log2(19n) qubits for scheduling n days. We show the explicit construction of the Grover's oracle, incorporating the multiple constraints and detail the corresponding logical-level resource requirements. Further, we simulate the application of QISS for small-scale problem instances to corroborate the performance of the algorithm.
Our work shows how complex real-world industrial optimization problems can be formulated in the context of Grover's algorithm.

Using QISS, we then used open-source tools to estimate the quantum resources required for execution of this algorithm. We used qubit models based on current technology, as well as theoretical high-fidelity scenarios for superconducting qubit platforms. We find that the overall computational runtime is more strongly influenced by the execution time of gate and measurement operations than by system error rates. We find that achieving quantum utility would not only require low system error rates (10^(-6) or better), but also measurement operations with an execution time below 10 ns. This rules out the possibility of near-term quantum utility for this use-case, and suggests that significant technological or algorithmic progress will be needed before quantum utility can be achieved.

The research in this dissertation allows us to answer our main research question:
How can we make the quantum computing stack ready for utility-scale quantum computing?
For the quantum stack to be ready for utility-scale quantum computing, several major improvements will need to be made to prepare for programming and compiling circuits with millions of qubits.
* We will need high-level abstractions that will speed up programming of quantum computers, allow for (easier) debugging and will allow for programming millions of qubits.
* The classical component of the compilation and compute of (hybrid) quantum algorithms will need to be improved.
* More algorithms for real-world use-cases will need to be developed, which will provide a basis for improvements across the quantum stack that will lead to quantum utility.
* We need to do quantum resource estimation for real use-cases, in order to have insights into what utility-scale quantum computing will look like. ...
Master thesis (2024) - R. Vonk, Z. Al-Ars, R. Hai, Joost Hoozemans, J.W. Peltenburg
This thesis describes how the throughput of data ingestion on GPUs can be increased by using data compression. This is done through two main contributions. First, a high-level model is presented to assess the impact of compression on ingestion throughput. Second, a novel decompression algorithm called GSST (GPU Static Symbol Table) is developed, optimized for GPU parallelism. GSST achieves state-of-the-art performance, striking an effective balance between compression ratio and decompression throughput.
The work done in this thesis has contributed to the submission of two scientific papers:
1. GSST: Parallel string decompression at 150 GB/s on GPU [1] (to be updated to 191 GB/s upon
submission).
2. Benchmarking GPU Direct Storage for High-Performance Filesystems: Impact & Future Trends [2]
The ingestion throughput model is introduced to quantify the impact of data compression on data ingestion on a GPU. This model offers insight into how the compression ratio and decompression throughput influence overall data ingestion performance. This model shows that, as storage devices become faster, the decompression algorithm must also increase its throughput to keep up, while the compression ratio becomes less influential on the ingestion throughput.
GSST is the solution proposed to increase data ingestion throughput on GPUs. GSST adapts from the FSST (Fast Static Symbol Table) algorithm to the parallel architecture of GPUs. GSST’s performance is driven by six performance optimizations. Three format optimizations, block parallelism, split parallelism, and coalesced memory access, increase parallelism and throughput by changing the way data is stored. Additionally, three memory management techniques are implemented to effectively utilize the memory throughput of a GPU. These are the use of shared memory, using aligned memory accesses, and utilizing asynchronous data transfers.
Using the ingestion throughput model, GSST is evaluated against the state-of-the-art GPU compression algorithms from nvCOMP. The results reveal that GSST achieves a decompression throughput of 191 GB/s with a compression ratio of 2.74 on an A100. While nvCOMP’s ANS and Bitcomp outperform GSST in decompression throughput, they offer lower compression ratios. Similarly, Zstd achieves a higher compression with significantly lower decompression throughput, positioning GSST as a good balance of decompression throughput and compression ratio.
The data ingestion model demonstrates that GSST offers the highest ingestion throughput among the tested compression algorithms when ingesting data over a connection with a throughput between 0.8 GB/s and 87 GB/s. This means GSST is ideally suited for use with top-of-the-line networking equipment and even provides headroom for future improvements in connection throughput. Additionally, GSST is extremely memory-efficient, using significantly less GPU memory than all of nvCOMP’s compression algorithms. In some cases, GSST uses 3,500 times less memory, and in the best scenarios, over 67 million times less.
By leveraging these format and memory management optimizations, GSST provides a powerful, efficient solution for industries using large-scale data systems such as high-performance computing and data analytics.
The GSST source code will be made available on GitHub [3]. ...
Master thesis (2024) - K. Zhao, Z. Al-Ars, Tanveer Ahmad, J.A. Baaijens
This thesis focuses on accelerating the polishing stage of the Flye genome assemblers. Flye is a de novo assembler designed for long reads produced by modern sequencing technologies, excelling in handling large genomes with high accuracy and efficiency. A crucial component of the assembly process is the polishing stage, which refines the draft assembly to correct errors and improve overall accuracy. However, this stage is computationally intensive and time-consuming, presenting a significant bottleneck in genome assembly workflows.

To address this, a novel multi-threading architecture is introduced, significantly reducing mutex contention by minimizing the use and acquisition times of mutexes within the bubble processor. Additionally, advanced vectorization techniques using AVX (Advanced Vector Extensions) instructions are incorporated to process multiple reads simultaneously. These optimizations effectively parallelize the polishing process and exploit modern CPU capabilities for enhanced performance.

Benchmarking the enhanced polishing stage on both bacteria and human genome datasets demonstrates a substantial improvement in processing time. For the bacteria dataset, the error correction process achieves speedups of 3.0x and 4.3x using AVX2 and AVX-512 instructions running on one core, respectively. The process realizes speedups of 2.6x and 2.7x with AVX2 and AVX-512 running on eight cores. For the human genome dataset, the process demonstrates a speedup of 4.0x when handling 1 million bubbles running on one core, while 32 cores yield a speedup of 2.3x for the same dataset. Applying AVX2 to the complete dataset on 64 cores results in a speedup of 1.4x. This acceleration not only reduces computational costs but also expedites the overall genome assembly process, making it more feasible for large-scale and time-sensitive genomic studies. The implementation is available on GitHub. ...

A Typed Waveform Viewer for Chisel HDL with typed circuit components and Tydi streams

Master thesis (2024) - R. Meloni, Z. Al-Ars, H.P. Hofstee
Modern hardware design languages introduce high-level constructs to considerably improve design capabilities. The adoption of software language features and strong type systems contribute to expressing complex designs with cleaner and more robust code, facilitating the translation of software algorithms for hardware accelerators. Despite these advantages, their mainstream adoption is often discouraged by the lack of debugging tools that support the same level of abstraction. The usage of standard tools implies inspecting automatically generated RTL code, dissimilar from the source, which leads to a convoluted debugging experience.

This thesis presents Tywaves, a new kind of type-centered waveform viewer for the Chisel hardware language with typed circuit components and Tydi streams. Contributions to both the Chisel library and CIRCT MLIR compiler are described. Type information for debugging is extracted from the source language and linked with the target Verilog. A frontend waveform viewer is updated with the functionality to interpret and associate type information with values dumped from an RTL simulator and reconstruct the source language view. Finally, a Chisel API has been implemented to enable Tywaves from a high-level testbench.

The Tywaves project aims to enhance the debugging experience of modern hardware languages by reducing the gap between the source code and waveforms. It provides a new type-centered debugging format that helps to bring the same level of abstraction of new languages into waveform viewers. ...
Genomics, the study of an organism's complete set of DNA, including all of its genes, has revolutionized our understanding of biological processes and disease mechanisms. The field's rapid advancements have paved the way for personalized medicine, offering targeted therapies and improved healthcare outcomes. These advancements are a result of significant improvements in sequencing technology, bioinformatics, and computational power. Next-generation or long-read sequencing has reduced the cost and time required to sequence entire genomes, and Oxford Nanopore Technologies (ONT) sequencers provide 100–1000× longer contiguous reads, simplifying genome assembly. However, bioinformatics-driven advances in accuracy have come at the cost of high computational requirements because of the dependency on large deep neural networks (DNNs), and the basecalling step now takes 43% of the time in the nanopore sequencing pipeline.

This thesis addresses the large computational demands for high accuracy nanopore basecalling of nanopore reads. Bonito, ONT's research basecaller, and other basecallers use DNNs at their core. The five Long Short-Term Memory (LSTM) layers used by the basecaller are the primary bottleneck to more efficient basecalling, taking almost 90% of the whole model's execution time when basecalling a single read. To alleviate this bottleneck, three approaches are investigated: pruning, model architecture, and quantization. Preliminary results show that pruning is the most impactful approach and has not successfully been used in previous work.

We propose learning structured sparsity using a delayed masking penalty scheduler. By adapting and improving on previous work, each LSTM layer is able to learn its optimal size during training, simultaneously with learning to basecall accurately. The method is optimized for the basecalling application and can be generalized to other tasks. We find that the required number of computations in the LSTM layers can be significantly reduced by up to 21 times with a reduction in match rate of just 1.3% compared to the high accuracy Bonito model. Furthermore, the newly introduced penalty parameter can be tuned to find the optimal trade-off between compute and accuracy for users' requirements.

The results indicate that state-of-the-art basecalling models are overparameterized and that their size can be reduced drastically without significantly affecting accuracy. Future work is suggested to investigate the benefits of pruning the whole model, and to assess the feasibility of combining pruning with advanced quantization methods. This work helps increase the accessibility of nanopore DNA sequencing, broadening the reach and impact of this technology. ...
Master thesis (2024) - W. Trinh, Z. Al-Ars, H.P. Hofstee, R.T. Rajan
Genomics has revolutionized medicine and biological research by providing deeper insights into the genetic makeup of organisms, advancing our understanding of diseases, and enabling personalized medicine. These breakthroughs are driven by advancements in genome sequencing technologies, bioinformatics, and the aid of neural networks. The advent of third-generation sequencing technol ogy has further accelerated progress by allowing for long-read sequencing, which enhances the ac curacy and efficiency of genome assembly. Oxford Nanopore Technologies (ONT) offers advanced sequencers that use nanopore-based technology to read DNA sequences in real-time. However, the raw sequenced data contains noise and requires a basecalling stage to read DNA sequences with the required accuracy. Basecalling relies on deep neural networks (DNNs) to achieve high-accuracy reads, but the significant computational power required makes basecalling a costly process, especially in real-time applications. Current hardware accelerators used for these compute-intensive basecallers are state-of-the-art GPUs that cost over $10,000 per unit. This thesis explores the design of a custom hardware accelerator for ONT’s basecalling program, Bonito, which aims to provide a cost-effective alternative to existing accelerators such as Nvidia GPUs and Groq Tensor Streaming Processors (TSPs). Bonito’s DNN is dominated by five Long Short-Term Memory(LSTM)layers, accounting for 90% of its execution time. The custom accelerator targets these compute-intensive LSTMs to reduce execution time. This work provides a comprehensive analysis of LSTM performance and behavior on GPUs and Groq TSPs. It emphasizes the architectural benefits and limitations in the context of basecalling. Furthermore, it also evaluates an Application-Specific Integrated Circuit (ASIC) implementation of an existing FPGA-based LSTM accelerator design. TheanalysisshowsthatBonito’sHigh-Accuracymodel(HAC)LSTMlayerscontainmanysequential matrix multiplications that are compute-intensive and require high memory bandwidth to accommodate the data transfers. Furthermore, Bonito’s small problem size, with 384 features per vector input, causes GPUs to not fully utilize available compute cores. Combined with slow per-core performance, GPUs executing LSTMs achieve only 13.5% of the maximum TFLOP/s with FP16 precision. Groq uses a heterogeneous architecture with fast separate MXM units executing matrix multiplica tions, and VXM units executing point-wise operations. Data travels between these different compute units through streaming channels and MEM units. The LSTM analysis on Groq showed three main issues. First, Groq uses a 320-element wide data channel to transfer data across the chip, whereas Bonito has 384 hidden features as input. This leads to the 384-element input being sliced in two 192 element partial inputs as Groq supports physical tensors up to 320-element long, effectively doubling the cycle cost by using two slices instead of one. Second, performing matrix multiplications requires these slices to be transferred from MEM to MXM, by executing reload operations to the MXM weight buffer. This data transfer consumes the bandwidth on the streaming channel, which introduces stalls in the pipeline and clock cycles are spent on MEM operations, instead of compute operations in MXM or VXM. Lastly, the analysis shows that the VXM forms a bottleneck in the LSTM execution, accounting for 50% of the clock cycles, whereas the MXM accounts for the other 50%. This shows that the special functions and additions inside an LSTM cell are slowing down Groq’s overall performance. After the existing architecture analysis, the ASIC evaluation showed a synthesis result, where one block of 384 LSTM Processing Engines (PEs) achieves a clock speed of 434MHz and costs 8.07𝑚𝑚2 on a 40nmprocess node. Putting these PEs on a Groq-based chip layout of 725𝑚𝑚2 in area size, the 40nm-based PEs can achieve 79.3 TFLOP/s at FP16 precision. By correcting the 40nm process node to Groq’s 14nm process node, the 14nm-based PEs achieved 448 TFLOP/s at FP16. These results suggest that a custom LSTM accelerator could compete in performance with state-of-the-art solutions while being more cost-effective. Future work is suggested to investigate a cycle reduction in the multiply-accumulate stage and to evaluate the ASIC design using a modern process node technology, as the current ASIC design uses a 2008-based 40nm process node. This work helps future development in further optimizing a custom LSTM accelerator specified for Bonito’s DNN requirements, paving the road toward more affordable genome sequencing ...
Master thesis (2023) - K. Su, Z. Al-Ars, Y. Tian, A. Katsifodimos
General-purpose GPUs, renowned for their exceptional parallel processing capabilities and throughput, hold great promise for enhancing the efficiency of data analytics tasks. At the same time, recent developments in query execution engines have integrated the support of OLAP operations in a way that benefits from the zero serialization overhead provided by the Apache Arrow memory format.
In this project, our objective is to perform a study to evaluate the acceleration potential on GPUs of Arrow-based query execution engines, specifically with libcudf, a C++ GPU DataFrame library with Arrow format.
With this purpose, we design and implement four micro-benchmarks for different operators to understand the characteristics of workloads that result in high acceleration, and their
possible bottlenecks and limitations. When we exclude data transfer durations, inherently parallelizable workloads exhibit high potential for GPU acceleration. However, this advantage diminishes considerably when considering data transfer overheads. Stemming from these micro-benchmark outcomes, we designed an on-the-fly scheduler at the operator level to dynamically accelerate query execution engines in a hybrid CPU/GPU system. The scheduler can decide whether to distribute an operator on the CPU or GPU based on the input data location, data volume, data-related parameters, and the operator type so
that we can accelerate query execution engines in a hybrid CPU-GPU system according to a statistics cost model.
The conclusion is that,
with the scheduler, we achieve a maximum of 4.88x speedup for Filter Operator, 2.52x speedup for Sort Operator, and 1.52x speedup for Copy Operator when handling an array of 1e8 in length. ...
Master thesis (2023) - P.M.Q. Groet, H.P. Hofstee, Z. Al-Ars
With the rise of the new interconnect standards CXL and previously OpenCAPI, has come a great deal of possibilities to step away from the classical approach where CPUs are in charge of moving data between external devices and local memory. Specifically, OpenCAPI allows for attached devices to directly interface with the host memory bus in a near cache coherent way. IBM has developed the ThymesisFlow system which allows for other servers to access each others Random Access Memory through this OpenCAPI link. ThymesisFlow however is not fully coherent in some cases.
ThymesisFlow is designed for the situation where a borrower is able access a lender's memory, and the lender not accessing that borrowed memory. Coherency problems arise in the case where both a lender of memory, as well as a borrower of memory write to the lender's memory.
This thesis proposes the use of the Apache Arrow in-memory data format to not only access memory in a near coherent fashion, but in a fully coherent fashion. This will allow compute clusters to more efficiently use memory resources, allow for applications to dynamically hotplug memory, and allow for data sharing without copying over ethernet connection.

The protocols devised in this thesis are able to create disaggregated Arrow objects, which are readable by all nodes in a cluster in a coherent fashion. The creation of these coherent disaggregated objects is the only performance penalty in making them coherent, after initialization all nodes use their local CPU caches to cache remote objects.

A working proof-of-concept has been created which is able to share Apache Arrow objects stored in the memory of a single node. It is also possible to create Arrow objects which span the memory of multiple nodes, allowing for objects bigger than the memory of a single node. The proof-of-concept was able to be run thanks to the setup provided by the Hasso Plattner Institute. ...

Through redundant heterogeneous computing

Master thesis (2023) - R.A. Bijl, Z. Al-Ars, C. Lofi, Pekka Jääskeläinen
The ever-increasing demand for computing has led to the need for specialized heterogeneous hardware, and the frameworks required to utilize them. Besides the traditional central processing units, more and more programs will make use of specialized hardware to accelerate computations. However, the increase in computing also leads to shorter mean time between failures. In this thesis, we apply fault tolerance to Portable Computing Language (PoCL), an open-source implementation of the OpenCL standard. We show that our solution is easy to apply to existing programs making use of PoCL/OpenCL and is able to greatly reduce the total number of errors visible to the end user. Our solution can be used on any device supported by PoCL and provides a low overhead, given that the hardware requirements are met. ...
Master thesis (2023) - P. Geel, Z. Al-Ars, N.P. van der Meijs, J. Petri-König, Kevin McElligott
The demand for implementing neural networks on edge devices has rapidly increased as they allow designers to move away from expensive server-grade hardware. However, due to the limited resources available on edge devices, it is challenging to implement complex neural networks. This study selected the Kria SoM KV260 hardware platform due to its affordability and sufficient hardware capabilities for creating a resource-constrained environment. By leveraging the hardware acceleration capabilities of the FPGA for specific nodes of the MobileNetv1 model and offloading other nodes to the onboard quad-core ARM cortex-A53 CPU, it was feasible to implement a neural network on a hybrid combination of CPU and FPGA. Results showed that when executing the MobileNetv1 model in a hybrid configuration, a total runtime improvement of 2.8x over a pure CPU implementation can be achieved. The study concludes that node-wise partitioning of the MobileNetv1 model is a practical solution. This approach offers a cost-effective solution for users who seek an accessible way to run neural networks without the need for expensive server-grade hardware.
...
Master thesis (2023) - P.V. Nembhani, Zaid Al-Ars, P. Pawelczak, Amirreza Yousefzadeh
Artificial intelligence, machine learning, and deep learning have been the buzzwords in almost every industry (medical, automotive, defense, security, finance, etc.) for the last decade. As the market moves towards AI-based solutions, so does the computation need for these solutions increase and change with time. With the rise of smart cities and cyberphysical systems, the need for edge devices and efficient computation on the edge increases. While most of these newly developed deep learning models are quite large and wasteful in terms of energy, there have been recent methods that help improve the performance on the edge. However, due to their size, variety, and irregularity, the computing and power requirements are often too large to deploy these models on edge devices. This prohibits the application of such models within a rich field of application that requires high-throughput and real-time execution.

SENeCA (Scalable Energy Efficient Neuromorphic Computing Architecture) is a next-generation RISC-V-based neuromorphic computing architecture that was designed primarily for ultralow-edge applications where adaptivity is required. To mathematically model SENeCA, SENSIM (Scalable Energy Efficient Simulator, an open source simulator developed by the Interuniversity Microelectronic Center) provides an accurate mathematical software model of SENeCA, which helps in the early development and realization of a spiking neural network and deep neural network. This thesis work develops an efficient mapping tool SENMap (Scalable Energy-Efficient Neuromorphic Computing Architecture Mapper) on top of SENSIM which maps spiking neural networks efficiently. Having a faster, scalable realization software solution that can cater to large-scale neural networks can speed up the development procedure.

SENMap is developed in such a way that it supports flexible SNN/DNN application replacement, multiple single- and multi-objective optimization algorithms; the flexibility to choose from different optimization strategies; and also varying architectural parameters at the time of experimentation. Results show that mapping and neural processing elements (NPEs) depend primarily on the rate at which the sensor processes the data. On the basis of the rate, an early realization of SNN- and DNN-based edge AI chips SENMap. Depending on the actual parameters used, the maximum achieved improvements in energy consumption was around ~40%. ...
Master thesis (2023) - H.J.M.T. Knops, Z. Al-Ars, R.F. Remis, Rob de Jong
X-ray imaging systems play an important role in the diagnostic process of various medical conditions. Generating an accurate artificial X-ray image has multiple advantages. It allows for flexible configurations during generation. The resulting images can reduce testing time and cost, help the training of surgeons, and increase the amount of data for artificial intelligence model training. The generation of an X-ray image involves the simulation of a raytracing algorithm through a data model. In this research, a naive approach to this problem is examined. It was found that this approach can be improved by implementing model parallelization, data caching, and data compression. The resulting algorithm is simulated and validated in a software environment. This is then implemented for both an Ultrascale+ and a Versal FPGA. The results show that the algorithm can achieve real-time X-ray image generation, matching the performance of currently used detectors, provided that the required memory performance is achieved. ...

A reference-based perspective

Doctoral thesis (2023) - T.O. Mokveld, M.J.T. Reinders, Z. Al-Ars
Genomics is a field devoted to understanding the differences in genetics between populations, individuals, and even within individuals. By constantly comparing and contrasting data from diverse sources, genomics can refine our understanding of life and identify new ways to improve our lives. However, this often presents technical and biological challenges that require careful consideration of what is compared, in what context, and what might be present. In this thesis I contribute to resolving these challenges in three different domains:

In genomic data analysis, analysts often compare and contrast new genomic data to an established reference to reduce costs. However, this approach biases comparisons in favor of population-specific genetics since such references encode only a fraction of the genetics of a given population. To address this bias, I propose a method that accounts for population variability in a way that integrates it directly into the comparison process. This integration ensures that the contrast between sample and reference becomes smaller and closer to personalized, so they are treated the same way regardless of the underlying population. The method improves genome characterization and simplifies downstream analyses that rely on these comparisons. As a result, a more accurate portrayal of the genetics of a given population as a whole is obtained.

In non-invasive sequencing-based prenatal testing, we rely on circulating cell-free DNA from maternal plasma to detect pathogenic variants that may affect the fetus. A healthy baseline, which describes the normative state, is generally required to determine the presence of such variants. However, because this DNA is a mixture of maternal and much lower fetal proportions, it remains difficult to disentangle the two, primarily because of biological and technical biases. While this bias can partially be mitigated by changing the baseline and thus contrasting within the individual DNA mixture rather than to a divergent population of mixtures, further improvements are still needed. I present a generalized framework in which the signal-to-noise ratio can be further improved by fully exploiting the information in sequencing data, allowing for more robust predictions at even earlier stages of pregnancy.

The composition of the gut ecosystem can have short- and long-term effects on our health. It is therefore important to understand how it is formed and how a healthy balance can be maintained for as long as possible to preserve our health. To do this, ecosystems must be stratified and compared based on health indices. I show in extremely contrasting Dutch subpopulations that we can obtain valuable characteristics of divergent health states by comparing the gut ecosystems of centenarians with those of Alzheimer's patients. However, significant efforts are required to enable these comparisons due to the many organisms present and the technological limitations in measuring them, introducing bias at all levels. ...
Master thesis (2022) - C. Ma, Z. Al-Ars
In order to ensure that autonomous driving vehicles can make appropriate driving decisions based on the surrounding situation, motion prediction algorithms are used to generate the driving decision output, which will then be used for guiding the trajectory of the vehicle. In general, the output of the motion prediction algorithm is a series that contains the predicted information for the future movement of the vehicle. A traditional approach is using a physics-based model to generate the acceleration prediction series. However, such an approach requires lots of mathematical computation but is only capable to be effective in specific driving scenarios.

To solve that kind of issue, we proposed a data-driven approach by running four different kinds of machine learning models to generate the prediction output series. The results show that the auto-regressive (AR) model has the best prediction performance compared with traditional physics-based models, with a 14.32% improvement on average for the ADE (average displacement error) evaluation metric and 5.93% improvement on average for the FDE (final displacement error) evaluation metric. ...
Master thesis (2022) - L.A. Dierick, Z. Al-Ars, J. Petri-König, M.A. Zuñiga Zamalloa, Gokhan Gunay
In recent years, the big data era has produced an increasing volume and complexity of data that requires processing. To analyze and process these large amounts of data, applications are being scaled on large clusters using distributed data processing frameworks. A more recent trend utilizes hardware accelerators to offload computationally intensive tasks and reduce compute time and energy consumption. As a result, a rapid growth of data center deployment containing heterogeneous compute infrastructures is observed. Alternative to the more commonly used general-purpose GPUs (GPGPUS), the field programmable gate array (FPGA) is becoming an increasingly popular choice of accelerator. Its effectiveness to accelerate highly parallel applications in combination with the flexibility due to its reconfigurable nature make it well suited for a wide range of applications. As a spatial compute resource, the problem size a single FPGA can process is bounded by the available programmable logic and memory. However, applications that do not require the full resources of an FPGA can be vertically scaled by instantiating multiple instances of the hardware design on a single node. A barrier in the adoption of FPGAs is formed by the complexity of hardware design which requires in depth hardware-specific expertise. Additionally, integrating FPGAs in distributed data processing frameworks is a challenge on itself.

These challenges are being addressed in two directions. High level synthesis (HLS) tools and compilers are being developed to decrease the complexity of hardware design by allowing users to develop FPGA designs in high level languages. Additionally, there is an increased availability of ready-to-use FPGA designs for common applications in hardware libraries such as Vitis libraries.

To aid the adoption of FPGAs and improve their accessibility, this work presents OctoRay: a python framework with a focus on ease-of-use that allows users to flexibly and transparently scale applications both vertically and horizontally on FPGA clusters. Scaling a binarized convolutional neural network (CNN) with OctoRay resulted in performance improvements linear to the number of nodes, or copied instances applied. The framework was also used to analyze the cost-efficiency of a cluster of low-end PYNQ-Z1 FPGAs compared to a data center class Alveo U280 FPGA. A partly in hardware accelerated implementation of Full Waveform Inversion (FWI), a seismic imaging algorithm, was developed and used to conduct the investigation. It was concluded that 32 PYNQ-Z1s are required to match the performance of a single Alveo U280 FPGA. An important bottleneck in the performance of the PYNQ-Z1s was the low-performance host processor on which a significant portion of FWI was executed. The small number of resources available on a PYNQ-Z1 limited the attainable accuracy of FWI to a bare minimum. The FWI hardware design with the same specifications made for the high-end FPGA only utilized a fraction of its resources, far from harnessing its full potential. It was concluded that, unlike FWI, applications that do not require the abundance of resources a high-end FPGA offers, but do benefit from rapid development cycles and low energy consumption are suited for a distributed low-end FPGA composition. ...