HH

H.P. Hofstee

info

Please Note

26 records found

Aiding financial crime detection in large datasets

Master thesis (2026) - K. Khalili, H.P. Hofstee, Kubilay Atasu, Z. Al-Ars
With the rise of memory costs and the persistent under-utilization of memory in clusters, researchers have begun exploring alternative approaches to improve memory efficiency and reduce operational costs. Resource disaggregation is becoming increasingly common and sought after, driven by the emergence of new interconnect standards such as CXL and, previously, OpenCAPI. While the industry is primarily moving toward memory pooling, where memory is dynamically provisioned among applications or virtual machines, this work investigates distributed memory disaggregation and sharing. IBM's Power10 processors include hardware support that enables multiple systems to directly share memory. However, few applications have been developed to take advantage of disaggregated shared memory.

Since Memory Inception, Power10 processors' memory disaggregation hardware, is not yet fully operational, a ThymesisFlow prototype, upgraded to support a shared disaggregated memory system with the help of Apache Arrow, is used to implement a practical application. The selected application is a graph processor capable of detecting money laundering patterns in financial transaction graphs in real-time. These patterns yield transaction features that machine learning algorithms can use to identify fraudulent financial transactions.

Our proof-of-concept implementation enables the creation of a distributed graph, represented as Apache Arrow tables, that can process large datasets in real-time. The graph resides in a shared disaggregated memory region and can be accessed by multiple systems without data copying, incurring lower latency penalties than network-based data retrieval. The distributed graph processor was developed and tested using the ThymesisFlow prototype provided by the Hasso Plattner Institute. ...
Master thesis (2026) - A.A.F. Verdiesen, H.P. Hofstee, Wim Bos, M. Weinmann, Z. Al-Ars
Monitoring the lower airspace for small drones and distinguishing them from birds, helicopters and airplanes, is a growing security need that radar, radio-frequency, and acoustic sensors meet only at considerable cost. This thesis asks whether a ground-based network of synchronized, overlapping RGB cameras can instead reconstruct and classify flying objects directly in 3D, recovering range through multi-view geometry rather than a long-range sensor. The central hypothesis is that the temporal evolution of a 3D Gaussian Splatting representation carries motion cues more discriminative than per-frame 2D or static 3D appearance.

Four contributions support this investigation, which, to our knowledge, is the first to classify flying objects
using temporal 4D Gaussian features. AeroSplat-4D is a synthetic multi-camera dataset and NVIDIA Isaac Sim pipeline emitting synchronized RGB, instance masks, depth, 3D trajectories, and exact calibration across the four classes, with class-balanced, identity-disjoint splits. DepthSplat-OC adapts feed-forward Gaussian splatting to thin, distant targets against a texture-less sky via a mask-gated photometric loss. MambaSplat-4D,
the main contribution, classifies the temporal Gaussian sequences by pairing a rotation-equivariant Vector-Neuron Transformer with a linear-time Mamba temporal encoder, enforcing SO(3) invariance architecturally rather than through augmentation.

In an augmentation-free ablation, aggregating a 24-frame clip rather than classifying a single frame raises accuracy from 59.1 % to 78.8 %, confirming that motion, not single-frame appearance, drives discrimination. Because SO(3) invariance is enforced architecturally, the full-attribute model attains the same 70.2 % four-class accuracy on clean and arbitrarily rotated data, about eight percentage points above a position-only baseline; it trails the strongest temporal baseline by roughly five points on clean data but is uniquely robust under rotation, with zero classification changes across 9600 rotated forward passes. DepthSplat-OC surpasses the closest-protocol baseline (24.65 versus 21.44 PSNR) despite roughly two orders of magnitude less training compute, and the compact 1.9 M-parameter classifier runs in under a millisecond per frame. On the out-of-distribution probe the pipeline does not yet surpass the 2D baselines, a gap that likely reflects their ImageNet-pretrained (∼1.2M-image) backbones rather than a limit of the 3D representation; real-camera transfer remains open, and and the core of the pipeline is released as open-source software at github.com/lumiad-bv/MambaSplat-4D.

This work thereby points toward multi-view 3D reconstruction and temporal reasoning as an effective alternative to the per-frame 2D detection that currently dominates aerial object classification. ...
Master thesis (2026) - U. Verma, H.P. Hofstee, K. J. Engel, M. Möller, Z. Al-Ars
“There is no threshold dose below which radiation can be considered completely harmless.”
— Hermann J. Muller

Although originally formulated for radiation in general, this statement applies to X-ray imaging as well, given its ionising nature. Even low-dose diagnostic procedures can damage DNA and carry cumulative cancer risk, which has led radiologists to minimise patient exposure whenever possible. These safety constraints limit the acquisition of large, diverse imaging datasets needed for developing, validating, and benchmarking modern medical imaging systems. They also restrict the number of projections that can be acquired in modalities such as computed tomography (CT), requiring accurate volumetric reconstruction from fewer scans and thereby increasing the technical demands on reconstruction algorithms.

Because acquiring realistic X-ray images is limited by patient safety, modern approaches first use sparse CT scans to reconstruct an accurate volumetric model of the object of interest. From this model, synthetic images can be generated through novel view synthesis, enabling large-scale offline datasets for machine learning, system testing, and model validation without additional radiation exposure. Beyond dataset creation, these synthetic projections can be produced in real time for interactive applications such as digital twins, used in virtual physician training and integration testing. However, generating high-fidelity synthetic images in real time remains challenging, given the substantial computational requirments of the algorithm.

This thesis investigates ways to accelerate X-ray simulation using graphics processing unit (GPU) implementations. Two techniques were developed: one based on voxelised models and another using Gaussian mixture models (GMMs). The approaches were evaluated in terms of visual fidelity and rendering performance, achieving ≈300 frames per second for voxel-based simulation and ≈40 frames per second for GMM-based simulation. Both techniques significantly reduce computation time compared to baseline CPU implementations, while maintaining realistic image quality suitable for virtual testing, physician training, and AI data generation.

These results demonstrate that GPU acceleration can enable real-time synthetic X-ray simulation, supporting scalable dataset creation and interactive applications while maintaining strict adherence to radiation safety principles. ...
Master thesis (2026) - S.L. Streef, H.P. Hofstee, Zaid Al-Ars, Kubilay Atasu, J.W. Peltenburg
This thesis explores the application of ahead-of-time parser generation to improve the throughput of big data ingestion. To investigate parser generation, this work produced several libraries related to the conversion of CSV to Apache Arrow.

First, an ahead-of-time parser generator was developed in the form of a Rust derive macro and a supporting library. Using a derive macro, a schema can be defined by a Rust struct definition for which a CSV to Arrow reader is derived. With the knowledge of the schema at compile-time, extra optimizations were possible. For instance, when the types in a schema are known to be of fixed size, estimates for the number of records in an input buffer could be made. With this principle, input bound checks could be reduced.

Experiments showed that ahead-of-time generated parsers outperformed state-of-the-art frameworks such as Apache Arrow, Polars, and DuckDB. Benchmarks for single types revealed that for integers, unbuffered and buffered generated parsers achieved a throughput of at least 1.5x compared to Apache Arrow, with the reduction of size bound checks sometimes even reaching a throughput of 3x. For floating point numbers, the buffered parser performed slightly better than Arrow. However, the unbuffered, and buffered with reduced size bound checking, parser achieved a throughput of at least 1.5x. For strings, the buffered parser performed similar to Apache Arrow since it uses the same efficient buffered string parsing. The unbuffered parser only achieved half the performance. Furthermore, benchmarks for parsing the TPC-H and TPC-DS datasets showed that the buffered parser generated ahead-of-time performed better for real-world datasets compared to the frameworks mentioned above. The unbuffered parser was only able to achieve a higher throughput for datasets larger than 100 MB. For the datasets, the throughput varied based on the distribution of types.

Additionally, this work explored, but did not integrate, the use of multi-threaded CSV parsing. However, experiments revealed that the performance of parallelization depends on how fast CSV can be scanned for record positions whilst correctly checking character escaping. An experiment with a custom multi-threaded parser implementation showed that when scaling the number of parse threads, the throughput is limited by scanning rather than parsing. This characteristic was also found for Polars and DuckDB, which support multi-threading. The scanning was shown to be possibly improved by using SIMD, which allowed scanning record delimiters at 2.8 GB/s using AVX2. This is approximately 1.5x more than the maximum throughput that Polars or DuckDB achieved.

https://github.com/sstreef/csv-to-arrow
...

The development of a formal foundation for the Tydi hardware stream specification

Master thesis (2026) - J.A. Struik, H.P. Hofstee, S.S. Chakraborty, Zaid Al-Ars
The Tydi (Typed Dataflow Interface) specification was introduced to bridge a gap in the High Level Synthesis (HLS) domain, enabling engineers to easily define component interfaces, mapping com- plex, dynamically sized data structures onto hardware streams. While Tydi successfully defines the static bit-level layout, it lacks a formal description for the dynamic procedure of mapping elements in a stream onto hardware lanes. This ambiguity has hindered the development of Tydi related tools and its integration in HLS frameworks. This work introduces a layered formalism for Tydi, centred around a modular operational semantics. It defines a two-tiered type system that separates a flexible user- facing syntax, from a canonical, hardware-oriented representation, strengthening the link between a type’s composition and runtime behaviour. The dynamic mapping is implemented by a configurable set of small-step semantics rules. These are derived from a user-provided set of formal properties, replacing the specifications' monolithic complexity levels with fine-grained control over interface logic. We define key structural determinism and progress metrics for the formalism, enabling the verification of interface implementations for streams of arbitrary type. This foundation enables developers to de- fine verifiably correct, typed hardware stream interfaces as easily as software, while retaining control over critical engineering trade-offs. A simulator of the semantics has been implemented, and provides empirical verification of the formalism and an insightful overview of the parsing logic. ...

Typed Behavioural Debugging in Modern Typed HDLs Through Signal Dependency Tracing

Master thesis (2025) - J.Y.K. Brand, H.P. Hofstee, C. Frenkel, Zaid Al-Ars
Debugging modern HDLs such as Chisel (Constructing Hardware In a Scala Embedded Language) remains challenging due to the lack of debugging tools operating on the source-language level. Furthermore, due to a lack of tooling, engineers often resort to manual waveform debugging, undermining productivity gains promised by such a language.

This thesis presents ChiselTrace, an open-source tool for Chisel that is capable of automatic signal dependency tracing at the Chisel source level, allowing faults to be more easily traced back to their root cause. The contributions of this work include the following. Modifications are made to the Chisel library to extract program dependence graphs and control flow graphs, and add instrumentation probes to the circuit, enabling post-simulation analysis. Furthermore, a library capable of dynamic program slicing and program dependence graph generation is introduced that is based on reconstruction from intermediate representation-level analysis. Finally, a front-end dependency graph viewer is presented, along with a method to automatically start a ChiselTrace session from failed ChiselSim unit tests.

The debugging capabilities of ChiselTrace are presented using a variety of test cases, including a real-world example, where an injected fault in the ChiselWatt processor is traced back to the source. ...
Master thesis (2025) - T.P.D. Anema, H.P. Hofstee, Z. Al-Ars, M. Möller, Joost Hoozemans
This thesis presents a GPU-accelerated string compression algorithm based on FSST (Fast Static Symbol Table).
The proposed compressor leverages several advanced CUDA techniques to optimize performance, including a voting mechanism that maximizes memory bandwidth and an efficient gathering pipeline utilizing stream compaction.
Additionally, the algorithm uses GPU compute capacity to support a memory-efficient encoding table through a space-time tradeoff.

The compression task is parallelized by tiling input data and adapting the data layout.
We introduce multiple compression pipelines, each with distinct tradeoffs.
To maximize encoding kernel throughput, the design introduces sliding windows and output packing to optimize register use and maximize effective memory bandwidth.
Pipeline-level throughput is further enhanced by introducing pipelined transposition stages and stream compaction to remove intermediate padding efficiently.

We evaluate these pipelines across several benchmark datasets and compare the best-performing version against state-of-the-art GPU compression algorithms, including nvCOMP, GPULZ, and compressors generated using the LC framework.
The proposed compressor achieves a throughput of 74GB/s on an RTX4090 while maintaining compression ratios comparable to FSST.
In terms of compression ratio, it consistently outperforms ANS, Bitcomp, Cascaded, and GPULZ across all datasets.
Its overall throughput exceeds that of GPULZ and all nvCOMP compressors except ANS, Bitcomp, Cascaded, and those produced by the LC framework.
Our compressor lies on the Pareto frontier for all evaluated datasets, advancing the state-of-the-art toward ideal compression.
It achieves near-identical compression ratios to FSST (except for machine-readable datasets), while achieving a speedup of 42.06x.
Compared to multithreaded CPU compression, it achieves a 6.45x speedup.

To assess end-to-end performance, we integrate the compressor with the GSST decompressor. The resulting (de)compression pipeline achieves a combined throughput of 55GB/s, outperforming uncompressed data transfer on links with a bandwidth up to 37.5 GB/s.
It also outperforms all state-of-the-art (de)compressors when the link bandwidth ranges between 3GB/s and 20GB/s.

While further research is needed to enhance robustness and integrate the compressor into analytical engines, this work demonstrates a viable and Pareto-optimal alternative to existing string compression methods.

The source code of all our compression pipelines is publicly available on GitHub.
This work also serves as the foundation for a scientific paper that has been accepted for presentation at ADMS 2025. ...
Master thesis (2025) - J.A. Fortanet Capetillo, H.P. Hofstee, A.J. van der Veen, Steven van der Vlugt, Mario Ruiz Noguera, Zaid Al-Ars
The growing prevalence of Artificial Intelligence (AI) applications has led to the development of specialized hardware accelerators optimized for performance and energy efficiency. One such accelerator is the Ryzen Neural Processing Unit (NPU), integrated into AMD’s Ryzen AI processors. While primarily designed for AI workloads, this thesis investigates the potential of repurposing the Ryzen NPU for Digital Signal Processing (DSP) applications, with a focus on radio astronomy. Using the All-Sky Imaging Algorithm from the LOFAR telescope system as a case study, the research evaluates whether the NPU can meet the real-time data processing demands imposed by LOFAR's 10 Hz data generation rate.

Four implementations of the algorithm were developed: three using the MLIR-AIE toolchain and one using the TINA framework. These implementations explored various parallelization and pipelining strategies to optimize performance while ensuring correctness and minimal power consumption. Experimental evaluations revealed up to a 77.4× speedup over a CPU baseline and a 2.84× speedup over a GPU implementation. Notably, three of the four implementations met the 10 Hz real-time requirement. All implementations yielded accurate results, with only minor variations due to differences in data types.

Although power consumption data for the NPU implementations was unavailable, the performance gains underscore the Ryzen NPU's potential for non-AI workloads. This thesis provides a proof of concept for DSP acceleration on the Ryzen NPU, contributes a new layer to the TINA toolchain, and offers insights for future application development. ...

Raising the abstraction level of the Haskell HDL Clash through typed waveforms and complex streaming interfaces

Master thesis (2025) - M.H.P. Adriaanse, H.P. Hofstee, Z. Al-Ars, C. P. R. Baaij, C.J.M. Verhoeven
This work contains two systems created to raise abstraction for the Haskell-based HDL Clash.

A common tool in hardware design is the waveform viewer. Although Clash could already generate waveform files, these only contained binary representations of the values. Without translating these to Haskell values, they are difficult to interpret. Shockwaves was created to perform this translation. Unlike other typed waveform solutions, Shockwaves performs the translation fully in the Haskell runtime, and stores the results in lookup tables. This gives the programmer full control over the waveform representation of data. There are two methods of generating VCD files from Clash, and Shockwaves was designed to work with both. The system is fully functional for signals traced during direct simulation. The alternative approach of simulating a design after compiling it to a different HDL depends on the Clash compiler adding type annotations. This requires an overhaul of the Clash compiler beyond the scope of the project.

The second system, Tydi-Clash, is a library for the Tydi streaming specification in Clash. Tydi was designed around transferring complex data structures, and allows for multiple related streams carrying typed, multi-dimensional data. The Tydi-Clash library supports Tydi data types, physical streams, and logical stream constructs. To encourage correct usage of the streams, the internal signals are encapsulated in algebraic and abstract data types that prevent defining or accessing undefined values. Additionally, tests are supplied for behavioral restrictions. An example implementation revealed implementations using Tydi-Clash are unfortunately still a bit cumbersome, but this is believed to be solvable by adding a library of utility modules for common situations. ...
Doctoral thesis (2025) - A.M. Krol, Peter Hofstee, Zaid Al-Ars
Because of recent stagnating single-thread performance and limited potential for further miniaturization of transistors, the computing industry is looking towards new technologies as the basis for the next generation of computing. One of these new technologies is quantum computing.
For utility-scale quantum computing, we will likely need millions of qubits. To program these qubits, the complete quantum computing stack will need to be improved, since programming large numbers of qubits is not feasible with current quantum programming languages.

In this dissertation, we present our new unitary decomposition algorithm, which is used to decompose arbitrary unitary matrices into a sequence of quantum gates that can be executed on a quantum computer. Our method results in 5% less CNOT gates than the previous state-of-the-art and can be used to decompose an arbitrary 3-qubit gate into at most 19 CNOT gates.

Unitary decomposition is an essential part of some quantum algorithms, and can be used as an optimization method for (parts) of quantum circuits. Efficient implementation of unitary decomposition allows for the translation of bigger input matrices into elementary quantum operations, which is key to executing these algorithms on existing quantum computers.
With the implementation of unitary decomposition in quantum programming framework OpenQL, we show how the structure of the input or intermediate matrices can be used to minimize the number of output gates and to minimize the runtime of the decomposition. Our implementation is 10 to 500 times as fast as the decomposition methods of the UniversalQCompiler and Qubiter.

With hybrid classical-quantum algorithms, even near-term quantum devices may be able to outperform classical computers. Hybrid algorithms, such as variational quantum eigensolvers, are iterative processes, and use a classical optimizer to update a parameterized quantum circuit. Each iteration, the circuit is executed on a physical quantum processor or a simulator, and the average of the measurement results is passed back to the classical optimizer. When many iterations are performed, the quantum program is recompiled many times.

We have implemented explicit parameters that prevent recompilation of hybrid programs in OpenQL, called OpenQLpc. These parameters reduce the compile time, and therefore improve the total runtime for hybrid algorithms. We have compared the execution of the MAXCUT benchmark in OpenQL with the execution of the same benchmark in PyQuil and Qiskit, which shows that the efficient handling of parameterized circuits in OpenQLpc results in up to 70% reduction in total compilation time and a reduced total execution time. With OpenQLpc, compilation of hybrid algorithms is also faster than either PyQuil or Qiskit.

In a collaboration with BMW and Entropica, we have developed a quantum algorithm for industrial shift scheduling (QISS), which uses Grover's adaptive search to tackle a common and important class of valuable, real-world combinatorial optimization problems.

We show how QISS can be used to find the optimal schedule for n days out of a solution space of size N = 4^(2n). The optimal solution is reached in 99% of cases within sqrt(N) = 4^n applications of Grover's oracle, which requires a total of 11n +9 + log2(19n) qubits for scheduling n days. We show the explicit construction of the Grover's oracle, incorporating the multiple constraints and detail the corresponding logical-level resource requirements. Further, we simulate the application of QISS for small-scale problem instances to corroborate the performance of the algorithm.
Our work shows how complex real-world industrial optimization problems can be formulated in the context of Grover's algorithm.

Using QISS, we then used open-source tools to estimate the quantum resources required for execution of this algorithm. We used qubit models based on current technology, as well as theoretical high-fidelity scenarios for superconducting qubit platforms. We find that the overall computational runtime is more strongly influenced by the execution time of gate and measurement operations than by system error rates. We find that achieving quantum utility would not only require low system error rates (10^(-6) or better), but also measurement operations with an execution time below 10 ns. This rules out the possibility of near-term quantum utility for this use-case, and suggests that significant technological or algorithmic progress will be needed before quantum utility can be achieved.

The research in this dissertation allows us to answer our main research question:
How can we make the quantum computing stack ready for utility-scale quantum computing?
For the quantum stack to be ready for utility-scale quantum computing, several major improvements will need to be made to prepare for programming and compiling circuits with millions of qubits.
* We will need high-level abstractions that will speed up programming of quantum computers, allow for (easier) debugging and will allow for programming millions of qubits.
* The classical component of the compilation and compute of (hybrid) quantum algorithms will need to be improved.
* More algorithms for real-world use-cases will need to be developed, which will provide a basis for improvements across the quantum stack that will lead to quantum utility.
* We need to do quantum resource estimation for real use-cases, in order to have insights into what utility-scale quantum computing will look like. ...

A Typed Waveform Viewer for Chisel HDL with typed circuit components and Tydi streams

Master thesis (2024) - R. Meloni, Z. Al-Ars, H.P. Hofstee
Modern hardware design languages introduce high-level constructs to considerably improve design capabilities. The adoption of software language features and strong type systems contribute to expressing complex designs with cleaner and more robust code, facilitating the translation of software algorithms for hardware accelerators. Despite these advantages, their mainstream adoption is often discouraged by the lack of debugging tools that support the same level of abstraction. The usage of standard tools implies inspecting automatically generated RTL code, dissimilar from the source, which leads to a convoluted debugging experience.

This thesis presents Tywaves, a new kind of type-centered waveform viewer for the Chisel hardware language with typed circuit components and Tydi streams. Contributions to both the Chisel library and CIRCT MLIR compiler are described. Type information for debugging is extracted from the source language and linked with the target Verilog. A frontend waveform viewer is updated with the functionality to interpret and associate type information with values dumped from an RTL simulator and reconstruct the source language view. Finally, a Chisel API has been implemented to enable Tywaves from a high-level testbench.

The Tywaves project aims to enhance the debugging experience of modern hardware languages by reducing the gap between the source code and waveforms. It provides a new type-centered debugging format that helps to bring the same level of abstraction of new languages into waveform viewers. ...
Genomics, the study of an organism's complete set of DNA, including all of its genes, has revolutionized our understanding of biological processes and disease mechanisms. The field's rapid advancements have paved the way for personalized medicine, offering targeted therapies and improved healthcare outcomes. These advancements are a result of significant improvements in sequencing technology, bioinformatics, and computational power. Next-generation or long-read sequencing has reduced the cost and time required to sequence entire genomes, and Oxford Nanopore Technologies (ONT) sequencers provide 100–1000× longer contiguous reads, simplifying genome assembly. However, bioinformatics-driven advances in accuracy have come at the cost of high computational requirements because of the dependency on large deep neural networks (DNNs), and the basecalling step now takes 43% of the time in the nanopore sequencing pipeline.

This thesis addresses the large computational demands for high accuracy nanopore basecalling of nanopore reads. Bonito, ONT's research basecaller, and other basecallers use DNNs at their core. The five Long Short-Term Memory (LSTM) layers used by the basecaller are the primary bottleneck to more efficient basecalling, taking almost 90% of the whole model's execution time when basecalling a single read. To alleviate this bottleneck, three approaches are investigated: pruning, model architecture, and quantization. Preliminary results show that pruning is the most impactful approach and has not successfully been used in previous work.

We propose learning structured sparsity using a delayed masking penalty scheduler. By adapting and improving on previous work, each LSTM layer is able to learn its optimal size during training, simultaneously with learning to basecall accurately. The method is optimized for the basecalling application and can be generalized to other tasks. We find that the required number of computations in the LSTM layers can be significantly reduced by up to 21 times with a reduction in match rate of just 1.3% compared to the high accuracy Bonito model. Furthermore, the newly introduced penalty parameter can be tuned to find the optimal trade-off between compute and accuracy for users' requirements.

The results indicate that state-of-the-art basecalling models are overparameterized and that their size can be reduced drastically without significantly affecting accuracy. Future work is suggested to investigate the benefits of pruning the whole model, and to assess the feasibility of combining pruning with advanced quantization methods. This work helps increase the accessibility of nanopore DNA sequencing, broadening the reach and impact of this technology. ...
Master thesis (2024) - W. Trinh, Z. Al-Ars, H.P. Hofstee, R.T. Rajan
Genomics has revolutionized medicine and biological research by providing deeper insights into the genetic makeup of organisms, advancing our understanding of diseases, and enabling personalized medicine. These breakthroughs are driven by advancements in genome sequencing technologies, bioinformatics, and the aid of neural networks. The advent of third-generation sequencing technol ogy has further accelerated progress by allowing for long-read sequencing, which enhances the ac curacy and efficiency of genome assembly. Oxford Nanopore Technologies (ONT) offers advanced sequencers that use nanopore-based technology to read DNA sequences in real-time. However, the raw sequenced data contains noise and requires a basecalling stage to read DNA sequences with the required accuracy. Basecalling relies on deep neural networks (DNNs) to achieve high-accuracy reads, but the significant computational power required makes basecalling a costly process, especially in real-time applications. Current hardware accelerators used for these compute-intensive basecallers are state-of-the-art GPUs that cost over $10,000 per unit. This thesis explores the design of a custom hardware accelerator for ONT’s basecalling program, Bonito, which aims to provide a cost-effective alternative to existing accelerators such as Nvidia GPUs and Groq Tensor Streaming Processors (TSPs). Bonito’s DNN is dominated by five Long Short-Term Memory(LSTM)layers, accounting for 90% of its execution time. The custom accelerator targets these compute-intensive LSTMs to reduce execution time. This work provides a comprehensive analysis of LSTM performance and behavior on GPUs and Groq TSPs. It emphasizes the architectural benefits and limitations in the context of basecalling. Furthermore, it also evaluates an Application-Specific Integrated Circuit (ASIC) implementation of an existing FPGA-based LSTM accelerator design. TheanalysisshowsthatBonito’sHigh-Accuracymodel(HAC)LSTMlayerscontainmanysequential matrix multiplications that are compute-intensive and require high memory bandwidth to accommodate the data transfers. Furthermore, Bonito’s small problem size, with 384 features per vector input, causes GPUs to not fully utilize available compute cores. Combined with slow per-core performance, GPUs executing LSTMs achieve only 13.5% of the maximum TFLOP/s with FP16 precision. Groq uses a heterogeneous architecture with fast separate MXM units executing matrix multiplica tions, and VXM units executing point-wise operations. Data travels between these different compute units through streaming channels and MEM units. The LSTM analysis on Groq showed three main issues. First, Groq uses a 320-element wide data channel to transfer data across the chip, whereas Bonito has 384 hidden features as input. This leads to the 384-element input being sliced in two 192 element partial inputs as Groq supports physical tensors up to 320-element long, effectively doubling the cycle cost by using two slices instead of one. Second, performing matrix multiplications requires these slices to be transferred from MEM to MXM, by executing reload operations to the MXM weight buffer. This data transfer consumes the bandwidth on the streaming channel, which introduces stalls in the pipeline and clock cycles are spent on MEM operations, instead of compute operations in MXM or VXM. Lastly, the analysis shows that the VXM forms a bottleneck in the LSTM execution, accounting for 50% of the clock cycles, whereas the MXM accounts for the other 50%. This shows that the special functions and additions inside an LSTM cell are slowing down Groq’s overall performance. After the existing architecture analysis, the ASIC evaluation showed a synthesis result, where one block of 384 LSTM Processing Engines (PEs) achieves a clock speed of 434MHz and costs 8.07𝑚𝑚2 on a 40nmprocess node. Putting these PEs on a Groq-based chip layout of 725𝑚𝑚2 in area size, the 40nm-based PEs can achieve 79.3 TFLOP/s at FP16 precision. By correcting the 40nm process node to Groq’s 14nm process node, the 14nm-based PEs achieved 448 TFLOP/s at FP16. These results suggest that a custom LSTM accelerator could compete in performance with state-of-the-art solutions while being more cost-effective. Future work is suggested to investigate a cycle reduction in the multiply-accumulate stage and to evaluate the ASIC design using a modern process node technology, as the current ASIC design uses a 2008-based 40nm process node. This work helps future development in further optimizing a custom LSTM accelerator specified for Bonito’s DNN requirements, paving the road toward more affordable genome sequencing ...
Master thesis (2023) - P.M.Q. Groet, H.P. Hofstee, Z. Al-Ars
With the rise of the new interconnect standards CXL and previously OpenCAPI, has come a great deal of possibilities to step away from the classical approach where CPUs are in charge of moving data between external devices and local memory. Specifically, OpenCAPI allows for attached devices to directly interface with the host memory bus in a near cache coherent way. IBM has developed the ThymesisFlow system which allows for other servers to access each others Random Access Memory through this OpenCAPI link. ThymesisFlow however is not fully coherent in some cases.
ThymesisFlow is designed for the situation where a borrower is able access a lender's memory, and the lender not accessing that borrowed memory. Coherency problems arise in the case where both a lender of memory, as well as a borrower of memory write to the lender's memory.
This thesis proposes the use of the Apache Arrow in-memory data format to not only access memory in a near coherent fashion, but in a fully coherent fashion. This will allow compute clusters to more efficiently use memory resources, allow for applications to dynamically hotplug memory, and allow for data sharing without copying over ethernet connection.

The protocols devised in this thesis are able to create disaggregated Arrow objects, which are readable by all nodes in a cluster in a coherent fashion. The creation of these coherent disaggregated objects is the only performance penalty in making them coherent, after initialization all nodes use their local CPU caches to cache remote objects.

A working proof-of-concept has been created which is able to share Apache Arrow objects stored in the memory of a single node. It is also possible to create Arrow objects which span the memory of multiple nodes, allowing for objects bigger than the memory of a single node. The proof-of-concept was able to be run thanks to the setup provided by the Hasso Plattner Institute. ...
Genomics has revolutionized our understanding of evolution, hereditary diseases, and more. The advent of long-read DNA sequencers i.e. Oxford Nanopore Technologies' innovations, has opened many new research potentials in genomics. These sequencers produce significantly longer DNA reads, facilitating novel applications. However, this technological leap brings challenges, particularly in accurate basecalling which is the process of converting raw sequenced measurements into digital base pair sequences. While advances in basecalling accuracy have been steadily improving over the years, the computational intensity remains a bottleneck in genomic analysis workflows, demanding costly high-end GPUs for probabilistic neural network models.

The main problem this thesis addresses is the implementation of an accelerated hardware solution for the compute-intensive process of basecalling long-read sequences. The thesis presents an FPGA-based implementation of the computationally demanding Long Short-Term Memory (LSTM) layers within the basecalling network known as Bonito. However, due to the lack of floating-point arithmetic units available on the FPGA, the FPGA implementation could not achieve competitive performance compared to GPUs.

While the FPGA implementation falls short of GPU performance, it serves as a possible stepping stone toward developing an ASIC solution for implementing the Bonito LSTM layers or potentially implementing the entire Bonito model. An ASIC implementation has the potential for superior performance up to 9 times faster than a GPU implementation while additionally being cost-effective. This suggests that ASICs hold promise as a future direction for accelerating long-read sequence basecalling, allowing for faster and more affordable genomics research. ...

Collaborative and Interface-Driven Data-Streaming Accelerator Design

Master thesis (2023) - C. Cromjongh, H.P. Hofstee, Z. Al-Ars, C.B. Poulsen
In spite of progress on hardware design languages, the design of high-performance hardware accelerators forces many design decisions specializing the interfaces of these accelerators in ways that complicate the understanding of the design and hinder modularity and collaboration. In response to this challenge, Tydi has been presented as an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. Earlier efforts in providing an implementation framework for Tydi managed to generate VHDL boilerplate code for Tydi interfaces, but offered limited design value over custom solutions due to VHDL's low abstraction level. In contrast, Chisel, with its high level of abstraction and customizability offers a suitable platform to implement Tydi-based components.

In this thesis, the Tydi-Chisel library is presented along with an A-to-Z design-process description for data-streaming accelerators. A stream-interface solution is presented that offers both compatibility with Tydi in traditional HDLs and maximum utility within Chisel through two intercompatible representations. In addition, design complexity is reduced through novel utilities like stream-complexity conversion, developed to alleviate interface specification mismatches between components. Using the presented toolchain and library, the amount of code required to specify Tydi interfaces for representative use-cases can be reduced several times compared to a Verilog description, while offering increased utility.

Tydi-Chisel aims to simplify the design of data-streaming accelerators through the integration of the Tydi interface standard in Chisel, along with helper components, syntax sugar, and verification tools. In combination Chisel and Tydi help bridge the hardware-software divide, making solo-design and collaboration between designers easier. ...

Defining an IR for Composable Typed Streaming Dataflow Designs

Master thesis (2022) - M.A. Reukers, H.P. Hofstee, Z. Al-Ars, J.W. Peltenburg, T.G.R.M. van Leuken
Tydi is an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. This provides a higher-level method for defining interfaces between components as opposed to existing bit- and byte-based interface specifications.

In this thesis, an open-source intermediate representation (IR) is introduced which allows for the declaration of Tydi's types. The IR enables creating and connecting components with Tydi Streams as interfaces, called Streamlets. It also lets backends for synthesis and simulation retain high-level information, such as documentation. Types and Streamlets can be easily reused between multiple projects, and Tydi’s streams and type hierarchy can be used to define interface contracts, which aid collaboration when designing a larger system.

The IR codifies the rules and properties established in the Tydi specification and serves to complement computation-oriented hardware design tools with a data-centric view on interfaces. To support different backends and targets, the IR is focused on expressing interfaces, and complements behavior described by hardware description languages and other IRs. Additionally, a testing syntax for the verification of inputs and outputs against abstract streams of data, and for substituting interdependent components, is presented which allows for the specification of behavior.

To demonstrate this IR, a grammar, parser, and query system have been created, and paired with a backend targeting VHDL. ...

A manual for future Tydi-lang compiler developers

Master thesis (2022) - Y. TIAN, Z. Al-Ars, H.P. Hofstee
Transferring composite data structures with variable-length fields often requires designing non-trivial protocols that are not compatible between hardware designs. When each project designs its own data format and protocols the ability to collaborate between hardware developers is diminished, which is an issue especially in the open-source community. Because the high-level meaning of a protocol is often lost in translation to low-level languages when a custom protocol needs to be designed, extra documentation is required, the interpretation of which introduces new opportunities for errors.

The Tydi specification (Tydi-spec) was proposed to address the above issues by codifying the composite and variable-length data structures in a type and providing a standard protocol to transfer typed data among hardware components. The Tydi intermediate representation (Tydi-IR) extends the Tydi-spec by defining typed interfaces, typed components, and connections among typed components.

In this paper, we propose Tydi-lang, a high-level hardware description language (HDL) for streaming designs. The language incorporates Tydi-spec to describe typed streams and provides templates to describe abstract reusable components. We also implement an open-source compiler from Tydi-lang to Tydi-IR. We leverage a Tydi-IR to VHDL compiler, and also present a simulator blueprint to identify streaming bottlenecks. We show several Tydi-lang examples to translate high-level SQL to VHDL to demonstrate that Tydi-lang can efficiently raise the level of abstraction and reduce design effort. ...
Doctoral thesis (2022) - T. Ahmad, Z. Al-Ars, H.P. Hofstee
The ever increasing pace of advancements in sequencing technologies has enabled rapid DNA/genome sequencing to become much more accessible. In particular, next (second) and third generation sequencing technologies offer high throughput, massively parallel and cost effective sequencing solutions. Individual sample sequencing data volumes as well as the number of assembled genomes are also growing quickly. These advances in high throughput sequencing technologies and demand for fast computational processing and downstream analysis of sequencing data in clinical settings is widening the gap between the time spent in sample collection and sequencing versus computational analysis.

To improve the scalability and performance optimizations of genome variant calling analysis workflows on modern computing systems, in this dissertation four potential research directions have been selected for further exploration. First, to exploit the performance of modern processors hardware features like multi-core and vector units on the GATK best practices variant calling pipelines, we introduce ArrowSAM, a columnar inmemory data format to place and process genomics data in-memory thus removing the need for repeated file storage accesses in intermediate variant calling pipeline applications. Our second contribution focuses on integration of the Apache Arrow based columnar in-memory data format in the PySpark API to enable exploiting the benefits of vectorized operations in the Python language using user-defined functions on Spark dataframes. For our third research contribution, we tested and benchmarked both the scalability and performance of Arrow Flight for client-server as well as cluster scaled communication.For our final research contribution reported in this dissertation, we implemented an orthogonal approach that is even more scalable than Apache Spark and Arrow Flight based solutions and offers flexibility to use many different variant callers. ...
Master thesis (2020) - J. Chen, Al-Ars Zaid, Fabio Sebastiano, Harm Peter Hofstee, Maurice Daverveldt