H.P. Hofstee
Please Note
26 records found
1
Graph processing on systems with disaggregated memory
Aiding financial crime detection in large datasets
Since Memory Inception, Power10 processors' memory disaggregation hardware, is not yet fully operational, a ThymesisFlow prototype, upgraded to support a shared disaggregated memory system with the help of Apache Arrow, is used to implement a practical application. The selected application is a graph processor capable of detecting money laundering patterns in financial transaction graphs in real-time. These patterns yield transaction features that machine learning algorithms can use to identify fraudulent financial transactions.
Our proof-of-concept implementation enables the creation of a distributed graph, represented as Apache Arrow tables, that can process large datasets in real-time. The graph resides in a shared disaggregated memory region and can be accessed by multiple systems without data copying, incurring lower latency penalties than network-based data retrieval. The distributed graph processor was developed and tested using the ThymesisFlow prototype provided by the Hasso Plattner Institute. ...
Since Memory Inception, Power10 processors' memory disaggregation hardware, is not yet fully operational, a ThymesisFlow prototype, upgraded to support a shared disaggregated memory system with the help of Apache Arrow, is used to implement a practical application. The selected application is a graph processor capable of detecting money laundering patterns in financial transaction graphs in real-time. These patterns yield transaction features that machine learning algorithms can use to identify fraudulent financial transactions.
Our proof-of-concept implementation enables the creation of a distributed graph, represented as Apache Arrow tables, that can process large datasets in real-time. The graph resides in a shared disaggregated memory region and can be accessed by multiple systems without data copying, incurring lower latency penalties than network-based data retrieval. The distributed graph processor was developed and tested using the ThymesisFlow prototype provided by the Hasso Plattner Institute.
Four contributions support this investigation, which, to our knowledge, is the first to classify flying objects
using temporal 4D Gaussian features. AeroSplat-4D is a synthetic multi-camera dataset and NVIDIA Isaac Sim pipeline emitting synchronized RGB, instance masks, depth, 3D trajectories, and exact calibration across the four classes, with class-balanced, identity-disjoint splits. DepthSplat-OC adapts feed-forward Gaussian splatting to thin, distant targets against a texture-less sky via a mask-gated photometric loss. MambaSplat-4D,
the main contribution, classifies the temporal Gaussian sequences by pairing a rotation-equivariant Vector-Neuron Transformer with a linear-time Mamba temporal encoder, enforcing SO(3) invariance architecturally rather than through augmentation.
In an augmentation-free ablation, aggregating a 24-frame clip rather than classifying a single frame raises accuracy from 59.1 % to 78.8 %, confirming that motion, not single-frame appearance, drives discrimination. Because SO(3) invariance is enforced architecturally, the full-attribute model attains the same 70.2 % four-class accuracy on clean and arbitrarily rotated data, about eight percentage points above a position-only baseline; it trails the strongest temporal baseline by roughly five points on clean data but is uniquely robust under rotation, with zero classification changes across 9600 rotated forward passes. DepthSplat-OC surpasses the closest-protocol baseline (24.65 versus 21.44 PSNR) despite roughly two orders of magnitude less training compute, and the compact 1.9 M-parameter classifier runs in under a millisecond per frame. On the out-of-distribution probe the pipeline does not yet surpass the 2D baselines, a gap that likely reflects their ImageNet-pretrained (∼1.2M-image) backbones rather than a limit of the 3D representation; real-camera transfer remains open, and and the core of the pipeline is released as open-source software at github.com/lumiad-bv/MambaSplat-4D.
This work thereby points toward multi-view 3D reconstruction and temporal reasoning as an effective alternative to the per-frame 2D detection that currently dominates aerial object classification. ...
Four contributions support this investigation, which, to our knowledge, is the first to classify flying objects
using temporal 4D Gaussian features. AeroSplat-4D is a synthetic multi-camera dataset and NVIDIA Isaac Sim pipeline emitting synchronized RGB, instance masks, depth, 3D trajectories, and exact calibration across the four classes, with class-balanced, identity-disjoint splits. DepthSplat-OC adapts feed-forward Gaussian splatting to thin, distant targets against a texture-less sky via a mask-gated photometric loss. MambaSplat-4D,
the main contribution, classifies the temporal Gaussian sequences by pairing a rotation-equivariant Vector-Neuron Transformer with a linear-time Mamba temporal encoder, enforcing SO(3) invariance architecturally rather than through augmentation.
In an augmentation-free ablation, aggregating a 24-frame clip rather than classifying a single frame raises accuracy from 59.1 % to 78.8 %, confirming that motion, not single-frame appearance, drives discrimination. Because SO(3) invariance is enforced architecturally, the full-attribute model attains the same 70.2 % four-class accuracy on clean and arbitrarily rotated data, about eight percentage points above a position-only baseline; it trails the strongest temporal baseline by roughly five points on clean data but is uniquely robust under rotation, with zero classification changes across 9600 rotated forward passes. DepthSplat-OC surpasses the closest-protocol baseline (24.65 versus 21.44 PSNR) despite roughly two orders of magnitude less training compute, and the compact 1.9 M-parameter classifier runs in under a millisecond per frame. On the out-of-distribution probe the pipeline does not yet surpass the 2D baselines, a gap that likely reflects their ImageNet-pretrained (∼1.2M-image) backbones rather than a limit of the 3D representation; real-camera transfer remains open, and and the core of the pipeline is released as open-source software at github.com/lumiad-bv/MambaSplat-4D.
This work thereby points toward multi-view 3D reconstruction and temporal reasoning as an effective alternative to the per-frame 2D detection that currently dominates aerial object classification.
— Hermann J. Muller
Although originally formulated for radiation in general, this statement applies to X-ray imaging as well, given its ionising nature. Even low-dose diagnostic procedures can damage DNA and carry cumulative cancer risk, which has led radiologists to minimise patient exposure whenever possible. These safety constraints limit the acquisition of large, diverse imaging datasets needed for developing, validating, and benchmarking modern medical imaging systems. They also restrict the number of projections that can be acquired in modalities such as computed tomography (CT), requiring accurate volumetric reconstruction from fewer scans and thereby increasing the technical demands on reconstruction algorithms.
Because acquiring realistic X-ray images is limited by patient safety, modern approaches first use sparse CT scans to reconstruct an accurate volumetric model of the object of interest. From this model, synthetic images can be generated through novel view synthesis, enabling large-scale offline datasets for machine learning, system testing, and model validation without additional radiation exposure. Beyond dataset creation, these synthetic projections can be produced in real time for interactive applications such as digital twins, used in virtual physician training and integration testing. However, generating high-fidelity synthetic images in real time remains challenging, given the substantial computational requirments of the algorithm.
This thesis investigates ways to accelerate X-ray simulation using graphics processing unit (GPU) implementations. Two techniques were developed: one based on voxelised models and another using Gaussian mixture models (GMMs). The approaches were evaluated in terms of visual fidelity and rendering performance, achieving ≈300 frames per second for voxel-based simulation and ≈40 frames per second for GMM-based simulation. Both techniques significantly reduce computation time compared to baseline CPU implementations, while maintaining realistic image quality suitable for virtual testing, physician training, and AI data generation.
These results demonstrate that GPU acceleration can enable real-time synthetic X-ray simulation, supporting scalable dataset creation and interactive applications while maintaining strict adherence to radiation safety principles. ...
— Hermann J. Muller
Although originally formulated for radiation in general, this statement applies to X-ray imaging as well, given its ionising nature. Even low-dose diagnostic procedures can damage DNA and carry cumulative cancer risk, which has led radiologists to minimise patient exposure whenever possible. These safety constraints limit the acquisition of large, diverse imaging datasets needed for developing, validating, and benchmarking modern medical imaging systems. They also restrict the number of projections that can be acquired in modalities such as computed tomography (CT), requiring accurate volumetric reconstruction from fewer scans and thereby increasing the technical demands on reconstruction algorithms.
Because acquiring realistic X-ray images is limited by patient safety, modern approaches first use sparse CT scans to reconstruct an accurate volumetric model of the object of interest. From this model, synthetic images can be generated through novel view synthesis, enabling large-scale offline datasets for machine learning, system testing, and model validation without additional radiation exposure. Beyond dataset creation, these synthetic projections can be produced in real time for interactive applications such as digital twins, used in virtual physician training and integration testing. However, generating high-fidelity synthetic images in real time remains challenging, given the substantial computational requirments of the algorithm.
This thesis investigates ways to accelerate X-ray simulation using graphics processing unit (GPU) implementations. Two techniques were developed: one based on voxelised models and another using Gaussian mixture models (GMMs). The approaches were evaluated in terms of visual fidelity and rendering performance, achieving ≈300 frames per second for voxel-based simulation and ≈40 frames per second for GMM-based simulation. Both techniques significantly reduce computation time compared to baseline CPU implementations, while maintaining realistic image quality suitable for virtual testing, physician training, and AI data generation.
These results demonstrate that GPU acceleration can enable real-time synthetic X-ray simulation, supporting scalable dataset creation and interactive applications while maintaining strict adherence to radiation safety principles.
Semantics for Tydi
The development of a formal foundation for the Tydi hardware stream specification
First, an ahead-of-time parser generator was developed in the form of a Rust derive macro and a supporting library. Using a derive macro, a schema can be defined by a Rust struct definition for which a CSV to Arrow reader is derived. With the knowledge of the schema at compile-time, extra optimizations were possible. For instance, when the types in a schema are known to be of fixed size, estimates for the number of records in an input buffer could be made. With this principle, input bound checks could be reduced.
Experiments showed that ahead-of-time generated parsers outperformed state-of-the-art frameworks such as Apache Arrow, Polars, and DuckDB. Benchmarks for single types revealed that for integers, unbuffered and buffered generated parsers achieved a throughput of at least 1.5x compared to Apache Arrow, with the reduction of size bound checks sometimes even reaching a throughput of 3x. For floating point numbers, the buffered parser performed slightly better than Arrow. However, the unbuffered, and buffered with reduced size bound checking, parser achieved a throughput of at least 1.5x. For strings, the buffered parser performed similar to Apache Arrow since it uses the same efficient buffered string parsing. The unbuffered parser only achieved half the performance. Furthermore, benchmarks for parsing the TPC-H and TPC-DS datasets showed that the buffered parser generated ahead-of-time performed better for real-world datasets compared to the frameworks mentioned above. The unbuffered parser was only able to achieve a higher throughput for datasets larger than 100 MB. For the datasets, the throughput varied based on the distribution of types.
Additionally, this work explored, but did not integrate, the use of multi-threaded CSV parsing. However, experiments revealed that the performance of parallelization depends on how fast CSV can be scanned for record positions whilst correctly checking character escaping. An experiment with a custom multi-threaded parser implementation showed that when scaling the number of parse threads, the throughput is limited by scanning rather than parsing. This characteristic was also found for Polars and DuckDB, which support multi-threading. The scanning was shown to be possibly improved by using SIMD, which allowed scanning record delimiters at 2.8 GB/s using AVX2. This is approximately 1.5x more than the maximum throughput that Polars or DuckDB achieved.
https://github.com/sstreef/csv-to-arrow
...
First, an ahead-of-time parser generator was developed in the form of a Rust derive macro and a supporting library. Using a derive macro, a schema can be defined by a Rust struct definition for which a CSV to Arrow reader is derived. With the knowledge of the schema at compile-time, extra optimizations were possible. For instance, when the types in a schema are known to be of fixed size, estimates for the number of records in an input buffer could be made. With this principle, input bound checks could be reduced.
Experiments showed that ahead-of-time generated parsers outperformed state-of-the-art frameworks such as Apache Arrow, Polars, and DuckDB. Benchmarks for single types revealed that for integers, unbuffered and buffered generated parsers achieved a throughput of at least 1.5x compared to Apache Arrow, with the reduction of size bound checks sometimes even reaching a throughput of 3x. For floating point numbers, the buffered parser performed slightly better than Arrow. However, the unbuffered, and buffered with reduced size bound checking, parser achieved a throughput of at least 1.5x. For strings, the buffered parser performed similar to Apache Arrow since it uses the same efficient buffered string parsing. The unbuffered parser only achieved half the performance. Furthermore, benchmarks for parsing the TPC-H and TPC-DS datasets showed that the buffered parser generated ahead-of-time performed better for real-world datasets compared to the frameworks mentioned above. The unbuffered parser was only able to achieve a higher throughput for datasets larger than 100 MB. For the datasets, the throughput varied based on the distribution of types.
Additionally, this work explored, but did not integrate, the use of multi-threaded CSV parsing. However, experiments revealed that the performance of parallelization depends on how fast CSV can be scanned for record positions whilst correctly checking character escaping. An experiment with a custom multi-threaded parser implementation showed that when scaling the number of parse threads, the throughput is limited by scanning rather than parsing. This characteristic was also found for Polars and DuckDB, which support multi-threading. The scanning was shown to be possibly improved by using SIMD, which allowed scanning record delimiters at 2.8 GB/s using AVX2. This is approximately 1.5x more than the maximum throughput that Polars or DuckDB achieved.
https://github.com/sstreef/csv-to-arrow
ChiselTrace
Typed Behavioural Debugging in Modern Typed HDLs Through Signal Dependency Tracing
This thesis presents ChiselTrace, an open-source tool for Chisel that is capable of automatic signal dependency tracing at the Chisel source level, allowing faults to be more easily traced back to their root cause. The contributions of this work include the following. Modifications are made to the Chisel library to extract program dependence graphs and control flow graphs, and add instrumentation probes to the circuit, enabling post-simulation analysis. Furthermore, a library capable of dynamic program slicing and program dependence graph generation is introduced that is based on reconstruction from intermediate representation-level analysis. Finally, a front-end dependency graph viewer is presented, along with a method to automatically start a ChiselTrace session from failed ChiselSim unit tests.
The debugging capabilities of ChiselTrace are presented using a variety of test cases, including a real-world example, where an injected fault in the ChiselWatt processor is traced back to the source. ...
This thesis presents ChiselTrace, an open-source tool for Chisel that is capable of automatic signal dependency tracing at the Chisel source level, allowing faults to be more easily traced back to their root cause. The contributions of this work include the following. Modifications are made to the Chisel library to extract program dependence graphs and control flow graphs, and add instrumentation probes to the circuit, enabling post-simulation analysis. Furthermore, a library capable of dynamic program slicing and program dependence graph generation is introduced that is based on reconstruction from intermediate representation-level analysis. Finally, a front-end dependency graph viewer is presented, along with a method to automatically start a ChiselTrace session from failed ChiselSim unit tests.
The debugging capabilities of ChiselTrace are presented using a variety of test cases, including a real-world example, where an injected fault in the ChiselWatt processor is traced back to the source.
The proposed compressor leverages several advanced CUDA techniques to optimize performance, including a voting mechanism that maximizes memory bandwidth and an efficient gathering pipeline utilizing stream compaction.
Additionally, the algorithm uses GPU compute capacity to support a memory-efficient encoding table through a space-time tradeoff.
The compression task is parallelized by tiling input data and adapting the data layout.
We introduce multiple compression pipelines, each with distinct tradeoffs.
To maximize encoding kernel throughput, the design introduces sliding windows and output packing to optimize register use and maximize effective memory bandwidth.
Pipeline-level throughput is further enhanced by introducing pipelined transposition stages and stream compaction to remove intermediate padding efficiently.
We evaluate these pipelines across several benchmark datasets and compare the best-performing version against state-of-the-art GPU compression algorithms, including nvCOMP, GPULZ, and compressors generated using the LC framework.
The proposed compressor achieves a throughput of 74GB/s on an RTX4090 while maintaining compression ratios comparable to FSST.
In terms of compression ratio, it consistently outperforms ANS, Bitcomp, Cascaded, and GPULZ across all datasets.
Its overall throughput exceeds that of GPULZ and all nvCOMP compressors except ANS, Bitcomp, Cascaded, and those produced by the LC framework.
Our compressor lies on the Pareto frontier for all evaluated datasets, advancing the state-of-the-art toward ideal compression.
It achieves near-identical compression ratios to FSST (except for machine-readable datasets), while achieving a speedup of 42.06x.
Compared to multithreaded CPU compression, it achieves a 6.45x speedup.
To assess end-to-end performance, we integrate the compressor with the GSST decompressor. The resulting (de)compression pipeline achieves a combined throughput of 55GB/s, outperforming uncompressed data transfer on links with a bandwidth up to 37.5 GB/s.
It also outperforms all state-of-the-art (de)compressors when the link bandwidth ranges between 3GB/s and 20GB/s.
While further research is needed to enhance robustness and integrate the compressor into analytical engines, this work demonstrates a viable and Pareto-optimal alternative to existing string compression methods.
The source code of all our compression pipelines is publicly available on GitHub.
This work also serves as the foundation for a scientific paper that has been accepted for presentation at ADMS 2025. ...
The proposed compressor leverages several advanced CUDA techniques to optimize performance, including a voting mechanism that maximizes memory bandwidth and an efficient gathering pipeline utilizing stream compaction.
Additionally, the algorithm uses GPU compute capacity to support a memory-efficient encoding table through a space-time tradeoff.
The compression task is parallelized by tiling input data and adapting the data layout.
We introduce multiple compression pipelines, each with distinct tradeoffs.
To maximize encoding kernel throughput, the design introduces sliding windows and output packing to optimize register use and maximize effective memory bandwidth.
Pipeline-level throughput is further enhanced by introducing pipelined transposition stages and stream compaction to remove intermediate padding efficiently.
We evaluate these pipelines across several benchmark datasets and compare the best-performing version against state-of-the-art GPU compression algorithms, including nvCOMP, GPULZ, and compressors generated using the LC framework.
The proposed compressor achieves a throughput of 74GB/s on an RTX4090 while maintaining compression ratios comparable to FSST.
In terms of compression ratio, it consistently outperforms ANS, Bitcomp, Cascaded, and GPULZ across all datasets.
Its overall throughput exceeds that of GPULZ and all nvCOMP compressors except ANS, Bitcomp, Cascaded, and those produced by the LC framework.
Our compressor lies on the Pareto frontier for all evaluated datasets, advancing the state-of-the-art toward ideal compression.
It achieves near-identical compression ratios to FSST (except for machine-readable datasets), while achieving a speedup of 42.06x.
Compared to multithreaded CPU compression, it achieves a 6.45x speedup.
To assess end-to-end performance, we integrate the compressor with the GSST decompressor. The resulting (de)compression pipeline achieves a combined throughput of 55GB/s, outperforming uncompressed data transfer on links with a bandwidth up to 37.5 GB/s.
It also outperforms all state-of-the-art (de)compressors when the link bandwidth ranges between 3GB/s and 20GB/s.
While further research is needed to enhance robustness and integrate the compressor into analytical engines, this work demonstrates a viable and Pareto-optimal alternative to existing string compression methods.
The source code of all our compression pipelines is publicly available on GitHub.
This work also serves as the foundation for a scientific paper that has been accepted for presentation at ADMS 2025.
Exploration of the AMD Ryzen NPU for Real-time Signal Processing
Real-time Imaging of LOFAR Station Data
Four implementations of the algorithm were developed: three using the MLIR-AIE toolchain and one using the TINA framework. These implementations explored various parallelization and pipelining strategies to optimize performance while ensuring correctness and minimal power consumption. Experimental evaluations revealed up to a 77.4× speedup over a CPU baseline and a 2.84× speedup over a GPU implementation. Notably, three of the four implementations met the 10 Hz real-time requirement. All implementations yielded accurate results, with only minor variations due to differences in data types.
Although power consumption data for the NPU implementations was unavailable, the performance gains underscore the Ryzen NPU's potential for non-AI workloads. This thesis provides a proof of concept for DSP acceleration on the Ryzen NPU, contributes a new layer to the TINA toolchain, and offers insights for future application development. ...
Four implementations of the algorithm were developed: three using the MLIR-AIE toolchain and one using the TINA framework. These implementations explored various parallelization and pipelining strategies to optimize performance while ensuring correctness and minimal power consumption. Experimental evaluations revealed up to a 77.4× speedup over a CPU baseline and a 2.84× speedup over a GPU implementation. Notably, three of the four implementations met the 10 Hz real-time requirement. All implementations yielded accurate results, with only minor variations due to differences in data types.
Although power consumption data for the NPU implementations was unavailable, the performance gains underscore the Ryzen NPU's potential for non-AI workloads. This thesis provides a proof of concept for DSP acceleration on the Ryzen NPU, contributes a new layer to the TINA toolchain, and offers insights for future application development.
Shockwaves and Tydi-Clash
Raising the abstraction level of the Haskell HDL Clash through typed waveforms and complex streaming interfaces
A common tool in hardware design is the waveform viewer. Although Clash could already generate waveform files, these only contained binary representations of the values. Without translating these to Haskell values, they are difficult to interpret. Shockwaves was created to perform this translation. Unlike other typed waveform solutions, Shockwaves performs the translation fully in the Haskell runtime, and stores the results in lookup tables. This gives the programmer full control over the waveform representation of data. There are two methods of generating VCD files from Clash, and Shockwaves was designed to work with both. The system is fully functional for signals traced during direct simulation. The alternative approach of simulating a design after compiling it to a different HDL depends on the Clash compiler adding type annotations. This requires an overhaul of the Clash compiler beyond the scope of the project.
The second system, Tydi-Clash, is a library for the Tydi streaming specification in Clash. Tydi was designed around transferring complex data structures, and allows for multiple related streams carrying typed, multi-dimensional data. The Tydi-Clash library supports Tydi data types, physical streams, and logical stream constructs. To encourage correct usage of the streams, the internal signals are encapsulated in algebraic and abstract data types that prevent defining or accessing undefined values. Additionally, tests are supplied for behavioral restrictions. An example implementation revealed implementations using Tydi-Clash are unfortunately still a bit cumbersome, but this is believed to be solvable by adding a library of utility modules for common situations. ...
A common tool in hardware design is the waveform viewer. Although Clash could already generate waveform files, these only contained binary representations of the values. Without translating these to Haskell values, they are difficult to interpret. Shockwaves was created to perform this translation. Unlike other typed waveform solutions, Shockwaves performs the translation fully in the Haskell runtime, and stores the results in lookup tables. This gives the programmer full control over the waveform representation of data. There are two methods of generating VCD files from Clash, and Shockwaves was designed to work with both. The system is fully functional for signals traced during direct simulation. The alternative approach of simulating a design after compiling it to a different HDL depends on the Clash compiler adding type annotations. This requires an overhaul of the Clash compiler beyond the scope of the project.
The second system, Tydi-Clash, is a library for the Tydi streaming specification in Clash. Tydi was designed around transferring complex data structures, and allows for multiple related streams carrying typed, multi-dimensional data. The Tydi-Clash library supports Tydi data types, physical streams, and logical stream constructs. To encourage correct usage of the streams, the internal signals are encapsulated in algebraic and abstract data types that prevent defining or accessing undefined values. Additionally, tests are supplied for behavioral restrictions. An example implementation revealed implementations using Tydi-Clash are unfortunately still a bit cumbersome, but this is believed to be solvable by adding a library of utility modules for common situations.
For utility-scale quantum computing, we will likely need millions of qubits. To program these qubits, the complete quantum computing stack will need to be improved, since programming large numbers of qubits is not feasible with current quantum programming languages.
In this dissertation, we present our new unitary decomposition algorithm, which is used to decompose arbitrary unitary matrices into a sequence of quantum gates that can be executed on a quantum computer. Our method results in 5% less CNOT gates than the previous state-of-the-art and can be used to decompose an arbitrary 3-qubit gate into at most 19 CNOT gates.
Unitary decomposition is an essential part of some quantum algorithms, and can be used as an optimization method for (parts) of quantum circuits. Efficient implementation of unitary decomposition allows for the translation of bigger input matrices into elementary quantum operations, which is key to executing these algorithms on existing quantum computers.
With the implementation of unitary decomposition in quantum programming framework OpenQL, we show how the structure of the input or intermediate matrices can be used to minimize the number of output gates and to minimize the runtime of the decomposition. Our implementation is 10 to 500 times as fast as the decomposition methods of the UniversalQCompiler and Qubiter.
With hybrid classical-quantum algorithms, even near-term quantum devices may be able to outperform classical computers. Hybrid algorithms, such as variational quantum eigensolvers, are iterative processes, and use a classical optimizer to update a parameterized quantum circuit. Each iteration, the circuit is executed on a physical quantum processor or a simulator, and the average of the measurement results is passed back to the classical optimizer. When many iterations are performed, the quantum program is recompiled many times.
We have implemented explicit parameters that prevent recompilation of hybrid programs in OpenQL, called OpenQLpc. These parameters reduce the compile time, and therefore improve the total runtime for hybrid algorithms. We have compared the execution of the MAXCUT benchmark in OpenQL with the execution of the same benchmark in PyQuil and Qiskit, which shows that the efficient handling of parameterized circuits in OpenQLpc results in up to 70% reduction in total compilation time and a reduced total execution time. With OpenQLpc, compilation of hybrid algorithms is also faster than either PyQuil or Qiskit.
In a collaboration with BMW and Entropica, we have developed a quantum algorithm for industrial shift scheduling (QISS), which uses Grover's adaptive search to tackle a common and important class of valuable, real-world combinatorial optimization problems.
We show how QISS can be used to find the optimal schedule for n days out of a solution space of size N = 4^(2n). The optimal solution is reached in 99% of cases within sqrt(N) = 4^n applications of Grover's oracle, which requires a total of 11n +9 + log2(19n) qubits for scheduling n days. We show the explicit construction of the Grover's oracle, incorporating the multiple constraints and detail the corresponding logical-level resource requirements. Further, we simulate the application of QISS for small-scale problem instances to corroborate the performance of the algorithm.
Our work shows how complex real-world industrial optimization problems can be formulated in the context of Grover's algorithm.
Using QISS, we then used open-source tools to estimate the quantum resources required for execution of this algorithm. We used qubit models based on current technology, as well as theoretical high-fidelity scenarios for superconducting qubit platforms. We find that the overall computational runtime is more strongly influenced by the execution time of gate and measurement operations than by system error rates. We find that achieving quantum utility would not only require low system error rates (10^(-6) or better), but also measurement operations with an execution time below 10 ns. This rules out the possibility of near-term quantum utility for this use-case, and suggests that significant technological or algorithmic progress will be needed before quantum utility can be achieved.
The research in this dissertation allows us to answer our main research question:
How can we make the quantum computing stack ready for utility-scale quantum computing?
For the quantum stack to be ready for utility-scale quantum computing, several major improvements will need to be made to prepare for programming and compiling circuits with millions of qubits.
* We will need high-level abstractions that will speed up programming of quantum computers, allow for (easier) debugging and will allow for programming millions of qubits.
* The classical component of the compilation and compute of (hybrid) quantum algorithms will need to be improved.
* More algorithms for real-world use-cases will need to be developed, which will provide a basis for improvements across the quantum stack that will lead to quantum utility.
* We need to do quantum resource estimation for real use-cases, in order to have insights into what utility-scale quantum computing will look like. ...
For utility-scale quantum computing, we will likely need millions of qubits. To program these qubits, the complete quantum computing stack will need to be improved, since programming large numbers of qubits is not feasible with current quantum programming languages.
In this dissertation, we present our new unitary decomposition algorithm, which is used to decompose arbitrary unitary matrices into a sequence of quantum gates that can be executed on a quantum computer. Our method results in 5% less CNOT gates than the previous state-of-the-art and can be used to decompose an arbitrary 3-qubit gate into at most 19 CNOT gates.
Unitary decomposition is an essential part of some quantum algorithms, and can be used as an optimization method for (parts) of quantum circuits. Efficient implementation of unitary decomposition allows for the translation of bigger input matrices into elementary quantum operations, which is key to executing these algorithms on existing quantum computers.
With the implementation of unitary decomposition in quantum programming framework OpenQL, we show how the structure of the input or intermediate matrices can be used to minimize the number of output gates and to minimize the runtime of the decomposition. Our implementation is 10 to 500 times as fast as the decomposition methods of the UniversalQCompiler and Qubiter.
With hybrid classical-quantum algorithms, even near-term quantum devices may be able to outperform classical computers. Hybrid algorithms, such as variational quantum eigensolvers, are iterative processes, and use a classical optimizer to update a parameterized quantum circuit. Each iteration, the circuit is executed on a physical quantum processor or a simulator, and the average of the measurement results is passed back to the classical optimizer. When many iterations are performed, the quantum program is recompiled many times.
We have implemented explicit parameters that prevent recompilation of hybrid programs in OpenQL, called OpenQLpc. These parameters reduce the compile time, and therefore improve the total runtime for hybrid algorithms. We have compared the execution of the MAXCUT benchmark in OpenQL with the execution of the same benchmark in PyQuil and Qiskit, which shows that the efficient handling of parameterized circuits in OpenQLpc results in up to 70% reduction in total compilation time and a reduced total execution time. With OpenQLpc, compilation of hybrid algorithms is also faster than either PyQuil or Qiskit.
In a collaboration with BMW and Entropica, we have developed a quantum algorithm for industrial shift scheduling (QISS), which uses Grover's adaptive search to tackle a common and important class of valuable, real-world combinatorial optimization problems.
We show how QISS can be used to find the optimal schedule for n days out of a solution space of size N = 4^(2n). The optimal solution is reached in 99% of cases within sqrt(N) = 4^n applications of Grover's oracle, which requires a total of 11n +9 + log2(19n) qubits for scheduling n days. We show the explicit construction of the Grover's oracle, incorporating the multiple constraints and detail the corresponding logical-level resource requirements. Further, we simulate the application of QISS for small-scale problem instances to corroborate the performance of the algorithm.
Our work shows how complex real-world industrial optimization problems can be formulated in the context of Grover's algorithm.
Using QISS, we then used open-source tools to estimate the quantum resources required for execution of this algorithm. We used qubit models based on current technology, as well as theoretical high-fidelity scenarios for superconducting qubit platforms. We find that the overall computational runtime is more strongly influenced by the execution time of gate and measurement operations than by system error rates. We find that achieving quantum utility would not only require low system error rates (10^(-6) or better), but also measurement operations with an execution time below 10 ns. This rules out the possibility of near-term quantum utility for this use-case, and suggests that significant technological or algorithmic progress will be needed before quantum utility can be achieved.
The research in this dissertation allows us to answer our main research question:
How can we make the quantum computing stack ready for utility-scale quantum computing?
For the quantum stack to be ready for utility-scale quantum computing, several major improvements will need to be made to prepare for programming and compiling circuits with millions of qubits.
* We will need high-level abstractions that will speed up programming of quantum computers, allow for (easier) debugging and will allow for programming millions of qubits.
* The classical component of the compilation and compute of (hybrid) quantum algorithms will need to be improved.
* More algorithms for real-world use-cases will need to be developed, which will provide a basis for improvements across the quantum stack that will lead to quantum utility.
* We need to do quantum resource estimation for real use-cases, in order to have insights into what utility-scale quantum computing will look like.
Tywaves
A Typed Waveform Viewer for Chisel HDL with typed circuit components and Tydi streams
This thesis presents Tywaves, a new kind of type-centered waveform viewer for the Chisel hardware language with typed circuit components and Tydi streams. Contributions to both the Chisel library and CIRCT MLIR compiler are described. Type information for debugging is extracted from the source language and linked with the target Verilog. A frontend waveform viewer is updated with the functionality to interpret and associate type information with values dumped from an RTL simulator and reconstruct the source language view. Finally, a Chisel API has been implemented to enable Tywaves from a high-level testbench.
The Tywaves project aims to enhance the debugging experience of modern hardware languages by reducing the gap between the source code and waveforms. It provides a new type-centered debugging format that helps to bring the same level of abstraction of new languages into waveform viewers. ...
This thesis presents Tywaves, a new kind of type-centered waveform viewer for the Chisel hardware language with typed circuit components and Tydi streams. Contributions to both the Chisel library and CIRCT MLIR compiler are described. Type information for debugging is extracted from the source language and linked with the target Verilog. A frontend waveform viewer is updated with the functionality to interpret and associate type information with values dumped from an RTL simulator and reconstruct the source language view. Finally, a Chisel API has been implemented to enable Tywaves from a high-level testbench.
The Tywaves project aims to enhance the debugging experience of modern hardware languages by reducing the gap between the source code and waveforms. It provides a new type-centered debugging format that helps to bring the same level of abstraction of new languages into waveform viewers.
This thesis addresses the large computational demands for high accuracy nanopore basecalling of nanopore reads. Bonito, ONT's research basecaller, and other basecallers use DNNs at their core. The five Long Short-Term Memory (LSTM) layers used by the basecaller are the primary bottleneck to more efficient basecalling, taking almost 90% of the whole model's execution time when basecalling a single read. To alleviate this bottleneck, three approaches are investigated: pruning, model architecture, and quantization. Preliminary results show that pruning is the most impactful approach and has not successfully been used in previous work.
We propose learning structured sparsity using a delayed masking penalty scheduler. By adapting and improving on previous work, each LSTM layer is able to learn its optimal size during training, simultaneously with learning to basecall accurately. The method is optimized for the basecalling application and can be generalized to other tasks. We find that the required number of computations in the LSTM layers can be significantly reduced by up to 21 times with a reduction in match rate of just 1.3% compared to the high accuracy Bonito model. Furthermore, the newly introduced penalty parameter can be tuned to find the optimal trade-off between compute and accuracy for users' requirements.
The results indicate that state-of-the-art basecalling models are overparameterized and that their size can be reduced drastically without significantly affecting accuracy. Future work is suggested to investigate the benefits of pruning the whole model, and to assess the feasibility of combining pruning with advanced quantization methods. This work helps increase the accessibility of nanopore DNA sequencing, broadening the reach and impact of this technology. ...
This thesis addresses the large computational demands for high accuracy nanopore basecalling of nanopore reads. Bonito, ONT's research basecaller, and other basecallers use DNNs at their core. The five Long Short-Term Memory (LSTM) layers used by the basecaller are the primary bottleneck to more efficient basecalling, taking almost 90% of the whole model's execution time when basecalling a single read. To alleviate this bottleneck, three approaches are investigated: pruning, model architecture, and quantization. Preliminary results show that pruning is the most impactful approach and has not successfully been used in previous work.
We propose learning structured sparsity using a delayed masking penalty scheduler. By adapting and improving on previous work, each LSTM layer is able to learn its optimal size during training, simultaneously with learning to basecall accurately. The method is optimized for the basecalling application and can be generalized to other tasks. We find that the required number of computations in the LSTM layers can be significantly reduced by up to 21 times with a reduction in match rate of just 1.3% compared to the high accuracy Bonito model. Furthermore, the newly introduced penalty parameter can be tuned to find the optimal trade-off between compute and accuracy for users' requirements.
The results indicate that state-of-the-art basecalling models are overparameterized and that their size can be reduced drastically without significantly affecting accuracy. Future work is suggested to investigate the benefits of pruning the whole model, and to assess the feasibility of combining pruning with advanced quantization methods. This work helps increase the accessibility of nanopore DNA sequencing, broadening the reach and impact of this technology.
Zero-serialization, Zero-copy memory pooling in compute clusters
Disaggregated memory made accessible
ThymesisFlow is designed for the situation where a borrower is able access a lender's memory, and the lender not accessing that borrowed memory. Coherency problems arise in the case where both a lender of memory, as well as a borrower of memory write to the lender's memory.
This thesis proposes the use of the Apache Arrow in-memory data format to not only access memory in a near coherent fashion, but in a fully coherent fashion. This will allow compute clusters to more efficiently use memory resources, allow for applications to dynamically hotplug memory, and allow for data sharing without copying over ethernet connection.
The protocols devised in this thesis are able to create disaggregated Arrow objects, which are readable by all nodes in a cluster in a coherent fashion. The creation of these coherent disaggregated objects is the only performance penalty in making them coherent, after initialization all nodes use their local CPU caches to cache remote objects.
A working proof-of-concept has been created which is able to share Apache Arrow objects stored in the memory of a single node. It is also possible to create Arrow objects which span the memory of multiple nodes, allowing for objects bigger than the memory of a single node. The proof-of-concept was able to be run thanks to the setup provided by the Hasso Plattner Institute. ...
ThymesisFlow is designed for the situation where a borrower is able access a lender's memory, and the lender not accessing that borrowed memory. Coherency problems arise in the case where both a lender of memory, as well as a borrower of memory write to the lender's memory.
This thesis proposes the use of the Apache Arrow in-memory data format to not only access memory in a near coherent fashion, but in a fully coherent fashion. This will allow compute clusters to more efficiently use memory resources, allow for applications to dynamically hotplug memory, and allow for data sharing without copying over ethernet connection.
The protocols devised in this thesis are able to create disaggregated Arrow objects, which are readable by all nodes in a cluster in a coherent fashion. The creation of these coherent disaggregated objects is the only performance penalty in making them coherent, after initialization all nodes use their local CPU caches to cache remote objects.
A working proof-of-concept has been created which is able to share Apache Arrow objects stored in the memory of a single node. It is also possible to create Arrow objects which span the memory of multiple nodes, allowing for objects bigger than the memory of a single node. The proof-of-concept was able to be run thanks to the setup provided by the Hasso Plattner Institute.
The main problem this thesis addresses is the implementation of an accelerated hardware solution for the compute-intensive process of basecalling long-read sequences. The thesis presents an FPGA-based implementation of the computationally demanding Long Short-Term Memory (LSTM) layers within the basecalling network known as Bonito. However, due to the lack of floating-point arithmetic units available on the FPGA, the FPGA implementation could not achieve competitive performance compared to GPUs.
While the FPGA implementation falls short of GPU performance, it serves as a possible stepping stone toward developing an ASIC solution for implementing the Bonito LSTM layers or potentially implementing the entire Bonito model. An ASIC implementation has the potential for superior performance up to 9 times faster than a GPU implementation while additionally being cost-effective. This suggests that ASICs hold promise as a future direction for accelerating long-read sequence basecalling, allowing for faster and more affordable genomics research. ...
The main problem this thesis addresses is the implementation of an accelerated hardware solution for the compute-intensive process of basecalling long-read sequences. The thesis presents an FPGA-based implementation of the computationally demanding Long Short-Term Memory (LSTM) layers within the basecalling network known as Bonito. However, due to the lack of floating-point arithmetic units available on the FPGA, the FPGA implementation could not achieve competitive performance compared to GPUs.
While the FPGA implementation falls short of GPU performance, it serves as a possible stepping stone toward developing an ASIC solution for implementing the Bonito LSTM layers or potentially implementing the entire Bonito model. An ASIC implementation has the potential for superior performance up to 9 times faster than a GPU implementation while additionally being cost-effective. This suggests that ASICs hold promise as a future direction for accelerating long-read sequence basecalling, allowing for faster and more affordable genomics research.
Tydi-Chisel
Collaborative and Interface-Driven Data-Streaming Accelerator Design
In this thesis, the Tydi-Chisel library is presented along with an A-to-Z design-process description for data-streaming accelerators. A stream-interface solution is presented that offers both compatibility with Tydi in traditional HDLs and maximum utility within Chisel through two intercompatible representations. In addition, design complexity is reduced through novel utilities like stream-complexity conversion, developed to alleviate interface specification mismatches between components. Using the presented toolchain and library, the amount of code required to specify Tydi interfaces for representative use-cases can be reduced several times compared to a Verilog description, while offering increased utility.
Tydi-Chisel aims to simplify the design of data-streaming accelerators through the integration of the Tydi interface standard in Chisel, along with helper components, syntax sugar, and verification tools. In combination Chisel and Tydi help bridge the hardware-software divide, making solo-design and collaboration between designers easier. ...
In this thesis, the Tydi-Chisel library is presented along with an A-to-Z design-process description for data-streaming accelerators. A stream-interface solution is presented that offers both compatibility with Tydi in traditional HDLs and maximum utility within Chisel through two intercompatible representations. In addition, design complexity is reduced through novel utilities like stream-complexity conversion, developed to alleviate interface specification mismatches between components. Using the presented toolchain and library, the amount of code required to specify Tydi interfaces for representative use-cases can be reduced several times compared to a Verilog description, while offering increased utility.
Tydi-Chisel aims to simplify the design of data-streaming accelerators through the integration of the Tydi interface standard in Chisel, along with helper components, syntax sugar, and verification tools. In combination Chisel and Tydi help bridge the hardware-software divide, making solo-design and collaboration between designers easier.
A Toolchain for Streaming Dataflow Accelerator Designs for Big Data Analytics
Defining an IR for Composable Typed Streaming Dataflow Designs
In this thesis, an open-source intermediate representation (IR) is introduced which allows for the declaration of Tydi's types. The IR enables creating and connecting components with Tydi Streams as interfaces, called Streamlets. It also lets backends for synthesis and simulation retain high-level information, such as documentation. Types and Streamlets can be easily reused between multiple projects, and Tydi’s streams and type hierarchy can be used to define interface contracts, which aid collaboration when designing a larger system.
The IR codifies the rules and properties established in the Tydi specification and serves to complement computation-oriented hardware design tools with a data-centric view on interfaces. To support different backends and targets, the IR is focused on expressing interfaces, and complements behavior described by hardware description languages and other IRs. Additionally, a testing syntax for the verification of inputs and outputs against abstract streams of data, and for substituting interdependent components, is presented which allows for the specification of behavior.
To demonstrate this IR, a grammar, parser, and query system have been created, and paired with a backend targeting VHDL. ...
In this thesis, an open-source intermediate representation (IR) is introduced which allows for the declaration of Tydi's types. The IR enables creating and connecting components with Tydi Streams as interfaces, called Streamlets. It also lets backends for synthesis and simulation retain high-level information, such as documentation. Types and Streamlets can be easily reused between multiple projects, and Tydi’s streams and type hierarchy can be used to define interface contracts, which aid collaboration when designing a larger system.
The IR codifies the rules and properties established in the Tydi specification and serves to complement computation-oriented hardware design tools with a data-centric view on interfaces. To support different backends and targets, the IR is focused on expressing interfaces, and complements behavior described by hardware description languages and other IRs. Additionally, a testing syntax for the verification of inputs and outputs against abstract streams of data, and for substituting interdependent components, is presented which allows for the specification of behavior.
To demonstrate this IR, a grammar, parser, and query system have been created, and paired with a backend targeting VHDL.
Tydi-lang: a language for typed streaming hardware
A manual for future Tydi-lang compiler developers
The Tydi specification (Tydi-spec) was proposed to address the above issues by codifying the composite and variable-length data structures in a type and providing a standard protocol to transfer typed data among hardware components. The Tydi intermediate representation (Tydi-IR) extends the Tydi-spec by defining typed interfaces, typed components, and connections among typed components.
In this paper, we propose Tydi-lang, a high-level hardware description language (HDL) for streaming designs. The language incorporates Tydi-spec to describe typed streams and provides templates to describe abstract reusable components. We also implement an open-source compiler from Tydi-lang to Tydi-IR. We leverage a Tydi-IR to VHDL compiler, and also present a simulator blueprint to identify streaming bottlenecks. We show several Tydi-lang examples to translate high-level SQL to VHDL to demonstrate that Tydi-lang can efficiently raise the level of abstraction and reduce design effort. ...
The Tydi specification (Tydi-spec) was proposed to address the above issues by codifying the composite and variable-length data structures in a type and providing a standard protocol to transfer typed data among hardware components. The Tydi intermediate representation (Tydi-IR) extends the Tydi-spec by defining typed interfaces, typed components, and connections among typed components.
In this paper, we propose Tydi-lang, a high-level hardware description language (HDL) for streaming designs. The language incorporates Tydi-spec to describe typed streams and provides templates to describe abstract reusable components. We also implement an open-source compiler from Tydi-lang to Tydi-IR. We leverage a Tydi-IR to VHDL compiler, and also present a simulator blueprint to identify streaming bottlenecks. We show several Tydi-lang examples to translate high-level SQL to VHDL to demonstrate that Tydi-lang can efficiently raise the level of abstraction and reduce design effort.
To improve the scalability and performance optimizations of genome variant calling analysis workflows on modern computing systems, in this dissertation four potential research directions have been selected for further exploration. First, to exploit the performance of modern processors hardware features like multi-core and vector units on the GATK best practices variant calling pipelines, we introduce ArrowSAM, a columnar inmemory data format to place and process genomics data in-memory thus removing the need for repeated file storage accesses in intermediate variant calling pipeline applications. Our second contribution focuses on integration of the Apache Arrow based columnar in-memory data format in the PySpark API to enable exploiting the benefits of vectorized operations in the Python language using user-defined functions on Spark dataframes. For our third research contribution, we tested and benchmarked both the scalability and performance of Arrow Flight for client-server as well as cluster scaled communication.For our final research contribution reported in this dissertation, we implemented an orthogonal approach that is even more scalable than Apache Spark and Arrow Flight based solutions and offers flexibility to use many different variant callers. ...
To improve the scalability and performance optimizations of genome variant calling analysis workflows on modern computing systems, in this dissertation four potential research directions have been selected for further exploration. First, to exploit the performance of modern processors hardware features like multi-core and vector units on the GATK best practices variant calling pipelines, we introduce ArrowSAM, a columnar inmemory data format to place and process genomics data in-memory thus removing the need for repeated file storage accesses in intermediate variant calling pipeline applications. Our second contribution focuses on integration of the Apache Arrow based columnar in-memory data format in the PySpark API to enable exploiting the benefits of vectorized operations in the Python language using user-defined functions on Spark dataframes. For our third research contribution, we tested and benchmarked both the scalability and performance of Arrow Flight for client-server as well as cluster scaled communication.For our final research contribution reported in this dissertation, we implemented an orthogonal approach that is even more scalable than Apache Spark and Arrow Flight based solutions and offers flexibility to use many different variant callers.