ZA

Z. Al-Ars

info

Please Note

118 records found

Journal article (2025) - T. Ahmad, J. Schuchart, Z. Al Ars, C. Niethammer, J. Gracia, H. P. Hofstee
Rapid technological advancements in sequencing technologies allow producing cost effective and high volume sequencing data. Processing this data for real-time clinical diagnosis is potentially time-consuming if done on a single computing node. This work presents a complete variant calling workflow, implemented using the Message Passing Interface (MPI) to leverage the benefits of high bandwidth interconnects. This solution (GenMPI) is portable and flexible, meaning it can be deployed to any private or public cluster/cloud infrastructure. Any alignment or variant calling application can be used with minimal adaptation. To achieve high performance, compressed input data can be streamed in parallel to alignment applications while uncompressed data can use internal file seek functionality to eliminate the bottleneck of streaming input data from a single node. Alignment output can be directly stored in multiple chromosome-specific SAM files or a single SAM file. After alignment, a distributed queue using MPI RMA (Remote Memory Access) atomic operations is created for sorting, indexing, marking of duplicates (if necessary) and variant calling applications. We ensure the accuracy of variants as compared to the original single node methods. We also show that for 300x coverage data, alignment scales almost linearly up to 64 nodes (8192 CPU cores). Overall, this work outperforms existing Big Data based workflows by a factor of two and is almost 20% faster than other MPI-based implementations for alignment without any extra memory overheads. Sorting, indexing, duplicate removal and variant calling is also scalable up to 8 nodes cluster. For pair-end short-reads (Illumina) data, we integrated the BWA-MEM aligner and three variant callers (GATK HaplotypeCaller, DeepVariant and Octopus), while for long-reads data, we integrated the Minimap2 aligner and three different variant callers (DeepVariant, DeepVariant with WhatsHap for phasing (PacBio) and Clair3 (ONT)). ...
Conference paper (2025) - Sam Aanhane, P. Knops, R. de Jong, C. Cromjongh, Z. Al-Ars
Synthetic image generation involves the creation of artificially generated images that are indistinguishable from real ones. Conventional simulation-based image synthesis approaches suffer from intensive computational and memory throughput demands associated with physically accurate ray tracing through volumetric datasets. In this work, we propose an FPGA-based accelerator architecture capable of handling the computations required to simulate physically accurate X-ray images in real time. In addition, an algorithm is developed that can calculate the path of an X-ray through a phantom representing a physical model. To ensure real-time performance, a parallel accelerator architecture is proposed using a chain of accelerator kernels combined with High Bandwidth Memory architecture, which can simulate many rays concurrently, addressing the computational and memory throughput demands associated with simulationbased X-ray image generation. Performance evaluation of the simulation on an AMD Alveo U50 Data Accelerator card shows that an average speed-up of 12 x over CPU-based implementations is possible, and allows for realtime image synthesis at a frame rate of 60 images/s. These findings highlight the advantages of FPGA acceleration for deterministic, high-speed synthetic image generation. ...

Parallel string decompression at 191 GB/s on GPU

Conference paper (2025) - Robin Vonk, Joost Hoozemans, Zaid Al-Ars
Most of the commonly used compression standards make use of some form of the LZ algorithm. Decompressing this type of data is not a good match for the Single-Instruction, Multiple Thread (SIMT) model of computation used by GPUs, resulting in low throughput and poor utilization of the GPU parallel compute capabilities. In this paper, we introduce GSST, a GPU-optimized version of the FSST compression algorithm, which targets string compression. The optimizations proposed in this paper make the algorithm particularly suitable for GPUs, which allows it to achieve a significantly better tradeoff for decompression throughput vs compression ratio as compared to the state of the art. Our results show that the new algorithm pushes the Pareto curve closer towards the ideal region, completely dominating LZ-based compressors in the nvCOMP library (LZ4, Snappy, GDeflate). GSST provides a compression ratio of 2.74x and achieves a throughput of 191 GB/s on an A100 GPU. ...
Conference paper (2025) - J. Brand, C. Cromjongh, H. P. Hofstee, Z. Al-Ars
While modern HDLs such as Chisel (Constructing Hardware In a Scala Embedded Language) significantly improve the process of design entry, debugging these designs is often problematic, because the tools that aid debugging operate on translated code rather than the original HDL. Furthermore, engineers often resort to manual waveform debugging, undermining productivity gains promised by such a language. We present ChiselTrace, an open-source tool for Chisel that is capable of (dynamic) program slicing and automatic signal dependency tracing, allowing faults to be more easily traced back to their root cause. Where prior work focuses on data-flow analysis at the (compiled) Verilog level, ChiselTrace functions at the Chisel source level. Contributions include: modifications to the Chisel library to enable post-simulation analysis; a library capable of dynamic program slicing and dependence graph generation; and a front-end dependency graph viewer. We demonstrate debugging capabilities by tracing an injected fault in the ChiselWatt processor back to the source. We observe that using ChiselTrace's dynamic program dependence graph, the number of lines of code relevant to the fault path is reduced significantly. Project repository: https://github.com/jarlb/chiseltrace ...
Conference paper (2025) - P. V. Nembhani, O. Rhodes, M. Sifalakis, Z. Al-Ars, A. Yousefzadeh, G. Tang, A. F. Dobrita, Y. Xu, K. Vadivel, K. Shidqi, P. Detterer, M. Konijnenburg, G. -J. van Schaik
This paper introduces SENMap, a mapping and synthesis tool for a scalable energy efficient neuromorphic computing architecture frameworks. SENECA a flexible architectural design optimized for executing edge AI SNN/ANN inference applications efficiently. To speed up the silicon tapeout and chip design for SENECA, an accurate emulator SENSIM was designed. While SENSIM supports direct mapping of SNNs on neuromorphic architectures, as the SNN/ANN grow in size, achieving optimal mapping for objectives like energy, throughput, area, and accuracy becomes challenging. This paper introduces SENMap, flexible mapping software for efficiently mapping large SNN/ANN applications onto adaptable architectures. SENMap considers architectural, pretrained SNN/ANN realistic examples, and event rate-based parameters and is open-sourced along with SENSIM to aid flexible neuromorphic chip design before fabrication. Experimental results show SENMap enables 40 percent energy improvements for a baseline SENSIM operating on timestep asynchronous mode of operation. SENMap is designed in such a way that it facilitates mapping large spiking neural networks for future modifications as well. ...

Parallel string decompression at 191 GB/s on GPU

Journal article (2025) - Robin Vonk, Joost Hoozemans, Zaid Al-Ars
Most of the commonly used compression standards make use of some form of the LZ algorithm. Decompressing this type of data is not a good match for the Single-Instruction, Multiple Thread (SIMT) model of computation used by GPUs, resulting in low throughput and poor utilization of the GPU parallel compute capabilities. In this paper, we introduce GSST, a GPU-optimized version of the FSST compression algorithm, which targets string compression. The optimizations proposed in this paper make the algorithm particularly suitable for GPUs, which allows it to achieve a significantly better tradeoff for decompression throughput vs compression ratio as compared to the state of the art. Our results show that the new algorithm pushes the Pareto curve closer towards the ideal region, completely dominating LZ-based compressors in the nvCOMP library (LZ4, Snappy, GDeflate). GSST provides a compression ratio of 2.7 4x and achieves a throughput of 191 GB/s on an A100 GPu. ...

Dataflow Component Interfaces with Tydi-Chisel

As dedicated hardware is becoming more prevalent in accelerating complex applications, methods are needed to enable easy integration of multiple hardware components into a single accelerator system. However, this vision of composable hardware is hindered by the lack of standards for interfaces that allow such components to communicate. To address this challenge, the Tydi standard was proposed to facilitate the representation of streaming data in digital circuits, notably providing interface specifications of composite and variable-length data structures. At the same time, constructing hardware in a Scala embedded language (Chisel) provides a suitable environment for deploying Tydi-centric components due to its abstraction level and customizability. This article introduces Tydi-Chisel, a library that integrates the Tydi standard within Chisel, along with a toolchain and methodology for designing data-streaming accelerators. This toolchain reduces the effort needed to design streaming hardware accelerators by raising the abstraction level for streams and module interfaces, hereby avoiding writing boilerplate code, and allows for easy integration of accelerator components from different designers. This is demonstrated through an example project incorporating various scenarios where the interface-related declaration is reduced by 6-14 times. Tydi-Chisel project repository is available at https://github.com/abs-tudelft/Tydi-Chisel. ...
Conference paper (2024) - Christiaan Boerkamp, Steven van der Vlugt, Zaid Al-Ars
This paper introduces TINA, a novel framework for implementing non Neural Network (NN) signal processing algorithms on NN accelerators such as GPUs, TPUs or FPGAs. The key to this approach is the concept of mapping mathematical and logic functions as a series of convolutional and fully connected layers. By mapping functions into such a small sub stack ofNN layers, it becomes possible to execute non-NN algorithms on NN hardware (HW) accelerators efficiently, as well as to ensure the portability of TINA implementations to any platform that supports such NN accelerators. Results show that TINA is highly competitive vs alternative frame-works, specifically for complex functions with iterations. For a Polyphase Filter Bank use case TINA shows GPU speedups of up to 80x vs a CPU baseline with NumPy compared to 8x speedup achieved by alternative frameworks. The frame-work is open source and publicly available at httPs://github.com/ChristiaanBoe/TINA. ...

Circuit construction for n -qubit gates based on block- ZXZ decomposition

Journal article (2024) - Anna M. Krol, Zaid Al-Ars
This paper proposes an optimized quantum block-ZXZ decomposition method that results in more optimal quantum circuits than the quantum Shannon decomposition, which was presented in 2005 by M. Möttönen, and J. J. Vartiainen [in Trends in quantum computing research, edited by S. Shannon (Nova Science Publishers, 2006) Chap. 7, p. 149, arXiv:quant-ph/0504100]. The decomposition is applied recursively to generic quantum gates, and can take advantage of existing and future small-circuit optimizations. Because our method uses only single-qubit gates and uniformly controlled rotation-Z gates, it can easily be adapted to use other types of multi-qubit gates. With the proposed decomposition, a general three-qubit gate can be decomposed using 19 cnot gates (rather than 20). For general n-qubit gates, the proposed decomposition generates circuits that have 22484n-322n+53 cnot gates, which is less than the best-known exact decomposition algorithm by (4n-2-1)/3 cnot gates. ...
Journal article (2024) - Mengfei Ji, Zaid Al-Ars, Yuchun Chang, Baolin Zhang
In this paper, we present a fully pipelined and semi-parallel channel convolutional neural network hardware accelerator structure. This structure can trade off the compute time and the hardware utilization, allowing the accelerator to be layer pipelined without the need for fully parallelizing the input and output channels. A parallel strategy is applied to reduce the time gap in transferring the output results between different layers. The parallelism can be decided based on the hardware resources on the target FPGA. We use this structure to implement a binary ResNet18 based on the neural architecture search strategy, which can increase the accuracy of manually designed binary convolutional neural networks. Our optimized binary ResNet18 can achieve a Top-1 accuracy of 60.5% on the ImageNet dataset. We deploy this ResNet18 hardware implementation on an Alphadata 9H7 FPGA, connected with an OpenCAPI interface, to demonstrate the hardware capabilities. Depending on the amount of parallelism used, the latency can range from 1.12 to 6.33ms, with a corresponding throughput of 4.56 to 0.71 TOPS for different hardware utilization, with a 200MHz clock frequency. Our best latency is 8× lower and our best throughput is 1.9× higher compared to the best previous works. The code for our implementation is open-source and publicly available on GitHub at https://github.com/MFJI/NASBRESNET. ...

An Event-driven Parallel Simulator for Multi-core Neuromorphic Systems

Conference paper (2024) - Prithvish Nembhani, Kanishkan Vadivel, Guangzhi Tang, Mohammad Tahghighi, Gert Jan Van Schaik, Manolis Sifalakis, Zaid Al-Ars, Amirreza Yousefzadeh
In this paper, we present SENSIM, which is an open-source simulator designed specifically for the SENECA neuromorphic processor. This simulator is unique in that it combines features from both hardware-specific and hardware-agnostic spiking neural network simulators, resulting in a hybrid event-driven and time-step-driven simulation approach. This allows for flexibility between accuracy and speed during different stages of simulation. Our work highlights the open-source SENSIM platform, which enables the mapping of large-scale SNN/DNN models to the SENECA cores, as well as the benchmarking of crucial KPIs such as power and latency estimations. ...
Conference paper (2024) - Mees Frensel, Zaid Al-Ars, H. Peter Hofstee
High accuracy nanopore basecalling uses large deep neural networks, requiring powerful GPUs, which is undesirable for sequencing experiments outside the lab. Research has shown that this can be circumvented by using smaller models to increase efficiency as well as basecalling speed. However, this comes at the cost of reduced accuracy, going against the trend of increasingly more complex models to extract the highest possible accuracy out of the source data. We propose learning structured sparsity during model training to find an improved trade-off between accuracy and model size, and thus basecalling speed. Our work introduces an improved pruning method with a delayed masking scheduler and removes redundant masks, saving compute, and is optimized for the basecaller training process. We find that the model size can be reduced by up to 21× with a reduction in match rate of 0.1% to 1.3% compared to Bonito-HAC, using a standardized benchmarking method. Our results indicate that the size of basecalling models can be reduced drastically without affecting accuracy, as long as researchers use appropriate training methods. Furthermore, our work helps democratize nanopore DNA sequencing, broadening the reach and impact of this technology. The code with the masking mechanism to reproduce our results is available at https://github.com/meesfrensel/efficient-basecallers. ...
Journal article (2023) - T.O. Mokveld, Z. Al-Ars, Erik A. Sistermans, M.J.T. Reinders
Background

Non-Invasive Prenatal Testing is often performed by utilizing read coverage-based profiles obtained from shallow whole genome sequencing to detect fetal copy number variations. Such screening typically operates on a discretized binned representation of the genome, where (ab)normality of bins of a set size is judged relative to a reference panel of healthy samples. In practice such approaches are too costly given that for each tested sample they require the resequencing of the reference panel to avoid technical bias. Within-sample testing methods utilize the observation that bins on one chromosome can be judged relative to the behavior of similarly behaving bins on other chromosomes, allowing the bins of a sample to be compared among themselves, avoiding technical bias.
Results

We present a comprehensive performance analysis of the within-sample testing method Wisecondor and its variants, using both experimental and simulated data. We introduced alterations to Wisecondor to explicitly address and exploit paired-end sequencing data. Wisecondor was found to yield the most stable results across different bin size scales while producing more robust calls by assigning higher Z-scores at all fetal fraction ranges.
Conclusions

Our findings show that the most recent available version of Wisecondor performs best.
...
Conference paper (2023) - Zaid Al-Ars, Obinna Agba, Zhuoran Guo, Christiaan Boerkamp, Ziyaad Jaber, Tareq Jaber
This paper offers a systematic method for creating medical knowledge-grounded patient records for use in activities involving differential diagnosis. Additionally, an assessment of machine learning models that can differentiate between various conditions based on given symptoms is also provided. We use a public disease-symptom data source called SymCat in combination with Synthea to construct the patients records. In order to increase the expressive nature of the synthetic data, we use a medically-standardized symptom modeling method called NLICE to augment the synthetic data with additional contextual information for each condition. In addition, Naive Bayes and Random Forest models are evaluated and compared on the synthetic data. The paper shows how to successfully construct SymCat-based and NLICE-based datasets. We also show results for the effectiveness of using the datasets to train predictive disease models. The SymCat-based dataset is able to train a Naive Bayes and Random Forest model yielding a 58.8% and 57.1% Top-1 accuracy score, respectively. In contrast, the NLICE-based dataset improves the results, with a Top-1 accuracy of 82.0% and Top-5 accuracy values of more than 90% for both models. Our proposed data generation approach solves a major barrier to the application of artificial intelligence methods in the healthcare domain. Our novel NLICE symptom modeling approach addresses the incomplete and insufficient information problem in the current binary symptom representation approach. ...

Quantum Knowledge Seeking Agent

Conference paper (2023) - Aritra Sarkar, Zaid Al-Ars, Koen Bertels
In this research, we extend the universal reinforcement learning agent models of artificial general intelligence to quantum environments. The utility function of a classical exploratory stochastic Knowledge Seeking Agent, KL-KSA, is generalized to distance measures from quantum information theory on density matrices. Quantum process tomography (QPT) algorithms form a tractable subset of programs for modeling environmental dynamics. The optimal QPT policy is selected based on a mutable cost function based on algorithmic complexity as well as computational resource complexity. The entire agent design is encapsulated in a self-replicating quine which mutates the cost function based on the predictive value of the optimal policy choosing scheme. Thus, multiple agents with pareto-optimal QPT policies evolve using genetic programming, mimicking the development of physical theories each with different resource trade-offs. This formal framework, termed Quantum Knowledge Seeking Agent (QKSA), is a resource-bounded participatory observer modification to the recently proposed algorithmic information-based reconstruction of quantum mechanics. A proof-of-concept is implemented and available as open-sourced software. ...
Conference paper (2023) - Vinicius Trentin, Chenxu Ma, Jorge Villagra, Zaid Al-Ars
Motion prediction is a key factor towards the full deployment of autonomous vehicles. It is fundamental in order to assure safety while navigating through highly interactive complex scenarios. In this work, the framework IAMP (Interaction-Aware Motion Prediction), producing multi-modal probabilistic outputs from the integration of a Dynamic Bayesian Network and Markov Chains, is extended with a learning-based approach. The integration of a machine learning model tackles the limitations of the ruled-based mechanism since it can better adapt to different driving styles and driving situations. The method here introduced generates context-dependent acceleration distributions used in a Markov-chain-based motion prediction. This hybrid approach results in better evaluation metrics when compared with the baseline in the four highly-interactive scenarios obtained from publicly available datasets. ...
Journal article (2023) - M. Ji, Z. Al-Ars, H.P. Hofstee, Yuchun Chang, Baolin Zhang
Convolutional neural networks (CNNs) are to be effective in many application domains, especially in the computer vision area. In order to achieve lower latency CNN processing, and reduce power consumption, developers are experimenting with using FPGAs to accelerate CNN processing in several applications. Current FPGA CNN accelerators usually use the same acceleration approaches as GPUs, where operations from different network layers are mapped to the same hardware units working in a multiplexed manner. This will result in high flexibility in implementing different types of CNNs; however, this will degrade the latency that accelerators can achieve. Alternatively, we can reduce the latency of the accelerator by pipelining the processing of consecutive layers, at the expense of more FPGA resources. The continued increase in hardware resources available in FPGAs makes such implementations feasible for latency-critical application domains. In this paper, we present FPQNet, a fully pipelined and quantized CNN FPGA implementation that is channel-parallel, layer-pipelined, and network-parallel, to decrease latency and increase throughput, combined with quantization methods to optimize hardware utilization. In addition, we optimize this hardware architecture for the HDMI timing standard to avoid extra hardware utilization. This makes it possible for the accelerator to handle video datasets. We present prototypes of the FPQNet CNN network implementations on an Alpha Data 9H7 FPGA, connected with an OpenCAPI interface, to demonstrate architecture capabilities. Results show that with a 250 MHz clock frequency, an optimized LeNet-5 design is able to achieve latencies as low as 9.32 µs with an accuracy of 98.8% on the MNIST dataset, making it feasible for utilization in high frame rate video processing applications. With 10 hardware kernels working concurrently, the throughput is as high as 1108 GOPs. The methods in this paper are suitable for many other CNNs. Our analysis shows that the latency of AlexNet, ZFNet, OverFeat-Fast, and OverFeat-Accurate can be as low as 69.27, 66.95, 182.98, and 132.6 µs, using the architecture introduced in this paper, respectively. ...
In spite of progress on hardware design languages, the design of high-performance hardware accelerators forces many design decisions specializing the interfaces of these accelerators in ways that complicate the understanding of the design and hinder modularity and collaboration. In response to this challenge, Tydi is presented as an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. In contrast, Chisel, with its high level of abstraction and customizability offers a suitable platform to implement Tydi-based components. In this paper, Tydi-Chisel is presented along with an A-to-Z design-process description. Tydi-Chisel aims to simplify the design of data-streaming accelerators through the integration of the Tydi interface standard in Chisel, along with helper components and syntax sugar. In combination Chisel and Tydi help bridge the hardware-software divide, making solo-design and collaboration between designers easier.Project repository: https://github.com/ccromjongh/Tydi-Chisel ...
Tydi is an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. These data types are extensively used in a many application domains, such as big data and SQL applications. This way, Tydi provides a higher-level method for defining interfaces between components as opposed to existing bit- and byte-based interface specifications. In this paper, we introduce an open-source intermediate representation (IR) which allows for the declaration of Tydi's types. The IR enables creating and connecting components with Tydi Streams as interfaces, called Streamlets. It also lets backends for synthesis and simulation retain high-level information, such as documentation. Types and Streamlets can be easily reused between multiple projects, and Tydi's streams and type hierarchy can be used to define interface contracts, which aid collaboration when designing a larger system. The IR codifies the rules and properties established in the Tydi specification and serves to complement computation-oriented hardware design tools with a data-centric view on interfaces. To support different backends and targets, the IR is focused on expressing interfaces, and complements behavior described by hardware description languages and other IRs. Additionally, a testing syntax for the verification of inputs and outputs against abstract streams of data, and for substituting interdependent components, is presented which allows for the specification of behavior. To demonstrate this IR, we have created a grammar, parser, and query system, and paired these with a backend targeting VHDL. ...