Z. Al-Ars | TU Delft Repository

GSST

Parallel string decompression at 191 GB/s on GPU

Conference paper (2025) - Robin Vonk (author) , Joost Hoozemans (author) , Zaid Al-Ars (author)

Most of the commonly used compression standards make use of some form of the LZ algorithm. Decompressing this type of data is not a good match for the Single-Instruction, Multiple Thread (SIMT) model of computation used by GPUs, resulting in low throughput and poor utilization of ...

Beyond quantum Shannon decomposition

Circuit construction for n -qubit gates based on block- ZXZ decomposition

Journal article (2024) - A.M. Krol (author) , Zaid Al-Ars (author)

This paper proposes an optimized quantum block-ZXZ decomposition method that results in more optimal quantum circuits than the quantum Shannon decomposition, which was presented in 2005 by M. Möttönen, and J. J. Vartiainen [in Trends in quantum computing research, edited by S. Sh ...

Tina: Acceleration of Non-NN Signal Processing Algorithms Using NN Accelerators

Conference paper (2024) - C. Boerkamp (author) , Steven Vlugt (author) , Z Al-Ars (author)

This paper introduces TINA, a novel framework for implementing non Neural Network (NN) signal processing algorithms on NN accelerators such as GPUs, TPUs or FPGAs. The key to this approach is the concept of mapping mathematical and logic functions as a series of convolutional and ...

Fully Pipelined FPGA Acceleration of Binary Convolutional Neural Networks with Neural Architecture Search

Journal article (2024) - M. Ji (author) , Zaid Al-Ars (author) , Yuchun Chang (author) , Baolin Zhang (author)

In this paper, we present a fully pipelined and semi-parallel channel convolutional neural network hardware accelerator structure. This structure can trade off the compute time and the hardware utilization, allowing the accelerator to be layer pipelined without the need for fully ...

Learning Structured Sparsity for Efficient Nanopore DNA Basecalling Using Delayed Masking

Conference paper (2024) - Mees Frensel (author) , Zaid Al-Ars (author) , H.Peter Peter Hofstee (author)

High accuracy nanopore basecalling uses large deep neural networks, requiring powerful GPUs, which is undesirable for sequencing experiments outside the lab. Research has shown that this can be circumvented by using smaller models to increase efficiency as well as basecalling spe ...

Hardware-Accelerator Design by Composition

Dataflow Component Interfaces with Tydi-Chisel

Journal article (2024) - Casper Cromjongh (author) , Y. Tian (author) , H. Peter Hofstee (author) , Z. Al-Ars (author)

As dedicated hardware is becoming more prevalent in accelerating complex applications, methods are needed to enable easy integration of multiple hardware components into a single accelerator system. However, this vision of composable hardware is hindered by the lack of standards ...

SENSIM

An Event-driven Parallel Simulator for Multi-core Neuromorphic Systems

Conference paper (2024) - Prithvish Nembhani (author) , Kanishkan Vadivel (author) , Guangzhi Tang (author) , Mohammad Tahghighi (author) , Gert Jan Van Schaik (author) , Manolis Sifalakis (author) , Zaid Al-Ars (author) , Amirreza Yousefzadeh (author)

In this paper, we present SENSIM, which is an open-source simulator designed specifically for the SENECA neuromorphic processor. This simulator is unique in that it combines features from both hardware-specific and hardware-agnostic spiking neural network simulators, resulting in ...

Large Scale Calibration of Agent-Based Models in Social Systems with Sensitive Data

Poster (2023) - A.S. Hesam (author) , Frank Pijpers (author) , Fons Rademakers (author) , Zaid Al-Ars (author)

Tydi-Chisel: Collaborative and Interface-Driven Data-Streaming Accelerators

Conference paper (2023) - Casper Cromjongh (author) , Y. Tian (author) , H. Peter Peter Hofstee (author) , Z. Al-Ars (author)

In spite of progress on hardware design languages, the design of high-performance hardware accelerators forces many design decisions specializing the interfaces of these accelerators in ways that complicate the understanding of the design and hinder modularity and collaboration. ...

Learning-enabled multi-modal motion prediction in urban environments

Conference paper (2023) - Vinicius Trentin (author) , Chenxu Ma (author) , Jorge Villagra (author) , Zaid Al-Ars (author)

Motion prediction is a key factor towards the full deployment of autonomous vehicles. It is fundamental in order to assure safety while navigating through highly interactive complex scenarios. In this work, the framework IAMP (Interaction-Aware Motion Prediction), producing multi ...

NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis

Conference paper (2023) - Zaid Al-Ars (author) , Obinna Agba (author) , Zhuoran Guo (author) , C. Boerkamp (author) , Ziyaad Jaber (author) , Tareq Jaber (author)

This paper offers a systematic method for creating medical knowledge-grounded patient records for use in activities involving differential diagnosis. Additionally, an assessment of machine learning models that can differentiate between various conditions based on given symptoms i ...

An Intermediate Representation for Composable Typed Streaming Dataflow Designs

Journal article (2023) - Matthijs A. Reukers (author) , Yongding Tian (author) , Z. Al-Ars (author) , H. Peter Hofstee (author) , Matthijs Brobbel (author) , J.W. Peltenburg (author) , J. Van Straten (author)

Tydi is an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. These data types are extensively used in a many applicat ...

QKSA

Quantum Knowledge Seeking Agent

Conference paper (2023) - Aritra Sarkar (author) , Z. Al-Ars (author) , Koen Bertels (author)

In this research, we extend the universal reinforcement learning agent models of artificial general intelligence to quantum environments. The utility function of a classical exploratory stochastic Knowledge Seeking Agent, KL-KSA, is generalized to distance measures from quantum i ...

DFL: High-Performance Blockchain-Based Federated Learning

Journal article (2023) - Yongding Tian (author) , Zhuoran Guo (author) , Jiaxuan Zhang (author) , Z. Al-Ars (author)

Many researchers have proposed replacing the aggregation server in federated learning with a blockchain system to improve privacy, robustness, and scalability. In this approach, clients would upload their updated models to the blockchain ledger and use a smart contract to perform ...

FPQNet: Fully Pipelined and Quantized CNN for Ultra-Low Latency Image Classification on FPGAs Using OpenCAPI

Journal article (2023) - Mengfei Ji (author) , Z. Al-Ars (author) , Peter Hofstee (author) , Yuchun Chang (author) , Baolin Zhang (author)

Convolutional neural networks (CNNs) are to be effective in many application domains, especially in the computer vision area. In order to achieve lower latency CNN processing, and reduce power consumption, developers are experimenting with using FPGAs to accelerate CNN processing ...

Convolutional neural networks (CNNs) are to be effective in many application domains, especially in the computer vision area. In order to achieve lower latency CNN processing, and reduce power consumption, developers are experimenting with using FPGAs to accelerate CNN processing in several applications. Current FPGA CNN accelerators usually use the same acceleration approaches as GPUs, where operations from different network layers are mapped to the same hardware units working in a multiplexed manner. This will result in high flexibility in implementing different types of CNNs; however, this will degrade the latency that accelerators can achieve. Alternatively, we can reduce the latency of the accelerator by pipelining the processing of consecutive layers, at the expense of more FPGA resources. The continued increase in hardware resources available in FPGAs makes such implementations feasible for latency-critical application domains. In this paper, we present FPQNet, a fully pipelined and quantized CNN FPGA implementation that is channel-parallel, layer-pipelined, and network-parallel, to decrease latency and increase throughput, combined with quantization methods to optimize hardware utilization. In addition, we optimize this hardware architecture for the HDMI timing standard to avoid extra hardware utilization. This makes it possible for the accelerator to handle video datasets. We present prototypes of the FPQNet CNN network implementations on an Alpha Data 9H7 FPGA, connected with an OpenCAPI interface, to demonstrate architecture capabilities. Results show that with a 250 MHz clock frequency, an optimized LeNet-5 design is able to achieve latencies as low as 9.32 µs with an accuracy of 98.8% on the MNIST dataset, making it feasible for utilization in high frame rate video processing applications. With 10 hardware kernels working concurrently, the throughput is as high as 1108 GOPs. The methods in this paper are suitable for many other CNNs. Our analysis shows that the latency of AlexNet, ZFNet, OverFeat-Fast, and OverFeat-Accurate can be as low as 69.27, 66.95, 182.98, and 132.6 µs, using the architecture introduced in this paper, respectively.

A comprehensive performance analysis of sequence-based within-sample testing NIPT methods

Journal article (2023) - Tom Mokveld (author) , Zaid Al-Ars (author) , Erik A. Sistermans (author) , Marcel J.T. Reinders (author)

Background

Non-Invasive Prenatal Testing is often performed by utilizing read coverage-based profiles obtained from shallow whole genome sequencing to detect fetal copy number variations. Such screening typically operates on a discretized binned representation of the gen ...

Communication-Efficient Cluster Scalable Genomics Data Processing Using Apache Arrow Flight

Conference paper (2022) - Tanveer Ahmad (author) , Chengxin Ma (author) , Zaid Al-Ars (author) , H. Hofstee (author)

Current cluster scaled genomics data processing solutions rely on big data frameworks like Apache Spark, Hadoop and HDFS for data scheduling, processing and storage. These frameworks come with additional computation and memory overheads by default. It has been observed that scali ...

Efficient Decomposition of Unitary Matrices in Quantum Circuit Compilers

Journal article (2022) - A.M. Krol (author) , A. Sarkar (author) , Imran Ashraf (author) , Zaid Al-Ars (author) , K. Bertels (author)

Unitary decomposition is a widely used method to map quantum algorithms to an arbitrary set of quantum gates. Efficient implementation of this decomposition allows for the translation of bigger unitary gates into elementary quantum operations, which is key to executing these algo ...

SALoBa

Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs

Conference paper (2022) - Seongyeon Park (author) , Hajin Kim (author) , T. Ahmad (author) , N. Ahmed (author) , Z. Al-Ars (author) , H. Peter Hofstee (author) , Youngsok Kim (author) , Jinho Lee (author)

Sequence alignment forms an important backbone in many sequencing applications. A commonly used strategy for sequence alignment is an approximate string matching with a two-dimensional dynamic programming approach. Although some prior work has been conducted on GPU acceleration o ...

WisecondorFF

Improved Fetal Aneuploidy Detection from Shallow WGS through Fragment Length Analysis

Journal article (2022) - T.O. Mokveld (author) , Zaid Al-Ars (author) , EA Sistermans (author) , M.J.T. Reinders (author)

In prenatal diagnostics, NIPT screening utilizing read coverage-based profiles obtained from shallow WGS data is routinely used to detect fetal CNVs. From this same data, fragment size distributions of fetal and maternal DNA fragments can be derived, which are known to be differe ...