Z. Al-Ars
Please Note
118 records found
1
Rapid technological advancements in sequencing technologies allow producing cost effective and high volume sequencing data. Processing this data for real-time clinical diagnosis is potentially time-consuming if done on a single computing node. This work presents a complete variant calling workflow, implemented using the Message Passing Interface (MPI) to leverage the benefits of high bandwidth interconnects. This solution (GenMPI) is portable and flexible, meaning it can be deployed to any private or public cluster/cloud infrastructure. Any alignment or variant calling application can be used with minimal adaptation. To achieve high performance, compressed input data can be streamed in parallel to alignment applications while uncompressed data can use internal file seek functionality to eliminate the bottleneck of streaming input data from a single node. Alignment output can be directly stored in multiple chromosome-specific SAM files or a single SAM file. After alignment, a distributed queue using MPI RMA (Remote Memory Access) atomic operations is created for sorting, indexing, marking of duplicates (if necessary) and variant calling applications. We ensure the accuracy of variants as compared to the original single node methods. We also show that for 300x coverage data, alignment scales almost linearly up to 64 nodes (8192 CPU cores). Overall, this work outperforms existing Big Data based workflows by a factor of two and is almost 20% faster than other MPI-based implementations for alignment without any extra memory overheads. Sorting, indexing, duplicate removal and variant calling is also scalable up to 8 nodes cluster. For pair-end short-reads (Illumina) data, we integrated the BWA-MEM aligner and three variant callers (GATK HaplotypeCaller, DeepVariant and Octopus), while for long-reads data, we integrated the Minimap2 aligner and three different variant callers (DeepVariant, DeepVariant with WhatsHap for phasing (PacBio) and Clair3 (ONT)).
Synthetic image generation involves the creation of artificially generated images that are indistinguishable from real ones. Conventional simulation-based image synthesis approaches suffer from intensive computational and memory throughput demands associated with physically accurate ray tracing through volumetric datasets. In this work, we propose an FPGA-based accelerator architecture capable of handling the computations required to simulate physically accurate X-ray images in real time. In addition, an algorithm is developed that can calculate the path of an X-ray through a phantom representing a physical model. To ensure real-time performance, a parallel accelerator architecture is proposed using a chain of accelerator kernels combined with High Bandwidth Memory architecture, which can simulate many rays concurrently, addressing the computational and memory throughput demands associated with simulationbased X-ray image generation. Performance evaluation of the simulation on an AMD Alveo U50 Data Accelerator card shows that an average speed-up of 12 x over CPU-based implementations is possible, and allows for realtime image synthesis at a frame rate of 60 images/s. These findings highlight the advantages of FPGA acceleration for deterministic, high-speed synthetic image generation.
GSST
Parallel string decompression at 191 GB/s on GPU
This paper introduces SENMap, a mapping and synthesis tool for a scalable energy efficient neuromorphic computing architecture frameworks. SENECA a flexible architectural design optimized for executing edge AI SNN/ANN inference applications efficiently. To speed up the silicon tapeout and chip design for SENECA, an accurate emulator SENSIM was designed. While SENSIM supports direct mapping of SNNs on neuromorphic architectures, as the SNN/ANN grow in size, achieving optimal mapping for objectives like energy, throughput, area, and accuracy becomes challenging. This paper introduces SENMap, flexible mapping software for efficiently mapping large SNN/ANN applications onto adaptable architectures. SENMap considers architectural, pretrained SNN/ANN realistic examples, and event rate-based parameters and is open-sourced along with SENSIM to aid flexible neuromorphic chip design before fabrication. Experimental results show SENMap enables 40 percent energy improvements for a baseline SENSIM operating on timestep asynchronous mode of operation. SENMap is designed in such a way that it facilitates mapping large spiking neural networks for future modifications as well.
GSST
Parallel string decompression at 191 GB/s on GPU
Hardware-Accelerator Design by Composition
Dataflow Component Interfaces with Tydi-Chisel
As dedicated hardware is becoming more prevalent in accelerating complex applications, methods are needed to enable easy integration of multiple hardware components into a single accelerator system. However, this vision of composable hardware is hindered by the lack of standards for interfaces that allow such components to communicate. To address this challenge, the Tydi standard was proposed to facilitate the representation of streaming data in digital circuits, notably providing interface specifications of composite and variable-length data structures. At the same time, constructing hardware in a Scala embedded language (Chisel) provides a suitable environment for deploying Tydi-centric components due to its abstraction level and customizability. This article introduces Tydi-Chisel, a library that integrates the Tydi standard within Chisel, along with a toolchain and methodology for designing data-streaming accelerators. This toolchain reduces the effort needed to design streaming hardware accelerators by raising the abstraction level for streams and module interfaces, hereby avoiding writing boilerplate code, and allows for easy integration of accelerator components from different designers. This is demonstrated through an example project incorporating various scenarios where the interface-related declaration is reduced by 6-14 times. Tydi-Chisel project repository is available at https://github.com/abs-tudelft/Tydi-Chisel.
Beyond quantum Shannon decomposition
Circuit construction for n -qubit gates based on block- ZXZ decomposition
This paper proposes an optimized quantum block-ZXZ decomposition method that results in more optimal quantum circuits than the quantum Shannon decomposition, which was presented in 2005 by M. Möttönen, and J. J. Vartiainen [in Trends in quantum computing research, edited by S. Shannon (Nova Science Publishers, 2006) Chap. 7, p. 149, arXiv:quant-ph/0504100]. The decomposition is applied recursively to generic quantum gates, and can take advantage of existing and future small-circuit optimizations. Because our method uses only single-qubit gates and uniformly controlled rotation-Z gates, it can easily be adapted to use other types of multi-qubit gates. With the proposed decomposition, a general three-qubit gate can be decomposed using 19 cnot gates (rather than 20). For general n-qubit gates, the proposed decomposition generates circuits that have 22484n-322n+53 cnot gates, which is less than the best-known exact decomposition algorithm by (4n-2-1)/3 cnot gates.
In this paper, we present a fully pipelined and semi-parallel channel convolutional neural network hardware accelerator structure. This structure can trade off the compute time and the hardware utilization, allowing the accelerator to be layer pipelined without the need for fully parallelizing the input and output channels. A parallel strategy is applied to reduce the time gap in transferring the output results between different layers. The parallelism can be decided based on the hardware resources on the target FPGA. We use this structure to implement a binary ResNet18 based on the neural architecture search strategy, which can increase the accuracy of manually designed binary convolutional neural networks. Our optimized binary ResNet18 can achieve a Top-1 accuracy of 60.5% on the ImageNet dataset. We deploy this ResNet18 hardware implementation on an Alphadata 9H7 FPGA, connected with an OpenCAPI interface, to demonstrate the hardware capabilities. Depending on the amount of parallelism used, the latency can range from 1.12 to 6.33ms, with a corresponding throughput of 4.56 to 0.71 TOPS for different hardware utilization, with a 200MHz clock frequency. Our best latency is 8× lower and our best throughput is 1.9× higher compared to the best previous works. The code for our implementation is open-source and publicly available on GitHub at https://github.com/MFJI/NASBRESNET.
SENSIM
An Event-driven Parallel Simulator for Multi-core Neuromorphic Systems
High accuracy nanopore basecalling uses large deep neural networks, requiring powerful GPUs, which is undesirable for sequencing experiments outside the lab. Research has shown that this can be circumvented by using smaller models to increase efficiency as well as basecalling speed. However, this comes at the cost of reduced accuracy, going against the trend of increasingly more complex models to extract the highest possible accuracy out of the source data. We propose learning structured sparsity during model training to find an improved trade-off between accuracy and model size, and thus basecalling speed. Our work introduces an improved pruning method with a delayed masking scheduler and removes redundant masks, saving compute, and is optimized for the basecaller training process. We find that the model size can be reduced by up to 21× with a reduction in match rate of 0.1% to 1.3% compared to Bonito-HAC, using a standardized benchmarking method. Our results indicate that the size of basecalling models can be reduced drastically without affecting accuracy, as long as researchers use appropriate training methods. Furthermore, our work helps democratize nanopore DNA sequencing, broadening the reach and impact of this technology. The code with the masking mechanism to reproduce our results is available at https://github.com/meesfrensel/efficient-basecallers.
Non-Invasive Prenatal Testing is often performed by utilizing read coverage-based profiles obtained from shallow whole genome sequencing to detect fetal copy number variations. Such screening typically operates on a discretized binned representation of the genome, where (ab)normality of bins of a set size is judged relative to a reference panel of healthy samples. In practice such approaches are too costly given that for each tested sample they require the resequencing of the reference panel to avoid technical bias. Within-sample testing methods utilize the observation that bins on one chromosome can be judged relative to the behavior of similarly behaving bins on other chromosomes, allowing the bins of a sample to be compared among themselves, avoiding technical bias.
Results
We present a comprehensive performance analysis of the within-sample testing method Wisecondor and its variants, using both experimental and simulated data. We introduced alterations to Wisecondor to explicitly address and exploit paired-end sequencing data. Wisecondor was found to yield the most stable results across different bin size scales while producing more robust calls by assigning higher Z-scores at all fetal fraction ranges.
Conclusions
Our findings show that the most recent available version of Wisecondor performs best.
...
Non-Invasive Prenatal Testing is often performed by utilizing read coverage-based profiles obtained from shallow whole genome sequencing to detect fetal copy number variations. Such screening typically operates on a discretized binned representation of the genome, where (ab)normality of bins of a set size is judged relative to a reference panel of healthy samples. In practice such approaches are too costly given that for each tested sample they require the resequencing of the reference panel to avoid technical bias. Within-sample testing methods utilize the observation that bins on one chromosome can be judged relative to the behavior of similarly behaving bins on other chromosomes, allowing the bins of a sample to be compared among themselves, avoiding technical bias.
Results
We present a comprehensive performance analysis of the within-sample testing method Wisecondor and its variants, using both experimental and simulated data. We introduced alterations to Wisecondor to explicitly address and exploit paired-end sequencing data. Wisecondor was found to yield the most stable results across different bin size scales while producing more robust calls by assigning higher Z-scores at all fetal fraction ranges.
Conclusions
Our findings show that the most recent available version of Wisecondor performs best.
QKSA
Quantum Knowledge Seeking Agent
In this research, we extend the universal reinforcement learning agent models of artificial general intelligence to quantum environments. The utility function of a classical exploratory stochastic Knowledge Seeking Agent, KL-KSA, is generalized to distance measures from quantum information theory on density matrices. Quantum process tomography (QPT) algorithms form a tractable subset of programs for modeling environmental dynamics. The optimal QPT policy is selected based on a mutable cost function based on algorithmic complexity as well as computational resource complexity. The entire agent design is encapsulated in a self-replicating quine which mutates the cost function based on the predictive value of the optimal policy choosing scheme. Thus, multiple agents with pareto-optimal QPT policies evolve using genetic programming, mimicking the development of physical theories each with different resource trade-offs. This formal framework, termed Quantum Knowledge Seeking Agent (QKSA), is a resource-bounded participatory observer modification to the recently proposed algorithmic information-based reconstruction of quantum mechanics. A proof-of-concept is implemented and available as open-sourced software.
Motion prediction is a key factor towards the full deployment of autonomous vehicles. It is fundamental in order to assure safety while navigating through highly interactive complex scenarios. In this work, the framework IAMP (Interaction-Aware Motion Prediction), producing multi-modal probabilistic outputs from the integration of a Dynamic Bayesian Network and Markov Chains, is extended with a learning-based approach. The integration of a machine learning model tackles the limitations of the ruled-based mechanism since it can better adapt to different driving styles and driving situations. The method here introduced generates context-dependent acceleration distributions used in a Markov-chain-based motion prediction. This hybrid approach results in better evaluation metrics when compared with the baseline in the four highly-interactive scenarios obtained from publicly available datasets.
Tydi is an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. These data types are extensively used in a many application domains, such as big data and SQL applications. This way, Tydi provides a higher-level method for defining interfaces between components as opposed to existing bit- and byte-based interface specifications. In this paper, we introduce an open-source intermediate representation (IR) which allows for the declaration of Tydi's types. The IR enables creating and connecting components with Tydi Streams as interfaces, called Streamlets. It also lets backends for synthesis and simulation retain high-level information, such as documentation. Types and Streamlets can be easily reused between multiple projects, and Tydi's streams and type hierarchy can be used to define interface contracts, which aid collaboration when designing a larger system. The IR codifies the rules and properties established in the Tydi specification and serves to complement computation-oriented hardware design tools with a data-centric view on interfaces. To support different backends and targets, the IR is focused on expressing interfaces, and complements behavior described by hardware description languages and other IRs. Additionally, a testing syntax for the verification of inputs and outputs against abstract streams of data, and for substituting interdependent components, is presented which allows for the specification of behavior. To demonstrate this IR, we have created a grammar, parser, and query system, and paired these with a backend targeting VHDL.