J. van Straten
Please Note
17 records found
1
Tydi is an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. These data types are extensively used in a many application domains, such as big data and SQL applications. This way, Tydi provides a higher-level method for defining interfaces between components as opposed to existing bit- and byte-based interface specifications. In this paper, we introduce an open-source intermediate representation (IR) which allows for the declaration of Tydi's types. The IR enables creating and connecting components with Tydi Streams as interfaces, called Streamlets. It also lets backends for synthesis and simulation retain high-level information, such as documentation. Types and Streamlets can be easily reused between multiple projects, and Tydi's streams and type hierarchy can be used to define interface contracts, which aid collaboration when designing a larger system. The IR codifies the rules and properties established in the Tydi specification and serves to complement computation-oriented hardware design tools with a data-centric view on interfaces. To support different backends and targets, the IR is focused on expressing interfaces, and complements behavior described by hardware description languages and other IRs. Additionally, a testing syntax for the verification of inputs and outputs against abstract streams of data, and for substituting interdependent components, is presented which allows for the specification of behavior. To demonstrate this IR, we have created a grammar, parser, and query system, and paired these with a backend targeting VHDL.
Artificial neural networks are becoming an integral part of digital solutions to complex problems. However, employing neural networks on quantum processors faces challenges related to the implementation of non-linear functions using quantum circuits. In this paper, we use repeat-until-success circuits enabled by real-time control-flow feedback to realize quantum neurons with non-linear activation functions. These neurons constitute elementary building blocks that can be arranged in a variety of layouts to carry out deep learning tasks quantum coherently. As an example, we construct a minimal feedforward quantum neural network capable of learning all 2-to-1-bit Boolean functions by optimization of network activation parameters within the supervised-learning paradigm. This model is shown to perform non-linear classification and effectively learns from multiple copies of a single training state consisting of the maximal superposition of all inputs.
Tydi
An open specification for complex data structures over hardware streams
Streaming dataflow designs describe hardware by connecting components through streams that transport data structures. We introduce a stream-oriented specification and type system that provides a clear and intuitive way to map complex, dynamically-sized data structures onto hardware streams. This helps designers to lift the abstraction of streaming dataflow designs, reducing the design effort. The type system allows complex data structures to be as easy to use in streaming dataflow designs as in modern software languages today.
Fletcher
A framework to efficiently integrate FPGA accelerators with apache arrow
Modern big data systems are highly heterogeneous. The components found in their many layers of abstraction are often implemented in a wide variety of programming languages and frameworks. Due to language implementation differences, interfaces between these components, including hardware accelerated components, are often burdened by serialization overhead. Serialization bandwidth of many high-level language frameworks is an order of magnitude lower than contemporary FPGA accelerator interface bandwidth, especially when objects are small but numerous. Therefore, serialization bounds the effective end-to-end performance of FPGA-accelerated solutions integrated with applications written in high-level languages. The Apache Arrow project defines a language agnostic columnar in-memory format optimized for big data applications, preventing the need to serialize or even make copies during communication between components. To enable FPGA accelerators to benefit from the approach of Arrow, we first investigate the properties of its format in relation to hardware interfaces and establish that the format is usable. Second, we present the Fletcher framework, that automatically generates highly efficient hardware interfaces to access data of potentially complex, nested Arrow data types. Our approach allows 11 of the languages supported by Apache Arrow libraries to efficiently communicate large data sets with FPGA accelerators at system bandwidth. Furthermore, on the hardware side, the generated interfaces deliver any data type that Arrow can represent as groups of streams, providing a better starting point for data-flow-oriented kernel development, compared to manually creating custom interfaces to address issues related to pointer arithmetic, bus word misalignment and latency. For example applications, as measured on an AWS EC2 F1 and CAPI2-enabled POWER9 system, accelerated end-to-end application performance improves by 1.3x-49x compared to a hardware accelerated solution that still requires serialization.
Supporting Columnar In-memory Formats on FPGA
The Hardware Design of Fletcher for Apache Arrow
ALMARVI Execution Platform
Heterogeneous Video Processing SoC Platform on FPGA
eQASM
An executable quantum instruction set architecture
A widely-used quantum programming paradigm comprises of both the data flow and control flow. Existing quantum hardware cannot well support the control flow, significantly limiting the range of quantum software executable on the hardware. By analyzing the constraints in the control microarchitecture, we found that existing quantum assembly languages are either too high-level or too restricted to support comprehensive flow control on the hardware. Also, as observed with the quantum microinstruction set QuMIS [1], the quantum instruction set architecture (QISA) design may suffer from limited scalability and flexibility because of microarchitectural constraints. It is an open challenge to design a scalable and flexible QISA which provides a comprehensive abstraction of the quantum hardware. In this paper, we propose an executable QISA, called eQASM, that can be translated from quantum assembly language (QASM), supports comprehensive quantum program flow control, and is executed on a quantum control microarchitecture. With efficient timing specification, single-operationmultiple-qubit execution, and a very-long-instruction-word architecture, eQASM presents better scalability than QuMIS. The definition of eQASM focuses on the assembly level to be expressive. Quantum operations are configured at compile time instead of being defined at QISA design time. We instantiate eQASM into a 32-bit instruction set targeting a seven-qubit superconducting quantum processor. We validate our design by performing several experiments on a two-qubit quantum processor.
This paper presents and evaluates an approach to deploy image and video processing pipelines that are developed frame-oriented on a hardware platform that is stream-oriented, such as an FPGA. First, this calls for a specialized streaming memory hierarchy and accompanying software framework that transparently moves image segments between stages in the image processing pipeline. Second, we use softcore VLIW processors, that are targetable by a C compiler and have hardware debugging capabilities, to evaluate and debug the software before moving to a High-Level Synthesis flow. The algorithm development phase, including debugging and optimizing on the target platform, is often a very time consuming step in the development of a new product. Our proposed platform allows both software developers and hardware designers to test iterations in a matter of seconds (compilation time) instead of hours (synthesis or circuit simulation time).
In this paper, we present two novel concepts to aid these low-penalty reconfigurable processors - one requiring an ISA extension and one without. Our experimental results show that our approaches enable a dynamic processor to reduce the energy-delay product by up to 25% and on average 10% to 18% compared to the best performing static setups. ...
In this paper, we present two novel concepts to aid these low-penalty reconfigurable processors - one requiring an ISA extension and one without. Our experimental results show that our approaches enable a dynamic processor to reduce the energy-delay product by up to 25% and on average 10% to 18% compared to the best performing static setups.
Mixed-criticality systems need to provide strict guarantees to hard real-time tasks and simultaneously, deliver high throughput for non-critical tasks. However, techniques to enhance performance more often than not affect the analyzability, e.g., caches, branch prediction, out-of-order (OoO) execution superscalar processing, and simultaneous multithreading (SMT). In this paper, we propose the use of a polymorphic VLIW processor to increase performance for non-critical tasks while maintaining analyzability. The processor achieves these goals by dynamically distributing computing resources (in the form of datapaths) to one or multiple threads. A static schedule guarantees the minimum amount of cycles to meet the deadlines for critical tasks. Datapaths that are not used by critical tasks can be assigned to non-critical tasks in a highly flexible way, thereby increasing resource utilization resulting in higher throughput. Our experiments show that our approach can exploit its dynamic properties to improve schedulability and assign up to 50% and on average 25% more resources to lower-priority threads during the execution of a static real-time schedule.
memory hierarchy that is designed to stream data between intermediate stages of an image processing pipeline. These pipelines are commonplace in medical applications such as X-ray imagers. By using a streaming memory hierarchy, performance is increased by a factor that depends on the number of stages (7.5× when using 4 consecutive filters). Using a Xilinx VC707 board, we are able to place up to 75 cores. A platform of 64 cores can be routed at 193 MHz, achieving real-time performance, while keeping 20% resources available for off-board interfacing. Our VHDL implementation and associated tools (compiler, simulator, etc.) are available for download for the academic community. ...
memory hierarchy that is designed to stream data between intermediate stages of an image processing pipeline. These pipelines are commonplace in medical applications such as X-ray imagers. By using a streaming memory hierarchy, performance is increased by a factor that depends on the number of stages (7.5× when using 4 consecutive filters). Using a Xilinx VC707 board, we are able to place up to 75 cores. A platform of 64 cores can be routed at 193 MHz, achieving real-time performance, while keeping 20% resources available for off-board interfacing. Our VHDL implementation and associated tools (compiler, simulator, etc.) are available for download for the academic community.
well-suited to handle such workloads. However, they suffer from the fact that any mismatch between the application’s inherent instruction-level parallelism (ILP) and the core’s parallelism leads to unused resources or loss in performance. An accepted solution is to include several types of cores and match them dynamically depending on the performance needs of the application. This approach becomes less efficient when the number of cores does not match the number of parallel threads. Furthermore,
the heterogeneity of (fixed) cores cannot be increased indefinitely as it would result in even higher degrees of mismatching and increased movement of instruction and data streams. In this paper, we are proposing a polymorphic processor, based on VLIW architectures, that can adapt its issue-width during runtime. By design, the processor can be perceived as a single wide core (8-issue VLIW) or two medium-wide cores (4-issue) or four small cores (2-issue) that can run high-ILP/low DLP, medium-ILP/medium DLP, and low-ILP/high-DLP applications, respectively. Furthermore, we are executing one single generic binary while performing these reconfigurations. In order to show the effectiveness of our approach, we synthesized different versions of the core to represent fixed heterogeneous cores and compared them to the dynamic implementation of the core. Our experiments show that the dynamically adaptive solution performs on average 7% faster and uses 5% less area than a platform which consists of fixed cores with 1.5× as many datapaths. ...
well-suited to handle such workloads. However, they suffer from the fact that any mismatch between the application’s inherent instruction-level parallelism (ILP) and the core’s parallelism leads to unused resources or loss in performance. An accepted solution is to include several types of cores and match them dynamically depending on the performance needs of the application. This approach becomes less efficient when the number of cores does not match the number of parallel threads. Furthermore,
the heterogeneity of (fixed) cores cannot be increased indefinitely as it would result in even higher degrees of mismatching and increased movement of instruction and data streams. In this paper, we are proposing a polymorphic processor, based on VLIW architectures, that can adapt its issue-width during runtime. By design, the processor can be perceived as a single wide core (8-issue VLIW) or two medium-wide cores (4-issue) or four small cores (2-issue) that can run high-ILP/low DLP, medium-ILP/medium DLP, and low-ILP/high-DLP applications, respectively. Furthermore, we are executing one single generic binary while performing these reconfigurations. In order to show the effectiveness of our approach, we synthesized different versions of the core to represent fixed heterogeneous cores and compared them to the dynamic implementation of the core. Our experiments show that the dynamically adaptive solution performs on average 7% faster and uses 5% less area than a platform which consists of fixed cores with 1.5× as many datapaths.
The register file is an expensive component in the design of any processor, especially, when considering the additional ports that are needed to support multiple datapaths within a wide-issue VLIW processor. In a recent work, these additional resources were used to dynamically reconfigure the register file to support a dynamically reconfigurable VLIW core. The design can be perceived as a single 8-issue, two 4-issue, or four 2-issue VLIW cores. Consequently, the multi-ported design can operate in different modes, namely as one, two, or four register files, respectively, corresponding to the active number of cores. The implementation of the register file design on FPGAs using Block RAMs still results in unused resources due to the coarseness of the Block RAMs. In this paper, we propose to re-purpose these unused BRAM resources to additionally support multiple contexts next to earliermentioned modes. In this manner, the 8-issue, 4-issue, and 2- issue cores have access to 4, 2, and 1 contexts, respectively. Consequently, we can avoid saving and restoring of the task states in a multi-task environment, turning context switching from a traditionally time-consuming event to an almost instantaneous event. The advantage of this is the reduction of interrupt latency and task switching latency, which are important in real-time and embedded systems. Our results show that our technique can improve interrupt latency by a factor of 17.4x compared to using a software register spill routine, depending on the behavior of the memory system. Likewise, the task switching time can be improved by 6.7x.
Very Long Instruction Word (VLIW) processors are commonplace in embedded systems due to their inherent lowpower consumption as the instruction scheduling is performed by the compiler instead by sophisticated and power-hungry hardware instruction schedulers used in their RISC counterparts. This is achieved by maximizing resource utilization by only targeting a certain application domain. However, when the inherent application ILP (instruction-level parallelism) is low, resources are under-utilized/wasted and the encoding of NOPs results in large code sizes and consequently additional pressure on the memory subsystem to store these NOPs. To address the resource-utilization issue, we proposed a dynamic VLIW processor design that can merge unused resources to form additional cores to execute more threads. Therefore, the formation of cores can result in issue widths of 2, 4, and 8. Without sacrificing the possibility of code interruptability and resumption, we proposed a generic binary scheme that allows a single binary to be executed on these different issue-width cores. However, the code size issue remains as the generic binary scheme even slightly further increases the number NOPS. Therefore, in this paper, we propose to apply a well-known stop-bit code compression technique to the generic binaries that, most importantly, maintains its code compatibility characteristic allowing it to be executed on different cores. In addition, we present the hardware designs to support this technique in our dynamic core. For prototyping purposes, we implemented our design on a Xilinx Virtex-6 FPGA device and executed 14 embedded benchmarks. For comparison, we selected a nondynamic/ static VLIW core that incorporates a similar stop-bit technique for its code compression. We demonstrate, while maintaining code compatibility on top of a flexible dynamic VLIW processor, that the code size can be significantly reduced (up to 80%) resulting in energy savings, and that the performance can be increased (up to a factor of three). Finally, our experimental results show that we can use smaller caches (2 to 4 times as small), which will further help in decreasing energy consumption.