J.J. Hoozemans | TU Delft Repository

AEx

Automated High-Level Synthesis of Compiler Programmable Co-Processors

Journal article (2023) - Alex Hirvonen (author) , Topi Leppänen (author) , Kari Hepola (author) , Joonas Multanen (author) , J.J. Hoozemans (author) , Pekka Jääskeläinen (author)

Modern High Level Synthesis (HLS) tools succeed well in their engineering productivity goal, but still require toolset and target technology specific modifications to the source code to guide the process towards an efficient implementation. Furthermore, their end result is a fixe ...

FPGA Acceleration for Big Data Analytics

Challenges and Opportunities

Journal article (2021) - J.J. Hoozemans (author) , J.W. Peltenburg (author) , Fabian Nonnenmacher (author) , A. Hadnagy (author) , Z. Al-Ars (author) , H.P. Hofstee (author)

The big data revolution has ushered an era with ever increasing volumes and complexity of data requiring ever faster computational analysis. During this very same era, CPU performance growth has been stagnating, pushing the industry to either scale their computation horizontall ...

The big data revolution has ushered an era with ever increasing volumes and complexity of data requiring ever faster computational analysis. During this very same era, CPU performance growth has been stagnating, pushing the industry to either scale their computation horizontally using multiple nodes in datacenters, or to scale vertically using heterogeneous components to reduce compute time. However, networking and storage continue to provide both higher throughput and lower latency, which allows for leveraging heterogeneous components, deployed in data centers around the world. Still, the integration of big data analytics frameworks with heterogeneous hardware components such as GPGPUs and FPGAs is challenging, because there is an increasing gap in the level of abstraction between analytics solutions developed with big data analytics frameworks, and accelerated kernels developed with heterogeneous components. In this article, we focus on FPGA accelerators that have seen wide-scale deployment in large cloud infrastructures. FPGAs allow the implementation of highly optimized hardware architectures, tailored exactly to an application, and unburdened by the overhead associated with traditional general-purpose computer architectures. FPGAs implementing dataflow-oriented architectures with high levels of (pipeline) parallelism can provide high application throughput, often providing high energy efficiency. Latency-sensitive applications can leverage FPGA accelerators by directly connecting to the physical layer of a network, and perform data transformations without going through the software stacks of the host system. While these advantages of FPGA accelerators hold promise, difficulties associated with programming and integration limit their use. This article explores the existing practices in big data analytics frameworks, discusses the aforementioned gap in development abstractions, and provides some perspectives on how to address these challenges in the future.

Battling the CPU Bottleneck in Apache Parquet to Arrow Conversion Using FPGA

Conference paper (2021) - J.W. Peltenburg (author) , Lars Van Leeuwen (author) , J.J. Hoozemans (author) , J. Fang (author) , Z. Al-Ars (author) , H.P. Hofstee (author)

In the domain of big data analytics, the bottleneck of converting storage-focused file formats to in-memory data structures has shifted from the bandwidth of storage to the performance of decoding and decompression software. Two widely used formats for big data storage and in-mem ...

Energy Efficient Multistandard Decompressor ASIP

Conference paper (2021) - J.J. Hoozemans (author) , Kati Tervo (author) , Pekka Jaäskelaïnen (author) , Z. Al-Ars (author)

Many applications make extensive use of various forms of compression techniques for storing and communicating data. As decompression is highly regular and repetitive, it is a suitable candidate for acceleration. Examples are offloading (de)compression to a dedicated circuit on a ...

Towards Real Time Radiotherapy Simulation

Journal article (2020) - Nils Voss (author) , Peter Ziegenhein (author) , Lukas Vermond (author) , J.J. Hoozemans (author) , Oskar Mencer (author) , Uwe Oelfke (author) , W. Luk (author) , G. Gaydadjiev (author)

We propose a novel reconfigurable hardware architecture to implement Monte Carlo based simulation of physical dose accumulation for intensity-modulated adaptive radiotherapy. The long term goal of our effort is to provide accurate dose calculation in real-time during patient trea ...

Towards real time radiotherapy simulation

Conference paper (2019) - Nils Voss (author) , Peter Ziegenhein (author) , Lukas Vermond (author) , J.J. Hoozemans (author) , Oskar Mencer (author) , Uwe Oelfke (author) , W. Luk (author) , G. Gaydadjiev (author)

We propose a novel reconfigurable hardware architecture to implement Monte Carlo based simulation of physical dose accumulation for intensity-modulated adaptive radiotherapy. The long term goal of our effort is to provide accurate online dose calculation in real-time during patie ...

Frame-based Programming, Stream-Based Processing for Medical Image Processing Applications

Journal article (2019) - J.J. Hoozemans (author) , Rob de Jong (author) , Steven Van Der van der Vlugt (author) , Jeroen van Straten (author) , Uttam Kumar Elango (author) , Z. Al-Ars (author)

This paper presents and evaluates an approach to deploy image and video processing pipelines that are developed frame-oriented on a hardware platform that is stream-oriented, such as an FPGA. First, this calls for a specialized streaming memory hierarchy and accompanying software ...

ALMARVI Execution Platform

Heterogeneous Video Processing SoC Platform on FPGA

Journal article (2019) - J.J. Hoozemans (author) , J. van Straten (author) , Timo Viitanen (author) , Aleksi Tervo (author) , Jiri Kadlec (author) , Z. Al-Ars (author)

The proliferation of processing hardware alternatives allows developers to use various customized computing platforms to run their applications in an optimal way. However, porting application code on custom hardware requires a lot of development and porting effort. This paper des ...

Targeting static and dynamic workloads with a reconfigurable VLIW processor

Doctoral thesis (2018) - J.J. Hoozemans (author)

Embedded systems range from very simple devices, such as a digital watch, to highly complex systems such as smartphones. In these complex devices, an increasing number of applications need to be executed on a computing platform. Moreover, the number of applications (or programs) ...

Embedded systems range from very simple devices, such as a digital watch, to highly complex systems such as smartphones. In these complex devices, an increasing number of applications need to be executed on a computing platform. Moreover, the number of applications (or programs) usually exceeds the number of processors found on such platforms. This creates the need for scheduling. Furthermore, each program exhibits different characteristics and their interaction with the (real-life) environmentment leads to real-time requirements. Consequently, the set of programs, called workload, exhibits highly dynamic behavior. Workloads can be dynamic in intensity (i.e., the number of concurrent tasks), characteristics (amount and type of parallelism), and requirements (real-time constraints, power budgets, performance). We argue that dynamic workloads require a dynamic computing platform and propose to use one that comprises the 휌-VEX reconfigurable VLIW processor. It can dynamically adapt to the workload while it is running. Adaptations can be triggered by a user, programmer, compiler, or an operating system. The latter two methods can operate fully automatic and exploring these is one of the goals of this work. Besides dynamic workloads, a number of new classes of embedded devices are running application programs that are very static, but require very high throughput. Examples are the latest generations mobile telecommunications hardware and vision-based applications (automation, surveillance, automated driving). In this case, adapting to the workload at run-time is not advantageous because there are no changes to adapt to. Optimizing for these applications is possible, but must be done before the hardware platform is manufactured (during the design phase) or by making use of Field-Programmable Gate Arrays (FPGAs). This thesis explores the use of the proposed reconfigurable processor to target the full spectrum of embedded workloads. First, design-time reconfigurability is employed to optimize a hardware platform for a static, streaming image processing workload. Second, we explore the run-time reconfigurable processor for dynamic workloads. This is achieved by adapting to a single program to optimize energy efficiency, followed by adapting to a generated set of programs optimizing for throughput. Third, the real-time characteristics of the processor are evaluated and it is shown to have better schedulability compared to static processors. The VLIW architecture results in good timing-predictability, which allows finding tight bounds on the worst-case execution time. Last, we show that the processor is able to assign more parallel execution resources to a static program that is added into the workload, while still guaranteeing time-safety for critical tasks.

Increasing resource utilization in mixed-criticality systems using a polymorphic VLIW processor

Journal article (2018) - J.J. Hoozemans (author) , Jeroen Van Straten (author) , J.S.S.M. Wong (author)

Mixed-criticality systems need to provide strict guarantees to hard real-time tasks and simultaneously, deliver high throughput for non-critical tasks. However, techniques to enhance performance more often than not affect the analyzability, e.g., caches, branch prediction, out-of ...

Evaluating Auto-adaptation Methods for Fine-grained Adaptable Processors

Conference paper (2018) - J.J. Hoozemans (author) , J. Van Straten (author) , Z. Al-Ars (author) , J.S.S.M. Wong (author)

To achieve energy savings while maintaining adequate performance, system designers and programmers wish to create the best possible match between program behavior and the underlying hardware. Well-known current approaches include DVFS and task migrations in heterogeneous platform ...

Dynamic Trade-off among Fault Tolerance, Energy Consumption, and Performance on a Multiple-issue VLIW Processor

Journal article (2018) - A.L. Sartor (author) , Pedro H. Exenberger Becker (author) , J.J. Hoozemans (author) , J.S.S.M. Wong (author) , A.C. Schneider Beck (author)

In the design of modern-day processors, energy consumption and fault tolerance have gained significant importance next to performance. This is caused by battery constraints, thermal design limits, and higher susceptibility to errors as transistor feature sizes are decreasing. How ...

Using a Polymorphic VLIW Processor to Improve Schedulability and Performance for Mixed-criticality Systems

Conference paper (2017) - J.J. Hoozemans (author) , Jeroen Van Straten (author) , J.S.S.M. Wong (author)

As embedded systems are faced with ever more demanding workloads and more tasks are being consolidated onto a smaller number of microcontrollers, system designers are faced with opposing requirements of increasing performance while retaining real-time analyzability. For example, ...

Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor

Abstract (2017) - J.J. Hoozemans (author) , A.F. Lorenzon (author) , A.C. Schneider Beck Filho (author) , J.S.S.M. Wong (author)

Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing m ...

VLIW-Based FPGA Computation Fabric with Streaming Memory Hierarchy for Medical Imaging Applications

Conference paper (2017) - J.J. Hoozemans (author) , R. Heij (author) , J. Van Straten (author) , Z. Al-Ars (author)

In this paper, we present and evaluate an FPGA acceleration fabric that uses VLIW softcores as processing elements, combined with a
memory hierarchy that is designed to stream data between intermediate stages of an image processing pipeline. These pipelines are commonplace in ...

Exploring ILP and TLP on a Polymorphic VLIW Processor

Conference paper (2017) - A.C.C. Brandon (author) , J.J. Hoozemans (author) , J. van Straten (author) , J.S.S.M. Wong (author)

In today’s computing environments, the concurrent execution of multiple applications/threads is common and multi-cores are very
well-suited to handle such workloads. However, they suffer from the fact that any mismatch between the application’s inherent instruction-level para ...

A Sparse VLIW Instruction Encoding Scheme Compatible with Generic Binaries

Abstract (2016) - Anthony Brandon (author) , J.J. Hoozemans (author) , Jeroen van Straten (author) , A.F. Lorenzon (author) , A.L. Sartor (author) , A.C. Schneider Beck (author) , J.S.S.M. Wong (author)

A sparse VLIW instruction encoding scheme compatible with generic binaries

Conference paper (2015) - AAC Brandon (author) , J.J. Hoozemans (author) , J. van Straten (author) , A.F. Lorenzon (author) , A.L. Sartor (author) , Antonio Carlos Schneider Beck (author) , J.S.S.M. Wong (author)

Very Long Instruction Word (VLIW) processors are commonplace in embedded systems due to their inherent lowpower consumption as the instruction scheduling is performed by the compiler instead by sophisticated and power-hungry hardware instruction schedulers used in their RISC coun ...

Very Long Instruction Word (VLIW) processors are commonplace in embedded systems due to their inherent lowpower consumption as the instruction scheduling is performed by the compiler instead by sophisticated and power-hungry hardware instruction schedulers used in their RISC counterparts. This is achieved by maximizing resource utilization by only targeting a certain application domain. However, when the inherent application ILP (instruction-level parallelism) is low, resources are under-utilized/wasted and the encoding of NOPs results in large code sizes and consequently additional pressure on the memory subsystem to store these NOPs. To address the resource-utilization issue, we proposed a dynamic VLIW processor design that can merge unused resources to form additional cores to execute more threads. Therefore, the formation of cores can result in issue widths of 2, 4, and 8. Without sacrificing the possibility of code interruptability and resumption, we proposed a generic binary scheme that allows a single binary to be executed on these different issue-width cores. However, the code size issue remains as the generic binary scheme even slightly further increases the number NOPS. Therefore, in this paper, we propose to apply a well-known stop-bit code compression technique to the generic binaries that, most importantly, maintains its code compatibility characteristic allowing it to be executed on different cores. In addition, we present the hardware designs to support this technique in our dynamic core. For prototyping purposes, we implemented our design on a Xilinx Virtex-6 FPGA device and executed 14 embedded benchmarks. For comparison, we selected a nondynamic/ static VLIW core that incorporates a similar stop-bit technique for its code compression. We demonstrate, while maintaining code compatibility on top of a flexible dynamic VLIW processor, that the code size can be significantly reduced (up to 80%) resulting in energy savings, and that the performance can be increased (up to a factor of three). Finally, our experimental results show that we can use smaller caches (2 to 4 times as small), which will further help in decreasing energy consumption.

Multiple contexts in a multi-ported VLIW register file implementation

Conference paper (2015) - J.J. Hoozemans (author) , Jens Johansen (author) , Jeroen van Straten (author) , Anthony Brandon (author) , J.S.S.M. Wong (author)

The register file is an expensive component in the design of any processor, especially, when considering the additional ports that are needed to support multiple datapaths within a wide-issue VLIW processor. In a recent work, these additional resources were used to dynamically re ...