A.C.C. Brandon | TU Delft Repository

Exploring ILP and TLP on a Polymorphic VLIW Processor

Conference paper (2017) - Anthony Brandon, Joost Hoozemans, Jeroen van Straten, Stephan Wong

In today’s computing environments, the concurrent execution of multiple applications/threads is common and multi-cores are very
well-suited to handle such workloads. However, they suffer from the fact that any mismatch between the application’s inherent instruction-level parallelism (ILP) and the core’s parallelism leads to unused resources or loss in performance. An accepted solution is to include several types of cores and match them dynamically depending on the performance needs of the application. This approach becomes less efficient when the number of cores does not match the number of parallel threads. Furthermore,
the heterogeneity of (fixed) cores cannot be increased indefinitely as it would result in even higher degrees of mismatching and increased movement of instruction and data streams. In this paper, we are proposing a polymorphic processor, based on VLIW architectures, that can adapt its issue-width during runtime. By design, the processor can be perceived as a single wide core (8-issue VLIW) or two medium-wide cores (4-issue) or four small cores (2-issue) that can run high-ILP/low DLP, medium-ILP/medium DLP, and low-ILP/high-DLP applications, respectively. Furthermore, we are executing one single generic binary while performing these reconfigurations. In order to show the effectiveness of our approach, we synthesized different versions of the core to represent fixed heterogeneous cores and compared them to the dynamic implementation of the core. Our experiments show that the dynamically adaptive solution performs on average 7% faster and uses 5% less area than a platform which consists of fixed cores with 1.5× as many datapaths. ...

A Sparse VLIW Instruction Encoding Scheme Compatible with Generic Binaries

Abstract (2016) - Anthony Brandon, Joost Hoozemans, Jeroen van Straten, Arthur Lorenzon, Anderson Sartor, Antonio Carlos Schneider Beck Filho, Stephan Wong

Run-time Phase Prediction for a Reconfigurable VLIW Processor

Conference paper (2016) - Qi Guo, Anderson Sartor, Anthony Brandon, Antonio C.S. Beck, Xuehai Zhou, Stephan Wong

It is well-known that different applications exhibit varying amounts of ILP. Execution of these applications on the same fixed-width VLIW processor will result (1) in wasted energy due to underutilized resources if the issue-width of the processor is larger than the inherent ILP; or alternatively, (2) in lower performance if the issue-width is smaller than the inherent ILP. Moreover, even within a single application distinct phases can be observed with varying ILP and therefore changing resource requirements.With this in mind, we designed the ρ-VEX processor, which is a VLIW processor that can change its issuewidth
at run-time. In this paper, we propose a novel scheme to dynamically (i.e., at run-time) optimize the resource utilization by predicting and matching the number of active data-paths for each application phase. The purpose is to achieve low energy consumption for applications with low ILP, and high performance for applications with high ILP, on a single VLIW processor design. We prototyped the ρ-VEX processor on an FPGA and obtained the dynamic traces of applications running on top of a Linux port. Our results show that it is possible in some cases to achieve the performance of an 8-issue core with 10% lower energy consumption, while in others we achieve the energy consumption
of a 2-issue core with close to 20% lower execution time. ...

Multiple contexts in a multi-ported VLIW register file implementation

Conference paper (2015) - Joost Hoozemans, Jens Johansen, Jeroen Van Straten, Anthony Brandon, Stephan Wong

The register file is an expensive component in the design of any processor, especially, when considering the additional ports that are needed to support multiple datapaths within a wide-issue VLIW processor. In a recent work, these additional resources were used to dynamically reconfigure the register file to support a dynamically reconfigurable VLIW core. The design can be perceived as a single 8-issue, two 4-issue, or four 2-issue VLIW cores. Consequently, the multi-ported design can operate in different modes, namely as one, two, or four register files, respectively, corresponding to the active number of cores. The implementation of the register file design on FPGAs using Block RAMs still results in unused resources due to the coarseness of the Block RAMs. In this paper, we propose to re-purpose these unused BRAM resources to additionally support multiple contexts next to earliermentioned modes. In this manner, the 8-issue, 4-issue, and 2- issue cores have access to 4, 2, and 1 contexts, respectively. Consequently, we can avoid saving and restoring of the task states in a multi-task environment, turning context switching from a traditionally time-consuming event to an almost instantaneous event. The advantage of this is the reduction of interrupt latency and task switching latency, which are important in real-time and embedded systems. Our results show that our technique can improve interrupt latency by a factor of 17.4x compared to using a software register spill routine, depending on the behavior of the memory system. Likewise, the task switching time can be improved by 6.7x. ...

A sparse VLIW instruction encoding scheme compatible with generic binaries

Conference paper (2015) - Anthony Brandon, Joost Hoozemans, Jeroen Van Straten, Arthur Lorenzon, Anderson Sartor, Antonio Carlos Schneider Beck, Stephan Wong

Very Long Instruction Word (VLIW) processors are commonplace in embedded systems due to their inherent lowpower consumption as the instruction scheduling is performed by the compiler instead by sophisticated and power-hungry hardware instruction schedulers used in their RISC counterparts. This is achieved by maximizing resource utilization by only targeting a certain application domain. However, when the inherent application ILP (instruction-level parallelism) is low, resources are under-utilized/wasted and the encoding of NOPs results in large code sizes and consequently additional pressure on the memory subsystem to store these NOPs. To address the resource-utilization issue, we proposed a dynamic VLIW processor design that can merge unused resources to form additional cores to execute more threads. Therefore, the formation of cores can result in issue widths of 2, 4, and 8. Without sacrificing the possibility of code interruptability and resumption, we proposed a generic binary scheme that allows a single binary to be executed on these different issue-width cores. However, the code size issue remains as the generic binary scheme even slightly further increases the number NOPS. Therefore, in this paper, we propose to apply a well-known stop-bit code compression technique to the generic binaries that, most importantly, maintains its code compatibility characteristic allowing it to be executed on different cores. In addition, we present the hardware designs to support this technique in our dynamic core. For prototyping purposes, we implemented our design on a Xilinx Virtex-6 FPGA device and executed 14 embedded benchmarks. For comparison, we selected a nondynamic/ static VLIW core that incorporates a similar stop-bit technique for its code compression. We demonstrate, while maintaining code compatibility on top of a flexible dynamic VLIW processor, that the code size can be significantly reduced (up to 80%) resulting in energy savings, and that the performance can be increased (up to a factor of three). Finally, our experimental results show that we can use smaller caches (2 to 4 times as small), which will further help in decreasing energy consumption. ...

Very Long Instruction Word (VLIW) processors are commonplace in embedded systems due to their inherent lowpower consumption as the instruction scheduling is performed by the compiler instead by sophisticated and power-hungry hardware instruction schedulers used in their RISC counterparts. This is achieved by maximizing resource utilization by only targeting a certain application domain. However, when the inherent application ILP (instruction-level parallelism) is low, resources are under-utilized/wasted and the encoding of NOPs results in large code sizes and consequently additional pressure on the memory subsystem to store these NOPs. To address the resource-utilization issue, we proposed a dynamic VLIW processor design that can merge unused resources to form additional cores to execute more threads. Therefore, the formation of cores can result in issue widths of 2, 4, and 8. Without sacrificing the possibility of code interruptability and resumption, we proposed a generic binary scheme that allows a single binary to be executed on these different issue-width cores. However, the code size issue remains as the generic binary scheme even slightly further increases the number NOPS. Therefore, in this paper, we propose to apply a well-known stop-bit code compression technique to the generic binaries that, most importantly, maintains its code compatibility characteristic allowing it to be executed on different cores. In addition, we present the hardware designs to support this technique in our dynamic core. For prototyping purposes, we implemented our design on a Xilinx Virtex-6 FPGA device and executed 14 embedded benchmarks. For comparison, we selected a nondynamic/ static VLIW core that incorporates a similar stop-bit technique for its code compression. We demonstrate, while maintaining code compatibility on top of a flexible dynamic VLIW processor, that the code size can be significantly reduced (up to 80%) resulting in energy savings, and that the performance can be increased (up to a factor of three). Finally, our experimental results show that we can use smaller caches (2 to 4 times as small), which will further help in decreasing energy consumption.

Reconfigurable acceleration and dynamic partial self-reconfiguration in general purpose computing

Conference paper (2011) - Ioannis Sourdis, Abhijit Nandy, Venkatasubramanian Viswanathan, Anthony Brandon, Dimitris Theodoropoulos, Georgi N. Gaydadjiev

In this paper, we describe a generic approach for integrating a dynamically reconfigurable device into a general purpose system interconnected with a high-speed link. The system can dynamically install and execute hardware instances of functions to accelerate parts of a given software code. The hardware descriptions of the functions (bitstreams) are inserted into the executable binary running on the system. Our compiler further inserts system-calls to the software code to control the reconfigurable device. Thereby, the general purpose host-processor of the system manages the hardware reconfiguration and execution through a Linux device driver. The device has direct access to the main memory (DMA) operating in the virtual address space; it further supports memory mapped IO for data and control, and is able to raise and handle interrupts for synchronization. The above system is implemented on a general purpose machine providing a HyperTransport bus to connect a Xilinx Virtex4-100 FPGA, an AMD Opteron-244, and 1 GB of DDR main memory. We evaluate our proposal using a secure audio processing application. We accelerate in hardware the Audio processing kernel as well as the subsequent AES encryption function via dynamic partial self-reconfiguration. The proposed system achieves a 12 speedup over a software for the application at hand. ...