C.B. Ciobanu | TU Delft Repository

Accelerating a geometrical approximated PCA algorithm using AVX2 and CUDA

Journal article (2020) - Alina L. Machidon, Octavian M. Machidon, Cătălin B. Ciobanu, Petre L. Ogrutan

Remote sensing data has known an explosive growth in the past decade. This has led to the need for efficient dimensionality reduction techniques, mathematical procedures that transform the high-dimensional data into a meaningful, reduced representation. Projection Pursuit (PP) based algorithms were shown to be efficient solutions for performing dimensionality reduction on large datasets by searching low-dimensional projections of the data where meaningful structures are exposed. However, PP faces computational difficulties in dealing with very large datasets-which are common in hyperspectral imaging, thus raising the challenge for implementing such algorithms using the latest High Performance Computing approaches. In this paper, a PP-based geometrical approximated Principal Component Analysis algorithm (gaPCA) for hyperspectral image analysis is implemented and assessed on multi-core Central Processing Units (CPUs), Graphics Processing Units (GPUs) and multi-core CPUs using Single Instruction, Multiple Data (SIMD) AVX2 (Advanced Vector eXtensions) intrinsics, which provide significant improvements in performance and energy usage over the single-core implementation. Thus, this paper presents a cross-platform and cross-language perspective, having several implementations of the gaPCA algorithm in Matlab, Python, C++ and GPU implementations based on NVIDIA Compute Unified Device Architecture (CUDA). The evaluation of the proposed solutions is performed with respect to the execution time and energy consumption. The experimental evaluation has shown not only the advantage of using CUDA programming in implementing the gaPCA algorithm on a GPU in terms of performance and energy consumption, but also significant benefits in implementing it on the multi-core CPU using AVX2 intrinsics. ...

On Parallelizing Geometrical PCA Approximation

Conference paper (2019) - Alina Lumini Ia Machidon, Catalin Bogdan Ciobanu, Octavian Mihai Machidon, Petre Lucian Ogrutan

Remote sensing data has known an explosive growth in the past decade. This has led to the need for efficient dimensionality reduction techniques, mathematical procedures that transform the high-dimensional data into a meaningful, reduced representation. Principal Component Analysis (PCA) is a well-known dimensionality reduction technique used in the field of hyperspectral satellite images. However, PCA suffers from high computational costs and increased complexity, an issue that led to elaborating PCA adaptations capable of running on multi-core computing architectures. This paper proposes a parallel implementation of the geometrical PCA approximation (gaPCA) algorithm. Three parallel implementations are studied: two on multi-core CPUs and a NVIDIA Graphics Processing Units (GPU) CUDA accelerated implementation. Our results show significant speedups of the parallel implementations when applied on hyperspectral image datasets. Our results show that on the Intel Core i5 CPU, Python multi-core implementation is up to 2.01\times faster than its Matlab equivalent. Our GPU PyCUDA implementation is considerably faster than both our Python multi-core CPU implementations: up to 1.76\times faster than Intel Core i5-6200U and up to 5.72\times faster than the NVIDIA Jetson Nano quad-core ARM A53 CPU. We performed data analysis on the output data for the three methods and the maxim relative error was less than 0.001%. ...

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories with HLS

Conference paper (2019) - L. Stornaiuolo, M. Rabozzi, M. D. Santambrogio, D. Sciuto, C. B. Ciobanu, G. Stramondo, A. L. Varbanescu

With the increased interest in energy efficiency, a lot of application domains experiment with Field Programmable Gate Arrays (FPGAs), which promise customized hardware accelerators with high-performance and low power consumption. These experiments possible due to the development of High-Level Languages (HLLs) for FPGAs, which permit non-experts in hardware design languages (HDLs) to program reconfigurable hardware for general purpose computing. However, some of the expert knowledge remains difficult to integrate in HLLs, eventually leading to performance loss for HLL-based applications. One example of such a missing feature is the efficient exploitation of the local memories on FPGAs. A solution to address this challenge is PolyMem, an easy-to-use polymorphic parallel memory that uses BRAMs. In this work, we present HLS-PolyMem, the first complete implementation and in-depth evaluation of PolyMem optimized for the Xilinx Design Suite. Our evaluation demonstrates that HLS-PolyMem is a viable alternative to HLS memory partitioning, the current approach for memory parallelism in Vivado HLS. Specifically, we show that PolyMem offers the same performance as HLS partitioning for simple access patterns, and outperforms partitioning as much as 13x when combining multiple access patterns for the same data structure. We further demonstrate the use of PolyMem for two different case studies, highlighting the superior capabilities of HLS-PolyMem in terms of performance, resource utilization, flexibility, and usability. Based on all the evidence provided in this work, we conclude that HLS-PolyMem enables the efficient use of BRAMs as parallel memories, without compromising the HLS level or the achievable performance. ...

The Case for Polymorphic Registers in Dataflow Computing

Journal article (2018) - Catalin Bogdan Ciobanu, Georgi Gaydadjiev, Christian Pilato, Donatella Sciuto

Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-the-art NVIDIA Tesla C2050 GPU. We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for 9×9 or larger mask sizes, even in bandwidth-constrained systems. ...

EXTRA

Towards the exploitation of eXascale technology for reconfigurable architectures

Conference paper (2016) - Dirk Stroobandt, Ana Lucia Varbanescu, Elias Vansteenkiste, Wayne Luk, Marco D. Santambrogio, Donatella Sciuto, Michael Huebner, Tobias Becker, Georgi Gaydadjiev, Antonis Nikitakis, Alex J.W. Thom, Catalin Bogdan Ciobanu, Muhammed Al Kadi, Andreas Brokalakis, George Charitopoulos, Tim Todman, Xinyu Niu, Dionisios Pnevmatikatos, Amit Kulkarni

To handle the stringent performance requirements of future exascale-class applications, High Performance Computing (HPC) systems need ultra-efficient heterogeneous compute nodes. To reduce power and increase performance, such compute nodes will require hardware accelerators with a high degree of specialization. Ideally, dynamic reconfiguration will be an intrinsic feature, so that specific HPC application features can be optimally accelerated, even if they regularly change over time. In the EXTRA project, we create a new and flexible exploration platform for developing reconfigurable architectures, design tools and HPC applications with run-time reconfiguration built-in as a core fundamental feature instead of an add-on. EXTRA covers the entire stack from architecture up to the application, focusing on the fundamental building blocks for run-time reconfigurable exascale HPC systems: new chip architectures with very low reconfiguration overhead, new tools that truly take reconfiguration as a central design concept, and applications that are tuned to maximally benefit from the proposed run-time reconfiguration techniques. Ultimately, this open platform will improve Europe's competitive advantage and leadership in the field. ...

Real-time olivary neuron simulations on dataflow computing machines

Conference paper (2014) - Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis, Cǎtǎlin Ciobanu, Oskar Mencer, Chris I. De Zeeuw

The Inferior-Olivary nucleus (ION) is a well-charted brain region, heavily associated with the sensorimotor control of the body. It comprises neural cells with unique properties which facilitate sensory processing and motor-learning skills. Simulations of such neurons become rapidly intractable when biophysically plausible models and meaningful network sizes (at least in the order of some hundreds of cells) are modeled. To overcome this problem, we accelerate a highly detailed ION network model using a Maxeler Dataflow Computing Machine. The design simulates a 330-cell network at real-time speed and achieves maximum throughputs of 24.7 GFLOPS. The Maxeler machine, integrating a Virtex-6 FPGA, yields speedups of ×92-102, and ×2-8 compared to a reference-C implementation, running on a Intel Xeon 2.66GHz, and a pure Virtex-7 FPGA implementation, respectively. ...

Dataflow computing with polymorphic registers

Conference paper (2013) - Cǎtǎlin Ciobanu, Georgi Gaydadjiev, Christian Pilato, Donatella Sciuto

Heterogeneous systems are becoming increasingly popular for data processing. They improve performance of simple kernels applied to large amounts of data. However, sequential data loads may have negative impact. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high speed, parallel access to performance-critical data. Furthermore, by PRF customization, specific data path features are exposed to the programmer in a very convenient way. PRFs allow additional control over the registers dimensions, and the number of elements which can be simultaneously accessed by computational units. This paper shows how PRFs can be integrated in dataflow computational platforms. In particular, starting from an annotated source code, we present a compiler-based methodology that automatically generates the customized PRFs and the enhanced computational kernels that efficiently exploit them. ...

Separable 2D convolution with polymorphic register files

Conference paper (2013) - CB Ciobanu, GN Gaydadjiev

FASTER run-time reconfiguration management

Conference paper (2013) - Cǎtǎlin Bogdan Ciobanu, Dionisios N. Pnevmatikatos, Kyprianos D. Papadimitriou, Georgi N. Gaydadjiev

The FASTER project Run-Time System Manager offloads programmers from low-level operations by performing task placement, scheduling, and dynamic FPGA reconfiguration. It also manages device fragmentation, configuration caching, pre-fetching and reuse, bitstream compression, and optimizes the system thermal and power footprints. We propose a micro-reconfiguration aware, configuration content agnostic ISA interface and a technology independent Task Configuration Microcode format targeting Maxeler Data Flow computers and Xilinx XUPV5 platforms. We achieve improved resource utilization with negligible performance overhead. Up to 4Gbps for DMA transfers, and up to 3Gbps for FPGA reconfiguration on Xilinx Virtex-5/6 devices is achieved. ...

Novel design methods and a tool flow for unleashing dynamic reconfiguration

Conference paper (2012) - K. Papadimitriou, C. Pilato, W. Luk, D. Stroobandt, D. Pnevmatikatos, M. D. Santambrogio, C. Ciobanu, T. Todman, T. Becker, T. Davidson, X. Niu, G. Gaydadjiev

During the last few years, there is an increasing interest in mixing software and hardware to serve efficiently different applications. This is due to the heterogeneity characterizing the tasks of an application which require the presence of resources from both worlds, software and hardware. Controlling effectively these resources through an integrated tool flow is a challenging problem and towards this direction only a few efforts exist. In fact, a framework that seamlessly exploits both resources of a platform for executing efficiently an application has not yet come into existence. Moreover, reconfigurable computing often incorporated in such platforms due to its high flexibility and customization, has not yet taken off due to the lack of exploiting its full capabilities. Thus, the capability of reconfigurable devices such as Field Programmable Gate Arrays (FPGAs) to be dynamically reconfigured, i.e. reprogramming part of the chip while other parts of the same chip remain functional, has not yet taken off even in small-scale basis. The inherent difficulty in using the tools to control this technology has kept it back from being adopted by academia and industry alike. The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) project aims at introducing a design methodology and a tool flow that will enable designers to implement effectively and easily a system specification on a platform combining software and reconfigurable resources. The FASTER framework accepts as input a high-level description of the application and the architectural details of the target platform, and through certain steps it can enable the full use of the capabilities of the platform, while at the same time it should be flexible enough so as to balance efficiently performance, power and area. One of the main novelties is the incorporation of partial reconfiguration as an explicit design concept at an early stage of the design flow. We target different applications from the embedded, desktop and high-performance computing domains. In all cases we will demonstrate the effectiveness of the proposed framework in exploiting the inherent parallelism of applications and enabling the runtime adaptation of the platforms to the changing needs of the applications. ...

During the last few years, there is an increasing interest in mixing software and hardware to serve efficiently different applications. This is due to the heterogeneity characterizing the tasks of an application which require the presence of resources from both worlds, software and hardware. Controlling effectively these resources through an integrated tool flow is a challenging problem and towards this direction only a few efforts exist. In fact, a framework that seamlessly exploits both resources of a platform for executing efficiently an application has not yet come into existence. Moreover, reconfigurable computing often incorporated in such platforms due to its high flexibility and customization, has not yet taken off due to the lack of exploiting its full capabilities. Thus, the capability of reconfigurable devices such as Field Programmable Gate Arrays (FPGAs) to be dynamically reconfigured, i.e. reprogramming part of the chip while other parts of the same chip remain functional, has not yet taken off even in small-scale basis. The inherent difficulty in using the tools to control this technology has kept it back from being adopted by academia and industry alike. The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) project aims at introducing a design methodology and a tool flow that will enable designers to implement effectively and easily a system specification on a platform combining software and reconfigurable resources. The FASTER framework accepts as input a high-level description of the application and the architectural details of the target platform, and through certain steps it can enable the full use of the capabilities of the platform, while at the same time it should be flexible enough so as to balance efficiently performance, power and area. One of the main novelties is the incorporation of partial reconfiguration as an explicit design concept at an early stage of the design flow. We target different applications from the embedded, desktop and high-performance computing domains. In all cases we will demonstrate the effectiveness of the proposed framework in exploiting the inherent parallelism of applications and enabling the runtime adaptation of the platforms to the changing needs of the applications.

Scalability Evaluation of a Polymorphic Register File: a CG Case Study

Conference paper (2011) - CB Ciobanu, X Martorell, GK Kuzmanov, A Ramirez, GN Gaydadjiev