W. Luk | TU Delft Repository

Application of Computational Modelling to Particle Physics

Journal article (2025) - Marco Barbone (author) , Alexander Howard (author) , Mihaly Novak (author) , W. Luk (author) , G. Gaydadjiev (author) , Alex Tapper (author)

This study introduces a methodology for forecasting accelerator performance in Particle Physics algorithms. Accelerating applications can require significant engineering effort, prototyping and measuring the speedup that might finally result in disappointing accelerator performan ...

Embedded Continual Learning for High-Energy Physics

Journal article (2024) - Marco Barbone (author) , Christopher Brown (author) , G. Gaydadjiev (author) , Thomas Maguire (author) , Mikael Mieskolainen (author) , Benjamin Radburn-Smith (author) , Wayne Luk (author) , Alexander Tapper (author)

Neural Networks (NN) are often trained offline on large datasets and deployed on specialised hardware for inference, with a strict separation between training and inference. However, in many realistic applications the training environment differs from the real world, or data arri ...

Neural Networks (NN) are often trained offline on large datasets and deployed on specialised hardware for inference, with a strict separation between training and inference. However, in many realistic applications the training environment differs from the real world, or data arrives in a streaming fashion and is continuously changing. In these scenarios, the ability to continuously train and update NN models is desirable. Continual learning (CL) algorithms allow training of models on a stream of data. CL algorithms are often designed to work in constrained settings, such as limited memory and computational power, or limitations on the ability to store past data (e.g, due to privacy concerns or memory requirements). High-energy physics experiments are developing intelligent detectors, with algorithms running on computer systems located close to the detector to meet the challenges of increased data rates and occupancies. The use of NN algorithms in this context is limited by changing detector conditions, such as degradation over time or failure of an input signal which might cause the NNs to lose accuracy leading, in the worst case to the loss of interesting events. CL has the potential to solve this issue, using large amounts of continuously streaming data to allow the network to recognise changes, and to learn and adapt to detector conditions. It has the potential to outperform traditional NN training techniques as not all possible scenarios can be predicted and modelled in static training data samples. However, NN training is computationally expensive and when combined with the strict timing requirements of embedded processors deployed close to the detector, current state-of-the-art offline approaches cannot be directly applied to the real-time systems. Alternatives to typical backpropagation-based training that can be deployed on FPGAs for real-time data processing are presented, and their computational and accuracy characteristics are discussed in the context of High-Luminosity LHC.

Fast, high-quality pseudo random number generators for heterogeneous computing

Journal article (2024) - Marco Barbone (author) , G. Gaydadjiev (author) , Alexander Howard (author) , Wayne Luk (author) , George Savvidy (author) , Konstantin Savvidy (author) , Andrew Rose (author) , Alex Tapper (author)

Random number generation is key to many applications in a wide variety of disciplines. Depending on the application, the quality of the random numbers from a particular generator can directly impact both computational performance and critically the outcome of the calculation. Hig ...

Random number generation is key to many applications in a wide variety of disciplines. Depending on the application, the quality of the random numbers from a particular generator can directly impact both computational performance and critically the outcome of the calculation. High-energy physics applications use Monte Carlo simulations and machine learning widely, which both require high-quality random numbers. In recent years, to meet increasing performance requirements, many high-energy physics workloads leverage GPU acceleration. While on a CPU, there exist a wide variety of generators with different performance and quality characteristics, the same cannot be stated for GPU and FPGA accelerators. On GPUs, the most common implementation is provided by cuRAND - an NVIDIA library that is not open source or peer reviewed by the scientific community. The highest-quality generator implemented in cuRAND is a version of the Mersenne Twister. Given the availability of better and faster random number generators, high-energy physics moved away from Mersenne Twister several years ago and nowadays MIXMAX is the standard generator in Geant4 via CLHEP. The MIXMAX original design supports parallel streams with a seeding algorithm that makes it especially suited for GPU and FPGA where extreme parallelism is a key factor. In this study we implement the MIXMAX generator on both architectures and analyze its suitability and applicability for accelerator implementations. We evaluated the results against “Mersenne Twister for a Graphic Processor” (MTGP32) on GPUs which resulted in 5, 13 and 14 times higher throughput when a 240, 17 and 8 sized vector space was used respectively. The MIXMAX generator coded in VHDL and implemented on Xilinx Ultrascale+ FPGAs, requires 50% fewer total Look Up Tables (LUTs) compared to a 32-bit Mersenne Twister (MT-19337), or 75% fewer LUTs per output bit. In summary, the state-of-the art MIXMAX pseudo random number generator has been implemented on GPU and FPGA platforms and the performance benchmarked.

Accelerating Large-Scale Graph Processing with FPGAs

Lesson Learned and Future Directions

Conference paper (2024) - Marco Procaccini (author) , Amin Sahebi (author) , Marco Barbone (author) , W. Luk (author) , G. Gaydadjiev (author) , Roberto Giorgi (author)

Processing graphs on a large scale presents a range of difficulties, including irregular memory access patterns, device memory limitations, and the need for effective partitioning in distributed systems, all of which can lead to performance problems on traditional architectures s ...

Accelerating 4D image reconstruction for magnetic resonance-guided radiotherapy

Journal article (2023) - Bastien Lecoeur (author) , Marco Barbone (author) , Jessica Gough (author) , Uwe Oelfke (author) , Wayne Luk (author) , G. Gaydadjiev (author) , Andreas Wetscherek (author)

Background and purpose: Physiological motion impacts the dose delivered to tumours and vital organs in external beam radiotherapy and particularly in particle therapy. The excellent soft-tissue demarcation of 4D magnetic resonance imaging (4D-MRI) could inform on intra-fractional ...

Distributed large-scale graph processing on FPGAs

Journal article (2023) - Amin Sahebi (author) , Marco Barbone (author) , Marco Procaccini (author) , Wayne Luk (author) , G. Gaydadjiev (author) , Roberto Giorgi (author)

Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph process ...

Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators’ resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host’s file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device.

On Predictable Reconfigurable System Design

Journal article (2021) - Nils Voss (author) , Bas W. Kwaadgras (author) , Oskar Mencer (author) , W. Luk (author) , G. Gaydadjiev (author)

We propose a design methodology to facilitate rigorous development of complex applications targeting reconfigurable hardware. Our methodology relies on analytical estimation of system performance and area utilisation for a given specific application and a particular system instan ...

Efficient Table-Based Polynomial on FPGA

Conference paper (2021) - Marco Barbone (author) , Bas W. Kwaadgras (author) , Uwe Oelfke (author) , Wayne Luk (author) , G. Gaydadjiev (author)

Field Programmable Gate Arrays (FPGAs) are gaining popularity in the context of scientific computing due to the recent advances of High-Level Synthesis (HLS) toolchains for customised hardware implementations combined with the increase in computing capabilities of modern FPGAs. A ...

Efficient Online 4D Magnetic Resonance Imaging

Conference paper (2021) - Marco Barbone (author) , Andreas Wetscherek (author) , Thomas Yung (author) , Uwe Oelfke (author) , W. Luk (author) , G. Gaydadjiev (author)

Magnetic Resonance (MR)-guided online Adaptive RadioTherapy (MRgoART) utilises the excellent soft-tissue contrast of MR images taken just before the patient's treatment to quickly update and personalise radiotherapy treatment plans. Four-dimensional (4D) MR Imaging (MRI) can reso ...

Towards Real Time Radiotherapy Simulation

Journal article (2020) - Nils Voss (author) , Peter Ziegenhein (author) , Lukas Vermond (author) , J.J. Hoozemans (author) , Oskar Mencer (author) , Uwe Oelfke (author) , W. Luk (author) , G. Gaydadjiev (author)

We propose a novel reconfigurable hardware architecture to implement Monte Carlo based simulation of physical dose accumulation for intensity-modulated adaptive radiotherapy. The long term goal of our effort is to provide accurate dose calculation in real-time during patient trea ...

Towards real time radiotherapy simulation

Conference paper (2019) - Nils Voss (author) , Peter Ziegenhein (author) , Lukas Vermond (author) , J.J. Hoozemans (author) , Oskar Mencer (author) , Uwe Oelfke (author) , W. Luk (author) , G. Gaydadjiev (author)

We propose a novel reconfigurable hardware architecture to implement Monte Carlo based simulation of physical dose accumulation for intensity-modulated adaptive radiotherapy. The long term goal of our effort is to provide accurate online dose calculation in real-time during patie ...

Memory mapping for multi-die FPGAS

Conference paper (2019) - Nils Voss (author) , Pablo Quintana (author) , Oskar Mencer (author) , W. Luk (author) , G. Gaydadjiev (author)

This paper proposes an algorithm for mapping logical to physical memory resources on FPGAs. Our greedy strategy based algorithm is specifically designed to facilitate timing closure on modern multi-die FPGAs for static-dataflow accelerators utilising most of the on-chip resources ...

Low Area Overhead Custom Buffering for FFT

Conference paper (2019) - Nils Voss (author) , Stephen Girdlestone (author) , Tobias Becker (author) , Oskar Mencer (author) , Wayne Luk (author) , G. Gaydadjiev (author)

In this paper we propose a technique to minimise the area overhead of a double buffered implementation of Radix-4 Fast Fourier Transformation (FFT). Our proposal circumvents the need for double buffering by exploiting opportunities in the specific data reordering of the buffers t ...

Convolutional neural networks on dataflow engines

Conference paper (2017) - Nils Voss (author) , Marco Bacis (author) , Oskar Mencer (author) , G. Gaydadjiev (author) , W. Luk (author)

In this paper we discuss a high performance implementation for Convolutional Neural Networks (CNNs) inference on the latest generation of Dataflow Engines (DFEs). We discuss the architectural choices made during the design phase taking into account the DFE chip properties. We the ...

Quantum Chemistry in Dataflow

Density-Fitting MP2

Journal article (2017) - Bridgette Cooper (author) , Stephen Girdlestone (author) , Pavel Burovskiy (author) , G. Gaydadjiev (author) , Vitali Averbukh (author) , Peter J. Knowles (author) , W. Luk (author)

We demonstrate the use of dataflow technology in the computation of the correlation energy in molecules at the Møller-Plesset perturbation theory (MP2) level. Specifically, we benchmark density fitting (DF)-MP2 for as many as 168 atoms (in valinomycin) and show that speed-ups bet ...

EXTRA

Towards the exploitation of eXascale technology for reconfigurable architectures

Conference paper (2016) - Dirk Stroobandt (author) , Ana Varbanescu (author) , Elias Vansteenkiste (author) , Wayne Luk (author) , M. D. Santambrogio (author) , D. Sciuto (author) , Michael Huebner (author) , Tobias Becker (author) , G. Gaydadjiev (author) , Antonis Nikitakis (author) , Alex J.W. Thom (author) , Cǎtǎlin Ciobanu (author) , Muhammed Al Kadi (author) , A. Brokalakis (author) , George Charitopoulos (author) , Tim Todman (author) , Xinyu Niu (author) , Dionisios Pnevmatikatos (author) , Amit Kulkarni (author)

To handle the stringent performance requirements of future exascale-class applications, High Performance Computing (HPC) systems need ultra-efficient heterogeneous compute nodes. To reduce power and increase performance, such compute nodes will require hardware accelerators with ...

EXTRA

Towards an efficient open platform for reconfigurable High Performance Computing

Conference paper (2015) - CǍtǍlin Bogdan Ciobanu (author) , Ana Lucia Varbanescu (author) , Tobias Becker (author) , G. Gaydadjiev (author) , A. Brokalakis (author) , Antonis Nikitakis (author) , Alex J.W. Thom (author) , Elias Vansteenkiste (author) , D. Stroobandt (author) , Dionisios Pnevmatikatos (author) , George Charitopoulos (author) , Xinyu Niu (author) , W. Luk (author) , M. D. Santambrogio (author) , Donatella Sciuto (author) , Muhammed Al Kadi (author) , Michael Huebner (author)

To handle the stringent performance requirements of future exascale-class applications, High Performance Computing (HPC) systems need ultra-efficient heterogeneous compute nodes. To reduce power and increase performance, such compute nodes will require hardware accelerators with ...

FPGA-based design using the FASTER toolchain

The case of STM spear development board

Conference paper (2014) - F. Spada (author) , A. Scolari (author) , W. Luk (author) , D. Stroobandt (author) , P du Pau (author) , G. C. Durelli (author) , R Cattaneo (author) , M. D. Santambrogio (author) , D. Sciuto (author) , D. N. Pnevmatikatos (author) , G. Gaydadjiev (author) , Oliver Pell (author) , Andreas Brokalakis (author)

Even though FPGAs are becoming more and more popular as they are used in many different scenarios like communications and HPC, the steep learning curve needed to work with this technology is still the major limiting factor to their full success. Many works proposed to mitigate th ...

Effective reconfigurable design

The FASTER approach

Conference paper (2014) - D. N. Pnevmatikatos (author) , T. Becker (author) , Marco D. Santambrogio (author) , D. Sciuto (author) , Dirk Stroobandt (author) , A. Brokalakis (author) , G. Gaydadjiev (author) , Wayne Luk (author) , K. Papadimitriou (author) , I. Papaefstathiou (author) , P Pau (author) , Oliver Pell (author) , Christian Pilato (author)

While fine-grain, reconfigurable devices have been available for years, they are mostly used in a fixed functionality, "asic-replacement" manner. To exploit opportunities for flexible and adaptable run-time exploitation of fine grain reconfigurable resources (as implemented curre ...

FASTER

Facilitating analysis and synthesis technologies for effective reconfiguration

Conference paper (2012) - D. Pnevmatikatos (author) , T. Becker (author) , M. Robart (author) , M. D. Santambrogio (author) , D. Sciuto (author) , D. Stroobandt (author) , Tim Todman (author) , Andreas Brokalakis (author) , K. Bruneel (author) , G. Gaydadjiev (author) , Wayne Luk (author) , K. Papadimitriou (author) , I Papaefstathiou (author) , O. Pell (author) , C. Pilato (author)

The FASTER project aims to ease the definition, implementation and use of dynamically changing hardware systems. Our motivation stems from the promise reconfigurable systems hold for achieving better performance and extending product functionality and lifetime via the addition of ...