JP
J.W. Peltenburg
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
2 records found
1
Master thesis
(2017)
-
Nikolas Bampetas, Zaid Al-Ars, Stephan Wong, Rene van Leuken, J.W. Peltenburg
DNA carries all the information needed to define the genetic structure of an individual. The large amounts of information stored in the DNA makes it costly to store and to process. In this work, a fast and efficient implementation of a Field Programmable GateArray (FPGA) based, streaming multicore architecture for accelerating variant calling algorithms will be designed. We focused on the HaplotypeCaller which is the variant calling software part of the Genome Analysis Toolkit (GATK), which is one of the most widely used DNA analysis tools in the field. The most time consuming part of the HaplotypeCaller is the PairHMM algorithm. PairHMM is a probabilistic algorithm that executes pairwise alignment of two sequences. Starting from an existing single core PairHMM accelerator design, which was implemented using the POWER8 Coherent Accelerator Processor Interface (CAPI), we designed three extra cores of thePairHMM algorithm that can work independently and increase the performance of the overall system. The new accelerator achieves a 2.2x speedup in comparison with the single core. The accelerator is integrated with the HaplotypeCaller and uses CAPI to access a shared processor-accelerator memory for direct communication. A JNI call is used to send the memory addresses to the accelerator, which reduces the communication overhead between the HaplotypeCaller and the accelerator. Results show that the application is not able to saturate the accelerator with data, resulting in accelerator idle time and under-utilization. The accelerator presented idle time for the 26% of the different data sets that were used. It would be beneficial to implement a faster version of the HaplotypeCaller which can load data sets to the accelerator as current data sets are being processed.
...
DNA carries all the information needed to define the genetic structure of an individual. The large amounts of information stored in the DNA makes it costly to store and to process. In this work, a fast and efficient implementation of a Field Programmable GateArray (FPGA) based, streaming multicore architecture for accelerating variant calling algorithms will be designed. We focused on the HaplotypeCaller which is the variant calling software part of the Genome Analysis Toolkit (GATK), which is one of the most widely used DNA analysis tools in the field. The most time consuming part of the HaplotypeCaller is the PairHMM algorithm. PairHMM is a probabilistic algorithm that executes pairwise alignment of two sequences. Starting from an existing single core PairHMM accelerator design, which was implemented using the POWER8 Coherent Accelerator Processor Interface (CAPI), we designed three extra cores of thePairHMM algorithm that can work independently and increase the performance of the overall system. The new accelerator achieves a 2.2x speedup in comparison with the single core. The accelerator is integrated with the HaplotypeCaller and uses CAPI to access a shared processor-accelerator memory for direct communication. A JNI call is used to send the memory addresses to the accelerator, which reduces the communication overhead between the HaplotypeCaller and the accelerator. Results show that the application is not able to saturate the accelerator with data, resulting in accelerator idle time and under-utilization. The accelerator presented idle time for the 26% of the different data sets that were used. It would be beneficial to implement a faster version of the HaplotypeCaller which can load data sets to the accelerator as current data sets are being processed.
In recent years due to the slow down of Moores Law and Dennard Scaling, alternative architectures are starting to be used instead of plain CPU implementations. These new architectures, such as FPGAs and GPUs, offer higher performance to power consumption ratio when compared with a CPU only implementation. But these new approaches have to sacrifice programmability in favor of performance gains. While GPUs are somewhat easily programmableand provide high performance this comes at the cost of high power consumption. FPGA programming on the other hand is a tedious and time consuming task. Specialized personnel is required for this, as their programming requires a background in designing with HDL languages. Furthermore an implementation is specific to a certain algorithm and cannot be used for any other algorithm even if it is slightly different. So if a new algorithm for aparticular task is found then a part of the design process has to be redone. Also designing for FPGAs is a computationally intensive task as the whole design after simulation has to be synthesized and then placed and routed (P&R) for a particular FPGA every time the design changes slightly. This process of mapping the design can take hours or even days to compute for large designs. In recent years developments in High Level Synthesis (HLS) and OpenCL have made the whole process of designing for FPGAs an easier task. But this solution is notwithout problems either as the algorithm has to still be implemented for a specific FPGA device. A solution to the FPGA synthesis and P&R problem has recently been proposed with the name of FPGA Overlay Architectures. The core concept of this idea to abstract the FPGA create a virtual FPGA on top of the underlaying physical one in order to help with configuration and compile time. In this thesis, we investigate available alternative overlay architectures and select the most appropriate architecture for our analysis. We extended the selected architecture to be deployed on alternative FPGA hardware and to work in a shared CPU/FPGA system. Then, we implemented a number benchmarks to evaluate various aspects of system performance. Our results show that our architecture can be reconfigured in only 11.9us, as compared to seconds for full FPGA recon_guration. However, the overlay architecture uses 10.5x more LUTs and causes a drop in frequency of about 30% for the chosen architecture. For future work, there is room to improve these results by optimizing the interconnect network of the device.
...
In recent years due to the slow down of Moores Law and Dennard Scaling, alternative architectures are starting to be used instead of plain CPU implementations. These new architectures, such as FPGAs and GPUs, offer higher performance to power consumption ratio when compared with a CPU only implementation. But these new approaches have to sacrifice programmability in favor of performance gains. While GPUs are somewhat easily programmableand provide high performance this comes at the cost of high power consumption. FPGA programming on the other hand is a tedious and time consuming task. Specialized personnel is required for this, as their programming requires a background in designing with HDL languages. Furthermore an implementation is specific to a certain algorithm and cannot be used for any other algorithm even if it is slightly different. So if a new algorithm for aparticular task is found then a part of the design process has to be redone. Also designing for FPGAs is a computationally intensive task as the whole design after simulation has to be synthesized and then placed and routed (P&R) for a particular FPGA every time the design changes slightly. This process of mapping the design can take hours or even days to compute for large designs. In recent years developments in High Level Synthesis (HLS) and OpenCL have made the whole process of designing for FPGAs an easier task. But this solution is notwithout problems either as the algorithm has to still be implemented for a specific FPGA device. A solution to the FPGA synthesis and P&R problem has recently been proposed with the name of FPGA Overlay Architectures. The core concept of this idea to abstract the FPGA create a virtual FPGA on top of the underlaying physical one in order to help with configuration and compile time. In this thesis, we investigate available alternative overlay architectures and select the most appropriate architecture for our analysis. We extended the selected architecture to be deployed on alternative FPGA hardware and to work in a shared CPU/FPGA system. Then, we implemented a number benchmarks to evaluate various aspects of system performance. Our results show that our architecture can be reconfigured in only 11.9us, as compared to seconds for full FPGA recon_guration. However, the overlay architecture uses 10.5x more LUTs and causes a drop in frequency of about 30% for the chosen architecture. For future work, there is room to improve these results by optimizing the interconnect network of the device.