Scalable GPU Acceleration for Complex Brain Simulations

More Info
expand_more

Abstract

Complex mathematical models are used in computational neuroscience to stimulate brain activity to understand the biological processes involved. The simulation of such models is computationally costly, and thus highperformance
computing systems are selected as a potential solution to increase performance.
This thesis aims to implement a new versatile, multi-GPU eHH simulator (mgpuHH), explore its performance and make general observations on performance scalability over different modeling and cluster configuration properties. This work offers a multinode multi-GPU solution that offers excellent
scalability performance due to how the simulator is constructed, with the use of OpenMPI and CUDA. The simulator is configured with JSON configuration files, containing the neural descriptions and simulatorspecific settings. Consequently, enabling a userfriendly environment, for the neuroscientists, without the need of recompiling or understanding the source code. The gap junction calculations are identified as the critical function bottlenecking performance of the simulator. Therefore, an algorithm tailored to utilize GPU performance is implemented to decrease wallclock time for these specific calculations. For internode
communication, OpenMPI can be configured in two ways. Eiter share all possible compartments potentials with every node in the network or only share the compartments potentials to nodes that need them. These methods rely internally on MPI Allgather and Alltoallv respectively. When available, GPUDirect, NVlink, and RDMA are supported. The implementation hides communication overhead, when possible, by concurrently executable compute kernels. A neuron model from the Inferior Olivary Nucleus is selected for benchmarking. Reported results go up to 32 Nodes with a total of 64 GPU cards. The design shows linear weak and strong scaling within the experimental setups for intranode and internode scalability. With this simulator, networks over 10 million cells become available to model on largescale GPU clusters, setting a new standard for eHH simulations. Comparisons against related work on CPU and FPGAs have been conducted, a 100x speedup is achieved versus a single cpu threaded solution. Furthermore, a 2x speedup is achieved over an FPGA solution (flexHH) and 10 fold over a multithreaded CPU (GenEHH, with 128 threads) solution, both reported speedups are for a fully connected network with 7000 IO cells.