High-Performance and Scalable Agent-Based Simulation with BioDynaMo

Agent-based modeling plays an essential role in gaining insights into biology, sociology, economics, and other fields. However, many existing agent-based simulation platforms are not suitable for large-scale studies due to the low performance of the underlying simulation engines. To overcome this limitation, we present a novel high-performance simulation engine. We identify three key challenges for which we present the following solutions. First, to maximize parallelization, we present an optimized grid to search for neighbors and parallelize the merging of thread-local results. Second, we reduce the memory access latency with a NUMA-aware agent iterator, agent sorting with a space-filling curve, and a custom heap memory allocator. Third, we present a mechanism to omit the collision force calculation under certain conditions. Our evaluation shows an order of magnitude improvement over Biocellion, three orders of magnitude speedup over Cortex3D and NetLogo, and the ability to simulate 1.72 billion agents on a single server. Supplementary Materials, including instructions to reproduce the results, are available at: https://doi.org/10.5281/zenodo.6463816


Introduction
Agent-based modeling (ABM) allows to simulate complex dynamics in a wide range of research fields.ABM has been used to answer research questions in biology [30,43,73], sociology [18], economics [65], technology [50], business [56], and more fields [39].Agents are individual entities that, among others, can represent subcellular structures to simulate the growth of a neuron, a cell to investigate cancer development, or a person to simulate the spread of infectious diseases [10].The actions of an agent are defined through instances of class behavior.To stay with the examples from before, possible behaviors are neurite bifurcation, uncontrolled cell division, or infection.
Agent-based models are developed in an iterative way, during which an initial model is increasingly refined until it matches with observed data [58,67].Model parameters that cannot be derived from the literature are determined through optimization.An optimization algorithm generates a parameter set, executes the model, and evaluates the error with respect to observed data until the error converges to a local or global minimum.This loop might also contain an uncertainty analysis to evaluate the robustness of a solution [40].Consequently, the model must be simulated many times.
The simulation engine's performance limits the scale of the model and determines how often the model can be simulated.Thus, performance is a key issue for simulating models on extreme scales that might one day be able to simulate all 86 billion neurons in the brain [7].It is also crucial for smallerscale simulations to explore vast parameter space, analyze parameter uncertainty, repeat the simulation often enough to reach statistical significance, and develop models rapidly.
To achieve these goals, we present a novel simulation engine called BioDynaMo, which is optimized for high performance and scalability.During its development, we identify the following three main performance challenges for agent-based simulations.
Challenge 1: To fully utilize systems with high processor core counts, the parallel part of the simulation engine has to be maximized (see Amdahl's law [3]).Although it is easy to parallelize the loop over all agents (Algorithm 1), our benchmarks revealed two operations whose level of parallelization has a significant performance impact.First, building the environment index, which is used to determine the neighbors of an agent.The literature describes various radial-neighbor search algorithms with different design trade offs between build and search performance.Second, combining threadlocal results at the end of each iteration.In general, attention must also be paid to seemingly minor things, such as resizing a large vector, which by default is initialized by a single thread.
Challenge 2: ABMs are predominantly memory-bound due to two reasons.First, the behavior of agents often has low arithmetic intensity.Second, ABM can be very dynamic.During a simulation, agents move through space, change their behavior, and are created and destroyed.Consequently, the neighborhood of an agent changes continuously, leading to an irregular memory access pattern and poor cache utilization.This results in large data movement between the main memory and the processor cores.
Challenge 3: Under certain conditions, the expensive calculation of mechanical forces between agents is redundant (Section 5).These forces are for example used in tissue models to determine the displacement of agents.The challenge is identifying those agents for which the pairwise force calculation can be safely omitted.
BioDynaMo addresses these challenges with the following new optimizations.To maximize the parallelization (Challenge 1), we develop an optimized uniform grid to search for agent neighbors and fully parallelize the addition and removal of agents.We address the data movement bottleneck (Challenge 2) in software by (i) optimizing the iteration over all agents on systems with non-uniform memory architecture, (ii) sorting agents and their neighbors to improve the cache hit rate and minimize access to remote DRAM, and (iii) introducing a pool memory allocator.To avoid redundant mechanical force calculations (Challenge 3), we add a mechanism to detect agents for which we can guarantee that the resulting force will not move the agent.
These mechanisms make BioDynaMo nearly an order of magnitude more efficient than Biocellion and three orders of magnitude faster than Cortex3D and NetLogo.The performance improvements account for a median speedup of 159× compared to BioDynaMo's standard implementation with all optimizations turned off.As a result, BioDynaMo is able to simulate 1.72 billion agents on one server.The main contributions of this paper are as follows.
• We present a novel high-performance agent-based simulation engine.The engine can be used in many domains due to its modular software design and features a specialization for neuroscience, capable of simulating the development of neurons.• We present six optimizations to maximize performance (Section 3-5).These insights are transferable and can be used to improve the performance of other agentbased simulators.• We present an in-depth evaluation of BioDynaMo's performance using five different simulations (Section 6).This comprehensive analysis provides insights for users of BioDynaMo into which parameters yield the best performance and hints for developers of future agentbased simulation tools.

BioDynaMo's Simulation Engine
This Section gives an overview of BioDynaMo and its components.BioDynaMo is written in C++, uses OpenMP [53] for shared-memory parallelism, and is available under the Apache 2.0 open-source license.Breitwieser et al. [10] describe the user-facing features of the BioDynaMo platform and detail its modular software design and ease-of-use by means of three use cases in the domains of neuroscience, epidemiology, and oncology.
BioDynaMo is a hybrid framework able to utilize multicore CPUs and GPUs.This paper focuses on the CPU version, which has two major advantages.First, the CPU version can simulate many more agents than a GPU version.The reason is that GPUs have typically significantly smaller memory than CPUs.For example, our benchmark hardware has 12× more memory than the current flagship GPU from NVidia, the A100 [52].Second, the CPU version improves the usability and flexibility for our broad user community, who often only have a Matlab [31] or R [64] coding background.In Bio-DynaMo, users create simulations by writing C++ code.A GPU-only version would require users to write CUDA code to define new agents, behaviors, and other user-defined components.Therefore, BioDynaMo only offloads computations to the GPU, transparently to the user [27].
The main objects in agent-based simulations are agents, behaviors, and operations.Agents (e.g., a cancer cell) have attributes that are updated through behaviors and operations.Behaviors (e.g., uncontrolled cell division) are functions that can be assigned and removed from an agent and give users fine-grained control over the actions of an agent.In contrast, Agent operations are executed for each agent.For example, to calculate the mechanical forces between agents and execute all individual behaviors of an agent.The second type of operation, called standalone operation is executed once per iteration to perform a specific task (e.g., visualization).A characteristic property of agent-based simulation is local interaction.BioDynaMo provides a common interface for different neighbor search algorithms called environment.Besides the uniform grid detailed in Section 3.1, BioDynaMo features a kd-tree based on nanoflann [9] and octree based on the publication of Behley et al. [8].
The agent-based simulation algorithm (Algorithm 1) comprises two steps.First, users have to define the starting condition of the model (L1) in which agents, behaviors, operations, and any other resource are created.Second, the simulation engine executes this model for a number of iterations (L2-19).The engine executes all agent operations for each agent (L7-12) and all standalone operations (L12-14).Standalone operations can be further separated into operations that must be executed at the beginning of the iteration (e.g., to update the environment index [L3-5]) or the end (e.g., visualization [L16-18]).There are two barriers synchronizing threads (L6 and L15).Determining the neighbors of an agent is a pre-condition for all agent interactions.For example, the infection behavior in an epidemiological model requires information if any of the immediate neighbors is infected.In this context, it is essential to find neighbors fast and efficiently and minimize the build time of the required index.Building an index in every iteration has a high cost, as shown in the evaluation section.We exploit the fact that the interaction radius is known at the beginning of the iteration.For this fixed-radius search problem, a grid-based solution is a good choice because the box of an agent can be determined in constant time using the agent's position [8].This is confirmed by our evaluation in Section 6.9.The build stage in which all agents are assigned to a box can be easily parallelized.In the search stage, the grid determines all neighbors by iterating over all agents in the same box and the surrounding boxes.In 3D space, we consider the 3x3x3 cube of boxes surrounding and including the query box.All agents inside a box are stored in an array-based linked list.The box only needs to store the start index and the number of elements it contains.To avoid zeroing all boxes at the beginning of the build stage, we add a timestamp attribute to each box, updated whenever an agent is added.Consequently, we can determine that a box is empty if the simulation and box timestamp is different.Therefore, we can build the grid in  (#) time instead of  (# + #), which is relevant for large simulation spaces that are not fully populated.
The array-based linked list uses the same agent indices as in the ResourceManager.The ResourceManager, an essential class in the simulation engine, stores raw agent pointers and offers functions to add, remove, get, and iterate over agents.Thus, it also benefits from the memory layout optimization presented in Section 4.2.This optimization reduces the distance in memory of agents that are close in space.Consequently, linked list elements will be closer to each other, improving the cache hit rate of traversing the linked list during the search stage of the grid.The described grid implementation can be found in the class UniformGridEnvironment.

Adding and Removing Agents in Parallel
To maximize the theoretically achievable speedup described in Amdahl's law [3], we maximize the parallel part of the simulation by parallelizing the addition and removal of agents.By default, BioDynaMo stores a thread-local copy of additions and removals and commits them to the Resource-Manager at the end of each iteration.
Additions are trivial; the engine determines the total number of additions, grows the data structures in the Resource-Manager, and adds the agent pointers in parallel.In contrast, the parallelization of removals is a more elaborate process because we disallow empty vector elements in the ResourceManager.If the simulation engine has to remove an agent stored in the middle of the vector, it must swap it with the last element before shrinking it.The following algorithm aims at performing the necessary swaps and updates in dependent data structures in parallel.Figure 1 illustrates the parallelized algorithm simplified for a single NUMA domain.This example assumes a simulation with seven agents represented with identifier 1-7 and two threads, which remove three agents from the simulation.These agents are highlighted with a grey background.The other colors serve as a visual aid to track the agents that must be swapped.The algorithm comprises five main steps.First, the algorithm determines the total number of removed agents, calculates the new size of the vector (_ − _), and initializes two auxiliary arrays.The size of the auxiliary arrays equals the number of removed agents.The new vector size is indicated by the vertical line between indexes three and four in Figure 1.
Second, each thread iterates over its vector of removed agents and fills the auxiliary arrays.If the agent is stored to the left of the new size index, it must be moved to the right.Therefore the algorithm inserts the element index into the array _ℎ.If the agent is stored to the right of the new size, we insert a one into array __   at the index:  − _.The maximum index used to access elements in the auxiliary array is smaller than the number of removed agents and is independent of the number of remaining agents.
Third, we partition the two auxiliary arrays into blocks corresponding to the total number of threads.A thread iterates over its auxiliary block, moves entries to the beginning if they indicate a swap, and stores a counter in the #swaps array.For the _ℎ array, the algorithm skips or overwrites elements with the value    _ .In this step, the semantic of the __   array changes to _  .Thus, the algorithm looks for all zeros in the array, replaces them with the value _ + _, and moves them to the beginning of the block.
In the fourth step, we can finally perform the swaps.The algorithm calculates the prefix sum of the two # arrays and partitions the swaps among all threads.Each thread can determine the indices based on the prefix sum of #.
In the last step, the algorithm completes the removal by shrinking the vector to _.

Optimize Memory Layout and Data
Access Pattern 4.1 NUMA-Aware Iteration BioDynaMo supports systems with multiple NUMA domains.We add a mechanism to match threads with agents from the same NUMA domain to minimize the traffic to remote DRAM because OpenMP does not provide this functionality.these vectors into blocks of agent pointers of the same size 2 .Second, these blocks are partitioned among the threads from the matching NUMA domain 3 .Threads process the assigned blocks in parallel.Figure 2 shows processed blocks with a background color of the corresponding thread.
We implement a two-level work-stealing mechanism to avoid imbalanced execution times across threads.First, a thread can steal a block from a different thread from the same NUMA domain (e.g., 4 ).Second, if the thread's NUMA domain has already finished all work, the thread can steal work from a different NUMA domain (e.g., 5 ).

Agent Sorting and Balancing
To accelerate the memory-bound simulations, we must increase the cache hit ratio and load balance the agents among NUMA domains to minimize remote DRAM accesses.In Section 4.1 and Figure 2, we assume this is already the case.This section presents an efficient algorithm to achieve this goal by sorting the agents' memory locations and preserving the neighborhood relations in 3D.Preserving the neighborhood relation and reducing the dimensionality is the main characteristic of space-filling curves (e.g., Morton order [46] or Hilbert curve [28]).
We compared the performance of the Morton order with the Hilbert curve using an oncological simulation [10] and observed a negligible performance improvement of 0.54% from using the Hilbert curve.Higher costs to decode the Hilbert curve offset small gains for the agent operations.Therefore, we use the Morton order because it results in simpler code.
Figure 3 and the following description present the algorithm in 2D space, but the same principles apply in 3D.In BioDynaMo, the neighborhood information is stored in the implementation of the environment interface.Since the uni- form grid environment performs best, as shown in the evaluation section (Section 6.9), we utilize its characteristics to achieve fast sorting and balancing.We assume the following scenario.Agents are stored in a 3×3 uniform grid (A).The simulation runs on a system with two NUMA domains and two threads per domain, resulting in four threads.The grid boxes are stored in a flattened array.Figure 3B shows the box indices for x and y coordinates.We apply a Morton order space-filling curve [46] on the uniform grid (C).Our goal is to sort the agents in the nine boxes in increasing Morton order.The Morton order is only contiguous for quadratic simulation spaces where the length is a power of two.Therefore, C shows a 4×4 grid.For the 3×3 simulation space, there are gaps between Morton code four and six, six and eight, and nine and twelve.
The algorithm comprises three main steps.First, the algorithm determines the sequence of boxes in Morton order (D, E).Second, the algorithm partitions the boxes into segments that balance agents among NUMA domains and threads (F).Third, the algorithm stores the agents in their new position in the resource manager (G).In Figure 3, we selected four boxes and colored them in red, green, blue, and yellow to quickly find the corresponding entries for contained agents, box index, and Morton code.The boxes outside the simulation space have a grey background.
In the first main step (D, E), we determine all gaps between grid boxes in simulation space (D) to avoid a costly sorting operation or iteration over all  ×  boxes, where  is the next higher power of two of  (, ).We exploit the fact that the Morton order corresponds to a depth-first traversal of a quadtree, in which each box is a leaf in the tree.Leaves whose boxes are outside the simulation space are considered empty.Similarly, an inner node is empty if all of its corresponding leaves are outside the simulation space.If all the corresponding boxes of an inner node are inside the simulation space (i.e., the inner node has a perfect subtree), we say the inner node is complete.The quadtree is only an abstraction and does not need to be constructed.It is only necessary to store the current traversal path, which requires  ((#)) space.
The mechanism in D uses three auxiliary variables: box counter, offsets, and found_gap, which are initialized to zero, zero, and true.The matrix in Figure 3D shows the three variables before the update of the current traversal step.The algorithm traverses the tree depth-first and continues to the next deeper tree level only if the current node is neither complete nor empty.In this case, the variables are not changed.If a complete inner node or leave inside the simulation space is found and the found_gap variable is true, the algorithm adds an entry with the current box counter and offset values in the offsets array and clears found_gap.Afterward, the box counter variable is incremented by the number of leaves in its subtree or one if it was a leave, irrespectively of the former value of found_gap.Empty nodes or leaves are handled similarly.The offset variable is incremented by the number of empty leaves in the subtree of an empty node or one if it is an empty leave.Additionally, the found_gap variable is set to true.The algorithm keeps track of the x and y intervals to calculate (in constant time) the number of leaves in a subtree and determine if an inner node is entirely, partially, or not inside the simulation space.
With the already sorted offsets array, the Morton order can be determined in linear time by iterating over all indices and adding the corresponding offset (E).
In the second main step (F), the algorithm iterates over all boxes in Morton order and fills an auxiliary array with the number of agents in each box.Afterward, the algorithm calculates the prefix sum of the auxiliary array in a parallel work-efficient manner [36] and partitions the total number of agents in the simulation such that each NUMA domain receives a share corresponding to its number of threads.Inside a NUMA domain, the agents are further partitioned such that each thread in this domain receives an equal share.
In the third main step (G), the threads copy the agents and store the pointer in the new position in the resource manager.The simulation engine can immediately free obsolete agents' memory or delete all old copies after the step is finished.The latter requires more memory but might improve performance due to a more optimal memory layout in conjunction with the BioDynaMo memory allocator (Section 4.3).
The presented algorithm runs in  (# +#) time and space and parallelizes steps E-G.

BioDynaMo Memory Allocator
To improve the performance of the simulation engine, we present a custom dynamic memory allocator which improves the memory layout of the most frequently allocated objects in a simulation: agents and behaviors.Our solution builds upon pool allocators due to their constant time allocation performance.Pool allocators divide a memory block into equal-sized elements and store pointers to free elements in a linked list.
We create multiple instances of these allocators because they can only return memory elements of one size.As a result, agents and behaviors with distinct sizes are separated and stored in a columnar way.We separate the pool allocator into multiple NUMA domains (class NumaPoolAllocator) to fully control where memory is allocated.The NumaPool-Allocator has a central free-list and thread-private ones to minimize synchronization between threads.List nodes, which correspond to free memory locations, can be migrated between thread private and the central list, which is essential to avoid memory leaks.Migrations are triggered if a threadprivate list exceeds a specific memory threshold.Lists minimize these migrations, and thus thread synchronization, by maintaining additional skip lists.These skip lists support additions and removals of a large number of elements in constant time.
Memory is allocated in large blocks with exponentially increasing sizes controlled by the parameter mem_mgr_growth_rate.The initialization of these memory blocks, which includes list node generation, is performed on-demand in smaller segments to minimize the required worst-case allocation time.
Every allocated memory block is divided into N-page aligned segments (Figure 4A), where N can be set with parameter mem_mgr_aligned_pages_shift.The linked list nodes  4b) but wastes memory in three ways.First, memory blocks are allocated using numa_alloc_onnode (libnuma [5]).This function does not return N-page aligned pointers and causes unusable regions at the beginning and the end, which sum up to  * _ bytes.Second, there might not be enough space to place a whole element at the end of an N-page aligned segment.Elements must not cross N-page aligned borders because it would overwrite the necessary metadata.Third, the metadata requires the size of a pointer, which is eight bytes on 64-bit hardware.The amount of wasted memory is bounded by the following equation:  * _ + _ + _. Despite this memory overhead, our performance evaluation shows that the BioDynaMo pool allocator uses on average less memory than ptmalloc2 and jemalloc [19].Another side effect of this design choice is that the allocation size is limited by  * _ − _.

Omit Collision Force Calculation
The most time-consuming operation in the tissue models presented in Section 6.1 is the calculation of the displacement of agents based on all mechanical forces.For this purpose, the simulation engine has to calculate pairwise collision forces between agents and their neighbors implemented in the class InteractionForce.By default, BioDynaMo uses the force calculation method detailed in the Cortex3D paper [72].We observe that simulations can contain a significant amount of regions where agents do not move.Neural development simulations (see Section 6.1), for example, might only have an active growth front, while the remaining part of the neuron is unchanged.Therefore, we present a mechanism to detect agents for which it is safe to skip the expensive force calculation.We call these agents static.
The following four conditions must be fulfilled in the last iteration: (i) the agent and none of its neighbors moved (ii) the agent's and neighbors' attributes did not change in a way that could increase the pairwise force (e.g., larger diameter), or the resulting displacement, (iii) new agents were not added within the interaction radius of the agent, and (iv) there is maximum one neighbor force which is non-zero.
The detection mechanism is closely tied to the Interaction-Force implementation (see [10]), as condition two implies, and might have to be adjusted if a different force implementation is used.
Condition four is needed because we want to allow agents to shrink and to be removed from the simulation without setting the agents in this region to not static.Consequently, we have to ensure that two or more neighbor forces did not cancel each other out in the previous iteration.
The simulation engine monitors if any of the conditions are violated for each agent and sets the affected agents to not static.In this process, a distinction has to be made whether the changed attribute affects only the current agent or also its neighbors.If, for example, a static agent moves, the agent and all its neighbors will be affected, while a change to the agent's force threshold, which must be exceeded to move the agent, only affects itself.

Benchmark Simulations
We use five simulations to evaluate the performance of the simulation engine: cell proliferation, cell clustering, and use cases in the domains of epidemiology, neuroscience, and oncology.These simulations use double-precision floating point variables and are described in detail in [10].Table 1 shows that these simulations cover a broad spectrum of performance-related simulation characteristics and contain information about the number of agents, diffusion volumes, and iterations executed.We set the number of agents between two and 12.6 million to keep the total execution time of all benchmarks manageable.This is necessary due to the slow execution of the various baselines.In addition, Section 6.4 shows a benchmark in which each simulation is executed with one billion agents.Also the comparison with Biocellion in Section 6.5 contains a benchmark with 1.72 billion cells.

Experimental Setup and Reproducibility
All tests were executed in a Docker container with an Ubuntu 20.04 based image.Table 2 gives an overview of the main parameters of the three servers we used to evaluate the performance of BioDynaMo.If it is not explicitly mentioned, assume that System A was used to execute a benchmark.
We provide all code, the self-contained docker image, more detailed information on the hardware and software setup, and instructions to execute the benchmarks in the supplementary materials (https://doi.org/10.5281/zenodo.6463816).

General Performance Metrics
We characterize the agent-based simulation workload by breaking down the operation's execution time, and exploring microarchitecture inefficiencies.
The following benchmarks were performed with all optimizations enabled.Figure 5 shows a breakdown of all operations in our benchmark simulations.The majority of the runtime is spent in agent operations (median: 76.3%) which subsumes, among others, the execution of behaviors, calculation of mechanical forces, discretization, and detection of static regions.Rebuilding the uniform grid environment  at every time step is the second biggest runtime contributor, 4.09-36.5% (median: 18.0%).The epidemiology use case considers a wider environment that manifests itself in an increased update time.The average cost of agent sorting in its optimal setting (see Figure 12) is 0.180%-6.33%.Since adding and removing agents is parallelized, iterations' setup and tear down consume only 2.66% (max) of the execution time.
In the microarchitecture analysis, we observe that the benchmark simulations are primarily memory-bound.We lose between 31.8 and 47.2% of processor pipeline slots because the operands are not available.

Runtime and Space Complexity
We analyze the runtime and memory consumption of Bio-DynaMo on System B by increasing the number of agents from 10 3 to 10 9 for each simulation (Figure 6).With one thousand agents, the execution time for one iteration is on average 1.21 ms and increases only slightly until 10 5 agents (2.80ms).From there on, runtime increases linearly to one billion agents in which one iteration takes between 6.41 and 38.1 seconds to execute.A similar trend can be observed for the memory consumption of BioDynaMo (using doubleprecision floating point values), which remains below 1.60 GB until 10 6 agents and increases linearly to a maximum between 245 and 564 GB.
The number of agents that BioDynaMo can simulate is not fundamentally limited to one billion.The maximum depends only on the available memory of the underlying hardware and the tolerable execution time.

Comparison with Biocellion
We compare BioDynaMo with Biocellion [33], an agentbased framework for tissue models optimized for performance.We implement the cell sorting simulation presented in the Biocellion paper (Section 3.1) in BioDynaMo and use identical model parameters.The visualization of the BioDy-naMo simulation with 50k cells (Figure 7a) demonstrates a good agreement with the Biocellion results in Figure 3a in [33].
Since we do not have access to the Biocellion code, because it is proprietary software, we compare BioDynaMo to the performance results provided in [33].First, we replicate the benchmark with 26.8 million agents using 16 CPU cores.For Biocellion, Khang et al. [33] used a system with two Intel Xeon E5-2670 CPUs with 2.6 GHz.We execute BioDynaMo on System C with a comparable CPU and limit the number of CPU cores to 16 to ensure a fair comparison.We observe that BioDynaMo is 4.14× faster than Biocellion.BioDynaMo executes one iteration in 1.80s (averaged over 500 iterations), while Biocellion requires 7.48s.
Second, we consider the Biocellion benchmark in which Kang et al. executed 1.72 billion cells on a cluster with 4096 Figure 6.Average runtime per iteration and memory consumption analysis as the number of agents varies from 10 3 to 10 9  .CPU cores (128 nodes with two AMD Opteron 6271 Interlago 2.1 GHz CPUs per node).We execute the BioDynaMo simulation with the same number of cells on a single node (System B).Although BioDynaMo requires 26.3s per iteration, which is 5.90× slower than Biocellion, BioDynaMo uses 56.9× fewer CPU cores.Therefore, we conclude that the performance per CPU core of BioDynaMo is 9.64× more efficient than Biocellion.We repeat the experiment with 281.4 million cells to verify the last observation.Biocellion requires 4.37s per iteration (extracted from Figure 3b in [33]) using 21 nodes with a total of 672 CPU cores.The BioDynaMo simulation on System B with 72 CPU cores runs in almost identical 4.24s per iteration.This result confirms our observation that BioDynaMo is an order of magnitude more efficient than Biocellion.
We evaluate the impact of our optimizations to provide insights into the question of why BioDynaMo processes 4.14× more agents per CPU core in the first benchmark and 9.64× in the second.Therefore, we execute the relevant optimizations with 26.8 million cells on System C limited to 16 CPU cores and System B with 72 CPU cores.Figure 7b shows that the difference can largely be explained by the memory optimizations having a more significant impact on machines with higher CPU core count.

Comparison with Cortex3D and NetLogo
We also compare with capable single-thread tools to evaluate the parallel overhead of the BioDynaMo implementation [42].We choose Cortex3D [72] due to its similarity with the neuroscience features of BioDynaMo and select NetLogo [70] as a representative for an easy-to-use general-purpose tool.We extend the experiments from [10] by analyzing the impact of the presented performance improvements and comparing the memory consumption.This benchmark uses different simulation parameters for agents, diffusion volumes, and iterations, than shown in Table 1.The first four benchmarks in Figure 8 are small-scale benchmarks using between 2k and 30k agents and 0-128k diffusion volumes.These benchmarks run for 100-1000 iterations and only use one thread because Cortex3D and NetLogo are not parallelized.The "epidemiology (medium-scale)" benchmark contains 100k agents and uses 144 threads.NetLogo only benefits from parallel garbage collection in this scenario.In the "BioDynaMo standard implementation", all optimizations are turned off, and the kd-tree environment is used.
We make the following observations.For the small-scale simulations using one thread, BioDynaMo achieves a speedup of up to 78.8× while using 2.49× less memory.We observe three orders of magnitude speedup and two orders of magnitude reduction in memory consumption for the mediumscale benchmark in which all threads were used.
The median speedup of the BioDynaMo standard implementation is 15.5×.The optimized uniform grid of BioDy-naMo boosts performance in all benchmarks (median: 2.18×) but has the most significant impact if parallelization is used (45.5×).Memory layout optimizations improve the runtime of medium-scale simulations by 26.2%, but not for smallscale ones.The memory layout optimizations comprise the NUMA-aware iteration (Section 4.1), agent sorting and balancing (Section 4.2), and memory allocator (Section 4.3).Due to the interdependency between these individual optimizations, we subsumed them into one category.Similarly, extra memory usage during the agent sorting and balancing stage (Section 4.2) has only a slight performance impact (median speedup: 4.82%).However, the static region optimization dramatically improves the performance in the neuroscience use case (speedup 9.22×).Although the mechanism's overhead reduces the speedup for simulations without static regions, this is not problematic.The modeler usually knows this characteristic a priori and only enables the mechanism if static regions are expected (see parameter detect_static_agents).

Optimization Overview
We assess the performance of the presented using larger-scale simulations (Table 1) by enabling optimizations step-by-step (Figure 9).The baseline in this comparison is the BioDynaMo standard implementation introduced in Section 6.6.
We make the following observations.The BioDynaMo optimizations improve overall performance between 33.1× and 524× (median: 159×).These benchmarks confirm the speedup of BioDynaMo's optimized uniform grid that we observed in comparison with Cortex3D and NetLogo.For these larger-scale simulations, the magnitude of the speedup increases up to 184× with a median of 27.4×.A similar observation can be made for the static region detection mechanism, albeit with reduced magnitude (speedup: 3.22×).The main difference the comparison with Cortex3D and Net-Logo and this benchmark is the impact of the memory layout optimizations of agents and behaviors and the usage of extra memory during agent sorting.The maximum speedup is up  to 5.30× (median: 2.96×) and up to 2.07× (median: 1.09×), respectively.Only the Biocellion benchmark in Figure 7b shows a bigger impact.The simulation time of the oncology use case, the only benchmark that removes agents from the simulation is reduced by 31.7% using the "parallel removal" optimization described in Section 3.2.The optimizations increase the median memory consumption by a mere 1.77%, which increases to 55.6% by enabling the use of extra memory during agent sorting.

Scalability
We evaluate the scalability of BioDynaMo using the complete simulations lasting between 288 and 1000 iterations and perform a strong scaling analysis with different optimizations enabled.The strong scaling analysis is performed with ten iterations.
Figure 10a illustrates the excellent scalability of BioDy-naMo for complete simulations (i.e., executing all iterations).The speedup using 72 physical cores with hyperthreading enabled is between 60.7× and 74.0× (median 64.7×) compared to serial execution.Section 6.6 shows that BioDynaMo with one CPU core is more than 23× faster than Cortex3D.If we combine this result with the scalability analysis, which shows that BioDynaMo with 72 CPU cores is more than 60× faster than one CPU core, we can conclude that BioDynaMo is up to three orders of magnitude faster than Cortex3D.
Figures 10c-10g show the strong scaling analysis for each benchmark simulation with ten iterations after progressively switching on the presented optimizations.The left column shows the speedup with respect to a single-thread execution, and the right column presents the average runtime in milliseconds to highlight the absolute differences between various optimizations and the reduction in runtime with increasing threads.We make the following observations.The BioDynaMo standard implementation scales poorly due to the serial build of the kd-tree environment, which is improved considerably by using BioDynaMo's optimized uniform grid (Section 3.1).The presented memory optimizations (Section 4) fully achieve their desired effect and allow Bio-DynaMo to scale across NUMA domains and high CPU-core counts.
6.9 Neighbor Search Algorithm Comparison Figure 11 compares three different neighbor search algorithms: BioDynaMo's uniform grid, UniBN's octree [8], and the kd-tree from nanoflann [9].To ensure a fair comparison, we turned off agent sorting for all algorithms because it is currently only implemented for the uniform grid.We validate our choice for the octree bucket size and nanoflann depth parameter and observe that the used parameters are within 4.20% of the optimum runtime.The left column in Figure 11 shows the result for four NUMA domains and 144 threads, while the right column shows results for one NUMA domain and 18 threads.We analyzed four properties of these radial neighbor search methods: runtime impact on the whole simulation, build and search time of the index, and memory consumption (Figure 11).We measure the search time indirectly by comparing the agent operation runtimes.This operation contains the initial neighbor searches and thus provides information on how fast searches are executed.
The BioDynaMo uniform grid implementation shows its benefits not only in the pure build time comparison but also in the full simulation analysis.Although a significant build   time difference in comparison to the kd-tree and octree is expected (because the build process is serial), the magnitude between 255 and 983× on four NUMA domains is surprising.
The uniform grid outperforms the other algorithms also during the search stage for all simulations.Simulations using BioDynaMo's uniform grid implementation are up to 191× faster than the kd-tree implementation while consuming only 11% more memory in the worst case.

NUMA-Aware Iteration
We evaluate the individual performance impact of NUMAaware iteration (Section 4.1).In the other benchmarks, this optimization was included in the "memory layout optimization" group.We compare the simulation runtime with all optimizations enabled, to executions in which "NUMA-aware iteration" is turned off.This benchmark shows that this mechanism reduces the runtime between 1.07× and 1.38× (median: 1.30×).

Agent Sorting and Balancing
This section evaluates the impact of agent sorting and balancing (Section 4.2) on the simulation runtime for one and four NUMA domains.To this extent, we perform a parameter study with varying agent sorting frequencies for each simulation.Figure 12 shows the speedup for four NUMA domains (left) and one NUMA domain (right).The baselines in both cases are simulations without agent sorting.An agent sorting frequency of one means that the operation is executed in every iteration; similarly, a frequency of ten would mean that the operation is executed every ten iterations.
Load balancing of agents among NUMA domains greatly impacts performance even on systems without NUMA architecture.This stems from the fact that the agent sorting operation also aligns agents that are close in space also in memory.
The oncology and cell clustering simulations benefit most of this performance improvement (peak speedup of 5.77 and 4.56× for four NUMA domains).Both simulations are initialized with a random distribution of agents.Although the epidemiology simulation is also initialized randomly, its agents also move randomly with large distances between iterations.This behavior reduces the alignment improvements significantly (peak speedup 1.14× for four NUMA domains).The cell proliferation simulation is initialized with a 3D grid of cells, which improves the alignment compared to the worst-case random initialization.Therefore, the maximum obtained speedup is reduced to 1.82× (four NUMA domains).Suppose we change the initialization of the cell proliferation simulation to random, the maximum speedup increases to 4.68×.This optimization performs below average for the neuroscience simulation.This simulation only has an active growth front, while the remaining part remains static.The static agent detection mechanism exploits this fact and avoids calculating mechanical forces for the static regions.Therefore, the number of neighbor accesses is significantly reduced, and thus the benefits of aligned agents.If static region detection is disabled, agent sorting and balancing improve the runtime by 3.80× at a frequency of 20.

BioDynaMo Memory Allocator
To evaluate the performance of the BioDynaMo memory allocator, we compare it with glibc's version of ptmalloc2 [22] and jemalloc [19] using our five benchmark simulations.A comparison with tcmalloc [21] was impossible due to deadlock issues that we discovered during benchmarking.Only the epidemiology use case uses additional memory during agent sorting and balancing.Since the BioDynaMo memory allocator only covers agents and behaviors, we need to use another allocator for the remaining objects.
This requirement results in four tested configurations per simulation, as illustrated in Figure 13.The BioDynaMo memory allocator improves the overall simulation runtime up to 1.52× over ptmalloc2 (median: 1.19×) and up to 1.40× over jemalloc (median 1.15×).The allocator consumes 1.41% less memory than ptmalloc2 and 2.43% less memory than jemalloc on average.

Related Work
Agent-based simulation tools.To our knowledge, this is the first paper to present an agent-based simulation engine capable of simulating neuroscientific models with billions of agents.Biocellion [33] and Timothy [14] can also simulate one billion agents, but neither of them support simulations in the demanding computational neuroscience domain.Our performance evaluation shows that BioDynaMo is 9.64× more efficient than Biocellion [33] for a simulation with 1.72 billion agents.Furthermore, BioDynaMo is three orders of magnitude faster than the popular neuroscientific simulator Cortex3D [72], and the general-purpose tool NetLogo [70] on our benchmark hardware.Other tools [20,44,57] also support large-scale models, but only show simulations of up to 10 6 agents.
Memory layout optimizations.Data movement between main memory and processor cores is a fundamental bottleneck in today's computing systems [48].Recent research in computer architecture explores new approaches to address this bottleneck, such as processing-in-memory, i.e., placing compute capability closer to the data [2,25].Agentbased simulation tools are also negatively impacted by the data movement bottleneck.We address this problem in software with a better memory layout resulting in more efficient bandwidth utilization and data reuse in caches.Space-filling curves [28,46,54] can improve the memory layout by aligning objects that are close in 3D space.Therefore, these curves are frequently used to optimize geometric data structures [6,8] and molecular dynamics simulations [4,23,49].our knowledge, none of the other agent-based simulation frameworks (e.g., [11,13,15,17,20,33,34,37,38,41,44,51,57,61,63,68,70,72]) use space-filling curves to improve the cache hit rate and minimize the amount of remote DRAM accesses.We this proven technique to the agentbased workload and present a mechanism to determine the Morton order of a non-cubic grid linear time.
Neighbor search.Agent-based simulation platforms use various neighbor search algorithms: Delaunay triangulation [72], octree [26], and grid-based approaches [33,57,70].Grids are also commonly used on the GPU [1,8,16,24,29].Depending on the dataset and specific search query ([fixed-]radius neighbor search or k-nearest neighbors), the literature recommends different algorithms [8,69].Our contribution lies in the efficient implementation and integration of the uniform grid into the simulation engine and in providing insights into the performance differences for the agent-based workload.
Performance evaluation.To our knowledge, this paper presents the most comprehensive performance analysis of an agent-based simulation platform.Existing platforms report only limited performance results, including simulation execution times and occasionally scalability analyses [12,20,33,38,44,72].Performance data can also be found in model papers [47,62] and in works that focus on hardware accelerators [71].We improve upon these works by providing an in-depth analysis of each performance-relevant component.Efforts in the direction of a standard agent-based benchmark have been made by Moreno et al. [45] and Rousset et al. [59].However, these synthetic benchmarks fall short of representing a realistic range of agent-based simulations by over-simplifying memory access patterns and assuming that agents always move randomly.Compared to these, our benchmark simulations cover a broader spectrum of performance relevant simulation metrics (see Table 1).
Comparison outside the ABM field.Other particle-based applications, such as molecular dynamics (MD) [35,55,66], astrophysics (AP) [60], or computational fluid dynamic (CFD) [32] simulations often face similar computational challenges to improve the performance of large-scale simulations.LAMMPS [66], for example, also uses a grid-based structure to determine neighbors.While LAMMPS stores neighbor lists for each atom, which according to Thompson et al. [66] "consumes the most memory of any data structure in LAMMPS", BioDynaMo does not need these lists and therefore saves memory.BioDynaMo improves over LAMMPS and VASP [35] by sorting agents using a space-filling curve (Section 4.2) and using a custom memory allocator (Section 4.3) to reduce the memory access latency.A NUMA-aware thread allocation mechanism, as the one used in BioDynaMo (Section 4.1), is not needed in LAMMPS or VASP because both tools support distributed parallelism with MPI.In this work, we identify several computational challenges in ABM, which we tackle by using methods inspired by MD, AP, and CFD.The main difference between ABM and other particle-based applications is that the computations can vary significantly from each other in terms of arithmetic intensity, the number of considered neighbors, data access patterns, and more, thus posing diverse computational challenges.

Conclusion and Future Work
This paper presents a novel agent-based simulation engine optimized for high performance and scalability.BioDynaMo enables not only larger-scale simulations, but also helps researchers of small scale studies with accelerated parameter space exploration, and faster iterative development.
We identify general agent-based performance challenges and provide six solutions to maximize parallelization, reduce memory access latency and data transfers, and avoid unnecessary work.These solutions are transferable and can be used to accelerate other agent-based simulation tools.
We present a comprehensive performance analysis to provide insights into the agent-based workload and to give our users a better understanding of BioDynaMo's capabilities.We find that on our system, the presented optimizations improve performance up to 524× (median 159×) and allow BioDynaMo to scale to 72 physical processor cores with a parallel efficiency of 91.7%.A comparison with state-of-theart tools shows that BioDynaMo is up to three orders of magnitude faster.These performance characteristics enable simulations with billions of agents, as demonstrated in our analysis.
Our performance optimizations, which are effective on machines with one or more NUMA domains, are an important stepping stone towards a distributed simulation engine with a hybrid MPI/OpenMP design.Ongoing work focuses on realizing this distributed simulation engine capable of dividing the computation among multiple nodes to push the boundaries of agent-based simulation even further.

Figure 2
shows a server with two NUMA domains (ND0, ND1) corresponding to two CPUs with two threads each (T0 & T1, T2 & T3).The CPUs have a local DRAM with shorter memory access latency than the remote DRAM.The agents in the simulation are balanced between the two NUMA domains (see Section 4.2).The ResourceManager maintains a vector of agent pointers for each NUMA domain 1 .To iterate over agents in a NUMA-aware way, BioDynaMo first partitions

Figure 3 .
Figure 3. Agent sorting and balancing mechanism

7 .
(a) Final simulation state after executing the Biocellion cell sorting model on BioDynaMo.(b) Performance evaluation of the BioDynaMo optimizations with a model with 28.6 million cells on System B (left) and System C limited to 16 physical CPU cores (right).The Biocellion paper[33] provides only a performance measurement for the latter benchmark.

Figure 8 .
Figure 8. Performance comparison with Cortex3D and Net-Logo after the optimizations are progressively switched on.

Figure 9 .
Figure 9. Speedup (top) and memory consumption (bottom) compared with the BioDynaMo standard implementation after the optimizations are progressively switched on.The legend is shared between the plots.

Figure 10 .
Figure 10.(a) Simulation scalability using the whole simulation.(b-g) Detailed strong scaling analysis using only ten time steps.The left column shows the speedup with respect to a single-thread execution, while the right column presents the total runtime.

(a)Figure 11 .
Figure 11.Neighbor search algorithm comparison (left column: four NUMA domains and 144 threads, right column: one NUMA domain and 18 threads).The legend is shared between the plots.

Figure 12 .
Figure 12.Agent sorting and balancing speedup for different execution frequencies (left: four NUMA domains and 144 threads, right: one NUMA domain and 18 threads).

Figure 13 .
Figure 13.Memory allocator comparison (left: speedup, right: memory consumption).The legend is shared between the plots.

Memory Block N-page aligned segment free memory elements NumaPoolAllocator* increasing memory addresses allocated element list node pointing to next free element unused memory A Layout
BioDynaMo's memory allocator are stored inside free memory elements and do not require extra space.At the beginning of each N-aligned segment, we write the pointer to the corresponding NumaPoolAllocator instance.Therefore, allocated memory elements can obtain this pointer in constant time, based on their memory address.This solution enables constant time deallocations (seeFigure