Full-system After-cache Memory Tracing for Multi-core Systems using a Distributed Cache Simulator

More Info
expand_more

Abstract

The gap between CPU and memory performance becomes increasingly larger. Together with a growing memory pressure caused by higher CPU core counts combined with multi-tenant systems, this causes the need for new memory technologies. Recently, various technologies are becoming available for commercial use. Examples of these technologies are memory types like non-volatile RAM. These technologies generally have different characteristics than traditional DRAM. To be able to fully utilise the potential of these new memory types, a better understanding of memory usage in modern systems is required. A way to gain a better understanding is through memory traces.

Solutions that are currently available either do not support multi-core architectures or cause a severe slowdown. Therefore, this thesis presents a novel approach to gather full-system after-cache memory access traces. The proposed system is a hybrid framework which consists of the QEMU emulator combined with a custom distributed cache and page table simulator. A modified version of QEMU, called QMEMU, is devised to improve tracing performance and allow tracing instruction fetches.
By leveraging the existing tracing functionality of QEMU only a small amount of modifications have to be made to QEMU. The traces produced by QMEMU contain virtual addresses. However, for accurate cache simulation, the physical addresses have to be used. Tracing the physical address instead of the virtual address for each memory access is shown to cause a 70% slowdown when using QMEMU.

To find these physical addresses for the traced accesses, a novel approach is employed. This approach simulates the guest page tables outside the critical path for memory tracing and therefore does not decrease performance. Using QMEMU traces can be gathered with a speedup of up to 42.6 times over the gem5 simulator for benchmarks of the PARSEC suite.

In the second part of the framework, which performs memory, cache, and page table simulation, cache simulation is found the most computationally intensive task. Therefore in the proposed framework cache simulation is performed in a parallel and distributed manner. Most modern systems use set-associative caches, simulation can be parallelised without reducing accuracy by dividing the memory access traces based on these cache sets. Using this approach 10 Million accesses can be processed per second by the simulator when simulating a single modern cache hierarchy. When simulating 7 different cache hierarchies concurrently a throughput of 6 Million accesses per second is reached. The simulated guest page tables provide additional information like the number of accesses or virtual memory size for each process of the guest workload. This information can be used to decrease the size of the semantic gap between memory traces and their meaning. The proposed framework is evaluated by comparing it to CMP$im and gem5 using the PARSEC benchmark suite.