Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor
J.J. Hoozemans (TU Delft - Computer Engineering)
A.F. Lorenzon (Universidade Federal do Rio Grande do Sul)
AC Schneider Beck (Universidade Federal do Rio Grande do Sul)
Stephan Wong (TU Delft - Computer Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing misses and increased hit latency.
Dynamic caches have been proposed to adapt this balance to the workload type. Similarly, dynamic processors aim to execute workloads as efficient as possible to being able to balance between exploiting Instruction-level parallelism (ILP) and Thread-level parallelism (TLP). To support this, they consist of multiple processing components and caches that have adaptable interconnects
between them. Depending on the workload characteristics, these can connect them together to form a large core that exploits ILP, or split them up to form multiple cores that can run multiple threads (exploiting TLP). In this paper, we propose a cache system that is able to further exploit this additional connectivity
of a dynamic VLIW processor by being able to forward cache accesses to multiple cache blocks while the processor is running in multi-threaded (‘split’) mode. Additionally, only requests to global data are broadcasted, while accesses to local data are kept private. This will improve the hit rates similar to existing cache
sharing schemes, but reduce the penalty due to stalling the other subcores. Local accesses are recognized by distinguishing memory accesses relative to the stack frame pointer. Results show that our cache exhibits similar miss rate reductions as shared caches (up to 90% and on average 26%), and reduces the number of
broadcasted accesses by 21%.