Towards a Systematic Exploration of the Optimization Space for Many-Core Processors

More Info


The architecture diversity of many-core processors - with their different types of cores, and memory hierarchies - makes the old model of reprogramming every application for every platform infeasible. Therefore, inter-platform portability has become a desirable feature of programming models. While functional portability is ensured by standards and compilers (e.g., OpenCL), to achieve high performance across platforms remains a much more challenging task. In this thesis, we have investigated the enabling/disabling techniques for platform-specific optimizations with a unified programming model. We have selected OpenCL as our research vehicle, and identified that each platform has a specific optimization space for a given kernel. Taking two concrete examples, we have proposed solutions on how to (semi-) automatically tackle platform-specific optimizations with a unified programming model. We use a case study (in computer vision) to illustrate optimization’s dependency on platform. To deal with the difference in processing cores, we propose two approaches to vectorize scalar kernels (i.e., explicitly using vector data types), and reveal the vectorization needs with explicitly parallel programs. To deal with the difference in memory hierarchy, we first present a method to quantify the performance impact of using local memory starting from the memory access patterns. This work produces a performance database, which serves as an indicator of whether using local memory is beneficial. Once this indication is given, we propose a portable solution to simplify programming with local memory. Specifically, we present an easy-to-use API (ELMO) to enable local memory usage and a compiling pass (Grover) to automatically disable the local memory usage for applications where local memory is natively used. Much like vectorization and local memory usage, other architectural features require performance portable approaches. Therefore, we present our vision for a portable programming framework, called SESAME, which expands to architectural features beyond SIMD units and local memory. This thesis has given evidence that this problem can be addressed successfully. We conclude that tools such as SESAME help improving the state-of-the-art of existing programming models (like OpenCL, in our case) and ease the task of programmers when dealing with different many-core architectures. This work serves an essential step towards portable performance by systematically exploring the optimization space.