Efficient Execution of Video Applications on Heterogeneous Multi- and Many-Core Processors
More Info
expand_more
Abstract
In this dissertation we present methodologies and evaluations aiming at increasing the efficiency of video coding applications for heterogeneous many-core processors composed of SIMD-only, scratchpad memory based cores. Our contributions are spread in three different fronts: thread-level parallelism strategies for many-cores, identification of bottlenecks for SIMD-only cores, and software cache for scratchpad memory based cores. First, we present the 3D-Wave parallelization strategy for video decoding that scales for many-core processors. It is based on the observation that dependencies between frames are related with the motion compensation kernel and motion vectors are usually within a small range. The 3D-Wave strategy combines macroblock-level parallelism with frame- and slice-level parallelism by overlapping the decoding of frames while dynamically managing macroblock dependencies. The 3D-Wave was implemented and evaluated in a simulated many-core embedded processor consisting of 64 cores. Policies for reducing memory footprint and latency are presented. The effects of memory latency, cache size, and synchronization latency are studied. The assessment of SIMD-only cores for the increasing complexity of current multimedia kernels is our second contribution. We evaluate the suitability of SIMD-only cores for the increasing divergent branching in video processing algorithms. The H.264 Deblocking Filter is used as test case. Also, the overhead imposed by the lack of a scalar processing unit for SIMD-only cores is measured using two methodologies. Low area overhead solutions are proposed to add scalar support to SIMD-only cores. Finally, we focus on the memory hierarchy and we propose a new software cache organization to increase the efficiency and efficacy of scratchpad memories for unpredictable and indirect memory accesses. The proposed Multidimensional Software Cache reduces software cache overhead by allowing the programmer to exploit known access behavior in order to reduce the number of accesses to the software cache and by grouping memory requests. An instruction to accelerate MDSC lookup is also presented and analyzed.