Performance analysis of SCISM organization applied to the IA-32 architecture

More Info
expand_more

Abstract

There is a huge variety of processor microarchitectural techniques to decrease the program execution time, such as pipelining, branch prediction, and different methods to exploit the Instruction Level Parallelism (ILP). The Superscalar and VLIW machines are designed to exploit the ILP available in applications. These architectures improve performance by executing multiple independent instructions in parallel. However, this model faces some serious challenges, such as data hazards, and limited number of independent instructions that can be executed in parallel. The Scalable Compound Instruction Set Machine(SCISM), also referred to as Superscalar Instruction Set Machine, proposes solutions for many of these challenges. The SCISM approach can be applied both to Complex Instruction Set Computer (CISC) and Reduced Instruction Set Computer (RISC) machines. SCISM performs a run-time analysis of decoded instructions to determine on the fly when they can be executed in parallel or not. This analysis is based on a predefined set of instruction compounding rules. Rules categorize instructions based on their hardware utilization. SCISM, furthermore, features interlock collapsing hardware, which can eliminate certain instruction interlocks, and often allows to execute in parallel instructions with real data hazards. Another important property of SCISM is that it is compatible with the existing instruction sets, thus, it does not require binary any code recompilation or translation. In this work we have analyzed the IA-32 Instruction Set Architecture (ISA) with respect to compounding. The instruction categorization and compounding rules are applied to this ISA. The experiments are performed using Bochs x86 emulator implementing in-order execution. Very simple two-way instructions compounding is used, where the maximum of instructions that can be executed in parallel equals two. Experimental results, with SPEC CPU2006, demonstrate that such simple scenario the number of instructions executing in parallel varies from 13% to 24% for integer benchmarks and from 1% to 26% for floating-point benchmarks.

Files