# Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ # MSc THESIS ### MePoEfAr: Memory and Power Efficient Architecture for Embedded Microcontrollers #### **Imran Ashraf** #### Abstract CE-MS-2011-17 Microcontroller based embedded systems have witnessed enormous growth in recent decades. Microcontrollers are the most versatile products found in most of the market segments and in several product families spanning from 4-bit to 64-bit processors. The application domain is such that for some applications only a little functionality is required; for instance, when used as a controller for a simple user interface. In other applications, the functionality demands are high, such as the demand for floating-point calculations and signal processing. Microcontrollers have to meet these demands, while being smaller in size and power efficient. Since memory occupies a large share of area in a microcontroller and contributes the most towards power consumption, the architecture has to be memory efficient. Particularly, for applications using a matrix of processors (as in multi-core architectures), each with its own program memory, the program memory and power efficiencies are a major design goal. The memory efficiency of the instruction set, which also implies power efficiency, is an important factor which needs to be taken into account in the design of microcontroller architectures. In this thesis, we propose a Memory and Power Efficient Architecture (MePoEfAr) for embedded microcontrollers. MePoEfAr is intended as an improvement of the class of architectures represented by the ATMEL AVR, Texas Instruments MSP430 and the ARM Cortex-M3 microcon- trollers. These architectures were designed to be used as embedded controllers. They often have on-board SRAM for data storage and ROM/Flash for program storage. This property demands a memory-efficient architecture, because a small savings of the on-chip program memory area quickly offsets the gates required for extra processor functionality. In addition, due to power aspects, especially for hand-held devices, the clock frequencies used are not very high, so that the instruction decoding time is less critical. A source level profiler has been developed to get the statistics of various C language constructs for the representative programs used in embedded applications. These statistics were used in making various trade-offs to tune this architecture. An assembler and Interpretive simulator was developed to perform assembler level benchmarking for performance evaluation and comparison with three embedded architectures. Results show the improvement of MePoEfAr performance by 70% and 17% when compared to TI MSP430 and ARM Coretex-M3 microcontrollers, respectively. Furthermore, MePoEfAr outperforms Atmel AVR by a factor of 2.32. Efficiency of MePoEfAr comes from its more orthogonal architecture, its memory efficient and rich instruction set, efficient support for immediate values and displacements, efficient instruction encoding with variable length instructions of 1 to 4 bytes. Moreover, availability of large number of registers, and the possibility of large number of operations on these registers add to the efficiency of the architecture. # MePoEfAr: Memory and Power Efficient Architecture for Embedded Microcontrollers #### THESIS submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in COMPUTER ENGINEERING by Imran Ashraf born in Mansehra, Pakistan Computer Engineering Department of Electrical Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology ### MePoEfAr: Memory and Power Efficient Architecture for Embedded Microcontrollers by Imran Ashraf #### Abstract icrocontroller based embedded systems have witnessed enormous growth in recent decades. Microcontrollers are the most versatile products found in most of the market segments and in several product families spanning from 4-bit to 64-bit processors. The application domain is such that for some applications only a little functionality is required; for instance, when used as a controller for a simple user interface. In other applications, the functionality demands are high, such as the demand for floating-point calculations and signal processing. Microcontrollers have to meet these demands, while being smaller in size and power efficient. Since memory occupies a large share of area in a microcontroller and contributes the most towards power consumption, the architecture has to be memory efficient. Particularly, for applications using a matrix of processors (as in multi-core architectures), each with its own program memory, the program memory and power efficiencies are a major design goal. The memory efficiency of the instruction set, which also implies power efficiency, is an important factor which needs to be taken into account in the design of microcontroller architectures. In this thesis, we propose a Memory and Power Efficient Architecture (MePoEfAr) for embedded microcontrollers. MePoEfAr is intended as an improvement of the class of architectures represented by the ATMEL AVR, Texas Instruments MSP430 and the ARM Cortex-M3 microcontrollers. These architectures were designed to be used as embedded controllers. They often have on-board SRAM for data storage and ROM/Flash for program storage. This property demands a memory-efficient architecture, because a small savings of the on-chip program memory area quickly offsets the gates required for extra processor functionality. In addition, due to power aspects, especially for hand-held devices, the clock frequencies used are not very high, so that the instruction decoding time is less critical. A source level profiler has been developed to get the statistics of various C language constructs for the representative programs used in embedded applications. These statistics were used in making various trade-offs to tune this architecture. An assembler and Interpretive simulator was developed to perform assembler level benchmarking for performance evaluation and comparison with three embedded architectures. Results show the improvement of MePoEfAr performance by 70% and 17% when compared to TI MSP430 and ARM Coretex-M3 microcontrollers, respectively. Furthermore, MePoEfAr outperforms Atmel AVR by a factor of 2.32. Efficiency of MePoEfAr comes from its more orthogonal architecture, its memory efficient and rich instruction set, efficient support for immediate values and displacements, efficient instruction encoding with variable length instructions of 1 to 4 bytes. Moreover, availability of large number of registers, and the possibility of large number of operations on these registers add to the efficiency of the architecture. **Laboratory** : Computer Engineering Codenumber : CE-MS-2011-17 Committee Members : Advisor: Dr. Said Hamdioui, CE, TU Delft Advisor: Ad J. van de Goor, CE, TU Delft Chairperson: Dr. ir. Koen L. M. Bertels, CE, TU Delft Member: Dr. Ir. G. Kuzmanov, CE, TU Delft Member: Dr. Alexandru Iosup, PDS, TU Delft Member: Ir. A.C. de Graaf, CE, TU Delft To the hug of my son Usman & To the smile of my niece Simra Khan # Contents | Li | st of | Figur | es | ix | |----|-------|--------|-----------------------------------------------|------| | Li | st of | Table | ${f s}$ | xii | | Li | st of | Sourc | e Codes | xiv | | A | ckno | wledge | ements | XV | | | | | | | | 1 | Intr | roduct | ion | 1 | | | 1.1 | Introd | $\operatorname{luction}$ | . 1 | | | 1.2 | Motiv | ation | . 2 | | | 1.3 | Main | Thesis Contributions | . 2 | | | 1.4 | Outlin | ne of Thesis | . 3 | | 2 | Ove | erview | of Microcontroller Architectures | 5 | | | 2.1 | Classi | fication of Microcontroller Architectures | . 5 | | | | 2.1.1 | Classification Based on Architectural Style | . 6 | | | | 2.1.2 | Classification Based on Memory Interfaces | . 6 | | | | 2.1.3 | Classification Based on Word Size | . 7 | | | | 2.1.4 | Classification Based on Operand Specification | . 8 | | | 2.2 | Exam | ple Architectures | . 9 | | | | 2.2.1 | Atmel AVR AT90S851 | . 9 | | | | 2.2.2 | TI MSP430G2231 | . 9 | | | | 2.2.3 | ARM LPC1342 Cortex-M3 | . 10 | | | 2.3 | Ideal | Properties of a Microcontroller Architecture | . 10 | | | | 2.3.1 | Program Memory Size | . 11 | | | | 2.3.2 | Power Consumption | . 11 | | | | 2.3.3 | Speed | . 11 | | | | 2.3.4 | Modularity | . 11 | | | 0.4 | C | | 11 | | 3 | Sta | tistics | of C Language | 13 | |---|-----|---------|----------------------------------------------|----| | | 3.1 | List of | Language Constructs | 13 | | | 3.2 | Profili | ng | 15 | | | | 3.2.1 | Profiler | 15 | | | | 3.2.2 | Profiler Benchmark Applications | 16 | | | 3.3 | Freque | ency Distribution of $C$ Language Constructs | 17 | | | | 3.3.1 | Frequency Distribution of Statements | 17 | | | | 3.3.2 | Operations | 18 | | | | 3.3.3 | Operands | 23 | | | | 3.3.4 | Miscellaneous | 24 | | | 3.4 | Conclu | asions | 25 | | | | | | | | 4 | Me | PoEfA | r Architecture | 27 | | 5 | Me | PoEfA | r Assembler | 29 | | | 5.1 | Introd | uction to Assemblers | 29 | | | 5.2 | MePoI | EfAr Assembler | 29 | | | | 5.2.1 | Scanner | 30 | | | | 5.2.2 | Parser | 31 | | | | 5.2.3 | Analyzer | 32 | | | | 5.2.4 | Code Generator | 33 | | | 5.3 | Instru | ction Bit-assignment | 34 | | | 5.4 | Summ | ary | 36 | | 6 | Me | PoEfAi | r Interpreter | 39 | | Ū | 6.1 | | iew of Simulators | | | | 6.2 | | EfAr Interpreter | | | | 6.3 | | visor Program ( $main()$ ) | | | | | 6.3.1 | Memory Address to Source Line Number Mapping | | | | 6.4 | | EfAr Microcontroller Model | | | | - | 6.4.1 | Program Status Word | | | | | 6.4.2 | Program Counter | | | | | 6.4.3 | Registers | | | | | 6.4.4 | Program Memory | | | | | 6.4.5 | Data Memory | | | | | - | <i>u</i> | | | | | 6.4.6 | Stack and Stack Pointer | 45 | |--------------|-------|---------|--------------------------------------------------------------------------|----| | | | 6.4.7 | Decoder | 45 | | | | 6.4.8 | Arithmetic and Logic Unit | 46 | | | 6.5 | Summ | ary | 46 | | 7 | Ass | embler | Level Benchmarking | 49 | | | 7.1 | Evalua | ation Criteria | 49 | | | 7.2 | Candi | date Architectures for Comparison | 49 | | | | 7.2.1 | Atmel AVR AT90S851 | 50 | | | | 7.2.2 | TI MSP430G2231 | 50 | | | | 7.2.3 | ARM LPC1342 | 50 | | | 7.3 | Selecte | ed Benchmark Programs | 51 | | | | 7.3.1 | Benchmark Application 1: Recursive Factorial Program $\ \ldots \ \ldots$ | 52 | | | | 7.3.2 | Benchmark Application 2: String Copy Program | 53 | | | | 7.3.3 | Benchmark Application 3: Bubble Sort Program | 54 | | | | 7.3.4 | Benchmark Application 4: Sensor Structure Program | 55 | | | | 7.3.5 | Benchmark Application 5: Matrix Multiplication Program | 57 | | | | 7.3.6 | Benchmark Application 6: FIR Program | 58 | | | 7.4 | Result | Evaluation and Comparison | 59 | | | | 7.4.1 | Static Results | 59 | | | | 7.4.2 | Dynamic Results | 62 | | | 7.5 | Summ | ary | 66 | | 8 | Con | clusio | n and Future Work | 69 | | | 8.1 | Summ | ary | 69 | | | 8.2 | Conclu | asions | 70 | | | 8.3 | Future | e Work | 72 | | Bi | bliog | graphy | | 76 | | A | Lex | ical Aı | nalyzer Generator Code | 77 | | В | Par | ser Ge | enerator Code | 81 | | $\mathbf{C}$ | Ass | embly | Codes for the Selected Benchmarks | 87 | | | C.1 | MePol | EfAr Assembly Codes | 87 | | | C.2 | Atmel AVR AT90S851 Assembly Codes | 92 | |---|-----|--------------------------------------------|-----| | | C.3 | TI MSP430 Assembly Codes | 101 | | | C.4 | ARM Cortex-M3 Assembly Codes | 108 | | | | | | | D | Cal | culations Details | 115 | | | D.1 | MePoEfAr Calculations Details | 115 | | | D.2 | Atmel AVR AT90S851 Calculations Details | 118 | | | D.3 | TI MSP430G2231 Calculations Details | 126 | | | D.4 | ARM Cortex-M3 LPC1342 Calculations Details | 131 | # List of Figures | 1.1 | Various Microcontroller Applications | 1 | |-----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|----| | 1.2 | Microcontrollers in Consumer Applications [17] | 2 | | 1.3 | Annual Cellular Handset Sales [17] | 2 | | 2.1 | A Classification of Microcontroller Architectures | 5 | | 5.1 | Block Diagram of MePoEfAr Assembler Showing Various Steps Performed in the Assembly Process | 30 | | 5.2 | Block Diagram of Scanner, which Reads the Input Assembly Instructions and Produces the Tokens | 31 | | 5.3 | Tokens generated by Scanner for the Example Program in Listing $5.1$ | 31 | | 5.4 | Block Diagram of Parser. Tokens are taken as Input from the Scanner and Parser Performs Syntactic Analysis and Constructs the Abstract Syntax Tree as an Output | 31 | | 5.5 | Visual Representation of the Complete Abstract Syntax Tree for the Example Program given in Listing 5.1 | 32 | | 5.6 | Block Diagram of Code Generator which Generates the Machine Code at the Output for the Abstract Syntax Tree of a Single Instruction at the Input | 34 | | 5.7 | Summary of MePoEfAr Assembler Showing Various Steps Performed in the Assembly Process | 37 | | 6.1 | Block Diagram of MePoEfAr Interpreter Showing its Position in Relation to the Host Machine | 40 | | 6.2 | Block Diagram of the MePoEfAr Interpreter | 41 | | 7.1 | Classification of Evaluation Criteria | 50 | | 7.2 | Number of Instructions Required for Benchmark Programs | 60 | | 7.3 | Program Memory Size (Bytes) for Selected Benchmarks | 61 | | 7.4 | Total Number of Instructions Executed | 63 | | 7.5 | Total Number of Instructions Executed inside Loop | 63 | | 7.6 | Total Number of Execution Cycles | 64 | | 7.7 | Instruction Memory Traffic (Cycles) | 65 | # List of Tables | 2.1 | Classification of Three Microcontroller Architectures Based on the Categories Described in This Chapter | 9 | |------|---------------------------------------------------------------------------------------------------------|----| | 3.1 | Application Programs Used for Profiling | 16 | | 3.2 | Frequency Distribution of Statements | 17 | | 3.3 | Frequency Distribution of Assignment Statements Based on LHS $\ \ldots \ \ldots$ | 17 | | 3.4 | Distribution of Assignments Based on Complexity of RHS Expression $$ | 18 | | 3.5 | Frequency Distribution of Operations | 19 | | 3.6 | Frequency Distribution of Integer Operations | 20 | | 3.7 | Frequency Distribution of Floating Point Operations | 21 | | 3.8 | Frequency Distribution of 8-bit Integer Operations | 21 | | 3.9 | Frequency Distribution of 16-bit Integer Operations | 22 | | 3.10 | Frequency Distribution of 32-bit Integer Operations | 22 | | 3.11 | Frequency Distribution of Operands | 23 | | 3.12 | Frequency Distribution of Constants | 23 | | 3.13 | Frequency Distribution of Operand Accesses Based on Size $\ \ldots \ \ldots$ | 24 | | 3.14 | Average (per Function) of Variables Based on Locality | 24 | | 3.15 | Frequency Distribution of Parameters Based on Data Types | 24 | | 3.16 | Frequency Distribution of Locals Based on Data Types | 25 | | 5.1 | Visual Representation of the Symbol Table for the Example Program in Listing 5.1 | 33 | | 5.2 | A Possible Bit Assignment for Various MePoEfAr Instruction Formats $\ .$ . | 35 | | 7.1 | Number of Instructions Required for Benchmark Programs | 59 | | 7.2 | Program Memory Size (Bytes) for Selected Benchmarks | 61 | | 7.3 | Total Number of Instructions Executed | 62 | | 7.4 | Total Number of Instructions Executed inside Loop | 62 | | 7.5 | Number of Cycles for Arithmetic Operations for Supported Data Types $$ . | 64 | | 7.6 | Total Number of Execution Cycles | 64 | | 7.7 | Instruction Memory Traffic (Cycles) | 65 | | 7.8 | Data Memory Traffic (Cycles) | 66 | | 7.9 | Performance Comparison Summary | |-----|--------------------------------| | D.1 | MePoEfAr Calculations | | D.2 | Atmel AVR Calculations | | D.3 | TI MSP430 Calculations | | D.4 | ARM Cortex M3 Calculations | # Listings | 5.1 | MePoEfAr Example Assembly Program used for Illustration of Various Assembler Stages in this Chapter | 30 | |------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|----| | 5.2 | MePoEfAr Example Code Used for the the Illustration of Branch Instruction Size and Update of Location Counter | 33 | | 6.1 | MePoEfAr main() Interpreter $C$ Code. It Prompts the User for Input Hex File, Calls $loadPM()$ to load it into memory. $runProgram()$ Executes the Loaded Program | 41 | | 6.2 | Code Used to Store the Mapping of Program Memory Address and Line Numbers in MePoEfAr Interpreter | 42 | | 6.3 | runProgram() Function in which Instructions are Fetched, Decoded and Executed | 43 | | 6.4 | Code for Instruction Decoding $\dots$ | 45 | | 7.1 | Benchmark Application 1: Recursive Factorial Program | 52 | | 7.2 | Benchmark Application 2: String Copy Program | 53 | | 7.3 | Benchmark Application 3: Bubble Sort Program | 54 | | 7.4 | Benchmark Application 4: Sensor Structure Program | 55 | | 7.5 | Benchmark Application 5: Matrix Multiplication Program | 57 | | 7.6 | Benchmark Application 6: FIR Program | 58 | | A.1 | Flex Code for the Lexical Analyzer Generator for MePoEfAr Assembler $$ . | 77 | | B.1 | Bison Code for the Parser Generator for MePoEfAr Assembler $\ \ldots \ \ldots$ | 81 | | C.1 | MePoEfAr Assembly Code for Benchmark 1: Recursive Factorial $\ .\ .\ .$ . | 87 | | C.2 | MePoEfAr Assembly Code for Benchmark 2: String Copy | 87 | | C.3 | MePoEfAr Assembly Code for Benchmark 3: Bubble Sort $\ \ldots \ \ldots \ \ldots$ | 88 | | C.4 | MePoEfAr Assembly Code for Benchmark 4: Sensor Structure | 89 | | C.5 | MePoEfAr Assembly Code for Benchmark 5: Matrix Multiplication | 89 | | C.6 | MePoEfAr Assembly Code for Benchmark 6: FIR | 90 | | C.7 | ${\it Atmel AVR\ AT90S851\ Assembly\ Code\ for\ Benchmark\ 1:\ Recursive\ Factorial}$ | 92 | | C.8 | Atmel AVR AT90S851 Assembly Code for Benchmark 2: String Copy $$ | 93 | | C.9 | Atmel AVR AT90S851 Assembly Code for Benchmark 3: Bubble Sort $$ | 93 | | C.10 | Atmel AVR AT90S851 Assembly Code for Benchmark 4: Sensor Structure | 95 | | C.11 | Atmel AVR AT90S851 Assembly Code for Benchmark 5: Matrix Multiplication | 96 | | C.12 Atmel AVR AT90S851 Assembly Code for Benchmark 6: FIR 98 | |-----------------------------------------------------------------------------| | C.13 TI MSP430 Assembly Code for Benchmark 1: Recursive Factorial 101 | | C.14 TI MSP430 Assembly Code for Benchmark 2: String Copy 102 | | C.15 TI MSP430 Assembly Code for Benchmark 3: Bubble Sort 103 | | C.16 TI MSP430 Assembly Code for Benchmark 4: Sensor Structure 104 | | C.17 TI MSP430 Assembly Code for Benchmark 5: Matrix Multiplication $105$ | | C.18 TI MSP430 Assembly Code for Benchmark 6: FIR | | C.19 ARM Cortex-M3 Assembly Code for Benchmark 1: Recursive Factorial . 108 | | C.20 ARM Cortex-M3 Assembly Code for Benchmark 2: String Copy 109 | | C.21 ARM Cortex-M3 Assembly Code for Benchmark 3: Bubble Sort 109 | | C.22 ARM Cortex-M3 Assembly Code for Benchmark 4: Sensor Structure $110$ | | C.23 ARM Cortex-M3 Assembly Code for Benchmark 5: Matrix Multiplication 111 | | C.24 ARM Cortex-M3 Assembly Code for Benchmark 6: FIR | # Acknowledgements First of all, I would like to express my gratitude to my supervisors, Ad van de Goor and Said Hamdioui for giving me a chance to work under their kind supervision. Special thanks to Ad van de Goor for his valuable guidance and precious time, throughout this work. He has always come down at my level and helped me to understand the architecture related concepts. It is really an honor for me to work with a member of PDP-11 architecture team. I would also like to thank a number of people in the CE group for their help and support. Thanks to Georgi Kuzmanov, Nadeem and Fakhar for their useful discussions. Roel for allowing me to use his QUIPU profiler and providing me a quick start for its modifications. Anca Molnos for providing me EEMBC benchmarks. Thanks to Max Ferger (from ACE BV) for giving me a chance to attend the CoSy training. Laiq, Faisal, Mottaqiallah for proof reading parts of my thesis and their friendly support throughout my MSc studies. Among my friends at TU Delft, I would like to thank Di and Wu, for their wonderful company thought my stay here at Delft. I would also like to thank Husnul Amin, Mehfooz, Hamayun and Seyab for their help in finding a wonderful accommodation for me and for their help in setting it up. Last but not least, I would like to thank my family, especially, my parents for their love and support throughout my good and bad times, and for making me who I am. Sincere thanks to my wife, for her care, patience and encouragement throughout my MSc studies and especially during my thesis work. She helped me a lot by taking good care of home and kid, and sparing me completely for my studies. Imran Ashraf Delft, The Netherlands September 8, 2011 Introduction This chapter provides an introduction to the work presented in this thesis. Section 1.1 highlights some the applications of microcontroller with some statistics from an industry research for the year 2010. Section 1.2 presents the motivation behind this thesis work. Section 1.3 lists the main contributions of this thesis. Section 1.4 outlines the remaining content of this thesis. #### 1.1 Introduction One of the important aspects of modern electronic technology is embedded systems based on microcontrollers. According to the Microchip ISA Vision Summit 2011 [17], 10 billion microcontroller units are produced per year for embedded applications as compared to 400 million units per year for general purpose microprocessor based applications. This growth in microcontroller industry is derived by the huge application domains where they can be used. Figure 1.1 provides a brief list of applications which use microcontrollers. Among other applications, consumer application alone have utilized about 3.39 billion microcontrollers in the year 2010, as can be seen from Figure 1.2. | _ | | Office | | | |--------------------|-------------------|-------------------------------|-----------------------|---------------------| | Consumer | Automotive | Automation | Telecom | Industrial | | High definition TV | CDI | Computer mouse | Cellular<br>telephone | Power Inverter | | Stereo receiver | Body Control | Laptop trackball | • | Motor control | | DVD player | Infotainment | Computer<br>keyboard | Cordless<br>telephone | Compressor | | Universal remotes | Keyless entry | • | • | Thermostat | | Cable TV converter | Radar detector | Handheld scanner | Feature phone | Postage meter | | Video game systems | Cruise control | Laser printer interface board | Answering<br>machine | Utility meter | | Cameras | Anti-lock braking | Wireless LANs | Pay phone | Robotics | | Garage opener | Speedometer | Printer cartridges | Pager | Process control | | Carbon Monoxide | Climate control | • | Modem | Gas pump | | detector | Security System | Hi-res scanner | | Smoke detector | | Microwave oven | | Bar code reader | Caller ID | Credit card reader | | Smoke detector | Active suspension | Disk drive | Line cards | | | Water filters | Fuel pump control | Tape back-up unit | Hands-free kits | Access verification | | | Fuel injection | US bus hubs | Long distance | and control | | Cordless tools | Air bag sensor | | service router | Lighting sensors | | Vacuum cleaner | Power seats | Facsimile | | and control | | Electric blanket | _ | machine | Power Amp | | | | Compass | CD/DVD writer | | Ballast control | Figure 1.1: Various Microcontroller Applications The vast diversity of the microcontroller applications, demands a variety of microcontroller architectures satisfying the needs of these application domains. Most of these devices are aimed at small size and low power consumption, for instance, hand held devices such as cell phones, digital watches, pagers etc. Figure 1.3 provides the statistics of the annual cellular handset sales. It can be seen from these statistics that about 1.5 billion cellular phone units have been sold in the year 2010. Figure 1.2: Microcontrollers in Consumer Applications [17] Figure 1.3: Annual Cellular Handset Sales [17] Memory and power efficiency can be achieved in several ways at different design levels. This thesis discusses the details of an embedded microcontroller in which memory and power efficiency is achieved at the architecture level. #### 1.2 Motivation Key points which motivated the design of this memory and power efficient microcontroller architecture are: - Embedded microcontrollers are found inside another system where their smaller size is important. They often have on-board SRAM for data storage and ROM/Flash for program storage. Memory occupies a substantial area on the chip. This property demands a memory-efficient architecture, because small savings of the on the on-chip program memory area quickly offsets the gates required for extra processor functionality, reducing the size and cost of the microcontroller. - Power consumption is an important criteria in the design of microcontrollers, particularly for hand held devices running on batteries. In some cases, replacing the batteries is very costly, for instance, in case of underground water meters and heart pace makers. So these devices have to be power efficient. - Because of power aspects, especially for hand-held devices, the clock frequencies used are not very high, so that the instruction decoding time is less critical. This means that for these devices, design choices can be made in favor of power efficiency as compared to clock speed. - In applications using a matrix of processors (as in multi-core architectures), each with its own program memory, the program memory and power efficiencies are a major design goal. - The memory efficiency of the instruction set also implies power efficiency, which was the key motivation behind this architecture. #### 1.3 Main Thesis Contributions This thesis makes the following main contributions: 1. Provides the instruction set architecture of a memory and power efficient embedded microcontroller - 2. Provides the static profiling statistics to fine tune the architecture for memory and power efficiency - 3. Provides the details of the software tool chain including: - (a) An assembler to translate the assembly programs into machine code - (b) An interpreter to model the architecture to simulate the execution of machine code - 4. Provides the details of performance evaluation of this architecture - 5. Provides the static and dynamic results of performance comparison with the three well known embedded microcontrollers #### 1.4 Outline of Thesis An outline of this thesis is presented here to give an overview of the whole thesis. Chapter 2 presents an overview of microcontroller architectures. A classification of microcontroller architecture based on several criteria is presented. Three well know microcontroller architectures are discussed in detail, which we have used for performance comparison. Chapter 3 discusses the static profiling. The statistics of high level language constructs such as statements, operations, constants are are provided to show their frequency distributions in four C language benchmark programs. Chapter 4 provides the details of MePoEfAr architecture. It starts with overall architecture properties. Issues, like type of architecture, bit and byte numbering, data types, instruction classification and register sets are discussed. Global architecture issues such as layout of the program status word and Memory Map are provided. Various instruction formats in MePoEfAr architecture with examples are provided. Furthermore, operation sets supported by these instruction formats are also discussed with the a description on how these operations affect the condition codes. A brief description of exceptional conditions like traps and interrupt vectors are provided followed by a discussion of extension of Program and Data Memory. The summary of encoding cost and feasibility of MePoEfAr architecture are provided to show the space for future extension in the architecture. Chapter 5 gives the implementation details of MePoEfAr assembler. It covers the details of the intermediate steps involved to translate the assembly program to machine code. Instruction bit assignments are provided to showing the bit patterns used to represent assembly instructions. Chapter 6 discusses the MePoEfAr interpreter which simulates the MePoEfAr micro-controller. It discusses the two main parts of MePoEfAr interpreter. First part which loads the machine code to memory and performs some book keeping for debugging information. Second part is the microcontroller model which fetches the instructions from memory, decodes and executes them. Chapter 7 covers the assembler level benchmarking details, which we performed to evaluate the performance of MePoEfAr architecture. Furthermore, it provides the results of static and dynamic comparison of performance with three well known microcontrollers. Chapter 8 provides the conclusions and recommendations for future work. This chapter is followed by the bibliography and appendices. The scanner and parser generator codes for MePoEfAr assembler are provided in Appendix A and Appendix B respectively. Appendix C provides the assembly codes of the benchmark programs we have used for performance comparison. Details of these calculations are provided in Appendix D. # Overview of Microcontroller Architectures In this chapter an overview of microcontroller architectures is presented. Microcontroller architectures can be classified based on a number of factors such as the architectural style, memory interfaces, word-size and operand specification. Section 2.1 provides the classification of microcontroller architectures based on these criteria. A brief description of three example architectures is given in Section 2.2. Properties of an ideal microcontroller are discussed in Section 2.3. Finally, Section 2.4 summarizes this chapter. #### 2.1 Classification of Microcontroller Architectures Large number of microcontrollers are designed to fulfill the requirements for their diverse application area [19]. These microcontrollers can be classified based on various criteria. Figure 2.1 provides an overview of a classification of these microcontrollers based on architectural aspects. The details of each category in this classification is provided one by one in sub-sections. Figure 2.1: A Classification of Microcontroller Architectures #### 2.1.1 Classification Based on Architectural Style Based on the architectural style, microcontrollers can be classified into simple and fixed size instructions or complex variable length instructions as described below: Reduced Instruction Set Computer (RISC) style architectures have simple instructions [31]. Most of the instructions in these architectures execute in a single cycle, as these instructions involve register to register operations. Data fetch from the memory is performed only with Load and Store instructions with simple addressing modes. This is the reason they are also known as Load-Store architectures. From the performance point of view, in the design of RISC architectures trade-offs are made in favor of a lower Cycles Per Instruction (CPI), at the expense of increased code size. The reason for the increased code size is that the complexity of the system is shifted from hardware to software as most of the high level language support is provided in software [30]. So more number of assembly instructions are required to do some HLL operation, resulting in the increased code size. Examples of microcontrollers based on RISC architecture are: - ARM Cortex-M3 series microcontrollers - Atmel AVR AT90S851 - PIC microcontrollers by Microchip e.g. PIC16F84 - MSP430 Family by Texas Instruments Complex Instruction Set Computer (CISC) architecture style is characterized by having a large number of instructions, with most of the instructions requiring a number of cycles for execution. Instructions are variable length instruction. CISC architecture supports register to register, register to memory and memory to memory operand specification in instructions. Normally there is a variety of addressing modes available in these architectures. The advantage of the CISC architecture is that most of the instructions are powerful, allowing the programmer/compiler to use one instruction in place of many simpler instructions, resulting in a reduced code size. Examples of microcontrollers based on CISC architecture are: - Intel 8051, 8052 and 8096 family - Motrola 68000 family (designed and marketed by FreeScale Semiconductor) - M16C/60 and H8SX cores by Renesas Electronics - TLSC 870 C1, TLCS 900 L1, TLCS 900 H1 core families by Thoshiba #### 2.1.2 Classification Based on Memory Interfaces Microcontroller architectures can either have a single memory for instructions and data or physically separate memories to hold program and data. Based on these memory interfaces, architectures are classified as: Von Neumann architectures store both program and data in the common main memory [33]. This means that either instruction is read from memory or data is read/written from/to this memory. The Von-Neumann architectures main advantage is the simplification of the microcontroller design because of a single memory access. The disadvantage is because of the same bus system, both instruction cycle and data cycle cannot occur at the same time. This is known as Von Neumann Bottleneck as pointed out by Backus Naur [18]. Examples of microcontroller architectures based on Von Neumann style are: - Texas Instrument MSP430 - Motorola 68HC11 Harvard architectures are characterized by having two physically separate memories and pathways for program and data. Instructions can be stored in read-only memory and data in read-write memory. This means that attributes of instruction and data memory can be different. For instance, they may have different word width, access timings, implementation technology or memory address structure. Harvard architectures have distinct instruction space and data space. As instruction fetches and data access do not contend for a single memory pathway, a Harvard architecture microcontroller can thus be faster for a given circuit complexity. Examples of Harvard microcontroller architectures are: - Renesas RX600 Series microcontrollers - Microchip PIC microcontrollers Modified Harvard architectures have the characteristic that they have unified instruction and data space. They have separate path ways for instructions and data which is implemented by instruction and data caches. Examples of modified Harvard microcontroller architectures are: - Atmel AVR AT90S851 microcontroller - ARM Cortex M3 Series #### 2.1.3 Classification Based on Word Size Although there are 4-bit (COP400 by National Semiconductor) and 24-bit (PIC24 by Microchip) microcontroller architectures as well but the most common word sizes are 8-, 16- and 32-bit. **8-bit Architecture** performs arithmetic and logical operations on 8-bits. Examples of 8-bit microcontrollers are: - Intel 8051 family - Motorola MC68HC11 family - Atmel AVR AT90S851 **16-bit Architecture** performs arithmetic and logical operations on 16-bits. Examples of 16-bit microcontrollers are: - MSP430 Family by Texas Instruments - S12 and S12X families by Freescale - Motorola MC68HC12 and MC68332 families **32-bit Architecture** performs arithmetic and logical operations on 32-bits. Examples of 32-bit microcontrollers are: - ARM Cortex-M based family - Atmel AVR32 - Microchip PIC32 based on MIPSM4K architecture #### 2.1.4 Classification Based on Operand Specification Operands in a single instruction vary from a single operand to multiple operands. The work presented in [24] provides taxonomy of architectures based on operands. The most common <sup>1</sup> architectures based on number of operands are: 1-Operand Architectures specify one operand explicitly in the instruction and the other operand is the implicit accumulator operand. This accumulator register is a special register to accumulate the temporary results of computation. In order to perform an operations, instructions are required to move the operands to accumulator and move the result back to where it is required. Intel 8051 architecture is an example of 1-operand architectures. In these architecture, A = B + C is implemented as: load B add C store A **2-Operand Architectures:** have two operands explicitly specified in the instruction. One of the operand serves as both source and destination. The statement A=B+C in these architectures is implemented as: load r1, B load r2, C add r1, r2 store r1. A In these examples ri are general purpose registers. Examples of 2-operand microcontroller architectures are: - Atmel AVR AT90S851 - Texas Instrument MSP 430 - Microcontrollers based on ARM Thumb <sup>2</sup> architecture - **3-Operand Architectures:** have an explicit mention of one destination and two source operands in the instructions. So A = B + C will be implemented as: load r1, B load r2, C add r3, r1, r2 store r3, A Specification of three operands in an instruction requires relatively large encoding space. Most of the 3-operand architectures are 32-bit or higher architectures. Examples of 3-operand architectures are: - Atmel AVR32 architecture - Microcontrollers based on ARM Architectures <sup>&</sup>lt;sup>1</sup>0-operand architectures also known as stack-based architectures, have their operands implicitly on stack. Java Virtual Machine is an example of stack based architecture. These architectures are not common for microcontrollers. <sup>&</sup>lt;sup>2</sup>Thumb instructions are 16-bit instructions accommodating the specification of only two operands. #### 2.2 Example Architectures In this section, we provide the details of the three example architectures based on the classification we have described in this chapter. These example architectures are later used for performance comparison in assembler level benchmarking (Chapter 7). These three microcontroller architectures are: - 1. Atmel AVR AT90S851 - 2. TI MSP430G2231 - 3. ARM Cortex-M3 LPC1342 Table 2.1 provides an overview of this classification. For the sake of brevity in this table, TI, ARM and AVR refers to TI MSP430G2231, ARM Cortex-M3 LPC1342 and AVR AT90S851 microcontrollers respectively. Table 2.1: Classification of Three Microcontroller Architectures Based on the Categories Described in This Chapter | | Classification Criteria | | | | |------|-------------------------|--------------|---------------------|--------------------------| | Name | Architectural<br>Style | Word<br>Size | Memory<br>Interface | Operand<br>Specification | | AVR | RISC | 8 | Modif. Harvard | 2-operand | | TI | RISC | 16 | von Neumann | 2-operand | | ARM | RISC | 32 | Modif. Harvard | 3-operand | #### 2.2.1 Atmel AVR AT90S851 AT90S8515 is a low power, CMOS, 8-bit microcontroller based on the AVR RISC architecture [15] developed by Atmel [14]. It utilizes modified Harvard architecture concept. Although it is an 8-bit microcontroller, each instruction takes one or two 16-bit words. It has 32 single-byte general purpose registers with single clock cycle access time. It supports five addressing modes. #### 2.2.2 TI MSP430G2231 Second candidate is MSP430G2231 [3], a 16-bit RISC architecture developed by Texas Instruments [2]. It has been designed for low cost and low power embedded application. It uses von-Neumann architecture with a single instructions and data memory space. Instructions generally take one cycle per word fetched or stored. It has 27 core instruction and 24 emulated instructions. It supports seven addressing modes for source operands and four addressing modes for the specification of destination operands in instructions. It has the following 16-bit registers: **R0:** Program counter **R1:** Stack pointer **R2:** Status register (only in register addressing with word data type) **R2** and **R3**: are used as constant generators for the most frequent constants (0,1,2,4,8) #### R4-R15: General purpose registers The user guide found here [4] provide further details of MSP430 microcontroller architecture. #### 2.2.3 ARM LPC1342 Cortex-M3 Third candidate is the LPC1342 [13] developed by NXP (founded by Phillips) [12]; a Cortex-M3 based low power 32-bit RISC m1icrocontroller. ARM is a fab-less company which designs these architectures as Intellectual Property (IP) modules and sells licenses to other companies which actually manufacture the chips, in the case of LPC1342, the manufacturing company is NXP. There are various architectures provided by ARM targeting various application areas, such as: - ARM Cortex-A series targets the general purpose processor cores - ARM Cortex-R series is a family of processors for real time systems - ARM Cortex-M series processors are designed for low-power, memory efficient embedded applications Among this M-series processors, Cortex-M3 processors are especially designed for embedded microcontrollers. It is based on modified Harvard architecture concept. It supports Thumb-2 instruction set to reduce the instruction memory requirements by including the support for 16-bit instructions. It has following general purpose and special purpose registers: R0-R12: General purpose registers R13: Stack pointer R14: Link registers used by subroutines for return address **R15:** Program counter **xPSR** Program Status Register Registers R0-R7 are accessible by all instructions, whereas, registers R8-R12 are only accessible by 32-bit instructions and 16-bit instructions cannot access them. The technical reference manual of ARM Cortex-M3 architecture (as well as other ARM architectures) can be found here [5] for further details. #### 2.3 Ideal Properties of a Microcontroller Architecture Ideal properties of a microcontroller architecture refer to the properties which are not realizable in a single architecture. These properties are inter-related such that making a design trade-off to improve certain property may adversely affect the other property (ies). For instance, making a choice in favor of simple fixed width instructions favors higher clock speed at which these designs can be run. The down side of this choice is the increased program footprint. Ideal properties of microcontrollers are briefly discussed below. 2.4. SUMMARY 11 #### 2.3.1 Program Memory Size Microcontrollers are normally embedded inside other systems. Size of microcontroller is important so that it can fit in the system. Program memory occupies a major share in the chip area. So, ideally, program memory size should be negligible in microcontrollers. In other words, architecture should be memory efficient such that program size for a given application should ideally be negligible. #### 2.3.2 Power Consumption Power consumption is an important criteria in the design of microcontrollers, particularly for hand held devices running on batteries. In some cases, replacing the batteries is very costly, for instance, in the case of underground water meters and heart pace makers. Ideal microcontrollers should consume negligible amount of power. #### 2.3.3 Speed Due to the diverse application areas where microcontroller can be used, the demand on processing speed is also diverse. There are application which require high processing speed, such as streaming applications. Ideally, the processing speed should be very high. #### 2.3.4 Modularity Ideally, microcontroller architecture should be highly modular, such that any type of modification in one aspect should not bring change to the rest of the architecture. The modularity of an architecture helps in development and testing of the individual subsystems, which results in reduced time to market. During the life of the architecture, modularity assists in evolution of the architecture, resulting in variants of the architecture satisfying certain application requirements. This modularity can further be classified as: Modularity w.r.t. instruction and data address range: Architecture should be modular such that at any stage in the life of microcontroller, it is possible to extend the instruction address space without impacting data memory address space. Modularity w.r.t. data types and no of registers in different data types: In this respect, microcontroller architecture should be such that a variety of data types should be supported without modifying the architecture. Furthermore, It should be possible to change the number of registers in a particular data type depending upon the nature of an application. #### 2.4 Summary Demand of microcontroller based embedded systems is increasing every year. This is the result of a large number of applications using microcontroller as embedded processing units. The diversity of applications has resulted in a large variety of microcontroller architectures. In this chapter we provided an overview of microcontroller architectures. Microcontroller architectures are based on RISC or CISC philosophy depending upon the choice to be high processing requirement or smaller code size. These architectures are 4-bit to 64-bit architectures, while 8, 16 and 32 to be the most common word size found now-a-days. Architectures are found to be having single storage and single address space for program and data favoring the Von-Neumann style. Harvard architectures, having distinct program and data memory, or modified Harvard architectures, having single address space but separate buses for instructions and data are commonly used in microcontrollers. Very few architectures are single operand (accumulator based) architectures. 2-operand microcontroller architectures are commonly used by 16-bit architectures. Because of the high encoding space requirement, 3-operand architectures are mostly 32-bit architectures. This classification is further summarized for three microcontroller architectures which we have used in benchmark for performance comparison. Ideally, microcontroller architecture should be such that program memory size should be negligible, processing speed should be very high at the cost of negligible power consumption. Ideal microcontroller architecture should be modular such that variants can easily be produced and evolution of architecture should be possible without modifying the rest of the architecture. Before diving into the details of MePoEfAr microcontroller architecture, statistics of high level language constructs are presented in the next chapter. Statistics of C Language 3 In the previous chapter, an overview of microcontroller architecture was discussed. Before diving into the details of our architecture in the next chapter, frequency distributions of various C language constructs are presented in this chapter. An important rule for the design of a microcontroller architecture is to efficiently implement the most frequent cases. In order to know the frequency of different constructs in the language, four applications namely Coremark and AutoBench (EEMBC benchmarks), assembler and interpreter of our architecture have been profiled. The results of different types of statements, operations and operands are tabulated. These results are then utilized in the design of the architecture presented in next chapter. This chapter opens with the section on list of language constructs to give an overview of what we are going to analyze in this chapter. Section 3.2 briefly discusses the profiling, developed profiler and application programs used for profiling. Section 3.3 provides the profiling results with analysis. Section 3.4 concludes the chapter. #### 3.1 List of Language Constructs This section provides a list of C language constructs for which the frequency distributions are presented. The results are divided in four groups; namely, statements, operations, operands and miscellaneous measurements. The detailed list of these constructs is given below: - 1. Statements - (a) Assignments - i. Assignment Types based on LHS - A. Variable - B. Array Element - C. Structure/Union Field - ii. Assignment Types based on complexity of RHS expression - A. A = Constant - B. A = A op Constant - C. A = B - D. A = B op Constant - E. A = A op B - F. A = B op C - G. Others (with complex RHS) - (b) if statements - i. If-only statements - ii. If-else statements - (c) switch statements - (d) break statements - (e) continue statements - (f) goto statements - (g) Loops - (h) Function calls - (i) return statements - 2. Operations - (a) Arithmetic operations - i. + - ii. — - iii. \* - iv. / - v. % - (b) Address Arithmetic operations - i. + - ii. — - (c) Relational operations - i. == - ii.!= - iii. < - iv. > - v. <= - vi. >= - (d) Bitwise operations - i. and - ii. or - iii. xor - iv. not - (e) Shift operations - i. Shift left - ii. Shift right - iii. Arithmetic Shift right - (f) Complement operations - (g) Absolute operations - (h) Type conversions - i. 8 to 16 - ii. 8 to 32 - iii. 8 to 64 - iv. 16 to 8 - v. 16 to 32 - vi. 16 to 64 - vii. 32 to 8 - viii. 32 to 16 - ix. 32 to 64 3.2. PROFILING 15 x. 64 to 8 xi. 64 to 16 xii. 64 to 32 xiii. integer to address xiv. address to integer xv. integer to real xvi. real to integer xvii. others #### 3. Operands (a) Constants i. -1, 0, 1, 2, ..., 14, 15 ii. 16-31 iii. 32-63 iv. 64-127 v. 128-255 vi. 256-65535 vii. others (b) Variable accesses i. 8-bit variable access ii. 16-bit variable access iii. 32-bit variable access iv. 64-bit variable access (c) Array accesses (d) Structure/union Field accesses (e) Function calls #### 4. Miscellaneous - (a) Average number of function parameters - (b) Average number of function locals - (c) Average number of globals used in a function - (d) Frequency Distribution of Parameters Based on Data Types - (e) Frequency Distribution of Locals Based on Data Types #### 3.2 Profiling Profiling is the program analysis carried out for a number of purposes, for instance, to measure different metrics. Operation frequencies, operand frequencies, function calls are a few examples of such metrics. This analysis can be static or dynamic. Static analysis is performed on the application without actually running it. On the other hand, dynamic profiler analyzes the program during execution. From program memory point of view, results of static profiling are important, which we have provided in this chapter. #### 3.2.1 Profiler Profilers are the software tools which are used to automate profiling; in other words, to create the profile of the application program. We have modified the Quipu [20] static profiler to obtain all the results as listed in Section 3.1. Quipu is a part of $Q^2$ profiling framework which is developed in the context of **D**elft **W**ork**B**ench (DWB) [21]. This tool is developed as an engine in the CoSy compiler system [6] developed by ACE Associated Compiler Experts. #### 3.2.2 Profiler Benchmark Applications An important step in statistical analysis of various language constructs is the selection of the applications to be profiled. We have profiled following four applications: - 1. Coremark - 2. AutoBench - 3. Assembler - 4. Interpreter Table 3.1 provides information about the number of lines of code and number of functions in selected four applications. Blank lines and comments are also counted towards lines of code in these numbers. | | | | 0 | |-------|-------------|---------------|------------------| | S.No. | Application | Lines of Code | No. of Functions | | 1 | Coremark | 892 | 27 | | 2 | AutoBench | 1986 | 26 | | 3 | Assembler | 8194 | 104 | | 4 | Interpreter | 5597 | 214 | Table 3.1: Application Programs Used for Profiling A brief description of these applications is given below: Coremark: Coremark [7] is an Embedded Microprocessor Benchmark Consortium (EEMBC) benchmarks [10]. Unlike synthetic benchmarks, EEMBC benchmarks are real application programs. Coremark is freely available from EEMBC website and is used for a quick comparison of embedded processor and microcontroller core functionality. Coremark suit contains three applications as listed below: - 1. *core\_matrix* performs common matrix operations like additions, multiplications on integer and floating point data. - 2. core\_state determines if an input stream contains valid numbers. - 3. core\_list performs list processing as searching and sorting the linked list. AutoBench: AutoBench [9] is another EEMBC benchmark suite. AutoBench is a suite of benchmarks that allow users to predict the performance of microprocessors and microcontrollers in automotive, industrial, and general-purpose applications. It is not a free benchmark, but Computer Engineering Lab has the license to use it. It involves matrix operations, bit manipulation, arithmetic operations, table look up and singal processing like filtering. Assembler: The assembler application is the assembler developed for our architecture. It has the lexical analysis code generated by Flex (a general purpose lexical analyzer generator) [1], parser code generated by Bison (parser generator) [11], code for tree traversals for analysis, symbol table generation. At the end, machine code is generated for our architecture which involves bitwise operations. Further details are provided in Chapter 5. **Interpreter:** This application is the interpreter developed for our architecture. It reads the machine code in, decodes the instructions and executes it to produce the results based on the semantics of the instruction. Further details of this interpreter can be seen in Chapter 6. # 3.3 Frequency Distribution of C Language Constructs Frequency distributions of different C language constructs presented in the list in Section 3.1 obtained by our profiler for selected applications are presented in this section. ## 3.3.1 Frequency Distribution of Statements Frequency distribution of various C statements is given in Table 3.2 for the selected four application programs. It can be seen from this table that assignment statements constitute the bulk of statements with a frequency of 58.96%. The second most frequent statement is the if statement with an average of 19.73%. Similarly, statistics for other statements are also tabulated. Frequency of break statement is about 9% which majorly comes from the cases in switch statements, especially in assembler and interpreter. | Statement | | | Percentage | | | |----------------|----------|-----------|------------|-------------|---------| | Statement | Coremark | AutoBench | Assembler | Interpreter | Average | | Assignments | 62.69 | 67.96 | 53.99 | 51.18 | 58.96 | | if else | 18.98 | 16.87 | 27.13 | 15.94 | 19.73 | | switch | 0.64 | 0 | 1.4 | 3.66 | 1.43 | | goto | 0 | 0 | 1.21 | 0 | 0.3 | | Loops | 8.1 | 9.29 | 2.18 | 2.37 | 5.49 | | Function Calls | 1.92 | 1.7 | 1.04 | 1.07 | 1.43 | | return | 3.84 | 3.56 | 2.58 | 4.15 | 3.53 | | break | 3.84 | 0.62 | 10.37 | 21.62 | 9.11 | | continue | 0 | 0 | 0.1 | 0 | 0.03 | | Total | 100 | 100 | 100 | 100 | 100 | Table 3.2: Frequency Distribution of Statements As assignment statements have the highest frequency of occurrence among the statements, so let us see the details of assignment statements. Assignment statements can have a simple variable, an array element or a structure/union field on **L**eft **H**and **S**ide (**LHS**). Frequency distribution of assignment statements based on LHS expression is given in Table 3.3. As can be seen from the results that assignments with a variable on left hand side are the most frequent with an average frequency of about 73%. Table 3.3: Frequency Distribution of Assignment Statements Based on LHS | Assignment Type | Percentage | | | | | | | | | | |--------------------------|------------|-----------|-----------|-------------|---------|--|--|--|--|--| | Assignment Type | Coremark | AutoBench | Assembler | Interpreter | Average | | | | | | | variable assignments | 83.46 | 56.35 | 56.62 | 96.25 | 73.17 | | | | | | | array assignments | 1.5 | 17.55 | 1.73 | 2.71 | 5.87 | | | | | | | struct/union assignments | 15.04 | 26.1 | 41.65 | 1.05 | 20.96 | | | | | | | Total | 100 | 100 | 100 | 100 | 100 | | | | | | Assignments statements can also be classified based on the complexity of expression on Right Hand Side(RHS). Frequency distribution of C assignment statements based on complexity of the expression on RHS is given in Table 3.4. Results show that most of the assignment statements have simple RHS expression, that is a constant or a simple variable. These operations correspond to moves. On average, 33% of the assignments have a constant on RHS. Expressions with a variable on RHS make up about 22%. | Assignment Type | | | Percentage | | | |-------------------------------|----------|-----------|------------|-------------|---------| | Assignment Type | Coremark | AutoBench | Assembler | Interpreter | Average | | A = Const | 28.46 | 28.61 | 45.62 | 29.42 | 33.03 | | A = B | 31.09 | 20.57 | 27.01 | 8 | 21.67 | | A = A op Const | 13.11 | 11.35 | 8.33 | 14.93 | 11.93 | | A = B op Const | 0.37 | 0.24 | 1.23 | 1.33 | 0.79 | | A = A op B | 0.37 | 0 | 0.22 | 2.04 | 0.66 | | A = B op C | 0.37 | 0 | 0.58 | 3.47 | 1.11 | | others (different complexity) | 26.22 | 39.24 | 17.02 | 40.8 | 30.82 | | Total | 100 | 100 | 100 | 100 | 100 | Table 3.4: Distribution of Assignments Based on Complexity of RHS Expression The six simple cases listed in the table constitute 70% on average. The *other* expressions with different complexity make up rest of 30%. The RHS expressions in these cases have more than 2 operands on RHS. These operands can be constants, variables, array accesses, structure or union field or return value from a function, involved in various operations. ### 3.3.2 Operations In order to know the importance of different operations, frequency distribution of different operations in selected programs is given in Table 3.5. This table summarizes the frequency distribution of all operations for integer and floating point numbers. Statistics from this table show that arithmetic operations are the most frequent operations, wherein, addition and multiplication have a frequency of 24% and 14% respectively. Address arithmetic refers to arithmetic operations carried out to compute the addresses of data, which corresponds to C pointer arithmetic. These operations have a frequency of about 10% in total, where most of the operations are additions. Relational operations on the average, make up about 22% from the whole operation space. Among relational operations, equality (==), inequality (!=) and less than (<) operations are frequent operations. Equality and inequality operations are frequent because they are used as test conditions in selection statements. Less than (<) comparison is mostly used in loop statements, where a loop counter is initialized and incremented till this counter is less than certain count value. In bitwise operations, and (&) operation has highest frequency of about (3.25%), which is used in bit masking. Data type conversion takes place when the operations involve operands of different data types. For instance, in an operation involving integer and floating point data, type conversion takes place. This type conversion can be explicit (type casting) or implicit (operations involving different data types) in C language. Conversion operations have an average frequency of 14.8%. Detailed frequency distribution for different conversion | | _ | | | | | Perce | ntage | | | | | |------------|--------------|-------|-------|-------|-------|-------|---------|--------|--------|-------|-------| | Operation | on Type | Core | mark | Auto | Bench | Asse | mbler | Interp | preter | Ave | rage | | | + | 15.82 | | 16.37 | | 5.75 | | 10.35 | | 12.07 | | | | - | 1.88 | 1 | 3.74 | | 2.82 | 1 | 3.72 | | 3.04 | | | Arithmetic | * | 12.62 | 30.89 | 18.31 | 40.09 | 8.98 | 18.07 | 3.54 | 21.78 | 10.86 | 27.71 | | | / | 0.38 | 1 | 1.53 | | 0.42 | 1 | 2.54 | | 1.22 | | | | % | 0.19 | 1 | 0.14 | | 0.1 | 1 | 1.63 | | 0.52 | | | Address | + | 13.56 | 12 56 | 9.85 | 9.85 | 14.76 | 15 | 0.09 | 0.09 | 9.57 | 9.63 | | Arithmetic | - | 0 | 13.56 | 0 | 9.65 | 0.24 | 15 | 0 | 0.09 | 0.06 | 9.03 | | | == | 4.14 | | 4.85 | | 30.61 | | 20.62 | | 15.06 | | | | ! = | 8.29 | ] | 4.99 | | 11.11 | ] | 13.26 | | 9.41 | | | Relational | < | 10.55 | 27.31 | 13.18 | 27.18 | 5.29 | 53.13 | 13.17 | 53.05 | 10.55 | 40.17 | | Relational | > | 2.07 | 27.51 | 3.19 | 27.16 | 1.32 | 33.13 | 3.45 | 33.03 | 2.51 | 40.17 | | | <= | 0.94 | | 0.14 | | 1.88 | | 0.73 | | 0.92 | | | | >= | 1.32 | | 0.83 | | 2.92 | | 1.82 | | 1.72 | | | | << | 0.75 | | 0 | | 0.03 | | 2.27 | | 0.76 | | | Shift | >> | 0 | 3.01 | 0 | 4.72 | 0 | 0.1 | 0 | 4.36 | 0 | 3.05 | | | Arith >> | 2.26 | | 4.72 | | 0.07 | | 2.09 | | 2.29 | | | | and | 5.46 | | 0.14 | | 1.04 | | 6.36 | | 3.25 | | | Bitwise | or | 1.69 | 7.9 | 0 | 0.14 | 0 | 1.04 | 2.18 | 9.08 | 0.97 | 4.54 | | Browise | not | 0 | ] "." | 0 | 0.14 | 0 | ] 1.04 | 0.18 | 3.00 | 0.05 | 1.01 | | | xor | 0.75 | | 0 | | 0 | | 0.36 | | 0.28 | | | Complement | | 0.19 | 0.19 | 0 | 0 | 0.24 | 0.24 | 0 | 0 | 0.11 | 0.11 | | Absolute | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | | 8 to 16 | 0.38 | | 0.55 | | 0.28 | | 0.54 | | 0.44 | | | | 8 to 32 | 0.19 | | 1.39 | | 3.31 | | 1.73 | | 1.66 | | | | 8 to 64 | 0 | | 0 | | 0 | | 0 | | 0 | | | | 16 to 8 | 0.19 | | 0 | | 0 | | 0.82 | | 0.25 | | | | 16 to 32 | 2.82 | | 3.61 | | 0.52 | | 2.18 | | 2.28 | | | | 16 to 64 | 0 | | 0 | | 0 | | 0 | | 0 | | | | 32 to 8 | 0 | | 0 | | 0.24 | | 5.45 | | 1.42 | | | Type | 32 to 16 | 13.18 | 17.14 | 10.54 | 18.03 | 7.24 | 12.39 | 0.82 | 11.63 | 7.95 | 14.8 | | Conversion | 32 to 64 | 0 | ] '' | 0 | 10.00 | 0 | ] 12.00 | 0 | 11.00 | 0 | 14.0 | | | 64 to 8 | 0 | | 0 | | 0 | | 0 | | 0 | | | | 64 to 16 | 0 | | 0 | | 0 | | 0 | | 0 | | | | 64 to 32 | 0 | | 0 | | 0 | | 0 | | 0 | | | | int to addr | 0.38 | ] | 1.94 | | 0.8 | ] | 0.09 | | 0.8 | | | | add to int | 0 | ] | 0 | | 0 | ] | 0 | | 0 | | | | int to float | 0 | ] | 0 | | 0 | ] | 0 | | 0 | | | | float to int | 0 | | 0 | | 0 | | 0 | | 0 | | | Total | | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | Table 3.5: Frequency Distribution of Operations operations are also provided. Integer to address conversions takes place when an integer is operated with a pointer (pointing to some data). It can be seen from the statistics that integer to address conversion occurs frequently, whereas address to integer conversion never occurred. This is because of that fact that computed addresses are saved in pointers and not transferred to integer variables. Among integer data type conversion operations, 32- to 16-bit and 16- to 32-bit conversions are the most frequent. An operand is promoted to higher size when it is operated with an operand of higher size, for instance 16-bit variable will be converted to 32-bit representation when it will be added to 32-bit data. Conversion of data from higher size to lower size takes place when it is explicit in the language or when a statement involves assignment of data of larger size than the destination. As an example, addition of two 32-bit variables will result in 32-bit result, but when this 32-bit result is assigned to a 16-bit variable, 32-bit to 16-bit conversion will take place. Although 8-bit variables are accessed but these are not frequently used in operations involving 16 and 32 bit operands. So, type promotion does not take place so frequently. 16- to 32-bit conversion is frequent because, these two data types are frequently used in operations with each other. 32- to 16-bit conversion takes place, because 16-bit operands are frequent (as can be seen from Table 3.13). So the assignments having 16-bit variables cause these conversions. Operations operate on data and the data can be of integer or floating point type. Table 3.6 provides the statistics of integer type of operations. Overall, 58% operations on average are integer data type operations. On the other hand, frequency of floating point operations is about 1%. | Operation | Т | | | | | Percentage | | | | | | |------------|----------|-------|-------|-------|-------|------------|-------|-------|--------|-------|-------| | Operation | птуре | Core | mark | Auto | Bench | Asser | mbler | Inter | preter | Ave | rage | | | + | 14.69 | | 16.37 | | 5.75 | | 10.26 | | 11.77 | | | | - | 1.88 | | 3.74 | | 2.82 | | 3.54 | | 3 | | | Arithmetic | * | 11.86 | 29 | 17.48 | 39.12 | 8.95 | 18.04 | 3.36 | 21.15 | 10.41 | 26.83 | | | / | 0.38 | | 1.39 | | 0.42 | | 2.36 | | 1.14 | | | | % | 0.19 | | 0.14 | | 0.1 | | 1.63 | | 0.52 | | | | == | 3.2 | | 2.64 | | 17.06 | | 12.08 | | 8.75 | | | | ! = | 5.65 | | 2.64 | | 5.71 | | 10.54 | | 6.14 | | | Relational | < | 5.46 | 16.94 | 6.8 | 14.43 | 2.65 | 28.58 | 7.72 | 34.06 | 5.66 | 23.5 | | Relational | > | 1.13 | 10.94 | 1.66 | 14.43 | 0.66 | 20.56 | 2.36 | 34.00 | 1.45 | 23.5 | | | <= | 0.56 | | 0.14 | | 1.04 | | 0.45 | | 0.55 | | | | >= | 0.94 | | 0.55 | | 1.46 | | 0.91 | | 0.97 | | | | << | 0.75 | | 0 | | 0.03 | | 2.27 | | 0.76 | | | Shift | >> | 0 | 3.01 | 0 | 4.72 | 0 | 0.1 | 0 | 4.36 | 0 | 3.05 | | | Arith >> | 2.26 | | 4.72 | | 0.07 | | 2.09 | | 2.29 | | | | and | 5.46 | | 0.14 | | 1.04 | | 6.36 | | 3.25 | | | Bitwise | or | 1.69 | 7.9 | 0 | 0.14 | 0 | 1.04 | 2.18 | 9.08 | 0.97 | 4.54 | | Bitwise | not | 0 | 1.9 | 0 | 0.14 | 0 | 1.04 | 0.18 | 9.08 | 0.05 | 4.54 | | | xor | 0.75 | | 0 | | 0 | | 0.36 | | 0.28 | | | Complement | | 0 | 0 | 0 | 0 | 0.24 | 0.24 | 0 | 0 | 0.06 | 0.06 | | Absolute | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Total | | 56.85 | 56.85 | 58.41 | 58.41 | 48 | 48 | 68.7 | 68.7 | 58 | 58 | Table 3.6: Frequency Distribution of Integer Operations Table 3.7 provides the statistics of floating point operations. It can be seen that most of the floating point operations are the arithmetic operations. Relational operations never involved floating point data, whereas, complement operations still had an occurrence. Statistics from the previous tables show that most of the operations (58%) involve integer operands. Ineger operations are applied on different sizes of integers. In order to have support for integers of different sizes or to make some trade-offs in design of architecture, it is important to see the frequency distribution of integer type operations based on size. Frequency distribution of operations applied to 8-bit data type is given in Table 3.8. It | | | | - v | | | | | | | | | |-------------|------|------|------|------|-------|-------|--------|-------|--------|------|------| | Operation 7 | 'vpe | | | | | Perce | entage | | | | | | operation i | JPC | Core | mark | Auto | Bench | Asse | mbler | Inter | preter | Ave | rage | | | + | 1.13 | | 0 | | 0 | | 0.09 | | 0.31 | | | | - | 0 | | 0 | | 0 | | 0.18 | | 0.05 | | | Arithmetic | * | 0.75 | 1.88 | 0.83 | 0.97 | 0.03 | 0.03 | 0.18 | 0.63 | 0.45 | 0.88 | | | / | 0 | | 0.14 | | 0 | | 0.18 | | 0.08 | | | | % | 0 | | 0 | | 0 | | 0 | | 0 | | | | == | 0 | | 0 | | 0 | | 0 | | 0 | | | | ! = | 0 | | 0 | | 0 | | 0 | | 0 | | | Relational | < | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Relational | > | 0 | | 0 | | 0 | 0 | 0 | | 0 | U | | | <= | 0 | | 0 | | 0 | | 0 | | 0 | | | | >= | 0 | | 0 | | 0 | | 0 | | 0 | | | Complement | | 0.19 | 0.19 | 0 | 0 | 0 | 0 | 0 | 0 | 0.05 | 0.05 | | Absolute | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Total | | 2.07 | 2.07 | 0.97 | 0.97 | 0.03 | 0.03 | 0.63 | 0.63 | 0.94 | 0.93 | Table 3.7: Frequency Distribution of Floating Point Operations can be seen that, about 10% of the operations are the operations on 8-bit data. Most of the 8-bit operations are relational operations making up 5% on average. This is because, 8-bit data is the *char* data type in C, which is used for byte level processing. For instance, in EEMBC *core\_state* program, there are comparisons, if the character is a decimal point (.), an e or E for floating point exponantial representation etc. Furthermore, this is the reason that most of the relational operations are equality and inequality comparisons. | - Table 5.6: Frequency Distribution of 6-bit integer Ober | Table 3.8: ` | tribution of 8-bit Integer O | uency Di | Operations | |-----------------------------------------------------------|--------------|------------------------------|----------|------------| |-----------------------------------------------------------|--------------|------------------------------|----------|------------| | Operation | . Т | | | | | Perce | entage | | | | | |-------------|----------|------|------|-------|-------|-------|--------|-------|--------|------|------| | Operation | птуре | Core | mark | Auto | Bench | Asse | mbler | Inter | preter | Ave | rage | | | + | 0 | | 3.61 | | 0.07 | | 0.54 | | 1.06 | | | | - | 0 | 1 | 0.55 | | 0.1 | | 0.18 | | 0.21 | | | Arithmetic | * | 0 | 0 | 3.74 | 7.9 | 0 | 0.17 | 0.73 | 1.99 | 1.12 | 2.52 | | | / | 0 | 1 | 0 | | 0 | | 0.54 | | 0.14 | | | | % | 0 | 1 | 0 | | 0 | | 0 | | 0 | | | | == | 1.88 | | 0 | | 3.48 | | 2.18 | | 1.89 | | | | ! = | 2.45 | 1 | 0.28 | | 0.17 | | 7.08 | | 2.5 | | | Relational | < | 0 | 4.71 | 0 | 0.42 | 0 | 3.86 | 1.09 | 10.98 | 0.27 | 4.99 | | reciational | > | 0 | 1.71 | 0.14 | 0.42 | 0 | 0.00 | 0.54 | 10.50 | 0.17 | 4.00 | | | <= | 0.19 | 1 | 0 | | 0.21 | | 0.09 | | 0.12 | | | | >= | 0.19 | 1 | 0 | | 0 | | 0 | | 0.05 | | | | << | 0 | | 0 | | 0 | | 0.09 | | 0.02 | | | Shift | >> | 0 | 0 | 0 | 4.16 | 0 | 0 | 0 | 0.36 | 0 | 1.13 | | | Arith >> | 0 | 1 | 4.16 | | 0 | | 0.27 | | 1.11 | | | | and | 0.19 | | 0.14 | | 0 | | 0.82 | | 0.29 | | | Bitwise | or | 0 | 0.19 | 0 | 0.14 | 0 | | 0 | 1 | 0 | 0.33 | | Bitwise | not | 0 | 0.19 | 0 | 0.14 | 0 | 0 | 0.18 | 1 | 0.05 | 0.33 | | | xor | 0 | 1 | 0 | | 0 | | 0 | | 0 | | | Complement | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Absolute | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Total | | 4.9 | 4.9 | 12.62 | 12.62 | 4.03 | 4.03 | 14.33 | 14.33 | 9 | 8.97 | Table 3.9 provides the statistics of 16-bit integer operations. 16-bit operations have an overall frequency of 8%, where arithmetic operations have the contribution (3.27%). Frequency distribution of operations applied to 32-bit integers is given in Table 3.10. It | 0 | | 1 | J | | | Percer | ntage | <u> </u> | 1 | | | |-------------|----------|-------|-------|------|-------|--------|-------|----------|--------|------|------| | Operation | n Type | Core | mark | Auto | Bench | Asse | mbler | Inter | preter | Ave | rage | | | + | 1.69 | | 1.8 | | 0.24 | | 0.64 | | 1.09 | | | | - | 0.56 | | 0.97 | | 0.24 | | 0.18 | | 0.49 | | | Arithmetic | * | 0.56 | 2.81 | 4.72 | 7.49 | 0 | 0.48 | 0.73 | 2.28 | 1.5 | 3.27 | | | / | 0 | | 0 | | 0 | | 0.73 | | 0.18 | | | | % | 0 | | 0 | | 0 | | 0 | | 0 | | | | == | 0.38 | | 0.42 | | 0.03 | | 1.36 | | 0.55 | | | | ! = | 0.56 | | 0 | | 0.14 | | 0.73 | | 0.36 | | | Relational | < | 0.38 | 1.89 | 0.42 | 1.26 | 0 | 0.17 | 1.18 | 4.09 | 0.5 | 1.85 | | reciational | > | 0.19 | 1.00 | 0 | 1.20 | 0 | 0.11 | 0.73 | 4.03 | 0.23 | 1.00 | | | <= | 0 | | 0.14 | | 0 | | 0.09 | | 0.06 | | | | >= | 0.38 | | 0.28 | | 0 | | 0 | | 0.17 | | | | << | 0.56 | | 0 | | 0 | | 0.45 | | 0.25 | | | Shift | >> | 0 | 2.44 | 0 | 0.28 | 0 | 0 | 0 | 1.09 | 0 | 0.95 | | | Arith >> | 1.88 | | 0.28 | | 0 | | 0.64 | | 0.7 | | | | and | 3.77 | | 0 | | 0 | | 1 | | 1.19 | | | Bitwise | or | 1.13 | 5.28 | 0 | 0 | 0 | 0 | 0 | 1.18 | 0.28 | 1.62 | | Bitwise | not | 0 | 3.20 | 0 | | 0 | | 0 | 1.10 | 0 | 1.02 | | | xor | 0.38 | | 0 | | 0 | | 0.18 | | 0.14 | | | Complement | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Absolute | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Total | | 12.42 | 12.42 | 9.03 | 9.03 | 0.65 | 0.65 | 8.64 | 8.64 | 7.69 | 7.69 | Table 3.9: Frequency Distribution of 16-bit Integer Operations can be seen that, 41.62% of the operations involve 32-bit data. Arithmetic and relational operations are the most frequent 32-bit operations with an average frequency of 21.04% and 16.67%, respectively. | Table 3.10: Frequency Distribution of 32-bit Integer C | Operations | |--------------------------------------------------------|------------| |--------------------------------------------------------|------------| | | abic <b>5.1</b> 0. | 1 | J | | | | entage | -0- | 1 | | | |------------|--------------------|-------|-------|-------|-------|-------|--------|-------|--------|-------|-------| | Operatio | n Type | Core | mark | Autol | Bench | | mbler | Inter | preter | Ave | rage | | | + | 12.99 | | 10.96 | | 5.43 | | 9.08 | | 9.62 | | | | - | 1.32 | | 2.22 | | 2.47 | | 3.18 | | 2.3 | | | Arithmetic | * | 11.3 | 26.18 | 9.02 | 23.73 | 8.95 | 17.37 | 1.91 | 16.89 | 7.8 | 21.04 | | | / | 0.38 | | 1.39 | | 0.42 | | 1.09 | | 0.82 | | | | % | 0.19 | | 0.14 | | 0.1 | | 1.63 | | 0.52 | | | | == | 0.94 | | 2.22 | | 13.54 | | 8.54 | | 6.31 | | | | ! = | 2.64 | | 2.36 | | 5.4 | | 2.72 | | 3.28 | | | Relational | < | 5.08 | 10.36 | 6.38 | 12.77 | 2.65 | 24.55 | 5.45 | 18.98 | 4.89 | 16.67 | | Relational | > | 0.94 | 10.30 | 1.53 | 12.77 | 0.66 | 24.55 | 1.09 | 10.90 | 1.06 | 10.07 | | | <= | 0.38 | | 0 | | 0.84 | | 0.27 | | 0.37 | | | | >= | 0.38 | | 0.28 | | 1.46 | | 0.91 | | 0.76 | | | | << | 0.19 | | 0 | | 0.03 | | 1.73 | | 0.49 | | | Shift | >> | 0 | 0.57 | 0 | 0.28 | 0 | 0.1 | 0 | 2.91 | 0 | 0.97 | | | Arith >> | 0.38 | | 0.28 | | 0.07 | | 1.18 | | 0.48 | | | | and | 1.51 | | 0 | | 1.04 | | 4.54 | | 1.77 | | | Bitwise | or | 0.56 | 2.45 | 0 | 0 | 0 | 1.04 | 2.18 | 6.9 | 0.69 | 2.6 | | Bitwise | not | 0 | 2.40 | 0 | | 0 | 1.04 | 0 | 0.9 | 0 | 2.0 | | | xor | 0.38 | | 0 | | 0 | | 0.18 | | 0.14 | | | Complement | | 0 | 0 | 0 | 1.32 | 0.24 | 0 | 0 | 0 | 0.06 | 0.33 | | Absolute | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Total | | 39.56 | 39.56 | 36.78 | 38.1 | 43.3 | 43.06 | 45.68 | 45.68 | 41.36 | 41.61 | ### 3.3.3 Operands Operations operate on operands and operands in C language can be of various types. Frequency distribution of various operands in selected programs is given in the Table 3.11. It can be seen from these statistics that constants and simple variables occur most frequently with an average frequency of about 32% and 44% respectively. | Table 9.11. Troquency Elistination of Operands | | | | | | | | | | | | |------------------------------------------------|------------|-----------|-----------|-------------|---------|--|--|--|--|--|--| | 0 | Percentage | | | | | | | | | | | | Operand | Coremark | AutoBench | Assembler | Interpreter | Average | | | | | | | | Constants | 25.52 | 33.57 | 31.36 | 37.21 | 31.92 | | | | | | | | Simple Variables | 55.6 | 38.87 | 38.71 | 44.22 | 44.35 | | | | | | | | Array Access | 1.01 | 8.77 | 2.24 | 1 | 3.26 | | | | | | | | Struct/union Field Access | 8.89 | 10.92 | 18.3 | 4.38 | 10.62 | | | | | | | | Function Calls | 3.25 | 5.44 | 7.07 | 12.72 | 7.12 | | | | | | | | Pointers | 5.71 | 2.43 | 2.31 | 0.47 | 2.73 | | | | | | | | Total | 100 | 100 | 100 | 100 | 100 | | | | | | | Table 3.11: Frequency Distribution of Operands Because of the high frequency of constants, their further analysis is performed. Frequency distribution of different constants is given in the Table 3.12. It can be seen that small constants are the most frequent ones. Among the 4-bit constants, 0, 1, 2, 4, 8 are the most frequent. Constant 0 is frequent as it is used in initialization and comparison operations. 1 is also used frequently in increment/decrement operations like i++,--i in loops. Overall, 4-bit constants have an accumulative frequency of about 87%. AutoBench Average Coremark Assembler Interpreter Constant % Cum. % Cum. % % Cum. % Cum. % % Cum. % -1 0.11 0.11 0 0 0.95 0.95 1.41 1.41 0.62 0.62 30.06 18.44 18.55 16.63 16.63 18.61 19.56 31.47 20.94 21.56 33.69 1 21.75 40.3 50.32 18.58 38.14 19.71 51.18 23.43 44.99 2 7.26 2.13 42.43 4.855.12 8.18 46.32 58.44 5.59 50.58 3 3.78 46.21 3.94 59.06 7.62 53.94 10.66 69.1 6.5 57.08 4 13.24 59.45 8.96 68.02 13.29 67.23 4.97 74.07 10.12 67.2 5 2.13 61.58 1.39 69.41 68.23 3.32 77.39 1.96 69.16 6 1.18 62.76 0.64 70.05 1.3 69.53 4.97 82.36 2.02 71.18 0.96 1.43 4.32 2.27 73.45 2.36 65.12 71.01 70.96 21.28 11.19 1.59 72.55 2.33 82.55 89.01 9.1 0.24 1.4 73.95 0.42 83.17 0.43 82.63 89.43 0.62 0.24 0.75 1.7 75.65 0.27 89.7 0.74 11 0 86.88 0.43 76.87 0.38 90.08 0.51 84.42 12 0.95 87.83 0.43 84.24 0.65 77.52 0.23 90.31 0.57 13 0.43 0.76 0.15 0.34 85.33 0 87.83 84.67 78.28 90.46 14 0.43 0.76 0.15 0.34 0 87.83 85.1 79.04 90.61 85.67 0.64 0.34 15 0.71 88.54 85.74 1.32 80.36 90.95 0.75 86.42 90.17 16-31 3.07 1.92 87.66 8.61 1.38 92.33 3.75 91.6188.97 32-63 3.55 95.16 5.65 93.31 3.4 92.37 0.92 93.25 3.38 93.55 0.75 2.02 64-127 1.18 96.34 94.06 4.62 96.99 95.27 2.14 95.69 0.75 1.57 96.75 128-256 1.42 97.76 94.81 0.4997.4896.84 1.06 256-65535 1.65 93.26 3.3 90.96 1.62 90.59 1.57 93.9 2.04 92.21 others93.73 91.51 93.44 Table 3.12: Frequency Distribution of Constants In order to see the frequency of size of operands, frequency distribution of 8-, 16-, 32- and 64-bit operands appearing in different operations is given in the Table 3.13. 32- and 16-bit are the most frequent operand sizes with an average frequency of about 60% and 34%, respectively. | Size (Bits) | | | Percentage | | | |-------------|----------|-----------|------------|-------------|---------| | | Coremark | AutoBench | Assembler | Interpreter | Average | | 8 | 3.11 | 7.63 | 3.7 | 11.81 | 6.56 | | 16 | 41.01 | 40 | 33.1 | 20.91 | 33.76 | | 32 | 55.87 | 52.37 | 63.2 | 67.28 | 59.68 | | 64 | 0 | 0 | 0 | 0 | 0 | | Total | 100 | 100 | 100 | 100 | 100 | Table 3.13: Frequency Distribution of Operand Accesses Based on Size #### 3.3.4 Miscellaneous Table 3.14 gives the average number of variables based on locality per function. These variables can be of global scope, passed to this function as an argument or local variables of the function. It can be seen that on average, a function uses 7 locals. Furthermore, on average 2 arguments are passed to a function. Operands with global scope used inside a function are about 2.33 on average. Table 3.14: Average (per Function) of Variables Based on Locality | Locality | Average | | | | | | | | | | | | |------------|----------|-----------|-----------|-------------|---------|--|--|--|--|--|--|--| | Locality | Coremark | AutoBench | Assembler | Interpreter | Average | | | | | | | | | parameters | 3.04 | 1.42 | 1.13 | 1.53 | 1.78 | | | | | | | | | locals | 5 | 10.23 | 9.5 | 4.02 | 7.19 | | | | | | | | | globals | 0.16 | 1.81 | 5.31 | 2.02 | 2.33 | | | | | | | | The local variables and the arguments to the function can be simple variables, arrays, struct/union field or a pointer. Table 3.15 provides the frequency distribution of the arguments of a function based on data types. It can be seen that, most of the parameters passed to functions are either simple variables or pointers. Among simple variables, 32-bit integer variables are the most frequent data type passed as an argument to the function with a percentage distribution of 39% on average. Array is never passed as argument to function. This is because most of the time the base address is passed as a pointer pointing to these data structures. Arguments containing struct/union are not frequently used as well, as they are also frequently passed as reference. In short, about 50% of the function parameters are pointers. Table 3.15: Frequency Distribution of Parameters Based on Data Types | Operand Type | | Percentage | | | | | | | | | |-----------------|----------------|------------|-----------|-----------|-------------|---------|--|--|--|--| | | | Coremark | AutoBench | Assembler | Interpreter | Average | | | | | | Simple Variable | Integer 8-bit | 1.14 | 2.44 | 0 | 3.89 | 1.87 | | | | | | | Integer 16-bit | 10.23 | 2.44 | 0 | 4.38 | 4.26 | | | | | | | Integer 32-bit | 27.27 | 19.51 | 42.95 | 65.69 | 38.86 | | | | | | | Integer 64-bit | 0 | 0 | 0 | 0 | 0 | | | | | | | Floating Point | 4.55 | 0 | 1.34 | 8.27 | 3.54 | | | | | | Arr | Array | | 0 | 0 | 0 | 0 | | | | | | Struct/union | | 0 | 0 | 1.34 | 10.46 | 2.95 | | | | | | Pointer | | 56.82 | 75.61 | 54.36 | 7.3 | 48.52 | | | | | | Tot | al | 100 | 100 | 100 | 100 | 100 | | | | | Locals to a function can also be of various types as given in Table 3.16 with their frequency distributions. Statistics show that, on the average about 88% of the locals are simple variables. Among simple variables, 32-bit and 16-bit integer variables are the frequent data types, with an average frequency of 61% and 13% respectively. About 13% of the locals are pointers. | Operand Type | | Percentage | | | | | | | | |-----------------|----------------|------------|-----------|-----------|-------------|---------|--|--|--| | | | Coremark | AutoBench | Assembler | Interpreter | Average | | | | | | Integer 8-bit | 7.2 | 4.89 | 0.61 | 9.05 | 5.44 | | | | | | Integer 16-bit | 18.4 | 23.68 | 0 | 10.04 | 13.03 | | | | | Simple Variable | Integer 32-bit | 42.4 | 51.13 | 79.86 | 70.3 | 60.92 | | | | | | Integer 64-bit | 0 | 0 | 0 | 0 | 0 | | | | | | Floating Point | 5.6 | 2.26 | 2.53 | 7.78 | 4.54 | | | | | Arr | ay | 1.6 | 4.14 | 0.4 | 0.14 | 1.57 | | | | | Struct | union/ | 1.6 | 0 | 3.44 | 2.69 | 1.93 | | | | | Poir | nter | 23.2 | 13.91 | 13.16 | 0 | 12.57 | | | | | Tot | tal | 100 | 100 | 100 | 100 | 100 | | | | Table 3.16: Frequency Distribution of Locals Based on Data Types ## 3.4 Conclusions In order to see the characteristics of C language programs, this chapter discussed the static frequency distribution of various constructs in C language for embedded applications. Four C applications, namely EEMBC Coremark, EEMBC AutoBench, assembler and interpreter or our architecture were profiled. From the statistics, it can be concluded that among the statements, assignment statements are the most frequent statements. Most of these assignment statements are simple assignments with a variable on left hand side. Similarly, based on the complexity of expression on right hand side of assignments, constants and simple variables make up the most frequent cases. These assignments are translated to move and move immediate operations, which should be efficiently implemented. For the efficiency of memory accesses, there should be a support for efficient addressing modes. An interesting conclusion is that most of the simple assignments with an operand on right hand side have the same operation on left hand side destination. This shows the importance of 2-operand instructions, where one operand, while being a part of the operation, also serves as the destination to hold result. Arithmetic operations have a higher frequency among all the operations, where in addition and multiplication having the major contributions. Bulk of operations involve integers of 16-bit and 32-bit sizes. Operations involving 8-bit size also have reasonable frequency, whereas, 64-bit operations almost never occur. This shows that architecture should have a support for 8-, 16- and 32-bit sizes, especially for memory efficient architecture. Relational operations are the second highest frequent operations, as these are used to make decision for branches in selection and repetition instructions. This highlights the importance of conditions, which should be efficiently implemented for conditional control transfer instructions. Type conversions are also frequent operations following arithmetic and relational oper- ations. Most of the conversions are between 16- and 32-bit integer data type. It can be concluded that, support of type conversion with different operations will result in an efficient architecture. Most of the operands in these operations are simple variables and constants. Based on the size of the operands, 16-bit and 32-bit operands are most frequent ones. In a memory efficient architecture, there should be special support for constants, especially 4-bit constants. Statistics showed that 4-bit constants make up about 87% of the total constants used in operations, 0 and 1 being the most frequent constants. Statistics presented in this chapter showed that frequency distribution of C language constructs (statements, operations, operands etc) do not have a uniform distribution over the complete range. These results are utilized in making trade-off in the design of our microcontroller architecture discussed in next chapter. MePoEfAr Architecture This chapter contains the architectural details about the MeFoEfAr, which are confidential. Hence, it is not included in this public version of thesis. MePoEfAr Assembler In the previous chapter MePoEfAr architecture was discussed. To evaluate the efficiency of MePoEfAr architecture and have a comparison of performance with existing microcontrollers, benchmark programs need to be run on our architecture. In order to automate this task, MePoEfAr assembler and simulator was developed. Assembler is the focus of discussion in this chapter while interpreter will be discussed in the next chapter. This chapter starts with the a brief introduction of assemblers. Section 5.2 discusses MePoEfAr assembler with the details of the intermediate steps involved to translate the assembly program to machine code. Section 5.3 discusses instructions bit assignments. Finally, Section 5.4 summarizes the whole chapter . ## 5.1 Introduction to Assemblers Assembler is a utility program which translates the machine instruction written in the form of English mnemonics (assembly instructions) into binary patterns which machine can understand (machine instructions). This translation process is a one to one mapping of mnemonics to stream of bits representing the machine instruction and data. An important task of assemblers is to resolve symbolic names used in the assembly programs representing variables and memory locations. In order to resolve these references, assembler has to pass the assembly program once or twice depending upon the complexity of the assembly language. In this context, assemblers are generally classified as follows: One-Pass Assemblers reads the source code once and preform the translations. The assumption is that all the references will be defined before their use. If they are not so, an error is generated. In short, One-Pass assembler cannot handle forward referencing. Two-Pass Assemblers makes two passes over the assembly code. In the first pass it creates a symbol table. The values of the references are used in the second pass for the machine code generation. MePoEfAr assembler is a Two-Pass assembler. In short, assembler has to perform a number of tasks. It has to perform lexical analysis, syntactic analysis, semantic analysis, maintain symbol table to resolve references and emit the machine code at the end. ## 5.2 MePoEfAr Assembler MePoEfAr assembler is a Two-Pass assembler. It is written in C language and has 8194 lines of code, out of which 1944 lines of C code is generated by Flex and Bison from the description of lexical syntax and grammar as discussed in Section 5.2.1 and Section 5.2.2 Figure 5.1: Block Diagram of MePoEfAr Assembler Showing Various Steps Performed in the Assembly Process respectively. Based on the tasks performed by MePoEfAr assembler, it has been divided into following stages: - 1. Scanner - 2. Parser - 3. Analyzer - 4. Code Generator Figure 5.1 shows the overview of the assembler. These stages are described one by one in the following sections. Listing 5.1 provides an example MePoEfAr assembly program which will be used in the description in the following sections. ``` 1 ; test.asm ; Simple MePoEfAr Assembly Program 3 4 MAIN: MOVw ;W3 = 2 5 ADDw #5, W3 ;W3 = W3 + 5 6 SUBw W3, 4(X5) [M](X5)+4] = M[(X5)+4] - W3 7 END: RTS ; return to caller ``` Listing 5.1: MePoEfAr Example Assembly Program used for Illustration of Various Assembler Stages in this Chapter ### 5.2.1 Scanner Scanner is the first stage of the assembler to perform the lexical analysis. In this analysis, the assembly program in the file is scanned and broken down into tokens, leaving out the white spaces and comments. Lexical Analyzers can be generated by hand but pretty much efficient tools are available to generate the lexical analyzers. One such tool is **Flex** (**F**ast **Lex**) [1] which we have used to generate the lexical analyzer of MePoEfAr and is freely available. Flex code for the scanner is given in Appendix A. Figure 5.2 shows the block diagram of Scanner. It reads the assembly programs and generates the Tokens as shown. Flex Code (.l extension) is compiled by flex to generate the C code (.yy.c extension) for the lexical analyzer based on the lexical description in Flex Code. The tokens generated by this C program are given as input to Parser. For instance, the tokens generated by our scanner for the example program given in Listing 5.1 are as given in Figure 5.3. It can be seen that comments and white spaces are ignored. Newline is used to have a record of number of line in the source code for generating error messages. It can be seen from this figure that the first token is the LABEL corresponding to the label MAIN. Next is the SYMBOL token corresponding to MOVw instruction mnemonic in Figure 5.2: Block Diagram of Scanner, which Reads the Input Assembly Instructions and Produces the Tokens LABEL SYMBOL HASH DNUMBER COMMA WREGISTER SYMBOL HASH DNUMBER COMMA WREGISTER SYMBOL WREGISTER COMMA WREGISTER NEWLINE SYMBOL WREGISTER COMMA DNUMBER LBRACK XREGISTER RBRACK LABEL SYMBOL Figure 5.3: Tokens generated by Scanner for the Example Program in Listing 5.1 the first instruction at Line 4. Next token is the HASH symbol corresponding to # symbol for immediate value. Next a COMMA is seen and following COMMA is the WREGISTER token corresponding to W3 in the assembly program. On the same lines, other instructions are also tokenized as shown in Figure 5.3. #### 5.2.2 Parser Parser or Syntax Analyzer is the part of Assembler which determines the syntax or structure of a program based on the specified rules. These rules are called the grammar of the language. We have used Bison [11], a free parser generator, to generate the parser for MePoEfAr . Appendix B provides the grammar which we have used to generate the parser for MePoEfAr assembler. So, the tokens provided by Scanner are considered to make sentences according to this grammar. If the assembly program does not satisfy this grammar, a syntax error is generated. Figure 5.4: Block Diagram of Parser. Tokens are taken as Input from the Scanner and Parser Performs Syntactic Analysis and Constructs the Abstract Syntax Tree as an Output Figure 5.5: Visual Representation of the Complete Abstract Syntax Tree for the Example Program given in Listing 5.1 The output of the Parser is the abstract representation of the assembly program known as Abstract Syntax Tree (AST). Figure 5.4 shows the block diagram of Parser where it is shown that it takes the Tokens as input and generates the AST. The visual representation of the complete AST for the example program provided in Listing 5.1 is shown in Figure 5.5. In this AST, each right arrow is a pointer to next instruction. So, nodes in AST are linked together by next pointer as a linked list. Similarly, downward arrows indicate pointers to child. The first box represents the first node $NT\_LABEL$ which corresponds to Label MAIN. It does not have any child so there are no downward arrows. The next pointer points to next node $NT\_INSTRUCTION$ representing the instruction MOVw. This node has two children corresponding to the immediate field $(NT\_IMM)$ and destination register field $(NT\_WREGISTER)$ . Similar explanation hold for the rest of the nodes in the figure. This AST is used in later phases to do semantic analysis and code generation. ## 5.2.3 Analyzer At this stage, the AST generated by parser is traversed to perform semantic analysis. In the first phase, instruction groups are identified and symbols are added to symbol table. In order to know the location of various symbols in the assembly program, a *location* variable is updated according to the length of the instructions in the tree. An crucial task in this analysis is regarding the maintenance of symbol table and to know the size of instructions. In MePoEfAr, instructions are variable length, so information about the length of the instruction is important to update the location counter. Interesting part is, length of the branch instruction depends up the branch displacement and to know the branch displacement we need to know the length of the instructions. For instance, consider the code segment given in Listing 5.2. The instruction BRlt in Line 6 has a 8-bit field for the branch target address(shown as D8 in Table ??). If the branch target address is greater than or equal to -128 and less than or equal to +127 then it can be accommodated in the first instruction word and size of the instruction will be 2 bytes. Otherwise, branch target address will be provided in the next instruction word, making it a 4-byte instruction. So, the size of this instruction depends upon the location of Lable NoSWAP which is a forward reference and has not been resolved yet (in the first pass). Furthermore, location of the Label NoSWAP depends upon the size of all the instruction proceeding it including the BRlt instruction and hence maximum resolved by assuming the worst case offsets for branch instruction and hence maximum size of the instruction (4 bytes) in the first pass. These are finalized in the symbol table based on the actual value in the second pass. ``` 1 L1: MOVw WO, W1 2 L2: MOVd (X4)+,D2 3 4 D2, D3 ; compare D2 with D3 5 CPAd 6 BRlt ; if (D3 < D2) then no swamping required NoSwap 7 ; otherwise swap here 8 9 10 NoSwap: S1BR W1, L2 ; loop if j > 0 ``` Listing 5.2: MePoEfAr Example Code Used for the Illustration of Branch Instruction Size and Update of Location Counter Table 5.1 shows the visual representation the Symbol Table for the example assembly program given in Listing 5.1. This table has two entries for the two symbols found in this example program. The names of these symbols are provided in first Column. Values of symbols are given in second column. Line number of use is also stored for generating the error and warning messages, as shown in the 3rd column of the table. For instance, the first symbol is MAIN which has a value 0 as it is the address of the first instruction. The column Line number shows us that it has been accessed at Line 4 in the source code (See Listing 5.1). Similarly, the Symbol END has the value 9 and is available at line 7 in the source code. Table 5.1: Visual Representation of the Symbol Table for the Example Program in Listing 5.1 | Symbol Name | Symbol Value | Line Number | |-------------|--------------|-------------| | MAIN | 0 | 4 | | END | 9 | 7 | Type analysis is also performed in this stage. Data types are explicit in MePoEfAr assembly mnemonics, so it is checked if this data type matches the type specified by operand(s). For instance, the instruction ADDb #13, B3 expects the second operand to be a byte register. An error is generated, with the information about the line number of the instruction which caused this error, if types does not match. Similarly, error message is also generated if an operation is not defined in that instruction sub group. For instance, the instruction MULb #13, B3 will cause an error as multiplication is not defined for integer Immediate to Register (IR) instruction format (See Table ??). ### 5.2.4 Code Generator In this phase of assembler, AST is traversed and binary patterns corresponding to assembly mnemonics are emitted. All the information required to generate the machine code Figure 5.6: Block Diagram of Code Generator which Generates the Machine Code at the Output for the Abstract Syntax Tree of a Single Instruction at the Input is present in the AST nodes, which is collected by earlier stages. Consider the example of code generation for instruction ADDb # 13, B3. The machine code generated for this instruction will be 11011011101101 in binary format or DBED in hex format as shown in Figure 5.6. This is because this instruction belongs to the Sub Group Immediate to Register ( $SG\ IR$ ) (See Table ??). So the binary code to represent $SG\ IR$ for byte data type is 110110 as shown by Entry 16 in Table 5.2. The OiIR field will be 10 for the ADD operation (See Table ??). Rd field will get the value 11 representing the Register B3. Immediate field I will get the value 1101 representing the immediate value 13. The generated machine code for the given assembly program is written to a file in hex format, which will be given as input to the MePoEfAr interpreter. # 5.3 Instruction Bit-assignment The last phase in MePoEfAr assembler is the code generator stage. Binary patterns corresponding to assembly program for data, addresses and instructions is emitted. An important task in this stage is the assignment of binary patterns to mnemonics. This task is not trivial in MePoEfAr , as we have variable number of bits for the representation of mnemonics. We have utilized the concept of variable length coding to represent instruction sub groups. The bit-assignment is based on the implementation assumption that after instruction-fetch, the instruction decode cycle will take place. During this cycle three register prefetches will take place, regardless of the details of the instruction. The three fields to be pre-fetched are: Rs: the source register of a possible RR or MR instruction **Rd:** the destination register of a possible MR, IR or R instruction **AX:** the addressing mode and index register combination which may be used in a memory referencing instruction The above logic requires that the fields of Rs, Rd and AX in the instruction layout are always in the same position of the 16-bit instruction word; regardless of the operation to be performed. In other words, the fields Rs, Rd and AX are always assigned to the same bit positions in the instruction. Table 5.2: A Possible Bit Assignment for Various MePoEfAr Instruction Formats | lable | 3.2: A | 1 055 | SIDIC | DIU | A.S.S. | igiiii | епь | 101 V 6 | urous | 1016 | 31 0 | תועו | 11150 | ucu | 1011 | 1.01 | ma | | |-------|------------------------|--------------------------|------------|-------|--------|------------------|--------|---------|-------|------------------------------|------|---------|-----------|----------|------|------|----|--| | # | $\mathbf{s}\mathbf{G}$ | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | | | 1 | RR(b) | | 0000 Oirr | | | | | | | Rs | | | OiRR | Rd | | | | | | 2 | RR(w) | 0001 Oirr | | | | | | Rs | 3 | | OiRR | DiRR Rd | | | | | | | | 3 | RR(d) | 0010 | | | | | OiRI | ₹ | Rs | | | OiRR | OiRR Rd | | | | | | | 4 | MR(b) | 0011 | | | | | OiM | R | Rd | | | | AX | | | | | | | 5 | MR(w) | | 01 | 00 | | | OiM | R | | Re | l | | | Α | X | - | | | | 6 | MR(d) | | 01 | 01 | | | OiM | R | | Re | l | | | Α | X | | | | | 7 | RM(b) | | 01 | 10 | | | OiRM | Л | | Rs | 3 | | AX | | | | | | | 8 | RM(w) | | 01 | 111 | | | OiRN | Л | | Rs | 3 | | AX | | | | | | | 9 | RM(d) | | 10 | 000 | | | OiRM | Л | | Rs | 3 | | | Α | X | | | | | 10 | $_{ m MF}$ | | 10 | 001 | | | OfM | F | | Fd | l | | | Α | X | | | | | 11 | $_{\mathrm{FM}}$ | | 10 | 010 | | | OfFN | Л | | Fs | ; | | | Α | X | | | | | 12 | CB | | 10 | )11 | | | | CC | | | | | D8 | | | | | | | 13 | MX | | | 11000 | | | | OxMX | | | Xd | | | Α | X | | | | | 14 | FF | | | 11001 | | | С | fFF | | Fs | | | OfFF | | F | d | | | | 15 | S1 | | | 11010 | 1 | | 1 | RG | R | | | | | D7 | | | | | | 16 | IR(b) | | | 110 | 110 | | | OiIR | | Ro | l | | OiIR | | | I | | | | 17 | IR(w) | 110111 | | | | | | OiIR | | Re | l | | OiIR | | | | | | | 18 | IR(d) | | | 111 | .000 | | | OiIR | | Re | l | | OiIR | I | | | | | | 19 | XX | | | | 11100 | 10 | | | 0 | OxXX | | | Xs | Xd | | | | | | 20 | IX | | | | 11100 | 11 | | | OxI | х | | | I | | | Xd | | | | 21 | SAV | | | | 11101 | | | | S/R | #RegPairs DT | | | | | | | | | | 22 | SAVx | | | | 11101 | | | | S/R | Mask | | | | | | | | | | 23 | R(b) | | | | 11101 | | | | | | | | 0 | OiR | | | | | | 24 | R(w) | | | | 11101 | | | | | Ro | | | 1 | OiR | | | | | | 25 | R(d) | | | | 11101 | | | | | Ro | | | 0 | OiR | | | | | | 26 | F | | | | 11110 | | | | | | | | 1 | | | | | | | 27 | M(b) | | | | | 110000 | | | | | OiM | | | | X | | | | | 28 | M(w) | | | | | 110001 | | | | OiM<br>OiM | | | | AX<br>AX | | | | | | 29 | M(d) | | | | | 110010 | | | | | | | | | | | | | | 30 | M | | | | | 110011 | | | | OfM | | | AX<br>AX | | | | | | | 31 | M<br>NO | | | | | 110100<br>110101 | | | | | OxM | | DP only + | | | | | | | 33 | InXS | | | | | 110101 | | | | NOOP only takes 8 bits Mask | | | | | | | | | | 34 | InM | | | | | 111110110 | | | | | Г | т | ivias | | X | | | | | 35 | InMS | | | | | 111101 | | | | | _ | )T | - | X | | | | | | 36 | X | | | | | | 100000 | | | | | T . | OxX | | | Xd | | | | 37 | Ju | | | | | | 100000 | | | | | | OC2 | | Xd | | | | | 38 | InR | | | | | | 100001 | | | | | | DT | @ | | | | | | 39 | InRS | | | | | DT | | | | | | | | | | | | | | 40 | BitOP | 1111100011<br>1111100100 | | | | | | | | | | | OBit | | | | | | | 41 | InX | 1111100100 OBIT BIT# | | | | | | | | | | | | | | | | | | 42 | InS | | 1111100110 | | | | | | | | | | | | | | | | | 43 | InSS | | 1111100111 | | | | | | | | | | | | | | | | | 44 | RTS | 1111101000 | | | | | | | | | | | | | | | | | | # | SG | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | | | | | | | | | | | | | | | | 1 | | | | | | Table 5.2 shows a possible bit-assignment for the MePoEfAr instructions. The columns numbered from 15 through 0 denote the instruction bits of the first instruction word. The column SG lists the Sub Group which is implemented by the corresponding Table entry. The SGs are taken from Table ??(See Column SG). For example, Entry No 4 in Table 5.2 is the bit assignment for the SG MR which has instruction code 0011 (i.e., two zeros and two one's in binary, and not eleven) specified by bit positions 12-15. This pattern is for the memory register operations for byte data type. The OiMR stands for the Operations on integers Memory to Register format as specified in Table ??. Rd is the destination register and AX represent the addressing mode and index register combination for the specification of source operand which is in memory. From the Table 5.2 it is clear that the instructions are systematic, such that simple and fast encoding is possible. In addition, a fair amount of unused opcode space is available for future requirements. One idea which may be mentioned at this point is that it may be better to have two sets of indeX registers: one set which is used in User Mode and one set which is used in Supervisor mode; hence the selection is done by the Mode bit of the Status Register. The context switching can be very fast because interrupt handlers can have their private register sets and use the supervisor indeX registers. ## 5.4 Summary Assembler is a piece of code which translates assembly instructions to machine code. In this chapter we have discussed the MePoEfAr Two-Pass assembler assembler. Figure 5.7 shows the summary of the steps taken by MePoEfAr Assembler for the translation assembly program to machine code. It can be seen from this figure that scanner is the first stage of MePoEfAr Assembler. Whitespace and comments are left out by scanner and tokens are passed to parser. Parser performs the syntactic analysis and constructs the Abstract Syntax Tree (AST) based on the defined grammar. An important task in this translation process is maintaining the symbol table which is done by traversing the AST. This process was involved for MePoEfAr assembly language because of two reasons. Firstly, instructions in MePoEfAr are variable length instructions. Secondly, size of the branch instructions depends upon the offset used for branch displacements. We resolved this issue by assuming the worst size of branch instructions in first pass and updating the proper instruction lengths and hence the location counter in the second pass. The last stage in this translation process was generating the binary patterns for the instructions. For this, variable length instruction subgroups were assigned the bit patterns based on the concept of variable length coding. Fast encoding of instructions was taken into account during this bit assignment process. The machine code generated by assembler will be fed to MePoEfAr Interpreter which is discussed in the next chapter. <u>5.4. SUMMARY</u> 37 MePoEfAr Interpreter Machine code gets executed either on a real architecture or on its abstract model. Architecture models are utilized in simulators in the early design phase for a number of reasons. Firstly, simulators are used to run benchmark programs and obtain performance results for an architecture in the early design phase. Secondly, microcontroller architecture is a collection of sub-systems. Simulators are developed to verify the conformance of these sub-systems to the functionality as described in the architecture document. Thirdly, simulators can be utilized to debug and test development and application programs targeted for the new architecture. This implies that the software tool chain (compiler, assembler, linker) can be developed and tested in parallel with the development of the hardware platform. In the previous chapter, we discussed our MePoEfAr assembler which we developed to generate the machine code from MePoEfAr assembly programs. In order to debug and test the functionality of MePoEfAr assembly programs, we developed the MePoEfAr simulator, which is discussed in this chapter. This chapter starts with a brief overview of simulators in Section 6.1. Section 6.2 provides a high level description of the MePoEfAr Interpreter. Part of the interpreter working as supervisor program, is discussed in Section 6.3. This section also discusses the way source line numbers of the instructions in the assembly programs are mapped to the memory address of instructions. The MePoEfAr microcontroller model is described in Section 6.4, which actually executes the programs. Finally, Section 6.5 summarizes the whole chapter. ### 6.1 Overview of Simulators Architecture models are developed in the early design phase of the architecture for a number of reasons [8]. The models are known as simulators or cross simulators as they simulate the functionality of the target architecture on a host machine. When these models are used to test the instruction set of an architecture, they are referred to as as Instruction Set Simulators (ISS). The ISS of an architecture can be designed in two ways [26]: 1. Interpretive Simulators [27], [22], [29] in which the machine code of the program is loaded in to the memory of the architecture. Instructions are fetched, decoded and executed one by one much like the real architecture. Interpretive Simulators have the advantage that simulator does not need to be re-generated when the application program is modified (as is required in the compiled simulators, discussed below). The disadvantage is the low speed of interpretive simulators. [27] discusses an Figure 6.1: Block Diagram of MePoEfAr Interpreter Showing its Position in Relation to the Host Machine ARM interpretive simulator. We have developed an interpretive style simulator for our MePoEfAr architecture. 2. Compiled Simulator [28], [32] which generates an executable simulation file per application program. It has the advantage of speed, because the instruction decoding overhead moves to simulator generation time. The disadvantage is that it requires a recompilation of the whole simulator in order to simulate a different file. # 6.2 MePoEfAr Interpreter MePoEfAr simulator has been developed as interpretive simulator to closely resemble the instruction fetch, decode and execute stages of the architecture. MePoEfAr interpreter program is 5597 lines of code written in C language. The advantage of developing it in a high level language like C is that it is easily portable to other platforms with a recompilation of the interpreter. This interpreter reads the machine code generated by assembler and executes these instructions on host PC. Figure 6.1 shows the relation of MePoEfAr interpreter with respect to host machine. It can be seen from this figure that MePoEfAr Interpreter reads the machine code and communicates with the operating system layer for its execution. Next, operating system sends instructions to the host machine to execute this program. Figure 6.2 shows the block diagram of MePoEfAr Interpreter. It can be seen from this figure that interpreter consists mainly of two blocks, as listed below: Supervisor (main()) Program which loads the program to memory and instructs the microcontroller to RUN the program MePoEfAr Microcontroller Model which executes the program These two parts are discussed one by one in detail in the next two sections. Figure 6.2: Block Diagram of the MePoEfAr Interpreter # 6.3 Supervisor Program ( main() ) Supervisor or main() program is the part of MePoEfAr interpreter which supervises the interpretation process. It reads in the machine code and subsequently loads it into the data structure which represents the program memory of MePoEfAr architecture. Next it instructs the microcontroller to execute the loaded program. These tasks of the main() program are depicted in 6.1. ``` int main(int argc, char * argv[]) 2 3 if(argc > 1) //if there is a command line argument for the name of file 4 5 6 strcat(inputFileName, argv[1]); //use this name strcat(inputFileName,"test.hex"); //otherwise default test.hex will be used 8 printf("\n MePoEfAr Interpreter 10 11 12 initFiles(); //\,utility\ function\ to\ initalize\ files\\//for\ different\ purposes \frac{13}{14} \frac{15}{16} //call this function to load the machine code into the //program memory. The actual bytes representing the //machine code will be loaded and the information about //the line numbers of source program will be used for //the mapping of program memory and line numbers 17 18 19 20 21 22 //start the show //execute the program runProgram(); 23 24 25 printf("\n Finished Program Execution ... \n"); 26 27 28 29 31 closeFiles(); //close the files opened for internal use 33 return 0; ``` Listing 6.1: MePoEfAr main() Interpreter C Code. It Prompts the User for Input Hex File, Calls loadPM() to load it into memory. runProgram() Executes the Loaded Program An important task which is performed by program load function (loadPM()) at Line 16 in Listing 6.1) is the mapping of source line number of the instructions to their memory addresses. This is important because the feedback provided by the interpreter in the form of errors and warnings becomes very helpful if it points to the source line number. Next sub section describes in detail how we have achieved this in our MePoEfAr interpreter. ### 6.3.1 Memory Address to Source Line Number Mapping MePoEfAr Interpreter is developed such that user is able to see the contents of internals of MePoEfAr architecture. An important feature of MePoEfAr Interpreter is its ability to give the information about the running instruction with the source line number of the original assembly program. This is achieved by putting the line numbers of the source assembly instructions inside the generated machine code. On the Interpreter side, the main program, which acts as the supervisor will call the function loadPM() to load the program. This function is designed such that it serves two purposes. At first, It will load the actual machine code into the program memory. Secondly, it will store the mapping of memory addresses and source line numbers inside a list. The list to hold these mapping entries, is implemented utilizing the hash function concept. Listing 6.2 shows the code for this mapping. Two important functions in this regard are: - 1. insertMapping() which inserts a mapping entry in the list. - 2. searchMapping() which searches for entry that contains source line number of the requested memory address. ``` /* Structure to represent a node in the mapping list PMAddress is mapped to lineNo, which is represent by the entry of this node in the list. typedef struct node int lineNo; //line number in source program //program memory address //pointer to next node //pointer to a list of such nodes 9 int PMAddress; ruct node *next; }*mapList; /* the hash table */ 13 static mapList hashTable[SIZE]; /* Function insertMapping 17 Function insertMapping input is the pmAddr and lno which is to be mapped returns 0 on success and mapping entry is inserted successfully return -1 if mapping is already there, or it cannot be inserted because of memory allocation problem int insertMapping (int pmAddr, int lno) //temporary to hold the key from hash function //get the key from the hash function 25 int h = hash(pmAddr); 26 mapList 1 = hashTable[h]; //loop till mapping found or till the end of list while ((1 != NULL) && (pmAddr != 1->PMAddress) ) 1 = 1->next; 28 29 30 31 32 33 else // mapping not in list 34 \label{eq:local_problem} \begin{array}{ll} 1 = (\texttt{mapList}) & \texttt{malloc}(sizeof(struct \ \texttt{node})); \\ & //\operatorname{allocate} & \texttt{memory} \\ if(1 \ != \ \texttt{NULL}) \end{array} 36 38 39 1->PMAddress = pmAddr; //save memory address for this entry 1->lineNo = lno; //save the corresponding line no l->next = hashTable[h]; //pointer to next, get from hash func 40 hashTable[h] = 1; // return 0; //successful return 42 44 return -1; //unsuccessful return //memory allocation problem 46 ``` ``` \frac{48}{49} /* Function searchMapping reference search search mapping searches the map entry of pmAddr and corresponding lno. If found, this lno is written as its address is the argument to the function returns 0 if mapping found returns -1 if mapping not found int searchMapping( int pmAddr, int * lno ) 58 59 mapList 1 = hashTable[hash(pmAddr)]; //hash table entry 60 loop till mapping found or till the end of list le ((1 != NULL) && (pmAddr != 1->PMAddress) ) 1 = 1->next; 62 64 65 66 //not found till the end //signal failure if (1 == NULL) return -1; 67 68 69 70 71 72 //found //write the lno //signal succes *lno = 1->lineNo: return 0: ``` Listing 6.2: Code Used to Store the Mapping of Program Memory Address and Line Numbers in MePoEfAr Interpreter ### 6.4 MePoEfAr Microcontroller Model The main part of MePoEfAr Interpreter is the *microcontroller*. After the program is loaded to program memory, runProgram() function is called to execute the loaded program. This program execution is done in a loop as shown in Listing 6.3. The body of this loop consists of four main functions as discussed below: - fetchInstruction() fetches the first word (2 bytes) of instruction from the location pointed by the program counter and copies it into a temporary data structure (instrTemp) for later processing. This instrTemp is passed to it by reference as can be seen from Line 13 in Listing 6.3. - **decodeInstruction()** decodes the instruction passed to by reference as can be seen from Line 16 in Listing 6.3. This is the function which identifies the **Sub G**roup (**SG**) of the instruction. Details of the decoding process are provided later in a separate section. - **executeInstruction()** executes the instruction passed to it as argument. In this function, a *switch* statement selects the function corresponding to its SG to execute it. The instruction gets executed and changes (if needed) the state of the microcontroller based on its operation. - interact() interacts with the user in case the step mode is enabled as can be seen from Line 32 in Listing 6.3. After each instruction is executed, interpreter prompts the user whether he wants to see the internals of the architecture. In case the step mode is disabled, complete program gets executed and the user can interact only at the end of the program. ``` /* 2 Function runProgram(), which executes the program 3 instruction are fetched, decoded and executed one by one. 4 If step by step mode is defined, then ``` ``` void runProgram() { int lno; //temp to hold lno of current instruction Instruction instrTemp; //temp to hold current instruction 10 while ( \, PC \! < \! noOfBytes \, ) \ // \, loop \ till \ complete \ program 12 13 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 /first of all read the source line no from the linked list f(searchMapping(PC,&lno) == 0) //search for mapping printf("\n Executing instruction from line %d \n", lno); printf("\n Could not find the Mapping for PM Address : %d\n",PC); executeInstruction(instrTemp); // execute instruction // update PC accordingly in case if more bytes fetched //if step by step mode is active then ask user to continue or //if he wants to have a look at some registers or memory or.. \#ifdef\ STEP\_MODE 32 33 interact(); #endif ``` Listing 6.3: runProgram() Function in which Instructions are Fetched, Decoded and Executed In order to achieve this instruction fetch, decode and execute, internal components of MePoEfAr architecture were modeled as data structures. These components are listed below: - 1. Program Counter - 2. Program Status Word - 3. Registers - 4. Program Memory - 5. Data Memory - 6. Stack and Stack Pointer - 7. Decoder - 8. Arithmetic and Logic Unit the following is a brief description of the implementation of each of these components. ### 6.4.1 Program Status Word Four condition code bits from the **P**rogram **S**tatus **W**ord (**PSW**) namely zero flag, sign flag, carry flag and over flow flag are modeled as global integers which are updated after an instruction which affects these flags is executed <sup>1</sup>. #### 6.4.2 Program Counter **Program Counter (PC)** is modeled as a global counter pointing to the address of the next instruction to be executed. It is updated after each instruction fetch (or fetching of <sup>&</sup>lt;sup>1</sup>These flags are always visible at the terminal showing the updated status of condition codes based on the status of recently executed instruction. instruction bytes with size larger than two bytes), or execution of instructions operating on PC (Branch and Jump instructions). ## 6.4.3 Registers Registers are modeled as arrays of corresponding data type. Functions are provided to read from and write to these registers. ## 6.4.4 Program Memory Program Memory is modeled as an array of $int8_{-}t^{-2}$ data type. A variable indicates the number of bytes of program loaded into program memory, which is updated during the program load. Functions are provided to fetch instruction bytes from program memory. #### 6.4.5 Data Memory Data Memory is modeled as an array of $int8\_t$ data type. Basic functions are provided to read and write the data memory. These functions are then utilized to define functions to read and write data as integer and floating point values. #### 6.4.6 Stack and Stack Pointer Stack area is a part of data memory and starts from highest memory address and grows towards the lower memory address. A pointer pointing to current position on stack, known as **S**tack **P**ointer (**SP**) is implemented which is used in stack related operations (subroutine call and return). SP is initialized to highest data memory address, and whenever something is pushed on stack, SP is decremented and vice versa. #### 6.4.7 Decoder Instruction decoder is implemented as nested *switch* statements as can be seen from Listing 6.4. The outer *switch* statement (Line 17) selects the case based on the number of bits to be considered. The starting value is 4 as it is the minimum number of bits to identify an SG. The inner *switch* statement matches the proper SG among the options available based on the match of these bits value to code of that SG. These *switch* statements execute inside a *while* loop which iterates until instruction SG is identified or no of bits to be considered for making the decision equals 16 (bits in one instruction word). On each iteration of the loop, the number of bits to be considered for decoding are incremented as can be seen from Line 43. ``` 1 /* 2 Function decodeInstruction() decodes the instruction. 3 Input is a pointer to the instruction and based upon 4 the decoding logic described in the instruction bit 5 assignment, Sub Group of instruction will be updated. 6 */ ``` $<sup>^{2}</sup>int8_{-}t$ is always an 8-bit data type which is defined in stdint.h ``` \begin{array}{lll} \textbf{void} & \texttt{decodeInstruction} \ ( \ \texttt{Instruction} & * \texttt{instrTemp} \ ) \\ \{ \end{array} 9 10 11 //maximum bits in instruction is 16 while (bitsToConsider <=16) 13 slice the bits which we want to consider to compare its value 14 15 bitsValue = sliceBits(instrTemp->shortInstr,bitsToConsider); 16 switch (bitsToConsider) \frac{17}{18} case 4: //instructions with 4 bit SG field switch(bitsValue) 19 21 22 23 24 25 instrTemp ->SG=SG_RRb; RRwCode: instrTemp -> SG=SG_RRw; return: instrTemp -> SG=SG_RRd; case MRbCode: instrTemp -> SG=SG_MRb; return; 26 27 28 instrTemp -> SG=SG_MRw case MRdCode: \verb"instrTemp-> \verb"SG=SG_MRd" /* and so on are decoded */ other sub groups 29 30 31 break; 5: //instructions with 5 bit SG field switch(bitsValue) 32 33 34 35 case MXCode: instrTemp -> SG=SG_MX; return; instrTemp ->SG=SG_FF; 37 case S1Code: instrTemp ->SG=SG_S1; return; 38 39 40 41 and so on other sub groups are also decoded */ 42 43 bitsToConsider++: //increment bits to consider if not found 44 ``` Listing 6.4: Code for Instruction Decoding The end result of this decoding is that either the instruction is identified correctly and SG field is updated with the proper sub group, or instruction SG field is updated with $SG_NA$ indicating a Not Applicable SG for the execute stage. #### 6.4.8 Arithmetic and Logic Unit Arithmetic and Logic operations constitute the bulk of operations operating on various data types as defined in MePoEfAr architecture. An Arithmetic and Logic Unit (ALU) is modeled as a number of functions to execute these operations for all the data types. During the execution phase, based on the data type and operation the corresponding function is selected by a *switch* statement, which will perform the operation and update the condition codes as defined in the architecture. # 6.5 Summary In order to test and debug the programs written for a specific architecture, these programs are translated into the form understandable by machine. This machine can be the real machine or a model of the machine implemented in a high level language. Simulators are developed in the early architecture design phase to model these architectures. Interpretive simulator is the style of MePoEfAr interpreter which we have discussed in this chapter. From the two main parts of this interpreter, one part of MePoEfAr interpreter is the *main* program which loads the machine code in the program memory, maps the <u>6.5. SUMMARY</u> 47 memory address of instructions to their source line numbers (in the original assembly program). This mapping is important for testing and debugging the assembly programs, as the feed back given by interpreter in form of error and warning messages are useful if they have the information of the source line numbers. The second part of the interpreter is the model of the MePoEfAr microcontroller. Various components of MePoEfAr architecture are modeled inside this microcontroller. These components are utilized in the loop which is executing the instructions one by one. In this loop, instructions are fetched, decoded and executed. If step mode is active, the interpreter interacts with the user in case he is interested to examine the state of the microcontroller. The interpreter described in this chapter is used to debug and test the functionality of benchmark programs used to evaluate and compare the performance of MePoEfAr architecture. Benchmarking details are provided in next chapter. Assembler Level Benchmarking In this chapter, performance of MePoEfAr architecture is analyzed and compared to three other well known microcontroller architectures. Performance of an architecture for the given benchmark is also dependent upon the quality of code generated by the compiler. In order to have a comparison solely based on architectural capabilities, we have performed our first round of benchmarking at the assembler level. Assembler and Interpreter of MePoEfAr have been discussed in previous two chapters. Six benchmark programs are selected to test different aspects of architecture. These application programs are hand assembled and optimized for MePoEfAr architecture, as well as, for the other three candidate architectures for a fair comparison. Appendix C contains the hand assembled optimized programs for all the four architectures considered for comparison. This chapter starts with the description of the evaluation criteria. Candidate architectures are briefly mentioned in Section 7.2. Benchmark application programs are described in Section 7.3. Performance results with comparison and evaluation are discussed in Section 7.4. Section 7.5 summarizes the chapter with a table of combined results to give the overall impression of performance comparison. ## 7.1 Evaluation Criteria Performance has been evaluated based on efficiency of architecture in terms of number of instructions, program memory size (bytes) and execution time (cycles). To estimate the power consumption, number of instructions executed, program/data memory traffic (cycles) has also been calculated. These results can be classified in two main categories. Figure 7.1 shows the classification of results which are calculated for evaluation and comparison. # 7.2 Candidate Architectures for Comparison Performance of MePoEfAr architecture is compared with three famous architectures which are being widely used as embedded microcontrollers. These architectures are: - 1. Atmel AVR AT90S851 (8 Bit) - 2. TI MSP430G2231 (16 Bit) - 3. ARM LPC1342 (32 Bit) Figure 7.1: Classification of Evaluation Criteria #### 7.2.1 Atmel AVR AT90S851 AT90S8515 is a low power, CMOS, 8-bit microcontroller based on the AVR RISC architecture [15] developed by Atmel [14]. It utilizes modified Harvard architecture concept. Although it is an 8-bit microcontroller, each instruction takes one or two 16-bit words. It has 32 single-byte general purpose registers with single clock cycle access time. It supports five addressing modes. #### 7.2.2 TI MSP430G2231 Second candidate is MSP430G2231 [3], a 16-bit RISC architecture developed by Texas Instruments [2]. It has been designed for low cost and low power embedded application. It uses von-Neumann architecture with a single instructions and data memory space. Instructions generally take one cycle per word fetched or stored. It has 27 core instruction and 24 emulated instructions. It supports seven addressing modes for source operands and four addressing modes for the specification of destination operands in instructions. It has the following 16-bit registers: R0: Program counter R1: Stack pointer **R2:** Status register (only in register addressing with word data type) **R2** and **R3**: are used as constant generators for the most frequent constants (0,1,2,4,8) R4-R15: General purpose registers The user guide found here [4] provide further details of MSP430 microcontroller architecture. #### 7.2.3 ARM LPC1342 Third candidate is the LPC1342 [13] developed by NXP (founded by Phillips) [12]; a Cortex-M3 based low power 32-bit RISC m1icrocontroller. ARM is a fab-less company which designs these architectures as Intellectual Property (IP) modules and sells licenses to other companies which actually manufacture the chips, in the case of LPC1342, the manufacturing company is NXP. There are various architectures provided by ARM targeting various application areas, such as: - ARM Cortex-A series targets the general purpose processor cores - ARM Cortex-R series is a family of processors for real time systems - ARM Cortex-M series processors are designed for low-power, memory efficient embedded applications Among this M-series processors, Cortex-M3 processors are especially designed for embedded microcontrollers. It is based on modified Harvard architecture concept. It supports Thumb-2 instruction set to reduce the instruction memory requirements by including the support for 16-bit instructions. It has following general purpose and special purpose registers: R0-R12: General purpose registers R13: Stack pointer R14: Link registers used by subroutines for return address **R15:** Program counter **xPSR** Program Status Register Registers R0-R7 are accessible by all instructions, whereas, registers R8-R12 are only accessible by 32-bit instructions and 16-bit instructions cannot access them. The technical reference manual of ARM Cortex-M3 architecture (as well as other ARM architectures) can be found here [5] for further details. For the sake of brevity, in rest of the chapter, MePoEfAr , TI, ARM and AVR refers to our architecture, TI MSP430G2231, ARM Cortex-M3 LPC1342 and AVR AT90S851 microcontrollers respectively. # 7.3 Selected Benchmark Programs In this section a brief description of the benchmark programs is presented. The central idea of each algorithm is summarized and the features of microcontroller architecture which will be tested by each application are also mentioned. Three types of Microprocessor /Microcontroller/DSP benchmarks are known in general [25], [34]: - 1. 1. Synthetic benchmarks (e.g. Whetstone Benchmark [23], Dhrystone Benchmark [16] ) developed to measure system specific parameters (CPU, Compiler, and so on) - 2. Application based benchmarks (real world benchmarks) developed to compare different system architectures in the same real fields of application, for instance EEMBC benchmarks [10] such as AutoBench [9], Coremark [7] - 3. Algorithm based benchmarks (a compromise between the first and the second type) developed to compare different system architectures in special (synthetic) fields of application. The benchmark code used to test the processor architecture and compilers can be sepa- rated into eight different modules: - 1. Fixed-point math algorithms - 2. Floating-point math algorithms - 3. Logic calculations - 4. Digital control - 5. Fast Fourier Transform - 6. Field processing - 7. Loops and conditional jumps - 8. Recursion and stack tests At the assembler level, writing hand assembled codes for full fledged benchmarks for these different architectures is a time consuming process. So we picked up some part of these benchmarks (which are doing the real computations inside) and used them for our assembler level performance evaluation and comparison. Following programs have been used for our assembler level benchmarking: - 1. Recursive Factorial Algorithm - 2. String Copy Function - 3. Bubble Sort Algorithm - 4. Sensor Structure Program - 5. Matrix Multiplication - 6. FIR Algorithm The above mentioned applications cover most of the features mentioned in above 8 modules. A brief description of these benchmark programs is given below. #### 7.3.1 Benchmark Application 1: Recursive Factorial Program This program is the recursive factorial calculation program. It is based on the concept that factorial of a number n is the number times the factorial of previous number (n-1). This implies, factorial of n can be calculated if we know the factorial of n-1. This divide and conquer approach is continued till number is reduced to 1. Factorial of 0 and 1 is 1, which is the base case of recursion. A number is passed to factorial() function from the main(). This function calculates the factorial and returns the result to main function. Listing 7.1 provides the commented C code of this program. ``` 1 /* 2 FactRec Benchmark Program 3 C Program implementing recursive factorial function. 4 A number is passed as an argument to this function and 5 factorial of the number is returned after calculations. 6 7 Factorial of a positive integer n, denoted by n!, is the 8 product of all positive integers less than or equal to n. 9 For example, 5! = 5 X 4 X 3 X 2 X 1. 10 0! is defined to be 1. 11 */ 12 13 //prototype of the factorial function ``` ``` 14 long factorial(int); 15 16 void main(void) 17 { //call the factorial function 18 19 factorial(5); 20 } 21 22 long factorial(int n) 23 { 24 if(n \le 1) //i.e. if the number is less than or equal to 1 25 return 1; //then return 1 //otherwise factorial will 26 //be n times factorial of n-1 27 return n * factorial(n-1); 28 } ``` Listing 7.1: Benchmark Application 1: Recursive Factorial Program #### 7.3.2 Benchmark Application 2: String Copy Program This benchmark application performs simple string copy operation. StrCpy() function is called from main(). Source and destination string addresses are passed as arguments to this function. StrCpy() does the copy operation and returns back to main. This program will test the conditional branching and data memory access capability. Listing 7.2 is the C code of this benchmark. ``` 1 /* 2 StringCopy Benchmark Program 3 C Program implementing string copy For testing, source string is initialized to "Super Scalar". the address of source and destination strings are passed to StrCpy which will copy the string from source to destination 4 * / 6 //prototype of the string copy function 7 void strCopy(char * ,char * ); 9 void main(void) 10 { 11 //initialization of source string 12 char *strSrc = "Super Scalar"; 13 14 //destination string 15 char strDest[25]; 16 17 //now call the copy function 18 19 strCopy(strSrc,strDest); 20 21 } 22 23 //string copy function ``` ``` 24 void StrCpy(char * src, char *dest) 25 { int i=0; //index variable 26 27 28 while (src[i] != NULL) //loop until null character is not seen 29 30 dest[i]=src[i]; //copy a character from source to //destination i++; //increment the index 31 32 33 dest[i] = src[i]; //copy the last character, which is null //character 34 35 //return to calling method (done copying) return; 36 } ``` Listing 7.2: Benchmark Application 2: String Copy Program #### 7.3.3 Benchmark Application 3: Bubble Sort Program This application program is the famous bubble sort algorithm. An array of 10 numbers is initialized in the main() function with the elements in the ascending order. The base address of this array is passed to BSort() function. This function sorts the array in descending order. This program will test the performance regarding array handling, conditions and loops. C code of this benchmark is given in Listing 7.3. ``` 1 /* 2 BubbleSort Benchmark Program 3 Program to sort the array in ascending order. Bubble sort is used as the sorting algorithm. Bubble sort, also known as sinking sort, is a simple sorting algorithm that works by repeatedly stepping through the list to be sorted, comparing each pair of adjacent items and swapping them if they are in the wrong order. The pass through the list is repeated until no swaps are needed, which indicates that the list is sorted. The algorithm gets its name from the way smaller elements "bubble" to the top of the list. 6 //size of the array 7 #define Arr_Size 10 8 9 void BSort(int a[Arr_Size]); 10 11 void main(void) 12 { 13 int Array[10]; 14 int i; 15 16 //fill array with numbers 17 for (i=0; i<10; i++) 18 Array[i]=i; 19 ``` ``` 20 21 22 //call the sorting function 23 BSort (Array); 24 25 26 27 //bubble sort function 28 void BSort(int a[Arr_Size]) 29 { 30 int i,j,temp; 31 32 for (i=Arr_Size-2;i>=0;i--) // Array size is 10, 9 passes //needed to completely sort 33 for(j=0; j \le i; j++) 34 35 if(a[j] < a[j+1]) //if a number is greater than 36 //its next number 37 38 temp=a[j]; //then swap to bring them in //descending order a[j]=a[j+1]; 39 40 a[j+1]=temp; 41 42 }//end for j 43 }//end for i 44 45 46 \}//end function. ``` Listing 7.3: Benchmark Application 3: Bubble Sort Program #### 7.3.4 Benchmark Application 4: Sensor Structure Program This application implements a record (known as structure in C language) to store the data for sensor values, hence will test structure handling. A structure used to store sensor value contains 3 members: - 1. 1 char byte Flag indicating if sensor has been calibrated or not. - 2. 1 short int containing the offset to be adjusted - 3. 1 long int containing the actual sensor value An array of five sensor values is declared. InitSensors() function initializes these values to some arbitrary numbers. CalibrateSensors() function will subtract the offset from the value of the sensors and set the $Flag.\ main()$ will call these two functions to initialize and calibrate sensor data. Listing 7.4 provides the C code of this benchmark. ``` 1 /* 2 SensorStruct Benchmark Program ``` ``` 3 C Program implementing a structure for sensor values. Structure contains 3 elements: 4 1 char byte Flag indicating if sensor has been calibrated or not. 5 1 short int containing the offset to be adjusted 6 1 long int containing the actual sensor value 7 An array of 5 sensors is declared. InitSensors() will initialize these values to some numbers. CalibrateSensors() will subtract 8 the offset from the value of the sensors and set the Flag. main() will call these two functions to initialize and calibrate sensor data. 9 */ 10 11 // sensor initialization function 12 void InitSensors(); 13 14 // sensor calibration function 15 void CalibrateSensors(); 17 // structure to hold sensor data 18 typedef struct 19 { 20 char Flag; 21 short Offset; 22 long Value; 23 }Sensor; 25 // array of 5 sensor values 26 Sensor sensors [5]; 27 28 void main() 29 { 30 InitSensors(); 31 CalibrateSensors(); 32 } 33 34 // sensor initialization function 35 void InitSensors() 36 { 37 short i; 38 i=0; while (i < 5) 39 40 sensors[i].Flag = 0; 41 sensors[i].Offset = i; 42 43 sensors[i].Value = i+3; 44 i++; 45 } 46 } 48 // sensor calibration function 49 void CalibrateSensors() 50 { 51 short i=0; while (i < 5) 52 53 { 54 sensors[i].Flag = 1; ``` Listing 7.4: Benchmark Application 4: Sensor Structure Program #### 7.3.5 Benchmark Application 5: Matrix Multiplication Program This benchmark application performs the matrix multiplication algorithm. In the main function two matrices of order $3\,by\,4$ and $4\,by\,5$ , respectively; are initialized. Later standard matrix multiplication is performed to get the product matrix of order $3\,by\,5$ . This application will be able to test the capability of the architecture to handle integer math, nested loops with conditions and address calculations for matrix elements. Listing 7.5 is the C code of this benchmark. ``` 1 /* 2 MatrixMul Benchmark Program 3 Matrix Multiplication is the implementation of multiplication of a 4\ 3X4\ matrix by 4X5\ matrix to get a product 3X5\ matrix. Both the matrixes 5 are initialized with some values. Later actual multiplication is 6 performed to get the product matrix. 7 */ 8 9 int main(void) 10 { 11 short m, n, p; long m1[3][4]; //matrix 1 12 13 long m2[4][5]; //matrix 2 long m3[3][5]; //product matrix 14 15 //fill the first array with some numbers 16 //(m+p values for testing) 17 for(m = 0; m < 3; m++) 18 19 for(p = 0; p < 4; p++) 20 21 22 m1[m][p]=m+p; 23 24 25 26 //fill the second array with some numbers 27 //(m+p values for testing) 28 for(m = 0; m < 4; m++) 29 { 30 for(p = 0; p < 5; p++) 31 32 m2[m][p]=m+p; 33 34 35 //perform multiplication 36 ``` ``` for(m = 0; m < 3; m++) 37 38 for(p = 0; p < 5; p++) 39 40 m3[m][p] = 0; 41 for(n = 0; n < 4; n++) 42 43 m3[m][p] += m1[m][n] * m2[n][p]; 44 } 45 46 } 47 } 48 ``` Listing 7.5: Benchmark Application 5: Matrix Multiplication Program #### 7.3.6 Benchmark Application 6: FIR Program This application is the algorithm of 17th order FIR filter. This algorithm is used to test the math calculations capability of the architecture involved in these types of applications. Similar applications widely implemented on microcontrollers are PID control algorithms. In both types of applications the output is a weighted sum of the current and a finite number of previous values of the input. In this example, input values for the filter is an array of 51 16-bit arbitrary values representing discrete input signal. Calculations are performed and results are stored in the output array representing the discrete output signal. Performance calculations for this benchmark are based on the assumption that all the architectures have floating point hardware as there was huge difference in results because of floating point calculations involved in the program. Listing 7.6 shows the C code of this benchmark. ``` 1 2 FIR Benchmark Program 3 The output of a filter is a weighted sum of the current and a finite number of previous values of the input. For testing in this example, input values for the filter is an array of 51 16-bit values. The order of the filter is 17. 4 */ 5 void main(void) 6 { 7 int i, y; /* Loop counters */ float COEFF[17]; //to hold the coefficients of the filter 8 int INPUT[67]; 9 //to hold the input (A/D converted values) 10 float OUTPUT[36]; //to hold the (filtered) output values //temporary used for sum 11 float sum; 12 //fill the coefficient array with some values 13 for(i=0;i<17;i++) 14 15 COEFF[i]=1/(i+5.0); 16 17 //fill in the input values 18 19 for (i=0; i<67; i++) ``` ``` 20 21 INPUT[i]=i; 22 23 //apply filtering 24 for (y = 0; y < 36; y++) 25 26 sum = 0.0; for (i = 0; i < 8; i++) 27 28 sum=sum + COEFF[i]*(INPUT[y+16-i]+INPUT[y + i]); 29 30 31 OUTPUT[y] = sum + INPUT[y + 8] * COEFF[8]; 32 33 34 ``` Listing 7.6: Benchmark Application 6: FIR Program #### 7.4 Result Evaluation and Comparison In this section, performance results are summarized for the above mentioned benchmark programs. Static and dynamic results are tabulated and a comparison ratio of MePoEfAr architecture with selected candidate architectures is also provided. Last rows in all these tables present the mean value of the column. In case of actual values, for instance number of instructions, execution cycles etc; arithmetic mean is calculated. Arithmetic mean of the ratios does not show consistency, as mean of the ratios depends upon the reference architecture. So for the mean of ratios, this number represents the geometric mean of that column to show the overall comparison. The detailed internal calculations performed to get these results are given in Appendix D. #### 7.4.1 Static Results Total number of instructions required to implement some functionality by an *architecture* is a measure of capability of instruction set of that architecture. Table 7.1 below shows the total number of instructions required by the four microcontrollers. Figure 7.2 shows this information graphically. | # | Benchmark | MePoEfAr | TI | TI/MePoEfAr | ARM | ARM/MePoEfAr | AVR | AVR/MePoEfAr | |---|--------------|----------|----|-------------|-----|--------------|-----|--------------| | 1 | FactRec | 14 | 26 | 1.86 | 14 | 1.00 | 53 | 3.79 | | 2 | StringCopy | 8 | 9 | 1.13 | 10 | 1.25 | 11 | 1.38 | | 3 | BubbleSort | 20 | 33 | 1.65 | 24 | 1.20 | 42 | 2.10 | | 4 | SensorStruct | 23 | 29 | 1.26 | 29 | 1.26 | 39 | 1.70 | | 5 | MatrixMul | 33 | 56 | 1.70 | 43 | 1.30 | 105 | 3.18 | | 6 | FIR | 45 | 76 | 1.69 | 51 | 1.13 | 189 | 4.20 | | | Mean | 24 | 38 | 1.52 | 29 | 1.19 | 73 | 2.51 | Table 7.1: Number of Instructions Required for Benchmark Programs As can be seen from the these results, for the simple 8-bit string copy benchmark, all the controllers require same number of instructions. As the complexity of application and number of bits in data types involved in programs increases, this gap increases. Figure 7.2: Number of Instructions Required for Benchmark Programs AVR and TI require large number of instructions as most of the calculations involved are for more than 8 and 16 bits. On the other hand, MePoEfAr has several register sets (data types) and same instructions operate on these different register sets. Choice of data types resulted in less number of instructions to implement the same benchmarks in MePoEfAr . Large number of operations is possible in MePoEfAr due to efficient instruction encoding in spite of being 16-bit architecture. Support of large number of operations implies less number of instructions required in benchmarks to achieve some functionality. As, otherwise, operations need to be emulated with more instructions. For instance, TI suffers because multiply instruction is not a part of instruction set. Although, it has 16-bit hardware multiplier but it is available as memory mapped peripheral, which means multiple instructions for reading from and writing to those registers for multiplication. In case of AVR, though it has multiply instruction, it requires more instructions, because registers are only 8-bit wide. So, more instructions are needed to achieve, for example, 32-bit addition/subtraction. AVR also requires large number of instructions because of register pressure. This means more data move instructions and memory spills because of register shortage. This is because the data is moved to registers for some operational steps. If in case, before all these steps are complete, more data needs to be fetched which requires even more registers, previously occupied registers are stored in memory (spilled) and fetched back later resulting in extra instructions. This is one of the reasons for large number of instructions required by AVR in FIR benchmark in which 4 byte variables need to be processed in the internal loop. ARM also performs operations only on registers (load-store architecture). This means instructions are required for moving data to registers before operations can be performed on data. Similarly, store instructions are explicitly needed for storing the results back to memory. This is the reason that, in spite of being 32-bit architecture, it requires, on the average, 19% more instructions as compared to MePoEfAr . Another reason for reduced number of instructions required by MePoEfAr is the variety Figure 7.3: Program Memory Size (Bytes) for Selected Benchmarks of addressing modes possible in the architecture. Lesser instructions are required for address computation as compared to other architectures because of variety of addressing modes available in MePoEfAr . Although TI and AVR also have auto increment and decrement addressing modes, but in orthogonal MePoEfAr , these modes work on all the data types available in the architecture. On the average, for the given benchmarks, ARM and TI require 19% and 52% more instructions than MePoEfAr respectively, while instruction ratio of AVR to MePoEfAr is 2.51. Memory efficiency of an Instruction set architecture is compared by calculating the total number of bytes of program memory required for each benchmark application. Table 7.2 summarizes the total number of bytes required for all the six benchmark applications by the four microcontroller architectures. These results are graphically plotted in Figure 7.3. | # | Benchmark | MePoEfAr | TI | TI/MePoEfAr | ARM | ARM/MePoEfAr | AVR | AVR/MePoEfAr | | | |---|--------------|----------|-----|-------------|-----|--------------|-----|--------------|--|--| | 1 | FactRec | 28 | 74 | 2.64 | 34 | 1.21 | 100 | 3.57 | | | | 2 | StringCopy | 20 | 26 | 1.30 | 28 | 1.40 | 22 | 1.10 | | | | 3 | BubbleSort | 45 | 102 | 2.27 | 60 | 1.33 | 84 | 1.87 | | | | 4 | SensorStruct | 50 | 90 | 1.80 | 80 | 1.60 | 78 | 1.56 | | | | 5 | MatrixMul | 81 | 174 | 2.15 | 112 | 1.38 | 210 | 2.59 | | | | 6 | FIR | 113 | 209 | 1.85 | 152 | 1.35 | 378 | 3.35 | | | | | Mean | 56 | 113 | 1.95 | 78 | 1.37 | 145 | 2.15 | | | Table 7.2: Program Memory Size (Bytes) for Selected Benchmarks It can be seen From the results that ARM has better memory efficiency as compared to TI and AVR; but MePoEfAr has a small memory footprint as compared to ARM with an overall difference of 37%. ARM Cortex M3 supports Thumb-2 instruction set which means support for 16-bit instructions; still 32-bit instructions are needed resulting in large memory size. For string copy program requiring 8-bit operations, AVR is close in memory efficiency to MePoEfAr but for other applications which operate on higher data types, this difference increases. AVR also requires large number of instructions for these applications which directly means more instruction memory requirement. In short, MePoEfAr outperforms AVR by a factor of 2.15. Memory efficiency of MePoEfAr is mainly because of the variable length instructions resulting in instructions of 1, 2, 3 or 4 bytes depending upon their frequency of occurrence. Another reason for the memory efficiency of MePoEfAr is efficient support for small immediate values and short displacements. In Thumb mode, ARM supports 3-bit immediate values if two registers are specified and 8-bit immediate if a single register operand is specified. TI has reserved two registers namely R2 and R3 as constant generators to generate five most frequent constants -1, 0, 1, 2, 4 and 8. In case of MePoEfAr , 4-bit immediate and 8-bit offsets are accommodated directly inside first instruction word with both operands specified and without reserving any registers. #### 7.4.2 Dynamic Results Dynamic results describe the dynamic nature of the architecture. These results highlight the aspects of architecture for the benchmark applications, the way these programs are actually executed on it. For instance, the total number of instructions executed for a given program gives an idea about the total power requirements and the instruction-memory- CPU traffic. Table 7.3 summarizes the total number of instructions executed by each of the architecture for the selected six benchmark applications. | # | Benchmark | MePoEfAr | TI | TI/MePoEfAr | ARM | ARM/MePoEfAr | AVR | AVR/MePoEfAr | |---|--------------|----------|------|-------------|------|--------------|-------|--------------| | 1 | FactRec | 42 | 87 | 2.07 | 42 | 1.00 | 81 | 1.93 | | 2 | StringCopy | 44 | 57 | 1.30 | 58 | 1.32 | 59 | 1.34 | | 3 | BubbleSort | 371 | 779 | 2.10 | 523 | 1.41 | 908 | 2.45 | | 4 | SensorStruct | 66 | 101 | 1.53 | 97 | 1.47 | 131 | 1.98 | | 5 | MatrixMul | 599 | 1442 | 2.41 | 852 | 1.42 | 1662 | 2.77 | | 6 | FIR | 5229 | 9331 | 1.78 | 6282 | 1.20 | 24349 | 4.66 | | | Mean | 1059 | 1966 | 1.83 | 1309 | 1.29 | 4532 | 2.33 | Table 7.3: Total Number of Instructions Executed Initializing instructions are executed only once in the application. They may require a large part of program memory. From the execution point of view, the total number of instructions executed inside the loop is important and has major contribution in the execution time and total power consumption. Table 7.4 shows the total number of instructions executed inside the loop. | # | Benchmark | MePoEfAr | TI | TI/MePoEfAr | ARM | ARM/MePoEfAr | AVR | AVR/MePoEfAr | | |---|--------------|----------|------|-------------|------|--------------|-------|--------------|--| | 1 | FactRec | 39 | 84 | 2.15 | 39 | 1.00 | 76 | 1.95 | | | 2 | StringCopy | 39 | 52 | 1.33 | 52 | 1.33 | 52 | 1.33 | | | 3 | BubbleSort | 363 | 771 | 2.12 | 517 | 1.42 | 897 | 2.47 | | | 4 | SensorStruct | 55 | 90 | 1.64 | 85 | 1.55 | 115 | 2.09 | | | 5 | MatrixMul | 582 | 1420 | 2.44 | 833 | 1.43 | 1632 | 2.80 | | | 6 | FIR | 5218 | 9324 | 1.79 | 6278 | 1.20 | 24342 | 4.67 | | | | Mean | 1049 | 1957 | 1.87 | 1301 | 1.31 | 4519 | 2.37 | | Table 7.4: Total Number of Instructions Executed inside Loop Figure 7.4 and Figure 7.5 show the graph of total number of instructions executed inside the loop for these benchmark programs by the four microcontrollers. MePoEfAr requires less number of instructions for an application as compared to other architectures which means less number of instructions executed for an application. This is evident from Table 7.4 and graph in Figure 7.5. Because of larger instruction set of Figure 7.4: Total Number of Instructions Executed Figure 7.5: Total Number of Instructions Executed inside Loop MePoEfAr , fewer instructions are required for an application. For example, loop control instruction is a single instruction in MePoEfAr , but for other architectures two or three instructions are required. On the average for the given benchmarks, TI executes 87% more instructions than MePoEfAr . For operations higher than 16-bits, TI has to perform multiple operations which for instance, can be done with a single instruction in MePoEfAr and ARM. In case of ARM, 31% more instructions are executed as compared to MePoEfAr . In case of AVR, an 8-bit architecture, this requirement is even more and ratio of instructions executed on AVR to MePoEfAr is about 2.37. Although speed is not the main design consideration, we have also performed a comparison of execution time. In order to compare the architectures based on execution time, we need the information about the number of cycles required for the execution of benchmark programs. For our architecture we have assumed that if the instructions do not require extra operands to be fetched from memory then it is executed in one clock cycle. Otherwise extra cycle is added for each of the extra operand fetched from memory. For arithmetic operations, Table 7.5 gives the number of cycles assumed for Figure 7.6: Total Number of Execution Cycles integer and floating point operations for different data types supported by MePoEfAr . 16-bit hardware is assumed in for the numbers in this table. For 24-bit operations, a little more than 1.5 is assumed for the overhead. For floating point operations, the extra cycles have been assumed for the pre- and post-processing involved in these calculations like scaling, normalization, alignment etc. Table 7.5: Number of Cycles for Arithmetic Operations for Supported Data Types | | | Operation | | | | | | | | | |----------------|----|-----------|----|----|-----|----|----|-----|----|--| | | AI | DD/SU | JB | | MUL | | | DIV | | | | Data Type | 16 | 24 | 32 | 16 | 24 | 32 | 16 | 24 | 32 | | | Byte | 1 | 1.6 | 2 | 1 | 1.6 | 2 | 1 | 1.6 | 2 | | | Word/Index | 1 | 1.6 | 2 | 1 | 1.6 | 2 | 1 | 1.6 | 2 | | | Double Word | 1 | 1.6 | 2 | 2 | 2.6 | 3 | 2 | 2.6 | 3 | | | Floating Point | 4 | 4.6 | 5 | 6 | 6.6 | 7 | 8 | 8.6 | 9 | | Table 7.6 summarizes the total number of execution cycles consumed by six benchmarks. Figure 7.6 graphically shows the number of execution cycles required by four architectures for selected benchmarks. In order to make the comparison evident for all applications, the vertical axis in this graph is on log scale because of large number of cycles required by AVR especially for FIR application. Table 7.6: Total Number of Execution Cycles | # | Benchmark | MePoEfAr | TI | TI/MePoEfAr | ARM | ${ m ARM/MePoEfAr}$ | AVR | AVR/MePoEfAr | |---|--------------|----------|-------|-------------|------|---------------------|-------|--------------| | 1 | FactRec | 70 | 241 | 3.44 | 73 | 1.04 | 285 | 4.07 | | 2 | StringCopy | 73 | 145 | 1.99 | 103 | 1.41 | 93 | 1.27 | | 3 | BubbleSort | 982 | 2299 | 2.34 | 728 | 0.74 | 936 | 0.95 | | 4 | SensorStruct | 145 | 218 | 1.50 | 142 | 0.98 | 214 | 1.48 | | 5 | MatrixMul | 1679 | 4061 | 2.42 | 1087 | 0.65 | 5733 | 3.41 | | 6 | FIR | 8683 | 15182 | 1.75 | 9502 | 1.09 | 37138 | 4.28 | | | Mean | 1939 | 3691 | 2.16 | 1939 | 0.95 | 7400 | 2.18 | In order to make a fair comparison, it is assumed that all the architectures have floating point hardware unit. AVR and ARM executes all instructions in single cycle but TI and MePoEfAr require multiple cycles for different instructions. It can be seen from the results of Table 7.6 and graphs of Figure 7.6 that number of cycles required by TI and AVR are more than two times that of MePoEfAr . This primarily is because of more Figure 7.7: Instruction Memory Traffic (Cycles) number of instructions required for these benchmarks which results in more number of instructions executed. In other words, for MePoEfAr , fewer instructions have to be fetched and processed. Furthermore, TI has von Neumann architecture so it cannot perform instruction and memory accesses in parallel. In case of ARM, most of the instructions are executed in a single cycle as it is specifically designed for speed. ARM has a modified Harvard architecture, so it has a single address space but physically two memories which in turn facilitates parallel memory accesses. But, ARM is a load-store architecture, it needs instructions to load data from memory to perform operations on this data. Similarly, when the results need to be written to memory, store instructions are explicitly required. This means more instruction fetches and decodes. On the average for all the benchmarks, ARM requires 5% less execution cycles as compared to MePoEfAr . Memory access consumes considerable amount of power. Table 7.7 below summarizes the instruction memory traffic in cycles and Figure 7.7 shows the same information graphically. Vertical axis in this graph is on log scale. | # | Benchmark | MePoEfAr | TI | TI/MePoEfAr | ARM | ARM/MePoEfAr | AVR | AVR/MePoEfAr | |---|--------------|----------|-------|-------------|------|--------------|-------|--------------| | 1 | FactRec | 42 | 125 | 2.98 | 42 | 1.00 | 177 | 4.21 | | 2 | StringCopy | 46 | 73 | 1.59 | 58 | 1.26 | 59 | 1.28 | | 3 | BubbleSort | 418 | 1308 | 3.13 | 523 | 1.25 | 908 | 2.17 | | 4 | SensorStruct | 78 | 161 | 2.06 | 97 | 1.24 | 131 | 1.68 | | 5 | MatrixMul | 685 | 2499 | 3.65 | 852 | 1.24 | 3762 | 5.49 | | 6 | FIR | 7473 | 12436 | 1.66 | 6282 | 0.84 | 24349 | 3.26 | | | Mean | 1457 | 2767 | 2.39 | 1309 | 1.13 | 4898 | 2.66 | Table 7.7: Instruction Memory Traffic (Cycles) ARM has 16- or 32-bit instruction which implies a single instruction memory cycle to fetch an instruction. But, ARM requires more number of instructions on the average, so instruction memory cycles consumed by ARM are 13% more than MePoEfAr on the average as can be seen from Table 7.7. AVR also executes instructions in a single cycle but it requires higher number of instructions than MePoEfAr which directly translates to higher instruction memory traffic by a factor of 2.66. In case of TI, as the number of instructions fetched from the memory is large, and on top of it, most of the instructions require multiple cycles. This results in higher instruction memory traffic by a factor of 2.39 as compared to MePoEfAr . In order to compare data memory traffic, data memory cycles are also computed. In order to have a fair comparison, same width of MePoEfAr is assumed as used by the architecture in consideration. This mean 16-bit path from data memory is considered when comparing with TI and AVR, whereas, 32-bit bus width is assumed for the comparison with ARM. Table 7.8 summarizes the data memory traffic in cycles. Although, same input data is processed still there is a variation in data memory cycles for some benchmarks. | # | Benchmark | MePoEfAr | | TI | TI/MePoEfAr | ARM | ARM/MePoEfAr | AVR | AVR/MePoEfAr | |---|--------------|----------|--------|------|---------------|--------|------------------------|-------|----------------| | # | Dencimar k | 16-bit | 32-bit | ** | 11/ Wei obiAi | Aitivi | ritivity ivier oblirii | 2111 | AVIC/Mei obiAi | | 1 | FactRec | 10 | 10 | 10 | 1.00 | 10 | 1.00 | 20 | 2.00 | | 2 | StringCopy | 26 | 26 | 26 | 1.00 | 26 | 1.00 | 26 | 1.00 | | 3 | BubbleSort | 380 | 190 | 380 | 1.00 | 190 | 1.00 | 380 | 1.00 | | 4 | SensorStruct | 50 | 35 | 50 | 1.00 | 35 | 1.00 | 90 | 1.80 | | 5 | MatrixMul | 334 | 167 | 424 | 1.27 | 167 | 1.00 | 908 | 2.72 | | 6 | FIR | 1536 | 1106 | 1536 | 1.00 | 1106 | 1.00 | 10600 | 6.90 | | | Mean | 389 | 256 | 404 | 1.04 | 256 | 1.00 | 2004 | 2.02 | | | Mean | 388 | 255 | 403 | 1.04 | 256 | 1.03 | 2004 | 2.09 | Table 7.8: Data Memory Traffic (Cycles) An interesting point worth mentioning is that, though MePoEfAr and TI can access 16-bits data in single cycle still in case of *MatrixMul* TI requires 27% more data memory cycles as compared to MePoEfAr . This is because one element of matrix is needed twice as multiplier is 16-bit wide. Furthermore, inner loop needs register to calculate addresses of matrix elements as well as for the actual multiplication of elements. Programs are optimized considering instruction memory as first goal. So if we place this data in registers once, and for later operations, then data memory cycles will become same but program memory size and number of instructions executed will be adversely affected. But, in case of MePoEfAr , availability of large number of registers of different sizes facilitates storage of intermediate results in registers and operations are possible on these registers. This results in reduced data memory access even for complex applications. For AVR, in case of *StringCopy* benchmark, data memory cycles are same as required by other architectures, but for other benchmarks it requires far more data memory cycles. Furthermore, Registers are 8-bits wide so, multiple registers required for operations because of which limited data can be kept in registers. Especially in case of *FIR* application, data needs to be stored back to memory because of unavailability of registers, and fetched back later (spills) which caused considerable data memory traffic. #### 7.5 Summary In order to have the overall impression of the architectures under discussion, all the results discussed above are summarized in the Table 7.9. These numbers are ratios and 7.5. SUMMARY 67 mean of all the ratios is also given at the bottom of table to show the overall comparison. Table 7.9: Performance Comparison Summary | # | Benchmark | TI/MePoEfAr | ARM/MePoEfAr | AVR/MePoEfAr | |---|-------------------------------|-------------|--------------|--------------| | 1 | No of Instructions | 1.52 | 1.19 | 2.51 | | 2 | Program Size | 1.95 | 1.37 | 2.15 | | 3 | Instructions Executed | 1.83 | 1.29 | 2.33 | | 4 | Instructions Executed in Loop | 1.87 | 1.31 | 2.37 | | 5 | Execution Cycles | 2.16 | 0.95 | 2.18 | | 6 | Instruction Memory Traffic | 2.39 | 1.13 | 2.66 | | 7 | Data Memory Traffic | 1.04 | 1.03 | 2.09 | | | Mean | 1.77 | 1.17 | 2.32 | In summary it can be concluded from the above table that MePoEfAr architecture has better performance in all respects as compared to TI architecture. Overall MePoEfAr architecture performance is 77% better than TI microcontroller. MePoEfAr is better than ARM in most of the cases, while being same for data memory cycles. ARM has winning situation based on the execution cycles. This gain is because of the instructions to calculate array address in single cycle by a single instruction which utilizes the shifter. This can be seen from the bubble sort and matrix multiplication benchmark results for execution cycles. ARM is a 32-bit architecture and can represent these type of instructions. Overall MePoEfAr outperforms ARM by 17%. There is a considerable difference in performance results of AVR as compared to other architectures in all respects. On the average for the given benchmarks, MePoEfAr performance is better than AVR by a factor of 2.31. Conclusion and Future Work This chapter starts with a brief summary of the whole thesis in Section 8.1. We highlight the conclusions of this work in Section 8.2. Finally, Section 8.3 provides some recommendations for future work. #### 8.1 Summary This section gives a brief summary of the work presented in this thesis. We provide short description of each chapter as follows: Chapter 1 provided an introduction to the work presented in this thesis. It discussed the key motivation behind the thesis and enlisted the main contributions of this work. Chapter 2 presented an overview of microcontroller architectures and their classification, which are based on several criteria. Three well-known embedded microcontroller architectures were discussed in detail, which were used for the performance comparison. Chapter 3 discussed the static profiling. The statistics of high level language constructs obtained from the developed profiler were provided. These statistics show the frequency distributions of the C language constructs in four benchmark programs. Chapter 4 provided the details of MePoEfAr architecture. It started with overall architecture properties, type of architecture, bit and byte numbering, data types, instruction classification and register sets. Global architecture issues such as layout of the program status word and Memory Map were provided. Various instruction formats in MePoEfAr architecture with examples were detailed. Furthermore, operation sets supported by these instruction formats were also tabulated with a description on how these operations affect the condition codes. A brief description of exceptional conditions like traps and interrupt vectors were provided followed by a discussion of extension of program and data Memory. The summary of encoding cost and feasibility of MePoEfAr architecture were discussed, in order to show the availability of the encoding space in the architecture, for future extensions. Chapter 5 gave the implementation details of MePoEfAr assembler. It covered the details of the intermediate steps involved to translate the assembly program to machine code. Instruction bit assignments were provided which we used to represent assembly instructions as bit patterns. Chapter 6 discussed MePoEfAr interpreter which has been used for the simulation of the MePoEfAr microcontroller. It discussed the two main parts of MePoEfAr interpreter. First part loads the machine code to memory and performs some book keeping for debugging information. Second part is the microcontroller model which fetches the instructions from memory, decodes and executes them. Chapter 7 covered the assembler level benchmarking details, which we performed to evaluate the performance of MePoEfAr architecture. Furthermore, it provided the results of static and dynamic comparison of performance with three well known embedded microcontrollers. This chapter, that is Chapter 8, summarizes the thesis. Conclusions drawn from our work are provided followed by some recommendations for future work. #### 8.2 Conclusions Conclusions drawn based on the work presented in this thesis are enumerated below. For the sake of brevity, in rest of the chapter, TI, ARM and AVR refers to Texas Instruments MSP430G2231, ARM Cortex-M3 LPC1342 and Atmel AVR AT90S851 microcontrollers respectively. - Statistics presented in this thesis show that frequency distribution of C language constructs (statements, operations, operands etc.) do not have a uniform distribution over the complete range. Furthermore, a single architecture cannot satisfy the demands of all the applications, so intelligent trade-offs must be made in favor of the most frequent constructs. Conclusions drawn from the static analysis of the benchmarks programs are: - 1. Assignments are the most frequent statements. About 60% of the statements are assignments. - 2. 73% of assignments have a simple variable on the left hand side of assignments. - 3. Most of the assignments have a simple expression on the right hand side. About 55% of assignments have either a constant or a simple variable on right hand side. - 4. Arithmetic operations are the most frequent operations. Among arithmetic operations, addition and multiplication are the most frequent operations. - 5. After relational operations, type conversion operations are also frequent. Most of the conversions are between 16 and 32 bit integer data types. - 6. Based on data type, 32-bit operations are the most frequent operations. - 7. Small constants are the most frequent ones. 4-bit constants have an accumulative frequency of about 87%. 0, 1, 2, 4 and 8 are the most frequent constants. - 8. Among local variables, 32-bit integers, 16-bit integers, pointers and 8-bit integers have a frequency distribution of about 61%, 13%, 12%, and 5%, respectively. - The results of assembler level benchmarking show that MePoEfAr architecture is 77% and 17% better than TI and ARM, respectively. Furthermore, MePoEfAr outperforms AVR by a factor of 2.31. Following conclusions can be drawn from the detailed analysis of these results: - 1. Number of instructions required to implement some functionality by an architecture is a measure of capability of instruction set of that architecture. On average, for the given benchmarks, ARM and TI require 19% and 52% more instructions than MePoEfAr respectively. Furthermore, the instruction ratio of AVR to MePoEfAr is 2.51. The effeciency of MePoEfAr compared to other architectures is because of the following reasons: - (a) Operations normally involve 16 and 32-bit data types and AVR and MSP430 need multiple instructions for these operations whereas MePoE-fAr has 8, 16 and 32-bit data types. - (b) Availability of large number of operations in MePoEfAr architecture as compared to other architectures, requires no emulation of these operations by extra instructions. - (c) ARM is a load store architecture, which required instructions to load data in registers, perform operations and later instructions to store the results back to memory. - (d) AVR and TI have auto-increment and auto-decrement addressing modes, requiring less number of instructions for address computations. But, in MePoEfAr, these modes work on all the data types available in the architecture. - 2. Memory efficiency of an Instruction set architecture is compared by calculating the total number of bytes of program memory require for each benchmark application. MePoEfAr is 37%, 95% and 115% more memory effecient than ARM, TI and AVR, respectively. This memory effeciency is achieved as follows: - (a) Although ARM supports thumb-2 instruction set, which means the support for 16-bit instructions in addition to the 32-bit instructions. Despite of these 16-bit instructions, 32-bit instructions are also needed in these benchmarks increasing the program memory size. - (b) Variable length instructions in the MePoEfAr architecture has proven to be more memory efficient. Frequently occurring instructions are short 2-byte instructions. On the other hand 3 to 4 byte instructions are not very frequent. - (c) MePoEfAr provides efficient support for small immediate values and short displacements. In thumb mode, ARM supports 3-bit immediate values if two registers are specified and 8-bit immediate if a single register operand is specified. TI has reserved two registers namely R2 and R3 as constant generators to generate frequent constants (0, 1, 2, 4 and 8). In case of MePoEfAr, 4-bit immediate values and 8-bit offsets are accommodated directly inside the first instruction word, with both operands specified and without reserving any registers. - 3. Instructions executed for a given program give an idea about the total power requirements and the instruction- memory- CPU traffic. TI and ARM executes 87% and 31% more instructions than MePoEfAr. Furthermore, ratio of instructions executed on AVR to MePoEfAr is about 2.37. This efficiency of MePoEfAr is due to the following reasons: - (a) Larger instruction set of MePoEfAr resulted in fewer instructions for an application. For example, loop control instruction is a single instruction in MePoEfAr, but for other architectures two or three instructions are required. - (b) For operations higher than 16-bits, TI and AVR perform operations with multiple instructions which can be performed with a single instruction in MePoEfAr. - (c) TI, AVR and ARM require a large number of instructions, which result in large number of instructions executed by these architectures. - 4. Execution cycles required by TI and AVR are more than two times as compared to MePoEfAr. This is because of more number of instructions required for these benchmarks which resulted in more number of instructions executed. In other words, for MePoEfAr, fewer instructions have to be fetched and processed. Furthermore, TI has von Neumann architecture so it cannot perform instruction and memory accesses in parallel. - 5. Instruction memory accesses consume power. ARM required 13% more instruction cycles as compared to MePoEfAr. TI and AVR required higher instruction memory cycles by a factor of 2.39 and 2.66, respectively. More number of instructions required by these architecture result in higher instruction memory traffic. - 6. In case of data memory traffic, TI, ARM and MePoEfAr require almost same number of cycles. In case of AVR, registers are 8-bit wide. So multiple registers are required for operations which results in limited data to be kept in registers. Due to register spills, data must be stored back to memory because of unavailability of registers, and fetched back later, resulting in increased data memory traffic. #### 8.3 Future Work Some recommendations for the future work are enlisted as follows: - 1. In this work, we have performed static profiling analysis to obtain the frequency distributions of various C language constructs. Static results are important for the design of a memory efficient architecture. In contrast to static analysis, dynamic profiling is performed during the program execution. The results of dynamic profiling are also important, as they point out the most frequently executed constructs in the benchmarks. Hence, there is a need of dynamic profiling, which can be given the second priority in making design decisions and to fine tune the architecture. - 2. Cost of 8-bit instructions (in the units of 1024) is 216. This is 21% of the total encoding space available. From the results of static analysis, 8-bit data type is not so frequent. Hence, further analysis is required to probably remove the support of this data type and use this encoding space to make the architecture more efficient. - 3. The variable length instructions used in MePoEfAr architecture proved to be more memory efficient. This efficiency has its cost in terms of complex decoding logic required by instructions. Further work is required to synthesize the decoding logic to obtain some numbers for the area overhead introduced by this decoding logic. - 4. The interpretive simulator which we have developed, does not incorporate the information about the number of cycles consumed by individual instructions and - the overall execution cycles of the complete benchmark. Hence, there is a need to add the information about the execution cycles to make it a cycle accurate simulator, or to perform an RTL simulation (VHDL simulation). This will help in running larger benchmarks and will save the time consumed in performing the calculations for comparison manually. - 5. Another important property of an instruction set architecture is its support for compilers. Hence high level language compiler is required to further ease the benchmarking process. Furthermore, results from the compiler writing process can prove to be another important feedback for the architecture. ## Bibliography - [1] http://flex.sourceforge.net/. - [2] http://focus.ti.com/. - [3] http://focus.ti.com/docs/prod/folders/print/msp430g2231.html. - [4] $http://focus.\ ti.\ com/general/docs/lit/getliterature.\ tsp?$ literatureNumber=slau144h&fileType=pdf. - [5] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337i/index.html. - [6] http://www.ace.nl/compiler/cosy.html. - [7] http://www.coremark.org/home.php. - [8] http://www.design-reuse.com/articles/21745/interpretive-instruction-set-simulator.html. - [9] http://www.eembc.org/benchmark/automotive\_sl.php. - [10] http://www.eembc.org/home.php. - [11] http://www.gnu.org/software/bison/. - [12] http://www.nxp.com/. - [13] http://www.nxp.com/#/pip/pip=[pip=LPC1311\_13\_42\_43,pfp=71567] /pp=[t=pip,i=LPC1311\_13\_42\_43]. - [14] www.atmel.com. - [15] www.atmel.com/dyn/resources/prod\_documents/DOC0841.pdf. - [16] Dhrystone benchmark: Rationale for version 2 and measurement rules, SIGPLAN Notices 23 (1988), 49–62. - [17] Microchip industry research, 2011. - [18] John Backus, Can programming be liberated from the von neumann style?: a functional style and its algebra of programs, Commun. ACM 21 (1978), 613–641. - [19] R. Bannatyne and G. Viot, *Introduction to microcontrollers. i*, Wescon/98, sep 1998, pp. 350 –360. - [20] K.L.M. Bertels, S. A. Ostadzadeh, and R. J. Meeuws, Advanced profiling of applications for heterogeneous multi-core platforms, July 2011, p. 13. 76 BIBLIOGRAPHY [21] Koen Bertels, Stamatis Vassiliadis, Elena Moscu Panainte, Yana Yankova, Carlo Galuzzi, Ricardo Chaves, and Georgi Kuzmanov, Developing applications for polymorphic processors: The delft workbench, 2006. - [22] Robert F. Cmelik and David Keppel, Shade: A fast instruction set simulator for execution profiling, Tech. report, Mountain View, CA, USA, 1993. - [23] B A Wichmann H. J. Curnow, A synthetic benchmark, Computer Journal 19 (1976), 1. - [24] John L. Hennessy and David A. Patterson, Computer architecture: A quantitative approach, 3rd edition, Computer Architecture: A Quantitative Approach, 3rd Edition, Morgan Kaufmann, 3rd Edition, May 2002. - [25] K.-D. Kramer, T. Stolze, and T. Banse, Benchmarks to find the optimal microcontroller-architecture, Computer Science and Information Engineering, 2009 WRI World Congress on, vol. 2, 31 2009-april 2 2009, pp. 102 –105. - [26] R. Leupers, J. Elste, and B. Landwehr, Generation of interpretive and compiled instruction set simulators, Design Automation Conference, 1999. Proceedings of the ASP-DAC '99. Asia and South Pacific, jan 1999, pp. 339 –342 vol.1. - [27] Mingsong Lv, Qingxu Deng, Nan Guan, Yaming Xie, and Ge Yu, Armiss: An instruction set simulator for the arm architecture, Embedded Software and Systems, 2008. ICESS '08. International Conference on, july 2008, pp. 548 –555. - [28] Christopher Mills, Stanley C. Ahalt, and Jim Fowler, Compiled instruction set simulation, 1991. - [29] Achim Nohl, Gunnar Braun, Oliver Schliebusch, Rainer Leupers, Heinrich Meyr, and Andreas Hoffmann, A universal technique for fast and flexible instruction-set architecture simulation, Proceedings of the 39th annual Design Automation Conference (New York, NY, USA), DAC '02, ACM, 2002, pp. 22–27. - [30] David A. Patterson and David R. Ditzel, The case for the reduced instruction set computer, SIGARCH Comput. Archit. News 8 (1980), 25–33. - [31] David A. Patterson and Carlo H. Sequin, *Risc i: A reduced instruction set vlsi computer*, 25 years of the international symposia on Computer architecture (selected papers) (New York, NY, USA), ISCA '98, ACM, 1998, pp. 216–230. - [32] H. Meyr V. Zivojnovic, S. Tjiang, Compiled simulation of programmable dsp architectures, IEEE Workshop on VLSI Signal Processing (1995). - [33] John von Neumann, First draft of a report on the evdac, charles babbage institute reprint series for the history of computing, mit press, vol. 12, 1987. - [34] Reinhold P. Weicker, An overview of common benchmarks, Computer 23 (1990), 65–75. # Lexical Analyzer Generator Code ``` 1 %option nounput 3 %{ 4 #include "globals.h" 5 #include "grammar.tab.h" /* import token definitions from Yacc */ 6 #include <stdio.h> 7 #include <string.h> 9 void cstyle_comment(); /* Multiline C-Style comment */ 10 \text{ int } 1ineno = 1; 11 %} 12 digit 13 hdigit [a-fA-F0-9] 14 odigit [0-7] 15 letter [a-zA-Z] 16 newline 17 whitespace [\t]+ 18 hash "@" 19 atrate \begin{array}{lll} 20 & {\tt valid\_char} & \{{\tt letter}\}|"\_"|"."|\{{\tt digit}\} \\ 21 & {\tt symbol} & (\{{\tt letter}\}|"\_"|".")\{{\tt valid\_char}\}* \\ \end{array} 22 label {symbol}":" 23 comment ;".*\n 0[xX]{hdigit}+ 24 hnumber 25 bnumber 0[bB][01] + 0[o0]{odigit}+ 26 onumber 27 dnumber {digit}+ 28 \ \mathtt{fnumber} \{digit\}+\.\{digit\}+ 29 breg B\{digit\}+ 30 wreg W\{digit\}+ {\tt D}\{{\tt digit}\} + 31 dreg 32 freg F\{digit\}+ 33 xreg X\{digit\}+ '(\\.|\\\n|[^\\])[']? 34 norm_char 35 character {norm_char} 36 string \"(\\.|\\\n|[^\n\\"])*\" 37 %% 38 "/*" {cstyle_comment();} {lineno++; return NEWLINE;} 39 {comment} 40 {breg} {return BREGISTER;} {return WREGISTER;} 41 {wreg} {return DREGISTER;} 42 {dreg} 43 { freg } {return FREGISTER;} {return XREGISTER;} 44 {xreg} ``` ``` 45 {symbol} {return SYMBOL;} 46 {label} {return LABEL;} 47 {bnumber} {return BNUMBER;} {return ONUMBER;} 48 {onumber} 49 {dnumber} {return DNUMBER;} 50 {hnumber} {return HNUMBER;} 51 {fnumber} {return FNUMBER;} 52 {character} {return CHARACTER;} 53 { string } {return STRING;} 54 {hash} {return HASH;} 55 {atrate} {return ATRATE;} 56 "," 57 ";" {return COMMA;} {return SEMI_COLON;} 58 ":" {return COLON;} 59 "+" {return PLUS;} 60 "-" {return MINUS;} 61 "*" {return MULTIPLY;} 62 "/" {return DIVIDE;} 63 "(" {return LBRACK;} 64 ")" {return RBRACK;} 65 {newline} {lineno++; return NEWLINE;} 66 {whitespace} {/* ignore whitespaces */} 68 void cstyle_comment() 69 { 70 char c; 71 int done = FALSE; 72 char * text = yytext + 2; 73 int i = 0; 74 75 while (!done) 76 77 while ((c = input()) != '*') 78 79 if (c == EOF) return; 80 81 82 83 lineno++; 84 text[i++] = c; 85 while ((c = input()) == '*') 86 87 if (c == EOF) 88 89 return; 90 text[i++] = c; 91 } 92 text[i++] = c; if (c = | n' ) 93 94 lineno++; 95 if (c = '/') 96 { 97 done = TRUE; text[i] = \langle 0 \rangle; 98 99 ``` Listing A.1: Flex Code for the Lexical Analyzer Generator for MePoEfAr Assembler B ### Parser Generator Code ``` 1 %{ 2 #include <stdio.h> 3 #include <ctype.h> 4 #include <stdlib.h> 5 #include <string.h> 6 #include "ast.h" 7 #include "globals.h" 9 #define YYSTYPE TreeNode * 11 TreeNode * AST; 12 extern int lineno; 14 TreeNode *tmpNodes[NUM_OF_CHILDREN] = {NULL}; // Temperory Nodes 15 int tni = 0; // Temperory Nodes index 16 char tempStr[20] = "-"; 17 // Token string from Scanner 18 extern char * yytext; 19 extern int yylex(); 20 void yyerror( const char * msg ); 22 #define ADD_TO_LIST(ss, s1, s2) 23 { YYSTYPE t = s1; 24 25 if (t = NULL) 26 ss = s2; else 27 28 while ( t->next != NULL )\ 29 30 t = t->next; 31 t->next = s2; 32 ss = s1; 33 34 } 35 %} 37~\%token BREGISTER WREGISTER DREGISTER FREGISTER XREGISTER 38~\%token HASH ATRATE SYMBOL LABEL BNUMBER ONUMBER DNUMBER HNUMBER FNUMBER CHARACTER STRING 39~\%token COMMA SEMI_COLON PLUS MINUS MULTIPLY DIVIDE 40~\%token LBRACK RBRACK NEWLINE COLON 41 42~\%left PLUS MINUS 43~\%left MULTIPLY DIVIDE ``` ``` 44 45 %% 46~{\tt program} : stmt_seq { AST = $1;} 47 49 \text{ stmt\_seq} : stmt_seq stmt { ADD_TO_LIST( $$, $1, $2 ) } 50 | stmt { $$ = $1; } 51 52~{\tt stmt} : statement NEWLINE \{ $$ = $1; \} 53 | statement SEMI_COLON { $$ = $1; } NEWLINE 54 SEMI_COLON 55 56 57 statement : labels operation 58 { 59 YYSTYPE t = $1; 60 while ( t\rightarrow child[0] != NULL ) 61 62 t = t-> child[0]; 1->next = 2; 63 $$ = $1; 64 65 } 66 operation \{ $$ = $1; \} 67 labels \{ \$\$ = \$1; \} 68 69 labels : labels label 70 { YYSTYPE t = $1; 71 72 if (t == NULL) $$ = $2; 73 74 else 75 { 76 while (t\rightarrow child[0] != NULL) 77 \mathtt{t} \, = \, \mathtt{t} \!\! - \!\! > \!\! \mathtt{child} \, \big[ \, 0 \, \big] \, ; 78 t->child[0] = $2; 79 $$ = $1; 80 } 81 | label { $\$ = \$1; } 82 83 : instruction \{ $$ = $1; \} 84 operation label num_const \{ (\$\$ = \$1) \rightarrow \texttt{nodeType} = \texttt{NT_DIRECTIVE}; (\$\$ \rightarrow \texttt{NT_DIRECTIVE}) \} 85 child[0]) = (\$2); 86 87 instruction : symbol operands 88 { 89 int i; 90 91 (\$\$ = \$1)->nodeType = NT_INSTRUCTION; 92 ($$)->SG =SG_NA; //SG not applicable or not yet specified 93 ($$)->op = OP_INVALID; //not valid or not yet specified 94 ($)->instrSize = -1; //not yet specified so represented by -1 ``` ``` 95 for ( i = 0; i < tni && i < NUM_OF_CHILDREN; <math>i++) 96 $->child[i] = tmpNodes[i]; 97 tni = 0: } 98 99 | symbol { 100 ($$ = $1)->nodeType = NT_INSTRUCTION; 101 ($$)->SG =SG_NA; //SG not applicable or not yet specified 102 (\$\$)->op = OP_INVALID; 103 (\$\$)->instrSize = -1; 104 105 106 operands : operands COMMA operand 107 if ( tni < NUM_OF_CHILDREN )</pre> 108 109 tmpNodes[tni++] = $3; 110 operand { tmpNodes[tni++] = $1; } 111 112 reg { $$ = $1; } 113 operand imm { $$ $$ = $1;} 114 mem_ref { $$ = $1; } 115 116 symbol { $$ = $1; } 117 DX_addr { $$ = $1; } 118 \text{ mem\_ref} Ptr_addr { $$ = $1; } 119 120 SMptr_addr { $$ = $1; } 121 Abs_addr \{ \$\$ = \$1; \} 122 123~{\tt DX\_addr} /* D(X) Addressing : num_const LBRACK reg RBRACK */ 124 125 $ = newTreeNode( NT_DX_ADDR, NULL); 126 $$-> child[0] = $1; 127 $$-> child[1] = $3; 128 } 129 symbol LBRACK reg RBRACK /* D(X) Addressing */ 130 $$ = newTreeNode( NT_DX_ADDR, NULL ); 131 $$-> child[0] = $1; 132 $-> child[1] = $3; 133 134 135 : ATRATE reg /* Pointer Addressing 136 Ptr_addr */ 137 { 138 (\$\$ = \$2)->nodeType = NT_PTR_ADDR; 139 140 141 SMptr_addr : LBRACK reg RBRACK PLUS /* Self Modifying ptr Addressing */ /*Post increment (X)+ 142 143 (\$\$ = \$2)->nodeType = NT_SMPTR_POST_INC_ADDR; 144 ``` ``` 145 MINUS LBRACK reg RBRACK /*Pre decrement -(X 146 { 147 (\$\$ = \$3)->nodeType = NT_SMPTR_PRE_DEC_ADDR; 148 149 150 Abs_addr : ATRATE num_const { $$ = $2; ($$->nodeType) = NT_ABSADDRIMM;} 151 | ATRATE symbol { $$ = $2; ($$->nodeType) = NT_ABSADDRID;} 152 153 \ {\tt imm} : HASH num_const { \$\$ = \$2; (\$\$->nodeType) = NT_IMM;} 154 HASH symbol { $$ = $2; ($$->nodeType) = NT_ID;} 155 156 symbol : SYMBOL { $$ = newTreeNode( NT_ID, yytext ); } 157 : LABEL { $$ = newTreeNode( NT_LABEL, yytext); } 158 label 159 160 \text{ reg} : BREGISTER { $$ = newTreeNode( NT_BREGISTER, yytext ); /* Changed yytext+1 to yytext */ } | WREGISTER { $$ = newTreeNode( NT_WREGISTER, yytext ); /* 161 Changed yytext+1 to yytext */ } 162 DREGISTER { $$ = newTreeNode( NT_DREGISTER, yytext ); /* Changed yytext+1 to yytext */ } FREGISTER { $$ = newTreeNode( NT_FREGISTER, yytext ); /* 163 Changed yytext+1 to yytext */ } XREGISTER { $$ = newTreeNode( NT_XREGISTER, yytext ); /* 164 Changed yytext+1 to yytext */ } 165 166 num_const : BNUMBER { $$ = newTreeNode( NT_CONSTANT, yytext ); ($$->value ) = other2dec(\$\$->name,2); | MINUS BNUMBER 167 168 { 169 $$ = newTreeNode( NT_CONSTANT, yytext ); 170 (\$->value) = -1*other2dec(\$\$->name,2); 171 \verb|strcat|(tempStr|, \$\$-\!\!>\! \texttt{name}|); 172 tempStr [strlen(tempStr)]= ' \setminus 0'; 173 strcpy($$->name,tempStr); 174 strcpy(tempStr,"-"); 175 ONUMBER { $$ = newTreeNode( NT_CONSTANT, yytext ); ($$->value 176 ) = other2dec(\$\$->name,8); MINUS ONUMBER 177 178 { 179 $ = newTreeNode( NT_CONSTANT, yytext); 180 (\$->value) = -1*other2dec(\$->name, 8); 181 strcat(tempStr,$$->name); 182 tempStr[strlen(tempStr)]= '\0'; 183 strcpy($$->name, tempStr); strcpy(tempStr,"-"); 184 185 186 DNUMBER { $$ = newTreeNode( NT_CONSTANT, yytext ); ($$->value ) = atoi(\$\$->name); | MINUS DNUMBER 187 188 { 189 $ = newTreeNode( NT_CONSTANT, yytext); 190 (\$-value) = -1*atoi(\$-value); ``` ``` 191 strcat(tempStr,$$->name); 192 tempStr[strlen(tempStr)] = ' \setminus 0'; 193 \verb|strcpy|($\$-\!\!> \verb|name|, \verb|tempStr|); strcpy(tempStr,"-"); 194 195 196 HNUMBER { $$ = newTreeNode( NT_CONSTANT, yytext ); ($$->value ) = other2dec(\$->name, 16); MINUS HNUMBER 197 198 { 199 $ = newTreeNode( NT_CONSTANT, yytext); 200 (\$->value) = -1*other2dec(\$->name, 16); \verb|strcat|(tempStr|,\$\$-\!\!>\!\!name|); 201 \texttt{tempStr} \, [\, \texttt{strlen} \, (\, \texttt{tempStr} \, ) \, ] \! = {}^{\scriptscriptstyle |} \, \backslash 0 \, {}^{\scriptscriptstyle |} \, ; 202 203 \verb|strcpy|(\$\$-\!\!>\!\!\texttt{name}\;, \verb|tempStr|); 204 strcpy(tempStr,"-"); 205 | FNUMBER { $$ = newTreeNode( NT_CONSTANT, yytext ); ($$->value 206 ) = strtofp($\$->name); | MINUS FNUMBER 207 208 { 209 $$ = newTreeNode( NT_CONSTANT, yytext ); 210 (\$->value) = -1*strtofp(\$->name); 211 strcat(tempStr, \$\$->name); 212 tempStr[strlen(tempStr)]=' \setminus 0'; strcpy($$->name,tempStr); 213 214 strcpy(tempStr,"-"); 215 } 216 217 ; 218 %% 219 220 void yyerror( const char * msg ) 221 { 222 printf("%s on line %d\n", msg, lineno ); 223 exit(1); 224 // | CHARACTER { $$ = newTreeNode( NT.CONSTANT, yytext ); } 225 / * : num_const { $$ = newTreeNode( NT_DIRECTIVE, yytext); } 226 directive 227 228 229 */ 230 231 } ``` Listing B.1: Bison Code for the Parser Generator for MePoEfAr Assembler # Assembly Codes for the Selected Benchmarks #### C.1 MePoEfAr Assembly Codes Listing C.1: MePoEfAr Assembly Code for Benchmark 1: Recursive Factorial Listing C.2: MePoEfAr Assembly Code for Benchmark 2: String Copy ``` ; MePoEfAr Bubble Sort Benchmark Program METODIAT BUDDIE SORT BENCHMARK Program In this program, an array of 10 elements is initialized in the main subroutine. Base address of this array is passed to BSort subroutine to sort the numbers in descending order. \frac{3}{4} 10 -> Base Address of Array -> Index for number Array -> i.e. loop counter -> Data with which array will be initialized -> is used to index Array START X4 13 15 D2 17 ; j = # elements ; X4 = base address of array ; data with wihich array will be 19 Main: MOVd #10, D1 #START, X4 20 21 MOVd #0,D2 23 24 25 L1: MOVd D2 , (X4)+ ; Array[i] = D2 ; D2+1 26 27 A D D d #1,D2 ; decrement element counter and ; branch to next element if not done DECBRn D1 , L1 28 29 BRS BSort ; call the BSort routine 31 32 End: HALT ; Halt at the end 33 Sort Subroutine Actual subroutine used to implement sorting Algorithm 36 37 START -> Base Address of Array X4 -> used to index the Array W0 -> i i.e. loop counter W1 -> j i.e. loop counter 38 39 \frac{40}{41} 42 43 44 46 \frac{48}{49} L2: (X4)+,D2 ; D2 = arr[j] MOVd \frac{50}{51} MOVd @X4 , D3 ; D3 = arr[j+1] 52 53 54 55 56 D2 , D3 ; compare D2 with D3 ; if (D3 < D2) then no swaping required CPAd BRlt NoSwap : otherw ise swap here MOVd 58 MOVd 60 NoSwap: S1BR W1, L2 ;loop if j>0 62 SIBR WO.L1 :loop till i>0 ; return to caller (sorting done) 64 RTS ``` Listing C.3: MePoEfAr Assembly Code for Benchmark 3: Bubble Sort ``` MePoEfAr Assembly Program implementing a structure for sensor values. Structure contains 3 elements: 1 char byte Flag indicating if sensor has been calibrated or not. 1 short int containing the offset to be adjusted 1 long int containing the actual sensor value 3 5 An array of 5 sensors is declared. InitSensors() will initialize these values to some numbers. CalibrateSensors() will subtract the offset from the value of the sensors and set the Flag. main() will call these two functions to initialize and calibrate 10 12 13 ain: BRS Init ; call to Init subroutine BRS Calib ; call to Calib subroutine Main: 20 21 22 End: RTS 24 26 base address of first struct member index struct array loop counter % \left( 1\right) =\left( 1\right) \left( X4 B0 28 30 W0 -> Data with which Value will be initialized 32 33 34 Init: 36 MOVx \#START, X4 ; X4 = Starting address of struct 38 MOVw : i = 0 MOVb ;loop counter 40 \begin{array}{l} \#0\,,\ (\,\text{X4}\,)+\\ \text{WO}\,,\ (\,\text{X4}\,)+\\ \text{WO}\,,\ \text{DO}\\ \text{DO}\,,(\,\text{X4}\,)+ \end{array} ; sensors[i].Flag = 0; sensors[i].Offset = i \frac{41}{42} L0: MOVb MOVw ADDdws 43 ; sensors [i]. Value = i+3 44 MOVd 45 46 ADDb #1, WO ; i++ 47 48 SIBR BO , LO ;loop back 5 times 49 50 RTS return to caller; 53 54 55 START Starting address of struct array -> -> X4 D0 index struct array sensors[i].Value sensors[i].Offset 56 57 59 B0 -> loop counter #START , X4 ; X4 = Starting address of struct #5, B0 ; loop counter \frac{61}{62} Calib: MOVx ; sensors[i].Flag = 1 ;W0 = sensors[i].Offset ;D0 = sensors[i].Offset ; sensors[i].Value = sensors[i].Value - sensors[i].Offset 63 \begin{array}{c} \#1,\ (\,\mathrm{X4}\,)+\\ (\,\mathrm{X4}\,)+,\ \mathrm{W0}\\ \mathrm{WO}\,,\ \mathrm{DO}\\ \mathrm{DO}\,,(\,\mathrm{X4}\,)+ \end{array} L1: 65 MOVw 66 67 MOVwds SUBd ;loop back 5 times ;return to caller 69 SIBR BO , LO ``` Listing C.4: MePoEfAr Assembly Code for Benchmark 4: Sensor Structure ``` Base Address of matrix m1 -> M1 Base Address of matrix m2 -> M2 12 13 14 Base Address of matrix m3 \rightarrow M3 X4 is used to index the elements of matrix m1 X1 is used to index the elements of matrix m2 X5 is used to index the elements of matrix m3 16 17 B6 -> no of rows B7 -> no of columns B1,B4,B5 -> loop 18 B1\,,B4\,,B5 -> loop counters Note: Arrays are stored in memory in Row Major Order 20 Main: MOVd #nRows1, B6; B6 = no of rows of ml MOVd #nCols1, B7; B7 = no of cols of ml MOVZ #M1, X4; base address of ml BRS INIT; call initialize subroutine 22 24 26 27 28 ; for m1 #nRows2, B6 #nCols2, B7 #M2, X4 INIT 29 MOVd ; B6 = no of rows of m2 ; B7 = no of cols of m2 ; base address of m2 ; call initialize subroutine 30 MOVd MOVx 32 BRS 33 ; for m2 34 35 36 ; now perform multiplication ; Initialize base address of m1, m2, m3 {\tt INx} {\tt X4} , {\tt \#3} , {\tt \#M1} , {\tt \#M2} , {\tt \#M3} 37 38 INx 39 MOVd #5, B5 ; nCols2 40 ; nRows1 ; nCols1 (or nRows2 is same) ; D3 = 0 (accumulator for one element) ; D2 = m1[m][n] ; D2 = m1[m][n] * m2[n][p] ; D3 = D3 + m1[m][n] * m2[n][p] ; X1 += nCols * size ; it will point to first element of next row ; repeat 4 times 41 42 L3: L2: MOVd MOVd #3, B4 #4, B1 43 MOVd #0, D3 L1: MOVd (X4)+, D2 @X1, D2 D2, D3 45 46 MULd ADDd 47 48 ADDx #20, X1 49 SIBB B1 , L1 50 \begin{array}{c} {\tt D3}\;,\;\;(\,{\tt X5}\,)+\\ \#56\;,\;\;{\tt X1} \end{array} \frac{51}{52} MOVd ; m3[m][p] = D3 XI now points to first element of next column ; repeat this 5 times SUBx 53 54 55 56 S1BR B4 , L2 #M2 , X1 B5 , L3 ; X1 now points to base address of m2 ; repeat this 3 times MOVx SIBR 57 58 59 ; multiplication done, stop 61 D2 -> row number D3 -> value to be assigned X4 -> array index 63 65 #0, D3 D3, (X4)+ #1, D3 67 L1: MOVd Matrix[r][c] = D3 69 ADDd :D3++ 70 71 ; repeat this for nCols 72 ADDd #1, D2 73 74 75 ; D3 = D2 = row number D2 , D3 MOVd SIBR B6 , L1 ; repeat this for nRows ; return to caller ``` Listing C.5: MePoEfAr Assembly Code for Benchmark 5: Matrix Multiplication ``` 15 16 COEFF -> base address of COEFF array INPUT -> base address of INPUT array OUTPUT -> base address of OUTPUT array X4 -> index of COEFF array X2 -> index of INPUT array X6 -> index of OUTPUT array 17 18 19 20 21 F1 -> sum F0, F2, F3, F5 -> floating point temporary results D0, D1, D2, D3 are used for integer temporary calculations B1, B2 -> loop counters 23 25 29 ; ini MOVx Main: 31 MOVb 33 MOVf L1: ; F2 = 1/(i+5) 35 DIVf F0, F2 36 37 MOVf \texttt{F2} \;,\;\; (\; \texttt{X4}\;) + \qquad ; \texttt{COEFF} [\; \mathrm{i} \;] \; = \; \mathrm{F2} 38 39 ADDf #1. F0 :update value of i+5 in F0 40 41 SIBR BO , L1 ; loop back for all coeffecients 42 ; initialize INPUT array MOVx #INPUT, X4; base address of INPUT MOVb #68, B0; no of INPUT samples MOVd #2, D2; value to be stored 43 \begin{array}{c} 44 \\ 45 \end{array} MOVx MOVb 46 48 L2: MOVw W2, (X4)+; INPUT[i] = 2 50 51 SIBR BO , L2 ; loop back for all INPUT samples 52 53 54 ; Perform FIR Calculations #COEFF, X4 #INPUT, X2 #OUTPUT, X6 MOVx MOVx MOVx 56 57 #36, B1 MOVb ; y = 36 58 59 L4: MOVb #8, B2 ; sum = 0 60 MOVf #0, F1 ; X3 = 16 62 L3: MOVx #16, X3 \begin{array}{l} ; \text{X3} = \text{16} \\ ; \text{X3} = \text{16} - \text{i} \\ ; \text{X3} = \text{y+16} - \text{i} \\ ; \text{X3} = (\text{y+16} - \text{i}) * 2 \\ ; \text{X3} = \text{start} + (\text{y+16} - \text{i}) * 2 \\ ; \text{W3} = \text{INPUT}[\text{y+16} - \text{i}] \end{array} B2 , X3 B1 , X3 #2, X3 SUBbxs 64 ADDbxs MULx X2, X3 @X3, W3 66 ADDx MOVw 68 69 70 MOVbxs B1 , X3 ; X3 = y ;X3 = y+i ;X3 = (y+i) * 2 ;X3 = start + (y+i) * 2 ;W3 = INPUT[y+16-i]+ INPUT[y+i] B2, X3 #2, X3 X2, X3 QX3, W3 ADDbxs MULx ADDx 73 74 ADDd 75 76 77 78 MOVwfs W3, F3 ; convert to float \texttt{MULdfs} \quad \texttt{(X4)+, F3} \qquad ; \texttt{F3} = \texttt{COEFF[i]} \ * \ (\texttt{INPUT[y+16-i]+ INPUT[y+i])} 79 ADDf F3 , F1 ; F1 = sum + F3 80 81 82 SIBR B2 , L3 ; loop back 8 times 83 84 MOVf 32(X0),F5 ; F5 = COEFF[8] #8, X3 85 MOVx : X3 = 8 ; X3 = 0; ; X3 = y+8; ; X3 = (y+8) * size; ; X3 = start + (y+8) * size; ; W3 = INPUT[y+8] 86 ADDbxs B1, X3 #4, X3 X2, X3 @X3, W3 87 MULx ADDx 89 90 MOVw ; F5 = INPUT[y+8] * COEFF[8] ; F5 = sum + INPUT[y+8] * COEFF[8] \frac{91}{92} MULwfs 93 MOVf F5 , ( X6 )+ ;OUTPUT[y] = F5 95 S1BR B1 , L4 ; loop back 36 times 97 ; otherwise we are done ; Halt the program 99 End: HALT ``` Listing C.6: MePoEfAr Assembly Code for Benchmark 6: FIR #### C.2 Atmel AVR AT90S851 Assembly Codes ``` ; Atmel AVR Recursive Factorial Benchmark Program This program recursively calculates the factorial of a number (n). A number is passed to this subroutine by main for factorial calculation. 3 Total Number of Instruction: 53 11 13 R18, R19 contain n LDI LDI R19,0x00 RCALL ; call factorial 19 RET ; end of main R22-R25 will hold the calculated factorial Fact: ;Push register on stack ;Push register on stack PUSH ; if (n<=1) //i.e. if the number is 0 or 1 CPI R18,0 x02 ; Compare with 2 CPC R19,R1 ; Compare with carry BRGE L0 ; Branch if (n>=2) 32 33 CPI 36 37 38 ; return 1; //then return 1 LDI R22,0x01 ; Lo ; Load immediate ; Load immediate LDI 39 40 LDI R23 ,0 x00 R24 ,0 x00 LDI ; Load immediate ; Load immediate 42 RJMP L1 to return 1; go down to L1 retore registers and return 43 44 : otherwise ; otherwise; return n * factorial(n-1); MOV R20,R18 ; Co //\,\mathrm{this} n times factorial of \mathrm{n}{-}1 ;Copy register ;Copy register 46 LO: 47 48 ;R20-R21 contain n 49 50 SBIW R18,0x01 ; Subtract immediate from word 51 52 53 54 55 56 ;R18 now contain n-1 ; call factorial(n-1) RCALL Fact; result is in R22-R25 ; Relative call subroutine ; copy n back to R18-R21 for multiplication \texttt{MOV} $\tt R18 , \tt R20 ; Copy register 57 58 59 60 61 62 8-R21 for man, ; ;Copy register ;Copy register ;Clear Register MOV R19, R21 R20 MOV ; Clear Register CLR R 2 1 \frac{63}{64} RCALL Mult32 ; Mult32 ; result of multiplication is R22-R25 65 66 ; restore registers L1: ;Pop register from stack ;Pop register from stack POP R19 69 RET ; Subroutine return R18-R21 first operand ``` ``` R22-R25 second operand Mult32: R30 R27 ; Clear Register ; Clear Register 79 CLR Clear Register Clear Register Clear Register Clear Register Skip if bit in register set Relative jump Add without carry Add with carry Add with carry Logical Shift Left Rotate Left Through Carry Rotate Left Through Carry Rotate Left Through Carry Rotate Left Through Carry Rotate Left Through Carry Rotate right Branch if not equal Compare with carry Branch if not equal Copy register Copy register CLR CI.B R26 82 R22,0 MO: SBRS 83 R.JMP M 1 R26 , R18 ADD ADC ADC R27 , R19 R30 , R20 85 87 ADC R31, R21 M1: 89 90 ROI. R.19 R20 ROL 91 ROL R21 93 ROR R24 94 95 ROR R23 R.O.R. R22 R24,0x00 97 SBIW 98 99 CPC R23 , R22 BRNE МО MOV MOV 100 R25 , R31 101 R24 . R30 102 MOV MOV R23 , R27 Copy register 103 R22, R26 Copy register 104 RET Subroutine return ``` Listing C.7: Atmel AVR AT90S851 Assembly Code for Benchmark 1: Recursive Factorial ``` Atmel AVR String Copy Assembly Program; In this program, main subroutine passes the addresses of source and destination strings to the StrCpy subroutine to copy the chracters; from source to the destination string. ; Total Number of Instructions: 11 Total Number of Instructions: 11 10 12 R30,R31 contain address of strSrc 14 ; R28,R29 contain address of strDest 16 Main: LDI R31 , 0x00 R28 , 0x70 18 LDI ; address of strDest 20 LDI R29, 0x00 RCALL strCopv ; call string copy subroutine 23 24 RET ; end of main 29 R24 is used as temp to hold current character 31 32 33 34 R24 character is \frac{35}{36} BRNE strCopy ; if not then loop back for next 37 RET ; done copying, return to caller ``` Listing C.8: Atmel AVR AT90S851 Assembly Code for Benchmark 2: String Copy ``` 4 ; 5 ; 6 ; 7 in the main subroutine. Base address of this array is passed to BSort subroutine to sort the numbers in descending order. Total Number of Instruction: 42 10 11 12 13 R30\,,R31 contain address of array R24\,,R25 for i 14 15 16 18 Main: R30.0x00 ; base address of array ; base address of array I.DT R31 ,0 x00 20 21 22 R24 ,0 x00 R25 ,0 x00 LDI 23 24 25 ; Array [ i ] = i ; ST Z+ ;Store indirect and postincrement;Store indirect and postincrement L0: Z+,R24 ST 26 Z+,R25 27 28 ADIW R24,0x01 29 30 R24 , 0 x 0 A R25 , R1 ; Compare with 10 ; Compare with carry CPI CPC 31 32 BRNE LO ; loop back if (i < 10) 33 RCALL ; call BSort subroutine 34 BSort 35 36 RET ; Subroutine return 37 38 39 40 R30\,,R31 contain address of array R18\,,R19 for i R20\,,R21 for j 41 43 rt: MOV R30,R24 ; base address of a[] MOV R31,R25 ; base address of a[] 44 45 BSort: 47 48 R18,0x08 49 T. D.T. R19.0x00 : i = 0 1.2: R20.0x00 \frac{51}{52} I.DT ; j = 0 ; j = 0 R21,0x00 LDI 53 54 55 ; a [ j ] ;Load indirect with displacement ;Load indirect with displacement L1: \mathtt{R22}\ , \mathtt{Z+}0 57 58 59 I.DD R23,Z+1 ; a [j+1] 60 61 I.DD {\tt R26}\;,{\tt Z+2} ;Load indirect with displacement ;Load indirect with displacement LDD R27, Z+3 62 63 ; if a number is greater than its next number 64 65 CP CPC R26 , R22 R27 , R23 ; if (a[j]>a[j+1]) ; if (a[j]>a[j+1]) ; then no swap required 66 67 BRGE NoSwap ; otherwise we need to swap ; a[j]=a[j+1] STD z+1,R27 ; S 68 69 70 71 ; Store indirect with displacement; Store indirect with displacement Z+0,R26 STD 72 73 ; a[j+1]=a[j] 74 75 Z+3,R23 Z+2,R22 ;Store indirect with displacement;Store indirect with displacement STD STD 76 77 78 79 ;loop condition for j ;Subtract immediate;Subtract immediate with carry NoSwap: SUBT R20,0xFF R21 ,0 xFF R30 ,0 x02 R18 , R20 SBCI Add immediate to word 80 ADIW 81 CP ; Compare 82 83 ;Compare with carry;Branch if greater or equal, signed CPC R19 , R21 BRGE \frac{84}{85} ; loop condition for i ; Subtract immediate 86 SUBI R18,0x01 R19,0x00 ; Subtract immediate with carry ; Set Register SBCI 88 R20 SER R18 , 0 x F F ; Compare with immediate ; Compare with carry 90 CPC R19, R20 ``` ``` 91 BRNE L2 ; loop back if (i>=0) 92 93 ; done with sorting 94 RET ; Subroutine return ``` Listing C.9: Atmel AVR AT90S851 Assembly Code for Benchmark 3: Bubble Sort ``` Structure contains 3 elements: 1 char byte Flag indicating if sensor has been calibrated or not. 1 short int containing the offset to be adjusted 1 long int containing the actual sensor value 3 5 An array of 5 sensors is declared. InitSensors() will initialize these values to some numbers. CalibrateSensors() will subtract the offset from the value of the sensors and set the Flag. main() will call these two functions to initialize and calibrate sensor data. 8 11 13 Total Number of Instruction: 39 15 16 17 \begin{array}{lll} & \text{ Main subroutine} \\ & \text{ Main subroutine} \end{array} 18 19 ain: RCALL Init ; call to Init subroutine RCALL Calib ; call to Calib subroutine 20 21 Main: 22 23 End: RET ; end \frac{24}{25} \frac{26}{27} -> starting address of struct array -> pointer to current element -> loop counter, i -> data with which Values will be initialized STHi, STLo R30, R32 R15, R16 28 30 R20-R23 R30, #STLo ;R30,R31 contain starting address of struct 32 , , , , , , Init: \frac{34}{35} MOV \frac{36}{37} MOV R20 , #3 ; R20\,=\,3 ; sensor value will be initialized with D0 ; 3 is added to every value of i, so ; initialized D0 with 3 38 40 R21 CLR \frac{41}{42} CLR R23 43 44 MOV ; i = 0 45 MOV 46 47 LO: STD Z+, #0 ; Flag = 0 48 49 50 Z+, R15 \\ Z+, R16 ; Offset = i STD 51 52 53 54 55 56 57 58 INC R20 ; R20 = i + 3 R21 , R16 R22 , #0 R23 , #0 ADC ADC ADC z+,\ R20 STD ; Value = i+3 59 60 STD z+, R21 \\ z+, R22 STD \frac{61}{62} STD z+, R23 63 64 65 66 67 68 INC R15 ; i++ CPI {\tt R15}\ ,\ \#5 ; compare with 5; loop back 5 times BRNE RET ; return to caller 69 70 71 72 73 74 75 ; Calib subroutine starting address of struct array pointer to current element R30, R32 -> loop counter, i ``` ``` 81 82 MOV R15 , #5 83 L1: STD Z+0, #1 ; Flag = 1 85 86 MOV R18 , z+ ; R18, R19 = offset 87 88 MOV R19 , z+ ; value = value - offset 89 SHR \begin{array}{ll}z+,&\text{R18}\\z+,&\text{R19}\end{array} 91 92 93 z+, #0 z+, #0 SBCI SBCI 94 95 DEC ; decrement loop counter ; loop back for 5 sensors BRNE 97 RET ; return to caller ``` Listing C.10: Atmel AVR AT90S851 Assembly Code for Benchmark 4: Sensor Structure ``` Atmel AVR Matrix Multiplication Benchmark Program This program multiplies two matrices of order 3X4 and 4X5 to give a product matrix of order 3X5. Both the matrices are initialized with some numbers and then multiplication is performed to get product. {\tt Total\ Number\ of\ Instruction:\ 105} 10 12 Main Subroutine Base Address of matrix m1 -> M1Lo, M1Hi Base Address of matrix m2 -> M2Lo, M2Hi Base Address of matrix m3 -> M3Lo, M3Hi R26,R27 -> pointer to m1 R28,R29 -> pointer to m2 R20,R21 -> pointer to m2 14 16 R30, R31 R18-R21 R22-R25 -> pointer to m3 -> hold current element of m1 -> hold current element of m2 18 20 R15, R16, R17 -> temporaries for passing values and loop couters ; initialize ml Main: MOV R15, #nRows1 ; rows of ml 25 26 ; rows of m1 ; cols of m1 MOV R16, #nCols1 27 28 ; base address ml low MOV R30 , \#M1Lo 29 30 R31, #M1Hi ; base address ml high RCALL INIT ; call initialization subroutine 33 34 35 36 ; initialize m2 ; rows of m2; cols of m2 MOV R15, \#nRows2 MOV R16, #nCols2 37 38 R30 , #M2Lo R31 , #M2Hi MOV ; base address m2 low \frac{39}{40} MOV ; base address m2 high \frac{41}{42} RCALL INIT ; call initialization subroutine ; perform multiplication MOV R26, #M1Lo MOV R27, #M1Hi 43 44 45 46 ; base address m1 low MOV ; base address m1 high 47 48 49 50 R28 , #M2Lo R29 , #M2Hi ; base address m2 low ; base address m2 high MOV MOV R30 , #M3Lo ; base address m3 low MOV R31 , #M3Hi ; base address m3 high MOV R15 , #5 : nCols2 53 54 55 R16 , #3 R17 , #4 L2: MOV ; nCols1 ``` ``` 57 58 LDI LDI z+0,#0 z+1,#0 ; m3[m][p] = 0 59 60 Z+2,#0 Z+3,#0 LDI LDI 61 62 L1: MOV R18 , X+ R19 , X+ R20 , X+ R21 , X+ ; R18-R21 = m1[m][n] 63 MOV 64 MOV 65 MOV 66 67 68 MOV ; R22-R25 = m2[m][n] MOV \begin{array}{c} 69 \\ 70 \\ 71 \\ 72 \\ 73 \\ 74 \\ 75 \\ 76 \\ 77 \\ 80 \\ 81 \\ 82 \\ 83 \end{array} MOV ; perform m1[m][n]*m2[n][p] RCALL Mult32 ; Relative call subroutine ADD Z+0,R22 ; m3[m][p] += m1 * m2 Z+1,R23 Z+2,R24 ADD ADD ADD Z+3,R25 ADIW DEC : decrement R17 84 85 BRNE ; loop back 4 times 86 ADIW 87 88 89 90 SBIW \texttt{R29:R28,\#56} \hspace{0.2cm} ; \hspace{0.1cm} \texttt{pointer} \hspace{0.2cm} \texttt{for} \hspace{0.2cm} m2 ; now points to first element of next column 91 92 93 DEC R16 ; decrement ; loop back 3 times BRNE L2 94 95 R28 , #M2Lo MOV ; base address m2 low ; base address m2 high ; now points to base address of m2 96 MOV R29 , #M2Hi 97 98 DEC ; decrement R15 100 BRNE L3 ; loop back 5 times 101 102 RET :Subroutine return 104 105 106 Data to be assigned has the nRows has the nCols R23-R26 -> R15 -> R16 -> R30,R31 -> 108 109 110 element pointer Row number 112 113 Instructions = 44 114 Bytes By.;;;;;, CLR 116 INIT: 117 118 CLR CLR R24 R25 119 CLR R26 120 121 CLR R17 ;row counter 122 123 ; \, mat \, [\, m\,] \, \left[\, p\,\right] \,\, = \,\, d\, a\, t\, a 124 L1: ; Store indirect with displacement; Store indirect with displacement; Store indirect with displacement; Store indirect with displacement STD STD Z+0,R23 \\ Z+1,R24 125 126 {\tt Z+2,R25} \\ {\tt Z+3,R26} 127 STD 128 STD 129 130 ; increment data = data + 1 | 1 | ;Add without carry |;Add with carry |;Add with carry |;Add with carry 131 ADD R23,#1 R24,#0 R25,#0 R26,#0 ADC 132 133 ADC 134 ADC ;R30,R31 = address of m1[m][p]; ;now point to next element in the matrix ADI R30,#4 ;Copy register R31,#0 ;Copy register 135 136 137 138 139 DEC R16 141 ; repeat it for all columns 143 ``` ``` \frac{144}{145} ; row ++ INC R17 ; Add immediate to word \frac{146}{147} MOV R17 , R23 ;R23 = row number (data for new row) 148 149 R15 150 BRGT T. 1 ; repeat it for all rows 151 152 RET ; 32-bit Multiplication Subroutine; R18-R21 -> First Number; R22-R25 -> Second Number; R22-R25 -> Product 153 154 156 163 164 PUSH R31 165 R31 , R31 R30 , R30 R27 , R27 R26 , R26 ; clear registers ; for results EOR 166 167 EOR 168 EOR 169 EOR 170 171 M1: SBRS R22, 0 172 RJMP M2 173 ADD R26 , R18 R27 , R19 R30 , R20 R31 , R21 174 ADC 175 ADC ADC 176 178 M2: ADD R18, R18 179 180 ADC ADC R19 , R19 R20 , R20 181 ADC R21 , R21 182 LSR R25 183 ROR R24 184 R23 ROR 185 ROR R22 186 BRNE M 1 R24 , 0 X 0 0 R23 , R22 187 SBIW CPC 188 189 BRNE M 1 R25 , R31 MOV R24 , R30 R23 , R27 R22 , R26 191 MOV MOV 193 MOV POP 195 R31 ; restore registers 196 POP R30 197 POP R27 198 POP 199 200 RET ``` Listing C.11: Atmel AVR AT90S851 Assembly Code for Benchmark 5: Matrix Multiplication ``` Atmel AVR FIR Filter Benchmark Program This program is an implmentation of a 17 order FIR filter. COEFF and INPUT arrays are initialized with some data and then FIR caculations are performed to get the OUTPUT array. These calculations are basically integer and floating point calculations performed on these arrays to get floating result samples in OUTPUT array. \frac{1}{2} 3 \frac{4}{5} 6 7 result samples in OUTPUT array 10 11 12 14 16 18 20 ``` ``` ; ;COEFF initialization Main: STD Y+8,R1 ; i = 0 24 26 Y+7,R1 ^{ m R16} , Y+7 ^{ m R17} , Y+8 T.DD 28 LO: : R16.R17 = i 29 LDD 30 31 LDD R22, Y+7 ; R22, R25 = i 32 33 \substack{\texttt{R23}\,,\,\texttt{Y}+8\\\texttt{R24}} LDD CLR 34 35 CLR R 2 5 36 37 38 R.C.A.I.I. INT2FLOAT ; convert i to floating point ; so now R22-R25 = float(i) R18,0x00 R19,0x00 R20,0xA0 39 40 ; R18-R21 = 5.0 LDI LDI 41 42 43 LDI LDI R21,0x40 ; floating point add ; so now R22-R25 = (i+5.0) \frac{44}{45} \frac{46}{46} RCALL FADD R18,0x00 R19,0x00 R20,0x80 47 48 LDI LDI ; R18-R22 = 1.0 49 50 LDI R22,0x3F 51 52 RCALL FDIV ; floating point divide ; so now R22-R25 = 1.0 / (i+5.0) 53 54 55 56 ; now compute the address of COEFF[i] \texttt{MOV} \texttt{R30}\,,\texttt{R16} ; \texttt{R30}\,,\texttt{R19}=\texttt{i} 57 58 MOV LSL R31 , R17 R30 ;R30,R31 = i * 4 59 60 R31 R30 ROI. LSL \frac{61}{62} R31 R20,0x00 ROL ; R20, R21 = base address of COEFF LDI \frac{63}{64} I.D T R21,0x00 R30,R20 0 \times 00 ADD ; R30, R31 = base + i * 4 65 66 ADC STD \substack{\text{R31 , R21} \\ \text{Z}+0 \,,\, \text{R24}} ; COEFF[i] = 1 / (i+5.0) 67 68 {\begin{smallmatrix} Z+1 & , R25 \\ Z+2 & , R26 \end{smallmatrix}} STD 69 70 71 72 73 74 75 76 77 78 79 STD Z + 3, R27 R24, Y+7 R25, Y+8 R24,0x01 LDD ; R24, R25 = i LDD ADIW ; i++ ; Store back i Y+8,R25 Y+7,R24 R24,0x11 STD STD ; Compare with 17 CPC R25, R1 BRLT LO ; loop back if (i < 17) 80 81 ;INPUT array initialization STD Y+8,R1 ; i STD 82 83 STD Y+7, R1 ^{\rm R30}_{\rm R31}, ^{\rm Y+7}_{\rm Y+8} 84 L1: LDD ; R30, R31 = i 85 LDD 86 87 R18,0x00 R19,0x01 SUBI ; base address of INPUT i.e 0x0100 88 89 SBCI LSL ;R30,R31 = i * size R30 90 91 R31 R30 , R18 ROL ADD ; Add base address ; i.e. R30, R31 = base + i * size ; R18, R19 = 2 92 93 ADC LDI R31 , R19 R18 , 0 x02 R19,0x00 Z+1,R19 Z+0,R18 \frac{94}{95} LDI ;INPUT[i] = 2 STD 96 STD LDD R24.Y+7 ; R24, R31 = i 98 99 LDD R25 , Y+8 R24,0x01 Y+8,R25 ; i++ ; Store i back 100 ADTW 102 STD Y + 7, R24 103 CPI R24,0x43 ; Compare i with 67 104 CPC R25 , R1 BRLT ; loop back if (i < 67) 106 ; perform filtering 108 STD Y+6,R1 ; v = 0 ``` ``` 109 STD Y + 5, R1 110 111 L2: 112 R24 ,0 x00 R25 ,0 x00 T. D.T ; R24-R27 = 0 LDI R26 , 0 x00 R27 , 0 x00 113 LDI LDI Y+1,R24 Y+2,R25 ; sum = 0 115 STD 116 STD 117 STD Y+3.R26 Y + 4, R27 STD 119 120 STD Y + 8, R1 ; i = 0 121 STD Y + 7, R1 ; inner loop which will iteratively compute ; sum = sum + COEFF[i] * ( INPUT[y + 16 - i] + INPUT[y + i] ) LDD R30,Y+7 ; R30,R31 = i 123 R30 , Y+7 R31 , Y+8 125 L3: L\,D\,D ;R30,R31 = i * size 127 LSL R30 128 ROL R31 129 LSL R.30 130 R31 R30 ,0 x00 :R30,R31 = base + i * size 131 ADDI 132 ADIC R31,0x00 R14, Z+0 R15, Z+1 R16, Z+2 ; R14 , R17 = COEFF [ i ] 133 LDD LDD 135 LDD 136 LDD R17, Z+3 137 138 LDD \mathtt{R24}\ , \mathtt{Y+5} ; R24, R25 = y R25, Y+6 R24, Y+7 R25, Y+8 139 LDD LDD 140 ; R24, R25 = i 141 R30 , R24 R31 , R25 142 SHR ; y-i 143 SBC 144 ADDI R30 ,0 x00 R31 ,0 x10 ; R30, R31 = y - 1 + 16 145 \frac{146}{147} R30 R31 LSL ; R30, R31 = (y - 1 + 16) * size ROL ; add base address of INPUT R30 ,0 x00 R31 ,0 x01 148 ADDI ; and base address of INFO1; i.e. R30, R31 = base + (y - 1 + 16)*size; R18, R19 = INPUT[y+16-i] ADCI 149 150 I.DD R18, Z+0 R19, Z+1 151 LDD 152 153 ; R20, R21 = y L\,D\,D {\tt R20~,Y+5} \begin{array}{c} {\tt R21} \; , {\tt Y+6} \\ {\tt R30} \; , {\tt Y+7} \end{array} 154 T. D.D. ; R30, R31 = i R31, Y+8 R30, R20 156 I.DD 157 ADD ; y+i 158 ADC R31, R21 R30 ; R24, R25 = (y+i) * size 160 ROL R31 ; add base address of INPUT ; i.e. R30,R31 = base + (y+i) * size; R24,R25 = INPUT[y+i] 161 ADDI R30 ,0 x00 R31,0x01 R24,Z+0 162 ADCT 163 LDD \begin{array}{c} {\tt R25} \; , {\tt Z+1} \\ {\tt R18} \; , {\tt R24} \end{array} 164 LDD 165 ADD ; R18, R19 = INPUT[y+16-i] + INPUT[y+i] 166 ADC R19, R25 167 168 MOV ; R22-R25 = INPUT[y+16-i] + INPUT[y+i] R22.R18 169 170 MOV R23 , R19 CLR R24 R23,7 171 SBRC ; Skip if bit in register cleared ; Load and Toggle 172 LAT R24 173 MOV R25 , R24 174 ; call to int2float subroutine ; so R22-R25 will be converted to float ; i.e. R22-R25 = float(INPUT[y+16-i] + INPUT[y+i]) 175 RCALL INT2FLOAT 176 177 178 \begin{smallmatrix} M & O & V \\ M & O & V \end{smallmatrix} R18 , R14 R19 , R15 179 ; R18{-}R21 \ = \ COEFF\,[\ i\ ] 180 181 MOV R20 , R16 R21 , R17 182 MOV 183 ; call to floatin point multiplication routine ; R22-R25 = COEFF * ( INPUT[y+16-i\ ] + INPUT[y+i\ ]) 184 RCALL FMUL 185 186 \substack{\texttt{R18}\,,\,\texttt{Y}+1\\\texttt{R19}\,,\,\texttt{Y}+2} 187 I.DD :R18-R21 = sum 189 I.DD R20, Y+3 R21, Y+4 190 LDD 191 RCALL FADD ; call to floating point addition routine 193 STD Y+1,R22 ; store back the value of sum 195 STD Y + 2, R23 ``` ``` \frac{196}{197} Y+3,R24 \\ Y+4,R25 STD 198 199 ; R24, R25 = i LDD R25, Y+8 R24, 0 x01 Y+8, R25 Y+7, R24 200 LDD ADIW 201 ; store i back 202 STD STD 203 ; compare i with 8 ; Compare with carry ; loop back if (i < 8) 204 CPI R24,0x08 R25,R1 205 CPC 206 BRGE L3 207 ; outer loop which will make output samples ; OUTPUT[y] = sum + INPUT[y + 8] * COEFF[8]; LDD R30, Y+5 ; R30, R31 = y LDD R31, Y+6 ADIW R30,0x08 ; R30,R31 = y+8 208 210 ; R30, R31 = y+8 ; R30, R31 = (y+8) * size 212 213 214 ROL R31 215 216 ; add base address ; R30, R31 = base + (y+8) * size ADDI R30 ,0 x00 ADCI {\tt R31~,0~x01} \substack{\text{R22 },\, \text{Z}+0\\\text{R23 },\, \text{Z}+1} ; R22-R25 = INPUT[y+8] LDD 218 219 220 LDD ; Clear Register ; Skip if bit in register cleared ; Load and Toggle CLR R24 221 222 R23,7 SBRC LAT \frac{223}{224} MOV R25 , R24 ; call int2float subroutine ; i.e. R22-R25 = float(INPUT[y+8]) 225 RCALL INT2FLOAT 226 227 228 ;R18-R21 = COEFF[8] LDD R18, Y+41 R19, Y+42 R20, Y+43 ; substrict is constant ; so calculated at assemble time 229 LDD 230 LDD 231 LDD R21, Y+44 232 ; call floating point multiplication ; so R22-R25 = INPUT[y + 8] * COEFF[8] 233 RCALL FMUI. 234 235 \begin{array}{c} {\tt R18} \; , {\tt Y+1} \\ {\tt R19} \; , {\tt Y+2} \\ {\tt R20} \; , {\tt Y+3} \end{array} 236 LDD ; R18-R21 = sum 237 I.DD 238 LDD 239 LDD \mathtt{R21} , \mathtt{Y} + 4 240 ; call floating point addition routine ; so R22-R25 = sum + INPUT[y + 8] * COEFF[8] 241 RCALL FADD 243 \mathtt{R30} , \mathtt{Y} + \mathtt{5} ; R30, R31 = y 245 LDD R31,Y+6 ; R30, R31 = y * size 247 ROL R31 248 249 LSL R30 ROI. R.31 ; add base address ; base address of OUTPUT is 0 \times 0200 ; as R22-R25 = sum + INPUT[y + 8] * COEFF[8] ; assign to OUTPUT[y] 250 R30 ,0 x00 \begin{array}{c} {\tt R31~,0~x02} \\ {\tt Z+0\,,R22} \\ {\tt Z+1\,,R23} \end{array} 251 ADCI 252 253 STD STD \frac{254}{255} Z+2,R24 Z+3,R25 [OUTPUT[y] = sum + INPUT[y + 8] * COEFF[8] STD \frac{256}{257} LDD ; R24, R25 = y R24,Y+5 LDD ADIW R25, Y+6 R24,0x01 258 259 ; y++; Store back y \frac{260}{261} STD STD Y+6,R25 Y+5,R24 262 CPI \mathtt{R24}\ ,0\ \mathtt{x24} ;y Compare with 36 263 CPC R25, R1 \frac{264}{265} BRLT ; loop back if (y < 36) ; done with filtering {\tt RET} 266 267 ; Subroutine return ``` Listing C.12: Atmel AVR AT90S851 Assembly Code for Benchmark 6: FIR ### C.3 TI MSP430 Assembly Codes ``` This program recursively calculates the factorial of a number (n). A number is passed to this subroutine by main for factorial calculation. 5 6 Total No of Instruction = 26 10 11 main suproutine r12 -> n for which factorial is to be calculated r14,r15 -> calculated factorial r4,r5 -> temporaries 13 15 r_1, \dots, r_{12} = 5 MOV.W #5,r12 Main: #Fact ; call Fact subroutine CALL 19 \frac{20}{21} ; end of main End: RET ; Fact subroutine which will calculate factorial of 5; r10 assigned to n; r12 and r13 will hold resulting factorial; r14 and r15 are temporaries for multiplication; r15 and r17 r12; r12 will be modified so save it MOV.W r12, r10; r10 = n SUB.W #1,r12; r12 = n-1 IGE L1 : iump to L1 if (n>=2) 23 26 27 28 29 31 32 JGE Ľ1 ; jump to L1 if (n>=2) 33 34 35 ; base case is the case for 0 and 1 ; factorial of which is \mathbf{1} ; r14 = 1 \frac{36}{37} #1,r14 #0,r15 MOV.W MOV.W ; r15 = 0 38 39 PNP r12 ; restore r12 ; return to caller 40 ; if not base case then we need to find factorial(n-1); and multiply with n to get result CALL #Fact ; call factorial for n-1 MOV.W r14,r4; put lower 16 bits of result in r14 MOV.W r15,r5; put upper 16 bits of result in r15 42 43 L1: \frac{44}{45} \frac{46}{47} ; As r10 = n ; Now to calculate n * factorial(n-1); perform multiplication 48 r14 r15 ; prod hi = 0 50 CI.R. 51 52 53 54 ; prod low = 0 CLR LSBs * LSBs r4,&0130h r10,&0138h MOV ; copy to multiplier registers (OP1LO) ; OP2LO MOV 56 57 &SumLo, r14 &SumHi, r15 ; Add product to result (Sum0) ; Sum1 MOV 58 59 60 ; copy to multiplier registers (OP2LO) ; OP1HI MOV r10,&0130h MOV r5,&0138h 61 62 ; Add product to result (Sum0) ; Sum1 \, ADD &SumLo.r14 63 64 &SumHi, r15 ADDC 65 so now r14 and r15 contain the result P r12; restore r10 return to caller 66 PNP 68 RET ``` Listing C.13: TI MSP430 Assembly Code for Benchmark 1: Recursive Factorial ``` ;address of source string ;address of destination string ;call the copy subroutine ;end of main MOV.W #strDest,r15 #strCopy 16 CALL #1,r15 ADD.B 27 29 JNE strCopy return to caller 31 RET ``` Listing C.14: TI MSP430 Assembly Code for Benchmark 2: String Copy ``` ; TI MSP430 Bubble Sort Assembly Program; In this program, an array of 10 elements is initialized; in the main subroutine. Base address of this array is passed to BSort subroutine to sort the numbers in descending order. Total No of Instruction = 33 ; Total No of Instruction = 33 10 11 12 START -> starting address of array r15 -> pointer to current element in the array r13 -> loop counter i 13 15 17 Main: #0,r14 #START, r15 MOV.W 19 ; address of array element \frac{21}{22} MOV.W r13 ,0(r15) r14 ,2(r15) L1: ;\, {\tt arr} \; [\; i\; ] \;\; = \;\; i 23 24 25 ADD.W \#4,r15 ; point to next element of array ; size of each element is 4 ADD.W #1,r13 : i++ 28 29 #10,r13 CMP.W #10,r13 ; compare with 10 JL L1 ; loop back if(i <10) ; otherwise we are done with all the elements in the array CMP.W ; call the BSort routine CALL #BSort 33 35 End: RET end of main 36 37 40 41 42 44 45 \frac{46}{47} 48 49 L4: 50 51 52 53 ; compare higher 16 bits ; if (arr [j] < arr [j+1] ) ; then swaping is required ; otherwise no swaping required \frac{54}{55} 6(r15),r11 L5 CMP.W JL \frac{56}{57} 58 ; if higher 16 bits are same then; we need to compare lower 16 bits CMP.W 4(r15),r10; 60 ; now compare lower 16 bits ; if they are same then 62 JHS ``` ``` ; swap arr [j] and arr [j+1]; r10, r11 contain arr [j]; arr [j] = arr [j+1] MOV.W 4(r15), 0(r15) 66 L5: ; move low 16 bits 68 69 MOV.W 6(r15),2(r15) ; move high 16 bits \begin{array}{ll} ; \, \text{arr} \, [ \, \, \text{j} \, + 1 ] \, = \, \text{arr} \, [ \, \, \text{j} \, ] \\ \text{MOV.W} & \text{r10} \, , 4 \, ( \, \text{r15} \, ) \\ \text{MOV.W} & \text{r11} \, , 6 \, ( \, \text{r15} \, ) \end{array} 70 71 72 73 74 75 76 77 78 79 80 ; copy high 16 bits ; copy low 16 bits ; point to next element of array ; size of each element is 4 L6: ADD.W \#4,r15 ADD.W #1,r9 r9, r13 L4 ; compare with i ; loop back if (i>= j) CMP.W SUB.W TST.W L7: \#1,r13 ; compare with 0 ; loop back if(i>=0) 82 r13 84 L8: RET ; return to caller ``` Listing C.15: TI MSP430 Assembly Code for Benchmark 3: Bubble Sort ``` ; TI MSP430 Assembly Program implementing a structure for sensor values. Structure contains 3 elements: 2 3 1 char byte Flag indicating if sensor has been calibrated or not. 1 short int containing the offset to be adjusted 1 long int containing the actual sensor value An array of 5 sensors is declared. InitSensors() will initialize these values to some numbers. CalibrateSensors() will subtract the offset from the value of the sensors and set the Flag. main() will call these two functions to initialize and calibrate sensor data. 10 11 12 13 Total Number of Instruction: 29 16 17 Main subroutine 18 Main: CALL Init ; call to Init subroutine CALL Calib ; call to Calib subroutine End: RET ; end 20 22 24 26 27 START -> base address of first struct member index struct array 28 -> -> 29 30 r4 35 \mbox{\#START} , \mbox{\bf r15} ;r15 = Starting address of struct \mbox{\#0}, \mbox{\bf r4} ; i = 0 MOV.W 36 37 MOV.W 38 39 L0: MOV.B #0, 0(r15) ; sensors [i]. Flag = 0 r15 ; increment the index 40 INC.B r15 \frac{41}{42} \texttt{MOV.W} r4, 0(r15) ; sensors [i]. Offset = i 43 44 45 r4, r8 #3, r9 r8,2(r15) r8,4(r15) #6,r15 MOV.W ; r8, r9 = i+3 ADC. W 46 ; sensors [i]. Value = i+3 \texttt{MOV.W} \frac{47}{48} MOV.W ADD.W ;;increment the index to point to next struct element 49 50 ADD.W \frac{51}{52} #5, r4 L0 CMP.W ; loop back 5 times 53 ; return to caller 55 57 ``` ``` START -> Sta r15 -> ind r4 -> loop Starting address of struct array index struct array loop counter, i #START, r15; r15 = Starting address of struct #5, r4; loop counter MOV.W MOV.W 63 Calib: 65 66 67 68 #0, 0(r15) ; sensors [i]. Flag = 1; increment the index L1: MOV.B TNC.B 0(\texttt{r15})\,,2(\texttt{r15})\,;\,\texttt{sensors}\,[\,i\,]\,.\,Value\,-\!=\,\texttt{sensors}\,[\,i\,]\,.\,O\,ffset\,\#0\,,4(\texttt{r15}) 69 70 71 72 73 74 75 SUB.W SUBC.W ADD.W #6,r15 ; increment the index to point to next struct element DEC.W ; loop back 5 times JG LO RTS ; return to caller ``` Listing C.16: TI MSP430 Assembly Code for Benchmark 4: Sensor Structure ``` ; TI MSP430 Matrix Multiplication Assembly Program This program multiplies two matrices of order 3X4 and 4X5 to give a product matrix of order 3X5. Both the matrices are initialized with some numbers and then multiplication 3 4 is performed to get product. Total No of Instructions = 56 10 11 r4 assigned to n r5 assigned to m 13 14 15 r6 assigned to p 16 17 M1 -> Base Address of matrix m2 M2 -> Base Address of matrix m3 m3 -> Base Address of matrix m3 r13 -> index for m1 r14 -> index for m2 r15 -> index for m3 r9,r10,r11,r12 -> temporaries 18 19 21 23 26 27 ; initialize m1 MOV.W #nRow ; r6 = no of rows ; r6 = no of cols ; base address of m1 ; call init routine Main: #nRows1, r6 #nCols1, r7 #M1, r12 #INIT MOV.W 29 MOV.W 30 31 32 33 ; initialize m2 ; r6 = no of rows MOV.W #nRows2, r6 #nCols2, r7 #M2, r12 #INIT ; r6 = no of cols ; base address of m2 ; call init routine 34 35 MOV.W 36 37 CALL 38 39 ; perform multiplication ; base address of m1; base address of m2; base address of m3 MOV.W MOV.W #M1, r13 #M2, r14 #M3, r15 40 MOV.W 42 43 MOV.W #5, r4 ; nCols2 44 L3: 45 L2: \begin{smallmatrix} M & O & V & . & W \\ M & O & V & . & W \end{smallmatrix} #3, r5 #4, r6 ; nCols1 \frac{46}{47} #0, r9 #0, r10 ; accumulator 48 MOV.W ; multiplication of m1[m][n] * m2[n][p] CLR r11 ; tempo CLR r12 ; hold r 50 L1: temporary to hold product to 52 53 ; hold m1[m][n] * m2[n][p] :LSBs * LSBs 54 ;LOBS * LOBS MOV O(R13),&0130h MOV O(R14),&0138h ADD &SumLo,R11 ADDC &SumHi,R12 ; copy to multiplier registers 56 ; Add product to result 58 ; LSBs * MSBs 60 MOV 0(R13), \&0130h ; copy to multiplier registers ``` ``` \frac{62}{63} {\tt MOV}\ 2({\tt R14}), \&0138{\tt h} 64 65 \begin{array}{ll} \texttt{MOV} & \texttt{O(R14),\&0134h} \\ \texttt{MOV} & \texttt{2(R13),\&0138h} \end{array} ; multiplication with accumulation ; copy to multiplier registers 66 67 ADD &SumLo , R12 ; Add accumulated products ;R11 and R12 contain product i.e. m1 * m2 68 69 70 71 72 73 74 75 76 77 78 79 80 ADD. W r11, r9 r12, r10 ; accumulate products ADDC.W #20, r14 ADD.W ; for the next element it should point ;to first element of next row ;nCols * size ; done with 1 row DEC.W JG r9,0(r15) r10,2(r15) MOV.W ; m3[m][p] = r9, r10 81 82 83 84 85 86 MOV.W SUB.W #56, r14 ; decrement pointer for m1 to point : to first element of next row DEC.W ; repeat this for all the columns r5 87 88 89 90 \texttt{MOV.W} #M2, r14 ; base address of m2 91 92 DEC.W r4 ; repeat for all rows JG L3 93 94 RET ` ; return to caller 95 96 97 98 nRows nCols 99 100 r4, r5 r12 m+p value, the data to be assigned current element address pointer 101 102 103 r 9 row counter MOV.W #0, r4 ;r4,r5 represent m+p #0, r5 105 INIT: 107 MOV.W #0, r9 ; row counter required for m+p 109 L9: MOV.W r4, 0(r12) r5, 2(r12) MOV.W ; mat [m] [p] = m+p 111 ADD.W #4, r12 ; point to next array element 113 ADD.W #1, r4 ;increment m+p ADDC.W 115 #0, r5 117 DEC.W r6 : decrement row 118 L9 ; loop if > 0 119 \frac{120}{121} ADD.W #1, r9 ; increment row counter; assign it to r4 (m+p) MOV.W r9, r4 122 DEC.W ; decrement column ; loop back if > 0 123 124 JG L9 125 126 ``` Listing C.17: TI MSP430 Assembly Code for Benchmark 5: Matrix Multiplication ``` \frac{16}{17} 18 19 r7 assigned to i r8,r9 assigned to sum r10 assigned to y 20 22 23 24 26 28 30 ; initialize COEFF array 32 MOV.W #C0EFF , r15 33 main: ; i = 0 34 #0,r7 35 L1: r7, r12 ; r12= i MOV.W ; convert i to float 36 CALL. \#\_\_fs\_itof #0,r14 ; r14 and r15 will store 5.0 in float MOV.W 38 39 40 MOV.W #16544,r15 ; i + 5.0 CALL #__fs_add \frac{41}{42} MOV.W r12, r14 43 44 MOV.W MOV.W r13, r15 #0, r12 ; r14, r15 contain (i+5.0) ; r12 and r13 get 1.0; perform 1/(i+5.0); r12 and r13 contain result 45 MOV.W #16256,r13 46 CALL #__fs_div 47 48 ; COEFF[i] = 1/(i+5.0) MOV.W r12,0(r15) 49 ; lower 16 bits ; upper 16 bits 50 51 52 MOV.W r13,2(r15) 53 54 ADD.W \#4,r15 ; r15 now points to next element #1,r7 #17,r7 ; i++ ADD W 55 56 57 ; compare with 17 ; loop back if (i <17) CMP.W JL 59 ; initialize INPUT array #0,r7 ; i = 0 ; r8 = 2 ; r15 = i #2,r8 r7,r15 r15 \begin{smallmatrix} M & O & V & . & W \\ M & O & V & . & W \end{smallmatrix} 61 62 L3: ; r15 = i*size 63 RLA.W 65 MOV.W r8,68(r15) ; INPUT[i] = 2 66 67 #1,r7 ADD.W ; i++ ; compare with 67 ; loop back if (i < 67) CMP.W #67,r7 69 70 71 72 73 L5: JL L3 ; perform filtering MOV.W #0,r10 y = 0 i = 0 MOV.W #0,r7 MOV.W #0,r8 : sum = 0 76 77 MOV.W #0,r9 78 L6: 80 81 82 83 84 85 MOV.W r10 , r15 86 87 \begin{smallmatrix} M&O&V&.&W\\A&D&D&.&W\end{smallmatrix} r7, r14 r15, r14 ; r14 = i ; r14 = y + i ; r14 = (y + i) * size 88 RLA.W r14 89 90 ; r14 now ocntains address of INPUT[y+i] ; r12 = INPUT[y+i] ; r12 = INPUT[y+16-i] + INPUT[y+i] ; r12, r13 now contain float representation ; of INPUT[y+16-i] + INPUT[y+i] 68(r14),r12 68(r13),r12 #__fs_itof MOV.W 92 ADD.W 93 CALL 94 ; r15 = i ; r15 = i*2 ; r15 = i*size 96 MOV.W r7 , r15 r15 98 RLA.W r15 MOV.W 0(r15),r14 2(r15),r15 ; r14 = lower 16 bits of COEFF[i]; r15 = upper 16 bits of COEFF[i] 100 102 ; r14, r15 now contain float representation ``` ``` \frac{103}{104} ; of COEFF[i] ; COEFF[i] * (INPUT[y+16-i] + INPUT[y+i]); r12 and r13 contain result 105 CALL #__fs_mpy 107 ; r14 = lower 16 bits of sum 108 ; r15 = upper 16 bits of sum 109 MOV.W r9, r15 110 111 113 ; sum is being accumulated so store back the sum for next calculation [10V.W r12,r8 ; r8 = lower 16 bits of sum 10V.W r13,r9 ; r9 = upper 16 bits of sum 115 117 MOV.W ADD.W #1.r7 ; i++ 119 ; compare with 8 ; loop back if (i < 8) #8,r7 L6 121 JL ; r15 = y ; r15 = y+8 ; r15 = (y+8) * size \begin{smallmatrix} M & O & V & . & W \\ A & D & D & . & W \end{smallmatrix} r10 , r15 #8, r15 r15 123 125 RLA.W 126 ; r12 = INPUT[y+8] ; convert r12 to float MOV.W 68(r15),r12 127 128 129 #__fs_itof 130 ;r3 is 0 ;COEFF is ;r3 is 0; ;COEFF is at address 0; ;so COEFF[8] will be at address 32 MOV.W 32(R3),r14 ; r14 = lower 16 bits of COEFF[8] MOV.W 34(R3),r15 ; r14 = upper 16 bits of COEFF[8] 131 132 133 134 135 136 137 CALL 138 139 r8, r14 ; r14 = lower 16 bits of sum 140 MOV.W r9, r15 ; r15 = lower 16 bits of sum ; sum + INPUT[y + 8] * COEFF[8] CALL #__fs_add 142 143 ; r12 and r13 contain result CALL 144 MOV.W r10, r15 ; r15 = y 146 ; r15 = y*2 ; r15 = y*size 148 ; store sum + INPUT[y + 8] * COEFF[8] back to OUTPUT[y] MOV.W r12,202(r15) ; lower 16 bits MOV.W r13,204(r15) ; upper 16 bits 150 MOV.W MOV.W 152 compare with 36 154 CMP.W #36,r10 ; loop back if (y<36) 156 ; done with filtering return to caller 158 ``` Listing C.18: TI MSP430 Assembly Code for Benchmark 6: FIR ### C.4 ARM Cortex-M3 Assembly Codes ``` Factorial subroutine ;store register ;R4 = n ;R0 = n-1 RO , RO ,#1 26 27 ; if greater then jump BGT LO ; otherwise come here to base case calculations R4 , #1 R0 ; fact = 1 ; restore register 29 MOV POP ; return to caller 31 ВХ R14 33 L0: BI. ; call factorial recursively ; n*fact(n-1); restore register R4 , R4 , R5 35 POP RO ВХ R14 ; return to caller ``` Listing C.19: ARM Cortex-M3 Assembly Code for Benchmark 1: Recursive Factorial ``` ARM Cortex-M3 String Copy Benchmark Program In this program, main subroutine passes the addresses of source and destination strings to the StrCpy subroutine to copy the chracters from source to the destination string. 10 R1 -> Source String address R2 -> Destination String address 12 13 ;R1 = Address of source string ;R2 = Address of destin string Main: R1,#Src MOV R2, \#Dest 18 ВL strCpy ; call string copy subroutine 20 End: вх R14 ; return to caller 26 R0,#0 ; i=0 R3,[R2,R0] ;R3 = SrcStr[i] R3,[R1,R0] ;DestStr[i] = R3 28 R3 , [ R2 , R0 ] R3 , [ R1 , R0 ] 29 30 STRB 32 ADD RO , RO ,#1 33 CBNZ ; loop till not null 34 ВХ R14 ; return to caller ``` Listing C.20: ARM Cortex-M3 Assembly Code for Benchmark 2: String Copy ``` ADD \mathtt{R4}\ ,\mathtt{R4}\ ,\#\,1 ; compare with 10 R4,#10 22 BLT LO ; loop back if (j < 10) 24 ВL BSort ; call sorting routine 26 End: ВX R14 ; return to caller ; BSort Subroutine ; Actual subroutine used to implement sorting Algorithm ; Array is at Address 0X0000 28 30 32 34 36 ;;;;;;; BSort: 38 RO,#0 ; j = 0 40 \begin{array}{c} {\tt R12} \;, [\; {\tt R1} \;, {\tt R0} \;, {\tt LSL} \;\; \#2] \\ {\tt R4} \;, {\tt R0} \;, \#1 \\ {\tt R5} \;, [\; {\tt R1} \;, {\tt R4} \;, {\tt LSL} \;\; \#2] \end{array} L2: LDR ; R12 = Array[j] ; RA = j+1 ; R5 = Array[j+1] ; comparre Array[j] with Array[j+1] ; if less then or equal ; then no swap required 42 ADD 43 LDR 44 CMP R12, R5 45 46 47 48 ; Array[j] = Array[j+1] ; Array[j+1] = Array[j] 49 STR 50 STR 51 52 L1: ADD RO , RO ,#1 53 54 RO, R2 L2 CMP compare with i ; loop back if (j<=i) BLE 55 56 \mathtt{R2}\ ,\mathtt{R2}\ ,\#\,1 SUB 57 58 CBZ R2 , L1 ; compare to 0 and loop if (i \ge 0) ; done sorting 59 ВX R14 ; return to caller 60 ``` Listing C.21: ARM Cortex-M3 Assembly Code for Benchmark 3: Bubble Sort ``` 3 Structure contains 3 elements: 1 char byte Flag indicating if sensor has been calibrated or not. 1 short int containing the offset to be adjusted 1 long int containing the actual sensor value An array of 5 sensors is declared. InitSensors() will initialize these values to some numbers. CalibrateSensors() will subtract the offset from the value of the sensors and set the Flag. main() will call these two functions to initialize and calibrate sensor data. 9 10 11 13 16 17 ain: BL Init ; call to Init subroutine BL Calib ; call to Calib subroutine Main: 20 21 End: ; return to caller R14 22 24 25 base address of first struct member index struct array Data with which Value will be initialized Flag will be initialized by R3 START 26 -> -> 28 R.2 30 R4 loop counter Init: 32 MOV 34 MOV 36 R2,R4,#3 R3,[R1,#0x00] R4,[R1,#0x02] ;R2 = i + 3 ;sensors[i].Flag = 0 ;sensors[i].Offset = i L0: STRB 38 ``` ``` \frac{40}{41} STR R2, [R1, #0x04]; sensors[i]. Value = i+3 42 43 ADD R1, R1,#6 ; point to next element 44 ADD \mathtt{R4}\ ,\mathtt{R4}\ ,\#\,1 45 ; compare with 5 R4,#10 CMP ; loop back if (i < 5) 46 BI.T T.O 47 48 ВХ R14 ; return to caller 50 51 52 START Starting address of struct array R1 -> -> -> index struct array sensors[i].Offset sensors[i].Value 54 56 R3 R4 loop counter b: MOV R1,#START ;R1 = base address of Array 58 Calib: {\begin{smallmatrix} {\tt R4}\;,\#0\\ {\tt R3}\;,\#1 \end{smallmatrix}} 60 MOV : i = 0 MOV ;R3 = 1 for flag 62 63 L1: STRB R3 , [R1,#0x00] ; sensors [ i ] . Flag = 1 64 R2 , [R1,#0x02] R3 , [R1,#0x04] R2 , R3 , R2 R2 , [R1,#0x04] ;R2 = sensors[i].Offset ;R3 = sensors[i].Value ;R2 = sensors[i].Value - sensors[i].Offset ;sensors[i].Value = R2 65 66 LDRH LDR 67 68 SUB STR 69 70 71 72 ADD R1,R1,#6 ; point to next element ADD R4, R4,#1 ; compare with 5 73 74 CMP R4,\#10 ;loop back if (i < 5) BLT L1 вх R14 ; return to caller ``` Listing C.22: ARM Cortex-M3 Assembly Code for Benchmark 4: Sensor Structure ``` ; ARM Cortex-M3 Matrix Multiplication Benchmark Program ; This program multiplies two matrices of order 3X4 and 4X5 to give a product matrix of order 3X5. Both the matrices are initialized with some numbers and then multiplication is performed to get product. 3 8 9 ; Main Subroutine; Base Address of matrix m1 -> M1; Base Address of matrix m2 -> M2; Base Address of matrix m3 -> M3; R10 is used to index the elements of matrix m1; R11 is used to index the elements of matrix m2; R12 is used to index the elements of matrix m3; R1,R3,R4,R5 -> loop counters and temporaries 11 13 15 20 Main: 21 22 23 24 25 ; fill second matrix R12, #M2 R6, #nRows2 R7, #nCols2 INIT \begin{array}{l} ; R12 = base \ address \ of \ m2 \\ ; R6 = no \ of \ rows \ of \ m2 \\ ; R7 = no \ of \ cols \ of \ m2 \\ ; call \ to \ INIT \ subroutine \end{array} \frac{26}{27} MOV MOV \frac{28}{29} MOV ВL 30 perform multiplication R10 , #M1 R11 , #M2 R12 , #M2 \frac{32}{33} MOV ;R10 =base address of m1;R11 =base address of m2 MOV \frac{34}{35} ;R12 =base address of m3 MOV R5,#0 R4,#0 ; nCols2 36 MOV L6: MOV ; nRows1 R1,#0 R3,#0 ;nCols1 (or nRows2 is same);R3 = 0 (accumulator for one element) 38 L5: MOV 40 R7 , [ R10 ] R8 , [ R11 ] \begin{array}{l} ;R7 \ = \ m1 \, [m] \, [\, n] \\ ;R8 \ = \ m2 \, [\, n] \, [\, p] \\ ;R3 \ += \ m1 \, [m] \, [\, n] \ * \ m2 \, [\, n\, ] \, [\, p\, ] \end{array} 42 LDR R3 , R7 , R8 ``` ``` \frac{44}{45} ADD ; it will point to first element of next row R11, R11, #20 R10 , R10 , #4 R1 , R1 , #1 46 47 ADD ; points to next m1 element SUB 48 49 BLT L4 ; repeat 4 times 50 51 52 53 STR {\tt R3}\ , [\ {\tt R11}\ ] ; m3[m][p] = R3 R11, R11, #56 R12, R12, #4 R1, R4, #1 L5 ; now points to first element of next column ; points to next m3\ element SUB ADD 54 55 SUB BLT ; repeat 5 times 56 57 R11, #M2 R5, R5, #1 MOV ;R11 =base address of m2 58 59 BI.T 1.6 ; repeat 3 times 60 вх R14 return to caller 62 R0 -> value to be assigned R1 -> row number R12 -> array index 64 ; R12 — ;;;;;;;;; INIT: M 66 68 R1, #0 R0, [R12] R0, #1 R7, #1 L1: 70 71 72 ADD ;R0++; decrement col SUB ; point to next element ; repeat this for nCols 73 74 75 76 ADD R12,#4 BLT L1 R1 , #1 R0 , R1 ADD ; increment row 77 78 MOV ;R0 = row number, for next row 79 SHR R6 , #1 L1 ; decrement rows ; repeat this for nRows 80 BLT 81 ; return to caller ``` Listing C.23: ARM Cortex-M3 Assembly Code for Benchmark 5: Matrix Multiplication ``` ; ARM Cortex-M3 FIR Filter Benchmark Program; This program is an implmentation of a 17 order FIR filter. COEFF and INPUT arrays are initialized with some data and then FIR caculations are performed to get the OUTPUT array. These calculations are basically integer and floating point calculations performed on these arrays to get floating result samples in OUTPUT array. 2 11 12 13 \frac{14}{15} R10 -> to hold sum for accumulation R8,R9 -> loop counters i and y respectively 18 19 22 23 ; initialize COEFF array \frac{24}{25} Main: MOV R8,#0 R0,R8,5 ; i = 0 L0: ADD ; R0 = i+5 \frac{26}{27} ;R0 = float(i+5);R4 = 1.0 BI. int2float R4,#0x3f80000 ; R0 = 1/(i+5.0) 28 ВL fdiv 29 R1,#COEFF ;R1 = base address of COEFF R0,[R1,R8,LSL #2] ;COEFF[i] = 1/(i+5.0) 30 MOV STR 32 33 ADD \mathtt{R8}\ ,\mathtt{R8}\ ,\#\,1 R8,#17 L0 ; compare with 17; loop back if (i <17) 34 CMP 36 initialize INPUT array 38 MOV R8.#0 : i = 0 L1: RO,#2 ; R0 = 0 {\tt MOV} 40 MOV R1,#INPUT ;R1 = base address of INPUT ``` ``` \frac{42}{43} STR {\tt RO} , [ {\tt R1} , {\tt R8} , {\tt LSL} ~\#2] ; {\tt INPUT}\,[~{\tt i}~] =~2 R8, R8,#1 R8,#67 ; i++ ; compare with 67 44 45 46 47 48 49 50 51 ADD CMP BLT L1 ; loop back if (i < 67) ; Perform FIR Calculations MOV {\tt R9}\,, \#0 {\tt R10}\,, \#0 ;y\ =\ 0 \begin{array}{rcl} ; \text{sum} &=& 0 \\ ; \text{i} &=& 0 \end{array} L2: MOV 52 53 54 55 56 57 58 59 60 MOV R8,#0 R1,R9,#16 R1,R1,R8 R2,#INPUT R1,[R2,R1,LSL #2] \begin{array}{lll} ; R1 &=& y{+}16 \\ ; R1 &=& y{+}16{-}i \\ ; R2 &=& base \ address \ of \ INPUT \\ ; R1 &=& INPUT \left[ y{+}16{-}i \ \right] \end{array} L3: ADD MOV LDR R2 , R9 , R8 R2 , [R3 , R2 , LSL #2] ;R2 = y+i ;R2 = INPUT[y+i] LDR 61 62 63 64 65 66 67 68 ADD ; R0 \; = \; INPUT \left[ \; y{+}16{-}\,i \; \right] \; \; + \; INPUT \left[ \; y{+}i \; \right] RO, R1, R2 ВL int2float ;R0 = float(R0) R6,#COEFF \begin{array}{ll} ; R6 = \text{base of COEFF} \\ ; R1 = \text{COEFF}\left[\text{ i }\right] \end{array} MOV R1 , [ R6 , R8 , LSL #2] 69 70 71 72 73 74 75 76 77 78 79 80 ; R0 = \begin{array}{ccc} \text{COEFF[i]} & * & \text{INPUT[y+16-i]} & + & \text{INPUT[y+i]} \\ & & & \\ & & \\ & & \\ \end{array} ВL MOV R1, R10 ;R1 = sum = sum + COEFF[i] * INPUT[y+16-i] + INPUT[y+i] ; R0 ВL fadd() MOV R10 , R0 ; R10 = \sup_{\text{;sum accumulation}} R8 , R8 ,#1 ADD 81 82 83 84 R8,#8 L3 ; compare with 8; loop back if (i < 8) CMP BLT MOV R1,#INPUT ;R1 = base address of INPUT 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ADD LDR R2,R9,#8 R0,[R1,R2,LSL #2] ; R2 = y+8 ; R0 = INPUT[y+8] ВL int2float ;R0 = float(INPUT[y+8] R1 , [# Addr (COEFF [8]] ; R1 = COEFF[8] LDR {\tt B\,L} ;R0 = INPUT[y+8] * COEFF[8] MOV R1 , R10 ;R1 = sum ВL fadd ;R0 = sum + INPUT[y+8] * COEFF[8] MOV R1,#OUTPUT ; R1 = base address of OUTPUT 100 101 = sum + INPUT[y+8] * COEFF[8] R0,[R1,R9,LSL #2] STR 102 103 ADD R9, R9,#1 R9,#36 ; compare with 36; loop back if (y < 36) 104 CMP 105 BLT 107 End: вх ; return to caller R14 ``` Listing C.24: ARM Cortex-M3 Assembly Code for Benchmark 6: FIR # Calculations Details ## D.1 MePoEfAr Calculations Details Table D.1: MePoEfAr Calculations | | | | | St | atic Res | ults | Dynamic Results | | | | | | |----|---------|---------|-----------------|----------|----------|----------|-----------------|----------|-------|---------|---------------|--| | # | | Instruc | tion | Instr | Instr | DBytes | # of | Exec. | Men | ory Tra | ffic (Cycles) | | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | IM | DM16 | DMem32 | | | | | IV. | IePoEfAr Calcul | ations f | or Bench | nmark 1: | Recurs | ive Fact | orial | | | | | 1 | Main: | MOVd | #5, D0 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 2 | | BRS | Fact | 2 | 2 | | 1 | 2 | 1 | 0 | 0 | | | 3 | End: | RTS | | 2 | 2 | | 1 | 2 | 1 | 0 | 0 | | | 4 | Fact: | MOVd | D0,-(SP) | 2 | 2 | 2 | 5 | 10 | 5 | 5 | 5 | | | 5 | | MOVd | D0,D2 | 2 | 1.6 | | 5 | 8 | 5 | 0 | 0 | | | 6 | | SUBd | #1, D0 | 2 | 1 | | 5 | 5 | 5 | 0 | 0 | | | 7 | | BRgt | L0 | 2 | 1 | | 5 | 5 | 5 | 0 | 0 | | | 8 | | MOVd | #1,D1 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 9 | | MOVd | (SP)+,D0 | 2 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | | | 10 | | RTS | | 2 | 2 | | 1 | 2 | 1 | 0 | 0 | | | 11 | L0: | BRS | Fact | 2 | 2 | | 4 | 8 | 4 | 0 | 0 | | | 12 | | MULd | D2,D1 | 2 | 2 | | 4 | 8 | 4 | 0 | 0 | | | 13 | | MOVd | (SP)+,D0 | 2 | 2 | 2 | 4 | 8 | 4 | 4 | 4 | | | 14 | | RTS | | 2 | 2 | | 4 | 8 | 4 | 0 | 0 | | | | | Total | 1 | 28 | 23.6 | 6 | 42 | 70 | 42 | 10 | 10 | | | | | | MePoEfAr Ca | lculatio | ns for B | enchmark | 2: Str | ing Cop | у | | | | | 1 | Main: | MOVx | #Src,X4 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 2 | | MOVx | #Dst,X5 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 3 | | BRS | StrCpy | 2 | 2 | | 1 | 2 | 1 | 0 | 0 | | | 4 | End: | HALT | | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 5 | StrCpy: | MOVb | (X4)+, B0 | 2 | 2 | 2 | 13 | 26 | 13 | 13 | 13 | | | 6 | | MOVb | B0, (X5)+ | 2 | 2 | 2 | 13 | 26 | 13 | 13 | 13 | | | 7 | | BRne | StrCpy | 2 | 1 | | 13 | 13 | 13 | 0 | 0 | | | 8 | | RTS | | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | | | Total | • | 20 | 13 | 4 | 44 | 73 | 46 | 26 | 26 | | | | | | MePoEfAr Ca | lculatio | ns for B | enchmark | 3: Bul | ble Sor | t | | | | | 1 | Main: | MOVd | #10, D1 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 2 | | MOVx | #START,X4 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 3 | | MOVd | #0,D2 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 4 | L1: | MOVd | D2, (X4)+ | 2 | 4 | 4 | 10 | 40 | 10 | 20 | 10 | | | 5 | | ADDd | #1,D2 | 2 | 2 | | 10 | 20 | 10 | 0 | 0 | | | 6 | | DECBRn | D1, L1 | 2 | 3 | | 10 | 30 | 10 | 0 | 0 | | | 7 | | BRS | BSort | 2 | 2 | | 1 | 2 | 1 | 0 | 0 | | | 8 | End: | HALT | | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 9 | BSort: | MOVx | #START,X4 | 4 | 1 | | 1 | 1 | 2 | 0 | 0 | | | 10 | | MOVw | #9,W0 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | | | | | St | atic Res | ults | | Dv | namic | Results | | | |-----|---------|------------|-----------------|---------|----------|---------|------------------------------------|----------|-------|---------|--------|--| | # | | Instruct | tion | Instr | Instr | DBytes | # of Exec. Memory Traffic (Cycles) | | | | | | | 77- | | 111001 000 | | Bytes | Cycles | Moved | Exec. | Cycles | IM | DM16 | DMem32 | | | 11 | L1: | MOVw | #W0,W1 | 2 | 1 | | 9 | 9 | 9 | 0 | 0 | | | 12 | L2: | MOVd | (X4)+,D2 | 2 | 4 | 4 | 45 | 180 | 45 | 90 | 45 | | | 13 | | MOVd | @X4,D3 | 2 | 3 | 4 | 45 | 135 | 45 | 90 | 45 | | | 14 | | CPAd | D2, D3 | 2 | 1 | | 45 | 45 | 45 | 0 | 0 | | | 15 | | BRlt | NoSwap | 2 | 1 | | 45 | 45 | 45 | 0 | 0 | | | 16 | | MOVd | D2,@X4 | 2 | 3 | 4 | 45 | 135 | 45 | 90 | 45 | | | 17 | | MOVd | D3,-4(X4) | 3 | 5 | 4 | 45 | 225 | 90 | 90 | 45 | | | 18 | NoSwap: | DECBRn | W1, L2 | 2 | 2 | | 45 | 90 | 45 | 0 | 0 | | | 19 | | DECBRn | W0,L1 | 2 | 2 | | 9 | 18 | 9 | 0 | 0 | | | 20 | | RTS | | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | | | Total | | 45 | 41 | 20 | 371 | 982 | 418 | 380 | 190 | | | | | N | MePoEfAr Calcu | lations | for Bend | hmark 4 | : Senso | r Struct | ure | | | | | 1 | Main: | BRS | Init | 2 | 2 | | 1 | 2 | 1 | 0 | 0 | | | 2 | | BRS | Calib | 2 | 2 | 1 | 1 | 2 | 1 | 0 | 0 | | | 3 | End: | RTS | | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 4 | Init: | MOVd | #3, D0 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 5 | | MOVx | #START, X4 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 6 | | MOVw | #0, W0 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 7 | | MOVb | #5, B0 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 8 | L0: | MOVb | #0, (X4)+ | 2 | 2 | 2 | 5 | 10 | 5 | 5 | 5 | | | 9 | | MOVw | W0, (X4)+ | 2 | 2 | 2 | 5 | 10 | 5 | 5 | 5 | | | 10 | | ADDwds | W0, D0 | 3 | 4 | | 5 | 20 | 10 | 0 | 0 | | | 11 | | MOVd | D0,(X4)+ | 2 | 4 | 4 | 5 | 20 | 5 | 10 | 5 | | | 12 | | ADDb | #1, W0 | 2 | 1 | | 5 | 5 | 5 | 0 | 0 | | | 13 | | DECBRn | B0, L0 | 2 | 2 | | 5 | 10 | 5 | 0 | 0 | | | 14 | | RTS | | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 15 | Calib: | MOVx | #START, X4 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 16 | | MOVb | #5, B0 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 17 | L1: | MOVb | #1, (X4)+ | 2 | 2 | 2 | 5 | 10 | 5 | 5 | 5 | | | 18 | | MOVw | (X4)+, W0 | 2 | 2 | 2 | 5 | 10 | 5 | 5 | 5 | | | 19 | | MOVwds | W0, D0 | 3 | 2 | | 5 | 10 | 10 | 0 | 0 | | | 20 | | SUBd | D0, (X4)+ | 2 | 3 | 4 | 5 | 15 | 5 | 10 | 5 | | | 21 | | DECBRn | B0, L0 | 2 | 2 | | 5 | 10 | 5 | 0 | 0 | | | 22 | | RTS | -, - | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | | | Total | | 50 | 41 | 16 | 66 | 145 | 78 | 40 | 30 | | | | | | PoEfAr Calculat | | | | | | | | | | | 1 | Main: | MOVb | #nRows1, B6 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 2 | | MOVb | #nCols1, B7 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 3 | | MOVx | #M1, X4 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 4 | | BRS | INIT | 2 | 2 | | 1 | 2 | 1 | 0 | 0 | | | 5 | | MOVb | #nRows2, B6 | 2 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | | | 6 | | MOVb | #nCols2, B7 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 7 | | MOVx | #M2, X4 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 8 | | BRS | INIT | 2 | 2 | 1 | 1 | 2 | 1 | 0 | 0 | | | 9 | | Inx X4,#3 | #M1,#M2,#M3 | 9 | 5 | | 1 | 5 | 5 | 0 | 0 | | | 10 | | MOVb | #5, B5 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | | L3: | MOVb | #3, B4 | 2 | 1 | | 5 | 5 | 5 | 0 | 0 | | | | L2: | MOVb | #4, B1 | 2 | 1 | | 15 | 15 | 15 | 0 | 0 | | | 13 | | MOVd | #0, D3 | 2 | 2 | | 15 | 30 | 15 | 0 | 0 | | | | L1: | MOVd | (X4)+, D2 | 2 | 4 | 4 | 60 | 240 | 60 | 120 | 60 | | | | | Next Page | | | | | | | | | | | | | | | | St | atic Res | ults | Dynamic Results | | | | | | |----|-------|---------|------------|--------|-----------|----------|-----------------|--------|-----|------|---------------|--| | # | | Instruc | tion | Instr | Instr | DBytes | # of | Exec. | | | ffic (Cycles) | | | " | | | | Bytes | Cycles | Moved | Exec. | Cycles | IM | DM16 | DMem32 | | | 15 | | MULd | @X3, D2 | 2 | 8 | 4 | 60 | 480 | 60 | 120 | 60 | | | 16 | | ADDd | D2, D3 | 2 | 2 | | 60 | 120 | 60 | 0 | 0 | | | 17 | | ADDx | #20, X3 | 3 | 2 | | 60 | 120 | 120 | 0 | 0 | | | 18 | | DECBRn | B1, L1 | 2 | 2 | | 60 | 120 | 60 | 0 | 0 | | | 19 | | MOVd | D3, (X5)+ | 2 | 4 | 4 | 15 | 60 | 15 | 30 | 15 | | | 20 | | SUBx | #56, X3 | 3 | 2 | | 15 | 30 | 30 | 0 | 0 | | | 21 | | DECBRn | B4, L2 | 2 | 2 | | 15 | 30 | 15 | 0 | 0 | | | 22 | | MOVx | #M2, X3 | 4 | 2 | | 5 | 10 | 10 | 0 | 0 | | | 23 | | DECBRn | B5, L3 | 2 | 2 | | 5 | 10 | 5 | 0 | 0 | | | 24 | | RTS | | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 25 | INIT: | MOVd | #0, D2 | 2 | 1 | | 2 | 2 | 2 | 0 | 0 | | | 26 | | MOVd | #0, D3 | 2 | 1 | | 2 | 2 | 2 | 0 | 0 | | | 27 | L1: | MOVd | D3, (X4)+ | 2 | 4 | 4 | 32 | 128 | 32 | 64 | 32 | | | 28 | | ADDd | #1, D3 | 2 | 1 | | 32 | 32 | 32 | 0 | 0 | | | 29 | | DECBRn | B7, L1 | 2 | 2 | | 32 | 64 | 32 | 0 | 0 | | | 30 | | ADDd | #1, D2 | 2 | 2 | | 32 | 64 | 32 | 0 | 0 | | | 31 | | MOVd | D2, D3 | 2 | 1 | | 32 | 32 | 32 | 0 | 0 | | | 32 | | DECBRn | B6, L1 | 2 | 2 | | 32 | 64 | 32 | 0 | 0 | | | 33 | | RTS | | 2 | 1 | | 2 | 2 | 2 | 0 | 0 | | | | | Total | ' | 81 | 68 | 16 | 599 | 1679 | 685 | 334 | 167 | | | | | | MePoEfAr | Calcul | ations fo | r Benchr | nark 6: | FIR | • | ! | | | | 1 | Main: | MOVx | #COEFF,X4 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 2 | | MOVb | #18,B0 | 3 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 3 | | MOVf | #5,F0 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 4 | L1: | MOVf | #1,F2 | 2 | 1 | | 18 | 18 | 18 | 0 | 0 | | | 5 | | DIVf | F0,F2 | 2 | 1 | | 18 | 18 | 18 | 0 | 0 | | | 6 | | MOVf | F2,(X4)+ | 2 | 2 | 4 | 18 | 36 | 18 | 36 | 18 | | | 7 | | ADDf | #1,F0 | 2 | 1 | | 18 | 18 | 18 | 0 | 0 | | | 8 | | DECBRn | B0,L1 | 2 | 1 | | 18 | 18 | 18 | 0 | 0 | | | 9 | | MOVx | #INPUT,X4 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 10 | | MOVb | #68,B0 | 3 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 11 | | MOVd | #2,D2 | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | 12 | L2: | MOVw | W2,(X4)+ | 2 | 3 | 2 | 68 | 204 | 68 | 68 | 68 | | | 13 | | DECBRn | B0,L2 | 2 | 1 | | 68 | 68 | 68 | 0 | 0 | | | 14 | | MOVx | #COEFF,X4 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 15 | | MOVx | #INPUT,X2 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 16 | | MOVx | #OUTPUT,X6 | 4 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 17 | | MOVb | #36,B1 | 3 | 2 | | 1 | 2 | 2 | 0 | 0 | | | 18 | L4: | MOVb | #8,B2 | 2 | 1 | | 36 | 36 | 36 | 0 | 0 | | | 19 | | MOVf | #0,F1 | 2 | 1 | | 36 | 36 | 36 | 0 | 0 | | | 20 | L3: | MOVx | #16,X3 | 3 | 2 | | 304 | 608 | 608 | 0 | 0 | | | 21 | | SUBbx | B2,X3 | 3 | 2 | | 304 | 608 | 608 | 0 | 0 | | | 22 | | ADDbx | B1,X3 | 3 | 2 | | 304 | 608 | 608 | 0 | 0 | | | 23 | | MULx | #2,X3 | 2 | 1 | | 304 | 304 | 304 | 0 | 0 | | | 24 | | ADDx | X2,X3 | 2 | 1 | | 304 | 304 | 304 | 0 | 0 | | | 25 | | MOVw | @X3,W3 | 2 | 2 | 2 | 304 | 608 | 304 | 304 | 304 | | | 26 | | MOVbxs | B1,X3 | 3 | 2 | | 304 | 608 | 608 | 0 | 0 | | | 27 | | ADDbxs | B2,X3 | 3 | 2 | | 304 | 608 | 608 | 0 | 0 | | | 28 | | MULx | #2,X3 | 2 | 1 | | 304 | 304 | 304 | 0 | 0 | | | 29 | | ADDx | X2,X3 | 2 | 1 | | 304 | 304 | 304 | 0 | 0 | | | | | | | St | atic Res | ults | Dynamic Results | | | | | | |----|------|---------|-----------|-------|----------|--------|-----------------|--------|------|---------|---------------|--| | # | | Instruc | tion | Instr | Instr | DBytes | # of | Exec. | Mem | ory Tra | ffic (Cycles) | | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | IM | DM16 | DMem32 | | | 30 | | ADDd | @X3,W3 | 2 | 2 | 2 | 304 | 608 | 304 | 304 | 304 | | | 31 | | MOVwfs | W3,F3 | 3 | 2 | | 304 | 608 | 608 | 0 | 0 | | | 32 | | MULdfs | (X4)+,F3 | 3 | 2 | 4 | 304 | 608 | 608 | 608 | 304 | | | 33 | | ADDf | F3,F1 | 2 | 1 | | 304 | 304 | 304 | 0 | 0 | | | 34 | | DECBRn | B2,L3 | 2 | 2 | | 304 | 608 | 304 | 0 | 0 | | | 35 | | MOVf | 32(X0),F5 | 3 | 4 | 4 | 36 | 144 | 72 | 72 | 36 | | | 36 | | MOVx | #8,X3 | 2 | 1 | | 36 | 36 | 36 | 0 | 0 | | | 37 | | ADDbxs | B1,X3 | 3 | 2 | | 36 | 72 | 72 | 0 | 0 | | | 38 | | MULx | #4,X3 | 2 | 1 | | 36 | 36 | 36 | 0 | 0 | | | 39 | | ADDx | X2,X3 | 2 | 1 | | 36 | 36 | 36 | 0 | 0 | | | 40 | | MOVw | @X3,W3 | 2 | 2 | 4 | 36 | 72 | 36 | 72 | 36 | | | 41 | | MULwfs | W3,F5 | 3 | 1 | | 36 | 36 | 72 | 0 | 0 | | | 42 | | ADDf | F1,F5 | 2 | 1 | | 36 | 36 | 36 | 0 | 0 | | | 43 | | MOVf | F5,(X6)+ | 2 | 2 | 4 | 36 | 72 | 36 | 72 | 36 | | | 44 | | DECBRn | B1,L4 | 2 | 2 | | 36 | 72 | 36 | 0 | 0 | | | 45 | End: | RTS | | 2 | 1 | | 1 | 1 | 1 | 0 | 0 | | | | | Total | | 113 | 73 | 26 | 5229 | 8683 | 7473 | 1536 | 1106 | | ### D.2 Atmel AVR AT90S851 Calculations Details here will be the code for benchmark 2 Table D.2: Atmel AVR Calculations | | | | | St | atic Res | ults | Dynamic Results | | | | | | |----|-------|---------|--------------|----------|-----------|---------|-----------------|-----------|------------|---------------|--|--| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | ffic (Cycles) | | | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | | | | | A | tmel AVR Cal | culation | ns for Be | nchmark | 1: Recu | rsive Fac | torial | | | | | 1 | Main: | LDI | R18,0x05 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 2 | | LDI | R19,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 3 | | RCALL | Fact | 2 | 3 | | 1 | 3 | 1 | 0 | | | | 4 | | RET | | 2 | 4 | | 1 | 4 | 1 | 0 | | | | 5 | Fact: | PUSH | R18 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 6 | | PUSH | R19 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 7 | | CPI | R18,0x02 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | 8 | | CPC | R19,R1 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | 9 | | BRGE | L0 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | 10 | | LDI | R22,0x01 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 11 | | LDI | R23,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 12 | | LDI | R24,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 13 | | LDI | R25,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 14 | | RJMP | L1 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 15 | L0: | MOV | R20,R18 | 2 | 1 | | 4 | 4 | 4 | 0 | | | | 16 | | MOV | R21,R19 | 2 | 1 | | 4 | 4 | 4 | 0 | | | | 17 | | SBIW | R18,0x01 | 2 | 1 | | 4 | 4 | 4 | 0 | | | | 18 | | RCALL | Fact | 2 | 3 | | 4 | 12 | 4 | 0 | | | | 19 | | MOV | R18,R20 | 2 | 1 | | 4 | 4 | 4 | 0 | | | | | | | | St | atic Res | ults | Dynamic Results | | | | | |----------|----------|---------|---------------------|---------|-----------|----------|-----------------|---------|------------|---------------|--| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | ffic (Cycles) | | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | | 20 | | MOV | R19,R21 | 2 | 1 | | 4 | 4 | 4 | 0 | | | 21 | | CLR | R20 | 2 | 1 | | 4 | 4 | 4 | 0 | | | 22 | | CLR | R21 | 2 | 1 | | 4 | 4 | 4 | 0 | | | 23 | (27,50) | RCALL | Mult32 | 50 | 43 | | 4 | 172 | 100 | 0 | | | 24 | L1: | POP | R19 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | 25 | | POP | R18 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | 53 | | RET | | 2 | 4 | | 1 | 4 | 1 | 0 | | | | | Total | I | 100 | 82 | 4 | 81 | 285 | 177 | 20 | | | | | | Atmel AVR | Calcula | tions for | r Benchm | ark 2: St | ring Co | ру | | | | 1 | Main: | LDI | R30,0x60 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 2 | | LDI | R31,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 3 | | LDI | R28,0x70 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 4 | | LDI | R29,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 5 | | RCALL | strCopy | 2 | 3 | | 1 | 3 | 1 | 0 | | | 6 | | RET | | 2 | 4 | | 1 | 4 | 1 | 0 | | | 7 | strCopy: | LD | R24,Z+ | 2 | 2 | 1 | 13 | 26 | 13 | 13 | | | 8 | 113. | ST | Y+,R24 | 2 | 2 | 1 | 13 | 26 | 13 | 13 | | | 9 | | TST | R24 | 2 | 1 | | 13 | 13 | 13 | 0 | | | 10 | | BRNE | strCopy | 2 | 1 | | 13 | 13 | 13 | 0 | | | 11 | | RET | | 2 | 4 | | 1 | 4 | 1 | 0 | | | | | Total | | 22 | 21 | 2 | 59 | 93 | 59 | 26 | | | _ | | | Atmel AVR | | | | | | | | | | 1 | Main: | LDI | R30,0x00 | 2 | 1 | Benein | 1 | 1 | 1 | 0 | | | 2 | | LDI | R31,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 3 | | LDI | R24,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 4 | | LDI | R25,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 5 | L0: | ST | Z+,R24 | 2 | 2 | 1 | 10 | 20 | 10 | 10 | | | 6 | 10. | ST | Z+,R25 | 2 | 2 | 1 | 10 | 20 | 10 | 10 | | | 7 | | ADIW | R24,0x01 | 2 | 1 | 1 | 10 | 10 | 10 | 0 | | | 8 | | CPI | R24,0x0A | 2 | 1 | | 10 | 10 | 10 | 0 | | | 9 | | CPC | R25,R1 | 2 | 1 | | 10 | 10 | 10 | 0 | | | 10 | | BRNE | L0 | 2 | 1 | | 10 | 10 | 10 | 0 | | | 11 | | RCALL | BSort | 2 | 3 | | 10 | 3 | 10 | 0 | | | 12 | | RET | BSoft | 2 | 4 | | 1 | 4 | 1 | 0 | | | 13 | BSort: | MOV | D20 D24 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 14 | D5011. | MOV | R30,R24<br>R31,R25 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 15 | | LDI | R18,0x08 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 16 | | LDI | R19,0x00 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 17 | L2: | LDI | R20,0x00 | 2 | 1 | | 9 | 9 | 9 | 0 | | | 18 | 112. | LDI | R20,0x00 | 2 | 1 | | 9 | 9 | 9 | 0 | | | 19 | L1: | LDD | R21,0x00<br>R22,Z+0 | 2 | 1 | 1 | 45 | 45 | 45 | 45 | | | 20 | L1. | LDD | R23,Z+1 | 2 | 1 | 1 | 45 | 45 | 45 | 45 | | | 20 | | LDD | R26,Z+2 | 2 | 1 | 1 | 45 | 45 | 45 | 45 | | | 22 | | LDD | R26,Z+2<br>R27,Z+3 | 2 | 1 | 1 | 45 | 45 | 45 | 45 | | | | | CP | | | | 1 | | | | | | | 23 | | CPC | R26,R22<br>R27,R23 | 2 | 1 | | 45 | 45 | 45 | 0 | | | 24<br>25 | | | | 2 | 1 | | 45 | 45 | 45 | 0 | | | | | BRGE | NoSwap | 2 | 1 | 1 | 45 | 45 | 45 | | | | 26 | | STD | Z+1,R27 | 2 | 1 | 1 | 45 | 45 | 45 | 45 | | | 27 | | STD | Z+0,R26 | 2 | 1 | 1 | 45 | 45 | 45 | 45 | | | 28 | | STD | Z+3,R23 | 2 | 1 | 1 | 45 | 45 | 45 | 45 | | | | | | | St | atic Res | sults | Dynamic Results | | | | | | |----|---------|-------------------|---------------|-----------|-----------|---------------|-----------------|-----------|-------------------------|----------|--|--| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Traffic (Cycles) | | | | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | | | 29 | | STD | Z+2,R22 | 2 | 1 | 1 | 45 | 45 | 45 | 45 | | | | 30 | NoSwap: | SUBI | R20,0xFF | 2 | 1 | | 45 | 45 | 45 | 0 | | | | 31 | | SBCI | R21,0xFF | 2 | 1 | | 45 | 45 | 45 | 0 | | | | 32 | | ADIW | R30,0x02 | 2 | 1 | | 45 | 45 | 45 | 0 | | | | 33 | | CP | R18,R20 | 2 | 1 | | 45 | 45 | 45 | 0 | | | | 34 | | CPC | R19,R21 | 2 | 1 | | 45 | 45 | 45 | 0 | | | | 35 | | BRGE | L1 | 2 | 1 | | 45 | 45 | 45 | 0 | | | | 36 | | SUBI | R18,0x01 | 2 | 1 | | 9 | 9 | 9 | 0 | | | | 37 | | SBCI | R19,0x00 | 2 | 1 | | 9 | 9 | 9 | 0 | | | | 38 | | SER | R20 | 2 | 1 | | 9 | 9 | 9 | 0 | | | | 39 | | CPI | R18,0xFF | 2 | 1 | | 9 | 9 | 9 | 0 | | | | 40 | | CPC | R19,R20 | 2 | 1 | | 9 | 9 | 9 | 0 | | | | 41 | | BRNE | L2 | 2 | 1 | | 9 | 9 | 9 | 0 | | | | 42 | | RET | | 2 | 4 | | 1 | 4 | 1 | 0 | | | | | | Total | I. | 84 | 52 | 10 | 908 | 936 | 908 | 380 | | | | | | | Atmel AVR C | alculatio | ons for E | l<br>Benchmar | k 4: Sens | sor Struc | ture | | | | | 1 | Main: | RCALL | Init | 2 | 3 | | 1 | 3 | 1 | 0 | | | | 2 | | RCALL | Calib | 2 | 3 | | 1 | 3 | 1 | 0 | | | | 3 | End: | RET | | 2 | 4 | | 1 | 4 | 1 | 0 | | | | 4 | Init: | MOV | R30,#STLo | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 5 | | MOV | R31,#STHi | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 6 | | MOV | R20,#3 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 7 | | CLR | R21 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 8 | | CLR | R22 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 9 | | CLR | R23 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 10 | | MOV | R15,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 11 | | MOV | R16,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 12 | L0: | STD | Z+,#0 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 13 | LU. | STD | | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 14 | | STD | Z+,R15 | 2 | 2 | 1 | 5 | 10 | | 5 | | | | | | INC | Z+,R16<br>R20 | 2 | | 1 | | | 5 | 0 | | | | 15 | | | | | 1 | | 5 | 5 | 5 | | | | | 16 | | ADC | R21,R16 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | 17 | | ADC | R22,#0 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | 18 | | ADC | R23,#0 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | 19 | | STD | z+,R20 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 20 | | STD | z+,R21 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 21 | | STD | z+,R22 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 22 | | STD | z+,R23 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 23 | | INC | R15 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | 24 | | CPI | R15,#5 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | 25 | | BRNE | L0 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | 26 | | RET | | 2 | 4 | | 1 | 4 | 1 | 0 | | | | 27 | Calib: | MOV | R30,#STLo | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 28 | | MOV | R31,#STHi | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 29 | | MOV | R15,#5 | 2 | 1 | | 1 | 1 | 1 | 0 | | | | 30 | L1: | STD | Z+,#1 | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 31 | | MOV | R18,z+ | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 32 | | MOV | R19,z+ | 2 | 2 | 1 | 5 | 10 | 5 | 5 | | | | 33 | | SUB | z+,R18 | 2 | 2 | 2 | 5 | 10 | 5 | 10 | | | | 34 | | $_{\mathrm{SBC}}$ | z+,R19 | 2 | 2 | 2 | 5 | 10 | 5 | 10 | | | | | | | | St | atic Res | ults | Dynamic Results | | | | | |----|------------------|---------|--------------|----------|----------|----------|-----------------|--------|------------|---------------|--| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | | ffic (Cycles) | | | " | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | | 35 | | SBCI | z+,#0 | 2 | 2 | 2 | 5 | 10 | 5 | 10 | | | 36 | | SBCI | z+,#0 | 2 | 2 | 2 | 5 | 10 | 5 | 10 | | | 37 | | DEC | R15 | 2 | 1 | | 5 | 5 | 5 | 0 | | | 38 | | BRNE | L1 | 2 | 1 | | 5 | 5 | 5 | 0 | | | 39 | | RET | | 2 | 4 | | 1 | 4 | 1 | 0 | | | | | Total | I. | 78 | 66 | 18 | 131 | 214 | 131 | 90 | | | | | Atı | mel AVR Calc | ulations | for Ber | chmark 5 | 6: Matrix | Multip | lication | | | | 1 | Main: | MOV | R15,#nRows1 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 2 | | MOV | R16,#nCols1 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 3 | | MOV | R30,#M1Lo | 2 | 1 | | 1 | 1 | 1 | 0 | | | 4 | | MOV | R31,#M1Hi | 2 | 1 | | 1 | 1 | 1 | 0 | | | 5 | | RCALL | INIT | 2 | 3 | | 1 | 3 | 1 | 0 | | | 6 | | MOV | R15,#nRows2 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 7 | | MOV | R16,#nCols2 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 8 | | MOV | R30,#M2Lo | 2 | 1 | | 1 | 1 | 1 | 0 | | | 9 | | MOV | R31,#M2Hi | 2 | 1 | | 1 | 1 | 1 | 0 | | | 10 | | RCALL | INIT | 2 | 3 | | 1 | 3 | 1 | 0 | | | 11 | | MOV | R26,#M1Lo | 2 | 1 | | 1 | 1 | 1 | 0 | | | 12 | | MOV | R27,#M1Hi | 2 | 1 | | 1 | 1 | 1 | 0 | | | 13 | | MOV | R28,#M2Lo | 2 | 1 | | 1 | 1 | 1 | 0 | | | 14 | | MOV | R29,#M2Hi | 2 | 1 | | 1 | 1 | 1 | 0 | | | 15 | | MOV | R30,#M3Lo | 2 | 1 | | 1 | 1 | 1 | 0 | | | 16 | | MOV | R31,#M3Hi | 2 | 1 | | 1 | 1 | 1 | 0 | | | 17 | | MOV | R15,#5 | 2 | 1 | | 1 | 1 | 1 | 0 | | | 18 | L3: | MOV | R16,#3 | 2 | 1 | | 5 | 5 | 5 | 0 | | | 19 | L2: | MOV | R17,#4 | 2 | 1 | | 15 | 15 | 15 | 0 | | | 20 | | LDI | Z+0,#0 | 2 | 1 | 1 | 15 | 15 | 15 | 15 | | | 21 | | LDI | Z+1,#0 | 2 | 1 | 1 | 15 | 15 | 15 | 15 | | | 22 | | LDI | Z+2,#0 | 2 | 1 | 1 | 15 | 15 | 15 | 15 | | | 23 | | LDI | Z+3,#0 | 2 | 1 | 1 | 15 | 15 | 15 | 15 | | | 24 | L1: | MOV | R18,X+ | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 25 | | MOV | R19,X+ | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 26 | | MOV | R20,X+ | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 27 | | MOV | R21,X+ | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 28 | | MOV | R22,Y+ | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 29 | | MOV | R23,Y+ | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 30 | | MOV | R24,Y+ | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 31 | | MOV | R25,Y+ | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 32 | (35,50) | RCALL | Mult32 | 72 | 53 | | 60 | 3180 | 2160 | 0 | | | 33 | | ADD | Z+0,R22 | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 34 | | ADD | Z+1,R23 | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 35 | | ADD | Z+2,R24 | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 36 | | ADD | Z+3,R25 | 2 | 2 | 1 | 60 | 120 | 60 | 60 | | | 37 | | ADIW | R29:R28,#20 | 2 | 2 | | 60 | 120 | 60 | 0 | | | 38 | | DEC | R17 | 2 | 1 | | 60 | 60 | 60 | 0 | | | 39 | | BRNE | L1 | 2 | 1 | | 60 | 60 | 60 | 0 | | | 40 | | ADIW | R31:R30,#4 | 2 | 2 | | 15 | 30 | 15 | 0 | | | 41 | | SBIW | R29:R28,#56 | 2 | 2 | | 15 | 30 | 15 | 0 | | | 42 | | DEC | R16 | 2 | 1 | | 15 | 15 | 15 | 0 | | | 43 | | BRNE | L2 | 2 | 1 | | 15 | 15 | 15 | 0 | | | | l<br>tinued on l | | | 1 | L | | | I | I . | L | | | | | | | St | atic Res | sults | Dynamic Results | | | | | |-----|-----------|---------|-----------|--------|-----------|-----------|-----------------|--------|------------|----------------|--| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | iffic (Cycles) | | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | | 44 | | MOV | R28,#M2Lo | 2 | 1 | | 5 | 5 | 5 | 0 | | | 45 | | MOV | R29,#M2Hi | 2 | 1 | | 5 | 5 | 5 | 0 | | | 46 | | DEC | R15 | 2 | 1 | | 5 | 5 | 5 | 0 | | | 47 | | BRNE | L3 | 2 | 1 | | 5 | 5 | 5 | 0 | | | 48 | | RET | | 2 | 4 | | 1 | 4 | 1 | 0 | | | 49 | INIT: | CLR | R23 | 2 | 1 | | 2 | 2 | 2 | 0 | | | 50 | | CLR | R24 | 2 | 1 | | 2 | 2 | 2 | 0 | | | 51 | | CLR | R25 | 2 | 1 | | 2 | 2 | 2 | 0 | | | 52 | | CLR | R26 | 2 | 1 | | 2 | 2 | 2 | 0 | | | 53 | | CLR | R17 | 2 | 1 | | 2 | 2 | 2 | 0 | | | 54 | L1: | STD | Z+0,R23 | 2 | 2 | 1 | 32 | 64 | 32 | 32 | | | 55 | | STD | Z+1,R24 | 2 | 2 | 1 | 32 | 64 | 32 | 32 | | | 56 | | STD | Z+2,R25 | 2 | 2 | 1 | 32 | 64 | 32 | 32 | | | 57 | | STD | Z+3,R26 | 2 | 2 | 1 | 32 | 64 | 32 | 32 | | | 58 | | ADD | R23,#1 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 59 | | ADC | R24,#0 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 60 | | ADC | R25,#0 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 61 | | ADC | R26,#0 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 62 | | ADI | R30,#4 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 63 | | ADC | R31,#0 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 64 | | DEC | R16 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 65 | | BRGT | L1 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 66 | | INC | R17 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 67 | | MOV | R17,R23 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 68 | | DEC | R15 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 69 | | BRGT | L1 | 2 | 1 | | 32 | 32 | 32 | 0 | | | 105 | | RET | | 2 | 4 | | 2 | 8 | 2 | 0 | | | | | Total | I | 210 | 151 | 20 | 1662 | 5733 | 3762 | 908 | | | | | | Atmel A | VR Ca | lculation | s for Ber | chmark ( | 6: FIR | I | | | | 1 | Main: | STD | Y+8,R1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | | | 2 | | STD | Y+7,R1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | | | 3 | L0: | LDD | R16,Y+7 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | 4 | | LDD | R17,Y+8 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | 5 | | LDD | R22,Y+7 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | 6 | | LDD | R23,Y+8 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | 7 | | CLR | R24 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 8 | | CLR | R25 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 9 | | RCALL | INT2FLOAT | 2 | 3 | | 18 | 54 | 18 | 0 | | | 10 | | LDI | R18,0x00 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 11 | | LDI | R19,0x00 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 12 | | LDI | R20,0xA0 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 13 | | LDI | R21,0x40 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 14 | | RCALL | FADD | 2 | 3 | | 18 | 54 | 18 | 0 | | | 15 | | LDI | R18,0x00 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 16 | | LDI | R19,0x00 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 17 | | LDI | R20,0x80 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 18 | | LDI | R22,0x3F | 2 | 1 | | 18 | 18 | 18 | 0 | | | 19 | | RCALL | FDIV | 2 | 3 | | 18 | 54 | 18 | 0 | | | 20 | | MOV | R30,R16 | 2 | 1 | | 18 | 18 | 18 | 0 | | | 21 | | MOV | R31,R17 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | tinued on | | | 1 | l | 1 | | I . | L | | | | | | | | St | atic Res | ults | Dynamic Results | | | | | | |----------|-----|---------|----------|--------|----------|--------|-----------------|--------|------------|---------------|--|--| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | ffic (Cycles) | | | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | | | 22 | | LSL | R30 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 23 | | ROL | R31 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 24 | | LSL | R30 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 25 | | ROL | R31 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 26 | | LDI | R20,0x00 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 27 | | LDI | R21,0x00 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 28 | | ADD | R30,R20 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 29 | | ADC | R31,R21 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 30 | | STD | Z+0,R24 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | | 31 | | STD | Z+1,R25 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | | 32 | | STD | Z+2,R26 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | | 33 | | STD | Z+3,R27 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | | 34 | | LDD | R24,Y+7 | 2 | 2 | | 18 | 36 | 18 | 0 | | | | 35 | | LDD | R25,Y+8 | 2 | 2 | | 18 | 36 | 18 | 0 | | | | 36 | | ADIW | R24,0x01 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 37 | | STD | Y+8,R25 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | | 38 | | STD | Y+7,R24 | 2 | 2 | 1 | 18 | 36 | 18 | 18 | | | | 39 | | CPI | R24,0x11 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 40 | | CPC | R25,R1 | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 41 | | BRLT | LO | 2 | 1 | | 18 | 18 | 18 | 0 | | | | 42 | | STD | Y+8,R1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | | | | 43 | | STD | Y+7,R1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | | | | 44 | L1: | LDD | R30,Y+7 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | 45 | | LDD | R31,Y+8 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | 46 | | SUBI | R18,0x00 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 47 | | SBCI | R19,0x01 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 48 | | LSL | R30 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 49 | | ROL | R31 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 50 | | ADD | R30,R18 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 51 | | ADC | R31,R19 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 52 | | LDI | R18,0x02 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 53 | | LDI | R19,0x00 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 54 | | STD | Z+1,R19 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | 55 | | STD | Z+0,R18 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | 56 | | LDD | R24,Y+7 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | 57 | | LDD | R25,Y+8 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | 58 | | ADIW | R24,0x01 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 59 | | STD | Y+8,R25 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | 60 | | STD | Y+7,R24 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | 61 | | CPI | R24,0x43 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 62 | | CPC | R25,R1 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 63 | | BRLT | L1 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 64 | | STD | Y+6,R1 | 2 | 2 | | 1 | 2 | 1 | 0 | | | | 65 | | STD | Y+5,R1 | 2 | 2 | | 1 | 2 | 1 | 0 | | | | 66 | L2: | LDI | R24,0x00 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 67 | | LDI | R25,0x00 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 68 | | LDI | R26,0x00 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | | | | D07.0.00 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | 69 | | LDI | R27,0x00 | | 1 1 | | | | "" | ~ | | | | 69<br>70 | | STD | Y+1,R24 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | | | | | | atic Res | ults | Dynamic Results | | | | | |-----|-----|---------|----------|--------|----------|--------|-------------------------------------|--------|------------|----------|--| | # | | Instruc | ction | Instr. | Instr. | DBytes | No. of Exec. Memory Traffic (Cycles | | | | | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | | 72 | | STD | Y+3,R26 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | 73 | | STD | Y+4,R27 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | 74 | | STD | Y+8,R1 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | 75 | | STD | Y+7,R1 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | 76 | L3: | LDD | R30,Y+7 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 77 | | LDD | R31,Y+8 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 78 | | LSL | R30 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 79 | | ROL | R31 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 80 | | LSL | R30 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 81 | | ROL | R31 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 82 | | ADDI | R30,0x00 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 83 | | ADIC | R31,0x00 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 84 | | LDD | R14,Z+0 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 85 | | LDD | R15,Z+1 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 86 | | LDD | R16,Z+2 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 87 | | LDD | R17,Z+3 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 88 | | LDD | R24,Y+5 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 89 | | LDD | R25,Y+6 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 90 | | LDD | R24,Y+7 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 91 | | LDD | R25,Y+8 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 92 | | SUB | R30,R24 | 2 | 1 | _ | 304 | 304 | 304 | 0 | | | 93 | | SBC | R31,R25 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 94 | | ADDI | R30,0x00 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 95 | | ADCI | R31,0x10 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 96 | | LSL | R30 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 97 | | ROL | R31 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 98 | | ADDI | R30,0x00 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 99 | | ADCI | R31,0x01 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 100 | | LDD | R18,Z+0 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 101 | | LDD | R19,Z+1 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 102 | | LDD | R20,Y+5 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 103 | | LDD | R21,Y+6 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 104 | | LDD | R30,Y+7 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 105 | | LDD | R31,Y+8 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 106 | | ADD | R30,R20 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 107 | | ADC | R31,R21 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 108 | | LSL | R30 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 109 | | ROL | R31 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 110 | | ADDI | R30,0x00 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 111 | | ADCI | R31,0x01 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 112 | | LDD | R24,Z+0 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 113 | | LDD | R25,Z+1 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | | 114 | | ADD | R18,R24 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 115 | | ADC | R19,R25 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 116 | | MOV | R22,R18 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 117 | | MOV | R23,R19 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 118 | | CLR | R24 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 119 | | SBRC | R23,7 | 2 | 1 | | 304 | 304 | 304 | 0 | | | 120 | | LAT | R24 | 2 | 1 | | 304 | 304 | 304 | 0 | | | | | MOV | R25,R24 | 2 | 1 | | 304 | 304 | 304 | 0 | | | | | | St | atic Res | ults | | Dyr | namic Results | | |-----|-----------------------|-----------|--------|----------|--------|--------|--------|---------------|----------------| | # | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | affic (Cycles) | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 122 | RCALL | INT2FLOAT | 2 | 3 | | 304 | 912 | 304 | 0 | | 123 | MOV | R18,R14 | 2 | 1 | | 304 | 304 | 304 | 0 | | 124 | MOV | R19,R15 | 2 | 1 | | 304 | 304 | 304 | 0 | | 125 | MOV | R20,R16 | 2 | 1 | | 304 | 304 | 304 | 0 | | 126 | MOV | R21,R17 | 2 | 1 | | 304 | 304 | 304 | 0 | | 127 | RCALL | FMUL | 2 | 3 | | 304 | 912 | 304 | 0 | | 128 | LDD | R18,Y+1 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 129 | LDD | R19,Y+2 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 130 | LDD | R20,Y+3 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 131 | LDD | R21,Y+4 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 132 | RCALL | FADD | 2 | 3 | | 304 | 912 | 304 | 0 | | 133 | STD | Y+1,R22 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 134 | STD | Y+2,R23 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 135 | STD | Y+3,R24 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 136 | STD | Y+4,R25 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 137 | LDD | R24,Y+7 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 138 | LDD | R25,Y+8 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 139 | ADIW | R24,0x01 | 2 | 1 | | 304 | 304 | 304 | 0 | | 140 | STD | Y+8,R25 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 141 | STD | Y+7,R24 | 2 | 2 | 1 | 304 | 608 | 304 | 304 | | 142 | CPI | R24,0x08 | 2 | 1 | _ | 304 | 304 | 304 | 0 | | 143 | CPC | R25,R1 | 2 | 1 | | 304 | 304 | 304 | 0 | | 144 | BRGE | L3 | 2 | 1 | | 304 | 304 | 304 | 0 | | 145 | LDD | R30,Y+5 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 146 | LDD | R31,Y+6 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 147 | ADIW | R30,0x08 | 2 | 1 | - | 36 | 36 | 36 | 0 | | 148 | LSL | R30 | 2 | 1 | | 36 | 36 | 36 | 0 | | 149 | ROL | R31 | 2 | 1 | | 36 | 36 | 36 | 0 | | 150 | ADDI | R30,0x00 | 2 | 1 | | 36 | 36 | 36 | 0 | | 151 | ADCI | R31,0x01 | 2 | 1 | | 36 | 36 | 36 | 0 | | 152 | LDD | R22,Z+0 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 153 | LDD | R23,Z+1 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 154 | CLR | R24 | 2 | 1 | 1 | 36 | 36 | 36 | 0 | | 155 | SBRC | R23,7 | 2 | 1 | | 36 | 36 | 36 | 0 | | 156 | LAT | R24 | 2 | 1 | | 36 | 36 | 36 | 0 | | 157 | MOV | R25,R24 | 2 | 1 | | 36 | 36 | 36 | 0 | | 158 | RCALL | INT2FLOAT | 2 | 3 | | 36 | 108 | 36 | 0 | | 159 | LDD | R18,Y+41 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | | | | 2 | 2 | | | 72 | | 36 | | 160 | LDD | R19,Y+42 | 2 | | 1 | 36 | | 36 | | | 161 | LDD | R20,Y+43 | | 2 | 1 | 36 | 72 | 36 | 36 | | 162 | LDD | R21,Y+44 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 163 | RCALL | FMUL | 2 | 3 | 1 | 36 | 108 | 36 | 0 | | 164 | LDD | R18,Y+1 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 165 | LDD | R19,Y+2 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 166 | LDD | R20,Y+3 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 167 | LDD | R21,Y+4 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 168 | RCALL | FADD | 2 | 3 | | 36 | 108 | 36 | 0 | | 169 | LDD | R30,Y+5 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 170 | LDD | R31,Y+6 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 171 | LSL<br>l on Next Page | R30 | 2 | 1 | | 36 | 36 | 36 | 0 | | | | | St | atic Res | ults | | Dyn | amic Results | | |-----|---------|----------|--------|----------|--------|--------|--------|--------------|---------------| | # | Instru | ction | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | ffic (Cycles) | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 172 | ROL | R31 | 2 | 1 | | 36 | 36 | 36 | 0 | | 173 | LSL | R30 | 2 | 1 | | 36 | 36 | 36 | 0 | | 174 | ROL | R31 | 2 | 1 | | 36 | 36 | 36 | 0 | | 175 | ADDI | 2 | 1 | | 36 | 36 | 36 | 0 | | | 176 | ADCI | 2 | 1 | | 36 | 36 | 36 | 0 | | | 177 | STD | Z+0,R22 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 178 | STD | Z+1,R23 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 179 | STD | Z+2,R24 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 180 | STD | Z+3,R25 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 181 | LDD | R24,Y+5 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 182 | LDD | R25,Y+6 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 183 | ADIW | R24,0x01 | 2 | 1 | | 36 | 36 | 36 | 0 | | 184 | STD | Y+6,R25 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 185 | STD | Y+5,R24 | 2 | 2 | 1 | 36 | 72 | 36 | 36 | | 186 | CPI | R24,0x24 | 2 | 1 | | 36 | 36 | 36 | 0 | | 187 | CPC | R25,R1 | 2 | 1 | | 36 | 36 | 36 | 0 | | 188 | BRLT | L2 | 2 | 1 | | 36 | 36 | 36 | 0 | | 189 | 189 RET | | | 4 | | 1 | 4 | 1 | 0 | | | Total | 378 | 294 | 80 | 24349 | 37138 | 24349 | 10600 | | ## D.3 TI MSP430G2231 Calculations Details Table D.3: TI MSP430 Calculations | | | | | St | atic Res | ults | | Dyn | amic Results | | |----|-------|---------|--------------|----------|----------|---------|----------|-----------|--------------|---------------| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | ffic (Cycles) | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | | | Т | I MSP430 Cal | culation | s for Be | nchmark | 1: Recur | sive Fact | orial | | | 1 | Main: | MOV.W | #5,r12 | 4 | 2 | | 1 | 2 | 2 | 0 | | 2 | | CALL | #Fact | 4 | 5 | | 1 | 5 | 2 | 0 | | 3 | End: | RET | | 2 | 3 | | 1 | 3 | 1 | 0 | | 4 | Fact: | PUSH.W | r12 | 2 | 3 | 2 | 5 | 15 | 5 | 5 | | 5 | | MOV.W | r12,r10 | 2 | 1 | | 5 | 5 | 5 | 0 | | 6 | | SUB.W | #1,r12 | 2 | 1 | | 5 | 5 | 5 | 0 | | 7 | | JGE | L1 | 2 | 2 | | 5 | 10 | 5 | 0 | | 8 | | MOV.W | #1,r14 | 2 | 1 | | 1 | 1 | 1 | 0 | | 9 | | MOV.W | #0,r15 | 2 | 1 | | 1 | 1 | 1 | 0 | | 10 | | POP | r12 | 2 | 3 | 2 | 1 | 3 | 1 | 1 | | 11 | | RET | | 2 | 3 | | 1 | 3 | 1 | 0 | | 12 | L1: | CALL | #Fact | 4 | 5 | | 4 | 20 | 8 | 0 | | 13 | | MOV.W | r14,r4 | 2 | 1 | | 4 | 4 | 4 | 0 | | 14 | | MOV.W | r15,r5 | 2 | 1 | | 4 | 4 | 4 | 0 | | 15 | | CLR | r14 | 2 | 1 | | 4 | 4 | 4 | 0 | | 16 | | CLR | r15 | 2 | 1 | | 4 | 4 | 4 | 0 | | 17 | | MOV | r4,&0130h | 4 | 4 | | 4 | 16 | 8 | 0 | | 18 | | MOV | r10,&0138h | 4 | 4 | | 4 | 16 | 8 | 0 | | 19 | | MOV | &SumLo,r14 | 4 | 4 | | 4 | 16 | 8 | 0 | | | | Instruction | | St | atic Res | ults | | Dvn | namic Results | | |----|----------|-------------|---------------|----------|----------|---------|-----------|----------|---------------|----------| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | | | "- | | 111001 40 | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 20 | | MOV | &SumHi,r15 | 4 | 4 | 1110104 | 4 | 16 | 8 | 0 | | 21 | | MOV | r10,&0130h | 4 | 4 | | 4 | 16 | 8 | 0 | | 22 | | MOV | r5,&0138h | 4 | 4 | | 4 | 16 | 8 | 0 | | 23 | | ADD | &SumLo,r14 | 4 | 4 | | 4 | 16 | 8 | 0 | | 24 | | ADDC | &SumHi,r15 | 4 | 4 | | 4 | 16 | 8 | 0 | | 25 | | POP | r12 | 2 | 3 | 2 | 4 | 12 | 4 | 4 | | 26 | | RET | | 2 | 3 | | 4 | 12 | 4 | 0 | | | | Total | 1 | 74 | 72 | 6 | 87 | 241 | 125 | 10 | | | | | TI MSP430 | Calculat | ions for | Benchma | ark 2: St | ring Cop | by | | | 1 | Main: | MOV.W | #strSrc,r12 | 4 | 2 | | 1 | 2 | 2 | 0 | | 2 | | MOV.W | #strDest,r15 | 4 | 2 | | 1 | 2 | 2 | 0 | | 3 | | CALL | #strCopy | 4 | 5 | | 1 | 5 | 2 | 0 | | 4 | End: | RET | | 2 | 3 | | 1 | 3 | 1 | 0 | | 5 | strCopy: | MOV.B | @r12+,0(r15) | 4 | 5 | 2 | 13 | 65 | 26 | 13 | | 6 | | ADD.B | #1,r15 | 2 | 1 | | 13 | 13 | 13 | 0 | | 7 | | TST.B | 0(r15) | 2 | 2 | 2 | 13 | 26 | 13 | 13 | | 8 | | JNE | strCopy | 2 | 2 | | 13 | 26 | 13 | 0 | | 9 | | RET | | 2 | 3 | | 1 | 3 | 1 | 0 | | | | Total | 1 | 26 | 25 | 4 | 57 | 145 | 73 | 26 | | | | | TI MSP430 | Calculat | ions for | Benchma | ark 3: Bu | bble So | rt | | | 1 | Main: | MOV.W | #0,r13 | 2 | 1 | | 1 | 1 | 1 | 0 | | 2 | | MOV.W | #0,r14 | 2 | 1 | | 1 | 1 | 1 | 0 | | 3 | | MOV.W | #START,r15 | 4 | 2 | | 1 | 2 | 2 | 0 | | 4 | L1: | MOV.W | r13,0(r15) | 4 | 2 | 2 | 10 | 20 | 20 | 10 | | 5 | | MOV.W | r14,2(r15) | 4 | 4 | 2 | 10 | 40 | 20 | 10 | | 6 | | ADD.W | #4,r15 | 2 | 1 | | 10 | 10 | 10 | 0 | | 7 | | ADD.W | #1,r13 | 2 | 1 | | 10 | 10 | 10 | 0 | | 8 | | CMP.W | #10,r13 | 4 | 2 | | 10 | 20 | 20 | 0 | | 9 | | JL | L1 | 2 | 2 | | 10 | 20 | 10 | 0 | | 10 | | CALL | #BSort | 4 | 5 | | 1 | 5 | 2 | 0 | | 11 | End: | RET | | 2 | 3 | | 1 | 3 | 1 | 0 | | 12 | BSort: | MOV.W | #8,r13 | 4 | 2 | | 1 | 2 | 2 | 0 | | 13 | | MOV.W | #START,r15 | 4 | 2 | | 1 | 2 | 2 | 0 | | 14 | L3: | MOV.W | #0,r9 | 2 | 1 | | 9 | 9 | 9 | 0 | | 15 | L4: | MOV.W | @r15,r10 | 6 | 6 | 2 | 45 | 270 | 135 | 45 | | 16 | | MOV.W | 2(r15),r11 | 2 | 2 | 2 | 45 | 90 | 45 | 45 | | 17 | | CMP.W | 6(r15),r11 | 2 | 2 | 2 | 45 | 90 | 45 | 45 | | 18 | | JL | L5 | 6 | 6 | | 45 | 270 | 135 | 0 | | 19 | | JNE | L6 | 2 | 2 | | 45 | 90 | 45 | 0 | | 20 | | CMP.W | 4(r15),r10 | 2 | 2 | 2 | 45 | 90 | 45 | 45 | | 21 | | JHS | L6 | 4 | 3 | | 45 | 135 | 90 | 0 | | 22 | L5: | MOV.W | 4(r15),0(r15) | 6 | 6 | 2 | 45 | 270 | 135 | 45 | | 23 | | MOV.W | 6(r15),2(r15) | 6 | 6 | 2 | 45 | 270 | 135 | 45 | | 24 | | MOV.W | r10,4(r15) | 4 | 4 | 2 | 45 | 180 | 90 | 45 | | 25 | | MOV.W | r11,6(r15) | 4 | 3 | 2 | 45 | 135 | 90 | 45 | | 26 | L6: | ADD.W | #4,r15 | 2 | 1 | | 45 | 45 | 45 | 0 | | 27 | | ADD.W | #1,r9 | 2 | 1 | | 45 | 45 | 45 | 0 | | 28 | | CMP.W | r9,r13 | 2 | 1 | | 45 | 45 | 45 | 0 | | 29 | | JGE | L4 | 2 | 2 | | 45 | 90 | 45 | 0 | | 30 | L7: | SUB.W | #1,r13 | 2 | 1 | | 9 | 9 | 9 | 0 | | | | | | St | atic Res | ults | | Dyr | namic Results | | |----|--------|----------------------|----------------------|----------|----------|---------|-----------|----------|---------------|----------------| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | affic (Cycles) | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 31 | | TST.W | r13 | 2 | 1 | | 9 | 9 | 9 | 0 | | 32 | | JGE | L3 | 2 | 2 | | 9 | 18 | 9 | 0 | | 33 | L8: | RET | | 2 | 3 | | 1 | 3 | 1 | 0 | | | | Total | | 102 | 83 | 20 | 779 | 2299 | 1308 | 380 | | | | | TI MSP430 C | Calculat | ions for | Benchma | rk 4: Ser | sor Stru | ict | • | | 1 | Main: | CALL | Init | 4 | 5 | | 1 | 5 | 2 | 0 | | 2 | | CALL | Calib | 4 | 5 | | 1 | 5 | 2 | 0 | | 3 | End: | RET | | 2 | 3 | | 1 | 3 | 1 | 0 | | 4 | Init: | MOV.W | #0, r8 | 2 | 1 | | 1 | 1 | 1 | 0 | | 5 | | MOV.W | #0, r9 | 2 | 1 | | 1 | 1 | 1 | 0 | | 6 | | MOV.W | #START, r15 | 4 | 2 | | 1 | 2 | 2 | 0 | | 7 | | MOV.W | #0, r4 | 2 | 1 | | 1 | 1 | 1 | 0 | | 8 | L0: | MOV.B | #0, 0(r15) | 4 | 2 | 2 | 5 | 10 | 10 | 5 | | 9 | | INC.B | r15 | 2 | 1 | | 5 | 5 | 5 | 0 | | 10 | | MOV.W | r4, 0(r15) | 4 | 2 | 2 | 5 | 10 | 10 | 5 | | 11 | | MOV.W | r4, r8 | 2 | 1 | | 5 | 5 | 5 | 0 | | 12 | | ADC.W | #3, r9 | 4 | 2 | | 5 | 10 | 10 | 0 | | 13 | | MOV.W | r8,2(r15) | 4 | 4 | 2 | 5 | 20 | 10 | 5 | | 14 | | MOV.W | r8,4(r15) | 4 | 4 | 2 | 5 | 20 | 10 | 5 | | 15 | | ADD.W | #6,r15 | 4 | 2 | | 5 | 10 | 10 | 0 | | 16 | | ADD.W | #1, r4 | 2 | 1 | | 5 | 5 | 5 | 0 | | 17 | | CMP.W | #5, r4 | 4 | 2 | | 5 | 10 | 10 | 0 | | 18 | | JL | LO | 2 | 2 | | 5 | 10 | 5 | 0 | | 19 | | RET | - | 2 | 3 | | 1 | 3 | 1 | 0 | | 20 | Calib: | MOV.W | #START, r15 | 4 | 2 | | 1 | 2 | 2 | 0 | | 21 | | MOV.W | #5, r4 | 4 | 2 | | 1 | 2 | 2 | 0 | | 22 | L1: | MOV.B | #0, 0(r15) | 2 | 2 | 2 | 5 | 10 | 5 | 5 | | 23 | | INC.B | r15 | 2 | 1 | _ | 5 | 5 | 5 | 0 | | 24 | | SUB.W | 0(r15),2(r15) | 6 | 6 | 4 | 5 | 30 | 15 | 10 | | 25 | | SUBC.W | #0,4(r15) | 4 | 2 | 2 | 5 | 10 | 10 | 5 | | 26 | | ADD.W | #6,r15 | 4 | 1 | _ | 5 | 5 | 10 | 0 | | 27 | | DEC.W | r4 | 2 | 1 | | 5 | 5 | 5 | 0 | | 28 | | JG | LO | 2 | 2 | | 5 | 10 | 5 | 0 | | 29 | | RET | 20 | 2 | 3 | | 1 | 3 | 1 | 0 | | | | Total | | 90 | 66 | 16 | 101 | 218 | 161 | 40 | | | | | MSP430 Calcu | | | | | | | 10 | | 1 | Main: | MOV.W | #nRows1, r6 | 2 | 1 | | 1 | 1 | 1 | 0 | | 2 | | MOV.W | #nCols1, r7 | 2 | 1 | | 1 | 1 | 1 | 0 | | 3 | | MOV.W | #M1, r12 | 4 | 2 | | 1 | 2 | 2 | 0 | | 4 | | CALL | #INIT | 4 | 5 | | 1 | 5 | 2 | 0 | | 5 | | MOV.W | #nRows2, r6 | 2 | 1 | | 1 | 1 | 1 | 0 | | 6 | | MOV.W | #nCols2, r7 | 2 | 1 | | 1 | 1 | 1 | 0 | | 7 | | MOV.W | #M2, r12 | 4 | 2 | | 1 | 2 | 2 | 0 | | 8 | | CALL | #INIT | 4 | 5 | | 1 | 5 | 2 | 0 | | 9 | | MOV.W | #M1, r13 | 4 | 2 | | 1 | 2 | 2 | 0 | | 10 | | MOV.W | #M1, r13 | 4 | 2 | | 1 | 2 | 2 | 0 | | 11 | | MOV.W | #M2, r14<br>#M3, r15 | 4 | 2 | | 1 | 2 | 2 | 0 | | | | | | - | | | | | | | | 12 | т э. | MOV.W | #5, r4 | 4 | 2 | | 1 5 | 2 | 2 | 0 | | | L3: | MOV.W | #3, r5 | 4 | 2 | | 5 | 10 | 10 | 0 | | | L2: | MOV.W<br>n Next Page | #4, r6 | 2 | 1 | | 15 | 15 | 15 | 0 | | | | Instruction | | | atic Res | ults | | Dvr | amic Results | | |----------|-------|-------------|---------------|--------|----------|----------|--------|--------|--------------|----------| | # | | Instruct | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | | | "- | | THE U | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 15 | | MOV.W | #0, r9 | 2 | 1 | | 15 | 15 | 15 | 0 | | 16 | | MOV.W | #0, r10 | 2 | 1 | | 15 | 15 | 15 | 0 | | 17 | L1: | CLR | r11 | 2 | 1 | | 60 | 60 | 60 | 0 | | 18 | | CLR | r12 | 2 | 1 | | 60 | 60 | 60 | 0 | | 19 | | MOV | 0(R13),&0130h | 6 | 6 | 2 | 60 | 360 | 180 | 60 | | 20 | | MOV | 0(R14),&0138h | 6 | 6 | 2 | 60 | 360 | 180 | 60 | | 21 | | ADD | &SumLo,R11 | 4 | 4 | | 60 | 240 | 120 | 0 | | 22 | | ADDC | &SumHi,R12 | 4 | 4 | | 60 | 240 | 120 | 0 | | 23 | | MOV | 0(R13),&0130h | 6 | 6 | 2 | 60 | 360 | 180 | 60 | | 24 | | MOV | 2(R14),&0138h | 6 | 6 | 2 | 60 | 360 | 180 | 60 | | 25 | | MOV | 0(R14),&0134h | 6 | 6 | 2 | 60 | 360 | 180 | 60 | | 26 | | MOV | 2(R13),&0138h | 6 | 6 | 2 | 60 | 360 | 180 | 60 | | 27 | | ADD | &SumLo,R12 | 4 | 3 | | 60 | 180 | 120 | 0 | | 28 | | ADD.W | r11, r9 | 2 | 1 | | 60 | 60 | 60 | 0 | | 29 | | ADDC.W | r12, r10 | 2 | 1 | | 60 | 60 | 60 | 0 | | 30 | | ADD.W | #20, r14 | 4 | 2 | | 60 | 120 | 120 | 0 | | 31 | | DEC.W | r6 | 2 | 1 | | 60 | 60 | 60 | 0 | | 32 | | JG | L1 | 2 | 2 | | 60 | 120 | 60 | 0 | | 33 | | MOV.W | r9,0(r15) | 4 | 2 | 2 | 15 | 30 | 30 | 15 | | 34 | | MOV.W | r10,2(r15) | 4 | 4 | 2 | 15 | 60 | 30 | 15 | | 35 | | SUB.W | #56, r14 | 4 | 2 | | 15 | 30 | 30 | 0 | | 36 | | DEC.W | r5 | 2 | 1 | | 15 | 15 | 15 | 0 | | 37 | | JG | L2 | 2 | 2 | | 15 | 30 | 15 | 0 | | 38 | | MOV.W | #M2, r14 | 4 | 2 | | 5 | 10 | 10 | 0 | | 39 | | DEC.W | r4 | 2 | 1 | | 5 | 5 | 5 | 0 | | 40 | | JG | L3 | 2 | 2 | | 5 | 10 | 5 | 0 | | 41 | | RET' | | 2 | 1 | | 1 | 1 | 1 | 0 | | 42 | INIT: | MOV.W | #0, r4 | 2 | 1 | | 2 | 2 | 2 | 0 | | 43 | | MOV.W | #0, r5 | 2 | 1 | | 2 | 2 | 2 | 0 | | 44 | | MOV.W | #0, r9 | 2 | 1 | | 2 | 2 | 2 | 0 | | 45 | L9: | MOV.W | r4, 0(r12) | 4 | 2 | 2 | 2 | 4 | 4 | 2 | | 46 | | MOV.W | r5, 2(r12) | 4 | 2 | 2 | 32 | 64 | 64 | 32 | | 47 | | ADD.W | #4, r12 | 2 | 1 | | 32 | 32 | 32 | 0 | | 48 | | ADD.W | #1, r4 | 2 | 1 | | 32 | 32 | 32 | 0 | | 49 | | ADDC.W | #0, r5 | 2 | 1 | | 32 | 32 | 32 | 0 | | 50 | | DEC.W | r6 | 2 | 1 | | 32 | 32 | 32 | 0 | | 51 | | JG | L9 | 2 | 2 | | 32 | 64 | 32 | 0 | | 52 | | ADD.W | #1, r9 | 2 | 1 | | 32 | 32 | 32 | 0 | | 53 | | MOV.W | r9, r4 | 2 | 1 | | 32 | 32 | 32 | 0 | | 54 | | DEC.W | r7 | 2 | 1 | | 32 | 32 | 32 | 0 | | 55 | | JG | L9 | 2 | 2 | | 32 | 64 | 32 | 0 | | 56 | | RET | | 2 | 3 | 0.7 | 1 | 3 | 1 | 0 | | <u> </u> | | Total | | 174 | 125 | 20 | 1442 | 4061 | 2499 | 424 | | <u> </u> | | 1,101 | T | 1 | I | for Bend | | 1 | | | | 1 | main: | MOV.W | #COEFF,r15 | 4 | 2 | | 1 | 2 | 2 | 0 | | 2 | T 1 | MOV.W | #0,r7 | 2 | 1 | | 1 | 1 | 1 | 0 | | 3 | L1: | MOV.W | r7,r12 | 2 | 1 | | 18 | 18 | 18 | 0 | | 4 | | CALL | #itof | 4 | 5 | | 18 | 90 | 36 | 0 | | 5 | | MOV.W | #0,r14 | 2 | 1 | | 18 | 18 | 18 | 0 | | 6 | | MOV.W | #16544,r15 | 4 | 2 | | 18 | 36 | 36 | 0 | | | Instruction | | | St | atic Res | ults | | Dyr | amic Results | | |----|-------------|---------|-------------|--------|----------|--------|--------|--------|--------------|----------------| | # | | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | 1 | iffic (Cycles) | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 7 | | CALL | #fadd | 4 | 5 | | 18 | 90 | 36 | 0 | | 8 | | MOV.W | r12,r14 | 2 | 1 | | 18 | 18 | 18 | 0 | | 9 | | MOV.W | r13,r15 | 2 | 1 | | 18 | 18 | 18 | 0 | | 10 | | MOV.W | #0,r12 | 2 | 1 | | 18 | 18 | 18 | 0 | | 11 | | MOV.W | #16256,r13 | 4 | 2 | | 18 | 36 | 36 | 0 | | 12 | | CALL | #fdiv | 4 | 5 | | 18 | 90 | 36 | 0 | | 13 | | MOV.W | r12,0(r15) | 4 | 2 | 2 | 18 | 36 | 36 | 18 | | 14 | | MOV.W | r13,2(r15) | 4 | 2 | 2 | 18 | 36 | 36 | 18 | | 15 | | ADD.W | #4,r15 | 2 | 1 | | 18 | 18 | 18 | 0 | | 16 | | ADD.W | #1,r7 | 2 | 1 | | 18 | 18 | 18 | 0 | | 17 | | CMP.W | #17,r7 | 4 | 2 | | 18 | 36 | 36 | 0 | | 18 | | JL | L1 | 2 | 2 | | 18 | 36 | 18 | 0 | | 19 | | MOV.W | #0,r7 | 2 | 1 | | 1 | 1 | 1 | 0 | | 20 | | MOV.W | #2,r8 | 2 | 1 | | 1 | 1 | 1 | 0 | | 21 | | MOV.W | r7,r15 | 2 | 1 | | 1 | 1 | 1 | 0 | | 22 | L3: | RLA.W | r15 | 2 | 1 | | 68 | 68 | 68 | 0 | | 23 | ьэ. | MOV.W | r8,68(r15) | 4 | 2 | 2 | 68 | 136 | 136 | 68 | | 24 | | ADD.W | | 3 | 1 | 2 | 68 | 68 | 136 | 0 | | | | - | #1,r7 | 4 | 2 | | | | | 0 | | 25 | | CMP.W | #67,r7 | | | | 68 | 136 | 136 | | | 26 | | JL | L3 | 2 | 1 | | 68 | 68 | 68 | 0 | | 27 | T = | MOV.W | #0,r10 | _ | 1 | | 1 | 1 | 1 | 0 | | 28 | L5: | MOV.W | #0,r7 | 2 | 1 | | 36 | 36 | 36 | 0 | | 29 | | MOV.W | #0,r8 | 2 | 1 | | 36 | 36 | 36 | 0 | | 30 | | MOV.W | #0,r9 | 2 | 1 | | 36 | 36 | 36 | 0 | | 31 | L6: | MOV.W | r7,r15 | 2 | 1 | | 304 | 304 | 304 | 0 | | 32 | | MOV.W | r10,r13 | 2 | 1 | | 304 | 304 | 304 | 0 | | 33 | | SUB.W | r15,r13 | 2 | 1 | | 304 | 304 | 304 | 0 | | 34 | | ADD.W | #16,r13 | 4 | 2 | | 304 | 608 | 608 | 0 | | 35 | | RLA.W | r13 | 2 | 1 | | 304 | 304 | 304 | 0 | | 36 | | MOV.W | r10,r15 | 2 | 1 | | 304 | 304 | 304 | 0 | | 37 | | MOV.W | r7,r14 | 2 | 1 | | 304 | 304 | 304 | 0 | | 38 | | ADD.W | r15,r14 | 2 | 1 | | 304 | 304 | 304 | 0 | | 39 | | RLA.W | r14 | 2 | 1 | | 304 | 304 | 304 | 0 | | 40 | | MOV.W | 68(r14),r12 | 4 | 2 | 2 | 304 | 608 | 608 | 304 | | 41 | | ADD.W | 68(r13),r12 | 4 | 2 | 2 | 304 | 608 | 608 | 304 | | 42 | | CALL | #itof | 4 | 2 | | 304 | 608 | 608 | 0 | | 43 | | MOV.W | r7,r15 | 2 | 1 | | 304 | 304 | 304 | 0 | | 44 | | RLA.W | r15 | 2 | 1 | | 304 | 304 | 304 | 0 | | 45 | | RLA.W | r15 | 2 | 1 | | 304 | 304 | 304 | 0 | | 46 | | MOV.W | 0(r15),r14 | 4 | 2 | 2 | 304 | 608 | 608 | 304 | | 47 | | MOV.W | 2(r15),r15 | 4 | 2 | 2 | 304 | 608 | 608 | 304 | | 48 | | CALL | #fmpy | 4 | 5 | | 304 | 1520 | 608 | 0 | | 49 | | MOV.W | r8,r14 | 2 | 1 | | 304 | 304 | 304 | 0 | | 50 | | MOV.W | r9,r15 | 2 | 1 | | 304 | 304 | 304 | 0 | | 51 | | CALL | #fadd | 4 | 5 | | 304 | 1520 | 608 | 0 | | 52 | | MOV.W | r12,r8 | 2 | 1 | | 304 | 304 | 304 | 0 | | 53 | | MOV.W | r13,r9 | 2 | 1 | | 304 | 304 | 304 | 0 | | 54 | | ADD.W | #1,r7 | 2 | 1 | | 304 | 304 | 304 | 0 | | 55 | | CMP.W | #8,r7 | 2 | 1 | | 304 | 304 | 304 | 0 | | 56 | | JL | L6 | 2 | 2 | | 304 | 608 | 304 | 0 | | | | | St | atic Res | ults | | Dyn | amic Results | | |----|---------|--------------|--------|----------|--------|--------|--------|--------------|---------------| | # | Instruc | tion | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | ffic (Cycles) | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 57 | MOV.W | r10,r15 | 2 | 1 | | 36 | 36 | 36 | 0 | | 58 | ADD.W | #8, r15 | 2 | 1 | | 36 | 36 | 36 | 0 | | 59 | RLA.W | r15 | 2 | 1 | | 36 | 36 | 36 | 0 | | 60 | MOV.W | 68(r15),r12 | 4 | 2 | 4 | 36 | 72 | 72 | 72 | | 61 | CALL | #fitof | 4 | 5 | | 36 | 180 | 72 | 0 | | 62 | MOV.W | 32(R3),r14 | 4 | 2 | 2 | 36 | 72 | 72 | 36 | | 63 | MOV.W | 34(R3),r15 | 4 | 2 | 2 | 36 | 72 | 72 | 36 | | 64 | CALL | #fmpy | 4 | 5 | | 36 | 180 | 72 | 0 | | 65 | MOV.W | r8,r14 | 2 | 1 | | 36 | 36 | 36 | 0 | | 66 | MOV.W | r9,r15 | 2 | 1 | | 36 | 36 | 36 | 0 | | 67 | CALL | #fadd | 4 | 5 | | 36 | 180 | 72 | 0 | | 68 | MOV.W | r10,r15 | 2 | 1 | | 36 | 36 | 36 | 0 | | 69 | RLA.W | r15 | 2 | 1 | | 36 | 36 | 36 | 0 | | 70 | RLA.W | r15 | 2 | 1 | | 36 | 36 | 36 | 0 | | 71 | MOV.W | r12,202(r15) | 4 | 4 | 2 | 36 | 144 | 72 | 36 | | 72 | MOV.W | r13,204(r15) | 4 | 4 | 2 | 36 | 144 | 72 | 36 | | 73 | ADD.W | #1,r10 | 2 | 1 | | 36 | 36 | 36 | 0 | | 74 | CMP.W | #36,r10 | 4 | 2 | | 36 | 72 | 72 | 0 | | 75 | JL | 2 | 2 | | 36 | 72 | 36 | 0 | | | 76 | 76 RET | | | 3 | | 1 | 3 | 1 | 0 | | | Total | | | 137 | 26 | 9331 | 15182 | 12436 | 1536 | ## D.4 ARM Cortex-M3 LPC1342 Calculations Details Table D.4: ARM Cortex M3 Calculations | | | | | St | atic Res | ults | | Dyn | amic Results | | |----|-------|------|-------------------|----------|-----------|---------|-----------|-----------|--------------|---------------| | # | | In | struction | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | ffic (Cycles) | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | | | | ARM Cortex M3 Cal | culatior | ıs for Be | nchmark | 1: Recu | rsive Fac | torial | | | 1 | Main: | MOV | R0,#5 | 2 | 1 | | 1 | 1 | 1 | 0 | | 2 | | BL | fact | 4 | 3 | | 1 | 3 | 1 | 0 | | 3 | End: | BX | R14 | 2 | 2 | | 1 | 2 | 1 | 0 | | 4 | Fact: | PUSH | R0 | 2 | 2 | 4 | 5 | 10 | 5 | 5 | | 5 | | MOV | R5,R0 | 2 | 1 | | 5 | 5 | 5 | 0 | | 6 | | SUB | R0,R0,#1 | 2 | 1 | | 5 | 5 | 5 | 0 | | 7 | | BGT | L0 | 2 | 1 | | 5 | 5 | 5 | 0 | | 8 | | MOV | R4,#1 | 2 | 1 | | 1 | 1 | 1 | 0 | | 9 | | POP | R0 | 2 | 2 | 4 | 1 | 2 | 1 | 1 | | 10 | | BX | R14 | 2 | 3 | | 1 | 3 | 1 | 0 | | 11 | L0: | BL | Fact | 4 | 3 | | 4 | 12 | 4 | 0 | | 12 | | MUL | R4,R4,R5 | 4 | 1 | | 4 | 4 | 4 | 0 | | 13 | | POP | R0 | 2 | 2 | 4 | 4 | 8 | 4 | 4 | | 14 | | BX | R14 | 2 | 3 | | 4 | 12 | 4 | 0 | | | | 7 | Γotal | 34 | 26 | 12 | 42 | 73 | 42 | 10 | | | | | ARM Cortex M3 | Calcula | tions for | Benchm | ark 2: St | ring Co | ру | | | 1 | Main: | MOV | R1,#Src | 4 | 1 | | 1 | 1 | 1 | 0 | | | | | | St | atic Res | ults | | Dyn | amic Results | | |----------|---------|------|--------------------|---------|-----------|--------------|----------|----------|--------------|---------------| | # | | In | struction | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | ffic (Cycles) | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 2 | | MOV | R2,#Dest | 4 | 1 | | 1 | 1 | 1 | 0 | | 3 | | BL | strCpy | 4 | 3 | | 1 | 3 | 1 | 0 | | 4 | End: | BX | R14 | 2 | 3 | | 1 | 3 | 1 | 0 | | 5 | StrCpy: | MOV | R0,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | 6 | L0: | LDRB | R3,[R2,R0] | 2 | 2 | 4 | 13 | 26 | 13 | 13 | | 7 | | STRB | R3,[R1,R0] | 4 | 2 | 4 | 13 | 26 | 13 | 13 | | 8 | | ADD | R0,R0,#1 | 2 | 1 | | 13 | 13 | 13 | 0 | | 9 | | CBNZ | R3,L0 | 2 | 2 | | 13 | 26 | 13 | 0 | | 10 | | BX | R14 | 2 | 3 | | 1 | 3 | 1 | 0 | | | | 7 | Total | 28 | 19 | 8 | 58 | 103 | 58 | 26 | | | | | ARM Cortex M3 | Calcula | tions for | Benchm | ark 3: B | ubble Sc | ort | | | 1 | Main: | MOV | R4,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | 2 | | MOV | R1,#START | 4 | 1 | | 1 | 1 | 1 | 0 | | 3 | L0: | STR | R4,[R1,R4,LSL #2] | 4 | 2 | 4 | 10 | 20 | 10 | 10 | | 4 | | ADD | R4,R4,#1 | 2 | 1 | | 10 | 10 | 10 | 0 | | 5 | | CMP | R4,#10 | 2 | 1 | | 10 | 10 | 10 | 0 | | 6 | | BLT | LO | 2 | 1 | | 10 | 10 | 10 | 0 | | 7 | | BL | BSort | 4 | 3 | | 1 | 3 | 1 | 0 | | 8 | End: | BX | R14 | 2 | 3 | | 1 | 3 | 1 | 0 | | 9 | BSort: | MOV | R2,#8 | 2 | 1 | | 1 | 1 | 1 | 0 | | 10 | L1: | MOV | R0,#0 | 2 | 1 | | 9 | 9 | 9 | 0 | | 11 | L2: | LDR | R12,[R1,R0,LSL #2] | 4 | 2 | 4 | 45 | 90 | 45 | 45 | | 12 | | ADD | R4,R0,#1 | 2 | 1 | - | 45 | 45 | 45 | 0 | | 13 | | LDR | R5,[R1,R4,LSL #2] | 4 | 2 | 4 | 45 | 90 | 45 | 45 | | 14 | | CMP | R12,R5 | 2 | 1 | - | 45 | 45 | 45 | 0 | | 15 | | BLE | L3 | 2 | 1 | | 45 | 45 | 45 | 0 | | 16 | | STR | R5,[R1,R0,LSL #2] | 4 | 2 | 4 | 45 | 90 | 45 | 45 | | 17 | | STR | R12,[R1,R4,LSL #2] | 4 | 2 | 4 | 45 | 90 | 45 | 45 | | | L3: | ADD | | 2 | 1 | 4 | 45 | | | 0 | | 18<br>19 | LJ. | CMP | R0,R0,#1 | 2 | 1 | | 45 | 45<br>45 | 45<br>45 | 0 | | 20 | | BLE | R0,R2 | 2 | 1 | | | 45 | | 0 | | | | | | | | | 45 | | 45 | | | 21 | | SUB | R2,R2,#1 | 2 | 2 | | 9 | 9 | 9 | 0 | | 22 | | CBZ | R2,L1 | | | | 9 | 18 | 9 | | | 23 | | BX | R14 | 2 | 3 | 20 | 1 | 3 | 1 | 0 | | | | - 1 | Total | 60 | 35 | 20 | 523 | 728 | 523 | 190 | | - | 36. | DI | ARM Cortex M3 C | _ | | senchmar<br> | | | | | | 1 | Main: | BL | Init | 4 | 3 | | 1 | 3 | 1 | 0 | | 2 | Б., | BL | Calib | 4 | 3 | | 1 | 3 | 1 | 0 | | 3 | End: | BX | R14 | 2 | 3 | | 1 | 3 | 1 | 0 | | 4 | Init: | MOV | R1,#START | 4 | 1 | | 1 | 1 | 1 | 0 | | 5 | | MOV | R4,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | 6 | | MOV | R2,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | 7 | | MOV | R3,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | 8 | L0: | ADD | R2,R4,#3 | 2 | 1 | | 5 | 5 | 5 | 0 | | 9 | | STRB | R3,[R1,#0x00] | 4 | 2 | 4 | 5 | 10 | 5 | 5 | | 10 | | STRH | R4,[R1,#0x02] | 4 | 2 | 4 | 5 | 10 | 5 | 5 | | 11 | | STR | R2,[R1,#0x04] | 4 | 2 | 4 | 5 | 10 | 5 | 5 | | 12 | | ADD | R1,R1,#6 | 2 | 1 | | 5 | 5 | 5 | 0 | | 13 | | ADD | R4,R4,#1 | 2 | 1 | | 5 | 5 | 5 | 0 | | 14 | | CMP | R4,#10 | 2 | 1 | | 5 | 5 | 5 | 0 | | | | | | St | atic Res | ults | | Dyn | amic Results | | |----|--------|----------|--------------------|----------|----------|----------|-----------|---------|--------------|---------------| | # | | In | struction | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | ffic (Cycles) | | | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 15 | | BLT | L0 | 2 | 1 | | 5 | 5 | 5 | 0 | | 16 | | BX | R14 | 2 | 3 | | 1 | 3 | 1 | 0 | | 17 | Calib: | MOV | R1,#START | 4 | 1 | | 1 | 1 | 1 | 0 | | 18 | | MOV | R4,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | 19 | | MOV | R3,#1 | 2 | 1 | | 1 | 1 | 1 | 0 | | 20 | L1: | STRB | R3,[R1,#0x00] | 4 | 2 | 4 | 5 | 10 | 5 | 5 | | 21 | | LDRH | R2,[R1,#0x02] | 4 | 2 | 4 | 5 | 10 | 5 | 5 | | 22 | | LDR | R3,[R1,#0x04] | 2 | 2 | 4 | 5 | 10 | 5 | 5 | | 23 | | SUB | R2,R3,R2 | 4 | 1 | | 5 | 5 | 5 | 0 | | 24 | | STR | R2,[R1,#0x04] | 4 | 2 | 4 | 5 | 10 | 5 | 5 | | 25 | | ADD | R1,R1,#6 | 2 | 1 | | 5 | 5 | 5 | 0 | | 26 | | ADD | R4,R4,#1 | 2 | 1 | | 5 | 5 | 5 | 0 | | 27 | | CMP | R4,#10 | 2 | 1 | | 5 | 5 | 5 | 0 | | 28 | | BLT | L1 | 2 | 1 | | 5 | 5 | 5 | 0 | | 29 | | BX | R14 | 2 | 3 | | 1 | 3 | 1 | 0 | | | | 1 | Гotal | 80 | 46 | 28 | 97 | 142 | 97 | 35 | | | | | ARM Cortex M3 Calc | ulations | for Ben | chmark 5 | 5: Matrix | Multipl | ication | | | 1 | Main: | MOV | R12,#M1 | 4 | 1 | | 1 | 1 | 1 | 0 | | 2 | | MOV | R6,#nRows1 | 2 | 1 | | 1 | 1 | 1 | 0 | | 3 | | MOV | R7,#nCols1 | 2 | 1 | | 1 | 1 | 1 | 0 | | 4 | | BL | INIT | 4 | 3 | | 1 | 3 | 1 | 0 | | 5 | | MOV | R12,#M2 | 4 | 1 | | 1 | 1 | 1 | 0 | | 6 | | MOV | R6,#nRows2 | 2 | 1 | | 1 | 1 | 1 | 0 | | 7 | | MOV | R7,#nCols2 | 2 | 1 | | 1 | 1 | 1 | 0 | | 8 | | BL | INIT | 4 | 3 | | 1 | 3 | 1 | 0 | | 9 | | MOV | R10,#M1 | 4 | 1 | | 1 | 1 | 1 | 0 | | 10 | | MOV | R11,#M2 | 4 | 1 | | 1 | 1 | 1 | 0 | | 11 | | MOV | R12,#M2 | 4 | 1 | | 1 | 1 | 1 | 0 | | 12 | | MOV | R5,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | 13 | L6: | MOV | R4,#0 | 2 | 1 | | 5 | 5 | 5 | 0 | | 14 | L5: | MOV | R1,#0 | 2 | 1 | | 15 | 15 | 15 | 0 | | 15 | | MOV | R3,#0 | 2 | 1 | | 15 | 15 | 15 | 0 | | 16 | L4: | LDR | R7,[R10] | 2 | 2 | 4 | 60 | 120 | 60 | 60 | | 17 | | LDR | R8,[R11] | 2 | 2 | 4 | 60 | 120 | 60 | 60 | | 18 | | MLA | R3,R7,R8 | 4 | 2 | | 60 | 120 | 60 | 0 | | 19 | | ADD | R11,R11,#20 | 4 | 1 | | 60 | 60 | 60 | 0 | | 20 | | ADD | R10,R10,#4 | 2 | 1 | | 60 | 60 | 60 | 0 | | 21 | | SUB | R1,R1,#1 | 2 | 1 | | 60 | 60 | 60 | 0 | | 22 | | BLT | L4 | 2 | 1 | | 60 | 60 | 60 | 0 | | 23 | | STR | R3,[R11] | 4 | 2 | 4 | 15 | 30 | 15 | 15 | | 24 | | SUB | R11,R11,#56 | 4 | 1 | | 15 | 15 | 15 | 0 | | 25 | | ADD | R12,R12,#4 | 2 | 1 | | 15 | 15 | 15 | 0 | | 26 | | SUB | R1,R4,#1 | 2 | 1 | | 15 | 15 | 15 | 0 | | 27 | | BLT | L5 | 2 | 1 | 1 | 15 | 15 | 15 | 0 | | 28 | | MOV | R11,#M2 | 4 | 1 | | 5 | 5 | 5 | 0 | | 29 | | SUB | R5,R5,#1 | 2 | 1 | | 5 | 5 | 5 | 0 | | 30 | | BLT | L6 | 2 | 1 | | 5 | 5 | 5 | 0 | | 31 | | BX | R14 | 2 | 1 | 1 | 1 | 1 | 1 | 0 | | 32 | INIT: | MOV | R0,#0 | 2 | 1 | | 2 | 2 | 2 | 0 | | 33 | | MOV | R1,#0 | 2 | 1 | | 2 | 2 | 2 | 0 | | | L., . | n Next F | | l | L | | l | L | | - | | | | | | St | atic Res | ults | | Dyn | amic Results | | |----|-----------|-----|-------------------|--------|----------|-----------|--------|--------|--------------|----------| | # | | In | struction | Instr. | Instr. | DBytes | No. of | Exec. | Memory Tra | | | " | | | | Bytes | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 34 | L1: | STR | R0,[R12] | 4 | 2 | 4 | 32 | 64 | 32 | 32 | | 35 | | ADD | R0,#1 | 2 | 1 | | 32 | 32 | 32 | 0 | | 36 | | SUB | R7,#1 | 2 | 1 | | 32 | 32 | 32 | 0 | | 37 | | ADD | R12,#4 | 2 | 1 | | 32 | 32 | 32 | 0 | | 38 | | BLT | L1 | 2 | 1 | | 32 | 32 | 32 | 0 | | 39 | | ADD | R1,#1 | 2 | 1 | | 32 | 32 | 32 | 0 | | 40 | | MOV | R0,R1 | 2 | 1 | | 32 | 32 | 32 | 0 | | 41 | | SUB | R6,#1 | 2 | 1 | | 32 | 32 | 32 | 0 | | 42 | | BLT | L1 | 2 | 1 | | 32 | 32 | 32 | 0 | | 43 | | BX | R14 | 2 | 3 | | 2 | 6 | 2 | 0 | | 10 | | 1 | Fotal | 112 | 54 | 16 | 852 | 1087 | 852 | 167 | | ┝ | | - | ARM Cortex | | | | | | 332 | 107 | | 1 | Main: | MOV | R8,#0 | 2 | 1 | s for Ben | 1 | 1 | 1 | 0 | | 2 | L0: | ADD | R0,R8,5 | 2 | 1 | | 18 | 18 | 18 | 0 | | | LU: | | | | | | | | | | | 3 | | BL | int2float | 4 | 3 | | 18 | 54 | 18 | 0 | | 4 | | MOV | R4,#0x3f800000 | 4 | 1 | | 18 | 18 | 18 | 0 | | 5 | | BL | fdiv | 4 | 3 | | 18 | 54 | 18 | 0 | | 6 | | MOV | R1,#COEFF | 4 | 1 | | 18 | 18 | 18 | 0 | | 7 | | STR | R0,[R1,R8,LSL #2] | 4 | 2 | 4 | 18 | 36 | 18 | 18 | | 8 | | ADD | R8,R8,#1 | 2 | 1 | | 18 | 18 | 18 | 0 | | 9 | | CMP | R8,#17 | 2 | 1 | | 18 | 18 | 18 | 0 | | 10 | | BLT | L0 | 2 | 1 | | 18 | 18 | 18 | 0 | | 11 | | MOV | R8,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | 12 | L1: | MOV | R0,#2 | 2 | 1 | | 68 | 68 | 68 | 0 | | 13 | | MOV | R1,#INPUT | 4 | 1 | | 68 | 68 | 68 | 0 | | 14 | | STR | R0,[R1,R8,LSL #2] | 4 | 2 | 4 | 68 | 136 | 68 | 68 | | 15 | | ADD | R8,R8,#1 | 2 | 1 | | 68 | 68 | 68 | 0 | | 16 | | CMP | R8,#67 | 2 | 1 | | 68 | 68 | 68 | 0 | | 17 | | BLT | L1 | 2 | 1 | | 68 | 68 | 68 | 0 | | 18 | | MOV | R9,#0 | 2 | 1 | | 1 | 1 | 1 | 0 | | 19 | L2: | MOV | R10,#0 | 2 | 1 | | 36 | 36 | 36 | 0 | | 20 | | MOV | R8,#0 | 2 | 1 | | 36 | 36 | 36 | 0 | | 21 | L3: | ADD | R1,R9,#16 | 2 | 1 | | 304 | 304 | 304 | 0 | | 22 | | SUB | R1,R1,R8 | 4 | 1 | | 304 | 304 | 304 | 0 | | 23 | | MOV | R2,#INPUT | 4 | 1 | | 304 | 304 | 304 | 0 | | 24 | | LDR | R1,[R2,R1,LSL #2] | 4 | 2 | 4 | 304 | 608 | 304 | 304 | | 25 | | ADD | R2,R9,R8 | 4 | 1 | | 304 | 304 | 304 | 0 | | 26 | | LDR | R2,[R3,R2,LSL #2] | 4 | 2 | 4 | 304 | 608 | 304 | 304 | | 27 | | ADD | R0,R1,R2 | 2 | 1 | | 304 | 304 | 304 | 0 | | 28 | | BL | int2float | 4 | 3 | | 304 | 912 | 304 | 0 | | 29 | | MOV | R6,#COEFF | 2 | 1 | | 304 | 304 | 304 | 0 | | 30 | | LDR | R1,[R6,R8,LSL #2] | 4 | 2 | 4 | 304 | 608 | 304 | 304 | | 31 | | BL | fmul | 4 | 3 | | 304 | 912 | 304 | 0 | | 32 | | MOV | R1,R10 | 2 | 1 | | 304 | 304 | 304 | 0 | | 33 | | BL | fadd | 4 | 3 | | 304 | 912 | 304 | 0 | | 34 | | MOV | R10,R0 | 2 | 1 | | 304 | 304 | 304 | 0 | | 35 | | ADD | R8,R8,#1 | 2 | 1 | | 304 | 304 | 304 | 0 | | 36 | | CMP | R8,#8 | 2 | 1 | | 304 | 304 | 304 | 0 | | 37 | | BLT | L3 | 2 | 1 | | 304 | 304 | 304 | 0 | | 38 | | MOV | R1,#INPUT | 4 | 1 | | 36 | 36 | 36 | 0 | | | ntinued o | | | | | | | | - * | | | | Instruction | | | Static Results | | | Dynamic Results | | | | |----|-------------|-----|----------------------|----------------|--------|--------|-----------------|--------|-------------------------------|----------| | # | | | | Instr. | Instr. | DBytes | No. of | Exec. | Exec. Memory Traffic (Cycles) | | | | | | | | Cycles | Moved | Exec. | Cycles | Instr. Mem | Data Mem | | 39 | | ADD | R2,R9,#8 | 4 | 1 | 4 | 36 | 36 | 36 | 36 | | 40 | | LDR | R0,[R1,R2,LSL #2] | 4 | 2 | | 36 | 72 | 36 | 0 | | 41 | | BL | int2float | 4 | 3 | 4 | 36 | 108 | 36 | 36 | | 42 | | LDR | R1,[#Addr(COEFF[8])] | 4 | 2 | | 36 | 72 | 36 | 0 | | 43 | | BL | fmul | 4 | 3 | | 36 | 108 | 36 | 0 | | 44 | | MOV | R1,R10 | 2 | 1 | | 36 | 36 | 36 | 0 | | 45 | | BL | fadd | 4 | 3 | | 36 | 108 | 36 | 0 | | 46 | | MOV | R1,#OUTPUT | 4 | 1 | | 36 | 36 | 36 | 0 | | 47 | | STR | R0,[R1,R9,LSL#2] | 4 | 2 | 4 | 36 | 72 | 36 | 36 | | 48 | | ADD | R9,R9,#1 | 2 | 1 | | 36 | 36 | 36 | 0 | | 49 | | CMP | R9,#36 | 2 | 1 | | 36 | 36 | 36 | 0 | | 50 | | BLT | L2 | 2 | 1 | | 36 | 36 | 36 | 0 | | 51 | End: | BX | R14 | 2 | 3 | | 1 | 3 | 1 | 0 | | | Total | | | | 77 | 32 | 6282 | 9502 | 6282 | 1106 | ## Curriculum Vitae Imran Ashraf