A solution to misaligned data access in a vectorizing compiler framework

Master thesis (2009)

Authors

S. De Smalen

Contributors

B.H.H. Juurlink (mentor)

Programme

Embedded Systems () (TU Delft)

Vectorization Alignment Compiler SIMD

To reference this document use:

http://resolver.tudelft.nl/uuid:ee641d0f-b8c2-4194-bc46-a3e5326bea5f

More Info

expand_more

Published Date

11-11-2009

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Programme

Embedded Systems

Abstract

Vectorizing code for short vector architectures as employed by today’s multimedia extensions comes with a number of issues. The responsibilities of these issues are moved to the compiler in order to keep hardware simple. One of those issues is memory-alignment, which requires the compiler to guarantee loading and storing vectors at aligned addresses. Previous work that covered this issue proposed a mechanism to reorder vectors at runtime to ensure proper alignments, while other work has focussed on finding a minimal number of reorderings. We combined these subjects into an in-depth research and implemented the optimization for the retar- getable CoSy(R) compiler framework. Instead of solely focussing on the minimal number of reorder- ings, we also considered dynamic (runtime) properties which may enable latency-hiding of reordering operations. Furthermore, we performed a comparison of the presented reordering-techniques and researched the impact of other compiler optimizations on the proposed transformation. Finally, we placed our results into perspective with unaligned load/store operations supplied by our target architecture. With our implementation, we were able to vectorize a number of applications for SSE and SSE2 vector extensions where alignment-issues were involved. For randomly generated loops we were able to achieve between 50% and 80% of the speedup obtained by unaligned memory instructions. (Our targeted architecture is less strict on memory alignment and supplies instructions that can handle misalignments by hardware). As for the benchmarks, we were able to achieve speedup factors of about 2.25x for a block-matching algorithm (combined with loop versioning to avoid runtime alignment), 1.6x for the SPEC95 Swim benchmark and a factor 4x for a Sobel FIR filter.

Files

Thesis.pdf

(pdf | 3.28 Mb)