Beyond Traditional Lexing
Exploiting SIMD Instructions for Tokenizing C
A. Bolfă (TU Delft - Electrical Engineering, Mathematics and Computer Science)
D.G. Sprokholt – Mentor (TU Delft - Programming Languages)
S.S. Chakraborty – Mentor (TU Delft - Programming Languages)
Burcu Özkan – Graduation committee member (TU Delft - Software Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Over the past decades, Single Instruction, Multiple Data (SIMD) instructions have become common- place in conventional hardware. Lexical analysis, the first stage of compilation, can take advantage of this by splitting its workload across sub lexers that identify groups of tokens with similar structures. Each sub lexer can leverage SIMD to search for these structures across multiple characters in paral- lel and extract token positions efficiently. This pa- per presents a lexer with this architecture for the C programming language, along with implementation details for x86/x64 processors. Benchmark results show a 12x speed up in throughput compared to the lexer found in GCC, a state-of-the-art compiler.