Beyond Traditional Lexing

Exploiting SIMD Instructions for Tokenizing C

Bachelor Thesis (2024)
Author(s)

A. Bolfă (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.G. Sprokholt – Mentor (TU Delft - Programming Languages)

S.S. Chakraborty – Mentor (TU Delft - Programming Languages)

Burcu Özkan – Graduation committee member (TU Delft - Software Engineering)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
23-06-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Over the past decades, Single Instruction, Multiple Data (SIMD) instructions have become common- place in conventional hardware. Lexical analysis, the first stage of compilation, can take advantage of this by splitting its workload across sub lexers that identify groups of tokens with similar structures. Each sub lexer can leverage SIMD to search for these structures across multiple characters in paral- lel and extract token positions efficiently. This pa- per presents a lexer with this architecture for the C programming language, along with implementation details for x86/x64 processors. Benchmark results show a 12x speed up in throughput compared to the lexer found in GCC, a state-of-the-art compiler.

Files

License info not available