Design and Implementation of Parallelized AWK
I. Kravcevs (TU Delft - Electrical Engineering, Mathematics and Computer Science)
D. Spinellis – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The project presents the design and implementation of a system for automatic parallelization of AWK programs. AWK remains a widely used language for text processing and data transformation. It is included as a standard utility tool on most Unix-like systems. The execution model of AWK is traditionally sequential, which limits scalability on multi-core hardware. The goal of this work is to investigate whether static program analysis can identify AWK scripts that can be executed in parallel and to integrate this capability into an AWK interpreter.
The proposed solution introduces a static analyzer that evaluates AWK programs based on variable dependencies, control flow, and other behaviors that impact data dependencies. The analyzer identifies reduction patterns for global variables and determines whether program semantics can be preserved under parallel execution. These results are then integrated into the interpreter, which enables deterministic multi-threaded execution.
The project adopts the MapReduce programming model to enable parallel execution of AWK. The main processing phase of a script is treated as the map stage, where independent partitions of the input are processed concurrently by multiple workers. Intermediate thread-local results are then combined in a reduce stage using aggregation strategies derived from static analysis. This model provides a structured way to preserve AWK’s sequential semantics in the parallelized environment.
The implementation was evaluated on a dataset of real-world AWK scripts and through performance benchmarks on large text-processing workloads. The results show that a significant subset of AWK programs can be parallelized automatically, achieving execution speedups and state-of-the-art AWK performance.
The project provides a practical path for improving efficiency in text-processing workflows. This work also demonstrates that scripting languages can often benefit from modern parallel execution techniques, extending their practical relevance and performance in data-processing tasks.