Efficiently Optimizing Hyperparameters for the Gumtree Hybrid Code Differencing Algorithm within HyperAST

Bachelor Thesis (2025)
Author(s)

A.F. Nitters (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Quentin Le Dilavrec – Mentor (TU Delft - Software Engineering)

C.E. Brandt – Mentor (TU Delft - Software Engineering)

Jesper Cockx – Graduation committee member (TU Delft - Programming Languages)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
25-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Code differencing allows understanding changes between different versions of software, especially when using Abstract Syntax Trees (ASTs) to structurally represent code. The Gumtree algorithm is the current state-of-the-art algorithm for AST differencing, however its main drawback is long runtimes, especially for large ASTs. The Simple and Hybrid variants aims to solve this issue with a heuristic-based recovery phase, but their evaluation has been limited to small ASTs. Benchmarks show that the Simple variant results in significant improvements in output quality and performance with very large ASTs, while the Hybrid variant has mixed results. Furthermore, increasing the max_size hyperparameter with the Hybrid variant has an unpredictable effect on output quality and a negative effect on performance, but min_priority can generally be set to 1. The Diff-Auto-Tuning optimization approach is deemed often unusable for large ASTs, though it can be used to find cases where the Hybrid variant outperforms the Simple variant on a single data point.

Files

License info not available