Analyzing Iterative Improvements to Structural Code Diffs

None, None

Analyzing Iterative Improvements to Structural Code Diffs

Bachelor Thesis (2025)

Author(s)

M. Mejer (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Q.T. Le Dilavrec – Mentor (TU Delft - Software Engineering)

Carolin Brandt – Mentor (TU Delft - Software Engineering)

Jesper Cockx – Graduation committee member (TU Delft - Programming Languages)

Faculty

Electrical Engineering, Mathematics and Computer Science

Version Control System Abstract Syntax Tree HyperDiff GumTree Tree Edit Distance Code Differencing AST Differencing Structural Code Diff

To reference this document use:

https://resolver.tudelft.nl/uuid:a9ac5f85-5e63-4541-9172-7fa5e6d72ccf

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

25-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Version control systems rely on code differencing algorithms to track changes and support key development tasks such as merging, code search, and code reviews. Traditional differencing techniques operate on plain-text representations of source code, which sometimes fail to convey the original intent behind code modifications. To address this, modern algorithms operate on abstract syntax trees (ASTs), enabling more accurate and structurally meaningful edit scripts. However, computing differences between ASTs poses new challenges, especially in balancing edit script quality with runtime performance.

This paper investigates the evolution of AST differencing algorithms by analyzing a sequence of key refinements built on top of Xy - a foundational algorithm originally designed for XML. We evaluate three influential enhancements: GumTree’s optimal recovery strategy, the simplified recovery heuristic introduced in GumTree Simple, and HyperDiff’s use of a compressed AST representation. We evaluate each refinement independently using a shared benchmarking framework and a dataset of real-world Java code changes. Our results show how each refinement incrementally improves scalability, runtime stability, and script quality. These findings offer a deeper understanding of the design trade-offs in AST differencing and provide guidance for developing efficient and interpretable structural diff tools at scale.

Files

Analyzing_Iterative_Improvemen... (pdf)

(pdf | 4.69 Mb)

License info not available