Mining Reproducible Dependency Updates Across Ecosystems
What changes are made to dependency update pull requests before they are accepted?
P. Khan (TU Delft - Electrical Engineering, Mathematics and Computer Science)
C.R. Paulsen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
S. Proksch – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J.A. Pouwelse – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Dependency management is a critical, difficult-to-automate task in software engineering. Researching automated dependency management requires reliable, reproducible datasets of dependency updates across ecosystems, but existing datasets fall short: they cover only specific update types (e.g. breaking updates) and log outcomes rather than the causal factors behind pull request (PR) acceptance and build outcomes. Closing this gap would let researchers determine not just whether a dependency update PR succeeded, but why — information needed to build dependency management tools that developers can trust. As a first step toward this, we construct a categorisation model that describes the code changes within accepted dependency update PRs on a commit level, developed using an established taxonomy-development methodology, and build a regex-based tool that automates this categorisation with 86\% accuracy on hand-labelled data. Our results show that simple, deterministic techniques can reliably support transparent, automated change categorisation — a building block for future causal datasets of dependency updates.