Evaluating Autonomous Coding Agents for Code Refactoring and Maintainability
A Large-Scale Study of Open-Source Software
I. Joshi (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M. Izadi – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
R.M. Popescu – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
B. Özkan – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M.A. Migut – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The rapid adoption of autonomous coding agents raises a practical question for developers: is agent-authored code maintainable after merge? We present a large-scale empirical study of agent- and human-authored pull requests in open-source GitHub repositories, focusing on refactoring and maintainability. We construct a novel dataset of 4,392,818 agent-authored and 517,880 human-authored pull requests from 863,819 repositories, spanning 10 agents and 4 programming languages: C++, Java, JavaScript, and Python. Using a subset of 321,986 pull requests, we compare refactoring behavior, code smells, and maintainability metrics between agent- and human-authored contributions. We further examine how these outcomes vary across languages, repository popularity, and domains, and track post-merge evolution from 3 days to 2 months after merge to assess whether maintainability-related effects persist over time.
Our results show that agent-authored pull requests refactor less frequently and less diversely than human-authored pull requests, but their refactorings tend to affect larger code regions, especially in less popular repositories. Maintainability outcomes are mixed: agent-modified code is more likely to contain code smells after merge, while median metric changes remain context-dependent and broadly comparable to human-authored code. Longitudinally, agent-modified code shows similar maintainability trends after the early post-merge period, although agent-modified regions are revisited more frequently.
Files
File under embargo until 01-01-2027