Print Email Facebook Twitter Data-Driven Extract Method Recommendations: An Initial Study at ING Title Data-Driven Extract Method Recommendations: An Initial Study at ING Author van der Leij, David (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Aniche, Maurício (mentor) Visser, Eelco (graduation committee) Luo, Yaping (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science Date 2021-04-30 Abstract Refactoring is the process of improving the structure of code without changing its functionality. The process is beneficial for software quality but challenges remain for identifying refactoring opportunities. This work employs machine learning to predict the application of the refactoring type Extract Method in an industry setting with the use of code quality metrics. We detect 919 examples in industry code of Extract Method and 986 examples where Extract Method was not applied and compare this to open-source code. We find that feature distributions between industry and open-source code differ, especially in class-level metrics. We train models to predict Extract Method in industry code and find that Random Forests perform best. We find that class-level metrics are most important for the performance of these models. We then investigate whether models trained on an open-source set generalize to an industry setting. We find that, although less performant than a custom fit model, a Logistic Regression type model performs admirably. Afterward, we examine whether these models perform on unseen industry projects by validating on projects excluded from the training set. We find that average performance is decent but lower than when using the whole industry dataset or an open-source dataset for training. Lastly, we conduct a blind user study in which we ask experts to judge predictions made by our best model. We find that experts generally agree with the model's predictions. In the case that experts agree with the model's prediction to apply Extract Method, they do so because of high code complexity. When they agree with the model's prediction not to refactor they most frequently give the reason that the respective methods are already sufficiently understandable. Subject RefactoringMachine LearningIndustryData driven To reference this document use: http://resolver.tudelft.nl/uuid:8f5de978-7445-4ea8-ad99-b39245fcda34 Part of collection Student theses Document type master thesis Rights © 2021 David van der Leij Files PDF David_van_der_Leij_Data_D ... ersion.pdf 1.34 MB Close viewer /islandora/object/uuid:8f5de978-7445-4ea8-ad99-b39245fcda34/datastream/OBJ/view