Mining File Histories

Should we consider branches?

Conference Paper (2018)
Author(s)

Vladimir Kovalenko (TU Delft - Software Engineering)

Fabio Palomba (Universitat Zurich)

A Bacchelli (Universitat Zurich)

Research Group
Software Engineering
Copyright
© 2018 V.V. Kovalenko, F. Palomba, A. Bacchelli
DOI related publication
https://doi.org/10.1145/3238147.3238169
More Info
expand_more
Publication Year
2018
Language
English
Copyright
© 2018 V.V. Kovalenko, F. Palomba, A. Bacchelli
Research Group
Software Engineering
Pages (from-to)
202-213
ISBN (print)
978-1-4503-5937-5
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Modern distributed version control systems, such as Git, offer support for branching — the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Moreover, there is still a lack of knowledge of the extent to which considering branches during MSR studies impacts the results of the studies. In this study, we set out to evaluate the importance of proper handling of branches when calculating file modification histories. We analyze over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms. One algorithm only follows the first parent of each commit when traversing the repository, the other returns the full modification history of a file across all branches. We show that the two algorithms consistently deliver different results, but the scale of the difference varies across projects and ecosystems. Further, we evaluate the importance of accurate mining of file histories by comparing the performance of common techniques that rely on file modification history — reviewer recommendation, change recommendation, and defect prediction — for two algorithms of file history retrieval. We find that considering full file histories leads to an increase in the techniques’ performance that is rather modest.

Files

Git2neo.pdf
(pdf | 0.894 Mb)
License info not available