Discovering Digital Siblings

Quantifying Inter-Repository Similarity Through GitHub Dependency Structures

More Info
expand_more

Abstract

Open Source developers typically use Git repositories to transparently store the source code of projects and contribute to the code of others. There are millions of repositories actively hosted on platforms such as GitHub. This presents an opportunity for sharing knowledge between related projects – the so-called digital siblings. Finding repositories similar to one's own can allow for better developer collaboration and knowledge transfer. However, due to the large volume of projects, manually locating digital siblings of a project can be difficult. Hence, this paper proposes a novel approach, based on the dependency structures of GitHub repositories, that allows for calculating inter-repository similarity and subsequently querying for similar projects. We aim to answer the research question: How can the dependency structures of GitHub repositories be leveraged to find their digital siblings? This research includes an empirical evaluation of various similarity metrics and clustering techniques for GitHub repositories. Our results show that dependency structures are a reliable characteristic for measuring similarity between projects. We also identify the specific metrics and clustering techniques as particularly efficient. Lastly, we propose and evaluate a composable similarity metric to allow our findings to be combined with the research of the other Research Project group members.