Discovering Digital Siblings

Quantifying Inter-Repository Similarity Through GitHub Dependency Structures

Bachelor Thesis (2024)
Author(s)

Mateusz Rębacz (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

S. Proksch – Mentor (TU Delft - Software Engineering)

S. Huang – Mentor (TU Delft - Software Technology)

Julia Olkhovskaya – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2024 Mateusz Rębacz
More Info
expand_more
Publication Year
2024
Language
English
Copyright
© 2024 Mateusz Rębacz
Graduation Date
01-02-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Open Source developers typically use Git repositories to transparently store the source code of projects and contribute to the code of others. There are millions of repositories actively hosted on platforms such as GitHub. This presents an opportunity for sharing knowledge between related projects – the so-called digital siblings. Finding repositories similar to one's own can allow for better developer collaboration and knowledge transfer. However, due to the large volume of projects, manually locating digital siblings of a project can be difficult. Hence, this paper proposes a novel approach, based on the dependency structures of GitHub repositories, that allows for calculating inter-repository similarity and subsequently querying for similar projects. We aim to answer the research question: How can the dependency structures of GitHub repositories be leveraged to find their digital siblings? This research includes an empirical evaluation of various similarity metrics and clustering techniques for GitHub repositories. Our results show that dependency structures are a reliable characteristic for measuring similarity between projects. We also identify the specific metrics and clustering techniques as particularly efficient. Lastly, we propose and evaluate a composable similarity metric to allow our findings to be combined with the research of the other Research Project group members.

Files

CSE3000_Final_Paper_7_.pdf
(pdf | 0.232 Mb)
License info not available