Finding your digital sibling: which other GitHub projects are similar to yours?

Finding similar repositories based on the available documentation

Bachelor Thesis (2024)
Author(s)

A.C. Turcu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Sebastian Proksch – Mentor (TU Delft - Software Engineering)

S. Huang – Mentor (TU Delft - Software Technology)

Julia Olkhovskaya – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2024 Alexandru Turcu
More Info
expand_more
Publication Year
2024
Language
English
Copyright
© 2024 Alexandru Turcu
Graduation Date
01-02-2024
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This paper aims to study the importance of considering the documentation side of GitHub repositories when assessing the similarity between two or more applications. Readme and Wiki files, along with Comments from the source files, are the dimensions proposed to be analyzed through our methodology and experiments. We propose a pipeline that first extracts text fragments from these dimensions and then applies Natural Language Processing techniques to further prepare our data for evaluation. To gather a similarity score, we first vectorize our processed data with TF-IDF and then use cosine distance to obtain the score. Combinations of the three dimensions, ranging from using only one dimension to using all of them, are considered throughout our study. Moreover, additional information has been extracted from the plain text, such as referenced URLs and License usage, the similarity of which was calculated using Jaccard distance. Two experiments were performed. The first one aims to observe the behavioral tendencies of our methodology applied to a small dataset, while the second one aims to validate our results. By evaluating them, we found sufficient data that supported our presented conclusion: documentation represents a valuable asset in gathering a pool of similar applications.

Files

Final_paper_acturcu.pdf
(pdf | 0.512 Mb)
License info not available