Finding your digital sibling: which other GitHub projects are similar to yours?

None, None

Finding your digital sibling: which other GitHub projects are similar to yours?

Finding similar repositories based on the available documentation

Bachelor Thesis (2024)

Author(s)

A.C. Turcu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Sebastian Proksch – Mentor (TU Delft - Software Engineering)

S. Huang – Mentor (TU Delft - Software Technology)

Julia Olkhovskaya – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

GitHub Text Mining Documentation

To reference this document use:

https://resolver.tudelft.nl/uuid:6db01376-931b-4eb2-a494-4eee0c55cde7

More Info

expand_more

Publication Year

2024

Language

English

Copyright

Graduation Date

01-02-2024

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This paper aims to study the importance of considering the documentation side of GitHub repositories when assessing the similarity between two or more applications. Readme and Wiki files, along with Comments from the source files, are the dimensions proposed to be analyzed through our methodology and experiments. We propose a pipeline that first extracts text fragments from these dimensions and then applies Natural Language Processing techniques to further prepare our data for evaluation. To gather a similarity score, we first vectorize our processed data with TF-IDF and then use cosine distance to obtain the score. Combinations of the three dimensions, ranging from using only one dimension to using all of them, are considered throughout our study. Moreover, additional information has been extracted from the plain text, such as referenced URLs and License usage, the similarity of which was calculated using Jaccard distance. Two experiments were performed. The first one aims to observe the behavioral tendencies of our methodology applied to a small dataset, while the second one aims to validate our results. By evaluating them, we found sufficient data that supported our presented conclusion: documentation represents a valuable asset in gathering a pool of similar applications.

Files

Final_paper_acturcu.pdf

(pdf | 0.512 Mb)

License info not available