Contribution of source code identifiers to GitHub project similarity

Which other GitHub projects are similar to yours?

Bachelor Thesis (2024)
Author(s)

J.G.M. Crienen (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

S. Proksch – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S. Huang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

I.M. Olkhovskaia – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
31-01-2024
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
328
Collections
thesis
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

GitHub is an online platform that hosts millions of projects. Many of these projects have the same topic or share the same goal. Finding similar projects which can be used as role models, inspiration or examples can help developers meet their requirements faster and more efficiently. Previous studies have been successful in finding similar GitHub projects, but they do not share how well their proposed metrics indicate similarity.

Our research and analysis seek to find the contribution of source code identifiers to overall project similarity. We define project similarity and define each type of identifier we evaluate. After these steps, we extract the defined types of identifiers from a list of projects. From this list of projects, we use twenty projects as queries for our analysis. We then analyze all identifiers using techniques such as TF-IDF and LSA. Our findings are that combining all types of identifiers results in the highest chance of having the same topic when looking at the most similar project. We also find that splitting each identifier on its casing and combining all split identifiers results in the highest chance that the most similar project found is similar. We therefore see that source code identifiers are reasonably contributing to overall project similarities.

Files

License info not available