Contribution of source code identifiers to GitHub project similarity

Which other GitHub projects are similar to yours?

More Info
expand_more

Abstract

GitHub is an online platform that hosts millions of projects. Many of these projects have the same topic or share the same goal. Finding similar projects which can be used as role models, inspiration or examples can help developers meet their requirements faster and more efficiently. Previous studies have been successful in finding similar GitHub projects, but they do not share how well their proposed metrics indicate similarity.

Our research and analysis seek to find the contribution of source code identifiers to overall project similarity. We define project similarity and define each type of identifier we evaluate. After these steps, we extract the defined types of identifiers from a list of projects. From this list of projects, we use twenty projects as queries for our analysis. We then analyze all identifiers using techniques such as TF-IDF and LSA. Our findings are that combining all types of identifiers results in the highest chance of having the same topic when looking at the most similar project. We also find that splitting each identifier on its casing and combining all split identifiers results in the highest chance that the most similar project found is similar. We therefore see that source code identifiers are reasonably contributing to overall project similarities.