Discovering the topics of Continuous Integration Projects on GitHub

More Info
expand_more

Abstract

Continuous Integration (CI) is a software development technique that enhances software quality and development efficiency, but its implementation usually depends on the project's context. This creates an opportunity for studying real-world CI projects on GitHub, focusing on their CI metrics and best practices. In this paper, we explore various methods to extract the topics from CI software projects on GitHub. This data can then be used to group projects and facilitate an in-depth analysis within specific contexts and application domains, such as CI build success rates in machine learning or React Native projects. We explore the definition of a software topic, as it shows significant granularity variations in related studies. We examine existing tools and other potential topic modeling approaches, compare varying types of textual data from GitHub that could be used as inputs for these tools, and report on interesting insights from initial trials with the developed tool. Our research led us to use GitHub topic labels as topic definitions due to their relevance and prior research focus. We also evaluated three topic extraction tools - LASCAD, a Multi-label Linear Regression classifier, and ChatGPT - incorporating the last two into our CI project mining tool. Additionally, we included two tool-independent approaches: GitHub's search function with the ability to filter repositories by topics and existing project topic labels. Lastly, to test the practicality of the tool, we mined 4899 public repositories and briefly investigated workflow metrics of projects grouped into six arbitrary topics.