Discovering the topics of Continuous Integration Projects on GitHub

Bachelor Thesis (2023)
Author(s)

L. Ostrovskis (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

S. Huang – Mentor (TU Delft - Software Technology)

S. Proksch – Graduation committee member (TU Delft - Software Engineering)

Fenia Aivaloglou – Graduation committee member (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Lukas Ostrovskis
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Lukas Ostrovskis
Graduation Date
28-06-2023
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Continuous Integration (CI) is a software development technique that enhances software quality and development efficiency, but its implementation usually depends on the project's context. This creates an opportunity for studying real-world CI projects on GitHub, focusing on their CI metrics and best practices. In this paper, we explore various methods to extract the topics from CI software projects on GitHub. This data can then be used to group projects and facilitate an in-depth analysis within specific contexts and application domains, such as CI build success rates in machine learning or React Native projects. We explore the definition of a software topic, as it shows significant granularity variations in related studies. We examine existing tools and other potential topic modeling approaches, compare varying types of textual data from GitHub that could be used as inputs for these tools, and report on interesting insights from initial trials with the developed tool. Our research led us to use GitHub topic labels as topic definitions due to their relevance and prior research focus. We also evaluated three topic extraction tools - LASCAD, a Multi-label Linear Regression classifier, and ChatGPT - incorporating the last two into our CI project mining tool. Additionally, we included two tool-independent approaches: GitHub's search function with the ability to filter repositories by topics and existing project topic labels. Lastly, to test the practicality of the tool, we mined 4899 public repositories and briefly investigated workflow metrics of projects grouped into six arbitrary topics.

Files

License info not available