SH

S. Huang

info

Please Note

16 records found

Bachelor thesis (2025) - K. Lee, S. Proksch, S. Huang, M.A. Zuñiga Zamalloa
Continuous Integration (CI) has become a fundamental practice in modern software development. Organizations increasingly adopt complex pipeline configurations to automate their build, test, and deployment processes. Well-optimized CI/CD pipelines offer significant benefits in deployment reliability, team productivity, and code quality. As these pipelines become more complex, defined by the number of jobs and steps in this paper, there is limited empirical evidence on how this evolution affects key performance metrics. This study addresses this gap by investigating the relationship between pipeline complexity and performance outcomes. By analyzing data from over 194 open-source GitHub repositories, we reveal that while increased complexity generally correlates with longer build durations, the impact on success rates is less direct. More importantly, we found that strategic modifications to pipeline configuration files (i.e., line changes) were frequently associated with significant performance improvements, including shorter build durations and fix times. This research provides guidance for practitioners. Rather than asking whether to increase or decrease pipeline complexity, our findings show that the focus should be on the method of change. We recommend prioritizing small, iterative maintenance activities, which consistently improve performance, over large-scale tool migrations, which have unpredictable outcomes. This approach enables teams to evolve their pipelines while mitigating the risks associated with growing complexity. ...
Bachelor thesis (2025) - D. Rachev, S. Huang, S. Proksch, M.A. Zuñiga Zamalloa
Continuous Integration (CI) has become a standard practice for speeding up software development. However, the effect of comparatively slower artifacts, like documentation, on its performance is still unclear. Although documentation is often regarded as important, there is little data that connects documentation practices to key DevOps metrics. This study examines this relationship by looking at 670 open-source projects that use CI. We measured how documentation completeness, update frequency, and release notes affect delivery frequency, defect counts, and mean time to recovery. Our results show a "tipping point" where high documentation completeness greatly increases delivery frequency. We also found a "sweet spot" for update ratios between 20% and 55%, which relates to the lowest defect counts. On the other hand, we found no proof that long release notes improve recovery time. We conclude that the effectiveness of documentation depends more on quality and rhythm than on volume. This provides developers with practical, data-driven strategies to improve project performance. ...
Continuous Integration (CI) practices have become central to open-source software (OSS) development, yet the relationship between branching strategies, merge habits, and CI performance remains underexplored. Understanding their role is crucial for explaining the variation in CI outcomes and for refining development practices. We empirically examine how branching models (feature-based vs.\ trunk-based) and merge characteristics (size and frequency) affect key performance indicators (KPIs).

Using a dataset of 565 GitHub repositories, we analyze both short-term trends and long-term evolution of development strategies. We find that while feature branching is strongly associated with higher delivery frequency and lower defect counts, trunk-based workflows (though rare) sometimes outperform in lead time and recovery. Similarly, frequent merges correlate with faster delivery and shorter lead times, regardless of size. A longitudinal subset reveals that projects shift toward feature-based development over time, but do not consistently adopt smaller or more frequent merges.

We also highlight methodological limitations in mining GitHub. Future research should incorporate longitudinal repository tracking and developer surveys to capture workflows that are invisible to snapshot-based analysis. This study contributes to a nuanced understanding of how code management practices shape CI outcomes in collaborative OSS projects. ...
Continuous Integration is a used extensively in modern software engineering for both proprietary and open-source projects. Many studies have studied its benefits and drawbacks, finding how it increases development productivity and stability. However, CI is a set of practices from static linting to calculating the code coverage of the underlying test suites. In order to choose whether to make use of that technology or to evaluate the overall performance of a project’s development, practitioners make use of certain measurements, DevOps metrics being one of the most significant ones. We aim to analyse the effects of testing strategies within the CI over a set of DevOps metrics. This is done by collecting over 5778608 executions of GitHub CI workflows that involve a test-running step from 476 open-source projects. We see that 69.48% of runs happen after a pull request or a push. In the end, we found that frequent CI test execution didn’t increase the project’s DevOps metrics, indicating that developers should try limiting the unnecessary execution of tests to save on resources. Further we see lack of Pearson statistical significance for the correlation between coverage in CI and the metrics in the smaller set of selected projects. ...
Bachelor thesis (2025) - A. Buntov, S. Huang, S. Proksch, M.A. Zuñiga Zamalloa
The growing complexity of modern software systems, driven by larger codebases and evolving technologies, has amplified the need for effective collaboration in developer teams. Specifically, in open-source software (OSS) projects, where contributors often vary in background and engagement in the process, this complexity may introduce further collaborative challenges. As projects scale, coordination becomes increasingly difficult to maintain, highlighting the importance of understanding the socio-technical dynamics of effective development. While prior research emphasizes the role of collaboration, its influence on software delivery remains underexplored. In order to address the gap, this study examines how team characteristics, such as size and expertise, and communication practices, like interactions on issues and pull requests, relate to delivery efficiency in OSS projects. Based on an empirical analysis of 887 GitHub repositories, we found that team size and project expertise exhibit the strongest relationships with delivery size and frequency. Core contributor activity also shows positive but diminishing effects over time, while communication practices demonstrate no noticeable associations with release efficiency. These findings suggest that optimizing delivery in OSS projects may benefit from considering adaptive team structures and aligning CI/CD practices with the project stage and the evolving dynamics of the developer team. ...

Assessing the Impact of External Repositories on Packages in Maven Central

This paper presents a comprehensive experimental study on the use and impact of external repositories in the Maven ecosystem. For this research the prevalence, naming patterns, and potential risks associated with external repositories were analyzed. We analyzed 199,188 packages and found that 3.29% of projects employ external repositories. Our findings indicate a decline in the usage of external repositories over time, with one (1.85%) and two (0.72%) external repositories occurring the most. The usage of external repositories has no significant (p < 0.05) effect on the error rate. However, 19.85% of the errors of packages that use an
external repository are caused by one of their external repositories. Moreover, we found that 69.58% of the repository urls were unreachable. 19.31% of the unique ids have two or more different repository urls associated with them. Based on our findings, developers are urged to thoroughly evaluate their usage of external repositories and to consider checking their settings.xml and POM.xml files to
ensure no url or id collisions are prevent or causing unintended behaviour. ...

Quantifying Inter-Repository Similarity Through GitHub Dependency Structures

Open Source developers typically use Git repositories to transparently store the source code of projects and contribute to the code of others. There are millions of repositories actively hosted on platforms such as GitHub. This presents an opportunity for sharing knowledge between related projects – the so-called digital siblings. Finding repositories similar to one's own can allow for better developer collaboration and knowledge transfer. However, due to the large volume of projects, manually locating digital siblings of a project can be difficult. Hence, this paper proposes a novel approach, based on the dependency structures of GitHub repositories, that allows for calculating inter-repository similarity and subsequently querying for similar projects. We aim to answer the research question: How can the dependency structures of GitHub repositories be leveraged to find their digital siblings? This research includes an empirical evaluation of various similarity metrics and clustering techniques for GitHub repositories. Our results show that dependency structures are a reliable characteristic for measuring similarity between projects. We also identify the specific metrics and clustering techniques as particularly efficient. Lastly, we propose and evaluate a composable similarity metric to allow our findings to be combined with the research of the other Research Project group members. ...

Grouping GitHub projects that share certain attributes based on interactions and activities

Bachelor thesis (2024) - R.W. de Bruin, S. Proksch, S. Huang, I.M. Olkhovskaia
This study explores the feasibility of categorizing GitHub projects based on their interactions and activities, aiming to assist both researchers and practitioners in navigating the vast landscape of open-source software. Through experiments and analysis, key attributes contributing to project categorization are identified, paving the way for effective grouping of projects in terms of interactions and activities. Findings indicate distinct clusters among GitHub projects, highlighting the influence of interactions and activities on project categorization. The study underscores the importance of refining grouping algorithms and improving project categorization methods for future research. Future work could involve developing user-friendly tools to facilitate project discovery and exploring correlations between interaction related metrics and project development dynamics. Overall, this study contributes to advancing our understanding of project categorization on GitHub, facilitating more efficient knowledge sharing and collaboration within professional fields. ...

Finding similar repositories based on the available documentation

Bachelor thesis (2024) - A.C. Turcu, S. Proksch, S. Huang, I.M. Olkhovskaia
This paper aims to study the importance of considering the documentation side of GitHub repositories when assessing the similarity between two or more applications. Readme and Wiki files, along with Comments from the source files, are the dimensions proposed to be analyzed through our methodology and experiments. We propose a pipeline that first extracts text fragments from these dimensions and then applies Natural Language Processing techniques to further prepare our data for evaluation. To gather a similarity score, we first vectorize our processed data with TF-IDF and then use cosine distance to obtain the score. Combinations of the three dimensions, ranging from using only one dimension to using all of them, are considered throughout our study. Moreover, additional information has been extracted from the plain text, such as referenced URLs and License usage, the similarity of which was calculated using Jaccard distance. Two experiments were performed. The first one aims to observe the behavioral tendencies of our methodology applied to a small dataset, while the second one aims to validate our results. By evaluating them, we found sufficient data that supported our presented conclusion: documentation represents a valuable asset in gathering a pool of similar applications. ...

Which other GitHub projects are similar to yours?

Bachelor thesis (2024) - J.G.M. Crienen, S. Proksch, S. Huang, I.M. Olkhovskaia
GitHub is an online platform that hosts millions of projects. Many of these projects have the same topic or share the same goal. Finding similar projects which can be used as role models, inspiration or examples can help developers meet their requirements faster and more efficiently. Previous studies have been successful in finding similar GitHub projects, but they do not share how well their proposed metrics indicate similarity.

Our research and analysis seek to find the contribution of source code identifiers to overall project similarity. We define project similarity and define each type of identifier we evaluate. After these steps, we extract the defined types of identifiers from a list of projects. From this list of projects, we use twenty projects as queries for our analysis. We then analyze all identifiers using techniques such as TF-IDF and LSA. Our findings are that combining all types of identifiers results in the highest chance of having the same topic when looking at the most similar project. We also find that splitting each identifier on its casing and combining all split identifiers results in the highest chance that the most similar project found is similar. We therefore see that source code identifiers are reasonably contributing to overall project similarities. ...
Bachelor thesis (2024) - C.M. Manoli, S. Proksch, S. Huang, I.M. Olkhovskaia
GitHub is the home of hundreds of millions of Open Source Software(OSS) repositories where users collaborate on projects and find inspiration for new ideas. Some of these projects have certain build configurations set up to make building, testing, and deploying the software more time-efficient and less error-prone. However, setting up the correct configurations usually requires a lot of time and a high level of knowledge. This paper aims to analyze the current practices for setting up build configurations like the Maven files and GitHub actions while clustering some of these practices based on the scope of the project. Thus, we provide useful information in terms of discovering similar projects based on the build configurations and discuss the feasibility of build configuration analysis. In summary, we provide a comprehensive analysis of project similarity based on Maven build configurations and workflow files, shedding light on the importance of build configurations for identifying similar projects, and laying the groundwork for future exploration in the realm of build configuration analysis. ...

A Study of GitHub Actions in Continuous Integration Projects

The Continuous Integration (CI) practice, has been rapidly growing and developing ever since it's introduction. This practice has been constantly providing benefits to developers such as early bug detection and feedback to development teams. In this study, we aim to identify the descriptive metrics that best illustrate the performance of the CI build stage, regarded as heart of the development process.

We conduct a small case study on repositories utilizing GitHub Actions, a CI tool that is relatively unexplored. Within this context, we classify projects using two performance indicators: build breakages and build durations. We examine two distinct sets of metrics in our analysis. The first set being build level metrics, which are closely linked to the build stage. The second set including project level metrics.

Our findings suggest that patterns traditionally associated with low breakages and durations are applicable to repositories employing GitHub Actions. However, understanding the relationship between project level metrics demands a more comprehensive approach, necessitating a thorough analysis of the project context for a holistic understanding of build performance. ...
Bachelor thesis (2023) - L. Ostrovskis, S. Huang, S. Proksch, E.A. Aivaloglou
Continuous Integration (CI) is a software development technique that enhances software quality and development efficiency, but its implementation usually depends on the project's context. This creates an opportunity for studying real-world CI projects on GitHub, focusing on their CI metrics and best practices. In this paper, we explore various methods to extract the topics from CI software projects on GitHub. This data can then be used to group projects and facilitate an in-depth analysis within specific contexts and application domains, such as CI build success rates in machine learning or React Native projects. We explore the definition of a software topic, as it shows significant granularity variations in related studies. We examine existing tools and other potential topic modeling approaches, compare varying types of textual data from GitHub that could be used as inputs for these tools, and report on interesting insights from initial trials with the developed tool. Our research led us to use GitHub topic labels as topic definitions due to their relevance and prior research focus. We also evaluated three topic extraction tools - LASCAD, a Multi-label Linear Regression classifier, and ChatGPT - incorporating the last two into our CI project mining tool. Additionally, we included two tool-independent approaches: GitHub's search function with the ability to filter repositories by topics and existing project topic labels. Lastly, to test the practicality of the tool, we mined 4899 public repositories and briefly investigated workflow metrics of projects grouped into six arbitrary topics. ...

Discover the Descriptive Metrics of the Context in Continuous Integration (CI) Project

Bachelor thesis (2023) - P.J. Hibbs, S. Huang, S. Proksch
Continuous Integration (CI) systems automate the building, testing, and possibly more. However, it is still unclear how CI should be implemented in different contexts. Therefore, this paper tries to answer the question "What metrics can be used to describe project activity", as part of a bigger study. We mined information from 500 repositories and then applied several analysis techniques to find out whether a metric can be used to describe activity or not. Among the results, we show that the activity around a release date increases, and that Java is a way more active language than other languages, with the highest amount of commits, closed pull requests, contributors, issues, and releases. ...

The Implementation of Continuous Integration Pipelines

Bachelor thesis (2023) - A.C. de Vries, S. Proksch, S. Huang, E.A. Aivaloglou
While continuous integration has already been proven to positively affect software development, little is known about how it should be implemented based on project context. This paper investigates how CI pipelines are configured by analysing data mined from software projects on GitHub. This re- search has shown the continued rise of the CI plat- form GitHub Actions, which enables developers to broaden CI pipelines’ functionality due to great in- tegration into GitHub. Moreover, key differences between how jobs within pipelines are structured in Travis CI and GitHub Actions are outlined. These results can be used in future research, which will be aimed at connecting project context to CI setup with the goal of informing developers on maturing their CI configuration. ...

An analysis of key indicators of maturity

Bachelor thesis (2023) - K. Sartori, S. Proksch, S. Huang, E.A. Aivaloglou
Continuous integration (CI) is a software engineering practice that promotes frequent code integration into a shared repository, improving the productivity within development teams as well as the quality of the software being developed. While CI adoption has gained traction, studies have examined its effective implementation and associated challenges. The idea that multiple contextual factors influence the adoption of CI prompts an exploration of suitable descriptive metrics for describing the CI practices employed. This paper aims to explore the metrics that best depict the level of maturity of a project, addressing the question: "What metrics can be used to describe the maturity level of a project?". With a lack of a comprehensive maturity framework, we leverage GitHub's API in an attempt to analyze various metrics to be used to create a framework for filtering projects.
Our findings indicate that project maturity cannot be captured by a single metric, but rather a combination of metrics reflecting different aspects throughout the project's lifecycle. Activity levels, including commits and pull requests, popularity indicators like stargazers, forks, and contributors, as well as repository size and age, emerge as primary indicators of maturity. By combining these metrics, a unified framework for categorizing mature projects can be established and further developed. ...