Circular Image

G. Gousios

info

Please Note

57 records found

Fast and lightweight call graph generation for software builds

Call Graphs are a rich data source and form the foundation for advanced static analyses that can, for example, detect security vulnerabilities or dead code. This information is invaluable when it is immediately available, such as in the output of a build system. Call Graph generation is a whole-program analysis: not just the application, but also all its dependencies are processed together. Recent work has shown that even advanced static analyses can use summarization techniques to substantially improve runtime; however, existing analyses focus on soundness, and as such remain very expensive. When executed in the build system, which typically has limited resources, even powerful servers suffer from slow build times, rendering these analyses impractical in today’s fast-paced development. In this paper, we aim to strike a balance between improving static analyses while remaining practical for use cases that require quick results in low-resource environments. We propose a summarization-based implementation of a Class-Hierarchy Analysis algorithm for call graph generation of Java programs. Our approach leverages the fact that dependency sets often do not change between builds: we can generate call graphs for these dependencies, cache their generation for subsequent builds, and using a novel stitching algorithm, Frankenstein, merge all partial results into a complete call graph for the whole program. Our evaluation results show that this lightweight approach can substantially outperform existing frameworks. In terms of speed improvements, Frankenstein surpasses the baselines by up to 38%, requiring an average of just 388 Megabytes of memory. This makes the proposed approach practical for build systems with limited memory resources. Despite these optimizations, our generated call graphs maintain a near-identical set of edges when compared to the baselines, achieving an F 1 score of up to 0.98. This summarization-based approach for call graph generation paves the way for using extended static analyses in build processes. ...
Conference paper (2024) - Elvan Kula, Arie Van Deursen, Georgios Gousios
Sprint planning is essential for the successful execution of agile software projects. While various prioritization criteria influence the selection of user stories for sprint planning, their relative importance remains largely unexplored, especially across different project contexts. In this paper, we investigate how prioritization criteria vary across project settings and propose a model for generating sprint plans that are tailored to the context of individual teams. Through a survey conducted at ING, we identify urgency, sprint goal alignment, and business value as the top prioritization criteria, influenced by project factors such as resource availability and client type. These results highlight the need for contextual support in sprint planning. To address this need, we develop an optimization model that generates sprint plans aligned with the specific goals and performance of a team. By integrating teams' planning objectives and sprint history, the model adapts to unique team contexts, estimating prioritization criteria and identifying patterns in planning behavior. We apply our approach to real-world data from 4,841 sprints at ING, demonstrating significant improvements in team alignment and sprint plan effectiveness. Our model boosts team performance by generating plans that deliver more business value, align more closely with sprint goals, and better mitigate delay risks. Overall, our results show that the efficiency and outcomes of sprint planning practices can be significantly improved through the use of context-aware optimization methods. ...
Conference paper (2023) - E. Kula, Eric Greuter, A. van Deursen, G. Gousios
Modern agile software projects are subject to constant change, making it essential to re-asses overall delay risk throughout the project life cycle. Existing effort estimation models are static and not able to incorporate changes occurring during project execution. In this paper, we propose a dynamic model for continuously predicting overall delay using delay patterns and Bayesian modeling. The model incorporates the context of the project phase and learns from changes in team performance over time. We apply the approach to real-world data from 4,040 epics and 270 teams at ING. An empirical evaluation of our approach and comparison to the state-of-the-art demonstrate significant improvements in predictive accuracy. The dynamic model consistently outperforms static approaches and the state-of-the-art, even during early project phases. ...
Journal article (2023) - Xunhui Zhang, Yue Yu, Gousios Georgios, Ayushi Rastogi
Context: The pull-based development model is widely used in open source projects, leading to the emergence of trends in distributed software development. One aspect that has garnered significant attention concerning pull request decisions is the identification of explanatory factors. Objective: This study builds on a decade of research on pull request decisions and provides further insights. We empirically investigate how factors influence pull request decisions and the scenarios that change the influence of such factors. Method: We identify factors influencing pull request decisions on GitHub through a systematic literature review and infer them by mining archival data. We collect a total of 3,347,937 pull requests with 95 features from 11,230 diverse projects on GitHub. Using these data, we explore the relations among the factors and build mixed effects logistic regression models to empirically explain pull request decisions. Results: Our study shows that a small number of factors explain pull request decisions, with that concerning whether the integrator is the same as or different from the submitter being the most important factor. We also note that the influence of factors on pull request decisions change with a change in context; e.g., the area hotness of pull request is important only in the early stage of project development, however it becomes unimportant for pull request decisions as projects become mature. ...
Journal article (2022) - Elvan Kula, Eric Greuter, Arie Van Deursen, Gousios Georgios
Late delivery of software projects and cost overruns have been common problems in the software industry for decades. Both problems are manifestations of deficiencies in effort estimation during project planning. With software projects being complex socio-technical systems, a large pool of factors can affect effort estimation and on-time delivery. To identify the most relevant factors and their interactions affecting schedule deviations in large-scale agile software development, we conducted a mixed-methods case study at ING: two rounds of surveys revealed a multitude of organizational, people, process, project and technical factors which were then quantified and statistically modeled using software repository data from 185 teams. We find that factors such as requirements refinement, task dependencies, organizational alignment and organizational politics are perceived to have the greatest impact on on-time delivery, whereas proxy measures such as project size, number of dependencies, historical delivery performance and team familiarity can help explain a large degree of schedule deviations. We also discover hierarchical interactions among factors: organizational factors are perceived to interact with people factors, which in turn impact technical factors. We compose our findings in the form of a conceptual framework representing influential factors and their relationships to on-time delivery. Our results can help practitioners identify and manage delay risks in agile settings, can inform the design of automated tools to predict schedule overruns and can contribute towards the development of a relational theory of software project management. ...

Accelerating Overdue Pull Requests toward Completion

Journal article (2022) - C.S. Maddila, Sai Surya Upadrasta Upadrasta, Chetan Bansal , Nachiappan Nagappan, G. Gousios, A. van Deursen
Pull requests are a key part of the collaborative software development and code review process today. However, pull requests can also slow down the software development process when the reviewer(s) or the author do not actively engage with the pull request. In this work, we design an end-to-end service, Nudge, for accelerating overdue pull requests toward completion by reminding the author or the reviewer(s) to engage with their overdue pull requests. First, we use models based on effort estimation and machine learning to predict the completion time for a given pull request. Second, we use activity detection to filter out pull requests that may be overdue but for which sufficient action is taking place nonetheless. Last, we use actor identification to understand who the blocker of the pull request is and nudge the appropriate actor (author or reviewer(s)). The key novelty of Nudge is that it succeeds in reducing pull request resolution time, while ensuring that developers perceive the notifications sent as useful, at the scale of thousands of repositories. In a randomized trial on 147 repositories in use at Microsoft, Nudge was able to reduce pull request resolution time by 60% for 8,500 pull requests, when compared to overdue pull requests for which Nudge did not send a notification. Furthermore, developers receiving Nudge notifications resolved 73% of these notifications as positive. We observed similar results when scaling up the deployment of Nudge to 8,000 repositories at Microsoft, for which Nudge sent 210,000 notifications during a full year. This demonstrates Nudge's ability to scale to thousands of repositories. Last, our qualitative analysis of a selection of Nudge notifications indicates areas for future research, such as taking dependencies among pull requests and developer availability into account. ...

Multi-token Code Completion by Jointly learning from Structure and Naming Sequences

Conference paper (2022) - Maliheh Izadi, Roberta Gismondi, Georgios Gousios
Code completion is an essential feature of IDEs, yet current auto-completers are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant draw-backs: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context. In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets. ...
Journal article (2022) - Zoe Kotti, Georgios Gousios, Diomidis Spinellis
Existing work on the practical impact of software engineering (SE) research examines industrial relevance rather than adoption of study results, hence the question of how results have been practically applied remains open. To answer this and investigate the outcomes of impactful research, we performed a quantitative and qualitative analysis of 4,354 SE patents citing 1,690 SE papers published in four leading SE venues between 1975–2017. Moreover, we conducted a survey on 475 authors of 593 top-cited and awarded publications, achieving 26% response rate. Overall, researchers have equipped practitioners with various tools, processes, and methods, and improved many existing products. SE practice values knowledge-seeking research and is impacted by diverse cross-disciplinary SE areas. Practitioner-oriented publication venues appear more impactful than researcher-oriented ones, while industry-related tracks in conferences could enhance their impact. Some research works did not reach a wide footprint due to limited funding resources or unfavorable cost-benefit trade-off of the proposed solutions. The need for higher SE research funding could be corroborated through a dedicated empirical study. In general, the assessment of impact is subject to its definition. Therefore, academia and industry could jointly agree on a formal description to set a common ground for subsequent research on the topic. ...

From package-based to call-based dependency networks

Modern programming languages such as Java, JavaScript, and Rust encourage software reuse by hosting diverse and fast-growing repositories of highly interdependent packages (i.e., reusable libraries) for their users. The standard way to study the interdependence between software packages is to infer a package dependency network by parsing manifest data. Such networks help answer questions such as “How many packages have dependencies to packages with known security issues?” or “What are the most used packages?”. However, an overlooked aspect in existing studies is that manifest-inferred relationships do not necessarily examine the actual usage of these dependencies in source code. To better model dependencies between packages, we developed Präzi, an approach combining manifests and call graphs of packages. Präzi constructs a dependency network at the more fine-grained function-level, instead of at the manifest level. This paper discusses a prototypical Präzi implementation for the popular system programming language Rust. We use Präzi to characterize Rust’s package repository, Crates.io, at the function level and perform a comparative study with metadata-based networks. Our results show that metadata-based networks generalize how packages use their dependencies. Using Präzi, we find packages call only 40% of their resolved dependencies, and that manual analysis of 34 cases reveals that not all packages use a dependency the same way. We argue that researchers and practitioners interested in understanding how developers or programs use dependencies should account for its context—not the sum of all resolved dependencies. ...
Journal article (2022) - Chandra Maddila, Nachiappan Nagappan, Christian Bird, Georgios Gousios, Arie van Deursen
Modern, complex software systems are being continuously extended and adjusted. The developers responsible for this may come from different teams or organizations, and may be distributed over the world. This may make it difficult to keep track of what other developers are doing, which may result in multiple developers concurrently editing the same code areas. This, in turn, may lead to hard-to-merge changes or even merge conflicts, logical bugs that are difficult to detect, duplication of work, and wasted developer productivity. To address this, we explore the extent of this problem in the pull-request-based software development model. We study half a year of changes made to six large repositories in Microsoft in which at least 1,000 pull requests are created each month. We find that files concurrently edited in different pull requests are more likely to introduce bugs. Motivated by these findings, we design, implement, and deploy a service named Concurrent Edit Detector (ConE) that proactively detects pull requests containing concurrent edits, to help mitigate the problems caused by them. ConE has been designed to scale, and to minimize false alarms while still flagging relevant concurrently edited files. Key concepts of ConE include the detection of the Extent of Overlap between pull requests, and the identification of Rarely Concurrently Edited Files. To evaluate ConE, we report on its operational deployment on 234 repositories inside Microsoft. ConE assessed 26,000 pull requests and made 775 recommendations about conflicting changes, which were rated as useful in over 70% (554) of the cases. From interviews with 48 users, we learned that they believed ConE would save time in conflict resolution and avoiding duplicate work, and that over 90% intend to keep using the service on a daily basis. ...
Journal article (2022) - Joseph Hejderup, Georgios Gousios
Developers are increasingly using services such as Dependabot to automate dependency updates. However, recent research has shown that developers perceive such services as unreliable, as they heavily rely on test coverage to detect conflicts in updates. To understand the prevalence of tests exercising dependencies, we calculate the test coverage of direct and indirect uses of dependencies in 521 well-tested Java projects. We find that tests only cover 58% of direct and 21% of transitive dependency calls. By creating 1,122,420 artificial updates with simple faults covering all dependency usages in 262 projects, we measure the effectiveness of test suites in detecting semantic faults in dependencies; we find that tests can only detect 47% of direct and 35% of indirect artificial faults on average. To increase reliability, we investigate the use of change impact analysis as a means of reducing false negatives; on average, our tool can uncover 74% of injected faults in direct dependencies and 64% for transitive dependencies, nearly two times more than test suites. We then apply our tool in 22 real-world dependency updates, where it identifies three semantically conflicting cases and three cases of unused dependencies that tests were unable to detect. Our findings indicate that the combination of static and dynamic analysis should be a requirement for future dependency updating systems. ...

Practical Deep Similarity Learning-Based Type Inference for Python

Dynamic languages, such as Python and Javascript, trade static typing for developer flexibility and productivity. Lack of static typing can cause run-time exceptions and is a major factor for weak IDE support. To alleviate these issues, PEP 484 introduced optional type annotations for Python. As retrofitting types to existing code-bases is error-prone and laborious, machine learning (ML)-based approaches have been proposed to enable automatic type infer-ence based on existing, partially annotated codebases. However, previous ML-based approaches are trained and evaluated on human-provided type annotations, which might not always be sound, and hence this may limit the practicality for real-world usage. In this paper, we present TYPE4Py, a deep similarity learning-based hier-archical neural network model. It learns to discriminate between similar and dissimilar types in a high-dimensional space, which results in clusters of types. Likely types for arguments, variables, and return values can then be inferred through the nearest neigh-bor search. Unlike previous work, we trained and evaluated our model on a type-checked dataset and used mean reciprocal rank (MRR) to reflect the performance perceived by users. The obtained results show that TYPE4Py achieves an MRR of 77.1 %, which is a substantial improvement of 8.1% and 16.7% over the state-of-the-art approaches Typilus and Typewriter, respectively. Finally, to aid developers with retrofitting types, we released a Visual Stu-dio Code extension, which uses TYPE4Py to provide ML-based type auto-completion for Python. ...
Conference paper (2021) - Hendrig Sellik, Onno van Paridon, Georgios Gousios, Maurício Aniche
Mistakes in binary conditions are a source of error in many software systems. They happen when developers use, e.g., < or > instead of <= or >=. These boundary mistakes are hard to find and impose manual, labor-intensive work for software developers. While previous research has been proposing solutions to identify errors in boundary conditions, the problem remains open. In this paper, we explore the effectiveness of deep learning models in learning and predicting mistakes in boundary conditions. We train different models on approximately 1.6M examples with faults in different boundary conditions. We achieve a precision of 85% and a recall of 84% on a balanced dataset, but lower numbers in an imbalanced dataset. We also perform tests on 41 real-world boundary condition bugs found from GitHub, where the model shows only a modest performance. Finally, we test the model on a large-scale Java code base from Adyen, our industrial partner. The model reported 36 buggy methods, but none of them were confirmed by developers. ...
Journal article (2021) - Paolo Boldi, Georgios Gousios
Modern software development is increasingly dependent on components, libraries, and frameworks coming from third-party vendors or open-source suppliers and made available through a number of platforms (or forges). This way of writing software puts an emphasis on reuse and on composition, commoditizing the services that modern applications require. On the other hand, bugs and vulnerabilities in a single library living in one such ecosystem can affect, directly or by transitivity, a huge number of other libraries and applications. Currently, only product-level information on library dependencies is used to contain this kind of danger, but this knowledge often reveals itself too imprecise to lead to effective (and possibly automated) handling policies. We will discuss how fine-grained function-level dependencies can greatly improve reliability and reduce the impact of vulnerabilities on the whole software ecosystem. ...
Conference paper (2021) - Elvan Kula, Arie van Deursen, Georgios Gousios
In agile software development, proper team structures and effort estimates are crucial to ensure the on-time delivery of software projects. Delivery performance can vary due to the influence of changes in teams, resulting in team dynamics that remain largely unexplored. In this paper, we explore the effects of various aspects of teamwork on delays in software deliveries. We conducted a case study at ING and analyzed historical log data from 765,200 user stories and 571 teams to identify team factors characterizing delayed user stories. Based on these factors, we built models to predict the likelihood and duration of delays in user stories. The evaluation results show that the use of team-related features leads to a significant improvement in the predictions of delay, achieving on average 74%-82% precision, 78%-86% recall and 76%-84% F-measure. Moreover, our results show that team-related features can help improve the prediction of delay likelihood, while delay duration can be explained exclusively using them. Finally, training on recent user stories using a sliding window setting improves the predictive performance; our predictive models perform significantly better for teams that have been stable. Overall, our results indicate that planning in agile development settings can be significantly improved by incorporating team-related information and incremental learning methods into analysis/predictive models. ...
Journal article (2021) - Maliheh Izadi, Abbas Heydarnoori, Georgios Gousios
Many platforms exploit collaborative tagging to provide their users with faster and more accurate results while searching or navigating. Tags can communicate different concepts such as the main features, technologies, functionality, and the goal of a software repository. Recently, GitHub has enabled users to annotate repositories with topic tags. It has also provided a set of featured topics, and their possible aliases, carefully curated with the help of the community. This creates the opportunity to use this initial seed of topics to automatically annotate all remaining repositories, by training models that recommend high-quality topic tags to developers. In this work, we study the application of multi-label classification techniques to predict software repositories’ topics. First, we map the large-space of user-defined topics to those featured by GitHub. The core idea is to derive more information from projects’ available documentation. Our data contains about 152K GitHub repositories and 228 featured topics. Then, we apply supervised models on repositories’ textual information such as descriptions, README files, wiki pages, and file names. We assess the performance of our approach both quantitatively and qualitatively. Our proposed model achieves Recall@5 and LRAP scores of 0.890 and 0.805, respectively. Moreover, based on users’ assessment, our approach is highly capable of recommending correct and complete set of topics. Finally, we use our models to develop an online tool named Repository Catalogue, that automatically predicts topics for GitHub repositories and is publicly available1. ...

A benchmark python dataset for machine learning-based type inference

Conference paper (2021) - Amir M. Mir, Evaldas Latoskinas, Georgios Gousios
In this paper, we present ManyTypes4Py, a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5, 382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of the duplication bias. To facilitate training and evaluation of ML models, the dataset was split into training, validation and test sets by files. To extract type information from abstract syntax trees (ASTs), a light-weight static analyzer pipeline is developed and accompanied with the dataset. Using this pipeline, the collected Python projects were analyzed and the results of the AST analysis were stored in JSON-formatted files. The ManyTypes4Py dataset is shared on zenodo and its tools are publicly available on GitHub. ...
Conference paper (2020) - Enrique Larios Vargas, Maurício Aniche, Christoph Treude, Magiel Bruntink, Georgios Gousios
The selection of third-party libraries is an essential element of virtually any software development project. However, deciding which libraries to choose is a challenging practical problem. Selecting the wrong library can severely impact a software project in terms of cost, time, and development effort, with the severity of the impact depending on the role of the library in the software architecture, among others. Despite the importance of following a careful library selection process, in practice, the selection of third-party libraries is still conducted in an ad-hoc manner, where dozens of factors play an influential role in the decision.
In this paper, we study the factors that influence the selection process of libraries, as perceived by industry developers. To that aim, we perform a cross-sectional interview study with 16 developers from 11 different businesses and survey 115 developers that are involved in the selection of libraries. We systematically devised a comprehensive set of 26 technical, human, and economic factors that developers take into consideration when selecting a software library. Eight of these factors are new to the literature. We explain each of these factors and how they play a role in the decision. Finally, we discuss the implications of our work to library maintainers, potential library users, package manager developers, and empirical software engineering researchers. ...
Conference paper (2020) - Pietro Abate, Roberto Di Cosmo, Georgios Gousios, Stefano Zacchiroli
Dependency solving is a hard (NP-complete) problem in all non-trivial component models due to either mutually incompatible versions of the same packages or explicitly declared package conflicts. As such, software upgrade planning needs to rely on highly specialized dependency solvers, lest falling into pitfalls such as incompleteness - a combination of package versions that satisfy dependency constraints does exist, but the package manager is unable to find it. In this paper we look back at proposals from dependency solving research dating back a few years. Specifically, we review the idea of treating dependency solving as a separate concern in package manager implementations, relying on generic dependency solvers based on tried and tested techniques such as SAT solving, PBO, MILP, etc. By conducting a census of dependency solving capabilities in state-of-the-art package managers we conclude that some proposals are starting to take off (e.g., SAT-based dependency solving) while - with few exceptions - others have not (e.g., outsourcing dependency solving to reusable components). We reflect on why that has been the case and look at novel challenges for dependency solving that have emerged since. ...

Learning to Identify Mistakes in Boundary Conditions

Mistakes in boundary conditions are the cause of many bugs in software. These mistakes happen when, e.g., developers make use of '<' or '>' in cases where they should have used '<=' or '>='. Mistakes in boundary conditions are often hard to find and manually detecting them might be very time-consuming for developers. While researchers have been proposing techniques to cope with mistakes in the boundaries for a long time, the automated detection of such bugs still remains a challenge. We conjecture that, for a tool to be able to precisely identify mistakes in boundary conditions, it should be able to capture the overall context of the source code under analysis. In this work, we propose a deep learning model that learn mistakes in boundary conditions and, later, is able to identify them in unseen code snippets. We train and test a model on over 1.5 million code snippets, with and without mistakes in different boundary conditions. Our model shows an accuracy from 55% up to 87%. The model is also able to detect 24 out of 41 real-world bugs; however, with a high false positive rate. The existing state-of-the-practice linter tools are not able to detect any of the bugs. We hope this paper can pave the road towards deep learning models that will be able to support developers in detecting mistakes in boundary conditions. ...