MB

M.M. Beller

info

Please Note

20 records found

From package-based to call-based dependency networks

Modern programming languages such as Java, JavaScript, and Rust encourage software reuse by hosting diverse and fast-growing repositories of highly interdependent packages (i.e., reusable libraries) for their users. The standard way to study the interdependence between software packages is to infer a package dependency network by parsing manifest data. Such networks help answer questions such as “How many packages have dependencies to packages with known security issues?” or “What are the most used packages?”. However, an overlooked aspect in existing studies is that manifest-inferred relationships do not necessarily examine the actual usage of these dependencies in source code. To better model dependencies between packages, we developed Präzi, an approach combining manifests and call graphs of packages. Präzi constructs a dependency network at the more fine-grained function-level, instead of at the manifest level. This paper discusses a prototypical Präzi implementation for the popular system programming language Rust. We use Präzi to characterize Rust’s package repository, Crates.io, at the function level and perform a comparative study with metadata-based networks. Our results show that metadata-based networks generalize how packages use their dependencies. Using Präzi, we find packages call only 40% of their resolved dependencies, and that manual analysis of 34 cases reveals that not all packages use a dependency the same way. We argue that researchers and practitioners interested in understanding how developers or programs use dependencies should account for its context—not the sum of all resolved dependencies. ...
Build logs are textual by-products that a software build process creates, often as part of its Continuous Integration (CI) pipeline. Build logs are a paramount source of information for developers when debugging into and understanding a build failure. Recently, attempts to partly automate this time-consuming, purely manual activity have come up, such as rule- or information-retrieval-based techniques. We believe that having a common data set to compare different build log analysis techniques will advance the research area. It will ultimately increase our understanding of CI build failures. In this paper, we present logchunks, a collection of 797 annotated Travis CI build logs from 80 GitHub repositories in 29 programming languages. For each build log, logchunks contains a manually labeled log part (chunk) describing why the build failed. We externally validated the data set with the developers who caused the original build failure. The width and depth of the logchunks data set are intended to make it the default benchmark for automated build log analysis techniques. ...
Conference paper (2019) - Moritz Beller, Joseph Hejderup
Blockchain technology has found a great number of applications, from banking to the Internet of Things (IoT). However, it has not yet been envisioned whether and which problems in Software Engineering (SE) Blockchain technology could solve. In this paper, we coin this field 'Blockchain-based Software Engineering' and exemplify how Blockchain technology could solve two core SE problems: Continuous Integration (CI) Services such as Travis CI and Package Managers such as apt-get. We believe that Blockchain technology could help (1) democratize and professionalize Software Engineering infrastructure that currently relies on free work done by few volunteers, (2) improve the quality of artifacts and services, and (3) increase trust in ubiquitously used systems like GitHub or Travis CI. ...

Patterns, Beliefs, And Behavior

Journal article (2019) - Moritz Beller, Georgios Gousios, Annibale Panichella, Sebastian Proksch, Sven Amann, Andy Zaidman
Software testing is one of the key activities to achieve software quality in practice. Despite its importance, however, we have a remarkable lack of knowledge on how developers test in real-world projects. In this paper, we report on a large-scale field study with 2,443 software engineers whose development activities we closely monitored over 2.5 years in four integrated development environments (IDEs). Our findings, which largely generalized across the studied IDEs and programming languages Java and C#, question several commonly shared assumptions and beliefs about developer testing: half of the developers in our study do not test; developers rarely run their tests in the IDE; most programming sessions end without any test execution; only once they start testing, do they do it extensively; a quarter of test cases is responsible for three quarters of all test failures; 12 percent of tests show flaky behavior; Test-Driven Development (TDD) is not widely practiced; and software developers only spend a quarter of their time engineering tests, whereas they think they test half of their time. We summarize these practices of loosely guiding one's development efforts with the help of testing in an initial summary on Test-Guided Development (TGD), a behavior we argue to be closer to the development reality of most developers than TDD. ...
Conference paper (2018) - Moritz Beller, Niels Spruit, Andy Zaidman, Diomidis Spinellis
Debugging is an inevitable activity in most software projects, often difficult and more time-consuming than expected, giving it the nickname the “dirty little secret of computer science.” Surprisingly, we have little knowledge on how software engineers debug software problems in the real world, whether they use dedicated debugging tools, and how knowledgeable they are about debugging. This study aims to shed light on these aspects by following a mixed-methods research approach. We conduct an online survey capturing how 176 developers reflect on debugging. We augment this subjective survey data with objective observations on how 458 developers use the debugger included in their integrated development environments (IDEs) by instrumenting the popular ECLIPSE and INTELLIJ IDEs with the purpose-built plugin WATCHDOG 2.0. To clarify the insights and discrepancies observed in the previous steps, we followed up by conducting interviews with debugging experts and regular debugging users. Our results indicate that IDE-provided debuggers are not used as often as expected, because “printf debugging” remains a feasible choice for many programmers. Furthermore, both knowledge and use of advanced debugging features are low. Our results call for strengthening hands-on debugging experience in computer science curricula and have already refined the implementation of modern IDE debuggers. ...
Doctoral thesis (2018) - Moritz Beller
Software developers today crave for feedback, be it from their peers in the form of code review, static analysis tools like their compiler, or the local or remote execution of their tests in the Continuous Integration (CI) environment. With the advent of social coding sites such as GitHub and tight integration of CI services such as Travis CI, software development practices have fundamentally changed. Despite a highly alternated software engineering landscape, however, we still lack a suitable holistic description of contemporary software development practices. Existing descriptions such as the V-model are either too coarse-grained to describe an individual contributor’s workflow, or only regard a subpart of the development process, like Test-Driven Development (TDD). In addition, most existing models are pre- rather than de-scriptive. By contrast, in this thesis, we perform a series of empirical studies to characterize the individual constituents of Feedback-Driven Development (FDD): we study the prevalence and evolution of Automatic Static Analysis Tools (ASATs), we explain the “Last Line Effect,” a phenomenon at the boundary between ASATs and code review, we observe local testing patterns in the Integrated Development Environment (IDE) of developers, compare them to remote testing on the CI server, and, finally, should these quality assurance techniques have failed, we examine how developers debug faults. We then compile this empirical evidence into a model of how today’s software developers work. Our results show that developers employ the different techniques in FDD to best achieve their current task in the most efficient way, often knowingly taking shortcuts to get the job done. While this is efficient in the short term, it also bears risks, namely that prevention and introspection activities fall short: developers might not configure or combine ASATs to their full benefit, they might have wrong perceptions about the amount of time spent on quality-control, quality-related activities such as testing could become an after-thought, and learning about debugging techniques falls short. A relatively rigid, tool-enforced FDD process could help developers in not committing some of these mistakes. Our thesis culminates in the finding that feedback loops are the characterizing criterion of contemporary software development. Our model is flexible enough to accommodate a broad band of modern workflows, despite large variances in how projects use and configure parts of FDD. ...
Conference paper (2018) - Moritz Beller
Software developers today crave for feedback, be it from their peers or even bots in the form of code review, static analysis tools like their compiler, or the local or remote execution of their tests in the Continuous Integration (CI) environment. With the advent of social coding sites like GitHub and tight integration of CI services like Travis CI, software development practices have fundamentally changed. Despite a highly changed software engineering landscape, however, we still lack a suitable description of an individual's contemporary software development practices, that is how an individual code contribution comes to be. Existing descriptions like the v-model are either too coarse-grained to describe an individual contributor's workflow, or only regard a sub-part of the development process like Test-Driven Development. In addition, most existing models are pre-rather than de-scriptive. By contrast, in our thesis, we perform a series of empirical studies to describe the individual constituents of Feedback-Driven Development (FDD) and then compile the evidence into an initial framework on how modern software development works. Our thesis culminates in the finding that feedback loops are the characterizing criterion of contemporary software development. Our model is flexible enough to accommodate a broad bandwidth of contemporary workflows, despite large variances in how projects use and configure parts of FDD. ...
Journal article (2017) - Moritz Beller, Andy Zaidman, Andrey Karpov, Rolf A. Zwaan
Micro-clones are tiny duplicated pieces of code; they typically comprise only few statements or lines. In this paper, we study the “Last Line Effect,” the phenomenon that the last line or statement in a micro-clone is much more likely to contain an error than the previous lines or statements. We do this by analyzing 219 open source projects and reporting on 263 faulty micro-clones and interviewing six authors of real-world faulty micro-clones. In an interdisciplinary collaboration, we examine the underlying psychological mechanisms for the presence of these relatively trivial errors. Based on the interviews and further technical analyses, we suggest that so-called “action slips” play a pivotal role for the existence of the last line effect: Developers’ attention shifts away at the end of a micro-clone creation task due to noise and the routine nature of the task. Moreover, all micro-clones whose origin we could determine were introduced in unusually large commits. Practitioners benefit from this knowledge twofold: 1) They can spot situations in which they are likely to introduce a faulty micro-clone and 2) they can use PVS-Studio, our automated micro-clone detector, to help find erroneous micro-clones. ...
Conference paper (2017) - Alberto Bacchelli, Moritz Beller
The peer review process is central to the scientific method, the advancement and spread of research, as well as crucial for individual careers. However, the single-blind review mode currently used in most Software Engineering (SE) venues is susceptible to apparent and hidden biases, since reviewers know the identity of authors. We perform a study on the benefits and costs that are associated with introducing double- blind review in SE venues. We surveyed the SE community’s opinion and interviewed experts on double-blind reviewing. Our results indicate that the costs, mostly logistic challenges and side effects, outnumber its benefits and mostly regard difficulty for authors in blinding papers, for reviewers in understanding the increment with respect to previous work from the same authors, and for organizers to manage a complex transition. While the surveyed community largely consents on the costs of DBR, only less than one-third disagree with a switch to DBR for SE journals, all SE conferences, and, in particular, ICSE; the analysis of a survey with authors of submitted papers at ICSE 2016 run by the program chairs of that edition corroborates our result. ...

Warnings From Multiple Automated Static Analysis Tools At A Glance

Conference paper (2017) - Tim Buckers, Clinton Cao, Michiel Doesburg, Boning Gong, Sunwei Wang, Moritz Beller, Andy Zaidman
Automated Static Analysis Tools (ASATs) are an integral part of today’s software quality assurance practices. At present, a plethora of ASATs exist, each with different strengths. However, there is little guidance for developers on which of these ASATs to choose and combine for a project. As a result, many projects still only employ one ASAT with practically no customization. With UAV, the Unified ASAT Visualizer, we created an intuitive visualization that enables developers, researchers, and tool creators to compare the complementary strengths and overlaps of different Java ASATs. UAV’s enriched treemap and source code views provide its users with a seamless exploration of the warning distribution from a high-level overview down to the source code. We have evaluated our UAV prototype in a user study with ten second-year Computer Science (CS) students, a visualization expert and tested it on large Java repositories with several thousands of PMD, FindBugs, and Checkstyle warnings.
Project Website: https://clintoncao.github.io/uav/ ...

Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration

Conference paper (2017) - Moritz Beller, Georgios Gousios, Andy Zaidman
Continuous Integration (CI) has become a best practice of modern software development. Thanks in part to its tight integration with GitHub, Travis CI has emerged as arguably the most widely used CI platform for Open-Source Software (OSS) development. However, despite its prominent role in Software Engineering in practice, the benefits, costs, and implications of doing CI are all but clear from an academic standpoint. Little research has been done, and even less was of quantitative nature. In order to lay the groundwork for data-driven research on CI, we built TravisTorrent, travistorrent.testroots.org, a freely available data set based on Travis CI and GitHub that provides easy access to hundreds of thousands of analyzed builds from more than 1,000 projects. Unique to TravisTorrent is that each of its 2,640,825 Travis builds is synthesized with meta data from Travis CI's API, the results of analyzing its textual build log, a link to the GitHub commit which triggered the build, and dynamically aggregated project data from the time of commit extracted through GHTorrent. ...

An Explorative Analysis of Travis CI with GitHub

Conference paper (2017) - Moritz Beller, Georgios Gousios, Andy Zaidman
AbContinuous Integration (CI) has become a best practice of modern software development. Yet, at present, we have a shortfall of insight into the testing practices that are common in CI-based software development. In particular, we
seek quantifiable evidence on how central testing is to the CI process, how strongly the project language influences testing, whether different integration environments are valuable and if testing on the CI can serve as a surrogate to local testing in the IDE. In an analysis of 2,640,825 Java and Ruby builds on
TRAVIS CI, we find that testing is the single most important reason why builds fail. Moreover, the programming language has a strong influence on both the number of executed tests, their run time, and proneness to fail. The use of multiple integration environments leads to 10% more failures being caught at build time. However, testing on TRAVIS CI does not seem an adequate
surrogate for running tests locally in the IDE. To further research on TRAVIS CI with GITHUB, we introduce TRAVISTORRENT. ...

WatchDog, a family of IDE plug-ins to assess testing

Conference paper (2016) - Moritz Beller, Igor Levaja, Annibale Panichella, Georgios Gousios, Andy Zaidman
As software engineering researchers, we are also zealous tool smiths. Building a research prototype is often a daunting task, let alone building an industry-grade family of tools supporting multiple platforms to ensure the generalizability of results. In this paper, we give advice to academic and industrial tool smiths on how to design and build an easy-to-maintain architecture capable of supporting multiple integrated development environments (IDEs). Our experiences stem from WatchDog, a multi-IDE infrastructure that assesses developer testing activities in vivo and that over 2,000 registered developers use. To these software engineering practitioners, Watch-Dog provides real-time and aggregated feedback in the form of individual testing reports. ...
Conference paper (2016) - Sebastiano Panichella, Annibale Panichella, Moritz Beller, Andy Zaidman, Harald C. Gall
Automated test generation tools have been widely investigated with the goal of reducing the cost of testing activities. However, generated tests have been shown not to help developers in detecting and finding more bugs even though they reach higher structural coverage compared to manual testing. The main reason is that generated tests are diff-cult to understand and maintain. Our paper proposes an approach, coined TestDescriber, which automatically generates test case summaries of the portion of code exercised by each individual test, thereby improving understandability. We argue that this approach can complement the current techniques around automated unit test generation or searchbased techniques designed to generate a possibly minimal set of test cases. In evaluating our approach we found that (1) developers find twice as many bugs, and (2) test case summaries significantly improve the comprehensibility of test cases, which is considered particularly useful by developers. ...
Conference paper (2016) - Carmine Vassalo, Fiorelli Zampetti, Daniele Romano, Moritz Beller, Annibale Panichella, Massimiliano Di Penta, Andy Zaidman
Continuous Delivery is an agile software develop- ment practice in which developers frequently integrate changes into the main development line and produce releases of their software. An automated Continuous Integration infrastructure builds and tests these changes. Claimed advantages of CD include early discovery of (integration) errors, reduced cycle time, and better adoption of coding standards and guidelines. This paper reports on a study in which we surveyed 152 developers of a large financial organization (ING Netherlands), and investigated how they adopt a Continuous Integration and delivery pipeline during their development activities. In our study, we focus on topics related to managing technical debt, as well as test automation practices. The survey results shed light on the adoption of some agile methods in practice, and sometimes confirm, while in other cases, confute common wisdom and results obtained in other studies. For example, we found that refactoring tends to be performed together with other development activities, technical debt is almost always “self-admitted”, developers timely document source code, and assure the quality of their product through extensive automated testing, with a third of respondents dedicating more than 50% of their time to do testing activities. ...

A Large-Scale Evaluation in Open Source Software

Conference paper (2016) - Moritz Beller, Radjino Bholanath, Shane McIntosh, Andy Zaidman
The use of automatic static analysis has been a software engineering best practice for decades. However, we still do not know a lot about its use in real-world software projects: How prevalent is the use of Automated Static Analysis Tools (ASATs) such as FindBugs and JSHint? How do developers use these tools, and how does their use evolve over time? We research these questions in two studies on nine different ASATs for Java, JavaScript, Ruby, and Python with a population of 122 and 168,214 open-source projects. To compare warnings across the ASATs, we introduce the General Defect Classification (GDC) and provide a grounded-theory-derived mapping of 1,825 ASAT-specific warnings to 16 top-level GDC classes. Our results show that ASAT use is widespread, but not ubiquitous, and that projects typically do not enforce a strict policy on ASAT use. Most ASAT configurations deviate slightly from the default, but hardly any introduce new custom analyses. Only a very small set of default ASAT analyses is widely changed. Finally, most ASAT configurations, once introduced, never change. If they do, the changes are small and have a tendency to occur within one day of the configuration's initial introduction.
...
The research community in Software Engineering and Software Testing in particular builds many of its contributions on a set of mutually shared expectations. Despite the fact that they form the basis of many publications as well as open-source and commercial testing applications, these common expectations and beliefs are rarely ever questioned. For example, Frederic Brooks' statement that testing takes half of the development time seems to have manifested itself within the community since he first made it in the "Mythical Man Month" in 1975. With this paper, we report on the surprising results of a large-scale field study with 416 software engineers whose development activity we closely monitored over the course of five months, resulting in over 13 years of recorded work time in their integrated development environments (IDEs). Our findings question several commonly shared assumptions and beliefs about testing and might be contributing factors to the observed bug proneness of software in practice: The majority of developers in our study does not test; developers rarely run their tests in the IDE; Test-Driven Development (TDD) is not widely practiced; and, last but not least, software developers only spend a quarter of their work time engineering tests, whereas they think they test half of their time. ...
Conference paper (2015) - Moritz Beller, Georgios Gousios, Andy Zaidman
What do we know about software testing in the real world? It seems we know from Fred Brooks' seminal work 'The Mythical Man-Month' that 50% of project effort is spent on testing. However, due to the enormous advances in software engineering in the past 40 years, the question stands: Is this observation still true? In fact, was it ever true? The vision for our research is to settle the discussion about Brooks' estimation once and for all: How much do developers test? Does developers' estimation on how much they test match reality? How frequently do they execute their tests, and is there a relationship between test runtime and execution frequency? What are the typical reactions to failing tests? Do developers solve actual defects in the production code, or do they merely relax their test assertions? Emerging results from 40 software engineering students show that students overestimate their testing time threefold, and 50% of them test as little as 4% of their time, or less. Having proven the scalability of our infrastructure, we are now extending our case study with professional software engineers from open-source and industrial organizations. ...
Conference paper (2015) - Moritz Beller, Andy Zaidman, Andrey Karpov
Micro-clones are tiny duplicated pieces of code, they typically comprise only a few statements or lines. In this paper, we expose the "last line effect," the phenomenon that the last line or statement in a micro-clone is much more likely to contain an error than the previous lines or statements. We do this by analyzing 208 open source projects and reporting on 202 faulty micro-clones. ...
Conference paper (2014) - Moritz Beller, Alberto Bacchelli, Andy Zaidman, Elmar Juergens
Code review is the manual assessment of source code by humans, mainly intended to identify defects and quality problems. Modern Code Review (MCR), a lightweight variant of the code inspections investigated since the 1970s, prevails today both in industry and open-source software (OSS) systems. The objective of this paper is to increase our understanding of the practical benefits that the MCR process produces on reviewed source code. To that end, we empirically explore the problems fixed through MCR in OSS systems. We manually classified over 1,400 changes taking place in reviewed code from two OSS projects into a validated categorization scheme. Surprisingly, results show that the types of changes due to the MCR process in OSS are strikingly similar to those in the industry and academic systems from literature, featuring the similar 75:25 ratio of maintainability-related to functional problems. We also reveal that 7-35% of review comments are discarded and that 10-22% of the changes are not triggered by an explicit review comment. Patterns emerged in the review data; we investigated them revealing the technical factors that influence the number of changes due to the MCR process. We found that bug-fixing tasks lead to fewer changes and tasks with more altered files and a higher code churn have more changes. Contrary to intuition, the person of the reviewer had no impact on the number of changes. Copyright is held by the author/owner(s). Publication rights licensed to ACM. ...