Locating software components that are responsible for observed failures is a time-intensive and expensive phase in the software development cycle. Automatic fault localization techniques aid developers/testers in pinpointing the root cause of software failures, as such reducing t
...
Locating software components that are responsible for observed failures is a time-intensive and expensive phase in the software development cycle. Automatic fault localization techniques aid developers/testers in pinpointing the root cause of software failures, as such reducing the debugging effort. Automatic fault localization has been an active area of research in the past years. Current approaches to automatic software fault localization can be classified as either (1) statistics-based approaches, or (2) reasoning approaches. This distinction is based on the required amount of knowledge about the program’s internal component structure and behavior. Statistics-based fault localization techniques such as Spectrum-based Fault Localization (SFL) use abstraction of program traces (also known as program spectra) to find a statistical relationship between source code locations and observed failures. Although SFL’s modeling costs and computational complexity are minimal, its diagnostic accuracy is inherently limited since no reasoning is used. In contrast to SFL, model-based reasoning approaches use prior knowledge of the program, such as component interconnection and statement semantics, to build a model of the correct behavior of the system. On the one hand, model-based reasoning approaches deliver higher diagnostic accuracy, but on the other hand, they suffer from high computation complexity. In this thesis, we thoroughly studied the fundamental limitations of SFL. In particular, we studied its diagnostic accuracy as a function of similarity coefficient, quantity of observations, and quality of the error detectors. As a result of this study, we discovered a new similarity coefficient (Ochiai), known from the molecular biology community. Ochiai consistently outperforms all coefficients investigated, including those used by related approaches. Furthermore, we present a novel, low-cost, Bayesian reasoning approach to spectrum-based multiple fault localization, dubbed Barinel. A central feature of our approach is the use of a generic, intermittent component failure model. The novelty of this model lies in the computation of the component intermittency rate as part of the posterior candidate probability computation using a maximum likelihood estimation procedure, rather than using previous approaches’ approximations. This procedure optimally exploits all information contained in the program spectra. Our synthetic and real software experiments show that Barinel outperforms previous approaches to fault localization. Furthermore, this thesis reports on the following additional studies. First, we studied the capabilities of simple, generic program invariants to replace test oracles, so as to achieve total automation of the fault localization process. We verified that, despite the simplicity of the program invariants (and therefore considerable rates of false positives and/or negatives), the diagnostic performance of SFL is similar to the combination of SFL and test oracles. Second, to scale to large systems, reasoning approaches such as Barinel depends on low-cost algorithms to compute the set of diagnosis candidates. We investigated the possibility of using an SFL-based heuristic to focus the computation of valid diagnosis candidates. We show that the SFL-based heuristic is suitable to derive the set of candidates as the search is focused by visiting candidates in best-first order (aiming to capture the most relevant probability mass in the shortest amount of time). Therefore, our algorithm, Staccato, is order of magnitude faster than, e.g., brute-force approaches, rendering our reasoning approach amenable to large programs. Finally, we studied whether SFL can be integrated with existing model-based software debugging approaches (MBSD) to reduce their high time complexity, while improving their diagnostic quality. We have shown that the combination of SFL with MBSD focus the debugging process to relevant parts of the program. Specially compared to MBSD, we have shown that our algorithm has lower complexity, making it scale to large programs.