How to Kill Them All An Exploratory Study on the Impact of Code Observability on Mutation Testing

Mutation testing is well-known for its efficacy in assessing test quality, and starting to be applied in the industry. However, what should a developer do when confronted with a low mutation score? Should the test suite be plainly reinforced to increase the mutation score, or should the production code be improved as well, to make the creation of better tests possible? In this paper, we aim to provide a new perspective to developers that enables them to understand and reason about the mutation score in the light of testability and observability . First, we investigate whether testability and observability metrics are correlated with the mutation score on six open-source Java projects. We observe a correlation between observability metrics and the mutation score, e.g., test directness , which measures the extent to which the production code is tested directly, seems to be an essential factor. Based on our insights from the correlation study, we propose a number of "mutation score anti-patterns’’, enabling software engineers to refactor their existing code or add tests to improve the mutation score. In doing so, we observe that relatively simple refactoring operations enable an improvement or increase in the mutation score. © 2021TheAuthors.PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).


Introduction
Mutation testing has been a very active research field since the 1970s as a technique to evaluate test suite quality in terms of the fault-revealing capability (Jia and Harman, 2011).Recent advances have made it possible for mutation testing to be used in industry (Petrovic et al., 2018).For example, PIT/PiTest (Coles, 2019a) has been adopted by several companies, such as The Ladders and British Sky Broadcasting (Coles, 2019e).Furthermore, Google (Petrovic and Ivankovic, 2018) has integrated mutation testing with the code review process for around 6000 software engineers.
As mutation testing gains traction in the industry, a better understanding of the mutation score (one outcome of mutation testing) becomes essential.The existing works have mainly linked the mutation score with test quality (Inozemtseva and Holmes, 2014;Li et al., 2009) (i.e., how good is the test suite at detecting faults in the software?)and mutant utility (Yao et al., 2014;Just et al., 2017) (i.e., how useful is the mutant?).However, in our previous study, we have observed that certain mutants could be killed only after refactoring the production code to increase the observability of state changes.In such cases, test deficiency is not the only reason for the survival of mutants.Still, some issues in the production code, such as code observability, result in difficulties to kill the mutants.Unlike previous works (e.g., Inozemtseva and Holmes, 2014;Li et al., 2009;Yao et al., 2014;Just et al., 2017), our goal is to bring a new perspective to developers that enable them to understand and reason about the mutation score in the light of testability and observability.Thereby, developers can make a choice when confronting low mutation scores: (1) adding new tests, (2) refactoring the production code to be able to write better tests, or (3) ignoring the surviving mutants.
To this aim, our study consists of two parts: firstly, we investigate the relationship between testability/observability and mutation testing in order to find the most correlated metrics; secondly, based on what we observe from the correlations, we define anti-patterns or indicators that software engineers can apply to their code to kill the surviving mutants.We start by investigating the relationship between testability/observability metrics and the mutation score inspired by the work of Bruntink and van Deursen (2006).Testability is defined as the ''attributes of software that bear on the effort needed to validate the software product'' (ISO, 1991;Bruntink and van Deursen, 2006).Given our context, an important part of testability is observability, which is a measure of how well internal states of a system can be inferred, usually through the values of its external outputs (Staats et al., 2011).Whalen et al. (2013) formally defined observability as follows: An expression in a program is observable in a test case if the value of an expression is changed, leaving the rest of the program intact, and the output of the system is changed correspondingly.If there is no such value, the expression is not observable for that test.Compared to testability that covers various aspects of a project (e.g., inheritance and cohesion), observability specifically addresses the extent to which the value change of expression is observable in a test case.
Our first three research questions steer our investigation in the first part of our study: RQ1 What is the relation between testability metrics and the mutation score?RQ2 What is the relation between observability metrics and the mutation score?RQ3 What is the relation between the combination of testability and observability metrics and the mutation score?
After investigating the relationship between testability/ observability and mutation testing, we still lack insight into how these relationships can help developers to take actions when facing survival mutants.That is why, based on the observations from RQ1-RQ3, we define anti-patterns or indicators that software engineers can apply to their code/tests to ensure that mutants can be killed.This leads us to the next research question:

RQ4
To what extent does the removal of anti-patterns based on testability and observability help in improving the mutation score?
In terms of the methodology that we follow in our study, for RQ1-RQ3, we use statistical analysis on open-source Java projects to investigate the relationship between testability, observability, and the mutation score.For RQ4, we perform a case study with 16 code fragments to investigate whether the removal of anti-patterns increases the mutation score.

Background
In this section, we briefly introduce the basic concepts of and related works on mutation testing, testability metrics, and our proposed metrics for quantifying code observability.

Mutation testing
Mutation testing is defined by Jia and Harman (2011) as a fault-based testing technique that provides a testing criterion called the mutation adequacy score.This score can be used to measure the effectiveness of a test suite regarding its ability to detect faults (Jia and Harman, 2011).The principle of mutation testing is to introduce syntactic changes into the original program to generate faulty versions (called mutants) according to welldefined rules (mutation operators) (Offutt, 2011).The benefits of mutation testing have been extensively investigated and can be summarised (Zhu et al., 2018b) as (1) having better fault exposing capability compared to other test coverage criteria (Mathur and Wong, 1994;Frankl et al., 1997;Li et al., 2009), (2) being a valid substitute to real faults and providing a good indication of the fault detection ability of a test suite (Andrews et al., 2005;Just et al., 2014).
Researchers have actively investigated mutation testing for decades (as evidenced by the extensive survey Offutt, 2011;Jia and Harman, 2011;Madeyski et al., 2014;Zhu et al., 2018b).Recently, it has started to attract attention from industry (Petrovic et al., 2018).In part, this is due to the growing awareness of the importance of testing in software development (Ammann and Offutt, 2017).Code coverage, the most common metric to measure test suite effectiveness, has seen its limitations being reported in numerous studies (e.g.Mathur and Wong, 1994;Frankl et al., 1997;Li et al., 2009;Inozemtseva and Holmes, 2014).Using structural coverage metrics alone might be misleading because, in many cases, statements might be covered, but their consequences might not be asserted (Inozemtseva and Holmes, 2014).Another factor is that well-developed open-source mutation testing tools (e.g., PIT/PiTest Coles, 2019a andMull GitHub, 2019) have contributed to mutation testing being applied in the industrial environments (Petrovic et al., 2018;Petrovic and Ivankovic, 2018;Coles, 2019e).
However, questions still exist about mutation testing, especially regarding the usefulness of a mutant (Just et al., 2017).The majority of the mutants generated by existing mutation operators are equivalent, trivial, and redundant (Kurtz et al., 2014;Just et al., 2017;Brown et al., 2017;Papadakis et al., 2018;Jimenez et al., 2018), which reduces the efficacy of the mutation score.If a class has a high mutation score while most mutants generated are trivial and redundant, the high mutation score does not promise high test effectiveness.A better understanding of mutation score and mutants is thus important.
To address this knowledge gap, numerous studies have investigated how useful mutants are.Example studies include mutant subsumption (Kurtz et al., 2014), stubborn mutants (Yao et al., 2014), and real-fault coupling (Just et al., 2014;Papadakis et al., 2018).These studies paid attention to the context and types of mutants as well as the impact of the test suite, while the impact of production code quality has rarely been investigated.We have seen how code quality can influence how hard it is to test (Bruntink and van Deursen, 2006) (called software testability Freedman, 1991), and since mutation testing can generally be considered as ''testing the tests'', production code quality could also impact mutation testing, just like production code quality has been shown to be correlated with the presence of test smells (Spadini et al., 2018).Due to the lack of insights into how code quality affects the efforts needed for mutation testing, especially in how to engineer tests that kill all the mutants, we conduct this exploratory study.Our study can help researchers and practitioners deepen their understanding of the mutation score, which is generally related to test suite quality and mutant usefulness.

Existing object-oriented metrics for testability
The notion of software testability dates back to 1991 when Freedman (1991) formally defined observability and controllability in the software domain.Voas (1992) proposed a dynamic technique coined propagation, infection, and execution (PIE) analysis for statistically estimating the program's fault sensitivity.More recently, researchers have aimed to increase our understanding of testability by using statistical methods to predict testability based on various code metrics.Influential works include that of Bruntink and van Deursen (2006), in which they explored the relationship between nine object-oriented metrics and testability.To explore the relation between testability and mutation score (RQ1), we first need to collect several existing object-oriented metrics that have been proposed in the literature.In total, we collect 64 code quality metrics, including both class-level and method-level metrics that have been the most widely used.We select those 64 metrics because they measure various aspects of a project, including basic characteristics (e.g., NLOC and NOMT), inheritance (e.g., DIT), coupling (e.g., CBO and FIN), and cohesion (LCOM).A large number of those metrics, such as LCOM and HLTH, have been widely used to explore software testability (Bruntink and van Deursen, 2006;Gao and Shih, 2005) and fault prediction (Arisholm and Briand, 2006;Hall et al., 2011).
We present a brief summary of the 64 metrics in Table 1 (method-level) and Tables 2-3 (class-level).We computed these metrics using a static code analysis tool provided by JHawk (JHawk, 2019).

Code observability
To explore the relation between observability and mutation score (RQ2), we first need a set of metrics to quantify code observability.According to Whalen et al. (2013)'s definition of observability (as mentioned in Section 1), we consider that code observability comprises two perspectives: that of production code and that of the test case.To better explain these two perspectives, let us consider the example in Listing 1 from project jfreechart-1.5.0 showing the method setSectionPaint and its corresponding test.This method sets the section paint associated with the specified key for the PiePlot object, and sends a PlotChangeEvent to all registered listeners.There is one mutant in Line 3 that removes the call to org/jfree/chart/ plot/PiePlot ::fireChangeEvent.This mutant is not killed by testEquals.Looking at the observability of this mutant from the production code perspective, we can see that the setSec-tionPaint method is void; thus, this mutant is hard to detect because there is no return value for the test case to assert.From the test case perspective, although testEquals invokes the method setSectionPaint in Line 14 and 17, no Example of method org.jfree.chart.plot.PiePlot: setSectionPaint and its test proper assertion statements are used to examine the changes of fireChangeEvent() (which is used to send an event to listeners).
Starting with two angles of code observability, we come up with a set of the code observability metrics.Since our study is a starting point to design metrics to measure the code observability, we start with the simple and practical metrics, which are easy for practitioners to understand and apply.
First of all, we consider the return type of the method.As discussed in Listing 1, it is hard to observe the changing states inside a void method because there is no return value for test cases to assert.Accordingly, we design two metrics, is_void and non_void_percent (shown in 1st and 2nd rows in Table 5).The metric is_void is to examine whether the return value of the method is void or not.The metric non_void_percent addresses the return type at class level which measures the percent of non-void methods in the class.Besides these two, a void method might change the field(s) of the class it belongs to.A workaround to test a void method is to invoke getters.So getter_percentage (shown in 3rd row in Table 5) is proposed to complement is_void.
Secondly, we come up with the access control modifiers.Let us consider the example in Listing 2 from project commons-lang-LANG_3_7.The method getMantissa in class NumberUtils returns the mantissa of the given number.This method has only one mutant: the return value is replaced with ''return if (get-Mantissa(str, str.length()) != null) null else throw new RuntimeException''.1This mutant should be easy to detect given an input of either a legal String object (the return value is not null) or a null string (throw an exception).This ''trivial'' mutant is not detected because the method getMantissa is private.
The access control modifier private makes it impossible to test the method getMantissa directly, for this method is only visible to methods from class NumberUtils.To test this method, the test case must first invoke a method that calls method getMantissa.
From this case, we observe that access control modifiers influence the visibility of the method, so as to play a significant role in code observability.Thereby, we take access control modifiers into account to quantify code observability, where we design is_public and is_static (shown in 4th and 5th rows in Table 5).The third point we raise concerns fault masking.We have observed that mutants generated in certain locations are more likely to be masked (Gopinath et al., 2017), i.e., the state change cannot propagate to the output of the method.The first observation is that mutants that reside in a nested class.The reasoning is similar to mutants that reside in nested sections of code, namely that a change in intermediate results does not propagate to a point where a test can pick it up.Thus, we come up with is_nested (in 6th row in Table 5).Another group of mutants is generated inside nested conditions and loops.These can be problematic because the results of the mutations cannot propagate to the output, and the tests have no way of checking the intermediate results within the method.Accordingly, we define nested_depth (shown in 7th row in Table 5) and a set of metrics to quantify the conditions and loops (shown in 8th through 13 rows in Table 5).The last observation is related to mutants that are inside a long method (the reason is similar to the mutants inside nested conditions and loops), thus, we design method_length (shown in 14th row in Table 5).
The next aspect we consider is test directness.Before we dig into test directness, we take Listing 3 as an instance.Listing 3 shows the class Triple from project commons-lang-LANG_3_7, which is an abstract implementation defining the basic functions of the object, and that consists of three elements.It refers to the elements as ''left'', ''middle'' and ''right''.The method hashCode returns the hash code of the object.Six mutants are generated for the method hashCode in class Triple.Table 4 summarises all the mutants from Listing 3. Of those six mutants, only Mutant 1 is killed, and the other mutants are not equivalent.Through further investigation of method hashCode and its test class, we found that although this method has 100% coverage by the test suite, there is no direct test for this method.A direct test would mean that the test method directly invoking the method (production code) (Athanasiou et al., 2014).The direct test is useful because it allows to control the input data directly and to assert the output of a method directly.This example shows that test directness can influence the outcome of mutation testing, which denotes the test case angle of code observability.Previous works such as Huo and Clause (2016) also addressed the significance of test directness in mutation testing.Therefore, we design two metrics, direct_test_no.and test_distance (shown in 15th and 16th row in Table 5), to quantify test directness.Those two metrics represent the test case perspective of code observability.
Last but not least, we take assertions into considerations.As discussed in Listing 1, we have observed that mutants without appropriate assertions in place (throwing exceptions is also under consideration) cannot be killed, as a prerequisite to killing a mutant is to have the tests fail in the mutated program.Schuler and Zeller (2013) and Zhang and Mesbah (2015) also drew a similar conclusion to ours.Accordingly, we come up with three metrics to quantify assertions in the method, assertion_no., assertion-McCabe_Ratio and assertion_density (shown in 17th-19th rows in Table 5).The assertion-McCabe_Ratio metric (Athanasiou et al., 2014) is originally proposed to measure test completeness by indicating the ratio between the number of the actual points of testing in the test code and the number of decision points in the production code (i.e., how many decision points are tested).For example, a method has a McCabe complexity of 4, then in the ideal case, we would expect 4 different assertions to test those linear independent paths (in this case this ration would be 1), but if the ratio is lower than 1, it could be an indication that either not all paths are tested, or that not all paths are tested in a direct way.The assertion_density metric (Kudrjavets et al., 2006) aims at measuring the ability of the test code to detect defects in the parts of the production code that it covers.We include those two metrics here as a way to measure the quality of assertions.These three metrics are proposed based on the test case perspective of code observability.
To sum up, Table 5 presents all the code observability metrics we propose, where we display the name, the definition of each metric, and the category.

Experimental setup
To examine our conjectures, we conduct an experiment using six open-source projects.We recall the research questions we have proposed in Section 1:

Mutation testing
We adopt PIT (Version 1.4.0)(Coles, 2019a) to apply mutation testing in our experiments.The mutation operators we adopt are the default mutation operators provided by PIT (Coles, 2019c): Conditionals Boundary Mutator, Increments Mutator, Invert Negatives Mutator, Math Mutator, Negate Conditionals Mutator, Return Values Mutator, and Void Method Calls Mutator.We did not adopt the extended set of mutation operators provided PIT, as the operators in the default version are largely designed to be stable (i.e., not be too easy to detect) and minimise the number of equivalent mutations that they generate (Coles, 2019c).

Subject systems
We use six systems publicly available on GitHub in this experiment.Table 6 summarises the main characteristics of the selected projects, which include the lines of code (LOC), the number of tests (#Test), the total number of methods (#Total Methods), the number of selected methods used in our experiment (#Selected), the total number of mutants (#Total Mutants), and the killed mutants (#Killed).In our experiment, we remove the methods with no generated mutant by PIT, thus resulting in the number of The ratio between the total number of assertions in direct tests and the lines of code in direct tests a A getter method must follow three patterns (Zhang and Mesbah, 2015): (1) must be public; (2) has no arguments and its return type must be something other than void.
(3) have naming conventions: the name of a getter method begins with ''get'' followed by an uppercase letter.
b If the method is not directly tested, then its direct_test_no.is 0.
c If the method is directly tested, then its test_distance is 0. The maximum test_distance is set Integer.MAX_VALUE in Java which means there is no method call sequence that can reach the method from test methods.selected methods (#Selected).These systems are selected because they have been widely used in the research domain (e.g., Schuler and Zeller, 2013;Zhang and Mesbah, 2015;Huo and Clause, 2016;Zhu et al., 2018a;Zhang et al., 2018).All systems are written in Java, and tested by means of JUnit.The granularity of our analysis is at the method-level.The results of the mutants that are killable for all of the subjects are shown in Columns 7-8 of Table 6.Fig. 1a shows the distribution of mutation scores among selected methods.The majority of the mutation scores are either 0 or 1. Together with Fig. 1b, we can see that the massive number of 0s and 1s are due to the low number of mutants per method.Most methods show less than 10 mutants, which is mainly due to most methods being short methods (NOS < 2 as shown in Fig. 2).Writing short methods is a preferred strategy in practice, for a long method is a well-known code smell (Beck et al., 1999).Besides, PIT adopts several optimisation mechanisms (Coles, 2019d) to reduce the number of mutants.Thus, the number of mutants (#Total Mutants) shown in Table 6 is fewer than the actual number of generated mutants.The large number of methods with low mutant number is an unavoidable bias in our experiment.

Tool implementation
To evaluate the code observability metrics that we have proposed, we implemented a prototype tool (coined Mutation Observer) to capture all the necessary information from both the program under test and the mutation testing process.This tool is openly available on GitHub (Zhu, 2019).Our tool extracts information from three parts of the system under test (in Java): source code, bytecode, and tests.Firstly, Antlr (2019) parses the source code to obtain the basic code features, e.g., is public, is static, and (cond).Secondly, we adopt Apache Commons BCEL (Apache, 2019) to parse the bytecode.Then, java-callgraph (java-callgraph, 2019) generates the pairs of method calls between the source code and tests, which we later use to calculate direct test no.and other test call-related metrics.The last part is related to the mutation testing process, for which we adopt PIT (Version 1.4.0)(Coles, 2019a) to obtain the killable mutant results.An overview of the architecture of Mutation Observer can be seen in Fig. 3.

RQ1-RQ3
Our investigation of the relationships between testability/ observability metrics and the mutation score (RQ1-RQ3) is twofold: in the first part, we adopt Spearman's rank-order correlation to measure the pairwise correlations statistically between each metric (both testability and observability metrics) and the mutation score; in the second part, we turn the correlation problem into a binary classification problem (where we adopt Random Forest as the classification algorithm) to investigate how those metrics interact with one another.

Pairwise correlations.
To answer RQ1, RQ2, and RQ3, we first adopt Spearman's rank-order correlation to statistically measure the correlation between each metric (both testability and observability metrics) and the mutation score of the corresponding methods or classes.Spearman's correlation test checks whether there exists a monotonic relationship (linear or not) between two data samples.It is a non-parametric test and, therefore, it does not make any assumption about the distribution of the data being tested.The resulting coefficient ρ takes values in the interval [−1; +1]; the higher the correlation in either direction (positive or negative), the stronger the monotonic relationship between the two data samples under analysis.The strength of the correlation can be established by classifying into ''negligible'' (|ρ| < 0.1), ''small'' (0.1 ≤ |ρ| < 0.3), ''medium'' (0.3 ≤ |ρ| < 0.5), and ''large'' (|ρ| ≥ 0.5) (Hinkle et al., 1988).Positive ρ values indicate that one distribution increases when the other increases as well; negative ρ values indicate that one distribution decreases when the other increases.To measure the statistical significance of Spearman's correlation test, we look at p-values that measure the probability of an observed (or more extreme) result assuming that the null hypothesis is true.Any test size larger than the pvalue leads to rejection, whereas using a test size smaller than the p-value fails to reject the null hypothesis (Hung et al., 1997).
Here we consider the test size of 5% as the cutoff for statistical significance.
mutation score (A) = # killed mutants in method A # total mutants in method A (1) We adopt Matlab (MATLAB, 2019) to calculate the Spearman's rank-order correlation coefficient between each metric and the mutation score.In particular, we used the statistical analysis (corr function with the option of ''Spearman'' in Matlab's default package 3 ).
Interactions.Except for the pairwise correlations between metrics and mutation score, we are also interested in how those metrics interact with one another.First, we try regression models to predict mutation scores based on the metrics.However, all the regression models incur extremely high cross-validation errors, i.e., Root Relative Squared Errors (RRSEs) are > 70% (e.g., RRSE of linear regression is 76.62%).Therefore, we turn the correlation problem into a classification problem for better performance.For simplicity, we use 0.5 as the cutoff between HIGH and LOW mutation core because 0.5 is widely used as a cutoff in classification problems whose independent variable ranges in [0,1] (e.g., defect prediction (Zhang et al., 2016;Tosun and Bener, 2009)).We consider all the metrics to predicate whether the method belongs to classes with HIGH or LOW mutation score.One thing to notice here is that building a perfect prediction model is not our primary goal.Our interest is to see which metrics and/or combinations of the metrics contribute to the LOW mutation score by building the prediction models.Therefore, deciding different threshold values is outside the scope of this paper.
For prediction, we adopt Random Forest (Breiman, 2001) as the classification algorithm, where we use WEKA (Frank et al., 2016) to build the prediction model.Random Forest is an ensemble method based on a collection of decision tree classifiers, where the individual decision trees are generated using a random selection of attributes at each node to determine the split (Han et al., 2011).Besides, Random Forest is more accurate than one decision tree, and it is not affected by the overfitting problem (Han et al., 2011).
As our investigation includes testability and observability metrics, for each project, we compare three types of classification models: (1) a model based on merely existing testability metrics, (2) a model based on merely code observability metrics, and (3) a model based on the combination of existing and our observability metrics (overlapping metrics, e.g., method_length to NLOC, are only considered once).In particular, we include the model based on the combination of the two aspects for further comparison: to see whether the combination of the two aspects can work better than each aspect itself.To examine the effectiveness of Random Forest in our dataset, we also consider ZeroR, which classifies all the instances to the majority and ignores all predictors, as the baseline.It might be that our data is not balanced, as in that 3 https://www.mathworks.com/help/stats/corr.html.one project has over 90% methods with a HIGH mutation score.This could entail that the classification model achieving 90% accuracy is not necessarily an effective model.In this situation, ZeroR could also achieve over 90% accuracy in that scenario.Our Random Forest model must thus perform better than Ze-roR; otherwise, the Random Forest model is not suitable for our dataset.
In total, we consider four classification models: (1) ZeroR (i.e., the constant classifier), (2) Random Forest based on existing metrics, (3) Random Forest based on code observability metrics, and (4) Random Forest based on the combination of existing metrics and code observability metrics.To build Random Forest, WEKA (Frank et al., 2016) adopts bagging in tandem with random attribute selection.We use WEKA's default parameters to train the Random Forest model, i.e., ''-P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1''.To evaluate the performance of the classifier model (e.g., precision and recall), we use K-fold cross-validation with K = 10 ( Kohavi et al., 1995).
In terms of feature importance, we apply scikit-learn (Pedregosa et al., 2011) to conduct the analysis.To determine the feature importance, scikit-learn (Pedregosa et al., 2011) implements ''Gini Importance'' or ''Mean Decrease Impurity'' (Breiman, 2017).The importance of each feature is computed by the probability of reaching that node (which is approximated by the proportion of samples reaching that node) averaged over total tree ensembles (Breiman, 2017).We use the method of fea-ture_importances_ in sklearn.ensemble.RandomForest Regressor (scikit-learn, 2019) package to analyse the feature importance.

RQ4
To answer RQ4, we first need to establish the anti-patterns (or smells) based on these metrics.An example of an anti-pattern rule generated from the metrics is method_length > 20 and test_distance > 2. In this case, it is highly likely that the method has a low mutation score.To obtain the anti-pattern rules, we adopt J48 to build a decision tree (Quinlan, 1993;Frank et al., 2016).We consider J48 because of its advantage in interpretation over Random Forest.After building the decision tree, we rank all leaves (or paths) according to instances falling into each leaf and accuracy.We select the leaves with the highest instances and accuracy ≥ 0.8 for further manual analysis, to understand to what extent refactoring of the anti-patterns can help improve the mutation score.

Evaluation metrics
For RQ1, RQ2, and RQ3, to ease the comparisons of the four classification models, we consider four metrics widely used in classification problems: precision, recall, AUC, and the mean absolute error.
In our case, we cannot decide which class is positive or not, or in other words, we cannot say HIGH mutation score is what we expect.We use a prediction model to investigate the interactions between those metrics or how they interact with each other.So we adopt weighted precision and recall, which also take the number of instances in each class into consideration.
Weighted precision.The precision is the fraction of true positive instances in the instances that are predicted to be positive: TP/(TP+FP).The higher the precision, the fewer false positives.The weighted precision is computed as follows, where p c1 and p c2 are the precisions for class 1 and class 2, and |c1| and |c2| are the number of instances in class 1 and class 2, respectively: Weighted recall.The recall is the fraction of true positive instances in the instances that are actual positives: TP/(TP+FN).The higher the recall, the fewer false-negative errors there are.The weighted recall is computed as follows, where r c1 and r c2 are the recalls for class 1 and class 2, and |c1| and |c2| are the number of instances in class 1 and class 2: AUC.The area under ROC curve, which measures the overall discrimination ability of a classifier.An area of 1 represents a perfect test; an area of 0.5 represents a worthless test.
Mean absolute error.The mean of overall differences between the predicted values and actual values.

RQ1-RQ3 testability versus observability versus combination
We opt to discuss the three research questions, RQ1, RQ2, and RQ3, together, because it gives us the opportunity to compare testability, observability, and their combination in detail.

Testability
Findings.Table 7 presents the overall results of Spearman's rankorder correlation analysis for existing code metrics.The columns of ''rho'' represent the pairwise correlation coefficient between each code metric and the mutation score.The p-values columns denote the strength of evidence for testing the hypothesis of no correlation against the alternative hypothesis of a non-zero correlation using Spearman's rank-order.Here we used 0.05 as the cutoff for significance.From Table 7, we can see that except for NOS, NLOC, MOD, EXCR, INST(class), NSUB(class), COH(class) and S-R(class) (which, for convenience, we highlighted by underlining the value), the correlation results for the metrics are all statistically significant.
Overall, the pairwise correlation between each source code metric and the mutation score is not strong (|rho| < 0.27).We speculate the reason behind the weak correlations to be the collinearity of these code metrics.More specifically, Spearman's rank-order correlation analysis only evaluates the correlation between individual code metric and mutation score.Some code metrics could interact with one another.For example, a long method does not necessarily have a low mutation score.Alternatively, another example: if there are more than four loops in a long method, then the method is very likely to have a low mutation score.That is also an example of collinearity, i.e., the number of loops and the method length are highly correlated.
From Table 7, we can see that the highest rho4 is −0.2634 for both NSUP (class) standing for Number of Superclasses, and DIT(class), or Depth of Inheritance Tree.Followed by R-R(class), for Reuse Ratio, and HIER(class), for Hierarchy method calls.At first glance, the top 4 metrics are all classlevel metrics.However, we cannot infer that class-level metrics are more impactful on the mutation score than method-level ones.In particular, it can be related to the fact that we have considered more class-level metrics than method-level ones in the experiment.
Additionally, we expected that the metrics related to McCabe's Cyclomatic Complexity, i.e., COMP, TCC, AVCC and MAXCC would show stronger correlation to the mutation score.In fact, McCabe's Cyclomatic Complexity has been widely considered a powerful measure to quantify the complexity of a software program, and it is used to provide a lower bound to the number of tests that should be written (Woodward et al., 1979;Gill and Kemerer, 1991;Fenton and Ohlsson, 2000).Based on our results without further investigation, we could only speculate that McCabe's Cyclomatic Complexity might not directly influence the mutation score.
Summary.We found that the pair-wise correlations between the 64 existing source code metrics and the mutation score to be not so strong (|rho| < 0.27).The top 4 metrics with the strongest correlation coefficients are NSUP(class), DIT(class), R-R(class) and HIER(class).

Observability
Findings.(Schuler and Zeller, 2013;Zhang and Mesbah, 2015) that the quality of assertions can influence the outcome of mutation testing.

Summary.
The correlations between code observability metrics and mutation score are not very strong (<0.5);however, they are significantly better than the correlations for existing code metrics.
Test directness (test_distance and direct_test_no.)takes the first place of NSUP(class) in |rho| among all metrics (including existing ones in Section 2.2), followed by assertion-based metrics (assertion-density, assertion-McCabe and asser-tion_no).9 shows the results of the comparison of the four models.To make clear which model performs better than the others, we highlighted the values of the model achieving the best performance among the four in bold, that of second best in underline.For precision, recall, and AUC, the model with the best performance is the one with the highest value, while for the mean absolute error, the best scoring model exhibits the lowest value.For the ZeroR model, because this model classifies all the instances to the majority (i.e., one class), the precision of the minority is not valid due to 0/0.Thus, in Table 9, we mark the precisions by ''−''.
From Table 9, we can see that the Random Forest models are better than the baseline ZeroR which only relies on the majority.This is the prerequisite for further comparison.Combined achieves the best performance (in 5 out of 6 projects) compared to the existing code metrics and code observability metrics in terms of AUC; this observation is as expected since combined considered both the existing and our metrics during training, which provides the classification model with more information.The only exception is java-apns-apns-0.2.3 (pid = 4).We conjecture that the number of instances (selected methods) in this project might be too small (only 150 methods) to develop a sound prediction model.In second place comes the model based on code observability metrics, edging out the model based on existing metrics.For the overall dataset (the 7th row marked with ''all'' in Table 9), combined takes the first place in all evaluation metrics.In second place comes the code observability, slightly better than existing.Another interesting angle investigate further is the test directness.If we only consider the methods that are directly tested (the second to last row in Table 9), combined again comes in first, followed by the existing code metrics model.The same observation holds for the methods that are not directly tested (the last row in Table 9).It is easy to understand that when the dataset only considers methods that are directly tested (or not), the test directness features in our model become irrelevant.However, we can see that the difference between existing metrics and ours are quite tiny (<3.4%).

Feature importance analysis.
Tables 10 and 11 show the top 15 features per project (and overall) in descending order.We can see that for five out of the six projects (including the overall dataset), test_distance ranks first.This again supports our previous findings that test directness plays a significant role in mutation testing.The remaining features in the top 14 vary per project; this is not surprising, as the task and context of these projects vary greatly.For example, Apache Commons Lang (Column ''2'' in Table 10) is a utility library that provides a host of helper methods for the java.langAPI.Therefore, most methods in Apache Commons Lang are public and static; thus, is_public and is_static are not among the top 15 features for Apache Commons Lang.A totally different context is provided by the JFreeChart project (Column ''5'' in Table 10).JFreeChart is a Java chart library, whose class encapsulation and inheritance hierarchy are well-designed, so is_public appears among the top 15 features.Looking at the overall dataset (Column ''all'' in Table 11), there are eight metrics from our proposed code observability metrics among the top 15 features.The importance of test_distance is much higher than the other features (>4.83X).In second place comes PACK(class), or the number of packages imported.This observation is easy to understand since PACK(class) denotes the complexity of dependency, and dependency could influence the difficulty of testing, especially when making use of mocking objects.Thereby, dependency affects the mutation score.Clearly, more investigations are required to draw further conclusions.The third place in the feature importance analysis is taken by NOCL, which stands for the Number of Comments.This observation is quite interesting since NOCL is related to how hard it is to understand the code (code readability).This implies that code readability might have an impact on mutation testing.
As for the methods with direct tests (Column ''dir.'' in Table 11), is_void takes the first position, which indicates that it is more difficult to achieve a high mutation score for void methods.
Another observation stems from the comparison of the performance of assertion related metrics in the feature importance analysis and the Spearman rank order correlation results (in Section 4.1).For Spearman's rank order correlation, we can see that assertion related metrics are the second significant category right after test directness (in Table 8 in Section 4.1).While in the feature importance analysis, assertion related metrics mostly rank after the top 5 (shown in Tables 10 and 11) To further investigate the reason behind the dramatic changes of ranks for assertion related metrics, we analyse the correlations between test directness (i.e., direct_test_no and test_distance) and assertion related metrics (i.e., assertion_no, assertion-McCabe and assertion_distance).Looking at the correlation results between test directness and assertion related metrics in Table 12, the major reason is that test directness and assertion related metrics are almost collinear in the prediction model (where |rho| > 0.87).To put simply, there are almost no tests without assertions for the six subjects.If the method has a direct test, then the corresponding assertion no. is always greater than 1.Therefore, the ranks of assertion related metrics are not as high as we had initially expected in the feature importance analysis.
Moreover, we would like to put our observations into perspective by comparing our results with the work of Zhang et al. (2018), where they have constructed a similar Random Forest model to predict the result of killable mutant based on a series of features related to mutants and tests.The metrics that are common to their model and ours are Cyclomatic Complexity (COMP), Depth of Inheritance Tree (DIT), nested_depth, Number of Subclasses (NSUB), and method_length.Only two metrics in their study, i.e., method_length (in 6th place) and nested_depth (in 10th place) appear in our top 15 (Column ''all'' in Table 11).
Especially COMP which ranks nine in their results is not in our top 15.There are multiple reasons for the difference in results: (i) we do consider a much larger range of metrics, which provide a better explanatory power (statistically speaking) than the one in their paper; (ii) our goal is to determine patterns in production and test code that may prevent killing some mutants while Zhang et al. (2018) predict if a mutant is killable (aka different prediction target and different granularity level).Besides, as we see later (next section), we can use our model to determine common anti-patterns with proper statistical methods.(iii) the subjects used in our experiment are different from theirs.For example, in project java-apns-apns-0.2.3 (Column ''4'' in Table 11), COMP appears among the top 15.

Summary.
Overall, Random Forest based on the combination of existing code metrics and code observability metrics perform best, followed by that on code observability metrics.The analysis of feature importances shows that test directness ranks highest, remarkably higher than the other metrics.

RQ4 code refactoring
Our goal is to investigate whether we can refactor away the observability issue that we expect to hinder tests from killing mutants and thus to affect the mutation score.In an in-depth case study, we manually analysed 16 code fragments to understand better the interaction between testability/observability metrics that we have been investigating, and the possibilities for refactoring.
Our analysis starts from the combined model, which as Table 9 shows, takes the leading position among the models.We then apply Principal Component Analysis (PCA) (Wold et al., 1987) to perform feature selection, which, as Table 13 shows, leaves us with 36 features (or metrics).Then, as discussed in Section 3, we build a decision tree based on those 36 metrics using J48 (shown in Fig. 4), and select the top 6 leaves (also called end nodes) in the decision tree for further manual analysis as potential refactoring guidelines.We present the top six anti-patterns in Table 14.
Here, we take a partial decision tree to demonstrate how we generate rules (shown in Fig. 5).In Fig. 5, we can see that there are three attributes (marked as an ellipse) and four end nodes or leaves (marked as a rectangle) in the decision tree.Since we would like to investigate how code refactoring increases mutation score (RQ4), we only consider the end nodes labelled with ''LOW'' denoting mutation score < 0.5.By combining the conditions along the paths of the decision tree, we obtain the two rules for ''LOW'' end nodes (as shown in the first column of the table in Fig. 5).For every end node, there are two values attached to the class: the first is the number of instances that correctly fall into the node, the other is the instances that incorrectly fall into the node.The accuracy in the table is computed by the number of correct instances divided by that of total instances.As mentioned earlier, we select the top 6 end nodes from the decision tree, where the end nodes are ranked by the number of correct instances under the condition accuracy ≥ 0.8.
After selecting the rules, the first author of this paper has conducted the main task of the manual analysis.If there were any questions during the manual analysis, the attempts of refactoring or adding tests are discussed among all the authors to reach an agreement.In our actual case study, we manually analysed 16 cases in total.Due to space limitations, we only highlight six cases in this paper (all details are available on GitHub (Zhu, 2019)).We will discuss our findings in code refactoring case by case.hence hinder testing and testability (Suri and Singhani, 2015).
Existing literature (Mouchawrab et al., 2005;Singh and Saha, 2010;Zhou et al., 2012;Nazir et al., 2010) has already addressed this dilemma.Mouchawrab et al. (2005) pointed out that increasing the size of the inheritance hierarchy could increase the cost of testing due to dynamic dependencies.Singh and Saha's work (Singh and Saha, 2010) has shown that Inheritance and Polymorphism increase testing effort and lower software testability.All the works above indicate that there is a trade-off between OO design features and software testability.Currently, it is up to practitioners to balance the two perspectives themselves, depending on the requirements of software and their preferences.
In the context of mutation testing, a similar trade-off between OO design features and the ease of killing mutants exists.In this study, we relate the ease of killing mutants to the testability and observability.In Section 5.7, we found that a simple strategy to kill all the mutants is to write additional direct tests and/or assertions.However, some OO design features related to Encapsulation, such as the private access modifier (see Listing 8), increase the difficulty to add a direct test.Also, the void return type prevents killing the mutants generated from the immediate states that cannot propagate to the output (see Listing 4).As such, a very important note here is that our refactoring recommendations listed in Section 5.7 are centred around the anti-patterns based on the testability and observability; they do not take OO design principles into consideration.The recommendations attempt to help developers in understanding the cause of the low mutation score considering testability and observability, but not all surviving mutants are due to test quality.
Take Listing 8 for instance.The developer found the mutation score of this method is low, and our tool shows the low mutation score is mainly due to private access control modifier.Then, the developer can decide to ignore the surviving mutants if he cannot break Encapsulation based on the requirement.Or if this method is critical and must be well-tested according to the document, he may alter the access control modifier from private to protected/public to kill the mutants.Whether the developers make use of these testability and observability recommendations depends on their choices with regard to either (1) adding test cases (Beller et al., 2015b(Beller et al., ,a, 2019) ) (2) refactoring the production code to kill the mutants, or (3) ignoring the surviving mutants.

Threats to validity
External validity.Our results are based on mutants generated by the operators implemented in PIT.While PIT is a frequently used mutation testing tool, our results might be different when using other mutation tools (Kurtz et al., 2016).Concerning the subject systems selection, we choose six open-source projects from GitHub; the selected projects differ in size, the number of test cases, and application domain.Besides, as mentioned in Section 3, the large number of methods with low number of mutants is an unavoidable bias in our experiment.The reason is partly due to the optimisation mechanism of PIT (Coles, 2019d) and partly due to a large number of short methods in those projects.Nevertheless, we do acknowledge that a broad replication of our study would mitigate any generalisability concerns even further.
Internal validity.The main threat to internal validity for our study is the implementation of the Mutation Observer tool for the experiment.To reduce internal threats to a large extent, we rely on existing tools that have been widely used, e.g., WEKA, MATLAB, and PIT.Moreover, we carefully reviewed and tested all code for our study to eliminate potential faults in our implementation.Another threat to internal validity is the disregard of equivalent mutants in our experiment.However, this threat is unavoidable and shared by other studies on mutation testing that attempt to detect equivalent mutants or not (Grün et al., 2009;Mirshokraie et al., 2013).Moreover, we consider equivalent mutants as a potential weakness in the software (reported by Coles Coles, 2019b, slide 44-52); thereby, we did not manually detect equivalent mutants in this paper.
Construct validity.The main threat to construct validity is the measurement we used to evaluate our methods.We minimise this risk by adopting evaluation metrics that are widely used in research (such as recall, precision, and AUC), as well as a sound statistical analysis to assess the significance (Spearman's rank-order correlation).

Related work
The notion of software testability dates back to 1991 when Freedman (1991) formally defined observability and controllability in the domain of software.Voas (1992) proposed a dynamic technique coined propagation, infection, and execution (PIE) analysis for statistically estimating the program's fault sensitivity.More recently, researchers have aimed to increase our collective understanding of testability by using statistical methods to predict testability based on various code metrics.A prime example is the work of Bruntink and van Deursen (2006), who have explored the relationship between nine class-level object-oriented metrics and testability.To the best of our knowledge, no study uses statistical or machine learning methods to investigate the relationship between testability/observability metrics and the mutation score.
Mutation testing was initially introduced as a fault-based testing method which was regarded as significantly better at detecting errors than the covering measure approach (Budd et al., 1979).Since then, mutation testing has been actively investigated and studied, thereby resulting in remarkable advances in its concepts, theory, technology, and empirical evidence.For more literature on mutation testing, we refer to the existing surveys of DeMillo (1989), Offutt and Untch (2001), Jia andHarman (2011), Offutt (2011) and Zhu et al. (2018b).Here we mainly address the studies that concern mutant utility (Just et al., 2017), the efficacy of mutation testing.Yao et al. (2014) have reported on the causes and prevalence of equivalent mutants and their relationship to stubborn mutants based on a manual analysis of 1230 mutants.Visser (2016) has conducted an exhaustive analysis of all possible test inputs to determine how hard it is to kill a mutant considering three common mutation operators (i.e., relational, integer constants and arithmetic operators).His results show that mutant reachability, mutation operators, and oracle sensitivity are the key contributors to determining how hard it is to kill a mutant.Just et al. (2017) have shown a strong correlation between mutant utility and context information from the program in which the mutant is embedded.Brown et al. (2017) have developed a method for creating potential faults that are more closely coupled with changes made by actual programmers where they named ''wild-caught mutants''.Chekam et al. (2018) have investigated the problem of selecting the fault revealing mutants.They put forward a machine learning approach (decision trees) that learns to select fault revealing mutants from a set of static program features.Jimenez et al. (2018) investigated the use of natural language modelling techniques in mutation testing.All studies above have enriched the understanding of mutation testing, especially its efficacy.However, the aim of our work is different from those studies, as we would like to gain insights into how code quality in terms of testability and observability affects the efforts needed for mutation testing, especially in how to engineer tests to kill more the mutants.
Similar to our study, there have been a few recent studies also investigating the relationships between assertions and test directness with mutation testing.Schuler and Zeller (2013) introduced checked coverage -the ratio of statements that contribute to the computation of values that are later checked by the test suite -as an indicator for oracle quality.In their experiment, they compared checked coverage with the mutation score, where they found that checked coverage is more sensitive than mutation testing in evaluating oracle quality.Huo and Clause (2016) proposed direct coverage and indirect coverage by leveraging the concepts of test directness with conventional statement coverage.They used the mutants as an indicator of the test suite effectiveness, and they found faults in indirectly covered code are significantly less likely to be detected than those in directly covered code.Zhang and Mesbah (2015) evaluated the relationship between test suite effectiveness (in terms of the mutation score) and the (1) number of assertions, (2) assertion coverage, and (3) different types of assertions.They found test assertion quantity and assertion coverage are strongly correlated with the mutation score, and assertion types could also influence test suite effectiveness.Compared to our studies, those works only addressed one or two aspect(s) of code observability in our study.We provide a complete view of the relationships between code observability and mutation testing.
The study most related to ours is that of Zhang et al. (2018)'s predictive mutation testing, where they have constructed a classification model to predict killable mutant result based on a series of features related to mutants and tests.In their discussion, they compared source code related features and test code related features in the prediction model for the mutation score.They found that test code features are more important than source code ones.But from their results, we cannot draw clear conclusions on the impact of production code on mutation testing as their goal is to predict exact killable mutant results.Another interesting work close to our study is Vera-Pérez et al. ( 2017)'s pseudo-tested methods.Pseudo-tested methods denote those methods that are covered by the test suite, but for which no test case fails even if the entire method body is completely stripped.They rely on the idea of ''extreme mutation'', which completely strips out the body of a method.The difference between Vera-Pérez et al. ( 2017)'s study and ours is that we pay attention to conventional mutation operators rather than ''extreme mutation''.

Conclusion & future work
This paper aims to bring a new perspective to software developers helping them to understand and reason about the mutation score in the light of testability and observability.This should enable developers to make decisions on the possible actions to take when confronted with low mutation scores.To achieve this goal, we firstly investigate the relationship between testability and observability metrics and the mutation score.More specifically, we have collected 64 existing source code quality metrics for testability, and have proposed a set of metrics that specifically target observability.The results from our empirical study involving 6 open-source projects show that the 64 existing code quality metrics are not strongly correlated with the mutation score (|rho| < 0.27).In contrast, the 19 newly proposed code observability metrics, that are defined in terms of both production code and test cases, do show a stronger correlation with the mutation score (|rho| < 0.5).In particular, test directness, test_distance, and direct_test_no stand out.
To better understand the causality of our insights, we continue our investigation with a manual analysis of 16 methods that scored particularly bad in terms of mutation score, i.e., a number of mutants were not killed by the existing tests.In particular, we have refactored these methods and/or added tests according to the anti-patterns that we established in terms of the code observability metrics.Our aim here is to establish whether the removal of the observability anti-patterns would lead to an increase in the mutation score.We found that these anti-patterns can indeed provide insights in order to kill the mutants by indicating whether the production code or the test suite needs improvements.For instance, we found that private methods (expressed as is_public=0 in our schema) are prime candidates to potentially refactor to increase their observability, e.g., by making them public or protected for testing purposes.
However, some refactoring recommendations could violate OO design principles.For example, by changing private to protected/public we increase observability, but we also break the idea of encapsulation.Therefore, we suggest developers make a choice between -(1) adding test cases, (2) refactoring the production code to kill the mutants, or (3) ignoring the surviving mutants -by considering the trade-off between OO design features and testability/observability.
To sum up, our paper makes the following contributions: 1. 19 newly proposed code observability metrics 2. A detailed investigation of the relationship between testability/observability metrics and the mutation score (RQ1-RQ3) 3. A case study with 16 code fragments to investigate whether removal of the anti-patterns increases the mutation score (RQ4) 4. A guideline for developers to make choices when confronting low mutation scores 5.A prototype tool coined Mutation Observer (openly available on GitHub (Zhu, 2019)) that automatically calculates code observability metrics Example of method hashCode in class Triple

•:
RQ1What is the relation between testability metrics and the mutation score?• RQ2: What is the relation between observability metrics and the mutation score?• RQ3: What is the relation between the combination of testability and observability metrics and the mutation score?• RQ4: To what extent does removal of anti-patterns based on testability and observability help in improving the mutation score?

Fig. 1 .
Fig. 1.Distribution of mutation score and mutant no.

Future work .
With our tool, and since the results are encouraging, we envision the following future work: (1) conduct additional empirical studies on more subject systems; (2) evaluate the usability of our code observability metrics by involving practitioners; (3) investigate the relations between more code metrics (e.g., code readability) and mutation score.CRediT authorship contribution statement Qianqian Zhu: Methodology, Software, Data curation, Visualization, Investigation, Validation, Writing -original draft, Resources.Andy Zaidman: Conceptualization, Methodology, Supervision, Project administration, Funding acquisition, Writing -review & editing.Annibale Panichella: Methodology, Supervision, Writing -review & editing.

Table 1
Summary of method-level code quality metrics.

Table 2
Summary of class-level code quality metrics (1).
The number of methods in the class (WMC -one of the Chidamber and Kemerer metrics) LCOM Lack of Cohesion of Methods The value of the Lack of Cohesion of Methods metric for the class.This uses the LCOM* (or LCOM5) calculation.(one of the Chidamber and Kemerer metrics)

Table 3
Summary of class-level code quality metrics (2).

Table 4
Summary of mutants from Listing 3.

Table 6
Subject systems.

Table 7
Spearman results of existing code metrics for testability.
combined: Random Forest model based on the combination of existing metrics and code observability metrics Table

Table 16
Summary of mutants from Listing 8 (Case 2).