Smells Like Teen Spirit: Improving Bug Prediction Performance Using The Intensity of Code Smells
Smells Like Teen Spirit: Improving Bug Prediction Performance Using The Intensity of Code Smells
Abstract—Code smells are symptoms of poor design and prediction model can contribute to the correct classification of
implementation choices. Previous studies empirically assessed the buggyness of such a component. To verify this conjecture,
the impact of smells on code quality and clearly indicate their we use the intensity index (i.e., a metric able to estimate the
negative impact on maintainability, including a higher bug-
proneness of components affected by code smells. In this paper we severity of a code smell) defined by Arcelli Fontana et al. [31]
capture previous findings on bug-proneness to build a specialized to build a bug prediction model that takes into account the
bug prediction model for smelly classes. Specifically, we evaluate presence and the severity of design problems affecting a code
the contribution of a measure of the severity of code smells (i.e., component. Specifically, we evaluate the predictive power of
code smell intensity) by adding it to existing bug prediction the intensity index by adding it in a bug prediction model
models and comparing the results of the new model against
the baseline model. Results indicate that the accuracy of a bug based on structural quality metrics [32], and comparing its
prediction model increases by adding the code smell intensity accuracy against the one achieved by the baseline model on
as predictor. We also evaluate the actual gain provided by the six large Java open source systems. We also quantified the gain
intensity index with respect to the other metrics in the model, provided by the addition of the intensity index with respect
including the ones used to compute the code smell intensity. to the other structural metrics in the model, including the
We observe that the intensity index is much more important
as compared to other metrics used for predicting the buggyness ones used to compute the intensity. Finally, we report further
of smelly classes. analyses aimed at understanding (i) the accuracy of a model
where a simple truth value reporting the presence/absence of
I. I NTRODUCTION code smells rather than the intensity index is added to the
In the last decade, the research community has spent a baseline model, (ii) the impact of false positive smell instances
lot of effort in investigating bad code smells (shortly “code identified by the code smell detector, and (iii) the contribution
smells” or simply “smells”), i.e., symptoms of poor design of the intensity index in bug prediction models based on
and implementation choices applied by programmers during process metrics.
the development of a software project [1]. Besides approaches The results of our study indicate that:
for the automatic identification of code smells in source code
• The addition of the intensity index as predictor of buggy
[2]–[7], empirical studies have been conducted to understand
components positively impact the accuracy of a bug
when and why code smells appear [8], the relevance they
prediction model based on structural quality metrics.
have for developers [9], [10], their evolution and longevity
We observed an improvement of the accuracy of the
in software projects [11]–[14], as well as the negative ef-
classification up to 25% as compared to the accuracy
fects of code smells on software understandability [15], and
achieved by the baseline model.
maintainability [16]–[19]. Recently, Khomh et al. [20] have
• The intensity index is more important than other quality
also empirically demonstrated that classes affected by design
metrics for the prediction of the bug-proneness of smelly
problems (“antipatterns”) are more prone to contain bugs in the
classes.
future. Although this study showed the potential importance of
• The presence of a limited number of false positive smell
code smells in the context of bug prediction, these observations
instances identified by the code smell detector does not
have not been captured in bug prediction models yet. Indeed,
impact the accuracy and the practical applicability of the
while previous work has proposed the use of predictors based
proposed specialized bug prediction model.
on product metrics (e.g., see [21]–[23]), as well as the analysis
• The intensity index positively impacts the performance
of change-proneness [24]–[26], the entropy of changes [27], or
of bug prediction models based on process metrics,
human-related factors [28]–[30] to build accurate bug predic-
increasing the accuracy of the classification up to 47%.
tion models, none of them takes into account a measure able
to quantify the presence and the severity of design problems Structure of the paper. Section II discusses the related liter-
affecting code components. ature on bug prediction models, while Section III presents the
In this paper, we aim at making a further step ahead by specialized bug prediction model for smelly classes. Section
studying the role played by bad code smells in bug prediction. IV describes the design and the results of the case study aimed
Our hypothesis is that taking into account the severity of at evaluating the accuracy of the proposed model. Section V
a design problem affecting a source code element in a bug discusses the results of the additional analyses we conducted.
245
TABLE I
C ODE S MELL D ETECTION S TRATEGIES ( THE COMPLETE NAMES OF THE METRICS ARE GIVEN IN TABLE II)
Code Smells Detection Strategies: LABEL(n) → LABEL has value n for that smell
God Class LOCNAMM ≥ HIGH(176) ∧ WMCNAMM ≥ MEAN(22) ∧ NOMNAMM ≥ HIGH(18) ∧ TCC ≤ LOW(0.33) ∧ ATFD ≥ MEAN(6)
Data Class WMCNAMM ≤ LOW(14) ∧ WOC ≤ LOW(0.33) ∧ NOAM ≥ MEAN(4) ∧ NOPA ≥ MEAN(3)
Brain Method (LOC ≥ HIGH(33) ∧ CYCLO ≥ HIGH(7) ∧ MAXNESTING ≥ HIGH(6)) ∨ (NOLV ≥ MEAN(6) ∧ ATLD ≥ MEAN(5))
Shotgun Surgery CC ≥ HIGH(5) ∧ CM ≥ HIGH(6) ∧ FANOUT ≥ LOW(3)
Dispersed Coupling CINT ≥ HIGH(8) ∧ CDISP ≥ HIGH(0.66)
Message Chains MaMCL ≥ MEAN(3) ∨ (NMCS ≥ MEAN(3) ∧ MeMCL ≥ LOW(2))
246
TABLE III TABLE IV
D EFAULT THRESHOLDS FOR ALL SMELLS S OFTWARE P ROJECTS IN O UR DATASET
Metric VERY-LOW LOW MEAN HIGH VERY-HIGH System Classes KLOCs % Buggy Cl. % Smelly Cl.
LOCNAMM 26 38 78 176 393
God Class
CM 2 3 4 6 13
FANOUT 2 3 4 5 6 later). All the data used in the study are publicly available in
CINT 3 4 5 8 12 our online appendix [50].
CDISP 0.25 0.33 0.5 0.66 0.75
MaMCL 2 3 3 4 7 A. Empirical Study Definition and Design
MeMCL 2 2 3 4 5
NMCS 1 2 3 4 5
In the context of this empirical investigation, we formulated
the following research questions:
predictor of a structural metrics-based bug prediction model. RQ1 : To what extent the intensity index contributes
Indeed, we cannot use the intensity index as single predictor, to the prediction of bug-prone code components?
since in this case we could not predict the bug-proneness
of classes not affected by any design problem (the intensity RQ2 : What is the gain provided by the intensity
index for non-smelly classes is equal to 0). Thus, to build the index to the bug prediction model when compared
proposed bug prediction model we firstly split the training set to the other predictors?
by considering smelly (as identified by the code smell detector)
and non-smelly classes. We then assign to smelly classes To answer RQ1 , we firstly need an oracle reporting the
an intensity index according to the evaluation performed by presence of bugs in the source code of the analyzed soft-
JCodeOdor, while we set the intensity of non-smelly classes ware projects. Fortunately, all the systems are hosted on the
to 0. Finally, we add the information about the intensity to a PROMISE repository [51], which collects a large dataset
set of structural metrics in order to apply the predictions. of bugs and provides oracles for all the projects in this
study. Secondly, we need to instantiate the prediction model
IV. E VALUATION OF THE P ROPOSED M ODEL presented in Section III to define (i) the basic predictors,
The goal of the empirical study is to evaluate the contri- (ii) the code smell detection process, and (iii) the machine
bution of the intensity index in a prediction model aimed at learning technique to use for classifying buggy instances. As
discovering bug-prone code components, with the purpose of for the software metrics to use as basic predictors in the
improving the allocation of resources in the verification & model, the related literature proposes several alternatives, with
validation activities focusing on components having a higher a main distinction between product metrics (e.g., lines of
bug-proneness. The quality focus is on the prediction accuracy code, code complexity, etc) and process metrics (e.g., past
and completeness as compared to state-of-the-art approaches, changes and bug fixes performed on a code component). To
while the perspective is of researchers, who want to evaluate better understand the predictive power of the intensity index,
the effectiveness of using information about code smells when we decide to test its contribution in a bug prediction model
identifying bug-prone components. composed by structural predictors, and in particular the 20
The context of the study consists of six software quality metrics exploited by Jureczko et al. [32]. Our choice
systems having different size and scope, namely Apache is guided by the will to investigate whether the use of a
Xerces1 , Apache Xalan2 , Apache Velocity3 , single additional structural metric representing the intensity of
Apache Tomcat , Apache Lucene5 , and Apache
4 code smells is able to add useful information in a prediction
Log4j6 . Table IV reports the characteristics of the analyzed model already characterized by structural predictors, as well
software systems in terms of (i) system’s size considering as by the set of code metrics used for the computation of
the intensity index. Thus, to measure the extent to which the
1 http://xerces.apache.org contribution of the intensity index is useful for predicting bugs,
2 http://xalan.apache.org
we experimented the following bug prediction models:
3 http://velocity.apache.org
4 http://tomcat.apache.org
5 http://lucene.apache.org
• Basic Model: The model based on the 20 software metrics
6 http://logging.apache.org/log4j/2.x/ defined by Jureczko et al. [32];
247
• Basic Model + Intensity: The model above based on the this test strategy since it allows all observations to be used for
20 software metrics plus the intensity index. It is worth both training and test purpose, but also because it has been
remembering that, for non-smelly classes, the intensity widely-used in the context of bug prediction (e.g., see [28],
value is set to 0. [61]–[63]). Finally, we answer RQ1 by reporting three widely-
adopted metrics, namely accuracy, precision and recall [64].
Applying this procedure, we were able to control the In addition, we also report the Area Under the Curve (AUC)
effective contribution of the index during the prediction of obtained by the prediction model. The AUC quantifies the
bugs. Regarding the code smell detection process, our study overall ability of a prediction model to discriminate between
is focused on the analysis of the code smells for which an buggy and non-buggy classes. The closer the AUC to 1,
intensity index has been defined (see Section III). To this the higher the ability of the classifier to discriminate classes
aim, we rely on the detection performed by JCodeOdor [31], affected and not by a bug. On the other hand, the closer
because on the one hand it has been empirically validated the AUC to 0.5, the lower the accuracy of the classifier.
demonstrating good performances in detecting code smells, Besides the analysis of the performance of the specialized
and on the other hand it detects all the code smells considered bug prediction model and its comparison with the baseline
in the empirical study. Finally, it computes the value of model, we also investigate the behavior of the experimented
Intensity on the detected code smells. To build a bug prediction models in the classification of smelly and non-smelly instances.
model that discriminates actual smelly and non-smelly classes, Specifically, we compute the percentage of smelly and non-
we decide to discard the false positive instances from the smelly classes correctly classified by each of the prediction
set of candidate code smells given by the detection tool (in models, to evaluate whether the intensity-including model is
other words, we set the intensity of false positives to 0). To actually able to give a contribution in the classification of
this aim, we manually discard such instances by comparing classes affected by a code smell, or whether the addition of
the results of the tool against an annotated set of code smell the intensity index also affects the classification of smell-free
instances publicly available [52]. It is worth observing that the classes.
best solution would be that of considering all the actual smell As for RQ2 , we conduct a fine-grained investigation aimed
instances in a software project (i.e., the golden set). However, at measuring how important is the intensity index with respect
the smell instances which are not detected by JCodeOdor do to the other features (i.e., metrics) composing the model.
not exceed the structural metric thresholds that allow the tool In particular, we use an information gain algorithm [65] to
to detect and assign them an intensity value. As a consequence, quantify the gain provided by adding the intensity index in the
the intensity index assigned to these instances would be equal prediction model. Formally, let M be a bug prediction model,
to 0, and still have no effect on the prediction model. The let P = {p1 , . . . , pn } be the set of predictors composing M , an
final step is the definition of the machine learning classifier to information gain algorithm [65] applies the following formula
use. We experimented several classifiers, namely Multilayer to compute a measure which defines the difference in entropy
Perceptron [53], ADTree [54], Naive Bayes [55], Logistic from before to after the set M is split on an attribute p1 :
Regression [56], Decision Table Majority [57], and Simple
Logistic [58]. We empirically compared the results achieved by Inf oGain(M, pi ) = H(M ) − H(M |pi ) (1)
the prediction model on the software systems used in our study
where the function H(M ) indicates the entropy of the model
(more details on the adopted procedure later in this section).
that includes the predictor pi , while the function H(M |pi )
A complete comparison among the experimented classifiers
measures the entropy of the model that does not include pi .
can be found in our online appendix [50]. Over all the
Entropy is computed as follow:
systems, the best results on the baseline model were obtained
using the Simple Logistic, confirming previous findings in n
the field [42], [59]. Thus, in this paper we report the results H(M ) = − prob(pi ) log2 prob(pi ) (2)
of the models built with this classifier. This classifier uses i=1
a statistical technique based on a probability model. Indeed, In other words, the algorithm quantifies how much un-
instead of simple classification, the probability model gives certainty in M was reduced after splitting M on attribute
the probability of an instance belonging to each individual p1 . In the context of our work, we apply the Gain Ratio
class (i.e., buggy or not), describing the relationship between Feature Evaluation algorithm [65], which ranks p1 , . . . , pn
a categorical outcome (i.e., buggy or not) and one or more in descending order based on the contribution provided by
predictors [58]. pi to the decisions made by M . In particular, the output of
Once the model has been instantiated, to assess its perfor- the algorithm is a ranked list in which the predictors having
mance we adopted the 10-fold cross-validation strategy [60]. the higher expected reduction in entropy are placed at the
This strategy randomly partitions the original set of data into top. Using this procedure, we evaluate the relevance of the
10 equal sized subset. Of the 10 subsets, one is retained as predictors in the prediction model, possibly understanding
test set, while the remaining 9 are used as training set. The whether the addition of the intensity index gives a higher
cross-validation is then repeated 10 times, allowing each of contribution with respect to the structural metrics from which
the 10 subsets to be the test set exactly once [60]. We used it is derived (i.e., metrics used for the detection of the smells)
248
TABLE V index of 5.5. The basic model classifies this class as buggy,
ACCURACY, P RECISION , R ECALL , F-M EASURE , AUC-ROC, AND
P ERCENTAGE OF BUGGY CLASSES AFFECTED ( AND NOT ) BY A SMELL
since its structural metrics are considered by the Basic model
CORRECTLY CLASSIFIED BY THE EXPERIMENTED PREDICTION MODELS as indicators of the presence of a bug. Conversely, the low
level of intensity allows the intensity-including model to
AUC- % Cor. Class.
Project Model Accuracy Precision Recall F-Measure
ROC S-Cl. NS-Cl. correctly mark this class as non-buggy. On the other hand,
Apache Xerces
Basic 94 93 94 93 95 72 92 an example of code component correctly classified as buggy
Basic + Int. 95 96 95 95 95 97 92
Basic 99 99 100 99 99 94 100
thank to the use of the intensity index computed for the
Apache Xalan
Basic + Int. 100 100 100 100 100 99 100 Message Chains smell is the XSLTProcessorApplet class
Basic 67 18 28 22 89 56 32
Apache Velocity
Basic + Int. 92 31 46 37 90 100 35 from the org.apache.xalan.client package. In this
Basic 56 9 16 12 81 34 17
Apache Tomcat
Basic + Int. 79 20 31 24 94 83 20
case, the Basic model misclassifies this class as non-buggy,
Apache Lucene
Basic 78 75 76 75 81 50 78 while the specialized model correctly classifies it as buggy.
Basic + Int. 80 80 80 80 83 82 77
Basic 94 99 96 97 89 55 93 It is important to note that the ElemTextLiteral and
Apache Log4j
Basic + Int. 95 100 98 99 93 74 93 XSLTProcessorApplet classes have similar metrics (as
shown in Table VI), and the only predictor able to distinguish
them is the intensity index.
or with respect the other structural metrics contained in the Looking at the other software systems analyzed, we can
model. observe for Apache Xerces and Apache Log4j a be-
havior of the intensity index similar to the one achieved on
B. Analysis of the Results
Apache Xalan. In these cases, the performances of the
In the following we will discuss the achieved results aiming Basic model are always slightly improved by the addition
at providing an answer to our research questions. of the intensity index. Also, the intensity index enables the
correct classification of smelly instances, while the predictions
To what extent the intensity index contributes to the
made on classes not affected by design problems remain
prediction of bug-prone code components? Table V reports
exactly the same.
for each considered software project, the results achieved when
A different analysis can be done for the other systems. In-
considering (i) the baseline prediction model built using the
deed, when the performances of the Basic model are quite low,
structural metrics in [32] and (ii) the model built by adding
the intensity index is able to give a strong contribution in the
to the baseline model the intensity index of smelly classes. In
classification of buggy components. For instance, the accuracy
addition, the column % Cor. Class. S-Cl. of Table V reports the
of the intensity-including model in Apache Velocity is
percentages of smelly classes correctly classified (with respect
25% higher than the one achieved by the basic one. Here we
to buggyness) by each of the analyzed models, while % Cor.
can note that the proposed bug prediction model is not only
Class. NS-Cl. reports the percentage of correctly classified
able to correctly classify all the instances affected by smells,
non-smelly instances.
but also gives a slight contribution (+3%) to the classification
Looking at Table V, the first thing that leaps to the eye of the non-smelly instances. This result clearly highlights the
is that, overall, the basic prediction model is able to achieve importance of considering the design quality characteristics of
high accuracy. For instance, on Apache Xalan the basic a code component when predicting bugs.
model obtains 100% of recall, 99% of precision and accuracy.
From a practical point of view, this result means that the Summary for RQ1 . The addition of the intensity index
model misclassifies a small subset of instances. On the other as predictor of buggy components generally increases
hand, taking into account the intensity index of the smelly the performance of the baseline bug prediction model
classes results in 100% for all the considered metrics, correctly over all the analyzed projects. We observed cases in
classifying the instances missed by the basic model. It is which the prediction accuracy increases up to 25% with
worth noting that obtaining an increment of the performance respect to the performance achieved not considering the
in situations when the Basic model works well is quite intensity metric.
hard. Still, in this situations the intensity index is able to
contribute by refining the predictions of the Basic model, What is the gain provided by the intensity index to the bug
and increasing model’s accuracy. Analyzing the percentage prediction model when compared to the other predictors?
of smelly and non-smelly classes correctly classified by the Table VII shows the results achieved when applying the Gain
specialized bug prediction model, we can understand that the Ratio Feature Evaluation algorithm [65] on the set of predic-
increment of the performance is due to a better classification tors composing the intensity-including bug prediction model.
of instances composing the set of classes having design flaws Specifically, for each software system, we report the ranking
(+5%), while the non-smelly classes are treated generally in of the predictors based on their importance for the model,
the same way by both the models. An interesting example together with a value representing the expected reduction in
is represented by the class ElemTextLiteral contained entropy caused by partitioning the prediction model according
in the org.apache.xalan.templates package. This to a given predictor (i.e., column Gain). The results show
class contains a Brain Method code smell having an intensity that the Coupling Between Objects (CBO) metric is highly
249
TABLE VI
C OMPARISON BETWEEN E LEM T EXT L ITERAL AND XSLTP ROCESSOR A PPLET (A PACHE X ALAN ) IN TERMS OF S TRUCTURAL M ETRICS
MAX(CC)
AVG(CC)
Intensity
LCOM3
LCOM
WMC
MOA
DAM
CAM
AMC
CBM
NPM
NOC
MFA
CBO
LOC
RFC
DIT
CA
CE
IC
Class
org.apache.xalan.templates.ElemTextLiteral 11 3 0 9 22 29 6 4 11 0.88 127 0.8 0 0.95 0.34 2 5 10 2 1 5.5
org.apache.xalan.client.XSLTProcessorApplet 12 5 0 7 23 31 5 2 11 0.89 125 0.9 1 0.94 0.28 3 5 11 2 1 9.1
250
analyzed in this paper. Table VIII reports precision, recall, of models including process metrics.
and F-measure values obtained by considering the instances
of all the projects as a single dataset (i.e., overall). A detailed V. D ISCUSSION AND F URTHER A NALYSIS
analysis of the performance of the detector for each project The results of the empirical study reveal the usefulness of
can be found in our online appendix [50]. We can observe considering the intensity of code smells as additional predictor
that the performance of the detector ranges between 62% and in order to classify instances affected by design problems.
94% in terms of F-Measure. Despite the quite high precision From a practical perspective, results indicate that smells having
(i.e., 81% on overall), the tool still identifies 49 false positives low severity are less prone to be affected by a bug with
which we discarded to make the set of code smells as close respect to smells with high severity. In this sense, the use of
as possible to the golden set. As removing false positives an indicator of intensity is beneficial to correctly discriminate
is not always feasible, in Section V we evaluate the effect the bug-proneness of smelly classes. Moreover, the results also
of including false positives in the construction of the bug reveal that the structural metrics (including the ones used for
prediction model. On the other hand, the tool achieves an detecting smells) are not effective when applied to predict the
overall recall of 87%. In the empirical study, we were not buggyness of classes affected by design flaws. Even if this
able to include false negative smell instances, because the tool can appear as a quite surprising result, we observe that when
assigns an intensity index equal to zero to such instances. Even evaluating the bug-proneness of smelly classes, the prediction
if this could have influenced our results, it is worth noting that model is not able to correctly deal with the whole set of
only 32 of such instances (out of the total 238) are missed software metrics. In other words, several quality indicators
from the analysis. Future work will be devoted to improve considered in isolation work worse than a single aggregative
the detection performance of the tool by including rules used metric reporting the degree of severity of the design flaw
by other smell detectors (e.g., DECOR [4]). Another threat to affecting a class.
construct validity regards the annotated set of bugs and code In the context of our work, we exploited the use of the
smells used in the empirical study. As for bugs, we rely on intensity index, rather than using a simple truth value pro-
the publicly available oracles in the PROMISE repository [51] viding information about the presence of a design problem.
which have been widely used in previous research [32], [66]– Indeed, the latter solution could also provide a complementary
[69]. For code smells, we rely on the oracles publicly available information with respect to the structural metrics, leading to
in [52], previously used in [5], [6], [8], [9], [70]. However, we improvements similar to the ones achieved by considering
cannot exclude that the oracle we used misses some bugs or the intensity index. This issue is analyzed in Section V-A.
smells, or else include some false positives. Another discussion point is related to the threats to validity
Threats to conclusion validity concern the relation between pointed out in Section IV-C. On the one hand, in the empirical
the treatment and the outcome. The metrics used to evaluate study we discarded the false positive instances given by the
the bug prediction models, i.e., accuracy, precision, recall, F- smell detection tool to consider in the prediction model a
Measure, and AUC-ROC are widely used in the evaluation of set of code smells as close as possible to the golden set.
the performances of bug prediction techniques [42]. Moreover, However, removing such instances could not be practically
we analyzed to what extent the intensity index is important applicable for several reasons (e.g., effort needed to validate
with respect to the other metrics by analyzing the gain smells). In order to evaluate the impact that false positives have
provided by the addition of the severity measure in the model. on the performance of the prediction model, in Section V-B
Finally, threats to external validity concern the generaliza- we evaluate the performance of the model obtained without
tion of results. We analyzed six different software projects removing the false positive code smell instances detected by
from different application domains and with different charac- the tool. Finally, a threat to the generalizability of the results
teristics (size, number of classes, etc). However, our future regards the choice of the baseline model. To evaluate the
agenda includes the analysis of other systems aimed at cor- contribution of the intensity index in different contexts, Section
roborating our findings. Another threat in this category regards V-C reports the results achieved when adding the intensity of
the choice of the baseline model. We selected the model by code smells in a prediction model based on process metrics, as
Jureczko et al. [32] since it is more interesting and challenging well as the analysis of the contribution of the intensity index
of the predictive power of the intensity index when it is added in a model composed by both structural and process metrics.
to a model characterized by other structural metrics, including Note that for all the additional analyses we follow the same
the ones used for the computation of the intensity index. experimental design described in Section IV.
Moreover, the selected model contains a comprehensive set
of quality metrics, which allowed a more detailed analysis of A. Comparing the presence/absence of smells rather than the
the gain provided by the intensity index in the context of the intensity index
structural-based bug prediction model. However, as pointed We have added to the baseline bug prediction model defined
out by Moser et al. [24], predictors based on process metrics in [32] the boolean information about the presence of code
can achieve better performances in predicting bugs. To deal smells in a class (note that we used the golden set of code
with this threat, in Section V we discuss the results achieved smells in this case, to avoid bias deriving from the use of a
when considering the intensity index as additional predictor particular tool). The results are reported in Table IX. However,
251
TABLE IX TABLE X
ACCURACY M ETRICS FOR THE M ODEL B UILT BY A DDING TO THE ACCURACY M ETRICS FOR THE M ODEL WHERE FALSE P OSITIVE S MELLS
BASELINE M ODEL A T RUTH VALUE INDICATING THE P RESENCE OF A ARE NOT F ILTERED .
S MELL .
AUC- % Cor. Class.
Project Model Accuracy Precision Recall F-Measure
AUC- % Cor. Class. ROC S-Cl. NS-Cl.
Project Model Accuracy Precision Recall F-Measure
ROC S-Cl. NS-Cl. Apache Xerces Basic + Int. 94 94 94 91 95 96 92
Apache Xerces Basic + Truth 94 93 94 93 95 72 92 Apache Xalan Basic + Int. 100 100 100 100 100 99 100
Apache Xalan Basic + Truth 99 99 100 99 99 94 100 Apache Velocity Basic + Int. 90 28 43 35 89 97 35
Apache Velocity Basic + Truth 67 18 28 22 89 56 32 Apache Tomcat Basic + Int. 77 18 29 24 93 80 20
Apache Tomcat Basic + Truth 56 9 16 12 81 34 17 Apache Lucene Basic + Int. 79 77 78 77 83 80 77
Apache Lucene Basic + Truth 78 75 76 75 81 50 78 Apache Log4j Basic + Int. 95 99 97 95 93 71 92
Apache Log4j Basic + Truth 94 99 96 97 89 55 93
252
TABLE XI
ACCURACY M ETRICS FOR THE M ODELS BASED ON P ROCESS M ETRICS AND A C OMBINATION OF S TRUCTURAL AND P ROCESS M ETRICS .
discriminating between buggy and non-buggy classes. VI. C ONCLUSION AND F UTURE W ORK
Finally, we also evaluated the contribution of the intensity
index in a bug prediction model composed by both structural In this paper, we evaluate to what extent the addition of
and process metrics. Looking at the results reported in Table the intensity index (i.e., a metric that quantifies the severity
XI (i.e., see the Combined model), we can observe that of code smells) in an existing structural metrics-based bug
the addition of the process metric does not have the same prediction model is useful in order to increase the perfor-
impact with respect to the addition of the intensity index in mances of the baseline model. We also quantify the actual
the baseline structural model. Indeed, the Combined model gain provided by the intensity index with respect to the
never outperforms the performance achieved by the structural other metrics composing the model, including the ones used
model which considers the intensity as additional feature. to compute the code smell intensity. Moreover, we report
Thus, we can conclude that the addition of the intensity additional analyses aimed at showing (i) the accuracy of a
index is actually needed also in this case to achieve higher model where a simple truth value reporting the presence of
performances. Moreover, when adding the intensity index to code smells rather than the intensity index is added to the
the Combined model, we observe that the contribution of the baseline model, (ii) the impact of false positive smell instances
intensity index is still valuable. For example, let us consider identified by a code smell detector, and (iii) the contribution of
the cases of Apache Velocity and Apache Tomcat. In the intensity index in bug prediction models based on process
the first project, the performance of the Combined model are metrics.
not better with respect to the prediction model purely based on According to our experiments, the intensity always posi-
structural code metrics. However, when adding the intensity to tively contributes to state-of-the-art prediction models, even
the mixed set of metrics characterizing the Combined model, when they already have high performances. In particular, the
the performance are not only better than the baseline structural intensity index helps discriminating bug-prone code elements
model (+27% of accuracy), but they are also better than all affected by code smells in bug prediction models based on
the other structural and process based prediction models that product metrics, process metrics, and a combination of the
include the intensity index (i.e., the Combined + Int. model has two. Our initial results suggest that the intensity of code smells
higher performance with respect to the Basic + Int. structural is helpful in all of these cases, and cannot be substituted by a
models). In the second case, the Combined model is 7% more simple indicator of the presence or absence of a code smell.
accurate of the baseline structural model. This indicates that More importantly, the presence of a limited number of false
the entropy of changes actually complements the structural positive smell instances identified by the code smell detector
metrics in the predictions of buggy components. However, does not impact the accuracy and the practical applicability of
also in this case the addition of the intensity index allows the proposed specialized bug prediction model. The achieved
the prediction model to obtain a strong higher value of the findings highlight—on one hand—the value of code smell
prediction accuracy (+19%). This results in the achievement detection in the context of bug prediction, and on the other
of higher values for all the other evaluation metrics: indeed, hand the importance of considering the intensity of such design
the precision increases of 12%, the recall of 17% and the problems as additional indicator in bug prediction models.
AUC-ROC of 3%. A similar discussion can be done for the As future work, we plan to extend the number of systems
other software systems analyzed, where the intensity index analyzed with this method in order to corroborate the results
actually contributes to the improvement of the performance achieved in this paper, and evaluate the contribution of the
of the Combined bug prediction model. Finally, as expected, intensity index into other existing bug-prediction models.
we can observe that the buggy and smelly classes are mainly Moreover, we plan to compare our proposed model with the
correctly classified by the model including the intensity index. one proposed by Taba et al. [44].
253
R EFERENCES [20] F. Khomh, M. Di Penta, Y.-G. Guéhéneuc, and G. Antoniol, “An
exploratory study of the impact of antipatterns on class change- and
[1] M. Fowler, Refactoring: improving the design of existing code. fault-proneness,” Empirical Software Engineering, vol. 17, no. 3, pp.
Addison-Wesley, 1999. 243–275, 2012.
[2] F. A. Fontana, M. Zanoni, A. Marino, and M. V. Mantyla, “Code smell [21] V. Basili, L. Briand, and W. Melo, “A validation of object-oriented
detection: Towards a machine learning-based approach,” in Software design metrics as quality indicators,” Software Engineering, IEEE Trans-
Maintenance (ICSM), 2013 29th IEEE International Conference on, Sept actions on, vol. 22, no. 10, pp. 751–761, Oct 1996.
2013, pp. 396–399. [22] R. Subramanyam and M. S. Krishnan, “Empirical analysis of ck met-
[3] G. Bavota, R. Oliveto, M. Gethers, D. Poshyvanyk, and A. De Lucia, rics for object-oriented design complexity: Implications for software
“Methodbook: Recommending move method refactorings via relational defects,” Software Engineering, IEEE Transactions on, vol. 29, no. 4,
topic models,” IEEE Transactions on Software Engineering, vol. 40, pp. 297–310, 2003.
no. 7, pp. 671–694, July 2014. [23] T. Gyimóthy, R. Ferenc, and I. Siket, “Empirical validation of object-
[4] N. Moha, Y.-G. Guéhéneuc, L. Duchien, and A.-F. L. Meur, “Decor: A oriented metrics on open source software for fault prediction,” IEEE
method for the specification and detection of code and design smells,” Transactions on Software Engineering (TSE), vol. 31, no. 10, pp. 897–
IEEE Transactions on Software Engineering, vol. 36, no. 1, pp. 20–36, 910, 2005.
2010. [24] W. P. Raimund Moser and G. Succi, “A comparative analysis of the
[5] F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, and A. De Lucia, “A efficiency of change metrics and static code attributes for defect predic-
textual-based technique for smell detection,” in Proceedings of the 24th tion,” in International Conference on Software Engineering (ICSE), ser.
International Conference on Program Comprehension (ICPC 2016). ICSE ’08, 2008, pp. 181–190.
Austin, USA: IEEE, 2016, p. to appear. [25] R. Moser, W. Pedrycz, and G. Succi, “Analysis of the reliability of
[6] F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, D. Poshyvanyk, and a subset of change metrics for defect prediction,” in Proceedings
A. De Lucia, “Mining version histories for detecting code smells,” IEEE of the Second ACM-IEEE International Symposium on Empirical
Transactions on Software Engineering, vol. 41, no. 5, pp. 462–489, May Software Engineering and Measurement, ser. ESEM ’08. New
2015. York, NY, USA: ACM, 2008, pp. 309–311. [Online]. Available:
[7] N. Tsantalis and A. Chatzigeorgiou, “Identification of move method http://doi.acm.org/10.1145/1414004.1414063
refactoring opportunities,” IEEE Transactions on Software Engineering, [26] R. M. Bell, T. J. Ostrand, and E. J. Weyuker, “Does measuring
vol. 35, no. 3, pp. 347–367, 2009. code change improve fault prediction?” in Proceedings of the 7th
[8] M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, International Conference on Predictive Models in Software Engineering,
and D. Poshyvanyk, “When and why your code starts to smell bad,” in ser. Promise ’11. New York, NY, USA: ACM, 2011, pp. 2:1–2:8.
Proceedings of the International Conference on Software Engineering [Online]. Available: http://doi.acm.org/10.1145/2020390.2020392
(ICSE) - Volume 1. IEEE, 2015, pp. 403–414. [27] A. E. Hassan, “Predicting faults using the complexity of code changes,”
[9] F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, and A. De Lucia, “Do in Software Engineering, 2009. ICSE 2009. IEEE 31st International
they really smell bad? a study on developers’ perception of bad code Conference on, May 2009, pp. 78–88.
smells,” in Proceedings of the International Conference on Software [28] R. Bell, T. Ostrand, and E. Weyuker, “The limited impact of individual
Maintenance and Evolution (ICSME). IEEE, 2014, pp. 101–110. developer data on software defect prediction,” Empirical Software
[10] A. F. Yamashita and L. Moonen, “Do developers care about code smells? Engineering, vol. 18, no. 3, pp. 478–505, 2013. [Online]. Available:
an exploratory survey,” in Proceedings of the Working Conference on http://dx.doi.org/10.1007/s10664-011-9178-4
Reverse Engineering (WCRE). IEEE, 2013, pp. 242–251. [29] T. J. Ostrand, E. J. Weyuker, and R. M. Bell, “Programmer-based
[11] R. Arcoverde, A. Garcia, and E. Figueiredo, “Understanding the fault prediction,” in Proceedings of the 6th International Conference
longevity of code smells: preliminary results of an explanatory survey,” on Predictive Models in Software Engineering, ser. PROMISE ’10.
in Proceedings of the International Workshop on Refactoring Tools. New York, NY, USA: ACM, 2010, pp. 19:1–19:10. [Online]. Available:
ACM, 2011, pp. 33–36. http://doi.acm.org/10.1145/1868328.1868357
[12] A. Chatzigeorgiou and A. Manakos, “Investigating the evolution of bad [30] D. D. Nucci, F. Palomba, S. Siravo, G. Bavota, R. Oliveto, and
smells in object-oriented code,” in Proceedings of the International Con- A. D. Lucia, “On the role of developer’s scattered changes in bug
ference on the Quality of Information and Communications Technology prediction,” in Software Maintenance and Evolution (ICSME), 2015
(QUATIC). IEEE, 2010, pp. 106–115. IEEE International Conference on, Sept 2015, pp. 241–250.
[13] A. Lozano, M. Wermelinger, and B. Nuseibeh, “Assessing the impact [31] F. Arcelli Fontana, V. Ferme, M. Zanoni, and R. Roveda, “Towards a
of bad smells using historical information,” in Proceedings of the prioritization of code debt: A code smell intensity index,” in Proceedings
International workshop on Principles of Software Evolution (IWPSE). of the Seventh International Workshop on Managing Technical Debt
ACM, 2007, pp. 31–34. (MTD 2015). Bremen, Germany: IEEE, Oct. 2015, pp. 16–24, in
[14] D. Ratiu, S. Ducasse, T. Gîrba, and R. Marinescu, “Using history conjunction with ICSME 2015.
information to improve design flaws detection,” in Proceedings of [32] M. Jureczko and L. Madeyski, “Towards identifying software project
the European Conference on Software Maintenance and Reengineering clusters with regard to defect prediction,” in Proceedings of the 6th
(CSMR). IEEE, 2004, pp. 223–232. International Conference on Predictive Models in Software Engineering,
[15] M. Abbes, F. Khomh, Y.-G. Guéhéneuc, and G. Antoniol, “An empirical ser. PROMISE ’10. New York, NY, USA: ACM, 2010, pp. 9:1–9:10.
study of the impact of two antipatterns, Blob and Spaghetti Code, on [Online]. Available: http://doi.acm.org/10.1145/1868328.1868342
program comprehension,” in 15th European Conference on Software [33] S. Chidamber and C. Kemerer, “A metrics suite for object oriented
Maintenance and Reengineering, CSMR 2011, 1-4 March 2011, Old- design,” Software Engineering, IEEE Transactions on, vol. 20, no. 6,
enburg, Germany. IEEE Computer Society, 2011, pp. 181–190. pp. 476–493, Jun 1994.
[16] D. I. K. Sjøberg, A. F. Yamashita, B. C. D. Anda, A. Mockus, and [34] W. M. Khaled El Emam and J. C. Machado, “The prediction of faulty
T. Dybå, “Quantifying the effect of code smells on maintenance effort,” classes using object-oriented design metrics,” Journal of Systems and
IEEE Trans. Software Eng., vol. 39, no. 8, pp. 1144–1156, 2013. Software, vol. 56, no. 1, pp. 63–75, 2001.
[17] A. F. Yamashita and L. Moonen, “Do code smells reflect important [35] N. Ohlsson and H. Alberg, “Predicting fault-prone software modules
maintainability aspects?” in Proceedings of the International Conference in telephone switchess,” Software Engineering, IEEE Transactions on,
on Software Maintenance (ICSM). IEEE, 2012, pp. 306–315. vol. 22, no. 12, pp. 886–894, 1996.
[18] A. Yamashita and L. Moonen, “Exploring the impact of inter-smell re- [36] N. Nagappan and T. Ball, “Static analysis tools as early indicators of
lations on software maintainability: An empirical study,” in Proceedings pre-release defect density,” in Proceedings of the 27th International
of the International Conference on Software Engineering (ICSE). IEEE, Conference on Software Engineering, ser. ICSE ’05. New York,
2013, pp. 682–691. NY, USA: ACM, 2005, pp. 580–586. [Online]. Available: http:
[19] F. Khomh, M. Di Penta, and Y.-G. Gueheneuc, “An exploratory study //doi.acm.org/10.1145/1062455.1062558
of the impact of code smells on software change-proneness,” in Pro- [37] N. Nagappan, T. Ball, and A. Zeller, “Mining metrics to predict
ceedings of the Working Conference on Reverse Engineering (WCRE). component failures,” in Proceedings of the 28th International
IEEE, 2009, pp. 75–84. Conference on Software Engineering, ser. ICSE ’06. New York,
NY, USA: ACM, 2006, pp. 452–461. [Online]. Available: http:
//doi.acm.org/10.1145/1134285.1134349
254
[38] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects [53] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory
for eclipse,” in Proceedings of the Third International Workshop of Brain Mechanisms. Spartan Books, 1961.
on Predictor Models in Software Engineering, ser. PROMISE ’07. [54] L. M. Y. Freund, “The alternating decision tree learning algorithm,”
Washington, DC, USA: IEEE Computer Society, 2007, pp. 9–. [Online]. in Proceeding of the Sixteenth International Conference on Machine
Available: http://dx.doi.org/10.1109/PROMISE.2007.10 Learning, 1999, pp. 124–133.
[39] A. P. Nikora and J. C. Munson, “Developing fault predictors for [55] G. H. John and P. Langley, “Estimating continuous distributions in
evolving software systems,” in Proceedings of the 9th IEEE International bayesian classifiers,” in Eleventh Conference on Uncertainty in Artificial
Symposium on Software Metrics. IEEE CS Press, 2003, pp. 338–349. Intelligence. San Mateo: Morgan Kaufmann, 1995, pp. 338–345.
[40] A. N. Taghi M. Khoshgoftaar, Nishith Goel and J. McMullan, “Detection [56] S. le Cessie and J. van Houwelingen, “Ridge estimators in logistic
of software modules with high debug code churn in a very large legacy regression,” Applied Statistics, vol. 41, no. 1, pp. 191–201, 1992.
system,” in Software Reliability Engineering. IEEE, 1996, pp. 364–371. [57] R. Kohavi, “The power of decision tables,” in 8th European Conference
[41] J. S. M. Todd L. Graves, Alan F. Karr and H. P. Siy, “Predicting fault on Machine Learning. Springer, 1995, pp. 174–189.
incidence using software change history,” Software Engineering, IEEE [58] C.-Y. J. Peng, K. L. Lee, and G. M. Ingersoll, “An introduction to logistic
Transactions on, vol. 26, no. 7, pp. 653–661, 2000. regression analysis and reporting,” The Journal of Educational Research,
[42] M. DAmbros, M. Lanza, and R. Robbes, “Evaluating defect prediction vol. 96, no. 1, pp. 3–14, 2002.
approaches: a benchmark and an extensive comparison,” Empirical [59] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the impact
Software Engineering, vol. 17, no. 4, pp. 531–577, 2012. of classification techniques on the performance of defect prediction
[43] A. E. Hassan and R. C. Holt, “Studying the chaos of code development,” models,” in Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE
in Proceedings of the 10th Working Conference on Reverse Engineering, International Conference on, vol. 1, May 2015, pp. 789–800.
2003. [60] P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach,
[44] S. E. S. Taba, F. Khomh, Y. Zou, A. E. Hassan, and M. Nagappan, 1982.
“Predicting bugs using antipatterns,” in Proceedings of the 2013 IEEE [61] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as
International Conference on Software Maintenance, ser. ICSM ’13. deviant behavior: A general approach to inferring errors in systems
Washington, DC, USA: IEEE Computer Society, 2013, pp. 270–279. code,” SIGOPS Oper. Syst. Rev., vol. 35, no. 5, pp. 57–72, Oct. 2001.
[Online]. Available: http://dx.doi.org/10.1109/ICSM.2013.38 [Online]. Available: http://doi.acm.org/10.1145/502059.502041
[45] F. Arcelli Fontana, V. Ferme, and M. Zanoni, “Poster: Filtering code [62] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y.-G. Guéhéneuc,
smells detection results,” in Proceedings of the 37th International “Is it a bug or an enhancement?: a text-based approach to classify
Conference on Software Engineering (ICSE 2015), vol. 2. Florence, change requests,” in Proceedings of the 2008 conference of the Centre
Italy: IEEE, May 2015, pp. 803–804. for Advanced Studies on Collaborative Research, October 27-30, 2008,
[46] M. Lanza and R. Marinescu, Object-Oriented Metrics in Practice: Using Richmond Hill, Ontario, Canada. IBM, 2008, p. 23.
Software Metrics to Characterize, Evaluate, and Improve the Design of [63] E. J. W. J. Sunghun Kim and Y. Zhang, “Classifying software changes:
Object-Oriented Systems. Springer, 2006. Clean or buggy?” IEEE Transactions on Software Engineering (TSE),
[47] F. Palomba, A. D. Lucia, G. Bavota, and R. Oliveto, “Anti-pattern de- vol. 34, no. 2, pp. 181–196, 2008.
tection: Methods, challenges, and open issues,” Advances in Computers, [64] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.
vol. 95, pp. 201–238, 2015. Addison-Wesley, 1999.
[48] F. Arcelli Fontana, V. Ferme, M. Zanoni, and A. Yamashita, “Automatic [65] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, no. 1,
metric thresholds derivation for code smell detection,” in Proceedings of pp. 81–106, Mar. 1986. [Online]. Available: http://dx.doi.org/10.1023/A:
the 6th International Workshop on Emerging Trends in Software Metrics 1022643204877
(WETSoM 2015). Florence, Italy: IEEE, May 2015, pp. 44–53, co- [66] H. Lu, E. Kocaguneli, and B. Cukic, “Defect prediction between
located with ICSE 2015. software versions with active learning and dimensionality reduction,” in
[49] E. Tempero, C. Anslow, J. Dietrich, T. Han, J. Li, M. Lumpe, H. Melton, Software Reliability Engineering (ISSRE), 2014 IEEE 25th International
and J. Noble, “The qualitas corpus: A curated collection of java code Symposium on, Nov 2014, pp. 312–322.
for empirical studies,” in Proc. 17th Asia Pacific Software Eng. Conf. [67] S. Kim, H. Zhang, R. Wu, and L. Gong, “Dealing with noise in defect
Sydney, Australia: IEEE, December 2010, pp. 336–345. prediction,” in Software Engineering (ICSE), 2011 33rd International
[50] F. Palomba, M. Zanoni, F. A. Fontana, A. D. Lucia, and R. Oliveto, Conference on, May 2011, pp. 481–490.
“Smells like Teen Spirit: Improving Bug Prediction Performance Using [68] T. Menzies and J. Di Stefano, “How good is your blind spot sampling
the Intensity of Code Smells,” Tech. Rep., 4 2016. [Online]. Available: policy,” in High Assurance Systems Engineering, 2004. Proceedings.
http://tinyurl.com/hgorj4z Eighth IEEE International Symposium on, March 2004, pp. 129–138.
[51] T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, F. Peters, and [69] M. Shepperd, Q. Song, Z. Sun, and C. Mair, “Data quality: Some
B. Turhan. (2012, June) The promise repository of empirical software comments on the nasa software defect datasets,” Software Engineering,
engineering data. [Online]. Available: http://promisedata.googlecode. IEEE Transactions on, vol. 39, no. 9, pp. 1208–1215, Sept 2013.
com [70] F. Palomba, “Textual analysis for code smell detection,” in Proceedings
[52] F. Palomba, D. D. Nucci, M. Tufano, G. Bavota, R. Oliveto, D. Poshy- of the International Conference on Software Engineering (ICSE) -
vanyk, and A. De Lucia, “Landfill: An open dataset of code smells with Volume 2. IEEE, 2015, pp. 769–771.
public evaluation,” in Proceedings of the Working Conference on Mining
Software Repositories (MSR). IEEE, 2015, pp. 482–485.
255