0% found this document useful (0 votes)
53 views12 pages

Smells Like Teen Spirit: Improving Bug Prediction Performance Using The Intensity of Code Smells

Uploaded by

Hemanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views12 pages

Smells Like Teen Spirit: Improving Bug Prediction Performance Using The Intensity of Code Smells

Uploaded by

Hemanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2016 IEEE International Conference on Software Maintenance and Evolution

Smells like Teen Spirit: Improving Bug Prediction


Performance using the Intensity of Code Smells
Fabio Palomba∗ , Marco Zanoni† , Francesca Arcelli Fontana† , Andrea De Lucia∗ , Rocco Oliveto‡
∗ University
of Salerno, Italy, † University of Milano-Bicocca, Italy, ‡ University of Molise, Italy
fpalomba@unisa.it, marco.zanoni@disco.unimib.it, arcelli@disco.unimib.it, adelucia@unisa.it, rocco.oliveto@unimol.it

Abstract—Code smells are symptoms of poor design and prediction model can contribute to the correct classification of
implementation choices. Previous studies empirically assessed the buggyness of such a component. To verify this conjecture,
the impact of smells on code quality and clearly indicate their we use the intensity index (i.e., a metric able to estimate the
negative impact on maintainability, including a higher bug-
proneness of components affected by code smells. In this paper we severity of a code smell) defined by Arcelli Fontana et al. [31]
capture previous findings on bug-proneness to build a specialized to build a bug prediction model that takes into account the
bug prediction model for smelly classes. Specifically, we evaluate presence and the severity of design problems affecting a code
the contribution of a measure of the severity of code smells (i.e., component. Specifically, we evaluate the predictive power of
code smell intensity) by adding it to existing bug prediction the intensity index by adding it in a bug prediction model
models and comparing the results of the new model against
the baseline model. Results indicate that the accuracy of a bug based on structural quality metrics [32], and comparing its
prediction model increases by adding the code smell intensity accuracy against the one achieved by the baseline model on
as predictor. We also evaluate the actual gain provided by the six large Java open source systems. We also quantified the gain
intensity index with respect to the other metrics in the model, provided by the addition of the intensity index with respect
including the ones used to compute the code smell intensity. to the other structural metrics in the model, including the
We observe that the intensity index is much more important
as compared to other metrics used for predicting the buggyness ones used to compute the intensity. Finally, we report further
of smelly classes. analyses aimed at understanding (i) the accuracy of a model
where a simple truth value reporting the presence/absence of
I. I NTRODUCTION code smells rather than the intensity index is added to the
In the last decade, the research community has spent a baseline model, (ii) the impact of false positive smell instances
lot of effort in investigating bad code smells (shortly “code identified by the code smell detector, and (iii) the contribution
smells” or simply “smells”), i.e., symptoms of poor design of the intensity index in bug prediction models based on
and implementation choices applied by programmers during process metrics.
the development of a software project [1]. Besides approaches The results of our study indicate that:
for the automatic identification of code smells in source code
• The addition of the intensity index as predictor of buggy
[2]–[7], empirical studies have been conducted to understand
components positively impact the accuracy of a bug
when and why code smells appear [8], the relevance they
prediction model based on structural quality metrics.
have for developers [9], [10], their evolution and longevity
We observed an improvement of the accuracy of the
in software projects [11]–[14], as well as the negative ef-
classification up to 25% as compared to the accuracy
fects of code smells on software understandability [15], and
achieved by the baseline model.
maintainability [16]–[19]. Recently, Khomh et al. [20] have
• The intensity index is more important than other quality
also empirically demonstrated that classes affected by design
metrics for the prediction of the bug-proneness of smelly
problems (“antipatterns”) are more prone to contain bugs in the
classes.
future. Although this study showed the potential importance of
• The presence of a limited number of false positive smell
code smells in the context of bug prediction, these observations
instances identified by the code smell detector does not
have not been captured in bug prediction models yet. Indeed,
impact the accuracy and the practical applicability of the
while previous work has proposed the use of predictors based
proposed specialized bug prediction model.
on product metrics (e.g., see [21]–[23]), as well as the analysis
• The intensity index positively impacts the performance
of change-proneness [24]–[26], the entropy of changes [27], or
of bug prediction models based on process metrics,
human-related factors [28]–[30] to build accurate bug predic-
increasing the accuracy of the classification up to 47%.
tion models, none of them takes into account a measure able
to quantify the presence and the severity of design problems Structure of the paper. Section II discusses the related liter-
affecting code components. ature on bug prediction models, while Section III presents the
In this paper, we aim at making a further step ahead by specialized bug prediction model for smelly classes. Section
studying the role played by bad code smells in bug prediction. IV describes the design and the results of the case study aimed
Our hypothesis is that taking into account the severity of at evaluating the accuracy of the proposed model. Section V
a design problem affecting a source code element in a bug discusses the results of the additional analyses we conducted.

978-1-5090-3806-0/16 $31.00 © 2016 IEEE 244


DOI 10.1109/ICSME.2016.27
Finally, Section VI concludes the paper and outlines directions [43] introduced the concept of entropy of changes as a measure
for future work. of the complexity of the development process. Moser et al.
[24] performed a comparative study between the predictive
II. R ELATED W ORK power of product and process metrics. Their study, performed
The research community spent a lot of effort in the def- on Eclipse, highlights the superiority of process metrics in
inition of techniques aimed at predicting bug-prone code predicting buggy code components. Moser et al. [25] also
components, mainly proposing the use of product metrics and performed a deeper study on the bug prediction accuracy of
process metrics as indicators of the bug-proneness of a code process metrics, reporting that the past number of bug-fixes
component. performed on a file (i.e., bug-proneness), and the number
of changes involving a file in a given period (i.e., change-
A. Bug Prediction using Structural-based Predictors proneness) are the best predictors of buggy components. Bell
Basili et al. [21] proposed the use of the Object-Oriented et al. [26] confirm that the change-proneness is the best
metric suite (i.e., CK metrics) [33] as indicators of the bug predictor. Hassan [27] exploit the entropy of changes to
presence of buggy components. They demonstrated that 5 of build two bug prediction models which mainly differ for the
them are actually useful in the context of bug prediction. El choice of the temporal interval where the bug proneness of
Emam et al. [34] and by Subramanyam et al. [22] corroborate components is studied. The results of a reported case study
the results previously observed in [21]. On the same line, indicate that the proposed techniques have higher prediction
Gyimothy et al. [23] reported a more detailed analysis among accuracy than models purely based on code components
the relationships between code metrics and the bug-proneness changes. All of the predictors above do not consider how many
of code components. Their findings highlight that the Coupling developers apply changes to a component, neither how many
Between Object metric [33] is the best metric among the components they changed at the same time. Ostrand et al.
CK ones in predicting defects. Ohisson et al. [35] conducted [28], [29] propose the use of the number of developers who
an empirical study aimed at evaluating to what extent code modified a code component in a give time period as a bug-
metrics are able to identify bug-prone modules. Their model proneness predictor, demonstrating that products and process
has been experimented on a system developed at Ericsson, and metrics is poorly (positively) impacted by also considering
the results indicate the ability of code metrics in detecting the developers’ information. Di Nucci et al. [30] exploited the
buggy modules. Nagappan and Ball [36] exploited the use role of structural and semantic scattering of changes performed
of static code analysis tools to predict the bug density of by a developer in bug prediction. Their findings demonstrate
for Windows Server, showing that it is possible to perform on the one hand the superiority of the bug prediction model
a coarse grained classification between high and low quality built using scattering metrics with respect other state-of-
components with an accuracy of 83%. Nagappan et al. [37] the-art models. Moreover, they also show that the proposed
also investigated the use of metrics in the prediction of buggy metrics are orthogonal with respect to other predictors. Finally,
components across 5 Microsoft projects. Their main finding the paper by Taba et al. [44] reports the use of historical
highlights that while it is possible to successfully exploit metrics computed on classes affected by design flaws (called
complexity metrics in bug prediction, there is no single metric antipattern metrics) as additional source of information for
that could act as a universally best bug predictor (i.e., the predicting bugs. They found that such metrics can increase
best predictor is project-dependent). Complexity metrics in the the performances of cross-project bug prediction models of
context of bug prediction is also the focus of the work by 12.5%. This is clearly the closest work to the one presented
Zimmerman et al. [38], which reports a positive correlation in this paper. However, we propose the use of the intensity
between code complexity and bugs. Finally, Nikora et al. [39] of code smells rather than historical metrics. Moreover, we
showed that measurements of a system’s structural evolution work in the context of within-project bug prediction. An in-
(e.g., number of executable statements) can serve as predictors depth comparison between the two types of approaches able
of the number of bugs inserted into a system during its to include code smell information in bug prediction models is
development. part of our future agenda.

B. Bug Prediction using Process-based Predictors III. A S PECIALIZED B UG P REDICTION


Khoshgoftaar et al. [40] assessed the role played by debug M ODEL FOR S MELLY C LASSES
churns (i.e., the number of lines of code changed to fix Previous work has proposed the use of structural quality
bugs) in the identification of bug-prone modules, while Graves metrics to predict the bug-proneness of code components. The
et al. [41] experimented both product and process metrics underlying idea behind these prediction models is that the
for bug prediction. Their findings contradict in part what presence of bugs can be predicted by analyzing the quality
observed by other authors, showing that product metrics are of source code. However, none of them take into account
poor predictors of bugs. D’Ambros et al. [42] performed an the presence and the severity of well-known indicators of
extensive comparison of bug prediction approaches relying design flaws, i.e., code smells, affecting the source code. In
on process and product metrics, showing that there is not a this paper, we explicitly consider this information. Indeed, we
technique that works better in all contexts. Hassan and Holt believe that a more clear description and characterization of the

245
TABLE I
C ODE S MELL D ETECTION S TRATEGIES ( THE COMPLETE NAMES OF THE METRICS ARE GIVEN IN TABLE II)

Code Smells Detection Strategies: LABEL(n) → LABEL has value n for that smell

God Class LOCNAMM ≥ HIGH(176) ∧ WMCNAMM ≥ MEAN(22) ∧ NOMNAMM ≥ HIGH(18) ∧ TCC ≤ LOW(0.33) ∧ ATFD ≥ MEAN(6)
Data Class WMCNAMM ≤ LOW(14) ∧ WOC ≤ LOW(0.33) ∧ NOAM ≥ MEAN(4) ∧ NOPA ≥ MEAN(3)
Brain Method (LOC ≥ HIGH(33) ∧ CYCLO ≥ HIGH(7) ∧ MAXNESTING ≥ HIGH(6)) ∨ (NOLV ≥ MEAN(6) ∧ ATLD ≥ MEAN(5))
Shotgun Surgery CC ≥ HIGH(5) ∧ CM ≥ HIGH(6) ∧ FANOUT ≥ LOW(3)
Dispersed Coupling CINT ≥ HIGH(8) ∧ CDISP ≥ HIGH(0.66)
Message Chains MaMCL ≥ MEAN(3) ∨ (NMCS ≥ MEAN(3) ∧ MeMCL ≥ LOW(2))

severity of design problems affecting a source code instance TABLE II


M ETRICS USED FOR C ODE S MELLS D ETECTION
can help a machine learner in distinguishing those components
having higher probability to be subject of bugs in the future. Short Name Long Name
To this aim, once the set of code components affected by ATFD Access To Foreign Data
*ATLD Access To Local Data
code smells have been detected, we build a prediction model CC Changing Classes
that, other than relying on structural metrics, also includes the CDISP Coupling Dispersion
CINT Coupling Intensity
information about the severity of design problems computed CM Changing Methods
CYCLO McCabe Cyclomatic Complexity
using the intensity index defined by Arcelli Fontana et al. [31]. FANOUT Number of Called Classes
Specifically, the index is computed by JCodeOdor, a code LOC
*LOCNAMM
Lines Of Code
Lines of Code Without Accessor or Mutator Methods
smell detector which relies on detection strategies applied on *MaMCL Maximum Message Chain Length
MAXNESTING Maximum Nesting Level
metrics. The tool is able to detect, filter [45] and prioritize [31] *MeMCL Mean Message Chain Length
*NMCS Number of Message Chain Statements
instances of six kinds of code smells [1], [46]: NOAM Number Of Accessor Methods
NOLV Number Of Local Variables
• God Class: A large class implementing different respon- *NOMNAMM Number of Not Accessor or Mutator Methods
sibilities; NOPA Number Of Public Attributes
TCC Tight Class Cohesion
• Data Class: A class whose only purpose is holding data; *WMCNAMM Weighted Methods Count of Not Accessor or Mutator Methods
WOC Weight Of Class
• Brain Method: A large method that implements more than
one function;
• Shotgun Surgery: A class where every change triggers
many little changes to several other classes; Qualitas Corpus [49]. Table III reports all the threshold values
• Dispersed Coupling: A class having too many relation- associated to each of the detected code smells. Specifically,
ships with other classes; for each metric used in a detection strategy, JCodeOdor
• Message Chains: A method containing a long chain of extracts five meaningful values to be used as thresholds:
method calls. VERY-LOW, LOW, MEAN, HIGH, VERY-HIGH. For metrics
The intensity index is an estimation of the severity of a code representing ratios defined in the range [0,1] (e.g, the Tight
smell, and its value is defined in the range [1,10]. In particular, Class Cohesion), these values are fixed to 0.25, 0.33, 0.5, 0.66
given a code smell instance, its intensity is computed by and 0.75, respectively. For all other metrics, they are associated
relying on different kinds of information, i.e., (i) the code to percentile values on the metric distribution [48]. If a code
smell detection strategy, (ii) the metric thresholds used in the component is detected as a code smell, the actual value of a
detection strategy, (iii) the statistical distribution of the metric given metric used for the detection will exceed the threshold
values computed on a large dataset represented as a quantile value, and it will correspond to a percentile value on the metric
function, and (iv) the actual values of the metrics used in the distribution placed between the threshold and the maximum
detection strategies. observed value of the metric in the system under analysis. The
Above all, the detection strategies used are the ones pro- placement of the actual metric value in that range represents
posed in JCodeOdor, reported in Table I. Detection strategies the “exceeding amount” of a metric with respect to the defined
have often been used in the literature [46], [47] for code smell threshold. Finally, the value is normalized in the range [1,10].
detection. They rely on the evaluation of a set of metric values The intensity index of the code smell is given by the mean of
against defined thresholds, composed in a logical proposition. the exceeding amounts of the metrics used for the detection.
A code component is detected as smelly if one of the logical The higher the intensity index, the higher the severity of the
propositions shown in Table I is true, namely if the actual code smell under analysis. More details on the computation
metrics of the code component exceed the threshold values of the intensity index can be found in [31].
composing a detection strategy. The list of the metrics applied When considered as bug predictor, the intensity has two
in the detection rules is reported in Table II (see [31] for relevant properties: (i) its value is derived from a set of other
reference). The thresholds are represented as logical values metric values, and (ii) since it relies on the statistical distri-
associated to an actual value, and they are derived from the bution of metrics, it can be seen as a non-linear combination
statistical distribution [48] of metrics in 74 systems of the of their values. We include the intensity index as additional

246
TABLE III TABLE IV
D EFAULT THRESHOLDS FOR ALL SMELLS S OFTWARE P ROJECTS IN O UR DATASET

Metric VERY-LOW LOW MEAN HIGH VERY-HIGH System Classes KLOCs % Buggy Cl. % Smelly Cl.
LOCNAMM 26 38 78 176 393
God Class

WMCNAMM 11 14 22 41 81 Apache Xerces 1.4.4 588 141 74 5


NOMNAMM 7 9 13 21 30 Apache Xalan 2.7 909 428 86 12
TCC 0.25 0.33 0.5 0.66 0.75 Apache Velocity 1.6.1 229 57 15 7
ATFD 3 4 6 11 21
Apache Tomcat 6.0 858 301 6 4
11 14 21 40 81
Message Disp. Shotgun Brain Method Data Class

WMCNAMM Apache Lucene 2.4 338 103 59 10


WOC 0.25 0.33 0.5 0.66 0.75
NOPA 1 2 3 5 12
Apache Log4j 1.2 205 38 87 15
NOAM 2 3 4 7 13
LOC 11 13 19 33 59
CYCLO 3 4 5 7 13
MAXNESTING 3 4 5 6 7 number of classes and KLOC, (ii) the percentage of buggy
NOLV 4 5 6 8 12
ATLD 3 4 5 6 11 files (identified as explained later), and (iii) the percentage of
CC 2 3 4 5 10 classes affected by design problems (detected as explained
Chains Coup. Surgery

CM 2 3 4 6 13
FANOUT 2 3 4 5 6 later). All the data used in the study are publicly available in
CINT 3 4 5 8 12 our online appendix [50].
CDISP 0.25 0.33 0.5 0.66 0.75
MaMCL 2 3 3 4 7 A. Empirical Study Definition and Design
MeMCL 2 2 3 4 5
NMCS 1 2 3 4 5
In the context of this empirical investigation, we formulated
the following research questions:
predictor of a structural metrics-based bug prediction model. RQ1 : To what extent the intensity index contributes
Indeed, we cannot use the intensity index as single predictor, to the prediction of bug-prone code components?
since in this case we could not predict the bug-proneness
of classes not affected by any design problem (the intensity RQ2 : What is the gain provided by the intensity
index for non-smelly classes is equal to 0). Thus, to build the index to the bug prediction model when compared
proposed bug prediction model we firstly split the training set to the other predictors?
by considering smelly (as identified by the code smell detector)
and non-smelly classes. We then assign to smelly classes To answer RQ1 , we firstly need an oracle reporting the
an intensity index according to the evaluation performed by presence of bugs in the source code of the analyzed soft-
JCodeOdor, while we set the intensity of non-smelly classes ware projects. Fortunately, all the systems are hosted on the
to 0. Finally, we add the information about the intensity to a PROMISE repository [51], which collects a large dataset
set of structural metrics in order to apply the predictions. of bugs and provides oracles for all the projects in this
study. Secondly, we need to instantiate the prediction model
IV. E VALUATION OF THE P ROPOSED M ODEL presented in Section III to define (i) the basic predictors,
The goal of the empirical study is to evaluate the contri- (ii) the code smell detection process, and (iii) the machine
bution of the intensity index in a prediction model aimed at learning technique to use for classifying buggy instances. As
discovering bug-prone code components, with the purpose of for the software metrics to use as basic predictors in the
improving the allocation of resources in the verification & model, the related literature proposes several alternatives, with
validation activities focusing on components having a higher a main distinction between product metrics (e.g., lines of
bug-proneness. The quality focus is on the prediction accuracy code, code complexity, etc) and process metrics (e.g., past
and completeness as compared to state-of-the-art approaches, changes and bug fixes performed on a code component). To
while the perspective is of researchers, who want to evaluate better understand the predictive power of the intensity index,
the effectiveness of using information about code smells when we decide to test its contribution in a bug prediction model
identifying bug-prone components. composed by structural predictors, and in particular the 20
The context of the study consists of six software quality metrics exploited by Jureczko et al. [32]. Our choice
systems having different size and scope, namely Apache is guided by the will to investigate whether the use of a
Xerces1 , Apache Xalan2 , Apache Velocity3 , single additional structural metric representing the intensity of
Apache Tomcat , Apache Lucene5 , and Apache
4 code smells is able to add useful information in a prediction
Log4j6 . Table IV reports the characteristics of the analyzed model already characterized by structural predictors, as well
software systems in terms of (i) system’s size considering as by the set of code metrics used for the computation of
the intensity index. Thus, to measure the extent to which the
1 http://xerces.apache.org contribution of the intensity index is useful for predicting bugs,
2 http://xalan.apache.org
we experimented the following bug prediction models:
3 http://velocity.apache.org
4 http://tomcat.apache.org
5 http://lucene.apache.org
• Basic Model: The model based on the 20 software metrics
6 http://logging.apache.org/log4j/2.x/ defined by Jureczko et al. [32];

247
• Basic Model + Intensity: The model above based on the this test strategy since it allows all observations to be used for
20 software metrics plus the intensity index. It is worth both training and test purpose, but also because it has been
remembering that, for non-smelly classes, the intensity widely-used in the context of bug prediction (e.g., see [28],
value is set to 0. [61]–[63]). Finally, we answer RQ1 by reporting three widely-
adopted metrics, namely accuracy, precision and recall [64].
Applying this procedure, we were able to control the In addition, we also report the Area Under the Curve (AUC)
effective contribution of the index during the prediction of obtained by the prediction model. The AUC quantifies the
bugs. Regarding the code smell detection process, our study overall ability of a prediction model to discriminate between
is focused on the analysis of the code smells for which an buggy and non-buggy classes. The closer the AUC to 1,
intensity index has been defined (see Section III). To this the higher the ability of the classifier to discriminate classes
aim, we rely on the detection performed by JCodeOdor [31], affected and not by a bug. On the other hand, the closer
because on the one hand it has been empirically validated the AUC to 0.5, the lower the accuracy of the classifier.
demonstrating good performances in detecting code smells, Besides the analysis of the performance of the specialized
and on the other hand it detects all the code smells considered bug prediction model and its comparison with the baseline
in the empirical study. Finally, it computes the value of model, we also investigate the behavior of the experimented
Intensity on the detected code smells. To build a bug prediction models in the classification of smelly and non-smelly instances.
model that discriminates actual smelly and non-smelly classes, Specifically, we compute the percentage of smelly and non-
we decide to discard the false positive instances from the smelly classes correctly classified by each of the prediction
set of candidate code smells given by the detection tool (in models, to evaluate whether the intensity-including model is
other words, we set the intensity of false positives to 0). To actually able to give a contribution in the classification of
this aim, we manually discard such instances by comparing classes affected by a code smell, or whether the addition of
the results of the tool against an annotated set of code smell the intensity index also affects the classification of smell-free
instances publicly available [52]. It is worth observing that the classes.
best solution would be that of considering all the actual smell As for RQ2 , we conduct a fine-grained investigation aimed
instances in a software project (i.e., the golden set). However, at measuring how important is the intensity index with respect
the smell instances which are not detected by JCodeOdor do to the other features (i.e., metrics) composing the model.
not exceed the structural metric thresholds that allow the tool In particular, we use an information gain algorithm [65] to
to detect and assign them an intensity value. As a consequence, quantify the gain provided by adding the intensity index in the
the intensity index assigned to these instances would be equal prediction model. Formally, let M be a bug prediction model,
to 0, and still have no effect on the prediction model. The let P = {p1 , . . . , pn } be the set of predictors composing M , an
final step is the definition of the machine learning classifier to information gain algorithm [65] applies the following formula
use. We experimented several classifiers, namely Multilayer to compute a measure which defines the difference in entropy
Perceptron [53], ADTree [54], Naive Bayes [55], Logistic from before to after the set M is split on an attribute p1 :
Regression [56], Decision Table Majority [57], and Simple
Logistic [58]. We empirically compared the results achieved by Inf oGain(M, pi ) = H(M ) − H(M |pi ) (1)
the prediction model on the software systems used in our study
where the function H(M ) indicates the entropy of the model
(more details on the adopted procedure later in this section).
that includes the predictor pi , while the function H(M |pi )
A complete comparison among the experimented classifiers
measures the entropy of the model that does not include pi .
can be found in our online appendix [50]. Over all the
Entropy is computed as follow:
systems, the best results on the baseline model were obtained
using the Simple Logistic, confirming previous findings in n

the field [42], [59]. Thus, in this paper we report the results H(M ) = − prob(pi ) log2 prob(pi ) (2)
of the models built with this classifier. This classifier uses i=1
a statistical technique based on a probability model. Indeed, In other words, the algorithm quantifies how much un-
instead of simple classification, the probability model gives certainty in M was reduced after splitting M on attribute
the probability of an instance belonging to each individual p1 . In the context of our work, we apply the Gain Ratio
class (i.e., buggy or not), describing the relationship between Feature Evaluation algorithm [65], which ranks p1 , . . . , pn
a categorical outcome (i.e., buggy or not) and one or more in descending order based on the contribution provided by
predictors [58]. pi to the decisions made by M . In particular, the output of
Once the model has been instantiated, to assess its perfor- the algorithm is a ranked list in which the predictors having
mance we adopted the 10-fold cross-validation strategy [60]. the higher expected reduction in entropy are placed at the
This strategy randomly partitions the original set of data into top. Using this procedure, we evaluate the relevance of the
10 equal sized subset. Of the 10 subsets, one is retained as predictors in the prediction model, possibly understanding
test set, while the remaining 9 are used as training set. The whether the addition of the intensity index gives a higher
cross-validation is then repeated 10 times, allowing each of contribution with respect to the structural metrics from which
the 10 subsets to be the test set exactly once [60]. We used it is derived (i.e., metrics used for the detection of the smells)

248
TABLE V index of 5.5. The basic model classifies this class as buggy,
ACCURACY, P RECISION , R ECALL , F-M EASURE , AUC-ROC, AND
P ERCENTAGE OF BUGGY CLASSES AFFECTED ( AND NOT ) BY A SMELL
since its structural metrics are considered by the Basic model
CORRECTLY CLASSIFIED BY THE EXPERIMENTED PREDICTION MODELS as indicators of the presence of a bug. Conversely, the low
level of intensity allows the intensity-including model to
AUC- % Cor. Class.
Project Model Accuracy Precision Recall F-Measure
ROC S-Cl. NS-Cl. correctly mark this class as non-buggy. On the other hand,
Apache Xerces
Basic 94 93 94 93 95 72 92 an example of code component correctly classified as buggy
Basic + Int. 95 96 95 95 95 97 92
Basic 99 99 100 99 99 94 100
thank to the use of the intensity index computed for the
Apache Xalan
Basic + Int. 100 100 100 100 100 99 100 Message Chains smell is the XSLTProcessorApplet class
Basic 67 18 28 22 89 56 32
Apache Velocity
Basic + Int. 92 31 46 37 90 100 35 from the org.apache.xalan.client package. In this
Basic 56 9 16 12 81 34 17
Apache Tomcat
Basic + Int. 79 20 31 24 94 83 20
case, the Basic model misclassifies this class as non-buggy,
Apache Lucene
Basic 78 75 76 75 81 50 78 while the specialized model correctly classifies it as buggy.
Basic + Int. 80 80 80 80 83 82 77
Basic 94 99 96 97 89 55 93 It is important to note that the ElemTextLiteral and
Apache Log4j
Basic + Int. 95 100 98 99 93 74 93 XSLTProcessorApplet classes have similar metrics (as
shown in Table VI), and the only predictor able to distinguish
them is the intensity index.
or with respect the other structural metrics contained in the Looking at the other software systems analyzed, we can
model. observe for Apache Xerces and Apache Log4j a be-
havior of the intensity index similar to the one achieved on
B. Analysis of the Results
Apache Xalan. In these cases, the performances of the
In the following we will discuss the achieved results aiming Basic model are always slightly improved by the addition
at providing an answer to our research questions. of the intensity index. Also, the intensity index enables the
correct classification of smelly instances, while the predictions
To what extent the intensity index contributes to the
made on classes not affected by design problems remain
prediction of bug-prone code components? Table V reports
exactly the same.
for each considered software project, the results achieved when
A different analysis can be done for the other systems. In-
considering (i) the baseline prediction model built using the
deed, when the performances of the Basic model are quite low,
structural metrics in [32] and (ii) the model built by adding
the intensity index is able to give a strong contribution in the
to the baseline model the intensity index of smelly classes. In
classification of buggy components. For instance, the accuracy
addition, the column % Cor. Class. S-Cl. of Table V reports the
of the intensity-including model in Apache Velocity is
percentages of smelly classes correctly classified (with respect
25% higher than the one achieved by the basic one. Here we
to buggyness) by each of the analyzed models, while % Cor.
can note that the proposed bug prediction model is not only
Class. NS-Cl. reports the percentage of correctly classified
able to correctly classify all the instances affected by smells,
non-smelly instances.
but also gives a slight contribution (+3%) to the classification
Looking at Table V, the first thing that leaps to the eye of the non-smelly instances. This result clearly highlights the
is that, overall, the basic prediction model is able to achieve importance of considering the design quality characteristics of
high accuracy. For instance, on Apache Xalan the basic a code component when predicting bugs.
model obtains 100% of recall, 99% of precision and accuracy.
From a practical point of view, this result means that the Summary for RQ1 . The addition of the intensity index
model misclassifies a small subset of instances. On the other as predictor of buggy components generally increases
hand, taking into account the intensity index of the smelly the performance of the baseline bug prediction model
classes results in 100% for all the considered metrics, correctly over all the analyzed projects. We observed cases in
classifying the instances missed by the basic model. It is which the prediction accuracy increases up to 25% with
worth noting that obtaining an increment of the performance respect to the performance achieved not considering the
in situations when the Basic model works well is quite intensity metric.
hard. Still, in this situations the intensity index is able to
contribute by refining the predictions of the Basic model, What is the gain provided by the intensity index to the bug
and increasing model’s accuracy. Analyzing the percentage prediction model when compared to the other predictors?
of smelly and non-smelly classes correctly classified by the Table VII shows the results achieved when applying the Gain
specialized bug prediction model, we can understand that the Ratio Feature Evaluation algorithm [65] on the set of predic-
increment of the performance is due to a better classification tors composing the intensity-including bug prediction model.
of instances composing the set of classes having design flaws Specifically, for each software system, we report the ranking
(+5%), while the non-smelly classes are treated generally in of the predictors based on their importance for the model,
the same way by both the models. An interesting example together with a value representing the expected reduction in
is represented by the class ElemTextLiteral contained entropy caused by partitioning the prediction model according
in the org.apache.xalan.templates package. This to a given predictor (i.e., column Gain). The results show
class contains a Brain Method code smell having an intensity that the Coupling Between Objects (CBO) metric is highly

249
TABLE VI
C OMPARISON BETWEEN E LEM T EXT L ITERAL AND XSLTP ROCESSOR A PPLET (A PACHE X ALAN ) IN TERMS OF S TRUCTURAL M ETRICS

MAX(CC)
AVG(CC)

Intensity
LCOM3
LCOM
WMC

MOA
DAM

CAM

AMC
CBM
NPM
NOC

MFA
CBO

LOC
RFC
DIT

CA
CE

IC
Class
org.apache.xalan.templates.ElemTextLiteral 11 3 0 9 22 29 6 4 11 0.88 127 0.8 0 0.95 0.34 2 5 10 2 1 5.5
org.apache.xalan.client.XSLTProcessorApplet 12 5 0 7 23 31 5 2 11 0.89 125 0.9 1 0.94 0.28 3 5 11 2 1 9.1

TABLE VII TABLE VIII


G AIN P ROVIDED BY E ACH M ETRIC T O T HE P REDICTION M ODEL . P ERFORMANCE OF JC ODE O DOR ON THE SOFTWARE PROJECTS OBJECT OF
THE EMPIRICAL STUDY
Apache Xerces Apache Xalan Apache Velocity Apache Tomcat Apache Lucene Apache Log4j
Metric Gain Metric Gain Metric Gain Metric Gain Metric Gain Metric Gain
CBO 0.47 RFC 0.38 CAM 0.66 CBO 0.41 Intensity 0.47 CBO 0.49 Code Smell Precision Recall F-Measure # TP # FP # FN
CE 0.39 NPM 0.37 NPM 0.58 RFC 0.39 RFC 0.45 Intensity 0.49
CA 0.22 LCOM 0.22 RFC 0.56 MOA 0.38 CBO 0.45 CA 0.47 Blob 78% 81% 79% 25 7 6
AMC 0.17 WMC 0.22 Intensity 0.53 LOC 0.37 NPM 0.43 NPM 0.38 Data Class 89% 100% 94% 8 1 8
AVG(CC) 0.17 Intensity 0.21 WMC 0.53 Intensity 0.32 CE 0.42 LCOM 0.33
MFA 0.15 CE 0.20 CE 0.48 MAX(CC) 0.30 WMC 0.41 CE 0.31 Brain Method 87% 92% 89% 124 19 11
LOC 0.12 CBO 0.19 LCOM 0.46 WMC 0.22 LCOM 0.29 LCOM3 0.31 Shotgun Surgery 60% 64% 62% 9 6 5
DIT 0.11 LOC 0.14 DIT 0.35 AMC 0.22 DAM 0.26 NOC 0.29
CBM 0.11 CAM 0.13 LCOM3 0.35 CAM 0.21 CAM 0.24 DIT 0.29 Dispersed Coupling 76% 81% 78% 25 8 6
IC 0.11 DIT 0.11 CA 0.35 DAM 0.19 LOC 0.22 RFC 0.29 Message Chains 65% 79% 71% 15 8 4
Intensity 0.10 IC 0.09 CBO 0.29 AVG(CC) 0.19 AMC 0.16 LOC 0.26
LCOM3 0.07 CBM 0.09 NOC 0.28 LCOM 0.17 IC 0.11 AVG(CC) 0.25 Overall 81% 87% 84% 206 49 32
MAX(CC) 0.07 MFA 0.08 AVG(CC) 0.28 NPM 0.16 CA 0.11 AMC 0.19
CAM 0.06 MAX(CC) 0.08 CBM 0.25 NOC 0.11 MAX(CC) 0.10 MAX(CC) 0.15
DAM 0.05 NOC 0.07 MAX(CC) 0.23 DIT 0.08 MFA 0.09 DAM 0.14
RFC 0.05 AMC 0.06 AMC 0.21 IC 0.05 CBM 0.07 CBM 0.08
MOA 0.04 CA 0.03 IC 0.19 CA 0.05 NOC 0.07 IC 0.06
WMC 0.02 MOA 0.03 DAM 0.08 MFA 0.05 DIT 0.07 CAM 0.04 the LOC metric, 18% more than LCOM metric, and 6% more
NOC 0.01 DAM 0.01 MFA 0.06 CBM 0.03 AVG(CC) 0.06 MFA 0.04
LCOM 0.01 AVG(CC) 0.01 MOA 0.03 LCOM3 0.02 LCOM3 0.05 MOA 0.02
than WMC metric. It is worth noting that, as a consequence,
NPM 0.01 LCOM3 0.01 LOC 0.02 CE 0.02 MOA 0.02 WMC 0.01
the ability of the specialized bug prediction model to correctly
classify smelly instances on Apache Lucene increases of
22% (see Table V). Another interesting observation can be
relevant for the predictions made on 3 of the analyzed projects made by looking at the results of Apache Velocity. Also
(i.e., Apache Xerces, Apache Tomcat, and Apache in this case, the metrics used for the detection of smells are
Log4j), confirming the findings by Gyimóthy et al. [23] on partially relevant for the prediction model when considered
the predictive power of the metric. On the other systems, individually (e.g., CBO=0.29), while the intensity measure
different complexity metrics (e.g., RFC and CAM) appear is instead considered as a very useful predictor (gain=0.53).
to the top of the ranked list. As for the intensity index, we Here the performance provided by the intensity-including bug
observe that the contribution given by the metric is valuable prediction model are 25% better than the baseline model and
on all the object projects (minimum gain=0.10, maximum this is due to the fact that the specialized model is able to
gain=0.53), since it is generally included at the first places correctly classify all the smelly instances in the system. In the
of the ranked list. This is a quite surprising result, since worst case, the intensity index is ranked as the eleventh more
the goal of the addition of the intensity index is not to useful predictor on Apache Xerces with a gain equals to
provide the most relevant predictor, but to complement the 0.10. However it is important to highlight, as shown in Table
information used by a prediction model with a metric able V, that on this project the performances of the baseline model
to quantify in a single value the severity of design problems are high, and the intensity index contributes to the increment
affecting a class. For instance, it is interesting to discuss the of 1% in terms of accuracy.
result achieved on the Apache Lucene project, where the
intensity metric is evaluated as the most important by the Gain Summary for RQ2 . The intensity index has a higher
Ratio Feature Evaluation algorithm, which quantifies as 0.47 predictive power with respect to the individual metrics
the gain of the metric in reducing the entropy of the prediction from which it is derived. On all the projects of the study,
model. Looking at the ranking, we can see that the single we found that the intensity metric is one of the most
quality metrics from which the intensity index is computed important predictors of the model. As a consequence,
(i.e., metrics used for the smell detection) are placed by the the gain provided by the intensity index to the baseline
algorithm to the bottom of the ranked list (e.g., LCOM3 is only prediction model is highly relevant.
partially relevant and it provides a small gain of 0.05). In other
words, the single metrics do not reduce in the same measure C. Threats to Validity
the entropy with respect to the case in which such metrics Threats to construct validity are related to the relationship
are condensed in a single value representing the intensity of a between theory and observation. Above all, we relied on
code smell. As an example, the intensity index contributes in JCodeOdor [31] for detecting code smells. We have validated
reducing the entropy of the prediction model 25% more than the code smell detector performance on the software projects

250
analyzed in this paper. Table VIII reports precision, recall, of models including process metrics.
and F-measure values obtained by considering the instances
of all the projects as a single dataset (i.e., overall). A detailed V. D ISCUSSION AND F URTHER A NALYSIS
analysis of the performance of the detector for each project The results of the empirical study reveal the usefulness of
can be found in our online appendix [50]. We can observe considering the intensity of code smells as additional predictor
that the performance of the detector ranges between 62% and in order to classify instances affected by design problems.
94% in terms of F-Measure. Despite the quite high precision From a practical perspective, results indicate that smells having
(i.e., 81% on overall), the tool still identifies 49 false positives low severity are less prone to be affected by a bug with
which we discarded to make the set of code smells as close respect to smells with high severity. In this sense, the use of
as possible to the golden set. As removing false positives an indicator of intensity is beneficial to correctly discriminate
is not always feasible, in Section V we evaluate the effect the bug-proneness of smelly classes. Moreover, the results also
of including false positives in the construction of the bug reveal that the structural metrics (including the ones used for
prediction model. On the other hand, the tool achieves an detecting smells) are not effective when applied to predict the
overall recall of 87%. In the empirical study, we were not buggyness of classes affected by design flaws. Even if this
able to include false negative smell instances, because the tool can appear as a quite surprising result, we observe that when
assigns an intensity index equal to zero to such instances. Even evaluating the bug-proneness of smelly classes, the prediction
if this could have influenced our results, it is worth noting that model is not able to correctly deal with the whole set of
only 32 of such instances (out of the total 238) are missed software metrics. In other words, several quality indicators
from the analysis. Future work will be devoted to improve considered in isolation work worse than a single aggregative
the detection performance of the tool by including rules used metric reporting the degree of severity of the design flaw
by other smell detectors (e.g., DECOR [4]). Another threat to affecting a class.
construct validity regards the annotated set of bugs and code In the context of our work, we exploited the use of the
smells used in the empirical study. As for bugs, we rely on intensity index, rather than using a simple truth value pro-
the publicly available oracles in the PROMISE repository [51] viding information about the presence of a design problem.
which have been widely used in previous research [32], [66]– Indeed, the latter solution could also provide a complementary
[69]. For code smells, we rely on the oracles publicly available information with respect to the structural metrics, leading to
in [52], previously used in [5], [6], [8], [9], [70]. However, we improvements similar to the ones achieved by considering
cannot exclude that the oracle we used misses some bugs or the intensity index. This issue is analyzed in Section V-A.
smells, or else include some false positives. Another discussion point is related to the threats to validity
Threats to conclusion validity concern the relation between pointed out in Section IV-C. On the one hand, in the empirical
the treatment and the outcome. The metrics used to evaluate study we discarded the false positive instances given by the
the bug prediction models, i.e., accuracy, precision, recall, F- smell detection tool to consider in the prediction model a
Measure, and AUC-ROC are widely used in the evaluation of set of code smells as close as possible to the golden set.
the performances of bug prediction techniques [42]. Moreover, However, removing such instances could not be practically
we analyzed to what extent the intensity index is important applicable for several reasons (e.g., effort needed to validate
with respect to the other metrics by analyzing the gain smells). In order to evaluate the impact that false positives have
provided by the addition of the severity measure in the model. on the performance of the prediction model, in Section V-B
Finally, threats to external validity concern the generaliza- we evaluate the performance of the model obtained without
tion of results. We analyzed six different software projects removing the false positive code smell instances detected by
from different application domains and with different charac- the tool. Finally, a threat to the generalizability of the results
teristics (size, number of classes, etc). However, our future regards the choice of the baseline model. To evaluate the
agenda includes the analysis of other systems aimed at cor- contribution of the intensity index in different contexts, Section
roborating our findings. Another threat in this category regards V-C reports the results achieved when adding the intensity of
the choice of the baseline model. We selected the model by code smells in a prediction model based on process metrics, as
Jureczko et al. [32] since it is more interesting and challenging well as the analysis of the contribution of the intensity index
of the predictive power of the intensity index when it is added in a model composed by both structural and process metrics.
to a model characterized by other structural metrics, including Note that for all the additional analyses we follow the same
the ones used for the computation of the intensity index. experimental design described in Section IV.
Moreover, the selected model contains a comprehensive set
of quality metrics, which allowed a more detailed analysis of A. Comparing the presence/absence of smells rather than the
the gain provided by the intensity index in the context of the intensity index
structural-based bug prediction model. However, as pointed We have added to the baseline bug prediction model defined
out by Moser et al. [24], predictors based on process metrics in [32] the boolean information about the presence of code
can achieve better performances in predicting bugs. To deal smells in a class (note that we used the golden set of code
with this threat, in Section V we discuss the results achieved smells in this case, to avoid bias deriving from the use of a
when considering the intensity index as additional predictor particular tool). The results are reported in Table IX. However,

251
TABLE IX TABLE X
ACCURACY M ETRICS FOR THE M ODEL B UILT BY A DDING TO THE ACCURACY M ETRICS FOR THE M ODEL WHERE FALSE P OSITIVE S MELLS
BASELINE M ODEL A T RUTH VALUE INDICATING THE P RESENCE OF A ARE NOT F ILTERED .
S MELL .
AUC- % Cor. Class.
Project Model Accuracy Precision Recall F-Measure
AUC- % Cor. Class. ROC S-Cl. NS-Cl.
Project Model Accuracy Precision Recall F-Measure
ROC S-Cl. NS-Cl. Apache Xerces Basic + Int. 94 94 94 91 95 96 92
Apache Xerces Basic + Truth 94 93 94 93 95 72 92 Apache Xalan Basic + Int. 100 100 100 100 100 99 100
Apache Xalan Basic + Truth 99 99 100 99 99 94 100 Apache Velocity Basic + Int. 90 28 43 35 89 97 35
Apache Velocity Basic + Truth 67 18 28 22 89 56 32 Apache Tomcat Basic + Int. 77 18 29 24 93 80 20
Apache Tomcat Basic + Truth 56 9 16 12 81 34 17 Apache Lucene Basic + Int. 79 77 78 77 83 80 77
Apache Lucene Basic + Truth 78 75 76 75 81 50 78 Apache Log4j Basic + Int. 95 99 97 95 93 71 92
Apache Log4j Basic + Truth 94 99 96 97 89 55 93

C. Evaluating the Contribution of the Intensity Index in a


comparing Table V and IX, this choice would not lead to Process Metrics-based Bug Prediction Model
improvements in the performance of the baseline model (the
performance of the two models are exactly the same). Indeed, In order to evaluate the contribution of the intensity index in
the addition of such information does not provide any type a process metrics-based bug prediction model, we exploit the
of complementary information that the model can use to model defined by Hassan [27], which is built by considering
predict the bug-proneness of smelly classes. As a consequence, the entropy of changes as predictor of buggy components. The
the presence of the additional predictor is totally irrelevant. choice of using this process-based model is not random, but
This is because the simple truth value does not quantify guided to the will to select a bug prediction model having good
the extent to which the design problem is actually harmful. performance [27] and quite representative of the state-of-the-
For instance, let us recall the example shown in Table VI. art [30]. The analysis of the contribution of the intensity index
In this case, both the instances are smelly but they have in other process metrics-based bug prediction models (e.g., the
different intensity. A prediction model based on the simple ones proposed in [28] and [30]) is part of our future agenda.
thruth value would not distinguish their bug-proneness, and it As we can see from Table XI, the process model relying only
would not be able to correctly classify the buggyness of the on the entropy of changes does not obtain higher performances
XSLTProcessorApplet class. with respect to the models considering structural properties. At
the same time, we can observe that the use of the intensity
B. Evaluating the Impact of False Positive Smells in the Bug index as additional feature in the model can increase the
Prediction Model number of correctly classified instances, resulting in a higher
Table X reports the results achieved when building the accuracy. This is a quite expected result, since the addition of
proposed bug prediction model without filtering the false the intensity index adds an orthogonal source of information
positive instances from the set of candidate smells identified with respect to the process metric. It is worth noting that in the
by JCodeOdor. Comparing these results with the performance cases when the prediction accuracy of the baseline process-
of the models shown in Table V, we can provide two main based model is low, the intensity can increase the quality
observations. First of all, without filtering false positives, the of the predictions up to 47%. This is the case of Apache
bug prediction model obtains accuracy values always higher Velocity project, where the baseline model reaches 33% of
than the baseline model. This means that, even in the presence accuracy in the predictions. By adding the intensity index, the
of false positive instances, the use of the proposed model in prediction model increases its performances to 80% (+47%),
a practical case guarantees higher performance with respect demonstrating that a better characterization of the classes
to the baseline prediction model. Indeed, it is important to having design problems can help in obtaining more accurate
observe that the smelly instances correctly classified by the predictions. It is also interesting to analyze the results on
model ranges between 71% and 99%, clearly indicating its the percentage of smelly classes correctly classified. On the
ability to distinguish the bug-proneness of classes affected Apache Velocity project, the baseline model correctly
by design problems. At the same time, the performance classifies half of the smelly classes, while the model con-
of the model in the classification of non-smelly classes are sidering the intensity is able to capture 100% of the buggy
in line with the ones of the baseline. On the other hand, and smelly classes. As for the other software projects, we can
discarding false positive code smell instances does not result in outline a similar trend observed in the case of the structural-
significantly better performances with respect to including the based prediction model. Indeed, the intensity index is able to
false positives detected by the tool (compare Tables V and X). refine the predictions of the baseline model, ensuring slightly
Indeed, in this case the performances of the non-filtered false higher performances in cases where the performances of the
positives model are only slightly lower, indicating that false baseline are already high (e.g., see the results achieved on
positive instances do not have a significative impact on the Apache Xerces and Apache Log4j). In the other cases,
results and do not need to be necessarily validated and filtered we can always observe an improvement of both precision
out. Summarizing, we can claim that a fully automatic code and recall (and, consequently, of the F-measure), but also an
smell detection still improves the performance of the baseline improvement of the AUC-ROC metric, which indicates the
bug prediction model. higher overall ability of the model considering the intensity in

252
TABLE XI
ACCURACY M ETRICS FOR THE M ODELS BASED ON P ROCESS M ETRICS AND A C OMBINATION OF S TRUCTURAL AND P ROCESS M ETRICS .

Process Metrics Combined Metrics


Project Model
AUC- % Cor. Class. AUC- % Cor. Class.
Accuracy Precision Recall F-Measure Accuracy Precision Recall F-Measure
ROC S-Cl. NS-Cl. ROC S-Cl. NS-Cl.
Basic 91 91 91 91 71 41 89 94 93 94 93 94 68 91
Apache Xerces Basic + Int. 94 94 94 94 95 86 89 95 96 95 95 95 97 91
Basic 99 99 99 99 98 89 99 100 100 100 100 100 92 100
Apache Xalan Basic + Int. 99 99 99 99 97 92 99 100 100 100 100 100 94 100
Basic 33 33 33 33 76 50 45 67 18 28 22 79 57 32
Apache Velocity Basic + Int. 80 80 80 80 78 100 46 94 32 47 38 80 100 33
Basic 29 29 29 29 56 50 31 63 9 16 12 82 35 18
Apache Tomcat Basic + Int. 67 67 67 67 68 92 30 82 21 33 26 85 83 26
Basic 60 60 60 60 55 41 63 79 76 77 76 82 47 60
Apache Lucene Basic + Int. 71 71 71 71 62 79 63 81 79 80 79 83 85 60
Basic 92 92 93 92 52 63 88 99 93 96 94 90 52 89
Apache Log4j Basic + Int. 93 93 94 93 60 71 87 97 99 98 98 93 76 90

discriminating between buggy and non-buggy classes. VI. C ONCLUSION AND F UTURE W ORK
Finally, we also evaluated the contribution of the intensity
index in a bug prediction model composed by both structural In this paper, we evaluate to what extent the addition of
and process metrics. Looking at the results reported in Table the intensity index (i.e., a metric that quantifies the severity
XI (i.e., see the Combined model), we can observe that of code smells) in an existing structural metrics-based bug
the addition of the process metric does not have the same prediction model is useful in order to increase the perfor-
impact with respect to the addition of the intensity index in mances of the baseline model. We also quantify the actual
the baseline structural model. Indeed, the Combined model gain provided by the intensity index with respect to the
never outperforms the performance achieved by the structural other metrics composing the model, including the ones used
model which considers the intensity as additional feature. to compute the code smell intensity. Moreover, we report
Thus, we can conclude that the addition of the intensity additional analyses aimed at showing (i) the accuracy of a
index is actually needed also in this case to achieve higher model where a simple truth value reporting the presence of
performances. Moreover, when adding the intensity index to code smells rather than the intensity index is added to the
the Combined model, we observe that the contribution of the baseline model, (ii) the impact of false positive smell instances
intensity index is still valuable. For example, let us consider identified by a code smell detector, and (iii) the contribution of
the cases of Apache Velocity and Apache Tomcat. In the intensity index in bug prediction models based on process
the first project, the performance of the Combined model are metrics.
not better with respect to the prediction model purely based on According to our experiments, the intensity always posi-
structural code metrics. However, when adding the intensity to tively contributes to state-of-the-art prediction models, even
the mixed set of metrics characterizing the Combined model, when they already have high performances. In particular, the
the performance are not only better than the baseline structural intensity index helps discriminating bug-prone code elements
model (+27% of accuracy), but they are also better than all affected by code smells in bug prediction models based on
the other structural and process based prediction models that product metrics, process metrics, and a combination of the
include the intensity index (i.e., the Combined + Int. model has two. Our initial results suggest that the intensity of code smells
higher performance with respect to the Basic + Int. structural is helpful in all of these cases, and cannot be substituted by a
models). In the second case, the Combined model is 7% more simple indicator of the presence or absence of a code smell.
accurate of the baseline structural model. This indicates that More importantly, the presence of a limited number of false
the entropy of changes actually complements the structural positive smell instances identified by the code smell detector
metrics in the predictions of buggy components. However, does not impact the accuracy and the practical applicability of
also in this case the addition of the intensity index allows the proposed specialized bug prediction model. The achieved
the prediction model to obtain a strong higher value of the findings highlight—on one hand—the value of code smell
prediction accuracy (+19%). This results in the achievement detection in the context of bug prediction, and on the other
of higher values for all the other evaluation metrics: indeed, hand the importance of considering the intensity of such design
the precision increases of 12%, the recall of 17% and the problems as additional indicator in bug prediction models.
AUC-ROC of 3%. A similar discussion can be done for the As future work, we plan to extend the number of systems
other software systems analyzed, where the intensity index analyzed with this method in order to corroborate the results
actually contributes to the improvement of the performance achieved in this paper, and evaluate the contribution of the
of the Combined bug prediction model. Finally, as expected, intensity index into other existing bug-prediction models.
we can observe that the buggy and smelly classes are mainly Moreover, we plan to compare our proposed model with the
correctly classified by the model including the intensity index. one proposed by Taba et al. [44].

253
R EFERENCES [20] F. Khomh, M. Di Penta, Y.-G. Guéhéneuc, and G. Antoniol, “An
exploratory study of the impact of antipatterns on class change- and
[1] M. Fowler, Refactoring: improving the design of existing code. fault-proneness,” Empirical Software Engineering, vol. 17, no. 3, pp.
Addison-Wesley, 1999. 243–275, 2012.
[2] F. A. Fontana, M. Zanoni, A. Marino, and M. V. Mantyla, “Code smell [21] V. Basili, L. Briand, and W. Melo, “A validation of object-oriented
detection: Towards a machine learning-based approach,” in Software design metrics as quality indicators,” Software Engineering, IEEE Trans-
Maintenance (ICSM), 2013 29th IEEE International Conference on, Sept actions on, vol. 22, no. 10, pp. 751–761, Oct 1996.
2013, pp. 396–399. [22] R. Subramanyam and M. S. Krishnan, “Empirical analysis of ck met-
[3] G. Bavota, R. Oliveto, M. Gethers, D. Poshyvanyk, and A. De Lucia, rics for object-oriented design complexity: Implications for software
“Methodbook: Recommending move method refactorings via relational defects,” Software Engineering, IEEE Transactions on, vol. 29, no. 4,
topic models,” IEEE Transactions on Software Engineering, vol. 40, pp. 297–310, 2003.
no. 7, pp. 671–694, July 2014. [23] T. Gyimóthy, R. Ferenc, and I. Siket, “Empirical validation of object-
[4] N. Moha, Y.-G. Guéhéneuc, L. Duchien, and A.-F. L. Meur, “Decor: A oriented metrics on open source software for fault prediction,” IEEE
method for the specification and detection of code and design smells,” Transactions on Software Engineering (TSE), vol. 31, no. 10, pp. 897–
IEEE Transactions on Software Engineering, vol. 36, no. 1, pp. 20–36, 910, 2005.
2010. [24] W. P. Raimund Moser and G. Succi, “A comparative analysis of the
[5] F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, and A. De Lucia, “A efficiency of change metrics and static code attributes for defect predic-
textual-based technique for smell detection,” in Proceedings of the 24th tion,” in International Conference on Software Engineering (ICSE), ser.
International Conference on Program Comprehension (ICPC 2016). ICSE ’08, 2008, pp. 181–190.
Austin, USA: IEEE, 2016, p. to appear. [25] R. Moser, W. Pedrycz, and G. Succi, “Analysis of the reliability of
[6] F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, D. Poshyvanyk, and a subset of change metrics for defect prediction,” in Proceedings
A. De Lucia, “Mining version histories for detecting code smells,” IEEE of the Second ACM-IEEE International Symposium on Empirical
Transactions on Software Engineering, vol. 41, no. 5, pp. 462–489, May Software Engineering and Measurement, ser. ESEM ’08. New
2015. York, NY, USA: ACM, 2008, pp. 309–311. [Online]. Available:
[7] N. Tsantalis and A. Chatzigeorgiou, “Identification of move method http://doi.acm.org/10.1145/1414004.1414063
refactoring opportunities,” IEEE Transactions on Software Engineering, [26] R. M. Bell, T. J. Ostrand, and E. J. Weyuker, “Does measuring
vol. 35, no. 3, pp. 347–367, 2009. code change improve fault prediction?” in Proceedings of the 7th
[8] M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, International Conference on Predictive Models in Software Engineering,
and D. Poshyvanyk, “When and why your code starts to smell bad,” in ser. Promise ’11. New York, NY, USA: ACM, 2011, pp. 2:1–2:8.
Proceedings of the International Conference on Software Engineering [Online]. Available: http://doi.acm.org/10.1145/2020390.2020392
(ICSE) - Volume 1. IEEE, 2015, pp. 403–414. [27] A. E. Hassan, “Predicting faults using the complexity of code changes,”
[9] F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, and A. De Lucia, “Do in Software Engineering, 2009. ICSE 2009. IEEE 31st International
they really smell bad? a study on developers’ perception of bad code Conference on, May 2009, pp. 78–88.
smells,” in Proceedings of the International Conference on Software [28] R. Bell, T. Ostrand, and E. Weyuker, “The limited impact of individual
Maintenance and Evolution (ICSME). IEEE, 2014, pp. 101–110. developer data on software defect prediction,” Empirical Software
[10] A. F. Yamashita and L. Moonen, “Do developers care about code smells? Engineering, vol. 18, no. 3, pp. 478–505, 2013. [Online]. Available:
an exploratory survey,” in Proceedings of the Working Conference on http://dx.doi.org/10.1007/s10664-011-9178-4
Reverse Engineering (WCRE). IEEE, 2013, pp. 242–251. [29] T. J. Ostrand, E. J. Weyuker, and R. M. Bell, “Programmer-based
[11] R. Arcoverde, A. Garcia, and E. Figueiredo, “Understanding the fault prediction,” in Proceedings of the 6th International Conference
longevity of code smells: preliminary results of an explanatory survey,” on Predictive Models in Software Engineering, ser. PROMISE ’10.
in Proceedings of the International Workshop on Refactoring Tools. New York, NY, USA: ACM, 2010, pp. 19:1–19:10. [Online]. Available:
ACM, 2011, pp. 33–36. http://doi.acm.org/10.1145/1868328.1868357
[12] A. Chatzigeorgiou and A. Manakos, “Investigating the evolution of bad [30] D. D. Nucci, F. Palomba, S. Siravo, G. Bavota, R. Oliveto, and
smells in object-oriented code,” in Proceedings of the International Con- A. D. Lucia, “On the role of developer’s scattered changes in bug
ference on the Quality of Information and Communications Technology prediction,” in Software Maintenance and Evolution (ICSME), 2015
(QUATIC). IEEE, 2010, pp. 106–115. IEEE International Conference on, Sept 2015, pp. 241–250.
[13] A. Lozano, M. Wermelinger, and B. Nuseibeh, “Assessing the impact [31] F. Arcelli Fontana, V. Ferme, M. Zanoni, and R. Roveda, “Towards a
of bad smells using historical information,” in Proceedings of the prioritization of code debt: A code smell intensity index,” in Proceedings
International workshop on Principles of Software Evolution (IWPSE). of the Seventh International Workshop on Managing Technical Debt
ACM, 2007, pp. 31–34. (MTD 2015). Bremen, Germany: IEEE, Oct. 2015, pp. 16–24, in
[14] D. Ratiu, S. Ducasse, T. Gîrba, and R. Marinescu, “Using history conjunction with ICSME 2015.
information to improve design flaws detection,” in Proceedings of [32] M. Jureczko and L. Madeyski, “Towards identifying software project
the European Conference on Software Maintenance and Reengineering clusters with regard to defect prediction,” in Proceedings of the 6th
(CSMR). IEEE, 2004, pp. 223–232. International Conference on Predictive Models in Software Engineering,
[15] M. Abbes, F. Khomh, Y.-G. Guéhéneuc, and G. Antoniol, “An empirical ser. PROMISE ’10. New York, NY, USA: ACM, 2010, pp. 9:1–9:10.
study of the impact of two antipatterns, Blob and Spaghetti Code, on [Online]. Available: http://doi.acm.org/10.1145/1868328.1868342
program comprehension,” in 15th European Conference on Software [33] S. Chidamber and C. Kemerer, “A metrics suite for object oriented
Maintenance and Reengineering, CSMR 2011, 1-4 March 2011, Old- design,” Software Engineering, IEEE Transactions on, vol. 20, no. 6,
enburg, Germany. IEEE Computer Society, 2011, pp. 181–190. pp. 476–493, Jun 1994.
[16] D. I. K. Sjøberg, A. F. Yamashita, B. C. D. Anda, A. Mockus, and [34] W. M. Khaled El Emam and J. C. Machado, “The prediction of faulty
T. Dybå, “Quantifying the effect of code smells on maintenance effort,” classes using object-oriented design metrics,” Journal of Systems and
IEEE Trans. Software Eng., vol. 39, no. 8, pp. 1144–1156, 2013. Software, vol. 56, no. 1, pp. 63–75, 2001.
[17] A. F. Yamashita and L. Moonen, “Do code smells reflect important [35] N. Ohlsson and H. Alberg, “Predicting fault-prone software modules
maintainability aspects?” in Proceedings of the International Conference in telephone switchess,” Software Engineering, IEEE Transactions on,
on Software Maintenance (ICSM). IEEE, 2012, pp. 306–315. vol. 22, no. 12, pp. 886–894, 1996.
[18] A. Yamashita and L. Moonen, “Exploring the impact of inter-smell re- [36] N. Nagappan and T. Ball, “Static analysis tools as early indicators of
lations on software maintainability: An empirical study,” in Proceedings pre-release defect density,” in Proceedings of the 27th International
of the International Conference on Software Engineering (ICSE). IEEE, Conference on Software Engineering, ser. ICSE ’05. New York,
2013, pp. 682–691. NY, USA: ACM, 2005, pp. 580–586. [Online]. Available: http:
[19] F. Khomh, M. Di Penta, and Y.-G. Gueheneuc, “An exploratory study //doi.acm.org/10.1145/1062455.1062558
of the impact of code smells on software change-proneness,” in Pro- [37] N. Nagappan, T. Ball, and A. Zeller, “Mining metrics to predict
ceedings of the Working Conference on Reverse Engineering (WCRE). component failures,” in Proceedings of the 28th International
IEEE, 2009, pp. 75–84. Conference on Software Engineering, ser. ICSE ’06. New York,
NY, USA: ACM, 2006, pp. 452–461. [Online]. Available: http:
//doi.acm.org/10.1145/1134285.1134349

254
[38] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects [53] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory
for eclipse,” in Proceedings of the Third International Workshop of Brain Mechanisms. Spartan Books, 1961.
on Predictor Models in Software Engineering, ser. PROMISE ’07. [54] L. M. Y. Freund, “The alternating decision tree learning algorithm,”
Washington, DC, USA: IEEE Computer Society, 2007, pp. 9–. [Online]. in Proceeding of the Sixteenth International Conference on Machine
Available: http://dx.doi.org/10.1109/PROMISE.2007.10 Learning, 1999, pp. 124–133.
[39] A. P. Nikora and J. C. Munson, “Developing fault predictors for [55] G. H. John and P. Langley, “Estimating continuous distributions in
evolving software systems,” in Proceedings of the 9th IEEE International bayesian classifiers,” in Eleventh Conference on Uncertainty in Artificial
Symposium on Software Metrics. IEEE CS Press, 2003, pp. 338–349. Intelligence. San Mateo: Morgan Kaufmann, 1995, pp. 338–345.
[40] A. N. Taghi M. Khoshgoftaar, Nishith Goel and J. McMullan, “Detection [56] S. le Cessie and J. van Houwelingen, “Ridge estimators in logistic
of software modules with high debug code churn in a very large legacy regression,” Applied Statistics, vol. 41, no. 1, pp. 191–201, 1992.
system,” in Software Reliability Engineering. IEEE, 1996, pp. 364–371. [57] R. Kohavi, “The power of decision tables,” in 8th European Conference
[41] J. S. M. Todd L. Graves, Alan F. Karr and H. P. Siy, “Predicting fault on Machine Learning. Springer, 1995, pp. 174–189.
incidence using software change history,” Software Engineering, IEEE [58] C.-Y. J. Peng, K. L. Lee, and G. M. Ingersoll, “An introduction to logistic
Transactions on, vol. 26, no. 7, pp. 653–661, 2000. regression analysis and reporting,” The Journal of Educational Research,
[42] M. DAmbros, M. Lanza, and R. Robbes, “Evaluating defect prediction vol. 96, no. 1, pp. 3–14, 2002.
approaches: a benchmark and an extensive comparison,” Empirical [59] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the impact
Software Engineering, vol. 17, no. 4, pp. 531–577, 2012. of classification techniques on the performance of defect prediction
[43] A. E. Hassan and R. C. Holt, “Studying the chaos of code development,” models,” in Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE
in Proceedings of the 10th Working Conference on Reverse Engineering, International Conference on, vol. 1, May 2015, pp. 789–800.
2003. [60] P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach,
[44] S. E. S. Taba, F. Khomh, Y. Zou, A. E. Hassan, and M. Nagappan, 1982.
“Predicting bugs using antipatterns,” in Proceedings of the 2013 IEEE [61] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as
International Conference on Software Maintenance, ser. ICSM ’13. deviant behavior: A general approach to inferring errors in systems
Washington, DC, USA: IEEE Computer Society, 2013, pp. 270–279. code,” SIGOPS Oper. Syst. Rev., vol. 35, no. 5, pp. 57–72, Oct. 2001.
[Online]. Available: http://dx.doi.org/10.1109/ICSM.2013.38 [Online]. Available: http://doi.acm.org/10.1145/502059.502041
[45] F. Arcelli Fontana, V. Ferme, and M. Zanoni, “Poster: Filtering code [62] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y.-G. Guéhéneuc,
smells detection results,” in Proceedings of the 37th International “Is it a bug or an enhancement?: a text-based approach to classify
Conference on Software Engineering (ICSE 2015), vol. 2. Florence, change requests,” in Proceedings of the 2008 conference of the Centre
Italy: IEEE, May 2015, pp. 803–804. for Advanced Studies on Collaborative Research, October 27-30, 2008,
[46] M. Lanza and R. Marinescu, Object-Oriented Metrics in Practice: Using Richmond Hill, Ontario, Canada. IBM, 2008, p. 23.
Software Metrics to Characterize, Evaluate, and Improve the Design of [63] E. J. W. J. Sunghun Kim and Y. Zhang, “Classifying software changes:
Object-Oriented Systems. Springer, 2006. Clean or buggy?” IEEE Transactions on Software Engineering (TSE),
[47] F. Palomba, A. D. Lucia, G. Bavota, and R. Oliveto, “Anti-pattern de- vol. 34, no. 2, pp. 181–196, 2008.
tection: Methods, challenges, and open issues,” Advances in Computers, [64] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.
vol. 95, pp. 201–238, 2015. Addison-Wesley, 1999.
[48] F. Arcelli Fontana, V. Ferme, M. Zanoni, and A. Yamashita, “Automatic [65] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, no. 1,
metric thresholds derivation for code smell detection,” in Proceedings of pp. 81–106, Mar. 1986. [Online]. Available: http://dx.doi.org/10.1023/A:
the 6th International Workshop on Emerging Trends in Software Metrics 1022643204877
(WETSoM 2015). Florence, Italy: IEEE, May 2015, pp. 44–53, co- [66] H. Lu, E. Kocaguneli, and B. Cukic, “Defect prediction between
located with ICSE 2015. software versions with active learning and dimensionality reduction,” in
[49] E. Tempero, C. Anslow, J. Dietrich, T. Han, J. Li, M. Lumpe, H. Melton, Software Reliability Engineering (ISSRE), 2014 IEEE 25th International
and J. Noble, “The qualitas corpus: A curated collection of java code Symposium on, Nov 2014, pp. 312–322.
for empirical studies,” in Proc. 17th Asia Pacific Software Eng. Conf. [67] S. Kim, H. Zhang, R. Wu, and L. Gong, “Dealing with noise in defect
Sydney, Australia: IEEE, December 2010, pp. 336–345. prediction,” in Software Engineering (ICSE), 2011 33rd International
[50] F. Palomba, M. Zanoni, F. A. Fontana, A. D. Lucia, and R. Oliveto, Conference on, May 2011, pp. 481–490.
“Smells like Teen Spirit: Improving Bug Prediction Performance Using [68] T. Menzies and J. Di Stefano, “How good is your blind spot sampling
the Intensity of Code Smells,” Tech. Rep., 4 2016. [Online]. Available: policy,” in High Assurance Systems Engineering, 2004. Proceedings.
http://tinyurl.com/hgorj4z Eighth IEEE International Symposium on, March 2004, pp. 129–138.
[51] T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, F. Peters, and [69] M. Shepperd, Q. Song, Z. Sun, and C. Mair, “Data quality: Some
B. Turhan. (2012, June) The promise repository of empirical software comments on the nasa software defect datasets,” Software Engineering,
engineering data. [Online]. Available: http://promisedata.googlecode. IEEE Transactions on, vol. 39, no. 9, pp. 1208–1215, Sept 2013.
com [70] F. Palomba, “Textual analysis for code smell detection,” in Proceedings
[52] F. Palomba, D. D. Nucci, M. Tufano, G. Bavota, R. Oliveto, D. Poshy- of the International Conference on Software Engineering (ICSE) -
vanyk, and A. De Lucia, “Landfill: An open dataset of code smells with Volume 2. IEEE, 2015, pp. 769–771.
public evaluation,” in Proceedings of the Working Conference on Mining
Software Repositories (MSR). IEEE, 2015, pp. 482–485.

255

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy