Data Mining Techniques For Software Defect Prediction
Data Mining Techniques For Software Defect Prediction
net/publication/333868292
CITATIONS READS
0 731
1 author:
Dr. kavita
Jagannath University
141 PUBLICATIONS 49 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dr. kavita on 19 June 2019.
Abstract- Quality and reliability are the major challenges faced in a secure software development process. There are
major software cost overruns when a software product with bugs in its various components is deployed at client’s side.
The software warehouse is commonly used as record keeping repository which is mostly required while adding new
features or fixing bugs. Software bugs lead to inaccurate and different results. As an outcome, the software projects run
late, are cancelled or becomes unreliable after deployment This paper entails a detailed literature review on the
existence of software failures in the recent past, while briefly discussing about the severe repercussions of these failures.
In order to identify these bugs timely, different data mining techniques have been discussed, and used by the researchers
profusely, simultaneously with the bug tracking systems to locate the occurrence of the bugs precisely. Bug tracking
system plays a vital role in software project as poorly designed bug tracking system are partly to be blame for the delay
to resolve problem. Many researchers have suggested different ways to improve the bug tracking system. Bug
repositories are major source of database which keeps the history of success and failure. This paper also discusses the
interesting approaches to convert software repositories to active repositories and also discuss how systematic mining
uncovers which modules are most prone to defects and failures. Diverse social and technical issues are associated with
software failure and software defects are the major causes for the degradation of quality of product. In software
engineering, most active research is software defect prediction. Bug fix time prediction model like pre-release, post-
release defect and different metrices to predict failures is been discuss in this paper.
Keywords: Software repository, bug tracking system, software defect prediction model, software metrices.
Software engineering data contains a massive prevention. To accomplish the data mining job
amount of information for the development and various software tools are available to analyze large
progress of any software project. Hassan et.al. [12] quantities of data and apply different data mining
investigated that stored software evolution supports techniques [27].
various aspects of software development within
industrial software development process. To
produce high quality software systems, researchers
are using data mining techniques to explore the
valuable data to better manage the projects and
develop within time and budget. Data mining
techniques such as clustering, classification and
association rules and various statistical techniques
are involved to extract actionable information from
data sets. Data mining is not only applicable to
marketing data, drug designing, weather forecasting
etc. but also has been used by the software Figure 1: Mining SE Data[2]
development industries to manage their software
development processes. As shown in the figure 1. The mining of software repository is among the
to expand a better understanding of software upcoming research areas and have many research
defects, to analyze software defects patterns and to challenges to be addressed. Various business
predict them in future software development and intelligence techniques used in another area of
testing processes data mining techniques has been decision support system can be applied by the name
applied to software development depository. A of software intelligence for mining software
number of data mining techniques have been repositories, for extracting predictive information
developed for bug detection, prediction and regarding software bugs at development phase so
that they will not trouble business activities. Hassan understanding large systems [13, 23]. Ahmed and
et.al. [28] explored more research problems in Richard [23] have also discussed interesting
mining software repositories and software approaches to convert software repositories from
intelligence, which further give future directions , static record keeping repositories to active
(i) finding the technique to automate and extract repositories which can be used to predict, plan, and
information from repository (ii) to mine important understand various aspects of their project.
information from these repositories. Hassan and Software reuse can increase productivity by
Xie [12] have discussed challenges associated with avoiding redevelopment which leads to fewest fault
mining software engineering data, highlight its and lowest fault correction and eventually software
success stories and outline future research quality can be improved. Thomas [11] proposed the
directions. Zimmermann et.al [6] discussed in his use of statistical topic models such as Latent
paper that systematic mining uncovers which Dirichlet allocation (LDA) to automatically
modules are most prone to defects and failures. discover structure in software repositories as these
Hassan [17] in his paper has discussed about the repositories contain unstructured and unlabeled text
software repositories like Historical repositories, that is difficult to analyze with traditional
Run-time repositories and Code repositories which techniques. This paper addressed the challenges of
can guide decision processes in modern software applying topic models to software repositories.
projects while at the same time, can uncover useful, Software defects not only reduce software quality
important patterns and information. Working on but also increase costing and also suspend the
open source projects, Chen et al. [13] ,Hassan and development schedule. Defects are of two types’
Richard [23] demonstrated that using historical pre-release defect and post-release defect. Pre-
information can assist developers in understanding release defects are observed during development
large systems. and testing of a program, while post-release defects
Zimmermann et.al [6] has also discussed in his are observed after the program has been deployed to
paper that systematic mining uncovers the modules its users. Schröter et al. [36] explain that post-
which are most prone to defects and failures. release failures can be predicted by import
Software defects are the major cause for the relationship design data. Nagappan and Ball [10]
degradation of quality of product. Bug repositories also gave detailed explanation that post-release
are a major source of database which keeps the failures can be predicted by the number of
history of success and failure. Bug database is a rich dependencies within across a component. Generally
source of information for software failure. Capers the quality of the software is looked by the post-
Jones [7] has discussed in his paper that both release bugs and do not have any accountability for
technical and social issues are associated with bugs that remain dormant for years. Dormant bugs
software project failures. Social issues like accurate are bugs that were introduced in earlier versions of
estimate rejection and the project force to adhere to a system but not found until much later. Chen Tse-
essentially impossible schedules. Technical issue Hsun et. al [26] in their paper has studied dormant
like lack of modern estimating approaches and the bugs against non-dormant bugs. They found that
failure of planning for needs growth during dormant bugs are fixed faster than non-dormant
development. Banks, insurance companies, and bugs. Dormant bugs are mostly caused by corner
low-technology service have a tendency to estimate cases; wrong control flows so experience
using informal methods and also have a shortage for developers are required for dormant bugs.
software PMs estimating tools. The most common New Software bug prediction models need to be
reason for schedule slippages, cost overruns, and designed, effective software defect matrices need to
outright cancellation of major systems is the be synthesis and given them as inputs to various
presence of too many bugs or defects to operate data mining techniques for extracting classified
successfully. Defect prone module is a crucial task information to predict the software defects in new
for management. It’s been a very difficult task to software versions and also more developed
predict the time and effort for a software problem. methods are needed to reduce software cost
Working on Open source projects it was indicated overruns[28]. It’s been a very difficult task to
that historical information can assist developers in predict the time and effort for a software problem.
116 Ms. Tripti Lamba, Dr. Kavita, Dr A.K.Mishra
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 5, Issue 2
February 2016
In their paper Cathrin et.al [8] has specified an software. It is a medium through which developer
approach which will estimate the time it takes to fix and user interact and poorly designed bug tracking
an issue, allows for early effort estimation, help in systems are partly to be blamed for the delay to
assigning issues and scheduling stable releases. resolve the problem. Bettenburg et.al [24] have also
They have evaluated the approach on the JBoss pointed out that many duplicates are submitted
project data to uses the Jira issue tracking system because of shortcomings of bug tracking systems.
and found that they can estimate within ±7 hours of Zimmermann et.al.[21] in his paper has concluded
the actual effort to organize issue report. Fukushima that current bug tracking system need to be
et.al [10] have published in their paper that Just-In- improved to resolve the bugs on time by giving
Time (JIT) defect prediction is much more practical current information effectively to the developers.
than traditional defect prediction techniques They have conducted a study in which they
Software defect prediction is currently the most simulated an interactive bug tracking system which
active research areas in software engineering. asks the user context-sensitive questions to extract
Defect prediction leads to reduced development relevant information about the bug that need to be
time, cost, reduced rework effort, increased fixed so that it speeds up the time to resolve the
customer satisfaction and more reliable software. bugs. The role of users is not only simply of
Therefore, defect prediction practices are important reporting bugs but actively ongoing participation is
to achieve software quality and to learn from past required. Silvia et. al [20] have suggested ways to
mistakes [14]. Zimmermann1 et.al [1] has improve the bug tracking system. Users can also
researched how the past defects can be used to involved in a bug fixing process i.e. users can give
predict software properties in future. They have their valuable inputs how to fix a bug which may
studied why some module are defect prone than result in fixing a bugs fast and more efficiently.
others and what makes a module defect prone. Bug Ostrand et.al [16] have proposed a new tool to
database plays a major role as it is a reliable source automate the prediction process. They have
for failure information. But bug database do not discussed in their paper that main goal is to develop
contain all the information like how, where and by an automated tool that could find out the project
whom the problem in question was fixed. Many defects and that can be used by testers without
techniques have been developed to fix bug reports requiring any particular statistical expertise or
[30, 31,32]. subjective judgments. New bug tracking tools were
used to improve the quality of bug report by adding
Zimmermann et al [2] investigated the quality of
new information, collection assistance and reporting
bug report like length of description, formatting,
crucial information to developers. Sascha et.al [5]
attachment by conducting a survey among
in their paper has presented seven recommendations
developers and user of APACHE, ECLIPSE, and
for the design of new bug tracking system. Software
MOZILLA where it was found that there was a
engineer fixes a bug whenever system does not
huge difference between the developers and users.
work as intended. There have been no empirical
In most of the survey developers found duplicate
research as to how debugging can be designed.
bug reports and it was pointed out that it was a good
Emerson et. al [19] in their paper have sought to
source of information to resolve the bugs quicker
understand the ways the bug can be fixed. In their
except few of them considered as a serious problem
paper, they have described a study that combined
[24].
opportunistic interviews, firehouse interviews,
Zimmermann et.al [25] has categorized the reason meeting observation, and a survey. Their results
to reopen the bugs and then focus on how these describe a multi-dimensional design space for bug
bugs are been handled. Then they have constructed fixes. AbdelMoez [18] in his paper have used the
statistical models which depict the impact of bug fix time prediction model using Naïve Bayes
various metrics on reopened bugs ranging from the algorithm which require a large dataset to predict
opener’s reputation to how the bug was found. Bug the fix time of newly reported bugs which help the
tracking system plays a vital role in software developer to prioritize which bugs to fix first using
project. Bug tracking system support cooperation data mining technique. Some bugs take less than a
between the developers and the users of the few minutes while other bugs take years to get
fixed. In his paper, he has proposed a filtering step OO metrics are useful for predicting defect density
to remove the outliers to improve the quality of the which has been discussed by Basili et al. [33].
prediction models. They have performed the data Subramanyam and Krishnan [34] in their survey
mining experiment and found that filtering step have shown that OO metrics are significantly
allowed relativelty more accurate prediction associated with defects. Nagappan and Ball [11]
models. Philip et.al [22] found that reassignment is showed software defect density was predicted using
not always harmful and, in fact, it is useful to relative code churn. Zimmermann et.al [29] have
determine the best person to fix a bug. They have found that defect can be predicted by complexity
very clearly written the five primary reasons for metrics, which raise a follow-up question are there
reassignment: corroboration of the root cause, better indicators for defect prediction than
proper fix are hard to determine and improper complexity metrics ? Rawat Mrinal Singh [14] et.
workload balancing. They have mentioned that al has suggested many software defect model so that
there is little tool support in current bug tracking it can save time, money and resource . Danijel et.al
systems for efficiently directing reassignments. has published in their paper that OO metrics
To improve software quality fault prediction model performed better than complexity metrics and
are used, so software metrics are proposed for model using process metrics which are evaluated in
software fault prediction. Ehsan et.al [15] in their post-release software performed the worst. It was
paper has studied about the use of antipatterns for found that object-oriented metrics were used twice
bug prediction. They have studied that accuracy of as often when compared with traditional metrics or
bug prediction model can be improved by proposing process metrics. CK metrics is the most successful
various metrics based on antipatterns. Fukushima and recently used metrics among OO metrics. Not
et.al [10] has published in their paper that Just-In- all CK metrics are equally well, the best metrics
Time (JIT) defect prediction is a practical approach from CK metrics suite are CBO, WMC, and RFC.
than traditional defect prediction techniques. JIT LCOM from CK metric suite used in small pre-
models require a large amount of training data. release but not successful in finding faults.
They have studied that JIT cross-project learned Prevention is better than cure, its true in software
model using other projects are prove to be viable process also. Sakthi and Baskaran [3] has published
solution than the projects with little historical data. in their paper that defect prevention is a kind of
However, it performs best when the data which is investment as it lead to a quality project. They
used to learn them is carefully selected. But JIT studied that during software development lots of
requires a large amount of historical data used to defects emerge so in order to improve the software
train a model so that it will perform well. In this process quality defects are identified from a given
paper, they have used random forest algorithm to set of projects, which are classified and analyzed for
JIT cross-project and found that random forest patterns. They also study that earlier defect
produces robust, highly accurate, stable models that prevention were focused on defect prediction and
are especially resilient to noisy data. Researchers team size required to complete the project on time
have discussed about the performance of different and lots of efforts and time was required for
types of metrics. Different types of metrices are debugging to eliminate errors. Chenbin [4]
object-oriented (OO) metrics, complexity metrics, introduce a tool called Bug Tracing System (BTS)
process metrics, and traditional metrics. for defect tracking, which is popular and less costly,
Zimmermann et.al [29] in his paper has discussed and also gives the optimal accuracy for tracking the
the complexity metrics or historical data to predict identified defects. Defect Classification approaches
failures. Hassan[9] has also discuss in his paper that by IBM is called Orthogonal Defect Classification
complexity metrics are better predictors of fault (ODC) is the best technique for finding defects.
potential in comparison to other well-known ODC classifies defects into two section- Opener
historical predictors. He concluded that complexity section: Time when the defect was first detected and
models can predict future faults in a much better Closer section: Time when the defect got fixed. The
way for large software systems in contrast to using main purpose of defect prevention is to identify the
prior modifications or prior faults. cause of defects and prevent them for recurring,
which can reduce development time and cost, hence
increasing customer satisfaction, reduces rework [8] C. Weiß, T. Zimmermann, and A. Zeller, “Predicting
effort, thereby decreasing the cost and gradually Effort to Fix Software Bugs,” Comput. Inf. Sci., pp.
improves the product quality. 2–3, 2006.
[9] A. E. Hassan, “Predicting Faults Using the
Complexity of Code Changes,” Proc. 31st Int. Conf.
Conclusion Softw. Eng., pp. 78–88, 2009.
Bug database is a rich source of information for [10] T. Fukushima, Y. Kamei, S. Mcintosh, K.
Yamashita, and N. Ubayashi, “An Empirical Study
software failure. New software bug prediction
of Just-in-Time Defect Prediction using Cross-
models need to be designed, effective software Project Models,” pp. 172–181, 2014.
defect matrices need to be synthesis and given them [11]S. W. Thomas, “Mining Software Repositories Using
as inputs to various data mining techniques for Topic Models,” Mach. Learn., 2011.
extracting classified information to predict the [12] Ahmed E. Hassan and Tao Xie. 2010. Mining
software defects in new software versions and also software engineering data. In Proceedings of the
more developed methods are needed to reduce 32nd ACM/IEEE International Conference on
software cost overruns. It was found that Just-In- Software Engineering - Volume 2 (ICSE '10), Vol. 2.
Time (JIT) defect prediction is a more practical ACM, New York, NY, USA, 503-504.
approach than the traditional defect prediction DOI=10.1145/1810295.1810451
[13] A. Chen, E. Chou, J. Wong, A. Y. Yao, Q. Zhang, S.
techniques. It was observed that antipatterns have
Zhang, and A. Michail. CVSSearch: Searching
higher bug density than other files. Bug reports are through source code using CVS comments. In
crucial information to the developer. It was Proceedings of the 17th International Conference on
observed that antipatterns can help predict bugs and Software Maintenance, pages 364–374, Florence,
files participating in antipatterns have higher bug Italy, 2001.
density than other files. Bug reports are crucial [14] Rawat Mrinal Singh, and Sanjay Kumar Dubey.
information to developer. This paper has suggested "Software defect prediction models for quality
the simple and easy technique to search bug reports. improvement: a literature study." IJCSI International
Journal of Computer Science Issues 9.5 (2012):
1694-0814.
References [15] S. Ehsan, S. Taba, F. Khomh, Y. Zou, A. E. Hassan,
[1] T. Zimmermann, N. Nagappan, and A. Zeller, and M. Nagappan, “Predicting Bugs Using
“Predicting Bugs from History,” 2008. Antipatterns,” IEEE Int. Conf. Softw. Maintenance,
[2] T. Zimmermann, R. Premraj, N. Bettenburg, C. ICSM, pp. 270–279, 2013.
Weiss, S. Just, and A. Schro, “What Makes a Good [16] T. J. Ostrand, P. Avenue, F. Park, E. J. Weyuker, P.
Bug Report ?,” vol. 36, no. 5, pp. 618–643, 2010. Avenue, and F. Park, “A Tool for Mining Defect-
[3] S. Kumaresh and R. Baskaran, “Defect Analysis and Tracking Systems to Predict Fault-Prone Files.”
Prevention for Software Process Quality [17] A. E. Hassan, “The Road Ahead for Mining
Improvement,” Int. J. Comput. Appl., vol. 8, no. 7, Software Repositories,” Front. Softw. Maint., pp.
pp. 42–47, 2010. 48–57, 2008.
[4] T. Pann, L. Zheng , C. Fang , “Defect Tracing System [18] W. Abdelmoez, M. Kholief, and F. M. Elsalmy,
Based on Orthogonal Defect Classification” “Improving Bug Fix-Time Prediction Model by
Computer Engineering and Applications, vol. 43, pp Filtering out Outliers,” Int. Conf. Technol. Adv.
9-10, 2008. Electr. Electron. Comput. Eng., pp. 359–364, 2013.
[5] S. Just and T. Zimmermann, “Towards the Next [19]N. Carolina, E. Murphy-hill, T. Zimmermann, and
Generation of Bug Tracking Systems,” Proc. - 2008 C. Bird, “The Design of Bug Fixes,” 2013.
IEEE Symp. Vis. Lang. Human-Centric Comput. [20] S. Breu, R. Premraj, J. Sillito, and T. Zimmermann,
VL/HCC 2008, pp. 82–85, 2008. “Information Needs in Bug Reports : Improving
[6] T. Zimmermann, N. Nagappan, and A. Zeller, Cooperation Between Developers and Users,” Proc.
“Predicting Bugs from History,” Software Evolution. 2010 Comput. Support. Coop. Work Conf., pp. 301–
Springer Berlin Heidelberg, pp 69-88, 2008. 310, 2010.
[7] J. Caper, “Social and Technical Reasons for [21] T. Zimmermann and S. Breu, “Improving Bug
Software Project Failures”, The Journal of Defence Tracking Systems bug tracking,” 31st Int. Conf.
Software Engineering, vol.19, no. 6, 2006 Softw. Eng. - Companion Vol., pp. 247–250, 2009.
[22] P. J. Guo, ” Not My Bug !” and Other Reason For [31] M. Fischer, M. Pinzger, and H. Gall, "Populating a
Software Bug Report Reassignments”. Comput. Inf. release history database from version control and
Sci., 2011. bug tracking systems." in Proc. International
[23] A. E. Hassan, A. Mockus, R. C. Holt, and P. M. Conference on Software Maintenance (ICSM 2003),
Johnson, “Guest Editors ’ Introduction : Special Amsterdam, Netherlands, 2003.
Issue on Mining Software Repositories,” IEEE [32] J. Śliwerski, T. Zimmermann, and A. Zeller, "When
Trans. Softw. Eng., vol. 31, no. 6, pp. 426–428, do changes induce fixes? On fridays." in Proc.
2005. InternationalWorkshop on Mining Software
[24] N. Bettenburg, T. Zimmermann, and S. Kim, Repositories (MSR), St. Louis, Missouri, U.S., 2005.
“Duplicate Bug Reports Considered Harmful . . . [33]V. R. Basili, L. C. Briand, and W. L. Melo, “A
Really ?,” IEEE Int. Conf. Softw. Maintenance, validation of object-oriented design metrics as
ICSM, no. Section 2, pp. 337–345, 2008. quality their paper they have suggested the simple
[25] T. Zimmermann and P. J. Guo, “Characterizing and and easy technique to search bug reports. Bug
Predicting Which Bugs Get Reopened,” Proc. 34th reports are crucial information to developer.
Int. Conf. Softw. Eng., pp. 1074–1083, 2012. [34] R. Subramanyam and M. S. Krishnan, "Empirical
[26] T. Chen, M. Nagappan, E. Shihab, and A. E. analysis of ck metrics for object-oriented design
Hassan, “An Empirical Study of Dormant Bugs complexity: indicators" IEEE Transactions on
Categories and Subject Descriptors,” ACM, vol. 14, Software Engineering vol. 22, pp. 751-761, 1996.
no. 05, 2014. [34] R. Subramanyam and M. S. Krishnan, "Empirical
[27] http://en. wikipedia.org/wiki/ Data_mining #Data_ analysis of ck metrics for object-oriented design
mining complexity:Implications for software defects." IEEE
[28] A. E. Hassan, “Software Intelligence : The Future of Trans.Software Eng., vol. 29, pp. 297-310, 2003.
Mining Software Engineering Data,” Proc. FSE/SDP [35] A. B. Binkley and S. R. Schach, "Validation of the
Work. Futur. Softw. Eng. Res., pp. 161–165, 2010. coupling dependency metric as a predictor of run-
[29] T. Zimmermann, R. Premraj, and A. Zeller, time failures and maintenance measures." in
“Predicting Defects for Eclipse,” Third Int. Work. Proceedings of the International Conference on
Predict. Model. Softw. Eng. (PROMISE’07 ICSE Software Engineering, 1998, pp. 452-455.
Work. 2007), pp. 9–9, 2007. [36] A. Schröter, T. Zimmermann, and A. Zeller,
[30] D. Cubranic and G. C. Murphy, "Hipikat: "Predicting failure-prone components at design
Recommending pertinent software development time." in Proceedings of the 5th International
artifacts." in 25th International Conference on Symposium on Empirical Software Engineering
Software Engineering (ICSE), Portland, Oregon, (ISESE 2006), Rio de Janeiro, Brazil, 2006..
2003, pp. 408-418.