An Evaluationof Source Code Mining Techniques
An Evaluationof Source Code Mining Techniques
GUOHUI LI
School of Computer and Applied Technology
Huazhong University of Science & Technology (HUST)
Wuhan, PR China
Abstract— This paper reviews the tools and techniques which rely also increase productivity since the code is previously tested
only on data mining methods to determine patterns from source and is less likely to have defects. However, clone code may
code such as programming rules, copy paste code segments, and cause potentially maintainability problem for example, when a
API usage. The work provides comparison and evaluation of the cloned code fragment needs to be changed, for example change
current state-of-the-art in source code mining techniques. requirement or additional features, all fragments similar to it
Furthermore it identifies the essential strengths and weaknesses should be checked for the change. Moreover, the handling of
of individual tools and techniques to make an evaluation duplicated code can be very problematic such as an error in one
indicative of future potential. component is reproduced in every copy. This problem has
focused the attention of researcher towards development of
The pervious related works only focus on one specific pattern
being mined such as special kind of bug detection. Thus, there is
clone detection tools which allow developers to automatically
a need of multiple tools to test and find potential information find the locations in code that must be changed when related
from software which increase cost and time of development. code segment changes.
Hence there is a strong need of tool which helps in developing Another line of related research is how to write APIs code.
quality software by automatically detecting different kind of bugs A software system interacts with third-party libraries through
in one pass and also provides code reusability for the developers. various APIs. Using these library APIs often needs to follow
certain usage patterns. These patterns aid developers in
Keywords- Source code mining; literature review; Programming
rule; Copy-paste code; API usage
addressing commonly faced programming problems such as
what checks should precede or follow API calls, how to use a
given set of APIs for a given task, or what API method
I. INTRODUCTION sequence should be used to obtain one object from another.
The primary goal of software development is to deliver In this paper, we provide a comprehensive comparison and
high quality software in the least amount of time. To achieve evaluation of the currently available source code mining
these goals, Software Engineers are increasingly applying data techniques and tools in the context of mining rules, detecting
mining algorithms to various software engineering tasks [1] to copy paste code and API usage. This work not only provides
improve software productivity and quality. significant contributions to the source code mining research,
To deliver high quality software, automatic bug detection but have also exposes how challenging it is to compare
remains one of the most active areas in software engineering different tools, due to the diverse nature of the techniques and
research. Practitioners desire tools that would automatically target languages. To date all the previous evaluation studies
detect bugs and flag the location of bugs in their current code consider only one aspect of mining techniques such as clone
base so they can fix these bugs. In this direction much work has detection or rules extraction and no comparative evaluation is
been done to develop tools and techniques which analyze large available which detect various kind of patterns from source
amount of data about a software application such as source code in one pass. We aim to identify the essential strengths and
code, to uncover the dominant behavior or patterns and to flag weaknesses of individual tools and techniques to make an
variations from that behavior as possible bugs. One line of evaluation indicative of future potential e.g., when one aims to
research in this direction is Rule Mining Techniques which develop a new integrated or hybrid technique which address
induce set of rules from existing projects which can be used to multiple challenges in one tool rather presenting another new
improve subsequent development or new project development. tool.
Another dominant work by mining source code is clone The rest of this paper is organized as follows. After
detection. Developers often reuse code fragments by copying introducing some background of software mining in Section I,
and pasting (clone code) with or without minor adaptation to we provided a comprehensive literature review in section II.
reduce programming efforts and shorten developing time. It Section III presents an overall evaluation of source code
mining tools and techniques in term of taxonomy. Section IV
1930
usage history to identify methods call in the form of frequent copy paste code, API usage. The criterion Technique entails the
subsequences. The code search engine receives a query that algorithm used by tool. Different algorithm used in source code
describes a method, class, or package for an API and then mining research from data mining domain. Finally, criterion
searches open source repositories for source files that are Open Issues indicates the research challenge not addressed by
relevant to the query. The code analyzer analyzes the relevant specific tool or technique. Table 1 shows overall analysis of
source files and produces a set of method call sequences. The techniques and tools.
sequence preprocessor inline some call sequences into others
based on caller-callee relationships and removes some IV. COMPARISON OF SOURCE CODE MINING APPROACHES
irrelevant call sequences from the set of call sequences
according to the given query. The frequent-sequence miner Both Engler et al., work and PR-Miner discover patterns
discovers frequent sequences from the preprocessed sequences. involving set pairs of methods calls and functions, variables,
The frequent-sequence postprocessor reduces the set of data types that frequently appear in same methods and do not
frequent sequences in some ways. contain control structures or conditions among them, also the
order of method calls is not considered. However, compared
Sahavechaphan and Claypool [15] developed, a context- with Engler et al. work that extracts only function-pair based
sensitive code assistant tool XSnippet , that allows developers rules, PR-Miner extracts substantially more rules by extracting
to query for relevant code snippets from a sample code rules about variable correlations. Moreover, PR-Miner requires
repository to find code fragments relevant to the programming full parser to replace to work with other programming
task at hand. A range of instantiation queries are invoked from languages. CHRONICLER [4] which is fundamentally differs
java editor including generic query TQG that returns all from PR-Miner as it ensures path-sensitivity hence generate
possible code snippets for the instantiation of a type, to the less number of false negative as compare to PR miner. It differ
specialized type-based TQT and parent based queries TQP, that from Engler et al., approach as it computes the precedence
return either type-relevant or parent-relevant results. User input relationship based on program’s control flow structure whereas
the type of query, code context in which query is invoked and a Engler et al., approach detects relations between pairs of
specific code model instance to graph based Xsnippet system. functions by exploiting all possible paths. MUVI [6] mines
Mining algorithm BFSMINE, a breath first mining algorithm variable correlations and generate variable-pairing rules. Engler
traverses a code model instance and produces as output that et al. [2] also detect variable inconsistency through logical
represent the final code snippets meet the requirement of the reasoning where as MUVI [6] detect inconsistencies using
specified query. pattern analysis on multi-variable access correlations.
PARSEWeb developed by S. Thummalapenta, and T. Xie Dup[8] uses an order-sensitive indexing scheme to
[16], uses Google code search for collecting relevant code normalize for detection of consistently renamed Syntactically
snippets and mines the returned code snippets to find solution identical clones whereas CCFinder [7] applies additional
jungloids. The proposed technique described the desired code transformations of source code that actually change the
in the form of “Source → Destination” query which search for structure of the code, so that minor variations of the same
relevant code sample of source and destination object and syntactic form treated as similar. However, token-by-token
download to form a local source code repository which is matching is more expensive than line-by-line matching in
analyzed to constructs a directed acyclic graph. PARSEWeb terms of computing complexity since a single line is usually
identifies nodes that contain the given Source and Destination composed of several tokens. Dup, CCFinder and
object types and extracts a Method-Invocation Sequences CloneDetection identify clone code that can be helpful in
(MISs. PARSEWeb clusters similar MISs using a sequence software amenability to identify section of code that should be
postprocessor .The final MISs are sorts using several ranking replaced by procedure but do not detect copy paste related
heuristic and serves as a solution for the given query. bugs. On the other hand CP - Miner [11] detect copy paste
PARSEWeb also uses an additional heuristic called query related bugs. Compared to CCFinder, CP-Miner is able to find
splitting that helps address the problem where code samples for 17.52% more copy-pasted segments because CP-Miner can
the given query are split among different source files. tolerate statement insertions and modifications. whereas,
Graph based analysis [10] can capture more complicated
III. TAXONOMY OF SOURCE CODE MINNING TECHNIQUES changes such as statement reordering, insertion and control
replacement, compared with the common token-based
This section encompasses the analysis on previously approaches by capturing software’s inherit logic relationship
mentioned research contributions based on criteria that capture through PDG. Different mining techniques have been proposed
the main feature of each technique. in the literature to provide samples code which differs in the
A supporting tool is developed by each approach as a plug- means that a developer uses to retrieve relevant examples from
in for the programming environment. Source code is provided the repository, for example, Strathcona [18] use structural
as input to tool and it applies data mining technique to detect context to form a query is extracted automatically from the
frequently co-occurring patterns. Source code comprises of code a developer is writing. Xsnippets [15] uses class structure
different elements such as functions, classes, variables, data information such as parents, fields and methods of a class to
types etc. Criterion Input shows which elements of source code define code context to query a sample repository for code
are used as input by data mining tool. The criterion output snippets relevant to the object instantiation task at hand.
indicate which type of mining information are extracted by Prospector [13] , Parseweb [16] and MAPO [17] defines a
tools developed by each approach e.g. programming rules, query that describes the desired code.
1931
TABLE I. TAXONOMY OF SOURCE CODE MINING TOOL AND TECHNIQUES
Need rules to check against Static Functions Pair-wise Statistical Fixed rule templates, only [2]
program code by inferring code Analyzer programming analysis identify pair wise programming
believes and cross check for rules. rules
contradiction
Frequent itemset mining for pair- PR-Miner Functions, variable Pair-wise, Item-set mining Does not consider inter- [3]
wise, multi-functions and variable and data type complex and procedural analysis, data flow
correlation rules variable and control relationship
Rule- Mining
correlation rules
Frequent subsequence mining to CHRONIC Functions Function calls Frequent Does not take account of data [4]
infer function precedence protocols LER ordering rules subsequence flow or data dependence
mining
Graph based mining to search Framework Program Graph minor as Frequent item- Require manual inspection for [5]
conditional rules Dependence conditional rules set and sub- valid rules that may miss some
Graphs graph mining instances of rules during
algorithm inspection.
Frequent itemset mining to extract MUVI Functions, Global, Variable pairing Frequent item- Only handled variable access [6]
variable correlations class & structural rules set mining directly by caller functions
variables
Suffix trees for tokens per line Dup Sequence of lines Line by line Suffix tree Does not detect clone code [8]
clones based matching portions having different syntax
but similar meaning.
Token normalizations, then suffix- CC-Finder Sequence of Clone pairs Token Does not detect changes such as [7]
tree based search tokens comparison statement reordering, insertion
Detecting copy paste code
1932
system for large scale source code,” IEEE
V. CONCLUSION Transactions on Software Engineering, pp. 654-670,
In this paper we have provided concise but comprehensive 2002.
survey of three types of source code mining tools and [8] B. Baker, “On finding duplication and near-
techniques such as mining rules, copy-paste code and API duplication in large software systems,” in Proc.
usage. So far this is the first survey which includes Second IEEE Working Conf. Reverse Eng., 1995, pp.
combination of different techniques .Comparison of techniques 86-95.
and tools shows there is a no single tool which is superior to all [9] V. Wahler, D. Seipel, J. Wolff et al., “Clone
other in all aspects because all tools have strength and detection in source code by frequent itemset
weaknesses and intended for different task and context. techniques,” in Fourth IEEE International Workshop
However, a combination of these three source code mining on Source Code Analysis and Manipulation, 2004,
techniques help one to understand how to design a
pp. 128-135.
hybrid/integrated technique to be robust across all types of
[10] W. Qu, Y. Jia, and M. Jiang, “Pattern mining of
software patterns that can help bug detection as well as help
developers to write relevant API code. The comparison also cloned codes in software systems,” Information
helps how to employ a set of different tools to achieve better Sciences, 2010, 2010.
results. [11] Z. Li, S. Lu, S. Myagmar et al., “CP-Miner: A tool
for finding copy-paste and related bugs in operating
In future we are going to develop an integrated framework system code,” in Proceedings of the 6th conference
which can automatically find all the patterns from source code on Symposium on Opearting Systems Design &
in one pass and suggest developer potential bug locations for Implementation-Volume 6, 2004, pp. 20.
quality software development and relevant code suggestion for [12] M. Acharya, T. Xie, J. Pei et al., “Mining API
rapid software development. patterns as partial orders from source code: from
usage scenarios to specifications,” in Proceedings of
REFERENCES the 6th joint meeting of the European software
engineering conference and the ACM SIGSOFT
[1] A. Hassan, and T. Xie, “Mining software engineering symposium on The foundations of software
data,” in Proceedings of the 32nd ACM/IEEE engineering, 2007, pp. 25-34.
International Conference on Software Engineering- [13] D. Mandelin, L. Xu, R. Bodík et al., “Jungloid
Volume 2, 2010, pp. 503-504. mining: helping to navigate the API jungle,” ACM
[2] D. Engler, D. Chen, S. Hallem et al., “Bugs as SIGPLAN Notices, vol. 40, no. 6, pp. 48-61, 2005.
deviant behavior: A general approach to inferring [14] A. Michail, “Data mining library reuse patterns using
errors in systems code,” ACM SIGOPS Operating generalized association rules,” in Proceedings of
Systems Review, vol. 35, no. 5, pp. 57-72, 2001. 22nd International Conference on Software
[3] Z. Li, and Y. Zhou, “PR-Miner: Automatically Engineering (ICSE'00), Limerick, Ireland, 2000, pp.
extracting implicit programming rules and detecting 167-176.
violations in large software code,” in Proceedings of [15] N. Sahavechaphan, and K. Claypool, “XSnippet:
the 10th European software engineering conference mining for sample code,” ACM SIGPLAN Notices,
held jointly with 13th ACM SIGSOFT international vol. 41, no. 10, pp. 413-430, 2006.
symposium on Foundations of software engineering, [16] S. Thummalapenta, and T. Xie, “Parseweb: a
2005, pp. 306-315. programmer assistant for reusing open source code on
[4] M. Ramanathan, A. Grama, and S. Jagannathan, the web,” in Proceedings of the twenty-second
“Path-sensitive inference of function precedence IEEE/ACM international conference on Automated
protocols,” in 29th International Conference on software engineering, 2007, pp. 204-213.
Software Engineering ( ICSE 2007), 2007, pp. 240- [17] T. Xie, and J. Pei, “MAPO: Mining API usages from
250. open source repositories,” in Proceedings of the 2006
[5] R. Chang, A. Podgurski, and J. Yang, “Finding what's international workshop on Mining software
not there: a new approach to revealing neglected repositories, 2006, pp. 54-57.
conditions in software,” in Proceedings of the 2007 [18] R. Holmes, and G. C. Murphy, “Using structural
international symposium on Software testing and context to recommend source code examples,” in
analysis, 2007, pp. 163-173. Proceedings of the 27th international conference on
[6] S. Lu, S. Park, C. Hu et al., “MUVI: automatically Software engineering, 2005, pp. 117-125.
inferring multi-variable access correlations and
detecting related semantic and concurrency bugs,”
ACM SIGOPS Operating Systems Review, vol. 41,
no. 6, pp. 103-116, 2007.
[7] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: a
multilinguistic token-based code clone detection
1933