0% found this document useful (0 votes)
30 views5 pages

An Evaluationof Source Code Mining Techniques

This document summarizes and evaluates source code mining techniques for detecting patterns such as programming rules, copied code segments, and API usage from source code. It provides a literature review of existing source code mining tools and techniques and compares their strengths and weaknesses. The goal is to identify the best approaches and areas for improvement to develop new integrated techniques that can detect multiple patterns from source code in a single pass.

Uploaded by

jojokaway
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views5 pages

An Evaluationof Source Code Mining Techniques

This document summarizes and evaluates source code mining techniques for detecting patterns such as programming rules, copied code segments, and API usage from source code. It provides a literature review of existing source code mining tools and techniques and compares their strengths and weaknesses. The goal is to identify the best approaches and areas for improvement to develop new integrated techniques that can detect multiple patterns from source code in a single pass.

Uploaded by

jojokaway
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)

An Evaluation of Source Code Mining Techniques


Shaheen Khatoon Azhar Mahmood
School of Computer and Applied Technology School of Computer and Applied Technology
Huazhong University of Science & Technology (HUST) Huazhong University of Science & Technology (HUST)
Wuhan, China Wuhan, China

GUOHUI LI
School of Computer and Applied Technology
Huazhong University of Science & Technology (HUST)
Wuhan, PR China

Abstract— This paper reviews the tools and techniques which rely also increase productivity since the code is previously tested
only on data mining methods to determine patterns from source and is less likely to have defects. However, clone code may
code such as programming rules, copy paste code segments, and cause potentially maintainability problem for example, when a
API usage. The work provides comparison and evaluation of the cloned code fragment needs to be changed, for example change
current state-of-the-art in source code mining techniques. requirement or additional features, all fragments similar to it
Furthermore it identifies the essential strengths and weaknesses should be checked for the change. Moreover, the handling of
of individual tools and techniques to make an evaluation duplicated code can be very problematic such as an error in one
indicative of future potential. component is reproduced in every copy. This problem has
focused the attention of researcher towards development of
The pervious related works only focus on one specific pattern
being mined such as special kind of bug detection. Thus, there is
clone detection tools which allow developers to automatically
a need of multiple tools to test and find potential information find the locations in code that must be changed when related
from software which increase cost and time of development. code segment changes.
Hence there is a strong need of tool which helps in developing Another line of related research is how to write APIs code.
quality software by automatically detecting different kind of bugs A software system interacts with third-party libraries through
in one pass and also provides code reusability for the developers. various APIs. Using these library APIs often needs to follow
certain usage patterns. These patterns aid developers in
Keywords- Source code mining; literature review; Programming
rule; Copy-paste code; API usage
addressing commonly faced programming problems such as
what checks should precede or follow API calls, how to use a
given set of APIs for a given task, or what API method
I. INTRODUCTION sequence should be used to obtain one object from another.
The primary goal of software development is to deliver In this paper, we provide a comprehensive comparison and
high quality software in the least amount of time. To achieve evaluation of the currently available source code mining
these goals, Software Engineers are increasingly applying data techniques and tools in the context of mining rules, detecting
mining algorithms to various software engineering tasks [1] to copy paste code and API usage. This work not only provides
improve software productivity and quality. significant contributions to the source code mining research,
To deliver high quality software, automatic bug detection but have also exposes how challenging it is to compare
remains one of the most active areas in software engineering different tools, due to the diverse nature of the techniques and
research. Practitioners desire tools that would automatically target languages. To date all the previous evaluation studies
detect bugs and flag the location of bugs in their current code consider only one aspect of mining techniques such as clone
base so they can fix these bugs. In this direction much work has detection or rules extraction and no comparative evaluation is
been done to develop tools and techniques which analyze large available which detect various kind of patterns from source
amount of data about a software application such as source code in one pass. We aim to identify the essential strengths and
code, to uncover the dominant behavior or patterns and to flag weaknesses of individual tools and techniques to make an
variations from that behavior as possible bugs. One line of evaluation indicative of future potential e.g., when one aims to
research in this direction is Rule Mining Techniques which develop a new integrated or hybrid technique which address
induce set of rules from existing projects which can be used to multiple challenges in one tool rather presenting another new
improve subsequent development or new project development. tool.

Another dominant work by mining source code is clone The rest of this paper is organized as follows. After
detection. Developers often reuse code fragments by copying introducing some background of software mining in Section I,
and pasting (clone code) with or without minor adaptation to we provided a comprehensive literature review in section II.
reduce programming efforts and shorten developing time. It Section III presents an overall evaluation of source code
mining tools and techniques in term of taxonomy. Section IV

978-1-61284-181-6/11/$26.00 ©2011 IEEE 1929


compares the existing techniques. Finally, Section IV single source lines to entire AST/PDG sub-trees/sub-graphs.
concludes the paper and suggests directions for future work. However, we only focus on techniques which use data mining
and few others leading techniques for clone detection such as
II. RELATED WORK CCFinder [7] and Dup [8] that use tokenization on the source
code. Dup detect two types of matching code that is either
A. Mining rules from source code exactly the same or name of parameters such as variable and
constant are substituted. CCFinder detect clone code portions
Rule mining techniques induce set of rules from existing that have different syntax but have similar meaning and applies
projects which can be used to improve subsequent development
rule-based transformation such as regularization of identifiers,
or new project development. Several methods were proposed to
identification of structures, context information and parameter
detect rule-violating defects. Most of the studies used static
replacement of the sequence. Abstract syntax tree based
source code analysis to find programming rules and subsequent approaches [9] and PGDs based [10] tools looks for sub trees
rule violation as bugs. For example Engler et al., approach [2] and isomorphic graphs to find clones. In addition to above and
and PR-Miner [3] mine function-pairing rules, CHRONICLER many other technique we find only two approaches that, CP-
[4] mine function precedence protocols, Chang et al. [5] mine Miner [11] and Clonedetection [9] which uses data mining to
conditional rules and MUVI [6] mines variable-pairing rules detect clones. CP-Miner uses frequent token sequence and flag
Engler et al.,[2]approach mines function pairing rules by bugs by recognizing deviations in mined patterns for renaming
using compiler extensions called checkers to match rule variables when copy-and-pasting the code. It transforms a basic
templates, Proposed tool extracts programming beliefs from block into number by tokenizing its component. Once all the
acts at different location of source code by exploiting all components of a statement are tokenized, a hash value digest is
possible paths between function call and cross check for computed using the “hashpjw” hash function. The ColSpan
violated beliefs . Since approach relies on developers to supply algorithm is applied to the resulting sequence database to find
rule templates such as function A must be paired with function basic copy-pasted segments. By identifying abnormal mapping
B and covers the given or explicit rules known in advance, it of identifiers among copy-paste segments, CP-Miner detects
may miss many violations due to the existence of implicit rules. copy-paste related bugs, especially those bugs caused by the
PR-Miner developed by Li and Zhou [3] find implicit fact that the programmer forgot to modify identifiers
programming rules and rule violations that is based on frequent consistently after copy-pasting. Whereas, Wahler et al. [9]
item-set mining and does not require specification of rule approach find exact and parameterized clones at a more
templates. It can detect simple function pair-wise rules, abstract level by converting the AST to XML by using frequent
complex rules as well as variable correlation rules. It computes item set-mining technique. This tool first converts source code
the association in entire program elements by just counting the into Abstract Syntax Tree (AST) which contains complete
together occurrences of any two elements and not considering information about source code by using parser. Frequent
data flow or control flow which leads to increase number of itemset mining algorithm inputs XML configuration file and
false negative of violations in control path. CHRONICLER find frequent consecutive statements. Proposed technique only
developed by Ramanathan et al.,[4] applies inter-procedural finds exact and parameterized clones at a more abstract level.
path-sensitive static analysis to automatically infer accurate
function precedence protocols which specify ordering among C. API Usage pattern
function calls. CHRONICLER fundamentally differs from PR- Much research has been conducted to extract API usage
Miner as it ensures path-sensitivity hence generate less number rules or patterns from source code by proposing tools and
of false negative. Chang et al.,[5] proposed a new approach to approaches which helps developers to reuse existing
mine implicit condition rules and to detect neglected conditions frameworks and libraries more easily including [12-17]. In this
by applying frequent sub graph mining. . The approach requires direction, Michail, [14] described how data mining can be used
the user to indicate minimal constraints on the context of the to discover library reuse patterns in existing applications by
rules to be sought, rather than specific rule templates. However, developing a tool CodeWeb based on itemset and association-
frequent sub-graph mining algorithm does not handle directed rule mining.
graphs and multigraphs and require the modification leads to
information loss so that precision is sacrificed in rule Prospector developed by Mandelin et al., [13],
discovery. Another approach developed by Lu et al.,[6] called automatically synthesize the list of candidate jungloid code
MUVI to mine variable pairing rules which applied the based on simple query that described the required code in term
frequent itemset mining technique to automatically detect of input and output . The Jungloid graph is created using both
multi-variable inconsistent update bugs and multi-variable API method signatures and a corpus of sample client programs,
related concurrency bugs, which may result due to inconsistent and consists of chains of objects connected via method calls.
update of correlated variables. Engler et al. [2] work also detect Prospector mines signature graphs generated from API
variable inconsistency through logical reasoning where as specifications and jungloid graphs. The retrieval is
MUVI [6] detect inconsistencies using pattern analysis on accomplished by traversing a set of paths (API method call
multi-variable access correlations. sequences) from Tin to Tout. The code snippets returned by this
traversal process are ranked using the length of the paths with
B. Detecting copy paste code the shortest path ranked first from Tin to Tout.
Several automated techniques for detecting code clones MAPO developed by Xie and Pei [17], mines frequent
have been proposed differ by the level of comparison unit from usage patterns of API through class inheritance. It uses API’s

1930
usage history to identify methods call in the form of frequent copy paste code, API usage. The criterion Technique entails the
subsequences. The code search engine receives a query that algorithm used by tool. Different algorithm used in source code
describes a method, class, or package for an API and then mining research from data mining domain. Finally, criterion
searches open source repositories for source files that are Open Issues indicates the research challenge not addressed by
relevant to the query. The code analyzer analyzes the relevant specific tool or technique. Table 1 shows overall analysis of
source files and produces a set of method call sequences. The techniques and tools.
sequence preprocessor inline some call sequences into others
based on caller-callee relationships and removes some IV. COMPARISON OF SOURCE CODE MINING APPROACHES
irrelevant call sequences from the set of call sequences
according to the given query. The frequent-sequence miner Both Engler et al., work and PR-Miner discover patterns
discovers frequent sequences from the preprocessed sequences. involving set pairs of methods calls and functions, variables,
The frequent-sequence postprocessor reduces the set of data types that frequently appear in same methods and do not
frequent sequences in some ways. contain control structures or conditions among them, also the
order of method calls is not considered. However, compared
Sahavechaphan and Claypool [15] developed, a context- with Engler et al. work that extracts only function-pair based
sensitive code assistant tool XSnippet , that allows developers rules, PR-Miner extracts substantially more rules by extracting
to query for relevant code snippets from a sample code rules about variable correlations. Moreover, PR-Miner requires
repository to find code fragments relevant to the programming full parser to replace to work with other programming
task at hand. A range of instantiation queries are invoked from languages. CHRONICLER [4] which is fundamentally differs
java editor including generic query TQG that returns all from PR-Miner as it ensures path-sensitivity hence generate
possible code snippets for the instantiation of a type, to the less number of false negative as compare to PR miner. It differ
specialized type-based TQT and parent based queries TQP, that from Engler et al., approach as it computes the precedence
return either type-relevant or parent-relevant results. User input relationship based on program’s control flow structure whereas
the type of query, code context in which query is invoked and a Engler et al., approach detects relations between pairs of
specific code model instance to graph based Xsnippet system. functions by exploiting all possible paths. MUVI [6] mines
Mining algorithm BFSMINE, a breath first mining algorithm variable correlations and generate variable-pairing rules. Engler
traverses a code model instance and produces as output that et al. [2] also detect variable inconsistency through logical
represent the final code snippets meet the requirement of the reasoning where as MUVI [6] detect inconsistencies using
specified query. pattern analysis on multi-variable access correlations.
PARSEWeb developed by S. Thummalapenta, and T. Xie Dup[8] uses an order-sensitive indexing scheme to
[16], uses Google code search for collecting relevant code normalize for detection of consistently renamed Syntactically
snippets and mines the returned code snippets to find solution identical clones whereas CCFinder [7] applies additional
jungloids. The proposed technique described the desired code transformations of source code that actually change the
in the form of “Source → Destination” query which search for structure of the code, so that minor variations of the same
relevant code sample of source and destination object and syntactic form treated as similar. However, token-by-token
download to form a local source code repository which is matching is more expensive than line-by-line matching in
analyzed to constructs a directed acyclic graph. PARSEWeb terms of computing complexity since a single line is usually
identifies nodes that contain the given Source and Destination composed of several tokens. Dup, CCFinder and
object types and extracts a Method-Invocation Sequences CloneDetection identify clone code that can be helpful in
(MISs. PARSEWeb clusters similar MISs using a sequence software amenability to identify section of code that should be
postprocessor .The final MISs are sorts using several ranking replaced by procedure but do not detect copy paste related
heuristic and serves as a solution for the given query. bugs. On the other hand CP - Miner [11] detect copy paste
PARSEWeb also uses an additional heuristic called query related bugs. Compared to CCFinder, CP-Miner is able to find
splitting that helps address the problem where code samples for 17.52% more copy-pasted segments because CP-Miner can
the given query are split among different source files. tolerate statement insertions and modifications. whereas,
Graph based analysis [10] can capture more complicated
III. TAXONOMY OF SOURCE CODE MINNING TECHNIQUES changes such as statement reordering, insertion and control
replacement, compared with the common token-based
This section encompasses the analysis on previously approaches by capturing software’s inherit logic relationship
mentioned research contributions based on criteria that capture through PDG. Different mining techniques have been proposed
the main feature of each technique. in the literature to provide samples code which differs in the
A supporting tool is developed by each approach as a plug- means that a developer uses to retrieve relevant examples from
in for the programming environment. Source code is provided the repository, for example, Strathcona [18] use structural
as input to tool and it applies data mining technique to detect context to form a query is extracted automatically from the
frequently co-occurring patterns. Source code comprises of code a developer is writing. Xsnippets [15] uses class structure
different elements such as functions, classes, variables, data information such as parents, fields and methods of a class to
types etc. Criterion Input shows which elements of source code define code context to query a sample repository for code
are used as input by data mining tool. The criterion output snippets relevant to the object instantiation task at hand.
indicate which type of mining information are extracted by Prospector [13] , Parseweb [16] and MAPO [17] defines a
tools developed by each approach e.g. programming rules, query that describes the desired code.

1931
TABLE I. TAXONOMY OF SOURCE CODE MINING TOOL AND TECHNIQUES

Description Tool Open Ref.


Issues
Name Input Output Technique

Need rules to check against Static Functions Pair-wise Statistical Fixed rule templates, only [2]
program code by inferring code Analyzer programming analysis identify pair wise programming
believes and cross check for rules. rules
contradiction
Frequent itemset mining for pair- PR-Miner Functions, variable Pair-wise, Item-set mining Does not consider inter- [3]
wise, multi-functions and variable and data type complex and procedural analysis, data flow
correlation rules variable and control relationship
Rule- Mining

correlation rules
Frequent subsequence mining to CHRONIC Functions Function calls Frequent Does not take account of data [4]
infer function precedence protocols LER ordering rules subsequence flow or data dependence
mining
Graph based mining to search Framework Program Graph minor as Frequent item- Require manual inspection for [5]
conditional rules Dependence conditional rules set and sub- valid rules that may miss some
Graphs graph mining instances of rules during
algorithm inspection.
Frequent itemset mining to extract MUVI Functions, Global, Variable pairing Frequent item- Only handled variable access [6]
variable correlations class & structural rules set mining directly by caller functions
variables
Suffix trees for tokens per line Dup Sequence of lines Line by line Suffix tree Does not detect clone code [8]
clones based matching portions having different syntax
but similar meaning.
Token normalizations, then suffix- CC-Finder Sequence of Clone pairs Token Does not detect changes such as [7]
tree based search tokens comparison statement reordering, insertion
Detecting copy paste code

Suffix tree and control replacement.


based matching
Data mining for frequent token CP-Miner Statement Copy-paste code frequent Same syntax but different [11]
sequences sequence subsequence & semantic are detected as copy
tokenization paste segments
XML representation of ASTs with Clone- XML Clone pairs. Frequent item It does not detect complicated [9]
frequent itemsets techniques of Detetion representation of set mining changes i.e. statement
data mining ASTs reordering, insertion and control
replacement.
Searching similar sub graphs in Framework PDG Matching sub- Spatial search Limitation in search speed and [10]
PDGs graph & Graph pattern accuracy
matching
Discover library reuse patterns CodeWeb Components, Library reuse Item-set and To use CodeWeb developer [14]
using association rule mining classes, and pattern through association-rule must find similar applications of
functions class inheritance mining interest in advance.
Context based matching of related Strathcona Structural context List of relevant Heuristic Each heuristic is generic, not [18]
source code from example of code code under matching specific to a particular task of
repository development object instantiation
Mining past repositories to search Prospector API Method API Jungloids Signature graph It returns many irrelevant [13]
for a call chain that has previously signature/Class matching examples or in some cases too
API Usage

been used. type few qualified examples


Context sensitive code assistant to XSnippet Inheritance API code Graph mining XSnippet is limited to the [15]
mining sample code repository for hierarchy, fields snippets queries of a specific set of
relevant code and methods frameworks or libraries.
Mines API usage history to identify MAPO Method, class or sequencing Frequent It does not synthesized code [17]
call patterns package information sequence fragments from mined frequent
among method mining can be directly inserted into
calls developers’ code.
Search web for related code and ParseWeb Objects Method Clustering It only suggests the frequent [16]
mine the return code to find Invocation MISs and code samples cannot
solution sequence (MIS) directly generate compliable
code.

1932
system for large scale source code,” IEEE
V. CONCLUSION Transactions on Software Engineering, pp. 654-670,
In this paper we have provided concise but comprehensive 2002.
survey of three types of source code mining tools and [8] B. Baker, “On finding duplication and near-
techniques such as mining rules, copy-paste code and API duplication in large software systems,” in Proc.
usage. So far this is the first survey which includes Second IEEE Working Conf. Reverse Eng., 1995, pp.
combination of different techniques .Comparison of techniques 86-95.
and tools shows there is a no single tool which is superior to all [9] V. Wahler, D. Seipel, J. Wolff et al., “Clone
other in all aspects because all tools have strength and detection in source code by frequent itemset
weaknesses and intended for different task and context. techniques,” in Fourth IEEE International Workshop
However, a combination of these three source code mining on Source Code Analysis and Manipulation, 2004,
techniques help one to understand how to design a
pp. 128-135.
hybrid/integrated technique to be robust across all types of
[10] W. Qu, Y. Jia, and M. Jiang, “Pattern mining of
software patterns that can help bug detection as well as help
developers to write relevant API code. The comparison also cloned codes in software systems,” Information
helps how to employ a set of different tools to achieve better Sciences, 2010, 2010.
results. [11] Z. Li, S. Lu, S. Myagmar et al., “CP-Miner: A tool
for finding copy-paste and related bugs in operating
In future we are going to develop an integrated framework system code,” in Proceedings of the 6th conference
which can automatically find all the patterns from source code on Symposium on Opearting Systems Design &
in one pass and suggest developer potential bug locations for Implementation-Volume 6, 2004, pp. 20.
quality software development and relevant code suggestion for [12] M. Acharya, T. Xie, J. Pei et al., “Mining API
rapid software development. patterns as partial orders from source code: from
usage scenarios to specifications,” in Proceedings of
REFERENCES the 6th joint meeting of the European software
engineering conference and the ACM SIGSOFT
[1] A. Hassan, and T. Xie, “Mining software engineering symposium on The foundations of software
data,” in Proceedings of the 32nd ACM/IEEE engineering, 2007, pp. 25-34.
International Conference on Software Engineering- [13] D. Mandelin, L. Xu, R. Bodík et al., “Jungloid
Volume 2, 2010, pp. 503-504. mining: helping to navigate the API jungle,” ACM
[2] D. Engler, D. Chen, S. Hallem et al., “Bugs as SIGPLAN Notices, vol. 40, no. 6, pp. 48-61, 2005.
deviant behavior: A general approach to inferring [14] A. Michail, “Data mining library reuse patterns using
errors in systems code,” ACM SIGOPS Operating generalized association rules,” in Proceedings of
Systems Review, vol. 35, no. 5, pp. 57-72, 2001. 22nd International Conference on Software
[3] Z. Li, and Y. Zhou, “PR-Miner: Automatically Engineering (ICSE'00), Limerick, Ireland, 2000, pp.
extracting implicit programming rules and detecting 167-176.
violations in large software code,” in Proceedings of [15] N. Sahavechaphan, and K. Claypool, “XSnippet:
the 10th European software engineering conference mining for sample code,” ACM SIGPLAN Notices,
held jointly with 13th ACM SIGSOFT international vol. 41, no. 10, pp. 413-430, 2006.
symposium on Foundations of software engineering, [16] S. Thummalapenta, and T. Xie, “Parseweb: a
2005, pp. 306-315. programmer assistant for reusing open source code on
[4] M. Ramanathan, A. Grama, and S. Jagannathan, the web,” in Proceedings of the twenty-second
“Path-sensitive inference of function precedence IEEE/ACM international conference on Automated
protocols,” in 29th International Conference on software engineering, 2007, pp. 204-213.
Software Engineering ( ICSE 2007), 2007, pp. 240- [17] T. Xie, and J. Pei, “MAPO: Mining API usages from
250. open source repositories,” in Proceedings of the 2006
[5] R. Chang, A. Podgurski, and J. Yang, “Finding what's international workshop on Mining software
not there: a new approach to revealing neglected repositories, 2006, pp. 54-57.
conditions in software,” in Proceedings of the 2007 [18] R. Holmes, and G. C. Murphy, “Using structural
international symposium on Software testing and context to recommend source code examples,” in
analysis, 2007, pp. 163-173. Proceedings of the 27th international conference on
[6] S. Lu, S. Park, C. Hu et al., “MUVI: automatically Software engineering, 2005, pp. 117-125.
inferring multi-variable access correlations and
detecting related semantic and concurrency bugs,”
ACM SIGOPS Operating Systems Review, vol. 41,
no. 6, pp. 103-116, 2007.
[7] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: a
multilinguistic token-based code clone detection

1933

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy