0% found this document useful (0 votes)
66 views2 pages

Improving Bug Localization With Character-Level Convolutional Neural Network and Recurrent Neural Network

Uploaded by

Nguyen Van Toan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views2 pages

Improving Bug Localization With Character-Level Convolutional Neural Network and Recurrent Neural Network

Uploaded by

Nguyen Van Toan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

2018 25th Asia-Pacific Software Engineering Conference (APSEC)

Improving Bug Localization with Character-level


Convolutional Neural Network and Recurrent
Neural Network
Yan Xiao, Jacky Keung
Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
Email: yanxiao6-c@my.cityu.edu.hk, Jacky.Keung@cityu.edu.hk

Abstract—Background: Automated bug localization in large The code tokens in source files are similar to English words
amounts of source files for bug reports is a crucial task in in natural languages while the difference is obvious. Some
software engineering. However, the different representations of code tokens, especially the class or method names, are not
bug reports and source files limited the accuracy of the existing
bug localization techniques. Aims: We propose a novel deep actual words that are commonly used in natural languages.
learning-based model to improve the accuracy of bug localization Although they are very important in bug localization, most
for bug reports by expressing them in character and analyzing existing studies regarded them as unknown words [2], [3],
them with a language model. Method: The proposed model is [4]. However, both bug reports and source files are composed
composed of two main parts: character-level convolutional neural of characters. They share same expressions in character level.
network (CNN) and recurrent neural network (RNN) language
model. Both bug reports and source files are expressed in a Therefore, this paper proposes a bug localization technique
character level and then input into a CNN, whose output is given based on a language model in character-level. The proposed
to an RNN encoder-decoder architecture. Results: The results of model first obtains the character embeddings of the prepro-
preliminary experiments show that the proposed model achieves cessed bug reports and source files. Two CNNs with multiple
comparable or even higher accuracy than the existing machine filters are then applied to extract features respectively from
translation-based bug localization technique. Conclusion: The
proposed model is capable of automatically localizing buggy files the vectors of bug reports and source files, whose outputs are
for bug reports and achieves better accuracy by analyzing them fed into the subsequent RNN encoder-decoder architecture.
in character level where both bug reports and source code can The experiments on three open-source Java projects show the
be expressed. feasibility and effectiveness of the proposed model.
Index Terms—bug localization, convolutional neural network,
recurrent neural network, deep learning II. T HE P ROPOSED M ODEL
This section describes the proposed model whose overview
I. I NTRODUCTION AND M OTIVATION
is illustrated in Figure 1.
Automatically localizing buggy files for bug reports re-
mains a significant task in software project teams, especially A. Data Preprocessing
those involving hundreds of thousands of source files. It is We first combine summary and description in bug reports
painstaking for developers to search all source files for bug- to be a new document. Revised term frequency-user focused
fixing. The automated bug localization techniques are thus inverse document frequency (TF-IDuF) is then applied to filter
proposed to rank the source files and recommend the top some common words in the new bug reports for the purpose
relevant files to developers. However, bug reports are written in of redundancy reduction [3]. We also extract two types of
natural languages while source files are written in code tokens. Abstract Syntax Tree nodes from each source file as [4].
The different expressions between them have been empirically
demonstrated to be responsible for the low accuracy of the B. Character-level CNN
existing bug localization techniques [1], [2], [3], [4]. After preprocessing bug reports and source code, convolu-
Ye et al. [2] tried to bridge the lexical gap by adding the tional operations are applied in each word of them respectively.
semantic similarity between bug reports and source files into Each character in a word is transformed into a k-dimensional
their previous proposed learning-to-rank model [1]. Xiao et character embeddings, which are then convolved by multiple
al. [3] transformed bug reports and source files into word filters with different sizes (2 × k and 3 × k in Figure 1). The
vectors using word embedding techniques to preserve the subsequent max-pooling layer is used to conclude the features,
semantics, and extracted features from word vectors using whose outputs are given to encoder and decoder respectively.
enhanced CNN. To distinguish bug reports and source files,
Xiao et al. [4] proposed BugTranslator, a machine translation- C. RNN Encoder-Decoder
based bug localization technique. However, all the existing 1) Encoder: The output features extracted from a word in
techniques regard both bug reports and source files as natural bug reports by the character-level CNN are given to one long
languages. short-term memory (LSTM) cell. For example, the feature

978-1-7281-1970-0/18/$31.00 ©2018 IEEE 703


DOI 10.1109/APSEC.2018.00097
Bug Report Source Code
Bug 360872 Remove GtkCombo and friends. Gtk combo box remove text

k dimensional k dimensional
character embeddings character embeddings

convolutional layer convolutional layer


with multiple filters with multiple filters

max pooling c max pooling

Encoder Decoder

he0 LSTM LSTM LSTM LSTM LSTM LSTM hd0 LSTM LSTM LSTM LSTM LSTM score

Fig. 1. The overview of the proposed model.

vectors of the fourth word in a bug report are fed into the attention mechanism. It is much important when there are large
fourth LSTM in the encoder as shown in Figure 1. The amounts of source files. In Project Eclipse UI, the number of
context vector c is the final state of the encoder, which is source files is 6228 that is about five times the number in
the conclusion of the features of bug reports that will be one Project SWT. The performance of BugTranslator is thus better
part of input in the decoder. than our proposed model in Project Eclipse UI.
2) Decoder: Similar to the encoder, the features extracted IV. C ONCLUSIONS AND F UTURE W ORK
from each word in source code by the character-level CNN are
one part of the input to each LSTM cell. Besides, the context In this paper, bug reports and source files are analyzed in
vector is concatenated with output features of each word to be character-level instead of word-level to suppress the effect of
fed into each LSTM. different expressions on the accuracy of bug localization. The
number of unknown words is also reduced. The proposed
III. T HE P RELIMINARY R ESULTS model applies character-level CNN to extract features from
bug reports and source files, whose output is fed into the subse-
In order to validate the feasibility and effectiveness of the
quent RNN encoder-decoder. The preliminary results indicate
proposed model, we conduct several preliminary experiments
the feasibility and effectiveness of the proposed model.
on the before-fixed version of three open-source Java projects
1 We intend to enhance the proposed model with attention
similar to [4]. 3656, 2632, 2817 bug reports respectively
mechanisms and fine-tune the proposed model. In the future,
for Project Eclipse UI, JDT, SWT are used. Mean average
we will conduct experiments on more projects to obtain the
precision (MAP) and mean reciprocal rank (MRR) are used
general performance of the proposed model.
to evaluate the performance of the proposed model and the
competitor BugTranslator [4]. V. ACKNOWLEDGEMENT
TABLE I This work is supported in part by the General Research Fund
R ESULTS OF T WO M ODELS . of the Research Grants Council of Hong Kong (No. 11208017)
and the research funds of City University of Hong Kong (No.
Project Metrics BugTranslator Proposed model
MAP 0.36 0.35
9678149 and 7005028), and the Research Support Fund by
Eclipse UI Intel.
MRR 0.42 0.40
MAP 0.34 0.35
JDT R EFERENCES
MRR 0.41 0.42
MAP 0.34 0.37
SWT [1] X. Ye, R. Bunescu, and C. Liu, “Learning to rank relevant files for
MRR 0.40 0.42
bug reports using domain knowledge,” in Proceedings of the 22nd
ACM SIGSOFT International Symposium on Foundations of Software
The preliminary results are shown in Table I. The MAP Engineering. ACM, 2014, pp. 689–699.
[2] X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu, “From word embeddings
and MRR values of the proposed model are better than to document similarities for improved information retrieval in software
BugTranslator in Project JDT and SWT. The performance engineering,” in Proceedings of the 38th International Conference on
of BugTranslator is limited since it analyzes bug reports and Software Engineering. ACM, 2016, pp. 404–415.
[3] Y. Xiao, J. Keung, Q. Mi, and K. E. Bennin, “Improving bug localization
source files in word level where many out-of-vocabulary words with an enhanced convolutional neural network,” in Asia-Pacific Software
exist. But BugTranslator gives more emphases on the related Engineering Conference (APSEC), 2017 24th. IEEE, 2017, pp. 338–347.
words in buggy files with those in bug reports using an [4] Y. Xiao, J. Keung, K. E. Bennin, and Q. Mi, “Machine translation-based
bug localization technique for bridging lexical gap,” Information and
1 https://github.com/yanxiao6/BugLocalization-dataset
Software Technology, 2018.

704

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy