IJCSI Jehad
IJCSI Jehad
net/publication/259235118
CITATIONS READS
868 50,672
4 authors, including:
All content following this page was uploaded by Nasir Ahmad on 12 December 2013.
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 3, September 2012
ISSN (Online): 1694-0814
www.IJCSI.org 273
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 3, September 2012
ISSN (Online): 1694-0814
www.IJCSI.org 274
data and besides this, it also includes implementation for node, m variables are selected at random out of
regression, data pre-processing, clustering, classification the M and the best split on these m is used for
and visualization through various algorithms. More than splitting the node. During the forest growing, the
sixty algorithms are available in WEKA. Following is an value of m is held constant.
overview of few of the Decision Tree based algorithms. Each tree is grown to the largest possible extent.
No pruning is used.
2.1.1 REPTree
Random Forest generally exhibits a significant
In REPTree, decision/regression tree is constructed with performance improvement as compared to single tree
information gain as the splitting criterion and reduced error classifier such as C4.5. The generalization error rate
pruning is used to prune it. It sorts values only for numeric that it yields compares favorably to Adaboost,
attributes once. The method of fractional instances is used however it is more robust to noise.
to handle missing values with C4.5. REP Tree is a fast
Decision Tree learner.
3. Experimental Analysis
2.1.2 Random Tree
A random tree is a tree constructed randomly from a set of In this section, we concentrate on the classification
possible trees having K random features at each node. “At performance of the Decision Tree (J48) and the Random
random” in this context means that in the set of trees each Forest for large and small datasets. The objective of this
tree has an equal chance of being sampled. Or we can say comparison is creating a base-line, which will be useful for
that trees have a “uniform” distribution. Random trees can the classification scenarios. It will also help in the selection
be generated efficiently and the combination of large sets of appropriate model.
of random trees generally leads to accurate models. There
has been an extensive research in the recent years over 3.1 Data Sets
Random trees in the field of machine Learning.
For classification problems, we took these datasets
2.1.3 J.48
from the UCI Machine Learning repository [1]. In
Ross Quinlan [21] developed C4.5 algorithm which is used breast cancer data, some attributes are linear and few
to generate a Decision Tree. Decision Trees are produced are nominal. The detailed description, attributes,
from the J48 i.e. Open Source Java implementation of source of each dataset can be found from UCI
C4.5 release in WEKA data mining tool [22]. This is a repository. Table 1 shows the names of the dataset,
standard Decision Tree algorithm. One of the classification
the number of instances and number of attributes for
algorithms in data mining is Decision Tree Induction. The
Classification algorithm [23] is inductively learned to the twenty datasets we used for our analysis and
construct a model from the pre-classified data set. Each comparison. As an visual information, the Figures 2,
data item is defined by values of the characteristics or 3, 4 shows the distribution of data variables in the
features. Classification may be viewed as mapping from a corresponding three sampled data sets. Figure 2
set of features to a particular class. shows the Dataset Lymphography. Its total number
of instances are 148, the total attributes are 19 and
2.2 Random Forests having four classes. Figure 3 shows the Dataset
Sonar with 208 instances and 61 attributes and bi-
Random Forest developed by Leo Breiman [4] is a group
of un-pruned classification or regression trees made from classes data. Figure 4 shows the Dataset Heart-h. It
the random selection of samples of the training data. has 14 attributes and 294 instances and binary class
Random features are selected in the induction process. data.
Prediction is made by aggregating (majority vote for
classification or averaging for regression) the predictions
of the ensemble. Each tree is grown as described in [24]:
By Sampling N randomly, If the number of cases
in the training set is N but with replacement, from
the original data. This sample will be used as the
training set for growing the tree.
For M number of input variables, the variable m
is selected such that m<<M is specified at each
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 3, September 2012
ISSN (Online): 1694-0814
www.IJCSI.org 275
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 3, September 2012
ISSN (Online): 1694-0814
www.IJCSI.org 276
Figure 5: Parameters settings for the J48 Figure 6: Parameters settings for the Random Forest
Table 2: Comparison of the Random Forest and the J48 classification results for the 20 datasets.
Serial NO Data Set No. of instances No. of attributes Random Forest J-48 Results
Correctly classified Incorrectly classified Correctly classified Incorrectly classified
instances instances instances instances
1 Lymph 148 19 81.08% 18.91% 77.02% 22.97%
2 Autos 205 26 83.41% 16.58% 80.95% 18.04%
3 Sonar 208 61 80.77% 19.23% 71.15% 28.84%
4 Heart-h 270 14 77.89% 22.10% 80.95% 19.04%
5 Breast cancer 286 10 69.23% 30.76% 75.52% 24.47%
6 Heart-c 303 14 81.51% 18.48% 77.56% 22.44%
7 Ionosphere 351 35 92.88% 7.12% 91.45% 8.54%
8 colic 368 23 86.14% 13.85% 85.32% 14.67%
9 Colic.org 368 28 68.47% 31.52% 66.30% 33.69%
10 Primary tumor 399 18 42.48% 57.52% 39.82% 60.17%
11 Balance Scale 625 25 80.48% 19.52% 76.64% 23.36%
12 Soyben 683 36 91.65% 8.34% 91.50% 8.49%
13 Credit a 690 16 85.07% 14.92% 86.09% 13.91%
14 Breast W 699 10 96.13% 3.68% 94.56% 5.43%
15 Vehicle 846 19 77.06% 22.93% 72.45% 27.54%
16 vowel 990 14 96.06% 3.03% 81.51% 18.48%
17 Credit g 1000 21 72.50% 27.50% 70.50% 29.50%
18 Segment 2310 20 97.66% 2.33% 96.92% 3.07%
19 Waveform 5000 41 81.94% 18.06% 75.30% 24.70%
20 Letter 20,000 17 94.71% 5.29% 87.98% 12.02%
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 3, September 2012
ISSN (Online): 1694-0814
www.IJCSI.org 277
Table 3: Comparison of Random Forest and the J48 in terms of Precision, Recall and F-measure
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 3, September 2012
ISSN (Online): 1694-0814
www.IJCSI.org 278
yields results that are accurate and precise in the cases of [21] Quinlan, J. R. C4.5: Programs for Machine Learning.
large number of instances. These scenarios also cover the Morgan Kaufmann Publishers, 1993.
missing values problem in the datasets and thus besides [22] http://en.wikipedia.org/wiki/C4.5_algorithm
[23] Report from Pike research,
accuracy, it also overcomes the over-fitting problem
http://www.pikeresearch.com/research/smartgrid- data-
generated due to missing values in the datasets. Therefore, analytics
for the classification problems, if one has to choose a [24]http://www.stat.berkeley.edu/~breiman/RandomForests/cc_h
classifier among the tree based classifiers set, we ome.htm#prox Symposium, volume 1, July, 2005.
recommend to use the Random Forest with confidence for
variety of classification problems.
References
[1] A. Asuncion and D. Newman, "UCI machine learning
repository," 2007. [Online]. Available:
http://archive.ics.uci.edu/ml/ Jehad Ali is pursuing his M.Sc Computer Systems Engineering
[2] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997. from University of Engineering and Technology, Peshawar,
[3] Yael Ben-Haim, “A Streaming Parallel Decision Tree Pakistan. He did his B.Sc. Computer Systems Engineering from
Algorithm” , Elad Tom-Tov , 2010 the same university. He is working as a Computer Engineer in
Ghulam Ishaq Khan Institute (GIKI) of Engineering Sciences and
[4] Breiman, L., Random Forests, Machine Learning 45(1), 5-32, Technology, Topi, Pakistan. His research interest’s areas are
2001. image processing, computer vision, machine learning, Computer
[5] "Bagging predictors," Machine Learning, vol. 24, no. 2, pp. Networks and pattern recognition.
123-140, 1996.
[6] T. Ho, "The random subspace method for constructing
decision forests," IEEE Transactions on Pattern Analysis and Rehanullah Khan graduated from the University of
Machine Intelligence, vol. 20, no. 8, pp. 832-844, 1998. Engineering and Technology Peshawar, with a B.Sc degree
[7] Amit, Y., Geman, D.: Shape quantization and recognition (Computer Engineering) in 2004 and M.Sc (Information Systems)
in 2006. He obtained PhD degree (Computer Engineering) in 2011
with randomized trees. Neural Computation 9(7), 1545–1588
from Vienna University of Technology, Austria. He is currently an
(1997) Associate Professor at the Sarhad University of Science and
[8] Breiman, L.: Random Forests. ML Journal 45(1), 5–32 Technology, Peshawar. His research interests include color
(2001) interpretation, segmentation and object recognition.
[9] Lepetit, V., Fua, P.: Keypoint recognition using randomized
trees. IEEE Trans. Pattern Anal. Mach. Intell. 28(9), 1465– Nasir Ahmad graduated from University of Engineering
1479 (2006) and Technology Peshawar with a B.Sc Electrical Engineering
[10] Ozuysal, M., Fua, P., Lepetit, V.: Fast keypoint recognition degree. He obtained his PhD degree from UK in 2011. He is a
in ten lines of code. In: IEEE CVPR (2007) faculty member of Department of Computer Systems Engineering,
University of Engineering and Technology Peshawar, Pakistan. His
[11] Winn, J., Criminisi, A.: Object class recognition at a glance.
Research Areas include Pattern Recognition, Computer vision and
In: IEEE CVPR, video track (2006) Digital Signal Processing.
[12] Shotton, J., Johnson, M., Cipolla, R.: Semantic texton
forests for image categorization and segmentation. In: IEEE Imran Maqsood graduated from the University of Engineering
CVPR, Anchorage (2008) and Technology Peshawar, with a B.Sc degree (Computer
[13] Yin, P., Criminisi, A., Winn, J.M., Essa, I.A.: Tree-based Engineering) in 2004 and M.Sc in 2006. He is pursuing his PhD
classifiers for bilayer video segmentation. In: CVPR (2007) degree. He is currently an Assistant Professor at the Department
[14] Bosh, A., Zisserman, A., Munoz, X.: Image classification of Computer Software Engineering, UET Mardan Campus,
using Random Forests and ferns. In: IEEE ICCV (2007) Peshawar, Pakistan.
[15] Apostolof, N., Zisserman, A.: Who are you? - real-time
person identification. In: BMVC (2007).
[16] Introduction to Decision Trees and Random Forests, Ned
Horning; American Museum of Natural History’s
[17] Breiman, L.: Random Forests. Machine. Learning. 45, 5–32
(2001). DOI 10.1023/A:1010933404324
[18] Yanjun Qi., “Random Forest for Bioinformatics”.
www.cs.cmu.edu/~qyj/papersA08/11-rfbook.pdf
[19] Yang, P., Hwa Yang, Y., Zhou, B., Zomaya, Y., et al.: “A
review of ensemble methods in bioinformatics”. Current
Bioinformatics 5(4), 296–308 (2010)
[20] “Comparison of Decision Tree methods for finding active
objects” Yongheng Zhao and Yanxia Zhang, National
Astronomical Observatories, CAS, 20A Datun Road,
Chaoyang District, Bejing 100012 China
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.