Hadoop Based Feature Selection and Decision Making Models On Big Data
Hadoop Based Feature Selection and Decision Making Models On Big Data
Indian Journal of Science and Technology, Vol 9(10), DOI: 10.17485/ijst/2016/v9i10/88905, March 2016 ISSN (Online) : 0974-5645
2
SRITW, Warangal – 506371, Telangana, India; snandam@gmail.com
3
Director of Academics and Planning, JNTUCEA, Anantapur – 515002, Andhra Pradesh, India; akepog@gmail.com
Abstract
Objectives: A large amount of informative data is being captured and processed by today’s organizations and is continuing
to increase exponentially. It becomes computationally inaccurate to analyze such big data for decision making systems.
Methods/Analysis: Hadoop, which is a working model based on the Map-Reduce framework with efficient computation
and processing of Big Data. Findings: Most of the traditional classification algorithms have issues such as class imbalance
and dimension reduction on Big Data. However, a large part of the data produced today are incomplete and inaccurate, so
large organizations prefer relational databases to store their information, but the user query processing speed is very low.
Unlike existing solutions that require a prior knowledge of classification accuracy for various types of data characteristics,
which is impossible to obtain in practice. Applications/Improvement: In this paper, we have given a compared proposed
model to different big data feature selection and classification models along with advantages and limitations.
mining technologies. Current researchers on decision tree Info ( D / Att ) = − ∑ (nr / n) ∑(nr i / nr )∗ log(nr i / nr )
r =1 i =0
algorithm mainly focus on optimizing the efficiency, but not
the data processing capability. As the development of the Where nr, nir and ni denotes the number of the objects, the
networking and real-time applications increases, the vol- number of objects equal to i on D in Attribute r, and the
umes of data also increase exponentially. In order to solve number of objects equal to I on D respectively1.
these issues, a parallel distributed decision tree framework
using Hadoop framework is used to handle massive data. 2.1 Probabilistic Based Measures
The remainder of the paper is as follows. Section
Similarly, Leung,Carson and Fan2 proposed ranking inter-
2 introduces the general overview of attribute selec-
val which is different from ranking values explained in
tion measures and data mining approaches using the
paper1. Attribute ranked based on a relevance measure of
Hadoop framework. Next, Section 3, discuss about the
uncertain data is used to estimate two interval values. Let
traditional experimental results. Finally, Section 4 gives
A1 = [i1− , j1+ ] and A2 = [i2− , i2+ ] be two interval values. The
a conclusion.
degree of ranking measure from A1 to A2 is defined as
P (deg)[A 1 >= A2] = min{1, max{( j1+ − i2− ) / (( j1+ − i1− ) + ( j2+ − i2− )), 0}}
2. Parallel Rough-Set Feature
Similarity of the two intervals is given as S=1-| P (deg)
Selection
[A1>=A2] - P (deg) [A2<=A1] |
In the paper1, a parallel computation of the equivalence Assume that the two short discretization ranges are
classes and the attribute selection are implemented for represented as A1= [2, 5] and A2= [3, 7], then
2 Vol 9 (10) | March 2016 | www.indjst.org Indian Journal of Science and Technology
Thulasi Bikku, N. Sambasiva Rao and Ananda Rao Akepogu
P (deg) [A1>=A2] = m
in {1, max {(5-3) / ((5-2) + 2.2.1 Proposed algorithm for Roughset based
(7-3)), 0}} Random Forest
= min {1, max {0.28, 0}}
Map Function(Dataset)
= min {1, 0.28}
Input:
= 0.28
Training instances with attribute set A
P (deg) [A2<=A1] = m
in {1, max {(7-2) / ((7-3) +
Output:
(5-2)), 0}}
Decision Tree Rules
= min {1, max {0.42, 0}}
Procedure:
= min {1, 0.42}
Initialize ‘k’ parameters as ‘k’ clusters in cloud environ-
= 0.42
ment
So, the degree of relevance between two objects is Initialize the dataset using bagging algorithm
computed for attribute selection process is Build tree per bootstrap by randomly selecting attributes
While attribute_set!= null do
S = 1-| P (deg) [A1>=A2] - P (deg) [A2<=A1] |
For each candidate attributes do
= 1-|0.28-0.42|
Compute the max of information gain(IG) as α
= 1-0.14=0.86
α(A)=argmax IG;
This similarity index indicates the association between Split the information attribute
two highly related attributes for constructing a decision tree. End
An optimal feature subset extracted by a high dimensional End
reduction technique always depends on a certain feature Reduce Function
selection measures. In general, different measure may lead Input:
to different optimized attribute subsets. One of the major Set of Map Decision Trees
issues in the real-time distributed data is uncertainty and Output:
missing values. This problem arises; when more than one Classify Result
attribute has the same data distribution that is different Procedure:
attributes get uniform data distribution. The main issues in Produces multiple Decision Trees
the traditional attribute based classification models are data Check and compare the nodes in each decision tree
cleaning, filtering and reduction. Filtering analysis removes Find the majority voting trees for the classification
all the redundant attributes by attribute subset selection. Return set of Decision Tree Rules
Vol 9 (10) | March 2016 | www.indjst.org Indian Journal of Science and Technology 3
Hadoop based Feature Selection and Decision Making Models on Big Data
model namely Rule MR using the MapReduce model is different classes such as risk management, satellite images,
explained4. This model is able to construct a set of rules medical data, sensor data, and time series data, etc.
from large sets of nominal valued attributes. This model Although the hyper network graph theory has been
depends on user defined parameters that influence the used in solving various classification issues, it usu-
time computation requirements for training data. In this ally results in poor performance when dealing with
model, each clustered node computes the normalized class imbalance issues. Like most of the conventional
entropy and coverage factor as follows: methods, hyper network models assume that the class
((Entropy(NC) – Entropy(cond))/Entropy(NC))*(P/E) and data distribution of data sets are balanced7. Both the
single labelled classification and Multi labelled classifica-
Where NC indicates number of different classes, cond is tion are important research areas in supervised learning.
the class distinct values, P is probability of existence of However, neither of them can overcome the imbalance
each class and E is the estimation parameter. issue which has a negative effect on the classifier accuracy.
Gain Ratio which is a successor of the information Addressing imbalance issues in multi labelled classifica-
gain overcomes the challenge of the information gain by tion is more complex and it is even difficult to define the
using a normalization process a split information value. label distribution. Isaac, Daniel peralta and salva use
But the main drawback with the gain ratio is that it gener- M = Rd, denotes the d-dimensional vector space and N =
ates unbalanced splits to the unbalanced data. ELM-Tree {n1, n2, … nq}, ni ∈ {0, 1} denotes the binary label vector
decision tree5 of which the leaf nodes are linear regression space8. The multi labeled data set can be represented as:
functions, .i.e. a new instance is classified by traversing
the tree from the root to a leaf and determined the predic- D = {(pi , qi )| 1 <= i <= c, pi ∈Rd ,q i ⊆ N }
tion value with the linear regression function in the leaf
node as shown in Figure 2. Where d, q, c represents the number of attributes, number
The proposed ELM tree doesn’t applicable to incre-
5 of class labels and number of instances.
mental tree construction on large datasets. Also, this MRPR implemented a decision tree framework to
approach doesn’t handle mixed type of attributes using generate decision rules at each level in multidimensional
MapReduce framework. model 8
. The main aim of this model is to optimize the
Sara del Rio, Victoria and Manuel proposed a model decision rules with minimal rule-set. Mining rules on
to overcome the class imbalance problem, where each each level is based on multidimensional data model as
class has different data distribution and attributes type. shown in Figure 3. This framework has two phases, one
Oversampling and undersampling issue in large datasets is data Reduction phase and the other is decision rule
D {(pi , qi ) |1 i c9, pi Rd ,qi N}
is handled using Random forest algorithm for classifica- generation phases . In the first phase, attribute reduction
Where d, q, c represents the number of attributes, number of class labels and numbe
tion6. The drawbacks5 are performance depends on the instances.
number of intermediate mappers used in MapReduce Distributed Data Sources
Framework. It needs to determine the static threshold for DB1 DB2…. DB-n
Where
the minority NCwith
class indicates number
respect of number
to the different of
classes,
inter-cond is the class distinct values, P is probability .
of existence
mediate mappers of each
to get class
better and E is the It
performance. estimation
needs toparameter.
improve the static classification and static parameters gain overcomes the challenge of the
Gain Ratio which is a successor of the information
information gain by using a normalization process a split information value. But the main
when a MapReduce framework is used.
drawback with the gain ratio is that it generates unbalanced splits to the unbalanced data. Data ELM- Integration
The imbalanced dataset5 issue in classification models &
Tree decision tree of which the leaf nodes are linear regression functions, .i.e. a new instance is ModelRoughset
may occur when the number of tuples or instances that rep-
classified by traversing the tree from the root to a leaf and determined the prediction value with
resent one class is much more than the other classes. There
the linear regression function in the leaf node as shown in Figure 2.
are many fields in which imbalance occurs between the
and attribute filtering procedures are performed on the information. Proposed hadoop based feature selection
input data set .And in the second phase, decision rules model was tested on different classification models to test
using hierarchical rough-set and interpret the relation- the classification accuracy for large datasets. Kddcup’99
ship between decision rules mined from different levels of intrusion dataset was used on different classification
hierarchy. The problems observed are it needs to improve models such as Genetic Algorithm- Feature Selection
attribute reduction at multiple levels, with dynamic Algorithm, Neural Networks, ELM tree and Random
threshold parameters to approximate decision rules8. Forest. By using rough-set feature reduction model, rel-
Table 1 shows the traditional classification algo- evant attributes for attack detection are identified.
rithms, which cannot handle huge amounts of data. Table 3 and Figure 4 describe the performance analy-
Table 1 describes the different classification models and sis of mapper and reducer hadoop interface classes using
its capable features such as attribute selection measures, feature reduction process. Random Forest tree and rough-
scalability, parallel support, decision tree structure and set are integrated in hadoop environment to find the
hadoop framework support. In our research work, we time complexity of the Mapper and Reducer interfaces.
have implemented these models in hadoop framework for As shown in the table, we have observed that time com-
performance analysis which gives better results. plexity reduction in the Mapper and Reducer phases of
Random forest with Roughset model compared to other
classification models without roughset.
4. Experimental Results of Table 4 and Figure 5 describe the performance anal-
Traditional Approaches ysis of classification models. Random Forest tree with
In this section, we have evaluated the performance of
Table 2. Hadoop based Rough-set based features
the roughset attribute selection measure with different
selection model
Hadoop based classification models. All the experiments
are carried-out on the Hadoop framework, which is Dataset Size Total-Features Roughset Attack Classes
based
an open source that supports distributed applications.
Features
Hadoop framework and Netbeans IDE version 8 are used
Reduced
to execute the MapReduce framework. For data storage,
Kdd1 548825 41 25 8
Amazon AWS cloud server with large instances is used
to execute multiple large data sets on the different clus- Kdd2 453653 41 27 8
ter nodes. Large data sets are downloaded from the UCI Kdd3 625977 41 23 8
repository, which consists of 41 attributes with 10 decision Kdd4 587734 41 23 8
classes and large number of instances.
Table 2 describes the information about the total num- Table 3. Hadoop based Classification model using
ber of instances used in our experiment and its statistical Rough-set
integrated feature
in hadoop selection
environment to find the time complexity of the Mapper and Reducer
interfaces. As shown in the table, we have observed that time complexity reduction in the
MapperAlgorithm
and Reducer phasesMapperTime
of Random forest (mins) ReducerTime
with Roughset (mins)
model compared to other
Table 1. Decision Tree algorithms for Big data classification models without roughset.
GA_FSA 15.35 4.57
supporting parameters
Neural
Table Network
3. Hadoop 21.53
based Classification 5.75
model using Rough-set feature selection
Algorithm Selection Scalability Parallelism Tree Bigdata
Technique Structure Handle ELM-Tree
Algorithm 14.65
MapperTime (mins) 5.36
ReducerTime (mins)
GA_FSA 15.35 4.57
ID3 Information Poor Poor Multi-tree No Random Forest 12.43 4.567
Neural Network 21.53 5.75
Gain (Roughset)
ELM-Tree 14.65 5.36
C4.5 Gain Ratio Poor Poor Multi-tree No RandomForest(Roughset) 12.43 4.567
CART Gini Poor Poor Binary No
Coefficient
SLIQ Gini Good Good Binary No
Coefficient
SPRINT Gini Good Good Binary No
Coefficient
PUBLIC Gini Poor Poor Binary No
Figure 4. Algorithms with MapReduce Statistics.
Coefficient
Figure
Table 4 and4.
FigureAlgorithms with MapReduce
5 describe the performance Statistics. models. Random Forest
analysis of classification
tree with rough-set gets high true positive classification rate for intrusion detection compared to
other classification models in hadoop framework. True positive rate indicates the number of
positive instances reflects the attacks compare to negative samples. Error (%) indicates the
Vol 9 (10) | March 2016 | www.indjst.org number of misclassified instances. Also, the outliers
Indian Journal (%)
ofwhich indicates
Science 5
the number of instance
and Technology
that is not relevant to the existing attacks behaviour are also computed.
Table 4. Traditional classification Algorithms with True Positive and Error Rates
Hadoop based Feature Selection and Decision Making Models on Big Data
Table 4. Traditional classification Algorithms with in practice. Traditional models need to improve execution
True Positive and Error Rates process and classification accuracy of the complex distrib-
Algorithm True Positive (%) Error (%) Outlier (%) uted databases. The main limitations of these models over
GA_FSA 81.45 25.67 12.54 big data are outlier issue, scaling up for high dimensional
data classification issue, mining sparse data issue and
NeuralNetwork 83.157 21.56 15.37
constrained optimization issue. Hence, one of our future
ELM-Tree 84.26 27.89 16.24
works is to present a novel attribute selection based paral-
Random Forest 89.45 19.04 11.15
lel classifier to process mixed attributes on large datasets.
(Roughset)
6. References
1. Qian J, Lv P, Yue X, Liu C, Jing Z. Hierarchical attribute
reduction algorithms for big data using MapReduce.
Knowledge-Based Systems. 2015; 73:18–31.
2. Leung CK, Jiang F. A data science solution for min-
ing interesting patterns from uncertain big data. In 2014
Figure 5. Performance
Figure 5. Performance analysis
analysis of Trueof True Positive
Positive andRates.
and Error Error
IEEE 4th International Conference on Big Data and Cloud
Rates.
Computing. Sydney, NSW. 2014. p. 235–42.
3. Fayed HA, Atiya AF. A Novel template reduction approach
5. Conclusion rough-set gets high true positive classification rate for
for the-nearest neighbor method. IEEE Transactions on
In this paper, weintrusion detection
have studied compared
the Rough-set basedtofeature
otherselection
classification
model mod-
on the random forest
Neural Networks. 2009; 20(5):890–96.
els in hadoop
in Hadoop framework. Proposed framework. True positive
model is compared with the rate indicates
traditional the
classification models
along with limitations 4. Kolias V, Kolias C, Anagnostopoulos I, Kayafas E. Rule
numberand ofperformance measures.reflects
positive instances Experimental results compare
the attacks show that proposed
model performs well against the traditional models in Hadoop framework. Unlike existing MR: Classification rule discovery with MapReduce. In 2014
to negative samples. Error (%) indicates the number of IEEE International Conference on Big Data (Big Data).
solutions that require a prior knowledge of classification accuracy for various types of data
misclassified instances. Also, the outliers (%) which indi-
characteristics, which is impossible to obtain in practice. Traditional models need to improveWashington: DC. 2014. p. 20–8.
execution processcates
andthe number of
classification instance
accuracy of thethat is notdistributed
complex relevantdatabases.
to the The main
5. Wang R, He YL, Chow CY, Ou FF, Zhang J. Learning ELM-
existing attacks behaviour are also computed.
limitations of these models over big data are outlier issue, scaling up for high dimensional data tree from big data based on uncertainty reduction. Fuzzy
classification issue, mining sparse data issue and constrained optimization issue. Hence, one Sets of and Systems. 2015; 258:79–100.
our future works is to present a novel attribute selection based parallel classifier to process 6. mixed
Del Río S, López V, Benítez JM, Herrera F. On the use of
5. datasets.
attributes on large Conclusion MapReduce for imbalanced big data using random forest.
In this paper, we have studied the Rough-set based feature Information Sciences. 2014 Nov; 285:112–37.
7. Feng Q, Miao D, Cheng Y. Hierarchical decision rules
selection model on the random forest in Hadoop frame-
mining. Expert systems with applications. 2010 Mar;
work. Proposed model is compared with the traditional
37(3):2081–91.
classification models along with limitations and perfor-
6. References 8. Triguero I, Peralta D, Bacardit J, García S, Herrera F. MRPR:
mance measures. Experimental results show that proposed a MapReduce solution for prototype reduction in big data
model performs well against the traditional models in
1. Qian J, Lv P, Yue X, Liu C, Jing Z. Hierarchical attribute reduction algorithms for bigclassification. Neurocomputing. 2015 Feb; 150:331–45.
Hadoop framework. Unlike existing solutions that require 9. Noh KS, Lee DS. Bigdata Platform Design and
data using MapReduce. Knowledge-Based Systems. 2015; 73:18–31.
2. a prior knowledge of classification accuracy for various
Leung CK, Jiang F. A data science solution for mining interesting patterns from uncertain Implementation Model. Indian Journal of Science and
types of data characteristics, which is i
mpossible to obtain
big data. In 2014 IEEE 4th International Conference on Big Data and Cloud Computing. Technology. 2015; 8(18):1–8.
Sydney, NSW. 2014. p. 235–42.
3. Fayed HA, Atiya AF. A Novel template reduction approach for the-nearest neighbor
method. IEEE Transactions on Neural Networks. 2009; 20(5):890–96.
4. Kolias V, Kolias C, Anagnostopoulos I, Kayafas E. Rule MR: Classification rule
discovery with MapReduce. In 2014 IEEE International Conference on Big Data (Big Data).
Washington: DC. 2014. p. 20–8.
6 Vol 9 (10) | March 2016 | www.indjst.org Indian Journal of Science and Technology