0% found this document useful (0 votes)
35 views6 pages

Hadoop Based Feature Selection and Decision Making Models On Big Data

This document discusses using Hadoop and MapReduce frameworks for feature selection and decision making models on big data. It summarizes that traditional machine learning tools cannot handle large, fast-growing datasets, but Hadoop provides distributed processing capabilities through MapReduce. While MapReduce addresses issues of volume, velocity and variety in big data, it has limitations in efficiency and performance. The paper proposes and compares models for feature selection and classification on big data using Hadoop, discussing their advantages and limitations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views6 pages

Hadoop Based Feature Selection and Decision Making Models On Big Data

This document discusses using Hadoop and MapReduce frameworks for feature selection and decision making models on big data. It summarizes that traditional machine learning tools cannot handle large, fast-growing datasets, but Hadoop provides distributed processing capabilities through MapReduce. While MapReduce addresses issues of volume, velocity and variety in big data, it has limitations in efficiency and performance. The paper proposes and compares models for feature selection and classification on big data using Hadoop, discussing their advantages and limitations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

ISSN (Print) : 0974-6846

Indian Journal of Science and Technology, Vol 9(10), DOI: 10.17485/ijst/2016/v9i10/88905, March 2016 ISSN (Online) : 0974-5645

Hadoop based Feature Selection and Decision


Making Models on Big Data
Thulasi Bikku1, N. Sambasiva Rao2 and Ananda Rao Akepogu3
Department of CSE, JNTUA, Anantapur – 515002, Andhra Pradesh, India; thulasi.bikku@gmail.com
1

2
SRITW, Warangal – 506371, Telangana, India; snandam@gmail.com
3
Director of Academics and Planning, JNTUCEA, Anantapur – 515002, Andhra Pradesh, India; akepog@gmail.com

Abstract
Objectives: A large amount of informative data is being captured and processed by today’s organizations and is continuing
to increase exponentially. It becomes computationally inaccurate to analyze such big data for decision making systems.
Methods/Analysis: Hadoop, which is a working model based on the Map-Reduce framework with efficient computation
and processing of Big Data. Findings: Most of the traditional classification algorithms have issues such as class imbalance
and dimension reduction on Big Data. However, a large part of the data produced today are incomplete and inaccurate, so
large organizations prefer relational databases to store their information, but the user query processing speed is very low.
Unlike existing solutions that require a prior knowledge of classification accuracy for various types of data characteristics,
which is impossible to obtain in practice. Applications/Improvement: In this paper, we have given a compared proposed
model to different big data feature selection and classification models along with advantages and limitations.

Keywords: Big Data, Decision Tree, Feature Selection, Hadoop, MapReduce

1.  Introduction In the traditional approach, an organization uses a


server to process the available data. But there is an upper
The term Big Data is useful to the computational data or bound to the amount of computational data because it is
information that can’t be analyzed or handled using a tradi- not capable and scalable as the data grows with great vari-
tional machine learning tools and techniques. The general ety and velocity.
definition of Big Data indicates computational data is too The MapReduce framework supports high dimen-
fast, massive or too hard to process. Massive data can be gen- sional datasets, partitions them into smaller sets and
erated from the client server applications which are known distributes them to different cloud clusters for compu-
as sensors, retails, e-commerce, financial sectors, medical tation. MapReduce view data as key, value pair as <Key,
repositories, etc. It also refers to the architectures which Value> as shown in Figure 1.
were designed or developed to store, capture, process and Hadoop uses fully abstract classes called Map and
run volumes of data in lesser computational time in real Reduce from which developers will formulate their prob-
time. Hadoop is an open-source cloud computing environ- lems into Map-Reduce format. Data intensive processing
ment of the Apache foundation that provides a distributed is currently considered interest around the MapReduce
programming on large datasets based on MapReduce. Its framework for large scale data analytics. In real time, it is
remarkable features include simplicity, fault-tolerance and a fault-tolerant and scalable data processing tool, which
scalability. There are a number of MapReduce implemen- provides the capability to compute and process large volu-
tations such as Phoenix, Dryad, Sphere, Mars and Hadoop. minous datasets in parallel with many low-end computing
Hadoop supports the three main challenges created by Big cluster nodes. However, MapReduce has inherent chal-
data: Volume, Velocity and Variety. lenges on its efficiency and performance. Therefore, many

*Author for correspondence


scalable as the data grows with great variety and velocity.
The MapReduce framework supports high dimensional datasets, partitions them into
smaller sets and distributes them to different cloud clusters for computation. MapReduce view
data as key, value pair as <Key, Value> as shown in Figure 1.
Hadoop based Feature Selection and Decision Making Models on Big Data

High Dimensional Data


attribute reduction. They used parallel hierarchical based
attribute selection algorithm to find the decision rules on
each level. The main limitations of this attribute selec-
Data Partition tion method are: 1. As the number of classes in single and
multi-level hierarchical process increases then compu-
tational time and space also increases. 2. This model is
P1 P2 P3 based on entropy measure and uniform data distribution.
Roughsets can be used to find the most relevant attribute
selection from a given data set with discretized attribute
values. Lower and Upper approximations of decision class
Map(P1) Map(P2) Map(P3) Dc with respect to a partition pAtt are defined as:
Lower bound approximation Aprox lAtt(Dc) = {x ∈ U/
[x]Att ⊆ Dc};
Combine Solutions Upper bound Approximation Aprox uAtt(Dc) = {x ∈ U/
[x]Att ⊆ Dc≠Ø};
Figure 1. General
FigureWorkflow of Workflow
1.  General MapReduceof Framework.
MapReduce Framework. Here U is the nonempty finite set (Universe), Accuracy
Hadoop uses fully abstract classes called Map and Reduce from whichofdevelopers Approximation, will AproxAcur where RS = (U, R), S ⊆ R and
research works have been endeavoured to overcome the
formulate their problems into Map-Reduce format. Data intensive processing is currently C ⊆ U |C| denotes the cardinality of C.
challenges of the MapReduce framework.
considered interest around the MapReduce framework for large scale data analytics. In real S(C)
time, = |Aprox l S(C)|/| Aprox u S(C)|
Aprox
Theand
it is a fault-tolerant main objective
scalable dataofprocessing
data mining techniques
tool, is to find
which provides the capability to
Acur
compute
If AproxAcur S(C) = 1, then C is crisp with respect to S.
and process knowledge from large
large voluminous datasets.
datasets The discovered
in parallel with many knowledge
low-end computing cluster
If Aprox nodes. S(C) < 1, then C is rough with respect to S.
supports inhas
However, MapReduce decision making
inherent systems.
challenges Anefficiency
on its increase inandinten-
performance. Therefore,
Acur

many research sityworks


and data
haveset size
been affects the to
endeavoured computational
overcome theefficiency An entropy measure used to minimize the attributes
challenges of the MapReduce
framework. and also takes long time to process. However, most of the size is given as
optimization models are not designed for parallel com-
The main objective of data mining techniques is to find knowledge from large datasets.N
The discovered putational
knowledgeenvironments and the making
supports in decision parallelization
systems. ofAnthe
increase in intensity
Info(D) =and − (ni / n)∗ log(ni / n)

data set sizetraditional algorithms is difficult
affects the computational efficiencyto and
implement
also takesand non-
long time to process. However, i =0
trivial. The decision tree is one of the key areas in data
most of the optimization models are not designed for parallel computational environments and p N

mining technologies. Current researchers on decision tree Info ( D / Att ) = − ∑ (nr / n) ∑(nr i / nr )∗ log(nr i / nr )
r =1 i =0
algorithm mainly focus on optimizing the efficiency, but not
the data processing capability. As the development of the Where nr, nir and ni denotes the number of the objects, the
networking and real-time applications increases, the vol- number of objects equal to i on D in Attribute r, and the
umes of data also increase exponentially. In order to solve number of objects equal to I on D respectively1.
these issues, a parallel distributed decision tree framework
using Hadoop framework is used to handle massive data. 2.1  Probabilistic Based Measures
The remainder of the paper is as follows. Section
Similarly, Leung,Carson and Fan2 proposed ranking inter-
2 introduces the general overview of attribute selec-
val which is different from ranking values explained in
tion measures and data mining approaches using the
paper1. Attribute ranked based on a relevance measure of
Hadoop framework. Next, Section 3, discuss about the
uncertain data is used to estimate two interval values. Let
traditional experimental results. Finally, Section 4 gives
A1 = [i1− , j1+ ] and A2 = [i2− , i2+ ] be two interval values. The
a ­conclusion.
degree of ranking measure from A1 to A2 is defined as
P (deg)[A 1 >= A2] = min{1, max{( j1+ − i2− ) / (( j1+ − i1− ) + ( j2+ − i2− )), 0}}
2. Parallel Rough-Set Feature
Similarity of the two intervals is given as S=1-| P (deg)
Selection
[A1>=A2] - P (deg) [A2<=A1] |
In the paper1, a parallel computation of the equivalence Assume that the two short discretization ranges are
classes and the attribute selection are implemented for represented as A1= [2, 5] and A2= [3, 7], then

2 Vol 9 (10) | March 2016 | www.indjst.org Indian Journal of Science and Technology
Thulasi Bikku, N. Sambasiva Rao and Ananda Rao Akepogu

P (deg) [A1>=A2] = m
 in {1, max {(5-3) / ((5-2) + 2.2.1 Proposed algorithm for Roughset based
(7-3)), 0}} Random Forest
= min {1, max {0.28, 0}}
Map Function(Dataset)
= min {1, 0.28}
  Input:
= 0.28
   Training instances with attribute set A
P (deg) [A2<=A1] = m
 in {1, max {(7-2) / ((7-3) +
  Output:
(5-2)), 0}}
   Decision Tree Rules
= min {1, max {0.42, 0}}
  Procedure:
= min {1, 0.42}
   Initialize ‘k’ parameters as ‘k’ clusters in cloud environ-
= 0.42
ment
So, the degree of relevance between two objects is    Initialize the dataset using bagging algorithm
computed for attribute selection process is    Build tree per bootstrap by randomly selecting attributes
   While attribute_set!= null do
S = 1-| P (deg) [A1>=A2] - P (deg) [A2<=A1] |
   For each candidate attributes do
= 1-|0.28-0.42|
   Compute the max of information gain(IG) as α
= 1-0.14=0.86
   α(A)=argmax IG;
This similarity index indicates the association between    Split the information attribute
two highly related attributes for constructing a decision tree.    End
An optimal feature subset extracted by a high dimensional    End
reduction technique always depends on a certain feature   Reduce Function
selection measures. In general, different measure may lead   Input:
to different optimized attribute subsets. One of the major    Set of Map Decision Trees
issues in the real-time distributed data is uncertainty and   Output:
missing values. This problem arises; when more than one    Classify Result
attribute has the same data distribution that is different   Procedure:
attributes get uniform data distribution. The main issues in    Produces multiple Decision Trees
the traditional attribute based classification models are data    Check and compare the nodes in each decision tree
cleaning, filtering and reduction. Filtering analysis removes    Find the majority voting trees for the classification
all the redundant attributes by attribute subset selection.    Return set of Decision Tree Rules

2.2  Roughset based Random Forest 3. Data Preprocessing and


Our proposed system combines the random forest
Random Forest Model
­algorithm with roughset theory to obtain better results.
Roughset theory is useful for analysing the objects repre- Data discretization is a process of converting continuous
sented by attributes (features). The basic assumptions in data attribute values into a finite set of intervals and has
the roughest theory are: Objects are represented by attri- received more and more research attention. The main rea-
bute values and objects with the same information are son is that traditional methods mainly focus on learning
indiscernible. Using data mining techniques we are able to only nominal attributes, or only continuous attributes,
construct single tree, by using Random Forest algorithm but not mixed attributes. Also, rule generated through
produces multiple classification trees. Random Forest is induction rules or decision rules using discretized data
an efficient method, which deals effectively when a huge are usually shorter, compact and more accurate than
amount of data is missing. In the below algorithm we rules generated by using continuous values. A distributed
have used mapreduce, which contains two functions map partitioning method for data reduction using near-
and reduce. The map function processes the data and gen- est neighbour classification approach is proposed3. This
erates the key-value data into decision trees. The reduce model reduced the number of instances from the original
function reduces the irrelevant and redundant data and training data set. The main advantages of this model are
gives classification trees having maximum votes. sensitive to noise and storage overhead. A rule discovery

Vol 9 (10) | March 2016 | www.indjst.org Indian Journal of Science and Technology 3
Hadoop based Feature Selection and Decision Making Models on Big Data

model namely Rule MR using the MapReduce model is different classes such as risk management, satellite images,
explained4. This model is able to construct a set of rules medical data, sensor data, and time series data, etc.
from large sets of nominal valued attributes. This model Although the hyper network graph theory has been
depends on user defined parameters that influence the used in solving various classification issues, it usu-
time computation requirements for training data. In this ally results in poor performance when dealing with
model, each clustered node computes the normalized class imbalance issues. Like most of the conventional
entropy and coverage factor as follows: ­methods, hyper network models assume that the class
((Entropy(NC) – Entropy(cond))/Entropy(NC))*(P/E) and data distribution of data sets are balanced7. Both the
single labelled classification and Multi labelled classifica-
Where NC indicates number of different classes, cond is tion are important research areas in supervised learning.
the class distinct values, P is probability of existence of However, neither of them can overcome the imbalance
each class and E is the estimation parameter. issue which has a negative effect on the classifier accuracy.
Gain Ratio which is a successor of the information Addressing imbalance issues in multi labelled classifica-
gain overcomes the challenge of the information gain by tion is more complex and it is even difficult to define the
using a normalization process a split information value. label distribution. Isaac, Daniel peralta and salva use
But the main drawback with the gain ratio is that it gener- M = Rd, denotes the d-dimensional vector space and N =
ates unbalanced splits to the unbalanced data. ELM-Tree {n1, n2, … nq}, ni ∈ {0, 1} denotes the binary label vector
decision tree5 of which the leaf nodes are linear regression space8. The multi labeled data set can be represented as:
functions, .i.e. a new instance is classified by traversing
the tree from the root to a leaf and determined the predic- D = {(pi , qi )| 1 <= i <= c, pi ∈Rd ,q i ⊆ N }
tion value with the linear regression function in the leaf
node as shown in Figure 2. Where d, q, c represents the number of attributes, number
The proposed ELM tree doesn’t applicable to incre-
5 of class labels and number of instances.
mental tree construction on large datasets. Also, this MRPR implemented a decision tree framework to
approach doesn’t handle mixed type of attributes using generate decision rules at each level in multidimensional
MapReduce framework. model 8
. The main aim of this model is to optimize the
Sara del Rio, Victoria and Manuel proposed a model decision rules with minimal rule-set. Mining rules on
to overcome the class imbalance problem, where each each level is based on multidimensional data model as
class has different data distribution and attributes type. shown in Figure 3. This framework has two phases, one
Oversampling and undersampling issue in large datasets is data Reduction phase and the other is decision rule
D {(pi , qi ) |1 i c9, pi Rd ,qi N}
is handled using Random forest algorithm for classifica- generation phases . In the first phase, attribute reduction
Where d, q, c represents the number of attributes, number of class labels and numbe
tion6. The drawbacks5 are performance depends on the instances.
number of intermediate mappers used in MapReduce Distributed Data Sources

Framework. It needs to determine the static threshold for DB1 DB2…. DB-n
Where
the minority NCwith
class indicates number
respect of number
to the different of
classes,
inter-cond is the class distinct values, P is probability .
of existence
mediate mappers of each
to get class
better and E is the It
performance. estimation
needs toparameter.
improve the static classification and static parameters gain overcomes the challenge of the
Gain Ratio which is a successor of the information
information gain by using a normalization process a split information value. But the main
when a MapReduce framework is used.
drawback with the gain ratio is that it generates unbalanced splits to the unbalanced data. Data ELM- Integration
The imbalanced dataset5 issue in classification models &
Tree decision tree of which the leaf nodes are linear regression functions, .i.e. a new instance is ModelRoughset
may occur when the number of tuples or instances that rep-
classified by traversing the tree from the root to a leaf and determined the prediction value with
resent one class is much more than the other classes. There
the linear regression function in the leaf node as shown in Figure 2.
are many fields in which imbalance occurs between the

Figure 3. Rough-set model with Random Forest Using MapReduce.


Figure 3.  Rough-set model with Random Forest Using
Figure 2. Traditional C4.5 and ELM Tree Data Classification.
Figure 2.  Traditional C4.5 and ELM Tree Data Classification. MapReduce.
MRPR implemented a decision tree framework to generate decision rules at each le
multidimensional model8. The main aim of this model is to optimize the decision rules with
5 minimal rule-set.
The proposed ELM tree doesn’t applicable to incremental treeMining rules on each
construction on level
largeis based on multidimensional data model as sh
in Figure 3. This framework has two phases, one is data Reduction phase and the other is
4 datasets.
Vol 9 (10) | March 2016 Also, this approach doesn’t handle mixed type of attributes using MapReduce
| www.indjst.org decision rule generation phases9. Indian Journal
In the first of Science
phase, attributeand Technology
reduction and attribute filtering
framework. procedures are performed on the input data set .And in the second phase, decision rules usin
Sara del Rio, Victoria and Manuel proposed a model tohierarchical
overcomerough-set
the class
andimbalance problem, between decision rules mined from diff
interpret the relationship
levels of hierarchy. The problems observed are it needs to improve attribute reduction at mu
Thulasi Bikku, N. Sambasiva Rao and Ananda Rao Akepogu

and ­attribute filtering procedures are performed on the information. Proposed hadoop based feature selection
input data set .And in the second phase, decision rules model was tested on different classification models to test
using hierarchical rough-set and interpret the relation- the classification accuracy for large datasets. Kddcup’99
ship between decision rules mined from different levels of intrusion dataset was used on different classification
hierarchy. The problems observed are it needs to improve models such as Genetic Algorithm- Feature Selection
attribute reduction at multiple levels, with dynamic Algorithm, Neural Networks, ELM tree and Random
threshold parameters to approximate decision rules8. Forest. By using rough-set feature reduction model, rel-
Table 1 shows the traditional classification algo- evant attributes for attack detection are identified.
rithms, which cannot handle huge amounts of data. Table 3 and Figure 4 describe the performance analy-
Table 1 describes the different classification models and sis of mapper and reducer hadoop interface classes using
its capable features such as attribute selection measures, feature reduction process. Random Forest tree and rough-
scalability, parallel support, decision tree structure and set are integrated in hadoop environment to find the
hadoop framework support. In our research work, we time complexity of the Mapper and Reducer interfaces.
have implemented these models in hadoop framework for As shown in the table, we have observed that time com-
performance analysis which gives better results. plexity reduction in the Mapper and Reducer phases of
Random forest with Roughset model compared to other
classification models without roughset.
4. Experimental Results of Table 4 and Figure 5 describe the performance anal-
Traditional Approaches ysis of classification models. Random Forest tree with
In this section, we have evaluated the performance of
Table 2.  Hadoop based Rough-set based features
the roughset attribute selection measure with different
selection model
Hadoop based classification models. All the experiments
are carried-out on the Hadoop framework, which is Dataset Size Total-Features Roughset Attack Classes
based
an open source that supports distributed applications.
Features
Hadoop framework and Netbeans IDE version 8 are used
Reduced
to execute the MapReduce framework. For data storage,
Kdd1 548825 41 25 8
Amazon AWS cloud server with large instances is used
to execute multiple large data sets on the different clus- Kdd2 453653 41 27 8
ter nodes. Large data sets are downloaded from the UCI Kdd3 625977 41 23 8
repository, which consists of 41 attributes with 10 ­decision Kdd4 587734 41 23 8
classes and large number of instances.
Table 2 describes the information about the total num- Table 3.  Hadoop based Classification model using
ber of instances used in our experiment and its statistical Rough-set
integrated feature
in hadoop selection
environment to find the time complexity of the Mapper and Reducer
interfaces. As shown in the table, we have observed that time complexity reduction in the
MapperAlgorithm
and Reducer phasesMapperTime
of Random forest (mins) ReducerTime
with Roughset (mins)
model compared to other
Table 1.  Decision Tree algorithms for Big data classification models without roughset.
GA_FSA 15.35 4.57
supporting parameters
Neural
Table Network
3. Hadoop 21.53
based Classification 5.75
model using Rough-set feature selection
Algorithm Selection Scalability Parallelism Tree Bigdata
Technique Structure Handle ELM-Tree
Algorithm 14.65
MapperTime (mins) 5.36
ReducerTime (mins)
GA_FSA 15.35 4.57
ID3 Information Poor Poor Multi-tree No Random Forest 12.43 4.567
Neural Network 21.53 5.75
Gain (Roughset)
ELM-Tree 14.65 5.36
C4.5 Gain Ratio Poor Poor Multi-tree No RandomForest(Roughset) 12.43 4.567
CART Gini Poor Poor Binary No
Coefficient
SLIQ Gini Good Good Binary No
Coefficient
SPRINT Gini Good Good Binary No
Coefficient
PUBLIC Gini Poor Poor Binary No
Figure 4. Algorithms with MapReduce Statistics.
Coefficient
Figure
Table 4 and4. 
FigureAlgorithms with MapReduce
5 describe the performance Statistics. models. Random Forest
analysis of classification
tree with rough-set gets high true positive classification rate for intrusion detection compared to
other classification models in hadoop framework. True positive rate indicates the number of
positive instances reflects the attacks compare to negative samples. Error (%) indicates the
Vol 9 (10) | March 2016 | www.indjst.org number of misclassified instances. Also, the outliers
Indian Journal (%)
ofwhich indicates
Science 5
the number of instance
and Technology
that is not relevant to the existing attacks behaviour are also computed.

Table 4. Traditional classification Algorithms with True Positive and Error Rates
Hadoop based Feature Selection and Decision Making Models on Big Data

Table 4.  Traditional classification Algorithms with in practice. Traditional models need to improve execution
True Positive and Error Rates process and classification accuracy of the complex distrib-
Algorithm True Positive (%) Error (%) Outlier (%) uted databases. The main limitations of these models over
GA_FSA 81.45 25.67 12.54 big data are outlier issue, scaling up for high dimensional
data classification issue, mining sparse data issue and
NeuralNetwork 83.157 21.56 15.37
constrained optimization issue. Hence, one of our future
ELM-Tree 84.26 27.89 16.24
works is to present a novel attribute selection based paral-
Random Forest 89.45 19.04 11.15
lel classifier to process mixed attributes on large datasets.
(Roughset)

6.  References
1. Qian J, Lv P, Yue X, Liu C, Jing Z. Hierarchical attribute
reduction algorithms for big data using MapReduce.
Knowledge-Based Systems. 2015; 73:18–31.
2. Leung CK, Jiang F. A data science solution for min-
ing interesting patterns from uncertain big data. In 2014
Figure 5. Performance
Figure 5. Performance analysis
analysis of Trueof True Positive
Positive andRates.
and Error Error
IEEE 4th International Conference on Big Data and Cloud
Rates.
Computing. Sydney, NSW. 2014. p. 235–42.
3. Fayed HA, Atiya AF. A Novel template reduction approach
5. Conclusion ­rough-set gets high true positive classification rate for
for the-nearest neighbor method. IEEE Transactions on
In this paper, weintrusion detection
have studied compared
the Rough-set basedtofeature
otherselection
classification
model mod-
on the random forest
Neural Networks. 2009; 20(5):890–96.
els in hadoop
in Hadoop framework. Proposed framework. True positive
model is compared with the rate indicates
traditional the
classification models
along with limitations 4. Kolias V, Kolias C, Anagnostopoulos I, Kayafas E. Rule
numberand ofperformance measures.reflects
positive instances Experimental results compare
the attacks show that proposed
model performs well against the traditional models in Hadoop framework. Unlike existing MR: Classification rule discovery with MapReduce. In 2014
to negative samples. Error (%) indicates the number of IEEE International Conference on Big Data (Big Data).
solutions that require a prior knowledge of classification accuracy for various types of data
misclassified instances. Also, the outliers (%) which indi-
characteristics, which is impossible to obtain in practice. Traditional models need to improveWashington: DC. 2014. p. 20–8.
execution processcates
andthe number of
classification instance
accuracy of thethat is notdistributed
complex relevantdatabases.
to the The main
5. Wang R, He YL, Chow CY, Ou FF, Zhang J. Learning ELM-
existing attacks behaviour are also computed.
limitations of these models over big data are outlier issue, scaling up for high dimensional data tree from big data based on uncertainty reduction. Fuzzy
classification issue, mining sparse data issue and constrained optimization issue. Hence, one Sets of and Systems. 2015; 258:79–100.
our future works is to present a novel attribute selection based parallel classifier to process 6. mixed
Del Río S, López V, Benítez JM, Herrera F. On the use of
5. datasets.
attributes on large Conclusion MapReduce for imbalanced big data using random forest.
In this paper, we have studied the Rough-set based feature Information Sciences. 2014 Nov; 285:112–37.
7. Feng Q, Miao D, Cheng Y. Hierarchical decision rules
selection model on the random forest in Hadoop frame-
mining. Expert systems with applications. 2010 Mar;
work. Proposed model is compared with the traditional
37(3):2081–91.
classification models along with limitations and perfor-
6. References 8. Triguero I, Peralta D, Bacardit J, García S, Herrera F. MRPR:
mance measures. Experimental results show that proposed a MapReduce solution for prototype reduction in big data
model performs well against the traditional models in
1. Qian J, Lv P, Yue X, Liu C, Jing Z. Hierarchical attribute reduction algorithms for bigclassification. Neurocomputing. 2015 Feb; 150:331–45.
Hadoop framework. Unlike existing solutions that require 9. Noh KS, Lee DS. Bigdata Platform Design and
data using MapReduce. Knowledge-Based Systems. 2015; 73:18–31.
2. a prior knowledge of classification accuracy for various
Leung CK, Jiang F. A data science solution for mining interesting patterns from uncertain Implementation Model. Indian Journal of Science and
types of data characteristics, which is i
­ mpossible to obtain
big data. In 2014 IEEE 4th International Conference on Big Data and Cloud Computing. Technology. 2015; 8(18):1–8.
Sydney, NSW. 2014. p. 235–42.
3. Fayed HA, Atiya AF. A Novel template reduction approach for the-nearest neighbor
method. IEEE Transactions on Neural Networks. 2009; 20(5):890–96.
4. Kolias V, Kolias C, Anagnostopoulos I, Kayafas E. Rule MR: Classification rule
discovery with MapReduce. In 2014 IEEE International Conference on Big Data (Big Data).
Washington: DC. 2014. p. 20–8.

6 Vol 9 (10) | March 2016 | www.indjst.org Indian Journal of Science and Technology

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy