0% found this document useful (0 votes)

74 views12 pages

Applied Sciences: Attributes Reduction in Big Data

Big Data

Uploaded by

akshayhazari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views12 pages

Applied Sciences: Attributes Reduction in Big Data

Big Data

Uploaded by

akshayhazari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

applied

sciences
Article
Attributes Reduction in Big Data
Waleed Albattah 1, * , Rehan Ullah Khan 1 and Khalil Khan 2
1 Department of Information Technology, College of Computer, Qassim University, Buraydah, Saudi Arabia;
re.khan@qu.edu.sa
2 Department of Electrical Engineering, University of Azad Jammu and Kashmir,
Muzaffarabad 13100, Pakistan; khalil.khan@ajku.edu.pk
* Correspondence: w.albattah@qu.edu.sa

Received: 4 June 2020; Accepted: 10 July 2020; Published: 17 July 2020

Abstract: Processing big data requires serious computing resources. Because of this challenge,
big data processing is an issue not only for algorithms but also for computing resources. This article
analyzes a large amount of data from different points of view. One perspective is the processing
of reduced collections of big data with less computing resources. Therefore, the study analyzed
40 GB data to test various strategies to reduce data processing. Thus, the goal is to reduce this data,
but not to compromise on the detection and model learning in machine learning. Several alternatives
were analyzed, and it is found that in many cases and types of settings, data can be reduced to
some extent without compromising detection efficiency. Tests of 200 attributes showed that with
a performance loss of only 4%, more than 80% of the data could be ignored. The results found in the
study, thus provide useful insights into large data analytics.

Keywords: attributes sampling; content-based filtering; Support Vector Machines; machine learning

1. Introduction
The recent sharp increase in technological development and the associated growing increase in the
volume of data spread and produced have led to regular data being converted into big data. As the name
implies, big data indicates the type of data of large size, the forms it includes, and requiring high-speed
servers for timely input [1]. Big data is the variability, size, and speed of data that needs to be achieved.
This data is normally saved in large servers and is available only when necessary [2]. Big data is used
to perform regular organizational processes, such as decision making and validation [3]. However,
to improve efficiency, a trade-off is critical between efficiency and application size [4]. Typical examples
are GPS, facial recognition cameras. The effectiveness of such sort of applications can be improved by
providing large datasets for model training. Alternatively, this is not possible because large data sets
need comparatively large storage space, which becomes difficult to manipulate. For this, a mechanism
is needed that allows big data subgroups to carry knowledge and information similar to those found
in the source data [5].
Big data also poses serious account risks that need to be processed to ensure end-user protection at
the end. Accordingly, some parameters are usually indicated to ensure the quality of big data and the
quality of information [6]. Examples of these parameters are: syntactical validity, appropriate attribute
association, Precision, theoretical relevancy, controls, and audibility [6]. In addition, other problems
are created by the management of servers, privileges, sorting, and security [7]. By 2002, the number
of digital devices exceeded 92% with five Exabytes of data [8], which has been increasing since then
leading the problem to develop gradually. Today, the business of big data is about $46.4 billion [8],
which means that, despite problems with data processing, user interest is increasing over the years.
Relative to data mining, data processing becomes very difficult with hundreds of groups classified by
small differences, increased workload, and compilation time [8].

Appl. Sci. 2020, 10, 4901; doi:10.3390/app10144901 www.mdpi.com/journal/applsci

Appl. Sci. 2020, 10, 4901 2 of 12

In addition to endless applications, big data has become a complex concept for data mining,
information fusion, social networks, semantic network, etc. [9]. Accordingly, much attention was paid
to data processing, pattern mining, data storage, analysis of user behavior, data visualization, and data
tracking [10].
This devastation is intensified by the search for solutions to the problems of a large collection due
to the fact that technologies such as machine learning, computational intelligence, and social networks
use libraries to process data. These libraries, consecutively, increase in size as their scope expands.
As a result, solutions are constantly being sought for ease of processing and scanning of big data.
These solutions are data sampling, data condensation, density-based approaches, incremental learning,
divide and conquer, distributed computing, and others [8].
From data handling perspectives, sampling big data has the most considerable concern of
complexity, computational burden, and inefficiency for the appropriate execution of the task [11].
The effort of sampling is the number of data sets that can be added to each sample. In general, it is
believed that the richness of these data sets is poor only if the sample is biased by estimation [12].
In this regard, the selection bias can be calculated and successfully determined using the reverse
sampling procedure, where information is used from external resources, or using digital integration
technology, where big data is combined with independent probabilistic sampling [13]. Sample size is
very critical and plays a significant role in the accuracy of a system [14]. Thus, as a solution to the
challenges of sampling big data, some algorithms have been introduced, such as the process of Zig
Zag [15], nonprobabilistic sampling [13], inverse, and cluster sampling [16].
Machine learning is a data analysis that learns from the available data for prediction and
decision making [16]. Trends in data are extracted and calculated for machine learning techniques.
These machines are designed to understand information and are able to extract some meaning
from them. Training is carried out through comparisons, communication, strategies, discoveries,
problem solving, search, and comparison with changing parameters. The ability of any device to learn
depends on the amount of information with which it can work, and the amount that can be processed [4].
Machine learning improves as the amount of input data increases, however, the algorithms used are
usually traditional and designed to deal with simple data sets, which makes the job even harder.
Some challenges in regard to big data include but not limited to memory and processing problems,
unstructured data forms, rapidly changing data, and unclassified data [17].
Deep learning in the form of the Convolutional Neural Networks (CNNs) is used to accurately
model classification data [18], particularly image and text data. The interest towards CNN-based
recognition and detection has further increased in recent years. This is due to its improved classification
and detection performance in the imaging domain. On the other hand, CNNs require a huge amount
of processing power for large datasets. In its simplest form, the CNN requires several convolution
layers, pooling layers, and the fully connected layers, thus requiring considerable resources and time
to train and learn the distribution of the features. In large networks, for example, GNET, VGG16, time
complexity, and resource requirements increase exponentially. Therefore, there is a need to analyze
whether for a similar dataset, especially having strongly correlated images, should one process all
the features and images of such a dataset? As such, we believe that research work in the direction
of optimally reducing the dataset is of immense interest not only for traditional classical classifiers
but also for deep learning models, and this thus also initiates one of the motivations for the work in
this article.
Moreover, the attention has shifted from the usual to the independent feature extraction [19,20].
However, the increase of features and data instances, which have not been examined enough.
The dimensionality clearly affects the performance of the final model, similarly, a steady increase in
the amount of data leads to a recount and reassessment of machine learning models. This makes
a huge dependence on powerful computer equipment and resources; however, these facilities are still
inaccessible to the masses and many research institutes.
Appl. Sci. 2020, 10, 4901 3 of 12

The work in this paper discusses the reduction of data for classification purposes and some machine
learning scenarios. The study analyses the data attributes role and whether it can be generalized
to big data. Attributes play a significant role in classification and model learning. At the time that
more attributes are considered an advantage towards better models, they also at the same time can
be complicated to a far extent if they do not cover data and classes fairly. This problem will become
even worse if the data is too big which makes dimensionality a serious issue because of the large
number of data attributes. For such impacts of data attributes on classifier performance, this study
investigates how this impact can affect the classification performance. The experiments in the study
use the video dataset available at [21], which is a massive data set consisting of more than 40 GB of
video data. The data are divided into three categories for analysis purposes: Unacceptable, Flagged,
and Acceptable. The study is based on the general concept of data sampling, however, it uses data
from image filtering. This can be justified in the following three key points: first, the data is well
organized into three categories, which represents an appropriate case for machine learning algorithms.
Second, despite the data type image, it can be converted to numerical values in the feature form,
which, accordingly, is equal to other data sets and similar machine learning problems. The last point
is the huge size of data that exceeds 40 GB, which means the data used in the analysis is big data.
As a result, the findings can be generalized to such studies with similar nature of data.

2. Related Work
Looking at the literature, there are some works such as [22–25], which propose and model such
scenarios. The work in [22] combines AlexNet [20] and GoogLeNet [26] to increase productivity.
The work in [24] uses color transformations. Evidence is given in [27]. In [28], the adaptive sampling
method is used for filtering. Article [29] explains the analysis of the website filter, and [30] combines the
analysis of key frames. [31,32] use visual functions to access multimedia and filtering. Articles [33–36]
are based on content retrieval.
Another method also known as neighborhood rough sets [37] is used widely as a tool to reduce
attributes in big data. Most of the existing methods cannot well describe the neighborhoods of samples.
According to the authors of the [37], the proposed work is the best choice in such scenarios. The work
proposed in [38] introduces a hierarchical framework for attributes reduction in big data. The authors
propose a supervised classifier model. For reducing noise, Gabor filters are used that improve the overall
efficiency of the proposed framework. For specific feature selection, Elephant Herd Optimization
is utilized. The system architecture is implemented with a Support Vector Machine. To reduce
the irrelevant attributes and keep only important attributes Raddy et al. [39] propose a method
that uses two important features selection methods Linear Discriminant Analysis and Principal
Component Analysis. Four machine learning algorithms Naive Bayes, Random Forest, Decision
Tree Induction, and Support Vector Machine classifiers are used by the authors. In [40], the authors
investigate attribute reduction in parallel through dominance-based neighborhood rough sets (DNRS).
The DNSR considers the partial orders among numerical and categorical attributes. The authors
present some properties of attribute reduction in DNRS, and also investigate some principles of parallel
attributes reduction. Parallelization in various components of attributes is also explored in much detail.
A multicriterion-based attribute-reducing method is proposed in [41], considering both neighborhood
decision and neighborhood decision with some consistency. The neighborhood decision consistency
calculates the variations of classification if attribute values vary. Following the new attribute reducing,
a heuristic method is also proposed to derive reduct which targets to obtain comparatively less error
rate. The experimental results confirm that the multicriterion-based reduction improves the decision
consistencies and also brings more stable results.
Appl. Sci. 2020, 10, 4901 4 of 12

3. Classification Models
In this section, we discuss the classifiers used in the experimental evaluation.
Classifiers learn the inherent structure in the data. The classification and the learning ability
strongly depends on the data types, the correlation among the attributes, and the amount of clean
data processed for a particular problem. We selected the SVM, Random Forest, and AdaBoost for
our sampling analytics due to its good overall performance state-of-the-art for most of the correlated
data problems.
Recently, tree-based classifiers have been greatly admired. This acceptance stems from the
instinctive nature and the general cool training example. However, catalog trees hurt from the
classification and oversimplification accuracy. It is impossible to grow equally the classification and
generality accuracy concurrently and thus generalize at the same time. Leo Breiman [42] presented the
Random Forest of this design. Random Forest has the advantages of a mixture of several trees from
the corresponding dataset. Random Forest (RF) creates a forest of trees, so each tree is created on the
basis of a random kernel of enlarged data. For the steps of classification, input is applied to each tree.
Each tree decides on a vector class. The decisions are collected for final classification. The decision of
the tree is called the vote of the tree in the forest. So, for example, for a specific problem, out of five
trees, three trees vote “yes” and two trees vote “no”, a Random Forest classifies the problem as “yes”
because a Random Forest works by a majority vote.
In the Random Forest, for growing the classification trees, let cases in the training set be N,
thus sampling N data items are selected at random, but picked based on a replacement from the data
set. This sample constitutes the training set for tree growth. If there are “K” variables, a small “k”,
where “k” << K is specified such that, “k” number of variables being selected randomly from the large
K dataset. The best split on these “k” is used to split the node, and the value of “k” is held constant
during the forest growth. Each tree is allowed to grow to its largest extent possible on its part of data.
There is no pruning. With the increase in tree count, the generalization error thus converges to a limit.
(to be rewritten).
Support Vector Machines [43] are supervised learning methods used for classification and
regression in computer vision and some other fields. Considering a study dataset consisting of two
classes, SVM develops a training model. The model assigns newly sampled data to one or the other
category, which makes it a nonprobabilistic binary classifier model. Data can be visualized as points
in space in SVM separated by the hyperplane (gap) that is as large as possible. SVM can also be
used for nonlinear classification using the kernel that maps the inputs into high-dimensional feature
spaces where separation becomes easy. SVM has shown its potential in a number of classification tasks
ranging from simple text classification to the imaging, audio, and deep learning domains.
Adaptive boosting (AdaBoost) [44] is yet another approach for increasing the accuracy of the
classification task. The purpose of the AdaBoost method is to apply the weak classification method to
repeatedly modified versions of the data [44]. This, in turn, produces a sequence of comparatively
weak classifiers. The predictions are then combined through a majority vote which produces the
final prediction.

4. Experimental Setup and Results

For feature extraction, we use the auto-correlogram approach. The auto-correlogram captures the
spatial relationship between the color pixels.
For evaluation of the architecture, we use the sampled dataset obtained from videos of NDPI,
details are available in [21]. This is a large dataset which consists of around 40 GB video data.
For analysis, we divide the sampled data into three classes i.e., Unacceptable, Acceptable, and Flagged.
Figure 1 shows some samples.
For feature extraction, we use the auto-correlogram approach. The auto-correlogram captures
the spatial relationship between the color pixels.
For evaluation of the architecture, we use the sampled dataset obtained from videos of NDPI,
details are available in [21]. This is a large dataset which consists of around 40 GB video data. For
analysis,
Appl. we10,
Sci. 2020, divide
4901 the sampled data into three classes i.e., Unacceptable, Acceptable, and Flagged.
5 of 12
Figure 1 shows some samples.

Figure 1. Sample images from dataset [21].

Figure 1. Sample images from dataset [21].
We use
use the
the F-measure
F-measure asas an
an evaluation
evaluation measure.
measure. The F-measure takes into account both the
Precision and the Recall as follows:
follows:
F—measure = 2 × [(Precision × Recall)/(Precision + Recall)] (1)
F-measure = 2 × [(Precision × Recall)/(Precision + Recall)] (1)

4.1. Attributes’ Role

The class attribute has a significant role in model learning. One interest of this paper is to
investigate the effect of attributes in classification. Attributes have an impact on the model of data and
the classification. At least theoretically, more attributes lead to a better model, however, attributes
can be very complex if they do not properly cover data and classes. In other words, if attributes
are not related to data categories or the relationship between attributes is not strong, increasing the
number of attributes can have negative performance results. Big data usually means a lot of features.
This can be a problem in two ways. Firstly, solving a large number of attributes is a problem in itself.
Secondly, a large number of attributes leads to a dimensionality curse, and thus the classifier can be
misleading. Thus, the classifier will not take advantage of the robust correlated attributes associated.
This has serious consequences for the problem under investigation and the classifier performance.
The experimental setup aims to analyze the role of attributes in data and thus generalize it to big data.

4.2. SVM
For the analysis of the attributes in the SVM classifier, we perform several experiments to analyze
the role of reducing the attributes and see whether it decreases or increases performance. For this,
we use the many useful approaches available in the state-of-the-art. These are:

• Subset Evaluation proposed [45]: Evaluates the importance of a reduced set of attributes taking
into account the individual-based predictive capability of every attribute and measuring the
redundancy between them.
• Correlation Evaluation [46]: Considers the importance of feature by analyzing the correlation
between the feature and the class variable.
• Gain Ratio Evaluation [47]: Considers the importance of a feature by analyzing the gain ratio with
respect to the class.
• Info Gain Evaluation [47]: Considers the importance of an attribute that measures the information
gain to the corresponding class.
• OneR Evaluation [48]: Uses the OneR classifier for the attribute role in model building.
• Principal Components [49]: Evaluates transformation and the principal components analysis of
the data.
• Relief Evaluation [50]: Finds the importance of an attribute by repeated sampling and considers
the value of the attribute for the nearest instance of the similar and different class.
• Symmetrical Uncertain Evaluation [51]: Considers the attribute importance by measuring the
symmetrical uncertainty against the class variable.

All these approaches provide a rich and complete set of feature selection methods and are
representative of the complete framework for similar tasks. Table 1 shows the different approaches
and the attributes selected by them. Actual features mean the attributes that are returned by feature
Appl. Sci. 2020, 10, 4901 6 of 12

extraction methods and not the feature selection method. The other eight starting from the Subset
Evaluation are the selection methods and select the appropriate number of attributes depending on the
algorithm. The Subset Evaluation selects 84 important features and discards others. The Correlation
Evaluation, Gain Ration Evaluation, Info Gain Evaluation, OneR Evaluation, Relief Evaluation, and the
Symmetrical Uncertain Evaluation ranks features according to the importance. The 200 most important
features (returned by the importance) are selected from these algorithms. 200 attributes are enough
as the Subset Evaluation selects only 84 attributes out of 1024 attributes. The Principal Components
Appl. Sci. 2020,
returns 10, x FOR PEER
258 important REVIEW from the data.
components 6 of 11

Table 1.
Table 1. Total number
number ofof attributes
attributes selected
selected by
by the
the different
different algorithms.
algorithms. The 200 most important
according to
features ranked according to importance
importance are
are retrieved.
retrieved.

Approach
Approach Attributes
Attributes
Actual Attributes Before
Actual Attributes Before Selection
Selection 1024
1024
Subset
Subset Evaluation
Evaluation 84
84
Correlation
Correlation Evaluation
Evaluation 200
200
Gain
Gain Ratio Evaluation
Ratio Evaluation 200
200
Info Gain Evaluation 200
Info Gain Evaluation 200
OneR Evaluation 200
OneRComponents
Principal Evaluation 200
258
Principal Components
Relief Evaluation 258
200
Relief
Symmetrical Evaluation
Uncert. Evaluation 200
200
Symmetrical Uncert. Evaluation 200

The result of
The result of the
theattributes
attributesselected
selectedbybythese
these algorithms
algorithms areare interesting,
interesting, Figures
Figures 2 and
2 and 3 show
3 show the
the
corresponding feature selection algorithm and the F-measure using the classifier SVM and the
corresponding feature selection algorithm and the F-measure using the classifier SVM and the
Random Forest respectively. In Figure 2, SVM is used for model learning and validation.
Random Forest respectively. In Figure 2, SVM is used for model learning and validation. In Figure 3,In Figure 3,
the
the Random
Random Forest
Forest is
is used
used for
for model
model generation
generation andand corresponding
corresponding evaluation.
evaluation.

0.8
0.78
0.76
F-measure

0.74
0.72
0.7
0.68

Figure 2. SVM performance based on the F-measure for the Actual 1024 attributes and the F-measure
based on the attributes selected by the corresponding
corresponding algorithm.
algorithm.

In Figure 2, for SVM, the

0.85F-measure of different data attributes does not have the same behavior.
0.84 attributes in all algorithms is less than the number of attributes of the
Although the number of selected
0.83
actual data, two algorithms0.82
have higher F-measure than the F-measure of the actual data attributes.
F-measure

0.81
Even though the other F-measure
0.8 values are less than that of actual data, the considerable reduction in
0.79 methods should not be neglected. The F-measure of the actual data
the dataset by attribute selection
0.78
0.77
attributes which are 1024 is 0.784. The F-measure for Subset Evaluation is 0.76, Correlation Evaluation
0.76
is 0.786, the Gain Ration Evaluation
0.75 is 0.783, the Info Gain Evaluation is 0.77, OneR Evaluation is 0.778,
the Random Forest is used for model generation and corresponding evaluation.

0.8
0.78
Appl. Sci. 2020, 10, 4901 0.76 7 of 12

F-measure
0.74
0.72
Principal Components is 0.718, Relief Evaluation is 0.79, and the Symmetrical Uncertain Evaluation is
0.7
0.767. These results are interesting and shed valuable light on the attributes selection. Despite the great
0.68
reduction in the actual data attributes, the Correlation Evaluation and the Relief Evaluation F-measure
is slightly higher than the F-measure of the actual data attributes, which can lead to an interesting result
that is the selected attributes by these two algorithms can successfully represent the actual dataset.
The lowest F-measure of 0.718 is obtained for principal components. The Subset Evaluation has slightly
less F-measure (0.762) than the actual attributes (0.784). From these F-measure values, it is noted that
the loss in F-measure is not that large in comparison to the advantage of reducing the amount of
processed dataset. Furthermore, it is quite worth mentioning that even the F-measure is increased
in two cases.
Figure As the
2. SVM point of based
performance this article
on theisF-measure
to analyze
forthe
the impact of reducing
Actual 1024 attributesdata on F-measure
and the performance,
the results
based onshow an interesting
the attributes trend.
selected by the corresponding algorithm.

0.85
0.84
0.83
0.82
F-measure

0.81
0.8
0.79
0.78
0.77
0.76
0.75

Figure
Figure3.3.Random
RandomForest
Forestperformance
performance based
based on
on the F-measure for
for the
the Actual
Actual 1024
1024attributes
attributesand
andthe
the
F-measure
F-measurebased
basedon
onthe
the attributes
attributes selected by the corresponding
correspondingalgorithm.
algorithm.

As 100% data is represented with the 1024 attributes, the attribute selection methods provide
a reduced set of data. For the Subset Evaluation, 84 attributes are selected with an F-measure of 0.762.
This means that while just using the 8% of data from the actual data, we get only a 2% decrease in
performance. From the physical data perspectives, we can deduce that only 8 GB data, from the 100 GB
will approximately give the same performance of the 100 GB data. This is even more interesting with
the Correlation Evaluation and the Relief Evaluation where F-measure is slightly higher than the actual
F-measure of the actual data attributes. That means the smaller set can represent the larger dataset.

4.3. Random Forest

Figure 3 shows the analysis done with the Random Forest, and has almost similar trends as
Figure 2. However, in Figure 3, for the Random Forest with actual attributes of 1024, the F-measure is
0.841, and no higher F-measure has been recorded for any of the used algorithms. The F-measure for
Subset Evaluation is 0.83, Correlation Evaluation is 0.824, the Gain Ration Evaluation is 0.782, the Info
Gain Evaluation is 0.825, OneR Evaluation is 0.833, Principal Components is 0.806, Relief Evaluation is
0.838, and the Symmetrical Uncertain Evaluation is 0.812. Contrary to the SVM case, the Correlation
Evaluation, and the Relief Evaluation F-measure is slightly lower than the actual F-measure of 0.841.
The lowest F-measure of 0.782 is obtained for Gain Ration Evaluation. The Subset Evaluation has
slightly less F-measure (0.83) than the actual attributes (0.841). Despite none of the F-measure values
is higher than F-measure of the actual attributes, the performance can be viewed as progressing,
keeping in mind the great reduction in the dataset compared to the actual data.
In general, in Figure 3, the decreasing trend in the F-measure is low compared to the number
of attributes decreased. As 40 GB (100%) data is represented with the 1024 attributes, one of the
Appl. Sci. 2020, 10, 4901 8 of 12

beauties of the attribute selection methods is providing a reduced set of data. Instead of processing the
actual set of data attributes, a reasonable performance can be achieved with a great reduction in data
attributes, and consequently a great improvement in data processing and time. Let us analyze the
attribute selection method. For this, we select Relief Evaluation from Figure 3. The Relief Evaluation
uses 200 attributes and achieves an F-measure of 0.838. This means that while just using the 19% data
from the actual data of 1024 attributes and over 40 GB, we get only a 1% decrease in classification
and recognition performance. From data perspectives, only 19 GB data, from the 100 GB data will
approximately give the same performance of the 100 GB data set. The OneR Evaluation has also similar
results with only a 1% decrease in performance. Without a doubt, the slight decrease in performance
Appl.
stillSci.
can 2020,
be 10, x FOR PEERpositively
considered REVIEW as long as a great reduction in the data processing. The lowest 8 of 11
F-measure is Gain Ratio Evaluation, which has almost a 6% decreased performance. Even, with 6%,
with
we get6%,an we81%get an 81% decrease
decrease in data in data processing
processing andAs
and time. time.
theAs the point
point of thisofarticle
this article is to analyze
is to analyze the
the impact
impact of reducing
of reducing data
data on on performance,
performance, theseshow
these results results show an interesting
an interesting trend towards trend towards
processing
processing
big data with bigacceptable
data with results.
acceptable
Oneresults. One otherinsight
other interesting interesting
noted insight noted
in Figures in 3Figures
2 and 2 and 3 is
is that Principal
that Principal Components
Components has comparatively
has comparatively reduced performance.
reduced performance. One reason isOne thatreason
it mayisnotthat
beitthoroughly
may not be
thoroughly investigated with the other algorithms as is done in this experimentation
investigated with the other algorithms as is done in this experimentation setup. Finally, the work setup. Finally,
in
the work in this paper presents a continuation of sampling strategies of the previous
this paper presents a continuation of sampling strategies of the previous work in [46,47] and thus work in [46] and
[47] and thus
augments theaugments the related
related domain domain
with new with newand
experiments experiments
results. and results.

4.4.Adaptive
4.4. AdaptiveBoosting
Boosting
Figure 44 shows
Figure shows the the analysis done using using thethe AdaBoost
AdaBoost approach.
approach. The The AdaBoost
AdaBoost approach
approachisis
analyzed due to its inherent similarity
analyzed due to its inherent similarity to the Randomto the Random Forest classification approach. For thisanalysis,
classification approach. For this analysis,
theAdaBoost
the AdaBoostuses usesthetheJ48
J48trees
trees as
as the
the base
base classifier.
classifier. Figure 4 shows
shows aa similar
similar trend
trendofofthe
theF-measure
F-measure
totothat
thatofofthe
theRandom
RandomForest Forestapproach.
approach.InInFigure
Figure4,4,for
forthe
theAdaBoost,
AdaBoost,the thenonreduced
nonreducedF-measure
F-measureis
is 0.786,
0.786, which
which is almost
is almost similar
similar totothat
thatofofthe
theSVM.
SVM.The
The F-measure
F-measure forfor Subset
Subset Evaluation
Evaluation isis 0.772,
0.772,
Correlation Evaluation is 0.772, the Gain Ration Evaluation is 0.735, the Info Gain
Correlation Evaluation is 0.772, the Gain Ration Evaluation is 0.735, the Info Gain Evaluation is 0.782, Evaluation is 0.782,
OneR Evaluation
OneR Evaluationisis0.79, Principal
0.79, Components
Principal Components is 0.758,
is Relief
0.758,Evaluation is 0.772, and
Relief Evaluation isthe Symmetrical
0.772, and the
Uncertain Evaluation
Symmetrical UncertainF-measure
Evaluation is 0.761. Fromisthe
F-measure visual
0.761. perspectives,
From the visualthe trending of the
perspectives, the trending
F-measureof
of Figure
the 4 is almost
F-measure of Figuresimilar
4 is to that ofsimilar
almost the Random
to thatForest
of theofRandom
Figure 3.Forest
For AdaBoost
of Figureand the AdaBoost
3. For Random
Forest, though there is a difference in actual F-measure, the trend is almost
and the Random Forest, though there is a difference in actual F-measure, the trend is almost similar, similar, which shows
an interesting
which shows ansimilarity
interesting of similarity
behavior even in largeeven
of behavior datasets, thusdatasets,
in large augmenting the overall results
thus augmenting and
the overall
analysis of the proposed work.
results and analysis of the proposed work.

0.8
0.79
0.78
0.77
F-measure

0.76
0.75
0.74
0.73
0.72
0.71
0.7

Figure
Figure4.4.AdaBoost
AdaBoostperformance
performancebased
basedononthe
theF-measure
F-measurefor
forthe
theActual
Actual1024
1024attributes
attributesand
andthe
theF-
measure
F-measurebased onon
based thethe
attributes selected
attributes byby
selected the corresponding
the correspondingalgorithm.
algorithm.

5. Applications of the Study

The work in this paper presents a continuation of sampling strategies of the previous work in
[52] and [53] and thus augments the related domain with new experiments and results. The work in
[52] analyzed the impact of processing the reduced amount of actual data on machine learning model
performance. Subsequently, the work in [53] investigated the impact of two different random
Appl. Sci. 2020, 10, 4901 9 of 12

5. Applications of the Study

The work in this paper presents a continuation of sampling strategies of the previous work
in [52,53] and thus augments the related domain with new experiments and results. The work in [52]
analyzed the impact of processing the reduced amount of actual data on machine learning model
performance. Subsequently, the work in [53] investigated the impact of two different random sampling
techniques with two different machine learning classifiers. Complementary, the current study focused
on the data attributes rather than the dataset itself. The current results supported the results of the
previous studies in terms of reducing the processed data. The results discussed and presented in
Figures 2 and 3 showed that the performance either improved or slightly decreased. Looking at
different factors in the decreasing cases, advantages like reducing processing time as well as using
limited or available computing resources have been achieved in return, which makes this decrease
acceptable and reasonable.
Processing large amounts of data normally require great computing resources. From this arises
an issue not only on data processing but also on the availability of computing resources for most
researchers. Additionally, using the actual amounts does not necessarily ensure that it is the only way
to provide useful outcomes. This study provided an appropriate scenario by processing only a reduced
amount of data that can lead to a reasonable performance using fewer computing resources. In this
case, the processing time will be extremely reduced as well. On the other hand, the machine learning
model can be achieved by processing only subsets of the actual data attributes and no need to use all
the data attributes, which might be very complex if they do not properly cover data and classes.

6. Conclusions
Big data analysis requires enough computing equipment. This causes many challenges not only
in data processing, but for researchers who do not have access to powerful workstations. In this paper,
we analyzed a large amount of data from different points of view. One perspective is processing of
reduced collections of big data with less computing resources. Therefore, the study analyzed 40 GB
data to test various strategies to reduce data processing without losing the purpose of detection and
learning models in machine learning. Several alternatives were analyzed, and it was found that in
many cases and types of datasets, data can be reduced without compromising detection performance.
Tests of 200 attributes showed that with a performance loss of only 4%, more than 80% of the data
could be deleted. The experimental setup in this work is extensive, however, additional work is still
required to analyze a large number of large data sets despite that this work produced valuable results.
In the future, we aim to analyze several large data sets in order to obtain important analytical results.

Author Contributions: Conceptualization, R.U.K.; methodology, R.U.K.; software, K.K.; validation, K.K.;
formal analysis, R.U.K.; investigation, R.U.K. and K.K.; resources, R.U.K. and W.A.; data curation, R.U.K.;
writing—original draft preparation, R.U.K. and W.A.; writing—review and editing, R.U.K. and W.A.; visualization,
R.U.K.; supervision, W.A.; project administration, W.A.; funding acquisition, W.A. All authors have read and
agreed to the published version of the manuscript.
Funding: This research was funded by the Scientific Research Deanship (SRD), grant number: coc-2018-1-14-S-3603
at Qassim University, Saudi Arabia.
Acknowledgments: This research was funded by the Scientific Research Deanship (SRD), grant number:
coc-2018-1-14-S-3603 at Qassim University, Saudi Arabia.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Albattah, W. The Role of Sampling in Big Data Analysis. In Proceedings of the International Conference on
Big Data and Advanced Wireless Technologies—BDAW ’16, Blagoevgrad, Bulgaria, 10–11 November 2016;
pp. 1–5.
2. Hilbert, M. Big Data for Development: A Review of Promises and Challenges. Dev. Policy Rev. 2016, 34,
135–174. [CrossRef]
Appl. Sci. 2020, 10, 4901 10 of 12

3. Reed, D.A.; Dongarra, J. Exascale computing and big data. Commun. ACM 2015, 58, 56–68. [CrossRef]
4. L’Heureux, A.; Grolinger, K.; Elyamany, H.F.; Capretz, M.A.M. Machine Learning With Big Data: Challenges
and Approaches. IEEE Access 2017, 5, 7776–7797. [CrossRef]
5. Singh, K.; Guntuku, S.C.; Thakur, A.; Hota, C. Big Data Analytics framework for Peer-to-Peer Botnet detection
using Random Forests. Inf. Sci. N. Y. 2014, 278, 488–497. [CrossRef]
6. Clarke, R. Big data, big risks. Inf. Syst. J. 2016, 26, 77–90. [CrossRef]
7. Sullivan, D. Introduction to Big Data Security Analytics in the Enterprise. Available online:
https://searchsecurity.techtarget.com/feature/Introduction-to-big-data-security-analytics-in-the-enterprise
(accessed on 31 July 2018).
8. Tsai, C.-W.; Lai, C.-F.; Chao, H.-C.; Vasilakos, A.V. Big data analytics: A survey. J. Big Data 2015, 2, 21.
[CrossRef]
9. Bello-Orgaz, G.; Jung, J.J.; Camacho, D. Social big data: Recent achievements and new challenges. Inf. Fusion
2016, 28, 45–59. [CrossRef]
10. Zakir, J.; Seymour, T.; Berg, K. Big Data Analytics. Issues Inf. Syst. 2015, 16, 81–90.
11. Sivarajah, U.; Kamal, M.M.; Irani, Z.; Weerakkody, V. Critical analysis of Big Data challenges and analytical
methods. J. Bus. Res. 2017, 70, 263–286. [CrossRef]
12. Engemann, K.; Enquist, B.J.; Sandel, B.; Boyle, B.; Jørgensen, P.M.; Morueta-Holme, N.; Peet, R.K.; Violle, C.;
Svenning, J.-C. Limited sampling hampers ‘big data’ estimation of species richness in a tropical biodiversity
hotspot. Ecol. Evol. 2015, 5, 807–820. [CrossRef]
13. Kim, J.K.; Wang, Z. Sampling techniques for big data analysis. arXiv, 2018, arXiv:1801.09728v1. [CrossRef]
14. Liu, S.; She, R.; Fan, P. How Many Samples Required in Big Data Collection: A Differential Message
Importance Measure. arXiv, 2018, arXiv:1801.04063.
15. Bierkens, J.; Fearnhead, P.; Roberts, G. The Zig-Zag Process and Super-Efficient Sampling for Bayesian
Analysis of Big Data. arXiv, 2016, arXiv:1607.03188. [CrossRef]
16. Zhao, J.; Sun, J.; Zhai, Y.; Ding, Y.; Wu, C.; Hu, M. A Novel Clustering-Based Sampling Approach for
Minimum Sample Set in Big Data Environment. Int. J. Pattern Recognit. Artif. Intell. 2018, 32, 1–10. [CrossRef]
17. Zhou, L.; Pan, S.; Wang, J.; Vasilakos, A.V. Machine learning on big data: Opportunities and challenges.
Neurocomputing 2017, 237, 350–361. [CrossRef]
18. Kotzias, D.; Denil, M.; de Freitas, N.; Smyth, P. From Group to Individual Labels Using Deep Features.
In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, Sydney, Australia, 10–13 August 2015; pp. 597–606.
19. Farabet, C.; Couprie, C.; Najman, L.; LeCun, Y. Learning Hierarchical Features for Scene Labeling. IEEE Trans.
Pattern Anal. Mach. Intell. 2013, 35, 1915–1929. [CrossRef]
20. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks.
In Proceedings of the 25th International Conference on Neural Information Processing Systems; Curran Associates
Inc.: New York, NY, USA, 2012; Volume 1, pp. 1097–1105.
21. Avila, S.; Thome, N.; Cord, M.; Valle, E.; Araújo, A.D.A. Pooling in Image Representation: The Visual
Codeword Point of View. Comput. Vis. Image Underst. 2013, 117, 453–465. [CrossRef]
22. Moustafa, M. Applying deep learning to classify pornographic images and videos. arXiv, 2015,
arXiv:1511.08899.
23. Lopes, A.P.B.; de Avila, S.E.F.; Peixoto, A.N.A.; Oliveira, R.S.; Coelho, M.D.M.; Araújo, A.D.A. Nude Detection
in Video Using Bag-of-Visual-Features. In Proceedings of the XXII Brazilian Symposium on Computer
Graphics and Image Processing, Rio de Janeiro, Brazil, 11–15 October 2009; pp. 224–231.
24. Abadpour, A.; Kasaei, S. Pixel-Based Skin Detection for Pornography Filtering. Iran. J. Electr. Electron. Eng.
2005, 1, 21–41.
25. Ullah, R.; Alkhalifah, A. Media Content Access: Image-based Filtering. Int. J. Adv. Comput. Sci. Appl. 2018,
9, 415–419. [CrossRef]
26. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going Deeper with Convolutions. arXiv, 2014, arXiv:1409.4842.
Appl. Sci. 2020, 10, 4901 11 of 12

27. Valle, E.; Avila, S.; Souza, F.; Coelho, M.; Araujo, A.D.A. Content-Based Filtering for Video Sharing Social
Networks. In Proceedings of the XII Simpósio Brasileiro em Segurança da Informação e de Sistemas
Computacionais—SBSeg, Coritiba, Brazil, 19–22 November 2012; p. 28.
28. Monteiro, P.; Eleuterio, S.; De, M.; Polastro, C. An adaptive sampling strategy for automatic detection of
child pornographic videos. In Proceedings of the Seventh International Conference on Forensic Computer
Science, Brasilia, Brazil, 26–28 September 2012.
29. Agarwal, N.; Liu, H.; Zhang, J. Blocking objectionable web content by leveraging multiple information
sources. ACM SIGKDD Explor. Newsl. 2006, 8, 17–26. [CrossRef]
30. Jansohn, C.; Ulges, A.; Breuel, T.M. Detecting pornographic video content by combining image features with
motion information. In Proceedings of the seventeen ACM international conference on Multimedia, Beijing,
China, 19–24 October 2009; pp. 601–604.
31. Wang, J.-H.; Chang, H.-C.; Lee, M.-J.; Shaw, Y.-M. Classifying Peer-to-Peer File Transfers for Objectionable
Content Filtering Using a Web-based Approach. IEEE Intell. Syst. 2002, 17, 48–57.
32. Lee, H.; Lee, S.; Nam, T. Implementation of high performance objectionable video classification system.
In Proceedings of the 8th International Conference Advanced Communication Technology, Phoenix Park,
Korea, 20–22 February 2006; pp. 4–962.
33. Liu, D.; Hua, X.-S.; Wang, M.; Zhang, H. Boost search relevance for tag-based social image retrieval.
In Proceedings of the IEEE International Conference on Multimedia and Expo, Cancun, Mexico, 28 June–3
July 2009; pp. 1636–1639.
34. Da, J.A.; Júnior, S.; Marçal, R.E.; Batista, M.A. Image Retrieval: Importance and Applications. In Proceedings
of the Workshop de Visão Computacional—WVC, Minas Gerais, Brazil, 6–8 October 2014; pp. 311–315.
35. Badghaiya, S.; Bharve, A. Image Classification using Tag and Segmentation based Retrieval. Int. J.
Comput. Appl. 2014, 103, 20–23. [CrossRef]
36. Bhute, A.N.; Meshram, B.B. Text Based Approach for Indexing and Retrieval of Image and Video: A Review.
Adv. Vis. Comput. Int. J. 2014, 1, 27–38. [CrossRef]
37. Changzhong, W.; Shi, Y.; Fan, X.; Shao, M. Attribute reduction based on k-nearest neighborhood rough sets.
Int. J. Approx. Reason. 2019, 106, 18–31.
38. Lakshmanaprabu, S.K.; Shankar, K.; Khanna, A.; Gupta, D.; Rodrigues, J.J.P.C.; Pinheiro, P.R.;
De Albuquerque, V.H.C. Effective features to classify big data using social internet of things. IEEE
Access 2018, 6, 24196–24204. [CrossRef]
39. Reddy, G.; Thippa, M.; Reddy, P.K.; Lakshmanna, K.; Kaluri, R.; Rajput, D.S.; Srivastava, G.; Baker, T. Analysis
of dimensionality reduction techniques on big data. IEEE Access 2020, 8, 54776–54788. [CrossRef]
40. Chen, H.; Li, T.; Cai, Y.; Luo, C.; Fujita, H. Parallel attribute reduction in dominance-based neighborhood
rough set. Inf. Sci. 2016, 373, 351–368. [CrossRef]
41. Li, J.; Yang, X.; Song, X.; Li, J.; Wang, P.; Yu, D.-Y. Neighborhood attribute reduction: A multi-criterion
approach. Int. J. Mach. Learn. Cybern. 2019, 10, 731–742. [CrossRef]
42. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
43. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [CrossRef]
44. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, Data Mining, Inference, and Prediction,
2nd ed.; Springer: New York, NY, USA, 2009.
45. Hall, M.A.; Smith, L.A. Practical feature subset selection for machine learning. Comput. Sci. 1998, 98, 181–191.
46. Hall, M. Correlation-based Feature Selection for Machine Learning. Methodology 1999, 195, 1–5.
47. Gowda Karegowda, A.; Manjunath, A.S.; Jayaram, M.A. Comparative study of attribute selection using gain
ratio and correlation based feature selection. Int. J. Inf. Technol. Knowl. Manag. 2010, 2, 271–277.
48. Holte, R.C. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Mach. Learn.
1993, 11, 63–90. [CrossRef]
49. Jolliffe, I.T. Choosing a Subset of Principal Components or Variables. In Principal Component Analysis; Springer:
New York, NY, USA, 2002; pp. 111–149.
50. Kira, K.; Rendell, L.A. A Practical Approach to Feature Selection. In Machine Learning Proceedings; Morgan
Kaufmann: Berlinghton, MA, USA, 1992; pp. 249–256.
51. Kononenko, I. Estimating attributes: Analysis and extensions of RELIEF. In Proceedings of the European
Conference on Machine Learning, Catania, Italy, 6–8 April 1994; pp. 171–182.
Appl. Sci. 2020, 10, 4901 12 of 12

52. Albattah, W.; Khan, R.U. Processing Sampled Big Data. IJACSA 2018, 9, 350–356. [CrossRef]
53. Albattah, W.; Albahli, S. Content-based prediction: Big data sampling perspective. Int. J. Eng. Technol. 2019,
8, 627–635.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

BOOKLET - Chemsheets A2 009 (Acids - Bases)
No ratings yet
BOOKLET - Chemsheets A2 009 (Acids - Bases)
21 pages
Hose Reel Calculation
100% (4)
Hose Reel Calculation
2 pages
Data Management and Analysis: Reda Alhajj Mohammad Moshirpour Behrouz Far Editors
100% (1)
Data Management and Analysis: Reda Alhajj Mohammad Moshirpour Behrouz Far Editors
261 pages
Machine Learning Models and Algorithms For Big Data Classification
50% (2)
Machine Learning Models and Algorithms For Big Data Classification
364 pages
Yu S., Guo S. - Big Data Concepts, Theories and Applications - 2016 PDF
100% (1)
Yu S., Guo S. - Big Data Concepts, Theories and Applications - 2016 PDF
440 pages
Basics and Implementation of Big Data 012c9f8c
No ratings yet
Basics and Implementation of Big Data 012c9f8c
6 pages
BCE Report
No ratings yet
BCE Report
14 pages
Big Data Analytics
100% (1)
Big Data Analytics
3 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Big Data: Jump To Navigation Jump To Search
No ratings yet
Big Data: Jump To Navigation Jump To Search
50 pages
Processing Model From Mining Prospective
No ratings yet
Processing Model From Mining Prospective
5 pages
(IJCST-V9I6P1) :yew Kee Wong
No ratings yet
(IJCST-V9I6P1) :yew Kee Wong
7 pages
A Research On Machine Learning Methods For Big Data Processing, and Youming Sun
No ratings yet
A Research On Machine Learning Methods For Big Data Processing, and Youming Sun
9 pages
Machine Learning Models and Algorithms For Big Data Classification - Suthaharan
100% (3)
Machine Learning Models and Algorithms For Big Data Classification - Suthaharan
30 pages
A Survey On Big Data Analytics Challenges, Open Research Issues and Tools
No ratings yet
A Survey On Big Data Analytics Challenges, Open Research Issues and Tools
11 pages
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Rise of Big Data Science: A Survey of Techniques, Methods and Approaches in The Field of Natural Language Processing and Network Theory
No ratings yet
The Rise of Big Data Science: A Survey of Techniques, Methods and Approaches in The Field of Natural Language Processing and Network Theory
18 pages
Synthetic Data Generation: A Beginner’s Guide
From Everand
Synthetic Data Generation: A Beginner’s Guide
Robert Johnson
No ratings yet
Challenging Tools On Research Issues in Big Data Analytics: Althaf Rahaman - SK, Sai Rajesh.K .Girija Rani K
No ratings yet
Challenging Tools On Research Issues in Big Data Analytics: Althaf Rahaman - SK, Sai Rajesh.K .Girija Rani K
8 pages
Data Science and Big Data An Environment of Computational Intelligence
100% (4)
Data Science and Big Data An Environment of Computational Intelligence
303 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
A Review On Big Data
No ratings yet
A Review On Big Data
6 pages
Big Data
No ratings yet
Big Data
19 pages
Big Data
No ratings yet
Big Data
5 pages
A Review of Machine Learning Techniques
No ratings yet
A Review of Machine Learning Techniques
6 pages
Designing Machine Learning Systems With Python - Sample Chapter
100% (1)
Designing Machine Learning Systems With Python - Sample Chapter
31 pages
AI2SD2019 Paper 230
No ratings yet
AI2SD2019 Paper 230
13 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Bigdata
No ratings yet
Bigdata
12 pages
Overview of Big Data
No ratings yet
Overview of Big Data
4 pages
Bigdata Documentation
No ratings yet
Bigdata Documentation
20 pages
(IJETA-V9I1P2) :yew Kee Wong
No ratings yet
(IJETA-V9I1P2) :yew Kee Wong
7 pages
Jsaer2016 03 01 21 24
No ratings yet
Jsaer2016 03 01 21 24
4 pages
Book Chapter
No ratings yet
Book Chapter
23 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
11 pages
Seminar Topic
No ratings yet
Seminar Topic
13 pages
Managing Big Data Effectively
From Everand
Managing Big Data Effectively
Bhima Asan
No ratings yet
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
No ratings yet
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
3 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
From Everand
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Mining 101: Core Concepts and Algorithms
From Everand
Data Mining 101: Core Concepts and Algorithms
Swarnalata Verma
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
2020, Sathyaraj - Chicken Swarm Foraging Algorithm For Big Data Classification Using The Deep Belief Network Classifier
No ratings yet
2020, Sathyaraj - Chicken Swarm Foraging Algorithm For Big Data Classification Using The Deep Belief Network Classifier
21 pages
Introduction To Big Data - Report 1
No ratings yet
Introduction To Big Data - Report 1
5 pages
Compusoft, 3 (10), 1124-1127 PDF
No ratings yet
Compusoft, 3 (10), 1124-1127 PDF
4 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
CS-701 BigDataHadoop Unit-1
No ratings yet
CS-701 BigDataHadoop Unit-1
23 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Big Data
No ratings yet
Big Data
41 pages
Applied Data Analysis
No ratings yet
Applied Data Analysis
128 pages
Data Science
No ratings yet
Data Science
40 pages
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
No ratings yet
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
30 pages
A Survey On Big Data Applications and Challenges
No ratings yet
A Survey On Big Data Applications and Challenges
4 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Plagiarism Scan Report: Plagiarised Unique
No ratings yet
Plagiarism Scan Report: Plagiarised Unique
2 pages
Machine Learning and Bigdata
No ratings yet
Machine Learning and Bigdata
27 pages
OYO 3420 Hotel Ashoka
No ratings yet
OYO 3420 Hotel Ashoka
2 pages
BLR Pun
No ratings yet
BLR Pun
2 pages
COMPUTERSCIENCE
No ratings yet
COMPUTERSCIENCE
101 pages
About ThoughtSpot
No ratings yet
About ThoughtSpot
3 pages
Elasticsearch Tutorial PDF
100% (2)
Elasticsearch Tutorial PDF
152 pages
Apache Griffin: Data Quality Solution For Both Streaming and Batch
No ratings yet
Apache Griffin: Data Quality Solution For Both Streaming and Batch
32 pages
Process Scheduling
No ratings yet
Process Scheduling
6 pages
Ciprofloxacin Suspension in Syrup NF
No ratings yet
Ciprofloxacin Suspension in Syrup NF
0 pages
Gaurav Infort Pay
No ratings yet
Gaurav Infort Pay
915 pages
Level-Off Cement Plugging Method To Cure Lost Ci
No ratings yet
Level-Off Cement Plugging Method To Cure Lost Ci
13 pages
FlashLoanExample Sol
No ratings yet
FlashLoanExample Sol
3 pages
2D Mensuration
No ratings yet
2D Mensuration
12 pages
Module Programming
No ratings yet
Module Programming
15 pages
Flexim Fluxus F60x Quick Start Guide
100% (1)
Flexim Fluxus F60x Quick Start Guide
2 pages
Muravyl Installation ENG
No ratings yet
Muravyl Installation ENG
10 pages
Iso 8503-1 - 8503-2 - Surface Roughness Comprator PDF
No ratings yet
Iso 8503-1 - 8503-2 - Surface Roughness Comprator PDF
4 pages
Vit Assignment
No ratings yet
Vit Assignment
2 pages
Handbook of Hydraulics For The Solution of Hydraulic Engineering Problems
No ratings yet
Handbook of Hydraulics For The Solution of Hydraulic Engineering Problems
7 pages
Mcp737Pro: Cpflight Operations Manual
No ratings yet
Mcp737Pro: Cpflight Operations Manual
12 pages
Nodal Analysis and (IPR, TPC) Curve
No ratings yet
Nodal Analysis and (IPR, TPC) Curve
9 pages
Ideal Gas
No ratings yet
Ideal Gas
20 pages
Prelims Test Series Csat 1722243977612
No ratings yet
Prelims Test Series Csat 1722243977612
3 pages
Guia Desmontaje Pavilion Dv7t
No ratings yet
Guia Desmontaje Pavilion Dv7t
16 pages
Evolution of Object Approach: C Programming Language
No ratings yet
Evolution of Object Approach: C Programming Language
35 pages
Ajax Selenium Webdriver
No ratings yet
Ajax Selenium Webdriver
6 pages
Advanced Micro Controller: Unit I - AVR Microcontroller
No ratings yet
Advanced Micro Controller: Unit I - AVR Microcontroller
52 pages
Smath Studio
No ratings yet
Smath Studio
47 pages
5 PB
No ratings yet
5 PB
18 pages
Socio 101 - Midterm Exam Reviewer
No ratings yet
Socio 101 - Midterm Exam Reviewer
8 pages
My Strategy - MACD.HA
No ratings yet
My Strategy - MACD.HA
6 pages
Draftspecificationformantransformer 7775 Kvawithincr
No ratings yet
Draftspecificationformantransformer 7775 Kvawithincr
13 pages
CBSE Computer Science Class 12 Question Paper 2024 Solutions FREE PDF
No ratings yet
CBSE Computer Science Class 12 Question Paper 2024 Solutions FREE PDF
44 pages
Antennas and Wave Propagation - May - 2016
No ratings yet
Antennas and Wave Propagation - May - 2016
1 page
Fundamental Modeling For Optimal Design of Transverse Flux Motors
No ratings yet
Fundamental Modeling For Optimal Design of Transverse Flux Motors
2 pages
Thesis Topics On Image Processing
100% (3)
Thesis Topics On Image Processing
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Applied Sciences: Attributes Reduction in Big Data

Uploaded by

Applied Sciences: Attributes Reduction in Big Data

Uploaded by

applied

Appl. Sci. 2020, 10, 4901; doi:10.3390/app10144901 www.mdpi.com/journal/applsci

4. Experimental Setup and Results

Figure 1. Sample images from dataset [21].

4.1. Attributes’ Role

In Figure 2, for SVM, the

4.3. Random Forest

5. Applications of the Study

5. Applications of the Study

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.