Papakyriakou 2022 Ijca 921884
Papakyriakou 2022 Ijca 921884
net/publication/357999172
CITATIONS READS
13 9,943
2 authors:
All content following this page was uploaded by Dimitrios Papakyriakou on 21 January 2022.
There are many Big Data platforms a company can choose Semi-structured data. –The semi-structured data basically is a
like Hadoop and Apache Spark to analyze large sets of mix between structure and unstructured data, has some
data.Moreover, many data mining techniques like defining or consistent characteristics with some structure but
Classification, Clustering Analysis, Correlation Analysis, does not conform to a data model. The semi-structured data
Decision Tree Induction, Regression Analysis can be used to lacks a fixed or rigid schema, cannot be stored in a form of
identify patterns for knowledge discovery. In this paper, there rows and columns in Databases but contain tags and elements
is an extent review and summary of Big Data Mining in the form of Metadata which is used to group data and
techniqueswith the most common data mining algorithms describe how the data is stored. Examples of semi-structured
suitable to be used to handle large datasets. The review Data sources are the E-mails, XML and other markup
depicts the general pros and cons of these algorithms and the languages, binary executables, TCP/IP packets, Zipped files,
correspondingappropriate fields that apply, and in general acts and Web pages.
as a guideline to data mining researchers to have an outlook
on what algorithms to choose based on their needs and based
2. BIG DATA DIMENSIONS
on the given datasets. The concept of Big Data gained momentum in the early 2000s
where Gartner analyst Doug Laney articulated the definition
Keywords of Big Data analyzing the Volume, Velocity and Variety
dimensions the so called three (Vs) [2]. According to that,
Big Data, Big Data Analytics, Data Mining Algorithms, Data there are three significant dimensions of the Big data―Figure
Clustering 1‖.
1. INTRODUCTION
When we refer to Big Data, we mean the combination of
structured, semi-structured, and unstructured data collected by
Organizations and used in various projectsin combination
with predictivemodeling tools and advanced Big Data
analytics applications. The classifications of data referred
above are very important to understand due to the rapid
increase of semi-structured and unstructured data nowadays
on the one hand, and the advanced development of tools that
make managing and analyzing these classes of data on the
other hand.
Structured data. – Structured data can be created by machines
and humans having a pre-defined (fixed) data model, format,
structure where a database designer can create in a way that
entities can be grouped together to form relations. This makes
structured data easy to store, analyze and search.A relational
database is a representative example of structured data where
tables are linked together using unique IDs and query
language to interact with the data. Today the estimated
amount of structured data accounts for less than 20 percent of
Figure1: The 3 (Vs) of Big Data
all data whereas a much bigger percentage of all the data is
unstructured data in our world.
Nowadays, we all know that Big Data has penetrated in every
Unstructured data. –The unstructured data has no inherent industry accepting that is a prevailing driving force for every
structure, cannot be contained in a row-column database and Organization to succeed across the globe. The ―Big Data‖ as a
doe does not have an associated data model. The unstructured terminology refers to huge and complex data that it is difficult
5
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
6
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
various measures of significance and interest are used, so that regression is a predictive analysis. Logistic regression is used
to select the suitable rules among the set of all possible to describe data explaining the relationship between one
rules.An association rule has always two parts, the antecedent dependent binary variable and one or more nominal, ordinal,
(if) and a consequent (then) where an antecedent is something interval, or ratio-level independent variables. The heart of the
that is found in data, and a consequent is an item that is found matter of the logistic regression analysis is the task to estimate
in combination with the antecedent. The two primary patterns the log odds of an event.
that association rules use is support andconfidence which are
user defined measures of interestingness [10]. A statistical model typically used to model a binary dependent
variable with the help of logistic function or using another
Support. –Support is the measure of how frequent an itemset name for logistic function as sigmoid function and given by
appears in the dataset where for a given rule, itemset is the list equation(3).
of all the items in the antecedent and the consequent.
1 𝑒𝑥
𝐹 𝑥 = = (3)
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑏𝑜𝑡 𝑋 𝑎𝑛𝑑 𝑌 1+ 𝑒 −𝑥 1+𝑒 𝑥
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 → 𝑌 = =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
𝑓𝑟𝑞 𝑋,𝑌
(1) This function helps the logistic regression to squeeze the
𝑁 values from (-k, k) to (0, 1). From mathematical point of
view, logistic regression starts from a linear equation, where
In other words, support denotes the frequency of the rule this equation is constituted of log-odds which is further passed
within transactions. A high value means that the rule involves through a sigmoid function which squeezes the output of the
a great value of database. linear equation to a probability between 0, 1 . As a result, we
can choose a decision boundary and use this probability to
Confidence. –Confidence is the measure of the likeliness of conduct classification task. In logistic regression, the odds of
occurrence of consequence on the cart given that the cart an event occurring can be given by the formula:
already has the antecedents.
P event
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑋 → 𝑌 = logit = log odds where 𝑜𝑑𝑑𝑠 = 1−P =
𝑒𝑣𝑒𝑛𝑡
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑏𝑜𝑡 𝑋 𝑎𝑛𝑑 𝑌
=
𝑓𝑟𝑞 𝑋,𝑌
(2) 𝑒 𝑤 0 +𝑤 1 𝑥 1 +⋯𝑤 𝑛 𝑛 (4)
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑋 𝑓𝑟𝑞 𝑋
7
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
𝑛
𝑃 𝐵 : is the prior probability of predictor, meaning how 𝑚
likely (B) is on its own. MINIMIZE 𝑎 0 .…,𝑎 𝑚 : 𝑀𝐴𝑋 {0, 1 − ( 𝑎𝑖 𝑥𝑖𝑗 +
𝑖=1
𝑗 =1
𝑚
In simple terms, a Naive Bayes classifier assumes that the 𝑎0 ) 𝑦𝑗 } + 𝜆 (𝑎 )2 (9)
𝑖=1 𝑖
presence of a particular feature in a class is unrelated to the
presence of any other feature. Naive Bayes algorithms are where the first part of the formula (9):
used in real time prediction, where it is easy and fast to
predict class of test data set, in multi class prediction, which 𝑛
𝑚
performs well. Moreover, Naive Bayes algorithms are used in 𝑀𝐴𝑋 {0, 1 − ( 𝑎𝑖 𝑥𝑖𝑗 + 𝑎0 ) 𝑦𝑗 }focuses on
Text classification, inSpam Filtering andSentiment Analysis, 𝑗 =1
𝑖=1
for instancein social media analysis, to identify positive and minimizing the error, the number of falsely classified points,
negative customer sentiments and lastly, to build that the SVM makes, and
Recommendation Systemsthat uses machine learning and data
mining techniques to filter unseen information. the second part of the formula (9):
Fisher’s Linear Discriminant. – Linear Discriminant Analysis 𝑚
(LDA) and sometimes also called Fisher’s Linear 𝜆 (𝑎𝑖 )2
Discriminant is a linear classifier that projects a p-dimensional 𝑖=1
feature vector onto a hyperplane that divides the space into
two half-spaces where each half-space represents a class [+1 focuses on maximizing the margin, between the two classes.
or −1].This methodology relies on projecting points into a line
and the outputs of this methodology are precisely the decision SVM’s are powerful and flexible supervised machine learning
surfaces or the decision regions for a given set of classes algorithms which are used both for classification and
[13].The decision boundary (7) is characterized by the regression with astonishing real-life applications such as:
hyperplane's normal vector (w) and the threshold (w0).
Inverse Geo-sounding Problems where the SVM’s helps to
𝑇
𝑤1 … . , 𝑤𝑝 𝑤1 … . , 𝑤𝑝 + 𝑤0 = 𝑤 𝑇 𝑥 + 𝑤0 = 0(7) determine the layered structure of the planet.
Given a new input vector (𝑋 ∈ 𝑋 𝑝 ), classification is achieved Seismic Liquefaction Potential, with great result accuracy. In
by computing (8) and assigning the resulting class label this category we use the Standard Penetration Test (SPT) and
(𝑦 = −1 𝑜𝑟 𝑦 = +1)to the input 𝑥 . the Cone Penetration Test (CPT) to check the occurrence and
non-occurrence of liquefaction.
𝑦 = 𝑠𝑖𝑔𝑛 𝑤 𝑇 ∙ 𝑥 + 𝑤0 (8)
Protein Fold and Remote Homology Detection, where
To compute 𝑤 , (LDA) assumes that the class-conditional different methods to solve the kernel functions that are being
distributions 𝑃(𝑥|𝑐 = 1 𝑎𝑛𝑑 𝑃(𝑥|𝑐 = 2) are normal used. The kernel functions help to find the similarity between
distributions with mean (𝜇𝑐 ) and covariance (𝛴𝑐 ) 𝑓𝑜𝑟 𝑐 ∈ different protein sequences.Facial Expression Classification
where the SVM’s have great use in various life-care systems
{1,2}, [14]. LDA is an extremely popular dimensionality
in normal happy or sad look classification.
reduction technique which become critical in machine
learning,and it is commonly used in the pre-processing step in
machine learning and pattern classification projects. 3.2.3 Quadratic classifiers
QuadraticDiscriminant Analysis (QDA) is closely related to
3.2.2 Support Vector Machines Linear Discriminant Analysis (LDA) with the assumption that
the measurements from each class are normally
Support Vector Machines (SVMs) are a set of supervised
learning methods used for classification, regression, and distributed.Quadratic discriminant analysis (QDA) is a variant
of LDA that allows for non-linear separation of data.QDA is
outliers’ detections. This technique is very effective in high
dimensional spaces and still remain effective in cases where particularly useful if there is prior knowledge that individual
classes exhibit distinct covariances. On the other hand, a
number of dimensions is greater than the number of samples.
Moreover, SVMsworks pretty well when there is a clear disadvantage of QDA is that it cannot be used as a
dimensionality reduction technique.
margin of separation between classes. On the other hand,
SVM algorithm is not suitable for large data sets, do not
A quadratic discriminant function is a mapping 𝑔 ∶ 𝑋 →
provide probability estimates, and does not perform well
when the target classes are overlapping. In addition, in cases 𝑅 with
where the number of features for each datapoint exceed the 1
number of training data samples, the SVM technique will 𝑔 𝑥 = 𝑥 𝑇 𝑊𝑥 + 𝑤 𝑇 𝑥 + 𝑤0 , (10)
2
underperform [15], [16].
for some matrix𝑊 ∈ 𝑅𝑑𝑥𝑑 , some vector𝑤 ∈ 𝑅𝑑 , and some
The essence of the SVM is simply involves finding a scalar 𝑤0 ∈ R. In quadratic discriminant function, the model
boundary that separates different classes from each other parameter is 𝜃 = 𝑊, 𝑤, 𝑤0 and depending on (𝑊) the
where in 2-dimentional space, the boundary is named a line, geometry of (𝑔)can be convex, concave or neither.
in 3-dimensionnaly space the boundary is named plane and
finally in greater dimension than 3 the boundary is called 3.2.4 Kernel Estimation
hyperplane. The math behind the SVM is depicted below: A Kernel Distribution is a non-parametric representation of
the Probability Destiny Function (PDF) of a random variable.
The kernel distribution can be used when a parametric
distribution cannot properly describe the data, or in case it is
8
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
wanted to avoid making assumptions about distribution of the continuous variables. In the instance of categorical variables,
data. Since the kernel density estimator is the estimated the Hamming distance must be used.
Probability Destiny Function (PDF), for any real values of
𝑝
𝑥 , the kernel estimator’s formula is given below: Hamming distance𝑑 𝑥, 𝑦 = 𝑖=1 𝑥𝑖 − 𝑦𝑖 (16)
𝑛
1 𝑥− 𝑥𝑖 𝑥𝑖 = 𝑦𝑖 ⇒ 𝑑 𝑥, 𝑦 = 0
𝑓 𝑥 = 𝐾 (11)
𝑛
𝑖=1
𝑥𝑖 ≠ 𝑦𝑖 ⇒ 𝑑 𝑥, 𝑦 = 1
Where 𝑥1 , 𝑥2 , … , 𝑥𝑛 are random samples from an unknown
In case there is a mixture of numerical and categorical
distribution, 𝑛 is the sample size, K · is the kernel
variables in the dataset it is necessary to standardize the
smoothing function, and is the bandwidth.
training set.
K-Nearest Neighbor. –K-Nearest Neighbor (KNN) algorithm
is very simply used to solve classification problems where (K) 3.2.5 Decision Trees Induction
is the number of neighbors in KNN. The principle behind Decision trees are the most popular representation of logic-
nearest neighbor methods is to find a predefined number of based classifiers well presented in the literature [17],
training samples closest in distance to the new point and [18].There are three well known implementations of decision
predict the label from these. Broadly speaking, the distance trees, the Classification and Regression Trees (CART) [17],
can be any metric measuresuch as Hamming distance, and the Quinlan’s univariate tree growing algorithm, which is
Manhattan distance, Minkowski distance where the standard known as Iterative Dichotomiser 3 algorithm (ID3) [19]. The
Euclidean distance is the most common choice. Despite the third one is the C4.5 algorithm [20], which extends the ID3
fact that KNN is very easy to implement, where the (K) value algorithm by allowing the classification algorithm to deal with
is needed and the distance function (e.g Euclidean), the KNN numbersand not just categorical values as the ID3 does.
requires no training before making predictions, new data can
be added seamlessly which will not impact the accuracy of the Random Forests. –Random forest is a supervised
algorithm. Moreover, there is no training period for it, where learningalgorithm, very flexible, easy and one of the most
it stores the training dataset and learns from it only at the time used machine learning algorithmswhich produces great result
of making real time predictions. As a result, the KNN most of the time. Random forest is based on bagging
algorithm is much faster than other algorithms that require algorithm and uses Ensemble Learning technique where
training. On the other side, the KNN does not work well with creates as many trees as possible on the subset of the data and
large dataset where performance degradation appears and does combines the output of all the trees. As a result, we achieve
not work well with high dimensional data where it becomes overfitting problem reduction in the decision trees and also
difficult for the algorithm to calculate the distance in each variance reduction, which eventually improve the accuracy.
dimension. Moreover, the KNN needs to proceed with Random forest is used in both classification and regression
standardization and normalization before applying the problems and works well with categorical and continues
algorithm to any dataset and needs to manually impute variables. It uses rule-based approach of distance calculation
missing values and remove outlierssinceKNN is sensitive to and as a result no feature scaling (standardization and
noise in the dataset. normalization) is needed.Nonlinear parameters do not affect
the performance of a Random Forest unlike curve-based
Assuming that we have a dataset where 𝑋 is a matrix of algorithms and is very stable and comparatively less impacted
features from an observation and 𝑌 is a class label, the by noise.
formula that estimates the conditional distribution 𝑌 given
𝑋 classifying an observation to the class with the highest On the other hand, the random forest creates a lot of trees, -
probability is depicted below: for instance it creates one hundred trees in Pythonsklearn
library- and as a result requires much more computational
𝑛 1 power and resources in contrast with the decision tree which
𝑃𝑟 𝑌 = 𝑗|𝑋 = 𝑥0 =𝑘 I (𝑦𝑖 = 𝑗) (12)
i∈N0 is simple and does not require so much computational
resources.It also requires much time for training as it
Given a positive integer 𝑘 , the 𝑘 − 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑏𝑜𝑟𝑠 combines a lot of decision trees to determine the class and it
looks at the 𝑘 observations closest to a test observation 𝑥0 suffers interpretability and fails to determine the significance
the formula (12) estimates the conditional probability that it of each variable due to the ensemble of decision trees.
belongs to class 𝑗 .
In case of regression problems, when using the random forest
The distance between the input data point and other points in algorithm, the Mean Squared Error (MSE) is used to
the training data can be calculated as such: determine how the data branches from each node [21].
𝑝 1 𝑁 2
Euclidean distance 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2 (13) 𝑀𝑆𝐸 = 𝑁 𝑖=1
𝑓𝑖 − 𝑦𝑖 (17)
𝑖=1
9
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
In case we perform random forests based on classification Learning (SML) models are in predictive analyticshelping
data, it is often used the Gini Index, or the formula used to business leaders justify decisions or pivot for the benefit of
decide how nodes on a decision tree branch. the organization and in customer sentiment analysis, gaining a
better understanding of customer interactions and can be used
𝐶 2 to improve brand engagement efforts.Broadly speaking, the
𝐺𝑖𝑛𝑖 = 1 − 𝑖=1
𝑝𝑖 (18)
Supervised Machine Learning models challenge is that require
The formula (18) uses the class and probability to determine certain levels of expertise to structure accurately, the training
the Gini of each branch on a node, determining which of the is very time intensive, and the datasets can have a higher
branches is more likely to occur. The ( 𝑝𝑖 ) represents the likelihood of human error, resulting in algorithms learning
relative frequency of the class that is being observed in the incorrectly.Unlike unsupervised learning models, supervised
dataset and ( 𝐶) represents the number of classes. learning cannot cluster or classify data on its own.
The Random Forestsis a great choice in Banking Sector where Unsupervised Learning. –Unsupervised Learning(UL) is a
machine learning techniquewhere there is no need for users to
problems such as loan default chance of a customer or for
detecting any fraud transaction. Moreover, in healthcare supervise the model and instead the model work on its own to
discover patterns and information that was previously
sectors random forest can be used to identify the potential of a
certain medicine or the composition of chemicals required for undetected.The goal for unsupervised machine learning is to
model the underlying structure or distribution in the data in
medicines. In addition, it be used in hospitals to identify the
diseases suffered by a patient, the risk of cancer in a patient, order to learn more about the data, by using machine learning
algorithms. Clustering and Association are two main types of
and many other diseases where early analysis and research
play a crucial role. Unsupervised learning.Unsupervised learning is much similar
as a human learns to think by their own experiences, which
makes it closer to the real Artificial Intelligence (AI).
3.2.6 Neural Networks
Neural networks are a set of algorithms, that try to find Unsupervised learning can be separated into two types of
relationships in a dataset to recognize patterns by simulating problems when it comes to data mining:
the way of how the human brain works.The NeuralNetworks
in fact cluster and classify, group unlabeled data according to Clusteringwhich is a method of grouping the objects into
similarities and classify data when they have labeled dataset clusters such that objects with most similarities remains into a
to train on.In other words, NeuralNetworks are software group and has less or no similarities with the objects of
routines that can learn from existing data and solve complex another group.
real-world problems with an efficient way.NeuralNetwork
algorithms are designed to cluster raw input, recognize Association where an association rule is an unsupervised
patterns, and interpret sensory data and despite their multiple learning method which is used for finding the relationships
advantages, significant computational resources are required. between variables in the large database.
There are several methods to teach a NeuralNetworkfocusing The most popular unsupervised learning algorithms are the
on the main three learning paradigms examined below: following: K-means clustering, K-nearest neighbors
(KNN),Hierarchical clustering, Anomaly detection, Neural
Supervised Learning. – Supervised Learning (SL) is the Networks, Principal Component Analysis, Independent
machine learning process which is done under the seen label Component Analysis, Apriori, algorithm and Singular Value
of observation variables contrary to the Unsupervised Decomposition.
Learning where the response variables are not available. In
(SL), datasets are trained with the training set to infer a Unsupervised learning is used for more complex tasks
Machine Learning algorithm and then will be used to label compared to supervised learning because, since in
new observations from the testing set.Supervised learning can unsupervised learning there are no labeled input data. In
be separated into two types of problems when it comes to data addition, unsupervised learning is preferable as it is easy to
mining: get unlabeled data in comparison to labeled data. As a result,
Unsupervised learning is more challenging than other
Classification where an algorithm is used to accurately assign strategies due to the absence of labels. On the other side
test data into specific categories meaning that it recognizes Unsupervised learning is intrinsically more difficult than
specific entities within the dataset and attempts to draw some supervised learning as it does not have corresponding output
conclusions on how those entities should be labeled or and moreover, the result of the unsupervised learning
defined. algorithm might be less accurate as input data is not labeled
since the algorithm do not know in advance the exact output.
Regression where is used to understand the relationship
between dependent and independent variables and makes Reinforcement Learning. –Reinforcement Learning(RL) is a
projections for instance in businesses in terms of sales type of machine learning technique that enables an agent to
revenues. Very popular regression algorithms are the Linear learn in an interactive environment by trial and error using
Regression, the logistical regression, and the polynomial feedback from its own actions and experiences. In
regression. reinforcement machine learning, the machine learns by itself
after making many mistakes and correcting themin turn. RL is
Supervised Machine Learning models can be used to build one of the hottest research topics currently, is very common in
and advance a number of business applications such as, Image robotics and its popularity is growing day by
and object-recognition, where location, isolation, and object day.Reinforcement Learning method works on interacting
categorization out of videos or images, making them useful with the environment, whereas the
when applied to computer vision techniques and imagery supervised learning method works on given sample data or
analysis. Other applications that use Supervised Machine
10
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
11
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
(trees) into a strong classifier and tends to improve accuracy 𝑦𝑖 = 𝛽0 + 𝛽1 𝛸1𝜄 + 𝛽2 𝛸2𝜄 + ⋯ + 𝛽𝑘 𝛸𝑘𝜄 + 𝜀𝑖 (19)
with some small risk of less coverage [25], [26]. Each tree
attempts to minimize the errors of previous tree.Trees in where (𝛽0 ) is the intercept, 𝛽1′ 𝑠 are the slope between (Y)
boosting are weak learners but adding many trees in series, and the (𝛸𝜄 ), and (ε) pronounced epsilonis an intercept and (e)
meaning combining a learning algorithm in series, it is the error term that captures errors in measurement of (Y) and
achieved a strong learner from many sequentially connected the effect on (Y) of any variables missing from the equation
weak learners, making boosting a highly efficient and that would contribute to explaining variations in (Y). The
accurate model.Since trees are added sequentially, boosting linear regression should not be used to analyze big size data.
algorithms learn slowly. In statistical learning, models that
learn slowly perform better. However, the number of trees for Logistic Regression. –Logistic regression is one of the types
instance, in gradient boosting decision trees, is very critical in of regression analysis technique, which gets used when the
terms of overfitting whereadding too many trees will cause dependent variable is discrete for instance (0 or 1), (true or
overfitting so it is very important to stop adding trees at some false).This means the target variable can have only two
point. values, and a sigmoid curve denotes the relation between the
target variable and the independent variable.
The theory of a decision tree has the following components: a
root node which is the first node and the starting point of the The logistic regression model is based on the logistic function
tree; branchesarrows which connect one node to and can be stated by the equation (20):
anothershowing the flow from question to answer. Nodes that 𝐹 𝑥 =
𝐿
(20)
have child nodes are called interior nodes. Leafor terminal 1+ 𝑒 −𝑘 𝑥 −𝑥 𝑜
nodes are nodes that do not have child nodes and represent a Where (𝑥0 ), is the (x), value of the sigmoid’s midpoint; (L) is
possible value of targetvariable given the variables the curve’s maximum value; and (k), the logistic growth rate
represented by the path from the root. The branching factor or steepness of the curve.
(b) represents the number of children at each node.
Logistic regression works best with large data sets that have
The advantages of Decision Trees (DC) can be summarized as an almost equal occurrence of values in target variables. The
simple to understand and easy to interpret and visualized, dataset should not contain a high correlation between
where all kinds of data can be handled, making them widely independent variables (a phenomenon known as
used. DC are considered to be non-parametric meaning that multicollinearity), as this will create a problem when ranking
have no assumptions about the data point’s space or the the variables.Logistic regression can suffer from complete
classifier’s structure. DC are robust since require less effort separation. If there is a feature that would perfectly separate
from users for pre-processing data. They are not influenced by the two classes, the logistic regression model can no longer be
outliers and missing values either.On the other hand, overly trainedbecause the weight for that feature would not converge,
complex trees can be developed due to overfitting. Moreover, due to the fact that the optimal weight would be infinite.
Decision Trees can be unstable because small variations in the
data might result in a completely different tree being Ridge Regression. –Ridge Regression is a model tuning
generated.In addition, Decision Tree learners create biased method that is used to analyze any data that suffers from
trees if some classes are more likely to be predicted or have a multicollinearity.Ridge Regression performs (L2)
higher number of samples to support them. The optimality is regularizationand is usually used when there is a high
one more disadvantage, where the problem of learning an correlation between the independent variables. When
optimal decision tree is known to be NP-complete multicollinearity occurs, least squares estimates are unbiased,
(nondeterministic polynomial-time complete), since the but their variances are large so they may be far from the true
number of samples or a slight variation in the splitting value. By adding a degree of bias to the regression estimates,
attribute can change results drastically. ridge regression reduces the standard errors.
The Ridge Regression formula can be stated below:
3.3 Regression Analysis
Regression analysis is a well-known statistical 𝑛 𝑝 2 𝑝 2
𝑖=1 𝑦𝑖 − 𝑗 =1 𝑥𝑖𝑗 𝛽𝑗 +𝜆 𝑗 =1 𝛽𝑗 (21)
learningtechnique used to estimate the relationship between a
dependent variable with one or more independent variables, 𝑝
Where the 𝜆 𝑗 =1 𝛽𝑗2 represents the L2 regularization
where the independent variable is used as an assumption input
element. If lambda is zero, then we get Ordinary Least
that is changed in order to see the impact on a dependent
Squares (OLS).However, the high value of lambda will add
variable. In other words, Regression Analysis is a data mining
too much weight which will result in model under-fitting, so it
process that helps to understand the correlation and
is important how we choose the parameter lambda for our
independence of the variables to determine which factors
model.Overfitting problems may lead to inaccurate and
matter most and which factors can be ignored and eventually,
unstable model building so, a technique that helps minimize
how these factors influence each other.
the overfitting problem in Machine Leaning (ML) models is
There are many types of regression analysis techniques, known as regularization. Ridge regression uses L2
depending on number of factors such as, the type of target regularization compared to Lasso regression which uses L1
variable, the shape of the regression line, and the number of regularization.
independent variables.Regression Analysis has a wide range
Lasso Regression. – Lasso Regression is like linear
of real-life application such as, financial forecasting, Sales
regression, but it uses a shrinkage technique where the
and promotions forecasting. The different types of regression
coefficients of determination are shrunk towards zero. Since
are briefly explained below:
the Linear regression gives you regression coefficients as
Linear Regression. – Linear regression model comprises of a observed in the dataset. The Lasso Regression allows you to
predictor variable and a dependent variable related to each shrink or regularize these coefficients to avoid overfitting and
other in a linear fashion. The general linear regression model make them work better on different datasets.Lasso
can be stated by the equation (19): regression penalizes less important features of your dataset
12
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
and makes their respective coefficients zero, thereby signal when doing speech recognition or collective
eliminating them. Hence, it provides with the benefit of outlierssuch as a signal that may indicate the discovery of new
feature selection and simple model creation. The Lasso phenomena.
Regression formula can be stated below:
Most common causes of outliers on a data set could bedata
𝑛 2 𝑝 entry errors due to human mistake, or measurement errors due
𝑖=1 𝑦𝑖 − 𝑗 𝑥𝑖𝑗 𝛽𝑗 +𝜆 𝑗 =1 𝛽𝑗 (22)
to instrument accuracy, experimental errors, data processing
Where (𝜆) denotes the amount of shrinkage.(𝜆 = 0) implies and sampling errors.In machine learning and in any
all features are considered and it is equivalent to the linear quantitative discipline the quality of data is as important as the
regression where only the residual sum of squares is quality of a prediction or classification model, that is why,
considered to build a predictive model. (𝜆 = ∞) implies no detecting outliers is of a major importance for example in
feature is considered. The bias increases with increase in Physics, Economy, Finance, Machine Learning and
(𝜆)and variance increases with decrease in (𝜆). Cybersecurity.
Table 3. Differences Between Lasso and RidgeRegression Some of the most popular methods for outlier detection are
the Z-score or Extreme Value Analysis (parametric),
RidgeRegression Lasso Regression Probabilistic and Statistical Modeling (parametric), Linear
It makes use of the L2 It makes use of the L1 Regression Modelssuch as Principal Component Analysis
regularization technique. regularization technique (PCA), and Least Median of Squares (LMS)[27],the
Proximity BasedModels (non-parametric), Information Theory
It performs feature weight It performs the feature Models and last the High Dimensional Outlier Detection
updates as the loss function weight updates as the loss Methods.
has an additional squared function has an additional
term. term containing the L1 3.5 Predictive Modeling Techniques
norm of the weights vector. When we refer to Predictive Analytics, we mean the use of
statistical and machine learning techniques to identify the
It drives down the overall It drives down the overall likelihood of future outcomes based on historical data with the
size of the weight values size of the weight values final purpose to streamline decision making producing new
during optimization and during optimization and insights. Predictive analytics is used to predict behavior and
reduces overfitting. reduces overfitting. trends, to understand customers and to improve strategic
Polynomial Regression. –Polynomial regression is a model decision making and business performance. Some of the
which transforms data points into polynomial features of a common uses of predictive analytics includes the domain of
given degree, and models them using a linear model. It works fraud detection and security, Marketing, Operation and Risk
in a similar way to multiple linear regression with a little Identification. The most used Predictive Analytics models
modification but uses a non-linear curve and it is used when includes the Classification Model, which are best to answer
data points are present in a non-linear fashion.Polynomial Yes or No questions, the Clustering Model which sorts data
regression is one of several methods of curve fitting, where into separate nested smart based on similar attributes. Using
curve fitting is a process of constructing the best fit line that the clustering model, it can be quickly separate customers into
passes through all the data points, is not a straight line but a similar groups based on common characteristics and devise
curve line.With polynomial regression, the data is strategies for each group at a larger scale.Forecast Model is
approximated using a polynomial function that takes the form another predictive technique which can be applied wherever
(23). historical numerical data is available such as a call center to
predict how many supports calls, they will receive per hour.
𝑓 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 … 𝑐𝑛 𝑥 𝑛 + 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑒𝑟𝑟𝑜𝑟 (23) Moreover, Outliers and Time Series models are used as
predictive techniques, where anomalous data entries within a
Where (𝑛) is the degree of the polynomial and (𝑐) is a set of dataset are identifiedor identify sequence of datapoints using
coefficients. time as an input parameter, respectively.
Polynomial Regression provides the best approximation of the Broadly speaking, the common predictive algorithms can by
relationship between the dependent and independent variable separated into two groups: Machine Learning and Deep
and fits a wide range of curvature. On the other hand, it is Learning. Machine learning involves structural data, comprise
very sensitive to the outliers where the presence of one or two both linear and nonlinear varieties, train more quickly, while
outliers in the data can seriously affect the results of the nonlinear are better optimized for the problems they are likely
nonlinear analysis. Moreover, there are fewer model to face which is more often nonlinear. Deep Learning is a
validation tools for the detection of outliers in nonlinear subset of machine learning that is more popular to deal with
regression than there are for linear regression. audio, video, text, and images.With machine learning
predictive modeling, there are several different algorithms that
3.4 Outlier Detection can be applied, where the most common are the Random
An outlier is an observation that diverges from the overall
Forest, the Generalized Linear Model (GLM) for two Values,
pattern on a sample and mainly indicate a variability in a
the Gradient Boosted Model (GBM), the K-Means, and the
measurement, experimentalerrors, or a novelty. The outliers
Prophet algorithm.
can be in two categories, the univariatewhen looking for
instance, at a distribution of values in a single feature space, 3.6 Sequential Patterns
and the multivariance in n-dimensional space. In n- Similar to association rules mining, by using sequential
dimensional space, there is a need to train a model.Moreover, patterns mining, it can discoverstatistically interesting and
the outliers can be come out depending on the different type useful patterns and rules in a large-scale table that contains
of data: such as point outliers which are single data points that sequences of transactions [28]. A sequential pattern is a
appears far from the rest of the distribution, contextual frequent subsequence existing in a single sequence or a set of
outliers, that could be noise in data e.g.,background noise
13
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
14
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
drawbacks such as: ―Result repeatability‖ where K-Means algorithm is to minimize the average dissimilarity of objects
algorithm results will differ based due to random centroid to their closest selected object, hence, to find themost
initialization. Apart from the fact that K-Means Algorithm centrally located objects withinthe clusters.K-Medoids can be
needs manual intervention in some parameters (e.g considered as a robust alternative to k-means clustering,
n_clustersneed to be optimized, adjusted, and reassessed a meaning that the algorithm is less sensitive to noise and
few times, or max_iter and init), K-Means algorithm creates outliers, compared to k-means. This is due to the fact that it
spherical clusters that cover the whole dataset without be uses medoids as cluster centers instead of means used in k-
possible to exclude outliers or certain sample groups. means method.The k-medoids algorithm requires the user to
specify the (𝑘), and the number of clusters to be
In summation, the K-Means advantages, and disadvantages
generatedwhere the silhouette method is a nice approach to
can be depicted in the ―Table 5‖ below:
determine the optimal number of clusters. The complexity of
Table 5. K-Means(pros and cons) k-Medoids is 𝑂(𝑁 2 𝐾𝑇) where (𝑁) is the number of samples,
(𝑇) is the number of iterations and (𝐾) is the number of
K-Means advantages and disadvantages clusters, and this makes it more suitable for smaller datasets
compared to k-means which is O(N𝐾𝑇).
Advantages Disadvantages
The advantages and disadvantages of K-Medoids Method are
It’s very simple and flexible Difficult to predict K-value presented below in ―Table 6‖.
identify unknown groups of and does not work well
data from complex data with clusters of different Table 6. K-Medoids(pros and cons)
sets. size and density
K-Medoids advantages and disadvantages
If variables are huge, then Needs initial K (objects)
K-Means is most of the and has long computational Advantages Disadvantages
times computationally time. When dealing with a
faster than hierarchical large dataset, conducting a K-Medoidscan be more K-Medoids is not suitable
clustering, if we keep k dendrogram technique will robust that k-means in the for clustering non-spherical
smalls. crash the computer due to a presence of noise and (arbitrary shaped) groups of
lot of computational load outliers. objects.
Optimal for certain criteria and Ram limits
and suitable in a large K-Medoids is efficiently for K-Medoidsmay obtain
dataset small datasets while does different results for
It’s efficient at segmenting K-means doesn’t allow not scale well for large different runs on the same
the large data set depending development of an optimal datasets. dataset because the
on the shape of the set of clusters and for first k medoids are chosen
clusters. K-means work effective results, you should randomly
well in hyper-spherical decide on the clusters
clusters before K-Medoids is more flexible In k-Medoids, there is a
as it can use any similarity need to specify the value
Compared to hierarchical Lacks consistency where. A measure. (𝑘) (the number of clusters)
algorithms, k-means random choice of cluster in advance
produce tighter clusters patterns yields different
especially with globular clustering results. K-means
clusters algorithm can be performed
in numerical data only. CLARA Algorithm.–Clustering Large Applications (CLARA)
Algorithm, is an extension to k-Medoids (PAM) methods
K-means segmentation is It produces cluster with dealing with data, comprising a large number ofobjectsin
linear in the number of data uniform size even when the order to reduce computing time and RAM storage problem
objects thus increasing input data has different using the sampling approach.
execution time. Generalize sizes and it’s very sensitive
to cluster of different to scale where rescaling the In CLARA concept, instead of finding medoids for the entire
shapes and sizes, for datasetvia normalization or data set, this algorithm considers a small sample of the data
instance elliptical clusters. standardization will change with fixed size and applies the PAM algorithm to generate an
the final results optimal set of medoids for the sample. The algorithm repeats
the sampling and clustering processes a pre-specified number
PAM(K-Medoids)Algorithm. –Partitioning Around of times in order to minimize the sampling bias. The outcome
Medoids(PAM) Algorithm was introduced by Kaufman and of this iteration corresponds to the set of medoids with the
Rousseeuw based on (𝑘) representative objects, named minimal cost.
medoids, among the objects of the dataset [34].In k-medoids
clustering, each cluster is represented by one of the data
4.1.2 Hierarchical Clustering Method
points in the cluster.These points are named cluster Hierarchical Clustering algorithms can be
medoids.The term medoid refers to an object within a cluster Agglomerative (bottom-up approach) or divisive (also called
for which average dissimilarity between it and all the other as Top-Down Approach) and groups the clusters based on the
members of the cluster are minimal. It corresponds to the distance metrics.
most centrally located point in the cluster.Objects are In Agglomerative clustering, each data point acts as a cluster
tentatively defined as medoids and are placed into a set (𝑆) of initially and then pair of clusters successfully merged one by
selected objects. If (𝑂) is the set of objects that the set𝑈 = one until all clusters have been merged into one big cluster
𝑂 − 𝑆 is the set of unselected objects.The aim of the containing all objects. The result is a tree-based representation
15
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
16
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
clusters DBSCAN and OPTICS clusteringprocess is to adapt all data points to some
predefined mathematical models. As a result, the algorithms
Works well in presence of Sensitive to density that falls in this category, can automatically identify the
noisein case of OPTICS but parameters that should be number of clusters and outliers in data points according to the
not well in case of selected carefully selected mathematical model.However, the noise and outliers
DBSCAN are considered while calculating the standard statistics for
having robust clustering.In order to form clusters, these
clustering methods are classified into two categories:
Statistical and Neural Network approach methods. In the
Some typical examples of Density Based Clustering statistical approach the model-based algorithms follow
algorithms are the following: probability measures to determine clusters and in Neural
Network approach, input and output are associated with unit
DBSCAN Algorithm. –The Density-Based Spatial Clustering carrying weights.
of Applications with Noise Algorithm(DBSCAN)is
particularly suited to deal with large datasets, with noise, and Representative algorithms that fall into this category are as
is able to identify clusters with different sizes and shapes. The follows:
DBSCAN is the most well-known density-based clustering
GMM Algorithm. – Gaussian mixture model (GMM)
algorithm first introduced in 1996 by Ester et.al [36]. Unlike
algorithm is based on the probability model where the data is
k-means, DBSCAN does not require the number of clusters as
decomposed into several models based on the Gaussian
a parameter, where it infers the number of clusters based on
probability density function. The GMM algorithm results are
the data, and it can discover clusters of arbitrary shape.The
expressed in terms of probabilities, which are more visual and
DBSCAN algorithm is the fastest of the clustering methods,
can be used to predict in a certain area of interest based on
provided that there is a very clear Search Distance to use. The
these probabilities. On the other hand, it is necessary to use
advantages can be summarized as such: DBSCAN does not
complete sample information for prediction and lose
require a-priori specification of number of clusters, is able to
effectiveness in high-dimensional space and this is considered
identify noise data while clustering and to find arbitrarily size
as a disadvantage.
and arbitrarily shaped clusters. The disadvantages can be
summarized as such: DBSCAN fails in case of varying SOM Algorithm. –Self Organized Maps (SOM) algorithm is
density clusters, and in case of neck type of dataset and based on neural network modelthe input layer receives input
moreover, does not work well in case of high dimensional signals, and the output layer is arranged by a neuron into a
data. two-dimensional node matrix in a certain way. SOM
algorithm has the advantage to map to a two-dimensional
OPTICS Algorithm. – The Ordering Points to Identify
plane to achieve visualization and obtain higher-quality
Clustering Structure (OPTICS) Algorithm, works as an
clustering results. On the other hand, as a disadvantage is that
extension of DBSCAN. The only difference is that it does not
the calculation complexity is high, and the result depends to a
assign cluster memberships but stores the order in which the
certain extent on the choice of experience.
points are processed meaning that for each object stores the
Core Distance and the Reachability distance. The main idea 4.1.5 Grid Based Clustering Method
of OPTICS algorithm is similar to DBSCAN, but it addresses In grid-based clustering, the dataset is represented into a grid
one of DBSCAN’s major weaknesses: the problem of structure which comprises of grids (also named cells) to
detecting meaningful clusters in data of varying density. In design a grid-structure.Grid-based methods work in the object
order to do that, the points of the database are ordered in a space instead of dividing the data into a grid where grid is
way that spatially closest points become neighbors in the divided based on data characteristic.After partitioning the
ordering. Moreover, for each point a special distance is stored datasets into cells, it computes the density of the cells which
which represents the density that must be accepted for a helps in identifying the clusters.One of the greatest
cluster so as both points belong to the same cluster. advantages of these algorithms is its reduction in
Like DBSCAN, OPTICS requires two parameters: the computational complexity. They are more concerned with the
( 𝜀 )which describes the maximum distance to consider and value space surrounding the data points rather than the data
the ( 𝑀𝑖𝑛𝑃𝑡𝑠 ) which describes the number of points needed points themselves. The Grid-based clustering method has fast
to form a cluster. The key parameter to DBSCAN and time of processing than another way and depends on the
OPTICS is the ( 𝑀𝑖𝑛𝑃𝑡𝑠 ) parameter which roughly controls number of cells in the space of quantized each
the minimum size of a cluster. If this parameter is set too low dimension.Moreover, it applies to any attribute type and
everything will become clusters where if is set too high at provides flexibility related to the level of granularity.
some point there won’t be any clusters anymore, but only Representative algorithms that fall into this category are as
noise.OPTICS clustering method require more memory to follows:
determine the next data point which is closest to the point
currently being processed in terms of Reachability Distance. STING Algorithm. – Statistical Information Grid Approach
As a result, requires more computational power because the (STING) Algorithm, the dataset is divided recursively in a
nearest neighbor queries are more complicated compared to hierarchical manner where each cell is further sub-divided
radius queries in DBSCAN. Moreover, the OPTICS clustering into a different number of cells capturing in turn the statistical
technique does not need to maintain the ( 𝜀 )parameter and is measures of the cells. STING Algorithm has high efficiency
relatively insensitive to parameter settings. and low time complexity. On the other hand, the fact that the
clustering quality is affected by the granularity of the bottom
4.1.4 Model Based Clustering Method layer of the grid structure, can be considered a disadvantage.
Model-based clustering is a statistical approach to data
clusteringassuming that data points are generated according to WaveCluster Algorithm. – In this algorithm, the dataspace is
a certain probability distribution model, and the represented in form of wavelets where contains a n-
dimensional signal helping to identify the clusters. The parts
17
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
18
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022
[10] Trupti, A. Kumbhare, and Santosh, V. Chobe, (2014). An neural network in realtime adaptive traffic signal
Overview of Association Rule Mining Algorithms, control,‖ Jurnal Teknologi, vol. 42, no. 1, pp. 29–44.
International Journal of Computer Science and
Information Technologies, Vol.5(1), pp. 927-930. [25] Y. Freund, (1995). ―Boosting a weak learning algorithm
by majority‖, Information and computation. 121(2):256–
[11] Sudhir, M. Gorade, Ankit Deo and Pritesh Purohit, 285.
(2017). A Study of Some Data Mining Classification
Techniques. International Research Journal of [26] Y. Freund and R.E. Schapire, (1999). ―A short
Engineering and Technology. Vol. 4, Issue. 4, pp. 3112- introduction to boosting‖ Journal of Japanese Society for
3115. Artificial Intelligence, 14(5):771-780.
[12] J. Han, M. Kamber and J. Pei, J (2010). Data Mining [27] Huh, Myung-Hoe, & Lee, Yonggoo. (2006). ―LMS and
Concepts and Techniques (3rd ed.) University LTS-type Alternatives to Classical Principal Component
ofIllinois.Chapter 8, pp. 99-117. Analysis‖. Communications for Statistical Applications
and Methods, 13 (2), 233–
[13] Duda RO, Hart PE, and Stork DG, (2000). Pattern 241. https://doi.org/10.5351/CKSS.2006.13.2.233
classification, 2nd ed. New York: John Wiley & Sons.
[28] R. Agrawal and R. Srikant., (March 1995).―Mining
[14] Rao, R. P. N., & Scherer, R. (2010). Statistical Pattern Sequential Patterns‖. In Proc. of the 11th Int'l
Recognition and Machine Learning in Brain-Computer Conference on Data Engineering, Taipei, Taiwan.
Interfaces. In Statistical Signal Processing for
Neuroscience and Neurotechnology (1 ed., pp. 335-368). [29] Fournier-Viger, P., Lin, J.C.W., Kiran, R.U., Koh, Y.S.,
Elsevier B.V. Thomas, R. (2017).―A survey of sequential pattern
mining‖. Data Sci. Pattern Recogn. s1, 54–77.
[15] Auria, Laura and Moro, R. A., Support Vector Machines
(SVM) as a Technique for Solvency Analysis (August 1, [30] Thabet Slimani, and Amor Lazzez. (2013). ―Sequential
2008). DIW Berlin Discussion Paper No. 811, Available Mining: Patterns and Algorithms Analysis‖, International
atSSRN: https://ssrn.com/abstract=1424949. Journal of Computer and Electronics Research, Volume
2, Issue 5, pp 639-647.
[16]S. Karamizadeh, S. M. Abdullah, M. Halimi, J. Shayan
and M. j. Rajabi, (2014). "Advantage and drawback of [31] Mooney, C. H. & Roddick, J. F., (Feb 2013) ―Sequential
support vector machine functionality," 2014 Pattern Mining — Approaches and Algorithms‖, ACM
International Conference on Computer, Computing Surveys, vol. 45, no. 2, pp. 1–39, DOI:
Communications, and Control Technology (I4CT), pp. 10.1145/2431211.2431218.
63-65, doi: 10.1109/I4CT.2014.6914146. [32] Kum, H.-C., Chang, J. H., & Wang, W. (2006).
[17] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. ―Sequential Pattern Mining in MultiDatabases via
Stone (1984). Classification and Regression Trees. Multiple Alignment‖. Data Min. Knowl. Discov., 12(2-
Chapman & Hall, New York, NY. 3), 151-180.
[18] S. K. Murthy, S. Kasif, and S. Salzberg, (1994). A [33] S. Anitha Elavaras, (Jan 2011). ―A Survey on Partitional
system for induction of oblique decision trees. J. Artif. Clustering Algorithm‖, International Journal of
Int. Res., 2(1):1–32. Enterprise Computing and Business Systems, Vol. 1
Issue 1.
[19] J. Quinlan, (1986). Induction of decision trees. Machine
Learning, 1(1):81–106. [34] Kaufman, L., & Rousseeuw, P. J., (1990). ―Finding
groups in data: an introduction to cluster analysis.‖ New
[20] J. Quinlan, (1993). Morgan Kaufmann,C4.5: Programs York, Wiley.
for Machine Learning.
[35] T. Soni Madhulatha. (April 2012). ―An overview on
[21]Mean Squared Error (MSE). [Online]. Available: Clustering Methods‖. IOSR Journal of Engineering., Vol.
https://www.probabilitycourse.com/chapter9/9_1_5_mea 2(4) pp: 719-725.
n_squared_error_MSE.php
[36] Ester, M., Kriegel, H.P., Sander, J., Xu, X. (1996). ―A
[22] Nova, D., Estévez, P.A. (2014). A review of learning Density-Based Algorithm for Discovering Clusters in
vector quantization classifiers. Neural Comput & Large Spatial Databases with Noise‖. In Proc. KDD.
Applic 25, 511–524,https://doi.org/10.1007/s00521-013-
1535-3 [37] Dimitrios Papakyriakou, Dimitra Kottou and Ioannis
Kostouros. (April 2018). ―Benchmarking Raspberry Pi 2
[23] D. Nova and P. Estevez, (2013). ―A Review of Learning Beowulf Cluster. International Journal of Computer
Vector Quantization Classifiers,‖ Neural Computing and Applications‖ 179(32):21-27.
Applications, vol. 25, pp. 511–524.
[38] Dimitrios Papakyriakou. (August 2019). ―Benchmarking
[24] A. Priyono, M. Ridwan, A. J. Alias, R. A. O. Rahmat, A. Raspberry Pi 2 Hadoop Cluster‖. International Journal of
Hassan, and M. A. M. Ali, (2012). ―Application of LVQ Computer Applications 178(42):37-47.
IJCATM : www.ijcaonline.org 19