0% found this document useful (0 votes)
11 views16 pages

Papakyriakou 2022 Ijca 921884

Data mining Transaction paper 2020

Uploaded by

abhijitdohale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

Papakyriakou 2022 Ijca 921884

Data mining Transaction paper 2020

Uploaded by

abhijitdohale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/357999172

Data Mining Methods: A Review

Article in International Journal of Computer Applications · January 2022


DOI: 10.5120/ijca2022921884

CITATIONS READS

13 9,943

2 authors:

Dimitrios Papakyriakou Ioannis Barbounakis


Hellenic Mediterranean University Hellenic Mediterranean University
6 PUBLICATIONS 38 CITATIONS 13 PUBLICATIONS 34 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Dimitrios Papakyriakou on 21 January 2022.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

Data Mining Methods: A Review


Dimitrios Papakyriakou Ioannis S. Barbounakis
PhD Candidate Assistant Professor
Department of Electronic Engineering Department of Electronic Engineering
Hellenic Mediterranean University Hellenic Mediterranean University
Crete, Greece Crete, Greece

ABSTRACT data is usually stored as different types of files for instance


The Big Data revolution is taking place due to the evolution text documents, PDFs, photos, videos, audio files, social
of technology, where the technology enables firms to gather media content, satellite imagery, websites, and call center
extremely huge amount of data, disseminating knowledge to transcripts/recordings. Compared to structure datawere stored
their customers, partners, competitors in the marketplace [1]. in spreadsheets or relational databases the unstructured data is
The deeper we dive into technology, the more we compound usually stored in NoSQL databases, applications, and data
the physical with the virtual world having in mind for instance warehouses. Plethora of information in unstructured data can
the IoT (Internet of Things) as a network of physical devices be automatically processed with artificial intelligence
connected together and able to exchange data. algorithms today.

There are many Big Data platforms a company can choose Semi-structured data. –The semi-structured data basically is a
like Hadoop and Apache Spark to analyze large sets of mix between structure and unstructured data, has some
data.Moreover, many data mining techniques like defining or consistent characteristics with some structure but
Classification, Clustering Analysis, Correlation Analysis, does not conform to a data model. The semi-structured data
Decision Tree Induction, Regression Analysis can be used to lacks a fixed or rigid schema, cannot be stored in a form of
identify patterns for knowledge discovery. In this paper, there rows and columns in Databases but contain tags and elements
is an extent review and summary of Big Data Mining in the form of Metadata which is used to group data and
techniqueswith the most common data mining algorithms describe how the data is stored. Examples of semi-structured
suitable to be used to handle large datasets. The review Data sources are the E-mails, XML and other markup
depicts the general pros and cons of these algorithms and the languages, binary executables, TCP/IP packets, Zipped files,
correspondingappropriate fields that apply, and in general acts and Web pages.
as a guideline to data mining researchers to have an outlook
on what algorithms to choose based on their needs and based
2. BIG DATA DIMENSIONS
on the given datasets. The concept of Big Data gained momentum in the early 2000s
where Gartner analyst Doug Laney articulated the definition
Keywords of Big Data analyzing the Volume, Velocity and Variety
dimensions the so called three (Vs) [2]. According to that,
Big Data, Big Data Analytics, Data Mining Algorithms, Data there are three significant dimensions of the Big data―Figure
Clustering 1‖.

1. INTRODUCTION
When we refer to Big Data, we mean the combination of
structured, semi-structured, and unstructured data collected by
Organizations and used in various projectsin combination
with predictivemodeling tools and advanced Big Data
analytics applications. The classifications of data referred
above are very important to understand due to the rapid
increase of semi-structured and unstructured data nowadays
on the one hand, and the advanced development of tools that
make managing and analyzing these classes of data on the
other hand.
Structured data. – Structured data can be created by machines
and humans having a pre-defined (fixed) data model, format,
structure where a database designer can create in a way that
entities can be grouped together to form relations. This makes
structured data easy to store, analyze and search.A relational
database is a representative example of structured data where
tables are linked together using unique IDs and query
language to interact with the data. Today the estimated
amount of structured data accounts for less than 20 percent of
Figure1: The 3 (Vs) of Big Data
all data whereas a much bigger percentage of all the data is
unstructured data in our world.
Nowadays, we all know that Big Data has penetrated in every
Unstructured data. –The unstructured data has no inherent industry accepting that is a prevailing driving force for every
structure, cannot be contained in a row-column database and Organization to succeed across the globe. The ―Big Data‖ as a
doe does not have an associated data model. The unstructured terminology refers to huge and complex data that it is difficult

5
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

to process them by using traditional methods compared to old


fashioned data.It is fine whenever business is dealing with
data using excel sheets and databases, however when the data
cannot be fitted with such tools, then we think about Big Data
and Analytics.

Volume. – When we refer to volume, we mean the size of


huge amount of data setslying between terabytes and
zettabytes, from variety of Sources. Sources such as business
transactions, smart Internet of Things (IoT) sensor devices,
social media, and other e-commerce platforms where get real-
time, structured, and unstructured data.It is estimated that 2.5
quintillion bytes of data is created each day. According to
McAffe and Brynjolfsson more data crosses the internet every
second than the total amount of data stored online 20 years
ago [3]. The Volume of data created, captured, copied, and
consumed worldwide is forecast to increase rapidly reaching
59 zettabytes in 2020 and 149 zettabytes in 2024 [4].

Velocity. – Broadly speaking Velocity refers to the speed of


generating, processing, and analyzing the data. Nowadays, it
is crucial for the Organization to have the information quickly Figure2: The 5 (Vs) of Big Data
as close to real-time as possible in the sense of paying much
more importance to Velocity than to volume giving to Veracity. – Veracity refers to the quality, the accuracy, and
Organizations bigger comparative advantage [5], [6], [7]. The the reliability of the collected data since data comes from so
appropriate business decisions are strongly dependent to the many different sources. The first side of the Veracity in Big
data availability at the right time since after a couple of hours data it is not just the quality itself but how trustworthy are the
there may be useless under certain circumstances. For data type, the data source considering abnormalities,
instance, in a machine learning service running in a social inconsistencies, duplication as well [5], [6], [9]. The second
media platform with billions of users who post and upload side of data veracity involves the processing method of the
messages or photos and videos, there is a continues data and the adequate output to objectives based on business
transactions of petabytes of data that is being transferred from needs.
millions of devices. As we can understand the rate of the
volume data that inflows per second is very high defining the Value. – Value refers to an organization’s ability to transform
velocity of the data. A representative example of data those huge amounts of data into real business since accurate
generated with such a high velocity will be Twitter messages data enables businesses as a steppingstone to get closer to
and Facebooks posts. Another example of velocity is the their customer needs and expectations. Namely, Value
sensor data with the Internet of Things (IoT) evolution where denotes the added value for companies where huge amounts
the connected sensors are taking off at a dramatic rate with of data (Volume) from highly diverse sources (Variety) with
data being transmitted at a near constant rate. Another different quality (Validity) are used to quickly make vital
example of velocity is the packet analysis for cybersecurity, business decisions to gain comparative advantage [7], [9].
where unfortunately threatening payloads can be hidden in a
data flow passing through the firewall. Those data must be 3. DATA MINING METHODS
investigated and analyzed for patterns of suspect behavior and When we refer to data mining, we mean the process of finding
the situation is getting harder as more data is protected using potentially useful patterns by using huge data sets. During this
encryption and the malware payloads are inside the encrypted process, Machine Learning, Statistics, and Artificial
packets. Intelligent (AI) is used to extract information about the
probability of future events.The diversified aspects of data
Variety. – Variety refers to different data types of formats, mining comprise data classification, data integration, data
namely, the diversity of data types and data sources, from transformation, data discretization, and pattern evaluation and
structured numeric data stored in traditional databases to more.Data mining techniques are used to discover hidden and
unstructured data types such as text documents, PDFs, photos, unsuspected relationships amongst the data and used for
videos, audio files, social media content, XML and so on. marketing, sales, fraud detection, scientific
These kind of heterogeneous data sets possess a big challenge discoveries,product development, healthcare, and education.
for big data analytics and requires distinct processing Moreover, data mining techniques are used by the
capabilities and specialist algorithms [5], [8]. A typical Organizations to solve business problems such as increasing
example of high variety of data sets would be the Closed- revenues, acquiring new customers, improving cross-selling
circuit television (CCTV) audio and video generated in a and up-selling, increasing Return of Investment (ROI) from
surveillance area in a city. More than 80 % of the data in the marketing campaigns. As a result, the Organizations deliver
world today is unstructured and at first look does not show consistent results that keep businesses ahead of the
any clue of relationships.Moreover, when it comes to Big competition.
Data, two additional dimensions are under consideration, the
Veracity, and the Value, ―Figure 2‖. 3.1 Association Rule Learning
In data science, the association rules technique is used to
discover correlations between seemingly independent
relational and transactional databases and datasets,and to
observe frequently occurring patterns. The constraints on

6
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

various measures of significance and interest are used, so that regression is a predictive analysis. Logistic regression is used
to select the suitable rules among the set of all possible to describe data explaining the relationship between one
rules.An association rule has always two parts, the antecedent dependent binary variable and one or more nominal, ordinal,
(if) and a consequent (then) where an antecedent is something interval, or ratio-level independent variables. The heart of the
that is found in data, and a consequent is an item that is found matter of the logistic regression analysis is the task to estimate
in combination with the antecedent. The two primary patterns the log odds of an event.
that association rules use is support andconfidence which are
user defined measures of interestingness [10]. A statistical model typically used to model a binary dependent
variable with the help of logistic function or using another
Support. –Support is the measure of how frequent an itemset name for logistic function as sigmoid function and given by
appears in the dataset where for a given rule, itemset is the list equation(3).
of all the items in the antecedent and the consequent.
1 𝑒𝑥
𝐹 𝑥 = = (3)
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑏𝑜𝑡 𝑕 𝑋 𝑎𝑛𝑑 𝑌 1+ 𝑒 −𝑥 1+𝑒 𝑥
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 → 𝑌 = =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
𝑓𝑟𝑞 𝑋,𝑌
(1) This function helps the logistic regression to squeeze the
𝑁 values from (-k, k) to (0, 1). From mathematical point of
view, logistic regression starts from a linear equation, where
In other words, support denotes the frequency of the rule this equation is constituted of log-odds which is further passed
within transactions. A high value means that the rule involves through a sigmoid function which squeezes the output of the
a great value of database. linear equation to a probability between 0, 1 . As a result, we
can choose a decision boundary and use this probability to
Confidence. –Confidence is the measure of the likeliness of conduct classification task. In logistic regression, the odds of
occurrence of consequence on the cart given that the cart an event occurring can be given by the formula:
already has the antecedents.
P event
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑋 → 𝑌 = logit = log odds where 𝑜𝑑𝑑𝑠 = 1−P =
𝑒𝑣𝑒𝑛𝑡
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑏𝑜𝑡 𝑕 𝑋 𝑎𝑛𝑑 𝑌
=
𝑓𝑟𝑞 𝑋,𝑌
(2) 𝑒 𝑤 0 +𝑤 1 𝑥 1 +⋯𝑤 𝑛 𝑛 (4)
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑋 𝑓𝑟𝑞 𝑋

The log odds of an event are given by taking (log) of equation


The most common Algorithms used in Association Rule (1)
techniques are the ―Apriori Algorithm‖, the ―Eclat Algorithm‖
and the ―Frequent Pattern (FP) Growth Algorithm‖. The 𝑝 𝑥
association rule mining is suitable for non-numeric, 𝑙𝑜𝑔𝑖𝑡 𝑝 = log (𝑒 𝑤 0 +𝑤 1 𝑥 1 +⋯𝑤 𝑛 𝑥 𝑛 ) = 𝑤0 +
= log⁡
1−𝑝 𝑥
categorical data, in the Market Basket Analysis, Medical 𝑤1 𝑥1 + ⋯ 𝑤𝑛 𝑥𝑛 (5)
diagnosis, Protein sequences, Census data analysis, and
Customer Relationship Management (CRM) of credit card The odds ratio is log transformed to remove the restricted
business. range as probabilities are in the range 𝑝 𝑥 ∈ 0, 1 , 𝑥 ∈ 𝑅.
Log transformation changes this to values from negative
3.2 Classification infinity to positive infinity and moreover the log values are
Classification is named the problem of predicting a discrete easier to interpret.
random variable Y from another random variable X and
sometimes is called discrimination, or pattern classification or Naïve Bayes Classifier. –A classifier is a function (f) that
pattern recognition. Classification is a method which maps input feature vectors [𝑥 ∈ 𝑋] to output class labels
categorizes data into a definite number of classes and in turn [𝑦 ∈ {1, … , 𝐶}] where [𝑋] is the feature space. We typically
label areassigned to each class. The main idea of the assume [𝑋 ∈ 𝑅 𝐷 𝑜𝑟 𝑋 = 0, 1 𝐷 ], i.e that the feature vector
Classification algorithms is to predict the target class by is a vector of (D) real numbers or (D) binary bits, but broadly
analyzing the training dataset namely, to categorize the data speaking, we may mix discrete and continues feature. We
into a given number of classes. We use the training set of data assume the class labels are unordered (categorical) and
to get better boundary conditions and to assign the new data to mutually exclusive. This classifier is based on the Bayes’
preset categories or classes[11], [12]. Classification Theorem and the maximum posteriori hypothesis.Bayes
techniques are used to predict the group membership or class theorem provides a way of calculating posterior probability
- therefornamed classification techniques) of individuals [𝑃(𝐴|𝐵)] from 𝑃 𝐴 , 𝑃 𝑥 and 𝑃 𝐴 𝐵 , (6).
(data), for predefined group membershipsand also to describe
𝑃 𝑐∩𝐵 𝑃 𝑐 ∙𝑃 𝐵 𝑐
which characteristics of individuals can predict their group 𝑃 𝑐𝐵 = = (6)
𝑃 𝐵 𝑃 𝐵
membership.
Where:
The types of Classification Algorithms can be broadly
classified as following: 𝑃 𝑐 𝐵 : Posterior Probability of class (c, target) given
predictor (x, attributes), meaning how often (c) happens given
3.2.1 Linear Classifiers: that (B) happens.
Linear classifiers use classification on a linear function of
inputs, that is to say, linear models for classification separate 𝑃 𝑐 : is the prior probability of class.
input vectors into classes using linear decision boundaries.
𝑃 𝐵 𝑐 : is the likelihood, which is the probability of
Logistic Regression. –Logistic regression is a classification predictor given class, meaning how often (B) happens given
method that models the probability of an observation to one of that (A) happens.
two classes and like all regression analysis, the logistic

7
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

𝑛
𝑃 𝐵 : is the prior probability of predictor, meaning how 𝑚
likely (B) is on its own. MINIMIZE 𝑎 0 .…,𝑎 𝑚 : 𝑀𝐴𝑋 {0, 1 − ( 𝑎𝑖 𝑥𝑖𝑗 +
𝑖=1
𝑗 =1
𝑚
In simple terms, a Naive Bayes classifier assumes that the 𝑎0 ) 𝑦𝑗 } + 𝜆 (𝑎 )2 (9)
𝑖=1 𝑖
presence of a particular feature in a class is unrelated to the
presence of any other feature. Naive Bayes algorithms are where the first part of the formula (9):
used in real time prediction, where it is easy and fast to
predict class of test data set, in multi class prediction, which 𝑛
𝑚
performs well. Moreover, Naive Bayes algorithms are used in 𝑀𝐴𝑋 {0, 1 − ( 𝑎𝑖 𝑥𝑖𝑗 + 𝑎0 ) 𝑦𝑗 }focuses on
Text classification, inSpam Filtering andSentiment Analysis, 𝑗 =1
𝑖=1
for instancein social media analysis, to identify positive and minimizing the error, the number of falsely classified points,
negative customer sentiments and lastly, to build that the SVM makes, and
Recommendation Systemsthat uses machine learning and data
mining techniques to filter unseen information. the second part of the formula (9):
Fisher’s Linear Discriminant. – Linear Discriminant Analysis 𝑚
(LDA) and sometimes also called Fisher’s Linear 𝜆 (𝑎𝑖 )2
Discriminant is a linear classifier that projects a p-dimensional 𝑖=1
feature vector onto a hyperplane that divides the space into
two half-spaces where each half-space represents a class [+1 focuses on maximizing the margin, between the two classes.
or −1].This methodology relies on projecting points into a line
and the outputs of this methodology are precisely the decision SVM’s are powerful and flexible supervised machine learning
surfaces or the decision regions for a given set of classes algorithms which are used both for classification and
[13].The decision boundary (7) is characterized by the regression with astonishing real-life applications such as:
hyperplane's normal vector (w) and the threshold (w0).
Inverse Geo-sounding Problems where the SVM’s helps to
𝑇
𝑤1 … . , 𝑤𝑝 𝑤1 … . , 𝑤𝑝 + 𝑤0 = 𝑤 𝑇 𝑥 + 𝑤0 = 0(7) determine the layered structure of the planet.

Given a new input vector (𝑋 ∈ 𝑋 𝑝 ), classification is achieved Seismic Liquefaction Potential, with great result accuracy. In
by computing (8) and assigning the resulting class label this category we use the Standard Penetration Test (SPT) and
(𝑦 = −1 𝑜𝑟 𝑦 = +1)to the input 𝑥 . the Cone Penetration Test (CPT) to check the occurrence and
non-occurrence of liquefaction.
𝑦 = 𝑠𝑖𝑔𝑛 𝑤 𝑇 ∙ 𝑥 + 𝑤0 (8)
Protein Fold and Remote Homology Detection, where
To compute 𝑤 , (LDA) assumes that the class-conditional different methods to solve the kernel functions that are being
distributions 𝑃(𝑥|𝑐 = 1 𝑎𝑛𝑑 𝑃(𝑥|𝑐 = 2) are normal used. The kernel functions help to find the similarity between
distributions with mean (𝜇𝑐 ) and covariance (𝛴𝑐 ) 𝑓𝑜𝑟 𝑐 ∈ different protein sequences.Facial Expression Classification
where the SVM’s have great use in various life-care systems
{1,2}, [14]. LDA is an extremely popular dimensionality
in normal happy or sad look classification.
reduction technique which become critical in machine
learning,and it is commonly used in the pre-processing step in
machine learning and pattern classification projects. 3.2.3 Quadratic classifiers
QuadraticDiscriminant Analysis (QDA) is closely related to
3.2.2 Support Vector Machines Linear Discriminant Analysis (LDA) with the assumption that
the measurements from each class are normally
Support Vector Machines (SVMs) are a set of supervised
learning methods used for classification, regression, and distributed.Quadratic discriminant analysis (QDA) is a variant
of LDA that allows for non-linear separation of data.QDA is
outliers’ detections. This technique is very effective in high
dimensional spaces and still remain effective in cases where particularly useful if there is prior knowledge that individual
classes exhibit distinct covariances. On the other hand, a
number of dimensions is greater than the number of samples.
Moreover, SVMsworks pretty well when there is a clear disadvantage of QDA is that it cannot be used as a
dimensionality reduction technique.
margin of separation between classes. On the other hand,
SVM algorithm is not suitable for large data sets, do not
A quadratic discriminant function is a mapping 𝑔 ∶ 𝑋 →
provide probability estimates, and does not perform well
when the target classes are overlapping. In addition, in cases 𝑅 with
where the number of features for each datapoint exceed the 1
number of training data samples, the SVM technique will 𝑔 𝑥 = 𝑥 𝑇 𝑊𝑥 + 𝑤 𝑇 𝑥 + 𝑤0 , (10)
2
underperform [15], [16].
for some matrix𝑊 ∈ 𝑅𝑑𝑥𝑑 , some vector𝑤 ∈ 𝑅𝑑 , and some
The essence of the SVM is simply involves finding a scalar 𝑤0 ∈ R. In quadratic discriminant function, the model
boundary that separates different classes from each other parameter is 𝜃 = 𝑊, 𝑤, 𝑤0 and depending on (𝑊) the
where in 2-dimentional space, the boundary is named a line, geometry of (𝑔)can be convex, concave or neither.
in 3-dimensionnaly space the boundary is named plane and
finally in greater dimension than 3 the boundary is called 3.2.4 Kernel Estimation
hyperplane. The math behind the SVM is depicted below: A Kernel Distribution is a non-parametric representation of
the Probability Destiny Function (PDF) of a random variable.
The kernel distribution can be used when a parametric
distribution cannot properly describe the data, or in case it is

8
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

wanted to avoid making assumptions about distribution of the continuous variables. In the instance of categorical variables,
data. Since the kernel density estimator is the estimated the Hamming distance must be used.
Probability Destiny Function (PDF), for any real values of
𝑝
𝑥 , the kernel estimator’s formula is given below: Hamming distance𝑑 𝑥, 𝑦 = 𝑖=1 𝑥𝑖 − 𝑦𝑖 (16)
𝑛
1 𝑥− 𝑥𝑖 𝑥𝑖 = 𝑦𝑖 ⇒ 𝑑 𝑥, 𝑦 = 0
𝑓𝑕 𝑥 = 𝐾 (11)
𝑛𝑕 𝑕
𝑖=1
𝑥𝑖 ≠ 𝑦𝑖 ⇒ 𝑑 𝑥, 𝑦 = 1
Where 𝑥1 , 𝑥2 , … , 𝑥𝑛 are random samples from an unknown
In case there is a mixture of numerical and categorical
distribution, 𝑛 is the sample size, K · is the kernel
variables in the dataset it is necessary to standardize the
smoothing function, and 𝑕 is the bandwidth.
training set.
K-Nearest Neighbor. –K-Nearest Neighbor (KNN) algorithm
is very simply used to solve classification problems where (K) 3.2.5 Decision Trees Induction
is the number of neighbors in KNN. The principle behind Decision trees are the most popular representation of logic-
nearest neighbor methods is to find a predefined number of based classifiers well presented in the literature [17],
training samples closest in distance to the new point and [18].There are three well known implementations of decision
predict the label from these. Broadly speaking, the distance trees, the Classification and Regression Trees (CART) [17],
can be any metric measuresuch as Hamming distance, and the Quinlan’s univariate tree growing algorithm, which is
Manhattan distance, Minkowski distance where the standard known as Iterative Dichotomiser 3 algorithm (ID3) [19]. The
Euclidean distance is the most common choice. Despite the third one is the C4.5 algorithm [20], which extends the ID3
fact that KNN is very easy to implement, where the (K) value algorithm by allowing the classification algorithm to deal with
is needed and the distance function (e.g Euclidean), the KNN numbersand not just categorical values as the ID3 does.
requires no training before making predictions, new data can
be added seamlessly which will not impact the accuracy of the Random Forests. –Random forest is a supervised
algorithm. Moreover, there is no training period for it, where learningalgorithm, very flexible, easy and one of the most
it stores the training dataset and learns from it only at the time used machine learning algorithmswhich produces great result
of making real time predictions. As a result, the KNN most of the time. Random forest is based on bagging
algorithm is much faster than other algorithms that require algorithm and uses Ensemble Learning technique where
training. On the other side, the KNN does not work well with creates as many trees as possible on the subset of the data and
large dataset where performance degradation appears and does combines the output of all the trees. As a result, we achieve
not work well with high dimensional data where it becomes overfitting problem reduction in the decision trees and also
difficult for the algorithm to calculate the distance in each variance reduction, which eventually improve the accuracy.
dimension. Moreover, the KNN needs to proceed with Random forest is used in both classification and regression
standardization and normalization before applying the problems and works well with categorical and continues
algorithm to any dataset and needs to manually impute variables. It uses rule-based approach of distance calculation
missing values and remove outlierssinceKNN is sensitive to and as a result no feature scaling (standardization and
noise in the dataset. normalization) is needed.Nonlinear parameters do not affect
the performance of a Random Forest unlike curve-based
Assuming that we have a dataset where 𝑋 is a matrix of algorithms and is very stable and comparatively less impacted
features from an observation and 𝑌 is a class label, the by noise.
formula that estimates the conditional distribution 𝑌 given
𝑋 classifying an observation to the class with the highest On the other hand, the random forest creates a lot of trees, -
probability is depicted below: for instance it creates one hundred trees in Pythonsklearn
library- and as a result requires much more computational
𝑛 1 power and resources in contrast with the decision tree which
𝑃𝑟 𝑌 = 𝑗|𝑋 = 𝑥0 =𝑘 I (𝑦𝑖 = 𝑗) (12)
i∈N0 is simple and does not require so much computational
resources.It also requires much time for training as it
Given a positive integer 𝑘 , the 𝑘 − 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑕𝑏𝑜𝑟𝑠 combines a lot of decision trees to determine the class and it
looks at the 𝑘 observations closest to a test observation 𝑥0 suffers interpretability and fails to determine the significance
the formula (12) estimates the conditional probability that it of each variable due to the ensemble of decision trees.
belongs to class 𝑗 .
In case of regression problems, when using the random forest
The distance between the input data point and other points in algorithm, the Mean Squared Error (MSE) is used to
the training data can be calculated as such: determine how the data branches from each node [21].

𝑝 1 𝑁 2
Euclidean distance 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2 (13) 𝑀𝑆𝐸 = 𝑁 𝑖=1
𝑓𝑖 − 𝑦𝑖 (17)
𝑖=1

𝑝 2 Where 𝑁 is the number of data points, 𝑓𝑖 is the value


Manhattan distance 𝑑 𝑥, 𝑦 = 𝑖=1 𝑥𝑖 − 𝑦𝑖 (14)
returned by the decision tree and 𝑦𝑖 is the value of the data
1 point that are tested at a certain node.
𝑝 𝑞
Minkowski distance 𝑑 𝑥, 𝑦 = 𝑖=1 𝑥𝑖 − 𝑦𝑖 𝑞
(15)
The above formula (17) calculates the distance of each node
Where 𝑥 , is a point with coordinates (𝑥1 , 𝑥2 , … , 𝑥𝑝 ), and from the predicted actual value, helping to decide which
𝑦 , is a point with coordinates (𝑦1 , 𝑦2 , … , 𝑦𝑝 ).It should also branch is the better decision for your forest.
be noted that all three distance measures are only valid for

9
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

In case we perform random forests based on classification Learning (SML) models are in predictive analyticshelping
data, it is often used the Gini Index, or the formula used to business leaders justify decisions or pivot for the benefit of
decide how nodes on a decision tree branch. the organization and in customer sentiment analysis, gaining a
better understanding of customer interactions and can be used
𝐶 2 to improve brand engagement efforts.Broadly speaking, the
𝐺𝑖𝑛𝑖 = 1 − 𝑖=1
𝑝𝑖 (18)
Supervised Machine Learning models challenge is that require
The formula (18) uses the class and probability to determine certain levels of expertise to structure accurately, the training
the Gini of each branch on a node, determining which of the is very time intensive, and the datasets can have a higher
branches is more likely to occur. The ( 𝑝𝑖 ) represents the likelihood of human error, resulting in algorithms learning
relative frequency of the class that is being observed in the incorrectly.Unlike unsupervised learning models, supervised
dataset and ( 𝐶) represents the number of classes. learning cannot cluster or classify data on its own.

The Random Forestsis a great choice in Banking Sector where Unsupervised Learning. –Unsupervised Learning(UL) is a
machine learning techniquewhere there is no need for users to
problems such as loan default chance of a customer or for
detecting any fraud transaction. Moreover, in healthcare supervise the model and instead the model work on its own to
discover patterns and information that was previously
sectors random forest can be used to identify the potential of a
certain medicine or the composition of chemicals required for undetected.The goal for unsupervised machine learning is to
model the underlying structure or distribution in the data in
medicines. In addition, it be used in hospitals to identify the
diseases suffered by a patient, the risk of cancer in a patient, order to learn more about the data, by using machine learning
algorithms. Clustering and Association are two main types of
and many other diseases where early analysis and research
play a crucial role. Unsupervised learning.Unsupervised learning is much similar
as a human learns to think by their own experiences, which
makes it closer to the real Artificial Intelligence (AI).
3.2.6 Neural Networks
Neural networks are a set of algorithms, that try to find Unsupervised learning can be separated into two types of
relationships in a dataset to recognize patterns by simulating problems when it comes to data mining:
the way of how the human brain works.The NeuralNetworks
in fact cluster and classify, group unlabeled data according to Clusteringwhich is a method of grouping the objects into
similarities and classify data when they have labeled dataset clusters such that objects with most similarities remains into a
to train on.In other words, NeuralNetworks are software group and has less or no similarities with the objects of
routines that can learn from existing data and solve complex another group.
real-world problems with an efficient way.NeuralNetwork
algorithms are designed to cluster raw input, recognize Association where an association rule is an unsupervised
patterns, and interpret sensory data and despite their multiple learning method which is used for finding the relationships
advantages, significant computational resources are required. between variables in the large database.
There are several methods to teach a NeuralNetworkfocusing The most popular unsupervised learning algorithms are the
on the main three learning paradigms examined below: following: K-means clustering, K-nearest neighbors
(KNN),Hierarchical clustering, Anomaly detection, Neural
Supervised Learning. – Supervised Learning (SL) is the Networks, Principal Component Analysis, Independent
machine learning process which is done under the seen label Component Analysis, Apriori, algorithm and Singular Value
of observation variables contrary to the Unsupervised Decomposition.
Learning where the response variables are not available. In
(SL), datasets are trained with the training set to infer a Unsupervised learning is used for more complex tasks
Machine Learning algorithm and then will be used to label compared to supervised learning because, since in
new observations from the testing set.Supervised learning can unsupervised learning there are no labeled input data. In
be separated into two types of problems when it comes to data addition, unsupervised learning is preferable as it is easy to
mining: get unlabeled data in comparison to labeled data. As a result,
Unsupervised learning is more challenging than other
Classification where an algorithm is used to accurately assign strategies due to the absence of labels. On the other side
test data into specific categories meaning that it recognizes Unsupervised learning is intrinsically more difficult than
specific entities within the dataset and attempts to draw some supervised learning as it does not have corresponding output
conclusions on how those entities should be labeled or and moreover, the result of the unsupervised learning
defined. algorithm might be less accurate as input data is not labeled
since the algorithm do not know in advance the exact output.
Regression where is used to understand the relationship
between dependent and independent variables and makes Reinforcement Learning. –Reinforcement Learning(RL) is a
projections for instance in businesses in terms of sales type of machine learning technique that enables an agent to
revenues. Very popular regression algorithms are the Linear learn in an interactive environment by trial and error using
Regression, the logistical regression, and the polynomial feedback from its own actions and experiences. In
regression. reinforcement machine learning, the machine learns by itself
after making many mistakes and correcting themin turn. RL is
Supervised Machine Learning models can be used to build one of the hottest research topics currently, is very common in
and advance a number of business applications such as, Image robotics and its popularity is growing day by
and object-recognition, where location, isolation, and object day.Reinforcement Learning method works on interacting
categorization out of videos or images, making them useful with the environment, whereas the
when applied to computer vision techniques and imagery supervised learning method works on given sample data or
analysis. Other applications that use Supervised Machine

10
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

example. Table 2. Reinforcement Learning vs Unsupervised


Learning
There are two RL methods which is the ―Positive‖ where it is
defined as an event, that occurs because of specific behavior, Criteria Reinforcement Unsupervised
increasing the strength and the frequency of the behavior and Learning Learning
impacts positively on the action taken by the agent. The
second method is called ―Negative‖, and it is defined as Learns by Learns by using
strengthening of behavior that occurs because of a negative Interacting with unlabeled data
Definition the environment without any
condition which should have stopped or avoided.The
―positive‖ method helps to maximize performance and sustain guidance
change for a more extended period, contrary to the Data type No predefined Unlabeled data
―negative‖method which helps to define the minimum stand data
of performance. In addition, we have two widely used of
Reinforcement Learning models which are the ―Markov Helps to take The model work on
Decision Process (MDPs)‖ and the ―Q-Learning‖. Markov decisions its own to discover
Decision mode sequentially patterns and
Decision Process (MDPs)is mathematical framework to
describe an environment in reinforcement learningby which information
the learner -often called agent- learns to behave in this
Labels to all
interactive environment using its own actions and rewards for
dependent
its actions. The agent discovers which actions give the Dependency Unsupervised model,
decisions are
maximum reward by exploiting and exploring them.Q- on decision provides unlabeled
given since RL is
Learningis an off-policy reinforcement learning algorithm that data
dependent.
seeks to find the best action to take, given the current state. It
is considered off-policy because the q-learning function learns Problem types Exploitation or Association and
from actions that are outside the current policy, like taking Exploration Clustering
random actions, and therefore a policy is not needed. ―Table
1‖ and ―Table 2‖ depicts the differences upon some criteria, Q – Learning,
between Reinforcement Learning, vs Supervised Learning and State-Action-
Algorithms Reward-State- K – Means,
Unsupervised Learning, respectively. C – Means, Apriori.
Action(SARSA)
Table 1. Reinforcement Learning vs Supervised Learning
Robotics, Clustering, Anomaly
Machinelearning, Detection,
Criteria Reinforcement Supervised Applications Aircraft control, Visualization,
Learning Learning
robot, AI Pattern recognition,
Learns by Learns by using find association rules
Interacting with labelled data
Definition the environment
Data type No predefined Labelled data 3.2.7 Learning Vector Quantization
data Learning Vector Quantization (LVQ), is a type of artificial
neural network algorithm that lets you choose how many
Helps to take A decision training instances to hang onto and learns exactly what those
decisions considers the instances should look like, supporting both binary (two-class)
Decision mode sequentially input given at the and multi-class classification problems. The LVQ is based on
beginning prototype supervised learning version of vector quantization
Labels to all Labels are given where is used when we have labelled input data. This learning
dependent for every technique uses the class information to reposition the Voronoi
Dependency on decisions are decision vectors slightly, so that to improve the quality of the classifier
decision given decision regions and is very useful for pattern classification
problems [22].
Problem types Exploitation or Regression and
Exploration classification Learning Vector Quantization is a neural net that combines
competitive learning with supervision and used for pattern
Q – Learning, Linear and classification. For LVQ we suppose training data 𝑉 ⊆ ℝ𝑛
State-Action- Logistic with each 𝑣 ∈ 𝑉has a class label 𝑐 𝑣 ∈ 𝐶 = 1, … , 𝐶
Algorithms Reward-State- Regression, indicating to which class 𝑣 belongs. Further, we assume
Action(SARSA) SVM, KNN etc. 𝑀 prototypes 𝑊 = 𝑤𝑘 ∈ ℝ𝑛 , 𝑘 = 1 … 𝑀 with labels
𝑐 𝑤𝑘 ∈ 𝐶 such that at least one prototype is assigned to each
Robotics, Risk Evaluation,
other [23]. Indicative innovative application uses LVQ
Machine Forecast Sales,
Applications algorithm is the proposed one in real time adaptive traffic
learning, Aircraft Object
signal control [24].
control, AI recognition
3.2.8 Boosted Decision Trees Method
When a decision tree is a weak learner, the resulting algorithm
is named gradient boosted trees where boosted means that
each tree is dependent on prior trees. As a result, boosting in a
decision tree is a method of combining many weak learners

11
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

(trees) into a strong classifier and tends to improve accuracy 𝑦𝑖 = 𝛽0 + 𝛽1 𝛸1𝜄 + 𝛽2 𝛸2𝜄 + ⋯ + 𝛽𝑘 𝛸𝑘𝜄 + 𝜀𝑖 (19)
with some small risk of less coverage [25], [26]. Each tree
attempts to minimize the errors of previous tree.Trees in where (𝛽0 ) is the intercept, 𝛽1′ 𝑠 are the slope between (Y)
boosting are weak learners but adding many trees in series, and the (𝛸𝜄 ), and (ε) pronounced epsilonis an intercept and (e)
meaning combining a learning algorithm in series, it is the error term that captures errors in measurement of (Y) and
achieved a strong learner from many sequentially connected the effect on (Y) of any variables missing from the equation
weak learners, making boosting a highly efficient and that would contribute to explaining variations in (Y). The
accurate model.Since trees are added sequentially, boosting linear regression should not be used to analyze big size data.
algorithms learn slowly. In statistical learning, models that
learn slowly perform better. However, the number of trees for Logistic Regression. –Logistic regression is one of the types
instance, in gradient boosting decision trees, is very critical in of regression analysis technique, which gets used when the
terms of overfitting whereadding too many trees will cause dependent variable is discrete for instance (0 or 1), (true or
overfitting so it is very important to stop adding trees at some false).This means the target variable can have only two
point. values, and a sigmoid curve denotes the relation between the
target variable and the independent variable.
The theory of a decision tree has the following components: a
root node which is the first node and the starting point of the The logistic regression model is based on the logistic function
tree; branchesarrows which connect one node to and can be stated by the equation (20):
anothershowing the flow from question to answer. Nodes that 𝐹 𝑥 =
𝐿
(20)
have child nodes are called interior nodes. Leafor terminal 1+ 𝑒 −𝑘 𝑥 −𝑥 𝑜

nodes are nodes that do not have child nodes and represent a Where (𝑥0 ), is the (x), value of the sigmoid’s midpoint; (L) is
possible value of targetvariable given the variables the curve’s maximum value; and (k), the logistic growth rate
represented by the path from the root. The branching factor or steepness of the curve.
(b) represents the number of children at each node.
Logistic regression works best with large data sets that have
The advantages of Decision Trees (DC) can be summarized as an almost equal occurrence of values in target variables. The
simple to understand and easy to interpret and visualized, dataset should not contain a high correlation between
where all kinds of data can be handled, making them widely independent variables (a phenomenon known as
used. DC are considered to be non-parametric meaning that multicollinearity), as this will create a problem when ranking
have no assumptions about the data point’s space or the the variables.Logistic regression can suffer from complete
classifier’s structure. DC are robust since require less effort separation. If there is a feature that would perfectly separate
from users for pre-processing data. They are not influenced by the two classes, the logistic regression model can no longer be
outliers and missing values either.On the other hand, overly trainedbecause the weight for that feature would not converge,
complex trees can be developed due to overfitting. Moreover, due to the fact that the optimal weight would be infinite.
Decision Trees can be unstable because small variations in the
data might result in a completely different tree being Ridge Regression. –Ridge Regression is a model tuning
generated.In addition, Decision Tree learners create biased method that is used to analyze any data that suffers from
trees if some classes are more likely to be predicted or have a multicollinearity.Ridge Regression performs (L2)
higher number of samples to support them. The optimality is regularizationand is usually used when there is a high
one more disadvantage, where the problem of learning an correlation between the independent variables. When
optimal decision tree is known to be NP-complete multicollinearity occurs, least squares estimates are unbiased,
(nondeterministic polynomial-time complete), since the but their variances are large so they may be far from the true
number of samples or a slight variation in the splitting value. By adding a degree of bias to the regression estimates,
attribute can change results drastically. ridge regression reduces the standard errors.
The Ridge Regression formula can be stated below:
3.3 Regression Analysis
Regression analysis is a well-known statistical 𝑛 𝑝 2 𝑝 2
𝑖=1 𝑦𝑖 − 𝑗 =1 𝑥𝑖𝑗 𝛽𝑗 +𝜆 𝑗 =1 𝛽𝑗 (21)
learningtechnique used to estimate the relationship between a
dependent variable with one or more independent variables, 𝑝
Where the 𝜆 𝑗 =1 𝛽𝑗2 represents the L2 regularization
where the independent variable is used as an assumption input
element. If lambda is zero, then we get Ordinary Least
that is changed in order to see the impact on a dependent
Squares (OLS).However, the high value of lambda will add
variable. In other words, Regression Analysis is a data mining
too much weight which will result in model under-fitting, so it
process that helps to understand the correlation and
is important how we choose the parameter lambda for our
independence of the variables to determine which factors
model.Overfitting problems may lead to inaccurate and
matter most and which factors can be ignored and eventually,
unstable model building so, a technique that helps minimize
how these factors influence each other.
the overfitting problem in Machine Leaning (ML) models is
There are many types of regression analysis techniques, known as regularization. Ridge regression uses L2
depending on number of factors such as, the type of target regularization compared to Lasso regression which uses L1
variable, the shape of the regression line, and the number of regularization.
independent variables.Regression Analysis has a wide range
Lasso Regression. – Lasso Regression is like linear
of real-life application such as, financial forecasting, Sales
regression, but it uses a shrinkage technique where the
and promotions forecasting. The different types of regression
coefficients of determination are shrunk towards zero. Since
are briefly explained below:
the Linear regression gives you regression coefficients as
Linear Regression. – Linear regression model comprises of a observed in the dataset. The Lasso Regression allows you to
predictor variable and a dependent variable related to each shrink or regularize these coefficients to avoid overfitting and
other in a linear fashion. The general linear regression model make them work better on different datasets.Lasso
can be stated by the equation (19): regression penalizes less important features of your dataset

12
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

and makes their respective coefficients zero, thereby signal when doing speech recognition or collective
eliminating them. Hence, it provides with the benefit of outlierssuch as a signal that may indicate the discovery of new
feature selection and simple model creation. The Lasso phenomena.
Regression formula can be stated below:
Most common causes of outliers on a data set could bedata
𝑛 2 𝑝 entry errors due to human mistake, or measurement errors due
𝑖=1 𝑦𝑖 − 𝑗 𝑥𝑖𝑗 𝛽𝑗 +𝜆 𝑗 =1 𝛽𝑗 (22)
to instrument accuracy, experimental errors, data processing
Where (𝜆) denotes the amount of shrinkage.(𝜆 = 0) implies and sampling errors.In machine learning and in any
all features are considered and it is equivalent to the linear quantitative discipline the quality of data is as important as the
regression where only the residual sum of squares is quality of a prediction or classification model, that is why,
considered to build a predictive model. (𝜆 = ∞) implies no detecting outliers is of a major importance for example in
feature is considered. The bias increases with increase in Physics, Economy, Finance, Machine Learning and
(𝜆)and variance increases with decrease in (𝜆). Cybersecurity.
Table 3. Differences Between Lasso and RidgeRegression Some of the most popular methods for outlier detection are
the Z-score or Extreme Value Analysis (parametric),
RidgeRegression Lasso Regression Probabilistic and Statistical Modeling (parametric), Linear
It makes use of the L2 It makes use of the L1 Regression Modelssuch as Principal Component Analysis
regularization technique. regularization technique (PCA), and Least Median of Squares (LMS)[27],the
Proximity BasedModels (non-parametric), Information Theory
It performs feature weight It performs the feature Models and last the High Dimensional Outlier Detection
updates as the loss function weight updates as the loss Methods.
has an additional squared function has an additional
term. term containing the L1 3.5 Predictive Modeling Techniques
norm of the weights vector. When we refer to Predictive Analytics, we mean the use of
statistical and machine learning techniques to identify the
It drives down the overall It drives down the overall likelihood of future outcomes based on historical data with the
size of the weight values size of the weight values final purpose to streamline decision making producing new
during optimization and during optimization and insights. Predictive analytics is used to predict behavior and
reduces overfitting. reduces overfitting. trends, to understand customers and to improve strategic
Polynomial Regression. –Polynomial regression is a model decision making and business performance. Some of the
which transforms data points into polynomial features of a common uses of predictive analytics includes the domain of
given degree, and models them using a linear model. It works fraud detection and security, Marketing, Operation and Risk
in a similar way to multiple linear regression with a little Identification. The most used Predictive Analytics models
modification but uses a non-linear curve and it is used when includes the Classification Model, which are best to answer
data points are present in a non-linear fashion.Polynomial Yes or No questions, the Clustering Model which sorts data
regression is one of several methods of curve fitting, where into separate nested smart based on similar attributes. Using
curve fitting is a process of constructing the best fit line that the clustering model, it can be quickly separate customers into
passes through all the data points, is not a straight line but a similar groups based on common characteristics and devise
curve line.With polynomial regression, the data is strategies for each group at a larger scale.Forecast Model is
approximated using a polynomial function that takes the form another predictive technique which can be applied wherever
(23). historical numerical data is available such as a call center to
predict how many supports calls, they will receive per hour.
𝑓 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 … 𝑐𝑛 𝑥 𝑛 + 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑒𝑟𝑟𝑜𝑟 (23) Moreover, Outliers and Time Series models are used as
predictive techniques, where anomalous data entries within a
Where (𝑛) is the degree of the polynomial and (𝑐) is a set of dataset are identifiedor identify sequence of datapoints using
coefficients. time as an input parameter, respectively.
Polynomial Regression provides the best approximation of the Broadly speaking, the common predictive algorithms can by
relationship between the dependent and independent variable separated into two groups: Machine Learning and Deep
and fits a wide range of curvature. On the other hand, it is Learning. Machine learning involves structural data, comprise
very sensitive to the outliers where the presence of one or two both linear and nonlinear varieties, train more quickly, while
outliers in the data can seriously affect the results of the nonlinear are better optimized for the problems they are likely
nonlinear analysis. Moreover, there are fewer model to face which is more often nonlinear. Deep Learning is a
validation tools for the detection of outliers in nonlinear subset of machine learning that is more popular to deal with
regression than there are for linear regression. audio, video, text, and images.With machine learning
predictive modeling, there are several different algorithms that
3.4 Outlier Detection can be applied, where the most common are the Random
An outlier is an observation that diverges from the overall
Forest, the Generalized Linear Model (GLM) for two Values,
pattern on a sample and mainly indicate a variability in a
the Gradient Boosted Model (GBM), the K-Means, and the
measurement, experimentalerrors, or a novelty. The outliers
Prophet algorithm.
can be in two categories, the univariatewhen looking for
instance, at a distribution of values in a single feature space, 3.6 Sequential Patterns
and the multivariance in n-dimensional space. In n- Similar to association rules mining, by using sequential
dimensional space, there is a need to train a model.Moreover, patterns mining, it can discoverstatistically interesting and
the outliers can be come out depending on the different type useful patterns and rules in a large-scale table that contains
of data: such as point outliers which are single data points that sequences of transactions [28]. A sequential pattern is a
appears far from the rest of the distribution, contextual frequent subsequence existing in a single sequence or a set of
outliers, that could be noise in data e.g.,background noise

13
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

sequences. A sequence α= 𝛼1 𝛼2 … 𝛼𝑛 is a subsequence of additionally, the number of clusters is unknown. As a result,


another sequence β= 𝑏1 𝑏2 … 𝑏𝑚 if there exist integers the number of different combinations is the sum of the Stirling
1 ≤ 𝑗1 < 𝑗2 < ⋯ < 𝑗𝑛 ≤ 𝑚 such that 𝛼1 ⊆ 𝑏𝑗 1 , 𝛼2 ⊆ 𝑏𝑗 2 numbers of the second kind (24).
, … 𝛼𝑛 ⊆ 𝑏𝑗 𝑛 [29]. 𝑖=𝐾𝑚𝑎𝑥
𝑖=1 S(𝑖) (24)
The algorithms classification suitable and used for sequential 𝑛
pattern mining are the following: Apriori-like algorithms, BFS Where (𝐾𝑚𝑎𝑥 )is the maximum number of cluster and it is
(Breadth First Search)-based algorithms, DFS (Depth First obvious that𝐾𝑚𝑎𝑥 <= 𝑛
Search)-based algorithms, closed sequential pattern-based
algorithms, and incremental-based algorithms [30], [32]. Therefore, more practical approach than exhaustive search is
the iterative optimization.The advantages and disadvantages
The sequential data mining techniques are suitable in of Partitioning Clustering Method are presented below in
healthcarewhere patterns observed in symptoms of a ―Table 4‖.
particular disease, and patterns in daily activity and health
data, in Education and Web Usage Mining, in Text Mining to Table 4. Partitioning Clustering Method(pros and cons)
discover trends, for text categorization, for document
classification and authorship identification. Moreover, the Partitioning Clustering Method
sequential mining techniques are used in Bioinformatics
domain for predicting rules for organization of certain Advantages Disadvantages
elements in genes, for protein function prediction, for gene
expression analysis, for protein fold recognition and for motif Relatively scalable and Poor cluster descriptors and
discovery in DNA sequences.Pattern mining can be used in simplicity often requires long
the field of telecommunications for mining of group patterns computation time
from mobile user movement data, for customer behavior
prediction, for predicting future location of a mobile user for Suitable for datasets with High sensitivity to
location-based services and for mining patterns useful for compact spherical clusters initialization phase, noise,
mobile commerce [32]. that are well-separated and outliers since it works
with squared distances.
4. CLUSTERING METHOD
Cluster is a group of objects that belongs to the same class, Optimal for certain criteria Needs initial K (objects)
meaning that similar objects are grouped in one cluster and and has long computational
dissimilar objects are grouped in other clusters based on time
similarities.
Cluster analysis is a statistical method to group data into
The algorithms that fall into this category are as follows:
subsets with related characteristics to understand the internal
structure of the data. Clustering is considered as one of the K-Means Clustering Algorithm. –K-Means clustering is one
most important unsupervised learning methods due to the fact of the most widely used algorithms where the value (k) is
that no information is provided about the best answer for any defined by the user. Basically, K-Means is an iterative process
of the object and in fact, it can reveal undetected correlations that divides a given data set into (K) disjoint groups based
in a complex data set.Clusters are regions where the density of upon the distance metric used for the clustering. In other
similar data points is high and in general clusters are seen words, the algorithm adjusts the assignment of objects to the
moreoften in a spherical shape, but it can of any shape. It closest current cluster mean until no new assignments of
depends on the type of the algorithm we use which decides objects to clusters can be madeunder a new iterative process
how the clusters will be created. [33]. K-means is perhaps the most widely used clustering
principle, and especially, the best-known of the partitioning-
4.1 Clustering Categories based clustering methods that utilize prototypes for cluster
4.1.1 Partitioning Clustering Method presentation.Even that the simplicity is a good advantage, it
This method is one of the most popular choices for analysts to has some major drawbacks such as: it is very hard to specify
create clusters where the clusters are partitioned based upon number of clusters in advance, and due to the fact that it
the characteristics and the similarity of the data point. works with squared distances, it’s also sensitive to outliers. K-
Partitioning-based clustering algorithms minimize a given Means algorithm has linear time complexity, and it can be
clustering criterion by iteratively relocating data points used with large datasets conveniently. As an unsupervised
between clusters until a (locally) optimal partition is attained. clustering algorithm K-Means provides many benefits with
Since the number of data points in any data set is always finite unlabeled big data. For instance,even if the data has no labels
and, also the number of distinct partitions is finite, the (class values or targets) or even column headers, K-Means
problem of local minima could be avoided by using will still successfully cluster the data. K-Means is also very
exhaustive search methods. The number of different partitions easy to use by using default parameters in the Scikit-Learn
for (n)observations into (K)groups is a Stirling number of the implementation such as number of clusters where (8)is by
second kind, which is given by the following form (23): default, the maximum iterations where (300) is by default),
and like the initial centroid initialization where is 10 by
i=K default. All these default parameters can easily be adjusted
(𝐾) 1 𝐾 𝑛
S = (−1)𝐾−𝑖 𝑖 (23) later on to suit the task goals.Moreover, K-Means returns
𝑛 𝐾! i=0 𝑖
clusters which can be easily interpreted and even visualized.
From the above can be seen that enumeration of all possible Just a few examples use cases could be ―customer
partitions is impossible for even relatively small problems and segmentation‖, ―logistic optimization‖, ―user suggestions‖,
moreover the problem is even more demanding when ―patient management‖, ―trial management‖ and ―fraud
detection‖. On the other hand, K-Means, introduces

14
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

drawbacks such as: ―Result repeatability‖ where K-Means algorithm is to minimize the average dissimilarity of objects
algorithm results will differ based due to random centroid to their closest selected object, hence, to find themost
initialization. Apart from the fact that K-Means Algorithm centrally located objects withinthe clusters.K-Medoids can be
needs manual intervention in some parameters (e.g considered as a robust alternative to k-means clustering,
n_clustersneed to be optimized, adjusted, and reassessed a meaning that the algorithm is less sensitive to noise and
few times, or max_iter and init), K-Means algorithm creates outliers, compared to k-means. This is due to the fact that it
spherical clusters that cover the whole dataset without be uses medoids as cluster centers instead of means used in k-
possible to exclude outliers or certain sample groups. means method.The k-medoids algorithm requires the user to
specify the (𝑘), and the number of clusters to be
In summation, the K-Means advantages, and disadvantages
generatedwhere the silhouette method is a nice approach to
can be depicted in the ―Table 5‖ below:
determine the optimal number of clusters. The complexity of
Table 5. K-Means(pros and cons) k-Medoids is 𝑂(𝑁 2 𝐾𝑇) where (𝑁) is the number of samples,
(𝑇) is the number of iterations and (𝐾) is the number of
K-Means advantages and disadvantages clusters, and this makes it more suitable for smaller datasets
compared to k-means which is O(N𝐾𝑇).
Advantages Disadvantages
The advantages and disadvantages of K-Medoids Method are
It’s very simple and flexible Difficult to predict K-value presented below in ―Table 6‖.
identify unknown groups of and does not work well
data from complex data with clusters of different Table 6. K-Medoids(pros and cons)
sets. size and density
K-Medoids advantages and disadvantages
If variables are huge, then Needs initial K (objects)
K-Means is most of the and has long computational Advantages Disadvantages
times computationally time. When dealing with a
faster than hierarchical large dataset, conducting a K-Medoidscan be more K-Medoids is not suitable
clustering, if we keep k dendrogram technique will robust that k-means in the for clustering non-spherical
smalls. crash the computer due to a presence of noise and (arbitrary shaped) groups of
lot of computational load outliers. objects.
Optimal for certain criteria and Ram limits
and suitable in a large K-Medoids is efficiently for K-Medoidsmay obtain
dataset small datasets while does different results for
It’s efficient at segmenting K-means doesn’t allow not scale well for large different runs on the same
the large data set depending development of an optimal datasets. dataset because the
on the shape of the set of clusters and for first k medoids are chosen
clusters. K-means work effective results, you should randomly
well in hyper-spherical decide on the clusters
clusters before K-Medoids is more flexible In k-Medoids, there is a
as it can use any similarity need to specify the value
Compared to hierarchical Lacks consistency where. A measure. (𝑘) (the number of clusters)
algorithms, k-means random choice of cluster in advance
produce tighter clusters patterns yields different
especially with globular clustering results. K-means
clusters algorithm can be performed
in numerical data only. CLARA Algorithm.–Clustering Large Applications (CLARA)
Algorithm, is an extension to k-Medoids (PAM) methods
K-means segmentation is It produces cluster with dealing with data, comprising a large number ofobjectsin
linear in the number of data uniform size even when the order to reduce computing time and RAM storage problem
objects thus increasing input data has different using the sampling approach.
execution time. Generalize sizes and it’s very sensitive
to cluster of different to scale where rescaling the In CLARA concept, instead of finding medoids for the entire
shapes and sizes, for datasetvia normalization or data set, this algorithm considers a small sample of the data
instance elliptical clusters. standardization will change with fixed size and applies the PAM algorithm to generate an
the final results optimal set of medoids for the sample. The algorithm repeats
the sampling and clustering processes a pre-specified number
PAM(K-Medoids)Algorithm. –Partitioning Around of times in order to minimize the sampling bias. The outcome
Medoids(PAM) Algorithm was introduced by Kaufman and of this iteration corresponds to the set of medoids with the
Rousseeuw based on (𝑘) representative objects, named minimal cost.
medoids, among the objects of the dataset [34].In k-medoids
clustering, each cluster is represented by one of the data
4.1.2 Hierarchical Clustering Method
points in the cluster.These points are named cluster Hierarchical Clustering algorithms can be
medoids.The term medoid refers to an object within a cluster Agglomerative (bottom-up approach) or divisive (also called
for which average dissimilarity between it and all the other as Top-Down Approach) and groups the clusters based on the
members of the cluster are minimal. It corresponds to the distance metrics.
most centrally located point in the cluster.Objects are In Agglomerative clustering, each data point acts as a cluster
tentatively defined as medoids and are placed into a set (𝑆) of initially and then pair of clusters successfully merged one by
selected objects. If (𝑂) is the set of objects that the set𝑈 = one until all clusters have been merged into one big cluster
𝑂 − 𝑆 is the set of unselected objects.The aim of the containing all objects. The result is a tree-based representation

15
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

of the objects, named dendrogram. (ROCK) is a robust agglomerative hierarchical-clustering


algorithm based on the notion of links. It is suitable for
Divisive is the opposite of Agglomerativeit starts off with all handling large datasets and most suitable for clustering data
the points into one cluster and divides them to create more that have Boolean and categorical attributes. In this algorithm,
clusters. cluster similarity is based on the number of points from
A key step in a hierarchical clustering is to select a distance different clusters that have neighbors in common. The concept
measure such as the Manhattan distance, which is equal to the of links is to measure the similarity/proximity between a pair
sum of absolute distances for each variable. A more common of data points, where the ROCK algorithm employs links and
not distances when merging clusters. ROCK algorithm
measure is Euclidean distance, computed by finding the performs well on real and synthetic categorical dataset, and
square of the distance between each variable, summing the respectably on time-series data compared to traditional
squares, and finding the square root of that sum [35]. algorithms.
The advantages and disadvantages of Hierarchical Clustering CHAMELEON Algorithm.–CHAMELEONis an
Method are presented below in ―Table 6‖. agglomerative hierarchical clustering algorithm that uses
Table 6. Hierarchical Clustering Method(pros and cons) dynamic modeling in which measures the similarity of two
cluster on a dynamic model approach.Adapt to the
Hierarchical Clustering Method characteristics of the data set to find the natural clusters where
the main property is the relative closeness and relative inter-
Advantages Disadvantages connectivity of the cluster. Two clusters are merged only if
the interconnectivity and closeness (proximity) between two
Fast computation and there Hard to define levels for clusters are high relative to the internal interconnectivity of
is no need to pre-define the clusters.Sensitivity to noise the clusters and closeness of items within the
number of clusters (k). and outliers clusters.CHAMELEON works by using a two-phase
algorithm so that to find the clusters in the datasets. At first
Embedded flexibility Rigid, cannot correct later phase uses a graph partitioning algorithm to cluster the data
regarding the level of for erroneous decisions into a large number of small sub-clusters where in second
granularity. made earlier. phase uses an agglomerative hierarchical clustering algorithm
to find the actual clusters by iteratively merges subclusters
Very well suited in terms of No ability to make based on their similarity.The key advantage of the
problems involving point corrections when the CHAMELEON algorithm is that it determines the pair of
linkages. splitting/merging decision most similar sub-clusters by considering both the inter-
is taken. connectivity as well as the closeness of the clusters.One of the
areas of application is spatial datasets.
Accepts any valid measure Lack of interpretability in 4.1.3 Density Based Clustering Method
of distance terms of the cluster An interesting property of density-based clustering is that
descriptors. these algorithms do not assume clusters to have a particular
shape.In this method the clusters are created based on the
Good for data visualization It cannot perform well on a density of the data points represented in the data space,
providing hierarchical large database namely, the density-based clustering algorithm considers
relation between clusters. cluster as a dense area separated by sparse area in data
space.The data points in the sparse regionare considered as
Some typical examples of Hierarchical Clustering algorithms noise or outliers. This clustering method creates clusters of
are the following: arbitrary shapes.Partition-based and hierarchical clustering
CURE Algorithm.– Clustering Using REpresentatives techniques are highly efficient with normal shaped clusters.
(CURE) is an efficient agglomerative hierarchical clustering Moreover, when it comes to arbitrary shaped clusters or
algorithm suitable for large datasets, that adopts a balance detecting outliers, density-based techniques are more efficient
between centroid based and all point extremes. It starts with a and in particular, are very efficient at finding high-density
single point cluster, and moves to merge with another cluster, regions and outliers.
until the desired number of clusters are formed. CURE The advantages and disadvantages of Density Based
algorithm, instead of using one point centroid, as in most of Clustering Method are presented below in ―Table 7‖.
data mining algorithms, uses a set of well-defined
representative points, so that to efficiently handle the clusters Table 7. Density Based Clustering Method(pros and cons)
and eliminate the outliers. Compared with K-means
clustering, it is more robust to outliers and capable of Density Based Clustering Method
identifying clusters having non-spherical shapes and size
variances. CURE is robust to outliers and can handle large Advantages Disadvantages
datasets by combining random sampling and partitioning
method. Contrary to K-means, CURE supports non-spherical There is no need to require Cannot perform well and
shaped clusters and densities with the disadvantage that a-priori specification of not suitable with large
cannot handle differing densities. In addition, compared with number of clusters in differences in densities
k-means which deals with data points in spherical datasets, advance
CURE algorithm deals with outliers of non-spherical clusters
with random sampling and partitioningto reliably find clusters Ability to identify noise Not suitable for high
of arbitrary shape and size. data while clustering and to dimensional data in case of
find arbitrarily shaped
ROCK Algorithm.– The RObust Clustering using linKs

16
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

clusters DBSCAN and OPTICS clusteringprocess is to adapt all data points to some
predefined mathematical models. As a result, the algorithms
Works well in presence of Sensitive to density that falls in this category, can automatically identify the
noisein case of OPTICS but parameters that should be number of clusters and outliers in data points according to the
not well in case of selected carefully selected mathematical model.However, the noise and outliers
DBSCAN are considered while calculating the standard statistics for
having robust clustering.In order to form clusters, these
clustering methods are classified into two categories:
Statistical and Neural Network approach methods. In the
Some typical examples of Density Based Clustering statistical approach the model-based algorithms follow
algorithms are the following: probability measures to determine clusters and in Neural
Network approach, input and output are associated with unit
DBSCAN Algorithm. –The Density-Based Spatial Clustering carrying weights.
of Applications with Noise Algorithm(DBSCAN)is
particularly suited to deal with large datasets, with noise, and Representative algorithms that fall into this category are as
is able to identify clusters with different sizes and shapes. The follows:
DBSCAN is the most well-known density-based clustering
GMM Algorithm. – Gaussian mixture model (GMM)
algorithm first introduced in 1996 by Ester et.al [36]. Unlike
algorithm is based on the probability model where the data is
k-means, DBSCAN does not require the number of clusters as
decomposed into several models based on the Gaussian
a parameter, where it infers the number of clusters based on
probability density function. The GMM algorithm results are
the data, and it can discover clusters of arbitrary shape.The
expressed in terms of probabilities, which are more visual and
DBSCAN algorithm is the fastest of the clustering methods,
can be used to predict in a certain area of interest based on
provided that there is a very clear Search Distance to use. The
these probabilities. On the other hand, it is necessary to use
advantages can be summarized as such: DBSCAN does not
complete sample information for prediction and lose
require a-priori specification of number of clusters, is able to
effectiveness in high-dimensional space and this is considered
identify noise data while clustering and to find arbitrarily size
as a disadvantage.
and arbitrarily shaped clusters. The disadvantages can be
summarized as such: DBSCAN fails in case of varying SOM Algorithm. –Self Organized Maps (SOM) algorithm is
density clusters, and in case of neck type of dataset and based on neural network modelthe input layer receives input
moreover, does not work well in case of high dimensional signals, and the output layer is arranged by a neuron into a
data. two-dimensional node matrix in a certain way. SOM
algorithm has the advantage to map to a two-dimensional
OPTICS Algorithm. – The Ordering Points to Identify
plane to achieve visualization and obtain higher-quality
Clustering Structure (OPTICS) Algorithm, works as an
clustering results. On the other hand, as a disadvantage is that
extension of DBSCAN. The only difference is that it does not
the calculation complexity is high, and the result depends to a
assign cluster memberships but stores the order in which the
certain extent on the choice of experience.
points are processed meaning that for each object stores the
Core Distance and the Reachability distance. The main idea 4.1.5 Grid Based Clustering Method
of OPTICS algorithm is similar to DBSCAN, but it addresses In grid-based clustering, the dataset is represented into a grid
one of DBSCAN’s major weaknesses: the problem of structure which comprises of grids (also named cells) to
detecting meaningful clusters in data of varying density. In design a grid-structure.Grid-based methods work in the object
order to do that, the points of the database are ordered in a space instead of dividing the data into a grid where grid is
way that spatially closest points become neighbors in the divided based on data characteristic.After partitioning the
ordering. Moreover, for each point a special distance is stored datasets into cells, it computes the density of the cells which
which represents the density that must be accepted for a helps in identifying the clusters.One of the greatest
cluster so as both points belong to the same cluster. advantages of these algorithms is its reduction in
Like DBSCAN, OPTICS requires two parameters: the computational complexity. They are more concerned with the
( 𝜀 )which describes the maximum distance to consider and value space surrounding the data points rather than the data
the ( 𝑀𝑖𝑛𝑃𝑡𝑠 ) which describes the number of points needed points themselves. The Grid-based clustering method has fast
to form a cluster. The key parameter to DBSCAN and time of processing than another way and depends on the
OPTICS is the ( 𝑀𝑖𝑛𝑃𝑡𝑠 ) parameter which roughly controls number of cells in the space of quantized each
the minimum size of a cluster. If this parameter is set too low dimension.Moreover, it applies to any attribute type and
everything will become clusters where if is set too high at provides flexibility related to the level of granularity.
some point there won’t be any clusters anymore, but only Representative algorithms that fall into this category are as
noise.OPTICS clustering method require more memory to follows:
determine the next data point which is closest to the point
currently being processed in terms of Reachability Distance. STING Algorithm. – Statistical Information Grid Approach
As a result, requires more computational power because the (STING) Algorithm, the dataset is divided recursively in a
nearest neighbor queries are more complicated compared to hierarchical manner where each cell is further sub-divided
radius queries in DBSCAN. Moreover, the OPTICS clustering into a different number of cells capturing in turn the statistical
technique does not need to maintain the ( 𝜀 )parameter and is measures of the cells. STING Algorithm has high efficiency
relatively insensitive to parameter settings. and low time complexity. On the other hand, the fact that the
clustering quality is affected by the granularity of the bottom
4.1.4 Model Based Clustering Method layer of the grid structure, can be considered a disadvantage.
Model-based clustering is a statistical approach to data
clusteringassuming that data points are generated according to WaveCluster Algorithm. – In this algorithm, the dataspace is
a certain probability distribution model, and the represented in form of wavelets where contains a n-
dimensional signal helping to identify the clusters. The parts

17
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

of the signal with a lower frequency and highamplitude businesses to grow.


indicate that the data points are concentrated representing
clusters that identified by the algorithm. The WaveCluster 6. FUTURE WORK
Algorithm is a fast multi-resolution algorithm where high- The current review acts as a guideline to data mining
resolutioncan obtain detailed information, and low-resolution researchers to have an outlook on what algorithms to choose
can obtain contour information. when the processed clusters based on their needs and based on the given datasets. The next
have no obvious edges the clustering effect is poor, and this step is to design and deploy a High-Performance Computer
can be considered as a disadvantage. (HPC) based on Raspberry Pi 4, to benchmark the efficiency
of a Beowulf, Hadoop, and Spark Cluster architectures suited
CLIQUE Algorithm. – Clustering in Quest (CLIQUE) to deal with Big Data Analytics needs [37], [38]. Following
Algorithm is a density-based and grid-based subspace the successful results referred above, the final goal is to
clustering algorithm. Grid-based because it discretizes the proceed with a comparative study on various clustering
data space through a grid and estimates the density by Algorithms, Bringing HPC to Big Data Algorithms. For
counting the number of points in a grid cell. Density-based instance, a comparative study between parallel K-Means and
since a cluster is a maximal set of connected dense units in a K-Medoids using Message Passing Interface (MPI) and
subspace. The CLIQUE algorithm discovers minimal MapReduce in a Hadoop architecture would be very
descriptions of the clusters and automatically identifies interesting topic to see the results when the computes nodes
subspaces of a high dimensional data space that allow better are increased gradually in terms of computing performance
clustering than original space using the Apriori [38]. Moreover, a survey of parallel Clustering Algorithms
principle.CLIQUE Algorithm is good at handling high- based on Spark architecturewith Raspberry Pi 4 would be
dimensional data and large datasets but has the disadvantage another very interesting topic to see the results when the
of having low the accuracy of clustering. Another weakness as computes nodes are increased gradually in terms of computing
it happens in all grid-based clustering approaches, the quality performance.
of the results crucially depends on the appropriate choice of
the number and width of the partitions and grid cells 7. ACKNOWLEDGMENTS
My sincere gratitude to assistance Professor Ioannis S.
5. CONCLUSION Barbounakisfor the precious suggestions, and knowledge
The structured data accounts for less than 20 percent of all contribution for the successful completion of this paper.
data whereas a much bigger percentage of all the data is
unstructured data in our world. In this paper, it is clear thatnot
all the algorithms are suited for all kind ofdatasets. There are 8. REFERENCES
different tools, data mining algorithms and methods which are [1] X. Zhu, B. Song, Y. Ni, Y. Ren, R. Li, (2016). Business
used to analyze the datasets and as result, the choice of the Trends in the Digital Era:Evolution of Theories and
best algorithm to use for a particular analytical task is a big Applications, Springer.
challenge to data mining researchers. [2] Laney, D. (2001) 3D Data Management: Controlling
The supervised learning algorithms are those for which the Data Volume, Velocity and Variety. META Group
class attribute values for the dataset are known before Research Note, 6.
runningthe algorithm. These kinds of datasets are named [3] McAfee, A. and Brynjolfsson, E. (2012). Big Data. The
labelled data or training data. Classification for example, is a Management Revolution. Harvard Business Review,
popular data mining technique referred to as a supervised 90(10), pp. 60–9.
learning technique because an example dataset is used to learn
the structure of the groups. Examples of supervised learning [4] Volume of data/information created, captured, copied,
algorithms commonly used in data mining are the and consumed worldwide from 2010 to 2024. Statista
Classification category (Decision tree Learning, Naive Bayes 2020. [Online].
Classifiers, K-Nearest Neighbor, Support Vector Machine etc. Available:https://www.statista.com/statistics/871513/wor
algorithms), Regression category (Linear and Logistic ldwide-data-created/.
regression, etc. algorithms). [5] Brands, K. (2014). Big Data and Business Intelligence
In the unsupervised learning algorithms, there is no need for for Management Accountants. Strategic Finance, 96(6),
users to supervise the model and instead the model work on pp. 64–5.
its own to discover patterns and information that was [6] Gandomi, A. and Haider, M. (2015). Beyond the hype:
previously undetected. Association Rule Learning for Big Data concepts, methods, and analysis. International
instance, is one of the unsupervised data mining techniques in Journal of Information Management, 35(2), pp. 137–44.
which an item set is defined as a collection of one or more
itemsthat is used to discover relationships between variables [7] Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S.,
in datasets. Normally when discussing the Gani, A. and Khan, S.U. (2015). The rise of "Big Data"
unsupervisedlearning, most researchers focus on clustering.In on cloud computing: Review and open research issues.
clustering, the data is often unlabeled where the label for Information Systems, 47(1), pp. 98–115.
eachinstance is not known to the clustering algorithm, and this
[8] Bendler, J., Wagner, S., Brandt, T. and Neumann, D.
is main difference between supervised and
(2014). Taming uncertainty in Big Data: Evidence from
unsupervisedlearningExamples of unsupervised learning
social media in Urban Areas. Business & Information
algorithms commonly used in data mining are Clustering
Systems Engineering, 6(5), pp. 279–88
category (K-Means, Density based, Apriori etc. algorithms).
[9] Ishwarappa, K. and Anuradha, J. (2015). A Brief
In bottom line, Data mining techniques such as classification,
Introduction on Big Data 5Vs Characteristics and
clustering, prediction, association, etc., it helps to find the
Hadoop Technology. Procedia Computer Science, 48(1),
patterns, forecasting, discovery of knowledge etc., in different
pp. 319–324.
business domain to decide upon the future trends in

18
International Journal of Computer Applications (0975 – 8887)
Volume 183 – No. 48, January 2022

[10] Trupti, A. Kumbhare, and Santosh, V. Chobe, (2014). An neural network in realtime adaptive traffic signal
Overview of Association Rule Mining Algorithms, control,‖ Jurnal Teknologi, vol. 42, no. 1, pp. 29–44.
International Journal of Computer Science and
Information Technologies, Vol.5(1), pp. 927-930. [25] Y. Freund, (1995). ―Boosting a weak learning algorithm
by majority‖, Information and computation. 121(2):256–
[11] Sudhir, M. Gorade, Ankit Deo and Pritesh Purohit, 285.
(2017). A Study of Some Data Mining Classification
Techniques. International Research Journal of [26] Y. Freund and R.E. Schapire, (1999). ―A short
Engineering and Technology. Vol. 4, Issue. 4, pp. 3112- introduction to boosting‖ Journal of Japanese Society for
3115. Artificial Intelligence, 14(5):771-780.

[12] J. Han, M. Kamber and J. Pei, J (2010). Data Mining [27] Huh, Myung-Hoe, & Lee, Yonggoo. (2006). ―LMS and
Concepts and Techniques (3rd ed.) University LTS-type Alternatives to Classical Principal Component
ofIllinois.Chapter 8, pp. 99-117. Analysis‖. Communications for Statistical Applications
and Methods, 13 (2), 233–
[13] Duda RO, Hart PE, and Stork DG, (2000). Pattern 241. https://doi.org/10.5351/CKSS.2006.13.2.233
classification, 2nd ed. New York: John Wiley & Sons.
[28] R. Agrawal and R. Srikant., (March 1995).―Mining
[14] Rao, R. P. N., & Scherer, R. (2010). Statistical Pattern Sequential Patterns‖. In Proc. of the 11th Int'l
Recognition and Machine Learning in Brain-Computer Conference on Data Engineering, Taipei, Taiwan.
Interfaces. In Statistical Signal Processing for
Neuroscience and Neurotechnology (1 ed., pp. 335-368). [29] Fournier-Viger, P., Lin, J.C.W., Kiran, R.U., Koh, Y.S.,
Elsevier B.V. Thomas, R. (2017).―A survey of sequential pattern
mining‖. Data Sci. Pattern Recogn. s1, 54–77.
[15] Auria, Laura and Moro, R. A., Support Vector Machines
(SVM) as a Technique for Solvency Analysis (August 1, [30] Thabet Slimani, and Amor Lazzez. (2013). ―Sequential
2008). DIW Berlin Discussion Paper No. 811, Available Mining: Patterns and Algorithms Analysis‖, International
atSSRN: https://ssrn.com/abstract=1424949. Journal of Computer and Electronics Research, Volume
2, Issue 5, pp 639-647.
[16]S. Karamizadeh, S. M. Abdullah, M. Halimi, J. Shayan
and M. j. Rajabi, (2014). "Advantage and drawback of [31] Mooney, C. H. & Roddick, J. F., (Feb 2013) ―Sequential
support vector machine functionality," 2014 Pattern Mining — Approaches and Algorithms‖, ACM
International Conference on Computer, Computing Surveys, vol. 45, no. 2, pp. 1–39, DOI:
Communications, and Control Technology (I4CT), pp. 10.1145/2431211.2431218.
63-65, doi: 10.1109/I4CT.2014.6914146. [32] Kum, H.-C., Chang, J. H., & Wang, W. (2006).
[17] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. ―Sequential Pattern Mining in MultiDatabases via
Stone (1984). Classification and Regression Trees. Multiple Alignment‖. Data Min. Knowl. Discov., 12(2-
Chapman & Hall, New York, NY. 3), 151-180.

[18] S. K. Murthy, S. Kasif, and S. Salzberg, (1994). A [33] S. Anitha Elavaras, (Jan 2011). ―A Survey on Partitional
system for induction of oblique decision trees. J. Artif. Clustering Algorithm‖, International Journal of
Int. Res., 2(1):1–32. Enterprise Computing and Business Systems, Vol. 1
Issue 1.
[19] J. Quinlan, (1986). Induction of decision trees. Machine
Learning, 1(1):81–106. [34] Kaufman, L., & Rousseeuw, P. J., (1990). ―Finding
groups in data: an introduction to cluster analysis.‖ New
[20] J. Quinlan, (1993). Morgan Kaufmann,C4.5: Programs York, Wiley.
for Machine Learning.
[35] T. Soni Madhulatha. (April 2012). ―An overview on
[21]Mean Squared Error (MSE). [Online]. Available: Clustering Methods‖. IOSR Journal of Engineering., Vol.
https://www.probabilitycourse.com/chapter9/9_1_5_mea 2(4) pp: 719-725.
n_squared_error_MSE.php
[36] Ester, M., Kriegel, H.P., Sander, J., Xu, X. (1996). ―A
[22] Nova, D., Estévez, P.A. (2014). A review of learning Density-Based Algorithm for Discovering Clusters in
vector quantization classifiers. Neural Comput & Large Spatial Databases with Noise‖. In Proc. KDD.
Applic 25, 511–524,https://doi.org/10.1007/s00521-013-
1535-3 [37] Dimitrios Papakyriakou, Dimitra Kottou and Ioannis
Kostouros. (April 2018). ―Benchmarking Raspberry Pi 2
[23] D. Nova and P. Estevez, (2013). ―A Review of Learning Beowulf Cluster. International Journal of Computer
Vector Quantization Classifiers,‖ Neural Computing and Applications‖ 179(32):21-27.
Applications, vol. 25, pp. 511–524.
[38] Dimitrios Papakyriakou. (August 2019). ―Benchmarking
[24] A. Priyono, M. Ridwan, A. J. Alias, R. A. O. Rahmat, A. Raspberry Pi 2 Hadoop Cluster‖. International Journal of
Hassan, and M. A. M. Ali, (2012). ―Application of LVQ Computer Applications 178(42):37-47.

IJCATM : www.ijcaonline.org 19

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy