0% found this document useful (0 votes)
23 views23 pages

Unit 2

ASSOCIATION RULE IN MACHINE LEARING

Uploaded by

pandian0114
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views23 pages

Unit 2

ASSOCIATION RULE IN MACHINE LEARING

Uploaded by

pandian0114
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit - 2

Association Rule Learning


Association rule learning is a type of unsupervised learning technique that checks for the dependency of
one data item on another data item and maps accordingly so that it can be more profitable. It tries to
find some interesting relations or associations among the variables of dataset. It is based on different
rules to discover the interesting relations between variables in the database.

The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here market
basket analysis is a technique used by the various big retailer to discover the associations between
items. We can understand it by taking an example of a supermarket, as in a supermarket, all products
that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby. Consider the below diagram:

Association rule learning can be divided into three types of algorithms:

1. Apriori

2. Eclat

3. F-P Growth Algorithm

We will understand these algorithms in later chapters.

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are
several metrics. These metrics are given below:

o Support

o Confidence

o Lift

Let's understand each of them:

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the
fraction of the transaction T that contains the itemset X. If there are X datasets, then for transactions T, it
can be written as:

Confidence

Confidence indicates how often the rule has been found to be true. Or how often the items X and Y
occur together in the dataset when the occurrence of X is already given. It is the ratio of the transaction
that contains X and Y to the number of records that contain X.

Lift

It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent of each
other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each


other.

o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a
negative effect on another.

Types of Association Rule Lerning

Association rule learning can be divided into three algorithms:

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on the
databases that contain transactions. This algorithm uses a breadth-first search and Hash Tree to
calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be bought
together. It can also be used in the healthcare field to find drug reactions for patients.

Eclat Algorithm

Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search
technique to find frequent itemsets in a transaction database. It performs faster execution than Apriori
Algorithm.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori
Algorithm. It represents the database in the form of a tree structure that is known as a frequent pattern
or tree. The purpose of this frequent tree is to extract the most frequent patterns.

Applications of Association Rule Learning

It has various applications in machine learning and data mining. Below are some popular applications of
association rule learning:

o Market Basket Analysis: It is one of the popular examples and applications of association rule
mining. This technique is commonly used by big retailers to determine the association between
items.

o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in
identifying the probability of illness for a particular disease.

o Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.

o It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.

Multilevel Association Rule in data mining

In this article, we will discuss concepts of Multilevel Association Rule mining and its algorithms,
applications, and challenges.
Data mining is the process of extracting hidden patterns from large data sets. One of the fundamental
techniques in data mining is association rule mining. To identify relationships between items in a
dataset, Association rule mining is used. These relationships can then be used to make predictions about
future occurrences of those items.

Multilevel Association Rule mining is an extension of Association Rule mining. Multilevel Association
Rule mining is a powerful tool that can be used to discover patterns and trends.

Association Rule in data mining

Association rule mining is used to discover relationships between items in a dataset. An association rule
is a statement of the form "If A, then B," where A and B are sets of items. The strength of an association
rule is measured using two measures: support and confidence. Support measures the frequency of the
occurrence of the items in the rule, and confidence measures the reliability of the rule.

Apriori algorithm is a popular algorithm for mining association rules. It is an iterative algorithm that
works by generating candidate itemsets and pruning those that do not meet the support and confidence
thresholds.

Multilevel Association Rule in data mining

Multilevel Association Rule mining is a technique that extends Association Rule mining to discover
relationships between items at different levels of granularity. Multilevel Association Rule mining can be
classified into two types: multi-dimensional Association Rule and multi-level Association Rule.

Multi-dimensional Association Rule mining

This is used to find relationships between items in different dimensions of a dataset. For example, in a
sales dataset, multi-dimensional Association Rule mining can be used to find relationships between
products, regions, and time.

Multi-level Association Rule mining


This is used to find relationships between items at different levels of granularity. For example, in a retail
dataset, multi-level Association Rule mining can be used to find relationships between individual items
and categories of items.

Needs of Multidimensional Rule

Multidimensional rule mining is important because data at lower levels may not exhibit any meaningful
patterns, yet it can contain valuable insights. The goal is to find such hidden information within and
across levels of abstraction.

Algorithms for Multilevel Association Rule Mining

There are several algorithms for Multilevel Association Rule mining, including partition-based,
agglomerative, and hybrid approaches.

Partition-based algorithms divide the data into partitions based on some criteria, such as the level of
granularity, and then mine Association Rules within each partition. Agglomerative algorithms start with
the smallest itemsets and then gradually merge them into larger itemsets, until a set of rules is obtained.
Hybrid algorithms combine the strengths of partition-based and agglomerative approaches.

Approaches to Multilevel Association rule mining

Multilevel Association Rule mining has different approaches to finding relationships between items at
different levels of granularity. There are three approaches: Uniform Support, Reduced Support, and
Group-based Support. These are explained as follows below in brief.

Uniform Support (using uniform minimum support for all levels)

where only one minimum support threshold is used for all levels. This approach is simple but may miss
meaningful associations at low levels.

Reduced Support (using reduced minimum support at lower levels)

where the minimum support threshold is lowered at lower levels to avoid missing important
associations. This approach uses different search techniques, such as Level-by-Level independence and
Level-cross separating by single item or K-itemset.

Group-based Support (using item or group based support)

where the user or expert sets the support and confidence threshold based on a specific group or
product category.

For example, if an expert wants to study the purchase patterns of laptops and clothes in the non-
electronic category, a low support threshold can be set for this group to give attention to these items'
purchase patterns.

Applications of Multilevel Association Rule in data mining

These are some application as follows

Retail Sales Analysis

Multilevel Association Rule mining helps retailers gain insights into customer buying behavior and
preferences, optimize product placement and pricing, and improve supply chain management.

Healthcare Management
Multilevel Association Rule mining helps healthcare providers identify patterns in patient behavior,
diagnose diseases, identify high-risk patients, and optimize treatment plans.

Fraud Detection

Multilevel Association Rule mining helps companies identify fraudulent patterns, detect anomalies, and
prevent fraud in various industries such as finance, insurance, and telecommunications.

Web Usage Mining

Multilevel Association Rule mining helps web-based companies gain insights into user preferences,
optimize website design and layout, and personalize content for individual users by analyzing data at
different levels of abstraction.

Social Network Analysis

Multilevel Association Rule mining helps social network providers identify influential users, detect
communities, and optimize network structure and design by analyzing social network data at different
levels of abstraction.

Challenges in Multilevel Association Rule Mining

Multilevel Association Rule mining poses several challenges, including high dimensionality, large data set
size, and scalability issues.

High dimensionality

It is the problem of dealing with data sets that have a large number of attributes.

Large data set size

It is the problem of dealing with data sets that have a large number of records.

Scalability

It is the problem of dealing with data sets that are too large to fit into memory.

Radial Basis Function

RBF
In a mathematical context, RBF is a real valued function that we use to
calculate distance between a variable with respect to a reference point.
In a network context, an RBF network is an artificial neural network in
which we use the radial basis function as the activation function of
neurons.

2.1. Definition
RBF is a mathematical function, say , that measures the
distance between an input point (or a vector) with a given fixed point (or
a vector) of interest (a center or reference point) .
Here, can be any distance function, such as Euclidean distance. Further,
the function depends on the specific application and desired set of
properties. For the case of vectors, we call this function RBK (radial basis
kernel).
We use RBF in mathematics, signal processing, computer vision, and machine
learning. In these, we use radial functions to approximate those functions that
either lack a closed form or are too complex to solve. In most cases, this
approximation function is a generic neural network.

2.2. Types
RBF measures the similarity between a given data point and an agreed
reference point. We can then use this similarity score to take specific
actions, such as activating a dead neural network node. Over here, the
similarity directly correlates with the distance between the data and reference
points.
Depending upon the function definition, we can have different RBFs. One
of the most commonly used RBFs is the Gaussian RBF. It is given by:
Here, the parameter controls the variance (spread) of the Gaussian
curve. A more minor value of results in a broader curve, while a more
considerable value of leads to a narrower curve:
Other types of RBFs include the Multiquadric, Inverse Multiquadric, and
Thin Plate Splines. Each RBF has its characteristics and can be suitable for
specific applications or tasks. In the realm of neural networks, we often use
RBFs as activation functions in the network’s hidden layer.
To summarize this section, RBF represents a basis function that measures
the similarity between the input data and a reference point, influencing the
network’s output.

RBF Neural Networks Architecture


In this section, we explore RBF neural networks.

3.1. Intuitive Understanding


Radial basis function (RBF) networks are an artificial neural network (ANN)
type that uses the radial basis function as its activation function. We
commonly use RBF networks for function approximation, classification,
time series prediction, and clustering tasks.

3.2. Structure
Now, we move ahead and describe the typical RBF network structure.
The RBF network consists of the following three layers:
1. the input layer (usually one)
2. the hidden layer (strictly one)
3. the output layer (usually one)

Training
RBF network training process involves two main steps:

1. Initialize the network


2. Apply gradient descent to learn model parameters (weights)

Our initialization phase involves applying K-means clustering on a


subset of training data to determine the centroids of the hidden
neurons. So, we determine the weights connecting the hidden layer to the
output layer in the learning phase using a gradient descent algorithm.
Further, we use mean square error (MSE) loss to model the error between
the model output and ground truth output.

Splines
In the previous lecture, we have discussed about linear regression, which is a

straight line to connect the dependent and non-dependent variables, but with

that linear line, it is not always possible to make a linear line. Then comes the

polynomial regression to model nonlinear functions. However, we discussed

that the more polynomial terms we add, the more prone the model was to

overfitting.

To fit complex shapes that describe real data, we need a way to design complex

functions without overfitting. For that, we have to make a new method that is a

nonlinear regression but with a combination of linear and nonlinear functions

to fit the data points, which is termed regression splines.


What are Splines?
To overcome the disadvantages of linear and polynomial regression we

introduced the regression splines. As we know in linear regression the dataset

is considered as one, but in splines regression, we have to split the dataset into

many parts which we call bin. And the points in which we divide the data are

called knots and we use different methods in different bins. These separate

functions we use in the different bins are called piecewise step functions.

Splines are a way to fit a high-degree polynomial function by breaking it up into

smaller piecewise polynomial functions. For each polynomial, we fit a separate

model and connect them all together.

Why Splines?
We already discussed that linear regression is a straight line hence we made

polynomial regression but it can make the model overfitting issue. The need for

a model that can be used with the good properties of both linear and polynomial

regression made the spline regression. While this sounds complicated, by

breaking up each section into smaller polynomials, we decrease the risk of

overfitting.

How to break up a polynomial


Because a spline breaks up a polynomial into smaller pieces, we need to

determine where to break up the polynomial. The point where this division

occurs is called a knot.

In the example above, each P_x represents a knot. The knots at the ends of the

curves are known as boundary knots, while the knots within the curve are

known as internal knots.

Selecting number and location of knots


While we can visually inspect where to place these knots, we need to devise

systematic methods to select knots.

Some strategies include:


• Placing knots in highly variable regions

• Specify the degrees of freedom and place knots uniformly throughout

the data

• Cross-validation

Types of Splines
The mathematics for splines can seem complicated without knowing some

calculus and properties of piecewise functions. We’ll discuss the intuition

beneath these algorithms.

If you’re interested in the specific mathematics underpinning splines, we can

refer you to the Elements of Statistical Learning, 2nd Edition by Trevor Hastie,

Rovert Tibshirani, and Jerome Friedman. This intermediate to advanced

textbook is an essential read for aspiring data scientists.

Cubic Splines

Cubic splines require that we connect these different polynomial functions

smoothly. This means that the first and second derivatives of these functions

must be continuous. The plot below shows a cubic spline and how the first

derivative is a continuous function.

Natural Splines
Polynomial functions and other kinds of splines tend to have bad fits near the

ends of the functions. This variability can have huge consequences, particularly

in forecasting. Natural splines resolve this issue by forcing the function to be

linear after the boundary knots.

Smoothing Splines

Finally, we can consider the regularized version of a spline: the smoothing

spline. The cost function is penalized if the variability of the coefficient is high.

Below is a plot that shows a situation where smoothing splines are needed to

get an adequate model fit.


The curse of dimensionality

It refers to the phenomena of strange/weird things happening as we try to analyze the

data in high-dimensional spaces. Let us understand this peculiarity with an example,

suppose we are building several machine learning models to analyze the performance of

a Formula One (F1) driver. Consider the following cases:


i) Model_1 consists of only two features say the circuit name and the country
name.

ii) Model_2 consists of 4 features say weather and max speed of the car including
the above two.

iii) Model_3 consists of 8 features say driver’s experience, number of wins, car
condition, and driver’s physical fitness including all the above features.

iv) Model_4 consists of 16 features say driver’s age, latitude, longitude,


driver’s height, hair color, car color, the car company, and driver’s
marital status including all the above features.

v) Model_5 consists of 32 features.

vi) Model_6 consists of 64 features.

vii) Model_7 consists of 128 features.

viii) Model_8 consists of 256 features.

ix) Model_9 consists of 512 features.

x) Model_10 consists of 1024 features.

Assuming the training data remains constant, it is observed that on increasing


the number of features the accuracy tends to increase until a certain threshold
value and after that, it starts to decrease. From the above example : the
accuracy of Model_1 < accuracy of Model_2 < accuracy of Model_3 but if we
try to extrapolate this trend it doesn’t hold true for all the models having more
than 8 features. Now you might wonder if we are providing some extra
information for the model to learn why is it so that the performance starts to
degrade.

If we think logically some of the features provided to Model_4 don’t actually


contribute anything towards analyzing the performance of the F1 driver. For
example, the driver’s height, Driver skin color, car color, car company, and the
driver’s marital status is giving useless information for the model to learn, hence
the model gets confused with all this extra information, and the accuracy starts
to go down.

The curse of dimensionality was first termed by Richard E. Bellman when


considering problems in dynamic programming.

Curse of dimensionality in various domains

There are several domains where we can see the effect of this phenomenon.
Machine Learning is one such domain. Other domains include numerical
analysis, sampling, combinatorics, data mining, and databases. As it is clear from
the title we will see its effect only in Machine Learning.

What problems does curse of dimensionality cause?

The curse of dimensionality refers to the difficulties that arise when analyzing
or modeling data with many dimensions. These problems can be summarized in
the following points:

• Data Sparsity: Data points become increasingly spread out, making it hard
to find patterns or relationships.
• Computational Complexity: The computational burden of algorithms
increases exponentially.
• Overfitting: Models become more likely to memorize the training data
without generalizing well.
• Distortion of Distance Metrics: Traditional distance metrics become less
reliable in measuring proximity.
• Visualization Challenges: Projecting high-dimensional data onto lower
dimensions leads to loss of information.
• Data Preprocessing: Identifying relevant features and reducing
dimensionality is crucial for effective analysis.
• Algorithmic Efficiency: Algorithms need to be scalable and efficient to
handle the complexity of high-dimensional spaces.
• Domain-Specific Challenges: Each domain faces unique challenges in
high-dimensional spaces, requiring tailored approaches.
• Interpretability Issues: Understanding the decision-making process of
high-dimensional models becomes increasingly difficult.
• Data Storage Requirements: Efficient data storage and retrieval
strategies are essential for managing large volumes of high-dimensional
data

How to overcome its effect

This was a general overview of the curse of dimensionality. Now we will go


slightly technical to understand it completely. In ML, it can be defined as follows:
as the number of features or dimensions ‘d’ grows, the amount of data we
require to generalize accurately grows exponentially. As the dimensions
increase the data becomes sparse and as the data becomes sparse it becomes
hard to generalize the model. In order to better generalize the model, more
training data is required.
Interpolation methods
Interpolation is a method of fitting the data points to represent the value of

a function. It has a various number of applications in engineering and

science, that are used to construct new data points within the range of a

discrete data set of known data points or can be used for determining a

formula of the function that will pass from the given set of points (x,y).

Interpolation Meaning
Interpolation is a method of deriving a simple function from the given
discrete data set such that the function passes through the provided data
points. This helps to determine the data points in between the given data
ones. This method is always needed to compute the value of a function for an
intermediate value of the independent function. In short, interpolation is a
process of determining the unknown values that lie in between the known
data points. It is mostly used to predict the unknown values for any
geographical related data points such as noise level, rainfall, elevation, and
so on.

Interpolation Methods

Nearest-neighbor interpolation
Linear interpolation
Support Vector Machine (SVM)

Support Vector Machine (SVM) is a powerful and versatile supervised machine learning algorithm used
for classification and regression tasks. It's particularly effective in cases where the data is separable into
distinct classes or has a clear margin of separation. SVM aims to find the optimal hyperplane that best
separates the data points of different classes while maximizing the margin between them.

Here's a detailed discussion about Support Vector Machine:

1. Basic Concept:

• At its core, SVM works by finding the hyperplane that maximizes the margin between
the classes. A hyperplane in an n-dimensional space is a flat affine subspace of
dimension n-1.

• In simple terms, for a 2-dimensional dataset, the hyperplane is a line, and for a 3-
dimensional dataset, it's a plane. In higher dimensions, it's a hyperplane.

• SVM operates by mapping input data into a higher-dimensional feature space where the
data points can be separated by a hyperplane. This is done using a kernel function that
computes the inner products of the feature vectors in the higher-dimensional space.
2. Margin and Support Vectors:

• The margin is the distance between the hyperplane and the nearest data point from
each class, also known as support vectors.

• Support vectors are the data points closest to the hyperplane and play a crucial role in
determining its orientation.

• SVM aims to find the hyperplane with the maximum margin, which is the one that is
farthest from the support vectors.

3. Kernel Trick:

• In many cases, the data may not be linearly separable in the original feature space. The
kernel trick allows SVM to implicitly map the input data into a higher-dimensional space
without explicitly computing the transformation.

• Popular kernel functions include linear, polynomial, radial basis function (RBF), and
sigmoid kernels. These kernels enable SVM to handle complex decision boundaries and
nonlinear relationships in the data.

4. Optimization Objective:

• The optimization objective of SVM is to minimize the classification error while


maximizing the margin.

• This is typically formulated as a constrained optimization problem, where the objective


function aims to minimize the norm of the weight vector subject to the constraint that
all data points are correctly classified and lie on the correct side of the decision
boundary.

5. Soft Margin SVM:

• In cases where the data is not linearly separable or is noisy, SVM can be extended to
use a soft margin, allowing for some misclassifications. This is known as Soft Margin
SVM.

• The soft margin formulation introduces a penalty parameter (C) that controls the trade-
off between maximizing the margin and minimizing the classification error. A smaller
value of C allows for a wider margin but may lead to more misclassifications, while a
larger value of C reduces the margin to achieve higher accuracy.

6. Multiclass Classification:

• SVM inherently supports binary classification, but several strategies can be used to
extend it to multiclass classification problems. One common approach is the one-vs-all
(OvA) or one-vs-rest (OvR) strategy, where separate binary classifiers are trained for
each class, and the class with the highest confidence score is chosen as the predicted
class.

7. Applications:

• SVM has a wide range of applications in various domains, including text classification,
image recognition, bioinformatics, finance, and more.
• In text classification, SVM can be used for sentiment analysis, spam detection, and
document categorization.

• In image recognition, SVM can classify images into different categories based on their
features extracted from pixels.

• In bioinformatics, SVM is used for protein classification, gene expression analysis, and
disease prediction.

8. Advantages:

• SVM is effective in high-dimensional spaces and is memory efficient, making it suitable


for datasets with a large number of features.

• It works well with both linearly separable and nonlinearly separable data using
appropriate kernel functions.

• SVM provides global optimality, meaning the solution is guaranteed to be the best
possible solution given the data and the chosen parameters.

9. Disadvantages:

• SVM can be p, especially when dealing with large datasets.

• The choice of kernel function and its parameters can significantly impact the
performance of the SVM model, and selecting the appropriate kernel requires domain
knowledge and experimentation.

• SVM models are not very interpretable compared to some other machine learning
algorithms like decision trees.

In summary, Support Vector Machine is a versatile and powerful algorithm for classification and
regression tasks, capable of handling both linear and nonlinear relationships in the data. Its
effectiveness, especially in high-dimensional spaces, makes it a popular choice for various machine
learning applications. However, it's essential to carefully select the appropriate kernel function and tune
the model parameters to achieve optimal performance.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy