0% found this document useful (0 votes)
3 views26 pages

Week 8 Notes_DM

The document discusses Discriminant Analysis, particularly Linear Discriminant Analysis (LDA), as a method for dimensionality reduction in supervised learning. It outlines the steps involved in LDA, including computing mean vectors, scatter matrices, and eigenvalues, while also comparing LDA with Principal Component Analysis (PCA). Additionally, it addresses the importance of prior probabilities in classification tasks, especially in imbalanced datasets, and highlights the implications of misclassification errors in real-world scenarios.

Uploaded by

SUVODIP JANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views26 pages

Week 8 Notes_DM

The document discusses Discriminant Analysis, particularly Linear Discriminant Analysis (LDA), as a method for dimensionality reduction in supervised learning. It outlines the steps involved in LDA, including computing mean vectors, scatter matrices, and eigenvalues, while also comparing LDA with Principal Component Analysis (PCA). Additionally, it addresses the importance of prior probabilities in classification tasks, especially in imbalanced datasets, and highlights the implications of misclassification errors in real-world scenarios.

Uploaded by

SUVODIP JANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Discriminant Analysis and Association Rules

Introduction to Discriminant Analysis


Dimensionality reduction is the transformation of data from high dimensional space
into a low dimensional space so that low dimensional space representation retains
nearly all the information ideally saying all the information only by reducing the
width of the data. Working with high dimensional space can be undesirable for many
reasons like raw data is mostly sparse and results in high computational cost.
Dimensionality reduction is common in a field that deals with large instances and
columns.

Methods of dimensionality reduction are divided into linear and non-linear


approaches. Dimensionality reduction can also be used for noise reduction, data
visualization, cluster analysis, and as an intermediate step while building predictive
models.

In most cases, linear discriminant analysis is used as dimensionality reduction for


supervised problems. It is used for projecting features from higher dimensional space
to lower-dimensional space. Basically many engineers and scientists use it as a
preprocessing step before finalizing a model. Under LDA we basically try to address
which set of parameters can best describe the association of groups for a class, and
what is the best classification model that separates those groups.

For example, we have two classes and we need to separate them efficiently. Classes
can have multiple features. Using only a single feature to classify them may result in
some overlapping as shown in the below figure. So, we will keep on increasing the
number of features for proper classification.

1
Suppose we have two sets of data points belonging to two different classes that we
want to classify. As shown in the given 2D graph, when the data points are plotted on
the 2D plane, there‟s no straight line that can separate the two classes of the data
points completely. Hence, in this case, LDA (Linear Discriminant Analysis) is used
which reduces the 2D graph into a 1D graph in order to maximize the separability
between the two classes.

Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis
and projects data onto a new axis in a way to maximize the separation of the two
categories and hence, reducing the 2D graph into a 1D graph.

Two criteria are used by LDA to create a new axis:

 Maximize the distance between means of the two classes.


 Minimize the variation within each class.

2
In the above graph, it can be seen that a new axis (in red) is generated and plotted in
the 2D graph such that it maximizes the distance between the means of the two classes
and minimizes the variation within each class. In simple terms, this newly generated
axis increases the separation between the data points of the two classes. After
generating this new axis using the above-mentioned criteria, all the data points of the
classes are plotted on this new axis and are shown in the figure given below.

LDA approaches by finding a linear combination of features that characterizes two or


more classes or outcomes and the resulting combination is used as a linear classifier or
for dimensionality reduction.

3
Assumption of LDA

 Each feature/column in the dataset is Gaussian distribution in simple words


data points are normally distributed having bell-shaped curves.
 Independent variables are normal for each level of the grouping variable.
 Predictive power can decrease with an increase in correlation between
variables.
 All instances are assumed to be randomly sampled and scores on one variable
are assumed to be independent.

LDA vs PCA

LDA is a very similar approach to principal component analysis both are linear
transformation techniques for dimensionality reduction but also we are pursuing some
differences those are listed below.

Consider another simple example of dimensionality reduction and feature extraction,


you want to check the quality of soap based on the information provided related to a

4
soap including various features such as weight and volume of soap, peoples‟
preferential score, odor, color, contrasts, etc.

A small scenario to understand the problem more clearly;

 Object to be tested -Soap;


 To check the quality of a product- class category as „good‟ or „bad‟( dependent
variable, categorical variable, measurement scale as a nominal scale);
 Features to describe the product- various parameters that describe the soap
(independent variable, measurement scale as nominal, ordinal, internal scale);

When the target variable or dependent variable is decided then other related
information can be dragged out from existing datasets to check the effectivity of
features on the target variables. And hence, the data dimension gets reduced out and
important related-features have stayed in the new dataset.

How LDA works

LDA projects features from higher dimension to lower dimension space, how LDA
achieves this, let‟s look into:

Step#1 Computes mean vectors of each class of dependent variable

Step#2 Computers with-in class and between-class scatter matrices

Step#3 Computes eigenvalues and eigenvector for SW(Scatter matrix within class)
and SB (scatter matrix between class)

Step#4 Sorts the eigenvalues in descending order and select the top k

Step#5 Creates a new matrix containing eigenvectors that map to the k eigenvalues

Step#6 Obtains the new features (i.e. linear discriminants) by taking the dot product of
the data and the matrix.

How to prepare the data for LDA

5
1. Outlier Treatment: Outliers from the data should be removed, outliers will introduce
skewness and in-turn computations of mean and variance will be influenced and
finally, that will have an impact on LDA computations.

2. Equal Variance: Standardization of input data, such that it has a mean 0 and a
standard deviation of 1.

3. Gaussian distribution: Univariate analysis of each input feature and if they do not
exhibit the gaussian distribution transform them to look like Gaussian distribution (log
and root for exponential distributions).

LDA can be performed in 5 steps:

1. Compute the mean vectors for the different classes from the dataset.
2. Compute the scatter matrices (in-between-class and within-class scatter matrices).
3. Compute the eigenvectors and corresponding eigenvalues for the scatter matrices.
4. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with
the largest eigenvalues.
5. Use this eigenvector matrix to transform the samples onto the new subspace.

Computing the mean vectors

First, calculate the mean vectors for all classes inside the dataset.

Computing the scatter matrices

After calculating the mean vectors, the within-class and between-class scatter matrices
can be calculated.

6
Select linear discriminants for the new feature subspace

After calculating the eigenvectors and eigenvalues, we sort the eigenvectors from
highest to lowest depending on their corresponding eigenvalue and then choose the
top k eigenvectors, where k is the number of dimensions we want to keep.

Transform data onto the new subspace

After selecting the k eigenvectors, we can use the resulting d x k-dimensional


eigenvector matrix W to transform data onto the new subspace via the following
equation:

7
Y= X * W

Linear Discriminant Analysis Implementation

Let‟s implement the LDA on the Iris dataset. This dataset contains information about
the size of the petals and sepals of three different species of flowers. Before
implementing the LDA on the given dataset, ensure you have installed the following
modules on your system.

pandas
NumPy
matplotlib
sklearn
seaborn

%pip install pandas


pip install numpy
pip install matplotlib
pip install sklearn
pip install seaborn

Importing and exploring the dataset

# importing the module


from sklearn import datasets
# loading the iris data
dataset = datasets.load_iris()
Let‟s print the keys of the dataset and see what kind of information we have there:
# dataset key values
dataset.keys()

Output:

8
You can explore each of these on your own, but here we will just go through DESCR
because it contains the details about the dataset.
# information about dataset
print(dataset['DESCR'])

Next, you can find the dataset‟s statistics by using the DataFrame describe() function.
# importing the module
import pandas as pd
# convertig the dataset into pandas dataframe
data = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# descriptive statistics
data.describe()
Output:

DataFrame‟s stats contain each column‟s count, such as mean, standard deviation,
minimum, maximum values, etc.

Using LDA for dimensionality reduction

There are 4 input variables in our dataset, so it is impossible to visualize them in one
graph. Let‟s apply LDA with 2 components so that the same data can be visualized
using the 2D plot.

9
# input and output variables
X = dataset.data
y = dataset.target
target_names = dataset.target_names
# importing the requried module
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# initializing the model with 2 components
lda = LinearDiscriminantAnalysis(n_components=2)
# fitting the dataset
X_r2 = lda.fit(X, y).transform(X)

Now our data is two-dimensional, and we can easily visualize it.


# importing the required module
import matplotlib.pyplot as plt
# plot size
plt.figure(figsize=(15, 8))
# plotting the graph
plt.scatter(X_r2[:,0],X_r2[:,1], c=dataset.target)
plt.show()

10
This graph shows that there are three types of output classes. The LDA has helped us
to visualize these three clusters in a 2D plot.

Prior Probabilities

For a categorical target variable, each modelling node can estimate posterior
probabilities for each class, which are defined as the conditional probabilities of the
classes given the input variables. By default, the posterior probabilities are based on
implicit prior probabilities that are proportional to the frequencies of the classes in the
training set. Prior probabilities should be specified when the sample proportions of the
classes in the training set differ substantially from the proportions in the operational
data to be scored, either through sampling variation or deliberate bias. For example,
when the purpose of the analysis is to detect a rare class, it is a common practice to
use a training set in which the rare class is over represented. If no prior probabilities
are used, the estimated posterior probabilities for the rare class will be too high. If you
specify correct priors, the posterior probabilities will be correctly adjusted no matter
what the proportions in the training set are.

Increasing the prior probability of a class increases the posterior probability of the
class, moving the classification boundary for that class so that more cases are
classified into the class. Changing the prior will have a more noticeable effect if the
original posterior is near 0.5 than if it is near zero or one. For linear logistic regression
and linear normal-theory discriminant analysis, classification boundaries are
hyperplanes; increasing the prior for a class moves the hyperplanes for that class
farther from the class mean, but decreasing the prior moves the hyperplanes closer to
the class mean. But changing the priors does not change the angles of the hyperplanes.

 Prior probabilities are assumed to be estimates of the true proportions of the


classes in the operational data to be scored.
 Prior probabilities are not used by default for parameter estimation. This
enables you to manipulate the class proportions in the training set by

11
nonproportional sampling or by a frequency variable in any manner that you
want.
 If you specify prior probabilities, the posterior probabilities computed by the
modeling nodes are always adjusted for the priors.

 If you specify prior probabilities, the profit and loss summary statistics are
always adjusted for priors and therefore provide valid model comparisons,
assuming that you specify valid decision consequences.

Posterior probabilities are adjusted for priors as follows.

Where,
t- be an index for target values (classes)
i - be an index for cases
OldPrior(t) - be the old prior probability or implicit prior probability for target t
OldPost(i,t) - be the posterior probability based on OldPrior(t)
Prior(t) - be the new prior probability desired for target t
Post(i,t) - be the posterior probability based on Prior(t)

For classification, each case i is assigned to the class with the greatest posterior
probability, that is, the class t for which Post(i,t) is maximized.

Prior probabilities have no effect on estimating parameters in the Regression node, on


learning weights in the Neural Network node, or, by default, on growing trees in the
Tree node. Prior probabilities do affect classification and decision processing for each
case. Hence, if you specify the appropriate options for each node, prior probabilities
can affect the choice of models in the Regression node, early stopping in the Neural
Network node, and pruning in the Tree node.

Prior probabilities do not affect:

 Estimating parameters in the Regression node.


12
 Learning weights in the Neural Network node.
 Growing (as opposed to pruning) trees in the Decision Tree node unless you
configure the property Use Prior Probability in Split Search.
 Residuals, which are based on posteriors before adjustment for priors, except in the
Decision Tree node if you choose to use prior probabilities in the split search.
 Error functions such as deviance or likelihood, except in the Decision Tree node if
you choose to use prior probabilities in the split search.
 Fit statistics such as MSE based on residuals or error functions, except in the
Decision Tree node if you choose to use prior probabilities in the split search.

Prior probabilities do affect:

 Posterior probabilities
 Classification
 Decisions
 Misclassification rate
 Expected profit or loss
 Profit and loss summary statistics, including the relative contribution of each
class.

Unequal Classification Costs

 Imbalanced classification problems often value false-positive classification errors


differently from false negatives.
 Cost-sensitive learning is a subfield of machine learning that involves explicitly
defining and using costs when training machine learning algorithms.
 Cost-sensitive techniques may be divided into three groups, including data
resampling, algorithm modifications, and ensemble methods.

Classification is a predictive modeling problem that involves predicting the class label
for an observation. There may be many class labels, so-called multi-class

13
classification problems, although the simplest and perhaps most common type of
classification problem has two classes and is referred to as binary classification. Most
data mining or machine learning algorithms designed for classification assume that
there is an equal number of examples for each observed class. This is not always the
case in practice, and datasets that have a skewed class distribution are referred to as
imbalanced classification problems.

In addition to assuming that the class distribution is balanced, most algorithms also
assume that the prediction errors made by a classifier are the same, so-called miss-
classifications. This is typically not the case for binary classification problems,
especially those that have an imbalanced class distribution.

For imbalanced classification problems, the examples from the majority class are
referred to as the negative class and assigned the class label 0. Those examples from
the minority class are referred to as the positive class and are assigned the class label
1.

The reason for this negative vs. positive naming convention is because the examples
from the majority class typically represent a normal or no-event case, whereas
examples from the minority class represent the exceptional or event case.

 Majority Class: Negative or no-event assigned the class label 0.


 Minority Class: Positive or event assigned the class label 1.

Real-world imbalanced binary classification problems typically have a different


interpretation for each of the classification errors that can be made. For example,
classifying a negative case as a positive case is typically far less of a problem than
classifying a positive case as a negative case.

This makes sense if we consider the goal of a classifier on imbalanced binary


classification problems is to detect the positive cases correctly and positive cases
represent an exceptional or event that we are most interested in.

Examples:
14
Bank Loan Problem: Consider a problem where a bank wants to determine whether to
give a loan to a customer or not. Denying a loan to a good customer is not as bad as
giving a loan to a bad customer that may never repay it.

Cancer Diagnosis Problem: Consider a problem where a doctor wants to determine


whether a patient has cancer or not. It is better to diagnose a healthy patient with
cancer and follow-up with more medical tests than it is to discharge a patient that has
cancer.

Fraud Detection Problem: Consider the problem of an insurance company wants to


determine whether a claim is fraudulent. Identifying good claims as fraudulent and
following up with the customer is better than honoring fraudulent insurance claims.

We can see with these examples that misclassification errors are not desirable in
general, but one type of misclassification is much worse than the other. Specifically
predicting positive cases as a negative case is more harmful, more expensive, or worse
in whatever way we want to measure the context of the target domain.

A confusion matrix is a summary of the predictions made by a model on classification


tasks. It is a table that summarizes the number of predictions made for each class,
separated by the actual class to which each example belongs.

It is best understood using a binary classification problem with negative and positive
classes, typically assigned 0 and 1 class labels respectively. The columns of the table
represent the actual class to which examples belong, and the rows represent the
predicted class (although the meaning of rows and columns can and often are
interchanged with no loss of meaning). A cell in the table is the count of the number
of examples that meet the conditions of the row and column, and each cell has a
specific common name.

An example of a confusion matrix for a binary classification task is listed below


showing the common names for the values in each of the four cells of the table.

| Actual Negative | Actual Positive


15
Predicted Negative | True Negative | False Negative
Predicted Positive | False Positive | True Positive

Now, we can consider the same table with the same rows and columns and assign a
cost to each of the cells. This is called a cost matrix.

Cost Matrix: A matrix that assigns a cost to each cell in the confusion matrix.

The example below is a cost matrix where we use the notation C() to indicate the cost,
the first value represented as the predicted class and the second value represents the
actual class. The names of each cell from the confusion matrix are also listed as
acronyms, e.g. False Positive is FP.

| Actual Negative | Actual Positive


Predicted Negative | C(0,0), TN | C(0,1), FN
Predicted Positive | C(1,0), FP | C(1,1), TP

An intuition from this matrix is that the cost of misclassification is always higher than
correct classification, otherwise, cost can be minimized by predicting one class. For
example, we might assign no cost to correct predictions in each class, a cost of 5 for
False Positives and a cost of 88 for False Negatives.

| Actual Negative | Actual Positive


Predicted Negative | 0 | 88
Predicted Positive | 5 |0

We can define the total cost of a classifier using this framework as the cost-weighted
sum of the False Negatives and False Positives.

Total Cost = C(0,1) * False Negatives + C(1,0) * False Positives

In some problem domains, defining the cost matrix might be obvious. In an insurance
claim example, the costs for a false positive might be the monetary cost of follow-up

16
with the customer to the company and the cost of a false negative might be the cost of
the insurance claim.

In other domains, defining the cost matrix might be challenging. For example, in a
cancer diagnostic test example, the cost of a false positive might be the monetary cost
of performing subsequent tests, whereas what is the equivalent dollar cost for letting a
sick patient go home and get sicker? A cost matrix might be able to be defined by a
domain expert or economist in such cases, or not.

Introduction to Association Rules

The goal of association rule mining is to find rules that will predict the occurrence of
an item (Item Y) based on the occurrence of other items (Item X) in the transaction.
For example: Predict the chance of user buying a phone cover (Item Y) if he already
bought the phone (Item X) and if the chance is high enough then recommend phone
cover to someone who are buying the phone. There is a big chance to discover strong
rules in big data, but keep in mind that the implication means co-occurrence does not
necessarily means causality! We cannot assist that buying one item is cause of buying
the other one when items are just frequently bought together.

Here is our data which consist of 5 transactions made by our customer. Each
transaction shows the products bought together in that transaction.

17
Given a set of transactions, the goal of association rule mining is to find the rules that
allow us to predict the occurrence of a specific item based on the occurrences of the
other items in the transaction.

An association rule consists of two parts, an antecedent (if) and a consequent (then).
An antecedent is something found in data, and a consequent is something located in
conjunction with the antecedent. For a quick understanding, consider the following
association rule: “If a customer buys bread, he‟s 70% likely of buying milk.”
Bread is the antecedent in the given association rule, and milk is the consequent.
Terminologies:

Item-set: It‟s a collection of one or more items. K-item-set means a set of k items. For
example: Item-set is {Bread, Milk}

k-itemset: An itemset that contains k items. For example: {Bread, Milk} is 2-itemset

Support Count: Indication of how frequently the item set appears in the database. It‟s
frequency of occurrence of an item-set. For example: {Bread, Milk} occurs 3 times in
our data set

Support: Fraction of transactions that contain the item-set. Support=Frequency of


Itemset/Total N of Transactions. For example: Support for {Bread, Milk} = 3/5=60%,
it means that 60% of the transactions contain itemset {Bread, Milk}

18
Confidence: For a rule X=>Y confidence shows the percentage in which X is bought
with Y. So confidence is the number of transactions with both X and Y divided by the
total number of transactions having X.

For example: Confidence for Bread => Milk = 3 / 4 = 75%, it means that 75% of the
transactions that contain X (Bread) also contain Y (Milk) together.

Confidence (X=>Y) = P(X∩Y)/P(X)=Frequency(X,Y)/frequency(X). Suppose, the


confidence of the association rule X⇒Y is 80%, it means that 80% of the transactions
that contain X also contain Y together.

Form of Association Rule: X=> Y [Support, Confidence], where X and Y are sets of
items in the transaction data.

For example: Bread => Milk [Support=60%, Confidence= 75%], where support shows
that in 60% of transactions bread and milk are purchased together, confidence shows
that 75% of customers who purchase bread also purchase milk

Frequent item-set: An itemset whose support is greater than or equal to a minimum


_support threshold

Strong rules: If a rule X=>Y [Support, Confidence] satisfies min_sup and


min_confidence then it is a strong rule

The Goal of Association Rule Mining: The goal of association rule mining is to find
all association rules having support≥minimum_support threshold and confidence ≥
minimum_confidence threshold

Lift: Lift gives the correlation between X and Y in the rule X=>Y. Correlation shows
how one item-set X effects the item-set Y. Lift(X=>Y) = Confidence of the rule
(X=>Y)/ Support(Y)

Lift for the rule {Bread}=>{Milk}: Confidence of the rule (75%) / Support (Milk) =
4/5 (80 %) = 75%/80 % = 93.75%

19
Evaluate the rule using the value of the Lift:
 If the rule had a lift of 1, then X and Y are independent and no rule can be
derived from them
 If the lift is < 1, then presence of X will have negative effect on Y
 If the lift is > 1, then X and Y are dependent on each other, and the degree of
which is given by lift value

Why use support and confidence?

Support and Confidence measure how interesting the rule is. Support is also used for
efficient discovery of association rules. Confidence, on the other hand, measures the
reliability of the inference made by a rule. For a given rule X->Y, the higher the
confidence, the more likely it is for Y to be present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of Y given X.

Algorithms of Association Rule Mining:


 Apriori Algorithm
 Eclat Algorithm
 FP-Growth Algorithm

1) Apriori Algorithm - It delivers by characteristic the foremost frequent


individual things within the information and increasing them to larger and bigger item
sets as long as those item sets seem ofttimes enough within the information. The
common itemsets ensured by apriori also are accustomed make sure association rules
that highlight trends within the information. It counts the support of item sets
employing a breadth-first search strategy and a candidate generation perform that
takes advantage of the downward closure property of support.

2) Eclat Algorithm - Eclat denotes equivalence class transformation. The set


intersection was supported by its depth-first search formula. It‟s applicable for each
successive and parallel execution with spot-magnifying properties. This can be the

20
associate formula for frequent pattern mining supported by the item set lattice‟s depth-
first search cross.

It is a DFS cross of the prefix tree rather than a lattice.

For stopping, the branch and a specific technique are used.

3) FP-growth Algorithm - This algorithm is also called a recurring pattern. The FP


growth formula is used for locating frequent item sets terribly dealings data but not for
candidate generation. This was primarily designed to compress the database that
provides frequent sets and then divides the compressed data into conditional database
sets. This conditional database is associated with a frequent set. Each database then
undergoes the process of data mining.

The data source is compressed using the FP-tree data structure. This algorithm
operates in two stages. These are as follows:

 FP-tree construction
 Extract frequently used itemsets

Types of Association Rules:

There are various types of association rules in data mining:-

 Multi-relational association rules


 Generalized association rules
 Quantitative association rules
 Interval information association rules

1. Multi-relational association rules: Multi-Relation Association Rules (MRAR) is a


new class of association rules, different from original, simple, and even multi-
relational association rules (usually extracted from multi-relational databases), each

21
rule element consists of one entity but many a relationship. These relationships
represent indirect relationships between entities.

2. Generalized association rules: Generalized association rule extraction is a


powerful tool for getting a rough idea of interesting patterns hidden in data. However,
since patterns are extracted at each level of abstraction, the mined rule sets may be too
large to be used effectively for decision-making. Therefore, in order to discover
valuable and interesting knowledge, post-processing steps are often required.
Generalized association rules should have categorical (nominal or discrete) properties
on both the left and right sides of the rule.

3. Quantitative association rules: Quantitative association rules is a special type of


association rule. Unlike general association rules, where both left and right sides of
the rule should be categorical (nominal or discrete) attributes, at least one attribute
(left or right) of quantitative association rules must contain numeric attributes.

4. Interval Information Association Rules: Data is first pre-processed by data


smoothing and mapping. Next, interval association rules are generated which involved
data partitioning via clustering before the rules are generated using an Apriori
algorithm. Finally, these rules are used to identify data values that fall outside the
expected intervals.

Discovering Association Rules in Transaction Databases:


Apriori

Imagine you have a database about the items a customer purchases from the store. The
Apriori algorithm helps to uncover interesting relationships & patterns in this data. It
does that by finding the sets of items that occur together, frequently.

22
The following are the main steps of the algorithm:

1. Set the minimum support threshold - min frequency required for an itemset to be
"frequent".
2. Identify frequent individual items - count the occurence of each individual item.
3. Generate candidate itemsets of size 2 - create pairs of frequent items discovered.
4. Prune infrequent itemsets - eliminate itemsets that do not meet the threshold levels.
5. Generate itemsets of larger sizes - combine the frequent itemsets of size 3, 4, and
so on.
6. Repeat the pruning process - keep eliminating the itemsets that do not meet the
threshold levels.
7. Iterate till no more frequent itemsets can be generated.
8. Generate association rules that express the relationship between them - calculate
measures to evaluate the strength & significance of these rules.

Example of Apriori: Support threshold=50%, Confidence= 60%

23
Support threshold=50% => 0.5*6= 3 => min_sup=3

1. Count of Each Item

2. Prune Step: Following shows that I5 item does not meet min_sup=3, thus it is
deleted, only I1, I2, I3, I4 meet min_sup count.

24
3. Join Step: Form 2-itemset. From the first table, find out the occurrences of 2-
itemset.

4. Prune Step: Next table shows that item set {I1, I4} and {I3, I4} does not meet
min_sup, thus it is deleted.

5. Join and Prune Step: Form 3-itemset. From the first table, find out occurrences of 3-
itemset. From the previous table, find out the 2-itemset subsets which support
min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring as
shown in step 4, thus {I1, I2, I3} is frequent. We can see for itemset {I1, I2, I4}
subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as it is not occurring as
shown in step 4, thus {I1, I2, I4} is not frequent, hence it is deleted.

25
26

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy