Week 8 Notes_DM
Week 8 Notes_DM
For example, we have two classes and we need to separate them efficiently. Classes
can have multiple features. Using only a single feature to classify them may result in
some overlapping as shown in the below figure. So, we will keep on increasing the
number of features for proper classification.
1
Suppose we have two sets of data points belonging to two different classes that we
want to classify. As shown in the given 2D graph, when the data points are plotted on
the 2D plane, there‟s no straight line that can separate the two classes of the data
points completely. Hence, in this case, LDA (Linear Discriminant Analysis) is used
which reduces the 2D graph into a 1D graph in order to maximize the separability
between the two classes.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis
and projects data onto a new axis in a way to maximize the separation of the two
categories and hence, reducing the 2D graph into a 1D graph.
2
In the above graph, it can be seen that a new axis (in red) is generated and plotted in
the 2D graph such that it maximizes the distance between the means of the two classes
and minimizes the variation within each class. In simple terms, this newly generated
axis increases the separation between the data points of the two classes. After
generating this new axis using the above-mentioned criteria, all the data points of the
classes are plotted on this new axis and are shown in the figure given below.
3
Assumption of LDA
LDA vs PCA
LDA is a very similar approach to principal component analysis both are linear
transformation techniques for dimensionality reduction but also we are pursuing some
differences those are listed below.
4
soap including various features such as weight and volume of soap, peoples‟
preferential score, odor, color, contrasts, etc.
When the target variable or dependent variable is decided then other related
information can be dragged out from existing datasets to check the effectivity of
features on the target variables. And hence, the data dimension gets reduced out and
important related-features have stayed in the new dataset.
LDA projects features from higher dimension to lower dimension space, how LDA
achieves this, let‟s look into:
Step#3 Computes eigenvalues and eigenvector for SW(Scatter matrix within class)
and SB (scatter matrix between class)
Step#4 Sorts the eigenvalues in descending order and select the top k
Step#5 Creates a new matrix containing eigenvectors that map to the k eigenvalues
Step#6 Obtains the new features (i.e. linear discriminants) by taking the dot product of
the data and the matrix.
5
1. Outlier Treatment: Outliers from the data should be removed, outliers will introduce
skewness and in-turn computations of mean and variance will be influenced and
finally, that will have an impact on LDA computations.
2. Equal Variance: Standardization of input data, such that it has a mean 0 and a
standard deviation of 1.
3. Gaussian distribution: Univariate analysis of each input feature and if they do not
exhibit the gaussian distribution transform them to look like Gaussian distribution (log
and root for exponential distributions).
1. Compute the mean vectors for the different classes from the dataset.
2. Compute the scatter matrices (in-between-class and within-class scatter matrices).
3. Compute the eigenvectors and corresponding eigenvalues for the scatter matrices.
4. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with
the largest eigenvalues.
5. Use this eigenvector matrix to transform the samples onto the new subspace.
First, calculate the mean vectors for all classes inside the dataset.
After calculating the mean vectors, the within-class and between-class scatter matrices
can be calculated.
6
Select linear discriminants for the new feature subspace
After calculating the eigenvectors and eigenvalues, we sort the eigenvectors from
highest to lowest depending on their corresponding eigenvalue and then choose the
top k eigenvectors, where k is the number of dimensions we want to keep.
7
Y= X * W
Let‟s implement the LDA on the Iris dataset. This dataset contains information about
the size of the petals and sepals of three different species of flowers. Before
implementing the LDA on the given dataset, ensure you have installed the following
modules on your system.
pandas
NumPy
matplotlib
sklearn
seaborn
Output:
8
You can explore each of these on your own, but here we will just go through DESCR
because it contains the details about the dataset.
# information about dataset
print(dataset['DESCR'])
Next, you can find the dataset‟s statistics by using the DataFrame describe() function.
# importing the module
import pandas as pd
# convertig the dataset into pandas dataframe
data = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# descriptive statistics
data.describe()
Output:
DataFrame‟s stats contain each column‟s count, such as mean, standard deviation,
minimum, maximum values, etc.
There are 4 input variables in our dataset, so it is impossible to visualize them in one
graph. Let‟s apply LDA with 2 components so that the same data can be visualized
using the 2D plot.
9
# input and output variables
X = dataset.data
y = dataset.target
target_names = dataset.target_names
# importing the requried module
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# initializing the model with 2 components
lda = LinearDiscriminantAnalysis(n_components=2)
# fitting the dataset
X_r2 = lda.fit(X, y).transform(X)
10
This graph shows that there are three types of output classes. The LDA has helped us
to visualize these three clusters in a 2D plot.
Prior Probabilities
For a categorical target variable, each modelling node can estimate posterior
probabilities for each class, which are defined as the conditional probabilities of the
classes given the input variables. By default, the posterior probabilities are based on
implicit prior probabilities that are proportional to the frequencies of the classes in the
training set. Prior probabilities should be specified when the sample proportions of the
classes in the training set differ substantially from the proportions in the operational
data to be scored, either through sampling variation or deliberate bias. For example,
when the purpose of the analysis is to detect a rare class, it is a common practice to
use a training set in which the rare class is over represented. If no prior probabilities
are used, the estimated posterior probabilities for the rare class will be too high. If you
specify correct priors, the posterior probabilities will be correctly adjusted no matter
what the proportions in the training set are.
Increasing the prior probability of a class increases the posterior probability of the
class, moving the classification boundary for that class so that more cases are
classified into the class. Changing the prior will have a more noticeable effect if the
original posterior is near 0.5 than if it is near zero or one. For linear logistic regression
and linear normal-theory discriminant analysis, classification boundaries are
hyperplanes; increasing the prior for a class moves the hyperplanes for that class
farther from the class mean, but decreasing the prior moves the hyperplanes closer to
the class mean. But changing the priors does not change the angles of the hyperplanes.
11
nonproportional sampling or by a frequency variable in any manner that you
want.
If you specify prior probabilities, the posterior probabilities computed by the
modeling nodes are always adjusted for the priors.
If you specify prior probabilities, the profit and loss summary statistics are
always adjusted for priors and therefore provide valid model comparisons,
assuming that you specify valid decision consequences.
Where,
t- be an index for target values (classes)
i - be an index for cases
OldPrior(t) - be the old prior probability or implicit prior probability for target t
OldPost(i,t) - be the posterior probability based on OldPrior(t)
Prior(t) - be the new prior probability desired for target t
Post(i,t) - be the posterior probability based on Prior(t)
For classification, each case i is assigned to the class with the greatest posterior
probability, that is, the class t for which Post(i,t) is maximized.
Posterior probabilities
Classification
Decisions
Misclassification rate
Expected profit or loss
Profit and loss summary statistics, including the relative contribution of each
class.
Classification is a predictive modeling problem that involves predicting the class label
for an observation. There may be many class labels, so-called multi-class
13
classification problems, although the simplest and perhaps most common type of
classification problem has two classes and is referred to as binary classification. Most
data mining or machine learning algorithms designed for classification assume that
there is an equal number of examples for each observed class. This is not always the
case in practice, and datasets that have a skewed class distribution are referred to as
imbalanced classification problems.
In addition to assuming that the class distribution is balanced, most algorithms also
assume that the prediction errors made by a classifier are the same, so-called miss-
classifications. This is typically not the case for binary classification problems,
especially those that have an imbalanced class distribution.
For imbalanced classification problems, the examples from the majority class are
referred to as the negative class and assigned the class label 0. Those examples from
the minority class are referred to as the positive class and are assigned the class label
1.
The reason for this negative vs. positive naming convention is because the examples
from the majority class typically represent a normal or no-event case, whereas
examples from the minority class represent the exceptional or event case.
Examples:
14
Bank Loan Problem: Consider a problem where a bank wants to determine whether to
give a loan to a customer or not. Denying a loan to a good customer is not as bad as
giving a loan to a bad customer that may never repay it.
We can see with these examples that misclassification errors are not desirable in
general, but one type of misclassification is much worse than the other. Specifically
predicting positive cases as a negative case is more harmful, more expensive, or worse
in whatever way we want to measure the context of the target domain.
It is best understood using a binary classification problem with negative and positive
classes, typically assigned 0 and 1 class labels respectively. The columns of the table
represent the actual class to which examples belong, and the rows represent the
predicted class (although the meaning of rows and columns can and often are
interchanged with no loss of meaning). A cell in the table is the count of the number
of examples that meet the conditions of the row and column, and each cell has a
specific common name.
Now, we can consider the same table with the same rows and columns and assign a
cost to each of the cells. This is called a cost matrix.
Cost Matrix: A matrix that assigns a cost to each cell in the confusion matrix.
The example below is a cost matrix where we use the notation C() to indicate the cost,
the first value represented as the predicted class and the second value represents the
actual class. The names of each cell from the confusion matrix are also listed as
acronyms, e.g. False Positive is FP.
An intuition from this matrix is that the cost of misclassification is always higher than
correct classification, otherwise, cost can be minimized by predicting one class. For
example, we might assign no cost to correct predictions in each class, a cost of 5 for
False Positives and a cost of 88 for False Negatives.
We can define the total cost of a classifier using this framework as the cost-weighted
sum of the False Negatives and False Positives.
In some problem domains, defining the cost matrix might be obvious. In an insurance
claim example, the costs for a false positive might be the monetary cost of follow-up
16
with the customer to the company and the cost of a false negative might be the cost of
the insurance claim.
In other domains, defining the cost matrix might be challenging. For example, in a
cancer diagnostic test example, the cost of a false positive might be the monetary cost
of performing subsequent tests, whereas what is the equivalent dollar cost for letting a
sick patient go home and get sicker? A cost matrix might be able to be defined by a
domain expert or economist in such cases, or not.
The goal of association rule mining is to find rules that will predict the occurrence of
an item (Item Y) based on the occurrence of other items (Item X) in the transaction.
For example: Predict the chance of user buying a phone cover (Item Y) if he already
bought the phone (Item X) and if the chance is high enough then recommend phone
cover to someone who are buying the phone. There is a big chance to discover strong
rules in big data, but keep in mind that the implication means co-occurrence does not
necessarily means causality! We cannot assist that buying one item is cause of buying
the other one when items are just frequently bought together.
Here is our data which consist of 5 transactions made by our customer. Each
transaction shows the products bought together in that transaction.
17
Given a set of transactions, the goal of association rule mining is to find the rules that
allow us to predict the occurrence of a specific item based on the occurrences of the
other items in the transaction.
An association rule consists of two parts, an antecedent (if) and a consequent (then).
An antecedent is something found in data, and a consequent is something located in
conjunction with the antecedent. For a quick understanding, consider the following
association rule: “If a customer buys bread, he‟s 70% likely of buying milk.”
Bread is the antecedent in the given association rule, and milk is the consequent.
Terminologies:
Item-set: It‟s a collection of one or more items. K-item-set means a set of k items. For
example: Item-set is {Bread, Milk}
k-itemset: An itemset that contains k items. For example: {Bread, Milk} is 2-itemset
Support Count: Indication of how frequently the item set appears in the database. It‟s
frequency of occurrence of an item-set. For example: {Bread, Milk} occurs 3 times in
our data set
18
Confidence: For a rule X=>Y confidence shows the percentage in which X is bought
with Y. So confidence is the number of transactions with both X and Y divided by the
total number of transactions having X.
For example: Confidence for Bread => Milk = 3 / 4 = 75%, it means that 75% of the
transactions that contain X (Bread) also contain Y (Milk) together.
Form of Association Rule: X=> Y [Support, Confidence], where X and Y are sets of
items in the transaction data.
For example: Bread => Milk [Support=60%, Confidence= 75%], where support shows
that in 60% of transactions bread and milk are purchased together, confidence shows
that 75% of customers who purchase bread also purchase milk
The Goal of Association Rule Mining: The goal of association rule mining is to find
all association rules having support≥minimum_support threshold and confidence ≥
minimum_confidence threshold
Lift: Lift gives the correlation between X and Y in the rule X=>Y. Correlation shows
how one item-set X effects the item-set Y. Lift(X=>Y) = Confidence of the rule
(X=>Y)/ Support(Y)
Lift for the rule {Bread}=>{Milk}: Confidence of the rule (75%) / Support (Milk) =
4/5 (80 %) = 75%/80 % = 93.75%
19
Evaluate the rule using the value of the Lift:
If the rule had a lift of 1, then X and Y are independent and no rule can be
derived from them
If the lift is < 1, then presence of X will have negative effect on Y
If the lift is > 1, then X and Y are dependent on each other, and the degree of
which is given by lift value
Support and Confidence measure how interesting the rule is. Support is also used for
efficient discovery of association rules. Confidence, on the other hand, measures the
reliability of the inference made by a rule. For a given rule X->Y, the higher the
confidence, the more likely it is for Y to be present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of Y given X.
20
associate formula for frequent pattern mining supported by the item set lattice‟s depth-
first search cross.
The data source is compressed using the FP-tree data structure. This algorithm
operates in two stages. These are as follows:
FP-tree construction
Extract frequently used itemsets
21
rule element consists of one entity but many a relationship. These relationships
represent indirect relationships between entities.
Imagine you have a database about the items a customer purchases from the store. The
Apriori algorithm helps to uncover interesting relationships & patterns in this data. It
does that by finding the sets of items that occur together, frequently.
22
The following are the main steps of the algorithm:
1. Set the minimum support threshold - min frequency required for an itemset to be
"frequent".
2. Identify frequent individual items - count the occurence of each individual item.
3. Generate candidate itemsets of size 2 - create pairs of frequent items discovered.
4. Prune infrequent itemsets - eliminate itemsets that do not meet the threshold levels.
5. Generate itemsets of larger sizes - combine the frequent itemsets of size 3, 4, and
so on.
6. Repeat the pruning process - keep eliminating the itemsets that do not meet the
threshold levels.
7. Iterate till no more frequent itemsets can be generated.
8. Generate association rules that express the relationship between them - calculate
measures to evaluate the strength & significance of these rules.
23
Support threshold=50% => 0.5*6= 3 => min_sup=3
2. Prune Step: Following shows that I5 item does not meet min_sup=3, thus it is
deleted, only I1, I2, I3, I4 meet min_sup count.
24
3. Join Step: Form 2-itemset. From the first table, find out the occurrences of 2-
itemset.
4. Prune Step: Next table shows that item set {I1, I4} and {I3, I4} does not meet
min_sup, thus it is deleted.
5. Join and Prune Step: Form 3-itemset. From the first table, find out occurrences of 3-
itemset. From the previous table, find out the 2-itemset subsets which support
min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring as
shown in step 4, thus {I1, I2, I3} is frequent. We can see for itemset {I1, I2, I4}
subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as it is not occurring as
shown in step 4, thus {I1, I2, I4} is not frequent, hence it is deleted.
25
26