Supervised Learning Part1
Supervised Learning Part1
learning (Part Ⅰ)
Outline
1
Introduction to supervised
learning
An example application
2
Another application
A credit card company receives thousands of
applications for new cards. Each application contains
information about an applicant,
age
Marital status
annual salary
outstanding debts
credit rating
etc.
Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.
Machine learning
Like human learning from past experiences.
A computer does not have “experiences”.
A computer system learns from data, which
represent some “past experiences” of an application
domain.
Our focus: learn a target function that can be used
to predict the values of a discrete class attribute,
e.g., approve or not-approved, and high-risk or low
risk.
The task is commonly called: Supervised learning,
classification, or inductive learning.
3
The data and the goal
Data: A set of data records (also called examples, instances or
cases) described by
k attributes: A1, A2, … Ak.
a class: Each example is labelled with a pre-defined class.
Goal: To learn a classification model from the data that can be
used to predict the classes of new (future, or test)
cases/instances.
4
An example: the learning task
Learn a classification model from the data
Use the model to classify future loan applications into
Yes (approved) and
No (not approved)
What is the class for following case/instance?
10
5
Supervised learning process: two
steps
Learning (training): Learn a model using the
training data
Testing: Test the model using unseen test data
to assess the model accuracy
Number of correct classifications
Accuracy ,
Total number of test cases
11
12
6
An example
Data: Loan application data
Task: Predict whether a loan should be approved or not.
Performance measure: accuracy.
13
Fundamental assumption of
learning
Assumption: The distribution of training examples is identical to the
distribution of test examples (including future unseen examples).
14
7
Decision Tree classification
15
Presentation
Decision tree learning is one of the most widely used techniques
for classification.
Its classification accuracy is competitive with other methods, and
it is very efficient.
The classification model is a tree, called decision tree.
C4.5 by Ross Quinlan is perhaps the best known system. It can
be downloaded from the Web.
16
8
The loan data
Approved or not
17
18
9
Use the decision tree
No
19
20
10
From a decision tree to a set of
rules
A decision tree can
be converted to a
set of rules
Each path from the
root to a leaf is a
rule.
21
22
11
Top-Down Induction of Decision
Trees
Main loop:
1. A the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified, Then STOP, Else iterate over
new leaf nodes
Which attribute is best?
23
24
12
The loan data
Approved or not
25
26
13
Information theory
Information theory provides a mathematical basis
for measuring the information content.
To understand the notion of information, think
about it as providing the answer to a question, for
example, whether a coin will come up heads.
If one already has a good guess about the answer,
then the actual answer is less informative.
If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).
27
Information theory
For a fair (honest) coin, you have no information,
and you are willing to pay more (say in terms of
$) for advanced information - less you know, the
more valuable the information.
Information theory uses this same intuition, but
instead of measuring the value for information in
dollars, it measures information contents in bits.
One bit of information is enough to answer a
yes/no question about which one has no idea,
such as the flip of a fair coin
28
14
Information theory: Entropy
measure
The entropy formula,
|C |
entropy ( D ) Pr(c ) log
j 1
j 2 Pr(c j )
|C |
Pr(c ) 1,
j 1
j
29
15
Information gain
Given a set of examples D, we first compute its
entropy:
entropy Ai ( D )
j 1 | D
j
|
entropy ( D j )
31
Information gain
Information gained by selecting attribute Ai to branch or
to partition the data is
32
16
Own_ho Yes No Entropy(Di)
An example use
true 6 0 0
6 6 9 9 false 3 6 0,918
entropy(D) log2 log2 0.971
15 15 15 15
Age Yes No Enropy(Di)
6 9 young 2 3 0,971
entropyOwn _ house ( D) entropy( D1 ) entropy( D2 )
15 15
middle 3 2 0,971
6 9
0 0.918
15 15 old 4 1 0,722
0.551
Credit_rating Yes No Enropy(Di)
5 5 5
entropyAge( D) entropy( D1 ) entropy( D2 ) entropy(D3 ) fair 1 4 0,722
15 15 15
good 4 2 0,918
5 5 5
0.971 0.971 0.722 excellent 4 0 0
15 15 15
0.888
An example
We build the final tree
17
Overfitting
35
36
18
Avoid overfitting in classification
Overfitting: A tree may overfit the training data
Good accuracy on training data but poor on test data
Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
Two approaches to avoid overfitting
stop growing when data split not statistically significant
Pre-pruning: Halt tree construction early
grow full tree, then post-prune
Post-pruning: Remove branches or sub-trees from a “fully
grown” tree.
37
38
19
Naive Bayesian Classification
39
Bayesian Rule
• Product rule: P(ab) = P(a | b) P(b) = P(b | a) P(a)
Generally,
• Example:
– let m (meningitis), s(stiff neck),
P(s|m)=0.5, P(m)=1/50000 et P(s)=1/20.
– P(m|s) = P(s|m) P(m) / P(s) = 0.5 × 0.00002 / 0.05 = 0.0002
40
20
Bayesian Rule and Conditional
independence
P(Cavity | Toothache Catch)
= αP(Toothache Catch | Cavity) P(Cavity)
= αP(Toothache | Cavity) P(Catch | Cavity) P(Cavity)
41
Bayesian classification
Probabilistic view: Supervised learning can naturally
be studied from a probabilistic point of view.
Let A1 through Ak be attributes with discrete values.
The class is C.
Given a test example d with observed attribute values
a1 through ak.
Classification is basically to compute the following
posteriori probability. The prediction is the class cj
such that
is maximal
42
21
Apply Bayes’ Rule
Pr(C c j | A1 a1 ,..., A| A| a| A| )
Pr( A1 a1 ,..., A| A| a| A| | C c j ) Pr(C c j )
Pr( A1 a1 ,..., A| A| a| A| )
Pr( A1 a1 ,..., A| A| a| A| | C c j ) Pr(C c j )
Pr( A1 a1 ,..., A| A| a| A| )
43
Computing probabilities
The denominator P(A1=a1,...,Ak=ak) is irrelevant for decision making
since it is the same for every class.
We only need P(A1=a1,...,Ak=ak | C=ci), which can be written as
Pr(A1=a1|A2=a2,...,Ak=ak, C=cj)* Pr(A2=a2,...,Ak=ak |C=cj)
Recursively, the second factor above can be written in the same
way, and so on.
Now an assumption is needed.
44
22
Conditional independence
assumption
All attributes are conditionally independent given the class C = cj.
Formally, we assume,
Pr(A1=a1 | A2=a2, ..., A|A|=a|A|, C=cj) = Pr(A1=a1 | C=cj)
and so on for A2 through A|A|. I.e.,
| A|
Pr( A1 a1 ,..., A| A| a| A| | C ci ) Pr( Ai ai | C c j )
i 1
45
Pr(C c j | A1 a1 ,..., A| A| a| A| )
| A|
Pr(C c j ) Pr( Ai ai | C c j )
i 1
We are done.
How do we estimate P(Ai = ai| C=cj)?
46
23
Classify a test instance
If we only need a decision on the most probable class for the test
instance, we only need the numerator as its denominator is the
same for every class.
Thus, given a test example, we compute the following to decide
the most probable class for the test instance
| A|
c arg max Pr(c j ) Pr( Ai ai | C c j )
cj i 1
47
An example
48
24
An Example
For C = t, we have
2
1 2 2 2
Pr(C t ) Pr( A j a j | C t )
j 1 2 5 5 25
For class C = f, we have
2
1 1 2 1
Pr(C f ) j 1
Pr( A j a j | C f )
2 5 5 25
C = t is more probable. t is the final class.
49
Advantages:
Easy to implement
Very efficient
Good results obtained in many applications
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy when the assumption is seriously violated (those highly
correlated data sets)
50
25
Evaluation of classifiers
51
Introduction
A classifier is used to predict an outcome of a test data
Such a prediction is useful in many applications
Business forecasting, cause-and-effect analysis, etc.
52
26
Estimation Strategy
53
Learning
Training data technique
split
Test data
CLASSIFIER
27
Estimation Strategies
Accuracy and performance measurement should follow a strategy. As
the topic is important, many strategies have been advocated so far.
Most widely used strategies are
Holdout method
Random subsampling
Cross-validation
Bootstrap approach
55
Holdout Method
56
28
Random Subsampling
57
Cross-Validation
The main drawback of Random subsampling is, it does not have control
over the number of times each tuple is used for training and testing.
58
29
k-fold Cross-Validation
D1 Fold 1
Learning
Di technique
Fold i
Data set
Dk
Fold k
CLASSIFIER
Accuracy Performance 59
N-fold Cross-Validation
60
30
N-fold Cross-Validation : Issue
So far the estimation of accuracy and performance of a classifier
model is concerned, the N-fold cross-validation is comparable to the
others we have just discussed.
61
Bootstrap Method
The Bootstrap method is a variation of repeated version of Random
sampling method.
In other words, the Bootstrap method samples the given data set
uniformly with replacement.
The rational of having this strategy is that let some records be occur
more than once in the samples of both training as well as testing.
62
31
Accuracy Estimation
63
Accuracy Estimation
64
32
Accuracy : True and Predictive
65
Predictive Accuracy
Universality of predictive accuracy
66
33
Predictive Accuracy
With the above-mentioned issue in mind, researchers have proposed
two heuristic measures
Error estimation using Loss Functions
Statistical Estimation using Confidence Level (will not be covered in this
course!!)
67
XN yN
N×(n+1)
68
34
Error Estimation using Loss
Functions
69
Performance Estimation
70
35
Performance Estimation of a
Classifier
Predictive accuracy works fine, when the classes are balanced
That is, every class in the data set are equally important
In fact, data sets with imbalanced class distributions are quite common
in many real life applications
When the classifier classified a test data set with imbalanced class
distributions then, predictive accuracy on its own is not a reliable
indicator of a classifier’s effectiveness.
On the other hand, if the predictive accuracy is 0.02, then none of the stock markets
may be classified as “good”
71
Thus, when the classifier classified a test data set with imbalanced class
distributions, then predictive accuracy on its own is not a reliable
indicator of a classifier’s effectiveness.
72
36
Confusion Matrix
A confusion matrix for a two classes (+, -) is shown below.
There are four quadrants in the confusion matrix, which are symbolized
as below.
True Positive (TP: f++) : The number of instances that were positive (+) and
correctly classified as positive (+v).
False Negative (FN: f+-): The number of instances that were positive (+) and
incorrectly classified as negative (-). It is also known as Type 2 Error.
False Positive (FP: f-+): The number of instances that were negative (-) and
incorrectly classified as (+). This also known as Type 1 Error.
True Negative (TN: f--): The number of instances that were negative (-) and
correctly classified as (-).
73
Confusion Matrix
Note:
Np = TP (f++) + FN (f+-)
= is the total number of positive instances.
Nn = FP(f-+) + TN(f--)
= is the total number of negative instances.
N = Np + Nn
= is the total number of instances.
74
37
Confusion Matrix
Example: Confusion matrix
Predictive accuracy?
75
76
38
Confusion Matrix for Multiclass Classifier
Class C1 C2 C3 C4 C5 C6
C1 52 10 7 0 0 1
C2 15 50 6 2 1 2
C3 5 6 6 0 0 0
C4 0 2 0 10 0 1
C5 0 1 0 0 7 1
C6 1 3 0 1 0 24
Predictive accuracy?
77
78
39
Performance Evaluation Metrics
79
80
40
How to deduce the Predictive
Accuracy (ε)
81
82
41
Analysis with Performance
Measurement Metrics
Predicted Class
+ -
+ P 0
Actual
class
- 0 N
83
Predicted Class
+ -
+ 0 P
Actual
class
- N 0
84
42