0% found this document useful (0 votes)
12 views42 pages

Supervised Learning Part1

Uploaded by

lulucifer610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views42 pages

Supervised Learning Part1

Uploaded by

lulucifer610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Supervised

learning (Part Ⅰ)

Outline

- Introduction to supervised learning


- Decision tree classification
- Naïve Bayesian classification
- Evaluation of classifiers
• Estimation strategy
• Accuracy Estimation
• Performance Estimation

1
Introduction to supervised
learning

An example application

 An emergency room in a hospital measures 17


variables (e.g., blood pressure, age, etc) of newly
admitted patients.
 A decision is needed: whether to put a new patient
in an intensive-care unit.
 Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority.
 Problem: to predict high-risk patients and
discriminate them from low-risk patients.

2
Another application
 A credit card company receives thousands of
applications for new cards. Each application contains
information about an applicant,
 age
 Marital status
 annual salary
 outstanding debts
 credit rating
 etc.
 Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.

Machine learning
 Like human learning from past experiences.
 A computer does not have “experiences”.
 A computer system learns from data, which
represent some “past experiences” of an application
domain.
 Our focus: learn a target function that can be used
to predict the values of a discrete class attribute,
e.g., approve or not-approved, and high-risk or low
risk.
 The task is commonly called: Supervised learning,
classification, or inductive learning.

3
The data and the goal
 Data: A set of data records (also called examples, instances or
cases) described by
 k attributes: A1, A2, … Ak.
 a class: Each example is labelled with a pre-defined class.
 Goal: To learn a classification model from the data that can be
used to predict the classes of new (future, or test)
cases/instances.

An example: data (loan application)


Approved or not

4
An example: the learning task
 Learn a classification model from the data
 Use the model to classify future loan applications into
 Yes (approved) and
 No (not approved)
 What is the class for following case/instance?

Supervised vs. unsupervised


Learning
 Supervised learning: classification is seen as supervised learning
from examples.
 Supervision: The data (observations, measurements, etc.) are labeled with
pre-defined classes.
 Test data are classified into these classes too.
 Unsupervised learning (clustering)
 Class labels of the data are unknown
 Given a set of data, the task is to establish the existence of classes or
clusters in the data

10

5
Supervised learning process: two
steps
 Learning (training): Learn a model using the
training data
 Testing: Test the model using unseen test data
to assess the model accuracy
Number of correct classifications
Accuracy  ,
Total number of test cases

11

What do we mean by learning?


 Given
 a data set D,
 a task T, and
 a performance measure M,
a computer system is said to learn from D to perform the task T if
after learning the system’s performance on T improves as
measured by M.
 In other words, the learned model helps the system to perform T
better as compared to no learning.

12

6
An example
 Data: Loan application data
 Task: Predict whether a loan should be approved or not.
 Performance measure: accuracy.

No learning: classify all future applications (test data) to the majority


class (i.e., Yes):
Accuracy = 9/15 = 60%.
 We can do better than 60% with learning.

13

Fundamental assumption of
learning
Assumption: The distribution of training examples is identical to the
distribution of test examples (including future unseen examples).

 In practice, this assumption is often violated to certain degree.


 Strong violations will clearly result in poor classification accuracy.
 To achieve good accuracy on the test data, training examples
must be sufficiently representative of the test data.

14

7
Decision Tree classification

15

Presentation
 Decision tree learning is one of the most widely used techniques
for classification.
 Its classification accuracy is competitive with other methods, and
 it is very efficient.
 The classification model is a tree, called decision tree.
 C4.5 by Ross Quinlan is perhaps the best known system. It can
be downloaded from the Web.

16

8
The loan data
Approved or not

17

A decision tree from the loan data


 Decision nodes and leaf nodes (classes)

18

9
Use the decision tree

No

19

Is the decision tree unique?


 No. Here is a simpler tree.
 We want smaller tree and accurate tree.
 Easy to understand and perform better.

 Finding the best tree is


NP-hard.
 All current tree building
algorithms are heuristic
algorithms

20

10
From a decision tree to a set of
rules
 A decision tree can
be converted to a
set of rules
 Each path from the
root to a leaf is a
rule.

21

Algorithm for decision tree learning


 Basic algorithm (a greedy divide-and-conquer algorithm)
 Assume attributes are categorical now (continuous attributes
can be handled too)
 Tree is constructed in a top-down recursive manner
 At start, all the training examples are at the root
 Examples are partitioned recursively based on selected
attributes
 Attributes are selected on the basis of an impurity function (e.g.,
information gain)
 Conditions for stopping partitioning
 All examples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority class is the leaf
 There are no examples left

22

11
Top-Down Induction of Decision
Trees
 Main loop:
1. A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified, Then STOP, Else iterate over
new leaf nodes
 Which attribute is best?

23

Choose an attribute to partition


data
 The key to building a decision tree - which attribute to choose in
order to branch.
 The objective is to reduce impurity or uncertainty in data as much as
possible.
 A subset of data is pure if all instances belong to the same class.
 The heuristic in C4.5 is to choose the attribute with the maximum
Information Gain or Gain Ratio based on information theory.

24

12
The loan data
Approved or not

25

Two possible roots, which is


better?

 Fig. (B) seems to be better.

26

13
Information theory
 Information theory provides a mathematical basis
for measuring the information content.
 To understand the notion of information, think
about it as providing the answer to a question, for
example, whether a coin will come up heads.
 If one already has a good guess about the answer,
then the actual answer is less informative.
 If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).

27

Information theory
 For a fair (honest) coin, you have no information,
and you are willing to pay more (say in terms of
$) for advanced information - less you know, the
more valuable the information.
 Information theory uses this same intuition, but
instead of measuring the value for information in
dollars, it measures information contents in bits.
 One bit of information is enough to answer a
yes/no question about which one has no idea,
such as the flip of a fair coin

28

14
Information theory: Entropy
measure
 The entropy formula,
|C |
entropy ( D )    Pr(c ) log
j 1
j 2 Pr(c j )

|C |

 Pr(c )  1,
j 1
j

 Pr(cj) is the probability of class cj in data set D


 We use entropy as a measure of impurity or
disorder of data set D. (Or, a measure of information
in a tree)

29

Entropy measure: Signification

 As the data become purer and purer, the entropy value


becomes smaller and smaller!!
30

15
Information gain
 Given a set of examples D, we first compute its
entropy:

 If we make attribute Ai, with v values, the root of the


current tree, this will partition D into v subsets D1, D2
…, Dv . The expected entropy if Ai is used as the
current root:
v |D |


entropy Ai ( D ) 
j 1 | D
j

|
 entropy ( D j )

31

Information gain
 Information gained by selecting attribute Ai to branch or
to partition the data is

gain( D, Ai )  entropy ( D )  entropy Ai ( D )


Gain(D, Ai) = expected reduction in entropy due
to sorting on Ai
We choose the attribute with the highest gain to
branch/split the current tree.

32

16
Own_ho Yes No Entropy(Di)

An example use
true 6 0 0
6 6 9 9 false 3 6 0,918
entropy(D)    log2   log2  0.971
15 15 15 15
Age Yes No Enropy(Di)
6 9 young 2 3 0,971
entropyOwn _ house ( D)   entropy( D1 )   entropy( D2 )
15 15
middle 3 2 0,971
6 9
  0   0.918
15 15 old 4 1 0,722
 0.551
Credit_rating Yes No Enropy(Di)
5 5 5
entropyAge( D)   entropy( D1 )   entropy( D2 )   entropy(D3 ) fair 1 4 0,722
15 15 15
good 4 2 0,918
5 5 5
  0.971  0.971  0.722 excellent 4 0 0
15 15 15
 0.888

 Own_house is the best


choice for the root.
Has_job Yes No Entropy(Di)
true 5 0 0
false 4 6 0,971 33

An example
We build the final tree

 We can use information gain ratio to evaluate the


impurity as well
34

17
Overfitting

Consider error of hypothesis h over :


 training data: errortrain(h)
 entire distribution D of data: errorD(h)
Hypothesis h ∈ H overfits training data if there is an
alternative hypothesis h'∈ H such that
errortrain(h) < errortrain(h')
and
errorD(h) > errorD(h')

35

Overfitting in Decision Tree


Learning

36

18
Avoid overfitting in classification
 Overfitting: A tree may overfit the training data
 Good accuracy on training data but poor on test data
 Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
 Two approaches to avoid overfitting
 stop growing when data split not statistically significant
Pre-pruning: Halt tree construction early
 grow full tree, then post-prune
Post-pruning: Remove branches or sub-trees from a “fully
grown” tree.

37

Other issues in decision tree


learning
 From tree to rules, and rule pruning
 Handling of miss values
 Handing skewed distributions
 Handling attributes and classes with different costs.
 Etc.

38

19
Naive Bayesian Classification

39

Bayesian Rule
• Product rule: P(ab) = P(a | b) P(b) = P(b | a) P(a)

– Bayesian Rule: P(a | b) = P(b | a) P(a) / P(b)

Generally,

– P(Y|X) = P(X|Y) P(Y) / P(X) = αP(X|Y) P(Y)

• Useful for calculating / querying a diagnostic probability from a causal probability:


P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)

• Example:
– let m (meningitis), s(stiff neck),
P(s|m)=0.5, P(m)=1/50000 et P(s)=1/20.
– P(m|s) = P(s|m) P(m) / P(s) = 0.5 × 0.00002 / 0.05 = 0.0002

• Diagnostic rule: Observed Effects  Hidden causes


• Causal rule: Hidden causes  observed effects

40

20
Bayesian Rule and Conditional
independence
P(Cavity | Toothache  Catch)
= αP(Toothache  Catch | Cavity) P(Cavity)
= αP(Toothache | Cavity) P(Catch | Cavity) P(Cavity)

• A simple Bayesian classifier:

P(Cause,Effect1, … ,Effectn) = P(Cause) πiP(Effecti|Cause)

41

Bayesian classification
 Probabilistic view: Supervised learning can naturally
be studied from a probabilistic point of view.
 Let A1 through Ak be attributes with discrete values.
The class is C.
 Given a test example d with observed attribute values
a1 through ak.
 Classification is basically to compute the following
posteriori probability. The prediction is the class cj
such that

is maximal

42

21
Apply Bayes’ Rule
Pr(C  c j | A1  a1 ,..., A| A|  a| A| )
Pr( A1  a1 ,..., A| A|  a| A| | C  c j ) Pr(C  c j )

Pr( A1  a1 ,..., A| A|  a| A| )
Pr( A1  a1 ,..., A| A|  a| A| | C  c j ) Pr(C  c j )

Pr( A1  a1 ,..., A| A|  a| A| )

 Pr(C=cj) is the class prior probability: easy to


estimate from the training data.

43

Computing probabilities
 The denominator P(A1=a1,...,Ak=ak) is irrelevant for decision making
since it is the same for every class.
 We only need P(A1=a1,...,Ak=ak | C=ci), which can be written as
Pr(A1=a1|A2=a2,...,Ak=ak, C=cj)* Pr(A2=a2,...,Ak=ak |C=cj)
 Recursively, the second factor above can be written in the same
way, and so on.
 Now an assumption is needed.

44

22
Conditional independence
assumption
 All attributes are conditionally independent given the class C = cj.
 Formally, we assume,
Pr(A1=a1 | A2=a2, ..., A|A|=a|A|, C=cj) = Pr(A1=a1 | C=cj)
and so on for A2 through A|A|. I.e.,

| A|
Pr( A1  a1 ,..., A| A|  a| A| | C  ci )   Pr( Ai  ai | C  c j )
i 1

45

Final naïve Bayesian classifier

Pr(C  c j | A1  a1 ,..., A| A|  a| A| )
| A|
  Pr(C  c j ) Pr( Ai  ai | C  c j )
i 1

 We are done.
 How do we estimate P(Ai = ai| C=cj)?

46

23
Classify a test instance
 If we only need a decision on the most probable class for the test
instance, we only need the numerator as its denominator is the
same for every class.
 Thus, given a test example, we compute the following to decide
the most probable class for the test instance

| A|
c  arg max Pr(c j ) Pr( Ai  ai | C  c j )
cj i 1

47

An example

 Compute all probabilities


required for classification

48

24
An Example
 For C = t, we have
2
1 2 2 2
Pr(C  t ) Pr( A j  a j | C  t )    
j 1 2 5 5 25
 For class C = f, we have
2
1 1 2 1
Pr(C  f ) j 1
Pr( A j  a j | C  f )    
2 5 5 25
 C = t is more probable. t is the final class.

49

On naïve Bayesian classifier

 Advantages:
 Easy to implement
 Very efficient
 Good results obtained in many applications
 Disadvantages
 Assumption: class conditional independence, therefore loss of
accuracy when the assumption is seriously violated (those highly
correlated data sets)

50

25
Evaluation of classifiers

51

Introduction
 A classifier is used to predict an outcome of a test data
 Such a prediction is useful in many applications
 Business forecasting, cause-and-effect analysis, etc.

 A number of classifiers have been evolved to support the activities.


 Each has their own merits and demerits

 There is a need to estimate the accuracy and performance of the


classifier with respect to few controlling parameters in data sensitivity

 As a task of sensitivity analysis, we have to focus on


 Estimation strategy
 Metrics for measuring accuracy
 Metrics for measuring performance

52

26
Estimation Strategy

53

Planning for Estimation


 Using some “training data”, building a classifier based on certain
principle is called “learning a classifier”.
 After building a classifier and before using it for classification of unseen
instance, we have to validate it using some “test data”.
 Usually training data and test data are outsourced from a large pool of
data already available.

Learning
Training data technique

split

Test data
CLASSIFIER

Data set Estimation


54

27
Estimation Strategies
 Accuracy and performance measurement should follow a strategy. As
the topic is important, many strategies have been advocated so far.
Most widely used strategies are
 Holdout method
 Random subsampling
 Cross-validation
 Bootstrap approach

55

Holdout Method

 This is a basic concept of estimating a prediction.


 Given a dataset, it is partitioned into two disjoint sets called training set
and testing set.
 Classifier is learned based on the training set and get evaluated with
testing set.
 Proportion of training and testing sets is at the discretion of analyst; and
there is a trade-off between these sizes of these two sets.
 If the training set is too large, then model may be good enough, but
estimation may be less reliable due to small testing set and vice-versa.

56

28
Random Subsampling

 It is a variation of Holdout method to overcome the drawback of over-


presenting a class in one set thus under-presenting it in the other set and
vice-versa.
 In this method, Holdout method is repeated k times, and in each time, two
disjoint sets are chosen at random with a predefined sizes.
 Overall estimation is taken as the average of estimations obtained from
each iteration.

57

Cross-Validation

 The main drawback of Random subsampling is, it does not have control
over the number of times each tuple is used for training and testing.

 Cross-validation is proposed to overcome this problem.

 There are two variations in the cross-validation method.


 k-fold cross-validation
 N-fold cross-validation

58

29
k-fold Cross-Validation

D1 Fold 1

Learning
Di technique
Fold i

Data set
Dk

Fold k
CLASSIFIER

Accuracy Performance 59

N-fold Cross-Validation

60

30
N-fold Cross-Validation : Issue
 So far the estimation of accuracy and performance of a classifier
model is concerned, the N-fold cross-validation is comparable to the
others we have just discussed.

 The drawback of N-fold cross validation strategy is that it is


computationally expensive, as here we have to repeat the run N
times; this is particularly true when data set is large.

 In practice, the method is extremely beneficial with very small data


set only, where as much data as possible to need to be used to train
a classifier.

61

Bootstrap Method
 The Bootstrap method is a variation of repeated version of Random
sampling method.

 The method suggests the sampling of training records with


replacement.
 Each time a record is selected for training set, is put back into the original
pool of records, so that it is equally likely to be redrawn in the next run.

 In other words, the Bootstrap method samples the given data set
uniformly with replacement.

 The rational of having this strategy is that let some records be occur
more than once in the samples of both training as well as testing.

62

31
Accuracy Estimation

63

Accuracy Estimation

64

32
Accuracy : True and Predictive

65

Predictive Accuracy
Universality of predictive accuracy

 Consider a classifier model MD developed with a training set D using an


algorithm M.
 Two predictive accuracies when MD is estimated with two different
training sets T1 and T2 are
(MD)T1 = 95%
(MD)T2 = 70%

 Further, assume the size of T1 and T2 are


|T1| = 100 records
|T2| = 5000 records.

 Based on the above mentioned estimations, neither estimation is


acceptable beyond doubt.

66

33
Predictive Accuracy
 With the above-mentioned issue in mind, researchers have proposed
two heuristic measures
 Error estimation using Loss Functions
 Statistical Estimation using Confidence Level (will not be covered in this
course!!)

67

Error Estimation using Loss


Functions
X1 y1
X2 y2

XN yN
N×(n+1)

68

34
Error Estimation using Loss
Functions

69

Performance Estimation

70

35
Performance Estimation of a
Classifier
 Predictive accuracy works fine, when the classes are balanced
 That is, every class in the data set are equally important

 In fact, data sets with imbalanced class distributions are quite common
in many real life applications

 When the classifier classified a test data set with imbalanced class
distributions then, predictive accuracy on its own is not a reliable
indicator of a classifier’s effectiveness.

Example: Effectiveness of Predictive Accuracy


 Given a data set of stock markets, we are to classify them as “good” and
“worst”. Suppose, in the data set, out of 100 entries, 98 belong to “good”
class and only 2 are in “worst” class.
 With this data set, if classifier’s predictive accuracy is 0.98, a very high value!
 Here, there is a high chance that 2 “worst” stock markets may incorrectly be classified as
“good”

 On the other hand, if the predictive accuracy is 0.02, then none of the stock markets
may be classified as “good”
71

Performance Estimation of a Classifier

 Thus, when the classifier classified a test data set with imbalanced class
distributions, then predictive accuracy on its own is not a reliable
indicator of a classifier’s effectiveness.

 This necessitates an alternative metrics to judge the classifier.

 Before exploring them, we introduce the concept of Confusion matrix.

72

36
Confusion Matrix
 A confusion matrix for a two classes (+, -) is shown below.

 There are four quadrants in the confusion matrix, which are symbolized
as below.
 True Positive (TP: f++) : The number of instances that were positive (+) and
correctly classified as positive (+v).
 False Negative (FN: f+-): The number of instances that were positive (+) and
incorrectly classified as negative (-). It is also known as Type 2 Error.
 False Positive (FP: f-+): The number of instances that were negative (-) and
incorrectly classified as (+). This also known as Type 1 Error.
 True Negative (TN: f--): The number of instances that were negative (-) and
correctly classified as (-).

73

Confusion Matrix
Note:
 Np = TP (f++) + FN (f+-)
= is the total number of positive instances.

 Nn = FP(f-+) + TN(f--)
= is the total number of negative instances.

 N = Np + Nn
= is the total number of instances.

 (TP + TN) denotes the number of correct classification

 (FP + FN) denotes the number of errors in classification.

 For a perfect classifier FP = FN = 0, that is, there would be no Type 1 or


Type 2 errors.

74

37
Confusion Matrix
Example: Confusion matrix

A classifier is built on a dataset regarding Good and Worst classes of stock


markets. The model is then tested with a test set of 10000 unseen instances.
The result is shown in the form of a confusion matrix. The result is self
explanatory.
Class Good Worst Total Rate(%)

Good 6954 46 7000 99.34

Worst 412 2588 3000 86.27

Total 7366 2634 10000 95.52

Predictive accuracy?

75

Confusion Matrix for Multiclass Classifier


 Having m classes, confusion matrix is a table of size m×m , where,
element at (i, j) indicates the number of instances of class i but
classified as class j.

 To have good accuracy for a classifier, ideally most diagonal entries


should have large values with the rest of entries being close to zero.

 Confusion matrix may have additional rows or columns to provide


total or recognition rates per class.

76

38
Confusion Matrix for Multiclass Classifier

Example : Confusion matrix with multiple class


Following table shows the confusion matrix of a classification problem with six
classes labeled as C1, C2, C3, C4, C5 and C6.

Class C1 C2 C3 C4 C5 C6

C1 52 10 7 0 0 1

C2 15 50 6 2 1 2

C3 5 6 6 0 0 0

C4 0 2 0 10 0 1

C5 0 1 0 0 7 1

C6 1 3 0 1 0 24

Predictive accuracy?
77

Performance Evaluation Metrics

78

39
Performance Evaluation Metrics

79

Performance Evaluation Metrics

80

40
How to deduce the Predictive
Accuracy (ε)

81

How to deduce the Error Rate

82

41
Analysis with Performance
Measurement Metrics

Predicted Class

+ -

+ P 0

Actual
class
- 0 N

83

Analysis with Performance


Measurement Metrics

Predicted Class

+ -

+ 0 P
Actual
class

- N 0

84

42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy