0% found this document useful (0 votes)

12 views42 pages

Supervised Learning Part1

Uploaded by

lulucifer610

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views42 pages

Supervised Learning Part1

Uploaded by

lulucifer610

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Supervised

learning (Part Ⅰ)

Outline

- Introduction to supervised learning

- Decision tree classification
- Naïve Bayesian classification
- Evaluation of classifiers
• Estimation strategy
• Accuracy Estimation
• Performance Estimation

1
Introduction to supervised
learning

An example application

 An emergency room in a hospital measures 17

variables (e.g., blood pressure, age, etc) of newly
admitted patients.
 A decision is needed: whether to put a new patient
in an intensive-care unit.
 Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority.
 Problem: to predict high-risk patients and
discriminate them from low-risk patients.

2
Another application
 A credit card company receives thousands of
applications for new cards. Each application contains
information about an applicant,
 age
 Marital status
 annual salary
 outstanding debts
 credit rating
 etc.
 Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.

Machine learning
 Like human learning from past experiences.
 A computer does not have “experiences”.
 A computer system learns from data, which
represent some “past experiences” of an application
domain.
 Our focus: learn a target function that can be used
to predict the values of a discrete class attribute,
e.g., approve or not-approved, and high-risk or low
risk.
 The task is commonly called: Supervised learning,
classification, or inductive learning.

3
The data and the goal
 Data: A set of data records (also called examples, instances or
cases) described by
 k attributes: A1, A2, … Ak.
 a class: Each example is labelled with a pre-defined class.
 Goal: To learn a classification model from the data that can be
used to predict the classes of new (future, or test)
cases/instances.

An example: data (loan application)

Approved or not

4
An example: the learning task
 Learn a classification model from the data
 Use the model to classify future loan applications into
 Yes (approved) and
 No (not approved)
 What is the class for following case/instance?

Supervised vs. unsupervised

Learning
 Supervised learning: classification is seen as supervised learning
from examples.
 Supervision: The data (observations, measurements, etc.) are labeled with
pre-defined classes.
 Test data are classified into these classes too.
 Unsupervised learning (clustering)
 Class labels of the data are unknown
 Given a set of data, the task is to establish the existence of classes or
clusters in the data

5
Supervised learning process: two
steps
 Learning (training): Learn a model using the
training data
 Testing: Test the model using unseen test data
to assess the model accuracy
Number of correct classifications
Accuracy  ,
Total number of test cases

What do we mean by learning?

 Given
 a data set D,
 a task T, and
 a performance measure M,
a computer system is said to learn from D to perform the task T if
after learning the system’s performance on T improves as
measured by M.
 In other words, the learned model helps the system to perform T
better as compared to no learning.

6
An example
 Data: Loan application data
 Task: Predict whether a loan should be approved or not.
 Performance measure: accuracy.

No learning: classify all future applications (test data) to the majority

class (i.e., Yes):
Accuracy = 9/15 = 60%.
 We can do better than 60% with learning.

Fundamental assumption of
learning
Assumption: The distribution of training examples is identical to the
distribution of test examples (including future unseen examples).

 In practice, this assumption is often violated to certain degree.

 Strong violations will clearly result in poor classification accuracy.
 To achieve good accuracy on the test data, training examples
must be sufficiently representative of the test data.

7
Decision Tree classification

Presentation
 Decision tree learning is one of the most widely used techniques
for classification.
 Its classification accuracy is competitive with other methods, and
 it is very efficient.
 The classification model is a tree, called decision tree.
 C4.5 by Ross Quinlan is perhaps the best known system. It can
be downloaded from the Web.

8
The loan data
Approved or not

A decision tree from the loan data

 Decision nodes and leaf nodes (classes)

9
Use the decision tree

Is the decision tree unique?

 No. Here is a simpler tree.
 We want smaller tree and accurate tree.
 Easy to understand and perform better.

 Finding the best tree is

NP-hard.
 All current tree building
algorithms are heuristic
algorithms

10
From a decision tree to a set of
rules
 A decision tree can
be converted to a
set of rules
 Each path from the
root to a leaf is a
rule.

Algorithm for decision tree learning

 Basic algorithm (a greedy divide-and-conquer algorithm)
 Assume attributes are categorical now (continuous attributes
can be handled too)
 Tree is constructed in a top-down recursive manner
 At start, all the training examples are at the root
 Examples are partitioned recursively based on selected
attributes
 Attributes are selected on the basis of an impurity function (e.g.,
information gain)
 Conditions for stopping partitioning
 All examples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority class is the leaf
 There are no examples left

11
Top-Down Induction of Decision
Trees
 Main loop:
1. A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified, Then STOP, Else iterate over
new leaf nodes
 Which attribute is best?

Choose an attribute to partition

data
 The key to building a decision tree - which attribute to choose in
order to branch.
 The objective is to reduce impurity or uncertainty in data as much as
possible.
 A subset of data is pure if all instances belong to the same class.
 The heuristic in C4.5 is to choose the attribute with the maximum
Information Gain or Gain Ratio based on information theory.

12
The loan data
Approved or not

Two possible roots, which is

better?

 Fig. (B) seems to be better.

13
Information theory
 Information theory provides a mathematical basis
for measuring the information content.
 To understand the notion of information, think
about it as providing the answer to a question, for
example, whether a coin will come up heads.
 If one already has a good guess about the answer,
then the actual answer is less informative.
 If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).

Information theory
 For a fair (honest) coin, you have no information,
and you are willing to pay more (say in terms of
$) for advanced information - less you know, the
more valuable the information.
 Information theory uses this same intuition, but
instead of measuring the value for information in
dollars, it measures information contents in bits.
 One bit of information is enough to answer a
yes/no question about which one has no idea,
such as the flip of a fair coin

14
Information theory: Entropy
measure
 The entropy formula,
|C |
entropy ( D )    Pr(c ) log
j 1
j 2 Pr(c j )

|C |

 Pr(c )  1,
j 1
j

 Pr(cj) is the probability of class cj in data set D

 We use entropy as a measure of impurity or
disorder of data set D. (Or, a measure of information
in a tree)

Entropy measure: Signification

 As the data become purer and purer, the entropy value

becomes smaller and smaller!!
30

15
Information gain
 Given a set of examples D, we first compute its
entropy:

 If we make attribute Ai, with v values, the root of the

current tree, this will partition D into v subsets D1, D2
…, Dv . The expected entropy if Ai is used as the
current root:
v |D |


entropy Ai ( D ) 
j 1 | D
j

|
 entropy ( D j )

Information gain
 Information gained by selecting attribute Ai to branch or
to partition the data is

gain( D, Ai )  entropy ( D )  entropy Ai ( D )

Gain(D, Ai) = expected reduction in entropy due
to sorting on Ai
We choose the attribute with the highest gain to
branch/split the current tree.

16
Own_ho Yes No Entropy(Di)

An example use
true 6 0 0
6 6 9 9 false 3 6 0,918
entropy(D)    log2   log2  0.971
15 15 15 15
Age Yes No Enropy(Di)
6 9 young 2 3 0,971
entropyOwn _ house ( D)   entropy( D1 )   entropy( D2 )
15 15
middle 3 2 0,971
6 9
  0   0.918
15 15 old 4 1 0,722
 0.551
Credit_rating Yes No Enropy(Di)
5 5 5
entropyAge( D)   entropy( D1 )   entropy( D2 )   entropy(D3 ) fair 1 4 0,722
15 15 15
good 4 2 0,918
5 5 5
  0.971  0.971  0.722 excellent 4 0 0
15 15 15
 0.888

 Own_house is the best

choice for the root.
Has_job Yes No Entropy(Di)
true 5 0 0
false 4 6 0,971 33

An example
We build the final tree

 We can use information gain ratio to evaluate the

impurity as well
34

17
Overfitting

Consider error of hypothesis h over :

 training data: errortrain(h)
 entire distribution D of data: errorD(h)
Hypothesis h ∈ H overfits training data if there is an
alternative hypothesis h'∈ H such that
errortrain(h) < errortrain(h')
and
errorD(h) > errorD(h')

Overfitting in Decision Tree

Learning

18
Avoid overfitting in classification
 Overfitting: A tree may overfit the training data
 Good accuracy on training data but poor on test data
 Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
 Two approaches to avoid overfitting
 stop growing when data split not statistically significant
Pre-pruning: Halt tree construction early
 grow full tree, then post-prune
Post-pruning: Remove branches or sub-trees from a “fully
grown” tree.

Other issues in decision tree

learning
 From tree to rules, and rule pruning
 Handling of miss values
 Handing skewed distributions
 Handling attributes and classes with different costs.
 Etc.

19
Naive Bayesian Classification

Bayesian Rule
• Product rule: P(ab) = P(a | b) P(b) = P(b | a) P(a)

– Bayesian Rule: P(a | b) = P(b | a) P(a) / P(b)

Generally,

– P(Y|X) = P(X|Y) P(Y) / P(X) = αP(X|Y) P(Y)

• Useful for calculating / querying a diagnostic probability from a causal probability:

P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)

• Example:
– let m (meningitis), s(stiff neck),
P(s|m)=0.5, P(m)=1/50000 et P(s)=1/20.
– P(m|s) = P(s|m) P(m) / P(s) = 0.5 × 0.00002 / 0.05 = 0.0002

• Diagnostic rule: Observed Effects  Hidden causes

• Causal rule: Hidden causes  observed effects

20
Bayesian Rule and Conditional
independence
P(Cavity | Toothache  Catch)
= αP(Toothache  Catch | Cavity) P(Cavity)
= αP(Toothache | Cavity) P(Catch | Cavity) P(Cavity)

• A simple Bayesian classifier:

P(Cause,Effect1, … ,Effectn) = P(Cause) πiP(Effecti|Cause)

Bayesian classification
 Probabilistic view: Supervised learning can naturally
be studied from a probabilistic point of view.
 Let A1 through Ak be attributes with discrete values.
The class is C.
 Given a test example d with observed attribute values
a1 through ak.
 Classification is basically to compute the following
posteriori probability. The prediction is the class cj
such that

is maximal

21
Apply Bayes’ Rule
Pr(C  c j | A1  a1 ,..., A| A|  a| A| )
Pr( A1  a1 ,..., A| A|  a| A| | C  c j ) Pr(C  c j )

Pr( A1  a1 ,..., A| A|  a| A| )
Pr( A1  a1 ,..., A| A|  a| A| | C  c j ) Pr(C  c j )

Pr( A1  a1 ,..., A| A|  a| A| )

 Pr(C=cj) is the class prior probability: easy to

estimate from the training data.

Computing probabilities
 The denominator P(A1=a1,...,Ak=ak) is irrelevant for decision making
since it is the same for every class.
 We only need P(A1=a1,...,Ak=ak | C=ci), which can be written as
Pr(A1=a1|A2=a2,...,Ak=ak, C=cj)* Pr(A2=a2,...,Ak=ak |C=cj)
 Recursively, the second factor above can be written in the same
way, and so on.
 Now an assumption is needed.

22
Conditional independence
assumption
 All attributes are conditionally independent given the class C = cj.
 Formally, we assume,
Pr(A1=a1 | A2=a2, ..., A|A|=a|A|, C=cj) = Pr(A1=a1 | C=cj)
and so on for A2 through A|A|. I.e.,

| A|
Pr( A1  a1 ,..., A| A|  a| A| | C  ci )   Pr( Ai  ai | C  c j )
i 1

Final naïve Bayesian classifier

Pr(C  c j | A1  a1 ,..., A| A|  a| A| )
| A|
  Pr(C  c j ) Pr( Ai  ai | C  c j )
i 1

 We are done.
 How do we estimate P(Ai = ai| C=cj)?

23
Classify a test instance
 If we only need a decision on the most probable class for the test
instance, we only need the numerator as its denominator is the
same for every class.
 Thus, given a test example, we compute the following to decide
the most probable class for the test instance

| A|
c  arg max Pr(c j ) Pr( Ai  ai | C  c j )
cj i 1

An example

 Compute all probabilities

required for classification

24
An Example
 For C = t, we have
2
1 2 2 2
Pr(C  t ) Pr( A j  a j | C  t )    
j 1 2 5 5 25
 For class C = f, we have
2
1 1 2 1
Pr(C  f ) j 1
Pr( A j  a j | C  f )    
2 5 5 25
 C = t is more probable. t is the final class.

On naïve Bayesian classifier

 Advantages:
 Easy to implement
 Very efficient
 Good results obtained in many applications
 Disadvantages
 Assumption: class conditional independence, therefore loss of
accuracy when the assumption is seriously violated (those highly
correlated data sets)

25
Evaluation of classifiers

Introduction
 A classifier is used to predict an outcome of a test data
 Such a prediction is useful in many applications
 Business forecasting, cause-and-effect analysis, etc.

 A number of classifiers have been evolved to support the activities.

 Each has their own merits and demerits

 There is a need to estimate the accuracy and performance of the

classifier with respect to few controlling parameters in data sensitivity

 As a task of sensitivity analysis, we have to focus on

 Estimation strategy
 Metrics for measuring accuracy
 Metrics for measuring performance

26
Estimation Strategy

Planning for Estimation

 Using some “training data”, building a classifier based on certain
principle is called “learning a classifier”.
 After building a classifier and before using it for classification of unseen
instance, we have to validate it using some “test data”.
 Usually training data and test data are outsourced from a large pool of
data already available.

Learning
Training data technique

split

Test data
CLASSIFIER

Data set Estimation

27
Estimation Strategies
 Accuracy and performance measurement should follow a strategy. As
the topic is important, many strategies have been advocated so far.
Most widely used strategies are
 Holdout method
 Random subsampling
 Cross-validation
 Bootstrap approach

Holdout Method

 This is a basic concept of estimating a prediction.

 Given a dataset, it is partitioned into two disjoint sets called training set
and testing set.
 Classifier is learned based on the training set and get evaluated with
testing set.
 Proportion of training and testing sets is at the discretion of analyst; and
there is a trade-off between these sizes of these two sets.
 If the training set is too large, then model may be good enough, but
estimation may be less reliable due to small testing set and vice-versa.

28
Random Subsampling

 It is a variation of Holdout method to overcome the drawback of over-

presenting a class in one set thus under-presenting it in the other set and
vice-versa.
 In this method, Holdout method is repeated k times, and in each time, two
disjoint sets are chosen at random with a predefined sizes.
 Overall estimation is taken as the average of estimations obtained from
each iteration.

Cross-Validation

 The main drawback of Random subsampling is, it does not have control
over the number of times each tuple is used for training and testing.

 Cross-validation is proposed to overcome this problem.

 There are two variations in the cross-validation method.

 k-fold cross-validation
 N-fold cross-validation

29
k-fold Cross-Validation

D1 Fold 1

Learning
Di technique
Fold i

Data set
Dk

Fold k
CLASSIFIER

Accuracy Performance 59

N-fold Cross-Validation

30
N-fold Cross-Validation : Issue
 So far the estimation of accuracy and performance of a classifier
model is concerned, the N-fold cross-validation is comparable to the
others we have just discussed.

 The drawback of N-fold cross validation strategy is that it is

computationally expensive, as here we have to repeat the run N
times; this is particularly true when data set is large.

 In practice, the method is extremely beneficial with very small data

set only, where as much data as possible to need to be used to train
a classifier.

Bootstrap Method
 The Bootstrap method is a variation of repeated version of Random
sampling method.

 The method suggests the sampling of training records with

replacement.
 Each time a record is selected for training set, is put back into the original
pool of records, so that it is equally likely to be redrawn in the next run.

 In other words, the Bootstrap method samples the given data set
uniformly with replacement.

 The rational of having this strategy is that let some records be occur
more than once in the samples of both training as well as testing.

31
Accuracy Estimation

Accuracy Estimation

32
Accuracy : True and Predictive

Predictive Accuracy
Universality of predictive accuracy

 Consider a classifier model MD developed with a training set D using an

algorithm M.
 Two predictive accuracies when MD is estimated with two different
training sets T1 and T2 are
(MD)T1 = 95%
(MD)T2 = 70%

 Further, assume the size of T1 and T2 are

|T1| = 100 records
|T2| = 5000 records.

 Based on the above mentioned estimations, neither estimation is

acceptable beyond doubt.

33
Predictive Accuracy
 With the above-mentioned issue in mind, researchers have proposed
two heuristic measures
 Error estimation using Loss Functions
 Statistical Estimation using Confidence Level (will not be covered in this
course!!)

Error Estimation using Loss

Functions
X1 y1
X2 y2

XN yN
N×(n+1)

34
Error Estimation using Loss
Functions

Performance Estimation

35
Performance Estimation of a
Classifier
 Predictive accuracy works fine, when the classes are balanced
 That is, every class in the data set are equally important

 In fact, data sets with imbalanced class distributions are quite common
in many real life applications

 When the classifier classified a test data set with imbalanced class
distributions then, predictive accuracy on its own is not a reliable
indicator of a classifier’s effectiveness.

Example: Effectiveness of Predictive Accuracy

 Given a data set of stock markets, we are to classify them as “good” and
“worst”. Suppose, in the data set, out of 100 entries, 98 belong to “good”
class and only 2 are in “worst” class.
 With this data set, if classifier’s predictive accuracy is 0.98, a very high value!
 Here, there is a high chance that 2 “worst” stock markets may incorrectly be classified as
“good”

 On the other hand, if the predictive accuracy is 0.02, then none of the stock markets
may be classified as “good”
71

Performance Estimation of a Classifier

 Thus, when the classifier classified a test data set with imbalanced class
distributions, then predictive accuracy on its own is not a reliable
indicator of a classifier’s effectiveness.

 This necessitates an alternative metrics to judge the classifier.

 Before exploring them, we introduce the concept of Confusion matrix.

36
Confusion Matrix
 A confusion matrix for a two classes (+, -) is shown below.

 There are four quadrants in the confusion matrix, which are symbolized
as below.
 True Positive (TP: f++) : The number of instances that were positive (+) and
correctly classified as positive (+v).
 False Negative (FN: f+-): The number of instances that were positive (+) and
incorrectly classified as negative (-). It is also known as Type 2 Error.
 False Positive (FP: f-+): The number of instances that were negative (-) and
incorrectly classified as (+). This also known as Type 1 Error.
 True Negative (TN: f--): The number of instances that were negative (-) and
correctly classified as (-).

Confusion Matrix
Note:
 Np = TP (f++) + FN (f+-)
= is the total number of positive instances.

 Nn = FP(f-+) + TN(f--)
= is the total number of negative instances.

 N = Np + Nn
= is the total number of instances.

 (TP + TN) denotes the number of correct classification

 (FP + FN) denotes the number of errors in classification.

 For a perfect classifier FP = FN = 0, that is, there would be no Type 1 or

Type 2 errors.

37
Confusion Matrix
Example: Confusion matrix

A classifier is built on a dataset regarding Good and Worst classes of stock

markets. The model is then tested with a test set of 10000 unseen instances.
The result is shown in the form of a confusion matrix. The result is self
explanatory.
Class Good Worst Total Rate(%)

Good 6954 46 7000 99.34

Worst 412 2588 3000 86.27

Total 7366 2634 10000 95.52

Predictive accuracy?

Confusion Matrix for Multiclass Classifier

 Having m classes, confusion matrix is a table of size m×m , where,
element at (i, j) indicates the number of instances of class i but
classified as class j.

 To have good accuracy for a classifier, ideally most diagonal entries

should have large values with the rest of entries being close to zero.

 Confusion matrix may have additional rows or columns to provide

total or recognition rates per class.

38
Confusion Matrix for Multiclass Classifier

Example : Confusion matrix with multiple class

Following table shows the confusion matrix of a classification problem with six
classes labeled as C1, C2, C3, C4, C5 and C6.

Class C1 C2 C3 C4 C5 C6

C1 52 10 7 0 0 1

C2 15 50 6 2 1 2

C3 5 6 6 0 0 0

C4 0 2 0 10 0 1

C5 0 1 0 0 7 1

C6 1 3 0 1 0 24

Predictive accuracy?
77

Performance Evaluation Metrics

39
Performance Evaluation Metrics

Performance Evaluation Metrics

40
How to deduce the Predictive
Accuracy (ε)

How to deduce the Error Rate

41
Analysis with Performance
Measurement Metrics

Predicted Class

+ -

+ P 0

Actual
class
- 0 N

Analysis with Performance

Measurement Metrics

Predicted Class

+ -

+ 0 P
Actual
class

- N 0

42

Supervised Learning
No ratings yet
Supervised Learning
147 pages
Unit 3 Machine Learning
No ratings yet
Unit 3 Machine Learning
159 pages
08 - Classification - Decision Trees
No ratings yet
08 - Classification - Decision Trees
116 pages
Machine Learning and Deep Learning For State of Art
No ratings yet
Machine Learning and Deep Learning For State of Art
21 pages
Non-Stationary Time Series Models.
No ratings yet
Non-Stationary Time Series Models.
13 pages
Naive Baysian SVM and K Nearest Neighbour
No ratings yet
Naive Baysian SVM and K Nearest Neighbour
166 pages
AI Notes Week 11
No ratings yet
AI Notes Week 11
68 pages
Supervised Learning Compressed
No ratings yet
Supervised Learning Compressed
165 pages
08ClassBasic L
No ratings yet
08ClassBasic L
78 pages
Supervised Learning
No ratings yet
Supervised Learning
166 pages
Chapter 7 Supervised Learning
No ratings yet
Chapter 7 Supervised Learning
71 pages
Unit-2 Material
No ratings yet
Unit-2 Material
52 pages
Learning
No ratings yet
Learning
51 pages
MCS 224 2
No ratings yet
MCS 224 2
5 pages
05 - Decision Tree - Updated
No ratings yet
05 - Decision Tree - Updated
69 pages
Decision Tree - 1
No ratings yet
Decision Tree - 1
31 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
41 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
40 pages
05 Classification
No ratings yet
05 Classification
79 pages
Decision Tree
No ratings yet
Decision Tree
41 pages
DM Lect 9 - Classification - Decision Trees
No ratings yet
DM Lect 9 - Classification - Decision Trees
39 pages
Meeting 6 CE609-supervised-learning
No ratings yet
Meeting 6 CE609-supervised-learning
166 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
Ai - Unit Vi
No ratings yet
Ai - Unit Vi
40 pages
cs188 Fa21 MT
No ratings yet
cs188 Fa21 MT
13 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
CS583 Supervised Learning
No ratings yet
CS583 Supervised Learning
166 pages
Learning AI
No ratings yet
Learning AI
34 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
224 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
CH 5
No ratings yet
CH 5
84 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
Ds Notes V Unit C++
No ratings yet
Ds Notes V Unit C++
20 pages
CFM - Programming Task
No ratings yet
CFM - Programming Task
10 pages
CCST9017 (2023-24lecture11printed Version) MachineLearning
No ratings yet
CCST9017 (2023-24lecture11printed Version) MachineLearning
55 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Hand Written Digit Recognition2106.08267
No ratings yet
Hand Written Digit Recognition2106.08267
12 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
CS583 Chapter 4 Supervised Learning
No ratings yet
CS583 Chapter 4 Supervised Learning
166 pages
Collision Risk in Hash-Based Surrogate Keys - by Krzysztof K. Zdeb - Nov, 2024 - Towards Data Science
No ratings yet
Collision Risk in Hash-Based Surrogate Keys - by Krzysztof K. Zdeb - Nov, 2024 - Towards Data Science
11 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
Machine Learning: Mona Leeza Email: Monaleeza - Bukc@bahria - Edu.pk
No ratings yet
Machine Learning: Mona Leeza Email: Monaleeza - Bukc@bahria - Edu.pk
60 pages
Training Effective Deep Reinforcement Learning Agents For Real-Time
No ratings yet
Training Effective Deep Reinforcement Learning Agents For Real-Time
14 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Explainable AI With Inductive Logic Programming
No ratings yet
Explainable AI With Inductive Logic Programming
8 pages
JU Ch9
No ratings yet
JU Ch9
21 pages
5 Learning
No ratings yet
5 Learning
8 pages
Classification
No ratings yet
Classification
33 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Lecture1 AML
No ratings yet
Lecture1 AML
16 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
Correlation and Regression
No ratings yet
Correlation and Regression
3 pages
ODE Midterm Exam - 2024.01
No ratings yet
ODE Midterm Exam - 2024.01
4 pages
CS583 Supervised Learning
No ratings yet
CS583 Supervised Learning
147 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
7 Classification
100% (3)
7 Classification
63 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
Lecture 06 Part A - Macine Learning
No ratings yet
Lecture 06 Part A - Macine Learning
77 pages
MTH601 Mid Term Quiz
No ratings yet
MTH601 Mid Term Quiz
9 pages
Assignment 9 July 2022 Solution
No ratings yet
Assignment 9 July 2022 Solution
4 pages
5 Support Vector Machine
No ratings yet
5 Support Vector Machine
6 pages
Supervised Learning
No ratings yet
Supervised Learning
41 pages
Ch.23 Algorithm Design II - Assignment - Ans
No ratings yet
Ch.23 Algorithm Design II - Assignment - Ans
8 pages
ML-Lec-06-Supervised Learning-Decision Trees
No ratings yet
ML-Lec-06-Supervised Learning-Decision Trees
45 pages
Ankit Adhikari 2 PDF
No ratings yet
Ankit Adhikari 2 PDF
22 pages
Ds Libraries
No ratings yet
Ds Libraries
32 pages
Artificial Intelligence: Slide 6
100% (1)
Artificial Intelligence: Slide 6
42 pages
06 Learning
No ratings yet
06 Learning
51 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Final Exam Solution
No ratings yet
Final Exam Solution
10 pages
Vasvi Khullar Mca - Iv (B) 06417704417
No ratings yet
Vasvi Khullar Mca - Iv (B) 06417704417
5 pages
Data Analysis Resume
No ratings yet
Data Analysis Resume
2 pages
Data Strucuture - Linked List
No ratings yet
Data Strucuture - Linked List
2 pages
P1.7 Genetic Algorithms in Geophysical Fluid Dynamics
No ratings yet
P1.7 Genetic Algorithms in Geophysical Fluid Dynamics
7 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
String Pair PDF
No ratings yet
String Pair PDF
3 pages
Example 1: Chapter 6. Eigenvalues and Eigenvectors
No ratings yet
Example 1: Chapter 6. Eigenvalues and Eigenvectors
1 page
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
49 Machine Learning
No ratings yet
49 Machine Learning
300 pages
Control System Final Roadmap
No ratings yet
Control System Final Roadmap
3 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Manipal Institute of Technology: Course Plan
No ratings yet
Manipal Institute of Technology: Course Plan
3 pages
Bessel Function Zeroes
No ratings yet
Bessel Function Zeroes
5 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Supervised Learning Part1

Uploaded by

Supervised Learning Part1

Uploaded by

Supervised

- Introduction to supervised learning

 An emergency room in a hospital measures 17

An example: data (loan application)

Supervised vs. unsupervised

What do we mean by learning?

No learning: classify all future applications (test data) to the majority

 In practice, this assumption is often violated to certain degree.

A decision tree from the loan data

Is the decision tree unique?

 Finding the best tree is

Algorithm for decision tree learning

Choose an attribute to partition

Two possible roots, which is

 Fig. (B) seems to be better.

 Pr(cj) is the probability of class cj in data set D

Entropy measure: Signification

 As the data become purer and purer, the entropy value

 If we make attribute Ai, with v values, the root of the

gain( D, Ai )  entropy ( D )  entropy Ai ( D )

 Own_house is the best

 We can use information gain ratio to evaluate the

Consider error of hypothesis h over :

Overfitting in Decision Tree

Other issues in decision tree

– Bayesian Rule: P(a | b) = P(b | a) P(a) / P(b)

– P(Y|X) = P(X|Y) P(Y) / P(X) = αP(X|Y) P(Y)

• Useful for calculating / querying a diagnostic probability from a causal probability:

• Diagnostic rule: Observed Effects  Hidden causes

• A simple Bayesian classifier:

P(Cause,Effect1, … ,Effectn) = P(Cause) πiP(Effecti|Cause)

 Pr(C=cj) is the class prior probability: easy to

Final naïve Bayesian classifier

 Compute all probabilities

On naïve Bayesian classifier

 A number of classifiers have been evolved to support the activities.

 There is a need to estimate the accuracy and performance of the

 As a task of sensitivity analysis, we have to focus on

Planning for Estimation

Data set Estimation

 This is a basic concept of estimating a prediction.

 It is a variation of Holdout method to overcome the drawback of over-

 Cross-validation is proposed to overcome this problem.

 There are two variations in the cross-validation method.

 The drawback of N-fold cross validation strategy is that it is

 In practice, the method is extremely beneficial with very small data

 The method suggests the sampling of training records with

 Consider a classifier model MD developed with a training set D using an

 Further, assume the size of T1 and T2 are

 Based on the above mentioned estimations, neither estimation is

Error Estimation using Loss

Example: Effectiveness of Predictive Accuracy

Performance Estimation of a Classifier

 This necessitates an alternative metrics to judge the classifier.

 Before exploring them, we introduce the concept of Confusion matrix.

 (TP + TN) denotes the number of correct classification

 (FP + FN) denotes the number of errors in classification.

 For a perfect classifier FP = FN = 0, that is, there would be no Type 1 or

A classifier is built on a dataset regarding Good and Worst classes of stock

Good 6954 46 7000 99.34

Worst 412 2588 3000 86.27

Total 7366 2634 10000 95.52

Confusion Matrix for Multiclass Classifier

 To have good accuracy for a classifier, ideally most diagonal entries

 Confusion matrix may have additional rows or columns to provide

Example : Confusion matrix with multiple class

Performance Evaluation Metrics

Performance Evaluation Metrics

How to deduce the Error Rate

Analysis with Performance

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.