0% found this document useful (0 votes)
23 views78 pages

Unit 4

Uploaded by

redoxit809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views78 pages

Unit 4

Uploaded by

redoxit809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Data Mining (DM)

2101CS521

Unit-4
Classification

Prof. Jayesh D. vagadiya


Computer Engineering
Department
Darshan Institute of Engineering & Technology, Rajkot
jayesh.vagadiya@darshan.ac.in
9537133260
 Looping
Topics to be covered
• Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
What Kinds of Patterns Can Be Mined?
 Data mining functionalities can be classified into two categories:
1. Descriptive
2. Predictive We are going to cover
this part in this
chapter
 Descriptive
• This task presents the general properties of data stored in a database.
• The descriptive tasks are used to find out patterns in data.
• E.g.: Frequent patterns, association, correlation etc.

 Predictive
• These tasks predict the value of one attribute on the basis of values of other
attributes.
• E.g.: Festival Customer/Product Sell prediction at store

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 3


Classification
 One method of data analysis uses a model derived from previously
collected data to predict a class label for fresh data.
 Such model is known as classifiers to predict categorical (discrete,
unordered) class labels.
 Example:
 we can build a classification model to categorize bank loan applications as either
safe or risky.
 The goal of classification is to learn a model that can automatically
determine the appropriate class for new, unseen data based on patterns it
has learned from a labeled training dataset.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 4


Model

Model

Classifier Predictor

This is known as classification Analysis This is known as Regression Analysis

It is used for predication of class It is used for predication of numeric


(categorical) label prediction(continuous-valued function ).
Such as “safe” or “risky” for the loan Suppose that the marketing manager wants to
application data; “yes” or “no” for the predict how much a given customer will spend
marketing data; during a sale at AllElectronics.
“treatment A,” “treatment B,” or “treatment
C” for the medical data.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 5


Steps in Classification
 Classification is two step process.
 First Step:
 We build a classification model based on previous data.
 Classification algorithm builds the classifier by analyzing or “learning
from” a training set made up of database tuples and their associated class
labels.
 First step also know as learning step (or training phase)
 Second Step:
 We use the model to classify new data.
 Second step also know as classification step.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 6


Example
incom loan_decisi
Name Age e on Class Label

Sandy Jones youth low risky


Bill Lee youth low risky
Tanning
Caroline Fox middle_aged high safe
Data Set
Rick Field middle_aged low risky
Susan Lake senior low safe
Claire Phips senior mediu safe
m
Joe Smith middle_aged high safe
Testing
Juan Bello senior low safe Data Set
Sylvia Crest middle_aged low risky
Anne Yee middle_aged high safe
Class Label is provided in so this learning is known
as supervised machine learning

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 7


Example
Step - 1
Classification
Algorithms
Training
Data

inco loan_decisi Classifier


Name Age
me on (Model)
Sandy Jones youth low risky
Bill Lee youth low risky
Caroline Fox middle_aged high safe
IF age = youth THEN loan_decision = risky;
Rick Field middle_aged low risky IF income = high THEN loan_decision = safe;
Susan Lake senior low safe IF age = middle_aged AND income = low THEN
Claire Phips senior mediu safe loan_decision = risky;
m
Joe Smith middle_aged high safe

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 8


Example
Step - 2

Classifier
(Model)
Unseen Data
Testing
Data
(XYZ, youth,
low)
incom loan_decisi
loan_decision ?
Name Age
e on
Juan Bello senior low safe
Sylvia Crest middle_aged low risky
Risky
Anne Yee middle_aged high safe

9
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 9
Decision Tree Induction
 Decision tree induction is the learning of decision trees from class-labeled
training tuples.
 A decision tree is a flowchart-like tree structure.
 The Inner node represents an attribute.
 An Edge represents a test on attribute. A decision tree for the
 Leaf represent one of the classes.
age?
concept buys computer,
indicating whether an
AllElectronics customer is
youth seni likely to purchase a
middle_aged or computer.

student? Yes credit_rating?

no yes fair excelle


nt
No yes No Yes

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 10


History of Decision Tree
 During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in
machine learning, developed a decision tree algorithm known as ID3
(Iterative Dichotomiser).
 This work expanded on earlier work on concept learning systems,
described by E. B. Hunt, J. Marin, and P. T. Stone. Quinlan later presented
C4.5 (a successor of ID3).
 In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C.
Stone) published the book Classification and Regression Trees (CART),
which described the generation of binary decision trees.
 ID3, C4.5, and CART adopt a greedy (i.e., non-backtracking) approach in
which decision trees are constructed in a top-down recursive divide-and-
conquer manner.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 11


Decision Tree Induction Algorithm
 Basic algorithm (a greedy algorithm):
 Tree is constructed in a top-down recursive divide-and-conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are discretized in
advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning:
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf
 There are no samples left
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 12
Decision Tree Induction Algorithm
INPUT:
Data partition, D, which is a set of training tuples and their associated class
labels;

Attribute list, the set of candidate attributes;

Attribute selection method, a procedure to determine the splitting criterion that


“best” partitions the data tuples into individual classes. This criterion consists
of a
Output:
splitting attribute and, possibly, either a split-point or splitting subset.
A decision tree.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 13


Decision Tree Induction Algorithm
create a node N ;
if tuples in D are all of the same class, C, then
return N as a leaf node labeled with the class C; terminate;
if attribute list is empty then
return N as a leaf node labeled with the majority class in D; terminate; //
majority voting
Apply Attribute_selection_method(D, attribute list) to find A with highest Attribute
selection measure and Label node N with A
if A is discrete-valued and multiway splits allowed then // not restricted to binary
trees
attribute list ← attribute list − A;
For each value v of A:
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D with V=Dj;
if Dj is Empty then
attach a leaf labeled with the majority class in D to node N;
else
attach the node returned by Generate_decision_tree(Dj, attribute list) to
node N;
End for
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 14
Attribute Selection Measures
 Attribute selection measures are known as splitting rules because they
determine how the tuples at a given node are to be split.
 An attribute selection measure is a heuristic for selecting the splitting
criterion that “best” separates a given data partition, D, of class-labeled
training tuples into individual classes.
 The tree node created for partition D is labeled with the splitting criterion,
branches are grown for each outcome of the criterion, and the tuples are
partitioned accordingly.
 Three popular attribute selection measures
1. Information gain
2. Gain ratio
3. Gini index

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 15


1. Information Gain
 ID3 uses information gain as its attribute selection measure.
 Let node N represent or hold the tuples of partition D. The attribute with
the highest information gain is chosen as the splitting attribute for node N.
 This attribute minimizes the information needed to classify the tuples in
the resulting partitions and reflects the least randomness or “impurity” in
these
 The partitions.
expected information needed to classify a tuple in D is given by
𝑚 𝑚
𝐼𝑛𝑓𝑜 ( 𝐷 )=− ∑ 𝑝¿𝑖 log
−∑ 2 (|
𝑝𝐶𝑖 )𝑖 , 𝐷|/|𝐷|log 2 (|𝐶 𝑖 , 𝐷|/|𝐷|)
𝑖 =1 𝑖 =1

where - nonzero probability that an arbitrary tuple in D belongs to class =


 Info(D) is just the average amount of information needed to identify the
class label of a tuple in D.
 Info(D) is also known as the Entropy of D.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 16


1. Information Gain Cont..
 How much more information would we still need (after the partitioning) to
arrive at an exact classification? This amount is measured by

where - weight of the jth partition


 - the expected information required to classify a tuple from D based on
the partitioning by A.
 The smaller the expected information (still) required, the greater the
purity of the partitions.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 17


1. Information Gain Cont..
 Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes) and
the new requirement (i.e., obtained after partitioning on A).

 The attribute A with the highest information gain is chosen as the splitting
attribute at node N.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 18


Information Gain -
Example
RID age income student credit_rating Class:
buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no
#2101CS521 (DM)  Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 19
Information Gain - Example
 The class label attribute, buys_computer, has two distinct values namely,
{yes, no}, therefore, there are two distinct classes i.e., mDistinc
= 2.
t Count
 Let class C1 correspond to yes and class C2 correspond toValues
no.
 C1 has 9 tuples & C2 has 5 tuples. Yes 9
No 5
 Info(D) is computed as Total 14

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 20


Information Gain - Example
 Computing for age attribute
Disti
 For the age category “youth” – 2 yes tuples & 3 no
nct
tuples
Yes No Total
Valu
 For the category “middle_aged” – 4 yes tuples & es
0 no tuples
youth 2 3 5
 For the category “senior” - 3 yes tuples & 2 no tuples
middl 4 0 4
e_age
d
senior 3 2 5

Info(youth) Info(middle_ Info(senior)


age)

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 21


Information Gain - Example
 Gain in information for age attribute is given as

 Similarly,
 Gain(income) = 0.029 bits
 Gain(student) = 0.151 bits
 Gain(credit_rating) = 0.048 bits
 age attribute has highest Information Gain among all attributes.
 Therefore node N is labelled with age and branches grow for each
of the attributes value.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 22


Information Gain - Example
Splitting Attribute age at Root node age?
youth senior
middle_aged
incom stude credit_rati clas incom stude credit_rati clas
e nt ng s e nt ng s
high no fair no mediu no fair yes
high no excellent no m

mediu no fair no low yes fair yes


m low yes excellent no
low yes fair yes mediu yes fair yes
mediu yes incomyesstude
excellent m
credit_rati clas
m e nt ng medius no excellent no
high no fair m yes
low yes excellent yes
mediu no excellent yes
m
high yes fair yes
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 23
2. Gain Ratio
 The information gain measure is biased toward tests with many outcomes.
 For example, consider an attribute that acts as a unique identifier such as
product_ID.
 A split on product_ID would result in a large number of partitions each one
containing just one tuple.
 for product_ID attribute which results in maximum information gain.
Clearly, such a partitioning is useless for classification.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 24


2. Gain Ratio Cont..
 C4.5 uses an extension to information gain known as gain ratio.
 It applies a kind of normalization to information gain using a “split
information” value defined with as

 This value represents the potential information generated by splitting the


training data set, D, into v partitions, corresponding to the v outcomes of a
test on attribute A.
 Gain Ratio is defined as

 The attribute with the maximum gain ratio is selected as the splitting
attribute.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 25


2. Gain Ratio - Example
RID age income stude credit_rati Class: buys_computer
nt ng
1 youth high no fair no
2 youth high no excellent no
3 middle_ag high no fair yes
ed
4 senior mediu no fair yes
m
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_ag low yes excellent yes
ed
8 youth mediu no fair no
m
9 youth low yes fair yes
10 senior mediu yes fair yes
m
11 youth mediu yes excellent yes
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 26
Gain Ratio for the attribute income - Example
 To compute the gain ratio of income
Distinct
Total
Values
high 4
medium 6
low 4

 Similarly, GainRatio(age), GainRatio(student), GainRatio(credit_rating) is to be


computed.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 27


3. Gini Index
 CART uses Gini Index.
 The Gini index measures the impurity of D, a data partition or set of
training tuples, as

where is the probability that a tuple in D belongs to class and is


estimated by .
 The Gini index considers a binary split for each attribute.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 28


3. Gini Index Cont..
 The Gini index considers a binary split for each attribute.
 Consider the case where A is a discrete-valued attribute having v distinct
values, {a1,a2,…,av}.
 Examine all the possible subsets that can be formed using known values
of A.
 Each subset SA, can be considered as a binary test for attribute A of the
form “A SA”
 For example, if income has three possible values, namely {low, medium,
high}, then the possible subsets are {low, medium, high}, {low,
medium}, {low, high}, {medium, high}, {low}, {medium}, {high}, and
{}.
 Excluding the power set and the empty set, there are 2V — 2 possible
ways to form two partitions of the data, D, based on a binary split on A.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 29
3. Gini Index Cont..
 Compute a weighted sum of the impurity of each resulting partition. For
example, if a binary split on A partitions D into D1 and D2, the Gini index of
D given that partitioning is

 For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 30


3. Gini Index - Example
RID age income stude credit_rati Class: buys_computer
nt ng
1 youth high no fair no
2 youth high no excellent no
3 middle_ag high no fair yes
ed
4 senior mediu no fair yes
m
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_ag low yes excellent yes
ed
8 youth mediu no fair no
m
9 youth low yes fair yes
10 senior mediu yes fair yes
m
11 youth mediu yes excellent yes
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 31
3. Gini Index – Example Cont..
 Considering the data of AllElectronics,
 buys_computer = yes - 9 tuples
 buys_computer = no - 5 tuples
 Gini index to compute the impurity of D is

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 32


3. Gini Index – Example Cont..
 Consider each of the possible splitting subsets for income attribute.
 Consider the subset {low, medium}.
 10 tuples in partition D1 satisfying the condition “”
 4 tuples of D would be assigned to partition D2.
 The Gini index value computed based on
10 4
𝐺𝑖𝑛𝑖 𝑖𝑛𝑐𝑜𝑚𝑒 ∈ { 𝑙𝑜𝑤 , 𝑚𝑒𝑑𝑖𝑢𝑚 } ( ¿𝐷 ) 𝐺𝑖𝑛𝑖 ( 𝐷 1 ) + 𝐺𝑖𝑛𝑖( 𝐷 2)
14 14

( ( ) ( )) ( ( ) ( ))
2 2 2 2
Distinct 10 7 3 4 2 2
Values
Yes No Total ¿ 1− − + 1− −
Low, 7 3 10
14 10 10 14 4 4
Medium
¿ 0.443
High 2 2 4
¿ 𝐺𝑖𝑛𝑖𝑖𝑛𝑐𝑜𝑚𝑒 ∈ {h𝑖𝑔h } ( 𝐷 )

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 33


3. Gini Index – Example Cont..

Incom
e

Age 𝐺𝑖𝑛𝑖 𝑎𝑔𝑒∈ { 𝑦𝑜𝑢𝑡h , 𝑠𝑒𝑛𝑖𝑜𝑟 } ( 𝐷 )=𝐺𝑖𝑛𝑖 𝑎𝑔𝑒∈ { 𝑚𝑖𝑑𝑑𝑙𝑒𝑎𝑔𝑒𝑑 } ( 𝐷 )=0.357

Stude 𝐺𝑖𝑛𝑖 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 ∈ { 𝑦𝑒𝑠 ,𝑛𝑜 } ( 𝐷 )=0.367


nt
Credit 𝐺𝑖𝑛𝑖 𝑐𝑟𝑒𝑑𝑖𝑡 ( 𝐷 )=0.429
Rating 𝑟𝑎𝑡𝑖𝑛𝑔 ∈ { 𝑓𝑎𝑖𝑟 ,𝑒𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 }

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 34


Tree Pruning
 When a decision tree is built, many of the branches will reflect anomalies
in the training data due to noise or outliers.
 We can remove the least-reliable branches from the tree.
 There are two common approaches to tree pruning: prepruning and
postpruning.
Prepruning:
 A tree is “pruned” by halting its construction early (e.g., by deciding not to
further split or partition the subset of training tuples at a given node).
 If partitioning the tuples at a node would result in a split that falls below a
prespecified threshold, then further partitioning of the given subset is
halted.
 There are difficulties, however, in choosing an appropriate threshold. High
thresholds could result in oversimplified trees, whereas low thresholds
could result in very little simplification.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 35
Tree Pruning
Postpruning:
 Which removes subtrees from a “fully grown” tree.
 A subtree at a given node is pruned by removing its branches and
replacing it with a leaf.
 The leaf is labeled with the most frequent class among the subtree being
replaced. Ye A1 No
A1
Ye No
s s
A2 A3 No
A2
Ye No Ye No
Ye No
s Ye s s
A4 A5 No Ye
s A4
s
Ye No Ye No Ye No
s s s
Ye Ye Ye
No No No
s s s

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 36


Tree Pruning
 Cost complexity pruning algorithm:
 It is used in CART
 This method considers two factors: the number of leaves in the tree(cost complexity)
and the error rate (misclassification percentage).
 For each inner node N:
 Calculate the cost complexity of the subtree rooted at N.
 Calculate the cost complexity if the subtree at N were replaced by a leaf node.
 Compare these two cost complexity values.
 If pruning N's subtree leads to lower cost complexity, prune the subtree (replace it with a
leaf node). Otherwise, keep it intact.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 37


Bayesian Classification
 The Naive Bayes classifier works on the principle of conditional probability,
as given by the Bayes theorem.
 They can predict class membership probabilities such as the probability
that a given tuple belongs to a particular class.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 38


Bayes’ Theorem
 Bayesian classification is used to find conditional
probabilities. The Bayes Theorem:
 Consider a data set D with a tuple X in Bayes
P(H|X)=
Theorem X works as an evidence.
Let H be some hypothesis such as that the data
tuple X belongs to a specified class C.
X is a customer with age 35 and income 40,000
Rs.

P(H|X) :
 The probability of customer X purchasing a computer,
considering the information about the customer's age and
income.
 Here H conditioned on X.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 39
Bayes’ Theorem
P(X|H) :
 The probability that customer X is 35 years old and earnsThe Bayes Theorem:
40,000 Rs., given that we are aware that he/she intends
P(H|X)=
to purchase the computer.
 Here X conditioned on H.
P(H) :
 The probability that the customer will buy the computer.
P(X) :
 The probability the customer X from a set of customers is
35 years old and earns 40,000 Rs.

 Here P(H|X), P(X|H) are called as posterior probability


and P(X) ,P(H) are called prior probability.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 40


Example
Day Weather Rain The Bayes Theorem:
To find the probability of rain given that it's cloudy (P(Rain|Cloudy))
1 Sunny No
2 Rainy Yes P(H|X)=
P(Rain|Cloudy)=
3 Cloudy Yes
4 Sunny No
5 Rainy Yes P(Rain|Cloudy)=
6 Sunny No
7 Cloudy Yes
P(Rain|Cloudy)=1
X H
the probability of rain given that
it's cloudy is approximately 100%

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 41


Naive Bayesian Classification : Steps
 The naive Bayesian classifier, or simple Bayesian classifier, works as
follows:
 Step:1
 Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
 Step:2
 Suppose there are m classes C1, C2, …, Cm.
 Given a tuple, X, the classifier will predict that X belongs to the class having the
highest posterior probability, conditioned on X.
 We find maximum P(Ci|X).
 This can be derived from Bayes’ theorem
P(Ci|X)=

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 42


Naive Bayesian Classification : Steps
 Step:3
 Since P(X) is constant for all classes, only needs to be maximized.
 Step:4
 A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
 =
 Step:5
 To predict the class label of X, is evaluated for each class Ci. The classifier predicts
that the class label of tuple X is the class Ci if and only if it is maximum.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 43


Naive Bayesian : Example
Day Outloo Temperat Humidity Wind PlayTennis Given a new instance,
k ure x’=(Outlook=Sunny,
Sunny Hot High Wea No Temperature=Cool, Humidity=High,
1 Wind=Strong)
k
Sunny Hot High Stron No
2 P(Play=Yes) = 9/14
g
Overca Hot High Wea Yes P(Play=No) = 5/14
3
st k
Rain Mild High Wea Yes
4
k
Rain Cool Normal Wea Yes
5
k
Rain Cool Normal Stron No
6
g
Overca Cool Normal Stron Yes
7
st g
Sunny Mild High Wea No
8
k
Sunny Cool Normal Wea Yes
9 Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 44
Naive Bayesian : Example
Outloo Yes No Temp. Yes No Humidi Yes No Wind Yes No
k ty
Hot 2/9 2/5 Stron 3/9 3/5
Sunny 2/9 3/5 High 3/9 4/5 g
Mild 4/9 2/5
Overca 4/9 0/5 Normal 6/9 1/5 Weak 6/9 2/5
st Cool 3/9 1/5

x’=(Outlook=Sunny,
Rain 3/9 2/5 Temperature=Cool, Humidity=High,
Wind=Strong) = ?
P(Ci|X) =

P(Yes|x’) = [P(Sunny|Yes) * P(Cool|Yes) * P(High|Yes)* P(Strong|Yes)]


* P(Play=Yes)
P(No|x’) =
= [2/9 * 3/9 * 3/9**P(Cool|No)
[P(Sunny|No) 3/9 ] * 9/14* = 0.0053 * P(Strong|No)] *
P(High|No)
P(Play=No)
= [3/5 * 1/5 * 4/5 * 3/5 ] * 5/14 = 0.0206
we label x’ to be “No”.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 45


Rule-Based Classification
 Rules are a good way of representing information or bits of knowledge.
 A rule-based classifier uses a set of IF-THEN rules for classification.
 An IF-THEN rule is an expression of the form

IF condition THEN conclusion.


IF age = youth AND student = yes THEN buys_computer =
yes.

The “IF” part (or left side) of a The “THEN” part (or right
rule is known as the rule side) is the rule consequent.
antecedent or precondition.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 46


Rule-Based Classification
 In the rule antecedent, the condition consists of one or more attribute
tests (e.g., age = youth and student = yes) that are logically ANDed.
 The rule’s consequent contains a class prediction
 If the condition (i.e., all the attribute tests) in a rule antecedent holds true
for a given tuple, we say that the rule antecedent is satisfied and that the
rule covers the tuple.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 47


Coverage and Accuracy
 Coverage: Day Outloo Wind PlayTennis
 Fraction of records that satisfy k
the antecedent of a rule Sunny Wea Yes
1
Coverage(R)= k
Sunny Stron No
 Accuracy: 2
g
 Fraction of records that satisfy Overca Wea Yes
both the antecedent and 3
st k
consequent of a rule (over
those that satisfy the Rain Wea Yes
4
k
antecedent)
 Accuracy(R)= Rain Wea Yes
5
k
Rain Stron No
6
= No. of records that satisfy the antecedent of a rule g
R = (Outlook=Sunny) 
Overca Stron
PlayTennis=Yes Yes
|D| = No. of records in D 7
st g
= No. of records that correctly classified by R Coverage = 4/10 = 40%
Sunny Wea No
8
Accuracy = 2/4 = k50%
Sunny Wea Yes48
Prof. Jayesh D. Vagadiya 9
#2101CS521 (DM)  Unit 4 - Classification
Resolution Strategy
 X= (age = youth, income = medium, student = yes, credit rating = fair) ?.
 We would like to classify X according to buys computer. X satisfies R1,
which triggers the rule.
 A tuple X may satisfied more then rules simultaneously.
 This can lead to a challenge when faced with conflicting class predictions
from different triggered rules or when no rule is satisfied for X.
 Two strategy are available namely Size ordering and Rule ordering.
Size Ordering:
 In the size ordering strategy, rules are assigned a priority based on their
complexity.
 The rule with the most conditions or the most specific conditions is given
higher priority.
 When conflicts arise, the rule with higher complexity or specificity takes
precedence and determines the class prediction.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 49
Resolution Strategy
Rule ordering:
 Class-Based Ordering:
 The classes are sorted in order of decreasing “importance”
 All the rules for the most prevalent (or most frequent) class come first, the rules for
the next prevalent class come next, and so on.
 Within each class, the rules are not ordered they don’t have to be because they all
predict the same class

 Rule-based ordering
 The rules are organized into one long priority list, according to some measure of rule
quality, such as accuracy, coverage, or size (number of attribute tests in the rule
antecedent)
 Rule that appears earliest in the list has the highest priority, and so it gets to fire its
class prediction. Any other rule that satisfies X is ignored.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 50


Resolution Strategy
 Consider a scenario where there is no rule satisfied by X
 How to determine the class label of X?
 In this case, a fallback or default rule can be set up to specify a default
class, based on a training set.
 The default rule is evaluated at the end, if and only if no other rule covers
X.
 The condition in the default rule is empty. In this way, the rule fires when
no other rule is satisfied.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 51


Rule Extraction from a Decision Tree
 It convert tree into set of rules.
 One rule for each leaf.
 Antecedent contains a condition for every node on the path from the root
to the leaf.
 Straightforward, but rule set might be overly complex.
 Each splitting criterion along a given path is logically ANDed to form the
rule antecedent (“IF” part).
 The leaf node holds the class prediction, forming the rule consequent
(“THEN” part).

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 52


Rule Extraction from a Decision Tree
age?

youth seni
middle_aged or

student? Yes credit_rating?

no yes fair excelle


nt
No yes No Yes

R1: IF age = youth AND student = no THEN buys_computer = no


R2: IF age = youth AND student = yes THEN buys_computer = yes
R3: IF age = middle aged THEN buys_computer = yes
R4: IF age = senior AND credit rating = THEN buys_computer = yes
excellent THEN buys_computer = no
R5: IF age = senior AND credit rating = fair
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 53
Rule Extraction from a Decision Tree
 Rules are directly extracted from tree they are mutually exclusive and
exhaustive.
Mutually exclusive rules:
 No two rules are triggered by the same record.
 This ensures that every record is covered by at most one rule.

Exhaustive rules
 There exists a rule for each combination of attribute values.
 This ensures that every record is covered by at least one rule.

 Together these properties ensure that every record is covered by exactly


one rule.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 54


Rule Induction Using a Sequential Covering Algorithm
 IF-THEN rules can be extracted directly from the training data (i.e., without
having to generate a decision tree first) using a sequential covering
algorithm.
 The name comes from the notion that the rules are learned sequentially
(one at a time),
 here are many sequential covering algorithms. Popular variations include
AQ, CN2, and the more recent RIPPER.
 The general strategy is as follows.
 Rules are learned one at a time.
 Each time a rule is learned, the tuples covered by the rule are removed, and
 The process repeats on the remaining tuples.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 55


Algorithm: Sequential covering
INPUT:
D, a data set of class-labeled tuples;
Att_vals, the set of all attributes and their possible values.

Output:
A set of IF-THEN rules.

Rule_set = {}; // initial set of rules learned is empty


for each class c do
repeat
Rule = Learn_One_Rule(D, Att_vals, c);
remove tuples covered by Rule from D;
Rule_set = Rule_set + Rule; // add new rule to rule set
until terminating condition;
endfor
return Rule Set;

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 56


Algorithm: Sequential covering
 The process continues until the terminating condition is met, such as
when there are no more training tuples or the quality of a rule returned is
below a user-specified threshold.
 The Learn One Rule procedure finds the “best” rule for the current class,
given the current set of training tuples.
 How are rules learned?
 Typically, rules are grown in a general-to-specific manner. We can think of this as a
beam search.
 where we start off with an empty rule and then gradually keep appending attribute
tests to it.
 The classifying attribute is loan decision, which indicates whether a loan is accepted
(considered safe) or rejected (considered risky).
 To learn a rule for the class “accept,”
 we start off IF
with_ THEN loan decision = accept
the most general rule possible, that is, the condition of the rule
antecedent is empty. The rule is:

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 57


Algorithm: Sequential covering
IF _ THEN
loan_decision =
accept

IF income = high IF income =


IF loan_term = F loan_term =
THEN loan_decision medium THEN …
short THEN long THEN
loan_decision loan_decision = = accept
loan_decision = .
accept
= accept accept

… … IF income = high AND … …


. . credit_rating = . .
excellent THEN
loan_decision = accept
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 58
Algorithm: Sequential covering
 Learn One Rule adopts a greedy depth-first strategy.
 Each time it is faced with adding a new attribute test (conjunct) to the
current rule.
 it picks the one that most improves the rule quality, based on the training
samples.
 Greedy search does not allow for backtracking.
 At each step, we heuristically add what appears to be the best choice at
the moment.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 59


Model Evaluation and Selection
 Now that you may have built a classification model, there may be many
questions going through your mind.
 For example, suppose you used data from previous sales to build a
classifier to predict customer purchasing behavior.
 You would like an estimate of how accurately the classifier can predict the
purchasing behavior of future customers, that is, future customer data on
which the classifier has not been trained.
 You may even have tried different methods to build more than one
classifier and now wish to compare their accuracy.
Trainin Derive Estimate
g model accuracy
Set
Data

Test
Set
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 60
Metrics for Performance Evaluation: Confusion
Matrix
Actual\ • TP :True Positive
predict Yes No • FN :False
ed Negative
Yes TP FN • FP : False
Positive
No FP TN
• TN : True
 True Positive: These refer to the positive
tuples that were correctly
Negative
labeled by the classifier.
 True negatives: These are the negative tuples that were correctly
labeled by the classifier.
 False positives: These are the negative tuples that were incorrectly
labeled as positive.
 False negatives: These are the positive tuples that were mislabeled as
negative.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 61


Metrics for Performance Evaluation: Confusion
Matrix
Patien BP BMI Diabet Actual\
t es
predicte Yes No
1 120 25.5 No
d
2 130 30.1 Yes Tanning
Data Set Yes TP FN
3 110 22.8 No
4 140 28.7 Yes Model No FP TN
5 125 24.0 No
Predicate
6 118 26.3 No Actual\
d by
7 135 31.2 Yes Model predicte Yes No
T Yes
8 128 27.5 Yes d
e Yes
s 9 130 23.8 No
Yes Yes 2 2
t 10 122 29.6 Yes
i No
11 115 24.9 No No 1 2
n No
12 142 32.0 Yes
g No
13 127 26.7 No
No
14 138 30.5 Yes
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 62
Accuracy(Recognition rate)
 The accuracy of a classifier on a given test set is the percentage of test
set tuples that are correctly classified by the classifier.
Actual\
predicte Yes No
Total Accuracy =
d
P
Yes TP FN
N
No FP TN

Actual\
Yes No Total
predicte
d Accuracy = = 0.57
4
Yes 2 2
3
No 1 2

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 63


Error rate (Misclassification rate)
 Error rate or misclassification rate of a classifier M, is simply
1−accuracy(M), where accuracy(M) is the accuracy of M. This also can be
computed as
Actual\
predicte Yes No
Total Error rate =
d
P
Yes TP FN
N
No FP TN

Actual\
Yes No Total
predicte
d Error rate =
4
Yes 2 2
3
No 1 2

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 64


Class imbalance problem
 where the main class of interest is rare.
 In medical data, there may be a rare class, such as “cancer.”
 Suppose that you have trained a classifier to classify medical data tuples,
where the class label attribute is “cancer” and the possible class values
are “yes” and “no.”
 An accuracy rate of, say, 97% may make the classifier seem quite
accurate, but what if only, say, 3% of the training tuples are actually
cancer? Clearly, an accuracy rate of 97% may not be acceptable.
Sensitivity (the proportion of positive tuples that are
correctly identified) =
specificity (the proportion of negative tuples that are correctly
identified ) =
Accuracy = Sensitivity + specificity

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 65


Precision
 Exactness – what % of tuples that the classifier labeled as positive are
actually positive.
 It calculates the ratio of correctly predicted positive instances to the total
number
Actual\ of instances predicted as positive.
Total
predicte Yes No
d
P Precision =
Yes TP FN
N
No FP TN

Actual\
predicte Yes No Total Precision =
d
4
Yes 2 2
3
No 1 2
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 66
Recall
 Completeness – what % of positive tuples did the classifier label as
positive?
 It Actual\
calculates the ratio of true positives to the total number of actual
positive
predicteinstances.
Yes No
Total Recall =
d
P
Yes TP FN
N
No FP TN

Actual\
Yes No Total
predicte
d Recall =
4
Yes 2 2
3
No 1 2

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 67


F1 and F Measure
 F measure also known as the F1  The Fβ measure is a weighted
score or F-score. it is harmonic measure of precision and recall.
mean of precision and recall.  where β is a non-negative real
 It provides a balanced number.
assessment of a model's  It allow you to control the emphasis
performance by taking both false on precision or recall using the
positives and false negatives into parameter β. When β > 1, it
account. emphasizes recall more than
 The F1-score ranges between 0 precision, and when 0 < β < 1, it
(worst) and 1 (best). emphasizes precision more than
recall.
F=
F1 =

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 68


Evaluating Classifier Accuracy: Holdout
 Holdout method:
 Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
 Random sampling:
 A variation of holdout
 Repeat holdout k times, accuracy = avg. of the accuracies obtained

Trainin Derive Estimate


g model accuracy
Set
Data

Test
Set

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 69


Evaluating Classifier Accuracy: Cross-Validation Methods
 Cross-validation:
 Also know as k fold, where k is 10 recommend.
 Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
 Training and testing is performed k times.
 At i-th iteration, use Di as test set and others as training set
 That is, in the first iteration, subsets D2,..., Dk collectively serve as the
training set to obtain a first model, which is tested on D1

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 70


Evaluating Classifier Accuracy: Cross-Validation
Methods
Data

K = 10

1 2 3 4 5 6 7 8 9 10

Iteration
Training and testing is performed k times.

1 Test Train Train Train Train Train Train Train Train Train

2 Train Test Train Train Train Train Train Train Train Train

10 Train Train Train Train Train Train Train Train Train Test

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 71


Evaluating Classifier Accuracy: Bootstrap
 Bootstrap:
 The bootstrap method samples the given training tuples uniformly with
replacement.
 That is, each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set.
 There are several bootstrap methods. A commonly used one is the .632
bootstrap, which works as follows
 A data set with d tuples is sampled d times, with replacement, resulting in a training
set of d samples.
 The data tuples that did not make it into the training set end up forming the test set.
 About 63.2% of the original data end up in the bootstrap, and the remaining 36.8%
form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 72


Comparing Classifiers Based on ROC Curves
 Receiver operating characteristic curves are a useful visual tool for
comparing two classification models.
 ROC curves come from signal detection theory that was developed during
World War II for the analysis of radar images.
 An ROC curve for a given model shows the trade-off between the true
positive rate (TPR) and the false positive rate (FPR).
 The area under the ROC curve is a measure of the accuracy of the model.
 Any increase in TPR occurs at the cost of an increase in FPR.

Sensitivity (the proportion of positive tuples that are correctly


identified) (TPR) =
specificity (the proportion of negative tuples that are correctly
identified ) (FPR) =

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 73


Comparing Classifiers Based on ROC Curves
ROC Curve Tuple Class Prob. TP FP TPR FPR
1.2 1 P 0.90 1 0 0.2 0.0
1.0
2 P 0.80 2 0 0.4 0.0
3 N 0.70 2 1 0.4 0.2
0.8
4 P 0.60 3 1 0.6 0.2
5 P 0.55 4 1 0.8 0.2
TPR

0.6

0.4
6 N 0.54 4 2 0.8 0.4
7 N 0.53 4 3 0.8 0.6
0.2
8 N 0.51 4 4 0.8 0.8
0.0 9 P 0.50 5 4 1.0 0.8
0.0 0.2 0.4 0.6 0.8 1.0 1.2
10 N 0.40 5 5 1.0 1.0
FPR

Returned by a (TPR) =
• Vertical axis represents the true
positive rate probabilistic classifier for
• Horizontal axis represents the false each of the 10 tuples in a (FPR) = =
positive rate test set, sorted by
• A model with perfect accuracy will 0.0
have an area of 1.0 decreasing probability
order.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 74
Techniques to Improve Classification Accuracy
 We focus on ensemble methods for improvement of classification
accuracy.
 It involve combining the predictions from multiple individual models
(classifiers) to improve the overall performance.
 The individual classifiers vote, and a class label prediction is returned by
the ensemble based on the collection of votes.
 Combine a series of k Clearned models, M1, M2, …, Mk, with the aim of
1 New
creating an improved model M* Data
Sample
C2

Combine Class
Data . Vote
Predicati
on
.
CT
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 75
Bagging
 Analogy: Diagnosis based on multiple doctors’ majority vote.
 Training:
 Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled
with replacement from D (i.e., bootstrap)
 A classifier model Mi is learned for each training set Di
 Classification: classify an unknown sample X
 Each classifier Mi returns its class prediction
 The bagged classifier M* counts the votes and assigns the class with the most votes
to X
 Prediction can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 76


Boosting
 Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy.
 How boosting works?
 Weights are assigned to each training tuple.
 A series of k classifiers is iteratively learned.
 After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to pay more attention to the training tuples that were misclassified by
M i.
 The final M* combines the votes of each individual classifier, where the weight of
each classifier's vote is a function of its accuracy.
 Boosting algorithm can be extended for numeric prediction
 Comparing with bagging: Boosting tends to have greater accuracy, but it
also risks overfitting the model to misclassified data.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 77


Random Forest
 Random Forest:
 Each classifier in the ensemble is a decision tree classifier and is generated using a
random selection of attributes at each node to determine the split
 During classification, each tree votes and the most popular class is returned
 Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node, F attributes as
candidates for the split at the node. The CART methodology is used to grow the trees
to maximum size
 Forest-RC (random linear combinations): Creates new attributes (or features) that
are a linear combination of the existing attributes (reduces the correlation between
individual classifiers)

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 4 - Classification 78

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy