Classification Part 1
Classification Part 1
CLASSIFICATION
Task of assigning objects to one of several predefined categories.
Examples
Predicting tumor cells as benign or malignant
Classifying credit card transactions as legitimate or fraudulent
Detecting spam email messages based on message header and content
Categorizing news stories as finance, weather, entertainment, sports, etc.
Classifying galaxies based on their shape
CLASSIFICATION
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
CONFUSION MATRIX
The series of questions and their possible answers can be organized in the form of
decision tree.
It is a hierarchical structure that consists of nodes and directed edges.
A tree has three types of nodes
• Root Node: One root node with no incoming
edges and zero or more outgoing edges.
• Internal Nodes: Exactly one incoming edge and
two or more outgoing edges.
• Leaf or Terminal Nodes: Exactly one incoming
edge and no outgoing edges.
HOW A DECISION TREE WORKS?
Training Data
ANOTHER EXAMPLE OF DECISION TREE
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree.
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
DECISION TREE CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction Decision
2 No Medium 100K No algorithm Tree
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
DECISION TREE INDUCTION
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
GENERAL STRUCTURE OF HUNT’S ALGORITHM
Tid Refund Marital Taxable
Cheat
Let Dt be the set of training records that reach a
Status Income
Dt
the data into smaller subsets. Recursively apply
the procedure to each subset. ?
Tid Refund Marital Taxable
Status Income Cheat
HUNT’S ALGORITHM
Don’t
Refund Hunt’s 1 Yes Single 125K No
Yes No
Cheat Algorithm 2 No Married 100K No
Don’t Don’t 3 No Single 70K No
Cheat Cheat 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
Married Married
Divorced Divorced
Don’t Taxable Don’t
Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Predicting whether a loan applicant will repay her loan
obligations or become delinquent, subsequently defaulting in
HUNT’S ALGORITHM
her loan
• The initial tree contains a single label with class label Defaulted = “NO”, which means most
borrowers successfully repaid their loans.
• However, the tree needs refinement.
HUNT’S ALGORITHM
Hunt’s algorithm will work if every combination of attribute values is present in the
training data and each combination has a unique class label.
These assumptions are too stringent for use in most practical scenarios.
Additional conditions are needed to handle the following cases:
It is possible for some of the child nodes created be empty; i.e. there are no
records associated with these nodes. This can happen if none of the training
records have the combination of attribute values associated with such nodes.
In this case the node is declared a leaf node with the same class label as the
majority class of training records associated with its parent node.
HUNT’S ALGORITHM
If all the records associated with Dt have identical attribute values (except for the
class label), then it is not possible to split the records any further.
In this case, the node is declared a leaf node with the same class label as the
majority class of training records associated with this node.
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Continue expanding a node until either all records belong to the same class or all
records have identical attribute values
METHODS FOR EXPRESSING ATTRIBUTE TEST CONDITIONS
Binary split: Divides values into two subsets. Need to find optimal partitioning.
OR CarType
CarType {Family,
{Sports, Luxury} {Sports}
Luxury} {Family}
SPLITTING BASED ON ORDINAL ATTRIBUTES
Size
Small Large
Medium
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Size
{Small,
Large} {Medium}
SPLITTING BASED ON CONTINUOUS ATTRIBUTES
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
The measure developed for selecting the best split are often based on the degree of
impurity of the child nodes.
The smaller the degree of impurity, the more skewed the class distribution.
A node with class distribution (0,1) has zero impurity and a node with uniform class
distribution (0.5,0.5) has the highest impurity.
MEASURES OF NODE IMPURITY
Gini Index
Entropy
Misclassification error
ALGORITHM
The best attribute is selected and used as the test at the root node of the tree.
A descendant of the root node is then created for each possible value of this
attribute, and the training examples are sorted to the appropriate descendant
node (i.e., down the branch corresponding to the example's value for this
attribute).
The entire process is then repeated using the training examples associated with
each descendant node to select the best attribute to test at that point in the
tree.
This forms a greedy search for an acceptable decision tree, in which the
algorithm never backtracks to reconsider earlier choices.
ID3 ALGORITHM
Examples: Training Examples
Target-attribute: Attribute to be predicted
Attributes: List of predictors
We would like to select the attribute that is most useful for classifying examples.
We will define a statistical property, called information gain, that
measures how well a given attribute separates the training examples
according to their target classification
ENTROPY
A measure used from Information Theory in the ID3 algorithm and popularly used in
decision tree construction is that of Entropy.
Entropy of a dataset measures the impurity of the dataset.
Informally, Entropy can be considered to find out how disordered the dataset is.
ENTROPY
It has been shown that there is a relationship between entropy and information.
That is, higher the uncertainty or entropy of some data, implies more information is
required to completely describe that data.
In building a decision tree, the aim is to decrease the entropy of the dataset until we
reach leaf nodes at which point the subset that we are left with is pure, or has zero
entropy and represents instances all of one class.
ENTROPY
Suppose C is class distribution and A is a group of attributes, the Entropy (H) of a
random variable is defined as
Entropy(t ) = − p( j | t ) log p( j | t )
j
Entropy(t ) = − p( j | t ) log p( j | t )
j 2
Decision tree induction algorithms choose a test condition that maximizes the gain
(∆).
As, I (Parent) is same for all test conditions, maximizing the gain is equivalent to
minimizing the weighted average impurity measures of the child nodes.
When entropy is used as an impurity measure, the difference in entropy is known as
information gain, ∆ gain
HOW TO FIND THE BEST SPLIT
C0 N00
Before Splitting: C1 N01 M0
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
EXAMPLE
WHICH ATTRIBUTE TO SELECT?
EXAMPLE
There are two classes, play tennis, “yes” and “no” in the data. Therefore the entropy can be
calculated as below:
Entropy(S) = -pyes log2(pyes) -pno log2(pno)
Now, the next step is to select highly significant input variable among all the four
input variables (Outlook, Temperature, Humidity, Wind) that will split the data more
purely than the other attributes.
For this, we will calculate the information gain that would result in over the entire
dataset after splitting the attributes (Outlook, Temperature, Humidity, Wind).
EXAMPLE
Infogain (S|Outlook) = Entropy(S) -5/14 Entropy (S|Outlook =Sunny) – 4/14
Entropy(S|Outlook =Overcast) – 5/14 Entropy(S|Outlook =Rain)
= 0.94 –(5/14) (-pyes log2(pyes) -pno log2(pno) ) –(4/14) (-pyes log2(pyes) -pno log2(pno) ) –
(5/14) (-pyes log2(pyes) -pno log2(pno) )
= 0.94 -(5/14) (-2/5 log2 2/5 -3/5 log2 3/5) -(4/14) (-4/4 log2 4/4) -(5/14) (-3/5 log2 3/5 -
2/5 log2 2/5)
= 0.94 – 0.347 – 0 -0.347
= 0.246 bits
EXAMPLE
Infogain (S|Temperature) = Entropy(S) -4/14 Entropy (S| Temperature =Hot) – 6/14
Entropy(S| Temperature =Mild) – 4/14 Entropy(S|Temperature =Cold)
= 0.94 –(4/14) (-pyes log2(pyes) -pno log2(pno) ) –(6/14) (-pyes log2(pyes) -pno log2(pno) ) –
(4/14) (-pyes log2(pyes) -pno log2(pno) )
= 0.94 -(4/14) (-2/4 log2 2/4 -2/4 log2 2/4) -(6/14) (-4/6 log2 4/6 -2/6 log2 2/6) -(4/14) (-
3/4 log2 3/4 -1/4 log2 1/4)
= 0.94 – 0.286 – 0.392 -0.233
= 0.029 bits
EXAMPLE
Infogain (S|Humidity) = Entropy(S) -7/14 Entropy (S| Humidity =High) – 7/14 Entropy(S|
Humidity =Normal)
= 0.94 –(7/14) (-pyes log2(pyes) -pno log2(pno) ) –(7/14) (-pyes log2(pyes) -pno log2(pno) )
= 0.94 -(7/14) (-3/7 log2 3/7 -4/7 log2 4/7) -(7/14) (-6/7 log2 6/7 -1/7 log2 1/7)
= 0.94 – 0.493 – 0.296 = 0.151 bits
EXAMPLE
Infogain (S|Wind) = Entropy(S) -6/14 Entropy (S| Wind =Strong) – 8/14 Entropy(S|
Wind = Weak)
= 0.94 –(6/14) (-pyes log2(pyes) -pno log2(pno) ) –(8/14) (-pyes log2(pyes) -pno log2(pno) )
= 0.94 -(6/14) (-1/2 log2 1/2 -1/2 log2 1/2) -(8/14) (-6/8 log2 6/8 -2/8 log2 2/8)
= 0.94 – 0.428 – 0.465 = 0.047 bits
EXAMPLE
Now, we select the attribute for the root node which will result in the highest
reduction in entropy.
Infogain (S| Outlook) = 0.246 bits
Infogain (S| Temperature) = 0.029 bits
Infogain (S| Humidity) = 0.151 bits
Infogain (S|Wind) = 0.047 bits
We can clearly see that the attribute Outlook results in the highest reduction in
entropy or the highest information gain.
We would therefore select Outlook at the root node, splitting the data up into subsets
corresponding to all the different values for the Outlook attribute.
HOLDOUT METHOD
The original data with labelled examples is partitioned into two disjoint sets, called
the training set and test set.
A classification model is then induced from the training set and its performance is
evaluated on the test set.
The proportion of data reserved for training and for testing is typically at the
discretion of the analysts (Eg. 50-50 or two-thirds for training and one-third for
testing).
The accuracy of the classifier can be estimated based on the accuracy of the induced
model on the test set.
HOLDOUT METHOD
Limitations
Fewer labelled examples are available for training because some of the records are
withheld for testing. The induced model may not be as good as when all the labelled
examples are used for training.
The model may be highly dependent on the composition of the training and tests
sets.
The smaller the training set size, the larger is the variance of the model.
On the other hand, if the training set is too large, then the estimated accuracy computed
from the smaller test set is less reliable. Such an estimate is said to have a wide
confidence interval.
HOLDOUT METHOD
Limitations
The training and test sets are
no longer independent of each
other. Because the training
and test sets are subsets of
the original data, a class that is
overrepresented in one subset
will be underrepresented in
the other, and vice versa.
RANDOM SUBSAMPLING
Limitations
It does not utilize as much as data as possible for training.
It has no control over the number of times each record is used for testing and
training. Consequently, some records might be used for training more often than
others.
CROSS-VALIDATION
In this approach, each record is used the same
number of times for training and exactly once for
testing.
Example:
Partition the data three equal-sized subsets.
Choose two of the subsets (A, B) for training and
the remaining one for testing.
Choose two of the other combination (A,C) of
subsets for training and the remaining one for
testing.
Choose the remaining one combination (B,C) for
training and remaining one for testing.
This approach is called three-fold cross validation.
CROSS-VALIDATION
The k-fold cross validation method generalizes this approach by segmenting the data
into k-equal sized partitions.
During each runs, one of the partition is used for testing, while the rest of them are
used for training.
This procedure is repeated k times so that each partition is used for testing exactly
once.
The total error is found by summing up the errors for all the k runs.
CROSS-VALIDATION
A special case of k-fold cross-validation method sets k=N, the size of the dataset.
Called leave-one-out approach, each test set contains only one record.
This approach has the advantage of utilizing as much data as possible for training.
In addition, the test sets are mutually exclusive and they effectively cover the
entire data set.
The drawback of this approach is that it is computationally expensive to repeat the
procedure N times.
Furthermore, since each test contains only one record, the variance of the
estimated performance metric tends to be high.
BOOTSTRAP
The methods so far assume that the training records are sampled without
replacement. As a result, there are no duplicate records in the training and test sets.
Bootstrap Approach: Training records sampled with replacement.
If the original data has N records, it can be shown that, on average, a bootstrap
sample of size N contains about 63.2% of the records of the original data.
Records that are not included in bootstrap sample, become a part of the test set.
The module induced from the training set is then applied to the test set to obtain an
estimate of the accuracy of the bootstrap sample, εi.
The sampling procedure is then repeated b times to generate b bootstrap samples.
BOOTSTRAP
.632 Bootstrap
It computes the overall accuracy by combining the accuracies of each bootstrap
sample (εi) with the accuracy computed from the training set that contains all
labelled examples in the original data (accs) i.e. no repeated labelled examples.
1
Accuracy (accboost) = σ𝑏𝑖=1 0.632 ∗ ∈ 𝑖 + 0.368 ∗ 𝑎𝑐𝑐𝑠
𝑏