Class Adv Classification V
Class Adv Classification V
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., boostrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
– Often significant better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction 16
Bagging
• Accuracy of bagging:
X 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Actual
Class labels
Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging
• Decision Stump
• Single level decision binary
tree
• Entropy – x<=0.35 or
x<=0.75
• Accuracy at most 70%
Actual
Class labels
Bagging
Example taken from Tan et. al. book “Introduction to Data Mining”
Boosting
• Equal weights are assigned to each training tuple (1/d
for round 1)
• After a classifier Mi is learned, the weights are
adjusted to allow the subsequent classifier Mi+1 to
“pay more attention” to tuples that were
misclassified by Mi.
• Final boosted classifier M* combines the votes of
each individual classifier
• Weight of each classifier’s vote is a function of its
accuracy
• Adaboost – popular boosting algorithm
Adaboost
• Input:
– Training set D containing d tuples
– k rounds
– A classification learning scheme
• Output:
– A composite model
Adaboost
• Data set D containing d class-labeled tuples (X1,y1),
(X2,y2), (X3,y3),….(Xd,yd)
• Initially assign equal weight 1/d to each tuple
• To generate k base classifiers, we need k rounds or
iterations
• Round i, tuples from D are sampled with
replacement , to form Di (size d)
• Each tuple’s chance of being selected depends on its
weight
Adaboost
• Base classifier Mi, is derived from training tuples of Di
• Error of Mi is tested using Di
• Weights of training tuples are adjusted depending on
how they were classified
– Correctly classified: Decrease weight
– Incorrectly classified: Increase weight
• Weight of a tuple indicates how hard it is to classify it
(directly proportional)
Adaboost
• Some classifiers may be better at classifying some
“hard” tuples than others
• We finally have a series of classifiers that complement
each other!
• Error rate of model Mi:
d
error ( M i ) w j * err ( X j )
j
• Error rate:
N
1
i
N
w I C ( x ) y
j 1
j i j j
• Importance of a classifier:
1 1 i
i ln
2 i
Example: AdaBoost
• Weight update:
j
( j 1)
( j)
w exp if C j ( xi ) yi
wi i
Z j exp j if C j ( xi ) yi
where Z j is the normalizat ion factor
T
C * ( x) arg max j I C j ( x) y
y j 1
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
• Example:
X 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Actual
Class labels
Example taken from Tan et. al. book “Introduction to Data Mining”
Random Forests
• Ensemble method specifically designed for decision tree classifiers
• Random Forests grows many classification trees (that is why the
name!)
• Combines predictions made by many decision trees.
• Each tree is generated based on a bootstrap sample and the
values of an independent set of random vectors.
• The random vectors are generated from a fixed probability
distribution.
• Ensemble of unpruned decision trees
• Each base classifier classifies a “new” vector
• Forest chooses the classification having the most votes (over all
the trees in the forest)
Random Forests
• Introduce two sources of randomness: “Bagging” and
“Random input vectors”
– Each tree is grown using a bootstrap sample of training data
– At each node, best split is chosen from random sample of
mtry variables instead of all variables
Random Forests
Randomizing Random Forests
• Each decision tree uses a random vector that is
generated from some fixed probability distribution.
This randomness can be incorporated in many ways:
1. Randomly select F input features to split at each
node (Forest-RI).
2. Create linear combinations of the input features to
split at each node (Forest-RC).
3. Randomly select one of the F best splits at each
node.
Why use Random Vectors?