Ensemble Learning, Decision Trees
Ensemble Learning, Decision Trees
• The No Free Lunch Theorem states that there is no single learning algorithm that in any domain always
induces the most accurate learner.
• The usual approach is to try many and choose the one that performs the best on a separate validation set.
• We may combine multiple learning algorithms or same algorithms with different hyperparameters as
classifiers.
Bagging
• Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-algorithm designed
to improve the stability and accuracy of machine learning algorithms used in statistical classification and
regression.
• It decreases the variance and helps to avoid overfitting.
• It is usually applied to decision tree methods.
• Bagging is a special case of the model averaging approach.
Description of the Technique
• Suppose a set D of d tuples, at each iteration i, a training set D i of d tuples is selected via row sampling with a
replacement method (i.e., there can be repetitive elements from different d tuples) from D (i.e., bootstrap).
• Then a classifier model Mi is learned for each training set D < i.
• Each classifier Mi returns its class prediction.
• The bagged classifier M* counts the votes and assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
•Step 1: Multiple subsets are created from the original data set with equal tuples, selecting observations with
replacement.
•Step 2: A base model is created on each of these subsets.
•Step 3: Each model is learned in parallel with each training set and independent of each other.
•Step 4: The final predictions are determined by combining the predictions from all the models.
How does Bagging Classifier Work?
The basic steps of how a bagging classifier works are as follows:
•Bootstrap Sampling: In Bootstrap Sampling randomly ‘n’ subsets of original training data are sampled with
replacement. This step ensures that the base models are trained on diverse subsets of the data, as some samples may
appear multiple times in the new subset, while others may be omitted. It reduces the risks of overfitting and improves the
accuracy of the model.
•Base Model Training: In bagging, multiple base models are used. After the Bootstrap Sampling, each base model
is independently trained using a specific learning algorithm, such as decision trees, support vector machines, or neural
networks on a different bootstrapped subset of data. These models are typically called “Weak learners” because they may
not be highly accurate on their own. Since the base model is trained independently of different subsets of data. To make
the model computationally efficient and less time-consuming, the base models can be trained in parallel.
•Aggregation: Once all the base models are trained, it is used to make predictions on the unseen data i.e. the subset of
data on which that base model is not trained. In the bagging classifier, the predicted class label for the given instance is
chosen based on the majority voting. The class which has the majority voting is the prediction of the model.
•Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset of particular base models during
the bootstrapping method. These “out-of-bag” samples can be used to estimate the model’s performance without the need
for cross-validation.
•Final Prediction: After aggregating the predictions from all the base models, Bagging produces a final prediction for
each instance.
Example of Bagging
The Random Forest model uses Bagging, where decision tree models with higher variance are
present. It makes random feature selection to grow trees. Several random trees make a
Random Forest.
Advantages of Bagging
•The biggest advantage of bagging is that multiple weak learners can work better than a single strong learner.
•It provides stability and increases the machine learning algorithm’s accuracy, which is used in statistical
classification and regression.
•It helps in reducing variance, i.e., it avoids overfitting.
Disadvantages of Bagging
•It may result in high bias if it is not modeled properly and thus may result in underfitting.
•Since we must use multiple models, it becomes computationally expensive and may not be suitable in
various use cases.
Boosting
• Boosting is an ensemble modeling technique designed to create a strong classifier by combining
multiple weak classifiers.
• The process involves building models sequentially, where each new model aims to correct the
errors made by the previous ones.
• Initially, a model is built using the training data.
• Subsequent models are then trained to address the mistakes of their predecessors.
• Boosting assigns weights to the data points in the original dataset.
• Higher weights: Instances that were misclassified by the previous model receive higher weights.
• Lower weights: Instances that were correctly classified receive lower weights.
• Training on weighted data: The subsequent model learns from the weighted dataset, focusing its
attention on harder-to-learn examples (those with higher weights).
• This iterative process continues until:
• The entire training dataset is accurately predicted, or
• A predefined maximum number of models is reached.
Boosting Algorithms
• There are several boosting algorithms.
• The original ones, proposed by Robert Schapire and Yoav Freund were not adaptive and could not take
full advantage of the weak learners.
• Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious
Gödel Prize.
• AdaBoost was the first really successful boosting algorithm developed for the purpose of binary
classification.
• AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that combines multiple
“weak classifiers” into a single “strong classifier”.
• In boosting, we take records from the dataset and pass them to base learners sequentially; here, base learners can be any
model. Suppose we have m number of records in the dataset.
• Then, we pass a few records to base learner BL1 and train it.
• Once the BL1 gets trained, then we pass all the records from the dataset and see how the Base learner works.
• For all the records classified incorrectly by the base learner, we only take them and pass them to other base learners, say
BL2, and simultaneously pass the incorrect records classified by BL2 to train BL3.
• This will go on unless and until we specify some specific number of base learner models we need.
• Finally, we combine the output from these base learners and create a strong learner; thus, the model’s prediction power
improves.
Boosting works with the following steps:
•We sample m-number of subsets from an initial training dataset.
•Using the first subset, we train the first weak learner.
•We test the trained weak learner using the training data. As a result of the testing, some data points
will be incorrectly predicted.
•Each data point with the wrong prediction is sent into the second subset of data, and this subset is
updated.
•Using this updated subset, we train and test the second weak learner.
•We continue with the next subset until reaching the total number of subsets.
•We now have the total prediction. The overall prediction has already been aggregated at each step,
Reducing Bias by Boosting
• We use boosting to combine weak learners with high bias.
• Boosting aims to produce a model with a lower bias than the individual models.
• Boosting involves sequentially training weak learners.
• Here, each subsequent learner improves the errors of previous learners in the sequence.
• A sample of data is first taken from the initial dataset.
• Using this sample to train the first model, and the model makes its prediction.
• The samples can either be correctly or incorrectly predicted.
• The samples that are wrongly predicted are reused for training the next model.
• In this way, subsequent models can improve on the errors of previous models.
• Boosting aggregates the results at each step.
• Weighted averaging involves giving all models different weights depending on their predictive power.
• In other words, it gives more weight to the model with the highest predictive power.
• This is because the learner with the highest predictive power is considered the most important.
Advantages of Boosting
•It is one of the most successful techniques in solving the two-class classification problems.
•It is good at handling the missing data.
Disadvantages of Boosting
•Boosting is hard to implement in real time due to the increased complexity of the algorithm.
•The high flexibility of these techniques results in multiple numbers of parameters that directly affect the behavior of
the model.
Similarities between Bagging and Boosting Differences between Bagging and Boostin
Decision Trees
• Decision Trees are a type of Supervised Machine Learning where the data is continuously split
according to a certain parameter. The tree can be explained by two entities, namely decision
nodes and leaves. The leaves are the decisions or the final outcomes.
• And the decision nodes are where the data is split.
• An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc.
• The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat
a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’.
• In this case this was a binary classification problem (a yes no type problem).
• A tree structure resembling Flow Charts
• Leaf Node: Holds the Class Labels (Terminal Nodes)
• Internal Node: Non–Leaf Nodes → denotes a test on an attribute
• Branch: Outcome of the test
• Root Node: The topmost node from where all the branches originate
• Given a test tuple, x, the feature values are tested against the decision tree.
• A path is traced from the root to a leaf node
• Leaf node holds the classification results
Advantages
• • Types of Decision Trees: Iterative Dichotomiser 3 (ID3), C 4.5, C
No domain knowledge is required
5.0 (latest)
• Easy to interpret
• Can handle multidimensional data
• Simple steps for learning and classification
Splitting Criterion
• Tells the best feature to test at node N
• Gives the best way to partition the tuples in D into individual classes
• “Goodness” is determined by measuring the purity of the partition at each branch
• A partition is pure, if all the tuples in it belongs to the same class
• Splitting criterion also tells which branches to grow from node N with respect to the
outcomes of the chosen test
Splitting
• The node N is labeled with the splitting criterion, which serves as a test at
the node
• A branch is grown from N for each of the outcomes, and the tuples in D are
partitioned accordingly
• Let A be the feature having v distinct values, {a 1 , a2, . . . , a v }
• Three conditions can arise:
— A is discrete–valued and binary tree is desired
— A is discrete–valued and non–binary tree is desired
— A is continuous–valued
Splitting Scenarios
Splitting Scenario: Discrete A & Non–Binary Tree is desired
• Outcomes of the test at node N correspond to the known values of A.
• A branch is created for each known value, aj , of A
• Partition D j is the subset of labelled tuples in D having value a j of A.
• Because all the tuples in a given partition have the same value for A, A
need not be considered in any future partitioning of the tuples.
• Therefore, it is removed from feature list
Splitting Scenario: A is Continuous
• Test at node N has two possible outcomes, corresponding to the conditions A
≤ Split Point and A > Split Point
• Split Point is returned by Attribute selection method as part of the splitting criterion.
• Split Point may not be a value of A from the training data.
• Two branches are grown from N and labeled according to the previous outcomes
• Partitioning: D 1 holds the subset of tuples in D for which A ≤ Split Point, while D 2
holds the rest.
Splitting Scenario: A is Discrete and Binary Tree is desired
• Test at node N is of the form A ∈ S A ?
• S A is the splitting subset for A, returned by Attribute selection method
• S A is a subset of the known values of A.
• If a given tuple has value a j of A and if a j ∈ S A , then the test at node N
is satisfied.
• Two branches are grown from N .
• Partitioning: D 1 holds the subset of tuples in D for which the test is satisfied, while
D 2 holds the rest.
The algorithm uses the same process recursively to form a decision tree
for the tuples at each resulting partition, D j , of D Recursive partitioning
stops only when any one of the Terminating Conditions are met:
Terminating Conditions
• All the tuples at node N belong to the same class, C
— Node N is converted to a leaf node labeled with the class C
• The partition D j is empty
— A leaf is created with the majority class in D
• No remaining features on which the tuples may be further partitioned
— Convert node N into a leaf node and label it with the most common class in D
— Called Majority Voting
Decision Tree Algorithm
(Decision Trees)
Attribute Selection Measures
(ASM)
• A heuristic to find the Splitting Criterion (SC) that best separates a data partition
D.
• The best splitting criterion partitions D into pure (close to pure) partitions
• Also called Splitting Rules (SR), since they decide how the tuples at a node are to be
split
• ASM ranks each feature describing the data
• The feature with the best score is used to split the given tuple
• Depending on the measure, the best score can be minimum or maximum
• If the feature is continuous valued, a splitting point is determined
• If feature is discrete & binary tree is desired, a splitting subset is created
• The node for partition D is labeled with the splitting criterion
• Branches are grown for each outcome of the criterion and the tuples are partitioned
• accordingly.
Number of classes: C i , D : Set of tuples in class C i
m;
Notation in D
• |D|: No. of tuples in |C i,D |: No. of tuples in C i , D
Information Gain
• Based on the information content of the feature
• The feature with the highest information gain is chosen as the splitting attribute
for the node
• This minimizes the information required to split the resulting partitions further
• i.e., the resulting partitions has minimum impurity
• Thus, the number of tests (nodes) needed to classify a tuple is minimized
• This can make the tree simple
There are two main types of Decision Trees:
1. Classification trees (Yes/No types) (can explain ID3 algorithm) What we’ve seen above is an example
of classification tree, where the outcome was a variable like ‘fit’ or ‘unfit’. Here the decision variable is
Categorical. Tree models where the target variable can take a discrete set of values are called
classification trees. In these tree structures, leaves represent class labels and branches represent
conjunctions of features that lead to those class labels.
2. Regression trees (Continuous data types) – (can explain CART algorithm) Decision trees where the
target variable can take continuous values (real numbers) like the price of a house, are called regression
trees.
Classification trees
• Consider the data given in Table 8.1 which specify the features of certain vertebrates and the class to which they belong.
• For each species, four features have been identified: “gives birth”, ”aquatic animal”, “aerial animal” and “has legs”.
• There are five class labels, namely, “amphibian”, “bird”, “fish”, “mammal” and “reptile”.
• The problem is how to use this data to identify the class of a newly discovered vertebrate.
Construction of the tree
Step 1
• We split the set of examples given in Table 8.1 into disjoint subsets according to the values of
the feature “gives birth”.
• Since there are only two possible values for this feature, we have only two subsets: One
subset consisting of those examples for which the value of “gives birth” is “yes” and one
subset for which the value is “no”.
Step 2
• We now consider the examples in Table 8.2. We split these examples based on the values of
the feature “aquatic animal”. There are three possible values for this feature. However, only
two of these appear in Table 8.2. Accordingly, we need consider only two subsets. These are
shown in Tables 8.4 and 8.5.
• Table 8.4 contains only one example and hence no further splitting
is required.
• It leads to the assignment of the class label “fish”.
• The examples in Table 8.5 need to be split into subsets based on
the values of “aerial animal”.
• It can be seen that these subsets immediately lead to
unambiguous assignment of class labels: The value of “no” leads
Step 3
• Next we consider the examples in Table 8.3 and split them into disjoint subsets based on the
values of “aquatic animal”. We get the examples in Table 8.6 for “yes”, the examples in Table
8.8 for “no” and the examples in Table 8.7 for “semi”. We now split the resulting subsets based
on the values of “has legs”, etc.
• Putting all these together, we get the the diagram in tree (Figure 8.5) for the data in Table 8.1
Elements of a classification tree
• The various elements in a classification tree are identified as follows.
• In the example discussed above, initially we chose the feature “gives birth” to split the data set into disjoint
subsets and then the feature “aquatic animal”, and so on.
• There was no theoretical justification for this choice.
• The classification tree depends on the order in which the features are selected for partitioning the data.
• Stopping criteria (when can we stop the growth of a Decision Tree)
All (or nearly all) of the examples at the node have the same class.
There are no remaining features to distinguish among the examples.
The tree has grown to a predefined size limit.
Explain how we decide which feature to be selected at each level in a decision tree.
• Feature selection measures (information gain, Gain ratio, Gini index )
• If a dataset consists of n attributes, then deciding which attribute is to be to placed at the root or at different
levels of the tree as internal nodes is a complicated problem.
• It is not enough that we just randomly select any node to be the root.
• If we do this, it may give us bad results with low accuracy.
• The most important problem in implementing the decision tree algorithm is deciding which features are to be
considered as the root node and at each level.
• Several methods have been developed to assign numerical values to the various features such that the values
reflect the relative importance of the various features. These are called the feature selection measures.
• Two of the popular feature selection measures are information gain and Gini index.
Entropy
• The degree to which a subset of examples contains only a single class is known as purity, and any subset composed
of only a single class is called a pure class.
• Informally, entropy is a measure of “impurity” in a dataset.
• Sets with high entropy are very diverse and provide little information about other items that may also belong in the
set, as there is no apparent commonality.
• Entropy is measured in bits. If there are only two possible classes, entropy values can range from 0 to 1. For n
classes, entropy ranges from 0 to log2(n).
• In each case, the minimum value indicates that the sample is completely homogeneous, while the maximum value
indicates that the data are as diverse as possible
• Consider a segment S of a dataset having c number of class labels. Let pi be the proportion of examples in S having
the i th class label. The entropy of S is defined as
Special case
• Let the data segment S has only two class labels, say, “yes” and “no”.
• If p is the proportion of examples having the label “yes” then the proportion of examples having
label “no” will be 1 − p.
• In this case, the entropy of S is given by
Entropy (S) = −p log2(p) − (1 − p) log2(1 − p).
Examples of Entropy Calculation
• Else, the leaves can be programmed to show the probability distribution of the tuples
• Method: Thresholding on the Attribute Selection Measures is done to stop the splitting of nodes
• Ex: Only if the Gain(A ) > T , split the node; Else Don’t split
• Too large T → Too simplified tree
• Too small T → Complex tree
• Setting the threshold is the key
•
Postpruning
Most common method
• Removes a subtree and replaces it
with a leaf
• Leaf label is found out by majority
voting
Postpruning
Cost Complexity Pruning Algorithm
• Used in CART method
• Cost Complexity: Function of the no. of leaves and the percentage of
misclassified tuples
• For each node, this is compared for pruned and unpruned cases
• If pruned one has lower complexity, pruning is performed
• This process is performed from bottom to top
• A different set used for computing cost complexity → Pruning Set
• Pruning set is independent of training set
Pessimistic Pruning Algorithm
• Used in C4.5
• Uses Error Rate Estimates to prune/keep the subtrees
• No separate pruning set needed; uses the training set itself
• After the tree is built, start removing each node from the first child node (top to
bottom)
• Compute the error rate before and after pruning.
• Prune/keep the node based on the error rate
• Prepruning is less computationally complex
• Postpruning is computationally complex, but gives best results
• Pre and Post Pruning can also be combined if needed
• Even after pruning, Repetition and Replication can exist in decision trees
Imagine asking a group of friends for advice on where to go for vacation. Each friend gives their
recommendation based on their unique perspective and preferences (decision trees trained on
different subsets of data). You then make your final decision by considering the majority opinion
or averaging their suggestions (ensemble prediction).
Process starts with a dataset with rows and their corresponding class labels (columns).
Then - Multiple Decision Trees are created from the training data.
Each tree is trained on a random subset of the data (with replacement) and a random subset of
features. This process is known as bagging or bootstrap aggregating.
Each Decision Tree in the ensemble learns to make predictions independently.
When presented with a new, unseen instance, each Decision Tree in the ensemble makes a
prediction.
The final prediction is made by combining the predictions of all the Decision Trees. This is
typically done through a majority vote (for classification) or averaging (for regression).
The random Forest algorithm works in several steps:
Random Forest builds multiple decision trees using random samples of the
data. Each tree is trained on a different subset of the data which makes each
tree unique.
When creating each tree the algorithm randomly selects a subset of features or
variables to split the data rather than using all available features at a time.
This adds diversity to the trees.
Each decision tree in the forest makes a prediction based on the data it was
trained on. When making final prediction random forest combines the results
from all the trees.
For classification tasks the final prediction is decided by a majority vote. This
means that the category predicted by most trees is the final prediction.
For regression tasks the final prediction is the average of the predictions from all
the trees.
The randomness in data samples and feature selection helps to prevent the
Assumptions of Random Forest
Each tree makes its own decisions: Every tree in the forest makes its own predictions
without relying on others.
Random parts of the data are used: Each tree is built using random samples and
features to reduce mistakes.
Enough data is needed: Sufficient data ensures the trees are different and learn unique
patterns and variety.
Different predictions improve accuracy: Combining the predictions from different trees
leads to a more accurate final results.