0% found this document useful (0 votes)
2 views65 pages

Ensemble Learning, Decision Trees

The document discusses ensemble learning, emphasizing the importance of combining multiple learners to improve accuracy and reduce errors, as no single learning algorithm is universally optimal. It outlines various methods for achieving diversity among base learners, including using different algorithms, hyperparameters, and training sets, as well as techniques like bagging and boosting. Additionally, it explains decision trees as a supervised learning method, detailing their structure and advantages.

Uploaded by

sakshijha7632
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views65 pages

Ensemble Learning, Decision Trees

The document discusses ensemble learning, emphasizing the importance of combining multiple learners to improve accuracy and reduce errors, as no single learning algorithm is universally optimal. It outlines various methods for achieving diversity among base learners, including using different algorithms, hyperparameters, and training sets, as well as techniques like bagging and boosting. Additionally, it explains decision trees as a supervised learning method, detailing their structure and advantages.

Uploaded by

sakshijha7632
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Combining Multiple Learners – Ensemble Learning

• The No Free Lunch Theorem states that there is no single learning algorithm that in any domain always
induces the most accurate learner.
• The usual approach is to try many and choose the one that performs the best on a separate validation set.
• We may combine multiple learning algorithms or same algorithms with different hyperparameters as
classifiers.

➢ Why do we prefer to combine many learners together?


There are several reasons why a single learner may not produce accurate results.
• Each learning algorithm carries with it a set of assumptions. This leads to error if the assumptions do not
hold. We cannot be fully sure whether the assumptions are true in a particular situation.
• Learning is an ill-posed problem. With finite data, each algorithm may converge to a different solution and
may fail in certain circumstances.
• The performance of a learner may be fine-tuned to get the highest possible accuracy on a validation set.
But this fine-tuning is a complex task and still there are instances on which even the best learner is not
accurate enough.
• It has been proved that there is no single learning algorithm that always produces the most accurate output.
 What are the different ways to achieve diversity in learning or how can we combine/select diverse
learners together. What do you mean by a base learner, how are they chosen
• When many learning algorithms are combined, the individual algorithms in the collection are called
the base learners of the collection.
• When we generate multiple base-learners, we want them to be reasonably accurate but do not require
them to be very accurate individually.
• The base-learners are not chosen for their accuracy, but for their simplicity.
• What we care for is the final accuracy when the base- learners are combined, rather than the
accuracies of the bas-learners we started from.
There are several different ways for selecting the base learners.

1. Use different learning algorithms


• There may be several learning algorithms for performing a given task. For
example, for classification, one may choose the naive Bayes’ algorithm, or the
decision tree algorithm or even the SVM algorithm.
• Different algorithms make different assumptions about the data and lead to
different results.
• When we decide on a single algorithm, we give emphasis to a single method and
ignore all others.
• Combining multiple learners based on multiple algorithms, we get better results.
1.
2. Use the same algorithm with different hyperparameters
• A hyperparameter is a parameter whose value is set before the learning process begins.
• By contrast, the values of other parameters are derived via training.
• The number of layers, the number of nodes in each layer and the initial weights are all hyper-
parameters in an artificial neural network.
• When we train multiple base-learners with different hyperparameter values, we average over it
and reduce variance, and therefore error.

3. Use different representations of the input object


• For example, in speech recognition, to recognize the uttered words, words may be represented
by the acoustic(sound) input.
• Words can also be represented by video images of the speaker’s lips as the words are spoken.
• Different representations make different characteristics explicit allowing better identification.

4. Use different training sets to train different base-learners


• This can be done by drawing random training sets from the given sample; this is called bagging.
• The learners can be trained serially so that instances on which the preceding base- learners are
not accurate are given more emphasis in training later base learners; examples are boosting and
cascading.
• The partitioning of the training sample can also be done based on locality in the input space so
that each base-learner is trained on instances in a certain local part of the input space.
5. Model Combination Schemes
a) Multi-expert combination methods
• In this base learners work in parallel.
• All of them are trained and then given an instance, they all give their decisions, and a
separate combiner computes the final decision using their predictions.
• Examples include voting and its variants.
b) Multistage combination methods
• These methods use a serial approach where the next base-learner is trained with or
tested on only the instances where the previous base-learners are not accurate
enough.
Model combination schemes
● Voting
 The simplest way to combine multiple classifiers is by voting, which corresponds to taking
a linear combination of the learners also know as ensembles.
 Each learner outputs its result, there are different mechanisms to choose the combined
result. (also explain why we combine multiple learners – based on marks)
 Base-learners are dj and their outputs are combined using f(·). This is for a single output;
 In the case of classification, each base learner has K outputs that are separately used to
calculate yi, and then we choose the maximum. Note that here, all learners observe the
same input; it may be the case that different learners observe different representations of
the same input object or event.
Simple Voting
• All learners are given equal weight and we have simple voting that corresponds to taking an average.
Weighted Voting
• Each learner result is assigned different weights and we consider a weighted combination of the result
to obtain the final result.
Binary classification problem using Voting
• Consider a binary classification problem with class labels −1 and +1.
• Let there be L base learners and let x be a test instance.
• Each of the base learners will assign a class label to x. If the class label assigned is +1, we say that the
learner votes for +1 and that the label +1 gets a vote.
• The number of votes obtained by the class labels when the different base learners are applied is
counted.
• In the voting scheme for combining the learners, the label which gets the majority votes is assigned to
x.
Multi-class classification problem using Voting
• Let there be n class labels C1,C2,...,Cn.
• Let x be a test instance and let there be L base learners.
• Here also, each of the base learners will assign a class label to x and when a class label is assigned a
label, the label gets a vote.
• In the voting scheme, the class label which gets the maximum number of votes is assigned to x.
 Bagging and Boosting are two types of Ensemble Learning.
 These two decrease the variance of a single estimate as they combine several estimates from different
models.
 So the result may be a model with higher stability
1.Bagging: It is a homogeneous weak learners’ model that learns from each other independently in parallel
and combines them for determining the model average.
2.Boosting: It is also a homogeneous weak learners’ model but works differently from Bagging. In this model,
learners learn sequentially and adaptively to improve model predictions of a learning algorithm.

 Bagging
• Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-algorithm designed
to improve the stability and accuracy of machine learning algorithms used in statistical classification and
regression.
• It decreases the variance and helps to avoid overfitting.
• It is usually applied to decision tree methods.
• Bagging is a special case of the model averaging approach.
Description of the Technique
• Suppose a set D of d tuples, at each iteration i, a training set D i of d tuples is selected via row sampling with a
replacement method (i.e., there can be repetitive elements from different d tuples) from D (i.e., bootstrap).
• Then a classifier model Mi is learned for each training set D < i.
• Each classifier Mi returns its class prediction.
• The bagged classifier M* counts the votes and assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
•Step 1: Multiple subsets are created from the original data set with equal tuples, selecting observations with
replacement.
•Step 2: A base model is created on each of these subsets.
•Step 3: Each model is learned in parallel with each training set and independent of each other.
•Step 4: The final predictions are determined by combining the predictions from all the models.
How does Bagging Classifier Work?
The basic steps of how a bagging classifier works are as follows:
•Bootstrap Sampling: In Bootstrap Sampling randomly ‘n’ subsets of original training data are sampled with
replacement. This step ensures that the base models are trained on diverse subsets of the data, as some samples may
appear multiple times in the new subset, while others may be omitted. It reduces the risks of overfitting and improves the
accuracy of the model.
•Base Model Training: In bagging, multiple base models are used. After the Bootstrap Sampling, each base model
is independently trained using a specific learning algorithm, such as decision trees, support vector machines, or neural
networks on a different bootstrapped subset of data. These models are typically called “Weak learners” because they may
not be highly accurate on their own. Since the base model is trained independently of different subsets of data. To make
the model computationally efficient and less time-consuming, the base models can be trained in parallel.
•Aggregation: Once all the base models are trained, it is used to make predictions on the unseen data i.e. the subset of
data on which that base model is not trained. In the bagging classifier, the predicted class label for the given instance is
chosen based on the majority voting. The class which has the majority voting is the prediction of the model.
•Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset of particular base models during
the bootstrapping method. These “out-of-bag” samples can be used to estimate the model’s performance without the need
for cross-validation.
•Final Prediction: After aggregating the predictions from all the base models, Bagging produces a final prediction for
each instance.
Example of Bagging
The Random Forest model uses Bagging, where decision tree models with higher variance are
present. It makes random feature selection to grow trees. Several random trees make a
Random Forest.

Advantages of Bagging
•The biggest advantage of bagging is that multiple weak learners can work better than a single strong learner.
•It provides stability and increases the machine learning algorithm’s accuracy, which is used in statistical
classification and regression.
•It helps in reducing variance, i.e., it avoids overfitting.

Disadvantages of Bagging
•It may result in high bias if it is not modeled properly and thus may result in underfitting.
•Since we must use multiple models, it becomes computationally expensive and may not be suitable in
various use cases.
Boosting
• Boosting is an ensemble modeling technique designed to create a strong classifier by combining
multiple weak classifiers.
• The process involves building models sequentially, where each new model aims to correct the
errors made by the previous ones.
• Initially, a model is built using the training data.
• Subsequent models are then trained to address the mistakes of their predecessors.
• Boosting assigns weights to the data points in the original dataset.
• Higher weights: Instances that were misclassified by the previous model receive higher weights.
• Lower weights: Instances that were correctly classified receive lower weights.
• Training on weighted data: The subsequent model learns from the weighted dataset, focusing its
attention on harder-to-learn examples (those with higher weights).
• This iterative process continues until:
• The entire training dataset is accurately predicted, or
• A predefined maximum number of models is reached.
Boosting Algorithms
• There are several boosting algorithms.
• The original ones, proposed by Robert Schapire and Yoav Freund were not adaptive and could not take
full advantage of the weak learners.
• Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious
Gödel Prize.
• AdaBoost was the first really successful boosting algorithm developed for the purpose of binary
classification.
• AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that combines multiple
“weak classifiers” into a single “strong classifier”.
• In boosting, we take records from the dataset and pass them to base learners sequentially; here, base learners can be any
model. Suppose we have m number of records in the dataset.
• Then, we pass a few records to base learner BL1 and train it.
• Once the BL1 gets trained, then we pass all the records from the dataset and see how the Base learner works.
• For all the records classified incorrectly by the base learner, we only take them and pass them to other base learners, say
BL2, and simultaneously pass the incorrect records classified by BL2 to train BL3.
• This will go on unless and until we specify some specific number of base learner models we need.
• Finally, we combine the output from these base learners and create a strong learner; thus, the model’s prediction power
improves.
Boosting works with the following steps:
•We sample m-number of subsets from an initial training dataset.
•Using the first subset, we train the first weak learner.
•We test the trained weak learner using the training data. As a result of the testing, some data points
will be incorrectly predicted.
•Each data point with the wrong prediction is sent into the second subset of data, and this subset is
updated.
•Using this updated subset, we train and test the second weak learner.
•We continue with the next subset until reaching the total number of subsets.
•We now have the total prediction. The overall prediction has already been aggregated at each step,
Reducing Bias by Boosting
• We use boosting to combine weak learners with high bias.
• Boosting aims to produce a model with a lower bias than the individual models.
• Boosting involves sequentially training weak learners.
• Here, each subsequent learner improves the errors of previous learners in the sequence.
• A sample of data is first taken from the initial dataset.
• Using this sample to train the first model, and the model makes its prediction.
• The samples can either be correctly or incorrectly predicted.
• The samples that are wrongly predicted are reused for training the next model.
• In this way, subsequent models can improve on the errors of previous models.
• Boosting aggregates the results at each step.
• Weighted averaging involves giving all models different weights depending on their predictive power.
• In other words, it gives more weight to the model with the highest predictive power.
• This is because the learner with the highest predictive power is considered the most important.

Advantages of Boosting
•It is one of the most successful techniques in solving the two-class classification problems.
•It is good at handling the missing data.
Disadvantages of Boosting
•Boosting is hard to implement in real time due to the increased complexity of the algorithm.
•The high flexibility of these techniques results in multiple numbers of parameters that directly affect the behavior of
the model.
Similarities between Bagging and Boosting Differences between Bagging and Boostin
Decision Trees

• Decision Trees are a type of Supervised Machine Learning where the data is continuously split
according to a certain parameter. The tree can be explained by two entities, namely decision
nodes and leaves. The leaves are the decisions or the final outcomes.
• And the decision nodes are where the data is split.
• An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc.
• The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat
a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’.
• In this case this was a binary classification problem (a yes no type problem).
• A tree structure resembling Flow Charts
• Leaf Node: Holds the Class Labels (Terminal Nodes)
• Internal Node: Non–Leaf Nodes → denotes a test on an attribute
• Branch: Outcome of the test
• Root Node: The topmost node from where all the branches originate

• Given a test tuple, x, the feature values are tested against the decision tree.
• A path is traced from the root to a leaf node
• Leaf node holds the classification results
Advantages
• • Types of Decision Trees: Iterative Dichotomiser 3 (ID3), C 4.5, C
No domain knowledge is required
5.0 (latest)
• Easy to interpret
• Can handle multidimensional data
• Simple steps for learning and classification
Splitting Criterion
• Tells the best feature to test at node N
• Gives the best way to partition the tuples in D into individual classes
• “Goodness” is determined by measuring the purity of the partition at each branch
• A partition is pure, if all the tuples in it belongs to the same class
• Splitting criterion also tells which branches to grow from node N with respect to the
outcomes of the chosen test

Splitting
• The node N is labeled with the splitting criterion, which serves as a test at
the node
• A branch is grown from N for each of the outcomes, and the tuples in D are
partitioned accordingly
• Let A be the feature having v distinct values, {a 1 , a2, . . . , a v }
• Three conditions can arise:
— A is discrete–valued and binary tree is desired
— A is discrete–valued and non–binary tree is desired
— A is continuous–valued
Splitting Scenarios
Splitting Scenario: Discrete A & Non–Binary Tree is desired
• Outcomes of the test at node N correspond to the known values of A.
• A branch is created for each known value, aj , of A
• Partition D j is the subset of labelled tuples in D having value a j of A.
• Because all the tuples in a given partition have the same value for A, A
need not be considered in any future partitioning of the tuples.
• Therefore, it is removed from feature list
Splitting Scenario: A is Continuous
• Test at node N has two possible outcomes, corresponding to the conditions A
≤ Split Point and A > Split Point
• Split Point is returned by Attribute selection method as part of the splitting criterion.
• Split Point may not be a value of A from the training data.
• Two branches are grown from N and labeled according to the previous outcomes
• Partitioning: D 1 holds the subset of tuples in D for which A ≤ Split Point, while D 2
holds the rest.
Splitting Scenario: A is Discrete and Binary Tree is desired
• Test at node N is of the form A ∈ S A ?
• S A is the splitting subset for A, returned by Attribute selection method
• S A is a subset of the known values of A.
• If a given tuple has value a j of A and if a j ∈ S A , then the test at node N
is satisfied.
• Two branches are grown from N .
• Partitioning: D 1 holds the subset of tuples in D for which the test is satisfied, while
D 2 holds the rest.
 The algorithm uses the same process recursively to form a decision tree
for the tuples at each resulting partition, D j , of D Recursive partitioning
stops only when any one of the Terminating Conditions are met:
Terminating Conditions
• All the tuples at node N belong to the same class, C
— Node N is converted to a leaf node labeled with the class C
• The partition D j is empty
— A leaf is created with the majority class in D
• No remaining features on which the tuples may be further partitioned
— Convert node N into a leaf node and label it with the most common class in D
— Called Majority Voting
Decision Tree Algorithm
(Decision Trees)
Attribute Selection Measures
(ASM)
• A heuristic to find the Splitting Criterion (SC) that best separates a data partition
D.
• The best splitting criterion partitions D into pure (close to pure) partitions
• Also called Splitting Rules (SR), since they decide how the tuples at a node are to be
split
• ASM ranks each feature describing the data
• The feature with the best score is used to split the given tuple
• Depending on the measure, the best score can be minimum or maximum
• If the feature is continuous valued, a splitting point is determined
• If feature is discrete & binary tree is desired, a splitting subset is created
• The node for partition D is labeled with the splitting criterion
• Branches are grown for each outcome of the criterion and the tuples are partitioned
• accordingly.
Number of classes: C i , D : Set of tuples in class C i
m;
Notation in D
• |D|: No. of tuples in |C i,D |: No. of tuples in C i , D
Information Gain
• Based on the information content of the feature
• The feature with the highest information gain is chosen as the splitting attribute
for the node
• This minimizes the information required to split the resulting partitions further
• i.e., the resulting partitions has minimum impurity
• Thus, the number of tests (nodes) needed to classify a tuple is minimized
• This can make the tree simple
There are two main types of Decision Trees:
1. Classification trees (Yes/No types) (can explain ID3 algorithm) What we’ve seen above is an example
of classification tree, where the outcome was a variable like ‘fit’ or ‘unfit’. Here the decision variable is
Categorical. Tree models where the target variable can take a discrete set of values are called
classification trees. In these tree structures, leaves represent class labels and branches represent
conjunctions of features that lead to those class labels.
2. Regression trees (Continuous data types) – (can explain CART algorithm) Decision trees where the
target variable can take continuous values (real numbers) like the price of a house, are called regression
trees.
Classification trees

• Consider the data given in Table 8.1 which specify the features of certain vertebrates and the class to which they belong.
• For each species, four features have been identified: “gives birth”, ”aquatic animal”, “aerial animal” and “has legs”.
• There are five class labels, namely, “amphibian”, “bird”, “fish”, “mammal” and “reptile”.
• The problem is how to use this data to identify the class of a newly discovered vertebrate.
Construction of the tree
Step 1
• We split the set of examples given in Table 8.1 into disjoint subsets according to the values of
the feature “gives birth”.
• Since there are only two possible values for this feature, we have only two subsets: One
subset consisting of those examples for which the value of “gives birth” is “yes” and one
subset for which the value is “no”.
Step 2
• We now consider the examples in Table 8.2. We split these examples based on the values of
the feature “aquatic animal”. There are three possible values for this feature. However, only
two of these appear in Table 8.2. Accordingly, we need consider only two subsets. These are
shown in Tables 8.4 and 8.5.

• Table 8.4 contains only one example and hence no further splitting
is required.
• It leads to the assignment of the class label “fish”.
• The examples in Table 8.5 need to be split into subsets based on
the values of “aerial animal”.
• It can be seen that these subsets immediately lead to
unambiguous assignment of class labels: The value of “no” leads
Step 3
• Next we consider the examples in Table 8.3 and split them into disjoint subsets based on the
values of “aquatic animal”. We get the examples in Table 8.6 for “yes”, the examples in Table
8.8 for “no” and the examples in Table 8.7 for “semi”. We now split the resulting subsets based
on the values of “has legs”, etc.
• Putting all these together, we get the the diagram in tree (Figure 8.5) for the data in Table 8.1
 Elements of a classification tree
• The various elements in a classification tree are identified as follows.
• In the example discussed above, initially we chose the feature “gives birth” to split the data set into disjoint
subsets and then the feature “aquatic animal”, and so on.
• There was no theoretical justification for this choice.
• The classification tree depends on the order in which the features are selected for partitioning the data.
• Stopping criteria (when can we stop the growth of a Decision Tree)
 All (or nearly all) of the examples at the node have the same class.
 There are no remaining features to distinguish among the examples.
 The tree has grown to a predefined size limit.
 Explain how we decide which feature to be selected at each level in a decision tree.
• Feature selection measures (information gain, Gain ratio, Gini index )
• If a dataset consists of n attributes, then deciding which attribute is to be to placed at the root or at different
levels of the tree as internal nodes is a complicated problem.
• It is not enough that we just randomly select any node to be the root.
• If we do this, it may give us bad results with low accuracy.
• The most important problem in implementing the decision tree algorithm is deciding which features are to be
considered as the root node and at each level.
• Several methods have been developed to assign numerical values to the various features such that the values
reflect the relative importance of the various features. These are called the feature selection measures.
• Two of the popular feature selection measures are information gain and Gini index.
Entropy
• The degree to which a subset of examples contains only a single class is known as purity, and any subset composed
of only a single class is called a pure class.
• Informally, entropy is a measure of “impurity” in a dataset.
• Sets with high entropy are very diverse and provide little information about other items that may also belong in the
set, as there is no apparent commonality.
• Entropy is measured in bits. If there are only two possible classes, entropy values can range from 0 to 1. For n
classes, entropy ranges from 0 to log2(n).
• In each case, the minimum value indicates that the sample is completely homogeneous, while the maximum value
indicates that the data are as diverse as possible
• Consider a segment S of a dataset having c number of class labels. Let pi be the proportion of examples in S having
the i th class label. The entropy of S is defined as

Special case
• Let the data segment S has only two class labels, say, “yes” and “no”.
• If p is the proportion of examples having the label “yes” then the proportion of examples having
label “no” will be 1 − p.
• In this case, the entropy of S is given by
Entropy (S) = −p log2(p) − (1 − p) log2(1 − p).
Examples of Entropy Calculation

• Let “i” be some class label.


• We denote by pi the proportion of examples
with class label “i”. 1.
• Let S be the data in Table 8.1.
• The class labels are ”amphi”, “bird”, ”fish”,
”mammal” and ”reptile”.
• In S we have the following numbers.

Number of examples with class label “amphi” =3


Number of examples with class label “bird” =2
Number of examples with class label “fish” =2
Number of examples with class label “mammal”
=2
Number of examples with class label “reptile” =1
Total number of examples =10
Therefore, we have
Information Gain
• Let S be a set of examples, A be a feature (or, an attribute), S v be the subset of S with A = v, and
Values (A) be the set of all possible values of A.
• Then the information gain of an attribute A relative to the set S, denoted by Gain (S, A), is
defined as

• The attribute A with he highest information gain is selected.


Example
Consider the data S given in Table 8.1. We have have already seen that ∣S ∣ = 10 Entropy (S) = 2.2464
We denote the information gain corresponding to the feature “ i ” by Gain (S, i)
Similarly we can compute Gain(S, aerial animal) and Gain(S, has legs)
Gini indices
• The Gini split index of a data set is another feature selection measure in the construction
of classification trees. This measure is used in the CART algorithm
Tree Pruning
• Due to noise/anomalies in the data, trees may overfit.
• Pruning is Early stopping of trees to avoid this
• Advantages:
— Trees are shorter, Less complex, More interpretable
— Faster and Better classification of unseen tuples
• Two methods:
— Prepruning: Halt the tree construction early
— Postpruning: Remove subtrees from a fully grown tree’
Prepruning
• Decides not to further split or partition data at the given node
• After halting, the node becomes a leaf.
• Majority voting gives the class label at the leaf

• Else, the leaves can be programmed to show the probability distribution of the tuples

• Method: Thresholding on the Attribute Selection Measures is done to stop the splitting of nodes
• Ex: Only if the Gain(A ) > T , split the node; Else Don’t split
• Too large T → Too simplified tree
• Too small T → Complex tree
• Setting the threshold is the key

Postpruning
Most common method
• Removes a subtree and replaces it
with a leaf
• Leaf label is found out by majority
voting
Postpruning
Cost Complexity Pruning Algorithm
• Used in CART method
• Cost Complexity: Function of the no. of leaves and the percentage of
misclassified tuples
• For each node, this is compared for pruned and unpruned cases
• If pruned one has lower complexity, pruning is performed
• This process is performed from bottom to top
• A different set used for computing cost complexity → Pruning Set
• Pruning set is independent of training set
Pessimistic Pruning Algorithm
• Used in C4.5
• Uses Error Rate Estimates to prune/keep the subtrees
• No separate pruning set needed; uses the training set itself
• After the tree is built, start removing each node from the first child node (top to
bottom)
• Compute the error rate before and after pruning.
• Prune/keep the node based on the error rate
• Prepruning is less computationally complex
• Postpruning is computationally complex, but gives best results
• Pre and Post Pruning can also be combined if needed
• Even after pruning, Repetition and Replication can exist in decision trees

• Replication: Duplicate subtrees within the tree


• Both can be removed by means of splits using a combination of
attributes
Random Forest
 A Random Forest is a collection of decision trees that work together to make predictions.
 Random Forest algorithm is a powerful tree learning technique in Machine Learning to make
predictions and then we do voting of all the trees to make prediction.
 They are widely used for classification and regression task.
 It is a type of classifier that uses many decision trees to make predictions.
 It takes different random parts of the dataset to train each tree and then it combines the results by
averaging them.
 This approach helps improve the accuracy of predictions. Random Forest is based on
ensemble learning.

 Imagine asking a group of friends for advice on where to go for vacation. Each friend gives their
recommendation based on their unique perspective and preferences (decision trees trained on
different subsets of data). You then make your final decision by considering the majority opinion
or averaging their suggestions (ensemble prediction).
 Process starts with a dataset with rows and their corresponding class labels (columns).
 Then - Multiple Decision Trees are created from the training data.
 Each tree is trained on a random subset of the data (with replacement) and a random subset of
features. This process is known as bagging or bootstrap aggregating.
 Each Decision Tree in the ensemble learns to make predictions independently.
 When presented with a new, unseen instance, each Decision Tree in the ensemble makes a
prediction.
 The final prediction is made by combining the predictions of all the Decision Trees. This is
typically done through a majority vote (for classification) or averaging (for regression).
The random Forest algorithm works in several steps:
 Random Forest builds multiple decision trees using random samples of the
data. Each tree is trained on a different subset of the data which makes each
tree unique.
 When creating each tree the algorithm randomly selects a subset of features or
variables to split the data rather than using all available features at a time.
This adds diversity to the trees.
 Each decision tree in the forest makes a prediction based on the data it was
trained on. When making final prediction random forest combines the results
from all the trees.
 For classification tasks the final prediction is decided by a majority vote. This
means that the category predicted by most trees is the final prediction.
 For regression tasks the final prediction is the average of the predictions from all
the trees.
 The randomness in data samples and feature selection helps to prevent the
Assumptions of Random Forest
 Each tree makes its own decisions: Every tree in the forest makes its own predictions
without relying on others.
 Random parts of the data are used: Each tree is built using random samples and
features to reduce mistakes.
 Enough data is needed: Sufficient data ensures the trees are different and learn unique
patterns and variety.
 Different predictions improve accuracy: Combining the predictions from different trees
leads to a more accurate final results.

Key Features of Random Forest


 Handles Missing Data: Automatically handles missing values during training,
eliminating the need for manual imputation.
 Algorithm ranks features based on their importance in making predictions offering
valuable insights for feature selection and interpretability.
 Scales Well with Large and Complex Data without significant performance
degradation.
 Algorithm is versatile and can be applied to both classification tasks (e.g., predicting
categories) and regression tasks (e.g., predicting continuous values).
• Random Forest learns from the training data like a real estate expert.
• After training it predicts house prices on the test set.
• We evaluate the model's performance using Mean Squared Error and R-squared Score which show
how accurate the predictions are and used a random sample to check model prediction.

Advantages of Random Forest


 Random Forest provides very accurate predictions even with large datasets.
 Random Forest can handle missing data well without compromising with accuracy.
 It doesn’t require normalization or standardization on dataset.
 When we combine multiple decision trees it reduces the risk of overfitting of the model.

Limitations of Random Forest


 It can be computationally expensive especially with a large number of trees.
 It’s harder to interpret the model compared to simpler models like decision trees.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy