0% found this document useful (0 votes)
22 views118 pages

ML Unit 3 Notes-1

Uploaded by

warriorblazefire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views118 pages

ML Unit 3 Notes-1

Uploaded by

warriorblazefire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 118

UNIT – III

Dr. Snehlata
Assistant Professor
Introduction to Decision Tree
• Decision tree is used to create a learning model that can
be used to predict (test) the class or value of the target
variable.
• The decision tree uses prior training data to predict the
class of new example.
A DECISION TREE
• “It is a flowchart structure in which each node
represents a “test” on the attribute and each branch
represents the outcome of the test.”
• The end node (called leaf node) represents a class label.
• Decision tree is a supervised learning method.
• It is used for both classification and regression tasks in
machine learning.
DECISION TREE LEARNING
• It is a method for approximating discrete - valued target
function, (concept) in which the learned function is
represented by a decision tree.
TERMINOLOGIES IN DECISION
TREE
• Root Node: It represents the entire population (or
sample) which gets further divided into two or more sets.
• Splitting: It is a process of dividing a node into two or
more sub-nodes to increase the tree.
• Decision Nodes: When a sub-node splits into further sub-
nodes then it is called a decision node.
• Leaf/ Terminal Node: The end nodes that do not split are
called leaf or terminal nodes.
TERMINOLOGIES IN DECISION
TREE
• Pruning : The removal of sub-nodes is called pruning to reduce trees.

• Branch(sub tree) : A submission of entire tree is called branch or


subtree.

• Parent nodes : a node divided into sub nodes is called a parent node.

• Child Nodes: The sub nodes of a parent node are called child nodes.
How does the Decision Tree
algorithm Work?
• In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree.
• This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison,
follows the branch and jumps to the next node.
• For the next node, the algorithm again compares the attribute
value with the other sub-nodes and move further.
• It continues the process until it reaches the leaf node of the tree.
• The complete process can be better understood using the below
algorithm:
How does the Decision Tree
algorithm Work?
• Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
• Step-3: Divide the S into subsets that contain possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example
• Example: Suppose there is a candidate who has a job offer and
wants to decide whether he should accept the offer or Not. So, to
solve this problem, the decision tree starts with the root node
(Salary attribute by ASM). The root node splits further into the
next decision node (distance from the office) and one leaf node
based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one
leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below
diagram
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM.
• By this measurement, we can easily select the best attribute for the
nodes of the tree.
• There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index
Advantages of the Decision
Tree
• It is simple to understand as it follows the same process
which a human follow while making any decision in real-
life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a
problem.
• There is less requirement of data cleaning compared to
other algorithms.
Disadvantages of the Decision
Tree
• The decision tree contains lots of layers, which makes it
complex.
• It may have an overfitting issue, which can be resolved
using the Random Forest algorithm.
• For more class labels, the computational complexity of the
decision tree may increase.
UNIT III
Lecture 14
Dr. Snehlata
Assistant Professor
UCER, Prayagraj
ENTROPY E(x)
• Average amount of information contained by a random
variable(x) is called Entropy.
• “Measure of randomness of information” of a variable.
• Denoted by E or H
EXAMPLE 1

Let’s have a dataset made up of three colors;


red, purple, and yellow. If we have one red,
three purple, and four yellow observations in
our set,
our equation becomes:
• We have Pr = 1 /8 ( Red in the Dataset)
• Pp = 3/8 (Purple in Dataset)
• Py = 4/8 (Yellow in Dataset)

• E = 1.41
• when all observations belong to the same class? In such a
case, the entropy will always be zero.
• Such a dataset has no impurity.
• This implies that such a dataset would
not be useful for learning.
• If we have a dataset with say, two classes,
• Half made up of yellow and the other half being purple,
• The entropy will be one.
Example 2
• Calculate the entropy E of a single attribute “Playing
Golf” problem when following data is given.

Play Golf
Yes No

9 5
Solution
E(s) =
S = current state , pi = Prob of event (i) of State(S)
Entropy (Play Golf) = (9,5)
Probability of Play Golf = Yes = 9/14 = 0.64
Probability of Play Golf = No = 5/14 = 0.36
Entropy = - (0.36 log 2 (0.36) - (0.64) log 2 (0.64))
E(s) = 0.94
Example 3
• Calculate the entropy of multiple attributes of “Playing
Golf” Problem with given data set.
Play Golf
Yes No
Sunny 3 2
Outloo Overcas 4 0
k t
Rainy 2 3
Solution
E(Play Golf, Outlook) = P(Sunny), E(3,2) + P(Overcast) , E(4,0) +
P(Rainy) , E(2,3)
P(Sunny), E(3,2) = - (3/14 log 2 (3/14) – 2/14 log 2 (2/14))=
INFORMATION GAIN
• In machine Learning and decision trees, the information gain(IG) is
defined as the reduction (decrease) in entropy.
• Information gain is the measurement of changes in entropy after
the segmentation of a dataset based on an attribute.
• It calculates how much information a feature provides us about a
class.
• Information gain helps to determine the order of attributes in the
nodes of a decision tree.
• According to the value of information gain, we split the node and
build the decision tree.
INFORMATION GAIN
• The main node is referred to as the parent node, whereas
sub-nodes are known as child nodes.
• We can use information gain to determine how good the
splitting of nodes in a decision tree.

• E parent is the entropy of the parent node and E Children is the


average entopic of the child nodes.
Example
Suppose we have a dataset with two classes.
This dataset has 5 purple and 5 yellow examples.
Since the dataset is balanced, we expect the answer to be 1.

Say we split the dataset into two branches.


One branch ends up having four values while the other has six.
The left branch has four purples while the right one has five yellows and one
purple.
Example
• We mentioned that when all the observations belong to
the same class, the entropy is zero since the dataset is
pure. As such, the entropy of the left branch Eleft= 0
• On the other hand, the right branch has five yellows and
one purple.
• Thus:
Example
• A perfect split would have five examples on each branch.
• This is clearly not a perfect split, but we can determine
how good the split is.
• We know the entropy of each of the two branches.
• We weight the entropy of each branch by the number of
elements each contains.
• This helps us calculate the quality of the split.
Example
• The one on the left has 4, while the other has 6 out of a total of 10.
Therefore, the weighting goes as shown below:

• The entropy before the split, which we referred to as initial entropy E initial=1
• After splitting, the current value is 0.39
• We can now get our information gain, which is the entropy we “lost” after
splitting.
Example

The more the entropy removed, the greater the information gain.
The higher the information gain, the better the split.
Decision Tree
Algorithms
Dr. Snehlata
Assistant Professor
Types of Decision Tree
Algorithms
• 1. Iterative Dichotomizer 3(ID 3) Algorithm
• 2. CD 4.5 Algorithm
• 3. Classification and Regression Tree (CART) Algorithm
General Decision Tree
Algorithm Steps
1. Calculate the Entropy(E ) of Every attribute (A) of Dataset
(S).
2. Split the dataset (S) into subsets using the attribute for
which the resulting entropy after splitting is minimized .
Iterative Dichotomizer 3
Algorithm
Inductive Bias
Dr. Snehlata
Assistant Professor
Inductive Bias in Decision Tree
Learning
• The Inductive Bias of machine learning is the set of
assumptions that the learner uses to predict outputs of
given inputs that it has not encountered.
• An approximation of inductive bias of ID3 decision tree
algorithm: “Shorter trees are preferred over longer trees.
Trees that place high information gain attributes close to
the root are preferred over those that do not.”
• Inductive bias is a “policy” by which the decision tree
algorithm generalizes from observed training examples to
classify unseen instances.
Inductive Bias in Decision Tree
Learning
• Inductive Bias is an essential requirement in machine
learning. With Inductive bias, a learner algorithm can
easily examples generalize new unseen.
• Bias: The assumptions made by a model to make a
function easier to learn are called bias.
• Variance: If you get very less errors in training and very
high errors in testing of data, then this difference is called
“Variance” in M/C learning and testing.
ISSUES IN DECISION TREE
LEARNING
1. Avoiding Over Fitting of Data
a) Reduced Error Pruning
b) Rule Post Pruning
2. Incorporating continuous valued attributes.
3. Alternative measure for selecting attributes.
4. Handling training examples with missing attribute values.
5. Handling attributes with differing costs.
OVERFITTING OF DATA IN
DECISION TREES
• “Overfitting” of data is a condition in which the model completely fits
the training data but fails to generalize the testing data.
• “Given a hypothesis space(H), a hypothesis h1 ∊ H is
said to “overfit” the “training data of there exists some
alternative hypothesis h2 ∊ H, such that h1 has smaller
error than h2 over the training examples, but h2 has
smaller error than h2 overall distribution of complete
data set instances(i.e training + testing data set).”
Technique to reduce Overfitting
• 1. Reduce model complexity
• 2. early stopping training process before final
classification
• Post Pruning the decision tree nodes after the overfit has
occurred
• Ridge regularization
• Lasso Regularization
• To use ANN
METHODS /APPROACHES TO AVOID
OVER FITTING IN DECISION TREES
1. Pre pruning (avoiding) Overfitting - To stop growing
the decision tree, before it finally classifies training data.
2. Post pruning after Over Fitting – To allow decision
tree to overfit data and then post prune the tree leafs
nodes.
ALGORITHM STEPS FOR DECISION
TREE FOR BOOLEAN FUNCTIONS
• 1) Every Variable in Boolean Function(eg. A, B , C) has
two possibilities as true(1) or False(0)
• If Boolean Function is true, we write yes(Y) in leaf Node.
• If Boolean Function is false, we write no (N) in leaf node.
• Boolean functions are evaluated from left to right i.e the
first variable on LHS is root node always.
Example
• Make decision tree for following Boolean
function(expression)
a) A ⋀ ¬ B b) A ⋁
[B ⋀ C]
c) A XOR B d) [A ⋀
A B A ⋀ B ¬B A ⋀ ¬ B Y or N

B] ⋁ [C ⋀ D]
F F F T F NO
F T F F F NO
T F F T T YES
T T T F F NO
Step 1 If A = True and B = Click icon to add picture
True , then final decision in
truth table is “NO”.
Step 2 If A = True and B = False,
then final decision in truth table
is “YES”
Step 3 If A = False , B = Any
value , Then final decision in
truth table is “No”
A ⋁[B ⋀C] -
Instance Based
Learning
Dr. Snehlata
Assistant Professor
Instance Based Learning
• It is a supervised learning technique which is used for
classification and regression tasks.
• It performs operation after comparing the current
instances with the previous instances.
• It is also called “ Lazy Learning”, “Memory Based
Learning”
Types of Instance Based
Learning
• K – Nearest Neighbourhood(KNN) Algorithm
• Locally Weighted Regression Algorithm
• Radial Basis Function Network
• Case Based Learning
• Learning Vector Quantization(LVQ)
• Self Organizing Map(SOM)
KNN Example
Dr. Snehlata
Assistant Professor
Example
• To solve the numerical example on the K-nearest neighbor
i.e. KNN classification algorithm, we will use the following
dataset.
• In the Given dataset, we have fifteen data points with
three class labels. Now, suppose that we have to find the
class label of the point P= (5, 7).
Point Coordinates Class Label
A1 (2,10) C2
A2 (2, 6) C1
A3 (11,11) C3
A4 (6, 9) C2
A5 (6, 5) C1
A6 (1, 2) C1
A7 (5, 10) C2
A8 (4, 9) C2
A9 (10, 12) C3
A10 (7, 5) C1
A11 (9, 11) C3
A12 (4, 6) C1
A13 (3, 10) C2
A15 (3, 8) C2
A15 (6, 11) C2
• For this, we will first specify the number of nearest
neighbors i.e. k. Let us take k to be 3.
• Now, we will find the distance of P to each data point in the
dataset.
• For this KNN classification numerical example, we will use
the euclidean distance metric.
• The following table shows the euclidean distance of P to
each data point in the dataset.
Point Coordinates Distance from P (5, 7)
A1 (2, 10) 4.24
A2 (2, 6) 3.16
A3 (11, 11) 7.21
A4 (6, 9) 2.23
A5 (6, 5) 2.23
A6 (1, 2) 6.40
A7 (5, 10) 3.0
A8 (4, 9) 2.23
A9 (10, 12) 7.07
A10 (7, 5) 2.82
A11 (9, 11) 5.65
A12 (4, 6) 1.41
A13 (3, 10) 3.60
A15 (3, 8) 2.23
A15 (6, 11) 4.12
• After finding the distance of each point in the dataset to P,
we will sort the above points according to their distance
from P (5, 7).
• After sorting, we get the following table.
Point Coordinates Distance from P (5, 7)
A12 (4, 6) 1.41
A4 (6, 9) 2.23
A5 (6, 5) 2.23
A8 (4, 9) 2.23
A15 (3, 8) 2.23
A10 (7, 5) 2.82
A7 (5, 10) 3
A2 (2, 6) 3.16
A13 (3, 10) 3.6
A15 (6, 11) 4.12
A1 (2, 10) 4.24
A11 (9, 11) 5.65
A6 (1, 2) 6.4
A9 (10, 12) 7.07
A3 (11, 11) 7.21
• As we have taken k=3, we will now consider the class
labels of three points in the dataset nearest to point P to
classify P In the above table, A12, A4, and A5 are the
closest 3 neighbors of point P.
• Hence, we will use the class labels of points A12, A4, and
A5 to decide the class label for P. Distance from P (5,
Point Coordinates
7)
A12 (4, 6) 1.41
A4 (6, 9) 2.23
A5 (6, 5) 2.23
• Now, point A12, A4, and A5 have the class labels C1, C2,
and C1 respectively. Among these points, the majority
class label is C1.
• Therefore, we will specify the class label of point P = (5, 7)
as C1.
• Hence, we have successfully used KNN classification to
classify point P according to the given dataset.
Question
Locally Weighted
Regression
Dr. Snehlata
Assistant Professor
Linear Regression is a supervised learning algorithm used for
computing linear relationships between input (X) and output
(Y).

The steps involved in ordinary linear regression are:


• As evident from the image below, this algorithm cannot be
used for making predictions when there exists a non-linear
relationship between X and Y. In such cases, locally
weighted linear regression is used.
RADIAL BASIS FUNCTION(RBF) –

• It is a mathematical function whose value depends only on the


distance from the origin.
• A RBF works by defining its distance from the centre or origin
point.
• This is done by using absolute values.
• It is denoted by Փ(x) = Փ(|x|)
• The RBF is used for approximation of multivariate target
function
RADIAL BASIS FUNCTION(RBF) –

• The target function approximation is given as


f(x) = w0 + wu ku (d(xu , x))
• Where f(x) = Approximation of multivariate target function
w0 = Initial Weight
wu = Unit Weight
ku (d(xu , x)) = kernel function
d(xu , x) = distance between xu and x
RADIAL BASIS FUNCTION
NETWORKS -
• They are used in Artificial neural networks(ANN)
• It is used for classification task in ANN.
• Commonly used in ANN for function approximation also.
• The RBF networks are different from simple ANN due to
their universal approximation and faster speed.
RADIAL BASIS FUNCTION
NETWORKS -
• An RBF networks is a feed forward neural network.
• It consists of three layers as input layer, middle layer and
output layer.
CASE BASE
REASONING(CBR)
• Also called Case-based Learning.
• Used for classification and regression.
• It is the process of solving new problems based on the
solutions of similar past problems.
• It is an advanced instance-based learning method that is
used to solve more complex problems.
• It does not use the Euclidean distance metric.
STEPS IN CBR -
• Retrieve – Gather Data from memory. Check any previous
solution similar to the current problem.
• Reuse – Suggest a solution based on the experience.
Adapt it to meet the demands of the new situation.
• Revise – Evaluate the use of the solution in a new context.
• Retain – Store this new problem-solving method in the
memory system.
Applications of CBR -
• Customer service helpdesk for diagnosis of problems.

• Engineering and law for technical design and legal rules.

• Medical science for patient case histories and treatment.


• For any query : Write mail to

snehlata@united.ac.in

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy