0% found this document useful (0 votes)
10 views14 pages

Adobe Scan 16 May 2023

Decision tree learning is a popular classification algorithm that builds a model in a tree structure, allowing for multi-dimensional analysis and easy interpretation of rules. The algorithm involves recursively partitioning data based on feature values to predict output variables, with nodes representing features and leaf nodes representing classifications. Key concepts include entropy, information gain, and various algorithms like ID3 and CART, which help determine the best features for splitting the data.

Uploaded by

SUBHANKAR PARIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

Adobe Scan 16 May 2023

Decision tree learning is a popular classification algorithm that builds a model in a tree structure, allowing for multi-dimensional analysis and easy interpretation of rules. The algorithm involves recursively partitioning data based on feature values to predict output variables, with nodes representing features and leaf nodes representing classifications. Key concepts include entropy, information gain, and various algorithms like ID3 and CART, which help determine the best features for splitting the data.

Uploaded by

SUBHANKAR PARIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Decision tree

Decision tree learning is one of the most widely adopted algorithms for classification.
Asthe name indicates,it builds a model in the form of a tree structure. Its grouping
exactness is focused with different strategies, and it is exceptionally productive.
Adecision tree is used formulti-dimensional analysis with multiple classes. It is char
acterized by fast execution time and ease in the interpretation of the rules. The goal of
decision tree leaning is to create a model (based on thepast data called past vector)that
predicts the value of the output variable bascd on the input variables in the feature vector.
Each node (or decision node) of adecision trec corresponds to one of the feature
vector. Fromevery node, there are cdgestochildren, whercin there is an edge for
cach of the possible values (oT range of values) of the feature associated with the
node. lhc tree ternninates at different leaf nodes (or terminal nodes) where each
lcaf node represents a possible valuc for the output variable. The output variable is
deteminedby following a path that starts at the root and is guided by the values
of the inputvariables.
Adecision tree is usually represcnted in the format depicted in Figure 78.

B B

F F

FIG. 7.8
Decision tree structure

Each internal node (represented by boxes) tests an attribute (represented as'A`B


within the boxes).Each branch corresponds to an attribute value (T/F) in the above
case. Each leaf node assigns a classification. The first node is called as Root Node.
Branches from the root node are called as 'Leaf Nodes where 'A' is the Root Node
(first node). B' is theBranch Node. T' & 'F are Leaf Nodes.
Thus. a decision tree consistsof three types of nodes:
" Root Node
" Branch Node
" LeafNode

Figure9 shows an example decision tree tor acar driving - the decision to be
takenis whether to'Keep Going' or to 'Stop; which depends on various situationsas
depicted in the figure. If thesignal is RED in colour,hen thecar should be stopped.
If there is not enough gas (pelrol) in the car, the car should be stopped at the next
available gas station.
Keep driving forward

Is there a Is the
stoplight Yes
stop
ahead?
light
red?
Yes
No Yes
No
Find a petrol
stafion, No
and buv Keep Stop
Is there enough going
gas in the car?

FIG. 7.9
Decision tree example

7.5.2.1 Buildingg a decision tree


Decision trees are built corresponding to the training data
called recursive partitioning. The approach splits the data following an approach
the basis of the feature values. It starts into multiple subsets on
the entire data set.It first selects the from the root node, which is nothing but
feature which predicts the target class in the
strongest way. The decision tree splits the data set into
data in each partition having a distinct value for multiple partitions, with
the feature based on which the
partitioning has happened. This is the first set of branches. Likewise, the
continues splitting the nodes on the basis of the feature algorithm
partition. This continues till astopping criterion is which helps in the best
criteria are - reached. The usual stopping
1. Allor most of the examples at a
particular node have the same cass
2. All features have been used up in the
partitioning
3. The tree has grown to a pre-defined threshold limit
Let us try to understand this in the context of an
of
Solutions (GTS), aleading provider IT solutions, example. Global Technology
Engineering and Management (CEM) for hiring B.Tech. is coming to College of
campus recruitment, they had shortlisted 18 students for the students.
final
Last year during
company of international repute, they follow a
stringent interview. Being a
tointerview
only the best of the students. The information process to select
related
results of shortlisted students (hiding the names) on the the
interview evaluation
tion parameters is available for reference in basis of
Figure 7.10. Chandra, a different of
evalua-
student CEM,
189
Common Classification Algorithms
wantsto find out if
he may be offered a job in His self-
ovaluation thc other parameters is as follows:GTS, His
on CGPA is quite high.
Comunication - Bad;
Aptitude High;
- Programming skills - Bad
CGPA
Communication Programming Job offered?
Aptitude Skill
High Good High Good Yes
Medium Good
High Good Yes
Low Bad Low No
Good
Low Good Low Bad No
High Good
High Bad Yes
High Good
High Good Yes
Medium Bad Low Bad No
Medium Bad Low Good No

High Bad High Good Yes


Medium Good High Good Yes

Low Bad High Bad No

Low Bad High Bad No

Medium Good High Bad Yes

Low Good Low Good No

Bad Low Bad No


High
Medium Bad High Good No

Bad Low Bad No


High
Good High Bad Yes
Medium

FIG. 7.10
Training datafor GTS recruitment
Let us try to solve this problem, i.e. predicting whether Chandra will get a job offer
the decision tree correspond.
byusing thedecision tree model. First,we need to draw
the training data given in Figure 710. Accordng to the table, job otter condition
ing to where Aptitude = Low, irrespective of
(i.e. the outcome) is FALSE for allthe cases be taken up as the tirst node of th
can
Other conditions. So. the feature Aptitude
decision tree.
job offer condilion Is TRUE for all the cases k
For Aptitude = High, Commumication =Bad, job offer cond
Communication = Good. For cases where
CGPA = High.
tion is TRUE for allthe cases where decision tree diagram for the table
the complete given in
Figure 711 depicts
Figure 710.
START

Job not offered


Medium/
Low

High Bad
Aptitude? Communication?
CGPA?

Low Good
High
Job not offered Job offered
Job offered
FIG. 7.11
Decision tree based on the trainingdata

7.5.2.2 Searching a decision tree


By using the above decision tree
Chandra might get a job offer depicted
in Figure 711, we need to predict
for the given parameter values: whether
Communication = Bad, Aptitude = High, CGPA = High.
tiple ways to search through the Programming
trained decision
skills = Bad. There are mul
prediction problem. tree for a solution to the given

Exhaustive search
1. Place the item in the first
item in the first group group (class). Recursively examine solutions with the
2. Place the item in the (class).
second group (class).
the itemn in the
second grOup (class). Recursively examine solutions with
3. Repeat the above
steps untilthe solution is
Exhaustive search reached.
travels
much time when the decisionthrough the decision tree
values. tree is big with multiple exhaustively, but it will take
leaves and multiple attrnbule
Branch and bound search
Branch and bound uses an
decision tree in full. When the existing best solution to sidestep
have the worst possible algorithm starts, the best searching
solution
of the entire
makes the algorithm value; thus, any solution it finds out is well detined
though that is unlikely initially run down to the
to produce a left-most
is an
branch împrovement.
of
e
solution corresponds to realistic result. In the tree, evo
the
solution. A
programme putting every item in one
can partitioning
initial solution. This can be speed up the process bygroup, and it is an problem, unacceptable
right, the savings can be used as an input for using a fast heuristic to find an
substantial. branch and bound. If the heuristici
Section 7.5 Common Classification Algorithms 191

START

Job not offered

Medium/
Low

Aptitude? High Bad


Communication? CGPA?

Low
Good High
Job not offered
Job offered Job offered
FIG. 7.12
Decision tree based on the training data (depicting a sample path)
Figure 712 depicts a sample path (thick line) for the conditions CGPA = High,
Communication = Bad, Aptitude =High and Programming skills =Bad. According to
the above decision tree, the prediction can be made as Chandra will get the job offer.
There are many implementations of decision tree, the most prominent ones being
CS.0, CART (Classification and Regression Tree), CHAID (Chi-square Automatic
Interaction Detector) and ID3 (Iterative Dichotomiser 3) algorithms. The biggest
challenge of a decision tree algorithm is to find out which feature to split upon. The
main driver for identifying the feature is that the data should be split in such a way
that the partitions created by the split should contain examples belonging to asingle
class. If that happens, the partitions are considered to bepure. Entropy is a measure
of impurity of an attribute or feature adopted by many algorithms such as ID3 and
C5.0. The information gain is calculated on the basis of the decrease in entropy (S)
after a data set is split according to a particular attribute (A). Constructing a decision
tree is all about finding an attribute that returns the highest information gain (i.e. the
most homogeneous branches).

Note:
Like information gain, there are other measures like Gini index or chi-square for
individual nodes to decide the feature on the basis of which the split has to be applied.
TheCART algorithm uses Gini index, while the CHAID algorithm uses chi-square
for deciding the feature for applying split.

7.5.2.3 Entropy of adecision tree


Let us say S is the sample set of training examples. Then, Entropy (S) measuring the
impurity of S is defined as

Entropy(S) = P, log 2Pi


192 Chapter 7 Supervised Learning: Classification
where c is the number of different class labels and prefers to the
values falling into the i-thclass label. proportion of
For example, with respect to the training data in Figure 710, we have two values
for thetarget class Job Offered? - Yes and No. The value of p, for class value Yes
is 0.44 (i.e. 8/18) and that for class value 'No is 0.56 (1.e. 10/18). So, we can calcubs
the entropy as
Entropy( S) = -0.44 log ,(0.44) - 0.56 log 2(0.56) = 0,99.
7.5.2.4 Information gain of a decision tree
The information gain is created on the basis of the decrease in entropy (S) after a
data set is split according to a particular attribute (A). Constructing a decision tree is
all about finding an attribute that returns the highest information gain (i.e. the most
homogeneous branches). If the information gain is 0, it means that there is no reduc.
feature, On
tion in entropy due to split of the data set according to that particular
the other hand, the maximum amount of information gain which may happen is the
entropy of the data set before the split.
Information gain for aparticular feature A is calculated by the difference in
entropy before asplit (or S) with the entropy after the split (Sas).
Information Gain (S, A) = Entropy (S,s)- Entropy (S)
For calculating the entropy after split,entropy for all partitions needs to be con
sidered. Then, the weighted summation of the entropy for each partition can be taken
as the total entropy after split. For performing weighted summation, the proportion
of examples falling into each partition is used as weight.

Entropy(Sas) =w,
i=1
Entropy (p;)
Let us examine the value of information gain for the training data set shown in
Figure 710. We will find the value of entropy at the beginning before any split happens
and then again after the split happens. We will compare the values for all the cases -
1. when the feature CGPA' is used for the split
2. when the feature Communication' is used for the split
3. when the feature 'Aptitude' is used for the split
4. when the feature 'Programming Skills' isused for the split
Figure 7.13a gives the entropy values for the first level split for each of the cases
mentioned above.
Ascalculated, entropy of the data set before split (i.e. Entropy (Sh:)) =
entropy of the data set after split (i.e. Entropy (Sas)) is 0.99, and
" 0.69 when the feature
'CGPA' is used for split
" 0.63 when the feature
" 0.52 when the feature Communication'
is used for split
" 0.95 when the feature
'Aptitude is used for split
Programming skill' is used for split
Section 7.5 Common 193
Classification Algoritnms
(a) Original data
set:
Yes No Total
Count 8
18
pi 0.44 0.56
-pilog(pi) 0.52 0.47 0.99

Total Entropy = 0.99


(b)Splitted data set (based on the feature 'CGPA'):
CGPA = High CGPA = Medium CGPA = Low
Yes No Total Yes No Total Yes No Total
Count 4 2 Count 4 3 Count 5 5
0.67 0.33 pi 0.57 0.43 pi 0.00 1.00
-pilog(pi) 0.39 0.53 0.92 -pi*log(pi) 0.46 0.52 0.99 -pi*log(pi) 0.00 0.00 0.00

Total Entropy = 0.69 Information Gain = 0.30

(c) Splitted data set (based on the feature 'Communication'):


Communication = Good Communication = Bad

Yes No Total Yes No Total


Count 7 2 9 Count 1 8
pi 0.78 0.22 pi 0.11 0.89
-pi*log(pi) 0.28 0.48 0.76 -pi*log(pi) 0.35 0.15 0.50

Total Entropy = 0.63 Information Gain= 0.36

(d) Splitteddata set (based on the feature 'Aptitude'):


Aptitude = High Aptitude = Low

Yes No Total Yes No Total


11 Count 7 7
Count 3
pi 0.00 1.00
pi 0.73 0.27
0.85 -pi*log(pi) 0.00 0.00 0.00
-pi*log(pi) 0.33 0.51
Information Gain =0.47
Total Entropy =0.52
'Programming Skill'):
(e)Splitted data set (based on the featureProgramming Skill = Bad
Programming Skill = Good
Yes No Total
Yes No Total
Count 3 6
Count 5 4
pi 0.33 0.67
pi 0.56 0.44 -pi*log(pi) 0.53 0.39 092
-pi*log(pi) 0.47 0.52 0.99
Information Gain = 0.04
Total Entropy = 0.95
194 Chapter 7 Supervised Learning: Classification
Therefore, the information gain from the feature 'CGPA' = 0,99 - 0.69 =
whereas the information gain from the feature Communication' = 0.99- 0.63 03,
Likewise, the information gain for Aptitude' and 'Programming skills' is 0.47=0.a36,
0.04, respectively.
Hence, it is quite evident that among all the features,'Aptitude' results iin the best
information gain when adopted for the split. So, at the first level, a split will be applie
according to the value of 'Aptitude' or in other words, Aptitude' willbe the firstnode
of the decision tree formed. One important point to be noted here is that for Aptitude
=Low, entropy is 0, which indicates that always the result will be the same irrespec-
tiveof the values of the other features. Hence, the branch towards Aptitude= Low
will not continue any further.
As a part of level 2, we willthus have only one branch to navigate in this case - the
One for Aptitude = High. Figure Z13b presents calculations for level 2. As can be seen
from the figure, the entropy value is as follows:
" 0.85 before the split
" 0.33 when the feature CGPA' is used for split
" 0.30 when the feature Communication' is used for split
" 0.80when the feature 'Programming skill' is used for split
Hence, the information gain after split with the features CGPA, Communication
and Programming Skill is 0.52, 0.55 and 0.05, respectively. Hence, the feature
Communication should be used for this split as it results in the highest informa
tion gain. So, at the second level, a split will be applied on the basis of the value of
Communication. Again, the point to be noted here is that for Communication =
Good,entropy is 0, which indicates that always the result willbe the same irrespec
tive of the values of the other features. Hence, the branch towards Communication =
Good will notcontinue any further.

Aptitude = High
CGPA Communication Programming Skill Job offered?
High Good Good Yes
Medium Good Good Yes
High Good Bad Yes
High Good Good Yes
High Bad Good Yes
Medium Good Good Yes
Low Bad Bad No
Low Bad Bad No
Medium Good Bad Yes
Medium Bad Good No
Medium Good Bad
Yes
FIG. 7.13B (Continued)
Section 7.5 Common Classification 195
Algorithms
(a) Level 2 starting set:
Yes No Total
Count 3

0.73 0.27
pilogpi) 0.33 0.51 0.85

Total Entropy= 0.85


(b) Splitted data set (based on the feature 'CGPA'):
CGPA = High CGPA = Medium CGPA = Low

Yes No Total Yes No Total Yes No Total


Count Count 4 1 Count 2
pi 1.00 0.00 pi 0.80 0.20 0.00 1.00
pi
-pi*log(pi) 0.00 0.00 0.00 -pi*log(pi) 0.26 0.46 0.72 -pi*log(pi) 0.00 0.00 0.00

Total Entropy = 0.33 Information Gain = 0.52

() Splitted data set (based onthe feature 'Communication'):


Communication = Good Communication = Bad

Yes No Total Yes No Total


Count 7 0 7 Count 3 4

p! 1.00 0.00 pi 0.25 0.75


piloglpi) 0.00 0.00 0.00 -pi*log(pi) 0.50 0.31 0.81

Information Gain =0.55


Total Entropy = 0.30

(d) Spitted data set (based on the feature (Programming Skill'):


Programming Skill = Good Programming Skill = Bad

Yes No Total Yes No Total


Count 3
Count
0.83 0.17 0.60 0.40
p! 0.97
pi*loglpi) 0.22 0.43 0.65 pilog(pi) 0.44 0.53

Total Entropy = 0.80


InformationGain =0.05

FIG. 7.13B
Entropy and information gain calculation (Level 2)
196 Chapter 7 Supervised Learning: Classification
As a part of level 3, we willthus have only one branch to navigate in this . case -the
one for Communication = Bad. Figure Z13c presents calculations for level 3.. As can
be seen from the figure, the entropy value is as follows:
0.81 before the split
" 0when the feature CGPA' is used for split
Skill' is used for split
" 0.50when the feature Programming

Aptitude = High & Communication = Bad


CGPA Programming Skill Job offered?
Good Yes
High
Low Bad No
Low Bad No
Medium Good No

(a) Level 2 starting set:


Yes No Total

Count 1 3 4
0.25 0.75
pi
-pi*log(pi) 0.50 0.31 0.81

Total Entropy = 0.81


(b) Splitted data set (based on the feature 'CGPA):
CGPA = Medium CGPA = Low
CGPA = High
Yes No Total Yes No Total Yes No Total

1 1 Count 1 1 Count 2 2
Count
1.00 0.00 pi 0.00 1.00 pl 0.00 1.00
pi
-pi*log(pi) 0.00 0.00 0.00 -pi*log(pi) 0.00 0.00 0.00 -pi*log(pi) 0.00 0.00 0.00

Information Gain = 0.81


Total Entropy = 0.00
() Splitted data set (based on the feature 'Programming Skill'):
Programming Skill = Good Programming Skill = Bad
Yes No Total Yes No Total
1 1 2 Count
Count 0 2 2
0.50 0.50 pi 0.00 1,00
pi
-pi*log(pi) 0.50 0.50 1.00
-pi*log(pi) 0.00 0.00 0.00

Total Entropy = 0.50 Information Gain = 0.31


FIG. 7.13C
Entropy and information gain calculation (Level 3)
Hence, the information gain after split with the feature CGPA is 0.81, which
is the maximum possible information gain (as the entropy before
split was 0.31):
Hence, as obvious, a split will be applied on the basis of the value of CGPA.
Because the maximum information gain is already achieved, the tree will not con
tinue any further.

7.5.2.5 Algorithm for decision tree


Input:Training data set, test data set (or data points)
Steps:
Do for all attributes
Calculate the entropy E; of the attribute F;
if E;<Emin
then Emin = E; and Fmin = F;
end if
End do
Split the data set into subsets using the attribute Fmin
Draw a decision tree node containing the attribute Fmin and split the
set into subsets data
Repeat the abovesteps until the full tree is drawn covering all the
of the original table. attributes

7.5.2.6 Avoiding overfitting in decisiontree - pruning


The decision tree algorithm, unless a stopping
criterion is
indefinitely - splitting for every feature and dividing intoapplied, may keep growing
smaller partitions till the
point that the data is perfectly classified. This, as is quite evident,
problem. To prevent a decision tree getting overfitted to the training resultsin overfitting
of the decision tree is data, pruning
essential. Pruning a decision tree
such that the model is nmore generalized and can classifyreduces the size of the tree
data in a better way. unknown and unlabelled
There are two approaches of pruning:
" Pre-pruning: Stop growing the tree before it reaches
" Post-pruning: Allow the tree to grow entirely and then
perfection.
branches from it. post-prune some of the
In thecase of pre-pruning, the tree is stopped from further growing once it
acertain number of decision nodes or decisions. Hence, in this strategy. the reaches
algorithm
avoids overfitting as well as optimizes computational cost. However, it also stands a
chance toignore important information contributed by afeature which was skipped,
thereby resulting in miss out of certain patterns in the data.
On theother hand, in thecase of post-pruning, the tree is allowed to grow to the
Tullextent. Then,by using certain pruning criterion,e.g, errorrates at the nodes, the
SIZe of the tree is reduced. This is amore effective approach in terms of classification
Learning: Classification
8 Chapter 7 Supervised
minute information available from the trainines
avray as it considers all pre-prunino
However.the computationalcost is obviously more than that of
tree
7.5.2. 7 Strengths of declsion
trees, not much matk
understandable rules. For smallerunderstand
" lt produces verv simple
knowledge is required to this model.
ematical and computational
problems.
" Works wellfor nmost of the
can handle both numerical and categoricalvariables.
" I and large training data sets.
small
" Can wvork well both with are more useful f
Decision trees provide a definite clue of which teatures
"
classification.

7.5.2.8 Weaknesses of decision tree


number
biased towards features having more
" Decision tree models are often
of possible values, i.e. levels.
underfitted quite easily.
" This modelgets overfitted or classification problems with many classes
trees are prone to errors in
" Decision
and relatively smallnumber of training examples.
expensive to train.
A decision tree can be computationally
understand.
" Large trees are complex to
7.5.2.9 Application of decision tree
there is a finite list of attributes
Decision tree can be applied in a dataset in which
attribute (e.g. High for the attri
and each data instance stores a value for that
values (e.g. 'High.
bute CGPA). When each attribute has a small number of distinct
an
'Medium, Low'), it is easier/quicker for thedecision tree to suggest (or choose)(e.g.
effective solution.This algorithm can beextended to handle real-value attributes
a floating point temperature).
The most straightforward case exists when there are only two possible values tor
an attribute (Boolean classification). Example: Communication has only two values
as Good' or 'Bad. It is also easy to extend the decision tree tocreate a target fune
tionwith more than two possible output values. Example: CGPA can take one of the
values from 'High, 'Medium} and Low: Irrespective of whether it is a binary value
multiple values, it is discrete in nature. For example, Aptitude can take the value of
either 'High' or Low. It is not possible to assign the value of both High' and 'Low 0
theattribute Aptitude to draw adecision tree.
There should be no infinite loops on taking adecision. As we move from the T0v
node to the next level node, it should move step-by-step towards the decision nodt
Otherwise,the algorithm may not give the final result for agiven data. If aset of code
goes ina loop, it would repeat itself forever, unless the
A decision tree can be used even for some system crashes.
instances with
instances with errors in the classification of examples or inmissing attributes
the attribute valu
Section 7.5 Common
Classification
describing those examples;such instances are handled well Algorithms
199

making them a robust learning method. by decision trees, thereby

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy