0% found this document useful (0 votes)
129 views27 pages

Data Mining

A decision tree uses attributes to split a node into subsets, with each leaf representing a class. It works by testing attribute values at each node to determine the path to a leaf class. The goal is to find the attribute that best splits the data to reduce impurity and diversity between child nodes.

Uploaded by

Salman Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views27 pages

Data Mining

A decision tree uses attributes to split a node into subsets, with each leaf representing a class. It works by testing attribute values at each node to determine the path to a leaf class. The goal is to find the attribute that best splits the data to reduce impurity and diversity between child nodes.

Uploaded by

Salman Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Decision Tree

A decision tree is an approach like that needed to


support the game of twenty questions where each
internal node of the tree denotes a question or a test on
the value of an independent attribute, each branch
represents an outcome of the test, and each leaf
represents a class.

Assume that each object has a number of independent


attributes and a dependent attribute.

15/10/2019 ©GKGupta 1
Example (From Han and Kamber)
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
30…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
15/10/2019 ©GKGupta 2
A decision Tree (From Han and Kamber)

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
15/10/2019 ©GKGupta 3
Decision Tree

To classify an object, the appropriate attribute value is


used at each node, starting from the root, to determine
the branch taken. The path found by tests at each
node leads to a leaf node which is the class the model
believes the object belongs to.

Decision tree is an attractive technique since the


results are easy to understand. The rules can often be
expressed in natural language e.g. if the student has
GPA > 3.0 and class attendance > 90% then the
student is likely to get a D.
15/10/2019 ©GKGupta 4
Basic Algorithm
1. The training data is S. Discretise all continuous-valued
attributes. Let the root node contain S.
2. If all objects S in the root node belong to the same class then
stop.
3. Split the next leaf node by selecting an attribute A from
amongst the independent attributes that best divides or
splits the objects in the node into subsets and create a
decision tree node.
4. Split the node according to the values of A.
5. Stop if any of the following conditions is met otherwise
continue with 3.
-- data in each subset belongs to a single class.
-- there are no remaining attributes on which the
sample may be further divided.
15/10/2019 ©GKGupta 5
Building A Decision Tree

The aim is to build a decision tree consisting of a root


node, a number of internal nodes, and a number of leaf
nodes. Building the tree starts with the root node and
then splitting the data into two or more children nodes
and splitting them in lower level nodes and so on until
the process is complete.

The method uses induction based on the training data.


We illustrate it using a simple example.

15/10/2019 ©GKGupta 6
An Example (from the text)
Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold

•First five attributes are symptoms and the last


attribute is diagnosis. All attributes are categorical.
• Wish to predict the diagnosis class.
15/10/2019 ©GKGupta 7
An Example (from the text)
Consider each of the attributes in turn to see
which would be a “good” one to start
Sore
Throat Diagnosis
No Allergy
No Cold
No Allergy
No Strep throat
No Cold
Yes Strep throat
Yes Cold
Yes Strep throat
Yes Allergy
Yes Cold

• Sore throat does not predict diagnosis.

15/10/2019 ©GKGupta 8
An Example (from the text)
Is symptom fever any better?

Fever Diagnosis
No Allergy
No Strep throat
No Allergy
No Strep throat
No Allergy
Yes Strep throat
Yes Cold
Yes Cold
Yes Cold
Yes Cold

Fever is better but not perfect.

15/10/2019 ©GKGupta 9
An Example (from the text)
Try swollen glands

Swollen
Glands Diagnosis
No Allergy
No Cold
No Cold
No Allergy
No Allergy
No Cold
No Cold
Yes Strep throat
Yes Strep throat
Yes Strep throat

Good. Swollen glands = yes means Strep Throat


15/10/2019 ©GKGupta 10
An Example (from the text)
Try congestion

Congestion Diagnosis
No Strep throat
No Strep throat
Yes Allergy
Yes Cold
Yes Cold
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat

Not helpful.
15/10/2019 ©GKGupta 11
An Example (from the text)

Try the symptom headache

Headache Diagnosis
No Cold
No Cold
No Allergy
No Strep throat
No Strep throat
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat

Not helpful.
15/10/2019 ©GKGupta 12
An Example
This approach does not work if there are many
attributes and a large training set. Need an algorithm to
select an attribute that best discriminates among the
target classes as the split attribute.

How do we find the attribute that is most influential in


determining the dependent attribute?

The tree continues to grow until it is no longer possible


to find better ways to split the objects.

15/10/2019 ©GKGupta 13
Finding the Split

One approach involves finding the data’s diversity (or


uncertainty) and choosing a split attribute that
minimises diversity amongst the children nodes or
maximises the following:

diversity(before split) - diversity(left child) -


diversity(right child)

We discuss two approaches. One is based on


information theory and the other is based on the work
of Gini who devised a measure for the level of income
inequality in a country.
15/10/2019 ©GKGupta 14
Finding the Split

Since our aim is to find nodes that belong to the same


class (called pure), a term impurity is sometime used to
measure how far the node is from being pure.

The aim of the split then is to reduce impurity:

impurity(before split) - impurity(left child) -


impurity(right child)

Impurity is just a different term. Information theory or


the Gini index may be used to find the split attribute
that reduces impurity by the largest amount.
15/10/2019 ©GKGupta 15
Information Theory
value x or value y.
• If s is going to be always x then there is no
information and there is no uncertainty.
• What about at p(x) = 0.9 and p(y) = 0.1?
• What about p(x) = 0.5 and p(y) = 0.5?

The measure of information is


Suppose there is a variable s that can take either a
I = – sum (pi log(pi))

15/10/2019 ©GKGupta 16
Information
Information is defined as –pi*log(pi) where pi is the
probability of some event.

pi is always less than 1, so log(pi) is always negative and


–pi*log(pi) is always positive.

Note that log of 1 is always zero, the log of any number


greater than 1 is always positive and the log of any
number smaller than 1 is always negative.

15/10/2019 ©GKGupta 17
Information Theory
I = 2*(– 0.5 log (0.5))
This comes out to 1.0 and is the max information for an
event with two possible values. Also called entropy. A
measure of the minimum number of bits required to
encode the information.
Consider a dice with six possible outcomes with equal
probability. The information is:
I = 6 * (– (1/6) log (1/6)) = 2.585
Therefore three bits are required to represent the
outcome of rolling a dice.

15/10/2019 ©GKGupta 18
Information Theory
Why is information lower if a toss is more likely to get
a head than a tail?

If a loaded dice had much more chance of getting a 6,


say 50% or even 75%, does the roll of the dice has less
or more information?

The information is:


50% I = 5 * (– (0.1) log (0.1)) – 0.5*log(0.5)

75% I = 5 * (– (0.05) log (0.05)) – 0.75*log(0.75)


How many bits are required to represent the outcome
of rolling the loaded dice?
15/10/2019 ©GKGupta 19
Information Gain
• Select the attribute with the highest information gain
• Assume the training data S has two classes, P and N
– Let S contain a total of s objects, p of class P and n of
class N (so p + n = s)
– The amount of information in S given the two class P
and N is
p p n n
I ( p, n )   log 2  log 2
s s s s

15/10/2019 ©GKGupta 20
Information Gain
• Assume that using an attribute A the set S is
partitioned into {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N,
the entropy, or the expected information needed to
classify objects in all subtrees Si is
 p n
E ( A)   i i I ( pi , ni )
i 1 p  n

– The encoding information that would be gained by


branching on A
Gain( A)  I ( p, n)  E ( A)
15/10/2019 ©GKGupta 21
Back to the Example
There are 10 (s = 10) samples and three classes.
Strep throat = t = 3
Cold = c = 4
Allergy = a = 3

Information = 2*(– 3/10 log(3/10)) – (4/10 log(4/10)) = 1.57

Let us now consider using the various symptoms to split


the sample.

15/10/2019 ©GKGupta 22
Example
Sore Throat
Yes has t = 2, c = 2, a =1, total 5
No has t = 1, c = 2, a = 2, total 5
I(y) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 1.52
I(n) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 1.52
Information = 0.5*1.52 + 0.5*1.52 = 1.52

Fever
Yes has t = 1, c = 4, a =0, total 5
No has t = 2, c = 0, a = 3, total 5
I(y) = –1/5 log(1/5)) – 4/5 log(4/5)
I(n) = –2/5 log(2/5)) – 3/5 log(3/5)
Information = 0.5 I(y) + 0.5 I(n) = 0.846

15/10/2019 ©GKGupta 23
Example
Swollen Glands
Yes has t = 3, c = 0, a =0, total 3
No has t = 0, c = 4, a = 3, total 7
I(y) = 0
I(n) = – 4/7 log(4/7)) – (3/7 log(3/7))
Information = 0.3*I(y) + 0.7*I(n) = 0.69

Congestion
Yes has t = 1, c = 4, a =3, total 8
No has t = 2, c = 0, a = 0, total 2
I(y) = –1/8 log(1/8)) – 4/8 log(4/8) – 3/8 log(3/8)
I(n) = 0
Information = 0.8 I(y) + 0.2 I(n) = 1.12

15/10/2019 ©GKGupta 24
Example
Headache
Yes has t = 2, c = 2, a =1, total 5
No has t = 1, c = 2, a = 2, total 5
I(y) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 1.52
I(n) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 1.52
Information = 0.5*1.52 + 0.5*1.52 = 1.52

So the values for information are:


Sore Throat 1.52
Fever 0.85
Swollen Glands 0.69
Congestion 1.12
Headache 1.52

15/10/2019 ©GKGupta 25
Decision Tree
Continuing the process
one more step will find Swollen
Fever as the next split Glands
attribute and the final
Result as shown.
No Yes

Diagnosis = Strep Throat

Fever

No Yes

Diagnosis = Allergy Diagnosis = Cold


15/10/2019 ©GKGupta 26
Information Gain
Assume that using an attribute A the set S is
partitioned into {S1, S2 , …, Sv}
– If Si contains pi samples of class P and ni of N, the
entropy, or the expected information needed to
classify objects in all subtrees Si is
 p n
E ( A)   i i I ( pi , ni )
i 1 p  n

– The information gain by branching on A


Gain( A)  I ( p, n)  E ( A)
15/10/2019 ©GKGupta 27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy