Data Mining
Data Mining
15/10/2019 ©GKGupta 1
Example (From Han and Kamber)
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
30…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
15/10/2019 ©GKGupta 2
A decision Tree (From Han and Kamber)
age?
<=30 overcast
30..40 >40
no yes no yes
15/10/2019 ©GKGupta 3
Decision Tree
15/10/2019 ©GKGupta 6
An Example (from the text)
Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold
15/10/2019 ©GKGupta 8
An Example (from the text)
Is symptom fever any better?
Fever Diagnosis
No Allergy
No Strep throat
No Allergy
No Strep throat
No Allergy
Yes Strep throat
Yes Cold
Yes Cold
Yes Cold
Yes Cold
15/10/2019 ©GKGupta 9
An Example (from the text)
Try swollen glands
Swollen
Glands Diagnosis
No Allergy
No Cold
No Cold
No Allergy
No Allergy
No Cold
No Cold
Yes Strep throat
Yes Strep throat
Yes Strep throat
Congestion Diagnosis
No Strep throat
No Strep throat
Yes Allergy
Yes Cold
Yes Cold
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat
Not helpful.
15/10/2019 ©GKGupta 11
An Example (from the text)
Headache Diagnosis
No Cold
No Cold
No Allergy
No Strep throat
No Strep throat
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat
Not helpful.
15/10/2019 ©GKGupta 12
An Example
This approach does not work if there are many
attributes and a large training set. Need an algorithm to
select an attribute that best discriminates among the
target classes as the split attribute.
15/10/2019 ©GKGupta 13
Finding the Split
15/10/2019 ©GKGupta 16
Information
Information is defined as –pi*log(pi) where pi is the
probability of some event.
15/10/2019 ©GKGupta 17
Information Theory
I = 2*(– 0.5 log (0.5))
This comes out to 1.0 and is the max information for an
event with two possible values. Also called entropy. A
measure of the minimum number of bits required to
encode the information.
Consider a dice with six possible outcomes with equal
probability. The information is:
I = 6 * (– (1/6) log (1/6)) = 2.585
Therefore three bits are required to represent the
outcome of rolling a dice.
15/10/2019 ©GKGupta 18
Information Theory
Why is information lower if a toss is more likely to get
a head than a tail?
15/10/2019 ©GKGupta 20
Information Gain
• Assume that using an attribute A the set S is
partitioned into {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N,
the entropy, or the expected information needed to
classify objects in all subtrees Si is
p n
E ( A) i i I ( pi , ni )
i 1 p n
15/10/2019 ©GKGupta 22
Example
Sore Throat
Yes has t = 2, c = 2, a =1, total 5
No has t = 1, c = 2, a = 2, total 5
I(y) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 1.52
I(n) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 1.52
Information = 0.5*1.52 + 0.5*1.52 = 1.52
Fever
Yes has t = 1, c = 4, a =0, total 5
No has t = 2, c = 0, a = 3, total 5
I(y) = –1/5 log(1/5)) – 4/5 log(4/5)
I(n) = –2/5 log(2/5)) – 3/5 log(3/5)
Information = 0.5 I(y) + 0.5 I(n) = 0.846
15/10/2019 ©GKGupta 23
Example
Swollen Glands
Yes has t = 3, c = 0, a =0, total 3
No has t = 0, c = 4, a = 3, total 7
I(y) = 0
I(n) = – 4/7 log(4/7)) – (3/7 log(3/7))
Information = 0.3*I(y) + 0.7*I(n) = 0.69
Congestion
Yes has t = 1, c = 4, a =3, total 8
No has t = 2, c = 0, a = 0, total 2
I(y) = –1/8 log(1/8)) – 4/8 log(4/8) – 3/8 log(3/8)
I(n) = 0
Information = 0.8 I(y) + 0.2 I(n) = 1.12
15/10/2019 ©GKGupta 24
Example
Headache
Yes has t = 2, c = 2, a =1, total 5
No has t = 1, c = 2, a = 2, total 5
I(y) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 1.52
I(n) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 1.52
Information = 0.5*1.52 + 0.5*1.52 = 1.52
15/10/2019 ©GKGupta 25
Decision Tree
Continuing the process
one more step will find Swollen
Fever as the next split Glands
attribute and the final
Result as shown.
No Yes
Fever
No Yes