Data Mining
Data Mining
Proximity Measures for Binary Attributes: Contingency Table for Binary Attributes
i j
r+s q 1 1
1. Symmetric d(i,j) = r 1 0
q+r + s+ t
s 0 1
r +s t 0 0
2. Asymmetric d(i,j)= q+r + s
Example :
Jack M Y N P N N N
Jim M Y Y N N N N
Mary F Y N P N P N
Solution:
Name Fever Cough test-1 test-2 test-3 test-4
r +s q 1+0+0+0+0+0=1
d(Jack, Jim) =
q+r + s r 0+0+1+0+0+0=1
1+1 s 0+1+0+0+0+0=1
= 1+ 1+ 1
= 0.67
r +s q 1+0+1+0+0+0=2
d(Jack, Mary) =
q+r + s r 0+0+0+0+0+0=0
0+1 s 0+0+0+0+1+0=1
= 2+ 0+1
= 0.33
r +s q 1+0+0+0+0+0=1
d(Jim, Mary) =
q+r + s r 0+1+0+0+0+0=1
1+2 s 0+0+1+0+1+0=2
= 1+ 1+ 2
= 0.75
These measurements suggest that Jim and Mary are unlikely to have a similar
disease because they have the highest dissimilarity value among the three
pairs. Of the three patients, Jack and Mary are the most likely to have a similar
disease.
Ordinal:
r if = particular Rank
r if −1 M if = Total Rank number
z if =¿
M if −1
Numeric:
¿
d if =¿¿ x if −x jf ∨ ¿
max−min
Example:
Object Identifier Test-1 Test-2 Test-3
1 Code A Excellent 45
2 Code B Fair 22
3 Code C Good 64
4 Code A Excellent 28
For nominal:
p−m
d(i,j) =
p
d(2,1)=
1−0
=1 1 2 3 4
1 1 0
d(3,1)=
1−0
=1 2 1 0
1 3 1 1 0
=1
1−0 4 0 1 1 0
d(3,2)=
1
=0
1−1
d(4,1)=
1
=1
1−0
d(4,2)=
1
d(4,3)=
1−0
=1 Dissimilarity Matrix
1
[ Note: if there are two or more attributes, there will still be one dissimilarity matrix
that represents the dissimilarities between objects based on all attributes. ]
For Ordinal:
Rank:
r if −1
z if =¿ Excellent (1) —-------> 0
M if −1
Fair (2) —-------> 0.5
Z excellent =
1−1
3−1
=0 Good (3) —-------> 1
Z Fair =
2−1
3−1
= 0.5
ZGood =
3−1
=1
3−1 1 2 3 4
1 0
Using Manhattan: 2 0.5 0
3 1 0.5 0
d(2,1) = |0.5-0| = 0.5
4 0 0.5 1 0
d(3,1) = |1-0| = 1
d(3,2) = |1-0.5| = 0.5
d(4,1) = |0-0| = 0
d(4,2) = |0-0.5| = 0.5
d(4,3) = |0-1| = 1 Dissimilarity Matrix
For Numerical:
¿
d if =¿¿ x if −x jf ∨ ¿
max−min
1 2 3 4
d 21=¿ 22−45∨ ¿ ¿= d 31=¿ 64−45∨ ¿ ¿= 1 0
64−22 64−22
2 0.54 0
0.54 0.45
3 0.45 1 0
¿ ¿ d =¿ 28−45∨ ¿ ¿= 4 0.40 0.14 0.86 0
64−22 =1 41
d 32=¿ 64−22∨
64−22
0.40
¿ ¿ ¿ ¿
64−22 = 64−22 =
d 42=¿ 28−22∨ d 21=¿ 28−64∨
0.14 0.86
Dissimilarity Matrix
=
(1∗1)+(1∗0.5)+(1∗0.54 )
d(2,1) 1+ 1+1
=¿0.68
1 2 3 4
1 0
d(3,1) =
(1∗1)+(1∗1)+(1∗0.45)
=¿ 0.81
1+1+1 2 0.68 0
3 0.81 0.84 0
d(3,2) =
(1∗1)+(1∗0.5)+(1∗0)
=¿0.84 4 0.14 0.54 0.95 0
1+1+1
d(4,1) =
(1∗0)+(1∗0)+(1∗0.40)
=¿0.14
1+ 1+ 1
d(4,2) =
(1∗1)+(1∗0.5)+(1∗0.14 )
=¿0.54
1+ 1+1
d(4,3) =
(1∗1)+(1∗1)+(1∗0.86)
=¿0.95
1+1+1
Cosine Similarity:
d 1. d 2
cos(d1,d2) = ¿∨d 1∨¿ .∨¿ d 2∨¿
||d||=√❑
Which pair has the most similarity and the most dissimilarity?
d1.d2= (5*3)+(0*0)+(3*2)+(0*0)+(2*1)+(0*1)+(0*0)+(2*1)+(0*0)+(0*1)
= 15 + 6 + 2 + 2
= 25
d1.d3= (5*0)+(0*7)+(3*0)+(0*2)+(2*1)+(0*0)+(0*0)+(2*3)+(0*0)+(0*0)
=2+6
=8
d1.d4= (5*0)+(0*1)+(3*0)+(0*0)+(2*1)+(0*2)+(0*2)+(2*0)+(0*3)+(0*0)
=2
d2.d3= (3*0)+(0*7)+(2*0)+(0*2)+(1*1)+(1*0)+(0*0)+(1*3)+(0*0)+(1*0)
=1+3
=4
d2.d4= (3*0)+(0*1)+(2*0)+(0*0)+(1*1)+(1*2)+(0*2)+(1*0)+(0*3)+(1*0)
=1+2
=3
d3.d4= (0*0)+(7*1)+(0*0)+(2*0)+(1*1)+(0*2)+(0*2)+(3*0)+(0*3)+(0*0)
=7+1
=8
||d1||=√ ❑
=√ ❑
=√ ❑ = 6.48
||d2||=√ ❑
=√ ❑
=√ ❑ = 4.12
||d3||=√ ❑
=√ ❑
=√ ❑ = 7.93
||d4||=√ ❑
=√ ❑
=√ ❑ = 4.35
d 1. d 2 25
cos(d1,d2) = ¿∨d 1∨¿ .∨¿ d 2∨¿ = 6.48∗4.12 = 0.94
d 1. d 3 8
cos(d1,d3) = ¿∨d 1∨¿ .∨¿ d 3∨¿ = 6.48∗7.93 = 0.15
d 2. d 3 4
cos(d2,d3) = ¿∨d 2∨¿ .∨¿ d 3∨¿ = 4.12∗7.93 = 0.12
d 2. d 4 3
cos(d2,d4) = ¿∨d 2∨¿ .∨¿ d 4∨¿ = 4.12∗4.35 = 0.16
d 3. d 4 8
cos(d3,d4) = ¿∨d 3∨¿ .∨¿ d 4∨¿ = 7.93∗4.35 = 0.23
(d1, d2) because their similarity is 94%. So, (d1, d2) is the most similar.
(d1, d4) has the most dissimilarity because their similarity is 7%.
Chapter-6
Frequent pattern:
=
σ (x ∪ y)
Support S(x→y) N
σ (x ∪ y)
Confidence C(x→y) = σ (x)
10 Beef,Nuts,Durian
20 Beef,Coffee,Durian
30 Beef,Durian,Eggs
40 Nuts,Eggs,Milk
50 Nuts,Coffee,Durian,Eggs,Milk
σ (Beef ∪ Durian) 3
S(Beef→Durian)= N
= 5 =0.6
σ (Beef ∪ Durian) 3
C(Beef→Durian)= σ (Beef ) = 3 =1
σ (Durian ∪ Beef ) 3
S(Durian→Beef)= N
= 5 =0.6
σ (Durian ∪ Beef ) 3
C(Durian→Beef)= σ (Durian) = 4 =0.75
Apriori Algorithm:
¿min =2
Dataset C1 L1
{E} 3 {A,E} 1
L2 {B,C} 2
C3 {B,C} 2
{A,B,E} 1
L3
Dataset C1
T100 I 1, I 2, I 5 { I 1} 6
T200 I 2, I 4 { I 2} 7 L1 C2
T600 I 2, I 3 { I 3} 6 —-----> { I 1, I 4} 1
T700 I 1, I 3 { I 4} 2 { I 1, I 5} 2
T800 I 1, I 2 , I 3, { I 5} 2 { I 2, I 3} 4
I5
T900 I 1, I 2 , I 3 { I 2, I 4} 2
L2 { I 2, I 5} 2
L3 C3 ItemSet Sup { I 3, I 4 } 0
{ I 1 , I 2 , I 3} 2 {I 1, I 2 , I 3 2 { I 1, I 3} 4 { I 4, I 5} 0
}
{ I 1 , I 2 , I 5} 2 <—------ { 1 3rd { I 1, I 5} 2
I 1, I 2, I 4 scan
}
{I 1, I 2 , I 5 2 <—---- { I 2, I 3} 4
}
{ I 2, I 3 , I 4 0 { I 2, I 4} 2
}
ItemSet Sup { I 2, I 3 , I 5 1 { I 2, I 5} 2
}
{ 0 { 0
I 1, I 2, I 3, I 4 I 3, I 4 , I 5 ,
} }
{ 0
I 2 , I 3 , I 4 , I5
}
Frequent Pattern { I 1 , I 2 , I 3} ∪{ I 1 , I 2 , I 5}
Associate Rule Generation: IF( Confidence < 50 %) Invalid
Else Valid
Support (I )
S→I-S Confidence= Support (S) I={
I 1 , I 2 , I 3}
S={{ I 1},{ I 2},{ I 3},{ I 1 , I 2 ,},{ I 1 , I 3},{ I 2 , I 3}}
Rule-1: 2 2
Support( I ) =
9 Con = 6 =0.33
{ I 1} → { I 2 , I 3} 6 Invalid / Weak Associative
Support(S)=
9
Rule-2: 2 2
Support( I ) =
9 Con= 7 =0.28
{ I 2} → { I 1 , I 3} 7 Invalid / Weak Associative
Support(S) =
9
Rule-3: 2 2
Support(I ) =
9 Con = 6 =0.33
{ I 3} → { I 1 , I 2} 6 Invalid / Weak
Support( S) = Associative
9
Rule-4: 2 2
Support( I ) =
9 Con = 4 =0.5
{ I 1 , I 2} → { I 3} 4 Valid/Strong
Support(S)= Associative
9
Rule-5: 2 2
Support( I ) =
9 Con = 4 =0.5
{ I 1 , I 3} → { I 2} 4 Valid/Strong
Support(S) = Associative
9
Rule-6: 2 2
Support( I ) =
9 Con= 4 =0.5
{ I 2 , I 3} → { I 1} 4 Valid/Strong Associative
Support(S)=
9
I={ I 1 , I 2 , I 5}
S={{ I 1},{ I 2},{ I 5},{ I 1 , I 2 ,},{ I 1 , I 5},{ I 2 , I 5}}
Rule-1: 2 2
Support( I ) =
9 Con = 6 =0.33
{ I 1} → { I 2 , I 5} 6 Invalid / Weak Associative
Support(S)=
9
Rule-2: 2 2
Support( I ) =
9 Con = 7 =0.28
{ I 2} → { I 1 , I 5} 7 Invalid / Weak Associative
Support(S)=
9
Rule-3: 2 2
Support(I ) =
9 Con = 2 =1
{ I 5} → { I 1 , I 2} 2 Valid/Strong Associative
Support( S) =
9
Rule-4: 2 2
Support( I ) =
9 Con= 4 =0.5
{ I 1 , I 2} → { I 5} 4 Valid/Strong
Support(S)= Associative
9
Rule-5: 2 2
Support( I ) =
9 Con = 2 =1
{ I 1 , I 5} → { I 2} 2 Valid/Strong
Support(S) = Associative
9
Rule-6: 2 2
Support(I ) =
9 Con = 2 =1
{ I 2 , I 5} → { I 1} 2 Valid/Strong Associative
Support( S)=
9
T3 { A,E,K,M }
T1 has "O" → Count 1
T2 has "O" → Count 1
T3 has no "O" → Count 0
T4 has no "O" → Count 0
T5 has "O" → Count 1
O=3
T4 { C,K,M,U,Y }
T5 { C,E,I,K,O,O }
Given
¿min =3
m m
Entropy(S) = -∑ pi log2 ( pi ) = - pi ∑ log 2 ( pi)
i=1 i=1
9 9 5 5
Entropy (S) = - log 2( ) - log 2( ) = 0.94
14 14 14 14
Overcast(total)= 4
sunny(total)= 5
Overcast(Yes) = 4
Sunny(Yes) = 2 Overcast(No) = 0
Sunny(No) = 3
2 2 3 3 4 4 0 0
Entropy(Sunny)=- log 2( ) - log 2( ) Entropy(Overcast)=- log 2( ) - log 2( )
5 5 5 5 4 4 4 4
= 0.971 =0
Rain(total) = 5
Rain(Yes) = 3
Rain(No) = 2
3 3 2 2
Entropy(Rain)=- log 2( ) - log 2( ) = 971
5 5 5 5
5 4 5
Gain(OutLook)=Entropy (S)- *Entropy(Sunny) - *Entropy(Overcast)- *Entropy(Rain)
14 14 14
5 4 5
= 0.94 - ( ∗0.971 ¿−( ∗0)- ( ∗0.971 ¿
14 14 14
= 0.2464
Hot(total) = 4
Hot(Yes) = 2 Mild(total) = 6
Hot(No) = 2 Mild(Yes) = 4
Mild(No) = 2
2 2 2 2
Entropy(Hot)=- log 2( ) - log 2( )
4 4 4 4
4 4 2 2
=1 Entropy(Mild)=- log 2( ) - log 2( )
6 6 6 6
= 0.918
Cool(total) = 4
Cool(Yes) = 3
Cool(No) = 1
3 3 1 1
Entropy(Cool)=- log 2( ) - log 2( )
4 4 4 4
= 0.811
4 6 4
Gain(Temp)=Entropy (S)- *Entropy(Hot) - *Entropy(Mild)- *Entropy(Cool)
14 14 14
4 6 4
= 0.94 - ( ∗1 ¿−( ∗0.918)- ( ∗0.811 ¿
14 14 14
= 0.0289
High(total) = 7
High(Yes) = 3
High(No) = 4 Normal(total)= 7
Normal(Yes) = 6
Normal(No) = 1
3 3 4 4
Entropy(High)=- log 2( ) - log 2( )
7 7 7 7
6 6 1 1
=0.9852 Entropy(Normal)=- log 2( ) - log 2( )
7 7 7 7
= 0.5916
7 7
Gain(Humidity)=Entropy (S)- *Entropy(High) - *Entropy(Normal)
14 14
7 7
= 0.94 - ( ∗0.9852 ¿−( ∗0.5916)
14 14
= 0.1516
Weak(total) = 8
Weak(Yes) = 6 Strong(total) = 6
Weak(No) = 2 Strong(Yes) = 3
Strong(No) = 3
6 6 2 2
Entropy(Weak)=- log 2( ) - log 2( )
8 8 8 8
3 3 3 3
=0.94 Entropy(Strong)=- log 2( ) - log 2( )
6 6 6 6
=0.8113
8 6
Gain( Wind)=Entropy (S)- *Entropy(Weak) - *Entropy(Strong)
14 14
8 6
= 0.94 - ( ∗0.94 ¿−( ∗0.8113 ) = 0.0478
14 14
Gain(OutLook) = 0.2464
Gain(Temp) = 0.0289 Gain(OutLook) > others
Gain(Humidity) = 0.1516 Root node = Outlook
Gain( Wind) = 0.0478
For Sunny:
2 2 3 3
Entropy( S sunny ¿= - log 2( ) - log 2( ) = 0.97
5 5 5 5
0 0 2 2
Entropy(Hot) =- log 2( ) - log 2( ) =0
2 2 2 2
1 1 1 1
Entropy(Mild) =- log 2( ) - log 2( ) =1
2 2 2 2
1 1 0 0
Entropy(Cool) =- log 2( ) - log 2( ) =0
1 1 1 1
2 2 1
Gain(Temp ¿=Entropy( S sunny ¿ - *Entropy(Hot) - *Entropy(Mild) - *Entropy(Cool)
5 5 5
2 2 1
=0.97-
5
*0 - 5
*1 - 5
*0
= 0.570
0 0 3 3
Entropy(High) =- log 2( ) - log 2( ) =0
3 3 3 3
2 2 0 0
Entropy(Normal) =- log 2( ) - log 2( ) =0
2 2 2 2
3 2
Gain( Humidity ¿=Entropy( S sunny ¿ - *Entropy(High) - *Entropy(Normal)
5 5
2 2
=0.97-
5
*0 - 5
*0
= 0.97
1 1 2 2
Entropy(Weak) =- log 2( ) - log 2( ) =0.9183
3 3 3 3
1 1 1 1
Entropy(Strong) =- log 2( ) - log 2( ) =1
2 2 2 2
3 2
Gain(Wind ¿ =Entropy( S sunny ¿ - *Entropy(Weak) - *Entropy(Strong)
5 5
2 2
=0.97- *0.9183 - *1
5 5
= 0.0192
Gain(Temp ¿= 0.57
Gain( Humidity ¿= 0.97 humidity is highest Gain
Gain(Wind ¿ =0.0192 Humidity is the particular node
For Rain:
D6 Cool Strong No
3 3 2 2
Entropy( Srain )=- 5 log 2( 5 ) - 5 log 2( 5 ) = 0.97