0% found this document useful (0 votes)

42 views24 pages

Data Mining

The document discusses proximity measures for binary and mixed attributes, detailing methods for calculating dissimilarity and similarity among objects based on various attribute types. It provides examples of calculations for binary attributes, nominal, ordinal, and numeric data, as well as the use of Minkowski, Euclidean, Manhattan, and cosine similarity distances. Additionally, it introduces concepts of support and confidence in the context of frequent patterns in datasets.

Uploaded by

rabby01601565625

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views24 pages

Data Mining

Uploaded by

rabby01601565625

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 24

chapter-2

Similarity and Dissimilarity ( Proximity ):

Proximity Measures for Binary Attributes: Contingency Table for Binary Attributes

i j
r+s q 1 1
1. Symmetric d(i,j) = r 1 0
q+r + s+ t
s 0 1
r +s t 0 0
2. Asymmetric d(i,j)= q+r + s

Example :

Name Gender Fever Cough test-1 test-2 test-3 test-4

Jack M Y N P N N N

Jim M Y Y N N N N

Mary F Y N P N P N

Solution:
Name Fever Cough test-1 test-2 test-3 test-4

Jack Y (1) N (0) P (1) N (0) N (0) N (0)

Jim Y (1) Y (1) N (0) N (0) N (0) N (0)

Mary Y (1) N (0) P (1) N (0) P (1) N (0)

asymmetric binary dissimilarity:

r +s q 1+0+0+0+0+0=1
d(Jack, Jim) =
q+r + s r 0+0+1+0+0+0=1
1+1 s 0+1+0+0+0+0=1
= 1+ 1+ 1
= 0.67

r +s q 1+0+1+0+0+0=2
d(Jack, Mary) =
q+r + s r 0+0+0+0+0+0=0
0+1 s 0+0+0+0+1+0=1
= 2+ 0+1
= 0.33

r +s q 1+0+0+0+0+0=1
d(Jim, Mary) =
q+r + s r 0+1+0+0+0+0=1
1+2 s 0+0+1+0+1+0=2
= 1+ 1+ 2
= 0.75

These measurements suggest that Jim and Mary are unlikely to have a similar
disease because they have the highest dissimilarity value among the three
pairs. Of the three patients, Jack and Mary are the most likely to have a similar
disease.

Proximity Measures for Mixed Attributes:

Identify nominal , ordinal and numeric:

nominal 1. Categorized ● Gender (Male, Female, Other)

2. Nol rank or order ● Marital Status (Single, Married, Divorced)
● Color (Brown, Blue, Green)

ordinal 1. Categorized ● Satisfaction (Low, Medium, High)

2. Meaningful Rank or order ● Ranks (Lieutenant, Captain, Major)
● Ratings (1-star, 2-star, 3-star, 4-star, 5-star)

numeric 1. Perform Mathematical ● Years (1990, 2000, 2020)

operations ● Score (90, 100, 120)
2. measurable quantities

Nominal: p= number of attribute

m= number of matches
d(i,j)= p
p−m

Ordinal:

r if = particular Rank
r if −1 M if = Total Rank number
z if =¿
M if −1

Numeric:

¿
d if =¿¿ x if −x jf ∨ ¿
max−min

Example:
Object Identifier Test-1 Test-2 Test-3

1 Code A Excellent 45

2 Code B Fair 22

3 Code C Good 64

4 Code A Excellent 28

Test-1 —---------------> Nominal

Test-2 —---------------> Ordinal
Test-3 —---------------> Numeric

For nominal:

p−m
d(i,j) =
p
d(2,1)=
1−0
=1 1 2 3 4
1 1 0
d(3,1)=
1−0
=1 2 1 0
1 3 1 1 0
=1
1−0 4 0 1 1 0
d(3,2)=
1

=0
1−1
d(4,1)=
1

=1
1−0
d(4,2)=
1

d(4,3)=
1−0
=1 Dissimilarity Matrix
1

[ Note: if there are two or more attributes, there will still be one dissimilarity matrix
that represents the dissimilarities between objects based on all attributes. ]

For Ordinal:
Rank:
r if −1
z if =¿ Excellent (1) —-------> 0
M if −1
Fair (2) —-------> 0.5
Z excellent =
1−1
3−1
=0 Good (3) —-------> 1

Z Fair =
2−1
3−1
= 0.5
ZGood =
3−1
=1
3−1 1 2 3 4
1 0
Using Manhattan: 2 0.5 0
3 1 0.5 0
d(2,1) = |0.5-0| = 0.5
4 0 0.5 1 0
d(3,1) = |1-0| = 1
d(3,2) = |1-0.5| = 0.5
d(4,1) = |0-0| = 0
d(4,2) = |0-0.5| = 0.5
d(4,3) = |0-1| = 1 Dissimilarity Matrix

For Numerical:

¿
d if =¿¿ x if −x jf ∨ ¿
max−min
1 2 3 4
d 21=¿ 22−45∨ ¿ ¿= d 31=¿ 64−45∨ ¿ ¿= 1 0
64−22 64−22
2 0.54 0
0.54 0.45
3 0.45 1 0
¿ ¿ d =¿ 28−45∨ ¿ ¿= 4 0.40 0.14 0.86 0
64−22 =1 41
d 32=¿ 64−22∨
64−22
0.40
¿ ¿ ¿ ¿
64−22 = 64−22 =
d 42=¿ 28−22∨ d 21=¿ 28−64∨
0.14 0.86
Dissimilarity Matrix

Final Dissimilarity Matrix:

p
δ ij (f )=0 1. x if ∨x jf is missing
d(i,j)= ∑ ij
δ ( f ). d ij (f )
f =1 2. x if =x jf =0
❑
δ ij (f )=1 others

=
(1∗1)+(1∗0.5)+(1∗0.54 )
d(2,1) 1+ 1+1
=¿0.68
1 2 3 4
1 0
d(3,1) =
(1∗1)+(1∗1)+(1∗0.45)
=¿ 0.81
1+1+1 2 0.68 0
3 0.81 0.84 0
d(3,2) =
(1∗1)+(1∗0.5)+(1∗0)
=¿0.84 4 0.14 0.54 0.95 0
1+1+1

d(4,1) =
(1∗0)+(1∗0)+(1∗0.40)
=¿0.14
1+ 1+ 1

d(4,2) =
(1∗1)+(1∗0.5)+(1∗0.14 )
=¿0.54
1+ 1+1

d(4,3) =
(1∗1)+(1∗1)+(1∗0.86)
=¿0.95
1+1+1

Dissimilarity of Numeric Data:

Minkowski distance, d(i,j)= √¿ xh

i1 −x j 1∨¿h +¿ x i 2−x j 2∨¿ h+........+ ¿ x ip −x jp ∨¿h ¿ ¿ ¿

Euclidean distance, d(i,j)= √¿ x

2
i1
2 2 2
−x j 1∨¿ +¿ x i 2−x j 2∨¿ +........+¿ x ip −x jp ∨¿ ¿ ¿ ¿
Manhattan (or city block) distance, d(i,j) =
❑ ❑ ❑
¿ x i 1−x j 1∨¿ +¿ x i 2−x j 2∨¿ +........+ ¿ x ip −x jp ∨¿ ¿ ¿ ¿

supremum distance d(i,j)= max ( ¿ x i 1−x j 1∨¿ , ¿ xi 2−x j 2∨¿ ,........ , ¿ x ip −x jp ∨¿ ¿ ¿ ¿)

❑ ❑ ❑

Cosine Similarity:

d 1. d 2
cos(d1,d2) = ¿∨d 1∨¿ .∨¿ d 2∨¿
||d||=√❑

Example: Document Vector or Term-Frequency Vector

Document team coach hockey baseball soccer penalty score win loss season
document1 5 0 3 0 2 0 0 2 0 0
document2 3 0 2 0 1 1 0 1 0 1
document3 0 7 0 2 1 0 0 3 0 0
document 0 1 0 0 1 2 2 0 3 0
4

Which pair has the most similarity and the most dissimilarity?
d1.d2= (5*3)+(0*0)+(3*2)+(0*0)+(2*1)+(0*1)+(0*0)+(2*1)+(0*0)+(0*1)
= 15 + 6 + 2 + 2
= 25

d1.d3= (5*0)+(0*7)+(3*0)+(0*2)+(2*1)+(0*0)+(0*0)+(2*3)+(0*0)+(0*0)
=2+6
=8

d1.d4= (5*0)+(0*1)+(3*0)+(0*0)+(2*1)+(0*2)+(0*2)+(2*0)+(0*3)+(0*0)
=2

d2.d3= (3*0)+(0*7)+(2*0)+(0*2)+(1*1)+(1*0)+(0*0)+(1*3)+(0*0)+(1*0)
=1+3
=4

d2.d4= (3*0)+(0*1)+(2*0)+(0*0)+(1*1)+(1*2)+(0*2)+(1*0)+(0*3)+(1*0)
=1+2
=3

d3.d4= (0*0)+(7*1)+(0*0)+(2*0)+(1*1)+(0*2)+(0*2)+(3*0)+(0*3)+(0*0)
=7+1
=8

||d1||=√ ❑
=√ ❑
=√ ❑ = 6.48

||d2||=√ ❑
=√ ❑
=√ ❑ = 4.12
||d3||=√ ❑
=√ ❑
=√ ❑ = 7.93

||d4||=√ ❑
=√ ❑
=√ ❑ = 4.35

Distance Between Document-1 and Document-2:

d 1. d 2 25
cos(d1,d2) = ¿∨d 1∨¿ .∨¿ d 2∨¿ = 6.48∗4.12 = 0.94

Distance Between Document-1 and Document-3:

d 1. d 3 8
cos(d1,d3) = ¿∨d 1∨¿ .∨¿ d 3∨¿ = 6.48∗7.93 = 0.15

Distance Between Document-1 and Document-4:

d 1. d 4 2
cos(d1,d4) = ¿∨d 1∨¿ .∨¿ d 4∨¿ = 6.48∗4.35 = 0.07

Distance Between Document-2 and Document-3:

d 2. d 3 4
cos(d2,d3) = ¿∨d 2∨¿ .∨¿ d 3∨¿ = 4.12∗7.93 = 0.12

Distance Between Document-2 and Document-4:

d 2. d 4 3
cos(d2,d4) = ¿∨d 2∨¿ .∨¿ d 4∨¿ = 4.12∗4.35 = 0.16

Distance Between Document-3 and Document-4:

d 3. d 4 8
cos(d3,d4) = ¿∨d 3∨¿ .∨¿ d 4∨¿ = 7.93∗4.35 = 0.23

(d1, d2) because their similarity is 94%. So, (d1, d2) is the most similar.

(d1, d4) has the most dissimilarity because their similarity is 7%.

Chapter-6

Frequent pattern:

Support : how frequently a pattern (or itemset) appears in a dataset.

Confidence : Confidence measures the probability of Y happening if X is present

=
σ (x ∪ y)
Support S(x→y) N
σ (x ∪ y)
Confidence C(x→y) = σ (x)

Tid Items Bought

10 Beef,Nuts,Durian

20 Beef,Coffee,Durian

30 Beef,Durian,Eggs

40 Nuts,Eggs,Milk

50 Nuts,Coffee,Durian,Eggs,Milk

σ (Beef ∪ Durian) 3
S(Beef→Durian)= N
= 5 =0.6
σ (Beef ∪ Durian) 3
C(Beef→Durian)= σ (Beef ) = 3 =1
σ (Durian ∪ Beef ) 3
S(Durian→Beef)= N
= 5 =0.6
σ (Durian ∪ Beef ) 3
C(Durian→Beef)= σ (Durian) = 4 =0.75

Apriori Algorithm:

¿min =2

Dataset C1 L1

Tid items ItemSet sup ItemSet sup

10 A,C,D {A} 2 {A} 2 C2

20 B,C,E 1st scan {B} 3 {B} 3 2nd scan ItemSet sup

30 A,B,C,E —------> {C} 3 ---> {C} 3 —---------> {A,B} 1

40 B,E {D} 1 {E} 3 {A,C} 2

{E} 3 {A,E} 1

L2 {B,C} 2

ItemSet sup {B,E} 3

{A,C} 2 <—-------- {C,E} 2

C3 {B,C} 2

ItemSet sup 3rd scan {B,E} 3

<--—-------
{A,B,C} 1 {C,E} 2

{A,B,E} 1
L3

{A,C,E} 1 ItemSet sup

{B,C,E} 2 —----------> {B,C,E} 2

Frequent pattern {B,C,E}

[ Note : If Empty set arise The Frequent pattern will be Previous layer ]
Given
¿min =2 Conmin =2

Dataset C1

Tid Items ItemSet Sup

T100 I 1, I 2, I 5 { I 1} 6

T200 I 2, I 4 { I 2} 7 L1 C2

T300 I 2, I 3 1st { I 3} 6 ItemSet Sup ItemSet Sup

Scan

T400 I 1, I 2, I 4 —----> { I 4} 2 —----> { I 1} 6 { I 1 , I 2} 4

T500 I 1, I 3 { I 5} 2 { I 2} 7 2nd { I 1, I 3} 4
scan

T600 I 2, I 3 { I 3} 6 —-----> { I 1, I 4} 1

T700 I 1, I 3 { I 4} 2 { I 1, I 5} 2

T800 I 1, I 2 , I 3, { I 5} 2 { I 2, I 3} 4
I5

T900 I 1, I 2 , I 3 { I 2, I 4} 2

L2 { I 2, I 5} 2

L3 C3 ItemSet Sup { I 3, I 4 } 0

ItemSet Sup ItemSet Sup { I 1 , I 2} 4 <-—------- { I 3, I 5} 1

{ I 1 , I 2 , I 3} 2 {I 1, I 2 , I 3 2 { I 1, I 3} 4 { I 4, I 5} 0
}

{ I 1 , I 2 , I 5} 2 <—------ { 1 3rd { I 1, I 5} 2
I 1, I 2, I 4 scan
}

{I 1, I 2 , I 5 2 <—---- { I 2, I 3} 4
}

{ I 2, I 3 , I 4 0 { I 2, I 4} 2
}

ItemSet Sup { I 2, I 3 , I 5 1 { I 2, I 5} 2
}

{ 0 { 0
I 1, I 2, I 3, I 4 I 3, I 4 , I 5 ,
} }

{ 0
I 2 , I 3 , I 4 , I5
}

Frequent Pattern { I 1 , I 2 , I 3} ∪{ I 1 , I 2 , I 5}
Associate Rule Generation: IF( Confidence < 50 %) Invalid
Else Valid
Support (I )
S→I-S Confidence= Support (S) I={
I 1 , I 2 , I 3}
S={{ I 1},{ I 2},{ I 3},{ I 1 , I 2 ,},{ I 1 , I 3},{ I 2 , I 3}}

Rule-1: 2 2
Support( I ) =
9 Con = 6 =0.33
{ I 1} → { I 2 , I 3} 6 Invalid / Weak Associative
Support(S)=
9

Rule-2: 2 2
Support( I ) =
9 Con= 7 =0.28
{ I 2} → { I 1 , I 3} 7 Invalid / Weak Associative
Support(S) =
9

Rule-3: 2 2
Support(I ) =
9 Con = 6 =0.33
{ I 3} → { I 1 , I 2} 6 Invalid / Weak
Support( S) = Associative
9

Rule-4: 2 2
Support( I ) =
9 Con = 4 =0.5
{ I 1 , I 2} → { I 3} 4 Valid/Strong
Support(S)= Associative
9

Rule-5: 2 2
Support( I ) =
9 Con = 4 =0.5
{ I 1 , I 3} → { I 2} 4 Valid/Strong
Support(S) = Associative
9
Rule-6: 2 2
Support( I ) =
9 Con= 4 =0.5
{ I 2 , I 3} → { I 1} 4 Valid/Strong Associative
Support(S)=
9

I={ I 1 , I 2 , I 5}
S={{ I 1},{ I 2},{ I 5},{ I 1 , I 2 ,},{ I 1 , I 5},{ I 2 , I 5}}

Rule-1: 2 2
Support( I ) =
9 Con = 6 =0.33
{ I 1} → { I 2 , I 5} 6 Invalid / Weak Associative
Support(S)=
9

Rule-2: 2 2
Support( I ) =
9 Con = 7 =0.28
{ I 2} → { I 1 , I 5} 7 Invalid / Weak Associative
Support(S)=
9

Rule-3: 2 2
Support(I ) =
9 Con = 2 =1
{ I 5} → { I 1 , I 2} 2 Valid/Strong Associative
Support( S) =
9
Rule-4: 2 2
Support( I ) =
9 Con= 4 =0.5
{ I 1 , I 2} → { I 5} 4 Valid/Strong
Support(S)= Associative
9

Rule-5: 2 2
Support( I ) =
9 Con = 2 =1
{ I 1 , I 5} → { I 2} 2 Valid/Strong
Support(S) = Associative
9

Rule-6: 2 2
Support(I ) =
9 Con = 2 =1
{ I 2 , I 5} → { I 1} 2 Valid/Strong Associative
Support( S)=
9

FP Graph Growth Algorithm :

K=5 O=3 C=2 Example how count

Tid Item E=4 Y=3 U=1 1. Count how many rows contain
the letter "O" at least once.
T1 { K,E,M,N,O,Y } M=3 D=1 I=1 2. 2.If a row has one or more
occurrences of "O", count it as 1
N=2 A=1
T2 { D,E,K,N,O,Y }

T3 { A,E,K,M }
T1 has "O" → Count 1
T2 has "O" → Count 1
T3 has no "O" → Count 0
T4 has no "O" → Count 0
T5 has "O" → Count 1

O=3
T4 { C,K,M,U,Y }

T5 { C,E,I,K,O,O }

Given
¿min =3

L={ K:5 , E:4 , M:3 , O:3 , Y:3 } Tree:

Tid Item Ordered List

T1 { K,E,M,N,O,Y } { K,E,M,O,Y}
T2 { D,E,K,N,O,Y } { K,E,O,Y}
T3 { A,E,K,M } {K,E,M}
T4 { C,K,M,U,Y } {K,M,Y}
T5 { C,E,I,K,O,O } {K,E,O}

Items Conditional Pattern Base Conditional Fp Tree Frequent pattern

K Null Null Null
E { K:4 } { K:4 } { <K,E:4> }
M { K,E:2 } , { K:1 } { K:3 } { <K,M:3> }
O {K,E,M:1},{K,E:2} {K,E:3} { <K,O:3> },{ <E,O:3> },{ <K,E,O:3> }
Y {K,E,M,O:1},{K,E,O:1},{K,M:1} {K:3} { <K,Y:3> }
❑❑
Chapter-8

Decision tree : (ID3)

m m
Entropy(S) = -∑ pi log2 ( pi ) = - pi ∑ log 2 ( pi)
i=1 i=1
9 9 5 5
Entropy (S) = - log 2( ) - log 2( ) = 0.94
14 14 14 14

Now Consider the Outlook attribute: ( Sunny , OverCast, Rain )

Overcast(total)= 4
sunny(total)= 5
Overcast(Yes) = 4
Sunny(Yes) = 2 Overcast(No) = 0

Sunny(No) = 3

2 2 3 3 4 4 0 0
Entropy(Sunny)=- log 2( ) - log 2( ) Entropy(Overcast)=- log 2( ) - log 2( )
5 5 5 5 4 4 4 4
= 0.971 =0

Rain(total) = 5
Rain(Yes) = 3
Rain(No) = 2

3 3 2 2
Entropy(Rain)=- log 2( ) - log 2( ) = 971
5 5 5 5
5 4 5
Gain(OutLook)=Entropy (S)- *Entropy(Sunny) - *Entropy(Overcast)- *Entropy(Rain)
14 14 14
5 4 5
= 0.94 - ( ∗0.971 ¿−( ∗0)- ( ∗0.971 ¿
14 14 14
= 0.2464

Now Consider the Temp attribute: ( Hot,Mild, Cool)

Hot(total) = 4
Hot(Yes) = 2 Mild(total) = 6
Hot(No) = 2 Mild(Yes) = 4
Mild(No) = 2

2 2 2 2
Entropy(Hot)=- log 2( ) - log 2( )
4 4 4 4
4 4 2 2
=1 Entropy(Mild)=- log 2( ) - log 2( )
6 6 6 6
= 0.918

Cool(total) = 4
Cool(Yes) = 3
Cool(No) = 1
3 3 1 1
Entropy(Cool)=- log 2( ) - log 2( )
4 4 4 4
= 0.811
4 6 4
Gain(Temp)=Entropy (S)- *Entropy(Hot) - *Entropy(Mild)- *Entropy(Cool)
14 14 14
4 6 4
= 0.94 - ( ∗1 ¿−( ∗0.918)- ( ∗0.811 ¿
14 14 14
= 0.0289

Now Consider the Humidity attribute: (High, Normal)

High(total) = 7
High(Yes) = 3
High(No) = 4 Normal(total)= 7
Normal(Yes) = 6
Normal(No) = 1
3 3 4 4
Entropy(High)=- log 2( ) - log 2( )
7 7 7 7
6 6 1 1
=0.9852 Entropy(Normal)=- log 2( ) - log 2( )
7 7 7 7
= 0.5916
7 7
Gain(Humidity)=Entropy (S)- *Entropy(High) - *Entropy(Normal)
14 14
7 7
= 0.94 - ( ∗0.9852 ¿−( ∗0.5916)
14 14
= 0.1516

Now Consider the Wind attribute: (Weak,Strong)

Weak(total) = 8
Weak(Yes) = 6 Strong(total) = 6
Weak(No) = 2 Strong(Yes) = 3
Strong(No) = 3

6 6 2 2
Entropy(Weak)=- log 2( ) - log 2( )
8 8 8 8
3 3 3 3
=0.94 Entropy(Strong)=- log 2( ) - log 2( )
6 6 6 6
=0.8113

8 6
Gain( Wind)=Entropy (S)- *Entropy(Weak) - *Entropy(Strong)
14 14
8 6
= 0.94 - ( ∗0.94 ¿−( ∗0.8113 ) = 0.0478
14 14
Gain(OutLook) = 0.2464
Gain(Temp) = 0.0289 Gain(OutLook) > others
Gain(Humidity) = 0.1516 Root node = Outlook
Gain( Wind) = 0.0478
For Sunny:

Day temp Humidity Wind Play Tennis

D1 Hot High Weak No

D2 Hot High Strong No

D8 Mild High Weak No

D9 Cool Normal Weak Yes

D11 Mild Normal Strong Yes

2 2 3 3
Entropy( S sunny ¿= - log 2( ) - log 2( ) = 0.97
5 5 5 5

Temp attribute : (Hot,Mild,Cool):

0 0 2 2
Entropy(Hot) =- log 2( ) - log 2( ) =0
2 2 2 2
1 1 1 1
Entropy(Mild) =- log 2( ) - log 2( ) =1
2 2 2 2
1 1 0 0
Entropy(Cool) =- log 2( ) - log 2( ) =0
1 1 1 1
2 2 1
Gain(Temp ¿=Entropy( S sunny ¿ - *Entropy(Hot) - *Entropy(Mild) - *Entropy(Cool)
5 5 5
2 2 1
=0.97-
5
*0 - 5
*1 - 5
*0
= 0.570

Humidity attribute : (High,Normal):

0 0 3 3
Entropy(High) =- log 2( ) - log 2( ) =0
3 3 3 3
2 2 0 0
Entropy(Normal) =- log 2( ) - log 2( ) =0
2 2 2 2
3 2
Gain( Humidity ¿=Entropy( S sunny ¿ - *Entropy(High) - *Entropy(Normal)
5 5
2 2
=0.97-
5
*0 - 5
*0
= 0.97

Wind attribute : (Weak,Strong):

1 1 2 2
Entropy(Weak) =- log 2( ) - log 2( ) =0.9183
3 3 3 3
1 1 1 1
Entropy(Strong) =- log 2( ) - log 2( ) =1
2 2 2 2
3 2
Gain(Wind ¿ =Entropy( S sunny ¿ - *Entropy(Weak) - *Entropy(Strong)
5 5
2 2
=0.97- *0.9183 - *1
5 5
= 0.0192

Gain(Temp ¿= 0.57
Gain( Humidity ¿= 0.97 humidity is highest Gain
Gain(Wind ¿ =0.0192 Humidity is the particular node
For Rain:

Day Temp Wind Play Tennis

D4 Mild Weak Yes

D5 Cool Weak Yes

D6 Cool Strong No

D10 Mild Week Yes

D14 Mild Strong No

3 3 2 2
Entropy( Srain )=- 5 log 2( 5 ) - 5 log 2( 5 ) = 0.97

Attribute Temp (Mild,Cool):

2 2 1 1
Entropy(Mild)=- log 2( ) - log 2( ) =0.9183
3 3 3 3
1 1 1 1
Entropy(Cool)=- log 2( ) - log 2( ) = 1
2 2 2 2
3 2
Gain(Temp ¿= 0.97 - *0.9183 - *1 = 0.0192
5 5
Attribute Wind(Weak,Strong):
3 3 0 0
Entropy(Weak)=- log 2( ) - log 2( ) = 0
3 3 3 3
0 0 2 2
Entropy(Strong)=- log 2( ) - log 2( ) = 0
2 2 2 2
3 2
Gain(Wind ¿ = 0.97 - *0 - *0 = 0.97
5 5

Gain(Temp ¿= 0.0192 Wind is highest Gain

Gain(Wind ¿ = 0.97 Wind is the particular node

Final Decision Tree is:

Proximity Measure
No ratings yet
Proximity Measure
34 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
178 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Unit 4
No ratings yet
Unit 4
113 pages
DLWSS551 - Algorithms Part II
No ratings yet
DLWSS551 - Algorithms Part II
44 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
Data Mining and Data Warehousing: Unit - III Association Rules
No ratings yet
Data Mining and Data Warehousing: Unit - III Association Rules
19 pages
Association Rules Notes
No ratings yet
Association Rules Notes
30 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
IDS4
No ratings yet
IDS4
50 pages
Solutions To All Problem (1) - Compressed
No ratings yet
Solutions To All Problem (1) - Compressed
25 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Data Mining Formula
No ratings yet
Data Mining Formula
2 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
06 Association Rules
No ratings yet
06 Association Rules
32 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Module 5 - Frequent Pattern Mining
No ratings yet
Module 5 - Frequent Pattern Mining
111 pages
DMT Unit-IV - UR20 - New
No ratings yet
DMT Unit-IV - UR20 - New
62 pages
Rough Sets Tutorial
No ratings yet
Rough Sets Tutorial
57 pages
Rsfinal
No ratings yet
Rsfinal
30 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
10 EST Solution
No ratings yet
10 EST Solution
16 pages
To Students Data Mining Part-2 Sept 13 - 240913 - 160930
No ratings yet
To Students Data Mining Part-2 Sept 13 - 240913 - 160930
5 pages
STAT243 Chapter 2 - Section 2.4
No ratings yet
STAT243 Chapter 2 - Section 2.4
41 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Sap-C S4ewm 2023
No ratings yet
Sap-C S4ewm 2023
31 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Lecture 7 Clustring
No ratings yet
Lecture 7 Clustring
10 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
6 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
CS426 SolutionForHomework1
No ratings yet
CS426 SolutionForHomework1
6 pages
Unclaimed Shares Transferred To IEPF 22042019
No ratings yet
Unclaimed Shares Transferred To IEPF 22042019
178 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Clustering: " Are There Clusters of Similar Cells?"
No ratings yet
Clustering: " Are There Clusters of Similar Cells?"
24 pages
Simple Humanoid Walking and Dancing Robot Arduino
100% (1)
Simple Humanoid Walking and Dancing Robot Arduino
12 pages
Suppose A Student Collected The Price and Weight of 20 Products in A Shop With The Following Result
No ratings yet
Suppose A Student Collected The Price and Weight of 20 Products in A Shop With The Following Result
4 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Lec 5
No ratings yet
Lec 5
24 pages
Samsung Bloatware List
No ratings yet
Samsung Bloatware List
2 pages
Assignment 2 Slot8 TTS3208 Summer
No ratings yet
Assignment 2 Slot8 TTS3208 Summer
11 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Similarity
No ratings yet
Similarity
19 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Association Rule
No ratings yet
Association Rule
27 pages
Data Analytics Unit 4
No ratings yet
Data Analytics Unit 4
22 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Apriori
No ratings yet
Apriori
27 pages
Valuejet 1604
No ratings yet
Valuejet 1604
456 pages
ZedBoard HW UG v1 1 PDF
No ratings yet
ZedBoard HW UG v1 1 PDF
38 pages
Id-11659 Scrapping Web
No ratings yet
Id-11659 Scrapping Web
295 pages
Rough Sets Association Analysis
No ratings yet
Rough Sets Association Analysis
14 pages
Coursera Certificate JavaScript Jquery and JSON
No ratings yet
Coursera Certificate JavaScript Jquery and JSON
1 page
Similarity Measures
No ratings yet
Similarity Measures
11 pages
HCPP-03 - Small - and Medium-Sized Campus Network Design Guide-2022.01
No ratings yet
HCPP-03 - Small - and Medium-Sized Campus Network Design Guide-2022.01
77 pages
Essay On Advantages and Disadvantages of Computer
0% (1)
Essay On Advantages and Disadvantages of Computer
4 pages
DS Quiz
No ratings yet
DS Quiz
4 pages
Apriori
No ratings yet
Apriori
27 pages
Interrupt System in 8086
No ratings yet
Interrupt System in 8086
21 pages
Release Notes
No ratings yet
Release Notes
30 pages
RMK Group 21cs905 CV Unit 1
No ratings yet
RMK Group 21cs905 CV Unit 1
77 pages
Experiment: 3: Aim: Theory
No ratings yet
Experiment: 3: Aim: Theory
16 pages
Software Development Tools
No ratings yet
Software Development Tools
6 pages
JSP Session Tracking: Cookies
No ratings yet
JSP Session Tracking: Cookies
6 pages
TIM-94N / TIM-94N-B / TIM-94N-BN: Description
No ratings yet
TIM-94N / TIM-94N-B / TIM-94N-BN: Description
5 pages
BX21 Technical Manual en v1.2
No ratings yet
BX21 Technical Manual en v1.2
24 pages
SAQA - 115431 - Learner Guide
No ratings yet
SAQA - 115431 - Learner Guide
21 pages
SQL Class-12
No ratings yet
SQL Class-12
12 pages
Chapter 5 Sound CSC253
No ratings yet
Chapter 5 Sound CSC253
15 pages
DocuCentre-VI C3370
No ratings yet
DocuCentre-VI C3370
16 pages
Simple-Ostinato: Release 0.0.1
No ratings yet
Simple-Ostinato: Release 0.0.1
41 pages
Sasvinaa Kandasamy (DSTR Final TP053388)
No ratings yet
Sasvinaa Kandasamy (DSTR Final TP053388)
34 pages
CVTSP1120-M01-An Introduction To Commvault
No ratings yet
CVTSP1120-M01-An Introduction To Commvault
16 pages
Omori Headspace Vinyl Figure Collection - Omocat
No ratings yet
Omori Headspace Vinyl Figure Collection - Omocat
1 page
Ijmet 08 10 013
No ratings yet
Ijmet 08 10 013
7 pages
GE Launches Asset Transfer System For Airlines and Lessors - Air Transport News - Aviation International News
100% (1)
GE Launches Asset Transfer System For Airlines and Lessors - Air Transport News - Aviation International News
2 pages
Pagerank Explained Simple
No ratings yet
Pagerank Explained Simple
4 pages
Uropean Curriculum Vitae Format: Ersonal Information
No ratings yet
Uropean Curriculum Vitae Format: Ersonal Information
6 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Analytic Geometry: Graphic Solutions Using Matlab Language
From Everand
Analytic Geometry: Graphic Solutions Using Matlab Language
Ing. Mario Castillo
No ratings yet
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
From Everand
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Mining

Uploaded by

Data Mining

Uploaded by

chapter-2

Similarity and Dissimilarity ( Proximity ):

Name Gender Fever Cough test-1 test-2 test-3 test-4

Jack Y (1) N (0) P (1) N (0) N (0) N (0)

Jim Y (1) Y (1) N (0) N (0) N (0) N (0)

Mary Y (1) N (0) P (1) N (0) P (1) N (0)

asymmetric binary dissimilarity:

Proximity Measures for Mixed Attributes:

nominal 1. Categorized ● Gender (Male, Female, Other)

ordinal 1. Categorized ● Satisfaction (Low, Medium, High)

numeric 1. Perform Mathematical ● Years (1990, 2000, 2020)

Nominal: p= number of attribute

Test-1 —---------------> Nominal

Final Dissimilarity Matrix:

Dissimilarity of Numeric Data:

Minkowski distance, d(i,j)= √¿ xh

Euclidean distance, d(i,j)= √¿ x

supremum distance d(i,j)= max ( ¿ x i 1−x j 1∨¿ , ¿ xi 2−x j 2∨¿ ,........ , ¿ x ip −x jp ∨¿ ¿ ¿ ¿)

Example: Document Vector or Term-Frequency Vector

Distance Between Document-1 and Document-2:

Distance Between Document-1 and Document-3:

Distance Between Document-1 and Document-4:

Distance Between Document-2 and Document-3:

Distance Between Document-2 and Document-4:

Distance Between Document-3 and Document-4:

Support : how frequently a pattern (or itemset) appears in a dataset.

Tid Items Bought

Tid items ItemSet sup ItemSet sup

10 A,C,D {A} 2 {A} 2 C2

30 A,B,C,E —------> {C} 3 ---> {C} 3 —---------> {A,B} 1

40 B,E {D} 1 {E} 3 {A,C} 2

ItemSet sup {B,E} 3

{A,C} 2 <—-------- {C,E} 2

ItemSet sup 3rd scan {B,E} 3

{A,C,E} 1 ItemSet sup

{B,C,E} 2 —----------> {B,C,E} 2

Frequent pattern {B,C,E}

Tid Items ItemSet Sup

T300 I 2, I 3 1st { I 3} 6 ItemSet Sup ItemSet Sup

T400 I 1, I 2, I 4 —----> { I 4} 2 —----> { I 1} 6 { I 1 , I 2} 4

ItemSet Sup ItemSet Sup { I 1 , I 2} 4 <-—------- { I 3, I 5} 1

FP Graph Growth Algorithm :

K=5 O=3 C=2 Example how count

L={ K:5 , E:4 , M:3 , O:3 , Y:3 } Tree:

Tid Item Ordered List

Items Conditional Pattern Base Conditional Fp Tree Frequent pattern

Decision tree : (ID3)

Now Consider the Outlook attribute: ( Sunny , OverCast, Rain )

Now Consider the Temp attribute: ( Hot,Mild, Cool)

Now Consider the Humidity attribute: (High, Normal)

Now Consider the Wind attribute: (Weak,Strong)

Day temp Humidity Wind Play Tennis

D1 Hot High Weak No

D2 Hot High Strong No

D8 Mild High Weak No

D9 Cool Normal Weak Yes

D11 Mild Normal Strong Yes

Temp attribute : (Hot,Mild,Cool):

Humidity attribute : (High,Normal):

Wind attribute : (Weak,Strong):

Day Temp Wind Play Tennis

D4 Mild Weak Yes

D5 Cool Weak Yes

D10 Mild Week Yes

D14 Mild Strong No

Attribute Temp (Mild,Cool):

Gain(Temp ¿= 0.0192 Wind is highest Gain

Final Decision Tree is:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.