15 Link 2
15 Link 2
• Homeworks:
• HW2 (due: 11/08)
• HW3 (will be posted on 11/06)
Link Analysis 2
EE412: Foundation of Big Data Analytics
Fall 2024
Recap Outline
1. Web Search as a Graph 1. Topic-Specific PageRank
2. PageRank 2. Clustering and Partitioning
3. PageRank: Implementation 3. Finding Overlapping Communities
A B
C
3.3 38.4 Google Matrix:
1
34.3
𝐴 = 𝛽𝑀 + 1 − 𝛽
D E F 𝑁 !×!
3.9 8.1 3.9
1.6
1.6 1.6 1.6 1.6
Jaemin Yoo 3 Jaemin Yoo 4
Google vs. Spammers: Round 2 Link Farms
• Spammers began to work out ways to fool Google’s PageRank. • Three kinds of web pages from a spammer’s point of view.
• Link spam: • Owned pages: Completely controlled by spammer.
• They create a link structure that • May span multiple domain names.
boosts PageRank of certain pages.
• Accessible pages: Spammer can post links to his pages.
• Link farm: • E.g., comments on blogs or newspapers, Wikipedia, etc.
• Collection of pages for link spam.
• Inaccessible pages: Majority of the web.
• Spammers cannot do anything about them.
Accessible
Owned
t 2 t 2
• Then, 𝑧 = 𝛽𝑀 + , and 𝑦 = 𝑥 + 𝛽𝑀 +
%& #$% %& #$%
' ! ' !
.
• If we solve it for 𝑦, we get 𝑦 = #$% ! +
( %'
#)% !
.
𝛽𝑀./ + 1 − 𝛽 / 𝑆 if 𝑖 ∈ 𝑆
• TrustRank: Let’s bias the random walk to trustworthy pages.
• When the walker teleports, pick a page from the teleport set 𝑆.
𝐴./ = 4
• Two approaches for developing a trustworthy teleport set: 𝛽𝑀./ otherwise
• Let humans examine a set of pages with highest PageRanks.
• Pick a domain (e.g., .edu, .gov, .ac.kr, etc.) where membership is controlled.
• Note that 𝐴 is still a stochastic matrix.
0% spam 0.1% spam 1% spam 10% spam • We can also assign different weights to the pages in 𝑆.
1 link hop 1 link hop 1 link hop
…
A A A
& E
D
Source: PGL
1 5 4 1 5 4
4.5 1.5 4.5 1.5
C C
G F G F
1.5 1.5
1 5 4 C
4.5 1.5 H G F
C Smallest Best
G F cut cut
1.5
cut 𝑆, 𝑇 cut 𝑆, 𝑇
• Note that computing (or finding) the optimal cut is NP-hard.
+
vol 𝑆 vol 𝑇
A B D E
• cut 𝑆, 𝑇 = The number of edges that connect 𝑆 and 𝑇.
• vol 𝑆 = The number of edges with at least one end in 𝑆. C
H G F
Smallest Best
cut cut
Jaemin Yoo 25 Jaemin Yoo 26
D C
B E
Pr(instance) = 0.13 Pr(instance) = 3 x 0.12 x 0.9 Pr(instance) = 3 x 0.1 x 0.92 Pr(instance) = 0.93
31 Jaemin Yoo 32
Affiliation Graph Model Affiliation Graph Model
• Affiliation graph model is a mechanism to generate social graphs. • Fixed (and given): The numbers of nodes and communities.
• Users (= nodes) belong to different communities. • Parameter 1: Community assignment of nodes.
• Users (= nodes) are connected only in each community. • Each community can have any set of individuals as members.
• Parameter 2: Probability 𝑝3 for each community 𝐶 such that
• Graph has edges if users are connected in any community.
2 Communities pA A pB B 2 Communities pA A pB B
Memberships Memberships
Individuals Individuals
𝑝 𝐺 = Q 𝑝$% Q 1 − 𝑝$%
$,% ∈) $,% ∉)
u v w
u v w