0% found this document useful (0 votes)
11 views11 pages

15 Link 2

The document outlines a lecture on Link Analysis in the context of Big Data Analytics, covering topics such as PageRank, TrustRank, and community detection in graphs. It discusses the challenges posed by spammers and link farms, as well as methods for clustering and partitioning graph data. Additionally, it introduces concepts like Maximum Likelihood Estimation and the Affiliation Graph Model for generating social graphs.

Uploaded by

asansyzbai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

15 Link 2

The document outlines a lecture on Link Analysis in the context of Big Data Analytics, covering topics such as PageRank, TrustRank, and community detection in graphs. It discusses the challenges posed by spammers and link farms, as well as methods for clustering and partitioning graph data. Additionally, it introduces concepts like Maximum Likelihood Estimation and the Affiliation Graph Model for generating social graphs.

Uploaded by

asansyzbai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Announcements

• Homeworks:
• HW2 (due: 11/08)
• HW3 (will be posted on 11/06)
Link Analysis 2
EE412: Foundation of Big Data Analytics
Fall 2024

Jaemin Yoo 1 Jaemin Yoo 2

Recap Outline
1. Web Search as a Graph 1. Topic-Specific PageRank
2. PageRank 2. Clustering and Partitioning
3. PageRank: Implementation 3. Finding Overlapping Communities

A B
C
3.3 38.4 Google Matrix:
1
34.3

𝐴 = 𝛽𝑀 + 1 − 𝛽
D E F 𝑁 !×!
3.9 8.1 3.9
1.6
1.6 1.6 1.6 1.6
Jaemin Yoo 3 Jaemin Yoo 4
Google vs. Spammers: Round 2 Link Farms
• Spammers began to work out ways to fool Google’s PageRank. • Three kinds of web pages from a spammer’s point of view.
• Link spam: • Owned pages: Completely controlled by spammer.
• They create a link structure that • May span multiple domain names.
boosts PageRank of certain pages.
• Accessible pages: Spammer can post links to his pages.
• Link farm: • E.g., comments on blogs or newspapers, Wikipedia, etc.
• Collection of pages for link spam.
• Inaccessible pages: Majority of the web.
• Spammers cannot do anything about them.

Jaemin Yoo 5 Jaemin Yoo 6

Accessible
Owned

Link Farms Link Farms: Analysis 1


2

• Goal: Maximize the PageRank score of target page 𝑡


t
• Symbols:
• 𝑁 ≫ 0: The number of pages in the web.
M
𝑥: PageRank contributed by accessible pages.
Accessible Owned

• 𝑦: PageRank of the target page 𝑡.
1
𝑧: PageRank contributed by owned pages.
Inaccessible

• Then, 𝑦 = 𝑥 + 𝑧 +
2 #$%
t
. Small constant; can be ignored.
≈ 𝑥 + 𝑧.
!

Can insert links to M


Accessible pages
Millions of
farm pages
Jaemin Yoo 7 Jaemin Yoo 8
Accessible Accessible
Owned Owned

Link Farms: Analysis 1 Link Farms: Analysis 1

t 2 t 2

• How can we interpret 𝑦 = +


• Symbols: ( %'
?
𝑁 ≫ 0: The number of pages in the web. #$% ! #)% !

• If 𝛽 = 0.85, it becomes 𝑦 = 3.6𝑥 + 𝑀.



M M
𝑀 > 0: The number of pages owned by the spammer.
*.,-

• We can make 𝑦 as large as we want by making 𝑀 large.


𝑥: PageRank contributed by accessible pages.
!

• 𝑦: PageRank of the target page 𝑡.
+
• Bots create millions of farm pages.
%& #$%
• PageRank of each “owned” page is given as .
' !

• Then, 𝑧 = 𝛽𝑀 + , and 𝑦 = 𝑥 + 𝛽𝑀 +
%& #$% %& #$%
' ! ' !
.
• If we solve it for 𝑦, we get 𝑦 = #$% ! +
( %'
#)% !
.

Jaemin Yoo 9 Jaemin Yoo 10

TrustRank Matrix Formulation


• Assumption: Trustworthy page is unlikely to link to a spam page. • Update the teleportation part of the PageRank formulation:

𝛽𝑀./ + 1 − 𝛽 / 𝑆 if 𝑖 ∈ 𝑆
• TrustRank: Let’s bias the random walk to trustworthy pages.
• When the walker teleports, pick a page from the teleport set 𝑆.
𝐴./ = 4
• Two approaches for developing a trustworthy teleport set: 𝛽𝑀./ otherwise
• Let humans examine a set of pages with highest PageRanks.
• Pick a domain (e.g., .edu, .gov, .ac.kr, etc.) where membership is controlled.
• Note that 𝐴 is still a stochastic matrix.
0% spam 0.1% spam 1% spam 10% spam • We can also assign different weights to the pages in 𝑆.
1 link hop 1 link hop 1 link hop

kaist.ac.kr icml.cc my.blogger.com abc123.biz

Jaemin Yoo 11 Jaemin Yoo 12


Example: Topic-Specific PageRank Application: Proximity on Graphs
• TrustRank is an example of a topic-specific PageRank. • Proximity: How close are nodes 𝐴 and 𝐵 in this graph?
• Teleport set has high PageRank scores even with few in-links.
F A G
A A
0.2 Suppose S = {1}, b = 0.8
! A # A "
0.5 1 Node Iteration
0.5
0.4 0.4 0 1 2 … stable
1 1 0.25 0.4 0.28 0.294 A A
2 3 B
0.8
2 0.25 0.1 0.16 0.118
1 3 0.25 0.3 0.32 0.327
1 A A A
4 0.25 0.2 0.24 0.261
0.8 0.8 & E
4 D

Jaemin Yoo 13 Jaemin Yoo 14

Good Proximity Measure? Random Walk with Restarts


• Shortest path between nodes is not good enough: • Solution: Random walk with restarts. F A G
• (Left) Degree-1 nodes 𝐸, 𝐹, and 𝐺 have no effects. • Also known as Personalized PageRank A A
• (Right) Multi-faceted relationships are not considered. • Teleport always to the query node.
• E.g., 𝑆 = 𝐴 in this case. ! A # A "

• The score 𝑟/ of each node 𝑗 is the


proximity with node 𝑖. A
B
A


A A A
& E
D

Jaemin Yoo 15 Jaemin Yoo 16


Application: Recommendation Outline
• RWR can be naturally used for recommendation. 1. Topic-Specific PageRank
1. Create a bipartite graph from a utility matrix. 2. Clustering and Partitioning
2. Run RWR starting from the query user.
3. Proximity scores represent the probabilities to buy.
3. Finding Overlapping Communities
• Finding similar items can also be done with RWR.
• Limitation: Need to rerun RWR for each query.

Source: PGL

Jaemin Yoo 17 Jaemin Yoo 18

Clustering of Nodes Betweenness


• What else can we do in graph-structured data? • Idea: Let’s find inter-cluster edges and remove them.
• Clustering: Way to find communities of nodes. • Assumptions:
• Useful for various graph data, not just the web. • There will be many edges inside each cluster.
• E.g., finding user groups in a social media. • There won’t be many edges between clusters.
• We need an approach specialized for graph data.
• We find such edges by computing their betweenness.
• Graph is a special non-Euclidean space.
• Existing clustering algorithms won’t be enough. = Score for being an intra-cluster edge.
Source: Wikipedia

Jaemin Yoo 19 Jaemin Yoo 20


Betweenness Example: Betweenness
• Betweenness of edge 𝑎, 𝑏 is defined as • Betweenness of 𝐵, 𝐷 is 12.
• The number of pairs of nodes 𝑥 and 𝑦 such that 𝑎, 𝑏 lies on the shortest • It is on every shortest path between any of 𝐴, 𝐵, 𝐶 to any of 𝐷, 𝐸, 𝐹, 𝐺 .
path between 𝑥 and 𝑦.
• Betweenness of 𝐸, 𝐹 is 1.5.
• If there are several shortest paths, credit with a fraction of them. • 1 × 𝐸, 𝐹 + 0.5 × 𝐸, 𝐺 Why? 𝐸, 𝐺 has another shortest path 𝐸, 𝐷, 𝐺 .
5 12 4.5 5 12 4.5
A B D E A B D E

1 5 4 1 5 4
4.5 1.5 4.5 1.5

C C
G F G F
1.5 1.5

Jaemin Yoo 21 Jaemin Yoo 22

Betweenness for Clustering Graph Partitioning


• Betweenness can be directly used for clustering: • Partitioning: Similar to clustering, but more focus on cut.
• High betweenness suggests 𝐵, 𝐷 connects different communities. • Given a graph, divide nodes into two sets so that
• Repeatedly remove edges with highest betweenness to get clusters. • The size of the cut, the set of edges between different sets, is minimized.
• See the Girvan-Newman algorithm (Chapter 10.2.4).
5 4.5 A B D E
A B D E

1 5 4 C
4.5 1.5 H G F

C Smallest Best
G F cut cut
1.5

Jaemin Yoo 23 Jaemin Yoo 24


Normalized Cuts Example: Normalized Cuts
• It is a better partitioning if the two node sets are similar in size. • The smallest cut has a normalized cut 1/1 + 1/11 = 1.09.
• Normalized cut for 𝑆 and 𝑇 is: • The best cut has a normalized cut 2/6 + 2/7 = 0.62.

cut 𝑆, 𝑇 cut 𝑆, 𝑇
• Note that computing (or finding) the optimal cut is NP-hard.
+
vol 𝑆 vol 𝑇
A B D E
• cut 𝑆, 𝑇 = The number of edges that connect 𝑆 and 𝑇.
• vol 𝑆 = The number of edges with at least one end in 𝑆. C
H G F

Smallest Best
cut cut
Jaemin Yoo 25 Jaemin Yoo 26

Pop Quiz Outline


• Find the betweenness values for the edges below. 1. Topic-Specific PageRank
2. Clustering and Partitioning
A B E 3. Finding Overlapping Communities

D C

Jaemin Yoo 27 Jaemin Yoo 28


Overlapping Communities Maximum Likelihood Estimation
• Clustering and partitioning: • Idea: Let’s learn the cluster assignment of nodes from data.
• Two different ways to detect non-overlapping communities. • Maximum likelihood estimation (MLE):
• We often want to find overlapping communities. • Model the generative process of a graph as a function 𝑓.
• Can give us better communities by relaxing the constraint. • 𝑓 has a set of parameters 𝜃 that determine the likelihood ℒ.
• Limitation: Much harder to find an optimal solution. • The values of 𝜃 are “optimal” with the highest likelihood.
• We apply gradient descent to find optimal 𝜃: 𝜃 ← 𝜃 + 𝜕ℒ/𝜕𝜃.
A D

B E

Jaemin Yoo 29 Jaemin Yoo 30

Example: MLE Example: MLE


• Let’s assume the generative process (or model) as follows: • Given a graph 𝐺, the “correct” value of 𝑝 makes highest Pr 𝐺 .
• All edges are independent of each other. • The probability of generating an observed graph.
• Each edge is present with probability 𝑝 = 0.1. • Suppose we observe a graph 𝐺 having 15 nodes and 23 edges.
• Then, a random graph 𝑋 of 3 nodes follows one of four cases. • The number of pairs of nodes is 15 × 14 / 2 = 105.
• The probability 𝑃 𝑋 = 𝐺 for each 𝐺 is determined by 𝑝. • Each to check that Pr 𝐺 is maximized when 𝑝 = 23/105.

Pr(instance) = 0.13 Pr(instance) = 3 x 0.12 x 0.9 Pr(instance) = 3 x 0.1 x 0.92 Pr(instance) = 0.93

31 Jaemin Yoo 32
Affiliation Graph Model Affiliation Graph Model
• Affiliation graph model is a mechanism to generate social graphs. • Fixed (and given): The numbers of nodes and communities.
• Users (= nodes) belong to different communities. • Parameter 1: Community assignment of nodes.
• Users (= nodes) are connected only in each community. • Each community can have any set of individuals as members.
• Parameter 2: Probability 𝑝3 for each community 𝐶 such that
• Graph has edges if users are connected in any community.

Model Social graph • Two members of 𝐶 create a connection with probability 𝑝! .

2 Communities pA A pB B 2 Communities pA A pB B

Memberships Memberships

Individuals Individuals

Jaemin Yoo 33 Jaemin Yoo 34

Example: Affiliation Graph Model Likelihood of AGM


• Q: Likelihood of a graph (top) given a model (bottom)? • Probability of an edge 𝑢, 𝑣 if they are in communities 𝑀:
• Model: Cluster assignments of nodes, and probabilities 𝑝" , 𝑝# , 𝑝!
• Pr 𝐺 = 𝑝45 𝑝56 1 − 𝑝46 𝑝45 = 1 − Z 1 − 𝑝3
• 𝑝$% = 𝑝" u v w 3∈'
• 𝑝%& = 1 − 1 − 𝑝# 1 − 𝑝! • 𝑝$% = 𝜖 if 𝑢 and 𝑣 are not in any communities together.
• 𝑝$& = 𝜖 (a small number) • Since they can still be friends although unlikely.
• Likelihood of an observed graph 𝐺 with edges 𝐸 is given as:
pA A pB B pC C •

𝑝 𝐺 = Q 𝑝$% Q 1 − 𝑝$%
$,% ∈) $,% ∉)
u v w

Jaemin Yoo 35 Jaemin Yoo 36


Optimization of Community Assignments Continuous Community Assignment
• Community assignment of nodes is a discrete parameter. • Assume a “strength of membership” for each node and community.
• MLE solution is the assignment that has the highest likelihood. • Common trick to allow gradient descent.
• However, we cannot use gradient descent. • For each community 𝐶,
• Once we fix on an assignment, we can find the probabilities 𝑝! . • There is a strength of membership 𝐹+! ≥ 0 for each node 𝑥.
• Probability for edge 𝑢, 𝑣 is 𝑝! 𝑢, 𝑣 = 1 − exp −𝐹$! 𝐹%! .
A B
A B
?
FuA FvA FwB

u v w

Jaemin Yoo 37 Jaemin Yoo 38

Continuous Community Assignment Log Likelihood


• Recall that the likelihood of the graph 𝐺 with edges 𝐸 is: • Finally, we use the negative log likelihood as an objective function.
• Update all parameters to minimize 𝑙 𝜃 = − log Pr 𝐺 .
𝑝 𝐺 = Z 𝑝45 Z 1 − 𝑝45

• In machine learning, we usually compute the log likelihood.

• Now, the probability of an edge between nodes 𝑢 and 𝑣 is:


4,5 ∈9 4,5 ∉9 • Products become sums, which often simplifies expressions.
• Summing many numbers is less prone to numerical rounding errors .

𝑝45 = 1 − Z 1 − 𝑝3 𝑢, 𝑣 = 1 − exp − ] 𝐹43 𝐹53


• Compared to taking the product of many tiny numbers.

3∈' 3 log 10$#* × 10$#* = log 10$#* + log 10$#* = −20


• Now we can use gradient descent to maximize the likelihood.

Jaemin Yoo 39 Jaemin Yoo 40


Pop Quiz Summary
• What is the likelihood of the observed graph using the new model? 1. Topic-Specific PageRank
• TrustRank
• Random walk with restarts
A B C 2. Clustering and Partitioning
u v w
• Betweenness
• Normalized cuts
u v w 3. Finding Overlapping Communities
• Maximum likelihood estimation
• Log likelihood

Jaemin Yoo 41 Jaemin Yoo 42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy