CSF-469-L11-13 (Link Analysis Page Rank)
CSF-469-L11-13 (Link Analysis Page Rank)
2
Web as a Graph
◼ Web as a directed graph:
▪ Nodes: Webpages
▪ Edges: Hyperlinks
I teach a
class on IR.
CS F469:
Classes are
in the
LTC building
CSIS DEPT
BITS HYD
BITS PILANI
HYD
CAMPUS
3
Web as a Directed Graph
Broad Question
◼ How to organize the Web?
◼ First try: Human curated
Web directories
▪ Yahoo, DMOZ, LookSmart
◼ Second try: Web Search
▪ Information Retrieval investigates:
Find relevant docs in a small
and trusted set
▪ Newspaper articles, Patents, etc.
▪ But: Web is huge, full of untrusted documents,
random things, web spam, etc.
Web Search: 2 Challenges
2 challenges of web search:
◼ (1) Web contains many sources of information
Who to “trust”?
▪ Trick: Trustworthy pages may point to each other!
◼ (2) What is the “best” answer to query
“newspaper”?
▪ No single right answer
▪ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
Ranking Nodes on the Graph
◼ All web pages are not equally “important”
http://universe.bits-pilani.ac.in/hyderabad/arunamalap
ati/Profile
vs.
http://www.bits-pilani.ac.in/Hyderabad/index.aspx
A
B
3.3 C
38.4
34.3
D
E F
3.9
8.1 3.9
1.6
1.6 1.6 1.6
1.6
Simple Recursive Formulation
◼ Each link’s vote is proportional to the
importance of its source page
◼ If page j with importance rj has n out-links,
each link gets rj / n votes
◼ Page j’s own importance is the sum of the
votes on its in-links i k
ri/3 r /4
k
j rj/3
rj = ri/3+rk/4
rj/3 rj/3
PageRank: The “Flow” Model
◼ A “vote” from an important The web in 1839
j rj
. =
ri
1/3
M . r = r
Eigenvector Formulation
◼
Example: Flow Equations & M
y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0
r = M·r
ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m
Power Iteration Method
◼ Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
◼ Power iteration: a simple iterative scheme
▪ Suppose there are N web pages
▪ Initialize: r(0) = [1/N,….,1/N]T
▪ Iterate: r(t+1) = M · r(t)
di …. out-degree of node i
▪ Stop when |r(t+1) – r(t)|1 < ε
|x|1 = ∑1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
PageRank: How to solve?
y a m
◼ y
y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Iteration 0, 1, 2, …
PageRank: How to solve?
y a m
◼ y
y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Iteration 0, 1, 2, …
Random Walk Interpretation
i1 i2 i3
◼
j
The Stationary Distribution
i1 i2 i3
◼
j
Existence and Uniqueness
◼ A central result from the theory of random
walks (a.k.a. Markov processes):
or
equivalently
a b
◼ Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
Does it converge to what we want?
a b
◼ Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
PageRank: Problems
Dead end
2 problems:
◼ (1) Some pages are
dead ends (have no out-links)
▪ Random walk has “nowhere” to go to
▪ Such pages cause importance to “leak out” Spider
trap
m is a spider trap ry = ry /2 + ra /2
ra = ry /2
rm = ra /2 + rm
Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m.
Solution: Teleports!
◼ The Google solution for spider traps: At each
time step, the random surfer has two options
▪ With prob. β, follow a link at random
▪ With prob. 1-β, jump to some random page
▪ Common values for β are in the range 0.8 to 0.9
◼ Surfer will teleport out of spider trap
within a few time steps
y y
a m a m
Problem: Dead Ends
y a m
◼ y
y ½ ½ 0
a ½ 0 0
a m m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2
Iteration 0, 1, 2, …
Here the PageRank “leaks” out since the matrix is not stochastic.
Solution: Always Teleport!
◼ Teleports: Follow random teleport links with
probability 1.0 from dead-ends
▪ Adjust matrix accordingly
y y
a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
Why Teleports Solve the Problem?
Theory of Markov chains
Make M stochastic
Make M aperiodic
Make M Irreducible
Solution: Random Teleports
◼
di … out-degree
of node i
The Google Matrix
◼
[1/N]NxN…N by N matrix
where all entries are 1/N
Random Teleports (β = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5
15
7/1
5
7/1
1/
15
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Some Problems with Page Rank
◼ Measures generic popularity of a page
▪ Biased against topic-specific authorities
▪ Solution: Topic-Specific PageRank (next)
◼ Uses a single measure of importance
▪ Other models of importance
▪ Solution: Hubs-and-Authorities
◼ Susceptible to Link spam
▪ Artificial link topographies created in order to
boost page rank
▪ Solution: TrustRank