0% found this document useful (0 votes)

41 views47 pages

CSF-469-L11-13 (Link Analysis Page Rank)

This document discusses link analysis and PageRank algorithms for analyzing large graphs, specifically the web graph. It describes: 1) The web can be modeled as a directed graph with webpages as nodes and hyperlinks as edges. 2) PageRank is an algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. It defines a "random surfer" model and considers the probability of ending up at a page as an indication of its importance. 3) The PageRank algorithm can be formulated as the principal eigenvector of the normalized link matrix of the web graph. It

Uploaded by

nitin gopala krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views47 pages

CSF-469-L11-13 (Link Analysis Page Rank)

Uploaded by

nitin gopala krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Analysis of Large Graphs:

2
Web as a Graph
◼ Web as a directed graph:
▪ Nodes: Webpages
▪ Edges: Hyperlinks

I teach a
class on IR.
CS F469:
Classes are
in the
LTC building
CSIS DEPT
BITS HYD
BITS PILANI
HYD
CAMPUS

3
Web as a Directed Graph
Broad Question
◼ How to organize the Web?
◼ First try: Human curated
Web directories
▪ Yahoo, DMOZ, LookSmart
◼ Second try: Web Search
▪ Information Retrieval investigates:
Find relevant docs in a small
and trusted set
▪ Newspaper articles, Patents, etc.
▪ But: Web is huge, full of untrusted documents,
random things, web spam, etc.
Web Search: 2 Challenges
2 challenges of web search:
◼ (1) Web contains many sources of information
Who to “trust”?
▪ Trick: Trustworthy pages may point to each other!
◼ (2) What is the “best” answer to query
“newspaper”?
▪ No single right answer
▪ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
Ranking Nodes on the Graph
◼ All web pages are not equally “important”
http://universe.bits-pilani.ac.in/hyderabad/arunamalap
ati/Profile
vs.
http://www.bits-pilani.ac.in/Hyderabad/index.aspx

◼ There is large diversity

in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!
Link Analysis Algorithms
◼ We will cover the following Link Analysis
approaches for computing importances
of nodes in a graph:
▪ Page Rank
▪ Topic-Specific (Personalized) Page Rank
▪ Web Spam Detection Algorithms
PageRank:
The “Flow” Formulation
Links as Votes
◼ Idea: Links as votes
▪ Page is more important if it has more links
▪ In-coming links? Out-going links?
◼ Think of in-links as votes:
▪ http://www.bits-pilani.ac.in/Hyderabad/index.aspx has 23,400 in-links
▪ http://universe.bits-pilani.ac.in/hyderabad/arunamalapati/Profile has 1
in-link

◼ Are all in-links are equal?

▪ Links from important pages count more
▪ Recursive question!
Example: PageRank Scores

A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6
Simple Recursive Formulation
◼ Each link’s vote is proportional to the
importance of its source page
◼ If page j with importance rj has n out-links,
each link gets rj / n votes
◼ Page j’s own importance is the sum of the
votes on its in-links i k
ri/3 r /4
k
j rj/3
rj = ri/3+rk/4
rj/3 rj/3
PageRank: The “Flow” Model
◼ A “vote” from an important The web in 1839

page is worth more y/2

◼ A page is important if it is
y
pointed to by other important
a/2
pages y/2
◼ Define a “rank” rj for page j m
a m
a/2
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Solving the Flow Equations
Flow equations:
◼ ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
PageRank: Matrix Formulation
◼
Example
◼

j rj
. =
ri
1/3

M . r = r
Eigenvector Formulation
◼
Example: Flow Equations & M

y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

r = M·r

ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m
Power Iteration Method
◼ Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
◼ Power iteration: a simple iterative scheme
▪ Suppose there are N web pages
▪ Initialize: r(0) = [1/N,….,1/N]T
▪ Iterate: r(t+1) = M · r(t)
di …. out-degree of node i
▪ Stop when |r(t+1) – r(t)|1 < ε
|x|1 = ∑1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
PageRank: How to solve?
y a m
◼ y
y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

Iteration 0, 1, 2, …
PageRank: How to solve?
y a m
◼ y
y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

Iteration 0, 1, 2, …
Random Walk Interpretation
i1 i2 i3
◼

j
The Stationary Distribution
i1 i2 i3
◼

j
Existence and Uniqueness
◼ A central result from the theory of random
walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,

the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0
PageRank:
The Google Formulation
PageRank: Three Questions

or
equivalently

◼ Does this converge?

◼ Does it converge to what we want?
◼ Are results reasonable?
Does this converge?

a b

◼ Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
Does it converge to what we want?

a b

◼ Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
PageRank: Problems
Dead end
2 problems:
◼ (1) Some pages are
dead ends (have no out-links)
▪ Random walk has “nowhere” to go to
▪ Such pages cause importance to “leak out” Spider
trap

◼ (2) Spider traps:

(all out-links are within the group)
▪ Random walked gets “stuck” in a trap
▪ And eventually spider traps absorb all importance
Problem: Spider Traps
y a m
◼ y
y ½ ½ 0
a ½ 0 0
a m m 0 ½ 1

m is a spider trap ry = ry /2 + ra /2
ra = ry /2
rm = ra /2 + rm

Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m.
Solution: Teleports!
◼ The Google solution for spider traps: At each
time step, the random surfer has two options
▪ With prob. β, follow a link at random
▪ With prob. 1-β, jump to some random page
▪ Common values for β are in the range 0.8 to 0.9
◼ Surfer will teleport out of spider trap
within a few time steps
y y

a m a m
Problem: Dead Ends
y a m
◼ y
y ½ ½ 0
a ½ 0 0
a m m 0 ½ 0

ry = ry /2 + ra /2
ra = ry /2
rm = ra /2

Iteration 0, 1, 2, …

Here the PageRank “leaks” out since the matrix is not stochastic.
Solution: Always Teleport!
◼ Teleports: Follow random teleport links with
probability 1.0 from dead-ends
▪ Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
Why Teleports Solve the Problem?
Theory of Markov chains
Make M stochastic
Make M aperiodic
Make M Irreducible
Solution: Random Teleports
◼

di … out-degree
of node i
The Google Matrix
◼

[1/N]NxN…N by N matrix
where all entries are 1/N
Random Teleports (β = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5

15
7/1

5
7/1

1/
15

y 7/15 7/15 1/15

13/15
a 7/15 1/15 1/15
7/15
a m 1/15 7/15 13/15
1/15
m
1/
15 A

y 1/3 0.33 0.24 0.26 7/33

a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
How do we actually compute
the PageRank?
Computing Page Rank
◼ Key step is matrix-vector multiplication
▪ rnew = A · rold
◼ Easy if we have enough main memory to
hold A, rold, rnew
◼ Say N = 1 billion pages
▪ We need 4 bytes for A = β·M + (1-β) [1/N]NxN
each entry (say) ½ ½ 0 1/3 1/3 1/3
▪ 2 billion entries for A = 0.8 ½ 0 0
0 ½ 1
+0.2 1/3 1/3 1/3
1/3 1/3 1/3
vectors, approx 8GB
▪ Matrix A has N2 entries 7/15 7/15 1/15
▪ 10 is a large number!
18
= 7/15 1/15 1/15
1/15 7/15 13/15
Matrix Formulation
◼ Suppose there are N pages
◼ Consider page i, with di out-links
◼ We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
◼ The random teleport is equivalent to:
▪ Adding a teleport link from i to every other page
and setting transition probability to (1-β)/N
▪ Reducing the probability of following each
out-link from 1/|di| to β/|di|
▪ Equivalent: Tax each page a fraction (1-β) of its
score and redistribute evenly
Rearranging the Equation
◼

Note: Here we assumed M

has no dead-ends
[x]N … a vector of length N with all entries x
PageRank: The Complete Algorithm
◼

If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Some Problems with Page Rank
◼ Measures generic popularity of a page
▪ Biased against topic-specific authorities
▪ Solution: Topic-Specific PageRank (next)
◼ Uses a single measure of importance
▪ Other models of importance
▪ Solution: Hubs-and-Authorities
◼ Susceptible to Link spam
▪ Artificial link topographies created in order to
boost page rank
▪ Solution: TrustRank

Abdul Majid Yousaf Sna Project 2020 Sapid 16362
No ratings yet
Abdul Majid Yousaf Sna Project 2020 Sapid 16362
14 pages
Lecture 9
No ratings yet
Lecture 9
64 pages
CS345 Data Mining: Link Analysis Algorithms Page Rank
No ratings yet
CS345 Data Mining: Link Analysis Algorithms Page Rank
37 pages
Power Point
No ratings yet
Power Point
77 pages
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
No ratings yet
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
51 pages
Cse535 Link Analysis
No ratings yet
Cse535 Link Analysis
19 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
Google Pagerank: Maths Delivers!
No ratings yet
Google Pagerank: Maths Delivers!
24 pages
Link Analysis: (Follow The Links To Learn More!)
No ratings yet
Link Analysis: (Follow The Links To Learn More!)
28 pages
Module 4 MapReduce and Link Analysis
No ratings yet
Module 4 MapReduce and Link Analysis
103 pages
PMBD-07-Link Analysis
No ratings yet
PMBD-07-Link Analysis
42 pages
PageRank 2021
No ratings yet
PageRank 2021
55 pages
Lecture 12 - Link Analysis
No ratings yet
Lecture 12 - Link Analysis
57 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
18 pages
Lab 4-2
No ratings yet
Lab 4-2
4 pages
Math 551 Lab 12
No ratings yet
Math 551 Lab 12
5 pages
1.1 Pagerank Description
No ratings yet
1.1 Pagerank Description
19 pages
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
No ratings yet
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
99 pages
Pagerank
No ratings yet
Pagerank
3 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
55 pages
14 Link 1
No ratings yet
14 Link 1
10 pages
Link Analysis
No ratings yet
Link Analysis
37 pages
Liuty
No ratings yet
Liuty
50 pages
Report PDF
No ratings yet
Report PDF
35 pages
Page Rank PDF
0% (1)
Page Rank PDF
20 pages
Applications of Stochastic Models in Web Page Ranking
No ratings yet
Applications of Stochastic Models in Web Page Ranking
8 pages
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
No ratings yet
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
6 pages
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
No ratings yet
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
35 pages
Module VI Link Analysis Final
No ratings yet
Module VI Link Analysis Final
104 pages
Structure of The Web:: Be M (MV) M V
No ratings yet
Structure of The Web:: Be M (MV) M V
14 pages
6 Pagerank
No ratings yet
6 Pagerank
7 pages
Page Rank With 13 Cases
No ratings yet
Page Rank With 13 Cases
72 pages
Link Analysis
No ratings yet
Link Analysis
47 pages
Brin and Page 1998 Page Et Al. 1999
No ratings yet
Brin and Page 1998 Page Et Al. 1999
37 pages
Page Rank and HITS
No ratings yet
Page Rank and HITS
39 pages
De Kerchove NV07
No ratings yet
De Kerchove NV07
15 pages
Social Network Analysis
No ratings yet
Social Network Analysis
28 pages
Project2 SimplifiedPageRank
No ratings yet
Project2 SimplifiedPageRank
6 pages
IR-UNIT 11 (Link Analysis) - 2019
No ratings yet
IR-UNIT 11 (Link Analysis) - 2019
58 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
44 pages
TM3 ch05 Link Analysis
No ratings yet
TM3 ch05 Link Analysis
69 pages
The Use of The Linear Algebra by Web Search Engines
No ratings yet
The Use of The Linear Algebra by Web Search Engines
5 pages
Page Rank
No ratings yet
Page Rank
29 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
Dynamic Pagerank
No ratings yet
Dynamic Pagerank
31 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
33 pages
09 Pagerank
No ratings yet
09 Pagerank
61 pages
Abstract. The Original Purpose of Google'S Pagerank Algorithm Is To Assess The
No ratings yet
Abstract. The Original Purpose of Google'S Pagerank Algorithm Is To Assess The
6 pages
Markov Chains PDF
No ratings yet
Markov Chains PDF
66 pages
DWM Expt9
No ratings yet
DWM Expt9
6 pages
Rec Sys Network
No ratings yet
Rec Sys Network
45 pages
15 Link 2
No ratings yet
15 Link 2
11 pages
Lect 14-Web Ranking
No ratings yet
Lect 14-Web Ranking
30 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
46 pages
Page Rank
No ratings yet
Page Rank
1 page
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
No ratings yet
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
19 pages
Markov Chains
No ratings yet
Markov Chains
37 pages
3.5 WebMining ImportantPages
No ratings yet
3.5 WebMining ImportantPages
11 pages
Building Marble Runs
From Everand
Building Marble Runs
Marne Ventura
No ratings yet
Applications of Eigenvalues and Eigenvectors
No ratings yet
Applications of Eigenvalues and Eigenvectors
5 pages
Mod3 Newman Networks An Introduction
No ratings yet
Mod3 Newman Networks An Introduction
67 pages
Mapreduce Ii: Permalink Comments (25) Trackbacks
No ratings yet
Mapreduce Ii: Permalink Comments (25) Trackbacks
6 pages
IRS Syllabus
No ratings yet
IRS Syllabus
2 pages
WSMA 2021-22 Question Paper Answered
No ratings yet
WSMA 2021-22 Question Paper Answered
11 pages
Tisqaad College
No ratings yet
Tisqaad College
31 pages
PageRank Algorithm Journal
No ratings yet
PageRank Algorithm Journal
8 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Unit Iv - Irt
No ratings yet
Unit Iv - Irt
62 pages
Risk Assessment in Supply Chains: A State-Of-The-Art Review of Methodologies and Their Applications
No ratings yet
Risk Assessment in Supply Chains: A State-Of-The-Art Review of Methodologies and Their Applications
43 pages
BAD714D-syllabus
No ratings yet
BAD714D-syllabus
3 pages
Basic & Advanced SEO Course Syllabus
No ratings yet
Basic & Advanced SEO Course Syllabus
8 pages
Econ 2040 Homework #4 SOlutions
0% (3)
Econ 2040 Homework #4 SOlutions
5 pages
CPSP Questions
No ratings yet
CPSP Questions
19 pages
NLP Mini Project
No ratings yet
NLP Mini Project
28 pages
Be - Artificial Intelligence and Data Science - Semester 7 - 2024 - May - Information Retrieval Ir 2019 Pattern
No ratings yet
Be - Artificial Intelligence and Data Science - Semester 7 - 2024 - May - Information Retrieval Ir 2019 Pattern
2 pages
M6 Spatial and Web Mining I
No ratings yet
M6 Spatial and Web Mining I
68 pages
The Keyword Minute Report
100% (1)
The Keyword Minute Report
36 pages
Case Study Marketing Management: University of The East College of Business Administration
No ratings yet
Case Study Marketing Management: University of The East College of Business Administration
8 pages
Sandaruwan WP
No ratings yet
Sandaruwan WP
4 pages
Assignment 1 & 2
No ratings yet
Assignment 1 & 2
2 pages
Google Deep Dive
No ratings yet
Google Deep Dive
9 pages
Discrete Maths and Graph Theory Mindmap
No ratings yet
Discrete Maths and Graph Theory Mindmap
1 page
IR Endsem Leaked
No ratings yet
IR Endsem Leaked
50 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
H.V.P.M's College of Engineering and Technology, Amravati
No ratings yet
H.V.P.M's College of Engineering and Technology, Amravati
23 pages
Real Time Applications Using MapReduce
No ratings yet
Real Time Applications Using MapReduce
12 pages
Page Ranking and Topic-Sensitive Page Ranking: Micro-Changes and Macro-Impact
No ratings yet
Page Ranking and Topic-Sensitive Page Ranking: Micro-Changes and Macro-Impact
11 pages
Quiz Assignment VIII PDF
No ratings yet
Quiz Assignment VIII PDF
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CSF-469-L11-13 (Link Analysis Page Rank)

Uploaded by

CSF-469-L11-13 (Link Analysis Page Rank)

Uploaded by

Analysis of Large Graphs:

Link Analysis, PageRank

◼ There is large diversity

◼ Are all in-links are equal?

page is worth more y/2

For graphs that satisfy certain conditions,

◼ Does this converge?

◼ (2) Spider traps:

y 7/15 7/15 1/15

y 1/3 0.33 0.24 0.26 7/33

Note: Here we assumed M

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.