0% found this document useful (0 votes)
5 views71 pages

03 GNN1

CS224W is a course on Machine Learning with Graphs, focusing on mapping nodes to low-dimensional embeddings to capture similarities. It discusses the limitations of shallow embedding methods and introduces deep learning techniques using Graph Neural Networks (GNNs) for tasks such as node classification and link prediction. The course emphasizes the importance of permutation invariance in graph representation and the challenges posed by the complex structure of graph data.

Uploaded by

Sher1ock
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views71 pages

03 GNN1

CS224W is a course on Machine Learning with Graphs, focusing on mapping nodes to low-dimensional embeddings to capture similarities. It discusses the limitations of shallow embedding methods and introduces deep learning techniques using Graph Neural Networks (GNNs) for tasks such as node classification and link prediction. The course emphasizes the importance of permutation invariance in graph representation and the challenges posed by the complex structure of graph data.

Uploaded by

Sher1ock
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

CS224W: Machine Learning with Graphs

Jure Leskovec, Stanford University


http://cs224w.stanford.edu
ANNOUNCEMENTS
• Next Thursday (10/12): Colab 1 due

CS224W: Machine Learning with Graphs


Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Intuition: Map nodes to 𝑑-dimensional
embeddings such that similar nodes in the
graph are embedded close together

f( )=
Input graph 2D node embeddings

How to learn mapping function 𝒇?


10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 3
"
Goal: similarity 𝑢, 𝑣 ≈ 𝐳! 𝐳#

Need to define!

Input network d-dimensional


embedding space
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4
¡ Encoder: Maps each node to a low-dimensional
vector d-dimensional
ENC 𝑣 = 𝐳! embedding
node in the input graph
¡ Similarity function: Specifies how the
relationships in vector space map to the
relationships in the original network
similarity 𝑢, 𝑣 ≈ 𝐳!" 𝐳# Decoder
Similarity of 𝑢 and 𝑣 in dot product between node
the original network embeddings
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5
Simplest encoding approach: Encoder is just an
embedding-lookup
embedding vector for a
embedding specific node
matrix

Dimension/size
𝐙= of embeddings

one column per node

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 6
¡ Limitations of shallow embedding methods:
§ 𝑶(|𝑽|𝒅) parameters are needed:
§ No sharing of parameters between nodes
§ Every node has its own unique embedding
§ Inherently “transductive”:
§ Cannot generate embeddings for nodes that are not seen
during training
§ Do not incorporate node features:
§ Nodes in many graphs have features that we can and
should leverage

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7
¡ Today: We will now discuss deep learnig
methods based on graph neural networks
(GNNs):
multiple layers of
ENC 𝑣 = non-linear transformations
based on graph structure

¡ Note: All these deep encoders can be


combined with node similarity functions
defined in the Lecture 3.
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8

Output: Node embeddings.


Also, we can embed subgraphs,
and graphs
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9
Tasks we will be able to solve:
¡ Node classification
§ Predict the type of a given node
¡ Link prediction
§ Predict whether two nodes are linked
¡ Community detection
§ Identify densely linked clusters of nodes
¡ Network similarity
§ How similar are two (sub)networks

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10
Images

Text/Speech

Modern deep learning toolbox is designed


for simple sequences & grids
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11
But networks are far more complex!
§ Arbitrary size and complex topological structure (i.e.,
no spatial locality like grids)

vs.
Text

Networks Images

§ No fixed node ordering or reference point


§ Often dynamic and have multimodal features
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12
1. Basics of deep learning

2. Deep learning for graphs

3. Graph Convolutional Networks

4. GNNs subsume CNNs

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13
¡ Loss function:
min ℒ(𝒚, 𝑓! 𝒙 )
!
¡ 𝑓 can be a simple linear layer, an MLP, or other
neural networks (e.g., a GNN later)
¡ Sample a minibatch of input 𝒙
¡ Forward propagation: Compute ℒ given 𝒙
¡ Back-propagation: Obtain gradient ∇! ℒ using a
chain rule.
¡ Use stochastic gradient descent (SGD) to
optimize ℒ for Θ over many iterations.
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Local network neighborhoods:
§ Describe aggregation strategies
§ Define computation graphs

¡ Stacking multiple layers:


§ Describe the model, parameters, training
§ How to fit the model?
§ Simple example for unsupervised and
supervised training

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 16
¡ Assume we have a graph 𝑮:
§ 𝑉 is the vertex set
§ 𝑨 is the adjacency matrix (assume binary)
§ 𝑿 ∈ ℝ ! ×# is a matrix of node features
§ 𝑣: a node in 𝑉; 𝑁 𝑣 : the set of neighbors of 𝑣.
§ Node features:
§ Social networks: User profile, User image
§ Biological networks: Gene expression profiles, gene
functional information
§ When there is no node feature in the graph dataset:
§ Indicator vectors (one-hot encoding of a node)
§ Vector of constant 1: [1, 1, …, 1]
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 17
A naïve approach
• Take adjacency matrix and feature matrix

• Concatenate them [A, X]


¡ Join adjacency matrix and features
¡ Feed
• Feed them
them into deepinto
(fully aconnected)
deep neural net:
neural net

• Done?
A B C D E Feat
A 0 1 1 1 0 1 0
A

C
B

D
E
B
C
D
1
1
1
0
0
1
0
0
1
1
1
0
1
0
1
0
0
1
0
1
1
?
¡ Issues with this idea:
E 0 1 0 1 0 1 0

¡ Issues with this idea:


Problems:
§ 𝑂(|𝑉|) parameters
• Huge number of parameters
§ Not applicable to graphs of different sizes
• No inductive learning possible
§ Sensitive to node ordering
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 18
CNN on an image:

Goal is to generalize convolutions beyond simple lattices


Leverage node features/attributes (e.g., text, images)
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19
Hidden layer Hidden l
Graph-structured data
Graph-structured data
What if our data looks like this?
But our graphs
Input
What look
if our data like
looks likethis:
this?
Input

or or
this:
this: ReLU
ReLU
or this:

… …

§ There is no fixed notion of locality or sliding
window on the graph
§ Graph is permutation invariant
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20
¡ Graph does not have a canonical order of the nodes!
¡ We can have many different order plans.

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21
¡ Graph does not have a canonical order of the nodes!
Node features 𝑿𝟏 Adjacency matrix 𝑨𝟏
Order plan 1 B A A B C D E F
A
B
A B
C C
C
D D
F E E
D E
F F

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22
¡ Graph does not have a canonical order of the nodes!
Node features 𝑿𝟏 Adjacency matrix 𝑨𝟏
Order plan 1 B A A B C D E F
A
B
A B
C C
C
D D
F E E
D E
F F

Node features 𝑿𝟐 Adjacency matrix 𝑨𝟐


Order plan 2 F A A B C D E F
A
B
E B
D C
C
D D
A E E
C B
F F
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23
¡ Graph does not have a canonical order of the nodes!
Node features 𝑿𝟏 Adjacency matrix 𝑨𝟏
Order plan 1 B A A B C D E F
A
B
A B
C C
C
D D
F
Graph and nodeF representations
D E E E
F
should be the same for Order
Node feature 𝑿 planmatrix
Adjacency 1 𝑨 𝟐 𝟐
Order plan 2 A B C D E F
F
and Order
A
B
plan 2 A
E B
D C
C
D D
A E E
C B
F F
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 24
What does it mean by “graph representation is
same for two order plans”? In other words, 𝑓 maps a
graph to a 𝑑-dim embedding

¡ Consider we learn a function 𝑓 that maps a


graph 𝐺 = (𝑨, 𝑿) to a vector ℝ$ then
𝑓 𝑨% , 𝑿% = 𝑓 𝑨& , 𝑿& 𝑨 is the adjacency matrix
𝑿 is the node feature matrix

Order plan 1: 𝑨𝟏 , 𝑿𝟏 Order plan 2: 𝑨𝟐 , 𝑿𝟐


B
For two order plans, F
A output of 𝑓 should E
C
be the same! D

F
D E A
C B
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25
What does it mean by “graph representation is
same for two order plans”?
¡ Consider we learn a function 𝑓 that maps a graph
𝐺 = (𝑨, 𝑿) to a vector ℝ" . 𝑨 is the adjacency matrix
𝑿 is the node feature matrix

¡ Then, if 𝑓 𝑨# , 𝑿# = 𝑓 𝑨$ , 𝑿$ for any order plan 𝑖


and 𝑗, we formally say 𝑓 is a permutation invariant
function. For a graph with |𝑉| nodes, there
are |𝑉|! different order plans. 𝑚… each node has a 𝑚-dim
feature vector associated with it.

¡ Definition: For any graph function 𝑓: ℝ % ×' ×


ℝ % ×|%| → ℝ" , 𝑓 is permutation-invariant if
𝑓 𝐴, 𝑋 = 𝑓 𝑃𝐴𝑃) , 𝑃𝑋 for any permutation 𝑃.
𝑑… output embedding dimensionality of Permutation 𝑃: a shuffle of the node order
embedding the graph 𝐺 = (𝐴, 𝑋) Example: (A,B,C)->(B,C,A)
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 26
For node representation: We learn a function 𝑓 that
maps nodes of 𝐺 to a matrix ℝ|%|×" . Inmapped
other words, each node in 𝑉 is
to a 𝑑-dim embedding.

Order plan 1: 𝑨𝟏 , 𝑿𝟏 Order plan 2: 𝑨𝟐 , 𝑿𝟐


B F

A E
C D

F A
D E C B
A A
B B
C C
𝑓 𝑨! , 𝑿! = 𝑓 𝑨" , 𝑿" =
D D
E E
F F
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27
For node representation: We learn a function 𝑓 that
maps nodes of 𝐺 to a matrix ℝ|%|×" .
Order plan 1: 𝑨𝟏 , 𝑿𝟏 Order plan 2: 𝑨𝟐 , 𝑿𝟐
B F

A E
C D

F A
D E C B
Representation vector
of the brown node A A A
B B
C C
𝑓 𝑨! , 𝑿! = 𝑓 𝑨" , 𝑿" =
D D
Representation vector
E For two order plans, the vector of node at E of the brown node E

F the same position in the graph is the same! F


12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 28
For node representation: We learn a function 𝑓 that
maps nodes of 𝐺 to a matrix ℝ|%|×" .
Order plan 1: 𝑨𝟏 , 𝑿𝟏 Order plan 2: 𝑨𝟐 , 𝑿𝟐
B F

A E
C D

F A
D E C B
A A
B B
Representation vector
C C
𝑓 𝑨! , 𝑿! =
of the green node C
𝑓 𝑨" , 𝑿" = Representation vector
D D of the green node D

E For two order plans, the vector of node at E


F the same position in the graph is the same! F
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 29
For node representation:
¡ Consider we learn a function 𝑓 that maps a
graph 𝐺 = (𝑨, 𝑿) to a matrix ℝ|(|×$
¡ If the output vector of a node at the same
position in the graph remains unchanged for any
order plan, we say 𝑓 is permutation
equivariant. 𝑚… each node has a 𝑚-dim
feature vector associated with it.

¡ Definition: For any node function 𝑓: ℝ ( ×* ×


ℝ ( ×|(| → ℝ ( ×$ , 𝑓 is permutation-
equivariant if 𝑃𝑓 𝐴, 𝑋 = 𝑓 𝑃𝐴𝑃+ , 𝑃𝑋 for any
permutation 𝑃. 𝑓 maps each node in 𝑉 to a 𝑑-dim embedding.
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 30
¡ Permutation-invariant Permute the input, the output
stays the same.
𝑓 𝐴, 𝑋 = 𝑓 𝑃𝐴𝑃+ , 𝑃𝑋 (map a graph to a vector)

¡ Permutation-equivariant Permute the input, output also


𝑃𝑓 𝐴, 𝑋 = 𝑓 𝑃𝐴𝑃+ , 𝑃𝑋 permutes accordingly.
(map a graph to a matrix)

¡ Examples:
§ 𝑓 𝐴, 𝑋 = 1$ 𝑋 : Permutation-invariant
§ Reason: 𝑓 𝑃𝐴𝑃& , 𝑃𝑋 = 1& 𝑃𝑋 = 1& 𝑋 = 𝑓 𝐴, 𝑋
§ 𝑓 𝐴, 𝑋 = 𝑋 : Permutation-equivariant
§ Reason: 𝑓 𝑃𝐴𝑃& , 𝑃𝑋 = 𝑃𝑋 = 𝑃𝑓 𝐴, 𝑋
§ 𝑓 𝐴, 𝑋 = 𝐴𝑋 : Permutation-equivariant
§ Reason: 𝑓 𝑃𝐴𝑃& , 𝑃𝑋 = 𝑃𝐴𝑃& 𝑃𝑋 = 𝑃𝐴𝑋 = 𝑃𝑓 𝐴, 𝑋
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31
[Bronstein, ICLR 2021 keynote]

¡ Graph neural networks consist of multiple


permutation equivariant / invariant functions.

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32
Are other neural network architectures, e.g.,
MLPs, permutation invariant / equivariant?
¡ No.
Switching the order of the
input leads to different
outputs!

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33
A naïve approach
• Take adjacency matrix and feature matrix
Are other neural network
• Concatenate them [A, X] architectures, e.g.,
MLPs, permutation invariant / equivariant?
• Feed them into deep (fully connected) neural net
¡ No.
• Done?
A B C D E Feat
A 0 1 1 1 0 1 0
A

C
B

D
E
B
C
D
1
1
1
0
0
1
0
0
1
1
1
0
1
0
1
0
0
1
0
1
1
?
E 0 1 0 1 0 1 0

Problems:

• Huge number of parameters


This explains why the naïve MLP approach
fails for graphs!
• No inductive learning possible

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 34
A naïve approach
• Take adjacency matrix and feature matrix
¡ Are any neural network
• Concatenate them [A, X] architectures, e.g.,
MLPs, permutation invariant / equivariant?
• Feed them into deep (fully connected) neural net

• Done?
Next: Design graph neural
networks that are permutation
A B C D E Feat
A 0 1 1 1 0 1 0
A

C
B

D
B

D
1

1
0
0
1
0

invariant / equivariant by
E C 1 0
1
1
1
0
1
0
1
0
0
1
0
1
1
?
passing and aggregating
E 0 1 0 1 0 1 0

information from neighbors!


Problems:
This explains why the naïve MLP approach is bad!
• Huge number of parameters
• No inductive learning possible

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35
1. Basics of deep learning

2. Deep learning for graphs

3. Graph Convolutional Networks

4. GNNs subsume CNNs

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36
[Kipf and Welling, ICLR 2017]

Idea: Node’s neighborhood defines a


computation graph

𝑖 𝑖

Determine node Propagate and


computation graph transform information

Learn how to propagate information across the


graph to compute node features
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37
¡ Key idea: Generate node embeddings based
on local network neighborhoods

TARGET NODE B B C

A
A
C B
A C
F E
D F
E
D
INPUT GRAPH A

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38
¡ Intuition: Nodes aggregate information from
their neighbors using neural networks
A

TARGET NODE B B C

A
A
C B
A C
F E
D F
E
D
INPUT GRAPH A

Neural networks
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39
¡ Intuition: Network neighborhood defines a
computation graph
Every node defines a computation
graph based on its neighborhood!

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40
¡ Model can be of arbitrary depth:
§ Nodes have embeddings at each layer
§ Layer-0 embedding of node 𝑣 is its input feature, 𝑥𝑣
§ Layer-𝑘 embedding gets information from nodes that
are 𝑘 hops away
Layer-0
Layer-1 A xA
TARGET NODE B B C xC
Layer-2 A xA
A
C B xB
F
A C
E xE
D
E
F xF
D
INPUT GRAPH A
xA
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41
¡ Neighborhood aggregation: Key distinctions
are in how different approaches aggregate
information across the layers
A

TARGET NODE B B ? C

What is in the box? A


A
C B
A ? C ? E
F
D F
E

INPUT GRAPH
D
? A

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42
¡ Basic approach: Average information from
neighbors and apply a neural network

(1) average messages A

TARGET NODE B from neighbors B C

A
A
C B
A C
F E
D F
E
D
INPUT GRAPH A
(2) apply neural network
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43
¡ Basic approach: Average neighbor messages
and apply a neural network
Initial 0-th layer embeddings are
equal to node features embedding of
h&% = x%
𝑣 at layer 𝑘
(*)
(*+,) h- (*)
h% = 𝜎(W* ; + B* h% ), ∀𝑘 ∈ {0, … , 𝐾 − 1}
N(𝑣)
-∈/(%)
(()
Total number
z% = h% Average of neighbor’s of layers
previous layer embeddings
Embedding after K
layers of neighborhood Non-linearity
aggregation (e.g., ReLU) Notice summation is a permutation
invariant pooling/aggregation.
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44
What are the invariance and equivariance
properties for a GCN?
¡ Given a node, the GCN that computes its
embedding is permutation invariant
Shared NN weights
B
B
A
C
D A C

F D
D E
Target Node Average of neighbor’s previous layer
embeddings - Permutation invariant
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45
¡ Considering all nodes in a graph, GCN computation
is permutation equivariant
Embeddings 𝐻!
B A
Order A
B

plan 1 C C
D
F E
D E F
Target Node Permute the input, the output also permutes
accordingly - permutation equivariant
F Embeddings 𝐻"
Order E
A

plan 2 D B
C
D
A
C B E
F
Target Node
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46
¡ Considering all nodes in a graph, GCN computation
is permutation equivariant
Embeddings 𝐻!
Detailed reasoning: A
B
1. The rows of input node features and C
output embeddings are aligned D
2. We know computing the embedding E
of a given node with GCN is invariant. F

3. So, after permutation, the location Permute the input, the output also permutes
of a given node in the input node accordingly - permutation equivariant
Embeddings 𝐻"
feature matrix is changed, and the the A
output embedding of a given node B
stays the same (the colors of node C
feature and embedding are matched) D
This is permutation equivariant E
F

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47
How do we train the GCN to
generate embeddings?

𝒛0

Need to define a loss function on the embeddings.


10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 48
Trainable weight matrices
(2) (i.e., what we learn)
h0 = x0
(5)
(567) h8 (5)
h0 = 𝜎(W5 @ + B5 h0 ), ∀𝑘 ∈ {0. . 𝐾 − 1}
N(𝑣)
(4) 8∈:(0)
z0 = h0
Final node embedding

We can feed these embeddings into any loss function


and run SGD to train the weight parameters
ℎ!" : the hidden representation of node 𝑣 at layer 𝑘
¡ 𝑊" : weight matrix for neighborhood aggregation
¡ 𝐵" : weight matrix for transforming hidden vector of
self
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 49
¡ Many aggregations can be performed
efficiently by (sparse) matrix operations
¡ Let 𝐻 (,) = [ℎ(,) … ℎ(,) ]5 Matrix of hidden embeddings 𝐻 ($%&)
2(,) |4|
¡ Then: ∑(∈*1 ℎ( = A!,: H (,)
¡ Let 𝐷 be diagonal matrix where
𝐷!,! = Deg 𝑣 = |𝑁 𝑣 |
§ The inverse of 𝐷: 𝐷 !" is also diagonal:
!" = 1/|𝑁 𝑣 |
𝐷#,# ($%&)
𝒉(
¡ Therefore,
(+,!)
ℎ%
;
|𝑁(𝑣)|
𝐻(*+,) = 𝐷8, 𝐴𝐻(*)
%∈'())

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 50
¡ Re-writing update function in matrix form:
I (*) 𝑊*9 + 𝐻
𝐻(*+,) = 𝜎(𝐴𝐻 *
𝐵*9 )
where 𝐴I = 𝐷8, 𝐴
(&) (&)
𝐻(&) = [ℎ" … ℎ|)| ]*
§ Red: neighborhood aggregation
§ Blue: self transformation

¡ In practice, this implies that efficient sparse


matrix multiplication can be used (𝐴@ is sparse)
¡ Note: not all GNNs can be expressed in a simple matrix form,
when aggregation function is complex
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 51
¡ Node embedding 𝒛0 is a function of input graph
¡ Supervised setting: We want to minimize loss ℒ:
min ℒ(𝒚, 𝑓, 𝒛! )
,
§ 𝒚: node label
§ ℒ could be L2 if 𝒚 is real number, or cross entropy
if 𝒚 is categorical
¡ Unsupervised setting:
§ No node label available
§ Use the graph structure as the supervision!

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 52
¡ One possible idea: “Similar” nodes have similar
embeddings:
𝐦𝐢𝐧𝚯 ℒ = ; CE(𝑦%,) , DEC 𝑧% , 𝑧) )
.) ,.*
§ where 𝑦%,) = 1 when node 𝑢 and 𝑣 are similar
§ 𝑧2 = 𝑓3 𝑢 and DEC(⋅,⋅) is the dot product
§ CE is the cross entropy loss:
§ CE 𝒚, 𝑓 𝒙 = − ∑A?@,(𝑦? log 𝑓B (𝑥)? )
§ 𝑦4 and 𝑓3 (𝑥)4 are the actual and predicted values of the 𝑖-th class.
§ Intuition: the lower the loss, the closer the prediction is to one-hot
¡ Node similarity can be anything from
Lecture 2, e.g., a loss based on:
§ Random walks (node2vec, DeepWalk, struc2vec)
§ Matrix factorization
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 53
Directly train the model for a supervised task
(e.g., node classification)

Safe or toxic
Safe or toxic drug?
drug?

E.g., a drug-drug
interaction network
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 54
Directly train the model for a supervised task
(e.g., node classification)
¡ Use cross entropy loss (Slide 53)

ℒ = − @ 𝑦0 log(𝜎(z0I 𝜃)) + 1 − 𝑦0 log(1 − 𝜎 z0I 𝜃 )


0∈%

Encoder output: Classification


node embedding weights
Node class
label
Safe or toxic drug?

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 55
(1) Define a neighborhood
aggregation function

𝒛-

(2) Define a loss function on the


embeddings
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 56
(3) Train on a set of nodes, i.e.,
a batch of compute graphs

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 57
(4) Generate embeddings
for nodes as needed
Even for nodes we never
trained on!

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 58
¡ The same aggregation parameters are shared
for all nodes:
§ The number of model parameters is sublinear in
|𝑉| and we can generalize to unseen nodes!

B shared parameters

A
C 𝑊. 𝐵.
F
D
shared parameters
E

INPUT GRAPH
Compute graph for node A Compute graph for node B

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 59
z#

Train on one graph Generalize to new graph

Inductive node embedding Generalize to entirely unseen graphs


E.g., train on protein interaction graph from model organism A and generate
embeddings on newly collected data about organism B
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 60
z#

Generate embedding
Train with snapshot New node arrives for new node

¡ Many application settings constantly encounter


previously unseen nodes:
§ E.g., Reddit, YouTube, Google Scholar
¡ Need to generate new embeddings “on the fly”
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 61
1. Basics of deep learning

2. Deep learning for graphs

3. Graph Convolutional Networks

4. GNNs subsume CNNs

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62
¡ How do GNNs compare to prominent
architectures such as Convolutional Neural
Nets?

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63
Convolutional neural networks (on gri
Convolutional neural network (CNN)
Single CNN layer with 3x3 filter:
layer with
3x3 filter:

CNN
Image Output
(Animation by
Vincent Dumoulin)
weights

(+,") - (+)
CNN formulation: h# = 𝜎(∑-∈/ # ∪{#} W+ h- ), ∀𝑙 ∈ {0, … , 𝐿 − 1}

𝑵 𝒗 represents the 8 neighbor pixels of 𝒗.

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 64
Convolutional neural networks (on grids)
Convolutional neural network (CNN) layer with
Single CNN layer with 3x3 filter:
3x3 filter:

(Animation by
Vincent Dumoulin)

Image Graph
(#)
(,-&) 2! (,)
• GNN formulation: h+ = 𝜎(𝐖𝒍 ∑/∈1(+) 1(+)
+ B, h+ ), ∀𝑙 ∈ {0, … , 𝐿 − 1}
(,-&) (,)
• CNN formulation: (previous slide) h+ = 𝜎(∑/∈1 + ∪ + W,/ h/ ), ∀𝑙 ∈ {0, … , 𝐿 − 1}
(,-&)
End-to-end 𝒖
learning on graphs with GCNs (,) Thomas(,)
Kipf 5
if we rewrite: h+ = 𝜎(∑/∈1 + 𝐖𝒍 h/ + B, h+ ), ∀𝑙 ∈ {0, … , 𝐿 − 1}

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 65
Convolutional neural networks (on grids)
Convolutional neural network (CNN) layer with
Single CNN layer with 3x3 filter:
3x3 filter:

(Animation by
Vincent Dumoulin)

Image Graph
(#)
(&'() -! (&)
GNN formulation: h% = 𝜎(𝐖𝒍 ∑*∈,(%) ,(%)
+ B& h% ), ∀𝑙 ∈ {0, … , 𝐿 − 1}

(&'() (&) (&)


CNN formulation: h% = 𝜎(∑*∈,(%) 𝐖𝒍𝒖 h* + B& h% ), ∀𝑙 ∈ {0, … , 𝐿 − 1}

End-to-end learning on graphs with GCNs Thomas Kipf 5


Key difference: We can learn different 𝑊,/ for different “neighbor” 𝑢 for pixel 𝑣 on
the image. The reason is we can pick an order for the 9 neighbors using relative
position to the center pixel: {(-1,-1). (-1,0), (-1, 1), …, (1, 1)}
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 66
Convolutional neural networks (on grids)
Convolutional neural network (CNN) layer with
Single CNN layer with 3x3 filter:
3x3 filter:
• CNN can be seen as a special GNN with fixed neighbor
size and ordering:
• The size of the filter is pre-defined for a CNN.
• The advantage of GNN is it processes arbitrary
(Animation by
Vincent Dumoulin)

graphs with different degrees for each node.


Graph
Image
• CNN is not permutation invariant/equivariant. (#)
(&'() - (&)
• Switching the
GNN formulation: h% order of
= 𝜎(𝐖 𝒍 ∑ pixels
*∈,(%) ,(%)
+ leads
B& h% ), ∀𝑙to different
!
∈ {0, … , 𝐿 − 1}

outputs.
CNN formulation: h%
(&'() (&) (&)
= 𝜎(∑*∈,(%) 𝐖𝒍𝒖 h* + B& h% ), ∀𝑙 ∈ {0, … , 𝐿 − 1}

End-to-end learning on graphs with GCNs Thomas Kipf 5


Key difference: We can learn different 𝑊,/ for different “neighbor” 𝑢 for pixel 𝑣 on
the image. The reason is we can pick an order for the 9 neighbors using relative
position to the center pixel: {(-1,-1). (-1,0), (-1, 1), …, (1, 1)}
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 68
[Attention is all you need. Vaswani et al., NeurIPS 2017]

Transformer is one of the


most popular
architectures that
achieves great
performance in many
sequence modeling tasks.
Key component: self-attention
¡ Every token/word attends to all the other tokens/words via
matrix calculation.

I am a Stanford student

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 69
A general definition of attention:
Given a set of vector values, and a vector query, attention is a technique to
compute a weighted sum of the values, dependent on the query.

Each token/word has a value vector and a query vector. The value
vector can be seen as the representation of the token/word. We use
the query vector to calculate the attention score (weights in the
weighted sum).

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 70
A nice blog plot for this: https://towardsdatascience.com/transformers-are-graph-neural-networks-bca9f75412aa

KeyTransformer layer
component: self can be seen as a
attention
¡ Every token/word
special attends
GNN that to all
runs onthea other
fully-
tokens/words via matrix calculation.
connected “word” graph!
I
Since each word attends to all the other
words, the computation graph of a
transformer layer is identical to that of a GNN
on the fully-connected “word” graph. am student

a Stanford
Text (Complete) Graph
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 71
¡ In this lecture, we introduced
§ Idea for Deep Learning for Graphs
§ Multiple layers of embedding transformation
§ At every layer, use the embedding at previous layer as
the input
§ Aggregation of neighbors and self-embeddings
§ Graph Convolutional Network
§ Mean aggregation; can be expressed in matrix form
§ GNN is a general architecture
§ CNN can be viewed as a special GNN

10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 72

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy