03 GNN1
03 GNN1
f( )=
Input graph 2D node embeddings
Need to define!
Dimension/size
𝐙= of embeddings
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 6
¡ Limitations of shallow embedding methods:
§ 𝑶(|𝑽|𝒅) parameters are needed:
§ No sharing of parameters between nodes
§ Every node has its own unique embedding
§ Inherently “transductive”:
§ Cannot generate embeddings for nodes that are not seen
during training
§ Do not incorporate node features:
§ Nodes in many graphs have features that we can and
should leverage
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7
¡ Today: We will now discuss deep learnig
methods based on graph neural networks
(GNNs):
multiple layers of
ENC 𝑣 = non-linear transformations
based on graph structure
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10
Images
Text/Speech
vs.
Text
Networks Images
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13
¡ Loss function:
min ℒ(𝒚, 𝑓! 𝒙 )
!
¡ 𝑓 can be a simple linear layer, an MLP, or other
neural networks (e.g., a GNN later)
¡ Sample a minibatch of input 𝒙
¡ Forward propagation: Compute ℒ given 𝒙
¡ Back-propagation: Obtain gradient ∇! ℒ using a
chain rule.
¡ Use stochastic gradient descent (SGD) to
optimize ℒ for Θ over many iterations.
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Local network neighborhoods:
§ Describe aggregation strategies
§ Define computation graphs
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 16
¡ Assume we have a graph 𝑮:
§ 𝑉 is the vertex set
§ 𝑨 is the adjacency matrix (assume binary)
§ 𝑿 ∈ ℝ ! ×# is a matrix of node features
§ 𝑣: a node in 𝑉; 𝑁 𝑣 : the set of neighbors of 𝑣.
§ Node features:
§ Social networks: User profile, User image
§ Biological networks: Gene expression profiles, gene
functional information
§ When there is no node feature in the graph dataset:
§ Indicator vectors (one-hot encoding of a node)
§ Vector of constant 1: [1, 1, …, 1]
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 17
A naïve approach
• Take adjacency matrix and feature matrix
• Done?
A B C D E Feat
A 0 1 1 1 0 1 0
A
C
B
D
E
B
C
D
1
1
1
0
0
1
0
0
1
1
1
0
1
0
1
0
0
1
0
1
1
?
¡ Issues with this idea:
E 0 1 0 1 0 1 0
or or
this:
this: ReLU
ReLU
or this:
… …
…
§ There is no fixed notion of locality or sliding
window on the graph
§ Graph is permutation invariant
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20
¡ Graph does not have a canonical order of the nodes!
¡ We can have many different order plans.
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21
¡ Graph does not have a canonical order of the nodes!
Node features 𝑿𝟏 Adjacency matrix 𝑨𝟏
Order plan 1 B A A B C D E F
A
B
A B
C C
C
D D
F E E
D E
F F
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22
¡ Graph does not have a canonical order of the nodes!
Node features 𝑿𝟏 Adjacency matrix 𝑨𝟏
Order plan 1 B A A B C D E F
A
B
A B
C C
C
D D
F E E
D E
F F
F
D E A
C B
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25
What does it mean by “graph representation is
same for two order plans”?
¡ Consider we learn a function 𝑓 that maps a graph
𝐺 = (𝑨, 𝑿) to a vector ℝ" . 𝑨 is the adjacency matrix
𝑿 is the node feature matrix
A E
C D
F A
D E C B
A A
B B
C C
𝑓 𝑨! , 𝑿! = 𝑓 𝑨" , 𝑿" =
D D
E E
F F
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27
For node representation: We learn a function 𝑓 that
maps nodes of 𝐺 to a matrix ℝ|%|×" .
Order plan 1: 𝑨𝟏 , 𝑿𝟏 Order plan 2: 𝑨𝟐 , 𝑿𝟐
B F
A E
C D
F A
D E C B
Representation vector
of the brown node A A A
B B
C C
𝑓 𝑨! , 𝑿! = 𝑓 𝑨" , 𝑿" =
D D
Representation vector
E For two order plans, the vector of node at E of the brown node E
A E
C D
F A
D E C B
A A
B B
Representation vector
C C
𝑓 𝑨! , 𝑿! =
of the green node C
𝑓 𝑨" , 𝑿" = Representation vector
D D of the green node D
¡ Examples:
§ 𝑓 𝐴, 𝑋 = 1$ 𝑋 : Permutation-invariant
§ Reason: 𝑓 𝑃𝐴𝑃& , 𝑃𝑋 = 1& 𝑃𝑋 = 1& 𝑋 = 𝑓 𝐴, 𝑋
§ 𝑓 𝐴, 𝑋 = 𝑋 : Permutation-equivariant
§ Reason: 𝑓 𝑃𝐴𝑃& , 𝑃𝑋 = 𝑃𝑋 = 𝑃𝑓 𝐴, 𝑋
§ 𝑓 𝐴, 𝑋 = 𝐴𝑋 : Permutation-equivariant
§ Reason: 𝑓 𝑃𝐴𝑃& , 𝑃𝑋 = 𝑃𝐴𝑃& 𝑃𝑋 = 𝑃𝐴𝑋 = 𝑃𝑓 𝐴, 𝑋
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31
[Bronstein, ICLR 2021 keynote]
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32
Are other neural network architectures, e.g.,
MLPs, permutation invariant / equivariant?
¡ No.
Switching the order of the
input leads to different
outputs!
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33
A naïve approach
• Take adjacency matrix and feature matrix
Are other neural network
• Concatenate them [A, X] architectures, e.g.,
MLPs, permutation invariant / equivariant?
• Feed them into deep (fully connected) neural net
¡ No.
• Done?
A B C D E Feat
A 0 1 1 1 0 1 0
A
C
B
D
E
B
C
D
1
1
1
0
0
1
0
0
1
1
1
0
1
0
1
0
0
1
0
1
1
?
E 0 1 0 1 0 1 0
Problems:
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 34
A naïve approach
• Take adjacency matrix and feature matrix
¡ Are any neural network
• Concatenate them [A, X] architectures, e.g.,
MLPs, permutation invariant / equivariant?
• Feed them into deep (fully connected) neural net
• Done?
Next: Design graph neural
networks that are permutation
A B C D E Feat
A 0 1 1 1 0 1 0
A
C
B
D
B
D
1
1
0
0
1
0
invariant / equivariant by
E C 1 0
1
1
1
0
1
0
1
0
0
1
0
1
1
?
passing and aggregating
E 0 1 0 1 0 1 0
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35
1. Basics of deep learning
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36
[Kipf and Welling, ICLR 2017]
𝑖 𝑖
TARGET NODE B B C
A
A
C B
A C
F E
D F
E
D
INPUT GRAPH A
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38
¡ Intuition: Nodes aggregate information from
their neighbors using neural networks
A
TARGET NODE B B C
A
A
C B
A C
F E
D F
E
D
INPUT GRAPH A
Neural networks
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39
¡ Intuition: Network neighborhood defines a
computation graph
Every node defines a computation
graph based on its neighborhood!
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40
¡ Model can be of arbitrary depth:
§ Nodes have embeddings at each layer
§ Layer-0 embedding of node 𝑣 is its input feature, 𝑥𝑣
§ Layer-𝑘 embedding gets information from nodes that
are 𝑘 hops away
Layer-0
Layer-1 A xA
TARGET NODE B B C xC
Layer-2 A xA
A
C B xB
F
A C
E xE
D
E
F xF
D
INPUT GRAPH A
xA
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41
¡ Neighborhood aggregation: Key distinctions
are in how different approaches aggregate
information across the layers
A
TARGET NODE B B ? C
INPUT GRAPH
D
? A
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42
¡ Basic approach: Average information from
neighbors and apply a neural network
A
A
C B
A C
F E
D F
E
D
INPUT GRAPH A
(2) apply neural network
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43
¡ Basic approach: Average neighbor messages
and apply a neural network
Initial 0-th layer embeddings are
equal to node features embedding of
h&% = x%
𝑣 at layer 𝑘
(*)
(*+,) h- (*)
h% = 𝜎(W* ; + B* h% ), ∀𝑘 ∈ {0, … , 𝐾 − 1}
N(𝑣)
-∈/(%)
(()
Total number
z% = h% Average of neighbor’s of layers
previous layer embeddings
Embedding after K
layers of neighborhood Non-linearity
aggregation (e.g., ReLU) Notice summation is a permutation
invariant pooling/aggregation.
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44
What are the invariance and equivariance
properties for a GCN?
¡ Given a node, the GCN that computes its
embedding is permutation invariant
Shared NN weights
B
B
A
C
D A C
F D
D E
Target Node Average of neighbor’s previous layer
embeddings - Permutation invariant
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45
¡ Considering all nodes in a graph, GCN computation
is permutation equivariant
Embeddings 𝐻!
B A
Order A
B
plan 1 C C
D
F E
D E F
Target Node Permute the input, the output also permutes
accordingly - permutation equivariant
F Embeddings 𝐻"
Order E
A
plan 2 D B
C
D
A
C B E
F
Target Node
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46
¡ Considering all nodes in a graph, GCN computation
is permutation equivariant
Embeddings 𝐻!
Detailed reasoning: A
B
1. The rows of input node features and C
output embeddings are aligned D
2. We know computing the embedding E
of a given node with GCN is invariant. F
3. So, after permutation, the location Permute the input, the output also permutes
of a given node in the input node accordingly - permutation equivariant
Embeddings 𝐻"
feature matrix is changed, and the the A
output embedding of a given node B
stays the same (the colors of node C
feature and embedding are matched) D
This is permutation equivariant E
F
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47
How do we train the GCN to
generate embeddings?
𝒛0
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 50
¡ Re-writing update function in matrix form:
I (*) 𝑊*9 + 𝐻
𝐻(*+,) = 𝜎(𝐴𝐻 *
𝐵*9 )
where 𝐴I = 𝐷8, 𝐴
(&) (&)
𝐻(&) = [ℎ" … ℎ|)| ]*
§ Red: neighborhood aggregation
§ Blue: self transformation
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 52
¡ One possible idea: “Similar” nodes have similar
embeddings:
𝐦𝐢𝐧𝚯 ℒ = ; CE(𝑦%,) , DEC 𝑧% , 𝑧) )
.) ,.*
§ where 𝑦%,) = 1 when node 𝑢 and 𝑣 are similar
§ 𝑧2 = 𝑓3 𝑢 and DEC(⋅,⋅) is the dot product
§ CE is the cross entropy loss:
§ CE 𝒚, 𝑓 𝒙 = − ∑A?@,(𝑦? log 𝑓B (𝑥)? )
§ 𝑦4 and 𝑓3 (𝑥)4 are the actual and predicted values of the 𝑖-th class.
§ Intuition: the lower the loss, the closer the prediction is to one-hot
¡ Node similarity can be anything from
Lecture 2, e.g., a loss based on:
§ Random walks (node2vec, DeepWalk, struc2vec)
§ Matrix factorization
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 53
Directly train the model for a supervised task
(e.g., node classification)
Safe or toxic
Safe or toxic drug?
drug?
E.g., a drug-drug
interaction network
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 54
Directly train the model for a supervised task
(e.g., node classification)
¡ Use cross entropy loss (Slide 53)
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 55
(1) Define a neighborhood
aggregation function
𝒛-
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 57
(4) Generate embeddings
for nodes as needed
Even for nodes we never
trained on!
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 58
¡ The same aggregation parameters are shared
for all nodes:
§ The number of model parameters is sublinear in
|𝑉| and we can generalize to unseen nodes!
B shared parameters
A
C 𝑊. 𝐵.
F
D
shared parameters
E
INPUT GRAPH
Compute graph for node A Compute graph for node B
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 59
z#
Generate embedding
Train with snapshot New node arrives for new node
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62
¡ How do GNNs compare to prominent
architectures such as Convolutional Neural
Nets?
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63
Convolutional neural networks (on gri
Convolutional neural network (CNN)
Single CNN layer with 3x3 filter:
layer with
3x3 filter:
CNN
Image Output
(Animation by
Vincent Dumoulin)
weights
(+,") - (+)
CNN formulation: h# = 𝜎(∑-∈/ # ∪{#} W+ h- ), ∀𝑙 ∈ {0, … , 𝐿 − 1}
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 64
Convolutional neural networks (on grids)
Convolutional neural network (CNN) layer with
Single CNN layer with 3x3 filter:
3x3 filter:
(Animation by
Vincent Dumoulin)
Image Graph
(#)
(,-&) 2! (,)
• GNN formulation: h+ = 𝜎(𝐖𝒍 ∑/∈1(+) 1(+)
+ B, h+ ), ∀𝑙 ∈ {0, … , 𝐿 − 1}
(,-&) (,)
• CNN formulation: (previous slide) h+ = 𝜎(∑/∈1 + ∪ + W,/ h/ ), ∀𝑙 ∈ {0, … , 𝐿 − 1}
(,-&)
End-to-end 𝒖
learning on graphs with GCNs (,) Thomas(,)
Kipf 5
if we rewrite: h+ = 𝜎(∑/∈1 + 𝐖𝒍 h/ + B, h+ ), ∀𝑙 ∈ {0, … , 𝐿 − 1}
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 65
Convolutional neural networks (on grids)
Convolutional neural network (CNN) layer with
Single CNN layer with 3x3 filter:
3x3 filter:
(Animation by
Vincent Dumoulin)
Image Graph
(#)
(&'() -! (&)
GNN formulation: h% = 𝜎(𝐖𝒍 ∑*∈,(%) ,(%)
+ B& h% ), ∀𝑙 ∈ {0, … , 𝐿 − 1}
outputs.
CNN formulation: h%
(&'() (&) (&)
= 𝜎(∑*∈,(%) 𝐖𝒍𝒖 h* + B& h% ), ∀𝑙 ∈ {0, … , 𝐿 − 1}
I am a Stanford student
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 69
A general definition of attention:
Given a set of vector values, and a vector query, attention is a technique to
compute a weighted sum of the values, dependent on the query.
Each token/word has a value vector and a query vector. The value
vector can be seen as the representation of the token/word. We use
the query vector to calculate the attention score (weights in the
weighted sum).
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 70
A nice blog plot for this: https://towardsdatascience.com/transformers-are-graph-neural-networks-bca9f75412aa
KeyTransformer layer
component: self can be seen as a
attention
¡ Every token/word
special attends
GNN that to all
runs onthea other
fully-
tokens/words via matrix calculation.
connected “word” graph!
I
Since each word attends to all the other
words, the computation graph of a
transformer layer is identical to that of a GNN
on the fully-connected “word” graph. am student
a Stanford
Text (Complete) Graph
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 71
¡ In this lecture, we introduced
§ Idea for Deep Learning for Graphs
§ Multiple layers of embedding transformation
§ At every layer, use the embedding at previous layer as
the input
§ Aggregation of neighbors and self-embeddings
§ Graph Convolutional Network
§ Mean aggregation; can be expressed in matrix form
§ GNN is a general architecture
§ CNN can be viewed as a special GNN
10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 72