07-theory
07-theory
(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity
GNN Layer 2
Evaluation
metrics
Input Graph Node
Graph Neural embeddings
Network
Prediction
Predictions Labels
head
Loss
function
Implementation resources:
PyG provides core modules for this pipeline
GraphGym further implements the full pipeline to facilitate GNN design
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 4
How powerful are GNNs?
Many GNN models have been proposed (e.g.,
GCN, GAT, GraphSAGE, design space).
What is the expressive power (ability to
distinguish different graph structures) of these
GNN models?
How to design a maximally expressive GNN
model?
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 5
We focus on message passing GNNs:
▪ (1) Message: each node computes a message
(𝑙) 𝑙 𝑙−1
𝐦𝑢 = MSG 𝐡𝑢 , 𝑢 ∈ {𝑁 𝑣 ∪ 𝑣}
▪ (2) Aggregation: aggregate messages from neighbors
(𝑙) 𝑙 𝑙 𝑙
𝐡𝑣 = AGG 𝐦𝑢 , 𝑢 ∈ 𝑁 𝑣 , 𝐦𝑣
(2) Aggregation
(1) Message
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 6
Many GNN models have been proposed:
▪ GCN, GraphSAGE, GAT, Design Space etc.
?
?
?
?
?
?
?
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 9
We use node same/different colors to represent
nodes with same/different features.
▪ For example, the graph below assumes all the nodes
share the same feature.
1 2
5 4
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 11
We specifically consider local neighborhood
structures around each node in a graph.
▪ Example: Nodes 1 and 4
both have the same node
degree of 2. However, they 1 2
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 14
In each layer, a GNN aggregates neighboring node
embeddings.
A GNN generates node embeddings through a
computational graph defined by the neighborhood.
▪ Ex: Node 1’s computational graph (2-layer GNN)
1
1 2
3
2 5
5 4
1 5 1 2 4
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 15
Ex: Nodes 1 and 2’s computational graphs.
1 2
1 2
3
2 5 1 5
5 4
1 5 1 2 4 2 5 1 2 4
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 16
Ex: Nodes 1 and 2’s computational graphs.
But GNN only sees node features (not IDs):
1 2
5 4
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 17
A GNN will generate the same embedding for
nodes 1 and 2 because:
Note: GNN does not
▪ Computational graphs are the same. care about node ids, it
just aggregates features
▪ Node features (colors) are identical. vectors of different nodes.
1 2
1 2
5 4
5 4
1 2 3 4 5
2 5 1 5 4 3 5 1 2 4
1 5 1 2 4 2 5 1 2 4 3 5 4 1 2 4 2 5 1 5 3 5
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 19
Computational graphs are identical to rooted
subtree structures around each node.
1 2 Rooted subtree structures
(defined by recursively unfolding
3
neighboring nodes from the root nodes)
5 4
1 2 3 4 5
2 5 1 5 4 3 5 1 2 4
1 5 1 2 4 2 5 1 2 4 3 5 4 1 2 4 2 5 1 5 3 5
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 20
GNN‘s node embeddings capture rooted
subtree structures.
Most expressive GNN maps different rooted
subtrees into different node embeddings
(represented by different colors). Embedding
1 2 3 4 5
2 5 1 5 4 3 5 1 2 4
1 5 1 2 4 2 5 1 2 4 5 4 1 2 4 2 5 1 5 3 5
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 21
Function 𝑓: 𝑋 → Y is injective if it maps
different elements into different outputs.
Intuition: 𝑓 retains all the information about
input.
𝑋 𝑌
𝑓
1 D
B
2
C
3 A
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 22
Most expressive GNN should map subtrees to
the node embeddings injectively.
Embedding space
ℝ𝑑
Subtrees
1 2 3 4 5
2 5 1 5 4 3 5 1 2 4
1 5 1 2 4 2 5 1 2 4 5 4 1 2 4 2 5 1 5 3 5
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 23
Key observation: Subtrees of the same depth
can be recursively characterized from the leaf
nodes to the root nodes.
From leaves From leaves
to the root to the root
(2 neighbors, (1 neighbor, 4
1
3 neighbors) 3 neighbors)
2 5 3 5
2 neighbors 3 neighbors 1 neighbor 3 neighbors
1 5 1 2 4 4 1 2 4
Input features Input features
are uniform are uniform
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 24
If each step of GNN’s aggregation can fully
retain the neighboring information, the
generated node embeddings can distinguish
different rooted subtrees.
Fully retain (2 neighbors, (1 neighbor,
neighboring 1 4
3 neighbors) 3 neighbors)
information
2 5 3 5
2 neighbors 3 neighbors 1 neighbor 3 neighbors
Fully retain 1 5 1 2 4 4 1 2 4
neighboring
information Input features Input features
are uniform are uniform
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 25
In other words, most expressive GNN would
use an injective neighbor aggregation
function at each step.
▪ Maps different neighbors to different embeddings.
Injective 1 4
neighbor
aggregation
2 5 3 5
Injective
neighbor
aggregation 1 5 1 2 4 4 1 2 4
Input features Input features
are uniform are uniform
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 26
Summary so far
▪ To generate a node embedding, GNNs use a
computational graph corresponding to a subtree
rooted around each node.
Input graph Computational 1
graph Using injective
1 2 = Rooted neighbor
subtree 2 5 aggregation
3 → distinguish
different
5 4 1 5 1 2 4 subtrees
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 29
Observation: Neighbor aggregation can be
abstracted as a function over a multi-set (a
set with repeating elements).
Examples of
Equivalent multi-set
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 31
GCN (mean-pool) [Kipf & Welling ICLR 2017]
▪ Take element-wise mean, followed by linear
function and ReLU activation, i.e., max(0, 𝑥).
▪ Theorem [Xu et al. ICLR 2019]
▪ GCN’s aggregation function cannot distinguish different
multi-sets with the same color proportion.
Failure case
Why?
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 32
For simplicity, we assume node features
(colors) are represented by one-hot encoding.
▪ Example: If there are two distinct colors:
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 33
GCN (mean-pool) [Kipf & Welling ICLR 2017]
▪ Failure case illustration
Same outputs!
Element-wise-
mean-pool
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 34
GraphSAGE (max-pool) [Hamilton et al. NeurIPS 2017]
▪ Apply an MLP, then take element-wise max.
▪ Theorem [Xu et al. ICLR 2019]
▪ GraphSAGE’s aggregation function cannot distinguish
different multi-sets with the same set of distinct colors.
Failure case
Why?
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 35
GraphSAGE (max-pool) [Hamilton et al. NeurIPS 2017]
▪ Failure case illustration
The same outputs!
Element-wise-
max-pool
For simplicity,
assume the one-
hot encoding
after MLP.
MLP
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 36
We analyzed the expressive power of GNNs.
Main takeaways:
▪ Expressive power of GNNs can be characterized by
that of the neighbor aggregation function.
▪ Neighbor aggregation is a function over multi-sets
(sets with repeating elements)
▪ GCN and GraphSAGE’s aggregation functions fail to
distinguish some basic multi-sets; hence not injective.
▪ Therefore, GCN and GraphSAGE are not maximally
powerful GNNs.
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 37
Our goal: Design maximally powerful GNNs
in the class of message-passing GNNs.
This can be achieved by designing injective
neighbor aggregation function over multi-
sets.
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 38
Theorem [Xu et al. ICLR 2019]
Any injective multi-set function can be expressed
Some non-
as: linear function
Some non-
linear function
𝑆 : multi-set
𝑓 +𝑓 +𝑓
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 39
Proof Intuition: [Xu et al. ICLR 2019]
𝑓 produces one-hot encodings of colors. Summation of
the one-hot encodings retains all the information about
the input multi-set.
Example:
𝑓 +𝑓 +𝑓
One-hot + + =
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 40
How to model 𝜱 and 𝒇 in 𝜱 σ𝒙∈𝑺 𝒇(𝒙) ?
We use a Multi-Layer Perceptron (MLP).
Theorem: Universal Approximation Theorem
[Hornik et al., 1989]
Input 𝑾1 𝜎 𝑾2 Output
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 41
We have arrived at a neural network that can
model any injective multiset function.
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 42
Graph Isomorphism Network (GIN) [Xu et al. ICLR 2019]
▪ Apply an MLP, element-wise sum, followed by
another MLP.
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 44
Recall: Color refinement algorithm in WL kernel.
Given: A graph 𝐺 with a set of nodes 𝑉.
▪ Assign an initial color 𝑐 0 𝑣 to each node 𝑣.
▪ Iteratively refine node colors by
𝑘+1 𝑘 𝑘
𝑐 𝑣 = HASH 𝑐 𝑣 , 𝑐 𝑢 ,
𝑢∈𝑁 𝑣
where HASH maps different inputs to different colors.
▪ After 𝐾 steps of color refinement, 𝑐 𝐾 𝑣
summarizes the structure of 𝐾-hop neighborhood
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 45
Example of color refinement given two graphs
▪ Assign initial colors
1 1 1 1
1 1 1 1
1 1 1 1
1,1 1,1
1,1 1,1
3/7/2023 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46
Example of color refinement given two graphs
▪ Aggregated colors:
1,11 1,111
1,111 1,11
1,1 1,1
1,1 1,1
3/7/2023 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47
Example of color refinement given two graphs
Process continues until a stable coloring is
reached
Two graphs are considered isomorphic if they
have the same set of colors.
11 8 9 11
12 11 13 10
7 7 7 6
3/7/2023 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 48
GIN uses a neural network to model the
injective HASH function.
𝑘+1 𝑘 𝑘
𝑐 𝑣 = HASH 𝑐 𝑣 , 𝑐 𝑢
𝑢∈𝑁 𝑣
can be modeled as
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 50
If input feature 𝑐 0 (𝑣) is represented as one-
hot, direct summation is injective.
Example: + +
+ + =
We only need Φ to ensure the injectivity.
Root node
features Neighboring node
features This MLP can provide “one-hot” input
feature for the next layer.
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 51
GIN’s node embedding updates
Given: A graph 𝐺 with a set of nodes 𝑉.
▪ Assign an initial vector 𝑐 0 𝑣 to each node 𝑣.
▪ Iteratively update node vectors by
𝑘+1 𝑘 𝑘
𝑐 𝑣 = GINConv 𝑐 𝑣 , 𝑐 𝑢 ,
𝑢∈𝑁 𝑣
Differentiable color HASH function
where GINConv maps different inputs to different embeddings.
▪ After 𝐾 steps of GIN iterations, 𝑐 𝐾 𝑣 summarizes
the structure of 𝐾-hop neighborhood.
3/7/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 52
GIN can be understood as differentiable neural
version of the WL graph Kernel:
Update target Update function
WL Graph Kernel Node colors HASH
(one-hot)
GIN Node embeddings GINConv
(low-dim vectors)
GNN frameworks:
DGL GraphNets Implements a variety
of GNN architectures
Auto-differentiation frameworks
12/6/18 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 61
Tutorials and overviews:
▪ Relational inductive biases and graph networks (Battaglia et al., 2018)
▪ Representation learning on graphs: Methods and applications (Hamilton et al., 2017)
Attention-based neighborhood aggregation:
▪ Graph attention networks (Hoshen, 2017; Velickovic et al., 2018; Liu et al., 2018)
Embedding entire graphs:
▪ Graph neural nets with edge embeddings (Battaglia et al., 2016; Gilmer et. al., 2017)
▪ Embedding entire graphs (Duvenaud et al., 2015; Dai et al., 2016; Li et al., 2018) and graph pooling
(Ying et al., 2018, Zhang et al., 2018)
▪ Graph generation and relational inference (You et al., 2018; Kipf et al., 2018)
▪ How powerful are graph neural networks(Xu et al., 2017)
Embedding nodes:
▪ Varying neighborhood: Jumping knowledge networks (Xu et al., 2018), GeniePath (Liu et al., 2018)
▪ Position-aware GNN (You et al. 2019)
12/6/18 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 62