07 Hetero
07 Hetero
material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://cs224w.Stanford.edu
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 2
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ So far we only handle graphs with one edge
type
¡ How to handle graphs with multiple nodes or
edge types (a.k.a heterogeneous graphs)?
¡ Goal: Learning with heterogeneous graphs
§ Relational GCNs
§ Heterogeneous Graph Transformer
§ Design space for heterogeneous GNNs
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4
2 types of nodes:
¡ Node type A: Paper nodes
¡ Node type B: Author nodes
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5
2 types of edges:
¡ Edge type A: Cite
¡ Edge type B: Like
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 6
A graph could have multiple types of nodes and
edges! 2 types of nodes + 2 types of edges.
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7
8 possible relation types!
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10
¡ Example: E-Commerce Graph
§ Node types: User, Item, Query, Location, ...
§ Edge types: Purchase, Visit, Guide, Search, …
§ Different node type's features spaces can be different!
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11
¡ Example: Academic Graph
§ Node types: Author, Paper, Venue, Field, ...
§ Edge types: Publish, Cite, …
§ Benchmark dataset: Microsoft Academic Graph
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12
¡ Observation: We can also treat types of
nodes and edges as features
§ Example: Add a one-hot indicator for nodes and
edges
§ Append feature [1, 0] to each “author node”; Append
feature [0, 1] to each “paper node”
§ Similarly, we can assign edge features to edges with
different types
§ Then, a heterogeneous graph reduces to a
standard graph
¡ When do we need a heterogeneous graph?
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13
¡ When do we need a heterogeneous graph?
§ Case 1: Different node/edge types have different
shapes of features
§ An “author node” has 4-dim feature, a “paper node” has
5-dim feature
§ Case 2: We know different relation types
represent different types of interactions
§ (English, translate, French) and (English, translate,
Chinese) require different models
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14
¡ Ultimately, heterogeneous graph is a more
expressive graph representation
§ Captures different types of interactions between
entities
¡ But it also comes with costs
§ More expensive (computation, storage)
§ More complex implementation
¡ There are many ways to convert a
heterogeneous graph to a standard graph
(that is, a homogeneous graph)
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Kipf and Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017
(#) #
𝐡%#()
𝐡! = 𝜎 𝐖 %
𝑁 𝑣
%∈' !
(#) #
𝐡%#() (2) Aggregation
𝐡! =𝜎 % 𝐖
𝑁 𝑣 (1) Message
%∈' !
Aggregation
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 18
¡ We will extend GCN to handle heterogeneous
graphs with multiple edge/relation types
¡ We start with a directed graph with one relation
§ How do we run GCN and update the representation of
the target node A on this graph?
B
Target Node
A
C
F
D E
Input Graph
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19
¡ We will extend GCN to handle heterogeneous
graphs with multiple edge/relation types
¡ We start with a directed graph with one relation
§ How do we run GCN and update the representation of
the target node A on this graph?
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20
¡ What if the graph has multiple relation types?
𝑟) B
Target node 𝑟+
A
𝑟) 𝑟* C
𝑟+ 𝑟*
F
D E 𝑟)
Input graph
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21
¡ What if the graph has multiple relation types?
¡ Use different neural network weights for
different relation types.
Weights 𝐖!! for 𝑟"
𝑟) B
Target node 𝑟+
A
Weights 𝐖!" for 𝑟#
𝑟) 𝑟* C
𝑟+ 𝑟*
F
D E 𝑟) Weights 𝐖!# for 𝑟$
Input graph
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22
¡ What if the graph has multiple relation types?
¡ Use different neural network weights for
different relation types! Aggregation
C
𝑟) B B
Target node 𝑟+
A F
𝑟) 𝑟* C A C
𝑟+ 𝑟*
F E
D E 𝑟) D
Input graph
Neural networks
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23
¡ Introduce a set of neural networks for each
relation type!
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 24
¡ Relational GCN (RGCN):
#.) 1 # (#) (#) (#)
𝐡! =𝜎 % % 𝐖/ 𝐡% + 𝐖2 𝐡!
𝑐
" !,/
/∈0 %∈'!
¡ How to write this as Message + Aggregation?
¡ Message: Normalized by node degree
§ Each neighbor of a given relation: of the relation 𝑐%,! = 𝑁%!
(%) 1 % (%)
𝐦!,# = 𝐖# 𝐡!
𝑐',#
§ Self-loop:
(%) % (%)
𝐦' = 𝐖( 𝐡'
¡ Aggregation:
§ Sum over messages from neighbors and self-loop, then apply activation
%)* % %
§ 𝐡' = 𝜎 Sum 𝐦!,# , 𝑢 ∈ 𝑁(𝑣) ∪ 𝐦'
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25
" # $
¡ Each relation has 𝐿 matrices: 𝐖! , 𝐖! ⋯ 𝐖!
%
¡ The size of each 𝐖! is 𝑑 (%'") ×𝑑 (%) 𝑑 is the hidden (")
dimension in layer 𝑙
𝐖+ =
Limitation: only nearby
neurons/dimensions
can interact through 𝑊
𝑟) B
𝑟+
A
𝑟) 𝑟* C
𝑟+ 𝑟*
𝒓𝟑 F
D E 𝑟)
Input Graph
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31
¡ Training:
𝑟) B 1. Use RGCN to score the training
𝑟+ supervision edge 𝑬, 𝒓𝟑 , 𝑨
A
2. Create a negative edge by perturbing
𝑟) 𝑟* C
𝑟+ 𝑟* the supervision edge 𝑬, 𝒓𝟑 , 𝑩
𝒓𝟑 F • Corrupt the tail of 𝑬, 𝒓𝟑 , 𝑨
D E 𝑟) • e.g., 𝑬, 𝒓𝟑 , 𝑩 , 𝑬, 𝒓𝟑 , 𝑫
Input Graph
𝜎 … Sigmoid function
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33
¡ Evaluation:
§ Validation time as an example, same at the test time
𝑟) B Evaluate how the model can predict the
𝑟+ validation edges with the relation types.
A
Let’s predict validation edge 𝑬, 𝒓𝟑 , 𝑫
𝑟) 𝑟* C
𝑟+ 𝑟+ 𝑟* Intuition: the score of 𝑬, 𝒓𝟑 , 𝑫 should be
F higher than all 𝑬, 𝒓𝟑 , 𝒗 where 𝑬, 𝒓𝟑 , 𝒗 is NOT
D
𝒓𝟑 ?
E 𝑟) in the training message edges and training
Input Graph supervision edges, e.g., 𝑬, 𝒓𝟑 , 𝑩
validation edges: 𝑬, 𝒓𝟑 , 𝑫
training message edges & training supervision
edges: all existing edges (solid lines)
¡ Benchmark dataset
§ ogbn-mag from Microsoft Academic Graph (MAG)
¡ Four (4) types of entities
§ Papers: 736k nodes
§ Authors: 1.1m nodes
§ Institutions: 9k nodes
§ Fields of study: 60k nodes
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36
Wang et al. Microsoft academic graph: When experts are not enough. Quantitative Science Studies 2020.
¡ Benchmark dataset
§ ogbn-mag from Microsoft Academic Graph (MAG)
¡ Four (4) directed relations
§ An author is "affiliated with" an institution
§ An author "writes" a paper
§ A paper "cites" a paper
§ A paper "has a topic of" a field of study
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37
Wang et al. Microsoft academic graph: When experts are not enough. Quantitative Science Studies 2020.
¡ Prediction task
§ Each paper has a 128-dimensional word2vec feature vector
§ Given the content, references, authors, and author affiliations
from ogbn-mag, predict the venue of each paper
§ 349-class classification problem due to 349 venues considered
¡ Time-based dataset splitting
§ Training set: papers published before 2018
§ Test set: papers published after 2018
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38
Wang et al. Microsoft academic graph: When experts are not enough. Quantitative Science Studies 2020.
¡ Benchmark results:
SOTA
R-GCN
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39
¡ Relational GCN, a graph neural network for
heterogeneous graphs
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45
Hu et al. Heterogeneous Graph Transformer. WWW 2020.
" Q-Linear!"#$%
'()*+[-]
Write Cite
/)+[0+] %&& 1--[01, -]
!!"#$
!! K-Linear!"#$%
Paper
'()*+[-] …
…
!" %&&
!'("#$ 1--[02, -]
K-Linear&'()*% /)+[0,]
Author
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46
Hu et al. Heterogeneous Graph Transformer. WWW 2020.
(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity
GNN Layer 2
A
Node 𝒗
TARGET NODE B B C
A (2) Aggregation
A
C B
A C
F E
D (1) Message
F
E
D
INPUT GRAPH A
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 52
¡ (1) Heterogeneous message computation
(3) 3 378
§ Message function: = 𝐦4 MSG" 𝐡4
§ Observation: A node could receive multiple types of
messages. Num of message type = Num of relation
type
§ Idea: Create a different message function for each
relation type
(,)
§𝐦+ = MSG*, 𝐡+,-. , 𝑟 = (𝑢, 𝑒, 𝑣) is the relation
type between node 𝑢 that sends the message, edge
type 𝑒 , and node 𝑣 that receive the message
(,) ,-. ,
§ Example: A Linear layer 𝐦+ = 𝐖* 𝐡+
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 53
¡ (2) Aggregation
§ Intuition: Each node will aggregate the messages from
node 𝑣’s neighbors
(,) , 3
𝐡/ = AGG 𝐦4 ,𝑢 ∈ 𝑁 𝑣
§ Example: Sum(⋅), Mean(⋅) or Max(⋅) aggregator
§ 𝐡!# = Sum({𝐦%# , 𝑢 ∈ 𝑁(𝑣)})
A
A
A
C (2) Aggregation
B
A C
F E
D F
E (1) Message
D
INPUT GRAPH A
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 54
¡ (2) Heterogeneous Aggregation
§ Observation: Each node could receive multiple types of
messages from its neighbors, and multiple neighbors
may belong to each message type.
§ Idea: We can define a 2-stage message passing
(#) #
§ 𝐡! = AGGA## AGG/# 𝐦%# , 𝑢 ∈ 𝑁/ 𝑣
§ Given all the messages sent to a node
§ Within each message type, aggregate the messages
that belongs to the edge type with AGG/#
#
§ Aggregate across the edge types with AGGA##
# #
§ Example: 𝐡! = Concat Sum 𝐦% , 𝑢 ∈ 𝑁/ 𝑣
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 55
¡ (3) Layer connectivity
§ Add skip connections, pre/post-process layers
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 56
¡ Heterogeneous pre/post-process layers:
§ MLP layers with respect to each node type
§ Since the output of GNN are node embeddings
(3) (3)
§ 𝐡2 = MLP 9(2) (𝐡2 )
§ 𝑇(𝑣) is the type of node 𝑣
¡ Other successful GNN designs are
also encouraged for heterogeneous
GNNs: skip connections, batch/layer
normalization, …
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 57
¡ Graph Feature manipulation
§ The input graph lacks features à feature
augmentation
¡ Graph Structure manipulation
§ The graph is too sparse à Add virtual nodes / edges
§ The graph is too dense à Sample neighbors when
doing message passing
§ The graph is too large à Sample subgraphs to
compute embeddings
§ Will cover later in lecture: Scaling up GNNs
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 58
¡ Graph Feature manipulation
§ 2 Common options: compute graph statistics (e.g.,
node degree) within each relation type, or across the
full graph (ignoring the relation types)
¡ Graph Structure manipulation
§ Neighbor and subgraph sampling are also common
for heterogeneous graphs.
§ 2 Common options: sampling within each relation
type (ensure neighbors from each type are covered),
or sample across the full graph
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 59
Node-level prediction:
$ (5) ($)
9𝒗 = Head0123 (𝐡4 ) = 𝐖 𝐡4
¡ 𝒚
Edge-level prediction:
$ $
9𝒖𝒗 = Head3278 (𝐡9 , 𝐡4 )=
¡ 𝒚
$ $
Linear(Concat(𝐡9 , 𝐡4 ))
Graph-level prediction:
$
9: = Head7;<=> ({𝐡4 ∈ ℝ+ , ∀𝑣 ∈ 𝐺})
¡ 𝒚
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 60
Node-level prediction:
$ 5 ($)
9𝒗 = Head0123, @(4) (𝐡4 ) = 𝐖@(4) 𝐡4
¡ 𝒚
Edge-level prediction:
$ $
9𝒖𝒗 = Head3278, ! (𝐡9 , 𝐡4 )=
¡ 𝒚
$ $
Linear! (Concat(𝐡9 , 𝐡4 ))
Graph-level prediction:
$
9
¡ 𝒚: = AGG(Head7;<=>, A ({𝐡4 ∈
ℝ+ , ∀𝑇 𝑣 = 𝑖}))
11/14/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 61
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020
(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity
GNN Layer 2