Ai Presentation
Ai Presentation
Neural Networks:
PGNNs, Pretraining, and OGB
Jure Leskovec
Images
Text/Speech
…
z
vs.
Text
Networks Images
§ No fixed node ordering or reference point
§ Often dynamic and have multimodal features
Jure Leskovec, Stanford University 6
Graph Neural Networks
A
TARGET NODE B B C
A
A
C B
A C
F E
D F
E
D
INPUT GRAPH A
TARGET NODE B B C
A
A
C B
A C
F E
D F
E
D
INPUT GRAPH A
Neural networks
GraphSAGE: Training
§ Aggregation parameters are shared for all nodes
§ Number of model parameters is independent of |V|
§ Can use different loss functions:
*
§ Classification/Regression: ℒ ℎ% = '% − ) ℎ%
§ Pairwise Loss: ℒ ℎ% , ℎ, = max(0, 1 − 3456 ℎ% , ℎ, )
B shared parameters
A
C 8 (9) , : (9)
F
D
shared parameters
E
INPUT GRAPH
Compute graph for node A Compute graph for node B
Jure Leskovec, Stanford University 11
Inductive Capability
zu
generate embedding
train with a snapshot new node arrives for new node
How Powerful are Graph Neural Networks? K. Xu, et al. ICLR 2019.
Jure Leskovec, Stanford University
Multiset15!
Key Insight: Rooted Subtrees
Assume no node features, then single nodes cannot
be distinguished but rooted trees can be distinguished:
Graph: GNN distinguishes:
Multiset "
Idea: If GNN aggregation is injective, then
different multisets are distinguished and GNN can
capture subtree structure
Theorem: Injective multiset function ! can be
written as: ! " = $(∑'∈) *(+))
Funcs. $, * are based on universal approximation thm.
Consequence: Sum aggregator is the most
expressive!
Jure Leskovec, Stanford University 17
> >
Power of Aggregators
Input sum - multiset mean - distribution
Figure 2: Ranking by expressive power for sum, mean and max-pooling aggregators over a multiset.
Left panel shows the input multiset and the three panels illustrate the aspects of the multiset a given
max - set
aggregator is able to capture: sum captures the full multiset, mean captures the proportion/distribution
of elements of a given type, and the max aggregator ignores multiplicities (reduces the multiset to a
Failure cases for mean and max agg.
simple set).
A A A A A
vs. vs. A vs.
(a) Mean and Max both fail (b) Max fails (c) Mean and Max both fail
Under Figure
review 3:
as Examples
a conference paper atgraph
of simple ICLR 2019 that mean and max-pooling aggregators fail to
structures
existing GNNs instead use a 1-layer perceptron W (Duvenaud et al., 2015; Kipf & Welling, 2017;
> >
Zhang etAal., 2018), a linear mapping followed
A A
by a non-linear activation A as a ReLU.
function such
Such 1-layer mappings are examples of Generalized Linear Models (Nelder & Wedderburn, 1972).
Therefore, we are interested in understanding whether 1-layer perceptrons are enough for graph
learning. Lemma 7 suggests that there are indeed network neighborhoods (multisets) that models
Inputperceptrons can never
with 1-layer sumdistinguish.
- multiset mean - distribution max - set
Jure Leskovec, Stanford University 18
Three Consequences of GNNs
1) The GNN does two things:
§ 1) Learns how to “borrow”
feature information from
nearby nodes to enrich
the target node
§ 2) Each node can have a different
computation graph and the network is
also able to capture/learn its structure
Jure Leskovec, Stanford University 19
Three Consequences of GNNs
2) Computation graphs can be
chosen:
§ Aggregation does not
need to happen across
all neighbors
§ Neighbors can be
strategically chosen/sampled
§ Leads to big gains in practice!
Jure Leskovec, Stanford University 20
Three Consequences of GNNs
3) We understand GNN failure cases:
§ GNNs fail to distinguish isomorphic
nodes
§ Nodes with identical rooted subtrees will
be classified in the same class (in the
absence of differentiating node features)
A A A B
B B A B
Vs.
A A A B
$% $( $% $' $* $% $( $% $' $*
Jure Leskovec, Stanford University 24
Our Solution
PGNN: Position-Aware
Graph Neural Networks
A !" 1
B
Anchor !$ 2
A !" 1 2
B
Anchor Anchor !$ 2 1
!$
!# !&
'& '"
!# !&
One per !$ 0.5 0.3 1 0.5 0.3
'& '"
anchor set !% 0.3 0.5 0.5 1 0.5
!& 0.3 0.5 0.3 0.5 1
Jure Leskovec, Stanford University 34
Position-aware GNN Framework
Step (d): From messages to node embeddings
!"# 1
Messages from
3 anchor-sets !"# 2
×( Output
)"#
!"# 3
∈ ℝ,
!"# 3
f( ) = biological activity?
f( ) = biological activity?
2. Out-of-distribution prediction
§ Test examples tend to be very different from
training examples
à GNNs extrapolate poorly
0 Toxicity B?
0 Toxicity C?
1 Toxicity D?
: Naïve strategy
improvement over no
14.0
12.0
pre-training
ROCAUC
10.0
8.0
6.0
4.0
Limited
2.0 improvement
0.0
Downstream datasets
Jure Leskovec, Stanford University 47
What are the Effective Strategies?
Key insight: Pre-train both node and graph
embeddings
Nodes are separated. But Graphs are separated. But Separation both in node as
embeddings are not embeddings of individual well as in graph space.
composable (graphs are nodes are not!
not separated)! Jure Leskovec, Stanford University 48
Possible Pre-training Methods
1 Toxicity A?
0 Toxicity B?
GNN
0 Toxicity C?
1 Toxicity D?
3
GNN Fine-tune
Chemistry database
2
Downstrea
Graph pre-train m task N
Jure Leskovec, Stanford University 53
Results of Our Strategy
§ Avoids negative transfer.
§ Significantly improve the performance.
Molecule classification performance
16.0
: Our strategy
improvement over no
14.0
12.0
pre-training
ROCAUC
: Naïve strategy
ROCAUC improvement over no
10.0
8.0
6.0
pre-training
4.0
2.0
Avoids negative transfer!
0.0
Downstream datasets
Comparison of GNNs
Supervised Infomax 68.0 ±1.8 77.8 ±0.3 64.9 ±0.7 60.9 ±0.6 71.2 ±2.8 81.3 ±1.4 77.8 ±0.9 80.1 ±0.9 72.8
Supervised EdgePred 66.6 ±2.2 78.3 ±0.3 66.5 ±0.3 63.3 ±0.9 70.9 ±4.6 78.5 ±2.4 77.5 ±0.8 79.1 ±3.7 72.6
Supervised AttrMasking 66.5 ±2.5 77.9 ±0.4 65.1 ±0.3 63.9 ±0.9 73.7 ±2.8 81.2 ±1.9 77.1 ±1.2 80.3 ±0.9 73.2
Supervised ContextPred 68.7 ±1.3 78.1 ±0.6 65.7 ±0.6 62.7 ±0.8 72.6 ±1.5 81.3 ±2.1 79.9 ±0.7 84.5 ±0.7 74.2
Table 1: Test ROC-AUC (%) performance on molecular prediction benchmarks using different
pre-training strategies with GIN. The rightmost column averages the mean of test performance
Q: What are the effect of pre-training on
across the 8 datasets. The best result for each dataset and comparable results (i.e., results within one
standard deviation from the best result) are bolded. The shaded cells indicate negative transfer, i.e.,
different GNN architectures?
ROC-AUC of a pre-trained model is worse than that of a non-pre-trained model. Notice that node- as
well as graph-level pretraining are essential for good performance.
Chemistry Biology
Non-pre-trained Pre-trained Gain Non-pre-trained Pre-trained Gain
GIN 67.0 74.2 +7.2 64.8 ± 1.0 74.2 ± 1.5 +9.4
GCN 68.9 72.2 +3.4 63.2 ± 1.0 70.9 ± 1.7 +7.7
GraphSAGE 68.3 70.3 +2.0 65.7 ± 1.2 68.5 ± 1.5 +2.8
GAT 66.8 60.3 -6.5 68.2 ± 1.1 67.8 ± 3.6 -0.4
Table 2: Test ROC-AUC (%) performance of different GNN architectures with and without
pre-training. Without pre-training, the less expressive GNNs give slightly better performance
A: More expressive models (GIN) benefit the
than the most expressive GIN because of their smaller model complexity in a low data regime.
However, with pre-training, the most expressive GIN is properly regularized and dominates the other
most from pre-training
architectures. For results split by chemistry datasets, see Table 4 in Appendix H. Pre-training strategy
for chemistry data: Context Prediction + Graph-level supervised pre-training; pre-training strategy
for biology data: Attribute Masking + Graph-level supervised pre-training.
Webpage: https://ogb.stanford.edu/
Github: https://github.com/snap-stanford/ogb
https://ogb.stanford.edu/ https://github.com/snap-stanford/ogb
Jure Leskovec, Stanford University 68
Open Graph Benchmark
https://ogb.stanford.edu
ogb@cs.stanford.edu
Steering committee
Regina Barzilay, Peter Battaglia, Yoshua Bengio, Michael Bronstein, Stephan
Günnemann, Will Hamilton, Tommi Jaakkola, Stefanie Jegelka, Maximilian
Nickel, Chris Re, Le Song, Jian Tang, Max Welling, Rich Zemel
Jure Leskovec, Stanford University 69
Postdoc positions in :
(1) ML on Graphs; (2) NLP and knowledge
graphs; (3) ML for biomedicine
Apply at http://snap.stanford.edu
Jure Leskovec, Stanford University 70
Industry Partnerships
PhD Students
Funding
Jiaxuan Bowen Hongyu Rex
You Liu Ren Ying
Post-Doctoral Fellows
Collaborators
Dan Jurafsky, Linguistics, Stanford University
David Grusky, Sociology, Stanford University
Stephen Boyd, Electrical Engineering, Stanford University
Baharan Marinka Michele Pan Shantao David Gleich, Computer Science, Purdue University
Mirzasoleiman Zitnik Catasta Li Li VS Subrahmanian, Computer Science, University of Maryland
Marinka Zitnik, Medicine, Harvard University
Russ Altman, Medicine, Stanford University
Research Jochen Profit, Medicine, Stanford University
Staff Eric Horvitz, Microsoft Research
Jon Kleinberg, Computer Science, Cornell University
Maria Adrijan Rok Sendhill Mullainathan, Economics, Harvard University
Brbic Bradaschia Sosic Scott Delp, Bioengineering, Stanford University
James Zou, Medicine, Stanford University
Jure Leskovec, Stanford University 71