GNN Foundations Frontiers and Applications Chapter9
GNN Foundations Frontiers and Applications Chapter9
Christopher Morris
Abstract Recently, graph neural networks emerged as the leading machine learn-
ing architecture for supervised learning with graph and relational input. This chapter
gives an overview of GNNs for graph classification, i.e., GNNs that learn a graph-
level output. Since GNNs compute node-level representations, pooling layers, i.e.,
layers that learn graph-level representations from node-level representations, are
crucial components for successful graph classification. Hence, we give a thorough
overview of pooling layers. Further, we overview recent research in understand-
ing GNN’s limitations for graph classification and progress in overcoming them.
Finally, we survey some graph classification applications of GNNs and overview
benchmark datasets for empirical evaluation.
9.1 Introduction
Christopher Morris
CERC in Data Science for Real-Time Decision-Making, Polytechnique Montréal, e-mail: chris@
christophermorris.info
179
180 Christopher Morris
In the following, we survey classic and modern works of GNNs for graph classifi-
cation. GNNs layers for graph classification date back to at least the mid-nineties
in chemoinformatics. For example, Kireev (1995) derived GNN-like neural ar-
chitectures to predict chemical molecule properties. The work of (Merkwirth and
9 Graph Neural Networks: Graph Classification 181
Lengauer, 2005) had a similar aim. Gori et al (2005); Scarselli et al (2008) proposed
the original GNN architecture, introducing the general formulation that was later
reintroduced and refined in (Gilmer et al, 2017) by deriving the general message-
passing formulation, most modern GNN architectures can be expressed in, see Sec-
tion 9.2.1.
We divide our overview of modern GNN layers for graph classification into spa-
tial approaches, i.e., ones that are purely based on the graph structure by aggre-
gating local information around each node, and spectral approaches, i.e., ones that
rely on extracting information from the graph’s spectrum. Although this division is
somewhat arbitrary, we stick to it due to historical reasons. Due to the large body of
different GNN layers, we cannot offer a complete survey but focus on representative
and influential works.
One of the earliest modern, spatial GNN architectures for graph classification was
presented in (Duvenaud et al, 2015b), focusing on the prediction of chemical
molecules’ properties. Specifically, the authors propose to design a differentiable
variant of the well-known Extended Connectivity Fingerprint (ECFP) (Rogers and
Hahn, 2010) from chemoinformatics, which works similar to the computation of the
WL feature vector. For the computation of their GNN layer, denoted Neural Graph
Fingerprints, Duvenaud et al (2015b) first initialize the feature vector f 0 (v) of each
node v with features of the corresponding atom, e.g., a one hot-encoding represent-
ing the atom type. In each iteration or layer t, they compute a feature representation
f t (v) for node v as
f t (v) = f t 1 (v) + Â f t 1 (w),
w2N(v)
followed by the application of a one-layer perceptron. Here, N(v) denotes the neigh-
borhood of node v, i.e., N(v) = {w 2 V | (v, w) 2 E }. Since the ECFP usually com-
putes sparse feature vectors for small molecules, they apply a linear layer followed
by a softmax function, i.e.,
f t (v) = s (f t 1
(v) ·W1 + Â ft 1
(w) ·W2 ), (9.1)
w2N(v)
where W1 and W2 are parameter matrices in Rd⇥d , which are shared across layers,
and s (·) is a component-wise non-linearity. The above layer is evaluated on stan-
dard, small-scale benchmark datasets (Kersting et al, 2016) showing good perfor-
mance, similar to classical kernel approaches. Lei et al (2017a) proposed a similar
layer and showed a connection to kernel approaches by deriving the corresponding
kernel space of the learned graph embeddings.
To explicitly support edge labels, e.g., chemical bonds, Simonovsky and Ko-
modakis (2017) introduced Edge-Conditioned Convolution, where a feature for
node v is represented as
1
f t (v) = Â F l (l(w, v),W (l)) · f t
|N(v)| w2N(v)
1
(w) + bl .
Here l(w, v) is the feature (or label) of the edge shared by the nodes v and w. More-
over, F l : Rs ! Rdt ⇥dt 1 is a function, where s denotes the number of components
of the edge features and dt and dt 1 denotes the number of components of the fea-
tures of layer t and (t 1), respectively, mapping the edge feature to a matrix in
Rdt ⇥dt 1 . Further, the function F l is parameterized by the matrix W , conditioned on
the edge feature l. Finally, bl is a bias term, again conditioned on the edge feature
l. The above layer is applied to graph classification tasks on small-scale, standard
benchmark datasets (Kersting et al, 2016), and point cloud data from the computer
vision.
Building on (Scarselli et al, 2008), Gilmer et al (2017) introduced a general
message-passing framework, unifying most of the proposed GNN architectures so-
far. Specifically, Gilmer et al (2017) replaced the inner sum defined over the neigh-
borhood in the above equations by a general permutation-invariant, differentiable
function, e.g., a neural network, and substituted the outer sum over the previous
and the neighborhood feature representation, e.g., by a column-wise vector concate-
nation or LSTM-style update step. Thus, in full generality a new feature f t (v) is
computed as
⇣ ⌘
W1
fmerge f t 1 (v), faggr
W2
{{f t 1 (w) | w 2 N(v)}} , (9.2)
W1 W2
where faggr aggregates over the multi-set of neighborhood features and fmerge
merges the node’s representation from step (t 1) with the computed neighbor-
hood features. Moreover, it is straighfoward to include edge features as well, e.g., by
learning a combined feature representation of the node itself, the neighboring node,
and the corresponding edge feature. Gilmer et al (2017) employed the above ar-
chitecture for regression tasks from quantum chemistry, showing promising perfor-
mance for regression targets computed by expensive numerical simulations (namely,
DFT) (Wu et al, 2018; Ramakrishnan et al, 2014).
9 Graph Neural Networks: Graph Classification 183
where
1
d= Â log(di + 1),
|train| i2train
and a is a variable parameter. Here, the set train contains all nodes i in the training
set and di denotes its degree, resulting in the aggregation function
184 Christopher Morris
2 3
2 3 µ
M I 6 s 7
= 4 S(D, a = 1) 5 ⌦ 6 7
4max5 .
S(D, a = 1)
| {z } min
scalers | {z }
aggregators
where ⌦ denotes the tensor product. The authors report promising performance over
standard aggregation functions on a wide range of standard benchmark datasets,
improving over some standard GNN layers.
Vignac et al (2020b) extended the expressivity of GNNs, see also Section 9.4, by
using unique node identifiers, generalizing the message-passing scheme proposed
by (Gilmer et al, 2017), see Equation (9.2), by computing and passing matrix fea-
tures instead of vector features. Formally, each node i maintains a matrix Ui in Rn⇥c ,
denoted local context, where the j-th row contains the vectorial representation of
node j of node i. At initialization, each local context Ui is set to 1 in Rn⇥1 , where n
denotes the number of nodes in the given graph. Now at each layer l, similar to the
above message-passing framework, the local context is updated as
⇣ ⌘ ✓n o ◆
2 Rn⇥cl+1
(l+1) (l) (l) (l) (l) (l)
Ui = u(l)Ui , Ũi with Ũi = f m(l) (Ui ,U j , yi j ) ,
j2N(i)
where u(l) , m(l) , and f are update, message, and aggregation functions, respectively,
to compute the updated local context, and yi j denotes the edge features shared by
node i and j. Moreover, the authors study the expressive power, showing that, in
principle, the above layer can distinguish any non-isomorphic pair of graphs and
propose more scalable alternative variants of the above architecture. Finally, promis-
ing results on standard benchmark datasets are reported.
L = ULU > ,
where the operator · denotes the elementwise product. If we set gq = diag(U T g),
the above can be expressed as
x ⇤ gq = Ugq U > x.
for j in {1, 2, . . . ,t}. Here, t is the layer index, H t 1 in Rn⇥(t 1) is the graph signal,
where H 0 = X, i.e., the given graph features, and Qi,t j is a diagonal parameter matrix.
However, the above layer suffers from a number of drawbacks: The bases of the
eigenvectors is not permution invariant, the layer cannot be applied to a graph with
a different structure, and the computation of the eigendecomposition is cubic in the
number of nodes. Hence, Henaff et al (2015) proposed more scalable variants of
the above layer by building on a smoothness notion in the spectral domain, which
reduces the numbers of parameters and acts as a regulizer.
To further make the above layer more scalable, Defferrard et al (2016) intro-
duced Chebyshev Spectral CNNs, which approximates gq by a Chebyshev expan-
sion (Hammond et al, 2011). Namely, they express
K
gq = Â qi Ti (L̂ ),
i=0
186 Christopher Morris
where L̂ = 2L /lmax I, and lmax denotes the largest eigenvalue of the normalized
Laplacian L̂ . The normalization ensures that the eigenvalues of the Laplacian are
in the [ 1, 1] real interval, which is required by Chebyshev polynomials. Here, Ti
denotes the ith Chebyshev polynomial with T1 (x) = x. Alternatively, Levie et al
(2019) used Caley polynomials, and show that Chebyshev Spectral CNNs are a
special case.
Kipf and Welling (2017b) proposed to make Chebyshev Spectral CNNs more
scalable by setting
1 1
x ⇤ gq = q0 x q1 D 2 AD 2 x.
Further, they improved the generalization ability of the resulting layer by setting
q = q0 = q1 , resulting in
1 1
x ⇤ gq = q (I + D 2 AD 2 )x.
In fact, the above layer can be understood as a spatial GNN, i.e., it is equivalent to
computing a feature
!
1
t
f (v) = s  p f (w) ·W , t 1
w2N(v)[v dv dw
for node v in the given graph G , where dv and dw denote the degrees of node v and w,
respectively. Although the above layer was originally proposed for semi-supervised
node classification, it is now one of the most widely used ones and has been ap-
plied for tasks such as matrix completion (van den Berg et al, 2018), link predic-
tion (Schlichtkrull et al, 2018), and also as a baseline for graph classification (Ying
et al, 2018c).
Since GNNs learn vectorial node representations, using them for graph classification
requires a pooling layer, enabling going from node to graph-level output. Formally, a
pooling layer is a parameterized function that maps a multiset of vectors, i.e., learned
node-level representations, to a single vector, i.e., the graph-level representation.
Arguably, the simplest of such layers are sum, mean, and min or max pooling. That
is, given a graph G and a multiset
M = {{f(v) 2 Rd | v 2 V }}
fpool (G ) = Â f(v),
f(v)2M
9 Graph Neural Networks: Graph Classification 187
while mean, min, max pooling take the (component-wise) average, minimum, max-
imum over the elements in M, respectively. These four simple pooling layers are
still used in many published GNN architectures, e.g., see (Duvenaud et al, 2015b).
In fact, recent work (Mesquita et al, 2020) showed that more sophisticated layers,
e.g., relying on clustering, see below, do not offer any empirical benefits on many
real-world datasets, especially those from the molecular domain.
Simple attention-based pooling became popular in recent years due to its easy im-
plementation and scalability compared to more sophisticated alternatives; see be-
low. For example, Gilmer et al (2017), see above, used a seq2seq architecture for
sets (Vinyals et al, 2016) for pooling purposes in their empirical study. Focusing
on pooling for GNNs, Lee et al (2019b) introduced the SAGPool layer, short for
Self-Attention Graph Pooling method for GNNs, using self-attention. Specifically,
they computed a self-attention score by multiplying the aggregated features of an
arbitrary GNN layer by a matrix Qatt in Rd⇥1 , where d denotes the number of com-
ponents of the node features. For example, computing the self-attention score Z(v)
for the simple layer of Equation (9.1) equates to
!
Z(v) = s ft 1
(v) ·W1 + Â ft 1
(w) ·W2 ·Qatt .
w2N(v)
The self-attention score Z(v) is subsequently used to select the top-k nodes in the
graph; similarly, to Cangea et al (2018) and (Gao et al, 2018a), see below, omitting
the other nodes, effectively pruning nodes from the graph. Similar attention-based
techniques are proposed in (Huang et al, 2019).
The idea of cluster-based pooling layers is to coarsen the graph, i.e., merging similar
nodes iteratively. One of the earliest uses has been proposed in (Simonovsky and
Komodakis, 2017), see above, where the Graclus clustering algorithm (Dhillon et al,
2007) is used. However, one has the note that the algorithm is parameter-free, i.e., it
does adapt to the learning task at hand.
The arguably most well-known cluster-based pooling layer is DiffPool (Ying
et al, 2018c). The idea of DiffPool is to iteratively coarsen the graph by learn-
ing a soft clustering of nodes, making the otherwise discrete clustering assignment
differentiable. Concretely, at layer t, DiffPool learns a soft cluster assigment S in
[0, 1]nt ⇥nt+1 , where nt and nt+1 are the number of nodes at layer t and (t + 1), respec-
tively. Each entry Si, j represents the probablity of node i of layer t being clustered
188 Christopher Morris
S = softmax(GNN(At , Ft )),
where At and Ft are the adjacency matrix and the feature matrix of the clustered
graph at layer t, and the function GNN is an abitrary GNN layer. Finally, in each
layer, the adjacency matrix and the feature matrix are updated as
respectively.
Empirically, the authors show that the DiffPool layer boosts standard GNN lay-
ers’ performance, e.g., GraphSage (Hamilton et al, 2017b), on standard, small-scale
benchmark datasets (Morris et al, 2020a). The downside of the above layer is the
added computational cost. The adjacency matrix becomes dense and real-valued af-
ter the first pooling layer, leading to a quadratic cost in the number of nodes for
each GNN layer’s computation. Moreover, the number of clusters has to be chosen
in advance, leading to an increase in hyperparameters.
where the function sort sorts the feature matrix Ft row-wise in a descending fashion,
and the functions truncate return the first k of the input matrix. Ties are broken up
using the features from previous layers, 1 to (t 1). The resulting tensor Ftrunc of
shape k ⇥ Âhi=1 di , where di denotes the number of features of the ith layer and h
the total number of layers, is reshaped into a tensor of size k(Âhi=1 di ) ⇥ 1, row-
wise, followed by a standard 1-D convolution with a filter and step size of Âhi=1 di .
Finally, a sequence of max-pooling and 1-D convolutions are applied to identifiy
local patterns in the sequence.
Similarly, to combat the high computational cost of some pooling layer, e.g.,
DiffPool, Cangea et al (2018) introduced a pooling layer dropping n dnke nodes
of a graph with n nodes in each layer for k in [0, 1). The nodes to be dropped are
choosen according to a projection score against a learnable vector p. Concretly, they
compute the score vector
9 Graph Neural Networks: Graph Classification 189
Ft · p
y= and I = top-k(y, k),
kpk
where top-k returns top-k indices from a given vector according to y. Finally, the
adjacency At+1 is updated by removing rows and columns that are not in I, while
the updated feature matrix
The authors report slightly lower classification accuracies than the DiffPool layer
on most employed datasets while being much faster in computation time. A similar
approach was presented in (Gao and Ji, 2019).
To derive more expressive graph representations, Murphy et al (2019c,b) propose
relational pooling. To increase the expressive power of GNN layers, they average
over all permutations of a given graph. Formally, let G be a graph, then a represen-
tation
1
f(G ) = Â g(Ap,p , [Fp , I|V | ])
|V | p2P
(9.4)
is learned, where P denotes all possible permutations of the rows and columns of
the adjacency matrix of G , g is a permutation-invariant function, and [·, ·] denotes
column-wise matrix concatenation. Moreover. Ap,p permutes the rows and columns
of the adjaceny matrix A according to the permutation p in P , similarly Fp permutes
the rows of the feature matrix F. The author showed that the above architecture
is more expressive in terms of distinguishing non-isomorphic graphs than the WL
algorithm, and proposed sampling-based techniques to speed up the computation.
Bianchi et al (2020) introduced a pooling layer based on spectral clustering (VON-
LUXBURG, 2007). Thereto, they train a GNN together with an MLP, followed by
a softmax function, against an approximation of a relaxed version of the k-way
normalized Min-cut problem (Shi and Malik, 2000). The resulting cluster assign-
ment matrix S is used in the same way as in Section 9.3.2. The authors evaluated
their approach on standard, small-scale benchmark datasets showing promising per-
formance, especially over the DiffPool layer. For another pooling layer based on
spectral clustering, see (Ma et al, 2019d).
In the following, we briefly survey the limitations of GNNs and how their expressive
power is upper-bounded by the Weisfeiler-Leman method (Weisfeiler and Leman,
1968; Weisfeiler, 1976; Grohe, 2017). Concretely, a recent line of works by Morris
et al (2020b); Xu et al (2019d); Maron et al (2019a) connects the power or expressiv-
ity of GNNs to that of the WL algorithm. The results show that GNN architectures
generally do not have more power to distinguish between non-isomorphic graphs
190 Christopher Morris
than the WL. That is, for any graph structure that the WL algorithm cannot dis-
tinguish, any possible GNN with any possible choices of parameters will also not
be able to distinguish it. On the positive side, the second result states that there is a
sequence of parameter initializations such that GNNs have the same power in distin-
guishing non-isomorphic (sub-)graphs as the WL algorithm, see also Equation (9.3).
However, the WL algorithm has many short-comings, see (Arvind et al, 2015; Kiefer
et al, 2015), e.g., it cannot distinguish between cycles of different lengths, an impor-
tant property for chemical molecules, and is not able to distinguish between graphs
with different triangle counts, an important property of social networks.
To address this, many recent works tried to build provable more expressive GNNs
for graph classification. For example, in (Morris et al, 2020b; Maron et al, 2019b,
2018) the authors proposed higher-order GNN architectures that have the same ex-
pressive power as the k-dimensional Weisfeiler-Leman algorithm (k-WL), which is,
as k grows, a more expressive generalization of the WL algorithm. In the following,
we give an overview of such works.
The first GNN architecture that overcame the limitations of the WL algorithm was
proposed in (Morris et al, 2020b). Specifically, they introduced so-called k-GNNs,
which work by learning features over the set of subgraphs on k nodes instead of
vertices by defining a notion of neighborhood between these subgraphs. Formally,
for a given k, they consider all k-element subsets [V ]k over V . Let s = {s1 , . . . , sk }
be a k-set in [V ]k , then they define the neighborhood of s as
N(s) = {t 2 [V ]k | |s \ t| = k 1} .
The local neighborhood NL (s) consists of all t in N(s) such that (v, w) in E for the
unique v 2 s \ t and the unique w 2 t \ s. The global neighborhood NG (s) then is
defined as N(s) \ NL (s).
Based on this neighborhood definition, one can generalize most GNN layers for
vertex embeddings to more expressive subgraph embeddings. Given a graph G , a
feature for a subgraph s can be computed as
⇣ ⌘
fkt (s) = s fkt 1 (s) ·W1t + Â fkt 1 (u) ·W2t . (9.5)
u2NL (s)[NG (s)
The authors resort to sum over the local neighborhood in the experiments for better
scalability and generalization, showing a significant boost over standard GNNs on a
quantum chemistry benchmark dataset (Wu et al, 2018; Ramakrishnan et al, 2014).
The latter approach was refined in (Maron et al, 2019a) and (Morris et al, 2019).
Specifically, based on (Maron et al, 2018), Maron et al (2019a) derived an architec-
ture based on standard matrix multiplication that has at least the same power as the
3-WL. Morris et al (2019) proposed a variant of the k-WL that, unlike the original
9 Graph Neural Networks: Graph Classification 191
algorithm, takes the sparsity of the underlying graph into account. Moreover, they
showed that the derived sparse variant is slightly more powerful than the k-WL in
distinguishing non-isomorphic graphs and proposed a neural architecture with the
same power as the sparse k-WL variant.
An important direction in studying graph representations’ expressive power was
taken by (Chen et al, 2019f). The authors prove that a graph representation can
approximate a function f if and only if it can distinguish all pairs of non-isomorphic
graphs G and H where f (G ) 6= f (H ). With that in mind, they established an
equivalence between the set of pairs of graphs a representation can distinguish and
the space of functions it can approximate, further introducing a variation of the 2-
WL.
Bouritsas et al (2020) enhanced the expressivity of GNNs by annotating node
features with subgraph information. Specifically, by fixing a set of predefined, small
subgraphs, they annotated each node with their role, formally their automorphism
type, in these subgraphs, showing promising performance gains on standard bench-
mark datasets for graph classification.
Beaini et al (2020) studied how to incorporate directional information into GNNs.
Finally, You et al (2021) enhanced GNNs by uniquely coloring central vertices and
used two types of message functions to surpass the expressive power of the 1-WL,
while Sato et al (2021) and Abboud et al (2020) use random features to achieve
the same goal and additionally studied the universality properties of their derived
architectures.
In the following, we highlight some application areas of GNNs for graph classifi-
cation, focusing on the molecular domain. One of the most promising applications
of GNNs for graph classification is pharmaceutical drug research, see (Gaudelet
et al, 2020) for an overview. In this direction, a promosing approach was proposed
by (Stokes et al, 2020). They used a form of directed message passing neural net-
works operating on molecular graphs to identify repurposing candidates for antibi-
otic development. Moreover, they validated their predictions in vivo, proposing suit-
able repurposing candidates different from know ones.
Schweidtmann et al (2020) used 2-GNNs, see Equation (9.5), to derive GNN
models for predicting three fuel ignition quality indicators such as the derived cetane
number, the research octane number,and the motor octane number of oxygenated
and non-oxygenated hydrocarbons, indicating that the higher-order layers of Equa-
tion (9.5) provide significant gains over standard GNNs in the molecular learning
domain.
A general principled GNN for the molecular domain, denoted DimeNet, was in-
troduced by (Klicpera et al, 2020). By using an edge-based architecture, they com-
puted a message coefficient between atoms based on their relative positioning in 3D
192 Christopher Morris
Since most developments for GNNs are driven empirically, i.e., based on evalua-
tions on standard benchmark datasets, meaningful benchmark datasets are crucial
for the development of GNNs in the context of graph classification. Hence, the re-
search community has established several widely used repositories for benchmark
datasets for graph classification. Two such repositories are worth being highlighted
here. First, the TUDataset (Morris et al, 2020a) collection contains over 130 datasets
provided at www.graphlearning.io of various sizes and various areas such as
chemistry, biology, and social networks. Moreover, it provides Python-based data
loaders and baseline implementations of standards graph kernel and GNNs. More-
over, the datasets can be easily accessed from well-known GNN implementation
frameworks such as Deep Graph Library (Wang et al, 2019f), PyTorch Geomet-
ric (Fey and Lenssen, 2019), or Spektral (Grattarola and Alippi, 2020). Secondly,
the OGB (Weihua Hu, 2020) collections contain many large-scale graph classifica-
tion benchmark datasets, e.g., from chemistry and code analysis with data loaders,
prespecified splits, and evaluation protocols. Finally, Wu et al (2018) also provides
many large-scale datasets from chemo- and bioinformatics.
9.7 Summary