GNN-Foundations-Frontiers-and-Applications-chapter4
GNN-Foundations-Frontiers-and-Applications-chapter4
Abstract Graph Neural Networks are neural architectures specifically designed for
graph-structured data, which have been receiving increasing attention recently and
applied to different domains and applications. In this chapter, we focus on a funda-
mental task on graphs: node classification. We will give a detailed definition of node
classification and also introduce some classical approaches such as label propaga-
tion. Afterwards, we will introduce a few representative architectures of graph neu-
ral networks for node classification. We will further point out the main difficulty—
the oversmoothing problem—of training deep graph neural networks and present
some latest advancement along this direction such as continuous graph neural net-
works.
Graph-structured data (e.g., social networks, the World Wide Web, and protein-
protein interaction networks) are ubiquitous in real-world, covering a variety of
applications. A fundamental task on graphs is node classification, which tries to
classify the nodes into a few predefined categories. For example, in social networks,
we want to predict the political bias of each user; in protein-protein interaction net-
works, we are interested in predicting the function role of each protein; in the World
Wide Web, we may have to classify web pages into different semantic categories.
To make effective prediction, a critical problem is to have very effective node rep-
resentations, which largely determine the performance of node classification.
Graph neural networks are neural network architectures specifically designed for
learning representations of graph-structured data including learning node represen-
Jian Tang
Mila-Quebec AI Institute, HEC Montreal, e-mail: jian.tang@hec.ca
Renjie Liao
University of Toronto, e-mail: rjliao@cs.toronto.edu
41
42 Jian Tang and Renjie Liao
tations of big graphs (e.g., social networks and the World Wide Web) and learning
representations of entire graphs (e.g., molecular graphs). In this chapter, we will
focus on learning node representations for large-scale graphs and will introduce
learning the whole-graph representations in other chapters. A variety of graph neu-
ral networks have been proposed (Kipf and Welling, 2017b; Veličković et al, 2018;
Gilmer et al, 2017; Xhonneux et al, 2020; Liao et al, 2019b; Kipf and Welling,
2016; Veličković et al, 2019). In this chapter, we will comprehensively revisit exist-
ing graph neural networks for node classification including supervised approaches
(Sec. 4.2), unsupervised approaches (Sec. 4.3), and a common problem of graph
neural networks for node classification—over-smoothing (Sec. 4.4).
Problem Definition. Let us first formally define the problem of learning node rep-
resentations for node classification with graph neural networks. Let G = (V , E )
denotes a graph, where V is the set of nodes and E is the set of edges. A 2 RN⇥N rep-
resents the adjacency matrix, where N is the total number of nodes, and X 2 RN⇥C
represents the node attribute matrix, where C is the number of features for each
node. The goal of graph neural networks is to learn effective node representations
(denoted as H 2 RN⇥F , F is the dimension of node representations) by combining
the graph structure information and the node attributes, which are further used for
node classification.
Concept Notation
Graph G = (V , E )
Adjacency matrix A 2 RN⇥N
Node attributes X 2 RN⇥C
Total number of GNN layers K
Node representations at the k-th layer H k 2 RN⇥F , k 2 {1, 2, · · · , K}
The essential idea of graph neural networks is to iteratively update the node repre-
sentations by combining the representations of their neighbors and their own repre-
sentations. In this section, we introduce a general framework of graph neural net-
works in (Xu et al, 2019d). Starting from the initial node representation H 0 = X, in
each layer we have two important functions:
• AGGREGATE, which tries to aggregate the information from the neighbors of
each node;
• COMBINE, which tries to update the node representations by combining the
aggregated information from neighbors with the current node representations.
Mathematically, we can define the general framework of graph neural networks
as follows:
Initialization: H 0 = X
For k = 1, 2, · · · , K,
where N(v) is the set of neighbors for the v-th node. The node representations H K
in the last layer can be treated as the final node representations.
Once we have the node representations, they can be used for downstream tasks.
Take the node classification as an example, the label of node v (denoted as ŷv ) can
be predicted through a Softmax function, i.e.,
We will start from the graph convolutional networks (GCN) (Kipf and Welling,
2017b), which is now the most popular graph neural network architecture due to its
simplicity and effectiveness in a variety of tasks and applications. Specifically, the
node representations in each layer is updated according to the following propagation
rule:
1 1
H k+1 = s (D̃ 2 ÃD̃ 2 H kW k ). (4.5)
à = A + I is the adjacency matrix of the given undirected graph G with self-
connections, which allows to incorporate the node features itself when updating the
node representations. I 2 RN⇥N is the identity matrix. D̃ is a diagonal matrix with
D̃ii = Â j Ãi j . s (·) is an activation function such as ReLU and Tanh. The ReLU ac-
0
tive function is widely used, which is defined as ReLU(x) = max(0, x). W k 2 RF⇥F
(F,F 0 are the dimensions of node representations in the k-th, (k+1)-th layer respec-
tively) is a laywise linear transformation matrix, which will be trained during the
optimization.
We can further dissect equation equation 4.5 and understand the AGGREGATE
and COMBINE function defined in GCN. For a node i, the node updating equation
can be reformulated as below:
Ãi j
Hik = s ( Â q H kj 1W k ) (4.6)
j2{N(i)[i} D̃ii D̃ j j
Ai j 1 k 1 k
Hik = s ( Â q H kj 1W k + H W )
D̃i i
(4.7)
j2N(i) D̃ii D̃ j j
In the Equation equation 4.7, we can see that the AGGREGATE function is de-
fined as the weighted average of the neighbor node representations. The weight of
the neighbor j is determined by the weight of the edge between i and j (i.e. Ai j nor-
malized by the degrees of the two nodes). The COMBINE function is defined as the
summation of the aggregated messages and the node representation itself, in which
the node representation is normalized by its own degree.
gq ? x = Ugq U T x. (4.8)
4 Graph Neural Networks for Node Classification 45
U represents the matrix of the eigenvectors of the normalized graph Laplacian ma-
1 1
trix L = IN D 2 AD 2 . L = ULU T , L is a diagonal matrix of eigenvalues, and
T
U x is the graph Fourier transform of the input signal x. In practice, gq can be un-
derstood as a function of the eigenvalues of the normalized graph Laplacian matrix
L (i.e. gq (L )). In practice, directly calculating Eqn. equation 4.8 is very compu-
tationally expensive, which is quadratic to the number of nodes N. According to
(Hammond et al, 2011), this problem can be circumvented by approximating the
function gq (L ) with a truncated expansion of Chebyshev polynomials Tk (x) up to
K th order:
K
gq 0 (L ) = Â qk0 Tk (L̃ ), (4.9)
k=0
2
where L̃ = lmax L I, and lmax is the largest eigenvalue of L. q 0 2 RK is the vector
of Chebyshev coefficients. Tk (x) are Chebyshev polynomials which are recursively
defined as Tk (x) = 2xTk 1 (x) Tk 2 (x), with T0 (x) = 1 and T1 (x) = x. By combining
Eqn. equation 4.9 and Eqn. equation 4.8, the convolution of a signal x with a filter
gq 0 can be reformulated as below:
K
gq 0 ? x = Â qk0 Tk (L̃)x, (4.10)
k=0
2
where L̃ = lmax L I. From this equation, we can see that each node only depends
on the information within the K th -order neighborhood. The overall complexity of
evaluating Eqn. equation 4.10 is O(|E |) (i.e. linear to the number of edges in the
original graph G ), which is very efficient.
To define a neural network based on graph convolutions, one can stack multiple
convolution layers defined according to Eqn. equation 4.10 with each layer followed
by a nonlinear transformation. At each layer, instead of being limited to the explicit
parametrization by the Chebyshev polynomials defined in Eqn. equation 4.10, the
authors of GCNs proposed to limit the number of convolutions to K = 1 at each
layer. By doing this, at each layer, it only defines a linear function over the graph
Laplacian matrix L. However, by stacking multiple such layers, we are still capable
of covering a rich class of convolution filter functions on graphs. Intuitively, such a
model is capable of alleviating the problem of overfitting local neighborhood struc-
tures for graphs whose node degree distribution has a high variance such as social
networks, the World Wide Web, and citation networks.
At each layer, we can further approximate lmax ⇡ 2, which could be accommo-
dated by the neural network parameters during training. Based on al these simplifi-
cations, we have
1 1
gq 0 ? x ⇡ q00 x + q10 x(L IN )x = q00 x q10 D 2 AD 2 , (4.11)
where q00 and q10 are too free parameters, which could be shared over the entire
graph. In practice, we can further reduce the number of parameters, which allows to
46 Jian Tang and Renjie Liao
reduce overfitting and meanwhile minimize the number of operations per layer. As
a result, the following expression can be further obtained:
1 1
gq ? x ⇡ q (I + D 2 AD 2 )x, (4.12)
1 1
where q = q00 = q10 . One potential issue is the matrix IN + D AD , whose 2 2
eigenvalues lie in the interval of [0, 2]. In a deep graph convolutional neural network,
repeated application of the above function will likely lead to exploding or vanish-
ing gradients, yielding numerical instabilites. As a result, we can further renormal-
1 1 1 1
ize this matrix by converting I + D 2 AD 2 to D̃ 2 ÃD̃ 2 , where à = A + I, and
D̃ii = Â j Ãi j .
In the above, we only consider the case that there is only one feature channel
and one filter. This can be easily generalized to an input signal with C channels
X 2 RN⇥C and F filters (or number of hidden units) as follows:
1 1
H = D̃ 2 ÃD̃ 2 XW, (4.13)
where W 2 RC⇥F is a matrix of filter parameters. H is the convolved signal matrix.
Graph Attention Layer. The graph attention layer defines how to transfer the hid-
den node representations at layer k 1 (denoted as H k 1 2 RN⇥F ) to the new node
0
representations H k 2 RN⇥F . In order to guarantee sufficient expressive power to
transform the lower-level node representations to higher-level node representations,
0
a shared linear transformation is applied to every node, denoted as W 2 RF⇥F . Af-
terwards, self-attention is defined on the nodes, which measures the attention coeffi-
0 0
cients for any pair of nodes through a shared attentional mechanism a : RF ⇥ RF !
R
ei j = a(W Hik 1
,W H kj 1
). (4.14)
4 Graph Neural Networks for Node Classification 47
ei j indicates the relationship strength between node i and j. Note in this subsec-
tion we use Hik 1 to represent a column-wise vector instead of a row-wise vector.
For each node, we can theoretically allow it to attend to every other node on the
graph, which however will ignore the graph structural information. A more reason-
able solution would be only to attend to the neighbors for each node. In practice,
the first-order neighbors are only used (including the node itself). And to make the
coefficients comparable across different nodes, the attention coefficients are usually
normalized with the softmax function:
exp(ei j )
ai j = Softmax j ({ei j }) = . (4.15)
Âl2N(i) exp(eil )
We can see that for a node i, ai j essentially defines a multinomial distribution over
the neighbors, which can also be interpreted as the transition probability from node
i to each of its neighbors.
In the work by Veličković et al (2018), the attention mechanism a is defined as
a single-layer feedforward neural network including a linear transformation with
0
the weight vector W2 2 R1⇥2F ) and a LeakyReLU nonlinear activation function
(with negative input slope a = 0.2). More specifically, we can calculate the attention
coefficients with the following architecture:
exp(LeakyReLU(W2 [W Hik 1
||W H kj 1
]))
ai j = , (4.16)
Âl2N(i) exp(LeakyReLU(W2 [W Hik 1 ||W Hlk 1 ]))
where || represents the operation of concatenating two vectors. The new node rep-
resentation is a linear combination of the neighboring node representations with the
weights determined by the attention coefficients (with a potential nonlinear trans-
formation), i.e.
!
Hik = s  ai jW H kj 1
. (4.17)
j2N(i)
Multi-head Attention.
In practice, instead of only using one single attention mechanism, multi-head at-
tention can be used, each of which determines a different similarity function over
the nodes. For each attention head, we can independently obtain a new node rep-
resentation according to Eqn. equation 4.17. The final node representation will be
a concatenation of the node representations learned by different attention heads.
Mathematically, we have
!
T
Hik = s  ait jW t H kj 1
, (4.18)
t=1 j2N(i)
48 Jian Tang and Renjie Liao
where T is the total number of attention heads, ait j is the attention coefficient calcu-
lated from the t-th attention head, W t is the linear transformation matrix of the t-th
attention head.
One thing that mentioned in the paper by Veličković et al (2018) is that in the
final layer, when trying to combine the node representations from different attention
heads, instead of using the operation concatenation, other pooling techniques could
be used, e.g. simply taking the average node representations from different attention
heads.
!
T
1
Hik = s   ait jW t H kj 1 .
T t=1
(4.19)
j2N(i)
Another very popular graph neural network architecture is the Neural Message Pass-
ing Network (MPNN) (Gilmer et al, 2017), which is originally proposed for learn-
ing molecular graph representations. However, MPNN is actually very general, pro-
vides a general framework of graph neural networks, and could be used for the task
of node classification as well. The essential idea of MPNN is formulating existing
graph neural networks as a general framework of neural message passing among
nodes. In MPNNs, there are two important functions including Message and Up-
dating function:
mki = Â Mk (Hik 1
, H kj 1
, ei j ), (4.20)
i2N( j)
Hik = Uk (Hik 1
, mki ). (4.21)
Mk (·, ·, ·) defines the message between node i and j in the k-th layer, which depends
on the two node representations and the information of their edge. Uk is the node
updating function in the k-th layer which combines the aggregated messages from
the neighbors and the node representation itself. We can see that the MPNN frame-
work is very similar to the general framework we introduced in Section 4.2.1. The
AGGREGATE function defined here is simply a summation of all the messages
from the neighbors. The COMBINE function is the same as the node Updating
function.
The above graph neural networks iteratively update the node representations with
different kinds of graph convolutional layers. Essentially, these approaches model
4 Graph Neural Networks for Node Classification 49
H k+1 = AH k + H 0 , (4.23)
where H 0 = X or the output of an encoder on the input feature X. Intuitively, at each
step, the new node representation is a linear combination of its neighboring node
representations as well as the initial node features. Such a mechanism allows to
model the information propagation on the graph without forgetting the initial node
features. We can unroll Eqn. equation 4.23 and explicitly derive the node represen-
tations at the k-th step:
!
k
Hk = Â Ai H 0 = (A I) 1 (Ak+1 I)H 0 . (4.24)
i=0
As the above equation effectively models the discrete dynamics of node repre-
sentations, the CGNN model further extended it to the continuous setting, which
50 Jian Tang and Renjie Liao
dH t
= log AH t + X, (4.25)
dt
with the initial value H 0 = (log A) 1 (A I)X, where X is the initial node features or
the output of an encoder applied to it. We do not provide the proof here. More details
can be referred to the original paper (Xhonneux et al, 2020). In Eqn. equation 4.25,
as log A is intractable to compute in practice, it is approximated with the first-order
of the Taylor expansion, i.e. log A ⇡ A I. By integrating all these information, we
have the following ODE equation:
dH t
= (A I)H t + X, (4.26)
dt
with the initial value H 0 = X, which is the first variant of the CGNN model.
The CGNN model is actually very intuitive, which has a nice connection with
traditional epidemic model, which aims at studying the dynamics of infection in a
population. For the epidemic model, it usually assumes that the infection of people
will be affected by three different factors including the infection from neighbors, the
natural recovery, and the natural characteristics of people. If we treat H t as the num-
ber of people infected at time t, then these three factors can be naturally modeled by
the three terms in Eqn. equation 4.26: AH t for the infection from neighbors, H t
for the natural recovery, and the last one X for the natural characteristics of people.
Model 2: Modeling the Interaction of Feature Channels. The above model as-
sumes different node feature channels are independent with each other, which is a
very strong assumption and limits the capacity of the model. Inspired by the success
of a linear variant of graph neural networks (i.e., Simple GCN (Wu et al, 2019a)),
a more powerful discrete node dynamic model is proposed, which allows different
feature channels to interact with each other as,
H k+1 = AH kW + H 0 , (4.27)
where W 2 RF⇥F is a weight matrix used to model the interactions between different
feature channels. Similarly, we can also extend the above discrete dynamics into
continuous case, yielding the following equation:
dH t
= (A I)H t + H t (W I) + X, (4.28)
dt
with the initial value being H 0 = X. This is the second variant of CGNN with train-
able weights. Similar form of ODEs defined in Eqn. equation 4.28 has been studied
in the literature of control theory, which is known as Sylvester differential equa-
tion (Locatelli and Sieniutycz, 2002). The two matrices A I and W I characterize
4 Graph Neural Networks for Node Classification 51
the natural solution of the system while X is the information provided to the system
to drive the system into the desired state.
Discussion. The proposed continuous graph neural networks (CGNN) has multiple
nice properties: (1) Recent work has shown that if we increase the number of layers
K in the discrete graph neural networks, the learned node representations tend to
have the problem of over-smoothing (will introduce in detail later) and hence lose
the power of expressiveness. On the contrary, the continuous graph neural networks
are able to train very deep graph neural networks and are experimentally robust to
arbitrarily chosen integration time; (2) For some of the tasks on graphs, it is crit-
ical to model the long-range dependency between nodes, which requires training
deep GNNs. Existing discrete GNNs fail to train very deep GNNs due to the over-
smoothing problem. The CGNNs are able to effectively model the long-range de-
pendency between nodes thanks to the stability w.r.t. time. (3) The hyperparameter
a is very important, which controls the rate of diffusion. Specifically, it controls the
rate at which high-order powers of regularized matrix A vanishes. In the work pro-
posed by (Xhonneux et al, 2020), the authors proposed to learn a different value of
a for each node, which hence allows to choose the best diffusion rates for different
nodes.
Recall the one-layer graph convolution operator used in GCNs (Kipf and Welling,
1 1
2017b) H = LHW , where L = D 2 ÃD 2 . Here we drop the superscript of the layer
index to avoid the clash with the notation of the matrix power. There are two main
issues with this simple graph convolution formulation. First, one such graph convo-
lutional layer would only propagate information from any node to its nearest neigh-
bors, i.e., neighboring nodes that are one-hop away. If one would like to propagate
information to M-hop away neighbors, one has to either stack M graph convolutional
layers or compute the graph convolution with M-th power of the graph Laplacian,
i.e., H = s (LM HW ). When M is large, the solution of stacking layers would make
the whole GCN model very deep, thus causing problems in learning like the van-
ishing gradient. This is similar to what people experienced in training very deep
feedforward neural networks. For the matrix power solution, naively computing the
M-th power of the graph Laplacian is also very costly (e.g., the time complexity is
O(N 3(M 1) ) for graphs with N nodes). Second, there are no learnable parameters
in GCNs associated with the graph Laplacian L (corresponding to the connectiv-
ities/structures). The only learnable parameter W is a linear transform applied to
every node simultaneously which is not aware of the structures. Note that we typ-
ically associate learnable weights on edges while applying the convolution applied
to regular graphs like grids (e.g., applying 2D convolution to images). This would
greatly improve the expressiveness of the model. However, it is not clear that how
52 Jian Tang and Renjie Liao
one can add learnable parameters to the graph Laplacian L since its size varies from
graph to graph.
{"# , %# }' = )*+,-./()) Long Range Spectral Filtering Long Range Spectral Filtering
e.g., I = {20, 50, , … } e.g., I = {20, 50, , … }
' '
O O O O O O
)> < = K N< ("#P , "#Q , … , "# |R| )%# %#S )> < = K N< ("#P , "#Q , … , "# |R| )%# %#S
#LM #LM
7< = = )> < ?@< ∀B ∈ [|F|] 7< = = )> < ?@< ∀B ∈ [|F|]
Layer 1 Layer 2
Fig. 4.1: The inference procedure of Lanczos Networks. The approximated top
eigenvalues {rk } and eigenvectors {vk } are computed by the Lanczos algorithm.
Note that this step is only needed once per graph. The long range/scale (top blocks)
graph convolutions are efficiently computed by the low-rank approximation of the
graph Laplacian. One can control the ranges (i.e., the exponent of eigenvalues)
as hyperparameters. Learnable spectral filters are applied to the approximated top
eigenvalues {rk }. The short range/scale (bottom blocks) graph convolution is the
same as GCNs. Adapted from Figure 1 of (Liao et al, 2019b).
uses the M-step Lanczos algorithm (Lanczos, 1950) (listed in Alg. 1) to compute an
orthogonal matrix Q and a symmetric tridiagonal matrix T , such that Q> LQ = T .
We denote Q = [q1 , · · · , qM ] where column vector qi is the i-th Lanczos vector. Note
that M could be much smaller than the number of nodes N. T is illustrated as below,
2 3
g1 b1
6 .. .. 7
6 b1 . . 7
T =66 .
7.
7 (4.29)
4 .. ... b 5
M 1
bM 1 gM
After obtaining the tridiagonal matrix T , we can compute the Ritz values and Ritz
vectors which approximate the top eigenvalues and eigenvectors of L by diagonal-
izing the matrix T as T = BRB> , where the K ⇥ K diagonal matrix R contains the
Ritz values and B 2 RK⇥K is an orthogonal matrix. Here top means ranking the
eigenvalues by their magnitudes in a descending order. This can be implemented
via the general eigendecomposition or some fast decomposition methods special-
ized for tridiagonal matrices. Now we have a low rank approximation of the graph
Laplacian matrix L ⇡ V RV > , where V = QB. Denoting the column vectors of V as
{v1 , · · · , vM }, we can compute multi-scale graph convolution as
H = L̂HW
M
L̂ = Â I1 I2
fq (rm Iu
, rm , · · · , rm )vm v>
m, (4.30)
m=1
where {I1 , · · · , Iu } is the set of scale/range parameters which determine how many
hops (or how far) one would like to propagate the information over the graph. For
example, one could easily set {I1 = 50, I2 = 100} (u = 2 in this case) to consider the
situations of propagating 50 and 100 steps respectively. Note that one only needs to
compute the scalar power rather than the original matrix power. The overall com-
plexity of the Lanczos algorithm in our context is O(MN 2 ) which makes the whole
algorithm much more efficient than naively computing the matrix power. Moreover,
fq is a learnable spectral filter parameterized by q and can be applied to graphs with
varying sizes since we decouple the graph size and the input size of fq . fq directly
acts on the graph Laplacian and greatly improves the expressiveness of the model.
Although Lanczos algorithm provides an efficient way to approximately com-
pute arbitrary powers of the graph Laplacian, it is still a low-rank approximation
which may lose certain information (e.g., the high frequency one). To alleviate the
problem, one can further do vanilla graph convolution with small scale parameters
like H = LS HW where S could be small integers like 2 or 3. The resultant repre-
sentation can be concatenated with the one obtained from the longer scale/range
graph convolution in Eq. (4.30). Relying on the above design, one could add nonlin-
earities and stack multiple such layers to build a deep graph convolutional network
(namely Lanczos Networks) just like GCNs. The overall inference procedure of
Lanczos Networks is shown in Fig. 4.1. This method demonstrates strong empirical
54 Jian Tang and Renjie Liao
4.3.1.2 Model
Similar to VAEs, the VGAE model consists of an encoder qf (Z|A, X), a decoder
pq (A|Z), and a prior p(Z).
Encoder The goal of the encoder is to learn a distribution of latent variables asso-
ciated with each node conditioning on the node features X and the adjacency matrix
A. We could instantiate qf (Z|A, X) as a graph neural network where the learnable
parameters are f . In particular, VGAE assumes an node-independent encoder as
below,
N
qf (Z|X, A) = ’ qf (zi |X, A) (4.31)
i=1
qf (zi |X, A) = N (zi |µi , diag(si2 )) (4.32)
µ, = GCNf (X, A) (4.33)
where zi , µi , and i are the i-th rows of the matrices Z, µ, and respectively. Ba-
sically, we assume a multivariate Normal distribution with the diagonal covariance
as the variational approximated distribution of the latent vector per node (i.e., zi ).
The mean and diagonal covariance are predict by the encoder network, i.e., a GCN
as described in Section 4.2.2. For example, the original paper uses a two-layer GCN
as follows,
µ = ÃHWµ (4.34)
= ÃHW (4.35)
H = ReLU(ÃXW0 ), (4.36)
1 1
where à = D 2 AD 2 is the symmetrically normalized adjacency matrix and D is
the degree matrix. Learnable parameters are thus f = [Wµ ,W ,W0 ].
Decoder Given sampled latent variables, the decoder aims at predicting the con-
nectivities among nodes. The original paper adopts a simple dot-product based pre-
dictor as below,
N N
p(A|Z) = ’ ’ p(Ai j |zi , z j ) (4.37)
i=1 j=1
where Ai j denotes the (i, j)-th element and (·) is the logistic sigmoid function.
This decoder again assumes conditional independence among all possible edges for
tractability. Note that there are no learnable parameters associated with this decoder.
The only way to improve the performance of the decoder is to learn good latent
representations.
Prior The prior distributions over the latent variables are simply set to indepen-
dent zero-mean Gaussians with unit variances,
56 Jian Tang and Renjie Liao
N
p(Z) = ’ N (zi |0, I). (4.39)
i=1
This prior is fixed throughout the learning as what typical VAEs do.
Objective & Learning To learn the encoder and the decoder, one typically max-
imize the evidence lower bound (ELBO) as in VAEs,
4.3.1.3 Discussion
The VGAE model is popular in the literature mainly due to its simplicity and good
empirical performances. For example, since there are no learnable parameters for
the prior and the decoder, the model is quite light-weight and the learning process
is fast. Moreover, the VGAE model is versatile in way that once we learned a good
encoder, i.e., good latent representations, we can use them for predicting edges (,
link prediction), node attributes, and so on. On the other side, VGAE model is still
limited in the following ways. First, it can not serve as a good generative model for
graphs as what VAEs do for images since the decoder is not learnable. One could
simply design some learnable decoder. However, it is not clear that the goal of learn-
ing good latent representations and generating graphs with good qualities are always
well-aligned. More exploration along this direction would be fruitful. Second, the
independence assumption is exploited for both the encoder and the decoder which
might be very limited. More structural dependence (e.g., auto-regressive) would be
desirable to improve the model capacity. Third, as discussed in the original paper,
the prior may be potentially a poor choice. At last, for link prediction in practice,
one may need to add the weighting of edges vs. non-edges in the decoder term and
carefully tune it since graphs may be very sparse.
4 Graph Neural Networks for Node Classification 57
Following Mutual Information Neural Estimation (MINE) (Belghazi et al, 2018) and
Deep Infomax (Hjelm et al, 2018), Deep Graph Infomax (Veličković et al, 2019) is
an unsupervised learning framework that learns graph representations via the prin-
ciple of mutual information maximization.
Following the original paper, we will explain the model under the single-graph
setup, i.e., the node feature matrix X and the graph adjacency matrix A of a single
graph are provided as input. Extensions to other problem setups like transductive
and inductive learning settings will be discussed in Section 4.3.2.3. The goal is to
learn the node representations in an unsupervised way. After node representations
are learned, one can apply some simple linear (logistic regression) classifier on top
of the representations to perform supervised tasks like node classification.
4.3.2.2 Model
(X, A) (H, A)
E
~xi ~hi D +
R
C ~s
~x E ~e
ej hj D
e A)
(X, e e A)
(H, e
Fig. 4.2: The overall process of Deep Graph Infomax. The top path shows how the
positive sample is processed, whereas the bottom shows process corresponding to
the negative sample. Note that the graph representation is shared for both positive
and negative samples. Subgraphs of positive and negative samples do not necessarily
need to be different. Adapted from Figure 1 of (Veličković et al, 2019).
The main idea of the model is to maximize the local mutual information between
a node representation (capturing local graph information) and the graph represen-
tation (capturing global graph information). By doing so, the learned node repre-
sentation should capture the global graph information as much as possible. Let us
denote the graph encoder as e which could be any GNN discussed before, e.g., a
two-layer GCN. We can obtain all node representations as H = e(X, A) where the
58 Jian Tang and Renjie Liao
representation hi of any node i should contain some local information near node i.
Specifically, k-layer GCN should be able to leverage node information that is k-hop
away. To get the global graph information, one could use a readout layer/function
to process all node representations, i.e., s = R(H), where the readout function R
could be some learnable pooling function or simply an average operator.
Objective Given the local node representation hi and the global graph represen-
tation s, the natural next-step is to compute their mutual information. Recall the
definition of mutual information is as follows,
Z Z ✓ ◆
p(h, s)
MI(h, s) = p(h, s) log dhds. (4.41)
p(h)p(s)
However, maximizing the local mutual information alone is not enough to learn
useful representations as shown in (Hjelm et al, 2018). To develop a more practical
objective, authors in (Veličković et al, 2019) instead use a noise-contrastive type
objective following Deep Infomax (Hjelm et al, 2018),
!
N M
1 ⇥ ⇤
L =
N + M i=1Â E(X,A) [log D(hi , s)] + Â E(X̃,Ã) log 1 D(h̃ j , s) . (4.42)
j=1
where D is a binary classifier which takes both the node representation hi and the
graph representation s as input and predicts whether the pair (hi , s) comes from the
joint distribution p(h, s) (positive class) or the product of marginals p(hi )p(s) (neg-
ative class). We denote h̃ j as the j-th node representation from the negative sample.
The numbers of positive and negative samples are N and M respectively. We will
explain how to draw positive and negative samples shortly. The overall objective is
thus the negative binary cross-entropy for training a probabilistic classifier. Note that
this objective is the same type of distance as used in generative adversarial networks
(GANs) (Goodfellow et al, 2014b) which is shown to be proportional to the Jensen-
Shannon divergence (Goodfellow et al, 2014b; Nowozin et al, 2016). As verified by
(Hjelm et al, 2018), maximizing the Jensen-Shannon divergence based mutual in-
formation estimator behaves similarly (i.e., they have an approximately monotonic
relationship) to directly maximizing the mutual information. Therefore, maximizing
the objective in Eq. (4.42) is expected to maximize the mutual information. More-
over, the freedom of choosing negative samples makes the method more likely to
learn useful representations than maximizing the vanilla mutual information.
Negative Sampling To generate the positive samples, one can directly sample a
few nodes from the graph to construct the pairs (hi , s). For negative samples, one can
generate them via corrupting the original graph data, denoting as (X̃, Ã) = C (X, A).
In practice, one can choose various forms of this corruption function C . For ex-
ample, authors in (Veličković et al, 2019) suggest to keep the adjacency matrix to
be the same and corrupt the node feature X by row-wise shuffling. Other possibili-
ties of the corruption function include randomly sampling subgraphs and applying
Dropout (Srivastava et al, 2014) to node features.
4 Graph Neural Networks for Node Classification 59
Once positive and negative samples were collected, one can learn the representa-
tions via maximizing the objective in Eq. (4.42). We summarize the training process
of Deep Graph Infomax as follows:
1. Sample negative examples via the corruption function (X̃, Ã) ⇠ C (X, A).
2. Compute node representations of positive samples H = {h1 , · · · , hN } = e(X, A).
3. Compute node representations of negative samples H̃ = {h̃1 , · · · , h̃M } = e(X̃, Ã).
4. Compute graph representation via the readout function s = R(H).
5. Update parameters of e, D, and R via gradient ascent to maximize Eq. (4.42).
4.3.2.3 Discussion
Training deep graph neural networks by stacking multiple layers of graph neural
networks usually yields inferior results, which is a common problem observed in
many different graph neural network architectures. This is mainly due to the prob-
lem of over-smoothing, which is first explicitly studied in (Li et al, 2018b). (Li et al,
2018b) showed that the graph convolutional network (Kipf and Welling, 2017b) is
a special case of Laplacian smoothing:
downstream tasks suffer as well. This phenomenon has later been pointed out by a
few other later work as well such as (Zhao and Akoglu, 2019; Li et al, 2018b; Xu
et al, 2018a; Li et al, 2019c; Rong et al, 2020b).
PairNorm (Zhao and Akoglu, 2019). Next, we will present a method called
PairNorm for alleviating the problem of over-smoothing when GNNs go deep. The
essential idea of PairNorm is to keep the total pairwise squared distance (TPSD)
of node representations unchanged, which is the same as that of the original node
feature X. Let H̃ be the output of the node representations by the graph convolu-
tion, which will be the input of PairNorm, and Ĥ is the output of PairNorm. The
goal of PairNorm is to normalize the H̃ such that after normalization TPSD(Ĥ) =
TPSD(X). In other words,
!
1 N 1 N
TPSD(H̃) = Â ||H̃i 2
H̃ j || = 2N 2
 ||H̃i ||22
N i=1
|| Â H̃i ||22
N i=1
(4.45)
(i, j)2[N]
We can further simply the above equation by substracting the row-wise mean
from each H̃i . In other words, H̃ic = H̃i N1 ÂNi=1 H̃i , which denotes the centered
representation. A nice property of centering the node representation is that it will
not change the TPSD and meanwhile push the second term || N1 ÂNi=1 H̃i ||22 to zero.
As a result, we have
1 N
H̃ic = H̃i  H̃i
N i=1
(Center) (4.47)
H̃ic p H̃ c
Ĥi = s · q =s N· q i (Scale), (4.48)
1 N c 2
N Âi=1 ||H̃i ||2 ||H̃ c ||2F
4 Graph Neural Networks for Node Classification 61
H̃ic
TPSD(Ĥ) = 2N||Ĥ||2F = 2N Â ||s · q ||22 = 2N 2 s2 (4.49)
1
i
N ÂNi=1 ||H̃ic ||22
4.5 Summary
Editor’s Notes: Node classification task is one of the most important tasks
in Graph Neural Networks. The node representation learning techniques in-
troduced in this chapter are the corner stone for all other tasks for the rest
of the book, including graph classification task (Chapter 9), link predic-
tion (Chapter 10), graph generation task (Chapter 11), and so on. Familiar
with the learning methodologies and design principles of node representa-
tion learning is the key to deeply understanding other fundamental research
directions like Theoretical analysis (Chapter 5), Scalability (Chapter 6), Ex-
plainability (Chapter 7), and Adversarial Robustness (Chapter 8).