0% found this document useful (0 votes)
14 views21 pages

GNN-Foundations-Frontiers-and-Applications-chapter4

Uploaded by

waifungloo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views21 pages

GNN-Foundations-Frontiers-and-Applications-chapter4

Uploaded by

waifungloo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 4

Graph Neural Networks for Node Classification

Jian Tang and Renjie Liao

Abstract Graph Neural Networks are neural architectures specifically designed for
graph-structured data, which have been receiving increasing attention recently and
applied to different domains and applications. In this chapter, we focus on a funda-
mental task on graphs: node classification. We will give a detailed definition of node
classification and also introduce some classical approaches such as label propaga-
tion. Afterwards, we will introduce a few representative architectures of graph neu-
ral networks for node classification. We will further point out the main difficulty—
the oversmoothing problem—of training deep graph neural networks and present
some latest advancement along this direction such as continuous graph neural net-
works.

4.1 Background and Problem Definition

Graph-structured data (e.g., social networks, the World Wide Web, and protein-
protein interaction networks) are ubiquitous in real-world, covering a variety of
applications. A fundamental task on graphs is node classification, which tries to
classify the nodes into a few predefined categories. For example, in social networks,
we want to predict the political bias of each user; in protein-protein interaction net-
works, we are interested in predicting the function role of each protein; in the World
Wide Web, we may have to classify web pages into different semantic categories.
To make effective prediction, a critical problem is to have very effective node rep-
resentations, which largely determine the performance of node classification.
Graph neural networks are neural network architectures specifically designed for
learning representations of graph-structured data including learning node represen-

Jian Tang
Mila-Quebec AI Institute, HEC Montreal, e-mail: jian.tang@hec.ca
Renjie Liao
University of Toronto, e-mail: rjliao@cs.toronto.edu

41
42 Jian Tang and Renjie Liao

tations of big graphs (e.g., social networks and the World Wide Web) and learning
representations of entire graphs (e.g., molecular graphs). In this chapter, we will
focus on learning node representations for large-scale graphs and will introduce
learning the whole-graph representations in other chapters. A variety of graph neu-
ral networks have been proposed (Kipf and Welling, 2017b; Veličković et al, 2018;
Gilmer et al, 2017; Xhonneux et al, 2020; Liao et al, 2019b; Kipf and Welling,
2016; Veličković et al, 2019). In this chapter, we will comprehensively revisit exist-
ing graph neural networks for node classification including supervised approaches
(Sec. 4.2), unsupervised approaches (Sec. 4.3), and a common problem of graph
neural networks for node classification—over-smoothing (Sec. 4.4).

Problem Definition. Let us first formally define the problem of learning node rep-
resentations for node classification with graph neural networks. Let G = (V , E )
denotes a graph, where V is the set of nodes and E is the set of edges. A 2 RN⇥N rep-
resents the adjacency matrix, where N is the total number of nodes, and X 2 RN⇥C
represents the node attribute matrix, where C is the number of features for each
node. The goal of graph neural networks is to learn effective node representations
(denoted as H 2 RN⇥F , F is the dimension of node representations) by combining
the graph structure information and the node attributes, which are further used for
node classification.

Table 4.1: Notations used throughout this chapter.

Concept Notation
Graph G = (V , E )
Adjacency matrix A 2 RN⇥N
Node attributes X 2 RN⇥C
Total number of GNN layers K
Node representations at the k-th layer H k 2 RN⇥F , k 2 {1, 2, · · · , K}

4.2 Supervised Graph Neural Networks

In this section, we revisit several representative methods of graph neural networks


for node classification. We will focus on the supervised methods and introduce the
unsupervised methods in the next section. We will start by introducing a general
framework of graph neural networks and then introduce different variants under this
framework.
4 Graph Neural Networks for Node Classification 43

4.2.1 General Framework of Graph Neural Networks

The essential idea of graph neural networks is to iteratively update the node repre-
sentations by combining the representations of their neighbors and their own repre-
sentations. In this section, we introduce a general framework of graph neural net-
works in (Xu et al, 2019d). Starting from the initial node representation H 0 = X, in
each layer we have two important functions:
• AGGREGATE, which tries to aggregate the information from the neighbors of
each node;
• COMBINE, which tries to update the node representations by combining the
aggregated information from neighbors with the current node representations.
Mathematically, we can define the general framework of graph neural networks
as follows:
Initialization: H 0 = X
For k = 1, 2, · · · , K,

akv = AGGREGATEk {Huk 1


: u 2 N(v)} (4.1)
Hvk = COMBINEk {Hvk 1
, akv }, (4.2)

where N(v) is the set of neighbors for the v-th node. The node representations H K
in the last layer can be treated as the final node representations.
Once we have the node representations, they can be used for downstream tasks.
Take the node classification as an example, the label of node v (denoted as ŷv ) can
be predicted through a Softmax function, i.e.,

ŷv = Softmax(W Hv> ), (4.3)

where W 2 R|L |⇥F , |L | is the number of labels in the output space.


Given a set of labeled nodes, the whole model can be trained by minimizing the
following loss function:
1 nl
O = Â loss(ŷi , yi ), (4.4)
nl i=1
where yi is the ground truth label of node i, nl is the number of labeled nodes,
loss(·, ·) is a loss function such as cross-entropy loss function. The whole neural
networks can be optimized by minimizing the objective function O with backprop-
agation.
Above we present a general framework of graph neural networks. Next, we will
introduce a few most representative instantiations or variants of graph neural net-
works in the literature.
44 Jian Tang and Renjie Liao

4.2.2 Graph Convolutional Networks

We will start from the graph convolutional networks (GCN) (Kipf and Welling,
2017b), which is now the most popular graph neural network architecture due to its
simplicity and effectiveness in a variety of tasks and applications. Specifically, the
node representations in each layer is updated according to the following propagation
rule:
1 1
H k+1 = s (D̃ 2 ÃD̃ 2 H kW k ). (4.5)
à = A + I is the adjacency matrix of the given undirected graph G with self-
connections, which allows to incorporate the node features itself when updating the
node representations. I 2 RN⇥N is the identity matrix. D̃ is a diagonal matrix with
D̃ii = Â j Ãi j . s (·) is an activation function such as ReLU and Tanh. The ReLU ac-
0
tive function is widely used, which is defined as ReLU(x) = max(0, x). W k 2 RF⇥F
(F,F 0 are the dimensions of node representations in the k-th, (k+1)-th layer respec-
tively) is a laywise linear transformation matrix, which will be trained during the
optimization.
We can further dissect equation equation 4.5 and understand the AGGREGATE
and COMBINE function defined in GCN. For a node i, the node updating equation
can be reformulated as below:

Ãi j
Hik = s ( Â q H kj 1W k ) (4.6)
j2{N(i)[i} D̃ii D̃ j j

Ai j 1 k 1 k
Hik = s ( Â q H kj 1W k + H W )
D̃i i
(4.7)
j2N(i) D̃ii D̃ j j
In the Equation equation 4.7, we can see that the AGGREGATE function is de-
fined as the weighted average of the neighbor node representations. The weight of
the neighbor j is determined by the weight of the edge between i and j (i.e. Ai j nor-
malized by the degrees of the two nodes). The COMBINE function is defined as the
summation of the aggregated messages and the node representation itself, in which
the node representation is normalized by its own degree.

Connections with Spectral Graph Convolutions. Next, we discuss the connec-


tions between GCNs and traditional spectral filters defined on graphs (Defferrard
et al, 2016). The spectral convolutions on graphs can be defined as a multiplication
of a node-wise signal x 2 RN with a convolutional filter gq = diag(q ) (q 2 RN is
the parameter of the filter) in the Fourier domain. Mathematically,

gq ? x = Ugq U T x. (4.8)
4 Graph Neural Networks for Node Classification 45

U represents the matrix of the eigenvectors of the normalized graph Laplacian ma-
1 1
trix L = IN D 2 AD 2 . L = ULU T , L is a diagonal matrix of eigenvalues, and
T
U x is the graph Fourier transform of the input signal x. In practice, gq can be un-
derstood as a function of the eigenvalues of the normalized graph Laplacian matrix
L (i.e. gq (L )). In practice, directly calculating Eqn. equation 4.8 is very compu-
tationally expensive, which is quadratic to the number of nodes N. According to
(Hammond et al, 2011), this problem can be circumvented by approximating the
function gq (L ) with a truncated expansion of Chebyshev polynomials Tk (x) up to
K th order:
K
gq 0 (L ) = Â qk0 Tk (L̃ ), (4.9)
k=0
2
where L̃ = lmax L I, and lmax is the largest eigenvalue of L. q 0 2 RK is the vector
of Chebyshev coefficients. Tk (x) are Chebyshev polynomials which are recursively
defined as Tk (x) = 2xTk 1 (x) Tk 2 (x), with T0 (x) = 1 and T1 (x) = x. By combining
Eqn. equation 4.9 and Eqn. equation 4.8, the convolution of a signal x with a filter
gq 0 can be reformulated as below:
K
gq 0 ? x = Â qk0 Tk (L̃)x, (4.10)
k=0
2
where L̃ = lmax L I. From this equation, we can see that each node only depends
on the information within the K th -order neighborhood. The overall complexity of
evaluating Eqn. equation 4.10 is O(|E |) (i.e. linear to the number of edges in the
original graph G ), which is very efficient.
To define a neural network based on graph convolutions, one can stack multiple
convolution layers defined according to Eqn. equation 4.10 with each layer followed
by a nonlinear transformation. At each layer, instead of being limited to the explicit
parametrization by the Chebyshev polynomials defined in Eqn. equation 4.10, the
authors of GCNs proposed to limit the number of convolutions to K = 1 at each
layer. By doing this, at each layer, it only defines a linear function over the graph
Laplacian matrix L. However, by stacking multiple such layers, we are still capable
of covering a rich class of convolution filter functions on graphs. Intuitively, such a
model is capable of alleviating the problem of overfitting local neighborhood struc-
tures for graphs whose node degree distribution has a high variance such as social
networks, the World Wide Web, and citation networks.
At each layer, we can further approximate lmax ⇡ 2, which could be accommo-
dated by the neural network parameters during training. Based on al these simplifi-
cations, we have
1 1
gq 0 ? x ⇡ q00 x + q10 x(L IN )x = q00 x q10 D 2 AD 2 , (4.11)
where q00 and q10 are too free parameters, which could be shared over the entire
graph. In practice, we can further reduce the number of parameters, which allows to
46 Jian Tang and Renjie Liao

reduce overfitting and meanwhile minimize the number of operations per layer. As
a result, the following expression can be further obtained:
1 1
gq ? x ⇡ q (I + D 2 AD 2 )x, (4.12)
1 1
where q = q00 = q10 . One potential issue is the matrix IN + D AD , whose 2 2

eigenvalues lie in the interval of [0, 2]. In a deep graph convolutional neural network,
repeated application of the above function will likely lead to exploding or vanish-
ing gradients, yielding numerical instabilites. As a result, we can further renormal-
1 1 1 1
ize this matrix by converting I + D 2 AD 2 to D̃ 2 ÃD̃ 2 , where à = A + I, and
D̃ii = Â j Ãi j .
In the above, we only consider the case that there is only one feature channel
and one filter. This can be easily generalized to an input signal with C channels
X 2 RN⇥C and F filters (or number of hidden units) as follows:
1 1
H = D̃ 2 ÃD̃ 2 XW, (4.13)
where W 2 RC⇥F is a matrix of filter parameters. H is the convolved signal matrix.

4.2.3 Graph Attention Networks

In GCNs, for a target node i, the importance of a neighbor j is determined by the


weight of their edge Ai j (normalized by their node degrees). However, in practice,
the input graph may be noisy. The edge weights may not be able to reflect the true
strength between two nodes. As a result, a more principled approach would be to au-
tomatically learn the importance of each neighbor. Graph Attention Networks (a.k.a.
GAT(Veličković et al, 2018)) is built on this idea and try to learn the importance of
each neighbor based on the Attention mechanism (Bahdanau et al, 2015; Vaswani
et al, 2017). Attention mechanism has been wide used in a variety of tasks in nat-
ural language understanding (e.g. machine translation and question answering) and
computer vision (e.g. visual question answering and image captioning). Next, we
will introduce how attention is used in graph neural networks.

Graph Attention Layer. The graph attention layer defines how to transfer the hid-
den node representations at layer k 1 (denoted as H k 1 2 RN⇥F ) to the new node
0
representations H k 2 RN⇥F . In order to guarantee sufficient expressive power to
transform the lower-level node representations to higher-level node representations,
0
a shared linear transformation is applied to every node, denoted as W 2 RF⇥F . Af-
terwards, self-attention is defined on the nodes, which measures the attention coeffi-
0 0
cients for any pair of nodes through a shared attentional mechanism a : RF ⇥ RF !
R

ei j = a(W Hik 1
,W H kj 1
). (4.14)
4 Graph Neural Networks for Node Classification 47

ei j indicates the relationship strength between node i and j. Note in this subsec-
tion we use Hik 1 to represent a column-wise vector instead of a row-wise vector.
For each node, we can theoretically allow it to attend to every other node on the
graph, which however will ignore the graph structural information. A more reason-
able solution would be only to attend to the neighbors for each node. In practice,
the first-order neighbors are only used (including the node itself). And to make the
coefficients comparable across different nodes, the attention coefficients are usually
normalized with the softmax function:

exp(ei j )
ai j = Softmax j ({ei j }) = . (4.15)
Âl2N(i) exp(eil )
We can see that for a node i, ai j essentially defines a multinomial distribution over
the neighbors, which can also be interpreted as the transition probability from node
i to each of its neighbors.
In the work by Veličković et al (2018), the attention mechanism a is defined as
a single-layer feedforward neural network including a linear transformation with
0
the weight vector W2 2 R1⇥2F ) and a LeakyReLU nonlinear activation function
(with negative input slope a = 0.2). More specifically, we can calculate the attention
coefficients with the following architecture:

exp(LeakyReLU(W2 [W Hik 1
||W H kj 1
]))
ai j = , (4.16)
Âl2N(i) exp(LeakyReLU(W2 [W Hik 1 ||W Hlk 1 ]))
where || represents the operation of concatenating two vectors. The new node rep-
resentation is a linear combination of the neighboring node representations with the
weights determined by the attention coefficients (with a potential nonlinear trans-
formation), i.e.
!
Hik = s  ai jW H kj 1
. (4.17)
j2N(i)

Multi-head Attention.
In practice, instead of only using one single attention mechanism, multi-head at-
tention can be used, each of which determines a different similarity function over
the nodes. For each attention head, we can independently obtain a new node rep-
resentation according to Eqn. equation 4.17. The final node representation will be
a concatenation of the node representations learned by different attention heads.
Mathematically, we have
!
T
Hik = s  ait jW t H kj 1
, (4.18)
t=1 j2N(i)
48 Jian Tang and Renjie Liao

where T is the total number of attention heads, ait j is the attention coefficient calcu-
lated from the t-th attention head, W t is the linear transformation matrix of the t-th
attention head.
One thing that mentioned in the paper by Veličković et al (2018) is that in the
final layer, when trying to combine the node representations from different attention
heads, instead of using the operation concatenation, other pooling techniques could
be used, e.g. simply taking the average node representations from different attention
heads.
!
T
1
Hik = s   ait jW t H kj 1 .
T t=1
(4.19)
j2N(i)

4.2.4 Neural Message Passing Networks

Another very popular graph neural network architecture is the Neural Message Pass-
ing Network (MPNN) (Gilmer et al, 2017), which is originally proposed for learn-
ing molecular graph representations. However, MPNN is actually very general, pro-
vides a general framework of graph neural networks, and could be used for the task
of node classification as well. The essential idea of MPNN is formulating existing
graph neural networks as a general framework of neural message passing among
nodes. In MPNNs, there are two important functions including Message and Up-
dating function:

mki = Â Mk (Hik 1
, H kj 1
, ei j ), (4.20)
i2N( j)

Hik = Uk (Hik 1
, mki ). (4.21)
Mk (·, ·, ·) defines the message between node i and j in the k-th layer, which depends
on the two node representations and the information of their edge. Uk is the node
updating function in the k-th layer which combines the aggregated messages from
the neighbors and the node representation itself. We can see that the MPNN frame-
work is very similar to the general framework we introduced in Section 4.2.1. The
AGGREGATE function defined here is simply a summation of all the messages
from the neighbors. The COMBINE function is the same as the node Updating
function.

4.2.5 Continuous Graph Neural Networks

The above graph neural networks iteratively update the node representations with
different kinds of graph convolutional layers. Essentially, these approaches model
4 Graph Neural Networks for Node Classification 49

the discrete dynamics of node representations with GNNs. Xhonneux et al (2020)


proposed the continuous graph neural networks (CGNNs), which generalizes exist-
ing graph neural networks with discrete dynamics to continuous settings, i.e., trying
to model the continuous dynamics of node representations. The key idea is how to
characterize the continuous dynamics of node representations, i.e. the derivatives of
node representation w.r.t. time. The CGNN model is inspired by the diffusion-based
models on graphs such as PageRank and epidemic models on social networks. The
derivatives of the node representations are defined as a combination of the node
representation itself, the representations of its neighbors, and the initial status of the
nodes. Specifically, two different variants of node dynamics are introduced. The first
model assumes that different dimensions of node presentations (a.k.a. feature chan-
nels) are independent; the second model is more flexible, which allows different
feature channels to interact with each other. Next, we give a detailed introduction to
each of the two models.
Note: in this part, instead of using the original adjacency matrix A, we use the fol-
lowing regularized matrix for characterizing the graph structure:
a⇣ 1 1

A := I + D 2 AD 2 , (4.22)
2
where a 2 (0, 1) is a hyperparameter. D is the degree matrix of the original adja-
cency matrix A. With the new regularized adjacency matrix A, the eigenvalues of A
will lie in the interval [0, a], which will make Ak converges to 0 when we increase
the power of k.

Model 1: Independent Feature Channels. As different nodes in a graph are inter-


connected, a natural solution to model the dynamic of each feature channel should
be taking the graph structure into consideration, which allows the information to
propagate across different nodes. We are motivated by existing diffusion-based
methods on graphs such as PageRank (Page et al, 1999) and label propagation (Zhou
et al, 2004), which defines the discrete propagation of node representations (or sig-
nals on nodes) with the following step-wise propagation equations:

H k+1 = AH k + H 0 , (4.23)
where H 0 = X or the output of an encoder on the input feature X. Intuitively, at each
step, the new node representation is a linear combination of its neighboring node
representations as well as the initial node features. Such a mechanism allows to
model the information propagation on the graph without forgetting the initial node
features. We can unroll Eqn. equation 4.23 and explicitly derive the node represen-
tations at the k-th step:
!
k
Hk = Â Ai H 0 = (A I) 1 (Ak+1 I)H 0 . (4.24)
i=0

As the above equation effectively models the discrete dynamics of node repre-
sentations, the CGNN model further extended it to the continuous setting, which
50 Jian Tang and Renjie Liao

replaces the discrete time step k to a continuous variable t 2 R+ 0 . Specifically, it


has been shown that Eqn. equation 4.24 is a discretization of the following ordinary
differential equation (ODE):

dH t
= log AH t + X, (4.25)
dt
with the initial value H 0 = (log A) 1 (A I)X, where X is the initial node features or
the output of an encoder applied to it. We do not provide the proof here. More details
can be referred to the original paper (Xhonneux et al, 2020). In Eqn. equation 4.25,
as log A is intractable to compute in practice, it is approximated with the first-order
of the Taylor expansion, i.e. log A ⇡ A I. By integrating all these information, we
have the following ODE equation:

dH t
= (A I)H t + X, (4.26)
dt
with the initial value H 0 = X, which is the first variant of the CGNN model.
The CGNN model is actually very intuitive, which has a nice connection with
traditional epidemic model, which aims at studying the dynamics of infection in a
population. For the epidemic model, it usually assumes that the infection of people
will be affected by three different factors including the infection from neighbors, the
natural recovery, and the natural characteristics of people. If we treat H t as the num-
ber of people infected at time t, then these three factors can be naturally modeled by
the three terms in Eqn. equation 4.26: AH t for the infection from neighbors, H t
for the natural recovery, and the last one X for the natural characteristics of people.

Model 2: Modeling the Interaction of Feature Channels. The above model as-
sumes different node feature channels are independent with each other, which is a
very strong assumption and limits the capacity of the model. Inspired by the success
of a linear variant of graph neural networks (i.e., Simple GCN (Wu et al, 2019a)),
a more powerful discrete node dynamic model is proposed, which allows different
feature channels to interact with each other as,

H k+1 = AH kW + H 0 , (4.27)
where W 2 RF⇥F is a weight matrix used to model the interactions between different
feature channels. Similarly, we can also extend the above discrete dynamics into
continuous case, yielding the following equation:

dH t
= (A I)H t + H t (W I) + X, (4.28)
dt
with the initial value being H 0 = X. This is the second variant of CGNN with train-
able weights. Similar form of ODEs defined in Eqn. equation 4.28 has been studied
in the literature of control theory, which is known as Sylvester differential equa-
tion (Locatelli and Sieniutycz, 2002). The two matrices A I and W I characterize
4 Graph Neural Networks for Node Classification 51

the natural solution of the system while X is the information provided to the system
to drive the system into the desired state.

Discussion. The proposed continuous graph neural networks (CGNN) has multiple
nice properties: (1) Recent work has shown that if we increase the number of layers
K in the discrete graph neural networks, the learned node representations tend to
have the problem of over-smoothing (will introduce in detail later) and hence lose
the power of expressiveness. On the contrary, the continuous graph neural networks
are able to train very deep graph neural networks and are experimentally robust to
arbitrarily chosen integration time; (2) For some of the tasks on graphs, it is crit-
ical to model the long-range dependency between nodes, which requires training
deep GNNs. Existing discrete GNNs fail to train very deep GNNs due to the over-
smoothing problem. The CGNNs are able to effectively model the long-range de-
pendency between nodes thanks to the stability w.r.t. time. (3) The hyperparameter
a is very important, which controls the rate of diffusion. Specifically, it controls the
rate at which high-order powers of regularized matrix A vanishes. In the work pro-
posed by (Xhonneux et al, 2020), the authors proposed to learn a different value of
a for each node, which hence allows to choose the best diffusion rates for different
nodes.

4.2.6 Multi-Scale Spectral Graph Convolutional Networks

Recall the one-layer graph convolution operator used in GCNs (Kipf and Welling,
1 1
2017b) H = LHW , where L = D 2 ÃD 2 . Here we drop the superscript of the layer
index to avoid the clash with the notation of the matrix power. There are two main
issues with this simple graph convolution formulation. First, one such graph convo-
lutional layer would only propagate information from any node to its nearest neigh-
bors, i.e., neighboring nodes that are one-hop away. If one would like to propagate
information to M-hop away neighbors, one has to either stack M graph convolutional
layers or compute the graph convolution with M-th power of the graph Laplacian,
i.e., H = s (LM HW ). When M is large, the solution of stacking layers would make
the whole GCN model very deep, thus causing problems in learning like the van-
ishing gradient. This is similar to what people experienced in training very deep
feedforward neural networks. For the matrix power solution, naively computing the
M-th power of the graph Laplacian is also very costly (e.g., the time complexity is
O(N 3(M 1) ) for graphs with N nodes). Second, there are no learnable parameters
in GCNs associated with the graph Laplacian L (corresponding to the connectiv-
ities/structures). The only learnable parameter W is a linear transform applied to
every node simultaneously which is not aware of the structures. Note that we typ-
ically associate learnable weights on edges while applying the convolution applied
to regular graphs like grids (e.g., applying 2D convolution to images). This would
greatly improve the expressiveness of the model. However, it is not clear that how
52 Jian Tang and Renjie Liao

one can add learnable parameters to the graph Laplacian L since its size varies from
graph to graph.

Algorithm 1 : Lanczos Algorithm


1: Input: S, x, M, e
2: Initialization: b0 = 0, q0 = 0, and q1 = x/kxk
3: For j = 1, 2, . . . , K:
4: z = Sq j
5: g j = q>j z
6: z = z g j q j b j 1q j 1
7: b j = kzk2
8: If b j < e, quit
9: q j+1 = z/b j
10:
11: Q = [q1 , q2 , · · · , qM ]
12: Construct T following Eq. (4.29)
13: Eigen decomposition T = BRB>
14: Return V = QB and R. =0

{"# , %# }' = )*+,-./()) Long Range Spectral Filtering Long Range Spectral Filtering
e.g., I = {20, 50, , … } e.g., I = {20, 50, , … }
' '
O O O O O O
)> < = K N< ("#P , "#Q , … , "# |R| )%# %#S )> < = K N< ("#P , "#Q , … , "# |R| )%# %#S
#LM #LM

7< = = )> < ?@< ∀B ∈ [|F|] 7< = = )> < ?@< ∀B ∈ [|F|]

concat 7 concat 7 ... 2 = 345645(7)

Short Range Spectral Filtering Short Range Spectral Filtering


e.g., S = 1, 2, … e.g., S = 1, 2, …

7< = = )TU ?@< ∀B ∈ [|V|] 7< = = )TU ?@< ∀B ∈ [|V|]

Layer 1 Layer 2

Fig. 4.1: The inference procedure of Lanczos Networks. The approximated top
eigenvalues {rk } and eigenvectors {vk } are computed by the Lanczos algorithm.
Note that this step is only needed once per graph. The long range/scale (top blocks)
graph convolutions are efficiently computed by the low-rank approximation of the
graph Laplacian. One can control the ranges (i.e., the exponent of eigenvalues)
as hyperparameters. Learnable spectral filters are applied to the approximated top
eigenvalues {rk }. The short range/scale (bottom blocks) graph convolution is the
same as GCNs. Adapted from Figure 1 of (Liao et al, 2019b).

To overcome these two problems, authors propose Lanczos Networks in (Liao


et al, 2019b). Given the graph Laplacian matrix L1 and node features X, one first
1 Here we assume a symmetric graph Laplacian matrix. If it is non-symmetric (e.g., for directed
graphs), one can resort to the Arnoldi algorithm.
4 Graph Neural Networks for Node Classification 53

uses the M-step Lanczos algorithm (Lanczos, 1950) (listed in Alg. 1) to compute an
orthogonal matrix Q and a symmetric tridiagonal matrix T , such that Q> LQ = T .
We denote Q = [q1 , · · · , qM ] where column vector qi is the i-th Lanczos vector. Note
that M could be much smaller than the number of nodes N. T is illustrated as below,
2 3
g1 b1
6 .. .. 7
6 b1 . . 7
T =66 .
7.
7 (4.29)
4 .. ... b 5
M 1
bM 1 gM

After obtaining the tridiagonal matrix T , we can compute the Ritz values and Ritz
vectors which approximate the top eigenvalues and eigenvectors of L by diagonal-
izing the matrix T as T = BRB> , where the K ⇥ K diagonal matrix R contains the
Ritz values and B 2 RK⇥K is an orthogonal matrix. Here top means ranking the
eigenvalues by their magnitudes in a descending order. This can be implemented
via the general eigendecomposition or some fast decomposition methods special-
ized for tridiagonal matrices. Now we have a low rank approximation of the graph
Laplacian matrix L ⇡ V RV > , where V = QB. Denoting the column vectors of V as
{v1 , · · · , vM }, we can compute multi-scale graph convolution as

H = L̂HW
M
L̂ = Â I1 I2
fq (rm Iu
, rm , · · · , rm )vm v>
m, (4.30)
m=1

where {I1 , · · · , Iu } is the set of scale/range parameters which determine how many
hops (or how far) one would like to propagate the information over the graph. For
example, one could easily set {I1 = 50, I2 = 100} (u = 2 in this case) to consider the
situations of propagating 50 and 100 steps respectively. Note that one only needs to
compute the scalar power rather than the original matrix power. The overall com-
plexity of the Lanczos algorithm in our context is O(MN 2 ) which makes the whole
algorithm much more efficient than naively computing the matrix power. Moreover,
fq is a learnable spectral filter parameterized by q and can be applied to graphs with
varying sizes since we decouple the graph size and the input size of fq . fq directly
acts on the graph Laplacian and greatly improves the expressiveness of the model.
Although Lanczos algorithm provides an efficient way to approximately com-
pute arbitrary powers of the graph Laplacian, it is still a low-rank approximation
which may lose certain information (e.g., the high frequency one). To alleviate the
problem, one can further do vanilla graph convolution with small scale parameters
like H = LS HW where S could be small integers like 2 or 3. The resultant repre-
sentation can be concatenated with the one obtained from the longer scale/range
graph convolution in Eq. (4.30). Relying on the above design, one could add nonlin-
earities and stack multiple such layers to build a deep graph convolutional network
(namely Lanczos Networks) just like GCNs. The overall inference procedure of
Lanczos Networks is shown in Fig. 4.1. This method demonstrates strong empirical
54 Jian Tang and Renjie Liao

performances on a wide variety of tasks/benchmarks including molecular property


prediction in quantum chemistry and document classification in citation networks. It
just requires slight modifications to the implementation of the original GCNs. Nev-
ertheless, if the input graph is extremely large (e.g., some large social network), the
Lanczos algorithm itself would be a computational bottleneck. How to improve this
model in such a problem context would be an open question.
Here we only introduce a few representative architectures of graph neural net-
works for node classification. There are also many other well-known architectures
including gated graph neural networks (Li et al, 2016b)—which is mainly designed
for output sequences—and GraphSAGE (Hamilton et al, 2017b)—which is mainly
designed for inductive setting of node classification.

4.3 Unsupervised Graph Neural Networks

In this section, we review a few representative GNN-based methods for unsuper-


vised learning on graph-structured data, including variational graph auto-encoders
(Kipf and Welling, 2016) and deep graph infomax (Veličković et al, 2019).

4.3.1 Variational Graph Auto-Encoders

Following variational auto-encoders (VAEs) (Kingma and Welling, 2014; Rezende


et al, 2014) , variational graph auto-encoders (VGAEs) (Kipf and Welling, 2016)
provide a framework for unsupervised learning on graph-structured data. In the fol-
lowing, we first review the model and then discuss its advantages and disadvantages.

4.3.1.1 Problem Setup

Suppose we are given an undirected graph G = (V , E ) with N nodes. Each node


is associated with a node feature/attribute vector. We compactly denote all node
features as a matrix X 2 RN⇥C . The adjacency matrix of the graph is A. We assume
self-loops are added to the orignal graph G so that the diagonal entries of A are 1.
This is a convention in graph convolutional networks (GCNs) (Kipf and Welling,
2017b) and makes the model consider a node’s old representation while updating its
new representation. We also assume each node is associated with a latent variable
(the collection of all latent variables is again compactly denoted as a matrix Z 2
RN⇥F ). We are interested in inferring the latent variables of nodes in the graph and
decoding the edges.
4 Graph Neural Networks for Node Classification 55

4.3.1.2 Model

Similar to VAEs, the VGAE model consists of an encoder qf (Z|A, X), a decoder
pq (A|Z), and a prior p(Z).
Encoder The goal of the encoder is to learn a distribution of latent variables asso-
ciated with each node conditioning on the node features X and the adjacency matrix
A. We could instantiate qf (Z|A, X) as a graph neural network where the learnable
parameters are f . In particular, VGAE assumes an node-independent encoder as
below,
N
qf (Z|X, A) = ’ qf (zi |X, A) (4.31)
i=1
qf (zi |X, A) = N (zi |µi , diag(si2 )) (4.32)
µ, = GCNf (X, A) (4.33)

where zi , µi , and i are the i-th rows of the matrices Z, µ, and respectively. Ba-
sically, we assume a multivariate Normal distribution with the diagonal covariance
as the variational approximated distribution of the latent vector per node (i.e., zi ).
The mean and diagonal covariance are predict by the encoder network, i.e., a GCN
as described in Section 4.2.2. For example, the original paper uses a two-layer GCN
as follows,

µ = ÃHWµ (4.34)
= ÃHW (4.35)
H = ReLU(ÃXW0 ), (4.36)
1 1
where à = D 2 AD 2 is the symmetrically normalized adjacency matrix and D is
the degree matrix. Learnable parameters are thus f = [Wµ ,W ,W0 ].
Decoder Given sampled latent variables, the decoder aims at predicting the con-
nectivities among nodes. The original paper adopts a simple dot-product based pre-
dictor as below,
N N
p(A|Z) = ’ ’ p(Ai j |zi , z j ) (4.37)
i=1 j=1

p(Ai j |zi , z j ) = (z>


i z j ), (4.38)

where Ai j denotes the (i, j)-th element and (·) is the logistic sigmoid function.
This decoder again assumes conditional independence among all possible edges for
tractability. Note that there are no learnable parameters associated with this decoder.
The only way to improve the performance of the decoder is to learn good latent
representations.
Prior The prior distributions over the latent variables are simply set to indepen-
dent zero-mean Gaussians with unit variances,
56 Jian Tang and Renjie Liao

N
p(Z) = ’ N (zi |0, I). (4.39)
i=1

This prior is fixed throughout the learning as what typical VAEs do.
Objective & Learning To learn the encoder and the decoder, one typically max-
imize the evidence lower bound (ELBO) as in VAEs,

LELBO = Eqf (Z|X,A) [log p(A|Z)] KL(qf (Z|X, A)kp(Z)), (4.40)

where KL(qkp) is the Kullback-Leibler divergence between distributions q and p.


Note that we can not directly maximize the log likelihood since the introduction
of latent variables Z induces a high-dimensional integral which is intractable. We
instead maximize the ELBO in Eq. (4.40) which is a lower bound of the log like-
lihood. However, the first expectation term is again intractable. One often resorts
to the Monte Carlo estimation by sampling a few Z from the encoder qf (Z|X, A)
and evaluating the term using the samples. To maximize the objective, one can per-
form stochastic gradient descent along with the reparameterization trick (Kingma
and Welling, 2014). Note that the reparameterization trick is necessary since we
need to back-propagate through the sampling in the aforementioned Monte Carlo
estimation term to compute the gradient w.r.t. the parameters of the encoder.

4.3.1.3 Discussion

The VGAE model is popular in the literature mainly due to its simplicity and good
empirical performances. For example, since there are no learnable parameters for
the prior and the decoder, the model is quite light-weight and the learning process
is fast. Moreover, the VGAE model is versatile in way that once we learned a good
encoder, i.e., good latent representations, we can use them for predicting edges (,
link prediction), node attributes, and so on. On the other side, VGAE model is still
limited in the following ways. First, it can not serve as a good generative model for
graphs as what VAEs do for images since the decoder is not learnable. One could
simply design some learnable decoder. However, it is not clear that the goal of learn-
ing good latent representations and generating graphs with good qualities are always
well-aligned. More exploration along this direction would be fruitful. Second, the
independence assumption is exploited for both the encoder and the decoder which
might be very limited. More structural dependence (e.g., auto-regressive) would be
desirable to improve the model capacity. Third, as discussed in the original paper,
the prior may be potentially a poor choice. At last, for link prediction in practice,
one may need to add the weighting of edges vs. non-edges in the decoder term and
carefully tune it since graphs may be very sparse.
4 Graph Neural Networks for Node Classification 57

4.3.2 Deep Graph Infomax

Following Mutual Information Neural Estimation (MINE) (Belghazi et al, 2018) and
Deep Infomax (Hjelm et al, 2018), Deep Graph Infomax (Veličković et al, 2019) is
an unsupervised learning framework that learns graph representations via the prin-
ciple of mutual information maximization.

4.3.2.1 Problem Setup

Following the original paper, we will explain the model under the single-graph
setup, i.e., the node feature matrix X and the graph adjacency matrix A of a single
graph are provided as input. Extensions to other problem setups like transductive
and inductive learning settings will be discussed in Section 4.3.2.3. The goal is to
learn the node representations in an unsupervised way. After node representations
are learned, one can apply some simple linear (logistic regression) classifier on top
of the representations to perform supervised tasks like node classification.

4.3.2.2 Model

(X, A) (H, A)

E
~xi ~hi D +
R
C ~s

~x E ~e
ej hj D

e A)
(X, e e A)
(H, e

Fig. 4.2: The overall process of Deep Graph Infomax. The top path shows how the
positive sample is processed, whereas the bottom shows process corresponding to
the negative sample. Note that the graph representation is shared for both positive
and negative samples. Subgraphs of positive and negative samples do not necessarily
need to be different. Adapted from Figure 1 of (Veličković et al, 2019).

The main idea of the model is to maximize the local mutual information between
a node representation (capturing local graph information) and the graph represen-
tation (capturing global graph information). By doing so, the learned node repre-
sentation should capture the global graph information as much as possible. Let us
denote the graph encoder as e which could be any GNN discussed before, e.g., a
two-layer GCN. We can obtain all node representations as H = e(X, A) where the
58 Jian Tang and Renjie Liao

representation hi of any node i should contain some local information near node i.
Specifically, k-layer GCN should be able to leverage node information that is k-hop
away. To get the global graph information, one could use a readout layer/function
to process all node representations, i.e., s = R(H), where the readout function R
could be some learnable pooling function or simply an average operator.
Objective Given the local node representation hi and the global graph represen-
tation s, the natural next-step is to compute their mutual information. Recall the
definition of mutual information is as follows,
Z Z ✓ ◆
p(h, s)
MI(h, s) = p(h, s) log dhds. (4.41)
p(h)p(s)

However, maximizing the local mutual information alone is not enough to learn
useful representations as shown in (Hjelm et al, 2018). To develop a more practical
objective, authors in (Veličković et al, 2019) instead use a noise-contrastive type
objective following Deep Infomax (Hjelm et al, 2018),
!
N M
1 ⇥ ⇤
L =
N + M i=1Â E(X,A) [log D(hi , s)] + Â E(X̃,Ã) log 1 D(h̃ j , s) . (4.42)
j=1

where D is a binary classifier which takes both the node representation hi and the
graph representation s as input and predicts whether the pair (hi , s) comes from the
joint distribution p(h, s) (positive class) or the product of marginals p(hi )p(s) (neg-
ative class). We denote h̃ j as the j-th node representation from the negative sample.
The numbers of positive and negative samples are N and M respectively. We will
explain how to draw positive and negative samples shortly. The overall objective is
thus the negative binary cross-entropy for training a probabilistic classifier. Note that
this objective is the same type of distance as used in generative adversarial networks
(GANs) (Goodfellow et al, 2014b) which is shown to be proportional to the Jensen-
Shannon divergence (Goodfellow et al, 2014b; Nowozin et al, 2016). As verified by
(Hjelm et al, 2018), maximizing the Jensen-Shannon divergence based mutual in-
formation estimator behaves similarly (i.e., they have an approximately monotonic
relationship) to directly maximizing the mutual information. Therefore, maximizing
the objective in Eq. (4.42) is expected to maximize the mutual information. More-
over, the freedom of choosing negative samples makes the method more likely to
learn useful representations than maximizing the vanilla mutual information.
Negative Sampling To generate the positive samples, one can directly sample a
few nodes from the graph to construct the pairs (hi , s). For negative samples, one can
generate them via corrupting the original graph data, denoting as (X̃, Ã) = C (X, A).
In practice, one can choose various forms of this corruption function C . For ex-
ample, authors in (Veličković et al, 2019) suggest to keep the adjacency matrix to
be the same and corrupt the node feature X by row-wise shuffling. Other possibili-
ties of the corruption function include randomly sampling subgraphs and applying
Dropout (Srivastava et al, 2014) to node features.
4 Graph Neural Networks for Node Classification 59

Once positive and negative samples were collected, one can learn the representa-
tions via maximizing the objective in Eq. (4.42). We summarize the training process
of Deep Graph Infomax as follows:
1. Sample negative examples via the corruption function (X̃, Ã) ⇠ C (X, A).
2. Compute node representations of positive samples H = {h1 , · · · , hN } = e(X, A).
3. Compute node representations of negative samples H̃ = {h̃1 , · · · , h̃M } = e(X̃, Ã).
4. Compute graph representation via the readout function s = R(H).
5. Update parameters of e, D, and R via gradient ascent to maximize Eq. (4.42).

4.3.2.3 Discussion

Deep Graph Infomax is an efficient unsupervised representation learning method


for graph-structured data. The implementation of the encoder, the readout, and the
binary cross-entropy type of loss are all straightforward. The mini-batch training
does not necessarily need to store the whole graph since the readout can be ap-
plied to a set of subgraphs as well. Therefore, the method is memory-efficient. Also,
the processing of positive and negative samples can be done in parallel. Moreover,
authors prove that minimizing the cross-entropy type of classification error can be
used to maximize the mutual information under certain conditions, e.g., the readout
function is injective and input feature comes from a finite set. However, the choice
of the corruption function seems to be crucial to ensure satisfying empirical perfor-
mances. There seems no such a universally good corruption function. One needs to
do trial–and-error to obtain a proper one depending on the task/dataset.

4.4 Over-smoothing Problem

Training deep graph neural networks by stacking multiple layers of graph neural
networks usually yields inferior results, which is a common problem observed in
many different graph neural network architectures. This is mainly due to the prob-
lem of over-smoothing, which is first explicitly studied in (Li et al, 2018b). (Li et al,
2018b) showed that the graph convolutional network (Kipf and Welling, 2017b) is
a special case of Laplacian smoothing:

Y = (1 gI)X + g Ãrw X, (4.43)


where Ãrw = D̃ 1 Ã, which defines the transitional probabilities between nodes on
graphs. The GCN corresponds to a special case of Laplacian smoothing with g = 1
1 1
and the symmetric matrix Ãsym = D̃ 2 ÃD̃ 2 is used. The Laplacian smoothing will
push nodes belonging to the same clusters to take similar representations, which
are beneficial for downstream tasks such as node classification. However, when the
GCNs go deep, the node representations suffer from the problem of over-smoothing,
i.e., all the nodes will have similar representations. As a result, the performance on
60 Jian Tang and Renjie Liao

downstream tasks suffer as well. This phenomenon has later been pointed out by a
few other later work as well such as (Zhao and Akoglu, 2019; Li et al, 2018b; Xu
et al, 2018a; Li et al, 2019c; Rong et al, 2020b).

PairNorm (Zhao and Akoglu, 2019). Next, we will present a method called
PairNorm for alleviating the problem of over-smoothing when GNNs go deep. The
essential idea of PairNorm is to keep the total pairwise squared distance (TPSD)
of node representations unchanged, which is the same as that of the original node
feature X. Let H̃ be the output of the node representations by the graph convolu-
tion, which will be the input of PairNorm, and Ĥ is the output of PairNorm. The
goal of PairNorm is to normalize the H̃ such that after normalization TPSD(Ĥ) =
TPSD(X). In other words,

 ||Ĥi Ĥ j ||2 +  ||Ĥi Ĥ j ||2 =  ||Xi X j ||2 +  ||Xi X j ||2 .


(i, j)2E (i, j)62E (i, j)2E (i, j)62E
(4.44)
In practice, instead of measuring the TPSD of original node features X, (Zhao
and Akoglu, 2019) proposed to maintain a constant TPSD value C across different
graph convolutional layers. The value C will be a hyperparameter of the PairNorm
layer, which can be tuned for each data set. To normalize H̃ into Ĥ with a constant
TPSD, we must first calculate the TPSD(H̃). However, this is very computationally
expensive, which is quadratic to the number of nodes N. We notice that the TPSD
can be reformulated as:

!
1 N 1 N
TPSD(H̃) = Â ||H̃i 2
H̃ j || = 2N 2
 ||H̃i ||22
N i=1
|| Â H̃i ||22
N i=1
(4.45)
(i, j)2[N]

We can further simply the above equation by substracting the row-wise mean
from each H̃i . In other words, H̃ic = H̃i N1 ÂNi=1 H̃i , which denotes the centered
representation. A nice property of centering the node representation is that it will
not change the TPSD and meanwhile push the second term || N1 ÂNi=1 H̃i ||22 to zero.
As a result, we have

TPSD(H̃) = TPSD(H̃ c ) = 2N||H̃ c ||2F . (4.46)


To summarize, the proposed PairNorm can be divded into two steps: center-and-
scale,

1 N
H̃ic = H̃i  H̃i
N i=1
(Center) (4.47)

H̃ic p H̃ c
Ĥi = s · q =s N· q i (Scale), (4.48)
1 N c 2
N Âi=1 ||H̃i ||2 ||H̃ c ||2F
4 Graph Neural Networks for Node Classification 61

where s is a hyperparameter determining C. At the end, we have

H̃ic
TPSD(Ĥ) = 2N||Ĥ||2F = 2N Â ||s · q ||22 = 2N 2 s2 (4.49)
1
i
N ÂNi=1 ||H̃ic ||22

which is a constant across different graph convolutional layers.

4.5 Summary

In this chapter, we give a comprehensive introduction to different architectures


of graph neural networks for node classification. These neural networks can be
generally classified into two categories including supervised and unsupervised ap-
proaches. For supervised approaches, the main difference among different architec-
tures lie in how to propagate messages between nodes, how to aggregate the mes-
sages from neighbors, and how to combine the aggregated messages from neighbors
with the node representation itself. For the unsupervised approaches, the main dif-
ference comes from designing the objective function. We also discuss a common
problem of training deep graph neural networks—over-smoothing, and introduce a
method to tackle it. In the future, promising directions on graph neural networks in-
clude theoretical analysis for understanding the behaviors of graph neural networks,
and applying them to a variety of fields and domains such as recommender sys-
tems, knowledge graphs, drug and material discovery, computer vision, and natural
language understanding.

Editor’s Notes: Node classification task is one of the most important tasks
in Graph Neural Networks. The node representation learning techniques in-
troduced in this chapter are the corner stone for all other tasks for the rest
of the book, including graph classification task (Chapter 9), link predic-
tion (Chapter 10), graph generation task (Chapter 11), and so on. Familiar
with the learning methodologies and design principles of node representa-
tion learning is the key to deeply understanding other fundamental research
directions like Theoretical analysis (Chapter 5), Scalability (Chapter 6), Ex-
plainability (Chapter 7), and Adversarial Robustness (Chapter 8).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy