A_Practical_Tutorial_on_Graph_Neural_Net
A_Practical_Tutorial_on_Graph_Neural_Net
What are the fundamental motivations and mechanics that drive Graph Neural Networks, what are the
diferent variants, and what are their applications?
ISAAC RONALD WARD1 , ISOLABS, Australia and the University of Southern California, USA
JACK JOYNER1 , ISOLABS, Australia
CASEY LICKFOLD1 , ISOLABS, Australia
YULAN GUO, Sun Yat-sen University, China
MOHAMMED BENNAMOUN, The University of Western Australia, Australia
Graph neural networks (GNNs) have recently grown in popularity in the ield of artiicial intelligence (AI) due to their
unique ability to ingest relatively unstructured data types as input data. Although some elements of the GNN architecture are
conceptually similar in operation to traditional neural networks (and neural network variants), other elements represent a
departure from traditional deep learning techniques. This tutorial exposes the power and novelty of GNNs to AI practitioners
by collating and presenting details regarding the motivations, concepts, mathematics, and applications of the most common
and performant variants of GNNs. Importantly, we present this tutorial concisely, alongside practical examples, thus providing
a practical and accessible tutorial on the topic of GNNs.
CCS Concepts: • Theory of computation → Machine learning theory; • Mathematics of computing; • Computing
methodologies → Artiicial intelligence; Machine learning approaches; Machine learning algorithms;
Additional Key Words and Phrases: graph neural network, tutorial, artiicial intelligence, recurrent, convolutional, auto
encoder, decoder, machine learning, deep learning, papers with code, theory, applications
1 Equal contribution.
Authors’ addresses: Isaac Ronald Ward1 , ISOLABS, Australia and the University of Southern California, USA, isaacronaldward@gmail.com;
Jack Joyner1 , ISOLABS, Australia; Casey Lickfold1 , ISOLABS, Australia; Yulan Guo, Sun Yat-sen University, China; Mohammed Bennamoun,
The University of Western Australia, Australia.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from
permissions@acm.org.
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
0360-0300/2022/1-ART $15.00
https://doi.org/10.1145/3503043
0 1 0 0 0 0 0 0
1 0 1 1 1 0 0 0
0 1 0 0 0 0 0 0
0 1 0 0 0 1 1 1
A = 0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 1 0 0 0 0
(c) A diagram of an alcohol molecule (let), its associated graph representation with vertex indices labelled
(middle), and its adjacency matrix (right).
(d) A vector representation and a Reedś (e) A gameplaying tree can be repre-
Kellogg diagram (rendered according sented as a graph. Vertices are states of
to modern tree conventions) of the the game and directed edges represent
same sentence. The graph structure actions which take us from one state
encodes dependencies and constituen- to another.
cies.
Fig. 1. The graphs data structure is highly abstract, and can be used to represent images (matrices), molecules, sentence
structures, game playing trees, etc.
Graph neural networks (GNNs) provide a uniied view of these input data types: the images used as inputs in
computer vision, and the sentences used as inputs in NLP can both be interpreted as special cases of a single,
general data structure Ð the graph (see Figure 1 for examples).
Formally, a graph is a set of distinct vertices (representing items or entities) that are joined optionally to each
other by edges (representing relationships). Uniquely, the graphs fed into a GNN (during training and evaluation)
do not have strict structural requirements per se; the number of vertices and edges between input graphs
can change. In this way, GNNs can handle unstructured, non-Euclidean data [7], a property which makes them
valuable in problem domains where graph data is abundant. Conversely, NN-based algorithms are typically
required to operate on structured inputs with strictly deined dimensions. For example, a CNN built to classify
over the MNIST dataset must have an input layer of 28 × 28 neurons, and all subsequent input images must be
28 × 28 pixels in size to conform to this strict dimensionality requirement [50].
The expressiveness of graphs as a method for encoding data and the lexibility of GNNs with respect to
unstructured inputs has motivated their research and development. They represent a new approach for exploring
relatively general DL methods, and they facilitate the application of DL approaches to sets of data which Ð until
recently Ð were not not exposed to AI.
1.1 Contributions
The key contributions of this tutorial paper are as follows:
(1) An easy to understand, introductory tutorial, which assumes no prior knowledge of GNNs1 .
(2) Step-wise explanations of the mechanisms that underpin speciic classes of GNNs, as enumerated in Table 1.
These explanations progressively build a holistic understanding of GNNs.
(3) Descriptions of the advantages and disadvantages of GNNs, and key areas of application.
(4) Full examples of how speciic GNN variants can be applied to real world problems.
1.2 Taxonomy
The structure and taxonomy of this paper is outlined in Table 1.
1 We envisage that this work will serve as the ‘irst port of call’ for those looking to understand GNNs, rather than as a comprehensive survey
of methods and applications. For those seeking a more comprehensive treatment, we highly recommend the following works [30, 98, 108, 110]
(see Table 2 for more detail).
Table 1. A variety of algorithms are discussed in this tutorial paper. This table illustrates potential use cases for each
algorithm, and the section where they are discussed. Should the reader prefer to read this tutorial paper from an applications
/ downstream task-based perspective, then we invite them to review Table 5, Table 6, and Table 8, which link each algorithm .
Graph Neural GNN design framework, A review paper which proposes a general design
Networks: A Review GNN modules, framework for GNN models, and systematically
of Methods and GNN variants, elucidates, compares, and discusses the varying GNN
Applications [110] Theoretical and modules which can exist within the components of
Empirical analyses & said framework.
Applications
Deep Learning on Recurrent GNNs, A survey paper which outlines the development
Graphs: A Survey Convolutional GNNs, history and general operations of each major
[108] Graph Autoencoders, category of GNN. A complete survey of the GNN
Graph RL & variants within said categories is provided (including
Graph Adversarial Methods links to implementations and discussions on
computational complexity).
Computing graph GNN fundamentals, A review of the ield of GNNs is presented from a
neural networks: A modeling, applications, computing perspective. A brief tutorial is included
survey from complexity, algorithms, on GNN fundamentals, alongside an in-depth
algorithms to aceclerators & data lows analysis of acceleration schemes, culminating in a
accelerators [1] communication-centric vision of GNN accelerators.
Table 2. A comparison of our tutorial and related works. While other works provide comprehensive overviews of the field, our
work focuses on explaining and illustrating key GNN techniques to the AI practitioner. Our goal is to act as a ‘first port of
call’ for readers, providing them with a basic understanding that they can build upon when reading more advanced material.
2 PRELIMINARIES
Notation Meaning
V A set of vertices.
|V| The number of vertices in a set of vertices V.
v� The � �ℎ vertex in a set of vertices V.
v�� The feature vector of vertex v� .
E A set of edges.
|E| The number of edges in a set of edges E.
e� � The edge between the � �ℎ vertex and the � �ℎ vertex, in a set of edges E.
e��� The feature vector of edge e� � .
G = G (V, E ) A graph deined by the set of vertices V and the set of edges E.
Nv� The set of vertex indicies for the vertices that are direct neighbors of v� .
h�� The � �ℎ hidden layer’s representation of the � �ℎ vertex. Since each layer typi-
cally aggregates information from neighbors 1-hop away, this representation
includes information from neighbors k-hops away.
o� The � �ℎ output of a GNN (indexing is dependant on output structure).
I� An � × � identity matrix; all zero except for one’s along the diagonal.
A The adjacency matrix; each element A� � represents if the � �ℎ vertex is connected
to the � �ℎ vertex by an edge.
D The degree matrix; a diagonal matrix of vertex degrees or valencies (the number
Í
of edges incident to a vertex). Formally deined as D�,� = � A� � .
A� The weight matrix; each element A� � � represents the ‘weight’ of the edge
between the � �ℎ vertex and the � �ℎ vertex. The ‘weight’ typically represents
some real concept or property. For example, the weight between two given
vertices could be inversely proportional to their distance from one another
(i.e., close vertices have a higher weight between them). Graphs with a weight
matrix are referred to as weighted graphs, but not all graphs are weighted
graphs; in unweighted graphs A� = A.
M The incidence matrix; a |V| × |E| matrix where for each edge e� � , the element
of M at (�, e� � ) = +1, and at ( �, e� � ) = −1. All other elements are set to zero. M
describes the incidence of all edges to all vertices in a graph.
L The non-normalized combinatorial graph Laplacian; deined as L = D − A� .
1 1
L �� The symmetric normalized graph Laplacian; deined as L = I� − D− 2 AD− 2 .
Table 3. Notation used in this work. We suggest that the reader familiarise themselves with this notation before proceeding.
Here we discuss some basic elements of graph theory, as well as the the key concepts required to understand
how GNNs are formulated and operate. We present the notation which will be used consistently in this work (see
Table 3).
of vertices is only partially labelled. The training set consists of the labelled vertices, and the testing set consists
of both a small set of labelled vertices (for benchmarking) and the remaining unlabelled vertices. In this case, our
learning methods should be exposed to the entire graph during training (including the test vertices), because the
additional information (e.g. structural patterns) will be useful to learn from. Transductive learning methods are
useful in such cases where it is challenging to separate the training and testing data without introducing biases.
Inductive learning methods reserve separate training and testing datasets. The learning process ingests the
training data, and then the learned model is tested using the testing data, which it has not observed before in any
capacity.
Fig. 2. An RGNN forward pass for a simple input graph G (V, E ) with |V| = 4, |E| = 4. G goes through � layers of processing.
In each layer, each vertex’s features v�� (green), the neighborhood’s features Nv�� (yellow), and the previous hidden layer
(purple) are processed by the state transition function � and aggregated, thereby producing successive embeddings of G. Note
that the neighborhood features must be aggregated into a fixed embedding size, otherwise � would need to handle variable
input sizes. This is repeated until the embeddings converge (i.e., the change between consecutive embeddings fails to exceed
some stopping threshold). At that stage, the embeddings are fed to an output function � which perform some downstream
task Ð in this case, the task is a vertex-level classification problem. Note that � and � can be implemented as NNs and trained
via backpropagation of supervised error signals through the unrolled computation graph [60, 74]. Note that each vertex’s
embedding includes information from at max � ‘hops’ away ater the � �ℎ layer of processing. Image best viewed in colour.
In order to recurrently apply this learned transition function to compute successive embeddings, � must have
a ixed number of input and output variables. How then can it be dependent on the immediate neighborhood,
which might vary in size depending on where we are in the graph? There are two simple solutions, the irst
of which is to set a ‘maximum neighborhood size’ and use null vectors when dealing with vertices that have
non-existing neighbors [74]. The second approach is to aggregate all neighborhood features in some permutation
invariant manner [25], thus ensuring that any neighborhood in the graph is represented by a ixed size feature
vector. While both approaches are viable, the irst approach does not scale well to ‘scale-free graphs’, which have
degree distributions that follow a power law. Since many real world graphs (e.g. social networks) are scale-free
[72], we’ll use the second solution here. Mathematically, this can be formulated as in Equation 1 [74].
︁
h�� = � (v�� , e��� , v �� , h ��−1 ), where all h�0 are deined on initialisation. (1)
� ∈Nv�
We can see that under this formulation Equation 1, � is well deined. It accepts four feature vectors which all
have a deined length, regardless of which vertex in the graph is being considered, regardless of the iteration.
This means that the transition function can be applied iteratively, until a stable embedding is reached for all
vertices in the input graph. This expression can be interpreted as passing ‘messages’, or features, throughout the
graph; in every iteration, the embedding h�� is dependant on the features and embeddings of its neighbors. This
means that with enough recurrent iterations, information will propagate throughout the whole graph: after the
irst iteration, any vertex’s embedding encodes the features of the neighborhood within a range of a single edge.
In the second iteration, any vertex’s embedding is an encoding of the features of the neighborhood within a
range of two edges away, and so on. The iterative passing of ‘messages’ to generate an encoding of the graph is
what gives this message passing framework its name2 .
Note that it is typical to explicitly add the identity matrix I� to the adjacency matrix A, thus ensuring that all
vertices become trivially connected to themselves, meaning that a vertex v� ∈ Nv� ∀� ∈ V. Moreover, this allows
us to directly access the neighborhood by iterating through a single row of the adjacency matrix. This modiied
adjacency matrix is usually normalised to prevent unwanted scaling of embeddings.
When discussing recurrence thus far, we have referred mainly to computing techniques that are iteratively
applied to neighborhoods in a graph to produce embeddings that are dependent on information propagated
throughout the graph. However, recurrent techniques may also refer to computing processes over sequential
data, e.g., time series data. In the graph domain, sequential data refers to instances which can be interpreted as
graphs with features that change over time. These include spatiotemporal graphs [98]. For example Figure 1 (b)
illustrates how a graph can represent a skeletal structure in a single image of a hand, however, if we were to
create such a graph for every frame of a contiguous video of a moving hand, we would have a data structure that
could be interpreted as a sequence of individual graphs, or a single graph with sequential features, and such data
could be used for classifying hand actions in video.
As is the case with traditional sequential data, when processing each state of the sequence we want to consider
not only the current state but also information from the previous states, as outlined in Figure 6 (a). A simple
solution to this challenge might be to simply concatenate the graph emebddings of previous states to the features
of the current state (as in Figure 6 (b)), but such approaches do not capture long term dependencies in the data. In
this section, we outline how existing solutions from traditional DL Ð such as Long Short-Term Memory Networks
(LSTMs) and Gated Recurrent Units (GRUs) (outlined in Figure 6) Ð can be extended to the graph domain.
Graph LSTMs (GLSTMs) make use of LSTM cells that have been adapted to operate on graph based data.
Whereas the aforementioned recurrent modules (Figure 6 (b)) employ a simple concatenation strategy, GLSTMs
ensure that long-term dependencies can be encoded in the LSTM’s ‘cell state’ (Figure 6 (c)). This alleviates the
vanishing gradient problem where long-term dependency signals are exponentially reduced when backpropagated
throughout the network [35, 36].
GLSTM cells achieves this through four key processing elements which learn to calculate useful quantities
based on the previous state’s embedding and the input from the current state (as illustrated in Figure 6 (c)).
(1) The forget gate, which uses � � to extract values in the range [0,1], representing if elements in the previous
cell’s state should be ‘forgotten’ (0) or retained (1).
(2) The input gate, which uses �� to extract values in the range [0,1], indicating the amount of the modulated
input which will be added to this cell’s cell state.
(3) The input modulation gate, which uses �� to extract values in the range [-1,1], representing learned
information from this cell’s input.
(4) The output gate, which uses �� to calculate values in the range [0,1], indicating which parts of the cell
state should be output as this cell’s hidden state.
To use GLSTMs, we need to deine all the operators in Figure 6 (e). Since a graph G (V, E ) can be thought of as
a variably sized set of vertices and edges, we can deine graph concatenation as the separate concatenation of
vertex features and edge features, where some null padding is used to ensure that the resultant tensor is of a
ixed-size. This can be achieved by deining some ‘max number’ of vertices for the input graphs. If the input signal
for the GLSTM cell has a ixed size, then all other operators can be interpreted as traditional tensor operations,
and the entire process is diferentiable when it comes to backpropagation.
Fig. 5. TSNE renderings of final hidden graph representations for the x1, x2, x4, x8 hidden layer networks. Note that with
more applications of the transition function (equivalent to more layers in a NN) the final hidden representations of the
input graphs become more linearly separable into their classes (hence why they are able to be beter classified using only a
linear classifier).
Here, our transition function � was a ‘feedforward NN’ with just one layer, so
more advanced NNs (or other) implementations of � might result in more performant RGNNs. As more rounds of transition
function were applied to our hidden states, the performance Ð and required computation Ð increased. Ensuring a consistent
number of transition function applications is key in developing simpliied GNN architectures, and in reducing the amount
of computation required in the transition stage. We will explore how this improved concept is realised through CGNNs in
Section 4.2.
(a) The general processing approach for sequential graph (b) A simple recurrent cell, which learns to extract useful
data. The input data is a sequence of graphs (blue), and features from the current input and the previous hidden
each processing cell considers not only the current input state. Ater numerous cells of this type, the signal from
state, but also information from the states preceding it, the early input states is exponentially reduced.
thus yielding per-state embeddings (purple) which are
dependent on the sequence thus far.
(c) A GLSTM, which employs graph concatenation to en- (d) A Graph GRU, which employs graph concatenation
able LSTM-like processing of inputs. Four learned gates to enable GRU-like processing of inputs. GRUs have rel-
are employed to learn specific tasks within the cell. atively less learnable parameters than LSTMs, and are
generally less prone to overfiting.
(e) Legend for the diagrams (a) ś (d), all operators are traditional tensor operations apart from graph
concatenation.
Fig. 6. The processing approaches for graph-based sequential data, including the overarching approach (a), simple RNN cells
(b), GLSTMs (c), and graph GRUs (d).
ACM Comput. Surv.
14 • I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
Gated Recurrent Units (GRUs) provide a less computationally expensive alternative to GLSTMs by removing
the need to calculate a cell state in each cell. As such, GRUs have three learnable weight matrices (as illustrated
in Figure 6 (d)) which serve similar functions to the four learnable weight matrices in GLSTMs. Again, GRUs
require some deinition of graph concatenation.
(1) The reset gate �� determines how much information from to ‘forget’ or ‘retain’ when calculating the new
information to add to the hidden state from the current state.
(2) The update gate �� determines what information to ‘forget’ or ‘retain’ from the previous hidden state.
(3) The candidate gate �� determines what information from the reset input will contribute to the next hidden
state.
GRUs are well suited to sequential data when repeating patterns are less frequent, whereas LSTM cells perform
well in cases where more frequent pattern information needs to be captured. LSTMs also have a tendency to
overit when compared to GRUs, and as such GRUs outperform LSTM cells when the sample size is low [27].
Approach Applications
RNNs (early work) [61] Quantitative structure-activity relationship analysis.
RNNs (early work) [4] Various, including localisation of objects in images.
RGNNs (early work) [74] Various, including subgraph matching, the mutagenesis problem,
and web page ranking.
RGNNs (Neural Networks Quantitative structure-activity relationship analysis of alkanes,
for Graphs) [60] and classiication of general cyclic / acyclic graphs
RGNNs & RNNs (a compari- 4-class image classiication
son) [18]
Geometric Deep Learning al- Graphs, grids, groups, geodesics, gauges, point clouds, meshes,
gorithms (incl. RGNNs) [7] and manifolds. Speciic investigations include computer graph-
ics, chemistry (e.g. drug design), protein biology, recommender
systems, social networks, traic forcasting, etc.
RGNN pretraining [38] Molecular property prediction, protein function prediction, bi-
nary graph classiication, etc.
RGNNs benchmarking [57] Cycle detection, and exploring what RGNNs can and cannot
learn.
Natural Graph Networks Graph classiication (bioinformatics and social networks).
(NGNs) [15]
GLSTMs [105] Airport delay prediction (with |V| = 325).
GLSTMs (using diferential Emotion classiication from electroencephalogram (EEG) analy-
entropy) [100] sis (graphs calculated from K-nearest neighbor algorithms).
GLSTMs [58] Speed prediction of road traic throughout a directed road net-
work (vertices are road segments, and edges are direct links
between them).
GLSTMs (with spatiotempo- Real-time distracted driver behaviour classiication (i.e., based
ral graph convolution) [63] on the human pose graph [23] from a sequence of video frames,
is the driver drinking, texting, performing normal driving, etc.).
Other techniques for this problem include [53, 79].
LSTM-Q (i.e. fusion of RL Connected autonomous vehicle network analysis for controlling
with a bidirectional LSTM) agent movement (in a multi-lane road corridor).
for graphs [13]
Graph GRUs [54] Computer program veriication.
Graph GRUs [32] Explainable predictive business process monitoring.
Graph GRUs [3] NLP as a graph to sequence problem (leveraging structural in-
formation in language).
Graph GRUs [68, 69] Gating for vertices and edges. Key applications include earth-
quake epicentre placement and synthetic regression problems.
Symmetric Graph GRUs [59] Improved long term dependency performance on synthetic tasks.
Table 5. A selection of works which use recurrent GNN techniques such as those discussed in this section.
(1) Locality: learned feature extraction weights should be localised. They should only consider the information
within a given neighborhood, and they should be applicable throughout the input graph.
(2) Scalability: the learning of these feature extractors should be scalable, i.e., the number of learnable
parameters should be independent of |V|. Preferably the operator should be ‘stackable’, so that models can
be built from successive independent layers, rather than requiring repeated iteration until convergence as
with RGNNs in Section 3. Computation complexities should be bounded where possible.
(3) Interpretability: the convolutional operation should (preferably) be grounded in some mathematical or
physical interpretation, and its mechanics should be intuitive to understand.
(a) A convolutional operation of 2D matrices. This process is used through- (b) Three neighborhoods in a given
out computer vision and in CNNs. The convolutional operation here has a graph (designated by doted boxes),
stride of 2 pixels. The given filter is applied in the red, green, blue, and then with each one defined by a central ver-
purple positions. At each position each element of the filter is multiplied tex (designated by a correspondingly
with the corresponding element in the input (i.e., the Hadamard product) coloured circle). Each neighborhood in
and the results are summed, producing a single element in the output. the graph is aggregated into a feature
For clarity, this multiplication and summing process is illustrated for the vector (i.e., and embedding) centered at
purple position. In the case of this image the filter is a standard sharpening each vertex, thus allowing the process
filter used in image analysis. to repeat for multiple layers.
Fig. 7. A comparison of image-based and graph-based spatial convolution techniques. Both techniques create embeddings
centered around pixels / vertices, and the output of both techniques describes how the input is modified by the filter. Images
best viewed in colour.
Note that at no point during our general deinition of convolution was the structure of the given inputs alluded
to. In fact, convolutional operations can be applied to continuous functions (e.g., audio recordings and other
signals), N-dimensional discrete tensors (e.g., semantic vectors in 1D, and images in 2D), and so on. During
convolution, one input is typically interpreted as a ilter (or kernel) being applied to the other input, and we will
adopt this language throughout this section. Speciic ilters can be utilised to perform speciic tasks: in the case
of audio recordings, high pass ilters can be used to ilter out low frequency signals, and in the case of images,
certain ilters can be used to increase contrast, sharpen, or blur images. In our previous example of CNNs, ilters
are learned rather than designed.
︁ Wh ��−1
h�� = � ( ︁ ) (4)
� ∈Nv� |Nv� ||Nv � |
Graph Attention Networks (GATs) extend GCNs: instead of using the size of the neighborhoods to weight
the importance of v� to v � , they implicitly calculate this weighting based on the normalised product of an attention
mechanism [87]. In this case, the attention mechanism is dependent on the embeddings of two vertices and the
edge between them. Vertices are constrained to only be able to attend to neighboring vertices, thus localising the
ilters. GATs are stabilised during training using multi-head attention and regularisation, and are considered
less general than MPNNs [88]. Although GATs limit the attention mechanism to the direct neighborhood, the
scalability to large graphs is not guaranteed, as attention mechanisms have compute complexities that grow
quadratically with the number of vertices being considered.
�−1 ,h �−1 ,e )
︁ � att(h� � ��
h�� = �( �� � Wh ��−1 ), where �� � = Í �−1 ,h �−1 ,e )
(5)
� ∈Nv� � ∈Nv� � att(h� � ��
Interestingly, all of these approaches consider information from the direct neighborhood and the previous
embeddings, aggregate this information in some symmetric fashion, apply learned weights to calculate more
complex features, and ‘activate’ these results in some way to produce an embedding that captures non-linear
relationships.
The Laplacian is a second order diferential operator that is calculated as the divergence of the gradient of
a function in Euclidean space. The Laplacian occurs naturally in equations that model physical interactions,
including but not limited to electromagnetism, heat difusion, celestial mechanics, and pixel interactions in
computer vision applications. Similarly, it arises naturally in the graph domain, where we are interested in the
‘difusion of information’ throughout a graph structure.
More formally, if we deine lux as the quantity passing outward through a surface, then the Laplacian represents
the density of the lux of the gradient low of a given function. A step by step visualisation of the Laplacian’s
calculation is provided in Figure 9. Note that the deinition of the Laplacian is dependant on three things:
functions, the gradient of a function, and the divergence of the gradient. Since we’re seeking to deine the
Laplacian in the graph domain, we need to deine how these constructs operate in the graph domain.
Functions in the graph domain (referred to as graph signals in GSP) are a mapping from every vertex in
a graph to a scalar value: � (G (V, E )) : V ↦→ R. Multiple graph functions can be deined over a given graph,
and we can interpret a single graph function as a single feature vector deined over the vertices in a graph. See
Figure 10 for an example of a graph with two graph functions.
(a) An input function � (�, �) (rendered as a purple surface (b) The vector field from (a) ∇� (�, �) (rendered as an or-
plot) and its gradient ∇� (�, �) (rendered as an orange ange vector field above the surface plot) and its divergence
vector field above the surface plot). The gradient, at every ∇ · ∇� (�, �) (rendered as a green surface plot). The diver-
point, denotes the direction which increases � (�, �) the gence denotes how much every infinitesimal region of the
most. In other words, the orange arrows always point in vector field behaves like a source. In other words, it is a
the direction of a maxima in the purple surface plot. measure of the ‘outgoing flow’ of the infinitesimal volume
at each point.
Fig. 9. An input function � (�, �) : R2 ↦→ R (a), its gradient ∇� (�, �) : R2 ↦→ R2 ((a) and (b)), and the divergence of its
gradient ∇ · ∇� (�, �) : R2 ↦→ R (b). The divergence of a function’s gradient is known as the Laplacian, and it can be
interpreted as measuring ‘how much’ of a minimum each point is in the original function � (�, �). The plots in (a) and (b) are
an example of the entire calculation of the Laplacian; from scalar field to vector field (gradient), and then from vector field
back to scalar field (divergence). The Laplacian is an analog of the second derivative, and is oten denoted by ∇ · ∇, ∇2 , or Δ.
The gradient of a function in the graph domain describes the the direction and the rate of fastest increase
of graph signals. In a graph structure, when we refer to ‘direction’ we are referring to the edges of the graph; the
avenues by which a graph function can change. For example, in Figure 10, the graph functions are 8-dimensional
vectors (deined over the vertices), but the gradients of the functions for this graph are 12-dimensional vectors
(deined over the edges), and are calculated as in Equations 7. Refer to Table 3 for a formal deinition of the
incident matrix M.
+1 −1 0 0 0 0 0 0 +857
+1 0 −1 0 0 0 0 0 +197
+1 0 0 0 0 0 −1 0 1048 −19831
0 +1 0 0 0 0 −1 0
0 +1 0 −1 0 0 0 0 191 −20688
−1572
851
0 0 +1 0 0 0 −1 0 1763 −20028
MT = 0 0 0 +1 −1 0 0 0 , �cases = 7492 , ∇�cases = M� �cases = −5729 (7)
124
0 0 0 +1 0 0 −1 0 −19116
0 0 0 0 +1 −1 0 0
0 0 0 0 +1 0 −1 0 20879 +7368
−13387
13
0 0 0 0 0 +1 −1 0 −20755
0 0 0 0 0 0 +1 −1 +20866
In Equations 7, the gradient vectors describe the diference in graph function value across the vertices / along
the edges. Speciically, note that the largest magnitude value is 20866, and corresponds to e12 , the edge between
Hobart and Melbourne in Figure 10. In other words, the greatest magnitude edge is between the city with the
least cases and the city with the most cases. Similarly, the lowest magnitude edge is e2 ; the edge between Perth
and Adelaide, which has the least diference in cases.
Fig. 10. A graph representing Australia (|V| = 8, |E| = 12). Its vertices represent Australia’s capital cities, and the edges
between them represent common flight paths. Each vertex has two features, one representing the population, and another
representing the total (statewide) cases of an infectious disease at those locations. Those two vertex feature vectors can
be interpreted as the graph functions (also known as graph signals) �cases and �pop. , which are rendered at the botom of
the figure. As an example, it may be of interest to investigate the propagation / difusion of these graph signal quantities
throughout the graph structure.
The divergence of a gradient function in the graph domain describes the outward lux of the gradient
function at every vertex. To continue with our example, we could interpret the divergence of the gradient function
������ as the outgoing ‘low’ of infectious disease cases from each capital city. Whereas the gradient function was
deined over the graph’s edges, the divergence of a gradient function is deined over the graph’s vertices, and is
calculated as in Equation 8.
−18777
−23117
−20225
−23273
∇ · (∇�cases ) = M(∇�cases ) = M(M� �cases ) = MM� �cases = L�cases =
−290 (8)
−28123
+134671
−20866
The maximum value in the divergence vector for the infectious disease graph signal is 134671, corresponding
to Melbourne (the 7�ℎ vertex). Again, this can be interpreted as the magnitude of the ‘source’ of infectious disease
cases from Melbourne. Contrastively, the minimum value is −281123, corresponding to Camberra, the largest
‘sink’ of infections disease.
Note as well that the dimensionality of the original graph function is 8 Ð corresponding to the vector space, its
gradient’s dimensionality is 12 Ð corresponding to its edge space, and the Laplacian’s dimensionality is again 8 Ð
corresponding to the vertex space. This mimics the calculation of the Laplacian in Figure 9, where the original
scalar ield (representing the magnitude at each point) is converted to a vector ield (representing direction), and
then back to a scalar ield (representing how each point acts as a source).
The graph Laplacian appears naturally in these calculations as a |V| × |V| matrix operator in the form
L = MM� (see Equation 8). This corresponds to the formulation provided in Table 3, as shown in Equation 9, and
this formulation is referred to as the combinatorial deinition L = D − A� (the normalised deinition is deined
as L �� [17]). The graph Laplacian is pervasive in the ields of GSP [14].
3 −1 −1 0 0 0 −1 0
−1 3 0 −1 0 0 −1 0
−1 0 2 0 0 0 −1 0
0 −1 0 3 −1 0 −1 0
L = MM� = 0 0 0 −1 3 −1 −1 0 = D − A� (9)
0 0 0 0 −1 2 −1 0
−1 −1 −1 −1 −1 −1 7 −1
0 0 0 0 0 0 −1 1
Since L = D − A� , the graph Laplacian must be a real (L� � ∈ R, ∀ 0 ≤ �, � < |V|) and symmetric (L = L� )
matrix. As such, it will have an eigensystem comprised of a set of |V| orthonormal eigenvectors, each associated
with a single real eigenvalue [78]. We denote the � �ℎ eigenvector with �� , and the associated eigenvalue with �� ,
each satisfying L�� = �� �� , where the eigenvectors �� are the |V|-dimensional columns in the matrix (Fourier
basis) U. The Laplacian can be factored as three matrices such that L = U�U� through a process known as
eigenvector decomposition. A variety of algorithms exist for solving this kind of eigendecomposition problem (e.g.
the QR algorithm and Singular Value Decomposition).
These eigenvectors form a basis in R |V| , and as such we can express any discrete graph function as a linear
combination these eigenvectors. We deine the graph Fourier transform of any graph function / signal as �ˆ =
U� � ∈ R |V | , and its inverse as � = U �ˆ ∈ R |V| . To complete our goal of performing convolution in the spectral
domain, we now complete the following steps.
(1) Convert the � �ℎ graph function into the frequency space (i.e., generate its graph Fourier transform). We do
this through matrix multiplication with the transpose of the Fourier basis: U� �� . Note that multiplication
with the eigenvector matrix is � (� 2 ).
(2) Apply the corresponding � �ℎ learned ilter in frequency space. If we deine Θ� as our � �ℎ learned ilter (and a
function of the eigenvalues of L), then this appears like so: � U� �� .
(3) Convert the result back to vertex space by multiplying the result with the Fourier basis matrix. This
completes the formulation deined in Equation 10. By Parseval’s theorem, multiplication applied in the
frequency space corresponds to translation in vertex space, so the ilter has been convolved against the
graph function [52].
U� U� �� (10)
This simple formulation has a number of downsides. Formostly, the approach is not localised Ð it has global
support Ð meaning that the ilter is applied to all vertices (i.e., the entirety of the graph function). This means
that useful ilters aren’t shared, and that the locality of graph structures is not being exploited. Secondly, it is
not scalable; the number of learnable parameters grows with the size of the graph (not the scale of the ilter)
[17], the � (� 2 ) cost of matrix multiplication scales poorly to large graphs, and the � (� 3 ) time complexity of
QR-based eigendecomposition [81] is prohibitive on large graphs. Moreover, directly computing this transform
requires the diagonalisation of the Laplacian, and is infeasible for large graphs (where |V| exceeds a few thousand
vertices) [31]. Finally, since the structure of the graph dictates the values of the Laplacian, graphs with dynamic
topologies can’t use this method of convolution.
︁
�
Θ= � ��� (e
�) (11)
�=1
To alleviate the locality issue [21] noted that the smoothing of ilters in the frequency space would result in
localisation in the vertex space. Instead of learning the ilter directly, they formed the ilter as a combination of
smooth polynomial functions, and instead learned the coeicients to these polynomials. Since the Laplacian is a
local operator afecting only direct neighbors of any given vertex, then a polynomial of degree � afects vertices
� -hops away. By approximating the spectral ilter in this way (instead of directly learning it), spatial localisation
is thus guaranteed [77]. Furthermore, this improved scalability; learning � coeicients of the predeined smooth
polynomial functions meant that the number of learnable parameters was no longer dependent on the size of the
input graph. Additionally, the learned model could be applied to other graphs too, as opposed to spectral ilter
coeicients which are basis dependant. Since then, multiple potential polynomials have been used for specialised
efects (e.g. Chebyshev polynomials, Cayley polynomials [51]).
Equation 11 outlines this approach. The learnable parameters are � � Ð vectors of Chebyshev polynomial
coeicients Рand �� (e �) is the Chevyshev polynomial of order � (dependent on the normalised diagonal matrix
of scaled eigenvalues e�). Chevyshev polynomials can be computed recursively with a stable recurrence relation,
and form an orthogonal basis [83]. We recommend [65] for a full treatment on Chebyshev polynomials.
Interestingly, these approximate approaches demonstrate an equivalence between spatial and spectral tech-
niques. Both are spatially localised and allow for a single ilter to extract repeating patterns of interest throughout
a graph, both have a number of learnable parameters which is independent of the input graph size, and each have
meaningful and intuitive interpretations from a spatial (Figure 7) and spectral (Figure 10) perspective. In fact,
GCNs can be viewed as a irst order approximation of Chebyshev polynomials [56]. For an in-depth treatment on
the topic of GSP, we recommend [82] and [78].
Approach Applications
GSP general (spectral) [82] Multi-sensor temperature sensing (as a signal processing prob-
lem).
ChebNet (spectral) [83] Various, but particularly in contexts where the functions to be
approximated are high dimensional and smooth.
CayleyNets (spectral) [51] Community detection, MNIST, citation networks, recommender
systems, and other domains where speciic frequency bands are
of particular interest.
MPNNs (spatial) [25] Quantum chemistry, speciically molecular property prediction.
GraphSAGE (spatial) [29] Classifying academic papers, classifying Reddit posts, classifying
protein functions, etc.
GCNs (spatial) [45] Semi-supervised vertex classiication on citation networks and
knowledge graphs.
Residual Gated Graph Con- Subgraph matching and graph clustering.
vNets (spatial) [6]
Graph Isomorphism Net- Various, including bioinformatics and social network datasets.
works (GINs) [99]
CGNN benchmarking [20] Extensive, including ZINC [39], MNIST [49], CIFAR10 [48], etc.
GATs [88] Citation networks, protein-protein interaction.
GATs [73] Robust pointwise correspondence of local image features.
Gated Attention Modules Traic speed forecasting.
[106]
Edge GATs [11] Citation networks, but generally any domain sensitive to rela-
tions / edge features.
Graph Attention Tracking Visual tracking (i.e., similarity matching between a template
[28] image and a search region).
Hyperbolic GATs [107] Hyperbolic domains, e.g. protein surfaces, biomolecular interac-
tions, drug discovery, or statistical mechanics.
Heterogeneous Attention Citation networks, IMBD (movie database networks), or any
Networks (HANs) [94] domain where vertices / edges are heterogeneous.
GATs [93] Knowledge graphs and explainable recommender systems.
Graphormers [101] Various, including quantum chemistry prediction. Particularly
well suited to smaller scale graphs due to quadratic computation
complexity of attention mechanisms.
Graph Transformers (with Various, including molecular graph analysis (i.e. [39] and similar).
spectral attention) [47] Particularly well suited to smaller scale graphs as above.
Table 6. A particular selection of oten-cited works which use convolutional GNN techniques (such as those discussed in
this section). Many of these algorithms are applicable to graph generally, and as such, the application column outlines the
applications directly discussed in the cited paper.
5 GRAPH AUTOENCODERS
GAEs represent the application of GNNs (often CGNNs) to autoencoding. The goal of an AE can be summarised
as follows: to project the inputs features into a new space (known as the latent space) where the projection has
more desirable properties than the input representation. These properties may include:
(1) The data being more separable (i.e. classiiable) in the latent space.
(2) The dimensionality of the dataset being smaller in the latent space than in the input space.
(3) The data being obfuscated for security or privacy concerns in the latent space.
A beneit of AEs in general is that they can often achieve this in an unsupervised manner Ð i.e., they can
create useful embeddings without any training data. In their short history, GAEs have lead the way in unsuper-
vised learning on graph-structured data and enabled greater performance on supervised tasks such as vertex
classiication on citation networks [46].
(a) The architecture for a simple traditional standard AE. AEs take a tensor input � , alter the dimensionality via
a learnable encoder NN, and thus convert said input into a latent embedding � . From there, the AE atempts to
reconstruct the original input, thus creating the reconstructed input �ˆ (this process forms the decoder). By minimising
the reconstruction loss L = ∥� − �ˆ ∥ 2 , eficient latent space representations can be learned. This diagram shows an AE
with a latent space representation that is smaller than the input size. In practice, encoder NNs can use custom layers
and connections to improve performance.
(b) The architecture for a GAE. The input graph is described by the adjacency matrix A and the vertex feature matrix � in
this case (though edge and global graph features can be accepted as input also). Since the input is an unstructured graph
and not a tensor, a GNN architecture, such as those described throughout this tutorial, is used to generate a matrix
of latent vertex embeddings � . To reconstruct the input the similarity between all pairs of latent vertex embeddings
is calculated, yielding a proxy for the ‘connectedness’ amongst the vertices in the graph. This creates the estimated
adjacency matrix �, ˆ which can be compared with the original A to create a loss term. In this example, the red edges
denote edges which were incorrectly reconstructed.
Fig. 11. The architecture for a traditional tensor-based AE, compared to a GAE.
speciically, this term regularises the latent space distributions by ensuring that they do not diverge signiicantly
from some prior distribution with desirable properties. In our example, we use the normal distribution (denoted as
� (0, 1)). This divergence is quantiied in our case using Kulback-Leibler divergence (denoted as ‘KL’), though
other similarity metrics (e.g. Wassertein space distance or ranking loss) can be used successfully. Without this
loss penalty, the VGAE encoder might generate distributions with small variances, or high magnitude means:
both of which would make it harder to sample from the distribution efectively.
Fig. 13. An example of a VGAE. Graph inputs are encoded via a GNN into multivariate Guassian parameters (i.e. mean
and variance). These represent ranges of possible values in the latent space, which enforces a continuous latent space
representation. Samples are selected from these distributions and fed to the decoder, as in Figure 11 (b). In practice,
researchers have observed that this method ensures that all regions of latent space map to meaningful outputs, and that
latent vectors which are close to one another map to reconstructions that are ‘close to’ one another in the input space.
To ensure that the encoded distributions are well behaved, a penalty term is added to the loss function to enforce the
distributions to match some known prior distribution (i.e., normal distributions). The total loss function for a VGAE is thus
defined as L = ∥� − �ˆ ∥ 2 + ��((� (0, 1), �(� )),
The ield of GAdvT’s is broad, with multiple diferent kinds of attacks, training regimes, and use cases. In this
tutorial, we’ll look at how a GAdvTs can be used to extend VGAEs to create robust encoding networks, and well
regularised generative mechanisms. Figure 14 describes a typical architecture for adversarially training a VGAE.
Fig. 14. A typical approach to adversarial training with VGAEs. The top row described a VGAE as illustrated in Figure 13.
Importantly, for each real sample, a ‘fake’ sample is generated from some prior distribution � (� ) (e.g., a multivariate Guassian,
or some other distribution which is believed to model the properties of the latent space atributes). During training, these
fake and real samples are input into a discriminator network, which predicts whether said inputs are real or fake. If the
discriminator correctly classifies the sample, the generator is penalised, thus optimising the encoder to generate distributions
whose samples are more likely to ‘fool’ the discriminator. In other words, this causes the encoder to create samples which
have similar properties to the samples pulled from the prior distribution � (�), thus acting as a form of regularisation.
To ensure that the sampling operation is diferentiable, VGAEs levereage a ‘reparameterisation trick’, where
a random Guassian sample is generated from � (0, 1) outside the forward pass of the network. The sample is
then transformed by the parameterisation of the generated distribution �(�), rather than having the sample be
generated directly from �(�) [19]. Since this approach is entirely diferentiable, it allows for end-to-end training
via backpropagation of an unsupervised loss signal.
vertex’s one-step local neighbourhood, showing competitive performance on numerous benchmarks. Despite
this, typical GAEs use more complex encoders Ð primarily CGNNs Ð in order to capture nonlinear relationships
in the input data and larger local neighbourhoods [5, 8, 46, 64, 84, 85, 91, 102].
Approach Applications
Deep Neural Graph Repre- Various, including clustering, calculating useful vertex embed-
sentation [8] dings, and visualisation.
Structure Deep Network Em- Various, including language networks, citation networks, and
beddings [91] social networks.
Denoising Attribute AEs Various, including social networks and citation networks.
[34]
Link prediction-based GAEs Various, including link prediction and bidirectionally prediction
(and VGAEs) [71] on citation networks.
VGAEs [46] Various, including citation networks.
Deep Guassian Embedding Various, including citation networks.
of Graphs (G2G) [5]
Semi-implicit VGAEs [33] Various graph analytic tasks, including citation networks.
Adversarially Regularised Directed communication networks, software class dependency
AEs (NetRA) [102] networks, undirected social networks, citation networks, di-
rected word networks with inferred ‘Part-of-Speech’ tags, and
Protein-Protein Interactions.
Adversarially Regularised Various, including vertex clustering and visualisation of citation
Graph Autoencoder (ARGA, networks.
and its variants) [64]
Graph Convoltuional Gener- Traic prediction in optical networks (particularly in domains
ative Adversarial Networks with ‘burst events’).
[90]
FeederGAN (adversarial) Generation of distributed feeder circuits.
[55]
Labelled Graph GANs [22] Generating graph-structured data with vertex labels. Demon-
strated for citation networks and protein graphs.
Graph GANs for Sparse Data Generating sparse graph datasets. Demonstrated for MNIST and
[43] high energy physics proton-proton jet particle data.
Graph Convolutional Adver- Predicting missing infant difusion MRI data for longitudinal
sarial Networks [37] studies.
Table 8. A selection of works using GAE / VGAE / GAdvT techniques as discussed in this section.
6 FUTURE RESEARCH
The ield of GNNs is rapidly developing, and there are numerous directions for meaningful future research. In
this section, we outline a few speciic directions which have been identiied as important research areas to focus
on [98, 103, 108, 110].
6.1 Explainability
Recent advancements in deep learning have allowed deeper NNs to be developed and applied throughout the
ield of AI. As the mechanics that drive predictions (and thus decisions) become more complex, the path by which
those decisions are reached becomes more obfuscated. Explainable AI (XAI) promises to address this issue.
Explainability in the graph domain promises much of the same beneits as it does across AI, including more
interpretable outputs, clearer relationships between inputs and outputs, more interpretable models, and in general,
more trust between AI and human operators across problem domains (e.g. digital pathology [40], knowledge
graphs [95], etc.).
While the suite of available XAI algorithms has been consistently growing over the recent years (e.g. LIME,
SHAP), graph speciic XAI algorithms are relatively few and far between [103]. A key reason for this might be the
requirement for graph explanations to incorporate not just the relationships among the input features, but also
the relationships surrounding the input’s structural / topological information. In particular, the exploration of
instance-level explainers Ð including high idelity perturbative and gradient-based methods Ð may provide good
approximations of input importance in graph prediction tasks. Further techniques, especially those which assign
quantitative importances to a graph’s structure and its features will give a more holistic view of explainability in
the graph domain.
6.2 Scalability
In traditional deep learning, a common technique for dealing with extremely large datasets and AI models is to
distribute and parallelise computations where possible. In the graph domain, the non-rigid structure of graphs
presents additional challenges. For example; how can a graph be uniformly partitioned across multiple devices?
How can message passing frameworks be eiciently implemented in a distributed system? These questions are
especially pertinent for extremely large graphs. Recent developments suggest that sampling based approaches
may provide appropriate solutions in the near future [109], though such solutions are non-trivial, especially
when graphs are stored on distributed systems [76].
Moreover, the scalability of GNN modules themselves may be improved by further directed research. For
example, popular GNN variants such as MPNNs can in practice only be applied to small graphs due to the large
computational overheads associated with the message passing framework. Methods such as GATs show promising
results regarding scalability, but attentional mechanisms still incur a quadratic time complexity, which may be
prohibitive for graphs with large neighborhoods (on average). An exciting further avenue of research regarding
GATs is their equivalence to Transformer networks [41, 47, 86, 101]. Further directed research in this area may
contribute not only to the development of exciting new graph-based techniques, but also the understanding of
Transformer networks as a whole. Breakthroughs in this area may address challenges speciic to Transformers,
such as the design of eicient positional encodings, efective warm-up strategies, and the quantiication of
inductive biases.
7 CONCLUSION
The development of GNNs has accelerated hugely in the recent years due to increased interest in exploring
unstructured data and developing general AI solutions. In this paper, we have illustrated key GNN variants,
described the mechanisms which underpin their operations, addressed their limitations, and worked through
examples of their application to various real world problems (with links to more advanced literature where
necessary). Going forward, we expect that GNNs will continue to emerge as an exciting and highly performant
branch of algorithms that natively model and address important real-world problems.
FUNDING
This work was partially supported by ISOLABS, the Australian Research Council (Grants DP150100294 and
DP150104251), the National Natural Science Foundation of China (No. U20A20185, 61972435), the Natural Science
Foundation of Guangdong Province (2019A1515011271), and the Shenzhen Science and Technology Program (No.
RCYX20200714114641140, JCYJ20190807152209394).
ACKNOWLEDGMENTS
We express a special appreciation to Josh Crowe at ISOLABS for his ongoing support of technical research
(including this tutorial paper) at ISOLABS. We also thank Richard Pienaar for providing early feedback which
greatly improved this work.
REFERENCES
[1] Sergi Abadal, Akshay Jain, Robert Guirado, Jorge López-Alonso, and Eduard Alarcón. 2021. Computing graph neural networks: A
survey from algorithms to accelerators. ACM Computing Surveys (CSUR) 54, 9 (2021), 1ś38.
[2] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, and Michael Rabbat. 2021. Semi-
Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples. arXiv preprint
arXiv:2104.13963 (2021).
[3] Daniel Beck, Gholamreza Hafari, and Trevor Cohn. 2018. Graph-to-sequence learning using gated graph neural networks. arXiv
preprint arXiv:1806.09835 (2018).
[4] M Bianchini, M Gori, and F Scarselli. 2002. Recursive processing of cyclic graphs. In Proceedings of the 2002 International Joint Conference
on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), Vol. 1. IEEE, 154ś159.
[5] Aleksandar Bojchevski and Stephan Günnemann. 2018. Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via
Ranking. arXiv: Machine Learning (2018).
[6] Xavier Bresson and Thomas Laurent. 2017. Residual Gated Graph ConvNets. arXiv:1711.07553 [cs.LG]
[7] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. 2017. Geometric Deep Learning: Going beyond Euclidean data.
IEEE Signal Processing Magazine 34, 4 (2017), 18ś42.
[8] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep Neural Networks for Learning Graph Representations. In Proceedings of the
Thirtieth AAAI Conference on Artiicial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 1145ś1152.
[9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging
properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294 (2021).
[10] Hong Chen and Hisashi Koga. 2019. GL2vec: Graph Embedding Enriched by Line Graphs with Edge Features. In ICONIP.
[11] Jun Chen and Haopeng Chen. 2021. Edge-Featured Graph Attention Network. arXiv preprint arXiv:2101.07671 (2021).
[12] Liang Chen, Jintang Li, Jiaying Peng, Tao Xie, Zengxu Cao, Kun Xu, Xiangnan He, and Zibin Zheng. 2020. A survey of adversarial
learning on graphs. arXiv preprint arXiv:2003.05730 (2020).
[13] Sikai Chen, Jiqian Dong, Paul Ha, Yujie Li, and Samuel Labi. 2021. Graph neural network and reinforcement learning for multi-agent
cooperative control of connected autonomous vehicles. Computer-Aided Civil and Infrastructure Engineering 36, 7 (2021), 838ś857.
[14] Fan RK Chung and Fan Chung Graham. 1997. Spectral graph theory. Number 92. American Mathematical Soc.
[15] Pim de Haan, Taco Cohen, and Max Welling. 2020. Natural Graph Networks. arXiv:2007.08349 [cs.LG]
[16] Nathan de Lara and Edouard Pineau. 2018. A Simple Baseline Algorithm for Graph Classiication. CoRR abs/1810.09155 (2018).
arXiv:1810.09155 http://arxiv.org/abs/1810.09155
[17] Michaël Deferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized
spectral iltering. Advances in neural information processing systems 29 (2016), 3844ś3852.
[18] Vincenzo Di Massa, Gabriele Monfardini, Lorenzo Sarti, Franco Scarselli, Marco Maggini, and Marco Gori. 2006. A comparison between
recursive neural networks and graph neural networks. In The 2006 IEEE International Joint Conference on Neural Network Proceedings.
IEEE, 778ś785.
[19] Carl Doersch. 2016. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016).
[20] Vijay Prakash Dwivedi, Chaitanya K. Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. 2020. Benchmarking Graph Neural
Networks. arXiv:2003.00982 [cs.LG]
[21] Joan Bruna Estrach, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral networks and deep locally connected networks
on graphs. In 2nd International Conference on Learning Representations, ICLR, Vol. 2014.
[22] Shuangfei Fan and Bert Huang. 2019. Labeled graph generative adversarial networks. arXiv preprint arXiv:1906.03220 (2019).
[23] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional Multi-person Pose Estimation. In ICCV.
[24] Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. In ICLR Workshop on Representation
Learning on Graphs and Manifolds.
[25] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum
chemistry. In International conference on machine learning. PMLR, 1263ś1272.
[26] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila
Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent: A new approach to self-supervised
learning. arXiv preprint arXiv:2006.07733 (2020).
[27] Nicole Gruber and Alfred Jockisch. 2020. Are GRU cells more speciic and LSTM cells more sensitive in motive classiication of text?
Frontiers in artiicial intelligence 3 (2020), 40.
[28] Dongyan Guo, Yanyan Shao, Ying Cui, Zhenhua Wang, Liyan Zhang, and Chunhua Shen. 2021. Graph attention tracking. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9543ś9552.
[29] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. CoRR abs/1706.02216
(2017). arXiv:1706.02216 http://arxiv.org/abs/1706.02216
[30] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation Learning on Graphs: Methods and Applications. CoRR
abs/1709.05584 (2017). arXiv:1709.05584 http://arxiv.org/abs/1709.05584
[31] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. 2011. Wavelets on graphs via spectral graph theory. Applied and
Computational Harmonic Analysis 30, 2 (2011), 129ś150.
[32] Maximilian Harl, Sven Weinzierl, Mathias Stierle, and Martin Matzner. 2020. Explainable predictive business process monitoring using
gated graph neural networks. Journal of Decision Systems (2020), 1ś16.
[33] Arman Hasanzadeh, Ehsan Hajiramezanali, Nick Duield, Krishna R. Narayanan, Mingyuan Zhou, and Xiaoning Qian. 2019. Semi-
Implicit Graph Variational Auto-Encoders. arXiv:1908.07078 [cs.LG]
[34] Bhagya Hettige, Weiqing Wang, Yuan-Fang Li, and Wray Buntine. 2020. Robust Attribute and Structure Preserving Graph Embedding.
In Advances in Knowledge Discovery and Data Mining, Hady W. Lauw, Raymond Chi-Wing Wong, Alexandros Ntoulas, Ee-Peng Lim,
See-Kiong Ng, and Sinno Jialin Pan (Eds.). Springer International Publishing, Cham, 593ś606.
[35] Sepp Hochreiter. 1991. Untersuchungen zu dynamischen neuronalen Netzen. Diploma, Technische Universität München 91, 1 (1991).
[36] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. 2001. Gradient low in recurrent nets: the diiculty of
learning long-term dependencies.
[37] Yoonmi Hong, Jaeil Kim, Geng Chen, Weili Lin, Pew-Thian Yap, and Dinggang Shen. 2019. Longitudinal prediction of infant difusion
MRI data via graph convolutional adversarial networks. IEEE transactions on medical imaging 38, 12 (2019), 2717ś2725.
[38] Weihua Hu*, Bowen Liu*, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. 2020. Strategies for Pre-training
Graph Neural Networks. In International Conference on Learning Representations. https://openreview.net/forum?id=HJlWWJSFDH
[39] John J. Irwin, Teague Sterling, Michael M. Mysinger, Erin S. Bolstad, and Ryan G. Coleman. 2012. ZINC: A Free Tool to Discover
Chemistry for Biology. Journal of Chemical Information and Modeling 52, 7 (2012), 1757ś1768. https://doi.org/10.1021/ci3001277
arXiv:https://doi.org/10.1021/ci3001277 PMID: 22587354.
[40] Guillaume Jaume, Pushpak Pati, Antonio Foncubierta-Rodriguez, Florinda Feroce, Giosue Scognamiglio, Anna Maria Anniciello,
Jean-Philippe Thiran, Orcun Goksel, and Maria Gabrani. 2020. Towards explainable graph representations in digital pathology. arXiv
preprint arXiv:2007.00311 (2020).
[41] Chaitanya Joshi. 2020. Transformers are Graph Neural Networks. https://thegradient.pub/transformers-are-gaph-neural-networks/.
The Gradient (2020).
[42] Nikola Jovanović, Zhao Meng, Lukas Faber, and Roger Wattenhofer. 2021. Towards robust graph contrastive learning. arXiv preprint
arXiv:2102.13085 (2021).
[43] Raghav Kansal, Javier Duarte, Breno Orzari, Thiago Tomei, Maurizio Pierini, Mary Touranakou, Jean-Roch Vlimant, and Dimitrios
Gunopulos. 2020. Graph Generative Adversarial Networks for Sparse Data Generation in High Energy Physics. arXiv preprint
arXiv:2012.00173 (2020).
[44] M. A. Khamsi and William A. Kirk. 2001. An introduction to metric spaces and ixed point theory. Wiley.
[45] Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classiication with Graph Convolutional Networks. CoRR abs/1609.02907
(2016). arXiv:1609.02907 http://arxiv.org/abs/1609.02907
[46] Thomas N. Kipf and Max Welling. 2016. Variational Graph Auto-Encoders. arXiv:1611.07308 [stat.ML]
[47] Devin Kreuzer, Dominique Beaini, William L Hamilton, Vincent Létourneau, and Prudencio Tossou. 2021. Rethinking Graph Trans-
formers with Spectral Attention. arXiv preprint arXiv:2106.03893 (2021).
[48] Alex Krizhevsky et al. 2009. Learning multiple layers of features from tiny images. (2009).
[49] Y. Lecun, L. Bottou, Y. Bengio, and P. Hafner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998),
2278ś2324.
[50] Yann LeCun and Corinna Cortes. 2010. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. (2010). http:
//yann.lecun.com/exdb/mnist/
[51] Ron Levie, Federico Monti, Xavier Bresson, and Michael M. Bronstein. 2017. CayleyNets: Graph Convolutional Neural Networks with
Complex Rational Spectral Filters. CoRR abs/1705.07664 (2017). arXiv:1705.07664 http://arxiv.org/abs/1705.07664
[52] Bing Li and G. Jogesh Babu. 2019. Convolution Theorem and Asymptotic Eiciency. In A Graduate Course on Statistical Inference.
Springer New York, New York, NY, 295ś327. https://doi.org/10.1007/978-1-4939-9761-9_10
[53] Hongsheng Li, Guangming Zhu, Liang Zhang, Juan Song, and Peiyi Shen. 2020. Graph-Temporal LSTM Networks for Skeleton-Based
Action Recognition. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 480ś491.
[54] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint
arXiv:1511.05493 (2015).
[55] Ming Liang, Yao Meng, Jiyu Wang, David L Lubkeman, and Ning Lu. 2020. FeederGAN: Synthetic feeder generation via deep graph
adversarial nets. IEEE Transactions on Smart Grid 12, 2 (2020), 1163ś1173.
[56] Siwu Liu, Ji Hwan Park, and Shinjae Yoo. 2020. Eicient and efective graph convolution networks. In Proceedings of the 2020 SIAM
International Conference on Data Mining. SIAM, 388ś396.
[57] Andreas Loukas. 2019. What graph neural networks cannot learn: depth vs width. CoRR abs/1907.03199 (2019). arXiv:1907.03199
http://arxiv.org/abs/1907.03199
[58] Zhilong Lu, Weifeng Lv, Yabin Cao, Zhipu Xie, Hao Peng, and Bowen Du. 2020. LSTM variants meet graph neural networks for road
speed prediction. Neurocomputing 400 (2020), 34ś45.
[59] Denis Lukovnikov, Jens Lehmann, and Asja Fischer. 2020. Improving the Long-Range Performance of Gated Graph Neural Networks.
arXiv preprint arXiv:2007.09668 (2020).
[60] Alessio Micheli. 2009. Neural network for graphs: A contextual constructive approach. IEEE Transactions on Neural Networks 20, 3
(2009), 498ś511.
[61] Alessio Micheli, Alessandro Sperduti, Antonina Starita, and Anna Maria Bianucci. 2001. Analysis of the internal representations
developed by neural networks for structures applied to quantitative structure- activity relationship studies of benzodiazepines. Journal
of Chemical Information and Computer Sciences 41, 1 (2001), 202ś218.
[62] Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec:
Learning Distributed Representations of Graphs. CoRR abs/1707.05005 (2017). arXiv:1707.05005 http://arxiv.org/abs/1707.05005
[63] Chaopeng Pan, Haotian Cao, Weiwei Zhang, Xiaolin Song, and Mingjun Li. 2021. Driver activity recognition using spatial-temporal
graph convolutional LSTM networks with attention mechanism. IET Intelligent Transport Systems (2021).
[64] Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang. 2018. Adversarially Regularized Graph Autoencoder for
Graph Embedding. arXiv:1802.04407 [cs.LG]
[65] George M Phillips. 2003. Interpolation and approximation by polynomials. Vol. 14. Springer Science & Business Media.
[66] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. Gcc: Graph
contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining. 1150ś1160.
[67] Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. Karate Club: An API Oriented Open-source Python Framework for
Unsupervised Learning on Graphs. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management
(CIKM ’20). ACM.
[68] Luana Ruiz, Fernando Gama, and Alejandro Ribeiro. 2019. Gated graph convolutional recurrent neural networks. In 2019 27th European
Signal Processing Conference (EUSIPCO). IEEE, 1ś5.
[69] Luana Ruiz, Fernando Gama, and Alejandro Ribeiro. 2020. Gated graph recurrent neural networks. IEEE Transactions on Signal
Processing 68 (2020), 6303ś6318.
[70] Guillaume Salha, Romain Hennequin, and Michalis Vazirgiannis. 2019. Keep It Simple: Graph Autoencoders Without Graph Convolu-
tional Networks. arXiv:1910.00942 [cs.LG]
[71] Guillaume Salha, Stratis Limnios, Romain Hennequin, Viet-Anh Tran, and Michalis Vazirgiannis. 2019. Gravity-Inspired Graph
Autoencoders for Directed Link Prediction. CoRR abs/1905.09570 (2019). arXiv:1905.09570 http://arxiv.org/abs/1905.09570
[72] Peter Sanders and Christian Schulz. 2016. Scalable generation of scale-free graphs. Inform. Process. Lett. 116, 7 (2016), 489ś491.
[73] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2020. Superglue: Learning feature matching with
graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4938ś4947.
[74] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2009. The Graph Neural Network Model. IEEE Transactions on
Neural Networks 20, 1 (Jan 2009), 61ś80. https://doi.org/10.1109/TNN.2008.2005605
[75] Franco Scarselli, Sweah Liang Yong, Marco Gori, Markus Hagenbuchner, Ah Chung Tsoi, and Marco Maggini. 2005. Graph neural
networks for ranking web pages. In The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05). IEEE, 666ś672.
[76] Marco Seraini. 2021. Scalable Graph Neural Network Training: The Case for Sampling. ACM SIGOPS Operating Systems Review 55, 1
(2021), 68ś76.
[77] David I. Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2012. Signal Processing on Graphs:
Extending High-Dimensional Data Analysis to Networks and Other Irregular Data Domains. CoRR abs/1211.0053 (2012). arXiv:1211.0053
http://arxiv.org/abs/1211.0053
[78] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2013. The emerging ield of signal
processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing
magazine 30, 3 (2013), 83ś98.
[79] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Tieniu Tan. 2019. An attention enhanced graph convolutional lstm network
for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1227ś1236.
[80] Tomas Simon, Hanbyul Joo, Iain A. Matthews, and Yaser Sheikh. 2017. Hand Keypoint Detection in Single Images using Multiview
Bootstrapping. CoRR abs/1704.07809 (2017). arXiv:1704.07809 http://arxiv.org/abs/1704.07809
[81] Gerard LG Sleijpen and Henk A Van der Vorst. 2000. A JacobiśDavidson iteration method for linear eigenvalue problems. SIAM review
42, 2 (2000), 267ś293.
[82] Ljubisa Stankovic, Danilo P Mandic, Milos Dakovic, Ilia Kisil, Ervin Sejdic, and Anthony G Constantinides. 2019. Understanding the
basis of graph signal processing via an intuitive example-driven approach [lecture notes]. IEEE Signal Processing Magazine 36, 6 (2019),
133ś145.
[83] Shanshan Tang, Bo Li, and Haijun Yu. 2019. ChebNet: Eicient and Stable Constructions of Deep Neural Networks with Rectiied
Power Units using Chebyshev Approximations. arXiv:1911.05467 [cs.LG]
[84] Ke Tu, Peng Cui, Xiao Wang, Philip S. Yu, and Wenwu Zhu. 2018. Deep Recursive Network Embedding with Regular Equivalence. In
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD
’18). Association for Computing Machinery, New York, NY, USA, 2357ś2366. https://doi.org/10.1145/3219819.3220068
[85] Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion. arXiv:1706.02263 [stat.ML]
[86] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.
Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/1706.03762
[87] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in neural information processing systems. 5998ś6008.
[88] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks.
arXiv preprint arXiv:1710.10903 (2017).
[89] Saurabh Verma and Zhi-Li Zhang. 2017. Hunt For The Unique, Stable, Sparse And Fast Feature Learning On Graphs. In Advances in
Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett
(Eds.). Curran Associates, Inc., 88ś98. http://papers.nips.cc/paper/6614-hunt-for-the-unique-stable-sparse-and-fast-feature-learning-
on-graphs.pdf
[90] C Vinchof, N Chung, T Gordon, L Lyford, and M Aibin. 2020. Traic Prediction in Optical Networks Using Graph Convolutional
Generative Adversarial Networks. In 2020 22nd International Conference on Transparent Optical Networks (ICTON). IEEE, 1ś4.
[91] Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural Deep Network Embedding. In Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing
Machinery, New York, NY, USA, 1225ś1234. https://doi.org/10.1145/2939672.2939753
[92] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang,
Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J Smola, and Zheng Zhang. 2019. Deep Graph Library:
Towards Eicient and Scalable Deep Learning on Graphs. ICLR Workshop on Representation Learning on Graphs and Manifolds (2019).
https://arxiv.org/abs/1909.01315
[93] Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019. Kgat: Knowledge graph attention network for recommen-
dation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 950ś958.
[94] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019. Heterogeneous graph attention network. In
The World Wide Web Conference. 2022ś2032.
[95] Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, and Tat-Seng Chua. 2019. Explainable reasoning over knowledge
graphs for recommendation. In Proceedings of the AAAI Conference on Artiicial Intelligence, Vol. 33. 5329ś5336.
[96] Oliver Wieder, Stefan Kohlbacher, Mélaine Kuenemann, Arthur Garon, Pierre Ducrot, Thomas Seidel, and Thierry Langer. 2020. A
compact review of molecular property prediction with graph neural networks. Drug Discovery Today: Technologies (2020).
[97] Felix Wu, Tianyi Zhang, Amauri H. Souza Jr., Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. 2019. Simplifying Graph Convolutional
Networks. CoRR abs/1902.07153 (2019). arXiv:1902.07153 http://arxiv.org/abs/1902.07153
[98] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. A Comprehensive Survey on Graph
Neural Networks. CoRR abs/1901.00596 (2019). arXiv:1901.00596 http://arxiv.org/abs/1901.00596
[99] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful are Graph Neural Networks? CoRR abs/1810.00826
(2018). arXiv:1810.00826 http://arxiv.org/abs/1810.00826
[100] Yongqiang Yin, Xiangwei Zheng, Bin Hu, Yuang Zhang, and Xinchun Cui. 2021. EEG emotion recognition using fusion model of graph
convolutional neural networks and LSTM. Applied Soft Computing 100 (2021), 106954.
[101] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do Transformers
Really Perform Bad for Graph Representation? arXiv preprint arXiv:2106.05234 (2021).
[102] Wenchao Yu, Cheng Zheng, Wei Cheng, Charu C. Aggarwal, Dongjin Song, Bo Zong, Haifeng Chen, and Wei Wang. 2018. Learning
Deep Network Representations with Adversarially Regularized Autoencoders. In Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Machinery, New
York, NY, USA, 2663ś2671. https://doi.org/10.1145/3219819.3220000
[103] Hao Yuan, Haiyang Yu, Shurui Gui, and Shuiwang Ji. 2020. Explainability in graph neural networks: A taxonomic survey. arXiv preprint
arXiv:2012.15445 (2020).
[104] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. 2021. Barlow twins: Self-supervised learning via redundancy
reduction. arXiv preprint arXiv:2103.03230 (2021).
[105] Weili Zeng, Juan Li, Zhibin Quan, and Xiaobo Lu. 2021. A Deep Graph-Embedded LSTM Neural Network Approach for Airport Delay
Prediction. Journal of Advanced Transportation 2021 (2021).
[106] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. 2018. Gaan: Gated attention networks for learning on
large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294 (2018).
[107] Yiding Zhang, Xiao Wang, Chuan Shi, Xunqiang Jiang, and Yanfang Fanny Ye. 2021. Hyperbolic graph attention network. IEEE
Transactions on Big Data (2021).
[108] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2018. Deep Learning on Graphs: A Survey. CoRR abs/1812.04202 (2018). arXiv:1812.04202
http://arxiv.org/abs/1812.04202
[109] Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. 2020.
Distdgl: distributed graph neural network training for billion-scale graphs. In 2020 IEEE/ACM 10th Workshop on Irregular Applications:
Architectures and Algorithms (IA3). IEEE, 36ś44.
[110] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. 2018. Graph Neural Networks: A Review of
Methods and Applications. CoRR abs/1812.08434 (2018). arXiv:1812.08434 http://arxiv.org/abs/1812.08434
[111] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020. Deep graph contrastive representation learning. arXiv
preprint arXiv:2006.04131 (2020).
[112] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation.
In Proceedings of the Web Conference 2021. 2069ś2080.