Mondal Umn 0130E 25561
Mondal Umn 0130E 25561
A THESIS
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Sudipta Mondal
June, 2024
© Sudipta Mondal 2024
ALL RIGHTS RESERVED
Acknowledgements
First, I would like to extend my deepest gratitude to my esteemed advisor, Prof. Sachin
S. Sapatnekar. Under his mentorship, I have not only gained invaluable insights into
the realm of research but also imbibed qualities of perseverance, meticulousness, and
holistic problem-solving. The unwavering support, profound guidance, and scholarly
wisdom of Prof. Sapatnekar have been instrumental in shaping me into the researcher I
am today. His dedication to fostering academic excellence has been truly inspiring, and
I am honored to have had the privilege of working under his supervision.
I would also like to express my sincere appreciation to my thesis committee members,
Prof. Kia Bazargan, Prof. Antonia Zhai, and Prof. Ulya Karpuzcu, whose insightful
feedback and constructive critiques have significantly enriched my research endeavors. I
am deeply thankful to the Semiconductor Research Corporation (SRC) and the ECE
department of UMN for their generous funding, which has provided essential support for
my research throughout my PhD. Special thanks are also due to the dedicated staff of the
ECE department, particularly Chimai Nguyen, Sarah Dohm, and Ann Rausch, for their
unwavering assistance and support. My heartfelt appreciation goes out to my esteemed
project partners, Dr. Susmita Dey Manasi, Dr. Kishore Kunal, Dr. Ramprasath,
Ziqing Zeng, and all other lab mates, including Dr. Vidya A. Chhabria, Dr. Tonmoy
Dhar, Mohammad Shohel, Subhadip Ghosh, Abhimanyu Kumar, Hangyu Zhang, and
Endalkachew Gebru, whose collaborative efforts have contributed immensely to the
success of my research endeavors.
I am also grateful to the vibrant UMN Bangladeshi community, particularly Dr.
Snigdha, Dr. Ibrahim, Nitol, Gourab, Gourango, Neon, and Darpan, for their unwavering
camaraderie and support during various social events, including get-togethers, potlucks,
and recreational activities.
i
To my beloved family, including my mother Promila Biswas and my father Dipak
Ranjan Mondal, I owe a debt of gratitude beyond words. Their unwavering love,
encouragement, and support have been the cornerstone of my journey, guiding me
through every challenge and triumph. And to my better half, best buddy, and also my
esteemed labmate Nibedita Karmokar, I owe a profound debt of gratitude. Throughout
this journey, she has been my unwavering pillar of strength, standing by me through
thick and thin, and cheering me on through every high and low. Her belief in my abilities
never wavered, even in the face of adversity, and her resolute support has been the
anchor that kept me grounded during the most challenging times. I am immensely
grateful for her presence in my life and for the countless ways she has enriched my
journey with her unwavering love, friendship, and support. I am also thankful to my
parents-in-law, Tanusri Chowdhury and Nanda Kumar Karmokar, for their constant
support and blessings.
Finally, I am proud to call myself a graduate of the University of Minnesota, whose
nurturing environment and steadfast support have been integral to my academic and
personal growth.
ii
Dedication
iii
Abstract
Graph neural networks (GNN) are vital for analyzing real-world problems (e.g., network
analysis, drug interaction, electronic design automation, e-commerce) that use graph
models. However, efficient GNN acceleration faces with multiple challenges related to
high and variable sparsity of input feature vectors, power-law degree distribution in the
adjacency matrix, and maintaining load-balanced computation with minimal random
memory accesses. This thesis addresses the problems of building fast, energy-efficient
inference and training accelerators for GNNs, addressing both static and dynamic graphs.
For inference, this thesis proposes GNNIE, a versatile GNN inference accelerator
capable of handling a diverse set of GNNs, including graph attention networks (GATs),
graph convolutional networks (GCNs), GraphSAGE, GINConv, and DiffPool. It mitigates
workload imbalance by (i) splitting vertex feature operands into blocks, (ii) reordering
and redistributing computations, (iii) using a novel “flexible MAC” architecture. To
maximize on-chip data reuse and reduce random DRAM fetches, GNNIE adopts a novel
graph-specific, degree-aware caching policy. GNNIE attains substantial speedup over
CPU (7197×), GPU (17.81×), and prior works, e.g., HyGCN (5×), AWB-GCN (1.3×)
over multiple datasets on GCN, GAT, GraphSAGE, and GINConv.
For training GNNs for large graphs, this research develops a GNNIE-based multicore
accelerator. A novel feature vector segmentation approach is proposed to scale on
large graphs using small on-chip buffers. A multicore-specific graph-specific caching
is also implemented to reduce off-chip and on-chip communication and to alleviate
random DRAM accesses. Experiments over multiple large datasets and multiple GNNs
demonstrate an average training speedup and energy efficiency improvement of 17× and
322×, respectively, over DGL on a GPU, and a speedup of 14× with 268× lower energy
over the GPU-based GNNAdvisor approach. Overall, this research tackles scalability
and versatility issues of building GNN accelerators while delivering significant speedup
and energy efficiency.
Finally, this thesis addresses the acceleration of dynamic graph neural networks
(DGNNs), which play a crucial role in applications such as social network analytics
and urban traffic prediction that require inferencing on graph-structured data, where
iv
the connectivity and features of the underlying graph evolve over time. The proposed
platform integrates GNN and Recurrent Neural Network (RNN) components of DGNNs,
providing a unified platform for spatial and temporal information capture, respectively.
The contributions encompass optimized cache reuse strategies, a novel caching policy,
and an efficient pipelining mechanism. Evaluation across multiple graph datasets and
multiple DGNNs demonstrates average energy efficiency gains of 8393×, 183×, and 87×
– 10×, and inference speedups of 1796×, 77×, and 21× – 2.4× , over Intel Xeon Gold
CPU, NVIDIA V100 GPU, and prior state-of-the-art DGNN accelerators, respectively,
are demonstrated across multiple graph datasets and multiple DGNNs.
v
Contents
Acknowledgements i
Dedication iii
Abstract iv
Contents vi
List of Tables ix
List of Figures x
1 Introduction 1
1.1 Hardware Acceleration of GNN Inference . . . . . . . . . . . . . . . . . 2
1.2 Multicore Acceleration of GNN Training on Large Graphs . . . . . . . . 3
1.3 Inference Acceleration of Dynamic Graph Neural Networks . . . . . . . 4
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Fundamentals of GNNs 6
2.1 Machine Learning on Graph-structured Data . . . . . . . . . . . . . . . 6
2.2 Types of GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 GNNIE 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Accelerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Mapping Weighting to Computational Processing Elements . . . . . . . 18
vi
3.3.1 Scheduling Operations in the Computational Processing Elements 18
3.3.2 The Merge Computational Processing Element . . . . . . . . . . 20
3.3.3 Load Balancing for Weighting . . . . . . . . . . . . . . . . . . . . 21
3.4 Aggregation Computations . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Reordering for Linear Computational Complexity . . . . . . . . . 23
3.4.2 Mapping Attention Vector Multiplication . . . . . . . . . . . . . 25
3.4.3 Mapping Edge-based Computations . . . . . . . . . . . . . . . . 26
3.5 Graph-Specific Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.2 Baseline Platform Comparisons . . . . . . . . . . . . . . . . . . . 36
3.7.3 Cross-platform Comparisons . . . . . . . . . . . . . . . . . . . . 38
3.7.4 Throughput and Energy Comparisons . . . . . . . . . . . . . . . 39
3.7.5 DRAM Access Analysis . . . . . . . . . . . . . . . . . . . . . . . 40
3.7.6 Optimization Analysis . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
vii
5.3 Proposed DGNN Accelerator . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Accelerating GNN Computations . . . . . . . . . . . . . . . . . . . . . . 75
5.4.1 Overlap Extraction Methodology . . . . . . . . . . . . . . . . . . 75
5.4.2 Proposed Overlap-aware Caching Policy . . . . . . . . . . . . . . 76
5.5 Accelerating RNN Computations . . . . . . . . . . . . . . . . . . . . . . 78
5.5.1 Weight Coalescing for the RNN Kernel . . . . . . . . . . . . . . . 78
5.5.2 Pipelining GNN and RNN Engines . . . . . . . . . . . . . . . . . 79
5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6 Thesis Conclusion 85
Bibliography 87
viii
List of Tables
ix
List of Figures
x
3.20 Effectiveness of GNNIE’s optimization methods. . . . . . . . . . . . . . 46
4.1 Block diagram of the proposed multicore GNN training accelerator (core
architecture in inset) with 4 cores; our evaluation considers accelerators
with up to 36 cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 (a) Boosting γintra to break intra-cluster stagnation on Core 2. (b) Invok-
ing full random access after most edges are processed on all cores. . . . 55
4.3 Feature vector segmentation. . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Performance analysis of feature vector segmentation: (a) etotal (Average)
vs. Execution Cycles (b) Aggregation cycle comparison. . . . . . . . . . 59
4.5 Speedup and energy efficiency of the proposed multicore GNN training
accelerator vs. DGL+Tesla V100 and GNNAdvisor+Tesla V100: (a), (c):
Type A datasets (b), (d): Type B datasets. . . . . . . . . . . . . . . . . 63
4.6 Inference speedup and energy efficiency the proposed multicore GNN
training accelerator vs. DGL+Tesla V100 and GNNAdvisor+Tesla V100:
(a), (c): Type A datasets (b), (d): Type B datasets. . . . . . . . . . . . 66
5.1 Block diagram of the proposed DGNN accelerator. . . . . . . . . . . . . 74
5.2 Overlap extraction between consecutive groups. . . . . . . . . . . . . . . 77
5.3 Implementation of our weight coalescing scheme. . . . . . . . . . . . . . 79
5.4 Implementation of inter-engine pipelining. . . . . . . . . . . . . . . . . . 80
5.5 Speedup vs. snapshots per group. . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Speedup comparison results for DGNN inference. . . . . . . . . . . . . . 82
5.7 Energy efficiency comparison results for DGNN inference. . . . . . . . . 84
xi
Chapter 1
Introduction
The remarkable success of machine learning and artificial intelligence in the current era can
be attributed to deep learning (DL)-based approaches that have revolutionized the way
we process data and make decisions. This success lies in the ability of these approaches
to extract patterns and insights from vast amounts of information, empowering a wide
range of tasks such as object detection, disease diagnosis, and speech recognition among
numerous others.
One of the first major advancements in deep learning that kicked off the AI revolution
was the deployment of convolutional neural networks (CNNs), which are now particularly
renowned for their proficiency in image recognition and classification tasks. CNNs have
drastically improved the accuracy and efficiency of tasks such as object detection, facial
recognition, and medical imaging, showcasing their indispensable role in modern AI
applications.
However, traditional neural networks such as CNNs and recurrent neural networks
(RNNs) are limited in their ability to process non-Euclidean data structures such
as graphs. This is where graph neural networks (GNNs) come into play, offering a
breakthrough in analyzing and understanding complex relational data for myriads of
real-world problems (e.g., social network analysis, recommendation systems, epidemic
forecasting, and molecular modeling). The significance of GNNs lies in their unique
ability to capture intricate relationships and dependencies for graph structured data,
paving the way for more sophisticated and context-aware machine learning models.
With the scalability challenges posed by the ever-growing size and complexity of data,
1
2
implementations of GNNs on conventional computing platforms, e.g., central processing
units (CPUs), graphics processing units (GPUs), and field programmable gate arrays
(FPGAs) struggle to meet the computational and energy demands of DL workloads.
Hence, there is a pressing need to build specialized application-specific integrated circuit
(ASIC)-based accelerators, with high performance and low power, that enable successful
deployment (i.e., inference) for edge-based applications for inference and for training
where the model parameters of the neural network are optimized.
However, the accelerators proposed in the literature for CNN and RNN applications [3–
9] are not suited to address the computational and energy demands imposed by GNNs.
This thesis endeavors to address the burgeoning demand for ASIC-based accelerators for
GNNs. In this thesis, we propose to address the following three challenges:
• Chapter 4 describes our proposed multicore GNN training accelerator for large
graphs. This chapter discusses the novel feature vector segmentation and dynamic
caching scheme that enable our platform to achieve GPU-like scalability and
accelerator-like efficiency for large graphs. Experiments conducted on various
datasets and different GNNs show that our approach achieves an average training
speedup of 17× and an energy efficiency improvement of 322× compared to a
GPU-based baseline.
Fundamentals of GNNs
The increasing complexity and scale of data in various fields necessitate advanced methods
for processing and analyzing graph-structured information. GNNs have emerged as
a key technology to address this need, providing sophisticated tools for handling the
non-Euclidean nature of graphs. Unlike traditional neural networks like CNNs and RNNs,
which are optimized for Euclidean data such as images and sequences, GNNs excel at
capturing intricate relationships in graph data. This capability is crucial for applications
in social network analysis, molecular chemistry, and recommendation systems, where
the relationships between data points are inherently irregular and interconnected. The
following two subsections of this chapter will delve into the fundamental principles of
GNNs, tracing their evolution and categorizing the main types of GNN architectures.
These discussions provide a foundation for understanding the majority of the work in
this thesis.
6
7
graph-structured data, which CNNs and RNNs struggle to handle. We can think of an
image as essentially an array of pixels with a highly structured connectivity pattern
(Fig. 2.1). Each pixel in an image has neighboring pixels to its east, west, north, and
south, forming a structured graph-like arrangement. In general graphs do not have this
inherent structured nature, and each node can have an arbitrary number of neighboring
nodes, with no specific order or regularity. CNNs and RNNs typically rely on ordered
feature stacking, which becomes redundant and inefficient when applied to graph inputs.
In contrast, GNNs propagate information on each node individually, disregarding the
input order and producing invariant outputs, thus overcoming the inherent limitations
of ordered processing.
Despite their potential, GNNs also pose unique challenges that researchers must
address. The irregular structures of graphs, coupled with the large scale of real-world
graph data, present significant computational and scalability challenges. Additionally,
the wide variety of graphs encountered in different domains necessitates the development
of flexible and adaptable GNN architectures capable of accommodating diverse data
types and structures.
In addressing these challenges, researchers have classified GNNs into two main
categories: spectral-based and spatial-based approaches. Early approaches [23, 24] have
used spectral-based methods to define graph convolutions using filters inspired by graph
signal processing, interpreting the convolutional operation as noise removal from graph
8
signals. On the other hand, spatial-based approaches [2, 25, 26] focus on aggregating
information from neighboring nodes to define graph convolutions. Just as a filter is
applied to a set of pixels in an image, graph convolution involves aggregating information
from neighboring vertices within the vicinity of a node. This weighted averaging process
enables GNNs to effectively process graph-structured data.
In Fig. 2.2, a comparison is drawn between performing 2D convolution on image data,
which can be seen as a specialized form of a graph, and conducting convolution on a
general graph. In the image representation, each pixel is connected to adjacent pixels in
the east, west, north, and south directions, forming a structured graph-like arrangement.
The highlighted area in the figure represents a filter applied to the set of vertices (or
pixels in the case of an image). However, due to irregular distribution and ordering of
neighbors of the vertices, the graph may not be easily embeddable into a 2D planar
format. As a result, performing convolution and pooling operations on such graphs is
not as straightforward as in CNNs. Thus in contrast, in graph convolution, information
from neighboring vertices within the neighborhood of a node is aggregated through a
weighted average, mirroring the process of 2D convolution on images. The advent of
GNN models such as graph convolutional networks (GCNs) [25] has bridged the gap
between spectral and spatial approaches, leading to rapid advancements in spatial-based
methods.
vertex i in a layer, over a set of neighboring vertices j, the GNN aggregates information
from vectors hl−1
j of the previous layer, and processes it to create the output feature
vector, hli .
Regardless of the type, GNNs have two major computational steps in common:
(i) Weighting multiplies the vertex feature vector, hl−1
i of each vertex by a weight
matrix, W l , of dimension F l−1 × F l . (ii) Aggregation combines the weighted vertex
feature vectors neighboring vertex i. Table 2.1 shows the Weighting and Aggregation
operations for GCNs [25], GraphSAGE [26], graph attention networks (GATs) [2], and
GINConv [27]. For Aggregation, if N (i) is the immediate one-hop neighborhood of
vertex i, then for GCNs, GATs, and GINConv, N (i) = {i} ∪ N (i). For GraphSAGE,
N (i) = {i} ∪ SN (i) , where SN (i) is a random sample of N (i). At vertex i, the aggregation
operation performed by various GNNs is as summarized below:
GCNs: Each product hl−1 l
p
j W , j ∈ N (i), is multiplied by 1/ di dj (d∗ is the vertex
degree). The result is summed.
GraphSAGE: The products hl−1 l
j W are combined over j ∈ N (i) using aggregator ak
(typically, mean or pooling).
GATs: For each edge (i, j), an inner product with a learned attention vector al finds
the normalized attention coefficient
T
αij = softmax(LeakyReLU(al · [hl−1 l l−1 l
i W ]||hj W ]))
P l−1 l
followed by j∈{i}∪N (i) eij hj W , a weighted aggregation.
GINConv: The vertex feature vectors of all neighbors of a vertex i are summed and
added to ϵl times the vertex feature vector of i, where ϵl is a learned parameter, using a
10
multilayer perceptron (MLP) with weights W l and bl :
hli = MLPl (1 + ϵl )hl−1 l−1 l , bl
P
i + h
j∈N (i) j , W (2.1)
L
l
P
hG = i∈G hi (2.2)
l=1
DiffPool [28] can be combined with any of these GNNs to reduce the volume of data.
It uses two GNNs, one to extract vertex embeddings for graph classification, and one
to extract embeddings for hierarchical pooling. The embedding GNN at layer l is a
standard GNN with Weighting and Aggregation,
where Al−1 is the adjacency matrix of the coarsened graph at level (l − 1), and X l−1 is
the matrix of input cluster features. The pooling GNN generates the assignment matrix:
S l−1 = softmax GNNpool (Al−1 , X l−1 ) (2.4)
The number of clusters in layer l is fixed during inference. The coarsened adjacency
matrix Al = S (l−1)T Al−1 S l−1 , and the new embedding matrix X l = S (l−1)T Z l−1 .
Chapter 3
GNNIE
3.1 Introduction
Deep learning accelerators have largely focused on data with Euclidean embeddings,
e.g., audio/video/images/speech. Many real-world problems (e.g., network analysis,
embedded sensing, e-commerce, drug interactions) use graphs to model relationships.
Inferencing on large, unstructured, and sparse graphs with non-Euclidean embeddings
requires specialized GNNs. Today’s GNNs [2, 25–27] are based on nearest-neighbor
operations, with improved efficiency over early methods [23–25].
Multilayer GNN inference engines perform two computation steps per layer, as
outlined in Section 2.2:
(a) Weighting performs a linear transform of vertex feature vectors through multipli-
cation by a weight matrix.
(b) Aggregation consolidates information from the neighbors of a vertex to compute
the feature vectors for the next layer.
The challenges of building efficient GNN accelerators for inference tasks are as follows:
(1) Versatility An accelerator should be able to handle a diverse set of GNNs to cover a
wide range of GNN architectures to provide appropriate computation/accuracy tradeoff
points for various applications. The achievable accuracy depends on the GNN: GATs
achieve higher accuracy than other GNNs, but with more computation (Fig. 3.1).
(2) Adjacency matrix sparsity The graph adjacency matrix encodes vertex neighborhood
information required for Aggregation. The adjacency matrix is highly sparse (> 99.8%
11
12
Figure 3.1: GNN accuracy comparison (data from [2], PPI dataset).
Figure 3.2: Nonzero histogram for input vertex feature vectors (Cora).
for all datasets in this paper; in contrast, DNN data shows 10%–50% sparsity). Unlike
image/video data, adjacency matrix sparsity patterns typically exhibit power-law behavior,
with vertex degrees ranging from very low (for most vertices) to extremely high (for very
few vertices): in the Reddit dataset, 11% of the vertices cover 88% of all edges.
(3) Input feature vector sparsity The vertex input feature vectors are highly sparse, e.g.,
the 2708 input vertex feature vectors of the Cora dataset have 98.73% average sparsity.
In Fig. 3.2, Region A is sparser than B and requires less computation, leading to load
balancing issues during Weighting.
(4) Memory footprint and random-access patterns Real-world graphs have a large number
of vertices and a massive memory footprint (Reddit: 2.3Gb in sparse format). High
sparsity and power-law distributions can lead to random memory access patterns and
poor data access locality in Aggregation.
13
Therefore, GNN-specific accelerators must address:
(a) load balancing during Weighting, due to the sparsity variations in Fig. 3.2, and during
Aggregation, due to the imbalance of computations for high- and low-degree vertices.
(b) lightweight graph-specific caching of the adjacency matrix for high data access locality
and maximal reuse of cached data.
In this chapter we present GNNIE (pronounced “genie”), a versatile and high-
performance accelerator designed to handle inference tasks on diverse GNN architectures.
This chapter highlights significant advancements of GNNIE in inference acceleration for
static graphs (where the graph topology and vertex features do not change over time),
addressing the critical challenges and outperforming existing solutions.
Relation to other acceleration engines: The Weighting step performs matrix-
vector multiplication which resembles CNN computations, but CNN accelerators [3–9]
are inefficient at handling graph data. Aggregation operates on graph neighborhoods
and resembles graph analytics, but graph processing accelerators [29–31] are designed
to perform lightweight computations, significantly lower than the needs of a GNN.
Extensions of CNN/graph processing engines are inadequate.
An early GNN accelerator, HyGCN [10], bridges the divide by using two pipelined
engines: an Aggregation engine that operates on graph data and consolidates vertex
feature vectors from the neighborhood of each vertex, followed by a Combination engine,
which uses a multilevel perceptron to weight the aggregated features with the weight
matrix. The disparity between engines raises challenges in providing a steady stream of
data to keep the Aggregation/Combination engine pipeline busy. The Aggregation engine
does not account for power-law behavior while caching partial results, and high-degree
vertices may create stalls due to the limited size of on-chip buffers. In the Combination
engine, the aggregated feature vectors are both sparse and show high sparsity variations
(Fig. 3.2). Consequently, stalls are required, leading to inefficiency.
AWB-GCN [11] views the GNN computation as two consecutive sparse-dense matrix
multiplications (SpMMs). During Weighting, the method is targeted to moderate
sparsity of 75% – but input layer vertex feature vectors are ultra-sparse (Fig. 3.2).
During Aggregation, the graph-agnostic SpMM view necessitates numerous expensive off-
chip accesses to the adjacency matrix. AWB-GCN addresses workload imbalances issues
through multiple rounds of runtime load-rebalancing, but this leads to high inter-PE
14
communication. Finally, SpMM-based approaches face more severe load imbalances for
implementing GNNs that involve additional complex computations before Aggregation
(e.g., softmax in GATs and DiffPool). In fact, AWB-GCN targets only GCNs and not
general GNNs. SCV-GNN [32], a matrix multiplication-based approach, proposes a
sparse compressed vector format accompanied by a processing order that aids parallelism
and reduces workload imbalance.
Novelty of this work: GNNIE uses a single engine that efficiently performs both
Weighting and Aggregation. The GNNIE framework handles high levels of sparsity in
the input vertex feature vectors and the adjacency matrix, with novel approaches for
load balancing and graph-specific caching. It covers a number of GNN topologies, from
lower accuracy/lower computation (e.g., GCN, GraphSAGE) higher accuracy/higher
computation (e.g., GATs), as motivated in Fig. 3.1, and is more versatile than previous
methods in handling functions such as softmax over a neighborhood (e.g., as used for
attention normalization in GATs; prior work [33] on GATs, skips this crucial step).
Novel methods to mitigate sparsity effects, and overcome load imbalances and
compute bottlenecks, include:
• Load balancing during Weighting based on splitting vertex features into blocks
(Section 3.3.1). Together with load balancing (Section 3.3.3), this enhances throughput
during Weighting by ensuring high PE utilization and skipping unnecessary com-
putations, by (a) Reordering computations on a flexible MAC (FM) architecture to
address imbalances due to input feature vector sparsity variations. Computations
are dynamically mapped to heterogeneous PEs, each with different numbers of MAC
units. (b) Static load redistribution to nearby PEs, offloading computations from
heavily-loaded to lightly-loaded rows, minimizing inter-PE communication.
• Load-balanced edge Aggregation (Section 3.4) through a mapping scheme that fully
utilizes the PEs. For GATs, we further propose a novel linear-complexity computation
that implements compute-bound attention vector multiplication similarly as Weighting,
and memory-bound attention coefficient computation to maximize reuse of cached
data.
the data is sent through the RLC decoder to the PE array. The RLC decoder is activated
for sparse input layer vertex feature vectors, and bypassed for denser feature vectors in
later layers.
The output buffer caches intermediate results for vertex feature vectors, including
the result of multiplication by W l after Weighting, and the result after Aggregation.
The end result is written to off-chip memory. The weight buffer holds the values of the
weight matrix W l during Weighting, and, for GAT computations, the attention vector
during Aggregation.
The memory access scheduler coordinates off-chip memory requests from the in-
put/output/weight buffers.
(3) An array of processing elements (PEs): The array consists of an M × N array
of (CPEs). Each CPE has two scratch pads (spads) and MACs.
Within the array of CPEs, we merge multiple columns of Special Function Units
(SFUs) (e.g., exp, LeakyReLU, division) [grey blocks], and a row of merge PEs (MPEs)
[red blocks]. Interleaved placement allows low latency and communication overhead
with CPEs. For exponentiation, we use an accurate, low-area lookup-table-based
17
implementation [40].
Merge PEs (MPEs) are used to aggregate partial results of vertex features sent from
the CPE array during Weighting and Aggregation. One MPE is dedicated for each CPE
column in the array (Fig. 3.3), for merging the partial results of vertices. Since these
partial results may belong to different vertices we use 16 wires, i.e., one for each CPE,
while sending the partial results to the MPEs. A tag is sent along with each partial
result to indicate the vertex that the partial result is associated with.
The partial results, along with the tags, are received from the CPEs and stored in
the update spad of the MPE. The updated spad can hold 16 such partial results and
corresponding tags received from the 16 CPEs of the column. If the tags of partial
results match (i.e., if they belong to the same vertex), they are sent to one of the 16
accumulators in the accumulator bank of the MPE to be merged. The result is stored in
one of the 16 psum spads with the corresponding tag. The intermediate result stored in
the psum spad may be brought into the accumulator again if a partial result with the
same tag is found in the updates spad. Following the same procedure, these values are
summed and the result is stored in the psum spad. After merging of the partial results
in the update spad is completed, the psum spads send the results and tag to the output
buffer.
(4) The Activation unit performs an activation operation on the vertex features at the
final activation stage of computation.
(5) The controller coordinates operations, including assigning vertex features to the
CPE, workload reordering among the CPEs, sending CPE results to the MPEs, sending
MPE data to the output buffer, and writing concatenated MPE data.
For a GCN, the layer-wise computation can be written as:
e l−1 W l )
hli = σ(Ah (3.1)
i
We now map the Weighting step, which multiplies the sparse feature row vector hl−1
i
with the dense weight matrix W l , to the architecture. The feature vectors are fetched
from DRAM, core computations are performed in the CPEs, and the results from the
CPEs are assimilated in the MPEs before being written back to DRAM. Our novel
scheduling methodology keeps the CPEs busy during the computation, so that Weighting
is not memory-bounded. We partition data in two ways (Fig. 3.5):
(1) Across the vertex feature vector: We process a block of k elements of hl−1
i
at a time, and multiplying it by the corresponding k rows of W l . This is mapped to a
row of the CPE array. With a block size of k = F l−1 /M , the entire feature vector is
W l in the pass. At the end of a pass, the next set of N columns of W l is loaded. After
all passes are completed, the current set of weights is replaced by a new set, and the
process continues under the weight-stationary scheme. Within each pass, the CPEs are
loaded as follows:
• For a given set of s vertices, the ith subvectors, of size k, of all s vertex feature vectors
are broadcast to the entire CPE row i using a bus. This is indicated by h in Fig. 3.4.
Since the CPEs in a row work independently of each other and CPE rows do not talk
to each other during the Weighting phase, we do not require a complex interconnection
scheme: since all CPEs in a row are assigned the same feature vector blocks of length
k, we use a bus-based interconnection to broadcast this data to a CPE row.
To leverage input data sparsity, a zero detection buffer is used to detect whether a
k-element block that is to be broadcast contains zeros only, so that these computations
can be skipped. In case such a block is detected we refrain from broadcasting it to the
CPE row. We place zero detection circuitry at the output of the RLC decoder (Fig. 3.3),
20
at a stage after the k-element blocks are created. The zero-detection function uses a set
of OR gates and has minimal hardware overhead.
Benefit of using vertex feature subvector blocks: Our use of k-element blocks
instead of the entire vector allows a CPE to skip zero subvectors during pipelined
execution and immediately move on to a block from the next available subvector. The
next block will be fetched from the input buffer, and under the weight-stationary scheme,
it can start computation with the already-loaded weights in the CPE.
The proposed weight-stationary dataflow maximizes the reuse of the weights cached
in the weight buffer, which in turn reduces the size requirement of the on-chip weight
buffer. Though the feature vectors fetched in the input buffer are get reused, for all
datasets evaluated, the computation time for vertices cached in the input buffer is seen
to be larger than the memory fetch time under the HBM 2.0 off-chip bandwidth.
The MAC operation within each CPE generates a partial result for an element of the
transformed vertex features. This is sent to the MPE in its column for accumulation
over the vertex feature subvectors, along with a tag that denotes its vertex. Due to
the irregular completion times for the CPEs, the MPE may accumulate partial sums
for several vertices at a time. A bank of psum buffers holds the partially accumulated
results: when all partial sums are accumulated for a vertex feature vector, the MPE
21
sends the result to the output buffer, along with the vertex ID i: this is one element of
the result of multiplying the feature vector of vertex i and W l . When all F l elements
are computed, the result is written back to DRAM.
After a CPE column processes all feature blocks for all vertices, the next pass begins.
The weights in that column are replaced with the next column of weights from W l .
To overlap computations and keep the CPEs busy, we use double-buffering to fetch
the next block of weights from the DRAM to the chip while the CPEs perform their
computations.
The Weighting computation skips zeros in the vertex feature vector. Vertex feature
vectors in the input layer have different sparsity levels (e.g., in Regions A and B of
Fig. 3.2), and this is also true of the k-subvectors. Hence, some k-subvectors are processed
rapidly (“rabbits”) while others take longer (“turtles”). This causes workload imbalance
in the CPE array.
The MPEs that accumulate the results of the CPEs must keep track of psums from
a large number of vertices, but they have only limited psum slots for accumulating
information. The rabbit/turtle disparity implies that stalls may have to be introduced to
stay within the limits of available psum memory in the MPE. As results are accumulated
in the output buffer, a larger number of vertex feature vectors must be stored within the
buffer, waiting to be completed and written to the DRAM, to account for the disparity
between rabbits and turtles.
Flexible MAC (FM) Architecture: We can avoid stalls and speed up computa-
tion with more MACs per CPE. Increasing the number of MACs per CPE uniformly
throughout the array overcomes the bottleneck of “turtles,” but is overkill for “rabbits.”
Our flexible MAC architecture uses a heterogeneous number of MAC units per CPE
in different rows of the array. The CPE array is divided into g row groups, each with
an equal number of rows; the number of MACs per CPE, |M AC|i , is monotonically
nondecreasing from the first row to the last, i.e., |M AC|1 ≤ |M AC|2 ≤ · · · ≤ |M AC|g .
The input buffer has a scheduler that assigns vertex feature vectors to CPE rows. The
scheduler uses information about the total nonzero workload for each k-element block of
the vertex feature vector to assign the workload to CPE rows. The workloads for the
22
k-element blocks are first binned based on the number of nonzeros, where the number of
bins equals the number of CPE groups. Workload binning is carried out as a preprocess-
ing step in linear time on a CPU. The bin with fewest nonzeros is sent to the first CPE
group with fewest MACs, and so on; the bin with the most nonzeros is sent to the last
CPE row group with the most MACs. After workload binning, each block in a bin is
assigned an ID that denotes the CPE row to which it should be broadcast. The scheduler
receives these block IDs for the k-element blocks of each feature vector from the host
CPU. The input buffer is connected to the embedded scheduler through one port which
fetches the block ID information for each feature vector as they are sent over to RLC
decoder and eventually the k-element feature vector blocks are broadcast to a CPE row
according to their IDs. Since the assignment of IDs to k-element blocks is computed as
a preprocessing step, the scheduler does not require any runtime information. The total
preprocessing overheads (which include the preprocessing overheads for the linear time
binning of k-element blocks) for the four datasets used in our experiment are shown in
Table 3.3. For the Cora, Citeseer, Pubmed, and Reddit datasets, the preprocessing times
required for the binning of k-element blocks are, respectively, 5.5%, 4.8%, 3.4%, and
0.7% of the total inference time. It should also be noted that this percentage overhead
is lower for the larger datasets (Reddit (233K vertices) has a lower percentage overhead
than Cora (2.7K vertices)), indicating the scalability of the solution.
An example of workload reordering among CPE rows is shown in Fig. 3.6. The CPE
array is divided into three groups, Group 1, 2, and 3, where Group i is equipped with
|M AC|i MACs per CPE, where |M AC|1 < |M AC|2 < |M AC|3 . The vertex feature
blocks are binned into three bins that will be assigned to each group. Each bin has
several vertex feature blocks: the vertex feature blocks in the left-most bin have the
most nonzeros (six), and those of the right-most bin have the fewest of nonzeros (four).
We see that the least populated bin is assigned to the group with the fewest MACs, the
next to the group with the next number of MACs, and so on.
Load Redistribution (LR): The FM approach does not completely balance the
workload. For greater uniformity, we redistribute loads among nearby CPEs. Based on
workload distribution in CPE rows, the controller selects pairs of CPE rows to perform
workload redistribution, offloading a portion of workload from heavily loaded to lightly
loaded CPE rows.
23
We present a new method for reordering GAT computations for efficient hardware
implementation (Fig. 3.7). We define the weighted vertex attention vector for vertex p
as ηw lp = hl−1 l
p W . The first step in finding the attention coefficient αij for neighboring
24
where ei,1 = al1 T · ηw li , ej,2 = al2 T · ηw lj , This goes through a LeakyReLU and then a
softmax over all neighbors of i to find the normalized attention coefficient,
As shown in Fig. 3.7 a naı̈ve approach would fetch ηw lj from each neighbor j of i,
compute eij using (3.3), and perform softmax to find αij . However, since ej,2 is required
by every vertex for which j is a neighbor (not just i), this would needlessly recompute
its value at each neighbor of j. To avoid redundant calculations, we propose to reorder
the computation (Fig. 3.7): for each vertex i, we compute
(a) ei,1 = al1 T ηw li , used to compute αi∗ at vertex i.
(b) ei,2 = al2 T ηw li , used by all vertices j for which i is a neighbor, to compute αj∗ at
vertex j.
25
Since al = [al1 al2 ] is identical for each vertex, we calculate ei,2 just once at vertex i, and
transmit it to vertices j.
For |V | vertices and |E| edges, the naı̈ve computation performs O(|E|) multipli-
cations and memory accesses to ηw li ) per vertex, for a total cost of O(|V ||E|). Our
reordered computation is O(|V | + |E|), with O(|E|) accumulations over all vertices, i.e.,
latency and power are linear in graph size.
The last step requires edge aggregation from each neighbor of a vertex. All GNNs,
perform edge-based summations followed by an activation function; for GATs, the
weights for this summation are computed using methods in the above subsections.
Typical graphs are too large for the on-chip buffers. We use a dynamic scheme
(Section 3.5) to process a subgraph of the graph at a time, processing edges in parallel
in the CPE array.
Load Distribution: The Aggregation computation brings data into the input buffer.
For each vertex in the subgraph corresponding to the vertices in the buffer, it accumulates
edge data by pairwise assignment to CPE spads.
Due to power-law behavior, the vertex degrees in the subgraph may have a large
range. To distribute the load, the Aggregation summations are divided into unit pairwise
summations and assigned to CPEs. For instance, accumulation of a sum effectively
implements an adder tree in which the number of CPEs required to process Aggregation
for each vertex depends on its degree in the subgraph. Thus, the number of CPEs
assigned for Aggregation of a vertex in a subgraph is proportional to its degree. The
degree-dependent assignment of CPEs to vertices tackles imbalance in workload that
might occur due to the power-law behavior.
GATs: The final step in computing the attention coefficient αij involves edge-based
computations (Equation (3.4)):
Each edge from a neighbor j to vertex i contributes an eij to the numerator of the
softmax, and one to the denominator. These computations are parallelized in the CPEs
among incoming edges of a vertex using pull-based aggregation [42].
The computation of numerator in the softmax step is shown in Fig. 3.8. For a target
vertex i connected to a neighbor j by edge (i, j), ηw i , ei,1 , and ei,2 , are loaded into one
spad of a CPE, and the corresponding data for j into the other spad. For vertex i, the
result ei,1 + ej,2 is sent to the SFU to perform LeakyReLU followed by exponentiation.
27
The output returns to the CPE and is multiplied with ηw lj . A similar operation is
performed for vertex j to compute exp(eji )ηw li .
Other GNNs: The Aggregation step for GCN, GraphSAGE, GAT and GINConv
involves a sum of weighted vertex feature vectors over all neighbors j (or a sample of
neighbors for GraphSAGE) of each vertex i. This computation is similar to but simpler
than that in Fig. 3.8: just addition is performed.
As before, a subgraph of the larger graph is processed at a time. In processing vertex
i, the data for all neighbors j is processed in an adder tree, placing operands in spad1
and spad2 of a CPE, and storing the result in spad1. The partial results for a vertex
(partial sum for a general GNN, or the summed numerator and softmax denominator for
a GAT) are written to the output buffer after each edge computation. For a GAT, the
values of exp(eik ) are also added over the neighborhood to create the denominator for
the softmax. Finally, the accumulation over neighbors is divided by the denominator,
in the SFU to obtain the result. Similarly, in the another round of accumulation the
partial results of the vertices are sent form the output buffers to CPEs to compute the
final result. When all components of the sum for vertex i are accumulated, the result is
sent through the Activation unit and written to DRAM.
Figure 3.9: Example illustrating the subgraph in the input buffer (left) and its evolution after
cache replacement (right).
accesses. A notable feature of our proposed policy is a guarantee that all random-access
patterns are confined to on-chip buffers and off-chip fetches are sequential.
As stated earlier, the adjacency matrix is stored in the CSR format. Our input is a
graph represented by three arrays: (i) the coordinate array lists the incoming/outgoing
neighbors of each vertex, (ii) the offset array contains the starting offset of each vertex
in the coordinate array, and (iii) the property array with the weighted vertex feature,
ηw li (see Section 3.4.1), for each vertex i; for GATs, this is concatenated with {ei,1 , ei,2 }.
Subgraph in the Input Buffer: Edge-mapped computations involve a graph traversal
to aggregate information from neighbors. At any time, a set of vertices resides in the
input buffer: these vertices, and the edges between them, form a subgraph of the original
graph. In each iteration, we process edges in the subgraph to perform partial Aggregation
operations (Section 3.4.3) for the vertices in the subgraph. Under our proposed caching
strategy, ultimately all edges in the graph will be processed, completing Aggregation for
all vertices.
We illustrate the concept through an example in Fig. 3.9, showing a graph with
vertices V1 through V16 . The highest degree vertices are first brought into the cache,
i.e., the input buffer: vertices V1 , V2 , and V3 of degree 5, vertices V5 and V6 of degree 2,
29
and then two vertices of degree 1, V4 and V7 . The subgraph, Subgraph 1, consists of
these vertices and edges E1 to E6 which connect them. After edges E1 through E6 are
processed, vertices V4 through V7 have no unprocessed edges and may be replaced in
the cache by V8 through V11 in Iteration 2. This creates Subgraph 2, the subgraph with
edges E7 through E10 ), which is processed next, and so on.
Cache Replacement Policy: As vertices are replaced after computation of each
subgraph, a replacement policy is necessary. Our policy prioritizes vertices with the
most unprocessed edges for retention in the input buffer. Since such vertices appear
more frequently in the list of neighbors for other vertices in the coordinate array, this
increases the likelihood of finding both the source and destination of edges in the cache.
The policy requires inexpensive preprocessing to sort vertices in order of their degrees.
In practice, it is enough to sort vertices into bins based to their degrees, differentiating
high-degree vertices from medium-/low-degree vertices to prioritize higher-degree vertices.
After preprocessing, vertices of the input graph are stored contiguously in DRAM in
descending degree order of the bins. Ties are broken in dictionary order of vertex IDs.
The key to avoiding random-access fetches from DRAM is the preprocessing step and the
replacement policy.
We track the number of unprocessed edges, αi for vertex i, decrementing it as
each neighbor is processed. Initially αi is the vertex degree; when αi = 0, hli is fully
computed. Tracking αi requires minimal hardware overhead (a decrementer and one
word of storage per vertex), and its tracking enables GNNIE to maximize edge processing
in each iteration.
Fig. 3.10 illustrates our policy, managed by a cache controller using a 4-way set
associative cache. Graph vertices are stored contiguously in DRAM in descending degree
order, where vertex 1 has the highest degree. If the input buffer capacity is n vertices,
initially data (i.e., feature vector, connectivity information, αi ) for vertices 1 to n are
loaded from DRAM.
The algorithm processes each such set of vertices in the input buffer in an iteration.
We track αi for vertex i, decrementing it as each neighbor is processed. Tracking αi
requires minimal hardware overhead (a decrementer and one word of storage per vertex).
Initially αi is the vertex degree. At the end of iteration 1 (after finishing computation of
the subgraph of the first n vertices), if αi < γ for any vertex, where γ is a predefined
30
threshold, it is replaced from the cache. We replace r vertices in each iteration using
dictionary order. These vertices are replaced in the input buffer by vertices (n + 1) to
(n + 1 + r) from DRAM: these have the next highest vertex degrees. For each such vertex
i, we write back the αi value into DRAM. When all vertices are processed once, we have
completed a Round.
Similarly, the partial sums for the vertex feature vector in the output buffer are
updated as more edges in the subgraphs are processed. Any hli for which all accumulations
are complete is written back to DRAM. Due to limited output buffer capacity, and only
a subset of partial vertex feature vector sums can be retained in the buffer, and the
rest must be written to off-chip DRAM. To reduce the cost of off-chip access, we use a
degree-based criterion for prioritizing writes to the output buffer vs. DRAM. As partial
Aggregation results for softmax are written to DRAM, the numerator and denominator
components for a vertex are stored nearby, for locality during future fetches.
How our policy avoids random-access DRAM fetches: Our policy makes random
accesses only to the input buffer; all DRAM fetches are sequential. In the first Round,
data is fetched from consecutive DRAM locations. In the CPE array, aggregation of
each vertex fetches the vertex feature data of its neighbors in the current subgraph in
the cache. Each vertex feature vector may be thus fetched by the CPE array multiple
times according to the graph neighborhood structure, but all such random accesses are
limited to the cache, which has much better random-access bandwidth than the off-chip
31
Figure 3.11: Histogram of α through various Rounds (Pubmed). The inset shows a magnified
view.
memory.
Vertices evicted from the cache, with αi < γ, may be fetched again in a subsequent
Round. Even in these Rounds, data blocks are brought into cache in serial order from
DRAM: there are no random accesses from DRAM. During DRAM fetches, a cache
block is skipped if all of its vertices are fully processed. The total unprocessed edges in
a cache block is tracked through inexpensive hardware, similar to tracking αi .
The effectiveness of the approach is illustrated in Fig. 3.11, which shows the his-
togram of αi distributions in the input buffer after each Round. The initial distribution
corresponds to the power-law degree distribution, and in each successive Round, the
histogram grows flatter – with both the peak frequency and maximum α becoming lower,
thus mitigating the problems of power-law distribution. In contrast, HyGCN ignores the
power-law problem, and AWB-GCN overcomes it using high inter-PE communication.
Moreover, our approach is shown to be effective even for much more intensive GAT
computations (prior accelerators do not address GATs).
Fig. 3.12 shows the impact of γ on DRAM accesses for three datasets during Aggre-
gation of first layer. For the calculation we use the weighted feature vector size at first
layer to 128 B. As γ increases, more vertices are evicted and may have to be brought
back to the cache, resulting in more DRAM accesses. However, if γ is too low, vertices
may not be evicted from the cache, resulting in deadlock as new vertices cannot be
32
Figure 3.12: Ablation study on γ: (a) Cora (b) Citeseer (c) Pubmed.
brought in. In our experiments, we use a static value γ = 5, but in practice, γ may have
to be changed dynamically when deadlock arises. Deadlock detection is inexpensive and
is based on the number of total unprocessed edges in the partition, which is monitored
by a counter, and this dynamic scheme will be inexpensive in hardware.
3.7 Evaluation
(a)
(b)
As shown in Fig. 3.13(a), the average speedup of GNNIE over the PyG-CPU across
the datasets used in our experiment for GCN, GAT, GraphSAGE, GINConv, and DiffPool
are 6229×, 5894×, 625×, 22878×, and 359×, respectively. According to Fig. 3.13(b)
the average speedup of GNNIE over the PyG-GPU across the datasets used for GCN,
GAT, GraphSAGE, GINConv, and DiffPool are 8.25×, 24.67×, 17.53×, 17.37×, and
21×, respectively. The speedup calculations take into account the total preprocessing
times mentioned in Table 3.3.
The speedup comes from several GNNIE optimizations: (i) The segmentation of
vertex feature vectors and their assignment in our FM architecture tackles the feature
vector sparsity challenge. (ii) Our degree-aware cache replacement policy avoids random
memory accesses to DRAM. (iii) During Weighting, distributed computation across
multiple batches enables weight reuse. Note that PyG-CPU and PyG-GPU do not
allow our dynamic caching scheme to be implemented within their purely software
based frameworks. The speedup of GNNIE on GINConv is further enhanced because of
38
PyTorch Geometric executes Aggregation before Weighting: as described in Section 3.2,
this requires more computation than the reverse order of computation used in GNNIE.
For the GraphSAGE speedup calculations, the neighborhood sampling time on PyG-
CPU/PyG-GPU is excessive and is excluded (for RD it is 13s whereas the execution
time is 0.35s for PyG-CPU and 0.003s for PyG-GPU), but GNNIE runtimes include
neighborhood sampling times. This results in lower speedup compared to PyG-GPU
for RD. However, the GPU is much more power-hungry than GNNIE, e.g., it requires
98.5× more energy for GraphSAGE/RD than GNNIE. GNNIE is scalable on PyG-CPU:
for GCN, GAT, and GINConv, the speedups generally increase with benchmark size.
GraphSAGE bucks this trend for the above reasons, but while its sampling scheme
improves scalability, it reduces accuracy [2, 55].
On PyG-GPU, the speedups do not monotonically improve with the number of nodes.
This is because larger datasets (e.g., PB) reap greater benefit from GPU parallelization:
for these datasets, GNNIE vs. PyG-GPU speedup decreases whereas GNNIE vs. PyG-
CPU speedup increases. It is important to note that the GPU comparison is not entirely
fair to GNNIE’s lightweight accelerator with low on-chip memory, targeted to edge
applications. In contrast, this GPU has a ∼20× larger on-chip memory than GNNIE
and its power-hungry nature makes it impractical for the edge. Nevertheless, GNNIE
shows speedups over even this powerful GPU.
Table 3.4 shows the throughput for various datasets for our configuration of GNNIE. The
table shows that the throughput degrades only moderately as the graph size is increased.
The power dissipation of GNNIE is 3.9W in 32nm, lower than HyGCN (6.7W in
12nm), similar to recent CNN edge inference engines (Edge TPU, Hailo-8, InferX1).
Fig. 3.15 shows the energy breakdown for GNNIE for GAT and GCN across three
40
Table 3.4: Throughput for various datasets for GNNIE.
datasets, including DRAM energy required to supply the output, input, and weight
buffers. The output buffer has the most of transactions with DRAM due to psum storage.
On-chip weight buffer energy is negligible and not shown.
Fig. 3.16 compares GNNIE’s energy efficiency with prior works. The efficiency ranges
from ranges from 2.3×101 – 5.2×105 inferences/kJ for HyGCN and 1.5×102 – 4.4×105
inferences/kJ for AWB-GCN. GNNIE clearly outperforms the others, going from 7.4×103
– 6.7×106 inferences/kJ.
To illustrate the efficiency of the proposed graph-specific caching scheme we compare the
number of DRAM accesses required by GNNIE with those in the widely used 2-D graph
partitioning method (employed by GridGraph [56], HyGCN [10], Marius [57]). In 2-D
graph partitioning, vertices of the graph are divided into u equal-sized disjoint partitions
41
and stored in DRAM. Edges are then grouped into u2 blocks that can be viewed as
a grid. In this grid, each edge block (p, q) contains the edges for which source nodes
belong to the pth vertex partition and destination nodes belong to q th partition. In this
scheme, except for the self-edge blocks (e.g., edge block (p, p)) vertex partition p and q
must be in the cache (input buffer) together at least once to process the corresponding
edge block (p, q).
If the input buffer can hold v vertex partitions at a time (u ≥ v), a lower bound on
the number of DRAM block accesses for processing the graph using 2-D partitioning
is [57]):
u(u − 1) v(v − 1)
− (v − 1) (3.5)
2 2
To compare the caching schemes of GNNIE and 2-D graph partitioning we evaluate
the DRAM accesses required for executing Aggregation of the first layer for the Pubmed
dataset. In our experiment we use a 512 KB input buffer and the size of each vertex feature
vector is set to 128 B. For the 2-D partitioning scheme, we vary the number of vertex
partitions in DRAM (u) from 2 to 100 in steps of of 1 and compute the corresponding
lower bound on the number of DRAM access for 2-D partitioning using (3.5). The
lower number is multiplied with the size of each vertex partition in the input buffer to
42
determine the DRAM accesses in MB. To calculate the each vertex partition size the
input buffer size is divided by v. In Fig. 3.17, the x-axis denotes the number of vertex
partitions in DRAM (u) and the y-axis shows the corresponding lower bound for 2-D
partitioning on the DRAM access required (in MB) to process the graph. From Fig. 3.17
we can see that initially, the lower bound on the DRAM accesses decreases with the
number of partitions and plateaus eventually for higher values of u. For u = 100, the
lower bound is 5.59MB.
The static caching scheme proposed in 2-D graph partitioning must go through all
the vertex pair combinations to process all the edges. Due to the power-law behavior and
sparsity of real-world graphs, not all vertices in a vertex partition are used to process the
edges of its corresponding edge blocks. However, processing of an edge block requires all
vertices of the corresponding vertex partition to be cached in this scheme. Since this
approach does not make any effort to distinguish between the useful vertices of a vertex
partition to process the edge blocks, it incurs redundant DRAM access and provides
suboptimal performance in reducing DRAM accesses.
On the other hand, as shown in Fig. 3.12(c) and Fig. 3.17 for γ = 5 GNNIE requires
4.62 MB of DRAM access to execute the first layer Aggregation of Pubmed dataset.
In GNNIE the number of vertex feature vectors that get replaced after each iteration
dynamically varies according to the α of cached vertices and γ. In each iteration, GNNIE
tries to maximize the number of edges being processed by retaining the vertices with
a higher potential of being reused in the next iteration. Thus, by dynamically tuning
the retentivity of cached vertices at each iteration to maximize their reuse the proposed
graph-specific caching scheme leads to lower DRAM accesses compared to calculated
lower bound of 2-D partitioning.
Figure 3.17: Comparison of DRAM access of GNNIE with the lower bound on DRAM access
vs vertex partitions in DRAM of 2-D graph partitioning.
applying flexible MACs (FM) on the baseline design during Weighting. For the Cora,
Citeseer, and Pubmed datasets, the workload distribution among the CPE rows for the
baseline (without load-balancing) and FM designs are shown in Figs. 3.18(a), (b), and
(c), respectively. Due to vertex feature sparsity, the CPE rows in the baseline design
suffer from workload imbalance. The FM design smooths the workload distribution
among the CPE rows results in 6% (Cora), 14% (Citeseer), and 24% (Pubmed) reduction
in the number of cycles required to compute 16 elements of the output vertex features
during Weighting. The imbalance between the maximum and minimum is also reduced
by FM.
For all datasets, the last four CPE rows require more cycles than others (heavily
loaded CPE rows) and the first four CPE rows finish computation earlier (lightly loaded
rows) in FM. We perform load redistribution (LR) between “LR pairs” of heavily loaded
and lightly loaded CPE rows, offloading a portion of the workload from the heavily loaded
CPE row to the lightly loaded one. The figure shows that applying LR on FM further
smooths the workload distribution, reducing the imbalance between the maximum and
minimum significantly, and also further reduces the number of cycles.
Cost/Benefit Ratio: We introduce a metric, the cost/benefit ratio, β, relative to
44
(a)
(b)
(c)
Figure 3.18: CPE row workload in Weighting: (a) Cora (b) Citeseer (c) Pubmed.
The percentage reduction in cycles required is measured for Weighting for various choices
of MAC counts. The additional hardware overhead is measured in terms of percentage
increase in MACs compared to the baseline design. We compute β for four designs.
These design choices are as follows: (i) 5 MACs per CPE (i.e., Design B, 1280 MACs
in all), (ii) 6 MACs per CPE (i.e., Design C, 1536 MACs in all), (iii) 7 MACs per
CPE (i.e., Design D, 1792 MACs in all), (iv) flexible MAC architecture for GNNIE,
described at the end of Section 3.7.1 (i.e., Design E, 1216 MACs in all).
Fig. 3.19 plots β on the three datasets used in our experiment for the four design
45
choices. As MAC units are added uniformly to the baseline design β drops and is lowest
for Design D across all datasets. β drops for Designs B, C, and D as the high sparsity
and sparsity variation among vertex features yield low speedup gains as more MACs
are added. By employing MACs among CPE rows as needed, the FM approach tackles
input vertex feature sparsity, achieving high β across all datasets.
Optimizing Aggregation Time: Our baseline design has 4 MACs/row (no FM),
no load balancing (i.e., no degree-dependent load distribution in Aggregation), and no
graph-specific caching (i.e., vertices are processed in order of ID).
We first evaluate our degree-aware graph reordering and our proposed cache replace-
ment policy (CP). We measure the execution time of the baseline during Aggregation
with and without CP. Fig. 3.20(left) shows that CP reduces Aggregation time by 11%
(Cora), 35% (Citeseer), and 80% (Pubmed). This is due to reduced random off-chip
memory accesses as more edges in a subgraph are processed under degree-aware caching.
Next, we apply CP over FM to measure their combined effect. From Fig. 3.20(left),
the added MACs in CP + FM yield gains of 17% (Cora), 39% (Citeseer), and 82%
(Pubmed).
We add our approach for load-balancing (LB) during Aggregation, using the
load distribution approach in Section 3.4.3, on top of CP+FM. The combined effect
(CP+FM+LB) is shown in Fig. 3.20(left) to reduce Aggregation time cumulatively by
46
3.8 Conclusion
This chapter presents GNNIE, a versatile GNN acceleration platform for a wide degree
of GNNs, including GATs. GNNIE efficiently works with unstructured data, input
47
vertex feature vector sparsity, and adjacency matrix sparsity, and “power-law” vertex
degree distribution. It mitigates load balancing issues, computational bottlenecks, and
irregular/random data accesses using multiple methods: splitting the computation into
blocks to leverage sparsity; optimized caching strategies; employing a flexible MAC
architecture in the CPE array. Substantial improvements over prior work are shown.
Chapter 4
4.1 Introduction
In recent years, GNNs have achieved unprecedented success on many real-life problems
(recommender systems, IC design, embedded sensing, e-commerce, etc.). In Chapter 3,
we presented GNNIE [17] that aims at accelerating the GNN inference for small- to
medium-scale graph workloads. However, a well-trained model is a prerequisite for
efficient inference. This chapter focuses on the development of a multicore GNN train-
ing accelerator for large-scale static graphs, addressing the ever-growing demand for
scalability and energy efficiency.
Energy-efficient and scalable acceleration of GNN training is an open problem that
involves several major challenges:
(i) High computation and communication costs: GNN training is more compute-intensive
than inference, especially with backpropagation, and incurs high access time and energy
costs for communication between memory and on-chip buffers;
(ii) Scalability for large graph sizes: Graph sizes in real-world datasets have grown
exponentially in recent years [58], necessitating multiple accelerator engines to work
together;
(iii) Load balancing during computation: High and variable input feature vector sparsity,
high adjacency matrix sparsity, and power-law distributions of vertex degrees, result in
irregular and random memory accesses during GNN computations, with low utilization
of processing elements [10, 11, 17].
48
49
(iv) Versatility: A GNN training accelerator must be able to accommodate a wide range
of GNN architectures. These challenges also persist while performing GNN inference on
large graphs emphasizing their relevance to both training and inference acceleration.
GPU-based solutions are energy-inefficient. GNNAdvisor [19], a single-GPU solution
is limited to small-to-medium-sized graphs. Multi-GPUs platforms can handle large
graphs: RoC [59] uses dynamic techniques for graph partitioning and memory man-
agement; NeuGraph [60] employs 2-D graph partitioning and inter-GPU vertex-chunk
swapping (with increased communication overhead); PaGraph [61] replicates boundary
vertices to reduce communication among partitions, but faces scalability issues due to
replica synchronization.
Several FPGA- and ASIC-based accelerators with better energy efficiency have been
proposed. Among FPGA-based approaches, GCoD [62] implements algorithm-accelerator
co-design, but requires large on-chip buffers due to scatter-based aggregation and incurs
high preprocessing overhead for sparsification and polarization; GraphACT [20] proposes
a CPU+FPGA platform, with graph sampling and loss gradient calculation offloaded to
the CPU, and forward- and back-propagation handled in the FPGA. Among ASIC-based
approaches, Rubik [21], uses a hierarchical array of processing elements; GNNear [22] uses
an ASIC-based central acceleration engine for some computations and offloads others to
near-memory processing engines that reside in the buffer chips of DIMMs. TT-GNN [63]
presents an ASIC-based software and hardware co-optimization approach that employs
vertex feature matrix compression using tensor-train representation. However, this work
is limited to GCN only. As single-core structures, these methods are not scalable for
larger graphs; they largely neglect input feature vector sparsity and power-law degree
distribution problems.
Any single-core solution has limited scalability. This chapter presents a multicore
GNN training accelerator for static graphs, moving past the limitation of single cores and
using an array of processing cores for training, offering substantial speedup and energy-
efficiency improvements. We target much larger graphs than previous ASIC/FPGA
training accelerators (we show results on datasets with up to 8.6M vertices in Section 3.7).
We believe this is the first multicore GNN training accelerator to support a wide range
of GNNs; the only other multicore accelerator [64] known to us handles inference
only and not training. As a preprocessing step in our approach we first partition the
50
graph into multiple clusters before assigning them to the cores. The existing multicore
inference accelerators can not handle backpropagation efficiently due to: (i) massive
computation/communication overhead for the calculation/propagation of error gradients.
(ii) large gradient synchronization overhead. (iii) lack of support for various special
functions, e.g., log and softmax.
For the core, we choose the GNNIE inference accelerator [17] introduced in Chapter 3
over other candidates [10–15] as it can handle sparsity in input vertex feature vectors
and adjacency matrix, support a wide range of GNN topologies (e.g., GCN, GraphSAGE,
GAT, GINConv), and shows speedup and efficiency advantages over other methods.
However, simply arraying a set of GNNIE cores leads to performance bottlenecks due
to: (i) suboptimality in GNNIE’s caching scheme in a multicore scenario; (ii) lack
of multicore-specific optimizations that consider both DRAM accesses and inter-core
communication. We develop novel techniques to address these challenges and develop
methods that are scalable for training large graphs. Degree-Quant [65] proposes integer-
based GNN training and we leverage this in our implementation. The major contributions
of our work in this chapter are:
• A novel feature vector segmentation scheme that reduces memory accesses, and a
random-forest-based machine learning (ML) model for optimal segmentation.
• Demonstrated gains in scalability, speedup, and energy efficiency over prior GPU/FP-
GA/ASIC solutions across multiple GNN topologies.
In addition to training we evaluate the inference runtime for large graphs on our
platform. To offset the preprocessing overhead of partitioning we consider the cases
where the inference can be performed repeatedly with minimal changes to the graph
properties (detailed discussion in Section 4.7).
Figure 4.1: Block diagram of the proposed multicore GNN training accelerator (core architecture
in inset) with 4 cores; our evaluation considers accelerators with up to 36 cores.
Forward Pass Computations. The forward pass has two steps [10,17,22]: (a) Weight-
ing in layer l multiplies feature vector hli (dimension F l ) of each vertex i by a weight
matrix, W l (dimension F l−1 × F l ). (b) Aggregation for vertex i combines (sum/-
max/mean/pool) the weighted feature vectors in a set Ni . For GCN/GAT/GINConv,
Ni is the neighbors N (i) of i; for GraphSAGE, Ni randomly samples N (i).
Backward Pass Computations. The output node features of the forward pass are
compared against the ground truth to compute the loss function. Then, starting from
the last layer, the gradients of the loss with respect to the feature vectors and weights
are calculated, and weight updates are performed at each layer using the chain rule
until the input layer is reached. Backward pass computations consist of Weighting and
Aggregation steps similar to the forward pass, and MAC operations for loss computations
and gradient updates.
(a) (b)
Figure 4.2: (a) Boosting γintra to break intra-cluster stagnation on Core 2. (b) Invoking full
random access after most edges are processed on all cores.
eviction criterion based on γintra (γinter ) is small, as the changes in the computational
subgraph across iterations are minor. This results in low computation and low PE
utilization per iteration.
We define the metric eintra [i] (einter [i]) as the ratio of the number of intra-cluster
(inter-cluster) edges processed up to iteration i, to the total number of intra-cluster
(inter-cluster) edges of the cluster associated with the core. After a detection interval of
every I iterations, we detect stagnation as:
(a) (b)
Figure 4.4: Performance analysis of feature vector segmentation: (a) etotal (Average) vs.
Execution Cycles (b) Aggregation cycle comparison.
4.6 Evaluation
Hardware/Simulation Setup. Each core is implemented in Verilog, synthesized
with Synopsys DC in a 12nm standard VT library, placed and routed using Innovus,
and verified via RTL simulations. The area, energy, and latency of on-chip buffers
are estimated using CACTI 6.5 [68]. Post-P&R metrics for each core are: 4.97mm2 ,
0.93W, 934 MHz. The controller has 0.26 mm2 area and 0.1W power. For the NoC,
latency and throughput were analyzed using BookSim2 [69], and power and area using
Orion3.0 [70]. The NoC power overhead ranges between 2.9%–6.3% of the total chip
power. An in-house simulator computes the execution cycles for our accelerator, with
Ramulator [53] modeling off-chip HBM access (256 GB/s, 3.97pJ/bit [54]).
61
Configuration of the Multicore Accelerator.
Individual GNNIE cores Configuration per core is as follows:
Buffer sizes: Output: 1MB; Weight: 128KB; Input: 512KB
CPE array with flexible MACs: 16 × 16 array; 4 MACs (rows 1–8), 5 MACs (rows 9–12),
6 MACs (rows 13–16).
NoC Buffer size: 128 KB, 4 links per router, 50GB/s BW/link.
Number of GNNIE cores The number of cores for a dataset is based on the ratio, ϑ, of
vertices per computational subgraph (i.e., the full-length vertex features that can fit in
cache) to the vertices assigned per core. Empirically, we determined that its optimal
range is 0.03 ≤ ϑ ≤ 0.15. Using this, we find the number of cores m (see Tables 4.1
and 4.2) for the optimal ϑ that optimizes speedup gain vs. area/power overhead.
We analyze the change in speedup when the number of cores is altered from the
optimal m. For the A-06 dataset, m = 4; for 2, 16, and 36 cores, the speedup changes
by 0.43×, 3.1×, and 7.29×, respectively. In each case, the speedup change is sublinear,
indicating that m = 4 is optimal.
Benchmark GNN Datasets and Models. We evaluate the performance of our
platform using Type A and Type B benchmark graph datasets from Table 4.1 and 4.2,
respectively. Type A datasets consist of multiple small graphs with no inter-graph edges,
while Type B datasets are large monolithic graphs with a high amount of structural
irregularity, i.e., higher adjacency matrix sparsity and power-law behavior. Table 4.1
and 4.2 also provide the input feature length (FL), number of cores (m), and feature
vector segments (j) used for each dataset.
We evaluate the accelerator for training four GNN models: GCN, GINConv, GAT,
and GraphSAGE. All GNNs have one hidden layer, except GINConv which has five;
for GCN, GINConv, and GraphSAGE each hidden layer has 16, 64, and 256 channels,
respectively. The GAT hidden layer uses eight 16-dimensional attention heads. All
speedup and energy numbers include preprocessing times, including runtime for graph
partitioning, degree-based vertex reordering, workload reordering, and neighborhood
sampling time (performed on Intel Xeon Gold@2.60GHz CPU) for GraphSAGE. The
preprocessing overhead over 500 epochs for amazon0601 is 18%.
Performance comparison with DGL. We compare all GNNs against Deep Graph
Library (DGL) [71] on a V100 Tesla GPU with V100S-PCIe@1.25GHz, 32GB HBM2
62
Table 4.2: Type B datasets
Table 4.1: Type A datasets (SB: soc-BlogCatalog, CA: com-amazon, A-05: ama-
(DD: D&D, TW: TWITTER-Partial, zon0505, A-06: amazon0601, EN: enwiki, A-
YT: Yeast, SW: SW-620H, 8M: amazon8M) for GNN training
OV: OVCAR-8H) for GNN training
Dataset Vertices Edges (FL, m, j)
Dataset Vertices Edges (FL, m, j)
SB 89K 2.1M (128, 1, 2)
DD 335K 1.7M (89, 2, 4)
CA 335K 1.9M (96, 2, 4)
TW 581K 1.4M (1323, 4, 2)
A-05 410K 4.9M (96, 4, 4)
YT 1.7M 3.6M (74, 16, 2)
A-06 403K 3.4M (96, 4, 4)
SW 1.9M 3.9M (66, 16, 2)
EN 3.6M 276.1M (300, 16, 16)
OV 1.9M 3.9M (66, 16, 2)
A-8M 8.6M 231.6M (96, 36, 16)
(“DGL+Tesla V100”). The training latency for speedup comparison are averaged over
500 epochs. As shown in Fig. 4.5(a) and (b), the average speedup of our approach against
DGL+Tesla V100 for GCN, GINConv, GAT, GraphSAGE ranges from 8.9×–46.6×
across Type A datasets and 3.3×–15.5× for Type B.
The speedup comes from several of our optimizations: (i) Feature vertex segmentation
improves scalability for large GNNs. (ii) Dynamic cache replacement mitigates irregular
random memory accesses and on-chip communication overhead. (iii) Distributed compu-
tation across multiple batches ensures weight reuse. The speedup is particularly high for
GINConv: unlike DGL, we use dimension-aware stage reordering (DASR) [11, 17], which
requires fewer computations. To determine their impact, we removed these optimizations
successively on A-06. Without segmentation, the computation did not complete (as in
Fig. 4.4). With optimal segmentation, removing dynamic cache replacement increases
runtime by 34%; also removing weight reuse raises the penalty to 43%.
GraphSAGE shows lower speedup than other models due to: (i) inclusion of prepro-
cessing time for neighborhood sampling on our platform, but not on DGL+Tesla V100.
(ii) mitigation of power-law behavior in real-world graphs by sampling. Type A datasets
have higher speedups than Type B datasets due to the lack of on-chip communication
overheads. Larger datasets (e.g., OV, A-06) show higher speedups than smaller datasets
(e.g., DD, SB) for both Type A and B, indicating scalability.
Comparison with GPU-based accelerators. Speedup: GNNAdvisor implements
only GCN and GINConv. For the same configurations for these GNNs, Fig. 4.5(a) and
(b) shows that relative to GNNAdvisor, we achieve 15.5×–27.9× speedup for Type A
and 4.2×–9.2× for Type B datasets.
63
NeuGraph uses 2-D graph partitioning to process large graphs using one NVIDIA
Tesla P100 GPU. We achieve 12.2× and 16.9× speedup for GCN on EN and A-8M,
respectively, over NeuGraph. The corresponding speedups over GNNAdvisor are 3.1×
and 6.8×, respectively.
(a)
(b)
(c)
(d)
Figure 4.5: Speedup and energy efficiency of the proposed multicore GNN training accelerator
vs. DGL+Tesla V100 and GNNAdvisor+Tesla V100: (a), (c): Type A datasets (b), (d): Type B
datasets.
Energy: Fig. 4.5(c) and (d), illustrate the energy efficiency comparison with Tesla V100,
reporting Egain , the ratio of the energy required by the GPU to the energy of our
64
approach. Compared DGL+Tesla V100, our average Egain ranges from 149×–711× over
Type A datasets and 75×–628× over Type B. Against GNNAdvisor+Tesla V100, Egain
ranges from 168×–415× and 118×–372×, respectively.
Comparison with FPGA-/ASIC-based accelerators. Our approach achieves an
average speedup of 11× and 24× over Rubik and GraphACT, respectively; neither
reports absolute power numbers. Our speedup over Rubik is due to its inefficient reuse
of cache data which incurs high on-chip and off-chip communication costs, and over
GraphACT since it does not consider the power-law behavior of real-world graphs and
makes no explicit efforts to address the random off-chip memory accesses. In comparison
with GNNear, we achieve 17× average speedup over DGL+Tesla V100, but the speedup
of GNNear is only 2.5×. Unlike our approach, the graph partitioner of GNNear is
oblivious to community structure in real-world graphs, resulting in high communication
costs due to the high number of cut edges between the partitions. GCoD handles only
small graphs (up to 233K vertices, as against 8.6M vertices for our approach), and uses a
whopping 180W of power even for these graphs, which can be handled by our approach
on a single core using < 1W.
(a)
(b)
(c)
(d)
Figure 4.6: Inference speedup and energy efficiency the proposed multicore GNN training
accelerator vs. DGL+Tesla V100 and GNNAdvisor+Tesla V100: (a), (c): Type A datasets (b),
(d): Type B datasets.
4.8 Conclusion
Our multicore GNN training accelerator with GPU-like scalability and accelerator-like
efficiency for large GNNs is proposed in this chapter. It leverages novel feature vector
segmentation and dynamic caching schemes for scalability and to mitigate communi-
cation costs. Our evaluation demonstrates substantial speedup and energy-efficiency
67
improvements over prior approaches.
Chapter 5
5.1 Introduction
Chapter 3 and 4 focused on acceleration of inference and training of GNNs on static
graphs, where the vertex features and graph topology remain unchanged over time.
However, real-world scenarios, such as financial transactions, social media interactions,
and molecular biology processes, often exhibit dynamic graph structures that capture
scenarios where the graph topology and node features evolve over time; the above GNNs
are not suited for these applications.
Dynamic graphs can be modeled as a series of snapshots to describe the change
of vertex features and graph topology at regular intervals; such models are referred
to as discrete-time dynamic graphs (DTDG) [72]. The DTDG model has numerous
applications across a large variety of domains. Its discrete representation of temporal
changes provides a versatile framework for understanding and analyzing dynamic systems,
e.g., pandemic forecasting, social network analytics, and traffic prediction. A dynamic
graph neural network (DGNN) is a special kind of neural network that operates on
dynamic graphs and involves two computational kernels: (a) the GNN, which captures
structural information, and (b) the recurrent neural network (RNN), which captures
temporal information.
Due to the growing significance of edge applications, where processing occurs closer to
the data source, it is imperative to implement techniques that efficiently perform inference
on dynamic graphs. This motivates the need to build dedicated hardware accelerators
68
69
for DGNNs, which is the subject of this chapter. The unique challenges of DGNNs,
characterized by dynamic irregularity, inter-vertex locality, and sequential dependence of
the RNN kernel, require a novel approach for efficient performance enhancement.
The challenges inherent in developing an accelerator for DGNNs can be distilled
into several key aspects: (C1) Real-world benchmark DTDG datasets exhibit minimal
variation between consecutive snapshots. This presents a unique opportunity for inter-
snapshot computation reuse, which must be effectively leveraged [73]. Beyond the
irregular memory access overhead seen in GNN engines for static graphs, resulting
from the sparsity of graph snapshots, an additional memory inefficiency stems from
the time-dependent variation of vertex features [74]. (C2) The introduction of time-
dependent RNN kernels in DGNNs, unlike traditional GNNs, introduces a bottleneck
arising from the sequential dependence between two consecutive snapshots. (C3) The
batch size of the RNN kernel can be orders of magnitude higher than that of conventional
RNN inference tasks for text and speech [75]; this offers an opportunity to minimize
excessive memory accesses by reusing weight parameters that needs to exploited for
computational efficiency. Addressing all of these challenges is essential for an efficient
DGNN accelerator.
While GPU-based solutions have been proposed for training on DGNNs, they are
energy-inefficient. ESDG [76], a multi-GPU platform, proposes a graph difference method
to reduce data transfer overhead between snapshots for DGNN training. This method
overlooks the overlap between contiguous snapshots in dynamic graphs (C1), resulting
in the recomputation of all graph data for each snapshot and subsequent performance
degradation. ESDG does not implement any explicit mechanism to tackle random
memory access issues for the GNN kernel (C1). PiPAD [74] proposes overlap-ware data
organization and data transfer for DGNN training on a single GPU and address random
memory access only for the overlapping part of consecutive snapshots (C1). It does not
address the bottleneck due to inter-snapshot sequential dependence of the RNN kernel
(C2). Neither approach accounts for the weight reuse opportunity for optimizing RNN
computations (C3).
To address energy efficiency concerns of GPU-based solutions, various ASIC- and
FPGA-based inference accelerators have been proposed. Cambricon [77], an ASIC-
based accelerator, introduces a cuboid-based processing architecture that supports the
70
addition/deletion of edges and fine-grained data transfers to avoid unnecessary snapshot
updates. However, it primarily focuses on changes in graph topology between snapshots
and does not consider the changes in the features of the vertices (C1). DGNN-Booster [78],
a CPU+FPGA DGNN inference accelerator, uses a message-passing mechanism for the
GNN kernel, but neglects overlaps between snapshots (C1). ReaDy [79], a ReRAM-
based DGNN inference accelerator, implements redundancy-free data scheduling and
inter-kernel pipelining (C2) to enhance efficiency, but it overlooks the overlap between
snapshots (C1). ReFlip [80] also proposes a ReRAM-based DGNN accelerator, but it
does not address sequential bottleneck imposed by the RNN kernel (C2) and remain
oblivious of the data reuse opportunity offer by overlap between snapshots (C1). Neither
Cambricon [77] nor DGNN-Booster [78] capitalizes on weight reuse in the RNN kernel
(C3).
In this chapter we propose an integrated platform for DGNN inference acceleration,
handling both GNN and RNN computations on the same hardware and overcoming the
limitations of prior works. We address Challenges C1–C3 through the following key
contributions:
• Challenge C2: We develop an efficient pipelining mechanism for the GNN and RNN
engines to ensure seamless computation without bottlenecks or stalls.
• Challenge C3: We employ weight coalescing to maximize weight reuse and reduce
off-chip communication for the RNN kernel.
Finally, our platform is versatile in handling a wide variety of dynamic GNNs, including
those employing self-attention mechanisms for temporal layer processing. This flexibility
provides our platform to be applicable across diverse dynamic graph scenarios; prior
frameworks [74, 76–79] are limited to specific temporal layers.
71
5.2 Background
In the discrete-time representation, dynamic graphs are modeled as a set of graphs,
DG = {G1 , G2 , ..., GT }, where T denotes the total number of snapshots, and the graph
Gk = {V k , E k } represents the snapshot with vertices V k and edges E k at timestamp
k. This representation enables the utilization of traditional static GNNs for spatial
information encoding and RNNs for temporal information.
During the k th iteration of DGNN, the GNN kernel computes the updated feature
vectors, i.e., Y k = {y1k , y2k , ..., yvk , ..., yN
k } for all N k vertices in the snapshot Gk . Sub-
k
The LSTM computation requires the following matrix-vector multiplications: (i) between
the updated vertex feature vector yvk and four input weight matrices Wx , x ∈ {i, f, o, c};
(ii) between the hidden state vector hk−1
v and four hidden weight matrices Ux , x ∈
{i, f, o, c}. The input gate, forget gate, output gate, and cell state feature of vertex v
at the timestamp k are represented by ikv , fvk , okv , and ckv , respectively. There are no
intra-snapshot dependencies among the eight matrix multiplications, but inter-snapshot
dependence stems from sequential dependence on the hidden state vector at timestamp
k-1. In addition, the LSTM involves element-wise additions (+) and products (⊙), and
activation functions, e.g., sigmoid and tanh. Some DGNN models use GRU for capturing
temporal information, which is similar to LSTM and uses a gated mechanism. GRU is
less compute-intensive than LSTM due to its simpler architecture and fewer parameters,
but may struggle to capture complex long-term dependencies since it lacks the forget
gate in LSTM.
ASTGCN [84] implements a spatial-temporal attention mechanism for capturing
dynamic spatial and temporal correlations of DTDGs. This requires computation of two
kinds of attention, i.e, spatial attention and temporal attention. The spatial attention
computation is described by following set of equations:
S = Vs · σ((χr−1 r−1 T
h W1 )W2 (W3 χh ) + bs ); S′ = softmax(S) (5.3)
73
where χr−1
h is the input feature tensor that is fed to the rth spatial-temporal block; Vs ,
bs , W1 , W2 , and W3 are learnable parameters; and S′ is the normalized version of the
spatial attention matrix S. While performing graph convolutions, S′ is accompanied
with the adjacency matrix to dynamically adjust the weights of the edges.
Similar to spatial attention, the computation of temporal attention is described by
the following equations:
Here, ⊛ and ⊗ denote standard and graph convolution operation, respectively; ϕ is the
parameter of the temporal convolution kernel; and gθ represents the graph convolution
filter.
caching; (ii) its versatile platform accommodates a broad range of GNN topologies (GCN,
GraphSAGE, GAT, GINConv) and supports functions not present in alternative engines,
such as the softmax over neighborhood, a requirement for GATs; (iii) GNNIE shows
notable speedup, lower power consumption, and reduced area over competing methods.
Pipeline Buffer. This shared buffer between the GNN and RNN engine is used for
inter-engine pipelining. It caches the GNN engine outputs before relaying them to the
RNN engine.
RNN Engine. As shown in Equation (5.2), RNN operations are predominantly MVMs
that primarily involve interactions among the hidden feature vector emanating from
the RNN layer of the preceding timestamp, the output feature vector of the GNN
layer at the current timestamp, and the corresponding weight matrices. The RNN
engine is composed of two separate units: (i) the WY unit performs the matrix-vector
multiplication between the updated vertex feature vector yvk and four input weight
matrices Wx , x ∈ {i, f, o, c}; (ii) The UH unit is responsible for the matrix-vector
multiplication between the hidden state vectors hk−1
v and four hidden weight matrices
Ux , x ∈ {i, f, o, c}. The final operations in the RNN kernel require the use of nonlinear
functions (e.g., softmax, tanh), critical for capturing complex dependencies in sequential
75
data. In our architecture, these nonlinear functions are integrated into the processing
elements (PE) array of the UH unit through the special function units (SFU), which use
look-up tables to realize these functions. Collectively, the weight-stationary dataflow and
the integration of SFU units lead to high computational efficiency. Each computation PE
(CPE) in the PE arrays has two scratch spads (spads) and multiply-accumulate (MAC)
units; merge PEs (MPEs) aggregate partial results sent from each CPE column. The
RNN Weight Buffers 1 and 2 hold the weights for the WY and UH units, respectively.
HBM DRAM. The high-bandwidth memory (HBM) DRAM stores information about
the graph. To store the dynamic graph we use the temporal compressed sparse row
(T-CSR) [85] format. Unlike the traditional CSR format used to store static graphs,
T-CSR uses an additional time array that indicates the timestamp of incoming/outgoing
edges for each vertex. Moreover, RLC encoding is used for sparse input feature vectors.
Output Buffer: This buffer stores the output of WY and UH units and the updated
state vectors before sending them to the DRAM. The memory access scheduler manages
off-chip memory requests from the input/pipeline/output/GNN weight/RNN weight
buffers.
Since most new data goes through the input buffer of the GNN engine, we refer to it as
the cache. Processing of each graph snapshot begins with the retrieval of vertex features
and their adjacency information from memory to the cache. While caching the data to
input buffer, we must account for the random memory access overheads that stem from
the irregularity of the adjacency matrix of real-world graphs. On the other hand, the
output of the GNN engine is cached into the pipeline buffer before being consumed by
the RNN engine, which undergoes compute-intensive operations with regular data access
patterns for a snapshot. However, while caching data into the pipeline buffer we must
account for the inter-snapshot dependence introduced by the RNN kernel.
77
In our proposed approach, we process two consecutive groups at a time. For instance,
we process process Groups j and j+1 together, then process Groups j+2 and j+3, and
so on. Due to the inter-snapshot sequential dependence of the RNN kernel, we process
the snapshot at timestamp k before that at timestamp k+1, processing pairs of groups
at a time. In the example above, as we process Groups j and j+1 (snapshots k+1
through k+7) we process snapshot k before snapshot k+1, and so on. While processing
each snapshot we fetch the corresponding data (vertex features and their adjacency
information) from DRAM to the input buffer, i.e., the cache.
As discussed in Section 2.2 the computation of the GNN kernel involves two major
steps: (i) Weighting multiples the feature vector for each vertex with a weight matrix.
This is step is compute-intensive. (ii) Aggregation involves consolidation of information
over the neighborhood of each vertex. This requires extensive interaction with the graph
adjacency matrix. However, the ultra-high sparsity of adjacency matrix of real-world
graphs leads to a large amount of irregular and random memory accesses [17], i.e.,
performance degradation. In addition, the limited capacity of the input and pipeline
buffers further exacerbates this issue due to frequent off-chip communication. Hence it
is essential to employ caching techniques that can minimize the communication between
DRAM and on-chip buffers.
78
To minimize DRAM communication overhead and maximize the reuse of cached
data, we first cache the vertices in the overlapped part for a pair of groups. We also
concatenate the features of vertices in the overlapped part across the timestamps in
which they appear as we fetch them from memory. This is advantageous because with a
single fetch of the overlapped vertices we can reuse them across multiple timestamps as
we process each pair of groups.
For instance, as shown in Fig. 5.2, the edge between vertex A and I appears across
timestamps k, k+1, k+4, k+6, and k+7, i.e., vertex A and I belong to the overlapped
part. Thus, we concatenate the features of vertex A and I across timestamp k, k+1,
k+4, k+6, and k+7 as we fetch them to the cache. After caching the vertices in the
overlapped part we cache the vertices in the exclusive parts. During this process vertices
at timestamp k are cached before those at timestamp k+1. For instance, vertex C at
timestamp k is cached before vertex B at at timestamp k+1 and D at timestamp k+2.
If the cache is unable to hold all the vertex data for a timestamp, cache replacement
is required after processing all the edges that connect the cached vertices. Similar to
proposed caching mechanism of GNNIE in Section 3.5, during cache replacement, we
replace a vertex if the number of unprocessed edges of the vertex for that timestamp is
below a user-defined threshold, γ. In such a scenario, we may need multiple iterations
to process all edges of a timestamp, and an evicted vertex may be fetched into the cache
in subsequent iterations. The proposed cache replacement scheme aims at retaining
the cached vertices with higher reuse potential than others and this leads to reduced
DRAM communication overhead. To address the limited capacity of the pipeline buffer
we prioritize the updated vertex features at timestamp k over timestamp k+1 to be
written to the pipeline buffer vs. DRAM.
RNN operations entail significant inter-snapshot dependencies (Section 5.2), but intra-
snapshot computations of the RNN kernel are not subject to such restrictions. This
opens up the ample intra-snapshot data reuse opportunity that can be leveraged: we
propose a weight-stationary dataflow with weight coalescing, reducing unnecessary data
79
For a given snapshot, the RNN engine takes the output of the GNN engine as its input
(Fig. 5.2). Employing inter-engine pipelining is crucial for maximizing parallelism and
minimizing data movement. Nevertheless, during Aggregation, the irregular structure of
real-world graphs leads to varying workloads for different vertices in the GCN kernel.
In contrast, the RNN kernel exhibits a uniform execution pattern for all vertices. This
workload disparity introduces potential pipeline stalls between the two kernels. To
address this, we use a pipeline buffer to cache vertex features received at different times
80
from the GNN kernel before relaying them to the RNN kernel. This decoupling of GNN
and RNN execution enables the design of an inter-engine pipeline that enhances overall
efficiency.
The LSTM-based RNN kernel operations listed in Equation (5.2) involve two types
of independent MVM operations: (i) between the updated vertex feature Y k and four
input weight matrices Wx , x ∈ {i, f, o, c}. (ii) between the hidden state vector H k−1 and
four hidden weight matrices Ux , x ∈ {i, f, o, c}. However, the inter-snapshot sequential
dependence introduced by the RNN kernel (Section 5.2) limits parallelism and imposes
performance bottleneck. To address this issue, we schedule the first type of MVM
operations to the WY unit and the latter to the UH unit to aid parallelism (Fig. 5.4)
and ensure streamlined processing, while keeping both units busy.
Fig. 5.4 illustrates our dataflow for the WY and UH units. At timestamp k, the
updated feature vector yvk from the GNN is sent to the WY unit from the pipeline
buffer to compute the MVM Wx ykv , x ∈ {i, f, o, c}. The hidden vector from the previous
timestamp, hk−1
v , which was forwarded to the pipeline buffer, is used to compute
Ux hk−1 k
v , x ∈ {i, f, o, c} in parallel in the UH unit. Finally, the computation of hv is
performed in the UH unit using the results of the two MVMs as well as nonlinear
computations, (softmax, tanh) and element-wise multiplications (Equation (5.2)). This
result is forwarded to the pipeline buffer for the UH computation in the next timestamp
(this forwarding path is shown in Fig. 5.1). Thus, our approach keeps the WY and UH
81
5.6 Evaluation
Hardware/Simulation Setup. The accelerator is implemented in Verilog, synthesized
with Synopsys DC in a 12nm standard VT library, placed and routed using Innovus,
and verified via RTL simulations. The area, energy, and latency of on-chip buffers are
estimated using CACTI 6.5 [68]. The post post-P&R area, power and frequency are
10.1 mm2 , 1.92 W, and 934 MHz, respectively. Our in-house simulator calculates the
execution cycles for our accelerator, utilizing Ramulator [53] to simulate off-chip HBM
access. This access is characterized by a data transfer rate of 256 GB/s and energy
consumption of 3.97 pJ per bit [54] .
Configuration of our Proposed Accelerator. The GNN input buffer is 512 KB; the
pipeline buffer between the GNN/RNN engines and the output buffer are each 1 MB; we
use 16×16 PE arrays for the GNN engine and the WY, UH units of the RNN engine.
Number of timestamps per group Empirically, we select the number of snapshots per
group m = 2 for the datasets used in our experiment. We analyze the change in speedup
when the number of snapshots in a group is altered from the optimal m. As shown in
Fig. 5.5 the normalized speedup decareses for the HepTh (HT), Epinions (EP), and
Flicker (FK) dataset as we deviate from the optimal m. This is because a smaller group
size results in underutilization of cache and a smaller number of overlapped vertices. On
the other hand, a larger group size results in a higher number of overlapped vertices
(Section 5.4.2), but this leads to higher off-chip communication overhead due to frequent
cache replacements.
82
Table 5.1: Dataset information for DGNN inference
5.7 Conclusion
In this chapter we propose a unified engine for the efficient acceleration of discrete-
time dynamic graph neural networks. Key contributions include a holistic approach
for handling both GNN and RNN components, optimized cache reuse strategies, a
novel caching policy, and an efficient pipelining mechanism. The proposed platform
demonstrates exceptional versatility, capable of accommodating diverse dynamic GNNs.
Evaluation on benchmark datasets and models reveals substantial speedups and energy
improvement, positioning the platform as a promising solution for edge applications.
Chapter 6
Thesis Conclusion
This thesis addresses the critical need for hardware accelerators tailored to the unique
requirements of GNNs. The presentation navigates through the challenges posed by
the traditional approaches and proposes innovative solutions to accelerate inference and
training on graph-structured data. First, we propose GNNIE, a versatile GNN inference
accelerator, which tackles the challenges of handling highly sparse input feature vectors
and adjacency matrices, load balancing during computation, and irregular memory
access patterns. By employing novel techniques such as feature vector segmentation,
load-balanced Aggregation, and lightweight graph-specific caching, GNNIE achieves
remarkable speedups and energy efficiency improvements over existing methods. The next
work extends the focus to GNN training acceleration, recognizing the escalating demands
for scalability and efficiency in handling large-scale graph datasets. By leveraging
multicore architectures and novel caching strategies, this work overcomes challenges
related to high computation and communication costs, load balancing, and versatility
in supporting various GNN architectures. The proposed platform demonstrates GPU-
like scalability and energy efficiency, positioning it as a promising solution for large
GNN training tasks. Finally, we address the emerging need for efficient inference on
dynamic graph structures. By proposing an integrated platform for DGNN inference
acceleration, it tackles challenges related to overlap-aware caching, efficient pipelining of
GNN and RNN components, and weight coalescing for maximizing reuse and reducing
off-chip communication. The exceptional versatility and performance enhancements of
the proposed platform make it suitable for a wide variety of DGNN scenarios, offering
85
86
substantial speedups and energy improvements for edge applications. In summary, the
works presented in this thesis collectively contribute to advancing the state-of-the-art
in hardware acceleration for both static and dynamic GNNs, offering scalable, efficient,
and versatile solutions that pave the way for broader adoption of graph-based machine
learning in real-world applications.
Bibliography
[1] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher,
and Tina Eliassi-Rad. Collective Classification in Network Data. AI magazine,
29(3):93–93, 2008.
[2] Petar Veličković, , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio,
and Yoshua Bengio. Graph Attention Networks. In Proceedings of the International
Conference on Learning Representations, 2018.
[3] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz,
and William J Dally. EIE: Efficient Inference Engine on Compressed Deep Neural
Network. In Proceedings of the ACM/IEEE International Symposium on Computer
Architecture, pages 243–254, 2016.
[4] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE
Journal of Solid-State Circuits, 52(1):127–138, 2017.
[6] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle,
Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt
87
88
Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati,
William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu,
Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander
Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve
Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle
Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran
Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie,
Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross,
Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham,
Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian,
Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric
Wilcox, and Doe Hyun Yoon. In-datacenter Performance Analysis of a Tensor
Processing Unit. In Proceedings of the ACM/IEEE International Symposium on
Computer Architecture, pages 1–12, 2017.
[7] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Ranghara-
jan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J
Dally. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Net-
works. In Proceedings of the ACM/IEEE International Symposium on Computer
Architecture, pages 27–40, 2017.
[8] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau,
Joon Kyung Kim, Vikas Chandra, and Hadi Esmaeilzadeh. Bit Fusion: Bit-Level
Dynamically Composable Architecture for Accelerating Deep Neural Network. In
Proceedings of the ACM/IEEE International Symposium on Computer Architecture,
pages 764–775, 2018.
[9] Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James
Laudon, Cliff Young, and David Patterson. A Domain-Specific Supercomputer for
Training Deep Neural Networks. Communications of the ACM, 63(7):67–78, 2020.
[10] Mingyu Yan, Lei Deng, Xing Hu, Ling Liang, Yujing Feng, Xiaochun Ye, Zhimin
Zhang, Dongrui Fan, and Yuan Xie. HyGCN: A GCN Accelerator with Hybrid Archi-
tecture. In Proceedings of the IEEE International Symposium on High Performance
89
Computer Architecture, pages 15–29, 2020.
[11] Tong Geng, Ang Li, Runbin Shi, Chunshu Wu, Tianqi Wang, Yanfei Li, Pouya
Haghi, Antonino Tumeo, Shuai Che, and Steve Reinhardt. AWB-GCN: A Graph
Convolutional Network Accelerator with Runtime Workload Rebalancing. In Pro-
ceedings of the IEEE/ACM International Symposium on Microarchitecture, pages
922–936, 2020.
[12] Jacob R Stevens, Dipankar Das, Sasikanth Avancha, Bharat Kaul, and Anand
Raghunathan. GNNerator: A Hardware/Software Framework for Accelerating
Graph Neural Networks. In Proceedings of the ACM/IEEE Design Automation
Conference, pages 955–960, 2021.
[13] Zhe Zhou, Bizhao Shi, Zhe Zhang, Yijin Guan, Guangyu Sun, and Guojie Luo.
BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight
Matrices. In Proceedings of the ACM/IEEE Design Automation Conference, pages
1009–1014, 2021.
[14] Cen Chen, Kenli Li, Xiaofeng Zou, and Yangfan Li. DyGNN: Algorithm and
Architecture Support of Dynamic Pruning for Graph Neural Networks. In Proceedings
of the ACM/IEEE Design Automation Conference, pages 1201–1206, 2021.
[15] Bingyi Zhang, Rajgopal Kannan, and Viktor Prasanna. BoostGCN: A Framework
for Optimizing GCN Inference on FPGA. In Porceedings of the IEEE International
Symposium on Field-Programmable Custom Computing Machines, pages 29–39,
2021.
[16] Sudipta Mondal, Susmita Dey Manasi, Kishor Kunal, Ramprasath S, and Sachin S
Sapatnekar. GNNIE: GNN Inference Engine with Load-Balancing and Graph-
Specific Caching. In Proceedings of the ACM/IEEE Design Automation Conference,
pages 565–570, 2022.
[17] Sudipta Mondal, Susmita Dey Manasi, Kishor Kunal, S Ramprasath, Ziqing Zeng,
and Sachin S Sapatnekar. A Unified Engine for Accelerating GNN Weighting/Ag-
gregation Operations, with Efficient Load Balancing and Graph-Specific Caching.
90
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
42(12):4844–4857, 2022.
[18] Sudipta Mondal, S Ramprasath, Ziqing Zeng, Kishor Kunal, and Sachin S Sapat-
nekar. A Multicore GNN Training Accelerator. In Proceedings of the IEEE/ACM
International Symposium on Low Power Electronics and Design, pages 1–6, 2023.
[19] Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and
Yufei Ding. GNNAdvisor: An Adaptive and Efficient Runtime System for GNN
Acceleration on GPUs. In Proceedings of the USENIX Symposium on Operating
Systems Design and Implementation, pages 515–531, 2021.
[20] Hanqing Zeng and Viktor Prasanna. GraphACT: Accelerating GCN Training
on CPU-FPGA Heterogeneous Platforms. In Proceedings of the ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pages 255–265, 2020.
[21] Xiaobing Chen, Yuke Wang, Xinfeng Xie, Xing Hu, Abanti Basak, Ling Liang,
Mingyu Yan, Lei Deng, Yufei Ding, Zidong Du, and Yuan Xie. Rubik: A Hierarchical
Architecture for Efficient Graph Neural Network Training. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 41(4):936–949, 2022.
[22] Zhe Zhou, Cong Li, Xuechao Wei, Xiaoyang Wang, and Guangyu Sun. GNNear:
Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory
Processing. In Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques, pages 54–68, 2022.
[24] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral Networks
and Locally Connected Networks on Graphs. In Proceedings of the International
Conference on Learning Representations, 2013.
[25] Thomas N Kipf and Max Welling. Semi-Supervised Classification with Graph
Convolutional Networks. In Proceedings of the International Conference on Learning
Representations, 2017.
91
[26] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive Representation Learning
on Large Graphs. In Advances in Neural Information Processing Systems, pages
1025–1035, 2017.
[27] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are
Graph Neural Networks? In Proceedings of the International Conference on Learning
Representations, 2019.
[28] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure
Leskovec. Hierarchical Graph Representation Learning with Differentiable Pooling.
In Advances in Neural Information Processing Systems, pages 4805–4815, 2018.
[29] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret
Martonosi. Graphicionado: A High-Performance and Energy-Efficient Accelerator
for Graph Analytics. In Proceedings of the IEEE/ACM International Symposium
on Microarchitecture, pages 1–13, 2016.
[30] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. FPGP: Graph Processing
Framework on FPGA A Case Study of Breadth-First Search. In Proceedings of the
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages
105–110, 2016.
[31] Sang-Woo Jun, Andy Wright, Sizhuo Zhang, and Shuotao Xu. GraFBoost: Using
Accelerated Flash Storage for External Graph Analytics. In Proceedings of the
ACM/IEEE International Symposium on Computer Architecture, pages 411–424,
2018.
[32] Nanda K Unnikrishnan, Joe Gould, and Keshab K Parhi. Scv-gnn: Sparse com-
pressed vector-based graph neural network aggregation. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 42(12):4803–4816, 2023.
[33] Adam Auten, Matthew Tomei, and Rakesh Kumar. Hardware Acceleration of
Graph Neural Networks. In Proceedings of the ACM/IEEE Design Automation
Conference, pages 1–6, 2020.
92
[34] Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S Chung, and Greg Stitt. A
High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplica-
tion. In Proceedings of the IEEE International Symposium on Field-Programmable
Custom Computing Machines, pages 36–43, 2014.
[35] Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. MatRap-
tor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product.
In Proceedings of the IEEE/ACM International Symposium on Microarchitecture,
pages 766–780, 2020.
[36] Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi,
and Zhiru Zhang. Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense
Tensor Computations. In Proceedings of the IEEE International Symposium on
High Performance Computer Architecture, pages 689–702, 2020.
[37] David Salomon. Data Compression: The Complete Reference. Springer Science &
Business Media, London, UK, 4th edition, 2007.
[40] Peter Nilsson, Ateeq Ur Rahman Shaik, Rakesh Gangarajaiah, and Erik Hertz.
Hardware Implementation of the Exponential Function using Taylor Series. In
Proceedings of the IEEE Nordic Circuits and Systems Conference, pages 1–4, 2014.
[41] Shengwen Liang, Ying Wang, Cheng Liu, Lei He, LI Huawei, Dawen Xu, and
Xiaowei Li. EnGN: A High-Throughput and Energy-Efficient Accelerator for Large
Graph Neural Networks. IEEE Transactions on Computers, 70(9):1511–1525, 2021.
93
[42] Jasmina Malicevic, Baptiste Lepers, and Willy Zwaenepoel. Everything You Always
Wanted to Know about Multicore Graph Processing but Were Afraid to Ask. In
Proceedings of the USENIX Annual Technical Conference, pages 631–643, 2017.
[43] Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and
Hyesoon Kim. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph
Computing Frameworks. In Proceedings of the IEEE International Symposium on
High Performance Computer Architecture, pages 457–468, 2017.
[44] Jiajun Li, Ahmed Louri, Avinash Karanth, and Razvan Bunescu. GCNAX: A
Flexible and Energy-efficient Accelerator for Graph Convolutional Neural Networks.
In Proceedings of the IEEE International Symposium on High Performance Computer
Architecture, pages 775–788, 2021.
[45] Jiajun Li, Hao Zheng, Ke Wang, and Ahmed Louri. SGCNAX: A Scalable Graph
Convolutional Neural Network Accelerator With Workload Balancing. IEEE Trans-
actions on Parallel and Distributed Systems, 33(11):2834–2845, 2022.
[46] Soroush Ghodrati, Byung Hoon Ahn, Joon Kyung Kim, Sean Kinzer, Brahmen-
dra Reddy Yatham, Navateja Alla, Hardik Sharma, Mohammad Alian, Eiman
Ebrahimi, Nam Sung Kim, Cliff Young, and Hadi Esmaeilzadeh. Planaria: Dynamic
Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks.
In Proceedings of the IEEE/ACM International Symposium on Microarchitecture,
pages 681–697, 2020.
[47] Udit Gupta, Samuel Hsia, Jeff Zhang, Mark Wilkening, Javin Pombra, Hsien-
Hsin Sean Lee, Gu-Yeon Wei, Carole-Jean Wu, and David Brooks. RecPipe:
Co-designing Models and Hardware to Jointly Optimize Recommendation Quality
and Performance. In Proceedings of the IEEE/ACM International Symposium on
Microarchitecture, pages 870–884, 2021.
[48] Kai Zhong, Shulin Zeng, Wentao Hou, Guohao Dai, Zhenhua Zhu, Xuecang Zhang,
Shihai Xiao, Huazhong Yang, and Yu Wang. CoGNN: An Algorithm-Hardware
Co-Design Approach to Accelerate GNN Inference With Minibatch Sampling.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
42(12):4883–4896, 2023.
94
[49] Zeyu Zhu, Fanrong Li, Gang Li, Zejian Liu, Zitao Mo, Qinghao Hu, Xiaoyao Liang,
and Jian Cheng. Mega: A memory-efficient gnn accelerator exploiting degree-aware
mixed-precision quantization. In Proceedings of the IEEE International Symposium
on High Performance Computer Architecture, pages 124–138, 2024.
[50] Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Saman Amarasinghe, and
Matei Zaharia. Making Caches Work for Graph Analytics. In Proceedings of the
IEEE International Conference on Big Data, pages 293–302, 2017.
[51] Junya Arai, Hiroaki Shiokawa, Takeshi Yamamuro, Makoto Onizuka, and Sotetsu
Iwamura. Rabbit Order: Just-in-Time Parallel Reordering for Fast Graph Analy-
sis. In Proceedings of the IEEE International Parallel and Distributed Processing
Symposium, pages 22–31, 2016.
[52] Priyank Faldu, Jeff Diamond, and Boris Grot. Domain-Specialized Cache Manage-
ment for Graph Analytics. In Proceedings of the IEEE International Symposium on
High Performance Computer Architecture, pages 234–248, 2020.
[53] Yoongu Kim, Weikun Yang, and Onur Mutlu. Ramulator: A Fast and Extensible
DRAM Simulator. IEEE Computer Architecture Letters, 15(1):45–49, 2015.
[54] Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal,
Stephen W Keckler, and William J Dally. Fine-Grained DRAM: Energy-Efficient
DRAM for Extreme Bandwidth Systems. In Proceedings of the IEEE/ACM Inter-
national Symposium on Microarchitecture, pages 41–54, 2017.
[55] Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua
Bengio, and Xavier Bresson. Benchmarking Graph Neural Networks. Journal of
Machine Learning Research, 24(43):1–48, 2023.
[56] Xiaowei Zhu, Wentao Han, and Wenguang Chen. GridGraph: Large-Scale Graph
Processing on a Single Machine Using 2-Level Hierarchical Partitioning . In Pro-
ceedings of the USENIX Annual Technical Conference, pages 375–386, 2015.
95
[57] Jason Mohoney, Roger Waleffe, Henry Xu, Theodoros Rekatsinas, and Shivaram
Venkataraman. Marius: Learning Massive Graph Embeddings on a Single Ma-
chine. In Proceedings of the USENIX Symposium on Operating Systems Design and
Implementation, pages 533–549, 2021.
[58] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu,
Michele Catasta, and Jure Leskovec. Open Graph Benchmark: Datasets for Machine
Learning on Graphs. In Advances in Neural Information Processing Systems, pages
22118–22133, 2020.
[59] Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. Improving the
Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. In
Proceedings of Machine Learning and Systems, pages 187–198, 2020.
[60] Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and
Yafei Dai. NeuGraph: Parallel Deep Neural Network Computation on Large Graphs.
In Proceedings of the USENIX Annual Technical Conference, pages 443–458, 2019.
[61] Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. PaGraph: Scaling
GNN Training on Large Graphs via Computation-Aware Caching. In Proceedings
of the ACM Symposium on Cloud Computing, pages 401–415, 2020.
[62] Haoran You, Tong Geng, Yongan Zhang, Ang Li, and Yingyan Lin. GCoD: Graph
Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-
Design. In Proceedings of the IEEE International Symposium on High Performance
Computer Architecture, pages 460–474, 2022.
[63] Zheng Qu, Dimin Niu, Shuangchen Li, Hongzhong Zheng, and Yuan Xie. TT-GNN:
Efficient On-Chip Graph Neural Network Training via Embedding Reformation and
Hardware Optimization. In Proceedings of the IEEE/ACM International Symposium
on Microarchitecture, pages 452–464, 2023.
[64] Gongjian Sun, Mingyu Yan, Duo Wang, Han Li, Wenming Li, Xiaochun Ye, Don-
grui Fan, and Yuan Xie. Multi-Node Acceleration for Large-Scale GCNs. IEEE
Transactions on Computers, 71(12):3140–3152, 2022.
96
[65] Shyam A Tailor, Javier Fernandez-Marques, and Nicholas D Lane. Degree-Quant:
Quantization-Aware Training for Graph Neural Networks. In Proceedings of the
International Conference on Learning Representations, 2021.
[66] George Karypis and Vipin Kumar. A Fast and High Quality Multilevel Scheme
for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, pages
359–392, 1998.
[67] Swapnil Gandhi and Anand Padmanabha Iyer. P 3 : Distributed Deep Graph
Learning at Scale. In Proceedings of the USENIX Symposium on Operating Systems
Design and Implementation, 2021.
[69] Nan Jiang, Daniel U Becker, George Michelogiannakis, James Balfour, Brian Towles,
David E Shaw, John Kim, and William J Dally. A Detailed and Flexible Cycle-
Accurate Network-on-Chip Simulator. In Proceedings of the International Symposium
on Performance Analysis of Systems and Software, pages 86–96, 2013.
[70] Andrew B Kahng, Bill Lin, and Siddhartha Nath. ORION3.0: A Comprehensive
NoC Router Estimation Tool. IEEE Embedded Systems Letters, pages 41–45, 2015.
[71] Minjie Yu Wang. Deep Graph Library: Towards Efficient and Scalable Deep
Learning on Graphs. In Proceedings of the International Conference on Learning
Representations, 2019. https://github.com/dmlc/dgl/.
[72] Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti,
and Michael Bronstein. Temporal Graph Networks for Deep Learning on Dynamic
Graphs. Proceedings of the International Conference on Machine Learning, 2020.
[73] Seyed Mehran Kazemi, Rishab Goel, Kshitij Jain, Ivan Kobyzev, Akshay Sethi,
Peter Forsyth, and Pascal Poupart. Representation Learning for Dynamic Graphs:
A Survey. Journal of Machine Learning Research, 21(70):1–73, 2020.
[74] Chunyang Wang, Desen Sun, and Yuebin Bai. PiPAD: Pipelined and Parallel
Dynamic GNN Training on GPUs. In Proceedings of the ACM SIGPLAN Annual
97
Symposium on Principles and Practice of Parallel Programming, pages 405–418,
2023.
[75] Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. DeepCPU:
Serving RNN-based Deep Learning Models 10x Faster. In Proceedings of the USENIX
Annual Technical Conference, pages 951–965, 2018.
[77] Xinkai Song, Tian Zhi, Zhe Fan, Zhenxing Zhang, Xi Zeng, Wei Li, Xing Hu,
Zidong Du, Qi Guo, and Yunji Chen. Cambricon-G: A Polyvalent Energy-Efficient
Accelerator for Dynamic Gaph Neural Networks. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 41(1):116–128, 2021.
[78] Hanqiu Chen and Cong Hao. DGNN-Booster: A Generic FPGA Accelerator
Framework For Dynamic Graph Neural Network Inference. In Porceedings of
the IEEE International Symposium on Field-Programmable Custom Computing
Machines, pages 195–201, 2023.
[79] Yu Huang, Long Zheng, Pengcheng Yao, Qinggang Wang, Haifeng Liu, Xiaofei
Liao, Hai Jin, and Jingling Xue. ReaDy: A ReRAM-Based Processing-in-Memory
Accelerator for Dynamic Graph Convolutional Networks. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 41(11):3567–3578, 2022.
[80] Yu Huang, Long Zheng, Pengcheng Yao, Qinggang Wang, Xiaofei Liao, Hai Jin, and
Jingling Xue. Accelerating Graph Convolutional Networks Using Crossbar-based
Processing-In-Memory Architectures. In Proceedings of the IEEE International
Symposium on High Performance Computer Architecture, pages 1029–1042, 2022.
[83] Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and
Haifeng Li. T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction.
IEEE Transactions on Intelligent Transportation Systems, 21(9):3848–3858, 2019.
[84] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. Attention
Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting.
In Proceedings of the AAAI Conference on Artificial Intelligence, pages 922–929,
2019.
[85] Hongkuan Zhou, Da Zheng, Israt Nisa, Vasileios Ioannidis, Xiang Song, and George
Karypis. TGL: A General Framework for Temporal GNN Training on Billion-Scale
Graphs. Proceedings of the VLDB Endowment, 15(8):1572–1580, 2022.
[86] Benedek Rozemberczki, Paul Scherer, Yixuan He, George Panagopoulos, Alexander
Riedel, Maria Astefanoaei, Oliver Kiss, Ferenc Beres, Guzman Lopez, Nicolas
Collignon, and Rik Sarkar. PyTorch Geometric Temporal: Spatiotemporal Signal
Processing with Neural Machine Learning Models. In Proceedings of the ACM
International Conference on Information & Knowledge Management, pages 4564–
4573, 2021.