0% found this document useful (0 votes)
29 views111 pages

Mondal Umn 0130E 25561

This thesis presents advancements in hardware acceleration for Graph Neural Networks (GNNs), focusing on efficient inference and training mechanisms. It introduces GNNIE, an inference accelerator achieving significant speedups and energy efficiency improvements, and a multicore accelerator for training large graphs. Additionally, it addresses dynamic graph neural networks, demonstrating substantial performance gains in various applications.

Uploaded by

Thinh Phat Huynh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views111 pages

Mondal Umn 0130E 25561

This thesis presents advancements in hardware acceleration for Graph Neural Networks (GNNs), focusing on efficient inference and training mechanisms. It introduces GNNIE, an inference accelerator achieving significant speedups and energy efficiency improvements, and a multicore accelerator for training large graphs. Additionally, it addresses dynamic graph neural networks, demonstrating substantial performance gains in various applications.

Uploaded by

Thinh Phat Huynh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Scalable and Versatile Hardware Acceleration

of Graph Neural Networks

A THESIS
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY

Sudipta Mondal

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS


FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

Sachin S. Sapatnekar, Advisor

June, 2024
© Sudipta Mondal 2024
ALL RIGHTS RESERVED
Acknowledgements

First, I would like to extend my deepest gratitude to my esteemed advisor, Prof. Sachin
S. Sapatnekar. Under his mentorship, I have not only gained invaluable insights into
the realm of research but also imbibed qualities of perseverance, meticulousness, and
holistic problem-solving. The unwavering support, profound guidance, and scholarly
wisdom of Prof. Sapatnekar have been instrumental in shaping me into the researcher I
am today. His dedication to fostering academic excellence has been truly inspiring, and
I am honored to have had the privilege of working under his supervision.
I would also like to express my sincere appreciation to my thesis committee members,
Prof. Kia Bazargan, Prof. Antonia Zhai, and Prof. Ulya Karpuzcu, whose insightful
feedback and constructive critiques have significantly enriched my research endeavors. I
am deeply thankful to the Semiconductor Research Corporation (SRC) and the ECE
department of UMN for their generous funding, which has provided essential support for
my research throughout my PhD. Special thanks are also due to the dedicated staff of the
ECE department, particularly Chimai Nguyen, Sarah Dohm, and Ann Rausch, for their
unwavering assistance and support. My heartfelt appreciation goes out to my esteemed
project partners, Dr. Susmita Dey Manasi, Dr. Kishore Kunal, Dr. Ramprasath,
Ziqing Zeng, and all other lab mates, including Dr. Vidya A. Chhabria, Dr. Tonmoy
Dhar, Mohammad Shohel, Subhadip Ghosh, Abhimanyu Kumar, Hangyu Zhang, and
Endalkachew Gebru, whose collaborative efforts have contributed immensely to the
success of my research endeavors.
I am also grateful to the vibrant UMN Bangladeshi community, particularly Dr.
Snigdha, Dr. Ibrahim, Nitol, Gourab, Gourango, Neon, and Darpan, for their unwavering
camaraderie and support during various social events, including get-togethers, potlucks,
and recreational activities.

i
To my beloved family, including my mother Promila Biswas and my father Dipak
Ranjan Mondal, I owe a debt of gratitude beyond words. Their unwavering love,
encouragement, and support have been the cornerstone of my journey, guiding me
through every challenge and triumph. And to my better half, best buddy, and also my
esteemed labmate Nibedita Karmokar, I owe a profound debt of gratitude. Throughout
this journey, she has been my unwavering pillar of strength, standing by me through
thick and thin, and cheering me on through every high and low. Her belief in my abilities
never wavered, even in the face of adversity, and her resolute support has been the
anchor that kept me grounded during the most challenging times. I am immensely
grateful for her presence in my life and for the countless ways she has enriched my
journey with her unwavering love, friendship, and support. I am also thankful to my
parents-in-law, Tanusri Chowdhury and Nanda Kumar Karmokar, for their constant
support and blessings.
Finally, I am proud to call myself a graduate of the University of Minnesota, whose
nurturing environment and steadfast support have been integral to my academic and
personal growth.

ii
Dedication

To my parents and dear wife.

iii
Abstract

Graph neural networks (GNN) are vital for analyzing real-world problems (e.g., network
analysis, drug interaction, electronic design automation, e-commerce) that use graph
models. However, efficient GNN acceleration faces with multiple challenges related to
high and variable sparsity of input feature vectors, power-law degree distribution in the
adjacency matrix, and maintaining load-balanced computation with minimal random
memory accesses. This thesis addresses the problems of building fast, energy-efficient
inference and training accelerators for GNNs, addressing both static and dynamic graphs.
For inference, this thesis proposes GNNIE, a versatile GNN inference accelerator
capable of handling a diverse set of GNNs, including graph attention networks (GATs),
graph convolutional networks (GCNs), GraphSAGE, GINConv, and DiffPool. It mitigates
workload imbalance by (i) splitting vertex feature operands into blocks, (ii) reordering
and redistributing computations, (iii) using a novel “flexible MAC” architecture. To
maximize on-chip data reuse and reduce random DRAM fetches, GNNIE adopts a novel
graph-specific, degree-aware caching policy. GNNIE attains substantial speedup over
CPU (7197×), GPU (17.81×), and prior works, e.g., HyGCN (5×), AWB-GCN (1.3×)
over multiple datasets on GCN, GAT, GraphSAGE, and GINConv.
For training GNNs for large graphs, this research develops a GNNIE-based multicore
accelerator. A novel feature vector segmentation approach is proposed to scale on
large graphs using small on-chip buffers. A multicore-specific graph-specific caching
is also implemented to reduce off-chip and on-chip communication and to alleviate
random DRAM accesses. Experiments over multiple large datasets and multiple GNNs
demonstrate an average training speedup and energy efficiency improvement of 17× and
322×, respectively, over DGL on a GPU, and a speedup of 14× with 268× lower energy
over the GPU-based GNNAdvisor approach. Overall, this research tackles scalability
and versatility issues of building GNN accelerators while delivering significant speedup
and energy efficiency.
Finally, this thesis addresses the acceleration of dynamic graph neural networks
(DGNNs), which play a crucial role in applications such as social network analytics
and urban traffic prediction that require inferencing on graph-structured data, where

iv
the connectivity and features of the underlying graph evolve over time. The proposed
platform integrates GNN and Recurrent Neural Network (RNN) components of DGNNs,
providing a unified platform for spatial and temporal information capture, respectively.
The contributions encompass optimized cache reuse strategies, a novel caching policy,
and an efficient pipelining mechanism. Evaluation across multiple graph datasets and
multiple DGNNs demonstrates average energy efficiency gains of 8393×, 183×, and 87×
– 10×, and inference speedups of 1796×, 77×, and 21× – 2.4× , over Intel Xeon Gold
CPU, NVIDIA V100 GPU, and prior state-of-the-art DGNN accelerators, respectively,
are demonstrated across multiple graph datasets and multiple DGNNs.

v
Contents

Acknowledgements i

Dedication iii

Abstract iv

Contents vi

List of Tables ix

List of Figures x

1 Introduction 1
1.1 Hardware Acceleration of GNN Inference . . . . . . . . . . . . . . . . . 2
1.2 Multicore Acceleration of GNN Training on Large Graphs . . . . . . . . 3
1.3 Inference Acceleration of Dynamic Graph Neural Networks . . . . . . . 4
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Fundamentals of GNNs 6
2.1 Machine Learning on Graph-structured Data . . . . . . . . . . . . . . . 6
2.2 Types of GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 GNNIE 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Accelerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Mapping Weighting to Computational Processing Elements . . . . . . . 18

vi
3.3.1 Scheduling Operations in the Computational Processing Elements 18
3.3.2 The Merge Computational Processing Element . . . . . . . . . . 20
3.3.3 Load Balancing for Weighting . . . . . . . . . . . . . . . . . . . . 21
3.4 Aggregation Computations . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Reordering for Linear Computational Complexity . . . . . . . . . 23
3.4.2 Mapping Attention Vector Multiplication . . . . . . . . . . . . . 25
3.4.3 Mapping Edge-based Computations . . . . . . . . . . . . . . . . 26
3.5 Graph-Specific Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.2 Baseline Platform Comparisons . . . . . . . . . . . . . . . . . . . 36
3.7.3 Cross-platform Comparisons . . . . . . . . . . . . . . . . . . . . 38
3.7.4 Throughput and Energy Comparisons . . . . . . . . . . . . . . . 39
3.7.5 DRAM Access Analysis . . . . . . . . . . . . . . . . . . . . . . . 40
3.7.6 Optimization Analysis . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Multicore Training Acceleration 48


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 GNN Training Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Multicore Architecture and Computations . . . . . . . . . . . . . . . . . 51
4.4 Dynamic Cache Replacement Policy . . . . . . . . . . . . . . . . . . . . 53
4.5 Scaling on Large GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Bottlenecks of Scaling on Large GNNs . . . . . . . . . . . . . . . 56
4.5.2 Feature Vector Segmentation . . . . . . . . . . . . . . . . . . . . 57
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Applying Partitioning Methods for a Multicore GNN Inference Engine . 64
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 DGNN Inference Acceleration 68


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

vii
5.3 Proposed DGNN Accelerator . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Accelerating GNN Computations . . . . . . . . . . . . . . . . . . . . . . 75
5.4.1 Overlap Extraction Methodology . . . . . . . . . . . . . . . . . . 75
5.4.2 Proposed Overlap-aware Caching Policy . . . . . . . . . . . . . . 76
5.5 Accelerating RNN Computations . . . . . . . . . . . . . . . . . . . . . . 78
5.5.1 Weight Coalescing for the RNN Kernel . . . . . . . . . . . . . . . 78
5.5.2 Pipelining GNN and RNN Engines . . . . . . . . . . . . . . . . . 79
5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Thesis Conclusion 85

Bibliography 87

viii
List of Tables

2.1 Summary of operations in layer l of various GNNs. . . . . . . . . . . . . 9


3.1 Dataset information for GNNIE [1] . . . . . . . . . . . . . . . . . . . . . 35
3.2 Convolution layer configurations (len[hli ] = length of hli ) . . . . . . . . . 35
3.3 Absolute run time of inference for GNNIE . . . . . . . . . . . . . . . . . 37
3.4 Throughput for various datasets for GNNIE. . . . . . . . . . . . . . . . 40
4.1 Type A datasets (DD: D&D, TW: TWITTER-Partial, YT: Yeast, SW: SW-
620H, OV: OVCAR-8H) for GNN training . . . . . . . . . . . . . . . . . 62
4.2 Type B datasets (SB: soc-BlogCatalog, CA: com-amazon, A-05: ama-
zon0505, A-06: amazon0601, EN: enwiki, A-8M: amazon8M) for GNN
training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 Dataset information for DGNN inference . . . . . . . . . . . . . . . . . . 82

ix
List of Figures

2.1 Graphs vs images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


2.2 2-D convolution versus graph convolution. . . . . . . . . . . . . . . . . . 8
3.1 GNN accuracy comparison (data from [2], PPI dataset). . . . . . . . . 12
3.2 Nonzero histogram for input vertex feature vectors (Cora). . . . . . . . . 12
3.3 Block diagram of GNNIE. . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Weight-stationary linear transformation of vertex features. . . . . . . . . 19
3.5 Mapping Weighting operations to the CPE array. . . . . . . . . . . . . . 20
3.6 Workload reordering in flexible MAC (FM) approach. . . . . . . . . . . 23
3.7 Reordering of GAT computations. . . . . . . . . . . . . . . . . . . . . . 24
3.8 Data flow corresponding to computation of an edge. . . . . . . . . . . . 27
3.9 Example illustrating the subgraph in the input buffer (left) and its evolu-
tion after cache replacement (right). . . . . . . . . . . . . . . . . . . . . 28
3.10 Input buffer replacement policy during Aggregation. . . . . . . . . . . . 30
3.11 Histogram of α through various Rounds (Pubmed). The inset shows a
magnified view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.12 Ablation study on γ: (a) Cora (b) Citeseer (c) Pubmed. . . . . . . . . . 32
3.13 GNNIE performance vs. (a) PyG-CPU (b) PyG-GPU. . . . . . . . . . . 37
3.14 GNNIE performance comparison with HyGCN and AWB-GCN. . . . . . 39
3.15 Energy breakdown for GCN and GAT. . . . . . . . . . . . . . . . . . . . 40
3.16 Energy efficiency: GNNIE vs. HyGCN, AWB-GCN. . . . . . . . . . . . 41
3.17 Comparison of DRAM access of GNNIE with the lower bound on DRAM
access vs vertex partitions in DRAM of 2-D graph partitioning. . . . . . 43
3.18 CPE row workload in Weighting: (a) Cora (b) Citeseer (c) Pubmed. . . 44
3.19 Cost/benefit ratio for adding MACs in Designs B–E. . . . . . . . . . . . 45

x
3.20 Effectiveness of GNNIE’s optimization methods. . . . . . . . . . . . . . 46
4.1 Block diagram of the proposed multicore GNN training accelerator (core
architecture in inset) with 4 cores; our evaluation considers accelerators
with up to 36 cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 (a) Boosting γintra to break intra-cluster stagnation on Core 2. (b) Invok-
ing full random access after most edges are processed on all cores. . . . 55
4.3 Feature vector segmentation. . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Performance analysis of feature vector segmentation: (a) etotal (Average)
vs. Execution Cycles (b) Aggregation cycle comparison. . . . . . . . . . 59
4.5 Speedup and energy efficiency of the proposed multicore GNN training
accelerator vs. DGL+Tesla V100 and GNNAdvisor+Tesla V100: (a), (c):
Type A datasets (b), (d): Type B datasets. . . . . . . . . . . . . . . . . 63
4.6 Inference speedup and energy efficiency the proposed multicore GNN
training accelerator vs. DGL+Tesla V100 and GNNAdvisor+Tesla V100:
(a), (c): Type A datasets (b), (d): Type B datasets. . . . . . . . . . . . 66
5.1 Block diagram of the proposed DGNN accelerator. . . . . . . . . . . . . 74
5.2 Overlap extraction between consecutive groups. . . . . . . . . . . . . . . 77
5.3 Implementation of our weight coalescing scheme. . . . . . . . . . . . . . 79
5.4 Implementation of inter-engine pipelining. . . . . . . . . . . . . . . . . . 80
5.5 Speedup vs. snapshots per group. . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Speedup comparison results for DGNN inference. . . . . . . . . . . . . . 82
5.7 Energy efficiency comparison results for DGNN inference. . . . . . . . . 84

xi
Chapter 1

Introduction

The remarkable success of machine learning and artificial intelligence in the current era can
be attributed to deep learning (DL)-based approaches that have revolutionized the way
we process data and make decisions. This success lies in the ability of these approaches
to extract patterns and insights from vast amounts of information, empowering a wide
range of tasks such as object detection, disease diagnosis, and speech recognition among
numerous others.
One of the first major advancements in deep learning that kicked off the AI revolution
was the deployment of convolutional neural networks (CNNs), which are now particularly
renowned for their proficiency in image recognition and classification tasks. CNNs have
drastically improved the accuracy and efficiency of tasks such as object detection, facial
recognition, and medical imaging, showcasing their indispensable role in modern AI
applications.
However, traditional neural networks such as CNNs and recurrent neural networks
(RNNs) are limited in their ability to process non-Euclidean data structures such
as graphs. This is where graph neural networks (GNNs) come into play, offering a
breakthrough in analyzing and understanding complex relational data for myriads of
real-world problems (e.g., social network analysis, recommendation systems, epidemic
forecasting, and molecular modeling). The significance of GNNs lies in their unique
ability to capture intricate relationships and dependencies for graph structured data,
paving the way for more sophisticated and context-aware machine learning models.
With the scalability challenges posed by the ever-growing size and complexity of data,

1
2
implementations of GNNs on conventional computing platforms, e.g., central processing
units (CPUs), graphics processing units (GPUs), and field programmable gate arrays
(FPGAs) struggle to meet the computational and energy demands of DL workloads.
Hence, there is a pressing need to build specialized application-specific integrated circuit
(ASIC)-based accelerators, with high performance and low power, that enable successful
deployment (i.e., inference) for edge-based applications for inference and for training
where the model parameters of the neural network are optimized.
However, the accelerators proposed in the literature for CNN and RNN applications [3–
9] are not suited to address the computational and energy demands imposed by GNNs.
This thesis endeavors to address the burgeoning demand for ASIC-based accelerators for
GNNs. In this thesis, we propose to address the following three challenges:

• Inference acceleration for GNN for edge-based applications.

• Acceleration of GNN training for large-scale static graphs in a multi-core scenario.

• Inference acceleration of dynamic GNNs where the relationship among objects in


a graph can evolve over time.

1.1 Hardware Acceleration of GNN Inference


GNNs are indispensable for addressing a multitude of real-world issues, spanning from
network analysis and drug discovery to electronic design automation and optimizing
e-commerce platforms. However, the development of an efficient accelerator for GNNs
faces numerous hurdles. These challenges stem from various factors, including the high
and variable sparsity of input feature vectors, which represent the features associated
with each vertex in the graph. Additionally, the adjacency matrix of real-world graphs
exhibits high sparsity and a power-law degree distribution, wherein a few vertices have a
large number of connections while the majority of the vertices have only a few connections.
Furthermore, achieving load-balanced computation, where all processors have similar
workloads assigned to them, while minimizing random memory accesses adds another
layer of complexity to GNN accelerator development.
Prior CNN accelerators [3–9] are not tailored towards graphs structured data and
not suitable to handle the high sparsity (an order of magnitude higher than ones
3
encountered while processing images in CNNs) inherent to real-world graphs. To address
this software- and hardware-based GNN accelerators have been proposed in the literature.
The limitations of software-based frameworks for GNNs arise due to inefficiencies of
general-purpose processors (CPUs/GPUs) to address the computational challenges
of GNNs. Several notable hardware-based GNN accelerators have been proposed in
the literature, e.g., HyGCN [10], AWB-GCN [11], GNNerator [12], BlockGNN [13],
DyGNN [14], BoostGCN [15]. However, they do not handle/have limited ability of
handling all the key challenges of energy-efficient GNN acceleration.
We first propose GNNIE [16, 17], a GNN inference accelerator that offers several
advantages over prior accelerators [10, 11, 13, 14]. GNNIE is designed to be a versatile
platform, i.e., it can handle a diverse set of GNNs. It efficiently accesses data from
DRAM and effectively manages the overheads associated with random access patterns.
GNNIE also addresses the challenges of load-balancing that stem from the high sparsity
in input node feature vectors, adjacency matrices, and power-law degree distributions.

1.2 Multicore Acceleration of GNN Training on Large


Graphs
Before GNNs can be effectively deployed for real-world tasks, they must undergo a rigor-
ous training phase to learn model parameters essential for their subsequent deployment.
Training GNNs on large-scale graphs presents unique challenges that must be addressed
to ensure efficient and scalable acceleration.
One major challenge in GNN training is the high computation and communication
costs associated with backpropagation, especially on large graphs. Unlike inference tasks,
which are relatively less compute-intensive, training tasks involve iterative processes
of updating model parameters based on computed gradients. This process incurs high
access time and energy costs for communication between memory and on-chip buffers,
significantly impacting training acceleration efficiency. Furthermore, scalability becomes
a concern as graph sizes in real-world datasets continue to grow exponentially. Existing
single-core solutions struggle to handle the computational demands of training on large
graphs.
In response to these challenges, we propose a multicore GNN training accelerator [18]
4
designed specifically for training on large-scale graphs. Our proposed platform offers
significant improvements over prior works [19–22], addressing the limitations of single-
core solutions by leveraging an array of GNNIE cores for training. This approach leads
to substantial speedup and energy-efficiency improvements, surpassing the capabilities
of existing multicore inference accelerators. Additionally, it demonstrates scalability on
datasets with up to 8.6M vertices, surpassing the capabilities of previous ASIC/FPGA-
based training accelerators.

1.3 Inference Acceleration of Dynamic Graph Neural Net-


works
In many real-world applications, the relationships among the objects evolve over time.
For instance, friendship relations on social media, user ratings in a recommendation
system, and streaming transactions in computational finance. In these scenarios, a static
graph is not sufficient to capture the underlying relationship among objects, hence the
need for dynamic graphs. Dynamic graph neural networks (DGNNs) are a special kind
of neural network which operate on dynamic graphs and involve two major computation
kernels, i.e., GNN and RNN kernels. The GNN kernel is used to capture the structural
information whereas the RNN kernel is used to capture the temporal information.
Motivated by the increasing importance of edge computing and the necessity for efficient
handling of dynamic graphs, particularly in applications such as social network analysis
and urban traffic prediction, there is a crucial demand for hardware acceleration of
DGNNs. These networks, which combine GNN and RNN components, face challenges
such as irregularity and locality in dynamic graphs, requiring novel approaches for
effective performance enhancement.
We propose a platform that addresses these challenges, offering a holistic solution to
seamlessly handle both GNN and RNN computations. Unlike existing accelerators that
treat these components separately, our approach aims to capture spatial and temporal
information simultaneously.
5
1.4 Thesis Organization
The thesis is organized as follows:

• Chapter 2 provides a fundamental background of GNNs and the computations


involved in GNN inference and training. This chapter also discusses several
noteworthy DGNN architectures. The material covered in this chapter forms the
foundational knowledge necessary for comprehending the work of the thesis.

• Chapter 3 presents GNNIE, our proposed energy-efficient and versatile GNN


inference accelerator. This chapter discusses the details of the GNNIE architecture
and the proposed mechanisms to address the computational bottlenecks imposed
by GNN inference. Our evaluation demonstrates that GNNIE achieves significant
performance improvements, 7197× speedup over a CPU, 17.81× speedup over a
GPU, and outperforming previous GNN accelerators such as HyGCN by 5× and
AWB-GCN by 1.3×.

• Chapter 4 describes our proposed multicore GNN training accelerator for large
graphs. This chapter discusses the novel feature vector segmentation and dynamic
caching scheme that enable our platform to achieve GPU-like scalability and
accelerator-like efficiency for large graphs. Experiments conducted on various
datasets and different GNNs show that our approach achieves an average training
speedup of 17× and an energy efficiency improvement of 322× compared to a
GPU-based baseline.

• Chapter 5 outlines the proposed hardware accelerator platform for performing


inference on DGNNs. This chapter describes the details of the novel overlap-
aware caching and the pipelining of GNN and RNN engines for dynamic graphs.
Evaluation across various graph datasets and DGNNs shows average inference
speedups of 1603×, 64×, and 3× over a CPU, a GPU, and a previous state-of-the-
art DGNN accelerator, respectively.

• Chapter 6 concludes the thesis.


Chapter 2

Fundamentals of GNNs

The increasing complexity and scale of data in various fields necessitate advanced methods
for processing and analyzing graph-structured information. GNNs have emerged as
a key technology to address this need, providing sophisticated tools for handling the
non-Euclidean nature of graphs. Unlike traditional neural networks like CNNs and RNNs,
which are optimized for Euclidean data such as images and sequences, GNNs excel at
capturing intricate relationships in graph data. This capability is crucial for applications
in social network analysis, molecular chemistry, and recommendation systems, where
the relationships between data points are inherently irregular and interconnected. The
following two subsections of this chapter will delve into the fundamental principles of
GNNs, tracing their evolution and categorizing the main types of GNN architectures.
These discussions provide a foundation for understanding the majority of the work in
this thesis.

2.1 Machine Learning on Graph-structured Data


As a computational structure designed to extend deep learning techniques to the non-
Euclidean domain of graphs (Chapter 1), GNNs deal with complex data structures with
irregular connectivity, such as social networks, molecular graphs, and recommendation
systems. Understanding the significance of GNNs requires an exploration of their
capabilities, challenges, and underlying architectures.
One of the significant advantages of GNNs lies in their ability to effectively process

6
7

Figure 2.1: Graphs vs images.

graph-structured data, which CNNs and RNNs struggle to handle. We can think of an
image as essentially an array of pixels with a highly structured connectivity pattern
(Fig. 2.1). Each pixel in an image has neighboring pixels to its east, west, north, and
south, forming a structured graph-like arrangement. In general graphs do not have this
inherent structured nature, and each node can have an arbitrary number of neighboring
nodes, with no specific order or regularity. CNNs and RNNs typically rely on ordered
feature stacking, which becomes redundant and inefficient when applied to graph inputs.
In contrast, GNNs propagate information on each node individually, disregarding the
input order and producing invariant outputs, thus overcoming the inherent limitations
of ordered processing.
Despite their potential, GNNs also pose unique challenges that researchers must
address. The irregular structures of graphs, coupled with the large scale of real-world
graph data, present significant computational and scalability challenges. Additionally,
the wide variety of graphs encountered in different domains necessitates the development
of flexible and adaptable GNN architectures capable of accommodating diverse data
types and structures.
In addressing these challenges, researchers have classified GNNs into two main
categories: spectral-based and spatial-based approaches. Early approaches [23, 24] have
used spectral-based methods to define graph convolutions using filters inspired by graph
signal processing, interpreting the convolutional operation as noise removal from graph
8

Figure 2.2: 2-D convolution versus graph convolution.

signals. On the other hand, spatial-based approaches [2, 25, 26] focus on aggregating
information from neighboring nodes to define graph convolutions. Just as a filter is
applied to a set of pixels in an image, graph convolution involves aggregating information
from neighboring vertices within the vicinity of a node. This weighted averaging process
enables GNNs to effectively process graph-structured data.
In Fig. 2.2, a comparison is drawn between performing 2D convolution on image data,
which can be seen as a specialized form of a graph, and conducting convolution on a
general graph. In the image representation, each pixel is connected to adjacent pixels in
the east, west, north, and south directions, forming a structured graph-like arrangement.
The highlighted area in the figure represents a filter applied to the set of vertices (or
pixels in the case of an image). However, due to irregular distribution and ordering of
neighbors of the vertices, the graph may not be easily embeddable into a 2D planar
format. As a result, performing convolution and pooling operations on such graphs is
not as straightforward as in CNNs. Thus in contrast, in graph convolution, information
from neighboring vertices within the neighborhood of a node is aggregated through a
weighted average, mirroring the process of 2D convolution on images. The advent of
GNN models such as graph convolutional networks (GCNs) [25] has bridged the gap
between spectral and spatial approaches, leading to rapid advancements in spatial-based
methods.

2.2 Types of GNNs


In layer l of a GNN, each vertex i in the graph is represented by an F l -dimensional row
vector, hli , called the vertex feature vector; h0i is the input vertex feature vector. For each
9
Table 2.1: Summary of operations in layer l of various GNNs.
 
hli √1 hl−1 W l
P
GCN =σ j∈{i}∪N (i) di dj j
  
GraphSAGE hli = σ ak hl−1 W l ∀ j ∈ {i} ∪ SN (i)
P j 
l j∈{i}∪N (i) exp(eij )hl−1
j Wl
GAT hi = σ P
exp(eij )
j∈{i}∪N (i)

eij = LeakyReLU(aT · [hl−1 l l−1


i W ||hj W ])
l
 
hli = MLPl (1 + ϵl )hl−1 l−1 l , bl
P
GINConv i + h
j∈N (i) j , W

vertex i in a layer, over a set of neighboring vertices j, the GNN aggregates information
from vectors hl−1
j of the previous layer, and processes it to create the output feature
vector, hli .
Regardless of the type, GNNs have two major computational steps in common:
(i) Weighting multiplies the vertex feature vector, hl−1
i of each vertex by a weight
matrix, W l , of dimension F l−1 × F l . (ii) Aggregation combines the weighted vertex
feature vectors neighboring vertex i. Table 2.1 shows the Weighting and Aggregation
operations for GCNs [25], GraphSAGE [26], graph attention networks (GATs) [2], and
GINConv [27]. For Aggregation, if N (i) is the immediate one-hop neighborhood of
vertex i, then for GCNs, GATs, and GINConv, N (i) = {i} ∪ N (i). For GraphSAGE,
N (i) = {i} ∪ SN (i) , where SN (i) is a random sample of N (i). At vertex i, the aggregation
operation performed by various GNNs is as summarized below:
GCNs: Each product hl−1 l
p
j W , j ∈ N (i), is multiplied by 1/ di dj (d∗ is the vertex
degree). The result is summed.
GraphSAGE: The products hl−1 l
j W are combined over j ∈ N (i) using aggregator ak
(typically, mean or pooling).
GATs: For each edge (i, j), an inner product with a learned attention vector al finds
the normalized attention coefficient

T
αij = softmax(LeakyReLU(al · [hl−1 l l−1 l
i W ]||hj W ]))

P l−1 l
followed by j∈{i}∪N (i) eij hj W , a weighted aggregation.
GINConv: The vertex feature vectors of all neighbors of a vertex i are summed and
added to ϵl times the vertex feature vector of i, where ϵl is a learned parameter, using a
10
multilayer perceptron (MLP) with weights W l and bl :
 
hli = MLPl (1 + ϵl )hl−1 l−1 l , bl
P
i + h
j∈N (i) j , W (2.1)

The activation operator σ (softmax or ReLU), is applied to the aggregated weighted


vertex feature vector, yielding the updated hli . For GINConv, activation is built into the
MLP.
GINConv concatenates the sum of all vertex feature vectors across all layers to obtain
a representation for the graph as

L
l
P 
hG = i∈G hi (2.2)
l=1

DiffPool [28] can be combined with any of these GNNs to reduce the volume of data.
It uses two GNNs, one to extract vertex embeddings for graph classification, and one
to extract embeddings for hierarchical pooling. The embedding GNN at layer l is a
standard GNN with Weighting and Aggregation,

Z l−1 = GNNembed (Al−1 , X l−1 ); (2.3)

where Al−1 is the adjacency matrix of the coarsened graph at level (l − 1), and X l−1 is
the matrix of input cluster features. The pooling GNN generates the assignment matrix:
 
S l−1 = softmax GNNpool (Al−1 , X l−1 ) (2.4)

The number of clusters in layer l is fixed during inference. The coarsened adjacency
matrix Al = S (l−1)T Al−1 S l−1 , and the new embedding matrix X l = S (l−1)T Z l−1 .
Chapter 3

GNNIE

3.1 Introduction
Deep learning accelerators have largely focused on data with Euclidean embeddings,
e.g., audio/video/images/speech. Many real-world problems (e.g., network analysis,
embedded sensing, e-commerce, drug interactions) use graphs to model relationships.
Inferencing on large, unstructured, and sparse graphs with non-Euclidean embeddings
requires specialized GNNs. Today’s GNNs [2, 25–27] are based on nearest-neighbor
operations, with improved efficiency over early methods [23–25].
Multilayer GNN inference engines perform two computation steps per layer, as
outlined in Section 2.2:
(a) Weighting performs a linear transform of vertex feature vectors through multipli-
cation by a weight matrix.
(b) Aggregation consolidates information from the neighbors of a vertex to compute
the feature vectors for the next layer.
The challenges of building efficient GNN accelerators for inference tasks are as follows:
(1) Versatility An accelerator should be able to handle a diverse set of GNNs to cover a
wide range of GNN architectures to provide appropriate computation/accuracy tradeoff
points for various applications. The achievable accuracy depends on the GNN: GATs
achieve higher accuracy than other GNNs, but with more computation (Fig. 3.1).
(2) Adjacency matrix sparsity The graph adjacency matrix encodes vertex neighborhood
information required for Aggregation. The adjacency matrix is highly sparse (> 99.8%

11
12

Figure 3.1: GNN accuracy comparison (data from [2], PPI dataset).

Figure 3.2: Nonzero histogram for input vertex feature vectors (Cora).

for all datasets in this paper; in contrast, DNN data shows 10%–50% sparsity). Unlike
image/video data, adjacency matrix sparsity patterns typically exhibit power-law behavior,
with vertex degrees ranging from very low (for most vertices) to extremely high (for very
few vertices): in the Reddit dataset, 11% of the vertices cover 88% of all edges.
(3) Input feature vector sparsity The vertex input feature vectors are highly sparse, e.g.,
the 2708 input vertex feature vectors of the Cora dataset have 98.73% average sparsity.
In Fig. 3.2, Region A is sparser than B and requires less computation, leading to load
balancing issues during Weighting.
(4) Memory footprint and random-access patterns Real-world graphs have a large number
of vertices and a massive memory footprint (Reddit: 2.3Gb in sparse format). High
sparsity and power-law distributions can lead to random memory access patterns and
poor data access locality in Aggregation.
13
Therefore, GNN-specific accelerators must address:
(a) load balancing during Weighting, due to the sparsity variations in Fig. 3.2, and during
Aggregation, due to the imbalance of computations for high- and low-degree vertices.
(b) lightweight graph-specific caching of the adjacency matrix for high data access locality
and maximal reuse of cached data.
In this chapter we present GNNIE (pronounced “genie”), a versatile and high-
performance accelerator designed to handle inference tasks on diverse GNN architectures.
This chapter highlights significant advancements of GNNIE in inference acceleration for
static graphs (where the graph topology and vertex features do not change over time),
addressing the critical challenges and outperforming existing solutions.
Relation to other acceleration engines: The Weighting step performs matrix-
vector multiplication which resembles CNN computations, but CNN accelerators [3–9]
are inefficient at handling graph data. Aggregation operates on graph neighborhoods
and resembles graph analytics, but graph processing accelerators [29–31] are designed
to perform lightweight computations, significantly lower than the needs of a GNN.
Extensions of CNN/graph processing engines are inadequate.
An early GNN accelerator, HyGCN [10], bridges the divide by using two pipelined
engines: an Aggregation engine that operates on graph data and consolidates vertex
feature vectors from the neighborhood of each vertex, followed by a Combination engine,
which uses a multilevel perceptron to weight the aggregated features with the weight
matrix. The disparity between engines raises challenges in providing a steady stream of
data to keep the Aggregation/Combination engine pipeline busy. The Aggregation engine
does not account for power-law behavior while caching partial results, and high-degree
vertices may create stalls due to the limited size of on-chip buffers. In the Combination
engine, the aggregated feature vectors are both sparse and show high sparsity variations
(Fig. 3.2). Consequently, stalls are required, leading to inefficiency.
AWB-GCN [11] views the GNN computation as two consecutive sparse-dense matrix
multiplications (SpMMs). During Weighting, the method is targeted to moderate
sparsity of 75% – but input layer vertex feature vectors are ultra-sparse (Fig. 3.2).
During Aggregation, the graph-agnostic SpMM view necessitates numerous expensive off-
chip accesses to the adjacency matrix. AWB-GCN addresses workload imbalances issues
through multiple rounds of runtime load-rebalancing, but this leads to high inter-PE
14
communication. Finally, SpMM-based approaches face more severe load imbalances for
implementing GNNs that involve additional complex computations before Aggregation
(e.g., softmax in GATs and DiffPool). In fact, AWB-GCN targets only GCNs and not
general GNNs. SCV-GNN [32], a matrix multiplication-based approach, proposes a
sparse compressed vector format accompanied by a processing order that aids parallelism
and reduces workload imbalance.
Novelty of this work: GNNIE uses a single engine that efficiently performs both
Weighting and Aggregation. The GNNIE framework handles high levels of sparsity in
the input vertex feature vectors and the adjacency matrix, with novel approaches for
load balancing and graph-specific caching. It covers a number of GNN topologies, from
lower accuracy/lower computation (e.g., GCN, GraphSAGE) higher accuracy/higher
computation (e.g., GATs), as motivated in Fig. 3.1, and is more versatile than previous
methods in handling functions such as softmax over a neighborhood (e.g., as used for
attention normalization in GATs; prior work [33] on GATs, skips this crucial step).
Novel methods to mitigate sparsity effects, and overcome load imbalances and
compute bottlenecks, include:

• Load balancing during Weighting based on splitting vertex features into blocks
(Section 3.3.1). Together with load balancing (Section 3.3.3), this enhances throughput
during Weighting by ensuring high PE utilization and skipping unnecessary com-
putations, by (a) Reordering computations on a flexible MAC (FM) architecture to
address imbalances due to input feature vector sparsity variations. Computations
are dynamically mapped to heterogeneous PEs, each with different numbers of MAC
units. (b) Static load redistribution to nearby PEs, offloading computations from
heavily-loaded to lightly-loaded rows, minimizing inter-PE communication.

• Load-balanced edge Aggregation (Section 3.4) through a mapping scheme that fully
utilizes the PEs. For GATs, we further propose a novel linear-complexity computation
that implements compute-bound attention vector multiplication similarly as Weighting,
and memory-bound attention coefficient computation to maximize reuse of cached
data.

• Lightweight graph-specific dynamic caching (Section 3.5), fetching vertices in


unprocessed degree order; aggregation operates on dynamic subgraphs formed by
15
cached vertices. This new lightweight scheme is effective in avoiding the random
DRAM accesses that plague graph computation.

Speedups: On five GNN datasets, results based on an RTL implementation and


simulation show that even including all of its preprocessing overheads, GNNIE delivers
average speedups of 7197× over CPUs (Intel Xeon Gold 6132 + PyTorch Geometric),
17.81× over GPUs (V100 Tesla V100S-PCI + PyTorch Geometric) and 5× over prior
work.

3.2 Accelerator Architecture


The block diagram of the proposed accelerator is illustrated in Fig. 3.3, and it consists
of the following key components:
(1) HBM DRAM: The high-bandwidth memory (HBM) DRAM stores information
about the graph. The adjacency matrix of the graph represents its connectivity infor-
mation and is stored in sparse compressed sparse row (CSR) format. Other formats
(CISR [34], C2 SR [35], CISS [36]) are not viable candidates as they ignore the underlying
graph structure: GNNIE uses adjacency matrix connectivity information to schedule
computations and is not a matrix multiplication method.
The sparse input vertex feature vectors are encoded using run-length compression
(RLC) [37]. We choose RLC because it is lossless and the decoder has low power/area
overhead: this is important because it is only used for the input layer and not thereafter.
Alternatives such as CISS have much higher implementation overhead and have been
targeted to lock-step systolic arrays, which are unsuitable for Weighting due to the
insertion of stalls to handle feature vector sparsity variations.
The DRAM is also used to store intermediate results that do not fit in on-chip memory.
High bandwidth options such as HBM or GDDR6 are viable for edge AI [38, 39].
(2) Memory interface: The input buffer stores vertex features for one pass of the current
layer l, i.e., hl−1
i for vertices i being processed, and the edge connectivity information
of the subgraph. Double-buffering is used to reduce DRAM latency: off-chip data is
fetched while the PE array computes.
Sparse data is transmitted from off-chip DRAM to the input buffer using RLC
encoding. The input buffer keeps this data in RLC format until it is ready for use, when
16

Figure 3.3: Block diagram of GNNIE.

the data is sent through the RLC decoder to the PE array. The RLC decoder is activated
for sparse input layer vertex feature vectors, and bypassed for denser feature vectors in
later layers.
The output buffer caches intermediate results for vertex feature vectors, including
the result of multiplication by W l after Weighting, and the result after Aggregation.
The end result is written to off-chip memory. The weight buffer holds the values of the
weight matrix W l during Weighting, and, for GAT computations, the attention vector
during Aggregation.
The memory access scheduler coordinates off-chip memory requests from the in-
put/output/weight buffers.
(3) An array of processing elements (PEs): The array consists of an M × N array
of (CPEs). Each CPE has two scratch pads (spads) and MACs.
Within the array of CPEs, we merge multiple columns of Special Function Units
(SFUs) (e.g., exp, LeakyReLU, division) [grey blocks], and a row of merge PEs (MPEs)
[red blocks]. Interleaved placement allows low latency and communication overhead
with CPEs. For exponentiation, we use an accurate, low-area lookup-table-based
17
implementation [40].
Merge PEs (MPEs) are used to aggregate partial results of vertex features sent from
the CPE array during Weighting and Aggregation. One MPE is dedicated for each CPE
column in the array (Fig. 3.3), for merging the partial results of vertices. Since these
partial results may belong to different vertices we use 16 wires, i.e., one for each CPE,
while sending the partial results to the MPEs. A tag is sent along with each partial
result to indicate the vertex that the partial result is associated with.
The partial results, along with the tags, are received from the CPEs and stored in
the update spad of the MPE. The updated spad can hold 16 such partial results and
corresponding tags received from the 16 CPEs of the column. If the tags of partial
results match (i.e., if they belong to the same vertex), they are sent to one of the 16
accumulators in the accumulator bank of the MPE to be merged. The result is stored in
one of the 16 psum spads with the corresponding tag. The intermediate result stored in
the psum spad may be brought into the accumulator again if a partial result with the
same tag is found in the updates spad. Following the same procedure, these values are
summed and the result is stored in the psum spad. After merging of the partial results
in the update spad is completed, the psum spads send the results and tag to the output
buffer.
(4) The Activation unit performs an activation operation on the vertex features at the
final activation stage of computation.
(5) The controller coordinates operations, including assigning vertex features to the
CPE, workload reordering among the CPEs, sending CPE results to the MPEs, sending
MPE data to the output buffer, and writing concatenated MPE data.
For a GCN, the layer-wise computation can be written as:

e l−1 W l )
hli = σ(Ah (3.1)
i

e = D−1/2 (A+I)D−1/2 is the normalized adjacency matrix, I is the identity matrix,


Here, A
P
and Dii = Aij . This can be computed either as (A e × hl−1 ) × W l or A
e × (hl−1 × W l ).
i i
The latter requires an order of magnitude fewer computations than the former [11, 41],
and we use this approach. Moreover, as A
e is highly sparse and shows power-law behavior,
we will perform edge-based Aggregation with optimized graph-specific cache replacement
18
policies to limit off-chip accesses.

3.3 Mapping Weighting to Computational Processing Ele-


ments

3.3.1 Scheduling Operations in the Computational Processing Elements

We now map the Weighting step, which multiplies the sparse feature row vector hl−1
i
with the dense weight matrix W l , to the architecture. The feature vectors are fetched
from DRAM, core computations are performed in the CPEs, and the results from the
CPEs are assimilated in the MPEs before being written back to DRAM. Our novel
scheduling methodology keeps the CPEs busy during the computation, so that Weighting
is not memory-bounded. We partition data in two ways (Fig. 3.5):
(1) Across the vertex feature vector: We process a block of k elements of hl−1
i
at a time, and multiplying it by the corresponding k rows of W l . This is mapped to a
row of the CPE array. With a block size of k = F l−1 /M , the entire feature vector is
 

processed in the CPE array.


(2) Across vertices: We process feature vectors for a set of s vertices at a time in the
PE array, as shown in Fig. 3.5, where s is constrained by the size of the input buffer. To
process all vertices in the graph G(V, E), we process ⌈|V |/s⌉ sets as:
"N −1 N −1
X X
hl−1
i W
l
= l−1
h(0:k−1) l
W(0:k−1,i) , hl−1 l
(k:2k−1) W(k:2k−1,i) ,
i=0 i=0
N −1
#
X
··· , hl−1 Wl
((M −1)k:F l−1 ) ((M −1)k:F l−1 ),i
(3.2)
i=0

where the term in each sum is processed in a separate CPE.


We use a weight-stationary scheme (Fig. 3.4). Each vertex goes through Weighting
set by set, placing k-element blocks of the vertex feature vectors for each set into the
input buffer.
We fetch N columns of the weight matrix, W l , from the DRAM to the weight buffer.
A pass processes all vertex feature vectors (i.e., processing all vertices in all sets). As
shown in Fig. 3.5, we multiply the vertex feature vectors in all sets with N columns of
19

Figure 3.4: Weight-stationary linear transformation of vertex features.

W l in the pass. At the end of a pass, the next set of N columns of W l is loaded. After
all passes are completed, the current set of weights is replaced by a new set, and the
process continues under the weight-stationary scheme. Within each pass, the CPEs are
loaded as follows:

• Each column of W l is loaded to a CPE column in chunks of k rows, i.e., W(ik:(i+1)k−1,j)


is loaded into CPE (i, j).

• For a given set of s vertices, the ith subvectors, of size k, of all s vertex feature vectors
are broadcast to the entire CPE row i using a bus. This is indicated by h in Fig. 3.4.
Since the CPEs in a row work independently of each other and CPE rows do not talk
to each other during the Weighting phase, we do not require a complex interconnection
scheme: since all CPEs in a row are assigned the same feature vector blocks of length
k, we use a bus-based interconnection to broadcast this data to a CPE row.

To leverage input data sparsity, a zero detection buffer is used to detect whether a
k-element block that is to be broadcast contains zeros only, so that these computations
can be skipped. In case such a block is detected we refrain from broadcasting it to the
CPE row. We place zero detection circuitry at the output of the RLC decoder (Fig. 3.3),
20

Figure 3.5: Mapping Weighting operations to the CPE array.

at a stage after the k-element blocks are created. The zero-detection function uses a set
of OR gates and has minimal hardware overhead.
Benefit of using vertex feature subvector blocks: Our use of k-element blocks
instead of the entire vector allows a CPE to skip zero subvectors during pipelined
execution and immediately move on to a block from the next available subvector. The
next block will be fetched from the input buffer, and under the weight-stationary scheme,
it can start computation with the already-loaded weights in the CPE.
The proposed weight-stationary dataflow maximizes the reuse of the weights cached
in the weight buffer, which in turn reduces the size requirement of the on-chip weight
buffer. Though the feature vectors fetched in the input buffer are get reused, for all
datasets evaluated, the computation time for vertices cached in the input buffer is seen
to be larger than the memory fetch time under the HBM 2.0 off-chip bandwidth.

3.3.2 The Merge Computational Processing Element

The MAC operation within each CPE generates a partial result for an element of the
transformed vertex features. This is sent to the MPE in its column for accumulation
over the vertex feature subvectors, along with a tag that denotes its vertex. Due to
the irregular completion times for the CPEs, the MPE may accumulate partial sums
for several vertices at a time. A bank of psum buffers holds the partially accumulated
results: when all partial sums are accumulated for a vertex feature vector, the MPE
21
sends the result to the output buffer, along with the vertex ID i: this is one element of
the result of multiplying the feature vector of vertex i and W l . When all F l elements
are computed, the result is written back to DRAM.
After a CPE column processes all feature blocks for all vertices, the next pass begins.
The weights in that column are replaced with the next column of weights from W l .
To overlap computations and keep the CPEs busy, we use double-buffering to fetch
the next block of weights from the DRAM to the chip while the CPEs perform their
computations.

3.3.3 Load Balancing for Weighting

The Weighting computation skips zeros in the vertex feature vector. Vertex feature
vectors in the input layer have different sparsity levels (e.g., in Regions A and B of
Fig. 3.2), and this is also true of the k-subvectors. Hence, some k-subvectors are processed
rapidly (“rabbits”) while others take longer (“turtles”). This causes workload imbalance
in the CPE array.
The MPEs that accumulate the results of the CPEs must keep track of psums from
a large number of vertices, but they have only limited psum slots for accumulating
information. The rabbit/turtle disparity implies that stalls may have to be introduced to
stay within the limits of available psum memory in the MPE. As results are accumulated
in the output buffer, a larger number of vertex feature vectors must be stored within the
buffer, waiting to be completed and written to the DRAM, to account for the disparity
between rabbits and turtles.
Flexible MAC (FM) Architecture: We can avoid stalls and speed up computa-
tion with more MACs per CPE. Increasing the number of MACs per CPE uniformly
throughout the array overcomes the bottleneck of “turtles,” but is overkill for “rabbits.”
Our flexible MAC architecture uses a heterogeneous number of MAC units per CPE
in different rows of the array. The CPE array is divided into g row groups, each with
an equal number of rows; the number of MACs per CPE, |M AC|i , is monotonically
nondecreasing from the first row to the last, i.e., |M AC|1 ≤ |M AC|2 ≤ · · · ≤ |M AC|g .
The input buffer has a scheduler that assigns vertex feature vectors to CPE rows. The
scheduler uses information about the total nonzero workload for each k-element block of
the vertex feature vector to assign the workload to CPE rows. The workloads for the
22
k-element blocks are first binned based on the number of nonzeros, where the number of
bins equals the number of CPE groups. Workload binning is carried out as a preprocess-
ing step in linear time on a CPU. The bin with fewest nonzeros is sent to the first CPE
group with fewest MACs, and so on; the bin with the most nonzeros is sent to the last
CPE row group with the most MACs. After workload binning, each block in a bin is
assigned an ID that denotes the CPE row to which it should be broadcast. The scheduler
receives these block IDs for the k-element blocks of each feature vector from the host
CPU. The input buffer is connected to the embedded scheduler through one port which
fetches the block ID information for each feature vector as they are sent over to RLC
decoder and eventually the k-element feature vector blocks are broadcast to a CPE row
according to their IDs. Since the assignment of IDs to k-element blocks is computed as
a preprocessing step, the scheduler does not require any runtime information. The total
preprocessing overheads (which include the preprocessing overheads for the linear time
binning of k-element blocks) for the four datasets used in our experiment are shown in
Table 3.3. For the Cora, Citeseer, Pubmed, and Reddit datasets, the preprocessing times
required for the binning of k-element blocks are, respectively, 5.5%, 4.8%, 3.4%, and
0.7% of the total inference time. It should also be noted that this percentage overhead
is lower for the larger datasets (Reddit (233K vertices) has a lower percentage overhead
than Cora (2.7K vertices)), indicating the scalability of the solution.
An example of workload reordering among CPE rows is shown in Fig. 3.6. The CPE
array is divided into three groups, Group 1, 2, and 3, where Group i is equipped with
|M AC|i MACs per CPE, where |M AC|1 < |M AC|2 < |M AC|3 . The vertex feature
blocks are binned into three bins that will be assigned to each group. Each bin has
several vertex feature blocks: the vertex feature blocks in the left-most bin have the
most nonzeros (six), and those of the right-most bin have the fewest of nonzeros (four).
We see that the least populated bin is assigned to the group with the fewest MACs, the
next to the group with the next number of MACs, and so on.
Load Redistribution (LR): The FM approach does not completely balance the
workload. For greater uniformity, we redistribute loads among nearby CPEs. Based on
workload distribution in CPE rows, the controller selects pairs of CPE rows to perform
workload redistribution, offloading a portion of workload from heavily loaded to lightly
loaded CPE rows.
23

Figure 3.6: Workload reordering in flexible MAC (FM) approach.

To perform computation on the offloaded workloads, the weights must be transferred


with the data. To minimize communication overhead, we first finish the computation
in FM, to the point where the current weights are no longer needed, before applying
LR. The spads for weights in these CPE rows are loaded with weights for the offloaded
workloads.

3.4 Aggregation Computations


For most GNNs in Section 2.2, Aggregation is a simple summation over the neighbors of
the vertex, but GATs require significantly more computation in determining attention
coefficients, which are used for weighted aggregation. The first two subsections focus
on GAT-specific computations. We then consider Aggregation operations that affect all
GNNs.

3.4.1 Reordering for Linear Computational Complexity

We present a new method for reordering GAT computations for efficient hardware
implementation (Fig. 3.7). We define the weighted vertex attention vector for vertex p
as ηw lp = hl−1 l
p W . The first step in finding the attention coefficient αij for neighboring
24

Figure 3.7: Reordering of GAT computations.

vertices i and j, is to multiply the 2F l -dimensional attention vector, al , by a concatenation


of two F l -dimensional weighted vertex feature vectors, (ηw li , ηw lj ).
Rewriting al = [al1 al2 ], where alq is the subvector that multiplies ηw lq , we can denote
this inner product as

eij = al1 T · ηw li + al2 T · ηw lj = ei,1 + ej,2 (3.3)

where ei,1 = al1 T · ηw li , ej,2 = al2 T · ηw lj , This goes through a LeakyReLU and then a
softmax over all neighbors of i to find the normalized attention coefficient,

αij = softmax (LeakyReLU(eij )) (3.4)

As shown in Fig. 3.7 a naı̈ve approach would fetch ηw lj from each neighbor j of i,
compute eij using (3.3), and perform softmax to find αij . However, since ej,2 is required
by every vertex for which j is a neighbor (not just i), this would needlessly recompute
its value at each neighbor of j. To avoid redundant calculations, we propose to reorder
the computation (Fig. 3.7): for each vertex i, we compute
(a) ei,1 = al1 T ηw li , used to compute αi∗ at vertex i.
(b) ei,2 = al2 T ηw li , used by all vertices j for which i is a neighbor, to compute αj∗ at
vertex j.
25
Since al = [al1 al2 ] is identical for each vertex, we calculate ei,2 just once at vertex i, and
transmit it to vertices j.
For |V | vertices and |E| edges, the naı̈ve computation performs O(|E|) multipli-
cations and memory accesses to ηw li ) per vertex, for a total cost of O(|V ||E|). Our
reordered computation is O(|V | + |E|), with O(|E|) accumulations over all vertices, i.e.,
latency and power are linear in graph size.

3.4.2 Mapping Attention Vector Multiplication

As in Weighting, we use a block strategy to distribute computation in the CPE array.


The vector ηw i is distributed across all N columns of a row, so that the size of each
block allocated to a CPE for vertex i is G = ⌈F l /N ⌉. Each CPE column processes Va
vertices. Here, Va depends on the number of columns N in the CPE array, and also
depends on the size of the output buffer |OB|, i.e., the size of the set of vertices that
can be cached in the output buffer: Va = |OB|/N .
This dot product computation is very similar to the weight-stationary scheme used
in the Weighting step, i.e., the attention vectors remain stationary until a pass through
all the vertices. The F l -dimensional subvector al1 is divided into N blocks of size G and
distributed columnwise to one of the spads in each CPE. Vertex feature blocks for Va
vertices at a time, divided into chunks of size G, are loaded into the other spad, and the
inner product computation proceeds. Since ηw lj and al are dense, load balancing in the
CPE array is unnecessary.
As the CPEs in a column finish computation for a vertex, the partial results are
sent to the corresponding MPE for Aggregation. We overlap the computation in a CPE
column and with the Aggregation in the corresponding MPE: as the MPE aggregates
partial results for the current vertex, the blocks of the next weighted vertex features are
loaded into the CPE. Thus, all CPEs and MPEs remain busy.
After all Va vertices in the row are processed, the spad that contains al1 is loaded
with al2 , and the second inner product computation for the Va vertices is performed,
reusing ηw . The computed ei,1 and ei,2 are written back to the output buffer and are
appended to the feature vector of vertex i.
26
3.4.3 Mapping Edge-based Computations

The last step requires edge aggregation from each neighbor of a vertex. All GNNs,
perform edge-based summations followed by an activation function; for GATs, the
weights for this summation are computed using methods in the above subsections.
Typical graphs are too large for the on-chip buffers. We use a dynamic scheme
(Section 3.5) to process a subgraph of the graph at a time, processing edges in parallel
in the CPE array.
Load Distribution: The Aggregation computation brings data into the input buffer.
For each vertex in the subgraph corresponding to the vertices in the buffer, it accumulates
edge data by pairwise assignment to CPE spads.
Due to power-law behavior, the vertex degrees in the subgraph may have a large
range. To distribute the load, the Aggregation summations are divided into unit pairwise
summations and assigned to CPEs. For instance, accumulation of a sum effectively
implements an adder tree in which the number of CPEs required to process Aggregation
for each vertex depends on its degree in the subgraph. Thus, the number of CPEs
assigned for Aggregation of a vertex in a subgraph is proportional to its degree. The
degree-dependent assignment of CPEs to vertices tackles imbalance in workload that
might occur due to the power-law behavior.
GATs: The final step in computing the attention coefficient αij involves edge-based
computations (Equation (3.4)):

• the addition, eij = ei,1 + ej,2

• a LeakyReLU step, LeakyReLU(eij )

• a softmax step, exp(eij )ηw j /


P
k∈{i}∪N (i) exp(eik )

Each edge from a neighbor j to vertex i contributes an eij to the numerator of the
softmax, and one to the denominator. These computations are parallelized in the CPEs
among incoming edges of a vertex using pull-based aggregation [42].
The computation of numerator in the softmax step is shown in Fig. 3.8. For a target
vertex i connected to a neighbor j by edge (i, j), ηw i , ei,1 , and ei,2 , are loaded into one
spad of a CPE, and the corresponding data for j into the other spad. For vertex i, the
result ei,1 + ej,2 is sent to the SFU to perform LeakyReLU followed by exponentiation.
27

Figure 3.8: Data flow corresponding to computation of an edge.

The output returns to the CPE and is multiplied with ηw lj . A similar operation is
performed for vertex j to compute exp(eji )ηw li .
Other GNNs: The Aggregation step for GCN, GraphSAGE, GAT and GINConv
involves a sum of weighted vertex feature vectors over all neighbors j (or a sample of
neighbors for GraphSAGE) of each vertex i. This computation is similar to but simpler
than that in Fig. 3.8: just addition is performed.
As before, a subgraph of the larger graph is processed at a time. In processing vertex
i, the data for all neighbors j is processed in an adder tree, placing operands in spad1
and spad2 of a CPE, and storing the result in spad1. The partial results for a vertex
(partial sum for a general GNN, or the summed numerator and softmax denominator for
a GAT) are written to the output buffer after each edge computation. For a GAT, the
values of exp(eik ) are also added over the neighborhood to create the denominator for
the softmax. Finally, the accumulation over neighbors is divided by the denominator,
in the SFU to obtain the result. Similarly, in the another round of accumulation the
partial results of the vertices are sent form the output buffers to CPEs to compute the
final result. When all components of the sum for vertex i are accumulated, the result is
sent through the Activation unit and written to DRAM.

3.5 Graph-Specific Caching


Aggregation operations intensively access the graph adjacency matrix. Computational
efficiency requires graph-specific caching techniques to transfer data to/from on-chip
input and output buffers, maximizing data reuse and minimizing off-chip random memory
28

Figure 3.9: Example illustrating the subgraph in the input buffer (left) and its evolution after
cache replacement (right).

accesses. A notable feature of our proposed policy is a guarantee that all random-access
patterns are confined to on-chip buffers and off-chip fetches are sequential.
As stated earlier, the adjacency matrix is stored in the CSR format. Our input is a
graph represented by three arrays: (i) the coordinate array lists the incoming/outgoing
neighbors of each vertex, (ii) the offset array contains the starting offset of each vertex
in the coordinate array, and (iii) the property array with the weighted vertex feature,
ηw li (see Section 3.4.1), for each vertex i; for GATs, this is concatenated with {ei,1 , ei,2 }.
Subgraph in the Input Buffer: Edge-mapped computations involve a graph traversal
to aggregate information from neighbors. At any time, a set of vertices resides in the
input buffer: these vertices, and the edges between them, form a subgraph of the original
graph. In each iteration, we process edges in the subgraph to perform partial Aggregation
operations (Section 3.4.3) for the vertices in the subgraph. Under our proposed caching
strategy, ultimately all edges in the graph will be processed, completing Aggregation for
all vertices.
We illustrate the concept through an example in Fig. 3.9, showing a graph with
vertices V1 through V16 . The highest degree vertices are first brought into the cache,
i.e., the input buffer: vertices V1 , V2 , and V3 of degree 5, vertices V5 and V6 of degree 2,
29
and then two vertices of degree 1, V4 and V7 . The subgraph, Subgraph 1, consists of
these vertices and edges E1 to E6 which connect them. After edges E1 through E6 are
processed, vertices V4 through V7 have no unprocessed edges and may be replaced in
the cache by V8 through V11 in Iteration 2. This creates Subgraph 2, the subgraph with
edges E7 through E10 ), which is processed next, and so on.
Cache Replacement Policy: As vertices are replaced after computation of each
subgraph, a replacement policy is necessary. Our policy prioritizes vertices with the
most unprocessed edges for retention in the input buffer. Since such vertices appear
more frequently in the list of neighbors for other vertices in the coordinate array, this
increases the likelihood of finding both the source and destination of edges in the cache.
The policy requires inexpensive preprocessing to sort vertices in order of their degrees.
In practice, it is enough to sort vertices into bins based to their degrees, differentiating
high-degree vertices from medium-/low-degree vertices to prioritize higher-degree vertices.
After preprocessing, vertices of the input graph are stored contiguously in DRAM in
descending degree order of the bins. Ties are broken in dictionary order of vertex IDs.
The key to avoiding random-access fetches from DRAM is the preprocessing step and the
replacement policy.
We track the number of unprocessed edges, αi for vertex i, decrementing it as
each neighbor is processed. Initially αi is the vertex degree; when αi = 0, hli is fully
computed. Tracking αi requires minimal hardware overhead (a decrementer and one
word of storage per vertex), and its tracking enables GNNIE to maximize edge processing
in each iteration.
Fig. 3.10 illustrates our policy, managed by a cache controller using a 4-way set
associative cache. Graph vertices are stored contiguously in DRAM in descending degree
order, where vertex 1 has the highest degree. If the input buffer capacity is n vertices,
initially data (i.e., feature vector, connectivity information, αi ) for vertices 1 to n are
loaded from DRAM.
The algorithm processes each such set of vertices in the input buffer in an iteration.
We track αi for vertex i, decrementing it as each neighbor is processed. Tracking αi
requires minimal hardware overhead (a decrementer and one word of storage per vertex).
Initially αi is the vertex degree. At the end of iteration 1 (after finishing computation of
the subgraph of the first n vertices), if αi < γ for any vertex, where γ is a predefined
30

Figure 3.10: Input buffer replacement policy during Aggregation.

threshold, it is replaced from the cache. We replace r vertices in each iteration using
dictionary order. These vertices are replaced in the input buffer by vertices (n + 1) to
(n + 1 + r) from DRAM: these have the next highest vertex degrees. For each such vertex
i, we write back the αi value into DRAM. When all vertices are processed once, we have
completed a Round.
Similarly, the partial sums for the vertex feature vector in the output buffer are
updated as more edges in the subgraphs are processed. Any hli for which all accumulations
are complete is written back to DRAM. Due to limited output buffer capacity, and only
a subset of partial vertex feature vector sums can be retained in the buffer, and the
rest must be written to off-chip DRAM. To reduce the cost of off-chip access, we use a
degree-based criterion for prioritizing writes to the output buffer vs. DRAM. As partial
Aggregation results for softmax are written to DRAM, the numerator and denominator
components for a vertex are stored nearby, for locality during future fetches.
How our policy avoids random-access DRAM fetches: Our policy makes random
accesses only to the input buffer; all DRAM fetches are sequential. In the first Round,
data is fetched from consecutive DRAM locations. In the CPE array, aggregation of
each vertex fetches the vertex feature data of its neighbors in the current subgraph in
the cache. Each vertex feature vector may be thus fetched by the CPE array multiple
times according to the graph neighborhood structure, but all such random accesses are
limited to the cache, which has much better random-access bandwidth than the off-chip
31

Figure 3.11: Histogram of α through various Rounds (Pubmed). The inset shows a magnified
view.

memory.
Vertices evicted from the cache, with αi < γ, may be fetched again in a subsequent
Round. Even in these Rounds, data blocks are brought into cache in serial order from
DRAM: there are no random accesses from DRAM. During DRAM fetches, a cache
block is skipped if all of its vertices are fully processed. The total unprocessed edges in
a cache block is tracked through inexpensive hardware, similar to tracking αi .
The effectiveness of the approach is illustrated in Fig. 3.11, which shows the his-
togram of αi distributions in the input buffer after each Round. The initial distribution
corresponds to the power-law degree distribution, and in each successive Round, the
histogram grows flatter – with both the peak frequency and maximum α becoming lower,
thus mitigating the problems of power-law distribution. In contrast, HyGCN ignores the
power-law problem, and AWB-GCN overcomes it using high inter-PE communication.
Moreover, our approach is shown to be effective even for much more intensive GAT
computations (prior accelerators do not address GATs).
Fig. 3.12 shows the impact of γ on DRAM accesses for three datasets during Aggre-
gation of first layer. For the calculation we use the weighted feature vector size at first
layer to 128 B. As γ increases, more vertices are evicted and may have to be brought
back to the cache, resulting in more DRAM accesses. However, if γ is too low, vertices
may not be evicted from the cache, resulting in deadlock as new vertices cannot be
32

(a) (b) (c)

Figure 3.12: Ablation study on γ: (a) Cora (b) Citeseer (c) Pubmed.

brought in. In our experiments, we use a static value γ = 5, but in practice, γ may have
to be changed dynamically when deadlock arises. Deadlock detection is inexpensive and
is based on the number of total unprocessed edges in the partition, which is monitored
by a counter, and this dynamic scheme will be inexpensive in hardware.

3.6 Related Work


There has been much work on CNN accelerators [3–9], but these are not efficient for
processing GNNs. Graph analytics accelerators include ASIC-based (Graphicionado [29],
GraFBoost [31]), FPGA-based (FPGP [30]) and in-memory (GraphPIM [43]) platforms.
However, graph accelerators target lightweight operations, do not focus on data reuse,
and would be challenged by computation-intensive GNNs.
Software frameworks for GNNs include Deep Graph Library, AliGraph, and Ten-
sorFlow. Some hardware accelerators have been proposed, but we know of no prior
work that can handle networks that require softmax nonlinearities on graphs such as
GATs. Although some GAT computations are addressed in [33], the crucial attention
normalization step is left out. To our knowledge, no methods handle extreme input
feature vector sparsity using graph-specific methods.
HyGCN [10] uses an Aggregation engine for graph processing and a Combination
engine for neural operations. This requires separate on-chip buffers for each engine, which
are not fully utilized due to with workload imbalance at different stages of computation.
HyGCN must arbitrate off-chip memory access requests coming from on-chip buffers
of two different engines, which involves complicated memory access control. Using a
single hardware platform optimized to handle both the irregular graph computation
33
and compute-intensive, albeit regular, DNN computation, GNNIE achieves performance
gains over HyGCN. Moreover, HyGCN uses sharding with window sliding/shrinking to
reduce random memory access during Aggregation: this has (1) limited efficacy for highly
sparse adjacency matrices as the number of overlapping neighbors of vertices is a small
farction of the total number of vertices in a shard, undermining its efficacy; moreover, no
specific effort is made to address power-law degree distributions. (2) limited parallelism,
as the sliding window of the current shard depends on the shrinking of the previous
shard; HyGCN does not fully leverage data reuse opportunities of high-degree vertices
during Aggregation, performing (Ãhl−1 l l−1 l
i )W , instead of the cheaper Ã(hi W ) [11, 41].
Input feature vector sparsity is not addressed and can result in inefficiency due to stalls.
These factors explain GNNIE’s speedups over HyGCN.
BlockGNN [13] optimizes Weighting by applying an FFT and block-circulant con-
straint on the weight matrix. Similar to HyGCN [10], GNNerator [12] and DyGNN [14]
pipeline separate engines for Aggregation and Weighting. Thus susceptible to stalls due
to (i) unbalanced loads between engines (ii) workload variations in each engine due to
variable input feature vector sparsity and power-law degree distributions. DyGNN also
employs pruning to reduce vertex/edge redundancy.
AWB-GCN [11], which is limited only to GCNs and not general GNNs, views the
problem as a set of matrix operations. It does not specifically try to reduce random
memory accesses due to the highly sparse graph adjacency matrix. Its dynamic scheduling
scheme in AWB-GCN for workload redistribution among PEs may incur high inter-PE
communication, degrading energy efficiency. The GCNAX/SCGNAX approaches [44, 45]
also propose a matrix-multiplication-based approach and handle only GCNs, and their
results show reducing speedups as the dataset sizes increase. EnGN [41] uses a ring-edge-
reduce (RER) dataflow for Aggregation, where each PE broadcasts its data to other
PEs in the same column. To reduce communication, EnGN reorders the edges, but
this is an energy-intensive step, undermined by high sparsity in the adjacency matrix,
that occurs frequently as the limited number of cached edges are replaced. The scheme
has large preprocessing costs. In the literature, reconfigurable PE array designs (e.g.,
Planaria [46], RecPipe [47]) have been proposed for DNN acceleration. However, focus of
these works are orthogonal to GNNIE’s approach. For instance, Planaria targets dynamic
architecture fission for spatial multi-tenant execution and RecPipe focuses on optimizing
34
multi-stage recommendation inference. CoGNN [48] introduces an algorithm-hardware
co-design approach aimed at accelerating GNN inference with minibatch sampling.
This work proposes a reuse-aware sampling method and parallel-aware quantization for
reducing computation and memory access overhead, respectively. MEGA [49] employs
a heterogeneous architecture with separate dataflows for Aggregation and Weighting,
along with techniques like adaptive-package format and condense-edge scheduling.
Prior accelerators have not fully explored load balancing. Methods that offload tasks
to idler PEs (ring-edge-reduce [41], multistage networks [11]) involve high communication
and control overheads. GNNIE bypasses such approaches and uses the flexible MAC
architecture for load balancing, using heterogeneous PEs, and assigning computation
according to need. The idea is simple, effective, and easily implemented. In addition, as
stated in Section 3.3-C the load redistribution scheme of GNNIE results in low inter-PE
communication, low control overhead, and high speedup gain for the hardware overhead
(Fig. 3.19). Preprocessing is cheap and involves linear-time binning of vertex features
blocks into groups.
Frequency-based caching techniques for graph data have been proposed in [50]
using a programming interface. However, [50] is a purely software-based framework,
agnostic to the underlying hardware, for traditional graph analytics and uses a static
approach. GNNIE uses a hardware-centric dynamic frequency-based caching scheme
that tracks the α value for each vertex with minimal hardware overhead, and ensures
serial access to DRAM. Other schemes are also static and more computational than
GNNIE: they use hashing functions [21] or perform more computation [19, 51] in finding
static communities/partitions that do not specifically address cache size. On the other
hand, GNNIE’s computationally cheap dynamic scheme automatically adapts to the
cache size using subgraphs built from vertices in the cache. GRASP [52], another cache
management scheme for graph analytics, employs a most-recently-used (MRU) approach.
However, this scheme is based on past history, while GNNIE’s use of the unprocessed
vertex count measures future potential for a vertex.
35
Table 3.1: Dataset information for GNNIE [1]

Dataset Vertices Edges Feature Length Labels Sparsity


Cora (CR) 2708 10556 1433 7 98.73%
Citeseer (CS) 3327 9104 3703 6 99.15%
Pubmed (PB) 19717 88648 500 3 90%
Reddit (RD) 232965 114.6M 602 41 48.4%

Table 3.2: Convolution layer configurations (len[hli ] = length of hli )

GNN Model Weighting Aggregation Sample size


GAT len[hli ] , 128 Sum −−
GCN len[hli ] , 128 Sum −−
GraphSAGE len[hli ] , 128 Max 25
GINConv len[hli ] , 128 / 128 Sum −−
DiffPool (GCNpool ) len[hli ] , 128 Sum −−
DiffPool (GCNembedding ) len[hli ] , 128 Sum −−

3.7 Evaluation

3.7.1 Experimental Setup

Accelerator Simulator: We develop a simulator to measure the execution time in


terms of the number of cycles required. The simulator models each module of GNNIE
and integrated with Ramulator [53] to model the memory access to the off-chip HBM
with 256 GB/s bandwidth.
Each module was implemented and synthesized in Verilog and the synthesized design
was verified through RTL simulations. Synopsys Design Compiler was used to synthesize
the accelerator at 32nm technology node with standard VT cell library. The chip area,
critical path delay, and dynamic/static power, extracted from Design Compiler, are
used for evaluating performance and energy. CACTI 6.5 is used to estimate the area,
energy consumption, and access latency of on-chip buffers. The energy of HBM 2.0 is
3.97 pJ/bit [54]. The chip area is 15.6 mm2 and its frequency is 1.3 GHz.
Benchmark GNN Datasets and Models: For evaluation of the performance of
GNNIE, we used the benchmark graph datasets listed in Table 3.1. We used five GNN
models for evaluations, i.e., GAT, GCN, GraphSAGE, GINConv, and DiffPool. The
36
convolution layer configurations are shown in Table 3.2. All preprocessing costs are
included in the evaluation.
Configurations for Baseline/Cross-Platform Comparison: We first compare
GNNIE against two baseline architectures, i.e., a general-purpose CPU and a GPU. The
CPU platform is equipped with Intel Xeon Gold 6132@2.60GHz and 768 GB DDR4. The
GPU platform is equipped with V100 Tesla V100S-PCI @1.25GHz and 32 GB HBM2.
For GNNIE, the sizes of output and weight buffers are 1MB and 128KB, respectively.
The input buffer size is 256KB for the smaller datasets (CR, CS) and 512KB for the
larger datasets (PB, RD). The area and power numbers reported later correspond to the
larger input buffer size. The output buffer is larger since it must cache many partial
results before they are aggregated, particularly for high-degree vertices. For a 1-byte
weight, for the dataset with the largest feature vector (∼4K for CS), to keep 16 CPE
columns occupied, the buffer size is 4K×16×2 (for double-buffering) = 128KB.
The 16 × 16 CPE array consists of four MAC units for CPE row number 1 to 8, five
MAC units for CPE row number 9 to 12 and six MAC units for CPE row number 13 to 16.
The heterogeneous CPE array is blockwise regular and is friendly to back-end physical
design. The number of MACs per CPE was chosen through design space exploration,
optimizing the cost-to-benefit ratio (speedup gain : hardware overhead).

3.7.2 Baseline Platform Comparisons

Performance comparisons with CPU and GPU: To make a fair performance


comparison with the general-purpose CPU and GPU, we implement the GNN models
with the PyTorch Geometric (PyG) software framework. The PyG-based implementations
for CPU and GPU used in our experiment are denoted as PyG-CPU and PyG-GPU,
respectively. Neighborhood sampling for GraphSAGE is based on cycling through a
pregenerated set of random numbers. Table 3.3 shows the absolute run time including
the preprocessing overheads for the five datasets across the GNN models used in our
experiment. The total preprocessing time (including degree-based vertex reordering
(Aggregation: Section 3.5) and workload reordering (Weighting: Section 3.3.1) and
neighborhood sampling time for GraphSAGE) is shown in the parenthesis along with the
run time. It can be seen that the total preprocessing time is very negligible for smaller
datasets (CR and CS) and a small percentage of run time for larger datasets (RD).
37
Table 3.3: Absolute run time of inference for GNNIE

Dataset GCN GAT GraphSAGE GINConv


CR 45.80 µs (27.50 µs) 53.80 µs (27.50 µs) 54.70 µs (34.7 µs) 56.01 µs (27.5 µs)
CS 49.50 µs (26.40 µs) 62.60 µs (26.40 µs) 61.30 µs (37.40 µs) 66.40 µs (26.40 µs)
PB 0.25 ms (94.60 µs) 0.39 ms (94.60 µs) 0.36 ms (0.21 ms) 0.34 ms (94.60 µs)
RD 10.31 ms (0.31 ms) 12.32 ms (0.31 ms) 83.12 ms (74.9 ms) 11.31 ms (0.31 ms)

(a)

(b)

Figure 3.13: GNNIE performance vs. (a) PyG-CPU (b) PyG-GPU.

As shown in Fig. 3.13(a), the average speedup of GNNIE over the PyG-CPU across
the datasets used in our experiment for GCN, GAT, GraphSAGE, GINConv, and DiffPool
are 6229×, 5894×, 625×, 22878×, and 359×, respectively. According to Fig. 3.13(b)
the average speedup of GNNIE over the PyG-GPU across the datasets used for GCN,
GAT, GraphSAGE, GINConv, and DiffPool are 8.25×, 24.67×, 17.53×, 17.37×, and
21×, respectively. The speedup calculations take into account the total preprocessing
times mentioned in Table 3.3.
The speedup comes from several GNNIE optimizations: (i) The segmentation of
vertex feature vectors and their assignment in our FM architecture tackles the feature
vector sparsity challenge. (ii) Our degree-aware cache replacement policy avoids random
memory accesses to DRAM. (iii) During Weighting, distributed computation across
multiple batches enables weight reuse. Note that PyG-CPU and PyG-GPU do not
allow our dynamic caching scheme to be implemented within their purely software
based frameworks. The speedup of GNNIE on GINConv is further enhanced because of
38
PyTorch Geometric executes Aggregation before Weighting: as described in Section 3.2,
this requires more computation than the reverse order of computation used in GNNIE.
For the GraphSAGE speedup calculations, the neighborhood sampling time on PyG-
CPU/PyG-GPU is excessive and is excluded (for RD it is 13s whereas the execution
time is 0.35s for PyG-CPU and 0.003s for PyG-GPU), but GNNIE runtimes include
neighborhood sampling times. This results in lower speedup compared to PyG-GPU
for RD. However, the GPU is much more power-hungry than GNNIE, e.g., it requires
98.5× more energy for GraphSAGE/RD than GNNIE. GNNIE is scalable on PyG-CPU:
for GCN, GAT, and GINConv, the speedups generally increase with benchmark size.
GraphSAGE bucks this trend for the above reasons, but while its sampling scheme
improves scalability, it reduces accuracy [2, 55].
On PyG-GPU, the speedups do not monotonically improve with the number of nodes.
This is because larger datasets (e.g., PB) reap greater benefit from GPU parallelization:
for these datasets, GNNIE vs. PyG-GPU speedup decreases whereas GNNIE vs. PyG-
CPU speedup increases. It is important to note that the GPU comparison is not entirely
fair to GNNIE’s lightweight accelerator with low on-chip memory, targeted to edge
applications. In contrast, this GPU has a ∼20× larger on-chip memory than GNNIE
and its power-hungry nature makes it impractical for the edge. Nevertheless, GNNIE
shows speedups over even this powerful GPU.

3.7.3 Cross-platform Comparisons

We conduct cross-platform performance comparisons with HyGCN and AWB-GCN.


Neither computes exponentiation for softmax, required by GATs, and AWB-GCN only
implements GCN. Thus, for GCN we perform a comparison with HyGCN and AWB-GCN.
For GraphSAGE and GINConv we also show a comparison with HyGCN. Unlike the
original implementations, HyGCN uses 128 channels for hidden layers of all the GNN
models, and therefore we have also configured the hidden layers similarly (Table 3.2). To
compare with HyGCN, AWB-GCN runs the customized GCN model with 128 channels for
hidden layers on a E5-2680v3 CPU with PyG and reports relative speedup and inference
latency. We leverage inference latency data from AWB-GCN for our comparison.
To compute speedup over HyGCN for GraphSAGE, GINConv, and DiffPool we run
the GNN models on Intel Xeon Gold 6132@2.60GHz CPU, which has similar performance
39

Figure 3.14: GNNIE performance comparison with HyGCN and AWB-GCN.

as the E5-2680v3@2.50GHz CPU, and determine the relative speedup of GNNIE. We


then take a ratio of the computed relative speedup with the relative speedup of HyGCN
compared to E5-2680v3 CPU. It should be noted that PyG framework is used to optimize
both the baseline CPUs of HyGCN and GNNIE. Fig. 3.14 shows compared to HyGCN,
GNNIE achieves average speedup of 5.23×, 6.81×, and 3.1× for GCN, GraphSAGE,
GINConv, respectively. A comparison for DiffPool is not possible: HyGCN does not
report results on the widely used datasets that we evaluate. As before, these speedup
comparisons include GNNIE preprocessing costs. Even though the on-chip buffer size of
HyGCN (24 MB + 128 KB) is much larger than GNNIE (1.7 MB), GNNIE shows an
average speedup of 5.05×.
AWB-GCN’s scatter-based-aggregation requires 3x larger on-chip buffers than
GNNIE’s gather-based-aggregation. Sparse matrix-vector-multiplication-based AWB-
GCN loses graph-adjacency view, sacrificing efficiency. GNNIE’s caching scheme 3.5
specifically leverages graph adjacency to reduce expensive random DRAM accesses. For
GCNs, GNNIE (with 3.4x fewer PEs) shows 1.3x speedup (Fig. 3.14) over AWB-GCN
and 15-51x higher (Fig. 3.16) inferences/kJ. We note that AWB-GCN results are reported
on an FPGA, which is likely to be slower and more power-hungry than an ASIC.

3.7.4 Throughput and Energy Comparisons

Table 3.4 shows the throughput for various datasets for our configuration of GNNIE. The
table shows that the throughput degrades only moderately as the graph size is increased.
The power dissipation of GNNIE is 3.9W in 32nm, lower than HyGCN (6.7W in
12nm), similar to recent CNN edge inference engines (Edge TPU, Hailo-8, InferX1).
Fig. 3.15 shows the energy breakdown for GNNIE for GAT and GCN across three
40
Table 3.4: Throughput for various datasets for GNNIE.

Peak Cora (CR) Citeseer (CS) Pubmed (PB) Reddit (RD)


3.16 TOPS 2.88 TOPS 2.69 TOPS 2.57 TOPS 2.52 TOPS

Figure 3.15: Energy breakdown for GCN and GAT.

datasets, including DRAM energy required to supply the output, input, and weight
buffers. The output buffer has the most of transactions with DRAM due to psum storage.
On-chip weight buffer energy is negligible and not shown.
Fig. 3.16 compares GNNIE’s energy efficiency with prior works. The efficiency ranges
from ranges from 2.3×101 – 5.2×105 inferences/kJ for HyGCN and 1.5×102 – 4.4×105
inferences/kJ for AWB-GCN. GNNIE clearly outperforms the others, going from 7.4×103
– 6.7×106 inferences/kJ.

3.7.5 DRAM Access Analysis

To illustrate the efficiency of the proposed graph-specific caching scheme we compare the
number of DRAM accesses required by GNNIE with those in the widely used 2-D graph
partitioning method (employed by GridGraph [56], HyGCN [10], Marius [57]). In 2-D
graph partitioning, vertices of the graph are divided into u equal-sized disjoint partitions
41

Figure 3.16: Energy efficiency: GNNIE vs. HyGCN, AWB-GCN.

and stored in DRAM. Edges are then grouped into u2 blocks that can be viewed as
a grid. In this grid, each edge block (p, q) contains the edges for which source nodes
belong to the pth vertex partition and destination nodes belong to q th partition. In this
scheme, except for the self-edge blocks (e.g., edge block (p, p)) vertex partition p and q
must be in the cache (input buffer) together at least once to process the corresponding
edge block (p, q).
If the input buffer can hold v vertex partitions at a time (u ≥ v), a lower bound on
the number of DRAM block accesses for processing the graph using 2-D partitioning
is [57]):
  
u(u − 1) v(v − 1)
− (v − 1) (3.5)
2 2

To compare the caching schemes of GNNIE and 2-D graph partitioning we evaluate
the DRAM accesses required for executing Aggregation of the first layer for the Pubmed
dataset. In our experiment we use a 512 KB input buffer and the size of each vertex feature
vector is set to 128 B. For the 2-D partitioning scheme, we vary the number of vertex
partitions in DRAM (u) from 2 to 100 in steps of of 1 and compute the corresponding
lower bound on the number of DRAM access for 2-D partitioning using (3.5). The
lower number is multiplied with the size of each vertex partition in the input buffer to
42
determine the DRAM accesses in MB. To calculate the each vertex partition size the
input buffer size is divided by v. In Fig. 3.17, the x-axis denotes the number of vertex
partitions in DRAM (u) and the y-axis shows the corresponding lower bound for 2-D
partitioning on the DRAM access required (in MB) to process the graph. From Fig. 3.17
we can see that initially, the lower bound on the DRAM accesses decreases with the
number of partitions and plateaus eventually for higher values of u. For u = 100, the
lower bound is 5.59MB.
The static caching scheme proposed in 2-D graph partitioning must go through all
the vertex pair combinations to process all the edges. Due to the power-law behavior and
sparsity of real-world graphs, not all vertices in a vertex partition are used to process the
edges of its corresponding edge blocks. However, processing of an edge block requires all
vertices of the corresponding vertex partition to be cached in this scheme. Since this
approach does not make any effort to distinguish between the useful vertices of a vertex
partition to process the edge blocks, it incurs redundant DRAM access and provides
suboptimal performance in reducing DRAM accesses.
On the other hand, as shown in Fig. 3.12(c) and Fig. 3.17 for γ = 5 GNNIE requires
4.62 MB of DRAM access to execute the first layer Aggregation of Pubmed dataset.
In GNNIE the number of vertex feature vectors that get replaced after each iteration
dynamically varies according to the α of cached vertices and γ. In each iteration, GNNIE
tries to maximize the number of edges being processed by retaining the vertices with
a higher potential of being reused in the next iteration. Thus, by dynamically tuning
the retentivity of cached vertices at each iteration to maximize their reuse the proposed
graph-specific caching scheme leads to lower DRAM accesses compared to calculated
lower bound of 2-D partitioning.

3.7.6 Optimization Analysis

We analyze key optimization techniques applied in GNNIE. To evaluate these techniques


we select a baseline design (Design A) which uses four MACs per CPE uniformly.
Parameters for the flexible MAC architecture and on-chip buffer sizes for all designs are
as described at the end of Section 3.7.1. The dimension of the PE array in all cases is
16 × 16.
Optimizing Weighting Time: We first analyze the performance improvement of
43

Figure 3.17: Comparison of DRAM access of GNNIE with the lower bound on DRAM access
vs vertex partitions in DRAM of 2-D graph partitioning.

applying flexible MACs (FM) on the baseline design during Weighting. For the Cora,
Citeseer, and Pubmed datasets, the workload distribution among the CPE rows for the
baseline (without load-balancing) and FM designs are shown in Figs. 3.18(a), (b), and
(c), respectively. Due to vertex feature sparsity, the CPE rows in the baseline design
suffer from workload imbalance. The FM design smooths the workload distribution
among the CPE rows results in 6% (Cora), 14% (Citeseer), and 24% (Pubmed) reduction
in the number of cycles required to compute 16 elements of the output vertex features
during Weighting. The imbalance between the maximum and minimum is also reduced
by FM.
For all datasets, the last four CPE rows require more cycles than others (heavily
loaded CPE rows) and the first four CPE rows finish computation earlier (lightly loaded
rows) in FM. We perform load redistribution (LR) between “LR pairs” of heavily loaded
and lightly loaded CPE rows, offloading a portion of the workload from the heavily loaded
CPE row to the lightly loaded one. The figure shows that applying LR on FM further
smooths the workload distribution, reducing the imbalance between the maximum and
minimum significantly, and also further reduces the number of cycles.
Cost/Benefit Ratio: We introduce a metric, the cost/benefit ratio, β, relative to
44

(a)

(b)

(c)

Figure 3.18: CPE row workload in Weighting: (a) Cora (b) Citeseer (c) Pubmed.

Design A with 1024 MACs (4 MACs/CPE)

β = (% reduction in Cycles)/(% increase in MACs) (3.6)

The percentage reduction in cycles required is measured for Weighting for various choices
of MAC counts. The additional hardware overhead is measured in terms of percentage
increase in MACs compared to the baseline design. We compute β for four designs.
These design choices are as follows: (i) 5 MACs per CPE (i.e., Design B, 1280 MACs
in all), (ii) 6 MACs per CPE (i.e., Design C, 1536 MACs in all), (iii) 7 MACs per
CPE (i.e., Design D, 1792 MACs in all), (iv) flexible MAC architecture for GNNIE,
described at the end of Section 3.7.1 (i.e., Design E, 1216 MACs in all).
Fig. 3.19 plots β on the three datasets used in our experiment for the four design
45

Figure 3.19: Cost/benefit ratio for adding MACs in Designs B–E.

choices. As MAC units are added uniformly to the baseline design β drops and is lowest
for Design D across all datasets. β drops for Designs B, C, and D as the high sparsity
and sparsity variation among vertex features yield low speedup gains as more MACs
are added. By employing MACs among CPE rows as needed, the FM approach tackles
input vertex feature sparsity, achieving high β across all datasets.
Optimizing Aggregation Time: Our baseline design has 4 MACs/row (no FM),
no load balancing (i.e., no degree-dependent load distribution in Aggregation), and no
graph-specific caching (i.e., vertices are processed in order of ID).
We first evaluate our degree-aware graph reordering and our proposed cache replace-
ment policy (CP). We measure the execution time of the baseline during Aggregation
with and without CP. Fig. 3.20(left) shows that CP reduces Aggregation time by 11%
(Cora), 35% (Citeseer), and 80% (Pubmed). This is due to reduced random off-chip
memory accesses as more edges in a subgraph are processed under degree-aware caching.
Next, we apply CP over FM to measure their combined effect. From Fig. 3.20(left),
the added MACs in CP + FM yield gains of 17% (Cora), 39% (Citeseer), and 82%
(Pubmed).
We add our approach for load-balancing (LB) during Aggregation, using the
load distribution approach in Section 3.4.3, on top of CP+FM. The combined effect
(CP+FM+LB) is shown in Fig. 3.20(left) to reduce Aggregation time cumulatively by
46

Figure 3.20: Effectiveness of GNNIE’s optimization methods.

47% (Cora), 69% (Citeseer), and 87% (Pubmed).


Optimizing Inference Time: We evaluate our techniques on GCN and GAT inference
time. We first analyze the effect of CP on inference time. Next, we incrementally add FM
and LR optimization to CP and measure their combined effect on inference time. Finally,
we add all load-balancing (LB) methods: the LR technique for Weighting as well as load
distribution during Aggregation. Figs. 3.20(middle) and (right) shows the reduction
in the GCN and GAT inference time, respectively for CP, CP+FM, and CP+FM+LB.
The reduction in inference time is higher for Pubmed (19717 vertices) than Cora (2708
vertices), indicating the scalability of GNNIE.
Customizing GNNIE for specific GNNs: GNNIE is specifically designed to
support a wide variety of GNNs. The baseline architecture used for GNNs can be used
without any change for GraphSAGE; for GINConv, a larger PE array can be used to
overlap some additional computations (e.g., multiplication by 1 + ϵ), but these PEs are
not well utilized for other parts of the computation and the speedup is not worth the
hardware cost. For GAT, where one could increase the number of SFUs to one per CPE
to achieve 17.6% higher speedup, at the cost of 1.0% area increase and 21.1% higher
power.

3.8 Conclusion
This chapter presents GNNIE, a versatile GNN acceleration platform for a wide degree
of GNNs, including GATs. GNNIE efficiently works with unstructured data, input
47
vertex feature vector sparsity, and adjacency matrix sparsity, and “power-law” vertex
degree distribution. It mitigates load balancing issues, computational bottlenecks, and
irregular/random data accesses using multiple methods: splitting the computation into
blocks to leverage sparsity; optimized caching strategies; employing a flexible MAC
architecture in the CPE array. Substantial improvements over prior work are shown.
Chapter 4

Multicore Training Acceleration

4.1 Introduction
In recent years, GNNs have achieved unprecedented success on many real-life problems
(recommender systems, IC design, embedded sensing, e-commerce, etc.). In Chapter 3,
we presented GNNIE [17] that aims at accelerating the GNN inference for small- to
medium-scale graph workloads. However, a well-trained model is a prerequisite for
efficient inference. This chapter focuses on the development of a multicore GNN train-
ing accelerator for large-scale static graphs, addressing the ever-growing demand for
scalability and energy efficiency.
Energy-efficient and scalable acceleration of GNN training is an open problem that
involves several major challenges:
(i) High computation and communication costs: GNN training is more compute-intensive
than inference, especially with backpropagation, and incurs high access time and energy
costs for communication between memory and on-chip buffers;
(ii) Scalability for large graph sizes: Graph sizes in real-world datasets have grown
exponentially in recent years [58], necessitating multiple accelerator engines to work
together;
(iii) Load balancing during computation: High and variable input feature vector sparsity,
high adjacency matrix sparsity, and power-law distributions of vertex degrees, result in
irregular and random memory accesses during GNN computations, with low utilization
of processing elements [10, 11, 17].

48
49
(iv) Versatility: A GNN training accelerator must be able to accommodate a wide range
of GNN architectures. These challenges also persist while performing GNN inference on
large graphs emphasizing their relevance to both training and inference acceleration.
GPU-based solutions are energy-inefficient. GNNAdvisor [19], a single-GPU solution
is limited to small-to-medium-sized graphs. Multi-GPUs platforms can handle large
graphs: RoC [59] uses dynamic techniques for graph partitioning and memory man-
agement; NeuGraph [60] employs 2-D graph partitioning and inter-GPU vertex-chunk
swapping (with increased communication overhead); PaGraph [61] replicates boundary
vertices to reduce communication among partitions, but faces scalability issues due to
replica synchronization.
Several FPGA- and ASIC-based accelerators with better energy efficiency have been
proposed. Among FPGA-based approaches, GCoD [62] implements algorithm-accelerator
co-design, but requires large on-chip buffers due to scatter-based aggregation and incurs
high preprocessing overhead for sparsification and polarization; GraphACT [20] proposes
a CPU+FPGA platform, with graph sampling and loss gradient calculation offloaded to
the CPU, and forward- and back-propagation handled in the FPGA. Among ASIC-based
approaches, Rubik [21], uses a hierarchical array of processing elements; GNNear [22] uses
an ASIC-based central acceleration engine for some computations and offloads others to
near-memory processing engines that reside in the buffer chips of DIMMs. TT-GNN [63]
presents an ASIC-based software and hardware co-optimization approach that employs
vertex feature matrix compression using tensor-train representation. However, this work
is limited to GCN only. As single-core structures, these methods are not scalable for
larger graphs; they largely neglect input feature vector sparsity and power-law degree
distribution problems.
Any single-core solution has limited scalability. This chapter presents a multicore
GNN training accelerator for static graphs, moving past the limitation of single cores and
using an array of processing cores for training, offering substantial speedup and energy-
efficiency improvements. We target much larger graphs than previous ASIC/FPGA
training accelerators (we show results on datasets with up to 8.6M vertices in Section 3.7).
We believe this is the first multicore GNN training accelerator to support a wide range
of GNNs; the only other multicore accelerator [64] known to us handles inference
only and not training. As a preprocessing step in our approach we first partition the
50
graph into multiple clusters before assigning them to the cores. The existing multicore
inference accelerators can not handle backpropagation efficiently due to: (i) massive
computation/communication overhead for the calculation/propagation of error gradients.
(ii) large gradient synchronization overhead. (iii) lack of support for various special
functions, e.g., log and softmax.
For the core, we choose the GNNIE inference accelerator [17] introduced in Chapter 3
over other candidates [10–15] as it can handle sparsity in input vertex feature vectors
and adjacency matrix, support a wide range of GNN topologies (e.g., GCN, GraphSAGE,
GAT, GINConv), and shows speedup and efficiency advantages over other methods.
However, simply arraying a set of GNNIE cores leads to performance bottlenecks due
to: (i) suboptimality in GNNIE’s caching scheme in a multicore scenario; (ii) lack
of multicore-specific optimizations that consider both DRAM accesses and inter-core
communication. We develop novel techniques to address these challenges and develop
methods that are scalable for training large graphs. Degree-Quant [65] proposes integer-
based GNN training and we leverage this in our implementation. The major contributions
of our work in this chapter are:

• A novel feature vector segmentation scheme that reduces memory accesses, and a
random-forest-based machine learning (ML) model for optimal segmentation.

• Multicore-specific graph-specific caching with reduced random DRAM accesses and


limited on-chip communication.

• Demonstrated gains in scalability, speedup, and energy efficiency over prior GPU/FP-
GA/ASIC solutions across multiple GNN topologies.

In addition to training we evaluate the inference runtime for large graphs on our
platform. To offset the preprocessing overhead of partitioning we consider the cases
where the inference can be performed repeatedly with minimal changes to the graph
properties (detailed discussion in Section 4.7).

4.2 GNN Training Steps


GNN training involves a forward pass similar to inference, and a backward pass that
feeds gradients back to update weights.
51

Figure 4.1: Block diagram of the proposed multicore GNN training accelerator (core architecture
in inset) with 4 cores; our evaluation considers accelerators with up to 36 cores.

Forward Pass Computations. The forward pass has two steps [10,17,22]: (a) Weight-
ing in layer l multiplies feature vector hli (dimension F l ) of each vertex i by a weight
matrix, W l (dimension F l−1 × F l ). (b) Aggregation for vertex i combines (sum/-
max/mean/pool) the weighted feature vectors in a set Ni . For GCN/GAT/GINConv,
Ni is the neighbors N (i) of i; for GraphSAGE, Ni randomly samples N (i).
Backward Pass Computations. The output node features of the forward pass are
compared against the ground truth to compute the loss function. Then, starting from
the last layer, the gradients of the loss with respect to the feature vectors and weights
are calculated, and weight updates are performed at each layer using the chain rule
until the input layer is reached. Backward pass computations consist of Weighting and
Aggregation steps similar to the forward pass, and MAC operations for loss computations
and gradient updates.

4.3 Multicore Architecture and Computations


Architecture. Our GNN training engine, shown in Fig. 4.1, has multiple cores connected
by a network-on-chip (NoC).
GNNIE core [17]. A GNNIE core (inset of Fig. 4.1) consists of an M × N array of
52
computational PEs (CPEs) for ALU computations; merge PEs (MPEs) within the CPE
array that aggregate partial results in their CPE column during Weighting; and special
function units (SFUs) for nonlinear functions. Three on-chip buffers cache the input,
weight, and output data. The controller for each core orchestrates operations in the PE
array (workload reordering for CPEs, sending partial results to MPEs). The memory
access scheduler from [16] is modified to handle memory requests from both DRAM and
NoC.
Partitioning. For a multicore training engine with m GNNIE cores, the input graph is
partitioned into m clusters, and each cluster is the workload for one core. Intra-cluster
edges are connections between vertices (“intra-cluster vertices”) within a cluster and can
be processed entirely within a core; inter-cluster edges connect vertices in the cluster
to vertices in another cluster (“inter-cluster vertices”). We preprocess the graph with
METIS [66] to create clusters that (a) are balanced, i.e., have roughly equal numbers of
intra-cluster vertices, (b) have a minimal number of inter-cluster edges.
Weighting. Weighting is performed separately on the feature vectors of each vertex,
and can be carried out independently in each core, with no inter-vertex/inter-core
communication. The matrix-vector multiplication computations in this step are very
structured, but input vector sparsity variations can lead to load imbalance during this
computation. These issues are tackled using the workload imbalance strategies using
GNNIE’s flexible MAC architecture and load redistribution [17].
Aggregation. Aggregation consolidates data from the neighbors of each vertex, and
may involve intra-cluster and inter-cluster edges. For most GNNs (GCN/GINConv/-
GraphSAGE), this involves summation, but GATs require nonlinear computation of
attention coefficients. Aggregation for each vertex in a cluster is performed on its own
core, with no synchronization. Operations on inter-cluster vertices, fetched via NoC
from the buffers of other cores, are read-only because there are no data dependencies
between operations in the same layer.
Ultra-high sparsity of the adjacency matrix and power-law behavior incur numerous
irregular and random memory accesses even on one core; this is exacerbated for large
graphs on multiple cores by heavy and irregular communication between cores due to
long feature vectors, limited NoC bandwidth, and small on-chip buffers. We will address
novel methods for overcoming these bottlenecks in Sections 4.4 and 4.5.
53
Dynamic caching. Since the data for each cluster is too large for the cache (input
buffer), GNNIE uses a dynamic caching scheme to fetch vertex data from the DRAM.
It processes a subgraph of the cluster, called the computational subgraph, which is
the subset of intra-cluster and inter-cluster vertices of a core currently in the cache, and
edges between these vertices.
For our multicore training engine, the intra-cluster vertices of a core and their
edge data are stored in CSR format in the DRAM. For intra-cluster vertices, this
data for a computational subgraph is fetched into the input buffer from DRAM (off-
chip communication), and for inter-cluster vertices, the data is fetched from the input
buffers of other cores (which are responsible for DRAM fetches) via the NoC (on-chip
communication). Within each core, the CPEs process the computational subgraph using
efficient load-balancing techniques [17].

4.4 Dynamic Cache Replacement Policy


We first review the hardware-centric graph-specific caching technique in [17] for the
GNNIE inference engine. To maximize cache data reuse, the number of unprocessed
edges, ev , for each vertex v is tracked during Aggregation. Since nodes with larger ev are
involved in a larger number of future computations, they are more likely to be reused;
hence they are prioritized for retention in the cache. Specifically, a node is replaced in
the cache if ev ≤ γ, where γ is a threshold.
This strategy is used to promote cache reuse and minimize DRAM fetches. The
graph undergoes lightweight preprocessing to store the vertices in descending order
of degree in DRAM (initially, ev = dv , the degree of vertex v). The cache is initially
populated with the first set of DRAM blocks with the highest degrees. An iteration is
completed when all edges of the computational subgraph in the cache are processed. At
this time, the set of cache blocks that meet the replacement criterion are evicted and the
next sequential set of DRAM blocks (the DRAM is stored in degree order) is brought
into cache. Multiple iterations are needed until all edges are traversed. By construction,
DRAM fetches exclusively use sequential blocks, avoiding expensive random access. All
random accesses are limited to the on-chip SRAM cache, which is much more inexpensive
than random DRAM access.
54
Multicore-specific Graph-specific Caching. The direct application of the graph-
specific caching scheme of [17] to large graphs in the multicore scenario results in
bottlenecks related to stagnation (described next), and requirements for increased
retention of inter-cluster vertices whose data must be sent over the NoC to other
cores. We alter the scheme, using new methods that use dynamic thresholds to prevent
stagnation.
In the multicore scenario, the preprocessing step stores each cluster of the graph
(instead of the entire graph) in degree order. The retention requirements for intra-
cluster and inter-cluster vertices are different. Due to the min-cut objective function
of clustering, intra-cluster vertices in a core tend to have higher connectivity within
the cluster, while inter-cluster vertices are connected to fewer vertices. Using the same
γ threshold for both types of vertices would disadvantage inter-cluster vertices, which
might then require frequent fetches across the NoC, with high latency and cost overheads.
Therefore, separate thresholds γintra and γinter are required for retaining intra-cluster
and inter-cluster vertices, respectively.
As the distribution of vertex degrees varies across clusters, the values of these γ
parameters must be cluster-specific, and using a uniform value of these γ variables for
all clusters is inefficient. For each core, we set γintra (γinter ) to a certain percentile value,
κintra (κinter ) of the degree distribution of intra-cluster (inter-cluster) vertices of the
cluster assigned to the core. Empirically, we find that setting κintra and κinter to the 50th
percentile of the intra-cluster and inter-cluster vertex degree distribution, respectively, is
a good choice. Our choice of γ parameters based on the above criterion also makes the
approach generalized (i.e., not tuned for a particular dataset).
We track the unprocessed intra-cluster (inter-cluster) edges of an intra-cluster (inter-
cluster) vertex through a simple decrement operation, i.e., whenever an intra-cluster
(inter-cluster) edge of an intra-cluster (inter-cluster) vertex is processed in the CPE array,
the controller decrements the intra-cluster (inter-cluster) edge count of the vertex by 1.
Then, based on γintra (γinter ) cache replacement is performed. Fetch operations for the
next set of intra-cluster and inter-cluster vertices via off-chip and on-chip communication
for the next subgraph are overlapped with the computation in the CPE array.
Dynamic Thresholds for Preventing Stagnation. Intra-cluster (inter-cluster)
stagnation occurs when the number of cached intra-cluster vertices that meet the
55

(a) (b)

Figure 4.2: (a) Boosting γintra to break intra-cluster stagnation on Core 2. (b) Invoking full
random access after most edges are processed on all cores.

eviction criterion based on γintra (γinter ) is small, as the changes in the computational
subgraph across iterations are minor. This results in low computation and low PE
utilization per iteration.
We define the metric eintra [i] (einter [i]) as the ratio of the number of intra-cluster
(inter-cluster) edges processed up to iteration i, to the total number of intra-cluster
(inter-cluster) edges of the cluster associated with the core. After a detection interval of
every I iterations, we detect stagnation as:

eintra [i] ≤ (1 + δ)eintra [i − I], einter [i] ≤ (1 + δ)einter [i − I]

where δ is a user-defined threshold. If this is satisfied, we boost the relevant γ to


κth
boost percentile of the vertex degree distribution. After one iteration with the boosted
value evicts numerous vertices and overcomes stagnation by changing the computational
subgraph, we revert to the original γ.
We tune the parameter values over a range of datasets. Varying κboost ∈ [70, 95], the
optimal value was found to be κboost = 90; varying I ∈ [1, 15], an optimum was found
at I = 5; varying δ ∈ [0.01, 0.1] yielded the best value of δ = 0.05. For the amazon0601
dataset on a 4-core system, Fig. 4.2(a) shows the change in eintra with each iteration
and shows regions of stagnation that is found after the detection interval of I. At this
point, γintra is boosted, and as shown in Fig. 4.2(a) this increases the rate of progress of
eintra (the dotted line shows the slower trajectory of eintra without boosting).
As the Aggregation computation nears completion, when a large fraction of edges
56
has been processed, it becomes increasingly difficult to find unprocessed edges in the
computational subgraph. To detect this, we monitor etotal , the ratio of the number of
intra-cluster/inter-cluster edges processed up to iteration i, to the total number of edges
in the cluster. When etotal exceeds a threshold, we move to full random access: the
cache now has random access to the DRAM to complete Aggregation. This is shown in
Fig. 4.2(b), where the original trajectory (dotted lines), is accelerated to faster completion
(solid lines). At this stage, the number of random DRAM accesses is relatively small and
the benefit of faster convergence outweighs the cost of slower random DRAM accesses
during this final phase. We find the etotal threshold values of 0.8 to be optimal over a
range of datasets.

4.5 Scaling on Large GNNs

4.5.1 Bottlenecks of Scaling on Large GNNs

While graph-specific caching significantly improves latency and power/energy due to


increased data reuse, the benefits of this approach face bottlenecks due to fundamental
limitations in the traditional structure of GNN computations. Since node feature vectors
for each node can be long and the input buffer size is small, the computational subgraph
in each iteration constitutes a very small fraction of the total number of vertices in each
cluster. Thus, only a small fraction of edges can be processed in each iteration, leading
to high rates of cache replacement and slow convergence. Switching to full random
access mode can overcome this issue, with significant costs due to the high energy of
random DRAM access. The problem becomes more acute for large GNNs that require
more cores: with more cores, more inter-cluster vertices are sent over NoC, leading to
higher injection rates and/or larger packet sizes. This increases NoC latency, worsening
performance.
The key to overcoming this problem is to increase the size of the computational
subgraph during Aggregation, subject to the cache size. We achieve this by proposing
feature vector segmentation, splitting a vertex feature vector into multiple segments,
processing one segment at a time. We show that the choice of segment size involves
balancing off-chip and on-chip communication latency as we seek to efficiently overlap
computation with communication for high performance.
57
4.5.2 Feature Vector Segmentation

During Aggregation, there is no dependency between operations in different elements


of the feature vector of a vertex. Therefore, Aggregation over the neighborhood for
each feature vector segment can be carried out independently. We develop the concept
of feature vector segmentation under a fixed cache size, illustrated in Fig. 4.3. The
conventional approach at left (“Full”) uses full feature vectors of length F . For a cache
size of C, the number of vertex feature vectors that can fit in the cache is roughly
nF = C/F , and this limits the size of the computational subgraph. We can increase
the subgraph size by using a subset of the entire feature vector. If we split the feature
vector into two segments (“2-segments,” middle), we can fit a subgraph of 2nF vertices
in the cache. For the j-segments case (right), where each segment length is q = ⌈F/j⌉,
we increase the size of the computational subgraph by a factor of j relative to the “Full”
case.

Figure 4.3: Feature vector segmentation.

Using j segments, Aggregation operations are performed on one feature vector


segment at a time, over all nodes, using the graph-specific caching method of Section 4.4.
For larger j, the computational graph in each iteration is larger, and more edges are
available for Aggregation, so that CPEs in each core are kept busy. However, as j
increases, more vertices fit into each core and have more neighbors in other cores. Hence,
traffic in the NoC also increases as more vertices are sent to other cores, increasing the
injection rate and slowing communication.
A few prior approaches have used concepts similar to segmentation, but have signifi-
cant limitations: our solution gains efficiency by exploiting segmentation in harmony
with other schemes that reduce cache access latencies, including graph-specific caching
and on-chip fetches from other cores using the NoC. P 3 [67] uses a superficially similar
58
method that segments the feature vector into ⌈F/c⌉ segments, where c is the number of
GPU cores. This accommodates the entire graph into each core, but their power-hungry
GPU-based solution requires a much larger cache than our power-efficient ASIC solution.
BoostGCN [15] and GNNerator [12] implement 2-D graph partitioning and use feature
vector segmentation to increase the number of vertices in each partition so that the
frequency of DRAM communication decreases. However, as shown in [17], the lower
bound of DRAM access for 2-D partitioning is always higher than the graph-specific
caching of GNNIE. In addition, while BoostGCN and GNNerator rely solely on DRAM
communication to fetch neighboring vertices, we fetch neighbors from DRAM (off-chip
communication) or from the cache of other machines (on-chip communication); since
DRAM accesses are more costly than on-chip communication, our approach is more
efficient.
Performance Analysis. Fig. 4.4 shows the results of combining feature vector segmen-
tation with dynamic graph-specific caching (Section 4.4), showing the number of cycles
required for Aggregation in the first GCN layer for the amazon0601 dataset (403,394
vertices/3,387,388 edges) on a 4-core system. The results are shown for j = 1 (Full), 2,
4, and 8.
Fig. 4.4(a) shows the progress in processing graph edges as the execution progresses,
showing the fraction etotal of all edges that are processed (averaged over all cores). For
the Full case, etotal rises very slowly and does not reach the threshold of 0.8 required to
transition to full random access mode. The 2-segments approach progresses faster, and
the 4-segments approach is still faster; however, increasing to 8-segments slows down the
progress of etotal . We will understand this trend based on Fig. 4.4(b), which shows the
total number of cycles for the computation, and the components of this total.
Fig. 4.4(b) shows improvement in on-core computation cycles from the Full to the
segmented cases with larger j. Compared to the Full approach, the 2-, 4-, and 8-segment
cases reduce off-chip stall cycles by 60%, 74%, and 83%, respectively: increasing j reduces
DRAM access frequency as the computational subgraph becomes larger. While for the
Full case the computational subgraph, on average, contains just 8% of the vertices in a
cluster; for 2, 4, and 8 segments, the fraction rises to 16%, 32%, and 64%, respectively.
The number of full random mode cycles reduces as j increases, because the computation
subgraph grows larger as j increases and fewer edges are unprocessed at the switch to
59

(a) (b)

Figure 4.4: Performance analysis of feature vector segmentation: (a) etotal (Average) vs.
Execution Cycles (b) Aggregation cycle comparison.

full random mode.


Under segmentation, the NoC injection rate increases with higher j due to the
increased size of the computational subgraph on each core, which results in the transmittal
of multiple feature vector segments of inter-cluster vertices over the NoC. However, since
individual segments are smaller, the message size is reduced. This tradeoff between the
increased injection rate and the reduced message size implies there is an optimal j for
which the on-chip stall cycles are minimized. We develop an ML model to optimize
j. The impact of stalls can be further mitigated by overlapping on-chip and off-chip
communication during each iteration. In particular, no on-chip stall cycles are required
for the Full approach, because on-chip communication per iteration is so low (due to a
small number of cached vertices) that it can be completely overlapped.
Fig. 4.4(b) also shows that the on-chip stall cycles increase from 2- to 4- to 8-segment
cases. Hence, the total cycle count required to complete the computation has its minimum
for the 4-segments case. This explains the trend in Fig. 4.4(a).
ML Model for Optimizing j. To optimize the number of segments j, we trained a
machine learning (ML) model using a random forest (RF) regressor. This trained ML
model is then used on unseen graphs. Input parameters include graph attributes (vertex
and edge counts, a power-law metric capturing the fraction of edges adjacent on the
top 10 percentile of high-degree vertices, and the number of cores). The total number
of samples corresponds to 144 synthetic graphs, ranging from 100K–10M vertices and
200k–100M edges, and with power-law metric from 0.27–0.95. A train/test split of 80/20
60
was used. The RF regressor used 100 estimators, and the model achieved 95% training
and test accuracy based on R2 score.
On real datasets, the model prediction was close to the results of a much more costly
enumeration of all j ∈ [2, 16]. For example, the optimal j predicted by the RF model
is 5 for amazon0505 and amazon0601 datasets, close to optimal enumerated value of
4 ; the prediction for com-amazon dataset is j = 4, which matches the optimum from
enumeration.
Comparison with P 3 . Fig. 4.4(b) also show results for P 3 [67], which switches between a
segmented method (model parallelism, across feature vector segments) to the Full method
(data parallelism, across cores) after the first layer, incurring a large communication
overhead due to a burst of communication at that stage; by its very nature, this step
provides no opportunities for overlapping off-chip and on-chip communication. Hence, it
incurs high on-chip stalls: 89%, 63%, and 7% higher on-chip stall cycles compared to 2-,
4-, and 8-segments, respectively, in our implementation of their method.
PE Utilization. The segmented approach also leads to higher PE utilization compared
to full-length feature vectors. The larger computational subgraph size allows more edges
to be processed per iteration, increasing the computational intensity of data fetched from
DRAM and keeping GNNIE PEs busy. For the first layer of Aggregation of amazon0601
for GCN, the average PE utilization for Full, 2-segments, 4-segments, and 8-segments
are 67%, 86%, 100%, and 100%, respectively.

4.6 Evaluation
Hardware/Simulation Setup. Each core is implemented in Verilog, synthesized
with Synopsys DC in a 12nm standard VT library, placed and routed using Innovus,
and verified via RTL simulations. The area, energy, and latency of on-chip buffers
are estimated using CACTI 6.5 [68]. Post-P&R metrics for each core are: 4.97mm2 ,
0.93W, 934 MHz. The controller has 0.26 mm2 area and 0.1W power. For the NoC,
latency and throughput were analyzed using BookSim2 [69], and power and area using
Orion3.0 [70]. The NoC power overhead ranges between 2.9%–6.3% of the total chip
power. An in-house simulator computes the execution cycles for our accelerator, with
Ramulator [53] modeling off-chip HBM access (256 GB/s, 3.97pJ/bit [54]).
61
Configuration of the Multicore Accelerator.
Individual GNNIE cores Configuration per core is as follows:
Buffer sizes: Output: 1MB; Weight: 128KB; Input: 512KB
CPE array with flexible MACs: 16 × 16 array; 4 MACs (rows 1–8), 5 MACs (rows 9–12),
6 MACs (rows 13–16).
NoC Buffer size: 128 KB, 4 links per router, 50GB/s BW/link.
Number of GNNIE cores The number of cores for a dataset is based on the ratio, ϑ, of
vertices per computational subgraph (i.e., the full-length vertex features that can fit in
cache) to the vertices assigned per core. Empirically, we determined that its optimal
range is 0.03 ≤ ϑ ≤ 0.15. Using this, we find the number of cores m (see Tables 4.1
and 4.2) for the optimal ϑ that optimizes speedup gain vs. area/power overhead.
We analyze the change in speedup when the number of cores is altered from the
optimal m. For the A-06 dataset, m = 4; for 2, 16, and 36 cores, the speedup changes
by 0.43×, 3.1×, and 7.29×, respectively. In each case, the speedup change is sublinear,
indicating that m = 4 is optimal.
Benchmark GNN Datasets and Models. We evaluate the performance of our
platform using Type A and Type B benchmark graph datasets from Table 4.1 and 4.2,
respectively. Type A datasets consist of multiple small graphs with no inter-graph edges,
while Type B datasets are large monolithic graphs with a high amount of structural
irregularity, i.e., higher adjacency matrix sparsity and power-law behavior. Table 4.1
and 4.2 also provide the input feature length (FL), number of cores (m), and feature
vector segments (j) used for each dataset.
We evaluate the accelerator for training four GNN models: GCN, GINConv, GAT,
and GraphSAGE. All GNNs have one hidden layer, except GINConv which has five;
for GCN, GINConv, and GraphSAGE each hidden layer has 16, 64, and 256 channels,
respectively. The GAT hidden layer uses eight 16-dimensional attention heads. All
speedup and energy numbers include preprocessing times, including runtime for graph
partitioning, degree-based vertex reordering, workload reordering, and neighborhood
sampling time (performed on Intel Xeon Gold@2.60GHz CPU) for GraphSAGE. The
preprocessing overhead over 500 epochs for amazon0601 is 18%.
Performance comparison with DGL. We compare all GNNs against Deep Graph
Library (DGL) [71] on a V100 Tesla GPU with V100S-PCIe@1.25GHz, 32GB HBM2
62
Table 4.2: Type B datasets
Table 4.1: Type A datasets (SB: soc-BlogCatalog, CA: com-amazon, A-05: ama-
(DD: D&D, TW: TWITTER-Partial, zon0505, A-06: amazon0601, EN: enwiki, A-
YT: Yeast, SW: SW-620H, 8M: amazon8M) for GNN training
OV: OVCAR-8H) for GNN training
Dataset Vertices Edges (FL, m, j)
Dataset Vertices Edges (FL, m, j)
SB 89K 2.1M (128, 1, 2)
DD 335K 1.7M (89, 2, 4)
CA 335K 1.9M (96, 2, 4)
TW 581K 1.4M (1323, 4, 2)
A-05 410K 4.9M (96, 4, 4)
YT 1.7M 3.6M (74, 16, 2)
A-06 403K 3.4M (96, 4, 4)
SW 1.9M 3.9M (66, 16, 2)
EN 3.6M 276.1M (300, 16, 16)
OV 1.9M 3.9M (66, 16, 2)
A-8M 8.6M 231.6M (96, 36, 16)
(“DGL+Tesla V100”). The training latency for speedup comparison are averaged over
500 epochs. As shown in Fig. 4.5(a) and (b), the average speedup of our approach against
DGL+Tesla V100 for GCN, GINConv, GAT, GraphSAGE ranges from 8.9×–46.6×
across Type A datasets and 3.3×–15.5× for Type B.
The speedup comes from several of our optimizations: (i) Feature vertex segmentation
improves scalability for large GNNs. (ii) Dynamic cache replacement mitigates irregular
random memory accesses and on-chip communication overhead. (iii) Distributed compu-
tation across multiple batches ensures weight reuse. The speedup is particularly high for
GINConv: unlike DGL, we use dimension-aware stage reordering (DASR) [11, 17], which
requires fewer computations. To determine their impact, we removed these optimizations
successively on A-06. Without segmentation, the computation did not complete (as in
Fig. 4.4). With optimal segmentation, removing dynamic cache replacement increases
runtime by 34%; also removing weight reuse raises the penalty to 43%.
GraphSAGE shows lower speedup than other models due to: (i) inclusion of prepro-
cessing time for neighborhood sampling on our platform, but not on DGL+Tesla V100.
(ii) mitigation of power-law behavior in real-world graphs by sampling. Type A datasets
have higher speedups than Type B datasets due to the lack of on-chip communication
overheads. Larger datasets (e.g., OV, A-06) show higher speedups than smaller datasets
(e.g., DD, SB) for both Type A and B, indicating scalability.
Comparison with GPU-based accelerators. Speedup: GNNAdvisor implements
only GCN and GINConv. For the same configurations for these GNNs, Fig. 4.5(a) and
(b) shows that relative to GNNAdvisor, we achieve 15.5×–27.9× speedup for Type A
and 4.2×–9.2× for Type B datasets.
63
NeuGraph uses 2-D graph partitioning to process large graphs using one NVIDIA
Tesla P100 GPU. We achieve 12.2× and 16.9× speedup for GCN on EN and A-8M,
respectively, over NeuGraph. The corresponding speedups over GNNAdvisor are 3.1×
and 6.8×, respectively.

(a)

(b)

(c)

(d)

Figure 4.5: Speedup and energy efficiency of the proposed multicore GNN training accelerator
vs. DGL+Tesla V100 and GNNAdvisor+Tesla V100: (a), (c): Type A datasets (b), (d): Type B
datasets.

Energy: Fig. 4.5(c) and (d), illustrate the energy efficiency comparison with Tesla V100,
reporting Egain , the ratio of the energy required by the GPU to the energy of our
64
approach. Compared DGL+Tesla V100, our average Egain ranges from 149×–711× over
Type A datasets and 75×–628× over Type B. Against GNNAdvisor+Tesla V100, Egain
ranges from 168×–415× and 118×–372×, respectively.
Comparison with FPGA-/ASIC-based accelerators. Our approach achieves an
average speedup of 11× and 24× over Rubik and GraphACT, respectively; neither
reports absolute power numbers. Our speedup over Rubik is due to its inefficient reuse
of cache data which incurs high on-chip and off-chip communication costs, and over
GraphACT since it does not consider the power-law behavior of real-world graphs and
makes no explicit efforts to address the random off-chip memory accesses. In comparison
with GNNear, we achieve 17× average speedup over DGL+Tesla V100, but the speedup
of GNNear is only 2.5×. Unlike our approach, the graph partitioner of GNNear is
oblivious to community structure in real-world graphs, resulting in high communication
costs due to the high number of cut edges between the partitions. GCoD handles only
small graphs (up to 233K vertices, as against 8.6M vertices for our approach), and uses a
whopping 180W of power even for these graphs, which can be handled by our approach
on a single core using < 1W.

4.7 Applying Partitioning Methods for a Multicore GNN


Inference Engine
The idea of using multicore GNNIE engines can, in principle, also be applied to
accelerate inference on large graphs. However, in practice, the relatively high cost of
the partitioning step dominates the computational cost of inferencing. However, in
applications where the partitioning solution is reused across multiple inferences, it is
possible to amortize the cost of partitioning. In this section, we show the results of using
a GNNIE-based multicore inference engine for such a scenario.
In many real-world scenarios, minor changes can occur to the graph properties
over time. For instance, the A-06 dataset is a co-purchase graph dataset where each
vertex is a good available on the web store of an internet-based retailer, and the
edges between the goods indicate compatibility, i.e., if either of the two goods can be
complemented/substituted for each other. The price, quality, and availability of the
products in a warehouse (i.e., features of vertices) can change over a short period of
65
time (i.e., days/weeks). In such a scenario, the graph topology remains unchanged but
due to the change in the vertex features, we may need to run inference on the graph. In
addition, small changes in the graph topology (i.e., addition/deletion of edges/vertices)
can be handled by assigning the incremental workload to one of the cores. Running
multiple inferences on the same graph also amortizes the prepossessing overhead due to
graph partitioning.
The setup for the proposed multicore GNN inference engine is similar to that described
in Section 4.3. However, due to the absence of backpropagation during inference, we do
not need to store and fetch the gradients of the loss with respect to the feature vectors
and weights. In other words, inference requires smaller on-chip buffers compared to
training. Hence, we configure the output buffer and input buffer of each core in the
multicore inference engine as 512MB and 256KB, respectively. In addition, since the
loss computation during backpropagation requires special functions such as log, the SFU
units for each GNNIE core can be simpler compared to the one used during training.
The inference latencies for speedup comparison are averaged over for running 500
forward passes. As shown in Fig. 4.6 (a) and (b), the average speedup of our ap-
proach against DGL+Tesla V100 for GCN, GINConv, GAT, GraphSAGE ranges from
5.1×–27.5× across Type A datasets and 2.5×–10.4× for Type B datasets. Compared to
the training speedup achieved by our proposed platform (Fig. 4.5 the inference speedup
is lower. This difference can be attributed to two main factors. Firstly, training in-
volves backpropagation, which is computationally intensive and requires significant data
movement, resulting in larger communication overhead. Our proposed multicore-specific
caching and feature vector segmentation are particularly effective in addressing these
challenges, outperforming GPU-based accelerators during training. Secondly, since the
inference latency is lower than that of training the preprocessing overhead of graph
partitioning is more efficiently amortized during training compared to inference. As a
result, speedup for the inference operation is lower than that for training, as reported in
Section 4.6.
Fig. 4.6(c) and (d), illustrate the energy efficiency comparison with Tesla V100.
Compared DGL+Tesla V100, our average Egain ranges from 115×–472× over Type A
datasets and 59×–427× over Type B. Against GNNAdvisor+Tesla V100, Egain ranges
from 114×–295× and 87×–253×, respectively.
66

(a)

(b)

(c)

(d)

Figure 4.6: Inference speedup and energy efficiency the proposed multicore GNN training
accelerator vs. DGL+Tesla V100 and GNNAdvisor+Tesla V100: (a), (c): Type A datasets (b),
(d): Type B datasets.

4.8 Conclusion
Our multicore GNN training accelerator with GPU-like scalability and accelerator-like
efficiency for large GNNs is proposed in this chapter. It leverages novel feature vector
segmentation and dynamic caching schemes for scalability and to mitigate communi-
cation costs. Our evaluation demonstrates substantial speedup and energy-efficiency
67
improvements over prior approaches.
Chapter 5

DGNN Inference Acceleration

5.1 Introduction
Chapter 3 and 4 focused on acceleration of inference and training of GNNs on static
graphs, where the vertex features and graph topology remain unchanged over time.
However, real-world scenarios, such as financial transactions, social media interactions,
and molecular biology processes, often exhibit dynamic graph structures that capture
scenarios where the graph topology and node features evolve over time; the above GNNs
are not suited for these applications.
Dynamic graphs can be modeled as a series of snapshots to describe the change
of vertex features and graph topology at regular intervals; such models are referred
to as discrete-time dynamic graphs (DTDG) [72]. The DTDG model has numerous
applications across a large variety of domains. Its discrete representation of temporal
changes provides a versatile framework for understanding and analyzing dynamic systems,
e.g., pandemic forecasting, social network analytics, and traffic prediction. A dynamic
graph neural network (DGNN) is a special kind of neural network that operates on
dynamic graphs and involves two computational kernels: (a) the GNN, which captures
structural information, and (b) the recurrent neural network (RNN), which captures
temporal information.
Due to the growing significance of edge applications, where processing occurs closer to
the data source, it is imperative to implement techniques that efficiently perform inference
on dynamic graphs. This motivates the need to build dedicated hardware accelerators

68
69
for DGNNs, which is the subject of this chapter. The unique challenges of DGNNs,
characterized by dynamic irregularity, inter-vertex locality, and sequential dependence of
the RNN kernel, require a novel approach for efficient performance enhancement.
The challenges inherent in developing an accelerator for DGNNs can be distilled
into several key aspects: (C1) Real-world benchmark DTDG datasets exhibit minimal
variation between consecutive snapshots. This presents a unique opportunity for inter-
snapshot computation reuse, which must be effectively leveraged [73]. Beyond the
irregular memory access overhead seen in GNN engines for static graphs, resulting
from the sparsity of graph snapshots, an additional memory inefficiency stems from
the time-dependent variation of vertex features [74]. (C2) The introduction of time-
dependent RNN kernels in DGNNs, unlike traditional GNNs, introduces a bottleneck
arising from the sequential dependence between two consecutive snapshots. (C3) The
batch size of the RNN kernel can be orders of magnitude higher than that of conventional
RNN inference tasks for text and speech [75]; this offers an opportunity to minimize
excessive memory accesses by reusing weight parameters that needs to exploited for
computational efficiency. Addressing all of these challenges is essential for an efficient
DGNN accelerator.
While GPU-based solutions have been proposed for training on DGNNs, they are
energy-inefficient. ESDG [76], a multi-GPU platform, proposes a graph difference method
to reduce data transfer overhead between snapshots for DGNN training. This method
overlooks the overlap between contiguous snapshots in dynamic graphs (C1), resulting
in the recomputation of all graph data for each snapshot and subsequent performance
degradation. ESDG does not implement any explicit mechanism to tackle random
memory access issues for the GNN kernel (C1). PiPAD [74] proposes overlap-ware data
organization and data transfer for DGNN training on a single GPU and address random
memory access only for the overlapping part of consecutive snapshots (C1). It does not
address the bottleneck due to inter-snapshot sequential dependence of the RNN kernel
(C2). Neither approach accounts for the weight reuse opportunity for optimizing RNN
computations (C3).
To address energy efficiency concerns of GPU-based solutions, various ASIC- and
FPGA-based inference accelerators have been proposed. Cambricon [77], an ASIC-
based accelerator, introduces a cuboid-based processing architecture that supports the
70
addition/deletion of edges and fine-grained data transfers to avoid unnecessary snapshot
updates. However, it primarily focuses on changes in graph topology between snapshots
and does not consider the changes in the features of the vertices (C1). DGNN-Booster [78],
a CPU+FPGA DGNN inference accelerator, uses a message-passing mechanism for the
GNN kernel, but neglects overlaps between snapshots (C1). ReaDy [79], a ReRAM-
based DGNN inference accelerator, implements redundancy-free data scheduling and
inter-kernel pipelining (C2) to enhance efficiency, but it overlooks the overlap between
snapshots (C1). ReFlip [80] also proposes a ReRAM-based DGNN accelerator, but it
does not address sequential bottleneck imposed by the RNN kernel (C2) and remain
oblivious of the data reuse opportunity offer by overlap between snapshots (C1). Neither
Cambricon [77] nor DGNN-Booster [78] capitalizes on weight reuse in the RNN kernel
(C3).
In this chapter we propose an integrated platform for DGNN inference acceleration,
handling both GNN and RNN computations on the same hardware and overcoming the
limitations of prior works. We address Challenges C1–C3 through the following key
contributions:

• Challenge C1: We employ overlap-aware caching to leverage small variation in graph


topology across snapshots. We concatenate features of overlapping vertices across
multiple snapshots to maximize cache reuse and minimize random DRAM accesses for
fetching dynamically varying vertex features.

• Challenge C2: We develop an efficient pipelining mechanism for the GNN and RNN
engines to ensure seamless computation without bottlenecks or stalls.

• Challenge C3: We employ weight coalescing to maximize weight reuse and reduce
off-chip communication for the RNN kernel.

Finally, our platform is versatile in handling a wide variety of dynamic GNNs, including
those employing self-attention mechanisms for temporal layer processing. This flexibility
provides our platform to be applicable across diverse dynamic graph scenarios; prior
frameworks [74, 76–79] are limited to specific temporal layers.
71
5.2 Background
In the discrete-time representation, dynamic graphs are modeled as a set of graphs,
DG = {G1 , G2 , ..., GT }, where T denotes the total number of snapshots, and the graph
Gk = {V k , E k } represents the snapshot with vertices V k and edges E k at timestamp
k. This representation enables the utilization of traditional static GNNs for spatial
information encoding and RNNs for temporal information.
During the k th iteration of DGNN, the GNN kernel computes the updated feature
vectors, i.e., Y k = {y1k , y2k , ..., yvk , ..., yN
k } for all N k vertices in the snapshot Gk . Sub-
k

sequently, the state vector H k−1 = {hk−1 k−1 k−1 k−1


1 , h2 , ..., hv , ..., hN k−1 } from the previous
timestamp and the extracted information Y k are fed into the RNN kernel to produce
the updated state vector, H k = {hk1 , hk2 , ..., hkv , ..., hkN k }.
Discrete-time DGNNs are classified into two categories: stacked and integrated. We
evaluate our method againss three models:
MPNN-LSTM [81] stacks a two-layer GCN and long short-term memory (LSTM)
models, where LSTM operations are applied on the features generated by the GCN. The
inter-snapshot data dependence stems from updating the hidden state for the LSTM.
EvolveGCN [82], an integrated model, involves a one-layer GCN and one gated
recurrent unit (GRU) in each of its two layers. The GCN in the second layer takes
outputs from the first layer as input features. Unlike MPNN-LSTM, EvolveGCN enables
the weights to evolve by applying the RNN module over GCN weight matrices.
T-GCN [83] integrates multiple one-layer GCNs into a GRU. The propagation of hidden
states across snapshots results in inter-snapshot dependencies akin to MPNN-LSTM.
In addition, there is another variant of DGNNs proposed in the literature that uses
self-attention mechanism to capture temporal information. Traditional RNNs, including
gated models such as LSTM and GRU, face challenges in handling temporal dependencies
due to their sequential information propagation. In contrast, the self-attention mecha-
nism overcomes these limitations, allowing models to capture temporal dependencies
more effectively. For analysis and evaluation purposes, we select ASTGCN [84] as a
representative model that implements a spatial-temporal attention mechanism.
The GNN kernel computation in the k th iteration is given by:

yvk = σ(Ak xkv Wlin ) (5.1)


72
Here, for graph Gk , a one-layer GNN kernel takes the initial feature vector, xkv , of a vertex
v at timestamp k, and generates the updated representation yvk ; a similar equation can
be used to represent the update for subsequent layers of a multi-layer GNN kernel. The
adjacency and weight matrices for linear transformation of feature vectors are denoted
as Ak and Wlin , respectively. The following set of equations describes the well-known
LSTM model for RNNs:

ikv = σ(Wi yvk + Ui hk−1


v + bi )
fvk = σ(Wf yvk + Uf hk−1
v + bf )
okv = σ(Wo yvk + Uo hk−1
v + bo )
ckv = fvk ⊙ ck−1
v + ikv ⊙ tanh(Wc yvk + Uc hk−1
v + bc )
hkv = okv ⊙ tanh(ckv ) (5.2)

The LSTM computation requires the following matrix-vector multiplications: (i) between
the updated vertex feature vector yvk and four input weight matrices Wx , x ∈ {i, f, o, c};
(ii) between the hidden state vector hk−1
v and four hidden weight matrices Ux , x ∈
{i, f, o, c}. The input gate, forget gate, output gate, and cell state feature of vertex v
at the timestamp k are represented by ikv , fvk , okv , and ckv , respectively. There are no
intra-snapshot dependencies among the eight matrix multiplications, but inter-snapshot
dependence stems from sequential dependence on the hidden state vector at timestamp
k-1. In addition, the LSTM involves element-wise additions (+) and products (⊙), and
activation functions, e.g., sigmoid and tanh. Some DGNN models use GRU for capturing
temporal information, which is similar to LSTM and uses a gated mechanism. GRU is
less compute-intensive than LSTM due to its simpler architecture and fewer parameters,
but may struggle to capture complex long-term dependencies since it lacks the forget
gate in LSTM.
ASTGCN [84] implements a spatial-temporal attention mechanism for capturing
dynamic spatial and temporal correlations of DTDGs. This requires computation of two
kinds of attention, i.e, spatial attention and temporal attention. The spatial attention
computation is described by following set of equations:

S = Vs · σ((χr−1 r−1 T
h W1 )W2 (W3 χh ) + bs ); S′ = softmax(S) (5.3)
73
where χr−1
h is the input feature tensor that is fed to the rth spatial-temporal block; Vs ,
bs , W1 , W2 , and W3 are learnable parameters; and S′ is the normalized version of the
spatial attention matrix S. While performing graph convolutions, S′ is accompanied
with the adjacency matrix to dynamically adjust the weights of the edges.
Similar to spatial attention, the computation of temporal attention is described by
the following equations:

E = Ve · σ((χr−1 r−1 r−1


h U1 )U2 (U3 χh ) + be ); χ̂h = χr−1
h softmax(E)

Here, Ve , be , U1 , U2 , and U3 are learnable parameters. E′ is the normalized version of


the temporal attention matrix E. E′ is directly applied to the input to obtain χ̂r−1
h =
χr−1
h E.

The final step, i.e., the spatial-temporal convolution, is given by

χrh = ReLU(ϕ ⊛ (ReLU(gθ ⊗ χ̂r−1


h ))) (5.4)

Here, ⊛ and ⊗ denote standard and graph convolution operation, respectively; ϕ is the
parameter of the temporal convolution kernel; and gθ represents the graph convolution
filter.

5.3 Proposed DGNN Accelerator


Architecture. As mentioned in Section 5.1, DGNN acceleration requires careful
orchestration of both GNN and RNN computations. In our proposed architecture we
implement separate engines to handle these operations and pipeline them for faster
inference. Fig. 5.1 illustrates the key components of our proposed accelerator.
GNN Engine. Our core GNN engine (with an input buffer, run-length compression
(RLC) decoder, GNN PE array, pipeline buffer, GNN Weight Buffer, and an activation
unit) is similar to the GNNIE GNN inference accelerator [17] presented in Chapter 3. We
choose GNNIE over alternative GNN inference accelerators such as HyGCN [10], AWB-
GCN [11], GNNerator [12], BlockGNN [13], and DyGNN [14] due to several advantages
that it offers: (i) GNNIE effectively manages high sparsity in input vertex feature
vectors and the adjacency matrix by implementing load balancing and graph-specific
74

Figure 5.1: Block diagram of the proposed DGNN accelerator.

caching; (ii) its versatile platform accommodates a broad range of GNN topologies (GCN,
GraphSAGE, GAT, GINConv) and supports functions not present in alternative engines,
such as the softmax over neighborhood, a requirement for GATs; (iii) GNNIE shows
notable speedup, lower power consumption, and reduced area over competing methods.
Pipeline Buffer. This shared buffer between the GNN and RNN engine is used for
inter-engine pipelining. It caches the GNN engine outputs before relaying them to the
RNN engine.
RNN Engine. As shown in Equation (5.2), RNN operations are predominantly MVMs
that primarily involve interactions among the hidden feature vector emanating from
the RNN layer of the preceding timestamp, the output feature vector of the GNN
layer at the current timestamp, and the corresponding weight matrices. The RNN
engine is composed of two separate units: (i) the WY unit performs the matrix-vector
multiplication between the updated vertex feature vector yvk and four input weight
matrices Wx , x ∈ {i, f, o, c}; (ii) The UH unit is responsible for the matrix-vector
multiplication between the hidden state vectors hk−1
v and four hidden weight matrices
Ux , x ∈ {i, f, o, c}. The final operations in the RNN kernel require the use of nonlinear
functions (e.g., softmax, tanh), critical for capturing complex dependencies in sequential
75
data. In our architecture, these nonlinear functions are integrated into the processing
elements (PE) array of the UH unit through the special function units (SFU), which use
look-up tables to realize these functions. Collectively, the weight-stationary dataflow and
the integration of SFU units lead to high computational efficiency. Each computation PE
(CPE) in the PE arrays has two scratch spads (spads) and multiply-accumulate (MAC)
units; merge PEs (MPEs) aggregate partial results sent from each CPE column. The
RNN Weight Buffers 1 and 2 hold the weights for the WY and UH units, respectively.
HBM DRAM. The high-bandwidth memory (HBM) DRAM stores information about
the graph. To store the dynamic graph we use the temporal compressed sparse row
(T-CSR) [85] format. Unlike the traditional CSR format used to store static graphs,
T-CSR uses an additional time array that indicates the timestamp of incoming/outgoing
edges for each vertex. Moreover, RLC encoding is used for sparse input feature vectors.
Output Buffer: This buffer stores the output of WY and UH units and the updated
state vectors before sending them to the DRAM. The memory access scheduler manages
off-chip memory requests from the input/pipeline/output/GNN weight/RNN weight
buffers.

5.4 Accelerating GNN Computations


We now discuss our proposed overlap-aware caching policy for accelerating the GNN
kernel of DGNNs, leveraging the minimal variation between consecutive snapshots of
benchmark DTDG datasets to reduce DRAM accesses and maximize the reuse of cached
data.

5.4.1 Overlap Extraction Methodology

We describe the identification of shared topology between neighboring snapshots, and


an adjacency matrix representation used to maximize data reuse and reduce off-chip
communication overhead.
As a prepossessing step, we combine m consecutive snapshots of the DGNN into
a group. For instance, as shown in Fig. 5.2, snapshots from timestamps k to k+3
are combined to constitute Group j, and snapshots from timestamps k+4 to k+7 are
combined to make Group j+1. In general, if there are T snapshots in the graph, then
76
T 
the total number of such groups is m . For each group, the adjacency list for each
vertex v consists of the neighbors within the group, i.e. from timestamp k to k+m-1,
where k is the first timestamp in the group. For instance, in Fig. 5.2 the adjacency list
of vertex A in Group j consists of its neighbors from timestamp k to k+3 in the graphs
Gk through Gk+3 . We select m so that the data volume of each group (i.e., the feature
vector and adjacency information) is roughly equal to the input buffer capacity.
To extract the overlapping part between two consecutive groups we perform a bitwise-
AND operation between the vertex IDs of the groups; a bitwise-XOR operation yields
the exclusive node IDs for each group. The process of overlap extraction and creation
of exclusive parts for two consecutive groups of snapshots is shown in Fig. 5.2. After
overlap extraction, we have three parts: (i) the overlapped part, consisting of the common
vertices and edges that appear both in the timestamps of Group j and Group j+1,
(ii) exclusive part 1 consisting of vertices and edges that belong to Group j (timestamp
k to k+3), (iii) exclusive part 2 consisting of vertices and edges belonging to Group j+1
(timestamp k+4 to k+7).
By grouping multiple snapshots to boost overlap, we have higher data reuse opportu-
nity, i.e., better performance. For the HepTh dataset, if we increase m from 1 to 4, the
number of overlapped vertices increases by 7×. For lower values, e.g., m = 2 and m = 3,
the overlap increases by 3.5× and 6×, respectively. For higher values, e.g. m = 8, the
overlap increases by 9×, at the expense of higher off-chip communication overhead due
to frequent cache replacements.

5.4.2 Proposed Overlap-aware Caching Policy

Since most new data goes through the input buffer of the GNN engine, we refer to it as
the cache. Processing of each graph snapshot begins with the retrieval of vertex features
and their adjacency information from memory to the cache. While caching the data to
input buffer, we must account for the random memory access overheads that stem from
the irregularity of the adjacency matrix of real-world graphs. On the other hand, the
output of the GNN engine is cached into the pipeline buffer before being consumed by
the RNN engine, which undergoes compute-intensive operations with regular data access
patterns for a snapshot. However, while caching data into the pipeline buffer we must
account for the inter-snapshot dependence introduced by the RNN kernel.
77

Figure 5.2: Overlap extraction between consecutive groups.

In our proposed approach, we process two consecutive groups at a time. For instance,
we process process Groups j and j+1 together, then process Groups j+2 and j+3, and
so on. Due to the inter-snapshot sequential dependence of the RNN kernel, we process
the snapshot at timestamp k before that at timestamp k+1, processing pairs of groups
at a time. In the example above, as we process Groups j and j+1 (snapshots k+1
through k+7) we process snapshot k before snapshot k+1, and so on. While processing
each snapshot we fetch the corresponding data (vertex features and their adjacency
information) from DRAM to the input buffer, i.e., the cache.
As discussed in Section 2.2 the computation of the GNN kernel involves two major
steps: (i) Weighting multiples the feature vector for each vertex with a weight matrix.
This is step is compute-intensive. (ii) Aggregation involves consolidation of information
over the neighborhood of each vertex. This requires extensive interaction with the graph
adjacency matrix. However, the ultra-high sparsity of adjacency matrix of real-world
graphs leads to a large amount of irregular and random memory accesses [17], i.e.,
performance degradation. In addition, the limited capacity of the input and pipeline
buffers further exacerbates this issue due to frequent off-chip communication. Hence it
is essential to employ caching techniques that can minimize the communication between
DRAM and on-chip buffers.
78
To minimize DRAM communication overhead and maximize the reuse of cached
data, we first cache the vertices in the overlapped part for a pair of groups. We also
concatenate the features of vertices in the overlapped part across the timestamps in
which they appear as we fetch them from memory. This is advantageous because with a
single fetch of the overlapped vertices we can reuse them across multiple timestamps as
we process each pair of groups.
For instance, as shown in Fig. 5.2, the edge between vertex A and I appears across
timestamps k, k+1, k+4, k+6, and k+7, i.e., vertex A and I belong to the overlapped
part. Thus, we concatenate the features of vertex A and I across timestamp k, k+1,
k+4, k+6, and k+7 as we fetch them to the cache. After caching the vertices in the
overlapped part we cache the vertices in the exclusive parts. During this process vertices
at timestamp k are cached before those at timestamp k+1. For instance, vertex C at
timestamp k is cached before vertex B at at timestamp k+1 and D at timestamp k+2.
If the cache is unable to hold all the vertex data for a timestamp, cache replacement
is required after processing all the edges that connect the cached vertices. Similar to
proposed caching mechanism of GNNIE in Section 3.5, during cache replacement, we
replace a vertex if the number of unprocessed edges of the vertex for that timestamp is
below a user-defined threshold, γ. In such a scenario, we may need multiple iterations
to process all edges of a timestamp, and an evicted vertex may be fetched into the cache
in subsequent iterations. The proposed cache replacement scheme aims at retaining
the cached vertices with higher reuse potential than others and this leads to reduced
DRAM communication overhead. To address the limited capacity of the pipeline buffer
we prioritize the updated vertex features at timestamp k over timestamp k+1 to be
written to the pipeline buffer vs. DRAM.

5.5 Accelerating RNN Computations

5.5.1 Weight Coalescing for the RNN Kernel

RNN operations entail significant inter-snapshot dependencies (Section 5.2), but intra-
snapshot computations of the RNN kernel are not subject to such restrictions. This
opens up the ample intra-snapshot data reuse opportunity that can be leveraged: we
propose a weight-stationary dataflow with weight coalescing, reducing unnecessary data
79

Figure 5.3: Implementation of our weight coalescing scheme.

accesses by enhancing inter-vertex data reuse in a snapshot.


In this weight-stationary dataflow (Fig. 5.3), each column of the four input weight
matrices Wx , x ∈ {i, f, o, c} is divided into chunks of p = ⌈L/M ⌉ rows and coalesced
before loading into the weight spad of a CPE of a column in the PE array of the WY
unit. Here, L and M denote the length of yvk and number of rows in the CPE array,
respectively, and corresponds to the fact the M elements of yvk can be processed at a
time. On the other hand, for a given set of vertices in the pipeline buffer at a timestamp
k, a subvector of length p is broadcast to the entire CPE row using a bus. The UH unit
follows similar weight stationary computation that involves hidden state vectors at the
preceding timestamp and four hidden weight matrices Ux , x ∈ {i, f, o, c}. The weight
matrices are shared by all nodes across all timestamps. Hence, the approach maximizes
weight reuse for all timestamps, reducing the size requirement of on-chip weight buffers,
and minimizes communication overhead.

5.5.2 Pipelining GNN and RNN Engines

For a given snapshot, the RNN engine takes the output of the GNN engine as its input
(Fig. 5.2). Employing inter-engine pipelining is crucial for maximizing parallelism and
minimizing data movement. Nevertheless, during Aggregation, the irregular structure of
real-world graphs leads to varying workloads for different vertices in the GCN kernel.
In contrast, the RNN kernel exhibits a uniform execution pattern for all vertices. This
workload disparity introduces potential pipeline stalls between the two kernels. To
address this, we use a pipeline buffer to cache vertex features received at different times
80

Figure 5.4: Implementation of inter-engine pipelining.

from the GNN kernel before relaying them to the RNN kernel. This decoupling of GNN
and RNN execution enables the design of an inter-engine pipeline that enhances overall
efficiency.
The LSTM-based RNN kernel operations listed in Equation (5.2) involve two types
of independent MVM operations: (i) between the updated vertex feature Y k and four
input weight matrices Wx , x ∈ {i, f, o, c}. (ii) between the hidden state vector H k−1 and
four hidden weight matrices Ux , x ∈ {i, f, o, c}. However, the inter-snapshot sequential
dependence introduced by the RNN kernel (Section 5.2) limits parallelism and imposes
performance bottleneck. To address this issue, we schedule the first type of MVM
operations to the WY unit and the latter to the UH unit to aid parallelism (Fig. 5.4)
and ensure streamlined processing, while keeping both units busy.
Fig. 5.4 illustrates our dataflow for the WY and UH units. At timestamp k, the
updated feature vector yvk from the GNN is sent to the WY unit from the pipeline
buffer to compute the MVM Wx ykv , x ∈ {i, f, o, c}. The hidden vector from the previous
timestamp, hk−1
v , which was forwarded to the pipeline buffer, is used to compute
Ux hk−1 k
v , x ∈ {i, f, o, c} in parallel in the UH unit. Finally, the computation of hv is
performed in the UH unit using the results of the two MVMs as well as nonlinear
computations, (softmax, tanh) and element-wise multiplications (Equation (5.2)). This
result is forwarded to the pipeline buffer for the UH computation in the next timestamp
(this forwarding path is shown in Fig. 5.1). Thus, our approach keeps the WY and UH
81

Figure 5.5: Speedup vs. snapshots per group.

busy at all times, resulting in high utilization of accelerator resources.

5.6 Evaluation
Hardware/Simulation Setup. The accelerator is implemented in Verilog, synthesized
with Synopsys DC in a 12nm standard VT library, placed and routed using Innovus,
and verified via RTL simulations. The area, energy, and latency of on-chip buffers are
estimated using CACTI 6.5 [68]. The post post-P&R area, power and frequency are
10.1 mm2 , 1.92 W, and 934 MHz, respectively. Our in-house simulator calculates the
execution cycles for our accelerator, utilizing Ramulator [53] to simulate off-chip HBM
access. This access is characterized by a data transfer rate of 256 GB/s and energy
consumption of 3.97 pJ per bit [54] .
Configuration of our Proposed Accelerator. The GNN input buffer is 512 KB; the
pipeline buffer between the GNN/RNN engines and the output buffer are each 1 MB; we
use 16×16 PE arrays for the GNN engine and the WY, UH units of the RNN engine.
Number of timestamps per group Empirically, we select the number of snapshots per
group m = 2 for the datasets used in our experiment. We analyze the change in speedup
when the number of snapshots in a group is altered from the optimal m. As shown in
Fig. 5.5 the normalized speedup decareses for the HepTh (HT), Epinions (EP), and
Flicker (FK) dataset as we deviate from the optimal m. This is because a smaller group
size results in underutilization of cache and a smaller number of overlapped vertices. On
the other hand, a larger group size results in a higher number of overlapped vertices
(Section 5.4.2), but this leads to higher off-chip communication overhead due to frequent
cache replacements.
82
Table 5.1: Dataset information for DGNN inference

Dataset Vertices Edges Timestamps Description


HepTh (HT) 22,908 2,673,133 219 Citation Network
Epinions (EP) 755,780 13,668,320 501 Rating Graph
Flicker (FK) 2,302,925 33,140,017 134 Rating Graph
Youtube (YT) 3,223,589 12,223,774 203 Sharing Graph
PeMSD8 (PE) 170 277 17856 Traffic Flow Graph

Figure 5.6: Speedup comparison results for DGNN inference.

Benchmark DGNN Datasets and Models. To evaluate the performance of our


platform, we use the datasets listed in Table 5.1. We perform inference on three DGNN
models: MPNN-LSTM, EvovleGCN, and T-GCN. Each of these DGNN models comes
with standard 32 bit hidden dimension [86]. All speedup and energy numbers include
preprocessing times, including grouping of graph snapshots and overlap extraction
(performed on Intel Xeon Gold@2.60GHz CPU). The preprocessing overhead for the
Flicker (FK) dataset is 14%.
Performance comparisons with CPU and GPU: We compare our platform with
the DGNN software framework, PyTorch Geometric Temporal (PyGT) [86], on Intel
Xeon Gold 6132@2.60GHz, 768 GB DDR4 (“PyGT-CPU”), and on NVIDIA V100 Tesla
GPU with V100S-PCIe@1.25GHz, 32GB HBM2 (“PyGT-GPU”).
The speedup comparison results are shown in Fig. 5.6. For these results, we use
PyGT-CPU as the baseline. Here the x-axis denotes the datasets used in our experiments
83
and the y-axis denotes the speedup against the PyGT-CPU. As shown in Fig. 5.6, the
average speedup achieved by our accelerator compared to PyGT-CPU, across the datasets
used in our experiments, for T-GCN, MPNN-LSTM, and EvolveGCN are 1688×, 2513×,
and 1962×, respectively. Compared to PyGT-GPU, our platform achieves 71×, 92×,
and 78× speedup on average across the datasets used in our experiment for T-GCN,
MPNN-LSTM, and EvolveGCN, respectively.
In addition, we perform evaluation for DGNNs using self-attention as the temporal
layer, e.g., ASTGCN. We report results for two datasets (i.e., PE and HT), since for
larger datasets (i.e., EP, FK, and YT) PyGT-CPU and PyGT-GPU run out of memory.
Compared to PyGT-CPU and PyGT-GPU, on average our approach achieves 251× and
28× speedup. We observe that the average speedup for ASTGCN is lower compared
to other three DGNNs used for evaluation. This is because the spatial and temporal
attention computations in ASTGCN involve dense matrix-vector multiplications, and the
CPU and GPU can reap greater benefit from their large computing resources. However,
as we will soon see, the energy efficiency of CPU and GPU is an order of magnitude
lower than our approach.
Performance Comparisons with Prior Accelerators: On average, we achieve 17×
and 25× speedup over ReFlip [80], across our datasets for T-GCN and MPNN-LSTM,
respectively. Our average speedups over ReaDy [79], across our datasets, are 2× and 3×
for T-GCN and MPNN-LSTM, respectively. ReFlip and ReaDy do not report speedup
results for EvolveGCN and do not support DGNNs employing self-attention as the
temporal layer. We do not perform comparisons with ESDG [76] and PiPAD [74] as they
are training accelerators, ours is an inference accelerator. No reasonable comparison can
be shown against Cambricon [77], which targets smaller datasets and lacks support for
vertex feature changes over time.
Energy Comparison: Fig. 5.7 illustrates the energy efficiency comparison, reporting
Egain , the ratio of PyGT-CPU energy to the energy of PyGT-GPU, ReFlip, ReaDy,
and our proposed approach. Our average energy benefit is 1.11×103 – 1.18×104 over
PyGT-CPU, 170×–238× over PyGT-GPU, 68×–107× over ReFlip, and 10×–11× over
ReaDy.
Ablation study. Several optimizations contribute to speedup in our approach: (i) Overlap-
aware caching ensures maximum reuse of cached data across multiple snapshots and
84

Figure 5.7: Energy efficiency comparison results for DGNN inference.

avoids unnecessary communication and computation overhead. (ii) Inter-engine pipelin-


ing ensures seamless flow of computation between the GNN and RNN engines and
minimizes stalls. (iii) Weight coalescing and our weight-stationary dataflow maximize
reuse of cached weights and reduce DRAM communication. To assess the impact of
our optimizations we remove them on EP for MPNN-LSTM and evaluate the runtime
for inference. Without overlap-aware caching, the time required for inference increased
by 3.19×. After removing the inter-engine pipelining we observed a 1.83× increase in
runtime. The elimination of weight coalescing leads to a further 1.22× runtime increase.

5.7 Conclusion
In this chapter we propose a unified engine for the efficient acceleration of discrete-
time dynamic graph neural networks. Key contributions include a holistic approach
for handling both GNN and RNN components, optimized cache reuse strategies, a
novel caching policy, and an efficient pipelining mechanism. The proposed platform
demonstrates exceptional versatility, capable of accommodating diverse dynamic GNNs.
Evaluation on benchmark datasets and models reveals substantial speedups and energy
improvement, positioning the platform as a promising solution for edge applications.
Chapter 6

Thesis Conclusion

This thesis addresses the critical need for hardware accelerators tailored to the unique
requirements of GNNs. The presentation navigates through the challenges posed by
the traditional approaches and proposes innovative solutions to accelerate inference and
training on graph-structured data. First, we propose GNNIE, a versatile GNN inference
accelerator, which tackles the challenges of handling highly sparse input feature vectors
and adjacency matrices, load balancing during computation, and irregular memory
access patterns. By employing novel techniques such as feature vector segmentation,
load-balanced Aggregation, and lightweight graph-specific caching, GNNIE achieves
remarkable speedups and energy efficiency improvements over existing methods. The next
work extends the focus to GNN training acceleration, recognizing the escalating demands
for scalability and efficiency in handling large-scale graph datasets. By leveraging
multicore architectures and novel caching strategies, this work overcomes challenges
related to high computation and communication costs, load balancing, and versatility
in supporting various GNN architectures. The proposed platform demonstrates GPU-
like scalability and energy efficiency, positioning it as a promising solution for large
GNN training tasks. Finally, we address the emerging need for efficient inference on
dynamic graph structures. By proposing an integrated platform for DGNN inference
acceleration, it tackles challenges related to overlap-aware caching, efficient pipelining of
GNN and RNN components, and weight coalescing for maximizing reuse and reducing
off-chip communication. The exceptional versatility and performance enhancements of
the proposed platform make it suitable for a wide variety of DGNN scenarios, offering

85
86
substantial speedups and energy improvements for edge applications. In summary, the
works presented in this thesis collectively contribute to advancing the state-of-the-art
in hardware acceleration for both static and dynamic GNNs, offering scalable, efficient,
and versatile solutions that pave the way for broader adoption of graph-based machine
learning in real-world applications.
Bibliography

[1] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher,
and Tina Eliassi-Rad. Collective Classification in Network Data. AI magazine,
29(3):93–93, 2008.

[2] Petar Veličković, , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio,
and Yoshua Bengio. Graph Attention Networks. In Proceedings of the International
Conference on Learning Representations, 2018.

[3] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz,
and William J Dally. EIE: Efficient Inference Engine on Compressed Deep Neural
Network. In Proceedings of the ACM/IEEE International Symposium on Computer
Architecture, pages 243–254, 2016.

[4] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE
Journal of Solid-State Circuits, 52(1):127–138, 2017.

[5] Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro,


Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B Milde, Federico Corradi,
Alejandro Linares-Barranco, and Shih-Chii Liu. Nullhop: A Flexible Convolutional
Neural Network Accelerator Based on Sparse Representations of Feature Maps.
IEEE Transactions on Neural Networks and Learning Systems, 30(3):644–656, 2019.

[6] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle,
Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt

87
88
Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati,
William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu,
Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander
Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve
Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle
Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran
Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie,
Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross,
Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham,
Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian,
Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric
Wilcox, and Doe Hyun Yoon. In-datacenter Performance Analysis of a Tensor
Processing Unit. In Proceedings of the ACM/IEEE International Symposium on
Computer Architecture, pages 1–12, 2017.

[7] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Ranghara-
jan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J
Dally. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Net-
works. In Proceedings of the ACM/IEEE International Symposium on Computer
Architecture, pages 27–40, 2017.

[8] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau,
Joon Kyung Kim, Vikas Chandra, and Hadi Esmaeilzadeh. Bit Fusion: Bit-Level
Dynamically Composable Architecture for Accelerating Deep Neural Network. In
Proceedings of the ACM/IEEE International Symposium on Computer Architecture,
pages 764–775, 2018.

[9] Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James
Laudon, Cliff Young, and David Patterson. A Domain-Specific Supercomputer for
Training Deep Neural Networks. Communications of the ACM, 63(7):67–78, 2020.

[10] Mingyu Yan, Lei Deng, Xing Hu, Ling Liang, Yujing Feng, Xiaochun Ye, Zhimin
Zhang, Dongrui Fan, and Yuan Xie. HyGCN: A GCN Accelerator with Hybrid Archi-
tecture. In Proceedings of the IEEE International Symposium on High Performance
89
Computer Architecture, pages 15–29, 2020.

[11] Tong Geng, Ang Li, Runbin Shi, Chunshu Wu, Tianqi Wang, Yanfei Li, Pouya
Haghi, Antonino Tumeo, Shuai Che, and Steve Reinhardt. AWB-GCN: A Graph
Convolutional Network Accelerator with Runtime Workload Rebalancing. In Pro-
ceedings of the IEEE/ACM International Symposium on Microarchitecture, pages
922–936, 2020.

[12] Jacob R Stevens, Dipankar Das, Sasikanth Avancha, Bharat Kaul, and Anand
Raghunathan. GNNerator: A Hardware/Software Framework for Accelerating
Graph Neural Networks. In Proceedings of the ACM/IEEE Design Automation
Conference, pages 955–960, 2021.

[13] Zhe Zhou, Bizhao Shi, Zhe Zhang, Yijin Guan, Guangyu Sun, and Guojie Luo.
BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight
Matrices. In Proceedings of the ACM/IEEE Design Automation Conference, pages
1009–1014, 2021.

[14] Cen Chen, Kenli Li, Xiaofeng Zou, and Yangfan Li. DyGNN: Algorithm and
Architecture Support of Dynamic Pruning for Graph Neural Networks. In Proceedings
of the ACM/IEEE Design Automation Conference, pages 1201–1206, 2021.

[15] Bingyi Zhang, Rajgopal Kannan, and Viktor Prasanna. BoostGCN: A Framework
for Optimizing GCN Inference on FPGA. In Porceedings of the IEEE International
Symposium on Field-Programmable Custom Computing Machines, pages 29–39,
2021.

[16] Sudipta Mondal, Susmita Dey Manasi, Kishor Kunal, Ramprasath S, and Sachin S
Sapatnekar. GNNIE: GNN Inference Engine with Load-Balancing and Graph-
Specific Caching. In Proceedings of the ACM/IEEE Design Automation Conference,
pages 565–570, 2022.

[17] Sudipta Mondal, Susmita Dey Manasi, Kishor Kunal, S Ramprasath, Ziqing Zeng,
and Sachin S Sapatnekar. A Unified Engine for Accelerating GNN Weighting/Ag-
gregation Operations, with Efficient Load Balancing and Graph-Specific Caching.
90
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
42(12):4844–4857, 2022.

[18] Sudipta Mondal, S Ramprasath, Ziqing Zeng, Kishor Kunal, and Sachin S Sapat-
nekar. A Multicore GNN Training Accelerator. In Proceedings of the IEEE/ACM
International Symposium on Low Power Electronics and Design, pages 1–6, 2023.

[19] Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and
Yufei Ding. GNNAdvisor: An Adaptive and Efficient Runtime System for GNN
Acceleration on GPUs. In Proceedings of the USENIX Symposium on Operating
Systems Design and Implementation, pages 515–531, 2021.

[20] Hanqing Zeng and Viktor Prasanna. GraphACT: Accelerating GCN Training
on CPU-FPGA Heterogeneous Platforms. In Proceedings of the ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pages 255–265, 2020.

[21] Xiaobing Chen, Yuke Wang, Xinfeng Xie, Xing Hu, Abanti Basak, Ling Liang,
Mingyu Yan, Lei Deng, Yufei Ding, Zidong Du, and Yuan Xie. Rubik: A Hierarchical
Architecture for Efficient Graph Neural Network Training. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 41(4):936–949, 2022.

[22] Zhe Zhou, Cong Li, Xuechao Wei, Xiaoyang Wang, and Guangyu Sun. GNNear:
Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory
Processing. In Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques, pages 54–68, 2022.

[23] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional


Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in
Neural Information Processing Systems, pages 3844–3852, 2016.

[24] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral Networks
and Locally Connected Networks on Graphs. In Proceedings of the International
Conference on Learning Representations, 2013.

[25] Thomas N Kipf and Max Welling. Semi-Supervised Classification with Graph
Convolutional Networks. In Proceedings of the International Conference on Learning
Representations, 2017.
91
[26] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive Representation Learning
on Large Graphs. In Advances in Neural Information Processing Systems, pages
1025–1035, 2017.

[27] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are
Graph Neural Networks? In Proceedings of the International Conference on Learning
Representations, 2019.

[28] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure
Leskovec. Hierarchical Graph Representation Learning with Differentiable Pooling.
In Advances in Neural Information Processing Systems, pages 4805–4815, 2018.

[29] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret
Martonosi. Graphicionado: A High-Performance and Energy-Efficient Accelerator
for Graph Analytics. In Proceedings of the IEEE/ACM International Symposium
on Microarchitecture, pages 1–13, 2016.

[30] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. FPGP: Graph Processing
Framework on FPGA A Case Study of Breadth-First Search. In Proceedings of the
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages
105–110, 2016.

[31] Sang-Woo Jun, Andy Wright, Sizhuo Zhang, and Shuotao Xu. GraFBoost: Using
Accelerated Flash Storage for External Graph Analytics. In Proceedings of the
ACM/IEEE International Symposium on Computer Architecture, pages 411–424,
2018.

[32] Nanda K Unnikrishnan, Joe Gould, and Keshab K Parhi. Scv-gnn: Sparse com-
pressed vector-based graph neural network aggregation. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 42(12):4803–4816, 2023.

[33] Adam Auten, Matthew Tomei, and Rakesh Kumar. Hardware Acceleration of
Graph Neural Networks. In Proceedings of the ACM/IEEE Design Automation
Conference, pages 1–6, 2020.
92
[34] Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S Chung, and Greg Stitt. A
High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplica-
tion. In Proceedings of the IEEE International Symposium on Field-Programmable
Custom Computing Machines, pages 36–43, 2014.

[35] Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. MatRap-
tor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product.
In Proceedings of the IEEE/ACM International Symposium on Microarchitecture,
pages 766–780, 2020.

[36] Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi,
and Zhiru Zhang. Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense
Tensor Computations. In Proceedings of the IEEE International Symposium on
High Performance Computer Architecture, pages 689–702, 2020.

[37] David Salomon. Data Compression: The Complete Reference. Springer Science &
Business Media, London, UK, 4th edition, 2007.

[38] J. Hruska. HBM2 vs. GDDR6: New Video Compares, Con-


trasts Memory Types. https://www.extremetech.com/computing/
289391-hbm2-vs-gddr6-new-video-compares-contrasts-memory-types,
4/11/2019.

[39] S Ward-Foxton. Memory Technologies Confront Edge


AI’s Diverse Challenges. https://www.eetimes.com/
memory-technologies-confront-edge-ais-diverse-challenges, 9/18/2020.

[40] Peter Nilsson, Ateeq Ur Rahman Shaik, Rakesh Gangarajaiah, and Erik Hertz.
Hardware Implementation of the Exponential Function using Taylor Series. In
Proceedings of the IEEE Nordic Circuits and Systems Conference, pages 1–4, 2014.

[41] Shengwen Liang, Ying Wang, Cheng Liu, Lei He, LI Huawei, Dawen Xu, and
Xiaowei Li. EnGN: A High-Throughput and Energy-Efficient Accelerator for Large
Graph Neural Networks. IEEE Transactions on Computers, 70(9):1511–1525, 2021.
93
[42] Jasmina Malicevic, Baptiste Lepers, and Willy Zwaenepoel. Everything You Always
Wanted to Know about Multicore Graph Processing but Were Afraid to Ask. In
Proceedings of the USENIX Annual Technical Conference, pages 631–643, 2017.

[43] Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and
Hyesoon Kim. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph
Computing Frameworks. In Proceedings of the IEEE International Symposium on
High Performance Computer Architecture, pages 457–468, 2017.

[44] Jiajun Li, Ahmed Louri, Avinash Karanth, and Razvan Bunescu. GCNAX: A
Flexible and Energy-efficient Accelerator for Graph Convolutional Neural Networks.
In Proceedings of the IEEE International Symposium on High Performance Computer
Architecture, pages 775–788, 2021.

[45] Jiajun Li, Hao Zheng, Ke Wang, and Ahmed Louri. SGCNAX: A Scalable Graph
Convolutional Neural Network Accelerator With Workload Balancing. IEEE Trans-
actions on Parallel and Distributed Systems, 33(11):2834–2845, 2022.

[46] Soroush Ghodrati, Byung Hoon Ahn, Joon Kyung Kim, Sean Kinzer, Brahmen-
dra Reddy Yatham, Navateja Alla, Hardik Sharma, Mohammad Alian, Eiman
Ebrahimi, Nam Sung Kim, Cliff Young, and Hadi Esmaeilzadeh. Planaria: Dynamic
Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks.
In Proceedings of the IEEE/ACM International Symposium on Microarchitecture,
pages 681–697, 2020.

[47] Udit Gupta, Samuel Hsia, Jeff Zhang, Mark Wilkening, Javin Pombra, Hsien-
Hsin Sean Lee, Gu-Yeon Wei, Carole-Jean Wu, and David Brooks. RecPipe:
Co-designing Models and Hardware to Jointly Optimize Recommendation Quality
and Performance. In Proceedings of the IEEE/ACM International Symposium on
Microarchitecture, pages 870–884, 2021.

[48] Kai Zhong, Shulin Zeng, Wentao Hou, Guohao Dai, Zhenhua Zhu, Xuecang Zhang,
Shihai Xiao, Huazhong Yang, and Yu Wang. CoGNN: An Algorithm-Hardware
Co-Design Approach to Accelerate GNN Inference With Minibatch Sampling.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
42(12):4883–4896, 2023.
94
[49] Zeyu Zhu, Fanrong Li, Gang Li, Zejian Liu, Zitao Mo, Qinghao Hu, Xiaoyao Liang,
and Jian Cheng. Mega: A memory-efficient gnn accelerator exploiting degree-aware
mixed-precision quantization. In Proceedings of the IEEE International Symposium
on High Performance Computer Architecture, pages 124–138, 2024.

[50] Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Saman Amarasinghe, and
Matei Zaharia. Making Caches Work for Graph Analytics. In Proceedings of the
IEEE International Conference on Big Data, pages 293–302, 2017.

[51] Junya Arai, Hiroaki Shiokawa, Takeshi Yamamuro, Makoto Onizuka, and Sotetsu
Iwamura. Rabbit Order: Just-in-Time Parallel Reordering for Fast Graph Analy-
sis. In Proceedings of the IEEE International Parallel and Distributed Processing
Symposium, pages 22–31, 2016.

[52] Priyank Faldu, Jeff Diamond, and Boris Grot. Domain-Specialized Cache Manage-
ment for Graph Analytics. In Proceedings of the IEEE International Symposium on
High Performance Computer Architecture, pages 234–248, 2020.

[53] Yoongu Kim, Weikun Yang, and Onur Mutlu. Ramulator: A Fast and Extensible
DRAM Simulator. IEEE Computer Architecture Letters, 15(1):45–49, 2015.

[54] Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal,
Stephen W Keckler, and William J Dally. Fine-Grained DRAM: Energy-Efficient
DRAM for Extreme Bandwidth Systems. In Proceedings of the IEEE/ACM Inter-
national Symposium on Microarchitecture, pages 41–54, 2017.

[55] Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua
Bengio, and Xavier Bresson. Benchmarking Graph Neural Networks. Journal of
Machine Learning Research, 24(43):1–48, 2023.

[56] Xiaowei Zhu, Wentao Han, and Wenguang Chen. GridGraph: Large-Scale Graph
Processing on a Single Machine Using 2-Level Hierarchical Partitioning . In Pro-
ceedings of the USENIX Annual Technical Conference, pages 375–386, 2015.
95
[57] Jason Mohoney, Roger Waleffe, Henry Xu, Theodoros Rekatsinas, and Shivaram
Venkataraman. Marius: Learning Massive Graph Embeddings on a Single Ma-
chine. In Proceedings of the USENIX Symposium on Operating Systems Design and
Implementation, pages 533–549, 2021.

[58] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu,
Michele Catasta, and Jure Leskovec. Open Graph Benchmark: Datasets for Machine
Learning on Graphs. In Advances in Neural Information Processing Systems, pages
22118–22133, 2020.

[59] Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. Improving the
Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. In
Proceedings of Machine Learning and Systems, pages 187–198, 2020.

[60] Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and
Yafei Dai. NeuGraph: Parallel Deep Neural Network Computation on Large Graphs.
In Proceedings of the USENIX Annual Technical Conference, pages 443–458, 2019.

[61] Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. PaGraph: Scaling
GNN Training on Large Graphs via Computation-Aware Caching. In Proceedings
of the ACM Symposium on Cloud Computing, pages 401–415, 2020.

[62] Haoran You, Tong Geng, Yongan Zhang, Ang Li, and Yingyan Lin. GCoD: Graph
Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-
Design. In Proceedings of the IEEE International Symposium on High Performance
Computer Architecture, pages 460–474, 2022.

[63] Zheng Qu, Dimin Niu, Shuangchen Li, Hongzhong Zheng, and Yuan Xie. TT-GNN:
Efficient On-Chip Graph Neural Network Training via Embedding Reformation and
Hardware Optimization. In Proceedings of the IEEE/ACM International Symposium
on Microarchitecture, pages 452–464, 2023.

[64] Gongjian Sun, Mingyu Yan, Duo Wang, Han Li, Wenming Li, Xiaochun Ye, Don-
grui Fan, and Yuan Xie. Multi-Node Acceleration for Large-Scale GCNs. IEEE
Transactions on Computers, 71(12):3140–3152, 2022.
96
[65] Shyam A Tailor, Javier Fernandez-Marques, and Nicholas D Lane. Degree-Quant:
Quantization-Aware Training for Graph Neural Networks. In Proceedings of the
International Conference on Learning Representations, 2021.

[66] George Karypis and Vipin Kumar. A Fast and High Quality Multilevel Scheme
for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, pages
359–392, 1998.

[67] Swapnil Gandhi and Anand Padmanabha Iyer. P 3 : Distributed Deep Graph
Learning at Scale. In Proceedings of the USENIX Symposium on Operating Systems
Design and Implementation, 2021.

[68] CACTI 6.5. https://github.com/Chun-Feng/CACTI-6.5.

[69] Nan Jiang, Daniel U Becker, George Michelogiannakis, James Balfour, Brian Towles,
David E Shaw, John Kim, and William J Dally. A Detailed and Flexible Cycle-
Accurate Network-on-Chip Simulator. In Proceedings of the International Symposium
on Performance Analysis of Systems and Software, pages 86–96, 2013.

[70] Andrew B Kahng, Bill Lin, and Siddhartha Nath. ORION3.0: A Comprehensive
NoC Router Estimation Tool. IEEE Embedded Systems Letters, pages 41–45, 2015.

[71] Minjie Yu Wang. Deep Graph Library: Towards Efficient and Scalable Deep
Learning on Graphs. In Proceedings of the International Conference on Learning
Representations, 2019. https://github.com/dmlc/dgl/.

[72] Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti,
and Michael Bronstein. Temporal Graph Networks for Deep Learning on Dynamic
Graphs. Proceedings of the International Conference on Machine Learning, 2020.

[73] Seyed Mehran Kazemi, Rishab Goel, Kshitij Jain, Ivan Kobyzev, Akshay Sethi,
Peter Forsyth, and Pascal Poupart. Representation Learning for Dynamic Graphs:
A Survey. Journal of Machine Learning Research, 21(70):1–73, 2020.

[74] Chunyang Wang, Desen Sun, and Yuebin Bai. PiPAD: Pipelined and Parallel
Dynamic GNN Training on GPUs. In Proceedings of the ACM SIGPLAN Annual
97
Symposium on Principles and Practice of Parallel Programming, pages 405–418,
2023.

[75] Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. DeepCPU:
Serving RNN-based Deep Learning Models 10x Faster. In Proceedings of the USENIX
Annual Technical Conference, pages 951–965, 2018.

[76] Venkatesan T Chakaravarthy, Shivmaran S Pandian, Saurabh Raje, Yogish Sab-


harwal, Toyotaro Suzumura, and Shashanka Ubaru. Efficient Scaling of Dynamic
Graph Neural Networks. In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.

[77] Xinkai Song, Tian Zhi, Zhe Fan, Zhenxing Zhang, Xi Zeng, Wei Li, Xing Hu,
Zidong Du, Qi Guo, and Yunji Chen. Cambricon-G: A Polyvalent Energy-Efficient
Accelerator for Dynamic Gaph Neural Networks. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 41(1):116–128, 2021.

[78] Hanqiu Chen and Cong Hao. DGNN-Booster: A Generic FPGA Accelerator
Framework For Dynamic Graph Neural Network Inference. In Porceedings of
the IEEE International Symposium on Field-Programmable Custom Computing
Machines, pages 195–201, 2023.

[79] Yu Huang, Long Zheng, Pengcheng Yao, Qinggang Wang, Haifeng Liu, Xiaofei
Liao, Hai Jin, and Jingling Xue. ReaDy: A ReRAM-Based Processing-in-Memory
Accelerator for Dynamic Graph Convolutional Networks. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 41(11):3567–3578, 2022.

[80] Yu Huang, Long Zheng, Pengcheng Yao, Qinggang Wang, Xiaofei Liao, Hai Jin, and
Jingling Xue. Accelerating Graph Convolutional Networks Using Crossbar-based
Processing-In-Memory Architectures. In Proceedings of the IEEE International
Symposium on High Performance Computer Architecture, pages 1029–1042, 2022.

[81] George Panagopoulos, Giannis Nikolentzos, and Michalis Vazirgiannis. Transfer


Graph Neural Networks for Pandemic Forecasting. In Proceedings of the AAAI
Conference on Artificial Intelligence, pages 4838–4845, 2021.
98
[82] Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura,
Hiroki Kanezashi, Tim Kaler, Tao Schardl, and Charles Leiserson. EvolveGCN:
Evolving Graph Convolutional Networks for Dynamic Graphs. In Proceedings of
the AAAI Conference on Artificial Intelligence, pages 5363–5370, 2020.

[83] Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and
Haifeng Li. T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction.
IEEE Transactions on Intelligent Transportation Systems, 21(9):3848–3858, 2019.

[84] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. Attention
Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting.
In Proceedings of the AAAI Conference on Artificial Intelligence, pages 922–929,
2019.

[85] Hongkuan Zhou, Da Zheng, Israt Nisa, Vasileios Ioannidis, Xiang Song, and George
Karypis. TGL: A General Framework for Temporal GNN Training on Billion-Scale
Graphs. Proceedings of the VLDB Endowment, 15(8):1572–1580, 2022.

[86] Benedek Rozemberczki, Paul Scherer, Yixuan He, George Panagopoulos, Alexander
Riedel, Maria Astefanoaei, Oliver Kiss, Ferenc Beres, Guzman Lopez, Nicolas
Collignon, and Rik Sarkar. PyTorch Geometric Temporal: Spatiotemporal Signal
Processing with Neural Machine Learning Models. In Proceedings of the ACM
International Conference on Information & Knowledge Management, pages 4564–
4573, 2021.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy