Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework For Traffic Forecasting
Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework For Traffic Forecasting
ing to shorten the length of sequences by Kt -1 each time. where Wθ are all trainable parameters in the model; vt+1 is
Thus, input of temporal convolution for each node can be the ground truth and v̂(·) denotes the model’s prediction.
regarded as a length-M sequence with Ci channels as Y ∈ We now summarize the main characteristics of our model
RM ×Ci . The convolution kernel Γ ∈ RKt ×Ci ×2Co is de- STGCN in the following,
signed to map the input Y to a single output element [P Q] ∈ • STGCN is a universal framework to process structured
R(M −Kt +1)×(2Co ) (P , Q is split in half with the same size of time series. It is not only able to tackle traffic network
channels). As a result, the temporal gated convolution can be modeling and prediction issues but also to be applied to
defined as, more general spatio-temporal sequence learning tasks.
Γ ∗T Y = P σ(Q) ∈ R(M −Kt +1)×Co , (7)
• The spatio-temporal block combines graph convolutions
where P , Q are input of gates in GLU respectively; denotes and gated temporal convolutions, which can extract the
the element-wise Hadamard product. The sigmoid gate σ(Q) most useful spatial features and capture the most essen-
controls which input P of the current states are relevant for tial temporal features coherently.
discovering compositional structure and dynamic variances
in time series. The non-linearity gates contribute to the ex- • The model is entirely composed of convolutional struc-
ploiting of the full input filed through stacked temporal layers tures and therefore achieves parallelization over input
as well. Furthermore, residual connections are implemented with fewer parameters and faster training speed. More
among stacked temporal convolutional layers. Similarly, the importantly, this economic architecture allows the model
temporal convolution can also be generalized to 3-D variables to handle large-scale networks with more efficiency.
by employing the same convolution kernel Γ to every node
Yi ∈ RM ×Ci (e.g. sensor stations) in G equally, noted as 4 Experiments
“Γ ∗T Y” with Y ∈ RM ×n×Ci . 4.1 Dataset Description
3.4 Spatio-temporal Convolutional Block We verify our model on two real-world traffic datasets,
In order to fuse features from both spatial and temporal BJER4 and PeMSD7, collected by Beijing Municipal Traffic
domains, the spatio-temporal convolutional block (ST-Conv Commission and California Deportment of Transportation,
block) is constructed to jointly process graph-structured time respectively. Each dataset contains key attributes of traffic
series. The block itself can be stacked or extended based on observations and geographic information with corresponding
the scale and complexity of particular cases. timestamps, as detailed below.
As illustrated in Figure 2 (mid), the spatial layer in the BJER4 was gathered from the major areas of east ring
middle is to bridge two temporal layers which can achieve No.4 routes in Beijing City by double-loop detectors. There
fast spatial-state propagation from graph convolution through are 12 roads selected for our experiment. The traffic data are
temporal convolutions. The “sandwich” structure also helps aggregated every 5 minutes. The time period used is from 1st
the network sufficiently apply bottleneck strategy to achieve July to 31st August, 2014 except the weekends. We select the
scale compression and feature squeezing by downscaling and first month of historical speed records as training set, and the
upscaling of channels C through the graph convolutional rest serves as validation and test set respectively.
layer. Moreover, layer normalization is utilized within every PeMSD7 was collected from Caltrans Performance Mea-
ST-Conv block to prevent overfitting. surement System (PeMS) in real-time by over 39, 000 sensor
The input and output of ST-Conv blocks are all 3-D tensors. stations, deployed across the major metropolitan areas of Cal-
l ifornia state highway system [Chen et al., 2001]. The dataset
For the input v l ∈ RM ×n×C of block l, the output v l+1 ∈
l+1 is also aggregated into 5-minute interval from 30-second data
R(M −2(Kt −1))×n×C is computed by, samples. We randomly select a medium and a large scale
v l+1 = Γl1 ∗T ReLU(Θl ∗G (Γl0 ∗T v l )), (8) among the District 7 of California containing 228 and 1, 026
BJER4 (15/ 30/ 45 min)
Model
MAE MAPE (%) RMSE
HA 5.21 14.64 7.56
LSVR 4.24/ 5.23/ 6.12 10.11/ 12.70/ 14.95 5.91/ 7.27/ 8.81
ARIMA 5.99/ 6.27/ 6.70 15.42/ 16.36/ 17.67 8.19/ 8.38/ 8.72
FNN 4.30/ 5.33/ 6.14 10.68/ 13.48/ 15.82 5.86/ 7.31/ 8.58
FC-LSTM 4.24/ 4.74/ 5.22 10.78/ 12.17/ 13.60 5.71/ 6.62/ 7.44
GCGRU 3.84/ 4.62/ 5.32 9.31/ 11.41/ 13.30 5.22/ 6.35/ 7.58
STGCN(Cheb) 3.78/ 4.45/ 5.03 9.11/ 10.80/ 12.27 5.20/ 6.20/ 7.21
STGCN(1st ) 3.83/ 4.51/ 5.10 9.28/ 11.19/ 12.79 5.29/ 6.39/ 7.39
Speed (km/h)
Figure 4: Speed prediction in the morning peak and evening rush do. For PeMSD7(L), GCGRU has to use the half of batch
hours of the dataset PeMSD7. size since its GPU consumption exceeded the memory capac-
ity of a single card (results marked as “*” in Table 2); while
12 7
STGCN only need to double the channels in the middle of
STGCN(Cheb) STGCN(Cheb) ST-Conv blocks. Even though our model still consumes less
11 STGCN(1st) STGCN(1st)
GCGRU 6 than a tenth of the training time of model GCGRU under this
10 GCGRU
Test RMSE
Test MAE