0% found this document useful (0 votes)
101 views7 pages

Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework For Traffic Forecasting

Spatio-Temporal Graph Convolutional Networks (STGCN) is a deep learning framework proposed for traffic forecasting that models multi-scale traffic networks. STGCN uses graph convolutional structures instead of regular convolutional or recurrent units, enabling faster training with fewer parameters. Experiments show STGCN effectively captures comprehensive spatio-temporal correlations and outperforms state-of-the-art baselines on real-world traffic datasets.

Uploaded by

Rod March
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views7 pages

Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework For Traffic Forecasting

Spatio-Temporal Graph Convolutional Networks (STGCN) is a deep learning framework proposed for traffic forecasting that models multi-scale traffic networks. STGCN uses graph convolutional structures instead of regular convolutional or recurrent units, enabling faster training with fewer parameters. Experiments show STGCN effectively captures comprehensive spatio-temporal correlations and outperforms state-of-the-art baselines on real-world traffic datasets.

Uploaded by

Rod March
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework

for Traffic Forecasting

Bing Yu ∗1 , Haoteng Yin∗2,3 , Zhanxing Zhu †3,4


1
School of Mathematical Sciences, Peking University, Beijing, China
2
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
3
Center for Data Science, Peking University, Beijing, China
4
Beijing Institute of Big Data Research (BIBDR), Beijing, China
{byu, htyin, zhanxing.zhu}@pku.edu.cn
arXiv:1709.04875v4 [cs.LG] 12 Jul 2018

Abstract to predict the future. Based on the length of prediction, traffic


forecast is generally classified into two scales: short-term (5
Timely accurate traffic forecast is crucial for ur- ∼ 30 min), medium and long term (over 30 min). Most preva-
ban traffic control and guidance. Due to the high lent statistical approaches (for example, linear regression) are
nonlinearity and complexity of traffic flow, tradi- able to perform well on short interval forecast. However, due
tional methods cannot satisfy the requirements of to the uncertainty and complexity of traffic flow, those meth-
mid-and-long term prediction tasks and often ne- ods are less effective for relatively long-term predictions.
glect spatial and temporal dependencies. In this pa- Previous studies on mid-and-long term traffic prediction
per, we propose a novel deep learning framework, can be roughly divided into two categories: dynamical mod-
Spatio-Temporal Graph Convolutional Networks eling and data-driven methods. Dynamical modeling uses
(STGCN), to tackle the time series prediction prob- mathematical tools (e.g. differential equations) and physi-
lem in traffic domain. Instead of applying regu- cal knowledge to formulate traffic problems by computational
lar convolutional and recurrent units, we formulate simulation [Vlahogianni, 2015]. To achieve a steady state,
the problem on graphs and build the model with the simulation process not only requires sophisticated system-
complete convolutional structures, which enable atic programming but also consumes massive computational
much faster training speed with fewer parameters. power. Impractical assumptions and simplifications among
Experiments show that our model STGCN effec- the modeling also degrade the prediction accuracy. Therefore,
tively captures comprehensive spatio-temporal cor- with rapid development of traffic data collection and storage
relations through modeling multi-scale traffic net- techniques, a large group of researchers are shifting their at-
works and consistently outperforms state-of-the-art tention to data-driven approaches.
baselines on various real-world traffic datasets. Classic statistical and machine learning models are two
major representatives of data-driven methods. In time-
1 Introduction series analysis, autoregressive integrated moving average
(ARIMA) and its variants are one of the most consolidated
Transportation plays a vital role in everybody’s daily life. Ac- approaches based on classical statistics [Ahmed and Cook,
cording to a survey in 2015, U.S. drivers spend about 48 min- 1979; Williams and Hoel, 2003]. However, this type of model
utes on average behind the wheel daily.1 Under this circum- is limited by the stationary assumption of time sequences
stance, accurate real-time forecast of traffic conditions is of and fails to take the spatio-temporal correlation into account.
paramount importance for road users, private sectors and gov- Therefore, these approaches have constrained representabil-
ernments. Widely used transportation services, such as flow ity of highly nonlinear traffic flow. Recently, classic statistical
control, route planning, and navigation, also rely heavily on models have been vigorously challenged by machine learning
a high-quality traffic condition evaluation. In general, multi- methods on traffic prediction tasks. Higher prediction accu-
scale traffic forecast is the premise and foundation of urban racy and more complex data modeling can be achieved by
traffic control and guidance, which is also one of main func- these models, such as k-nearest neighbors algorithm (KNN),
tions of the Intelligent Transportation System (ITS). support vector machine (SVM), and neural networks (NN).
In the traffic study, fundamental variables of traffic flow, Deep learning approaches have been widely and suc-
namely speed, volume, and density are typically chosen as in- cessfully applied to various traffic tasks nowadays. Sig-
dicators to monitor the current status of traffic conditions and nificant progress has been made in related work, for in-

Equal contributions. stance, deep belief network (DBN) [Jia et al., 2016; Huang

Corresponding author. et al., 2014], stacked autoencoder (SAE) [Lv et al., 2015;
1
https://aaafoundation.org/american-driving-survey-2014-2015/ Chen et al., 2016]. However, it is difficult for these dense
networks to extract spatial and temporal features from the in- Time … vt+H
put jointly. Moreover, within narrow constraints or even com- … vt
vt-M+1
plete absence of spatial attributes, the representative ability of
these networks would be hindered seriously.
To take full advantage of spatial features, some researchers wij
use convolutional neural network (CNN) to capture adjacent
relations among the traffic network, along with employing
recurrent neural network (RNN) on time axis. By combin- Figure 1: Graph-structured traffic data. Each vt indicates a frame
ing long short-term memory (LSTM) network [Hochreiter of current traffic status at time step t, which is recorded in a graph-
structured data matrix.
and Schmidhuber, 1997] and 1-D CNN, Wu and Tan [2016]
presented a feature-level fused architecture CLTFP for short-
term traffic forecast. Although it adopted a straightforward not independent but linked by pairwise connection in graph.
strategy, CLTFP still made the first attempt to align spatial Therefore, the data point vt can be regarded as a graph sig-
and temporal regularities. Afterwards, Shi et al. [2015] pro- nal that is defined on an undirected graph (or directed one) G
posed the convolutional LSTM, which is an extended fully- with weights wij as shown in Figure 1. At the t-th time step,
connected LSTM (FC-LSTM) with embedded convolutional in graph Gt = (Vt , E, W ), Vt is a finite set of vertices, corre-
layers. However, the normal convolutional operation applied sponding to the observations from n monitor stations in a traf-
restricts the model to only process grid structures (e.g. im- fic network; E is a set of edges, indicating the connectedness
ages, videos) rather than general domains. Meanwhile, recur- between stations; while W ∈ Rn×n denotes the weighted
rent networks for sequence learning require iterative training, adjacency matrix of Gt .
which introduces error accumulation by steps. Additionally,
RNN-based networks (including LSTM) are widely known to 2.2 Convolutions on Graphs
be difficult to train and computationally heavy. A standard convolution for regular grids is clearly not appli-
For overcoming these issues, we introduce several strate- cable to general graphs. There are two basic approaches cur-
gies to effectively model temporal dynamics and spatial de- rently exploring how to generalize CNNs to structured data
pendencies of traffic flow. To fully utilize spatial informa- forms. One is to expand the spatial definition of a convolu-
tion, we model the traffic network by a general graph instead tion [Niepert et al., 2016], and the other is to manipulate in
of treating it separately (e.g. grids or segments). To handle the spectral domain with graph Fourier transforms [Bruna et
the inherent deficiencies of recurrent networks, we employ a al., 2013]. The former approach rearranges the vertices into
fully convolutional structure on time axis. Above all, we pro- certain grid forms which can be processed by normal con-
pose a novel deep learning architecture, the spatio-temporal volutional operations. The latter one introduces the spectral
graph convolutional networks, for traffic forecasting tasks. framework to apply convolutions in spectral domains, often
This architecture comprises several spatio-temporal convolu- named as the spectral graph convolution. Several following-
tional blocks, which are a combination of graph convolutional up studies make the graph convolution more promising by
layers [Defferrard et al., 2016] and convolutional sequence reducing the computational complexity from O(n2 ) to linear
learning layers, to model spatial and temporal dependencies. [Defferrard et al., 2016; Kipf and Welling, 2016].
To the best of our knowledge, it is the first time that to ap- We introduce the notion of graph convolution operator
ply purely convolutional structures to extract spatio-temporal “∗G ” based on the conception of spectral graph convolution,
features simultaneously from graph-structured time series in as the multiplication of a signal x ∈ Rn with a kernel Θ,
a traffic study. We evaluate our proposed model on two real- Θ ∗G x = Θ(L)x = Θ(U ΛU T )x = U Θ(Λ)U T x, (2)
world traffic datasets. Experiments show that our framework where graph Fourier basis U ∈ Rn×n is the matrix of
outperforms existing baselines in prediction tasks with multi- eigenvectors of the normalized graph Laplacian L = In −
ple preset prediction lengths and network scales. 1 1
D− 2 W D− 2 = U ΛU T ∈ Rn×n (In is an identity matrix,
n×n
D∈R is the diagonal degree matrix with Dii = Σj Wij );
2 Preliminary Λ ∈ Rn×n is the diagonal matrix of eigenvalues of L, and fil-
2.1 Traffic Prediction on Road Graphs ter Θ(Λ) is also a diagonal matrix. By this definition, a graph
Traffic forecast is a typical time-series prediction problem, signal x is filtered by a kernel Θ with multiplication between
i.e. predicting the most likely traffic measurements (e.g. Θ and graph Fourier transform U T x [Shuman et al., 2013].
speed or traffic flow) in the next H time steps given the pre-
vious M traffic observations as, 3 Proposed Model
v̂t+1 , ..., v̂t+H = 3.1 Network Architecture
arg max log P (vt+1 , ..., vt+H |vt−M +1 , ..., vt ), (1) In this section, we elaborate on the proposed architecture of
vt+1 ,...,vt+H spatio-temporal graph convolutional networks (STGCN). As
where vt ∈ Rn is an observation vector of n road segments shown in Figure 2, STGCN is composed of several spatio-
at time step t, each element of which records historical obser- temporal convolutional blocks, each of which is formed as a
vation for a single road segment. “sandwich” structure with two gated sequential convolution
In this work, we define the traffic network on a graph and layers and one spatial graph convolution layer in between.
focus on structured traffic time series. The observation vt is The details of each module are described as follows.
(vt-M+1, … vt) W vl W
l l
(vt-M+1, … vt ) polynomial approximation, the cost of Eq. (2) can be reduced
to O(K|E|) as Eq. (3) shows [Defferrard et al., 2016].
Temporal
ST-Conv Block Gated-Conv, C=64
1-D 1st -order Approximation A layer-wise linear formulation
Spatial Conv can be defined by stacking multiple localized graph convo-
ST-Conv Block Graph-Conv, C=16
lutional layers with the first-order approximation of graph
Temporal Laplacian [Kipf and Welling, 2016]. Consequently, a deeper
Output Layer Gated-Conv, C=64 GLU
Temporal architecture can be constructed to recover spatial information
ST-Conv Block Gated-Conv
in depth without being limited to the explicit parameteriza-
tion given by the polynomials. Due to the scaling and nor-
v̂ vl+1 l l
(vt-M+Kt , … vt ) malization in neural networks, we can further assume that
λmax ≈ 2. Thus, the Eq. (3) can be simplified to,
Figure 2: Architecture of spatio-temporal graph convolutional net- 2
works. The framework STGCN consists of two spatio-temporal Θ ∗G x ≈ θ0 x + θ1 ( L − In )x
λmax (4)
convolutional blocks (ST-Conv blocks) and a fully-connected output 1 1
layer in the end. Each ST-Conv block contains two temporal gated ≈ θ0 x − θ1 (D− 2 W D− 2 )x,
convolution layers and one spatial graph convolution layer in the
middle. The residual connection and bottleneck strategy are applied where θ0 , θ1 are two shared parameters of the kernel. In
inside each block. The input vt−M +1 , ..., vt is uniformly processed order to constrain parameters and stabilize numerical per-
by ST-Conv blocks to explore spatial and temporal dependencies co- formances, θ0 and θ1 are replaced by a single parameter θ
herently. Comprehensive features are integrated by an output layer by letting θ = θ0 = −θ1 ; W and D are renormalized by
to generate the final prediction v̂. W̃ = W + In and D̃ii = Σj W̃ij separately. Then, the graph
convolution can be alternatively expressed as,
3.2 Graph CNNs for Extracting Spatial Features 1
Θ ∗G x = θ(In + D− 2 W D− 2 )x
1

The traffic network generally organizes as a graph structure. 1 1


(5)
It is natural and reasonable to formulate road networks as = θ(D̃− 2 W̃ D̃− 2 )x.
graphs mathematically. However, previous studies neglect
spatial attributes of traffic networks: the connectivity and Applying a stack of graph convolutions with the 1st -order ap-
globality of the networks are overlooked, since they are split proximation vertically that achieves the similar effect as K-
into multiple segments or grids. Even with 2-D convolu- localized convolutions do horizontally, all of which exploit
tions on grids, it can only capture the spatial locality roughly the information from the (K −1)-order neighborhood of cen-
due to compromises of data modeling. Accordingly, in our tral nodes. In this scenario, K is the number of successive fil-
model, the graph convolution is employed directly on graph- tering operations or convolutional layers in a model instead.
structured data to extract highly meaningful patterns and fea- Additionally, the layer-wise linear structure is parameter-
tures in the space domain. Though the computation of kernel economic and highly efficient for large-scale graphs, since
Θ in graph convolution by Eq. (2) can be expensive due to the order of the approximation is limited to one.
O(n2 ) multiplications with graph Fourier basis, two approx-
imation strategies are applied to overcome this issue. Generalization of Graph Convolutions The graph convo-
lution operator “∗G ” defined on x ∈ Rn can be extended
Chebyshev Polynomials Approximation To localize the to multi-dimensional tensors. For a signal with Ci channels
filter and reduce the number of parameters, the kernel Θ can X ∈ Rn×Ci , the graph convolution can be generalized by,
PK−1
be restricted to a polynomial of Λ as Θ(Λ) = k=0 θk Λk , Ci
X
where θ ∈ RK is a vector of polynomial coefficients. K yj = Θi,j (L)xi ∈ Rn , 1 ≤ j ≤ Co (6)
is the kernel size of graph convolution, which determines i=1
the maximum radius of the convolution from central nodes. with the Ci × Co vectors of Chebyshev coefficients Θi,j ∈
Traditionally, Chebyshev polynomial Tk (x) is used to ap-
RK (Ci , Co are the size of input and output of the feature
proximate kernels as a truncated expansion of order K −1 as
PK−1 maps, respectively). The graph convolution for 2-D variables
Θ(Λ) ≈ k=0 θk Tk (Λ̃) with rescaled Λ̃ = 2Λ/λmax − In is denoted as “Θ ∗G X” with Θ ∈ RK×Ci ×Co . Specifically,
(λmax denotes the largest eigenvalue of L) [Hammond et al., the input of traffic prediction is composed of M frame of road
2011]. The graph convolution can then be rewritten as, graphs as Figure 1 shows. Each frame vt can be regarded as
K−1 a matrix whose column i is the Ci -dimensional value of vt
at the ith node in graph Gt , as X ∈ Rn×Ci (in this case,
X
Θ ∗G x = Θ(L)x ≈ θk Tk (L̃)x, (3)
k=0 Ci = 1). For each time step t of M , the equal graph con-
volution operation with the same kernel Θ is imposed on
where Tk (L̃) ∈ Rn×n is the Chebyshev polynomial of order Xt ∈ Rn×Ci in parallel. Thus, the graph convolution can
k evaluated at the scaled Laplacian L̃ = 2L/λmax − In . By be further generalized in 3-D variables, noted as “Θ ∗G X ”
recursively computing K-localized convolutions through the with X ∈ RM ×n×Ci .
3.3 Gated CNNs for Extracting Temporal Features where Γl0 , Γl1 are the upper and lower temporal kernel within
Although RNN-based models become widespread in time- block l, respectively; Θl is the spectral kernel of graph con-
series analysis, recurrent networks for traffic prediction still volution; ReLU(·) denotes the rectified linear units function.
suffer from time-consuming iterations, complex gate mecha- After stacking two ST-Conv blocks, we attach an extra tem-
nisms, and slow response to dynamic changes. On the con- poral convolution layer with a fully-connected layer as the
trary, CNNs have the superiority of fast training, simple struc- output layer in the end (See the left of Figure 2). The tempo-
tures, and no dependency constraints to previous steps. In- ral convolution layer maps outputs of the last ST-Conv block
spired by [Gehring et al., 2017], we employ entire convolu- to a single-step prediction. Then, we can obtain a final output
tional structures on time axis to capture temporal dynamic Z ∈ Rn×c from the model and calculate the speed predic-
behaviors of traffic flows. This specific design allows parallel tion for n nodes by applying a linear transformation across
and controllable training procedures through multi-layer con- c-channels as v̂ = Zw + b, where w ∈ Rc is a weight vector
volutional structures formed as hierarchical representations. and b is a bias. We use L2 loss to measure the performance
As Figure 2 (right) shows, the temporal convolutional layer of our model. Thus, the loss function of STGCN for traffic
contains a 1-D causal convolution with a width-Kt kernel fol- prediction can be written as,
lowed by gated linear units (GLU) as a non-linearity. For X
each node in graph G, the temporal convolution explores L(v̂; Wθ ) = ||v̂(vt−M +1 , ..., vt , Wθ ) − vt+1 ||2 , (9)
Kt neighbors of input elements without padding which lead- t

ing to shorten the length of sequences by Kt -1 each time. where Wθ are all trainable parameters in the model; vt+1 is
Thus, input of temporal convolution for each node can be the ground truth and v̂(·) denotes the model’s prediction.
regarded as a length-M sequence with Ci channels as Y ∈ We now summarize the main characteristics of our model
RM ×Ci . The convolution kernel Γ ∈ RKt ×Ci ×2Co is de- STGCN in the following,
signed to map the input Y to a single output element [P Q] ∈ • STGCN is a universal framework to process structured
R(M −Kt +1)×(2Co ) (P , Q is split in half with the same size of time series. It is not only able to tackle traffic network
channels). As a result, the temporal gated convolution can be modeling and prediction issues but also to be applied to
defined as, more general spatio-temporal sequence learning tasks.
Γ ∗T Y = P σ(Q) ∈ R(M −Kt +1)×Co , (7)
• The spatio-temporal block combines graph convolutions
where P , Q are input of gates in GLU respectively; denotes and gated temporal convolutions, which can extract the
the element-wise Hadamard product. The sigmoid gate σ(Q) most useful spatial features and capture the most essen-
controls which input P of the current states are relevant for tial temporal features coherently.
discovering compositional structure and dynamic variances
in time series. The non-linearity gates contribute to the ex- • The model is entirely composed of convolutional struc-
ploiting of the full input filed through stacked temporal layers tures and therefore achieves parallelization over input
as well. Furthermore, residual connections are implemented with fewer parameters and faster training speed. More
among stacked temporal convolutional layers. Similarly, the importantly, this economic architecture allows the model
temporal convolution can also be generalized to 3-D variables to handle large-scale networks with more efficiency.
by employing the same convolution kernel Γ to every node
Yi ∈ RM ×Ci (e.g. sensor stations) in G equally, noted as 4 Experiments
“Γ ∗T Y” with Y ∈ RM ×n×Ci . 4.1 Dataset Description
3.4 Spatio-temporal Convolutional Block We verify our model on two real-world traffic datasets,
In order to fuse features from both spatial and temporal BJER4 and PeMSD7, collected by Beijing Municipal Traffic
domains, the spatio-temporal convolutional block (ST-Conv Commission and California Deportment of Transportation,
block) is constructed to jointly process graph-structured time respectively. Each dataset contains key attributes of traffic
series. The block itself can be stacked or extended based on observations and geographic information with corresponding
the scale and complexity of particular cases. timestamps, as detailed below.
As illustrated in Figure 2 (mid), the spatial layer in the BJER4 was gathered from the major areas of east ring
middle is to bridge two temporal layers which can achieve No.4 routes in Beijing City by double-loop detectors. There
fast spatial-state propagation from graph convolution through are 12 roads selected for our experiment. The traffic data are
temporal convolutions. The “sandwich” structure also helps aggregated every 5 minutes. The time period used is from 1st
the network sufficiently apply bottleneck strategy to achieve July to 31st August, 2014 except the weekends. We select the
scale compression and feature squeezing by downscaling and first month of historical speed records as training set, and the
upscaling of channels C through the graph convolutional rest serves as validation and test set respectively.
layer. Moreover, layer normalization is utilized within every PeMSD7 was collected from Caltrans Performance Mea-
ST-Conv block to prevent overfitting. surement System (PeMS) in real-time by over 39, 000 sensor
The input and output of ST-Conv blocks are all 3-D tensors. stations, deployed across the major metropolitan areas of Cal-
l ifornia state highway system [Chen et al., 2001]. The dataset
For the input v l ∈ RM ×n×C of block l, the output v l+1 ∈
l+1 is also aggregated into 5-minute interval from 30-second data
R(M −2(Kt −1))×n×C is computed by, samples. We randomly select a medium and a large scale
v l+1 = Γl1 ∗T ReLU(Θl ∗G (Γl0 ∗T v l )), (8) among the District 7 of California containing 228 and 1, 026
BJER4 (15/ 30/ 45 min)
Model
MAE MAPE (%) RMSE
HA 5.21 14.64 7.56
LSVR 4.24/ 5.23/ 6.12 10.11/ 12.70/ 14.95 5.91/ 7.27/ 8.81
ARIMA 5.99/ 6.27/ 6.70 15.42/ 16.36/ 17.67 8.19/ 8.38/ 8.72
FNN 4.30/ 5.33/ 6.14 10.68/ 13.48/ 15.82 5.86/ 7.31/ 8.58
FC-LSTM 4.24/ 4.74/ 5.22 10.78/ 12.17/ 13.60 5.71/ 6.62/ 7.44
GCGRU 3.84/ 4.62/ 5.32 9.31/ 11.41/ 13.30 5.22/ 6.35/ 7.58
STGCN(Cheb) 3.78/ 4.45/ 5.03 9.11/ 10.80/ 12.27 5.20/ 6.20/ 7.21
STGCN(1st ) 3.83/ 4.51/ 5.10 9.28/ 11.19/ 12.79 5.29/ 6.39/ 7.39

Table 1: Performance comparison of different approaches on the


Figure 3: PeMS sensor network in District 7 of California (left), dataset BJER4.
each dot denotes a sensor station; Heat map of weighted adjacency
matrix in PeMSD7(M) (right).
Root Mean Squared Errors (RMSE) are adopted. We com-
stations, labeled as PeMSD7(M) and PeMSD7(L), respec- pare our framework STGCN with the following baselines: 1).
tively, as data sources (shown in the left of Figure 3). The Historical Average (HA); 2). Linear Support Victor Regres-
time range of PeMSD7 dataset is in the weekdays of May sion (LSVR); 3). Auto-Regressive Integrated Moving Aver-
and June of 2012. We split the training and test sets based on age (ARIMA); 4). Feed-Forward Neural Network (FNN); 5).
the same principles as above. Full-Connected LSTM (FC-LSTM) [Sutskever et al., 2014];
6). Graph Convolutional GRU (GCGRU) [Li et al., 2018].
4.2 Data Preprocessing
The standard time interval in two datasets is set to 5 min- STGCN Model For BJER4 and PeMSD7(M/L), the chan-
utes. Thus, every node of the road graph contains 288 data nels of three layers in ST-Conv block are 64, 16, 64 respec-
points per day. The linear interpolation method is used to fill tively. Both the graph convolution kernel size K and tem-
missing values after data cleaning. In addition, data input are poral convolution kernel size Kt are set to 3 in the model
normalized by Z-Score method. STGCN(Cheb) with the Chebyshev polynomials approxima-
In BJER4, the topology of the road graph in Beijing east tion, while the K is set to 1 in the model STGCN(1st ) with
No.4 ring route system is constructed by the deployment dia- the 1st -order approximation. We train our models by mini-
gram of sensor stations. By collating affiliation, direction and mizing the mean square error using RMSprop for 50 epochs
origin-destination points of each road, the ring route system with batch size as 50. The initial learning rate is 10−3 with a
can be digitized as a directed graph. decay rate of 0.7 after every 5 epochs.
In PeMSD7, the adjacency matrix of the road graph is com-
puted based on the distances among stations in the traffic net- 4.4 Experiment Results
work. The weighted adjacency matrix W can be formed as, Table 1 and 2 demonstrate the results of STGCN and base-

d2ij d2ij lines on the datasets BJER4 and PeMSD7(M/L). Our pro-
exp(− 2 ), i 6= j and exp(− 2 ) ≥  posed model achieves the best performance with statistical

wij = σ σ (10)
 significance (two-tailed T-test, α = 0.01, P < 0.01) in all
0 , otherwise. three evaluation metrics. We can easily observe that tradi-
where wij is the weight of edge which is decided by dij (the tional statistical and machine learning methods may perform
distance between station i and j). σ 2 and  are thresholds to well for short-term forecasting, but their long-term predic-
control the distribution and sparsity of matrix W , assigned to tions are not accurate because of error accumulation, memo-
10 and 0.5, respectively. The visualization of W is presented rization issues, and absence of spatial information. ARIMA
in the right of Figure 3. model performs the worst due to its incapability of handling
complex spatio-temporal data. Deep learning approaches
4.3 Experimental Settings generally achieved better prediction results than traditional
machine learning models.
All experiments are compiled and tested on a Linux cluster
(CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, GPU: Benefits of Spatial Topology
NVIDIA GeForce GTX 1080). In order to eliminate atypical Previous methods did not incorporate spatial topology and
traffic, only workday traffic data are adopted in our experi- modeled the time series in a coarse-grained way. Differently,
ment [Li et al., 2015]. We execute grid search strategy to through modeling spatial topology of the sensors, our model
locate the best parameters on validations. All the tests use STGCN has achieved a significant improvement on short and
60 minutes as the historical time window, a.k.a. 12 observed mid-and-long term forecasting. The advantage of STGCN is
data points (M = 12) are used to forecast traffic conditions more obvious on dataset PeMSD7 than BJER4, since the sen-
in the next 15, 30, and 45 minutes (H = 3, 6, 9). sor network of PeMS is more complicated and structured (as
illustrated in Figure 3), and our model can effectively utilize
Evaluation Metric & Baselines To measure and evaluate spatial structure to make more accurate predictions.
the performance of different methods, Mean Absolute Er- To compare three methods based on graph convolution:
rors (MAE), Mean Absolute Percentage Errors (MAPE), and GCGRU, STGCN(Cheb) and STGCN(1st ), we show their
PeMSD7(M) (15/ 30/ 45 min) PeMSD7(L) (15/ 30/ 45 min)
Model
MAE MAPE (%) RMSE MAE MAPE (%) RMSE
HA 4.01 10.61 7.20 4.60 12.50 8.05
LSVR 2.50/ 3.63/ 4.54 5.81/ 8.88/ 11.50 4.55/ 6.67/ 8.28 2.69/ 3.85/ 4.79 6.27/ 9.48/ 12.42 4.88/ 7.10/ 8.72
ARIMA 5.55/ 5.86/ 6.27 12.92/ 13.94/ 15.20 9.00/ 9.13/ 9.38 5.50/ 5.87/ 6.30 12.30/ 13.54/ 14.85 8.63/ 8.96/ 9.39
FNN 2.74/ 4.02/ 5.04 6.38/ 9.72/ 12.38 4.75/ 6.98/ 8.58 2.74/ 3.92/ 4.78 7.11/ 10.89/ 13.56 4.87/ 7.02/ 8.46
FC-LSTM 3.57/ 3.94/ 4.16 8.60/ 9.55/ 10.10 6.20/ 7.03/ 7.51 4.38/ 4.51/ 4.66 11.10/ 11.41/ 11.69 7.68/ 7.94/ 8.20
GCGRU 2.37/ 3.31/ 4.01 5.54/ 8.06/ 9.99 4.21/ 5.96/ 7.13 2.48/ 3.43/ 4.12 ∗ 5.76/ 8.45/ 10.51 ∗ 4.40/ 6.25/ 7.49 ∗
STGCN(Cheb) 2.25/ 3.03/ 3.57 5.26/ 7.33/ 8.69 4.04/ 5.70/ 6.77 2.37/ 3.27/ 3.97 5.56/ 7.98/ 9.73 4.32/ 6.21/ 7.45
STGCN(1st ) 2.26/ 3.09/ 3.79 5.24/ 7.39/ 9.12 4.07/ 5.77/ 7.03 2.40/ 3.31/ 4.01 5.63/ 8.21/ 10.12 4.38/ 6.43/ 7.81

Table 2: Performance comparison of different approaches on the dataset PeMSD7.

70 Time Consumption (s)


70 Dataset
60 STGCN(Cheb) STGCN(1st ) GCGRU
60
Speed (km/h)

Speed (km/h)

50 50 PeMSD7(M) 272.34 271.18 3824.54


40 40 PeMSD7(L) 1926.81 1554.37 19511.92
HA
GCGRU 30
30 STGCN(1st)
STGCN(Cheb) Table 3: Time consumptions of training on the dataset PeMSD7.
Ground Truth 20
20
10:00 14:00 18:00 22:00 04:00 10:00 16:00 22:00

Figure 4: Speed prediction in the morning peak and evening rush do. For PeMSD7(L), GCGRU has to use the half of batch
hours of the dataset PeMSD7. size since its GPU consumption exceeded the memory capac-
ity of a single card (results marked as “*” in Table 2); while
12 7
STGCN only need to double the channels in the middle of
STGCN(Cheb) STGCN(Cheb) ST-Conv blocks. Even though our model still consumes less
11 STGCN(1st) STGCN(1st)
GCGRU 6 than a tenth of the training time of model GCGRU under this
10 GCGRU
Test RMSE

Test MAE

9 5 circumstance. Meanwhile, the advantages of the 1st -order


8 approximation have appeared since it is not restricted to the
7 4 parameterization of polynomials. The model STGCN(1st )
6 speeds up around 20% on a larger dataset with a satisfactory
3
0 500 1000 1500 2000 2500 3000 3500 4000 0 10 20 30 40 50
Training Time (s) Training Epoch performance compared with STGCN(Cheb).
In order to further investigate the performance of compared
Figure 5: Test RMSE versus the training time (left); Test MAE ver- deep learning models, we plot the RMSE and MAE of the test
sus the number of training epochs (right). (PeMSD7(M)) set of PeMSD7(M) during the training process, see Figure 5.
Those figures also suggest that our model can achieve much
faster training procedure and easier convergences. Thanks to
predictions during morning peak and evening rush hours, as
the special designs in ST-Conv blocks, our model has superior
shown in Figure 4. It is easy to observe that our proposal
performances in balancing time consumption and parameter
STGCN captures the trend of rush hours more accurately than
settings. Specifically, the number of parameters in STGCN
other methods; and it detects the ending of the rush hours ear-
(4.54 × 105 ) only accounts for around two third of GCGRU,
lier than others. Stemming from the efficient graph convolu-
and saving over 95% parameters compared to FC-LSTM.
tion and stacked temporal convolution structures, our model
is capable of fast responding to the dynamic changes among
the traffic network without over-reliance on historical average 5 Related Works
as most of recurrent networks do. There are several recent deep learning studies that are also
motivated by the graph convolution in spatio-temporal tasks.
Training Efficiency and Generalization Seo et al. [2016] introduced graph convolutional recurrent
To see the benefits of the convolution along time axis in our network (GCRN) to identify jointly spatial structures and dy-
proposal, we summarize the comparison of training time be- namic variation from structured sequences of data. The key
tween STGCN and GCGRU in Table 3. In terms of fairness, challenge of this study is to determine the optimal combi-
GCGRU consists of three layers with 64, 64, 128 units re- nations of recurrent networks and graph convolution under
spectively in the experiment for PeMSD7(M), and STGCN specific settings. Based on principles above, Li et al. [2018]
uses the default settings as described in Section 4.3. Our successfully employed the gated recurrent units (GRU) with
model STGCN only consumes 272 seconds, while RNN-type graph convolution for long-term traffic forecasting. In con-
of model GCGRU spends 3, 824 seconds on PeMSD7(M). trast to these works, we build up our model completely from
This 14 times acceleration of training speed mainly bene- convolutional structures; The ST-Conv block is specially de-
fits from applying the temporal convolution instead of re- signed to uniformly process structured data with residual con-
current structures, which can achieve fully parallel training nection and bottleneck strategy inside; More efficient graph
rather than exclusively relying on chain structures as RNN convolution kernels are employed in our model as well.
6 Conclusion and Future Work [Jia et al., 2016] Yuhan Jia, Jianping Wu, and Yiman Du.
In this paper, we propose a novel deep learning framework Traffic speed prediction using deep learning method. In
STGCN for traffic prediction, integrating graph convolution ITSC, pages 1217–1222. IEEE, 2016.
and gated temporal convolution through spatio-temporal con- [Kipf and Welling, 2016] Thomas N Kipf and Max Welling.
volutional blocks. Experiments show that our model out- Semi-supervised classification with graph convolutional
performs other state-of-the-art methods on two real-world networks. arXiv preprint arXiv:1609.02907, 2016.
datasets, indicating its great potentials on exploring spatio- [Li et al., 2015] Yexin Li, Yu Zheng, Huichu Zhang, and Lei
temporal structures from the input. It also achieves faster Chen. Traffic prediction in a bike-sharing system. In
training, easier convergences, and fewer parameters with flex- SIGSPATIAL, page 33. ACM, 2015.
ibility and scalability. These features are quite promising and
practical for scholarly development and large-scale industry [Li et al., 2018] Yaguang Li, Rose Yu, Cyrus Shahabi, and
deployment. In the future, we will further optimize the net- Yan Liu. Diffusion convolutional recurrent neural net-
work structure and parameter settings. Moreover, our pro- work: Data-driven traffic forecasting. In ICLR, 2018.
posed framework can be applied into more general spatio- [Lv et al., 2015] Yisheng Lv, Yanjie Duan, Wenwen Kang,
temporal structured sequence forecasting scenarios, such as Zhengxi Li, and Fei-Yue Wang. Traffic flow prediction
evolving of social networks, and preference prediction in rec- with big data: a deep learning approach. IEEE Trans-
ommendation systems, etc. actions on Intelligent Transportation Systems, 16(2):865–
873, 2015.
References [Niepert et al., 2016] Mathias Niepert, Mohamed Ahmed,
[Ahmed and Cook, 1979] Mohammed S Ahmed and Allen R and Konstantin Kutzkov. Learning convolutional neural
Cook. Analysis of freeway traffic time-series data by using networks for graphs. In ICML, pages 2014–2023, 2016.
Box-Jenkins techniques. 1979. [Seo et al., 2016] Youngjoo Seo, Michaël Defferrard, Pierre
[Bruna et al., 2013] Joan Bruna, Wojciech Zaremba, Arthur Vandergheynst, and Xavier Bresson. Structured sequence
Szlam, and Yann LeCun. Spectral networks and lo- modeling with graph convolutional recurrent networks.
cally connected networks on graphs. arXiv preprint arXiv preprint arXiv:1612.07659, 2016.
arXiv:1312.6203, 2013. [Shi et al., 2015] Xingjian Shi, Zhourong Chen, Hao Wang,
[Chen et al., 2001] Chao Chen, Karl Petty, Alexander Sk- Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo.
abardonis, Pravin Varaiya, and Zhanfeng Jia. Freeway per- Convolutional lstm network: A machine learning approach
formance measurement system: mining loop detector data. for precipitation nowcasting. In NIPS, pages 802–810,
Transportation Research Record: Journal of the Trans- 2015.
portation Research Board, (1748):96–102, 2001. [Shuman et al., 2013] David I Shuman, Sunil K Narang, Pas-
[Chen et al., 2016] Quanjun Chen, Xuan Song, Harutoshi cal Frossard, Antonio Ortega, and Pierre Vandergheynst.
Yamada, and Ryosuke Shibasaki. Learning deep represen- The emerging field of signal processing on graphs: Ex-
tation from big and heterogeneous data for traffic accident tending high-dimensional data analysis to networks and
inference. In AAAI, pages 338–344, 2016. other irregular domains. IEEE Signal Processing Maga-
zine, 30(3):83–98, 2013.
[Defferrard et al., 2016] Michaël Defferrard, Xavier Bres-
son, and Pierre Vandergheynst. Convolutional neural net- [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and
works on graphs with fast localized spectral filtering. In Quoc V Le. Sequence to sequence learning with neural
NIPS, pages 3844–3852, 2016. networks. In NIPS, pages 3104–3112, 2014.
[Gehring et al., 2017] Jonas Gehring, Michael Auli, David [Vlahogianni, 2015] Eleni I Vlahogianni. Computational in-
Grangier, Denis Yarats, and Yann N Dauphin. Convo- telligence and optimization for transportation big data:
lutional sequence to sequence learning. arXiv preprint challenges and opportunities. In Engineering and Applied
arXiv:1705.03122, 2017. Sciences Optimization, pages 107–128. Springer, 2015.
[Hammond et al., 2011] David K Hammond, Pierre Van- [Williams and Hoel, 2003] Billy M Williams and Lester A
dergheynst, and Rémi Gribonval. Wavelets on graphs via Hoel. Modeling and forecasting vehicular traffic flow
spectral graph theory. Applied and Computational Har- as a seasonal arima process: Theoretical basis and em-
monic Analysis, 30(2):129–150, 2011. pirical results. Journal of transportation engineering,
129(6):664–672, 2003.
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and
[Wu and Tan, 2016] Yuankai Wu and Huachun Tan. Short-
Jürgen Schmidhuber. Long short-term memory. Neural
term traffic flow forecasting with spatial-temporal correla-
computation, 9(8):1735–1780, 1997.
tion in a hybrid deep learning framework. arXiv preprint
[Huang et al., 2014] Wenhao Huang, Guojie Song, Haikun arXiv:1612.01022, 2016.
Hong, and Kunqing Xie. Deep architecture for traffic flow
prediction: deep belief networks with multitask learning.
IEEE Transactions on Intelligent Transportation Systems,
15(5):2191–2201, 2014.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy