0% found this document useful (0 votes)
107 views8 pages

Combinatorial Optimization by Graph Pointer Networks and Hierarchical Reinforcement Learning

This document proposes using Graph Pointer Networks (GPNs) and hierarchical reinforcement learning to solve combinatorial optimization problems like the Traveling Salesman Problem (TSP). GPNs use a graph embedding layer to capture relationships between nodes. The document also introduces Hierarchical GPNs (HG-PNs) which use a hierarchical policy to find optimal solutions under constraints. HG-PNs are trained using reinforcement learning, with each layer having a separate reward function for stable training. Results show GPNs generalize well to larger TSP problems, and HG-PNs find feasible solutions for constrained problems like the TSP with time windows.

Uploaded by

Sreekrishna Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views8 pages

Combinatorial Optimization by Graph Pointer Networks and Hierarchical Reinforcement Learning

This document proposes using Graph Pointer Networks (GPNs) and hierarchical reinforcement learning to solve combinatorial optimization problems like the Traveling Salesman Problem (TSP). GPNs use a graph embedding layer to capture relationships between nodes. The document also introduces Hierarchical GPNs (HG-PNs) which use a hierarchical policy to find optimal solutions under constraints. HG-PNs are trained using reinforcement learning, with each layer having a separate reward function for stable training. Results show GPNs generalize well to larger TSP problems, and HG-PNs find feasible solutions for constrained problems like the TSP with time windows.

Uploaded by

Sreekrishna Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Combinatorial Optimization by Graph Pointer Networks

and Hierarchical Reinforcement Learning


Qiang Ma1 Suwen Ge Danyang He Darshan Thaker Iddo Drori2

Columbia University

Abstract. In this work, we introduce Graph Pointer Networks coordinates. It then predicts a policy that describes the next possi-
(GPNs) trained using reinforcement learning (RL) for tackling the ble move so that a permutation of visited cities is sampled. An RL
traveling salesman problem (TSP). GPNs build upon Pointer Net- framework for pointer networks has been proposed [2], in which the
works by introducing a graph embedding layer on the input, which pointer network model is trained by the Actor-Critic algorithm [18]
captures relationships between nodes. Furthermore, to approximate and the negative tour length is used as a reward. The RL approach
arXiv:1911.04936v1 [cs.LG] 12 Nov 2019

solutions to constrained combinatorial optimization problems such proved to be more efficient than previous supervised learning meth-
as the TSP with time windows, we train hierarchical GPNs (HG- ods and outperformed most of the previous heuristics on TSP with up
PNs) using RL, which learns a hierarchical policy to find an op- to 100 nodes. As an extension of the pointer network, Nazari et al.
timal city permutation under constraints. Each layer of the hierar- [20] modified the architecture of the pointer network to tackle more
chy is designed with a separate reward function, resulting in stable complex combinatorial optimization problems, such as the vehicle
training. Our results demonstrate that GPNs trained on small-scale routing problem (VRP).
TSP50/100 problems generalize well to larger-scale TSP500/1000 Due to the property of routing problems, the neural network ar-
problems, with shorter tour lengths and faster computational times. chitectures used in the above works do not fully take into account
We verify that for constrained TSP problems such as the TSP with the relationship between problem entities, which is a critical prop-
time windows, the feasible solutions found via hierarchical RL train- erty of routing problems and also plays a role in several other prob-
ing outperform previous baselines. In the spirit of reproducible re- lems. As a powerful tool to process non-Euclidean data and cap-
search we make our data, models, and code publicly available. ture graph information, Graph Neural Networks (GNNs) [11, 28]
have been studied extensively in recent years. Based on GNNs, two
novel approaches [16, 10] were proposed, which leverage the infor-
1 INTRODUCTION mation of the inherent graph structure present in many combinato-
As a fundamental problem in Computer Science and Operations Re- rial optimization problems. Li et al. [16] applied a Graph Convolu-
search, combinatorial optimization problems have received wide at- tional Network (GCN) model [11] along with a guided tree search
tention in the past few decades. One of the most important and prac- algorithm to solve graph-based combinatorial optimization problems
tical problems is the traveling salesman problem (TSP). To introduce such as Maximal Independent Set and Minimum Vertex Cover prob-
the TSP, consider a salesman who is traveling on a tour across a set lems. Dai et al. [10] proposed a graph embedding network trained
of cities. The salesman must visit all cities exactly once while mini- with deep Q-learning and found that this generalized well to larger-
mizing the overall tour length. TSP is known to be an NP-complete scale problems. Recently, motivated by the Transformer architecture
problem [21], which captures the difficulty of finding efficient exact [24], Kool et al. proposed an attention model [12, 13] to solve rout-
solutions in polynomial time. To overcome this complexity barrier, ing problems such as the TSP, VRP, and Orienteering Problem. In
several approximation algorithms and heuristics have been proposed their model, the relationships between the nodes of the graph are
such as the 2-opt heuristic [1], Christofides algorithm [4], guided lo- captured by a multi-head attention mechanism, using a rollout base-
cal search [26], and the Lin-Kernighan heuristic (LKH) [8]. line in the REINFORCE algorithm, which significantly improves the
With the development of machine learning (ML) and reinforce- result for small-scale TSP. However, scale is still an issue for the at-
ment learning (RL), an increasing number of recent works concen- tention model.
trate on solving combinatorial optimization using an ML or RL ap- The previous works have achieved good approximate results on
proach [25, 2, 20, 16, 10, 12, 13, 9]. A seq2seq model, known as the various combinatorial optimization problems, but combinatorial op-
pointer network [25], has great potential in approximating solutions timization problems with constraints, e.g. TSP with time window
to several combinatorial optimization problems such as finding the (TSPTW), have not been fully considered. To deal with constrained
convex hull and the TSP. It uses LSTMs as the encoder and an at- problems, Bello et al. [2] proposed a penalty method, which added
tention mechanism [24] as the decoder to extract features from city a penalty term for infeasible solutions on the reward function. How-
ever, the penalty method can lead to unstable training, and the hy-
1 Columbia University, Department of Computer Science, email: perparameters of the penalty term are usually difficult to tune. A
ma.qiang@columbia.edu better choice for training is using hierarchical RL methods, which
2 Columbia University, Department of Computer Science, email: have been applied widely to tackle complex problems such as video
idrori@cs.columbia.edu and Cornell University, School of Operations
Research and Information Engineering, email: idrori@cornell.edu games with sparse rewards and robot maze tasks [14, 19, 6]. The
key motivation for hierarchical RL is the splitting of complex tasks 2.2 Reinforcement Learning for TSP
into several simple subproblems which are learned in a hierarchy.
Haarnoja et al. [6] introduced latent space policies for hierarchical We begin by introducing the notation used to formulate the TSP as a
RL, in which the lower layers of the hierarchy provide a feasible so- reinforcement learning problem. Let S be the state space and A be
lution space and constrain the actions of the higher layers. The higher the action space. Each state st ∈ S is defined as the set of all previous
layers then make decisions based on the information from the latent visited cities, i.e. st = {xσ(i) }ti=1 . The action at ∈ A is defined as
space in the lower layers. In this work, we explore the use of hier- the next selected city, that is at = xσ(t+1) . Since σ(1) = σ(N + 1),
archical RL methods to tackle combinatorial optimization problems it follows that aN = xσ(N +1) = xσ(1) , which means the last choice
with constraints, which are split into different subtasks. Each layer of of the route is the start city.
the hierarchy learns to search the feasible solutions under constraints Denote a policy as πθ (at |st ), which is a distribution over candi-
or learns the heuristics to optimize the objective function. date cities at given a set of visited cities st . Given a set of visited
In this work, we aim to approximate solutions to larger-scale TSP cities, the policy will return a probability distribution over the next
problems and address constrained combinatorial optimization prob- candidate cities that have not been chosen. In our case, the policy
lems. The contributions of this work are three-fold: Firstly, we pro- is represented by a neural network and the parameter θ represents
pose a graph pointer network (GPN) to tackle the vanilla TSP. The the trainable weights of the neural network. Furthermore, the reward
GPN extends the pointer network with graph embedding layers and function is defined as the negative cost incurred from taking action
achieves faster convergence. Secondly, we add a vector context to at from state st , i.e. r(st , at ) = −kxσ(t) − xσ(t+1) k2 . Then the
the GPN architecture and train using early stopping in order to gen- expected reward [23] is defined as follows:
eralize our model to tackle larger-scale TSP instances, e.g. TSP1000, "N #
X
from a model trained on a much smaller TSP50 instance. Thirdly, we E(st ,at )∼πθ (st ,at ) r(st , at )
employ a hierarchical RL framework along with the GPN architec- i=1
ture to efficiently solve TSP with a time window constraint. For each " N
X
#
(3)
task, we conduct experiments to compare our model performance = Eσ∼pθ (Γ),X∼X −kxσ(i) − xσ(i+1) k2
with existing baselines and previous work. i=1
This work is structured as follows. In the Preliminaries section, = −Eσ∼pθ (Γ),X∼X [L(σ, X)]
we formulate the TSP and its corresponding reinforcement learning
framework. The Hierarchical Reinforcement Learning section intro- where X is the space of the set of cities, Γ is the space of all possible
duces the hierarchical RL framework as well as the hierarchical pol- permutations σ over X , and pθ (Γ) is the distribution over Γ, which
icy gradient method. The Graph Pointer Network section describes is predicted by the neural network. To maximize the above reward
the architecture of the proposed GPN and its hierarchical version. function, the network must learn a policy to minimize the expected
Then, in the Experiments section, we analyze our approach on small- tour length. We employ the policy gradient algorithm [23] to learn to
scale TSP problems, their generalization capabilities to large-scale maximize the reward function as described next.
TSP problems, as well as their performance on the TSP with Time
Windows problem.
3 HIERARCHICAL REINFORCEMENT
LEARNING
2 PRELIMINARIES
3.1 Hierarchical RL for TSP
2.1 Traveling Salesman Problem
A key aspect of our work is tackling TSP with constraints. Augment-
In this work, we focus on solving the symmetric 2-D Euclidean trav- ing traditional RL reward functions with a penalty term encourages
eling salesman problem (TSP) [15]. The graph of the symmetric solutions to be in the feasible set [2]; however, we find this method
TSP is complete and undirected. Given a list of N city coordinates leads to unstable training. Instead, we propose a hierarchical RL
{x1 , x2 , ..., xN } ⊂ R2 , the problem is to find an optimal route such framework to more efficiently tackle TSP with constraints.
that each city is visited exactly once and the total distance covered Motivated by the work of Haarnoja et al. [6, 7], we adopt a proba-
in the route is minimized. In other words, we wish to find an optimal bilistic graphical model framework for control, as demonstrated in
permutation σ over the cities that minimizes the tour length [2]: Figure 1. Each layer of a hierarchy defines a policy, from which
N we sample actions. At a given layer k ∈ {0, . . . , K}, the current
X (k) (k) (k) (k)
L(σ, X) = kxσ(i) − xσ(i+1) k2 , (1) action at is sampled from the policy πθk (at |st , ht ), where
(k)
i=1 ht ∈ H(k) is a latent variable from the previous layer in the hi-
where σ(1) = σ(N + 1), σ(i) ∈ {1, ..., N }, σ(i) 6= σ(j) for any erarchy and H(k) is its corresponding latent space. The lowest layer
i 6= j, and X = [x> > >
∈ RN ×2 is a matrix consisting of shown in Figure 1(b) is a simple Markov Decision Process (MDP)
1 , ..., xN ] (0) (0) (0)
all city coordinates xi . In addition, in our work, we consider the TSP with action at sampled from policy πθ0 (at |st ), which pro-
(1)
with added constraints. Generally, the constrained TSP is written as vides a latent vector ht for the higher layer. The middle layer not
(k)
the following optimization problem: only depends on the latent variable ht from the (k − 1)-th layer,
(k+1)
N
but also provides a latent variable ht for the next higher layer.
For convenience of notation, on the k-th layer, we extend the pol-
P
min L(σ, X) = kxσ(i) − xσ(i+1) k2
σ i=1
(2) icy to both sample the action and provide the latent variable, i.e.
s.t. f (σ, X) = 0, (k) (k+1) (k) (k)
at , ht ∼ πθk (·|st , ht ).
g(σ, X) ≤ 0,
Each layer corresponds to a different RL task, so the reward func-
where σ is a permutation, f (σ, X) and g(σ, X) represent constraint tions are hand-designed to be different for each layer. There are two
functions. natural ways to formulate constrained TSP optimization problems in
a hierarchical fashion. First, we set lower layer reward functions to The second term of Equation 5 is the gap of the rewards between the
simply bias solutions to be in the feasible set of the constrained op- sampling and greedy approach, which is designed to centre the ad-
timization problem, and set higher layer reward functions to be the vantage term in the REINFORCE algorithm [27]. Using a central
original optimization objective. Conversely, we order reward func- self-critic baseline accelerates the convergence rate compared to us-
tions in increasing difficulty of optimization: the first layer attempts ing an exponential moving average of the rewards.
to solve vanilla TSP, the second layer is given a TSP instance with Since the lowest layer of the hierarchy is a Markov Decision Pro-
one constraint, and so on. For our experiments, we use the first for- cess (MDP), the lowest-level policy is learned directly and provides
mulation, since we find that this yields better results. latent variables for the higher layer. In other words, we use a bottom-
up approach for learning the hierarchical policy and training the neu-
ral network.

3.2.2 Layer-wise Policy Optimization

Suppose we need to learn a (K + 1)-layer hierarchical policy, which


includes πθ0 ,πθ1 ,...,πθK . Each policy is represented by a GPN. In
order to learn policy πθK , we first need to train all lower layers πθk
(a) Middle Layer (b) Lowest Layer (c) Highest Layer for k = 0, ..., K −1 and fix the weights of the neural networks. Then,
(k) (k)
for layer k = 0, ..., K − 1, we sample (st , at ) based on πθk , and
Figure 1. Graphical models for hierarchical RL framework. (a) Middle (k+1)
provide latent variable ht for the next higher layer. Finally, we
layer of hierarchy: In each middle layer, the next action is conditioned both on (K)
can learn the policy πθK from ht . Algorithm 1 provides detailed
the current state and the latent variable from the lower layer. It also provides
pseudo-code.
the latent variable for the next higher layer. (b) Lowest layer of hierarchy: a
simple MDP which provides latent variables for the next layer. (c) Highest
layer: does not provide latent variables and only utilizes latent variable from Algorithm 1 Layer-wise Policy Optimization
the lower layer. 1: procedure T RAIN(training set X , # of training steps
M0 , M1 , ..., MK , batch size B, learning rate α, the num-
ber of layers K)
3.2 Hierarchical Policy Gradient 2: Initialize network parameters θk for k ∈ {0, ..., K}
We use the policy gradient method to learn a hierarchical policy. 3: for k = 0 to K do
Considering a hierarchical policy, the objective function of the k-th 4: for m = 1 to Mk do
layer is J(θk ) = −Eσ∼pθk (σ) ,X∼X [L(σ, X)]. Based on the REIN- 5: Xi ∼Sample(X ) for i ∈ {1, ..., B}
FORCE algorithm, the gradient of the k-th layer policy is expressed 6: for j = 0 to k − 1 do
(j) (j+1) (j) (j)
as [2, 27]: 7: ai,t , hi,t ∼ πθj (·|si,t , hi,t )
(k) (k) (k)
" N ! 8: ai,t ∼ πθk (·|si,t , hi,t )
B
1 X X (k) (k) 9:
(k)
ãi,t ∼ πθGreedy
(k) (k)
(·|s̃i,t , hi,t )
∇θk J(θk ) = rk (si,t , ai,t ) − bi,k k
B i=1 t=1 10: Compute J(θk ), ∇θk J(θk )
!# (4) 11: θk ← θk + α∇θk J(θk )
XN
(k) (k) (k)
× ∇θk log πθk (ai,t |si,t , hi,t ) , 12: return πθ0 , πθ1 , ..., πθK
t=1

where B is the batch size, πθk is the k-th layer policy, rk (·, ·) is the
reward function for the k-th layer, bi,k is the k-th layer baseline, and
(k)
ht is the latent variable from the lower layer. Based on Equation 4, 4 GRAPH POINTER NETWORK
the parameters θk are optimized using gradient descent through the
update rule θk ← θk + α∇θk J(θk ). 4.1 GPN Architecture
We propose a graph pointer network (GPN) based on the pointer net-
3.2.1 Central Self-Critic
work [2] for approximately solving the TSP. The GPN architecture,
We introduce the central self-critic baseline bi,k , which is similar to which is shown in Figure 2, consists of an encoder and decoder com-
the self-critic baseline [22] and the rollout baseline in the Attention ponent.
Model [12]. The central self-critic baseline bi,k is expressed as:
N 
X  Encoder The encoder includes two parts: point encoder and graph
(k) (k)
bi,k = rk (s̃i,t , ãi,t ) encoder. For the point encoder, each city coordinate xi is embedded
t=1 into a higher dimensional vector x̃i ∈ Rd , where d is the hidden
"
N B
# (5)
1 XX (k) (k) (k) (k)
 dimension. This linear transformation shares weights across all cities
+ rk (sj,t , aj,t ) − rk (s̃j,t , ãj,t ) xi . The vector x̃i for the current city xi is then encoded by an LSTM.
B t=1 j=1
The hidden variable xhi of the LSTM is passed to both the decoder
(k) in the current step and the encoder in the next time step. For the
where the action ãi,t ∼ πθGreedy
k
is from the greedy policy πθGreedy
k
, graph encoder, we use graph embedding layers to encode all city
(k)
i.e. the action is sampled greedily, and s̃i,t is the corresponding state. coordinates X = [x> > >
1 , ..., xN ] , and pass it to the decoder.
Graph Embedding Layer In TSP, the context information of a
city node includes the neighbors’ information of the city. In a GPN,
context information is obtained by encoding all city coordinates X
via a graph neural network (GNN) [11, 28]. Each layer of the GNN
is expressed as:
 
1
xli = γxl−1
i Θ + (1 − γ)φ θ {x l−1
} j∈N (i)∪{i} , (6)
|N (i)| j

where xli ∈ Rdl is the l-th layer variable with l ∈ {1, ..., L}, x0i =
xi , γ is a trainable parameter which regularizes the eigenvalue of the
weight matrix, Θ ∈ Rdl−1 ×dl is a trainable weight matrix, N (i) is
the adjacency set of node i, and φθ : Rdl−1 → Rdl is the aggregation Figure 2. Architecture for Graph Pointer Network. The current city coor-
function [11], which is represented by a neural network in this work. dinate xi (we denote xσ(i) as xi for convenience) is encoded by the LSTM
Furthermore, since we only consider symmetric TSP, the graph of while X̄ = (X − Xi ) is encoded as the vector context by a graph neural net-
the TSP is a complete graph. Therefore, the graph embedding layer work. The encoded vectors are passed to the attention decoder, which outputs
is further expressed as: the pointer vector ui . The probability distribution over the next candidate city
  is pi = softmax(ui ). The next visited city xi+1 is sampled from pi .
Xl = γXl−1 Θ + (1 − γ)Φθ Xl−1 /|N (i)| , (7)
is illustrated in Figure 3. In contrast to a single-layer GPN, the co-
where Xl ∈ RN ×dl , and Φθ : RN ×dl−1 → RN ×dl is the aggrega- (k)
ordinate xi at k-th layer is first passed as input to a lower-level
tion function. (k−1)
neural network and the network outputs a pointer vector ui .
(k−1) (k)
Then, ui is added to the pointer vector u i of a higher layer, i.e.
Vector Context In previous work [2, 12], the context is computed  
(k) (k) (k−1)
based on the 2D coordinates of all cities, i.e. X ∈ RN ×2 . We re- pi = softmax ui + αui , where α is a trainable param-
fer to this context as point context. In contrast, instead of using co- eter. This plays an important role since ui
(k−1)
contains lower layer
ordinate features directly, in this work, we use the vectors point- information which provides a prior distribution over the output cities.
ing from the current city to all other cities as the context, which (k) (k) (k)
The output xi+1 is then sampled from πθ (·|si , hi ) = pi ,
(k)
we refer to as a vector context. For the current city xi , suppose (k) (k−1)
where hi = ui is the latent variable from the lower layer.
Xi = [x> > >
i , ..., xi ] ∈ RN ×2 is a matrix with identical rows xi .
We define X̄i = X − Xi as the vector context. The j-th row of X̄i
is a vector pointing from node i to node j. Then X̄i is passed into
the graph embedding layers. A graph embeddinglayer is rewritten
as X̄li = γ X̄l−1i Θ + (1 − γ)Φθ X̄l−1 i /|N (i)| . In practice, the
GPN using the vector context yields more transferable representa-
tions, which allows the model to perform well on larger-scale TSP.

Decoder The decoder is based on an attention mechanism and out-


puts the pointer vector ui , which is then passed to a softmax layer
to generate a distribution over the next candidate cities. Similar to
pointer networks [2], the attention mechanism and pointer vector ui
is defined as:
(
(j) v > · tanh(Wr rj + Wq q) if j 6= σ(k), ∀k < j,
ui = (8) Figure 3. A two layer hierarchical architecture of GPN. The pointer vectors
−∞ otherwise, of the two layers are added together to predict the next candidate city. The
(j) pointer vector of the lower layer provides a prior for the higher layer.
where ui is the j-th entry of the vector ui , Wr and Wq are trainable
matrices, q is a query vector from the hidden variable of the LSTM,
and ri is a reference vector containing the information of the context
of all cities. Precisely, we use the hidden variable xhi from the point
encoder as the query vector q, and use the context XL from the graph
5 EXPERIMENTS
embedding layer as the reference, i.e. q = xhi and rj = XL j . In our experiments, we use L = 3 graph embedding layers to encode
The distribution policy over all candidate cities is given by: the context in the GPN. The aggregation function used is a single
πθ (ai |si ) = pi = softmax(ui ) (9) layer fully connected neural network. The graph embedding layer is
expressed as
We predict the next visited city ai = xσ(i+1) , by sampling or choos-
ing greedily from the policy πθ (ai |si ). Xl = γXl−1 Θ + (1 − γ)g(Xl−1 W/|N (i)| + b), (10)

where g(·) is the ReLU activation function, W ∈ Rdl−1 ×da and


4.2 Hierarchical GPN Architecture
b ∈ RN ×da are trainable weights and biases with dl = da = 128
In this section, we use the proposed GPN to design a hierarchical ar- for l = 1, 2, 3. We use point context for small-scale problems such
chitecture. The architecture of a two-layer hierarchical GPN (HGPN) as TSP20/50 and vector context for larger-scale problems such as
TSP500. The training data is generated randomly from a [0, 1]2 uni- In Table 2, we train a GPN model with vector context on TSP50
form distribution. In each epoch, the training data is generated on the data with 10 epochs, and use this model to predict the routes on
fly. The central self-critic baseline is used during RL training. Unless TSP250/500/750/1000. Furthermore, we use a local search algorithm
otherwise specified, the following experiments use the hyperparam- 2-opt [1] to improve our results after prediction. The Pointer Network
eters shown in Table 1. (PN) [2], s2v-DQN [10] and Attention Model (AM) [12] are also
trained with TSP50 data, and we check the transferability of these
Table 1. Hyperparameters used for training models to larger-scale problem as well. Results are averaged over
Parameter Value Parameter Value 1000 TSP instances. Due to memory constraints, we set the batch
Epoch 100 Optimizer Adam size B = 50 during inference for all models. The results are also
Batch size 512 Learning rate 1e-3 compared with LKH, nearest neighbor, 2-opt, farthest insertion and
Training steps (per epoch) 2500 Learning rate decay 0.96
Google OR-Tools [5].
Table 2 shows that our GPN model outperforms PN and AM when
we train with TSP50 instances and generalize to larger-scale prob-
OR-Tools Setting We use OR-Tools [5] as one of the baselines lems. With local search added, the GPN+2opt has similar tour length
to compare with our result. To compare with larger-scale TSP in- to s2v-DQN, but saves ≈ 20% running time. Compared with the 2-
stances, the Savings and Christofides algorithms are selected as first opt heuristic, the GPN+2opt uses ≈ 25% less running time, which
solution strategies in OR-Tools. The search time limit for each TSP means the GPN model can be treated as a good initialization method.
instance is set to 5 seconds. We choose Guided Local Search as the The GPN+2opt also outperforms OR-Tools on TSP1000. On Table 2,
metaheuristic when running OR-Tools. For TSP with Time Windows GPN does not outperform the state-of-the-art TSP solver, e.g. LKH
(TSPTW), the Savings algorithm is picked as the first solution strat- and Farthest Insertion. However, it still has the potential to be an
egy in OR-Tools. We use the default setting for its search limits and effective initialization method, since the GPN shows good general-
metaheuristics. ization capabilities and can solve TSP instances in parallel. Some
sample tours are shown in Figure 5.
5.1 Experiments for small-scale TSP
We train our GPN model with TSP20 and TSP50 instances. The
training time of each epoch is 7 minutes for TSP20 and 30 minutes
for TSP50 using one NVIDIA Tesla P100 GPU. We compare the per-
formance of our model on small-scale TSP with previous work, such
as the Attention Model [12], s2v-DQN [10], the Pointer Network [2],
and other heuristics, e.g. 2-opt heuristics, Christofides algorithm and
random insertion. The results are shown in Figure 4, which compares
the approximate gap to the optimal solution. A smaller gap indicates
a better result. The optimal solutions are obtained from the LKH al-
gorithm. We observe that for small-scale TSP instances, the GPN (a) TPS250 (GPN+2opt) (b) TPS500 (GPN+2opt)
outperforms the Pointer Network, which demonstrates the usefulness
of the graph embedding, but yields worse approximations than the
Attention Model.

(c) TPS750 (GPN+2opt) (d) TPS1000 (GPN+2opt)

Figure 5. Sample tours for TSP250/500/750/1000. Approximate solutions


of larger-scale TSP predicted by GPN and 2-opt heuristics.
Figure 4. Comparison of TSP20/50 results: Attention Model, s2v-DQN,
Pointer Net, 2-opt, random insertion and Christofides. The y-axis is the ap- As aforementioned, the generalization capacity of the GPN model
proximate gap to the optimal solutions. is roughly an order of magnitude larger than the size of the instances
the model is trained on. More specifically, we train the GPN models
on TSP20/50/100 and use these models to predict on TSP500/1000.
The results are shown in Table 3, which demonstrates that the results
5.2 Experiments for larger-scale TSP
improve if we increase the size of the TSP instances used for training.
In real world applications, most practical TSP instances have hun-
dreds or thousands of nodes, and the optimal solution is not effi-
ciently computable. We find that the proposed GPN model general-
5.3 Experiments for TSP with time window
izes well from small-scale TSP problems to larger-scale problems. Finally, we consider a well known constrained TSP problem, the TSP
The generalization capacity increases by an order of magnitude. with Time Windows (TSPTW). In TSPTW, each node i has its own
Table 2. Comparison for larger-scale TSP. Each result is obtained by running on 1000 random TSP instances. Tour Len refers to average tour length. Time
refers to total running time (sec) of 1000 instances.
TSP 250 TSP 500 TSP 750 TSP 1000
Method Tour Len. Time Tour Len. Time Tour Len. Time Tour Len. Time
LKH 11.893 9792s 16.542 23070s 20.129 36840s 23.130 50680s
Concorde 11.89 1894s 16.55 13902s 20.10 32993s 23.11 47804s
Nearest Neighbor 14.928 25s 20.791 60s 25.219 115s 28.973 136s
2-opt 13.253 303s 18.600 1363s 22.668 3296s 26.111 6153s
Farthest Insertion 13.026 33s 18.288 160s 22.342 454s 25.741 945s
OR-Tools (Savings) 12.652 5000s 17.653 5000s 22.933 5000s 28.332 5000s
OR-Tools (Christofides) 12.289 5000s 17.449 5000s 22.395 5000s 26.477 5000s
s2v-DQN 13.079 476s 18.428 1508s 22.550 3182s 26.046 5600s
Pointer Net 14.249 29s 21.409 280s 27.382 782s 32.714 3133s
Attention Model 14.032 2s 24.789 14s 28.281 42s 34.055 136s
GPN (ours) 13.679 32s 19.605 111s 24.337 232s 28.471 393s
GPN+2opt (ours) 12.942 214s 18.358 974s 22.541 2278s 26.129 4410s

penalty:
Table 3. Comparison for larger-scale TSP. The GPNs are trained with dif- N
X
ferent size of TSP instances. Each result is obtained by running on 1000 ran- r2 = ci + β ∗ ρ(c, l). (13)
dom TSP instances. Tour Len refers to average tour length. Time refers to i=1
total running time (sec) of 1000 instances. For the inference phase, we use ρ(c, l) to measure accuracy, i.e. the
TSP 500 TSP 1000 number of instances that are solved successfully. For any of the in-
Model Tour Len. Time Tour Len. Time stances, if ρ(c, l) > 0, then there exists at least one city such that
GPN (TSP20) 22.320 107s 33.649 391s the arriving time exceeds the leaving time, which indicates that the
GPN (TSP50) 19.605 111s 28.471 393s
GPN (TSP100) 19.527 109s 28.036 408s solution is infeasible.
The lower layer is trained with 1 epoch of TSPTW20 data, and the
service time interval [ei , li ], where ei is the entering time and li is higher layer is trained with 19 epochs. For TSPTW data, each of the
the leaving time. A city cannot be visited after its leaving time. If nodes xi is a tuple (xi , yi , ei , li ), where (xi , yi ) is a 2-D coordinate
the node is visited earlier than the entering time, the salesman must and ei , li are the entering and leaving time. We average results over
wait until the service begins, namely until the entering time. In this 10000 problem instances to compare our results with OR-Tools and
experiment, we consider the following formalization of the TSP with Ant Colony Optimization (ACO) algorithm [3].
Time Windows problem: At prediction time, we use both the greedy and sampling method.
The result is improved by sampling 100 or 500 times. Table 4 demon-
N
min
P
ci strates that our HGPN framework outperforms all other baselines on
σ i=1 TSPTW including the single-layer GPN. Even though all instances
s.t. ci+1 − ci ≥ kxσ(i+1) − xσ(i) k2 , i ∈ {1, ..., N − 1}, have feasible solutions based on our training setup, sometimes the
ei ≤ ci ≤ li i ∈ {1, ..., N }, algorithms will fail to find a feasible solution. To capture this, we
(11) use the percentage of feasible solutions as an evaluation metric. The
where ci is the time cost for the i-th city. In this problem, a feasible HGPN achieves a much higher percentage of feasible solutions com-
solution does not always exist. To ensure the existence of training pared to the baselines. Some sample tours are shown in Figure 6.
and test data, we first generate TSP20 instances from a [0, 1]2 uni-
form distribution. Then using 2-opt local search on the generated in-
stances, we solve the approximate solutions c̃i for i ∈ {1, ..., N }. We Table 4. Results for TSPTW20. Cost: objective of TSPTW. Time: the run-
set ei = max{c̃i −ẽi , 0} and li = c̃i +˜li , where ẽi ∼ Uniform(0, 2) ning time of the algorithms. Feasible %: the percentage of instances that are
and ˜li ∼ Uniform(0, 2) + 1. Therefore, ei ≤ c̃i ≤ li , which means predicted to have feasible solutions by the algorithm.
that a feasible solutions in the training and test data always exist. The Method Cost Time Feasible %
dataset is obtained by shuffling all cities in the instances above. The OR-Tools (Savings) 4.045 121s 72.06%
exponential moving average critic baseline [12] is used during RL ACO 4.655 204s 62.10%
training. GPN-greedy 4.209 1s 99.87%
HGPN-greedy 4.178 1s 99.88%
In the experiments for TSP with Time Windows (TSPTW), we HGPN-sampling-100 4.013 99s 100%
construct a two-layer hierarchical GPN (HGPN). First we define: HGPN-sampling-500 3.991 494s 100%
N
X
ρ(c, l) := max{li − ci , 0}, (12)
i=1 5.4 Real World TSP Instances
as the penalty if the arriving time exceeds the leaving time, where We have evaluated our model on the real world TSPLIB dataset
li is the leaving time and ci is the arriving time. Then the reward using instances which have less than 1500 nodes. We report the
function of the lower layer is the penalty of violating the leaving average gap between our result and the best solution, which is shown
time constraints r1 = β ∗ ρ(c, l), where β is the penalty factor. The in Table 5.
reward of the higher layer is the total time cost of TSPTW plus the
choose C = 100 to make a better exploration-exploitation tradeoff.

6.2 Hierarchical Architecture


In TSPTW problem, the hierarchical GPN (HGPN) performs better
than single-layer GPN. The training curves of HGPN and single-
layer GPN are shown in Figure 8. For single-layer GPN, the re-
ward function includes both the penalty and the objective of TSPTW,
which leads to unstable training on the early stage as shown by the
blue curve in Figure 8. In contrast, we train the lower layer of HGPN
Figure 6. Sample tours for TSPTW20. For the text on each node, the first to minimize the penalty term, which is simple to learn and converges
line is the arriving time, and the second line is the time window. quickly within one epoch. Then, the lower layer provides a prior dis-
tribution of possible feasible solutions for the higher layer. Given the
Table 5. Evaluation on real world TSPLIB dataset. latent information of feasible solutions, the higher layer of HGPN
Method Concorde GPN+2opt converges quicker than single-layer GPN and yields better solutions.
Optimality Gap 0.13 ± 0.6% 9.35 ± 3.45%
Running Time 1377s 200s

6 DISCUSSION
6.1 Generalization
Vector Context In our GPN model, we use vector contex before
encoding. The vector context is helpful to obtain pairwise informa-
tion between cities. Therefore, in each step, our model knows the
relative position between the current city and all others, which con-
tributes to good generalization. In the experiments, the GPN with
vector context performs better than GPN with point context on larger-
scale TSP, which is illustrated in Figure 7.
Figure 8. Validation curves of HGPN and single-layer GPN on TSPTW20
during training. The orange curve shows only the higher layer of HGPN and
begins at the 2nd epoch. The 1st epoch is used for training lower layer.

7 CONCLUSION
In this work, we propose a Graph Pointer Network (GPN) framework
which efficiently solves larger-scale TSP by using graph embedding
layers. Training a hierarchical RL model allows our approach to ad-
ditionally tackle constrained combinatorial optimization problems
such as the TSP with time windows. Our experimental results demon-
Figure 7. Validation curves of GPN on TSP500. The GPN model is trained
strate that the GPN generalizes well from small-scale to larger-scale
with TSP50 and generalizes on TSP500.
problems, outperforming previous RL methods. We make our data,
models, and code publicly available [17].

Early Stopping In order to generalize well to larger-scale in-


stances and avoid overfitting on small-scale problems, we use early REFERENCES
stopping during training and solve larger-scale TSP with models
[1] Emile Aarts, Emile HL Aarts, and Jan Karel Lenstra, Local search in
trained for 10 epochs. The comparison results between performance combinatorial optimization, Princeton University Press, 2003.
on various levels of early stopping is shown in Table 6. Based on the [2] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy
performance, we still train PN, s2v-DQN, AM for 100 epochs. Bengio, ‘Neural combinatorial optimization with reinforcement learn-
ing’, International Conference on Learning Representations Workshop,
(2017).
Table 6. Tour length results obtained by GPN on different epochs. We train [3] Chi-Bin Cheng and Chun-Pin Mao, ‘A modified ant colony system for
GPN with TSP50 instances and predict TSP500/1000. The bold result is the solving the travelling salesman problem with time windows’, Mathe-
shortest tour length. matical and Computer Modelling, 46(9-10), 1225–1235, (2007).
[4] Nicos Christofides, ‘Worst-case analysis of a new heuristic for the trav-
Epoch 1 5 10 100
elling salesman problem’, Technical report, Carnegie-Mellon Univer-
TSP500 20.26 19.69 19.58 20.19 sity, Pittsburgh Management Sciences Research Group, (1976).
TSP1000 29.23 28.52 28.48 29.28 [5] Google, ‘Or-tools, Google optimization tools’,
https://developers.google.com/optimization/routing, (2016).
[6] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey
Levine, ‘Latent space policies for hierarchical reinforcement learning’,
Clip Range In our model, we clip the range of pointer vector u to in International Conference on Machine Learning, pp. 1846–1855,
[−C, C]. Instead of C = 10, which is used in previous work [2], we (2018).
[7] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine, Learning Representations, (2019).
‘Soft actor-critic: Off-policy maximum entropy deep reinforcement
learning with a stochastic actor’, in International Conference on Ma-
chine Learning, pp. 1856–1865, (2018).
[8] Keld Helsgaun, ‘An effective implementation of the lin–kernighan trav-
eling salesman heuristic’, European Journal of Operational Research,
126(1), 106–130, (2000).
[9] Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson, ‘An effi-
cient graph convolutional network technique for the travelling salesman
problem’, arXiv preprint arXiv:1906.01227, (2019).
[10] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song,
‘Learning combinatorial optimization algorithms over graphs’, in Ad-
vances in Neural Information Processing Systems, pp. 6348–6358,
(2017).
[11] Thomas N Kipf and Max Welling, ‘Semi-supervised classification with
graph convolutional networks’, International Conference on Learning
Representations, (2016).
[12] Wouter Kool, Herke van Hoof, and Max Welling, ‘Attention, learn to
solve routing problems!’, International Conference on Learning Rep-
resentations, (2019).
[13] Wouter Kool, Herke van Hoof, and Max Welling, ‘Buy 4 reinforce sam-
ples, get a baseline for free!’, International Conference on Learning
Representations, (2019).
[14] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh
Tenenbaum, ‘Hierarchical deep reinforcement learning: Integrating
temporal abstraction and intrinsic motivation’, in Advances in Neural
Information Processing Systems, pp. 3675–3683, (2016).
[15] Eugene L Lawler, Jan Karel Lenstra, AHG Rinnooy Kan,
David Bernard Shmoys, et al., The traveling salesman problem:
a guided tour of combinatorial optimization, volume 3, Wiley New
York, 1985.
[16] Zhuwen Li, Qifeng Chen, and Vladlen Koltun, ‘Combinatorial opti-
mization with graph convolutional networks and guided tree search’,
in Advances in Neural Information Processing Systems, pp. 537–546,
(2018).
[17] Quiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo
Drori, ‘GitHub Repository for Combinatorial Optimization by
Graph Pointer Networksand Hierarchical Reinforcement Learning’,
https://github.com/qiang-ma/graph-pointer-network, (2019).
[18] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex
Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray
Kavukcuoglu, ‘Asynchronous methods for deep reinforcement learn-
ing’, in International Conference on Machine Learning, pp. 1928–
1937, (2016).
[19] Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine,
‘Data-efficient hierarchical reinforcement learning’, in Advances in
Neural Information Processing Systems, pp. 3303–3313, (2018).
[20] Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Mar-
tin Takác, ‘Reinforcement learning for solving the vehicle routing prob-
lem’, in Advances in Neural Information Processing Systems, pp. 9839–
9849, (2018).
[21] Christos H Papadimitriou, ‘The euclidean travelling salesman problem
is np-complete’, Theoretical Computer Science, 4(3), 237–244, (1977).
[22] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and
Vaibhava Goel, ‘Self-critical sequence training for image captioning’,
in IEEE Conference on Computer Vision and Pattern Recognition, pp.
7008–7024, (2017).
[23] Richard S Sutton and Andrew G Barto, Reinforcement learning: An
introduction, MIT Press, 2018.
[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, ‘Attention
is all you need’, in Advances in neural information processing systems,
pp. 5998–6008, (2017).
[25] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly, ‘Pointer networks’,
in Advances in Neural Information Processing Systems, pp. 2692–2700,
(2015).
[26] Christos Voudouris and Edward Tsang, ‘Guided local search and its
application to the traveling salesman problem’, European Journal of
Operational Research, 113(2), 469–499, (1999).
[27] Ronald J Williams, ‘Simple statistical gradient-following algorithms
for connectionist reinforcement learning’, Machine Learning, 8(3-4),
229–256, (1992).
[28] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka, ‘How
powerful are graph neural networks?’, International Conference on

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy