ICTAI12
ICTAI12
Abstract—A significant issue in representing reinforcement in [9] where a number of fixed Fourier basis functions are
learning agents in Markov decision processes is how to design used for value function approximation. However, there are
efficient feature spaces in order to estimate optimal policy. This some works where the basis functions are estimated during
particular study addresses this challenge by proposing a compact
framework that employs an on-line clustering approach for con- the learning process. In [10] for example, a steady number of
structing appropriate basis functions. Also, it performs a state- basis functions are tuned in a batch manner by building a graph
action trajectory analysis to gain valuable affinity information over the state space and then calculating the k eigenvectors of
among clusters and estimate their transition dynamics. Value the graph Laplacian matrix. In another work [11], a set of k
function approximation is used for policy evaluation in a least- RBF basis function are adjusted directly over the Bellman’s
squares temporal difference framework. The proposed method
is evaluated in several simulated and real environments, where equation of the value function. Finally, in [12] the probability
we took promising results. density function and the reward model, which are assumed to
Index Terms—mixture models, on-line EM, clustering, model- be known, are used for creating basis function from Krylov
based reinforcement learning space vectors (powers of the transition matrix used to systems
of linear equations).
I. I NTRODUCTION In this study, we propose a model-based approach for value
Reinforcement Learning (RL) aims at controlling an au- function approximation which is based on an on-line clustering
tonomous agent in unknown stochastic environments [1]. approach for partitioning the state-action input space into
Typically, the environment is modelled as a Markov Decision clusters. This is done by considering an appropriate mixture
Process (MDP), where the agent receives a scalar reward signal model that is trained incrementally through an on-line version
that evaluates every transition. The objective is to maximize its of the Expectation-Maximization (EM) algorithm [13], [14].
long-term profit that is equivalent to maximizing the expected A kernel-based mechanism for creating new clusters is also
total discounted reward. Value function is used for measuring incorporated. The number and the structure of the created
the quality of a policy, which associates to every state the ex- clusters compose a dictionary of basis functions which used
pected discounted reward when starting from this state and all next for policy evaluation. In addition, during the clustering
decisions are made following the particular policy. However, procedure the transitions among adjacent clusters are observed
in cases with large or infinite state spaces the value function and the transition probabilities are computed, as well as their
cannot be calculated explicitly. In such domains a common average reward. Policy is then evaluated at each step by
strategy is to employ function approximation methodologies, estimating the linear weights of the value function through
by representing the value function as a linear combination of the least-squares framework. The proposed methodology has
some set of basis functions [2]. been tested to several known simulated and real environments
The Temporal Difference (TD) family of algorithms [1] where we measure its efficiency in discovering the optimal
provides a nice framework for policy evaluation, where the policy. Comparisons have been made using a recent online
least-squares temporal difference (LSTD) [3] is one of the version of LSPI algorithm [5].
most popular mechanism for approximating the value function In section 2, we briefly present some preliminaries and
of a given policy. The least square policy iteration (LSPI) [4] review the basic LSTD scheme for value function approxima-
is an off-policy method that extends the LSTD to control tion. Section 3 describes the online clustering scheme and the
problems, where the policy is refined iteratively. Recently, model-based approach. In section 4 the experimental results
an online version of the LSPI has been proposed in [5] that are presend and finally, in section 5 we give conclusions and
overcome the limitation of the LSPI in online problems. Also, suggestions for future research.
kernelized RL methods [6] have been paid a lot of attention
last years by employing kernel techniques to standard RL II. BACKGROUND
methods [7] and Gaussian Processes as a description model A Markov Decision Process (MDP) is a tuple
for the value function [8]. In most cases the basis functions (S, A, P, R, γ), where S is a set of states; A a set of
used for estimating the value function remain fixed during actions; P : S × A × S → [0, 1] is a Markovian transition
the learning process, as for example a recent work presented model that specifies the probability P (s |s, a) of transition
which can be further written as where θj = {μj , Σj , ρj } denotes the set of cluster parameters.
Mixture modelling treats clustering as an estimation prob-
R = HΦw , (6)
lem for the model parameters Θk = {uj , θj }kj=1 by maximiz-
where ing the log-likelihood function:
⎡ ⎤
1 −γ 0 · · · 0
N
k
⎢0 1 −γ · · · 0⎥
H=⎢ ⎣ .. .. ⎥ ⎦, (7) L(Θk ) = log{ uj N (sn ; μj , Σj )M u(an ; ρj )} . (13)
. . n=1 j=1
0 0 ··· 0 1
Φ = [φ(s1 , a1 ) , . . . , φ(sN , aN ) ] and (8) The Expectation-Maximization (EM) algorithm [13] is an
efficient framework that can be used for this purpose. It
R = [r1 , . . . , rN ]. (9) iteratively performs two steps: The E-step, where the current
713
posterior probabilities of samples to belong to each cluster are is the following: In the M-step of the normal (off-line) EM
calculated: algorithm, the update rules for both Gaussian parameters have
uj p(xn |θj ) N
znj = k , (14) the quantity n=1 znj in the denominator
(and the size N
z sn
j =1 uj p(xn |θj ) for the weights uj ) e.g. μj = n nj
. During the on-
n znj
and the M-step, where the maximization of the expected line version, each new sample contributes to the computation
N
complete log-likelihood is performed. This leads to closed- of these parameters by a factor equal to znj / n=1 znj .
form update rules for the model parameters [14]. Therefore, when the number of observations becomes too
In our case the samples are non-stationary and are gener- large, the incoming new sample will barely influence the
ated sequentially. We present here an extension of the EM model parameters. To avoid this situation we have selected
algorithm [15] for online estimating mixture models that suits the above update rules where we fix the contribution of new
our particular needs. It consists of two phases: first we have samples to λ.
a mechanism for deciding whether or not a new cluster must
be created, and secondly the main EM procedure is performed A. Model-based approximation
for adjusting the structure of clusters so as to incorporate the As it is obvious from the previous discussion, the EM-based
new sample. on-line clustering approach performs a partitioning of the input
Lets assume that a random sample xn = (sn , an ) is (state-action) space into clusters that have also the same tran-
observed. The method first performs the E-step and calculates sition dynamics. Clusters can be seen as nodes of a directed
the posterior probabilities values znj (Eq. 14) based on the (not full) graph that communicate to each other. The learning
current k-order mixture model. The winner cluster j ∗ ∈ [1, k] process construct new nodes in this graph (by performing
is then found according to the maximum posterior value, i.e. a splitting process) and can also provide useful information
k (frequency and distance) between adjacent nodes. In another
j ∗ = arg max{znj }. (15)
j=1 point of view, the proposed scheme can be seen as a type
We assume here that the degree of belongingness of the xn to of relocatable action model (RAM), that has been proposed
the cluster j ∗ is given by a kernel function K(xn , j ∗ ) which recently [16] and provides a decomposition or factorization of
is written as a product of two kernel functions, one for each the transition function.
type of the input (state-action); i.e. We assume a trajectory of transitions (si , ai , ri , si+1 ) in the
same episode. During the on-line clustering we maintain for
K(xn , j ∗ ) = Ks (xn , j ∗ )Ka (xn , j ∗ ) , (16) each cluster j = 1, . . . , k the following quantities:
where • t̄j,j : the mean number of time-steps between two suc-
cessively observed clusters j, j
Ks (xn , j ∗ ) = exp(−0.5(sn − μj ∗ ) Σ−1
j ∗ (sn − μj ∗ )) • R(j, j ): the mean total reward of the transition from
is the state kernel, and cluster j to cluster j
• nj,j : the total number of times (frequency) that we have
M
I(a ,i) observed this transition
Ka (xn , j ∗ ) = ρj ∗ i n • nj : the total number of times that the cluster j is visited.
i=1
Furthermore, another two useful quantities can be calculated:
is the action kernel. If the degree of belongingness of xn to
The transition probabilities P (j |j) between two (adjacent)
cluster j ∗ is less than a predefined threshold value Kmin , a n
clusters from their relative frequencies, i.e. P (j |j) = nj,jj ,
new cluster (k+1) must be created. This is done by initializing
and the mean value of the reward function for any cluster j as
it properly:
R(j) = j P (j |j)R(j, j ). These can be used for the policy
ξ, if i = a
μk+1 = sn , Σk+1 = 12 Σj ∗ , ρk+1,i = 1−ξ , estimation process.
M −1 , otherwise The equation for the action value function for a cluster j
where ξ is set to a large value (e.g. ξ = 0.9). The M-step is then becomes
applied next that provides a step-wise update procedure for
the model parameters using the next rules: Q(j) = R(j) + P (j |j)γ t̄j,j Q(j ) , (21)
j
uj = (1 − λ)uj + λznj , (17)
where the summation is made over the neighbourhood of
μj = μj + λznj sn , (18)
cluster j (adjacent clusters j where P (j |j) > 0). Thus, a
Σj = Σj + λznj (si − μj )(si − μj ) , (19) set of k equations are available for the k quantities R(j)
nji (observations):
nji = nji + I(an , i)znj and ρji = M
, (20) Rk = Hk Φk wk , (22)
l=1 njl
where the term λ takes a small value (e.g. 0.09) and it can where we have considered linear function approximation for
be decreased over episode. That is necessary to remark here the action value function. In this case the kernel design
714
matrix Φk = [φ1 . . . φk ] is explicitly derived from the online
clustering solution using the kernel function of Eq. 16:
[Φk ]jj = exp(−0.5(μj −μj ) Σ−1
j (μj −μj ))ρj ρj . (23)
715
kernels equal to the number of states with a kernel width 0
equal to 1. Also, we have allowed our method to construct the
716
1
10
0
10 −20
RMSE of all V
−21
−22
−1
10
−23
−24 Online EM
−2
Online LSPl
10 −25
0 200 400 600 800 1000 0 200 400 600 800 1000
Episodes Episodes
(a)13-states chain
1
10 −140
−160
RMSE of all V
−170
0
10 −180
−190
−200
−210 Online EM
−1
Online LSPI
10 −220
0 200 400 600 800 1000 0 200 400 600 800 1000
Episodes Episodes
(b) 98-states chain
Fig. 2. Comparative results on policy evaluation and learned policy in Boyan’s chain domain with 13 and 98 states. Each curve is the average of 30
independent trials.
the case that an obstacle is hit where the received reward is we can observe, the two methodologies tend to find the same
−100. The action space has been discretized into the 8 major policies in most world areas.
compass winds, while the length of each step was set equal
to 1m .
V. C ONCLUSION
The comparative results on the stage world are illustrated In this study we have presented a model-based reinforce-
in Fig. 5. The first plot shows the number of episodes where ment learning scheme for learning optimally in MDPs. The
the robot succeed to find the target. On the other hand, the proposed method is based on the online partitioning of the
second represents the mean returns received by the agent state-action space into clusters and simultaneously construct-
in the last 100 episodes. In the case of LSPI algorithm, an ing a Markov transition matrix by counting the observed
equidistant fixed 10× 10 grid of Gaussian RBFs is used over transitions that each cluster conformation undergoes over time
the state spaces (800 RBFs are used). As it becomes apparent, steps. It is our intention to further pursue and develop the
the proposed methodology achieves to discover an optimal method in two directions: At first we can employ regularized
policy in a much higher rate, while at the same time end up least-squares methods, such as Lasso or Bayesian sparse
to the target in more episodes than the online LSPI. On the methodologies to eliminate the problem of overfitting. Also,
other hand, our method don’t need a so huge number of basis during our experiments we have observed a tendency of our
functions as the online LSPI do, in order to discover an optimal method to produce large number of clusters, some of them
policy. More specifically, the proposed algorithm constructs may become inactivated. In fact, it is possible some of them
approximately 300 − 400 clusters in the world, that specify to become inactivated during the learning process. Thus, a
the basis functions of our model. Finally, Fig. 6 represents mechanism for merging clusters to the body of the online EM
the learned policies of both methods after 500 episodes. As procedure constitutes an interesting direction for future work.
717
450 0
350 −20
Successive Episodes
−30
300
−40
250
−50
200
−60
150
−70
100 −80
50 Online EM −90 Online EM
Online LSPl Online LSPl
0 −100
0 100 200 300 400 500 0 100 200 300 400 500
Episodes Episodes
(a) (b)
Fig. 5. Comparative results in the test world of the real environment.
718