0% found this document useful (0 votes)
6 views7 pages

ICTAI12

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views7 pages

ICTAI12

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2012 IEEE 24th International Conference on Tools with Artificial Intelligence

A model-based reinforcement learning approach


using on-line clustering
Nikolaos Tziortziotis and Konstantinos Blekas
Department of Computer Science, University of Ioannina
P.O.Box 1186, Ioannina 45110 - Greece
Email:{ntziorzi,kblekas}@cs.uoi.gr

Abstract—A significant issue in representing reinforcement in [9] where a number of fixed Fourier basis functions are
learning agents in Markov decision processes is how to design used for value function approximation. However, there are
efficient feature spaces in order to estimate optimal policy. This some works where the basis functions are estimated during
particular study addresses this challenge by proposing a compact
framework that employs an on-line clustering approach for con- the learning process. In [10] for example, a steady number of
structing appropriate basis functions. Also, it performs a state- basis functions are tuned in a batch manner by building a graph
action trajectory analysis to gain valuable affinity information over the state space and then calculating the k eigenvectors of
among clusters and estimate their transition dynamics. Value the graph Laplacian matrix. In another work [11], a set of k
function approximation is used for policy evaluation in a least- RBF basis function are adjusted directly over the Bellman’s
squares temporal difference framework. The proposed method
is evaluated in several simulated and real environments, where equation of the value function. Finally, in [12] the probability
we took promising results. density function and the reward model, which are assumed to
Index Terms—mixture models, on-line EM, clustering, model- be known, are used for creating basis function from Krylov
based reinforcement learning space vectors (powers of the transition matrix used to systems
of linear equations).
I. I NTRODUCTION In this study, we propose a model-based approach for value
Reinforcement Learning (RL) aims at controlling an au- function approximation which is based on an on-line clustering
tonomous agent in unknown stochastic environments [1]. approach for partitioning the state-action input space into
Typically, the environment is modelled as a Markov Decision clusters. This is done by considering an appropriate mixture
Process (MDP), where the agent receives a scalar reward signal model that is trained incrementally through an on-line version
that evaluates every transition. The objective is to maximize its of the Expectation-Maximization (EM) algorithm [13], [14].
long-term profit that is equivalent to maximizing the expected A kernel-based mechanism for creating new clusters is also
total discounted reward. Value function is used for measuring incorporated. The number and the structure of the created
the quality of a policy, which associates to every state the ex- clusters compose a dictionary of basis functions which used
pected discounted reward when starting from this state and all next for policy evaluation. In addition, during the clustering
decisions are made following the particular policy. However, procedure the transitions among adjacent clusters are observed
in cases with large or infinite state spaces the value function and the transition probabilities are computed, as well as their
cannot be calculated explicitly. In such domains a common average reward. Policy is then evaluated at each step by
strategy is to employ function approximation methodologies, estimating the linear weights of the value function through
by representing the value function as a linear combination of the least-squares framework. The proposed methodology has
some set of basis functions [2]. been tested to several known simulated and real environments
The Temporal Difference (TD) family of algorithms [1] where we measure its efficiency in discovering the optimal
provides a nice framework for policy evaluation, where the policy. Comparisons have been made using a recent online
least-squares temporal difference (LSTD) [3] is one of the version of LSPI algorithm [5].
most popular mechanism for approximating the value function In section 2, we briefly present some preliminaries and
of a given policy. The least square policy iteration (LSPI) [4] review the basic LSTD scheme for value function approxima-
is an off-policy method that extends the LSTD to control tion. Section 3 describes the online clustering scheme and the
problems, where the policy is refined iteratively. Recently, model-based approach. In section 4 the experimental results
an online version of the LSPI has been proposed in [5] that are presend and finally, in section 5 we give conclusions and
overcome the limitation of the LSPI in online problems. Also, suggestions for future research.
kernelized RL methods [6] have been paid a lot of attention
last years by employing kernel techniques to standard RL II. BACKGROUND
methods [7] and Gaussian Processes as a description model A Markov Decision Process (MDP) is a tuple
for the value function [8]. In most cases the basis functions (S, A, P, R, γ), where S is a set of states; A a set of
used for estimating the value function remain fixed during actions; P : S × A × S → [0, 1] is a Markovian transition
the learning process, as for example a recent work presented model that specifies the probability P (s |s, a) of transition

1082-3409/12 $26.00 © 2012 IEEE 712


DOI 10.1109/ICTAI.2012.101
to state s when taken an action a in state s; R : S → R is Then, the LSTD approach estimates the weights of the above
the reward function for a state-action pair; and γ ∈ (0, 1) is linear equation according to the least-squares solution:
the discount factor for future rewards. A stationary policy
π : S → A is a mapping from states to actions and denotes ŵ = (Φ HΦ)−1 Φ R . (10)
a mechanism for choosing actions. An episode is a sequence
of transitions: (s1 , a1 , r1 , s2 , . . .). III. O N - LINE CLUSTERING AND VALUE FUNCTION
The notion of value function is of central interest in rein- APPROXIMATION
forcement learning tasks. Given a policy π, the value V π (s) of The proposed methodology is based on a policy evalu-
a state s is defined as the expected discounted sum of rewards ation scheme that incrementally separates the input space
obtained when starting from this state until the current episode into clusters and estimates the transition probabilities among
terminates: them. Thus a dictionary of features is dynamically constructed
V π (s) = Eπ [R(st ) + γV π (st+1 )|st = s] . (1) for modeling the value functions. To what follows we will
assume that the input samples are state-action pairs, denoted
This satisfies the Bellman equations, which expresses a rela- as xn = (sn , an ). We will also consider a finite action space
tionship between the values of successive states in the same of size M .
episode. Similarly, the state-action value function Q(s, a) Suppose we are given a data set of N samples
denotes the expected cumulative reward as received by taking {x1 , x2 , . . . , xN }. The task of clustering aims at partitioning
action a in state s the input set into k disjoint clusters, containing samples

Qπ (s, a) = Ra (s) + γ P (s |s, a) max Q(s , a) , (2) with common properties. Mixture modeling [14] provides a
a
s ∈S convenient and elegant framework for clustering, where we
where Ra (s) specifies the average reward for executing a in consider that the properties of a single cluster j is described
s. The objective of RL problems is to estimate an optimal implicitly via a probability distribution with parameters θj .
policy π ∗ by choosing actions that yields the optimal action- This can be formulated as:
state value function Q∗ : k

p(x|Θk ) = uj p(x|θj ) , (11)
π ∗ (s) = arg max Q∗ (s, a). (3)
a j=1
A common choice for representing the value function is where Θk denotes the set of mixture model parameters. The
through a linear function approximation using a set of k basis parameters 0 < uj ≤ 1 represent the mixing weights satisfying
functions φj (s, a): k
j=1 uj = 1. Since we are dealing with two source of
k
 information (state-actions), in our scheme we assume that the
Q(s, a) = φ(s, a) w = φj (s, a)wj , (4) conditional density for each cluster is written as a product of
j=1 two pdfs:
where w = (w1 , . . . , wk ) is a vector of weights which • a Gaussian pdf N (s; μj , Σj ) for the state s and
are unknown and must be estimated so as to minimize the M I(a,i)
• a multinomial pdf M u(a; ρj ) = i=1 ρji for the
approximation error. The selection of the basis functions is
action a, where ρj is a M -length probabilistic vector
very important and must be chosen to encode properties of
while I(a, i) is a binary indicator function, i.e.
the state and action relevant to the proper determination of
the Q values. 1, if a = i
The LSTD approach combines the Bellman operator with I(a, i) = .
0, otherwise
the least-squares estimation procedure. If we assume a N -
length trajectory of transitions (si , ai , ri , si+1 ) sampled from This can be written as:
the MDP, a set of N equations is obtained:
ri = (φ(si , ai ) − γφ(si+1 , ai+1 )) w , (5) p(x|θj ) = N (s; μj , Σj )M u(a; ρj ) , (12)

which can be further written as where θj = {μj , Σj , ρj } denotes the set of cluster parameters.
Mixture modelling treats clustering as an estimation prob-
R = HΦw , (6)
lem for the model parameters Θk = {uj , θj }kj=1 by maximiz-
where ing the log-likelihood function:
⎡ ⎤
1 −γ 0 · · · 0
N
 k
⎢0 1 −γ · · · 0⎥
H=⎢ ⎣ .. .. ⎥ ⎦, (7) L(Θk ) = log{ uj N (sn ; μj , Σj )M u(an ; ρj )} . (13)
. . n=1 j=1
0 0 ··· 0 1
Φ = [φ(s1 , a1 ) , . . . , φ(sN , aN ) ] and (8) The Expectation-Maximization (EM) algorithm [13] is an
efficient framework that can be used for this purpose. It
R = [r1 , . . . , rN ]. (9) iteratively performs two steps: The E-step, where the current

713
posterior probabilities of samples to belong to each cluster are is the following: In the M-step of the normal (off-line) EM
calculated: algorithm, the update rules for both Gaussian parameters have
uj p(xn |θj ) N
znj = k , (14) the quantity n=1 znj in the denominator
 (and the size N
z sn
j  =1 uj p(xn |θj ) for the weights uj ) e.g. μj = n nj
 
. During the on-
n znj

and the M-step, where the maximization of the expected line version, each new sample contributes to the computation
N
complete log-likelihood is performed. This leads to closed- of these parameters by a factor equal to znj / n=1 znj .
form update rules for the model parameters [14]. Therefore, when the number of observations becomes too
In our case the samples are non-stationary and are gener- large, the incoming new sample will barely influence the
ated sequentially. We present here an extension of the EM model parameters. To avoid this situation we have selected
algorithm [15] for online estimating mixture models that suits the above update rules where we fix the contribution of new
our particular needs. It consists of two phases: first we have samples to λ.
a mechanism for deciding whether or not a new cluster must
be created, and secondly the main EM procedure is performed A. Model-based approximation
for adjusting the structure of clusters so as to incorporate the As it is obvious from the previous discussion, the EM-based
new sample. on-line clustering approach performs a partitioning of the input
Lets assume that a random sample xn = (sn , an ) is (state-action) space into clusters that have also the same tran-
observed. The method first performs the E-step and calculates sition dynamics. Clusters can be seen as nodes of a directed
the posterior probabilities values znj (Eq. 14) based on the (not full) graph that communicate to each other. The learning
current k-order mixture model. The winner cluster j ∗ ∈ [1, k] process construct new nodes in this graph (by performing
is then found according to the maximum posterior value, i.e. a splitting process) and can also provide useful information
k (frequency and distance) between adjacent nodes. In another
j ∗ = arg max{znj }. (15)
j=1 point of view, the proposed scheme can be seen as a type
We assume here that the degree of belongingness of the xn to of relocatable action model (RAM), that has been proposed
the cluster j ∗ is given by a kernel function K(xn , j ∗ ) which recently [16] and provides a decomposition or factorization of
is written as a product of two kernel functions, one for each the transition function.
type of the input (state-action); i.e. We assume a trajectory of transitions (si , ai , ri , si+1 ) in the
same episode. During the on-line clustering we maintain for
K(xn , j ∗ ) = Ks (xn , j ∗ )Ka (xn , j ∗ ) , (16) each cluster j = 1, . . . , k the following quantities:
where • t̄j,j  : the mean number of time-steps between two suc-
cessively observed clusters j, j 
Ks (xn , j ∗ ) = exp(−0.5(sn − μj ∗ ) Σ−1
j ∗ (sn − μj ∗ )) • R(j, j  ): the mean total reward of the transition from
is the state kernel, and cluster j to cluster j 
• nj,j  : the total number of times (frequency) that we have
M
I(a ,i) observed this transition
Ka (xn , j ∗ ) = ρj ∗ i n • nj : the total number of times that the cluster j is visited.
i=1
Furthermore, another two useful quantities can be calculated:
is the action kernel. If the degree of belongingness of xn to
The transition probabilities P (j  |j) between two (adjacent)
cluster j ∗ is less than a predefined threshold value Kmin , a n 
clusters from their relative frequencies, i.e. P (j  |j) = nj,jj ,
new cluster (k+1) must be created. This is done by initializing
and the mean value of the reward function for any cluster j as
it properly:
R(j) = j  P (j  |j)R(j, j  ). These can be used for the policy
ξ, if i = a
μk+1 = sn , Σk+1 = 12 Σj ∗ , ρk+1,i = 1−ξ , estimation process.
M −1 , otherwise The equation for the action value function for a cluster j
where ξ is set to a large value (e.g. ξ = 0.9). The M-step is then becomes
applied next that provides a step-wise update procedure for 
the model parameters using the next rules: Q(j) = R(j) + P (j  |j)γ t̄j,j Q(j  ) , (21)
j
uj = (1 − λ)uj + λznj , (17)
where the summation is made over the neighbourhood of
μj = μj + λznj sn , (18)
cluster j (adjacent clusters j  where P (j  |j) > 0). Thus, a
Σj = Σj + λznj (si − μj )(si − μj ) , (19) set of k equations are available for the k quantities R(j)
nji (observations):
nji = nji + I(an , i)znj and ρji = M
, (20) Rk = Hk Φk wk , (22)
l=1 njl
where the term λ takes a small value (e.g. 0.09) and it can where we have considered linear function approximation for
be decreased over episode. That is necessary to remark here the action value function. In this case the kernel design

714
matrix Φk = [φ1 . . . φk ] is explicitly derived from the online
clustering solution using the kernel function of Eq. 16:
[Φk ]jj  = exp(−0.5(μj −μj  ) Σ−1 
j  (μj −μj  ))ρj ρj  . (23)

Also, the matrix Hk contains the coefficients of Eq. 21, i.e. at


each line j we have [Hk ]jj = 1 and [Hk ]j,j  = −P (j  |j)γ t̄j,j
in case where two clusters are adjacent (P (j  |j) > 0). Finally,
Rk is the vector of the calculated mean reward values per (a) Boyan’s Chain (b) Puddle World
cluster, i.e. Rk = [R(1), . . . , R(k)]. The least-square solution Fig. 1. Simulated experimental domains
for the linear weights wk can be obtained then as
ŵk = (Φ
k Hk Φk )
−1 
Φk R k . (24)
In all domains, the discount factor γ was set equal to 1,
Thus, for an input state-action pair x = (s, a) we can estimate while the threshold Kmin for creating new clusters was set
the action value function according to the current policy as: as Kmin = 0.7. Finally, in order to introduce stochasticity
into the transitions and to attain better exploration of the
Q(s, a) = φ(s, a) ŵk , (25) environment, the actions are chosen -greedy. In our method
where φ(s, a) = [K(x, 1), . . . , K(x, M )]. The above proce- the is initially set equal to 0.1 and decreases steady at each
dure is repeated until convergence, or a number of episodes is step.
found. The method starts with a single cluster k = 1 initialized Comparison has been made using the least square temporal
by the first sample taken by the agent. At every time step difference (LSTD) [3] in the case of policy evaluation problem
the on-line EM clustering procedure and the policy evaluation (Boyan’s Chain), i.e to evaluate the value function of a given
stage are sequentially performed. The overall scheme of the policy, and the online least square policy iteration (LSPI) [5] in
proposed methodology is given in Algorithm 1. the case of control learning problem, i.e. discover the optimal
policy. Both methods are considered as state-of-the-art RL
Algorithm 1 General framework of the proposed methodology algorithms for policy evaluation problems, where the action
1: Start with k = 1 and use first point xi = (s1 , a1 ) for
value function is representing as a linear model by using an
initializing it. Set a random value to weight w1 . t = 0. equidistant fixed grid of radial basis functions (RBFs) over the
2: while convergence or maximum number of episodes not
state space, following the suggestions described in [5]. It must
found do be noted that in our experiments the state space is small, either
3: Suppose previous input xi = (si , ai ). 1D or 2D, with small discrete set of actions. That makes the
4: Observe new state si+1 . LSTD and LSPI methodologies to be practically applicable
5: Select action according to the current policy (not huge number of basis functions) and since they cover
ai+1 = arg maxM all the possible state-action pairs, they are able to find the
l=1 Q(si+1 , l).
6: Find the winning cluster j ∗ = arg maxkj=1 {znj }. optimal solution. The objective of our study was to discover
7: if K(xi+1 , mj ∗ ) < Kmin then the ability of our method to obtain the best performance with
8: Create a new cluster (k = k + 1) and initialize its less number of basis functions constructing by the online EM
prototype mk with xi+1 . clustering procedure. However, our method manages to work
9: Create a new weight wk of linear model and initialize in higher dimensional spaces where the other methods have
it randomly. wt = wt ∪ wk . some limitations and this constitutes an interesting direction
10: else for feature work.
11: Update the prototype mj ∗ of the winning cluster A. Experiments with Simulated environments
using Eqs. 17-20. The first series of experiments was made using two well-
12: end if known simulated benchmarks. More specifically, at first we
13: Obtain the new k basis functions as: φj (s, a) = have used the classical Boyan’s chain problem (Fig. 1(a))
K((s, a), mj ), ∀ j = 1, . . . , k. with N = 13 and N = 98 states [3]. In this application
14: Update the environment statistics. states are connected in a chain, where an agent located in
15: Update the model weights wt according to Eq. 24. a state s > 2 can move into states s − 1 and s − 2 with the
16: t=t+1 same probability receiving a reward of −3 (r = −3). On the
17: end while
other hand, from states 2 and 1, there are only deterministic
transitions to states 1 and 0 where the received rewards are
−2 and 0, respectively. Every episode starts at state N and
IV. E XPERIMENTAL RESULTS
terminates at state 0. In the specific domain, both the policy
A series of experiments have been conducted in a variety evaluation problem as well as the problem of discovering the
of simulated benchmarks and real environments, in order to optimal policy have been considered. It must be noted that
study the performance of the proposed model-based approach. both LSTD and online LSPI methods had a number of RBF

715
kernels equal to the number of states with a kernel width 0
equal to 1. Also, we have allowed our method to construct the

Mean Returns of the Last 30 Episodes


same number of clusters, in an attempt to focus our study on
the effect of the clusters’ transition probabilities. The results
are illustrated in Fig. 2 plotting the root mean square error −500

(RMSE) between the true and the estimated value function


(1st problem), as well as the mean returns of the last 100
episodes (2nd problem) per method. As it is obvious, the
−1000
proposed method performs better exploration estimating the
Online EM
optimal value function more accurately than the LSTD. On Online LSPl 15x15
the control problem, both methodologies achieve to discover Online LSPl 20x20
the optimal policies but the proposed method converges in a Online LSPl 25x25
−1500
higher rate than the online LSPI. 0 20 40 60 80 100
Episodes
Another simulated environment used in our experiments is
the Puddle World [17] (Fig. 1(b)), found on the RL-Glue Fig. 3. Comparative results in the simulated environment, Puddle World.
Library1 . The Puddle World is a continuous world with two
oval puddles and the goal is to reach the upper right corner
from any random position, avoiding the two puddles. The
environmental states are 2-dimensional (x and y coordinates)
and it can choose one of four actions that correspond to the
four major compass directions: up, right, left, or right. The
received reward is −1 except for the puddle region where
a penalty between 0 and −40 is received, depending on the
proximity to the middle of the puddle.
In the particular problem, comparison has been made with
the online LSPI method where an equidistant fixed N × N
grid of RBFs is used. In this experiment, we have use three
values of N (15, 20, 25), that corresponds to a set of 900,
1600, 2500 number of basis functions, respectively. On the
other hand, our algorithm constructs approximately 450-550 (a) Pioneer/PeopleBot (b) Stage world
basis functions. The depicted results are illustrated in Fig. 3 Fig. 4. The mobile robot and the 2D-grid map used in our experiments.
that gives the mean number of returns received during the last This is one snapshot from the MobileSim simulator with visualization of the
robot’s laser and sonar range scanners.
30 episodes. The experimental results agrees with our belief
that the number of the basis functions is a very important
issue and that using large number of RBFs does not involve
a better performance. Finally, it becomes apparent that our restrictions, such as the strict battery life, the training of the
method manages to discover the optimal policy. specific methodologies are achieved by using the MobileSim2
simulator. The specific simulator is built on the famous Stage
B. Experiments with Real Environments platform and manages to simulate the real environment with
Experiments have also been conducted in a number of satisfactory precision and realism.
real environments by using the wheeled real mobile robot A grid map (stage world) have been selected during our
platform Pioneer/PeopleBot, shown in Fig. 4(a), based on the experiments, as shown in Fig.4(b). The specific world has been
robust P3-DX base. The robot is equipped with advanced tools designed and edited by using the Mapper toolkit. The objective
for communication and control, like the ARIA (Advanced of the robot in this task is to find a steady landmark (shown
Robot Interface for Applications) library which provides a with a rectangular green box in the map of Fig.4(b)) with the
nice framework for controlling and receiving data from the minimum number of steps, starting from any position in the
MobileRobots platforms. At the same time a plethora of world and performing a finite number of actions. The particular
sensors are included into the specific type of robot, such task is episodic and a new episode starts when one of the
as sonar, laser, bumpers and a pan-tilt-zoom camera. For following incidents comes first: the maximum allowed number
the purposes of our experiments, only the sonar and laser of steps per episode is expired (in our case was set to 100), an
sensors were used in the case of the obstacle avoidance. obstacle is hit, or the target is reached. The state space consists
Furthermore, an embedded motion controller provide at each of two continuous variables: the x and y coordinates which
time step, the robot state such as the robot position (x, y, θ), specify the situation of the robot in the world. At each time
sensors range sensing data, e.t.c. Due to numerous physical step, the robot receives an immediate reward of −1, except in
1 Available at http://library.rl-community.org/wiki 2 more details can be found at http://robots.mobilerobots.com/wiki/MobileSim

716
1
10

Mean Returns of the Last 100 Episodes


Online EM −18
LSTD
−19

0
10 −20
RMSE of all V

−21

−22
−1
10
−23

−24 Online EM
−2
Online LSPl
10 −25
0 200 400 600 800 1000 0 200 400 600 800 1000
Episodes Episodes
(a)13-states chain
1
10 −140

Mean Returns in the Last 100 Episodes


Online EM
LSTD −150

−160
RMSE of all V

−170
0
10 −180

−190

−200

−210 Online EM
−1
Online LSPI
10 −220
0 200 400 600 800 1000 0 200 400 600 800 1000
Episodes Episodes
(b) 98-states chain
Fig. 2. Comparative results on policy evaluation and learned policy in Boyan’s chain domain with 13 and 98 states. Each curve is the average of 30
independent trials.

the case that an obstacle is hit where the received reward is we can observe, the two methodologies tend to find the same
−100. The action space has been discretized into the 8 major policies in most world areas.
compass winds, while the length of each step was set equal
to 1m .
V. C ONCLUSION
The comparative results on the stage world are illustrated In this study we have presented a model-based reinforce-
in Fig. 5. The first plot shows the number of episodes where ment learning scheme for learning optimally in MDPs. The
the robot succeed to find the target. On the other hand, the proposed method is based on the online partitioning of the
second represents the mean returns received by the agent state-action space into clusters and simultaneously construct-
in the last 100 episodes. In the case of LSPI algorithm, an ing a Markov transition matrix by counting the observed
equidistant fixed 10× 10 grid of Gaussian RBFs is used over transitions that each cluster conformation undergoes over time
the state spaces (800 RBFs are used). As it becomes apparent, steps. It is our intention to further pursue and develop the
the proposed methodology achieves to discover an optimal method in two directions: At first we can employ regularized
policy in a much higher rate, while at the same time end up least-squares methods, such as Lasso or Bayesian sparse
to the target in more episodes than the online LSPI. On the methodologies to eliminate the problem of overfitting. Also,
other hand, our method don’t need a so huge number of basis during our experiments we have observed a tendency of our
functions as the online LSPI do, in order to discover an optimal method to produce large number of clusters, some of them
policy. More specifically, the proposed algorithm constructs may become inactivated. In fact, it is possible some of them
approximately 300 − 400 clusters in the world, that specify to become inactivated during the learning process. Thus, a
the basis functions of our model. Finally, Fig. 6 represents mechanism for merging clusters to the body of the online EM
the learned policies of both methods after 500 episodes. As procedure constitutes an interesting direction for future work.

717
450 0

Mean Returns in the Last 100 Episodes


400 −10

350 −20
Successive Episodes

−30
300
−40
250
−50
200
−60
150
−70
100 −80
50 Online EM −90 Online EM
Online LSPl Online LSPl
0 −100
0 100 200 300 400 500 0 100 200 300 400 500
Episodes Episodes

(a) (b)
Fig. 5. Comparative results in the test world of the real environment.

(a) Online EM (b) Online LSPI


Fig. 6. Learned policies by both comparative methods in the case of test world.

R EFERENCES [10] S. Mahadevan and M. Maggioni, “Proto-value Functions: A Laplacian


Framework for Learning Representation and Control in Markov Decision
[1] R. Sutton and A. Barto, Reinforcement Learning: An Introduction. MIT Processes,” Journal of Machine Learning Research, vol. 8, pp. 2169–
Press Cambridge, USA, 1998. 2231, 2007.
[2] L. Buşoniu, R. Babuška, B. De Schutter, and D. Ernst, Reinforcement [11] I. Menache, S. Mannor, and N. Shimkin, “Basis Function Adaptation
Learning and Dynamic Programming Using Function Approximators. in Temporal Difference Reinforcement Learning,” Annals of Operations
CRC Press, 2010. Research, vol. 134, pp. 215–238, 2005.
[12] M. Petrik, “An analysis of laplacian methods for value function ap-
[3] J. A. Boyan, “Technical update: Least-squares temporal difference
proximation in mdps,” in International Joint Conference on Artificial
learning,” Machine Learning, pp. 233–246, 2002.
Intelligence, 2007, pp. 2574–2579.
[4] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” Journal
[13] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from
of Machine Learning Research, vol. 4, pp. 1107–1149, 2003.
incomplete data via the EM algorithm,” J. Roy. Statist. Soc. B, vol. 39,
[5] L. Buşoniu, D. Ernst, B. De Schutter, and R. Babuška, “Online least- pp. 1–38, 1977.
squares policy iteration for reinforcement learning control,” in Proceed- [14] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
ings of the 2010 American Control Conference, 2010, pp. 486–491. [15] R. Neal and G. Hinton, “A view of the em algorithm that justifies
[6] G. Taylor and R. Parr, “Kernelized value function approximation for incremental, sparse, and other variants,” in NATO Adv. Study Inst. on
reinforcement learning,” in International Conference on Machine Learn- Learn. Graph. Models, 1998, pp. 355–368.
ing, 2009, pp. 1017–1024. [16] B. R. Leffler, M. L. Littman, and T. Edmunds, “Efficient reinforcement
[7] X. Xu, H. Hu, and B. Dai, “Adaptive sample collection using active learning with relocatable action models,” in Proceedings of the Twenty-
learning for kernel-based approximate policy iteration,” in Proceedings Second Conference on Artificial Intelligence, 2007, pp. 572–577.
of the Adaptive Dynamic Programming and Reinforcement Learning, [17] R. S. Sutton, “Generalization in reinforcement learning: Successful ex-
2011, pp. 56–61. amples using sparse coarse coding,” in Advances in Neural Information
[8] Y. Engel, S. Mannor, and R. Meir, “Reinforcement learning with Processing Systems 8. MIT Press, 1996, pp. 1038–1044.
gaussian process,” in International Conference on Machine Learning,
2005, pp. 201–208.
[9] G. Konidaris, S. Osentoski, and P. Thomas, “Value function approxima-
tion in reinforcement learning using the fourier basis,” in AAAI Conf.
on Artificial Intelligence, 2011, pp. 380–385.

718

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy