0% found this document useful (0 votes)

6 views7 pages

ICTAI12

Uploaded by

RəhimTufarghanli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views7 pages

ICTAI12

Uploaded by

RəhimTufarghanli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2012 IEEE 24th International Conference on Tools with Artificial Intelligence

A model-based reinforcement learning approach

using on-line clustering
Nikolaos Tziortziotis and Konstantinos Blekas
Department of Computer Science, University of Ioannina
P.O.Box 1186, Ioannina 45110 - Greece
Email:{ntziorzi,kblekas}@cs.uoi.gr

Abstract—A significant issue in representing reinforcement in [9] where a number of fixed Fourier basis functions are
learning agents in Markov decision processes is how to design used for value function approximation. However, there are
efficient feature spaces in order to estimate optimal policy. This some works where the basis functions are estimated during
particular study addresses this challenge by proposing a compact
framework that employs an on-line clustering approach for con- the learning process. In [10] for example, a steady number of
structing appropriate basis functions. Also, it performs a state- basis functions are tuned in a batch manner by building a graph
action trajectory analysis to gain valuable affinity information over the state space and then calculating the k eigenvectors of
among clusters and estimate their transition dynamics. Value the graph Laplacian matrix. In another work [11], a set of k
function approximation is used for policy evaluation in a least- RBF basis function are adjusted directly over the Bellman’s
squares temporal difference framework. The proposed method
is evaluated in several simulated and real environments, where equation of the value function. Finally, in [12] the probability
we took promising results. density function and the reward model, which are assumed to
Index Terms—mixture models, on-line EM, clustering, model- be known, are used for creating basis function from Krylov
based reinforcement learning space vectors (powers of the transition matrix used to systems
of linear equations).
I. I NTRODUCTION In this study, we propose a model-based approach for value
Reinforcement Learning (RL) aims at controlling an au- function approximation which is based on an on-line clustering
tonomous agent in unknown stochastic environments [1]. approach for partitioning the state-action input space into
Typically, the environment is modelled as a Markov Decision clusters. This is done by considering an appropriate mixture
Process (MDP), where the agent receives a scalar reward signal model that is trained incrementally through an on-line version
that evaluates every transition. The objective is to maximize its of the Expectation-Maximization (EM) algorithm [13], [14].
long-term profit that is equivalent to maximizing the expected A kernel-based mechanism for creating new clusters is also
total discounted reward. Value function is used for measuring incorporated. The number and the structure of the created
the quality of a policy, which associates to every state the ex- clusters compose a dictionary of basis functions which used
pected discounted reward when starting from this state and all next for policy evaluation. In addition, during the clustering
decisions are made following the particular policy. However, procedure the transitions among adjacent clusters are observed
in cases with large or infinite state spaces the value function and the transition probabilities are computed, as well as their
cannot be calculated explicitly. In such domains a common average reward. Policy is then evaluated at each step by
strategy is to employ function approximation methodologies, estimating the linear weights of the value function through
by representing the value function as a linear combination of the least-squares framework. The proposed methodology has
some set of basis functions [2]. been tested to several known simulated and real environments
The Temporal Difference (TD) family of algorithms [1] where we measure its efficiency in discovering the optimal
provides a nice framework for policy evaluation, where the policy. Comparisons have been made using a recent online
least-squares temporal difference (LSTD) [3] is one of the version of LSPI algorithm [5].
most popular mechanism for approximating the value function In section 2, we briefly present some preliminaries and
of a given policy. The least square policy iteration (LSPI) [4] review the basic LSTD scheme for value function approxima-
is an off-policy method that extends the LSTD to control tion. Section 3 describes the online clustering scheme and the
problems, where the policy is refined iteratively. Recently, model-based approach. In section 4 the experimental results
an online version of the LSPI has been proposed in [5] that are presend and finally, in section 5 we give conclusions and
overcome the limitation of the LSPI in online problems. Also, suggestions for future research.
kernelized RL methods [6] have been paid a lot of attention
last years by employing kernel techniques to standard RL II. BACKGROUND
methods [7] and Gaussian Processes as a description model A Markov Decision Process (MDP) is a tuple
for the value function [8]. In most cases the basis functions (S, A, P, R, γ), where S is a set of states; A a set of
used for estimating the value function remain fixed during actions; P : S × A × S → [0, 1] is a Markovian transition
the learning process, as for example a recent work presented model that specifies the probability P (s |s, a) of transition

1082-3409/12 $26.00 © 2012 IEEE 712

DOI 10.1109/ICTAI.2012.101
to state s when taken an action a in state s; R : S → R is Then, the LSTD approach estimates the weights of the above
the reward function for a state-action pair; and γ ∈ (0, 1) is linear equation according to the least-squares solution:
the discount factor for future rewards. A stationary policy
π : S → A is a mapping from states to actions and denotes ŵ = (Φ HΦ)−1 Φ R . (10)
a mechanism for choosing actions. An episode is a sequence
of transitions: (s1 , a1 , r1 , s2 , . . .). III. O N - LINE CLUSTERING AND VALUE FUNCTION
The notion of value function is of central interest in rein- APPROXIMATION
forcement learning tasks. Given a policy π, the value V π (s) of The proposed methodology is based on a policy evalu-
a state s is defined as the expected discounted sum of rewards ation scheme that incrementally separates the input space
obtained when starting from this state until the current episode into clusters and estimates the transition probabilities among
terminates: them. Thus a dictionary of features is dynamically constructed
V π (s) = Eπ [R(st ) + γV π (st+1 )|st = s] . (1) for modeling the value functions. To what follows we will
assume that the input samples are state-action pairs, denoted
This satisfies the Bellman equations, which expresses a rela- as xn = (sn , an ). We will also consider a finite action space
tionship between the values of successive states in the same of size M .
episode. Similarly, the state-action value function Q(s, a) Suppose we are given a data set of N samples
denotes the expected cumulative reward as received by taking {x1 , x2 , . . . , xN }. The task of clustering aims at partitioning
action a in state s the input set into k disjoint clusters, containing samples

Qπ (s, a) = Ra (s) + γ P (s |s, a) max Q(s , a) , (2) with common properties. Mixture modeling [14] provides a
a
s ∈S convenient and elegant framework for clustering, where we
where Ra (s) specifies the average reward for executing a in consider that the properties of a single cluster j is described
s. The objective of RL problems is to estimate an optimal implicitly via a probability distribution with parameters θj .
policy π ∗ by choosing actions that yields the optimal action- This can be formulated as:
state value function Q∗ : k

p(x|Θk ) = uj p(x|θj ) , (11)
π ∗ (s) = arg max Q∗ (s, a). (3)
a j=1
A common choice for representing the value function is where Θk denotes the set of mixture model parameters. The
through a linear function approximation using a set of k basis parameters 0 < uj ≤ 1 represent the mixing weights satisfying
functions φj (s, a): k
j=1 uj = 1. Since we are dealing with two source of
k
information (state-actions), in our scheme we assume that the
Q(s, a) = φ(s, a) w = φj (s, a)wj , (4) conditional density for each cluster is written as a product of
j=1 two pdfs:
where w = (w1 , . . . , wk ) is a vector of weights which • a Gaussian pdf N (s; μj , Σj ) for the state s and
are unknown and must be estimated so as to minimize the M I(a,i)
• a multinomial pdf M u(a; ρj ) = i=1 ρji for the
approximation error. The selection of the basis functions is
action a, where ρj is a M -length probabilistic vector
very important and must be chosen to encode properties of
while I(a, i) is a binary indicator function, i.e.
the state and action relevant to the proper determination of
the Q values. 1, if a = i
The LSTD approach combines the Bellman operator with I(a, i) = .
0, otherwise
the least-squares estimation procedure. If we assume a N -
length trajectory of transitions (si , ai , ri , si+1 ) sampled from This can be written as:
the MDP, a set of N equations is obtained:
ri = (φ(si , ai ) − γφ(si+1 , ai+1 )) w , (5) p(x|θj ) = N (s; μj , Σj )M u(a; ρj ) , (12)

which can be further written as where θj = {μj , Σj , ρj } denotes the set of cluster parameters.
Mixture modelling treats clustering as an estimation prob-
R = HΦw , (6)
lem for the model parameters Θk = {uj , θj }kj=1 by maximiz-
where ing the log-likelihood function:
⎡ ⎤
1 −γ 0 · · · 0
N
k
⎢0 1 −γ · · · 0⎥
H=⎢ ⎣ .. .. ⎥ ⎦, (7) L(Θk ) = log{ uj N (sn ; μj , Σj )M u(an ; ρj )} . (13)
. . n=1 j=1
0 0 ··· 0 1
Φ = [φ(s1 , a1 ) , . . . , φ(sN , aN ) ] and (8) The Expectation-Maximization (EM) algorithm [13] is an
efﬁcient framework that can be used for this purpose. It
R = [r1 , . . . , rN ]. (9) iteratively performs two steps: The E-step, where the current

713
posterior probabilities of samples to belong to each cluster are is the following: In the M-step of the normal (off-line) EM
calculated: algorithm, the update rules for both Gaussian parameters have
uj p(xn |θj ) N
znj = k , (14) the quantity n=1 znj in the denominator
(and the size N
z sn
j =1 uj p(xn |θj ) for the weights uj ) e.g. μj = n nj

. During the on-
n znj

and the M-step, where the maximization of the expected line version, each new sample contributes to the computation
N
complete log-likelihood is performed. This leads to closed- of these parameters by a factor equal to znj / n=1 znj .
form update rules for the model parameters [14]. Therefore, when the number of observations becomes too
In our case the samples are non-stationary and are gener- large, the incoming new sample will barely influence the
ated sequentially. We present here an extension of the EM model parameters. To avoid this situation we have selected
algorithm [15] for online estimating mixture models that suits the above update rules where we fix the contribution of new
our particular needs. It consists of two phases: first we have samples to λ.
a mechanism for deciding whether or not a new cluster must
be created, and secondly the main EM procedure is performed A. Model-based approximation
for adjusting the structure of clusters so as to incorporate the As it is obvious from the previous discussion, the EM-based
new sample. on-line clustering approach performs a partitioning of the input
Lets assume that a random sample xn = (sn , an ) is (state-action) space into clusters that have also the same tran-
observed. The method first performs the E-step and calculates sition dynamics. Clusters can be seen as nodes of a directed
the posterior probabilities values znj (Eq. 14) based on the (not full) graph that communicate to each other. The learning
current k-order mixture model. The winner cluster j ∗ ∈ [1, k] process construct new nodes in this graph (by performing
is then found according to the maximum posterior value, i.e. a splitting process) and can also provide useful information
k (frequency and distance) between adjacent nodes. In another
j ∗ = arg max{znj }. (15)
j=1 point of view, the proposed scheme can be seen as a type
We assume here that the degree of belongingness of the xn to of relocatable action model (RAM), that has been proposed
the cluster j ∗ is given by a kernel function K(xn , j ∗ ) which recently [16] and provides a decomposition or factorization of
is written as a product of two kernel functions, one for each the transition function.
type of the input (state-action); i.e. We assume a trajectory of transitions (si , ai , ri , si+1 ) in the
same episode. During the on-line clustering we maintain for
K(xn , j ∗ ) = Ks (xn , j ∗ )Ka (xn , j ∗ ) , (16) each cluster j = 1, . . . , k the following quantities:
where • t̄j,j : the mean number of time-steps between two suc-
cessively observed clusters j, j
Ks (xn , j ∗ ) = exp(−0.5(sn − μj ∗ ) Σ−1
j ∗ (sn − μj ∗ )) • R(j, j ): the mean total reward of the transition from
is the state kernel, and cluster j to cluster j
• nj,j : the total number of times (frequency) that we have
M
I(a ,i) observed this transition
Ka (xn , j ∗ ) = ρj ∗ i n • nj : the total number of times that the cluster j is visited.
i=1
Furthermore, another two useful quantities can be calculated:
is the action kernel. If the degree of belongingness of xn to
The transition probabilities P (j |j) between two (adjacent)
cluster j ∗ is less than a predefined threshold value Kmin , a n
clusters from their relative frequencies, i.e. P (j |j) = nj,jj ,
new cluster (k+1) must be created. This is done by initializing
and the mean value of the reward function for any cluster j as
it properly:
R(j) = j P (j |j)R(j, j ). These can be used for the policy
ξ, if i = a
μk+1 = sn , Σk+1 = 12 Σj ∗ , ρk+1,i = 1−ξ , estimation process.
M −1 , otherwise The equation for the action value function for a cluster j
where ξ is set to a large value (e.g. ξ = 0.9). The M-step is then becomes
applied next that provides a step-wise update procedure for
the model parameters using the next rules: Q(j) = R(j) + P (j |j)γ t̄j,j Q(j ) , (21)
j
uj = (1 − λ)uj + λznj , (17)
where the summation is made over the neighbourhood of
μj = μj + λznj sn , (18)
cluster j (adjacent clusters j where P (j |j) > 0). Thus, a
Σj = Σj + λznj (si − μj )(si − μj ) , (19) set of k equations are available for the k quantities R(j)
nji (observations):
nji = nji + I(an , i)znj and ρji = M
, (20) Rk = Hk Φk wk , (22)
l=1 njl
where the term λ takes a small value (e.g. 0.09) and it can where we have considered linear function approximation for
be decreased over episode. That is necessary to remark here the action value function. In this case the kernel design

714
matrix Φk = [φ1 . . . φk ] is explicitly derived from the online
clustering solution using the kernel function of Eq. 16:
[Φk ]jj = exp(−0.5(μj −μj ) Σ−1
j (μj −μj ))ρj ρj . (23)

Also, the matrix Hk contains the coefﬁcients of Eq. 21, i.e. at

each line j we have [Hk ]jj = 1 and [Hk ]j,j = −P (j |j)γ t̄j,j
in case where two clusters are adjacent (P (j |j) > 0). Finally,
Rk is the vector of the calculated mean reward values per (a) Boyan’s Chain (b) Puddle World
cluster, i.e. Rk = [R(1), . . . , R(k)]. The least-square solution Fig. 1. Simulated experimental domains
for the linear weights wk can be obtained then as
ŵk = (Φ
k Hk Φk )
−1
Φk R k . (24)
In all domains, the discount factor γ was set equal to 1,
Thus, for an input state-action pair x = (s, a) we can estimate while the threshold Kmin for creating new clusters was set
the action value function according to the current policy as: as Kmin = 0.7. Finally, in order to introduce stochasticity
into the transitions and to attain better exploration of the
Q(s, a) = φ(s, a) ŵk , (25) environment, the actions are chosen -greedy. In our method
where φ(s, a) = [K(x, 1), . . . , K(x, M )]. The above proce- the is initially set equal to 0.1 and decreases steady at each
dure is repeated until convergence, or a number of episodes is step.
found. The method starts with a single cluster k = 1 initialized Comparison has been made using the least square temporal
by the first sample taken by the agent. At every time step difference (LSTD) [3] in the case of policy evaluation problem
the on-line EM clustering procedure and the policy evaluation (Boyan’s Chain), i.e to evaluate the value function of a given
stage are sequentially performed. The overall scheme of the policy, and the online least square policy iteration (LSPI) [5] in
proposed methodology is given in Algorithm 1. the case of control learning problem, i.e. discover the optimal
policy. Both methods are considered as state-of-the-art RL
Algorithm 1 General framework of the proposed methodology algorithms for policy evaluation problems, where the action
1: Start with k = 1 and use first point xi = (s1 , a1 ) for
value function is representing as a linear model by using an
initializing it. Set a random value to weight w1 . t = 0. equidistant fixed grid of radial basis functions (RBFs) over the
2: while convergence or maximum number of episodes not
state space, following the suggestions described in [5]. It must
found do be noted that in our experiments the state space is small, either
3: Suppose previous input xi = (si , ai ). 1D or 2D, with small discrete set of actions. That makes the
4: Observe new state si+1 . LSTD and LSPI methodologies to be practically applicable
5: Select action according to the current policy (not huge number of basis functions) and since they cover
ai+1 = arg maxM all the possible state-action pairs, they are able to find the
l=1 Q(si+1 , l).
6: Find the winning cluster j ∗ = arg maxkj=1 {znj }. optimal solution. The objective of our study was to discover
7: if K(xi+1 , mj ∗ ) < Kmin then the ability of our method to obtain the best performance with
8: Create a new cluster (k = k + 1) and initialize its less number of basis functions constructing by the online EM
prototype mk with xi+1 . clustering procedure. However, our method manages to work
9: Create a new weight wk of linear model and initialize in higher dimensional spaces where the other methods have
it randomly. wt = wt ∪ wk . some limitations and this constitutes an interesting direction
10: else for feature work.
11: Update the prototype mj ∗ of the winning cluster A. Experiments with Simulated environments
using Eqs. 17-20. The first series of experiments was made using two well-
12: end if known simulated benchmarks. More specifically, at first we
13: Obtain the new k basis functions as: φj (s, a) = have used the classical Boyan’s chain problem (Fig. 1(a))
K((s, a), mj ), ∀ j = 1, . . . , k. with N = 13 and N = 98 states [3]. In this application
14: Update the environment statistics. states are connected in a chain, where an agent located in
15: Update the model weights wt according to Eq. 24. a state s > 2 can move into states s − 1 and s − 2 with the
16: t=t+1 same probability receiving a reward of −3 (r = −3). On the
17: end while
other hand, from states 2 and 1, there are only deterministic
transitions to states 1 and 0 where the received rewards are
−2 and 0, respectively. Every episode starts at state N and
IV. E XPERIMENTAL RESULTS
terminates at state 0. In the specific domain, both the policy
A series of experiments have been conducted in a variety evaluation problem as well as the problem of discovering the
of simulated benchmarks and real environments, in order to optimal policy have been considered. It must be noted that
study the performance of the proposed model-based approach. both LSTD and online LSPI methods had a number of RBF

715
kernels equal to the number of states with a kernel width 0
equal to 1. Also, we have allowed our method to construct the

Mean Returns of the Last 30 Episodes

same number of clusters, in an attempt to focus our study on
the effect of the clusters’ transition probabilities. The results
are illustrated in Fig. 2 plotting the root mean square error −500

(RMSE) between the true and the estimated value function

(1st problem), as well as the mean returns of the last 100
episodes (2nd problem) per method. As it is obvious, the
−1000
proposed method performs better exploration estimating the
Online EM
optimal value function more accurately than the LSTD. On Online LSPl 15x15
the control problem, both methodologies achieve to discover Online LSPl 20x20
the optimal policies but the proposed method converges in a Online LSPl 25x25
−1500
higher rate than the online LSPI. 0 20 40 60 80 100
Episodes
Another simulated environment used in our experiments is
the Puddle World [17] (Fig. 1(b)), found on the RL-Glue Fig. 3. Comparative results in the simulated environment, Puddle World.
Library1 . The Puddle World is a continuous world with two
oval puddles and the goal is to reach the upper right corner
from any random position, avoiding the two puddles. The
environmental states are 2-dimensional (x and y coordinates)
and it can choose one of four actions that correspond to the
four major compass directions: up, right, left, or right. The
received reward is −1 except for the puddle region where
a penalty between 0 and −40 is received, depending on the
proximity to the middle of the puddle.
In the particular problem, comparison has been made with
the online LSPI method where an equidistant fixed N × N
grid of RBFs is used. In this experiment, we have use three
values of N (15, 20, 25), that corresponds to a set of 900,
1600, 2500 number of basis functions, respectively. On the
other hand, our algorithm constructs approximately 450-550 (a) Pioneer/PeopleBot (b) Stage world
basis functions. The depicted results are illustrated in Fig. 3 Fig. 4. The mobile robot and the 2D-grid map used in our experiments.
that gives the mean number of returns received during the last This is one snapshot from the MobileSim simulator with visualization of the
robot’s laser and sonar range scanners.
30 episodes. The experimental results agrees with our belief
that the number of the basis functions is a very important
issue and that using large number of RBFs does not involve
a better performance. Finally, it becomes apparent that our restrictions, such as the strict battery life, the training of the
method manages to discover the optimal policy. specific methodologies are achieved by using the MobileSim2
simulator. The specific simulator is built on the famous Stage
B. Experiments with Real Environments platform and manages to simulate the real environment with
Experiments have also been conducted in a number of satisfactory precision and realism.
real environments by using the wheeled real mobile robot A grid map (stage world) have been selected during our
platform Pioneer/PeopleBot, shown in Fig. 4(a), based on the experiments, as shown in Fig.4(b). The specific world has been
robust P3-DX base. The robot is equipped with advanced tools designed and edited by using the Mapper toolkit. The objective
for communication and control, like the ARIA (Advanced of the robot in this task is to find a steady landmark (shown
Robot Interface for Applications) library which provides a with a rectangular green box in the map of Fig.4(b)) with the
nice framework for controlling and receiving data from the minimum number of steps, starting from any position in the
MobileRobots platforms. At the same time a plethora of world and performing a finite number of actions. The particular
sensors are included into the specific type of robot, such task is episodic and a new episode starts when one of the
as sonar, laser, bumpers and a pan-tilt-zoom camera. For following incidents comes first: the maximum allowed number
the purposes of our experiments, only the sonar and laser of steps per episode is expired (in our case was set to 100), an
sensors were used in the case of the obstacle avoidance. obstacle is hit, or the target is reached. The state space consists
Furthermore, an embedded motion controller provide at each of two continuous variables: the x and y coordinates which
time step, the robot state such as the robot position (x, y, θ), specify the situation of the robot in the world. At each time
sensors range sensing data, e.t.c. Due to numerous physical step, the robot receives an immediate reward of −1, except in
1 Available at http://library.rl-community.org/wiki 2 more details can be found at http://robots.mobilerobots.com/wiki/MobileSim

716
1
10

Mean Returns of the Last 100 Episodes

Online EM −18
LSTD
−19

0
10 −20
RMSE of all V

−21

−22
−1
10
−23

−24 Online EM
−2
Online LSPl
10 −25
0 200 400 600 800 1000 0 200 400 600 800 1000
Episodes Episodes
(a)13-states chain
1
10 −140

Mean Returns in the Last 100 Episodes

Online EM
LSTD −150

−160
RMSE of all V

−170
0
10 −180

−190

−200

−210 Online EM
−1
Online LSPI
10 −220
0 200 400 600 800 1000 0 200 400 600 800 1000
Episodes Episodes
(b) 98-states chain
Fig. 2. Comparative results on policy evaluation and learned policy in Boyan’s chain domain with 13 and 98 states. Each curve is the average of 30
independent trials.

the case that an obstacle is hit where the received reward is we can observe, the two methodologies tend to find the same
−100. The action space has been discretized into the 8 major policies in most world areas.
compass winds, while the length of each step was set equal
to 1m .
V. C ONCLUSION
The comparative results on the stage world are illustrated In this study we have presented a model-based reinforce-
in Fig. 5. The first plot shows the number of episodes where ment learning scheme for learning optimally in MDPs. The
the robot succeed to find the target. On the other hand, the proposed method is based on the online partitioning of the
second represents the mean returns received by the agent state-action space into clusters and simultaneously construct-
in the last 100 episodes. In the case of LSPI algorithm, an ing a Markov transition matrix by counting the observed
equidistant fixed 10× 10 grid of Gaussian RBFs is used over transitions that each cluster conformation undergoes over time
the state spaces (800 RBFs are used). As it becomes apparent, steps. It is our intention to further pursue and develop the
the proposed methodology achieves to discover an optimal method in two directions: At first we can employ regularized
policy in a much higher rate, while at the same time end up least-squares methods, such as Lasso or Bayesian sparse
to the target in more episodes than the online LSPI. On the methodologies to eliminate the problem of overfitting. Also,
other hand, our method don’t need a so huge number of basis during our experiments we have observed a tendency of our
functions as the online LSPI do, in order to discover an optimal method to produce large number of clusters, some of them
policy. More specifically, the proposed algorithm constructs may become inactivated. In fact, it is possible some of them
approximately 300 − 400 clusters in the world, that specify to become inactivated during the learning process. Thus, a
the basis functions of our model. Finally, Fig. 6 represents mechanism for merging clusters to the body of the online EM
the learned policies of both methods after 500 episodes. As procedure constitutes an interesting direction for future work.

717
450 0

Mean Returns in the Last 100 Episodes

400 −10

350 −20
Successive Episodes

−30
300
−40
250
−50
200
−60
150
−70
100 −80
50 Online EM −90 Online EM
Online LSPl Online LSPl
0 −100
0 100 200 300 400 500 0 100 200 300 400 500
Episodes Episodes

(a) (b)
Fig. 5. Comparative results in the test world of the real environment.

(a) Online EM (b) Online LSPI

Fig. 6. Learned policies by both comparative methods in the case of test world.

R EFERENCES [10] S. Mahadevan and M. Maggioni, “Proto-value Functions: A Laplacian

Framework for Learning Representation and Control in Markov Decision
[1] R. Sutton and A. Barto, Reinforcement Learning: An Introduction. MIT Processes,” Journal of Machine Learning Research, vol. 8, pp. 2169–
Press Cambridge, USA, 1998. 2231, 2007.
[2] L. Buşoniu, R. Babuška, B. De Schutter, and D. Ernst, Reinforcement [11] I. Menache, S. Mannor, and N. Shimkin, “Basis Function Adaptation
Learning and Dynamic Programming Using Function Approximators. in Temporal Difference Reinforcement Learning,” Annals of Operations
CRC Press, 2010. Research, vol. 134, pp. 215–238, 2005.
[12] M. Petrik, “An analysis of laplacian methods for value function ap-
[3] J. A. Boyan, “Technical update: Least-squares temporal difference
proximation in mdps,” in International Joint Conference on Artificial
learning,” Machine Learning, pp. 233–246, 2002.
Intelligence, 2007, pp. 2574–2579.
[4] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” Journal
[13] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from
of Machine Learning Research, vol. 4, pp. 1107–1149, 2003.
incomplete data via the EM algorithm,” J. Roy. Statist. Soc. B, vol. 39,
[5] L. Buşoniu, D. Ernst, B. De Schutter, and R. Babuška, “Online least- pp. 1–38, 1977.
squares policy iteration for reinforcement learning control,” in Proceed- [14] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
ings of the 2010 American Control Conference, 2010, pp. 486–491. [15] R. Neal and G. Hinton, “A view of the em algorithm that justifies
[6] G. Taylor and R. Parr, “Kernelized value function approximation for incremental, sparse, and other variants,” in NATO Adv. Study Inst. on
reinforcement learning,” in International Conference on Machine Learn- Learn. Graph. Models, 1998, pp. 355–368.
ing, 2009, pp. 1017–1024. [16] B. R. Leffler, M. L. Littman, and T. Edmunds, “Efficient reinforcement
[7] X. Xu, H. Hu, and B. Dai, “Adaptive sample collection using active learning with relocatable action models,” in Proceedings of the Twenty-
learning for kernel-based approximate policy iteration,” in Proceedings Second Conference on Artificial Intelligence, 2007, pp. 572–577.
of the Adaptive Dynamic Programming and Reinforcement Learning, [17] R. S. Sutton, “Generalization in reinforcement learning: Successful ex-
2011, pp. 56–61. amples using sparse coarse coding,” in Advances in Neural Information
[8] Y. Engel, S. Mannor, and R. Meir, “Reinforcement learning with Processing Systems 8. MIT Press, 1996, pp. 1038–1044.
gaussian process,” in International Conference on Machine Learning,
2005, pp. 201–208.
[9] G. Konidaris, S. Osentoski, and P. Thomas, “Value function approxima-
tion in reinforcement learning using the fourier basis,” in AAAI Conf.
on Artificial Intelligence, 2011, pp. 380–385.

718

ProfEd221 - Unit 5 - Feedbacking and Communicating Assessment Results PDF
100% (4)
ProfEd221 - Unit 5 - Feedbacking and Communicating Assessment Results PDF
12 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Offline Reinforcement Learning With Instrumental Variables in Confounded Markov Decision Processes
No ratings yet
Offline Reinforcement Learning With Instrumental Variables in Confounded Markov Decision Processes
71 pages
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
No ratings yet
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
56 pages
Effective Reinforcement Learning Based On Structural Information Principles
No ratings yet
Effective Reinforcement Learning Based On Structural Information Principles
47 pages
Markov Decicion
No ratings yet
Markov Decicion
40 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Action Robust Reinforcement Learning and Applications in Continuous Control
No ratings yet
Action Robust Reinforcement Learning and Applications in Continuous Control
10 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Model-Based Policy Optimization With Unsupervised Model Adaptation
No ratings yet
Model-Based Policy Optimization With Unsupervised Model Adaptation
17 pages
Constrained Policy Opt
No ratings yet
Constrained Policy Opt
18 pages
Notes
No ratings yet
Notes
6 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
An Application of Inverse Reinforcement Learning To Medical Records of Diabetes Treatment
No ratings yet
An Application of Inverse Reinforcement Learning To Medical Records of Diabetes Treatment
8 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
Team 10 Primer
No ratings yet
Team 10 Primer
12 pages
Variational Methods For Reinforced Learning
No ratings yet
Variational Methods For Reinforced Learning
8 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Subtitle
No ratings yet
Subtitle
2 pages
1、Reinforcement Learning With Gaussian Processes (2005)
No ratings yet
1、Reinforcement Learning With Gaussian Processes (2005)
8 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Theory of Elasticity
No ratings yet
Theory of Elasticity
4 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Literary Voice - March 2021
No ratings yet
Literary Voice - March 2021
372 pages
ReadyIAS AW Toolkit
No ratings yet
ReadyIAS AW Toolkit
41 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
CS229
No ratings yet
CS229
17 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
5385 Representation Driven Reinforc
No ratings yet
5385 Representation Driven Reinforc
16 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Fundamentals of Aerodynamits: MC Graw Hill
No ratings yet
Fundamentals of Aerodynamits: MC Graw Hill
9 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Sistema de Frenos Freight m12
No ratings yet
Sistema de Frenos Freight m12
457 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Finalworm 160204043543
No ratings yet
Finalworm 160204043543
20 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Jurnal Manajemen Strategi Agribisnis Jessica Halaman 74 - 87
No ratings yet
Jurnal Manajemen Strategi Agribisnis Jessica Halaman 74 - 87
46 pages
BQ1031 Exercises
No ratings yet
BQ1031 Exercises
90 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
Install
No ratings yet
Install
3 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Installation Instructions: Diesel/Alternator Tachometer 3-3/8" & 5"
No ratings yet
Installation Instructions: Diesel/Alternator Tachometer 3-3/8" & 5"
2 pages
Thesis Final Kajal Pourjalil
No ratings yet
Thesis Final Kajal Pourjalil
58 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
AX Series Hanyoung Brochure
No ratings yet
AX Series Hanyoung Brochure
6 pages
Revision For Mid Term Test
No ratings yet
Revision For Mid Term Test
7 pages
MBA Marketing Research Project Guidelines
No ratings yet
MBA Marketing Research Project Guidelines
7 pages
Ayitenew Determinantsof Internal Audit Effectiveness Evidencefrom Gurage Zone
No ratings yet
Ayitenew Determinantsof Internal Audit Effectiveness Evidencefrom Gurage Zone
12 pages
Omer Bakis Written
No ratings yet
Omer Bakis Written
19 pages
Guiding Principle:: Title: Training Guide For Dcws On Self Help Assessment
No ratings yet
Guiding Principle:: Title: Training Guide For Dcws On Self Help Assessment
33 pages
Physical Science - q4 - Slm13-Pages-Deleted
No ratings yet
Physical Science - q4 - Slm13-Pages-Deleted
5 pages
Cambridge International AS & A Level: Thinking Skills 9694/13
No ratings yet
Cambridge International AS & A Level: Thinking Skills 9694/13
9 pages
First Public Dataset To Study 2023 Turkish General Election
No ratings yet
First Public Dataset To Study 2023 Turkish General Election
12 pages
Ece 34 - Microprocessor System Project
No ratings yet
Ece 34 - Microprocessor System Project
3 pages
RSA Projects Overview
100% (2)
RSA Projects Overview
7 pages
Solar Si ANN
No ratings yet
Solar Si ANN
18 pages
1 s2.0 S095741742401604X Main
No ratings yet
1 s2.0 S095741742401604X Main
12 pages
Unit 8
No ratings yet
Unit 8
9 pages
Case Study: Gates Corporation Blueprint
No ratings yet
Case Study: Gates Corporation Blueprint
2 pages
Premier League Data - Activity Questions: Part A: Sorting and Filtering
No ratings yet
Premier League Data - Activity Questions: Part A: Sorting and Filtering
3 pages
Leave Application For The Death in The Family
No ratings yet
Leave Application For The Death in The Family
1 page
English Formulas (Elements For Composing Written Responses)
No ratings yet
English Formulas (Elements For Composing Written Responses)
2 pages
Reflective Essay
No ratings yet
Reflective Essay
4 pages
Laporan Daftar Pengguna GoodEva SmartSafety - Batch 1
No ratings yet
Laporan Daftar Pengguna GoodEva SmartSafety - Batch 1
3 pages
Admnadvt
No ratings yet
Admnadvt
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ICTAI12

Uploaded by

ICTAI12

Uploaded by

2012 IEEE 24th International Conference on Tools with Artificial Intelligence

A model-based reinforcement learning approach

1082-3409/12 $26.00 © 2012 IEEE 712

Also, the matrix Hk contains the coefﬁcients of Eq. 21, i.e. at

Mean Returns of the Last 30 Episodes

(RMSE) between the true and the estimated value function

Mean Returns of the Last 100 Episodes

Mean Returns in the Last 100 Episodes

Mean Returns in the Last 100 Episodes

(a) Online EM (b) Online LSPI

R EFERENCES [10] S. Mahadevan and M. Maggioni, “Proto-value Functions: A Laplacian

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.