0% found this document useful (0 votes)

43 views9 pages

A Pac RL Algorithm For Episodic Pomdps

Uploaded by

data science

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views9 pages

A Pac RL Algorithm For Episodic Pomdps

Uploaded by

data science

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

A PAC RL Algorithm for Episodic POMDPs

Zhaohan Daniel Guo Shayan Doroudi Emma Brunskill

Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
5000 Forbes Ave 5000 Forbes Ave 5000 Forbes Ave
Pittsburgh PA 15213, USA Pittsburgh PA 15213, USA Pittsburgh PA 15213, USA

Abstract is said to be PAC if with high probability, it selects

a near-optimal action on all but a number of steps
Many interesting real world domains involve (the sample complexity) which is a polynomial func-
reinforcement learning (RL) in partially ob- tion of the problem parameters. There has been sub-
servable environments. Efficient learning stantial progress on PAC RL for the fully observable
in such domains is important, but existing setting [Brafman and Tennenholtz, 2003, Strehl and
sample complexity bounds for partially ob- Littman, 2005, Kakade, 2003, Strehl et al., 2012, Lat-
servable RL are at least exponential in the timore and Hutter, 2012], but to our knowledge there
episode length. We give, to our knowledge, exists no published work on PAC RL algorithms for
the first partially observable RL algorithm partially observable settings.
with a polynomial bound on the number of This lack of work on PAC partially observable RL is
episodes on which the algorithm may not perhaps because of the additional challenge introduced
achieve near-optimal performance. Our al- by the partial observability of the environment. In
gorithm is suitable for an important class of fully observable settings, the world is often assumed
episodic POMDPs. Our approach builds on to behave as a Markov decision process (MDP). An
recent advances in method of moments for elegant approach for proving that a RL algorithm for
latent variable model estimation. MDPs is PAC is to compute finite sample error bounds
on the MDP parameters. However, because the states
of a partially observable MDP (POMDP) are hidden,
1 INTRODUCTION the naive approach of directly treating the POMDP as
a history-based MDP yields a state space that grows
A key challenge in artificial intelligence is how to ef- exponentially with the horizon, rather than polyno-
fectively learn to make a sequence of good decisions mial in all POMDP parameters [Even-Dar et al., 2005].
in stochastic, unknown environments. Reinforcement
On the other hand, there has been substantial recent
learning (RL) is a subfield specifically focused on how
interest and progress on method of moments and spec-
agents can learn to make good decisions given feedback
tral approaches for modeling partially observable sys-
in the form of a reward signal. In many important ap-
tems [Anandkumar et al., 2012, 2014, Hsu et al., 2008,
plications such as robotics, education, and healthcare,
Littman et al., 2001, Boots et al., 2011]. The majority
the agent cannot directly observe the state of the envi-
of this work has focused on inference and prediction,
ronment responsible for generating the reward signal,
with little work tackling the control setting. Method of
and instead only receives incomplete or noisy observa-
moments approaches to latent variable estimation are
tions.
of particular interest because for a number of models
One important measure of an RL algorithm is its sam- they obtain global optima and provide finite sample
ple efficiency: how much data/experience is needed guarantees on the accuracy of the learned model pa-
to compute a good policy and act well. One way to rameters.
measure sample complexity is given by the Probably
Inspired by the this work, we propose a POMDP RL
Approximately Correct framework; an RL algorithm
algorithm that is, to our knowledge, the first PAC
POMDP RL algorithm for episodic domains (with no
Appearing in Proceedings of the 19th International Con-
ference on Artificial Intelligence and Statistics (AISTATS) restriction on the policy class). Our algorithm is ap-
2016, Cadiz, Spain. JMLR: W&CP volume 51. Copyright plicable to a restricted but important class of POMDP
2016 by the authors. settings, which include but are not limited to informa-
A PAC RL Algorithm for Episodic POMDPs

tion gathering POMDP RL domains such as preference sample guarantees. MLE approaches for estimating
elicitation [Boutilier, 2002], dialogue management slot- HMMs [Abe and Warmuth, 1992] also unfortunately
filling domains [Ko et al., 2010], and medical diagnosis do not provide accuracy guarantees on the estimated
before decision making [Amato and Brunskill, 2012]. HMM parameters. As POMDP planning methods typ-
Our work builds on method of moments inference tech- ically require us to have estimates of the underlying
niques, but requires several non-trivial extensions to POMDP parameters, it would be difficult to use such
tackle the control setting. In particular, there is a MLE methods for computing a POMDP policy and
subtle issue of latent state alignment: if the models for providing a finite sample guarantee1 .
each action are learned as independent hidden Markov
Aside from the MoM method in Anandkumar et al.
models (HMMs), then it is unclear how to solve the
[2012], another popular spectral method involves us-
correspondence issue across latent states, which is es-
ing Predictive State Representations (PSRs) [Littman
sential for performing planning and selecting actions.
et al., 2001, Boots et al., 2011], to directly tackle the
Our primary contribution is to provide a theoretical
control setting; however it only has asymptotic conver-
analysis of our proposed algorithm, and prove that it
gence guarantees and no finite sample analysis. There
is possible to obtain near-optimal performance on all
is also another method of moments approach to trans-
but a number of episodes that scales as a polynomial
fer across a set of bandits tasks, but the latent variable
function of the POMDP parameters. Similar to most
estimation problem is substantially simplified because
fully observable PAC RL algorithms, directly instan-
the state of the system is unchanged by the selected
tiating our bounds would yield an impractical number
actions [Azar et al., 2013].
of samples for a real application. Nevertheless, we be-
lieve understanding the sample complexity may help Fortunately, due to the polynomial finite sample
to guide the amount of data required for a task, and bounds from MoM, we can achieve a PAC (polyno-
also similar to PAC MDP RL work, may motivate new mial) sample complexity bound for POMDPs.
practical algorithms that build on these ideas.
3 PROBLEM SETTING
2 BACKGROUND AND RELATED
WORK We consider a partially observable Markov decision
process (POMDP) which is described as the tuple
The inspiration for pursuing PAC bounds for (S, A, R, T, Z, b, H) where we have a set of discrete
POMDPs came about from the success of PAC bounds states S, discrete actions A, discrete observations Z,
for MDPs [Brafman and Tennenholtz, 2003, Strehl discrete rewards R, initial belief b (more details be-
and Littman, 2005, Kakade, 2003, Strehl et al., 2012, low), and episode length H. The transition model is
Lattimore and Hutter, 2012]. While algorithms represented by a set of |A| matrices Ta (i, j) : |S| × |S|
have been developed for POMDPs with finite sample where the (i, j)-th entry is the probability of transi-
bounds [Peshkin and Mukherjee, 2001, Even-Dar et al., tioning from si to sj under action a. With a slight
2005], unfortunately these bounds are not PAC as they abuse of notation, we use Z to denote both the finite
have an exponential dependence on the horizon length. set of observations and the observation model captured
by the set of |A| observation matrices, Za where the
Alternatively, Bayesian methods [Ross et al., 2011, (i, j)-th entry represents the probability of observing
Doshi-Velez, 2012] are very popular for solving zi given the agent took action a and transitioned to
POMDPs. For MDPs, there exist Bayesian meth- state sj . We similarly do a slight abuse of notation
ods that have PAC bounds [Kolter and Ng, 2009, As- and let R denote both the finite set of rewards, and
muth et al., 2009]; however there have been no PAC the reward matrices Ra where the (i, j)-th entry in a
bounds for Bayesian methods for POMDPs. That matrix denotes the probability of obtaining reward ri
said, Bayesian methods are optimal in the Bayesian
sense of making the best decision given the posterior 1
Abe and Warmuth [1992]’s MLE approach guaran-
over all possible future observations, which does not tees that the estimated probability over H-length obser-
translate to a frequentist finite sample bound. vation sequences has a bounded KL-divergence from the
true probability of the sequence under the true parameters,
We build on method of moments (MoM) work for es- which is expressed as a function of the number of underly-
timating HMMs [Anandkumar et al., 2012] in order to ing data samples used to estimate the HMM parameters.
provide a finite sample bound for POMDPs. MoM is We think it may be possible to use such estimates in the
able to obtain a global optimum, and has finite sample control setting when modeling hidden state control systems
as PSRs, and employing a forward search approach to plan-
bounds on the accuracy of their estimates, unlike the ning; however, there remain a number of subtle issues to
popular Expectation-Maximization (EM) that is only address to ensure such an approach is viable and we leave
guaranteed to find a local optima, and offers no finite this as an interesting direction for future work.
Zhaohan Daniel Guo, Shayan Doroudi, Emma Brunskill

when taking action a in state sj . Note that in our is similar to a mixing assumption and is necessary in
setting we also treat the reward as an additional ob- order for MoM to estimate dynamics for all states.
servation2 . Assumption 3 is necessary for MoM to uniquely deter-
mine the transition, observation, and reward dynam-
The objective in POMDP planning is to compute a
ics. The second assumption may sound quite strong,
policy π that achieves a large expected sum of future
as in some POMDP settings states are only reachable
rewards, where π is a mapping from histories of prior
by a complex sequence of carefully chosen actions, such
sequences of actions, observations, and rewards, to ac-
as in robotic navigation or video games. However, as-
tions. In many cases we capture prior histories using
sumption 2 is commonly satisfied in many important
a sufficient statistic called the belief b where b(s) rep-
POMDP settings that primarily involve information
resents the probability of being in a particular state s
gathering. For example, in preference elicitation or
given the prior history of actions, observations and re-
user modeling, POMDPs are commonly used to iden-
wards. One popular method for POMDP planning in-
tify the, typically static, hidden intent or preference or
volves representing the value function by a finite set of
state of the user, before taking some action based on
α-vectors, where α(s) represents the expected sum of
the resulting information [Boutilier, 2002]. Examples
future rewards of following the policy associated with
of this include dialog systems [Ko et al., 2010], medical
the α-vector from initial state s. POMDP planning
diagnosis and decision support [Amato and Brunskill,
then proceeds by taking the first action associated with
2012], and even human-robot collaboration preference
the policy of the α-vector which yields the maximum
modeling [Nikolaidis et al., 2015]. In such settings, the
expected value for the current belief state, which can
belief commonly starts out non-zero over all possible
be computed for a particular α-vector using the dot
user states, and slowly gets narrowed down over time.
product hb, αi.
The third assumption is also significant, but is still sat-
In the reinforcement learning setting, the transition, isfied by an important class of problems that overlap
observation, and/or reward model parameters are ini- with the settings captured by assumption 2. Informa-
tially unknown. The goal is to learn a policy that tion gathering POMDPs where the state is hidden but
achieves large sum of rewards in the environment with- static automatically satisfy the full rank assumption
out advance knowledge of how the world works. on the transition model, since it is an identity matrix.
Assumption 3 on the observation and reward matrices
We make the following assumptions about the domain
imply that the cardinality of the set of observations
and problem setting:
(and rewards) is at least as large as the size of the
state space. A similar assumption has been made in
1. We consider episodic, finite horizon partially ob- many latent variable estimation settings (e.g. [Anand-
servable RL (PORL) settings kumar et al., 2012, 2014, Song et al., 2010]) including
in the control setting [Boots et al., 2011]. Indeed, when
2. It is possible to achieve a non-zero probability of
the observations consist of videos, images or audio sig-
being in any state in two steps from the initial
nals, this assumption is typically satisfied [Boots et al.,
belief.
2011], and such signals are very common in dialog
3. For each action a, the transition matrix Ta is full systems and the user intent and modeling situations
rank, and the observation matrix Za and reward covered by assumption 2. Satisfying that the reward
matrix Ra are full column rank. matrix has full rank is typically trivial as the reward
signal is often obtained by discretizing a real-valued
reward. Therefore, while we readily acknowledge that
The first assumption on the setting is satisfied by many
our setting does not cover all generic POMDP rein-
real world situations involving an agent repeatedly do-
forcement learning settings, we believe it does cover
ing a task: for example, an agent may sequentially in-
an important class of problems that are relevant to
teract with many different customers each for a finite
real applications.
amount of time. The key restrictions on the setting
are captured in assumptions 2 and 3. Assumption 2
2 4 ALGORITHM
In planning problems the reward is typically a real-
valued scalar, but in PORL we must learn the reward
model. This requires assuming some mapping between Our goal is to create an algorithm that can achieve
states and rewards. For simplicity we assume multinomial near optimal performance from the initial belief on
distribution over a discrete set of rewards. Note that we each episode. Prior work has shown that the error
can always discretized a real-valued reward into a finite set
of values with bounded error on the resulting value func- in the POMDP value function is bounded when us-
tion estimates, and our choice makes very little restrictions ing model parameter estimates that themselves have
on the underlying setting. bounded error [Ross et al., 2009, Fard et al., 2008];
A PAC RL Algorithm for Episodic POMDPs

Algorithm 1: EEPORL
input: S, A, Z, R, H, N, c, πrest
1 Let πexplore be the policy where a1 , a2 are
uniformly random, and
1
p(at+2 |at ) = 1+c|A| (I + c1|A|×|A| ) ;
2 X ←∅ ;
// Phase 1:
3 for episode i ← 1 to N do Figure 1: POMDP (left) analogous to induced HMM
4 Follow πexplore for 4 steps ; (right). Gray nodes show fully observed variables,
5 Let xt = (at , rt , zt , at+1 ) ; whereas white nodes show latent states.
6 X ← X ∪ {(x1 , x2 , x3 )} ;
7 Execute πrest for the rest of the steps ; In contrast, many PAC RL algorithms for MDPs
// Phase 2: have shown that exploration is critical in order to get
8 Get Tb, O,
b w b for the induced HM M from X enough data to estimate the model parameters. How-
through our extended MoM method ; ever in MDPs, algorithms can directly observe how
many times every action has been tried in every state,
9 Using the labeling from Algorithm 2 with O, b
and can use this information to steer exploration to-
compute estimated POMDP parameters.;
wards less explored areas. In partially observable set-
10 Call Algorithm 3 with estimated POMDP
tings it is more challenging, as the state itself is hid-
parameters to estimate a near optimal policy πb
den, and so it is not possible to directly observe the
;
number of times an action has been tried in a latent
11 Execute π b for the rest of the episodes ;
state. Fortunately, recent advances in method of mo-
ments (MoM) estimation procedures for latent variable
Algorithm 2: LabelActions estimation (see e.g. [Anandkumar et al., 2012, 2014])
have demonstrated that in certain uncontrolled set-
input: Ob
tings, including many types of hidden Markov models
1 foreach column i of O b do
(HMMs), it is still possible to achieve accuracy esti-
Find a row j such that O(i, b j) ≥ 2
2 3|R||Z| ; mates of the underlying latent variable model param-
3 Let the observation associated with row j be eters as a function of the amount of data samples used
(a, r0 , z 0 , a0 ), label column i with (a, a0 ) ; to perform the estimation. For some intuition about
this, consider starting in a belief state b which has
non-zero probability over all possible states. If one
Algorithm 3: FindPolicy can repeatedly take the same action a from same be-
input: bb(s(a0 ,a1 ) ), pb(z|a, s(a,a0 ) ), pb(r|s(a,a0 ) , a0 ), lief b, given a sufficient number of samples, we will
pb(s(a0 ,a00 ) |s(a,a0 ) , a0 ) have actually taken action a in each state many times
a ,a (even if we don’t know the specific instances on which
1 ∀a− , a ∈ A, Γ1 − = {βb1a (s(a− ,a) )} ;
action a was taken in a state s).
2 for t ← 2 to H do
0
3 ∀a, a0 ∈ A, Γa,a =∅; The control setting is more subtle than the uncon-
t
4 for a, a0 ∈ A do trolled setting which has been the focus of the ma-
a,a0 jority of recent MoM spectral learning research, be-
5 for ft (r, z) ∈ (|R| × |Z| → Γt−1 ) do
cause we wish to estimate not just the transition and
// all mappings from an
observation models of a HMM, but to estimate the
observation pair to a
POMDP model parameters. Our ultimate interest is
previous β-vector
a ,a in being able to select good actions. A naive approach
6 ∀a− ∈ A, Γt − =
a ,a is to independently learn the transition, observation,
Γt − ∪ {βta,ft (s(a− ,a) )} ; and reward parameters for each separate action, by re-
7 Return arg maxa ,a ,β (s
0 1 H (a ,a ) ))∈ΓH
b · βH ) ;
a0 ,a1 (b
stricting the POMDP to only execute a single action,
0 1
thereby turning the POMDP into an HMM. However,
this simple aproach fails because the returned parame-
however, this work takes a sensitivity analysis perspec- ters can correspond to a different labeling of the hidden
tive, and does not address how such model estimation states. For example, the first column of the transition
errors themselves could be computed or bounded.3 matrix for action a1 may actually correspond to the
state s2 , while the first column of the transition ma-
3
Fard et al. [2008] assume that labels of the hidden
states are provided, which removes the need for latent variable estimation.
Zhaohan Daniel Guo, Shayan Doroudi, Emma Brunskill

1
trix for action a2 may truly correspond to s5 . We 1+c|A| (I + c1|A|×|A| ) where c can be any positive real
require that the labeling must be consistent for all ac- number. For our proof, we pick c = O(1/|A|). Note
tions since we wish to compute what happens when that πexplore only depends on previous actions and not
different actions are executed consecutively. An un- on any observations. The definition of p(at+2 |at ) for
satisfactory way to match up the labels for different what will work for the proof only requires it to be
actions is by requiring that the initial belief state have full-rank and having some minimum probability over
probabilities that are unique and well separated per all actions. We chose a perturbed identity matrix for
state. Then we can use the estimated initial belief simplicity. Since πexplore is a fixed policy, the POMDP
from each action to match up the labels. However, process reduces to a HMM for these first four steps.
this is a very strong assumption on the starting belief During these steps we store the observed experience
state which is unlikely to be realized. as (x1 , x2 , x3 ), where xt = (at , rt , zt , at+1 ) is an obser-
To address this challenge of mismatched labels, we vation of our previously defined induced HMM. The
transform our POMDP into an induced HMM (see algorithm then follows policy πrest for the remaining
Figure 1) by fixing the policy to πexplore (for a few steps of the episode. All of these episodes will be con-
steps, during a certain number of episodes), and create sidered as potentially non-optimal, and so the choice
an alternate hidden state representation that directly of πrest does not impact the theoretical analysis. How-
solves the problem of alignment of hidden states across ever, empirically πrest could be constructed to encour-
actions. Specifically, we make the hidden state at time age near optimal behavior given the observed data col-
t of the induced HMM, denoted by ht , equal to the tu- lected up to the current episode.
ple of the action at time step t, the next state, and the
subsequent action, ht = (at , st+1 , at+1 ). We denote 4.2 Parameter Estimation
the observations of the induced HMM by x, and the
observation associated with a hidden state ht is the After Phase 1 completes, we have N samples of the
tuple xt = (at , rt , zt , at+1 ). Figure 1 shows how the tuple (x1 , x2 , x3 ). We then apply our extension to
graphical model of our original POMDP is related to the MoM algorithm for HMM parameter estimation by
the graphical model of the induced HMM. In making Anandkumar et al. [2012]. Our extension computes es-
this transformation, our resulting HMM still satisfies timates and bounds on the transition model Tb which is
the Markov assumption: the next state is only a func- not computed in the original method. To summarize,
tion of the prior state, and the observation is only a this procedure yields an estimated transition matrix
function of the current state. But, this transformation Tb, observation matrix O,b and belief vector w b for the
also has the desired property that it is now possible to induced HMM. The belief w b is over the second hidden
directly align the identity of states across selected ac- state, h2 .
tions. This is because HMM parameters now depend As mentioned before as one major challenge, label-
on both state and action, so there is a built-in corre- ing of the states h of the induced HMM is arbi-
lation between different actions. We will discuss this trary; however it is consistent between Tb, O, b w
b since
more in the theoretical analysis. this is a single HMM inference problem. Recall that
We are now ready to describe our algorithm for a hidden state in our induced HMM is defined as
episodic finite horizon reinforcement learning in ht = (at , st+1 , at+1 ). Since the actions are fully ob-
POMDPs, EEPORL (Explore then Exploit Partially servable, it is possible to label each state h = (a, s0 , a0 )
Observable RL, which is shown in Algorithm 1). Our (i.e. the columns of O, b the rows and columns of Tb, and
algorithm is model-based and proceeds in two phases. the rows of w)b with two actions (a, a0 ) that are associ-
In the first phase, it performs exploration to collect ated with that state. This is possible because the true
samples of trying different actions in different (latent) observation matrix entries for the actions of a hidden
states. After the first phase completes, we extend a state must be non-zero, and the true value of all other
MoM approach [Anandkumar et al., 2012] to compute entries (for other actions) must be zero; therefore, as
estimates of the induced HMM parameters. We use long as we have sufficiently accurate estimates of the
these estimates to obtain a near-optimal policy. observation matrix, we can use the observation matrix
parameters to augment the states h with their associ-
ated action pair. This procedure is performed by Algo-
4.1 Phase 1 rithm 2. This labeling provides a connection between
the HMM state h and the original POMDP state. For
The first phase consists of the first N episodes. Let a particular pair of actions a, a0 , there are exactly |S|
πexplore be a fixed open-loop policy for the first four HMM states that correspond to them. Thus looking at
actions of an episode. In πexplore actions a1 , a2 the columns of O b from left-to-right, and only picking
are selected uniformly at random, and p(at+2 |at ) = out the columns that are labeled with a, a0 results in a
A PAC RL Algorithm for Episodic POMDPs

specific ordering of the states (a, ·, a0 ), which is a per- belief) and then following the associated policy π
b. π
b is
mutation of the POMDP states, which we denote as then followed for the entire episode with no additional
{s(a,a0 ),1 , s(a,a0 ),2 , . . . , s(a,a0 ),|S| }. We will also use the belief updating required as the policy itself encodes
notation s(a,a0 ) to implicitly refer to a vector of states the conditional branching.
in the order of the permutation.
However, in practical circumstances, it will not be pos-
The algorithm proceeds to estimate the original sible to enumerate all possible H-step policies. In
POMDP parameters in order to perform planning and this case, one can use point-based approaches or other
compute a policy. Note that the estimated parameters methods that use α-vectors to enumerate only a sub-
use the computed s(a,a0 ) permutations of the state. Let set of possible policies. In this case there will be an
Ob a,a0 be the submatrix where the rows and columns additional error planning in the final error bound due
0 00
correspond to the actions (a, a0 ) and Tba,a ,a be the to finite set of policies considered. In our analysis we
submatrix where the rows correspond to the actions omit planning for simplicity and assume that we enu-
(a0 , a00 ) and columns correspond to the actions (a, a0 ). merate all H-step policies.
Then the estimated POMDP parameters can be com- Definition 1. A β-vector taking as input s(a,a0 ) with
puted as follows: root action a0 and t-step conditional policies ft (r, z)
bb(s(a ,a ) ) = normalize((Tb−1 Tb−1 w)(a for each observation pair (r, z) is defined as
0 1
b 0 , ·, a1 ))
X 0 0 X
a,a
pb(z|a, s(a,a0 ) ) = normalize( O
b ) β1a (s(a,a0 ) ) = p(r|s(a,a0 ) , a) · r
r r
0
b a,a0 )
X f (r,z)
pb(r|s(a,a0 ) , a0 ) = normalize( a ,ft
X
O βt+1 (s(a,a0 ) ) = (r + γβt t (s(a0 ,ft (r,z)) ))
z r,z,s(a0 ,ft (r,z))
0 00
pb(s(a0 ,a00 ) |s(a,a0 ) , a0 ) = normalize(Tba,a ,a ) · p(r|s(a,a0 ) , a)p(z|s(a0 ,ft (r,z)) , a)p(s(a0 ,ft (r,z)) |s(a,a0 ) , a)
Note that we require an additional normalize() proce- where ft (r, z) can also denote the root action of the
dure since the MoM approach we leverage is not guar- policy ft (r, z) used in terms like s(a,ft (r,z)) .
anteed to return well formed probability distributions.
The normalization procedure just divides by the sum
to make them into valid probability distributions (if 5 THEORY
there are negative values we can either set them to
zero or even just use the absolute value). 5.1 PAC Theorem Setup
Algorithm 3 then uses these estimated POMDP pa- We now state our primary result. For full details,
rameters to compute a policy. The algorithm con- please refer to our tech report4 . Before doingPso, we
structs β-vectors (see Definition 1) that represent the H
define some additional notation. Let V π (b) = i=1 rt
expected sum of rewards of following a particular starting from belief b be the total undiscounted re-
policy starting with action a0 given an input per- ward following policy π for an episode. Let σ1,a (Ta ) =
muted state s(a,a0 ) . Aside from this slight modifica- maxa σ1 (Ta ) and similarly for σ1,a (Ra ) and σ1,a (Za ).
tion, β-vectors are analogous to α-vectors in standard Let σ a (Ta ) = mina σ|S| (Ta ) and similarly for σ a (Ra )
POMDP planning. The β-vectors form an approxi- and σ a (Za ). Assume σ a (Ta ), σ a (Ra ), and σ a (Za ) are
mate value function for the underlying POMDP and all at most 1 (otherwise each term can be replaced by
can be used in a similar way to standard α-vectors. 1 in the final sample complexity bound below).

4.3 Phase 2 5.2 PAC Theorem

In phase 2, after estimating the POMDP parameters Theorem 1. For POMDPs that satisfy the stated as-
and β-vectors, we use the estimated POMDP value sumptions defined in the problem setting, executing
function to extract a policy for acting, and we will EEPORL will achieve an expected episodic reward of
shortly prove sufficient conditions for this policy to be V (b0 ) ≥ V ∗ (b0 ) − on all but a number of episodes
near-optimal for all remaining episodes. that is bounded by
The policy followed depends on the computed value  2
4 2 12 4 4 12
q
3 3

function. If computationally tractable, one can com- H Vmax |A| |R| |Z| |S| 1 + log( δ ) log δ
pute β-vectors incrementally for all possible H-step O
 2 

policies. In this case, control proceeds by finding the Cd,d,d 3δ σ a (Ta )6 σ a (Ra )8 σ a (Za )8 2
best β-vector for the estimated initial belief bb(s(a0 ,a1 ) )
4
(largest dot product of the β-vector with the initial http://www.cs.cmu.edu/~zguo/#publications
Zhaohan Daniel Guo, Shayan Doroudi, Emma Brunskill

with probability at least 1 − δ, where and O are both full rank and w = p(h2 ) has posi-
tive probability everywhere. p Furthermore the following
Cd,d,d (δ) = min(C1,2,3 (δ), C1,3,2 (δ)) terms are bounded: kT k2 ≤ |S|, kT −1 k2 ≤ 2(1+c|A|)
σ a (Ta ) ,
mini6=j ||M3 (~ei − ~ej )||2 · σk (P1,2 )2

C1,2,3 (δ) = min σmin (O) ≥ σ a (Ra )σ a (Za ), and kOk2 = σ1 (O) ≤ |S|.
||P1,2,3 ||2 · k 5 · κ(M1 )4

δ σk (P1,3 ) Next, we use Lemma 2, which is an extension of the
· ,
log(k/δ) 1 method of moments method by Anandkumar et al.
mini6=j ||M2 (~ei − ~ej )||2 · σk (P1,3 )2 [2012] that provides a bound on the accuracies of the

C1,3,2 (δ) = min estimated induced HMM parameters in terms of N ,
||P1,3,2 ||2 · k 5 · κ(M1 )4
the number of samples collected. Our extension in-
δ σk (P1,2 )
· , volves computing Tb (the original method only had O b
log(k/δ) 1
and OT ) and bounding its accuracy.
d

The quantities C1,2,3 , C1,3,2 directly arise from using Lemma 2. Given an HMM such that p(h2 ) has posi-
the previously referenced MoM method for HMM pa- tive probability everywhere, the transition matrix is full
rameter estimation [Anandkumar et al., 2012] and in- rank, and the observation matrix is full column rank,
volve singular values of the moments of the induced then by gathering N samples of (x1 , x2 , x3 ), the esti-
HMM and the induced HMM parameters (see [Anand- mates Tb, O,
b w b can be computed such that
kumar et al., 2012] for details).
We now briefly overview the proof. Detailed proofs
are available in the supplemental material. We first ||Tb − T ||2 ≤ 18|A||S|4 (σ a (Ra )σ a (Za ))−4 1
show that by executing EEPORL we obtain param- b − O||2 ≤ |A||S|0.5 1
||O
eter estimates of the induced HMM, and bounds on
||O
b − O||max ≤ 1
these estimates, as a function of the number of data
points (Lemma 2). We then prove that we can use the b − w||2 ≤ 14|A|2 |S|2.5 (σ a (Ra )σ a (Za ))−4 1
||w
induced HMM to obtain estimated parameters of the
underlying POMDP (Lemma 4). Then we show that
we can compute policies that are equivalent (in struc- where || · ||2 is the spectral norm for matrices, and
ture and value) to those from the original POMDP the euclidean norm for vectors, and w is the marginal
(Lemma 5). We then bound the error in the resulting probability of h2 , with probability 1 − δ, as long as
value function estimates of the resulting policies due
to the use of approximate (instead of exact) model p !
|A|2 |Z||R|(1 + log(1/δ))2 1
parameters (Lemma 6). This allows us to compute a N ≥O log
bound on the number of required samples (episodes) (Cd,d,d (δ))2 · 21 δ
necessary to achieve near-optimal policies, with high
probability, for use in phase 2.
Next we proceed by showing how to bound the error
We commence the proof by bounding the error in es-
in the estimates of the POMDP parameters. The fol-
timates of the induced HMM parameters. In order
lowing Lemma 3 is a prerequisite for computing the
to do that, we introduce Lemma 1, which proves that
submatrices of Ob and Tb needed for the estimates of
samples taken in phase 1 belong to an induced HMM
the POMDP parameters.
where the transition and observation matrices are full
rank. This is a requirement for being able to apply Lemma 3. Given O b with max-norm error O ≤
the MoM HMM parameter estimation procedure of 1
, then the columns which correspond to HMM
3|Z||R|
Anandkumar et al. [2012].
states of the form h = (a, s0 , a0 ) can be labeled with
Lemma 1. The induced HMM has the observation their corresponding a, a0 using Algorithm 2.
and transition matrices defined as

O(xit , hjt ) = With the correct labels, the submatrices of O b and Tb

allow us to compute estimates of the original POMDP
δ(ait , ajt )δ(ait+1 , ajt+1 )p(zt+1
i
|ajt , sjt+1 )p(rt+1
i
|sjt+1 , ajt+1 )
parameters in terms of these permutations s(a,a0 ) .
T (hit+1 , hjt ) = δ(ait+1 , ajt+1 )p(sit+2 |sjt+1 , ajt+1 )p(ait+2 |ajt ) Lemma 4 bounds the error in these resulting estimates.
where i is the index over the rows and j is the in- Lemma 4. Given Tb, O, b w
b with max-norm errors
dex over the columns, and xit = (ait , zt+1
i i
, rt+1 , ait+1 ), T , O , w respectively, then the following bounds hold
i i i i j j j j
ht+1 = (at+1 , st+2 , at+2 ), ht = (at , st+1 , at+1 ). T on the estimated POMDP model parameters with prob-
A PAC RL Algorithm for Episodic POMDPs

ability at least 1 − δ correctly identifies the best policy for the estimated
POMDP. Then let the initial beliefs b, bb have error
4|S|T kb − bbk∞ ≤ b , and the bound over α-vectors of any
p(s(a0 ,a00 ) |s(a,a0 ) , a0 ) − p(s(a0 ,a00 ) |s(a,a0 ) , a0 )| ≤
|b
2a policy π, kαπ − αbπ k∞ ≤ α be given. Then
|b
p(z|a, s(a,a0 ) ) − p(z|a, s(a,a0 ) )| ≤ 4|Z||R|O ∗

0 0
Vb πb (bb) = bb · α
bπb ≥ bb · α
bπ
|b
p(r|s(a,a0 ) , a ) − p(r|s(a,a0 ) , a )| ≤ 4|Z||R|O ∗ ∗ ∗ ∗
≥ bb · απ − |bb · απ − bb · α
bπ | ≥ bb · απ − α
|b(s(a ,a ) ) − b(s(a ,a ) )|
b
0 1 0 1 ∗ ∗ ∗
≥ b · απ − |b · απ − bb · απ | − α
≤ 4|A|4 |S|(||T −1 ||22 w + 6||T −1 ||32 T ) ∗
≥ b · απ − b Vmax − α = V ∗ (b) − b Vmax − α
where a = Θ(1/|A|)
where the first inequality is because π b is the optimal
We proceed by bounding the error in computing the policy for bb and α b, the second inequality is by the trian-
estimated β-vectors. Lemma 5 states that β-vectors gle inequality, the third inequality is because kbbk1 = 1,
are equivalent under permutation to α-vectors. the fourth inequality is by the triangle inequality, the
fifth inequality is since α is at most Vmax . Next
Lemma 5. Given the permutation of the states
s(a,a0 ),j = sφ((a,a0 ),j) , β-vectors and α-vectors over V πb (b) = b · απb ≥ bb · απb − |bb · απb − b · απb |
the same policy πt are equivalent i.e. βtπt (s(a,a0 ),j ) = ≥ bb · απb − b Vmax
αtπt (sφ((a,a0 ),j) )
bπb − |bb · α
≥ bb · α bπb − bb · απb | − b Vmax
The following lemma bounds the error in the resulting bπb − α − b Vmax
≥ bb · α
α-vectors obtained by performing POMDP planning,
and follows from prior work [Fard et al., 2008, Ross where the first inequality is by triangle inequality, the
et al., 2009]. second inequality is because α is at most Vmax , the
third inequality is triangle inequality, and the fourth
Lemma 6. Suppose we have approximate POMDP
inequality is due to kbbk1 = 1. Putting those two to-
parameters with errors |b p(s0 |s, a) − p(s0 |s, a)| ≤ T ,
0 0 gether results in
|b
p(z|a, s )−p(z|a, s )| ≤ Z , and |bp(r|s, a)−p(r|s, a0 )| ≤
R . Then for any t-step conditional policy πt V πb (b) ≥ V ∗ (b) − 2b Vmax − 2α

|αtπt (s) − α
btπt (s)| ≤ t2 Rmax (|R|R + |S|T + |Z|Z ). Letting = 2b Vmax + 2α , and setting the number of
episodes N to the value specified in the theorem will
ensure that the resulting errors b and α are small
We next prove that our EEPORL algorithm computes enough to obtain an -optimal policy as desired.
a policy that is optimal for the input parameters5 :
Lemma 7. Algorithm 3 finds the policy π b which 6 CONCLUSION
maximizes V πb (bb(s1 )) for a POMDP with parameters
bb(s1 ), pb(z|a, s0 ), pb(r|s, a), and pb(s0 |s, a). We have provided a PAC RL algorithm for an impor-
tant class of episodic POMDPs, which includes many
We now have all the key pieces to prove our result. information gathering domains. To our knowledge this
is the first RL algorithm for partially observable set-
Proof. (Proof sketch of Theorem 1). Lemma 4 shows tings that has a sample complexity that is a polyno-
that the error in the estimates of the POMDP pa- mial function of the POMDP parameters.
rameters can be bounded in terms of the error in the
induced HMM parameters, which is itself bounded in There are many areas for future work. We are inter-
terms of the number of samples (Lemma 1). Lemma 5 ested in reducing the set of currently required assump-
and Lemma 6 together bound in the error in comput- tions, thereby creating PAC PORL algorithms that
ing the estimated value function (as represented by are suitable to more generic settings. Such a direc-
β-vectors) using estimated POMDP parameters. tion may also require exploring alternatives to method
of moments approaches for performing latent variable
We then need to bound the error from executing π b estimation. We also hope that our theoretical results
that Algorithm 3 returns compared to the optimal pol- will lead to further insights on practical algorithms for
icy π ∗ . We know from Lemma 7 that Algorithm 3 partially observable RL.
5
Again, we could easily modify this to account for ap-
proximate planning error, but leave this out for simplicity, Acknowledgements
as we do not expect this to make a significant impact on
the resulting sample complexity, except in terms of minor This work was supported by NSF CAREER grant
changes to the polynomial terms. 1350984.
Zhaohan Daniel Guo, Shayan Doroudi, Emma Brunskill

References Li Ling Ko, David Hsu, Wee Sun Lee, and Sylvie CW Ong.
Structured parameter elicitation. In AAAI, 2010.
Naoki Abe and Manfred K Warmuth. On the compu-
tational complexity of approximating distributions by J Zico Kolter and Andrew Y Ng. Near-bayesian exploration
probabilistic automata. Machine Learning, 9(2-3):205– in polynomial time. In Proceedings of the 26th Annual
260, 1992. International Conference on Machine Learning, pages
513–520. ACM, 2009.
Christopher Amato and Emma Brunskill. Diagnose and de-
cide: An optimal bayesian approach. In In Proceedings Tor Lattimore and Marcus Hutter. Pac bounds for dis-
of the Workshop on Bayesian Optimization and Decision counted mdps. In Algorithmic learning theory, pages
Making at the Twenty-Sixth Annual Conference on Neu- 320–334. Springer, 2012.
ral Information Processing Systems (NIPS-12), 2012.
Michael L Littman, Richard S Sutton, and Satinder P
Animashree Anandkumar, Daniel Hsu, and Sham M Singh. Predictive representations of state. In NIPS,
Kakade. A method of moments for mixture models and volume 14, pages 1555–1561, 2001.
hidden markov models. arXiv preprint arXiv:1203.0683,
2012. Stefanos Nikolaidis, Ramya Ramakrishnan, Keren Gu, and
Julie Shah. Efficient model learning from joint-action
Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M demonstrations for human-robot collaborative tasks. In
Kakade, and Matus Telgarsky. Tensor decompositions HRI, pages 189–196, 2015.
for learning latent variable models. The Journal of Ma-
chine Learning Research, 15(1):2773–2832, 2014. Leonid Peshkin and Sayan Mukherjee. Bounds on sam-
ple size for policy evaluation in markov environments.
John Asmuth, Lihong Li, Michael L Littman, Ali Nouri, In Computational Learning Theory, pages 616–629.
and David Wingate. A bayesian sampling approach to Springer, 2001.
exploration in reinforcement learning. In Proceedings of
the Twenty-Fifth Conference on Uncertainty in Artificial Stephane Ross, Masoumeh Izadi, Mark Mercer, and David
Intelligence, pages 19–26. AUAI Press, 2009. Buckeridge. Sensitivity analysis of pomdp value func-
tions. In Machine Learning and Applications, 2009.
Mohammad Azar, Alessandro Lazaric, and Emma Brun- ICMLA’09. International Conference on, pages 317–323.
skill. Sequential transfer in multi-armed bandit with IEEE, 2009.
finite set of models. In C.J.C. Burges, L. Bottou,
M. Welling, Z. Ghahramani, and K.Q. Weinberger, ed- Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, and
itors, Advances in Neural Information Processing Sys- Pierre Kreitmann. A bayesian approach for learning and
tems 26, pages 2220–2228. Curran Associates, Inc., 2013. planning in partially observable markov decision pro-
cesses. The Journal of Machine Learning Research, 12:
Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. 1729–1770, 2011.
Closing the learning-planning loop with predictive state
representations. The International Journal of Robotics L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J.
Research, 30(7):954–966, 2011. Smola. Hilbert space embeddings of hidden Markov
models. In Proc. 27th Intl. Conf. on Machine Learning
Craig Boutilier. A pomdp formulation of preference elici- (ICML), 2010.
tation problems. In AAAI/IAAI, pages 239–246, 2002.
Alexander L Strehl and Michael L Littman. A theoretical
Ronen I Brafman and Moshe Tennenholtz. R-max-a gen- analysis of model-based interval estimation. In Proceed-
eral polynomial time algorithm for near-optimal rein- ings of the 22nd international conference on Machine
forcement learning. The Journal of Machine Learning learning, pages 856–863. ACM, 2005.
Research, 3:213–231, 2003.
Alexander L Strehl, Lihong Li, and Michael L Littman.
Finale Doshi-Velez. Bayesian nonparametric approaches Incremental model-based learners with formal learning-
for reinforcement learning in partially observable do- time guarantees. arXiv preprint arXiv:1206.6870, 2012.
mains. PhD thesis, Massachusetts Institute of Technol-
ogy, 2012.

Eyal Even-Dar, Sham M Kakade, and Yishay Mansour.

Reinforcement learning in pomdps without resets. In
IJCAI, pages 690–695, 2005.

Mahdi Milani Fard, Joelle Pineau, and Peng Sun. A vari-

ance analysis for pomdp policy evaluation. In AAAI,
pages 1056–1061, 2008.

Daniel Hsu, Sham M Kakade, and Tong Zhang. A spec-

tral algorithm for learning hidden markov models. arXiv
preprint arXiv:0811.4413, 2008.

Sham M. Kakade. On the Sample Complexity of Reinforce-

ment Learning. . PhD thesis, University College London,
2003.

Measurement and Evaluation in Human Performance (James R Morrow JR., Dale P. Mood Etc.)
50% (2)
Measurement and Evaluation in Human Performance (James R Morrow JR., Dale P. Mood Etc.)
759 pages
Stanford Machine Learning Course Notes by Andrew NG
No ratings yet
Stanford Machine Learning Course Notes by Andrew NG
16 pages
Lecture Notes: Take Note! Characteristics of A Good Experimental Design
0% (1)
Lecture Notes: Take Note! Characteristics of A Good Experimental Design
5 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Average-Reward Model-Free Reinforcement Learning - A Systematic Review and Literature Mapping
No ratings yet
Average-Reward Model-Free Reinforcement Learning - A Systematic Review and Literature Mapping
36 pages
Near-Optimal Reinforcement Learning in Polynomial Time
No ratings yet
Near-Optimal Reinforcement Learning in Polynomial Time
24 pages
Learning Symbolic Persistent Macro-Actions For POMDP Solving Over Time
No ratings yet
Learning Symbolic Persistent Macro-Actions For POMDP Solving Over Time
15 pages
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
No ratings yet
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
139 pages
Offline Reinforcement Learning With Instrumental Variables in Confounded Markov Decision Processes
No ratings yet
Offline Reinforcement Learning With Instrumental Variables in Confounded Markov Decision Processes
71 pages
What Are The Odds? Improving The Foundations of Statistical Model Checking
No ratings yet
What Are The Odds? Improving The Foundations of Statistical Model Checking
42 pages
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
No ratings yet
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
56 pages
MDP Pomdp
No ratings yet
MDP Pomdp
51 pages
Achieving (1) Sample Complexity For Constrained Markov Decision Process
No ratings yet
Achieving (1) Sample Complexity For Constrained Markov Decision Process
40 pages
Adprl Chapter Icis
No ratings yet
Adprl Chapter Icis
43 pages
Double R
No ratings yet
Double R
42 pages
POMDP Tutoria POMDP - Tutoriall
No ratings yet
POMDP Tutoria POMDP - Tutoriall
55 pages
Bayes-Adaptive POMDPs 2007
No ratings yet
Bayes-Adaptive POMDPs 2007
8 pages
CSE4037 Reinforcement Learning: A Partially Observable Markov Decision Process
No ratings yet
CSE4037 Reinforcement Learning: A Partially Observable Markov Decision Process
19 pages
Robust Markov Decision Processes - A Place Where AI and Formal Methods Meet
No ratings yet
Robust Markov Decision Processes - A Place Where AI and Formal Methods Meet
29 pages
Artificial Intelligence: Karina V. Delgado, Leliane N. de Barros, Daniel B. Dias, Scott Sanner
No ratings yet
Artificial Intelligence: Karina V. Delgado, Leliane N. de Barros, Daniel B. Dias, Scott Sanner
32 pages
Provably Efficient Maximum Entropy Exploration
No ratings yet
Provably Efficient Maximum Entropy Exploration
11 pages
Branch Prediction As A Reinforcement Learning Problem: Why, How and Case Studies
No ratings yet
Branch Prediction As A Reinforcement Learning Problem: Why, How and Case Studies
6 pages
Imprecise Probabilities Meet Partial Observability: Game Semantics For Robust Pomdps
No ratings yet
Imprecise Probabilities Meet Partial Observability: Game Semantics For Robust Pomdps
10 pages
Partially Observable Markov Decision Processes and Robotics
No ratings yet
Partially Observable Markov Decision Processes and Robotics
25 pages
Littomore
No ratings yet
Littomore
169 pages
1512 04455v1-RNN
No ratings yet
1512 04455v1-RNN
11 pages
1 PB
No ratings yet
1 PB
9 pages
A Framework For Sequential Planning in Multi-Agent Settings: Piotr J. Gmytrasiewicz Prashant Doshi
No ratings yet
A Framework For Sequential Planning in Multi-Agent Settings: Piotr J. Gmytrasiewicz Prashant Doshi
31 pages
Constrained Policy Opt
No ratings yet
Constrained Policy Opt
18 pages
Dynamic Programming For Partially Observable Stochastic Games
No ratings yet
Dynamic Programming For Partially Observable Stochastic Games
7 pages
Variational Methods For Reinforced Learning
No ratings yet
Variational Methods For Reinforced Learning
8 pages
Modeling Markov Decision Processes With Imprecise Probabilities Using Probabilistic Logic Programming
No ratings yet
Modeling Markov Decision Processes With Imprecise Probabilities Using Probabilistic Logic Programming
12 pages
0、Bayesian reinforcement learning in continuous POMDPs with application to robot navigation.（2008）
No ratings yet
0、Bayesian reinforcement learning in continuous POMDPs with application to robot navigation.（2008）
7 pages
Incorporating Recurrent Reinforcement Learning Into Model Predictive
No ratings yet
Incorporating Recurrent Reinforcement Learning Into Model Predictive
7 pages
Action Robust Reinforcement Learning and Applications in Continuous Control
No ratings yet
Action Robust Reinforcement Learning and Applications in Continuous Control
10 pages
An Application of Inverse Reinforcement Learning To Medical Records of Diabetes Treatment
No ratings yet
An Application of Inverse Reinforcement Learning To Medical Records of Diabetes Treatment
8 pages
6 (4 Files Merged)
0% (1)
6 (4 Files Merged)
4 pages
RL Assignment1
No ratings yet
RL Assignment1
5 pages
Follow Actions PDF
No ratings yet
Follow Actions PDF
42 pages
Discretized Approximations For POMDP With Average Cost: UAI 2004 Yu & Bertsekas 619
No ratings yet
Discretized Approximations For POMDP With Average Cost: UAI 2004 Yu & Bertsekas 619
9 pages
Model-Based Policy Optimization With Unsupervised Model Adaptation
No ratings yet
Model-Based Policy Optimization With Unsupervised Model Adaptation
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
Pomdp Py AFramework To Build and Solve POMDP Problems
No ratings yet
Pomdp Py AFramework To Build and Solve POMDP Problems
5 pages
Nursing Research
No ratings yet
Nursing Research
48 pages
Guzman 22 A
No ratings yet
Guzman 22 A
12 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Chapter 1 Concept of Economics and Significance of Statistics in Economics
No ratings yet
Chapter 1 Concept of Economics and Significance of Statistics in Economics
6 pages
Regents Research Fund Fellows
No ratings yet
Regents Research Fund Fellows
74 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
Partially Observable MDP AI
No ratings yet
Partially Observable MDP AI
2 pages
Using Reinforcement Leaming For Proactive Network Fault Management
No ratings yet
Using Reinforcement Leaming For Proactive Network Fault Management
7 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
A.I Unit4
No ratings yet
A.I Unit4
54 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
How To Write Objectives of The Study in Thesis
100% (3)
How To Write Objectives of The Study in Thesis
7 pages
Stats Quiz 2 Ans
No ratings yet
Stats Quiz 2 Ans
3 pages
MODULE6 7 A Partially Observable Markov Decision Process
No ratings yet
MODULE6 7 A Partially Observable Markov Decision Process
19 pages
databookRL Steve Brunton PDF
No ratings yet
databookRL Steve Brunton PDF
76 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Mathematical Model of QBD
No ratings yet
Mathematical Model of QBD
8 pages
Week Systematic Errors and Random Errors in Analysis
No ratings yet
Week Systematic Errors and Random Errors in Analysis
9 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
E1 251 Linear and Nonlinear Op2miza2on: Chapter 4: Convex and Quadra2c Func2ons
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on: Chapter 4: Convex and Quadra2c Func2ons
35 pages
Department of Commerce and Business Studies Jamia Millia Islamia, New Delhi
No ratings yet
Department of Commerce and Business Studies Jamia Millia Islamia, New Delhi
38 pages
Econometrics Model With Panel Data: Dinh Thi Thanh Binh Faculty of International Economics, FTU
No ratings yet
Econometrics Model With Panel Data: Dinh Thi Thanh Binh Faculty of International Economics, FTU
19 pages
Zamora Beltranena Daniel Alejandro
No ratings yet
Zamora Beltranena Daniel Alejandro
7 pages
Test 7. Statistics - Probability
No ratings yet
Test 7. Statistics - Probability
4 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
(A) Regress Log of Wages On A Constant and The Female Dummy. Paste Output Here
No ratings yet
(A) Regress Log of Wages On A Constant and The Female Dummy. Paste Output Here
6 pages
3RD Quarter Exam STAT and Prob
No ratings yet
3RD Quarter Exam STAT and Prob
9 pages
Answer Spss
No ratings yet
Answer Spss
4 pages
04b - Data Management (Average, Dispersion) PDF
No ratings yet
04b - Data Management (Average, Dispersion) PDF
3 pages
Assessing The Knowledge and Impact of Ad
No ratings yet
Assessing The Knowledge and Impact of Ad
8 pages
Introductory of Statistics - Chapter 3
No ratings yet
Introductory of Statistics - Chapter 3
7 pages
Chapter 5.3
No ratings yet
Chapter 5.3
26 pages
Test A
No ratings yet
Test A
3 pages
National Resilience - EN (151-225)
No ratings yet
National Resilience - EN (151-225)
75 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Clinical Prediction Rule CASP Checklist
No ratings yet
Clinical Prediction Rule CASP Checklist
5 pages
E1 251 Linear and Nonlinear Op2miza2on: Chapter 9: The Method of Conjugate Direc6ons
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on: Chapter 9: The Method of Conjugate Direc6ons
32 pages
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
No ratings yet
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
32 pages
E1 251 Linear and Nonlinear Op2miza2on: Chapter 3: Condi/ons of Maxima and Minima
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on: Chapter 3: Condi/ons of Maxima and Minima
27 pages
E1 251 Linear and Nonlinear Op2miza2on
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on
24 pages
Chapter 6 PDF
No ratings yet
Chapter 6 PDF
13 pages
PR2 G12
No ratings yet
PR2 G12
19 pages
Lno 2017 Syllabus
No ratings yet
Lno 2017 Syllabus
1 page
Anna HW3
No ratings yet
Anna HW3
4 pages
A Research Proposal On Sanima Bank Final
No ratings yet
A Research Proposal On Sanima Bank Final
9 pages
E1 251 Linear and Nonlinear Op2miza2on: Chapter 10: Quasi - Newton Method
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on: Chapter 10: Quasi - Newton Method
18 pages
Experimental Psych Midterms Reviewer
No ratings yet
Experimental Psych Midterms Reviewer
15 pages
Sum of Independent Exponentials
No ratings yet
Sum of Independent Exponentials
3 pages
Indian Institute of Science,: Research Programme 2020
No ratings yet
Indian Institute of Science,: Research Programme 2020
3 pages
Pengobatan Pasien Tuberkulosis Paru Studi Kasus Rumah Sakit Paru Jember
No ratings yet
Pengobatan Pasien Tuberkulosis Paru Studi Kasus Rumah Sakit Paru Jember
12 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Chapter 1 To 3 Defense
No ratings yet
Chapter 1 To 3 Defense
18 pages
Assignment QM
No ratings yet
Assignment QM
2 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Pac RL Algorithm For Episodic Pomdps

Uploaded by

A Pac RL Algorithm For Episodic Pomdps

Uploaded by

A PAC RL Algorithm for Episodic POMDPs

Zhaohan Daniel Guo Shayan Doroudi Emma Brunskill

Abstract is said to be PAC if with high probability, it selects

4.3 Phase 2 5.2 PAC Theorem

O(xit , hjt ) = With the correct labels, the submatrices of O b and Tb

Eyal Even-Dar, Sham M Kakade, and Yishay Mansour.

Mahdi Milani Fard, Joelle Pineau, and Peng Sun. A vari-

Daniel Hsu, Sham M Kakade, and Tong Zhang. A spec-

Sham M. Kakade. On the Sample Complexity of Reinforce-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.