A Pac RL Algorithm For Episodic Pomdps
A Pac RL Algorithm For Episodic Pomdps
tion gathering POMDP RL domains such as preference sample guarantees. MLE approaches for estimating
elicitation [Boutilier, 2002], dialogue management slot- HMMs [Abe and Warmuth, 1992] also unfortunately
filling domains [Ko et al., 2010], and medical diagnosis do not provide accuracy guarantees on the estimated
before decision making [Amato and Brunskill, 2012]. HMM parameters. As POMDP planning methods typ-
Our work builds on method of moments inference tech- ically require us to have estimates of the underlying
niques, but requires several non-trivial extensions to POMDP parameters, it would be difficult to use such
tackle the control setting. In particular, there is a MLE methods for computing a POMDP policy and
subtle issue of latent state alignment: if the models for providing a finite sample guarantee1 .
each action are learned as independent hidden Markov
Aside from the MoM method in Anandkumar et al.
models (HMMs), then it is unclear how to solve the
[2012], another popular spectral method involves us-
correspondence issue across latent states, which is es-
ing Predictive State Representations (PSRs) [Littman
sential for performing planning and selecting actions.
et al., 2001, Boots et al., 2011], to directly tackle the
Our primary contribution is to provide a theoretical
control setting; however it only has asymptotic conver-
analysis of our proposed algorithm, and prove that it
gence guarantees and no finite sample analysis. There
is possible to obtain near-optimal performance on all
is also another method of moments approach to trans-
but a number of episodes that scales as a polynomial
fer across a set of bandits tasks, but the latent variable
function of the POMDP parameters. Similar to most
estimation problem is substantially simplified because
fully observable PAC RL algorithms, directly instan-
the state of the system is unchanged by the selected
tiating our bounds would yield an impractical number
actions [Azar et al., 2013].
of samples for a real application. Nevertheless, we be-
lieve understanding the sample complexity may help Fortunately, due to the polynomial finite sample
to guide the amount of data required for a task, and bounds from MoM, we can achieve a PAC (polyno-
also similar to PAC MDP RL work, may motivate new mial) sample complexity bound for POMDPs.
practical algorithms that build on these ideas.
3 PROBLEM SETTING
2 BACKGROUND AND RELATED
WORK We consider a partially observable Markov decision
process (POMDP) which is described as the tuple
The inspiration for pursuing PAC bounds for (S, A, R, T, Z, b, H) where we have a set of discrete
POMDPs came about from the success of PAC bounds states S, discrete actions A, discrete observations Z,
for MDPs [Brafman and Tennenholtz, 2003, Strehl discrete rewards R, initial belief b (more details be-
and Littman, 2005, Kakade, 2003, Strehl et al., 2012, low), and episode length H. The transition model is
Lattimore and Hutter, 2012]. While algorithms represented by a set of |A| matrices Ta (i, j) : |S| × |S|
have been developed for POMDPs with finite sample where the (i, j)-th entry is the probability of transi-
bounds [Peshkin and Mukherjee, 2001, Even-Dar et al., tioning from si to sj under action a. With a slight
2005], unfortunately these bounds are not PAC as they abuse of notation, we use Z to denote both the finite
have an exponential dependence on the horizon length. set of observations and the observation model captured
by the set of |A| observation matrices, Za where the
Alternatively, Bayesian methods [Ross et al., 2011, (i, j)-th entry represents the probability of observing
Doshi-Velez, 2012] are very popular for solving zi given the agent took action a and transitioned to
POMDPs. For MDPs, there exist Bayesian meth- state sj . We similarly do a slight abuse of notation
ods that have PAC bounds [Kolter and Ng, 2009, As- and let R denote both the finite set of rewards, and
muth et al., 2009]; however there have been no PAC the reward matrices Ra where the (i, j)-th entry in a
bounds for Bayesian methods for POMDPs. That matrix denotes the probability of obtaining reward ri
said, Bayesian methods are optimal in the Bayesian
sense of making the best decision given the posterior 1
Abe and Warmuth [1992]’s MLE approach guaran-
over all possible future observations, which does not tees that the estimated probability over H-length obser-
translate to a frequentist finite sample bound. vation sequences has a bounded KL-divergence from the
true probability of the sequence under the true parameters,
We build on method of moments (MoM) work for es- which is expressed as a function of the number of underly-
timating HMMs [Anandkumar et al., 2012] in order to ing data samples used to estimate the HMM parameters.
provide a finite sample bound for POMDPs. MoM is We think it may be possible to use such estimates in the
able to obtain a global optimum, and has finite sample control setting when modeling hidden state control systems
as PSRs, and employing a forward search approach to plan-
bounds on the accuracy of their estimates, unlike the ning; however, there remain a number of subtle issues to
popular Expectation-Maximization (EM) that is only address to ensure such an approach is viable and we leave
guaranteed to find a local optima, and offers no finite this as an interesting direction for future work.
Zhaohan Daniel Guo, Shayan Doroudi, Emma Brunskill
when taking action a in state sj . Note that in our is similar to a mixing assumption and is necessary in
setting we also treat the reward as an additional ob- order for MoM to estimate dynamics for all states.
servation2 . Assumption 3 is necessary for MoM to uniquely deter-
mine the transition, observation, and reward dynam-
The objective in POMDP planning is to compute a
ics. The second assumption may sound quite strong,
policy π that achieves a large expected sum of future
as in some POMDP settings states are only reachable
rewards, where π is a mapping from histories of prior
by a complex sequence of carefully chosen actions, such
sequences of actions, observations, and rewards, to ac-
as in robotic navigation or video games. However, as-
tions. In many cases we capture prior histories using
sumption 2 is commonly satisfied in many important
a sufficient statistic called the belief b where b(s) rep-
POMDP settings that primarily involve information
resents the probability of being in a particular state s
gathering. For example, in preference elicitation or
given the prior history of actions, observations and re-
user modeling, POMDPs are commonly used to iden-
wards. One popular method for POMDP planning in-
tify the, typically static, hidden intent or preference or
volves representing the value function by a finite set of
state of the user, before taking some action based on
α-vectors, where α(s) represents the expected sum of
the resulting information [Boutilier, 2002]. Examples
future rewards of following the policy associated with
of this include dialog systems [Ko et al., 2010], medical
the α-vector from initial state s. POMDP planning
diagnosis and decision support [Amato and Brunskill,
then proceeds by taking the first action associated with
2012], and even human-robot collaboration preference
the policy of the α-vector which yields the maximum
modeling [Nikolaidis et al., 2015]. In such settings, the
expected value for the current belief state, which can
belief commonly starts out non-zero over all possible
be computed for a particular α-vector using the dot
user states, and slowly gets narrowed down over time.
product hb, αi.
The third assumption is also significant, but is still sat-
In the reinforcement learning setting, the transition, isfied by an important class of problems that overlap
observation, and/or reward model parameters are ini- with the settings captured by assumption 2. Informa-
tially unknown. The goal is to learn a policy that tion gathering POMDPs where the state is hidden but
achieves large sum of rewards in the environment with- static automatically satisfy the full rank assumption
out advance knowledge of how the world works. on the transition model, since it is an identity matrix.
Assumption 3 on the observation and reward matrices
We make the following assumptions about the domain
imply that the cardinality of the set of observations
and problem setting:
(and rewards) is at least as large as the size of the
state space. A similar assumption has been made in
1. We consider episodic, finite horizon partially ob- many latent variable estimation settings (e.g. [Anand-
servable RL (PORL) settings kumar et al., 2012, 2014, Song et al., 2010]) including
in the control setting [Boots et al., 2011]. Indeed, when
2. It is possible to achieve a non-zero probability of
the observations consist of videos, images or audio sig-
being in any state in two steps from the initial
nals, this assumption is typically satisfied [Boots et al.,
belief.
2011], and such signals are very common in dialog
3. For each action a, the transition matrix Ta is full systems and the user intent and modeling situations
rank, and the observation matrix Za and reward covered by assumption 2. Satisfying that the reward
matrix Ra are full column rank. matrix has full rank is typically trivial as the reward
signal is often obtained by discretizing a real-valued
reward. Therefore, while we readily acknowledge that
The first assumption on the setting is satisfied by many
our setting does not cover all generic POMDP rein-
real world situations involving an agent repeatedly do-
forcement learning settings, we believe it does cover
ing a task: for example, an agent may sequentially in-
an important class of problems that are relevant to
teract with many different customers each for a finite
real applications.
amount of time. The key restrictions on the setting
are captured in assumptions 2 and 3. Assumption 2
2 4 ALGORITHM
In planning problems the reward is typically a real-
valued scalar, but in PORL we must learn the reward
model. This requires assuming some mapping between Our goal is to create an algorithm that can achieve
states and rewards. For simplicity we assume multinomial near optimal performance from the initial belief on
distribution over a discrete set of rewards. Note that we each episode. Prior work has shown that the error
can always discretized a real-valued reward into a finite set
of values with bounded error on the resulting value func- in the POMDP value function is bounded when us-
tion estimates, and our choice makes very little restrictions ing model parameter estimates that themselves have
on the underlying setting. bounded error [Ross et al., 2009, Fard et al., 2008];
A PAC RL Algorithm for Episodic POMDPs
Algorithm 1: EEPORL
input: S, A, Z, R, H, N, c, πrest
1 Let πexplore be the policy where a1 , a2 are
uniformly random, and
1
p(at+2 |at ) = 1+c|A| (I + c1|A|×|A| ) ;
2 X ←∅ ;
// Phase 1:
3 for episode i ← 1 to N do Figure 1: POMDP (left) analogous to induced HMM
4 Follow πexplore for 4 steps ; (right). Gray nodes show fully observed variables,
5 Let xt = (at , rt , zt , at+1 ) ; whereas white nodes show latent states.
6 X ← X ∪ {(x1 , x2 , x3 )} ;
7 Execute πrest for the rest of the steps ; In contrast, many PAC RL algorithms for MDPs
// Phase 2: have shown that exploration is critical in order to get
8 Get Tb, O,
b w b for the induced HM M from X enough data to estimate the model parameters. How-
through our extended MoM method ; ever in MDPs, algorithms can directly observe how
many times every action has been tried in every state,
9 Using the labeling from Algorithm 2 with O, b
and can use this information to steer exploration to-
compute estimated POMDP parameters.;
wards less explored areas. In partially observable set-
10 Call Algorithm 3 with estimated POMDP
tings it is more challenging, as the state itself is hid-
parameters to estimate a near optimal policy πb
den, and so it is not possible to directly observe the
;
number of times an action has been tried in a latent
11 Execute π b for the rest of the episodes ;
state. Fortunately, recent advances in method of mo-
ments (MoM) estimation procedures for latent variable
Algorithm 2: LabelActions estimation (see e.g. [Anandkumar et al., 2012, 2014])
have demonstrated that in certain uncontrolled set-
input: Ob
tings, including many types of hidden Markov models
1 foreach column i of O b do
(HMMs), it is still possible to achieve accuracy esti-
Find a row j such that O(i, b j) ≥ 2
2 3|R||Z| ; mates of the underlying latent variable model param-
3 Let the observation associated with row j be eters as a function of the amount of data samples used
(a, r0 , z 0 , a0 ), label column i with (a, a0 ) ; to perform the estimation. For some intuition about
this, consider starting in a belief state b which has
non-zero probability over all possible states. If one
Algorithm 3: FindPolicy can repeatedly take the same action a from same be-
input: bb(s(a0 ,a1 ) ), pb(z|a, s(a,a0 ) ), pb(r|s(a,a0 ) , a0 ), lief b, given a sufficient number of samples, we will
pb(s(a0 ,a00 ) |s(a,a0 ) , a0 ) have actually taken action a in each state many times
a ,a (even if we don’t know the specific instances on which
1 ∀a− , a ∈ A, Γ1 − = {βb1a (s(a− ,a) )} ;
action a was taken in a state s).
2 for t ← 2 to H do
0
3 ∀a, a0 ∈ A, Γa,a =∅; The control setting is more subtle than the uncon-
t
4 for a, a0 ∈ A do trolled setting which has been the focus of the ma-
a,a0 jority of recent MoM spectral learning research, be-
5 for ft (r, z) ∈ (|R| × |Z| → Γt−1 ) do
cause we wish to estimate not just the transition and
// all mappings from an
observation models of a HMM, but to estimate the
observation pair to a
POMDP model parameters. Our ultimate interest is
previous β-vector
a ,a in being able to select good actions. A naive approach
6 ∀a− ∈ A, Γt − =
a ,a is to independently learn the transition, observation,
Γt − ∪ {βta,ft (s(a− ,a) )} ; and reward parameters for each separate action, by re-
7 Return arg maxa ,a ,β (s
0 1 H (a ,a ) ))∈ΓH
b · βH ) ;
a0 ,a1 (b
stricting the POMDP to only execute a single action,
0 1
thereby turning the POMDP into an HMM. However,
this simple aproach fails because the returned parame-
however, this work takes a sensitivity analysis perspec- ters can correspond to a different labeling of the hidden
tive, and does not address how such model estimation states. For example, the first column of the transition
errors themselves could be computed or bounded.3 matrix for action a1 may actually correspond to the
state s2 , while the first column of the transition ma-
3
Fard et al. [2008] assume that labels of the hidden
states are provided, which removes the need for latent vari- able estimation.
Zhaohan Daniel Guo, Shayan Doroudi, Emma Brunskill
1
trix for action a2 may truly correspond to s5 . We 1+c|A| (I + c1|A|×|A| ) where c can be any positive real
require that the labeling must be consistent for all ac- number. For our proof, we pick c = O(1/|A|). Note
tions since we wish to compute what happens when that πexplore only depends on previous actions and not
different actions are executed consecutively. An un- on any observations. The definition of p(at+2 |at ) for
satisfactory way to match up the labels for different what will work for the proof only requires it to be
actions is by requiring that the initial belief state have full-rank and having some minimum probability over
probabilities that are unique and well separated per all actions. We chose a perturbed identity matrix for
state. Then we can use the estimated initial belief simplicity. Since πexplore is a fixed policy, the POMDP
from each action to match up the labels. However, process reduces to a HMM for these first four steps.
this is a very strong assumption on the starting belief During these steps we store the observed experience
state which is unlikely to be realized. as (x1 , x2 , x3 ), where xt = (at , rt , zt , at+1 ) is an obser-
To address this challenge of mismatched labels, we vation of our previously defined induced HMM. The
transform our POMDP into an induced HMM (see algorithm then follows policy πrest for the remaining
Figure 1) by fixing the policy to πexplore (for a few steps of the episode. All of these episodes will be con-
steps, during a certain number of episodes), and create sidered as potentially non-optimal, and so the choice
an alternate hidden state representation that directly of πrest does not impact the theoretical analysis. How-
solves the problem of alignment of hidden states across ever, empirically πrest could be constructed to encour-
actions. Specifically, we make the hidden state at time age near optimal behavior given the observed data col-
t of the induced HMM, denoted by ht , equal to the tu- lected up to the current episode.
ple of the action at time step t, the next state, and the
subsequent action, ht = (at , st+1 , at+1 ). We denote 4.2 Parameter Estimation
the observations of the induced HMM by x, and the
observation associated with a hidden state ht is the After Phase 1 completes, we have N samples of the
tuple xt = (at , rt , zt , at+1 ). Figure 1 shows how the tuple (x1 , x2 , x3 ). We then apply our extension to
graphical model of our original POMDP is related to the MoM algorithm for HMM parameter estimation by
the graphical model of the induced HMM. In making Anandkumar et al. [2012]. Our extension computes es-
this transformation, our resulting HMM still satisfies timates and bounds on the transition model Tb which is
the Markov assumption: the next state is only a func- not computed in the original method. To summarize,
tion of the prior state, and the observation is only a this procedure yields an estimated transition matrix
function of the current state. But, this transformation Tb, observation matrix O,b and belief vector w b for the
also has the desired property that it is now possible to induced HMM. The belief w b is over the second hidden
directly align the identity of states across selected ac- state, h2 .
tions. This is because HMM parameters now depend As mentioned before as one major challenge, label-
on both state and action, so there is a built-in corre- ing of the states h of the induced HMM is arbi-
lation between different actions. We will discuss this trary; however it is consistent between Tb, O, b w
b since
more in the theoretical analysis. this is a single HMM inference problem. Recall that
We are now ready to describe our algorithm for a hidden state in our induced HMM is defined as
episodic finite horizon reinforcement learning in ht = (at , st+1 , at+1 ). Since the actions are fully ob-
POMDPs, EEPORL (Explore then Exploit Partially servable, it is possible to label each state h = (a, s0 , a0 )
Observable RL, which is shown in Algorithm 1). Our (i.e. the columns of O, b the rows and columns of Tb, and
algorithm is model-based and proceeds in two phases. the rows of w)b with two actions (a, a0 ) that are associ-
In the first phase, it performs exploration to collect ated with that state. This is possible because the true
samples of trying different actions in different (latent) observation matrix entries for the actions of a hidden
states. After the first phase completes, we extend a state must be non-zero, and the true value of all other
MoM approach [Anandkumar et al., 2012] to compute entries (for other actions) must be zero; therefore, as
estimates of the induced HMM parameters. We use long as we have sufficiently accurate estimates of the
these estimates to obtain a near-optimal policy. observation matrix, we can use the observation matrix
parameters to augment the states h with their associ-
ated action pair. This procedure is performed by Algo-
4.1 Phase 1 rithm 2. This labeling provides a connection between
the HMM state h and the original POMDP state. For
The first phase consists of the first N episodes. Let a particular pair of actions a, a0 , there are exactly |S|
πexplore be a fixed open-loop policy for the first four HMM states that correspond to them. Thus looking at
actions of an episode. In πexplore actions a1 , a2 the columns of O b from left-to-right, and only picking
are selected uniformly at random, and p(at+2 |at ) = out the columns that are labeled with a, a0 results in a
A PAC RL Algorithm for Episodic POMDPs
specific ordering of the states (a, ·, a0 ), which is a per- belief) and then following the associated policy π
b. π
b is
mutation of the POMDP states, which we denote as then followed for the entire episode with no additional
{s(a,a0 ),1 , s(a,a0 ),2 , . . . , s(a,a0 ),|S| }. We will also use the belief updating required as the policy itself encodes
notation s(a,a0 ) to implicitly refer to a vector of states the conditional branching.
in the order of the permutation.
However, in practical circumstances, it will not be pos-
The algorithm proceeds to estimate the original sible to enumerate all possible H-step policies. In
POMDP parameters in order to perform planning and this case, one can use point-based approaches or other
compute a policy. Note that the estimated parameters methods that use α-vectors to enumerate only a sub-
use the computed s(a,a0 ) permutations of the state. Let set of possible policies. In this case there will be an
Ob a,a0 be the submatrix where the rows and columns additional error planning in the final error bound due
0 00
correspond to the actions (a, a0 ) and Tba,a ,a be the to finite set of policies considered. In our analysis we
submatrix where the rows correspond to the actions omit planning for simplicity and assume that we enu-
(a0 , a00 ) and columns correspond to the actions (a, a0 ). merate all H-step policies.
Then the estimated POMDP parameters can be com- Definition 1. A β-vector taking as input s(a,a0 ) with
puted as follows: root action a0 and t-step conditional policies ft (r, z)
bb(s(a ,a ) ) = normalize((Tb−1 Tb−1 w)(a for each observation pair (r, z) is defined as
0 1
b 0 , ·, a1 ))
X 0 0 X
a,a
pb(z|a, s(a,a0 ) ) = normalize( O
b ) β1a (s(a,a0 ) ) = p(r|s(a,a0 ) , a) · r
r r
0
b a,a0 )
X f (r,z)
pb(r|s(a,a0 ) , a0 ) = normalize( a ,ft
X
O βt+1 (s(a,a0 ) ) = (r + γβt t (s(a0 ,ft (r,z)) ))
z r,z,s(a0 ,ft (r,z))
0 00
pb(s(a0 ,a00 ) |s(a,a0 ) , a0 ) = normalize(Tba,a ,a ) · p(r|s(a,a0 ) , a)p(z|s(a0 ,ft (r,z)) , a)p(s(a0 ,ft (r,z)) |s(a,a0 ) , a)
Note that we require an additional normalize() proce- where ft (r, z) can also denote the root action of the
dure since the MoM approach we leverage is not guar- policy ft (r, z) used in terms like s(a,ft (r,z)) .
anteed to return well formed probability distributions.
The normalization procedure just divides by the sum
to make them into valid probability distributions (if 5 THEORY
there are negative values we can either set them to
zero or even just use the absolute value). 5.1 PAC Theorem Setup
Algorithm 3 then uses these estimated POMDP pa- We now state our primary result. For full details,
rameters to compute a policy. The algorithm con- please refer to our tech report4 . Before doingPso, we
structs β-vectors (see Definition 1) that represent the H
define some additional notation. Let V π (b) = i=1 rt
expected sum of rewards of following a particular starting from belief b be the total undiscounted re-
policy starting with action a0 given an input per- ward following policy π for an episode. Let σ1,a (Ta ) =
muted state s(a,a0 ) . Aside from this slight modifica- maxa σ1 (Ta ) and similarly for σ1,a (Ra ) and σ1,a (Za ).
tion, β-vectors are analogous to α-vectors in standard Let σ a (Ta ) = mina σ|S| (Ta ) and similarly for σ a (Ra )
POMDP planning. The β-vectors form an approxi- and σ a (Za ). Assume σ a (Ta ), σ a (Ra ), and σ a (Za ) are
mate value function for the underlying POMDP and all at most 1 (otherwise each term can be replaced by
can be used in a similar way to standard α-vectors. 1 in the final sample complexity bound below).
with probability at least 1 − δ, where and O are both full rank and w = p(h2 ) has posi-
tive probability everywhere. p Furthermore the following
Cd,d,d (δ) = min(C1,2,3 (δ), C1,3,2 (δ)) terms are bounded: kT k2 ≤ |S|, kT −1 k2 ≤ 2(1+c|A|)
σ a (Ta ) ,
mini6=j ||M3 (~ei − ~ej )||2 · σk (P1,2 )2
C1,2,3 (δ) = min σmin (O) ≥ σ a (Ra )σ a (Za ), and kOk2 = σ1 (O) ≤ |S|.
||P1,2,3 ||2 · k 5 · κ(M1 )4
δ σk (P1,3 ) Next, we use Lemma 2, which is an extension of the
· ,
log(k/δ) 1 method of moments method by Anandkumar et al.
mini6=j ||M2 (~ei − ~ej )||2 · σk (P1,3 )2 [2012] that provides a bound on the accuracies of the
C1,3,2 (δ) = min estimated induced HMM parameters in terms of N ,
||P1,3,2 ||2 · k 5 · κ(M1 )4
the number of samples collected. Our extension in-
δ σk (P1,2 )
· , volves computing Tb (the original method only had O b
log(k/δ) 1
and OT ) and bounding its accuracy.
d
The quantities C1,2,3 , C1,3,2 directly arise from using Lemma 2. Given an HMM such that p(h2 ) has posi-
the previously referenced MoM method for HMM pa- tive probability everywhere, the transition matrix is full
rameter estimation [Anandkumar et al., 2012] and in- rank, and the observation matrix is full column rank,
volve singular values of the moments of the induced then by gathering N samples of (x1 , x2 , x3 ), the esti-
HMM and the induced HMM parameters (see [Anand- mates Tb, O,
b w b can be computed such that
kumar et al., 2012] for details).
We now briefly overview the proof. Detailed proofs
are available in the supplemental material. We first ||Tb − T ||2 ≤ 18|A||S|4 (σ a (Ra )σ a (Za ))−4 1
show that by executing EEPORL we obtain param- b − O||2 ≤ |A||S|0.5 1
||O
eter estimates of the induced HMM, and bounds on
||O
b − O||max ≤ 1
these estimates, as a function of the number of data
points (Lemma 2). We then prove that we can use the b − w||2 ≤ 14|A|2 |S|2.5 (σ a (Ra )σ a (Za ))−4 1
||w
induced HMM to obtain estimated parameters of the
underlying POMDP (Lemma 4). Then we show that
we can compute policies that are equivalent (in struc- where || · ||2 is the spectral norm for matrices, and
ture and value) to those from the original POMDP the euclidean norm for vectors, and w is the marginal
(Lemma 5). We then bound the error in the resulting probability of h2 , with probability 1 − δ, as long as
value function estimates of the resulting policies due
to the use of approximate (instead of exact) model p !
|A|2 |Z||R|(1 + log(1/δ))2 1
parameters (Lemma 6). This allows us to compute a N ≥O log
bound on the number of required samples (episodes) (Cd,d,d (δ))2 · 21 δ
necessary to achieve near-optimal policies, with high
probability, for use in phase 2.
Next we proceed by showing how to bound the error
We commence the proof by bounding the error in es-
in the estimates of the POMDP parameters. The fol-
timates of the induced HMM parameters. In order
lowing Lemma 3 is a prerequisite for computing the
to do that, we introduce Lemma 1, which proves that
submatrices of Ob and Tb needed for the estimates of
samples taken in phase 1 belong to an induced HMM
the POMDP parameters.
where the transition and observation matrices are full
rank. This is a requirement for being able to apply Lemma 3. Given O b with max-norm error O ≤
the MoM HMM parameter estimation procedure of 1
, then the columns which correspond to HMM
3|Z||R|
Anandkumar et al. [2012].
states of the form h = (a, s0 , a0 ) can be labeled with
Lemma 1. The induced HMM has the observation their corresponding a, a0 using Algorithm 2.
and transition matrices defined as
ability at least 1 − δ correctly identifies the best policy for the estimated
POMDP. Then let the initial beliefs b, bb have error
4|S|T kb − bbk∞ ≤ b , and the bound over α-vectors of any
p(s(a0 ,a00 ) |s(a,a0 ) , a0 ) − p(s(a0 ,a00 ) |s(a,a0 ) , a0 )| ≤
|b
2a policy π, kαπ − αbπ k∞ ≤ α be given. Then
|b
p(z|a, s(a,a0 ) ) − p(z|a, s(a,a0 ) )| ≤ 4|Z||R|O ∗
0 0
Vb πb (bb) = bb · α
bπb ≥ bb · α
bπ
|b
p(r|s(a,a0 ) , a ) − p(r|s(a,a0 ) , a )| ≤ 4|Z||R|O ∗ ∗ ∗ ∗
≥ bb · απ − |bb · απ − bb · α
bπ | ≥ bb · απ − α
|b(s(a ,a ) ) − b(s(a ,a ) )|
b
0 1 0 1 ∗ ∗ ∗
≥ b · απ − |b · απ − bb · απ | − α
≤ 4|A|4 |S|(||T −1 ||22 w + 6||T −1 ||32 T ) ∗
≥ b · απ − b Vmax − α = V ∗ (b) − b Vmax − α
where a = Θ(1/|A|)
where the first inequality is because π b is the optimal
We proceed by bounding the error in computing the policy for bb and α b, the second inequality is by the trian-
estimated β-vectors. Lemma 5 states that β-vectors gle inequality, the third inequality is because kbbk1 = 1,
are equivalent under permutation to α-vectors. the fourth inequality is by the triangle inequality, the
fifth inequality is since α is at most Vmax . Next
Lemma 5. Given the permutation of the states
s(a,a0 ),j = sφ((a,a0 ),j) , β-vectors and α-vectors over V πb (b) = b · απb ≥ bb · απb − |bb · απb − b · απb |
the same policy πt are equivalent i.e. βtπt (s(a,a0 ),j ) = ≥ bb · απb − b Vmax
αtπt (sφ((a,a0 ),j) )
bπb − |bb · α
≥ bb · α bπb − bb · απb | − b Vmax
The following lemma bounds the error in the resulting bπb − α − b Vmax
≥ bb · α
α-vectors obtained by performing POMDP planning,
and follows from prior work [Fard et al., 2008, Ross where the first inequality is by triangle inequality, the
et al., 2009]. second inequality is because α is at most Vmax , the
third inequality is triangle inequality, and the fourth
Lemma 6. Suppose we have approximate POMDP
inequality is due to kbbk1 = 1. Putting those two to-
parameters with errors |b p(s0 |s, a) − p(s0 |s, a)| ≤ T ,
0 0 gether results in
|b
p(z|a, s )−p(z|a, s )| ≤ Z , and |bp(r|s, a)−p(r|s, a0 )| ≤
R . Then for any t-step conditional policy πt V πb (b) ≥ V ∗ (b) − 2b Vmax − 2α
|αtπt (s) − α
btπt (s)| ≤ t2 Rmax (|R|R + |S|T + |Z|Z ). Letting = 2b Vmax + 2α , and setting the number of
episodes N to the value specified in the theorem will
ensure that the resulting errors b and α are small
We next prove that our EEPORL algorithm computes enough to obtain an -optimal policy as desired.
a policy that is optimal for the input parameters5 :
Lemma 7. Algorithm 3 finds the policy π b which 6 CONCLUSION
maximizes V πb (bb(s1 )) for a POMDP with parameters
bb(s1 ), pb(z|a, s0 ), pb(r|s, a), and pb(s0 |s, a). We have provided a PAC RL algorithm for an impor-
tant class of episodic POMDPs, which includes many
We now have all the key pieces to prove our result. information gathering domains. To our knowledge this
is the first RL algorithm for partially observable set-
Proof. (Proof sketch of Theorem 1). Lemma 4 shows tings that has a sample complexity that is a polyno-
that the error in the estimates of the POMDP pa- mial function of the POMDP parameters.
rameters can be bounded in terms of the error in the
induced HMM parameters, which is itself bounded in There are many areas for future work. We are inter-
terms of the number of samples (Lemma 1). Lemma 5 ested in reducing the set of currently required assump-
and Lemma 6 together bound in the error in comput- tions, thereby creating PAC PORL algorithms that
ing the estimated value function (as represented by are suitable to more generic settings. Such a direc-
β-vectors) using estimated POMDP parameters. tion may also require exploring alternatives to method
of moments approaches for performing latent variable
We then need to bound the error from executing π b estimation. We also hope that our theoretical results
that Algorithm 3 returns compared to the optimal pol- will lead to further insights on practical algorithms for
icy π ∗ . We know from Lemma 7 that Algorithm 3 partially observable RL.
5
Again, we could easily modify this to account for ap-
proximate planning error, but leave this out for simplicity, Acknowledgements
as we do not expect this to make a significant impact on
the resulting sample complexity, except in terms of minor This work was supported by NSF CAREER grant
changes to the polynomial terms. 1350984.
Zhaohan Daniel Guo, Shayan Doroudi, Emma Brunskill
References Li Ling Ko, David Hsu, Wee Sun Lee, and Sylvie CW Ong.
Structured parameter elicitation. In AAAI, 2010.
Naoki Abe and Manfred K Warmuth. On the compu-
tational complexity of approximating distributions by J Zico Kolter and Andrew Y Ng. Near-bayesian exploration
probabilistic automata. Machine Learning, 9(2-3):205– in polynomial time. In Proceedings of the 26th Annual
260, 1992. International Conference on Machine Learning, pages
513–520. ACM, 2009.
Christopher Amato and Emma Brunskill. Diagnose and de-
cide: An optimal bayesian approach. In In Proceedings Tor Lattimore and Marcus Hutter. Pac bounds for dis-
of the Workshop on Bayesian Optimization and Decision counted mdps. In Algorithmic learning theory, pages
Making at the Twenty-Sixth Annual Conference on Neu- 320–334. Springer, 2012.
ral Information Processing Systems (NIPS-12), 2012.
Michael L Littman, Richard S Sutton, and Satinder P
Animashree Anandkumar, Daniel Hsu, and Sham M Singh. Predictive representations of state. In NIPS,
Kakade. A method of moments for mixture models and volume 14, pages 1555–1561, 2001.
hidden markov models. arXiv preprint arXiv:1203.0683,
2012. Stefanos Nikolaidis, Ramya Ramakrishnan, Keren Gu, and
Julie Shah. Efficient model learning from joint-action
Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M demonstrations for human-robot collaborative tasks. In
Kakade, and Matus Telgarsky. Tensor decompositions HRI, pages 189–196, 2015.
for learning latent variable models. The Journal of Ma-
chine Learning Research, 15(1):2773–2832, 2014. Leonid Peshkin and Sayan Mukherjee. Bounds on sam-
ple size for policy evaluation in markov environments.
John Asmuth, Lihong Li, Michael L Littman, Ali Nouri, In Computational Learning Theory, pages 616–629.
and David Wingate. A bayesian sampling approach to Springer, 2001.
exploration in reinforcement learning. In Proceedings of
the Twenty-Fifth Conference on Uncertainty in Artificial Stephane Ross, Masoumeh Izadi, Mark Mercer, and David
Intelligence, pages 19–26. AUAI Press, 2009. Buckeridge. Sensitivity analysis of pomdp value func-
tions. In Machine Learning and Applications, 2009.
Mohammad Azar, Alessandro Lazaric, and Emma Brun- ICMLA’09. International Conference on, pages 317–323.
skill. Sequential transfer in multi-armed bandit with IEEE, 2009.
finite set of models. In C.J.C. Burges, L. Bottou,
M. Welling, Z. Ghahramani, and K.Q. Weinberger, ed- Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, and
itors, Advances in Neural Information Processing Sys- Pierre Kreitmann. A bayesian approach for learning and
tems 26, pages 2220–2228. Curran Associates, Inc., 2013. planning in partially observable markov decision pro-
cesses. The Journal of Machine Learning Research, 12:
Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. 1729–1770, 2011.
Closing the learning-planning loop with predictive state
representations. The International Journal of Robotics L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J.
Research, 30(7):954–966, 2011. Smola. Hilbert space embeddings of hidden Markov
models. In Proc. 27th Intl. Conf. on Machine Learning
Craig Boutilier. A pomdp formulation of preference elici- (ICML), 2010.
tation problems. In AAAI/IAAI, pages 239–246, 2002.
Alexander L Strehl and Michael L Littman. A theoretical
Ronen I Brafman and Moshe Tennenholtz. R-max-a gen- analysis of model-based interval estimation. In Proceed-
eral polynomial time algorithm for near-optimal rein- ings of the 22nd international conference on Machine
forcement learning. The Journal of Machine Learning learning, pages 856–863. ACM, 2005.
Research, 3:213–231, 2003.
Alexander L Strehl, Lihong Li, and Michael L Littman.
Finale Doshi-Velez. Bayesian nonparametric approaches Incremental model-based learners with formal learning-
for reinforcement learning in partially observable do- time guarantees. arXiv preprint arXiv:1206.6870, 2012.
mains. PhD thesis, Massachusetts Institute of Technol-
ogy, 2012.