0% found this document useful (0 votes)

4 views65 pages

Microsoft PowerPoint - Lecture20Final-Part1

The document discusses Markov Decision Processes (MDPs), which are frameworks for planning in uncertain domains, focusing on long-term action plans. It covers the components of MDPs, including states, actions, transition probabilities, and rewards, as well as algorithms like value iteration for finding optimal policies. The document also highlights the importance of considering finite and infinite horizons in decision-making and the use of discount factors to evaluate future rewards.

Uploaded by

wptjcks15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views65 pages

Microsoft PowerPoint - Lecture20Final-Part1

Uploaded by

wptjcks15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

6.

825 Techniques in Artificial Intelligence

Markov Decision Processes

• Framework
• Markov chains
• MDPs
• Value iteration
• Extensions

Lecture 20 • 1

Now we’re going to think about how to do planning in uncertain domains. It’s
an extension of decision theory, but focused on making long-term plans of
action. We’ll start by laying out the basic framework, then look at Markov
chains, which are a simple case. Then we’ll explore what it means to have
an optimal plan for an MDP, and look at an algorithm, called value iteration,
for finding optimal plans. We’ll finish by looking at some of the major
weaknesses of this approach and seeing how they can be addressed.

1
MDP Framework

Lecture 20 • 2

A Markov decision process (known as an MDP) is a discrete-time state-

transition system. It can be described formally with 4 components.

2
MDP Framework
• S : states

Lecture 20 • 3

First, it has a set of states. These states will play the role of outcomes in the
decision theoretic approach we saw last time, as well as providing whatever
information is necessary for choosing actions. For a robot navigating
through a building, the state might be the room it’s in, or the x,y coordinates.
For a factory controller, it might be the temperature and pressure in the
boiler. In most of our discussion, we’ll assume that the set of states is finite
and not too big to enumerate in our computer.

3
MDP Framework
• S : states
• A : actions

Lecture 20 • 4

Next, we have a set of actions. These are chosen, in the simple case, from
a small finite set.

4
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities

Lecture 20 • 5

The transition probabilities describe the dynamics of the world. They play
the role of the next-state function in a problem-solving search, except that
every state is thought to be a possible consequence of taking an action in a
state. So, we specify, for each state s_t and action a_t, the probability that
the next state will be s_t+1. You can think of this as being represented as a
set of matrices, one for each action. Each matrix is square, indexed in both
dimensions by states. If you sum over all states s_I, then the sum of the
probabilities that S_I is the next state, given a particular previous state and
action is 1.

5
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities
= Pr(st+1 | s0 … st, a0 … at) Markov property

Lecture 20 • 6

These processes are called Markov, because they have what is known as
the Markov property. that is, that given the current state and action, the next
state is independent of all the previous states and actions. The current state
captures all that is relevant about the world in order to predict what the next
state will be.

6
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities
= Pr(st+1 | s0 … st, a0 … at) Markov property
• R(s) : real-valued reward

Lecture 20 • 7

Finally, there is a real-valued reward function on states. You can think of

this is as a short-term utility function. How good is it, from the agent’s
perspective to be in state s? That’s R(s).

7
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities
= Pr(st+1 | s0 … st, a0 … at) Markov property
• R(s) : real-valued reward

Find a policy: ∏: S → A

Lecture 20 • 8

The result of classical planning was a plan. A plan was either an ordered list
of actions, or a partially ordered set of actions, meant to be executed without
reference to the state of the environment. When we looked at conditional
planning, we considered building plans with branches in them, that observed
something about the state of the world and acted differently depending on
the observation. In an MDP, the assumption is that you could potentially go
from any state to any other state in one step. And so, to be prepared, it is
typical to compute a whole policy, rather than a simple plan. A policy is a
mapping from states to actions. It says, no matter what state you happen to
find yourself in, here is the action that it’s best to take now. Because of the
Markov property, we’ll find that the choice of action only needs to depend on
the current state (and possibly the current time), but not on any of the
previous states.

8
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities
= Pr(st+1 | s0 … st, a0 … at) Markov property
• R(s) : real-valued reward

Find a policy: ∏: S → A
Maximize
• Myopic: E[rt | ∏, st] for all s

Lecture 20 • 9

So, what is our criterion for finding a good policy? In the simplest case, we’ll
try to find the policy that maximizes, for each state, the expected reward of
executing that policy in the state. This is a particularly easy problem,
because it completely decomposes into a set of decision problems: for each
state, find the single action that maximizes expected reward. This is exactly
the single-action decision theory problem that we discussed before.

9
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities
= Pr(st+1 | s0 … st, a0 … at) Markov property
• R(s) : real-valued reward

Find a policy: ∏: S → A
Maximize
• Myopic: E[rt | ∏, st] for all s
• Finite horizon: E[∑kt=0 rt | ∏, s0]

Lecture 20 • 10

It’s not usually good to be so short sighted, though! Let’s think a little bit
farther ahead. We can consider policies that are “finite-horizon optimal” for
a particular horizon k. That means, that we should find a policy that, for
every initial state s0, results in the maximal expected sum of rewards from
times 0 to k.

10
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities
= Pr(st+1 | s0 … st, a0 … at) Markov property
• R(s) : real-valued reward

Find a policy: ∏: S → A
Maximize
• Myopic: E[rt | ∏, st] for all s
• Finite horizon: E[∑kt=0 rt | ∏, s0]
– Non-stationary policy: depends on time

Lecture 20 • 11

So, if the horizon is 2, we’re maximizing over our reward today and
tomorrow. If it’s 300, then we’re looking a lot farther ahead. We might start
out by doing some actions that have very low rewards initially, because by
doing them we are likely to be taken to states that will ultimately result in
high reward. (Students are often familiar with the necessity of passing
through low-reward states in order to get to states with higher reward!).
Because we will, in general, want to choose actions differently at the very
end of our lives (when the horizon is short) than early in our lives, it will be
necessary to have a non-stationary policy in the finite-horizon case. That is,
we’ll need a different policy for each number of time steps remaining in our
life.

11
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities
= Pr(st+1 | s0 … st, a0 … at) Markov property
• R(s) : real-valued reward

Find a policy: ∏: S → A
Maximize
• Myopic: E[rt | ∏, st] for all s
• Finite horizon: E[∑kt=0 rt | ∏, s0]
– Non-stationary policy: depends on time
• Infinite horizon: E[∑∞t=0 rt | ∏, s0]

Lecture 20 • 12

Because in many cases it’s not clear how long the process is going to run
(consider designing a robot sentry, or a factory controller), it’s popular to
consider infinite horizon models of optimality.

12
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities
= Pr(st+1 | s0 … st, a0 … at) Markov property
• R(s) : real-valued reward

Find a policy: ∏: S → A
Maximize
• Myopic: E[rt | ∏, st] for all s
• Finite horizon: E[∑kt=0 rt | ∏, s0]
– Non-stationary policy: depends on time
• Infinite horizon: E[∑∞t=0 γt rt | ∏, s0]
– 0 < γ < 1 is discount factor

Lecture 20 • 13

But if we add up all the rewards out into infinity, then the sums will be infinite
in general. To keep the math nice, and to put some pressure on the agent to
get rewards sooner rather than later, we use a discount factor. The discount
factor gamma is a number between 0 and 1, which has to be strictly less
than 1. Usually it’s somewhere near 0.9 or 0.99 . So, we want to maximize
our sum of rewards, but rewards that happen tomorrow are only worth .9 of
what they would be worth today. You can think of this as a model of the
present value of money, as in economics. Or, that your life is going to end
with probability 1-gamma on each step, but you don’t when.

13
MDP Framework
• S : states
• A : actions
• Pr(st+1 | st, at) : transition probabilities
= Pr(st+1 | s0 … st, a0 … at) Markov property
• R(s) : real-valued reward

Lecture 20 • 14

This model has the very convenient property that the optimal policy is
stationary. It’s independent of how long the agent has run or will run in the
future (since nobody knows that exactly). Once you’ve survived to live
another day, in this model, the expected length of your life is the same as it
was on the previous step, and so your behavior is the same, as well.

14
Markov Chain
• Markov Chain
• states
• transitions
• rewards
• no actions

Lecture 20 • 15

To build up some intuitions about how MDPs work, let’s look at a simpler
structure called a Markov chain. A Markov chain is like an MDP with no
actions, and a fixed, probabilistic transition function from state to state.

15
Markov Chain
• Markov Chain
• states 2
• transitions
• rewards 1
• no actions
3

Lecture 20 • 16

Here’s a tiny example of a Markov chain. It has three states.

16
Markov Chain
0.1
• Markov Chain
• states 0.5
2
• transitions
0.5 0.2
• rewards 1
0.7
• no actions 0.9
3

0.1

Lecture 20 • 17

The transition probabilities are shown on the arcs between states. Note that
the probabilities on all the outgoing arcs of each state sum to 1.

17
Markov Chain
0.1
• Markov Chain
10
• states 0.5
2
• transitions 0
0.5 0.2
• rewards 1
0.7
• no actions 0.9
3
0

0.1

Lecture 20 • 18

Markov chains don’t always have reward values associated with them, but
we’re going to add rewards to ours. We’ll make states 1 and 3 have an
immediate reward of 0, and state 2 have immediate reward of 10.

18
Markov Chain
0.1
• Markov Chain
10
• states 0.5
2
• transitions 0
0.5 0.2
• rewards 1
0.7
• no actions 0.9

• Value of a state, using infinite 3

0
discounted horizon
V(s)
0.1

Lecture 20 • 19

Now, we can define the infinite horizon expected discounted reward as a

function of the starting state. We’ll abbreviate this as the value of a state.

19
Markov Chain
0.1
• Markov Chain
10
• states 0.5
2
• transitions 0
0.5 0.2
• rewards 1
0.7
• no actions 0.9

• Value of a state, using infinite 3

0
discounted horizon
V(s)=R(s)
0.1

Lecture 20 • 20

So, how much total reward do we expect to get if we start in state s? Well,
we know that we’ll immediately get a reward of R(s).

20
Markov Chain
0.1
• Markov Chain
10
• states 0.5
2
• transitions 0
0.5 0.2
• rewards 1
0.7
• no actions 0.9

• Value of a state, using infinite 3

0
discounted horizon
V(s)=R(s)+γ()
0.1

Lecture 20 • 21

But then, we’ll get some reward in the future. The reward we get in the
future is not worth as much to us as reward in the present, so we multiply by
discount factor gamma.

21
Markov Chain
0.1
• Markov Chain
10
• states 0.5
2
• transitions 0
0.5 0.2
• rewards 1
0.7
• no actions 0.9

• Value of a state, using infinite 3

0
discounted horizon
V(s)=R(s)+γ∑s0P(s0|s)V(s0)
0.1

Lecture 20 • 22

Now, we consider what the future might be like. We’ll compute the
expected long-term value of the next state by summing over all possible next
states, s’, the product of the probability of making a transition from s to s’
and the infinite horizon expected discounted reward, or value of s’.

22
Markov Chain
0.1
• Markov Chain
10
• states 0.5
2
• transitions 0
0.5 0.2
• rewards 1
0.7
• no actions 0.9

• Value of a state, using infinite 3

0
discounted horizon
V(s)=R(s)+γ∑s0P(s0|s)V(s0)
0.1

Lecture 20 • 23

Since we know R and P (those are given in the specification of the Markov
chain), we’d like to compute V. If n is the number of states in the domain,
then we have a set of n equations in n unknowns (the values of each state).
Luckily, they’re easy to solve.

23
Markov Chain
0.1
• Markov Chain
10
• states 0.5
2
• transitions 0
0.5 0.2
• rewards 1
0.7
• no actions 0.9

• Value of a state, using infinite 3

0
discounted horizon
V(s)=R(s)+γ∑s0P(s0|s)V(s0)
0.1

• Assume γ=0.9
V(1)= 0+ .9(.5 V(1)+.5 V(2))
V(2)=10+ .9(.2 V(1)+.1 V(2)+.7 V(3))
V(3)= 0+ .9( .9 V(2)+.1 V(3))

Lecture 20 • 24

So, here are the equations for the values of the states in our example,
assuming a discount factor of 0.9.

24
Markov Chain
V=49.5 0.1
• Markov Chain
10
• states 0.5
2
• transitions 0
0.5 0.2
• rewards 1
0.7
• no actions 0.9
V=40.5
• Value of a state, using infinite V=49.1 3
0
discounted horizon
V(s)=R(s)+γ∑s0P(s0|s)V(s0)
0.1

• Assume γ=0.9
V(1)= 0+ .9(.5 V(1)+.5 V(2))
V(2)=10+ .9(.2 V(1)+.1 V(2)+.7 V(3))
V(3)= 0+ .9( .9 V(2)+.1 V(3))

Lecture 20 • 25

Now, if we solve for the values of the states, we get that V(1) is 40.5, V(2) is
49.5, and V(3) is 49.1. This seems at least intuitively plausible. State 1 is
worth the least, because it’s kind of hard to get from there to state 2, where
the reward is. State 2 is worth the most; it has a large reward and it usually
goes to state 3, which usually comes right back again for another large
reward. State 3 is close to state 2 in value, because it usually takes only
one step to get from 3 back to 2.

25
Markov Chain
V=49.5 0.1
• Markov Chain
10
• states 0.5
2
• transitions 0
0.5 0.2
• rewards 1
0.7
• no actions 0.9
V=40.5
• Value of a state, using infinite V=49.1 3
0
discounted horizon
V(s)=R(s)+γ∑s0P(s0|s)V(s0)
0.1

• Assume γ=0.9
V(1)= 0+ .9(.5 V(1)+.5 V(2))
V(2)=10+ .9(.2 V(1)+.1 V(2)+.7 V(3))
V(3)= 0+ .9( .9 V(2)+.1 V(3))

Lecture 20 • 26

If we set gamma to 0, then the values of the nodes would be the same as
their rewards. If gamma were small but non-zero, then the values would be
smaller than in this case and their differences more pronounced.

26
Finding the Best Policy
• MDP + Policy = Markov Chain
• MDP = the way the world works
• Policy = the way the agent works

Lecture 20 • 27

Now, we’ll go back to thinking about Markov Decision Processes. If you take
an MDP and fix the policy, then all of the actions are chosen and what you
have left is a Markov chain. So, given a policy, it’s easy to evaluate it, in the
sense that you can compute what value the agent can expect to get from
each possible starting state, if it executes that policy.

27
Finding the Best Policy
• MDP + Policy = Markov Chain
• MDP = the way the world works
• Policy = the way the agent works

• V(s)=R(s) + maxa[γ ∑s0 P(s0 | s, a) V(s0)]

Lecture 20 • 28

We want to find the best possible policy. We’ll approach this problem by
thinking about V*, the optimal value function. V* is defined using the
following set of recursive equations. The optimal value of a state s is the
reward that we get in s, plus the maximum over all actions we could take in
s, of the discounted expected optimal value of the next state. The idea is
that in every state, we want to choose the action that maximizes the value of
the future.

28
Finding the Best Policy
• MDP + Policy = Markov Chain
• MDP = the way the world works
• Policy = the way the agent works

• V(s)=R(s) + maxa[γ ∑s0 P(s0 | s, a) V(s0)]

• Theorem: There is a unique V* satisfying these

equations

Lecture 20 • 29

This is sort of like the set of equations we had for the Markov chain. We
have n equations in n unknowns. The problem is that now we have these
annoying “max” operators in our equations, which makes them non-linear,
and therefore non-trivial to solve. Nonetheless, there is a theorem that says,
given R and P, there is a unique function V* that satisfies these equations.

29
Finding the Best Policy
• MDP + Policy = Markov Chain
• MDP = the way the world works
• Policy = the way the agent works

• V(s)=R(s) + maxa[γ ∑s0 P(s0 | s, a) V(s0)]

• Theorem: There is a unique V* satisfying these

equations

• ∏(s)=argmaxa ∑s0 P(s0 | s, a) V(s0)]

Lecture 20 • 30

If we know V*, then it’s easy to find the optimal policy. The optimal policy,
pi*, guarantees that we’ll get the biggest possible infinite-horizon expected
discounted reward in every state. So, if we find ourselves in a state s, we
can pick the best action by considering, for each action, the average the V*
value of the next state according to how likely it is to occur given the current
state s and the action under consideration. Then, we pick the action that
maximizes the expected V* on the next step.

30
Computing V*
• Approaches
• Value iteration
• Policy iteration
• Linear programming

Lecture 20 • 31

So, we’ve seen that if we know V*, then we know how to act optimally. So,
can we compute V*? It turns out that it’s not too hard to compute V*. There
are three fairly standard approaches to it. We’ll just cover the first one, value
iteraton.

31
Value Iteration
Initialize V0(s)=0, for all s

Lecture 20 • 32

Here’s the value iteration algorithm. We are going to compute V*(s) for all s,
by doing an iterative procedure, in which our current estimate for V* gets
closer to the true value over time. We start by initializing V(s) to 0, for all
states. We could actually initialize to any values we wanted to, but it’s
easiest to just start at 0.

32
Value Iteration
Initialize V0(s)=0, for all s
Loop for a while

Lecture 20 • 33

Now, we loop for a while (we’ll come back to the question of how long).

33
Value Iteration
Initialize V0(s)=0, for all s
Loop for a while
Loop for all s
Vt+1(s) = R(s) + maxa γ ∑s0 P(s0 | s, a) Vt(s)

Lecture 20 • 34

Now, for every s, we compute a new value, V_t+1 (s) using the equation that
defines V* as an assignment, based on the previous values V_t of V. So, on
each iteration, we compute a whole new value function given the previous
one.

34
Value Iteration
Initialize V0(s)=0, for all s
Loop for a while
Loop for all s
Vt+1(s) = R(s) + maxa γ ∑s0 P(s0 | s, a) Vt(s)

• Converges to V*

Lecture 20 • 35

This algorithm is guaranteed to converge to V*. It might seem sort of

surprising. We’re starting with completely bogus estimates of V, and using
them to make new estimates of V. So why does this work? It’s possible to
show that the influence of R and P, which we know, drives the successive Vs
to get closer and closer to V*.

35
Value Iteration
Initialize V0(s)=0, for all s
Loop for a while
Loop for all s
Vt+1(s) = R(s) + maxa γ ∑s0 P(s0 | s, a) Vt(s)

• Converges to V*
• No need to keep Vt vs Vt+1

Lecture 20 • 36

Not only does this algorithm converge to V*, we can simplify it a lot and still
retain convergence. First of all, we can get rid of the different Vt functions in
the algorithm and just use a single V, in both the left and right hand sides of
the update statement.

36
Value Iteration
Initialize V0(s)=0, for all s
Loop for a while
Loop for all s
Vt+1(s) = R(s) + maxa γ ∑s0 P(s0 | s, a) Vt(s)

• Converges to V*
• No need to keep Vt vs Vt+1
• Asynchronous (can do random state updates)

Lecture 20 • 37

In addition, we can execute this algorithm asynchronously. That is, we can

do these state-update assignments to the states in any order we want to. In
fact, we can even do them by picking states at random to update, as long as
we update all the states sufficiently often.

37
Value Iteration
Initialize V0(s)=0, for all s
Loop for a while
Loop for all s
Vt+1(s) = R(s) + maxa γ ∑s0 P(s0 | s, a) Vt(s)

• Converges to V*
• No need to keep Vt vs Vt+1
• Asynchronous (can do random state updates)
• Assume we want V − V = max V (s) − V (s) < ε
t ∗ t ∗
s

Lecture 20 • 38

Now, let’s go back to the “loop for a while” statement, and make it a little bit
more precise. Let’s say we want to guarantee that when we terminate, the
current value function differs from V* by at most epsilon (the double bars
indicate the max norm, which is just the biggest difference between the two
functions, at any argument s).

38
Value Iteration
Initialize V0(s)=0, for all s
Loop for a while [until kVt – Vt+1k < ε(1-γ)/γ]
Loop for all s
Vt+1(s) = R(s) + maxa γ ∑s0 P(s0 | s, a) Vt(s)

• Converges to V*
• No need to keep Vt vs Vt+1
• Asynchronous (can do random state updates)
• Assume we want V − V = max V (s) − V (s) < ε
t ∗ t ∗
s

Lecture 20 • 39

In order to guarantee this condition, it is sufficient to examine the maximum

difference between Vt and Vt+1. As soon as it is below epsilon times (1 –
gamma) / gamma, we know that Vt is within epsilon of V*.

39
Value Iteration
Initialize V0(s)=0, for all s
Loop for a while [until kVt – Vt+1k < ε(1-γ)/γ]
Loop for all s
Vt+1(s) = R(s) + maxa γ ∑s0 P(s0 | s, a) Vt(s)

• Converges to V*
• No need to keep Vt vs Vt+1
• Asynchronous (can do random state updates)
∗
• Assume we want V − V = max
t
V t (s) − V ∗ (s) < ε
s
• Gets to optimal policy in time polynomial in |A|,
|S|, 1/(1-γ)

Lecture 20 • 40

Now, although we may never converge to the exact, analytically correct V* in

a finite number of iterations, it’s guaranteed that we can get close enough to
V* so that using V to choose our actions will yield the optimal policy within a
finite number of iterations. In fact, using value iteration, we can find the
optimal policy in time that is polynomial in the size of the state and action
spaces, the magnitude of the largest reward, and in 1/1-gamma.

40
Value Iteration
Initialize V0(s)=0, for all s
Loop for a while [until kVt – Vt+1k < ε(1-γ)/γ]
Loop for all s
Vt+1(s) = R(s) + maxa γ ∑s0 P(s0 | s, a) Vt(s)

Lecture 20 • 41

MDPs and value iteration are great as far as they go. But to model real
problems, we have to make a number of extensions and compromises, as
we’ll see in the next few slides.

41
Big state spaces

Lecture 20 • 42

Of course, the theorems only hold when we can store V in a table of values,
one for each state. And in large state spaces, even running in polynomial
time in the size of the state spaces is nowhere efficient enough.

42
Big state spaces
• Function approximation for V

Lecture 20 • 43

The usual approach to dealing with big state spaces is to use value iteration,
but to store the value function only approximately, using some kind of
function-approximation method. Then, the value iteration uses the previous
approximation of the value function to compute a new, better, approximation,
still stored in some sort of a compact way. The convergence theorems only
apply to very limited types of function approximation.

43
Big state spaces
• Function approximation for V

Lecture 20 • 44

Function approximation can be done in a variety of ways. It’s very much a

topic of current research to understand which methods are most appropriate
in which circumstances.

44
Big state spaces
• Function approximation for V
• neural nets
• regression trees

Lecture 20 • 45

Basic techniques from machine learning, such as neural nets and regression
trees, can be used to compute and store approximations to V from example
points.

45
Big state spaces
• Function approximation for V
• neural nets
• regression trees
• factored representations (represent Pr(s’|s,a)
using Bayes net)

Lecture 20 • 46

An interesting new method uses Bayesian networks to compactly represent

the state transition probability distributions. Then the value function is
approximated using factors that are related to groups of highly connected
variables in the bayes net.

46
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)

Lecture 20 • 47

Another big issue is partial observability. MDPs assume that any crazy thing
can happen to you, but when it does, you’ll know exactly what state you’re in.
This is an incredibly strong assumption that only very rarely holds.
Generally, you can just see a small part of the whole state of the world, and
you don’t necessarily even observe that reliably.

47
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)

Lecture 20 • 48

There’s a much richer class of models, called partially observable Markov

decision processes, or POMDPs, that account for partial observability.

48
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)
• Observation: Pr(O|s,a) [O is observation]

Lecture 20 • 49

We augment the MDP model with a set of observations, O, and a probability

distribution over observations given state and action.

49
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)
• Observation: Pr(O|s,a) [O is observation]
• o, a, o, a, o, a

Lecture 20 • 50

Now, the agent’s interaction with the world is that it makes an observation o,
chooses an action a, makes another observation o, chooses another action
a, and so on.

50
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)
• Observation: Pr(O|s,a) [O is observation]
• o, a, o, a, o, a

Lecture 20 • 51

In general, for optimal behavior, the agent’s choice of actions will now have
to depend on the complete history of previous actions and observations.
One way of knowing more about the current state of the world is
remembering what you knew about it previously.

51
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)
• Observation: Pr(O|s,a) [O is observation]
• o, a, o, a, o, a

obs a

Lecture 20 • 52

Here’s a block diagram for an optimal POMDP controller. It’s divided into
two parts.

52
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)
• Observation: Pr(O|s,a) [O is observation]
• o, a, o, a, o, a

mental
obs Memory state a
updates

Lecture 20 • 53

The first box is in charge of memory updates. It takes in as input the current
observation, the last action, and the last “mental state” and generates a new
mental state.

53
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)
• Observation: Pr(O|s,a) [O is observation]
• o, a, o, a, o, a

mental
obs Memory state a
Policy
updates

Lecture 20 • 54

The second box is a policy, as in an MDP. Except in this case, the policy
takes the mental state as input, rather than the true state of the world (or
simply the current observation). The policy still has the job of generating
actions.

54
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)
• Observation: Pr(O|s,a) [O is observation]
• o, a, o, a, o, a

Optimal choice for

mental state: P(s)

mental
obs Memory state a
Policy
updates

Lecture 20 • 55

So, what are we supposed to remember? This is really a hard question.

Given all the things you’ve ever seen and done, what should you remember?
It’s especially difficult if you have a fixed-sized memory. The theory of
POMDPs tells us that the best thing to “remember” is a probability
distribution over the current, true, hidden state s of the world. It’s actually
relatively easy to build the memory update box so that it takes an old
distribution over world states, a new action, and a new observation, and
generates a new distribution over world states, taking into account the state
transition probabilities and the observation probabilities.

55
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)
• Observation: Pr(O|s,a) [O is observation]
• o, a, o, a, o, a

Policy maps P(s) into

Optimal choice for actions, so can base
mental state: P(s) actions on degree of
uncertainty
mental
obs Memory state a
Policy
updates

Lecture 20 • 56

Now, the job of the policy is interesting. It gets to map probability

distributions over the current state into actions. This is an interesting job,
because it allows the agent to take different actions depending on its degree
of uncertainty. So, a robot that is trying to deliver a package to some office
might drive directly toward that office if it knows where it is; but if the robot is
lost (expressed as having a very spread-out distribution over possible
states), the optimal action might be to ask someone for directions or to go
and buy a map. The POMDP framework very elegantly combines taking
actions because of their effects on the world with taking actions because of
their effects on the agent’s mental state.

56
Partial Observability
• MDPs assume complete observability (can always
tell what state you’re in)
• POMDP (Partially Observable MDP)
• Observation: Pr(O|s,a) [O is observation]
• o, a, o, a, o, a

Policy maps P(s) into

Optimal choice for actions, so can base
mental state: P(s) actions on degree of
uncertainty
mental
obs Memory state a
Policy
updates

Lecture 20 • 57

Unfortunately, although it’s beautiful mathematically, it’s completely,

hopelessly, intractable computationally. So, a major current research topic
is how to approximately solve POMDPs, or to discover special cases in
which they can be solved relatively efficiently.

57
Worrying too much
• Assumption that every possible eventuality should
be taken into account

Lecture 20 • 58

Finally, the MDP view that every possible state has a non-zero chance of
happening as a result of taking an action in a particular state is really too
paranoid in most domains.

58
Worrying too much
• Assumption that every possible eventuality should
be taken into account

Lecture 20 • 59

So, what can we do when the state-transition function is sparse (contains a

lot of zeros) or approximately sparse (contains a lot of small numbers)?

59
Worrying too much
• Assumption that every possible eventuality should
be taken into account

Lecture 20 • 60

When the horizon, or the effective horizon (1/1-gamma) is large with respect
to the size of the state space, then it makes sense to compute a whole
policy, since your search through the space for a good course of action is
likely to keep hitting the same states over and over (this is, essentially, the
principle of dynamic programming).

60
Worrying too much
• Assumption that every possible eventuality should
be taken into account

Lecture 20 • 61

In many domains, though, the state space is truly enormous, but the agent is
trying to achieve a goal that is not too far removed from the current state. In
such situations, it makes sense to do something that looks a lot more like
regular planning, searching forward from the current state.

61
Worrying too much
• Assumption that every possible eventuality should
be taken into account

Lecture 20 • 62

If the transition function is truly sparse, then each state will have only a small
number of successor states under each action, and it’s possible to build up a
huge decision tree, and then evaluate it from leaf to root to decide which
action to take next.

62
Worrying too much
• Assumption that every possible eventuality should
be taken into account
• sample-based planning: with short horizon in large
state space, planning should be independent of
state-space size

Lecture 20 • 63

If the transition function isn’t too sparse, it may be possible to get an

approximately optimal action by just drawing samples from the next-state
distribution and acting as if those are the only possible resulting states.

63
Worrying too much
• Assumption that every possible eventuality should
be taken into account
• sample-based planning: with short horizon in large
state space, planning should be independent of
state-space size

Lecture 20 • 64

Once the search out to a finite horizon is done, the agent chooses the first
action, executes it, and looks to see what state it’s in. Then, it searches
again. Thus, most of the computational work is done on-line rather than
offline, but it’s much more sensitive to the actual states, and doesn’t worry
about every possible thing that could happen. Thus, these methods are
usually exponential in the horizon, but completely independent of the size of
the state space.

64
Leading to Learning
MDPs and value iteration are an important
foundation of reinforcement learning, or learning to
behave

Lecture 20 • 65

Markov Decision Processes and value iteration are an important foundation

of reinforcement learning, which we’ll talk about next time.

QA Chapter 1 2 3 4 5
100% (4)
QA Chapter 1 2 3 4 5
145 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
RL-DQN-PG
No ratings yet
RL-DQN-PG
65 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
AI (IT) UNIT-4
No ratings yet
AI (IT) UNIT-4
37 pages
Markov Decision
No ratings yet
Markov Decision
11 pages
RL_UNIT-II (1)
No ratings yet
RL_UNIT-II (1)
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
119686
No ratings yet
119686
24 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Markov decision
No ratings yet
Markov decision
4 pages
ReinforcementLearning-Algos
No ratings yet
ReinforcementLearning-Algos
77 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
MIT16 410F10 Lec22
No ratings yet
MIT16 410F10 Lec22
19 pages
Infinite Horizon Problems
No ratings yet
Infinite Horizon Problems
69 pages
Unit-4 MDP
No ratings yet
Unit-4 MDP
21 pages
Markov Decision Processes (MDP) : Sudeshna Sarkar
No ratings yet
Markov Decision Processes (MDP) : Sudeshna Sarkar
14 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
06 MDP
No ratings yet
06 MDP
89 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
3f0739aa808805e51c445195485a7ebb_16-412s16ResourceFile
No ratings yet
3f0739aa808805e51c445195485a7ebb_16-412s16ResourceFile
56 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
POMDP Tutoria POMDP - Tutoriall
No ratings yet
POMDP Tutoria POMDP - Tutoriall
55 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
UNIT-4 OF AI
No ratings yet
UNIT-4 OF AI
9 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
mdp1 6pp
No ratings yet
mdp1 6pp
13 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Cse 473 MDP Notes
No ratings yet
Cse 473 MDP Notes
11 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Policies, Search, Utility
No ratings yet
Policies, Search, Utility
13 pages
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
No ratings yet
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
10 pages
mdp2 6pp
No ratings yet
mdp2 6pp
14 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
UNIT 4 (2)
No ratings yet
UNIT 4 (2)
6 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
XOS DataAnalysisMIDTERM
No ratings yet
XOS DataAnalysisMIDTERM
18 pages
Unit-Six Sampling and Sampling Distribution
No ratings yet
Unit-Six Sampling and Sampling Distribution
19 pages
Egarch Model
No ratings yet
Egarch Model
25 pages
Option Pricing Under Power Laws A Robust
No ratings yet
Option Pricing Under Power Laws A Robust
5 pages
Problem Set 4 - Engineering Statistics PDF
No ratings yet
Problem Set 4 - Engineering Statistics PDF
4 pages
Monte Carlo Simulation in Engineering PDF
No ratings yet
Monte Carlo Simulation in Engineering PDF
98 pages
Unit 1 Problems
No ratings yet
Unit 1 Problems
7 pages
Distribution System Loss (DSL) Segregator
No ratings yet
Distribution System Loss (DSL) Segregator
79 pages
Reading 7 Estimation and Inference Answers
No ratings yet
Reading 7 Estimation and Inference Answers
4 pages
R300 Advanced Econometrics Methods Lecture Slides
No ratings yet
R300 Advanced Econometrics Methods Lecture Slides
362 pages
Newsvendor Problem
No ratings yet
Newsvendor Problem
24 pages
114635812-Correlation and Regression
100% (1)
114635812-Correlation and Regression
5 pages
Session 6
No ratings yet
Session 6
9 pages
Probability Theory
No ratings yet
Probability Theory
11 pages
Prof. Massimo Guidolin: 20192 - Financial Econometrics
No ratings yet
Prof. Massimo Guidolin: 20192 - Financial Econometrics
12 pages
Topic 2 To 10 Answer PDF
No ratings yet
Topic 2 To 10 Answer PDF
24 pages
1 - Worksheet 1 - Probability
No ratings yet
1 - Worksheet 1 - Probability
9 pages
GRMD2102 - Homework 2 - With - Answer
No ratings yet
GRMD2102 - Homework 2 - With - Answer
5 pages
Chapter Twenty - Time Series
No ratings yet
Chapter Twenty - Time Series
28 pages
Chapter 3 - Multiple Random Variables-Updated
No ratings yet
Chapter 3 - Multiple Random Variables-Updated
25 pages
Lets Have Fun With Math
No ratings yet
Lets Have Fun With Math
3 pages
Hotelling's T2 PDF
No ratings yet
Hotelling's T2 PDF
12 pages
2. Discrete Distributions
No ratings yet
2. Discrete Distributions
21 pages
Lesson 2 Math Shs
No ratings yet
Lesson 2 Math Shs
17 pages
Holy Rosary College of Santa Rosa Laguna, Inc.: Learning Module
No ratings yet
Holy Rosary College of Santa Rosa Laguna, Inc.: Learning Module
5 pages
cheatsheet-summary-made
No ratings yet
cheatsheet-summary-made
3 pages
MTH302
No ratings yet
MTH302
31 pages
Statistika 1 - M8
No ratings yet
Statistika 1 - M8
5 pages
Statistical Advisor
No ratings yet
Statistical Advisor
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Microsoft PowerPoint - Lecture20Final-Part1

Uploaded by

Microsoft PowerPoint - Lecture20Final-Part1

Uploaded by

6.

825 Techniques in Artificial Intelligence

Markov Decision Processes

A Markov decision process (known as an MDP) is a discrete-time state-

Finally, there is a real-valued reward function on states. You can think of

Here’s a tiny example of a Markov chain. It has three states.

• Value of a state, using infinite 3

Now, we can define the infinite horizon expected discounted reward as a

• Value of a state, using infinite 3

• Value of a state, using infinite 3

• Value of a state, using infinite 3

• Value of a state, using infinite 3

• Value of a state, using infinite 3

• V*(s)=R(s) + maxa[γ ∑s0 P(s0 | s, a) V*(s0)]

• V*(s)=R(s) + maxa[γ ∑s0 P(s0 | s, a) V*(s0)]

• Theorem: There is a unique V* satisfying these

• V*(s)=R(s) + maxa[γ ∑s0 P(s0 | s, a) V*(s0)]

• Theorem: There is a unique V* satisfying these

• ∏*(s)=argmaxa ∑s0 P(s0 | s, a) V*(s0)]

This algorithm is guaranteed to converge to V*. It might seem sort of

In addition, we can execute this algorithm asynchronously. That is, we can

In order to guarantee this condition, it is sufficient to examine the maximum

Now, although we may never converge to the exact, analytically correct V* in

Function approximation can be done in a variety of ways. It’s very much a

An interesting new method uses Bayesian networks to compactly represent

There’s a much richer class of models, called partially observable Markov

We augment the MDP model with a set of observations, O, and a probability

Optimal choice for

So, what are we supposed to remember? This is really a hard question.

Policy maps P(s) into

Now, the job of the policy is interesting. It gets to map probability

Policy maps P(s) into

Unfortunately, although it’s beautiful mathematically, it’s completely,

So, what can we do when the state-transition function is sparse (contains a

If the transition function isn’t too sparse, it may be possible to get an

Markov Decision Processes and value iteration are an important foundation

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

• V(s)=R(s) + maxa[γ ∑s0 P(s0 | s, a) V(s0)]

• V(s)=R(s) + maxa[γ ∑s0 P(s0 | s, a) V(s0)]

• V(s)=R(s) + maxa[γ ∑s0 P(s0 | s, a) V(s0)]

• ∏(s)=argmaxa ∑s0 P(s0 | s, a) V(s0)]