0% found this document useful (0 votes)
28 views23 pages

08. Chapter. 08 - Markov Decision Processes - Examples

Uploaded by

truong khoa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views23 pages

08. Chapter. 08 - Markov Decision Processes - Examples

Uploaded by

truong khoa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Markov Decision Processes

Examples
Chapter 8
(adapted from https://inst.eecs.berkeley.edu/~cs188/fa18)
MDPs: Micro-Blackjack

In micro-blackjack, you repeatedly draw a card (with replacement)


that is equally likely to be a 2, 3, or 4. You can either Draw or Stop if
the total score of the cards you have drawn is less than 6. If your
total score is 6 or higher, the game ends, and you receive a utility of
0. When you Stop, your utility is equal to your total score (up to 5),
and the game ends. When you Draw, you receive no utility. There is
no discount (g = 1). Let’s formulate this problem as an MDP with the
following states: 0, 2, 3, 4, 5 and a Done state, for when the game
ends.

2
MDPs: Micro-Blackjack

a) What is the transition function and the reward function for this MDP?
Note:
• States s: 0,2,3,4,5, Done
• Actions a: Draw, Stop
• Transition function T(s, a, s’)?
(Probability that action a from s leads to s’, i.e., P(s’| s, a))
• Reward function R(s, a, s’):? (utility values)
• Discount: no discount (g = 1)

3
MDPs: Micro-Blackjack

a) What is the transition function and the reward function for this MDP?
• T(s, Stop, s’) = 1
• T(0, Draw, s’) = 1/3 s’ = {2,3,4}
• T(2, Draw, s’) = 1/3 s’ = {2+2,2+3,2+4} = {4,5,6}
• T(3, Draw, s’) = 1/3 {s’=5}; s’ = {3+2,3+3,3+4} = {5,Done,Done}
• T(3, Draw, s’) = 2/3 {s’=Done} s’ = {5,Done}
• T(4, Draw, Done) = 1
• T(5, Draw, Done) = 1
• Reward function R(s, a, s’):
o R(s, Stop, Done) = s, s ⩽ 5
o R(s, a, s’) = 0 otherwise

4
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0
V1
V2
V3
V4

5
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
• V0 = 0 (for all state)
• Value Iteration:
• V1,2,3,4: compute value for each action (Draw, Stop)
o Vk(s’) = 0 b/c it’s a done state
o Vi (s=0, Stop) = 1 * [0 + 0] = 0 • T(s, Stop, s’) = 1
• Reward function R(s, a, s’)
o Vi (s=2, Stop) = 1 * [2 + 0] = 2 o R(s, Stop, Done) = s, s ⩽ 5
o Vi (s=3, Stop) = 1 * [3 + 0] = 3 o R(s, a, s’) = 0 otherwise
o Vi (s=4, Stop) = 1 * [4 + 0] = 4
o Vi (s=5, Stop) = 1 * [5 + 0] = 5

6
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
• Reward function R(s, a, s’)
o R(s, Stop, Done) = s, s ⩽ 5
o R(s, a, s’) = 0 otherwise

• V0 = 0 (for all state)


• Value Iteration:
• V1,2,3,4: compute value for each action (Draw, Stop)
o V! s, Draw = ∑"’ T s, Draw, s’ R s, Draw, s’ + 1 ∗ V$(s’)
o V! s, Draw = ∑"’ T s, Draw, s’ 0 + 1 ∗ 0 = 0

7
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0 0 0 0 0 0
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
0 0 0 2 0 3 0 4 0 5
V1
0 2 3 4 5
V2

V3

V4

8
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
• Value Iteration:
• V1,2,3,4: compute value for each action (Draw, Stop)
o V% s, Draw = ∑"’ T s, Draw, s’ R s, Draw, s’ + 1 ∗ V!(s’)
o V% s, Draw = ∑"’ T s, Draw, s’ V!(s’)
o V2(s=0, Draw) = 1/3*2 + 1/3*3 + 1/3*4 = 3
o V2(s=2, Draw) = 1/3*4 + 1/3*5 = 3
o V2(s=3, Draw) = 1/3*5 = 5/3 • Reward function R(s, a, s’)
o R(s, Stop, Done) = s, s ⩽ 5
o V2(s=4, Draw) = 0 o R(s, a, s’) = 0 otherwise
o V2(s=5, Draw) = 0

9
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0 0 0 0 0 0
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
0 0 0 2 0 3 0 4 0 5
V1
0 2 3 4 5
3 0 3 2 5/3 3 0 4 0 5
V2
3 3 3 4 5
V3

V4

10
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
• Value Iteration:
• V1,2,3,4: compute value for each action (Draw, Stop)
o V& s, Draw = ∑"’ T s, Draw, s’ R s, Draw, s’ + 1 ∗ V%(s’)
o V& s, Draw = ∑"’ T s, Draw, s’ V%(s’)
o V3(s=0, Draw) = 1/3*3 + 1/3*3 + 1/3*4 = 10/3
o V3(s=2, Draw) = 1/3*4 + 1/3*5 = 3
o V3(s=3, Draw) = 1/3*5 = 5/3 • Reward function R(s, a, s’)
o R(s, Stop, Done) = s, s ⩽ 5
o V3(s=4, Draw) = 0 o R(s, a, s’) = 0 otherwise
o V3(s=5, Draw) = 0

11
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0 0 0 0 0 0
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
0 0 0 2 0 3 0 4 0 5
V1
0 2 3 4 5
3 0 3 2 5/3 3 0 4 0 5
V2
3 3 3 4 5
10/3 0 3 2 5/3 3 0 4 0 5
V3
10/3 3 3 4 5
V4

12
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
• Value Iteration:
• V1,2,3,4: compute value for each action (Draw, Stop)
o V' s, Draw = ∑"’ T s, Draw, s’ R s, Draw, s’ + 1 ∗ V&(s’)
o V' s, Draw = ∑"’ T s, Draw, s’ V%(s’)
o V4(s=0, Draw) = 1/3*3 + 1/3*3 + 1/3*4 = 10/3
o V4(s=2, Draw) = 1/3*4 + 1/3*5 = 3
o V4(s=3, Draw) = 1/3*5 = 5/3 • Reward function R(s, a, s’)
o R(s, Stop, Done) = s, s ⩽ 5
o V4(s=4, Draw) = 0 o R(s, a, s’) = 0 otherwise
o V4(s=5, Draw) = 0

13
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0 0 0 0 0 0
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
0 0 0 2 0 3 0 4 0 5
V1
0 2 3 4 5
3 0 3 2 5/3 3 0 4 0 5
V2
3 3 3 4 5
10/3 0 3 2 5/3 3 0 4 0 5
V3
10/3 3 3 4 5
10/3 0 3 2 5/3 3 0 4 0 5
V4
10/3 3 3 4 5

14
MDPs: Micro-Blackjack

b) Fill in the following table of value iteration values for the first 4 iterations.

States 0 2 3 4 5
V0 0 0 0 0 0
V1 0 2 3 4 5
V2 3 3 3 4 5
V3 10/3 3 3 4 5
V4 10/3 3 3 4 5

15
MDPs: Micro-Blackjack

c) You should have noticed that value iteration converged above. What
is the optimal policy for the MDP?

States 0 2 3 4 5
p*

16
MDPs: Micro-Blackjack

c) You should have noticed that value iteration converged above. What
is the optimal policy for the MDP?
• optimal policy = optimal action from state s
• Q*(s,a) = expected utility starting out having taken action a from state
s and (thereafter) acting optimally à optimal action gives best Q(s,a)

States 0 2 3 4 5
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
V1 0 0 0 2 0 3 0 4 0 5
V2 3 0 3 2 5/3 3 0 4 0 5
V3 10/3 0 3 2 5/3 3 0 4 0 5
V4 10/3 0 c3 2 5/3 3c 0 4 0 5

• 17
MDPs: Micro-Blackjack

c) You should have noticed that value iteration converged above. What
is the optimal policy for the MDP?

States 0 2 3 4 5
p* Draw Draw Stop Stop Stop

18
MDPs: Micro-Blackjack

d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:

States 0 2 3 4 5
pi Draw Stop Draw Stop Draw
Vpi
pi+1

19
MDPs: Micro-Blackjack

d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:
• Utilities for a Fixed Policy:

• V ()*+ 5 = ∑(,-. 1 ∗ [0 + 1.0 ∗ 0] = 0


• V /0,1 4 = ∑(,-. 1 ∗ 4 = 4
! %
• V ()*+ 3 = ∑2,(,-. ∗ [0 + 1.0 ∗ 0]
&
+ &
∗ [0 + 1.0 ∗ 0] =0
• V /0,1 2 = ∑(,-. 1 ∗ 2 = 2
!
• V ()*+ 0 = ∑%,&,' & ∗ 0 + 1.0 ∗ 2 + 0 + 1.0 ∗ 0 + [0 + 1.0 ∗ 4] = 2

20
MDPs: Micro-Blackjack

d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:
• Utilities for a Fixed Policy:

States 0 2 3 4 5
pi Draw Stop Draw Stop Draw
Vpi 2 2 0 4 0
pi+1

21
MDPs: Micro-Blackjack

d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:
• Computing Actions from Q-Values:
• S = 5: Q(5, D) = 0; Q(5, S) = 5 à Qmax (s=5) = 5 à a = Stop
• S = 4: Q(4, D) = 0; Q(4, S) = 4 à Qmax (s=4) = 4 à a = Stop
• S = 3: Q(3, D) = T(3,D,5)*VD(5) + T(3,D,Done)*VD(Done) = 0 + 0 = 0
Q(3, S) = 3 à Qmax (s=3) = 3 à a = Stop
• S = 2: Q(2, D) = T(2,D,4)* VD(4) + T(2,D,5)* VD(5) + T(2,D,Done)* VD(Done)
Q(2, D) = 1/3*4 + 1/3*0 + 1/3*0 = 4/3
Q(2, S) = 2 à Qmax (s=2) = 2 à a = Stop
• S = 0: Q(0, D) = T(0,D,2)* VD(2) + T(0,D,3)* VD(3) + T(0,D,4)* VD(4)
Q(0, D) = 1/3*2 + 1/3*0 + 1/3*4 = 2
Q(0, S) = 0 à Qmax (s=0) = 2 à a = Draw
22
MDPs: Micro-Blackjack

d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:
• Computing Actions from Q-Values:

States 0 2 3 4 5
pi Draw Stop Draw Stop Draw
Vpi 2 2 0 4 0
pi+1 Draw Stop Stop Stop Stop

23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy