08. Chapter. 08 - Markov Decision Processes - Examples
08. Chapter. 08 - Markov Decision Processes - Examples
Examples
Chapter 8
(adapted from https://inst.eecs.berkeley.edu/~cs188/fa18)
MDPs: Micro-Blackjack
2
MDPs: Micro-Blackjack
a) What is the transition function and the reward function for this MDP?
Note:
• States s: 0,2,3,4,5, Done
• Actions a: Draw, Stop
• Transition function T(s, a, s’)?
(Probability that action a from s leads to s’, i.e., P(s’| s, a))
• Reward function R(s, a, s’):? (utility values)
• Discount: no discount (g = 1)
3
MDPs: Micro-Blackjack
a) What is the transition function and the reward function for this MDP?
• T(s, Stop, s’) = 1
• T(0, Draw, s’) = 1/3 s’ = {2,3,4}
• T(2, Draw, s’) = 1/3 s’ = {2+2,2+3,2+4} = {4,5,6}
• T(3, Draw, s’) = 1/3 {s’=5}; s’ = {3+2,3+3,3+4} = {5,Done,Done}
• T(3, Draw, s’) = 2/3 {s’=Done} s’ = {5,Done}
• T(4, Draw, Done) = 1
• T(5, Draw, Done) = 1
• Reward function R(s, a, s’):
o R(s, Stop, Done) = s, s ⩽ 5
o R(s, a, s’) = 0 otherwise
4
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0
V1
V2
V3
V4
5
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
• V0 = 0 (for all state)
• Value Iteration:
• V1,2,3,4: compute value for each action (Draw, Stop)
o Vk(s’) = 0 b/c it’s a done state
o Vi (s=0, Stop) = 1 * [0 + 0] = 0 • T(s, Stop, s’) = 1
• Reward function R(s, a, s’)
o Vi (s=2, Stop) = 1 * [2 + 0] = 2 o R(s, Stop, Done) = s, s ⩽ 5
o Vi (s=3, Stop) = 1 * [3 + 0] = 3 o R(s, a, s’) = 0 otherwise
o Vi (s=4, Stop) = 1 * [4 + 0] = 4
o Vi (s=5, Stop) = 1 * [5 + 0] = 5
6
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
• Reward function R(s, a, s’)
o R(s, Stop, Done) = s, s ⩽ 5
o R(s, a, s’) = 0 otherwise
7
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0 0 0 0 0 0
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
0 0 0 2 0 3 0 4 0 5
V1
0 2 3 4 5
V2
V3
V4
8
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
• Value Iteration:
• V1,2,3,4: compute value for each action (Draw, Stop)
o V% s, Draw = ∑"’ T s, Draw, s’ R s, Draw, s’ + 1 ∗ V!(s’)
o V% s, Draw = ∑"’ T s, Draw, s’ V!(s’)
o V2(s=0, Draw) = 1/3*2 + 1/3*3 + 1/3*4 = 3
o V2(s=2, Draw) = 1/3*4 + 1/3*5 = 3
o V2(s=3, Draw) = 1/3*5 = 5/3 • Reward function R(s, a, s’)
o R(s, Stop, Done) = s, s ⩽ 5
o V2(s=4, Draw) = 0 o R(s, a, s’) = 0 otherwise
o V2(s=5, Draw) = 0
9
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0 0 0 0 0 0
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
0 0 0 2 0 3 0 4 0 5
V1
0 2 3 4 5
3 0 3 2 5/3 3 0 4 0 5
V2
3 3 3 4 5
V3
V4
10
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
• Value Iteration:
• V1,2,3,4: compute value for each action (Draw, Stop)
o V& s, Draw = ∑"’ T s, Draw, s’ R s, Draw, s’ + 1 ∗ V%(s’)
o V& s, Draw = ∑"’ T s, Draw, s’ V%(s’)
o V3(s=0, Draw) = 1/3*3 + 1/3*3 + 1/3*4 = 10/3
o V3(s=2, Draw) = 1/3*4 + 1/3*5 = 3
o V3(s=3, Draw) = 1/3*5 = 5/3 • Reward function R(s, a, s’)
o R(s, Stop, Done) = s, s ⩽ 5
o V3(s=4, Draw) = 0 o R(s, a, s’) = 0 otherwise
o V3(s=5, Draw) = 0
11
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0 0 0 0 0 0
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
0 0 0 2 0 3 0 4 0 5
V1
0 2 3 4 5
3 0 3 2 5/3 3 0 4 0 5
V2
3 3 3 4 5
10/3 0 3 2 5/3 3 0 4 0 5
V3
10/3 3 3 4 5
V4
12
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
• Value Iteration:
• V1,2,3,4: compute value for each action (Draw, Stop)
o V' s, Draw = ∑"’ T s, Draw, s’ R s, Draw, s’ + 1 ∗ V&(s’)
o V' s, Draw = ∑"’ T s, Draw, s’ V%(s’)
o V4(s=0, Draw) = 1/3*3 + 1/3*3 + 1/3*4 = 10/3
o V4(s=2, Draw) = 1/3*4 + 1/3*5 = 3
o V4(s=3, Draw) = 1/3*5 = 5/3 • Reward function R(s, a, s’)
o R(s, Stop, Done) = s, s ⩽ 5
o V4(s=4, Draw) = 0 o R(s, a, s’) = 0 otherwise
o V4(s=5, Draw) = 0
13
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0 0 0 0 0 0
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
0 0 0 2 0 3 0 4 0 5
V1
0 2 3 4 5
3 0 3 2 5/3 3 0 4 0 5
V2
3 3 3 4 5
10/3 0 3 2 5/3 3 0 4 0 5
V3
10/3 3 3 4 5
10/3 0 3 2 5/3 3 0 4 0 5
V4
10/3 3 3 4 5
14
MDPs: Micro-Blackjack
b) Fill in the following table of value iteration values for the first 4 iterations.
States 0 2 3 4 5
V0 0 0 0 0 0
V1 0 2 3 4 5
V2 3 3 3 4 5
V3 10/3 3 3 4 5
V4 10/3 3 3 4 5
15
MDPs: Micro-Blackjack
c) You should have noticed that value iteration converged above. What
is the optimal policy for the MDP?
States 0 2 3 4 5
p*
16
MDPs: Micro-Blackjack
c) You should have noticed that value iteration converged above. What
is the optimal policy for the MDP?
• optimal policy = optimal action from state s
• Q*(s,a) = expected utility starting out having taken action a from state
s and (thereafter) acting optimally à optimal action gives best Q(s,a)
States 0 2 3 4 5
Draw Stop Draw Stop Draw Stop Draw Stop Draw Stop
V1 0 0 0 2 0 3 0 4 0 5
V2 3 0 3 2 5/3 3 0 4 0 5
V3 10/3 0 3 2 5/3 3 0 4 0 5
V4 10/3 0 c3 2 5/3 3c 0 4 0 5
• 17
MDPs: Micro-Blackjack
c) You should have noticed that value iteration converged above. What
is the optimal policy for the MDP?
States 0 2 3 4 5
p* Draw Draw Stop Stop Stop
18
MDPs: Micro-Blackjack
d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:
States 0 2 3 4 5
pi Draw Stop Draw Stop Draw
Vpi
pi+1
19
MDPs: Micro-Blackjack
d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:
• Utilities for a Fixed Policy:
20
MDPs: Micro-Blackjack
d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:
• Utilities for a Fixed Policy:
States 0 2 3 4 5
pi Draw Stop Draw Stop Draw
Vpi 2 2 0 4 0
pi+1
21
MDPs: Micro-Blackjack
d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:
• Computing Actions from Q-Values:
• S = 5: Q(5, D) = 0; Q(5, S) = 5 à Qmax (s=5) = 5 à a = Stop
• S = 4: Q(4, D) = 0; Q(4, S) = 4 à Qmax (s=4) = 4 à a = Stop
• S = 3: Q(3, D) = T(3,D,5)*VD(5) + T(3,D,Done)*VD(Done) = 0 + 0 = 0
Q(3, S) = 3 à Qmax (s=3) = 3 à a = Stop
• S = 2: Q(2, D) = T(2,D,4)* VD(4) + T(2,D,5)* VD(5) + T(2,D,Done)* VD(Done)
Q(2, D) = 1/3*4 + 1/3*0 + 1/3*0 = 4/3
Q(2, S) = 2 à Qmax (s=2) = 2 à a = Stop
• S = 0: Q(0, D) = T(0,D,2)* VD(2) + T(0,D,3)* VD(3) + T(0,D,4)* VD(4)
Q(0, D) = 1/3*2 + 1/3*0 + 1/3*4 = 2
Q(0, S) = 0 à Qmax (s=0) = 2 à a = Draw
22
MDPs: Micro-Blackjack
d) Perform one iteration of policy iteration for one step of this MDP,
starting from the fixed policy below:
• Computing Actions from Q-Values:
States 0 2 3 4 5
pi Draw Stop Draw Stop Draw
Vpi 2 2 0 4 0
pi+1 Draw Stop Stop Stop Stop
23