Cs748 s2021 Quizzes Till q4
Cs748 s2021 Quizzes Till q4
February 5, 2021
Note. Provide justifications/calculations/steps along with each answer to illustrate how you ar-
rived at the answer. State your arguments clearly, in logical sequence. You will not receive credit
for giving an answer without sufficient explanation.
Submission. You can either write down your answer by hand or type it in. Be sure to mention
your roll number. Upload to Moodle as a pdf file.
Week 4
Question. This week you are given a short programming assignment related to Dyna-Q, and also
a regular question related to the helicopter control application of Ng et al. (2003) (see the week’s
lecture slides for reference).
a. Recall the “Windy Gridworld” task (Example 6.5, Sutton and Barto, 2018) that you imple-
mented as a part of a programming assignment in CS 747. In particular consider Exercise
6.10 from the textbook, which asks you to implement a “stochastic wind”, with the agent
allowed to make “King’s moves”. (1) In your CS 747 assignment, you have implemented
a learning agent that uses Sarsa. You can reuse your code. (1) First, change the learning
algorithm to Q-learning (using the same values for α and ). Generate a plot similar to the
one in the textbook under Example 6.5. (2) Implement Dyna-Q. The textbook provides pseu-
docude for deterministic environments, but the class lecture (Slide 5) generalises to stochastic
environments—which you will have to use in this case. Observe that the parameter N spec-
ifies the number of model-based updates. Your earlier plot can be viewed as an instance of
Dyna-Q with N = 0. To it add curves for N = 2, N = 5, and N = 20. Let each plot be an
average of at least 10 independent runs with different random seeds. You can decide whether
to place all the plots in a single graph or to have separate graphs. [4 marks]
b. Ng et al. (2003) make minor changes to the policy class ΠH used for the hovering task to
obtain the policy class ΠT for trajectory-following. What changes are these, and why are
they brought in? ΠT appears to be a superset of ΠH ; why do you think the authors did not
use ΠT itself for hovering, too? [2 marks]
In a single pdf file, put your plot(s) for (a) along with a description of your experiment (hyperpa-
rameters, design choices, observations, etc.); also include your answer to (b). Submit a compressed
directory containing the pdf file and your code for (a). We will examine your code, but not run it
ourselves. Feel free to use a programming language/environment of your choice.
Week 2
Question. This question takes you through the steps to construct a proof that applying MCTS at
every encountered state with a non-optimal rollout policy πR will lead to higher aggregate reward
than that obtained by following πR itself.
This nonstationary policy π is constructed in the agent’s mind, but it is eventually π 0 that
is applied in the (real) environment. By our convention, π d , π d+1 , π d+2 , . . . all refer to πR .
d−1 d d+1 i i+1 i+2
Show that V (π ,π ,π ,... ) V πR , and show for i ∈ {0, 1, . . . , d − 2} that V (π ,π ,π ,... )
i+1 i+2 i+3
V (π ,π ,π ,... ) . [4 marks]
0
c. Put together the results from a and b to show that V π V πR . [1 mark]
Solution.
a. Observe that V π = B head(π) (V tail(π) ). Thus, the antecedent is B head(π) (V tail(π) ) V tail(π) .
Since the Bellman operator preserves , a repeated application gives
2
b. For s ∈ S,
d−1 ,π d ,π d+1 ,... )
X
V (π (s) = max T (s, a, s0 ){R(s, a, s0 ) + γV πR (s0 )}
a∈A
s0 ∈S
X
≥ T (s, πR (s), s0 ){R(s, πR (s), s0 ) + γV πR (s0 )}
s0 ∈S
πR
=V (s),
d−1 d d+1
and moreover, since πR is not optimal, there is some state s̄ ∈ S such that V (π ,π ,π ,... ) (s̄) >
d−1 d d+1 i+1 i+2 i+3 i+2 i+3 i+4
V πR (s̄). In short, V (π ,π ,π ,... ) V πR . If we assume that V (π ,π ,π ,... ) V (π ,π ,π ,... ) ,
we observe that for s ∈ S,
i i+1 i+2 i+1 i+2 i+3
X
V (π ,π ,π ,... ) (s) = max T (s, a, s0 ){R(s, a, s0 ) + γV (π ,π ,π ,... ) (s0 )}
a∈A
s0 ∈S
i+1 ,π i+2 ,π i+3 ,... )
X
≥ T (s, π i+1 (s), s0 ){R(s, π i+1 (s), s0 ) + γV (π (s0 )}
s0 ∈S
i+2 ,π i+3 ,π i+4 ,... )
X
≥ T (s, π i+1 (s), s0 ){R(s, π i+1 (s), s0 ) + γV (π (s0 )}
s0 ∈S
i+1 ,π i+2 ,π i+3 ,... )
= V (π (s),
3
Week 1
Question.
a. Consider the value iteration algorithm provided on Slide 12 in the week’s lecture. Assume
that the algorithm is applied to a two player zero-sum Markov game (S, A, O, R, T, γ). Using
Banach’s fixed point theorem, show that the algorithm converges to the game’s minimax
value V ? (which you can assume to be unique, even though we did not present a proof in the
lecture). Clearly state any results that your derivation uses. [3 marks]
b. Consider G(2, 2), the class of two player general-sum Matrix games in which each of the
players, A and O, has exactly two actions. In a “pure-strategy Nash equilibrium” (π A , π O ),
the individual strategies π A and π O are both pure (deterministic). Do there exist games in
G(2, 2) that have no pure-strategy Nash equilibria, or are all games in G(2, 2) guaranteed to
have at least one pure-strategy Nash equilibrium? Justify your answer. [2 marks]
Solution.
a. In this setting value iteration repeatedly applies operator B ? : (S → R) → (S → R) defined by
( )
X X
? def 0 0
B (X)(s) = max min π(a) R(s, a, o) + γ T (s, a, o, s )X(s )
π∈PD(A) o∈O
a∈A s0 ∈S
for X : S → R and s ∈ S. First, observe that V ? is the unique fixed point of this operator. If
we start the iteration with any arbitrary V 0 : S → R and set V t+1 ← B ? (V t ) for t ≥ 0, Banach’s
fixed point theorem assures convergence to V ? if B ? is a contraction mapping. We show the same
by considering arbitrary X : S → R and Y : S → R. We use the rule that for functions f : U → R
and g : U → R on the same domain U , | maxu∈U P f (u) − maxu∈U g(u)| ≤ max
P u∈U |f (u) − g(u)|. For
convenience we use α(Z, s, π, o) as shorthand for a∈A π(a){R(s, a, o) + γ s0 ∈S T (s, a, o, s )Z(s0 )}
0