0% found this document useful (0 votes)
47 views4 pages

Cs748 s2021 Quizzes Till q4

This document contains instructions for weekly quizzes in CS 748 (Spring 2021). It outlines: 1) Submissions must include justifications, calculations, and steps to explain answers. Answers without sufficient explanation will not receive credit. 2) Week 4 asks students to (a) implement Dyna-Q on the Windy Gridworld task and compare results to Q-learning, and (b) explain differences between policy classes used for helicopter hovering and trajectory following tasks. 3) Week 2 asks students to (a) prove that applying MCTS with a non-optimal rollout policy results in higher reward than following the rollout policy alone, and (b) consider value iteration on Markov games.

Uploaded by

Mahesh Abnave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views4 pages

Cs748 s2021 Quizzes Till q4

This document contains instructions for weekly quizzes in CS 748 (Spring 2021). It outlines: 1) Submissions must include justifications, calculations, and steps to explain answers. Answers without sufficient explanation will not receive credit. 2) Week 4 asks students to (a) implement Dyna-Q on the Windy Gridworld task and compare results to Q-learning, and (b) explain differences between policy classes used for helicopter hovering and trajectory following tasks. 3) Week 2 asks students to (a) prove that applying MCTS with a non-optimal rollout policy results in higher reward than following the rollout policy alone, and (b) consider value iteration on Markov games.

Uploaded by

Mahesh Abnave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CS 748 (Spring 2021): Weekly Quizzes

Instructor: Shivaram Kalyanakrishnan

February 5, 2021

Note. Provide justifications/calculations/steps along with each answer to illustrate how you ar-
rived at the answer. State your arguments clearly, in logical sequence. You will not receive credit
for giving an answer without sufficient explanation.

Submission. You can either write down your answer by hand or type it in. Be sure to mention
your roll number. Upload to Moodle as a pdf file.

Week 4
Question. This week you are given a short programming assignment related to Dyna-Q, and also
a regular question related to the helicopter control application of Ng et al. (2003) (see the week’s
lecture slides for reference).

a. Recall the “Windy Gridworld” task (Example 6.5, Sutton and Barto, 2018) that you imple-
mented as a part of a programming assignment in CS 747. In particular consider Exercise
6.10 from the textbook, which asks you to implement a “stochastic wind”, with the agent
allowed to make “King’s moves”. (1) In your CS 747 assignment, you have implemented
a learning agent that uses Sarsa. You can reuse your code. (1) First, change the learning
algorithm to Q-learning (using the same values for α and ). Generate a plot similar to the
one in the textbook under Example 6.5. (2) Implement Dyna-Q. The textbook provides pseu-
docude for deterministic environments, but the class lecture (Slide 5) generalises to stochastic
environments—which you will have to use in this case. Observe that the parameter N spec-
ifies the number of model-based updates. Your earlier plot can be viewed as an instance of
Dyna-Q with N = 0. To it add curves for N = 2, N = 5, and N = 20. Let each plot be an
average of at least 10 independent runs with different random seeds. You can decide whether
to place all the plots in a single graph or to have separate graphs. [4 marks]

b. Ng et al. (2003) make minor changes to the policy class ΠH used for the hovering task to
obtain the policy class ΠT for trajectory-following. What changes are these, and why are
they brought in? ΠT appears to be a superset of ΠH ; why do you think the authors did not
use ΠT itself for hovering, too? [2 marks]

In a single pdf file, put your plot(s) for (a) along with a description of your experiment (hyperpa-
rameters, design choices, observations, etc.); also include your answer to (b). Submit a compressed
directory containing the pdf file and your code for (a). We will examine your code, but not run it
ourselves. Feel free to use a programming language/environment of your choice.
Week 2
Question. This question takes you through the steps to construct a proof that applying MCTS at
every encountered state with a non-optimal rollout policy πR will lead to higher aggregate reward
than that obtained by following πR itself.

a. For MDP (S, A, T, R, γ), a nonstationary policy π = (π 0 , π 1 , π 2 , . . . ) is an infinite sequence of


stationary policies π i : S → A for i ≥ 0 (we assume the per-time-step policies are deterministic
and Markovian). V π and Qπ have the usual definitions. For s ∈ S, a ∈ A, (1) V π (s)
is the expected long-term reward obtained by acting according to π from state s, and (2)
Qπ (s, a) is the expected long-term reward obtained by taking a from s, and thereafter acting
according to π (π 0 gives the second action, π 1 gives the third, etc.). Denote by head(π) the
stationary policy π 0 that is first in the sequence, and by tail(π) the remaining sequence—itself
a nonstationary policy—(π 1 , π 2 , π 3 , . . . ). Show that V π  V tail(π) =⇒ V head(π)  V tail(π) .
It will help to begin with a Bellman equation connecting V π and V tail(π) . Thereafter the
proof will follow the structure in the analysis of the policy improvement theorem. [3 marks]

b. Consider an application of MCTS in which a tree of depth d ≥ 1 is constructed and rollout


policy πR : S → A is used. Assume that an infinite number of rollouts is performed: hence
evaluations within the search tree, and those at the leaves using πR , are exact. Although tree
search is undertaken afresh from each “current” state, we may equivalently view tree search
as the application of a nonstationary policy π = (π 0 , π 1 , π 2 , . . . , π d−1 , πR , πR , πR , . . . ) on the
original MDP, starting from the current state, where for i ∈ {0, 1, . . . , d − 1}, s ∈ S,
i+1 i+2 i+3
X
π i (s) = argmax T (s, a, s0 ){R(s, a, s0 ) + γV (π ,π ,π ,... ) (s0 )}.
a∈A
s0 ∈S

This nonstationary policy π is constructed in the agent’s mind, but it is eventually π 0 that
is applied in the (real) environment. By our convention, π d , π d+1 , π d+2 , . . . all refer to πR .
d−1 d d+1 i i+1 i+2
Show that V (π ,π ,π ,... )  V πR , and show for i ∈ {0, 1, . . . , d − 2} that V (π ,π ,π ,... ) 
i+1 i+2 i+3
V (π ,π ,π ,... ) . [4 marks]
0
c. Put together the results from a and b to show that V π  V πR . [1 mark]

Solution.

a. Observe that V π = B head(π) (V tail(π) ). Thus, the antecedent is B head(π) (V tail(π) )  V tail(π) .
Since the Bellman operator preserves , a repeated application gives

B head(π) (V tail(π) )  V tail(π) ,


(B head(π) )2 (V tail(π) )  B head(π) (V tail(π) ),
(B head(π) )3 (V tail(π) )  (B head(π) )2 (V tail(π) ),
..
.

and hence liml→∞ (B head(π) )l (V tail(π) ) = V head(π)  V tail(π) .

2
b. For s ∈ S,
d−1 ,π d ,π d+1 ,... )
X
V (π (s) = max T (s, a, s0 ){R(s, a, s0 ) + γV πR (s0 )}
a∈A
s0 ∈S
X
≥ T (s, πR (s), s0 ){R(s, πR (s), s0 ) + γV πR (s0 )}
s0 ∈S
πR
=V (s),
d−1 d d+1
and moreover, since πR is not optimal, there is some state s̄ ∈ S such that V (π ,π ,π ,... ) (s̄) >
d−1 d d+1 i+1 i+2 i+3 i+2 i+3 i+4
V πR (s̄). In short, V (π ,π ,π ,... )  V πR . If we assume that V (π ,π ,π ,... )  V (π ,π ,π ,... ) ,
we observe that for s ∈ S,
i i+1 i+2 i+1 i+2 i+3
X
V (π ,π ,π ,... ) (s) = max T (s, a, s0 ){R(s, a, s0 ) + γV (π ,π ,π ,... ) (s0 )}
a∈A
s0 ∈S
i+1 ,π i+2 ,π i+3 ,... )
X
≥ T (s, π i+1 (s), s0 ){R(s, π i+1 (s), s0 ) + γV (π (s0 )}
s0 ∈S
i+2 ,π i+3 ,π i+4 ,... )
X
≥ T (s, π i+1 (s), s0 ){R(s, π i+1 (s), s0 ) + γV (π (s0 )}
s0 ∈S
i+1 ,π i+2 ,π i+3 ,... )
= V (π (s),

which completes our proof.


0
c. Applying the result in b for i = 0 gives V π  V tail(π) , which, from a, implies V π  V tail(π) .
However, from b, we also have
1 ,π 2 ,π 3 ,... ) 2 ,π 3 ,π 4 ,... ) d−1 ,π d ,π d+1 ,... )
V tail(π) = V (π  V (π  · · ·  V (π  V πR ,
0
which means V π  V πR .

3
Week 1
Question.
a. Consider the value iteration algorithm provided on Slide 12 in the week’s lecture. Assume
that the algorithm is applied to a two player zero-sum Markov game (S, A, O, R, T, γ). Using
Banach’s fixed point theorem, show that the algorithm converges to the game’s minimax
value V ? (which you can assume to be unique, even though we did not present a proof in the
lecture). Clearly state any results that your derivation uses. [3 marks]
b. Consider G(2, 2), the class of two player general-sum Matrix games in which each of the
players, A and O, has exactly two actions. In a “pure-strategy Nash equilibrium” (π A , π O ),
the individual strategies π A and π O are both pure (deterministic). Do there exist games in
G(2, 2) that have no pure-strategy Nash equilibria, or are all games in G(2, 2) guaranteed to
have at least one pure-strategy Nash equilibrium? Justify your answer. [2 marks]
Solution.
a. In this setting value iteration repeatedly applies operator B ? : (S → R) → (S → R) defined by
( )
X X
? def 0 0
B (X)(s) = max min π(a) R(s, a, o) + γ T (s, a, o, s )X(s )
π∈PD(A) o∈O
a∈A s0 ∈S

for X : S → R and s ∈ S. First, observe that V ? is the unique fixed point of this operator. If
we start the iteration with any arbitrary V 0 : S → R and set V t+1 ← B ? (V t ) for t ≥ 0, Banach’s
fixed point theorem assures convergence to V ? if B ? is a contraction mapping. We show the same
by considering arbitrary X : S → R and Y : S → R. We use the rule that for functions f : U → R
and g : U → R on the same domain U , | maxu∈U P f (u) − maxu∈U g(u)| ≤ max
P u∈U |f (u) − g(u)|. For
convenience we use α(Z, s, π, o) as shorthand for a∈A π(a){R(s, a, o) + γ s0 ∈S T (s, a, o, s )Z(s0 )}
0

for Z ∈ {X, Y }, s ∈ S π ∈ PD(A), o ∈ O.


kB ? (X) − B ? (Y )k∞ = max |B ? (X)(s) − B ? (Y )(s)|
s∈S


= max max min α(X, s, π, o) − max min α(Y, s, π, o)

s∈S π∈PD(A) o∈O π∈PD(A) o∈O


≤ max max min α(X, s, π, o) − min α(Y, s, π, o)

s∈S π∈PD(A) o∈O o∈O


= max max max(−α(X, s, π, o)) − max(−α(Y, s, π, o))
s∈S π∈PD(A) o∈O o∈O

≤ max max max |(−α(X, s, π, o)) − (−α(Y, s, π, o))|


s∈S π∈PD(A) o∈O

X
= γ max max max T (s, a, o, s0 )(Y (s0 ) − X(s0 ))

s∈S π∈PD(A) o∈O 0
s ∈S
X
≤ γ max max max T (s, a, o, s0 )kX − Y k∞ = kX − Y k∞ .
s∈S π∈PD(A) o∈O
s0 ∈S
b. The table below corresponds to a game with no pure-strategy Nash equilibria. Cells show “A’s
reward, O’s reward” when A plays the row action and O plays the column action.
c d
a 1, 1 1, 2
b 0, 1 2, 0

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy