0% found this document useful (0 votes)

47 views4 pages

Cs748 s2021 Quizzes Till q4

This document contains instructions for weekly quizzes in CS 748 (Spring 2021). It outlines: 1) Submissions must include justifications, calculations, and steps to explain answers. Answers without sufficient explanation will not receive credit. 2) Week 4 asks students to (a) implement Dyna-Q on the Windy Gridworld task and compare results to Q-learning, and (b) explain differences between policy classes used for helicopter hovering and trajectory following tasks. 3) Week 2 asks students to (a) prove that applying MCTS with a non-optimal rollout policy results in higher reward than following the rollout policy alone, and (b) consider value iteration on Markov games.

Uploaded by

Mahesh Abnave

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views4 pages

Cs748 s2021 Quizzes Till q4

Uploaded by

Mahesh Abnave

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

CS 748 (Spring 2021): Weekly Quizzes

Instructor: Shivaram Kalyanakrishnan

February 5, 2021

Note. Provide justifications/calculations/steps along with each answer to illustrate how you ar-
rived at the answer. State your arguments clearly, in logical sequence. You will not receive credit
for giving an answer without sufficient explanation.

Submission. You can either write down your answer by hand or type it in. Be sure to mention
your roll number. Upload to Moodle as a pdf file.

Week 4
Question. This week you are given a short programming assignment related to Dyna-Q, and also
a regular question related to the helicopter control application of Ng et al. (2003) (see the week’s
lecture slides for reference).

a. Recall the “Windy Gridworld” task (Example 6.5, Sutton and Barto, 2018) that you imple-
mented as a part of a programming assignment in CS 747. In particular consider Exercise
6.10 from the textbook, which asks you to implement a “stochastic wind”, with the agent
allowed to make “King’s moves”. (1) In your CS 747 assignment, you have implemented
a learning agent that uses Sarsa. You can reuse your code. (1) First, change the learning
algorithm to Q-learning (using the same values for α and ). Generate a plot similar to the
one in the textbook under Example 6.5. (2) Implement Dyna-Q. The textbook provides pseu-
docude for deterministic environments, but the class lecture (Slide 5) generalises to stochastic
environments—which you will have to use in this case. Observe that the parameter N spec-
ifies the number of model-based updates. Your earlier plot can be viewed as an instance of
Dyna-Q with N = 0. To it add curves for N = 2, N = 5, and N = 20. Let each plot be an
average of at least 10 independent runs with different random seeds. You can decide whether
to place all the plots in a single graph or to have separate graphs. [4 marks]

b. Ng et al. (2003) make minor changes to the policy class ΠH used for the hovering task to
obtain the policy class ΠT for trajectory-following. What changes are these, and why are
they brought in? ΠT appears to be a superset of ΠH ; why do you think the authors did not
use ΠT itself for hovering, too? [2 marks]

In a single pdf file, put your plot(s) for (a) along with a description of your experiment (hyperpa-
rameters, design choices, observations, etc.); also include your answer to (b). Submit a compressed
directory containing the pdf file and your code for (a). We will examine your code, but not run it
ourselves. Feel free to use a programming language/environment of your choice.
Week 2
Question. This question takes you through the steps to construct a proof that applying MCTS at
every encountered state with a non-optimal rollout policy πR will lead to higher aggregate reward
than that obtained by following πR itself.

a. For MDP (S, A, T, R, γ), a nonstationary policy π = (π 0 , π 1 , π 2 , . . . ) is an infinite sequence of

stationary policies π i : S → A for i ≥ 0 (we assume the per-time-step policies are deterministic
and Markovian). V π and Qπ have the usual definitions. For s ∈ S, a ∈ A, (1) V π (s)
is the expected long-term reward obtained by acting according to π from state s, and (2)
Qπ (s, a) is the expected long-term reward obtained by taking a from s, and thereafter acting
according to π (π 0 gives the second action, π 1 gives the third, etc.). Denote by head(π) the
stationary policy π 0 that is first in the sequence, and by tail(π) the remaining sequence—itself
a nonstationary policy—(π 1 , π 2 , π 3 , . . . ). Show that V π V tail(π) =⇒ V head(π) V tail(π) .
It will help to begin with a Bellman equation connecting V π and V tail(π) . Thereafter the
proof will follow the structure in the analysis of the policy improvement theorem. [3 marks]

b. Consider an application of MCTS in which a tree of depth d ≥ 1 is constructed and rollout

policy πR : S → A is used. Assume that an infinite number of rollouts is performed: hence
evaluations within the search tree, and those at the leaves using πR , are exact. Although tree
search is undertaken afresh from each “current” state, we may equivalently view tree search
as the application of a nonstationary policy π = (π 0 , π 1 , π 2 , . . . , π d−1 , πR , πR , πR , . . . ) on the
original MDP, starting from the current state, where for i ∈ {0, 1, . . . , d − 1}, s ∈ S,
i+1 i+2 i+3
X
π i (s) = argmax T (s, a, s0 ){R(s, a, s0 ) + γV (π ,π ,π ,... ) (s0 )}.
a∈A
s0 ∈S

This nonstationary policy π is constructed in the agent’s mind, but it is eventually π 0 that
is applied in the (real) environment. By our convention, π d , π d+1 , π d+2 , . . . all refer to πR .
d−1 d d+1 i i+1 i+2
Show that V (π ,π ,π ,... ) V πR , and show for i ∈ {0, 1, . . . , d − 2} that V (π ,π ,π ,... )
i+1 i+2 i+3
V (π ,π ,π ,... ) . [4 marks]
0
c. Put together the results from a and b to show that V π V πR . [1 mark]

Solution.

a. Observe that V π = B head(π) (V tail(π) ). Thus, the antecedent is B head(π) (V tail(π) ) V tail(π) .
Since the Bellman operator preserves , a repeated application gives

B head(π) (V tail(π) ) V tail(π) ,

(B head(π) )2 (V tail(π) ) B head(π) (V tail(π) ),
(B head(π) )3 (V tail(π) ) (B head(π) )2 (V tail(π) ),
..
.

and hence liml→∞ (B head(π) )l (V tail(π) ) = V head(π) V tail(π) .

2
b. For s ∈ S,
d−1 ,π d ,π d+1 ,... )
X
V (π (s) = max T (s, a, s0 ){R(s, a, s0 ) + γV πR (s0 )}
a∈A
s0 ∈S
X
≥ T (s, πR (s), s0 ){R(s, πR (s), s0 ) + γV πR (s0 )}
s0 ∈S
πR
=V (s),
d−1 d d+1
and moreover, since πR is not optimal, there is some state s̄ ∈ S such that V (π ,π ,π ,... ) (s̄) >
d−1 d d+1 i+1 i+2 i+3 i+2 i+3 i+4
V πR (s̄). In short, V (π ,π ,π ,... ) V πR . If we assume that V (π ,π ,π ,... ) V (π ,π ,π ,... ) ,
we observe that for s ∈ S,
i i+1 i+2 i+1 i+2 i+3
X
V (π ,π ,π ,... ) (s) = max T (s, a, s0 ){R(s, a, s0 ) + γV (π ,π ,π ,... ) (s0 )}
a∈A
s0 ∈S
i+1 ,π i+2 ,π i+3 ,... )
X
≥ T (s, π i+1 (s), s0 ){R(s, π i+1 (s), s0 ) + γV (π (s0 )}
s0 ∈S
i+2 ,π i+3 ,π i+4 ,... )
X
≥ T (s, π i+1 (s), s0 ){R(s, π i+1 (s), s0 ) + γV (π (s0 )}
s0 ∈S
i+1 ,π i+2 ,π i+3 ,... )
= V (π (s),

which completes our proof.

0
c. Applying the result in b for i = 0 gives V π V tail(π) , which, from a, implies V π V tail(π) .
However, from b, we also have
1 ,π 2 ,π 3 ,... ) 2 ,π 3 ,π 4 ,... ) d−1 ,π d ,π d+1 ,... )
V tail(π) = V (π V (π · · · V (π V πR ,
0
which means V π V πR .

3
Week 1
Question.
a. Consider the value iteration algorithm provided on Slide 12 in the week’s lecture. Assume
that the algorithm is applied to a two player zero-sum Markov game (S, A, O, R, T, γ). Using
Banach’s fixed point theorem, show that the algorithm converges to the game’s minimax
value V ? (which you can assume to be unique, even though we did not present a proof in the
lecture). Clearly state any results that your derivation uses. [3 marks]
b. Consider G(2, 2), the class of two player general-sum Matrix games in which each of the
players, A and O, has exactly two actions. In a “pure-strategy Nash equilibrium” (π A , π O ),
the individual strategies π A and π O are both pure (deterministic). Do there exist games in
G(2, 2) that have no pure-strategy Nash equilibria, or are all games in G(2, 2) guaranteed to
have at least one pure-strategy Nash equilibrium? Justify your answer. [2 marks]
Solution.
a. In this setting value iteration repeatedly applies operator B ? : (S → R) → (S → R) defined by
( )
X X
? def 0 0
B (X)(s) = max min π(a) R(s, a, o) + γ T (s, a, o, s )X(s )
π∈PD(A) o∈O
a∈A s0 ∈S

for X : S → R and s ∈ S. First, observe that V ? is the unique fixed point of this operator. If
we start the iteration with any arbitrary V 0 : S → R and set V t+1 ← B ? (V t ) for t ≥ 0, Banach’s
fixed point theorem assures convergence to V ? if B ? is a contraction mapping. We show the same
by considering arbitrary X : S → R and Y : S → R. We use the rule that for functions f : U → R
and g : U → R on the same domain U , | maxu∈U P f (u) − maxu∈U g(u)| ≤ max
P u∈U |f (u) − g(u)|. For
convenience we use α(Z, s, π, o) as shorthand for a∈A π(a){R(s, a, o) + γ s0 ∈S T (s, a, o, s )Z(s0 )}
0

for Z ∈ {X, Y }, s ∈ S π ∈ PD(A), o ∈ O.

kB ? (X) − B ? (Y )k∞ = max |B ? (X)(s) − B ? (Y )(s)|
s∈S

= max max min α(X, s, π, o) − max min α(Y, s, π, o)

s∈S π∈PD(A) o∈O π∈PD(A) o∈O

≤ max max min α(X, s, π, o) − min α(Y, s, π, o)

s∈S π∈PD(A) o∈O o∈O

= max max max(−α(X, s, π, o)) − max(−α(Y, s, π, o))
s∈S π∈PD(A) o∈O o∈O

≤ max max max |(−α(X, s, π, o)) − (−α(Y, s, π, o))|

s∈S π∈PD(A) o∈O

X
= γ max max max T (s, a, o, s0 )(Y (s0 ) − X(s0 ))

s∈S π∈PD(A) o∈O 0
s ∈S
X
≤ γ max max max T (s, a, o, s0 )kX − Y k∞ = kX − Y k∞ .
s∈S π∈PD(A) o∈O
s0 ∈S
b. The table below corresponds to a game with no pure-strategy Nash equilibria. Cells show “A’s
reward, O’s reward” when A plays the row action and O plays the column action.
c d
a 1, 1 1, 2
b 0, 1 2, 0

cs188 sp19 Final Sol
No ratings yet
cs188 sp19 Final Sol
28 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
Problem 1: Markov Reward Process
No ratings yet
Problem 1: Markov Reward Process
3 pages
Confirmation - Flight Booking - Etihad
No ratings yet
Confirmation - Flight Booking - Etihad
2 pages
Volume Control Dampers
100% (1)
Volume Control Dampers
13 pages
CS 748 (Spring 2021) : Weekly Quizzes: Week 2
No ratings yet
CS 748 (Spring 2021) : Weekly Quizzes: Week 2
2 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
cs747 A2020 Quizzes PDF
No ratings yet
cs747 A2020 Quizzes PDF
5 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
HW 2
No ratings yet
HW 2
2 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
15-381 Spring 2007 Final Exam SOLUTIONS
No ratings yet
15-381 Spring 2007 Final Exam SOLUTIONS
18 pages
cs188 sp16 mt1 Sol
No ratings yet
cs188 sp16 mt1 Sol
23 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Assignment 4 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 4 (Sol.) : Reinforcement Learning
6 pages
MDP RL Paper Grock
No ratings yet
MDP RL Paper Grock
5 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Intro RL Paper Grock
No ratings yet
Intro RL Paper Grock
6 pages
Assignment
No ratings yet
Assignment
2 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Final Solution
No ratings yet
Final Solution
12 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
Computational Economics: Session 16: Numerical Dynamic Programming
No ratings yet
Computational Economics: Session 16: Numerical Dynamic Programming
17 pages
Homework - 06 - 223 - Spring 2024
No ratings yet
Homework - 06 - 223 - Spring 2024
5 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Lnotes 04
No ratings yet
Lnotes 04
8 pages
Solution 3
No ratings yet
Solution 3
4 pages
Wa 1
No ratings yet
Wa 1
9 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Reinforcement Learning Prof. B. Ravindran
4 pages
q2B Review Sol
No ratings yet
q2B Review Sol
14 pages
RL 2021 22 Exam I
No ratings yet
RL 2021 22 Exam I
4 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Lec 4
No ratings yet
Lec 4
16 pages
Macro2 HW2 Solution v1
No ratings yet
Macro2 HW2 Solution v1
15 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Amitkv Lit Survey Summarization
No ratings yet
Amitkv Lit Survey Summarization
15 pages
The Next Evolution of Health and Fitness: Your Masterclass Workbook With Eric Edmeades
No ratings yet
The Next Evolution of Health and Fitness: Your Masterclass Workbook With Eric Edmeades
12 pages
Data Challenge Problem Statement
No ratings yet
Data Challenge Problem Statement
2 pages
$gem Install Rails
No ratings yet
$gem Install Rails
6 pages
ETWealth PDF
No ratings yet
ETWealth PDF
39 pages
Wealth: HE Conomic Imes
No ratings yet
Wealth: HE Conomic Imes
31 pages
LCM & CLCM Printout
No ratings yet
LCM & CLCM Printout
4 pages
Different Types of Operating Systems
No ratings yet
Different Types of Operating Systems
8 pages
Represents The Camel Runtime System
No ratings yet
Represents The Camel Runtime System
5 pages
Local & Remote Repos: View Existing Remote Repositories
No ratings yet
Local & Remote Repos: View Existing Remote Repositories
4 pages
Read and Write Heads and Head Actuator Mechanisms of Hard Disks
100% (1)
Read and Write Heads and Head Actuator Mechanisms of Hard Disks
8 pages
Name: Mahesh A Abnave Title: Inventory System Simulation Using MS Excel Spreadsheet Sample Cell Formulae
No ratings yet
Name: Mahesh A Abnave Title: Inventory System Simulation Using MS Excel Spreadsheet Sample Cell Formulae
2 pages
Package Javax - Microedition.lcdui - Game: Class Summary
No ratings yet
Package Javax - Microedition.lcdui - Game: Class Summary
10 pages
04 04 Intuition For Regularization 06-59
No ratings yet
04 04 Intuition For Regularization 06-59
6 pages
Wireless LAN Basics
No ratings yet
Wireless LAN Basics
3 pages
Vector Spaces: All R Is X-Y Plane R Is A Line
No ratings yet
Vector Spaces: All R Is X-Y Plane R Is A Line
6 pages
Clocked Synchronous State Machines
No ratings yet
Clocked Synchronous State Machines
2 pages
What Is Normalization?
No ratings yet
What Is Normalization?
4 pages
Expansion Slots
0% (1)
Expansion Slots
6 pages
04 04 Data Dimensions 3-08
No ratings yet
04 04 Data Dimensions 3-08
9 pages
Operational Amplifier (Op-Amp) Basics
No ratings yet
Operational Amplifier (Op-Amp) Basics
5 pages
Chapter 04 Image e Nhanc Spat
No ratings yet
Chapter 04 Image e Nhanc Spat
141 pages
300 GPD Water Maker
No ratings yet
300 GPD Water Maker
7 pages
Vocative in English PDF
No ratings yet
Vocative in English PDF
22 pages
Chapter 3 Mano Questions
100% (1)
Chapter 3 Mano Questions
7 pages
ERP in FMCG Company
No ratings yet
ERP in FMCG Company
48 pages
Class 11 CHAPTER 04 by Arslan Saleem
No ratings yet
Class 11 CHAPTER 04 by Arslan Saleem
12 pages
CHILD HEALTH NURSING Unit I Stuents Notes Notes
100% (9)
CHILD HEALTH NURSING Unit I Stuents Notes Notes
13 pages
Structures Brochure
No ratings yet
Structures Brochure
44 pages
Solaris Disk Quota Implementation
No ratings yet
Solaris Disk Quota Implementation
2 pages
Dan Glimne Motor Tuning 2 - MC Jan-70
No ratings yet
Dan Glimne Motor Tuning 2 - MC Jan-70
40 pages
Properties of KMnO4 and K2Cr2O7.PDF-65
No ratings yet
Properties of KMnO4 and K2Cr2O7.PDF-65
7 pages
Benchmark Report - Voice Service Optimization For Common State, TP20160728
No ratings yet
Benchmark Report - Voice Service Optimization For Common State, TP20160728
16 pages
Sapien Labs Age of First Smartphone and Mental Wellbeing Outcomes
No ratings yet
Sapien Labs Age of First Smartphone and Mental Wellbeing Outcomes
26 pages
Crystal Reports Vs PowerBI 1
No ratings yet
Crystal Reports Vs PowerBI 1
3 pages
Huawei S3700 Switch Datasheet (22-Oct-2012)
No ratings yet
Huawei S3700 Switch Datasheet (22-Oct-2012)
12 pages
UG - CAO - .00132-002 Tools & Equipment
No ratings yet
UG - CAO - .00132-002 Tools & Equipment
39 pages
Java Lab Cycle Programs 2022
No ratings yet
Java Lab Cycle Programs 2022
2 pages
FIT ZONE Nutrition Plan For MEN by Guru Mann
100% (1)
FIT ZONE Nutrition Plan For MEN by Guru Mann
8 pages
RMK Engineering College Digital India Activities
No ratings yet
RMK Engineering College Digital India Activities
2 pages
Rain Industries Limited Investor Presentation
No ratings yet
Rain Industries Limited Investor Presentation
14 pages
Grade 06 History 1st Term Test Paper With Answers 2019 Sinhala Medium North Western Province
83% (6)
Grade 06 History 1st Term Test Paper With Answers 2019 Sinhala Medium North Western Province
7 pages
On A Clear Day A Town With An Ocean View Joe Hisaishi
No ratings yet
On A Clear Day A Town With An Ocean View Joe Hisaishi
22 pages
Energy Drinks
0% (1)
Energy Drinks
19 pages
Visual Rhetoric PresentationWorksheet
No ratings yet
Visual Rhetoric PresentationWorksheet
2 pages
Subject-Verb Agreement: by Resmy R S
No ratings yet
Subject-Verb Agreement: by Resmy R S
19 pages
Yaskawa SGMGV
No ratings yet
Yaskawa SGMGV
24 pages
Models of Integration
No ratings yet
Models of Integration
18 pages
LP 4TH Grade 10 Day1
No ratings yet
LP 4TH Grade 10 Day1
3 pages
RDMC - Cairo Metro Line-3 Checklist 03-02: Rail - Greasy Status Check Preventive
No ratings yet
RDMC - Cairo Metro Line-3 Checklist 03-02: Rail - Greasy Status Check Preventive
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cs748 s2021 Quizzes Till q4

Uploaded by

Cs748 s2021 Quizzes Till q4

Uploaded by

CS 748 (Spring 2021): Weekly Quizzes

Instructor: Shivaram Kalyanakrishnan

a. For MDP (S, A, T, R, γ), a nonstationary policy π = (π 0 , π 1 , π 2 , . . . ) is an infinite sequence of

b. Consider an application of MCTS in which a tree of depth d ≥ 1 is constructed and rollout

B head(π) (V tail(π) ) V tail(π) ,

and hence liml→∞ (B head(π) )l (V tail(π) ) = V head(π) V tail(π) .

which completes our proof.

for Z ∈ {X, Y }, s ∈ S π ∈ PD(A), o ∈ O.

≤ max max max |(−α(X, s, π, o)) − (−α(Y, s, π, o))|

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Cs748 s2021 Quizzes Till q4

Uploaded by

Cs748 s2021 Quizzes Till q4

Uploaded by

CS 748 (Spring 2021): Weekly Quizzes

Instructor: Shivaram Kalyanakrishnan

a. For MDP (S, A, T, R, γ), a nonstationary policy π = (π 0 , π 1 , π 2 , . . . ) is an infinite sequence of

b. Consider an application of MCTS in which a tree of depth d ≥ 1 is constructed and rollout

B head(π) (V tail(π) )  V tail(π) ,

and hence liml→∞ (B head(π) )l (V tail(π) ) = V head(π)  V tail(π) .

which completes our proof.

for Z ∈ {X, Y }, s ∈ S π ∈ PD(A), o ∈ O.

≤ max max max |(−α(X, s, π, o)) − (−α(Y, s, π, o))|

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

B head(π) (V tail(π) ) V tail(π) ,

and hence liml→∞ (B head(π) )l (V tail(π) ) = V head(π) V tail(π) .