0% found this document useful (0 votes)

15 views50 pages

Lecture 12 Slides - After

Reinforcement learning is a type of machine learning where an agent learns to map situations to actions to maximize a reward signal. It involves concepts such as Markov decision processes, policies, and various approaches like policy gradient methods. The field has applications in robotics, autonomous driving, and power grid control, but also faces challenges like sample inefficiency and lack of safety guarantees.

Uploaded by

baptiste.ferrer10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views50 pages

Lecture 12 Slides - After

Uploaded by

baptiste.ferrer10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Introduction to

Reinforcement Learning
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.

▪ ChatGPT, 2022: “Reinforcement learning is a type of machine learning in which an agent learns to
interact with its environment in order to maximize a reward signal”.
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.

▪ ChatGPT, 2022: “Reinforcement learning is a type of machine learning in which an agent learns to
interact with its environment in order to maximize a reward signal”.

Agent Environment
- State st
- Take action at

- Next state st+1

- Get reward r(st, at)
Example: Multi-agent game

2019: Learning to play hide and seek via

multi-agent reinforcement learning [1].
Recent advances
2013
Atari 2016
Deep Q-learning for
Atari games [2]. Energy saving 2017
DeepMind AI AlphaGo/ 2018
reduces
AlphaZero OpenAI Five 2019
Google data
centre cooling AI achieving grand Training five Alpha Star
bill by 40% [3]. master level in artificial 2022
chess, go, and intelligence AI achieving grand
shogi [4,5]. agents to play master level in AlphaTensor
the Dota 2 [6]. StarCraft II game [7].
Discovering faster
matrix multiplication
Rubik’s Cube
algorithms [9].
Solving Rubik's Cube
with a human-like robot ChatGPT
hand [8].
A language model
trained to generate
human-like responses
to text input [10].
(Potential) real world applications
▪ Robotics

Teaching a robot how to walk in the wild [11].

(Potential) real world applications
▪ Robotics

▪ Autonomous driving
Action at

Next state st+1

and reward r(st, at)
(Potential) real world applications
▪ Robotics

▪ Autonomous driving

▪ Control of power grids

Control of power grids [12].

An interdisciplinary field

Many facets of reinforcement learning [13].

Markov Decision processes

A Markov decision process is given by a tuple ℳ = (𝒮, 𝒜, P, ρ) where…

▪ 𝒮 is the set of all possible states.
▪ 𝒜 is the set of all possible actions.
▪ P is the transition law with P(s′| s, a) = Pr(st+1 = s′| st = s, at = a).
▪ ρ is the initial state distribution with ρ(s) = Pr(s0 = s).

Markov property
Pr(st+1 | st, st−1, . . . , s0, at, at−1, . . . . , a0) = Pr(st+1 | st, at) → stochastic dynamical system!
Example: Deterministic MDP
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∈ 𝒜.
• End up in state st+1 ∼ P( ⋅ | st, at).
Example: Stochastic MDP (Gridworld)
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∈ 𝒜.
• End up in state st+1 ∼ P( ⋅ | st, at).
Policy
Decision rule
▪ In state s ∈ 𝒮, we take action a ∈ 𝒜 with
probability π(a | s).
▪ π( ⋅ | s) is a probability distribution over 𝒜.
Objective

Reward and discount factor

▪ Reward function r : 𝒮 × 𝒜 → ℝ.
▪ Discount rate γ ∈ (0,1).

Objective function
The goal is to find an optimal policy π* maximizing

[∑ ]
J(π) := 𝔼 γ tr(st, at) | s0 ∼ ρ, π
t=0
Example: Stochastic MDP (Gridworld)
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∼ π( ⋅ | st).
• Get reward r(st, at).
• End up in state st+1 ∼ P( ⋅ | st, at).
Reinforcement learning vs. optimal control
Stochastic optimal control
∞

[∑ ]
For known P: dynamic programming. Still
max 𝔼 γ tr(st, at) | s0 ∼ ρ, π →
π very hard for large 𝒮 and 𝒜.
t=0

Reinforcement learning
▪ In RL we can only sample from the MDP (in simulation or real world), but don’t know P.
▪ We need to explore the environment.
Many different approaches

Taxonomy of reinforcement learning approaches [14].

Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0

▪ Policy Optimization: Parameterize the policy as πθ(a | s) and then find the best policy.
Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0

▪ Policy Optimization: Parameterize the policy as πθ(a | s) and then find the best policy.

▪ Direct parameterization

∑
πθ(a | s) = θs,a, where θ ∈ ℝ|𝒮|×|𝒜|, θs,a ≥ 0 and θs,a = 1.
a∈𝒜
Policy parameterization

▪ Softmax parameterization
exp (θs,a)
πθ(a | s) = , where θ ∈ ℝ|𝒮|×|𝒜|
∑a′∈𝒜 exp (θs,a′)
Policy parameterization

▪ Softmax parameterization
exp (θs,a)
πθ(a | s) = , where θ ∈ ℝ|𝒮|×|𝒜|
∑a′∈𝒜 exp (θs,a′)

▪ Neural softmax parameterization

exp (fθ (s, a))
πθ(a | s) = , where fθ (s, a) represents a neural network.
∑a′∈𝒜 exp (fθ (s, a′))
Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0

▪ Parameterize the policy as πθ(a | s).

▪ Using gradient ascent method to find best policy π*
▪ Pseudo-code for policy gradient method
Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0

▪ Parameterize the policy as πθ(a | s).

▪ Using gradient ascent method to find best policy π*
▪ Pseudo-code for policy gradient method
Results for policy gradient method
▪ Can we converge to the optimal policy when K is big enough?——Non-convexity may lead to sub-
optimal policy
Results for policy gradient method
▪ Can we converge to the optimal policy when K is big enough?——Non-convexity may lead to sub-
optimal policy

▪ Convergence for direct parametrization [15]:

(1 − γ)3 c1
Let ρ(s) > 0, ∀s ∈ 𝒮 and α ≤ , we have min J(π*) − J(πt) ≤ .
• 2γ | 𝒜 | t≤K K
Results for policy gradient method
▪ Can we converge to the optimal policy when K is big enough?——Non-convexity may lead to sub-
optimal policy

▪ Convergence for direct parametrization [15]:

(1 − γ)3 c1
Let ρ(s) > 0, ∀s ∈ 𝒮 and α ≤ , we have min J(π*) − J(πt) ≤ .
• 2γ | 𝒜 | t≤K K
▪ Convergence for softmax parametrization [16]:
(1 − γ)3 c2
• Let ρ(s) > 0, ∀s ∈ 𝒮 and α ≤ 8
, we have J(π*) − J(πK ) ≤
K
.

▪ Here, c1, c2 are the constant that depend on MDP ℳ = (𝒮, 𝒜, P, ρ, r, γ).
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0

▪ And we set reward we get from this trajectory τ as

∞
γ tr(st, at)
∑
R (τ) :=
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0

▪ And we set reward we get from this trajectory τ as

∞
γ tr(st, at)
∑
R (τ) :=
t=0
▪ Then,
∞

[∑ ]
J(πθ) := 𝔼 γ tr(st, at) | s0 = s, π = 𝔼τ∼pθ [R(τ)]
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0

▪ And we set reward we get from this trajectory τ as

∞
γ tr(st, at)
∑
R (τ) :=
t=0
▪ Then,
∞

[∑ ]
J(πθ) := 𝔼 γ tr(st, at) | s0 = s, π = 𝔼τ∼pθ [R(τ)]
t=0

▪ Therefore,
∇θ J(πθ) = ∇θ 𝔼τ∼pθ[R(τ)]
How to compute the gradient?
Policy gradient theorem:
∞ ∞

[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)

Proof: Because
∇θ J(πθ) = ∇θ 𝔼τ∼pθ[R(τ)]
How to compute the gradient?
Policy gradient theorem:
∞ ∞

[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)

Proof: Because
∞
γ tr(st, at)
∑
∇θ J(πθ) = 𝔼τ∼pθ[R(τ) ∇log pθ(τ)], R(τ) =
t=0
How to estimate the gradient?
Monte Carlo approximation:
▪ Consider a random variable X ∼ q.
▪ Given independent and identically distributed X1, . . . , XN ∼ q, we can estimate
1 N
N∑
𝔼[ f(X)] ≈ f(Xi) .
i=1
In Reinforcement learning: We don’t know P, but we can approximate
∞ ∞

[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)

1 N ∞ ∞

N i=1 ( t=0 ) ( t=0 )

t i i
∑ ∑ ∑
≈ γ r(st , at ) ∇θ log πθ(at | st)

1 N H H

N i=1 ( t=0 ) ( t=0 )

t i i
∑ ∑ ∑
≈ γ r(st , at ) ∇θ log πθ(at | st)
Stochastic policy gradient method
Demonstration of policy gradient method

Using policy gradient method to play CartPole

Pros and cons of reinforcement learning
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
▪ Adapt to changing environments
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments ▪ Lack of safety and convergence guarantees
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments ▪ Lack of safety and convergence guarantees
▪ Model free: no need to know dynamic model ▪ Hard to assign meaningful rewards
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments ▪ Lack of safety and convergence guarantees
▪ Model free: no need to know dynamic model ▪ Hard to assign meaningful rewards
Reinforcement learning projects in Sycamore lab
Bachelor projects
▪ Policy optimization for Pacman
Reinforcement learning projects in Sycamore lab
Bachelor projects
▪ Policy optimization for Pacman

Semester and master projects

▪ Safe reinforcement learning
▪ Inverse reinforcement learning
References
▪ [1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
▪ [2] Baker, Bowen, et al. "Emergent tool use from multi-agent autocurricula." arXiv preprint arXiv:1909.07528 (2019).
▪ [3] DeepMind AI Reduces Google Data Centre Cooling Bill by 40%. https://www.deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-
bill-by-40. Accessed 15 Dec. 2022.
▪ [4] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489
▪ [5] Silver, David, et al. "Mastering chess and shogi by self-play with a general reinforcement learning algorithm." arXiv preprint arXiv:1712.01815 (2017).
▪ [6] Berner, Christopher, et al. "Dota 2 with large scale deep reinforcement learning." arXiv preprint arXiv:1912.06680 (2019).
▪ [7] Arulkumaran, Kai, Antoine Cully, and Julian Togelius. "Alphastar: An evolutionary computation perspective." Proceedings of the genetic and
evolutionary computation conference companion. 2019.
▪ [8] Akkaya, Ilge, et al. "Solving rubik's cube with a robot hand." arXiv preprint arXiv:1910.07113 (2019).
▪ [9] Fawzi, Alhussein, et al. "Discovering faster matrix multiplication algorithms with reinforcement learning." Nature 610.7930 (2022): 47-53.
▪ [10] “ChatGPT: Optimizing Language Models for Dialogue.” OpenAI, 30 Nov. 2022, https://openai.com/blog/chatgpt/.
▪ [11] Miki, Takahiro, et al. "Learning robust perceptive locomotion for quadrupedal robots in the wild." Science Robotics 7.62 (2022): eabk2822.
▪ [12] Ibrahim, Muhammad Sohail, Wei Dong, and Qiang Yang. "Machine learning driven smart electric power systems: Current trends and new
perspectives." Applied Energy 272 (2020): 115237.
▪ [13] Niao He, Lecture notes on “Introduction to Reinforcement Learning”, ETH Zurich, 2021, https://odi.inf.ethz.ch/files/zinal/Lecture-1-RL-
introduction.pdf
▪ [14] Open AI, “Taxonomy of RL Algorithms”, https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#citations-below
▪ [15] Agarwal, Alekh, et al. "Optimality and approximation with policy gradient methods in markov decision processes." Conference on Learning Theory.
PMLR, 2020.
▪ [16] Mei, Jincheng, et al. "On the global convergence rates of softmax policy gradient methods." International Conference on Machine Learning. PMLR,
2020.

Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Unit 4
No ratings yet
Unit 4
49 pages
Algorithm For RL
No ratings yet
Algorithm For RL
99 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
RL 5
No ratings yet
RL 5
26 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
RL Basics 1737166593
No ratings yet
RL Basics 1737166593
30 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
06 MDP
No ratings yet
06 MDP
89 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Simulation-Based Optimization Parametric Optimizat
100% (1)
Simulation-Based Optimization Parametric Optimizat
11 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
CS229
No ratings yet
CS229
17 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
16 RL
No ratings yet
16 RL
51 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
AI Consultant Prompt
No ratings yet
AI Consultant Prompt
4 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Group Discussion Evaluation Sheet YUVA
100% (3)
Group Discussion Evaluation Sheet YUVA
4 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Module 2-PROPERTIES OF A WELL-WRITTEN
No ratings yet
Module 2-PROPERTIES OF A WELL-WRITTEN
10 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
OPTCL JMOT Recruitment 2021: 200 Junior Maintenance & Operator Trainee Posts, Apply Online Now !!!
No ratings yet
OPTCL JMOT Recruitment 2021: 200 Junior Maintenance & Operator Trainee Posts, Apply Online Now !!!
113 pages
Topics Tested Mathematics PP1 P2 2017-2023 Analysis
No ratings yet
Topics Tested Mathematics PP1 P2 2017-2023 Analysis
3 pages
Human Error in Shipping
No ratings yet
Human Error in Shipping
6 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Topic 3 Characteristics and Principles of Assessment
100% (1)
Topic 3 Characteristics and Principles of Assessment
45 pages
Cardiology Dissertation Titles
100% (2)
Cardiology Dissertation Titles
7 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Udemy - AI 900 - Exam
No ratings yet
Udemy - AI 900 - Exam
17 pages
Air Canada SMS
No ratings yet
Air Canada SMS
42 pages
Homeroom Guidance Quarter 1
No ratings yet
Homeroom Guidance Quarter 1
78 pages
Future of Media and Journalism
No ratings yet
Future of Media and Journalism
47 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Stcgan Shadow
No ratings yet
Stcgan Shadow
10 pages
Unit 2 Univariate Data Unit Plan
No ratings yet
Unit 2 Univariate Data Unit Plan
5 pages
Educ54 Technical Writing
No ratings yet
Educ54 Technical Writing
21 pages
Ruchi Integration of Approaches
No ratings yet
Ruchi Integration of Approaches
19 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
LEVEL 4 Present Conditional With Modals Practice
No ratings yet
LEVEL 4 Present Conditional With Modals Practice
3 pages
Admit Card of B.Ed.
No ratings yet
Admit Card of B.Ed.
1 page
Consul Personality
No ratings yet
Consul Personality
19 pages
SP321 Reviewer
No ratings yet
SP321 Reviewer
5 pages
Students' Experiences of Active Engagement Through Cooperative Learning Activities in Lectures
No ratings yet
Students' Experiences of Active Engagement Through Cooperative Learning Activities in Lectures
11 pages
Interpreting SNT TC 1a - Part7
No ratings yet
Interpreting SNT TC 1a - Part7
2 pages
Geography 10-12 Essay and Research Topics
No ratings yet
Geography 10-12 Essay and Research Topics
10 pages
The Problem and Its Background: Thesis Title: Learning Virtues Through Literary Selections in English
No ratings yet
The Problem and Its Background: Thesis Title: Learning Virtues Through Literary Selections in English
12 pages
English - L4 - W-Listening and 1st Conditional.
No ratings yet
English - L4 - W-Listening and 1st Conditional.
7 pages
Winter Break H W Class 11
No ratings yet
Winter Break H W Class 11
2 pages
MIS ASSIGNMENT 2: KEDA: SAP Implementation Q1. ERP Projects Are Expensive and Risky. Why Did Keda Embark On A ERP Implementation Project?
No ratings yet
MIS ASSIGNMENT 2: KEDA: SAP Implementation Q1. ERP Projects Are Expensive and Risky. Why Did Keda Embark On A ERP Implementation Project?
3 pages
CSWIP-WI-6-92 14th Edition April 2017
No ratings yet
CSWIP-WI-6-92 14th Edition April 2017
17 pages
Program Evaluation Read 180
No ratings yet
Program Evaluation Read 180
15 pages
Bai Tap Ve Tu Noi Trong Tieng Anh Linking Words Connectors
No ratings yet
Bai Tap Ve Tu Noi Trong Tieng Anh Linking Words Connectors
5 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 12 Slides - After

Uploaded by

Lecture 12 Slides - After

Uploaded by

Introduction to

- Next state st+1

2019: Learning to play hide and seek via

Teaching a robot how to walk in the wild [11].

Next state st+1

▪ Control of power grids

Control of power grids [12].

Many facets of reinforcement learning [13].

A Markov decision process is given by a tuple ℳ = (𝒮, 𝒜, P, ρ) where…

Reward and discount factor

Taxonomy of reinforcement learning approaches [14].

▪ Neural softmax parameterization

▪ Parameterize the policy as πθ(a | s).

▪ Parameterize the policy as πθ(a | s).

▪ Convergence for direct parametrization [15]:

▪ Convergence for direct parametrization [15]:

▪ And we set reward we get from this trajectory τ as

▪ And we set reward we get from this trajectory τ as

▪ And we set reward we get from this trajectory τ as

N i=1 ( t=0 ) ( t=0 )

N i=1 ( t=0 ) ( t=0 )

Using policy gradient method to play CartPole

Semester and master projects

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.