0% found this document useful (0 votes)
15 views50 pages

Lecture 12 Slides - After

Reinforcement learning is a type of machine learning where an agent learns to map situations to actions to maximize a reward signal. It involves concepts such as Markov decision processes, policies, and various approaches like policy gradient methods. The field has applications in robotics, autonomous driving, and power grid control, but also faces challenges like sample inefficiency and lack of safety guarantees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views50 pages

Lecture 12 Slides - After

Reinforcement learning is a type of machine learning where an agent learns to map situations to actions to maximize a reward signal. It involves concepts such as Markov decision processes, policies, and various approaches like policy gradient methods. The field has applications in robotics, autonomous driving, and power grid control, but also faces challenges like sample inefficiency and lack of safety guarantees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Introduction to

Reinforcement Learning
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.

▪ ChatGPT, 2022: “Reinforcement learning is a type of machine learning in which an agent learns to
interact with its environment in order to maximize a reward signal”.
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.

▪ ChatGPT, 2022: “Reinforcement learning is a type of machine learning in which an agent learns to
interact with its environment in order to maximize a reward signal”.

Agent Environment
- State st
- Take action at

- Next state st+1


- Get reward r(st, at)
Example: Multi-agent game

2019: Learning to play hide and seek via


multi-agent reinforcement learning [1].
Recent advances
2013
Atari 2016
Deep Q-learning for
Atari games [2]. Energy saving 2017
DeepMind AI AlphaGo/ 2018
reduces
AlphaZero OpenAI Five 2019
Google data
centre cooling AI achieving grand Training five Alpha Star
bill by 40% [3]. master level in artificial 2022
chess, go, and intelligence AI achieving grand
shogi [4,5]. agents to play master level in AlphaTensor
the Dota 2 [6]. StarCraft II game [7].
Discovering faster
matrix multiplication
Rubik’s Cube
algorithms [9].
Solving Rubik's Cube
with a human-like robot ChatGPT
hand [8].
A language model
trained to generate
human-like responses
to text input [10].
(Potential) real world applications
▪ Robotics

Teaching a robot how to walk in the wild [11].


(Potential) real world applications
▪ Robotics

▪ Autonomous driving
Action at

Next state st+1


and reward r(st, at)
(Potential) real world applications
▪ Robotics

▪ Autonomous driving

▪ Control of power grids

Control of power grids [12].


An interdisciplinary field

Many facets of reinforcement learning [13].


Markov Decision processes

A Markov decision process is given by a tuple ℳ = (𝒮, 𝒜, P, ρ) where…


▪ 𝒮 is the set of all possible states.
▪ 𝒜 is the set of all possible actions.
▪ P is the transition law with P(s′| s, a) = Pr(st+1 = s′| st = s, at = a).
▪ ρ is the initial state distribution with ρ(s) = Pr(s0 = s).

Markov property
Pr(st+1 | st, st−1, . . . , s0, at, at−1, . . . . , a0) = Pr(st+1 | st, at) → stochastic dynamical system!
Example: Deterministic MDP
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∈ 𝒜.
• End up in state st+1 ∼ P( ⋅ | st, at).
Example: Stochastic MDP (Gridworld)
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∈ 𝒜.
• End up in state st+1 ∼ P( ⋅ | st, at).
Policy
Decision rule
▪ In state s ∈ 𝒮, we take action a ∈ 𝒜 with
probability π(a | s).
▪ π( ⋅ | s) is a probability distribution over 𝒜.
Objective

Reward and discount factor


▪ Reward function r : 𝒮 × 𝒜 → ℝ.
▪ Discount rate γ ∈ (0,1).

Objective function
The goal is to find an optimal policy π* maximizing

[∑ ]
J(π) := 𝔼 γ tr(st, at) | s0 ∼ ρ, π
t=0
Example: Stochastic MDP (Gridworld)
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∼ π( ⋅ | st).
• Get reward r(st, at).
• End up in state st+1 ∼ P( ⋅ | st, at).
Reinforcement learning vs. optimal control
Stochastic optimal control

[∑ ]
For known P: dynamic programming. Still
max 𝔼 γ tr(st, at) | s0 ∼ ρ, π →
π very hard for large 𝒮 and 𝒜.
t=0

Reinforcement learning
▪ In RL we can only sample from the MDP (in simulation or real world), but don’t know P.
▪ We need to explore the environment.
Many different approaches

Taxonomy of reinforcement learning approaches [14].


Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0

▪ Policy Optimization: Parameterize the policy as πθ(a | s) and then find the best policy.
Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0

▪ Policy Optimization: Parameterize the policy as πθ(a | s) and then find the best policy.

▪ Direct parameterization


πθ(a | s) = θs,a, where θ ∈ ℝ|𝒮|×|𝒜|, θs,a ≥ 0 and θs,a = 1.
a∈𝒜
Policy parameterization

▪ Softmax parameterization
exp (θs,a)
πθ(a | s) = , where θ ∈ ℝ|𝒮|×|𝒜|
∑a′∈𝒜 exp (θs,a′)
Policy parameterization

▪ Softmax parameterization
exp (θs,a)
πθ(a | s) = , where θ ∈ ℝ|𝒮|×|𝒜|
∑a′∈𝒜 exp (θs,a′)

▪ Neural softmax parameterization


exp (fθ (s, a))
πθ(a | s) = , where fθ (s, a) represents a neural network.
∑a′∈𝒜 exp (fθ (s, a′))
Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0

▪ Parameterize the policy as πθ(a | s).


▪ Using gradient ascent method to find best policy π*
▪ Pseudo-code for policy gradient method
Policy gradient method

[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0

▪ Parameterize the policy as πθ(a | s).


▪ Using gradient ascent method to find best policy π*
▪ Pseudo-code for policy gradient method
Results for policy gradient method
▪ Can we converge to the optimal policy when K is big enough?——Non-convexity may lead to sub-
optimal policy
Results for policy gradient method
▪ Can we converge to the optimal policy when K is big enough?——Non-convexity may lead to sub-
optimal policy

▪ Convergence for direct parametrization [15]:


(1 − γ)3 c1
Let ρ(s) > 0, ∀s ∈ 𝒮 and α ≤ , we have min J(π*) − J(πt) ≤ .
• 2γ | 𝒜 | t≤K K
Results for policy gradient method
▪ Can we converge to the optimal policy when K is big enough?——Non-convexity may lead to sub-
optimal policy

▪ Convergence for direct parametrization [15]:


(1 − γ)3 c1
Let ρ(s) > 0, ∀s ∈ 𝒮 and α ≤ , we have min J(π*) − J(πt) ≤ .
• 2γ | 𝒜 | t≤K K
▪ Convergence for softmax parametrization [16]:
(1 − γ)3 c2
• Let ρ(s) > 0, ∀s ∈ 𝒮 and α ≤ 8
, we have J(π*) − J(πK ) ≤
K
.

▪ Here, c1, c2 are the constant that depend on MDP ℳ = (𝒮, 𝒜, P, ρ, r, γ).
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as

pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)

t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as

pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)

t=0

▪ And we set reward we get from this trajectory τ as



γ tr(st, at)

R (τ) :=
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as

pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)

t=0

▪ And we set reward we get from this trajectory τ as



γ tr(st, at)

R (τ) :=
t=0
▪ Then,

[∑ ]
J(πθ) := 𝔼 γ tr(st, at) | s0 = s, π = 𝔼τ∼pθ [R(τ)]
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as

pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)

t=0

▪ And we set reward we get from this trajectory τ as



γ tr(st, at)

R (τ) :=
t=0
▪ Then,

[∑ ]
J(πθ) := 𝔼 γ tr(st, at) | s0 = s, π = 𝔼τ∼pθ [R(τ)]
t=0

▪ Therefore,
∇θ J(πθ) = ∇θ 𝔼τ∼pθ[R(τ)]
How to compute the gradient?
Policy gradient theorem:
∞ ∞

[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)

Proof: Because
∇θ J(πθ) = ∇θ 𝔼τ∼pθ[R(τ)]
How to compute the gradient?
Policy gradient theorem:
∞ ∞

[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)

Proof: Because

γ tr(st, at)

∇θ J(πθ) = 𝔼τ∼pθ[R(τ) ∇log pθ(τ)], R(τ) =
t=0
How to estimate the gradient?
Monte Carlo approximation:
▪ Consider a random variable X ∼ q.
▪ Given independent and identically distributed X1, . . . , XN ∼ q, we can estimate
1 N
N∑
𝔼[ f(X)] ≈ f(Xi) .
i=1
In Reinforcement learning: We don’t know P, but we can approximate
∞ ∞

[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)

1 N ∞ ∞

N i=1 ( t=0 ) ( t=0 )


t i i
∑ ∑ ∑
≈ γ r(st , at ) ∇θ log πθ(at | st)

1 N H H

N i=1 ( t=0 ) ( t=0 )


t i i
∑ ∑ ∑
≈ γ r(st , at ) ∇θ log πθ(at | st)
Stochastic policy gradient method
Demonstration of policy gradient method

Using policy gradient method to play CartPole


Pros and cons of reinforcement learning
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
▪ Adapt to changing environments
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments ▪ Lack of safety and convergence guarantees
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments ▪ Lack of safety and convergence guarantees
▪ Model free: no need to know dynamic model ▪ Hard to assign meaningful rewards
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments ▪ Lack of safety and convergence guarantees
▪ Model free: no need to know dynamic model ▪ Hard to assign meaningful rewards
Reinforcement learning projects in Sycamore lab
Bachelor projects
▪ Policy optimization for Pacman
Reinforcement learning projects in Sycamore lab
Bachelor projects
▪ Policy optimization for Pacman

Semester and master projects


▪ Safe reinforcement learning
▪ Inverse reinforcement learning
References
▪ [1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
▪ [2] Baker, Bowen, et al. "Emergent tool use from multi-agent autocurricula." arXiv preprint arXiv:1909.07528 (2019).
▪ [3] DeepMind AI Reduces Google Data Centre Cooling Bill by 40%. https://www.deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-
bill-by-40. Accessed 15 Dec. 2022.
▪ [4] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489
▪ [5] Silver, David, et al. "Mastering chess and shogi by self-play with a general reinforcement learning algorithm." arXiv preprint arXiv:1712.01815 (2017).
▪ [6] Berner, Christopher, et al. "Dota 2 with large scale deep reinforcement learning." arXiv preprint arXiv:1912.06680 (2019).
▪ [7] Arulkumaran, Kai, Antoine Cully, and Julian Togelius. "Alphastar: An evolutionary computation perspective." Proceedings of the genetic and
evolutionary computation conference companion. 2019.
▪ [8] Akkaya, Ilge, et al. "Solving rubik's cube with a robot hand." arXiv preprint arXiv:1910.07113 (2019).
▪ [9] Fawzi, Alhussein, et al. "Discovering faster matrix multiplication algorithms with reinforcement learning." Nature 610.7930 (2022): 47-53.
▪ [10] “ChatGPT: Optimizing Language Models for Dialogue.” OpenAI, 30 Nov. 2022, https://openai.com/blog/chatgpt/.
▪ [11] Miki, Takahiro, et al. "Learning robust perceptive locomotion for quadrupedal robots in the wild." Science Robotics 7.62 (2022): eabk2822.
▪ [12] Ibrahim, Muhammad Sohail, Wei Dong, and Qiang Yang. "Machine learning driven smart electric power systems: Current trends and new
perspectives." Applied Energy 272 (2020): 115237.
▪ [13] Niao He, Lecture notes on “Introduction to Reinforcement Learning”, ETH Zurich, 2021, https://odi.inf.ethz.ch/files/zinal/Lecture-1-RL-
introduction.pdf
▪ [14] Open AI, “Taxonomy of RL Algorithms”, https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#citations-below
▪ [15] Agarwal, Alekh, et al. "Optimality and approximation with policy gradient methods in markov decision processes." Conference on Learning Theory.
PMLR, 2020.
▪ [16] Mei, Jincheng, et al. "On the global convergence rates of softmax policy gradient methods." International Conference on Machine Learning. PMLR,
2020.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy