Lecture 12 Slides - After
Lecture 12 Slides - After
Reinforcement Learning
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.
▪ ChatGPT, 2022: “Reinforcement learning is a type of machine learning in which an agent learns to
interact with its environment in order to maximize a reward signal”.
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.
▪ ChatGPT, 2022: “Reinforcement learning is a type of machine learning in which an agent learns to
interact with its environment in order to maximize a reward signal”.
Agent Environment
- State st
- Take action at
▪ Autonomous driving
Action at
▪ Autonomous driving
Markov property
Pr(st+1 | st, st−1, . . . , s0, at, at−1, . . . . , a0) = Pr(st+1 | st, at) → stochastic dynamical system!
Example: Deterministic MDP
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∈ 𝒜.
• End up in state st+1 ∼ P( ⋅ | st, at).
Example: Stochastic MDP (Gridworld)
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∈ 𝒜.
• End up in state st+1 ∼ P( ⋅ | st, at).
Policy
Decision rule
▪ In state s ∈ 𝒮, we take action a ∈ 𝒜 with
probability π(a | s).
▪ π( ⋅ | s) is a probability distribution over 𝒜.
Objective
Objective function
The goal is to find an optimal policy π* maximizing
[∑ ]
J(π) := 𝔼 γ tr(st, at) | s0 ∼ ρ, π
t=0
Example: Stochastic MDP (Gridworld)
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∼ π( ⋅ | st).
• Get reward r(st, at).
• End up in state st+1 ∼ P( ⋅ | st, at).
Reinforcement learning vs. optimal control
Stochastic optimal control
∞
[∑ ]
For known P: dynamic programming. Still
max 𝔼 γ tr(st, at) | s0 ∼ ρ, π →
π very hard for large 𝒮 and 𝒜.
t=0
Reinforcement learning
▪ In RL we can only sample from the MDP (in simulation or real world), but don’t know P.
▪ We need to explore the environment.
Many different approaches
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
Policy gradient method
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
▪ Policy Optimization: Parameterize the policy as πθ(a | s) and then find the best policy.
Policy gradient method
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
▪ Policy Optimization: Parameterize the policy as πθ(a | s) and then find the best policy.
▪ Direct parameterization
∑
πθ(a | s) = θs,a, where θ ∈ ℝ|𝒮|×|𝒜|, θs,a ≥ 0 and θs,a = 1.
a∈𝒜
Policy parameterization
▪ Softmax parameterization
exp (θs,a)
πθ(a | s) = , where θ ∈ ℝ|𝒮|×|𝒜|
∑a′∈𝒜 exp (θs,a′)
Policy parameterization
▪ Softmax parameterization
exp (θs,a)
πθ(a | s) = , where θ ∈ ℝ|𝒮|×|𝒜|
∑a′∈𝒜 exp (θs,a′)
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
▪ Here, c1, c2 are the constant that depend on MDP ℳ = (𝒮, 𝒜, P, ρ, r, γ).
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0
[∑ ]
J(πθ) := 𝔼 γ tr(st, at) | s0 = s, π = 𝔼τ∼pθ [R(τ)]
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0
[∑ ]
J(πθ) := 𝔼 γ tr(st, at) | s0 = s, π = 𝔼τ∼pθ [R(τ)]
t=0
▪ Therefore,
∇θ J(πθ) = ∇θ 𝔼τ∼pθ[R(τ)]
How to compute the gradient?
Policy gradient theorem:
∞ ∞
[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)
Proof: Because
∇θ J(πθ) = ∇θ 𝔼τ∼pθ[R(τ)]
How to compute the gradient?
Policy gradient theorem:
∞ ∞
[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)
Proof: Because
∞
γ tr(st, at)
∑
∇θ J(πθ) = 𝔼τ∼pθ[R(τ) ∇log pθ(τ)], R(τ) =
t=0
How to estimate the gradient?
Monte Carlo approximation:
▪ Consider a random variable X ∼ q.
▪ Given independent and identically distributed X1, . . . , XN ∼ q, we can estimate
1 N
N∑
𝔼[ f(X)] ≈ f(Xi) .
i=1
In Reinforcement learning: We don’t know P, but we can approximate
∞ ∞
[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)
1 N ∞ ∞
1 N H H