Lecture 4: Model-Free Prediction: David Silver
Lecture 4: Model-Free Prediction: David Silver
David Silver
Lecture 4: Model-Free Prediction
Outline
1 Introduction
2 Monte-Carlo Learning
3 Temporal-Difference Learning
4 TD(λ)
Lecture 4: Model-Free Prediction
Introduction
Last lecture:
Planning by dynamic programming
Solve a known MDP
This lecture:
Model-free prediction
Estimate the value function of an unknown MDP
Next lecture:
Model-free control
Optimise the value function of an unknown MDP
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
S1 , A1 , R2 , ..., Sk ∼ π
vπ (s) = Eπ [Gt | St = s]
To evaluate state s
The first time-step t that state s is visited in an episode,
Increment counter N(s) ← N(s) + 1
Increment total return S(s) ← S(s) + Gt
Value is estimated by mean return V (s) = S(s)/N(s)
By law of large numbers, V (s) → vπ (s) as N(s) → ∞
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
To evaluate state s
Every time-step t that state s is visited in an episode,
Increment counter N(s) ← N(s) + 1
Increment total return S(s) ← S(s) + Gt
Value is estimated by mean return V (s) = S(s)/N(s)
Again, V (s) → vπ (s) as N(s) → ∞
Lecture 4: Model-Free Prediction
Monte-Carlo Learning
Blackjack Example
Blackjack Example
Incremental Mean
N(St ) ← N(St ) + 1
1
V (St ) ← V (St ) + (Gt − V (St ))
N(St )
Temporal-Difference Learning
MC and TD
exit highway 20 15 35
behind truck 30 10 40
home street 40 3 43
arrive home 43 0 43
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Driving Home Example
Bias/Variance Trade-Off
Batch MC and TD
AB Example
AB Example
Certainty Equivalence
MC converges to solution with minimum mean-squared error
Best fit to the observed returns
K XTk
X 2
Gtk − V (stk )
k=1 t=1
Monte-Carlo Backup
st
T!
T T! TT! T! T!
! ! !
Temporal-Difference Backup
st
rt +1
st +1
T! TT! TT! T! T!
! !
T! T
T! TT!! T! TT!
!
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Unified View
st
rt +1
st +1
T! TT!! T! T! T!
TT! T! T! T! T!
Lecture 4: Model-Free Prediction
Temporal-Difference Learning
Unified View
n-Step Prediction
n-Step Return
λ-return
Forward-view TD(λ)
V (St ) ← V (St ) + α Gtλ − V (St )
Lecture 4: Model-Free Prediction
TD(λ)
Forward View of TD(λ)
∞
(n)
X
Gtλ = (1 − λ) λn−1 Gt
n=1
Lecture 4: Model-Free Prediction
TD(λ)
Forward View of TD(λ)
Forward-view TD(λ)
Eligibility Traces
Et (s) = 1(St = s)
V (s) ← V (s) + αδt Et (s)
TD(λ) and MC
MC and TD(1)
Consider an episode where s is visited once at time-step k,
TD(1) eligibility trace discounts time since visit,
Telescoping in TD(1)
Telescoping in TD(λ)
Offline updates
Updates are accumulated within episode
but applied in batch at the end of episode
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
Online updates
TD(λ) updates are applied online at each step within episode
Forward and backward-view TD(λ) are slightly different
NEW: Exact online TD(λ) achieves perfect equivalence
By using a slightly different form of eligibility trace
Sutton and von Seijen, ICML 2014
Lecture 4: Model-Free Prediction
TD(λ)
Forward and Backward Equivalence
=
Forward view TD(0) Forward TD(λ) MC
Online updates λ=0 λ ∈ (0, 1) λ=1
Backward view TD(0) TD(λ) TD(1)
=
6=
6=
Forward view TD(0) Forward TD(λ) MC
=
=
Exact Online TD(0) Exact Online TD(λ) Exact Online TD(1)