Lec 22
Lec 22
Roger Grosse
Gt = rt + γrt+1 + γ 2 rt+2 + · · ·
V π (s) = E[Gt | st = s]
"∞ #
X
=E γ i rt+i | st = s
i=0
Notice: Q-learning only learns about the states and actions it visits.
Exploration-exploitation tradeoff: the agent should sometimes pick
suboptimal actions in order to visit new states and actions.
Simple solution: -greedy policy
With probability 1 − , choose the optimal action according to Q
With probability , choose a random action
Believe it or not, -greedy is still used today!