0% found this document useful (0 votes)
15 views51 pages

16 RL

The document discusses reinforcement learning and how it relates to learning behaviors through rewards and punishments. It covers different methods for reinforcement learning including model-based learning by estimating transition and reward models, model-free methods like estimating value functions from samples, and temporal difference learning which uses online updates to value estimates. Key concepts are passive learning by policy evaluation versus active learning, as well as predicting rewards received in the future.

Uploaded by

gudu58939
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views51 pages

16 RL

The document discusses reinforcement learning and how it relates to learning behaviors through rewards and punishments. It covers different methods for reinforcement learning including model-based learning by estimating transition and reward models, model-free methods like estimating value functions from samples, and temporal difference learning which uses online updates to value estimates. Key concepts are passive learning by policy evaluation versus active learning, as well as predicting rewards received in the future.

Uploaded by

gudu58939
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Reinforcement Learning

Chapter 21

Mausam
(some slides by Rajesh Rao)
2
MDPs
Static

Environment

Fully
Observable
Stochastic
What action
next?

Instantaneous
Perfect

Percepts Actions
3
Reinforcement Learning

• S: a set of states
• A: a set of actions
• T(s,a,s’): transition model
• R(s,a): reward model
• : discount factor
• Still looking for policy (s)

• New Twist: we don’t know T and/or R


• we don’t know which state is good/what actions do
• must learn from data/experience

• Fundamental model for learning of human behavior


4
Learning vs Inference

 Batch setting in Bayes Nets


• Data  Model  Prediction

 Active setting in MDPs


• Action  Data  (Model?)

• Actions have two purposes


• To maximize reward
• To learn the model

5
Learning/Planning/Acting
Main Dimensions

 Model-based vs. Model-free


• Model-based: learn the model (T, R)
• Model-free: directly learn what action to do when

 Passive vs. Active


• Passive: learn state values evaluating a given policy
• Active: need to learn both optimal policy + state values

 Strong vs Weak simulator


• Strong: can jump to any part of state space and simulate
• Weak: real world; can’t teleport
7
RL and Animal Foraging

 RL studied experimentally for more than 80 years


in psychology and brain science
• Rewards: food, pain, hunger, drugs, etc.
• Evidence for RL in the brain via a chemical called
dopamine

 Example: foraging
• Bees can learn near-optimal foraging policy in field of
artificial flowers with controlled nectar supplies

8
Passive Learning (Policy Evaluation)

 Given a policy ¼: compute V¼


• V¼ : expected discounted reward while following ¼

 Remember
• We don’t know T
• We don’t know R
• But we can execute (and simulate)

 Key Idea
• compute expectations by average over samples
9
Aside: Expected Age

Goal: Compute expected age of COL333 students


Known P(A)

Without P(A), instead collect samples [a1, a2, … aN]


Unknown P(A): “Model Unknown P(A): “Model
Based” Free”
Why does this Why does this
work? Because work? Because
eventually you samples appear
learn the right with the right
model. frequencies.
Method 1: Model-based Learning

 Learn an empirical model


 Solve for V¼ using policy evaluation
• assuming that the learned model is correct

 Learning the model


• maintain estimates of T(s,a,s’)
• maintain estimates of R(s,a,s’)

11
Example 1 2 3 4
A +100
 12 states, 4 actions
B
 Reward(action) = -1
 Discount factor = 1 C -100
 A4 and C4 are absorbing states

 When might this be the optimal policy?

12
Data on Executing ¼ 1 2 3 4
A +100
(A1, D, -1) (A1, D, -1)
B
(B1, R, -1) (B1, R, -1)
(B2, R, -1) (B2, R, -1) C -100
(B3, U, -1) (B3, U, -1)
(A3, R, -1) (C3, U, -1)
(A2, D, -1) (C4, -100)
(B2, R, -1)
(B3, U, -1)  T(A1, D, B1) = 1
(A3, R, -1)
(A4, 100)  T(B3, U, A3) = 2/3

 We may want to smooth… 13


Properties

 Converges to correct model with infinite data


• If no state is starved

 With correct model


• V¼ is computed accurately

 How about model free learning?


• i.e., expectation is average of samples

14
Method 2: Empirical Estimation of V¼

 Given a policy ¼: compute V¼


• V¼ : expected discounted long-term reward following ¼
• 𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑙𝑜𝑛𝑔 𝑡𝑒𝑟𝑚 𝑟𝑒𝑤𝑎𝑟𝑑 𝑤𝑖𝑡ℎ 𝑠 → 𝑠′]
1
• 𝑉𝜋 𝑠 = 𝑖 [𝑙𝑜𝑛𝑔 𝑡𝑒𝑟𝑚 𝑟𝑒𝑤𝑎𝑟𝑑𝑖 ]
𝑁

15
Data on Executing ¼ 1 2 3 4
A +100
(A1, D, -1) (A1, D, -1)
B
(B1, R, -1) (B1, R, -1)
(B2, R, -1) (B2, R, -1) C -100
(B3, U, -1) (B3, U, -1)
(A3, R, -1) (C3, U, -1)
(A2, D, -1) (C4, -100)
(B2, R, -1)
(B3, U, -1)  V¼ (B1) =
(A3, R, -1)  V¼ (B2) =
(A4, 100)

16
Properties

 Converges to optimal with infinite data


• If no state is starved

 Is wasteful (why?)
• Compare V¼ (B1) and V¼ (B2)

 Each state is computed independently


• Connections (Bellman equations) are ignored
• Learns slowly

17
Method 3: Temporal Difference Learning

 Given a policy ¼: compute V¼


• V¼ : expected discounted long-term reward following ¼
• 𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑙𝑜𝑛𝑔 𝑡𝑒𝑟𝑚 𝑟𝑒𝑤𝑎𝑟𝑑 𝑤𝑖𝑡ℎ 𝑠 → 𝑠′]
1
• 𝑉𝜋 𝑠 = 𝑖 [𝑙𝑜𝑛𝑔 𝑡𝑒𝑟𝑚 𝑟𝑒𝑤𝑎𝑟𝑑𝑖 ]
𝑁

 𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉 𝜋 (𝑠 ′ )]
 represents relationship between s and s’
 TD Learning: computing this expectation as average

18
TD Learning

 𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉 𝜋 (𝑠 ′ )]
 Say I know correct values of 𝑉 𝜋 𝑠1 and 𝑉 𝜋 (𝑠2 )
V¼=5

s1 s1
Pr=0.6
R=5
a0 s
s
Pr=0.4
R=2
s2 s2
V¼=3
V¼(s)=0.6(5+5)
+ 0.4(2+3) V¼(s)= (10+10+10+5+5)/5
=6+2=8 =8 19
TD Learning

 𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉 𝜋 (𝑠 ′ )]
 Inner term is the sample value
• (s,s’,r): reached s’ from s by executing 𝜋 𝑠 and got
immediate reward of r
• sample = r + 𝛾𝑉 𝜋 (𝑠 ′ )
𝜋 1
 Compute 𝑉 𝑠 = 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑖
𝑁

 Problem: we don’t know true values of 𝑉 𝜋 (𝑠 ′ )


• learn together using dynamic programming!
Estimating mean via online updates

 Don’t learn T or R; directly maintain V¼


 Update V¼(s) each time you take an action in s
via a moving average
𝜋 1
• 𝑉𝑛+1 (s)  (n. 𝑉𝑛𝜋 (s) + samplen+1)
𝑛+1
𝜋 1
• 𝑉𝑛+1 (s)  ((n+1-1).𝑉𝑛𝜋 (s) + samplen+1)
𝑛+1
𝜋 1
• 𝑉𝑛+1 (s)  𝑉𝑛𝜋 (s) + (samplen+1−𝑉𝑛𝜋 (s))
𝑛+1

sample n+1
average of n+1 samples learning rate
𝜋
• 𝑉𝑛+1 (s)  𝑉𝑛𝜋 (s) + 𝛼(samplen+1−𝑉𝑛𝜋 (s))
• Nudge the old estimate towards the sample 21
TD Learning

 (s,s’,r)
 𝑉 𝜋 (s)  𝑉 𝜋 (s) + 𝛼(sample−𝑉 𝜋 (s))
 𝑉 𝜋 (s)  𝑉 𝜋 (s) + 𝛼(r+𝛾𝑉 𝜋 (𝑠 ′ ) − 𝑉 𝜋 (s)) TD-error
 𝑉 𝜋 (s)  (1 − 𝛼)𝑉 𝜋 (s) + 𝛼(r+𝛾𝑉 𝜋 (𝑠 ′ ))

 Update maintains a mean of (noisy) value samples

 If the learning rate decreases appropriately with


the number of samples (e.g. 1/n) then the value
estimates will converge to true values! (non-trivial)
22
Early Results: Pavlov and his Dog

 Classical (Pavlovian)
conditioning
experiments
 Training: Bell Food
 After: Bell  Salivate
 Conditioned stimulus
(bell) predicts future
reward (food)
Predicting Delayed Rewards

 Reward is typically delivered at the end (when


you know whether you succeeded or not)
 Time: 0  t  T with stimulus a(t) and reward r(t)
at each time step t (Note: r(t) can be zero at
some time points)
 Key Idea: Make the output v(t) predict total
expected future reward starting from time t
T t
v(t )  

r (t   )
0
Predicting Delayed Reward: TD Learning

Stimulus at t = 100 and reward at t = 200

Prediction error  for each time step


(over many trials)
Figure from Theoretical Neuroscience by Peter Dayan and Larry Abbott, MIT Press, 2001
Prediction Error in the Primate Brain?

Dopaminergic cells in Ventral Tegmental Area (VTA)

Reward Prediction error? [r (t )  v(t  1)  v(t )]

Before Training

After Training

No error
[0  v(t  1)  v(t )] v(t )  r (t )  v(t  1)

Figure from Theoretical Neuroscience by Peter Dayan and Larry Abbott, MIT Press, 2001
More Evidence for Prediction Error Signals

Dopaminergic cells in VTA

Negative error
r (t )  0, v(t  1)  0
[r (t )  v(t  1)  v(t )]  v(t )
Figure from Theoretical Neuroscience by Peter Dayan and Larry Abbott, MIT Press, 2001
The Story So Far: MDPs and RL

Known MDP: Offline Solution

Goal Technique
Compute V*, Q*, * Value / policy iteration
Evaluate a policy  Policy evaluation

Unknown MDP: Model-Based Unknown MDP: Model-Free

Goal Technique Goal Technique


Compute V*, Q*, * VI/PI on approx. MDP Compute V*, Q*, * Q-learning
Evaluate a policy  PE on approx. MDP Evaluate a policy  TD-Learning
Model-based RL

 Learn an initial model M0


 Loop
• VI/PI on Mi to compute policy i
• Execute i to generate data
• Learn a better model Mi+1

 Key challenge?

29
Model-based RL Example 1 2 3 4
A ? ? +100
 Say world is deterministic
B ? ? ?
• and no wind
C -2

 Lets say the agent first


discovers the path to bad
reward first

 Will the agent ever learn the optimal policy?


• won‘t have any information about some states or
state-action pairs
30
Model-based RL

 Learn an initial model M0


 Loop
• VI/PI on Mi to compute policy i
• Execute i to generate data
• Learn a better model Mi+1

 Key challenge
• Just executing i is not enough!
• It may miss important regions
• Needs to explore new regions
31
TD Learning  TD (V*) Learning

 Can we do TD-like updates on V*?

 𝑉 ∗ 𝑠 = max𝑎 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]

 Hmmm… what to do?


• RHS should be expectation.
• Instead of V* write all equations in Q*
Bellman Equations (V*)  Bellman Equations (Q*)

 𝑉 ∗ 𝑠 = max𝑎 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]

 𝑄∗ 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]

 𝑄∗ 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅(𝑠, 𝑎, 𝑠 ′
+ 𝛾 max 𝑎′ 𝑄 ∗
𝑠′, 𝑎′ ]

 VI  Q-Value Iteration
 TD Learning  Q Learning
Q Learning

 Directly learn Q*(s,a) values


 Receive a sample (s, a, s’, r)
 Your old estimate Q(s,a)
 New sample value: r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ )
𝑎′

Nudge the estimates:


 𝑄(s,a)  𝑄(s,a) + 𝛼(r+𝛾 max𝑎′ 𝑄(𝑠 ′ , 𝑎′ ) − 𝑄(s,a))
 𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′

34
Q Learning Algorithm

 Forall s, a
• Initialize Q(s, a) = 0

 Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′

Is an off policy learning algorithm

35
Properties

 Q Learning converges to optimal values Q*


• Irrespective of initialization,
• Irrespective of action choice policy
• Irrespective of learning rate

 as long as
• states/actions finite, all rewards bounded
• No (s,a) is starved: infinite visits over infinite samples
• Learning rate decays with visits to state-action pairs
• but not too fast decay. (∑ia(s,a,i) = ∞, ∑ia2(s,a,i) < ∞)

36
Q Learning Algorithm

 Forall s, a
• Initialize Q(s, a) = 0

 Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′

How to choose?
new: exploration
greedy: exploitation
37
Exploration vs. Exploitation Tradeoff

 A fundamental tradeoff in RL

 Exploration: must take actions that may be


suboptimal but help discover new rewards and
in the long run increase utility

 Exploitation: must take actions that are known


to be good (and seem currently optimal) to
optimize the overall utility

 Slowly move from exploration exploitation 38


Explore/Exploit Policies

 Simplest scheme: ϵ-greedy


• Every time step flip a coin
• With probability 1-ϵ, take the greedy action
• With probability ϵ, take a random action

 Problem
• Exploration probability is constant

 Solutions
• Lower ϵ over time
• Use an exploration function 39
Explore/Exploit Policies

 Boltzmann Exploration
• Select action a with probability
exp(𝑄 𝑠,𝑎 𝑇))
• Pr(𝑎|𝑠) =
𝑎′∈𝐴 exp(𝑄 𝑠,𝑎′ 𝑇))

 T: Temperature
• Similar to simulated annealing
• Large T: uniform, Small T: greedy
• Start with large T and decrease with time

 GLIE: greedy in the limit of infinite exploration 40


Explore/Exploit Policies

 Exploration Functions
• stop exploring actions whose badness is established
• continue exploring other actions
 Let Q(s,a) = q, #visits(s,a) = n
 E.g.: f q, n = 𝑞 + 𝑘/𝑛
• Unexplored states have infinite f
• Highly explored bad states have low f
 Modified Q update
• 𝑄(s,a)  (1 − 𝛼)𝑄(s,a)
+ 𝛼(r+𝛾 max 𝑓(𝑄 𝑠 ′ , 𝑎′ , 𝑁 𝑠 ′ , 𝑎′ ))
𝑎′
States leading to unexplored states are also preferred41
Explore/Exploit Policies

 A Famous Exploration Policy: UCB


• Upper Confidence Bound

ln n( s)
 UCT ( s)  arg max a Q( s, a)  c
n( s , a )

Value Term:
favors actions that looked Exploration Term:
good historically actions get an exploration
bonus that grows with ln(n)

Optimistic in the Face of Uncertainty 42


Model based vs. Model Free RL

 Model based
• estimate O(|S|2|A|) parameters
• requires relatively larger data for learning
• can make use of background knowledge easily

 Model free
• estimate O(|S||A|) parameters
• requires relatively less data for learning
Generalizing Across States

 Basic Q-Learning (or VI) keeps a table of all q-values

 In realistic situations, we cannot possibly learn about


every single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory

 Instead, we want to generalize:


• Learn about some small number of training states from experience
• Generalize that experience to new, similar situations
• This is a fundamental idea in machine learning

44
Feature-based Representation

 Describe a state using vector of features


 We can write a q function using a few weights:

 Advantage: our experience is summed up in a


few powerful numbers (wi)

 Disadvantage: states may share features but


actually be very different in value!

45
Approximate Q-Learning

 Exact Q-Learning difference


• 𝑄(s,a)  𝑄(s,a) + 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ) − 𝑄(s,a))
𝑎′
 Q-Learning with linear function approximation
• 𝑤𝑚  𝑤𝑚 + 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ) − 𝑄(s,a))fm(s,a)
𝑎′

 Move feature weights up/down based on


difference and feature values

46
Optimization: Least Squares

Error
Observation
Prediction

0
0 20
Minimizing Error

Imagine we had only one point x, with features f(x), target value y,
and weights w:

Approximate q update
explained:
“target” “prediction”
Overfitting and Limited Capacity Approximations

Low capacity generalizes better


Issue: linear approximation not powerful enough in practice
Deep Learning!
49
Summary: RL

RL is a very general AI problem


most general single agent?

Main idea: expectationP as avg of samples


sampling distribution is P

Agent learns as it gathers experience


Exploration-exploitation tradeoff
Function approximation is key: deep RL is the rage!
50
Applications
 Stochastic Games
 Robotics: navigation, helicopter manuevers…
 Finance: options, investments
 Communication Networks
 Medicine: Radiation planning for cancer
 Controlling workflows
 Optimize bidding decisions in auctions
 Traffic flow optimization
 Aircraft queueing for landing; airline meal provisioning
 Optimizing software on mobiles
 Forest firefighting
 …
51

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy