0% found this document useful (0 votes)

15 views51 pages

16 RL

The document discusses reinforcement learning and how it relates to learning behaviors through rewards and punishments. It covers different methods for reinforcement learning including model-based learning by estimating transition and reward models, model-free methods like estimating value functions from samples, and temporal difference learning which uses online updates to value estimates. Key concepts are passive learning by policy evaluation versus active learning, as well as predicting rewards received in the future.

Uploaded by

gudu58939

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views51 pages

16 RL

Uploaded by

gudu58939

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Reinforcement Learning

Chapter 21

Mausam
(some slides by Rajesh Rao)
2
MDPs
Static

Environment

Fully
Observable
Stochastic
What action
next?

Instantaneous
Perfect

Percepts Actions
3
Reinforcement Learning

• S: a set of states
• A: a set of actions
• T(s,a,s’): transition model
• R(s,a): reward model
• : discount factor
• Still looking for policy (s)

• New Twist: we don’t know T and/or R

• we don’t know which state is good/what actions do
• must learn from data/experience

• Fundamental model for learning of human behavior

4
Learning vs Inference

 Batch setting in Bayes Nets

• Data  Model  Prediction

 Active setting in MDPs

• Action  Data  (Model?)

• Actions have two purposes

• To maximize reward
• To learn the model

5
Learning/Planning/Acting
Main Dimensions

 Model-based vs. Model-free

• Model-based: learn the model (T, R)
• Model-free: directly learn what action to do when

 Passive vs. Active

• Passive: learn state values evaluating a given policy
• Active: need to learn both optimal policy + state values

 Strong vs Weak simulator

• Strong: can jump to any part of state space and simulate
• Weak: real world; can’t teleport
7
RL and Animal Foraging

 RL studied experimentally for more than 80 years

in psychology and brain science
• Rewards: food, pain, hunger, drugs, etc.
• Evidence for RL in the brain via a chemical called
dopamine

 Example: foraging
• Bees can learn near-optimal foraging policy in field of
artificial flowers with controlled nectar supplies

8
Passive Learning (Policy Evaluation)

 Given a policy ¼: compute V¼

• V¼ : expected discounted reward while following ¼

 Remember
• We don’t know T
• We don’t know R
• But we can execute (and simulate)

 Key Idea
• compute expectations by average over samples
9
Aside: Expected Age

Goal: Compute expected age of COL333 students

Known P(A)

Without P(A), instead collect samples [a1, a2, … aN]

Unknown P(A): “Model Unknown P(A): “Model
Based” Free”
Why does this Why does this
work? Because work? Because
eventually you samples appear
learn the right with the right
model. frequencies.
Method 1: Model-based Learning

 Learn an empirical model

 Solve for V¼ using policy evaluation
• assuming that the learned model is correct

 Learning the model

• maintain estimates of T(s,a,s’)
• maintain estimates of R(s,a,s’)

11
Example 1 2 3 4
A +100
 12 states, 4 actions
B
 Reward(action) = -1
 Discount factor = 1 C -100
 A4 and C4 are absorbing states

 When might this be the optimal policy?

12
Data on Executing ¼ 1 2 3 4
A +100
(A1, D, -1) (A1, D, -1)
B
(B1, R, -1) (B1, R, -1)
(B2, R, -1) (B2, R, -1) C -100
(B3, U, -1) (B3, U, -1)
(A3, R, -1) (C3, U, -1)
(A2, D, -1) (C4, -100)
(B2, R, -1)
(B3, U, -1)  T(A1, D, B1) = 1
(A3, R, -1)
(A4, 100)  T(B3, U, A3) = 2/3

 We may want to smooth… 13

Properties

 Converges to correct model with infinite data

• If no state is starved

 With correct model

• V¼ is computed accurately

 How about model free learning?

• i.e., expectation is average of samples

14
Method 2: Empirical Estimation of V¼

 Given a policy ¼: compute V¼

• V¼ : expected discounted long-term reward following ¼
• 𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑙𝑜𝑛𝑔 𝑡𝑒𝑟𝑚 𝑟𝑒𝑤𝑎𝑟𝑑 𝑤𝑖𝑡ℎ 𝑠 → 𝑠′]
1
• 𝑉𝜋 𝑠 = 𝑖 [𝑙𝑜𝑛𝑔 𝑡𝑒𝑟𝑚 𝑟𝑒𝑤𝑎𝑟𝑑𝑖 ]
𝑁

15
Data on Executing ¼ 1 2 3 4
A +100
(A1, D, -1) (A1, D, -1)
B
(B1, R, -1) (B1, R, -1)
(B2, R, -1) (B2, R, -1) C -100
(B3, U, -1) (B3, U, -1)
(A3, R, -1) (C3, U, -1)
(A2, D, -1) (C4, -100)
(B2, R, -1)
(B3, U, -1)  V¼ (B1) =
(A3, R, -1)  V¼ (B2) =
(A4, 100)

16
Properties

 Converges to optimal with infinite data

• If no state is starved

 Is wasteful (why?)
• Compare V¼ (B1) and V¼ (B2)

 Each state is computed independently

• Connections (Bellman equations) are ignored
• Learns slowly

17
Method 3: Temporal Difference Learning

 Given a policy ¼: compute V¼

 𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉 𝜋 (𝑠 ′ )]
 represents relationship between s and s’
 TD Learning: computing this expectation as average

18
TD Learning

 𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉 𝜋 (𝑠 ′ )]
 Say I know correct values of 𝑉 𝜋 𝑠1 and 𝑉 𝜋 (𝑠2 )
V¼=5

s1 s1
Pr=0.6
R=5
a0 s
s
Pr=0.4
R=2
s2 s2
V¼=3
V¼(s)=0.6(5+5)
+ 0.4(2+3) V¼(s)= (10+10+10+5+5)/5
=6+2=8 =8 19
TD Learning

 𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉 𝜋 (𝑠 ′ )]
 Inner term is the sample value
• (s,s’,r): reached s’ from s by executing 𝜋 𝑠 and got
immediate reward of r
• sample = r + 𝛾𝑉 𝜋 (𝑠 ′ )
𝜋 1
 Compute 𝑉 𝑠 = 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑖
𝑁

 Problem: we don’t know true values of 𝑉 𝜋 (𝑠 ′ )

• learn together using dynamic programming!
Estimating mean via online updates

 Don’t learn T or R; directly maintain V¼

 Update V¼(s) each time you take an action in s
via a moving average
𝜋 1
• 𝑉𝑛+1 (s)  (n. 𝑉𝑛𝜋 (s) + samplen+1)
𝑛+1
𝜋 1
• 𝑉𝑛+1 (s)  ((n+1-1).𝑉𝑛𝜋 (s) + samplen+1)
𝑛+1
𝜋 1
• 𝑉𝑛+1 (s)  𝑉𝑛𝜋 (s) + (samplen+1−𝑉𝑛𝜋 (s))
𝑛+1

sample n+1
average of n+1 samples learning rate
𝜋
• 𝑉𝑛+1 (s)  𝑉𝑛𝜋 (s) + 𝛼(samplen+1−𝑉𝑛𝜋 (s))
• Nudge the old estimate towards the sample 21
TD Learning

 (s,s’,r)
 𝑉 𝜋 (s)  𝑉 𝜋 (s) + 𝛼(sample−𝑉 𝜋 (s))
 𝑉 𝜋 (s)  𝑉 𝜋 (s) + 𝛼(r+𝛾𝑉 𝜋 (𝑠 ′ ) − 𝑉 𝜋 (s)) TD-error
 𝑉 𝜋 (s)  (1 − 𝛼)𝑉 𝜋 (s) + 𝛼(r+𝛾𝑉 𝜋 (𝑠 ′ ))

 Update maintains a mean of (noisy) value samples

 If the learning rate decreases appropriately with

the number of samples (e.g. 1/n) then the value
estimates will converge to true values! (non-trivial)
22
Early Results: Pavlov and his Dog

 Classical (Pavlovian)
conditioning
experiments
 Training: Bell Food
 After: Bell  Salivate
 Conditioned stimulus
(bell) predicts future
reward (food)
Predicting Delayed Rewards

 Reward is typically delivered at the end (when

you know whether you succeeded or not)
 Time: 0  t  T with stimulus a(t) and reward r(t)
at each time step t (Note: r(t) can be zero at
some time points)
 Key Idea: Make the output v(t) predict total
expected future reward starting from time t
T t
v(t )  

r (t   )
0
Predicting Delayed Reward: TD Learning

Stimulus at t = 100 and reward at t = 200

Prediction error  for each time step

(over many trials)
Figure from Theoretical Neuroscience by Peter Dayan and Larry Abbott, MIT Press, 2001
Prediction Error in the Primate Brain?

Dopaminergic cells in Ventral Tegmental Area (VTA)

Reward Prediction error? [r (t )  v(t  1)  v(t )]

Before Training

After Training

No error
[0  v(t  1)  v(t )] v(t )  r (t )  v(t  1)

Figure from Theoretical Neuroscience by Peter Dayan and Larry Abbott, MIT Press, 2001
More Evidence for Prediction Error Signals

Dopaminergic cells in VTA

Negative error
r (t )  0, v(t  1)  0
[r (t )  v(t  1)  v(t )]  v(t )
Figure from Theoretical Neuroscience by Peter Dayan and Larry Abbott, MIT Press, 2001
The Story So Far: MDPs and RL

Known MDP: Offline Solution

Goal Technique
Compute V*, Q*, * Value / policy iteration
Evaluate a policy  Policy evaluation

Unknown MDP: Model-Based Unknown MDP: Model-Free

Goal Technique Goal Technique

Compute V*, Q*, * VI/PI on approx. MDP Compute V*, Q*, * Q-learning
Evaluate a policy  PE on approx. MDP Evaluate a policy  TD-Learning
Model-based RL

 Learn an initial model M0

 Loop
• VI/PI on Mi to compute policy i
• Execute i to generate data
• Learn a better model Mi+1

 Key challenge?

29
Model-based RL Example 1 2 3 4
A ? ? +100
 Say world is deterministic
B ? ? ?
• and no wind
C -2

 Lets say the agent first

discovers the path to bad
reward first

 Will the agent ever learn the optimal policy?

• won‘t have any information about some states or
state-action pairs
30
Model-based RL

 Learn an initial model M0

 Loop
• VI/PI on Mi to compute policy i
• Execute i to generate data
• Learn a better model Mi+1

 Key challenge
• Just executing i is not enough!
• It may miss important regions
• Needs to explore new regions
31
TD Learning  TD (V*) Learning

 Can we do TD-like updates on V*?

 𝑉 ∗ 𝑠 = max𝑎 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]

 Hmmm… what to do?

• RHS should be expectation.
• Instead of V* write all equations in Q*
Bellman Equations (V*)  Bellman Equations (Q*)

 𝑉 ∗ 𝑠 = max𝑎 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]

 𝑄∗ 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]

 𝑄∗ 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅(𝑠, 𝑎, 𝑠 ′
+ 𝛾 max 𝑎′ 𝑄 ∗
𝑠′, 𝑎′ ]

 VI  Q-Value Iteration
 TD Learning  Q Learning
Q Learning

 Directly learn Q*(s,a) values

 Receive a sample (s, a, s’, r)
 Your old estimate Q(s,a)
 New sample value: r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ )
𝑎′

Nudge the estimates:

 𝑄(s,a)  𝑄(s,a) + 𝛼(r+𝛾 max𝑎′ 𝑄(𝑠 ′ , 𝑎′ ) − 𝑄(s,a))
 𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′

34
Q Learning Algorithm

 Forall s, a
• Initialize Q(s, a) = 0

 Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′

Is an off policy learning algorithm

35
Properties

 Q Learning converges to optimal values Q*

• Irrespective of initialization,
• Irrespective of action choice policy
• Irrespective of learning rate

 as long as
• states/actions finite, all rewards bounded
• No (s,a) is starved: infinite visits over infinite samples
• Learning rate decays with visits to state-action pairs
• but not too fast decay. (∑ia(s,a,i) = ∞, ∑ia2(s,a,i) < ∞)

36
Q Learning Algorithm

 Forall s, a
• Initialize Q(s, a) = 0

 Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′

How to choose?
new: exploration
greedy: exploitation
37
Exploration vs. Exploitation Tradeoff

 A fundamental tradeoff in RL

 Exploration: must take actions that may be

suboptimal but help discover new rewards and
in the long run increase utility

 Exploitation: must take actions that are known

to be good (and seem currently optimal) to
optimize the overall utility

 Slowly move from exploration exploitation 38

Explore/Exploit Policies

 Simplest scheme: ϵ-greedy

• Every time step flip a coin
• With probability 1-ϵ, take the greedy action
• With probability ϵ, take a random action

 Problem
• Exploration probability is constant

 Solutions
• Lower ϵ over time
• Use an exploration function 39
Explore/Exploit Policies

 Boltzmann Exploration
• Select action a with probability
exp(𝑄 𝑠,𝑎 𝑇))
• Pr(𝑎|𝑠) =
𝑎′∈𝐴 exp(𝑄 𝑠,𝑎′ 𝑇))

 T: Temperature
• Similar to simulated annealing
• Large T: uniform, Small T: greedy
• Start with large T and decrease with time

 GLIE: greedy in the limit of infinite exploration 40

Explore/Exploit Policies

 Exploration Functions
• stop exploring actions whose badness is established
• continue exploring other actions
 Let Q(s,a) = q, #visits(s,a) = n
 E.g.: f q, n = 𝑞 + 𝑘/𝑛
• Unexplored states have infinite f
• Highly explored bad states have low f
 Modified Q update
• 𝑄(s,a)  (1 − 𝛼)𝑄(s,a)
+ 𝛼(r+𝛾 max 𝑓(𝑄 𝑠 ′ , 𝑎′ , 𝑁 𝑠 ′ , 𝑎′ ))
𝑎′
States leading to unexplored states are also preferred41
Explore/Exploit Policies

 A Famous Exploration Policy: UCB

• Upper Confidence Bound

ln n( s)
 UCT ( s)  arg max a Q( s, a)  c
n( s , a )

Value Term:
favors actions that looked Exploration Term:
good historically actions get an exploration
bonus that grows with ln(n)

Optimistic in the Face of Uncertainty 42

Model based vs. Model Free RL

 Model based
• estimate O(|S|2|A|) parameters
• requires relatively larger data for learning
• can make use of background knowledge easily

 Model free
• estimate O(|S||A|) parameters
• requires relatively less data for learning
Generalizing Across States

 Basic Q-Learning (or VI) keeps a table of all q-values

 In realistic situations, we cannot possibly learn about

every single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory

 Instead, we want to generalize:

• Learn about some small number of training states from experience
• Generalize that experience to new, similar situations
• This is a fundamental idea in machine learning

44
Feature-based Representation

 Describe a state using vector of features

 We can write a q function using a few weights:

 Advantage: our experience is summed up in a

few powerful numbers (wi)

 Disadvantage: states may share features but

actually be very different in value!

45
Approximate Q-Learning

 Exact Q-Learning difference

• 𝑄(s,a)  𝑄(s,a) + 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ) − 𝑄(s,a))
𝑎′
 Q-Learning with linear function approximation
• 𝑤𝑚  𝑤𝑚 + 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ) − 𝑄(s,a))fm(s,a)
𝑎′

 Move feature weights up/down based on

difference and feature values

46
Optimization: Least Squares

Error
Observation
Prediction

0
0 20
Minimizing Error

Imagine we had only one point x, with features f(x), target value y,
and weights w:

Approximate q update
explained:
“target” “prediction”
Overfitting and Limited Capacity Approximations

Low capacity generalizes better

Issue: linear approximation not powerful enough in practice
Deep Learning!
49
Summary: RL

RL is a very general AI problem

most general single agent?

Main idea: expectationP as avg of samples

sampling distribution is P

Agent learns as it gathers experience

Exploration-exploitation tradeoff
Function approximation is key: deep RL is the rage!
50
Applications
 Stochastic Games
 Robotics: navigation, helicopter manuevers…
 Finance: options, investments
 Communication Networks
 Medicine: Radiation planning for cancer
 Controlling workflows
 Optimize bidding decisions in auctions
 Traffic flow optimization
 Aircraft queueing for landing; airline meal provisioning
 Optimizing software on mobiles
 Forest firefighting
 …
51

Principles & Strategies of Teaching in Medical Laboratory Science (PSTM221)
100% (4)
Principles & Strategies of Teaching in Medical Laboratory Science (PSTM221)
16 pages
Sathya Dinesh CH 17BBAD070 Consumer Insights and Behaviour: The Schiffman & Kanuk's Model of Consumer Decision Making
No ratings yet
Sathya Dinesh CH 17BBAD070 Consumer Insights and Behaviour: The Schiffman & Kanuk's Model of Consumer Decision Making
6 pages
Behavioral Theory Presentation
100% (1)
Behavioral Theory Presentation
36 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Model Free Prediction
No ratings yet
Model Free Prediction
38 pages
Psy Comprehensive Exam
No ratings yet
Psy Comprehensive Exam
28 pages
Unit 4
No ratings yet
Unit 4
49 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
RL 1
No ratings yet
RL 1
30 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Lec 10
No ratings yet
Lec 10
50 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Internship Report Jayant
No ratings yet
Internship Report Jayant
129 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Unit 3
No ratings yet
Unit 3
32 pages
Broken Homes, Broken Hearts - Voices of Students From Broken Families
100% (1)
Broken Homes, Broken Hearts - Voices of Students From Broken Families
167 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Unit 3
No ratings yet
Unit 3
29 pages
Learning Task
No ratings yet
Learning Task
14 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
PSY101 Solved MCQs Mega Collection For Mid Term Papers by Arslan Ali
100% (1)
PSY101 Solved MCQs Mega Collection For Mid Term Papers by Arslan Ali
33 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
ML - Unit-3 - Reinforcement Learning
No ratings yet
ML - Unit-3 - Reinforcement Learning
47 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Animal Behavior: by Shahid Mahmood Department of Zoology University of Gujrat
No ratings yet
Animal Behavior: by Shahid Mahmood Department of Zoology University of Gujrat
72 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Abnormal Psychology: Historical and Modern Perspectives: Total Assessment Guide (T.A.G.)
No ratings yet
Abnormal Psychology: Historical and Modern Perspectives: Total Assessment Guide (T.A.G.)
29 pages
37 RL
No ratings yet
37 RL
18 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Glossary of Consumer Behaviour
No ratings yet
Glossary of Consumer Behaviour
36 pages
Non-Associative Learning Refers To "A Relatively Permanent Change in The Strength of
No ratings yet
Non-Associative Learning Refers To "A Relatively Permanent Change in The Strength of
17 pages
Modified OB MCQ
No ratings yet
Modified OB MCQ
26 pages
Perceptions: "You Become What You Thinketh"
No ratings yet
Perceptions: "You Become What You Thinketh"
66 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
Learning and It's Theory
No ratings yet
Learning and It's Theory
18 pages
Motivation Crossword Puzzle
No ratings yet
Motivation Crossword Puzzle
3 pages
Attachment Model Essays
No ratings yet
Attachment Model Essays
25 pages
Unit-6 2
No ratings yet
Unit-6 2
24 pages
Learning & Memory: Individual Determinants of Consumer Behaviour
No ratings yet
Learning & Memory: Individual Determinants of Consumer Behaviour
24 pages
PL1101E Full Summary
No ratings yet
PL1101E Full Summary
44 pages
Humanistic & Behavioural Approches To Counseling: Lekshmi Priya.K.B 2 MSC Applied Psychology
No ratings yet
Humanistic & Behavioural Approches To Counseling: Lekshmi Priya.K.B 2 MSC Applied Psychology
49 pages
ButasJhanLyka-Human Behavior Reviewer
No ratings yet
ButasJhanLyka-Human Behavior Reviewer
13 pages
Human Behavior
No ratings yet
Human Behavior
11 pages
Behaviorism's Core Principles
No ratings yet
Behaviorism's Core Principles
2 pages
Critical Reading Exercise
No ratings yet
Critical Reading Exercise
6 pages
Consumer Buying Behavior - Suman Saha
No ratings yet
Consumer Buying Behavior - Suman Saha
7 pages
Chapter 6 Longterm Memory Structure
No ratings yet
Chapter 6 Longterm Memory Structure
18 pages
Emotions in Brain
No ratings yet
Emotions in Brain
22 pages
Aryan Gupta 11th H
No ratings yet
Aryan Gupta 11th H
9 pages
Chapter 3, Chance and Chapter 4, Domjan - Notes
No ratings yet
Chapter 3, Chance and Chapter 4, Domjan - Notes
6 pages
Terminology in Psychology Competitive Exams Part 1 Psychology
No ratings yet
Terminology in Psychology Competitive Exams Part 1 Psychology
3 pages
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

16 RL

Uploaded by

16 RL

Uploaded by

Reinforcement Learning

• New Twist: we don’t know T and/or R

• Fundamental model for learning of human behavior

 Batch setting in Bayes Nets

 Active setting in MDPs

• Actions have two purposes

 Model-based vs. Model-free

 Passive vs. Active

 Strong vs Weak simulator

 RL studied experimentally for more than 80 years

 Given a policy ¼: compute V¼

Goal: Compute expected age of COL333 students

Without P(A), instead collect samples [a1, a2, … aN]

 Learn an empirical model

 Learning the model

 When might this be the optimal policy?

 We may want to smooth… 13

 Converges to correct model with infinite data

 With correct model

 How about model free learning?

 Given a policy ¼: compute V¼

 Converges to optimal with infinite data

 Each state is computed independently

 Given a policy ¼: compute V¼

 Problem: we don’t know true values of 𝑉 𝜋 (𝑠 ′ )

 Don’t learn T or R; directly maintain V¼

 Update maintains a mean of (noisy) value samples

 If the learning rate decreases appropriately with

 Reward is typically delivered at the end (when

Stimulus at t = 100 and reward at t = 200

Prediction error  for each time step

Dopaminergic cells in Ventral Tegmental Area (VTA)

Reward Prediction error? [r (t )  v(t  1)  v(t )]

Dopaminergic cells in VTA

Known MDP: Offline Solution

Unknown MDP: Model-Based Unknown MDP: Model-Free

Goal Technique Goal Technique

 Learn an initial model M0

 Lets say the agent first

 Will the agent ever learn the optimal policy?

 Learn an initial model M0

 Can we do TD-like updates on V*?

 Hmmm… what to do?

 Directly learn Q*(s,a) values

Nudge the estimates:

Is an off policy learning algorithm

 Q Learning converges to optimal values Q*

 Exploration: must take actions that may be

 Exploitation: must take actions that are known

 Slowly move from exploration exploitation 38

 Simplest scheme: ϵ-greedy

 GLIE: greedy in the limit of infinite exploration 40

 A Famous Exploration Policy: UCB

Optimistic in the Face of Uncertainty 42

 Basic Q-Learning (or VI) keeps a table of all q-values

 In realistic situations, we cannot possibly learn about

 Instead, we want to generalize:

 Describe a state using vector of features

 Advantage: our experience is summed up in a

 Disadvantage: states may share features but

 Exact Q-Learning difference

 Move feature weights up/down based on

Low capacity generalizes better

RL is a very general AI problem

Main idea: expectationP as avg of samples

Agent learns as it gathers experience

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.