0% found this document useful (0 votes)

34 views11 pages

Lect28 4up

This document discusses machine learning techniques, specifically reinforcement learning, that allow robots to learn how to complete tasks without being explicitly programmed for every detail. The robot is simply given the overall task or goal and then uses reinforcement learning to figure out how to achieve it. The lecture will cover how learning can be incorporated into robot control policies and decision making. It provides examples of how agents can calculate expected values and utilities of actions to determine the best action to take, though notes that for realistic scenarios the outcomes are unknown so expected values must be used instead of definite values. It also discusses how single decisions are insufficient, and agents need to consider sequences of decisions over time to achieve the best overall results.

Uploaded by

alialataby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views11 pages

Lect28 4up

Uploaded by

alialataby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Today

Robotics and Autonomous Systems

• We have seen how difficult it is to program, down to the finest details,
Lecture 28: Learning in Robots
a control policy (a controller) for robots to carry out even simple tasks
like obstacle avoidance
Richard Williams • Ideally, one would like to be able to simply specify what the robot is
supposed to achieve:
Department of Computer Science
• Go from A to B without hitting obstacles
University of Liverpool without having to say how precisely that should be done
• Machine learning techniques
• in particular Reinforcement Learning
can be used to that purpose: the robot will then learn the how of the
what the programmer has specified as task
• This is our subject today.

1 / 41 2 / 41

Little Dog How learning might fit in

• Recall that in Jason you write rules/plans that invoke actions that are
defined by the environment model.
• Imagine if these could be learnt.

https://www.youtube.com/watch?v=nUQsRPJ1dYw

3 / 41 4 / 41
How to decide what to do How to decide what to do

• Consider being offered a bet in which you pay £2 if an odd number is • Consider being offered a bet in which you pay £2 if an odd number is
rolled on a die, and win £3 if an even number appears. rolled on a die, and win £3 if an even number appears.
• Is this a good bet? • Is this a good bet?
• To analyse this, we need the expected value of the bet.

5 / 41 6 / 41

How to decide what to do How to decide what to do

• The expected value is then the weighted sum of the values, where the
weights are the probabilities.
• Formally the expected value of X is defined by:
• We do this in terms of a random variable, which we will call X . ÿ
• X can take two values: E pX q “ k PrpX “ k q
3 if the die rolls odd k
´2 if the die rolls even
where the summation is over all values of k for which PrpX “ k q , 0.
• And we can also calculate the probability of these two values
• Here the expected value is:
PrpX “ 3q “ 0.5
PrpX “ ´2q “ 0.5
E pX q “ 0.5 ˆ 3 ` 0.5 ˆ ´2

• Thus the expected value of X is £0.5, and we take this to be the value
of the bet.
• As opposed to £0 if you don’t take the bet.

7 / 41 8 / 41
How to decide what to do How to decide what to do

• Another bet: you get £1 if a 2 or a 3 is rolled, £5 if a six is rolled, and

pay 3 otherwise.
• Not the value you will get.
• The expected value here is:
• But a value that allows you to make a decision.
E pX q “ 0.333 ˆ 1 ` 0.166 ˆ 5 ` 0.5 ˆ ´3

which is ´0.33.

9 / 41 10 / 41

How an agent might decide what to do How an agent might decide what to do

• The action a ˚ which a rational agent should choose is that which

maximises the agent’s utility.
• In other words the agent should pick:
• Consider an agent with a set of possible actions A available to it.
• Each a P A has a set of possible outcomes sa . a ˚ “ arg max upsa q
a PA
• Which action should the agent pick?
• The problem is that in any realistic situation, we don’t know which sa
will result from a given a, so we don’t know the utility of a given action.
• Instead we have to calculate the expected utility of each action and
make the choice on the basis of that.

11 / 41 12 / 41
How an agent might decide what to do Sequential decision problems

• In other words, for the set of outcomes sa of each action each a, the
agent should calculate:
ÿ • These approaches give us a battery of techniques to apply to
E pupsa qq “ ups 1 q. Prpsa “ s 1 q
individual decisions by agents.
s 1 Psa
• However, they aren’t really sufficient.
and pick the best. • Agents aren’t usually in the business of taking single decisions
• Life is a series of decisions.
s a2 s6
The best overall result is not necessarily obtained by a greedy
a1 approach to a series of decisions.
• The current best option isn’t the best thing in the long-run.
s5
s3 s4
s1
s2

13 / 41 14 / 41

Sequential decision problems Sequential decision problems

• Need to think about sequential decision problems where the agent’s

utility depends on a sequence of decisions.
• We saw something like this at the start of the semester.
• Runs of an agent.

11
00 G
00
11
00
11
S
00
11
• To get from the start point (S) to the goal (G), an agent needs to
repeatedly make a decision about what to do.

• Otherwise I’d only ever eat chocolate cake.

15 / 41 16 / 41
Rewards Motion model

• If the agent chooses to move in some direction, there is a probability

• Here we exchange the notion of a goal for the notion of a reward. of 0.8 it will move that way.
• It is easy to see which is the “goal” in this case:
0.1

11
00
00
11
+1
00
11
00
11
−1 0.8

S 0.1

• The action model is more complex than we saw before.

• Now actions are non-deterministic. • Probability of 0.2 it will move in the perpendicular direction.
• If the agent hits a wall, it doesn’t move.

17 / 41 18 / 41

Motion model Motion model

• As you know by know, this is an approximation to how a robot moves.

• If the agent goes tUp , Up , Right , Right , Right u

11
00 +1
00
11
00
11 −1
S
00
11

• It will get to the goal with probability 0.85 “ 0.32768 doing what it
expects/hopes to do.
• Arguably a more accurate approximation than assuming that it will
always do what it is programmed to do.

19 / 41 20 / 41
Motion model Rewards

• It can also reach the goal going around the obstacle the other way,
with probability = 0.14 ˆ 0.8.
• To complete the description, we have to give a reward to every state.
• To give the agent an incentive to reach the goal quickly, we give each
11
00
00
11
+1 non-terminal state a reward of ´0.04.

00
11
• Equivalent to a cost for each action.
−1
S
00
11 • So if the goal is reached after 10 steps, the agent’s overall reward is
0.6.

• Total probability of reaching the goal is 0.32776.

21 / 41 22 / 41

Markov Decision Process Policies

• This kind of problem is a Markov Decision Process (MDP).

• A plan — a sequence of actions — is not much help.
• We have:
• Isn’t guaranteed to find the goal.
• a set of states S.
• an initial state s0 . • Better is a policy π, which tells us which action πps q to do in every
• a set A of actions. state.
• A transition model Prps 1 |s , a q for s , s 1 P S and a P A ; and • Then the non-determinism doesn’t matter.
• A reward function R ps q for s P S. • However badly we do as a result of an action, we will know what to do.
• What does a solution look like?

23 / 41 24 / 41
Policies Policies

• Because of the non-determinism, a policy will give us different

sequences of actions different times it is run.
• To tell how good a policy is, we can compute the expected value. • Given π˚ an agent doesn’t have to think — it just does the right action
• Compute value you get when you run the policy. for the state it is in.
• Can compute it by running the policy
• The optimal policy π˚ is the one that gives the highest expected utility.
• On average it will give the best reward.

25 / 41 26 / 41

Policies Bellman Equation

• The optimum policy is then:

• How do we find the best policy (for a given set of rewards)?
• Turns out that there is a neat way to do this, by first computing the
11
00 +1
00
11
utility of each state.

00
11
00
11
−1 • We compute this using the Bellman equation
ÿ
U ps q “ R ps q ` γ max Prps 1 |s , a qU ps 1 q
a P A ps q
s1

• γ is a discount factor.
• Note that this is specific to the value of the reward R ps q for
non-terminal states — different rewards will give different policies.

27 / 41 28 / 41
Value iteration Applications

• The Bellman equation(s)/update are widely used.

• In an MDP wth n states, we will have n Bellman equations.

• Hard to solve these simultaneously because of the max operation
• Makes them non-linear
• Instead use an iterative approach
• value iteration.
• Start with arbitrary values for utilities (say 0) and then update with:
ÿ
Ui `1 Ð R ps q ` γ max Prps 1 |s , a qUi ps 1 q
a PA ps q
s1

• Repeat until the value stabilises.

• D. Romer, It’s Fourth Down and What Does the Bellman Equation
Say? A Dynamic Programming Analysis of Football Strategy, NBER
Working Paper No. 9024, June 2002
29 / 41 30 / 41

Applications Partial observability

This paper uses play-by-play accounts of virtually all regular

season National Football League games for 1998-2000 to
analyze teams’ choices on fourth down between trying for a first • For all their complexity, MDPs are not an accurate model of the world.
down and kicking. Dynamic programming is used to estimate the • Assume accessibility/observability
values of possessing the ball at different points on the field. • To deal with partial observability we have the Partially observable
These estimates are combined with data on the results of kicks Markov decision process (POMDP).
and conventional plays to estimate the average payoffs to kicking • We don’t know which state we are in, but we know what probability
and going for it under different circumstances. Examination of we have to being in every state.
teams’ actual decisions shows systematic, overwhelmingly
• That is all we will say on the subject.
statistically significant, and quantitatively large departures from
the decisions the dynamic-programming analysis implies are
preferable.

31 / 41 32 / 41
Reinforcement learning Reinforcement learning

• Ok, now we have the notion of an MDP, imagine we don’t know what
the model is. • Since it knows what state s 1 it gets to when it executes a in s, it can
• We don’t know R ps q count how often particular transitions occur to estimate:
• We don’t know Prps 1 |s , a q Prps 1 |s , a q
• But it is simple to learn them — the agent just moves around the
environment. as the proportion of times executing a in s takes the agent to s 1 .
http://vimeo.com/13387420

33 / 41 34 / 41

Reinforcement learning Reinforcement learning

• If the agent wanders randomly for long enough, it will learn the
probability and reward values.
• Similarly the agent can see what reward it gets in s to give it R ps q.
• (How would it know what “long enough” was?)
• With these values it can apply the Bellman equation(s) and start
doing the right thing.

35 / 41 36 / 41
Reinforcement learning Q-learning

• Q-learning is a model-free approach to reinforcement learning.

• It doesn’t need to learn P ps 1 |s , a q.
• Revolves around the notion of Q ps , a q, which denotes the value of
• The agent can also be smarter, and use the values as it learns them.
doing a in s.
• At each step it can solve the Bellman equation(s) to compute the best U ps q “ max Q ps , a q
a
action given what it knows.
• We can write:
• This means it can learn quicker, but also it may lead to sub-optimal
performance. ÿ
Q ps , a q “ R ps q ` γ P ps 1 |s , a qmaxa 1 Q ps 1 , a 1 q
s1

and we could do value-iteration style updates on this.

• (Wouldn’t be model-free.)

37 / 41 38 / 41

Q-learning Introducing some supervision

• RL allows the agent to learn control policies from scratch

• However, we can write the update rule as:
• However, when the state-space and the action-space are large,
Q ps , a q Ð Q ps , a q ` αpR ps q ` γ max Q ps 1 , a 1 q ´ Q ps , a qq supervision can help bootstrap the learning process
a1
• With supervision an agent is exposed to instances of positive and
and recalculate everytime that a is executed in s and takes the agent negative behavior, which get it starting in building its value function
to s 1 . (the table of state-action pairs)
• α is a learning rate • Those instances do not need to show the robot the best solutions!
• Controls how quickly we update the Q-value when we have new The robot uses that input only as a starting point for its own learning
information.
http://vimeo.com/13387420

39 / 41 40 / 41
Summary

• This lecture has introduced machine learning.

• One aspect, reinforcement learning
• The idea is to have the robot figure out for itself how to do things.
• Just give it feedback.
• And, perhaps, some examples to help it get started.

41 / 41

Player Survival Guide v1.2
No ratings yet
Player Survival Guide v1.2
44 pages
Lecture 4: Sequential Decision Making: Simon Parsons
No ratings yet
Lecture 4: Sequential Decision Making: Simon Parsons
94 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
06 MDP
No ratings yet
06 MDP
89 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
mdp1 6pp
No ratings yet
mdp1 6pp
13 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
RL 1
No ratings yet
RL 1
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Markov Decision Process
No ratings yet
Markov Decision Process
29 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Lec 08
No ratings yet
Lec 08
59 pages
Module 04
No ratings yet
Module 04
63 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
AI Notes
No ratings yet
AI Notes
37 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
Ai (It) Unit-4
No ratings yet
Ai (It) Unit-4
37 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
CS229
No ratings yet
CS229
17 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Unit 4
No ratings yet
Unit 4
49 pages
Lect04 4up
No ratings yet
Lect04 4up
15 pages
Lect02 4up
No ratings yet
Lect02 4up
12 pages
Lect05 4up
No ratings yet
Lect05 4up
12 pages
Lect25 4up
No ratings yet
Lect25 4up
13 pages
Electronic App 1: Fly-By-Wire Control System
No ratings yet
Electronic App 1: Fly-By-Wire Control System
18 pages
Cepstral Analysis: Appendix 3
No ratings yet
Cepstral Analysis: Appendix 3
3 pages
Chapter 02 Warehousing Decisions
No ratings yet
Chapter 02 Warehousing Decisions
12 pages
Benedetto Croce, Aesthetic As Science of Expression and General Linguistic
No ratings yet
Benedetto Croce, Aesthetic As Science of Expression and General Linguistic
135 pages
Pre Cal
100% (1)
Pre Cal
532 pages
TLE9 CSS Q3 M4 Maintain-Hand-Tools
No ratings yet
TLE9 CSS Q3 M4 Maintain-Hand-Tools
8 pages
Luyện Đọc Điền - Đọc Hiểu (Buổi 8) Livestream
No ratings yet
Luyện Đọc Điền - Đọc Hiểu (Buổi 8) Livestream
3 pages
Generation Gap
No ratings yet
Generation Gap
3 pages
Free Surface Flow Week 3
No ratings yet
Free Surface Flow Week 3
15 pages
STRESSMANAGEMENT
No ratings yet
STRESSMANAGEMENT
3 pages
Ieee Mas Awards - 2024
No ratings yet
Ieee Mas Awards - 2024
17 pages
Amity University, Mumbai Aibas: Title: Learned Optimism Scale
No ratings yet
Amity University, Mumbai Aibas: Title: Learned Optimism Scale
9 pages
CBSE Class 12 Mathematics Matrices & Determinants Worksheet (2) - 1
No ratings yet
CBSE Class 12 Mathematics Matrices & Determinants Worksheet (2) - 1
4 pages
Astm A-291
No ratings yet
Astm A-291
4 pages
108 Unix
No ratings yet
108 Unix
20 pages
CelebritySchool Pitch Deck
No ratings yet
CelebritySchool Pitch Deck
13 pages
Determinants of Work-Readiness: Siti Nurlaela Kurjono Rasto
No ratings yet
Determinants of Work-Readiness: Siti Nurlaela Kurjono Rasto
7 pages
Rma DLP
No ratings yet
Rma DLP
2 pages
Ocb & LMX
No ratings yet
Ocb & LMX
12 pages
Herpetology Notes, Volume 4 219-224 (2011) (Published Online On 27 May 2011) - New Locality Records For Chelonians
No ratings yet
Herpetology Notes, Volume 4 219-224 (2011) (Published Online On 27 May 2011) - New Locality Records For Chelonians
6 pages
IDEA TRIBE - 2025 - Broucher
No ratings yet
IDEA TRIBE - 2025 - Broucher
4 pages
Comsats University Islamabad Lab Report#2 Applied Physics For Engineers
No ratings yet
Comsats University Islamabad Lab Report#2 Applied Physics For Engineers
4 pages
Inductive & Deductive Reasoning: Mr. Smith IM3
No ratings yet
Inductive & Deductive Reasoning: Mr. Smith IM3
20 pages
AAiT PECC 2015 Year II Sem I Sections 21
No ratings yet
AAiT PECC 2015 Year II Sem I Sections 21
21 pages
Experiment 1 Photocell
83% (6)
Experiment 1 Photocell
6 pages
Important Questions MARKETING MANAGEMEMT
No ratings yet
Important Questions MARKETING MANAGEMEMT
3 pages
Reviewer Communication
No ratings yet
Reviewer Communication
2 pages
MSME Certificate - OFG
No ratings yet
MSME Certificate - OFG
2 pages
Astrology Courses in Delhi, Astrology Institutes in Delhi, Astrology Classes in Delhi
No ratings yet
Astrology Courses in Delhi, Astrology Institutes in Delhi, Astrology Classes in Delhi
2 pages
DAILY LESSON LOG OF STEM - PC11AG-Ib-1 (Week Two-Day One) : 4 Cy 4cy
No ratings yet
DAILY LESSON LOG OF STEM - PC11AG-Ib-1 (Week Two-Day One) : 4 Cy 4cy
4 pages
7e-Week 2-2nd Trim
No ratings yet
7e-Week 2-2nd Trim
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lect28 4up

Uploaded by

Lect28 4up

Uploaded by

Today

Robotics and Autonomous Systems

Little Dog How learning might fit in

How to decide what to do How to decide what to do

• Another bet: you get £1 if a 2 or a 3 is rolled, £5 if a six is rolled, and

• The action a ˚ which a rational agent should choose is that which

Sequential decision problems Sequential decision problems

• Need to think about sequential decision problems where the agent’s

• Otherwise I’d only ever eat chocolate cake.

• If the agent chooses to move in some direction, there is a probability

• The action model is more complex than we saw before.

Motion model Motion model

• As you know by know, this is an approximation to how a robot moves.

• If the agent goes tUp , Up , Right , Right , Right u

• Total probability of reaching the goal is 0.32776.

Markov Decision Process Policies

• This kind of problem is a Markov Decision Process (MDP).

• Because of the non-determinism, a policy will give us different

Policies Bellman Equation

• The optimum policy is then:

• The Bellman equation(s)/update are widely used.

• In an MDP wth n states, we will have n Bellman equations.

• Repeat until the value stabilises.

Applications Partial observability

This paper uses play-by-play accounts of virtually all regular

Reinforcement learning Reinforcement learning

• Q-learning is a model-free approach to reinforcement learning.

and we could do value-iteration style updates on this.

Q-learning Introducing some supervision

• RL allows the agent to learn control policies from scratch

• This lecture has introduced machine learning.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.