0% found this document useful (0 votes)
34 views11 pages

Lect28 4up

This document discusses machine learning techniques, specifically reinforcement learning, that allow robots to learn how to complete tasks without being explicitly programmed for every detail. The robot is simply given the overall task or goal and then uses reinforcement learning to figure out how to achieve it. The lecture will cover how learning can be incorporated into robot control policies and decision making. It provides examples of how agents can calculate expected values and utilities of actions to determine the best action to take, though notes that for realistic scenarios the outcomes are unknown so expected values must be used instead of definite values. It also discusses how single decisions are insufficient, and agents need to consider sequences of decisions over time to achieve the best overall results.

Uploaded by

alialataby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views11 pages

Lect28 4up

This document discusses machine learning techniques, specifically reinforcement learning, that allow robots to learn how to complete tasks without being explicitly programmed for every detail. The robot is simply given the overall task or goal and then uses reinforcement learning to figure out how to achieve it. The lecture will cover how learning can be incorporated into robot control policies and decision making. It provides examples of how agents can calculate expected values and utilities of actions to determine the best action to take, though notes that for realistic scenarios the outcomes are unknown so expected values must be used instead of definite values. It also discusses how single decisions are insufficient, and agents need to consider sequences of decisions over time to achieve the best overall results.

Uploaded by

alialataby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Today

Robotics and Autonomous Systems


• We have seen how difficult it is to program, down to the finest details,
Lecture 28: Learning in Robots
a control policy (a controller) for robots to carry out even simple tasks
like obstacle avoidance
Richard Williams • Ideally, one would like to be able to simply specify what the robot is
supposed to achieve:
Department of Computer Science
• Go from A to B without hitting obstacles
University of Liverpool without having to say how precisely that should be done
• Machine learning techniques
• in particular Reinforcement Learning
can be used to that purpose: the robot will then learn the how of the
what the programmer has specified as task
• This is our subject today.

1 / 41 2 / 41

Little Dog How learning might fit in

• Recall that in Jason you write rules/plans that invoke actions that are
defined by the environment model.
• Imagine if these could be learnt.

https://www.youtube.com/watch?v=nUQsRPJ1dYw

3 / 41 4 / 41
How to decide what to do How to decide what to do

• Consider being offered a bet in which you pay £2 if an odd number is • Consider being offered a bet in which you pay £2 if an odd number is
rolled on a die, and win £3 if an even number appears. rolled on a die, and win £3 if an even number appears.
• Is this a good bet? • Is this a good bet?
• To analyse this, we need the expected value of the bet.

5 / 41 6 / 41

How to decide what to do How to decide what to do

• The expected value is then the weighted sum of the values, where the
weights are the probabilities.
• Formally the expected value of X is defined by:
• We do this in terms of a random variable, which we will call X . ÿ
• X can take two values: E pX q “ k PrpX “ k q
3 if the die rolls odd k
´2 if the die rolls even
where the summation is over all values of k for which PrpX “ k q , 0.
• And we can also calculate the probability of these two values
• Here the expected value is:
PrpX “ 3q “ 0.5
PrpX “ ´2q “ 0.5
E pX q “ 0.5 ˆ 3 ` 0.5 ˆ ´2

• Thus the expected value of X is £0.5, and we take this to be the value
of the bet.
• As opposed to £0 if you don’t take the bet.

7 / 41 8 / 41
How to decide what to do How to decide what to do

• Another bet: you get £1 if a 2 or a 3 is rolled, £5 if a six is rolled, and


pay 3 otherwise.
• Not the value you will get.
• The expected value here is:
• But a value that allows you to make a decision.
E pX q “ 0.333 ˆ 1 ` 0.166 ˆ 5 ` 0.5 ˆ ´3

which is ´0.33.

9 / 41 10 / 41

How an agent might decide what to do How an agent might decide what to do

• The action a ˚ which a rational agent should choose is that which


maximises the agent’s utility.
• In other words the agent should pick:
• Consider an agent with a set of possible actions A available to it.
• Each a P A has a set of possible outcomes sa . a ˚ “ arg max upsa q
a PA
• Which action should the agent pick?
• The problem is that in any realistic situation, we don’t know which sa
will result from a given a, so we don’t know the utility of a given action.
• Instead we have to calculate the expected utility of each action and
make the choice on the basis of that.

11 / 41 12 / 41
How an agent might decide what to do Sequential decision problems

• In other words, for the set of outcomes sa of each action each a, the
agent should calculate:
ÿ • These approaches give us a battery of techniques to apply to
E pupsa qq “ ups 1 q. Prpsa “ s 1 q
individual decisions by agents.
s 1 Psa
• However, they aren’t really sufficient.
and pick the best. • Agents aren’t usually in the business of taking single decisions
• Life is a series of decisions.
s a2 s6
The best overall result is not necessarily obtained by a greedy
a1 approach to a series of decisions.
• The current best option isn’t the best thing in the long-run.
s5
s3 s4
s1
s2

13 / 41 14 / 41

Sequential decision problems Sequential decision problems

• Need to think about sequential decision problems where the agent’s


utility depends on a sequence of decisions.
• We saw something like this at the start of the semester.
• Runs of an agent.

11
00 G
00
11
00
11
S
00
11
• To get from the start point (S) to the goal (G), an agent needs to
repeatedly make a decision about what to do.

• Otherwise I’d only ever eat chocolate cake.


15 / 41 16 / 41
Rewards Motion model

• If the agent chooses to move in some direction, there is a probability


• Here we exchange the notion of a goal for the notion of a reward. of 0.8 it will move that way.
• It is easy to see which is the “goal” in this case:
0.1

11
00
00
11
+1
00
11
00
11
−1 0.8

S 0.1

• The action model is more complex than we saw before.


• Now actions are non-deterministic. • Probability of 0.2 it will move in the perpendicular direction.
• If the agent hits a wall, it doesn’t move.

17 / 41 18 / 41

Motion model Motion model

• As you know by know, this is an approximation to how a robot moves.

• If the agent goes tUp , Up , Right , Right , Right u

11
00 +1
00
11
00
11 −1
S
00
11

• It will get to the goal with probability 0.85 “ 0.32768 doing what it
expects/hopes to do.
• Arguably a more accurate approximation than assuming that it will
always do what it is programmed to do.

19 / 41 20 / 41
Motion model Rewards

• It can also reach the goal going around the obstacle the other way,
with probability = 0.14 ˆ 0.8.
• To complete the description, we have to give a reward to every state.
• To give the agent an incentive to reach the goal quickly, we give each
11
00
00
11
+1 non-terminal state a reward of ´0.04.

00
11
• Equivalent to a cost for each action.
−1
S
00
11 • So if the goal is reached after 10 steps, the agent’s overall reward is
0.6.

• Total probability of reaching the goal is 0.32776.

21 / 41 22 / 41

Markov Decision Process Policies

• This kind of problem is a Markov Decision Process (MDP).


• A plan — a sequence of actions — is not much help.
• We have:
• Isn’t guaranteed to find the goal.
• a set of states S.
• an initial state s0 . • Better is a policy π, which tells us which action πps q to do in every
• a set A of actions. state.
• A transition model Prps 1 |s , a q for s , s 1 P S and a P A ; and • Then the non-determinism doesn’t matter.
• A reward function R ps q for s P S. • However badly we do as a result of an action, we will know what to do.
• What does a solution look like?

23 / 41 24 / 41
Policies Policies

• Because of the non-determinism, a policy will give us different


sequences of actions different times it is run.
• To tell how good a policy is, we can compute the expected value. • Given π˚ an agent doesn’t have to think — it just does the right action
• Compute value you get when you run the policy. for the state it is in.
• Can compute it by running the policy
• The optimal policy π˚ is the one that gives the highest expected utility.
• On average it will give the best reward.

25 / 41 26 / 41

Policies Bellman Equation

• The optimum policy is then:


• How do we find the best policy (for a given set of rewards)?
• Turns out that there is a neat way to do this, by first computing the
11
00 +1
00
11
utility of each state.

00
11
00
11
−1 • We compute this using the Bellman equation
ÿ
U ps q “ R ps q ` γ max Prps 1 |s , a qU ps 1 q
a P A ps q
s1

• γ is a discount factor.
• Note that this is specific to the value of the reward R ps q for
non-terminal states — different rewards will give different policies.

27 / 41 28 / 41
Value iteration Applications

• The Bellman equation(s)/update are widely used.

• In an MDP wth n states, we will have n Bellman equations.


• Hard to solve these simultaneously because of the max operation
• Makes them non-linear
• Instead use an iterative approach
• value iteration.
• Start with arbitrary values for utilities (say 0) and then update with:
ÿ
Ui `1 Ð R ps q ` γ max Prps 1 |s , a qUi ps 1 q
a PA ps q
s1

• Repeat until the value stabilises.


• D. Romer, It’s Fourth Down and What Does the Bellman Equation
Say? A Dynamic Programming Analysis of Football Strategy, NBER
Working Paper No. 9024, June 2002
29 / 41 30 / 41

Applications Partial observability

This paper uses play-by-play accounts of virtually all regular


season National Football League games for 1998-2000 to
analyze teams’ choices on fourth down between trying for a first • For all their complexity, MDPs are not an accurate model of the world.
down and kicking. Dynamic programming is used to estimate the • Assume accessibility/observability
values of possessing the ball at different points on the field. • To deal with partial observability we have the Partially observable
These estimates are combined with data on the results of kicks Markov decision process (POMDP).
and conventional plays to estimate the average payoffs to kicking • We don’t know which state we are in, but we know what probability
and going for it under different circumstances. Examination of we have to being in every state.
teams’ actual decisions shows systematic, overwhelmingly
• That is all we will say on the subject.
statistically significant, and quantitatively large departures from
the decisions the dynamic-programming analysis implies are
preferable.

31 / 41 32 / 41
Reinforcement learning Reinforcement learning

• Ok, now we have the notion of an MDP, imagine we don’t know what
the model is. • Since it knows what state s 1 it gets to when it executes a in s, it can
• We don’t know R ps q count how often particular transitions occur to estimate:
• We don’t know Prps 1 |s , a q Prps 1 |s , a q
• But it is simple to learn them — the agent just moves around the
environment. as the proportion of times executing a in s takes the agent to s 1 .
http://vimeo.com/13387420

33 / 41 34 / 41

Reinforcement learning Reinforcement learning

• If the agent wanders randomly for long enough, it will learn the
probability and reward values.
• Similarly the agent can see what reward it gets in s to give it R ps q.
• (How would it know what “long enough” was?)
• With these values it can apply the Bellman equation(s) and start
doing the right thing.

35 / 41 36 / 41
Reinforcement learning Q-learning

• Q-learning is a model-free approach to reinforcement learning.


• It doesn’t need to learn P ps 1 |s , a q.
• Revolves around the notion of Q ps , a q, which denotes the value of
• The agent can also be smarter, and use the values as it learns them.
doing a in s.
• At each step it can solve the Bellman equation(s) to compute the best U ps q “ max Q ps , a q
a
action given what it knows.
• We can write:
• This means it can learn quicker, but also it may lead to sub-optimal
performance. ÿ
Q ps , a q “ R ps q ` γ P ps 1 |s , a qmaxa 1 Q ps 1 , a 1 q
s1

and we could do value-iteration style updates on this.


• (Wouldn’t be model-free.)

37 / 41 38 / 41

Q-learning Introducing some supervision

• RL allows the agent to learn control policies from scratch


• However, we can write the update rule as:
• However, when the state-space and the action-space are large,
Q ps , a q Ð Q ps , a q ` αpR ps q ` γ max Q ps 1 , a 1 q ´ Q ps , a qq supervision can help bootstrap the learning process
a1
• With supervision an agent is exposed to instances of positive and
and recalculate everytime that a is executed in s and takes the agent negative behavior, which get it starting in building its value function
to s 1 . (the table of state-action pairs)
• α is a learning rate • Those instances do not need to show the robot the best solutions!
• Controls how quickly we update the Q-value when we have new The robot uses that input only as a starting point for its own learning
information.
http://vimeo.com/13387420

39 / 41 40 / 41
Summary

• This lecture has introduced machine learning.


• One aspect, reinforcement learning
• The idea is to have the robot figure out for itself how to do things.
• Just give it feedback.
• And, perhaps, some examples to help it get started.

41 / 41

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy