0% found this document useful (0 votes)
36 views25 pages

21 - Reinforcement Learning

Reinforcement learning is a machine learning technique where an agent learns to achieve a goal in an environment by receiving rewards or punishments for its actions. The agent learns through trial-and-error interactions with the environment without relying on external teachers. In reinforcement learning, the agent learns to map situations to actions in a way that maximizes rewards. It does this by exploring various actions and learning which ones yield the highest rewards.

Uploaded by

nada abdelrahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views25 pages

21 - Reinforcement Learning

Reinforcement learning is a machine learning technique where an agent learns to achieve a goal in an environment by receiving rewards or punishments for its actions. The agent learns through trial-and-error interactions with the environment without relying on external teachers. In reinforcement learning, the agent learns to map situations to actions in a way that maximizes rewards. It does this by exploring various actions and learning which ones yield the highest rewards.

Uploaded by

nada abdelrahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Reinforcement Learning

Geoff Hulten
Reinforcement Learning
• Learning to interact with an environment
• Robots, games, process control
• With limited human training
• Where the ‘right thing’ isn’t obvious Agent

Reward

Action
State
• Supervised Learning:
• Goal:
• Data:
Environment

• Reinforcement Learning:
• Goal:
Maximize

• Data:
TD-Gammon – Tesauro ~1995 State: Board State
Actions: Valid Moves
Reward: Win or Lose

• Net with 80 hidden units,


initialize to random weights

• Select move based on network


estimate & shallow search
P(win)

• Learn by playing against itself

• 1.5 million games of training


-> competitive with world class players
Atari 2600 games

State: Raw Pixels


Actions: Valid Moves
Reward: Game Score

• Same model/parameters for


~50 games

https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
Robotics and
Locomotion

State:
Joint States/Velocities
Accelerometer/Gyroscope
Terrain
Actions: Apply Torque to Joints
Reward: Velocity – { stuff }

https://youtu.be/hx_bgoTF7bs

2017 paper https://arxiv.org/pdf/1707.02286.pdf


Alpha Go State: Board State
Actions: Valid Moves
Reward: Win or Lose
• Learning how to beat humans at ‘hard’ games (search
space too big)

• Far surpasses (Human) Supervised learning

• Algorithm learned to outplay humans at chess in 24 hours

https://deepmind.com/documents/119/agz_unformatted_nature.pdf
How Reinforcement Learning is Different
• Delayed Reward

• Agent chooses training data

• Explore vs Exploit (Life long learning)

• Very different terminology (can be confusing)


Setup for Reinforcement Learning
Markov Decision Process Policy
(environment) (agent’s behavior)
• Discrete-time stochastic control process • – The action to take in state

• Each time step, :


• Agent chooses action from set • Goal maximize:
• Moves to new state with probability:
Probability of moving to each state
• Receives reward:
• – Tradeoff immediate vs future

• Every outcome depends on and


• Nothing depends on previous states/actions

Reward for making that move Value of being in that state


Simple Example of Agent in an Environment
State:
Map Locations
Score: 100
0
0, 0 1, 0 2, 0
100
Actions:
Move within map
Reaching chest ends episode 0, 1 1, 1 2, 1

0, 2 1, 2 2, 2

Reward:
100 at chest
0 for others
Policies
Policy Evaluating Policies

0, 0 1, 0 2, 0

12.5 100

0, 1 1, 1 2, 1

50

0, 2 1, 2 2, 2

Move to <1,1>
Move to <0,1> Move to <1,0>
Move to <2,0>

𝜋
Policy could be better
𝑉 ¿
Q learning
Learn a policy that optimizes for all states, using:
• No prior knowledge of state transition probabilities:
• No prior knowledge of the reward function:

Approach:
• Initialize estimate of discounted reward for every state/action pair:
• Repeat (for a while):
• Take a random action from
• Receive and from environment
• Update = +
• Random restart if in terminal state

1 Exploration Policy:
∝𝑣 =
1+𝑣𝑖𝑠𝑖𝑡𝑠( 𝑠 , 𝑎)
Example of Q learning
(round 1)

• Initialize to 0
• Random initial state =
0, 0 1, 0 2, 0
0
• Random action from 0

0
0 0
0 0 100
0
0, 1 1, 1 2, 1
0 0

• Update
0 0
0 0 0
• Random action from 0, 2
0
1, 2
0
2, 2
0

0 0

0 0

• Update

• No more moves possible, start again…


Example of Q learning
(round 2)

• Round 2: Random initial state =


• Random action from 0, 0 1, 0 2, 0
0 0

0
0 0
0 0 100
0, 1 1, 1 2, 1
• Update + * 100 0 0

0 0
0 0 0
• Random action from 0, 2
0
1, 2
0
2, 2
0
50

0 0

0 0

• Update

• No more moves possible, start again…

𝛾=0.5
Example of Q learning
(some acceleration…)
𝛾=0.5

• Random Initial State 0, 0


0
1, 0
0
2, 0

0
0 0
• Update 0, 1
0
1, 1
0 100
2, 1
0 0
50

0 0
• Update 0
0
0
0
50
0

0, 2 1, 2 2, 2
0 25
0

0 0
Example of Q learning
(some acceleration…)
𝛾=0.5

• Random Initial State 0, 0


0
1, 0
100
0
2, 0

0
0 0
• Update 0, 1
0
1, 1
0 100
2, 1
25
0 50

0 0
• Update 0
0
0
0
50
0

0, 2 1, 2 2, 2
0 25

0 0
Example of Q learning
( after many, many runs…)

0, 0 1, 0 2, 0
50 100
• converged
25
12.5 25
25 50 100
0, 1 1, 1 2, 1
• Policy is: 25 50

12.5 25
6.25 12.5 25
12.5 25 50
0, 2 1, 2 2, 2
12.5 25

6.25 12.5
Challenges for Reinforcement Learning
• When there are many
states and actions Turns Remaining: 15

• When the episode can


end without reward

• When there is a
‘narrow’ path to Each stepexploring
Random ~50% probability of of
will fall off going
ropewrong
~97%way – P(reaching
of the time goal) ~ 0.01%
reward
Reward Shaping

• Hand craft intermediate


objectives that yield
reward

• Encourage the right type


of exploration

• Requires custom human


work

• Risks of learning to game


the rewards
Memory
• Retrain on previous explorations
0, 0 1, 0 2, 0
0
50 100

• Maintain samples of:


0
0 0
25
0 0
50 100
0
0, 1 1, 1 2, 1
25
0 0
50

0 0
0 0 0
25
0
25 0 50
0
0, 2 1, 2 2, 2
• Useful when 0 25
0

• It is cheaper to use some RAM/CPU than to run 0 12.5


0
more simulations
Replay
Replay
Do itana exploration
bunchexploration
a different of times
• It is hard to get to reward so you want to
leverage it for as much as possible when it
happens
Gym – toolkit for reinforcement learning
CartPole
import gym

env = gym.make('CartPole-v0')

import random
import QLearning # Your implementation goes here...
import Assignment7Support

trainingIterations = 20000

qlearner = QLearning.QLearning(<Parameters>)

for trialNumber in range(trainingIterations):


observation = env.reset()
reward = 0
for i in range(300):
Reward +1 per step the pole remains up env.render() # Comment out to make much faster...

currentState = ObservationToStateSpace(observation)
action = qlearner.GetAction(currentState, <Parameters>)
MountainCar
oldState = ObservationToStateSpace(observation)
observation, reward, isDone, info = env.step(action)
newState = ObservationToStateSpace(observation)

qlearner.ObserveAction(oldState, action, newState, reward, …)

if isDone:
if(trialNumber%1000) == 0:
print(trialNumber, i, reward)
break

# Now you have a policy in qlearner – use it...

Reward 200 at flag -1 per step

https://gym.openai.com/docs/
Some Problems with QLearning
• State space is continuous
• Must approximate by discretizing

• Treats states as identities print(env.observation_space.high)


#> array([ 2.4 , inf, 0.20943951, inf])
• No knowledge of how states relate print(env.observation_space.low)
• Requires many iterations to fill in #> array([-2.4 , -inf, -0.20943951, -inf])

• Converging can be difficult with


randomized transitions/rewards
Policy Gradients

• Q-learning -> learn a value function


• = an estimate of the expected
discounted reward of taking from
• Performance time: take the action
that has the highest estimated value

• Policy Gradient -> learn policy


directly
• Probability distribution over
• Performance time: choose action
according to distribution

Example from: https://www.youtube.com/watch?v=tqrcjHuNdmQ


Policy Gradients

• Receive a frame
• Forward propagate to get
• Select by sampling from
• Find the gradient that makes more likely –
store it
• Play the rest of the game
• If won, take a step in direction
• If lost, take a step in direction

One per action


Sum and step
in correct direction
Policy Gradients – reward shaping
Not relevant to outcome(?)

Less important to outcome More important to outcome


Summary Agent
Reinforcement Learning:

Reward

Action
• Goal: Maximize

State
• Data:

Environment

Many (awesome) recent successes:


• Robotics
• Surpassing humans at difficult games
• Doing it with (essentially) zero human knowledge (Simple) Approaches:
• Q-Learning -> discounted reward of action
• Policy Gradients -> Probability distribution over
Challenges: • Reward Shaping
• When the episode can end without reward • Memory
• When there is a ‘narrow’ path to reward • Lots of parameter tweaking…
• When there are many states and actions

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy