21 - Reinforcement Learning
21 - Reinforcement Learning
Geoff Hulten
Reinforcement Learning
• Learning to interact with an environment
• Robots, games, process control
• With limited human training
• Where the ‘right thing’ isn’t obvious Agent
Reward
Action
State
• Supervised Learning:
• Goal:
• Data:
Environment
• Reinforcement Learning:
• Goal:
Maximize
• Data:
TD-Gammon – Tesauro ~1995 State: Board State
Actions: Valid Moves
Reward: Win or Lose
https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
Robotics and
Locomotion
State:
Joint States/Velocities
Accelerometer/Gyroscope
Terrain
Actions: Apply Torque to Joints
Reward: Velocity – { stuff }
https://youtu.be/hx_bgoTF7bs
https://deepmind.com/documents/119/agz_unformatted_nature.pdf
How Reinforcement Learning is Different
• Delayed Reward
0, 2 1, 2 2, 2
…
Reward:
100 at chest
0 for others
Policies
Policy Evaluating Policies
0, 0 1, 0 2, 0
12.5 100
0, 1 1, 1 2, 1
50
0, 2 1, 2 2, 2
Move to <1,1>
Move to <0,1> Move to <1,0>
Move to <2,0>
𝜋
Policy could be better
𝑉 ¿
Q learning
Learn a policy that optimizes for all states, using:
• No prior knowledge of state transition probabilities:
• No prior knowledge of the reward function:
Approach:
• Initialize estimate of discounted reward for every state/action pair:
• Repeat (for a while):
• Take a random action from
• Receive and from environment
• Update = +
• Random restart if in terminal state
1 Exploration Policy:
∝𝑣 =
1+𝑣𝑖𝑠𝑖𝑡𝑠( 𝑠 , 𝑎)
Example of Q learning
(round 1)
• Initialize to 0
• Random initial state =
0, 0 1, 0 2, 0
0
• Random action from 0
0
0 0
0 0 100
0
0, 1 1, 1 2, 1
0 0
• Update
0 0
0 0 0
• Random action from 0, 2
0
1, 2
0
2, 2
0
0 0
0 0
• Update
0
0 0
0 0 100
0, 1 1, 1 2, 1
• Update + * 100 0 0
0 0
0 0 0
• Random action from 0, 2
0
1, 2
0
2, 2
0
50
0 0
0 0
• Update
𝛾=0.5
Example of Q learning
(some acceleration…)
𝛾=0.5
0
0 0
• Update 0, 1
0
1, 1
0 100
2, 1
0 0
50
0 0
• Update 0
0
0
0
50
0
0, 2 1, 2 2, 2
0 25
0
0 0
Example of Q learning
(some acceleration…)
𝛾=0.5
0
0 0
• Update 0, 1
0
1, 1
0 100
2, 1
25
0 50
0 0
• Update 0
0
0
0
50
0
0, 2 1, 2 2, 2
0 25
0 0
Example of Q learning
( after many, many runs…)
0, 0 1, 0 2, 0
50 100
• converged
25
12.5 25
25 50 100
0, 1 1, 1 2, 1
• Policy is: 25 50
12.5 25
6.25 12.5 25
12.5 25 50
0, 2 1, 2 2, 2
12.5 25
6.25 12.5
Challenges for Reinforcement Learning
• When there are many
states and actions Turns Remaining: 15
• When there is a
‘narrow’ path to Each stepexploring
Random ~50% probability of of
will fall off going
ropewrong
~97%way – P(reaching
of the time goal) ~ 0.01%
reward
Reward Shaping
0 0
0 0 0
25
0
25 0 50
0
0, 2 1, 2 2, 2
• Useful when 0 25
0
env = gym.make('CartPole-v0')
import random
import QLearning # Your implementation goes here...
import Assignment7Support
trainingIterations = 20000
qlearner = QLearning.QLearning(<Parameters>)
currentState = ObservationToStateSpace(observation)
action = qlearner.GetAction(currentState, <Parameters>)
MountainCar
oldState = ObservationToStateSpace(observation)
observation, reward, isDone, info = env.step(action)
newState = ObservationToStateSpace(observation)
if isDone:
if(trialNumber%1000) == 0:
print(trialNumber, i, reward)
break
https://gym.openai.com/docs/
Some Problems with QLearning
• State space is continuous
• Must approximate by discretizing
• Receive a frame
• Forward propagate to get
• Select by sampling from
• Find the gradient that makes more likely –
store it
• Play the rest of the game
• If won, take a step in direction
• If lost, take a step in direction
Reward
Action
• Goal: Maximize
State
• Data:
Environment