0% found this document useful (0 votes)

36 views25 pages

21 - Reinforcement Learning

Reinforcement learning is a machine learning technique where an agent learns to achieve a goal in an environment by receiving rewards or punishments for its actions. The agent learns through trial-and-error interactions with the environment without relying on external teachers. In reinforcement learning, the agent learns to map situations to actions in a way that maximizes rewards. It does this by exploring various actions and learning which ones yield the highest rewards.

Uploaded by

nada abdelrahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views25 pages

21 - Reinforcement Learning

Uploaded by

nada abdelrahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Reinforcement Learning

Geoff Hulten
Reinforcement Learning
• Learning to interact with an environment
• Robots, games, process control
• With limited human training
• Where the ‘right thing’ isn’t obvious Agent

Reward

Action
State
• Supervised Learning:
• Goal:
• Data:
Environment

• Reinforcement Learning:
• Goal:
Maximize

• Data:
TD-Gammon – Tesauro ~1995 State: Board State
Actions: Valid Moves
Reward: Win or Lose

• Net with 80 hidden units,

initialize to random weights

• Select move based on network

estimate & shallow search
P(win)

• Learn by playing against itself

• 1.5 million games of training

-> competitive with world class players
Atari 2600 games

State: Raw Pixels

Actions: Valid Moves
Reward: Game Score

• Same model/parameters for

~50 games

https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
Robotics and
Locomotion

State:
Joint States/Velocities
Accelerometer/Gyroscope
Terrain
Actions: Apply Torque to Joints
Reward: Velocity – { stuff }

https://youtu.be/hx_bgoTF7bs

2017 paper https://arxiv.org/pdf/1707.02286.pdf

Alpha Go State: Board State
Actions: Valid Moves
Reward: Win or Lose
• Learning how to beat humans at ‘hard’ games (search
space too big)

• Far surpasses (Human) Supervised learning

• Algorithm learned to outplay humans at chess in 24 hours

https://deepmind.com/documents/119/agz_unformatted_nature.pdf
How Reinforcement Learning is Different
• Delayed Reward

• Agent chooses training data

• Explore vs Exploit (Life long learning)

• Very different terminology (can be confusing)

Setup for Reinforcement Learning
Markov Decision Process Policy
(environment) (agent’s behavior)
• Discrete-time stochastic control process • – The action to take in state

• Each time step, :

• Agent chooses action from set • Goal maximize:
• Moves to new state with probability:
Probability of moving to each state
• Receives reward:
• – Tradeoff immediate vs future

• Every outcome depends on and

• Nothing depends on previous states/actions

Reward for making that move Value of being in that state

Simple Example of Agent in an Environment
State:
Map Locations
Score: 100
0
0, 0 1, 0 2, 0
100
Actions:
Move within map
Reaching chest ends episode 0, 1 1, 1 2, 1

0, 2 1, 2 2, 2
…

Reward:
100 at chest
0 for others
Policies
Policy Evaluating Policies

0, 0 1, 0 2, 0

12.5 100

0, 1 1, 1 2, 1

0, 2 1, 2 2, 2

Move to <1,1>
Move to <0,1> Move to <1,0>
Move to <2,0>

𝜋
Policy could be better
𝑉 ¿
Q learning
Learn a policy that optimizes for all states, using:
• No prior knowledge of state transition probabilities:
• No prior knowledge of the reward function:

Approach:
• Initialize estimate of discounted reward for every state/action pair:
• Repeat (for a while):
• Take a random action from
• Receive and from environment
• Update = +
• Random restart if in terminal state

1 Exploration Policy:
∝𝑣 =
1+𝑣𝑖𝑠𝑖𝑡𝑠( 𝑠 , 𝑎)
Example of Q learning
(round 1)

• Initialize to 0
• Random initial state =
0, 0 1, 0 2, 0
0
• Random action from 0

0
0 0
0 0 100
0
0, 1 1, 1 2, 1
0 0

• Update
0 0
0 0 0
• Random action from 0, 2
0
1, 2
0
2, 2
0

0 0

• Update

• No more moves possible, start again…

Example of Q learning
(round 2)

• Round 2: Random initial state =

• Random action from 0, 0 1, 0 2, 0
0 0

0
0 0
0 0 100
0, 1 1, 1 2, 1
• Update + * 100 0 0

0 0
0 0 0
• Random action from 0, 2
0
1, 2
0
2, 2
0
50

0 0

• Update

• No more moves possible, start again…

𝛾=0.5
Example of Q learning
(some acceleration…)
𝛾=0.5

• Random Initial State 0, 0

0
1, 0
0
2, 0

0
0 0
• Update 0, 1
0
1, 1
0 100
2, 1
0 0
50

0 0
• Update 0
0
0
0
50
0

0, 2 1, 2 2, 2
0 25
0

0 0
Example of Q learning
(some acceleration…)
𝛾=0.5

• Random Initial State 0, 0

0
1, 0
100
0
2, 0

0
0 0
• Update 0, 1
0
1, 1
0 100
2, 1
25
0 50

0 0
• Update 0
0
0
0
50
0

0, 2 1, 2 2, 2
0 25

0 0
Example of Q learning
( after many, many runs…)

0, 0 1, 0 2, 0
50 100
• converged
25
12.5 25
25 50 100
0, 1 1, 1 2, 1
• Policy is: 25 50

12.5 25
6.25 12.5 25
12.5 25 50
0, 2 1, 2 2, 2
12.5 25

6.25 12.5
Challenges for Reinforcement Learning
• When there are many
states and actions Turns Remaining: 15

• When the episode can

end without reward

• When there is a
‘narrow’ path to Each stepexploring
Random ~50% probability of of
will fall off going
ropewrong
~97%way – P(reaching
of the time goal) ~ 0.01%
reward
Reward Shaping

• Hand craft intermediate

objectives that yield
reward

• Encourage the right type

of exploration

• Requires custom human

work

• Risks of learning to game

the rewards
Memory
• Retrain on previous explorations
0, 0 1, 0 2, 0
0
50 100

• Maintain samples of:

0
0 0
25
0 0
50 100
0
0, 1 1, 1 2, 1
25
0 0
50

0 0
0 0 0
25
0
25 0 50
0
0, 2 1, 2 2, 2
• Useful when 0 25
0

• It is cheaper to use some RAM/CPU than to run 0 12.5

0
more simulations
Replay
Replay
Do itana exploration
bunchexploration
a different of times
• It is hard to get to reward so you want to
leverage it for as much as possible when it
happens
Gym – toolkit for reinforcement learning
CartPole
import gym

env = gym.make('CartPole-v0')

import random
import QLearning # Your implementation goes here...
import Assignment7Support

trainingIterations = 20000

qlearner = QLearning.QLearning(<Parameters>)

for trialNumber in range(trainingIterations):

observation = env.reset()
reward = 0
for i in range(300):
Reward +1 per step the pole remains up env.render() # Comment out to make much faster...

currentState = ObservationToStateSpace(observation)
action = qlearner.GetAction(currentState, <Parameters>)
MountainCar
oldState = ObservationToStateSpace(observation)
observation, reward, isDone, info = env.step(action)
newState = ObservationToStateSpace(observation)

qlearner.ObserveAction(oldState, action, newState, reward, …)

if isDone:
if(trialNumber%1000) == 0:
print(trialNumber, i, reward)
break

# Now you have a policy in qlearner – use it...

Reward 200 at flag -1 per step

https://gym.openai.com/docs/
Some Problems with QLearning
• State space is continuous
• Must approximate by discretizing

• Treats states as identities print(env.observation_space.high)

#> array([ 2.4 , inf, 0.20943951, inf])
• No knowledge of how states relate print(env.observation_space.low)
• Requires many iterations to fill in #> array([-2.4 , -inf, -0.20943951, -inf])

• Converging can be difficult with

randomized transitions/rewards
Policy Gradients

• Q-learning -> learn a value function

• = an estimate of the expected
discounted reward of taking from
• Performance time: take the action
that has the highest estimated value

• Policy Gradient -> learn policy

directly
• Probability distribution over
• Performance time: choose action
according to distribution

Example from: https://www.youtube.com/watch?v=tqrcjHuNdmQ

Policy Gradients

• Receive a frame
• Forward propagate to get
• Select by sampling from
• Find the gradient that makes more likely –
store it
• Play the rest of the game
• If won, take a step in direction
• If lost, take a step in direction

One per action

Sum and step
in correct direction
Policy Gradients – reward shaping
Not relevant to outcome(?)

Less important to outcome More important to outcome

Summary Agent
Reinforcement Learning:

Reward

Action
• Goal: Maximize

State
• Data:

Environment

Many (awesome) recent successes:

• Robotics
• Surpassing humans at difficult games
• Doing it with (essentially) zero human knowledge (Simple) Approaches:
• Q-Learning -> discounted reward of action
• Policy Gradients -> Probability distribution over
Challenges: • Reward Shaping
• When the episode can end without reward • Memory
• When there is a ‘narrow’ path to reward • Lots of parameter tweaking…
• When there are many states and actions

Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
37 RL
No ratings yet
37 RL
18 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
An Introduction To Deep ReinforcementLearning
No ratings yet
An Introduction To Deep ReinforcementLearning
65 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
15 pages
ML Unit-V
No ratings yet
ML Unit-V
20 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
59 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforedu
No ratings yet
Reinforedu
46 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Learning Task
No ratings yet
Learning Task
14 pages
Lecture Notes On Reinforcement Learning Basics
No ratings yet
Lecture Notes On Reinforcement Learning Basics
6 pages
Unit 1
No ratings yet
Unit 1
18 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Reinforcement Learning - Ipynb - Colaboratory
No ratings yet
Reinforcement Learning - Ipynb - Colaboratory
7 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Report p1
No ratings yet
Report p1
7 pages
p1 Piotr
No ratings yet
p1 Piotr
7 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
AI A Z HandBook
No ratings yet
AI A Z HandBook
12 pages
Hota ML ReinforcementLearning
No ratings yet
Hota ML ReinforcementLearning
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Adobe Scan Nov 18, 2024
No ratings yet
Adobe Scan Nov 18, 2024
13 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
Unit 5
No ratings yet
Unit 5
45 pages
Project Charter: People, Processes, Partnerships: A Faculty of Arts Initiative
No ratings yet
Project Charter: People, Processes, Partnerships: A Faculty of Arts Initiative
5 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
22 pages
A Markov Decision Model To Optimize Hotel Room Occupancy Under Stochastic Demand
No ratings yet
A Markov Decision Model To Optimize Hotel Room Occupancy Under Stochastic Demand
7 pages
كل القطع من يوم 30-10-2017 حتى 13-11-2017
No ratings yet
كل القطع من يوم 30-10-2017 حتى 13-11-2017
41 pages
Specification With SRS Process Supporting Document
No ratings yet
Specification With SRS Process Supporting Document
121 pages
Operating System I Assignment # 3: CPU Schedulers Simulator
No ratings yet
Operating System I Assignment # 3: CPU Schedulers Simulator
5 pages
Structured Modeling Based Methodology
No ratings yet
Structured Modeling Based Methodology
33 pages
JCM 18 jcm778
No ratings yet
JCM 18 jcm778
13 pages
Black, White, and Gray Minimalist Professional Graphic Designer Resume
No ratings yet
Black, White, and Gray Minimalist Professional Graphic Designer Resume
1 page
DSS - Quiz (1) October 2018: 1. The Inclusion of A Model Base and Its Management System
No ratings yet
DSS - Quiz (1) October 2018: 1. The Inclusion of A Model Base and Its Management System
1 page
Part A: Sets Declaration and Definition Solve The Following Questions
No ratings yet
Part A: Sets Declaration and Definition Solve The Following Questions
7 pages
General Algebraic Modeling Systems (Model Answers)
No ratings yet
General Algebraic Modeling Systems (Model Answers)
37 pages
Web Applications
No ratings yet
Web Applications
26 pages
Lecture - 03
No ratings yet
Lecture - 03
16 pages
My Junior School
No ratings yet
My Junior School
2 pages
Summer Learning Programs Monitoring Tool
No ratings yet
Summer Learning Programs Monitoring Tool
2 pages
Holy Cross College of Calinan, Inc. Datu Abeng ST, Calinan, Davao City
No ratings yet
Holy Cross College of Calinan, Inc. Datu Abeng ST, Calinan, Davao City
35 pages
Research Problem and Research Design
No ratings yet
Research Problem and Research Design
52 pages
Mental Health Project REVISED
No ratings yet
Mental Health Project REVISED
11 pages
Meliza D. Sioting Bsece-3 Engr. 304 Conflict Management
No ratings yet
Meliza D. Sioting Bsece-3 Engr. 304 Conflict Management
13 pages
875 2793 1 PB
No ratings yet
875 2793 1 PB
6 pages
Types of Speech According To Delivery: Oral Communication / Week 9 - Lesson 10
No ratings yet
Types of Speech According To Delivery: Oral Communication / Week 9 - Lesson 10
28 pages
Updated Compilation For PSY322 (Clinical Psychology)
No ratings yet
Updated Compilation For PSY322 (Clinical Psychology)
14 pages
Masculine Feminine Polarity Chart
No ratings yet
Masculine Feminine Polarity Chart
1 page
Oral Com Module 3
No ratings yet
Oral Com Module 3
4 pages
DB My Career Gps 2022
No ratings yet
DB My Career Gps 2022
28 pages
Qualitative Vs Quantitative Research - Examples & Methods
No ratings yet
Qualitative Vs Quantitative Research - Examples & Methods
9 pages
The Subnormal Adolescent Girl Theodora M Abel Elaine F Kinder Download
No ratings yet
The Subnormal Adolescent Girl Theodora M Abel Elaine F Kinder Download
59 pages
Module 2 Group and Team Dynamics
100% (1)
Module 2 Group and Team Dynamics
19 pages
Internship A1 Resume
No ratings yet
Internship A1 Resume
2 pages
Sedona Method Release Technique 1992 Sedona Institute 01 of 08 Volume 1 Session 1
100% (2)
Sedona Method Release Technique 1992 Sedona Institute 01 of 08 Volume 1 Session 1
110 pages
Course Outline - SHRM - 2025
No ratings yet
Course Outline - SHRM - 2025
2 pages
PORTFOLIO Immersion
No ratings yet
PORTFOLIO Immersion
18 pages
Assessment of Learning 2
No ratings yet
Assessment of Learning 2
34 pages
Dakota Situation
No ratings yet
Dakota Situation
12 pages
Spectrum Advanced Calculus 1 Ba and BSC 3rd Sem Pu First 1985 Reprint 2015 D R Sharma Download
No ratings yet
Spectrum Advanced Calculus 1 Ba and BSC 3rd Sem Pu First 1985 Reprint 2015 D R Sharma Download
53 pages
Biopsychology and Neuroscience Reviewer
No ratings yet
Biopsychology and Neuroscience Reviewer
4 pages
1 - Philoposophies in Understanding The Self
100% (1)
1 - Philoposophies in Understanding The Self
15 pages
Organisational Development
No ratings yet
Organisational Development
1 page
(Read) Practical Conduction
No ratings yet
(Read) Practical Conduction
16 pages
Children's Influence On Family Dynamics The Neglected Side of Family Relationships, 1st Edition Complete Book Download
100% (17)
Children's Influence On Family Dynamics The Neglected Side of Family Relationships, 1st Edition Complete Book Download
14 pages
Understanding and Addressing Arab-American Mental Health Disparit
No ratings yet
Understanding and Addressing Arab-American Mental Health Disparit
13 pages
Liquid Leadership B0096CCSUS EBOK Nodrm
No ratings yet
Liquid Leadership B0096CCSUS EBOK Nodrm
165 pages
Examples of Hooks For Persuasive Essays
100% (2)
Examples of Hooks For Persuasive Essays
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

21 - Reinforcement Learning

Uploaded by

21 - Reinforcement Learning

Uploaded by

Reinforcement Learning

• Net with 80 hidden units,

• Select move based on network

• Learn by playing against itself

• 1.5 million games of training

State: Raw Pixels

• Same model/parameters for

2017 paper https://arxiv.org/pdf/1707.02286.pdf

• Far surpasses (Human) Supervised learning

• Algorithm learned to outplay humans at chess in 24 hours

• Agent chooses training data

• Explore vs Exploit (Life long learning)

• Very different terminology (can be confusing)

• Each time step, :

• Every outcome depends on and

Reward for making that move Value of being in that state

• No more moves possible, start again…

• Round 2: Random initial state =

• No more moves possible, start again…

• Random Initial State 0, 0

• Random Initial State 0, 0

• When the episode can

• Hand craft intermediate

• Encourage the right type

• Requires custom human

• Risks of learning to game

• Maintain samples of:

• It is cheaper to use some RAM/CPU than to run 0 12.5

for trialNumber in range(trainingIterations):

qlearner.ObserveAction(oldState, action, newState, reward, …)

# Now you have a policy in qlearner – use it...

Reward 200 at flag -1 per step

• Treats states as identities print(env.observation_space.high)

• Converging can be difficult with

• Q-learning -> learn a value function

• Policy Gradient -> learn policy

Example from: https://www.youtube.com/watch?v=tqrcjHuNdmQ

One per action

Less important to outcome More important to outcome

Many (awesome) recent successes:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.