0% found this document useful (0 votes)
26 views41 pages

Model-Based Reinforcement Learning

Uploaded by

aglaveakshay00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views41 pages

Model-Based Reinforcement Learning

Uploaded by

aglaveakshay00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Model-Based Reinforcement Learning

CS 285
Instructor: Sergey Levine
UC Berkeley
Today’s Lecture
1. Basics of model-based RL: learn a model, use model for control
• Why does naïve approach not work?
• The effect of distributional shift in model-based RL
2. Uncertainty in model-based RL
3. Model-based RL with complex observations
4. Next time: policy learning with model-based RL
• Goals:
• Understand how to build model-based RL algorithms
• Understand the important considerations for model-based RL
• Understand the tradeoffs between different model class choices
Why learn the model?
Does it work? Yes!

• Essentially how system identification works in classical robotics


• Some care should be taken to design a good base policy
• Particularly effective if we can hand-engineer a dynamics representation
using our knowledge of physics, and fit just a few parameters
Does it work? No!

go right to get higher!

• Distribution mismatch problem becomes exacerbated as we use more


expressive model classes
Can we do better?
What if we make a mistake?
Can we do better?
every N steps

This will be on HW4!


How to replan?
every N steps

• The more you replan, the less perfect


each individual plan needs to be
• Can use shorter horizons
• Even random sampling can often work
well here!
Uncertainty in Model-Based RL
A performance gap in model-based RL

pure model-based
(about 10 minutes real time) model-free training
(about 10 days…)

Nagabandi, Kahn, Fearing, L. ICRA 2018


Why the performance gap?

…but still have high capacity over here


need to not overfit here…
Why the performance gap?
every N steps

very tempting to go here…


How can uncertainty estimation help?

expected reward under high-variance prediction


is very low, even though mean is the same!
Intuition behind uncertainty-aware RL
every N steps

only take actions for which we think we’ll get high


reward in expectation (w.r.t. uncertain dynamics)

This avoids “exploiting” the model

The model will then adapt and get better


There are a few caveats…

Need to explore to get better

Expected value is not the same as pessimistic value

Expected value is not the same as optimistic value

…but expected value is often a good start


Uncertainty-Aware Neural Net Models
How can we have uncertainty-aware models?
Idea 1: use output entropy

why is this not enough?

Two types of uncertainty:


aleatoric or statistical uncertainty

epistemic or model uncertainty

“the model is certain about the data, but we are not


certain about the model”
what is the variance here?
How can we have uncertainty-aware models?
Idea 2: estimate model uncertainty
“the model is certain about the data, but we are not certain about the model”

the entropy of this tells us


the model uncertainty!
Quick overview of Bayesian neural networks

expected weight uncertainty


about the weight
For more, see:
Blundell et al., Weight Uncertainty in Neural Networks
Gal et al., Concrete Dropout

We’ll learn more about variational inference later!


Bootstrap ensembles
Train multiple models and see if they agree!

How to train?
Main idea: need to generate
“independent” datasets to get
“independent” models
Bootstrap ensembles in deep learning
This basically works

Very crude approximation, because the


number of models is usually small (< 10)

Resampling with replacement is usually


unnecessary, because SGD and random
initialization usually makes the models
sufficiently independent
Planning with Uncertainty, Examples
How to plan with uncertainty

distribution over
deterministic models

Other options: moment matching, more complex posterior estimation


with BNNs, etc.
Example: model-based RL with ensembles

exceeds performance of model-free after 40k steps


(about 10 minutes of real time)

before after
More recent example: PDDM

Deep Dynamics Models for Learning Dexterous Manipulation. Nagabandi et al. 2019
Further readings
• Deisenroth et al. PILCO: A Model-Based and Data-Efficient
Approach to Policy Search.
Recent papers:
• Nagabandi et al. Neural Network Dynamics for Model-
Based Deep Reinforcement Learning with Model-Free
Fine-Tuning.
• Chua et al. Deep Reinforcement Learning in a Handful of
Trials using Probabilistic Dynamics Models.
• Feinberg et al. Model-Based Value Expansion for Efficient
Model-Free Reinforcement Learning.
• Buckman et al. Sample-Efficient Reinforcement Learning
with Stochastic Ensemble Value Expansion.
Model-Based RL with Images
What about complex observations?

What is hard about this?


• High dimensionality
• Redundancy
• Partial observability
high-dimensional low-dimension
but not dynamic but dynamic
State space (latent space) models

observation model

dynamics model

reward model

How to train?
standard (fully observed) model:

latent space model:


Model-based RL with latent space models

“encoder”

+ most accurate
full smoothing posterior
- most complicated

+ simplest
single-step encoder
- least accurate
we’ll talk about this one for now
We will discuss variational inference in more detail
next week!
Model-based RL with latent space models

deterministic encoder

Everything is differentiable, can train with backprop


Model-based RL with latent space models

latent space dynamics image reconstruction reward model

Many practical methods use a stochastic encoder to


model uncertainty
Model-based RL with latent space models
every N steps
Learn directly in observation space

Finn, L. Deep Visual Foresight for Planning Robot


Motion. ICRA 2017.
Ebert, Finn, Lee, L. Self-Supervised Visual Planning
with Temporal Skip Connections. CoRL 2017.
Use predictions to complete tasks

Designated Pixel
Goal Pixel
Task execution

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy