Towards Vision-Based Deep Reinforcement Learning For Robotic Motion Control
Towards Vision-Based Deep Reinforcement Learning For Robotic Motion Control
Abstract
This paper introduces a machine learning based
system for controlling a robotic manipulator
with visual perception only. The capability
to autonomously learn robot controllers solely
from raw-pixel images and without any prior
knowledge of configuration is shown for the first
time. We build upon the success of recent deep
reinforcement learning and develop a system
for learning target reaching with a three-joint
robot manipulator using external visual obser-
vation. A Deep Q Network (DQN) was demon-
strated to perform target reaching after train- Figure 1: Baxter’s arm being controlled by a trained
ing in simulation. Transferring the network to deep Q Network (DQN). Synthetic images (on the right)
real hardware and real observation in a naive are fed into the DQN to overcome some of the real-world
approach failed, but experiments show that the issues encountered, i.e., the differences between training
network works when replacing camera images and testing settings.
with synthetic images.
and deep learning. One topical example of DRL is the
1 Introduction Deep Q Network (DQN), which, after learning to play
Atari 2600 games over 38 days, was able to match human
Robots are widely used to complete various manipula-
performance when playing the game [Mnih et al., 2013;
tion tasks in industrial manufacturing factories where
Mnih et al., 2015]. Despite their promise, applying
environments are relatively static and simple. However,
DQNs to "perfect" and relatively simple computer game
these operations are still challenging for robots in highly
worlds is a far cry from deploying them in complex
dynamic and complex environments commonly encoun-
robotic manipulation tasks, especially when factors such
tered in everyday life. Nevertheless, humans are able to
as sensor noise and image offsets are considered.
manipulate in such highly dynamic and complex envi-
This paper takes the first steps towards enabling
ronments. We seem to be able to learn manipulation
DQNs to be used for learning robotic manipulation. We
skills by observing how others perform them (learning
focus on learning these skills from visual observation of
from observation), as well as, master new skills through
the manipulator, without any prior knowledge of config-
trial and error (learning from exploration). Inspired by
uration or joint state. Towards this end, as first steps, we
this, we want robots to learn and master manipulation
assess the feasibility of using DQNs to perform a simple
skills in the same way.
target reaching task, an important component of general
To give robots the ability to learn from explo-
manipulation tasks such as object picking. In particular,
ration, methods are required that are able to learn au-
we make the following contributions:
tonomously and which are flexible to a range of differ-
ing manipulation tasks. A promising candidate for au- • We present a DQN-based learning system for a tar-
tonomous learning in this regard is Deep Reinforcement get reaching task. The system consists of three
Learning (DRL), which combines reinforcement learning components: a 2D robotic arm simulator for target
reaching, a DQN learner, and ROS-based interfaces the issues with greedy policy search and gradient based
to enable operation on a Baxter robot. methods. How to generate the right reward is an ac-
• We train agents in simulation and evaluate them in tive topic of research. Intrinsic motivation and curiosity
both simulation and real-world target reaching ex- have been shown to provide means to explore large state
periments. The experiments in simulation are con- spaces, such as the ones found on complex humanoids,
ducted with varying levels of noise, image offsets, faster and more efficient [Frank et al., 2014].
initial arm poses and link lengths, which are com-
2.3 Deep Visuomotor Policies
mon concerns in robotic motion control and manip-
ulation. To enable robots to learn manipulation skills with little
prior knowledge, a convolutional neural network (CNN)
• We identify and discuss a number of issues and op-
based policy representation architecture (deep visuo-
portunities for future work towards enabling vision-
motor policies) and its guided policy search method
based deep reinforcement learning in real-world
were introduced by Sergey et al. [Levine et al., 2015a;
robotic manipulation.
Levine et al., 2015b]. The deep visuomotor policies map
joint angles and camera images directly to the joint
2 Related Work torques. Robot configurations are the only necessary
2.1 Vision-based Robotic Manipulation prior knowledge. The policy search method consists of
Vision-based robotic manipulation is the process by two phases, i.e., optimal control phase and supervised
which robots use their manipulators (such as robotic learning phase. The training consists of three proce-
arms) to rearrange environments [Mason, 2001], based dures, i.e., pose CNN training, trajectories pre-training,
on camera images. The early vision-based robotic ma- and end-to-end training.
nipulation was implemented using pose-based (position The deep visuomotor policies did enable robots to
and orientation) closed-loop control, where vision was learn manipulation skills with little prior knowledge
typically used to extract the pose of an object as an in- through supervised learning, but pre-collected datasets
put for a manipulation controller at the beginning of a were necessary. Human involvements in the datasets col-
task [Kragic and Christensen, 2002]. lection made this method less autonomous. Besides, the
Most current vision-based robotic manipulation meth- training method specifically designed to speed up the
ods are closed-loop based on visual perception. A vision- contact-rich manipulation learning made it less flexible
based manipulation system was implemented on a Johns for other manipulation tasks.
Hopkins “Steady Hand Robot” for cooperative manipu-
lation at millimeter to micrometer scales, using virtual
2.4 Deep Q Network
fixtures [Bettini et al., 2004]. With both monocular and The DQN, a topical example of DRL, satisfies both the
binocular vision cues, various closed-loop visual strate- autonomy and flexibility requirements for learning from
gies were applied to enable robots to manipulate both exploration. It successfully learnt to play 49 different
known and unknown objects [Kragic et al., 2005]. Atari 2600 games, achieving a human-level of control
Also, various learning methods have been applied to [Mnih et al., 2015]. The DQN used a deep convolutional
implement complex manipulation tasks in the real world. neural network (CNN) [Krizhevsky et al., 2012] to ap-
With continuous hidden Markov models (HMMs), a hu- proximate a Q-value function. It maps raw pixel im-
manoid robot was able to learn dual-arm manipulation ages directly to actions. No pre-input feature extraction
tasks from human demonstrations through vision [Asfour is needed. The only one thing is to let the algorithm
et al., 2008]. However, most of these algorithms are for improve policies through playing games over and over
specific tasks and need much prior knowledge. They are again. It learnt playing 49 different games, using the
not flexible for learning a range of different manipulation same network architecture with no modification.
tasks. The DQN is defined by its inputs – raw pixels of game
video frames and received rewards – and outputs, i.e.,
2.2 Reinforcement Learning in Robotics the number of available actions in a game [Mnih et al.,
Reinforcement Learning (RL) [Sutton and Barto, 1998; 2015]. This number of actions is the only prior knowl-
Kormushev et al., 2013] has been applied in robotics, as edge, which means no robot configuration information
it promises a way to learn complex actions on complex is needed to the agent, when using the DQN for motion
robotic systems by just providing informing the robot control. However, in the DQN training process, the Atari
whether its actions were successful (positive reward) or 2600 game engine worked as a reward function, but for
not (negative reward). [Peters et al., 2003] reviewed robotic motion control, no such engine exists. To apply
some of the RL concepts in terms of applicability to con- it in robotic motion control, a reward function is needed
trol complex humanoid robots and highlighting some of to assess trials. Besides, sensing noise and higher com-
Outputs of 3 Convolutional Layers Rf Rs Rf
Outputs of 2 Fully
Rs Rf Rf Connected Layers
Figure 2: Schematic of the DQN layers for end-to-end learning and their respective outputs. Four input images are
reshaped (Rs) and then fed into the DQN network as grey-scale images (converted from RGB). The DQN, consists
of three convolutional layers with rectifier layers (Rf) after each, followed by a reshaping layer (Rs) and two fully
connected layers (again with a rectifier layer in between). The normalized outputs of each layer are visualized. (Note:
The outputs of the last four layers are shown as matrices instead of vectors.)
Figure 5: Screenshots highlighting the different training scenarios for the agents.
110
Table 1: Agents and training settings