0% found this document useful (0 votes)
26 views8 pages

Towards Vision-Based Deep Reinforcement Learning For Robotic Motion Control

This document discusses using deep reinforcement learning for vision-based robotic motion control. Specifically: - It introduces a machine learning system that can autonomously learn robot controllers from raw images without any prior knowledge of the robot's configuration. - A Deep Q Network is demonstrated to perform target reaching in simulation for a 3-joint robot arm controlled by external visual observation. - Transferring the trained network to real hardware with real images failed initially, but worked when replacing camera images with synthetic images.

Uploaded by

hadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views8 pages

Towards Vision-Based Deep Reinforcement Learning For Robotic Motion Control

This document discusses using deep reinforcement learning for vision-based robotic motion control. Specifically: - It introduces a machine learning system that can autonomously learn robot controllers from raw images without any prior knowledge of the robot's configuration. - A Deep Q Network is demonstrated to perform target reaching in simulation for a 3-joint robot arm controlled by external visual observation. - Transferring the trained network to real hardware with real images failed initially, but worked when replacing camera images with synthetic images.

Uploaded by

hadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Towards Vision-Based Deep Reinforcement Learning

for Robotic Motion Control


Fangyi Zhang, Jürgen Leitner, Michael Milford, Ben Upcroft, Peter Corke
ARC Centre of Excellence for Robotic Vision (ACRV)
Queensland University of Technology (QUT)
fangyi.zhang@hdr.qut.edu.au
arXiv:1511.03791v2 [cs.LG] 13 Nov 2015

Abstract
This paper introduces a machine learning based
system for controlling a robotic manipulator
with visual perception only. The capability
to autonomously learn robot controllers solely
from raw-pixel images and without any prior
knowledge of configuration is shown for the first
time. We build upon the success of recent deep
reinforcement learning and develop a system
for learning target reaching with a three-joint
robot manipulator using external visual obser-
vation. A Deep Q Network (DQN) was demon-
strated to perform target reaching after train- Figure 1: Baxter’s arm being controlled by a trained
ing in simulation. Transferring the network to deep Q Network (DQN). Synthetic images (on the right)
real hardware and real observation in a naive are fed into the DQN to overcome some of the real-world
approach failed, but experiments show that the issues encountered, i.e., the differences between training
network works when replacing camera images and testing settings.
with synthetic images.
and deep learning. One topical example of DRL is the
1 Introduction Deep Q Network (DQN), which, after learning to play
Atari 2600 games over 38 days, was able to match human
Robots are widely used to complete various manipula-
performance when playing the game [Mnih et al., 2013;
tion tasks in industrial manufacturing factories where
Mnih et al., 2015]. Despite their promise, applying
environments are relatively static and simple. However,
DQNs to "perfect" and relatively simple computer game
these operations are still challenging for robots in highly
worlds is a far cry from deploying them in complex
dynamic and complex environments commonly encoun-
robotic manipulation tasks, especially when factors such
tered in everyday life. Nevertheless, humans are able to
as sensor noise and image offsets are considered.
manipulate in such highly dynamic and complex envi-
This paper takes the first steps towards enabling
ronments. We seem to be able to learn manipulation
DQNs to be used for learning robotic manipulation. We
skills by observing how others perform them (learning
focus on learning these skills from visual observation of
from observation), as well as, master new skills through
the manipulator, without any prior knowledge of config-
trial and error (learning from exploration). Inspired by
uration or joint state. Towards this end, as first steps, we
this, we want robots to learn and master manipulation
assess the feasibility of using DQNs to perform a simple
skills in the same way.
target reaching task, an important component of general
To give robots the ability to learn from explo-
manipulation tasks such as object picking. In particular,
ration, methods are required that are able to learn au-
we make the following contributions:
tonomously and which are flexible to a range of differ-
ing manipulation tasks. A promising candidate for au- • We present a DQN-based learning system for a tar-
tonomous learning in this regard is Deep Reinforcement get reaching task. The system consists of three
Learning (DRL), which combines reinforcement learning components: a 2D robotic arm simulator for target
reaching, a DQN learner, and ROS-based interfaces the issues with greedy policy search and gradient based
to enable operation on a Baxter robot. methods. How to generate the right reward is an ac-
• We train agents in simulation and evaluate them in tive topic of research. Intrinsic motivation and curiosity
both simulation and real-world target reaching ex- have been shown to provide means to explore large state
periments. The experiments in simulation are con- spaces, such as the ones found on complex humanoids,
ducted with varying levels of noise, image offsets, faster and more efficient [Frank et al., 2014].
initial arm poses and link lengths, which are com-
2.3 Deep Visuomotor Policies
mon concerns in robotic motion control and manip-
ulation. To enable robots to learn manipulation skills with little
prior knowledge, a convolutional neural network (CNN)
• We identify and discuss a number of issues and op-
based policy representation architecture (deep visuo-
portunities for future work towards enabling vision-
motor policies) and its guided policy search method
based deep reinforcement learning in real-world
were introduced by Sergey et al. [Levine et al., 2015a;
robotic manipulation.
Levine et al., 2015b]. The deep visuomotor policies map
joint angles and camera images directly to the joint
2 Related Work torques. Robot configurations are the only necessary
2.1 Vision-based Robotic Manipulation prior knowledge. The policy search method consists of
Vision-based robotic manipulation is the process by two phases, i.e., optimal control phase and supervised
which robots use their manipulators (such as robotic learning phase. The training consists of three proce-
arms) to rearrange environments [Mason, 2001], based dures, i.e., pose CNN training, trajectories pre-training,
on camera images. The early vision-based robotic ma- and end-to-end training.
nipulation was implemented using pose-based (position The deep visuomotor policies did enable robots to
and orientation) closed-loop control, where vision was learn manipulation skills with little prior knowledge
typically used to extract the pose of an object as an in- through supervised learning, but pre-collected datasets
put for a manipulation controller at the beginning of a were necessary. Human involvements in the datasets col-
task [Kragic and Christensen, 2002]. lection made this method less autonomous. Besides, the
Most current vision-based robotic manipulation meth- training method specifically designed to speed up the
ods are closed-loop based on visual perception. A vision- contact-rich manipulation learning made it less flexible
based manipulation system was implemented on a Johns for other manipulation tasks.
Hopkins “Steady Hand Robot” for cooperative manipu-
lation at millimeter to micrometer scales, using virtual
2.4 Deep Q Network
fixtures [Bettini et al., 2004]. With both monocular and The DQN, a topical example of DRL, satisfies both the
binocular vision cues, various closed-loop visual strate- autonomy and flexibility requirements for learning from
gies were applied to enable robots to manipulate both exploration. It successfully learnt to play 49 different
known and unknown objects [Kragic et al., 2005]. Atari 2600 games, achieving a human-level of control
Also, various learning methods have been applied to [Mnih et al., 2015]. The DQN used a deep convolutional
implement complex manipulation tasks in the real world. neural network (CNN) [Krizhevsky et al., 2012] to ap-
With continuous hidden Markov models (HMMs), a hu- proximate a Q-value function. It maps raw pixel im-
manoid robot was able to learn dual-arm manipulation ages directly to actions. No pre-input feature extraction
tasks from human demonstrations through vision [Asfour is needed. The only one thing is to let the algorithm
et al., 2008]. However, most of these algorithms are for improve policies through playing games over and over
specific tasks and need much prior knowledge. They are again. It learnt playing 49 different games, using the
not flexible for learning a range of different manipulation same network architecture with no modification.
tasks. The DQN is defined by its inputs – raw pixels of game
video frames and received rewards – and outputs, i.e.,
2.2 Reinforcement Learning in Robotics the number of available actions in a game [Mnih et al.,
Reinforcement Learning (RL) [Sutton and Barto, 1998; 2015]. This number of actions is the only prior knowl-
Kormushev et al., 2013] has been applied in robotics, as edge, which means no robot configuration information
it promises a way to learn complex actions on complex is needed to the agent, when using the DQN for motion
robotic systems by just providing informing the robot control. However, in the DQN training process, the Atari
whether its actions were successful (positive reward) or 2600 game engine worked as a reward function, but for
not (negative reward). [Peters et al., 2003] reviewed robotic motion control, no such engine exists. To apply
some of the RL concepts in terms of applicability to con- it in robotic motion control, a reward function is needed
trol complex humanoid robots and highlighting some of to assess trials. Besides, sensing noise and higher com-
Outputs of 3 Convolutional Layers Rf Rs Rf

Outputs of 2 Fully
Rs Rf Rf Connected Layers

Figure 2: Schematic of the DQN layers for end-to-end learning and their respective outputs. Four input images are
reshaped (Rs) and then fed into the DQN network as grey-scale images (converted from RGB). The DQN, consists
of three convolutional layers with rectifier layers (Rf) after each, followed by a reshaping layer (Rs) and two fully
connected layers (again with a rectifier layer in between). The normalized outputs of each layer are visualized. (Note:
The outputs of the last four layers are shown as matrices instead of vectors.)

plexity and dynamics are inevitable issues for real-world


applications.

3 Problem Definition and System


Description
A common problem in robotic manipulation is to reach
for the object to be interacted with. This target reaching
task is defined as controlling a robot arm, such that its
end-effector is reaching a specific target configuration.
We are interested in the case in which a robot performs
the target reaching with visual perception only. To learn
such a task, we developed a system consisting of three
parts:
• a 2D simulator for robotic target reaching, creating Figure 3: System overview
the visual inputs to the learner
• a deep reinforcement learning framework based on When training or testing in simulation, the target
the DQN implementation by Google Deepmind reaching simulator provides the reward value (R) and
[Mnih et al., 2015], and image (I). R is used for training the network. The action
• a component of ROS-based interfaces to control a output (A) of the DQN is directly sent to the simulated
Baxter robot according to the DQN outputs. robotic arm.
When testing on a Baxter robot using camera images,
3.1 DQN-based Learning System an external camera provides the input images (I). The
The DQN adopted here has the same architecture with action output (A) of the DQN is implemented on the
that for playing Atari games, which contains three con- robot controlled by ROS-based interfaces. The interfaces
volutional layers and two fully connected layers [Mnih control the robot by sending updated robot’s poses (q 0 ).
et al., 2015]. Its implementation is based on the Google
3.2 Target Reaching Simulator
Deepmind DQN code1 with minor modifications. Fig. 2
shows the architecture and examplary output of each We simulate the reaching task to control a three-joint
layer. The inputs of the DQN include rewards and im- robotic arm in 2D (Fig. 4). The simulator was imple-
ages. Its output is the index of the action to take. The mented from scratch. In the implementation, no simu-
DQN learns target reaching skills in the interactions with lation platform was used. As shown in Fig. 4(a), the
the target reaching simulator. An overview of the sys- robotic arm consists of four links and three joints, whose
tem framework for both the learning in simulation and configurations are consistent to the specifications of a
testing on a real robot is shown in Fig. 3. Baxter arm, including joints constraints. The blue spot
is the target to be reached. For a better visualization,
1
https://sites.google.com/a/deepmind.com/dqn/ the position of the end-effector is marked with a red spot.
Algorithm 1: Reward Function
Completion Area input : Pt : the target 2D coordinates;
Pe : the end-effector 2D coordinates.
output: R: the reward for current state;
Target
T : whether the game is terminal.
1 Dis = ComputeDistance(Pt , Pe );
S1 DisChange = Dis − P reviousDis;
E1 W1 2
3 if DisChange > 0 then
4 R = −1;
Joints
5 else if DisChange < 0 then
6 R = 1;
End Effector 7 else
8 R = 0;
(a) Schematic diagram (b) The robot simulator
during a successful reach 9 end
10 Racc = Rt + Rt−1 + Rt−2 ;
Figure 4: The 2D target reaching simulator, providing 11 if Racc < −1 then
visual inputs to the DQN learner. It was implemented 12 T = T rue;
from scratch, no simulation platform was used. 13 else
14 T = F alse;
The simulator can be controlled by sending specific com- 15 end
mands to the individual joints “S1”, “E1” and “W1”. The
simulator screen resolution is 160 × 320.
The corresponding real scenario that the simulator is terminal. Its algorithm is shown in Algorithm 1. The
simulates is: with appropriate constant joint angles of reward of each action is determined according to the dis-
other joints on a Baxter arm, the arm moves in a verti- tance change between the end-effector and the target. If
cal plane controlled by joints “S1”, “E1” and “W1”, and the distance gets closer, the reward function returns 1; if
a controller (game player) observes the arm through an gets further, returns -1; otherwise returns 0. If the sum
external camera placed directly aside it with a horizon- of the latest three rewards is smaller than -1, the game
tal point of view. The three joints are in position control terminates. This reward function was designed as a first
mode. The background is white. step, more study is necessary to get an optimal reward
In the system, the 2D simulator is used as a target function.
reaching video game in connection with the DQN setup.
It provides raw pixel inputs to the network and has nine 4 Experiments and Results
options for action, i.e., three buttons for each joint: joint To evaluate the feasibility of the DQN-based system in
angle increasing, decreasing and hold. The joint angle learning performing target reaching, we did some exper-
increasing/decreasing step is constant at 0.02 rad. At iments in both simulation and real-world scenarios. The
the beginning of each round, joints “S1”, “E1” and “W1” experiments consist of three phases: training in simula-
will be set to a certain initial pose, such as [0.0, 0.0, 0.0] tion, testing in simulation, and testing in the real world.
rad; and the target will be randomly selected.
In the game playing, a reward value will be returned 4.1 Training in Simulation Scenarios
for each button press. The reward value is determined To evaluate the capability of the DQN to adapt to some
by a reward function introduced in Section 3.3. When noise commonly concerned in robotic manipulation, we
satisfying some conditions, the game will terminate. The trained several agents with different simulator settings.
game terminal is determined by the reward function as The different settings include sensing noise, image off-
well. For a player, the goal is to get an as high as possible sets, variations in initial arm pose and link length. The
accumulated reward before the game terminates. For setting details for training the five agents are shown in
clarity, we name an entire trial from the start of the Table 1. Their screenshots are shown in Fig. 5, respec-
game to its terminal as one round. tively.
Agent A was trained in Setting A where the 2D robotic
3.3 Reward Function arm was initialized to the same pose ([0.0, 0.0, 0.0] rad)
To keep consistent to the DQN setup, the reward func- at the beginning of each round. There was no image
tion has two return values: one for the reward of each noise in Setting A. To simulate camera sensing noise,
action; the other shows whether the target reaching game random noise was added in Setting B, on the basis of
(a) Settings A: simula- (b) Setting B: simula- (c) Setting C: Setting B (d) Setting D: Setting C (e) Setting E: Setting D
tion images tion images + noise + random initial pose + random image offset + random link length

Figure 5: Screenshots highlighting the different training scenarios for the agents.

110
Table 1: Agents and training settings

Average Maximum Action Q-Value


100
90
Agent Simulator Settings
80
A constant initial pose 70
60
B Setting A + random image noise 50
C Setting B + random initial pose 40 Agent A
D Setting C + random image offset 30 Agent B
20 Agent C
E Setting D + random link length Agent D
10 Agent E
0
0 10 20 30 40 50 60 70 80
Setting A. The random noise was with a uniform distri- Training Epoch
bution with a scale between -0.1 and 0.1 (for float pixel
values). Figure 6: Action Q-value converging curves. Each epoch
In Setting C, in addition to random image noise, the contains 50,000 training steps. The average maximum
initial arm pose was randomly selected. In the training of action Q-values are the average of the estimated max-
Agent D, random image offsets were added on the basis imum Q-values for all states in a validation set. The
of Setting C. The offset ranges in u and v directions validation set has 500 frames.
were respectively [-23, 7] and [-40, 20] in pixel. Agent
E was trained with dynamic arm link lengths. The link 6 shows the converging case before 80 epochs, i.e., 4 mil-
length variation ratio was [-4.2, 12.5]% with respect to lion training steps. The average maximum action Q-
the link length settings in the previous four settings. The values are the average of the estimated maximum Q-
image offsets and link lengths were randomly selected values for all states in a validation set. The validation
at the beginning of each round, and stayed unchanged set was randomly selected at the beginning of each train-
in the entire round (not vary at each frame). All the ing.
parameters for noisy factors were empirically selected as From Fig. 6, we can observe that all the five agents
a first step. converge towards to a certain Q-value state, although
All the agents were trained using more than 4 million their values are different. One thing we have to empha-
steps within 160 hours. Due to the difference in setting size is this converging is just for average maximum action
complexity, the time-cost for the simulator to update Q-values. A high value might but not necessarily indi-
each game video frame varies in five different settings. cate a high performance of an agent in performing target
Therefore, within 160 hours, the exact numbers of used reaching, since this value cannot completely indicate the
training steps for the five agents are different. They are target reaching performance.
6.475, 6.275, 5.225, 4.75 and 6.35 million, respectively.
The action Q-value converging curves are shown in 4.2 Testing in Simulation Scenarios
Fig. 6. The Q-value curves are respect to training We tested the five agents in simulation scenarios with
epochs. Each epoch contains 50,000 training steps. Fig. the 2D simulator. Each agent was tested in all those
five settings in Table 1. Each test took 200 rounds, i.e.,
terminated 200 times. More testing rounds can make the Table 2: Success rates (%) in different settings
testing results closer to the ground truth, but need too Setting
much time. Agent
A B C D E
In the testing, task success rates were evaluated. In
A 51.0 53.0 14.0 8.5 8.5
the computation of success rates, it is regarded as a suc-
B 50.5 49.5 11.0 8.0 10.0
cess when the end-effector gets into a completion area
C 32.0 34.5 36.0 22.5 14.0
with a radius of 16 cm around a target, as shown in the
D 13.5 16.5 22.0 19.5 15.0
grey circle in Fig. 4(a), which is twice size of the target
E 13.0 16.5 20.0 16.5 19.0
circle. The radius of 16 cm is equivalent to 15 pixels in
the simulator screen. However, for the DQN, this com-
pletion area is a ellipse (a=8 pixels, b=4 pixels), since
the simulator screen will be resized from 160 × 320 to Table 3: Success rates (%) after different training steps
84 × 84 before being input to the learning core. Training Steps / million
Table 2 shows the success rates of different agents Agent/Setting
1 2 3 4 f
in different settings after 3 million training steps (60
A 36.0 43.0 51.0 36.0 36.5
epochs). The data in the diagonal (with a cell color
B 58.0 55.5 49.5 51.5 13.5
of gray) shows the success rate of each agent tested with
C 30.5 33.0 36.0 48.0 13.5
the same setting in which it trained, i.e., Agent A was
D 16.5 17.5 19.5 26.5 14.0
tested in Setting A.
E 13.0 18.5 19.0 23.0 27.0
We also did some experiments for agents from differ-
ent training steps. Table 3 shows the success rates of
different agents after some certain training steps. The tings. However, some goes down after a certain training
success rates of each agent were tested with the same step, e.g., the success rate of Agent A goes down after 4
simulator setting in which it was trained, i.e., the case million training steps. Theoretically, with a appropriate
in the diagonal of Table 2. In Table 3, “f” indicates the reward function, the DQN should perform better and
final number of steps used for training each agent in 160 better, and the success rates should go up iteratively.
hours, as mentioned in Section 4.1. The going down case was quite possibly caused by the
What we will discuss regarding the data in Table 2 reward function, which has the possibility to guide the
and 3 is based on the assumption that some outliers agent to a wrong direction. For the case in this pa-
of some conclusions appeared accidentally due to the per, the evaluation is based on success rates, but the
limited number of testing rounds. Although 200 test- reward function is based on distance changes. The rela-
ing rounds are already able to extract data changing tion between success rates and distance changes is indi-
trends in success rates, they are insufficient to extract rect. This indirect relation provides the incorrect guid-
the ground truth. Some minor success rate distortions ance possibility. This should be considered carefully in
happen occasionally. To make the conclusions more con- future work.
vincing, more study is necessary. Table 3 also shows that the success rate of the agent
From Table 2, we can find that Agent A and B can trained in a more complicated setting is normally smaller
both adapt to Setting A and B, but can not adapt to the than that in a simpler setting, and needs more training
other three settings. This shows that these two agents time to get to a same level of success rate. For example,
are robust to random image noise, but not robust to dy- the success rate of Agent E is smaller than that of Agent
namic arm initial pose and link length, and image offsets. D in each training episode, but is close to that of Agent
The random image noise is not a key feature in these two D in a latter training episode.
agents. In general, no matter whether the discussion assump-
In addition, other than the settings in which they are tion holds or not, the data in Table 2 and 3 at least shows
trained, Agent C, D and E can also achieve relatively that the DQN has the capability to adapt to these noisy
high success rates in the settings with fewer noisy factors factors, and is feasible to learn performing target reach-
than their training settings. This indicates that agents ing from exploration in simulation. However, more study
trained with more noisy factors can adapt to settings is necessary to increase the success rates.
with fewer noisy factors.
In Table 3, we can find that the success rate of each 4.3 Real World Experiment Using Camera
agent normally goes up after more training steps. This Images
shows that, in the training process, all the five agents can To check the feasibility of trained agents in the real
learn to adapt to the noisy factors presented in their set- world, we did a target reaching experiment in real sce-
ing target reaching. There were some kind of mapping
distortions between real and simulation scenarios. The
distortions might be caused by the differences between
real-scenario and simulation-scenario images.

4.4 Real World Experiment Using


Synthetic Images
To verify the analysis regarding the reason why Agent
B failed to perform target reaching, we did another real
world experiment using synthetic images instead of cam-
era images. In the experiment, the synthetic images were
generated by the 2D simulator according to real-time
joint angles (“S1”, “E1” and “W1”) on a Baxter robot.
The real-time joint angles were provided by the ROS-
based interfaces. In this case, there was no difference be-
(a) Real world experiment using (b) A sample input im-
camera images age tween real-scenario and simulation-scenario images. All
other settings were the same with those in Section 4.3,
Figure 7: Testing scene and a sample input of the real as shown in Fig. 1.
world experiment using camera images. In the testing In this experiment, we used the same agent that was
scene, a Baxter arm moved on a vertical plane with a used in Section 4.3, i.e., Agent B trained with 3 million
white background. To guarantee that images input to steps. It achieved a consistent success rate with that in
the DQN have an as consistent as possible appearance the simulation-scenario testing.
to those in simulation scenarios, camera images were According to the results, we can conclude that the rea-
cropped and masked with a boundary. The boundary son why Agent B failed in completing the target reach-
is from the background of a simulator screenshot. ing task with camera images is the existence of input
image differences. These differences might come from
camera pose variations, color and shape distortions, or
narios using camera images, i.e., the second phase men-
some other factors. More study is necessary to exactly
tioned in Section 3.1. In this experiment, we used Agent
figure out where the differences came from.
B trained with 3 million steps, which has relatively high
success rates for both Setting A and B in the testing in
simulation. 5 Conclusion and Discussion
The experiment settings were arranged to the case The DQN-based system is feasible to learn performing
that the 2D simulator simulated, i.e., a Baxter arm target reaching from exploration in simulation, using
moved on a vertical plane with a white background. A only visual observation with no prior knowledge. How-
grey-scale camera was placed in front of the arm, observ- ever, the agent (Agent B) trained in simulation scenarios
ing the arm with a horizontal view of point (for the DQN, failed to perform target reaching in the real world experi-
the grey-scale camera is the same with a color camera, ment using camera images as inputs. Instead, in the real
since even the images from Atari games and the 2D tar- world experiment using synthetic images as inputs, the
get reaching simulator are RGB-color images, they are agent got a consistent success rate with that in simula-
converted to grey-scale images prior to being input to tion. These two different results show that the failure
the network). The background was a white sheet. The in the real world experiment with camera images was
testing scene and a sample input to the DQN are shown caused by the input image differences between real and
in Fig. 7(a) and 7(b), respectively. simulation scenarios. To determine the causes of these
In the experiment, to make the agent work in the real more work is required.
world, we tried to match the arm position (in images) In the future, we are looking at either decreasing the
in real scenarios to that in simulation scenarios. The image differences or making agents robust to these differ-
position adjustment was made through changing camera ences. Decreasing the differences is a trade-off between
pose and image cropping parameters. However, no mat- making the simulator more consistent to real scenarios
ter how we adjusted, it did not reach the target. The and preprocessing input images to make them more con-
success rate is 0. sistent to those in simulation scenarios. If choose to in-
Other than the success rate, we also got a qualitative crease the fidelity of the simulator, it will most likely
result: Agent B mapped specific input images to certain result in a slow-down of the simulation, increasing train-
actions, but the mapping was ineffective for perform- ing time.
Regarding making agents robust to the differences, learning in robotics: Applications and real-world
there are four possible methods: adding variations of challenges. Robotics, 2(3):122–148, 2013.
the factors causing the image differences into simulation [Kragic and Christensen, 2002] Danica Kragic and Hen-
scenarios when training, adding a fine-tuning process in rik I Christensen. Survey on visual servoing for ma-
real scenarios after the training in simulation scenarios, nipulation. Technical report, Computational Vision
training in real scenarios directly, and designing a new and Active Perception Laboratory, Royal Institute of
DRL architecture (still can be a DQN) which is robust Technology, Stockholm, Sweden, 2002.
to the image differences.
In addition to solving the problem of image differences, [Kragic et al., 2005] Danica Kragic, Mårten Björkman,
more study is necessary in the design of reward function. Henrik I Christensen, and Jan-Olof Eklundh. Vi-
A good reward function is the key to get effective motion sion for robotic object manipulation in domestic set-
control or even manipulation skills and also speed up the tings. Robotics and Autonomous Systems, 52(1):85–
learning process. The reward function used in this work 100, 2005.
is just a first step. It is far less than enough to be a [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya
good reward function. Other than the effectiveness and Sutskever, and Geoffrey E Hinton. Imagenet
efficiency concerns, a good reward function needs also to classification with deep convolutional neural net-
be flexible to a range of general purpose motion control works. In Advances in Neural Information Processing
or even manipulation tasks. Systems (NIPS), pages 1097–1105, 2012.
Besides, the visual perception in this work is from an [Levine et al., 2015a] Sergey Levine, Chelsea Finn,
external monocular camera. An on-robot stereo camera Trevor Darrell, and Pieter Abbeel. End-to-end train-
or RGBD sensor can be a more effective and practical ing of deep visuomotor policies. Technical report, Uni-
solution for applications in the 3D real world. The joint versity of California, Berkeley, CA, USA, 2015.
control mode in this work is position control, some other
control modes like speed control and torque control are [Levine et al., 2015b] Sergey Levine, Nolan Wagener,
more common and appropriate for dynamic motion con- and Pieter Abbeel. Learning contact-rich manipula-
trol and manipulation in real-world applications. tion skills with guided policy search. Proceedings of
the IEEE International Conference on Robotics and
Acknowledgements Automation (ICRA), pages 156–163, 2015.
This research was conducted by the Australian Research [Mason, 2001] Matthew T Mason. Mechanics of robotic
Council Centre of Excellence for Robotic Vision (project manipulation. MIT Press, 2001.
number CE140100016). Computational resources and [Mnih et al., 2013] Volodymyr Mnih, Koray
services used in this work were partially provided by the Kavukcuoglu, David Silver, Alex Graves, Ioannis
HPC and Research Support Group, Queensland Univer- Antonoglou, Daan Wierstra, and Martin Riedmiller.
sity of Technology (QUT). Playing atari with deep reinforcement learning.
Technical report, Google DeepMind, London, UK,
References 2013.
[Asfour et al., 2008] Tamim Asfour, Pedram Azad, Flo- [Mnih et al., 2015] Volodymyr Mnih, Koray
rian Gyarfas, and Rüdiger Dillmann. Imitation learn- Kavukcuoglu, David Silver, Andrei A Rusu, Joel
ing of dual-arm manipulation tasks in humanoid Veness, Marc G Bellemare, Alex Graves, Martin
robots. International Journal of Humanoid Robotics, Riedmiller, Andreas K Fidjeland, Georg Ostrovski,
5(02):183–202, 2008. et al. Human-level control through deep reinforcement
[Bettini et al., 2004] Alessandro Bettini, Panadda learning. Nature, 518(7540):529–533, 2015.
Marayong, Samuel Lang, Allison M Okamura, and [Peters et al., 2003] Jan Peters, Sethu Vijayakumar,
Gregory D Hager. Vision-assisted control for manip- and Stefan Schaal. Reinforcement learning for hu-
ulation using virtual fixtures. IEEE Transactions on manoid robotics. In Proceedings of the IEEE-RAS
Robotics, 20(6):953–966, 2004. International Conference on Humanoid Robots, pages
[Frank et al., 2014] Mikhail Frank, Jürgen Leitner, Mar- 1–20, 2003.
ijn Stollenga, Alexander Förster, and Jürgen Schmid- [Sutton and Barto, 1998] Richard S Sutton and An-
huber. Curiosity driven reinforcement learning for drew G Barto. Reinforcement learning: An introduc-
motion planning on humanoids. Frontiers in Neuro- tion. MIT Press, 1998.
robotics, 7(25), 2014.
[Kormushev et al., 2013] Petar Kormushev, Sylvain
Calinon, and Darwin G Caldwell. Reinforcement

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy