0% found this document useful (0 votes)
24 views15 pages

Vdo Slam

1) The document presents VDO-SLAM, a visual SLAM system that can track dynamic objects in a scene without prior knowledge of object shape or models. 2) VDO-SLAM identifies static and dynamic elements, estimates full motion of objects as well as robot trajectory, and builds a spatiotemporal map. 3) It demonstrates improved performance over state-of-the-art methods on real indoor and outdoor datasets, providing accurate dynamic object motion estimates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views15 pages

Vdo Slam

1) The document presents VDO-SLAM, a visual SLAM system that can track dynamic objects in a scene without prior knowledge of object shape or models. 2) VDO-SLAM identifies static and dynamic elements, estimates full motion of objects as well as robot trajectory, and builds a spatiotemporal map. 3) It demonstrates improved performance over state-of-the-art methods on real indoor and outdoor datasets, providing accurate dynamic object motion estimates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MANUSCRIPT ONLY 1

VDO-SLAM: A Visual Dynamic Object-aware


SLAM System
Jun Zhang[co] , Mina Henein[co] , Robert Mahony and Viorela Ila

Abstract—Combining Simultaneous Localisation and Mapping


(SLAM) estimation and dynamic scene modelling can highly
benefit robot autonomy in dynamic environments. Robot path
planning and obstacle avoidance tasks rely on accurate esti-
arXiv:2005.11052v3 [cs.RO] 14 Dec 2021

mations of the motion of dynamic objects in the scene. This


paper presents VDO-SLAM, a robust visual dynamic object-
aware SLAM system that exploits semantic information to enable
accurate motion estimation and tracking of dynamic rigid objects
in the scene without any prior knowledge of the objects’ shape
or geometric models. The proposed approach identifies and
tracks the dynamic objects and the static structure in the
environment and integrates this information into a unified SLAM
framework. This results in highly accurate estimates of the
robot’s trajectory and the full SE(3) motion of the objects as well
as a spatiotemporal map of the environment. The system is able
to extract linear velocity estimates from objects’ SE(3) motion
providing an important functionality for navigation in complex
dynamic environments. We demonstrate the performance of the
proposed system on a number of real indoor and outdoor datasets
and the results show consistent and substantial improvements
over the state-of-the-art algorithms. An open-source version of
the source code is available∗ . Fig. 1: Results of our VDO-SLAM system. (Top) A full
Index Terms—SLAM, dynamic scene, object motion estima- map including camera trajectory in red, static background
tion, multiple object tracking. points in black and points on moving objects colour coded
by their instance. (Bottom) Detected 3D points on the static
background and the objects’ body, and the estimated object
I. I NTRODUCTION
speed. Black circles represents static points, and each object

T He ability of a robot to build a model of the environment,


often called map, and to localise itself within this map is
a key factor in enabling autonomous robots to operate in real
is shown with a different colour.

world environments. Creating these maps is achieved by fusing ([6]–[9]). The former technique excludes information about
multiple sensor measurements into a consistent representation dynamic objects in the scene, and generates static only maps.
using estimation techniques such as Simultaneous Localisation The accuracy of the latter depends on the camera pose
And Mapping (SLAM). SLAM is a mature research topic and estimation, which is more susceptible to failure in complex
have already revolutionised a wide range of applications from dynamic environments. Increased presence of autonomous
mobile robotics, inspection, entertainment and film produc- systems in dynamic environments is driving the community
tion to exploration and monitoring of natural environments, to challenge the static world assumption that underpins most
amongst many others. However, most of the existing solutions existing open-source SLAM algorithms. In this paper, we
to SLAM rely heavily on the assumption that the environment redefine the term “mapping” in SLAM to be concerned with
is predominantly static. a spatiotemporal representation of the world, as opposed to
The conventional techniques to deal with dynamics in the concept of a static map that has long been the emphasis
SLAM is to either treat any sensor data associated with moving of the classical SLAM algorithms. Our approach focuses on
objects as outliers and remove them from the estimation accurately estimate the motion of all dynamic entities in the
process ([1]–[5]), or detect moving objects and track them environment including the robot and other moving objects
separately using traditional multi-target tracking approaches in the scene, this information being highly relevant in the
context of robot path planning and navigation in dynamic
Jun Zhang, Mina Henein and Robert Mahony are with the
Australian National University (ANU), 0020 Canberra, Australia.
environments.
{jun.zhang2,mina.henein,robert.mahony}@anu.edu.au Existing scene motion estimation techniques mainly rely
Viorela Ila is with the University of Sydney (USyd), 2006 Sydney, Australia. on optical flow estimation ([10]–[13]) and scene flow esti-
viorela.ila@sydney.edu.au
[co]: The two authors contributed equally to this work. mation ([14]–[17]). Optical flow records the scene motion
∗ https://github.com/halajun/vdo slam by estimating the velocities associated with the movement
MANUSCRIPT ONLY 2

of brightness patterns on an image plane. Scene flow, on the feature and object tracking method is proposed, with the ability
other hand, describes the 3D motion field of a scene observed to handle indirect occlusions resulting from the failure of
at different instants of time. Those techniques only estimate semantic object segmentation. In summary, the contributions
linear translation of individual pixels or 3D points in the of this work are:
scene, and are not exploiting the collective behaviour points • a novel formulation to model dynamic scenes in a uni-
on rigid objects failing to describe the full SE(3) motion of fied estimation framework over robot poses, static and
objects in the scene. In this paper we explore this collective dynamic 3D points, and object motions.
behaviour of points on individual objects to obtain accurate • accurate estimation for SE(3) motion of dynamic objects
and robust motion estimation of the objects in the scene while that outperforms state-of-the-art algorithms, as well as a
simultaneously localising the robot and map the environment. way to extract objects’ velocity in the scene,
A typical SLAM system consists of a front-end module, • a robust method for tracking moving objects exploiting
that processes the raw data from the sensors and a back- semantic information with the ability to handle indirect
end module, that integrates the obtained information (raw occlusions resulting from the failure of semantic object
and higher-level information) into a probabilistic estimation segmentation,
framework. Simple primitives such as 3D locations of salient • a demonstrable full system in complex and compelling
features are commonly used to represent the environment. real-world scenarios.
This is largely a consequence of the fact that points are easy To the best of our knowledge, this is the first full dynamic
to detect, track and integrate within the SLAM estimation SLAM system that is able to achieve motion segmentation,
problem. dynamic object tracking, and estimate the camera poses along
Feature tracking has been more reliable and robust with with the static and dynamic structure, the full SE(3) pose
the advances in deep learning to provide algorithms that can change of every rigid object in the scene, extract velocity infor-
reliably estimate the 2D optical flow associated with the mation, and be demonstrable in real-world outdoor scenarios
apparent motion of every pixel on an image in a dense manner. (see Fig. 1). We demonstrate the performance of our algorithm
A task that is particularly important for data association and on real datasets and show capability of the proposed system to
that has been otherwise challenging in dynamic environments resolve rigid object motion estimation and yield motion results
using classical feature tracking methods. that are comparable to the camera pose estimation in accuracy
Other primitives such as lines and planes ([18]–[21]) or even and that outperform state-of-the-art algorithms by an order of
objects ([22]–[24]) have been considered in order to provide magnitude in urban driving scenarios.
richer map representations. To incorporate such information in The remainder of this paper is structured as follows, in the
existing geometric SLAM algorithms, either a dataset of 3D- following Section II we discuss the related work. In Section III
models of every object in the scene must be available a priori and IV we describe the proposed algorithm and system. We
([23], [25]) or the front end must explicitly provide object introduce the experimental setup, followed by the results and
pose information in addition to detection and segmentation evaluations in Section V. We summarise and offer concluding
([26]–[28]) adding a layer of complexity to the problem. The remarks in Section VI.
requirement for accurate 3D-models severely limits the poten-
tial domains of application, while to the best of our knowledge,
multiple object tracking and 3D pose estimation remain a II. R ELATED W ORK
challenge to learning techniques. There is a clear need for In the past two decades, the study of SLAM for dynamic
an algorithm that can exploit the powerful detection and environments has become more and more popular in the
segmentation capabilities of modern deep learning algorithms community, with a considerable amount of algorithms being
([29], [30]) without relying on additional pose estimation or proposed to solve the dynamic SLAM problem. Motivated by
object model priors, an algorithm that operates at feature-level different goals to achieve, solutions in the literature can be
with the awareness of an object concept. mainly divided into three categories.
While the problems of SLAM and object motion track- The first category aims to explore robust SLAM against
ing/estimation are long studied in isolation in the literature, dynamic environments. Early methods in this category ([2],
recent approaches try to solve the two problems in a unified [34], [35]) normally detect and remove the information drawn
framework ([31], [32]). However, they both focus on the from dynamic foreground, which is seen as degrading the
SLAM back-end instead of a full system, resulting in a SLAM performance. More recent methods on this track tend
severely limited performance in real world scenarios. In this to go further by not just removing the dynamic foreground,
paper, we carefully integrate our previous works ([31], [33]) but also inpainting or reconstructing the static background that
and propose VDO-SLAM, a novel feature-based stereo/RGB- is occluded by moving targets. [5] present dynaSLAM that
D dynamic SLAM system, that leverages image-based se- combines classic geometry and deep learning-based models to
mantic information to simultaneously localise the robot, map detect and remove dynamic objects, then inpaint the occluded
the static and dynamic structure, and track motions of rigid background with multi-view information of the scene. Simi-
objects in the scene. Different to [31], we rely on a denser larly, a Light Field SLAM front-end is proposed by [36] to
object feature representation to ensure robust tracking, and reconstruct the occluded static scene via Synthetic Aperture
propose new factors to smoothen the motion of rigid objects in Imaging (SAI) technics. Different from [5], features on the
urban driving scenarios. Different to [33], an improved robust reconstructed static background are also tracked and used
MANUSCRIPT ONLY 3

to achieve better SLAM performance. The above state-of- Both methods succeed to exploit object information in a
the-art solutions achieve robust and accurate estimation by dense RGB-D SLAM framework, without prior knowledge of
discarding the dynamic information. However, we argue that object model. Their main interest, however, is the 3D object
this information has potential benefits for SLAM if it is prop- segmentation and consistent fusion of the dense map rather
erly modelled. Furthermore, understanding dynamic scenes in than the estimation of the motion of the objects.
addition to SLAM is crucial for many other robotics tasks such Lately, the use of basic geometric models to represent
as planning, control and obstacle avoidance, to name a few. objects becomes a popular solution due to the less complexity
Approaches of the second category performs SLAM and and easy integration into a SLAM framework. In Quadric-
Moving Objects Tracking (MOT) separately, as an extension SLAM [46], detected objects are represented as ellipsoids to
to conventional SLAM for dynamic scene understanding ([9], compactly parametrise the size and 3D pose of an object. In
[37]–[39]). [37] developed a theory for performing SLAM this way, the quadric parameters are directly constrained as
with Moving Objects Tracking (SLAMMOT). In the latest geometric error and formulated together with camera poses
version of their SLAM with detection and tracking of mov- in a factor graph SLAM for joint estimation. [24] propose to
ing objects, the estimation problem is decomposed into two combine 2D and 3D object detection with SLAM for both
separate estimators (moving and stationary objects) to make static and dynamic environments. Objects are represented as
it feasible to update both filters in real time. [9] tackle the high-quality cuboids and optimized together with points and
SLAM problem with dynamic objects by solving the problems cameras through multi-view bundle adjustment. While both
of Structure from Motion (SfM) and tracking of moving methods prove the mutual benefit between detected object and
objects in parallel, and unifying the output of the system SLAM, their main focus is on object detection and SLAM
into a 3D dynamic map containing the static structure and primarily for static scenarios. In this paper, we take this
the trajectories of moving objects. Later in [38], the authors direction further to tackle the challenging problem of dynamic
propose to integrate semantic constraints to further improve the object tracking within a SLAM framework, and exploit the
3D reconstruction. The more recent work [39] present a stereo- relationships between moving objects and agent robot, static
based dense mapping algorithm in a SLAM framework, with and dynamic structures for potential advantages.
the advantage of accurately and efficiently reconstructing both Apart from the dynamic SLAM categories, the literature of
static background and moving objects in large scale dynamic 6-DoF object motion estimation is also crucial for dynamic
environments. The listed algorithms above have proven that SLAM problem. Quite a few methods have been proposed in
combining multiple objects tracking with SLAM is doable the literature to estimate SE(3) motion of objects in a visual
and applicable for dynamic scene exploration. To take a step odometry or SLAM framework ([50]–[52]). [50] present a
further by proper exploiting and establishing the spatial and model-free method for detecting and tracking moving objects
temporal relationships between the robot, static background, in 3D LiDAR scans. The method sequentially estimates mo-
stationary and dynamic objects, we show in this paper that tion models using RANSAC [53], then segments and tracks
the problems of SLAM and multi-object tracking are mutually multiple objects based on the models by a proposed Bayesian
beneficial. approach. In [51], the authors address the problem of simul-
The last and most active category is object SLAM, which taneous estimation of ego and third-party SE(3) motions in
usually includes both static and dynamic objects. Algorithms complex dynamic scenes using cameras. They apply multi-
in this class normally require specific modelling and repre- model fitting techniques into a visual odometry pipeline and
sentation of 3D object, such as 3D shape ([40]–[42]), sur- estimate all rigid motions within a scene. In later work, [52]
fel [43] or volumetric [44] model, geometric model such as present ClusterVO that is able to perform online processing
ellipsoid ([45], [46]) or 3D bounding box ([24], [47]–[49]), for multiple motion estimations. To achieve this, a multi-level
etc., to extract high-level primitive (e.g., object pose) and probabilistic association mechanism is proposed to efficiently
integrate into a SLAM framework. [40] is one of the earliest track features and detections, then a heterogeneous Conditional
works to introduce an object-oriented SLAM paradigm, which Random Field (CRF) clustering approach is applied to jointly
represents cluttered scene in object level and constructs an infer cluster segmentations, with a sliding-window optimiza-
explicit graph between camera and object poses to achieve tion for clusters in the end. While the above proposed methods
joint pose-graph optimisation. Later, [41] propose a novel 3D represent an important step forward to the Multi-motion Visual
object recognition algorithm to ensure the system robustness Odometry (MVO) task, the study of spacial and temporal
and improve the accuracy of estimated object pose. The high- relationships is not fully explored but is arguably important.
level scene representation enables real-time 3D recognition Therefore, by carefully considering the pros and cons in the
and significant compression of map storage for SLAM. Never- literature of SLAM+MOT, object SLAM and MVO, this paper
theless, a database of pre-scanned or pre-trained object models proposes a visual dynamic object-aware SLAM system that is
has to be created in advance. To avoid prebuilt database, able to achieve robust ego and object motion tracking, as well
representing objects using surfel or voxel element in a dense as consistent static and dynamic mapping in a novel SLAM
manner starts to gain popularity, along with RGB-D cameras formulation.
becoming widely used. [43] present MaskFusion that adopts
surfel representation to model, track and reconstruct objects in III. M ETHODOLOGY
the scene, while [44] apply an octree-based volumetric model Before discussing details of the proposed system pipeline,
to objects and build multi-object dynamic SLAM system. as shown in Fig. 4, this section covers the mathematical details
MANUSCRIPT ONLY 4

of the core components in the system. Variables and notations (5) is crucially important as it relates the same 3D point
are first introduced, including the novel way of modelling the on a rigid object in motion at consecutive time steps by
Lk−1
motion of a rigid-object in a model free manner. Then we a homogeneous transformation k−10 Hk := 0 Lk−1 k−1 Hk 0 L−1
k−1 .
show how the camera pose and object motion are estimated This equation represents a frame change of a pose transforma-
in the tracking component of the system. Finally, a factor tion [54], and shows how the body-fixed frame pose change
Lk−1
graph optimisation is proposed and applied in the mapping k−1 Hk relates to the global reference frame pose change
component, to refine the camera poses and object motions, 0
k−1 Hk . The point motion in global reference frame is then
and build a global consistent map including static and dynamic expressed as:
structure. 0
mik = k−10 Hk 0 mik−1 . (6)
A. Background and Notation Equation (6) is at the core of our motion estimation approach,
1) Coordinate Frames: Let 0 X ,0 L
∈ SE(3) be the
k k
as it expresses the rigid object pose change in terms of the
robot/camera and the object 3D pose respectively, at time k points that reside on the object in a model-free manner without
in a global reference frame 0, with k ∈ T the set of time the need to include the object 3D pose as a random variable
steps. Note that calligraphic capital letters are used in our in the estimation. Section III-B2 details how this rigid object
notation to represent sets of indices. Fig. 2 shows these pose pose change is estimated based on the above equation. Here
0
transformations as solid curves. k−1 Hk ∈ SE(3) represents the object point motion in global
2) Points: Let 0 mik be the homogeneous coordinates of the reference frame; for the remainder of this document, we refer
> to this quantity as the object pose change or the object motion
i 3D point at time k, with 0 mi = mix , miy , miz , 1 ∈ IE3 and
th


i ∈ M the set of points. We write a point in robot/camera for ease of reading.


frame as Xk mik =0 X−1 k
0 mi .
k
Define Ik the reference frame associated with the image B. Camera Pose and Object Motion Estimation
captured by the camera at time k chosen The cost function chosen to estimate the camera pose and
 at the top left
corner of the image, and let Ik pik = ui , vi , 1 ∈ IE2 be the pixel object motion is associated with the 3D-2D re-projection error
location on frame Ik corresponding to the homogeneous 3D and is defined on the image plane. Since the noise is better
point Xk mik , which is obtained via the projection function π(·) characterised in image plane, this yields more accurate results
as follows: for camera localisation [55]. Moreover, based on this error
Ik i
term, we propose a novel formulation to jointly optimise the
pk = π(Xk mik ) = K Xk mik , (1) optical flow along with the camera pose and the object motion,
where K is the camera intrinsics matrix. to ensure a robust tracking of points. In the mapping module, a
The camera and/or object motions both produce an optical 3D error cost function is used in global optimization to ensure
flow Ik φ i ∈ IR2 that is the displacement vector indicating the best results of 3D structure and object motions estimation as
motion of pixel Ik−1 pik−1 from image frame Ik−1 to Ik , and is later described in Section III-C.
given by: 1) Camera Pose Estimation: Given a set of static 3D
points {0 mik−1 | i ∈ M , k ∈ T } observed at time k − 1 in
Ik
φ i = Ik p̃ik − Ik−1 pik−1 . (2) global reference frame, and the set of 2D correspondences
{Ik p̃ik | i ∈ M , k ∈ T } in image Ik , the camera pose 0 Xk is
Here Ik p̃ik is the correspondence of Ik−1 pik−1 in Ik . Note that,
estimated via minimizing the re-projection error:
we overload the same notation to represent the 2D pixel
coordinates ∈ IR2 . In this work, we leverage optical flow to ei (0 Xk ) = Ik p̃ik − π(0 X−1
k
0 i
mk−1 ) . (7)
find correspondences between consecutive frames.
3) Object and 3D Point Motions: The object motion be- We parameterise the SE(3) camera pose by elements of the
tween times k − 1 and k is described by the homogeneous Lie-algebra xk ∈ se(3):
Lk−1
transformation k−1 Hk ∈ SE(3) according to: 0
Xk = exp(0 xk ) , (8)
Lk−1
k−1 Hk =0 L−1 0
k−1 Lk . (3) and define 0 x∨k ∈ IR6 with the vee operator a mapping from
se(3) to IR6 . Using the Lie-algebra parameterisation of SE(3)
Fig. 2 shows these motion transformations as dashed curves.
with the substitution of (8) into (7), the solution of the least
We write a point in its corresponding object frame as
Lk mi = 0 L−1 0 mi (shown as a dashed vector from the object squares cost is given by:
k k k
nb
reference frame to the red dot in Fig. 2), substituting the object 0 ∗∨
 
xk = argmin ∑ ρh e> i (0
x k ) Σ −1
p e i (0
xk ) (9)
pose at time k from (3), this becomes: 0 x∨ i
k
0 0 0 Lk−1
mik = Lk Lk
mik = Lk−1 k−1 Hk Lk mik . (4) for all nb visible 3D-2D static background point correspon-
Note that for rigid body objects, Lk mik stays constant atL mi , dences between consecutive frames. Here ρh is the Huber
and L mi = 0 L−1 0 mi = 0 L−1 0 mi n ∈ Z. function [56], and Σ p is the covariance matrix associated with
k k k+n k+n for any integer
Then, for rigid objects with n = −1, (4) becomes: the re-projection error. The estimated camera pose is given by
0 X∗ = exp(0 x∗ ) and is found using the Levenberg-Marquardt
Lk−1 k k
0 0 −1 0 i
mik = 0 Lk−1 k−1 Hk Lk−1 mk−1 . (5) algorithm to solve for (9).
MANUSCRIPT ONLY 5

𝐿𝑘−2
𝐻𝑘−1 𝐿𝑘−1
𝑘−2
𝐻𝑘
𝑘−1

𝐿𝑘−2 𝐿𝑘−1
𝑚𝑘−2 𝑚𝑘−1
0
𝐿𝑘−2
𝐿𝑘
0 0
𝑚𝑘
𝑚𝑘−2 𝑚𝑘−1
0
𝐿𝑘−1
0
𝑚𝑘
0
𝐿𝑘
{0}

𝑋𝑘
𝑚𝑘
𝑋𝑘−2 𝑋𝑘−1
𝑚𝑘−2 𝑚𝑘−1
{𝐼𝑘−2 }

{𝐼𝑘−1 }
𝐼𝑘−2 {𝐼𝑘 }
𝑝𝑘−2
0
𝑋𝑘−2 𝐼𝑘
𝐼𝑘−1 𝑝𝑘
𝑝𝑘−1

𝑋𝑘−2
𝑇𝑘−1
𝑘−2 𝑋𝑘−1
𝑇𝑘
𝑘−1
0
0
𝑋𝑘−1 𝑋𝑘

Fig. 2: Notation and coordinate frames. Solid curves represent camera and object poses in inertial frame; 0 X and 0 L
respectively, and dashed curves their respective motions in body-fixed frame. Solid lines represent 3D points in inertial frame,
and dashed lines represent 3D points in camera frames.

2) Object Motion Estimation: Analogous to the camera function:


pose estimation, a cost function based on re-projection error n nb
k ∗
{0 x∗∨ ∑ ρh e>i (Ik φ i ) Σ−1 Ik i

is constructed to solve for the object motion k−10 Hk . Using (6), k , Φ k } = argmin φ ei ( φ ) +
the error term between the re-projection of an object 3D point {0 x∨ k
k , Φk } i
o
and the corresponding 2D point in image Ik is: ρh e>i (0 xk ,Ik φ i ) Σ−1 e i (0
xk ,Ik i
φ ) , (13)
p
ei (k−10 Hk ) := Ik p̃i − π(0 X−1 0 0 i
k k k−1 Hk mk−1 ) where ρh (e>i (Ik φ i ) Σ−1 Ik i
Ik p̃i − π( 0 G 0 mi φ ei ( φ )) is the regularization term with
= k k−1 k k−1 ) ,
(10)
i
ei (Ik φ i ) = Ik φ̂φ − Ik φ i . (14)
where k−10 Gk ∈ SE(3). Parameterising k−10 Gk := exp k−10 gk

i i
with k−10 gk ∈ se(3), the optimal solution is found via min- Here Ik Φ̂Φ = {Ik φ̂φ | i ∈ M , k ∈ T } is the initial optic-flow
imising: obtained through classical or learning-based methods, and Σφ
nd   is the associated covariance matrix. Analogously, the cost
0 ∗∨ > 0 −1 0
g
k−1 k = argmin e ( g )
∑ h i k−1 k p i k−1 k
ρ Σ e ( g ) (11) function for object motion in (11) combining optical flow
0 ∨ i
k−1 gk refinement is given by
given all nd visible 3D-2D dynamic point correspondences on nd n
k ∗
{0k−1 g∗∨ > Ik i −1 Ik i

an object between frames k − 1 and k. The object motion, k , Φ k } = argmin ∑ ρh ei ( φ ) Σφ ei ( φ ) +
0 0 0 {0k−1 g∨ k
k , Φk }
i
k−1 Hk = Xk k−1 Gk can be recovered afterwards.
o
ρh e>i (k−10 gk ,Ik φ i ) Σ−1
p ei ( 0
k−1 gk ,Ik i
φ ) .
3) Joint Estimation with Optical Flow: The camera pose (15)
and object motion estimation both rely on good image corre-
spondences. Tracking of points on moving objects can be very
challenging due to occlusions, large relative motions and large C. Graph Optimisation
camera-object distances. In order to ensure a robust tracking of The proposed approach formulates the dynamic SLAM as
points, we follow our earlier work [33] to refine the estimation a graph optimisation problem, to refine the camera poses and
of the optical flow jointly with the motion estimation. object motions, and build a global consistent map including
static and dynamic structure. We model the dynamic SLAM
For camera pose estimation, the error term in (7) is refor- problem as a factor graph as the one demonstrated in Fig. 3.
mulated considering (2) as: The factor graph formulation is highly intuitive and has the
advantage that it allows for efficient implementations of batch
ei (0 Xk ,Ik φ ) = Ik−1 pik−1 + Ik φ i − π(0 X−1
k
0 i
mk−1 ) . (12)
([57], [58]) and incremental ([59]–[61]) solvers.
Applying the Lie-algebra parameterisation of SE(3) element, Four types of measurements/observations are integrated into
the optimal solution is obtained via minimising the cost a joint optimisation problem; the 3D point measurements, the
MANUSCRIPT ONLY 6

rate and the physics laws governing the motion of relatively


large objects (vehicles) and preventing their motions to change
abruptly, we introduce smooth motion factors to minimise the
change in consecutive object motions, with the error term
defined as:
−1
el,k (k−20 Hlk−1 , k−10 Hlk ) = k−20 Hlk−1 0 l
k−1 Hk . (19)
The object smooth motion factor el,k (k−20 Hlk−1 , k−10 Hlk ) is used
to minimise the change between the object motion at consec-
utive time steps and is shown as cyan circles in Fig. 3.
Let θ M = {0 mik | i ∈ M , k ∈ T } be the set of all 3D points,
and θ X = {0 x∨k | k ∈ T } as the set of all camera poses.
Fig. 3: Factor graph representation of an object-aware We parameterise the SE(3) object motion k−10 Hlk by elements
0 l
SLAM with a moving object. Black squares stand for the k−1 hk ∈ se(3) the Lie-algebra of SE(3):
camera poses at different time steps, blue for static points, red 0 l
= exp(k−10 hlk ) ,
k−1 Hk (20)
for the same dynamic point on an object (dashed box) at dif-

ferent time steps and green for the object pose change between and define θ H = {k−10 hlk | k ∈ T , l ∈ L } as the set of all

time steps. For ease of visualisation, only one dynamic point is object motions, with k−10 hlk ∈ IR6 and L the set of all object
drawn here. A prior factor is shown as a black circle, odometry labels. Given θ = θ X ∪ θ M ∪ θ H as all the nodes in the graph,
factors are shown as orange, point measurement factors as with the Lie-algebra parameterisation of SE(3) for X and H
white and point motion factors as magenta. A smooth motion (substituting (8) in (16) and (17), and substituting (20) in (18)
factor is shown as cyan circle. and (19)), the solution of the least squares cost is given by:
n nz
θ ∗ = argmin e>i,k (0 xk ,0 mik ) Σ−1 0 0 i

∑ ρh z ei,k ( xk , mk )
visual odometry measurements, the motion of points on a θ i,k
no
dynamic object and the object smooth motion observations.
+ ∑ ρh log(ek (0 xk−1 , 0 xk ))> Σ−1 0 0

o log(ek ( xk−1 , xk ))
The 3D point measurement model error ei,k (0 Xk ,0 mik ) is k
defined as: ng
+ ∑ ρh e>i,l,k (0 mik , k−10 hlk ,0 mik−1 ) Σ−1
g
ei,k (0 Xk ,0 mik ) =0 X−1
k
0 i
mk − zik . (16) i,l,k
ei,l,k (0 mik , k−10 hlk ,0 mik−1 )

Here z | i ∈ M , k ∈ T } is the set of all 3D point mea-
= {zik
surements at all time steps, with cardinality nz and zik ∈ IR3 . ns
The 3D point measurement factors are shown as white circles + ∑ ρh log(el,k (k−20 hlk−1 , k−10 hlk ))> Σ−1
s
l,k
in Fig. 3. o
The tracking component of the system provides a high-quality log(el,k (k−20 hlk−1 , k−10 hlk )) , (21)
ego-motion via 3D-2D error minimization, which can be used
as an odometry measurement to constrain camera poses in where Σz is the 3D point measurement noise covariance
the graph. The visual odometry model error ek (0 Xk−1 ,0 Xk ) is matrix, Σo is the odometry noise covariance matrix, Σg is
defined as: the motion noise covariance matrix with ng the total number
of ternary object motion factors, and Σs the smooth motion
Xk−1
ek (0 Xk−1 ,0 Xk ) = (0 X−1 0
k−1 Xk )
−1
k−1 Tk , (17) covariance matrix, with ns the total number of smooth motion
X factors. The non-linear least squares problem in (21) is solved
where T = { k−1 k−1
Tk | k ∈ T } is the odometry measurement set using Levenberg-Marquardt method.
Xk−1
with k−1 Tk ∈ SE(3) and cardinality no . The odometric factors
are shown as orange circles in Fig. 3.
The motion model error of points on dynamic objects IV. S YSTEM
ei,l,k (0 mik , k−10 Hlk ,0 mik−1 ) is defined as: In this section, we propose a novel object-aware dynamic
0 SLAM system that robustly estimates both camera and object
ei,l,k ( mik , k−10 Hlk ,0 mik−1 ) = 0 mik − k−10 Hlk 0 mik−1 . (18)
motions, along with the static and dynamic structure of the
The motion of all points on a detected rigid object l are environment. The full system overview is shown in Fig. 4.
characterised by the same pose transformation k−10 Hlk ∈ SE(3) The system consists of three main components: image pre-
given by (6) and the corresponding factor, shown as magenta processing, tracking and mapping.
circles in Fig. 3, is a ternary factor which we call the motion The input to the system is stereo or RGB-D images. For
model of a point on a rigid body. stereo images, as a first step, we extract depth information by
It has been shown that incorporating prior knowledge about applying the stereo depth estimation method described in [62]
the motion of objects in the scene is highly valuable in to generate depth maps and the resulting data is treated as
dynamic SLAM ([31], [37]). Motivated by the camera frame RGB-D.
MANUSCRIPT ONLY 7

Although this system was initially designed to be an RGB-D 2) Camera Pose Estimation: The camera pose is com-
system, as an attempt to fully exploit image-based semantic in- puted using (13) for all detected 3D-2D static point cor-
formation, we apply single image depth estimation to achieve respondences. To ensure robust estimation, a motion model
depth information from monocular camera. Our “learning- generation method is applied for initialisation. Specifically,
based monocular” system is monocular in the sense that only the method generates two models and compares their inlier
RGB images are used as input to the system, however the numbers based on re-projection error. One model is generated
estimation problem is formulated using RGB-D data, where by propagating the camera previous motion, while the other by
the depth is obtained using single image depth estimation. computing a new motion transform using P3P [63] algorithm
with RANSAC. The motion model that generates most inliers
A. Pre-processing is then selected for initialisation.
There are two challenging aspects that this module needs to 3) Dynamic Object Tracking: The process of object motion
fulfil. First, to robustly separate static background and objects, tracking consists of two steps. In the first step, segmented ob-
and secondly to ensure long-term tracking of dynamic objects. jects are classified into static and dynamic. Then we associate
To achieve this, we leverage recent advances in computer the dynamic objects across pairs of consecutive frames.
vision techniques for instance level semantic segmentation and • Instance-level object segmentation allows us to separate
dense optical flow estimation in order to ensure efficient object objects from background. Although the algorithm is capable of
motion segmentation and robust object tracking. estimating the motions of all the segmented objects, dynamic
1) Object Instance Segmentation: Instance-level semantic object identification helps reduce computational cost of the
segmentation is used to segment and identify potentially mov- proposed system. This is done based on scene flow estimation.
able objects in the scene. Semantic information constitutes an Specifically, after obtaining the camera pose 0 Xk , the scene
important prior in the process of separating static and moving flow vector fik describing the motion of a 3D point 0 mi between
object points, e.g., buildings and roads are always static, but frames k − 1 and k, can be calculated as in [64]:
cars can be static or dynamic. Instance segmentation helps
to further divide semantic foreground into different instance fik = 0 mik−1 − 0 mik = 0 mik−1 −0 Xk Xk mik . (22)
masks, which makes it easier to track each individual object. Unlike optical flow, scene flow−ideally only caused by scene
Moreover, segmentation masks provide a “precise” boundary motion−can directly decide whether some structure is moving
of the object body that ensures robust tracking of points on or not. Ideally, the magnitude of the scene flow vector should
the object. be zero for all static 3D points. However, noise or error in
2) Optical Flow Estimation: The dense optical flow is depth and matching complicates the situation in real scenarios.
used to maximise the number of tracked points on moving To robustly handle this, we compute the scene flow magnitude
objects. Most of the moving objects only occupy a small of all the sampled points on each object. If the magnitude of
portion of the image. Therefore, using sparse feature matching the scene flow of a certain point is greater than a predefined
does not guarantee robust nor long-term feature tracking. Our threshold, the point is considered dynamic. This threshold was
approach makes use of dense optical flow to considerably set to 0.12 in all experiments carried in this work. An object
increase the number of object points by sampling from all the is then recognised dynamic if the proportion of “dynamic”
points within the semantic mask. Dense optical flow is also points is above a certain level (30% of total number of
used to consistently track multiple objects by propagating a points), otherwise static. Thresholds to identify if an object
unique object identifier assigned to every point on an object is dynamic were deliberately chosen as mentioned above, to
mask. Moreover, it allows to recover objects masks if semantic be more conservative as the system is flexible to model a
segmentation fails; a task that is extremely difficult to achieve static object as dynamic and estimate a zero motion at every
using sparse feature matching. time step, however, the opposite would degrade the system’s
performance.
B. Tracking • Instance-level object segmentation only provides single-
The tracking component includes two modules; the camera image object labels. Objects then need to be tracked across
ego-motion tracking with sub-modules of feature detection frames and their motion models propagated over time. We
and camera pose estimation, and the object motion tracking propose to use optical flow to associate point labels across
including sub-modules of dynamic object tracking and object frames. A point label is the same as the unique object identifier
motion estimation. on which the point was sampled. We maintain a finite tracking
1) Feature Detection: To achieve fast camera pose estima- label set L ⊂ N, where l ∈ L starts from l = 1 for the first
tion, we detect a sparse set of corner features and track them detected moving object in the scene. The number of elements
with optical flow. At each frame, only inlier feature points in L increases as more moving objects are being detected.
that fit the estimated camera motion are saved into the map, Static objects and background are labelled with l = 0.
and used to track correspondences in the next frame. New Ideally, for each detected object in frame k, the labels of all
features are detected and added, if the number of inlier tracks its points should be uniquely aligned with the labels of their
falls below a certain level (1200 in default). These sparse correspondences in frame k − 1. However, in practice this is
features are detected on static background, i.e., image regions affected by the noise, image boundaries and occlusions. To
excluding the segmented objects. overcome this, we assign all the points with the label that
MANUSCRIPT ONLY 8

Fig. 4: Overview of our VDO-SLAM system. Input images are first pre-processed to generate instance-level object
segmentation and dense optical flow. These are then used to track features on static background structure and dynamic objects.
Camera poses and object motions estimated from feature tracks are then refined in a global batch optimisation, and a local
map is maintained and updated with every new frame. The system outputs camera poses, static structure, tracks of dynamic
objects, and estimates of their pose changes over time.

appears most in their correspondences. For a dynamic object, similarly, a factor graph optimisation is performed to refine all
if the most frequent label in the previous frame is 0, it means the variables within the local map, and then update them back
that the object starts to move, appears in the scene at the into the global map.
boundary, or reappears from occlusion. In this case, the object 2) Global Batch Optimisation: The output of the tracking
is assigned a new tracking label. component and the local batch optimisation consists of the
4) Object Motion Estimation: As mentioned above, objects camera pose, the object motions and the inlier structure. These
normally appear in small portions in the scene, which makes are saved in a global map that is constructed with all the
it hard to get sufficient sparse features to track and estimate previous time steps and is continually updated with every
their motions robustly. We sample every third point within new frame. A factor graph is constructed based on the global
an object mask, and track them across frames. Similar to the map after all input frames have been processed. To effectively
camera pose estimation, only inlier points are saved into the explore the temporal constraints, only points that have been
map and used for tracking in the next frame. When the number tracked for more than 3 instances are added into the factor
of tracked object points decreases below a certain level, new graph. The graph is formulated as an optimisation problem as
object points are sampled and added. We follow the same described in Section III-C. The optimisation results serve as
method as discussed in Section IV-B2 to generate an initial the output of the whole system.
object motion model. 3) From Mapping to Tracking: Maintaining the map pro-
vides history information to the estimate of the current state in
C. Mapping the tracking module, as shown in Fig. 4 with blue arrows going
from the global map to multiple components in the tracking
In the mapping component, a global map is constructed module of the system. Inlier points from the last frame are
and maintained. Meanwhile, a local map is extracted from leveraged to track correspondences in the current frame and
the global map, which is based on the current time step and estimate camera pose and object motions. The last camera
a window of previous time steps. Both maps are updated via and object motion also serve as possible prior models to
a batch optimisation process. initialise the current estimation as described in Section IV-B2
1) Local Batch Optimisation: We maintain and update a and IV-B4. Furthermore, object points help associate semantic
local map. The goal of the local batch optimisation is to masks across frames to ensure robust tracking of objects,
ensure accurate camera pose estimates are provided to the by propagating their previously segmented masks in case of
global batch optimisation. The camera pose estimation has a “indirect occlusion” resulting from the failure of semantic
big influence on the accuracy of the object motion estimation object segmentation.
and the overall performance of the algorithm. The local map
is built using a fixed-size sliding window containing the
information of the last nw frames, where nw is the window size V. E XPERIMENTS
and is set to 20 in this paper. Local maps share some common We evaluate VDO-SLAM in terms of camera motion, object
information; this defines the overlap between the different motion and velocity, as well as object tracking performance.
windows. We choose to only locally optimise the camera The evaluation is done on the Oxford Multimotion Dataset [65]
poses and static structure within the window size, as locally for indoor, and KITTI Tracking dataset [66] for outdoor
optimising the dynamic structure does not bring any benefit scenarios, with comparison to other state-of-the-art methods,
to the optimisation unless a hard constraint (e.g. a constant including MVO [51], ClusterVO [52], DynaSLAM II [49]
object motion) is assumed within the window. However, the and CubeSLAM [24]. Due to the non-deterministic nature in
system is able to incorporate static and dynamic structure in running the proposed system, such as RANSAC processing,
the local mapping if needed. When a local map is constructed, we run each sequence 5 times and take median values as the
MANUSCRIPT ONLY 9

demonstrating results. All the results are obtained by running Then the speed error Es between the estimated v̂ and the
the proposed system in default parameter setup. Our open- ground truth v velocities can be calculated as: Es = |v̂| − |v|.
source implementation includes the demo YAML files and
instructions to run the system in both datasets. C. Oxford Multimotion Dataset
A. Deep Model Setup The recent Oxford Multimotion Dataset [65] contains se-
quences from a moving stereo or RGB-D camera sensor
We adopt a learning-based instance-level object segmen- observing multiple swinging boxes or toy cars in an indoor
tation, Mask R-CNN [67], to generate object segmentation scenario. Ground truth trajectories of the camera and moving
masks. The model of this method is trained on COCO objects are obtained via a Vicon motion capture system. We
dataset [68], and is directly used in this work without any fine- only choose the swinging boxes sequence (500 frames) for
tuning. For dense optical flow, we leverage a state-of-the-art evaluation, since results of real driving scenarios are evaluated
method; PWC-Net [12]. The model is trained on FlyingChairs on KITTI dataset. Note that, the trained model for instance
dataset [69], and then fine-tuned on Sintel [70] and KITTI segmentation cannot be applied to this dataset directly, since
training datasets [71]. To generate depth maps for a “monocu- the training data (COCO) does not contain the class of
lar” version of our proposed system, we apply a learning-based square box. Instead, we use Otsu’s method [77], together with
monocular depth estimation method, MonoDepth2 [72]. The color information and multi-label processing to segment the
model is trained on Depth Eigen split [73] excluding the tested boxes, which works very well for the simple setup of this
data in this paper. Feature detection is done using FAST [74] dataset (color boxes that are highly distinguishable from the
implemented in [75]. All the above methods are applied using background). Table I shows results compared to the state-of-
the default parameters. the-art MVO [51] and ClusterVO [52], with data provided by
the authors, respectively. As they are both visual odometry
B. Error Metrics
systems without global refinement, we switch off the batch
We use a pose change error metric to evaluate the estimated optimisation module in our system and generate our results
SE(3) motion, i.e., given a ground truth motion transform T for fair comparison. We use the error metrics described in
and a corresponding estimated motion T̂, where T ∈ SE(3) Section V-B.
could be either a camera relative pose or an object motion. Compared to MVO, our proposed method achieves better
The pose change error is computed as: E = T̂−1 T. This is accuracy in the estimation of camera pose (35%) and motion
similar to Relative Pose Error [76], while we set the time of the swinging boxes, top-left (15%) and bottom-left (40%).
interval ∆ = 1 (per frame), because the trajectory of different We obtain slightly higher errors when there is spinning ro-
object in a sequence varies from each other and are normally tational motion of the object observed, in particular the top-
much shorter than the camera trajectory. right swinging and rotating box (in translation only), and the
The translational error Et (meter) is computed as the L2 norm bottom-right rotating box. We believe that this is due to using
of the translational component of E. The rotational error Er an optical flow algorithm that is not well optimised for self-
(degree) is calculated as the angle of rotation in an axis-angle rotating objects. The consequence of this is poor estimation of
representation of the rotational component of E. For different point motion and consequent degradation of the overall object
camera time steps and different objects in a sequence, we tracking performance. Even with the associated performance
compute the root mean squared error (RMSE) for camera loss for rotating objects, the benefits of dense optical flow
poses and object motions, respectively. The object pose change motion estimation is clear in the other metrics. Our method
in body-fixed frame is obtained by transforming the pose performs slightly worse than ClusterVO in the estimate of
change k−10 Hk in the inertial frame into the body frame using camera pose, and the translation of bottom-right rotating box.
the object pose ground-truth Other than that, we achieve more than twice improvements
Lk−1 against ClusterVO in the estimate of object motions.
k−1 Hk =0 L−1
k−1
0 0
k−1 Hk Lk−1 . (23)
An illustrative result of the trajectory output of our algo-
We also evaluate the object speed error. The linear velocity rithm on Oxford Multimotion Dataset is shown in Fig. 5.
of a point on the object, expressed in the inertial frame, can Tracks of dynamic features on swinging boxes visually corre-
be estimated by applying the pose change k−10 Hk and taking spond to the actual motion of the boxes. This can be clearly
the difference seen in the swinging motion of the bottom-left box shown with
0
v ≈0 mik −0 mik−1 = k−10 Hk − I4 mik−1 purple color in Fig. 5.

= k−10 tk − (I3 − k−10 Rk ) 0 mik−1 . (24)


D. KITTI Tracking Dataset
To get a more reliable measurement, we average over all points The KITTI Tracking Dataset [66] contains 21 sequences in
on an object at a certain time. Define ck−1 := n1 ∑ mik−1 for all total with ground truth information about camera and object
n points on an object at time k − 1. Then poses. Among these sequences, some are not included in the
1 n 0 0 0 i
 evaluation of our system; as they contain no moving objects
v≈ ∑ k−1 tk − (I3 − k−1 Rk ) mk−1 (static only scenes) or only contain pedestrians that are non-
n i=1
rigid objects, which is outside the scope of this work. Note
= k−10 tk − (I3 − k−10 Rk ) ck−1 . (25) that, as only rotation around Y-axis is provided in the ground
MANUSCRIPT ONLY 10

TABLE I: Comparison versus MVO [51] and ClusterVO [52] for camera pose and object motion estimation accuracy on the
sequence of swinging 4 unconstrained sequence in Oxford Multi-motion dataset. Bold numbers indicate the better results.
VDO-SLAM MVO ClusterVO
Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m)
Camera 0.7709 0.0112 1.1948 0.0314 0.7665 0.0066
Top-left Swinging Box 1.1889 0.0207 1.4553 0.0288 3.2537 0.0673
Top-right Swinging and rotating Box 0.7631 0.0132 0.8992 0.0130 3.5308 0.0256
Bottom-left Swinging Box 0.9153 0.0149 1.4949 0.0261 4.9146 0.0763
Bottom-right Rotating Box 0.8469 0.0192 0.7815 0.0115 4.0675 0.0144

also compute results of a learning-based monocular version of


our proposed method (as mentioned in Section IV) for fair
comparison.

Our proposed method achieves competitive and high ac-


curacy in comparison with DynaSLAM II for the estimate
of camera pose. In particular, our method obtains slightly
lower rotational errors while higher translational errors than
DynaSLAM II. We believe the difference in accuracy is due to
Fig. 5: Qualitative results of our method on Oxford the underlying formulations in estimating camera pose. When
Multimotion Dataset. (Left) The 3D trajectories of camera compared to CubeSLAM, our RGB-D version gets lower
(red) and centres of the four boxes. (Right) Detected points errors in camera pose, while our learning-based monocular
on static background and object body. Black color corresponds version slightly higher. We believe the weak performance of
to static points and features on each object are shown in a monocular version is because the model does not capture
different color. the scale of depth accurately with only monocular input.
Nevertheless, both versions obtain consistently lower errors
in object motion estimation. In particular, as demonstrated in
truth object poses, we assign zeros to the other two axes for Fig. 6, the translation and rotation errors in CubeSLAM are
the convenience of full motion evaluation. all above 3 meters and 3 degrees, with errors reaching 32
meters and 5 degrees in extreme cases respectively. However,
our translation errors vary between 0.1-0.3 meters and rotation
errors between 0.2-1.5 degrees in the case of RGB-D, and 0.1-
0.3 meters, and 0.4-3.1 degrees in the case of learning-based
monocular, which indicates that our object motion estimation
achieves an order of magnitude improvements in most cases. In
general, the results suggest that point-based object motion/pose
estimation methods is more robust and accurate than those us-
ing high-level geometric models, probably due to the fact that
geometric model extraction could lead to losing information
and introducing more uncertainty.

Fig. 6: Accuracy of object motion estimation of our method 2) Object Tracking and Velocity: We also demonstrate the
compared to CubeSLAM ([24]). The color bars refer to performance of tracking dynamic objects, and show results
translation error that is corresponding to the left Y-axis in log- of object speed estimation, which is an important information
scale. The circles refer to rotation error, which corresponds to for autonomous driving applications. Fig. 7 illustrates results
the right Y-axis in linear-scale. of object tracking length and object speed for some selected
objects (tracked for over 20 frames) in all the tested sequences.
1) Camera Pose and Object Motion: Table II demonstrates Our system is able to track most objects for more than 80%
results of both camera pose and object motion estimation of their occurrence in the sequence. Moreover, our estimated
in nine sequences, compared to DynaSLAM II [49] and objects speed is always consistently close to the ground truth.
CubeSLAM [24]. Results of DynaSLAM II is obtained di-
rectly from their paper, where only the evaluation of camera
pose is available. We initially tried to evaluate CubeSLAM 3) Qualitative Results: Fig. 8 illustrates the output of our
ourselves with the default provided parameters, however errors system for three of the KITTI sequences. The proposed system
were much higher, and hence we only report results of five is able to output the camera poses, along with the static
sequences provided by the authors of CubeSLAM after some structure and dynamic tracks of every detected moving object
correspondences. As CubeSLAM is for monocular camera, we in the scene in a spatiotemporal map representation.
MANUSCRIPT ONLY 11

TABLE II: Comparison versus DynaSLAM II [49] and CubeSLAM [24] for camera pose and object motion estimation accuracy
on nine sequences with moving objects drawn from the KITTI dataset. Bold numbers indicate the better result.
DynaSLAM II VDO-SLAM (RGB-D) VDO-SLAM (Monocular) CubeSLAM
Camera Camera Object Camera Object Camera Object
Seq Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m)
00 0.06 0.04 0.0741 0.0674 1.0520 0.1077 0.1830 0.1847 2.0021 0.3827 - - - -
01 0.04 0.05 0.0382 0.1220 0.9051 0.1573 0.1772 0.4982 1.1833 0.3589 - - - -
02 0.02 0.04 0.0182 0.0445 1.2359 0.2801 0.0496 0.0963 1.6833 0.4121 - - - -
03 0.04 0.06 0.0311 0.0816 0.2919 0.0965 0.1065 0.1505 0.4570 0.2032 0.0498 0.0929 3.6085 4.5947
04 0.06 0.07 0.0482 0.1114 0.8288 0.1937 0.1741 0.4951 3.1156 0.5310 0.0708 0.1159 5.5803 32.5379
05 0.03 0.06 0.0219 0.0932 0.3705 0.1140 0.0506 0.1368 0.6464 0.2669 0.0342 0.0696 3.2610 6.4851
06 0.04 0.02 0.0488 0.0186 1.0803 0.1158 0.0671 0.0451 2.0977 0.2394 - - - -
18 0.02 0.05 0.0211 0.0749 0.2453 0.0825 0.1236 0.3551 0.5559 0.2774 0.0433 0.0510 3.1876 3.7948
20 0.04 0.07 0.0271 0.1662 0.3663 0.0824 0.3029 1.3821 1.1081 0.3693 0.1348 0.1888 3.4206 5.6986

GT Tracks EST. Tracks GT Speed EST. Speed of good points in terms of both quantity and quality. This was
240 60

216 54
achieved by refining the estimated optical flow jointly with
192 48
the motion estimation, as discussed in Section III-B3. The

Speed (Km/h)
Track Length

168 42

144 36 effectiveness of joint optimisation is shown by comparing a


120 30

96 24
baseline method that only optimises for the motion (Motion
72 18 Only) using (9) for camera motion or (11) for object motion,
48 12

24 6
and the improved method that optimises for both the motion
0
00-1 00-2 01-1 01-2 02-1 02-2 02-3 03-1 03-2 04-1 05-1 06-1 06-2 06-3 06-4 06-5 06-6 18-1 18-2 18-3 20-1 20-2 20-3
0 and the optical flow (Joint) using (13) or (15). Table III
Sequence-Object ID demonstrates that the joint method obtains considerably more
points that are tracked for long periods.
Fig. 7: Tracking performance and speed estimation. Results
of object tracking length and object speed for some selected TABLE IV: Average camera pose and object motion errors
objects (tracked for over 20 frames), due to limited space. over the nine sequences of the KITTI dataset. Bold numbers
The color bars represent the length of object tracks, which is indicate the better results.
corresponding to the left Y-axis. The circles represent object
speeds, which is corresponding to the right Y-axis. GT refers Motion Only Joint
to ground truth, and EST. refers to estimated values. Er (deg) Et (m) Er (deg) Et (m)
Camera 0.0412 0.0987 0.0365 0.0866
Object 1.0179 0.1853 0.7085 0.1367
E. Discussion
Using the tracked points given by the joint estimation
Apart from the extensive evaluation in Section V-D and V-C,
process leads to better estimation of both camera pose and
we also provide detailed experimental results to prove the
object motion. As demonstrated in Table IV, an improvement
effectiveness of key modules in our proposed system. Finally,
of about 10% (camera) and 25% (object) in both translation
the computational cost of the proposed system is discussed.
and rotation errors was observed over the nine sequences of
TABLE III: The number of points tracked for more than five the KITTI dataset shown above.
frames on the nine sequences of the KITTI dataset. Bold 2) Robustness against Non-direct Occlusion: The mask
numbers indicate the better results. Underlined bold numbers segmentation may fail in some cases, due to direct or indirect
indicate an order of magnitude increase in number. occlusions (illumination change, etc.). Thanks to the mask
propagating method described in Section IV-C3, our proposed
Background Object system is able to handle mask failure cases caused by indirect
Seq Motion Only Joint Motion Only Joint
occlusions. Fig. 9 demonstrates an example of tracking a white
00 1798 12812 1704 7162 van for 80 frames, where the mask segmentation fails in 33
01 237 5075 907 4583
02 7642 10683 52 1442 frames. Despite the object segmentation failure, our system is
03 778 12317 343 3354 still continuously able to track the van, and estimate its speed
04 9913 25861 339 2802 with an average error of 2.64 km/h across the whole sequence.
05 713 11627 2363 2977
06 7898 11048 482 5934 Speed errors in the second half of the sequence are higher due
18 4271 22503 5614 14989 to partial direct occlusions, and increased distance to the object
20 9838 49261 9282 13434 getting farther away from the camera.
3) Global Refinement on Object Motion: Initial object
1) Robust Tracking of Points: The graph optimisation ex- motion estimation (in the tracking component of the system)
plores the spacial and temporal information to refine the is independent between frames, since it is purely related to the
camera poses and the object motions, as well as the static sensor measurements. As illustrated in Fig. 10, the blue curve
and dynamic structure. This process requires robust tracking describes an initial object speed estimate of a wagon observed
MANUSCRIPT ONLY 12

Fig. 8: Illustration of system output; a dynamic map with camera poses, static background structure, and tracks of
dynamic objects. Sample results of VDO-SLAM on KITTI sequences. Black represents static background, and each detected
object is shown in a different colour. Top left figure represents Seq.01 and a zoom-in on the intersection at the end of the
sequence, top right figure represents Seq.06 and bottom figure represents Seq.03.

for 55 frames in sequence 03 of the KITTI tracking dataset. 16 GB RAM. The object semantic segmentation and dense
As seen in the figure, the speed estimation is not smooth and optical flow computation times depend on the GPU power
large errors occur towards the second half of the sequence. and the CNN model complexity. Many current state-of-the-
This is mainly caused by the increased distance to the object art algorithms can run in real time ([30], [78]). In this
getting farther away from the camera, and its structure only paper, the semantic segmentation and optical flow results are
occupying a small portion of the scene. In this case, the object produced off-line as input to the system. The SLAM system
motion estimation from sensor measurements solely becomes is implemented in C++ on CPU using a modified version of
challenging and error-prone. Therefore, we formulate a factor g2o as a back-end [79]. We show the computational time in
graph and refine the motions together with the static and Table V for both datasets. Overall, the tracking part of our
dynamic structure as discussed in Section III-C. The green proposed system is able to run at the frame rate of 5-8 fps
curve in Fig. 10 shows the object speed results after the global depending on the number of detected moving objects, which
refinement, which becomes smoother in the first half of the can be improved by employing parallel implementation. The
sequence and is significantly improved in the second half. runtime of the global batch optimisation strongly depends on
Fig. 11 demonstrates the average improvement for all ob- the amount of camera poses (number of frames), and objects
jects in each sequence of KITTI dataset. With graph optimiza- (density in terms of the number of dynamic objects observed
tion, the errors can be reduced up to 39% in translation and per frame) present in the scene.
55% in rotation. Interestingly, the translation errors in Seq.18 VI. C ONCLUSION
and Seq.20 increase slightly. We believe it is because the ve-
In this paper, we have presented VDO-SLAM, a novel
hicles keep alternating between acceleration and deceleration
dynamic feature-based SLAM system that exploits image-
due to the heavy traffic jams in both sequences, which strongly
based semantic information in the scene with no additional
violates the smooth motion constraint that is set for general
knowledge of the object pose or geometry, to achieve simulta-
cases.
neous localisation, mapping and tracking of dynamic objects.
4) Computational Analysis: Finally, we provide the com- The system consistently shows robust and accurate results on
putational analysis of our system. The experiments are carried indoor and challenging outdoor datasets, and achieves state-of-
out on an Intel Core i7 2.6 GHz laptop computer with the-art performance in object motion estimation. We believe
MANUSCRIPT ONLY 13

0.5
Translation 0.27 0.27 0.11 0.39 0.1 0.16 0.02 -0.03 -0.04 0.4
0.3
0.2
Rotation 0.2 0.22 0.06 0.54 0.26 0.55 0.04 0.34 0.12 0.1
0
Seq.00 Seq.01 Seq.02 Seq.03 Seq.04 Seq.05 Seq.06 Seq.18 Seq.20

Fig. 11: Improvement on object motion after graph op-


timization. The numbers in the heatmap show the ratio of
decrease in error on the nine sequences of the KITTI dataset.

TABLE V: Runtime of different system components for both


datasets. The time cost of every component is averaged over all
frames and sequences, except for the object motion estimation
and object motion estimation that are averaged over the
number of objects.
Fig. 9: Robustness in tracking performance and speed Dataset Tasks Runtime (mSec)
estimation in case of semantic segmentation failure. Feature Detection 16.2550
An example of tracking performance and speed estimation for Camera Pose Estimation 52.6542
a white van (ground-truth average speed 20km/h) in Seq.00. KITTI Dynamic Object Tracking (avg/object) 8.2980
Object Motion Estimation (avg/object) 22.9081
(Top) Blue bars represent a successful object segmentation, Map and Mask Updating 22.1830
and green curves refer to the object speed error. (Bottom-left) Local Batch Optimisation 18.2828
An illustration of semantic segmentation failure on the van. Feature Detection 7.5220
(Bottom-right) Result of propagating the previously tracked Camera Pose Estimation 32.0909
features on the van by our system. OMD Dynamic Object Tracking (avg/object) 7.0134
Object Motion Estimation (avg/object) 19.5280
Map and Mask Updating 30.3153
Local Batch Optimisation 15.3414

observed far in the past seems to be a natural step towards a


long-term SLAM system in highly dynamic environments.

ACKNOWLEDGEMENTS
This research is supported by the Australian Research Coun-
cil through the Australian Centre of Excellence for Robotic
Vision (CE140100016), and the Sydney Institute for Robotics
and Intelligent Systems. The authors would like to thank Mr.
Ziang Cheng and Mr. Huangying Zhan for providing help in
preparing the testing datasets.
Fig. 10: Global refinement effect on object speed estima-
R EFERENCES
tion. The initial (blue) and refined (green) estimated speeds of
a wagon in Seq.03, travelling along a straight road, compared [1] D. Hahnel, D. Schulz, and W. Burgard, “Map Building with Mobile
Robots in Populated Environments,” in International Conference on
to the ground truth speed (red). Note the ground truth speed Intelligent Robots and Systems (IROS), vol. 1. IEEE, 2002, pp. 496–
is slightly fluctuating. We believe it is due to the ground truth 501.
object poses being approximated from lidar scans. [2] D. Hahnel, R. Triebel, W. Burgard, and S. Thrun, “Map Building with
Mobile Robots in Dynamic Environments,” in International Conference
on Robotics and Automation (ICRA), vol. 2. IEEE, 2003, pp. 1557–
1563.
[3] D. F. Wolf and G. S. Sukhatme, “Mobile Robot Simultaneous Local-
the high performance accuracy achieved in object motion ization and Mapping in Dynamic Environments,” Autonomous Robots,
estimation is due to the fact that our system is a feature-based vol. 19, no. 1, pp. 53–65, 2005.
system. Feature points remain to be the easiest to detect, track [4] H. Zhao, M. Chiba, R. Shibasaki, X. Shao, J. Cui, and H. Zha, “SLAM
in a Dynamic Large Outdoor Environment using a Laser Scanner,” in
and integrate within a SLAM system, and that require the International Conference on Robotics and Automation (ICRA). IEEE,
front-end to have no additional knowledge about the object 2008, pp. 1455–1462.
model, or explicitly provide any information about its pose. [5] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “DynaSLAM: Tracking,
Mapping, and Inpainting in Dynamic Scenes,” Robotics and Automation
An important issue to be reduced is the computational Letters (RAL), vol. 3, no. 4, pp. 4076–4083, 2018.
complexity of SLAM with dynamic objects. In long-term [6] C.-C. Wang, C. Thorpe, and S. Thrun, “Online Simultaneous Local-
applications, different techniques can be applied to limit the ization and Mapping with Detection and Tracking of Moving Objects:
Theory and Results from a Ground Vehicle in Crowded Urban Areas,”
growth of the graph ([80], [81]). In fact, history summari- in International Conference on Robotics and Automation (ICRA), vol. 1.
sation/deletion of map points pertaining to dynamic objects IEEE, 2003, pp. 842–849.
MANUSCRIPT ONLY 14

[7] I. Miller and M. Campbell, “Rao-blackwellized Particle Filtering for [30] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time Instance
Mapping Dynamic Environments,” in International Conference on Segmentation,” in International Conference on Computer Vision (ICCV).
Robotics and Automation (ICRA). IEEE, 2007, pp. 3862–3869. IEEE, 2019, pp. 9157–9166.
[8] J. G. Rogers, A. J. Trevor, C. Nieto-Granda, and H. I. Christensen, [31] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic SLAM:
“SLAM with Expectation Maximization for Moveable Object Tracking,” The Need for Speed,” in International Conference on Robotics and
in International Conference on Intelligent Robots and Systems (IROS). Automation (ICRA). IEEE, 2020, pp. 2123–2129.
IEEE, 2010, pp. 2077–2082. [32] J. Huang, S. Yang, Z. Zhao, Y. Lai, and S. Hu, “ClusterSLAM: A
[9] A. Kundu, K. M. Krishna, and C. Jawahar, “Realtime Multibody Visual SLAM Backend for Simultaneous Rigid Body Clustering and Motion
SLAM with a Smoothly Moving Monocular Camera,” in International Estimation,” in International Conference on Computer Vision (ICCV).
Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2080–2087. IEEE, 2019, pp. 5874–5883.
[10] K. Yamaguchi, D. McAllester, and R. Urtasun, “Efficient Joint Segmen- [33] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Robust Ego and Object 6-
tation, Occlusion Labeling, Stereo and Flow Estimation,” in European DoF Motion Estimation and Tracking,” in International Conference on
Conference on Computer Vision (ECCV). Springer, 2014, pp. 756–771. Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5017–5023.
[11] D. Sun, S. Roth, and M. J. Black, “Secrets of Optical Flow Estimation [34] P. F. Alcantarilla, J. J. Yebes, J. Almazán, and L. M. Bergasa, “On
and Their Principles,” in Computer Society Conference on Computer Combining Visual SLAM and Dense Scene Flow to Increase the
Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 2432–2439. Robustness of Localization and Mapping in Dynamic Environments,” in
[12] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for Optical International Conference on Robotics and Automation (ICRA). IEEE,
Flow Using Pyramid, Warping, and Cost Volume,” in Computer Soci- 2012, pp. 1290–1297.
ety Conference on Computer Vision and Pattern Recognition (CVPR). [35] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust Monocular
IEEE, 2018. SLAM in Dynamic Environments,” in International Symposium on
[13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, Mixed and Augmented Reality (ISMAR). IEEE, 2013, pp. 209–218.
“FlowNET 2.0: Evolution of Optical Flow Estimation with Deep [36] P. Kaveti and H. Singh, “A Light Field Front-end for Robust SLAM in
Networks,” in Computer Society Conference on Computer Vision and Dynamic Environments,” arXiv preprint arXiv:2012.10714, 2020.
Pattern Recognition (CVPR). IEEE, 2017, pp. 2462–2470. [37] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte,
[14] C. Vogel, K. Schindler, and S. Roth, “Piecewise Rigid Scene Flow,” in “Simultaneous Localization, Mapping and Moving Object Tracking,”
International Conference on Computer Vision (ICCV). IEEE, 2013, pp. International Journal of Robotics Research (IJRR), vol. 26, no. 9, pp.
1377–1384. 889–916, 2007.
[15] M. Menze and A. Geiger, “Object Scene Flow for Autonomous Vehi- [38] N. D. Reddy, P. Singhal, V. Chari, and K. M. Krishna, “Dynamic Body
cles,” in Computer Society Conference on Computer Vision and Pattern VSLAM with Semantic Constraints,” in International Conference on
Recognition (CVPR). IEEE, 2015, pp. 3061–3070. Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 1897–1904.
[16] X. Liu, C. R. Qi, and L. J. Guibas, “FlowNet3D: Learning Scene Flow in
[39] I. A. Bârsan, P. Liu, M. Pollefeys, and A. Geiger, “Robust Dense
3D Point Clouds,” in Computer Society Conference on Computer Vision
Mapping for Large-Scale Dynamic Environments,” in International
and Pattern Recognition (CVPR). IEEE, 2019, pp. 529–537.
Conference on Robotics and Automation (ICRA). IEEE, 2018.
[17] H. Jiang, D. Sun, V. Jampani, Z. Lv, E. Learned-Miller, and J. Kautz,
[40] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and
“SENSE: A Shared Encoder Network for Scene-flow Estimation,” in
A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at
International Conference on Computer Vision (ICCV). IEEE, 2019,
the Level of Objects,” in Computer Society Conference on Computer
pp. 3195–3204.
Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1352–1359.
[18] P. de la Puente and D. Rodrı́guez-Losada, “Feature Based Graph-SLAM
[41] K. Tateno, F. Tombari, and N. Navab, “When 2.5D is Not Enough:
in Structured Environments,” Autonomous Robots, vol. 37, no. 3, pp.
Simultaneous Reconstruction, Segmentation and Recognition on Dense
243–260, 2014.
SLAM,” in International Conference on Robotics and Automation
[19] M. Kaess, “Simultaneous Localization and Mapping with Infinite
(ICRA). IEEE, 2016, pp. 2295–2302.
Planes,” in International Conference on Robotics and Automation
(ICRA). IEEE, 2015, pp. 4605–4611. [42] E. Sucar, K. Wada, and A. Davison, “NodeSLAM: Neural Object
Descriptors for Multi-View Shape Reconstruction,” arXiv preprint
[20] M. Henein, M. Abello, V. Ila, and R. Mahony, “Exploring the Effect
arXiv:2004.04485, 2020.
of Meta-structural Information on the Global Consistency of SLAM,”
in International Conference on Intelligent Robots and Systems (IROS). [43] M. Runz, M. Buffier, and L. Agapito, “MaskFusion: Real-time Recog-
IEEE, 2017, pp. 1616–1623. nition, Tracking and Reconstruction of Multiple Moving Objects,” in
[21] M. Hsiao, E. Westman, G. Zhang, and M. Kaess, “Keyframe-based International Symposium on Mixed and Augmented Reality (ISMAR).
Dense Planar SLAM,” in International Conference on Robotics and IEEE, 2018, pp. 10–20.
Automation (ICRA). IEEE, 2017, pp. 5110–5117. [44] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and
[22] B. Mu, S.-Y. Liu, L. Paull, J. Leonard, and J. P. How, “SLAM with S. Leutenegger, “MID-Fusion: Octree-based Object-level Multi-instance
Objects using a Nonparametric Pose Graph,” in International Conference Dynamic SLAM,” in International Conference on Robotics and Automa-
on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4602–4609. tion (ICRA). IEEE, 2019, pp. 5231–5237.
[23] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and [45] M. Hosseinzadeh, K. Li, Y. Latif, and I. Reid, “Real-time Monocular
A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at Object-model Aware Sparse SLAM,” in International Conference on
the Level of Objects,” in Computer Society Conference on Computer Robotics and Automation (ICRA). IEEE, 2019, pp. 7123–7129.
Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1352–1359. [46] L. Nicholson, M. Milford, and N. Sünderhauf, “QuadricSLAM: Dual
[24] S. Yang and S. Scherer, “CubeSLAM: Monocular 3-D Object SLAM,” Quadrics from Object Detections as Landmarks in Object-oriented
Transactions on Robotics (T-RO), vol. 35, no. 4, pp. 925–938, 2019. SLAM,” Robotics and Automation Letters (RAL), vol. 4, no. 1, pp. 1–8,
[25] D. Gálvez-López, M. Salas, J. D. Tardós, and J. Montiel, “Real-time 2018.
Monocular Object SLAM,” Robotics and Autonomous Systems, vol. 75, [47] P. Li, T. Qin, et al., “Stereo Vision-based Semantic 3D Object and Ego-
pp. 435–449, 2016. motion Tracking for Autonomous Driving,” in European Conference on
[26] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, Computer Vision (ECCV), 2018, pp. 646–661.
“MOT16: A Benchmark for Multi-Object Tracking,” arXiv:1603.00831 [48] P. Li, J. Shi, and S. Shen, “Joint Spatial-temporal Optimization for Stereo
[cs], Mar. 2016, arXiv: 1603.00831. [Online]. Available: http: 3D Object Tracking,” in Computer Society Conference on Computer
//arxiv.org/abs/1603.00831 Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 6877–6886.
[27] A. Byravan and D. Fox, “SE3-Nets: Learning Rigid Body Motion using [49] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, “DynaSLAM
Deep Neural Networks,” in International Conference on Robotics and II: Tightly-coupled Multi-object Tracking and SLAM,” Robotics and
Automation (ICRA). IEEE, 2017, pp. 173–180. Automation Letters (RAL), vol. 6, no. 3, pp. 5191–5198, 2021.
[28] P. Wohlhart and V. Lepetit, “Learning Descriptors for Object Recog- [50] A. Dewan, T. Caselitz, G. D. Tipaldi, and W. Burgard, “Motion-based
nition and 3D Pose Estimation,” in Computer Society Conference on Detection and Tracking in 3D Lidar Scans,” in International Conference
Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3109– on Robotics and Automation (ICRA). IEEE, 2016, pp. 4508–4513.
3118. [51] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion Visual
[29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” Odometry (MVO): Simultaneous Estimation of Camera and Third-party
Transactions on Pattern Analysis and Machine Intelligence (PAMI), Motions,” in International Conference on Intelligent Robots and Systems
2018. (IROS). IEEE, 2018, pp. 3949–3956.
MANUSCRIPT ONLY 15

[52] J. Huang, S. Yang, T.-J. Mu, and S.-M. Hu, “ClusterVO: Clustering [67] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
Moving Instances and Estimating Visual Odometry for Self and Sur- International Conference on Computer Vision (ICCV). IEEE, 2017,
roundings,” in Computer Society Conference on Computer Vision and pp. 2980–2988.
Pattern Recognition (CVPR). IEEE, 2020, pp. 2168–2177. [68] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[53] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects
Paradigm for Model Fitting with Applications to Image Analysis and in context,” in European Conference on Computer Vision (ECCV).
Automated Cartography,” Communications of the ACM, vol. 24, no. 6, Springer, 2014, pp. 740–755.
pp. 381–395, 1981.
[69] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy,
[54] G. S. Chirikjian, R. Mahony, S. Ruan, and J. Trumpf, “Pose Changes
and T. Brox, “A Large Dataset to Train Convolutional Networks for
from a Different Point of View,” in The ASME International Design
Disparity, Optical Flow, and Scene Flow Estimation,” in Computer So-
Engineering Technical Conferences (IDETC). ASME, 2017.
ciety Conference on Computer Vision and Pattern Recognition (CVPR).
[55] D. Nistér, O. Naroditsky, and J. Bergen, “Visual Odometry,” in Com-
IEEE, 2016, pp. 4040–4048.
puter Society Conference on Computer Vision and Pattern Recognition
(CVPR), vol. 1. IEEE, 2004, pp. I–I. [70] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A Naturalistic
[56] P. J. Huber, “Robust Estimation of a Location Parameter,” in Break- Open Source Movie for Optical Flow Evaluation,” in European Confer-
throughs in Statistics. Springer, 1992, pp. 492–518. ence on Computer Vision (ECCV). Springer, 2012, pp. 611–625.
[57] F. Dellaert and M. Kaess, “Square Root SAM: Simultaneous Localiza- [71] A. Geiger, P. Lenz, and R. Urtasun, “Are We Ready for Autonomous
tion and Mapping via Square Root Information Smoothing,” Interna- Driving? The KITTI Vision Benchmark Suite,” in Computer Society
tional Journal of Robotics Research (IJRR), vol. 25, no. 12, pp. 1181– Conference on Computer Vision and Pattern Recognition (CVPR).
1203, 2006. IEEE, 2012.
[58] S. Agarwal, K. Mierle, and Others, “Ceres Solver,” http://ceres-solver. [72] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging
org, 2012. into Self-supervised Monocular Depth Estimation,” in International
[59] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and Conference on Computer Vision (ICCV). IEEE, 2019, pp. 3828–3838.
F. Dellaert, “iSAM2: Incremental Smoothing and Mapping using the [73] D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a
Bayes Tree,” International Journal of Robotics Research (IJRR), p. Single Image using a Multi-scale Deep Network,” in Advances in Neural
0278364911430419, 2011. Information Processing Systems (NIPS), 2014, pp. 2366–2374.
[60] L. Polok, V. Ila, M. Solony, P. Smrz, and P. Zemcik, “Incremental Block [75] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An Efficient
Cholesky Factorization for Nonlinear Least Squares in Robotics,” in Alternative to SIFT or SURF,” in International Conference on Computer
Robotics: Science and Systems (RSS), Berlin, Germany, June 2013. Vision (ICCV). IEEE, 2011, pp. 2564–2571.
[61] V. Ila, L. Polok, M. Šolony, and P. Svoboda, “SLAM++-A Highly
Efficient and Temporally Scalable Incremental SLAM Framework,” [76] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
International Journal of Robotics Research (IJRR), vol. Online First, “A Benchmark for the Evaluation of RGB-D SLAM Systems,” in
no. 0, pp. 1–21, 2017. International Conference on Intelligent Robots and Systems (IROS).
[62] K. Yamaguchi, D. McAllester, and R. Urtasun, “Efficient Joint Segmen- IEEE, 2012, pp. 573–580.
tation, Occlusion Labeling, Stereo and Flow Estimation,” in European [77] N. Otsu, “A Threshold Selection Method from Gray-level Histograms,”
Conference on Computer Vision (ECCV). Springer, 2014, pp. 756–771. Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66,
[63] T. Ke and S. I. Roumeliotis, “An Efficient Algebraic Solution to the 1979.
Perspective-three-point Problem,” in Computer Society Conference on [78] T.-W. Hui, X. Tang, and C. C. Loy, “A Lightweight Optical Flow CNN
Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. - Revisiting Data Fidelity and Regularization.” IEEE, 2020. [Online].
[64] Z. Lv, K. Kim, A. Troccoli, J. Rehg, and J. Kautz, “Learning Rigidity Available: http://mmlab.ie.cuhk.edu.hk/projects/LiteFlowNet/
in Dynamic Scenes with a Moving Camera for 3D Motion Field [79] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
Estimation,” in European Conference on Computer Vision (ECCV). “g2o: A General Framework for Graph Optimization,” in International
Springer, 2018. Conference on Robotics and Automation (ICRA). IEEE, 2011, pp.
[65] K. M. Judd and J. D. Gammell, “The Oxford Multimotion Dataset: 3607–3613.
Multiple SE(3) Motions with Ground Truth,” Robotics and Automation
Letters (RAL), vol. 4, no. 2, pp. 800–807, 2019. [80] H. Strasdat, A. J. Davison, J. M. Montiel, and K. Konolige, “Double
[66] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision Meets Robotics: Window Optimisation for Constant Time Visual SLAM,” in Interna-
The KITTI Dataset,” International Journal of Robotics Research (IJRR), tional Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2352–
vol. 32, no. 11, pp. 1231–1237, 2013. 2359.
[74] E. Rosten and T. Drummond, “Machine Learning for High-speed Cor- [81] V. Ila, J. M. Porta, and J. Andrade-Cetto, “Information-based Compact
ner Detection,” in European Conference on Computer Vision (ECCV). Pose SLAM,” Transactions on Robotics (T-RO), vol. 26, no. 1, pp. 78–
Springer, 2006, pp. 430–443. 93, 2010.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy