Vdo Slam
Vdo Slam
world environments. Creating these maps is achieved by fusing ([6]–[9]). The former technique excludes information about
multiple sensor measurements into a consistent representation dynamic objects in the scene, and generates static only maps.
using estimation techniques such as Simultaneous Localisation The accuracy of the latter depends on the camera pose
And Mapping (SLAM). SLAM is a mature research topic and estimation, which is more susceptible to failure in complex
have already revolutionised a wide range of applications from dynamic environments. Increased presence of autonomous
mobile robotics, inspection, entertainment and film produc- systems in dynamic environments is driving the community
tion to exploration and monitoring of natural environments, to challenge the static world assumption that underpins most
amongst many others. However, most of the existing solutions existing open-source SLAM algorithms. In this paper, we
to SLAM rely heavily on the assumption that the environment redefine the term “mapping” in SLAM to be concerned with
is predominantly static. a spatiotemporal representation of the world, as opposed to
The conventional techniques to deal with dynamics in the concept of a static map that has long been the emphasis
SLAM is to either treat any sensor data associated with moving of the classical SLAM algorithms. Our approach focuses on
objects as outliers and remove them from the estimation accurately estimate the motion of all dynamic entities in the
process ([1]–[5]), or detect moving objects and track them environment including the robot and other moving objects
separately using traditional multi-target tracking approaches in the scene, this information being highly relevant in the
context of robot path planning and navigation in dynamic
Jun Zhang, Mina Henein and Robert Mahony are with the
Australian National University (ANU), 0020 Canberra, Australia.
environments.
{jun.zhang2,mina.henein,robert.mahony}@anu.edu.au Existing scene motion estimation techniques mainly rely
Viorela Ila is with the University of Sydney (USyd), 2006 Sydney, Australia. on optical flow estimation ([10]–[13]) and scene flow esti-
viorela.ila@sydney.edu.au
[co]: The two authors contributed equally to this work. mation ([14]–[17]). Optical flow records the scene motion
∗ https://github.com/halajun/vdo slam by estimating the velocities associated with the movement
MANUSCRIPT ONLY 2
of brightness patterns on an image plane. Scene flow, on the feature and object tracking method is proposed, with the ability
other hand, describes the 3D motion field of a scene observed to handle indirect occlusions resulting from the failure of
at different instants of time. Those techniques only estimate semantic object segmentation. In summary, the contributions
linear translation of individual pixels or 3D points in the of this work are:
scene, and are not exploiting the collective behaviour points • a novel formulation to model dynamic scenes in a uni-
on rigid objects failing to describe the full SE(3) motion of fied estimation framework over robot poses, static and
objects in the scene. In this paper we explore this collective dynamic 3D points, and object motions.
behaviour of points on individual objects to obtain accurate • accurate estimation for SE(3) motion of dynamic objects
and robust motion estimation of the objects in the scene while that outperforms state-of-the-art algorithms, as well as a
simultaneously localising the robot and map the environment. way to extract objects’ velocity in the scene,
A typical SLAM system consists of a front-end module, • a robust method for tracking moving objects exploiting
that processes the raw data from the sensors and a back- semantic information with the ability to handle indirect
end module, that integrates the obtained information (raw occlusions resulting from the failure of semantic object
and higher-level information) into a probabilistic estimation segmentation,
framework. Simple primitives such as 3D locations of salient • a demonstrable full system in complex and compelling
features are commonly used to represent the environment. real-world scenarios.
This is largely a consequence of the fact that points are easy To the best of our knowledge, this is the first full dynamic
to detect, track and integrate within the SLAM estimation SLAM system that is able to achieve motion segmentation,
problem. dynamic object tracking, and estimate the camera poses along
Feature tracking has been more reliable and robust with with the static and dynamic structure, the full SE(3) pose
the advances in deep learning to provide algorithms that can change of every rigid object in the scene, extract velocity infor-
reliably estimate the 2D optical flow associated with the mation, and be demonstrable in real-world outdoor scenarios
apparent motion of every pixel on an image in a dense manner. (see Fig. 1). We demonstrate the performance of our algorithm
A task that is particularly important for data association and on real datasets and show capability of the proposed system to
that has been otherwise challenging in dynamic environments resolve rigid object motion estimation and yield motion results
using classical feature tracking methods. that are comparable to the camera pose estimation in accuracy
Other primitives such as lines and planes ([18]–[21]) or even and that outperform state-of-the-art algorithms by an order of
objects ([22]–[24]) have been considered in order to provide magnitude in urban driving scenarios.
richer map representations. To incorporate such information in The remainder of this paper is structured as follows, in the
existing geometric SLAM algorithms, either a dataset of 3D- following Section II we discuss the related work. In Section III
models of every object in the scene must be available a priori and IV we describe the proposed algorithm and system. We
([23], [25]) or the front end must explicitly provide object introduce the experimental setup, followed by the results and
pose information in addition to detection and segmentation evaluations in Section V. We summarise and offer concluding
([26]–[28]) adding a layer of complexity to the problem. The remarks in Section VI.
requirement for accurate 3D-models severely limits the poten-
tial domains of application, while to the best of our knowledge,
multiple object tracking and 3D pose estimation remain a II. R ELATED W ORK
challenge to learning techniques. There is a clear need for In the past two decades, the study of SLAM for dynamic
an algorithm that can exploit the powerful detection and environments has become more and more popular in the
segmentation capabilities of modern deep learning algorithms community, with a considerable amount of algorithms being
([29], [30]) without relying on additional pose estimation or proposed to solve the dynamic SLAM problem. Motivated by
object model priors, an algorithm that operates at feature-level different goals to achieve, solutions in the literature can be
with the awareness of an object concept. mainly divided into three categories.
While the problems of SLAM and object motion track- The first category aims to explore robust SLAM against
ing/estimation are long studied in isolation in the literature, dynamic environments. Early methods in this category ([2],
recent approaches try to solve the two problems in a unified [34], [35]) normally detect and remove the information drawn
framework ([31], [32]). However, they both focus on the from dynamic foreground, which is seen as degrading the
SLAM back-end instead of a full system, resulting in a SLAM performance. More recent methods on this track tend
severely limited performance in real world scenarios. In this to go further by not just removing the dynamic foreground,
paper, we carefully integrate our previous works ([31], [33]) but also inpainting or reconstructing the static background that
and propose VDO-SLAM, a novel feature-based stereo/RGB- is occluded by moving targets. [5] present dynaSLAM that
D dynamic SLAM system, that leverages image-based se- combines classic geometry and deep learning-based models to
mantic information to simultaneously localise the robot, map detect and remove dynamic objects, then inpaint the occluded
the static and dynamic structure, and track motions of rigid background with multi-view information of the scene. Simi-
objects in the scene. Different to [31], we rely on a denser larly, a Light Field SLAM front-end is proposed by [36] to
object feature representation to ensure robust tracking, and reconstruct the occluded static scene via Synthetic Aperture
propose new factors to smoothen the motion of rigid objects in Imaging (SAI) technics. Different from [5], features on the
urban driving scenarios. Different to [33], an improved robust reconstructed static background are also tracked and used
MANUSCRIPT ONLY 3
to achieve better SLAM performance. The above state-of- Both methods succeed to exploit object information in a
the-art solutions achieve robust and accurate estimation by dense RGB-D SLAM framework, without prior knowledge of
discarding the dynamic information. However, we argue that object model. Their main interest, however, is the 3D object
this information has potential benefits for SLAM if it is prop- segmentation and consistent fusion of the dense map rather
erly modelled. Furthermore, understanding dynamic scenes in than the estimation of the motion of the objects.
addition to SLAM is crucial for many other robotics tasks such Lately, the use of basic geometric models to represent
as planning, control and obstacle avoidance, to name a few. objects becomes a popular solution due to the less complexity
Approaches of the second category performs SLAM and and easy integration into a SLAM framework. In Quadric-
Moving Objects Tracking (MOT) separately, as an extension SLAM [46], detected objects are represented as ellipsoids to
to conventional SLAM for dynamic scene understanding ([9], compactly parametrise the size and 3D pose of an object. In
[37]–[39]). [37] developed a theory for performing SLAM this way, the quadric parameters are directly constrained as
with Moving Objects Tracking (SLAMMOT). In the latest geometric error and formulated together with camera poses
version of their SLAM with detection and tracking of mov- in a factor graph SLAM for joint estimation. [24] propose to
ing objects, the estimation problem is decomposed into two combine 2D and 3D object detection with SLAM for both
separate estimators (moving and stationary objects) to make static and dynamic environments. Objects are represented as
it feasible to update both filters in real time. [9] tackle the high-quality cuboids and optimized together with points and
SLAM problem with dynamic objects by solving the problems cameras through multi-view bundle adjustment. While both
of Structure from Motion (SfM) and tracking of moving methods prove the mutual benefit between detected object and
objects in parallel, and unifying the output of the system SLAM, their main focus is on object detection and SLAM
into a 3D dynamic map containing the static structure and primarily for static scenarios. In this paper, we take this
the trajectories of moving objects. Later in [38], the authors direction further to tackle the challenging problem of dynamic
propose to integrate semantic constraints to further improve the object tracking within a SLAM framework, and exploit the
3D reconstruction. The more recent work [39] present a stereo- relationships between moving objects and agent robot, static
based dense mapping algorithm in a SLAM framework, with and dynamic structures for potential advantages.
the advantage of accurately and efficiently reconstructing both Apart from the dynamic SLAM categories, the literature of
static background and moving objects in large scale dynamic 6-DoF object motion estimation is also crucial for dynamic
environments. The listed algorithms above have proven that SLAM problem. Quite a few methods have been proposed in
combining multiple objects tracking with SLAM is doable the literature to estimate SE(3) motion of objects in a visual
and applicable for dynamic scene exploration. To take a step odometry or SLAM framework ([50]–[52]). [50] present a
further by proper exploiting and establishing the spatial and model-free method for detecting and tracking moving objects
temporal relationships between the robot, static background, in 3D LiDAR scans. The method sequentially estimates mo-
stationary and dynamic objects, we show in this paper that tion models using RANSAC [53], then segments and tracks
the problems of SLAM and multi-object tracking are mutually multiple objects based on the models by a proposed Bayesian
beneficial. approach. In [51], the authors address the problem of simul-
The last and most active category is object SLAM, which taneous estimation of ego and third-party SE(3) motions in
usually includes both static and dynamic objects. Algorithms complex dynamic scenes using cameras. They apply multi-
in this class normally require specific modelling and repre- model fitting techniques into a visual odometry pipeline and
sentation of 3D object, such as 3D shape ([40]–[42]), sur- estimate all rigid motions within a scene. In later work, [52]
fel [43] or volumetric [44] model, geometric model such as present ClusterVO that is able to perform online processing
ellipsoid ([45], [46]) or 3D bounding box ([24], [47]–[49]), for multiple motion estimations. To achieve this, a multi-level
etc., to extract high-level primitive (e.g., object pose) and probabilistic association mechanism is proposed to efficiently
integrate into a SLAM framework. [40] is one of the earliest track features and detections, then a heterogeneous Conditional
works to introduce an object-oriented SLAM paradigm, which Random Field (CRF) clustering approach is applied to jointly
represents cluttered scene in object level and constructs an infer cluster segmentations, with a sliding-window optimiza-
explicit graph between camera and object poses to achieve tion for clusters in the end. While the above proposed methods
joint pose-graph optimisation. Later, [41] propose a novel 3D represent an important step forward to the Multi-motion Visual
object recognition algorithm to ensure the system robustness Odometry (MVO) task, the study of spacial and temporal
and improve the accuracy of estimated object pose. The high- relationships is not fully explored but is arguably important.
level scene representation enables real-time 3D recognition Therefore, by carefully considering the pros and cons in the
and significant compression of map storage for SLAM. Never- literature of SLAM+MOT, object SLAM and MVO, this paper
theless, a database of pre-scanned or pre-trained object models proposes a visual dynamic object-aware SLAM system that is
has to be created in advance. To avoid prebuilt database, able to achieve robust ego and object motion tracking, as well
representing objects using surfel or voxel element in a dense as consistent static and dynamic mapping in a novel SLAM
manner starts to gain popularity, along with RGB-D cameras formulation.
becoming widely used. [43] present MaskFusion that adopts
surfel representation to model, track and reconstruct objects in III. M ETHODOLOGY
the scene, while [44] apply an octree-based volumetric model Before discussing details of the proposed system pipeline,
to objects and build multi-object dynamic SLAM system. as shown in Fig. 4, this section covers the mathematical details
MANUSCRIPT ONLY 4
of the core components in the system. Variables and notations (5) is crucially important as it relates the same 3D point
are first introduced, including the novel way of modelling the on a rigid object in motion at consecutive time steps by
Lk−1
motion of a rigid-object in a model free manner. Then we a homogeneous transformation k−10 Hk := 0 Lk−1 k−1 Hk 0 L−1
k−1 .
show how the camera pose and object motion are estimated This equation represents a frame change of a pose transforma-
in the tracking component of the system. Finally, a factor tion [54], and shows how the body-fixed frame pose change
Lk−1
graph optimisation is proposed and applied in the mapping k−1 Hk relates to the global reference frame pose change
component, to refine the camera poses and object motions, 0
k−1 Hk . The point motion in global reference frame is then
and build a global consistent map including static and dynamic expressed as:
structure. 0
mik = k−10 Hk 0 mik−1 . (6)
A. Background and Notation Equation (6) is at the core of our motion estimation approach,
1) Coordinate Frames: Let 0 X ,0 L
∈ SE(3) be the
k k
as it expresses the rigid object pose change in terms of the
robot/camera and the object 3D pose respectively, at time k points that reside on the object in a model-free manner without
in a global reference frame 0, with k ∈ T the set of time the need to include the object 3D pose as a random variable
steps. Note that calligraphic capital letters are used in our in the estimation. Section III-B2 details how this rigid object
notation to represent sets of indices. Fig. 2 shows these pose pose change is estimated based on the above equation. Here
0
transformations as solid curves. k−1 Hk ∈ SE(3) represents the object point motion in global
2) Points: Let 0 mik be the homogeneous coordinates of the reference frame; for the remainder of this document, we refer
> to this quantity as the object pose change or the object motion
i 3D point at time k, with 0 mi = mix , miy , miz , 1 ∈ IE3 and
th
𝐿𝑘−2
𝐻𝑘−1 𝐿𝑘−1
𝑘−2
𝐻𝑘
𝑘−1
𝐿𝑘−2 𝐿𝑘−1
𝑚𝑘−2 𝑚𝑘−1
0
𝐿𝑘−2
𝐿𝑘
0 0
𝑚𝑘
𝑚𝑘−2 𝑚𝑘−1
0
𝐿𝑘−1
0
𝑚𝑘
0
𝐿𝑘
{0}
𝑋𝑘
𝑚𝑘
𝑋𝑘−2 𝑋𝑘−1
𝑚𝑘−2 𝑚𝑘−1
{𝐼𝑘−2 }
{𝐼𝑘−1 }
𝐼𝑘−2 {𝐼𝑘 }
𝑝𝑘−2
0
𝑋𝑘−2 𝐼𝑘
𝐼𝑘−1 𝑝𝑘
𝑝𝑘−1
𝑋𝑘−2
𝑇𝑘−1
𝑘−2 𝑋𝑘−1
𝑇𝑘
𝑘−1
0
0
𝑋𝑘−1 𝑋𝑘
Fig. 2: Notation and coordinate frames. Solid curves represent camera and object poses in inertial frame; 0 X and 0 L
respectively, and dashed curves their respective motions in body-fixed frame. Solid lines represent 3D points in inertial frame,
and dashed lines represent 3D points in camera frames.
Although this system was initially designed to be an RGB-D 2) Camera Pose Estimation: The camera pose is com-
system, as an attempt to fully exploit image-based semantic in- puted using (13) for all detected 3D-2D static point cor-
formation, we apply single image depth estimation to achieve respondences. To ensure robust estimation, a motion model
depth information from monocular camera. Our “learning- generation method is applied for initialisation. Specifically,
based monocular” system is monocular in the sense that only the method generates two models and compares their inlier
RGB images are used as input to the system, however the numbers based on re-projection error. One model is generated
estimation problem is formulated using RGB-D data, where by propagating the camera previous motion, while the other by
the depth is obtained using single image depth estimation. computing a new motion transform using P3P [63] algorithm
with RANSAC. The motion model that generates most inliers
A. Pre-processing is then selected for initialisation.
There are two challenging aspects that this module needs to 3) Dynamic Object Tracking: The process of object motion
fulfil. First, to robustly separate static background and objects, tracking consists of two steps. In the first step, segmented ob-
and secondly to ensure long-term tracking of dynamic objects. jects are classified into static and dynamic. Then we associate
To achieve this, we leverage recent advances in computer the dynamic objects across pairs of consecutive frames.
vision techniques for instance level semantic segmentation and • Instance-level object segmentation allows us to separate
dense optical flow estimation in order to ensure efficient object objects from background. Although the algorithm is capable of
motion segmentation and robust object tracking. estimating the motions of all the segmented objects, dynamic
1) Object Instance Segmentation: Instance-level semantic object identification helps reduce computational cost of the
segmentation is used to segment and identify potentially mov- proposed system. This is done based on scene flow estimation.
able objects in the scene. Semantic information constitutes an Specifically, after obtaining the camera pose 0 Xk , the scene
important prior in the process of separating static and moving flow vector fik describing the motion of a 3D point 0 mi between
object points, e.g., buildings and roads are always static, but frames k − 1 and k, can be calculated as in [64]:
cars can be static or dynamic. Instance segmentation helps
to further divide semantic foreground into different instance fik = 0 mik−1 − 0 mik = 0 mik−1 −0 Xk Xk mik . (22)
masks, which makes it easier to track each individual object. Unlike optical flow, scene flow−ideally only caused by scene
Moreover, segmentation masks provide a “precise” boundary motion−can directly decide whether some structure is moving
of the object body that ensures robust tracking of points on or not. Ideally, the magnitude of the scene flow vector should
the object. be zero for all static 3D points. However, noise or error in
2) Optical Flow Estimation: The dense optical flow is depth and matching complicates the situation in real scenarios.
used to maximise the number of tracked points on moving To robustly handle this, we compute the scene flow magnitude
objects. Most of the moving objects only occupy a small of all the sampled points on each object. If the magnitude of
portion of the image. Therefore, using sparse feature matching the scene flow of a certain point is greater than a predefined
does not guarantee robust nor long-term feature tracking. Our threshold, the point is considered dynamic. This threshold was
approach makes use of dense optical flow to considerably set to 0.12 in all experiments carried in this work. An object
increase the number of object points by sampling from all the is then recognised dynamic if the proportion of “dynamic”
points within the semantic mask. Dense optical flow is also points is above a certain level (30% of total number of
used to consistently track multiple objects by propagating a points), otherwise static. Thresholds to identify if an object
unique object identifier assigned to every point on an object is dynamic were deliberately chosen as mentioned above, to
mask. Moreover, it allows to recover objects masks if semantic be more conservative as the system is flexible to model a
segmentation fails; a task that is extremely difficult to achieve static object as dynamic and estimate a zero motion at every
using sparse feature matching. time step, however, the opposite would degrade the system’s
performance.
B. Tracking • Instance-level object segmentation only provides single-
The tracking component includes two modules; the camera image object labels. Objects then need to be tracked across
ego-motion tracking with sub-modules of feature detection frames and their motion models propagated over time. We
and camera pose estimation, and the object motion tracking propose to use optical flow to associate point labels across
including sub-modules of dynamic object tracking and object frames. A point label is the same as the unique object identifier
motion estimation. on which the point was sampled. We maintain a finite tracking
1) Feature Detection: To achieve fast camera pose estima- label set L ⊂ N, where l ∈ L starts from l = 1 for the first
tion, we detect a sparse set of corner features and track them detected moving object in the scene. The number of elements
with optical flow. At each frame, only inlier feature points in L increases as more moving objects are being detected.
that fit the estimated camera motion are saved into the map, Static objects and background are labelled with l = 0.
and used to track correspondences in the next frame. New Ideally, for each detected object in frame k, the labels of all
features are detected and added, if the number of inlier tracks its points should be uniquely aligned with the labels of their
falls below a certain level (1200 in default). These sparse correspondences in frame k − 1. However, in practice this is
features are detected on static background, i.e., image regions affected by the noise, image boundaries and occlusions. To
excluding the segmented objects. overcome this, we assign all the points with the label that
MANUSCRIPT ONLY 8
Fig. 4: Overview of our VDO-SLAM system. Input images are first pre-processed to generate instance-level object
segmentation and dense optical flow. These are then used to track features on static background structure and dynamic objects.
Camera poses and object motions estimated from feature tracks are then refined in a global batch optimisation, and a local
map is maintained and updated with every new frame. The system outputs camera poses, static structure, tracks of dynamic
objects, and estimates of their pose changes over time.
appears most in their correspondences. For a dynamic object, similarly, a factor graph optimisation is performed to refine all
if the most frequent label in the previous frame is 0, it means the variables within the local map, and then update them back
that the object starts to move, appears in the scene at the into the global map.
boundary, or reappears from occlusion. In this case, the object 2) Global Batch Optimisation: The output of the tracking
is assigned a new tracking label. component and the local batch optimisation consists of the
4) Object Motion Estimation: As mentioned above, objects camera pose, the object motions and the inlier structure. These
normally appear in small portions in the scene, which makes are saved in a global map that is constructed with all the
it hard to get sufficient sparse features to track and estimate previous time steps and is continually updated with every
their motions robustly. We sample every third point within new frame. A factor graph is constructed based on the global
an object mask, and track them across frames. Similar to the map after all input frames have been processed. To effectively
camera pose estimation, only inlier points are saved into the explore the temporal constraints, only points that have been
map and used for tracking in the next frame. When the number tracked for more than 3 instances are added into the factor
of tracked object points decreases below a certain level, new graph. The graph is formulated as an optimisation problem as
object points are sampled and added. We follow the same described in Section III-C. The optimisation results serve as
method as discussed in Section IV-B2 to generate an initial the output of the whole system.
object motion model. 3) From Mapping to Tracking: Maintaining the map pro-
vides history information to the estimate of the current state in
C. Mapping the tracking module, as shown in Fig. 4 with blue arrows going
from the global map to multiple components in the tracking
In the mapping component, a global map is constructed module of the system. Inlier points from the last frame are
and maintained. Meanwhile, a local map is extracted from leveraged to track correspondences in the current frame and
the global map, which is based on the current time step and estimate camera pose and object motions. The last camera
a window of previous time steps. Both maps are updated via and object motion also serve as possible prior models to
a batch optimisation process. initialise the current estimation as described in Section IV-B2
1) Local Batch Optimisation: We maintain and update a and IV-B4. Furthermore, object points help associate semantic
local map. The goal of the local batch optimisation is to masks across frames to ensure robust tracking of objects,
ensure accurate camera pose estimates are provided to the by propagating their previously segmented masks in case of
global batch optimisation. The camera pose estimation has a “indirect occlusion” resulting from the failure of semantic
big influence on the accuracy of the object motion estimation object segmentation.
and the overall performance of the algorithm. The local map
is built using a fixed-size sliding window containing the
information of the last nw frames, where nw is the window size V. E XPERIMENTS
and is set to 20 in this paper. Local maps share some common We evaluate VDO-SLAM in terms of camera motion, object
information; this defines the overlap between the different motion and velocity, as well as object tracking performance.
windows. We choose to only locally optimise the camera The evaluation is done on the Oxford Multimotion Dataset [65]
poses and static structure within the window size, as locally for indoor, and KITTI Tracking dataset [66] for outdoor
optimising the dynamic structure does not bring any benefit scenarios, with comparison to other state-of-the-art methods,
to the optimisation unless a hard constraint (e.g. a constant including MVO [51], ClusterVO [52], DynaSLAM II [49]
object motion) is assumed within the window. However, the and CubeSLAM [24]. Due to the non-deterministic nature in
system is able to incorporate static and dynamic structure in running the proposed system, such as RANSAC processing,
the local mapping if needed. When a local map is constructed, we run each sequence 5 times and take median values as the
MANUSCRIPT ONLY 9
demonstrating results. All the results are obtained by running Then the speed error Es between the estimated v̂ and the
the proposed system in default parameter setup. Our open- ground truth v velocities can be calculated as: Es = |v̂| − |v|.
source implementation includes the demo YAML files and
instructions to run the system in both datasets. C. Oxford Multimotion Dataset
A. Deep Model Setup The recent Oxford Multimotion Dataset [65] contains se-
quences from a moving stereo or RGB-D camera sensor
We adopt a learning-based instance-level object segmen- observing multiple swinging boxes or toy cars in an indoor
tation, Mask R-CNN [67], to generate object segmentation scenario. Ground truth trajectories of the camera and moving
masks. The model of this method is trained on COCO objects are obtained via a Vicon motion capture system. We
dataset [68], and is directly used in this work without any fine- only choose the swinging boxes sequence (500 frames) for
tuning. For dense optical flow, we leverage a state-of-the-art evaluation, since results of real driving scenarios are evaluated
method; PWC-Net [12]. The model is trained on FlyingChairs on KITTI dataset. Note that, the trained model for instance
dataset [69], and then fine-tuned on Sintel [70] and KITTI segmentation cannot be applied to this dataset directly, since
training datasets [71]. To generate depth maps for a “monocu- the training data (COCO) does not contain the class of
lar” version of our proposed system, we apply a learning-based square box. Instead, we use Otsu’s method [77], together with
monocular depth estimation method, MonoDepth2 [72]. The color information and multi-label processing to segment the
model is trained on Depth Eigen split [73] excluding the tested boxes, which works very well for the simple setup of this
data in this paper. Feature detection is done using FAST [74] dataset (color boxes that are highly distinguishable from the
implemented in [75]. All the above methods are applied using background). Table I shows results compared to the state-of-
the default parameters. the-art MVO [51] and ClusterVO [52], with data provided by
the authors, respectively. As they are both visual odometry
B. Error Metrics
systems without global refinement, we switch off the batch
We use a pose change error metric to evaluate the estimated optimisation module in our system and generate our results
SE(3) motion, i.e., given a ground truth motion transform T for fair comparison. We use the error metrics described in
and a corresponding estimated motion T̂, where T ∈ SE(3) Section V-B.
could be either a camera relative pose or an object motion. Compared to MVO, our proposed method achieves better
The pose change error is computed as: E = T̂−1 T. This is accuracy in the estimation of camera pose (35%) and motion
similar to Relative Pose Error [76], while we set the time of the swinging boxes, top-left (15%) and bottom-left (40%).
interval ∆ = 1 (per frame), because the trajectory of different We obtain slightly higher errors when there is spinning ro-
object in a sequence varies from each other and are normally tational motion of the object observed, in particular the top-
much shorter than the camera trajectory. right swinging and rotating box (in translation only), and the
The translational error Et (meter) is computed as the L2 norm bottom-right rotating box. We believe that this is due to using
of the translational component of E. The rotational error Er an optical flow algorithm that is not well optimised for self-
(degree) is calculated as the angle of rotation in an axis-angle rotating objects. The consequence of this is poor estimation of
representation of the rotational component of E. For different point motion and consequent degradation of the overall object
camera time steps and different objects in a sequence, we tracking performance. Even with the associated performance
compute the root mean squared error (RMSE) for camera loss for rotating objects, the benefits of dense optical flow
poses and object motions, respectively. The object pose change motion estimation is clear in the other metrics. Our method
in body-fixed frame is obtained by transforming the pose performs slightly worse than ClusterVO in the estimate of
change k−10 Hk in the inertial frame into the body frame using camera pose, and the translation of bottom-right rotating box.
the object pose ground-truth Other than that, we achieve more than twice improvements
Lk−1 against ClusterVO in the estimate of object motions.
k−1 Hk =0 L−1
k−1
0 0
k−1 Hk Lk−1 . (23)
An illustrative result of the trajectory output of our algo-
We also evaluate the object speed error. The linear velocity rithm on Oxford Multimotion Dataset is shown in Fig. 5.
of a point on the object, expressed in the inertial frame, can Tracks of dynamic features on swinging boxes visually corre-
be estimated by applying the pose change k−10 Hk and taking spond to the actual motion of the boxes. This can be clearly
the difference seen in the swinging motion of the bottom-left box shown with
0
v ≈0 mik −0 mik−1 = k−10 Hk − I4 mik−1 purple color in Fig. 5.
TABLE I: Comparison versus MVO [51] and ClusterVO [52] for camera pose and object motion estimation accuracy on the
sequence of swinging 4 unconstrained sequence in Oxford Multi-motion dataset. Bold numbers indicate the better results.
VDO-SLAM MVO ClusterVO
Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m)
Camera 0.7709 0.0112 1.1948 0.0314 0.7665 0.0066
Top-left Swinging Box 1.1889 0.0207 1.4553 0.0288 3.2537 0.0673
Top-right Swinging and rotating Box 0.7631 0.0132 0.8992 0.0130 3.5308 0.0256
Bottom-left Swinging Box 0.9153 0.0149 1.4949 0.0261 4.9146 0.0763
Bottom-right Rotating Box 0.8469 0.0192 0.7815 0.0115 4.0675 0.0144
Fig. 6: Accuracy of object motion estimation of our method 2) Object Tracking and Velocity: We also demonstrate the
compared to CubeSLAM ([24]). The color bars refer to performance of tracking dynamic objects, and show results
translation error that is corresponding to the left Y-axis in log- of object speed estimation, which is an important information
scale. The circles refer to rotation error, which corresponds to for autonomous driving applications. Fig. 7 illustrates results
the right Y-axis in linear-scale. of object tracking length and object speed for some selected
objects (tracked for over 20 frames) in all the tested sequences.
1) Camera Pose and Object Motion: Table II demonstrates Our system is able to track most objects for more than 80%
results of both camera pose and object motion estimation of their occurrence in the sequence. Moreover, our estimated
in nine sequences, compared to DynaSLAM II [49] and objects speed is always consistently close to the ground truth.
CubeSLAM [24]. Results of DynaSLAM II is obtained di-
rectly from their paper, where only the evaluation of camera
pose is available. We initially tried to evaluate CubeSLAM 3) Qualitative Results: Fig. 8 illustrates the output of our
ourselves with the default provided parameters, however errors system for three of the KITTI sequences. The proposed system
were much higher, and hence we only report results of five is able to output the camera poses, along with the static
sequences provided by the authors of CubeSLAM after some structure and dynamic tracks of every detected moving object
correspondences. As CubeSLAM is for monocular camera, we in the scene in a spatiotemporal map representation.
MANUSCRIPT ONLY 11
TABLE II: Comparison versus DynaSLAM II [49] and CubeSLAM [24] for camera pose and object motion estimation accuracy
on nine sequences with moving objects drawn from the KITTI dataset. Bold numbers indicate the better result.
DynaSLAM II VDO-SLAM (RGB-D) VDO-SLAM (Monocular) CubeSLAM
Camera Camera Object Camera Object Camera Object
Seq Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m) Er (deg) Et (m)
00 0.06 0.04 0.0741 0.0674 1.0520 0.1077 0.1830 0.1847 2.0021 0.3827 - - - -
01 0.04 0.05 0.0382 0.1220 0.9051 0.1573 0.1772 0.4982 1.1833 0.3589 - - - -
02 0.02 0.04 0.0182 0.0445 1.2359 0.2801 0.0496 0.0963 1.6833 0.4121 - - - -
03 0.04 0.06 0.0311 0.0816 0.2919 0.0965 0.1065 0.1505 0.4570 0.2032 0.0498 0.0929 3.6085 4.5947
04 0.06 0.07 0.0482 0.1114 0.8288 0.1937 0.1741 0.4951 3.1156 0.5310 0.0708 0.1159 5.5803 32.5379
05 0.03 0.06 0.0219 0.0932 0.3705 0.1140 0.0506 0.1368 0.6464 0.2669 0.0342 0.0696 3.2610 6.4851
06 0.04 0.02 0.0488 0.0186 1.0803 0.1158 0.0671 0.0451 2.0977 0.2394 - - - -
18 0.02 0.05 0.0211 0.0749 0.2453 0.0825 0.1236 0.3551 0.5559 0.2774 0.0433 0.0510 3.1876 3.7948
20 0.04 0.07 0.0271 0.1662 0.3663 0.0824 0.3029 1.3821 1.1081 0.3693 0.1348 0.1888 3.4206 5.6986
GT Tracks EST. Tracks GT Speed EST. Speed of good points in terms of both quantity and quality. This was
240 60
216 54
achieved by refining the estimated optical flow jointly with
192 48
the motion estimation, as discussed in Section III-B3. The
Speed (Km/h)
Track Length
168 42
96 24
baseline method that only optimises for the motion (Motion
72 18 Only) using (9) for camera motion or (11) for object motion,
48 12
24 6
and the improved method that optimises for both the motion
0
00-1 00-2 01-1 01-2 02-1 02-2 02-3 03-1 03-2 04-1 05-1 06-1 06-2 06-3 06-4 06-5 06-6 18-1 18-2 18-3 20-1 20-2 20-3
0 and the optical flow (Joint) using (13) or (15). Table III
Sequence-Object ID demonstrates that the joint method obtains considerably more
points that are tracked for long periods.
Fig. 7: Tracking performance and speed estimation. Results
of object tracking length and object speed for some selected TABLE IV: Average camera pose and object motion errors
objects (tracked for over 20 frames), due to limited space. over the nine sequences of the KITTI dataset. Bold numbers
The color bars represent the length of object tracks, which is indicate the better results.
corresponding to the left Y-axis. The circles represent object
speeds, which is corresponding to the right Y-axis. GT refers Motion Only Joint
to ground truth, and EST. refers to estimated values. Er (deg) Et (m) Er (deg) Et (m)
Camera 0.0412 0.0987 0.0365 0.0866
Object 1.0179 0.1853 0.7085 0.1367
E. Discussion
Using the tracked points given by the joint estimation
Apart from the extensive evaluation in Section V-D and V-C,
process leads to better estimation of both camera pose and
we also provide detailed experimental results to prove the
object motion. As demonstrated in Table IV, an improvement
effectiveness of key modules in our proposed system. Finally,
of about 10% (camera) and 25% (object) in both translation
the computational cost of the proposed system is discussed.
and rotation errors was observed over the nine sequences of
TABLE III: The number of points tracked for more than five the KITTI dataset shown above.
frames on the nine sequences of the KITTI dataset. Bold 2) Robustness against Non-direct Occlusion: The mask
numbers indicate the better results. Underlined bold numbers segmentation may fail in some cases, due to direct or indirect
indicate an order of magnitude increase in number. occlusions (illumination change, etc.). Thanks to the mask
propagating method described in Section IV-C3, our proposed
Background Object system is able to handle mask failure cases caused by indirect
Seq Motion Only Joint Motion Only Joint
occlusions. Fig. 9 demonstrates an example of tracking a white
00 1798 12812 1704 7162 van for 80 frames, where the mask segmentation fails in 33
01 237 5075 907 4583
02 7642 10683 52 1442 frames. Despite the object segmentation failure, our system is
03 778 12317 343 3354 still continuously able to track the van, and estimate its speed
04 9913 25861 339 2802 with an average error of 2.64 km/h across the whole sequence.
05 713 11627 2363 2977
06 7898 11048 482 5934 Speed errors in the second half of the sequence are higher due
18 4271 22503 5614 14989 to partial direct occlusions, and increased distance to the object
20 9838 49261 9282 13434 getting farther away from the camera.
3) Global Refinement on Object Motion: Initial object
1) Robust Tracking of Points: The graph optimisation ex- motion estimation (in the tracking component of the system)
plores the spacial and temporal information to refine the is independent between frames, since it is purely related to the
camera poses and the object motions, as well as the static sensor measurements. As illustrated in Fig. 10, the blue curve
and dynamic structure. This process requires robust tracking describes an initial object speed estimate of a wagon observed
MANUSCRIPT ONLY 12
Fig. 8: Illustration of system output; a dynamic map with camera poses, static background structure, and tracks of
dynamic objects. Sample results of VDO-SLAM on KITTI sequences. Black represents static background, and each detected
object is shown in a different colour. Top left figure represents Seq.01 and a zoom-in on the intersection at the end of the
sequence, top right figure represents Seq.06 and bottom figure represents Seq.03.
for 55 frames in sequence 03 of the KITTI tracking dataset. 16 GB RAM. The object semantic segmentation and dense
As seen in the figure, the speed estimation is not smooth and optical flow computation times depend on the GPU power
large errors occur towards the second half of the sequence. and the CNN model complexity. Many current state-of-the-
This is mainly caused by the increased distance to the object art algorithms can run in real time ([30], [78]). In this
getting farther away from the camera, and its structure only paper, the semantic segmentation and optical flow results are
occupying a small portion of the scene. In this case, the object produced off-line as input to the system. The SLAM system
motion estimation from sensor measurements solely becomes is implemented in C++ on CPU using a modified version of
challenging and error-prone. Therefore, we formulate a factor g2o as a back-end [79]. We show the computational time in
graph and refine the motions together with the static and Table V for both datasets. Overall, the tracking part of our
dynamic structure as discussed in Section III-C. The green proposed system is able to run at the frame rate of 5-8 fps
curve in Fig. 10 shows the object speed results after the global depending on the number of detected moving objects, which
refinement, which becomes smoother in the first half of the can be improved by employing parallel implementation. The
sequence and is significantly improved in the second half. runtime of the global batch optimisation strongly depends on
Fig. 11 demonstrates the average improvement for all ob- the amount of camera poses (number of frames), and objects
jects in each sequence of KITTI dataset. With graph optimiza- (density in terms of the number of dynamic objects observed
tion, the errors can be reduced up to 39% in translation and per frame) present in the scene.
55% in rotation. Interestingly, the translation errors in Seq.18 VI. C ONCLUSION
and Seq.20 increase slightly. We believe it is because the ve-
In this paper, we have presented VDO-SLAM, a novel
hicles keep alternating between acceleration and deceleration
dynamic feature-based SLAM system that exploits image-
due to the heavy traffic jams in both sequences, which strongly
based semantic information in the scene with no additional
violates the smooth motion constraint that is set for general
knowledge of the object pose or geometry, to achieve simulta-
cases.
neous localisation, mapping and tracking of dynamic objects.
4) Computational Analysis: Finally, we provide the com- The system consistently shows robust and accurate results on
putational analysis of our system. The experiments are carried indoor and challenging outdoor datasets, and achieves state-of-
out on an Intel Core i7 2.6 GHz laptop computer with the-art performance in object motion estimation. We believe
MANUSCRIPT ONLY 13
0.5
Translation 0.27 0.27 0.11 0.39 0.1 0.16 0.02 -0.03 -0.04 0.4
0.3
0.2
Rotation 0.2 0.22 0.06 0.54 0.26 0.55 0.04 0.34 0.12 0.1
0
Seq.00 Seq.01 Seq.02 Seq.03 Seq.04 Seq.05 Seq.06 Seq.18 Seq.20
ACKNOWLEDGEMENTS
This research is supported by the Australian Research Coun-
cil through the Australian Centre of Excellence for Robotic
Vision (CE140100016), and the Sydney Institute for Robotics
and Intelligent Systems. The authors would like to thank Mr.
Ziang Cheng and Mr. Huangying Zhan for providing help in
preparing the testing datasets.
Fig. 10: Global refinement effect on object speed estima-
R EFERENCES
tion. The initial (blue) and refined (green) estimated speeds of
a wagon in Seq.03, travelling along a straight road, compared [1] D. Hahnel, D. Schulz, and W. Burgard, “Map Building with Mobile
Robots in Populated Environments,” in International Conference on
to the ground truth speed (red). Note the ground truth speed Intelligent Robots and Systems (IROS), vol. 1. IEEE, 2002, pp. 496–
is slightly fluctuating. We believe it is due to the ground truth 501.
object poses being approximated from lidar scans. [2] D. Hahnel, R. Triebel, W. Burgard, and S. Thrun, “Map Building with
Mobile Robots in Dynamic Environments,” in International Conference
on Robotics and Automation (ICRA), vol. 2. IEEE, 2003, pp. 1557–
1563.
[3] D. F. Wolf and G. S. Sukhatme, “Mobile Robot Simultaneous Local-
the high performance accuracy achieved in object motion ization and Mapping in Dynamic Environments,” Autonomous Robots,
estimation is due to the fact that our system is a feature-based vol. 19, no. 1, pp. 53–65, 2005.
system. Feature points remain to be the easiest to detect, track [4] H. Zhao, M. Chiba, R. Shibasaki, X. Shao, J. Cui, and H. Zha, “SLAM
in a Dynamic Large Outdoor Environment using a Laser Scanner,” in
and integrate within a SLAM system, and that require the International Conference on Robotics and Automation (ICRA). IEEE,
front-end to have no additional knowledge about the object 2008, pp. 1455–1462.
model, or explicitly provide any information about its pose. [5] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “DynaSLAM: Tracking,
Mapping, and Inpainting in Dynamic Scenes,” Robotics and Automation
An important issue to be reduced is the computational Letters (RAL), vol. 3, no. 4, pp. 4076–4083, 2018.
complexity of SLAM with dynamic objects. In long-term [6] C.-C. Wang, C. Thorpe, and S. Thrun, “Online Simultaneous Local-
applications, different techniques can be applied to limit the ization and Mapping with Detection and Tracking of Moving Objects:
Theory and Results from a Ground Vehicle in Crowded Urban Areas,”
growth of the graph ([80], [81]). In fact, history summari- in International Conference on Robotics and Automation (ICRA), vol. 1.
sation/deletion of map points pertaining to dynamic objects IEEE, 2003, pp. 842–849.
MANUSCRIPT ONLY 14
[7] I. Miller and M. Campbell, “Rao-blackwellized Particle Filtering for [30] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time Instance
Mapping Dynamic Environments,” in International Conference on Segmentation,” in International Conference on Computer Vision (ICCV).
Robotics and Automation (ICRA). IEEE, 2007, pp. 3862–3869. IEEE, 2019, pp. 9157–9166.
[8] J. G. Rogers, A. J. Trevor, C. Nieto-Granda, and H. I. Christensen, [31] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic SLAM:
“SLAM with Expectation Maximization for Moveable Object Tracking,” The Need for Speed,” in International Conference on Robotics and
in International Conference on Intelligent Robots and Systems (IROS). Automation (ICRA). IEEE, 2020, pp. 2123–2129.
IEEE, 2010, pp. 2077–2082. [32] J. Huang, S. Yang, Z. Zhao, Y. Lai, and S. Hu, “ClusterSLAM: A
[9] A. Kundu, K. M. Krishna, and C. Jawahar, “Realtime Multibody Visual SLAM Backend for Simultaneous Rigid Body Clustering and Motion
SLAM with a Smoothly Moving Monocular Camera,” in International Estimation,” in International Conference on Computer Vision (ICCV).
Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2080–2087. IEEE, 2019, pp. 5874–5883.
[10] K. Yamaguchi, D. McAllester, and R. Urtasun, “Efficient Joint Segmen- [33] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Robust Ego and Object 6-
tation, Occlusion Labeling, Stereo and Flow Estimation,” in European DoF Motion Estimation and Tracking,” in International Conference on
Conference on Computer Vision (ECCV). Springer, 2014, pp. 756–771. Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5017–5023.
[11] D. Sun, S. Roth, and M. J. Black, “Secrets of Optical Flow Estimation [34] P. F. Alcantarilla, J. J. Yebes, J. Almazán, and L. M. Bergasa, “On
and Their Principles,” in Computer Society Conference on Computer Combining Visual SLAM and Dense Scene Flow to Increase the
Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 2432–2439. Robustness of Localization and Mapping in Dynamic Environments,” in
[12] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for Optical International Conference on Robotics and Automation (ICRA). IEEE,
Flow Using Pyramid, Warping, and Cost Volume,” in Computer Soci- 2012, pp. 1290–1297.
ety Conference on Computer Vision and Pattern Recognition (CVPR). [35] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust Monocular
IEEE, 2018. SLAM in Dynamic Environments,” in International Symposium on
[13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, Mixed and Augmented Reality (ISMAR). IEEE, 2013, pp. 209–218.
“FlowNET 2.0: Evolution of Optical Flow Estimation with Deep [36] P. Kaveti and H. Singh, “A Light Field Front-end for Robust SLAM in
Networks,” in Computer Society Conference on Computer Vision and Dynamic Environments,” arXiv preprint arXiv:2012.10714, 2020.
Pattern Recognition (CVPR). IEEE, 2017, pp. 2462–2470. [37] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte,
[14] C. Vogel, K. Schindler, and S. Roth, “Piecewise Rigid Scene Flow,” in “Simultaneous Localization, Mapping and Moving Object Tracking,”
International Conference on Computer Vision (ICCV). IEEE, 2013, pp. International Journal of Robotics Research (IJRR), vol. 26, no. 9, pp.
1377–1384. 889–916, 2007.
[15] M. Menze and A. Geiger, “Object Scene Flow for Autonomous Vehi- [38] N. D. Reddy, P. Singhal, V. Chari, and K. M. Krishna, “Dynamic Body
cles,” in Computer Society Conference on Computer Vision and Pattern VSLAM with Semantic Constraints,” in International Conference on
Recognition (CVPR). IEEE, 2015, pp. 3061–3070. Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 1897–1904.
[16] X. Liu, C. R. Qi, and L. J. Guibas, “FlowNet3D: Learning Scene Flow in
[39] I. A. Bârsan, P. Liu, M. Pollefeys, and A. Geiger, “Robust Dense
3D Point Clouds,” in Computer Society Conference on Computer Vision
Mapping for Large-Scale Dynamic Environments,” in International
and Pattern Recognition (CVPR). IEEE, 2019, pp. 529–537.
Conference on Robotics and Automation (ICRA). IEEE, 2018.
[17] H. Jiang, D. Sun, V. Jampani, Z. Lv, E. Learned-Miller, and J. Kautz,
[40] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and
“SENSE: A Shared Encoder Network for Scene-flow Estimation,” in
A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at
International Conference on Computer Vision (ICCV). IEEE, 2019,
the Level of Objects,” in Computer Society Conference on Computer
pp. 3195–3204.
Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1352–1359.
[18] P. de la Puente and D. Rodrı́guez-Losada, “Feature Based Graph-SLAM
[41] K. Tateno, F. Tombari, and N. Navab, “When 2.5D is Not Enough:
in Structured Environments,” Autonomous Robots, vol. 37, no. 3, pp.
Simultaneous Reconstruction, Segmentation and Recognition on Dense
243–260, 2014.
SLAM,” in International Conference on Robotics and Automation
[19] M. Kaess, “Simultaneous Localization and Mapping with Infinite
(ICRA). IEEE, 2016, pp. 2295–2302.
Planes,” in International Conference on Robotics and Automation
(ICRA). IEEE, 2015, pp. 4605–4611. [42] E. Sucar, K. Wada, and A. Davison, “NodeSLAM: Neural Object
Descriptors for Multi-View Shape Reconstruction,” arXiv preprint
[20] M. Henein, M. Abello, V. Ila, and R. Mahony, “Exploring the Effect
arXiv:2004.04485, 2020.
of Meta-structural Information on the Global Consistency of SLAM,”
in International Conference on Intelligent Robots and Systems (IROS). [43] M. Runz, M. Buffier, and L. Agapito, “MaskFusion: Real-time Recog-
IEEE, 2017, pp. 1616–1623. nition, Tracking and Reconstruction of Multiple Moving Objects,” in
[21] M. Hsiao, E. Westman, G. Zhang, and M. Kaess, “Keyframe-based International Symposium on Mixed and Augmented Reality (ISMAR).
Dense Planar SLAM,” in International Conference on Robotics and IEEE, 2018, pp. 10–20.
Automation (ICRA). IEEE, 2017, pp. 5110–5117. [44] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and
[22] B. Mu, S.-Y. Liu, L. Paull, J. Leonard, and J. P. How, “SLAM with S. Leutenegger, “MID-Fusion: Octree-based Object-level Multi-instance
Objects using a Nonparametric Pose Graph,” in International Conference Dynamic SLAM,” in International Conference on Robotics and Automa-
on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4602–4609. tion (ICRA). IEEE, 2019, pp. 5231–5237.
[23] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and [45] M. Hosseinzadeh, K. Li, Y. Latif, and I. Reid, “Real-time Monocular
A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at Object-model Aware Sparse SLAM,” in International Conference on
the Level of Objects,” in Computer Society Conference on Computer Robotics and Automation (ICRA). IEEE, 2019, pp. 7123–7129.
Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1352–1359. [46] L. Nicholson, M. Milford, and N. Sünderhauf, “QuadricSLAM: Dual
[24] S. Yang and S. Scherer, “CubeSLAM: Monocular 3-D Object SLAM,” Quadrics from Object Detections as Landmarks in Object-oriented
Transactions on Robotics (T-RO), vol. 35, no. 4, pp. 925–938, 2019. SLAM,” Robotics and Automation Letters (RAL), vol. 4, no. 1, pp. 1–8,
[25] D. Gálvez-López, M. Salas, J. D. Tardós, and J. Montiel, “Real-time 2018.
Monocular Object SLAM,” Robotics and Autonomous Systems, vol. 75, [47] P. Li, T. Qin, et al., “Stereo Vision-based Semantic 3D Object and Ego-
pp. 435–449, 2016. motion Tracking for Autonomous Driving,” in European Conference on
[26] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, Computer Vision (ECCV), 2018, pp. 646–661.
“MOT16: A Benchmark for Multi-Object Tracking,” arXiv:1603.00831 [48] P. Li, J. Shi, and S. Shen, “Joint Spatial-temporal Optimization for Stereo
[cs], Mar. 2016, arXiv: 1603.00831. [Online]. Available: http: 3D Object Tracking,” in Computer Society Conference on Computer
//arxiv.org/abs/1603.00831 Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 6877–6886.
[27] A. Byravan and D. Fox, “SE3-Nets: Learning Rigid Body Motion using [49] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, “DynaSLAM
Deep Neural Networks,” in International Conference on Robotics and II: Tightly-coupled Multi-object Tracking and SLAM,” Robotics and
Automation (ICRA). IEEE, 2017, pp. 173–180. Automation Letters (RAL), vol. 6, no. 3, pp. 5191–5198, 2021.
[28] P. Wohlhart and V. Lepetit, “Learning Descriptors for Object Recog- [50] A. Dewan, T. Caselitz, G. D. Tipaldi, and W. Burgard, “Motion-based
nition and 3D Pose Estimation,” in Computer Society Conference on Detection and Tracking in 3D Lidar Scans,” in International Conference
Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3109– on Robotics and Automation (ICRA). IEEE, 2016, pp. 4508–4513.
3118. [51] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion Visual
[29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” Odometry (MVO): Simultaneous Estimation of Camera and Third-party
Transactions on Pattern Analysis and Machine Intelligence (PAMI), Motions,” in International Conference on Intelligent Robots and Systems
2018. (IROS). IEEE, 2018, pp. 3949–3956.
MANUSCRIPT ONLY 15
[52] J. Huang, S. Yang, T.-J. Mu, and S.-M. Hu, “ClusterVO: Clustering [67] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
Moving Instances and Estimating Visual Odometry for Self and Sur- International Conference on Computer Vision (ICCV). IEEE, 2017,
roundings,” in Computer Society Conference on Computer Vision and pp. 2980–2988.
Pattern Recognition (CVPR). IEEE, 2020, pp. 2168–2177. [68] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[53] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects
Paradigm for Model Fitting with Applications to Image Analysis and in context,” in European Conference on Computer Vision (ECCV).
Automated Cartography,” Communications of the ACM, vol. 24, no. 6, Springer, 2014, pp. 740–755.
pp. 381–395, 1981.
[69] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy,
[54] G. S. Chirikjian, R. Mahony, S. Ruan, and J. Trumpf, “Pose Changes
and T. Brox, “A Large Dataset to Train Convolutional Networks for
from a Different Point of View,” in The ASME International Design
Disparity, Optical Flow, and Scene Flow Estimation,” in Computer So-
Engineering Technical Conferences (IDETC). ASME, 2017.
ciety Conference on Computer Vision and Pattern Recognition (CVPR).
[55] D. Nistér, O. Naroditsky, and J. Bergen, “Visual Odometry,” in Com-
IEEE, 2016, pp. 4040–4048.
puter Society Conference on Computer Vision and Pattern Recognition
(CVPR), vol. 1. IEEE, 2004, pp. I–I. [70] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A Naturalistic
[56] P. J. Huber, “Robust Estimation of a Location Parameter,” in Break- Open Source Movie for Optical Flow Evaluation,” in European Confer-
throughs in Statistics. Springer, 1992, pp. 492–518. ence on Computer Vision (ECCV). Springer, 2012, pp. 611–625.
[57] F. Dellaert and M. Kaess, “Square Root SAM: Simultaneous Localiza- [71] A. Geiger, P. Lenz, and R. Urtasun, “Are We Ready for Autonomous
tion and Mapping via Square Root Information Smoothing,” Interna- Driving? The KITTI Vision Benchmark Suite,” in Computer Society
tional Journal of Robotics Research (IJRR), vol. 25, no. 12, pp. 1181– Conference on Computer Vision and Pattern Recognition (CVPR).
1203, 2006. IEEE, 2012.
[58] S. Agarwal, K. Mierle, and Others, “Ceres Solver,” http://ceres-solver. [72] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging
org, 2012. into Self-supervised Monocular Depth Estimation,” in International
[59] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and Conference on Computer Vision (ICCV). IEEE, 2019, pp. 3828–3838.
F. Dellaert, “iSAM2: Incremental Smoothing and Mapping using the [73] D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a
Bayes Tree,” International Journal of Robotics Research (IJRR), p. Single Image using a Multi-scale Deep Network,” in Advances in Neural
0278364911430419, 2011. Information Processing Systems (NIPS), 2014, pp. 2366–2374.
[60] L. Polok, V. Ila, M. Solony, P. Smrz, and P. Zemcik, “Incremental Block [75] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An Efficient
Cholesky Factorization for Nonlinear Least Squares in Robotics,” in Alternative to SIFT or SURF,” in International Conference on Computer
Robotics: Science and Systems (RSS), Berlin, Germany, June 2013. Vision (ICCV). IEEE, 2011, pp. 2564–2571.
[61] V. Ila, L. Polok, M. Šolony, and P. Svoboda, “SLAM++-A Highly
Efficient and Temporally Scalable Incremental SLAM Framework,” [76] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
International Journal of Robotics Research (IJRR), vol. Online First, “A Benchmark for the Evaluation of RGB-D SLAM Systems,” in
no. 0, pp. 1–21, 2017. International Conference on Intelligent Robots and Systems (IROS).
[62] K. Yamaguchi, D. McAllester, and R. Urtasun, “Efficient Joint Segmen- IEEE, 2012, pp. 573–580.
tation, Occlusion Labeling, Stereo and Flow Estimation,” in European [77] N. Otsu, “A Threshold Selection Method from Gray-level Histograms,”
Conference on Computer Vision (ECCV). Springer, 2014, pp. 756–771. Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66,
[63] T. Ke and S. I. Roumeliotis, “An Efficient Algebraic Solution to the 1979.
Perspective-three-point Problem,” in Computer Society Conference on [78] T.-W. Hui, X. Tang, and C. C. Loy, “A Lightweight Optical Flow CNN
Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. - Revisiting Data Fidelity and Regularization.” IEEE, 2020. [Online].
[64] Z. Lv, K. Kim, A. Troccoli, J. Rehg, and J. Kautz, “Learning Rigidity Available: http://mmlab.ie.cuhk.edu.hk/projects/LiteFlowNet/
in Dynamic Scenes with a Moving Camera for 3D Motion Field [79] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
Estimation,” in European Conference on Computer Vision (ECCV). “g2o: A General Framework for Graph Optimization,” in International
Springer, 2018. Conference on Robotics and Automation (ICRA). IEEE, 2011, pp.
[65] K. M. Judd and J. D. Gammell, “The Oxford Multimotion Dataset: 3607–3613.
Multiple SE(3) Motions with Ground Truth,” Robotics and Automation
Letters (RAL), vol. 4, no. 2, pp. 800–807, 2019. [80] H. Strasdat, A. J. Davison, J. M. Montiel, and K. Konolige, “Double
[66] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision Meets Robotics: Window Optimisation for Constant Time Visual SLAM,” in Interna-
The KITTI Dataset,” International Journal of Robotics Research (IJRR), tional Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2352–
vol. 32, no. 11, pp. 1231–1237, 2013. 2359.
[74] E. Rosten and T. Drummond, “Machine Learning for High-speed Cor- [81] V. Ila, J. M. Porta, and J. Andrade-Cetto, “Information-based Compact
ner Detection,” in European Conference on Computer Vision (ECCV). Pose SLAM,” Transactions on Robotics (T-RO), vol. 26, no. 1, pp. 78–
Springer, 2006, pp. 430–443. 93, 2010.