PHD Thesis Raul Acuna
PHD Thesis Raul Acuna
Dem Fachbereich
Elektrotechnik und Informationstechnik
der Technischen Universität Darmstadt
zur Erlangung des akademischen Grades
eines Doktor-Ingenieurs (Dr.-Ing.)
genehmigte Dissertation
von
D17
Darmstadt 2021
Acuña Godoy, Raúl Eduardo: Dynamic fiducial markers for camera-based
pose estimation
Darmstadt, Technische Universität Darmstadt,
Year thesis published in TUprints: 2021
URN: urn:nbn:de:tuda-tuprints-176507
Date of the disputation: 01.02.2021
§ 9 Abs. 1 PromO
Ich versichere hiermit, dass die vorliegende Dissertation selbstständig und
nur unter Verwendung der angegebenen Quellen verfasst wurde.
§ 9 Abs. 2 PromO
Die Arbeit hat bisher noch nicht zu Prüfungszwecken gedient.
Contents
Abbreviations and Basic Notation X
Abstract XIV
Kurzfassung XV
1 Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . 7
Bibliography 141
X
PnP Perspective-n-Point . . . . . . . . . . . . . . . . . . . . . . . . . 17
VO Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . 23
MB Marker-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ML Markerless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
LM Levenberg–Marquardt . . . . . . . . . . . . . . . . . . . . . . . . 121
Basic Notation
x Scalar
x Vector
M Matrix
a
Tb Homogeneous transform matrix that converts coordinates from
frame a to frame b
Oi Origin of coordinate system
˜
Noisy estimate of the true value of
ˆ
Optimal estimate of the true value of
d (x, y) Geometric distance (euclidean) between two vectors x, y
XIV
Abstract
This dissertation introduces new techniques that increase the accuracy
of camera-based pose estimation using fiducial markers. The problem of
camera-based pose estimation involves finding a camera’s pose relative to
some coordinate system by detecting some known features in the environ-
ment; when the visual appearance of these features is known beforehand,
they are called fiducials.
The visual-based pose estimation process is highly complex since the es-
timated pose accuracy depends on many interconnected factors that have
to be considered simultaneously; this thesis aims to identify the most influ-
ential factors and proposes solutions that mitigate the effect of the sources
of error, hence increasing the estimated pose’s accuracy and robustness.
We base our solutions on exploiting an interaction between the camera
and what the camera is measuring; this, in essence, means that the fea-
tures change and adapt to better suit the measurement by either moving
in space to better locations or changing their shape dynamically.
XV
Kurzfassung
In dieser Dissertation werden neue Techniken vorgestellt, die die
Genauigkeit der kamerabasierten Posenschätzung mit Hilfe von visuellen
Referenzmarken erhöht. Das Problem der kamerabasierten Posen-
schätzung besteht darin, die Pose einer Kamera relativ zu einem Koordi-
natensystem zu bestimmen, indem einige bekannte Merkmale in der Umge-
bung erkannt werden. Wenn die visuelle Erscheinung dieser Merkmale im
Voraus bekannt ist, werden diese als Referenzpunkte bezeichnet.
Der visuell-basierte Prozess der Posenschätzung ist sehr komplex, da
die Genauigkeit der geschätzten Pose von vielen miteinander verbundenen
Faktoren abhängt, die gleichzeitig berücksichtigt werden müssen. Diese
Arbeit zielt darauf ab, die einflussreichsten Faktoren zu identifizieren und
schlägt Lösungen vor, die die Auswirkungen der Fehlerquellen mildern
und somit die Genauigkeit und Robustheit der geschätzten Pose erhöhen.
Die in dieser Arbeit vorgestellte Lösung basiert auf der Ausnutzung einer
Interaktion zwischen der Kamera und dem, was die Kamera erfasst. Dies
bedeutet im Wesentlichen, dass sich die Merkmale ändern und anpassen,
um der Erfassung besser zu entsprechen, indem sie sich entweder im Raum
an bessere Orte bewegen oder ihre Form dynamisch verändern.
1
1 Introduction
What is “real”? How do you define “real”? If
real is what you can feel, smell, taste and see,
then ‘real’ is simply electrical signals
interpreted by your brain.
in less than 13 ms [15]; in the blink of an eye, the brain can extract an
enormous amount of high-level structural information from the world with
low energy requirements (compared to other organisms or computer-based
vision systems). Vision is also arguably the most useful perception tool for
living organisms. It is assumed that one of the reasons for the Cambrian
evolutionary explosion is that the evolution of advanced eyes started an
arms race that accelerated evolution (“Light Switch” theory of Andrew
Parker [89]), which is not a surprise, vision allows for fast locomotion and
navigation, object detection and recognition, threat avoidance and more.
The perception process is tied to an anticipation of what is being ob-
served. A cognitive model of what is being perceived is, in general, nec-
essary, either by being pre-coded in the brain structure (what we call in-
stincts) or learned by experience. Anticipation implies a prediction based
on the current measurement and the available model of the object of per-
ception. The perception is not passive; it is an active process connected to
actions in the world and how they change in time. Moreover, perception
can spur action tendencies, e.g., in pianists, the thought of the perception
of a particular note triggers the motor centers of the brain associated with
the production of that note with the finger [74].
To better illustrate the coupling of action and perception, let us describe
a different kind of “Light Switch”. In the scenario of Fig. 1.1, there is a
robot that can perform only two possible actions, move forward or move
backward, and only one sensor for localization, a camera. It is clear for
us that if the robot moves forward, it will push the button, turning light
off, and will not be able to locate itself again with the camera. We can
infer this only because we have a high-level concept of what a light switch
does, and additionally, we have a cognitive model of the consequences of
specific actions and its influence on perception.
Detect features
Pose
Calculate Pose
the whole process and identifying the potential pitfalls. What follows is
a short description of the necessary steps required for visual-based pose
estimation and the potential error sources.
1.2 Contributions
A comprehensive comparative study on the state of the art of artifi-
cial features (fiducial markers) for the last 30 years.
Mobile Marker Odometry. A cooperative odometry scheme for accu-
rate pose estimation of mobile robots without using any environment
features.
An extension to Mobile Marker Odometry where environmental fea-
tures are fused with marker features for improved accuracy when the
environment is rich in good features while maintaining the accuracy
when the environment lacks features.
A methodology for finding optimal spatial configuration of features
that improve the robustness of pose estimation algorithms versus
noise. This methodology will allow designers of fiducial markers to
test quality of their designs and will allow the creation of stable
feature data-sets for pose algorithms comparison.
A new fiducial marker design: the dynamic fiducial marker, based on
the interaction between the observer and the marker. We use screens
to display the shapes that define the fiducial, and these shapes can
change over time to better suit the particular state of the observer.
We design the shapes following the findings on optimal features spa-
tial configuration. The dynamic marker has higher accuracy than
traditional fiducials and can be used for camera calibration and pose
estimation problems.
scheme for robotics that uses only artificial features but does not require
environmental intervention. In Chapter 5, we analyze the influence of the
relative position of control points in the robustness of traditional pose
estimation algorithms and propose a new optimization methodology to
find optimal control points. Chapter 6 presents our final contribution,
the dynamic marker, which is a consolidation of all the results obtained
in the previous chapter applied to a new intelligent fiducial design that
can change its shape over time to facilitate the process of pose estimation,
increasing accuracy and robustness. Finally, in Chapter 7, we present the
conclusions obtained in this dissertation.
Chapter 1 Chapter 4 Chapter 6
2 Fundamentals of
camera-based pose
estimation
In this chapter, we introduce the camera-based pose estimation problem.
We first elaborate on the mathematical notation that connects the field
of robotics and computer vision and then explain the essential elements
of the camera-based pose estimation problem by studying the state of the
art of pose estimation methods.
j
pxi
j
pi = j pyi . (2.1)
j z
pi
Each component of the vector (j pxi , j pyi and j pzi ) are the cartesian
coordinates of Oi in the j frame. For convenience we can define a static
10 2 Fundamentals of camera-based pose estimation
j
r = j pi +i r, (2.2)
the ordering of the elements on this equation allows the reader to follow
the transformation from right to left. Notice how the subscripts and super-
scripts cancel each other, which helps the readability of the transformation
chain. This notation is summarized in Fig. 2.1.
x̂i · x̂j ŷi · x̂j ẑi · x̂j
j
Ri = x̂i · ŷj ŷi · ŷj ẑi · ŷj . (2.3)
x̂i · ẑj ŷi · ẑj ẑi · ẑj
cos θ 0 sin θ
RY (θ) = 0 1 0 , (2.5)
− sin θ 0 cos θ
1 0 0
RX (θ) = 0 cos θ − sin θ . (2.6)
0 sin θ cos θ
k
Ri = k Rj j Ri . (2.7)
The rotation matrix contains nine elements; however, only three param-
eters are required to define a body’s orientation in space. As a result, there
is no unique representation. A typical minimal representation of rotations
12 2 Fundamentals of camera-based pose estimation
can be defined by the so-called euler angles [α, β, γ], where each angle is a
rotation around an axis of a moving coordinate frame. When combining
rotation matrices using matrix multiplication, the ordering matters; this
means that there are different conventions for the euler angle representa-
tion, depending on how the axis rotations are chained together. The most
common representation is the Z-Y-X (or 3-2-1 notation), which defines
a rotation around the Z-axis by an angle α, then a rotation around the
Y-axis of the rotated coordinate frame by an angle β, and finally, a ro-
tation around the X-axis of the two-times rotated coordinate by an angle
γ. Other common representations for the rotation matrix are the axis-
angle and quaternions-based representation, which are discussed in detail
in [115].
j
r = j Ri · i r + j p i , (2.8)
r11 r12 r13 p1
j
r21 r22 r23 p2
Ti =
r31
. (2.10)
r32 r33 p3
0 0 0 1
j
r = j Ti · i r , (2.11)
Image plane
Camera frame
Oc
ẑc
x̂c
ẑw
ŷw
ŷc
Ow x̂w
c World frame
Tw
Figure 2.2: Example of the image formation and definition of world and camera
coordinate systems. Note the direction of the arrow in the transform c Tw ; this
will be the convened direction for all transformation matrices.
ray then intersects a specific plane called the image plane. The intersection
point of the ray with the image plane is the projection of the 3D point.
As a side note, this model is only one of the possible ways of modeling
how a camera captures images, however, camera lenses usually do not focus
the rays in precisely one point, and some distortion is introduced in the
final image. More advanced projection models can be used to minimize
these problems, but they are out of the scope of this research.
We will define a coordinate frame for the camera and some conventions
on its orientation related to the world coordinate system, as in Fig. 2.2.
Note that the Z-Axis of the camera coordinate system is perpendicular to
the image plane; this axis will be called the optical axis. The intersection
of the optical axis and the image plane is the principal point or image
center. Finally, we define the transform between the world and the camera
coordinate system as c Tw .
The distance between the center of the camera coordinate frame Oc and
the principal point on the image plane is called the camera constant or focal
length f . We can then define a central projection transformation based
on a pinhole camera model which converts 3D homogeneous coordinates
on the camera coordinate system into 2D homogeneous coordinates on the
image plane:
2.2 Image formation process 15
c rx
I
rx
f 0 0 1 0 0 0 c y
r
λ I ry = 0 f 0 0 1 0 0
c r z . (2.12)
1 0 0 1 0 0 1 0
| {z }| {z } 1
Kc Π0
u
c
r
p
r v
x̂p
w
I
r r
Oc principal point
ẑc ŷp
x̂c x̂I
ŷc ŷI ẑw ŷw
Ow x̂w
Image plane
Figure 2.3: Definition of world coordinate system and the main camera related
coordinate systems.
Camera images are generally defined with the origin in the top left cor-
ner, and the points are also scaled and sheared. The following transforma-
tion accounts for this with a scale factor in each axis (sx , sy ), a principal
point offset (cx , cy ), and a shear distortion factor (sθ ):
16 2 Fundamentals of camera-based pose estimation
p
rx
I x
sx sθ cx r
λ p r y = 0 sy cy I r y , (2.13)
1 0 0 1 1
| {z }
Ks
λ p r = Ks Kc Π0 c Tw w r (2.14)
| {z }
K
with K as:
f sx f sθ cx
K= 0 f sy cy . (2.15)
0 0 1
fx 0 cx
K=0 fy cy . (2.16)
0 0 1
One final coordinate frame relevant for us is the normalized image frame
η, which is equivalent to a projection in the camera plane in world units (no
influence of internal camera parameters). Assuming the calibration matrix
K to be known, the homogeneous pixel coordinates p r = [p rx ,p ry , 1]> can
be transformed to homogeneous normalized image coordinates η r in world
units (usually metric units) by
η
r = K−1 p r. (2.17)
Image plane
u
v
w i
Camera frame r
Oc p i
r
ẑc
x̂c
ŷc ẑw ŷw
Ow x̂w
World frame
c
Tw
Since it has been proven that pose accuracy usually increases with the
number of points [77], PnP approaches that use more points (n > 3) are
usually preferred. The general PnP methods can be broadly divided into
whether they are iterative or non-iterative.
also depends highly on the input data, its performance is similar or better
than the DLT, which makes it arguably the best algorithm for homography
estimation.
For both homography estimation methods, the previous normalization of
the measurements is a crucial step to improve the quality of the estimated
homography [49]. However, the normalization has some disadvantages [93]:
First, the normalization matrices are calculated from noisy measurements
and are sensitive to outliers. Second, for a given measurement, the noise
affecting each point is independent of the others; in normalized measure-
ments, this independence is removed [17]. A method is proposed in [93] to
overcome this problem by avoiding the normalization and using a Taubin
estimator instead, obtaining similar results as the normalized one but with
increased complexity.
solution. Some PnP algorithms will fail if the model points are co-planar
(e.g., general DLT or some EPnP implementations). Harder to identify are
some corner cases which appear only in some configurations for some algo-
rithms, and even worse, in only some implementations of a given algorithm.
For example, a known problem is the OpenCV original implementation of
Zhang’s method for solving PnP with co-planar points [123], which will fail
when the control points are precisely the four corners of a square. Many
of these special and corner cases are not correctly documented in the im-
plementations, and there are little comparative studies that focus on the
corner cases of each algorithm.
When the model points are co-planar, a known issue for PnP is the pose
ambiguity; it occurs when the planar model is either small or viewed far
to the camera (weak perspective). In these cases, there exists a rotation
ambiguity that corresponds to an unknown reflection of the plane about
the camera’s z-axis. Methods based on the homography decomposition will
return only one solution, which may be the wrong one (i.e., a reflection of
the correct solution); more modern algorithms like IPPE [21] will return
both solutions and let the user decide based on other metrics like the
reprojection error. However, in some cases, the reprojection error is very
similar between these two solutions, and additional external information
would be needed, for example, to mention a few: more control points,
analysis of the scene brightness (orientation influences the brightness of
the planes), temporal filtering, or discarding wrong solutions by applying
sensor fusion techniques [54].
control points. Nevertheless, this is only half of the story. The PnP
methods are generally agnostic to how the control points are defined in
the world, how they are detected in an image, and how correspondences
between control points and image points are determined. Each one of
these elements influences the final outcome of the pose estimation process
since, in each one of these steps, errors are introduced. When one is
trying to obtain the pose of the camera, it is vital to understand the whole
process of image capture and detection and each of the algorithms involved
and not only the PnP method selected. By looking at the problem as a
whole and not as independent algorithms, it will be possible to identify
the fundamental problems and find solutions to them.
23
Figure 3.1: Example of the ORB feature detection algorithm used to automat-
ically match features from a known object (a book) presented on the left side of
the image to a photo of the book in a real setting. The first match (from top to
bottom) is matched incorrectly.
Figure 3.2: Fiducial markers used in different fields. a) Circular fiducials used
for PCB alignment (marked in red). b) Self-adhesive skin markers, which provide
visibility in body imaging procedures such as Computerized Axial Tomography
(CT) and Magnetic Resonance Imaging (MRI). c) Reflective spherical markers
for full body motion capture with a multi-camera system. d) Planar fiducial
markers for pose control of a quadcopter.
Color
The presence of color in a fiducial marker is usually a compromise. Color
can be a rich form of information, especially for identification and seg-
mentation. However, the representation of colors in images is not unique.
The imaging sensors can have different sensitivities to the light spectrum;
this means that each camera produces different color representations of
the same scene, and for consistent measurements, they require color cali-
bration. For this reason, most fiducials are designed using black and white
shapes without considering the color for simplicity.
Orientation
The detection of a fiducial should be independent of the orientation of the
camera that captures it. The image should not favor some orientations over
others. Thus, the fiducial design should have non-symmetrical elements
that will allow the detection algorithm to orient the marker in the image
properly. Additionally, the non-symmetrical elements are a crucial part of
unambiguous full 6DOF pose estimation.
Size
The ideal size of a fiducial in world coordinates is a compromise between
many factors. First, to be detected, the chosen shape should cover a min-
imum quantity of pixels in the image. The minimum amount of pixels
increases with the complexity of the chosen shape and the amount of in-
formation that has to be transmitted within it (e.g., the ID). Second, the
size is also dependent on the camera properties (e.g., pixel density, res-
olution, lenses) and the minimum and maximum distance at which the
marker has to be detected. For example, a large marker can be better for
long ranges, but it may be out of the field of view when the camera is too
close.
Identification
A marker should be unique. It must be easily separated from all the
other shapes present in the environment, and it should be separable from
other markers of the same family. In many applications, the users place
several markers in different environment locations, and it should be easy
for the processing algorithm to discern one marker from the others in
these situations. The solution is implementing a coding scheme inside
28 3 Control points and fiducial markers
the marker, which defines a unique ID for each one. The code inside the
marker requires a certain amount of display area, and the amount of area
required increases with the complexity of the code.
Robustness
The robustness of a fiducial is related mainly to the identification and the
pose estimation processes. The identification should work on all conditions
and be consistent even in different environmental conditions (e.g., lighting
conditions). Once identified, the second most important task of a fiducial
is the process of pose estimation, which is highly susceptible to errors in
the position of the points. An ideal fiducial should provide almost noise-
free point coordinates to avoid pose errors or, at the very least, provide
the characteristics of the error based on the pose.
Construction
The fiducial can be any shape in the space, either flat or three-dimensional,
and theoretically, there are no constraints on the shapes and sizes. How-
ever, in practice, a fiducial should be reliable and easy to manufacture.
Inaccuracies in the construction of the fiducial will translate into measure-
ment errors; hence, the user should precisely measure the position of the
control points in the fiducial after manufacturing. The design also has
to consider that the usage of materials that expand or contract at dif-
ferent temperatures or deteriorate with time may introduce errors in the
measurements, which are hard for the user to consider.
white squares (white for 1 and black for 0), which are organized usually by
a snake-like sequence starting by one of the corners. Some fiducials rely on
other kinds of shapes, even textures, and some of them also include color.
However, the snake-like coding has been the dominant design; the reason
for this dominance is that the internal code can include error correction
strategies, making the identification more robust.
The typical pipeline for detection is first to detect the corners, then use
the four corner points to define a homography, find the proper orientation
based on the internal code, and finally extract the pose from the homog-
raphy matrix. Now we are going to go across the fiducials of Fig. 3.3
highlighting with bullet points the central claims or contributions of each
design:
1998 Matrix [94]. The first squared fiducial marker to our knowl-
edge. A black square with internal binary code for identification.
No robustness to occlusion.
2000 ARToolkit [90]. Use of any image as an internal code, the image
is detected by correlation with a template. High false detection rate
and inter maker confusion. Open source.
2005 ARTag [31]. Introduction of binary defined codes with error
correction to replace ARToolkit’s pattern recognition and identifica-
tion step.
2007 ARToolkit Plus [114]. Introduced binary marker patterns (sim-
ilar to those of ARTag). Performance improvements. Open source.
32 3 Control points and fiducial markers
When one compares the design of Aruco or AprilTag, which are one
of the most recent square marker designs, and the first design made by
3.2 Planar fiducial markers 33
3.2.3 QR markers
QR markers are a 2D version of barcodes that have been used in the
industry for a long time. They represent information in a sequence of black
and white squares representing the binary data. Not all the QR codes are
designed to provide pose estimation since their typical functionality is to
transmit information. However, it can be said that QR codes have inspired
the identification methodology of squared fiducial markers. The problem
with QR codes for pose estimation is that since they do not present a
bounding square, the detection takes more time. Additionally, they require
to be parallel to the camera (since there is no way for a priori projective
correction with a homography), and they usually have to fill a significant
portion of the image. The most relevant QR code based markers are shown
in Fig. 3.4.
Figure 3.5: Most relevant circular shaped fiducial markers in chronological or-
der. a) CCC [41], b) CCC Color [18], c) Intersense [81], d) 2002 TRIP [72],
e) Fourier Tag [103], f) RUNE-Tag [7], g) CCTag [13], h) Pi-Tag [8], i) Why-
con [59], j) Whycode [70], .
Like the squared markers, the first and latest proposed fiducial share
almost the same design in this category. If we compare the CCC marker
design from 1992 and the Whycode marker design of 2017, we notice that
the shape is the same, only with the addition of an internal code. Some
more creative markers like Fourier Tags, RUNE-TAG, and CCTag did not
hold enough traction to be widely spread, and eventually, the fundamen-
tal design came back to its roots. Most of the improvement has been on
the detector’s performance and in the usage of conics for pose estima-
tion. Circular markers are arguably more precise than their square-based
counterparts, especially those from the last decade, but they lack identifi-
cation flexibility. In terms of range, markers like Whycon have a greater
maximum detection distance than AprilTags or Aruco due to the design’s
simplicity.
(g) (h)
Figure 3.6: Most relevant hybrid fiducial markers in chronological order. a) Re-
acTIVision [55], b) SiftSurfTag [104], c) Lentimarkers [110], d) Fractal Marker
Field [50], e) Prasad [91], f) X-Tags [9], g) TopoTag [122], h) STag [6] .
The hybrid markers are the most interesting from all the categories.
Some designs focus on increasing the range (e.g., the Fractal Marker Field),
others on being robust to blur (e.g., Prasad), and others, in perspective
invariance (e.g., SiftSurfTag). In this category, some designs implement
the identification in a novel and creative way using a topology (e.g., Re-
acTIVision, TopoTag). We believe that a topology based identification
could also be applied to other marker families in the future, for example,
squared fiducials. The most common characteristic of this bunch is the
use of a combination of squares and circles to increase accuracy, which
may be true in theory, but there is a lack of comparisons between marker
families and a significant dependency on how the software is implemented.
We believe that many of this category’s markers show better promise than
AprilTags and Aruco markers from the theoretical point of view, but the
latter ( AprilTags and Aruco) have a proven and reliable codebase.
3.3 Discussion
All the markers categorized above are organized in a timeline in Fig. 3.7
and Fig. 3.8, this timeline summarizes almost 30 years of fiducial devel-
opment, and it is in itself an excellent tool to visualize the relationship
between different marker families and their evolution through time. We
see that the shape that appeared first as a fiducial is the circle, closely fol-
lowed by the square. The principal identification methodology originated
from the QR code family and was later fused inside the squared and circle
markers.
We see some convergent designs. For example, as QR-based markers try
to offer pose estimation capabilities, they start to look similar to squared
fiducials. As mentioned before, squared markers and circle-based markers
look almost the same after 30 years of development, but now the soft-
ware is faster, robust, and accurate. The most revolutionary designs come
from the hybrid category, where we start to see the merging of circles and
squares and new ways of identifying the fiducials. Besides some punctual
40 3 Control points and fiducial markers
1992
Maxicode CCC
1994
QR
1995
1998
2000
ARToolkit
Cyber-Code
2002
Intersense TRIP
2004
VisualCode
2005
ARTag
Squares QR Based Hybrids Circles
Figure 3.7: 30 years of fiducials part I.
42 3 Control points and fiducial markers
ARTag
Squares QR Based Hybrids Circles
2007
2009
SiftSurfTag
2010
RUNE-Tag
Caltag
2011
AprilTag Lentimarkers
2012
CCTag
2013
Fractal MF
2014 Whycon Pitag
ArUco
2015
X-tags
2017
ChromaTag Whycode
2019
Fractal AprilTag
2020 Aruco Uramaki STag TopoTag
4.1 Introduction
Visual pose estimation and localization is a problem of interest in many
fields, from robotics to augmented reality and autonomous cars. Possi-
ble solutions are dependent on the camera(s) configuration available to
the task (monocular, stereoscopic, or multi-camera) and the amount of
knowledge about the structure and geometry of the environment.
Visual pose estimation can be classified into two different categories:
The first one, based on Marker-Based (MB), relies on some detectable
visual landmarks like fiducial markers or 3D scene models with known
feature coordinates [38, 77]; The second category works Markerless (ML),
without any 3D scene knowledge [36, 77]. The difference between MB and
ML methods is illustrated in Fig. 4.1.
MB methods estimate the relative camera pose to a marker with known
absolute coordinates in the scene. Therefore, these methods are drift-
less, need only a monocular camera system, and the accuracy of the pose
estimation depends on the accuracy of the measurement of 2D image co-
ordinates, which are projections of known 3D marker coordinates, and the
kind of algorithm used to realize spatial resection [45, 119].
ML methods estimate relative poses between camera frames based on
static scene features with unknown absolute coordinates in the scene and
apply dead reckoning to reach the absolute pose within the scene w.r.t a
known initial pose. Due to this incremental estimation, errors are intro-
duced and are accumulated by each new frame-to-frame motion estimation,
which causes unavoidable drift.
These methods can be further divided into pure visual VO [36] and
more elaborate Visual Simultaneous Localization and Mapping (V-SLAM)
approaches [64] including the new developments on Semi-Dense VO [29].
Basic VO approaches estimate frame-to-frame pose changes of a camera-
based on some 2D feature coordinates, their optical flow estimates [120],
and their 3D reconstruction using epipolar geometry in conjunction with
an outlier rejection scheme to verify static features [12]. Even if some
additional temporal filtering like Extended Kalman Filtering (EKF) or
Local Bundle Adjustment (BA) is applied, drift can be reduced but cannot
be avoided [36].
V-SLAM approaches [64] accumulate not only camera poses but also
3D reconstructions of the back-projected extracted 2D features of VO in
a global 3D map. Thus, drift can be reduced using additional temporal
filtering on the 3D coordinates of the features in the map or global BA and
loop closure techniques to relocate already seen features via map match-
4.1 Introduction 45
ct
Tw ct
ct T̃cτ
T̃m
Ow cτ
Camera ˜w
T
(a) (b)
cτ cτ t
x= T̃ct c x. (4.3)
After including the collinearity equation, the reprojection error at time
t
t between reprojected 3D coordinates c x and homogeneous normalized
τ
image coordinates from the previous time η x:
cτ
2
ητ t
t = d x, Π0 T̃ct c x , (4.4)
cτ
leads to the optimal relative pose estimates T̂ct (see also Fig. 4.1b).
t
The 3D coordinates c x of the features are not known and their estimation
t t
changes over time. Thus, they have to be reconstructed as c x = λ η x,
for example using a stereo vision system that extracts the depth λ of each
t
2D coordinate η x. Also a proper correspondence search to get the 2D-
τ t
2D correspondences of {η x, η x} coordinate pairs is needed for a proper
reconstruction and a good optimization result from equation (4.5). Unfor-
tunately, a correspondence search in a ML environment is ambiguous and
4.2 Mobile Marker Odometry (MOMA) 49
fiducial marker that can be detected very robustly are used. Additionally,
the error accumulation for MOMA, according to (4.6), only happens at
discrete time instances t = τi , which occur on a much lower frequency at
specific waypoints rather than on the high frame rate of the camera like
in ML-VO. Finally, since the odometry scheme comprises only the camera
and the mobile marker, no environment features are required.
In conclusion, the whole MOMA process is only based on applying the
least-squares optimization along a specific caterpillar-like (see also Sec-
tion 4.3) marker-camera motion pattern. The minimal motion pattern
and concurrent optimizations are summarized graphically in Fig. 4.2.
mτ2 ˜w mτ2 ˜w
T T
w w w
m mτ 2 ct ˜ mτ2
Tw ˜ mτ1
T T
t
c ˜m
T
ct ˜ cτ1
τ1
T
t c
c ˜w
T ˜w
T
ct ˜w
T
0 ≤ t < τ1 τ1 ≤ t ≤ τ2 t > τ2
Figure 4.2: The basic MOMA odometry cycle. At t = 0 the marker is static
c
and the camera can obtain its pose T̃w knowing the initial marker pose m Tw .
During 0 ≤ t < τ1 the camera moves in relation to the marker and estimates
ct ct
its pose T̃w continuously by measuring the relative pose T̃m to the marker.
During the time interval τ1 ≤ t ≤ τ2 the camera is static and the marker moves
to a new location within the camera’s field of τview. At time t = τ2 the marker
m 2 cτ1
stops moving and the marker’s absolute pose T̃w can be estimated via T̃w
τ2
m
and T̃cτ1 . Finally, starting from t > τ2 the marker is static again and the
mτ 2
camera moves using the marker pose T̃w as a new reference to estimate its
ct
pose T̃w , closing the cycle.
The advantages of MOMA are: An improved accuracy with respect
to other relative approaches like classical VO, only a monocular cam-
era is needed to localize several robots since the markers already provide
the scale of the environment, and finally but more importantly, it does
not require features in the environment. Additionally, this method
provides localization to the camera and the marker simultaneously even
during movement (both robots in the basic cooperative scheme with only
4.3 Robotic architectures based on MOMA 51
1. The marker has to be static if the camera moves, and the camera has
to be static as long as the marker moves. If more than one marker is
used and at least one of the markers should remain static, then the
camera and the rest of the markers are able to move (which is not
possible for CPS [62]).
≤ ≤ ≤
mt ˜
Tc t ≤0
Marker Camera
(a) (b)
t = τ1
mτ1
mτ1
Tcτ1 t = τ2
τ1 ˜ cτ2
T
m
T̃ cτ1
(c) (d)
Two-robot Caterpillar
In this configuration, one robot is the mobile marker and the other one
is the observer (the one with the camera), see Fig. 4.3. The observer
follows the movement of the mobile marker using the monocular camera.
We named this particular kind of movement Caterpillar-like motion since
each robot behaves like a segment of the body of a caterpillar.
The MOMA and the observer move in turns, following the rules ex-
plained in Section 4.2. The error accumulates only during the switching of
the reference and is only dependent on the accuracy of the fiducial marker
detection, which by using a good camera and proper calibration may be
in the range of millimeters [8]. MOMA is able to track the pose of both
robots during the movement and not only in the transitions.
Single-robot Caterpillar
In this minimal configuration, a single robot will be pulling a lightweight
and rigid sled with a simple pulley mechanism, see Fig. 4.4. The robot can
either actuate to pull the sled close to itself (while its wheels are locked to
remain static) or let it drag behind. A monocular camera detects a fiducial
marker in front of the sled. The robot performs Caterpillar-like motion
leaving the sled behind as static reference when it has to move, then stops
4.3 Robotic architectures based on MOMA 53
and pulls the sled performing the MOMA odometry in the process.
Multi-robot Caterpillar
This is an extension of the basic Caterpillar case for N robots, see Fig. 4.5.
Each robot follows the one in front. In this configuration N − 1 robots
with cameras are needed for the relative transformations. If at least one
member of the group is static, the rest may move.
mobile
˜
T mobile
˜
T mobile
˜
T static
FOV
Area of coverage
4.3.3 Summary
The MOMA configurations may appear at first as a big restriction for the
movement of a multi-robot system, but in practice, only one of the robots
has to remain static to keep the MOMA odometry working, and this robot
may be designed as a simple robot, since it only needs to move, not sense.
The multi-robot system can use MOMA odometry to get to a particular
working space, then the simple robot remains static meanwhile the com-
plex ones move freely using some other odometry systems to perform their
tasks, with the advantage that they can return to the static robot to cor-
rect drift as necessary. After finishing their tasks, they can move again as
a group using MOMA odometry. This allows the robot team to maintain
a drift-less odometry estimation meanwhile being able to perform more
complex tasks in the environment.
Robotino has Aruco markers on the sides and top (Fig. 4.7). One of
the robots was defined as the observer (with a FLIR Blackfly monocular
camera) and the other as the mobile marker. For the Top Mobile Observer
configuration, we use a quadcopter as an addition.
and the behavior of the error during the navigation is shown in Fig. 4.10.
The positional error presents peaks followed by stabilization periods. The
peaks happen during the robots’ movement since the markers may ap-
pear blurred and in non-optimal configurations for the pose estimation
algorithm, but as soon as the robots are static while doing the transition,
the error decreases. Additionally, in closed-loop motions like this one, the
errors are canceled due to symmetric motions. To properly study the per-
formance of the MOMA, a more extended test without repetitive motions
was performed.
0.15
Error (m)
0.10
0.05
0.00
Figure 4.10: Euclidean error of MOMA during the trajectory of the final loop
in the Two-robot Caterpillar test.
In Fig. 4.11, the results of a long trajectory are shown. The robots moved
from Room A, the one with the Optitrack ground truth system, passed
4.4 Experiments on a multi-robot system 57
13
12
y (m)
11
10
3 4 5 6 7 8 9
x (m)
Figure 4.9: Final loop of the navigation for the two-robot Caterpillar config-
uration. The blue continuous line is the MOMA and the black dashed line the
ground truth from the OptiTrack system.
through a low illuminated hallway into Room B, and finally, they returned
once again to Room A. With this test, it was possible to measure the
robots’ start and final positions with the Optitrack system and compare
it against MOMA. In Fig. 4.12, a magnification of the starting and final
trajectory of the robot and the ground truth are shown. The error at the
start and end of the trajectory is shown in Fig. 4.13. The final error was
0.38 m or 0.68 % of the total trajectory.
58 4 Mobile Marker Odometry
10
y (m)
2
B
0
0 5 10 15 20 25 30
x (m)
Figure 4.11: Results of the MOMA odometry for a long trajectory from Room
A to Room B (red line) and the return trajectory (blue line) over a map of our
institute.
12.0
11.5
11.0
MOMA at start
y (m)
10.0
9.5
9.0
6.0 6.5 7.0 7.5 8.0 8.5 9.0
x (m)
Figure 4.12: Detail of the trajectories shown in Fig. 4.11 for Room A with the
ground truth obtained from OptiTrack.
4.4 Experiments on a multi-robot system 59
Error (m)
0.4
0.2
0.0
0 10 20 30 40 50 810 820 830 840 850
Time (s) Time (s)
Figure 4.13: Detail of the error for the long trajectory Two-robot Caterpillar
test. Left side: error detected in Room A at the start of the trajectory. Right
side: error detected in Room A at the end of the trajectory. The highlighted
sections correspond to the movement of the robot followed by a static phase,
during movement the accuracy decreases but improve when static.
1.0
0.8
0.6
y (m)
0.4
0.2
0.0
−0.2
Figure 4.14: Odometry results for the main robot after waypoint navigation. In
red is shown the behaviour of VISO2, where the error increases during rotations
due to the lack of good features in indoor environments. MOMA (blue) follows
the waypoints with low error.
The error on the final position when using MOMA was 0.51 cm or a
0.123 % of the total trajectory. For VISO2 we obtained an error of 33.81 cm
or 7.99 %, this was the best case for VISO2. It has to be pointed out
that VISO2 and other VO algorithms may achieve close to 1 % accuracy
when good features are present in the environment. However, this test
3 https://youtu.be/0xASGfH8cDM
4.4 Experiments on a multi-robot system 61
s
ẑw ŷw T̃w ⇒ e
c m
c
Tw T̃m Ts
⇒
m
s m
T̃w = s Tm T̃c c Tw (4.10)
In a similar way, we can use the homogeneous 3D coordinates of marker
features in marker coordinates m y and their projection η y (now on the
back monocular camera) to define a reprojection error for the marker fea-
tures:
c
m = d(η y, Π0 T̃m m y)2 . (4.11)
With these two reprojection errors, which are part of the same loop of
s
transformations (see Fig. 4.15), we can find an optimal T̂w as follows:
Ne Nm
(
s 1 X λ X
T̂w = arg min ie + jm . (4.12)
s
T̃w Ne Nm
i=1
j=1
Some remarks about the planar fiducial marker detection are relevant.
In a Caterpillar-like configuration, the estimation of the pose for planar
fiducial markers does not provide reasonable depth estimates (Z-axis of
the camera); this may be solved by selecting other fiducial marker struc-
tures. We will discuss more on this topic in the following chapters. The
Top Mobile Observer configuration is more precise than the other con-
figurations since it is based on measurements on the camera’s XY plane.
Nonetheless, to give more freedom of movement to the UGVs, the UAV
has to fly higher (which may decrease marker detection accuracy). Since
the moment of switching is the most critical part of the method (when
the error accumulates), it is essential to find new ways of improving the
marker estimation accuracy.
64
5 Optimal points
configurations
In the previous chapter, we focused on using accurate marker features in-
stead of environmental features for pose estimation. The reason is simple:
marker features are well defined and can be extracted reliably with higher
accuracy than those present in the environment. However, the improve-
ments obtained were still limited by the control points defined inside the
fiducial marker and the algorithm used to calculate the pose. There is a
relationship between the pose of the camera in relation to the marker and
the accuracy of the pose estimation, which is not easy to model since it
depends both on the spatial configuration of the points and the method
used to solve the pose; this makes us wonder: is there an optimal set of
artificial features? Is there an optimal marker at all?
In this chapter, we go deeper into the study of artificial features by
investigating the influence of the spatial configuration of a number of n ≥ 4
control points on the accuracy and robustness of space resection methods.
5.1 Introduction
The Perspective-n-Point (PnP) problem and the particular case of pla-
nar pose estimation via homography estimation are some of the most re-
searched topics in computer vision and photogrammetry. Even though the
research in these areas has been comprehensive, there is a surprising lack
of information regarding the effect of 3D control point configurations on
the estimation methods’ accuracy and robustness.
In Section 2.3, we reviewed the existing PnP methods, and it is clear
from the literature that control point configurations are relevant and in-
fluence the accuracy and robustness of pose estimates. However, the in-
formation available is rather general since it has been based on hands-on
experience and thus far only leads to some thumb rules. The most obvi-
ous and widely accepted is that increasing the number of control points
increases the accuracy of the results in the presence of noise. Further on,
5.1 Introduction 65
creases if the points are uniformly sampled from a given region. They
circumvent this problem by selecting the corners of the region as the po-
sitions for the control points and then refer the reader to the Chen and
Suter paper [17], where the analysis of the stability of the homography
estimation to 1st order perturbations is presented. In this analysis, it is
clear that the homography estimate’s error depends on the singular values
of the A matrix in the DLT algorithm (see also next section).
Additionally, in [19, 119] evaluations are presented characterizing pose-
dependent offsets and uncertainty on the camera pose estimations. Simu-
lations empirically prove that some camera poses are more stable for the
estimation process than others.
λ η xi = Π0 c Tw w xi . (5.1)
2
Thus, we can solve for the reprojection error i = d (η x̃i , η xi ) . Minimiz-
ing this error for all n points, leads to the following least-squares estimator
for the optimal pose:
n
c X
T̂w = argminc Tw i , n ≥ 3. (5.2)
i=1
Dividing the first row of equation (5.3) by the third row and the second
row by the third row, we get two linearly independent equations for each
point correspondence:
Ai h = 0 , (5.7)
5.3 Golden standard algorithm for pose estimation 69
where
ρ ρ η
xi ρ xi η
xi ρ yi η
xi yi −1 0 0 0 xi
Ai = ρ ρ η ρ η ρ η (5.8)
0 0 0 xi yi −1 yi xi yi yi yi
and
h = [h1 , h2 , h3 , h4 , h5 , h6 , h7 , h8 , h9 ]> (5.9)
Again, assuming noisy measurements of the image coordinates η x̃i , we
get noisy matrices
Ãi = Ai + Ei . (5.10)
From Ãi h = (Ai + Ei )h we can solve for the algebraic error ||Ei h||22 =
||(Ãi −Ai )h||22 = ||Ãi h||22 of each point, because Ai h = 0 holds. Minimizing
the squared 2-norm of all points for the optimal homography ĥ leads to
the following least-squares estimator
n
X
ĥ = argminh ||Ei h||22 , s.t. ||h|| = 1 , n ≥ 4. (5.11)
i=1
Since h contains 9 entries, but is defined only up to scale, the total number
of degrees of freedom is 8. Thus, the additional constraint ||h|| = 1 is
included to solve the optimization.
Now, stacking all {Ãi } and {Ei } as à = [Ã> > >
1 , . . . , Ãn ] ∈ R
2n×9
and
> > > 2n×9
E = [E1 , . . . , En ] ∈ R , we arrive at solving the noisy homogeneous
linear equation system
Ãh = Eh . (5.12)
The solution of (5.12) is equivalent to the solution of (5.11) and is given
by the DLT algorithm applying a singular value decomposition (SVD) of
à = ŨS̃Ṽ> , whereas ĥ = ṽ9 , with ṽ9 being the right singular vector
of à associated with the least singular value s̃9 . Usually, an additional
normalization step of the coordinates of the control points and its projec-
tions is performed leading to the normalized DLT algorithm, which is the
golden standard for non-iterative pose estimation because it is very easy
to handle and serves as a basis for other non-iterative as well as iterative
pose estimation methods.
Once a homography is found, the normal pipeline for pose estimation
is to perform homography decomposition to find an initial pose and then
refine this pose with a non-linear optimization based on the reprojection
error.
70 5 Optimal points configurations
where uk and vk are the left and right singular vectors of the unperturbed
matrix A, sk the corresponding singular values and E the measurement
errors. Equation (5.13) clearly shows that the optimal solution for the
homography that equals the right singular vector of the unperturbed ma-
trix A, associated with the least singular value2 s9 = 0, is perturbed by
the second term in (5.13). The second term is a weighted sum of the first
eight optimal right singular vectors vk , whereas the weights u>
k Ev9 /sk are
the influence of the measurement errors E on the unperturbed solution v9
along the different k dimensions of the model space, and uk are the left
singular vectors. The presence of very small sk in the denominator can
give us very large weights for the corresponding model space basis vector
vk and dominate the error. Hence, small singular values sk cause the esti-
mation ĥ to be extremely sensitive to small amounts of noise in the data
2 The singular values are arranged in descending order: s1 ≥ s2 ≥ · · · ≥ s8 ≥ s9 = 0.
5.4 Optimizing points configuration for pose estimation 71
and correlates with the singular value spectrum3 (s1 − s8 ) as follows: The
smaller the singular value spectrum, the less perturbed the estimation is.
It is also well known, that the condition number of a matrix with respect
to the 2-norm is given by the ratio between the largest and, in our case,
the second-smallest4 singular value [43]
smax s1
c(A) = kAk2 kA−1 k2 = = , (5.14)
smin s8
which is minimal if the singular value spectrum is minimal. The normal-
ization of the control points and its projections leads to the normalized
DLT algorithm, which has shown to improve the condition of matrix A
[48]. Thus, we simply try to minimize the condition number c of matrix
A with respect to all n control points {ρ xi |i = 1, . . . , n} as follows:
for each iteration t and stepsize α(t), which is adapted using SuperSAB
[111]. The control points dynamics can now be used to find optimal control
point configurations for pose estimation from planar markers.
equation system.
5 In our implementation we used autograd [28].
72 5 Optimal points configurations
To find a lower bound, we can use the error of using a perturbed matrix
à with the true homography h, defined as Ãh, and the error of using
the optimal homography estimation ĥ with the same perturbed matrix,
defined as Ãĥ, to build the following inequality:
z
y
able to represent the true homography H beyond the space of the control
points.
Each simulation for a given camera pose is then performed in the fol-
lowing way:
1) An initial random n-point set {ρ xi (t0 )} is defined inside the cir-
cular plane.
2) For each iteration step t, an improved set of control points {ρ xi (t)}
is obtained by (5.15) and projected to camera pixel coordinates
{p xi (t)} using the true camera pose c Tρ and the calibration ma-
trix K. We add Gaussian noise to the projected points {p x̃i (t)}
and use the correspondences {p x̃i (t),ρ xi (t)} to calculate A(t) and
c(A(t)).
3) For each iteration t, we performed 1000 runs of the homography
estimation using the normalized DLT algorithm7 since we want a
statistically meaningful measure of the homography estimation ro-
bustness against noise.
4) Finally, we calculate the error metric HE(t) for each run and the
average µ (HE(t))) and standard deviation σ (HE(t)) of this error
for all runs.
As illustration of the gradient minimization process, we present an ex-
ample case of a simulation in a fronto-parallel camera pose for a 4-point
configuration. A Gaussian noise of σG = 4 pixel is added to image coordi-
nates for the homography estimation runs. In Fig. 5.3, the initial object
and image point configurations are shown.
The evolution of c(A(t)), as well as µ (HE(t))) and σ (HE(t))), is pre-
sented in Fig. 5.4. The condition number decreases drastically in the first
iterations of the gradient descent, and by doing so, the mean and standard
deviation of HE(t) is also reduced. With more iterations, both metrics
slowly and smoothly converge to a stable minimum value.
7 The homography estimation method presented in [46] and the method based on
gradient descend of OpenCV were also tested. The results almost do not differ for
low point configurations to the DLT, so it was the chosen one for the experiments.
5.5 Simulation and real experiment results 75
100
−0.1
200
0.0
300
0.1
400
c(A(t))
100000
0
Homography error HE
40 µ (HE(t))
σ (HE(t))
20
0
0 5 10 15 20 25 30 35 40
Iterations
kt̂(t) − tk2
TE t̂(t) = × 100% . (5.22)
ktk2
100000
50000
TE (%)
IPPE
15
Percent
EPnP
10 LM
5
RE (◦ )
Angle (degrees)
40
30
20
10
0 5 10 15 20 25 30 35 40
Iterations
The pose error decreases during the optimization of the condition num-
ber for all methods, even for the LM algorithm, which is already using an
optimization based on the reprojection error. The improvement on LM
is because this algorithm needs a good initial estimate to converge to the
correct pose and optimized control points produce a better initial estimate.
We chose the fronto-parallel camera configuration because it is the most
challenging; this is evidenced by the high values of the RE (more than
10 degrees); if the camera is inclined, we see also the same improvement
8 For
the EPnP and LM methods, the OpenCV implementations were used, and for
IPPE the Python implementation provided in the author’s GitHub repository.
78 5 Optimal points configurations
TE (%) RE (◦ )
20
µ 30
σ 20
10
10
0 0
20
30
10 20
10
0 0
20
30
10 20
10
0 0
0 10 20 30 40 0 10 20 30 40
Iterations Iterations
IPPE EPnP LM
during the optimization, but the final errors in rotation are less than 5
degrees.
A real experiment9 was also implemented in order to test if the sim-
ulation assumptions (Gaussian image noise and perfect intrinsics) may
affect the results in practical applications. A computer screen was used as
the planar fiducial marker to display the points during gradient descent
dynamically. A set of 4 circles was displayed for each iteration of the
optimization. These circles were then captured by a PointGrey Blackfly
camera10 and detected using a circle detector based on the Hough trans-
form. We performed 100 detections for each gradient descent iteration.
An Optitrack system was used to measure the camera’s ground truth pose
relative to the marker screen. The results of running the optimization
process for a set of 4 random initial points are shown in figures 5.7, 5.8
and 5.9.
300 −0.1
400 0.0
500 0.1
600 0.2
500 600 700 800 −0.1 0.0 0.1
Figure 5.7: (Real). Movement of control points in image and object coordinates
during gradient descent for the experiment with a real camera.
EPnP
20 LM 40
10 20
0 0
0 20 40 60 80 0 20 40 60 80
Iterations Iterations
Figure 5.8: (Real). Evolution of the condition number and the homography
reprojection error during gradient descent using a real camera.
80 5 Optimal points configurations
TE (%) RE (°)
40
µ
20 50
σ
0 0
50 50
0 0
40 40
20 20
0 0
0 20 40 60 80 0 20 40 60 80
Iterations Iterations
IPPE EPnP LM
Figure 5.9: (Real). Detailed view of the standard deviation of each method
represented by the filled, lightly colored areas.
5.5.3 Discussion
The results show that control point configurations have a substantial effect
on the accuracy of homography and planar PnP methods. There are indeed
optimized configurations that are better than random, and it is possible
to find them using our method.
For the 4-point case, our empirical results show that a square-like shape
5.5 Simulation and real experiment results 81
c(A(t)) HE
50
40
105
30
20
10
4 5 6 7 8 9 10 11 12 13 14 4 5 6 7 8 9 10 11 12 13 14
Number of points Number of points
well-conditioned ill-conditioned square
Figure 5.10: Robustness (cond. num.) and accuracy (homography error) de-
pendent on the number of points for well- and ill-conditioned point configurations
as well as an ideal 4-point square (green line).
is the most common minima and a very stable and robust configuration
for all camera poses (see Fig. 5.12a); for 5-point, a common minimum is
a pentagon (Fig. 5.12b) and for the 6-point a hexagon (Fig. 5.12c). The
positions of the optimized point configurations do not show any strong
dependency on the pose of the camera (besides scale and image limits).
The real dependency is mainly related to the distribution of the points
in camera image coordinates. As we can see from the 2D histograms,
the points are driven to distribute themselves as much as possible in the
available space (hence high repetition in circle boundaries), and they tend
to increase the distance to each other, which favors the creation of some
common regular polygons (squares, pentagons, and hexagons). For higher
point numbers, the trend is still to place as many points as possible in the
boundaries.
The first iterations of the optimization are when the increase in accuracy
is more substantial, which means that the condition number is a good
optimization objective. For example, the improvement in accuracy from
a square-like configuration to a perfect square is tiny, but the increase of
accuracy from random points to the square-like shapes obtained on the first
iterations of the optimization is radical; this means that with less than five
iterations, it is possible to obtain very stable and accurate configurations.
The smaller the number of control points, the higher is the relative
82 5 Optimal points configurations
10
20
15 50
LM TE (%) LM RE (◦ )
12.5 40
10.0
30
7.5
5.0 20
2.5 10
4 5 6 7 8 9 10 11 12 13 14 4 5 6 7 8 9 10 11 12 13 14
Number of points Number of points
well-conditioned ill-conditioned square
improvement on the estimates for all of the evaluated methods. For ex-
ample, the accuracy using 4 points is always better than random point
configurations with more points 4 < n ≤ 9, as can be seen in Fig. 5.10
for homography, and in Fig. 5.11 for PnP. Thus, the control points’ con-
figuration has more effect on the accuracy than the number of control
points.
The improvement in the EPnP and IPPE methods is more pronounced
than for LM, which is an interesting result since those methods take con-
siderably less computation time. For well-configured points, the methods
converge to similar error values, and both mean and variance are reduced;
this means that well-conditioned points can be used for a fair comparison of
5.5 Simulation and real experiment results 83
20
10
(a)
20
10
(b)
20
10
(c)
Figure 5.12: Final point configurations for 4 points (a), 5 points (b) and 6 points
(c). On the left side: 2D histogram of final configurations for all 400 camera
poses. Middle and right side: Examples of optimized final point configurations
in object coordinates.
84 5 Optimal points configurations
These results are significant for fiducial marker design; they validate
that a square is a very stable configuration, and, additionally, some other
regular polygonal shapes are proven to be stable and robust, e.g., the
pentagon and hexagon. The hexagon is particularly interesting because it
is also the most compact way of packing circles inside some planar limits;
this would make a hexagonal grid an excellent and stable way of optimally
distributing control points on a given surface.
We also performed some extensions of this work to non-planar configu-
rations in [112], where we applied the DLT method to directly estimate a
pose from 6 or more control points (non-planar). In this case, the limits of
the optimization were a sphere instead of a circle. The final results showed
again that the optimal points are distributed on the available space (the
sphere borders) and arranged in regular polygonal shapes. For the 6-point
configuration, the most common shape was a triangular prism inscribed
in the sphere. In the 3D case, however, there may exist better metrics for
optimization than the DLT since the algebraic error of the DLT does not
match the geometric error ( in the planar case, it does for the homography
when the weights of the homogeneous coordinates are equal to 1 [49]), one
possible goal for the optimization would be the condition number of the
error covariance matrix [112].
85
Figure 6.1: We propose a dynamic fiducial marker that can adapt over time
to the requirements of the perception process. This method can be integrated
into common visual servoing approaches, e.g., a tracking system for autonomous
quadcopter landing. The fiducial marker’s size and shape can be changed dynam-
ically to better suit the detection process depending on the relative quadcopter-
to-marker pose.
different dynamic marker designs. The first marker design, the DDYMA,
is based on discrete changes of the fiducial by selecting the appropriate
shape from a family of known fiducial markers. The second design, the
FDYMA, is a fiducial designed from scratch to better suit the dynamism
of the screen; this fiducial changes its shape smoothly with pose changes,
and the shape of the fiducial is designed to be optimal for pose estimation
problems based on most of the results obtained in previous chapters. The
DDYMA was tested on a quadcopter landing scenario and the FDYMA
using a fixed optical tracking system as a comparison.
Our proposal is novel and simple: Instead of using a static
fiducial marker, we propose using a screen to change the marker
shape dynamically. A marker that changes requires a controller, which
we must couple with the perception algorithm and the camera’s movement.
This section presents the minimal hardware/software set-up for what we
call a discrete dynamic marker and introduces a control scheme that inte-
grates conveniently into visual servoing. We will demonstrate that includ-
ing a dynamic marker in the action-perception-cycle of robots improves the
detection range of the marker, the accuracy of the pose estimation and im-
proves the robot’s performance compared to a static marker, which can be
advantageous despite the increase in system complexity. It is worth noting
that in previous publications, screens were used to display structured light
sequences of images for camera calibration [107] [125] [44]. However, to the
best of our knowledge, none of those applications exploited the possibili-
ties of performing dynamic changes to the image based on the perception
task’s feedback. It is precisely this feedback that makes a dynamic marker
an exciting concept for control applications.
6.1 Motivation
As explained in Chapter 3, a planar visual fiducial marker is a known
shape, usually printed on a paper, located in the environment as a point
of reference and scale for a visual task. Fiducial markers are commonly
used in augmented reality, virtual reality, object tracking, and robot lo-
calization. In robotics, they are used to obtain the absolute 3D pose
of a robot in world coordinates; this usually involves distributing several
markers around the environment in known positions or fixing a camera and
detecting markers attached to the robots. A fiducial may not be conve-
nient for some applications due to the required environment intervention.
For unknown environments, the preference is for other types of localiza-
6.1 Motivation 87
MOMA, not only the range is important but also the accuracy, and ide-
ally, the magnitude of the pose estimation error should be constant and
not depend on the pose. Another process in which the problems of range
and accuracy in fiducial detection are relevant is camera calibration. To
calibrate a camera, the user must sample several images of a fiducial in
different relative poses. To get a good calibration, a high point density is
ideal. However, the point density is limited since the static marker is usu-
ally designed for long-range detection, but this requires more display area,
which means fewer control points. The range and accuracy requirements
of pose estimation and camera calibration problems cannot be satisfied
simultaneously by a static fiducial marker, and this is why a marker that
can change and adapt would be useful. The transition from printed mark-
ers to screens is also natural since screens are now ubiquitous on robotic
and Internet of Things (IoT) applications; it makes only sense to use their
capabilities.
From now on, we will refer to these three modules as the Dynamic
Marker. All these elements are commonplace nowadays. During our re-
search, we used a foldable laptop, Ipads, and smartphones as dynamic
markers.
6.2 Dynamic fiducial marker definition 89
s∗ e(t )
Controller
Robot
s(t )
Camera Marker
Feature
c
Extraction m(t) Tm
a
c Robot
T̃m
Camera Marker
Pose Feature
c
Estimation f(t ) Extraction m(t) Tm
a a
Display
s∗ e(t ) D(t)
Marker a(t) Image
τ
Controller Rendering
s(t )
c
T̃m c
Tm
Pose Feature
Estimation f(t ) Extraction Camera
m(t)
Detector
Figure 6.4: Dynamic marker control diagram.
where (w, h) are the number of pixels in the horizontal and vertical direc-
tion respectively. Each element Dij (t) = [r, g, b]> contains the required
intensity of the RGB LEDs at the (i, j) row and column of the screen.
The dynamics of the plant depend on the individual LED switching
characteristics. There is a finite amount of time required for changing the
intensity level of a single LED and this time depends on the LED technol-
ogy, for example, regular grade consumer LEDs have a 80% stabilization
time of around 4.8 ms, see Fig. 6.5, while Organic LEDs have a much faster
response time of less than 0.1 ms, see Fig. 6.6.
For the dynamic marker the transitory state of the LED is not relevant,
what matters is the final stable state of the LED, with this premise, we
model our plant as an ideal screen with LEDs that have 0 ms switching
time preceded by a time-delay with the value of the real LED switching
time as shown in the Display section of Fig. 6.4. In the time-delay block we
can also include other potential delays such as the communication between
the controller and the display device or rendering delays which are device
dependent and will be analyzed in detail in further sections.
We define the center of the screen as the origin of the marker coordi-
nate system and since we know the pixel pitch and the (w, h) values we
also know with sub-millimeter accuracy the position of each pixel Dij in
relation to the marker coordinate system.
Each position p rij in camera pixel coordinates plus the color measure-
ment of that pixel for all the screen pixels comprise the vector of measure-
ments m(t). In the next step a Feature Extraction algorithm will obtain
the main desired parameters from the m(t) vector. In our circle example
the task of this step is to detect the circle and calculate its diameter p φ
in image pixel units, however we can extract additional features useful
for pose estimation, like the circle border, the circle center and the circle
inner color. All the detected features are collected into a vector of general
features f (t) and passed to the pose estimation step, where, depending on
c
the available features, an estimate of the marker to camera pose T̃m can
be calculated and fed back to the controller. The pose estimation module
also builds the final vector s(t) which is then compared to s∗ (t) to produce
the error and close the loop.
At this point there are two scenarios, the first one is when the available
data is enough to estimate a pose (i.e., we have a calibrated camera and
enough features), in this case the pose estimation will output the estimated
c
pose T̃m and will set the s(t) vector to zero, so the input error to the
controller will be directly e(t) = s∗ (t); this is because in this scenario the
c
controller can directly compute the output of a(t) from T̃m , the camera
matrix and the vector s∗ (t) by simply doing an inverse projection of the
desired features.
For example, to calculate the circle diameter in meters φ which is one of
the a(t) parameters, we can build an homogeneous vector with the desired
diameter in camera pixels [p φ∗ , 0, 1]> and calculate directly the diameter
in meters φ to be displayed by the screen as follows:
6.2 Dynamic fiducial marker definition 95
φ p ∗
0 m φ
−1
= T̃c K 0 .
0 (6.6)
1
1
The second scenario is when it is not possible to obtain a pose estimate,
in this case the pose estimation block will set the output of the estimated
pose to zero and output the vector s(t) of detected features which is then
substracted to s∗ (t) to obtain the error. In our circle example, the mea-
sured vector of features will be simply the measured circle diameter in pixel
units s = [p φ] and the error input to the controller: e(t) = [p φ∗ −p φ].
The error can then be used by a simple controller (e.g., a PID controller)
to generate the a(t) vector closing the loop.
Display
Figure 6.7: A dynamic marker used with a virtual servoing control approach.
96 6 Dynamic fiducial markers
Screen refresh
tr
Trigger
Marker ready
Frame grab
Image
ti
Detection
Pose/features
tp
a(tp )
a(tp )
detector estimated the marker features, calculated the camera pose, and
transmitted the information back to the controller. At this point, the cycle
can be repeated.
The delays that define the loop duration are then the communication
delays, the screen refresh delay (tr − ts ), the image capture delay (ti − tr ),
and the detection delay (tp − ti ). We are going to analyze each one of
them independently.
change images on the screen, and it defines an absolute minimum for the
screen refresh delay. The achievable refresh rate depends on the screen
technology. For example, LCDs have a higher refresh rate than e-ink
displays. In the year 2020, commercial gaming monitors reach refresh
rates of up to 360 Hz and E-Ink displays up to 7 Hz; these numbers may
seem fast enough compared to regular camera refresh rates. However,
the refresh rate number of a monitor can be misleading. The actual time
required to change a single pixel’s brightness on a monitor is called reaction
time, and for gaming monitors, this delay is around 1 ms, which is faster
than the refresh rate. However, to change a pixel, the information has
to be rendered and transmitted to the monitor through a cable using
some protocol (e.g., High-Definition Multimedia Interface (HDMI)); once
transmitted, most monitors do some preprocessing to the input image
before displaying it line by line at the refresh rate of the monitor. The
amount of time that takes from the moment a new frame is sent at the
monitor’s input and the moment it is displayed is called the input lag, and
it is generally much higher than the monitor’s refresh rate. For competitive
gaming monitors, the input lag can be around 5 ms and for general purpose
monitors, greater than 15 ms. Moreover, the input lag does not include
the amount of time needed by the processing unit to render the image in
the graphic card and send it through the cable, and since there are many
variables involved (graphics library, operating system, drivers, input lag,
refresh rate), it is necessary to measure the complete path of latency.
The best way to measure the lag from generation to display, from now
on, “display lag”, is to create a program that changes the screen on user
input, and then use a high framerate camera to record the screen and the
user input (e.g., mouse movement) simultaneously. On the recorded video,
one can count the number of frames between the mouse movement and
the associated change on the screen, and then use the camera framerate
to calculate the amount of time; this is assuming that we use a mouse
with low and known input lag. Another way to measure the display lag is
to do it automatically by using a high frame rate camera connected to the
same computer that generates the image. Since the computer will know
precisely when it generated the frame and the instant of time the camera
took the images, it can automatically perform the calculation. As a refer-
ence, in our tests, we used a 144 Hz gaming monitor model LG 27GL850
with an input lag of 4.7 ms and a display lag of 12.5 ms at native resolution.
Image capture delay: This delay is measured from the moment that
a signal is sent to the camera to capture a new frame (in trigger mode)
6.2 Dynamic fiducial marker definition 99
until the moment that the image data is in the detector device’s memory.
The image capture delay will be comprised of 3 parts, the amount of time
for the camera firmware to receive and process the trigger signal, the
duration the shutter remains open (exposure), and finally, the amount
of time required to transmit the image to the computer that does the
detection. The shutter mainly drives the image delay since it is the one
that takes the longest. A typical value for the image capture delay in
indoor conditions is around 5 to 15 ms.
Detection delay: There are mainly two tasks that have to be per-
formed by the detector. The first task is the processing required to detect
the marker’s main features in the image, which includes the position of
the control points in image coordinates and a general evaluation of image
quality. The second task consists of calculating the pose c Tm using a
and the detected control points. Both tasks combined can take from 5 to
20 ms depending on how well optimized the image processing algorithms
are and how complicated is the non-linear optimization for the pose
estimation.
Including all the above delays, the synchronized control loop can last
from 25 to 55 ms, this translates into a loop frequency range from 18 to 40
Hz. The frequency range of a synchronized control loop may be too slow
for some cameras since it would not take advantage of the full-frame rate
of a fast camera (e.g., 60 fps). The problem is that the control loop works
sequentially, but we can parallelize some of the steps. Lets assume for
simplicity that the delays are: communication 1 ms, screen refresh 20 ms,
image capture 10 ms, and detection 10 ms; in this scenario, the sequential
control loop will be as shown in Fig. 6.9. However, this can be optimized
if we execute some of the stages in parallel. In Fig. 6.10, we present a
concurrent execution for the same delays; this exploits the fact that the
actual change of the pixel happens at the end of the 20 ms screen refresh
delay, and this change only last 1 ms (reaction time). Notice that we
can take two consecutive images per displayed marker, and only the first
loop has a latency; afterward, we obtain the right pose at the camera’s
maximum speed.
100 6 Dynamic fiducial markers
t0 tr ti tp time
Figure 6.9: Example of a sequential control loop for the dynamic marker where
each one of the stages is executed in sequence. The control loop will take more
time than the camera refresh rate, which is not ideal although easier to imple-
ment.
t0 tr ti tp time
0
Control loops
Figure 6.10: Example of a parallelized control loop for the dynamic marker.
The stages are overlapped; hence, the output of the estimated pose is at the
same rate as the camera.
6.3 The Discrete Dynamic Marker 101
Camera θ
mcd
θ
φr d max
Marker
Ideal ms
Max ms
Figure 6.11: An ideal marker for landing should have a size that allows a
successful detection while leaving some extra room in the field of view for both
lateral and rotational movements of the camera.
The angle α is an input defined by the user, and in each loop, the
controlled variable ms will be calculated by the controller by applying eq.
6.8 using the measured mcd and α.
Once we defined the marker’s automatic scaling, the only part missing is
the control rule for the discrete change between different marker families.
We define a switch distance sd, where we decide the marker family to use.
When mcd > sd, the marker Whycon is selected, and if mcd <= sd, Aruco
is selected. The value of sd is obtained experimentally by measuring the
maximum mcd at which Aruco is still reliably detected. We also exploit
another feature of Aruco, the so-called board of markers, which is a grid
of Aruco markers on the same surface, each one with a different ID but
sharing a common coordinate origin. When the size of the Aruco marker is
smaller than the available screen space, the rest of the screen will be filled
with additional Aruco markers from a predefined Aruco board, all of them
will form part of the same coordinate system, increasing the accuracy on
close ranges. At the start of the system, the initial marker will be Whycon
to assure detection.
6.3 The Discrete Dynamic Marker 105
Figure 6.12: Camera frames during the landing procedure. Notice the change
from Whycon to Aruco in the third frame and the start of the yaw rotation
correction. In the frames 4, 5 and 6 it is possible to see the dynamic change of
the scale.
106 6 Dynamic fiducial markers
The position of the quadcopter during landing can be seen in Fig. 6.13.
Notice how the yaw angle of the quadcopter is corrected as soon as the
dynamic marker changes into Aruco at t = 14 s. The landing was per-
formed smoothly with a final position error of 3.5 cm from the center of
the marker. The subsequent frames of Fig. 6.12 show how the display
changes according to dynamic marker design, first from Whycon to Aruco
at sd and then gradually reducing Aruco scale and filling empty spaces
with the Aruco board of markers. Extensive testing was performed with
this setup with more than 50 successful landings, with an average error
of 4, 8 cm. In comparison, if a static marker is used, either the detection
range is limited, so landing from the same height is impossible when using
Aruco, or it is impossible to align the quadcopter with the landing plat-
form when using Whycon; this proves the advantages of a DDYMA for
visual servoing based landing.
by half without updating a(t), then the pose estimator will calculate a
“virtual” height that is twice as high as the real one, and the platform will
move down to compensate; this can be used to control the platform uni-
laterally by only changing the marker, e.g., the heading of the quadcopter
may be controlled by rotating the dynamic marker.
Shape
We use the term shape, in this context, as the total structural configuration
of the fiducial and the term features as the essential structural elements
used to build the shape. The most common fundamental structural ele-
ment used in planar fiducial markers for feature representation are circles
and corners (as discussed in section 3.2). Other basic features exist, but
they are either harder to detect or less invariant to perspective transfor-
mations. When designing a new marker, the main question is then: Which
6.4 The Fully Dynamic Marker FDYMA 109
features are better for detection? Circles or corners? The answer depends
on both the achievable detection accuracy of the feature and the desired
flexibility of the detection process. To define the features that best suit
our design, we need to understand the details of their detection process.
For corners, the algorithms used for detection work either by detecting
edge intersections or by finding the saddle point using surface fitting of
intensity around a corner point [11]. Of the two options, edge intersec-
tions are more accurate. The currently preferred pipeline is an initial pass
with a fast edge-based corner detection algorithm such as Harris Corner
Detector [47], followed by a sub-pixel corner refinement step based on local
gradients.
For ellipse detection, there are mainly two options: centroid extraction,
and ellipse fitting [75], where the latter is more accurate than the first. The
best algorithm for ellipse detection in computation speed and accuracy is
arguably the “Direct least square fitting of ellipse” [33].
In terms of complexity and detection time (which becomes critical with
many features), the corner detection methods are slightly faster than el-
lipse fitting methods. However, both are highly optimized nowadays and
easily accessible in computer vision libraries (e.g., OpenCV).
The accuracy of corners and circle features is hard to evaluate since it
depends on the final problem that wants to be solved. For example, if
one is working with these features in synthetic images without perspec-
tive projection, circle features are the most accurate [75]. However, in
camera calibration or pose estimation problems, images are influenced by
both perspective transforms and camera lens distortion. The real posi-
tion represented by a feature will be affected first by the intrinsic errors
of the method used for the detection (detection bias), and then by the
perspective and distortion bias. Some of these biases can be corrected or
minimized, others not.
The saddle point methods for corner detection are unaffected by per-
spective and distortion bias (but they have a higher detection bias). Edge-
based methods do not suffer from perspective bias since lines project as
lines in captured images. However, a distorted line is a curve, and this
affects the detection of the corner, producing a distortion bias. Luckily,
the subsequent sub-pixel corner refinement step minimizes the effect of the
distortion bias since it works in a local region where the effect of distortion
is minimal.
Circles, on the other hand, are subject to both perspective and distortion
bias. The perspective transform converts circles into ellipses, and the
center of the ellipse is not the same as the projected circle’s center. The
110 6 Dynamic fiducial markers
perspective bias can be corrected by using the center of the projected conic
instead [75]. Additionally, the perspective bias is negligible if the diameter
of the circle is between 10 and 20 pixels in the camera image.
There is an additional bias, called the bloom effect, that can occur
on cameras when capturing images of scenes with varying illumination,
such as outdoor scenes [76]. Blooming happens when a high amount of
light causes the white areas of the image to bloom out (or bleed) into
the surrounding areas; this effect cannot be countered by exposure time
alone. Blooming is particularly relevant for screens since full white pixels
can leak into full black pixels. Corner features are significantly affected by
this problem since the bloom moves the edges inwards, which influences
the corner position estimation. The center of circles, on the other hand, are
more invariant to this effect because the circle edges are affected relatively
equally by the blooming, leaving the center unperturbed.
For camera calibration, the corner edge detection methods with sub-
pixel refinement are the best option due to a combination of lower detec-
tion bias plus a minimized distortion bias. When calibrating a camera, the
user who performs the calibration also controls the environment’s lighting
conditions, so the bloom effect is not determinant in this case. As an alter-
native, it is possible to use small radius circles to minimize the introduced
distortion bias while keeping the detection bias low [75].
In the case of pose estimation, which is the main aim of a Dynamic
Marker design, circles are the ideal feature due to several reasons. First,
the camera intrinsic parameters and the lens distortion coefficients are
known beforehand and used to undistort the captured image, so the dis-
tortion bias in circles is no longer a problem. Even though both corners
and circles with appropriate measures are free of perspective bias, circles
are less affected by the blooming effect, and they have lower detection bias
than corners.
Now that we discussed the characteristics of the features and found the
optimal, we can dwell more in-depth into the fiducial’s actual shape. A
fiducial marker for pose estimation is not made of a single individual fea-
ture. Several features have to be arranged in a geometric configuration
that provides enough information to extract the pose. In essence, a shape
is built from a group of individual features, and the designed shape should
provide a structure that allows a detector to recover distance, rotation,
and identification. We can analyze how markers in state of the art are
built using the aforementioned basic features and use it as a base for our
design. For example, in the case of corner features, the minimal structural
shape is a square or rectangle (4-corner features) with a coding inside;
6.4 The Fully Dynamic Marker FDYMA 111
examples of this design are AprilTags, Aruco, and other similar markers
presented in section 3.2. For circles, full pose recovering requires at least
two circles; a minimal structural shape example of this configuration is the
Whycode [70], which uses two concentric circles with a ring of circular cod-
ing inside. Fiducials build only from circles, such as Pi-tag and Rune-tag,
suffer from increased detection time because many contours in an image
can be similar to circle blobs, and they may be present in a wide variety
of scales. That is why circular fiducial designs need to introduce other
geometric relationships between the circles to separate them from other
circular blobs that occur naturally. Squared fiducials, on the other hand,
are faster to detect since 4-corner polygon shapes are not that common in
nature and easier to detect.
Based on the previous information, the proposed shape for the FDYMA
is a mixture of circles and corner features in the shape of a quadrilateral
polygon with circles inside, see Fig. 6.14. This marker can be detected
fast and efficiently by exploiting the already available pipeline for square
markers; besides, the circles inside the quadrilateral will provide extra ac-
curacy, which is desired in pose estimation. We define the distribution
of the circle centers inside the quadrilateral as a honeycomb (hexagonal
packing), this is based on the results obtained in Chapter 5: the homog-
raphy estimation, which is critical for planar PnP, is more accurate when
the control points are distributed appropriately in space, and one of the
more uniform distributions of six control points in space are hexagonal
grids. Hexagons are also a convenient configuration for packing circles in-
side quadrilaterals since the distance of each circle’s center to its neighbors
is always the same.
Color
White and black color combinations are the most common designs in pla-
nar fiducials because of the complexity of working with color. In our
case, we will maintain this principle to keep simplicity, with a minor ex-
ception for orientation and identification, as we will explain below. On
printed fiducials, the most common design is a black filled shape with a
white foreground (e.g., a black square with internal black/white squares for
code); this is because the designers usually assume that the fiducials will
be printed on white paper, so black filled shapes will provide the highest
contrast on the white paper surface. In our case, we have the flexibility of
a screen with its full range of brightness; however, for the color selection,
we have to consider the border of the screen (aka monitor bezel), which
112 6 Dynamic fiducial markers
Figure 6.14: The Proposed FDYMA structural shape consist on a set of circles
placed in a honeycomb configuration inside a quadrilateral. The blue equilateral
triangles are shown only as a reference of the internal hexagonal distribution of
the centers. The aspect ratio of the quadrilateral can be changed to suit the
needs of the detection.
in most monitors is black, but it can also be white or gray. The marker
should use the maximum amount of pixels on the screen; this means that
we can use the monitor bezel as the marker border on the biggest marker
size. Based on the bezel constraint, we configure the Dyma marker as a
white background with black circles when the monitor bezel is black and
the inverse when the bezel is white, see Fig. 6.15. When we scale the
marker down, we expand the same color of the bezel to the inside of the
screen for maximum contrast.
The lack of colors on the basic marker design may seem restrictive when
one thinks about the color gamut of modern computer screens, but the
design is targeted for detection simplicity, and colors are usually complex
to detect accurately. However, the presence of other colors can be helpful
if used sparingly, for example, in marker orientation definition and identi-
fication.
Orientation
The process of full pose estimation needs non-symmetrical elements to
estimate the orientation. Planar marker designs usually introduce these
non-symmetrical elements within the internal code that is used for identi-
fication. The Dyma design, in contrast, solves the identification and ori-
entation problem with a layered approach based on the screen dynamism;
this means that we can define asymmetry only by the modification of a
single feature. We propose a simple change in the circle feature closer to
the bottom left corner of the quadrilateral, as shown in Fig. 6.16. This
6.4 The Fully Dynamic Marker FDYMA 113
a) b) c) d)
Figure 6.15: Basic black and white color schemes for the FDYMA. a) White
quadrilateral with black circles when the screen bezel has a dark color, b) the
dark bezel is extended with black pixels when the marker is scaled down, c)
black quadrilateral with white circles when the screen bezel has a lighter color,
d) the light bezel is extended with white pixels when the marker is scaled down.
circle is going to be denominated as the key circle. There are several op-
tions to make this feature different than the others. The first option is
to remove the circle completely; this has the advantage of not requiring
extra processing; however, it is one less control point for pose estimation.
Another alternative is to replace the circle with another basic structural
shape, e.g., a triangle, but this requires extra processing. The better op-
tion is to introduce color or a shade of gray in the inside of the circle while
maintaining proper border contrast to assure contour detection. A colored
circle with a border would work, but, depending on the inner color, the
contour detector may detect two concentric circles, which increases detec-
tion complexity. A solution is to use a gradient from the border to the
desired color. We selected this last option for our design; we defined a
filled circle with a color gradient as the default option.
a) b) c) d)
Size
Similarly to the DDYMA, in this design, we can change the scale of the
FDYMA fluidly, but now fine-tuning the shape to the particular camera
pose. We can select, for example, the size of the quadrilateral and the
size of the circles independently. Moreover, we can move the center of the
quadrilateral to any place inside the screen coordinates. Scaling the circles
also changes the number of circles on the marker (the bigger the circles,
the less quantity of them can be fitted inside the quadrilateral). The
FDYMA controller is then in charge of changing several positional/scale
parameters of the marker simultaneously depending on the camera pose:
diameter of the circles, circle separation, amount of circles, quadrilateral
size, and quadrilateral position.
Identification
Fiducial identification is, at its essence, a block of information that has to
be transmitted to the detector. For static markers, this block of data has
to be statically coded inside the fiducial, and this requires a significant
portion of the available display area. The amount of data transmitted in
the block is directly related to the size of the marker family (the number
of markers that have to be identified uniquely). The amount of data
transmitted is also increased if error-correcting codes are included.
For marker identification, we are going to use a layered approach. Each
layer will provide extra complexity, allowing us to separate the marker
from other similar shapes and discern one marker from another. The first
and second layers are the quadrilateral border and the inner circles, re-
spectively. Every contour that is not a quadrilateral with circles inside
is discarded. The third layer is the number of circles, only the quadri-
laterals with the correct number of circles inside at any given moment
are selected. A fourth layer is the color of the key circle. These initial
four layers already segment most of the environment’s shapes with no ex-
tra overhead on the detection algorithm, except perhaps another dynamic
marker with the same scale. The design allows the definition of an inter-
nal code in the circles, if necessary, by alternating black and white circles;
however, this is not necessary due to a vital characteristic of the dynamic
marker: a feedback loop. In the rare case that another shape has the exact
characteristics of a displayed marker at a given time, then the controller
can change the displayed marker in the next iteration by either invert-
ing the colors of the circles and background or changing the color of the
6.4 The Fully Dynamic Marker FDYMA 115
key ellipse, see Fig. 6.17. This means that the identification information
for complex scenarios (several dynamic markers working together) can be
transmitted to the detectors by any method of optical communication via
visible light; this is an advantage of the FDYMA since traditional fiducial
shapes require additional space in the form of extra features to transmit
the identification. With a FDYMA, we exploit the additional display area
to present more accurate control points. In essence, we are moving the
data transmission complexity from a feature domain to a time domain,
thanks to the screen.
d) d)
f) f)
c) e) c) e)
a) b) a) b)
Capture n Capture n + 1
Robustness
The FDYMA design optimizes the displayed shape for detection and pose
estimation, but the robustness of the detection will depend on additional
factors related to the screen. Depending on the screen technology used,
there may be problems related to glare and view angles. The typical com-
mercial LCD screens are intended in general for indoor use, and they are
designed to emit light optimally when looking entirely straight into the
screen; if the view angle is increased, the viewer will perceive certain areas
of the screen dimmed; cameras can also perceive this effect. Nonetheless,
if angle view is a limiting factor, other screen technologies such as OLED
displays can be selected with a wide field of view because each pixel emits
116 6 Dynamic fiducial markers
6.4.2 Detection
The detection of the designed shape on captured images follows the same
initial steps of the Aruco marker detection implemented in OpenCV v4.2
and comprises the following steps:
1) Adaptive threshold. The first step is to binarize the input to
get a black and white image for contour extraction. For every pixel, we
apply a threshold value. If the pixel value is smaller than the threshold,
we set it to 0, otherwise, to a maximum pixel value. If the threshold is
predefined, the binarization will have different results depending on the
scene lighting. There are methods to find the threshold automatically,
such as OTSU’s algorithm, but they increase computation time if the
input image is large. One solution that keeps computation time low is
to use adaptive thresholding; this technique obtains the best threshold
for a region surrounding a pixel, and it works best in images that vary in
illumination. When using adaptive thresholding, the size of the window
to evaluate around a pixel is the main parameter to configure. In order to
be flexible to different illumination changes, we binarize the input image
three times using a different window size each time, which will produce
three different binarized images. The user can configure the size of the
three windows in runtime.
remove the quadrilateral candidate. Since we know the area ratio of the
quadrilateral to the circles inside of it, we can further remove candidates
that have the wrong ratio because it is perspective invariant.
6.4.3 Controller
The controller of the Discrete Dynamic Marker (DDYMA) used only the
relative distance between the marker and the camera to control the scale
and select the marker family. Now, the FDYMA controller will control
several parameters of the marker simultaneously depending on the re-
quirements of the detection: background/foreground colors (bgc, f gc), key
circle color (kc), circle diameter (d), circle separation (cs), screen separa-
tion (ss), quadrilateral scale (sx, sy) and quadrilateral position in screen
coordinates (x, y). All these parameters are included in the control signal
a(t), which is sent to the Display and Detection modules. The positional
and size parameters are defined in meters. A summary of the controlled
parameters is presented in Fig. 6.18.
Each of the parameters can be manually configured by the user or
controlled automatically. To make the automatic marker changes from
pose to pose more organic, we can define some relative parameters. We
define the parameters ss and cs as a percentage of the circle diameter cd,
and the scale of the quadrilateral in X and Y direction as a fraction of
the screen size: 0 ≤ sx, sy ≤ 1. For example, when sx = 1 and sy = 1,
the marker is using the whole screen. Based on these parameters, we
then calculate the maximum amount of circles that fit on the marker
6.4 The Fully Dynamic Marker FDYMA 119
bgc ss d
f gc ẑw x̂w
cs cs sy y
ss sy
ss kc ŷw x
cs sx
ss
sx
Figure 6.18: Main controlled parameters on a FDYMA. Left side: colors,
position and size of the circles inside the quadrilateral. Right side: scale and
position of the quadrilateral in the display device. All the parameters are defined
in meters. The center of the right-handed marker coordinate system is in the
center of the display device, the positive direction of the X-axis to the right,
Y-axis to the bottom and Z-axis going inside the display plane.
Now we need to control a(t) by using the features detected from previous
captured camera frames m(t). We control the scale of the quadrilateral in
a similar way to the scale of the DDYMA, as explained in section 6.3.2,
but now we also control the position of the quadrilateral’s center (x, y)
so it stays as close as possible to the middle of the camera image. The
colors are defined by default depending on the monitor color, and they
only change in case of an identification ambiguity. What is left to control
is the circle distribution inside the quadrilateral. Given that we defined
the positional parameters of the circles as relative to the circle diameter,
then the only parameter that we need to control is the circle diameter.
To control the circle diameter, we need to define first what is an optimal
diameter. The main question is: Which size of a circle allows more precise
detection of its center? When we talk about the size, we refer to the
circle’s size in camera image pixels. From [75] we know that small circles
(less than 10 pixels in diameter) are the best in order to avoid the effect of
perspective and distortion bias. Another argument in favor of small circles
is that they occupy less space allowing us to display more control points.
However, in practice, the distortion bias is not relevant to us because we
are working with undistorted images, the perspective bias can be corrected
in the pose estimation step, and small circles have more detection bias in
120 6 Dynamic fiducial markers
m p
d d
0 = m Hp 0 . (6.10)
1 1
The second way does not require the pose; this is useful for camera
calibration applications or when the camera intrinsics are not available.
Remember that the process of camera calibration requires the collection
of control points displayed on a plane for different poses of the camera.
We found that to control the circle diameter without a pose, it is enough
to use a simple proportional controller with the following form:
6.4 The Fully Dynamic Marker FDYMA 121
s1 x2 + s2 xy + s3 y 2 + s4 x + s5 y + s6 = 0 , (6.12)
s1 s2 /2 s4 /2 x1
x1 x2 x3 s2 /2 s3 s5 /2 x2 = 0 , (6.14)
s4 /2 s5 /2 s6 x3
−1
s1 s2 /2 s4 /2
C(S) = − . (6.16)
s2 /2 s3 s5 /2
η
S = H−> (m S)H−1 . (6.17)
The center of the transformed ellipse C(η S) is not the same as the
center of the marker circle transformed by the homography HC(m S), see
6.4 The Fully Dynamic Marker FDYMA 123
Fig. 6.19. When we detect the center of the captured ellipse using the
ellipse detector, we are obtaining an estimate of the center C̃(η S), hence,
we cannot use HC(m S) to calculate the reprojection error .
η
m
S S
C(m S) C(η S)
HC(m S)
2
ei◦ = d C(H̃−> (m Si )H̃−1 ), C̃(η Si ) (6.18)
η 2
ei◦ = d C( S̃i ), C̃(η Si ) . (6.19)
With these two errors we can now solve the following least squares op-
c
timization to obtain an optimal estimation of the pose T̂m
c X X
T̂m = argminc T̃m ej + ei◦ . (6.21)
j i
0% 20 % 40 %
60 % 80 % 100 %
Figure 6.21: Shape of the controlled FDYMA for different percentage of sepa-
ration between circles (cs) and between circle and screen border (ss).
Detection range
One of the dynamic marker design’s main goals is to increase the range
of detection; in this regard, we will compare the FDYMA to state-of-
126 6 Dynamic fiducial markers
1.00
True detection rate
0.75
0.50
FDYMA
Aruco 1 cm
0.25
Aruco 25 cm
0.00
0.03 0.06 0.12 0.25 0.50 1.00 2.00 4.00 8.00 16.00 32.00
Distance (m)
Figure 6.22: True detection rate vs distance of the FDYMA and two sizes of
Aruco markers.
large distances, the marker in the captured image does not cover enough
image pixels necessary to discern the individual features of a complex
marker. In contrast, the FDYMA, in its minimum shape, is a rectangle
with a single circle inside (see Fig. 6.20); this minimum amount of features
(4 corners and one circle center) is detectable at longer distances than any
more complex marker.
When the camera is too close to the screen, around 3 cm, the pixel pitch
of the screen starts to matter. At close range, it is possible to discern on
the camera image the individual pixels of the screen (see Fig. 6.23); this
complicates contour detection if the displayed shape is not aligned with
the screen pixels. The effect is more pronounced in circles because there
will be a low amount of pixels available to represent them and the circles
start to look like squares for the contour detection. In this scenario,
squared fiducial markers have an advantage; the solution for the FDYMA
is to use retina displays instead, which have a high pixel density.
Positional accuracy
Now we are interested in studying how accurate is the newly designed
fiducial when the pose of the camera changes. It is known that fiducials
present a lot of variation on the accuracy of the pose estimation depending
on the true pose. In the case of squared planar fiducials, fronto-parallel
camera-to-marker positions give the worst results since small errors in
128 6 Dynamic fiducial markers
Figure 6.23: Image captures of the markers at different distances. First row
the FDYMA, second row the small Aruco with 1 cm side length, and third row
the big Aruco with 25 cm side length. For the FDYMA at 35 m, the image was
zoomed in to make the marker visible for the reader.
0°
22.5 °
45 °
0.5 m
c
T m , and to measure c T m from the coordinates of the Optitrack spheres,
we have to perform a process called hand-eye calibration. A hand-eye
calibration finds the transformations from the coordinate frame of the
Optitrack system to the center of the screen and to the camera coordinate
frame. For our experiments, we performed the hand-eye calibration using
the ROS package easy handeye, which conveniently encapsulates the state
of the art hand-eye calibration methods of the Visual Servoing Platform
(ViSP). The selection of angles and distances and the Optitrack ground
truth measurements will allow us to make a 2D map of accuracy for both
position and rotation estimation.
For each of the camera positions defined in Fig. 6.25, we displayed the
Aruco marker with a side length of 18 cm, and the FDYMA in automatic
direct control mode with a circle area reference value of 500 px2 , which
gives a high point density for better accuracy. For each marker and each
position, we took 100 image captures, estimated the pose from each cap-
ture, and calculated the mean µ and standard deviation σ of the transla-
tion error metric TE using (5.22) and the rotation error metric RE using
(5.21). The main results of this experiment are presented in Fig. 6.25.
130 6 Dynamic fiducial markers
20 µ
10
σ
10 5
0 0
TE Dyma (%) RE Dyma (◦ )
1.0
2
0.5
1
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0 ◦ 22.5 ◦ 45 ◦
Figure 6.25: Translational and rotational accuracy of the estimated pose from
Aruco and FDYMA for different distances and angles. Notice that there is
an order of magnitude of difference between the errors of Aruco and those of
FDYMA (there is a different scaling of the Y-axes).
Figure 6.26: Effect of using an inclined view angle when capturing a screen
with a camera. These images were captured with an angle of 45° at a distance of
50 cm which is an angle that is outside the optimal viewing region of the screen,
hence, the colors on the screen do not look uniform; the left side of the screen
appears darker than the right side, which affects slightly the pose estimation
accuracy.
TE µ (%) TE σ (%)
Aruco
FDYMA
RE µ (◦ ) RE σ (◦ )
Aruco
FDYMA
Figure 6.27: Spatial distribution of the translation and rotation errors for the
Aruco marker and FDYMA. The errors are represented by a color-coded height
map, the mean on the left side in orange, and the standard deviation on the
right side in blue. The elevation scale is the same for all the figures, but the
color code changes for each one to represent the plotted values better.
6.4 The Fully Dynamic Marker FDYMA 133
H M L
Aruco
FDYMA
Figure 6.28: Images of Aruco and FDYMA for high (H), medium (M) and low
(L) image noise. The letter H stands for high noise, M medium noise and L low
noise.
rotation. The results of this test are presented in Fig. 6.29. As expected
for the Aruco marker, the noise greatly affects the accuracy of the pose
estimation both for translation and rotation. For Aruco, the change from
noise setting L to M is small; however, there is a big jump from setting
M to setting H, which evidences that a prediction of this marker response
to noise can be problematic. In contrast, the FDYMA has a more stable
response to noise, due to more and diverse control points. The mean of
the error and the deviation is less than those of Aruco for all the noise
settings.
TE (%) RE (◦ )
12.5
µ
20
σ 10.0
15
7.5
10 5.0
5 2.5
H M L H M L
Aruco FDYMA
The results from this section confirm that the FDYMA design provides
superior stability than traditional planar markers even without dynami-
cally changing the scale. Since a FDYMA does not require much space
for the identification, the extra control points improve the pose estima-
tion accuracy and stability. This better basic design, coupled with the
dynamic scale changes, make the FDYMA an outstanding fiducial marker
for pose estimation. The added benefits come with some disadvantages;
the FDYMA is intrinsically more complex to implement than traditional
fiducials since it requires communication and collaboration between sev-
eral modules. Moreover, the usage of screens brings its challenges due to
the influence of the screen technologies on detection speed, the limited
usage in outdoor environments, and the limited view angles. Nonetheless,
we believe that the FDYMA is an excellent tool for all the applications
where the accuracy is the most relevant factor, especially for cooperative
robotics, since those systems usually have the screens and communication
6.4 The Fully Dynamic Marker FDYMA 135
7.1 Overview
In the first chapter, we introduced the reader to the topic of vision-based
localization, emphasizing the relationship between action and perception
and we gave a first peek at the potential sources of errors in pose estimation
and how we were going to address them. In Chapter 2, we presented the
basics of camera-based pose estimation while introducing the reader to
our notation, which is a mixture of the one used in robotics and computer
vision. In the next chapter, we went deeper into the concepts of control
points and features and performed a comparative study of artificial features
for pose estimation; this study allowed us to find points in common among
the different fiducial designs and also identify their problems. With this
grounds, we started with our main contributions in Chapter 4, where we
presented the MOMA, a new cooperative odometry system that uses only
artificial features, and by doing so, we avoid some of the errors introduced
by automatically detected natural features and increasing accuracy and
robustness of pose estimation in single and multi-robot scenarios. The
work on this odometry system highlighted some of the problems on the
spatial configuration of the control points, so we elaborated a methodology
to obtain optimal configurations in Chaper 5 that allows a better selection
of control points for future fiducial designs. The extensive study on fiducial
designs of Chapter 3, the experience on interaction that we gathered from
MOMA, and the optimal configuration for six control points obtained in
Chaper 5 were used as the basis for the design of a new kind of interactive
fiducial marker in Chapter 6, closing in this way the dissertation.
The results of this dissertation also paved the way for some other work,
which is out of the scope of this dissertation, but still worth mentioning,
such as contributions in the field of UAV control and landing [1, 78] and
in the field of camera calibration [2]. On the last one, many of the results
7.2 MOMA 137
obtained in the chapter about optimal control points and in the one about
the FDYMA were applied to the development of a new single pose camera
calibration system that employs curved screens; this research has led to
the creation of a startup with the name “Caliberation”.
Next, we will present more detailed conclusions for the main highlights
of the dissertation and future work.
7.2 MOMA
The new cooperative mobile marker odometry demonstrated a high accu-
racy cooperative visual odometry system without the need of environment
features and with better accuracy than state of the art markerless-based
methods such as VO in featureless environments and better accuracy than
VO methods in feature-rich environments. Our system exploits interac-
tion between the observer and the marker. In MOMA the marker moves
and cooperates with the observer. This cooperation was mentioned in the
introduction chapter as the perception-interaction cycle.
We demonstrated in different real cooperative robot configurations the
feasibility of the implementation and the advantages of the system. Our
proposed method proved easy to integrate into existing multi-robot sys-
tems since it only requires a cheap monocular camera and cheap printed
fiducial markers. We believe that MOMA is particularly interesting for
challenging environments, e.g., underwater environments or with the ab-
sence of light, with potential applications in search and rescue scenarios.
We found that the poses one selects for the transitions mater. Some rela-
tive camera-to-marker poses are better than others in terms of accuracy.
For example, a fronto-parallel arrangement is less accurate than inclined.
We think that a possible avenue of research for the future is to study
which relative poses are optimal for the switch and integrate this knowl-
edge into a planner that defines the best relative motions of the robot to
maintain accuracy. Another potential research path would be to use three-
dimensional fiducial markers that could be observable from any angle or
use the 3D model of the shape of each robot as the marker.
[A3] Raul Acuna and Volker Willert. Insights into the robustness of con-
trol point configurations for homography and planar pose estimation.
CoRR, abs/1803.0, 2018.
[A4] Raul Acuna and Volker Willert. Method for determining calibration
parameters of a camera, Patent WO 2020/109033, 2020.
[A5] Raul Acuna, Ding Zhang, and Volker Willert. Vision-based UAV
landing on a moving platform in GPS denied environments using
motion prediction. In Proc. IEEE Lat. Am. Robot. Symp., pages
515–521, João Pessoa, Brazil, 2018. IEEE.
[A6] Raul Acuna, Robin Ziegler, and Volker Willert. Single pose camera
calibration using a curved display screen. In Forum Bildverarbeitung,
pages 25–36, Karlsruhe, 2018. KIT Scientific Publishing.
[A7] Zaijuan Li, Raul Acuna, and Volker Willert. Cooperative Localiza-
tion by Fusing Pose Estimates from Static Environmental and Mobile
Fiducial Features. In Proc. IEEE Lat. Am. Robot. Symp., pages 65–
70. IEEE, nov 2018.
[A8] Dinu Mihailescu-Stoica, Raul Acuna, and Jurgen Adamy. High per-
formance adaptive attitude control of a quadrotor. 2019 18th Eur.
Control Conf. ECC 2019, pages 3462–3469, 2019.
141
Bibliography
[1] Raul Acuna, Ding Zhang, and Volker Willert. Vision-based UAV land-
ing on a moving platform in GPS denied environments using motion
prediction. In Proc. IEEE Lat. Am. Robot. Symp., pages 515–521, João
Pessoa, Brazil, 2018. IEEE.
[2] Raul Acuna, Robin Ziegler, and Volker Willert. Single pose camera
calibration using a curved display screen. In Forum Bildverarbeitung,
pages 25–36, Karlsruhe, 2018. KIT Scientific Publishing.
[3] B Atcheson, F Heide, and W Heidrich. CALTag: High Precision Fidu-
cial Markers for Camera Calibration. In Vision, Model. Vis. The Eu-
rographics Association, 2010.
[6] Burak Benligiray, Cihan Topal, and Cuneyt Akinlar. STag: A stable
fiducial marker system. CoRR, abs/1707.0, 2017.
[7] Filippo Bergamasco, Andrea Albarelli, Emanuele Rodola, and Andrea
Torsello. RUNE-Tag: A high accuracy fiducial marker with strong
occlusion resilience. In CVPR 2011, pages 113–120, 2011.
[10] Matevž Bošnak, Drago Matko, and Sašo Blažič. Quadrocopter hov-
ering using position-estimation information from inertial sensors and a
high-delay video system. J. Intell. Robot. Syst. Theory Appl., 67(1):43–
60, 2012.
[11] Gary Bradski and Adrian Kaehler. Learning OpenCV. First edition,
2008.
[12] Martin Buczko and Volker Willert. How to distinguish inliers from
outliers in visual odometry for high-speed automotive applications. In
IEEE Intell. Veh. Symp. Proc., number IV, pages 478–483, 2016.
[13] L. Calvet, P. Gurdjos, and V. Charvillat. Camera tracking using
concentric circle markers: Paradigms and algorithms. In Int. Conf.
Image Process., pages 1361–1364, 2012.
[21] Toby Collins and Adrien Bartoli. Infinitesimal plane-based pose esti-
mation. Int. J. Comput. Vis., 2014.
[22] Gabriele Costante, Christian Forster, Jeffrey Delmerico, Paolo Valigi,
and Davide Scaramuzza. Perception-aware Path Planning. arXiv,
pages 1–16, 2016.
[23] G David. Object recognition from local scale-invariant features. Proc.
IEEE Int. Conf. Comput. Vis., 2:1150–1157, 1999.
[27] Diego Brito Dos Santos Cesar, Christopher Gaudig, Martin Fritsche,
Marco A. Dos Reis, and Frank Kirchner. An evaluation of artificial
fiducial markers in underwater environments. MTS/IEEE Ocean. 2015
- Genova Discov. Sustain. Ocean Energy a New World, 2015.
[28] Dougal Maclaurin. Modeling, Inference and Optimization with Com-
posable Differentiable Procedures. PhD thesis, Harvard University,
2016.
[29] Jakob Engel, Jurgen Sturm, and Daniel Cremers. Semi-dense visual
odometry for a monocular camera. Proc. IEEE Int. Conf. Comput.
Vis., pages 1449–1456, 2013.
[30] Davide Falanga, Philipp Foehn, Peng Lu, and Davide Scaramuzza.
PAMPC: Perception-Aware Model Predictive Control for Quadrotors.
arXiv cs.RO, 2018.
[31] Mark Fiala. ARTag, a fiducial marker system using digital tech-
niques. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog-
nit., 2:590–596, 2005.
144 Bibliography
[67] Shiqi Li, Chi Xu, and Ming Xie. A robust O(n) solution to the
perspective-n-point problem. IEEE Trans. Pattern Anal. Mach. In-
tell., 2012.
[68] Wei Li, Tianguang Zhang, and Kolja Kühnlenz. A vision-guided au-
tonomous quadrotor in an air-ground multi-robot system. In Proc.
IEEE Int. Conf. Robot. Autom., pages 2980–2985, Shanghai, 2011.
IEEE.
[69] Zaijuan Li, Raul Acuna, and Volker Willert. Cooperative Localiza-
tion by Fusing Pose Estimates from Static Environmental and Mobile
Fiducial Features. In Proc. IEEE Lat. Am. Robot. Symp., pages 65–70.
IEEE, nov 2018.
[70] Peter Lightbody, Tomáš Krajnı́k, and Marc Hanheide. A Versatile
High-performance Visual Fiducial Marker Detection System with Scal-
able Identity Encoding. Proc. Symp. Appl. Comput., pages 276–282,
2017.
[71] Hyon Lim and Young Sam Lee. Real-Time Single Camera SLAM
Using Fiducial Markers. In ICCAS-SICE, pages 177–182, 2009.
[72] Diego López De Ipiña, Paulo R.S. Mendonça, and Andy Hopper.
TRIP: A Low-Cost Vision-Based Location System for Ubiquitous Com-
puting. Pers. Ubiquitous Comput., 6(3):206–219, 2002.
[73] Chien Ping Lu, Gregory D. Hager, and Eric Mjolsness. Fast and
globally convergent pose estimation from video images. IEEE Trans.
Pattern Anal. Mach. Intell., 2000.
[74] Pieter Jan Maes, Marc Leman, Caroline Palmer, and Marcelo M.
Wanderley. Action-based effects on music perception. Front. Psychol.,
4(JAN):1–14, 2014.
[75] John Mallon and Paul F Whelan. Which pattern? Biasing aspects of
planar calibration patterns and detection methods. Pattern Recognit.
Lett., 28(8):921–930, 2007.
148 Bibliography
[87] C. B. Owen, Fan Xiao, and P. Middlin. What is the best fiducial? In
Proc. 1st IEEE Int. Work. Augment. Real. Toolkit, Darmstadt, 2002.
[88] Oxford. Oxford English dictionary. Oxford University Press, 1989.
[89] Andrew Parker. In The Blink Of An Eye: How Vision Sparked The
Big Bang Of Evolution. Perseus Pub, 2004.
[90] I Poupyrev, H Kato, and M Billinghurst. ARToolkit User Manual.
Technical report, University of Washingtoon, 2000.
[122] Guoxing Yu, Yongtao Hu, and Jingwen Dai. TopoTag: A Robust
and Scalable Topological Fiducial Marker System. In arXiv cs.CV,
2019.