0% found this document useful (0 votes)
42 views37 pages

Simultaneous Two-View Epipolar Geometry Estimation and Motion Segmentation by 4D Tensor Voting

This document summarizes a research paper that proposes a method called 4D tensor voting for simultaneously estimating multiple epipolar geometries and segmenting matches into independent motions from two views of a non-static scene containing multiple moving objects. The method considers the 4D joint image space and performs two tensor voting passes to propagate local geometric constraints and enforce global consistency. It extracts the fundamental matrices corresponding to different motions in succession. Experiments show it achieves better performance on non-static scenes than other representative algorithms. The method is efficient, requiring only two frames, and can tolerate high noise levels without assumptions beyond the pinhole camera model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views37 pages

Simultaneous Two-View Epipolar Geometry Estimation and Motion Segmentation by 4D Tensor Voting

This document summarizes a research paper that proposes a method called 4D tensor voting for simultaneously estimating multiple epipolar geometries and segmenting matches into independent motions from two views of a non-static scene containing multiple moving objects. The method considers the 4D joint image space and performs two tensor voting passes to propagate local geometric constraints and enforce global consistency. It extracts the fundamental matrices corresponding to different motions in succession. Experiments show it achieves better performance on non-static scenes than other representative algorithms. The method is efficient, requiring only two frames, and can tolerate high noise levels without assumptions beyond the pinhole camera model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/7990924

Simultaneous two-view epipolar geometry estimation and motion


segmentation by 4D tensor voting

Article  in  IEEE Transactions on Pattern Analysis and Machine Intelligence · October 2004


DOI: 10.1109/TPAMI.2004.72 · Source: PubMed

CITATIONS READS

23 122

3 authors, including:

Chi-Keung Tang Gérard G. Medioni


The Hong Kong University of Science and Technology University of Southern California
88 PUBLICATIONS   5,166 CITATIONS    417 PUBLICATIONS   20,238 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Assistive Technologies View project

All content following this page was uploaded by Gérard G. Medioni on 05 September 2016.

The user has requested enhancement of the downloaded file.


1

Simultaneous Two-View Epipolar Geometry


Estimation and Motion Segmentation by 4D
Tensor Voting

Wai-Shun Tong y Chi-Keung Tang y Gérard Medioni


cstws@cs.ust.hk cktang@cs.ust.hk medioni@iris.usc.edu

y
Hong Kong University of Science and Technology
University of Southern California

Wai-Shun Tong
Dept. of Computer Science, Hong Kong University of Science & Technology, Clear Water Bay, Hong Kong.
Email: cstws@cs.ust.hk

Chi-Keung Tang ( Corresponding author)


Dept. of Computer Science, Hong Kong University of Science & Technology, Clear Water Bay, Hong Kong.
Tel: +852-2358-8775 Fax: +852-2358-1477 Email: cktang@cs.ust.hk

Gérard Medioni
PHE 204, MC0273, Institute for Robotics and Intelligent Systems, University of Southern California, Los
Angeles, CA 90083-0273, USA
Tel: +1-213-740-6440 Fax: +1-213-740-7877 Email: medioni@iris.usc.edu

A preliminary version of this paper appears in the proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2001. In this submission, we use color codes in many figures. Reviewers are encouraged to review the color pdf
file we submitted to TPAMI and view all figures on-line using Acrobat Reader.

February 6, 2004 DRAFT


1

S IMULTANEOUS T WO -V IEW E PIPOLAR G EOMETRY E STIMATION AND


M OTION S EGMENTATION BY 4D T ENSOR VOTING
Wai-Shun Tong, Chi-Keung Tang, Gérard Medioni

Abstract

We address the problem of simultaneous two-view epipolar geometry estimation and motion seg-
mentation from non-static scenes. Given a set of noisy image pairs containing matches of n objects, we
propose an unconventional, efficient and robust method, 4D tensor voting, for estimating the unknown n
epipolar geometries, and segmenting the static and motion matching pairs into n independent motions. By
considering the 4D isotropic and orthogonal joint image space, only two tensor voting passes are needed,
and a very high noise to signal ratio (up to five) can be tolerated. Epipolar geometries corresponding
to multiple, rigid motions are extracted in succession. Only two uncalibrated frames are needed, and
no simplifying assumption (such as affine camera model or homographic model between images) other
than the pin-hole camera model is made. Our novel approach consists of propagating a local geometric
smoothness constraint in the 4D joint image space, followed by global consistency enforcement for ex-
tracting the fundamental matrices corresponding to independent motions. We have performed extensive
experiments to compare our method with some representative algorithms to show that better performance
on non-static scenes are achieved. Results on challenging datasets are presented.

Keywords

Epipolar geometry, motion segmentation, non-static scene, robust estimation, higher dimensional
inference

1 Introduction

In the presence of moving objects, image registration becomes a more challenging problem,
as the matching and registration phases become interdependent. Many researchers assume a
homographic model between images (e.g. [15]), and detect motion by residual, or more than two
frames are used (e.g. [5]). Three views are used to define a trifocal tensor [9]1 , which has a large
number of parameters to be estimated. Torr and Murray [21] use epipolar geometry to detect
independent motion. Wolf and Shashua [24] proposed the segmentation matrix that encapsulates
two-body motion geometry from two views. Vidal et al. [23] also considered two views, and
generalized the notion of fundamental matrix into a multibody fundamental matrix involving
more than two independently moving objects. In this paper, we propose to perform epipolar
geometry estimation for non-static scenes using 4D tensor voting, which runs in linear time in
1
In this paper, we use a geometric tensor to encode structure information in 4D. Our use of a 4D symmetric tensor is different
from trilinear or trifocal tensors found in some epipolar geometry literature.

February 6, 2004 DRAFT


2

the number of image pairs. Our geometric approach addresses the motion segmentation and
outlier rejection simultaneously for fundamental matrix estimation, via a novel voting approach.
In the presence of a large number of false matches, linear methods such as the Eight Point Al-
gorithm [10] are likely to fail. Thanks to many previous attempts, which either remove most of
the false matches by non-linear optimization [1], [25], or perform data normalization [8] before
fundamental matrix estimation, we can robustly solve the epipolar geometry problem efficiently.
However, when false matches and matches due to consistent motion are both present, the above
methods become unstable since the single fundamental matrix assumption is violated. Depend-
ing on the size and number of the objects in motion, robust methods [25] may return a totally
incorrect epipolar geometry of the scene. Also, even if the fundamental matrix corresponding
to the static background can be recovered, correspondences for objects in motion are usually
discarded as outliers. While outliers are still the main issue in the presence of moving objects
(in static scene as well, as Hartley suggested [8]), the interdependence of motion segmentation,
outlier rejection, and parameter estimation makes the problem particularly difficult to solve.
Specifically, the most dominant motion (e.g. background) may treat other less salient motion
matches as outliers.
In this paper, we propose an unconventional and non-iterative geometric approach that estab-
lishes strong geometric constraint in the 4D joint image space on true matching pairs (static
or motion). Robust segmentation is performed in a non-parametric manner. Our method con-
trasts with many algebraic approaches which optimize certain functionals for estimating a large
number of parameters. The 4D tensor voting method and the non-iterative Hough transform [4]
share the similar idea that a voting technique is employed to output the solution receiving max-
imal support. However, as the dimensionality grows, Hough transform is extremely inefficient,
and thus is impractical in higher dimensional detection problems. Technical differences between
the two methods will be discussed later in depth, following a detailed description of our 4D al-
gorithm. We show that 4D tensor voting is very effective in rejecting outlier noise resulted from
wrong matches, and is capable of identifying and segmenting salient motions in succession, by
using a traditional RANSAC approach to recover the epipolar geometry.
The rest of this paper is organized as follows. We first give an outline of the algorithm. In
section 3 we review related work and provide background knowledge on epipolar geometry. In
section 4 we give a contextual review on tensor voting. Section 5 presents the details of motion
segmentation and outlier rejection. Section 6 evaluates the time complexity of our system (it is
a linear-time algorithm in the number of matches). In sections 7–8, we present our result and

February 6, 2004 DRAFT


3

evaluate our system, where we discuss our method from the practitioners’ point of view. Finally,
we conclude our work with future research direction in section 9.
A preliminary version of this paper appears in [20]. The present coverage makes the descrip-
tion complete by providing a quantitative comparison of our 4D approach with widely used
techniques for epipolar geometry estimation, such as LMedS, M-estimators, and RANSAC. Ex-
perimental results were performed to evaluate the accuracy of estimation on both static and non-
static scenes, and on both real and synthetic data (for the latter, ground truth data are available).
An analysis of the applicability of our approach and adequate discussion are also given.

2 Outline of the algorithm

The key idea is that inliers, even if they belong to a small and independently moving object, re-
inforce each other after voting, whereras outliers do not support each other. In 4D tensor voting,
strong geometric constraints for motion inliers are enforced, reinforcing the saliency of good
matches and suppressing that of outliers simultaneously using a non-iterative voting scheme that
gathers tensorial support. There are two stages for extracting each motion in succession:
Local smoothness constraint enforcement (outlier rejection and motion segmentation). We
first apply 4D tensor voting, by exploiting the fact that in the 4D joint image space, the
corresponding 4D joint image of a point match lie on a 4D cone. We also show that inde-
pendently moving objects in fact give rise to different 4D cones in the joint image space.
We use 4D tensor voting, in exactly two passes (c.f. [18]), to enforce the local continuity
constraint in 4D, so that false matches (which must not lie on any of the 4D cones) are
discarded. Matching points that satisfy the continuity constraint are retained.
Global epipolar constraint enforcement (parameter estimation). Using this filtered set of
matches, we show that even a slight modification of a simple and efficient algorithm like the
the simple eight-point linear method, or random sub-sampling, can extract different motion
components in a non-static scene. Epipolar geometries corresponding to multiple motions
are also estimated.
The overall approach is summarized in Fig. 1. The input to the algorithm is a set of candidate
point matches, which can be obtained manually, or by automatic cross-correlation.
Our 4D algorithm works by first normalizing the input point set (using Hartley’s approach [8]).
After data normalization, we convert the dataset into a sparse set of 4D points (section 5.1).
Next, we convert the 4D coordinates into tensors, which will be discussed in section 5.2. Our
outlier rejection requires estimating the normal to the 4D cone, if the joint image corresponds

February 6, 2004 DRAFT


4

Fig. 1. Flowchart of the 4D system. Three motions are identified.

to a point lying on it. To this end, we use 4D ball voting field (section 5.3) to vote for normals.
The local continuity constraint is then enforced by the 4D stick voting field (section 5.4) to vote
for a smooth surface.
Based on the hypersurface saliency information estimated after the above two voting passes,
we discard points that are likely to be false matches (section 5.5). Then, methods such as the
normalized Eight Point Algorithm, RANSAC, and LMedS, can be used to estimate the funda-
mental matrix from the filtered set of matches. This enforces the global constraint that the points
should lie on a cone in 4D (section 5.6).
In a non-static scene, the estimated matrix can be used to identify the subset of matching points
corresponding to a consistent (camera or object) motion. The above process is applied again on
the remaining set of unclassified matched points. Thus, multiple and independent motions can
be successively extracted.

3 Related Work

A comprehensive survey and recent development on multiple view geometry can be found
in [6]. In [11], camera calibration for initially uncalibrated stereo images was proposed. It

February 6, 2004 DRAFT


5

also shows that other methods at that time are unstable when the points are close to planar. In
the absence of false matches or moving objects, algebraic method can be used for an accurate
estimation of fundamental matrix. One algebraic approach was proposed by Sturm [17]. In this
classical approach, linear spanning and determinant checking using seven corresponding points
are used. If more than seven points are available, the Eight-Point Algorithm [10] is often used.
The algorithm estimates the essential (resp. fundamental) matrix from two calibrated (resp.
uncalibrated) camera images, by formulating the problem as a system of linear equations. A
minimum of eight point matches are needed. If more than eight are available, a least mean square
minimization is often used. To make the resulting matrix satisfy the rank two requirement,
the singularity enforcement [8] is performed. Its simplicity of implementation offers a major
advantage. In Hartley [8], it is noted that after outlier rejection, the Eight-Point Algorithm
performs comparably with other non-linear optimization techniques.
False matches are unavoidable in matching programs based on cross correlation. More com-
plicated, non-linear iterative optimization methods were therefore proposed in [25]. These non-
linear robust techniques use objective functions, such as the distance between points and their
corresponding epipolar lines, or the gradient-weighted epipolar errors, to guide the optimization
process. Despite the increased robustness of these methods, non-linear optimization methods
require somewhat careful initialization for early convergence to the desired optimum. The most
successful algorithm in this class is the LMedS proposed by Zhang et al. [25]. The algorithm
uses the least median of squares, data sub-sampling, and certain adapted criterion to discard out-
liers, by solving a non-linear minimization problem. The fundamental matrix is then estimated.
Torr and Murray [22] proposed RANSAC, which randomly samples a minimum subset with
seven pairs of matching points for parameter estimation. The candidate subset that maximizes
the number of points and minimizes the residual is the solution. However, it is computationally
infeasible to consider all possible subsets, which is exponential in number. Thereby, additional
statistical measures are needed to derive the minimum number of sample subsets.
LMedS and RANSAC are considered to be some of the most robust methods. But, it is worth
noting that these methods still require a majority (at least 40–50%) of the data be correct, or
else some statistical assumption is needed (e.g. the approximate percentage of correct matches
needs be known), while our proposed method can tolerate a higher noise to signal ratio. If both
false matches and motion exist, these methods may fail, or become less attractive since many
matching points due to motion are discarded as outliers.
Another class of robust algorithm is the M-Estimator [7], in which a weighting function is

February 6, 2004 DRAFT


6

4D 8D
number of passes 2 2
orthogonal parameter space yes no
isotropic parameter space yes no
motion pixels classified discarded
outlier noise discarded discarded
TABLE I

4D AND 8D T ENSOR VOTING APPROACH COMPARISON

designed to minimize the effect of outliers. The advantage of this approach is that various
weight functions can be designed for different scenarios. But, for motion scenes with a large
number of false matches, the weight functions may not be trivial.
In [15], Pritchett and Zisserman proposed the use of local planar homography. Homographies
are generated by Gaussian pyramid techniques. Point matches are then generated by using
homography. However, the homography assumption does not generally apply to the entire image
(e.g. curved surfaces).
In [1], Adam, Rivlin, and Shimshoni addressed the problem of outlier rejection. They discover
that with proper rotation of the image points, the correct matching pairs will create line segments
pointing in approximately the same direction. However, this method requires some form of
searching. Point matches due to motion are rejected.
In [18], the estimation of fundamental matrix is formulated as an 8D hyperplane inference
problem. The eight dimensions are defined by the parameter space of the epipolar constraint.
Each point match is first converted into a point in the 8D parameter space. Then, 8D tensor
voting is applied to “vote” for the most salient hyperplane from the noisy matches. The resulting
point matches are assigned a hypersurface saliency value. Matches with low saliency values are
labeled as outlier and discarded. Matches with high saliency values are used for parameter
estimation. Note that one major difference of this method is that geometric continuity constraint
is used, instead of minimizing an algebraic error. While [18] reports good results, a multipass
algorithm is proposed. In case of multiple motions, the stability of the system is reduced. The
noise rejection ability is somewhat hampered by the non-orthogonality and anisotropy of the 8D
parameter space. Also, in [18], motion correspondences are discarded as outliers. A comparison
between 4D and 8D tensor voting for epipolar geometry estimation is shown in Table I. We shall

February 6, 2004 DRAFT


7

object point

(ul,vl ) (ur,vr)
C1 C2

Fig. 2. Epipolar constraint.

explain the comparative advantages of 4D tensor voting in this paper.


It is also worth noting that the tensor voting approach has also shown success in perceptual
grouping of motion layers [14], and perceptual grouping from motion cues [13].

3.1 Review of joint image and epipolar geometry

Here, we briefly review epipolar geometry and its corresponding 4D cone in the joint image
space. More details can be found in [2], [16]. Refer to Fig. 2. Given two images of a static
scene taken from two camera systems C1 and C2 , let (ul vl ) be a point in the first image. Its
corresponding point (ur vr ) is constrained to lie on the epipolar line derived from (ul vl ). This
epipolar line is the intersection of two planes: one plane is defined by three points: the two
optical centers C1 , C2 , and (ul vl ), and the other plane is the image plane of the second image.
Symmetric relation applies to (ur vr ). This is known as the epipolar constraint. The fundamen-
tal matrix F = Fij ] 1  i j  3, that relates any matching pair (ul vl ) and (ur vr ) is given by
uT1 Fu2 = 0, where u1 = (ul vl 1)T and u2 u vr
=( r 1)T .
It was suggested by Anandan and Avidan in [2] that in the joint image space, the above equa-
tion involving Fij can be written as:
0 1
 BqC
qT 1 C @ A = 0
1
(1)
2 1

where q =( l u vl ur vr ) is the stack vector of a corresponding point pair, or joint image coordi-
nates, and C is a 5  5 matrix defined as
2 3
66 0 0 F11 F21 F31 77
66 0 0 F12 F22 F32 77
66 77
C = 666 F11 F12 0 0 F13 77
77
(2)
66 F F 0 0 F23 77
64 21 22 5
F31 F32 F13 F23 2F33

February 6, 2004 DRAFT


8

Note that the 4D joint image space is isotropic and orthogonal. In [2], C is proved to be
of rank four. It describes a quadric in the 4D joint image space u vl ur vr ).
( l According
to [16], a quadric defined by rank four, 5  5 matrix represents a cone in 4D. Since the 4D
cone is a geometric smooth structure except at the apex2 , we use tensor voting to propagate the
smoothness constraint in a neighborhood.

3.2 Epipolar geometry and motion segmentation

In this section, we review the relationship between epipolar geometry and motion segmenta-
tion. The epipolar constraint of a static scene describes the camera motion between two optical
centers. Suppose the scene moves, not the camera. We can obtain exactly the same stereo
pair, and therefore the same fundamental matrix, if the camera motion and the scene motion are
inverse of each other.
Let < R1 t~1 > be the geometric transformation <rotation, translation> from one camera
coordinate system to another. Without loss of generality, assume that the former system is the
world coordinate system. Suppose a moving object is seen by both cameras. Let the object move
from 3D position P1 to P2 . Let the object motion from P1 and P2 be < R2 t~2 > w.r.t. the world
coordinate system. Therefore, if we “compensate” the object motion with the camera motion,
the epipolar constraint of the “motion-compensated” object describes a “compensated” camera
motion equal to < R2R1 R2t~1 + t~2 >. Hence, there are altogether two epipolar constraints
to describe the non-static scene: one fundamental matrix describes the static component of the
R1 t~1 >, and another fundamental matrix describes the “motion-
scene due to camera motion <
compensated” object < R2 R1 R2 t~1 + t~2 >. Each epipolar constraint corresponds to a cone in
the 4D joint image space. The above can be applied to three or more motions readily, leading to
an approach to motion segmentation by using epipolar geometry:
1. Reject wrong matches and reinforce true matches simultaneously, by exactly two passes of
4D tensor voting.
2. Extract the most salient cone in the 4D joint image space, by parameter estimation tech-
nique (or simply normalized Eight Point Algorithm with singularity enforcement).
3. Remove point matches whose joint image lies on the 4D cone found above, and repeat
step (1).
Fig. 3 shows the results on a synthetic example to further elaborate motion detection by epipo-
lar geometry estimation. Three spheres with independent motion are captured in two frames
2
The unfortunate case that all matches cluster around the apex is addressed in the discussion section.

February 6, 2004 DRAFT


9

Fig. 3. Top: two frames of a moving camera capturing three spheres with independent motion. There are a lot of
wrong matches (noise to signal ratio is 1:1). Middle: outliers are discarded, and motion matches are identified
and classified. Corresponding epipolar lines of the respective moving objects are drawn. Bottom: hypothetical
4D cones are drawn to illustrate the matching criterion.

by a moving camera. Using 4D tensor voting, we extract consistent motion components of the
scene successively (first the largest sphere in the middle, then the sphere on the left, and finally
the smallest sphere on the right), classify them into three distinct sets, and reject all outliers. A
few corresponding epipolar lines of the respective moving spheres are also drawn. In the result
section, we shall state more experimental results on real and synthetic data.

February 6, 2004 DRAFT


10

4 Contextual review of tensor voting

In this section, we provide a concise review of tensor voting [12] in the context of 4D in-
ference. Detailed pseudocodes of the 4D voting algorithm are also available in the appendix.
In tensor voting, tensor is used for token representation, and voting is used for non-iterative
token-token communication. Tensor and voting are related by a voting field. Voting fields are
tensor fields, which postulate the most likely directional information, by encoding a geometric
smoothness constraint. 4D voting fields can be derived from the 2D stick voting field.

4.1 Voting fields for the basic case

Our goal is to encode the smoothness constraint that should effectively determine if a 4D
point lies on some smooth surface in 4D. Let us consider a question related to 2D smoothness:
Suppose a point P in the plane is to be connected by a smooth curve to the origin O. Suppose
~ to the curve at O is given. What is the most likely normal direction at
also that the normal N
P ? [12]. Fig. 4a illustrates the situation.
We claim that the osculating circle connecting O and P is the most likely connection, since
it keeps the curvature constant along the hypothesized circular arc. The most likely normal is
given by the normal to the circular arc at P (thick arrow in Fig. 4a). The length of this normal,
which represents the vote strength, is inversely proportional to the arc length OP , and also to
the curvature of the underlying circular arc.
To encode proximity and smoothness (low curvature), the decay of the field takes the following
form, which is a function of r, ', c, and  :

r 2 +c 2 )
DF (r  ) = e ;( 2 (3)

where r is the arc length OP ,  is the curvature, c is a constant which controls the decay with
high curvature.  is the scale of analysis, which determines the effective neighborhood size3.
Note that  is the only free parameter in the system4.
If we consider all points in the 2D space, the whole set of normals thus derived constitutes the
2D stick voting field, Fig. 4b. Each normal is called a stick vote vx vy ], thus defined as
2 3 2 3
v
64 x 75 = DF (r  ) 64 sin ' 75 (4)
vy cos '

3
Since we use a Gaussian decay function for DF ( ), the effective neighborhood size is about 3 .
'
4
Given O and N ~ , r and  at P are known: refer to Fig. 4a. Using sine rule,  = 2sin 2 . The arc length r = ' . If  = 0,
jOP j

r = jOP j.

February 6, 2004 DRAFT


11

y
ϕ
most likely normal
N
P z
x
N
most likely
continuation O x

(a) (b)
B
stick vote received atB

(c) (d)

Fig. 4. (a) Design of the 2D stick voting field. (b) 2D stick voting field. (c) A casts a stick vote to B , using the 2D
stick voting field. (d) 2D ball voting field.

where ' is the angle subtended by the radii of the circle incident at O and P .
Given an input token A, how to use this field to cast a stick vote to another token B for
inferring a smooth connection between them? Let us assume the basic case that A’s normal is
known, as illustrated in Fig. 4c. First, we fix the scale  to determine the size of the voting field.
Then, we align the voting field with A’s normal (by translation and rotation). If B is within A’s
voting field neighborhood, B receives a stick vote from the aligned field.

4.2 Vote collection and tensor representation

How does B collect and interpret all received votes? Other input tokens cast votes to B as
well. Denote a received stick vote by vx vy ]T . To collect the majority vote, one alternative is
to accumulate the vector sum of the the stick votes. However, since we only have sparse data
(especially in the 4D joint image space), orientation information may be unavailable or wrong.
Two consistent directions with different orientation will cancel out each other if we collect the
vector sum, or first order tensor.
We therefore collect second order moments. The tensor sum of all votes collected by B are
accumulated, by summing up the covariance matrices consisting of the votes’ second order

February 6, 2004 DRAFT


12

2 P P 3
v vv
moments: 6 4 P x P x 2 y 75. This is a second order symmetric tensor. By decomposing this
2

vy vx vy
tensor into the corresponding eigensystem, we obtain the most likely normal at B , given by the
eigenvector associated with the largest eigenvalue. Let us denote this unit vector by e^1 .
Geometrically, a second order symmetric tensor in 2D is equivalent to an ellipse. The major
axis gives the general direction. The minor axis indicates the uncertainty: if the length of the
minor axis is zero, the tensor is a stick tensor, representing absolute certainty in one direction
given by the major axis. If the length of the minor axis is equal to that of the major axis, the
tensor is a ball tensor, indicating absolute uncertainty in all directions.
At first glance, the use of tensor fields in vote collection resembles to that of radial basis
function for scattered data interpolation, where a set of functions are linearly combined to fit
the data. However, the tensor fields used here have fundamental difference. First, our output
is a tensor, instead of a scalar. The accumulated tensor captures the statistical distribution of
directions by its shape after voting, while a scalar cannot. Moreover, tensor voting is capable
of performing extrapolation (inference) as well as interpolation. More importantly, geometric
salience information is encoded in the resulting tensor, which is absent in any interpolation
scheme. We will address feature saliency shortly in the vote interpretation section.

4.3 Voting fields for the general case

Now, consider the general case that no normal is available at A. We want to reduce this case
to the basic case, so we need to estimate the normal at A. Without any a priori assumption, all
directions are equally likely as the normal direction at A. Hence, we rotate the 2D stick voting
field at A. During the rotation, it casts a large number of stick votes to a given point B . All stick
votes received at B are converted into second order moments, and the tensor sum is accumulated.
This is exactly the same as casting stick votes as described in the previous sections.
Then, we compute the eigensystem of the resulting tensor to estimate the most likely normal
at B , given by the direction of the major axis of the resulting tensor inferred at B , or e^1 .
Alternatively, for implementation efficiency, instead of computing the tensor sum on-the-fly
at a given vote receiver B , we precompute and store tensor sums due to a rotating stick voting
field received at each quantized vote receiver within a neighborhood. We call the resulting field
a 2D ball voting field, which casts ball votes in A’s neighborhood. Fig. 4d shows the ball voting
field, which stores the eigensystem of the tensor sum at each point. Note the presence of two
eigenvectors at each site in Fig. 4d.

February 6, 2004 DRAFT


13

4.4 Vote interpretation

In 4D, we can define similar stick and ball voting fields. After collecting the second order
moments of the received votes, they are summed up to produce a 4D second order symmetric
tensor, which can be visualized as a 4D ellipsoid, represented by the corresponding eigensystem

X
4
i e^ie^Ti (5)
i=1

where 1 2 3 4 0 are eigenvalues, and e^1 , e^2 , e^3, and e^4 are corresponding
eigenvectors. The eigenvectors determine the orientation of the 4D ellipsoid. The eigenvalues
determine the shape of the 4D ellipsoid.
Consider any point in the 4D space. It is either on a smooth structure, at a discontinuity or
an outlier. If it is a hypersurface point, the stick votes received in its neighborhood reinforce
each other, indicating a high agreement of tensor votes. The inferred tensor should be stick-
like, that is, 1  2 3 4, indicating certainty in a single direction. On the other hand, an
outlier receives a few inconsistent votes, so all eigenvalues are small. We can thus define surface
saliencies by 1 ; 2 , with e^1 indicating the normal direction to the hypersurface. Moreover, if
it is a discontinuity or a point junction where several surfaces intersect exactly at a single point
(e.g. the apex of a 4D cone), it indicates a high disagreement of tensor votes, indicating no
single direction is preferred. Junction saliency is indicated by high values of 4 (and thus other
eigenvalues). Outlier noise is characterized by low vote saliency and low vote agreement.

5 Motion segmentation and outlier rejection by 4D tensor voting

We use 4D tensor voting to perform motion segmentation and outlier rejection. The input
to our algorithm consists of a set of potential point matches. False correspondences may be
present. In a non-static scene where multiple motions exist, point matches that correspond to a
single consistent motion should cluster and lie on a 4D cone in the joint image space. To reject
false matches while retaining the matching points contributed by the background and/or salient
motion, we make use of the property that outliers do not lie on any of these cones.

5.1 From point matches to joint image coordinates

Before we encode the data points into a 4D tensor (next section), we first normalize the 2D
point matches. Since the matching points found by the correlation methods may be shifted to
quantized positions, this data normalization step brings added stability to later steps.

February 6, 2004 DRAFT


14

For each potential matching point pair (ul vl ) u vr ) in images 1 and 2 respectively, we nor-
( r

malize them to (u0l vl0) ( 0ru vr0 ) by translation and scaling [8]: center the image data at the origin,
p
and scale the data so the mean distance between the origin and the data points is 2 pixels. The
4D joint image coordinates are then formed by stacking the normalized 2D corresponding points
together: u vl0 u0r vr0 ).
( 0l

5.2 Tensor encoding

The joint image coordinates are points in the 4D space. They are first encoded into a 4D
ball tensor to indicate no orientation preference, since initially they do not have any orientation
information. Geometrically, a 4D ball tensor can be thought as a 4D hypersphere which does
not have preference in any direction. Mathematically, it can be represented by the equivalent
eigensystem having four equal eigenvalues, and four orthonormal unit vectors:

X
4 X
3 Xi X
4
ie^ie^Ti =  ; i+1)
( i e^j e^Tj + 4 e^j e^Tj (6)
i=1 i=1 j =1 j =1
X
4
= 4 e^j e^Tj (7)
j =1
= 4 B (8)

if all i ’s are equal. B = P4j=1 e^j e^Tj is the 4D ball tensor.


In our system, we simply set 1 2 = 3 = 4 = 1. Hence, the first term in Equation (6) is
=

zero to give a ball tensor. We use the world coordinate system to initialize e^1 e^2 e^3 , and e^4 . So
e^1 = 1 0 0 0]T e^2 = 0 1 0 0]T , etc. Any four orthonormal vectors in 4D can do the job.
Thus, the joint image coordinates indicate the position of the encoded 4D tensor, whose shape
will then be changed after the first voting pass, where a new tensor is produced at that location.
This new tensor is no longer an isotropic hypersphere, but possesses orientation.
After the first pass voting with 4D ball voting field, the accumulated tensor is decomposed into
the corresponding eigensystem. The e^1 component of the decomposed tensor will be pointing
along the normal to the surface (section 5.3).
In the second pass, 4D stick voting field will be aligned with the normal to propagate the
smoothness constraint (section 5.4), reinforcing true data points lying on a smooth (conical) sur-
face, while suppressing false data points not lying on any smooth structure. Reinforcement and
suppression are expressed in terms of the hypersurface saliencies (1 ; 2 ), after decomposing
the accumulated tensor obtained in the second pass voting.

February 6, 2004 DRAFT


15

5.3 Normal estimation by the ball voting field

G EN T ENSORVOTE in Algorithm 2 (appendix) is called here, where voter and votee are the
encoded 4D ball tensors. This is the first of the two tensor voting passes. After the input has
been encoded as a set of ball tensors, they communicate with each other by our voting fields to
estimate the most likely normal at each 4D point.
Communication among all input ball tensors is achieved by the 4D ball voting field VB (P )
applied at a 4D point P . VB (P ) is obtained by rotating and integrating vote contributions of the
4D stick voting field VS (P ) (defined shortly). The 4D ball voting field has the form
Z Z Z
VB (P ) = 0 0 0
R12 34 VS (R;112 34 P )RT12 34 d1d2 d3 d4j4=0 (9)

where 1 2 3 4 are rotation angles in w z y x axis respectively, and R1 23 4 is the rotation
matrix that aligns VS (P ).
Like the 2D case, let a 4D point P1 communicate with another 4D point P2 , by using the ball
voting field, that is, by aligning VB with P1 . Since VB has non-zero volume, if P2 is contained
in VB ’s neighborhood, then P2 is said to receive a ball tensor vote from P1 . Or, equivalently, P1
has cast a ball vote to P2 . P2 receives ball tensor votes cast by all other P1 ’s, and sums them up
using tensor addition5 . The resulting symmetric tensor is decomposed into the corresponding
eigenvalues and eigenvectors. The hypersurface saliency, which indicates the likelihood of P2
lying on a (conical) surface, is given by 1 ; 2 . The normal to the surface in 4D is given by e^1 ,
which is the preferred direction given by the majority vote.

5.4 Enforcement of smoothness constraint by the stick voting field

G EN T ENSORVOTE in Algorithm 2 is called again, where voter and votee are 4D tensors ob-
tained in the first pass. Let us first describe VS , the 4D stick voting field. The design rationale of
the 4D stick voting field is the same as the 2D counterpart: to encode the proximity and smooth-
ness (constancy of curvature) constraints, but now in 4D. By following the same arguments, the
question we pose and the answer are the same as that of section 4, which are illustrated in Fig. 4,
except now in 4D.
The 4D stick voting field VS can in fact be derived from the 2D stick voting field: starting
S S
from the 2D stick voting field, denoted by 2D , we first rotate6 2D by 90 . Denote the rotated
S
field by R 2 2D . Put the field in the 4D space, at the origin and aligned with 1 0 0 0]T . Then,
5
Since a second order symmetric tensor in 4D is represented by a 4  4 symmetric matrix, tensor addition of two tensors is
simply a 4  4 matrix addition. See A DD T ENSOR in the appendix.
6
The rotation can be clockwise or counterclockwise, since S2D is symmetric.

February 6, 2004 DRAFT


16

rotate the field about the x axis, which produces a sweeping volume in the 4D space. Thus, VS
is the 4D stick voting field, which describes the direction along 1 0 0 0]T and postulates the
normal directions in a neighborhood. The size of the neighborhood is determined by the size of
the voting field, or the scale  in Equation (3).
Recall that a 4D normal is obtained at each point after the first voting pass. The second pass is
used to propagate the smoothness constraint, so as to reinforce true data points via smoothness
propagation, and to suppress noisy points that are not lying on any smooth structures. We align
the 4D stick voting field VS (P ) with the normal (given by the e^1 component of the inferred
tensor) estimated at point P to vote for smooth surface. The stick voting field VS (P ) propagates
the local continuity or smoothness constraint in a neighborhood. Votes are received and summed
up using tensor addition.
Let a point receive a stick vote n1 n2 n3 n4]T . The covariance matrix consisting of the vote’s
second order moments is calculated, which is a 4  4 symmetric matrix. This symmetric matrix
is equivalent to a second order symmetric tensor in 4D. The given point accumulates this matrix
by tensor sum, which essentially adds up the matrices. Vote interpretation by decomposing this
matrix into the corresponding eigensystem follows.
Since false matching points do not form any smooth features with other true matching points,
the resulting directional votes do not agree in orientation, and thus will not produce a consis-
tent normal direction in the majority vote. Therefore, these points will have low hypersurface
saliency. On the contrary, correct matching points reinforce each other to produce a high hyper-
surface saliency.

5.5 Rejection of wrong matches

Recall that after collecting and accumulating the tensor sum, the accumulated result is still a
second order symmetric tensor in 4D. We compute the corresponding eigensystem 4i=1 i e^i e^Ti
P =

 ; 2 )^e1e^T1 + (2 ; 3 )(^e1e^T1 + e^2e^T2 )+ (3 ; 4 )(^e1e^T1 + e^2 e^T2 + e^3 e^T3 )+ 4(^e1e^T1 + e^2 e^T2 +
( 1

e^3 e^T3 + e^4 e^T4 ). True matching points will have a high hypersurface saliency, 1 ; 2 . The saliency
value indicates if a point lies on some smooth structure such as a cone.
After this step, for high hypersurface saliency points, we still do not know which point cone
the corresponding matching point belongs to. On the other hand, low hypersurface saliency
value indicates that the point does not lie on any local smooth structure, and hence not on any
cone. So, this point should be discarded as a false match.
In [18], we use the extrema detection to discard false matches. In this paper, we adopt a simpler

February 6, 2004 DRAFT


17

Fig. 5. 2D illustration to explain the successive motion segmentation, outlier rejection and parameter estimation.

and faster technique, which classifies a point match as correct if the associated hypersurface
saliency is above the mean hypersurface saliency value. We find the enforcement of the global
constraint of fitting a cone is more efficient and effective for the purpose of parameter estimation.
The thresholded set of good matches are used for fundamental matrix estimation.

5.6 Parameter estimation and motion extraction

Now we have a set of good matches. These good matches satisfy the local continuity con-
straint (i.e., a high likelihood of lying on a smooth structure). We want to enforce a global,
point-cone constraint on this set of filtered matches. If there exist multiple and independent mo-
tions, the matching points should correspond to distinct epipolar geometries alongside with their
respective fundamental matrices. To estimate the fundamental matrix that corresponds to the
most salient motion (or motionless background), we apply RANSAC or the normalized linear
method, either of which is sufficient since we now have a set of good matches.
We use a 2D illustration in Fig. 5 to explain the successive extraction of motion components.
Let Si be the set of point matches after subtracting the i-th set of motion matches, Fig. 5a, and
S0 is the input point matches we obtained after tensor encoding (section 5.2). After 4D tensor
voting, a filtered set of point matches Ti is obtained, Fig. 5b. Note that Ti may contain more

February 6, 2004 DRAFT


18

than one motion components. RANSAC or other techniques can be performed on Ti to estimate
the dominating fundamental matrix, Fig. 5c. Let Ri  Si be the maximal subset of matches
that produces the minimal residual, according to the estimated fundamental matrix. We use the
normalized Eight-Point Algorithm (eigen-analysis) with singularity enforcement to estimate the
dominating fundamental matrix. Then, we apply the above parameter estimation to Si ; Ri to
successively extract the next dominating fundamental matrix. The above process is repeated
until Ti = Ri , or, only one motion component is found by RANSAC in the inliers returned by
4D tensor voting.

6 Time complexity analysis

Let N be the total number of input matches, and k  3 be the size of the voting field used
in tensor voting. k is often chosen to be half of the image dimension. Our experiments show
that the results are not sensitive to k . Note that we do not store an entire 4D voting field, which
consumes storage. Instead, due to symmetry, we use a 1D array to store the ball voting field,
and a 2D array to store the stick voting field, by making use of Algorithm 2 to precompute and
store the votes.
Data conversion (section 5.1) and tensor encoding (section 5.2) take O (N ) time respectively.
The two passes of tensor voting (sections 5.3 and 5.4) take O (kN ) time each. Outlier rejection
and motion segmentation (section 5.5) takes O(N ) time. Parameter estimation (section 5.6)
takes O(K ) time, where K is the total number of subsets we use in RANSAC. Therefore, if
there are n motions, the total time complexity is O (n(kN + K )), which is essentially linear in
N . Note that for any n, N only need to satisfy the minimum number of points necessary to
define a fundamental matrix, that is, eight, which is independent of n (c.f. [23]).
In practice, for epipolar geometry estimation, we set K = 1000. A complete run takes only a
few seconds. Since we already have a set of good matches by the time we run RANSAC, we do
not need all possible combinations to find the maximal subset, which is exponential in number.
Currently, the size of each subset used is 10 to 15. The estimation stability increases with the
size of the subset, but up to a certain point. For instance, if the size of the filtered set of good
matches is small, then, we need to use smaller subsets.

7 Results

We have performed extensive experiments to evaluate our 4D tensor voting approach. They
are categorized in the followings:

February 6, 2004 DRAFT


19

7.1 Evaluation on motion segmentation (synthetic data)

First, we evaluate our system by using fourteen synthetic non-static scenes to exhaust all
camera and object transformations (rotation and translation), as tabulated in Table II, where
ground truths are available for direct verification. The objects are of different sizes. To evaluate
noise robustness, five different noise-to-signal levels are tested. They are respectively 1, 1.5,
2, 2.5, 3. That is, there are three wrong correspondences for each correct pair of match, for
instance, when the noise-to-signal ratio equals to 3. A total of 70 (= 14  5) experiments is run
to enumerate all noise scenarios.

Each scene consists of three independent motions. The types of camera motion are also shown
in Table II. Fig. 6 shows the two views of the moving objects for each case. For clarity of display,
the noisy correspondences are not shown.
Three experimental conditions are worth noting here. First, since RANSAC is used in our pa-
rameter estimation, we run each RANSAC with different number of random subsets (iterations)
to evaluate its effect on our results. In general, the results improve with increasing number of
subsets. Second, results without noise rejection by 4D tensor voting(we call it RANSAC O NLY)
is compared to justify the significance of our method (we call it 4DRANSAC). Since random
permutations are used to estimate the epipolar geometries, in each experiment, we run our system
20 times, and produce the mean performance on motion pixel segmentation. Third, in order to
provide a fair comparison basis for successive motion segmentation among all methods, the in-
put is always updated by A  A ; result( 4DRANSAC), not by A  A ; result(METHOD).
Algorithm 1 summarizes the rundown of our experiment.
True (resp. false) positives, as defined in Algorithm 1, indicate the number of correctly (resp.
incorrectly) labeled motion correspondences. True (resp. false) negatives, which indicate the
performance on classifying noisy matches, can be derived similarly7 .
In Algorithm 1, the M ETHODs we implemented and tested are: 4D O NLY, RANSAC O NLY,
and our 4DRANSAC consisting of outlier rejection by 4D tensor voting followed by RANSAC
estimation of epipolar geometries.
To facilitate analysis of our extensive experiments, the results of all 70 experiments are further
categorized by object motions:
(a) Pure translation (scenes (1), (6)), Fig. 7a, Fig. 8a, and Fig. 9a.
7
TN  count((input ; result(Method)) \ (input ; correct)), FN  count((input ; result( Method)) \ correct)

February 6, 2004 DRAFT


20

Algorithm 1 E VALUATE S EGMENTATION P ERFORMANCE (input,correct,M ETHOD)


This function computes TP and FP (true positives and false positives, defined below) which
indicate the performance of motion segmentation.
Input:
input = the8set of correct matches plus noise.
>
< allgt if M ETHOD = 4D O NLY
correct = >
: gt[ ] otherwise
allgt = the set of all correct matches (ground truths)
gt[ ] = correct matches for motion M
M ETHOD = f 4D ONLY j RANSAC ONLY j 4DRANSAC g
Output:
result(M ETHOD) = inliers classified by M ETHOD
for each noise-to-signal ratio = 1, 1.5, 2, 2.5, 3 do
for number of random subsets (iterations) = 1000, 5000, 10000 do
for i = 1 to 20 do
input input
for each pass of motion extraction do
for each M ETHOD do
result(M ETHOD)  M ETHOD(input )
TP  i METHOD  count(result(Method) \ correct)
FP  i METHOD  count(result(Method) \ (input ; correct))
end for
input+1  input ; result(4DRANSAC)
end for
end for
for each extracted motion M and M ETHOD do
TP METHOD  P20i=1 TP i METHOD =20
P TF
FP METHOD  20
i =1  i METHOD =20

output TP METHOD and FP METHOD at the current noise-to-signal ratio and number
of random subsets
end for
end for
end for

February 6, 2004 DRAFT


21

(1) (2) (3)

(4) (5) (6)

(7) (8) (9)

(10) (11) (12)

(13) (14)
Fig. 6. Non-static synthetic scenes.

(b) Pure rotation (scenes (2), (4), (8), (10), (13)), Fig. 7b, Fig. 8b, and Fig. 9b.
(c) Translation and rotation (scenes (3), (5), (7), (9), (11), (12)), Fig. 7c, Fig. 8c, and Fig. 9c.
The plots for 4D O NLY are interpreted as follows. In Fig. 7a, the plots for true positives or TP
(resp. false positives or FP ) correspond to the mean TP (resp. FP ) for scenes (1) and (6) after
running 4D O NLY. In Fig. 7b, the TP (resp. FP ) plots correspond to the mean TP (resp. FP )
after running 4D O NLY on scenes (2), (4), (8), (10), and (13). Fig. 7c shows the averaged result
on segmenting the three motions for scenes (3), (5), (7), (9), (11), and (12). As there are three

February 6, 2004 DRAFT


22

Number of objects Object motion Camera motion

plane pyramid sphere translation rotation translation rotation

(1) 3 0 0 yes no yes no

(2) 3 0 0 no yes yes yes

(3) 3 0 0 yes yes yes yes

(4) 0 0 3 no yes yes no

(5) 0 0 3 yes yes yes no

(6) 0 0 3 yes no yes no

(7) 3 0 0 yes yes yes yes

(8) 3 0 0 no yes yes no

(9) 3 0 0 yes yes yes no

(10) 0 0 3 no yes yes no

(11) 0 0 3 yes yes yes yes

(12) 2 1 0 yes yes yes yes

(13) 1 1 1 no yes yes yes

(14) 1 1 1 yes yes yes yes

TABLE II

N ON - STATIC SYNTHETIC SCENES . E ACH SCENE CONSISTS OF THREE INDEPENDENT MOTIONS . D IFFERENT

NOISE - TO - SIGNAL RATIOS (1, 1.5, 2, 2.5, 3) WERE TESTED .

motions (M1 , M2, and M3), three measures on TP and FP are reported. Plots for RANSAC
O NLY and 4DRANSAC are interpreted similarly.
In each 3D plot, the three respective axes denote the noise level (in %), the number of
RANSAC iterations (random subsets) used to estimate the fundamental matrix, and the TP =FP
percentages. Note that the TP =FP percentages are defined differently for 4D O NLY and the
other two methods: since 4D O NLY (not 4DRANSAC) is primarily used for noise rejection
(no motion classification is done, which is performed in the parameter estimation step that fol-
lows, therefore allgt is used instead of gt( )). For each motion analysis M 1   3,
TP
the TP percentage is defined as count(input
 \allgt)
 100%. The FP percentage is defined
count(input ;allgt)  100%. That is, the inliers identified by 4D O NLY are not classified into
FP

the object motion to which they belong. In RANSAC O NLY and our 4DRANSAC, the TP
percentage for motion M 1   3, is defined as count(input
TP
 \gt())
 100%. The FP
percentage for motion M 1   3, is defined as count(input
FP
 100%. Given the
 ;gt())
above details, the following sentence summarizes well how to interpret the plots in Fig. 7–9:
good performance is indicated by high values of TP and low values of FP .
We have the following conclusions from our experiments:
1. RANSAC O NLY fails in identifying true positives in all the 70 experiments.
2. 4D RANSAC is very robust to incorrect matches.

February 6, 2004 DRAFT


23

TP M1 TP M2 TP M3 FP M1 FP M2 FP M3

100 100 100 100 100 100

50 50 50 50 50 50

0 0 0 0 0 0
10000 10000 10000 10000 10000 10000
300% 300% 300% 300% 300% 300%
5000 5000 5000 5000 5000 5000
200% 200% 200% 200% 200% 200%
1000 100% 1000 100% 1000 100% 1000 100% 1000 100% 1000 100%

(a)
TP M1 TP M2 TP M3 FP M1 FP M2 FP M3

100 100 100 100 100 100

50 50 50 50 50 50

0 0 0 0 0 0
10000 10000 10000 10000 10000 10000
300% 300% 300% 300% 300% 300%
5000 5000 5000 5000 5000 5000
200% 200% 200% 200% 200% 200%
1000 100% 1000 100% 1000 100% 1000 100% 1000 100% 1000 100%

(b)
TP M1 TP M2 TP M3 FP M1 FP M2 FP M3

100 100 100 100 100 100

50 50 50 50 50 50

0 0 0 0 0 0
10000 10000 10000 10000 10000 10000
300% 300% 300% 300% 300% 300%
5000 5000 5000 5000 5000 5000
200% 200% 200% 200% 200% 200%
1000 100% 1000 100% 1000 100% 1000 100% 1000 100% 1000 100%

(c)

Fig. 7. Performance on outlier rejection by 4D O NLY: (a) pure translation (b) pure rotation. (c) translation and
rotation. Denote true positives by TP, and false positives by FP. Three motions M 1    3 are segmented.
The vertical axis denotes the TP/FP percentages, the left axis indicates the number of RANSAC iterations
(random subsets) used to estimate the fundamental matrix, and the right axis correspond to the noise level (in
%).

3. The segmentation results improve with increasing number of random subsets (iterations).
In practice, we can use fewer iterations for acceptable epipolar geometry estimation, except
for the following scenario, which is further addressed in the limitation section (section 8.5).
4. For scenes (1) and (6) which involves pure object translations, the performance (e.g. the
plot for TPM3 ) for 4DRANSAC is not satisfactory.
Readers may demand a comparison of 4DRANSAC with other representative methods such
as LMedS and M-Estimator, which are addressed in the next section. However, it is worth noting
that the strength of 4DRANSAC is its noise robustness and its ability to segment motion pixels
successively. The LMedS uses median as threshold. In noisy correspondences, the median as-
sumption may not work well. The M-Estimator can extract the most salient motion/background.
Given multiple motions, it is somewhat difficult to adjust the thresholds to segment them.

February 6, 2004 DRAFT


24

TP M1 TP M2 TP M3 FP M1 FP M2 FP M3

100 100 100 100 100 100

50 50 50 50 50 50

0 0 0 0 0 0
10000 10000 10000 10000 10000 10000
300% 300% 300% 300% 300% 300%
5000 5000 5000 5000 5000 5000
200% 200% 200% 200% 200% 200%
1000 100% 1000 100% 1000 100% 1000 100% 1000 100% 1000 100%

(a)
TP M1 TP M2 TP M3 FP M1 FP M2 FP M3

100 100 100 100 100 100

50 50 50 50 50 50

0 0 0 0 0 0
10000 10000 10000 10000 10000 10000
300% 300% 300% 300% 300% 300%
5000 5000 5000 5000 5000 5000
200% 200% 200% 200% 200% 200%
1000 100% 1000 100% 1000 100% 1000 100% 1000 100% 1000 100%

(b)
TP M1 TP M2 TP M3 FP M1 FP M2 FP M3

100 100 100 100 100 100

50 50 50 50 50 50

0 0 0 0 0 0
10000 10000 10000 10000 10000 10000
300% 300% 300% 300% 300% 300%
5000 5000 5000 5000 5000 5000
200% 200% 200% 200% 200% 200%
1000 100% 1000 100% 1000 100% 1000 100% 1000 100% 1000 100%

(c)

Fig. 8. Performance on outlier rejection by RANSAC O NLY: (a) pure translation (b) pure rotation. (c) translation
and rotation. Denote true positives by TP, and false positives by FP. Three motions M 1    3 are
segmented. The vertical axis denotes the TP/FP percentages, the left axis indicates the number of RANSAC
iterations (random subsets) used to estimate the fundamental matrix, and the right axis correspond to the noise
level (in %).

7.2 Evaluation of epipolar geometry estimation (real/synthetic data)

Refer to the quantitative results in Table III. Fig. 3 shows the results on a synthetic image
pair of a non-static scene, T HREE S PHERES , captured by a moving camera. Random, wrong
matches are added to the set of input point matches. The number of wrong matches we added
is equal to the number of true matching points, thus the noise to signal ratio is 1. This ratio will
be increased up to 5.788 as we subtract matches corresponding to salient motion in subsequent
passes. For example, after the first salient motion is extracted, we subtract 109 matches from the
input, so leaving 83 correct data points in the second pass. The first two rows of figures in the
table for FAN, U MBRELLA, C AR and TOYS are interpreted similarly.
When T HREE S PHERES have been processed, wrong matches (blue crosses) are classified and
discarded. The three salient motions (inliers) are identified. In Fig. 3, the epipolar lines and
correspondences of the three extracted motions are colored red, green, and orange, respectively.
Note that, in each pass, we can label all outliers (but one in the last run) correctly. All the

February 6, 2004 DRAFT


25

TP M1 TP M2 TP M3 FP M1 FP M2 FP M3

100 100 100 100 100 100

50 50 50 50 50 50

0 0 0 0 0 0
10000 10000 10000 10000 10000 10000
300% 300% 300% 300% 300% 300%
5000 5000 5000 5000 5000 5000
200% 200% 200% 200% 200% 200%
1000 100% 1000 100% 1000 100% 1000 100% 1000 100% 1000 100%

(a)
TP M1 TP M2 TP M3 FP M1 FP M2 FP M3

100 100 100 100 100 100

50 50 50 50 50 50

0 0 0 0 0 0
10000 10000 10000 10000 10000 10000
300% 300% 300% 300% 300% 300%
5000 5000 5000 5000 5000 5000
200% 200% 200% 200% 200% 200%
1000 100% 1000 100% 1000 100% 1000 100% 1000 100% 1000 100%

(b)
TP M1 TP M2 TP M3 FP M1 FP M2 FP M3

100 100 100 100 100 100

50 50 50 50 50 50

0 0 0 0 0 0
10000 10000 10000 10000 10000 10000
300% 300% 300% 300% 300% 300%
5000 5000 5000 5000 5000 5000
200% 200% 200% 200% 200% 200%
1000 100% 1000 100% 1000 100% 1000 100% 1000 100% 1000 100%

(c)

Fig. 9. Performance on outlier rejection by 4DRANSAC: (a) pure translation (b) pure rotation. (c) translation and
rotation. Denote true positives by TP, and false positives by FP. Three motions M 1    3 are segmented.
The vertical axis denotes the TP/FP percentages, the left axis indicates the number of RANSAC iterations
(random subsets) used to estimate the fundamental matrix, and the right axis correspond to the noise level (in
%).

inliers are correctly classified. Three real examples are shown in Fig. 12. The FAN scene shown
in Fig. 12a is an indoor scene. An electric fan is rotating about an axis. Note that we can
identify the moving electric fan and the static scene behind it. In Fig. 12b, the U MBRELLA
scene shows a walking man holding an umbrella. Our system can discard outliers, identify
motion and background. Fig. 12c shows the C AR scene. In Fig. 13, the TOYS scene shows
a forward camera motion and two toy cars moving in different directions. We can extract the
forward camera motion and segment the two additional motions despite the large amount of
noise we added. Our 4D system can reject outliers, retain motion matches, and produce the
corresponding epipolar geometries for the non-static scene. In all examples, the camera as well
as some objects in the scene are in motion. Note that the normalized Eight Point Algorithm and
robust methods in [25] fail on all our noisy sets of matches8 .
8
In practice, the correspondence establishment can be performed by automatic cross-correlation or manual picking (section 2)
which produces less noisy point matches used in our experiments. Here, we test our approach to the extremes, in the presence
of a large amount of noises.

February 6, 2004 DRAFT


26

T HREE S PHERES FAN U MBRELLA C AR TOYS

most 2nd 3rd most 2nd most 2nd most 2nd most 2nd 3rd

salient salient salient salient salient salient salient salient salient salient salient salient

No. of correct data points Si 192 83 33 111 30 84 21 93 34 81 47 21

No. of incorrect data points 192 192 191 151 150 101 99 101 100 150 148 147

noise/signal ratio 1.000 2.313 5.788 1.360 5.000 1.202 4.714 1.086 2.941 1.852 3.149 7.000

( A ) R ESULTS ON 4D TENSOR VOTING

No. of correct inliers Ti 150 72 33 80 27 61 19 71 20 140 59 20

No. of incorrect inliers 2 1 0 3 9 3 7 7 2 14 6 5

( B ) R ESULTS ON PARAMETER ESTIMATION

No. of correct inliers Ri 109 50 33 81 30 63 21 59 30 79 45 20

No. of incorrect inliers 0 1 0 1 1 2 2 1 0 2 1 2

Scale used in 4D analysis  400 400 400 400 400

No. of random trials 1000 10000 10000 10000 50000

No. of points in each subset 10 15 15 15 15

TABLE III

F OUR EXPERIMENTAL RESULTS ON NON - STATIC SCENES .

8 Discussion

In this section, we further compare our algorithm with representative algorithms of epipolar
geometry estimation. Then, we state issues and limitations of our method.

8.1 Comparison with representative algorithms

Our evaluation uses static and non-static scenes. For static scenes, we used 7 real data sets
with noise provided in the Matlab toolkit written by Armangué [3]9 , and generated the funda-
mental matrix using both our method and other algorithms provided in the toolkit. The aver-
age pixel distances between the matching points and the estimated epipolar lines are plotted in
Fig. 10. Armangué’s toolkit provides the implementation of about 15 epipolar geometry estima-
tion methods. Here, owing to space limitation, we report the results of five methods that give
best performances. Readers can refer to a technical report [19] for results on other methods. As
ground truth data for the non-static scenes are difficult to obtain by automatic means, we hand-
picked good matches and added random false matches in our test cases. Then, the algorithms
implemented in the toolkit and our algorithm are compared. The results are plotted in Fig. 11.
It is clear that our method produces reasonably better results than other algorithms for the
non-static scenes we experimented. For static scenes, Torr’s M-estimator [22] performs slightly
better than our method. This is understandable, as the false match filtering by 4D Tensor Voting
is more tailored for non-static scenes, in which only local continuity constraint is maintained.
9
http://eia.udg.es/armangue/research/

February 6, 2004 DRAFT


27

Fig. 10. Comparison for static scenes.

Fig. 11. Comparison for one independent motion scenes.

As for static scenes, we can also enforce the global constraint, but that will require changing
the scale of analysis in tensor voting. Since our aim is to have no free parameters for the whole
system, we sacrifice the slight inaccuracy to complete automation.

8.2 Comparison with the 8D approach

Note that the advantages of 8D voting [18] are still inherited in this 4D formulation: non-
iterative processing and robustness to a considerable amount of outlier noise. An additional
advantage of dimensionality reduction is brought by the joint image space, which is an isotropic
one parameterized by (ul vl ur vr ). In [18], the 8D space is parameterized by
(ul ur vl ur ur ul vr vl vr vr ul vl ), which is neither isotropic nor orthogonal, nor independent.

Hence, in the 8D case, in order to satisfy the isotropy assumption of tensor voting, we have
to scale (for voting) and re-scale the parameter space (for computing the fundamental matrix).
Therefore, some precision is lost as a result of this scaling and re-scaling operations. Multiple
passes (more than two) are needed in practice to improve the accuracy. Now, in 4D, we only
need two passes in all the experiments. Note further that in 8D, we need at least eight points to
fix an 8D normal, while only four are sufficient to fix a 4D normal.

February 6, 2004 DRAFT


28

8.3 Tensor voting versus generalized Hough transform

As shown in section 6, the time and space complexities of 4D tensor voting are independent
of dimensionality, unlike that of Hough transform. Our method is more efficient than Hough
transform in high dimensions. Although tensor voting and Hough transform share the same idea
that a voting scheme is employed, the computational frameworks are very different. In Hough
transform [4], the voting occurs in high dimensional space, where each quantized location re-
ceives a number of votes cast from voters. The maximal support is given by the point receiving
the maximum (scalar) counts. In 4D tensor voting used in this paper, the voting occurs in the
high dimensional space, where only the points corresponding to the input joint images receive
tensor votes from voters, efficiently cast by aligning voting fields, which are 1D or 2D arrays in
implementation. Moreover, no quantization is necessary.

8.4 Estimating n matrices versus a single multibody matrix

In terms of epipolar geometry estimation and motion detection, 4D tensor voting adopts a
geometric approach, by propagating strong geometric smoothness constraint in the 4D joint
image space for detecting smooth 4D structures. This novel approach is effective: when false
matches and multiple motion matches co-exist, the problem of multiple motion segmentation
is particularly challenging. This segmentation problem is complicated by the fact that matches
belonging to a less salient motion may be mis-classified as outliers with respect to more salient
motion. Our method is inspired by robust structure inference in 2D, and surface fitting 3D (for
comparison of related work, see [12]). The 2D and 3D tensor voting reject outliers (not lying on
any smooth objects) and extract multiple objects in succession, without any genus or topological
limitations. In particular, the most salient object is segmented first, followed by less salient ones.
The 4D version considers joint image coordinates. Matches corresponding to independent
multiple motions are manifested into multiple point cones in the 4D joint image space. False
match is translated into an outlier in this 4D space, where the smoothness constraint is violated.
Our method adopts a segmentation and estimation approach, which can extract salient motions
in succession: first, the most salient “motion” (e.g. the background), followed by the second
salient motion (e.g. the largest moving object in the scene), so on.
This approach is different from [24] and [23], in which the two-view two-body (resp. multi-
body) motion extraction and epipolar estimation are performed. The (multibody) fundamental
matrix is estimated, which encapsulates all the fundamental matrices corresponding to motion.
Individual matrices are extracted from this composite matrix. The saliency of each motion is

February 6, 2004 DRAFT


29

unknown when its fundamental matrix is extracted. A further step may be needed to understand
the saliency of the extracted motion.

8.5 Limitations and future work

Although our approach performs better than some representative algorithms on our sample
non-static scenes, 4D tensor voting on epipolar geometry estimation for non-static scenes has its
own limitations.
We use epipolar geometry to extract motion components successively. One limitation is due
to the inherent ambiguity that the epipolar constraint maps a point to a line. In certain cases,
wrong matches can coincidentally lie on corresponding epipolar lines. Alternatively, if false
matches happen to be consistent with a parametric model returned by RANSAC, they can be mis-
classified as a consistent motion instead of outliers. Our approach enforces the local smoothness
constraint in the joint image space by 4D tensor voting. However, if false matches happen
to form some smooth surface in the joint image space, our voting system either cannot detect
them, or cannot reject them in the following parameter estimation step10 . Also, if two motion
components have very similar epipolar geometry (e.g. their motions are relatively parallel to
each other with respect to the viewpoint), our approach may fail to segment them into two
distinct components. One single motion will be returned, which needs further processing to
distinguish the motions.
If unfortunately all motion matches happen to cluster around the apex of the corresponding
point cone, we need to detect the apex, instead of smooth surfaces. As shown in previous
work [12] and section 4, regardless of dimensionality, tensor voting is well suited to detect
smooth manifolds as well as point junctions, which are characterized by high disagreement of
tensor votes. This difficult case does not pose any theoretical difficulty to our approach. If a high
disagreement of tensor votes in the 4D space is detected in the close vicinity of a 4D location
(indicated by large 4 ), the corresponding point matches should be segmented.

9 Conclusion

We propose a specialization of the tensor voting algorithm in 4D, and describe a novel, effi-
cient, and effective method for robust estimation of epipolar geometry estimation and motion
segmentation for non-static scenes. The key idea is the rejection of outliers which do not sup-
port each other, whereas inliers (correct matches), even if they are small in number, reinforce
10
However, in practice, we found that the corresponding motion is not very salient if such coincidental smooth structure
occurs.

February 6, 2004 DRAFT


30

each other as they belong to the same surface (cone). This idea is translated into a voting-based
computational framework, the main technical contribution of this paper. A geometric approach
is adopted, in which salient motion are extracted in succession. By reducing the dimensionality,
we solve the problems in [18], such as the non-orthogonality and anisotropy of the parameter
space. Motion correspondences are classified into distinct sets, by extracting their underlying
epipolar geometries. Only two passes of tensor voting are needed. It is shown that the new
approach can tolerate a larger amount of outlier noise. Upon proper segmentation and outlier
rejection, parameter estimation techniques such as RANSAC, LMedS, or even the normalized
Eight Point Algorithm can be used to extract motion components and epipolar geometry for
non-static scenes. Besides the issues to be addressed in section 8.5, in the future, we propose to
perform reconstruction into a set of motion layers, based on the extracted epipolar geometries.

Acknowledgment

We are indebted to the Associate Editor and all anonymous reviewers for their very construc-
tive comments and thoughtful discussion with the authors throughout the review process. This
research is supported by the Research Grant Council of Hong Kong Special Administrative Re-
gion, China: HKUST6193/02E, the National Science Foundation under grant number 9811883,
and the Integrated Media Systems Center, a National Science Foundation Engineering Research
Center, Cooperative Agreement No. EEC-9529152.

Appendix

A Algorithms for 4-D tensor voting

We detail the general tensor voting algorithm [18] in this section. In [18], an high dimensional
tensor voting algorithm is presented. C++11 . and Matlab12 source codes are available. The voter
makes use of G EN T ENSORVOTE to cast a tensor vote to vote receiver (votee). Normal direction
votes generated by G EN N ORMALVOTE are accumulated using C OMBINE. An 4  4 outTensor
is the output. The votee thus receives a set of outTensor from voters within its neighborhood.
The resulting tensor matrices can be summed up by A DD T ENSOR, which performs ordinary
4  4 matrix addition. This resulting matrix is equivalent to a 4D ellipsoid.
References
[1] A. Adam, E. Rivlin, and I. Shimshoni. Ror: Rejection of outliers by rotations. PAMI, 23(1):78–84, January 2001.

11
http://www.cs.ust.hk/cktang/TVLib.zip.
12
http://www.cs.ust.hk/cstws/research/TensorVoting3D/

February 6, 2004 DRAFT


31

[2] P. Anandan and S. Avidan. Integrating local affine into global projective images in the joint image space. In ECCV’2000,
2000.
[3] X. Armangué, J. Pags, J. Salvi, and J. Batlle. Comparative survey on estimating the fundamental matrix. In IX Simposium
Nacional de Reconocimiento de Formas y Anlisis de Imgenes, pages 227–232, June 2001.
[4] D. H. Ballard. Parameter nets. Artificial Intelligence, 22:235–267, 1984.
[5] J. Davis. Mosaics of scenes with moving objects. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition 1998 (CVPR’98), pages 354–360, Santa Barbara, CA, June 1998.
[6] O.D. Faugeras, Q.T. Luong, and T. Papadopoulo. The Geometry of Multiple Images. MIT Press, 2001.
[7] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000.
[8] R. I. Hartley. In defense of the 8-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(6):580–593, June 1997.
[9] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN:
0521623049, 2000.
[10] H. C. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Nature, 293:133–135,
1981.
[11] Q.T. Luong and O.D. Faugeras. The fundamental matrix: Theory, algorithms, and stability analysis. IJCV, 17(1):43–75,
January 1996.
[12] G. Medioni, M.-S. Lee, and C.-K. Tang. A Computational Framework for Feature Extraction and Segmentation. Elseviers
Science, Amsderstam, 2000.
[13] M. Nicolescu and G. Medioni. Perceptual grouping from motion cues using tensor voting in 4-d. In Proceedings of the
European Conference on Computer Vision, pages III: 423–437, 2002.
[14] M. Nicolescu and G. Medioni. Perceptual grouping into motion layers using tensor voting in 4-d. In Proceedings of the
International Conference on Pattern Recognition, pages III:303–308, 2002.
[15] P. Pritchett and A. Zisserman. Wide baseline stereo matching. In IEEE International Conference 1998 (ICCV’98), pages
754–760, Bombay, India, January 1998.
[16] J. G. Semple and G. T. Kneebone. Algebraic Projective Geometry. Clarendon Press, Oxford, England, 1952.
[17] R. Sturm. Das problem der projektivitat und seine anwendung auf die flaachen zweiten grades. Math. Ann, 1:533–547,
1869.
[18] C.-K. Tang, G. Medioni, and M.-S. Lee. N-dimensional tensor voting, and application to epipolar geometry estimation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(8):829–844, 2001.
[19] W.-S. Tong. An enhanced tensor voting formalism and applications. Master’s thesis, The Hong Kong University of Science
and Technology, June 2001.
[20] W.-S. Tong, C.-K. Tang, and G. Medioni. Epipolar geometry estimation by 4d tensor voting. In IEEE Conference on
Computer Vision and Pattern Recognition, pages I:926–933, 2001.
[21] P. H. S. Torr and D. W. Murray. Statistical detection of independent movement of a moving camera. Image and Vision
Computing, 1(4), 1993.
[22] P. H. S. Torr and D. W. Murray. The development and comparison of robust methods for estimating the fundamental
matrix. Int Journal of Computer Vision, 24(3):271–300, 1997.
[23] R. Vidal, Y. Ma, S. Hsu, and S. Sastry. Optimal motion estimation from multiview normalized epipolar constraint. In
ICCV01, pages I: 34–41, 2001.
[24] L. Wolf and A. Shashua. Two-body segmentation from two perspective views. In CVPR01, pages I:263–270, 2001.
[25] Z. Zhang. Determining the epipolar geometry and its uncertainty: A review. International Journal of Computer Vision,
27(2):161–195, 1998.

February 6, 2004 DRAFT


32

(a) FAN (b) U MBRELLA (c) C AR

Fig. 12. Epipolar geometry for camera motion and one motion non-static scenes. Motion components and their
corresponding epipolar lines are indicated by green. Background matches and epipolar lines are colored in red.
Discarded outliers are in blue. The first two rows show the image pair and input noisy matches. The last two
rows show the results.

February 6, 2004 DRAFT


33

Fig. 13. T OYS - epipolar geometry for forward camera motion and two additional independent motions in a non-
static scene. Motion components and their corresponding epipolar lines are indicated by green and yellow.
Background matches and epipolar lines are colored in red. Discarded outliers are in blue. The first row shows
the image pair and input noisy matches. The second row shows the results.

February 6, 2004 DRAFT


34

Algorithm 2 G EN T ENSORVOTE (voter,votee)


It uses G EN N ORMALVOTE to compute the most likely normal direction vote at the votee. Then,
plate and ball tensors are computed, by integrating the resulting normal votes cast by voter. They
are used to vote for curves and junctions.
for all 0  i j < 4, outTensor[i][j]  0
for all 0  i < 3,
voterSaliency[i]  voter[i ] ; voter[i+1 ]
voterSaliency[3]  voter[3 ]
if (voterSaliency[0] > 0) then
vecVote  G EN N ORMALVOTE (voter,votee)
fCompute stick component g
C OMBINE (outTensor,vecVote)
end if
transformVoter  voter
for i = 1 to 3 do
if (voterSaliency[i] > 0) then
// count[i] is a sufficient number of samples uniformly
// distributed on a unit (i + 1)-D sphere.
while (count[i] 6= 0) do
transformVoter[direction]  random[direction]  G EN R ANDOM U NIFORM P T ()
if (i 6= 3) then
=
Compute the alignment matrix, except the isotropic ball tensor
=
transformVoter[direction]  voter[eigenvectorMatrix]  random[direction]
end if
vecVote  G EN N ORMALVOTE (transformVoter,votee)
C OMBINE (outTensor, vecVote, voterSaliency[i])
count[i]  count[i] ; 1
end while
end if
end for
return outTensor

February 6, 2004 DRAFT


35

Algorithm 3 G EN N ORMALVOTE (voter, votee)


A vote (vector) on the most likely normal direction is returned.
v  votee[position] ; voter[position]
=
voter and votee are connected by high curvature?
=
if (angle(voter[direction],v ) < =4) then
return ZeroVector fsmoothness constraint violatedg
end if
=
voter and votee on a straight line, or voter and votee are the same point
=
if (angle(voter[direction],v ) = =2) or (voter = votee) then
return voter[direction]
end if
Compute center and radius of the osculating hemisphere in 4D, as shown in Fig. 4a.
=
assign stick vote
=
stickvote[direction]  center ; voter[position]
s2 +c'2
stickvote[length]  e; 2 fequation (3)g
stickvote[position]  votee[position]
return stickvote

Algorithm 4 C OMBINE (tensorvote, stickvote, weight)


It performs tensor addition, given a stick vote.
for all i j such that 0  i j < 4 do
tensorvote[i][j]  tensorvote[i][j] + weight  stickvote[i]  stickvote[j]
end for

Algorithm 5 A DD T ENSOR (outTensor, inTensor, weight)


Add two second order symmetric tensors – simply matrix addition
for all i j such that 0  i j < 4 do
outTensor[i][j]  outTensor[i][j] + weight  inTensor[i][j]
end for

February 6, 2004 DRAFT

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy