0% found this document useful (0 votes)
19 views54 pages

A Practical Setup For Voxel Coloring Using Off-The-Shelf Components

fdhhrty657
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views54 pages

A Practical Setup For Voxel Coloring Using Off-The-Shelf Components

fdhhrty657
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

A Practical Setup for Voxel Coloring

using off-the-shelf Components

Koen Erik Adriaan van de Sande

Bachelor Project supervised by Rein van den Boomgaard

June 2004
ii
Abstract

We have built a practical setup for 3D reconstruction using voxel coloring [9]. Voxel color-
ing can reconstruct a 3D scene from multiple photographs with known camera location; the
reconstruction created takes the form of a voxel volume [4].
Our setup consists of a cheap webcam, a small tripod and a computer-controlled Meccano
turntable. We perform an intrinsic calibration on the camera so we can model the distortion
of the camera in our camera model. The accuracy of the turntable is not important as we
perform an extrinsic calibration to determine camera location needed by voxel coloring. We
show how this extrinsic calibration is possible by having four planar reference points visible in
each image.
We found that background removal on the input images for voxel coloring is a necessary
step and show how one can make this step easy if the background removal is to be done by
hand. It is possible to perform automatic background removal, but we did not implement such
a system.
We show how one can best choose parameters for voxel coloring; we have found that the
adaptive consistency check [11] is suited best for our setup as the parameters do not depend
on the resolution of the reconstructed voxel volume. The adaptive consistency check offers
better control when reconstructing objects with both little and highly textured areas than the
original consistency check [9].
Reconstructions based on digital camera images have higher color quality than reconstruc-
tions based on webcam images, because the digital camera CCD chip is much better and gives
images of higher quality. An advantage of a webcam is that they can always be controlled by
computer; allowing for fully automated image capture. Most digital cameras cannot yet be
fully controlled by computer so they have to be operated by hand; in that case one can choose
to trade ease of use for reconstruction quality.
We would like to point out that at low resolutions, webcam reconstructions are competitive
with digital camera reconstructions, if one does not look at color quality. At higher resolutions
the digital camera creates better reconstructions as it has more data available, but the webcam
reconstructions are (to our surprise) still of acceptable quality.

iii
iv
Contents

1 Introduction 1

2 Related Work 3
2.1 Voxel Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Occlusion handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Voxel Coloring algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Original consistency check . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Adaptive consistency check . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Histogram consistency check . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Speed improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 3D reconstruction setup 11
3.1 Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Turntable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Perspective camera model . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Modelling real cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Separating camera frame and world frame . . . . . . . . . . . . . . . . . 16
3.2.4 Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.5 Determining extrinsic parameters from four planar points . . . . . . . . 16
3.3 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Reconstruction parameters . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Implemented Voxel Coloring algorithm . . . . . . . . . . . . . . . . . . . 19
3.4 Viewing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Evaluation 25
4.1 Reprojection error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Background removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Consistency checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Webcam versus digital camera . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Conclusions 35

v
A Camera calibration toolbox 37
A.1 Intrinsic parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.2 Extrinsic parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.3 Undistorting images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

B Phidgets 41

C Rose dataset 43

vi
Chapter 1

Introduction

Reconstructing a 3D object from a number of images is a fascinating problem. Because humans


are able to see depth easily from two slightly differing images, people tend to assume that 3D
reconstruction is an easy and thus solved problem. However, this is far from true: numerous
algorithms exist, but most of them have problems handling occlusion or only work when the
cameras for the images are close together.
Voxel coloring is an algorithm for 3D reconstruction that has no problems handling oc-
clusion or cameras placed widely apart. The basic voxel coloring algorithm [9] works quite
well, but does require knowledge of the camera location. Additionally, it places constraints on
camera placement and has some problems with highly textured surfaces; these problems have
been addressed in later extensions [2, 5, 11, 13].
Previous voxel coloring projects have used high-quality digital cameras and expensive
turntables or camera gantries, resulting in very accurate images and camera calibration. We
created a practical setup for 3D reconstruction using a cheap webcam, a tripod and an inaccu-
rate home-built Meccano turntable. Besides a webcam we will also mount a digital camera on
our tripod and compare its reconstructions to those made using our webcam. Along the way
we will address issues encountered such as camera calibration and background removal.
Our goal was to build a practical setup for 3D reconstruction using readily available and
cheap components. The future goal is to have a practical setup that is easy to use and requires
little or no user intervention.
In chapter 2 we will give an explanation of voxel coloring and an overview of related working
on voxel coloring. In chapter 3 we will describe our setup and voxel coloring implementation
in detail. In chapter 4 we will present an evaluation of our setup. Finally chapter 5 contains
our conclusions and recommendations for future work.

1
2
Chapter 2

Related Work

In this paper we will focus on 3D reconstruction using voxel coloring in a practical setup. In
this chapter voxel coloring is introduced and a number of extensions and improvements to the
original algorithm are described.

2.1 Voxel Coloring


Voxel coloring is an algorithm to reconstruct the 3D shape from a number of input photographs
with known camera locations. It was originally introduced by Seitz and Dyer in [9]. The
reconstruction consists of a regular grid of cube-shaped voxels. Voxels are ‘volume elements’
in analogy with pixels. Like a pixel, every voxel is associated with a color. See figure 2.1 for
an example of a voxel representation. More information about these representations can be
found in [4].

Figure 2.1: Example of a voxel representation. Picture taken from [4].

3
The problem addressed by voxel coloring is the assignment of colors to points in a 3D
volume as to be consistent with the input photographs. If a single point has the same color in
all images from which it is visible, then it should be given that color. If the colors do not match,
then there probably is no point in the 3D volume there and the voxel should be removed. This
basic principle underlying voxel coloring is illustrated in figure 2.2.

Figure 2.2: Given a set of input images and a voxel space, we want to assign colors to voxels
in a way that is consistent with all images. Picture taken from [10].

When rendering the voxel reconstruction from the photo viewpoints, it should be ‘identical’
to the input photographs (because of limitations of the voxel representation used, they are not
always identical). When a reconstruction matches with the input images, it is photo-consistent.
However, a photo-consistent reconstruction is not unique, as is illustrated in figure 2.3. The
different types of ambiguity are explored in [10].

Figure 2.3: Both voxel colorings appear to be identical from these two viewpoints but have no
voxels in common. Picture taken from [10].

The photo-hull, introduced in [5], is the union of all photo-consistent shapes and is thus the
maximally photo-consistent shape. Contrary to a photo-consistent reconstruction, the photo-
hull is uniquely defined and happens to be the reconstruction created by voxel coloring (this is
shown in [5]). If we apply voxel coloring to the case of figure 2.3 now, then the reconstruction
is uniquely defined as is shown in figure 2.4.

4
Figure 2.4: Voxel coloring using the photo-hull. Every voxel has the same color in every
reconstruction in which it is contained. Picture taken from [10].

The working of voxel coloring relies heavily on the ability to compare colors between dif-
ferent images. These colors are only the same if we assume a Lambertian reflection model,
which says that all objects in the scene reflect the light equally in all direction. If we do not
assume a Lambertian model, then the amount of light reflected would depend on the camera
angle being used, which means that a single point in space can have multiple colors.

2.1.1 Occlusion handling


In order to handle occlusion properly, the voxel coloring algorithm traverses the voxel space in
a special order. Voxels closer to the camera are visited first. This ensures that a voxel cannot
be occluded by an unvisited voxel (since an occluding voxel must be closer to the camera, it
has already been visited).
Across all input images, this is known as the ordinal visibility constraint (see [9]), which
ensures that for scene points P and Q, if P occludes Q, there is some metric which says that
kP k is smaller than kQk across all input images (thus if P occludes Q, then there are no images
possible in which Q is occluded by P).
The ordinal visibility constraint is satisfied when no scene point is contained within the
convex hull of camera centers. In case a pinhole camera is assumed, this camera center is equal
to the pinhole of the camera. The metric above can be taken as the distance to the to the
convex hull around the camera centers.
In figure 2.5 two practical camera setups are shown which satisfy the ordinal visibility
constraint. Note that the constraint implies that having two cameras on exactly opposite sides
is not allowed. Because of this, not all sides of an object can be reconstructed. There are
generalizations of voxel coloring, such as Space Carving [5] and Generalized Voxel Coloring [2],
which can both handle arbitrary camera positions.

2.2 Voxel Coloring algorithm


Using the principles described in the previous section, we can now construct the basic voxel
coloring algorithm. This algorithm assumes that all voxels V are traversed in an order which
satisfies the ordinal visibility constraint (described in section 2.1.1).

5
Figure 2.5: Camera configurations that satisfy the ordinal visibility constraint. Picture taken
from [10].

for every voxel V do


pixels ← ∅
for all images I do
pixels ← pixels ∪ selectUnmarked(projectVoxel(V, I))
end for
consistent ← consistencyCheck(pixels)
if pixels 6= ∅ and consistent then
colorVoxel(V, mean(pixels))
mark(pixels)
else
carveVoxel(V)
end if
end for

Every voxel is projected onto the input images, and all pixels that have not yet been marked
(as done) are collected. If there are no pixels the voxel projects onto, then the voxel is carved.
If the pixel collection is consistently colored, then the voxel is accepted and given the mean
color of the collection. The pixels are then marked (to indicate they are done). If the collection
does not have a consistent color, then the voxel is carved.

A common optimization is to use a sprite projection when projecting a voxel onto an image:
all vertices of the voxel are projected and a bounding box around the projected points is used
to approximate the shape of the actual voxel projection. Constructing a bounding box around
eight projected points is much faster than scan-converting the actual projection, while being
more accurate than a simple point projection (which only projects the center of the voxel).

We will now discuss three different consistency checks found in literature that can determine
whether a pixel collection is ‘consistently colored’.

6
2.2.1 Original consistency check
The original consistency check was introduced in [9] together with the original voxel coloring
algorithm. Their main assumption is that if there would be no noise, then all pixels a voxel
projects to should have the exact same color to be consistent. In order to account for noise,
they calculate the standard deviation σV of the color values of the pixels a voxel projects to,
and classify a voxel as consistent if σV falls below a global threshold.
One problem with this consistency check is that the threshold has to be determined; this
is often done through experimentation with different thresholds until reasonable results are
achieved. However, the main problem is that there is no optimal threshold: areas with little
texture are reconstructed best with a low threshold, while areas that are highly textured or
contain sharp edges need very high thresholds. Such a sharp edge is illustrated in figure 2.6:
half the pixels is gray, and the other half is white. The standard deviation σV is very high and
this voxel will be carved, even though this voxel projects to the same set of pixels in all three
images.

Figure 2.6: A voxel projects to the same set of pixels in all three images. Due to the sharp edge
in the scene, there is a high standard deviation and the voxel will be rejected by the original
consistency check.

The formula used to calculate the standard deviation σV , color is:


v
u K
u1 X K
1 X 2
σV,color =t colori2 − colori
K i=1
K i=1

with K being the number of pixels in the pixel collection and colori being the value for one
of the RGB color channels. A pixel collection is consistent if σV,color < threshold for all RGB
color channels.

2.2.2 Adaptive consistency check


The adaptive consistency check has been introduced by Slabaugh in [11] to address the problems
of the original consistency check with sharp edges. The adaptive consistency check calculates
the standard deviation over all pixels in a single image. These standard deviations are averaged
over all images, resulting in the average standard deviation per image σ.

7
The standard deviation σV over the entire pixel collection is now thresholded taking σ into
account. A voxel is color consistent if σV < T1 + σ T2 , where T1 and T2 are global thresholds.
If there are very high standard deviations in all images, as is the case in figure 2.6 (σ is 63.5
here), then it is now possible to accept the voxel by choosing a suitable value for T2 .
The disadvantage of the adaptive consistency check is that now two parameters T1 and T2
have to be chosen. Another disadvantage is that if the case of figure 2.6 is to be accepted,
then the case shown in figure 2.7 will also be accepted. This cannot be solved by a consistency
check which does not use geometric information, as geometric information is the only way to
distinguish between figure 2.6 and 2.7.

Figure 2.7: A voxel projects to a set of pixels with the same colors in all three images, but
with different geometric positions. The adaptive and histogram consistency checks treat this
case the same as figure 2.6 as they do not use geometric information.

The original check thresholds the standard deviation for each color channel separately.
Slabaugh has taken a slightly different approach: he combines all color channels into a single
standard deviation and thresholds that:
v
u K
u1 X K
1 X K
2  1 X K
2  1 X 2
σV =t (ri2 + gi2 + b2i ) − ri − gi − bi
K i=1
K i=1
K i=1
K i=1

2.2.3 Histogram consistency check


The histogram consistency check introduced in [13] presents a solution to abrupt color changes
as in figure 2.6 without introducing extra parameters. For every image from which the voxel
is visible, they construct a color histogram. They perform intersection tests between these
histograms to see if they are consistent: if a single histogram bin is non-empty in all image
histograms, then the entire pixel collection is color consistent.
Since the intersection test only needs to know whether bins are empty or non-empty, it
only needs a single bit per bin, allowing implementation using simple boolean AND and OR
operations. A disadvantage of this binary division is that a single white pixel in an otherwise
gray area (in combination with other images being all white) can cause the voxel to become
consistent. This means that the histogram consistency check is not very robust against noisy
images.

8
According to the authors, the histogram consistency check does not take any parameters
except for number of bins. They do point out that there is an overlap of about 20% between
their bins, as the bin boundaries are rather arbitrary. This overlap effectively causes a smooth-
ing of their histogram. While this overlap may have been fixed in their implementation, it is
not unthinkable that changing this overlap can have a positive effect on reconstruction quality.
The histogram consistency check has the same problem with figure 2.7 as the adaptive
consistency check, as it does not use any geometric information either.

2.2.4 Speed improvements


Until now we have been discussing improvements to the voxel coloring algorithm which will
either improve reconstruction quality or relax operational constraints. A common feature of
these improvements is that they increase the runtime for reconstruction. A different kind of
improvement is to reduce runtime of the voxel coloring algorithm.
It is possible to speed up voxel coloring by using video hardware to perform the projections
of voxels onto images (see [8]). In practice, this greatly speeds up voxel coloring, but from a
theoretical perspective it is not very interesting: the base algorithm remains the same; only
the hardware it runs on is different.
A speed-up which does make a fundamental change to the algorithm is the hierarchical
method of Prock and Dyer from [7]. They begin reconstruction at a very low resolution and
gradually increase it, but only voxels that are consistent at the lower resolution and its nearest
neighbours are considered to become consistent at the higher resolution. This coarse-to-fine
strategy combined with their nearest neighbour heuristic give them the same experimental
results as the original algorithm, while being 2 to 40 times faster. This only applies to their
datasets and is not true in general.

9
10
Chapter 3

3D reconstruction setup

We built a practical setup for 3D reconstruction using voxel coloring. There are three steps
involved in performing the reconstruction of an object. Firstly, photographs from various sides
need to be taken. Secondly, one has to perform a calibration procedure on the photographs
taken in order to retrieve the camera position, as this information is required for voxel coloring.
Finally, the actual reconstruction has to be performed using a software implementation of the
voxel coloring algorithm. In the coming sections we will discuss these steps in more detail.

3.1 Capture
In figure 3.1 our experimental setup is shown. A camera is placed on a small tripod looking
down on our turntable. Once the object is in place on the turntable the user can activate a
Matlab script to start image capture. The capture script will take a picture of the object,
move the turntable by a few degrees, take another picture, move again, etc. After a full round
of pictures has been made, the turntable moves back to its initial position.

Figure 3.1: Two pictures of our experimental setup. An object is placed onto a computer-
controlled turntable and images are captured using a webcam. The checkerboard pattern is
used to recover camera position from its images. The print is a PhidgetServo (see appendix B);
it connects the turntable servo engine with a computer USB port.

11
3.1.1 Cameras
We have used two cameras with our setup: the Philips Vesta 680K webcam (shown in figure
3.2) at a resolution of 640x480 and the Fuji FinePix 6800 Zoom (shown in figure 3.3) at a
resolution of 2048x1536. The Philips webcam could be controlled from Matlab. Its images
were rather noisy, so we take 10 shots for a single photo and average these shots to reduce
noise. The Fuji digital camera could not be controlled by computer: we had to press a button
every time a photo had to be made. To prevent camera shaking the digital camera was placed
on a self-timer.

Figure 3.2: On the left, a picture of the Philips Vesta 680K webcam. On the right, its distortion
at a resolution of 640x480 as modelled by our camera toolbox. X and Y axes indicate X and Y
offset of the image, while the vectors indicate the displacement due to distortion. The iso-lines
indicate areas with pixel displacements of the same length; the number on the lines is this
length.

Figure 3.3: On the left, a picture of the Fuji FinePix 6800 Zoom digital camera. On the
right, its distortion at a resolution of 2048x1536 as modelled by our camera toolbox. X and
Y axes indicate X and Y offset of the image, while the vectors indicate the displacement due
to distortion. The iso-lines indicate areas with pixel displacements of the same length; the
number on the lines is this length.

12
3.1.2 Turntable
We built our turntable from Meccano, a computer-controlled servo and the back of a picture
frame. It is shown in figure 3.4. The computer-controlled servo is called a PhidgetServo and
is described in more detail in appendix B. The accuracy of table movements is not important
as the degree over which the table has moved is not used in our setup. It is important to keep
the movement speed of the table low: if the object starts moving about the table then the
reconstruction will either be very inaccurate or fail completely. This means one does not even
need an automated turntable (though it is easier), as the only requirement is that the object
and calibration pattern move together between images!

Figure 3.4: Two close-ups of our turntable. On the left the table has been removed; on the
right it has been attached. How the table is connected is visible in figure 3.1.

3.2 Calibration
After capturing images of the scene with our setup, we need to retrieve the position and
orientation of the camera for each of those images. We use the Bouget camera toolbox [1]
which can do this for us. There are some requirements though:

• A separate calibration step should be performed first on images of a checkerboard pattern


to retrieve camera properties such as focal distance, optical center and distortion. This
calibration of intrinsic parameters has to be done just once and remains valid as long
as the camera does not change focus: these intrinsic parameters do not change between
images. Our digital camera has to be placed in manual focus mode explicitly to prevent
it from auto-focussing and rendering the intrinsic calibration useless.

• The user has to select four planar reference points in every image by hand. These points
will be used to retrieve the extrinsic parameters, which are the position and orientation
of the camera written in the form of a rigid body transformation.

You can find more detailed information on how to use Bougets camera toolbox in appendix A.
We will now discuss the camera model used by the toolbox and show how one can retrieve
camera position and orientation from just four points.

13
3.2.1 Perspective camera model
We will start our discussion with the perspective camera model, which corresponds to an ideal
pinhole camera. In figure 3.5 we have illustrated our model1 . The origin O is the pinhole of
the camera. Point P (with coordinates (X, Y, Z)) lies on a real-world object and is projected
onto the image plane I as point P 0 . Point C is called the image center or principal point and
is the point where the optical axis intersects the image plane.

Figure 3.5: A schematic version of the perspective camera model. The origin O is the camera
pinhole; point P is an object point with its Z component along the horizontal axis (also known
as the optical axis); the vertical axis can be its X or Y component. I is the image plane which
is perpendicular to the horizontal optical axis and point P 0 is the projection of P onto the
image plane I. f and its double-arrowed line illustrate that the distance between the pinhole
O and the principal point C is the focal distance f .

We will assume we know the distance between the pinhole O of the camera and the principal
point C; this is the focal distance f of the camera. If we know the real-world coordinates X,
Y and Z of point P, then it is possible to calculate the image coordinates x and y of the point
P 0 projected onto the image plane I:

X
 x=fZ

~x =
 y =fY

Z

We switch to homogenous coordinates so we can write the above equations in matrix


form. Homogenous coordinates are explained in [4]. Basically an extra scaling coordinate w
is added to a point and all other coordinates x, y, z need to be divided by w to get normal
1
The real situation is 3D, but since it is the same for X and Y , we decided to show it as a 2D scene where
the vertical axis can be either, as that is more instructive.

14
3D coordinates out of a 4D homogenous vector. We use the symbol ∼ for equality between
homogenous coordinates in the sense that they are equal if all vector elements are equal if the
vector is normalised2 .
 
      X
x x0 f 0 0 0 
Y 
 y  ∼  y0  =  0 f 0 0 
      
Z

1 w0 0 0 1 0
 
1

3.2.2 Modelling real cameras


Now we extend our model to make it more suitable for real-world usage. First of all, pixels
are never exactly square in a real camera, so we add two factors k and l (expressed in pixels
cm )
to account for this:

X
 x = kf Z

~x =
 y = lf Y

Z

Instead of writing kf and lf all the time, we will combine these factors into single terms
fx and fy .
Another feature of real cameras is that the origin of the coordinate system generally does
not lie at the principal point/image center C, but in or near one of the image corners. We
introduce u0 and v0 to translate the principal point to the right location:

X
 x = fx Z + u0

~x =
 y =f Y +v

yZ 0

in matrix form this becomes:


 
    X
x fx 0 u0 0 
Y 
 y ∼ 0 fy v0 0  
    
Z

1 0 0 1 0
 
1

Another possible quirk in real-world cameras is that the angle between the x and y axes
is not equal to 90 degrees. This is called skew and occurs when the camera CCD is not
perpendicular to the optical axis (the lens and CCD are not properly aligned). Equations for
skew are derived in appendix A of [15]. It introduces an extra non-zero element αfx at position
(1, 2) in our matrix and an additional term inside fy . Since our cameras had virtually no skew,
we will not discuss it here as it does not fundamentally change the rest of our discussion. The
matrix we have derived will from now on be known as the intrinsic camera matrix Mint as it
contains the intrinsic camera parameters (excluding distortion).
2
A homogenous vector is normalised if the w coordinate is 1

15
3.2.3 Separating camera frame and world frame
Currently our world frame has its origin at the pinhole of the camera, as it is equal to the
camera frame. This is rather inflexible: we do not want the origin of our world coordinate
system to lie at the pinhole. We introduce an intermediate rigid body transformation Mext
which converts from the world frame to the camera frame:
  
    r11 r12 r13 tx X
x fx 0 u0 0 
r21 r22 r23 ty   Y
  
~
 y ∼ 0 fy v0 0    = Mint Mext X
    
r31 r32 r33 tz   Z

1 0 0 1 0
 
0 0 0 1 1

When one inverts Mext (so it converts from the camera to the world frame) and multiplies
with the origin O, then one gets the camera position in world coordinates, which is equal to
the point (tx , ty , tz ). The rotational part of the inverse rigid body transformation specifies the
orientation of the camera.

3.2.4 Distortion
Besides real-world camera properties discussed thus far, there is also camera distortion. Most
cameras tend to have radial distortion due to the shape of the lenses used. We will assume
we have a distortion model which can apply a distortion to coordinates x and y and that
coordinates can be undistorted again as well. If our distortion model could only go one way,
then it would have been poorly chosen.

3.2.5 Determining extrinsic parameters from four planar points


We now move to one of the crucial parts of our camera calibration: we want to derive matrix
Mext (which can be converted to camera position and orientation) from four planar points
with known correspondence between real-world coordinates and image coordinates. Since we
know the points are planar, we choose them to lie in the Z = 0 plane in the real world3 . We
will presume our image coordinates have been undistorted already so our equations become:
  
    r11 r12 r13 tx Xi
xi fx 0 u0 0 
r21 r22 r23 ty   Yi
  
 yi  ∼  0 fy v0 0  
    
r31 r32 r33 tz   0
 
1 0 0 1 0
 
0 0 0 1 1

We have the above equation for all four reference points i: i ∈ {1, 2, 3, 4}. Since the third
element of X~ is 0 for all points, we can leave out this element and the third column of Mint
since their product will always be 0 after the matrix multiplication:
 
r r t
fx 0 u0 0  11 12 x  Xi
     
xi
  r21 r22 ty  
 yi  ∼  0 fy v0 0     Yi 
   
 r31 r32 tz 
1 0 0 1 0 1
0 0 1
3
We do not lose generality here as it is easy to create a rigid body transformation which moves a known
plane into the Z = 0 plane. This transformation can then be multiplied by our final result to get Mext

16
The homogenous factor 1 of X ~ will become the fourth element after this vector has been
multiplied by Mext and will always become 1 since Mext is a rigid body transformation. If we
look at Mint , we see that this 1 isn’t actually used anywhere (the fourth column only contains
zeros), so we can safely leave out this term as well:
     
xi fx 0 u0 r11 r12 tx Xi
 yi  ∼  0 fy v0   r21 r22 ty   Yi 
     
1 0 0 1 r31 r32 tz 1

With the fourth column of Mint gone, this matrix has now become invertible. We will call
the coordinates with Mint removed x0 and y 0 . You might wonder now how it is possible to
invert a perspective projection, as depth information is lost in such projections. There is a
simple response to that: the projection isn’t invertible at all; we have used a clever trick: the
homogenous coordinate. The homogenous factor is a scaling factor, effectively allowing our
point to represent the line of all 3D points that project onto our 2D point. After the projection
is removed, this scaling factor is still 1 but if you change it (and x0 and y 0 as well corresponding
to the rules of homogenous coordinates) you get other points on that line.
      
x0 r11 r12 tx Xi Xi
 0i  
 yi  =  r21 r22 ty   Yi  = H  Yi 
   
1 r31 r32 tz 1 1

We now have a correspondence between two planes (a real-world plane and the image
plane). A mapping between two planes is known as a plane to plane homography and can be
described by a 3x3 matrix H. In the equation above we are dealing with a plane to plane
homography so our matrix is this H. Calculating a homography is a solved problem. We have
4 points with 2 coordinate-correspondences each; this gives us 8 equations. The homography
matrix H has 9 elements, but it has only 8 degrees of freedom since the scale of the matrix
does not matter4 . We have to write out our 8 equations to see they are lineair for elements of
H, so we can lose the homogenous scaling factor. After converting these equations to matrix
form and splitting off the elements of H into a separate vector h we get:
 
  h11
X1 Y1 1 0 0 0 −X1 x1 −Y1 x1 −x1  h12 
 0 0 0 X1 Y1 1 −X1 y1 −Y1 y1 −y1  
h13
  
X2 Y2 1 0 0 0 −X2 x2 −Y2 x2 −x2
  
h21
  
  
 0 0 0 X2 Y2 1 −X2 y2 −Y2 y2 −y2  
  h22 =0
 X3 Y3 1 0 0 0 −X3 x3 −Y3 x3 −x3  
  h23 

0 0 0 X3 Y3 1 −X3 y3 −Y3 y3 −y3  
h31
  
X4 Y4 1 0 0 0 −X4 x4 −Y4 x4 −x4
  
 
h32

 
0 0 0 X4 Y4 1 −X4 y4 −Y4 y4 −y4
h33

Linear algebra tells us that h can be calculated from the above equation5 , minimizing the
error in reprojecting our points. Since we have 4 points, the non-trivial solution is uniquely
defined. If you have more points available, then you can add them as extra rows to matrix A.
4
This is due to homogenous coordinates; if you divide all matrix elements by a certain factor a, then all
elements of the resulting vector will also be divided by this a. As the resulting vector is homogenous, the
division by a is gone after normalisation.
5
A solution where h has length 1 that is, to prevent trivial solutions such as the null vector

17
We can reconstruct Mext from H as column 1 and 2 are the same and define the base vectors
for the X and Y axes. The Z base vector can be calculated from the first two by taking their
cross product. The translational part of Mext is equal to the 3rd column of H. Our matrix is
completed by giving the 4th row their default values for a rigid body transformation: 0, 0, 0
and 1.
Mext is not necessarily a rigid body transformation now, as it is possible for the base
vectors of the X and Y axes to have a length different from 1 (they are slightly scaled then).
This happens because the reference points are generally not exact: they are noisy.
Bouget solves this by taking the Mext we have derived as an initial guess for the real
Mext . He then converts the rotational part to Rodrigues coordinates6 . He then performs a
gradient descent to minimize the reprojection error in the reference points on the 3 Rodrigues
coordinates and the 3 elements of the translation vector. The Rodrigues coordinates can only
describe rotations without scaling and ensure that the final version of matrix Mext is indeed
a rigid body transformation.

3.3 Reconstruction
Now that we have a series of images of our scene with corresponding camera locations, we want
to perform a 3D reconstruction using voxel coloring. We will show you that an intermediate
step (background removal) between the calibration and the reconstruction is needed in practice
at the end of this section.

3.3.1 Reconstruction parameters


One of the most important parameters is the resolution of the reconstruction. You can choose
how many voxels there should be in the initial volume in every direction separately. Besides
the resolution, one has to specify to which real-world box these voxels correspond. During
extrinsic calibration (see appendix A one gets to choose an origin (the first point one chooses),
specify the X and Y axes and then one gets to enter the length of the axes selected. If our
X-axis spanned 3 squares of 30mm each, then we would have entered a length of 90mm for
the X-axis. When you want to specify the real-world bounding box corresponding to this, you
should enter 0 as the minimum value for the X-axis and 90 ax the maximum value. If you
choose the X size of the voxel volume to be 180, then the real-world size of individual voxels
would be 0.5mm. It is recommended to have the same real-world sizes along all axes (so your
voxels are cube-shaped); otherwise your reconstruction will look warped.
Another ‘option’ in the reconstruction parameters is the axes traversal order: you can say
which axis should be done first and in what direction. For example, if your camera is hovering
above your scene (as is the case in our setup), then you should select the Z-axis to be traversed
first in negative direction. This will cause reconstruction to start at the top of the voxel volume
and work its way down to the bottom in a way that satisfies the ordinal visibility constraint
(see section 2.1.1): voxels which can occlude other voxels (seen through the cameras) are done
first.
Since people are bound to get this traversal order or the camera locations wrong, our
program checks that the ordinal visibility constraint is satisfied. It divides space into two
6
Rodrigues coordinates describe a rotation with just 3 parameters: an unit vector (2 parameters) and a
rotation angle around this vector (1 parameter). See [3] for an introduction to Rodrigues coordinates.

18
half-spaces, separated by the plane given by the first value for the first axis to be traversed (in
our example where the Z-axis is traversed in negative direction, this would be the Z = Zmax
plane). All images whose camera locations lie on the side that contains the voxel volume are
rejected; all images on the other side are accepted. This ensures that the convex hull around
all accepted camera locations does not intersect the voxel volume as they are in two different
half-spaces.
This check can be too strict as it may reject some images whose camera locations would
not cause the convex hull to intersect the voxel volume. We ensure that camera is placed above
the highest point of the object we are trying to reconstruct in our setup so we do not run into
this problem.

3.3.2 Implemented Voxel Coloring algorithm

Below we give the version of the Voxel Coloring algorithm we have implemented. This algorithm
differs from the one given in section 2.2 in two ways. Firstly, instead of marking pixels as
done immediately after a voxel has been accepted, these pixels to mark are now collected in
pixelsDoneInLayer and marked after an entire layer of voxels has been processed. Secondly,
pixel collections containing a single background pixel cause the voxel to be automatically
carved. The reasons for these changes will become apparent later in this section.
for every voxel layer L do
pixelsDoneInLayer ← ∅
for every voxel V in voxel layer L do
pixels ← ∅
for all images I do
pixels ← pixels ∪ selectUnmarked(projectVoxel(V, I))
end for
consistent ← consistencyCheck(pixels)
if pixels 6= ∅ and consistent and not containsBackgroundPixel(pixels) then
colorVoxel(V, mean(pixels))
pixelsDoneInLayer ← pixelsDoneInLayer ∪ pixels
else
carveVoxel(V)
end if
end for
mark(pixelsDoneInLayer)
end for
We will now discuss the functions, mentioned in the algorithm, and their implementation in
more detail. The carveVoxel and colorVoxel operate on the voxel volume the algorithm outputs.
The first function marks a voxel as ‘unused’ while the latter will mark a voxel as ‘used’ and
assign it the color given as the second argument of the function. We assume standards set
operations are available to work with pixel set variables pixels and pixelsDoneInLayer.
Our images are stored in a data structure which can store pixels and has room for a flag
for every pixel (so individual pixels can be marked using the mark function).

19
projectVoxel
The projectVoxel function takes a single voxel V and an image I with camera calibration as
input. The voxel is projected using a sprite projection (explained in section 2.2): the eight
vertices of the voxel are projected from real-world coordinates to image coordinates using
the camera model and the bounding box around them is used to approximate the actual
voxel shape. How to project real-world coordinates to image coordinates can be found in our
discussion of the camera model in section 3.2. Besides this camera model (which is used in the
Bouget camera toolbox), our implementation also supports the Tsai camera model (see [16]),
but we do not actually use this model.

selectUnmarked
The projectVoxel returns all pixels a voxel projects to, even if they have already been done
(they have already projected to a consistent voxel). The selectUnmarked function filters out
the pixels which have been marked as ‘done’.
The problem with sprite projections is that they return too many pixels (the box contains
the exact voxel shape). One can imagine that several of these pixels actually belong to a
neighbouring voxel. If the voxel is consistent and the pixels it projects to are marked right
away, then they will be filtered out by the selectUnmarked function when the neighbouring
voxel is processed. This can cause the neighbouring voxel to be carved as the pixel collection
a voxel projects to can now become empty or it can be declared inconsistent (for example, if
all ‘taken’ pixels have a single color and the left-over pixels are highly noisy: the ‘taken’ pixels
could have stabilised the pixel collection enough to be declared consistent). To counter this
effect pixels are marked after an entire layer of voxels has been processed instead of after every
voxel7 .

containsBackgroundPixel
The containsBackgroundPixel function takes a set of pixels as input and returns true if this set
contains at least one pixel which belongs to the background. In our program, a pixel belongs
to the background if it has zero values in all color channels (for RGB colors, this corresponds
to the color black).
By default, there are no background pixels, since cameras generally cannot measure real
black properly. These pixels have been marked as background by the user in the input images
explicitly.
In theory, a varied background should not be a problem for voxel coloring as appropriate
voxels are automatically carved since they are all inconsistent due to the variation. In practice,
the variation in the background is often too low for this to work. In figure 3.6 (top) we have
a worst-case scenario: the background has a single color. After doing a very raw background
removal (middle), reconstruction results are much better. There are still some background
artefacts that can be removed by doing a more careful background removal (bottom) or using
a more varied background.
We have chosen to introduce a manual background removal as an intermediate step between
the calibration and the reconstruction to get good results. The kind of background removal
7
A nice side-effect is that parallel implementations are possible: all voxels in a layer can now be processed
independently on different nodes and global data (pixel markings) has to be synchronized only once per layer.
Finally there is a way to put Hyperthreading on your Intel Pentium 4 to good use!

20
Figure 3.6: Illustrating different approaches to background removal: no removal (top), raw
removal using a paintbrush (middle) and ‘segmentation’ removal using a magic wand (bottom).
For every approach, an input image (left) and a snapshot of the resulting reconstruction (right)
is shown. Input images were created using our setup under suitable lighting conditions. Object
Patrick is copyrighted by Viacom International as part of the SpongeBob SquarePants brand.

21
(exact results are not required) we need can be achieved automatically by identifying the
static background (areas of the image which are the same in all images) and the calibration
pattern and then performing a segmentation. We have done this segmentation by hand using
a magic wand tool in the bottom images of figure 3.6; removing most of the background gives
an excellent reconstruction. If this is good enough when it is automated instead of done by
hand needs further investigation. We will show some additional tricks for background removal
in the next chapter.

consistencyCheck
The consistency check to be used is selectable at runtime. It takes a set of pixels as input and
returns a boolean value indicating whether the set is consistent. We have implemented the
consistency checks described in the previous chapter:

• Original consistency check (section 2.2.1) with a parameter threshold T .

• Adaptive consistency check (section 2.2.2) with two parameters T1 and T2 .

• Histogram consistency check (section 2.2.3). This consistency check is supposed to have
no parameters beside number of bins (which can only be changed at compile time in
our implementation). We did find a parameter though: the overlap between bins, which
is scarcely mentioned in literature. This is generally taken to be about 20% and its
introduction is defended by saying that bin boundaries are otherwise ‘a bit arbitrary’.
The overlap parameter can be given as a range in the color channels as we do not know
how the overlap percentage is defined.

3.4 Viewing results


After reconstruction is complete one wants to look at the reconstructed voxel shape from
different sides. Our OpenGL Voxel Viewer8 is shown in figure 3.7 and has the following
features:

• Free rotation and zooming using the mouse

• Automated rotation of the object around X and Y axes (using toolbar buttons)

• Creating screenshots and animations of 360-degree rotations around the object

• Configurable background color

• Viewing the object from camera standpoints used in input images (non-interactive mode
only, needs calibration and reconstruction parameters)

We would like to note that VoxelViewer renders to the video buffer; this means that the
maximum resolution of its output images is limited by the current resolution of your video
card. Anyone who wants to go beyond the maximum resolution of his or her video card is
invited to rewrite our viewer to use off-screen buffers.
8
We would like to thank Stuart Carey for providing a base OpenGL viewer which he developed for Voxel
Section Editor II [14]. This editor was developed for use with Command & Conquer: Tiberian Sun, C&C: Red
Alert 2 and C&C: Yuri’s Revenge by Will Sutton, Plasmadroid, Stuart Carey and myself to edit voxel files.

22
Figure 3.7: A screenshot of our OpenGL VoxelViewer which can be used to view reconstruction
results.

23
24
Chapter 4

Evaluation

In this chapter we will analyze datasets obtained through our practical setup. Some readers
may have noticed that a setup with a turntable and a fixed camera can cause object lighting
to vary as the object turns. This means our scene violates the lighting assumption for voxel
coloring (which says lighting of the object does not change). We countered this by ensuring we
had a well-lit environment with fluorescent lighting from all sides and assuming this is good
enough for voxel coloring to work.

4.1 Reprojection error


The reprojection error was introduced by Slabaugh in [11] as a measure of reconstruction
quality. The camera standpoint of every input image is used to look at the voxel reconstruction.
These reprojected images are subsequently compared to the input images and the difference
between them is used as a measure for the reprojection error.
If the ith input image is Ii and the corresponding image with the reprojection of the
reconstruction is Ri then we can calculate the reprojection error as follows:

PN PMi  
i=1 j=1 (Ii (j).r − Ri (j).r)2 + (Ii (j).g − Ri (j).g)2 + (Ii (j).b − Ri (j).b)2
RE = PN
i=1 Mi

with N the number of images; Mi the number of pixels in image i which belong to the object;
these pixels can be indexed in Ii and Ri as Ii (j) and Ri (j) with j the pixel index. Individual
colorchannels can be addressed as Ii (j).r, for example.
One should be careful when using the reconstruction error: it is only meant as a measure
of how similar the input images are to the reprojected reconstruction. It is not an error metric
one should try to minimize when choosing consistency check parameters! If one minimizes
the reprojection error, then the reconstruction will match the input images very closely but
it will not be a very good reconstruction from different viewpoints: one is overfitting the
reconstruction to the input images. To prevent this one needs to have a seperate set of test
images to evaluate the reprojection error. Making such a large change to the existing notion
of the reprojection error is beyond the scope of our project as we are interested in a practical
setup.

25
4.2 Background removal
In the previous chapter we said that we need to perform background removal on the input
images if there is little variation in the background. In figure 4.1 we illustrate this: with no
background removal, the reconstruction will look good from one of the camera standpoints of
the input images, but when looking from different standpoints the background will occlude the
actual object.

Figure 4.1: On the left one of 23 input images of the rose dataset with no background removed;
in the middle the reconstruction as seen from the position of this input image; on the right a
side view of the reconstruction (not an input image view) where the background is occluding
the object.

If we know in advance we are going to need background removal, we might just as well
make our job easier by choosing a background which is easy to remove by hand1 . We have
picked the four reference points we are going to use for calibration in advance and used white
paper to cover the rest of the calibration pattern (see figure 4.1). We can now use magic wand
segmentation found in all image editing software with some tolerance to select and delete the
white background. For our ‘static’ background we made sure there was little variation as well,
so we could remove it with a single click as well.
In figure 4.2 you can see an input image which has been processed in this way and the
resulting reconstruction. You can still see some parts of the calibration pattern, but since
we know there is only white under our object, we can remove these by slicing them off our
reconstruction volume. We would like to point out that there still is some white under our
object due to shadows cast by it. One can remove this by increasing the tolerance in the magic
wand segmentation; however if one increases tolerance too much then one risks removing parts
of the object (especially if the object were to contain white).
By choosing the background in our scene properly, we have reduced the tedious job of
background removal to a few clicks per image. With a properly chosen voxel reconstruction
volume, the results are up to par with careful manual background removal using a paintbrush!

4.3 Consistency checks


We will evaluate the different consistency checks we have implemented using the Patrick dataset
which was acquired using our webcam. The dataset contains 15 images with background
manually removed. The top-left image in figure 4.3 shows one of the input images.
1
We do this because we do not have an automatic background removal system.

26
Figure 4.2: Top-left: one of 23 input images of the rose dataset with background removed
through magic wand segmentation. Top-right: reconstruction as seen from the position of
this input image. Bottom-left: side view of the reconstruction (same side viewpoint as in
figure 4.1). Bottom-right: reconstruction as seen from the position of the input image, but
with clamped voxel volume to remove calibration pattern artefacts.

27
The original consistency check (section 2.2.1) has a single threshold parameter called T .
The top-right image in figure 4.3 shows the reconstruction as seen from the standpoint of the
input image for T = 50, bottom-left for T = 65 and bottom-right for T = 80. A threshold
of 50 is obviously too low, as there is a big hole in the reconstruction. The reprojection error
of 0.1453 is substantially higher than for T = 65 (which gives 0.1230) which has an error of
0.1230. The difference between 65 and 80 is hardly noticeable, both in the reprojected images
and in the reprojection error.
When looking at your reconstruction it is often obvious if you have to lower or increase the
threshold. A basic rule one can follow is that one should choose a higher T if the reconstruction
contains big holes. If your reconstruction is good, you should lower the threshold to see if this
still gives a good reconstruction. The problem with this consistency check is that one needs
very high thresholds to get highly-textured areas right (around Patricks eyes) making for a ‘fat’
reconstruction in other areas. This can get prohibitive if background removal is not precise:
you will get artefacts floating around your reconstruction.

Figure 4.3: Top-left: one of 15 input images of the Patrick dataset. Top-right: reconstruction
as seen from the position of this input image using original consistency check with T = 50.
Bottom-left: reconstruction with original consistency check with T = 65. Bottom-right: recon-
struction with original consistency check with T = 80. Reprojection errors are 0.1453, 0.1230,
0.1229 respectively.

28
The adaptive consistency check (section 2.2.2) has two parameters T1 and T2 . T2 is scaled
by the mean standard deviation σ causing the check to accept a voxel sooner if the input image
areas all have high standard deviations. From this knowledge and experience we have found
that you should increase T2 if there are missing voxels in highly textured areas. If there are
very big holes, then both parameters need to be increased.

Figure 4.4: In all images the area of interest is Patricks eyes. The top row contains reconstruc-
tions using the original consistency check with a thresholds of 50, 65 and 80. The first image
has a big hole, the second and third are virtually indistinguishable. The middle row uses the
adaptive consistency check with a fixed T1 of 30 and for T2 values of 3, 4 and 5. The first
image has a hole, the second misses a few white voxels in the eye amd the third is very good.
The bottom row uses the adaptive consistency check with a fixed T2 of 5 and for T1 values of
15, 30 and 45. The first image misses a few white voxels in the eye, the second is the same as
the third image in the second row (very good), and the third looks just like the second.

In figure 4.4 we show reconstructions using both the original and adaptive consistency
check and vary a single parameter per row. From the parameter values found in the figure
caption one can derive the scale upon which these parameters should be varied when looking
for a better reconstruction: T2 should use small increments, while T1 (and T ) can use large
increments. The differences between parameter settings are not very big and we have found

29
that looking for parameters that give a good reconstruction is not very difficult. When looking
for an ‘optimal’ reconstruction that gets all the details right can be tedious at times though
(we find T1 = 30 and T2 = 5 to be ‘optimal’ for this Patrick dataset).
The histogram consistency check (section 2.2.3) has no real parameters (except perhaps
overlap between bins as we mentioned before). On the left in figure 4.5 you see the reconstruc-
tion we get using this consistency check if the background has been removed. The consistency
check appears to be working, but if we try it out on a dataset with no background removed
(which should at least give reasonable results when looking from camera standpoints found
amongst the input images), as we do on the right in figure 4.5, you can see a total failure. For
our datasets, this consistency check declares a voxel to be consistent, no matter what pixels it
projects to! This can be a bug in our implementation, but we suspect the consistency check
itself fails on our datasets.

Figure 4.5: On the left a reconstruction using the histogram consistency check of the Patrick
dataset with background removed; on the right a reconstruction where the background was
not removed. We found that our implementation of the histogram check accepted all voxels to
be consistent immediately.

One might question why the non-working histogram consistency check still gives good
results. We introduce a consistency check which accepts every pixel collection that contains
no background pixels (just as our histogram consistency check appears to be doing) and call it
the visual hull consistency check. This consistency check effectively degenerates voxel coloring
to a visual hull algorithm [12]. A visual hull algorithm intersects rays from object boundaries
from all images to retrieve object shape (see figure 4.6).

Figure 4.6: On the left photographs are taken of an object. On the right, foreground areas
of these photographs are backprojected into 3D space. The visual hull is inferred from the
intersection of these backprojections. Figure taken from [12]

30
The quality of a visual hull reconstruction is highly dependant on the quality of background
removal. If the background has been removed very carefully and a lot of images from all
sides are available, then the inferred shape will be very close to the actual object shape (see
figure 4.7).

Figure 4.7: A round object with hole and its visual hull (bold lines) as it would be inferred from
images taken from the displayed camera standpoints. The more images available the better
the shape recovered. This assumes the object boundaries in the images are correct requiring
good background removal. Figure taken from [12]

4.4 Webcam versus digital camera


To evaluate the use of a webcam in our setup we will compare reconstructions created from its
images to images created by a digital camera. In figure 4.8 we show an image created using our
webcam and a part of an image created using our digital camera; both images contain 640x480
pixels. That our digital camera image has a better CCD chip than the webcam is visible in
the yellow of the head: many more shades of yellow are used.
For both datasets we have made reconstructions using the adaptive consistency check. A
nice feature of this check observed by Verstraeten [15] is that good parameters for one size of
the voxel volume will also work for other sizes. This means one can double the resolution of
the voxel volume without having to search for good parameters again (this would be needed
in case we used the original consistency check). Thus far all our reconstructions used a voxel
resolution which corresponded to real-world voxel sizes of 1 mm3 . For the athlete dataset the
volume size of 90x60x100 corresponds to this real-world size.
In our comparison of the webcam and the digital camera we will be varying the resolution
of the voxel volume. You can find results for the athlete datasets for varying resolutions in
table 4.1; a number of images of the reconstruction for different resolutions are shown in figure
4.9.
We have found that in terms of shape the webcam and digital camera differ very little at
low resolutions. For higher resolutions the webcam reconstructions contain lots of small holes
and the digital camera reconstructions are clearly better. This is also visible in the numbers
of voxels in the reconstructions: for low resolutions there is little difference, but for higher
resolutions the differences are big.
In terms of color the digital camera is always superior, as is to be expected from its more
expensive CCD chip. For the resolutions shown in figure 4.9 the webcam color quality is only

31
Figure 4.8: On the left one of 23 input images of the webcam athlete dataset (size is 640x480).
On the right a part of one of 23 input images of the digital camera athlete dataset is shown
(part is the middle 640x480 area of the actual 2048x1536 sized image). By showing an area
of the same size from both images you can see the difference in scale between the cameras.
In a color edition of this report you will be able to see that the digital camera uses a greater
range of colors than the webcam. The athlete dataset shows a figure called Tweety which is
trademarked and copyrighted by Warner Bros.

Athlete object Webcam dataset Digital camera dataset


Voxel volume size Voxels colored Reprojection error Voxels colored Reprojection error
45x30x50 2772 0.1976 2699 0.1546
90x60x100 15348 0.1216 15458 0.0984
135x90x150 36925 0.1006 38523 0.0825
180x120x200 67012 0.0913 71586 0.0747
225x150x250 104567 0.0871 114770 0.0703
270x180x300 146854 0.0865 167719 0.0674

Table 4.1: Table with number of voxels in the reconstruction and the reprojection error for
the webcam athlete dataset and the digital camera athlete dataset for different sizes of the
reconstruction volume. The webcam dataset contains 778 438 non-background pixels in total;
the digital camera dataset contains 6 339 963 non-background pixels.

32
webcam reconstructions digital camera reconstructions

45x30x50 45x30x50

90x60x100 90x60x100

180x120x200 180x120x200

270x180x300 270x180x300

Figure 4.9: Reconstructions of the athlete dataset from webcam images (left) and digital
camera images (right). Reconstruction voxel volume sizes were varied and are shown below
each image. Areas of interest include the tail, the back of the head, the hand and the beak.
33
acceptable for 45x30x50 and 90x60x100; for higher resolutions the colors become more noisy.
This and the increasing number of small holes in the reconstruction can be explained by looking
at the number of pixels a voxel projects to. For the higher resolutions, a voxel projects to just
one or two pixels and then becomes consistent, making noise in the input images more visible
in the reconstruction (as the color mean over those pixels hardly reduces the noise). The small
holes are caused by voxels which project to zero (!) pixels and are immediately rejected: they
are missing because there are too few pixels in the input images.
One can find very similar results for another object we perfomed these experiments on and
make the same observations; results for this object can be found in appendix C.
Overall we would prefer to use a digital camera in our setup, as reconstruction quality is
clearly superior; this in combination with dropping prices on digital cameras makes it a viable
future option.
Before running these experiments we would have suspected some sort of ‘total breakdown’
for the high resolutions we have tried in combination with the webcam. We were pleasantly
surprised to see that there is a graceful degradation in quality when the resolution increases.
Resolutions higher than we have tried here is no longer practical, as the size of the raw
reconstruction volume2 for our highest resolution already exceeds 50 Megabyte and reconstruc-
tion runtime becomes significantly longer after every resolution increase. Those with time to
spare can investigate at which resolution the digital camera quality starts to degrade, as we
are planning to release our code and datasets to the public domain soon3 .
As a final note we would like to point out that there are parts of the object which cannot
be reconstructed properly from the input images, simply because those parts are not visible in
any of them. In figure 4.10 we have shown such a part of the athlete digital camera dataset:
the back of the head is not visible in any of the images, so there are holes in it.
One can solve this by adding input images that show these ‘missing’ parts, but introducing
these extra camera positions will violate the ordinal visibility constraint of voxel coloring. The
only solution to this is to implement algorithms such as Generalized Voxel Coloring [2] and
Space Carving [5] that do not have this constraint.

Figure 4.10: Parts of the object not visible from any of the input images cannot be properly
reconstructed. In the dataset used the back of the head was not visible and one could not look
under the dress of Tweety.
2
The raw reconstruction volume has room for all voxels and not just the ones colored
3
The project will be hosted by SourceForge and find its home on http://voxelcoloring.sourceforge.net

34
Chapter 5

Conclusions

Image acquisition in our setup has been fully automated and is easy to use. The accuracy
of the turntable used is not important as we derive camera position and orientation from a
calibration pattern visible in the images taken; one could even use a turntable operated by
hand!
The extrinsic calibration step, to retrieve camera position and orientation, could use further
improvement, as it requires the user to select four reference points on a checkerboard pattern
in every image. In computer vision there are methods to automatically locate such points in
the pattern; however additional information will have to encoded into the calibration pattern
to allow the detected points to be uniquely identified1 .
Between calibration and reconstruction there is the ‘hidden’ step of background removal
that is needed in case there is little variation in the background of the images taken. In practice,
one will use a background with little variation, as this will make background removal by hand
easier; but this does make background removal a necessary step.
There are two types of background in an image: the ‘static’ background that is the same
in all images and the calibration pattern whose position and orientation are known. For both
types it is possible to identify which regions they cover in an image automatically, so automatic
background removal is an option in the future.
We compared two working consistency checks: the original consistency check as defined
by Seitz and Dyer [9] and the adaptive consistency check by Slabaugh [11]. We found the
original consistency check easy to configure but limited when it came to highly textured areas
on objects. The adaptive consistency check is harder to configure but does work properly on
highly textured areas. A very strong point of this consistency check is that good settings work
on different resolutions of the reconstruction as well.
Our reconstructions based on webcam images are surprisingly good, considering image
quality. Contrary to what one might expect, reconstructions at very high resolutions still work
and give acceptable results. There is little difference between webcam and digital camera for
the reconstruction shape at low resolutions, but there is a big difference in color quality. The
digital camera has superior colors when compared to the webcam and this difference is clearly
visible when comparing their reconstructions.
In the future one might want to use a digital camera instead of a webcam as their prices
are dropping and the reconstruction quality is clearly better. We used our digital camera in
1
I will be working on such a pattern and a detection filter in the near future during an Artificial Intelligence
Bachelor project.

35
manual focus mode to make an initial calibration of intrinsic camera parameters possible. For
higher image quality one could imagine using auto-focus in every image. This means one has
to get a full camera calibration out of a single image and that requires a more advanced camera
calibration pattern and toolbox than we have used. We think manual focus images are good
enough for the near future, as we were not limited by image quality for the digital camera but
by the reconstruction runtime and size: they became unpractical.
Using a camera mounted on a tripod causes some parts of the scene not to be visible in
any image taken; making proper reconstruction of these parts impossible. One would have to
remove the tripod and take images manually to get images in which everything is visible. Once
the camera is no longer fixed, one no longer needs a turntable as the camera is now moving.
A disadvantage of a moving camera is that one easily violates camera constraints required for
voxel coloring; to solve this one would have to implement Space Carving [5] or Generalized
Voxel Coloring [2] as they do not have camera constraints.

36
Appendix A

Camera calibration toolbox

We have used the Camera Calibration Toolbox for Matlab [1] in our 3D reconstruction setup.
With this toolbox, we can determine camera properties, position and orientation. The camera
properties, which will be referred to as intrinsic parameters, do not change between different
images. The camera position and orientation do change between images and are commonly
referred to as the extrinsic parameters.

A.1 Intrinsic parameters


The first step in calibrating the camera is to take numerous pictures of a checkerboard pattern
(see figure A.1) and use these to determine the intrinsic parameters of the camera. These
intrinsic parameters include focal distance, principal point, skew and numerous radial distortion
parameters.

Figure A.1: Two calibration images with the checkerboard pattern

First, the toolbox asks for the size of the squares in the calibration pattern in millimeters.
By having a correspondence between the squares and a real-world metric (millimeters), the
calibration toolbox can express the intrinsic parameters in millimeters, where applicable.
In every picture with a checkerboard pattern, the user has to select 4 corners which together
form a rectangle (see figure A.2). After selecting these points, the toolbox will automatically

37
count the number of squares inside the selected rectangle. Should the toolbox fail to count the
number of squares automatically, it will consult the user. With knowledge of the number of
squares, the corners inside the selected rectangle can be estimated by interpolating the 4 points
selected. An edge detector is used on the estimated corners to determine the exact locations
of the corners. This is illustrated in figure A.3.

Figure A.2: Three corners of the calibration rectangle have been selected and the cursor is
over the fourth corner

After all calibration images have been processed, the toolbox can perform the calibration.
It makes a first estimation of the intrinsic parameters and then performs an iterative gradient
search in order to minimize the error over all the detected corners in the calibration images.
The intrinsic parameters are now known, but the toolbox offers additional functionality
to analyze calibration results and improve them further, if needed. We will not discuss the
analysis features of the toolbox here; see the tutorials in [1].
Please note that the intrinsic camera parameters change if the camera changes focus (be-
cause the focal distance changes), so be sure to set your camera to manual focus to get correct
results.

A.2 Extrinsic parameters


For voxel coloring, we have to know the position and orientation of the camera. Once the
intrinsic calibration is complete, the toolbox offers a function to determine the extrinsic pa-
rameters for an image. These extrinsic parameters together form a rigid body transformation,
which is made up of a rotation and a translation, giving the orientation and position of the
camera.
To determine the extrinsic parameters for an image, the user has to select four reference
points which together form a planar rectangle (see figure A.4). These four points have to be
selected in the exact same order in every image, which ensures that all images have the same
coordinate system. In section 3.2.5 we have shown that it is possible to determine the extrinsic

38
Figure A.3: All corners inside the calibration rectangle have been extracted (red crosses). The
blue squares indicate the area where the edge detector operated.

parameters from just 4 planar points.

Figure A.4: On the left four reference points are being selected. On the right the coordinate
system extracted is shown. The red crosses indicate the points selected and the yellow circles
indicate the reprojected points using the extrinsic parameters.

After selecting the four points, the user has to enter the width and height of the selected
rectangle. If it consists of 3 by 3 squares measuring 30x30mm each, then this is 90mm by
90mm. We thus treat the entire rectangle as a single ’square’. We do this because the corners
inside the rectangle will generally be obscured by an object (see figure A.4), and we do not want
the toolbox to use these points (as this gives strange results). After entering the real-world
sizes the toolbox can determine the extrinsic parameters.
If one wants to know the position of the camera, then one has to invert the rigid body
transformation matrix (which can be constructed from the extrinsic parameters) and extract
the translation, as that is the camera position.

39
A.3 Undistorting images
When we want to evaluate the quality of a voxel coloring reconstruction, we want to compare
a reprojection of the reconstruction with the original image. While our viewer creates repro-
jections using intrinsic parameters like focal distance and principal point properly, it does not
apply any radial distortion to the reprojection. This means that if we want to compare the
reprojection correctly, we have to remove the radial distortion from the original image.
A function to undistort images is available in the calibration toolbox, once the intrinsic
calibration is complete. We made minor changes to this function to reduce its memory usage,
but these do not change its output.

40
Appendix B

Phidgets

According to the Phidgets website [6], Phidgets are: an easy to use set of building blocks
for low cost sensing and control from your PC. Using the Universal Serial Bus (USB) as the
basis for all Phidgets, the complexity is managed behind an easy to use and robust Application
Programming Interface (API). Applications can be developed quickly in Visual Basic, with beta
support for C, C++, Delphi and FoxPro.
We used a PhidgetServo to move our turntable by computer. The PidghetServo ‘package’
consists of an USB cable, a control print and a TS-53 standard servo with an universal connector
(see figure B.1).

Figure B.1: Components for a computer-controlled servo: an USB cable, the PhidgetServo
print and a TS-53 standard servo. Pictures taken from [6].

We used Borland Delphi to build a wrapper around the beta C library for Phidgets. We
had to make several bugfixes to this beta library, which have been submitted to the author.
Until they make it into the public version, a custom version of the library has to be used. The
Delphi wrapper allows access to the PhidgetServo from Matlab through the Matlab Generic
DLL Calling functionality found in Matlab 6.5.1. This functionality is also available as a free
update for Matlab 6.5.

41
42
Appendix C

Rose dataset

In section 4.4 we used our athlete object to compare a webcam with a digital camera. We also
have datasets for a rose, but these would take up too much room in our main body relative to
their contribution. The rose is an interesting object due to its complex shape, but results are
similar to the athlete datasets so we will not repeat ourselves. Input images of the webcam
and digital camera datasets are shown in figure C.1. Results of reconstructions made from
both datasets at different resolutions can be found in table C.1 and figure C.2 and C.3.

Figure C.1: On the left a rose from the webcam dataset; on the right a rose from the digital
camera dataset. In a color edition of this report you will be able to see that the digital camera
uses a greater range of colors than the webcam. Both images are sized relative to the resolution
of images found in their dataset.

43
Rose object Webcam dataset Digital camera dataset
Voxel volume size Voxels colored Reprojection error Voxels colored Reprojection error
60x35x35 2846 0.2550 2424 0.1386
120x70x70 14155 0.1828 13315 0.0731
180x105x105 33072 0.1529 33203 0.0559
240x140x140 58859 0.1393 61549 0.0471
300x175x175 89696 0.1315 97878 0.0417
360x210x210 124808 0.1318 141891 0.0382

Table C.1: Table with number of voxels in the reconstruction and the reprojection error for the
webcam rose dataset and the digital camera rose dataset for different sizes of the reconstruction
volume. The webcam dataset contains 545 690 non-background pixels in total; the digital
camera dataset contains 4 251 981 non-background pixels.

webcam reconstructions digital camera reconstructions

60x35x35 60x35x35

120x70x70 120x70x70

Figure C.2: Reconstructions of the rose dataset from webcam images (left) and digital camera
images (right). Reconstruction voxel volume sizes were varied and are shown below each image.
Continued in figure C.3.

44
webcam reconstructions digital camera reconstructions

180x105x105 180x105x105

240x140x140 240x140x140

300x175x175 300x175x175

360x210x210 360x210x210

Figure C.3: Reconstructions of the rose dataset from webcam images (left) and digital camera
images (right). Continued from figure figure C.2.

45
46
Bibliography

[1] Jean-Yves Bouget. Camera calibration toolbox for Matlab.


http://www.vision.caltech.edu/bougetj/calib_doc

[2] Bruce Culbertson, Tom Malzbender and Greg Slabaugh. Generalized Voxel Coloring. In
Proceedings of the ICCV Workshop, Vision Algorithms Theory and Practice, Springer-
Verlag Lecture Notes in Computer Science 1883, pages 100-115, 1999.

[3] Laura Downes and Alex Berg. CS184: Computing rotations in 3D.
http://www.cs.berkeley.edu/~ug/slide/pipeline/assignments/as5/rotation.html

[4] J.D. Foley, A. van Dam, S.K. Feiner and J.F. Hughes. Computer Graphics: Principles and
Practice. Addison-Wesly Publishing Co, section 12.6, 1990.

[5] Kiriakos N. Kutulakos and Steven M. Seitz. A theory of shape by space carving. In Inter-
national Journal of Computer Vision, 38(3): pages 198-218, 2000.

[6] Phidgets Inc. - Unique USB Interfaces http://www.phidgets.com

[7] Andrew C. Prock and Charles R. Dyer. Towards real-time voxel coloring. In Proceedings
of the DARPA Image Understanding Workshop, pages 315-321, 1998.

[8] Miguel Sainz, Nader Bagherzadeh and Antonio Susin. Hardware accelerated voxel carving.
In Proceedings of SIACG, 2002.

[9] Steven M. Seitz and Charles R. Dyer. Photorealistic scene reconstruction by voxel coloring.
In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1067-
1073, 1997.

[10] Steven M. Seitz and Charles R. Dyer. Photorealistic scene reconstruction by voxel coloring.
In International Journal of Computer Vision, 35(2): pages 151-173, 1999.

[11] Greg Slabaugh. Novel volumetric scene reconstruction methods for new view synthesis.
PhD Thesis, Georgia Institute of Technology, 2002.

[12] Greg Slabaugh, Bruce Culbertson, Tom Malzbender and Ron Schafer. A survey of methods
for volumentric scene reconstruction from photographs. International Workshop on Volume
Graphics, 2001.

[13] Mark R. Stevens, Bruce Culbertson and Tom Malzbender. A histogram-based color con-
sistency test for voxel coloring. In Proceedings of International Conference on Pattern
Recognition, 2002.

47
[14] Will Sutton, Plasmadroid, Koen van de Sande and Stuart Carey. Voxel Section Editor II.
http://www.tibed.net/voxel

[15] Thomas Verstraeten. 3D Scene Reconstruction using Voxel Coloring. Master Thesis, Uni-
versity of Amsterdam, August 2003.

[16] R.Y. Tsai. A versatile camera calibration technique for high-accuracy 3D machine vision
metrology using off-the-shelf cameras and lenses. In IEEE Journal of Robotics and Au-
tomation: pages 323-344, 1987.

[17] Shuzhen Wang. Term Project for CPSC514: Voxel Coloring. Technical Report, 2002.

48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy