Mean-Shift Blob Tracking Through Scale Space: Robert T. Collins Carnegie Mellon University
Mean-Shift Blob Tracking Through Scale Space: Robert T. Collins Carnegie Mellon University
Robert T. Collins
Carnegie Mellon University
Abstract
The mean-shift algorithm is an efficient technique for tracking 2D blobs through an image. Although the scale of the
mean-shift kernel is a crucial parameter, there is presently
no clean mechanism for choosing or updating scale while
tracking blobs that are changing in size. We adapt Lindebergs theory of feature scale selection based on local maxima of differential scale-space filters to the problem of selecting kernel scale for mean-shift blob tracking. We show
that a difference of Gaussian (DOG) mean-shift kernel enables efficient tracking of blobs through scale space. Using
this kernel requires generalizing the mean-shift algorithm
to handle images that contain negative sample weights.
1. Introduction
The mean-shift algorithm is a nonparametric statistical
method for seeking the nearest mode of a point sample distribution [3, 6]. The algorithm has recently been adopted as
an efficient technique for appearance-based blob tracking
[1, 4]. In the blob tracking scenario, sample points are regularly distributed along the image pixel grid, with a pixels
value w(a) being the sample weight of the point at that location. This sample weight is chosen such that pixels on the
foreground blob have high weight, and pixels in the background have low weight. The mean-shift algorithm specifies how to combine the sample weights w(a) in a local
neightborhood with a set of kernel weights K (a) to produce
an offset that tracks the centroid of the blob in the image
(Figure 1).
The problem we address in this paper is selecting the
scale of the mean-shift kernel, which directly determines
the size of the window within which sample weights are examined. This size should of course be proportional to the
expected image area of the blob being tracked. Although
kernel scale is a crucial parameter for the mean-shift algorithm, there is currently no sound mechanism for choosing
this scale within the framework. We show how to combine a
well-developed theory of feature scale selection due to Lindeberg with the mean-shift algorithm, resulting in a method
This work is supported in part by DARPA/IAO HumanID under
ONR contract N00014-00-1-0915, and by DARPA/IPTO MARS contract
NBCHC020090.
a K (a x)w(a)a
a K (a x)w(a)
a K (a x)w(a)(a x)
a K (a x)w(a)
(1)
where K is a suitable kernel function and the summations
are performed over a local window of pixels a around the
current location x. A suitable kernel K is one that can
be written in terms of a profile function k such that K (y) =
k(kyk2 ) and profile k Ris nonnegative, nonincreasing, piecewise continuous, and 0 k(r)dr < [Cheng].
An important theoretical property of the mean-shift algorithm is that the local mean-shift offset x computed at
position x using kernel K points opposite to the gradient direction of the convolution surface
x=
C(x) = H (a
x)w(a)
=
=
(2)
K (a x)(w(a) + c)(a x)
K (a x)(w(a) + c)
K
(a x)w(a)(a x) + c K (a x)(a
K (a x)(w(a) + c)
K (a x)w(a)(a x)
K (a x)(w(a) + c)
(4)
x)
(5)
(6)
showing that the direction of the mean-shift vector is invariant, and just the step size changes.
mi
log di
a
=n
mi
di log di
i=1
we should interpret a points weight as a vote for the direction and magnitude of the mean shift offset vector towards or away from that point. For a single point p, a positive weight w specifies that the offset vector at neighboring
points should be directed towards p, with a magnitude w.
If instead we have a negative weight w, the offset vector at
neighboring points is now directed away from p, but with
magnitude jwj. The mean-shift equation can be interpreted
as forming the superposition of all these point-wise offset
votes to produce an overall average offset vector.
We now see how to modify the mean-shift equation (1)
to make sense for negative weights. The numerator of
that equation votes for both the magnitude and direction
of point-wise offset vectors, so the negative weights should
stay. However, the denominator normalizes by the overall
total magnitude of the votes, and therefore we must sum
only the magnitude (the absolute value) of each term. The
modified equation is
x =
a K (a x)w(a)(a x)
a jK (a x)w(a)j
(7)
22 kxk2 kx2k2
e 2
26
(8)
(9)
22 kxk2 kx2k2
e 2 f (x)
26
22
s2
ky sk2
=
22
f (y=s) j
1
j
s2
(12)
s2 L(0; s)
(13)
22 =1:6
kxk2
e 22
16
= :
1
22 (1:6)
0 4875 2 LOG(x; )
:
The extra 2 factor is what makes the DOG operator invariant across scales. This precise manner in which the DOG
operator approximates the LOG operator seems not to be
widely discussed. The two are typically presented in the
context of locating zero crossings, where the scale factor
does not matter.
kyk2
DOG(x; )
(11)
2(s)2
26
ky sk2 e
(10)
kxk2
e 22 (1 6) (14)
:
We adapt feature scale selection theory to create a mechanism for adapting mean-shift kernel size while tracking
blobs through changes in scale. Intuitively, we will track a
blobs location and scale by using the mean-shift algorithm
to track the local maximum (Eq 9) that represents the blob
feature in the scale space generated by the DOG operator.
Define a 3D shadow kernel H (x; ) with two spatial dimensions x and one scale dimension . At any given scale
0 , the 2D marginal kernel H (x; = 0 ) will be a spatial
filter DOG(x; 0 ). The set of DOG filters at multiple scales
form a scale-space filter bank, which is convolved with a
sample weight image where each pixel is proportional to
the likelihood that it belongs to the object being tracked.
See Figure 3. The results of the DOG filters at multiple
scales are then convolved with an Epanichikov shadow kernel in the scale dimension. The result is a 3D scale space
representation in which modes represent blobs at different
spatial positions and scales. We want to design a mean-shift
filter to track modes through this scale space representation.
More formally, define a set of kernel scales around the
current scale 0 as
fs = 0 bs
for
n s ng
(15)
x0); s )
(17)
G(x; x0 ; 2 )=22
(19)
(20)
(21)
s x Hx (x; s)w(x)s
:
s x Hx (x; s)w(x)
(22)
5. Examples
Figure 4 will help to better understand the motivation behind the scale selection approach presented in this paper.
The top of Figure 4 shows three blobs (squares) of different
sizes. The bottom of Figure 4 shows one slice through the
3D scale space generated by convolution with the DOGEpanichnikov filter bank defined in the previous section.
The modes in this scale space clearly localize each of the
blobs both spatially, and in the scale dimension, as expected
from the feature scale selection theory of Lindeberg. In the
previous section we designed an interleaved mean-shift procedure that can find/track these modes without having to
explicitly generate the scale space.
used in [4], who show that these weights are related to maximizing the similarity of histograms di and mi as measured
by the Bhattacharyya coefficient.
Sample frames of the tracking results from three different algorithms are presented. Figure 5A shows tracking results from the classic mean-shift algorithm without
any changes of scale. The person is successfully tracked
throughout the sequence, but the fixed scale of the tracking kernel provides poor localization of the centroid of the
person as their size increases near the end of the sequence.
Figure 5B shows tracking results when the scale is
adapted using the approach suggested in [4]. At each iteration, the mean-shift algorithm is run three times, once with
the current scale, and once with window sizes of plus or minus 10 percent of the current size. For each, the color distribution observed within the mean-shift window after convergence is compared to the model color distribution using the
Bhattacharyya coefficient, and the window size yielding the
most similar distribution is chosen as the new current scale.
We see that the window quickly shrinks too much, a common failure of this scale selection approach. At roughly a
third of the way through the sequence, the tracking is lost.
Finally, Figure 5C shows the results of our kernel size
selection method, described in the last section. The person is consistently tracked, both in image location AND in
scale. Note in particular the correct adaptation of kernel
scale as the persons size rapidly expands near the end of
the sequence.
6. Conclusion
(A)
(B)
(C)
Figure 5: Tracking examples. (A) Using a fixed-scale mean-shift kernel. The person is tracked through the sequence, but
localization is poor when the persons size increases. (B) Using the plus or minus 10 percent scale adaptation method (see
text). The kernel soon shrinks too much, leading to tracking failure. (C) Using the scale-space mode-tracking method
presented in this paper. The person is tracked well, both spatially and in scale.
References
[1] Bradski, G.R., Computer Vision Face Tracking for
Use in a Perceptual User Interface, IEEE Workshop
on Applications of Computer Vision, Princeton, NJ,
1998, pp.214-219.
[2] Bretzner, L. and Lindeberg, T., Qualitative Multiscale Feature Hierarchies for Object Tracking, Journal of Visual Communication and Image Representation, Vol 11(2), June 2000, pp.115-129.
[3] Cheng, Y., Mean Shift, Mode Seeking, and Clustering, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol 17(8), August 1995, pp.790-799.
[4] Comaniciu, D., Ramesh, V. and Meer, P., Real-Time
Tracking of Non-Rigid Objects using Mean Shift,
IEEE Computer Vision and Pattern Recognition, Vol
II, 2000, pp.142-149.
[5] Comaniciu, D., Ramesh, V., Meer, P., The Variable
Bandwidth Mean Shift and Data-Driven Scale Selection, International Conference on Computer Vision,
Vol I, pp.438-445.
[6] Fukanaga, K. and Hostetler, L.D., The Estimation
of the Gradient of a Density Function, with Applica-
tions in Pattern Recognition, IEEE Trans. Information Theory, Vol 21, 1975, pp.32-40.
[7] Hildreth, E.C., The Detection of Intensity Changes
by Computer and Biological Vision Systems, Computer Vision, Graphics and Image Processing, Vol
22(1), April 1983, pp.1-27.
[8] Lindeberg, T., Feature Detection with Automatic
Scale Selection, International Journal of Computer
Vision, Vol 30(2), November 1998, pp.79-116.