Real-Time Obstacle Detection and Tracking For Sense-And-Avoid Mechanism in Uavs
Real-Time Obstacle Detection and Tracking For Sense-And-Avoid Mechanism in Uavs
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
1
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
2
only on object detection or object tracking, rather than creating automated system. Since manual labeling is required in such
an intelligent system capable of simultaneous detection and trackers, they are not suitable for fully autonomous UAVs.
tracking in real-time. In this paper, we propose a novel and Moreover, such trackers are not fit for long-term tracking as
intelligent vision-based system that can automatically detect, discussed before. Similarly, most of the trackers assign a fixed
localize, and track the objects in high speed. Extensive ex- size bounding box only to track a part of the moving object
periments demonstrate that the proposed approach stands out in a scene, although the object changes its shape and size
among all the state-of-the-art detectors and trackers in terms throughout a sequence. Such approaches are inappropriate for
of speed and precision. estimating the shape and size of the obstacle and therefore
Mathematically, a saliency map can be understood as a rectifying the path of UAVs to avoid possible collisions with
probability map that expresses the probability of salient pixels the obstacle. Some of the tracking by detection methods aim to
in terms of intensity relative to the entire image [7], [8]. Wei et provide a changing bounding box according to object’s shape
al. [9] claimed that a major portion of the image is occupied by and size but they perform slower and thus inapt for real-time
the image background and is homogeneous. Hence, the image implementation.
boundary can be easily connected - connectivity prior. They In this paper, we propose a fast, reliable and accurate
also assumed that objects are generally absent on the image object localization and tracking approach for the autonomous
boundaries so that we can presume these boundaries to be the navigation of the flying UAVs by integrating the techniques
background as well i.e. background prior. Zhang et al. [10] for salient object detection [10] with the kernelized correlation
successfully used minimum barrier distance [11], [12] along filter [5]. Our approach achieves better detection and tracking
with a raster scanning algorithm utilizing both connectivity results compared to the state-of-the-art method in terms of
prior and background prior to generate saliency map in their speed and accuracy as demonstrated in our experimentation
work. section. Fig. 1 shows the result of the proposed method along
Classical tracking approaches can be categorized as gen- with SAMF [4] and KCF [5]. It can be clearly observed
erative and discriminative models. In the generative trackers that, although the appearance of the flying object undergoes
[7], [8], [13]–[15], we can represent the targets as a set of deformations, illumination or scale variations, the proposed
basis vectors in a subspace and the trackers search for regions method accurately confines the concerned object compared to
similar to previously tracked targets while the discriminative other peer trackers. The main contributions of this paper are
trackers [5], [16], [17] uses binary classification to differentiate listed below:
the background with the desired target. It has been mathe- • The proposed approach correctly localizes and generates
matically proven that the asymptotic error of a discriminative an adaptive bounding box in real-time despite varying
model is lower than that of a generative model [18]. shape and size of the object throughout the sequence.
Tracking by detection approaches [15], [17], [19] provide • Our approach, by integrating the detection and tracking
a new concept for detection and tracking, however, such strategies together and forming a closed loop system,
approaches suffer from the well-known stability-plasticity achieves long-term error-free tracking.
dilemma [20], where the drifting of an object in the later • The proposed approach, by training the filter from previ-
frames cannot be rectified since the classifier cannot be trained ous frames, tracks the object in subsequent frames with-
with stable samples, like that of the first frame. Thus, these out the need of any computationally expensive supervised
approaches barely identify the noisy images with occlusion. training for the detection.
Henriques et al. [5] harnessed the circulant structure of the • The proposed system is fully automated, accurate and
samples in the tracking problem with an aid of a kernelized has superior real-time speed without requiring any sort
correlation filter (KCF). Since this method is computationally of manual intervention.
inexpensive as it transforms the correlation operation in the
spatial domain to the frequency domain by exploiting the
II. R ELATED W ORK
circulant structure and Parseval’s identity and yielding only
O(nlogn) complexity. It is based on a common principle An object tracking by detection approach was proposed in
that circular matrices (used in an algorithm for kernel ridge [22]. However, since their detector needs training with a large
regression) that performs correlation in the spatial domain number of data samples, auto-initialization is not feasible in
is equal to performing element-wise multiplication in the their approach. Optical flow motion cues were leveraged in
frequency domain according to the Fourier Transformation [23] to design a tracker combined with a detection scheme
principles. However, experiments show that the algorithms based on a saliency map with an auto-initialization in the
using correlation filtering fail to track an obstacle for a larger first frame. However, such trackers are computationally too
period of time. This problem was successfully solved by expensive for real-time applications. Multiple cameras in an
Bharati et al. in [21] by providing a feedback to the tracker aircraft that measure the altitude and act as a sense and
about the current state of the object being tracked and using an avoid collision sensor were used in [24] but their method
adaptive detection scheme to re-detect the object in the cases is inefficient without the use of GPS data (which may have
of failure. delays), thus not feasible for real-time.
In addition, it is required to manually initialize most of Previous research on saliency map-based object detection
the generative and discriminative trackers with the position can be broadly classified into two methods — top-down and
of the target in the first frame making them an incomplete bottom-up. In the top-down methods [7], [8], [25], detection
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
3
Fig. 2. Illustration of our approach: (a) input frames, (b) correlation filter and tracking, (c) (from left to right) frame to redetect the salient object, saliency
map generated, thresholded binary image using our post processing technique and new bounding box detected on the object.
is executed on the reduced search space since all the possible meeting high tracking speed in benchmark datasets [37].
objects in an image are localized. But these methods are Henriques et al. integrated Gaussian and polynomial kernels
unrealistic for real-time object detection because they are together with multi-channel HoG features in [5] to achieve
mostly task-driven and accompanied by supervised learning. higher accuracy and speed than most of the state-of-the-art
On the other hand, bottom-up methods [9], [10], [26]–[28] discriminative and generative trackers. However, their method
compare the feature contrast of the salient region with the suffered from its inability to deal with scale variations because
background contrast by using the low-level features (like the of the fixed template size. Li et al. [4] tried to solve this
color, contrast, shape, texture, gradient and spatio-temporal problem by combining adaptive templates and HoG features in
features) from an image. Such methods have higher possi- SAMF tracker. To adapt with the changing size and appearance
bilities to fail in the case of complex images as they do of the object, Danelljan et al. [38] used HoG features in
not have prior knowledge of the localization of the object or a multiscale correlation filter in DSST tracker. However, all
the number of objects present in an image. In contrast, the these trackers are prone to mishandling the cases of occlusion
top-down methods require proper training before detection. and camera instability throughout the sequence.
However, our approach identifies the approximate location A part-based tracking algorithm using a correlation filter
of the object from the previous tracking results and then was proposed in [39] to deal with the occlusion. Only some
performs the re-detection on a much smaller search region. part of the object is visible during a partial occlusion and part-
Hence, our method is computationally efficient since it does based tracker exploits this feature to successfully handle the
not require any type of supervised training for the detection. partial occlusion. However, such algorithms fail if the object
Additionally, the reduced search region enhances qualitative undergo complete occlusion (becomes invisible) between cer-
efficiency during detection. tain consecutive frames. Correlation between temporal con-
Object detection in [9], [29] used a geodesic saliency texts was used in [20] to estimate the translation and scale
map by looking at the contrast of an image and calculating change of the objects. This approach also used a re-detection
the distance of each pixel from the background seeds to scheme by training a fern classifier to handle tracking failures
segment a region in the image. A supervised regression-based for long-term tracking. However, the proposed approach made
segmentation approach in [27] used binary classifier but was their trackers run slower. Some other detectors [26], [28], [40]
limited to detecting single objects in an image. Instead of and trackers [41], [42] rely on deep learning techniques to
scanning an image with sliding windows, a ranked list of improve the accuracy of the trackers and thus require large-
innumerable proposal windows in an image was proposed in scale training database making them slower and unsuitable for
[30], [31]. Such methods improved the recall rate but failed real-time applications.
to correctly localize an object in a given scene. A minimum
barrier saliency map was generated using a raster scanning III. P ROPOSED A PPROACH
method in [10] which performed better than the geodesic In this section, we describe the details of the proposed
saliency map. However, this method used the entire image strategy for fast and robust object detection and tracking. A
for generating saliency map which was exploited to look only flowchart of the proposed technique is shown in Fig. 2.
around the region where the object has more probability to be First, a saliency map S of an entire image is generated to
found compared to its previous location. segment the salient object out from the background and auto-
Since most of the previous work required supervised train- initialize the tracker with the current location of the salient
ing, correlation filters, though adept at object tracking, seemed object for tracking in the consecutive frame. In this process, we
inappropriate for real-time object tracking. The minimum generate a saliency map, post-process the generated saliency
output sum of a squared error (MOSSE) filter [32] and its map using the proposed post-processing technique to segment
derivatives [33]–[35] was found to be computationally efficient the salient object and feed the location of the salient object
for real-time object tracking as correlation filter was trained on to initialize the tracker. Next, the filter starts training itself
gray-scale images in this approach. Subsequently, an ample of on the salient object on each frame while tracking of the
research has been done in the correlation filter-based tracking. object runs simultaneously until a low peak of filter response
As a result, MOSSE filter was improved in [36] by introducing (confidence value) is observed. Confidence value measures the
a kernel-based correlation filter trained on gray-scale images resemblance of the object in the consecutive frame compared
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
4
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
5
C. Object Tracking
This section aims to introduce the correlation filters briefly
as well as the tracking mechanism for further understanding
of the proposed technique.
Correlation filter-based trackers use filters trained on pre-
Fig. 4. From left to right: Original frame, generated saliency map, binary viously tracked objects and their immediately surrounding
image and the object boundary using our object detection technique. background for tracking the object. Usually, a small test
window is selected on the object that needs to be tracked
[32]. Thereafter, tracking of an object and training of the
The mean intensity of pixels in C1 is given by filter is performed simultaneously in the consecutive frames.
t t t The filter is correlated over a search window in an adjacent
X X P (C1 /i).P (i) 1 X frame to obtain a correlation map. The peak value in the
m1 = i.P (i/C1 ) = i. = i.Hni
i=0 i=0
P (Ci ) P1 i=0 correlation map helps to determine the position of the object
(8) being tracked in this frame. However, computational efficiency
can be significantly increased by performing the correlation
where P(C1 /i) = 1, P(i) = Hni . Similarly mean intensity of
in the frequency domain. To perform a correlation in the
pixels in C2 is given by
frequency domain, Fast Fourier Transform (FFT) of the filter
l−1 is element-wise multiplied with a two-dimensional FFT of
1 X
m2 = i.Hni (9) the input image. This is possible because an element-wise
P2 i=t+1
multiplication in the frequency domain is equivalent to the
Let mg represent mean global intensity and mt represent mean correlation in the spatial domain.
intensity upto t level. Then, the inter-class variance is derived The correlation G between the FFT of R (denoted by I)
as in [43]. where the object needs to be tracked and the FFT of the filter
mg .P1 − mt (denoted by H) is given by
σb2 = P1 .(m1 − mg )2 + P2 .(m2 − mg )2 =
P1 .(1 − P1 ) G = I H∗ (12)
(10) ∗
where is element-wise multiplication and denotes complex
For each t ∈ L, we calculate σb2 (t) and optimal threshold topt conjugate. We can use inverse Fourier transform to transform
for S is given by the correlation output back to the spatial domain.
It can be derived from the properties of the circulant
σb2 (topt ) = max σb2 (t) (11)
0<t<l−1 matrices that any m × 1 vector v can be diagonalized using
By applying this method, we successfully obtain a binary Discrete Fourier Transform (DFT) [44].Hence,
image where the salient object is distinguishably highlighted C = F.diag(v̂).F H (13)
from the background and the noisy image background is √
eliminated. Now, the tracker is able to correctly locate the where v̂ denotes the DFT of v, v̂ = F(v), F = nF v and
required salient object in an analyzed scene as shown in Fig. F is a constant matrix independent of v.
4 and begin tracking in the consecutive frames. The auto- Motivated by [5], we use ridge regression along with
initialization algorithm is given in Algorithm 1. kernelized correlation filter to implement the tracking. Our
main aim is to find a function f (β) = αT β to minimize the
Algorithm 1 Auto-initialization squared error between the training samples xi ∈ X and the
Input: first frame (f) output yi ∈ Y. If we transform our linear object of interest O
Output: co-ordinates (x,y,width,height) of salient object in f to a non-linear feature space φ(O) and apply a kernel trick
1: Generate saliency map S using equation (1)-(4) [45] on it, we have,
2: Compute normalized histogram Hn of S
X
α= δi .φ(O) (14)
3: Divide into two groups with probabilities P1 and P2 as
i
in equations (6), (7)
Here, we need to optimize parameter δi for the least squared
4: for threshold level t=1 to maximum intensity in S do
error. Representing φ(O) in terms of dot products,
5: Compute global intensity mg = P1 m1 + P2 m2 using
equations (8), (9) Pt φT (O)φ(O0 ) = K(O, O0 ) (15)
6: Compute mean intensity upto level t mt = i=0 iPi
7: Compute σb2 using equation (10) where 0 denotes the cyclic shifts and K is a Gaussian kernel
8: end for
function. Therefore,
n
9: Derive optimal threshold using equation (11) X
10: Draw bounding box covering maximum area of the salient f (β) = αT β = δi K(β, Oi ) (16)
i=1
object in optimally thresholded binary image
11: return salient object coordinates The solution to equation (16) as derived in [46] is
σ = (K + λI)−1 Y (17)
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
6
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
7
Fig. 6. Peak sensitivity curve demonstrating high peak of sensitivity variance (bad) of KCF versus low peak of sensitivity variance (good) of our approach
in first 50 frames of airplane 006 dataset. (Best viewed in color)
Fig. 7. Peak of filter response of KCF in (a) airplane 011 and (b) youtube 3 dataset showing fall in the peak values of the filter leading to inaccurate tracking
of the object when the object changes its shape, size or illumination.
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
8
Fig. 8. OPE and TRE curve demonstrating the average precision rate and the success rate of the proposed and 6 competing trackers over 25 video sequences.
(Best viewed in color)
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
9
Ours CT STC CN DSST SAMF KCF Ours CN CT DSST SAMF STC KCF
Average Precision Rate (TRE) 0.82 0.31 0.47 0.45 0.51 0.45 0.46 airplane 001 0.78 0.17 0.20 0.27 0.12 0.32 0.12
Average Success Rate (TRE) 0.76 0.37 0.41 0.42 0.49 0.43 0.46
airplane 004 0.54 0.53 0.49 0.42 0.49 0.47 0.44
Average Precision Rate (OPE) 0.77 0.19 0.45 0.44 0.45 0.43 0.42
Average Success Rate (OPE) 0.6 0.27 0.38 0.41 0.43 0.42 0.40 airplane 005 0.57 0.20 0.15 0.26 0.21 0.39 0.23
CLE (in pixels) 13 170 47 75 59 79 87 airplane 006 0.54 0.43 0.19 0.45 0.43 0.49 0.43
Average Speed (fps) 115.53 29.13 27.21 27.50 4.97 6.54 71.69 airplane 007 0.53 0.46 0.13 0.55 0.53 0.47 0.49
airplane 011 0.77 0.34 0.21 0.29 0.73 0.33 0.20
airplane 012 0.47 0.31 0.18 0.59 0.34 0.48 0.70
TABLE II airplane 013 0.70 0.19 0.24 0.30 0.11 0.27 0.16
P RECISION RATE OF THE 25 SEQUENCES OF THE PROPOSED AND THE 6
airplane 015 0.71 0.65 0.45 0.59 0.62 0.49 0.54
COMPETING TRACKERS . T HE BEST AND THE SECOND BEST RESULTS ARE
HIGHLIGHTED USING BOLD - FACE AND UNDERLINE FONT STYLES , airplane 016 0.73 0.58 0.22 0.56 0.65 0.62 0.58
RESPECTIVELY. big 2 0.61 0.58 0.30 0.65 0.57 0.58 0.63
planestv 1 0.26 0.86 0.66 0.80 0.76 0.78 0.78
Ours CN CT DSST SAMF STC KCF planestv 2 0.59 0.41 0.38 0.57 0.45 0.30 0.45
airplane 001 0.92 0.20 0.20 0.26 0.21 0.38 0.12 planestv 3 0.79 0.47 0.42 0.43 0.42 0.43 0.38
airplane 004 0.79 0.44 0.25 0.50 0.26 0.42 0.37 planestv 4 0.76 0.35 0.23 0.27 0.30 0.43 0.31
airplane 005 0.81 0.33 0.19 0.32 0.21 0.36 0.27 planestv 5 0.72 0.32 0.36 0.26 0.27 0.15 0.41
airplane 006 0.92 0.54 0.22 0.54 0.20 0.65 0.53 planestv 6 0.58 0.48 0.35 0.34 0.32 0.31 0.37
airplane 007 0.76 0.61 0.18 0.36 0.15 0.46 0.37 planestv 7 0.63 0.28 0.30 0.49 0.11 0.17 0.41
airplane 011 0.90 0.43 0.27 0.28 0.8 0.31 0.25 planestv 8 0.47 0.37 0.24 0.54 0.29 0.46 0.27
airplane 012 0.74 0.15 0.20 0.88 0.20 0.83 0.81 planestv 9 0.71 0.37 0.25 0.45 0.41 0.32 0.21
airplane 013 0.89 0.32 0.20 0.32 0.21 0.26 0.12 youtube 1 0.44 0.41 0.32 0.18 0.43 0.18 0.55
airplane 015 0.82 0.73 0.35 0.58 0.82 0.49 0.79 youtube 2 0.56 0.45 0.15 0.40 0.35 0.25 0.43
airplane 016 0.83 0.76 0.18 0.73 0.75 0.65 0.45 youtube 3 0.62 0.24 0.10 0.22 0.22 0.16 0.29
big 2 0.89 0.82 0.31 0.91 0.84 0.85 0.85 Dog 0.33 0.15 0.15 0.39 0.10 0.24 0.19
planestv 1 0.86 0.90 0.46 0.85 0.75 0.90 0.84 Skater 0.51 0.57 0.55 0.55 0.53 0.47 0.54
planestv 2 0.77 0.37 0.37 0.42 0.35 0.36 0.49
planestv 3 0.89 0.33 0.14 0.14 0.30 0.60 0.13
planestv 4 0.55 0.03 0.12 0.08 0.15 0.02 0.10
planestv 5 0.64 0.03 0.09 0.03 0.09 0.16 0.23 the other competing trackers by observing the precision rate
planestv 6 0.73 0.47 0.15 0.61 0.70 0.34 0.53 plot and our method is also more adaptive to the variations in
planestv 7 0.75 0.27 0.24 0.49 0.30 0.14 0.38 shape and size of the object being tracked in a given video
planestv 8 0.88 0.64 0.18 0.80 0.63 0.66 0.34 sequence as demonstrated by success rate plot.
planestv 9 0.52 0.04 0.02 0.21 0.20 0.14 0.03
youtube 1 0.89 0.88 0.41 0.09 0.80 0.86 0.87
youtube 2 0.81 0.65 0.40 0.56 0.70 0.66 0.46
D. Speed Comparison
youtube 3 0.69 0.07 0.06 0.07 0.08 0.12 0.26
Dog 0.42 0.25 0.31 0.66 0.30 0.29 0.42 As presented in Table I, our algorithm (implemented in
Skater 0.57 0.60 0.53 0.59 0.50 0.44 0.57 C++) achieves an average speed of 115.53 frames per second
(fps) whereas KCF achieved 132.87 fps when implemented in
C++ and 71.69 fps when implemented in MATLAB on the 25
DSST [38], SAMF [4] and KCF [5] is shown in Table I. It challenging video sequences. One of the major disadvantages
can be observed from OPE and PRE values that our method of KCF, despite its good speed, is that it fails to track the object
outperforms the competing trackers. Similarly, our approach in the following frames once it loses track of the object in a
also has the least CLE and real-time speed performance. given frame, thus making it unreliable for real-time tracking.
Experimental results performed during OPE against the six However, our approach is able to re-detect the object if it
competing trackers on all the 25 video sequences are also loses track of the object due to the proposed re-detection
reported. PR and SR are tabulated in Table II and Table III scheme, thus making our algorithm more apt for such purpose.
respectively. The tables clearly demonstrate that our approach Moreover, KCF is not adaptive to variations in shape and size
is more accurate in the almost all of the experimented chal- of the object being tracked and draws a fixed bounding box
lenging datasets. Similarly, our approach stands best among around them. In contrast, our approach accurately adapts to
the competing trackers by a great margin – 20 out of 25 such variations in shape and size of the object being tracked
sequences in PR evaluation and 17 out of 25 sequences in and adjusts the bounding box accordingly thus making our
SR evaluation. approach more suitable for sense-and-avoid systems. Similarly,
Fig. 8 shows the precision as well as success rate plots for compared to CN, STC and CT, our approach stands out by
OPE as well as TRE experimented over all the challenging more than three times (3x) faster than their average speeds.
datasets. It is clear from the plots that our approach is signif- Similarly, DSST and SAMF clearly does not fit for real-
icantly better than the other trackers compared. In summary, time object tracking due to their very low speed. Hence, this
we can verify that our approach is superior in robustness to tremendous advantage in the speed and long-term tracking
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
10
(a) (b)
(c) (d)
(e) (f)
(g) (h)
(i) (j)
Fig. 9. Tracking results of our approach and the output of 6 competing trackers in the representative frames: (a) scale variation (youtube dataset 3), (b)
partial occlusion (airplane 005), (c) axial rotation (planestv 4), (d) planar rotation (big 2), (e) and (f) illumination variation (airplane 001 and airplane 006),
(g) and (h) camera instability (airplane 001 and airplane 012), (i) and (j) roll,pitch and yaw (planestv 6 and planestv 9). (Best viewed in color)
ability makes our algorithm more suitable for real-time object be invariant of partial occlusion. Almost all the other trackers
tracking than the compared state-of-the-art trackers. fail in this situation.
In Fig. 9 (second row), trackers are tested on the basis of
axial rotation (frame #41, frame #182, frame #270 and frame
E. Qualitative Evaluation #288 in planestv 4) and planar rotation (frame #87, frame
In this subsection, we present the qualitative comparisons of #203, frame #325 and frame #372 in big 2) dynamics. It can
our approach against the 6 competing trackers. In Fig. 9 (top be seen that CT is not apt for both axial or planar rotations.
row), we can observe that our tracker is extremely adaptive to Though KCF is able to perform well (except for scale changes)
the change in scale as well as during partial occlusion. It is during axial rotation, it fails against planar rotations. Despite
experimentally found that though all the competing trackers such complicated rotation dynamics, our method stands out
produce acceptable outputs in the first few frames, as the among all the other competing trackers, thus proving it to be
object changes its shape (frame #1148, frame #1222 and #1317 more robust.
in youtube dataset 3) or undergoes partial occlusion (frame Similarly, in Fig. 9 (mid row), we depict the evaluation of all
#91, frame #109 and frame #121 in airplane 005), some of the trackers under illumination variation (frame #3, frame #35,
the trackers fail. For instance, CT, SAMF, KCF and STC are frame #75 and frame #127 in airplane 001 as well as frame
unable to keep the scale variation in their account. However, #4, frame #51, frame #87 and frame #177 in airplane 006). It
our method quickly adjusts to changing appearance and size of is clear from the figure that CT, CN and SAMF are incapable
the object. Similarly, only STC and our method are found to of keeping track of the object under such constraints. However,
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
11
Fig. 10. Frames showing our tracker narrows down to the prime object in a
scene once the multiple objects are at a suitable distance apart or one of the
objects move out of scene. V. C ONCLUSION
It is vital for an intelligent autonomous UAV to have an
automatic, robust and real-time object tracking system built
our method is not affected by such illumination variation and in it. Therefore, in this paper, we have proposed a tracking
keeps good track of the object. method that incorporates variations in shape, size, illumina-
Fourth row in Fig. 9 demonstrates competing tracker’s tion as well as degenerate conditions like partial occlusion,
performance over camera instability (frame #190 - #193 in planar/axial rotation and camera instability in it for better
airplane 001 and frame #37 - #40 in airplane 012). It can be performance than the existing state-of-the-art trackers. Most of
clearly observed that almost all of the trackers, except ours, the up-to-date trackers were found to fail in one or several such
fail to correctly track the object when there is a significant complex scenarios. However, our tracker is able to keep track
jerk in the camera pose or sudden perturbations. of the object without any abrupt failures. Both qualitative and
Moreover, the last row in Fig. 9 demonstrates the exper- quantitative evaluation measures demonstrate that the proposed
imentation of the trackers on roll-pitch-yaw motion (frame approach is more efficient than the competing trackers. Unlike
#109, frame #122, frame #145 and frame #167 in planestv 6 other trackers, the proposed tracker is able to auto-initialize
and frame #123, frame #147, frame #235 and frame #310 in without any manual interference. Hence, our approach is
planestv 9). It is evident that almost all of the trackers fail to found to be accurate and fast in terms of speed for real-
keep track of the larger objects in such situations. However, time autonomous sense-and-avoid UAVs, drones or similar
our tracker is able to keep track of both the shape and pose flying units. Nevertheless, some of the experiments show that
of the object (small or large) being tracked. our method may not perform as expected in the presence of
several dubious salient objects in a scene or the object is too
tiny to detect and auto-initialize the tracker. Source code and
F. Limitations the dataset for the proposed approach can be downloaded for
research purpose from the author’s web page2 .
From several experimentations, we found that our approach,
though fast and robust, has some limitations. Since our
algorithm is designed to auto-initialize and utilizes salient ACKNOWLEDGMENT
object detection scheme whenever necessary throughout the The authors would like to thank the anonymous reviewers
tracking process, it starts tracking all the objects (bounding for their constructive comments. This work was supported in
box comprises maximum area occupancy of the objects in part by the National Aeronautics and Space Administration
a given scene) in the presence of multiple objects in a LEARN II Program under Grant NNX15AN94N, in part by
scene as shown in Fig. 10. However, our approach effectively the General Research Fund of the University of Kansas under
adapts its bounding box and narrows down to one object Grant 2228901, and in part by the Kansas NASA EPSCoR
once the prime object to be tracked is segregated in a given Program under Grant KNEP-PDG-10-2017-KU. We would
scene as demonstrated in Fig. 10. It is also important to also like to thank Mr. Arjan Gupta and Miss Nina Wang for
note that the above mentioned limitation has insignificant helping us label the test data.
effect in sense and avoid UAVs as a single bounding box
around multiple objects may suggest the detection of several R EFERENCES
obstacles within the region, thus need to be avoided during the
[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” Acm
UAV’s trajectory. The other limitation of our approach is an computing surveys (CSUR), vol. 38, no. 4, p. 13, 2006.
inability to accurate auto-initialization provided a too complex [2] Z. Kim, “Robust lane detection and tracking in challenging scenarios,”
background. For instance, in Fig. 11, the tracker’s bounding IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 1,
pp. 16–26, March 2008.
box comprises of the background along with the object to be [3] M. Shan, S. Worrall, and E. Nebot, “Probabilistic long-term vehicle mo-
tracked. This limitation, however, is only observed during the tion prediction and tracking in large environments,” IEEE Transactions
initialization phase. Once the tracker starts learning, the object on Intelligent Transportation Systems, vol. 14, no. 2, pp. 539–552, June
2013.
is significantly tracked to a greater accuracy in the consecutive
frames as shown in Fig.11. 2 http://www.ittc.ku.edu/cviu/tracking.html
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
12
[4] Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker [29] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from
with feature integration,” in European Conference on Computer Vision. robust background detection,” in Proceedings of the IEEE conference
Springer, 2014, pp. 254–265. on computer vision and pattern recognition, 2014, pp. 2814–2821.
[5] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed [30] J. Feng, Y. Wei, L. Tao, C. Zhang, and J. Sun, “Salient object detection
tracking with kernelized correlation filters,” IEEE Transactions on by composition,” in 2011 International Conference on Computer Vision.
Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, IEEE, 2011, pp. 1028–1035.
2015. [31] P. Siva, C. Russell, T. Xiang, and L. Agapito, “Looking beyond the
[6] L. Itti and C. Koch, “Computational modelling of visual attention,” image: Unsupervised learning for object saliency and detection,” in
Nature reviews neuroscience, vol. 2, no. 3, pp. 194–203, 2001. Proceedings of the IEEE conference on computer vision and pattern
[7] H. Cholakkal, D. Rajan, and J. Johnson, “Top-down saliency with recognition, 2013, pp. 3238–3245.
locality-constrained contextual sparse coding.” BMVC, 2015. [32] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual
[8] J. Yang and M.-H. Yang, “Top-down visual saliency via joint crf object tracking using adaptive correlation filters,” in Computer Vision
and dictionary learning,” in Computer Vision and Pattern Recognition and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE,
(CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2296–2303. 2010, pp. 2544–2550.
[9] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using back- [33] Y. Sui, G. Wang, and L. Zhang, “Correlation filter learning toward peak
ground priors,” in European Conference on Computer Vision. Springer, strength for visual tracking,” IEEE Transactions on Cybernetics, vol. PP,
2012, pp. 29–42. no. 99, pp. 1–14, 2017.
[10] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech, “Minimum [34] Y. Sui, Z. Zhang, G. Wang, Y. Tang, and L. Zhang, “Real-time visual
barrier salient object detection at 80 fps,” in Proceedings of the IEEE tracking: Promoting the robustness of correlation filter learning,” in
International Conference on Computer Vision, 2015, pp. 1404–1412. European Conference on Computer Vision. Springer, 2016, pp. 662–
[11] K. C. Ciesielski, R. Strand, F. Malmberg, and P. K. Saha, “Efficient 678.
algorithm for finding the exact minimum barrier distance,” Computer [35] Y. Sui, Y. Tang, L. Zhang, and G. Wang, “Visual tracking via subspace
Vision and Image Understanding, vol. 123, pp. 53–64, 2014. learning: A discriminative approach,” International Journal of Computer
[12] R. Strand, K. C. Ciesielski, F. Malmberg, and P. K. Saha, “The minimum Vision, pp. 1–22, 2017.
barrier distance,” Computer Vision and Image Understanding, vol. 117, [36] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the
no. 4, pp. 429–437, 2013. circulant structure of tracking-by-detection with kernels,” in European
[13] S. Avidan, “Ensemble tracking,” IEEE transactions on pattern analysis conference on computer vision. Springer, 2012, pp. 702–715.
and machine intelligence, vol. 29, no. 2, pp. 261–271, 2007. [37] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,”
[14] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online in Proceedings of the IEEE conference on computer vision and pattern
multiple instance learning,” in Computer Vision and Pattern Recognition, recognition, 2013, pp. 2411–2418.
2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 983–990. [38] M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate scale esti-
[15] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,” mation for robust visual tracking,” in British Machine Vision Conference,
in European Conference on Computer Vision. Springer, 2012, pp. 864– Nottingham, September 1-5, 2014. BMVA Press, 2014.
877. [39] T. Liu, G. Wang, and Q. Yang, “Real-time part-based visual tracking
[16] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. Hicks, and via adaptive correlation filters,” in Proceedings of the IEEE Conference
P. Torr, “Struck: Structured output tracking with kernels,” 2015. on Computer Vision and Pattern Recognition, 2015, pp. 4902–4912.
[17] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” [40] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
IEEE transactions on pattern analysis and machine intelligence, vol. 34, object detection with region proposal networks,” in Advances in neural
no. 7, pp. 1409–1422, 2012. information processing systems, 2015, pp. 91–99.
[18] A. Jordan, “On discriminative vs. generative classifiers: A comparison [41] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical con-
of logistic regression and naive bayes,” Advances in neural information volutional features for visual tracking,” in Proceedings of the IEEE
processing systems, vol. 14, p. 841, 2002. International Conference on Computer Vision, 2015, pp. 3074–3082.
[19] Y. Wu, Y. Sui, and G. Wang, “Vision-based real-time aerial object [42] H. Li, Y. Li, and F. Porikli, “Deeptrack: Learning discriminative feature
localization and tracking for uav sensing system,” IEEE Access, vol. 5, representations online for robust visual tracking,” IEEE Transactions on
pp. 23 969–23 978, 2017. Image Processing, vol. 25, no. 4, pp. 1834–1848, 2016.
[20] C. Ma, X. Yang, C. Zhang, and M.-H. Yang, “Long-term correlation [43] N. Otsu, “A threshold selection method from gray-level histograms,”
tracking,” in Proceedings of the IEEE Conference on Computer Vision Automatica, vol. 11, no. 285-296, pp. 23–27, 1975.
and Pattern Recognition, 2015, pp. 5388–5396. [44] R. M. Gray, Toeplitz and circulant matrices: A review. now publishers
[21] S. P. Bharati, S. Nandi, Y. Wu, Y. Sui, and G. Wang, “Fast and robust inc, 2006.
object tracking with adaptive detection,” in 2016 IEEE 28th International [45] B. Scholkopf and A. J. Smola, Learning with kernels: support vector
Conference on Tools with Artificial Intelligence (ICTAI), Nov 2016, pp. machines, regularization, optimization, and beyond. MIT press, 2001.
706–713. [46] R. Rifkin, G. Yeo, and T. Poggio, “Regularized least-squares classi-
[22] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection fication,” Nato Science Series Sub Series III Computer and Systems
and people-detection-by-tracking,” in Computer Vision and Pattern Sciences, vol. 190, pp. 131–154, 2003.
Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, [47] A. Li, M. Lin, Y. Wu, M. Yang, and S. Yan, “NUS-PRO: A New
pp. 1–8. Visual Tracking Challenge,” IEEE Transactions on Pattern Analysis and
[23] V. Mahadevan and N. Vasconcelos, “Saliency-based discriminant track- Machine Intelligence, vol. 38, no. 2, pp. 335–349, 2016.
ing,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. [48] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,”
IEEE Conference on. IEEE, 2009, pp. 1007–1013. in European Conference on Computer Vision. Springer, 2012, pp. 864–
[24] A. Nussberger, H. Grabner, and L. Van Gool, “Robust aerial object 877.
tracking in high dynamic flight maneuvers,” ISPRS Annals of the [49] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M.-H. Yang, “Fast visual
Photogrammetry, Remote Sensing and Spatial Information Sciences, tracking via dense spatio-temporal context learning,” in European Con-
vol. 2, no. 1, p. 1, 2015. ference on Computer Vision. Springer, 2014, pp. 127–141.
[25] A. Kocak, K. Cizmeciler, A. Erdem, and E. Erdem, “Top down saliency [50] M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. Van de Weijer,
estimation via superpixel-based discriminative dictionaries.” in BMVC, “Adaptive color attributes for real-time visual tracking,” in Proceedings
2014. of the IEEE Conference on Computer Vision and Pattern Recognition,
[26] S. He, R. W. Lau, W. Liu, Z. Huang, and Q. Yang, “Supercnn: A su- 2014, pp. 1090–1097.
perpixelwise convolutional neural network for salient object detection,”
International Journal of Computer Vision, vol. 115, no. 3, pp. 330–344,
2015.
[27] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, “Salient object
detection for searched web images via global saliency,” in Computer
Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.
IEEE, 2012, pp. 3194–3201.
[28] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by
multi-context deep learning,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2015, pp. 1265–1274.
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
13
Sushil Pratap Bharati earned his Bachelors degree Guanghui Wang (M’10, SM’17) received his PhD
from Motilal Nehru National Institute of Technol- in computer vision from the University of Water-
ogy, Allahabad, India. He is currently pursuing his loo, Canada, in 2014. He is currently an assistant
Masters degree at the University of Kansas. His professor at the University of Kansas, USA. He
research interests include real-time object detection is also with the Institute of Automation, Chinese
and tracking, 3-D reconstruction and modeling, pat- Academy of Sciences, China, as an adjunct profes-
tern recognition, autonomous robotics and broad sor.
applications of computer vision and deep learning. From 2003 to 2005, he was a research fellow and
visiting scholar with the Department of Electronic
Engineering at the Chinese University of Hong
Kong. From 2005 to 2006, he acted as a professor
at the Department of Control Engineering in Changchun Aviation University,
China. From 2006 to 2010, He was a research fellow with the Department
of Electrical and Computer Engineering, University of Windsor, Canada. He
has authored one book, Guide to Three Dimensional Structure and Motion
Factorization, published at Springer-Verlag. He has published over 90 papers
in peer-reviewed journals and conferences. His research interests include
computer vision, structure from motion, object detection and tracking, artificial
intelligence, and robot localization and navigation. Dr. Wang has served as
associate editor and on the editorial board of two journals, as an area chair
Yuanwei Wu received his Masters degree from the or TPC member of 20+ conferences, and as a reviewer of 20+ journals.
Tufts University. He is currently a PhD candidate
at the University of Kansas. His research interests
are focused on broad applications in deep learning
and computer vision, in particular object detection,
localization and visual tracking.
2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.