0% found this document useful (0 votes)
118 views13 pages

Real-Time Obstacle Detection and Tracking For Sense-And-Avoid Mechanism in Uavs

This document proposes a real-time obstacle detection and tracking system for unmanned aerial vehicles (UAVs). It introduces an adaptive obstacle detection strategy that integrates a kernelized correlation filter framework with salient object detection. The approach refines the location and boundary of detected objects when tracker confidence drops. It was tested on challenging datasets and demonstrated improved tracking speed and accuracy over state-of-the-art methods. The system aims to enable autonomous navigation of UAVs with obstacle avoidance capabilities.

Uploaded by

RazvanHarag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views13 pages

Real-Time Obstacle Detection and Tracking For Sense-And-Avoid Mechanism in Uavs

This document proposes a real-time obstacle detection and tracking system for unmanned aerial vehicles (UAVs). It introduces an adaptive obstacle detection strategy that integrates a kernelized correlation filter framework with salient object detection. The approach refines the location and boundary of detected objects when tracker confidence drops. It was tested on challenging datasets and demonstrated improved tracking speed and accuracy over state-of-the-art methods. The system aims to enable autonomous navigation of UAVs with obstacle avoidance capabilities.

Uploaded by

RazvanHarag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
1

Real-time Obstacle Detection and Tracking for


Sense-and-Avoid Mechanism in UAVs
Sushil Pratap Bharati, Yuanwei Wu, Yao Sui, Curtis Padgett, Guanghui Wang, Senior Member, IEEE

Abstract—Obstacle detection and tracking is an important


research topic in computer vision with a number of practical
applications. Though an ample amount of research has been
done in this domain, implementing automatic obstacle detection
and tracking in real-time is still a big challenge. To address
this issue, we propose a fast and robust obstacle detection and
tracking approach by integrating an adaptive obstacle detection
strategy within a kernelized correlation filter (KCF) framework
in this paper. A suitable salient object detection method auto-
initializes the KCF tracker for this purpose. Moreover, an
adaptive obstacle detection strategy is proposed to refine the
location and boundary of the object when the confidence value
of the tracker drops below a predefined threshold. In addition, a
reliable post-processing technique is implemented to accurately
localize the obstacle from a saliency map recovered from the
search region. The proposed approach has been extensively tested
through quantitative and qualitative evaluations on a number Fig. 1. Frames demonstrating our proposed approach in action. Our algorithm
of challenging datasets. The experiments demonstrate that the quickly adjusts to variations in shape, size and illumination of the object. Color
proposed approach significantly outperforms the state-of-the-art code: ’red’ marks our approach, ’yellow’ marks SAMF [4] and ’green’ marks
methods in terms of tracking speed and accuracy. KCF [5]. (Best viewed in color)

Index Terms—Object tracking, detection, correlation filter,


salient object.
collision avoidance and autonomous navigation is essential.
Recognizing such possible threats in a real-time and embed-
ding such computationally sound algorithm in flying UAVs
demands an ample amount of research and engineering.
I. I NTRODUCTION
Among all recent advancements in technology, vision-based
Automating visual detection and tracking of moving objects sense and avoid system is becoming a more popular choice
by intelligent autonomous systems, such as unmanned aerial since cameras are light-weight and low-cost as well as they
vehicles (UAVs), has been an active research topic for the past provide richer information of the surrounding than other
decades in computer vision. The research has diverse applica- available sensors, thus appropriate for UAVs with limited
tions extending from military, surveillance, security systems, payload capacity. A successful sense and avoid system should
aerial photography, search and rescue, object recognition, be able to automatically detect a possible obstacle which
auto-navigation to human-machine interactions [1]. Recently, may be present in the path of the flying UAVs and track
computer vision is being extensively used in roadside vehicle it in order to prevent a possible collision. In this paper, we
positioning and tracking as well as in intelligent transportation propose a suitable vision-based approach that is able to assist
systems for the vehicle as well as the passengers’ safety [2], in autonomous navigation of UAVs with a forward looking
[3]. Due to its emerging multidisciplinary usage, a handsome camera. The proposed algorithm automatically localizes and
number of companies are developing their own UAV systems, tracks the obstacle in real-time. Thus, our method provides
such as Google’s Project Wing, Amazon Prime Air and DHL’s a practical solution for the vision-based sense and avoid
parcelcopter. problems in UAVs.
However, designing such intelligent UAVs is pragmatically Provided a scenery or a landscape, human eyes tend to first
challenging. It is vital to keep track of other UAVs, birds, notice the characteristic features they could sense from the
airplanes or other possible flying objects during the flight of entire view [6]. These characteristic features, which help the
an autonomous UAV. Hence, identifying such potential obsta- human brain distinguish between a particular object and its
cles precisely and localizing them in real-time for successful background could be the basis for a successful sense and avoid
S. P. Bharati, Y. Wu, Y. Sui and G. Wang are with the Department of Elec- algorithm that segregates the object from its background.
trical Engineering and Computer Science, University of Kansas, 1520 West On the other hand, given an initially detected position of
15th Street, Lawrence, KS 66045. Email: {sushil bharati, ghwang}@ku.edu an object in the initial frame, an intelligent system should
C. Padgett is with the Maritime and Aerial Perception Systems Group, Jet
Propulsion Laboratory, 4800 Oak Grove Drive Pasadena, CA 91109. Email: correctly localize the position of the moving object throughout
Curtis.W.Padgett@jpl.nasa.gov the sequence. However, most of the previous works focused

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
2

only on object detection or object tracking, rather than creating automated system. Since manual labeling is required in such
an intelligent system capable of simultaneous detection and trackers, they are not suitable for fully autonomous UAVs.
tracking in real-time. In this paper, we propose a novel and Moreover, such trackers are not fit for long-term tracking as
intelligent vision-based system that can automatically detect, discussed before. Similarly, most of the trackers assign a fixed
localize, and track the objects in high speed. Extensive ex- size bounding box only to track a part of the moving object
periments demonstrate that the proposed approach stands out in a scene, although the object changes its shape and size
among all the state-of-the-art detectors and trackers in terms throughout a sequence. Such approaches are inappropriate for
of speed and precision. estimating the shape and size of the obstacle and therefore
Mathematically, a saliency map can be understood as a rectifying the path of UAVs to avoid possible collisions with
probability map that expresses the probability of salient pixels the obstacle. Some of the tracking by detection methods aim to
in terms of intensity relative to the entire image [7], [8]. Wei et provide a changing bounding box according to object’s shape
al. [9] claimed that a major portion of the image is occupied by and size but they perform slower and thus inapt for real-time
the image background and is homogeneous. Hence, the image implementation.
boundary can be easily connected - connectivity prior. They In this paper, we propose a fast, reliable and accurate
also assumed that objects are generally absent on the image object localization and tracking approach for the autonomous
boundaries so that we can presume these boundaries to be the navigation of the flying UAVs by integrating the techniques
background as well i.e. background prior. Zhang et al. [10] for salient object detection [10] with the kernelized correlation
successfully used minimum barrier distance [11], [12] along filter [5]. Our approach achieves better detection and tracking
with a raster scanning algorithm utilizing both connectivity results compared to the state-of-the-art method in terms of
prior and background prior to generate saliency map in their speed and accuracy as demonstrated in our experimentation
work. section. Fig. 1 shows the result of the proposed method along
Classical tracking approaches can be categorized as gen- with SAMF [4] and KCF [5]. It can be clearly observed
erative and discriminative models. In the generative trackers that, although the appearance of the flying object undergoes
[7], [8], [13]–[15], we can represent the targets as a set of deformations, illumination or scale variations, the proposed
basis vectors in a subspace and the trackers search for regions method accurately confines the concerned object compared to
similar to previously tracked targets while the discriminative other peer trackers. The main contributions of this paper are
trackers [5], [16], [17] uses binary classification to differentiate listed below:
the background with the desired target. It has been mathe- • The proposed approach correctly localizes and generates
matically proven that the asymptotic error of a discriminative an adaptive bounding box in real-time despite varying
model is lower than that of a generative model [18]. shape and size of the object throughout the sequence.
Tracking by detection approaches [15], [17], [19] provide • Our approach, by integrating the detection and tracking
a new concept for detection and tracking, however, such strategies together and forming a closed loop system,
approaches suffer from the well-known stability-plasticity achieves long-term error-free tracking.
dilemma [20], where the drifting of an object in the later • The proposed approach, by training the filter from previ-
frames cannot be rectified since the classifier cannot be trained ous frames, tracks the object in subsequent frames with-
with stable samples, like that of the first frame. Thus, these out the need of any computationally expensive supervised
approaches barely identify the noisy images with occlusion. training for the detection.
Henriques et al. [5] harnessed the circulant structure of the • The proposed system is fully automated, accurate and
samples in the tracking problem with an aid of a kernelized has superior real-time speed without requiring any sort
correlation filter (KCF). Since this method is computationally of manual intervention.
inexpensive as it transforms the correlation operation in the
spatial domain to the frequency domain by exploiting the
II. R ELATED W ORK
circulant structure and Parseval’s identity and yielding only
O(nlogn) complexity. It is based on a common principle An object tracking by detection approach was proposed in
that circular matrices (used in an algorithm for kernel ridge [22]. However, since their detector needs training with a large
regression) that performs correlation in the spatial domain number of data samples, auto-initialization is not feasible in
is equal to performing element-wise multiplication in the their approach. Optical flow motion cues were leveraged in
frequency domain according to the Fourier Transformation [23] to design a tracker combined with a detection scheme
principles. However, experiments show that the algorithms based on a saliency map with an auto-initialization in the
using correlation filtering fail to track an obstacle for a larger first frame. However, such trackers are computationally too
period of time. This problem was successfully solved by expensive for real-time applications. Multiple cameras in an
Bharati et al. in [21] by providing a feedback to the tracker aircraft that measure the altitude and act as a sense and
about the current state of the object being tracked and using an avoid collision sensor were used in [24] but their method
adaptive detection scheme to re-detect the object in the cases is inefficient without the use of GPS data (which may have
of failure. delays), thus not feasible for real-time.
In addition, it is required to manually initialize most of Previous research on saliency map-based object detection
the generative and discriminative trackers with the position can be broadly classified into two methods — top-down and
of the target in the first frame making them an incomplete bottom-up. In the top-down methods [7], [8], [25], detection

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
3

Fig. 2. Illustration of our approach: (a) input frames, (b) correlation filter and tracking, (c) (from left to right) frame to redetect the salient object, saliency
map generated, thresholded binary image using our post processing technique and new bounding box detected on the object.

is executed on the reduced search space since all the possible meeting high tracking speed in benchmark datasets [37].
objects in an image are localized. But these methods are Henriques et al. integrated Gaussian and polynomial kernels
unrealistic for real-time object detection because they are together with multi-channel HoG features in [5] to achieve
mostly task-driven and accompanied by supervised learning. higher accuracy and speed than most of the state-of-the-art
On the other hand, bottom-up methods [9], [10], [26]–[28] discriminative and generative trackers. However, their method
compare the feature contrast of the salient region with the suffered from its inability to deal with scale variations because
background contrast by using the low-level features (like the of the fixed template size. Li et al. [4] tried to solve this
color, contrast, shape, texture, gradient and spatio-temporal problem by combining adaptive templates and HoG features in
features) from an image. Such methods have higher possi- SAMF tracker. To adapt with the changing size and appearance
bilities to fail in the case of complex images as they do of the object, Danelljan et al. [38] used HoG features in
not have prior knowledge of the localization of the object or a multiscale correlation filter in DSST tracker. However, all
the number of objects present in an image. In contrast, the these trackers are prone to mishandling the cases of occlusion
top-down methods require proper training before detection. and camera instability throughout the sequence.
However, our approach identifies the approximate location A part-based tracking algorithm using a correlation filter
of the object from the previous tracking results and then was proposed in [39] to deal with the occlusion. Only some
performs the re-detection on a much smaller search region. part of the object is visible during a partial occlusion and part-
Hence, our method is computationally efficient since it does based tracker exploits this feature to successfully handle the
not require any type of supervised training for the detection. partial occlusion. However, such algorithms fail if the object
Additionally, the reduced search region enhances qualitative undergo complete occlusion (becomes invisible) between cer-
efficiency during detection. tain consecutive frames. Correlation between temporal con-
Object detection in [9], [29] used a geodesic saliency texts was used in [20] to estimate the translation and scale
map by looking at the contrast of an image and calculating change of the objects. This approach also used a re-detection
the distance of each pixel from the background seeds to scheme by training a fern classifier to handle tracking failures
segment a region in the image. A supervised regression-based for long-term tracking. However, the proposed approach made
segmentation approach in [27] used binary classifier but was their trackers run slower. Some other detectors [26], [28], [40]
limited to detecting single objects in an image. Instead of and trackers [41], [42] rely on deep learning techniques to
scanning an image with sliding windows, a ranked list of improve the accuracy of the trackers and thus require large-
innumerable proposal windows in an image was proposed in scale training database making them slower and unsuitable for
[30], [31]. Such methods improved the recall rate but failed real-time applications.
to correctly localize an object in a given scene. A minimum
barrier saliency map was generated using a raster scanning III. P ROPOSED A PPROACH
method in [10] which performed better than the geodesic In this section, we describe the details of the proposed
saliency map. However, this method used the entire image strategy for fast and robust object detection and tracking. A
for generating saliency map which was exploited to look only flowchart of the proposed technique is shown in Fig. 2.
around the region where the object has more probability to be First, a saliency map S of an entire image is generated to
found compared to its previous location. segment the salient object out from the background and auto-
Since most of the previous work required supervised train- initialize the tracker with the current location of the salient
ing, correlation filters, though adept at object tracking, seemed object for tracking in the consecutive frame. In this process, we
inappropriate for real-time object tracking. The minimum generate a saliency map, post-process the generated saliency
output sum of a squared error (MOSSE) filter [32] and its map using the proposed post-processing technique to segment
derivatives [33]–[35] was found to be computationally efficient the salient object and feed the location of the salient object
for real-time object tracking as correlation filter was trained on to initialize the tracker. Next, the filter starts training itself
gray-scale images in this approach. Subsequently, an ample of on the salient object on each frame while tracking of the
research has been done in the correlation filter-based tracking. object runs simultaneously until a low peak of filter response
As a result, MOSSE filter was improved in [36] by introducing (confidence value) is observed. Confidence value measures the
a kernel-based correlation filter trained on gray-scale images resemblance of the object in the consecutive frame compared

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
4

where ΠB,pi,j denotes a set of all possible paths connecting


elements in B with pi,j . In [9], the geodesic distance is used
for D, but in [10], [12] a formula more robust to noise is
proposed which is found to be the most appropriate approach
in our paper as well. The cost function is given as
n n
CI (P) = max I (P (i)) − min I (P (i)) (2)
i=0 i=0

Each pixel pi,j is visited during the raster scan as well


as the inverse raster scan (Fig. 3). During the raster scan,
we update pixel values of two adjacent neighbors pi,j−1 and
pi−1,j , whereas in the inverse raster scan, the values of pi,j+1
Fig. 3. Updation method for R (a) Raster scanning. (b) Inverse raster and pi+1,j are updated. The updates take place by
scanning.
M (pi,j ) = min{D (x) , CI Z p0i,j . p0i,j , pi,j } (3)



where Z p0i,j . p0i,j , pi,j is an appended path from p0i,j to




to the previous frame where the object was being tracked. pi,j with currently assigned

path for p0i,j , i.e., Z p0i,j .
Once such low confidence value is observed for the tracker, our 0 0
Let us denote Z pi,j . pi,j , pi,j with Zi,j .
proposed adaptive detection approach is applied to re-detect
CI (Zi,j ) = max{H p0i,j , I (pi,j )}−min{L p0i,j , I (pi,j )}
 
the object. The re-detection scheme is important because it
helps to increase the confidence value of the tracker to track (4)
where H p0i,j and L p0i,j are the highest and the lowest
 
in the later frames by re-estimating the accurate position and
the size of the object being tracked. To do so, we determine pixel values on Z p0i,j , respectively. Each iteration of the
an adaptive search region R based on the confidence value; raster/inverse raster scan updates H and L if the path assign-
the area of R is progressively increased to re-detect the object ment is found to be changed. The final outcome is a saliency
being tracked as the confidence value drops lower. This re- map S for the entire image, where a certain post processing
detection scheme is much similar to the detection process as needs to be done to obtain a binary image for the final object
performed in the first frame. A slight variation is that we detection and successfully initialize the tracker. An example
generate S only for search region R instead of an entire frame of a saliency map is shown in Fig. 4.
and update the tracker accordingly for a smooth training of the
KCF filter throughout the tracking process. B. Post Processing
Post processing helps to enhance the quality of the saliency
map S. For successful tracking, it is vital to obtain a binary
A. Automatic Salient Object Detection image from which we can segment a foreground salient object
This section best explains how our tracker is auto-initialized from its background. It is inefficient to apply a direct threshold
in the first frame of any given sequence. Most of the trackers to S because of the presence of different levels of noise
need to be provided with a ground truth of the initial frame content, relative size of objects and background, illuminance
to let know the whereabouts of the concerned object and and reflectance in different frames. Such wide variations
further to continue tracking the object in the consecutive necessitate to choose an adaptive threshold technique that suc-
frames, thus requiring manual initialization. However, our cessfully handles these subtle entities. Hence, in our approach,
tracker independently initializes from the very first frame and we rely on inter-class variance maximization [43] to find an
continues smooth tracking throughout the sequence. optimal value for the global threshold and obtain a binary
To initialize our tracker in the first frame, salient object image.
detection algorithm is run on an entire image as we are Let H be the histogram of S that contains pixels of intensity
unaware where the salient object to track is initially located. levels L ∈ [0, l − 1] and ηi be the numbers of pixels of
Inspired by the Minimum Barrier Distance (MBD) transform, intensity i where i ∈ L. Then,
we formulate the salient object detection as finding the shortest l−1
X
distance from pixel pi,j to the set of pixels B along the image H= ηi (5)
boundary. For simplicity, we consider a single-channel digital i=0
image I. In this paper, we consider 4-adjacent neighboring Let Hn be the normalized histogram, for every threshold value
pixels to calculate the distance of pi,j to B. For instance, t, t ∈ L, we define two classes C1 ∈ Hni , i ∈ [0, t] and
the neighbors of pi,j are pi−1,j , pi+1,j , pi,j−1 and pi,j+1 . A C2 ∈ Hni , i ∈ [t + 1, l − 1], such that
path P = < P(0), P(1), ..., P(k) > on I is a sequence of
t
pixels where consecutive pairs of pixels are adjacent. Given a X
P1 = P (C1 ) = Hni , (6)
distance cost function D, the distance map M for each pixel
i=0
in I is best defined as
l−1
X
M(pi,j ) = min D (δ) (1) P2 = P (C2 ) = Hni = 1 − P1 (7)
δ∈ΠB,pi,j i=t+1

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
5

C. Object Tracking
This section aims to introduce the correlation filters briefly
as well as the tracking mechanism for further understanding
of the proposed technique.
Correlation filter-based trackers use filters trained on pre-
Fig. 4. From left to right: Original frame, generated saliency map, binary viously tracked objects and their immediately surrounding
image and the object boundary using our object detection technique. background for tracking the object. Usually, a small test
window is selected on the object that needs to be tracked
[32]. Thereafter, tracking of an object and training of the
The mean intensity of pixels in C1 is given by filter is performed simultaneously in the consecutive frames.
t t t The filter is correlated over a search window in an adjacent
X X P (C1 /i).P (i) 1 X frame to obtain a correlation map. The peak value in the
m1 = i.P (i/C1 ) = i. = i.Hni
i=0 i=0
P (Ci ) P1 i=0 correlation map helps to determine the position of the object
(8) being tracked in this frame. However, computational efficiency
can be significantly increased by performing the correlation
where P(C1 /i) = 1, P(i) = Hni . Similarly mean intensity of
in the frequency domain. To perform a correlation in the
pixels in C2 is given by
frequency domain, Fast Fourier Transform (FFT) of the filter
l−1 is element-wise multiplied with a two-dimensional FFT of
1 X
m2 = i.Hni (9) the input image. This is possible because an element-wise
P2 i=t+1
multiplication in the frequency domain is equivalent to the
Let mg represent mean global intensity and mt represent mean correlation in the spatial domain.
intensity upto t level. Then, the inter-class variance is derived The correlation G between the FFT of R (denoted by I)
as in [43]. where the object needs to be tracked and the FFT of the filter
mg .P1 − mt (denoted by H) is given by
σb2 = P1 .(m1 − mg )2 + P2 .(m2 − mg )2 =
P1 .(1 − P1 ) G = I H∗ (12)
(10) ∗
where is element-wise multiplication and denotes complex
For each t ∈ L, we calculate σb2 (t) and optimal threshold topt conjugate. We can use inverse Fourier transform to transform
for S is given by the correlation output back to the spatial domain.
It can be derived from the properties of the circulant
σb2 (topt ) = max σb2 (t) (11)
0<t<l−1 matrices that any m × 1 vector v can be diagonalized using
By applying this method, we successfully obtain a binary Discrete Fourier Transform (DFT) [44].Hence,
image where the salient object is distinguishably highlighted C = F.diag(v̂).F H (13)
from the background and the noisy image background is √
eliminated. Now, the tracker is able to correctly locate the where v̂ denotes the DFT of v, v̂ = F(v), F = nF v and
required salient object in an analyzed scene as shown in Fig. F is a constant matrix independent of v.
4 and begin tracking in the consecutive frames. The auto- Motivated by [5], we use ridge regression along with
initialization algorithm is given in Algorithm 1. kernelized correlation filter to implement the tracking. Our
main aim is to find a function f (β) = αT β to minimize the
Algorithm 1 Auto-initialization squared error between the training samples xi ∈ X and the
Input: first frame (f) output yi ∈ Y. If we transform our linear object of interest O
Output: co-ordinates (x,y,width,height) of salient object in f to a non-linear feature space φ(O) and apply a kernel trick
1: Generate saliency map S using equation (1)-(4) [45] on it, we have,
2: Compute normalized histogram Hn of S
X
α= δi .φ(O) (14)
3: Divide into two groups with probabilities P1 and P2 as
i
in equations (6), (7)
Here, we need to optimize parameter δi for the least squared
4: for threshold level t=1 to maximum intensity in S do
error. Representing φ(O) in terms of dot products,
5: Compute global intensity mg = P1 m1 + P2 m2 using
equations (8), (9) Pt φT (O)φ(O0 ) = K(O, O0 ) (15)
6: Compute mean intensity upto level t mt = i=0 iPi
7: Compute σb2 using equation (10) where 0 denotes the cyclic shifts and K is a Gaussian kernel
8: end for
function. Therefore,
n
9: Derive optimal threshold using equation (11) X
10: Draw bounding box covering maximum area of the salient f (β) = αT β = δi K(β, Oi ) (16)
i=1
object in optimally thresholded binary image
11: return salient object coordinates The solution to equation (16) as derived in [46] is
σ = (K + λI)−1 Y (17)

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
6

where σ is the vector of coefficients δi representing the solu-


tion in the dual space. In [5], given a condition K(O, O0 ) =
K(MO, MO0 ) where M is a permutation matrix, K is shown
to be circulant and can be diagonalized for faster computation.
Since we need to find f (β) on multiple image locations that
can be arranged in a circulant matrix, let us define Kβ as the
kernel matrix between every training samples and candidate Fig. 5. Refinement process: (From left to right) Selected region around the
patches that are cyclic shifts of O and β. We have, prime object based on the peak value for the refinement process (brown
bounding box), the chosen area for refinement (zoomed in for better view),
Kβ = C(k Oβ ) (18) saliency map generated for the chosen area, post-processed binary image,
prime salient object re-detected (notice the change in the size bounding box
where C(k Oβ ) is the kernel correlation of O and β. Now, the before and after the refinement process) to update the tracker.
regression function for all candidate patches is given by
f (β) = (Kβ )T δ (19) selecting a certain peak of filter response value for adaptive
Diagonalizing for efficient computation, refinement process is presented in Section IV.
One of the examples that best demonstrates the refinement
ˆ = k̂ Oβ δ
f (β) (20) process used in our approach is shown in Fig.5. When the
where f (β)ˆ represents the DFT of f (β) and denotes object being tracked considerably changes in shape, size,
element-wise operation. illumination or reflectance the peak of response drops lower
The confidence values of the filter are monitored to deter- as the current frame significantly differ from the previous
mine the adaptive search region for the salient object detection frames. Thus, our refinement approach comes into action.
in our tracking approach. A tracker is less confident about the During the refining process, our approach successfully selects
object being tracked when its confidence value (measures the the region R around the prime object being tracked and applies
similarity of object in two consecutive frames while tracking) salient object detection algorithm (as described in subsection
drops below a certain defined threshold. Hence, we acquire A) only in this selected region, thus making our approach
a re-detection scheme to mitigate such off-guarded tracker. computationally efficient. Further, the generated saliency map
Therefore, we avidly monitor the confidence values throughout is post-processed (as described in subsection B) to relocate the
the tracking process and adjust our detection region adaptively prime object. The correct coordinates are further updated to
as the tracker’s confidence value surge lower. If the confidence the running tracker for successful tracking as shown in Fig. 2.
value drops too low, we set our search region R to an entire The tracking algorithm is given in Algorithm 2.
frame. Thus, such settlements help precise tracking of the
Algorithm 2 Real-time Salient Object Tracking
obstacle that undergo variations in shape, size, rotation, camera
Input: a sequence of images
instability and illumination.
for each frame f do
2: if (first frame) then
D. Refinement Auto-initialize using Algorithm 1
During the course of tracking, tracker may lose the track 4: continue
of the object being tracked due to several inconsistencies such end if
as abrupt motion dynamics, undefined perturbations, camera 6: observe f (β)ˆ (equation (20))
instability, projection/separation of similar or disparate foreign ˆ < set confidence value) then
if ( f (β)
objects into the scene or nearby the object being tracked. 8: Adaptively define R around last known coordinates
Most of the trackers are unable to handle such complicacies of the object being tracked
efficiently. Thus, to effectively monitor such circumstances, Generate saliency map S of R using equation (1)-(4)
we have implemented a refinement approach in our tracker. 10: Postprocess S using steps 2-10 of Algorithm 1
It is observed that the peak of filter response (confidence end if
value) drops below certain threshold when our tracker is 12: Update the tracker with new coordinates
unable to correctly track the object in a subsequent frame. end for
Therefore, a proper refining approach to correct the tracker
in such situations was found to be necessary. One of the
approaches that could be considered is to undergo salient IV. E XPERIMENTS
object detection approach (as described in subsection A) on We have implemented the proposed algorithm in this paper
an entire image to relocate the object and update the tracker using C++ and OpenCV v.3.0. All the experiments were
with necessary correction in the given frame. However, such performed on a Intel(R) Xeon(R) W3530 PC with 2.80 GHz
an attempt is computationally expensive when applied on processor and 4 GB RAM. The competing tracker’s codes were
each individual frame or on an entire frame iteratively while also experimented on the same PC and downloaded from the
tracking an object. Therefore, in our approach we adaptively respective author’s web page1 .
generate region R depending on the confidence value to run
the detection algorithm. Detailed analysis on the measures for 1 http://www.ittc.ku.edu/cviu/tracking.html

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
7

Fig. 6. Peak sensitivity curve demonstrating high peak of sensitivity variance (bad) of KCF versus low peak of sensitivity variance (good) of our approach
in first 50 frames of airplane 006 dataset. (Best viewed in color)

Fig. 7. Peak of filter response of KCF in (a) airplane 011 and (b) youtube 3 dataset showing fall in the peak values of the filter leading to inaccurate tracking
of the object when the object changes its shape, size or illumination.

A. Dataset the tracker with the new object localization parameters. We


Our approach is tested on 25 challenging video sequences designed an experiment, in which, we set a certain value for
where the object is subjected to variations of scale, partial peak of filter response (confidence value) and apply our re-
occlusion, axial and planar rotation, illumination variation and detection approach when the peak of filter response falls to
camera instability. The experimented sequences are namely; the set value or below it. Since we need a metric to compare
airplane 001 (200 frames), airplane 004 (200 frames), air- and evaluate our plots against the KCF plot for several peak of
plane 005 (200 frames), airplane 006 (200 frames), air- filter response values, we propose a peak sensitivity variance
plane 007 (200 frames), airplane 011 (300 frames), air- (pi − pm )2
pvar = where pi is the peak of the filter response
plane 012 (300 frames), airplane 013 (300 frames), air- th
n
value for i frame in a given dataset, pm is a mean peak of
plane 015 (300 frames), airplane 016 (300 frames), big 2
filter response value and n is the total number of frames in the
(382 frames) from [47], Dog (127 frames), planestv 1
dataset chosen. Thus, pvar gives the measure of the variation
(223 frames), planestv 2 (200 frames), planestv 3 (300
of sensitivity of the tracker’s filter response in the ith frame
frames), planestv 4 (350 frames), planestv 5 (200 frames),
from its mean value. We prefer lower pvar value (throughout
planestv 6 (230 frames), planestv 7 (250 frames), planestv 8
the frames in any given sequence) which shows that the tracker
(260 frames), planestv 9 (410 frames), Skater (160 frames),
is able to correctly and consistently track the position of the
youtube 1 (216 frames), youtube 2 (475 frames) together with
object in most of the frames and does not lose track of the
youtube 3 (301 frames) from publicly available videos on the
tracked objects ı.e. the bounding box does not deviate away
web to test our algorithm on several types of objects. Dog
from the tracked object which would otherwise change its peak
and Skater datasets have been chosen to observe our tracker’s
of response value on this frame (pi ) from mean peak value
performance on general objects other than aerial flying units.
(pm ), thus increases the sensitivity metric pvar .
We manually annotated the ground truth for each of the chosen
datasets to perform quantitative analysis as described in the We observed that as we increased the peak of response
latter sections. Datasets along with their annotated ground truth value for re-detection, the tracker performed better, i.e., lower
is made available on the author’s web page. peak sensitivity variance in most of the observed frames as
shown in Fig. 6. This is due to the re-detection scheme
B. Determination of Suitable Peak of Filter Response being performed more on the frames as soon as the peak of
It is essential to find a suitable value for a peak of response response value would fall to such higher values. In contrast,
of the tracking filter to ensure the success of object tracking, we found the speed of tracker decreased sharply as more
utilize our object re-detection approach to rectify the proper re-detection were being performed. Thus, a suitable balance
coordinates of the object being tracked, and further update between the speed of the reliable tracker and pvar is needed

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
8

Fig. 8. OPE and TRE curve demonstrating the average precision rate and the success rate of the proposed and 6 competing trackers over 25 video sequences.
(Best viewed in color)

to be determined. Several experiments on our 25 challenging C. Comparison with State-of-the-Art Trackers


datasets demonstrated average tracking speed of 83.02 fps, In quantitative analysis, we run the six competing trackers
115.53 fps, 122.18 fps for the set of peak of filter response along with our approach on 25 datasets and report the aver-
values 0.4, 0.5 and 0.6 respectively. Though, running the re- age performance. We use measures like precision rate (PR),
detection scheme for lower peak of filter response values success rate (SR) and central location error (CLE) to compare
reduces pvar , it severely impacts our tracking speed. Thus, our approach with other competing trackers. CLE is defined as
a peak of filter response value 0.5 was chosen for adaptive re- the Euclidean distance between the central coordinates of the
detection to maintain a sound balance between pvar and the the ground truth bounding box and that of the tracker’s output.
tracking speed. Thus, for a better performance of the tracker, a lower value of
CLE is preferred. PR is defined as the percentage of frames in
The higher filter response values are observed when the which CLE is lower than a given threshold. A threshold value
tracker tracks the object properly. Fig. 7 shows two such of 20 pixels is used in our paper for the evaluation as suggested
experiments where filter response is plotted throughout 100 in [37]. Tracking results are considered to be successful if
frames in two chosen datasets, namely; airplane 011 and (at ∩ ag )
> θ, where θ ∈ [0, 1], at and ag denote the areas
youtube 3. It can be clearly observed that the peak of filter (at ∪ ag )
response for KCF remains stable as long as the object being of the bounding boxes of the tracker’s output and the ground
tracked do not change much in shape, size or illumination truth, respectively. Thus, SR is defined as the percentage of
whereas the peak of filter response falls drastically in circum- frames where the overlap rates are greater than a threshold θ.
stances like partial occlusion, scale variation, or inability to Generally, θ is set to 0.5, which means a 50% overlap ratio
track properly. Since the sudden fall or rise in the peak of threshold.
filter response accounts for higher peak sensitivity variance, Similarly, one pass evaluation (OPE) and temporal robust
it is not a good characteristic for long-term stable trackers evaluation (TRE) experiments can be performed for a sound
as explained above. Several experiments as discussed in later evaluation of any tracker [37]. For OPE, each tracker is
sections clearly demonstrate that our approach is robust to run from the first frame till the last frame and compared
such scale variations or partial occlusions. Comparative study with the ground truth. TRE slightly differs from OPE as
of peak sensitivity variance of KCF versus our approach in the sequence is randomly divided into several portions (20
Fig. 6 further bolsters our claim. The sudden rise observed in in our experimentation) and the tracker is thereafter run on
the peak sensitivity variance curve in our method is due to each portion and finally compared with their respective ground
re-detection scheme being performed once the tracker tend to truths.
lose the object being tracked or the object changes its scale The quantitative evaluation between our approach versus
which is essential for a reliable object tracker. six competing visual trackers: CT [48], STC [49], CN [50],

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
9

TABLE I TABLE III


Q UANTITATIVE ANALYSIS OF THE PROPOSED AND 6 OTHER COMPETING S UCCESS RATE OF THE 25 SEQUENCES OF THE PROPOSED AND THE 6
TRACKERS ON 25 TEST SEQUENCES . T HE BEST AND THE SECOND BEST COMPETING TRACKERS . T HE BEST AND THE SECOND BEST RESULTS ARE
RESULTS ARE HIGHLIGHTED USING BOLD - FACE AND UNDERLINE HIGHLIGHTED USING BOLD - FACE AND UNDERLINE FONT STYLES ,
FONT- STYLES , RESPECTIVELY. RESPECTIVELY.

Ours CT STC CN DSST SAMF KCF Ours CN CT DSST SAMF STC KCF
Average Precision Rate (TRE) 0.82 0.31 0.47 0.45 0.51 0.45 0.46 airplane 001 0.78 0.17 0.20 0.27 0.12 0.32 0.12
Average Success Rate (TRE) 0.76 0.37 0.41 0.42 0.49 0.43 0.46
airplane 004 0.54 0.53 0.49 0.42 0.49 0.47 0.44
Average Precision Rate (OPE) 0.77 0.19 0.45 0.44 0.45 0.43 0.42
Average Success Rate (OPE) 0.6 0.27 0.38 0.41 0.43 0.42 0.40 airplane 005 0.57 0.20 0.15 0.26 0.21 0.39 0.23
CLE (in pixels) 13 170 47 75 59 79 87 airplane 006 0.54 0.43 0.19 0.45 0.43 0.49 0.43
Average Speed (fps) 115.53 29.13 27.21 27.50 4.97 6.54 71.69 airplane 007 0.53 0.46 0.13 0.55 0.53 0.47 0.49
airplane 011 0.77 0.34 0.21 0.29 0.73 0.33 0.20
airplane 012 0.47 0.31 0.18 0.59 0.34 0.48 0.70
TABLE II airplane 013 0.70 0.19 0.24 0.30 0.11 0.27 0.16
P RECISION RATE OF THE 25 SEQUENCES OF THE PROPOSED AND THE 6
airplane 015 0.71 0.65 0.45 0.59 0.62 0.49 0.54
COMPETING TRACKERS . T HE BEST AND THE SECOND BEST RESULTS ARE
HIGHLIGHTED USING BOLD - FACE AND UNDERLINE FONT STYLES , airplane 016 0.73 0.58 0.22 0.56 0.65 0.62 0.58
RESPECTIVELY. big 2 0.61 0.58 0.30 0.65 0.57 0.58 0.63
planestv 1 0.26 0.86 0.66 0.80 0.76 0.78 0.78
Ours CN CT DSST SAMF STC KCF planestv 2 0.59 0.41 0.38 0.57 0.45 0.30 0.45
airplane 001 0.92 0.20 0.20 0.26 0.21 0.38 0.12 planestv 3 0.79 0.47 0.42 0.43 0.42 0.43 0.38
airplane 004 0.79 0.44 0.25 0.50 0.26 0.42 0.37 planestv 4 0.76 0.35 0.23 0.27 0.30 0.43 0.31
airplane 005 0.81 0.33 0.19 0.32 0.21 0.36 0.27 planestv 5 0.72 0.32 0.36 0.26 0.27 0.15 0.41
airplane 006 0.92 0.54 0.22 0.54 0.20 0.65 0.53 planestv 6 0.58 0.48 0.35 0.34 0.32 0.31 0.37
airplane 007 0.76 0.61 0.18 0.36 0.15 0.46 0.37 planestv 7 0.63 0.28 0.30 0.49 0.11 0.17 0.41
airplane 011 0.90 0.43 0.27 0.28 0.8 0.31 0.25 planestv 8 0.47 0.37 0.24 0.54 0.29 0.46 0.27
airplane 012 0.74 0.15 0.20 0.88 0.20 0.83 0.81 planestv 9 0.71 0.37 0.25 0.45 0.41 0.32 0.21
airplane 013 0.89 0.32 0.20 0.32 0.21 0.26 0.12 youtube 1 0.44 0.41 0.32 0.18 0.43 0.18 0.55
airplane 015 0.82 0.73 0.35 0.58 0.82 0.49 0.79 youtube 2 0.56 0.45 0.15 0.40 0.35 0.25 0.43
airplane 016 0.83 0.76 0.18 0.73 0.75 0.65 0.45 youtube 3 0.62 0.24 0.10 0.22 0.22 0.16 0.29
big 2 0.89 0.82 0.31 0.91 0.84 0.85 0.85 Dog 0.33 0.15 0.15 0.39 0.10 0.24 0.19
planestv 1 0.86 0.90 0.46 0.85 0.75 0.90 0.84 Skater 0.51 0.57 0.55 0.55 0.53 0.47 0.54
planestv 2 0.77 0.37 0.37 0.42 0.35 0.36 0.49
planestv 3 0.89 0.33 0.14 0.14 0.30 0.60 0.13
planestv 4 0.55 0.03 0.12 0.08 0.15 0.02 0.10
planestv 5 0.64 0.03 0.09 0.03 0.09 0.16 0.23 the other competing trackers by observing the precision rate
planestv 6 0.73 0.47 0.15 0.61 0.70 0.34 0.53 plot and our method is also more adaptive to the variations in
planestv 7 0.75 0.27 0.24 0.49 0.30 0.14 0.38 shape and size of the object being tracked in a given video
planestv 8 0.88 0.64 0.18 0.80 0.63 0.66 0.34 sequence as demonstrated by success rate plot.
planestv 9 0.52 0.04 0.02 0.21 0.20 0.14 0.03
youtube 1 0.89 0.88 0.41 0.09 0.80 0.86 0.87
youtube 2 0.81 0.65 0.40 0.56 0.70 0.66 0.46
D. Speed Comparison
youtube 3 0.69 0.07 0.06 0.07 0.08 0.12 0.26
Dog 0.42 0.25 0.31 0.66 0.30 0.29 0.42 As presented in Table I, our algorithm (implemented in
Skater 0.57 0.60 0.53 0.59 0.50 0.44 0.57 C++) achieves an average speed of 115.53 frames per second
(fps) whereas KCF achieved 132.87 fps when implemented in
C++ and 71.69 fps when implemented in MATLAB on the 25
DSST [38], SAMF [4] and KCF [5] is shown in Table I. It challenging video sequences. One of the major disadvantages
can be observed from OPE and PRE values that our method of KCF, despite its good speed, is that it fails to track the object
outperforms the competing trackers. Similarly, our approach in the following frames once it loses track of the object in a
also has the least CLE and real-time speed performance. given frame, thus making it unreliable for real-time tracking.
Experimental results performed during OPE against the six However, our approach is able to re-detect the object if it
competing trackers on all the 25 video sequences are also loses track of the object due to the proposed re-detection
reported. PR and SR are tabulated in Table II and Table III scheme, thus making our algorithm more apt for such purpose.
respectively. The tables clearly demonstrate that our approach Moreover, KCF is not adaptive to variations in shape and size
is more accurate in the almost all of the experimented chal- of the object being tracked and draws a fixed bounding box
lenging datasets. Similarly, our approach stands best among around them. In contrast, our approach accurately adapts to
the competing trackers by a great margin – 20 out of 25 such variations in shape and size of the object being tracked
sequences in PR evaluation and 17 out of 25 sequences in and adjusts the bounding box accordingly thus making our
SR evaluation. approach more suitable for sense-and-avoid systems. Similarly,
Fig. 8 shows the precision as well as success rate plots for compared to CN, STC and CT, our approach stands out by
OPE as well as TRE experimented over all the challenging more than three times (3x) faster than their average speeds.
datasets. It is clear from the plots that our approach is signif- Similarly, DSST and SAMF clearly does not fit for real-
icantly better than the other trackers compared. In summary, time object tracking due to their very low speed. Hence, this
we can verify that our approach is superior in robustness to tremendous advantage in the speed and long-term tracking

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
10

(a) (b)

(c) (d)

(e) (f)

(g) (h)

(i) (j)

Fig. 9. Tracking results of our approach and the output of 6 competing trackers in the representative frames: (a) scale variation (youtube dataset 3), (b)
partial occlusion (airplane 005), (c) axial rotation (planestv 4), (d) planar rotation (big 2), (e) and (f) illumination variation (airplane 001 and airplane 006),
(g) and (h) camera instability (airplane 001 and airplane 012), (i) and (j) roll,pitch and yaw (planestv 6 and planestv 9). (Best viewed in color)

ability makes our algorithm more suitable for real-time object be invariant of partial occlusion. Almost all the other trackers
tracking than the compared state-of-the-art trackers. fail in this situation.
In Fig. 9 (second row), trackers are tested on the basis of
axial rotation (frame #41, frame #182, frame #270 and frame
E. Qualitative Evaluation #288 in planestv 4) and planar rotation (frame #87, frame
In this subsection, we present the qualitative comparisons of #203, frame #325 and frame #372 in big 2) dynamics. It can
our approach against the 6 competing trackers. In Fig. 9 (top be seen that CT is not apt for both axial or planar rotations.
row), we can observe that our tracker is extremely adaptive to Though KCF is able to perform well (except for scale changes)
the change in scale as well as during partial occlusion. It is during axial rotation, it fails against planar rotations. Despite
experimentally found that though all the competing trackers such complicated rotation dynamics, our method stands out
produce acceptable outputs in the first few frames, as the among all the other competing trackers, thus proving it to be
object changes its shape (frame #1148, frame #1222 and #1317 more robust.
in youtube dataset 3) or undergoes partial occlusion (frame Similarly, in Fig. 9 (mid row), we depict the evaluation of all
#91, frame #109 and frame #121 in airplane 005), some of the trackers under illumination variation (frame #3, frame #35,
the trackers fail. For instance, CT, SAMF, KCF and STC are frame #75 and frame #127 in airplane 001 as well as frame
unable to keep the scale variation in their account. However, #4, frame #51, frame #87 and frame #177 in airplane 006). It
our method quickly adjusts to changing appearance and size of is clear from the figure that CT, CN and SAMF are incapable
the object. Similarly, only STC and our method are found to of keeping track of the object under such constraints. However,

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
11

Fig. 11. Limitation of our approach: Auto-initialization in too complex


background is not accurate but our adaptive approach tracks smoothly in
successive frames. (Best viewed in color)

Fig. 10. Frames showing our tracker narrows down to the prime object in a
scene once the multiple objects are at a suitable distance apart or one of the
objects move out of scene. V. C ONCLUSION
It is vital for an intelligent autonomous UAV to have an
automatic, robust and real-time object tracking system built
our method is not affected by such illumination variation and in it. Therefore, in this paper, we have proposed a tracking
keeps good track of the object. method that incorporates variations in shape, size, illumina-
Fourth row in Fig. 9 demonstrates competing tracker’s tion as well as degenerate conditions like partial occlusion,
performance over camera instability (frame #190 - #193 in planar/axial rotation and camera instability in it for better
airplane 001 and frame #37 - #40 in airplane 012). It can be performance than the existing state-of-the-art trackers. Most of
clearly observed that almost all of the trackers, except ours, the up-to-date trackers were found to fail in one or several such
fail to correctly track the object when there is a significant complex scenarios. However, our tracker is able to keep track
jerk in the camera pose or sudden perturbations. of the object without any abrupt failures. Both qualitative and
Moreover, the last row in Fig. 9 demonstrates the exper- quantitative evaluation measures demonstrate that the proposed
imentation of the trackers on roll-pitch-yaw motion (frame approach is more efficient than the competing trackers. Unlike
#109, frame #122, frame #145 and frame #167 in planestv 6 other trackers, the proposed tracker is able to auto-initialize
and frame #123, frame #147, frame #235 and frame #310 in without any manual interference. Hence, our approach is
planestv 9). It is evident that almost all of the trackers fail to found to be accurate and fast in terms of speed for real-
keep track of the larger objects in such situations. However, time autonomous sense-and-avoid UAVs, drones or similar
our tracker is able to keep track of both the shape and pose flying units. Nevertheless, some of the experiments show that
of the object (small or large) being tracked. our method may not perform as expected in the presence of
several dubious salient objects in a scene or the object is too
tiny to detect and auto-initialize the tracker. Source code and
F. Limitations the dataset for the proposed approach can be downloaded for
research purpose from the author’s web page2 .
From several experimentations, we found that our approach,
though fast and robust, has some limitations. Since our
algorithm is designed to auto-initialize and utilizes salient ACKNOWLEDGMENT
object detection scheme whenever necessary throughout the The authors would like to thank the anonymous reviewers
tracking process, it starts tracking all the objects (bounding for their constructive comments. This work was supported in
box comprises maximum area occupancy of the objects in part by the National Aeronautics and Space Administration
a given scene) in the presence of multiple objects in a LEARN II Program under Grant NNX15AN94N, in part by
scene as shown in Fig. 10. However, our approach effectively the General Research Fund of the University of Kansas under
adapts its bounding box and narrows down to one object Grant 2228901, and in part by the Kansas NASA EPSCoR
once the prime object to be tracked is segregated in a given Program under Grant KNEP-PDG-10-2017-KU. We would
scene as demonstrated in Fig. 10. It is also important to also like to thank Mr. Arjan Gupta and Miss Nina Wang for
note that the above mentioned limitation has insignificant helping us label the test data.
effect in sense and avoid UAVs as a single bounding box
around multiple objects may suggest the detection of several R EFERENCES
obstacles within the region, thus need to be avoided during the
[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” Acm
UAV’s trajectory. The other limitation of our approach is an computing surveys (CSUR), vol. 38, no. 4, p. 13, 2006.
inability to accurate auto-initialization provided a too complex [2] Z. Kim, “Robust lane detection and tracking in challenging scenarios,”
background. For instance, in Fig. 11, the tracker’s bounding IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 1,
pp. 16–26, March 2008.
box comprises of the background along with the object to be [3] M. Shan, S. Worrall, and E. Nebot, “Probabilistic long-term vehicle mo-
tracked. This limitation, however, is only observed during the tion prediction and tracking in large environments,” IEEE Transactions
initialization phase. Once the tracker starts learning, the object on Intelligent Transportation Systems, vol. 14, no. 2, pp. 539–552, June
2013.
is significantly tracked to a greater accuracy in the consecutive
frames as shown in Fig.11. 2 http://www.ittc.ku.edu/cviu/tracking.html

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
12

[4] Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker [29] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from
with feature integration,” in European Conference on Computer Vision. robust background detection,” in Proceedings of the IEEE conference
Springer, 2014, pp. 254–265. on computer vision and pattern recognition, 2014, pp. 2814–2821.
[5] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed [30] J. Feng, Y. Wei, L. Tao, C. Zhang, and J. Sun, “Salient object detection
tracking with kernelized correlation filters,” IEEE Transactions on by composition,” in 2011 International Conference on Computer Vision.
Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, IEEE, 2011, pp. 1028–1035.
2015. [31] P. Siva, C. Russell, T. Xiang, and L. Agapito, “Looking beyond the
[6] L. Itti and C. Koch, “Computational modelling of visual attention,” image: Unsupervised learning for object saliency and detection,” in
Nature reviews neuroscience, vol. 2, no. 3, pp. 194–203, 2001. Proceedings of the IEEE conference on computer vision and pattern
[7] H. Cholakkal, D. Rajan, and J. Johnson, “Top-down saliency with recognition, 2013, pp. 3238–3245.
locality-constrained contextual sparse coding.” BMVC, 2015. [32] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual
[8] J. Yang and M.-H. Yang, “Top-down visual saliency via joint crf object tracking using adaptive correlation filters,” in Computer Vision
and dictionary learning,” in Computer Vision and Pattern Recognition and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE,
(CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2296–2303. 2010, pp. 2544–2550.
[9] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using back- [33] Y. Sui, G. Wang, and L. Zhang, “Correlation filter learning toward peak
ground priors,” in European Conference on Computer Vision. Springer, strength for visual tracking,” IEEE Transactions on Cybernetics, vol. PP,
2012, pp. 29–42. no. 99, pp. 1–14, 2017.
[10] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech, “Minimum [34] Y. Sui, Z. Zhang, G. Wang, Y. Tang, and L. Zhang, “Real-time visual
barrier salient object detection at 80 fps,” in Proceedings of the IEEE tracking: Promoting the robustness of correlation filter learning,” in
International Conference on Computer Vision, 2015, pp. 1404–1412. European Conference on Computer Vision. Springer, 2016, pp. 662–
[11] K. C. Ciesielski, R. Strand, F. Malmberg, and P. K. Saha, “Efficient 678.
algorithm for finding the exact minimum barrier distance,” Computer [35] Y. Sui, Y. Tang, L. Zhang, and G. Wang, “Visual tracking via subspace
Vision and Image Understanding, vol. 123, pp. 53–64, 2014. learning: A discriminative approach,” International Journal of Computer
[12] R. Strand, K. C. Ciesielski, F. Malmberg, and P. K. Saha, “The minimum Vision, pp. 1–22, 2017.
barrier distance,” Computer Vision and Image Understanding, vol. 117, [36] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the
no. 4, pp. 429–437, 2013. circulant structure of tracking-by-detection with kernels,” in European
[13] S. Avidan, “Ensemble tracking,” IEEE transactions on pattern analysis conference on computer vision. Springer, 2012, pp. 702–715.
and machine intelligence, vol. 29, no. 2, pp. 261–271, 2007. [37] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,”
[14] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online in Proceedings of the IEEE conference on computer vision and pattern
multiple instance learning,” in Computer Vision and Pattern Recognition, recognition, 2013, pp. 2411–2418.
2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 983–990. [38] M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate scale esti-
[15] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,” mation for robust visual tracking,” in British Machine Vision Conference,
in European Conference on Computer Vision. Springer, 2012, pp. 864– Nottingham, September 1-5, 2014. BMVA Press, 2014.
877. [39] T. Liu, G. Wang, and Q. Yang, “Real-time part-based visual tracking
[16] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. Hicks, and via adaptive correlation filters,” in Proceedings of the IEEE Conference
P. Torr, “Struck: Structured output tracking with kernels,” 2015. on Computer Vision and Pattern Recognition, 2015, pp. 4902–4912.
[17] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” [40] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
IEEE transactions on pattern analysis and machine intelligence, vol. 34, object detection with region proposal networks,” in Advances in neural
no. 7, pp. 1409–1422, 2012. information processing systems, 2015, pp. 91–99.
[18] A. Jordan, “On discriminative vs. generative classifiers: A comparison [41] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical con-
of logistic regression and naive bayes,” Advances in neural information volutional features for visual tracking,” in Proceedings of the IEEE
processing systems, vol. 14, p. 841, 2002. International Conference on Computer Vision, 2015, pp. 3074–3082.
[19] Y. Wu, Y. Sui, and G. Wang, “Vision-based real-time aerial object [42] H. Li, Y. Li, and F. Porikli, “Deeptrack: Learning discriminative feature
localization and tracking for uav sensing system,” IEEE Access, vol. 5, representations online for robust visual tracking,” IEEE Transactions on
pp. 23 969–23 978, 2017. Image Processing, vol. 25, no. 4, pp. 1834–1848, 2016.
[20] C. Ma, X. Yang, C. Zhang, and M.-H. Yang, “Long-term correlation [43] N. Otsu, “A threshold selection method from gray-level histograms,”
tracking,” in Proceedings of the IEEE Conference on Computer Vision Automatica, vol. 11, no. 285-296, pp. 23–27, 1975.
and Pattern Recognition, 2015, pp. 5388–5396. [44] R. M. Gray, Toeplitz and circulant matrices: A review. now publishers
[21] S. P. Bharati, S. Nandi, Y. Wu, Y. Sui, and G. Wang, “Fast and robust inc, 2006.
object tracking with adaptive detection,” in 2016 IEEE 28th International [45] B. Scholkopf and A. J. Smola, Learning with kernels: support vector
Conference on Tools with Artificial Intelligence (ICTAI), Nov 2016, pp. machines, regularization, optimization, and beyond. MIT press, 2001.
706–713. [46] R. Rifkin, G. Yeo, and T. Poggio, “Regularized least-squares classi-
[22] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection fication,” Nato Science Series Sub Series III Computer and Systems
and people-detection-by-tracking,” in Computer Vision and Pattern Sciences, vol. 190, pp. 131–154, 2003.
Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, [47] A. Li, M. Lin, Y. Wu, M. Yang, and S. Yan, “NUS-PRO: A New
pp. 1–8. Visual Tracking Challenge,” IEEE Transactions on Pattern Analysis and
[23] V. Mahadevan and N. Vasconcelos, “Saliency-based discriminant track- Machine Intelligence, vol. 38, no. 2, pp. 335–349, 2016.
ing,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. [48] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,”
IEEE Conference on. IEEE, 2009, pp. 1007–1013. in European Conference on Computer Vision. Springer, 2012, pp. 864–
[24] A. Nussberger, H. Grabner, and L. Van Gool, “Robust aerial object 877.
tracking in high dynamic flight maneuvers,” ISPRS Annals of the [49] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M.-H. Yang, “Fast visual
Photogrammetry, Remote Sensing and Spatial Information Sciences, tracking via dense spatio-temporal context learning,” in European Con-
vol. 2, no. 1, p. 1, 2015. ference on Computer Vision. Springer, 2014, pp. 127–141.
[25] A. Kocak, K. Cizmeciler, A. Erdem, and E. Erdem, “Top down saliency [50] M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. Van de Weijer,
estimation via superpixel-based discriminative dictionaries.” in BMVC, “Adaptive color attributes for real-time visual tracking,” in Proceedings
2014. of the IEEE Conference on Computer Vision and Pattern Recognition,
[26] S. He, R. W. Lau, W. Liu, Z. Huang, and Q. Yang, “Supercnn: A su- 2014, pp. 1090–1097.
perpixelwise convolutional neural network for salient object detection,”
International Journal of Computer Vision, vol. 115, no. 3, pp. 330–344,
2015.
[27] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, “Salient object
detection for searched web images via global saliency,” in Computer
Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.
IEEE, 2012, pp. 3194–3201.
[28] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by
multi-context deep learning,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2015, pp. 1265–1274.

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2018.2804166, IEEE
Transactions on Intelligent Vehicles
13

Sushil Pratap Bharati earned his Bachelors degree Guanghui Wang (M’10, SM’17) received his PhD
from Motilal Nehru National Institute of Technol- in computer vision from the University of Water-
ogy, Allahabad, India. He is currently pursuing his loo, Canada, in 2014. He is currently an assistant
Masters degree at the University of Kansas. His professor at the University of Kansas, USA. He
research interests include real-time object detection is also with the Institute of Automation, Chinese
and tracking, 3-D reconstruction and modeling, pat- Academy of Sciences, China, as an adjunct profes-
tern recognition, autonomous robotics and broad sor.
applications of computer vision and deep learning. From 2003 to 2005, he was a research fellow and
visiting scholar with the Department of Electronic
Engineering at the Chinese University of Hong
Kong. From 2005 to 2006, he acted as a professor
at the Department of Control Engineering in Changchun Aviation University,
China. From 2006 to 2010, He was a research fellow with the Department
of Electrical and Computer Engineering, University of Windsor, Canada. He
has authored one book, Guide to Three Dimensional Structure and Motion
Factorization, published at Springer-Verlag. He has published over 90 papers
in peer-reviewed journals and conferences. His research interests include
computer vision, structure from motion, object detection and tracking, artificial
intelligence, and robot localization and navigation. Dr. Wang has served as
associate editor and on the editorial board of two journals, as an area chair
Yuanwei Wu received his Masters degree from the or TPC member of 20+ conferences, and as a reviewer of 20+ journals.
Tufts University. He is currently a PhD candidate
at the University of Kansas. His research interests
are focused on broad applications in deep learning
and computer vision, in particular object detection,
localization and visual tracking.

Yao Sui received his Ph.D. degree in electronic en-


gineering from Tsinghua University, Beijing, China,
in 2015. He was a postdoctoral researcher in the
Department of Electrical Engineering and Computer
Science, University of Kansas, Lawrence, KS 66045,
USA, during 2015 to 2017. He is currently a research
fellow in Harvard Medical School. His research
interests include machine learning, computer vision,
image processing and pattern recognition.

Curtis Padgett Dr. Curtis Padgett is currently the


Supervisor for the Maritime and Aerial Perception
Systems Group and a Principal in the Robotics
Section at the Jet Propulsion Laboratory. Dr. Padgett
leads research efforts focused on aerial and maritime
imaging problems including: navigation support for
landing and proximity operations; automated, real-
time recovery of structure from motion; precision
geo-registration of imagery; automated landmark
generation and mapping for surface relative naviga-
tion; stereo image sea surface sensing for navigation
on water and image based, multi-platform contact range determination. He
has a Ph.D. in Computer Science from the University of California at San
Diego, and has been an employee of JPL since graduating in 1995. His
research interests include pattern recognition, image-based reconstruction, and
mapping.

2379-8858 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy