Dynamic_background_modeling_using_deep_l
Dynamic_background_modeling_using_deep_l
https://doi.org/10.1007/s11042-019-7411-0
Abstract
Background modeling is a major prerequisite for a variety of multimedia applications like
video surveillance, traffic monitoring, etc. Numerous approaches have been proposed for the
same over the past few decades. However, the need for real time artificial intelligent based low
cost approach still exists. Moreover, few recently proposed efficient approaches are not
validated on the basis of some of the challenging applications in which they may fail in its
efficiency when tested. In this paper, an efficient deep learning technique based on
autoencoder network is used for modeling the background. The background model generated
herein is obtained by training the incoming frames of the surveillance video with the deep
learning network in an unsupervised manner. In order to optimize the weights of the network,
greedy layer wise pre-training approach is used initially and the fine tuning of the network is
done using conjugate gradient based back propagation algorithm. The performance of the
algorithm is validated based on the application of unattended object detection in a dynamic
environment. Comprehensive assessment of the proposed method using CDNET 2014 dataset
and other datasets demonstrates the efficiency of the technique in background modeling.
1 Introduction
In recent years, surveillance cameras have been widely used among the public in large quantity
for various safety and security purposes [14, 16, 40, 45]. Monitoring all such surveillance
video round the clock, manually, with the lone support of human interventions, is a tedious
task. Hence, the need for user friendly real time artificial intelligent system that can detect
objects or events of interest by itself is highly appropriate to reduce the challenges in
monitoring the surveillance video. Therefore, the goal of the system is to design an approach
* Mala John
malajohnmit@gmail.com
1
Department of Electronics Engineering, Madras Institute of Technology, Chennai, India
Multimedia Tools and Applications
that can automatically identify activities that are unusual in happening like abnormal event
detection, unattended object detection, intruder detection and so on.
The basic fundamental technique used in surveillance is the background subtraction. The
purpose of background subtraction is to discriminate moving foreground objects, from the
background model. This is done by subtracting the current frame from the background model
for each co-located pixel values in the image. The accuracy of the background subtraction
depends on background modeling.
Though many approaches have been proposed for background modeling in the last few
decades, the need for improvement exists due to its inherent complexity. The performance
accuracy is challenged by the sudden change in luminance, motion patterns created due to
waterfall, waving trees, water ripples, fountains, shadows, addition or deletion of objects in the
background and opening or closing of the door etc. The results of foreground extraction is
quite prominent for videos belonging to static background. However, it lacks accuracy in non-
static background. Surveillance videos in real time scenarios have dynamic background and
should be carefully modeled.
The simplest strategy to detect foreground regions is the frame differencing method, where
the difference of the pixels in the current frame from those in a previous frame is calculated,
where the previous frame act as a background model. Clearly, this approach, although very
simple, is prone to many false positive errors due to global illumination changes and other
factors. As a pioneer work, various simple background modeling techniques, such as simple
averaging [39], median [19], or histogram analysis over time [46] have been introduced.
However, they are highly prone to error under various unconstrained real-world situations
including background clutters, outdoor dynamics, etc. In most of the previous works, the
background model is updated at regular intervals for accurate results. Deng and Guo [9]
proposed a self-adaptive background modeling method that improved the performance by
updating the Gaussian component of background model at regular interval.
In order to cope with different challenges in real time scenario, many authors have proposed
methods with multiple background models rather than single background models. Recently,
Sajid and Cheung [36], proposed a universal multimode background subtraction method in
which, the training image is treated as a background model. Multiple color space is used for
background subtraction based on lighting conditions. Here, RGB and Y color channels are
used under poor lighting conditions while Cb and Cr color channels of YCbCr color space are
used under good lighting conditions. Both YCbCr and RGB color spaces are used for
foreground or background classification under intermediate lighting conditions.
Several attempts have been made, to provide a neural network based background modeling
technique for foreground extraction. Gregorio and Giordano [8] proposed a background
estimation method based on a weightless neural network, and termed as BEWiS. This method
has a fast learning rate and is highly applicable in the video surveillance of both in long-term
and live videos. Babaee et al. [1] proposed a deep learning method for background modeling
based on Convolutional neural network which outperforms various other technique.
In general, deep learning methods are different from all traditional approaches. They
automatically learn features from raw pixels directly which act as an input to the network.
In [17], the author made an attempt to retrieve images using deep autoencoder network and
was successful with large dataset that contains about 80 million tiny images. In general, the
major advantage of deep learning also lies in the fact that the performance of the system does
not degrade with increase in the amount of data, unlike other techniques. Moreover, Wang
et al. [43] used the autoencoders for data visualization application. Also, the unsupervised
Multimedia Tools and Applications
pretraining used in the autoencoder acts as a pre-conditioner to optimize the initial parameter,
thereby making the network more effective for modeling the background image. Motivated by
these factors, in this paper, a deep learning technique based background modeling is proposed.
The proposed method outperforms most other state-of-the-art methods on a challenging
dataset. The robustness of the approach in modeling the background in a dynamic background
is demonstrated in abandoned object detection in remote video surveillance.
This paper is organized as follows. Section 2 presents the various works related to
background modeling, Section 3 explains in detail, the proposed technique. Section 4 elabo-
rates the applications of the proposed technique. Section 5 discusses the performance of the
proposed technique along with its application and Section 6 concludes this paper.
2 Related works
A plethora of work has been published on background modeling in the past decades. In
general, based on different methodology, background modeling can be mainly classified into
two categories namely pixel-based and region-based methods.
Pixel-wise methods are based on the assumption that the observation sequence of each pixel is
independent of each other. These methods make use of the statistical information over the time to
model the background. Wren et al. [44] presented a single Gaussian function that modeled the
background at each pixel location with a Gaussian distribution. The progress from a single Gaussian
approach is then extended to a mixture of Gaussians (MoG) models. Stauffer and Grimson [38]
proposed the Gaussian mixture model (GMM) which models every pixel with a mixture of K
Gaussians functions. An improvement to GMM was then made using online expectation maximi-
zation (EM) algorithm to initialize the parameter in the background model, but suffers because of
time complexity. Zivkovic in [48] proposed an adaptive GMM (AGMM) to efficiently update
parameters in GMM. This method showed improved performance for the outdoor scenario, but the
need for a solution for dynamic background videos still prevailed. Elgammal et al. in [11] proposed a
method to develop the statistical representations of the background and the foreground by utilized
nonparametric kernel density estimation techniques for building. In order to efficiently handle
backgrounds having large variations, Bayesian approaches have been introduced [3, 20]. Li et al.
[20] introduced a Bayesian framework that incorporates spectral, spatial and temporal features to
characterize the background appearance. Benedek et al. [3] proposed an adaptive shadow model,
which found improvements in performance for scenes under difficult lighting coloring effects and
shadow region.
Barnich and Droogenbroeck [2] presented a pixel-based nonparametric algorithm to detect
the foreground using random selection strategy and termed as ViBe. Few limitations of this
method were later observed and corrected by Van Droogenbroeck and Paquot [10]. Few of
their modification includes inhibition of propagation for some pixels in the updating mask,
change in distance function and thresholding criterion, proper filtering of connected compo-
nents, and so on. Lin et al. [23] proposed a method to initialize the background using a
probabilistic Support Vector Machine (SVM). Output probabilities of each pixel are calculated
using SVM to classify them as foreground or background which is used in the construction of
background model. The process of background initialization continues until there are no more
new pixels available for classification. Culibrk [7] proposed a background segmentation
method that relies on Probabilistic Neural Network (PNN) architecture for background
modeling and used unsupervised Bayesian classifier to classify as foreground or background.
Multimedia Tools and Applications
Kim et al. [15] presented a new adaptive background subtraction algorithm based
on a codebook in which, each pixel is quantized into the codebook. This method
achieves robust detection for compressed videos and could handle scenes containing
moving backgrounds or illumination variations with limited memory space. Although
pixel based background modeling methods can effectively obtain detailed shapes of
foreground objects, they are easily affected by noise, illumination changes, and
dynamic backgrounds.
The region-based methods makes use of spatial information termed as image regions to
exploit the spatial dependencies of pixels in the background. Matsuyama et al. [32]
proposed a correlation based regional block matching method for background subtraction
which is robust against the varying illumination conditions. Few region based methods
make use of texture or pattern information to detect the moving object in the scene [13, 21,
24, 25, 35]. Li and Leung [21] used the difference in texture information between two
adjacent frames to detect moving objects in the scene. Here, a discriminative texture
feature called local binary pattern (LBP) proposed by Ojala et al. in [35] along with the
histogram information was used for modeling the background and moving foreground
object detection. Moreover, these pattern information are also widely used in other
applications like activity recognition in video surveillance. Bhargava et al. [5] proposed
a framework based on contextual information in order to detect an abondoned
baggage. Lin et al. [24] presented a method for activity recognition by making use of
the temporal pattern which is considered as midlevel feature representation for activities.
This work was further extended in [25] for recognition of complex activity by including
adaptive Multi-Task Learning as an additional component, which involves capturing the
relatedness among activities and selecting the discriminated features. In [41], Toyama
et al. proposed a Wallflower Algorithm with three component system while the region-
level component segments the homogeneous regions of foreground objects. Maddalena
and Petrosino [28] proposed a technique for background modeling by a new self-
organizing method that classifies the foreground and background by learning the motion
pattern Furthermore, the neural network-based mapping method is used to update the
initial background model. In [47], Zhao et.al. proposed a Stacked Multilayer Self-
Organized Map (SMSOM) method for background modeling in which every pixel is
modeled by a SMSOM, and spatial consistency is considered at each layer. This method
gains several merits including strong representative ability to learn background model of
challenging scenarios, and automatic determination for most network parameters.
More recently, deep learning has been effective in image processing techniques and
has achieved superior capabilities in classification, representation learning and several
other applications [12, 18, 26, 30, 34]. In [30] Marsden et al. proposed a residual deep
learning architecture with multiple objectives for violent behavior detection, crowd
counting, and crowd density classification. In [34], Muhammad et al. presented a CNN
architecture for fire detection in a video surveillance application. Though the proposed
deep learning method performs better than the AlexNet architecture for fire detection
application, it suffers from higher false alarm rate. Guo et al. in [12] presented a deep
learning approach for object retrieval by learning the multiple deep features of a visual
object in the surveillance videos. Moreover, the advantage also lies in the adaptability of
the system to perform well in various applications based on design functions. Motivated
by these works, in this paper unsupervised deep learning technique is used to model the
background, which can be efficiently used in the application of surveillance.
Multimedia Tools and Applications
3 Methodology
3.1 Overview
Background model is a preliminary component for the foreground extraction. In this paper, a
method for background modeling is proposed by building a deep network, based on stacking
layers of autoencoders which capture the underlying background model. The deep learning
network is capable of providing a latent representation of the input pattern with its hidden
layers. The first incoming T sets of frames are used for modeling the background using deep
learning architecture. With sufficient training, the output of the network gives the background
model of the input frames presented to the deep learning network. The overview of the
proposed architecture for background modeling is depicted in Fig. 1. The details of the training
of the network and the associated background modeling are presented in the next subsection.
Deep learning architecture used herein for background modeling, belongs to the class of
unsupervised learning. In this work, an adaptive, deep learning architecture that transforms
the images into a low-dimensional code and decodes it back to the same input image, is used.
The architecture of the deep learning network used is presented in Fig. 2. The deep learning
network consists of a single input layer, a single output layer and three fully connected hidden
layers sacked with two autoencoder network. The approach involves training one layer at a
time, in an unsupervised way while freezing the parameters of the other layers. To do this, the
raw input is given to the input of the first autoencoder that transforms the raw input into a
vector of lower dimensions. This lower dimensional vector is the output of the hidden layer of
the autoencoder. The weights between the input and hidden layer are updated during this
encoding stage of training the first autoencoder. The weight matrix of the decoding stage is the
transpose of the weight matrix of the encoding stage. This procedure is repeated for the
subsequent network, while the output of the hidden layer of the first autoencoder acts as the
input for the next autoencoder. The second autoencoder provides an additional level of
compact encoding.
There are two autoencoders trained sequentially. The basic structure of the autoecoder is
shown in Fig. 3. Based on Fig. 2., during the first phase of learning, the image frames are used
to train the first autoencoder network, which has three layers, namely, input layer(x), output
layer(x̂ ) and the hidden layer (h1). Hidden layer (h1) acts as the input layer, and output layer
while layer (h2) acts as the hidden layer for the second autoencoder network. The autoencoder
used is a Restricted Boltzmann Machine (RBM).
Training the RBM includes representing the input by visible units, a positive phase, a
negative phase and updates of weights and bias. In the positive phase, given the input to the
þ
visible units (vþi ) of the RBM, the hidden unit state (h j )of the RBM is updated. The individual
activation probability of the hidden unit is given by,
ðk Þ
P h j jvðK Þ ¼ sigmoid b j þ ∑ wij vi ð1Þ
i
where sigmoid (·) is the logistic function. Here,bj is the bias contributing to the hidden unit and
{wij} is the set of weights associated with hidden unit. In the negative phase, the reconstruction
of visible units v−i and the hidden units h−j are computed. The individual activation
!
ðk Þ
P vi jhðK Þ ¼ sigmoid ai þ ∑ wij h j ð2Þ
j
where sigmoid (·) is the logistic function. Here,ai is the bias contributing to the visible units
and {wij} is the weight associated with it.
In the final step, the weights are updated. Given a training set of input frames, the visible
þ
states vþ − −
i and hidden states h j are sampled from the data distribution, while vi and h j are the
reconstructed visible and hidden states. The updated change in weight is given by
!
ΔW i j ¼ ε < vþ þ − −
i h j > − < vi h j > ð3Þ
where ε is a learning rate. Once the weights are pre-trained with RBM, back propagation
algorithm is used for its fine-tuning.
In general, the widely available method of updating the weight and biases for a neural network
is via backpropagation. Here, Conjugate gradient method [6] is used for back propagation.
Consider a neural network having one output layer and one input layer with one or more hidden
layers. Let W Lij be the weight associated with the connection between ith node in layer L-1 and jth
node in layer L. During the forward pass, the activations (QL) in each node of all the layers in the
network is computed where Q denotes the activations and L denotes the output layer.
In the forward pass, the error in the output layer is computed and the error signal is denoted
by δL.
δL ¼ QL −y ð4Þ
where y is the desired output, while in this case, it denotes the present input from the training
dataset to the network. The error is then propagated back to the network into the hidden layers
where the error value in each of the hidden layer is computed as, δL − 1, δL − 2 … δ2. In general,
the error at hidden layersδL − 1, δL − 2…up toδ2are given by
Multimedia Tools and Applications
T 0
δL−1 ¼ W L δL : f zðL−1Þ ð5Þ
This method is effective and can be used when training neural networks with a large amount of
data. In general, the deep learning architecture used trains itself by minimizing the error
between the two images. The choice of learning rates influences the rate of absorption of a
stationary or moving foreground object in the background model. Larger learning rates tend
the network to learn the changes in the foreground faster, while lower learning rates make the
network slower to adapt to sudden changes in the background model.
When trained with sets of incoming surveillance video frames in an unsupervised way, deep
learning architecture used in the method can probabilistically reconstruct the inputs by learning
the input frames from the video.
In foreground detection, labeling of pixels into a background or as the foreground is done. The
obtained background modeled image is subjected to background subtraction to extract the
Multimedia Tools and Applications
foreground object. To obtain the foreground pixel region F(x, y), the incoming image I(x, y) is
subtracted from the background modeled image, B(x, y). A threshold (H), is applied to the
absolute difference value to classify the pixel as foreground or background. Based on (12) the
foreground pixels are labeled as ones and background pixels are labeled as zero’s.
0; jIðx; yÞ−Bðx; yÞj < H
F ðx; yÞ ¼ ð12Þ
1; jIðx; yÞ−Bðx; yÞj≥H
4 Application
The background model obtained using deep learning approach, is highly reliable in real time
applications namely unattended object detection in video surveillance. The subsequent sub-
section will discuss the applications in detail.
The proposed method is highly suitable for the application of detecting unattended objects.
Fig. 4 shows the general framework of unattended object detection. Here, the incoming video
frames are trained in the deep learning network to generate the background model at a regular
periodic interval (T). Assuming that the static unattended object is not present while training
the initial set of incoming frames, the initial reference background is generated with frame
numbers from Pk to Pk + T-1, where k = 0,1,2,3...up to T-1. The P0 denotes the first frame, P1
denotes the second frame and so on. The initial background model (BM) generated with
training T number of frames is termed as the initial reference background model. These
reference background models generated by training the deep learning autoencoder network,
compress the incoming frames into latent space representation during the encoding stage and
reconstruct back to generate the image as background model during the decoding stage. The
pre training is done using RBM to initialize the weights and the fine tuning is done using
backpropogation. In order to make the network learn the weights and biases an iterative
procedure is followed. The details of network training is discussed in detail in section 3.2.
Continuous training of incoming video frames is done and the first updated background model
(BM + 1) is obtained by training 2 T number of frames. This process is continued for all the set
of incoming frames in the surveillance video and each updated background model is obtained
by training multiple of T number of frames.
The objects which are in motion during the training of the background model will not
influence much of the output. Once the reference background modeled image (BM) and all the
updated background models (BM + 1, BM + 2…BM+ n) are generated, the object which remains
static in the scene gets learned in the updated background model. This can be obtained by
Multimedia Tools and Applications
subtracting each of the updated background models to the reference background model (BM) in
sequential order which is termed to be as Bdual background subtraction^. The binary mask of
the unattended object is obtained via thresholding technique and a rectangular blob is drawn
around the foreground region, to resemble the detected abandoned object.
This technique can also be used in the application of intruder detection for remote
surveillance. This work can be extended for the application of activity recognition similar to
the method proposed by Liu et at [24, 25] considering the temporal pattern of each activity as
the features for training the network. In order to enhance the efficiency and compete with the
adaptive Multi-Task Learning approach [25] an additional stream of deep learning classifica-
tion network can be trained by extracting the features from the various activities to be
classified.
5 Experimental results
In this section, the robustness of the algorithm on unattended object detection and intruder
detection for remote video surveillance applications is demonstrated through the results
obtained. The algorithm was implemented using MATLAB on a PC equipped with AMD
processor at 3.90 GHz and 8 GB RAM.
The learning rate and the number of iterations play a major role in determining the accuracy
of the proposed approach. Initially, the selection of learning rate is done using AVSS 2007
abandoned bag dataset. In order to reduce the complexity of the network, the size of the
incoming frames in the video is reduced and it is partitioned into fixed patches of size (32 ×
Multimedia Tools and Applications
32). Three different parameters such as specificity, precision and F-Measure are used in the
evaluation of selecting the learning rate. Fig. 5 depicts the results for parameters using different
learning rates.
Based on the results, it was found that the learning rate of 0.0001 is used. Being an iterative
approach, the algorithm converges as the epochs start increasing from the initial value and the
accuracy tends to remain constant with further increase in epochs. Based on the experimental
evaluation, 25 number of epochs are used for modeling the background image. The next
subsection discusses the performance evaluation of the proposed method in the static
environment.
To evaluate the performance of the proposed method, CDNET 2014 video dataset which
includes a variety of indoor and outdoor environments is used for qualitative and quantitative
analysis. The dataset is significant as it includes videos of wide detection challenges distributed
in various categories, viz., indoor, outdoor, night, thermal, turbulence, etc. conditions. The
CDNET datasets provide ground-truth data and evaluation tools on their official homepage.
Visual comparison of the results with a few state-of-art algorithms is done in the qualitative
analysis. Fig. 6 shows the background model for a sequence from each category. Fig. 7 shows
the result of the proposed algorithm with state-of-art algorithms such as UMBS [36], Cp3 [22],
Mscale [27] and RMoG [42] respectively. To reduce the computational delay, each of the input
images is resized to lower dimensionality, when applied to the deep learning network. Single
video frame from each category is used for the representation of visual analysis. The first
column displays the category name, the name of the scenario and the frame number. The input
and ground truth of the selected image is seen in the second and third column. Columns four,
five, six, seven and eight illustrate the results of the proposed algorithm and the above
mentioned state of art methods.
1.2
0.8
Accuracy
0.6
precision
0.4 specficity
0.2 FMeasure
Learning Rate
Fig. 5 Evaluation using different learning rates on AVSS2007 dataset
Multimedia Tools and Applications
Quantitative evaluation is done by comparing the performance metrics obtained from the
proposed method to the state-of-art algorithms including UMBS [36], Cp3 [22], MScale [27],
SC_SOBS [29], BMOG [31], GMM-Zivkovic [48], Euclidean distance [4] and Simplified
SOBS [37]. Figures 8, 9 and 10 depict the results of the performance comparison for the
metrics such as Specificity, Percentage of Wrong Classifications (PWC) and Precision respec-
tively. Table 1 presents the overall average result of all the videos belonging to each category
which is obtained by the proposed algorithm. Equal importance has been given to all the
metrics for evaluation. While comparing the performance metric of specificity, the proposed
algorithm ranks in top two for three categories. In the two categories namely night video and
dynamic background, the proposed algorithm ranks the first position with values 0.9889 and
0.9980 respectively.
While comparing the results for False Positive Rate (FPR), the proposed algorithm ranks
among the top two for 3 categories out of the 10 categories. In the category of night video, the
proposed algorithm ranks the first position with values 0.01108. For the PWC metric, the
videos of categories namely night video and intermittentobject motion category ranks first and
the second best among all the categories. While comparing the results for Precision, the
proposed algorithm ranks among the top three for 5 categories out of the 10 categories.
From Table 1, it can be inferred that the overall average performance for all the recall metric
is 0.5449 and the overall average value of specificity is 0.9918 respectively. For specificity,
PWC and FPR metrics among the 10 categories available, the dynamic background is found to
be the best. For FPR, FNR, and PWC, the value of the minimum is accepted to be the best
result. For FNR metrics, the baseline category has the minimum value and can be rated to be
the best among other categories available. The baseline category ranks best for the precision
and F-Measure metrics. Table 2 provides the overall comparison results of 10 categories in the
CDNET 2014 dataset with the state of art algorithms.
The proposed method is found to be the second best for the Specificity, FPR and PWC
metrics. Multimode Background Subtraction [35] ranks the best for most of the metrics. This
method making use of multiple background models and color space for the foreground
detection task, may not be suitable for real-time application because of the complexity in the
multiple tasks that need to be performed before extracting the foreground pixels. Background
model generation, Binary Masks Aggregation/Fusion Binary Masks pruning makes the system
Multimedia Tools and Applications
Thermal
Library
#1156
Camera jitter
traffic
#964
Shadow
people in shade
358
nightVideos
Boulvard
#910
Baseline
Highway
#721
dynamicBackground
fountain02 #1282
bad weather
blizzard
#3362
intermittentObjectM
otion
streetLight #1498
lowFramerate
tunnelExit_0_35fps
#2081
Turbulence
turbulence1 #2179
more complex and time-consuming when compared with the proposed method. From Table 2,
based on overall result evaluation, it is seen that the proposed method, SC_SOBS [29] and
Multimode Background Subtraction [36], work better when compared to other approaches.
Although the proposed algorithm performs generally well on certain indoor and outdoor
Multimedia Tools and Applications
1.05
1
0.95
0.9
0.85
0.8
Bad Low Night Turbulence Baseline Dynamic Camera Intermient Shadow Thermal
Weather Framerate Videos Background Jier Object
Moon
In this subsection, the performance of the algorithm for the application of unattended object
detection is presented. Several standard database, namely, AVSS 2007 i-LIDS AB dataset,
PETS 2006 dataset and CAVIAR dataset are used for this evaluation task.
The i-LIDS AB dataset for abandoned object video sequence consists of 5474 frames while the
scene under consideration starts from the frame number 252. The initial reference background
model is obtained by training the frame numbers 251 to 850. On continuous training, the
Thermal
Shadow
Intermient Object Moon
Camera Jier
Dynamic Background
Baseline
Turbulence
Night Videos
Low Framerate
Bad Weather
0 2 4 6 8 10 12 14
Fig. 9 Comparison chart of average percentage of wrong classification (PWC) value for CD2014 dataset
Multimedia Tools and Applications
1
0.8
0.6
0.4
0.2
0
Bad Weather Low Night Videos Turbulence Baseline Dynamic Camera Jitter Intermittent Shadow Thermal
Framerate Background Object
Motion
subsequent output obtained by training frames at the periodic interval in multiples of T frames
acts as the updated background model. The value of T is set to be 600 and in order to reduce
the computational cost, the incoming video sequence is subsampled in the time domain (frame
rate reduction) by a factor of 3. The initial reference background is subtracted from the updated
background models in order to detect any object present. If the object is detected in consec-
utive updated background models, the object is termed as ‘unattended object’. The bag
remains stationary from around frame number 2012 to frame number4805 in the original
video and the proposed method captures the unattended object. The results are shown in
Fig. 11. As one could observe, the algorithm has no impact on transient events such as humans
crossing the area under surveillance.
PETS 2006 dataset S1-T1-C sequence contains 3020 frames and a man enters the scene and
leaves the bag completely unattended from frame number 1915 till the end of the sequence.
The value of interval T used for training the deep learning network is 600 and the incoming
frames are subsampled by a factor of 3. Fig. 12b gives the initial background model followed
by the updated background model generated by the network. Fig. 12c shows 3rd updated
background model obtained by training the background model along with frame number 1801
to 2400. Fig. 12d shows the binary mask with white pixels indicating the presence of the
Table 1 Performance evaluation results of the proposed algorithm on CDNET 2014 datasets
Table 2 Performance comparison of various state of art algorithms for CDNET 2014 databases
unattended object, and a rectangular blob is drawn around the foreground unattended object as
seen in Fig. 12e. The same procedure is followed for PETS 2006 dataset S5-T1-G sequence
and the results are seen in Fig. 13
The algorithm is tested for CAVIAR dataset and the results of LeftBox sequence of are
discussed below. The total no of frames in the LeftBox sequence is 863. The value of T is
set to be 200 and the results on LeftBox sequence is shown in Fig. 14.
In this paper, a deep learning architecture is used to develop a background model for
foreground extraction. Here, the network is trained with an input video sequence in order to
(c) 4h updated background (d) binary mask (e) rectangular blob on unattended
model(BM+4) foreground
(d) BM+2
(f) sixth updated background (g) binary mask (h) rectangular blob on unattended
model(BM+6) foreground
Fig. 11 Results of AVSS 2007 i-LIDSAB dataset
Multimedia Tools and Applications
generate an output which is equivalent to a background model. Pretraining the deep network is
done with RBM and backpropogation is used for fine tuning the weights. Finally, the
algorithm is validated in the application of unattended object detection. Experiments on
benchmark data set CDNET 2014 confirms that the algorithm performs reasonably well,
compared to the state of art methods for different scenarios and can be reliably used for
(c) third background model (d) binary mask (e) rectangular blob on
unattended foreground
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
References
1. Babaee M, Dinha DT, Rigolla G (2018) A deep convolutional neural network for video sequence
background subtraction. Pattern Recogn 76:635–649
2. Barnich O, Droogenbroeck MV (2011) ViBe: A Universal Background Subtraction Algorithm for Video
Sequences. IEEE Trans Image Process 20(6):1709–1724
3. Benedek C, Sziranyi T (2008) Bayesian foreground and shadow detection in uncertain frame rate surveil-
lance videos. IEEE Trans Image Process 17(4):608–621
4. Benezeth Y, Jodoin P-M, Emile B, Laurent H, Rosenberger C (2010) Comparative study of background
subtraction algorithms. J Electron Imaging 19(3)
5. Bhargava M, Chen C-C, Ryoo M, Aggarwal J (2009) Detection of object abandonment using temporal
logic. Mach Vis Appl 20(5):271–281
6. Charalambous C (1992) Conjugate gradient algorithm for efficient training of artificial neural networks. IEE
Proceedings G - Circuits, Devices and Systems 139(3):301–310
7. Culibrk D, Marques O, Socek D, Kalva H, Furht B (2017) Neural network approachto background
modeling for video object segmentation. IEEE Trans Neural Netw 18(6):1614–1627
8. De Gregorio M, Giordano M (2017) Background estimation by weightless neural networks. Pattern Recogn
Lett 96. https://doi.org/10.1016/j.patrec.2017.05.029
9. Deng G, Guo K (2014) Self-adaptive background modeling research based on change detection and area
training. Proceedings of IEEE Workshop on Electronics, Computer and Applications, Ottawa, pp. 59-62
10. Droogenbroeck MV, Paquot O (2012) Background subtraction: experiments and improvements for ViBe.
In: Proceedings of IEEE Comput. Soc. Conf. Comput.Vis. Pattern Recognit. Workshops, pp. 32-37
11. Elgammal A, Duraiswami R, Harwood D, Davis LS (2002) Background and foreground modeling using on
parametric kernel density estimation for visual surveillance. Proc IEEE 90(7):1151–1163
Multimedia Tools and Applications
12. Guo H, Wang J, Lu H (2016) Multiple deep features learning for object retrieval in surveillance videos. IET
Comput Vis 10(4):268–271. https://doi.org/10.1049/iet-cvi.2015.0291
13. Heikkilä M, Pietikäinen M (2006) A texture-based method for modeling the background and detecting
moving objects. IEEE Trans Pattern Anal Mach Intell 28(4):657–662
14. Kamijo S, Matsushita Y, Ikeuchi K, Sakauchi M (2000) Traffic monitoring and accident detection at
intersections. IEEE Trans Intell Transp Syst 1(2):108–118
15. Kim K, Chalidabhongse T, Harwood D, Davis L (2004) Background modeling and subtraction by codebook
construction. In: Proceedings of IEEE International Conference on Image Processing, ICIP
16. Krahnstoever N, Tu P, Sebastian T, Perera A, Collins R (2006) Multiview detection and tracking of travelers
and luggage in mass transit environments. In: Proceedings of Int. Workshop Performance Eval. Tracking
Surveillance, pp. 67–74
17. Krizhevsky A, Hinton GE (2011) Using very deep autoencoders for content-based image retrieval. In:
Proceedings of 19th ESANN, Bruges, pp 27-29
18. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural
networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems -
Volume 1 (NIPS'12)
19. Laugraud B, Piérard S, Braham M, Broeck MVD (2015) Simple median-based method for stationary
background generation using background subtraction algorithms. In Proceedings of ICIAP
20. Li L, Huang W, Gu IY-H, Tian Q (2004) Statistical modeling of complex backgrounds for foreground object
detection. IEEE Trans Image Process 13(11):1459–1472
21. Li L, Leung MKH (2002) Integrating intensity and texture differences for robust change detection. IEEE
Trans Image Process 11(2):105–112
22. Liang D, Kaneko S, Hashimoto M, Iwata K, Zhao X (2015) Co-occurrence probability-based pixel pairs
background model for robust object detection in dynamic scenes. Pattern Recogn 48(4):1374–1390
23. Lin H, Liu T, Chuang J (2002) A probabilistic SVM approach for background scene initialization. In:
Proceedings of the International Conference on Image Processing, ICIP, pp. 893–8963
24. Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2Activity: recognizing complex activities from
sensor data. In: Proceedings of the 24th international conference on artificial intelligence (IJCAI'15). AAAI
Press, pp 1617–1623
25. Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition.
Neurocomputing 181:108–115
26. Liu T, Stathaki T (2017) Enhanced pedestrian detection using deep learning based semantic image
segmentation. In: Proceedings of 22nd International Conference on Digital Signal Processing (DSP),
London, United Kingdom, 2017, pp. 1-5
27. Lu X (2014) A multiscalespatio-temporal background model for motion detection. In: Proceedings of IEEE
Int. Conference on image Processing (ICIP)
28. Maddalena L, Petrosino A (2008) A self organizing approach tobackground subtraction for visual surveil-
lance applications. IEEE Trans Image Process 17(7):1729–1736
29. Maddalena L, Petrosino A (2012) The sobs algorithm: what are the limits?. In: Proceedings of Computer
Vision and Pattern Recognition Workshops
30. Marsden M, McGuinness K, Little S, O'Connor NE (2017) ResnetCrowd: A residual deep learning
architecture for crowd counting, violent behaviour detection and crowd density level classification. In:
Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS),
Lecce, Italy, pp. 1-7
31. Martins I, Carvalho P, Corte-Real L, Alba-Castro JL (2017) BMOG: Boosted Gaussian Mixture Model with
Controlled Complexity. Pattern Recognition and Image Analysis. IbPRIA 2017. LNCS, Springer, pp 50-57
32. Matsuyama T, Ohya T, Habe H (2000) Background subtraction for non-stationary scenes. In: Proceedings of
Asian Conference on Computer Vision, pp. 662–667
33. Miron A, Badii A (2015) Change detection based on graph cuts. In: Proceedings of International Conference
on Systems, Signals and Image Processing (IWSSIP), London
34. Muhammad K, Ahmad J, Mehmood I, Rho S, Baik SW (2018) Convolutional Neural Networks Based Fire
Detection in Surveillance Videos. IEEE Access 6:18174–18183. https://doi.org/10.1109
/ACCESS.2018.2812835
35. Ojala T, Pietikäinen M, Mäenpää T (2002) Multiresolution gray-scale and rotation invariant texture
classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987
36. Sajid H, Cheung SCS (2017) Universal Multimode Background Subtraction. IEEE Trans Image Process
26(7):3249–3260
37. Sehairi K, Chouireb F, Meunier J (2017) Comparative study of motion detection methods for video
surveillance systems. J Electron Imaging 26(2)
Multimedia Tools and Applications
38. Stauffer C, Grimson E (1999) Adaptive background mixture models for realtime tracking. Proceedings of
IEEE Int Conf Comput Vis Pattern Recognit 2:246–252
39. Tang Z, Miao Z, Wan Y (2007) Background Subtraction Using Running Gaussian Average and Frame
Difference. In: Proceedings of Entertainment Computing – ICEC, pp 411-414
40. Tian Y, Wang Y, Hu Z, Huang T (2013) Selective Eigen background for background modeling and
subtraction in crowded scenes. IEEE Trans Circuits Syst Video Technol 23(11):1849–1864
41. Toyama K, Krumm J, Brumitt B, Meyers B (1999) Wallflower: principles and practice of background
maintenance. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, 1
Kerkyra, Greece, pp. 255–261
42. Varadarajan S, Miller P, Zhou H (2013) Spatial mixture of gaussians for dynamic background modelling. In:
Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance
43. Wang Y, Yao H, Zhao S (2015) Auto-Encoder Based Dimensionality Reduction. Neurocomputing 184:232–
242. https://doi.org/10.1016/j.neucom.2015.08.104
44. Wren CR, Azarbayejani A, Darrell T, Pentland AP (1997) Pfinder: Real-time tracking of the human body.
IEEE Trans Pattern Anal Mach Intell 19(7):780–785
45. Yi S, Li H, Wang X (2016) Pedestrian Behavior Modeling From Stationary Crowds With Applications to
Intelligent Surveillance. IEEE Trans Image Process 25(9):4354–4368
46. Zhang S, Yao H, Liu S (2008) Dynamic Background Subtraction Based on Local Dependency Histogram.
In: Proceedings of Eighth International Workshop on Visual Surveillance -VS2008, Marseille
47. Zhao Z, Zhang X, Fang Y (2015) Stacked Multilayer Self-Organizing Map for Background Modeling. IEEE
Trans Image Process 24(9):2841–2850
48. Zivkovic Z (2004) Improved adaptive gaussian mixture model for background subtraction. Proceedings of
International Conference on Pattern Recognition (ICIP) 2:28–31
Jeffin Gracewell received his B.E. degree in Electronics and Communication Engineering from Karunya
Universiy, Coimbatore during 2010, M.E. degree in Communication Systems from Sri Sivasubramaniya Nadar
College of Engineering, Kalavakkam, Chennai during 2012. He is currently a research scholar in Madras Institute
of technology, Chennai. His current research interests are in Deep Learning, Video and Image processing.
Multimedia Tools and Applications
Mala John did her M.Sc and M.Tech from Indian Institute of Technology (IIT) Madras and IIT Delhi
respectively. She did her Ph.D in Anna University. She is a faculty member of the Department of Electronics
Engineering, Madras Institute of Technology Campus of Anna University since 1992. Presently she is the
Professor in the department of Electronics Engineering. Her interests include Communication, Signal Processing
& Image Processing.