Reference Paper 1
Reference Paper 1
Signal Processing
journal homepage: www.elsevier.com/locate/sigpro
a r t i c l e in fo abstract
Article history: In this paper a bottom-up approach for human behaviour understanding is presented,
Received 6 August 2008 using a multi-camera system. The proposed methodology, given a training set of normal
Received in revised form data only, classifies behaviour as normal or abnormal, using two different criteria of
9 March 2009
human behaviour abnormality (short-term behaviour and trajectory of a person).
Accepted 11 March 2009
Available online 5 April 2009
Within this system an one-class support vector machine decides short-term behaviour
abnormality, while we propose a methodology that lets a continuous Hidden Markov
Keywords: Model function as an one-class classifier for trajectories. Furthermore, an approximation
Behaviour understanding algorithm, referring to the Forward Backward procedure of the continuous Hidden
Trajectory
Markov Model, is proposed to overcome numerical stability problems in the calculation
Hidden Markov Model
of probability of emission for very long observations. It is also shown that multiple
Support vector machine
Homography cameras through homography estimation provide more precise position of the person,
leading to more robust system performance. Experiments in an indoor environment
without uniform background demonstrate the good performance of the system.
& 2009 Elsevier B.V. All rights reserved.
0165-1684/$ - see front matter & 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.sigpro.2009.03.016
ARTICLE IN PRESS
1724 P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738
efficiency and mainly because motion can be viewed as a precisely two aspects of motion. This kind of modularity
short-term stationary signal. Abstract Hidden Markov allows switching between using one or both classifiers for
Models are used by Nguyen et al. in [26] to deal with the detection of either abnormal short-term behaviours,
noise and duration variation, while Wang et al. in [33] use abnormal trajectory, or both. Furthermore, one can use
conditional random fields for behaviour recognition in information from each classifier to determine the type of
order to be able to model context dependence in abnormality detected.
behaviours. In our approach we use a continuous HMM In behaviour understanding, only few works employ
to model trajectory, using a methodology that allows the homography estimation. Park et al. in [28] have used
model to be used as an one-class classifier. homography to extract object features and, using spatio-
Our presented approach focuses on the anomaly temporal relationships between people and vehicles,
detection aspect of behaviour understanding, which extract semantic information from interactions calculated
differentiates it from the aforementioned methods. How- from relative positions. Ribeiro et al. in [30] have
ever, recent research has provided several anomaly- estimated homography and enabled an orthographic view
detection-focused approaches that we briefly review here. of the ground plane which eliminates perspective distor-
These approaches can be classified based on whether they tion origination from a single camera. Then, they have
are supervised, semi-supervised or unsupervised. calculated features in order to classify the data in four
In [9,10] the authors use supervised approaches that activities (active, inactive, walking, running).
need the classes of both normal and abnormal behaviour In existing literature two basic assumptions are usually
to have an adequately large number of labelled instances, made in order to extract features. The first is that the
provided as a priori information. In our method, on the targets move almost vertically to the camera z-axis or
other hand, the training set only consists of normal within a range that is small compared to the distance from
instances of data. The semi-supervised method of [36], the camera. This assumption ensures that the size
which only uses normal data, has a different approach in variation of moving objects is relatively small. The second
that it creates a set of marginally normal instances as assumption is that humans are planar objects, so that
abnormal to constitute an estimation of the abnormal homography-based image rectification can be possible.
class. In our work, we have used the derived feature of However, even though this later assumption may be true
length-normalized log-probability to define the normal when the cameras are close to being vertical to the ground
class, without attempting to generate abnormal instances plane, as in the case of cameras viewing from high
at all. On the other hand, we also take into account ceilings, it does not stand in general. In our method we get
motion-based features used in an one-class SVM to detect over these limitations, as can be deduced from the section
further abnormalities. on homography estimation (Section 4).
A set of unsupervised methods in existing literature
use large databases [37,6] containing all the observed
normal behaviour patterns, matching any new instances 3. Proposed methodology
against the database represented instances. In our work,
we have a single composite model (including HMM and The proposed methodology is based on the fusion of
SVM classification) for all normal instances, thus avoiding data that we collect from several cameras with over-
the need for database storage and look-up. Jiang et al. in lapping fields of view. We perform classification using two
[18] start by representing normal trajectories by a single different one-class classifiers, a support vector machine
HMM model per trajectory, clustering and retraining these (SVM) and a continuous Hidden Markov Model, with each
HMMs until a given condition holds. Other than the fact classifier having different feature vectors as input. The
that, in the work presented herein, we also cover the case final decision on the behaviour is made by taking into
of short-term behaviours besides trajectory, we model the account outputs from both classifiers.
full set of normal trajectories into a single HMM from the The system architecture is presented in Fig. 2. The low
beginning. Therefore, less calculations are required. Lee et level addresses the problem of motion detection and blob
al. in [21] use n-cut clustering over motion energy images analysis, providing the upper level with two different
to determine outliers, which are then judged as abnormal. features vectors per instance. We note that an object’s
This approach is different from ours in that it requires blob is defined to be the set of the foreground pixels that
repetition of the n-cut clustering when a new instance is belong to that object. Background subtraction is applied
to be judged. Another approach is found in [22], where a for motion detection and a bounding box is extracted. The
multi-layer finite state machine representation is used to blobs apparent within the viewing area of each camera are
model activities. According to [22], an abnormal activity is used to extract the objects’ principal axes. These principal
judged by the number of times a valid transition fails to be axes in combination with the corresponding homography
performed when matching the activity to the model state calculations are used to locate each object, i.e. determine
machine. Our approach uses probabilistic tools as the the points where the target object touches the ground
HMM instead of finite state machines to model uncer- plane. From the coordinates of the latter points we
tainty within the normal activities’ modelling. In [35], a calculate the trajectories of the objects.
single feature vector represents position, motion and Additional object information, namely the object’s
shape information, which is used in a clustering process centroid, blob size and shape are made available during
to detect abnormality. In our approach we extract separate the preprocessing step. Furthermore, a histogram is
information for each classifier, attempting to model more extracted from the moving object’s shape depicting the
ARTICLE IN PRESS
1726 P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738
moving object’s blob projection on the y-axis. The overall The second classifier is a continuous Hidden Markov
set of elementary features is used for the creation of the Model (cHMM), also used as one-class classifier, which
final two feature vectors per instance: one vector for each supplied with the trajectory of every instance-object.
classifier. This classifier can decide whether a given trajectory
The two classifiers used at this point are able to decide follows the model of normal trajectories.
about the normality of the observed behaviour under two
different views:
Our method has been implemented to work in two
modes: offline and real-time. In the offline mode, the
The first classifier (one-class support vector machine decision concerns the classification of a time window
(SVM)) decides if the short-term behaviour is normal of arbitrary length, which can be used for example for the
or not, supplied with feature vectors computed by characterization of video shots for video retrieval pur-
taking into account both the background subtraction poses. In its real-time aspect, the system makes a deci
and the ground plane information. The features sion in every frame whether to issue alerts as the
provided as input describe the short-term motion events happen. This decision is made by taking into
information, which we argue that constitute the consideration a time window of relatively small duration
short-term behaviour information. concerning recent camera information (images). This
ARTICLE IN PRESS
P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738 1727
aspect can be used for security purposes, aiding a human coordinate system is attached to the ground plane, so that
supervisor. a point on the plane is expressed as Pp ¼ ðxp1 ; xp2 ; xp3 ÞT . If
In the recognition step, if either classifiers gives this point is visible to the camera, which is a matter of
‘‘abnormal’’ characterization as an output, the system proper camera configuration, the homogeneous coordi-
characterizes the scene as abnormal. This means that we nates of this point on the camera plane are given by
take as output the logical ‘‘or’’ of outputs, given that a Pc ¼ ðxc1 ; xc2 ; xc3 ÞT . The homography H is a 3 3 matrix,
value of true indicates abnormality. which relates Pp and Pc as follows:
2 3 2 3 2 3
xp1 h11 h12 h13 xc1
4. Preprocessing 6 7 6 7 6 7
Pp ¼ H Pc 3 4 xp2 5 ¼ 4 h21 h22 h23 5 4 xc2 5 (1)
xp3 h31 h32 h33 xc3
The proposed methodology uses a preprocessing step
that includes background subtraction for moving target Let the inhomogeneous coordinates of a pair of match-
segmentation and then target localization using homo- ing points xc ¼ ðxc ; yc Þ and xp ¼ ðxp ; yp Þ on the camera
graphy information. For the background subtraction, we plane (pixel coordinates) and the ground plane corre-
adopted the adaptive Gaussian mixture background spondingly. Then
model for dynamic background modelling [39]. Similar
xp1 h11 xc þ h12 yc þ h13
or better methods could have been used for the same xp ¼ ¼ (2)
xp3 h31 xc þ h32 yc þ h33
purpose, without changing our overall approach, and the
xp2 h21 xc þ h22 yc þ h23
reader is referred to the related literature for further yp ¼ ¼ (3)
information. xp3 h31 xc þ h32 yc þ h33
For target localization we have employed a homo- Each point correspondence gives an equation and four
graphy-based approach. The planar homographies are points are sufficient for the calculation of H up to a
geometric entities whose role is to provide associations multiplicative factor, if no triplet of the used points
between points on different planes, which are the ground contains collinear points. The calculation of H is a
and the camera planes in our case. In our indoor procedure done once offline and in practice many points
environment the target moves on the ground plane, so are used to compensate for errors.
mapping between planes is possible. In the following we The positioning of each target is done similarly to [16].
explain briefly how the approach works. A background subtraction algorithm extracts the silhou-
The scene viewed by a camera comprises a predomi- ettes of the targets, which move on the ground plane.
nant plane, the ground. We assume that a homogeneous From each silhouette we extract the vertical principal axis
Fig. 3. View from three cameras and extraction of the principal axis projection on the ground plane from two of the cameras. In (c) the projection is not
visible, however, the corresponding accumulator is still created in (d). In (d) three accumulators are visible—two of them very close to each other.
ARTICLE IN PRESS
1728 P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738
and we project it on the ground plane by replacing box. Mean optical flow and mean optical flow percentage
ðxc ; yc ; 1ÞT and ðxp ; yp ; 1ÞT in (1). The projection from each difference are derived from simple operations on optical
camera casts a ‘‘line’’ on the ground plane as depicted in flow. For these two features we use data from both the
Fig. 3. The maxima of those projected lines indicate the object’s bounding box as well as the full images of the
positions of the monitored targets, i.e. where the vertical video sequences. Optical flow is computationally expen-
principal axis touches the ground. The method is not sive, but is robust and discriminative [14]. The last two
strongly affected when the target pose is not vertical, features are computationally inexpensive, and they are
because a vertical principal axis is still extracted from extracted from the blob histogram. We have said that the
silhouettes. In such cases the indicated position is not the histogram reflects the number of the pixels that consist
exact position of the feet touching the ground but the one the foreground object per y coordinate. But, if we weigh
indicated by the vertical axes, which may be a bit out the histogram with the total number of the histo-
displaced. However, also in such cases the method still gram’s pixels, we have a probability distribution function
gives good position estimations. (pdf), pc ðyj Þ, that represents the probability of an object’s
pixel to lie in a given coordinate in the bounding box, yj .
5. Short-term behaviours Taking into account that features are extracted for
every single video frame and constitute the frame’s
feature vector, we elaborate on the calculations presented
Our first source of information for evaluating beha-
in Table 1.
viour is the so-called short-term behaviour. Our metho-
dology represents short-term behaviour with a feature
vector that consists of motion-based features. In the
(1) vðtÞ, is the Euclidean norm (over x- and y-axes of the
recognition step an one-class support vector machine is
ground plane) of the instantaneous object’s speed,
used, trained only with normal instances.
calculated from the current frame and the previous
frame object’s position.
5.1. Feature calculation (2) Algebraic mean speed, v cT ðtÞ, is the algebraic mean
value of an object’s speed within a time window that
In motion representation and analysis, our methodol- consists of the T last frames, including the frame on t 0 .
ogy uses information obtained by preprocessing, namely This value is calculated based on the algebraic sum of
the object’s bounding box, the object’s blob and sequential the x and y coordinates of the speed’s vector, which is
positions. In Fig. 4, all preprocessing-extracted informa- more robust against noise than vðtÞ.
tion are illustrated. (3) On the same grounds, the calculation of mean blob
Elaborating, from the background subtraction process difference, RðtÞ, is based on the algebraic sum of the
we extract the position of the object’s centroid inside the bounding boxes’ area change within a shifting frame
bounding box, the bounding box’s width and height and window T 0 comprising the last e.g. 5–10 frames. ðwc ðjÞ,
the object’s blob. Figs. 4a–c show the captured frames hc ðjÞ represent the width and the height of the blob for
from each camera with the corresponding bounding camera c for t ¼ j.
boxes. Figs. 4d–f show the background subtraction masks, (4) Optical flow, F i is first calculated on every frame and
from where the blob is extracted. for each camera i, but only for the object’s edges inside
The blob histogram is calculated based on the blob the bounding box. Then, the optical flow value is
information. The histogram of the blob indicates the normalized by the number of the pixels that partici-
number of pixels that belong to the blob for every y pate in the calculation—which are the pixels of the
coordinate. Figs. 4g–i show the histograms of the given edges—and the bounding box area. Then we compute
blob. the mean optical flow value from all cameras.
From homography estimation we calculate the object (5) Mean optical flow difference is the difference between
ground position and thus the trajectory which is ex- the current and the previous value of the mean optical
pressed as a sequence of ðx; yÞ vectors on the ground plane. flow divided by the previous value. This offers the
Fig. 4j illustrates the object’s trajectory in the scene, percentage of optical flow change. We calculate the
calculated from all views. features for each camera and we keep the maximum
The short-term activity is represented by a seven- value over all cameras.
dimensional feature vector, as follows: (6) Max entropy histogram difference, maxðDHðtÞÞ; is
cT ðtÞ; RT ðtÞ; FðtÞ; DFðtÞ; maxðDHðtÞÞ; maxðDSDðtÞÞÞ based on the Shannon entropy, HðtÞ, that is a measure
f ¼ ðvðtÞ; v
of the uncertainty associated with a random variable.
(4)
This means that the more a given pdf resembles a
The features’ calculation is presented in detail in Table 1, uniform pdf, the greater the entropy value. The main
with the features being separated into four categories idea is that when an abrupt motion occurs, the
according to what type of information they depend on. differences in entropy’s values will be significantly
The first two features, speed and algebraic mean speed, greater than those of a normal slow motion.
are computationally inexpensive and time efficient calcu- (7) Max standard deviation difference, maxðDSDðtÞÞ, is
lated only from trajectory data. Algebraic mean blob also calculated from the object blob’s histogram.
difference is also time efficient calculated only from the Standard deviation of the histogram (std) is a measure
background subtraction data on the object’s bounding of the spread of its values. The change on a
ARTICLE IN PRESS
P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738 1729
Fig. 4. (a)–(c) Frames captured from each camera with bounding boxes. (d)–(f) Background subtraction masks and blob indication per camera. (g)–(i)
Histogram of the object’s blob for each camera. (j) Trajectory formed by the calculated ground points.
histogram’s standard deviation value from one point be justified by the fact that normal behaviours are easier
of view,DSDðtÞ, can give us important information for to observe and thus whatever deviates from them can be
the motion of the object in that it indicates within- defined as abnormal. Thus we do not need to model
bounding-box movement. We calculate the features explicitly abnormal behaviours and we do not need
for each camera and we keep the maximum value over labelling of data, as long as our assumption on the
all cameras as the final feature value. sparsity of abnormality stands. This is what makes this
approach unsupervised.
The one-class SVM builds a boundary that separates
5.2. Short-term behaviours classification the training data class from the rest of the feature space.
For more details the reader is referred to [23].
The decision whether a short-term behaviour is normal
or not can be taken by employing an one-class SVM as
proposed by Scholkopf [31]. The selected model does not 6. Trajectories classification
require a labelled training set to determine the decision
surface. The one-class SVM is similar to the standard SVM Our second information source for evaluating beha-
in that it uses kernel functions to perform implicit viour is the trajectory. In a museum scenario, the
mappings and dot products and that the solution is only trajectory of a person entering from the designated
dependent on the support vectors. Such an approach can entrance, then approaching the cashier to buy a ticket,
ARTICLE IN PRESS
1730 P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1. Speed
vðtÞ ¼ ðxðtÞ xðt 1ÞÞ2 þ ðyðtÞ yðt 1ÞÞ2
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
!2 !2ffi
2. Algebraic mean u
u 1 P t 1 P t 6.1. One-class continuous Hidden Markov Model
speed vcT ðtÞ ¼ t vx ðiÞ þ vy ðiÞ
T i¼tTþ1 T i¼tTþ1
PnumCam The problem of discriminating between normal/abnor-
3. Algebraic mean Ri ðtÞT 0
bounding box RðtÞ ¼ i¼1 where
numCam mal trajectories concerns the definition of a measure that
difference 1 P t wc ðjÞ hc ðjÞ wc ðj 1Þ hc ðj 1Þ would give sufficiently different values for the two classes.
Rc ðtÞT 0 ¼ 0
T j¼tT 0 þ1 wc ðj 1Þ hc ðj 1Þ
The variable length of the trajectories poses additional
4. Mean optical flow SnumCam Fi
FðtÞ ¼ i¼1
whereF i is the normalized difficulties. Long, normal trajectories would have cHMM
numCam
optical flow from camera i generation probability values comparable to small values
5. Mean optical flow FðtÞ Fðt 1Þ of short, abnormal trajectories, so the observation’s length
DFðtÞ ¼
difference Fðt 1Þ factor needs to be removed.
6. Max entropy H ðtÞ Hi ðt 1Þ
maxðDHðtÞÞ ¼ maxi i , with If we can prove that for a normal observation sequence
difference Hi ðt 1Þ
PN
ðOnormal Þ and for an abnormal one ðOabnormal Þ the following
1pipnumCam where Hc ðtÞ ¼ pc ðyj Þ condition must hold:
j¼1
log pc ðyj Þ with pc ðyj Þ the histogram value in yj
log PðOabnormal jlÞ log PðOnormal jlÞ
location for camera c and N the bounding box’s 5 (5)
height lengthðOabnormal Þ lengthðOnormal Þ
7. Max standard std ðp ðyÞ Þ stdi ðpi ðyÞt1 Þ
maxðDSDðtÞÞ ¼ maxi i i t , then we will be able to use it as a classification measure.
deviation stdi ðpi ðyÞt1
difference with 1pipnumCam In (5) the logarithms help us sharpen the differences
between values below 1, and the division with the
sequence’s length normalizes the computed measure.
The anomaly detection problem begins with the
then browsing into the room and looking around, and definition of ‘‘what can be labelled as normal’’. We may
finally exiting from the designated exit should be define as normal the trajectories that between two time
characterized as normal. Trajectories of persons entering instances t and t þ 1, the probabilities of the correspond-
from the exit without first visiting the ticket stand, or ing observations are proportional to each other, and their
going the wrong direction should be labelled as abnormal. fraction can be viewed as a random variable D. Taking into
Some works in literature use rules to define the consideration that Ot is the observation sequence from
restricted areas and therefore distinct normal from time ¼ 0, until time ¼ t, the random variable D depends
abnormal trajectories. We apply an one-class learning only on the model, lðA; B; pÞ [29].
strategy, as in the short-term behaviours, by training our Thus, given the model and two consecutive observa-
time series classifier using only the normal trajectories. tions Ot , Otþ1 , there is a variable D, with an expected value
Each sample is a position vector ðx; yÞ of the target in the d ¼ E½D such that
global coordinate system in each frame (calculated as
described in Section 4). The extracted normal trajectories PðOtþ1 Þ
PðOtþ1 Þ ’ d PðOt Þ ) ’d (6)
(sequences of ðx; yÞ vectors) are used for training a PðOt Þ
continuous Hidden Markov Model [29] and constitute with 0ot þ 1pT. This assumption is derived by the facts
the model observations. that:
For convenience, we use the compact notation l ¼
ðA; B; pÞ to indicate the complete parameter set of the
model, where: D depends only on the model;
normal trajectories have a high probability of being
A is the state transition probability distribution matrix. generated by the model;
B is the observation probability density function per the expected value represents the average amount one
state matrix. ‘‘expects’’ as the outcome of the random trial when
p is the initial state probability distribution. identical odds are repeated many times.
The original Baum Welch algorithm is used for the We can also see that, 0odp1 because PðOtþ1 ÞpPðOt Þ.
training step, while for the recognition step we propose According to (6), we can expand the calculations as
a modified Forward Backward procedure (see Section 6.2). follows:
The methodology presented here proposes solution to two
t
problems: PðOtþ1 Þ ’ d PðOt Þ ) PðOtþ1 Þ ’ d PðO1 Þ
) log PðOtþ1 Þ ’ t log d þ log PðO1 Þ
the use of the Hidden Markov Model as an one-class log PðOtþ1 Þ 1
) ’ ðt log d þ log PðO1 ÞÞ
classifier. tþ1 tþ1
ARTICLE IN PRESS
P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738 1731
which results after replacing t with t 1 in the following: end up underflowing current computers’ number storage.
Solutions like sampling the trajectory, only partially solve
log PðOt Þ 1
’ ððt 1Þ log d þ log PðO1 ÞÞ; 8t : 0otpT the problem.
t t
In order to tackle the problem, one may rescale the
(7)
conditional probabilities using carefully designed scaling
As abnormal, we define the trajectories for which the as proposed in [29]. We, however, have devised a method
probability of their corresponding D value will be very for the approximation of the log-probability of a long
low. For those trajectories, we assume that there exists a sequence that gives the advantage of computational
transition from time k to time k þ 1 where, due to either simplicity and in parallel keeps the properties required
the transition probability aij or the observation probability for normal and abnormal trajectories’ classification (Eq.
bj ðOÞ, the D value probability (i.e. the probability to have (5)). Our approximating methodology avoids the calcula-
such a D value for the given model) decreases signifi- tion of the scaling factor and uses integer instead of real
cantly, because the value of Dkþ1 for the given time point values. We have named this method observation log-
k þ 1 becomes lower than expected: probability approximation (OLPA).
PðOkþ1 Þ Given the trained continuous Hidden Markov Model
9k : ¼ Dkþ1 ; pðDÞ51; Dkþ1 5d (8) and within the recognition step, in order to compute the
PðOk Þ
probability of a known observation sequence the Forward
Before that k, the trajectory can be characterized as Backward algorithm is used [29]. This algorithm consists
normal i.e. of the following steps:
PðOtþ1 Þ
8t : tok; ¼d (9) (1) Initialization: a1 ðiÞ ¼ pi bi ðO1 Þ.
PðOt Þ P
(2) Induction: atþ1 ðjÞ ¼ ½ N i¼1 at ðiÞaij bj ðOtþ1 Þ.
From the above we have P
(3) Termination: PðOjlÞ ¼ N i¼1 aT ðiÞ.
l1
PðOkþ1 Þ ¼ log Dkþ1 þ logðd PðO1 ÞÞ
log PðOkþ1 Þ 1 To compensate for the constant decrease in the like-
) ¼ ðlog Dkþ1 lihood in long sequences we modified the above algorithm
kþ1 kþ1
þ ðk 1Þ log d þ log PðO1 ÞÞ (10) so that instead of multiplications we use additions of
logarithms. Some background assumptions are given next.
For the discrimination problem (see Eq. (5)), the following By definition if bxc is the floor of x number,
must hold: j log a blog acjo1. Thus, we can approximate
log PðOkþ1 Þ log PðOk Þ log PðOjlÞ=lengthðOÞ with blog PðOjlÞc=lengthðOÞ. Now, due
5 (11) to the fact that for long sequences a PðOjlÞ is below 1
kþ1 k
and that log a ! Infinity, one may assume that
By letting t ¼ k in (7) and using (10) in (11) we have log a ’ blog ac. This approximation is acceptable, because
1 the estimation error is bounded (less than 1). Long normal
ðlog Dkþ1 þ ðk 1Þ log d þ log PðO1 ÞÞ sequences give small values of cHMM probabilities, due to
kþ1
1 successive multiplications, making the logarithm of those
5 ððk 1Þ log d þ log PðO1 ÞÞ (12) probabilities to be too high to let the 1 to be damaging.
k
Because k represents time, k40. On the other hand Dkþ1 Assuming this approximation is acceptable, it can be
inserted to Forward Backward algorithm.
and d represent the value of the probabilities’ ratio, so
0oDkþ1 ; do1. According to that remark we can assume First, we define functions necessary for computations
in cHMM algorithms, using logarithms:
that for sufficiently large sequences, e.g. for kp10, 1=k ’
1=ðk þ 1Þ in (12) due to the fact that log d; log d51. Thus, blogða bÞc ¼ blog a þ log bc ’ bblog ac þ blog bcc
Eq. (12) can be ¼ blog ac þ blog bc
log Dkþ1 þ ðk 1Þ log d þ log PðO1 Þ5ðk 1Þ log d þ log PðO1 Þ Additionally the following applies for a sequence of xi ,
) log D50 (13) the bigger of which is xmax :
X
Since Dkþ1 5d, Dkþ1 is a sufficiently small value that gives xmax p xi pn xmax
log Dkþ1 50. Given that (13) is valid, the initial assump- X
tion, Eq. (5), is true. Therefore, (5) can be used as criterion ) logðxmax Þp log xi p logðnÞ þ logðxmax Þ
for abnormal trajectory detection. The order of magnitude for xi is 109 or less and for n is
P
10, so logð xi Þ ’ maxi ðlogðxi ÞÞ or
6.2. Log likelihood approximation in long sequences P
blogð xi Þc ’ bmaxi ðlogðxi ÞÞc.
According to all the above we can conclude to a
As mentioned previously, the continuous Hidden modification of Forward Backward algorithm, using the same
Markov Models have problems with long sequences. This dynamic programming idea: let Loga blog ac, and a~ be the
is due to the multiplications in the Forward Backward approximated a, then the following approximations apply:
algorithm, which is used to calculate the observation
probability given the model. The constant decrease of the
observation probability results to a very low value, which a~ 1 ðiÞ ¼ Logðpi bi ðO1 ÞÞ ¼ blog pi þ log bi ðO1 Þc
ARTICLE IN PRESS
1732 P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738
Fig. 5. (a) View of our experimental room (exposition room). (b) Normal and abnormal trajectory example. In the latter the target goes over the barrier.
Fig. 9. (a) Percentage normality in normal and abnormal behaviours for support vector machine. (b) Output of continuous Hidden Markov Model for
normal and abnormal behaviours. Black colour is for normal behaviours and red for abnormal behaviours. (For interpretation of the references to colour in
this figure legend, the reader is referred to the web version of this article.)
generality, the cHMM’s output probabilities are stored in that normal and abnormal pdfs are different for both
order to be processed and used to extract the thresholds classifiers, thresholding their outputs was a logical
based on distributional characteristics (mean value, decision.
standard deviation and minimum value; also see For SVM-based classification we set the threshold to be
Eq. (14)). For the decision concerning the SVM classifier, the following function of the mean and the standard
we also extract a threshold which indicates the maximum deviation of the distribution of the number of allowed
number of abnormal frames we allow within a normal, abnormal frames within a normal sequence:
predefined length sequence of frames. Therefore, SVM
thresholdSVM ¼ meanðHsvmnormal Þ 2:5 stdðHsvmnormal Þ
decisions are also used to determine this second thresh-
old. At this point the system is considered to be calibrated. (14)
In case someone wishes to apply the system at a different For HMM outputs the minimum value of the distribu-
location, only the training step needs to be repeated and tion of normalized log-probabilities of the normal in-
the system will be applicable to the new environment. stances was considered to be the threshold value that
The experiments prove that the system is highly separates normal trajectories from the abnormal ones:
automated, as minimum human interference is needed
thresholdHMM ¼ minðHhmmnormal Þ (15)
during the training step and the results are very encoura-
ging. We remind the reader that in the background where Hsvm is the histogram of SVM’s outputs and
subtraction step the first 250 frames are used for training, Hhmm is the histogram with HMM’s outputs.
where no person is inside the scene. Those frames are
used to extract the background edges (also see Section 7.4. Real-time experiments
5.1). Features identifying short-term behaviour are ex-
tracted and used to train an one-class SVM with a radial In both the online and offline approaches the same
basis function kernel. Simultaneously, trajectories were training set (therefore the same models) and thresholds
extracted in order to be inserted into a continuous HMM have been used. The only difference is that in the online
for training. approach we had the system emit a decision for every
The threshold values have been calculated based on frame instead of for the whole behaviour. The system
the training test. In Fig. 9, distributions of SVM and cHMM performance in both approaches is encouraging, as will be
outputs for normal as well as abnormal behaviours are shown in the following paragraphs.
shown. Fig. 9a depicts the normality percentage for Real-time experiments follow a slightly different
normal and abnormal behaviours within a time window approach. Each frame is labelled as normal or abnormal
that includes the whole behaviour, i.e. how many feature depending on both classifiers’ decision. All the videos
vectors are recognized as normal in the entire behaviour. contain 34 479 normal frames, i.e. frames for which the
We used a t-test in order to ensure that the two density behaviour should be judged as normal, and 5260
functions are different and the resulted p-value was o1%. abnormal frames. From the 4537 frames 1251 have
Because of the fact that the two pdfs are not Gaussian, we motion-based abnormality and 4537 have trajectory-
have also applied the Kolmogorov–Smirnov test or KS-test based abnormality. The SVM classifier classifies a frame,
[3] that does not require normal pdfs. The Kolmogor- but the SVM-based decision also takes into account the
ov–Smirnov test indicated that, indeed, the normal and labels of the previous 24 frames, based on the percentage
abnormal samples come from different pdfs of abnormal frames within this history of 25 frames. The
ðp-value ¼ 2:09e 07Þ. Fig. 9b shows the cHMM’s output cHMM returns a normalized log-probability value which
for normal and abnormal behaviours. The two tests (t-test characterizes the object’s sampled trajectory since the
and KS test) were also applied to these results with both object’s first appearance in the scene and up to the current
p-values substantially below 1%. According to the remark frame. The final system result for each frame is the logical
ARTICLE IN PRESS
1736 P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738
Precision and recall have been calculated for the offline Precision Recall Precision Recall
and the real-time experiments. For each approach we give Normal 0.8882 0.775 0.7625 0.7309
the performance for both the SVM and HMM classifier Abnormal 0.3129 0.5125 0.2273 0.2582
models separately, as well as for the whole system in Table
2.
Even though the overall system performance is very to the fact that one camera is not able to give as robust
satisfactory, we should note that the precision of motion- ground point estimation of the object as the estimation
based abnormal instances, through the use of the SVM given by multiple cameras. Moreover, multiple cameras
classifier, appears to be low. This indicates that we should provide the benefit of more information, especially in the
further optimize SVM parameter values to the given case where the object is not within the view of one of the
classification problem, as it has been seen in literature available cameras.
that SVM performance can be highly dependent on the It is worth pointing out that in Table 3 we average
selected parameters. However, the simultaneous use of precision and recall taking into account two of our three
both classifiers helps the system perform highly for the cameras, due to the fact that the third camera could not
given dataset. give us proper output since the object was frequently out
of its view. The multi-camera system overcomes this
7.6. Multiple cameras vs. one camera problem by compensating for any missing camera data. In
addition, as we can observe from Tables 2 and 3, cHMM
To clarify the reasons for using multiple cameras precision and recall in both offline and real-time experi-
instead of one camera, we have performed a set of ments, are greater with multiple cameras than with only
experiments only with the data of one camera from our one camera. On the other hand, precision and recall in
lab dataset. The system’s results (precision and recall) are both offline and real-time experiments for SVM are in
shown in Table 3. As we can see the system’s performance most cases higher in the single camera system than in the
is lower than the one produced by multiple cameras, due multi-camera system. These observations have led us to
two main conclusions. The first is that our assumption
that multiple cameras provide us with a more precise
position of the object (more accurate trajectory) is correct.
Table 2
The second is that our application of trivial fusion of
Precision and recall for the 3-camera system on our dataset.
motion data from different cameras—we just calculated
SVM HMM Overall mean feature values over the three cameras—can cause a
decrease of performance and should be avoided. Future
Precision Recall Precision Recall Precision Recall
work should research how motion feature values from
Offline different cameras should be combined.
Normal 0.9048 0.9286 1 0.9762 1 0.9286 In order to further allow for solid comparison, we have
Abnormal 0.7674 0.7071 0.95 1 0.88 1 chosen to use a commonly used dataset for additional
comparisons. The corpus chosen is the set of video
Real-time sequences available for result comparison from the
Normal 0.9875 0.9228 0.9960 0.9770 0.9960 0.9105 PETS04 workshop [12]. The sequences have already been
Abnormal 0.2419 0.6788 0.8478 0.9704 0.8478 0.9375
used by the CAVIAR project. The system’s performance
The column ‘‘Overall’’ indicates the performance of the combined when applied on these data is depicted in Table 4. It is
decision. worth mentioning that:
videos4 and 4 abnormal.5 The extracted different beha- Our experimental results demonstrated the good
viours were a total of 43 normal and 8 abnormal ones. The performance of the system in the task of recognizing
number of frames was 12 188 normal and 2669 abnormal. human behaviour’s abnormality in a somewhat noisy
In the CAVIAR dataset evaluation of performance, the environment, with different scenarios of action and
detection of abnormal behaviour appears to be more participation of different actors. The experiments were
difficult than in our dataset. Given this difference in implemented in offline and real-time conditions, with
performance, we have sought the reasons for the decrease similar results, implying the robustness of the method.
in efficiency and found some possible causes. In our use of Furthermore, experiments with a single camera version of
the CAVIAR dataset, we used the whole videos described the system provide us the incentive to consider another,
as cases of ‘‘walking’’, ‘‘browsing’’ and ‘‘meeting’’ as input more robust method for the fusion of data in order to
for normal behaviour. We then discovered that a quick improve performance.
(running) motion can be found within a walking video, The multiple camera methodology has, so far, been
inducing noise in the discriminative ability of the speed- tested on scenarios with only one object inside the scene,
based features. Then we saw that occlusion may have without taking account any interactions between actors. It
caused problems, due to the fact that there are data from would be worthwhile to further investigate the effective-
only one camera. The edge-detection process and the ness of our system using more features, such as the
optical flow extraction fail when, for example, two people distance of the object from each camera, in order to
are too close the each other and fighting. In these cases improve the motion-based discriminatory performance of
the positioning of the targets with respect to the camera the system. However, other methodologies could also be
highly affects the method concerning the use of optical tested in the place of the SVM classifier.
flow, but only when a single camera is used. The use of
three cameras and proper fusion of information may offer
better optical edge detection and, thus, optical flow
values. The two identified problems partially explain the Acknowledgements
loss of recall for abnormal instances, even though more
experiments should be conducted to verify these findings. This work is being co-funded by the Greek General
One final comment would be that abnormality in such Secretariat of Research and Technology and the European
actions as fighting can be detected much more easily if Union via a PENED project.
one uses interaction information between actors, which
was not within the scope of this work. References
8. Conclusion and future work [1] F. Bashir, A. Khokhar, D. Schonfeld, View-invariant motion trajec-
tory-based activity classification and recognition, Multimedia
Systems 12 (1) (2006) 45–54.
In this paper, we have presented a set of theoretical [2] F. Bashir, W. Qu, A. Khokhar, D. Schonfeld, HMM-based motion
and practical tools for the domain of behaviour recogni- recognition system using segmented PCA, in: Proceedings of IEEE
tion, which have been integrated within a unified, International Conference on Image Processing (ICIP), Genoa, Italy,
vol. 3, 2005, pp. 1288–1291.
automatic, bottom-up system based on the use of multiple [3] Z. Birnbaum, F. Tingey, One-sided confidence contours for prob-
cameras performing human behaviour recognition in an ability distribution functions, The Annals of Mathematical Statistics
indoor environment, without a uniform background. The 22 (4) (1951) 592–596.
[4] M. Black, A. Jepson, A probabilistic framework for matching
approach’s innovation is fourfold: temporal trajectories: condensation-based recognition of gestures
and expressions, in: Proceedings of European Conference on
We propose the application of two different criteria of Computer Vision (ECCV), Freiburg, Germany, vol. 1406, 1998, pp.
909–924.
human behaviour’s abnormality used within a single [5] F. Bobick, W. Davis, The recognition of human movement using
methodology that needs only normal data for training. temporal templates, IEEE Transactions on Pattern Analysis and
We have proven that the application of multiple Machine Intelligence 23 (3) (2001) 257–267.
[6] O. Boiman, M. Irani, Detecting irregularities in images and in video,
cameras can be fruitful, when it comes to determining International Journal of Computer Vision 74 (1) (2007) 17–31.
abnormality based on the trajectory. [7] C. Bregler, J. Malik, Learning appearance based models: mixtures of
We have presented a methodology that lets a contin- second moment experts, Advances in Neural Information Proces-
sing Systems 9 (2) (1997) 845.
uous Hidden Markov Model function as an one-class
[8] C. Chang, C. Lin, LIBSVM: a library for support vector machines,
classifier, with very promising experimental results. Software available at: hhttp://www.csie,ntu.edu.tw/cjlin/libsvmi,
We have accomplished to offer an alternative to the vol. 80, 2001, pp. 604–611.
[9] H. Dee, D. Hogg, Detecting inexplicable behaviour, in: British
Forward Backward algorithm for the recognition step
Machine Vision Conference, London, UK, 2004, pp. 477–486.
of cHMMs in order to overcome arithmetic underflow [10] T. Duong, H. Bui, D. Phung, S. Venkatesh, Activity recognition and
in the case of very long observation sequences, without abnormality detection with the switching hidden semi-Markov
loss of precision. model, in: IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR), San Diego, CA, USA, vol. 1, 2005, pp.
838–845.
4
[11] A. Efros, C. Berg, G. Mori, J. Malik, Recognizing action at a distance,
Namely the normal videos were: browse1-browse4, wk1-wk3, in: Proceedings of Ninth IEEE International Conference on Computer
meetSplit3rdGuy,meetWalkSplit,meetWalkTogether1-meetWalkTo- Vision (ICCV), Nice, France, vol. 2, 2003, pp. 726–733.
gether2. [12] R. Fisher, The PETS04 surveillance ground-truth data sets, in:
5
Namely the abnormal videos were: FightChase, FightOneMan- International Workshop on Performance Evaluation of Tracking and
Down, FightRunAway1-FightRunAway2. Surveillance, 2004.
ARTICLE IN PRESS
1738 P. Antonakaki et al. / Signal Processing 89 (2009) 1723–1738
[13] J. Francois, Jahmm-hidden Markov model (hmm): an implementa- Vision and Pattern Recognition (CSCCVPR), vol. 2, 2003, pp.
tion in java, 2006. 620–625.
[14] B. Horn, B. Schunck, Determining optical flow, Artificial Intelligence [27] J. Owens, A. Hunter, Application of the self-organising map to
17 (1–3) (1981) 185–203. trajectory classification, in: Proceedings of IEEE International
[15] W. Hu, D. Xie, T. Tan, A hierarchical self-organizing approach for Workshop Visual Surveillance, Dublin, Ireland, 2000, pp. 77–83.
learning the patterns of motion trajectories, IEEE Transactions on [28] S. Park, M. Trivedi, Analysis and query of person–vehicle interac-
Neural Networks 15 (1) (2004) 135–144. tions in homography domain, in: Proceedings of the 4th ACM
[16] W. Hu, M. Hu, X. Zhou, T. Tan, J. Lou, Principal axis-based International Workshop on Video Surveillance and Sensor Net-
correspondence between multiple cameras for people tracking, works, Santa Barbara, CA, USA, 2006, pp. 101–110.
IEEE Transactions on Pattern Analysis and Machine Intelligence 28 [29] L.R. Rabiner, A tutorial on hidden Markov models and selected
(4) (2006) 663–671. applications in speech recognition, Proceedings of the IEEE 77 (2)
[17] A. Ivanov, F. Bobick, Recognition of visual activities and interactions (1989) 257–286.
by stochastic parsing, IEEE Transactions on Pattern Analysis and [30] P. Ribeiro, J. Santos-Victor, Human activity recognition from video:
Machine Intelligence 22 (8) (2000) 852–872. modeling, feature selection and classification architecture, Beijing,
[18] F. Jiang, Y. Wu, A. Katsaggelos, Abnormal event detection from in: Proceedings of the International Workshop on Human Activity
surveillance video by dynamic hierarchical clustering, in: IEEE Recognition and Modelling, 2005, pp. 61–78.
International Conference on Image Processing (ICIP), San Antonio, [31] B. Scholkopf, J. Platt, J. Shawe-Taylor, A. Smola, R. Williamson,
TX, USA, vol. 5, 2007, pp. 145–148. Estimating the support of a high-dimensional distribution, Neural
[19] N. Johnson, D. Hogg, Learning the distribution of object trajectories Computation 13 (7) (2001) 1443–1471.
for event recognition, Image and Vision Computing 14 (8) (1996) [32] G. Sukthankar, K. Sycara, Automatic recognition of human team
609–615. behaviors, in: Proceedings of Modeling Others from Observations,
[20] D. Kosmopoulos, P. Antonakaki, K. Valasoulis, D. Katsoulas, Workshop at the International Joint Conference on Artificial
Monitoring human behavior in an assistive environment using Intelligence (IJCAI), Edinburgh, Scotland, 2005.
multiple views, in: 1st International Conference on Pervasive [33] T. Wang, J. Li, Q. Diao, W. Hu, Y. Zhang, C. Dulong. Semantic event
Technologies Related to Assistive Environments PETRA’ 08, Athens, detection using conditional random fields, in: Conference on
Greece, 2008. Computer Vision and Pattern Recognition Workshop (CVPRW),
[21] C. Lee, M. Ho, W. Wen, C. Huang, T. Hsin-Chu, Abnormal event 2006, pp. 109–114.
detection in video using N-cut clustering, in: International [34] D. Weinland, R. Ronfard, E. Boyer, Motion history volumes for free
Conference on Intelligent Information Hiding and Multimedia viewpoint action recognition, in: IEEE International Workshop on
Signal Processing (IIH-MSP), Pasadena, CA, USA, 2006. Modeling People and Human Interaction, 2005.
[22] D. Mahajan, N. Kwatra, S. Jain, P. Kalra, S. Banerjee, A framework for [35] T. Xiang, S. Gong, Video behavior profiling for anomaly detection,
activity recognition and detection of unusual activities, in: IEEE Transactions on Pattern Analysis and Machine Intelligence 30
Proceedings of the Indian Conference on Computer Vision, Graphics (5) (2008) 893–908.
and Image Processing (ICVGIP), 2004, pp. 15–21. [36] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, Semi-supervised
[23] L. Manevitz, M. Yousef, One-class SVMs for document classification, adapted HMMs for unusual event detection, in: IEEE Computer
Journal of Machine Learning Research 2 (2) (2001) 139–154. Society Conference on Computer Vision and Pattern Recognition
[24] T. Moeslund, A. Hilton, V. Krüger, A survey of advances in vision- (CVPR), vol. 1, 2005, pp. 611–618.
based human motion capture and analysis, Computer Vision and [37] H. Zhong, J. Shi, M. Visontai, Detecting unusual activity in video, in:
Image Understanding 104 (2–3) (2006) 90–126. Proceedings of IEEE Computer Vision and Pattern Recognition, vol.
[25] H. Neoh, A. Hazanohuk, Adaptive edge detection for real-time video 2, 2004, pp. 819–826.
processing using FPGAs, CD proceedings at the 2004 Global Signal [38] H. Zhou, D. Kimber, Unusual event detection via multi-camera video
Processing Expo (GSPx) and International Signal Processing Con- mining, in: Proceedings of the 18th International Conference on
ference (ISPC), Santa Clara, California, September 27-30, 2004. Pattern Recognition (ICPR), vol. 3, 2006, pp. 1161–1166.
[26] N. Nguyen, H. Bui, S. Venkatsh, G. West, Recognizing and [39] Z. Zivkovic, F. van der Heijden, Efficient adaptive density estimation
monitoring high-level behaviors in complex spatial environments, per image pixel for the task of background subtraction, Pattern
in: Proceedings of IEEE Computer Society Conference on Computer Recognition Letters 27 (7) (2006) 773–780.