0% found this document useful (0 votes)

32 views35 pages

2010 Atrey MultimodalFusionForMultimediaAnalysisSurvey

This document provides a survey of fusion strategies used to combine multiple modalities for multimedia analysis tasks. It discusses classifications of fusion methods based on the fusion level and methodology. It also highlights issues that influence fusion like correlation, independence, context and synchronization between modalities.

Uploaded by

nailgrp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views35 pages

2010 Atrey MultimodalFusionForMultimediaAnalysisSurvey

Uploaded by

nailgrp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Multimedia Systems (2010) 16:345–379

DOI 10.1007/s00530-010-0182-0

REGULAR PAPER

Multimodal fusion for multimedia analysis: a survey

Pradeep K. Atrey • M. Anwar Hossain • @article{atrey2010multimodal,
Abdulmotaleb El Saddik • Mohan S. Kankanhalli title={Multimodal fusion for multimedia analysis: a survey},
author={Atrey, Pradeep K and Hossain, M Anwar and
El Saddik, Abdulmotaleb and Kankanhalli, Mohan S},
journal={Multimedia systems},
volume={16}, number={6}, pages={345--379},
year={2010}, publisher={Springer}}

Received: 8 January 2009 / Accepted: 9 March 2010 / Published online: 4 April 2010
Springer-Verlag 2010

Abstract This survey aims at providing multimedia Keywords Multimodal information fusion
researchers with a state-of-the-art overview of fusion Multimedia analysis
strategies, which are used for combining multiple modali-
ties in order to accomplish various multimedia analysis
tasks. The existing literature on multimodal fusion research
1 Introduction
is presented through several classifications based on the
fusion methodology and the level of fusion (feature, deci-
In recent times, multimodal fusion has gained much
sion, and hybrid). The fusion methods are described from
attention of many researchers due to the benefit it provides
the perspective of the basic concept, advantages, weak-
for various multimedia analysis tasks. The integration of
nesses, and their usage in various analysis tasks as reported
multiple media, their associated features, or the interme-
in the literature. Moreover, several distinctive issues that
diate decisions in order to perform an analysis task is
influence a multimodal fusion process such as, the use of
referred to as multimodal fusion. A multimedia analysis
correlation and independence, confidence level, contextual
task involves processing of multimodal data in order to
information, synchronization between different modalities,
obtain valuable insights about the data, a situation, or a
and the optimal modality selection are also highlighted.
higher level activity. Examples of multimedia analysis
Finally, we present the open issues for further research in
tasks include semantic concept detection, audio-visual
the area of multimodal fusion.
speaker detection, human tracking, event detection, etc.
Multimedia data used for these tasks could be sensory
(such as audio, video, RFID) as well as non-sensory (such
as WWW resources, database). These media and related
Communicated by Wu-chi Feng.
features are fused together for the accomplishment of
P. K. Atrey (&) various analysis tasks. The fusion of multiple modalities
Department of Applied Computer Science, can provide complementary information and increase the
University of Winnipeg, Winnipeg, Canada accuracy of the overall decision making process. For
e-mail: p.atrey@uwinnipeg.ca
example, fusion of audio-visual features along with other
M. A. Hossain A. El Saddik textual information have become more effective in
Multimedia Communications Research Laboratory, detecting events from a team sports video [149], which
University of Ottawa, Ottawa, Canada would otherwise not be possible by using a single medium.
e-mail: anwar@mcrlab.uottawa.ca
The benefit of multimodal fusion comes with a certain
A. El Saddik cost and complexity in the analysis process. This is due to
e-mail: abed@mcrlab.uottawa.ca
the different characteristics of the involved modalities,
M. S. Kankanhalli which are briefly stated in the following:
School of Computing, National University of Singapore,
Singapore, Singapore • Different media are usually captured in different
e-mail: mohan@comp.nus.edu.sg formats and at different rates. For example, a video

123
346 P. K. Atrey et al.

may be captured at a frame rate that could be different [103], and how the contextual [100] and the confidence
from the rate at which audio samples are obtained, or information [18] influences the overall fusion process.
even two video sources could have different frame • When to fuse? The time when the fusion should take
rates. Therefore, the fusion process needs to address place is an important consideration in the multimodal
this asynchrony to better accomplish a task. fusion process. Certain characteristics of media, such as
• The processing time of different types of media streams varying data capture rates and processing time of the
are dissimilar, which influences the fusion strategy that media, poses challenges on how to synchronize the
needs to be adopted. overall process of fusion. Often this has been addressed
• The modalities may be correlated or independent. The by performing the multimedia analysis tasks (such as
correlation can be perceived at different levels, such as event detection) over a timeline [29]. A timeline refers
the correlation among low-level features that are to a measurable span of time with information denoted
extracted from different media streams and the corre- at designated points. The timeline-based accomplish-
lation among semantic-level decisions that are obtained ment of a task requires identification of designated
based on different streams. On the other hand, the points at which fusion of data or information should
independence among the modalities is also important as take place. Due to the asynchrony and diversity among
it may provide additional cues in obtaining a decision. streams and due to the fact that different analysis tasks
When fusing multiple modalities, this correlation and are performed at different granularity levels in time, the
independence may equally provide valuable insight identification of these designated points, i.e. when the
based on a particular scenario or context. fusion should take place, is a challenging issue [8].
• The different modalities usually have varying confi- • What to fuse? The different modalities used in a fusion
dence levels in accomplishing different tasks. For process may provide complementary or contradictory
example, for detecting the event of a human crying, we information and therefore knowing which modalities
may have higher confidence in an audio modality than a are contributing towards accomplishing an analysis task
video modality. needs to be understood. This is also related to finding
• The capturing and processing of media streams may the optimal number of media streams [9, 143] or feature
involve certain costs, which may influence the fusion sets required to accomplish an analysis task under the
process. The cost may be incurred in units of time, specified constraints. If the most suitable subset is
money or other units of measure. For instance, the task unavailable, can one use alternate streams without
of object localization could be accomplished cheaply much loss of cost-effectiveness and confidence?
by using a RFID sensor compared to using a video
This paper presents a survey of the research related to
camera.
multimodal fusion for multimedia analysis in light of the
The above characteristics of multiple modalities influ- above challenges. Existing surveys in this direction are
ence the way the fusion process is carried out. Due to these mostly focused on a particular aspect of the analysis task,
varying characteristics and the objective tasks that need to such as multimodal video indexing [26, 120]; automatic
be carried out, several challenges may appear in the mul- audio-visual speech recognition [106]; biometric audio-
timodal fusion process as stated in the following: visual speech synchrony [20]; multi-sensor management
for information fusion [146]; face recognition [153]; mul-
• Levels of fusion. One of the earliest considerations is to
timodal human computer interaction [60, 97]; audio-visual
decide what strategy to follow when fusing multiple
biometric [5]; multi-sensor fusion [79] and many others. In
modalities. The most widely used strategy is to fuse the
spite of these literatures, a comprehensive survey focusing
information at the feature level, which is also known as
on the different methodologies and issues related to mul-
early fusion. The other approach is decision level
timodal fusion for performing different multimedia analy-
fusion or late fusion [45, 121] which fuses multiple
sis tasks is still missing. The presented survey aims to
modalities in the semantic space. A combination of
contribute in this direction. The fusion problems have also
these approaches is also practiced as the hybrid fusion
been addressed in other domains such as machine learning
approach [144].
[48], data mining [24] and information retrieval [133],
• How to fuse? There are several methods that are used in
however, the focus of this paper is restricted to the multi-
fusing different modalities. These methods are parti-
media research domain.
cularly suitable under different settings and are
Consequently, this work comments on the state-of-the-
described in this paper in greater detail. The discussion
art literature that uses different multimodal fusion strate-
also includes how the fusion process utilizes the feature
gies for various analysis tasks such as audio-visual person
and decision level correlation among the modalities

123
Multimodal fusion for multimedia analysis 347

tracking, video summarization, multimodal dialog under- analysis task. Here, features refer to some distinguishable
standing, speech recognition and so forth. It also presents properties of a media stream. For example, the feature
several classifications of the existing literature based on the fusion (FF) unit merges the multimodal features such as
fusion methodology and the level of fusion. Various issues skin color and motion cues into a larger feature vector
such as the use of correlation, context and confidence, and which is taken as the input to the face detection unit in
the optimal modality selection that influences the perfor- order to detect a face. An illustration of this is provided in
mance of a multimodal fusion process is also critically Fig. 1. While Fig. 1a shows an AU that receives a set of
discussed. either features or decisions and provides a semantic-level
The remainder of this paper is organized as follows. In decision, Fig. 1b shows a FF unit that receives a set of
Sect. 2, we first address the issue levels of fusion and features F1 to Fn and combines them into a feature vector
accordingly describe three levels (feature, decision and F1,n. Figure 1d shows an instance of the feature level
hybrid) of multimodal fusion, their characteristics, advan- multimodal analysis task in which the extracted features
tages and limitations. Section 3 addresses the issue how to are first fused using a FF unit and then the combined fea-
fuse by describing the various fusion methods that have ture vector is passed to an AU for analysis.
been used for multimedia analysis. These fusion methods In the feature level fusion approach, the number of
have been elaborated under three different categories—the features extracted from different modalities may be
rule-based methods, the estimation-based methods, and the numerous, which may be summarized as [138, 150]:
classification-based methods. In this section, we analyze
• Visual features. It may include features based on color
various related works from the perspective of the level of
(e.g. color histogram), texture (e.g. measures of
fusion, the modality used, and the multimedia analysis task
coarseness, directionality, contrast), shape (e.g. blobs),
performed. A discussion regarding the different fusion
and so on. These features are extracted from the entire
methodologies and the works we analyzed is also presented
image, fixed-sized patches or blocks, segmented image
here. Some other issues (e.g. the use of correlation, confi-
blobs or automatically detected feature points.
dence, and the context), also related to how to fuse, are
• Text features. The textual features can be extracted
described in Sect. 4. This section further elaborates the
from the automatic speech recognizer (ASR) transcript,
issues when to fuse (the synchronization), and what to fuse
video optical character recognition (OCR), video closed
(the optimal media selection). Section 5 provides a brief
caption text, and production metadata.
overview of the publicly available data sets and evaluation
• Audio features. The audio features may be generated
measures in multimodal fusion research. Finally, Sect. 6
based on the short time Fourier transform including the
concludes the paper by pointing out the open issues and
fast Fourier transform (FFT), mel-frequency cepstral
possible avenues of further research in the area of multi-
coefficient (MFCC) together with other features such as
modal fusion for multimedia analysis.
zero crossing rate (ZCR), linear predictive coding
(LPC), volume standard deviation, non-silence ratio,
spectral centroid and pitch.
2 Levels of fusion
• Motion features. This can be represented in the form of
kinetic energy which measures the pixel variation
The fusion of different modalities is generally performed at
within a shot, motion direction and magnitude histo-
two levels: feature level or early fusion and decision level
gram, optical flows and motion patterns in specific
or late fusion [3, 45, 121]. Some researchers have also
directions.
followed a hybrid approach by performing fusion at the
• Metadata. The metadata features are used as supple-
feature as well as the decision level.
mentary information in the production process, such as
Figure 1 shows different variants of the feature, deci-
the name, the time stamp, the source of an image or
sion, and hybrid level fusion strategies. We now describe
video as well as the duration and location of shots.
the three levels of fusion and highlight their pros and cons.
They can provide extra information to text or visual
Various works that have adopted different fusion models at
features.
different levels (feature, decision and hybrid) in different
scenarios will be discussed in Sect. 3. The feature level fusion is advantageous in that it can
utilize the correlation between multiple features from dif-
2.1 Feature level multimodal fusion ferent modalities at an early stage which helps in better
task accomplishment. Also, it requires only one learning
In the feature level or early fusion approach, the features phase on the combined feature vector [121]. However, in
extracted from input data are first combined and then sent this approach it is hard to represent the time synchroniza-
as input to a single analysis unit (AU) that performs the tion between the multimodal features [144]. This is because

123
348 P. K. Atrey et al.

Fig. 1 Multimodal fusion F1

strategies and conventions as
used in this paper. a Analysis F2
unit, b feature fusion unit, c
decision fusion unit, d feature F1 / D1 F 1,n
D

FF
AU
level multimodal analysis, e
decision level multimodal
analysis, f hybrid multimodal
analysis Fn

(a) (b)

D1 F1

D2 F2

D1, n F1, n D

FF
AU

Dn Fn

F1
F1 D1 F 1,2 D 1,2

FF
AU AU
F2
F2 D2
AU D1, n
D D1, n D
DF

DF
AU AU

Fn-1 D n- 1
AU D n-1,n
Fn Dn

DF
AU
Fn Dn
AU

(e) (f)

the features from different but closely coupled modalities (DF) unit to make a fused decision vector that is analyzed
could be extracted at different times. Moreover, the fea- further to obtain a final decision D about the task or the
tures to be fused should be represented in the same format hypothesis. Here, a decision is the output of an analysis
before fusion. In addition, the increase in the number of unit at the semantic level. An illustration of DF unit is
modalities makes it difficult to learn the cross-correlation provided in Fig. 1c whereas Fig. 1e shows an instance of
among the heterogeneous features. Various approaches to the decision level multimodal analysis in which the deci-
resolve the synchronization problem are discussed in sions obtained from various AUs are fused using a DF unit
Sect. 4.2. and the combined decision vector is further processed by
Several researchers have adopted the early fusion an AU.
approach for different multimedia analysis tasks. For The decision level fusion strategy has many advantages
instance, Nefian et al. [86] have adopted an early fusion over feature fusion. For instance, unlike feature level
approach in combining audio and visual features for speech fusion, where the features from different modalities (e.g.
recognition. audio and video) may have different representations, the
decisions (at the semantic level) usually have the same
2.2 Decision level multimodal fusion representation. Therefore, the fusion of decisions becomes
easier. Moreover, the decision level fusion strategy offers
In the decision level or late fusion approach, the analysis scalability (i.e. graceful upgradation or degradation) in
units first provide the local decisions D1 to Dn (see Fig. 1) terms of the modalities used in the fusion process, which is
that are obtained based on individual features F1 to Fn. The difficult to achieve in the feature level fusion [9]. Another
local decisions are then combined using a decision fusion advantage of late fusion strategy is that it allows us to use

123
Multimodal fusion for multimedia analysis 349

the most suitable methods for analyzing each single Multimodal fusion methods

modality, such as hidden Markov model (HMM) for audio

and support vector machine (SVM) for image. This pro- Rule-based Classification-based Estimation-based
methods
vides much more flexibility than the early fusion. methods methods

On the other hand, the disadvantage of the late fusion Linear Weighted Fusion Support Vector Machine Kalman Filter
Majority Voting Rule
approach lies in its failure to utilize the feature level cor- Bayesian Inference Extended Kalman Filter
Custom-defined Rule Dampster-Shafer Theory Particle Filter
relation among modalities. Moreover, as different classifi- Dynamic Bayesian Networks
ers are used to obtain the local decisions, the learning Neural Networks
Maximum Entropy Model
process for them becomes tedious and time-consuming.
Several researchers have successfully adopted the Fig. 2 A categorization of the fusion methods
decision level fusion strategy. For example, Iyenger et al.
[57] performed fusion of decisions obtained from a face
detector and a speech recognizer along with their syn- of the observation scores before estimation or making a
chrony score by adopting two approaches—a linear classification decision.
weighted sum and a linear weighted product. While the next three sections (Sect. 3.1–3.3) have been
devoted to the above three classes of fusion methods; in the
2.3 Hybrid multimodal fusion last section (Sect. 3.4), we present a comparative analysis
of all the fusion methods.
To exploit the advantages of both the feature level and the
decision level fusion strategies, several researchers have 3.1 Rule-based fusion methods
opted to use a hybrid fusion strategy, which is a combi-
nation of both feature and decision level strategies. An The rule-based fusion method includes a variety of basic
illustration of the hybrid level strategy is presented in rules of combining multimodal information. These include
Fig. 1f where the features are first fused by a FF unit and statistical rule-based methods such as linear weighted
then the feature vector is analyzed by an AU. At the same fusion (sum and product), MAX, MIN, AND, OR, majority
time, other individual features are analyzed by different voting. The work by Kittler et al. [69] has provided the
AUs and their decisions are fused using a DF unit. Finally, theoretical introduction of these rules. In addition to these
all the decisions obtained from the previous stages are rules, there are custom-defined rules that are constructed
further fused by a DF to obtain the final decision. for the specific application perspective. The rule-based
A hybrid fusion approach can utilize the advantages of schemes generally perform well if the quality of temporal
both early and late fusion strategies. Therefore, many alignment between different modalities is good. In the
researchers ([16, 88, 149], etc.) have used the hybrid fusion following, we describe some representative works that
strategy to solve various kinds of multimedia analysis have adopted the rule-based fusion strategy.
problems.
3.1.1 Linear weighted fusion

3 Methods for multimodal fusion Linear weighted fusion is one of the simplest and most
widely used methods. In this method, the information
In this section, we provide an overview of the different obtained from different modalities is combined in a linear
fusion methods that have been used by the multimedia fashion. The information could be the low-level features
researchers to perform various multimedia analysis tasks. (e.g. color and motion cues in video frames) [136] or the
The advantages and the drawbacks of each method are also semantic-level decisions (i.e. occurrence of an event) [90].
highlighted. The fusion methods are divided into the fol- To combine the information, one may assign normalized
lowing three categories: rule-based methods, classification- weights to different modalities. In literature, there are
based methods, and estimation-based methods (as shown in various methods for weight normalization such as min–
Fig. 2). This categorization is based on the basic nature of max, decimal scaling, z score, tanh-estimators and sigmoid
these methods and it inherently means the classification of function [61]. Each of these methods have pros and cons.
the problem space, such as, a problem of estimating The min–max, decimal scaling and z score methods are
parameters is solved by estimation-based methods. Simi- preferred when the matching scores (minimum and maxi-
larly the problem of obtaining a decision based on certain mum values for min–max, maximum for decimal scaling
observation can be solved by classification-based or rule- and mean and standard deviation for z score) of the indi-
based methods. However, if the observation is obtained vidual modalities can be easily computed. But these
from different modalities, the method would require fusion methods are sensitive to outliers. On the other hand, tanh

123
350 P. K. Atrey et al.

normalization method is both robust and efficient but performed linear weighted fusion of the location informa-
requires estimation of the parameters using training. Note tion of the objects. However, unlike Foresti and Snidaro
that the absence of prior knowledge of the weights usually [40], Yang et al. [152] assigned equal weights to the dif-
equals the weight assigned to them. ferent modalities.
The general methodology of linear fusion can be The linear weighted sum strategy at the feature level has
described as follows. Let Ii, 1 B i B n be a feature vector also been proposed by Wang et al. [136] for human
obtained from ith media source (e.g. audio, video etc.) or a tracking. In this work, the authors have fused several
decision obtained from a classifier.1 Also, let wi, 1 B i B n spatial cues such as color, motion and texture by assigning
be the normalized weight assigned to the ith media source appropriate weights to them. However, in the fusion pro-
or classifier. These vectors, assuming that they have the cess, the issue of how different weights should be assigned
same dimensions, are combined by using sum or product to different cues has not been discussed. This work was
operators and used by the classifiers to provide a high-level extended by Kankanhalli et al. [67] for face detection,
decision. This is shown in Eqs. 1 and 2, which are as monologue detection, and traffic monitoring. In both
follows: works, the authors used a sigmoid function to normalize the
Xn weights of different modalities.
I¼ wi Ii ð1Þ Neti et al. [87] obtained individual decisions for speaker
i¼1 recognition and speech event detection from audio features
Y
n (e.g. phonemes) and visual features (e.g. visemes). They
I¼ Ii wi ð2Þ adopted a linear weighted sum strategy to fuse these indi-
i¼1
vidual decisions. The authors used the training data to
This method is computationally less expensive compared determine the relative reliability of the different modalities
to other methods. However, a fusion system needs to and accordingly adjusted their weights. Similar to this
determine and adjust the weights for the optimal accom- fusion approach, Iyengar et al. [57] fused multiple
plishment of a task. modalities (face, speech and the synchrony score between
Several researchers have adopted the linear fusion them) by adopting two approaches at the decision level—a
strategy at the feature level for performing various multi- linear weighted sum and a linear weighted product. This
media analysis tasks. Examples include Foresti and Snidaro methodology was applied for monologue detection. The
[40], Yang et al. [152] for detecting and tracking people, synchrony or correlation between face and speech has been
and Wang et al. [136] and Kankanhalli et al. [67] for video computed in terms of mutual information between them by
surveillance and traffic monitoring. The linear fusion considering the audio and video features as locally
strategy has also been adopted at the decision level by Gaussian distributed. The mutual information is a measure
several researchers. These include Neti et al. [87] for of the information of one modality conveyed about another.
speaker recognition and speech event detection, Iyengar The weights of the different modalities have been deter-
et al. [57] for monologue detection, Iyengar et al. [58] for mined at the training stage. While fusing different modali-
semantic concept detection and annotation in video, Lucey ties, the authors have found the linear weighted sum
et al. [78] for spoken word recognition, Hua and Zhang approach to be a better option than the linear weighted
[55] for image retrieval, McDonald and Smeaton [83] for product for their data set. This approach was later extended
video shot retrieval and Jaffre and Pinquier [59] for person for semantic concept detection and annotation in video by
identification. We briefly describe these works in the Iyengar et al. [58]. Similar to [57], the linear weighted
following. product fusion strategy has also been adopted by Jaffre and
Foresti and Snidaro [40] used a linear weighted sum Pinquier [59] for fusing different modalities. In this work,
method to fuse trajectory information of the objects. The the authors have proposed a multimodal person identifi-
video data from each sensor in a distributed sensor network cation system by automatically associating voice and
is processed for moving object detection (e.g. a blob). Once image using a standard product rule. The association is
the blob locations are extracted from all sensors, their done through fusion of video and audio indexes. The pro-
trajectory coordinates are averaged in a linear weighted posed work used a common indexing mechanism for both
fashion in order to estimate the correct location of the blob. audio and video based on frame-by-frame analysis. The
The authors have also assigned weights to different sen- audio and video indexes were fused using a product fusion
sors; however, the determination of these weights has been rule at the late stage.
left to the user. Similar to [40], Yang et al. [152] also In another work, Lucey et al. [78] performed a linear
weighted fusion for the recognition of spoken words. The
1
To maintain consistency, we will use these notations for modalities word recognizer modules, which work on audio and video
in rest of this paper. data separately, provided decisions about a word in terms

123
Multimodal fusion for multimedia analysis 351

of the log likelihoods. These decisions are linearly fused by the classifiers reach a similar decision [113]. For example,
assigning weights to them. To determine the weights of the Radova and Psutka [108] have presented a speaker iden-
two decision components, the authors have chosen the tification system by employing multiple classifiers. Here,
discrete values (0, 0.5 and 1), which is a simple but non- the raw speech samples from the speaker are treated as
realistic choice. features. From the speech samples, a set of patterns are
A decision level fusion scheme proposed by Hua and identified for each speaker. The pattern usually contains a
Zhang [55] is based on the human’s psychological obser- current utterance of several vowels. Each pattern is clas-
vations which they call ‘‘attention’’. The core idea of this sified by two different classifiers. The output scores of all
approach is to fuse the decisions taken based on different the classifiers were fused in a late integration approach to
cues such as the strength of a sound, the speed of a motion, obtain the majority decision regarding the identity of the
the size of an object and so forth. These cues are consid- unknown speaker.
ered as the attention properties and are measured by
obtaining the set of features including color histogram, 3.1.3 Custom-defined rules
color moment, wavelet, block wavelet, correlogram, and
blocked correlogram. The authors proposed a new fusion Unlike the above approaches that use standard statistical
function which they call ‘‘attention fusion function’’. This rules, Pfleger [100] presented a production rule-based
new function is a variation of a linear weighted sum decision level fusion approach for integrating inputs from
strategy and is derived by adding the difference of two pen and speech modality. In this approach, each input
decisions to their average (please refer to [55] for forma- modality (e.g. pen input) is interpreted within its context
lism). The authors have demonstrated the utility of the of use, which is determined based on the previously
proposed attention based fusion model for image retrieval. recognized input events and dialog states belonging to the
Experimental results have shown that the proposed same user turn. The production rule consists of a
approach performed better in comparison to average or weighting factor and a condition-action part. These rules
maximal fusion rules. are further divided into three classes that work together to
In the context of video retrieval, McDonald and contribute to the fusion process. First, the synchronization
Smeaton [83] have employed a decision level linear rules are applied to track the processing state of the
weighted fusion strategy to combine the normalized scores individual recognizer (e.g. speech recognizer) and in case
and ranks of the retrieval results. The normalization was of pending recognition results the other classes of rules
performed using max–min method. The video shots were are not fired to ensure synchronization. Second, the rules
retrieved using different modalities such as text and mul- for multimodal event interpretation are used to determine
tiple visual features (color, edge and texture). In this work, which of the input events has the lead and need to be
the authors found that the combining of scores with dif- integrated. Furthermore, there may be conflicting events
ferent weights has been best for combining text and visual due to the recognition or interpretation error, which are
results for TRECVid type searches, while combining addresses by obtaining the event with highest score.
scores and ranks with equal weights have been best for Third, the rules for unimodal interpretations are adopted
combining multiple features for a single query image. A when one of the recognizers do not produce any mean-
similar approach was adopted by Yan et al. [151] for re- ingful result, for example a time-out by one recognizer,
ranking the video. In this work, the authors used a linear which will lead to a single modality based decision
weighted fusion strategy at the decision level in order to making. This approach is further extended [101] and
combine the retrieval scores obtained based on text and applied for discourse processing in a multiparty dialog
other modalities such as audio, video and motion. scenario.
From the works discussed above, it is observed that the In another work, Holzapfel et al. [49] showed an
optimal weight assignment is the major drawback of example of multimodal integration approach using custom-
the linear weighted fusion method. The issue of finding the defined rules. The authors combined speech and 3D
appropriate weight (or confidence level) for different pointing gestures as a means of natural interaction with a
modalities is an open research issue. This issue is further robot in a kitchen. Multimodal fusion is performed at the
elaborated in Sect. 4.1.2. decision level based on the n-best lists generated by each of
the event parsers. Their experiments showed that there is a
3.1.2 Majority voting close correlation in time of speech and gesture. Similarly,
in [32], a rule-based system has been proposed for fusion of
Majority voting is a special case of weighted combination speech and 2D gestures in human computer interaction.
with all weights to be equal. In majority voting based Here the audio and gesture modalities are fused at the
fusion, the final decision is the one where the majority of decision level. A drawback of these approaches is the

123
352 P. K. Atrey et al.

overhead to determine the best action based on n-best fused researchers. It is a simple as well as computationally less
input. expensive approach. This method performs well if the
In addition to the video, audio and gesture, other weights of different modalities are appropriately deter-
modalities such as closed caption text and external meta- mined, which has been a major issue in using this method.
data have been used for several applications such as video In the existing literature, this method has been used for face
indexing and content analysis for team sports videos. On detection, human tracking, monologue detection, speech
this account, Babaguchi et al. [12] presented a knowledge- and speaker recognition, image and video retrieval, and
based technique to leverage the closed caption text of person identification. On the other hand, the fusion using
broadcast video streams for indexing video shots based on custom-defined rules has the flexibility of adding rules
the temporal correspondence between them. The closed based on the requirements. However, in general, these rules
caption text features are extracted as keywords and the are domain specific and defining the rules requires proper
video features are extracted as temporal changes of color knowledge of the domain. This fusion method is widely
distribution. This work presumably integrates textual and used in the domain of multimodal dialog systems and
visual modalities using a late fusion strategy. sports video analysis.

3.1.4 Remarks on rule-based fusion methods 3.2 Classification-based fusion methods

A summary of all the works (related to the rule-based This category of methods includes a range of classification
fusion methods) described above is provided in Table 1. As techniques that have been used to classify the multimodal
can be seen from the table, in rule-based fusion category, observation into one of the pre-defined classes. The
linear weighted fusion method has been widely used by methods in this category are the support vector machine,

Table 1 A list of the representative works in the rule-based fusion methods category
Fusion Level of The work Modalities Multimedia analysis task
method fusion

Linear weighted Feature Foresti and Snidaro [40] Video (trajectory coordinates) Human tracking
fusion Wang et al. [136] Video (color, motion and texture Human tracking
Yang et al. [152] Video (trajectory coordinates) Human tracking
Kankanhalli et al. [67] Video (color, motion and texture) Face detection, monologue
detection and traffic
monitoring
Decision Neti et al. [87] Audio (phonemes) and visual (visemes) Speaker recognition
Lucey et al. [78] Audio (MFCC), video (Eigenlip) Spoken word recognition
Iyenger et al. [57, 58] Audio (MFCC), video (DCT of the face Monologue detection, semantic
region) and the synchrony score concept detection and
annotation in video
Hua and Zhang [55] Image (six features: color histogram, Image retrieval
color moment, wavelet, block wavelet,
correlagram, blocked correlagram)
Yan et al. [151] Text (closed caption, video OCR), audio, Video retrieval
video (color, edge and texture
histogram), motion
McDonald and Smeaton Text and video (color, edge and texture) Video retrieval
[83]
Jaffre and Pinquier [59] Audio, video index Person identification from
audio-visual sources
Majority voting rule Decision Radova and Psutka Raw speech (set of patterns) Speaker identification from
[108] audio sources
Custom-defined Decision Babaguchi et al. [12] Visual (color), closed caption text Semantic sports video indexing
rules (keywords)
Corradini et al. [32] Speech, 2D gesture Human computer interaction
Holzapfel et al. [49] Speech, 3D pointing gesture Multimodal interaction with robot
Pfleger [100] Pen gesture, speech Multimodal dialog system

123
Multimodal fusion for multimedia analysis 353

Bayesian inference, Dempster–Shafer theory, dynamic High dimension

vector SVM
boundary
Bayesian networks, neural networks and maximum entropy I1,1
model. Note that we can further classify these methods as Audio scores
I1,2
generative and discriminative models from the machine Concept
learning perspective. For example, Bayesian inference and I1,3 space 1

dynamic Bayesian networks are generative models, while I2,1

support vector machine and neural networks are discrimi- Video scores

I2,2
native models. However, we skip further discussion on
Concept
such classification for brevity. I2,3
space 2

I3,1
Text scores
3.2.1 Support vector machine
I3,2

Support vector machine (SVM) [23] has become increas-

ingly popular for data classification and related tasks. More Fig. 3 SVM based score space classification of combined informa-
tion from multiple intermediate concepts [3]
specifically, in the domain of multimedia, SVMs are being
used for different tasks including feature categorization,
concept classification, face detection, text categorization, concept. Unlike GLF, the NLF method is used for non-
modality fusion, etc. Basically SVM is considered as a linear combination of multimodal information. This
supervised learning method and is used as an optimal method is based on [3], where SVM is first used as a
binary linear classifier, where a set of input data vectors are classifier for the individual modality and then super kernel
partitioned as belonging to either one of the two learned non-linear fusion is applied for optimal combination of the
classes. From the perspective of multimodal fusion, SVM individual classifier models. The experiments on the
is used to solve a pattern classification problem, where the TREC-2003 Video Track benchmark showed that NLF and
input to this classifier is the scores given by the individual GLF performed 8.0 and 5.0% better than the best single
classifier. The basic SVM method is extended to create a modality, respectively. Furthermore, NLF had an average
non-linear classifier by using the kernel concept, where 3.0% better performance than GLF. The NLF fusion
every dot product in the basic SVM formalism is replaced approach was later extended by the authors [143] in order
using a non-linear kernel function. to obtain the best independent modalities (early fusion) and
Many existing literature use the SVM-based fusion the strategy to fuse the best modalities (late fusion).
scheme. Adams et al. [3] adopted a late fusion approach in A hybrid fusion approach has been presented by Ayache
order to detect semantic concepts (e.g. sky, fire-smoke) in et al. [11] as normalized early fusion and contextual late
videos using visual, audio and textual modalities. They use fusion for semantic indexing of multimedia resources using
a discriminate learning approach while fusing different visual and text cues. Unlike other works, in case of nor-
modalities at the semantic level. For example, the scores of malized early fusion, each entry of the concatenated vector
all intermediate concept classifiers are used to construct a is normalized and then fused. In the case of contextual late
vector that is passed as the semantic feature in SVM as fusion, the second layer classifier based on SVM is used to
shown in Fig. 3. This figure depicts that audio, video and exploit the contextual relationship between the different
text scores are combined in a high-dimensional vector concepts. Here, the authors have also presented a kernel-
before being classified by SVM. The black and white dots based fusion scheme based on SVMs, where the kernel
in the figure represent two semantic concepts. A similar functions are chosen according to the different modalities.
approach has been adopted by Iyengar et al. [58] for con- In the area of image classification, Zhu et al. [156] have
cept detection and annotation in video. reported a multimodal fusion framework to classify the
Wu et al. [141] reported two approaches to study the images that have embedded text within their spatial coor-
optimal combination of multimodal information for video dinates. The fusion process followed two steps. At first, a
concept detection, which are gradient-descent-optimization bag-of-words model [73] is applied to classify the given
linear fusion (GLF) and the super-kernel nonlinear fusion image that considers the low-level visual features. In par-
(NLF). In GLF, an individual kernel matrix is first con- allel, the text detector finds the text existence in the image
structed for each modality providing a partial view of the using text color, size, location, edge density, brightness,
target concept. The individual kernel matrices are then contrast, etc. In the second step, a pair-wise SVM classifier
fused based on a weighted linear combination scheme. is used for fusing the visual and textual features together.
Gradient-descent technique is used to find the optimal This is illustrated in Fig. 4.
weights to combine the individual kernels. Finally, SVM is In a recent work, Bredin and Chollet [19] proposed a
used on the fused kernel matrix to classify the target biometric-based identification scheme of a talking face.

123
354 P. K. Atrey et al.

Image at the feature level as well as at the decision level. The

observations obtained from multiple modalities or the
decisions obtained from different classifiers are combined,
Low level visual cues Text detector and an inference of the joint probability of an observation
or a decision is derived [109].
Bag-of-words model Text lines
The Bayesian inference fusion method is briefly
described as follows. Let us fuse the feature vectors or the
decisions ðI1 ; I2 ; . . .; In Þ obtained from n different modali-
Probabilities Related features
ties. Assuming that these modalities are statistically inde-
pendent, the joint probability of an hypothesis H based on
the fused feature vectors or the fused decisions can be
SVM-based classifier
computed as [102]:
Fig. 4 Multimodal fusion using visual and text cues for image 1Y n

classification based on pair-wise SVM classifier [156]

pðHjI1 ; I2 ; . . .; In Þ ¼ pðIk jHÞwk ð3Þ
N k¼1

The key idea was to utilize the synchrony measure between where N is used to normalize the posterior probability
the talking face’s voice and the corresponding video . . .; In Þ: The term wj is the weight of the
estimate pðHjI1 ; I2 ;P
frames. Audio and visual sample rates are balanced by kth modality, and nk¼1 wj ¼ 1: This posterior probability
linear interpolation. By adopting a late fusion approach, the is computed for all the possible hypotheses, E. The
scores from the monomodal biometric speaker verification, hypothesis that has the maximum probability is determined
face recognition, and synchrony were combined and passed using the MAP rule H^ ¼ argmaxH2E pðHjI1 ; I2 ; . . .; In Þ:
to the SVM model, which provided the decision about the The Bayesian inference method has various advantages.
identity of the talking face. On another front, Aguilar et al. Based on the new observations, it can incrementally
[4] provided a comparison between the rule-based fusion compute the probability of the hypothesis being true. It
and learning-based fusion (trained) strategy. The scores of allows for any prior knowledge about the likelihood of the
face, fingerprint and online signature are combined using hypothesis to be utilized in the inference process. The new
both the Sum rule and radial basis function SVM (RBF observation or the decision is used to update the a priori
SVM) for comparison. The experimental results demon- probability in order to compute the posterior probability of
strates that learning-based RBF SVM scheme outperforms the hypothesis. Moreover, in the absence of empirical data,
the rule-based scheme based on some appropriate param- this method permits the use of a subjective probability
eter selection. estimate for the a priori of hypotheses [140].
Snoek et al. [121] have compared both the early and late These advantages of the Bayesian method are seen as its
fusion strategies for semantic video analysis. Using the limitations in some cases. Bayesian inference method
former approach, the visual vector has been concatenated requires a priori and the conditional probabilities of the
with the text vector and then normalized to use as input in hypothesis to be well defined [110]. In absence of any
SVM to learn the semantic concept. In the latter approach knowledge of suitable priors, the method does not perform
the authors have adopted a probabilistic aggregation well. For example, in gesture recognition scenarios, it is
mechanism. Based on an experiment on 184 h of broadcast sometimes difficult to classify a gesture of two stretched
video using 20 semantic concepts, this study concluded that fingers in ‘‘V’’ form. This gesture can be interpreted as
a late fusion strategy provided better performance for most either ‘‘victory sign’’ or ‘‘sign indicating number two’’. In
concepts, but it bears an increased learning effort. The this case, since the priori probability of both the classes
conclusion also suggested that when the early fusion per- would be 0.5, the Bayesian method would provide
formed better, the improvements were significant. How- ambiguous results. Another limitation of this method is that
ever, which of the fusion strategies is better in which case it is often found unsuitable for handling mutually exclusive
needs further investigation. hypotheses and general uncertainty. It means that only one
hypothesis can be true at any given time. For example,
3.2.2 Bayesian inference Bayesian inference method would consider two events of
human’s running and walking mutually exclusive and
The Bayesian inference is often referred to as the ‘classi- cannot handle a fuzzy event of human’s fast walking or
cal’ sensor fusion method because it has been widely used slow running.
and many other methods are based on it [45]. In this Bayesian inference method has been successfully used
method, the multimodal information is combined as per the to fuse multimodal information (at the feature level and at
rules of probability theory [79]. The method can be applied the decision level) for performing various multimedia

123
Multimodal fusion for multimedia analysis 355

analysis tasks. An example of Bayesian inference fusion at exclusive hypotheses, so that it is able to assign evidence to
the feature level is the work by Pitsikalis et al. [102] for the union of hypotheses [140].
audio-visual speech recognition. Meyer et al. [85] and Xu The general methodology of fusing the multimodal
and Chua [149] have used the Bayesian inference method information using the D–S theory is as follows. The D–S
at the decision level for spoken digit recognition and sports reasoning system is based on a fundamental concept of
video analysis, respectively; while Atrey et al. [8] ‘‘the frame of discernment’’, which consists of a set H of
employed this fusion strategy at both the feature as well as all the possible mutually exclusive hypotheses. An
the decision level for event detection in the multimedia hypothesis is characterized by belief and plausibility. The
surveillance domain. These works are described in the degree of belief implies a lower bound of the confidence
following. with which a hypothesis is detected as true, whereas the
Pitsikalis et al. [102] used the Bayesian inference met- plausibility represents the upper bound of the possibility
hod to combine the audio-visual feature vectors. The audio that the hypothesis could be true. A probability is assigned
feature vector included 13 static MFCC and their deriva- to every hypothesis H 2 PðHÞ using a belief mass function
tives, while the visual feature vector was formed by con- m : PðHÞ ! ½0; 1 . The decision regarding a hypothesis is
catenating 6 shapes and 12 texture features. Based on the measured by a ‘‘confidence interval’’ bounded by its basic
combined features, the joint probability of a speech seg- belief and plausibility values, as shown in Fig. 5.
ment is computed. In this work, the authors have also When there are multiple independent modalities, the
proposed to model the measurement of noise uncertainty. D–S evidence combination rule is used to combine them.
At the decision level, Meyer et al. [85] fused the deci- Precisely, the mass of a hypothesis H based on two
sions obtained from speech and visual modalities. The modalities, Ii and Ij, is computed as:
authors have first extracted the MFCC features from speech P
Ii \Ij ¼H mi ðIi Þmj ðIj Þ
and the lip contour features from the speaker’s face in the ðmi mj ÞðHÞ ¼ P ð4Þ
1 Ii \Ij ¼£ mi ðIi Þmj ðIj Þ
video, and then obtained individual decisions (in terms of
probabilities) for both using HMM classifiers. These Note that, the weights can also be assigned to different
probability estimates are then fused using the Bayesian modalities that are fused.
inference method to estimate the joint probability of a Although the Dempster–Shafer fusion method has been
spoken digit. Similar to this work, Xu and Chua [149] also found more suitable for handling mutually inclusive
used the Bayesian inference fusion method for integrating hypotheses, this method suffers from the combinatorial
the probabilistic decisions about the offset and non-offset explosion when the the number of frames of discernment is
events detected in a sport video. These events have been large [27].
detected by fusing audio-visual features with textual clues Some of the representative works that have used the
and by employing a HMM classifier. In this work, the D–S fusion method for various multimedia analysis tasks
authors have shown that the Bayesian inference has com- are Bendjebbour et al. [16] (at hybrid level) and Mena and
parable accuracy to the rule-based schemes. Malpica [84] (at the feature level) for segmentation of
In another work, Atrey et al. [8] adopted a Bayesian satellite images, Guironnet et al. [44] for video classifica-
inference fusion approach at hybrid levels (feature level as tion, Singh et al. [116] for finger print classification, and
well as decision level). The authors demonstrated the [110] for human computer interaction (at the decision
utility of this fusion (they call it ‘assimilation’) approach level).
for event detection in a multimedia surveillance scenario. Bendjebbour et al. [16] proposed to use the D–S theory
The feature level assimilation was performed at the intra- to fuse the mass functions of two regions (cloud and no
media stream level and the decision level assimilation was
adopted at the inter-media stream level.
Sum of all
3.2.3 Dempster–Shafer theory evidences against
the hypothesis

Although the Bayesian inference fusion method allows for

Pla usibility

uncertainty modeling (usually by Gaussian distribution),

some researchers have preferred to use the Dempster– Sum of all
Belief

Shafer (D–S) evidence theory since it uses belief and evidences in favor
plausibility values to represent the evidence and their of the hypothesis
corresponding uncertainty [110]. Moreover, the D–S
method generalizes the Bayesian theory to relax the Fig. 5 An illustration of the belief and plausibility in the D–S theory
Bayesian inference method’s restriction on mutually [140]

123
356 P. K. Atrey et al.

cloud) of the image obtained from radar. They performed the area of biometrics, Singh et al. [116] used the D–S
fusion at two levels the feature level and the decision level. theory to combine the output scores of three different finger
At the feature level, the pixel intensity was used as a fea- print classification algorithms based on the Minutiae, ridge
ture and the mass of a given pixel based on two sensors was and image pattern features. The authors showed that the
computed and fused; while at the decision level, the deci- D–S theory of fusing three independent evidences outper-
sions about a pixel obtained from the HMM classifier were formed the individual approaches. Recently, Reddy [110]
used as mass and then the HMM outputs were combined. also used the D–S theory for fusing the outputs of two
Similar to this work, Mena and Malpica [84] also used the sensors, the Hand Gesture sensor and the Brain Computing
D–S fusion approach for the segmentation of color images Interface sensor. Two concepts, ‘‘Come’’ and ‘‘Here’’ were
for extracting information from terrestrial, aerial or satellite detected using these two sensors. The fusion results
images. However, they extracted the information of the showed that the D–S fusion approach helps in resolving the
same image from three different sources: the location of an ambiguity in the sensors.
isolated pixel, a group of pixels, and a pair of pixels. The
evidences obtained based on the location analysis were 3.2.4 Dynamic Bayesian networks
fused using the D–S evidence fusion strategy.
Guironnet et al. [44] extracted low-level (color or tex- Bayesian inferencing can be extended to a network (graph)
ture) descriptors from a TREC video and applied a SVM in which the nodes represent random variable (observations
classifier to recognize the pre-defined concepts (e.g. or states) of different types, e.g. audio and video; and the
‘beach’ or ‘road’) based on each descriptor. The SVM edges denote their probabilistic dependencies. For exam-
classifier outputs are integrated using the D–S fusion ple, as shown in Fig. 6, a speaker detection problem can be
approach, they call it the ‘‘transferable belief model’’. In depicted by a Bayesian Network [30]. The speaker node
value is determined based on the value of three interme-
diate nodes ‘visible’, ‘frontal’ and ‘speech’, which are
Speaker
inferred from the measurement nodes ‘skin’, ‘texture’,
‘face’, ‘mouth’ and ‘sound’. The figure shows the depen-
Kiosk dency of nodes upon each other. However, the network
Visible Frontal Speech
shown in Fig. 6 is a static one, meaning it depicts the state
at a particular time instant.
A Bayesian Network works as a dynamic Bayesian
Face
Skin Texture
Det
Mouth Sound network (DBN) when the temporal aspect being added to it
as shown in Fig. 7. A DBN is also called a probabilistic
Fig. 6 An example of a static Bayesian networks [30] generative model or a graphical model. Due to the fact that

Speaker Speaker

Frontal Frontal

Speech Speech

t-1 t

Speaker Speaker

Kiosk Kiosk

Visible Frontal Speech Visible Frontal Speech

Face Face
Skin Texture Mouth Sound Skin Texture Mouth Sound
Det Det

Fig. 7 An example of a dynamic Bayesian networks [30]

123
Multimodal fusion for multimedia analysis 357

these models describe the observed data in terms of the computed the mutual information (MI) between the two
process that generate them, they are called generative. types of features and analyzed its effect on the overall
They are termed as probabilistic because they describe speaker location results. Similar to this work, Beal et al.
probabilistic distributions rather than the sensor data. [15] have used graphical models to fuse audio-visual
Moreover, since they have useful graphical representations, observations for tracking a moving object in a cluttered,
they are also called graphical [15]. Although the DBNs noisy environment. The authors have modeled audio and
have been used with different names such as probabilistic video observations jointly by computing their mutual
generative models, graphical models, etc. for a variety of dependencies. The expectation-maximization algorithm
applications, the most popular and simplest form of a DBN has been used to learn the model parameters from a
is the HMM. sequence of audio-visual data. The results were demon-
The DBNs have a clear advantage over the other strated in a two microphones and one camera setting.
methods in two aspects. First, they are capable of modeling Similarly, Hershey et al. [46] also used a probabilistic
the multiple dependencies among the nodes. Second, by generative model to combine audio and video by learning
using them, the temporal dynamics of multimodal data can the dependencies between the noisy speech signal from a
easily be integrated [119]. These advantages make them single microphone and the fine-scale appearance and
suitable for various multimedia analysis tasks that require location of the lips during speech.
decisions to be performed using time-series data. Although It is important to note that all works described above
DBNs are very beneficial and widely used, the determi- assume multiple modalities (usually audio-visual data) that
nation of the right DBN state is often seen as its problem locally, as well as jointly, follow a Gaussian distribution. In
[81]. contrast to these works, Fisher et al. [39] have presented a
In the following, we briefly outline some representative non-parametric approach to learn the joint distribution of
works that have used DBN in one form or the other. audio and visual features. They estimated a linear projec-
Wang et al. [138] have used HMM for video shot clas- tion onto low-dimensional subspaces to maximize the
sification. The authors have extracted both audio (cepstral mutual information between the mapped random variables.
vector) and visual features (a gray-level histogram dif- This approach was used for audio–video localization.
ference and two motion features) from each video frame Although the non-parametric approach is free from any
and used them as the input data for the HMM. While this parametric assumptions, they often suffer from imple-
method used a single HMM that processed the joint mentation difficulties as the method results in a system of
audio-visual features, Nefian et al. [86] used the coupled undetermined equations. That is why, the parametric
HMM (CHMM), which is a generalization of the HMM. approaches have been preferred [92]. With this rationale,
The CHMM suits to multimodal scenarios where two or Noulas and Krose [92] also presented a two-layer Bayesian
more streams need to be integrated. In this work, the network model for human face tracking. In the first layer,
authors have modeled the state asynchrony of the audio the independent modalities (audio and video) are analyzed,
features (MFCC) and visual features (2D-DCT coeffi- while the second layer performs the fusion incorporating
cients of the lips region) while preserving their correlation their correlation. A similar approach was also presented by
over time. This approach is used for speech recognition. Zou and Bhanu [158] for tracking humans in a cluttered
The work by Adams et al. [3] also used a Bayesian net- environment. Recently, Town [131] also used the Bayesian
work in addition to SVM and showed the comparison of networks approach for multi-sensory fusion. In this work,
both for video shot retrieval. the visual information obtained from the calibrated cam-
Unlike Nefian et al. [86] who used CHMM, Bengio [17] eras is integrated with the ultrasonic sensor data at the
has presented the asynchronous HMM (AHMM) at the decision level to track people and devices in an office
feature level. The AHMM is a variant of HMM to deal with building. The authors presented a large-scale sentient
the asynchronous data streams. The authors modeled the computing system known as ‘‘SPIRIT’’.
joint probability distribution of asynchronous sequences— In the context of news video analysis, Chua et al. [31]
speech (MFCC features) stream and video (shape and have emphasized on the need to utilize multimodal features
intensity features) stream that described the same event. (text with audio-visual) for segmenting news video into
This method was used for biometric identity verification. story units. In their other work [25], the authors presented
Nock et al. [90] and [91] employed a set of HMMs an HMM-based multi-modal approach for news video story
trained on joint sequences of audio and visual data. The segmentation by using a combination of features. The
features used were MFCC from speech and DCT coeffi- feature set included visual-based features such as color,
cients of the lip region in the face from video. The joint object-based features such as face, video-text, temporal
features were presented to the HMMs at consecutive time features such as audio and motion, and semantic features
instances in order to locate a speaker. The authors also such as cue-phrases. Note that the fundamental assumption

123
358 P. K. Atrey et al.

which is often considered with the DBN methods is the method consists of a network of mainly three types of
independence among different observations/features. nodes—input, hidden and output nodes. The input nodes
However, this assumption does not hold true in reality. To accept sensor observations or decisions (based on these
relax the assumption of independence between observa- observations), and the output nodes provide the results of
tions, Ding and Fan [38] presented a segmental HMM fusion of the observations or decisions. The nodes that are
approach to analyze a sports video. In segmental HMM, neither input nor output are referred to hidden nodes. The
each hidden state emits a sequence of observations, which network architecture design between the input and output
is called a segment. The observations within a segment are nodes is an important factor for the success or failure of
considered to be independent to the observations of other this method. The weights along the paths, that connect the
segments. The authors showed that the segmental HMM input nodes to the output nodes, decide the input–output
performed better than traditional HMM. In another work, mapping behavior. These weights can be adjusted during
the importance of combining text modality with the other the training phase to obtain the optimal fusion results [22].
modalities has been demonstrated by Xie et al. [145]. The This method can also be employed at both the feature level
authors proposed a layered dynamic mixture model for and the decision level.
topic clustering in video. In their layer approach, first a In the following, we describe some works that illustrate
hierarchical HMM is used to find clusters in audio and the use of the NN fusion method for performing the mul-
visual streams; and then latent semantic analysis is used to timedia analysis tasks. Gandetto et al. [41] have used the
cluster the text from the speech transcript stream. At the NN fusion method to combine sensory data for detecting
next level, a mixture model is adopted to learn the joint human activities in an environment equipped with a het-
probability of the clusters from the HMM and latent erogeneous network of sensors with CCD cameras and
semantic analysis. The authors have performed experi- computational units working together in a LAN. In this
ments with the TRECVID 2003 data set, which demon- work, the authors considered two types of sensors—state
strated that the multi-modal fusion resulted in a higher sensors (e.g. CPU load, login process, and network load)
accuracy in topics clustering. and observation sensors (e.g. cameras). The human activ-
An interesting work was presented by Wu et al. [142]. ities in regard to usage of laboratory resources were
In this work, the authors used an influence diagram detected by fusing the data from these two types sensors at
approach (a form of the Bayesian network) to represent the the decision level.
semantics of photos. The multimodal fusion framework A variation of the NN fusion method is the time-delay
integrated the context information (location, time and neural network (TDNN) that has been used to handle
camera parameters), content information (holistic and temporal multimodal data fusion. Some researchers have
perceptual local features) with the domain-oriented adopted the TDNN approach for various multimedia
semantic ontology (represented by a directed acyclic analysis tasks, e.g. Cutler and Davis [34], Ni et al. [88],
graph). Moreover, since the conditional probabilities that and Zou and Bhanu [158] for speaker tracking. Cutler and
are used to infer the semantics can be misleading, the Davis [34] learned the correlation between audio and visual
authors have utilized the causal strength between the con- streams by using a TDNN method. The authors have used it
text/content and semantic ontology instead of using the for locating the speaking person in the scene. A similar
correlation among features. The causal strength is based on approach was also presented by Zou and Bhanu [158]. In
the following idea. The two variables may co-vary with [158], the authors have also compared the TDNN approach
each other, however, there may be a third variable as a with the BN approach and found that the BN approach
‘‘cause’’ that may affect the value of these two variables. performed better than the TDNN approach in many
For example, two variables ‘‘wearing a warm jacket’’ and aspects. First, the choosing of the initial parameters does
‘‘drinking coffee’’ may have a large positive correlation; not affect the DBN approach while it does affect the TDNN
however, the cause behind both could be ‘‘cold weather’’. approach. Second, the DBN approach was better in mod-
The authors have shown that the usage of causal strength in eling the joint Gaussian distribution of audio-visual data
influence diagrams provide better results in the automatic compared to linear mapping between audio signals and the
annotation of photos. object position in video sequences in the TDNN method.
Third, the graphical models provide an explicit and easily
3.2.5 Neural networks accessible structure compared to TDNN, in which, the
inner structure and parameters are difficult to design.
Neural network (NN) is another approach for fusing mul- Finally, the DBN approach offers better tracking accuracy.
timodal data. Neural networks are considered a non-linear Moreover, in the DBN approach, a posteriori probability of
black box that can be trained to solve ill-defined and the estimates is available as the quantitative measure in the
computationally expensive problems [140]. The NN support of the decision.

123
Multimodal fusion for multimedia analysis 359

While Cutler and Davis [34] and Zou and Bhanu [158] 3.2.7 Remarks on classification-based fusion methods
used the NN fusion approach at the feature level, Ni et al.
[88] adopted this approach at the feature level as well as at All the representative works related to the classification-
the decision level. In [88], the authors have used the NN based fusion methods are summarized in Table 2. Our
fusion method to fuse low level features to recognize observations are as follows:
images. The decisions from multiple trained NN classifiers
• The Bayesian inference fusion method, which works on
are further fused to come up with a final decision about an
probabilistic principles, provides easy integration of
image.
new observation and the use of a priori information.
Although the NN method is in general found suitable to
However, they are not suitable for handling mutually
work in a high-dimensional problem space and generate
exclusive hypotheses. Moreover, the lack of appropri-
high-order nonlinear mapping, there are some familiar
ate a priori information can lead to inaccurate fusion
complexities associated with them. For instance, the
results using this method. On the other hand, Demp-
selection of appropriate network architecture for a partic-
ster–Shafer fusion methods are good at handling
ular application is often difficult. Moreover, this method
certainty and mutually exclusive hypotheses. However,
also suffers from slow training. Due to these limitations
in this method, it is hard to handle the large number of
and other shortcomings (stated above as compared to the
combinations of hypotheses. This method has been
BN method), the NN method has not been often used for
used for speech recognition, sports video analysis and
multimedia analysis tasks compared to other fusion
event detection tasks.
methods.
• The dynamic Bayesian networks have been widely used
to deal with time-series data. This method is a variation
3.2.6 Maximum entropy model of the Bayesian Inference when used over time. The
DBN method in its different forms (such as HMMs) has
In general, maximum entropy model is a statistical classi- been successfully used for various multimedia analysis
fier which follows an information-theoretic approach and tasks such as speech recognition, speaker identification
provides a probability of an observation belonging to a and tracking, video shot classification etc. However, in
particular class based on the information content it has. this method, it is often difficult to determine the right
This method has been used by few researchers for cate- DBN states. Compared to DBN, the neural networks
gorizing the fused multimedia observations into respective fusion method is generally suitable to work in a high-
classes. dimensional problem space and it generates a high-
The maximum entropy model based fusion method is order nonlinear mapping, which is required in many
briefly described as follows. Let Ii and Ij are the two dif- realistic scenarios. However, due to the complex nature
ferent types of input observations. The probability of these of a network, this method suffers from slow training.
observations belonging to a class X can be given by an • As can be seen from the table, among various classifi-
exponential function: cation-based fusion methods, SVM and DBN have been
widely used by researchers. SVMs have been preferred
1
PðXjIi ; Ij Þ ¼ eFðIi ;Ij Þ ð5Þ due to their improved classification performance while
ZðIi ; Ij Þ the DBNs have been found more suitable to model
where, F(Ii, Ij) is the combined feature (or decision) vector temporal data.
and Z(Ii, Ij) is the normalization factor to ensure a proper • There are various other classification methods used in
probability. multimedia research. These include decision tree [76],
Recently, this fusion method has been used by relevance vector machines [36], logistics regression
Magalhães and Rüger [80] for semantic multimedia [71] and boosting [75]. However, these methods have
indexing. In this work, the authors combined the text and been used more for the traditional classification prob-
image based features to retrieve the images. The authors lems than for the fusion problems. Hence, we skip the
found that the maximum entropy model based fusion description of these methods.
worked better than the Naive Bayes approach.
There are other works such as Jeon and Manmatha [63]
and Argillander et al. [7], which have used maximum 3.3 Estimation-based fusion methods
entropy model for multimedia analysis tasks, however in
these works the authors used only single modality rather The estimation category includes the Kalman filter,
than multiple modalities. Therefore, the discussion of these extended Kalman filter and particle filter fusion methods.
works is out of scope for our paper. These methods have been primarily used to better estimate

123
Table 2 A list of the representative works in the classification methods category used for multimodal fusion
360

Fusion method Level of The work Modalities Multimedia analysis task

fusion

123
Support vector Decision Adams et al. [3] Video (color, structure, and shape), audio (MFCC) and textual cues Semantic concept detection
machine
Aguilar et al. [4] Fingerprint, signature, face (MCYT Multimodal Database, XM2VTS face database) Biometric verification
Iyenger et al. [58] Audio, video Semantic concept detection
Wu et al. [141] Color histogram, edge orientation histogram, color correlogram, co-occurrence Semantic concept detection
texture, motion vector histogram, visual perception texture, and speech
Bredin and Chollet [19] Audio (MFCC), video (DCT of lip area), audio-visual speech synchrony Biometric identification of
talking face
Hybrid Wu et al. [143] Video, audio Multimedia data analysis
Zhu et al. [156] Image (low-level visual features, text color, size, location, edge density, Image classification
brightness, contrast)
Ayache et al. [11] Visual, text cue Semantic indexing
Bayesian inference Feature Pitsikalis et al. [102] Audio (MFCC), video (Shape and texture) Speech recognition
Decision Meyer et al. [85] Audio (MFCC) and video (lips contour) Spoken digit recognition
Hybrid Xu and Chua [149] Audio, video, text, web log Sports video analysis
Atrey et al. [8] Audio (ZCR, LPC, LFCC) and video (blob location and area) Event detection for
surveillance
Dempster–Shafer theory Feature Mena and Malpica [84] Video (trajectory coordinates) Segmentation of satellite
images
Decision Guironnet et al. [44] Audio (phonemes) and visual (visemes) Video classification
Singh et al. [116] Audio (MFCC), video (DCT of the face region) and the synchrony score Finger print classification
Reddy [110] Audio (MFCC), video (Eigenlip) Human computer interaction
Hybrid Bendjebbour et al. [16] Video (trajectory coordinates) Segmentation of satellite
images
Dynamic Bayesian Feature Wang et al. [138] Audio (cepstral vector), visual (gray-level histogram difference and motion features) Video shot classification
networks Nefian et al. [86] Audio (MFCC) and visual (2D-DCT coefficients of the lips region) Speech recognition
Nock et al. [90, 91] Audio (MFCC) and video (DCT coefficients of the lips region Speaker localization
Chaisorn et al. [25] Audio (MFCCs and perceptual features), video (color, face, video-text, motion) Story segmentation in news
video
Adams et al. [3] Video (color, structure, and shape),audio (MFCC) and textual cues Video shot classification
Beal et al. [15] Audio and video—the details of features not available Object tracking
Bengio et al. [17] Speech (MFCC) and video (shape and intensity features) Biometric identity verification
Hershey et al. [46] Audio (Spectral components), video (fine-scale appearance and location of the lips) Speaker localization
Zou and Bhanu [158], Audio (MFCC) and video (pixel value variation) Human tracking
Noulas and Krose [92]
Ding and Fan [38] Video (spatial color distribution and the angle of yard lines) Shot classification in a sports
video
P. K. Atrey et al.
Multimodal fusion for multimedia analysis 361

the state of a moving object based on multimodal data. For

example, for the task of object tracking, multiple modali-

Human activity monitoring

Semantic image indexing

Topic clustering in video
Multimedia analysis task
ties such as audio and video are fused to estimate the
position of the object. The details of these methods are as

Speaker localization

Image recognition
Photo annotation
Human tracking follows.

Human tracking
3.3.1 Kalman filter

The Kalman filter (KF) [66, 112] allows for real-time

processing of dynamic low-level data and provides state
Text (closed caption), Audio (pitch, silence, significant pause), video

estimates of the system from the fused data with some

(color histogram and motion intensity), Speech (ASR transcript)

statistical significance [79]. For this filter to work, a linear

dynamic system model with Gaussian noise is assumed,
where at time t, the system true state, x(t) and its obser-
CPU load, Login process, Network load, Camera images
Image (color, texture and shape) and Camera parameters

vation, y(t) are modeled based on the state at time t - 1.

Precisely, this is represented using the state-space model
Image (features details not provided in the paper)

given by Eqs. 6 and 7 in the following:

xðtÞ ¼ AðtÞxðt 1Þ þ BðtÞIðtÞ þ wðtÞ ð6Þ
Video (face and blob), ultrasonic sensors

Audio (spectrogram) and video (blob)

Audio (phoneme) and video (viseme)

yðtÞ ¼ HðtÞxðtÞ þ vðtÞ ð7Þ

where, A(t) is the transition model, B(t) is the control input
model, I(t) is the input vector, H(t) is the observation
model, w(t) * N(0, Q(t)) is the process noise as a normal
distribution with zero mean and Q(t) covariance, and v(t)
* N(0, R(t)) is the observation noise as a normal distri-
Text and Image

bution with zero mean and R(t) covariance.

Modalities

Based on the above state-space model, the KF does not

require to preserve the history of observation and only
depends on the state estimation data from the previous
timestamp. The benefit is obvious for systems with less
storage capabilities. However, the use of the KF is limited
Magalhães and Rüger [80]

to the linear system model and is not suitable for the sys-
Cutler and Davis [34]

tems with non-linear characteristics. For non-linear system

Zou and Bhanu [158]
Gandetto et al. [41]

models, a variant of the Kalman filter known as extended

Wu et al. [142]

Xie et al. [145]

Kalman filter (EKF) [111] is usually used. Some

Ni et al. [88]
Town [131]

researchers also use KF as inverse Kalman filter (IKF) that

The work

reads an estimate and produce an observation as oppose to

KF that reads an observation and produce an estimate
[124]. Therefore, a KF and its associated IKF can logically
be arranged in series to generate the observation at the
Decision

Decision

output. Another variant of KF has gained attention lately,

Level of

Feature

Feature
Hybrid

Hybrid
fusion

which is the unscented Kalman filter (UKF) [65]. The

benefit of UKF is that it does not have a linearization step
and the associated errors.
The KF is a popular fusion method. Loh et al. [77]
Maximum Entropy Model

proposed a feature level fusion method for estimating the

translational motion of a single speaker. They used dif-
Table 2 continued

Neural networks

ferent audio-visual features for estimating the position,

Fusion method

velocity and acceleration of the single sound source. For

position estimation in 3D space, the measurement of three
microphones are used in conjunction with the camera
image point. Given the position estimate, a KF is then

123
362 P. K. Atrey et al.

used based on a constant acceleration model to estimate 3.3.2 Particle filter

velocity and acceleration. Unlike Loh et al. [77], Potam-
itis et al. [107] presented a audio-based fusion scheme for Particle filters are a set of sophisticated simulation-based
detecting multiple moving speakers. The same speaker methods, which are often used to estimate the state distri-
state is determined by fusing the location estimates from bution of the non-linear and non-Gaussian state-space
multiple microphone arrays, where the location estimates model [6]. These methods are also known as Sequential
are computed using separate KF for all the individual Monte Carlo (SMC) methods [33]. In this approach, the
microphone arrays. A probabilistic data association tech- particles represent the random samples of the state vari-
nique is used with an interacting multiple model estimator able, where each particle is characterized by an associated
to handle speaker’s motion and measurement origin weight. The particle filtering algorithm also consists of a
uncertainty. prediction and update steps. The prediction step propagates
KF as well as EKF have been successfully used for each particle as per its dynamics while the update step
source localization and tracking for many years. Strobel reweighs a particle according to the latest sensory infor-
et al. [124] focused on the localization and tracking of mation. While the KF, EKF or IKF are optimal only for
single objects. The audio and video localization features linear Gaussian processes, the particle methods can provide
are computed in terms of position estimates. EKF is used Bayesian optimal estimates for non-linear non-Gaussian
due to the non-linear estimates based on audio-based processes when sufficiently large number of samples are
position. On the other hand, a basic KF is used at the video taken.
camera level. The outputs of the audio and video estimates The particle methods have been widely used in multi-
are then fused within the fusion center, which is comprised media analysis. For instance, Vermaak et al. [132] used
of two single-input inverse KFs and a two-input basic KFs. particle filters to estimate the predictions from audio- and
This is shown in Fig. 8. This work requires that the audio video-based observations. The reported system uses a sin-
and video sources are in sync with each other. Likewise, gle camera and a pair of microphones and were tested
Talantzis et al. [125] have adopted a decentralized KF that based on stored audio-visual sequences. The fusion of
fuses audio and video modalities for better location esti- audio-visual features took place at the feature level,
mation in real time. A decision level fusion approach has meaning that the individual particle coordinates from the
been adopted in this work. features of both modalities were combined to track the
A recent work [154] presents a multi-camera based speaker. Similar to this approach, Perez et al. [99] adopted
tracking system, where multiple features such as spatial the particle filter approach to fuse 2D object shapes and
position, shape and color information are integrated toge- audio information for speaker tracking. However, unlike
ther to track object blobs in consecutive image frames. The Vermaak et al. [132], the latter uses the concept of
trajectories from multiple cameras are fused at feature level importance particle filter, where audio information was
to obtain the position and velocity of the object in the real specifically used to generate an importance function that
world. The fusion of trajectories from multiple cameras, influenced the computation of audio-based observation
which uses EKF, enables better tracking even when the likelihood. The audio and video-based observation likeli-
object view is occluded. Gehrig et al. [43] also adopted an hoods are then combined as a late fusion scheme using a
EKF based fusion approach using audio and video features. standard probabilistic product formula that forms the
Based on the observation of the individual audio and video multimodal particle.
sensor, the state of the KF was incrementally updated to A probabilistic particle filter framework is proposed by
estimate the speaker’s position. Zotkin et al. [157] that adopts a late fusion approach for
tracking people in a videoconferencing environment. This
framework used multiple cameras and microphones to
estimate the 3D coordinates of the person using the sam-
pled projection. Multimodal particle filters are used to
^
x 1[k |k ] y1 [k ] approximate the posterior distribution of the system
y1 [k ] KF1 KF1-1
parameters and the tracking position in the audio-visual
^
KF x [ k|k ] state-space model. Unlike Vermaak et al. [132] or Perez
^
x 2[k |k ]
y2 [k ] KF2 KF2-1
et al. [99], this framework enables tracking multiple per-
y2 [k ]
sons simultaneously.
Nickel et al. [89] presented an approach for real-time
Local Processor Fusion Center
tracking of the speaker using multiple cameras and
Fig. 8 Extended/decentralized Kalman filter in the fusion process microphones. This work used particle filters to estimate the
[124] location of the speaker by sampled projection as proposed

123
Multimodal fusion for multimedia analysis 363

by Zotkin et al. [157], where each particle filter represented the fact the linear weighted fusion can be easily used to
a 3D coordinate in the space. The evidence from all the prioritize different modalities while fusing; SVM has
camera views and microphones are adjusted to assign improved classification performance in many multi-
weights to the corresponding particle filter. Finally, the media analysis scenarios; and the DBN fusion method
weighted mean of a particle set is considered as the speaker is capable of handling temporal dependencies among
location. This work adopted a late fusion approach to multimodal data, which is an important issue often
obtain the final decision. considered in multimodal fusion.
• Fusion methods and levels of fusion. Existing literature
3.3.3 Remarks on estimation-based fusion methods suggest that linear weighted fusion is suitable to work
at the decision level. Also, although SVM is generally
The representative works in the estimation-based category used to classify individual modalities at the feature
are summarized in Table 3. The estimation-based fusion level, in the case of multimodal fusion, the outputs of
methods (Kalman filter, extended Kalman filter and particle individual SVM classifiers are fused and further
filter) are generally used to estimate and predict the fused classified using another SVM. That is why most of
observations over a period. These methods are suitable for the reported works have been seen to fall into the late
object localization and tracking tasks. While the Kalman fusion category. Among others, the DBNs have been
filter is good for the systems with a linear model, the used more at the feature level due to its suitability in
extended Kalman filter is better suited for non-linear sys- handling temporal dependencies.
tems. However, the particle filter method is more robust for • Modalities used. The modalities that have been used for
non-linear and non-Gaussian models as they approach the multimodal fusion are mostly based on audio and video.
Bayesian optimal estimate with sufficiently large number Some works also considered text modality, while others
of samples. have investigated gesture.
• Multimedia analysis tasks versus fusion methods. In
Table 4, we summarize the existing literature in terms
3.4 Further discussion
of the multimedia analysis tasks and the different fusion
methods used for these tasks. This may be useful for the
In the following, we provide our observations based on the
readers as a quick reference in order to decide which
analysis of the fusion methods described above.
fusion method would be suitable for which task. It has
• Most used methods. From the literature, it has been been found that for a variety of tasks, various fusion
observed that many fusion methods such as linear methodologies have been adopted. However, based on
weighted fusion, SVM, and DBN have been used more the nature of a multimedia analysis task, some fusion
often in comparison to the other methods. This is due to methods have been preferred over the others. For

Table 3 A list of the representative works in the estimation methods category used for multimodal fusion
Fusion method Level of The work Modalities Multimedia analysis task
fusion

Kalman filter and its Feature Potamitis et al. [107] Audio (position, velocity) Multiple speaker tracking
variants Loh et al. [77] Audio, video Single speaker tracking
Gehrig et al. [43] Audio (TDOA), video (position of the Single speaker tracking
speaker)
Zhou and Aggarwal Video [spatial position, shape, color (PCA), Person/vehicle tracking
[154] blob]
Decision Strobel et al. [124] Audio, video Single object localization and
tracking
Talantzis et al. [125] Audio (DOA), video (position, velocity, Person tracking
target size)
Particle filter Feature Vermaak et al. [132] Audio (TDOA), visual (gradient) Single speaker tracking
Decision Zotkin et al. [157] Audio (TDOA), video (skin color, shape Multiple speaker tracking
matching and color histograms)
Perez et al. [99] Audio (TDOA), video (coordinates) Single speaker tracking
Nickel et al. [89] Audio (TDOA), video (Haar-like features) Single speaker tracking

123
364 P. K. Atrey et al.

Table 4 A summary of the fusion method used for different multimedia analysis tasks
Multimedia analysis task Fusion method The works

Biometric identification and verification Support vector machine Bredin and Chollet [19], Aguilar et al. [4]
Dynamic Bayesian networks Bengio et al. [17]
Face detection, human tracking and Linear weighted fusion Kankanhalli et al. [67], Jaffre and
activity/event detection Pinquier [59]
Bayesian inference Atrey et al. [8]
Dynamic Bayesian networks Town [131], Beal et al. [15]
Neural networks Gandetto et al. [41], Zou and Bhanu [158]
Kalman filter Talantzis et al. [125], Zhou and Aggarwal
[154], Strobel et al. [124]
Human computer interaction and Custom-defined rules Corradini et al. [32], Pfleger [100],
multimodal dialog system Holzapfel et al. [49]
Dempster–Shafer theory Reddy [110]
Image segmentation, classification, Linear weighted fusion Hua and Zhang [55]
recognition, and retrieval Support vector machine Zhu et al. [156]
Neural networks Ni et al. [88]
Dempster–Shafer Theory Mena and Malpica [84], Bendjebbour
et al. [16]
Video classification and retrieval Linear weighted fusion Yan et al. [151], McDonald and Smeaton
[83]
Bayesian inference Xu and Chua [149]
Dempster–Shafer Theory Singh et al. [116]
Dynamic Bayesian networks Wang et al. [138], Ding and Fan [38],
Chaisorn et al. [25], Xie et al. [145],
Adams et al. [3]
Photo and video annotation Linear weighted fusion Iyenger et al. [58]
Dynamic Bayesian networks Wu et al. [142]
Semantic concept detection Linear weighted fusion Iyenger et al. [58]
Support vector machine Adams et al. [3], Iyenger et al. [58], Wu
et al. [141]
Semantic multimedia indexing Custom-defined rules Babaguchi et al. [12]
Support vector machine Ayache et al. [11]
Maximum entropy model Magalhães and Rüger [80]
Monologue detection Linear weighted fusion Kankanhalli et al. [67], Iyenger et al. [57]
Speaker localization and tracking Particle filter Vermaak et al. [132], Perez et al. [99],
Nickel et al. [89], Zotkin et al. [157]
Kalman filter Potamitis et al. [107]
Majority voting rule Radova and Psutka [108]
Dynamic Bayesian networks Nock et al. [91], Hershey et al. [46]
Neural networks Cutler and Davis [34]
Speech and speaker recognition Linear weighted fusion Neti et al. [87], Lucey et al. [78]
Bayesian inference Meyer et al. [85], Pitsikalis et al. [102]
Dynamic Bayesian networks Nefian et al. [86]

instance, for image and video classification retrieval have widely been found successful. Moreover, since the
tasks, the classification-based fusion methods such as sports and news analysis tasks consist of complex rules,
Bayesian inference, Dempster–Shafer theory and custom-defined rules have been found appropriate.
dynamic Bayesian networks have been used. Also, as • Application constraints and the fusion methods. From
an object tracking task involves dynamics and the state the perspective of application constraints such compu-
transition and estimation, dynamic Bayesian networks tation, delay and resources, we can analyze different
and the estimation-based methods such as Kalman filter fusion methods as follows. It has been observed that the

123
Multimodal fusion for multimedia analysis 365

linear weighted fusion method is applied to the modalities which will be discussed in Sect. 4.2. We will
applications which have lesser computational needs. cover what to fuse aspect by describing the issue of optimal
On other hand, while the dynamic Bayesian networks modality selection in Sect. 4.3. In the following, we
fusion method is computationally more expensive than highlight the importance of considering these distinctive
the others, the neural networks can be trained to issues and also describe the past works related to them.
computationally expensive problems. Regarding the
time delay and synchronization problems, custom- 4.1 Issues related to how to fuse
defined rules have been found more appropriate as
they are usually application specific. These time delays 4.1.1 Correlation between different modalities
may occur due to the resource constraints since the
input data can be obtained from different types of The correlation among different modalities represents how
multimedia sensors and the different CPU resources they co-vary with each other. In many situations, the cor-
may be available for analysis. relation between them provides additional cues that are very
useful in fusing them. Therefore, it is important to know
different methods of computing correlations and to analyze
them from the perspective of how they affect fusion [103].
4 Distinctive issues of multimodal fusion The correlation can be comprehended at various levels,
e.g. the correlation between low level features and the
This section provides a critical look at the distinctive issues correlation between semantic-level decisions. Also, there
that should be considered in a multimodal fusion process. are different forms of correlation that have been utilized by
These issues have been identified in the light of the fol- the researchers in the multimodal fusion process. The
lowing three aspects of fusion: how to fuse (in continuation correlation between features has been computed in the
with the fusion methodologies as discussed in Sect. 3), forms of correlation coefficient, mutual information, latent
when to fuse, and what to fuse. From the aspect of how to semantic analysis (also called lament semantic indexing),
fuse, we will elaborate in Sect. 4.1 on the issues of the use canonical correlation analysis, and cross-modal factor
of correlation, confidence and the contextual information analysis. On the other hand, the decision level correlation
while fusing different modalities. The when to fuse aspect has been exploited in the form of causal link analysis,
is related to the synchronization between different causal strength and agreement coefficient.

Table 5 A list of some representative works that used the correlation information between different streams in the fusion process
Level of fusion The form of correlation The works Multimedia analysis task

Feature Correlation coefficient Wang et al. [138] Video shot classification

Nefian et al. [86] Speech recognition
Beal et al. [15] Object tracking
Li et al. [74] Talking face detection
Mutual information Fisher-III et al. [39] Speech recognition
Darrell et al. [35] Speech recognition
Hershey and Movellan [47] Speaker localization
Nock et al. [90], Iyengar et al. [57] Monologue detection
Nock et al. [91] Speaker localization
Noulas and Krose [92] Human tracking
Latent semantic analysis Li et al. [74] Talking face detection
Chetty and Wagner [28] Biometric person authentication
Canonical correlation analysis Slaney and Covell [117] Talking face detection
Chetty and Wagner [28] Biometric person authentication
Bredin and Chollet [20] Talking face identity verification
Cross-modal factor analysis Li et al. [72] Talking head analysis
Decision Casual link analysis Stauffer [123] Event detection for surveillance
Causal strength Wu et al. [142] Photo annotation
Agreement coefficient Atrey et al. [8] Event detection for surveillance

123
366 P. K. Atrey et al.

In the following, we describe the above eight forms of authors found that the face region had high mutual infor-
correlation and their usage for the various multimedia mation with the speech data. Therefore, the mutual infor-
analysis tasks. We also cast light on the cases where mation score helped locate the speaker. Similarly, Fisher
independence between different modalities can be useful et al. [39] also learned the linear projections from a joint
for multimedia analysis tasks. A summary of the repre- audio–video subspace where the mutual information was
sentative works that have used correlation in different maximized. Other works that have used mutual information
forms is provided in Table 5. as a measure of correlation are Darrell et al. [35], Hershey
Correlation coefficient. The correlation coefficient is a and Movellan [47], Nock et al. [90], Nock et al. [91] and
measure of the strength and direction of a linear relation- Noulas and Krose [92] for different tasks as detailed in
ship between any two modalities. It has been widely used Table 5.
by multimedia researchers for joint modeling the audio– Latent semantic analysis. Latent semantic analysis
video relationship [138, 86, 15]. However, to jointly model (LSA) is a technique often used for text information
the audio–video, the authors have often assumed them—(1) retrieval. This technique has proven useful to analyze the
to be independent, and (2) to locally and jointly follow the semantic relationships between different textual units. In
Gaussian distribution. the context of text information retrieval, the three primary
One of the most simple and widely used forms of the goals that the LSA technique achieves are dimension
correlation coefficient is the Pearson’s product-moment reduction, noise removal and finding of the semantic and
coefficient [20], which is computed as follows. Assuming hidden relation between keywords and documents. The
that Ii and Ij are the two modalities (of same or different LSA technique has also been used to uncover the correla-
types). The correlation coefficient CC(Ii, Ij) between them tion between audio-visual modalities for talking-face
can be computed as [138]: detection [74]. The learning correlation using LSA consists
^ i ; Ij Þ of four steps: construction of a joint multimodal feature
CðI
CCðIi ; Ij Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð8Þ space, normalization, singular value decomposition and
^ i ; Ii Þ CðI
CðI ^ j ; Ij Þ measuring semantic association [28]. The mathematical
details can be found in Li et al. [74]. In [74], the authors
^ i ; Ij Þ is the (i, j)th element of the covariance
where CðI demonstrated the superiority of LSA over the traditional
matrix C, which is given as: correlation coefficient.
X
N Canonical correlation analysis. Canonical correlation
C¼ ðIik Iim Þ ðIjk Ijm Þ ð9Þ analysis (CCA) is another powerful statistical technique
k¼1
that can be used to find linear mapping that maximizes the
where Iik and Ijk are the kth value in the feature vector Ii and cross-correlation between two feature sets. Given two
Ij, respectively; and Iim and Ijm are the mean values of these feature sets Ii and Ij, the CCA is a set of two linear pro-
feature vectors. This method of computing correlation has jections A and B that whiten Ii and Ij. A and B are called
been used by many researchers such as Wang et al. [138] canonical correlation matrices. These matrices are con-
and Li et al. [74]. In [74], based on the Correlation Coef- structed under the constraints that their cross-correlation
ficient between audio and face feature vectors, the authors becomes diagonal and maximally compact in the projected
have selected the faces having the maximum correlation space. The computation details of A and B can be found in
with the audio. [117]. The first M vectors of A and B are used to compute
Mutual information. The mutual information is a infor- the synchrony score CCA(Ii, Ij) between two modalities Ii
mation theoretic measure of correlation that represents the and Ij as:
amount of information one modality conveys about
1X M
another. The mutual information MI(Ii, Ij) between two CCAðIi ; Ij Þ ¼ jcorrðaTm Ii ; bTm Ij Þj ð11Þ
M m¼1
modalities Ii and Ij, which are normally distributed with
variances RIi and RIj , and jointly distributed with covari- where aTm and bTm are elements of A and B.
ance RIi Ij ; is computed as: It is important to note that finding the canonical corre-
1 jRI jjRIj j lations and maximizing the mutual information between
MIðIi ; Ij Þ ¼ log i ð10Þ the sets are considered equivalent if the underlying distri-
2 jRIi Ij j
butions are elliptically symmetric [28].
There are several works that have used mutual information The canonical correlation analysis for computing the
as a measure of synchrony between audio and video. For synchrony score between modalities has been explored by
instance, Iyengar et al. [57] computed the synchrony few researchers. For instance, Chetty and Wagner [28] used
between face and speech using mutual information. The the CCA score between audio and video modalities for

123
Multimodal fusion for multimedia analysis 367

biometric person authentication. In this work, the authors has been utilized by Stauffer [123] for detecting events in a
also used LSA and achieved about 42% overall improve- surveillance environment. This kind of analysis of the
ment in error rate with CCA and 61% improvement with events has been called the casual link analysis.
LSA. In another work, Bredin and Chollet [20] also dem- Assuming that the two events were linked, the likeli-
onstrated the utility of considering CCA for audio–video hood pðci ; cj ; dtij jci;j ¼ 1Þ of a pair (ci, cj) of events and
based talking-face identity verification. Similarly, Slaney their relative times dtij is estimated. Note that, dtij is the
and Covell [117] used CCA for talking-face detection. time difference between the absolute times of event i and
Cross-modal factor analysis. The weakness of the LSA event j. The term ci,j = 1 indicates that the occurrence of
method lies in its inability to distinguish features from the first event ci is directly responsible for the occurrence
different modalities in the joint space. To overcome this of the second event cj. Once the estimates of the posterior
weakness, Li et al. [72] proposed the cross-modal factor likelihood of ci,j = 1 for all i, j pairs of events have been
analysis (CFA) method, in which, the features from dif- computed, an optimal chaining hypothesis is iteratively
ferent modalities are treated as two subsets and the determined. The authors have demonstrated that the casual
semantic patterns between these two subsets are discov- link analysis significantly helps in the overall accuracy of
ered. The method works as follows. Let the two subsets of event detection in an audio–video surveillance
features be Ii and Ij. The objective is to find the orthogonal environment.
transformation matrices A and B that can minimize the Causal strength. The causal strength is a measure of the
expression: cause due to which two variables may co-vary with each
other. Wu et al. [142] have preferred to use the causal
Ii A Ij B2 ð12Þ
F strength between the context/content and the semantic
where ATA and BTB are unit matrices. F denotes the ontology as described in Sect. 3.2.4. Here we describe how
Frobenius norm and is calculated for the matrix M as: the causal strength is computed by adopting a probabilistic
!1=2 model. Let u and d be chance and decision variables,
XX 2 respectively. The chance variables imply effects (e.g.
k M kF ¼ jmxy j ð13Þ
x y
wearing warm jacket and drinking hot coffee) and the
decision variables denote causes (e.g. cold weather). The
By solving the above equation for optimal transformation causal strength CSu|d is computed by using Eq. 15.
matrices A and B, the transformed version of Ii and Ij can
Pðujd; nÞ Pðu; nÞ
be calculated as follows: CSujd ¼ ð15Þ
1 Pðu; nÞ
I~i ¼ Ii A; I~j ¼ Ij B ð14Þ
In the above equation, n refers to the state of the world, e.g.
The optimized vectors I~i and I~j represent the coupled indoor or outdoor environment in the above mentioned
relationships between the two feature subsets Ii and Ij. example; and the terms P(u|d, n) and P(u, n) are the con-
Note that, unlike CCA, the CFA provides a feature ditional probability assuming that d and n are independent
selection capability in addition to feature dimension of each other.
reduction and noise removal. These advantages make CFA The authors have shown that the usage of causal strength
a promising tool for many multimedia analysis tasks. The provides not only improved accuracy of photo annotation,
authors in [72] have shown that although all three methods but also better capability of assessing the annotation
(LSA, CCA and CFA) achieved significant dimensionality quality.
reduction, the CFA gave the best results for talking head Agreement coefficient. Atrey et al. [8] have used the
analysis. The CFA method achieved 91% detection accu- correlation among streams at the intra-media stream and
racy as compared to the LSA (66.1%) and the CCA inter-media stream levels. At the intra-media stream level,
(73.9%). they used the traditional correlation coefficient; however,
All the methods described above have computed the at the inter-media stream level, they introduced the notion
correlation between the features extracted from different of the decision level ‘‘agreement coefficient’’. The agree-
modalities. In the following, we describe the methods that ment coefficient among streams has been computed based
have been used to compute the correlation at the semantic on how concurring or contradictory the evidence is that
(or decision) level. they provide. Intuitively, the higher the agreement among
Causal link analysis. The events that happen in an the streams, the more confidence one would have in the
environment are often correlated. For instance, the events global decision, and vice versa [115].
of ‘‘elevator pinging’’, ‘‘elevator door opening’’, ‘‘people The authors have modeled the agreement coefficient in
coming out of elevator’’ usually occur at relative times one the context of event detection in a multimedia surveillance
after another. This temporal relationship between events scenario. The agreement coefficient cki;j ðtÞ between the

123
368 P. K. Atrey et al.

media streams Ii and Ij detects the kth event at time instant the weights of different modalities. These weights can vary
t, by iteratively averaging the past agreement coefficients based on several factors such as the context and the task
with the current observation. Precisely, cki;j ðtÞ is computed performed. Therefore, the weight should be dynamically
as: adjusted in order to obtain optimal fusion results.
While performing multimodal fusion, several research-
cki;j ðtÞ ¼ ð1 2 jpi;k ðtÞ pj;k ðtÞjÞ þ cki;j ðt 1Þ ð16Þ
ers have adopted the strategy of weighting different
where, pi,k(t) and pj,k(t) are the individual probabilities of modalities. However, many of them either have considered
the occurrence of kth event based on the media streams Ii equal weights [83, 152] or have not elaborated the issue of
and Ij, respectively, at time t C 1; and cki;j ð0Þ ¼ 1 2 weight determination [55, 67, 136], and have left it to the
jpi;k ð0Þ pj;k ð0Þj . These probabilities represent decisions users to decide [40].
about the events. Exactly the same probabilities would Other works, such as Neti et al. [87], Iyenger et al. [57],
imply full agreement (cki;j ¼ 1) while detecting the kth Tatbul et al. [126], Hsu and Chang [53] and Atrey et al. [8]
event whereas totally dissimilar probabilities would mean have used pre-computed weights in the fusion process. The
that the two streams fully contradict each other (cki;j ¼ 1). weights of different streams have usually been determined
The authors have shown that the usage of agreement based on their past accuracy or any prior knowledge. The
coefficient resulted in better overall event detection accu- computation of past accuracy requires a significant amount
racy in a surveillance scenario. of training and testing. However, since the process of
Independence. In should be noted that, in addition to computing the accuracy has to be performed in advance, the
using the correlation among modalities, the independent confidence value or the weight determined based on such
modalities can also be very useful in some cases to obtain a accuracy value is considered ‘‘static’’ during the fusion
better decision. Let us consider the case of a multimodal process. It is obvious that a static value of confidence of a
dialog system [49, 96, 100]. In such systems, multiple modality does not reflect its true current value especially
modalities such as gesture and speech are used as a means under the changing context. On the other hand, determining
of interaction. It is sometimes very hard to fuse these the confidence level for each stream, based on its past
modalities at the feature level due to a lack of direct cor- accuracy, is difficult. This is because the system may pro-
respondence between their features and different temporal vide dissimilar accuracies for various tasks under different
alignment. However, each modality can complement each contexts. Pre-computation of accuracies of all the streams
other in obtaining a decision about the intended interaction for various detection tasks under varying contexts requires a
event. To this regard, each modality can be processed significant amount of training and testing, which is often
separately in parallel to derive individual decisions and tedious and time consuming. Therefore, a mechanism that
later fuse these individual decisions at a semantic level to can determine the confidence levels of different modalities
obtain the final decision [96]. Similarly, other cases of ‘‘on the fly’’ without pre-computation, needs to be explored.
independence among modalities are also possible. For In contrast to the above methods that used the static
instance, environment context, device context, network confidence, some efforts (e.g. Tavakoli et al. [127], Atrey
context, task context and so forth may provide comple- et al. [10] have also been performed towards the dynamic
mentary information to the fusion process, thereby making computation of the confidence levels of different modali-
the overall analysis tasks more robust and accurate. ties. Tavakoli et al. [127] have used spatial and temporal
information in clusters in order to determine a confidence
4.1.2 Confidence level of different modalities level of sensors. The spatial information indicated that
more sensors are covering a specific area; hence a higher
Different modalities may have varying capabilities of confidence is assigned to the observation obtained from
accomplishing a multimedia analysis task. For example, in that area. The temporal information is obtained in the form
good lighting condition, the video analysis may be more of the sensors detecting the target consecutively for a
useful in detecting human than the audio analysis; while in number of time slots. If a target is consecutively detected,
a dark environment, the audio analysis could be more it was assumed that the sensors are reporting correctly. This
handy. Therefore, in the fusion process, it is important to method is more suited to the environment where the sen-
assign the appropriate confidence level to the participating sors’ location changes over time. In a fixed sensor setting,
streams [115]. The confidence in a stream is usually the confidence value will likely remain constant.
expressed by assigning appropriate weight to it. Recently, Atrey et al. [10] have also presented a method
Many fusion methods such as the linear weight fusion to dynamically compute the confidence level of a media
and the Bayesian inference do have a notion of specifying stream based on its agreement coefficient with a trusted
the weights to different modalities. However, the main stream. The trusted stream is the one that has the confidence
question that remains to be answered is how to determine level above a certain threshold. The agreement coefficient

123
Multimodal fusion for multimedia analysis 369

between any two streams will be high when the similar Two research issues related to the context are (1) what
decisions are obtained based on them, and vice versa. In this are the different forms of contextual information and how
work, the authors have adopted the following idea. Let one the contextual information is determined? and (2) how it is
follow a trusted news bulletin. He/she also starts by fol- used in the fusion process? In the following, we discuss how
lowing an arbitrary news bulletin and compares the news these two issues have been addressed by the researchers.
content provided on both the news bulletins. Over a period The context has been represented in different forms for
of time, his/her confidence in the arbitrary bulletin will also different multimedia analysis tasks. For example, for the
grow if the news content of both the bulletins have been image classification task, the context could be time, loca-
found similar, and vice versa. The authors have demon- tion and camera parameters [142], while for the multimedia
strated that the confidence level of different media streams music selection task, the mood of the requester could be
computed using this method when used in the fusion pro- considered as context. We identify two types of contextual
cess provides the event detection results comparable to what information that have been often considered. These are
is obtained using pre-computed confidence. The drawback environmental context and the situational context. The
with this method is that the assumption of having at least environmental context consists of time, the sensor’s loca-
one trusted stream might not always be realistic. tion or geographical location, weather conditions, etc. For
The above discussion shows that, although there have example, if it is a dark environment, audio and IR sensor
been some attempts to address the issue of dynamic weight information should preferably be fused to detect a person
adjustment, this is still an open research problem, which is [8]. The situational context could be in the form of identity,
essential for the overall fusion process. A summarization of mood, and capability of a person, etc. For example, if the
the representative works related to the computation of person’s mood is happy, a smart mirror should select and
confidence provided in Table 6. play a romantic song when s/he enters into a smart house
[50].
4.1.3 Contextual information The contextual information can be determined by explic-
itly processing the sensor data, e.g. a mood detection algo-
The context is accessory information that greatly influences rithm can be applied on the video data to determine the mood
the performance of a fusion process. For example, time and of a person. On the other hand, it can also be learned through
location information significantly improves the accuracy of other mechanisms such as the time from a system clock,
automatic photo classification [142]. Also, the light con- location from a GPS device, and the sensors’ geometry and
ditions may help in selecting the right set of sensors for location as a priori information from the system designer.
detecting events in a surveillance environment. To integrate the contextual information in the fusion
Some of the earlier works, which have emphasized the process, most researchers such as Westerveld [139],
importance of using the contextual information in the Jasinschi et al. [62], Wang and Kankanhalli [135], Pfleger
fusion process, include Brmond and Thonnat [21], Teriyan [100], Wu et al. [142], Atrey et al. [8] have adopted a rule-
and Puuronen [129], Teissier et al. [128] and Westerveld based scheme. This scheme is very straight forward as it
[139]. Later, many other researchers such as Sridharan follows the ‘‘if–then–else’’ strategy. For example, if it is
et al. [122], Wang and Kankanhalli [135], Pfleger [100] day time, then the video cameras would be assigned a
and Atrey et al. [8] have demonstrated the advantages of greater weight than the audio sensors in the fusion process
using context in the fusion process. for detecting the event of a ‘‘human walking in the

Table 6 A list of the representative works related to the usage of confidence level in the fusion process
The mode of The works Multimedia analysis task The confidence is determined based on
computation

Static Neti et al. [87] Speaker recognition and speech The past accuracy
event detection
Iyenger et al. [57] Monologue detection
Tatbul et al. [126] Military smart uniform
Hsu and Chang [53] News video analysis
Atrey et al. [8] Event detection for surveillance
Dynamic Tavakoli et al. [127] Event detection in undersea sensor networks The spatial and the temporal information
of sensors’ observations
Atrey et al. [10] Event detection for surveillance The agreement/disagreement between
different streams

123
370 P. K. Atrey et al.

garden’’, else otherwise. We describe some of these works The representative works related to the use of contextual
in the following. In [139], the author has integrated image information in the fusion process have been summarized in
features (content) and the textual information that comes Table 7. Although the rule-based strategy of integrating the
with an image (context) at the semantic level. Similar to context in the fusion process is appealing, the number of
this work, Jasinschi et al. [62] have presented a layered rules largely increases in varying context in a real world
probabilistic framework that integrates the multimedia scenario. Therefore, other strategies for context determi-
content and context information. Within each layer, the nation and its integration in multimodal fusion remain to be
representation of content and context is based on Bayesian explored in future.
networks, and hierarchical priors that provide the connec-
tion between the two layers. The authors have applied the 4.2 Issue related to when to fuse
framework for an end-to-end system called the video scout
that selects, indexes, and stores TV program segments Different modalities are usually captured in different for-
based on topic classification. In the context of dialog sys- mats and at different rates. Therefore, they need to be
tems, Pfleger [100] has presented a multimedia fusion synchronized before fusion takes place [95]. As the fusion
scheme for detecting the user actions and events. While can be performed at the feature as well as the decision
detecting these input events, the user’s ‘local turn context’ level, the issue of synchronization is also considered at
has been considered. This local turn context comprises all these two levels. In the feature level synchronization, the
previously recognized input events and the dialog states features obtained from different but closely coupled
that both belong to the same user’s turn. Wu et al. [142] modalities captured during the same time period are
have used the context information (in form of the time and combined together [28]. On the other hand, the decision
location) for photo annotation. They have adopted a level synchronization needs to determine the designated
Bayesian network fusion approach in which the context has points along the timeline at which the decisions should be
been used to govern the transitions between nodes. Wang fused. However, in both the levels of fusion, the problem of
and Kankanhalli [135] and Atrey et al. [8] have used the synchronization arises in different forms. In the following,
context in the form of the environment and the sensor we elaborate on these problems and also describe some
information. The environment information consisted of the works which have addressed them.
geometry of the space under surveillance while the sensory The feature level synchronization has been illustrated in
information was related to their location and orientation. Fig. 9a. Assuming that the raw data from the two different
While the works described above have used the context in types of modalities (modality 1 and modality 2, in the
a static manner, Sridharan et al. [122] have provided a figure) are obtained at the same time t = 1. The feature
computational model of context evolution. The proposed extraction from these modalities can be from different time
model represents the context using semantic-nets. The con- periods (e.g. 2 and 1.5 time units for modality 1 and
text has been defined as the union of semantic-nets, each of modality 2, respectively in Fig. 9a). Due to the different
which can specify a fact about the environment. The inter- time periods of the data processing and feature extraction,
relationships among the various aspects (e.g. the user, the when these two features should be combined, remains an
environment, the allowable interactions, etc.) of the system issue. To resolve this issue, a simple strategy could be to
are used to define the overall system context. The evolution fuse the features at regular intervals [8]. Although this
of context has been modeled using a leaky bucket algorithm strategy may not be the best, it is computationally less
that has been widely used for traffic control in a network. expensive. An alternative strategy could be to combine all

Table 7 The representative works related to the use of contextual information in the fusion process
Contextual information The works Multimedia analysis task

Textual information with an image Westerveld [139] Image retrieval

Signature, pattern, or underlying structure Jasinschi et al. [62] TV program segmentation
in audio, video and transcript
Environment and sensor information Wang and Kankanhalli [135], Event detection for surveillance
Atrey et al. [8]
Word nets Sridharan et al. [122] Multimedia visualization and annotation
Past input events and the dialog state Pfleger [100] Detecting the user intention in multimodal
dialog systems
Time, location and camera parameters Wu et al. [142] Photo annotation

123
Multimodal fusion for multimedia analysis 371

Fig. 9 Illustration of the Modality 1 Modality 1

synchronization between two

Syn ch r o n iz a t io n
S y n c h r o n iza t io n
modalities at the a feature level,

of d e cis io n s
o f fe a t u r e s
b decision level
Modality 2 Modality 2

t=1 2 3 4 t=1 2 3 4 5
(a) (b)
Raw data obtained Features extracted Features processed and decision obtained

the features at the time instant they are available (e.g. at usually adopted simple strategies such as synchronization at
t = 3 in Fig. 9a). a regular interval. However, having a regular interval may
An illustration of the synchronization at the decision not be optimal and may not lead to the accomplishment of
level is provided in Fig. 9b. In contrast to feature level the task with the highest accuracy. Therefore, the issue of
synchronization, where only the feature extraction time synchronization still remains to be explored.
impact asynchrony between the modalities, the additional In the following, we discuss some representative works
time of obtaining decisions based on the extracted features that have focused on synchronization issue in one way or
further affect it. For example, as shown in Fig. 9b, the time the other. In the area of audio-visual speech processing,
taken in obtaining the decisions could be 1.5 and 1.75 time several researchers have computed the audio-visual syn-
units for modality 1 and modality 2, respectively. However, chrony. These works include Hershey and Movellan [47],
similar to feature level synchronization, in this case also, Slaney and Covell [117], Iyengar et al. [57], Nock et al.
the decisions are fused using various strategies discussed [91] and Bredin and Chollet [19]. In these works, the
earlier (e.g. at the time instant all the decisions are avail- audio-visual synchrony has been used as a measure of
able, t = 4 in Fig. 9b). Exploring the best strategy is an correlation that can be perceived as synchronization at the
issue that can be considered in future. feature level.
Another important synchronization issue is to determine The problem of synchronization at the decision level,
the amount of raw data needed from different modalities which is more difficult than the feature level synchroni-
for accomplishing a task. To mark the start and end of a zation, has been addressed by few researchers including
task (e.g. event detection) over a timeline, there is a need to Holzapfel et al. [49], Atrey et al. [8], and [Xu and Chua
obtain and process the data streams at certain time inter- [149]. Holzapfel et al. [49] have aligned the decisions
vals. For example, from a video stream of 24 frames/s, 2-s obtained from the processing of gesture and speech
data (48 frames) could be sufficient to determine a human modalities. To identify the instances at which these two
walking event (by computing the blob displacement in a decisions are to be along the timeline, the authors com-
sequence of images); however, the same event (sound of puted temporal correlation between the two modalities.
footsteps) could be detected using one second of audio data Unlike Holzapfel et al. [49], Atrey et al. [8] adopted a
of 44 kHz. This time period, which is basically the mini- simple strategy to combine the decisions at regular inter-
mum amount of time to accomplish a task, could be dif- vals. These decisions were obtained from audio and video
ferent for different tasks when accomplished using various event detectors. The authors have empirically found that
modalities. Ideally, it should be as small as possible since a the time interval of one second was optimal in improving
smaller value allows task accomplishment at a finer gran- the overall accuracy of event detection.
ularity in time. In other words, the minimum time period The issue of time synchronization has also been widely
for a specific task should be just large enough to capture the addressed in news and sports video analysis. Satoh et al.
data to accomplish it. Determining the minimum time [114] adopted a multimodal approach for face identifica-
period to accomplish different tasks is a research issue that tion and naming in news video by aligning the text, audio
needs to be explored in future. and video modalities. The authors proposed to detect face
In multimedia fusion literature, the issue of synchroni- sequences from images and extract the name candidates
zation has not been widely addressed. This is because many from the transcripts. The transcripts were obtained from the
researchers focused on the accuracy aspect of the analysis audio tracks using speech recognition technique. Moreover,
tasks and performed experiments in an offline manner. In the video captions were also processed to extract the name
offline mode, the synchorization has often been manually titles. Based on the audio-generated transcript and the
performed by aligning the modalities along a timeline. The video captions, the corresponding faces in the video were
researchers who performed analysis tasks in real time have aligned.

123
372 P. K. Atrey et al.

In the domain of sports event analysis, to determine the 4.3 Issue related to what to fuse
time when the event occurred in the broadcasted sports
video, Babaguchi et al. [13] used the textual overlays that In the literature, the issue of what to fuse has been
usually appear in the sports video. Similar approach was addressed at two different levels: modality selection and
used by Xu and Chua [149]. In their work, the authors feature vector reduction. Modality selection refers to
have used text modality in addition to the audio and choosing different types of modalities. For example, one
video. The authors observed a significant asynchronism can select and fuse two video camera and one microphone
that resulted from different time granularities of audio– data to determine the presence of a person. On the other
video and text analysis. While the audio–video frames hand, the fusion of features usually results into a large
were available at a regular interval of seconds, the text feature vector, which becomes a bottleneck for a particular
availability was very slow (approximately every minute). analysis task. This is known as the curse of dimensionality.
This was because, the human operator usually enters texts To overcome this problem, different data reduction tech-
for live matches which takes a few minutes to become niques are applied to reduce the feature vector. In the
available for automatic analysis. The authors have per- following, we discuss various works that addressed the
formed synchronization between text and audio–video modality selection and feature vector reduction issues.
modalities by using alignment. This alignment was per-
formed by maximizing the number of matches between 4.3.1 Modality selection
text events and audio–video events. The text and audio–
video events are considered matched when they are both The modality selection problem is similar to the sensor
within a temporal range, occur in the same sequential stream selection problem that has often been considered as
order, and the audio–video event adapts to the modeling an optimization problem, in which, the best set of sensors is
of the text event. For example, an offense followed by a obtained satisfying some cost constraints. Some of the
break conforms to goal’s event modeling. Although the fundamental works on the sensor stream selection in the
above mentioned synchronization method works well, it context of discrete-event systems and failure diagnosis
cannot be generalized since it is domain-oriented and include Oshman [94], Debouk et al. [37] and Jiang et al.
highly specific to the user-defined rules. Note that, while [64]. Similarly, in the context of wireless sensor networks,
Babaguchi et al. [13] and Xu and Chua [149] attempted to the optimal media stream selection has been studied by
synchronize video based on the time extracted from the Pahalawatta et al. [98], Lam et al. [70], and Isler and Ba-
time overlays in the video and web-casted text, respec- jcsy [56]. The details of these works are omitted in the
tively; Xu et al. [147] and [148] adopted a different interest of brevity.
approach. In their work, the authors extracted the timing In the context of multimedia analysis, the problem of
information from the broadcasted video by detecting the optimal modality selection has been targeted by Wu et al.
video event boundaries. The authors observed that as the [143], Kankanhalli et al. [68], and Atrey et al. [9]. In [143],
webcasted text is usually not available in the broadcasted the authors proposed a two-step optimal fusion approach.
text, the time recognition from the broadcast sports video In the first step, they find statistically independent moda-
is a better choice to perform the alignment of the text and lities from raw features. Then, the second step involves
video. super-kernel fusion to determine the optimal combination
A summarization of the above described works has been of individual modalities. The authors have provided a
provided in Table 8. tradeoff between modality independence and the curse of

Table 8 A list of representative works that have addressed synchronization problem

Level of fusion The work Multimedia analysis task

Feature Hershey and Movellan [47], Slaney and Covell [117], Speech recognition and speaker localization
Nock et al. [91]
Iyengar et al. [57] Monologue detection
Bredin and Chollet [19] Biometric-based person identification
Decision Holzapfel et al. [49] Dialog understanding
Atrey et al. [8] Event detection for surveillance
Satoh et al. [114] News video analysis
Babaguchi et al. [13], Xu and Chua [149], Xu et al. [147, 148] Sports video analysis

123
Multimodal fusion for multimedia analysis 373

dimensionality. Their idea of selecting optimal modalities particular multimedia analysis task involves the reduction
is as follows. When the number of modalities is one, all the of feature vector. The fusion of features that are obtained
feature components were treated as a one-vector repre- from different modalities usually result into a large feature
sentation, suffering from the curse of dimensionality. On vector, which becomes a bottleneck when processed to
the other hand, the large number of modalities reduces the accomplish any multimedia analysis task. To handle such
curse of dimensionality, but the inter modality correlation situations, researchers have used various data reduction
is increased. An optimal value of modalities would tend to techniques. Most commonly used techniques are principle
balance between the curse of dimensionality and the inter component analysis (PCA), singular vector decomposition
modality correlation. The authors have demonstrated the (SVD) and linear discriminant analysis (LDA).
utility of their scheme for image classification and video PCA is used to project higher dimensional data into
concept detection. Kankanhalli et al. [68] have also pre- lower dimensional space while preserving as much infor-
sented an Experiential Sampling based method to find the mation as possible. The projection that minimizes the
optimal subset of streams in multimedia systems. Their squared error in reconstructing original data is chosen to
method is the extension of the work by Debouk et al. [37], represent the reduced set of features. The PCA technique
in the context of multimedia. This method may have a high often does not perform well when the dimensionality of the
cost of computing the optimal subset as it requires the feature set is very large. This limitation is overcome by
minimum expected number of tests to be performed in SVD technique, which is used to determine the eigen
order to determine the optimal subset. vectors that most represent the input feature set. While
Recently, Atrey et al. [9] have presented a dynamic PCA and SVD are unsupervised techniques, LDA works in
programming based method to select the optimal subset of supervised mode. LDA is used for determining the linear
media streams. This method provides a threefold tradeoff combination of features, which is not only a reduced set of
between the extent to which the multimedia analysis task is features but it is also used for classification. The readers
accomplished by the selected subset of media streams, the may refer to [134] for further details about these feature
overall confidence in this subset, and the cost of using this dimensionality reduction methods.
subset. In addition, their method also provides flexibility to In multimodal fusion domain, many researchers have
the system designer to choose the next best sensor if the used these methods for feature vector dimension reduction.
best sensor is not available. They have demonstrated the Some representative works are: Guironnet et al. [44] used
utility of the proposed method for event detection in an PCA for video classification, Chetty and Wagner [28] uti-
audio–video surveillance scenario. lized SVD for biometric person authentication, and Pota-
From the above discussion, it can be observed that only mianos et al. [105] adopted LDA for speech recognition.
a few attempts (Wu et al. [143], Kankanhalli et al. [68],
and Atrey et al. [9]) have been made to select the best (or 4.3.3 Other considerations
optimal) subset of modalities for multimodal fusion.
However, these methods have their own limitations and There are other situations when the issue of ‘‘what to fuse’’
drawbacks. Moreover, they do not consider the different needs special consideration. For instance, dealing with
contexts under which the modalities may be selected. unlabeled data in fusion [130] and handling noisy positive
Therefore, a lot more can be done in this aspect of multi- data for fusion [54].
modal fusion. The methods for optimal subset modality There are three approaches used for learning with
selection described above are summarized in Table 9. unlabeled data: semi-supervised learning, transductive
learning and active learning [155]. Semi-supervised
4.3.2 Feature vector reduction learning methods automatically exploit unlabeled data to
help estimate the data distribution in order to improve
It is important to mention that, besides selecting the opti- learning performance. Transductive learning is different
mal set of the modalities, the issue ‘‘what to fuse’’ for a from semi-supervised learning in that it selects the

Table 9 A summary of approaches used for optimal modality subset selection

The work The optimality criteria Drawback

Wu et al. [143] Curse of dimensionality versus inter modality The gain from a modality is overlooked
correlation
Kankanhalli et al. [68] Information gain versus cost (time) Cost of computing the optimal subset could be high
Atrey et al. [9] Gain versus Confidence level versus processing cost The issue of how frequently the optimal subset should
be recomputed needs a formalization

123
374 P. K. Atrey et al.

unlabeled data from test data set. On the other hand, in Another popular dataset standardization effort has been
active learning methods, the learning algorithm selects the the agenda of performance evaluation of tracking and
unlabeled example and actively query the user/teacher for surveillance (PETS) community [1]. Several researchers
labels. Here the learning algorithm is supposed to be good have used PETS datasets for multimodal analysis tasks, for
enough to choose the least number of examples to learn a example, object tracking [154].
concept, otherwise there is a risk of including the unim- Although there are several available datasets that can be
portant and irrelevant examples. used for various analysis tasks, there lacks any standardi-
Another important issue that needs to be resolved is how zation effort for a common dataset for multimodal fusion
to reduce outliers or noise in the input data for fusion. In the research.
multimodal fusion process, noisy data usually results into
reduced classification accuracy and increased training time 5.2 Evaluation measures
and size of the classifier. There are various solutions to deal
with the noisy data. For instance, to employ some noise Several evaluation metrics are usually used to measure the
filter mechanisms to smooth the noisy data or to apply an performance of the fusion-based multimedia analysis tasks.
appropriate sampling technique to differentiate the noisy For example, NIST average precision metric is used to
data from the input data before the fusion takes place [137]. determine the accuracy of semantic concept detection at
the video shot level [58, 121, 143]. For news video story
segmentation, the precision and recall metrics are widely
5 Benchmark datasets and evaluation used [52]. Precision and recall measure are also commonly
used for image retrieval [55]. Similarly, for the video shot
5.1 Datasets retrieval some researchers use mean average precision [83].
While performing image categorization, the accuracy of
Many multimodal fusion applications use several publicly the classification is measured in terms of image category
available datasets. For example, the small 2k image datasets detection rate [156].
in Corel Image CDs. It contains representative images of To measure the performance of tracking related analysis
fourteen categories that includes architecture, bears, clouds, tasks, the dominating evaluation metrics include mean
elephants, fabrics, fireworks, flowers, food, landscape, distance from track, detection rate, false positive rate,
people, textures, tigers, tools, and waves. Different features recall and precision [131]. Similarly, for speaker position
such as color and texture can be extracted from this 2k estimation researchers have measured tracking error and
image dataset and be used for fusion as shown in [143]. calculated average distance between true and estimated
Among the video-based fusion research, the most popu- position of speaker [125]. In [154], the authors calculated
lar are the well-known TRECVID datasets [2] that are variance of motion direction and variance of compactness
available in different versions since 2001. A quick view of to calculate the accuracy of object tracking.
the high-level feature extraction from these datasets can be Recently, Hossain et al. [51] presented a multi-criteria
found in [118]. Depending on their release, these datasets evaluation metric to determine the quality of information
contain data files about broadcast news video, sound and obtained based on multimodal fusion. The evaluation
vision video, BBC rushes video, BBC rushes video, Lon- metric includes certainty, accuracy and timeliness. The
don Gatwick surveillance video, and test dataset annota- authors showed its applicability in the domain of multi-
tions for surveillance event detection. Features from visual, media monitoring.
audio and caption tracks in TRACVID datasets are In human computer interaction, fusion is used mostly to
extracted and used in fusion for various multimedia anal- identify multimodal commands or input interactions of
ysis tasks, such as video shot retrieval [83], semantic video human such as gestures, speech etc. Therefore, metrics
analysis [121], news video story segmentation [52], video such as speech recognition accuracy and gesture recogni-
concept detection [58, 143] and so on. tion accuracy are used to measure the accuracy of these
The fusion literature related to biometric identification tasks [49].
and verification make ongoing efforts to build multimodal Furthermore, to evaluate the fusion result for biometric
biometric databases. For example, BANCA [14] that con- verification, false acceptance rate (FAR) and false rejection
tains face and speech modalities; XM2VTS [82] that con- rate (FRR) are used to identify the types of errors [104].
tains synchronized video and speech data; BIOMET [42] The FAR and FRR are often used to present the half total
that contains face, speech, fingerprint, hand and signature error rate (HTER), which is a measure to assess the quality
modalities; MYCT [93] that contains 10-print fingerprint of a biometric verification system [17].
and signature modalities and several others as mentioned in Overall, we observed that the researchers have used
[104]. different evaulation criteria for different analysis tasks.

123
Multimodal fusion for multimedia analysis 375

However a common evaluation framework for multimodal (decision level) has not been fully explored, although
fusion is yet to be achieved. some initial attempts have been reported.
6. The optimal modality selection for fusion is emerging
as an important research issue. From the available set,
which modalities should be fused to accomplish a task
6 Conclusions and future research directions
at a particular time instant? The utility of these
modalities could be changed with the varying context.
We have surveyed the state-of-the-art research related to
Moreover, the optimality of modality selection can be
multimodal fusion and commented on these works from the
determined based on various constraints such as the
perspective of the usage of different modalities, the levels
extent to which the task is accomplished, the confi-
of fusion, and the methods of fusion. We have further
dence with which the task is accomplished, and the
provided a discussion to summarize our observation based
cost of using the modalities for performing the task. As
on the reviewed literature, which can be useful for the
the optimal subset changes over time, how frequently
readers to have an understanding of the appropriate fusion
it should be computed so that the cost of re-compu-
methodology and the level of fusion. Some distinctive
tation can be reduced to meet the timeliness, is an open
issues (e.g. correlation, confidence) that influence the
problem for multimedia researchers to consider.
fusion process are also elaborated in greater detail.
7. Last but not least, there are various evaluation metrics
Despite that a significant number of multimedia analysis
that are used to measure the performance of different
tasks have been successfully performed using a variety of
multimedia analysis tasks. However, it would be
fusion methods, there are several areas of investigation that
interesting to work on a common evaluation frame-
may be explored in the future. We have identified some of
work that can be used by multimodal fusion
them as follows:
community.
1. Multimedia researchers have mostly used the audio,
video and the text modalities for various multimedia Multimodal fusion for multimedia analysis is a pro-
analysis tasks. However, the integration of some new mising research area. This survey has covered the existing
modalities such as RFID for person identification, works in this domain and identified several relevant issues
haptics for dialog systems, etc. can be explored that deserve further investigation.
further. Acknowledgments The authors would like to thank the editor and
2. The appropriate synchronization of the different the anonymous reviewers for their valuable comments in improving
modalities is still a big research problem for multi- the content of this paper. This work is partially supported by the
modal fusion researchers. Specifically, when and how Natural Sciences and Engineering Research Council (NSERC) of
Canada.
much data should be processed from different modal-
ities to accomplish a multimedia analysis task, is an
issue that has not yet been explored exhaustively. References
3. The problem of the optimal weight assignment to the
different modalities under a varying context is an open 1. PETS: Performance evaluation of tracking and surveillance
problem. Since we usually have different confidence (Last access date 31 August 2009). http://www.cvg.rdg.ac.uk/
levels in the different modalities for accomplishing slides/pets.html
2. TRECVID data availability (Last access date 02 September
various analysis tasks, the problem of dynamic com- 2009). http://www-nlpir.nist.gov/projects/trecvid/trecvid.data.
putation of the confidence information for the different html
streams for various tasks, becomes challenging and 3. Adams, W., Iyengar, G., Lin, C., Naphade, M., Neti, C., Nock,
worth researching in future. H., Smith, J.: Semantic indexing of multimedia content using
visual, audio, and text cues. EURASIP J. Appl. Signal Process.
4. How to integrate context in the fusion process? This 2003(2), 170–185 (2003)
question can be answered by thinking beyond the ‘‘if– 4. Aguilar, J.F., Garcia, J.O., Romero, D.G., Rodriguez, J.G.: A
then–else’’ strategy. There is a need to formalize the comparative evaluation of fusion strategies for multimodal
concept of context. How may the changing context biometric verification. In: International Conference on Video-
Based Biometrie Person Authentication, pp. 830–837. Guildford
influence the fusion process? What model would be (2003)
most suitable to simulate the varying nature of 5. Aleksic, P.S., Katsaggelos, A.K.: Audio-visual biometrics. Proc.
context? These questions require greater attention IEEE 94(11), 2025–2044 (2006)
from multimedia researchers. 6. Andrieu, C., Doucet, A., Singh, S., Tadic, V.: Particle methods
for change detection, system identification, and control. Proc.
5. The feature level correlation among different modal- IEEE 92(3), 423–438 (2004)
ities has been utilized in an effective way. However, it 7. Argillander, J., Iyengar, G., Nock, H.: Semantic annotation of
has been observed that correlation at the semantic level multimedia using maximum entropy models. In: International

123
376 P. K. Atrey et al.

Conference on Accoustic, Speech and Signal Processing, Conference on Acoustics, Speech, and Signal Processing, vol. 5,
pp. II–153–156. Philadelphia (2005) pp. 1005–1008. IEEE Computer Society, Philadelphia (2005)
8. Atrey, P.K., Kankanhalli, M.S., Jain, R.: Information assimila- 27. Chen, Q., Aickelin, U.: Anomaly detection using the dempster–
tion framework for event detection in multimedia surveillance shafer method. In: International Conference on Data Mining,
systems. Springer/ACM Multimed. Syst. J. 12(3), 239–253 pp. 232–240. Las Vegas (2006)
(2006) 28. Chetty, G., Wagner, M.: Audio-visual multimodal fusion for
9. Atrey, P.K., Kankanhalli, M.S., Oommen, J.B.: Goal-oriented biometric person authentication and liveness verification. In:
optimal subset selection of correlated multimedia streams. ACM NICTA-HCSNet Multimodal User Interaction Workshop,
Trans. Multimedia Comput. Commun. Appl. 3(1), 2 (2007) pp. 17–24. Sydney (2006)
10. Atrey, P.K., Kankanhalli, M.S., El Saddik, A.: Confidence 29. Chieu, H.L., Lee, Y.K.: Query based event extraction along a
building among correlated streams in multimedia surveillance timeline. In: International ACM Conference on Research and
systems. In: International Conference on Multimedia Modeling, Development in Information Retrieval, pp. 425–432. Sheffield
pp. 155–164. Singapore (2007) (2004)
11. Ayache, S., Quénot, G., Gensel, J.: Classifier fusion for svm- 30. Choudhury, T., Rehg, J.M., Pavlovic, V., Pentland, A.: Boosting
based multimedia semantic indexing. In: The 29th European and structure learning in dynamic bayesian networks for audio-
Conference on Information Retrieval Research, pp. 494–504. visual speaker detection. In: The 16th International Conference
Rome (2007) on Pattern Recognition, vol. 3, pp. 789–794. Quebec (2002)
12. Babaguchi, N., Kawai, Y., Kitahashi, T.: Event based indexing 31. Chua, T.S., Chang, S.F., Chaisorn, L., Hsu, W.: Story boundary
of broadcasted sports video by intermodal collaboration. IEEE detection in large broadcast news video archives: techniques,
Trans. Multimed. 4, 68–75 (2002) experience and trends. In: ACM International Conference on
13. Babaguchi, N., Kawai, Y., Ogura, T., Kitahashi, T.: Personal- Multimedia, pp. 656–659. New York, USA (2004)
ized abstraction of broadcasted american football video by 32. Corradini, A., Mehta, M., Bernsen, N., Martin, J., Abrilian, S.:
highlight selection. IEEE Trans. Multimed. 6(4), 575–586 Multimodal input fusion in human–computer interaction. In:
(2004) NATO-ASI Conference on Data Fusion for Situation Monitor-
14. Bailly-Bailliére, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, ing, Incident Detection, Alert and Response Management.
J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Karlsruhe University, Germany (2003)
Ruı́z, B., Thiran, J.P.: The BANCA database and evaluation 33. Crisan, D., Doucet, A.: A survey of convergence results on
protocol. In: International Conference on Audio-and Video-Based particle filtering methods for practitioners. IEEE Trans. Signal
Biometrie Person Authentication, pp. 625–638. Guildford (2003) Process. 50(3), 736–746 (2002)
15. Beal, M.J., Jojic, N., Attias, H.: A graphical model for audio- 34. Cutler, R., Davis, L.: Look who’s talking: Speaker detection
visual object tracking. IEEE Trans. Pattern Anal. Mach. Intell. using video and audio correlation. In: IEEE International Con-
25, 828– 836 (2003) ference on Multimedia and Expo, pp. 1589–1592. New York
16. Bendjebbour, A., Delignon, Y., Fouque, L., Samson, V., Piec- City (2000)
zynski, W.: Multisensor image segmentation using Dempster– 35. Darrell, T., Fisher III, J.W., Viola, P., Freeman, W.: Audio-
Shafer fusion in markov fields context. IEEE Trans. Geosci. visual segmentation and ‘‘the cocktail party effect’’. In: Inter-
Remote Sens. 39(8), 1789–1798 (2001) national Conference on Multimodal Interfaces. Bejing (2000)
17. Bengio, S.: Multimodal authentication using asynchronous 36. Datcu, D., Rothkrantz, L.J.M.: Facial expression recognition
hmms. In: The 4th International Conference Audio and Video with relevance vector machines. In: IEEE International Con-
Based Biometric Person Authentication, pp. 770–777. Guildford ference on Multimedia and Expo, pp. 193–196. Amsterdam, The
(2003) Netherlands (2005)
18. Bengio, S., Marcel, C., Marcel, S., Mariethoz, J. Confidence 37. Debouk, R., Lafortune, S., Teneketzis, D.: On an optimal
measures for multimodal identity verification. Inf. Fusion 3(4), problem in sensor selection. J. Discret. Event Dyn. Syst. Theory
267–276 (2002) Appl. 12, 417–445 (2002)
19. Bredin, H., Chollet, G.: Audio-visual speech synchrony measure 38. Ding, Y., Fan, G.: Segmental hidden markov models for view-
for talking-face identity verification. In: IEEE International based sport video analysis. In: International Workshop on
Conference on Acoustics, Speech and Signal Processing, vol. 2, Semantic Learning Applications in Multimedia. Minneapolis
pp. 233–236. Paris (2007) (2007)
20. Bredin, H., Chollet, G.: Audiovisual speech synchrony measure: 39. Fisher-III, J., Darrell, T., Freeman, W., Viola, P.: Learning joint
application to biometrics. EURASIP J. Adv. Signal Process. statistical models for audio-visual fusion and segregation. In:
11 p. (2007). Article ID 70186 Advances in Neural Information Processing Systems, pp. 772–
21. Brémond, F., Thonnat, M.: A context representation of sur- 778. Denver (2000)
veillance systems. In: European Conference on Computer 40. Foresti, G.L., Snidaro, L.: A distributed sensor network for
Vision. Orlando (1996) video surveillance of outdoor environments. In: IEEE Interna-
22. Brooks, R.R., Iyengar, S.S.: Multi-sensor Fusion: Fundamentals tional Conference on Image Processing. Rochester (2002)
and Applications with Software. Prentice Hall PTR, Upper 41. Gandetto, M., Marchesotti, L., Sciutto, S., Negroni, D.,
Saddle River, NJ (1998) Regazzoni, C.S.: From multi-sensor surveillance towards smart
23. Burges, C.J.C.: A tutorial on support vector machines for pattern interactive spaces. In: IEEE International Conference on Mul-
recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998) timedia and Expo, pp. I:641–644. Baltimore (2003)
24. Caruana, R., Munson, A., Niculescu-Mizil, A.: Getting the most 42. Garcia Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., les
out of ensemble selection. In: ACM International Conference on Jardins, J., Lunter, J., Ni, Y., Petrovska Delacretaz, D.: BIO-
on Data Mining, pp. 828–833. Maryland (2006) MET: A multimodal person authentication database including
25. Chaisorn, L., Chua, T.S., Lee, C.H., Zhao, Y., Xu, H., Feng, H., face, voice, fingerprint, hand and signature modalities. In:
Tian, Q.: A multi-modal approach to story segmentation for International Conference on Audio-and Video-Based Biometrie
news video. World Wide Web 6, 187–208 (2003) Person Authentication, pp. 845–853. Guildford, UK (2003)
26. Chang, S.F., Manmatha, R., Chua, T.S.: Combining text and 43. Gehrig, T., Nickel, K., Ekenel, H., Klee, U., McDonough, J.:
audio-visual features in video indexing. In: IEEE International Kalman filters for audio–video source localization. In: IEEE

123
Multimodal fusion for multimedia analysis 377

Workshop on Applications of Signal Processing to Audio and 62. Jasinschi, R.S., Dimitrova, N., McGee, T., Agnihotri, L., Zim-
Acoustics, pp. 118– 121. Karlsruhe University, Germany (2005) merman, J., Li, D., Louie, J.: A probabilistic layered framework
44. Guironnet, M., Pellerin, D., Rombaut, M.: Video classification for integrating multimedia content and context information. In:
based on low-level feature fusion model. In: The 13th European International Conference on Acoustics, Speech and Signal Pro-
Signal Processing Conference. Antalya, Turkey (2005) cessing, vol. II, pp. 2057–2060. Orlando (2002)
45. Hall, D.L., Llinas, J.: An introduction to multisensor fusion. In: 63. Jeon, J., Manmatha, R.: Using maximum entropy for automatic
Proceedings of the IEEE: Special Issues on Data Fusion, vol. 85, image annotation. In: International Conference on Image and
no. 1, pp. 6–23 (1997) Video Retrieval, vol. 3115, pp. 24–32. Dublin (2004)
46. Hershey, J., Attias, H., Jojic, N., Krisjianson, T.: Audio visual 64. Jiang, S., Kumar, R., Garcia, H.E.: Optimal sensor selection for
graphical models for speech processing. In: IEEE International discrete event systems with partial observation. IEEE Trans.
Conference on Speech, Acoustics, and Signal Processing, Automat. Contr. 48, 369–381 (2003)
pp. 649–652. Montreal (2004) 65. Julier, S.J., Uhlmann, J.K.: New extension of the Kalman filter
47. Hershey, J., Movellan, J.: Audio-vision: using audio-visual to nonlinear systems. In: Signal Processing, Sensor Fusion, and
synchrony to locate sounds. In: Advances in Neural Information Target Recognition VI, vol. 3068 SPIE, pp. 182–193. San Diego
Processing Systems, pp. 813–819. MIT Press, USA (2000) (1997)
48. Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm 66. Kalman, R.E.: A new approach to linear filtering and prediction
for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006) problems. Trans. ASME J. Basic Eng. 82(series D), 35–45
49. Holzapfel, H., Nickel, K., Stiefelhagen, R.: Implementation and (1960)
evaluation of a constraint-based multimodal fusion system for 67. Kankanhalli, M.S., Wang, J., Jain, R.: Experiential sampling in
speech and 3d pointing gestures. In: ACM International Con- multimedia systems. IEEE Trans. Multimed. 8(5), 937–946
ference on Multimodal Interfaces, pp. 175–182. State College, (2006)
PA (2004) 68. Kankanhalli, M.S., Wang, J., Jain, R.: Experiential sampling on
50. Hossain, M.A., Atrey, P.K., El Saddik, A.: Smart mirror for multiple data streams. IEEE Trans. Multimed. 8(5), 947–955
ambient home environment. In: The 3rd IET International (2006)
Conference on Intelligent Environments, pp. 589–596. Ulm 69. Kittler, J., Hatef, M., Duin, R.P., Matas, J.: On combining
(2007) classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–
51. Hossain, M.A., Atrey, P.K., El Saddik, A.: Modeling and 239 (1998)
assessing quality of information in multi-sensor multimedia 70. Lam, K.Y., Cheng, R., Liang, B.Y., Chau, J.: Sensor node
monitoring systems. ACM Trans. Multimed. Comput. Commun. selection for execution of continuous probabilistic queries in
Appl. 7(1) (2011) wireless sensor networks. In: ACM International Workshop on
52. Hsu, W., Kennedy, L., Huang, C.W., Chang, S.F., Lin, C.Y.: Video Surveillance and Sensor Networks, pp. 63–71. NY, USA
News video story segmentation using fusion of multi-level (2004)
multi-modal features in TRECVID 2003. In: International 71. León, T., Zuccarello, P., Ayala, G., de Ves, E., Domingo, J.:
Conference on Acoustics Speech and Signal Processing. Mon- Applying logistic regression to relevance feedback in image
treal, QC (2004) retrieval systems. Pattern Recognit. 40(10), 2621–2632 (2007)
53. Hsu, W.H.M., Chang, S.F.: Generative, discriminative, and 72. Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content
ensemble learning on multi-modal perceputal fusion toward processing through cross-modal association. In: ACM Interna-
news stroy segmentation. In: IEEE International Conference on tional Conference on Multimedia (2003)
Multimedia and Expos, pp. 1091–1094. Taipei (2004) 73. Li, F.F., Perona, P.: A bayesian hierarchical model for learning
54. Hu, H., Gan, J.Q.: Sensors and data fusion algorithms in mobile natural scene categories. In: IEEE Computer Society Conference
robotics. Technical report, CSM-422, Department of Computer on Computer Vision and Pattern Recognition, vol. 2, pp. 524–
Science, University of Essex, UK (2005) 531. Washington (2005)
55. Hua, X.S., Zhang, H.J.: An attention-based decision fusion 74. Li, M., Li, D., Dimitrove, N., Sethi, I.K.: Audio-visual talking
scheme for multimedia information retrieval. In: The 5th Paci- face detection. In: International Conference on Multimedia and
fic-Rim Conference on Multimedia. Tokyo, Japan (2004) Expo, pp. 473–476. Baltimore, MD (2003)
56. Isler, V., Bajcsy, R.: The sensor selection problem for bounded 75. Liu, X., Zhang, L., Li, M., Zhang, H., Wang, D.: Boosting image
uncertainty sensing models. In: International Symposium on classification with lda-based feature combination for digital
Information Processing in Sensor Networks, pp. 151–158. Los photograph management. Pattern Recognit. 38(6), 887–901
Angeles (2005) (2005)
57. Iyengar, G., Nock, H.J., Neti, C.: Audio-visual synchrony for 76. Liu, Y., Zhang, D., Lu, G., Tan, A.H.: Integrating semantic
detection of monologue in video archives. In: IEEE International templates with decision tree for image semantic learning. In:
Conference on Acoustics, Speech, and Signal Processing. Hong The 13th International Multimedia Modeling Conference, pp.
Kong (2003) 185–195. Singapore (2007)
58. Iyengar, G., Nock, H.J., Neti, C.: Discriminative model fusion 77. Loh, A., Guan, F., Ge, S.S.: Motion estimation using audio and
for semantic concept detection and annotation in video. In: video fusion. In: International Conference on Control, Auto-
ACM International Conference on Multimedia, pp. 255–258. mation, Robotics and Vision, vol. 3, pp. 1569–1574 (2004)
Berkeley (2003) 78. Lucey, S., Sridharan, S., Chandran, V.: Improved speech rec-
59. Jaffre, G., Pinquier, J.: Audio/video fusion: a preprocessing step ognition using adaptive audio-visual fusion via a stochastic
for multimodal person identification. In: International Workshop secondary classifier. In: International Symposium on Intelligent
on MultiModal User Authentification. Toulouse, France (2006) Multimedia, Video and Speech Processing, pp. 551–554. Hong
60. Jaimes, A., Sebe, N.: Multimodal human computer interaction: a Kong (2001)
survey. In: IEEE International Workshop on Human Computer 79. Luo, R.C., Yih, C.C., Su, K.L.: Multisensor fusion and inte-
Interaction. Beijing (2005) gration: Approaches, applications, and future research direc-
61. Jain, A., Nandakumar, K., Ross, A.: Score normalization in tions. IEEE Sens. J. 2(2), 107–119 (2002)
multimodal biometric systems. Pattern Recognit. 38(12), 2270– 80. Magalhães, J., Rüger, S.: Information-theoretic semantic multi-
2285 (2005) media indexing. In: International Conference on Image and

123
378 P. K. Atrey et al.

Video Retrieval, pp. 619–626. Amsterdam, The Netherlands 99. Perez, D.G., Lathoud, G., McCowan, I., Odobez, J.M., Moore,
(2007) D.: Audio-visual speaker tracking with importance particle filter.
81. Makkook, M.A.: A multimodal sensor fusion architecture for In: IEEE International Conference on Image Processing (2003)
audio-visual speech recognition. MS Thesis, University of 100. Pfleger, N.: Context based multimodal fusion. In: ACM Inter-
Waterloo, Canada (2007) national Conference on Multimodal Interfaces, pp. 265–272.
82. Matas, J., Hamouz, M., Jonsson, K., Kittler, J., Li, Y., Kotropoulos, State College (2004)
C., Tefas, A., Pitas, I., Tan, T., Yan, H., Smeraldi, F., Capdevielle, 101. Pfleger, N.: Fade - an integrated approach to multimodal fusion
N., Gerstner, W., Abdeljaoued, Y., Bigun, J., Ben-Yacoub, S., and discourse processing. In: Dotoral Spotlight at ICMI 2005.
Mayoraz, E.: Comparison of face verification results on the Trento, Italy (2005)
XM2VTS database. p. 4858. Los Alamitos, CA, USA (2000) 102. Pitsikalis, V., Katsamanis, A., Papandreou, G., Maragos, P.:
83. McDonald, K., Smeaton, A.F.: A comparison of score, rank and Adaptive multimodal fusion by uncertainty compensation. In:
probability-based fusion methods for video shot retrieval. In: Ninth International Conference on Spoken Language Process-
International Conference on Image and Video Retrieval, ing. Pittsburgh (2006)
pp. 61–70. Singapore (2005) 103. Poh, N., Bengio, S.: How do correlation and variance of base-
84. Mena, J.B., Malpica, J.: Color image segmentation using the experts affect fusion in biometric authentication tasks? IEEE
dempster–shafer theory of evidence for the fusion of texture. In: Trans. Signal Process. 53, 4384–4396 (2005)
International Archives of Photogrammetry, Remote Sensing and 104. Poh, N., Bengio, S.: Database, protocols and tools for evaluating
Spatial Information Sciences, vol. XXXIV, Part 3/W8, score-level fusion algorithms in biometric authentication. Pat-
pp. 139–144. Munich, Germany (2003) tern Recognit. 39(2), 223–233 (2006) (Part Special Issue:
85. Meyer, G.F., Mulligan, J.B., Wuerger, S.M.: Continuous audio- Complexity Reduction)
visual digit recognition using N-best decision fusion. J. Inf. 105. Potamianos, G., Luettin, J., Neti, C.: Hierarchical discriminant
Fusion 5, 91–101 (2004) features for audio-visual LVSCR. In: IEEE International Con-
86. Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphye, K.: Dynamic ference on Acoustic Speech and Signal Processing, pp. 165–168.
bayesian networks for audio-visual speech recognition. EURA- Salt Lake City (2001)
SIP J. Appl. Signal Process. 11, 1–15 (2002) 106. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.:
87. Neti, C., Maison, B., Senior, A., Iyengar, G., Cuetos, P., Basu, Recent advances in the automatic recognition of audiovisual
S., Verma, A.: Joint processing of audio and visual information speech. Proc. IEEE 91(9), 1306–1326 (2003)
for multimedia indexing and human-computer interaction. In: 107. Potamitis, I., Chen, H., Tremoulis, G.: Tracking of multiple
International Conference RIAO. Paris, France (2000) moving speakers with multiple microphone arrays. IEEE Trans.
88. Ni, J., , Ma, X., Xu, L., Wang, J.: An image recognition method Speech Audio Process. 12(5), 520–529 (2004)
based on multiple bp neural networks fusion. In: IEEE Inter- 108. Radova, V., Psutka, J.: An approach to speaker identification
national Conference on Information Acquisition (2004) using multiple classifiers. In: IEEE International Conference on
89. Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint Acoustics, Speech, and Signal Processing, 2, 1135–1138.
particle filter for audio-visual speaker tracking. In: The 7th Munich, Germany (1997)
International Conference on Multimodal Interfaces, pp. 61–68. 109. Rashidi, A., Ghassemian, H.: Extended dempster–shafer theory
Torento, Italy (2005) for multi-system/sensor decision fusion. In: Commission IV
90. Nock, H.J., Iyengar, G., Neti, C.: Assessing face and speech Joint Workshop on Challenges in Geospatial Analysis, Integra-
consistency for monologue detection in video. In: ACM Inter- tion and Visualization II, pp. 31–37. Germany (2003)
national Conference on Multimedia. French Riviera, France 110. Reddy, B.S.: Evidential reasoning for multimodal fusion in
(2002) human computer interaction (2007). MS Thesis, University of
91. Nock, H.J., Iyengar, G., Neti, C.: Speaker localisation using Waterloo, Canada
audio-visual synchrony: an empirical study. In: International 111. Ribeiro, M.I.: Kalman and extended Kalman filters: concept,
Conference on Image and Video Retrieval. Urbana, USA (2003) derivation and properties. Technical report., Institute for Sys-
92. Noulas, A.K., Krose, B.J.A.: Em detection of common origin of tems and Robotics, Lisboa (2004)
multi-modal cues. In: International Conference on Multimodal 112. Roweis, S., Ghahramani, Z.: A unifying review of linear
Interfaces, pp. 201–208. Banff (2006) gaussian models. Neural Comput. 11(2), 305–345 (1999)
93. Ortega-Garcia, J., Fierrez-Aguilar, J., Simon, D., Gonzalez, J., 113. Sanderson, C., Paliwal, K.K.: Identity verification using speech
Faundez-Zanuy, M., Espinosa, V., Satue, A., Hernaez, I., Igarza, and face information. Digit. Signal Process. 14(5), 449–480
J.J., Vivaracho, C., Escudero, D., Moro, Q.I.: Biometric on the (2004)
internet MCYT baseline corpus: a bimodal biometric database. 114. Satoh, S., Nakamura, Y., Kanade, T.: Name-It: Naming and
IEE Proc. Vis. Image Signal Process. 150(6), 395–401 (2003) detecting faces in news video. IEEE Multimed. 6(1), 22–35
94. Oshman, Y.: Optimal sensor selection strategy for discrete-time (1999)
state estimators. IEEE Trans. Aerosp. Electron. Syst. 30, 307– 115. Siegel, M., Wu, H.: Confidence fusion. In: IEEE International
314 (1994) Workshop on Robot Sensing, pp. 96–99 (2004)
95. Oviatt, S.: Ten myths of multimodal interaction. Commun. 116. Singh, R., Vatsa, M., Noore, A., Singh, S.K.: Dempster–shafer
ACM 42(11), 74–81 (1999) theory based finger print classifier fusion with update rule to
96. Oviatt, S.: Taming speech recognition errors within a multi- minimize training time. IEICE Electron. Express 3(20), 429–435
modal interface. Commun. ACM 43(9), 45–51 (2000) (2006)
97. Oviatt, S.L.: Multimodal interfaces. In: Jacko, J., Sears, A. (eds.) 117. Slaney, M., Covell, M.: Facesync: A linear operator for mea-
The Human–Computer Interaction Handbook: Fundamentals, suring synchronization of video facial images and audio tracks.
Evolving Technologies and Emerging Applications. Lawrence In: Neural Information Processing Society, vol. 13 (2000)
Erlbaum Assoc., NJ (2003) 118. Smeaton, A.F., Over, P., Kraaij, W.: High-level feature detec-
98. Pahalawatta, P., Pappas, T.N., Katsaggelos, A.K.: Optimal tion from video in TRECVid: a 5-year retrospective of
sensor selection for video-based target tracking in a wireless achievements. In: Divakaran, A. (ed.) Multimedia Content
sensor network. In: IEEE International Conference on Image Analysis, Theory and Applications, pp. 151–174. Springer,
Processing, pp. V:3073–3076. Singapore (2004) Berlin (2009)

123
Multimodal fusion for multimedia analysis 379

119. Snoek, C.G.M., Worring, M.: A review on multimodal video 139. Westerveld, T.: Image retrieval: content versus context. In:
indexing. In: IEEE International Conference on Multimedia and RIAO Content-Based Multimedia Information Access. Paris,
Expo, pp. 21–24. Lusanne, Switzerland (2002) France (2000)
120. Snoek, C.G.M., Worring, M.: Multimodal video indexing: A 140. Wu, H.: Sensor data fusion for context-aware computing using
review of the state-of-the-art. Multimed. Tools Appl. 25(1), 5– dempster–shafer theory. Ph.D. thesis, The Robotics Institute,
35 (2005) Carnegie Mellon University, Pittsburgh, PA (2003)
121. Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Early 141. Wu, K., Lin, C.K., Chang, E., Smith, J.R.: Multimodal infor-
versus late fusion in semantic video analysis. In: ACM mation fusion for video concept detection. In: IEEE Interna-
International Conference on Multimedia, pp. 399–402. Sin- tional Conference on Image Processing, pp. 2391–2394.
gapore (2005) Singapore (2004)
122. Sridharan, H., Sundaram, H., Rikakis, T.: Computational models 142. Wu, Y., Chang, E., Tsengh, B.L.: Multimodal metadata fusion
for experiences in the arts and multimedia. In: The ACM using causal strength. In: ACM International Conference on
Workshop on Experiential Telepresence. Berkeley, CA (2003) Multimedia, pp. 872–881. Singapore (2005)
123. Stauffer, C.: Automated audio-visual activity analysis. Tech. 143. Wu, Y., Chang, E.Y., Chang, K.C.C., Smith, J.R.: Optimal
rep., MIT-CSAIL-TR-2005-057, Massachusetts Institute of multimodal fusion for multimedia data analysis. In: ACM
Technology, Cambridge, MA (2005) International Conference on Multimedia, pp. 572–579. New
124. Strobel, N., Spors, S., Rabenstein, R.: Joint audio–video object York City, NY (2004)
localization and tracking. IEEE Signal Process. Mag. 18(1), 22– 144. Wu, Z., Cai, L., Meng, H.: Multi-level fusion of audio and visual
31 (2001) features for speaker identification. In: International Conference
125. Talantzis, F., Pnevmatikakis, A., Polymenakos, L.C.: Real time on Advances in Biometrics, pp. 493–499 (2006)
audio-visual person tracking. In: IEEE 8th Workshop on Mul- 145. Xie, L., Kennedy, L., Chang, S.F., Divakaran, A., Sun, H., Lin,
timedia Signal Processing, pp. 243–247. IEEE Computer Soci- C.Y.: Layered dynamic mixture model for pattern discovery in
ety, Victoria, BC (2006) asynchronous multi-modal streams. In: IEEE International
126. Tatbul, N., Buller, M., Hoyt, R., Mullen, S., Zdonik, S.: Con- Conference on Acoustics, Speech, and Signal Processing, vol. 2,
fidence-based data management for personal area sensor net- pp. 1053–1056. Philadelphia, USA (2005)
works. In: The Workshop on Data Management for Sensor 146. Xiong, N., Svensson, P.: Multi-sensor management for infor-
Networks (2004) mation fusion: issues and approaches. Inf. Fusion 3, 163–
127. Tavakoli, A., Zhang, J., Son, S.H.: Group-based event detection 186(24) (2002)
in undersea sensor networks. In: Second International Workshop 147. Xu, C., Wang, J., Lu, H., Zhang, Y.: A novel framework for
on Networked Sensing Systems. San Diego, CA (2005) semantic annotation and personalized retrieval of sports video.
128. Teissier, P., Guerin-Dugue, A., Schwartz, J.L.: Models for IEEE Trans. Multimed. 10(3), 421–436 (2008)
audiovisual fusion in a noisy-vowel recognition task. J. VLSI 148. Xu, C., Zhang, Y.F., Zhu, G., Rui, Y., Lu, H., Huang, Q.: Using
Signal Process. 20, 25–44 (1998) webcast text for semantic event detection in broadcast sports
129. Teriyan, V.Y., Puuronen, S.: Multilevel context representation video. IEEE Trans. Multimed. 10(7), 1342–1355 (2008)
using semantic metanetwork. In: International and Interdisci- 149. Xu, H., Chua, T.S.: Fusion of AV features and external infor-
plinary Conference on Modeling and Using Context, pp. 21–32. mation sources for event detection in team sports video. ACM
Rio de Janeiro, Brazil (1997) Trans. Multimed. Comput. Commun. Appl. 2(1), 44–67 (2006)
130. Tesic, J., Natsev, A., Lexing, X., Smith, J.R.: Data modeling 150. Yan, R.: Probabilistic models for combining diverse knowledge
strategies for imbalanced learning in visual search. In: IEEE sources in multimedia retrieval. Ph.D. thesis. Carnegie Mellon
International Conference on Multimedia and Expo, pp. 1990– University (2006)
1993. Beijing (2007) 151. Yan, R., Yang, J., Hauptmann, A.: Learning query-class
131. Town, C.: Multi-sensory and multi-modal fusion for sentient dependent weights in automatic video retrieval. In: ACM
computing. Int. J. Comput. Vis. 71, 235–253 (2007) International Conference on Multimedia, pp. 548–555. New
132. Vermaak, J., Gangnet, M., Blake, A., Perez, P.: Sequential York, USA (2004)
monte carlo fusion of sound and vision for speaker tracking. In: 152. Yang, M.T., Wang, S.C., Lin, Y.Y.: A multimodal fusion system
The 8th IEEE International Conference on Computer Vision, for people detection and tracking. International Journal of
vol. 1, pp. 741–746. Paris, France (2001) Imaging Systems and Technology 15, 131–142 (2005)
133. Voorhees, E.M., Gupta, N.K., Johnson-Laird, B.: Learning 153. Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A. Face
collection fusion strategies. In: ACM International Conference recognition: a literature survey. ACM Comput. Surv. 35(4),
on Research and Development in Information Retrieval, pp. 399–458 (2003)
172–179. Seattle, WA (1995) 154. Zhou, Q., Aggarwal, J.: Object tracking in an outdoor environ-
134. Wall, M.E., Rechtsteiner, A., Rocha, L.M.: Singular Value ment using fusion of features and cameras. Image Vis. Comput.
Decomposition and Principal Component Analysis, Chap. 5, pp. 24(11), 1244–1255 (2006)
91–109. Kluwel, Norwell, MA (2003) 155. Zhou, Z.H.: Learning with unlabeled data and its application to
135. Wang, J., Kankanhalli, M.S.: Experience-based sampling tech- image retrieval. In: The 9th Pacific Rim International Confer-
nique for multimedia analysis. In: ACM International Confer- ence on Artificial Intelligence, pp. 5–10. Guilin (2006)
ence on Multimedia, pp. 319–322. Berkeley, CA (2003) 156. Zhu, Q., Yeh, M.C., Cheng, K.T.: Multimodal fusion using
136. Wang, J., Kankanhalli, M.S., Yan, W.Q., Jain, R.: Experiential learned text concepts for image categorization. In: ACM Inter-
sampling for video surveillance. In: ACM Workshop on Video national Conference on Multimedia, pp. 211–220. Santa Barbara
Surveillance. Berkeley (2003) (2006)
137. Wang, S., Dash, M., Chia, L.T., Xu, M.: Efficient sampling of 157. Zotkin, D.N., Duraiswami, R., Davis, L.S.: Joint audio-visual
training set in large and noisy multimedia data. ACM Trans. tracking using particle filters. EURASIP J. Appl. Signal Process.
Multimed. Comput. Commun. Appl. 3(3), 14 (2007) (11), 1154–1164 (2002)
138. Wang, Y., Liu, Z., Huang, J.C.: Multimedia content analysis: 158. Zou, X., Bhanu, B.: Tracking humans using multimodal fusion.
using both audio and visual clues. In: IEEE Signal Processing In: IEEE Conference on Computer Vision and Pattern Recog-
Magazine, pp. 12–36 (2000) nition, p. 4. Washington (2005)

123

Multi File
No ratings yet
Multi File
20 pages
Multimedia Retrieval Survey of Methods and Approaches
No ratings yet
Multimedia Retrieval Survey of Methods and Approaches
4 pages
Sensors 23 02381 v2
No ratings yet
Sensors 23 02381 v2
16 pages
Multimodal Data Fusion Techniques
No ratings yet
Multimodal Data Fusion Techniques
11 pages
Multimodal Data Fusion Anoverview of Methods
No ratings yet
Multimodal Data Fusion Anoverview of Methods
29 pages
Lahat Adali Jutten DataFusion 2015
No ratings yet
Lahat Adali Jutten DataFusion 2015
26 pages
2 Iccmc 2018 8487686
No ratings yet
2 Iccmc 2018 8487686
10 pages
Architectural Qualities and Principles of Multimodal Systems
No ratings yet
Architectural Qualities and Principles of Multimodal Systems
5 pages
Multimodal Fusion For Cross-Platform Content Generation: Science and Research, Coimbatore
No ratings yet
Multimodal Fusion For Cross-Platform Content Generation: Science and Research, Coimbatore
9 pages
1.analysis of Image Fusion Techniques Based On - 2016
No ratings yet
1.analysis of Image Fusion Techniques Based On - 2016
8 pages
Underwater Acoustic Detection and Signal
No ratings yet
Underwater Acoustic Detection and Signal
23 pages
Research Status of Image Fusion A Review
No ratings yet
Research Status of Image Fusion A Review
5 pages
Human Computer Interaction Overview On S
No ratings yet
Human Computer Interaction Overview On S
23 pages
Multimodal Federated Learning On Iot Data: Yuchen Zhao Payam Barnaghi Hamed Haddadi
No ratings yet
Multimodal Federated Learning On Iot Data: Yuchen Zhao Payam Barnaghi Hamed Haddadi
12 pages
Multi Model
No ratings yet
Multi Model
36 pages
On Improving Visual-Facial Emotion Recognition With Audio-Lingual and Keyboard Stroke Pattern Information
No ratings yet
On Improving Visual-Facial Emotion Recognition With Audio-Lingual and Keyboard Stroke Pattern Information
7 pages
System Design of RF Receiver and Digital Implementation of Control Logic
No ratings yet
System Design of RF Receiver and Digital Implementation of Control Logic
88 pages
ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
No ratings yet
ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
120 pages
2010 Atrey MultimodalFusionForMultimediaAnalysisSurvey
No ratings yet
2010 Atrey MultimodalFusionForMultimediaAnalysisSurvey
35 pages
Collaborative Path Finding Via Map Abstraction
No ratings yet
Collaborative Path Finding Via Map Abstraction
6 pages
SRF060424 03
No ratings yet
SRF060424 03
18 pages
2000 Munson Analysis Time Frequency Methods SAR Moving Targets
No ratings yet
2000 Munson Analysis Time Frequency Methods SAR Moving Targets
5 pages
Multimodal Sentiment Analysis A Survey
No ratings yet
Multimodal Sentiment Analysis A Survey
11 pages
How Young Algerians Interact With Their Smartphones
No ratings yet
How Young Algerians Interact With Their Smartphones
4 pages
2024 Leveraging Knowledge of Modality Experts For Incomplete Multimodal Learning - OpenReview
No ratings yet
2024 Leveraging Knowledge of Modality Experts For Incomplete Multimodal Learning - OpenReview
5 pages
International Telecommunication Union: Recommendation
No ratings yet
International Telecommunication Union: Recommendation
75 pages
FKFS HVAC ADK 2015 Final
No ratings yet
FKFS HVAC ADK 2015 Final
15 pages
Sensor Fusion Presentation
No ratings yet
Sensor Fusion Presentation
10 pages
General Method of Minimum Cross-Entropy Spectral Estimation
No ratings yet
General Method of Minimum Cross-Entropy Spectral Estimation
5 pages
Fundamental Concepts for Interactive Paper and Cross-Media Information Spaces
From Everand
Fundamental Concepts for Interactive Paper and Cross-Media Information Spaces
Beat Signer
No ratings yet
Trust between Cooperating Technical Systems: With an Application on Cognitive Vehicles
From Everand
Trust between Cooperating Technical Systems: With an Application on Cognitive Vehicles
Walter Bamberger
No ratings yet
The Fiesta Data Model: A novel approach to the representation of heterogeneous multimodal interaction data
From Everand
The Fiesta Data Model: A novel approach to the representation of heterogeneous multimodal interaction data
Peter Menke
No ratings yet
Introduction to Data Analysis in Qualitative Research
From Everand
Introduction to Data Analysis in Qualitative Research
Asher Shkedi
No ratings yet
Cloud-Based Multi-Modal Information Analytics
From Everand
Cloud-Based Multi-Modal Information Analytics
Tanushri Kaniyar
No ratings yet
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Trackpad Ver. 1.0 Class 8: Windows 7 & MS Office 2010
From Everand
Trackpad Ver. 1.0 Class 8: Windows 7 & MS Office 2010
Nidhi Arora
No ratings yet
Digital Humanities Research Methods
From Everand
Digital Humanities Research Methods
Vikrant Iyer
No ratings yet
OpenAI Whisper for Developers: The Complete Guide for Developers and Engineers
From Everand
OpenAI Whisper for Developers: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
The Information Process: A Model and Hierarchy
From Everand
The Information Process: A Model and Hierarchy
Victor Yang
No ratings yet
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
From Everand
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
DINO: Self-Supervised Vision Transformers Explained
From Everand
DINO: Self-Supervised Vision Transformers Explained
William Smith
No ratings yet
Data Science: Concepts, Strategies, and Applications
From Everand
Data Science: Concepts, Strategies, and Applications
Zemelak Goraga
No ratings yet
Neural Networks for Beginners: Introduction to Machine Learning and Deep Learning
From Everand
Neural Networks for Beginners: Introduction to Machine Learning and Deep Learning
daniel Huston
No ratings yet
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet
Artificial intelligence: AI in the technologies synthesis of creative solutions
From Everand
Artificial intelligence: AI in the technologies synthesis of creative solutions
Alexander V. Andreichikov
No ratings yet
End-to-End Encryption in Media Applications
From Everand
End-to-End Encryption in Media Applications
Othman Omran Khalifa
No ratings yet
Communication Nets: Stochastic Message Flow and Delay
From Everand
Communication Nets: Stochastic Message Flow and Delay
Leonard Kleinrock
3/5 (1)
Visuals Matter!
From Everand
Visuals Matter!
Mario Arlt
No ratings yet
Digital Voices - a collaborative exploration of the recorded voice in post-compulsory education
From Everand
Digital Voices - a collaborative exploration of the recorded voice in post-compulsory education
Andrew Middleton
No ratings yet
Integrating Information into the Engineering Design Process
From Everand
Integrating Information into the Engineering Design Process
Michael Fosmire
3.5/5 (2)
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Programming with X10: Definitive Reference for Developers and Engineers
From Everand
Programming with X10: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cloud Computing: Master the Concepts, Architecture and Applications with Real-world examples and Case studies
From Everand
Cloud Computing: Master the Concepts, Architecture and Applications with Real-world examples and Case studies
Ruchi Doshi
No ratings yet
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
From Everand
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Steven Cooper
2.5/5 (2)
NServiceBus for Distributed Systems: Definitive Reference for Developers and Engineers
From Everand
NServiceBus for Distributed Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Makers of the Environment : Building Resilience Into Our World, One Model at a Time.
From Everand
Makers of the Environment : Building Resilience Into Our World, One Model at a Time.
Finith Jernigan
No ratings yet
Language Understanding with LUIS: Definitive Reference for Developers and Engineers
From Everand
Language Understanding with LUIS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Exploring Semantic Technologies and Their Application to Nuclear Knowledge Management
From Everand
Exploring Semantic Technologies and Their Application to Nuclear Knowledge Management
IAEA
No ratings yet
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
From Everand
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
Manish Soni
No ratings yet
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
From Everand
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Decision Support Systems: Concepts and Applications
From Everand
Decision Support Systems: Concepts and Applications
Richard Johnson
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Media Transfer Protocol Engineering: Definitive Reference for Developers and Engineers
From Everand
Media Transfer Protocol Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mobile Computing Textbook
From Everand
Mobile Computing Textbook
Manish Soni
No ratings yet
Knowledge Management: Fundamentals and Applications
From Everand
Knowledge Management: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
The Ultimate Guide to Visual Lectures
From Everand
The Ultimate Guide to Visual Lectures
David Roberts
No ratings yet
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
From Everand
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Steven Cooper
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Virtual Intelligence: Fundamentals and Applications
From Everand
Virtual Intelligence: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
From Everand
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
Fouad Sabry
No ratings yet
Decision Analysis: Fundamentals and Applications
From Everand
Decision Analysis: Fundamentals and Applications
Fouad Sabry
No ratings yet
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2010 Atrey MultimodalFusionForMultimediaAnalysisSurvey

Uploaded by

2010 Atrey MultimodalFusionForMultimediaAnalysisSurvey

Uploaded by

Multimedia Systems (2010) 16:345–379

Multimodal fusion for multimedia analysis: a survey

Fig. 1 Multimodal fusion F1

modality, such as hidden Markov model (HMM) for audio

3.1.4 Remarks on rule-based fusion methods 3.2 Classification-based fusion methods

Bayesian inference, Dempster–Shafer theory, dynamic High dimension

dynamic Bayesian networks are generative models, while I2,1

Support vector machine (SVM) [23] has become increas-

Image at the feature level as well as at the decision level. The

classification based on pair-wise SVM classifier [156]

Although the Bayesian inference fusion method allows for

uncertainty modeling (usually by Gaussian distribution),

Visible Frontal Speech Visible Frontal Speech

Fig. 7 An example of a dynamic Bayesian networks [30]

Fusion method Level of The work Modalities Multimedia analysis task

the state of a moving object based on multimodal data. For

Human activity monitoring

Semantic image indexing

The Kalman filter (KF) [66, 112] allows for real-time

estimates of the system from the fused data with some

statistical significance [79]. For this filter to work, a linear

vation, y(t) are modeled based on the state at time t - 1.

given by Eqs. 6 and 7 in the following:

Audio (spectrogram) and video (blob)

yðtÞ ¼ HðtÞxðtÞ þ vðtÞ ð7Þ

bution with zero mean and R(t) covariance.

Based on the above state-space model, the KF does not

tems with non-linear characteristics. For non-linear system

models, a variant of the Kalman filter known as extended

Xie et al. [145]

Kalman filter (EKF) [111] is usually used. Some

researchers also use KF as inverse Kalman filter (IKF) that

reads an estimate and produce an observation as oppose to

output. Another variant of KF has gained attention lately,

which is the unscented Kalman filter (UKF) [65]. The

proposed a feature level fusion method for estimating the

ferent audio-visual features for estimating the position,

velocity and acceleration of the single sound source. For

used based on a constant acceleration model to estimate 3.3.2 Particle filter

Feature Correlation coefficient Wang et al. [138] Video shot classification

Textual information with an image Westerveld [139] Image retrieval

Fig. 9 Illustration of the Modality 1 Modality 1

Table 8 A list of representative works that have addressed synchronization problem

Table 9 A summary of approaches used for optimal modality subset selection

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.