0% found this document useful (0 votes)
19 views11 pages

PDF 1

Uploaded by

hamidzamani445
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

PDF 1

Uploaded by

hamidzamani445
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Minoo Soltanshahi

Abstract: Emotion recognition in conversations is a challenging Artificial


Intelligence (AI) task. In this work, I propose the Multimodal EmotionLines
Dataset (MELD). MELD contains 13,708 utterances from 1433 dialogues of
Friends TV series. Every utterance in MELD is associated with an emotion
and a sentiment label.I propose two strong baselines that show both
contextual and multimodal information play important role in emotion
recognition in conversations. A new graph-based dynamic fusion module to
fuse multimodal context features in a conversation [1] and a baseline with a
simple early fusion method [2]. Additionally, MELD includes multi-party
conversations. The dataset is nearly double the size of existing multimodal
conversational datasets, providing a substantial amount of data for training
and testing. The proposed baseline model demonstrates the importance of
considering both contextual and multimodal information in emotion
recognition in conversations. The MELD dataset can be used for developing
and evaluating multimodal affective dialogue systems for enhanced grounded
learning.

Introduction Emotion Recognition in Conversations The development of conversational AI thus depends on


(ERC) [2] aims to detect emotions in each utterance of the use of both contextual and multimodal information.
The publicly available datasets for multimodal emotion
the conversation. It has considerable prospects for
recognition in conversations – IEMOCAP and
developing empathetic machines. This report studies SEMAINE – have facilitated a significant number of
ERC under a multimodal setting, i.e., acoustic and research projects, but also have limitations due to their
textual modalities.This dataset that is called Muiltimodal relatively small number of total utterances and the lack
EmotionLines (MELD), includes not only textual of multi-party conversations.
dialogues but also their corresponding visual and audio
counterparts. MELD also contains multi-party • MELD contains multi-party conversations that are
more challenging to classify than dyadic variants
conversations, that means there are multiple participants
available in previous datasets.
in a conversation. It’s more challenging to classify but
they are more suitable for developing multimodal • There are more than 13,000 utterances in MELD,
affective dialogue systems. Conversation in its natural which makes our dataset nearly double the size of
form is multimodal. In dialogues, we rely on others’ existing multimodal conversational datasets.
facial expressions, vocal tonality, language, and gestures
to anticipate their stance. For emotion recognition, • MELD provides multimodal sources and can be used in
a multimodal affective dialogue system for enhanced
multimodality is particularly important. For the
grounded learning.
utterances with language that is difficult to understand,
we often resort to other modalities, such as prosodic and
visual cues, to identify their emotions. Figure1. presents
examples from the dataset where the presence of
multimodal signals in addition to the text itself is
necessary in order to make correct predictions of their
emotions and sentiments.
utterances from every dialogue in the EmotionLines
dataset. To accomplish this, we crawl through the
subtitles of all the episodes and heuristically extract the
respective timestamps. In particular, we enforce the
following constraints
:
1. Timestamps of the utterances in a dialogue must
be in an increasing order.

2. All the utterances in a dialogue have to belong


to the same episode and scene.

Timestamp alignment. There are many utterances in the


Figure 1. Importance of multimodal cues. Green shows primary subtitles that are grouped within identical timestamps in
modalities responsible for sentiment and emotion. the subtitle files. In order to find the accurate timestamp
for each utterance, we use a transcription alignment tool
MELD dataset like Gentle, which automatically aligns a transcript with
the audio by extracting word-level timestamps from the
The MELD dataset has evolved from the EmotionLines audio (see Table 1). In Table 2, we show the final format
dataset developed by [3] EmotionLines contains of the MELD dataset.
dialogues from the popular sitcom Friends, where each
dialogue contains utterances from multiple speakers [3]. We use seven emotions for the annotation, i.e., anger,
EmotionLines was created by crawling the dialogues disgust, fear, joy, neutral, sadness, and surprise. We
from each episode and then grouping them based on the present the emotion distribution in training,
number of utterances in a dialogue into four groups of [5, development, and test datasets in Table 3.We have also
9], [10, 14], [15, 19], and [20, 24] utterances converted these fine grained emotion labels into more
respectively. Finally, 250 dialogues were sampled coarse grained sentiment classes by considering anger,
randomly from each of these groups, resulting in the disgust, fear, sadness as negative, joy as positive, and
final dataset of 1,000 dialogues. [3] neutral as neutral sentiment bearing class.

2.1 Annotation 3 Resampling


The utterances in each dialogue were annotated with the Class Weights One of the techniques tried is using class
most appropriate emotion category. For this purpose, weights computed with the sklearn library. This method
Ekman’s six universal emotions (Joy, Sadness, Fear, computes class weights that are then applied to the loss
Anger, Surprise, and Disgust) were considered as function during training in order to reduce the prediction
annotation labels. This annotation list was extended with skew of the model towards the most populated label in
two additional emotion labels: Neutral and Non-Neutral. the dataset. This technique’s objective is to magnify the
Each utterance was annotated by five workers from the effects of predictions for the less populated classes and
Amazon Mechanical Turk (AMT) platform. A majority reduce the weight of the predictions of the most
voting scheme was applied to select a final emotion label populatedones. The weights computed with the balanced
for each utterance. The overall Fleiss’ kappa score of this mode offered by sklearn will reflect the frequency of the
annotation process was 0:34. class they are associated with. The less populated class
will have a bigger weight on the loss function while, on
2.2 Re-annotation the other hand, the most frequent classes will generally
have smaller weights (possibly smaller than one). Table
The utterances in the original EmotionLines dataset were 3. reports the weights computed on MELD’s training set.
annotated by looking only at the transcripts. However, It’s worth noting that this method improves the recall of
due to their focus on multimodality, they re-annotate all the less populated classes but also reduces the precision
the utterances by asking the three annotators to also look of their prediction and the recall of the other classes is
at the available video clip of the utterances. They then worse [4]. The overall F1-scores (weighted and macro)
use majority-voting to obtain the final label for each are either unchanged or worse.
utterance.

We start the construction of the MELD corpus by


extracting the starting and ending timestamps of all
Incorrect Splits Corrected Splits
Utterance Season Episode Start Time End Time Start Time End Time
Chris says they’re closing 2
down the bar. 3 6 00:05:57,023 00:05:59,691 00:05:57,023 00:05:58,734
No way! 3 6 00:05:57,023 00:05:59,691 00:05:58,734 00:05:59,691
Table 1. Example of timestamp alignment using the Gentle alignment tool.
MELD
Categories Train Dev Test
Utterance Speaker Emotin anger Episode 1109StartTime
D ID U ID Season 153 345 EndTime
surprise 1 0 9disgust 23 271 22 68 00:36:42,824
00:36:40,364
But then who? The waitress I went out with last fear joy
month Joey 268 40 50
neutral 1743 163 402
You know? Forget it! Rachel sadness 1 1 9 23 4710 00:36:44,368
470 125600:36:46,578
sadness
Table 2. MELD dataset format for a dialogue. Notations: D IDsurprise = dialogue 683 ID, U111 ID =208utterance ID.
StartTime and EndTime are in hh:mm:ss,ms format. negative 1205 150 281
neutral 2945 406 833
4710 470 1256
Resample The second technique applied is resampling. positive 2334 233 521
This strategy can be performed by upsampling the least
populated classes by injecting copies of certain samples
already present in the training split multiple times or by
downsampling the most populated classes removing
examples. This method has to be used with caution
because it can reduce the generalization capabilities of Table 3. Emotion and sentiment distribution in MELD
the model, increasing overfitting chances due to the
reduced diversity in the training examples. Given MELD
class distribution, refer to table 3, it is desirable to 4.1 Textual Feature Extraction
increase the number of utterances with the fear, disgust
and sadness labels. It would be also possible to We can use a deep Convolutional Neural Network
undersample the neutral class, by removing the samples (CNN) [5] to extract textual features from the transcripts
labeled as such, but considering the limited amount of of videos for emotion recognition. Each utterance in the
data available it was undesirable [4]. A test could be to text was represented as an array of pre-trained 300-
combine the two techniques and compute the weights on dimensional Glove vectors [6], which capture the
the upsampled dataset: we would still benefit from the semantic meaning of words. The utterances were
upsampling’s effects and the weights computed would be truncated or padded with null vectors to have exactly 50
more uniform, imposing a less severe penalty on the words, which is the maximum length allowed by the
neutral class. CNN architecture.

The textual features were extracted by passing the array


4 Feature Extraction of vectors representing each utterance through two
different convolutional layers in the CNN. The first
In this section we discuss the method of feature convolutional layer had two filters of size 3 and 4,
extraction for two modalities: text and audio. We respectively, with 50 feature maps each. The second
followed [9] for this purpose. convolutional layer had a filter of size 2 with 100 feature
maps. Each convolutional layer was followed by a max-
pooling layer with window 2 x 2, which reduces the
dimensionality of the extracted features.The output of the
second max-pooling layer was fed to a fully-connected
layer with 500 neurons with a rectified linear unit
(ReLU) [7] activation function, which introduces non-
linearity and helps the network learn complex
relationships between the input and output. The output of
the ReLU layer was passed through a softmax output
layer, which normalized the output probabilities to
represent the predicted emotion categories.

The output of the penultimate fully-connected layer,


which had 500 neurons, was used as the textual feature.

3
The translation of the convolutional filter over the input utterance may be determined by taking preceding
array of vectors allows the CNN to learn abstract utterances into consideration. We call this the context of
features from the input, and with each subsequent layer, an utterance. Following [9], we use RNN, specifically,
the context of the learned features expands further. GRU5 to model semantic dependency among the
Overall, this method effectively captures the contextual utterances in a video. Let the following items represent
and semantic meaning of the text, and has been shown to unimodal features:

fA ∈ RN×dA
be highly effective for emotion recognition tasks.
However, it requires significant computational resources (acoustic features)
fT ∈ RN×dT
and training data to achieve high accuracy.
(textual features)
The choice of CNN for feature extraction is justified by
the following considerations: Where N = maximum number of utterances in a video.
We pad the shorter videos with dummy utterances
1. The convolution layers of CNN can be seen as a represented by null vectors of corresponding length. For

fm (where m ∈ {A, T}) (discussed in 5.1) of a video to


feature extractor, whose output is then fed into a rather each modality, we feed the unimodal utterance features
simplistic classifier useful for training the network but
not the best at actual classification. CNN forms local GRUm with output size Dm, which is defined as:
features for each word and combine them to produce a
global feature vector for the whole text. However, the mz mz
zm = (J(fmtU + sm(t−1)W ),
features that CNN builds internally can be extracted and mr mr
rm = (J(fmtU + sm(t−1)W ),
hmt = tanh(fmtUmh + (sm(t−1) ∗ rm)Wmh), Fmt =
used as input for another, more advanced classifier. This
turns CNN, originally a supervised classifier, into a
trainable feature extractor. tanh(hmtUmx + umx),
2. As a feature extractor, CNN is automatic and does not smt = (1 − zm) ∗ Fmt + zm ∗ sm(t−1),
rely on handcrafted features. In particular, it adapts well

where Umz ∈ Rdm×Dm , W mz ∈ RDm×Dm , Umr


to the peculiarities of the specific dataset, in a supervised

∈ Rdm×Dm , W mr ∈ RDm×Dm , Umh ∈


manner.

3. The features it gives are based on a hierarchy of local Rdm×Dm , W mh ∈ RDm×Dm , Umx ∈ Rdm×Dm ,
umx ∈ RDm , zm ∈ R , rm ∈ R , hmt ∈ R , Fmt
features, reflecting well the context. Dm Dm Dm

∈ RDm , and smt ∈ RDm . This yields hidden outputs


Fmt as context-aware unimodal features for each

Fm ∈ RN×Dm . Thus, the context-aware unimodal


modal- ity. Hence, we define Fm = GRUm(fm), where
4.2 Audio Feature Extraction
features can be defined as:
Audio features are extracted at 30 Hz frame-rate and a
sliding window of 100 ms. To compute the features, we
FA = GRUA(fA), FT = GRUT (fT ).
use openSMILE [8], an open-source software that
automatically extracts audio features such as pitch and
voice intensity. Voice normalization is performed and
voice intensity is thresholded to identify samples with
and without voice. Z-standardization is used to perform 5 Fusion
voice normalization. The features extracted by
openSMILE consist of several low-level descriptors We then fuse FA, FT to a multimodal feature space.
(LLD), e.g., MFCC, voice intensity, pitch, and their In order to get the fused representation of the
statistics, e.g., mean, root quadratic mean, etc. modalities, FAT , we simply concatenated FA and FT [9].
Specifically, we use IS13-ComParE configuration file in
main reason for choosing concatenation based fusion is
openSMILE. Taking into account all functionals of each
because it is very simple to implement yet an effective
LLD, we obtained 6373 features.
fusion approach. Finally fAT was fed to contextual
GRU i.e., GRUAT which incorporates the contextual
information contributed by the utterances.
4.3 Context-Dependent Feature Extraction
Utterances in the videos are semantically dependent on
each other. In other words, the complete meaning of an 6 Training

4
`calc_test_result` function and saves the predicted labels
1. The code imports the necessary libraries such as along with their corresponding IDs to a file in the
`argparse`, `keras`, `sklearn`, `numpy`, and `os`. specified path.

2. The `bc_LSTM` class is defined. This class is The training of this network is performed using
responsible for loading the data, creating and training the categorical cross-entropy on each utterance’s softmax
model, and evaluating the performance of the model. output per dialogue.
Categorical cross-entropy is a commonly used loss
3. The `__init__` function initializes some class variables function in deep learning for multiclass classification
problems. It measures the difference between the
such as `classification_mode`, `modality`, `PATH`, and
predicted probability distribution and the true probability
`OUTPUT_PATH` based on the input arguments passed
distribution.
through the command line.
In the context of dialogue systems, the softmax output
4. The `load_data` function is responsible for loading the refers to the probability distribution over the possible
data for the specified modality and classification mode. responses to a given input utterance. The training of the
It loads the training, validation, and test sets along with network involves minimizing the categorical cross-
their corresponding labels and masks. It also loads the entropy loss between the predicted softmax output and
sequence length and the number of classes for the the true label (i.e., the correct response) for each
dataset. utterance in a dialogue.

5. The `calc_test_result` function calculates the test This approach is used to optimize the parameters of the
performance of the model by taking the predicted labels model to make accurate predictions for a given input
and the ground truth labels along with the mask as input. utterance. By minimizing the cross-entropy loss, the
It then prints the confusion matrix, classification report, model learns to assign high probabilities to the correct
and the weighted F-score. response and low probabilities to incorrect responses.
This results in a model that can accurately predict the
6. The `get_audio_model` function creates a model for correct response for a given input utterance in a dialogue.
the audio modality using Keras. It uses two layers of As a regularization method, dropout between the GRU
bidirectional LSTMs followed by a dense layer with cell and dense layer is introduced to avoid overfitting.
softmax activation. We used Adam [10] as an optimizer. We use
development data to tune the hyperparameters. Early
stopping with patience 10 was used in the training.
7. The `get_text_model` function creates a model for the
text modality using Keras. It uses an embedding layer
followed by three convolutional layers with different bcLSTM is a strong baseline proposed by [9], which
represents context using a bi-directional RNN. It follows
filter sizes and max-pooling layers. Finally, a dense layer
a two-step hierarchical process that models uni-modal
with softmax activation is added.
context first and then bi-modal context features. For
unimodal text, a CNN-LSTM model extracts contextual
8. The code checks the modality and calls the representations for each utterance taking the GloVe
corresponding function to create the model. embeddings as input. For unimodal audio, an LSTM
model gets audio representations for each audio
9. The `train` function trains the model using the utterance feature vector. Finally, the contextual
specified hyperparameters such as epochs and batch size. representations from the unimodal variants are supplied
It uses the Adam optimizer and the categorical cross- to the bimodal model for classification. bcLSTM does
entropy loss function. not distinguish among different speakers and models a
conversation as a single sequence.
10. The `save_model` function saves the weights of the
trained model to a file in the specified path.

11. The `predict` function predicts the labels for the test 7 Results
set using the trained model and the corresponding mask.
We provide results for the two tasks of sentiment and
12. Finally, the `evaluate` function evaluates the emotion classification on MELD.Table 4 shows the
performance of the model on the test set by calling the performance of sentiment classification by using

5
bcLSTM (66.68% F-score).In figure 2 we see the
performance of each modality on each emotion
prediction in form of plots.

However, the improvement due to the fusion is only


0.3% higher than the textual modality which suggests the
need for a better fusion mechanism. Textual modality
outperformed audio modality by more than 17% which
indicates the importance of spoken language in sentiment
analysis. For positive sentiment category, audio modality
performs poor. The performance on the emotion classes
disgust, fear, and sadness are particularly poor. The
primary reason for this is the inherent imbalance in the
dataset which has fewer training instances for these
mentioned emotion classes (see Table 3). We partially
tackle this by using class-weights as hyper-parameters.
Multimodal fusion helps in improving the emotion
recognition performance by 3%. However, multimodal
classifier performs worse than the textual classifier in
classifying sadness.

6
Fig2. Performance of each modality on Emotion
column by F1-score

Emotions
modalities anger disgust fear Joy neutral sadness surprise w-avg.
Text 42.06 21.69 7.75 54.31 71.63 26.92 48.15 56.44
25.85 6.06 2.90 15.74 61.86 14.71 19.34 39.08
bcLSTMAudio 43.39 23.66 9.38 54.48 76.67 24.34 51.04 59.25
Text+Audio

Table 4: Test-set weighted F-score results of bcLSTM for emotion


classification in MELD. Note: w-avg denotes weighted-average

7
Modality Encoder
Γ(k) = σ(Wg · [g(k−1), Ht(k−1)] + bg), ε = {u, f, o},
a bi-directional gated recurrent unit (BiGRU); for the C˜ (k) = tanh(Wg · [g(k−1), Ht(k−1)] + bg ),
acoustic and visual modalities, we apply a fully
connected network. C(k) = Γ(k) C(k−1) + Γ(k) C˜ (k), g(k) = Γ(k)
tanh(C(k)),
The context embedding is computed by combining these
embeddings with trainable parameters.
to account for the impact of different speakers in a (k) (k) (k)
conversation, a shared-parameter BiGRU is used to where Γ , Γ , Γ refer to the update gate, the
encode different contextual information from multiple forget gate, and the output gate in the k-th layer,
speakers. The BiGRU is applied to each speaker's respectively. g(0) is initialized with zero. Wg , bg are
utterances in the conversation, and the resulting hidden
states are used as speaker embeddings. learnable parameters. σ(·) is a sigmoid function. C˜
(k) stores contextual information of previous layers.
The update gate Γ(k) controls what part of the
8 Graph-based Dynamic Fusion Modules contextual information is written to the memory,
while the forget gate Γ(k) decides what redundant
information in C(k) is deleted. The output gate Γ(k)
We build an undirected graph to represent a reads selectively for passing into a graph
conversation [11], denoted as G = (V, E). V refers to convolution operation. The modified convolution
a set of nodes. Each utterance can be represented by operation can be defined as:
three nodes for differentiating acoustic, visual, and
textual modalities. Given N utterances, there are 3N t
(k) ˜ 1(k−1) + αH(0))((1 − βk−1)In+
nodes in the graph. We add both context embedding H = ReLU ((1 − α)P H
)
and speaker embedding to initialize the embedding of βk−1 W(k−1)) ,
nodes in the graph:
xδ = cδ + γδsδ, δ ∈ {a, v, t}, where P˜ = D˜ −1/2 A˜ D˜ −1/2 is the graph
convolution matrix with the renormalization trick.
α, βk are two hyperparameters. βk = log( ρ + 1). ρ
where γa, γv, γt are trade-off hyper-parameters. E refers is also a hyperparameter. W(k) is the weight
to a set of edges, which are built based on two rules. matrix. H(0) is initialized with Xa, Xv, Xt. In is an
The first rule is that any two nodes of the same identity mapping matrix. Then, the output of k-th
modality in the same conversation are connected. The layer can be computed as, Ht(k) = H(k) + g(k).
second rule is that each node is connected with nodes
corresponding to the same utterance but from different After the stack of K layers, representations of three
modalities. Edge weights are computed as: Aij = 1 − modalities for each utterance i can be refined as oa,
arccos(sim(xi , xj )) ov, ot. Finally, a classifier is used to predict the
, where sim(·) is cosine similarity
function. [12] emotion of each utterance:
yˆ i = Softmax(Wz[xa; xv; xt; oa; ov; ot] +
8.1 Dynamic Fusion Module bz),

Based on the graph, we improve [13]. with gating where Wz and bz are trainable parameters. We
mechanisms to fuse multimodal context features in the apply cross- entropy loss along with L2-
conversation. We utilize graph convolution operation to regularization to train the model.
aggregate context information of both inter- and intra-
modality in a specific semantic space at each layer. MM_GDF Training
Meanwhile, we leverage gating mechanisms to learn
intrinsic sequential patterns of contextual information in
different semantic spaces [14]. The updating process The GCNLayer1 class is defined with the initialization of
using gating mechanisms is defined. a linear layer with input and output features, a boolean
variable indicating whether to use a topic or not, and a
boolean variable indicating whether to create a new

8
graph or not. The forward method takes inputs, dia_len, classification. If use_residue is set to True, the output of
and topicLabel as arguments and calls the message the GCN layers is concatenated with the input audio
passing methods based on the new_graph variable. The features before passing through the final fully connected
message passing_wo_speaker method computes the layer.
adjacency matrix for the graph and performs message
passing. The message passing_directed_speaker method In addition to the GCN layers, the model also includes an
computes the adjacency matrix for the graph based on LSTM layer which takes as input the output of the GCN
directed speakers and performs message passing. The layers and performs sequential reasoning over the
cossim method computes the cosine similarity between dialogue. This is controlled by the reason_flag flag. If it
two vectors, and the atom_calculate_edge_weight is set to True, the output of the GCN layers is passed
method computes the edge weight between two nodes through an LSTM layer before concatenating with the
based on the cosine similarity. input audio features.

The message passing_wo_speaker method computes the During training, the model is optimized using the
adjacency matrix by initializing it with the identity negative log likelihood loss. The forward() function of
matrix, then loops through each conversation and its the GCNII_lyc class takes the input audio features,
corresponding utterances, computes the edge weight dialogue length, topic label, and adjacency matrix as
between each utterance pair based on their cosine input, and returns the model output. If the return_feature
similarity, and updates the adjacency matrix. If the flag is set to True, the output of each layer of the GCN is
use_topic variable is True, the adjacency matrix is returned as well. If the test_label flag is set to True, the
updated with the edge weight between each utterance output of each layer of the GCN is saved to disk.
and its corresponding topic.

The message_passing_directed_speaker method 9 Experimental Results and Analysis


computes the adjacency matrix by initializing it with the
identity matrix, then loops through each conversation Comparison with Early Fusion module GDF outperform
and its corresponding utterances. For each conversation, Early Fusion module in the first block since the graph-
it computes the edge weight between each pair of based fusion module sufficiently captures intra- and
utterances of the same speaker and updates the adjacency inter-modality interactions in conversations, which
matrix. If use_utterance_edge is True, it also computes provides complementarity between modalities. GDF
the edge weight between each pair of consecutive achieves better performance, reducing redundancy and
utterances and updates the adjacency matrix. Finally, if promoting the complementarity between modalities,
the use_topic variable is True, it computes the edge which shows the superiority of multimodal fusion.
weight between each utterance and its corresponding results are shown in table 5.
topic and updates the adjacency matrix.

In both methods, the adjacency matrix is normalized


using the degree matrix and passed to the linear layer for MELD
Methods
message passing. Neutra Surprise Sadness Hap Anger Acc w-F1
l py
bc-LSTM [6] 75.66 48.57 22.06 52.10 44.39 59.62 57.29
77.76 50.69 22.93 54.78 47.82 62.49 59.46
The input audio data is first segmented into dialogues, MM-DFN
and then, for each dialogue, a graph is constructed using
message passing relations between audio segments. The Table 5: Results under the multimodal setting (A+V+T). We
message_passing_relation_graph() function constructs report F1 score per class, except two classes (i.e. Fear and
the adjacency matrix of this graph, with each audio Disgust) on MELD, whose results are not statistically
segment corresponding to a node in the graph, and the significant due to the smaller number of training samples.
edges between nodes determined by a window of audio
segments surrounding each node. The adjacency matrix Comparison
is then normalized using a symmetric normalized
Laplacian matrix. Early fusion and graph-based dynamic fusion are two
approaches to combining multi-modal features for tasks
The GCN model is defined in the GCNII_lyc class. It such as audio-visual speech recognition, emotion
consists of a stack of GraphConvolution layers and fully recognition, and audio-visual event detection.
connected layers. Each GraphConvolution layer is
followed by a non-linear activation function and dropout Early fusion combines the multi-modal features at the
regularization. The final layer is a softmax layer for input level, by concatenating the features from different

9
modalities into a single feature vector or tensor. The combining the modalities at an early stage of processing,
concatenated features are then fed into a neural network and concatenation, which involves simply stacking the
for further processing. Early fusion is simple and modalities together. While these methods are easy to
efficient, but it may not capture the complex interplay implement, they may not provide a comprehensive
between the modalities and may not be suitable for tasks understanding of the conversational context, which can
where the modalities have different temporal or spatial be crucial for accurately recognizing emotions.
resolutions.
On the other hand, using other fusion methods like
Graph-based dynamic fusion, on the other hand, TensorFusion [14] or Graph-based Dynamic fusion
constructs a graph to model the relationships between the modules can enable a more nuanced understanding of the
multi-modal features. The nodes of the graph represent conversational context, as these methods allow for more
the features from different modalities, and the edges complex interactions between the modalities to be
represent the connections or interactions between them. modeled. For example, TensorFusion can capture the
The graph can be learned dynamically based on the input interdependence between modalities by learning a shared
data or predefined based on prior knowledge of the task. representation space for them, while Graph-based
The features are then propagated through the graph using Dynamic fusion modules can model the interactions
graph convolutional neural networks (GCNs), which can between modalities as a graph.
capture the complex relationships between the modalities
and model the temporal and spatial dynamics of the In addition to discussing the importance of fusion
features. methods, the concluding statement also highlights the
potential of the MELD dataset in aiding conversation
Graph-based dynamic fusion has shown promising understanding research. The MELD dataset is a
results in tasks such as audio-visual speech recognition multimodal dataset that contains over 13,000 dialogues
and emotion recognition, where the modalities have from TV shows and movies, with annotations for
different temporal or spatial resolutions and the emotions, sentiment, and other conversational features.
relationships between them are complex and dynamic. This dataset can be useful for training and evaluating
However, it can be computationally expensive and models for multimodal ERC tasks, as well as for other
requires careful design of the graph structure and GCN tasks related to conversation understanding.
architecture.
Overall, the choice of fusion method depends on the Overall, the concluding statement emphasizes the
specific task, dataset, and available resources. Early importance of considering fusion methods in multimodal
fusion is simple and efficient, but may not capture the ERC tasks and highlights the potential of the MELD
complex interplay between the modalities, while graph- dataset for conversation understanding research.
based dynamic fusion can capture the complex
relationships and dynamics between the modalities, but
can be computationally expensive.

10 Conclusion References
In the field of multimodal Emotion Recognition in [1] Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin
Conversation (ERC), the fusion method is an important Jiang, Yang Mo. 2022. MM-DFN: MULTIMODAL
aspect to consider. This is because ERC tasks involve DYNAMIC FUSION NETWORK FOR EMOTION
analyzing multiple modalities, such as text, audio, and RECOGNITION IN CONVERSATIONS.
visual cues, in order to accurately recognize and interpret
emotions in conversations.
[2] Soujanya Poria‡, Devamanyu Hazarika_, Navonil
There are different methods for fusing these modalities Majumder†, Gautam Naik‡, Erik Cambria‡, Rada
together, such as Early Fusion methods, which involve

10
Mihalcea. 2018. MELD: A Multimodal Multi-Party [11] Jingwen Hu, Yuchen Liu, Jinming Zhao, et al.,
Dataset for “MMGCN: multimodal fusion via deep graph
Emotion Recognition in Conversations convolution network for emotion recognition in
[3] Sheng-Yeh Chen, Chao-Chun Hsu, Chuan-Chun conversation,” in ACL/IJCNLP, 2021, pp. 5666–5675.
Kuo, Lun-Wei Ku, et al. 2018. Emotionlines:An emotion
corpus of multi-party conversations. arXiv preprint [12] Konstantinos Skianis, Fragkiskos D. Malliaros, and
arXiv:1802.08379. Michalis Vazirgiannis, “Fusing document, collection and
label graph-based representations with word embeddings
for text classification,” in TextGraphs@NAACLHLT,
2018, pp. 49–58.
[4] Luca Bolognini, Prof. Paolo Torroni, Dr. Eleonora
Mancini. 2022. Emotion Recognition for Human- [13] Thomas N. Kipf and Max Welling, “Semi-
Centered Conversational Agents supervised classification with graph convolutional
networks,” in ICLR (Poster), 2017.

[5] Andrej Karpathy, George Toderici, Sanketh


Shetty,Thomas Leung, Rahul Sukthankar, and Li Fei- [14] Amir Zadeh†, Minghai Chen†, Soujanya Poria, Erik
Fei. 2014. Large-scale video classification with Cambria, Louis-Philippe Morency. 2017. Tensor Fusion
convolutional neural networks. In Proceedings of the Network for Multimodal Sentiment Analysis
IEEE conference on Computer Vision and Pattern
Recognition, pages 1725–1732.

[6] Jeffrey Pennington, Richard Socher, and


Christopher Manning. 2014. Glove: Global vectors for
word representation. In EMNLP, pages 1532–1543.

[7] Vee Teh and Geoffrey E Hinton. 2001. Rate-coded


restricted boltzmann machines for face recognition. In
Advances in neural information processing system,
volume 13, pages 908–914.

[8] Florian Eyben, Martin W¨ollmer, and Bj¨orn


Schuller. 2010. Opensmile: the munich versatile and fast
open-source audio feature extractor. In Proceedings of
the 18th ACM international conference on Multimedia,
pages 1459–1462. ACM.

[9] Soujanya Poria, Erik Cambria, Devamanyu


Hazarika, Navonil Mazumder, Amir Zadeh, and Louis-
Philippe Morency. 2017b. Context-dependent sentiment
analysis in user-generated videos. In ACL, pages 873–
883.

[10] Diederik Kingma and Jimmy Ba. 2014. Adam: A


method for stochastic optimization. arXiv preprint
arXiv:1412.6980 .

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy