PDF 1
PDF 1
3
The translation of the convolutional filter over the input utterance may be determined by taking preceding
array of vectors allows the CNN to learn abstract utterances into consideration. We call this the context of
features from the input, and with each subsequent layer, an utterance. Following [9], we use RNN, specifically,
the context of the learned features expands further. GRU5 to model semantic dependency among the
Overall, this method effectively captures the contextual utterances in a video. Let the following items represent
and semantic meaning of the text, and has been shown to unimodal features:
fA ∈ RN×dA
be highly effective for emotion recognition tasks.
However, it requires significant computational resources (acoustic features)
fT ∈ RN×dT
and training data to achieve high accuracy.
(textual features)
The choice of CNN for feature extraction is justified by
the following considerations: Where N = maximum number of utterances in a video.
We pad the shorter videos with dummy utterances
1. The convolution layers of CNN can be seen as a represented by null vectors of corresponding length. For
3. The features it gives are based on a hierarchy of local Rdm×Dm , W mh ∈ RDm×Dm , Umx ∈ Rdm×Dm ,
umx ∈ RDm , zm ∈ R , rm ∈ R , hmt ∈ R , Fmt
features, reflecting well the context. Dm Dm Dm
4
`calc_test_result` function and saves the predicted labels
1. The code imports the necessary libraries such as along with their corresponding IDs to a file in the
`argparse`, `keras`, `sklearn`, `numpy`, and `os`. specified path.
2. The `bc_LSTM` class is defined. This class is The training of this network is performed using
responsible for loading the data, creating and training the categorical cross-entropy on each utterance’s softmax
model, and evaluating the performance of the model. output per dialogue.
Categorical cross-entropy is a commonly used loss
3. The `__init__` function initializes some class variables function in deep learning for multiclass classification
problems. It measures the difference between the
such as `classification_mode`, `modality`, `PATH`, and
predicted probability distribution and the true probability
`OUTPUT_PATH` based on the input arguments passed
distribution.
through the command line.
In the context of dialogue systems, the softmax output
4. The `load_data` function is responsible for loading the refers to the probability distribution over the possible
data for the specified modality and classification mode. responses to a given input utterance. The training of the
It loads the training, validation, and test sets along with network involves minimizing the categorical cross-
their corresponding labels and masks. It also loads the entropy loss between the predicted softmax output and
sequence length and the number of classes for the the true label (i.e., the correct response) for each
dataset. utterance in a dialogue.
5. The `calc_test_result` function calculates the test This approach is used to optimize the parameters of the
performance of the model by taking the predicted labels model to make accurate predictions for a given input
and the ground truth labels along with the mask as input. utterance. By minimizing the cross-entropy loss, the
It then prints the confusion matrix, classification report, model learns to assign high probabilities to the correct
and the weighted F-score. response and low probabilities to incorrect responses.
This results in a model that can accurately predict the
6. The `get_audio_model` function creates a model for correct response for a given input utterance in a dialogue.
the audio modality using Keras. It uses two layers of As a regularization method, dropout between the GRU
bidirectional LSTMs followed by a dense layer with cell and dense layer is introduced to avoid overfitting.
softmax activation. We used Adam [10] as an optimizer. We use
development data to tune the hyperparameters. Early
stopping with patience 10 was used in the training.
7. The `get_text_model` function creates a model for the
text modality using Keras. It uses an embedding layer
followed by three convolutional layers with different bcLSTM is a strong baseline proposed by [9], which
represents context using a bi-directional RNN. It follows
filter sizes and max-pooling layers. Finally, a dense layer
a two-step hierarchical process that models uni-modal
with softmax activation is added.
context first and then bi-modal context features. For
unimodal text, a CNN-LSTM model extracts contextual
8. The code checks the modality and calls the representations for each utterance taking the GloVe
corresponding function to create the model. embeddings as input. For unimodal audio, an LSTM
model gets audio representations for each audio
9. The `train` function trains the model using the utterance feature vector. Finally, the contextual
specified hyperparameters such as epochs and batch size. representations from the unimodal variants are supplied
It uses the Adam optimizer and the categorical cross- to the bimodal model for classification. bcLSTM does
entropy loss function. not distinguish among different speakers and models a
conversation as a single sequence.
10. The `save_model` function saves the weights of the
trained model to a file in the specified path.
11. The `predict` function predicts the labels for the test 7 Results
set using the trained model and the corresponding mask.
We provide results for the two tasks of sentiment and
12. Finally, the `evaluate` function evaluates the emotion classification on MELD.Table 4 shows the
performance of the model on the test set by calling the performance of sentiment classification by using
5
bcLSTM (66.68% F-score).In figure 2 we see the
performance of each modality on each emotion
prediction in form of plots.
6
Fig2. Performance of each modality on Emotion
column by F1-score
Emotions
modalities anger disgust fear Joy neutral sadness surprise w-avg.
Text 42.06 21.69 7.75 54.31 71.63 26.92 48.15 56.44
25.85 6.06 2.90 15.74 61.86 14.71 19.34 39.08
bcLSTMAudio 43.39 23.66 9.38 54.48 76.67 24.34 51.04 59.25
Text+Audio
7
Modality Encoder
Γ(k) = σ(Wg · [g(k−1), Ht(k−1)] + bg), ε = {u, f, o},
a bi-directional gated recurrent unit (BiGRU); for the C˜ (k) = tanh(Wg · [g(k−1), Ht(k−1)] + bg ),
acoustic and visual modalities, we apply a fully
connected network. C(k) = Γ(k) C(k−1) + Γ(k) C˜ (k), g(k) = Γ(k)
tanh(C(k)),
The context embedding is computed by combining these
embeddings with trainable parameters.
to account for the impact of different speakers in a (k) (k) (k)
conversation, a shared-parameter BiGRU is used to where Γ , Γ , Γ refer to the update gate, the
encode different contextual information from multiple forget gate, and the output gate in the k-th layer,
speakers. The BiGRU is applied to each speaker's respectively. g(0) is initialized with zero. Wg , bg are
utterances in the conversation, and the resulting hidden
states are used as speaker embeddings. learnable parameters. σ(·) is a sigmoid function. C˜
(k) stores contextual information of previous layers.
The update gate Γ(k) controls what part of the
8 Graph-based Dynamic Fusion Modules contextual information is written to the memory,
while the forget gate Γ(k) decides what redundant
information in C(k) is deleted. The output gate Γ(k)
We build an undirected graph to represent a reads selectively for passing into a graph
conversation [11], denoted as G = (V, E). V refers to convolution operation. The modified convolution
a set of nodes. Each utterance can be represented by operation can be defined as:
three nodes for differentiating acoustic, visual, and
textual modalities. Given N utterances, there are 3N t
(k) ˜ 1(k−1) + αH(0))((1 − βk−1)In+
nodes in the graph. We add both context embedding H = ReLU ((1 − α)P H
)
and speaker embedding to initialize the embedding of βk−1 W(k−1)) ,
nodes in the graph:
xδ = cδ + γδsδ, δ ∈ {a, v, t}, where P˜ = D˜ −1/2 A˜ D˜ −1/2 is the graph
convolution matrix with the renormalization trick.
α, βk are two hyperparameters. βk = log( ρ + 1). ρ
where γa, γv, γt are trade-off hyper-parameters. E refers is also a hyperparameter. W(k) is the weight
to a set of edges, which are built based on two rules. matrix. H(0) is initialized with Xa, Xv, Xt. In is an
The first rule is that any two nodes of the same identity mapping matrix. Then, the output of k-th
modality in the same conversation are connected. The layer can be computed as, Ht(k) = H(k) + g(k).
second rule is that each node is connected with nodes
corresponding to the same utterance but from different After the stack of K layers, representations of three
modalities. Edge weights are computed as: Aij = 1 − modalities for each utterance i can be refined as oa,
arccos(sim(xi , xj )) ov, ot. Finally, a classifier is used to predict the
, where sim(·) is cosine similarity
function. [12] emotion of each utterance:
yˆ i = Softmax(Wz[xa; xv; xt; oa; ov; ot] +
8.1 Dynamic Fusion Module bz),
Based on the graph, we improve [13]. with gating where Wz and bz are trainable parameters. We
mechanisms to fuse multimodal context features in the apply cross- entropy loss along with L2-
conversation. We utilize graph convolution operation to regularization to train the model.
aggregate context information of both inter- and intra-
modality in a specific semantic space at each layer. MM_GDF Training
Meanwhile, we leverage gating mechanisms to learn
intrinsic sequential patterns of contextual information in
different semantic spaces [14]. The updating process The GCNLayer1 class is defined with the initialization of
using gating mechanisms is defined. a linear layer with input and output features, a boolean
variable indicating whether to use a topic or not, and a
boolean variable indicating whether to create a new
8
graph or not. The forward method takes inputs, dia_len, classification. If use_residue is set to True, the output of
and topicLabel as arguments and calls the message the GCN layers is concatenated with the input audio
passing methods based on the new_graph variable. The features before passing through the final fully connected
message passing_wo_speaker method computes the layer.
adjacency matrix for the graph and performs message
passing. The message passing_directed_speaker method In addition to the GCN layers, the model also includes an
computes the adjacency matrix for the graph based on LSTM layer which takes as input the output of the GCN
directed speakers and performs message passing. The layers and performs sequential reasoning over the
cossim method computes the cosine similarity between dialogue. This is controlled by the reason_flag flag. If it
two vectors, and the atom_calculate_edge_weight is set to True, the output of the GCN layers is passed
method computes the edge weight between two nodes through an LSTM layer before concatenating with the
based on the cosine similarity. input audio features.
The message passing_wo_speaker method computes the During training, the model is optimized using the
adjacency matrix by initializing it with the identity negative log likelihood loss. The forward() function of
matrix, then loops through each conversation and its the GCNII_lyc class takes the input audio features,
corresponding utterances, computes the edge weight dialogue length, topic label, and adjacency matrix as
between each utterance pair based on their cosine input, and returns the model output. If the return_feature
similarity, and updates the adjacency matrix. If the flag is set to True, the output of each layer of the GCN is
use_topic variable is True, the adjacency matrix is returned as well. If the test_label flag is set to True, the
updated with the edge weight between each utterance output of each layer of the GCN is saved to disk.
and its corresponding topic.
9
modalities into a single feature vector or tensor. The combining the modalities at an early stage of processing,
concatenated features are then fed into a neural network and concatenation, which involves simply stacking the
for further processing. Early fusion is simple and modalities together. While these methods are easy to
efficient, but it may not capture the complex interplay implement, they may not provide a comprehensive
between the modalities and may not be suitable for tasks understanding of the conversational context, which can
where the modalities have different temporal or spatial be crucial for accurately recognizing emotions.
resolutions.
On the other hand, using other fusion methods like
Graph-based dynamic fusion, on the other hand, TensorFusion [14] or Graph-based Dynamic fusion
constructs a graph to model the relationships between the modules can enable a more nuanced understanding of the
multi-modal features. The nodes of the graph represent conversational context, as these methods allow for more
the features from different modalities, and the edges complex interactions between the modalities to be
represent the connections or interactions between them. modeled. For example, TensorFusion can capture the
The graph can be learned dynamically based on the input interdependence between modalities by learning a shared
data or predefined based on prior knowledge of the task. representation space for them, while Graph-based
The features are then propagated through the graph using Dynamic fusion modules can model the interactions
graph convolutional neural networks (GCNs), which can between modalities as a graph.
capture the complex relationships between the modalities
and model the temporal and spatial dynamics of the In addition to discussing the importance of fusion
features. methods, the concluding statement also highlights the
potential of the MELD dataset in aiding conversation
Graph-based dynamic fusion has shown promising understanding research. The MELD dataset is a
results in tasks such as audio-visual speech recognition multimodal dataset that contains over 13,000 dialogues
and emotion recognition, where the modalities have from TV shows and movies, with annotations for
different temporal or spatial resolutions and the emotions, sentiment, and other conversational features.
relationships between them are complex and dynamic. This dataset can be useful for training and evaluating
However, it can be computationally expensive and models for multimodal ERC tasks, as well as for other
requires careful design of the graph structure and GCN tasks related to conversation understanding.
architecture.
Overall, the choice of fusion method depends on the Overall, the concluding statement emphasizes the
specific task, dataset, and available resources. Early importance of considering fusion methods in multimodal
fusion is simple and efficient, but may not capture the ERC tasks and highlights the potential of the MELD
complex interplay between the modalities, while graph- dataset for conversation understanding research.
based dynamic fusion can capture the complex
relationships and dynamics between the modalities, but
can be computationally expensive.
10 Conclusion References
In the field of multimodal Emotion Recognition in [1] Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin
Conversation (ERC), the fusion method is an important Jiang, Yang Mo. 2022. MM-DFN: MULTIMODAL
aspect to consider. This is because ERC tasks involve DYNAMIC FUSION NETWORK FOR EMOTION
analyzing multiple modalities, such as text, audio, and RECOGNITION IN CONVERSATIONS.
visual cues, in order to accurately recognize and interpret
emotions in conversations.
[2] Soujanya Poria‡, Devamanyu Hazarika_, Navonil
There are different methods for fusing these modalities Majumder†, Gautam Naik‡, Erik Cambria‡, Rada
together, such as Early Fusion methods, which involve
10
Mihalcea. 2018. MELD: A Multimodal Multi-Party [11] Jingwen Hu, Yuchen Liu, Jinming Zhao, et al.,
Dataset for “MMGCN: multimodal fusion via deep graph
Emotion Recognition in Conversations convolution network for emotion recognition in
[3] Sheng-Yeh Chen, Chao-Chun Hsu, Chuan-Chun conversation,” in ACL/IJCNLP, 2021, pp. 5666–5675.
Kuo, Lun-Wei Ku, et al. 2018. Emotionlines:An emotion
corpus of multi-party conversations. arXiv preprint [12] Konstantinos Skianis, Fragkiskos D. Malliaros, and
arXiv:1802.08379. Michalis Vazirgiannis, “Fusing document, collection and
label graph-based representations with word embeddings
for text classification,” in TextGraphs@NAACLHLT,
2018, pp. 49–58.
[4] Luca Bolognini, Prof. Paolo Torroni, Dr. Eleonora
Mancini. 2022. Emotion Recognition for Human- [13] Thomas N. Kipf and Max Welling, “Semi-
Centered Conversational Agents supervised classification with graph convolutional
networks,” in ICLR (Poster), 2017.
11