0% found this document useful (0 votes)
9 views9 pages

Multimodal Speech Emotion Recognition and Ambiguity Resolution

This document discusses a comparative study on speech emotion recognition (SER) using both traditional machine learning classifiers and deep learning models, focusing on the effectiveness of hand-crafted features extracted from audio signals. The study employs a multi-class classification framework and evaluates models on the IEMOCAP dataset, demonstrating that lighter machine learning models can achieve performance comparable to deep learning approaches. Additionally, the work explores the integration of textual features to aid in resolving ambiguity in emotional communication.

Uploaded by

SUPER CLASH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

Multimodal Speech Emotion Recognition and Ambiguity Resolution

This document discusses a comparative study on speech emotion recognition (SER) using both traditional machine learning classifiers and deep learning models, focusing on the effectiveness of hand-crafted features extracted from audio signals. The study employs a multi-class classification framework and evaluates models on the IEMOCAP dataset, demonstrating that lighter machine learning models can achieve performance comparable to deep learning approaches. Additionally, the work explores the integration of textual features to aid in resolving ambiguity in emotional communication.

Uploaded by

SUPER CLASH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Multimodal Speech Emotion Recognition

and Ambiguity Resolution


Gaurav Sahu
David R. Cheriton School of Computer Science
University of Waterloo
Ontario, Canada
gaurav.sahu@uwaterloo.ca
arXiv:1904.06022v1 [cs.LG] 12 Apr 2019

Abstract—Identifying emotion from speech is a non- of Speech Emotion Recognition (SER) as in [1]
trivial task pertaining to the ambiguous definition [2] and [3]. However, this rise has made practition-
of emotion itself. In this work, we adopt a feature- ers rely more on the power of the deep learning
engineering based approach to tackle the task of
speech emotion recognition. Formalizing our problem models as opposed to using domain knowledge to
as a multi-class classification problem, we compare the construct meaningful features and building models
performance of two categories of models. For both, that perform well as well as are interpretable. In this
we extract eight hand-crafted features from the audio work, we explore the implication of hand-crafted
signal. In the first approach, the extracted features features for SER and compare the performance of
are used to train six traditional machine learning
classifiers, whereas the second approach is based on lighter machine learning models with the heavily
deep learning wherein a baseline feed-forward neural data-reliant deep learning models. Furthermore, we
network and an LSTM-based classifier are trained also combine features from the textual modality to
over the same features. In order to resolve ambiguity understand the correlation between different modali-
in communication, we also include features from the ties and aid ambiguity resolution. More formally, we
text domain. We report accuracy, f-score, precision
and recall for the different experiment settings we pose our task as a multi-class classification problem
evaluated our models in. Overall, we show that lighter and employ the two classes of models to solve
machine learning based models trained over a few that. For both the approaches, we first extract hand-
hand-crafted features are able to achieve performance crafted features from the time domain of the audio
comparable to the current deep learning based state- signal and train the respective models.
of-the-art method for emotion recognition.
Index Terms—multimodal speech emotion recogni- In the first approach, we train traditional machine
tion, machine learning, deep learning learning classifiers, namely, Random Forests, Gra-
dient Boosting, Support Vector Machines, Naive-
I. I NTRODUCTION Bayes and Logistic Regression. In the second ap-
proach, we build a Multi-Layer Perceptron and an
Communication is the key to human existence
LSTM [4] classifier to recognize emotion given
and more often than not, we have to deal with
a speech signal. The models are evaluated on
ambiguous situations. For instance, the phrase “This
the IEMOCAP [5] dataset under different settings,
is awesome” could be said under either happy or sad
namely, Audio-only, Text-only and Audio + Text 1 .
settings. Humans are able to resolve ambiguity in
The rest of the paper is organized as follows: Sec-
most cases because we can efficiently comprehend
tion II describes existing methods in the literature
information from multiple domains (henceforth, re-
for the task of speech emotion recognition; Section
ferred to as modalities), namely, speech, text and
III gives an overview of the dataset used in this work
visual. With the rise of deep learning algorithms,
and the pre-processing steps applied before feature
there have been multiple attempts to tackle the task
extraction; Section IV describes the proposed mod-
1 Code for all the experiments is available at els and implementation details; Results are reported
http://tinyurl.com/y55dlc3m in Section V, followed by the conclusion and future
scope of this work in Section VI. TABLE I: Number of examples for each emotion
Class Count
II. L ITERATURE R EVIEW Angry 860
In this section, we review some of the work Happy 1309
that has been done in the field of speech emotion Sad 2327
Fear 1007
recognition (SER). The task of SER is not new and Surprise 949
has been studied for quite some time in literature. Neutral 1385
A majority of the early approaches ( [6] [7]) used Total 7837
Hidden Markov Models (HMMs) [8] for identifying
emotion from speech. Recent introduction of deep
neural networks to the domain has also significantly features extracted and the two models applied to the
improved the state-of-the-art performance. For in- classification problem.
stance, [3] and [9] use recurrent autoencoders to A. Data Pre-processing
solve the task. Recently, methods have also been a) Audio: A preliminary frequency analysis re-
proposed to efficiently combine features from multi- vealed that the dataset is not balanced. The emotions
ple domains, such as, Tensor Fusion Networks [10] “fear” and “surprise” were under-represented and
and Low-Rank Matrix Multiplication [11], instead use upsampling techniques to alleviate the issue.
of trivial concatenation. We then merged examples from “happy” and “ex-
This work aims to provide a comparative study cited” classes as “happy” was under-represented and
between 1) deep learning based models that are the two emotions closely resemble each other. In
trained end-to-end, and 2) lighter machine learning addition to that, we discard examples classified as
and deep learning based models trained over hand- “others”; they corresponded to examples that were
crafted features. We also investigate the information labeled ambiguous even for a human. Applying the
residing in multiple modalities and how their com- aforementioned operations resulted in 7837 exam-
bination affects the performance. ples in total. Final sample distribution for each of
III. DATASET the emotions is shown in Table I.
In this work, we use the IEMOCAP [5] released b) Text: The available transcriptions were first
in 2008 by researchers at the University of Southern normalized to lowercase and any special symbols
California (USC). It contains five recorded sessions were removed.
of conversations from ten speakers and amounts to B. Feature Extraction
nearly 12 hours of audio-visual information along
We now describe the handcrafted features used to
with transcriptions. It is annotated with eight cat-
train both, the ML- and the DL-based models.
egorical emotion labels, namely, anger, happiness,
1) Audio Features:
sadness, neutral, surprise, fear, frustration and ex-
a) Pitch: Pitch is important because wave-
cited. It also contains dimensional labels such as
forms produced by our vocal cords change depend-
values of the activation and valence from 1 to 5;
ing on our emotion. Many algorithms for estimating
however, they are not used in this work.
the pitch signal exist. We use the most common
The dataset is already split into multiple utter-
method based on autocorrelation of center-clipped
ances for each session and we further split each
frames [12]. Formally, the input signal y[n] is
utterance file to obtain wav files for each sentence.
center-clipped to give a resultant signal, yclipped [n]:
This was done using the start timestamp and end
timestamp provided for the transcribed sentences. 
This results in a total of ∼10K audio files which y[n] − Cl , if y[n] ≥ Cl

are then used to extract features. yclipped [n] = 0, if |y[n]| < Cl (1)

y[n] + Cl , if y[n] ≤ Cl

IV. M ETHODOLOGY
This section describes the data pre-processing Typically, Cl is nearly half the mean of the input
steps followed by a detailed description of the signal and [·] denotes the discrete nature of the input
signal. Now, autocorrelation is calculated for the
obtained signal yclipped , which is further normalized P ause = P r(y[n] < t) (5)
and the peak values associated with the pitch of
where t represents a carefully-chosen threshold of
the given input y[n]. It was found that center-
≈ 0.4 ∗ E, E being the RMSE.
clipping the input signal resulted in more distinct
e) Central moments: Finally, we use the mean
autocorrelation peaks.
and standard deviation of the amplitude of the signal
b) Harmonics: In the emotional state of anger to incorporate a “summarized” information of the
or for stressed speech, there are additional excitation input.
signals other than pitch ( [13], [14]). This additional
excitation is apparent in the spectrum as harmonics
(see Figure 1) and cross-harmonics. We calculate
harmonics using a median-based filter as described
in [15]. First, the median filter is created for a given
window size l, given by:

y[n] = median(x[n−k : n+k]|k = (l−1)/2) (2)

where l is odd. For cases when l is even, the


median is obtained as the mean of two values in the
middle of the sorted list. This filter is then applied to
Fig. 1: Harmonics of angry (red) and sad (blue)
Sh , the h−th frequency slice of a given spectrogram
audio signals
S, to get harmonic-enhanced spectrogram frequency
slice Hh as:

Hi = M (Sh , lharm ) (3)

Here M is the median filter, i is the i−th time


step and lharm is the length of the harmonic filter.
c) Speech Energy: Since the energy of a
speech signal can be related to its loudness, we can
use it to detect certain emotions. Figure 2 shows the
difference in energy levels of an “angry” signal v/s
that of a “sad” signal. We use standard Root Mean
Square Energy (RMSE) to represent speech energy
using the equation:
Fig. 2: RMSE plots of angry (red) and sad (blue)
v
u n audio signals
u1 X
E=t y[i]2 (4)
n i=1 2) Text Features::
a) Term Frequency-Inverse Document Fre-
RMSE is calculated frame by frame and we take quency (TFIDF): TFIDF is a numerical statistic
both, the average and standard deviation as features. that shows the correlation between a word and a
d) Pause: We use this feature to represent the document in a collection or corpus. It consists of
“silent” portion in the audio signal. This quantity two parts:
is directly related to our emotions; for instance, we • Term Frequency: It denotes how many times a
tend to speak very fast when excited (say, angry or word/token occurs in a document. The simplest
happy, resulting in a low Pause value). The feature choice is to use raw count of a token in a
value is given by: document (sentences, in our case).
• Inverse Document Frequency: This term is in- c) Support Vector Machines (SVMs): SVMs
troduced to lessen the bias due to frequently are supervised learning models with associated
occurring words in language such “the”, “a” learning algorithms that analyze data used for clas-
and “an”. Usually, idf for a term t and a sification and regression analysis. An SVM train-
document D is defined as: ing algorithm essentially builds a non-probabilistic
binary linear classifier (although methods such as
N
idf (t, D) = log (6) Platt scaling [19] exist to use SVM in a probabilistic
|d ∈ D : t ∈ d| classification setting). It represents each training
The denominator shows the frequency of doc- example as a point in space, mapped such that the
uments containing the term t and N is the total examples of the separate categories are divided by
number of documents. a clear gap that is as wide as possible (this is
usually achieved by minimizing the hinge loss). New
Finally, TFIDF value for a term is calculated by examples are then mapped into that same space and
taking the product of TF and IDF values. predicted to belong to a category based on which
side of the gap they fall. SVMs were originally
C. Machine Learning Models: introduced to perform linear classification; however,
This section describes the various ML-based clas- they can efficiently perform a non-linear classifica-
sifiers considered in this work, namely, Random tion using the kernel trick [20], implicitly mapping
Forests, Gradient Boosting, Support Vector Ma- their inputs into high-dimensional feature spaces.
chines, Naive-Bayes, and Logistic Regression d) Multinomial Naive Bayes (MNB): Naive
a) Random Forest (RF): Random forests are Bayes classifiers are a family of simple “probabilis-
ensemble learners that operate by constructing mul- tic classifiers” based on applying Bayes’ theorem
tiple decision trees at training time and outputting with strong (naive) independence assumptions be-
the class that is the mode of the classes (classifica- tween the features. Under multinomial settings, the
tion) of the individual trees. It has two base working feature vectors represent the frequencies with which
principles: certain events have been generated by a multinomial
(p1 , . . . , pn ) where pi is the probability that event i
• Each decision tree predicts using a random occurs. MNB is very popular for document classi-
subset of features [16] fication task in text [21] which too essentially is a
• Each decision tree is trained with only a subset multi-class classification problem.
of training samples. This is known as bootstrap e) Logistic Regression (LR): LR is typically
aggregating [17] used for binary classification problems [22], that is,
Finally, a majority vote of all the decision trees when we have only two labels. In this work, LR is
is taken to predict the class of a given input. implemented in a one-vs-rest manner; six classifiers
b) Gradient Boosting (XGB): XGB refers to have been trained for each class and finally, we
eXtreme Gradient Boosting. It is an implementation consider the class that is predicted with the highest
of boosting that supports training the model in a fast probability.
and parallelized way. Boosting is another ensemble Having trained the above classifiers, we take
classifier combining a number of weak learners, ensemble of the best performing classifiers and use
typically decision trees. They are trained in a se- it for comparison with the current state-of-the-art for
quential manner, unlike RFs, using forward stage- emotion recognition on the IEMOCAP dataset.
wise additive modeling. During the early iterations,
the decision trees learned are simple. As training D. Deep Learning Models
progresses, the classifier becomes more powerful be- In this section, we describe the deep learning
cause it is made to focus on the instances where the models used. Typically, Deep Neural Networks
previous learners made errors. At the end of training, (DNNs) are trained in an end-to-end fashion and
the final prediction is a weighted linear combination they are expected to “figure out” features completely
of the output from the individual learners [18]. on their own. However, training such a model can
take a lot of time as well as computational resources. need to be learned during training. From figure 3,
In order to minimize the computational overhead, we see that an LSTM cell is able to keep track of
we directly feed the handcrafted features as input to hidden states at all time steps through the feedback
these models and compare their performance with mechanism.
the traditional end-to-end trained counterparts. In
this work, we implement two types of models:
a) Multi-Layer Perceptron (MLP): MLP be-
longs to a class of feed-forward neural network. It
consists of at least three nodes: an input, a hidden
and an output layer. All the nodes are interleaved
with a non-linear activation function to stabilize
the network during training time. Their expressive
power increases as we increase the number of hid-
den layers upto a certain extent. Their non-linear
nature allows them to distinguish data that is not
linearly separable.
b) Long Short Term Memory (LSTM): LSTMs
[4] were introduced for long-range context capturing
in sequences. Unlike MLP, it has feedback connec-
tions that allow it to decide what information is
important and what is not. It consists of a gating
mechanism and there are three types of gates: input, Fig. 3: Visualization of an LSTM cell
forget and output. Their equations are mentioned
below:

ft = σg (Wf xt + Uf ht−1 + bf ) (7)

it = σg (Wi xt + Ui ht−1 + bi ) (8)

ot = σg (Wo xt + Uo gh t − 1 + bo ) (9)

ct = f · ct−1 + it · σc (Wc xt + Uc ht−1 + bc ) (10)

ht = ot · σh (ct ) (11)
where initial values are c0 = 0 and h0 = 0 and Fig. 4: LSTM classifier
· denotes the element-wise product, t denotes the
time step (each element in a sequence belongs to Figure 4 shows the network implemented in this
one time step), xt refers to the input vector to the work. We feed the feature vectors as input to the
LSTM unit, ft is the forget gate’s activation vector, network and finally pass the output of the LSTM
it refers to the input gate’s activation vector, ot network through a softmax layer to get probability
refers to the output gate’s activation vector, ht is scores for each of the six emotion classes. Since we
the hidden state vector (which is typically used to are using feature vectors as input, we do not need
map a vector from the feature space to a lower- another decoder network to transform it back from
dimensional latent space,) ct is the cell state vector hidden to output space thereby reducing network
and W, U and b are weight and bias matrices which size.
(a) Audio-only setting (a) E1, Audio-only setting

(b) E2, Text-only setting


(b) Text-only setting

(c) Audio+Text setting (c) E2, Audio+Text setting

Fig. 5: Performance of different models; E1: Ensem- Fig. 6: Confusion Matrices of the our ensemble
ble (RF + XGB + MLP); E2: Ensemble (RF + XGB models; E1: Ensemble (RF + XGB + MLP); E2:
+ MLP + MNB + LR) Ensemble (RF + XGB + MLP + MNB + LR)
E. Experiments Hyperparameters for the all the models under the
Here, we describe the three different settings we three experiment settings could be found in the
conducted our experiments in: released repository.
• Audio-only: In this setting, we train all the G. Evaluation Metrics:
classifiers using only the audio feature vectors In this section, we first describe the various eval-
described earlier. uation metrics used and report results for the three
• Text-only: In this setting, we train all the classi-
experiment settings.
fiers using only the text feature vectors (TFIDF a) Accuracy: This refers to the percentage of
vectors) test samples that are classified correctly.
• Audio+Text: In this setting, we fuse the feature
b) Precision: This measure tells us out of all
vectors from the two modalities. There have
predictions, how many are actually present in the
been some methods proposed to fuse vectors
ground truth (a.k.a. labels). It is calculated using
efficiently from multiple modalities but we sim-
the formula:
ply concatenate the feature vectors from audio
and text to obtain the combined feature vectors. tp
P recision = (12)
Through this experiment, we would be able to tp + f p
infer how much information is contained in
c) Recall: This measure tells us how many
each of the modalities and how does fusion
correct labels are present in the predicted output.
influence the model’s performance.
It is calculated using the formula:
F. Implementation Details tp
P recision = (13)
In this section, we describe the implementation tp + f n
details adopted in this work. Here, tp, f p, and f n stand for true positive,
• We use librosa [23], a Python library, to false positive and false negative respectively. We can
process the audio files and extract features from compute these values from the confusion matrix.
them. d) F-score: It is defined as the harmonic mean
• We use scikit-learn and xgboost [24] of precision and recall. This measure was included
[25], the machine learning libraries for Python, as accuracy is not a complete measure of a model’s
to implement all the ML classifiers (RF, XGB, predictive power but F-score is since it is more
SVM, MNB, and LR) and the MLP. normalized.
• We use PyTorch [26] to implement the LSTM We compare our best performing models with the
classifiers described earlier. current state-of-the-art as mentioned in [2]. They
• In order to regularize the hidden space of the employ three types of recurrent encoders, namely,
LSTM classifiers, we use a shut-off mecha- ARE, TRE and MDRE denoting Audio-, Text- and
nism, called dropout [27], where a fraction of Multimodal Dual- Recurrent Encoders respectively.
neurons are not used for final prediction. This is It is important to mention that [2] only consid-
shown to increase the robustness of the network ers four emotions for classification, namely, angry,
and prevent overfitting. happy, sad and neutral as opposed to six in our case.
We randomly split our dataset into a train (80%) In order to present a fair comparison of our method
and test (20%) set. The same split is used for all with theirs, we also run the experiments for the four
the experiments to ensure a fair comparison. The classes (models with code 4-class in Figure 5).
LSTM classifiers were trained on an NVIDIA Titan
X GPU for faster processing. We stop the training V. R ESULTS
when we do not see any improvement in validation In this section, we discuss the performance of
performance for >10 epochs. Here, one epoch refers models described in Section IV.
to one iteration over all the training samples. Dif- From Figure 5, we can see that our simpler
ferent batch sizes were used for different models. and lighter ML models either outperform or are
comparable to the much heavier current state-of-the- A. Most Important Features:
art on this dataset. A more detailed analysis follows: In this section, we investigate which features
a) Audio-only results: Results are especially contribute the most during prediction in this clas-
interesting for this setting. Performance of LSTM sification task. We chose the XGB model for this
and ARE reveals that deep models indeed need a lot study and rank the eight audio features. We see that
of information to learn features as the LSTM clas- Harmonic, which is directly related to the excitation
sifier trained on eight-dimensional features achieves in signals, contributes the most. It is interesting to
very low accuracy as compared to the end-to-end see that “silence” attributing to Pause, is almost as
trained ARE. However, neither of them are able significant as standard deviation of the autocorre-
to beat the lighter E1 model (Ensemble of RF, lated signal (related to pitch). The low contribution
XGB and MLP) which was trained on the eight- of central moments is expected as a signal is very
dimensional audio feature vectors. A look at the con- diverse and an global/coarse feature would be unable
fusion matrix (Fig. 6a) reveals that detecting “neu- to identify the nuances present in it.
tral” or distinguishing between “angry”, “happy”
and “sad” is the most difficult for the model.
b) Text-only results: We observe that the per-
formance of all the models for this setting is similar.
This could be attributed to the richness of TFIDF
vectors known to capture word-sentence correlation.
We see from the confusion matrix (Fig. 6b) that
our text-based models are able to distinguish the
six emotions fairly well along with the end-to-end
trained TRE. We observe that “sad” is the toughest
for textual features to identify very clearly.
Fig. 7: Most important Audio features
c) Audio+Text results: We see that combining
audio and text features gives us a boost of ∼14% for
all the metrics. This is clear evidence of the strong VI. C ONCLUSION AND F UTURE W ORK
correlation between text and speech features. Also, In this work, we tackle the task of speech emotion
this is the only case when the recurrent encoders recognition and study the contribution of differ-
seem to perform slightly better in terms of accuracy ent modalities towards ambiguity resolution on the
but at the cost of precision. The lower performance IEMOCAP dataset. We compare, both, ML- and
of E1 maybe be attributed to the trivial fusion DL-based models and show that even lighter and
method (concatenation) we use as simple concate- more interpretable ML models can achieve perfor-
nation for an ML model would still contain a lot of mance close to DL-based models. We show that
modality-specific connections instead of the desired ensembling multiple ML models also lead to some
inter-modal connections. The promising result here improvement in the performance. We only extract a
is that combining features from both the modalities handful of time-domain features from audio signals.
indeed helped to resolve the ambiguity observed The audio feature-space could be made even richer
for modality-specific models as shown in Fig. 6c. if we could include some frequency-domain features
We can say that the textual features helped in too such as Mel-Frequency Cepstral Coefficients
correct classification of “angry” and “happy” classes (MFCC) [28], Spectral Roll-off and additional time-
whereas the audio features enabled the model to domain features such as Zero Crossing Rate (ZCR)
detect “sad” better. [29]. Also, better fusion methods such as TFN [10]
Overall, we can conclude that our simple ML and LMF [11] could be employed for combining
methods are very robust to have achieved compa- speech and text vectors more effectively. It would
rable performance even though they are modeled to also be interesting to see the scaling in the perfor-
predict six-classes as opposed to four in previous mance of ML models v/s DL models if include more
works. data.
R EFERENCES [18] J. Friedman, T. Hastie, and R. Tibshirani, The elements of
statistical learning, vol. 1. Springer series in statistics New
[1] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech York, 2001.
emotion recognition using cnn,” in Proceedings of the 22nd [19] J. Platt et al., “Probabilistic outputs for support vector ma-
ACM international conference on Multimedia, pp. 801–804, chines and comparisons to regularized likelihood methods,”
ACM, 2014. Advances in large margin classifiers, vol. 10, no. 3, pp. 61–
[2] S. Yoon, S. Byun, and K. Jung, “Multimodal speech emo- 74, 1999.
tion recognition using audio and text,” in 2018 IEEE Spoken [20] C. Cortes and V. Vapnik, “Support-vector networks,” Ma-
Language Technology Workshop (SLT), pp. 112–118, IEEE, chine learning, vol. 20, no. 3, pp. 273–297, 1995.
2018. [21] A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes,
[3] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition “Multinomial naive bayes for text categorization revisited,”
using deep neural network and extreme learning machine,” in Australasian Joint Conference on Artificial Intelligence,
in Fifteenth annual conference of the international speech pp. 488–499, Springer, 2004.
communication association, 2014. [22] G. King and L. Zeng, “Logistic regression in rare events
[4] S. Hochreiter and J. Schmidhuber, “Long short-term mem- data,” Political analysis, vol. 9, no. 2, pp. 137–163, 2001.
ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, [23] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar,
1997. E. Battenberg, and O. Nieto, “librosa: Audio and music
[5] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, signal analysis in python,” in Proceedings of the 14th
S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemo- python in science conference, pp. 18–25, 2015.
cap: Interactive emotional dyadic motion capture database,” [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
Language resources and evaluation, vol. 42, no. 4, p. 335, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
2008. V. Dubourg, et al., “Scikit-learn: Machine learning in
[6] A. Nogueiras, A. Moreno, A. Bonafonte, and J. B. Mariño, python,” No. Oct, pp. 2825–2830, 2011.
“Speech emotion recognition using hidden markov models,” [25] T. Chen, “Scalable, portable and distributed gradient boost-
in Seventh European Conference on Speech Communication ing (gbdt, gbrt or gbm) library, for python, r, java, scala,
and Technology, 2001. c++ and more. runs on single machine, hadoop, spark, flink
[7] B. Schuller, G. Rigoll, and M. Lang, “Hidden markov and dataflow,” 2014.
model-based speech emotion recognition,” in 2003 IEEE [26] Facebook, “Pytorch,” 2017.
International Conference on Acoustics, Speech, and Signal [27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,
Processing, 2003. Proceedings.(ICASSP’03)., vol. 2, pp. II– and R. Salakhutdinov, “Dropout: a simple way to prevent
1, IEEE, 2003. neural networks from overfitting,” The Journal of Machine
Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[8] L. R. Rabiner and B.-H. Juang, “An introduction to hidden
[28] S. Davis and P. Mermelstein, “Comparison of parametric
markov models,” ieee assp magazine, vol. 3, no. 1, pp. 4–
representations for monosyllabic word recognition in con-
16, 1986.
tinuously spoken sentences,” IEEE transactions on acous-
[9] Y. Kim, H. Lee, and E. M. Provost, “Deep learning for
tics, speech, and signal processing, vol. 28, no. 4, pp. 357–
robust feature generation in audiovisual emotion recogni-
366, 1980.
tion,” in 2013 IEEE International Conference on Acoustics,
[29] F. Gouyon, F. Pachet, O. Delerue, et al., “On the use
Speech and Signal Processing, pp. 3687–3691, IEEE, 2013.
of zero-crossing rate for an application of classification
[10] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. of percussive sounds,” in Proceedings of the COST G-6
Morency, “Tensor fusion network for multimodal sentiment conference on Digital Audio Effects (DAFX-00), Verona,
analysis,” arXiv preprint arXiv:1707.07250, 2017. Italy, 2000.
[11] Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang,
A. Zadeh, and L.-P. Morency, “Efficient low-rank multi-
modal fusion with modality-specific factors,” arXiv preprint
arXiv:1806.00064, 2018.
[12] M. Sondhi, “New methods of pitch extraction,” IEEE
Transactions on audio and electroacoustics, vol. 16, no. 2,
pp. 262–266, 1968.
[13] H. Teager and S. Teager, “Evidence for nonlinear sound
production mechanisms in the vocal tract,” in Speech pro-
duction and speech modelling, pp. 241–261, Springer, 1990.
[14] G. Zhou, J. H. Hansen, and J. F. Kaiser, “Nonlinear feature
based classification of speech under stress,” IEEE Transac-
tions on speech and audio processing, vol. 9, no. 3, pp. 201–
216, 2001.
[15] D. Fitzgerald, “Harmonic/percussive separation using me-
dian filtering,” 2010.
[16] Y. Amit, D. Geman, and K. Wilder, “Joint induction of
shape features and tree classifiers,” IEEE Transactions on
Pattern Analysis & Machine Intelligence, no. 11, pp. 1300–
1305, 1997.
[17] L. Breiman, “Bagging predictors,” Machine learning,
vol. 24, no. 2, pp. 123–140, 1996.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy