O04 3004
O04 3004
Abstract
This paper presents an approach to emotion recognition from speech signals and
textual content. In the analysis of speech signals, thirty-three acoustic features
are extracted from the speech input. After Principle Component Analysis (PCA)
is performed, 14 principle components are selected for discriminative
representation. In this representation, each principle component is the
combination of the 33 original acoustic features and forms a feature subspace.
Support Vector Machines (SVMs) are adopted to classify the emotional states. In
text analysis, all emotional keywords and emotion modification words are
manually defined. The emotion intensity levels of emotional keywords and
emotion modification words are estimated based on a collected emotion corpus.
The final emotional state is determined based on the emotion outputs from the
acoustic and textual analyses. Experimental results show that the emotion
recognition accuracy of the integrated system is better than that of either of the
two individual approaches.
1. Introduction
Human-machine interface technology has been investigated for several decades. Recent
research has placed more emphasis on the recognition of nonverbal information, and has
especially focused on emotion reaction. Many kinds of physiological characteristics are used
to extract emotions, such as voice, facial expressions, hand gestures, body movements,
heartbeat and blood pressure. Scientists have found that emotion technology can be an
important component in artificial intelligence [Salovey et al. 1990], especially for
human-human communication. Although human-computer interaction is different from
human-human communication, some theories show that human-computer interaction shares
*
Department of Computer Science and Information Engineering, National Cheng Kung University,
Tainan, Taiwan, ROC
E-mail: {bala, chwu}@csie.ncku.edu.tw
46 Ze-Jing Chuang, and Chung-Hsien Wu
basic characteristics with human-human interaction [Reeves et al. 1996]. In addition, affective
information is pervasive in electronic documents, such as digital news reports, economic
reports, e-mail, etc. The conclusions reached by researchers with respect to emotion can be
extended to other types of subjective information [Subasic et al. 2001]. For example,
education assistance software should be able to detect the emotions of users and; therefore;
choose suitable teaching courses. Moreover, the study of emotions can apply to some
assistance systems, such as virtual babysitting systems or virtual psychologist systems.
In recent years, several research works have focused on emotion recognition. Cohn and
Katz [Cohn et al. 1998] developed a semi-automated method for emotion recognition from
faces and voices. Silva [Silva et al. 2000] used the HMM structure to recognize emotion from
both video and audio sources. Yoshitomi [Yoshitomi et al. 2000] combined the hidden
Markov model (HMM) and neural networks to extract emotion from speech and facial
expressions. Other researchers focused on extracting emotion from speech data only. Fukuda
and Kostov [Fukuda et al. 1999] applied a wavelet/cepstrum-based software tool to perform
emotion recognition from speech. Yu [Yu et al. 2001] developed a support vector machine
(SVM)-based emotion recognition system. However, few approaches have focused on emotion
recognition from textual input. Textual information is another important communication
medium and can be retrieved from many sources, such as books, newspapers, web pages,
e-mail messages, etc. It is not only the most popular communication medium, but also rich in
emotion. With the help of natural language processing techniques, emotions can be extracted
from textual input by analyzing punctuation, emotional keywords, syntactic structure,
semantic information, etc. In [Chuang et al. 2002], the authors developed a semantic network
for performing emotion recognition from textual content. That investigation focused on the
use of textual information in emotion recognition systems. For example, the identification of
emotional keywords in a sentence is very helpful to decide the emotional state of the sentence.
A possible application of textual emotion recognition is the on-line chat system. With many
on-line chat systems, users are allowed to communicate with each other by typing or speaking.
A system can recognize a user’s emotion and give an appropriate response.
In this paper, a multi-modal emotion recognition system is constructed to extract emotion
information from both speech and text input. The emotion recognition system classifies
emotions according to six basic types: happiness, sadness, anger, fear, surprise and disgust. If
the emotion intensity value of the currently recognized emotion is lower than a predefined
threshold, the emotion output is determined to be neutral. The proposed emotion recognition
system can detect emotions from two different types of information: speech and text. To
evaluate the acoustic approach, a broadcast drama, including speech signal and textual content,
is adopted as the training corpus instead of artificial emotional speech. During feature
selection, an initial acoustic feature set that contained 33 features is first analyzed and
Multi-Modal Emotion Recognition from Speech and Text 47
extracted. These acoustic features contain several possible characteristics, such as intonation,
timbre, acoustics, tempo, and rhythm. We also extract some features to represent special
intonations, such as trembling speech, unvoiced speech, and crying speech. Finally, among
these diverse features, the most significant features are selected by means of principle
component analysis (PCA) to form an acoustic feature vector. The acoustic feature vector is
fed to the Support Vector Machines (SVMs) to determine the emotion output according to
hyperplanes determined by the training corpus.
For emotion recognition via text, we assume that the emotional reaction of an input
sentence is essentially represented by its word appearance. Two primary word types,
“emotional keywords” and “emotion modification words,” are manually defined and used to
extract emotion from the input sentence. All the extracted emotional keywords and emotion
modification words have their corresponding “emotion descriptors” and “emotion
modification values.” For each input sentence, the emotion descriptors are averaged and
modified using the emotion modification values to give the current emotion output. Finally,
the outputs of the textual and acoustic approaches are combined with the emotion history to
give the final emotion output.
The rest of the paper is organized as follows. Section 2 describes the module for
recognizing emotions from speech signals. The details of SVM classification model is also
provided in this section. Then the textual emotion recognition module and the integration of
these two modules are presented in sections 2.3 and 3, respectively. Finally, experimental
results obtained using the integrated emotion recognition system are provided in section 5, and
some conclusions are drawn in section 6.
The ratios described in categories (11) and (12) represent not only the slope but also the
shape of each vibration in the contour. Figure 2 shows the difference between these
parameters. In this figure, each part shows the vibration of a contour. In order to show how the
parameters are used, we assume that the length and the amplitude of these two contours are the
same. In part A, the length of the upslope contour is longer than that of the downslope contour,
while the opposite is shown in part B. The ratio of upslope to downslope is 3.14 (22 upslope
samples to 7 downslope samples) in part A and 0.26 (6 upslope samples to 23 downslope
samples) in part B.
Trembling speech can be characterized by means of pitch vibration. For category (13),
the pitch vibration is defined and calculated as follows:
N −1
1 x < 0 ,
∑ δ ( P ( i ) − P ) × ( P (i + 1) − P ) , δ [ x] = 0
1
Pr = (1)
N i =0 x≥0
where P is the mean value of the pitch contour.
detailed subspace. In PCA, each principle component is the linear combination of the original
features. If a principle component is selected, the features that have larger combination
weights are also selected and form a feature subspace. The combination weights of the original
features are represented in the transformation matrix, which is calculated in PCA. By setting
the threshold of the combination weights to a value of 0.2, we can select the significant
features for each principle component to form a feature set. Therefore, we have 14 feature
subspaces.
Table 1 shows an example of feature subspace generation. Suppose that F1 to F5 are the
original features, that P1 and P2 are the selected principle components in PCA, and that the
values indicate the combination weights. By selecting the original features according to values
that are greater than the threshold of 0.2, we can select {F1, F3, F4 } as the first feature
subspace from P1 and {F2, F4} as the second feature subspace from P2.
Traditional SVMs can construct a hard decision boundary with no probability output. In this
study, SVMs with continuous probability output are proposed. Given the test sample x’, the
probability that x’ belongs to class c is P(classc|x’). This value is estimated based on the
following factors:
l the distance between the test input and the hyperplane,
Multi-Modal Emotion Recognition from Speech and Text 51
D(x ′) w
R= = D (x ′) ; (3)
1 w
R D(x ′)
R′ = = ; (4)
D(x ) D(x )
P (classc x′) =
Pc Pc
=
1 + exp(1 − R′) D( x′) . (6)
1 + exp1 −
D( x )
As described above, the acoustic feature set is divided into 14 feature sub-spaces. For
each sub-space, an SVM model is applied to decide on the best class of the speech input. The
final output is the combination of these different SVM outputs, and shown as follows:
1
S S
P ( classc x′ ) = ∏ Pi ( classc x′ )
i =1
1 , (7)
S Pc S
= ∏
i =1 1 + exp (1 − D ( x′ ) D ( x ) )
where the probability Pi(class c|x’) is the output of SVM in the i-th feature subspace and S (=14)
is the number of sub-spaces.
the emotion modification words can enhance or suppress the emotional state. Finally, the final
emotional state is determined by combining the recognition results from both textual content
and speech signal.
from 0 to 1. The emotional state label can be one of the following six labels: happiness,
sadness, anger, fear, surprise, and disgust. The intensity value describes how strongly the
keyword belongs to this emotional state. In many cases, however, a word may contain one or
more emotional reactions. Accordingly, there may be more than one emotion descriptor for
each emotional keyword. For example, two emotional states, sadness and anger, are involved
in the keyword “disappointed.” However, the keyword “depressed” is annotated with only one
emotional state: sadness. After the tagging process is completed, the emotion descriptors of
the word “disappointed” are {(2, 0.2), (3, 0.6)}, and the emotion descriptor of the word
“depressed” is {(3, 0.6)}. The numbers 2 and 3 in the parentheses indicate the emotional states
anger and sadness, respectively. The numbers 0.2 and 0.6 represent the degree of the
emotional states. In the following, we describe how the emotional state is calculated. Consider
the following input sentence at time t:
St : “We felt very disappointed and depressed at the results.”
Here, the ith emotional keyword is represented by kit , 1 ≤ i ≤ M t , and Mt is the number of
keywords in sentence St. In this example, k1t and k 2t represent the words “disappointed”
and “depressed,” respectively, and the value of Mt is 2. For each emotional keyword kit , the
( )
corresponding emotion descriptor is lrti , vrti , 1 ≤ r ≤ Rit , where Rit represents the number of
emotion descriptors of kit . The variable lrti is the rth emotional state label, and vrti is the rth
intensity value of kit . The value of the emotional state label can range from 1 to 6,
corresponding to six emotional states: happiness, sadness, anger, fear, surprise, and disgust. In
this case, the values of R1t and R2t are 2 and 1, respectively. For k1t , the values of l1t1 , l2t1 ,
v1t1 , and v2t1 are 2, 3, 0.2, and 0.6, respectively. For k2t , the values of l1t 2 and v1t 2 are 3 and
0.6, respectively.
The emotion descriptors of each emotional keyword are manually defined based on a
Chinese lexicon containing 65620 words. In order to eliminate errors due to subjective
judgment, all the words are firstly tagged by three people individually and then cross validated
by the other two people. For each word, if the results tagged by different people are close, the
average of these values will be set as the emotion descriptors of the word. If the three people
cannot reach a common consensus, an additional person will be asked to tag the word, and the
result will be taken into consideration. Based on experience, only a few words need additional
suggestions.
The final tagged results for the emotion descriptors are shown in Table 2. A total of 496
words are defined as emotional keywords, and there are some ambiguities. Only 423 of them
have unique emotional label definitions, 64 words have 2 emotional label definitions, and 9
words have 3 emotional label definitions. Most of the ambiguities occur in the anger and
sadness categories. For example, the word “unhappy” may indicate an angry emotion or a sad
emotion, according to the individual’s personality and situation.
54 Ze-Jing Chuang, and Chung-Hsien Wu
The six elements in EtC represent the relationship between sentence St and the six emotional
states: happiness, sadness, anger, fear, surprise, and disgust, respectively. Each value is
calculated as follows:
M Ry
t
∑∑ S ( lzty , o ) vzty
1
1
eotC = ∏ u xt , 1≤ o ≤ 6
N N
y =1 z =1
t
. (8)
3 x =1 M Ry
∑∑ S ( lz , o )
ty
y =1 z =1
The value in the first pair of parentheses is a geometric mean of all the emotion modification
values, and the value in the second pair of parentheses is the average of intensity values that
belong to emotional state o. The function S ( lzty , o ) is a step function with a value of 1 when
lzty = o and a value of 0 when lzty ≠ o . The constant 1/3 is used to normalize the emotion
intensity value to the range from -1 to 1.
After the emotion reaction from the textual content has been calculated, the final emotion
output Et is the combination of EtA and EtC ,
Et = ( e1t , e2t , e3t , e4t , e5t , e6t )
. (9)
eot = α eotA + (1 − α ) eotC , where α = max ( eotA )
1≤ o ≤ 6
The emotion output of acoustic module eitA ranges from 0 to 1, and the emotion output of
tC
textual module ei ranges from -1 to +1.
According to the assumption that the current emotional state is influenced by the
previous emotional states, the output of the current emotion vector Et must be modified by
means of its previous emotion vector Et-1. The recursive calculation of the emotion history is
defined as follows:
Et′ = δ Et + (1 − δ ) Et −1 , t ≥ 1 , (10)
where Et is the t-th emotion vector calculated in as described in the previous section; Et′
indicates that the final output considers the emotion history, and the initial value E0 is the
output without any modification. The combination coefficient δ is empirically set to 0.75.
5. Experimental Results
For the purpose of system evaluation, in order to obtain real emotional states from natural
speech signals, we collected the training corpus from broadcast dramas. There were 1085
sentences in 227 dialogues from leading man and 1015 sentences in 213 dialogues from
leading woman. The emotional states of these sentences were tagged manually. The emotion
tagging results are listed in Table 3.
56 Ze-Jing Chuang, and Chung-Hsien Wu
The system was implemented on a personal computer with a Pentium IV CPU and 512
MB of memory. A high-sensitivity microphone was connected to the computer and provided
real-time information about speech signals.
As shown in Figure 4, the achieved recognition rate was 63.33% when T = −1 . When R2
= 91% and T = 0, the achieved recognition rate was 81.55%, the highest rate obtained in all the
tests. The results show that after PCA was performed, the orthogonal feature space was
extracted from the original feature sets when R2 = 91% and T = 0, and the emotion recognition
rate also increased due to the elimination of dependency.
Based on the results, we could decide on the appropriate number of feature sub-spaces.
Figure 5 shows the relation between the number of sub-spaces and R2. Since the previous
experiment indicated that an appropriate value of R2 was 91%, the appropriate number of
sub-spaces was chosen as 14 based on the curve in Figure 5.
40
Numbers of feature subspace
35
30
25
20
15
10
5
0
85% 87% 89% 91% 93% 95% 97% 99% 100%
R2
Figure 5. The relationship between R2 and the number of feature
sub-spaces.
100
Textual emotion recognition rate (%) 90
80
70
60
50
40
30
20
10
0
50 55 60 65 70 75 80 85 90 95 100
As shown in Figure 6, the emotion recognition rate of the textual module did not increase
after the ratio of selected keywords reached an accuracy rate of 75%. That means if the
keyword recognition rate is higher than 75%, the output of the textual emotion recognition
module will reach an upper bound. Since the keyword recognition rate of the system can reach
89.6%, this keyword spotting system is suitable for the textual emotion recognition module.
The acoustic module is based on the assumption that the speech information is too
complicated to be classified using only one SVM. Thus, PCA is used to generate the feature
subspace. In order to test this assumption, we compared the recognition results for speech
input obtained using the classifier with a single SVM and multiple SVMs. Table 5 shows the
comparison and confirms the assumption.
Table 5. A comparison of the results obtained using the acoustic module with
a single SVM and multiple SVMs.
Multiple SVM Single SVM
Happiness 75.37% 68.13%
Sadness 86.72% 75.91%
Anger 78.26% 66.57%
Fear 71.16% 60.55%
Surprise 68.05% 55.62%
Disgust 72.56% 64.54%
Neutral 82.96% 70.01%
Total 76.44% 65.90%
keyword-based approach is still helpful for improving performance when integrated with the
acoustic module.
6. Conclusion
In this paper, an emotion recognition system with multi-modal input has been proposed. When
PCA and the SVM model are applied, the emotional state of a speech input can be classified
and fed into the textual emotion recognition module. This approach to recognizing emotions
from textual information is based on pre-defined emotion descriptors and emotion
modification values. After all the emotion outputs have been integrated, the final emotional
state is further smoothed by mean of the previous emotion history. The experimental results
show that the multi-modal strategy is a more promising approach to emotion recognition than
the single module strategy.
In our study, we investigated a method of textual emotion recognition and also tested the
combination of the two emotion recognition approaches. Our method can extract emotions
from both speech and textual information without the need for a sophisticated speech
recognizer. However, there are still many problems that remain to be solved. For example, in
the textual emotion recognition module, syntactic structure information is important for
natural language processing but cannot be obtained using HowNet alone. An additional parser
may be needed to solve this problem. In the acoustic module, crying and laughing sounds are
useful for deciding on the current emotional state but are hard to extract. A sound recognizer
may, thus, be useful for improving the emotion recognition performance.
References
Salovey, P. and J. Mayer, “Emotional Intelligence,” Imagination, Cognition and Personality,
vol. 9, no. 3, 1990, pp.185-211.
Reeves, B. and C. Nass, “The Media Equation : How People Treat Computers, Television
and New Media Like Real People and Places,” Cambridge Univ. Press, 1996.
Subasic, P. and A. Huettner, “Affect Analysis of Text Using Fussy Semantic Typing, ” IEEE
Transactions on Fussy System, vol. 9, no. 4, 2001, pp.483-496.
Cohn, J.F. and G.S. Katz, “Bimodal Expression of Emotion by Face and Voice,”
Proceedings of the sixth ACM international conference on Multimedia: Face/gesture
recognition and their applications, 1998, pp.41-44.
Silva, L. C De and N.P. Chi, “Bimodal Emotion Recognition,” Proceedings of the Fourth
IEEE International Conference on Automatic Face and Gesture Recognition , 2000,
pp.332-335.
Yoshitomi, Y., S.I. Kim, T. Kawano, and T. Kitazoe, “Effect of Sensor Fusion for
recognition of Emotional States Using Voice, Face Image and Thermal Image of
Face,” Proceedings of the ninth IEEE International Workshop on Robot and Human
Interactive Communication, 2000, pp.173-183.
Fukuda, S. and V. Kostov, “Extracting Emotion from Voice,” Proceedings of IEEE
International Workshop on Systems, Man, and Cybernetics, vol. 4, 1999, pp.299-304.
62 Ze-Jing Chuang, and Chung-Hsien Wu
Yu, F., E. Chang, Y.Q. Xu, and H.Y. Shum, “Emotion Detection from Speech to Enrich
Multimedia Content,” Proceedings of IEEE Pacific Rim Conference on Multimedia,
2001, pp.550-557.
Chuang, Z.J. and C.H. Wu, “Emotion Recognition from Textual Input using an Emotional
Semantic Network,” Proceedings of IEEE International Conference on Spoken
Language Processing, 2002, pp.2033-2036.
Abramowitz, M. and I.A. Stegun, “Legendre Functions and Orthogonal Polynomials,” in
Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical
Tables, New York: Dover, 1972, pp.331-339.
Cristianini, N. and J. Shawe-Taylor, “An Introduction to Support Vector Machines,”
Cambridge University Press, 2001.
Wu, C.H. and Y.J. Chen, “Multi-Keyword Spotting of Telephone Speech Using Fuzzy
Search Algorithm and Keyword-Driven Two-Level CBSM,” Speech communication,
Vol.33, 2001, pp.197-212.
Lang, P.J., M.M. Bradley, and B.N. Cuthbert, “Emotion, atten Lang tion, and the startle
reflex,” Psychological Review, 97, 1990, pp.377-395.