Thesis
Thesis
December 2010
School of Engineering
College of Engineering and Computer Science
The Australian National University
Canberra, Australia
Declaration
This thesis describes the results of research undertaken in the School of Engineering,
College of Engineering and Computer Science, The Australian National University,
Canberra. This research was supported by a scholarship from The Australian National
University.
The results and analyses presented in this thesis are my own original work, ac-
complished under the supervision of Doctor Roland Göcke, Doctor Bruce Millar and
Doctor Antonio Robles-Kelly, except where otherwise acknowledged. This thesis has
not been submitted for any other degree.
Gordon McIntyre
School of Engineering
College of Engineering and Computer Science
The Australian National University
Canberra, Australia
10 May 2010
iii
Acknowledgements
First of all I would like to thank the members of my supervisory panel Doctor Roland
Göcke, Doctor Bruce Millar and Doctor Antonio Robles-Kelly. They have added in-
valuable insight from their respective areas which has been a big help in this multi-
disciplinary project.
Roland, thank you for being an excellent supervisor and driving force, especially
at times when it all seemed a bit too hard. You abound with positive energy and are
blessed with shrewdness and patience well beyond your years. Bruce, I would like
to thank you for your constructive criticism and the benefit of the wisdom that you
accumulated over a distinguished career.
This PhD project would not have been as enjoyable without the support of staff and
fellow students at the College of Engineering and Computer Science and my colleagues
at the Centre for Mental Health Research. Thank you to all of you! I would like to also
thank the administrative and support staff for putting up with my inane questions and
providing prompt and professional assistance.
My gratitude goes to the Black Dog Institute in Sydney, it was the experience of
a lifetime to be a part of such a multi-disciplinary team in an incredibly innovative
organisation. It helped me to get an appreciation of the fantastic work they do in such
a complex field.
v
vi ACKNOWLEDGEMENTS
In the course of this project, I have made many friends from a diverse range of
backgrounds and my life is all the more richer for it. There are some special peo-
ple that I need to acknowledge. To Dot, your spirit and determination is always
an inspiration to me. To Abha, thank you for being so supportive and understand-
ing. Lastly, to my children who were, oftentimes, deprived of my attention but never-
theless seemed to show an interest in my work - thank you!
Abstract
Significant advances have been made in the field of computer vision in the last few
years. The mathematical underpinnings have evolved in conjunction with increases in
computer processing speed. Many researchers have attempted to apply these improve-
ments to the field of Facial Expression Recognition (FER).
In the typical FER approach, once an image has been acquired, possibly from cap-
turing frames in a video, the face is detected and local information is extracted from
the facial region in the image. One popular approach is to build a database of the raw
feature data, and then use statistical measures to group the data into representations
that correspond to facial expressions. Newly acquired images are then subjected to the
same feature extraction process, and the resulting feature data compared to that in the
database for matching facial expressions.
Academic studies tend to make use of freely available, annotated sets of images.
These community databases, used for training and testing, are usually built from acted
or posed expressions [Kanade 00, Wallhoff ] of primary or prototypical emotion ex-
pressions such as fear, anger and happiness. Making use of video or images captured
in a natural setting is less common, and fewer studies attempt to apply the techniques
to more subtle and pervasive moods and emotional states, such as boredom, arousal,
anxiety and depression.
vii
viii ABSTRACT
ix
Abbreviations
xi
xii ABBREVIATIONS
Declaration iii
Acknowledgements v
Abstract vii
List of Publications ix
Abbreviations xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Review 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
xiii
xiv CONTENTS
3 Affective Sensing 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.5 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.4 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.8 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
xvi CONTENTS
5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8 Conclusions 161
Bibliography 186
List of Figures
xix
xx LIST OF FIGURES
5.6 Real and Imaginary part of a Gabor wavelet, scale = 1.4, orientation=
π/8 (5 × actual size) . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.9 Experiment 4 - Images fitted using generalised and specific AAMs . . 100
6.3 Participant’s view of the interview (video clip - Silence of the Lambs) 120
6.6 Old Paradigm - Stacked column chart comparing facial activity (Co -
Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.7 Old Paradigm - Clustered column chart comparing facial activity (Co
- Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . 127
LIST OF FIGURES xxi
6.8 Old Paradigm - Line chart comparing accumulated facial activity (Co
- Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.9 Old Paradigm - Facial activity for each video . . . . . . . . . . . . . 129
6.10 Old Paradigm - Number of happy expressions . . . . . . . . . . . . . 130
6.11 Old Paradigm - Number of sad expressions . . . . . . . . . . . . . . 131
6.12 Old Paradigm - Number of neutral expressions . . . . . . . . . . . . . 132
6.13 New Paradigm - Stacked column chart comparing facial Activity (Co
- Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.14 New Paradigm - Clustered column chart comparing facial activity (Co
- Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.15 New Paradigm - Facial Activity for each video . . . . . . . . . . . . . 136
6.16 New Paradigm - Number of happy expressions . . . . . . . . . . . . 137
6.17 New Paradigm - Number of sad expressions . . . . . . . . . . . . . . 138
6.18 New Paradigm - Number of neutral expressions . . . . . . . . . . . . 139
3.1 Action units for surprise expressions [Ekman 76, Ekman 02] . . . . . 34
3.2 Action units for fear expressions [Ekman 76, Ekman 02] . . . . . . . 35
5.3 Results from poll - numbers labelled as fear and anxiety retained. . . . 78
xxiii
xxiv LIST OF TABLES
B.4 Old Paradigm - Facial expressions - sorted by happy within video . . 177
B.5 Old Paradigm - Facial Expressions - sorted by sad within video . . . . 178
LIST OF TABLES xxv
B.6 Old Paradigm - Facial Expressions - sorted by neutral within video . . 179
B.7 New Paradigm - Accumulated facial activity . . . . . . . . . . . . . . 180
B.8 New Paradigm - Facial activity for each video . . . . . . . . . . . . . 181
B.9 New Paradigm - Facial expressions - sorted by happy within video . . 182
B.10 New Paradigm - Facial expressions - sorted by sad within video . . . 183
B.11 New Paradigm - Facial expressions - sorted by neutral within video . . 184
Arthur Schopenhauer
1
Introduction
Significant advances have been made in the field of computer vision in the last few
years. The mathematical underpinnings have evolved, in conjunction with increases in
computer processing speed. Many researchers have attempted to apply these improve-
ments to the field of FER in a process, which typically resembles Figure 1.1.
Once an image has been acquired, possibly from capturing frames in a video, the
face is detected and local information is extracted from the facial region in the image.
One popular approach is to build a database of the raw feature data, and then use
statistical measures to group the data into representations that correspond to facial
expressions. Newly acquired images are then subjected to the same feature extraction
1
2 CHAPTER 1. INTRODUCTION
process, and the resulting feature data compared to that in the database for matching
facial expressions.
Academic studies tend to make use of freely available, annotated sets of images.
These community databases used for training and testing are usually built from acted,
posed or induced expressions [Kanade 00, jaf , MMI , Wallhoff , Wallhoff 06], and this
tends to give an oversimplified picture of emotional expression. Making use of video
or images captured in a natural setting is less common.
Most studies aim to recognise prototypical emotion expressions such as fear, anger
and happiness. Fewer attempt to apply the same techniques to more subtle and perva-
sive moods and emotional states, such as boredom, arousal, anxiety and depression.
The limitations placed on FER studies are quite understandable. [Ekman 82] has
shown that the facial expressions of anger, disgust, fear, joy, sadness, and surprise are
universal across human cultures (although in [Ekman 99] he did expand the list to in-
clude amusement, contempt, contentment, embarrassment, excitement, guilt, pride, re-
lief, satisfaction, sensory pleasure and shame). Outside of the “unbidden” [Ekman 82]
1.1. MOTIVATION 3
emotions, the display rules of facial expressions vary with factors such as culture, con-
text, personality type.
1.1 Motivation
The difficulties outlined above, however, should not preclude research into how the
use of recently developed techniques could be applied to non-primary, emotional ex-
pressions. Several studies have confirmed characteristics such as speech, face, ges-
ture, Galvanic Skin Response (GSR) and body temperature, as being useful in the
diagnosis and evaluation of therapy for anxiety, depression and psychomotor retarda-
tion [Flint 93, Moore 08]. Vocal indicators have been shown to be of use in the detec-
tion of mood change in depression [Ellgring 96]. Some studies have suggested linking
certain syndromes by comparing parameters from modalities, such as speech and mo-
tor, to discriminate different groups. In [Chen 03], the eye blink rate in adults was
used in an attempt to diagnose Parkinson’s disease. [Alvinoa 07] have attempted to
link the computerised measurement of facial expression to emotions in patients with
schizophrenia.
Thus, even if the results are to be used in conjunction with measurements of other
modal expressions, e.g. vocal, eye-blink or gaze, there is good reason to explore the
use of the recent developments in computer vision. One obvious incentive is to provide
low-cost and unobtrusive ways to sense for disorders.
1.2 Objectives
If the landmark points can be reliably and consistently found within an image then
the collective “shape” for the points, together with the pixel information, can be used
to build representations of facial expressions.
This dissertation comprises eight chapters, including this introduction. A brief outline
of each of the other chapters is as follows:
Chapter 5 - Sensing for Anxiety The experimental contribution of this thesis begins
in Chapter 5. Using anxiety as an example, the exercise serves as a proof of
concept for the techniques presented in Chapters 3 and 4, and to test whether
more subtle expressions could possibly be tracked using these concepts. The
experiments were conducted on the Cohn-Kanade [Kanade 00] database which
is available for academic use. Each of the experiments involves different aspects
and degrees of difficulty.
1
The Black Dog Institute is a not-for-profit, educational, research, clinical and community-oriented
facility in Sydney, Australia, offering specialist expertise in depression and bipolar disorder. Available
at http://www.blackdoginstitute.org.au/, last accessed 23 May 2010.
Even before I open my
mouth to speak, the cul-
ture into which I’ve been
born has entered and suf-
fused it. My place of birth
and the country where I’ve
been raised, along with my
mother tongue, all help reg-
ulate the setting of my jaw,
the laxity of my lips, my
2
most comfortable pitch.
Anne Karpf
Literature Review
2.1 Introduction
This chapter begins with a broad discussion of emotional expression. The accepted
practices for its elicitation and description are discussed, before, narrowing the focus,
and reviewing the research of expressions, and how they relate, more specifically, to
anxiety and depression.
In Section 2.2, the somewhat difficult topic of defining emotions is broached, fol-
lowed by an introduction to the annotation schemes commonly used. This is followed,
in Section 2.3, by a brief description of the physiology contributing to emotional dis-
7
8 CHAPTER 2. LITERATURE REVIEW
play. Next, in Section 2.4, FACS, the ubiquitous system for describing facial muscle
activity, is introduced. The chapter concludes with a discussion of how dysphoric con-
ditions, i.e. anxiety and depression, might affect emotional expression.
Defining emotion is a bit like trying to define “knowledge”. It has deep ontological
significance; it goes to the heart of human existence; yet there is no universal agreement
on its definition. Whilst it is possible to observe physical symptoms arising from our
internal state, attaining agreement on emotion definitions and categories is challenging.
Taxonomies vary across disciplines, and [Cowie 03] point out that psychology, biology
and ecology have different stances. There are qualitative and quantitative approaches.
Complicating the picture even further is the question of moods. We know that
moods are accompanied by physiological changes and affect our decisions, but are
they emotions? The common view in the literature is that they are the long-term of
2.2. DESCRIBING EMOTIONS 9
the affective state [Picard 97]. Emotions are seen as reactions [Bower 92, Cowie 03],
having a cause or stimulus and a brief experience associated with them, whereas mood
is seen as lingering and less specific. Moods have valance but less intensity, and emo-
tions and mood can exist at the same time. Intuitively, one would think that they could
affect one another, and, presumably, affect the valence of the emotion.
[Cowie 03] present a good review of emotional classification regimes. Two ap-
proaches to describing emotions are dominant. The first, to define categories, is the
more common technique. The second approach uses dimensions. A third, less domi-
nant, approach makes use of the appraisal theory [Sander 05].
The various classification schemes are discussed in more detail in the following
subsections.
Category Approach
The most popular grouping of emotions, referred to as the “big six”, comprises fear,
anger, happiness, sadness, surprise and disgust [Cornelius 96, Ekman 99]. These are
regarded as full-blown emotions [Scherer 99] and are evolutionary, developmental and
cross-cultural in nature [Ekman 82, Ekman 99]. However, there are many alternative
groupings both across disciplines and within disciplines. Some studies concentrate
on only one or two select categories, others employ schemes using more than twenty
emotional archetypes. Thus, one of the difficulties in comparing results from studies
10 CHAPTER 2. LITERATURE REVIEW
into emotions is that the choice of categories used between studies is not consistent
and will often depend on the application that the researcher has in mind. If the focus
of the research is to understand full-blown emotions, then the big-six or a subset might
be adequate. However, if the objective is to study the less dramatic emotional states in
everyday life, with all the shades and nuances that we know distinguish them, then the
choice of categories is much more difficult [Schröder 05].
Dimensional Approach
Another way to label the affective state is to use dimensions. Instead of choosing
discrete labels, one or more continuous scales, such as pleasant/unpleasant, atten-
tion/rejection or simple/complicated are used. Two common scales are valence (nega-
tive/positive) and arousal (calm/excited). Valence describes the degree of positivity or
negativity of an emotion or mood; arousal describes the level of activation or emotional
excitement. Sometimes a third dimension, control or attention, is used to address the
internal or external source of emotion. [Cowie 03] have developed a software applica-
tion called FEELTRACE to assist in the continuous tracking of emotional state on two
dimensions [Cowie 03].
2.2. DESCRIBING EMOTIONS 11
Appraisal Approach
Scherer has extensively studied the assessment process in humans and suggests that
people affectively appraise events with respect to novelty, intrinsic pleasantness, goal/need
significance, coping, and norm/self compatibility [Scherer 99].
It is not yet clear how to implement this approach in practice although there is at
least one quite complex paper describing a practical application of the model [Sander 05].
Both the categorical and dimensional approaches, whilst practical, suffer from being
highly subjective. This is not only due to complexity, but also because of the dif-
ferences in the efficiency of listeners’ physiology and the fact that the listener’s own
affective state influences their judgment of the speaker’s affective state. Hence, there
is a need for a model that includes listener attributes.
A study by [Devillers 05] has included labelling of blended and secondary emo-
tions in a corpus of medical emergency call centre dialogues, as well as including
task-specific context annotation to one of the corpora.
Apart from the obvious, speech carries a great deal more information than just the ver-
bal message. It can tell us about the speaker, their background and their emotional
state. Age, gender, culture, social setting, personality and well-being all play their part
in suffusing our communication apparatus even before we begin to speak. Studies by
[Koike 98, Shigeno 98] have shown that subjects more easily identify an emotion in a
speaker from their own culture, and that people will predominantly use visual informa-
tion to identify the emotion. Everyday expressions such as “lump in the throat”, “stiff
upper lip”, “plumb in the mouth”, point to our awareness of the physiological changes
that emotions have on the voice.
Early work from James [James 90] contended that emotions could be equated with
awareness of a visceral response; in other words, the contention that emotions follow
physical stimuli. That may be true for fast primary emotions, however, the twentieth
century view is that emotions are antecedent, and can be more often detected from
physiological measurements. For example, your heart rate goes up when you discover
that you have won lotto, you think that you have lost your ATM card, or you realise
that you have forgotten an important birthday.
in memory. The stimulus may still be processed directly via the amygdala, but is now
also analysed in the thought process before processing.
Some affective states like anxiety can influence breathing, resulting in variations
in sub-glottal pressure. Drying of the mucus membrane causes shrinking of the voice.
Rapid breath alters the tempo of the voice and relaxation tends to deepen the breath
and lower the voice. Changes in facial expression can also alter the sound of the voice.
Figure 2.1 represents the typical cues to the six most common emotion categories
[Murray 93].
Darwin raised the issue of whether or not it was possible to inhibit emotional
expression [Ekman 03]. This is an important question in human emotion recognition
and in emotion recognition by computers. Intentional or not, the voice and face are
used in everyday life, to judge verisimilitude in speakers. Many studies [Anolli 97]
[Hirschberg 05] have investigated the detection of deception in the voice.
Although some studies have made use of MPEG-4 compliant Facial Definition Parameters
(FDP) [Cowie 05b], the most ubiquitous and versatile method of describing facial be-
haviour, pioneered by Ekman [Ekman 75, Ekman 82, Ekman 97, Ekman 99, Ekman 03],
is FACS. The goal of FACS is to provide an accurate description of facial activity based
14 CHAPTER 2. LITERATURE REVIEW
Despite being based on musculature, FACS measurement units are Action Unit
(AU)s, which are the identifiable actions of individual or groups of muscles. AUs are
One muscle can be represented by a single AU. Conversely, the appearance changes
produced by one muscle can sometimes appear as two or more relatively independent
actions attributable to different parts of the muscle. A FACS coder decomposes an ob-
served expression into the specific AUs that produced the movement, recording the list
of AUs that produced it. For example, during a smile, the zygomaticus major muscle
is activated, corresponding to AU12. During a spontaneous or Duchenne smile, the
orbicularis oculi muscle is recruited, corresponding to AU6. The FACS coder records
the AUs, and if needed, the duration, intensity, and asymmetry. Figure 2.2 and Table
2.5. ANXIOUS EXPRESSION 15
2.1, together, are included to give a description of the facial muscles and an exam-
ple of some common AUs. FACS scores, in themselves, do not provide interpretations.
EMFACS, as mentioned previously, deals only with emotionally relevant facial action
units.
There are many types of inflictions that are labelled as “anxiety”, e.g. test anxiety (a
type of performance anxiety), death anxiety and stage fright. Typically, these have
short-term impact and, depending on the level of arousal, may or may not affect a
person’s performance. It is when anxiety begins to affect someone’s day to day life,
that it is classed as a disorder. An anxiety disorder is an umbrella term used to cover
16 CHAPTER 2. LITERATURE REVIEW
AU Description
1 Inner Brow Raiser – Frontalis (pars medialis)
2 Outer Brow Raiser – Frontalis (pars lateralis)
4 Brow Lowerer – Corrugator and Depressor supercilii
5 Upper Lid Raiser – Levator palpebrae superioris
6 Cheek Raiser – Orbicularis oculi (pars orbitalis)
7 Lid Tightener – Orbicularis oculi (pars palpebralis)
9 Nose Wrinkler – Levator labii superioris alaeque nasi
10 Upper Lip Raiser – Levator labii superioris
11 Nasolabial Deepener – Zygomaticus minor
12 Lip Corner Puller – Zygomaticus major
13 Cheek Puffer – Levator anguli oris
14 Dimpler – Buccinator
15 Lip Corner Depressor – Depressor anguli oris
16 Lower Lip Depressor – Depressor labii inferioris
17 Chin Raiser – Mentalis
• phobia;
• panic disorder.
In the discussion that follows, the social anxiety disorder has been included along
with the phobia disorder.
GAD
GAD is usually diagnosed with reference to an instrument such as the Diagnostic and
Statistical Manual of Mental Disorders (DSM-IV) [dsm 00] [Kvaal 05]. Broadly speak-
2.5. ANXIOUS EXPRESSION 17
ing, someone who has felt anxious for at least six months and it is adversely affecting
their life, will meet the criteria. The anxiety might be associated with issues such as
finances, illness or family problems. The adverse impact might include factors such as
insomnia, missed work days or fatigue.
[beyondblue ] report that GAD affects approximately 5 per cent of people in Aus-
tralia at some time in their lives.1 Diagnosing GAD can be difficult for a clinician as
the symptoms are shared with other types of anxiety and it often coexists with other
psychiatric disorders, e.g. depression or dysthymia (chronic, “low-grade” depression).
The symptoms of GAD are so broad that it would be difficult to imagine it ever
being capable of detection by machine.
Phobia
• agoraphobia - fear of open spaces such as parks and big shopping centres;
• claustrophobia - fear of small spaces such as lifts, aeroplanes and crowded rooms;
• mysophobia - fear of dirt and germs in places such as toilets and kitchens.;
• social phobia or social anxiety - fear of social situations such as parties and
meetings; and
Anxious episodes of the types listed above are the easiest to induce (ethically and
practically) and are the most common candidates for facial expression recognition ex-
periments.
OCD
• The thoughts, impulses, or images are not simply excessive worries about real-
life problems;
• The person recognises that the obsessional thoughts, impulses, or images are a
product of his/her own mind (not imposed from without, as in thought inser-
tion) [dsm 00].
• cleaning or hand-washing;
• checking things repeatedly, e.g. that appliances are turned off or that doors and
windows are locked;
OCD affects 2 to 3 per cent of people in Australia at some time in their lives
[beyondblue ].
The varied and sometimes serious nature of the compulsions and situations make
OCD an unlikely candidate for facial expression recognition.
PTSD
PTSD occurs when a person has been exposed to a traumatic event in which both of the
• the person experienced, witnessed, or was confronted with an event or events that
involved actual or threatened death or serious injury, or a threat to the physical
integrity of self or others; and
• insomnia;
• amnesia.
20 CHAPTER 2. LITERATURE REVIEW
Approximately 8 per cent of people in Australia are affected by PTSD at some time
in their lives [beyondblue ]. PTSD has been the subject of some interesting virtual
reality applications for rehabilitation. In 2010, the U.S. Army began a four-year study
to track the results of using virtual reality therapy to treat Iraq and Afghanistan war
veterans suffering PTSD.2
The serious nature of this type of anxiety would mean that there would be some
important ethical and patient-care considerations to be met before any study is under-
taken.
Panic Disorder
The DSM-IV first sets out the definition of, Panic Attack as:
2. sweating;
3. trembling or shaking;
5. feeling of choking;
After first discounting the effects of substance abuse, a general medical condition or
some other medical disorder that might better account for the condition , e.g. OCD, the
criteria for Panic Disorder is specified as recurring, unexpected Panic Attacks where
at least one of the attacks has been followed by 1 month (or more) of one (or more) of
the following:
• worry about the implications of the attack or its consequences (e.g., losing con-
trol, having a heart attack, “going crazy”); or
Around 3 per cent of the Australian population has experienced a panic disorder
[beyondblue ]. Although this type of anxiety would make an interesting research topic,
one would think that there would be some quite restrictive ethical considerations.
Numerous studies in the past twenty years have confirmed characteristics such as
speech, face, gesture, galvanic skin response (GSR) and body temperature, as being
useful in the diagnosis and evaluation of therapy for anxiety, depressions and psy-
chomotor retardation [Flint 93]. Some earlier studies have suggested linking certain
syndromes by comparing parameters from modalities such as speech and motor to
22 CHAPTER 2. LITERATURE REVIEW
discriminate different groups. [Chen 03] attempted to link eye blink rate in adults to
Parkinsons disease.
Anxiety is sometimes confused with fear, which is a reaction normally commensu-
rate with some form of imminent threat. In the case of anxiety, the perceived threat is
usually in the future and the reaction tends to be irrational or out of proportion to the
threat. The facial expressions of fear and anxiety, however, are similar [Harrigan 96].
Confounding the problem is that the facial expression of surprise is similar to that
of fear. One key difference between the fearful and the surprise expression is the
mouth movement - a fearful expression involves a stretching of the lips in the horizontal
direction rather than opening of the mouth. The AUs for fearful expressions are shown
in Table 3.2.
Emotion Prototype Major Variants
Fear 1 +2+4+5*+20*+25 1+2+4+5*+L or R20*+25, 26, or 27
1+2+4+5*+25 1+2+4+5*
1+2+5Z, with or without 25, 26, 27
5*+20* with or without 25, 26, 27
* means in this combination the AU may be at any level of intensity
L - Left, R - Right
Action Unit 1 (Inner Brow Raiser), Action Unit 2 (Outer Brow Raiser)
Action Unit 5 (Upper Lid Raiser)
Action Unit 20 (Lip Stretcher)
Action Unit 25 (Lips part Jaw Drop)
Relatively little research has been conducted to definitively map which action units
are associated with anxiety. While available research generally supports the efficacy
of human ability to judge anxiety from facial expressions [Harrigan 96, Harrigan 97,
Harrigan 04, Ladouceur 06], understandably, due to logistical considerations, much of
the work has been conducted within the confines of social anxieties and in specific situ-
ations such as dental treatment [Buchanan 02], examinations, public speaking, children
receiving immunisations, medical examinations [Buchheim 07] and human-computer
interactions [Kaiser 98]. [Harrigan 96] reports:
2.5. ANXIOUS EXPRESSION 23
Of importance, [Harrigan 96] found that the brow movement exhibited when fear
is experienced, i.e. brows raised and drawn together, was displayed by the participants
in the study, but less often than the mouth movement for fear. They summarise:
“The most predominant fear element was the horizontal mouth stretch
movement. This horizontal pulling movement of the mouth and brief ten-
sion of the lips was clearly visible on the videotapes and could not be
confused with movements required in verbalization, smiling, or other fa-
cial action units. The brow movement exhibited when fear is experienced,
brows raised and drawn together, was displayed by the participants in this
study, but less often than the mouth movement for fear.”
The finding of [Kaiser 98], while studying human-computer interactions, was that
AU20 (lip stretcher) is found more often only in fear. [Ellgring 05] noted that actor
portrayals provide little evidence for an abundance of distinctive AUs or AU combina-
tions, that are specific for basic emotions. [Ellgring 05] reports that there is quite a
distribution of AUs through each emotion - including anxiety. Muddying the waters
is the fact that there can be large variations in the way individuals react to stressful
stimuli. Genetics, personality and biochemistry factors all play a part in the propensity
to display an anxious expression.
24 CHAPTER 2. LITERATURE REVIEW
[Lazarus 91] concludes that a cognitive appraisal of threat is a prerequisite for the
experience of this type of emotion. If this is the case, then this does not augur well for
the ability to detect anxiety in a system trained from acted expressions. That is, one
would question how well actors could portray anxious expressions without a threat
stimulus.
As in the previous section, this chapter outlines the affects on facial expressions. Un-
like the previous section, it is expanded to include the processing of emotional content
by patients with MDD. This is in order to provide a background to the experimental
work in Chapter 6.
Early attempts to link facial activity with depression used broad measurements
such as cries, smiles, frowns [Grinker 61]. Some studies have used Electromyography
(EMG) to measure muscle response, notwithstanding the somewhat intrusive and con-
straining nature of the equipment [Fridlund 83]. The more recent trend is to use
the FACS, as described in Section 2.4, to add rigour and objectivity to the process
[Reed 07, Renneberg 05, Cohn 09, McDuff 10].
The difficulty is that capturing and recording measurements of facial activity man-
2.6. NON-VERBAL COMMUNICATION IN DEPRESSION 25
ually, requires time, effort, training, and regime for maintaining objectivity. Such man-
ual work is tedious and is prone to errors. However, even the abridged version of FACS,
EMFACS [EMFACS ], which deals only with emotionally relevant facial action units,
requires a scoring time of approximately 10 minutes of measurement for one minute of
facial behaviour. Further, only people who have passed the FACS final test are eligible
to download EMFACS for use.
In [Ellgring 08], the levels of facial activity, before and after treatment, of endogenous
and neurotic depressives were measured through several key indicators. [Ellgring 08]
hypothesised that facial activity and the repertoire of its elements will be reduced dur-
ing depression and will expand with improvement of subjective wellbeing. Facial be-
haviour was analysed by applying EMFACS to the videotapes of 40 clinical interviews
of 20 endogenous depressed patients. After analysing a frequency distribution of all
of the AU observations across all of the interviews, 13 AUs or groups of AUs were
then used to complete the study. AUs that nearly always occur together, e.g. AU1 and
AU2 (inner and outer brow raiser), AU6 and AU12 (cheek raiser and lip corner puller)
were considered as one AU group. Activity, repertoire and patterns were defined as
parameters of facial activity and can be summarised as the following three measure-
ments:
• General facial activity: The first measurement, total number of AUs in a spe-
cific interval, counts the number of single, significant AUs and groups of closely
related AUs that occur within a 5 minute interval.
• Specific facial activity: The second, frequency of specific, major AUs, defines
the number of major AU combinations, e.g. AU6+12 in the case of a spontaneous
smile, occurring within a 5 minute interval.
26 CHAPTER 2. LITERATURE REVIEW
• Repertoire: The third, repertoire of AUs, is the number of distinct AUs occurring
more than twice within a 5 minute interval.
Depressed subjects have been shown to respond differently to images of negative and
positive content, when compared with non-depressed subjects. The underlying cause
could be the impaired inhibition of negative affect, which has been found in depressed
patients across several studies [Goeleven 06, Lee 07]. In turn, altered patterns of facial
activity have been reported in those patients suffering MDD [Reed 07, Renneberg 05].
If this “affective facial processing loop”, as shown in Figure 2.3, is reliable, then 1) ob-
jective observations could be of clinical importance in diagnosis; and 2) measurements
of facial activity could possibly predict response to treatments such as pharmacother-
apy and Cognitive Behaviour Therapy (CBT).
by dysfunctional inhibition toward negative content. The reason for this has been pos-
tulated that depressed subjects show lowered activation in the regions responsible for
gaining attentional control over emotional interference. In [Goeleven 06], inhibition
to positive and negative stimuli was studied across subjects including hospitalised
depressed patients, formerly depressed, and never-depressed control subjects. They
report that depressed patients show a specific failure to inhibit negative information
whereas positive information was unaffected. Surprisingly, they report that formerly
depressed subjects display impairment to inhibiting negative content.
[Joormann 07] similarly found attentional bias was evident even after individuals
had recovered from a depressive episode. In that study, the attentional biases in the
processing of emotional faces in currently and formerly depressed participants and
healthy controls were examined. Faces expressing happy or sad emotions paired with
neutral faces were presented (in a dot-probe task). Currently and formerly depressed
participants selectively attended to the sad faces, the control participants selectively
avoided the sad faces and oriented toward the happy faces. They also report that a
positive bias that was not observed for either of the depressed groups.
In an evaluation of the evidence in studies which have used modified Stroop and
visual probe test, [Mogg 05] has found that the “inhibition theory” only holds true if
the material is relevant to their “negative self-concept” and the stimulus is presented
for longer durations. [Joormann 06] found that depressed participants required signif-
icantly greater intensity of emotion to correctly identify happy expressions, and less
intensity to identify sad than angry expressions.
grees of sadness. Although there are some limitations to the study due to sample size,
there seems to be evidence that excessive amygdala activity correlated to the process-
ing of sad faces during episodes of acute depression.
This impairment may extend to the offspring of parents with MDD. [Monk 08]
found (small volume corrected) greater amygdala and nucleus accumbens activation
to fearful faces, and lower nucleus accumbens activation to happy faces, in high-risk
subjects when attention was unconstrained.
In a study of 116 participants (30 men, 86 women), some with a history of MDD
and individuals with no psychopathological history, the smile response to a comedy
clip was recorded. Participants were asked to rate a short film clip. FACS coding was
applied to 11 seconds of the clips - long enough to allow for the 4-6 second spontaneous
smile [Frank 93]. Those with a history of MDD and current depression symptoms were
more likely to control smiles than were the asymptomatic group [Reed 07].
There is always something
ridiculous about the emo-
tions of people whom one
has ceased to love.
Oscar Wilde
3
Emotional Expression Recognition by
Machines
3.1 Introduction
This section provides a summary of the more recent approaches and developments in
the field of emotional expression recognition. The scope is limited to FER, although
it would be incomplete without some reference to recognising vocal expression. The
objective of the chapter is to provide a theoretical grounding for later, more practical
chapters.
29
30 CHAPTER 3. AFFECTIVE SENSING
Section 3.2 introduces the broad concepts found in affective sensing systems. This
is quite a large section and encompasses the elicitation of training data, an overview
of the typical processing and a comparison of approaches to feature extraction. Sec-
tion 3.3 describes the computer vision techniques that are of particular interest in this
dissertation.
Affect is emotional feeling, tone, and mood attached to a thought, including its external
manifestations. Affective communication is the, often complex, multimodal interplay
of affect between communication parties (which potentially includes non-humans).
Much of our daily dose of affective communication constitutes small talk, or phatic
speech. Recognising emotions from the modulations in another person’s voice and fa-
cial expressions is perhaps one of our most important human abilities. Yet, it is one
of the greatest challenges in order for machines to become more human-like. Affec-
tive sensing is an attempt to map manifestations or measurable physical responses to
affective states.
tem usually resembles Figure 3.1 (a vocal analog of face detection is silence detection).
Whilst systems may use different types of classifiers, e.g. k-Nearest Neighbour (k-NN)
[Cover 67], SVM [Vapnik 95], AdaBoost [Freund 99], the main differences surround
the feature extraction process, i.e. the number and type of features used; and, whether
they incorporate rule-based logic such as that presented by Pantic and Rothkrantz
[Pantic 00].
One major difference between vocal and facial expression is that speech signals
are inherently one-dimensional, whereas facial signals can be 2D or 3D - although the
use of 3D processing has only become popular in the last few years [Lucey 06]. In
the case of facial expression recognition, an additional differentiating factor is whether
holistic (spanning the whole or a large part of the face) features or, what Pantic terms,
“analytic” (sub-regions) of the face are used [Pantic 07].
[ten Bosch 00] and [Schröder 05] have explored what is possible in extending the ASR
framework to emotion recognition. The common ASR approach is to train probabilistic
models from extracted speech features and then to use pattern matching to perform
recognition.
Most studies into affective communication begin with the collection of audio and/or
video communication samples. The topic of collection of emotional speech has been
well covered by other reviews [Cowie 03, Cowie 05a, Scherer 03], so it is only briefly
summarised here.
To date, call centre recordings, recordings of pilot conversations, and news readings
have provided sensible sources of data to research emotions in speech. Samples of
this nature have the highest ecological validity. However, aside from the copyright and
privacy issues, it is very difficult to construct a database of emotional speech from this
kind of naturally occurring emotional data. In audio samples, there are the complica-
tions of background noise and overlapping utterances. In video, there are difficulties
in detecting moving faces and facial expressions. A further complication is the sup-
pression of emotional behaviour by the speaker who is aware of being recorded.
One technique introduced by Velten [Velten 68], is to have subjects read emotive texts
and passages which, in turn, induce emotional states in the speaker. Other techniques
include the use of Wizard of Oz setups where, for example, a dialog between a human
and a computer is controlled without the knowledge of the human. This method has
the benefit of providing a degree of control over the dialogue and can simulate a natural
3.2. AFFECTIVE SENSING SYSTEMS 33
The principal shortcoming of these methods is that the response to stimuli may
induce different emotional states in different people.
A popular method is to engage actors to portray emotions. This technique provides for
a lot of experimental control over a range of emotions and like the previous method
provides for a degree of control over the ambient conditions.
One problem with this approach is that acted speech elicits how emotions should
be portrayed, not necessarily how they are portrayed. The other serious drawback is
that acted emotions are unlikely to derive from emotions in the way that [Scherer 04]
describe them, i.e. episodes of massive, synchronised recruitment of mental and so-
matic resources to adapt or cope with a stimulus event subjectively appraised as being
highly pertinent to the needs, goals and values of the individual.
The affective sensing of facial expression is more commonly known as automatic FER.
FER is somewhat similar in approach to face recognition but the objectives of the two
are quite different. In the former, representation of expressions is sought in sets of
images, possibly from different people, whereas in the latter, discriminating features
are sought, which will distinguish one face from a set of faces. FER works better if there
is large variation among the different expressions generated by a given face, but small
variation in how a given expression is generated amongst different faces [Daugman ].
Nevertheless, both endeavours have common techniques.
34 CHAPTER 3. AFFECTIVE SENSING
Broadly speaking, there are two approaches to automatic facial expression recogni-
tion. The first is a holistic one, in which a set of raw features extracted from an image
are matched to an emotional facial expression such as happy, sad, anger, pain, or pos-
sibly to some gestural emblem such as a wink, stare or eye-roll [Ashraf 09, Liu 06,
Martin 08, Okada 09, Saatci 06, Sung 08]. The second approach, an analytic one, is
more fine-grained and the face is divided into regions. The most popular scheme for
annotating regions is through the use of the FACS [Ekman 75, Ekman 82, Ekman 97,
Ekman 99, Ekman 03], in which surface or musculature movements are tracked using
an analytical framework.
Computer vision techniques [Cootes 95, Lucey 06, Nixon 01, Sebe 03, Sebe 05]
are used to detect features and build evidence of FACS AUs [Bartlett 99, Bartlett 02,
Bartlett 03, Bartlett 06, Lien 98, Lucey 06, Tian 01, Valstar 06b]. The theoretical ben-
efit of this approach relates to its purported reusability. For example, if a system can
accurately detect 20 major FACS AUs then, in theory, any facial expression that has a
repertoire using a combination of the 20 AUs can be detected.
Table 3.1: Action units for surprise expressions [Ekman 76, Ekman 02]
Table 3.2: Action units for fear expressions [Ekman 76, Ekman 02]
One of the problems with the holistic approach in comparing and validating results
is that each published report tends to use a different set or subset of classes or expres-
sions. Most experiments use a closed set of expressions and it is unknown how well
the reported systems would work, if at all, when an extraneous expression, or indeed
non-expression, is introduced. For example, if the training and testing database con-
sists of 2,000 images split evenly with happy, sad, angry and neutral expressions, the
article might report a 93% accuracy rate. However, if the database is then interspersed
with 500 other expressions (perhaps surprise, fear or yawn), it is likely that the system
in question will try to find the best match to one of the classes.
That is not to say that the analytic approach is an automatic choice either. Ac-
credited FACS coders do not always achieve consensus on AU displays. Not all AUs
are easily detectable in an image, and the ability of the systems are impacted by the
quality of the recordings. Holistic classification might have an advantage where fea-
tures are occluded; where, for instance, the subject has a beard, wears spectacles or is
36 CHAPTER 3. AFFECTIVE SENSING
Whilst there are some strategies in recognition systems, e.g. some systems operate on
profile poses [Pantic 04a, Pantic 04b] and in [Pantic 04a] the system that incorporates
case-based reasoning, most follow the fairly generic model as depicted in Figure 3.1.
If the system processes video then the first stage is to capture some or all of the
images from the video at predefined intervals. Next, faces are usually detected and
then segmented from the images and this is certainly the case with the AAM approach
discussed later.
The most common method of face detection is that formulated by Viola and Jones
[Viola 01], which is implemented in the popular “openCV” software [OpenCV ] from
Intel. [Bartlett 05] report an improved face detection implementation using the Gen-
tleBoost algorithm.
The next stage, facial feature extraction, is where most variation between systems
arises. Optical Flow, Particle Filters and Gabor Filters, and more recently, Active Ap-
pearance Models are some of the choices, with the latter two the most widely re-
ported in recent years. Since Gabor filters and Active Appearance Models are used
in this research work, they are discussed in more detail in Section 3.3 onwards. Some-
times the techniques are combined as in [Gao 09] and in several instances, especially
3.3. COMPUTER VISION TECHNIQUES 37
in the use of Gabor wavelets, dimension reduction is attempted through some pre-
processing such as Boosting, Principal Component Analysis (PCA) or SVM classifica-
tion [Chen 07, Shen 07]. It is this stage where most research activity is taking place,
as it is critical to the success of facial expression recognition.
Finally, once the features have been extracted, the next stage is to adopt some
means of classification such as k-NN, Artificial Neural Network (ANN), SVM, or Ad-
aBoost. If the temporal patterns are to be classified, or the facial patterns combined
with other signals, e.g. vocal speech, then the solution may involve ensembles of clas-
sifiers or Hidden Markov Models (HMMs).
Fasel and Luettin [Fasel 03] provide an appealing dichotomy of facial feature ex-
traction methods into Deformation Extraction, either image or model-based, and Mo-
tion Extraction. From a different viewpoint, they follow on from Pantic and Rothkrantz
[Pantic 00] and dichotomise into “holistic methods”, where the face is processed in its
entirety, or “local methods” (similar to Pantic and Rothkrantz’s “Analytic approach”),
which analyse only areas of interest in the face, especially where transient facial fea-
tures involved in expressions exist (as opposed to intransient features such as furrows,
38 CHAPTER 3. AFFECTIVE SENSING
wrinkles and laughter lines). Of course, the selection of local features tends to be very
application dependent.
The Discrete Fourier Transform (DFT), commonly used in speech processing, can be
applied to rows and columns of pixels in an image, indexed by co-ordinates x and y.
The 2D DFT of an N x N pixel image can be given by
N −1 N −1
1 XX 2π
FPu,v = Px,y e −j( N )(ux+vy) (3.3.1)
N x=0 y=0
where u and v are the horizontal and vertical dimensions of spatial frequency respec-
tively.
However, Fourier Transform (FT) and DFT are not well suited to non-stationary
signals, i.e. signals with time-varying spectra such as spikes. Both perform decima-
tion in frequency across the entire image in the forward transform and decimation in
time in the inverse transform. This can be overcome to some extend through the Short
Time Fourier Transform (STFT), which uses a window function to divide the signal
into segments.1 However, the inherent problem with STFT is in choosing the width of
the window. Too narrow a window will give good time resolution but poor frequency
resolution, and too wide a window will give good frequency resolution but poor time
resolution. If the spectral components in the input signal are already well separated
from each other, then a narrow window will possibly provide a satisfactory frequency
resolution. In the case of the frequency components being tightly packed, then a narrow
window will be needed, resulting in good time resolution but poor frequency resolu-
tion.
1
window is the term used when referring to the continuous-time STFT, whereas, frame is used in
discrete-time STFT. For explanatory purposes, only continuous-time STFT is referred to. However, the
same principles apply.
3.3. COMPUTER VISION TECHNIQUES 39
Invented by Dennis Gabor in 1946, Gabor wavelets have found many applications
including speech analysis, handwriting, fingerprint, face and facial expression recog-
nition. One reason for their popularity in computer vision is that, in their 2-D form and
primed with the appropriate values, their filter response resembles the neural responses
of the mammalian primary visual cortex [Daugman 85, Bhuiyan 07, Lee 96, Gao 09].
Just as bandpass filter bank’s ability to approximate cochlear processing and ANN abil-
ity to analogue human neural processing appealed to researchers, this biological re-
semblance has also attracted much attention.
where (u0 , v0 ) and P define the spatial frequency and the phase of the sinusoidal,
respectively [Movellan 08].
The complex sinusoidal can be split into its real and imaginary parts as
and
Im (s(x, y)) = sin(2π(u0 x + v0 y) + P ) (3.3.5)
The real and imaginary parts of a Gabor with an orientation of π/4 and a scale of
4 is shown in Figure 5.6.
where K scales the magnitude of the Gaussian envelope, (x0 , y0 ) is the peak of the
function, a and b are scaling parameters, and the r subscript is a rotation operation
such that
(x − x0 )r = (x − x0 ) cos θ + (y − y0 ) sin θ (3.3.8)
and
(y − y0 )r = −(x − x0 ) sin θ + (y − y0 ) cos θ (3.3.9)
filter with frequency and orientation to an image at point (x, y) on the image plane.
G(.) is obtained as
Z Z
G(x, y, θ, φ) = I(p, q)g(x − p, y − q), θ, φ) dp dq (3.3.10)
Figure 3.3 shows the original image on the left and the magnitude response image
on the right.
3.3.3 Introduction
In recent years, a powerful deformable model technique, known as the AAM [Edwards 98],
has become very popular for real-time face and facial expression recognition. The liter-
ature is not very clear in what exactly AAMs are and how they should be differentiated
from other approaches that represent the appearance of an object by shape and texture
subspaces, hence some explanation follows.
According to [Saragih 08], AAMs are examples of Linear Deformable Model (LDM)
which also includes Active Shape Models (ASM) [Cootes 92], AAMs [Cootes 98] and
42 CHAPTER 3. AFFECTIVE SENSING
3D Morphable Models (3DMM) [Blanz 99]. According to [Matthews 04], AAMs, to-
gether with the closely related concepts of Morphable Models and Active Blobs, are
“generative models of a certain visual phenomenon” and, “are just one instance in a
large class of closely related linear shape and appearance models and their associated
fitting algorithms”. In [Gross 05], they are defined as, “generative parametric models
commonly used to model faces”.
In the AAM approach, the non-rigid shape and visual texture (intensity and colour)
of an object (a face in an image perhaps) are statistically modelled using a low dimen-
sional representation obtained by applying PCA to a set of labelled training data. After
the models have been created, they can be parameterised to fit a new object (of similar
properties), which might vary in shape or texture or both.
Usually, the AAMs are pre-trained on static images using a method such as the
Simultaneous Inverse Compositional (SIC) algorithm [Baker 01, Baker 03b] (discussed
in Subsection 3.3.5) and then, when ready to be applied, the model will be fitted to one
or more images that were not present in the training set.
For the better understanding of the following sections, some terms are introduced
and explained now.
Shape All the geometrical information that remains when location, scale and rota-
tional effects are filtered out from an object - invariant to Euclidian similarity
transformations [Stegmann 02].
Landmark point Point of correspondence on each object that matches between and
within populations.
Texture The pattern of intensity or colour across the region of the object or image
patch [Cootes 01].
3.3. COMPUTER VISION TECHNIQUES 43
Image Registration This is the process of finding the optimal transformation between
a set of images in order to get them into one coordinate system.
Fitting An efficient scheme for adjusting the model parameters so that a synthetic
example is generated, which matches the image as closely as possible.
Model A model consists of the mean positions of the points and a number of vectors
describing the modes of variation [Cootes 01].
1. A statistical model of shape and appearance, trained from a set of images, each of
which has a set of corresponding landmark points (usually manually annotated);
and once built,
2. A method or algorithm for fitting the model to new and previously unseen im-
ages, i.e. images that were not in the set of images that were used to train the
model;
although sometimes they are used simply to mean the statistical model. It should be
noted that there are actually two types of AAM:
44 CHAPTER 3. AFFECTIVE SENSING
1. independent shape and appearance models, where the shape and appearance are
modelled separately; and
2. combined shape and appearance models, which use a single set of parameters to
describe shape and appearance [Matthews 04]
The reason for the different build strategies is to do with variations in the algorithms
that are used to “fit” the model to an image (discussed in Section 3.3.5).
Although there have been some attempts at automatically annotating images [Asthana 09,
Tong 09], there is no completely automated method for facial feature extraction and at
least one image has to be manually marked up as shown in Figure 5.4.
For each image, a corresponding set of points (each point denoting the x, y co-
ordindates of a single point) exists as shown in Table 3.3.
3.3. COMPUTER VISION TECHNIQUES 45
n points:69
{
249.809 274.693
249.785 297.994
259.769 361.231
...
328.853 393.34
305.365 280.317
431.243 281.514
}
N
X
x = 1/N xi (3.3.11)
i=1
However, one of the undesirable side effects of the Procrustes analysis, due to the
scaling and normalisation process, is that aligned shapes, or their shape vectors, will
now lie on the curved surface of a hypersphere, which can introduce non-linearities. A
popular approach to counter this problem is to transform or modify the shape vectors
into a tangent space to form a hyper plane [Cootes 01]. That way, linearity is assumed
and, the next step, “modelling the shape variation” is simplified and calculation per-
formance improved.
At this point, a set of points exist that are aligned to a common co-ordinate frame.
What remains to be done is to somehow model the point distributions, so that new and
plausible shapes can be generated. It is best to start with a reduction in dimensionality
and this is most commonly performed using PCA, which will derive a set of t eigen-
vectors, Φ, corresponding to the largest eigenvalues that best explain the data (spread
of the landmark points). The normal procedure is to first derive the mean shape as in
3.3.11.
N
1 X
Σx = (xi − x)(xi − x)T (3.3.12)
N − 1 i=1
x ≈ x + Φb (3.3.13)
b = ΦT (x − x) (3.3.14)
The total variation in the training samples can be explained by the sum of all eigen-
√
values. Limiting the number of eigenvectors, e.g. ±3 λi of the parameter bi will de-
termine how close generated shapes will be to the training set.
All that remains is to find a way to model the spread of the variation or distribution
around the points. A Gaussian is a reasonable starting point but for facial feature
processing, where there are likely to be non-linear shape variations due, for example,
to pitch, yaw and head roll, a better solution is required. A Gaussian mixture is a
reasonable approach.
Having already obtained a mean shape, each image texture is now warped to the mean
shape, obtaining what is described as a “shape-free” patch or “canonical frame”. This
is done by ensuring that each image’s control points match the mean shape. A trian-
gulation algorithm that creates a mesh, such as that shown in Figure 3.5 is used. Next,
the image texture over the warped image is sampled to obtain a vector of texture, ģim .
The texture vectors are normalised by applying a linear transformation (not shown
here but discussed in [Cootes 01]). Applying PCA to the normalised training samples
produces a linear model of texture
ģ = ģ + Pg bg (3.3.15)
48 CHAPTER 3. AFFECTIVE SENSING
Shape and texture can now be expressed in terms of its shape-level parameters bs
(Equation 3.3.14) and its grey-level parameters bg . However, a further PCA is applied
to account for correlations between shape and grey-level texture and finally appearance
model shape and texture can be controlled by a vector of parameters c
x = x + Qs c (3.3.16)
and
ģ = ģ + Qg c (3.3.17)
where c is a vector of appearance parameters controlling both the shape and the grey-
level texture, and Qs and Qg are matrices describing modes of variation.
Two fitting schemes are described, since they are used in this research work. Until now,
the building of a statistical model of shape and texture has been discussed. Although
there has been some notion of generating shapes and texture, the actual process of
fitting or adjusting model parameters to build a synthetic model, in an attempt to match
it to a new and previously unseen image, i.e. one not in the model training set, has
not been covered. Although it is a generalisation, state-of-the-art recognition systems
are heavily dependent on improvements in the area of model fitting, as it is crucial to
achieving the performance required for a real-time FER.
There are many variations and performance improvements (some application spe-
cific) that can be made to AAMs. Searching schemes can employ shape only, appear-
ance only, or combined shape and appearance.
[Baker 01, Baker 02, Baker 03a, Baker 03b, Baker 04a, Baker 04b] treat the AAM search
process as an image alignment problem. The original image alignment algorithm was
formulated by Lucas and Kanade in 1981 [Lucas 81]. The goal is to minimise the dif-
ference between an image and a template image T (x) by minimising
X
[I(W (x; p)) − T (x)]2 (3.3.18)
x
where I is the image, W is the warp, x are the pixels in the template and p is a
vector of parameters.
The fine details go beyond the scope of this thesis. However, a concise explanation
50 CHAPTER 3. AFFECTIVE SENSING
of [Baker 01] is that the formulation to solve Expression 3.3.18 requires minimising:
X
[I(W (x; p + ∆p)) − T (x)]2 (3.3.19)
x
with respect to ∆p. Performing a first order Taylor expansion on Expression 3.3.19
gives:
X ∂W
[I(W (x; p)) + ∇I ∆p − T (x)]2 (3.3.20)
x
∂p
repeat
Warp I with W (x; p) to compute I(W (x; p))
Compute the error image T (x) − I(W (x; p))
Warp the gradient of image I to compute ∆I)
∂W
Evaluate the Jacobian ∂p
pearance variation. Using this technique implies a major difference in the way that the
AAMs are initially built - the shape and the appearance parameters need to be modelled
This is a form of what [Saragih 08] describes as “Iterative Discriminative Fitting”. At-
tempts at improving the “fitting” efficiency of AAMs, such as those discussed at Sub-
section 3.3.5, involve various techniques to streamline the original Lucas-Kanade al-
gorithm [Lucas 81], and essentially aim to minimise a least squares error function over
the texture. [Matthews 04] provide experimental evidence to show that the Project-
Out Inverse Compositional (POIC) fitting method provides very fast fitting. However,
when there is a lot of shape and appearance variation, it comes at the expense of poor
generalisability, i.e. in the case of facial expression recognition the model becomes
very person-specific. Changing the algorithm to improve generalisability then impacts
performance.
To overcome these problems, [Saragih 06, Saragih 08] pre-learn the update model
by minimising the error bounds over the data, rather than minimising least squares
distances. Conceptually, this is akin to boosting [Freund 99] where a bunch of weak
classifiers are iteratively passed over a training set of data and, for each iteration, a
distribution of weights is updated so that the weights of each incorrectly classified
52 CHAPTER 3. AFFECTIVE SENSING
Machine learning classification finds its way into many aspects of research and there
are many options available to take the features extracted from say, a bank of Ga-
bor filters and classify them. The most frequently used include ANN, k-NN, SVM
[Burges 98, Chen 05, Vapnik 95] and AdaBoost [Freund 99] (although a variant for
a multi-class problem is MultiBoost [Casagrande 06]). Only SVM and AdaBoost are
discussed in this dissertation.
There are no facts, only in-
terpretations.
Friedrich Nietzsche
4
Expression Analysis in Practice
A system capable of interpreting affect from a speaking face must recognise and fuse
signals from multiple cues. Building such a system necessitates the integration of such
tasks as image registration, video segmentation, facial expression analysis and recog-
nition, speech analysis and recognition.
Without the availability of vast sums of time and money to build the components
“from the ground up”, this almost certainly entails re-using publicly available soft-
ware. However, such components tend to be idiosyncratic, purpose-built, and driven
53
54 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE
by scripts and peculiar configuration files. Integrating them to achieve the necessary
degree of flexibility to perform full multimodal affective recognition is a serious chal-
lenge.
If one contemplates the operation of a full-lifecycle system that can be trained from
audio and video samples and then used to perform multimodal affect recognition, the
requirements are extensive and diverse. For example, to detect emotion in the voice
the system must be capable of training, say, HMM from prosody in the speech signals.
Another requirement might be that a SVM be trained to recognise still image facial
expressions, e.g. fear, anger, happiness, sadness, disgust, surprise or neutral. More
complex, is the requirement to capture a sequence of video frames and, from the se-
quence, recognise temporal expressions. In order to perform the latter, it might be
necessary to use a deformable model, e.g. an AAM [Edwards 98] to fit to each image
and provide parameters that can then, in turn, be trained using some classifier - possibly
another HMM.
Other features might also be considered for input to the system, e.g. eye gaze and
blink rate. Ultimately, some strategy is required to assess the overall meaning of the
signals, whether it involves fusion using a combined HMM or some other technique.
From the concise consideration of the requirements it can be seen that a broad range
of expertise and software is needed. It is not practical to develop the software from first
principles. Software capable of the recognition of voice and facial expressions imple-
ment techniques from different areas of specialisation. ASR techniques have evolved
over decades while computer vision has become practical in the last ten years, with the
evolution of statistical techniques and computer processing power.
The reasons behind choosing one software product over another is not within the
scope of this work. A brief overview of some of the critical components and the “com-
posite framework” used to harness them is presented.
Although, due to time constraints, the main focus in this dissertation and the exper-
imental work is on facial expression recognition, the NXS system has been designed to
support multi-modal analysis and recognition.
There are several levels of sophistication that a system capable of sensing affect could
provide:
Indeed, the system should be able to operate within each modality or across a con-
flation of modalities. The ultimate goal is to be able to recognise emotion from both
audio and video inputs from a speaking face. Consider a speaking face in a real world
situation. Voice expression is not necessarily continuous, there may be long pauses or
sustained periods of speech. Vocal speech and facial expression may not necessarily
be contemporaneous, the verisimilitude of the voiced expression might be confirmed
or contradicted by the facial expressions. The face might be expressionless, hidden,
56 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE
not available or occluded to some degree for certain periods of time. This implies that
the system needs to be able to:
3. weigh one signal against the other when more than one modality is available.
Lastly, the system must be flexible so that alternative software products and tech-
niques can be substituted without a large amount of effort or re-engineering work in-
volved. For example, to compare classification performance, it might be desirable to
substitute an ANN package for an SVM package or simply compare different types of
SVM implementations.
The following sections present a minimalist requirement statement of some of the
key individual functional areas.
The approach to recognising affect in speech is often similar to ASR processing, how-
ever, emotion is normally mapped on a supra-segmental level, rather than word bound-
aries or silence. Nevertheless, features such as energy levels and variance in the signals
can be used to detect prosody. Indeed, use is often made of freely available ASR pack-
ages. Whatever the approach, this capability is mandatory.
Videos come in a wide variety of formats and containers. Unfortunately, not all freely
available computer vision software can operate on all formats. The very popular OpenCV
[OpenCV ] software, commonly used to perform face detection in videos, is also capa-
ble of capturing frames from a video, in principle, obviating the need for an additional
specialised image capture software. However, OpenCV will only process the Audio
Video Interleave (AVI) container format, introduced by Microsoft in 1992, thus con-
straining the solution to some extent.
Regardless of the video format, capturing image frames from the video at certain
intervals is essential and, in practice, may need to be performed both manually and
automatically. The frames need to be captured for training and recognition phases.
4.2.4 Classification
4.2.5 Miscellaneous
The NXS was built with the ultimate goal of evolving into a cross-paltform, extensible,
“real-world” audio visual FER system, capable of being used in a wide range of appli-
cations. With that in mind, the following sections discuss some putative prescriptions
of such a system.
Ideally the system should have broad platform support. In practical terms, this trans-
lates to variants of Unix and Linux, Mac OS and Windows.
One of the first hurdles that one encounters, especially with video processing, is the
number of different video container formats and their lack of availability on one or
more operating environments. Where possible the system should be capable of sup-
porting multiple audio and video container formats. At a minimum these should in-
clude WAV, AVI, MP4, MPEG2, and MOV. If support is not available, then there should
4.3. SYSTEM OPERATIONAL REQUIREMENTS 59
Image Processing
The system needs the capability to train a classifier on a corpus of still emotional
expressions. The corpus could be images of jpeg, png or some other format. Al-
ternatively, there may be no image corpus, rather, a video collection that will require
significant images to be captured into a suitable format, thus creating a de facto corpus.
The images will then be subjected to some recognition process.
Video Processing
A video can be hours in duration or it could simply be, as in the Cohn-Kanade database
[Kanade 00], collections of short, sample expressions. Both in training and testing, the
system needs to be able to capture frames from a video segment. The frames will then
be subjected to a treatment similar to the image processing mentioned previously, and
the resulting parameters input to, say, a HMM.
Classification
Ideally, the system will make use of multi-threading and multi-processing capabilities
of the operating system. Performance is critical, as is efficient memory usage. It is
preferable that the system be able to execute in online mode.
60 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE
The core system must be simple to use so that the effort required to integrate compo-
nents and to re-run exercises is minimised.
The major benefit of projects is that results can be saved, experiments can be re-
run with different parameters and outputs can be written to comma-delimited files for
use in third-party products, e.g. Excel. AAMs and SVMs built for one project can be
referenced and used in another project.
NXS is capable extracting and cataloguing frames from video, as well as performing
While the Windows operating system dominates the commercial and home environ-
ments, systems such as Linux and Mac OS also remain popular. In order to meet
the cross-platform operating environment requirement, C++ or Java was considered
suitable for development of the core system. However, given the critical performance
requirements and the fact that most high-performing video processing libraries are C
or C++ based, C++ was selected.
4.4. THE Any Expression Recognition System (NXS) 61
Qt has another very attractive feature. Its MetaObject Pattern supports a program-
ming concept known as “reflection”. The benefits of reflection are realised when it
comes to saving and restoring exercises. The state and values of objects and their
properties are “reflected” fairly simply into, in this case, Extensible Markup Language
(XML). Making use of this metaobject, it can do “round-trip” xml - serialising the ob-
jects, persisting them, and then deserialising them. In practical terms, this is the means
to saving and restoring projects.
The Vision something Libraries (VXL) are used for image processing [VXL ]. The
libraries are written in C++ and are very efficient. OpenCV is used for simple video
display and to capture images from videos [OpenCV ]. SVMLIB [Chang 01] will be
used for facial expression recognition.
The system has been built using design patterns as described by [Pree 95]. Qt lends
itself to building with design patterns [Ezust 06]. Figure 4.1 depicts the conceptual
class structure. Use is made of the serializer and composite patterns to effect the round-
trip processing, mentioned in Subsection 4.4.1.
A “project” is the top level concept, created by a software factory, and is simply
62 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE
a collection of segments (discussed in the next section). Facades are used to abstract
the details of external classes such as those used to perform AAM processing. A form
factory is used to create simple dialog boxes for user input, reducing the amount of
effort that would have otherwise have been required if the dialogs had been hand-
crafted.
Image ImageSegment
Video VideoSegment
Multimedia Segment
Audio AudioSegment
AudioVisual AVSegment
4.4.4 Segments
Central to the design of the system is the concept of segments. This borrows to some
extent from, but is simpler than, MPEG-7 [Salembier 01] and its concept of segment
types. This is no coincidence, MPEG-7, previously known as “Multimedia Content
Description Interface”, is a standard for describing the multimedia content data that
supports some degree of interpretation of the information meaning, which can be
passed onto, or accessed by, a device or a computer code. However, the implemen-
tation deviates slightly in that segment and multimedia data members and operations
are combined.
The various types of segments are created by a segment factory. As can be seen
in Figure 4.1, these include Image Segments, Image Sequence Segments, Image Col-
lections, Audio Collections and Video Collections. Using factories to provide a layer
4.4. THE Any Expression Recognition System (NXS) 63
of abstraction not only conceals the implementation complexity from the calling func-
tions but simplifies the creation of new types of segments. Figure 4.2 depicts the
segment factory class diagram.
User dialogs are, in keeping with the design pattern approach, created by factories.
Figure 4.3 demonstrates the simplicity in creating new dialogs through the use of a
form factory [Ezust 06].
64 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE
Rather than simply list each function, this is better described by a practical walk
through a typical processing scenario. Most major functions are accessible by right-
clicking to present a context menu as shown in Figure 4.4. The use of the product
begins with the creation of a “Project” as seen in figure 4.5.
Figure 4.6 shows the project tree structure after a project has been created, an
image segment, image collection segment, video segment, and model segment have
been added to the project. The tree structure is effectively the xml that reflects the
objects’ states and data member values. From here the xml can be saved and reopened
later.
4.4. THE Any Expression Recognition System (NXS) 65
The facial image is subdivided into regions in order to track inter-region movement and
a common scale is applied to the facial images. Similar to [Strupp 08], it is done by first
creating a horizontal or transverse delineation line, mid-way between the topmost and
bottommost landmark points on the facial image being examined, as shown in Figure
4.7 (The concept of “RU” which appears on the Figure is explained in Section 5.2.2).
This process is applied to any facial image used to train or test the system.
Figure 4.7: Measurements from horizontal delineations image from Feedtum database
[Wallhoff ]
to normalised and scaled vectors. The process of “scaling” the data is explained in
Section 5.3.2. The algorithm for the normalisation process is described in Algorithm 3.
4.4.8 Classification
Two sets of classifiers are used in this work, one for prototypical expressions, and the
others specific to Region Unit (RU)s. Those specific to RUs are trained to classify the
major intra-RU patterns that would normally accompany prototypical expressions. For
instance, the mouth and lip movements of fiduciary points in an image of a prototypical
smile, are re-used and represent one class in the RU3 classifier. The relevant normalised
measurements are used as inputs to all of the classifiers. For example, in the case of
RU1, only the landmark points within RU1 are input to the classifier. The mappings are
shown in Table 4.1.
Note that eye movement AUs, such as those that effect blinks, are not incorporated
at present, e.g. AU43, AU45-46 and AU61-68.
4.4. THE Any Expression Recognition System (NXS) 67
The processing sequence is depicted in Figure 4.8. First, images are captured at prede-
fined intervals from video, 500ms in this case. Frontal face poses are then segmented
from the images using the Viola and Jones [Viola 01] (or a derived) technique to deter-
mine the global pose parameters. See [Viola 01] for details. Next, AAMs, which have
been prepared in advance, are fitted to new and previously unseen images to derive
the local shape and texture features. The measurements obtained (see Section 4.4.7)
are then used to classify the regions using SVM classifiers. Ultimately, the number and
mixture of classified regions are used to report facial activity.
68 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE
AU Description RU
1, 2, 3, 4 Brow movement 1
5, 6, 7 Eyelid activity and 2
orbicularis oculi
9, 10, 12, 15, 17, Nose, mouth and lip regions 3
20, 25, 26, 27
Table 4.1: Mapping Action Units to Region Units
Figure 4.9 depicts the facial triangulation mesh, based on the feature points, used
to divide the facial image into regions. In this case, AAMs are pre-trained on static
images using the SIC fitting method [Baker 03a]. SIC is an another adaptation of in-
4.4. THE Any Expression Recognition System (NXS) 69
verse compositional image alignment for AAM fitting, that addresses the problem of
the significant shape and texture variability, by finding the optimal shape and texture
parameters simultaneously. Rather than re-computing the linear update model at every
iteration using the current estimate of appearance parameters, it can be approximated
by evaluating it at the mean appearance parameters, allowing the update model to be
pre-computed, which has significantly more computational efficiency. See [Baker 03a]
for details. The system described in section 7 involves the recording of frontal face im-
ages, while the participant views stimuli on a computer display. Under this setup, there
is a reasonable tolerance to head movement and short-duration occlusion.
Although there are several Gabor filter software packages available, few are written in
the C++ programming language and are easily incorporated into other software sys-
tems. The Gabor filter processing in NXS makes use of a well-designed software im-
plementation [cvG ], used in [Zhou 06, Zhou 09].
I am an old man and have
known a great many trou-
bles, but most of them
never happened.
Mark Twain
5
Sensing for Anxiety
5.1.1 Introduction
71
72 CHAPTER 5. SENSING FOR ANXIETY
will experience some type of anxiety disorder in any one year - around one in twelve
women and one in eight men. One in four people will experience an anxiety disorder
at some stage of their lives [beyondblue ]. In the United States, anxiety disorders affect
about forty million adults a year - around 18% of the population [Kessler 05]. Anxi-
ety is often comorbid with depression and between them they can have serious health
implications.
The term, “anxious expression”, is used in everyday language and most people
would give tacit acknowledgement to its intended meaning. Thus, given the prevalence
of anxiety and actual disorders, one would think that such an expression would be
straightforward to define and, similarly, to automatically recognise. However, the void
that occupies the literature on automatic anxious expression recognition suggests that
this is not the case. This Chapter describes an attempt, through a set of novel experi-
ments, to test the feasibility of such an exercise.
In line with the broader objectives of this thesis, summarised in Section 1.1, the
motivation for the experiments presented in this chapter is to:
• understand how best to use and calibrate the NXS system; and
This chapter is organised as follows. Section 5.2 explains the hypotheses and ques-
tions to be addressed. This is followed in Section 5.3 with a description of the method-
ology used in the experiments. Section 5.4 presents the data and analysis from the
experiments. Finally, Section 5.5 concludes and evaluates the exercise.
5.2.1 Hypotheses
On the basis of the motivation for the experiments and the literature review in Chapter
3, the following hypotheses and questions were generated:
Figure 5.1 depicts the fiduciary or landmark points that are “fitted” to each image dur-
ing analysis. The collective landmark points, referred to in this dissertation as “shape”,
are captured as a set of x, y Cartesian coordinates. As explained in Chapter 3, texture,
1
A “non-primary” emotion is one that excludes the fast acting emotions such as the “big-six”, dis-
cussed in Section 2.2.1
2
“Differentiated” is defined as that better than chance.
74 CHAPTER 5. SENSING FOR ANXIETY
or the spatial variation in the gray values of pixel intensities in the image, are also
obtained.
Several options are available to make use of this feature data to classify expres-
sions, e.g. using shape information only, using texture information only, or by combin-
ing the shape and texture information in some way. In keeping with the motivation for
this Section, the following set of questions were posed:
How does facial expression recognition performance, i.e. Classification Accuracy (CA),
vary when using:
3. the location of facial landmark points concatenated with AAM texture parameters;
Regions
In Section 4.4.7 of Chapter 4 the subdivision of facial areas for analysis is explained.
Figure 5.2 is reproduced for convenience, showing the three facial regions to be used:
It would be useful to know if one facial region is more important than another for
all expressions or only for specific expressions. More formally, the following questions
were of interest:
76 CHAPTER 5. SENSING FOR ANXIETY
Of interest is the cumulative execution time of face-detection, AAM fitting and clas-
sification of an image since these steps would apply to an online recognition system
(although the face-detection step is not normally undertaken in every frame).
The following question was therefore posed:
5.3 Methodology
Experiment 1 The first experiment, as a baseline, was to determine if the NXS system
could be trained to differentiate between prototypical facial expressions labelled
as ‘Fear’, ‘Anger’, ‘Happy’, ‘Sad’, ‘Surprise’, and ‘ Neutral’ from the Cohn-
Kanade database. The number of occurrences of each expression is shown at
Table 5.1.
Experiment 2 The second experiment was to test Hypothesis 1 in Section 5.2 and
determine if the NXS system could differentiate between facial expressions nom-
5.3. METHODOLOGY 77
There are no freely available databases of anxious facial expressions that have
FACS annotations. To improvise, a set of expressions from the Cohn-Kanade
[Kanade 00] database having FACS annotations corresponding to action units of
fear and anxiety were selected. They were first analysed by the author who made
a preliminary judgement, labelling the expression as either ‘Anxiety’ or ‘Fear’.
The number of occurrences of each class label from the preliminary assessment
is shown at Table 5.2.
An anonymous poll was then conducted and participants asked to view each
expression and judge whether they thought the expression should be labelled as
‘Fear’, ‘Anxiety or ‘Uncertain’. Those invited to participate in the poll were not
given any indication of the preliminary label.
The raw data results of the poll are shown at Appendix A. A summary of the
revised numbers of each class is at Table 5.3.
78 CHAPTER 5. SENSING FOR ANXIETY
For reasons discussed later, 3 images were removed from the exercise, leaving
31 images for use in the experiment. The number of occurrences of fear and
anxiety are shown in Table 5.4
Of these occurrences, 9 are male and 22 are female. An even split would have
been ideal, but a little difficult to attain with such a small sample set when sixty-
five percent of subjects in the Cohn-Kanade database are female. Individual at-
tributes of the recorded actors are not known. We are told, however, that in
the database subjects range in age from 18 to 30 years, 15 percent are African-
American, and three percent are Asian or Latino.
Experiment 3 The third experiment was to test Hypothesis 2 in Section 5.2 and es-
tablish if the NXS system could differentiate facial expressions of anxiety from
a larger set of emotional expressions which included, ‘Fear’, ‘Anger’, ‘Happy’,
‘Sad’, ‘Surprise’, and ‘ Neutral’. The images of fearful and anxious expressions
used in Experiment 2 were combined with those of Experiment 1, replacing Ex-
periment 1’s fearful expressions with Experiment 2’s fearful expressions, and
adding Experiment 2’s anxious expressions. All images were from the Cohn-
Kanade database and the numbers of each expression is shown at Table 5.5.
Experiment 4 The objective of the fourth experiment was to test whether the classifier
built from the Cohn-Kanade database of images, in Experiment 1, could be used
to predict the facial expressions in the Feedtum database of images [Wallhoff ],
which were recorded with different subjects and under different lighting condi-
tions. Prototypical facial expressions of ‘Fear’, ‘Anger’, ‘Happy’, ‘Sad’, ‘Sur-
prise’, and ‘ Neutral’ from both the Cohn-Kanade database and the Feedtum
database were selected.
A sample of an image from each database is presented at Figure 5.3 which show
clearly the marked difference in the lighting conditions between the image from
the Cohn-Kanade database on the left and the Feedtum database on the right.
The broad concept of facial expression recognition has been explained in Chapter 3
and the implementation of various associated functions in the NXS system described in
80 CHAPTER 5. SENSING FOR ANXIETY
(a) Sample image from the (b) Sample image from the
Cohn-Kanade database Feedtum database
Figure 5.3: Experiment 4 - Images from different databases showing different lighting
conditions
Chapter 4. Although the NXS system can be used to both train and test classifiers, in
this set of experiments, it was used, predominantly, to build AAMs and fit them to the
set of images in the experiments. When fitting AAMs to images, NXS records:
• the Cartesian coordinates of each landmark point that makes up the face “shape”.
These are normalised to take into account differences in face sizes;
The meaning of the term, “shape”, as used in this dissertation, is used broadly to
mean the facial landmark points and the subdivision of facial regions into eyebrow
(R1), eye (R2), and mouth (R3). This is explained in Chapter 4.
One of NXS’ other functions, using the aforementioned stored face and texture
features, is to scale and normalise them and output them in a format suitable for in-
put to an external classification product such as LIBSVM [Chang 01] or RapidMiner
[RapidMiner ] (which uses LIBSVM).3
Eight output datasets were produced for each experiment. These contained fea-
ture sets of shape, shape and Gabor texture concatenated, shape and AAM texture
3
LIBSVM uses an format called SVM Lite, whereas RapidMiner imports, amongst other things, tab-
delimited format.
5.3. METHODOLOGY 81
parameters concatenated, Gabor magnitude, AAM texture, eyebrow shape (R1), eye
shape (R2) and mouth shape (R3). The tuning details of each major functional area
and parameter selections used in this set of experiments are explained below.
The Iterative IEBM method [Saragih 06] of building the AAM and fitting the model to
the image was chosen because of its fast fitting capabilities4 which would be critical
in achieving real-time fitting as required in later experiments. Depending on the pa-
rameter selection, model training time was longer than that experienced with the SIC
method. The implementation of the algorithm by [Saragih 06] was used,5 and after the
recommended settings were applied, the parameters were fine-tuned by trial-and-error
in consultation with the software provider.
A before and after example of “fitting” an AAM to a face in an image from the
Cohn-Kanade database is given at Figure 5.4.
Figure 5.4: Original image on left and image after fitting on right
4
This had been shown in earlier informal trials [Saragih 09].
5
The implementation is written in the C++ programming language.
82 CHAPTER 5. SENSING FOR ANXIETY
Due to the high-dimensionality of output features from Gabor filters, processing was
only applied to the R1, R2 and R3 regions - not to the entire cropped face. Prior to
convolving the image with the Gabor filter, the region of interest, e.g. R1, is scaled or
warped using bicubic interpolation to a fixed, canonical size - 100 × 10 pixels for R1
and R2 and 100 x 20 pixels for R3 and written to a grayscale image. To facilitate the
explanation, the before and after images are shown at Figure 5.5.
The Gabor filter processing made use of a software implementation [cvG ], which
had been used by [Zhou 06].6 The program implements the Gabor wavelet using the
formula in [Zhou 06] and shown at Equation 5.3.1.
where z = (x, y) is the point with the horizontal coordinate x and the vertical coordi-
nate y. The parameters µ and ν define the orientation and scale of the Gabor kernel,
6
The implementation is also written in the C++ programming language.
5.3. METHODOLOGY 83
k · k denotes the norm operator, and σ is related to the standard derivation of the Gaus-
sian window in the kernel and determines the ratio of the Gaussian window width to
the wavelength. The wave vector kµ,ν is defined as follows
kmax πµ
where kν = fν
and φµ = 8
if 8 different orientations have been chosen. kmax is the
maximum frequency, and f ν is the spatial frequency between kernels in the frequency
domain.
The second term in the square brackets in Equation 5.3.1 compensates for the DC
value and its effect becomes negligible when the parameter σ, which determines the
ratio of the Gaussian window width to wavelength, is sufficiently large.
The approach taken was analogous to the “eigenface” recognition process. In this
method, each training image is “flattened” to a 1 × D vector and the vector pushed
onto a stack of vectors. Once all training images have been processed, dimensionality
reduction takes place. The eigenvectors of the most significant eigenvalues are used to
project the vectors into eigenspace and the coefficients used to train a classifier. The
recognition phase processes the images in a similar manner with coefficients matched
against those derived in the training phase.
In this experiment, each N × M image of the extracted facial region (R1, R2 and
R3) was convolved with each Gabor filter and the resulting Gabor magnitudes are
laid out end-to-end in a vector of dimension N × M . Thus, after convolving an image
region with 40 filters, the result is an N × M ×40 vector. After processing each image,
the N × M × 40 vector is pushed into a stack of vectors of convolved images. Once all
of the images are stacked, OpenCV’s cvCalcPCA function is used to perform PCA to
get the eigenvalues and eigenvectors. A truncated set of eigenvectors is then used and
OpenCV’s cvProjectPCA function is called to project the vectors into “eigenspace”,
84 CHAPTER 5. SENSING FOR ANXIETY
finally using the resulting coefficients for recognition. Through trial and error, it was
found that 20 eigenvectors explained approximately 90% of the variation in the set of
Gabor magnitude responses.
By far the most challenging aspect of the experiment was to calibrate the Gabor
filters. Selection of the scale parameters σ and ν, affects the width of the kernel and in-
volves a tradeoff. Larger values are more robust to noise but less sensitive. Smaller val-
ues are more sensitive but less effective in removing noise. Finding optimum settings
involved reference to a number of articles [Bhuiyan 07, Chen 07, Fasel 02, Gao 09,
Kamarainen 06, Kanade 00, Lades 93, Lee 96, Liu 04, Liu 06, Movellan 08, Shen 06,
Shen 07, Wiskott 97, Wu 04] and a lot of trial and error.
Many articles do not explain the reasons behind parameter selection, instead refer-
ring, if at all, to other articles that, in turn, do not explain the settings or simply refer to
yet another article. The popular setting for Kmax is π/2 and for the spatial frequency
√
f is 2. It would seem that the setting for Kmax = π/2 originates from [Lades 93]
who noted it yielded the best results after trialing values of 3π/4, π/2, π/3. [Lades 93]
√
also seems to be responsible for the setting of the spatial frequency f value to 2 after
√
trialing values f = 2, f = 2.
In a study into automatic coding of facial expressions displayed during posed and
genuine pain, [Littlewort 07, Littlewort 09] disclose that they convolved their 96 × 96
images through a bank of Gabor filters eight orientations and 9 spatial frequencies (232
pixels per cycle at 1/2 octave steps). They then pass the output magnitudes to action
unit classifiers. Yet, in a study from the same laboratory, [Whitehill 09] describe an
attempt to provide an optimum set of parameters in the task of performing smile detec-
tion against a real-world set of photographs. The database, GENKI, consists of pictures
from thousands of different subjects, photographed by the subjects themselves. In the
experiment, all images were converted to grayscale, normalised by rotating and crop-
ping around the eye region to a canonical width of 24 pixels. The authors report that
5.3. METHODOLOGY 85
they used energy filters to model the primate visual cortex, using 8 equally spaced
orientations of 22.5 degrees, but they do not explain how they arrived at the spatial fre-
quencies with wavelengths of 1.17, 1.65, 2.33, 3.20 and 4.67 Standard Iris Diameters 7 .
The authors refer to [Donato 99] for filter design, however, [Donato 99] report spatial
frequency values of ν ∈ {0, 1, 2, 3, 4} and go on to describe a further test using high
frequency values of ν ∈ {0, 1, 2} (scale is the inverse of frequency) and low frequency
values of ν ∈ {2, 3, 4}. They state that the performance of the high frequency subset
ν ∈ {0, 1, 2} was almost the same as ν ∈ {0, 1, 2, 3, 4}. It should be noted that the task
at hand was the classification of FACS Action Units. Intuitively, one would expect high
scale (low frequency) to generalise or provide a better representation of expressions
than low scale (high frequency), since the former is more likely to ignore artifacts.
The MPEG-7 [MPEG-7 ] Homogeneous Texture Descriptor standard has made use
of Gabor filter banks of 6 orientations and 5 scales. The number of orientations and
scales were based on previous results from [Manjunath 96, Ro 01]. Its use is pitched
towards automatic searching and browsing of images and the mean and standard devi-
ation of each filtered image, plus the mean and standard deviation of the input image
are typically used as features (30 × 2 + 2 = 62 features).
After much experimentation, values of σ = π and ν ∈ {0.0, 0.06, 1.4} were used as
in [Bhuiyan 07]. In this experiment, the kernel or mask width is automatically decided
by the spatial extent of the Gaussian envelope and was obtained from the formulation
implemented in [Zhou 06]
kmax
6 × σ/ )+1 (5.3.3)
fν
π/2
6 × (π/ √ 0.06 ) + 1 (5.3.4)
2
gives a filter size of 13 × 13 pixels. A sample filter with scale = 1.4, orientation=
π/8 can be seen at Figure 5.6. Images of the regions R1, R2 and R3 filter response
magnitude are shown at figures 5.7.
Figure 5.6: Real and Imaginary part of a Gabor wavelet, scale = 1.4, orientation= π/8
(5 × actual size)
Classification
In all of the experiments described in this chapter, a Radial Basis Function (RBF)
SVM kernel type with a regularized support vector classification (standard algorithm C-
SVC) type of SVM was used. As discussed previously, datasets of features were output
from the NXS system as training and testing sets. Optimal parameters for the SVM
models were found using the grid-search approach (the LIBSVM package includes
5.3. METHODOLOGY 87
the “grid.py” program to perform a grid-search). The training set data was reused for
testing in a 5-fold cross validation setup, as depicted in Figure 5.8. In this Figure,
“experiment ” is used to describe the training run.
Once found, the optimal parameters were transcribed to the RapidMiner product
[RapidMiner ] in order to produce a confusion matrix of CA of the type shown in Figure
5.7. There were slight differences between LIBSVM and RapidMiner results which
was very likely to due to differences in the cross-fold validation sampling algorithms
and random seeds between the products. Using the GNU C library, the default seed for
LIBSVM is 1, whereas RapidMiner uses a default value of 1992. Keeping the random
seed the same for each test in RapidMiner improves reproducibility. RapidMiner has
several algorithms of its own available and the stratified sampling type was used to
create random subsets while keeping class distributions constant.
88 CHAPTER 5. SENSING FOR ANXIETY
The sample size in all of the experiments was small and the misclassification of just
one or two expressions has a relatively high impact on the CA. Thus, some caution
is needed in interpreting results with regards to CA, where differences of only a few
percentage exist, and throughout this summary they are ignored. In the table column
headings, “true” denotes the actual or real classification and “pred.”, abbreviated from
predicted, denotes the derived classification.
5.4.1 Experiment 1
The first experiment was to determine if the NXS system could be trained to differ-
entiate between prototypical facial expressions labelled as ‘Fear’, ‘Anger’, ‘Happy’,
‘Sad’, ‘Surprise’, and ‘ Neutral’ from the Cohn-Kanade database. The recognition re-
sults using shape, shape and Gabor magnitudes concatenated, shape and AAM texture
parameters concatenated, Gabor magnitudes and AAM texture parameters are given in
Figures 5.7, and using R1, R2 and R3 shape in Figure 5.8.
Classification using the shape concatenated with the Gabor magnitudes yielded the
best overall CA. Given their holistic nature, one would have thought that the shape
concatenated with the AAM texture parameters would have performed better than the
rest. Perhaps there was an advantage of having all of the image patches set to a fixed
canonical size, as was the case with the Gabor filter pre-processing. There were no
exceptionally performing feature sets that provided a CA much higher than the others.
Of the CA results obtained using R1, R2 and R3 (Figure 5.8), the eyebrow (R1)
shape features achieved a 91% accuracy in the prediction of Anger. In nearly all of the
individual results, where surprise was misclassified, it was most often misclassified as
fear. Interestingly, the converse was not true.
Anger was most often the expression with the highest CA, regardless of the feature
5.4. PRESENTATION AND ANALYSIS OF DATA 89
sets that were used to build the classifer. One might have expected that Fear and
Surprise would most often be confused [Russell 94], however, this is not evident in
Tables 5.7 and 5.8.
90 CHAPTER 5. SENSING FOR ANXIETY
true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 27 3 1 5 0 0 75%
pred. Fear 2 10 0 0 3 0 66%
pred. Happy 0 3 25 1 1 0 83%
pred. Neutral 3 0 4 29 1 6 67%
pred. Surprise 0 0 0 0 16 0 100%
pred. Sad 1 0 1 1 0 11 79%
class recall 82% 63% 81% 81% 76% 65%
true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 28 0 1 2 0 1 88%
pred. Fear 2 12 0 0 2 0 75%
pred. Happy 0 3 27 1 0 0 87%
pred. Neutral 3 1 2 33 1 1 80%
pred. Surprise 0 0 0 0 17 0 100%
pred. Sad 0 0 1 0 1 15 88%
class recall 85% 75% 87% 92% 81% 88%
(c) Confusion matrix of recognition using shape and AAM texture parameters
Accuracy: 83%
true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 27 1 1 2 0 0 87%
pred. Fear 2 13 1 0 2 0 72%
pred. Happy 0 2 25 1 2 0 83%
pred. Neutral 4 0 4 32 0 3 74%
pred. Surprise 0 0 0 0 17 0 100%
pred. Sad 0 0 0 1 0 14 93%
class recall 82% 81% 81% 89% 81% 82%
(d) Confusion matrix of recognition using Gabor magnitudes in eyebrow, eye and mouth regions
Accuracy: 82%
true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 28 1 1 2 0 0 88%
pred. Fear 2 11 0 0 1 0 79%
pred. Happy 0 0 25 3 1 0 86%
pred. Neutral 3 2 5 30 2 1 70%
pred. Surprise 0 2 0 0 17 0 89%
pred. Sad 0 0 0 1 0 16 94%
class recall 85% 69% 81% 83% 81% 94%
(e) Confusion matrix of recognition using AAM texture parameters from entire face
Accuracy: 79%
true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 27 3 1 3 0 3 73%
pred. Fear 2 10 0 0 2 0 71%
pred. Happy 0 1 26 1 2 0 87%
pred. Neutral 4 2 4 31 1 2 70%
pred. Surprise 0 0 0 0 15 0 100%
pred. Sad 0 0 0 1 1 12 86%
class recall 82% 63% 84% 86% 71% 71%
true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 30 1 1 2 0 1 86%
pred. Fear 0 8 0 0 2 1 73%
pred. Happy 0 2 23 3 0 1 79%
pred. Neutral 3 2 7 28 2 2 64%
pred. Surprise 0 2 0 2 16 0 80%
pred. Sad 0 1 0 1 1 12 80%
class recall 91% 50% 74% 78% 76% 71%
true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 29 0 2 3 1 0 83%
pred. Fear 0 11 0 0 2 2 73%
pred. Happy 3 2 26 2 0 0 79%
pred. Neutral 1 1 3 28 2 3 74%
pred. Surprise 0 1 0 1 16 0 89%
pred. Sad 0 1 0 2 0 12 80%
class recall 88% 69% 84% 78% 76% 71%
true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 24 2 1 5 1 2 69%
pred. Fear 1 10 0 0 3 0 71%
pred. Happy 3 1 25 1 1 0 81%
pred. Neutral 4 1 5 29 0 2 71%
pred. Surprise 1 2 0 0 16 0 84%
pred. Sad 0 0 0 1 0 13 93%
class recall 73% 63% 81% 81% 76% 76%
Table 5.8: Experiment 1 - Recognition results using shape from eyebrow (R1), eye
(R2) and mouth (R3) regions
92 CHAPTER 5. SENSING FOR ANXIETY
5.4.2 Experiment 2
The second experiment was to determine if the system could differentiate between
facial expressions nominated as anxiety against those nominated as fear. The fitting
process did not work well with 3 images (this was likely due to the relatively small
number of images being used for AAM training) and, for expediency, the images were
removed from the training set. The revised number of images used in the exercise is
shown in Figure 5.9. In total, there were 31 images selected for the experiment - 9
male and 22 female.
The recognition results using shape, shape and Gabor magnitudes concatenated,
shape and AAM texture parameters concatenated, Gabor magnitudes and AAM texture
parameters are given in Figures 5.10, and using R1, R2 and R3 shape in Figure 5.11.
Overall, the CA was low. Shape alone, and individual R1, R2 and R3 shapes,
yielded the best CAs. Classifiers that were built using texture features did not per-
form nearly as well as those built with shape features. One surprising result was the
recognition performance using the shape extracted from the eyebrow region presented
at 5.10(a). One could theorise that this was because poll participants made more use
of the eyebrow region than the eye and mouth in their assessment of the expression.
And, based on the prior discussion of anxious features at Subsection 2.5.2, it would
seem that the less exaggerated brow movements with an anxious expression would be
a discriminating factor between fear and anxiety expressions. This, of course, is quite
speculative, with such a small sample size and the fact that there were only 14 poll
participants.
Another explanation considered was that it was simply due to the efficacy of the
5.4. PRESENTATION AND ANALYSIS OF DATA 93
(a) Confusion matrix of recognition using (b) Confusion matrix of recognition using
shape only shape and Gabor magnitudes
Accuracy: 70% Accuracy: 68%
true Fear true Anxious class precision true Fear true Anxious class precision
pred. Fear 10 3 77% pred. Fear 10 4 71%
pred. Anxious 6 12 67% pred. Anxious 6 11 65%
class recall 63% 80% class recall 63% 73%
true Fear true Anxious class precision true Fear true Anxious class precision
pred. Fear 8 5 62% pred. Fear 14 11 56%
pred. Anxious 8 10 56% pred. Anxious 2 4 67%
class recall 50% 67% class recall 88% 27%
SVM processing. To examine the phenomenon further, two post-hoc experiments were
devised reusing the eyebrow region (R1) data. The first was to randomly change the
class labels in the samples that had been labelled as anxious and fear and to re-run the
experiment. The results are shown in Figure 5.11(a). This resulted in a much lower CA
- 68%.
The second post-hoc experiment was to test how well regression analysis would
separate the classes. Epsilon Support Vector Regression (SVR) was used and the op-
timal parameters determined using an alternative grid search program.8 The result is
shown in Figure 5.11(b). This time the CA was lower - 65%.
8
http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/#grid_parameter_
search_for_regression, last accessed 1 March 2010
94 CHAPTER 5. SENSING FOR ANXIETY
(a) Confusion matrix of recognition using (b) Confusion matrix of recognition using eye
eyebrow region (R1) shape region (R2) shape
Accuracy: 90% Accuracy: 78%
true Fear true Anxious class precision true Fear true Anxious class precision
pred. Fear 15 2 88% pred. Fear 13 4 76%
pred. Anxious 1 13 93% pred. Anxious 3 11 79%
class recall 94% 87% class recall 81% 73%
Table 5.11: Experiment 2 - Recognition results using shape from eyebrow (R1), eye
(R2) and mouth (R3) regions
5.4.3 Experiment 3
The third experiment was to establish how well the facial expressions of anxiety could
be classified when infused into a larger set of emotional expressions including ‘Fear’,
‘Anger’, ‘Happy’, ‘Sad’, ‘Surprise’, and ‘ Neutral’. Thus, the focus of the experiment
was not the overall classification but to establish if an anxious expression could be
distinguished from 6 prototypical expressions, which included the fear expressions
used in Experiment 2.
The recognition results using shape, shape and Gabor magnitudes concatenated,
shape and AAM texture parameters concatenated, Gabor magnitudes and AAM texture
parameters are given in Figures 5.13, and using R1, R2 and R3 shape in Figure 5.14. It
was anticipated that fear and anxiety would have the lowest CA and this was confirmed
in the CA of every classifier. One might have expected that fear would have been mis-
classified most often as anxious, rather than any other expression, and vice-versa but
this was not the case. Whilst there was a slight tendency towards anxious expressions
being misclassified as fear, they were also misclassified as every other expression, other
than those labelled as ‘Sad’.
Use of shape concatenated with Gabor magnitudes produced the best overall re-
sults, despite the CA performance of AAM texture parameters being slightly less than
that of Gabor magnitude. However, as stated previously, with such small sample CA
differences of just a few percentage are within reason.
96 CHAPTER 5. SENSING FOR ANXIETY
true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 29 1 1 1 0 0 0 91%
pred. Fear 1 10 0 0 3 0 3 59%
pred. Happy 2 1 28 0 0 0 2 85%
pred. Neutral 1 0 1 33 1 7 1 75%
pred. Surprise 0 2 0 0 17 0 0 89%
pred. Sad 0 0 1 2 0 10 0 77%
pred. Anxious 0 2 0 0 0 0 9 82%
class recall 88% 63% 90% 92% 81% 59% 60%
true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 31 0 1 3 0 1 0 86%
pred. Fear 0 12 0 0 3 0 4 63%
pred. Happy 0 1 28 0 0 0 2 90%
pred. Neutral 2 1 1 32 0 3 1 80%
pred. Surprise 0 1 0 0 18 0 0 95%
pred. Sad 0 0 1 1 0 13 0 87%
pred. Anxious 0 1 0 0 0 0 8 89%
class recall 94% 75% 90% 89% 86% 76% 53%
(c) Confusion matrix of recognition using shape and AAM texture parameters
Accuracy: 82%
true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 29 2 1 1 0 0 0 88%
pred. Fear 0 10 0 0 2 0 3 67%
pred. Happy 0 0 28 1 1 0 3 85%
pred. Neutral 4 0 2 33 0 3 2 75%
pred. Surprise 0 2 0 0 18 0 0 90%
pred. Sad 0 0 0 1 0 14 0 93%
pred. Anxious 0 2 0 0 0 0 7 78%
class recall 88% 62% 90% 92% 86% 82% 47%
(d) Confusion matrix of recognition using Gabor magnitudes in eyebrow, eye and mouth regions
Accuracy: 75%
true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 28 1 1 4 0 1 0 80%
pred. Fear 0 6 0 0 2 0 4 50%
pred. Happy 0 1 25 3 0 0 2 81%
pred. Neutral 5 2 3 28 2 1 1 67%
pred. Surprise 0 1 0 0 17 0 1 89%
pred. Sad 0 0 0 1 0 15 0 94%
pred. Anxious 0 5 2 0 0 0 7 50%
class recall 85% 38% 81% 78% 81% 88% 47%
(e) Confusion matrix of recognition using AAM texture parameters from entire face
Accuracy: 76%
true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 29 1 1 1 1 3 1 78%
pred. Fear 1 9 0 1 3 0 5 47%
pred. Happy 0 2 27 2 0 0 0 87%
pred. Neutral 3 1 3 31 2 3 1 70%
pred. Surprise 0 1 0 0 15 0 1 88%
pred. Sad 0 0 0 1 0 11 0 92%
pred. Anxious 0 2 0 0 0 0 7 78%
class recall 88% 56% 87% 86% 71% 65% 47%
true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 30 2 1 2 0 1 0 83%
pred. Fear 0 8 0 0 4 3 2 47%
pred. Happy 2 1 25 1 0 0 3 78%
pred. Neutral 1 0 3 32 0 2 2 80%
pred. Surprise 0 1 0 0 15 0 0 94%
pred. Sad 0 2 0 1 2 11 0 69%
pred. Anxious 0 2 2 0 0 0 8 67%
class recall 91% 50% 81% 89% 71% 65% 53%
true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 27 0 2 3 0 0 0 84%
pred. Fear 0 9 0 0 2 1 2 64%
pred. Happy 1 1 26 1 0 0 1 87%
pred. Neutral 4 1 3 30 5 5 1 61%
pred. Surprise 0 1 0 1 14 0 3 74%
pred. Sad 1 1 0 1 0 11 0 79%
pred. Anxious 0 3 0 0 0 0 8 73%
class recall 82% 56% 84% 83% 67% 65% 53%
true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 27 1 2 3 1 2 1 73%
pred. Fear 0 10 0 0 2 0 3 67%
pred. Happy 0 0 24 1 0 0 2 89%
pred. Neutral 6 0 4 31 0 2 1 70%
pred. Surprise 0 1 0 0 16 0 1 89%
pred. Sad 0 1 0 1 1 13 0 81%
pred. Anxious 0 3 1 0 1 0 7 58%
class recall 82% 63% 77% 86% 76% 76% 47%
Table 5.14: Experiment 3 - Recognition results using shape from eyebrow (R1), eye
(R2) and mouth (R3) regions
98 CHAPTER 5. SENSING FOR ANXIETY
The ultimate objective of the fourth experiment was to test whether the classifier built
from the Cohn-Kanade database of images, in Experiment 1, could be used to predict
the facial expressions in the Feedtum database of images, which were recorded using
different subjects and under different lighting conditions. This is a somewhat novel
undertaking, probably due to the difficulty of the exercise. Classifiers trained using
shape vectors, shape and Gabor texture, Gabor texture, eye (R1), eye (R2) and mouth
(R3) regions were used.
As a preliminary step, a CA test was conducted using an SVM built from prototyp-
ical facial expressions of ‘Fear’, ‘Anger’, ‘Happy’, ‘Sad’, ‘Surprise’, and ‘Neutral’,
sourced entirely from images within the Feedtum database. This was so that the CA
could be compared to that attained in Experiment 1, which used the Cohn-Kanade
database. This is referred to here as the “baseline” experiment. The “baseline” was
acquired by SVM classification, similar to Experiment 1, using 5-fold cross validation.
At first, the AAM that had been built for Experiment 1, using images from the Cohn-
Kanade database, was used to fit the landmark points to images from the Feedtum
database. As can be seen in 5.9(a), the landmark points were not placed perfectly.
This was likely due to the relatively small number of samples used from the Cohn-
Kanade database (≈ 200) to build the initial AAM (in the order of 500 − 1, 000 would
be preferable). Since the object of the experiments was not to test the AAM per se, to
overcome this, a specific Feedtum AAM was built. The accuracy of fitting improved
with the new model and an example is shown at 5.9(b).
The baseline recognition results using shape, shape and Gabor magnitudes concate-
nated, shape and AAM texture parameters concatenated, Gabor magnitudes and AAM
texture parameters are given in Figures 5.15, and using R1, R2 and R3 shape in Figure
5.16.
5.4. PRESENTATION AND ANALYSIS OF DATA 99
Figures 5.15 and 5.16 show that the CA was lower than in Experiment 1. However,
the number of samples is smaller and a relatively larger variation in the CA will result
from each misclassification. Notwithstanding that, a plausible reason for the lower CA
is that the expressions portrayed in the Feedtum database are much less pronounced.
Figures 5.10(a) and 5.10(b) are reported within the Feedtum database transcriptions,9
as the apex of expressions of anger and fear respectively. One would think that human
judges might have difficulty in correctly classifying these expressions. In addition, the
intensity of the expressions are clearly in contrast to that shown in Figure 5.4.
9
Feedtum metadata transcriptions at http://www.mmk.ei.tum.de/˜waf/fgnet/
metadata-feedtum.csv, image files are anger/0003 3/p 086.jpg and fear/0007 2/p 110.jpg, last
access 1 March 2010
100 CHAPTER 5. SENSING FOR ANXIETY
(a) Feedtum image fitted using general AAM trained on Cohn-Kanade database
(b) Feedtum image fitted using specific AAM trained on Feedtum database
Figure 5.9: Experiment 4 - Images fitted using generalised and specific AAMs
5.4. PRESENTATION AND ANALYSIS OF DATA 101
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 0 0 0 1 0 88%
pred. Fear 0 6 0 1 1 1 67%
pred. Happy 1 2 10 2 1 1 59%
pred. Sad 0 1 0 2 2 4 22%
pred. Surprise 0 0 2 1 5 0 63%
pred. Neutral 1 2 2 4 1 9 47%
class recall 78% 55% 71% 20% 45% 60%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 8 0 0 0 0 1 89%
pred. Fear 0 8 0 0 2 1 73%
pred. Happy 0 1 11 0 1 1 79%
pred. Sad 0 1 0 6 0 1 75%
pred. Surprise 0 0 2 1 8 0 73%
pred. Neutral 1 1 1 3 0 11 65%
class recall 89% 73% 79% 60% 73% 73%
(c) Confusion matrix of recognition using shape and AAM texture parameters
Accuracy: 54%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 2 0 0 2 0 64%
pred. Fear 1 4 2 1 1 1 40%
pred. Happy 0 1 9 0 2 0 75%
pred. Sad 0 0 1 4 1 4 40%
pred. Surprise 0 1 2 1 4 0 50%
pred. Neutral 1 3 0 4 1 10 53%
class recall 78% 36% 64% 40% 36% 67%
(d) Confusion matrix of recognition using Gabor magnitudes in eyebrow, eye and mouth regions
Accuracy: 64%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 0 0 0 0 3 70%
pred. Fear 1 7 0 1 3 1 54%
pred. Happy 0 0 11 0 1 1 85%
pred. Sad 0 2 0 5 0 2 56%
pred. Surprise 0 1 1 2 7 0 64%
pred. Neutral 1 1 2 2 0 8 57%
class recall 78% 64% 79% 50% 64% 53%
(e) Confusion matrix of recognition using AAM texture parameters from entire face
Accuracy: 64%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 1 3 2 0 1 50%
pred. Fear 0 6 0 3 0 1 60%
pred. Happy 1 1 10 0 0 0 83%
pred. Sad 0 0 0 2 2 1 40%
pred. Surprise 0 0 0 0 8 0 100%
pred. Neutral 1 3 1 3 1 12 57%
class recall 78% 55% 71% 20% 73% 80%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 6 1 0 0 1 0 75%
pred. Fear 0 6 2 1 3 1 46%
pred. Happy 0 2 8 1 2 2 53%
pred. Sad 1 2 0 2 0 6 18%
pred. Surprise 1 0 2 1 4 0 50%
pred. Neutral 1 0 2 5 1 6 40%
class recall 67% 55% 57% 20% 36% 40%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 8 1 0 0 1 0 80%
pred. Fear 0 4 1 0 5 2 33%
pred. Happy 0 1 10 1 1 0 77%
pred. Sad 0 2 1 5 0 4 42%
pred. Surprise 0 2 2 2 4 0 40%
pred. Neutral 1 1 0 2 0 9 69%
class recall 89% 36% 71% 50% 36% 60%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 0 0 0 2 0 78%
pred. Fear 1 5 1 0 0 3 50%
pred. Happy 0 1 11 0 2 0 79%
pred. Sad 0 2 0 3 1 3 33%
pred. Surprise 1 0 2 1 6 0 60%
pred. Neutral 0 3 0 6 0 9 50%
class recall 78% 45% 79% 30% 55% 60%
Table 5.16: Experiment 4 Baseline Feedtum database - Recognition results using shape
from eyebrow (R1), eye (R2) and mouth (R3) regions
(a) Feedtum image of anger expression (b) Feedtum image of fear expression
Next, an attempt was made to automatically classify expressions in images from the
Feedtum database against the SVM models built in Experiment 1 using images from the
Cohn-Kanade database. As in the first part of the experiment, the images were fitted
with the Feedtum-specific AAM.
The PCA coefficients, after performing PCA on the Gabor filter magnitudes, were
obtained by projecting into the eigenspace that was created in Experiment 1. Simi-
larly, the scaling parameters obtained in Experiment 1 were applied when scaling the
features obtained in Experiment 4.
The recognition results for the second part of Experiment 4 using shape, shape
and Gabor magnitudes concatenated, shape and AAM texture parameters concatenated,
Gabor magnitudes and AAM texture parameters is given in Figure 5.18, and using R1,
R2 and R3 shape in Figure 5.19. The respective chance levels is shown at Table 5.17.
The CA results from this experiment were all very low. One reason that was con-
sidered is the practice of scaling the data prior to building the classifier. Scaling is
used not only in SVM, but also in Neural Network classification. The case for scaling
is presented in [Hsu 03]:
lems. We recommend linearly scaling each attribute to the range [−1; +1]
or [0; 1].”
A key requirement in the use of scaling is that the same method used to scale the
training data is applied to the test data. Operationally, the scaling parameter that is
found for each feature during training needs to be saved and then applied to the test
data. One of the criticisms of this approach is that it “overtrains” the data. It has the
potential to increase the CA in exercises where the training set is used for testing after
the entire training set has been scaled, i.e. no new previously unseen data is introduced
in testing. This is sometimes referred to as, “peeping the data”. When applied to new
and previously unseen data, there is no guarantee that the scaling parameters will have
a valid relationship, as might have been the case in the second part of Experiment 4.
Experiments 1, 2, 3, and 4-baseline were all conducted using scaled and normalised
data. To get some idea of the effect that scaling might have had on the second part of
Experiment 4, an informal, post-hoc classification exercise was conducted, whereby
the data was normalised but not scaled. The results, which are not included here, did
not vary significantly and scaling was ruled out as a major contributor to the poor CA
performance.
5.4. PRESENTATION AND ANALYSIS OF DATA 105
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 6 6 1 8 4 9 18%
pred. Fear 2 2 1 0 2 2 22%
pred. Happy 0 1 11 0 2 3 65%
pred. Sad 0 0 0 0 0 0 0%
pred. Surprise 1 2 1 0 3 0 43%
pred. Neutral 0 0 0 2 0 1 33%
class recall 67% 18% 79% 0% 27% 67%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 9 2 1 4 0 8 38%
pred. Fear 0 5 3 2 4 4 28%
pred. Happy 0 1 10 1 3 2 59%
pred. Sad 0 0 0 0 0 0 0%
pred. Surprise 0 3 0 0 4 0 57%
pred. Neutral 0 0 0 3 0 1 25%
class recall 100% 45% 71% 0% 36% 67%
(c) Confusion matrix of recognition using Gabor magnitudes in eyebrow, eye and mouth regions
Accuracy: 17%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 6 1 5 0 0 0 50%
pred. Fear 3 5 1 6 5 9 17%
pred. Happy 0 4 1 4 6 6 5%
pred. Sad 0 0 0 0 0 0 0%
pred. Surprise 0 1 0 0 0 0 0%
pred. Neutral 0 0 7 0 0 0 0%
class recall 67% 45% 7% 0% 0% 0%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 0 0 5 5 0 9 0%
pred. Fear 0 4 0 2 0 1 57%
pred. Happy 0 1 2 1 0 1 40%
pred. Sad 1 0 1 0 1 0 0%
pred. Surprise 8 5 5 1 10 2 32%
pred. Neutral 0 1 1 1 0 2 40%
class recall 0% 36% 14% 0% 92% 13%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 3 4 11 9 2 10 8%
pred. Fear 6 4 1 1 5 5 18%
pred. Happy 0 1 0 0 0 0 0%
pred. Sad 0 0 0 0 1 0 0%
pred. Surprise 0 0 0 0 3 0 100%
pred. Neutral 0 2 2 0 0 0 0%
class recall 33% 36% 0% 0% 27% 0%
true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 6 5 10 2 2 6 19%
pred. Fear 0 0 2 0 3 0 0%
pred. Happy 0 0 1 0 0 0 100%
pred. Sad 1 2 0 6 0 6 40%
pred. Surprise 0 0 0 0 1 0 100%
pred. Neutral 2 4 1 2 5 3 18%
class recall 67% 0% 7% 60% 9% 20%
Table 5.19: Experiment 4 - Recognition results using shape from eyebrow (R1), eye
(R2) and mouth (R3) regions against SVMs built in experiment 1
5.5. CONCLUSIONS AND EVALUATION 107
5.5.1 Hypothesis 1
Fear and anxious expressions were sourced from a set of expressions having action
units corresponding to both fear and anxiety. In the absence of a verified database
of anxious expressions, the Cohn-Kanade database was a useful starting point, but it
is not clear how the slightly pronounced nature of the fear expressions might have
affected the results. A larger, more ecologically valid set of training data, both for the
fearful and the anxious expressions, would provide a more reliable set of results to
support this hypothesis. Obviously, procuring such a database is a major undertaking.
Nevertheless, the results from Experiment 2 suggest that anxiety can be automatically
distinguished from fear.
The results from Experiment 2 suggest that shape is much better discriminator
than texture in the differentiation of fear and anxious expressions. Although much
more effort is need to draw a conclusion, human judges may rely, to a large extent,
on the shape of eyebrow region when trying to differentiate between fear and anxious
expression.
5.5.2 Hypothesis 2
Anxious expressions were not classified with a high degree of accuracy. In Ex-
periment 3, although they were quite often misclassified as fear, they were also mis-
10
“Differentiated” is defined as that better than chance.
108 CHAPTER 5. SENSING FOR ANXIETY
classified as every other expression except those labelled as ‘Sad’. Again, a much
larger sample set is needed to draw a conclusion. One possibility to improve accuracy
would be to use an ensemble of classifiers, recognising, first, a more broadly labelled
set of expressions which combined both fear and anxiety under one label, and, second,
performing a binary classification between fear and anxiety with the use of just shape.
How does facial expression recognition performance, i.e. CA, vary when using:
• the location of facial landmark points concatenated with AAM texture parameters;
Shape concatenated with Gabor magnitudes also produced the best CA in the “base-
line” Experiment 4. This was a slightly odd result where, even though the overall CA
of Gabor magnitudes and AAM texture parameters was the same, concatenating Gabor
magnitudes to shape improved the CA (achieved by shape) by 20%, yet concatenating
5.5. CONCLUSIONS AND EVALUATION 109
AAM texture parameters to shape resulted in a slight decrease in CA. One could spec-
ulate about the manner in which the feature data from the Gabor magnitudes better
compliments the shape data. However, the topic of how best to fuse heterogeneous
feature sets, i.e. shape and texture data, in order to achieve the best CA needs far more
investigation.
In the NXS system, described in Chapter 4, the face is subdivided into three regions:
Knowing which facial regions are used by humans to differentiate facial expres-
sions would be useful for the field of facial expression recognition.
110 CHAPTER 5. SENSING FOR ANXIETY
In summary, despite the lack of samples and natural data, the results suggest that the
recognition of anxious expressions is possible but becomes more difficult when fear-
ful expressions are also present. The difficulty increases when more primary expres-
sions are added to the classification problem. The exercise demonstrates that facial
expression classification is, in general, a difficult task and, in some situations, in the
absence of contextual information and/or temporal data revealing facial dynamics, may
not even be possible. Moreover, even with contextual and temporal evidence present,
the fact that a prototypical expression can take many forms, e.g. a ‘happy’ expression
can be portrayed with or without opening the mouth, compounds the degree of diffi-
culty. Any attempt at recognition may not be reliable without the presence of semantic
information.
The second part of Experiment 4 demonstrated that, even using two popular and
creditable databases, a classifier built from images from one database, did not achieve
a high CA in predicting expressions from the other, despite Gabor filtering being rea-
sonably invariant to lighting conditions. This echoes the preliminary results reported
in [Whitehill 09] (albeit, much worse).
In this instance, it seems that the notion of using a database recorded under one
set of conditions to recognise expression in a database recorded under a different set
5.6. OVERALL EVALUATION 111
• automatically selecting the best features to convolve with the Gabor filters.
There are many proposed schemes for doing this [Littlewort 06, Shen 05, Zhou 06,
Zhou 09] (although in [Zhou 06, Zhou 09] the processing is applied to reduce the
Gabor feature set after convolution); or
• by optimising the Gabor wavelet basis for convolution, e.g. use of genetic algo-
rithms (GA)
The GA approach has shown some promise in addressing both problems but,
ironically, it too imposes a computational burden [Li 07, Tsai 01, Wang 02].
In this set of experiments, all three facial regions were convolved using the same
basis for convolution and no attempt was made to optimise for individual facial re-
gions. [Ro 01] suggests a way to improve Gabor filter performance is to reduce the
computation load by selecting the Gabor filter basis in a pattern-dependent way. Given
the similarity of the images patches used in the this set of experiments, it is difficult to
envisage that using specific Gabor filter parameters would yield a significantly higher
CA or faster convolution times.
During the setup stage of the experiments, a great deal of effort was spent trying
to optimise the system in areas such as Gabor filter settings, facial region image sizes,
data scaling and PCA process. As discussed above, much of this calibration work came
down to trial and error, and the quantitative impacts are not reported here. The impact
on percentage CA from any of these measures was not as significant as changes to
the SVM parameters, and not comparable to the variation in Experiment 2 between
112 CHAPTER 5. SENSING FOR ANXIETY
Albert Ellis
6
Depression Analysis Using Computer
Vision
Facial expressions convey information about a person’s internal state. The ability to
correctly interpret another’s facial expressions is important to the quality of social in-
teractions. In phatic speech, in particular, the affective facial processing loop, i.e. inter-
preting another’s facial expressions then responding with one’s own facial expression,
plays a critical role in the ability to form and maintain relationships.
113
114 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION
The affective facial processing loop was discussed in Subsection 2.6.2 of Chapter 3.
A predisposition to misinterpret expressions often underlies dysphoria, e.g. anxiety and
depression. More specifically, the impaired inhibition of negative affective material
could be an important cognitive factor in depressive disorders. Recent studies using
fMRI, have provided some evidence that the severity of depression in MDD groups
correlates with increased neural responses to pictures with negative content [Fu 08,
Lee 07]. In turn, this bias to favour negative material by depressed patients has been
shown to be signaled in their resulting facial expressions.
This chapter explores the feasibility of using state-of-the-art, low-cost, unobtrusive
techniques to measure facial activity and expressions, and then applies them to a real-
life application. The motivation for the experiments presented in this chapter is to:
• test if automatic facial activity and expression analysis could be used in a real-
life application.
Section 6.2 states the hypotheses that are to be tested. Section 6.3 describes the
methodology used in the experiments. The results are presented in Section 6.4 and
Section 6.5 concludes the chapter.
6.2 Hypotheses
To sharpen the focus of the experiments, the following hypotheses and questions were
generated on the basis of the motivation for the experiments and the literature review
in Chapter 3:
6.3. METHODOLOGY 115
1. When viewing the stimuli, patients with a clinical diagnosis of unipolar melan-
cholic depression will show less facial activity than control subjects and patients
with other types of depression; and
2. When viewing the stimuli, patients with a clinical diagnosis of unipolar melan-
cholic depression will show less repertoire of facial expressions than control
subjects and patients with other types of depression.
6.3 Methodology
Total Participants 16 11
Participant Ids in the table take the form
XX G CD ID
XX - “Co” for Control or “Pa” for Patient, G - Gender, CD - Diagnosis (Patients only), ID - Sequential Id number (control and
patients numbered separately)
Watching movie clips Short movie segments of around two minutes each, some posi-
tive and some negative are presented. With the exception of one clip, each movie
has previously been rated for its affective content [Gross 95].
Watching and rating International Affective Picture System (IAPS) pictures Pictures
from the IAPS [Lang 05] compilation are presented and participants rate each im-
age as either positive or negative. Reporting logs enable correlation of the image
presentations, the participant’s ratings, and their facial activity.
Reading sentences containing affective content Two sets of sentences used in the
study by [Brierley 07, Medforda 05] are read aloud. The first set contains emo-
tionally arousing “target” words. The second set repeats the first, with the “tar-
6.3. METHODOLOGY 117
Answering specific questions Finally, participants are asked to describe events that
had aroused significant emotions. For instance, ideographic questions such as,
“Describe an incident that made you feel really sad.”
It is important to note that this chapter and thesis reports only on the first
section of the experimental paradigm, i.e. watching movie clips. The initial experi-
mental setup consisted of the movies listed in Table 6.3, with intended induced emotion
shown in parenthesis. The list is referred to as the “Old Paradigm”. After some sub-
jects had participated in the experiment, the movie sequence was re-evaluated, and it
was decided to add a “fear” sample, and incorporate a longer “surprise” clip. This is
shown in Table 6.4 and is referred to as the “New Paradigm”.
Movie (emotion)
Bill Crosby (Happy)
The Champ (Sad)
Weather (Happy)
Sea of Love (Surprise)
Cry Freedom (Anger)
Movie (emotion)
Bill Crosby (Happy)
The Champ (Sad)
Weather (Happy)
Silence of the Lambs (Fear)
Cry Freedom (Anger)
The Shining (Fear)
Capricorn One (Surprise)
Figure 6.2 shows a control subject being recorded as he watches the movie Cry
Freedom. The interview, from a participant’s view, is shown in Figure 6.3. When the
interview is in progress, the Research Assistant can monitor the session on the laptops
shown in Figure 6.4. The laptop on the left in Figure 6.4 is displaying a frame from the
movie clip Bill Cosby, while the one on the right shows the recording of the participant.
Once the video has been recorded, analysis begins by capturing sample frames
from each video, which are used to 1) build a person-specific AAM for each person
6.3. METHODOLOGY 119
(person-specific AAMs give better fitting quality [Gross 05] and there is no need at
this juncture to have generic AAMs); and 2) construct an SVM classifier for each per-
son’s emotional expressions (which are rated subjectively at this point in time). For
the reasons explained in Section 5.3.2 of Chapter 5, the IEBM method [Saragih 06]
of building the AAM and fitting the model to the image has been chosen, as was the
LIBSVM [Chang 01] implementation of SVM.
Once the AAM and SVM have been built, frames are then captured from the video
at 200 ms intervals (this seemed a reasonable choice of interval based, anecdotally, on
the speed of movement of facial features, however, there is no restriction on the rate).
As each frame is captured, frontal facial images are detected using the Viola and Jones
[Viola 01] technique to determine the global location of the face in the image. Next,
the AAM is used to track and measure the local shape and texture features. As described
120 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION
Figure 6.3: Participant’s view of the interview (video clip - Silence of the Lambs)
in Chapter 5, “shape” refers to the collective landmark points, which are captured as
a set of normalised, x, y Cartesian coordinates. The features are then used to classify
the expressions using an SVM classifier [Chang 01]. All outputs are stored within the
system to allow for post-processing, which is described in the next two Subsections.
With the raw feature data captured, Algorithm 4 is used to measure the collective
movement of the landmark points between each frame. Although not shown in the
algorithm, extreme movements are ignored if the movement falls outside of predefined
thresholds. This is to cater for situations where the face detection in a frame has failed
and the AAM “fitting” has not converged, which typically leaves the landmark points
spread around the image.
6.3. METHODOLOGY 121
The NXS system outputs the facial activity measurements for each subject into a file
of comma-separated values, which can then be imported to a third-party product for
further analysis, e.g. Excel.
The classified expression is stored within the system for each captured image. The
images, captured at a rate of one every 200 ms and marked up with the automatically
fitted landmark points (these are fitted with a person-specific AAM, built using only
a few frames for training), can be assembled as an image sequence and played as a
short video. This assists in verifying that the AAM has fitted properly, and consistently
between frames. The coloured slider below the images, shown in Figure 6.5, pro-
vides a visual representation of the facial expressions over the course of the interview.
The colours represent the classification, pink - happy, blue - sad and white - neu-
tral. In Figure 6.5, the slider has been positioned to a period of happy expressions,
which coincides with the Bill Cosby film clip. This allows a visual confirmation that
the expression recognition has worked successfully. Each reconstructed participant
“movie” can be played individually or along with several other participant “movies”,
thus allowing a comparison of participants’ facial responses at a specific time in the
interview.
The NXS system outputs the list of classifications for each subject into a file of
comma-separated values, which can then be imported to a third-party product for fur-
ther analysis, e.g. Excel.
6.3. METHODOLOGY 123
6.4.1 Introduction
Two sets of results are presented, one for each paradigm, in the form of charts at the
end of the chapter. The data used to construct the charts is available in Appendix
B. The y axis of the facial activity diagrams, e.g. Figure 6.6, is simply an internally
derived measurement of movement, described later in Algorithm 4, and no unit of
measurement has been attached to it.
Figure 6.6 displays a stacked column chart of facial activity for every Old Paradigm
participant over the entire video clip session. Overall, control subjects have tended to
have a higher facial activity score than patients. Figure 6.7 displays a clustered column
chart of the same data as Figure 6.6 facial activity. This is simply another view of the
data in Figure 6.6. Figure 6.8 is a comparison of the accumulated facial activity over
time, across the entire series of movie clips. Each sub-figure in Figure 6.9 shows the
facial activity specific to the relevant movie clip.
Overall, control subject, Co m 04 had a low facial activity score, but his score
during the “sad” stimuli was in keeping with the other controls. This was an interesting
result, since, anecdotally, the “sad” movie clip (The Champ), seemed to evoke strong
feelings in all of the other control subjects. On viewing Co m 04’s recording, he seems
of Asian appearance and it is not known if there was a cultural factor influencing the
results.
Figure 6.10 shows the number of happy expressions displayed by each subject, for
each film clip over time over the entire series of clips. Figure 6.11 shows the number of
sad expressions displayed by each subject, for each film clip over time over the entire
6.4. PRESENTATION AND ANALYSIS OF DATA 125
series of clips. Figure 6.12 shows the number of neutral expressions displayed by each
subject, for each film clip over time over the entire series of clips.
CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION
Figure 6.6: Old Paradigm - Stacked column chart comparing facial activity (Co - Con-
350
300
250
200
Cry Freedom (Anger)
Sea of Love (Surprise)
150 Weather (Happy)
The Champ (Sad)
Bill Cosby (Happy)
100
50
trol, Pa - Patient)
0
9 2 7 3 1 5 8 6 1 0 3 4 2 4
_0 _0 _0 _0 _0 _0 _0 _0 _0 _1 _0 _0 _0 _0
_f _m _f _f _m _f _f _f el _f el _m el el
Co Co Co Co Co Co Co Co P‐
M Co P‐
M
Co P‐
M
P‐
M
_U _U _U _U
_m _m _m _m
126 Pa Pa Pa Pa
127
Figure 6.7: Old Paradigm - Clustered column chart comparing facial activity (Co -
120
100
80
6.4. PRESENTATION AND ANALYSIS OF DATA
Control, Pa - Patient)
0
9 2 07 3 1 5 8 6 1 0 3 4 2 4
_0 _0 _ _0 _0 _0 _0 _0 _0 _1 _0 _0 _0 _0
_f _m _f _f _m _f _f _f el _f el _m el el
Co Co Co Co Co Co Co Co P‐
M Co P‐
M
Co P‐
M
P‐
M
_U _U _U _U
_m _m _m _m
Pa Pa Pa Pa
CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION
Figure 6.8: Old Paradigm - Line chart comparing accumulated facial activity (Co -
350
300
Co_f_09
250
Co_m_02
Co_f_07
Co_f_03
200 Co_m_01
Co_f_05
Co_f_08
Co_f_06
150
Pa_m_UP‐Mel_01
Co_f_10
Pa_m_UP‐Mel_03
100 Co_m_04
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_04
Control, Pa - Patient)
50
0
128 Bill Cosby (Happy) The Champ (Sad) Weather (Happy) Sea of Love (Surprise) Cry Freedom (Anger)
0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
60
70
80
90
100
Co
_m Co
_ 01 _f_
Co 09
_f_ Co
05 _f_
Co 07
_f_ Co
09 _f_
Co Co 03
_f_ _m
03 _0
Pa Co 2
_m_ _m Co
_f_
_0
UP
‐M 2 Co 08
el_ _m
01 _0
Co Co 1
_f_ _f_
08 05
Co Pa
_m Co_
_f_ _U f_0
Pa
07
Pa P 6
_m Co_f _m ‐Me
_U _0 _U l_
P‐M 6 P‐M 01
el_ el_
03 03
Pa
0
20
40
60
80
100
120
Co _m Co_
Co _f_ _U f_
_f_ 10
Co 07 Pa C P‐M 10
_m o_m el_
Weather (Happy)
_m
(c) Weather
Bill Cosby (Happy)
_U _0 C 02
_0
2 Pa P 4 Pa
_m o_m
Weather (Happy)
_U _
P‐M 04
el_
Pa 01
0
1
2
3
4
5
6
7
8
9
10
_m Co_
0
10
20
30
40
50
60
70
80
90
100
_U f_1 Co Co
Pa P 0 _f_ _f
_m ‐Me Co 08 _
_U l_0 _m Co 09
Pa P 3 _0 _m
_m ‐Me 2 _0
_0 _m
Pa 1 _0
_m Co_ Pa
_U f_ _m Co_ 1
_U f_
P‐M 09
P‐ 0
el_
03 M 5
el_
Co
_f_ 0
Co 1
07 _f
Co _0
_f_ Co 8
_m ‐Me _U el_0
_U l_0 Pa P
Sea of Love (Surprise)
P‐M 4 _m ‐M 3
(b) The Champ
100
200
300
400
500
600
0
Co_f_06
Co_f_09
Co_f_10
Co_m_02
Co_m_01
Pa_m_UP‐Mel_03
Bil Cosby
Co_f_07
Co_f_03
Co_m_04
Co_f_08
Co_f_05
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_m_01
Co_f_07
Pa_m_UP‐Mel_02
Co_f_09
Co_f_05
Co_f_10
The Champ
Co_f_06
Pa_m_UP‐Mel_03
Co_f_03
Co_m_02
Co_m_04
Co_f_08
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_f_09
Co_m_01
Co_m_02
Pa_m_UP‐Mel_02
Co_f_06
Co_f_05
Weather
Co_f_10
Happy
Co_f_07
Co_f_03
Pa_m_UP‐Mel_03
Co_m_04
Co_f_08
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_m_01
Co_m_02
Co_f_06
Co_f_05
Co_f_07
Co_f_09
Sea of Love
Co_f_08
Pa_m_UP‐Mel_03
Co_f_03
Co_m_04
Co_f_10
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_04
Co_m_01
Co_f_07
Co_f_06
Co_f_03
Co_f_08
Co_f_10
Cry Freedom
Pa_m_UP‐Mel_03
Co_m_02
Co_f_05
Co_m_04
Co_f_09
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Happy
100
200
300
400
500
600
700
800
900
0
Pa_m_UP‐Mel_03
Co_m_04
Co_f_03
Co_f_08
Co_m_01
Co_m_02
Bil Cosby
Co_f_05
Co_f_06
Co_f_10
Co_f_09
Pa_m_UP‐Mel_01
Co_f_07
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_04
Co_f_08
Co_m_02
Co_f_06
Co_m_04
Co_f_03
Co_f_10
The Champ
Co_f_09
Pa_m_UP‐Mel_03
Co_m_01
Co_f_05
Co_f_07
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_f_08
Co_m_04
Co_f_10
Co_f_03
Pa_m_UP‐Mel_03
Co_f_06
Weather
Co_m_02
Sad
Co_f_09
Co_m_01
Co_f_05
Pa_m_UP‐Mel_02
Co_f_07
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_f_03
Co_f_10
Co_m_04
Pa_m_UP‐Mel_03
Co_f_06
Co_f_08
Sea of Love
Co_m_02
Co_m_01
Co_f_05
Co_f_09
Co_f_07
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_04
Co_m_02
Co_f_06
Co_m_04
Co_f_08
Co_f_10
Pa_m_UP‐Mel_03
Cry Freedom
Co_f_09
Co_f_03
Co_f_05
Co_m_01
Co_f_07
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Sad
100
200
300
400
500
600
700
800
900
0
Pa_m_UP‐Mel_04
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_01
Co_f_05
Co_f_07
Co_f_08
Bil Cosby
Co_f_03
Co_m_04
Co_m_02
Co_m_01
Co_f_10
Co_f_09
Pa_m_UP‐Mel_03
Co_f_06
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_f_05
Pa_m_UP‐Mel_02
Co_f_07
Pa_m_UP‐Mel_03
The Champ
Co_f_10
Co_f_09
Co_m_01
Co_f_03
Co_m_04
Co_m_02
Co_f_06
Co_f_08
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_f_07
Pa_m_UP‐Mel_03
Pa_m_UP‐Mel_02
Co_f_05
Weather
Co_m_01
Neutral
Co_m_04
Co_f_03
Co_m_02
Co_f_09
Co_f_08
Co_f_10
Co_f_06
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_04
Co_f_09
Co_f_07
Co_f_08
Sea of Love
Co_f_05
Pa_m_UP‐Mel_03
Co_m_02
Co_m_04
Co_m_01
Co_f_03
Co_f_10
Co_f_06
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Pa_m_UP‐Mel_02
Co_f_05
Co_f_03
Co_f_09
Cry Freedom
Co_f_07
Co_m_01
Co_m_04
Pa_m_UP‐Mel_03
Co_f_10
Co_f_08
Co_m_02
Co_f_06
Neutral
Figure 6.13 displays a stacked column chart of facial activity. Although patient
Pa f UP BP2 08 has a very high score, examination of the video revealed that she
displayed non-purposeful or habitual mouth movement throughout the recording. Two
patients, Pa f UP-NonMel 09 and Pa f UP-NonMel 06, with a clinical diagnosis of
unipolar non-melancholic depression, also had a high facial activity score. Patients
Pa f UP-Mel 10 and Pa m UP-Mel 05, both diagnosed with unipolar melancholic de-
pression score lowest in the facial activity scale. Patient Pa m PD 11, who had a Mini-
diagnosis of unipolar melancholic depression and a clinical diagnosis of panic disorder,
had a low facial activity score. Figure 6.14 displays a clustered column chart of the
same data as Figure 6.6 facial activity. Each of the sub-figures in Figure 6.15 shows
the facial activity specific to a movie clip.
Figure 6.16 shows the number of happy expressions displayed by each subject, for
each film clip over time over the entire series of clips. Figure 6.17 shows the number of
sad expressions displayed by each subject, for each film clip over time over the entire
series of clips. Figure 6.18 shows the number of neutral expressions displayed by each
subject, for each film clip over time over the entire series of clips.
CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION
Figure 6.13: New Paradigm - Stacked column chart comparing facial Activity (Co -
600
500
400
Capricorn One (Surprise)
300 The Shining (Fear)
Cry Freedom (Anger)
Silence of the Lambs (Fear)
Weather (Happy)
200 The Champ (Sad)
Bill Cosby (Happy)
100
Control, Pa - Patient)
0
8 9 1 06 2 5 7 3 6 4 11 10 05
_0 el_
0 _1 el_ _1 _1 _0 _1 _1 _1 D_ el_ el_
P2 _m _f _f n
_m _m _m
_B M
Co M Co Co ow Co Co Co _P ‐M P‐
M
UP on on nk _m UP
f_ ‐N ‐N _U Pa f_ _U
_ UP UP m Pa
_ _ m
Pa _ f_ _ f_ Pa
_ Pa
134 Pa Pa
135
Figure 6.14: New Paradigm - Clustered column chart comparing facial activity (Co -
160
140
120
100
6.4. PRESENTATION AND ANALYSIS OF DATA
Control, Pa - Patient)
0
_08 09
_1
1 06 _1
2
_1
5
_0
7
_1
3
_1
6
_1
4 11 10 05
P2 el_ _m el_ _f _f n
_m _m _m D_ el_ el_
_B M
Co
M Co Co ow Co Co Co _P ‐M P‐
M
P on on nk _ m UP
_U ‐N P‐
N _U Pa f_ _U
_f UP _U m _ m
Pa _f
_
_f
_ Pa Pa
_
Pa
Pa Pa
136
Pa Pa
_ _f
0
20
40
60
80
100
120
140
160
f_
UP _UP
0
10
20
30
40
50
60
70
80
_B Pa ‐N
Pa_f_UP_BP2_08 P _f on
0
10
20
30
40
50
60
70
80
Pa Co 2_0 _U M
_f P_ el_
Pa_f_UP‐ _U C _m 8 BP 09
NonMel_06 P‐ o_ _11 2
Pa_f_UP‐ No m Co _0
nM _1 _ 8
NonMel_09 4 C f_1
Co_m_11 Co el_0
Pa
_m C o_f 2
Pa _m 9
Pa_m_Unkown_07 _m _
_U o_ _15
C
_U o 13 Pa nk m_
Co_m_16 nk _f_ _f ow 11
ow 12
_U n
P‐ Co _0
Co_f_15 Pa C n_ No _m 7
_f Pa o_ 07 nM _1
Co_f_12 _U _m m e 4
P‐ _ _16 Pa Co_ l_06
Co_m_13 No PD
nM _1 _m m
_P _16
Pa_m_PD_11 Pa e 1 Pa
C D
_f C l_0 Pa _f_U o_ _11
0
5
10
15
20
25
30
35
40
45
50
Pa _U o_ 6
Co_m_14 _m P‐ f_ _m P m_
M
_U e 15 _U ‐Me 13
Pa_f_UP_BP2_08 Pa_f_UP‐Mel_10 P‐ l_1 P‐ l_
(c) Weather
M 0 M 10
Pa_m_UP‐Mel_05 el el
Pa_f_UP‐
Weather (Happy)
_0 _0
Bill Cosby (Happy)
5 5
NonMel_09
Pa_m_PD_11
Pa_m_Unkown_07 P Pa
Pa_f_UP‐Mel_10
Pa a_ _
0
10
20
30
40
50
60
70
80
90
100
_f f_U Pa f_U
_U P _f P‐N
Pa_m_UP‐Mel_05
0
10
20
30
40
50
60
70
80
P‐ _B _U o
No P2 Pa_f_UP_BP2_08
0
5
10
15
20
25
30
35
40
45
50
Pa nM _0 P‐ nM
_f No e
_U C el 8 Pa_f_UP‐ nM l_0
P‐ o_ _09 Pa 9
NonMel_09
11
100
200
300
400
500
600
0
Pa_f_UP‐NonMel_09
Co_m_11
Pa_f_UP‐NonMel_06
Co_f_15
Co_f_12
Pa_f_UP_BP2_08
Bil Cosby
Pa_m_PD_11
Co_m_16
Co_m_14
Co_m_13
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_f_UP‐NonMel_09
Pa_f_UP_BP2_08
Co_f_12
Co_m_16
Pa_f_UP‐NonMel_06
Co_f_15
The Champ
Co_m_14
Co_m_13
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Pa_f_UP‐NonMel_06
Pa_m_PD_11
Pa_f_UP‐NonMel_09
Co_f_12
Co_m_11
Co_m_16
Weather
Co_m_14
Co_m_13
Pa_f_UP_BP2_08
Co_f_15
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Pa_f_UP_BP2_08
Pa_f_UP‐NonMel_09
Pa_f_UP‐NonMel_06
Silence of the Lambs
Co_f_12
Co_m_16
Happy
Co_m_14
Co_m_13
Co_f_15
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_f_UP‐NonMel_09
Co_f_12
Pa_m_PD_11
Co_m_14
Co_f_15
Cry Freedom
Pa_f_UP_BP2_08
Co_m_16
Pa_f_UP‐NonMel_06
Co_m_13
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_f_UP‐NonMel_09
Co_f_12
Co_m_16
Pa_f_UP_BP2_08
Co_f_15
The Shining
Co_m_14
Co_m_13
Pa_f_UP‐NonMel_06
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Pa_f_UP‐NonMel_09
Co_m_16
Pa_f_UP_BP2_08
Co_m_14
Co_f_12
Capricorn One
Pa_f_UP‐NonMel_06
Co_m_11
Co_m_13
Co_f_15
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Happy
1000
1200
200
400
600
800
0
Co_m_13
Co_f_12
Pa_f_UP‐NonMel_06
Co_m_14
Pa_f_UP_BP2_08
Co_m_11
Bil Cosby
Co_f_15
Pa_f_UP‐NonMel_09
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_14
Pa_f_UP‐NonMel_06
Co_m_11
Co_f_12
Co_m_13
Co_f_15
The Champ
Pa_f_UP_BP2_08
Pa_f_UP‐NonMel_09
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_f_15
Co_f_12
Co_m_13
Co_m_14
Pa_f_UP‐NonMel_09
Pa_f_UP‐NonMel_06
Weather
Co_m_11
Pa_f_UP_BP2_08
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_14
Co_m_13
Co_f_15
Pa_f_UP‐NonMel_06
Silence of the Lambs
Pa_f_UP‐NonMel_09
Pa_f_UP_BP2_08
Sad
Co_f_12
Co_m_11
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_13
Co_m_14
Co_f_15
Pa_f_UP_BP2_08
Pa_f_UP‐NonMel_06
Cry Freedom
Co_m_11
Pa_f_UP‐NonMel_09
Co_f_12
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_13
Co_m_14
Co_f_12
Co_f_15
Pa_f_UP_BP2_08
The Shining
Co_m_11
Pa_f_UP‐NonMel_06
Pa_f_UP‐NonMel_09
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_13
Co_m_14
Co_f_15
Pa_f_UP_BP2_08
Pa_f_UP‐NonMel_06
Capricorn One
Co_f_12
Pa_f_UP‐NonMel_09
Co_m_11
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Sad
1000
1200
200
400
600
800
0
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Co_m_16
Pa_m_PD_11
Co_m_14
Bil Cosby
Co_f_15
Pa_f_UP_BP2_08
Co_m_11
Co_f_12
Pa_f_UP‐NonMel_06
Co_m_13
Pa_f_UP‐NonMel_09
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_16
Pa_f_UP‐NonMel_09
The Champ
Co_f_15
Co_m_13
Pa_f_UP_BP2_08
Co_f_12
Co_m_11
Co_m_14
Pa_f_UP‐NonMel_06
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_f_UP_BP2_08
Co_m_16
Co_m_11
Weather
Co_m_13
Co_m_14
Pa_m_PD_11
Pa_f_UP‐NonMel_09
Co_f_15
Co_f_12
Pa_f_UP‐NonMel_06
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Co_m_11
Silence of the Lambs
Co_m_16
Co_f_12
Neutral
Pa_m_PD_11
Pa_f_UP‐NonMel_06
Pa_f_UP‐NonMel_09
Co_f_15
Pa_f_UP_BP2_08
Co_m_13
Co_m_14
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Co_m_16
Pa_m_PD_11
Cry Freedom
Co_f_12
Co_m_11
Pa_f_UP‐NonMel_09
Pa_f_UP‐NonMel_06
Pa_f_UP_BP2_08
Co_f_15
Co_m_13
Co_m_14
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Pa_f_UP‐NonMel_06
The Shining
Co_m_16
Co_m_11
Pa_f_UP‐NonMel_09
Pa_f_UP_BP2_08
Co_f_15
Co_f_12
Co_m_14
Co_m_13
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Capricorn One
Co_m_16
Co_f_12
Pa_f_UP‐NonMel_06
Pa_f_UP‐NonMel_09
Pa_f_UP_BP2_08
Co_f_15
Co_m_13
Co_m_14
Neutral
6.5.1 Hypothesis 1
When viewing the stimuli, patients with a clinical diagnosis of unipolar melancholic
depression will show less facial activity than control subjects and patients with other
types of depression.
In both the Old Paradigm and New Paradigm results, there was a tendency for patients
with unipolar melancholic depression to have reduced facial activity.
6.5.2 Hypothesis 2
When viewing the stimuli, patients with a clinical diagnosis of unipolar melancholic
depression will show less repertoire of facial expressions than control subjects and
patients with other types of depression.
In both the Old Paradigm and New Paradigm results, there was a tendency for patients
with unipolar melancholic depression to show less positive facial expressions. How-
ever, analysis of the results reveals that the same set of patients also display less nega-
tive (sad) expressions. This is in keeping with the lower facial scores and the tendency
towards a high number of neutral expressions.
Although the results would tend to confirm the hypotheses, realistically, the sample
size is far too small. One possibility raised, after the analysis had been undertaken,
is that some of the patients show psychomotor agitation, whereas others show retar-
dation - this would suggest 2 clusters of patients. Many more recordings are needed
6.6. OVERALL EVALUATION 141
before any conclusions could be drawn. Other factors such as age, gender and cultural
background need to be considered, i.e. to test if there are other attributes that closely
correlate with facial activity and expressiveness. At this stage, the result information
is simply embedded within the charts. The concluding Chapter 8.3.2 briefly describes
an informal application of the Mann-Whitney non-parametric test.
Interestingly, if one examines the charts reporting facial expressions against the
intended elicited emotion, i.e. Figures 6.10,6.11,6.16,6.17, there is an obvious corre-
lation between the control subjects’ expression and the intended emotion. This is not
so clear in the case of the clinical subjects, and Figures 6.12,6.18 reveal that those sub-
jects dominate the numbers of “neutral” expressions, regardless of which movie clip is
being viewed.
The motivation for the exercise was to test the feasibility of the approach and, on
that basis, the results confirm that this method of facial activity and expression analy-
sis could be used successfully in this and similar studies. Anecdotally, even with the
small sample size, there were some interesting patterns. For instance, other types of
MDD participants seemed to have slightly higher levels of facial activity and expres-
sions than control subjects and patients with unipolar melancholic depression. Another
interesting event was that, in one case ethnic background seemed to influence the fa-
cial activity and expression responses to the emotion in the video clip. The control
subject’s responses during the sad clip was similar to other control subjects, whereas
his responses during other clips was much lower. Obviously, much more samples
would be required to support the notion, but it would make an interesting follow up
study.
142 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION
The map is not the territory.
Alfred Korzybski
The chapter reflects on some of the lessons learned in the earlier experimental tasks,
and in keeping with objective 4 of the dissertation
considers the limitations in emotional expression recognition, and ways in which they
could be overcome. The problem domain is expanded well beyond the experiments
143
144 CHAPTER 7. SEMANTICS AND METADATA
described in this thesis, to the field of affective computing. Affective computing is com-
puting that relates to, arises from, or deliberately influences emotion or other affective
phenomena [Picard 97].
Section 7.2 discusses some of the strengths and weaknesses of the FER approach, used
in the earlier experiments. Approaches for improvement are suggested and, once the
background has been explained, an example framework for affective computing is de-
scribed in Section 7.3. This is further explained by way of examples in Sections 7.4
and 7.5.
7.2 Discussion
With the empirical footing in place, the requirements of an expression recognition sys-
tem can be revisited, and several observations made regarding the earlier experiments:
1. Culture-specific rules influence the display of affect [Cowie 05a], and in the ex-
periment in Chapter 6, the control subject Co m 04’s low facial activity score
gave cause for speculation that there might be ethnic or cultural factors influenc-
ing the result. This raises the question of how far a purely statistical approach to
emotional expression recognition can extend. One would think that accounting
for ethnic background or culture, as well as other factors such as context and
personality type are beyond the limitations of such an approach.
ing, and although a technical solution could be found to synchronise the start
of the video recording with the stimuli presentation, a solution that incorporated
detailed temporal information about the stimuli would be more useful. For in-
stance, knowing which frames the “punch-line” occurs in a movie clip would
enable the latency between subjects’ reactions and the stimuli to be easily mea-
sured and compared.
3. The experiment in Chapter 6 was confined to the first part of the paradigm,
where participants view video clips, i.e. FER only. Subsequent steps in the ex-
perimental paradigm include viewing of IAPS [Lang 05] images and an open
interview, i.e. a question & answer stage at the end. The processing of the audio-
visual sections of the interview is much more difficult. During this stage, some
method of detecting when each interlocutor speaks is necessary and some means
of representing the dialogue is required. A participant’s facial display of a par-
ticular emotion will obviously be different during speech to that when viewing
a video clip. The AAM will need to be able to track the face during speech and,
possibly, additional classifiers will be needed to be able to match expressions in
speech.
4. With a larger sample size in the MDD experiment, it might be useful to incor-
porate other variables, e.g. the participant’s age and gender, or, even to consider
variables such as temperament [Parker 02] and personality type [Parker 06]. Of
course, this subject information can be recorded on a spreadsheet or word pro-
cessing document, as was the case in this experiment. However, as the amount of
data increases, so too does the degree of difficulty in maintaining spreadsheets.
Having the information stored within the system performing the analysis, NXS in
this case, would potentially create much more comprehensive outputs for analy-
sis.
146 CHAPTER 7. SEMANTICS AND METADATA
In recent research, there have been attempts to add rules and record descriptions
to the emotional expression recognition process (audio and video), and each project
has devised its own approach. Some have used rulebases and case-based reasoning
type information [Pantic 04a, Valstar 06a], whereas others have attempted a more com-
plex level of integration [Vinciarelli 08, Vinciarelli 09]. [Athanaselisa 05, Devillers 05,
Liscombe 05] describe efforts to represent non-basic emotional patterns and context
features in real-life emotions in audio-video data. [Athanaselisa 05] demonstrate that
recognition of speech can be improved by incorporating a dictionary of affect with
the standard ASR dictionary. [Cowie 05b] reports on a “fuzzy” rule based system for
interpreting facial expressions.
Each of the studies mentioned so far has used its own technique to incorporate
rules, and it implies 1) a need to devise some common method of defining rules; and,
2) a requirement to record rules and recognition results in a standard, reusable way.
In the remainder of this section, two complementary and overlapping concepts are in-
troduced. Ontologies offer a means to unify and extend affective computing research
and these are explained in Subsection 7.2.1. Subsection 7.2.2 examines two descrip-
tion, or markup, schemes. One scheme, known as EmotionML [emo 10, Baggia 08], is
specifically for emotional content, and the other, MPEG-7, is broader and intended for
audio-visual content (including a modest amount of affective content).
7.2. DISCUSSION 147
concepts. Figures 7.1 and 7.2 illustrate 2 ontologies, one of human disease, and an-
other of cells. Once the concepts have been populated with values, it becomes akin
to a knowledge-base. One popular and well supported software product for building
ontologies, Protégé [Protégé ], is an ontology editor and knowledge-base framework,
and is supported by Stanford University. One very powerful feature of an ontology, is
its ability to link to other ontologies, e.g. medical and gene ontologies1 .
A perplexing issue faced in automatic affect recognition, is the reuse and verification
of results. To compare and verify studies requires some consistent means of describ-
ing affect, and, until recently, there has been no universally accepted mark up system
for describing emotional content. Various schemes have arisen, e.g. Synchronized
Multimedia Integration Language (SMIL), Speech Synthesis ML (SSML), Extensi-
ble Multi-Modal Annotation ML (EMMA), Virtual Human ML (VHML), and HU-
MAINE’s Emotion Annotation and Representation Language (EARL) [Schröder 07].
and,“Start”and“End”.
MPEG-7
Another initiative is that of the Moving Picture Experts Group (MPEG) which has
developed the MPEG-7 standard for audio, audio-video and multimedia description
[MPEG-7 ]. MPEG-7 uses metadata structures or Multimedia Description Schemes
(MDS) for describing and annotating audio-video content. These are provided as a
standardised way of describing the important concepts in content description and con-
tent management in order to facilitate searching, indexing, filtering, and access. They
are defined using the MPEG-7 Description Definition Language (DDL), which is XML
Schema-based. The output is a description expressed in XML, which can be used
for editing, searching, filtering. The standard also provides a description scheme
for compressed binary form for storage or transmission [Chiariglione 01, Rege 05,
Salembier 01]. Examples in the use of MPEG-7 exist in the video surveillance industry
where streams of video are matched against descriptions of training data [Annesley 05].
The standard also caters for the description of affective content.
Context
Context is linked to modality and emotion is strongly multi-modal in the way that cer-
tain emotions manifest themselves favouring one modality over the other [Cowie 05a].
Physiological measurements change depending on whether a subject is sedentary or
mobile. A stressful context, such as an emergency hot-line, air-traffic control, or a war
7.3. AN AFFECTIVE COMMUNICATION FRAMEWORK 151
His findings underline the fact that most studies so far took place in an artificial
environment, ignoring social, cultural, contextual and personality aspects which, in
natural situations, are major factors modulating speech and affect presentation. The
model depicted in Figure 7.3 takes into account the importance of context in the anal-
ysis of affect in speech.
There have been some attempts to include contextual data in emotion recogni-
tion research [Schuller 09a, Wollmer 08]. [Devillers 05] includes context annotation
as metadata in a corpus of medical emergency call centre dialogues. Context infor-
mation was treated as either task-specific or global in nature. Unlike [Devillers 05],
the model described in this dissertation does not differentiate between task-specific
and global context as the difference is seen merely as temporal, i.e. pre-determined or
established at “run-time”.
The HUMAINE project [HUMAINE 06] has proposed that at least the following
issues be specified:
“It is proposed to refine this scheme through work with the HUMAINE
databases as they develop.”
[Millar 04] developed a methodology for the design of audio-visual data corpora of the
speaking face in which the need to make corpora re-usable is discussed. The method-
ology, aimed at corpus design, takes into account the need for speaker and speaking
environment factors.
The model presented in this dissertation treats agent characteristics and social con-
straints separate to context information. This is because their effects on discourse are
seen as separate topics for research.
Agent characteristics
As [Scherer 03] points out, most studies are either speaker oriented or listener oriented,
with most being the former. This is significant when you consider that the emotional
state of someone labelling affective content in a corpus could impact the label that is
ascribed to a speaker’s message, or facial expression.
The literature has not given much attention to the role that agent characteristics,
such as personality type, play in affective presentation. This is surprising when one
considers the obvious difference in expression between extroverted and introverted
types. Intuitively, one would expect a marked difference in signals between these types
of speakers. One would also think that knowing a person’s personality type would be
of great benefit in applications monitoring an individual’s emotions [Parker 06].
7.3. AN AFFECTIVE COMMUNICATION FRAMEWORK 153
At a more physical level, agent characteristics such as facial hair, whether they
wear spectacles, and their head and eye movements all affect the ability to visually
detect and interpret emotions.
Cultural
Culture-specific rules influence the display of affect [Cowie 05a], and gender and age
are established as important factors in shaping conversation style and content in many
societies. Studies by [Koike 98] and [Shigeno 98] have shown that it is difficult to
identify the emotion of a speaker from a different culture and that people will pre-
dominantly use visual information to identify emotion. Putting it in the perspective of
the proposed model, cognisance of the speaker and listener’s cultural backgrounds, the
context, and whether visual cues are available, obviously influence the effectiveness of
affect recognition.
Physiological
It might be stating the obvious but there are marked differences in speech signals and
facial expressions between people of different age, gender and health. The habitual
settings of facial features and vocal organs determine the speaker’s range of possible
visual appearances and sounds produced. The configuration of facial features, such as
chin, lips, nose, and eyes, provide the visual cues, whereas the vocal tract length and
internal muscle tone guide the interpretation of acoustic output [Millar 04].
Social
Social factors temper spoken language to the demands of civil discourse [Cowie 05a].
For example, affective bursts are likely to be constrained in the case of a minor relating
to an adult, yet totally unconstrained in a scenario of sibling rivalry. Similarly, a social
154 CHAPTER 7. SEMANTICS AND METADATA
setting in a library is less likely to yield loud and extroverted displays of affect than a
family setting.
Internal state
Internal state has been included in the model for completeness. At the core of affective
states is the person and their experiences. Recent events such as winning the lottery or
losing a job are likely to influence emotions and their display.
To help explain the differences between the factors that influence the expression of
affect, Figure 7.4 lists some examples. The factors are divided into two groups. On the
left, is a list of factors that modulate or influence the speaker’s display of affect, i.e.
cultural, social and contextual. On the right, are the factors that influence production
7.4. A SET OF ONTOLOGIES 155
The three ontologies described in this section are a means by which the model de-
scribed in the previous section could be implemented. Figure 7.5 depicts the rela-
tionships between the ontologies and gives examples of each. Formality and rigour
increase towards the apex of the diagram.
It needs to be emphasised that this proposal is not confined just to experiments such
as those described within this dissertation. It is much broader and the intended users of
the set of ontologies extends beyond research exercises. There could be many types of
users such as librarians, decision support systems, application developers and teachers.
The top level ontology correlates to the model discussed in Section 7.3 and is a for-
mal description of the domain of affective communication. It contains internal state,
personality, physiological, social, cultural, and contextual factors. It can be linked to
external ontologies in fields such as medicine, anatomy, and biology. A fragment of
the top-level, domain ontology of concepts in shown in Figure 7.6.
This ontology is more loosely defined and includes the concepts and semantics used to
define research in the field. It has been left generic and can be further subdivided into
an affective computing domain at a later stage, if needed. It is used to specify the rules
by which accredited research reports are catalogued. It includes metadata to describe,
for example,
• manner in which corpora or results have been annotated, e.g. categorical or di-
mensional.
7.5. AN EXEMPLARY APPLICATION ONTOLOGY FOR AFFECTIVE SENSING157
Creating an ontology this way introduces a common way of reporting the knowledge
and facilitates intelligent searching and reuse of knowledge within the domain. For
instance, an ontology just based on the models described here could be used to find all
research reports where:
SPEAKER(internalState=‘happy’,
physiology=‘any’,
agentCharacteristics=‘extrovert’,
social=’friendly’,context=‘public’,
elicitation=‘dimension’)
This ontology is more correctly a repository containing both formal and informal rules,
as well as data. It is a combination of semantic, structural and syntactic metadata. It
contains information about resources such as corpora, toolkits, audio and video sam-
ples, and raw research result data.
The next section explains the bottom level, application ontology, in more detail.
Sensing
Figure 7.7 illustrates an example application ontology for affective sensing, in a context
of investigating dialogues. In the context of the experiments in Chapter 6, a dialogue
is the interaction between the participant and the stimuli. During a dialogue, various
events can occur, triggered by one of the dialogue participants and recorded by the
sensor system. These are recorded as time stamped instances of events, so that they
can be easily identified.
158 CHAPTER 7. SEMANTICS AND METADATA
In the ontology, the roles for each interlocutor, sender and receiver, are distin-
guished. This caters for the type of open interview session in the experiments in Chap-
ter 6. At various points in time, each interlocutor can take on different roles. On the
sensory side, facial, gestural, textual, speech, physiological and verbal2 cues are dis-
tinguished. The ontology could be extended for other cues and is meant to serve as
an example here, rather than a complete list of affective cues. Finally, the emotion
classification method used in the investigation of a particular dialogue is also recorded.
2
The difference between speech and verbal cues here being spoken language versus other verbal
utterings.
7.5. AN EXEMPLARY APPLICATION ONTOLOGY FOR AFFECTIVE SENSING159
8.1 Introduction
161
162 CHAPTER 8. CONCLUSIONS
such as anxiety and depression. Following the experiments, the practical limitations of
the statistical approach to FER were discussed, along with ways in which these could
be overcome through the use of a model and ontologies for affective computing.
In Section 8.2, each of the objectives as stated in Chapter 1 are examined, and consid-
eration is given to the results from the experimental work in Chapters 5 and 6. Finally,
in Section 8.3, the conclusions and contributions of this dissertation are stated before
discussing open issues and future directions.
8.2 Objectives
8.2.1 Objective 1
Explore, though the construction of a prototype system that incorporates AAMs, what
would be required in order to build a fully-functional FER system, i.e. one where the
system could be trained and then used to recognise new and previously unseen video
or images.
The system proved to be flexible and robust throughout the experiments, and dealt
with the requirements to process sets of images, as in Chapter 5, as well as the more dif-
ficult task of video processing. The underlying software components that are required
8.2. OBJECTIVES 163
to perform FER, were shown to be very stable, i.e. [OpenCV ] for face detect and image
capture, [VXL ] for image processing, DemoLib [Saragih 08] for AAM development,
LIBSVM [Chang 01] for SVM classification. Components were easily interchanged,
and the MultiBoost classification software [MultiBoost 06] was easily replaced with
the LIBSVM implementation of SVM classification.
The system was built using the Qt software from [Qt 09], which will ensure that it
can be deployed in Windows, Mac OS or a Unix/Linux variant environments. Anecdo-
tally, NXS’ performance was quite adequate for real-time FER, despite little effort being
expended in tuning the performance of the software. Overall, it demonstrates that un-
derlying software components are mature enough to be incorporated in a production-
like system.
8.2.2 Objective 2
The second part of Experiment 4 demonstrated that, even using two popular and
164 CHAPTER 8. CONCLUSIONS
creditable databases, a classifier built from images from one database, did not achieve
a high CA in predicting expressions from the other, despite Gabor filtering being rea-
sonably invariant to lighting conditions. This echoes the preliminary results reported
in [Whitehill 09] (albeit, much worse), and, to address this problem, there seems to be
two approaches. The first, is to expand the training set of images, including samples
with variant conditions, e.g. recorded under different lighting conditions. However,
the results in Chapter 5 suggest that this will not be a complete solution. The second,
is to incorporate some form of logic or rule processing in expression recognition, and
this was discussed in Chapter 7.
8.2.3 Objective 3
Anecdotally, even with the small sample size, there were some interesting patterns.
For instance, other types of MDD participants seemed to have slightly higher levels of
facial activity and expressions than control subjects and patients with unipolar melan-
cholic depression. Another interesting event was that, in one case ethnic background
seemed to influence the facial activity and expression responses to the emotion in the
video clip. The control subject’s responses during the sad clip was similar to other
control subjects, whereas his responses during other clips was much lower. Obviously,
much more samples would be required to support the notion, but it would make an
interesting follow up study.
8.3. CONCLUSIONS AND FUTURE WORK 165
8.2.4 Objective 4
Chapter 7 reflected on some of the lessons learned in the earlier experimental tasks,
and the scope was well beyond the field of FER, and set to affective computing. Two
recurring themes seem inevitable and emerging in the literature, i.e. the need to incor-
porate rules into the recognition process, and a means to represent the recognition in a
standard format. To address these requirements, the use of a model and ontologies of
affective computing was proposed.
This is a very broad thesis, which ranges in scope, from computer vision techniques
to psychology, taking in knowledge management concepts like ontologies along the
way. Such cross-disciplinary dissertations are necessary to advance such a broad and,
at times, nebulous field such as affective computing. This dissertation contributes in
several ways.
An artifact of the work is the NXS, which will be evolved and made available to
other researchers. It has been used successfully in a collaborative study that inves-
tigates the links between facial expressions and MDD, undertaken at the Black Dog
Institute, Sydney. The system is easily extendable, and its use is not confined to FER,
instead being suitable for experimental work in full multi-modal emotional expression
recognition. Its use is not confined to the recognition of primary emotional expression,
and it could be used to sense for any other state, e.g. attraction, boredom, level of
interest, or even all them. In fact, it is not confined to expression recognition and could
be used for face recognition.
166 CHAPTER 8. CONCLUSIONS
Even though the sample size was modest, the results in anxiety recognition are
very encouraging. It suggests that anxious facial expressions can be recognised with
FER techniques which could have many applications, extending beyond scientific and
medical research, such as interactive games and passenger screening technology.
Similarly, the results of the experiments on depression are encouraging, and once a
large enough sample size has been attained, the hypotheses in Chapter 6 can be prop-
erly tested. Even at this stage, interesting patterns in the data have emerged, enough to
suggest that other hypotheses could be tested within MDD populations, and with other
objectives in mind, e.g. cross-cultural responses to affective content.
The FER within this dissertation was confined to recognition within images and no
attempt was made to classify expressions based on temporal features. The raw fa-
cial landmark coordinates, already stored within NXS, could be used to train Hidden
Markov Models HMMs. The major difficulty with this type of approach is to recognise
the onset or apex of the expressions, however, some ensemble of classifiers, perhaps
using SVM to detect the peak of the expression and HMM to recongise the temporal
expression could be attempted, similar to [Fan 05, Wen 10].
Assessing the Relative Contributions to FER from AAM and Gabor Features
Chapter 6 explored the application of FER techniques to the study of facial activity
and emotional expressions of patients with MDD. At the time of writing this disserta-
tion, the number of participants was not sufficiently large, and the groups not properly
matched for age and gender, to publish statistical results. However, informal results us-
ing the Mann-Whitney, non-parametric test1 are encouraging. More subjects have now
participated in the project, and it is expected that the results will soon be published and
this will be a continuing avenue of investigation.
Anxiety and depression have been used in this dissertation to test the application of
FER techniques to non-prototypical facial activity and expressions. There are many
other disorders that could be investigated using a paradigm similar to that described in
Chapter 6, e.g. borderline personality disorder, schizophrenia and Parkinson’s disease.
This approach could also be used to investigate the mirroring effect between the stimuli
and the viewers. Beyond the laboratory, there are many commercial applications that
could make use of FER, including interactive learning, gaming and robotics.
One significant improvement that could be made in the field of FER, relates to the
experimental methods and reporting of results. At present, the format for reporting is
not consistent across studies and there is no standard database used which would enable
comparison of results. The set of expressions included for recognition varies between
studies, and this makes it difficult to compare and validate results. Reporting tends to
be constrained or dictated by journal or conference paper stipulations in relation to the
1
This was based on a two-tailed, or non-directional, hypothesis with a significance level of 0.05
168 CHAPTER 8. CONCLUSIONS
169
170 APPENDIX A. ANALYSIS AND DATA - ANXIETY
Uncertain 1 7.14%
Q4. Fear 6 42.86%
Anxiety 7 50.00% 1
Uncertain 1 7.14%
Q5. Fear 4 28.57%
Anxiety 9 64.29% 1
Uncertain 1 7.14%
Q6. Fear 8 57.14% 1
Anxiety 4 28.57%
Uncertain 2 14.29%
Q7. Fear 11 78.57% 1
Anxiety 1 7.14%
Uncertain 2 14.29%
Q8. Fear 5 35.71%
Anxiety 8 57.14% 1
Uncertain 1 7.14%
Q9. Fear 12 85.71% 1
Anxiety 0 0.00%
Uncertain 2 14.29%
Q10. Fear 1 7.14%
Anxiety 9 64.29% 1
Uncertain 4 28.57%
Q11. Fear 10 71.43% 1
Anxiety 4 28.57%
Uncertain 0 0.00%
Q12. Fear 5 35.71%
Anxiety 4 28.57%
Uncertain 5 35.71%
Q13. Fear 8 57.14% 1
Anxiety 6 42.86%
Uncertain 0 0.00%
Q14. Fear 1 7.14%
Anxiety 8 57.14% 1
Uncertain 5 35.71%
Q15. Fear 5 35.71%
Anxiety 4 28.57%
Uncertain 5 35.71%
Q16. Fear 11 78.57% 1
Anxiety 2 14.29%
Uncertain 1 7.14%
Q17. Fear 11 78.57% 1
Continued on next page
171
Anxiety 0 0.00%
Uncertain 3 21.43%
Q18. Fear 5 35.71%
Anxiety 4 28.57%
Uncertain 4 28.57%
Q19. Fear 6 42.86%
Anxiety 6 42.86%
Uncertain 2 14.29%
Q20. Fear 3 21.43%
Anxiety 4 28.57%
Uncertain 7 50.00%
Q21. Fear 4 28.57%
Anxiety 6 42.86%
Uncertain 4 28.57%
Q22. Fear 6 42.86%
Anxiety 5 35.71%
Uncertain 3 21.43%
Q23. Fear 5 35.71%
Anxiety 7 50.00% 1
Uncertain 2 14.29%
Q24. Fear 10 71.43%
Anxiety 1 7.14%
Uncertain 3 21.43%
Q25. Fear 7 50.00% 1
Anxiety 4 28.57%
Uncertain 3 21.43%
Q26. Fear 3 21.43%
Anxiety 8 57.14% 1
Uncertain 3 21.43%
Q27. Fear 3 21.43%
Anxiety 8 57.14% 1
Uncertain 3 21.43%
Q28. Fear 2 14.29%
Anxiety 6 42.86%
Uncertain 6 42.86%
Q29. Fear 0 0.00%
Anxiety 5 35.71%
Uncertain 9 64.29%
Q30. Fear 2 14.29%
Anxiety 6 42.86%
Uncertain 6 42.86%
Continued on next page
172 APPENDIX A. ANALYSIS AND DATA - ANXIETY
Uncertain 0 0.00%
Q45. Fear 1 7.14%
Anxiety 5 35.71%
Uncertain 8 57.14%
Q46. Fear 6 42.86%
Anxiety 5 35.71%
Uncertain 3 21.43%
Q47. Fear 1 7.14%
Anxiety 4 28.57%
Uncertain 9 64.29%
Q48. Fear 5 35.71%
Anxiety 4 28.57%
Uncertain 5 35.71%
Q49. Fear 9 64.29% 1
Anxiety 1 7.14%
Uncertain 3 21.43%
Q50. Fear 8 57.14% 1
Anxiety 3 21.43%
Uncertain 3 21.43%
Q51. Fear 10 71.43% 1
Anxiety 2 14.29%
Uncertain 2 14.29%
Q52. Fear 5 35.71%
Anxiety 6 42.86%
Uncertain 3 21.43%
Q53. Fear 2 14.29%
Anxiety 9 64.29% 1
Uncertain 3 21.43%
Q54. Fear 11 78.57% 1
Anxiety 0 0.00%
Uncertain 3 21.43%
Q55. Fear 7 50.00% 1
Anxiety 6 42.86%
Uncertain 1 7.14%
18 16
174 APPENDIX A. ANALYSIS AND DATA - ANXIETY
Analysis and Data - Depression
B
B.1 Old Paradigm
Bill Cosby (Happy) The Champ (Sad) Weather (Happy) Sea of Love (Surprise) Cry Freedom (Anger) Total
Co f 09 94.3759 88.349 32.5771 3.64379 79.0066 297.95239
Co m 02 66.0356 79.069 29.5169 8.87003 99.3888 282.88033
Co f 07 76.6297 71.4058 24.4274 2.28335 103.664 278.41025
Co f 03 68.7369 74.2912 30.0492 5.53615 86.7574 265.37085
Co m 01 49.9946 58.0265 43.1402 4.2638 80.2981 235.7232
Co f 05 49.8066 56.9214 33.3114 7.09537 65.8639 212.99867
Co f 08 60.8075 33.9613 25.0592 8.87829 67.4343 196.14059
Co f 06 48.9984 27.7255 23.5991 2.13677 66.0349 168.49467
Pa m UP-Mel 01 47.1827 48.7715 28.6823 0.939169 34.0904 159.666069
Co f 10 40.9297 26.9097 15.5889 0.920074 31.2753 115.623674
Pa m UP-Mel 03 45.8346 21.2189 17.1226 2.76461 23.1864 110.12711
Co m 04 15.6498 28.7875 7.7389 1.48216 48.9529 102.61126
Pa m UP-Mel 02 18.1763 12.5658 5.02547 0.599598 17.4755 53.842668
Pa m UP-Mel 04 9.42419 9.92479 4.89164 0.704307 14.052 38.996927
175
176 APPENDIX B. ANALYSIS AND DATA - DEPRESSION
Bill Cosby (Happy) The Champ (Sad) Weather (Happy) Sea of Love (Surprise) Cry Freedom (Anger)
Co f 09 94.3759 182.7249 215.302 218.94579 297.95239
Co m 02 66.0356 145.1046 174.6215 183.49153 282.88033
Co f 07 76.6297 148.0355 172.4629 174.74625 278.41025
Co f 03 68.7369 143.0281 173.0773 178.61345 265.37085
Co m 01 49.9946 108.0211 151.1613 155.4251 235.7232
Co f 05 49.8066 106.728 140.0394 147.13477 212.99867
Co f 08 60.8075 94.7688 119.828 128.70629 196.14059
Co f 06 48.9984 76.7239 100.323 102.45977 168.49467
Pa m UP-Mel 01 47.1827 95.9542 124.6365 125.575669 159.666069
Co f 10 40.9297 67.8394 83.4283 84.348374 115.623674
Pa m UP-Mel 03 45.8346 67.0535 84.1761 86.94071 110.12711
Co m 04 15.6498 44.4373 52.1762 53.65836 102.61126
Pa m UP-Mel 02 18.1763 30.7421 35.76757 36.367168 53.842668
Pa m UP-Mel 04 9.42419 19.34898 24.24062 24.944927 38.996927
(a) Old Paradigm - Facial ac- (b) Old Paradigm - Facial ac- (c) Old Paradigm - Facial ac-
tivity (Bill Cosby) tivity (The Champ) tivity (Weather)
Bill Cosby (Happy) The Champ (Sad) Weather (Happy)
Co f 09 94.3759 Co f 09 88.349 Co m 01 43.1402
Co f 07 76.6297 Co m 02 79.069 Co f 05 33.3114
Co f 03 68.7369 Co f 03 74.2912 Co f 09 32.5771
Co m 02 66.0356 Co f 07 71.4058 Co f 03 30.0492
Co f 08 60.8075 Co m 01 58.0265 Co m 02 29.5169
Co m 01 49.9946 Co f 05 56.9214 Pa m UP-Mel 01 28.6823
Co f 05 49.8066 Pa m UP-Mel 01 48.7715 Co f 08 25.0592
Co f 06 48.9984 Co f 08 33.9613 Co f 07 24.4274
Pa m UP-Mel 01 47.1827 Co m 04 28.7875 Co f 06 23.5991
Pa m UP-Mel 03 45.8346 Co f 06 27.7255 Pa m UP-Mel 03 17.1226
Co f 10 40.9297 Co f 10 26.9097 Co f 10 15.5889
Pa m UP-Mel 02 18.1763 Pa m UP-Mel 03 21.2189 Co m 04 7.7389
Co m 04 15.6498 Pa m UP-Mel 02 12.5658 Pa m UP-Mel 02 5.02547
Pa m UP-Mel 04 9.42419 Pa m UP-Mel 04 9.92479 Pa m UP-Mel 04 4.89164
(d) Old Paradigm - Facial activ- (e) Old Paradigm - Facial activ-
ity (Sea of Love) ity (Cry Freedom)
Sea of Love (Surprise) Cry Freedom (Anger)
Co f 08 8.87829 Co f 07 103.664
Co m 02 8.87003 Co m 02 99.3888
Co f 05 7.09537 Co f 03 86.7574
Co f 03 5.53615 Co m 01 80.2981
Co m 01 4.2638 Co f 09 79.0066
Co f 09 3.64379 Co f 08 67.4343
Pa m UP-Mel 03 2.76461 Co f 06 66.0349
Co f 07 2.28335 Co f 05 65.8639
Co f 06 2.13677 Co m 04 48.9529
Co m 04 1.48216 Pa m UP-Mel 01 34.0904
Pa m UP-Mel 01 0.939169 Co f 10 31.2753
Co f 10 0.920074 Pa m UP-Mel 03 23.1864
Pa m UP-Mel 04 0.704307 Pa m UP-Mel 02 17.4755
Pa m UP-Mel 02 0.599598 Pa m UP-Mel 04 14.052
Table B.4: Old Paradigm - Facial expressions - sorted by happy within video
178 APPENDIX B. ANALYSIS AND DATA - DEPRESSION
Table B.5: Old Paradigm - Facial Expressions - sorted by sad within video
B.1. OLD PARADIGM 179
Table B.6: Old Paradigm - Facial Expressions - sorted by neutral within video
180 APPENDIX B. ANALYSIS AND DATA - DEPRESSION
(a) New Paradigm - Facial activ- (b) New Paradigm - Facial ac- (c) New Paradigm - Facial ac-
ity (Bill Cosby) tivity (The Champ) tivity (Weather)
Bill Cosby (Happy) The Champ (Sad) Weather (Happy)
Pa f UP-NonMel 09 74.982 Pa f UP-NonMel 09 43.9521 Pa f UP BP2 08 67.7825
Pa f UP BP2 08 61.9718 Pa f UP-NonMel 06 41.9455 Co m 11 40.8223
Co f 12 58.5707 Co f 12 39.9424 Co m 14 38.1457
Co f 15 50.9681 Pa f UP BP2 08 34.5945 Pa f UP-NonMel 09 27.4592
Co m 11 47.8463 Co m 11 34.5049 Co m 13 26.0144
Pa m Unkown 07 42.8085 Pa m Unkown 07 32.9788 Co f 12 25.3561
Co m 14 40.6977 Co f 15 29.948 Pa m Unkown 07 19.9293
Pa f UP-NonMel 06 35.8466 Co m 16 26.5557 Co m 16 17.9017
Co m 16 34.7066 Pa m PD 11 22.6704 Pa m PD 11 17.5711
Pa m PD 11 31.7382 Pa f UP-Mel 10 21.373 Pa f UP-NonMel 06 17.4541
Co m 13 30.6725 Co m 14 20.1737 Co f 15 17.1747
Pa f UP-Mel 10 21.657 Co m 13 16.9278 Pa f UP-Mel 10 10.28
Pa m UP-Mel 05 5.82715 Pa m UP-Mel 05 6.73422 Pa m UP-Mel 05 3.32408
(d) New Paradigm - Facial activity (Si- (e) New Paradigm - Facial activity
lence of the Lambs) (Cry Freedom)
Silence of the Lambs (Fear) Cry Freedom (Anger)
Pa f UP BP2 08 91.4877 Pa f UP BP2 08 149.151
Pa f UP-NonMel 09 84.8893 Pa f UP-NonMel 06 49.7708
Co m 13 73.5504 Pa f UP-NonMel 09 48.3166
Pa f UP-NonMel 06 62.5193 Co m 11 43.3826
Co m 16 53.904 Pa m Unkown 07 37.639
Co m 11 51.2466 Co m 16 29.9717
Co f 15 46.0423 Co f 15 28.2632
Co f 12 43.6596 Co f 12 26.4961
Pa m Unkown 07 40.8624 Co m 13 18.4885
Co m 14 31.162 Pa m PD 11 18.0523
Pa m PD 11 30.8778 Co m 14 16.2503
Pa f UP-Mel 10 15.0949 Pa f UP-Mel 10 8.99202
Pa m UP-Mel 05 12.3734 Pa m UP-Mel 05 8.72638
(f) New Paradigm - Facial activ- (g) New Paradigm - Facial activity
ity (The Shining) (Capricorn One)
The Shining (Fear) Capricorn One (Surprise)
Pa f UP BP2 08 69.3599 Pa f UP BP2 08 43.5648
Pa f UP-NonMel 09 29.007 Pa f UP-NonMel 09 23.8494
Co m 11 25.3941 Co f 12 18.4008
Pa f UP-NonMel 06 24.5637 Pa f UP-NonMel 06 17.7679
Co f 12 24.3349 Co m 16 17.0196
Pa m Unkown 07 14.1273 Co f 15 15.9124
Co f 15 14.003 Co m 11 14.8772
Co m 13 11.9067 Co m 14 13.7803
Co m 14 11.0059 Co m 13 12.2761
Co m 16 9.40495 Pa m PD 11 11.1434
Pa m UP-Mel 05 5.31982 Pa m Unkown 07 10.4134
Pa f UP-Mel 10 5.11657 Pa f UP-Mel 10 3.99677
Pa m PD 11 4.66098 Pa m UP-Mel 05 3.20501
Table B.9: New Paradigm - Facial expressions - sorted by happy within video
B.2. NEW PARADIGM 183
Table B.10: New Paradigm - Facial expressions - sorted by sad within video
184 APPENDIX B. ANALYSIS AND DATA - DEPRESSION
Table B.11: New Paradigm - Facial expressions - sorted by neutral within video
Extract of Patient Diagnosis
C
Unipolar motor distubance Bipolar Schizoaffective other History of MI
ID age Gender (male =1) None Mel non-Mel Yes No type SAD (bipolar) SAD (dep) Specifiy
Pa m UP Mel 03 22 1 X
Pa m UP Mel 01 48 1 X
Pa m UP Mel 02 56 1 X
Pa m UP Mel 04 48 1 X
Pa f UP NonMel 09 27 2 X X
Pa f UP Mel 10 50 2 X X Anxiety? X
Pa m PD 11 53 1 X? panic disorder, substance abuse
Pa m UP Mel 05 26 1 X X
Pa f UP NonMel 06 34 2 X X? 2 X
Pa f UP BP2 08 32 2 2 X
Pa m Unkown 07 45 1
185
186 APPENDIX C. EXTRACT OF PATIENT DIAGNOSIS
Bibliography
[Annesley 05] J. Annesley and J. Orwell. On the Use of MPEG-7 for Visual Surveil-
lance. Technical report, Digital Imaging Research Center, Kingston Univer-
sity, Kingston-upon-Thames, Surrey, UK., 2005.
[Anolli 97] L. Anolli and R. Ciceri. The Voice of Deception: Vocal Strategies of Naive
and Able Liars. Journal of Nonverbal Behavior, 21:259–284, 1997.
187
188 BIBLIOGRAPHY
[Baker 01] S. Baker and I. Matthews. Equivalence and Efficiency of Image Alignment
Algorithms. Computer Vision and Pattern Recognition, IEEE Computer So-
ciety Conference on, 1:1090–1097, 2001.
[Baker 02] S. Baker and I. Matthews. Lucas-Kanade 20 Years On: A Unifying Frame-
work: Part 1. Technical Report CMU-RI-TR-02-16, Robotics Institute,
Pittsburgh, PA, July 2002.
[Baker 03a] S. Baker, R. Gross, and I. Matthews. Lucas-Kanade 20 years on: A unify-
ing framework: Part 3. Technical Report CMU-RI-TR-03-35, Robotics In-
stitute, Carnegie Mellon University, Pittsburgh (PA), USA, November 2003.
[Baker 04a] S. Baker, R. Gross, and I. Matthews. Lucas-Kanade 20 Years On: A Uni-
fying Framework: Part 4. Technical Report CMU-RI-TR-04-14, Robotics
Institute, Pittsburgh, PA, February 2004.
BIBLIOGRAPHY 189
[Bartlett 03] M. Bartlett, G. Littlewort, I. Fasel, and J. Movellan. Real Time Face De-
tection and Facial Expression Recognition: Development and Applications
to Human Computer Interaction. In In CVPR Workshop on CVPR for HCI,
2003.
[Bertini 05] M. Bertini, A. Del Bimbo, and C. Torniai. Video Annotation and Re-
trieval with Pictorially Enriched Ontologies. Technical report, Universit`a
di Firenze - Italy, 2005.
[Bhuiyan 07] A. Bhuiyan and C. Liu. On Face Recognition using Gabor Filters. In
Proceedings of World Academy of Science, Engineering and Technology,
22, pages 51–56, 2007.
[Blanz 99] V. Blanz and T. Vetter. A Morphable Model for the Synthesis of 3D Faces.
In Special Interest Group on Graphics and Interactive Techniques, pages
187–194, 1999.
[Bower 92] G.H. Bower. The handbook of emotion and memory: Research and theory,
chapter How Might Emotions Affect Learning?, pages 3–31. Lawrence Erl-
baum Associates, Inc, 365 Broadway, Hillsdale, New Jersey 07642, 1992.
[Burges 98] C. Burges. A Tutorial on Support Vector Machines for Pattern Recogni-
tion. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.
[Chang 01] C. Chang and C. Lin. LIBSVM: a library for support vector ma-
chines. 2001. Software available at http://www.csie.ntu.edu.
tw/˜cjlin/libsvm.
[Chen 03] W. Chen, T. Chiang, M. Hsu, and J. Liu. The validity of eye blink rate in
Chinese adults for the diagnosis of Parkinsons disease. Clinical Neurology
and Neurosurgery, 105:90–92, 2003.
[Chen 05] P. Chen, C. Lin, and B. Schölkopf. A tutorial on V-support vector machines:
Research Articles. Applied Stochastic Models in Business and Industry,
21(2):111–136, 2005.
[Chen 07] F. Chen and K. Kotani. Facial Expression Recognition by SVM-based Two-
stage Classifier on Gabor Features. In Machine Vision Applications, pages
453–456, 2007.
[Cootes 92] T. Cootes and C. Taylor. Active Shape Models - Smart Snakes. British
Machine Vision Conference, pages 266–275, 1992.
[Cootes 95] T. Cootes, C. Taylor, D. Cooper, and J. Graham. Active Shape Models—
their training and applications. Computer Vision and Image Understanding,
61(1):38–59, 1995.
[Cootes 98] T. Cootes, G. Edwards, and C. Taylor. Active Appearance Models. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 1407:484–498,
1998.
[Cootes 01] T. Cootes and C. Taylor. Statistical Models of Appearance for Computer
Vision. Technical report, University of Manchester, 2001.
[Cornelius 96] R. Cornelius. The science of emotion. New Jersey: Prentice Hall, 1996.
[Cover 67] T. Cover and P. Hart. Nearest neighbor pattern classification. Information
Theory, IEEE Transactions on, 13(1):21–27, 1967.
[Cowie 03] R. Cowie and R. Cornelius. Describing the emotional states that are ex-
pressed in speech. Speech Communication, 40:5–32, 2003.
[Daugman 85] J. Daugman. Uncertainty relation for resolution in space, spatial fre-
quency, and orientation optimized by two-dimensional visual cortical filters.
Journal of the Optical Society of America A: Optics, Image Science, and Vi-
sion, 2(7):1160–1169, 1985.
[Dunn 95] D. Dunn and W. Higgins. Optimal Gabor filters for texture segmentation.
Image Processing, IEEE Transactions on, 4(7):947–964, 1995.
[Edwards 98] G. Edwards, C. Taylor, and T. Cootes. Interpreting Face Images Using
Active Appearance Models. In Proceedings of the IEEE International Con-
194 BIBLIOGRAPHY
[Ekman 71] P. Ekman and W. Friesen. Constants across cultures in the face and emo-
tion. Journal of Personality and Social Psychology, 17(2):124–129, 02 1971.
[Ekman 75] P. Ekman and W. Friesen. Unmasking the Face. Prentice Hall, Englewood
Cliffs NJ, 1975.
[Ekman 76] P. Ekman and W. Friesen. Pictures of Facial Affect. Consulting Psychol-
ogists Press, Palo Alto, CA, 1976.
[Ekman 82] P. Ekman and H. Oster. Emotion in the Human Face. New York: Cam-
bridge University Press, 2nd edition, 1982.
[Ekman 97] P. Ekman and E.L. Rosenberg. What the Face Reveals. Series in Affective
Science. Oxford University Press, Oxford, UK, 1997.
[Ekman 99] P. Ekman. Handbook of cognition and emotions, chapter Basic Emotions,
pages 301–320. Wiley, New York., 1999.
[Ekman 02] P. Ekman, W. Friesen, and J. Hager. Facial Action Coding System (FACS):
Manual. Salt Lake City, UT, 2002. Research Nexus eBook.
[Ekman 03] P. Ekman. Darwin, Deception, and Facial Expression. Annals New York
Academy of Sciences, pages 205–221, 2003.
[Ellgring 96] H. Ellgring and K. Scherer. Vocal indicators of mood change in depres-
sion. Journal of Nonverbal Behavior, 20:83–110, 1996.
[Ellgring 05] K. Scherer & H. Ellgring. Multimodal markers of appraisal and emo-
tion. In Paper presented at the ISRE Conference, Bari, 2005.
BIBLIOGRAPHY 195
[Ezust 06] Alan Ezust and Paul Ezust. An Introduction to Design Patterns in C++
with Qt 4 (Bruce Perens Open Source). Prentice Hall PTR, Upper Saddle
River, NJ, USA, 2006.
[Fan 05] Y. Fan, N. Cheng, Z. Wang, J. Liu, and C. Zhu. Real-Time Facial Expression
Recognition System Based on HMM and Feature Point Localization. In Jian-
hua Tao, Tieniu Tan, and Rosalind Picard, editors, Affective Computing and
Intelligent Interaction, Volume 3784 of Lecture Notes in Computer Science,
pages 210–217. Springer Berlin / Heidelberg, 2005.
[Fasel 02] I. Fasel, M. Bartlett, and J. Movellan. A Comparison of Gabor Filter Meth-
ods for Automatic Detection of Facial Landmarks. Fifth IEEE International
Conference on Automatic Face and Gesture Recognition, page 242, 2002.
[Fasel 03] B. Fasel and J. Luettin. Automatic facial expression analysis: a survey.
Pattern Recognition, 36(1):259–275, 2003.
[Frank 93] M. Frank, P. Ekman, and W. Friesen. Behavioral markers and recogniz-
ability of the smile of enjoyment. Journal of Personality and Social Psychol-
ogy, 64(1):83–93, 1993.
[Fu 08] C.H.Y. Fu, S.C.R. Williams, A.J. Cleare, J. Scott, M.T. Mitterschiffthaler,
N.D. Walsh, C. Donaldson, J. Suckling, C. Andrew, H. Steiner, and R.M.
Murray. Neural Responses to Sad Facial Expressions in Major Depression
Following Cognitive Behavioral Therapy. Biological Psychiatry, 64(6):505–
512, 2008.
[Gao 09] X. Gao, Y. Su, X. Li, and D. Tao. Gabor texture in active appearance
models. Neurocomputing, 72(13-15):3174–3181, 2009.
[Grana 05] C. Grana, D. Bulgarelli, and R. Cucchiara. Video Clip Clustering for As-
sisted Creation of MPEG-7 Pictorially Enriched Ontologies. Technical re-
port, University of Modena and Reggio Emilia, Italy, 2005.
[Grinker 61] R. Grinker, N. Miller, M. Sabshin, R. Nunn, and J. Nunnally. The phe-
nomena of depressions. Harper and Row, New York, 1961.
[Gross 95] J. Gross and R. Levenson. Emotion elicitation using films. Cognition &
Emotion, 9:87–108, 1995.
[Gross 05] R. Gross, I. Matthews, and S. Baker. Generic vs. person specific active
appearance models. Image Vision Computing, 23(12):1080–1093, 2005.
[Gross 10] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. Image
Vision Comput., 28(5):807–813, 2010.
[Harrigan 96] J.A. Harrigan and D.M. O’Connell. How do you look when feeling
anxious? Facial displays of anxiety. Personality and Individual Differences,
21:205–212, August 1996.
[Harrigan 04] J. Harrigan, K. Wilson, and R. Rosenthal. Detecting state and trait
anxiety from auditory and visual cues: a meta-analysis. Personality and
Social Psychology Bulletin, 30(1):56–66, 2004.
[Hsu 03] C. Hsu, C. Chang, and C. Lin. A Practical Guide to Support Vector Classi-
fication. Bioinformatics, 2003.
[Joormann 06] J. Joormann and I. Gotlib. Is This Happiness I See? Biases in the Iden-
tification of Emotional Facial Expressions in Depression and Social Phobia.
Journal of Affective Disorders, 93(1-3):149–157, 2006.
[Kaiser 98] S. Kaiser, T. Wehrle, and S. Schmidt. Emotional Episodes, Facial Expres-
sions, and Reported Feelings in Human-Computer Interactions. In ISRE
Publications, pages 82–86, 1998.
BIBLIOGRAPHY 199
[Kanade 00] T. Kanade, Y. Tian, and J. Cohn. Comprehensive Database for Facial
Expression Analysis. Fourth IEEE International Conference on Automatic
Face and Gesture Recognition, pages 46–53, 2000.
[Kohavi 95] R. Kohavi. A study of cross-validation and bootstrap for accuracy es-
timation and model selection. International Joint Conference on Artificial
Intelligence, pages 1137–1143, 1995.
[Kvaal 05] K. Kvaal, I. Ulstein, I. Nordhus, and K. Engedal. The Spielberger State-
Trait Anxiety Inventory (STAI): the state scale in detecting mental disorders
in geriatric patients. International Journal of Geriatric Psychiatry, 20:629–
634, 2005.
[Lades 93] M. Lades, J.. Vorbrggen, J. Buhmann, J. Lange, C. Malsburg, R. Wrtz, and
W. Konen. Distortion Invariant Object Recognition in the Dynamic Link
Architecture. IEEE Trans. Computers, 42:300–311, 1993.
200 BIBLIOGRAPHY
[Lagoze 01] C. Lagoze and J. Hunter. The ABC Ontology and Model. Technical re-
port, Cornell University Ithaca, NY and DSTC Pty, Ltd. Brisbane, Australia,
2001.
[Lang 05] P. Lang, M. Bradley, and B. Cuthbert. International affective picture sys-
tem (IAPS): Affective ratings of pictures and instruction manual. Technical
Report A-6, University of Florida, Gainesville, FL, 2005.
[Lazarus 91] R. Lazarus. Emotion and adaptation. Oxford University Press, New
York :, 1991.
[Lee 96] T. Lee. Image Representation Using 2D Gabor Wavelets. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 18(10):959–971, 1996.
[Lee 07] B. Lee, S. Cho, and H. Khang. The neural substrates of affective processing
toward positive and negative affective pictures in patients with major de-
pressive disorder. Progress in Neuro-Psychopharmacology and Biological
Psychiatry, 31(7):1487–1492, 2007.
[Li 07] F. Li and K. Xu. Optimal Gabor Kernel’s Scale and orientation selection
for face classification. Optics & Laser Technology, 39(4):852–857, 2007.
[Lien 98] J. Lien, J. Cohn, T. Kanade, and C. Li. Automated Facial Expression Recog-
nition Based on FACS Action Units. In Third IEEE International Conference
on Automatic Face and Gesture Recognition, pages 390–395, April 1998.
BIBLIOGRAPHY 201
[Littlewort 07] G. Littlewort, M. Bartlett, and K. Lee. Faces of pain: Automated Mea-
surement of Spontaneous Facial Expressions of Genuine and Posed Pain.
9th International Conference on Multimodal Interfaces, pages 15–21, 2007.
[Liu 04] C. Liu. Gabor-based Kernel PCA with Fractional Power Polynomial Models
for Face Recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26:572–581, 2004.
[Liu 06] W. Liu and Z. Wang. Facial Expression Recognition Based on Fusion of
Multiple Gabor Features. 18th International Conference on Pattern Recog-
nition, 3:536–539, 2006.
[Lucas 81] B. Lucas and T. Kanade. An Iterative Image Registration Technique with
an Application to Stereo Vision. International Joint Conferences on Artificial
Intelligence, pages 674–679, April 1981.
202 BIBLIOGRAPHY
[Luo 06] H. Luo and J. Fan. Building concept ontology for medical video annotation.
In MULTIMEDIA ’06: Proceedings of the 14th annual ACM international
conference on Multimedia, pages 57–60, New York, NY, USA, 2006. ACM.
[Manjunath 96] B. Manjunath and W. Ma. Texture Features for Browsing and Re-
trieval of Image Data. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(8):837–842, August 1996.
[Martin 08] C. Martin, U. Werner, and H. Gross. A real-time facial expression recog-
nition system based on Active Appearance Models using gray images and
edge images. 8th IEEE International Conference on Automatic Face & Ges-
ture Recognition, pages 1–6, Sept. 2008.
[McDuff 10] D. McDuff, R. Kaliouby, K. Kassam, and R. Picard. Affect valence infer-
ence from facial action unit spectrograms. In Computer Vision and Pattern
Recognition Workshops (CVPRW), 2010 IEEE Computer Society Confer-
ence on, pages 17–24, June 2010.
[Mogg 05] K. Mogg and B. Bradley. Attentional Bias in Generalized Anxiety Disorder
Versus Depressive Disorder. Cognitive Therapy and Research, 29:29–45,
2005.
[Monk 08] C. Monk, R. Klein, and E. Telzer et al. Amygdala and Nucleus Accumbens
Activation to Emotional Facial Expressions in Children and Adolescents at
Risk for Major Depression. American Journal of Psychiatry, 165(3):90–98,
Jan 2008.
[Murray 93] I. Murray and L. Arnott. Toward the simulation of emotion in synthetic
speech. Journal Acoustical Society of America, 93(2):1097–1108, 1993.
[Navigli 03] R. Navigli, P. Velardi, and A. Gangemi. Ontology Learning and Its Ap-
plication to Automated Terminology Translation. IEEE Intelligent Systems,
pages 22–31, 2003.
BIBLIOGRAPHY 205
[Nixon 01] M. Nixon and A. Aguado. Feature Extraction and Image Processing.
MPG Books Lrd, Brodmin, Cornwall, 2001.
[Okada 09] T. Okada, T. Takiguchi, and Y. Ariki. Pose robust and person independent
facial expressions recognition using AAM selection. IEEE 13th International
Symposium on Consumer Electronics, pages 637–638, May 2009.
[Pantic 04b] M. Pantic and L. Rothkrantz. Facial action recognition for facial
expression analysis from static face images. IEEE Transactions on Systems,
Man, and Cybernetics, Part B, 34(3):1449–1461, 2004.
[Pantic 07] M. Pantic and M. Bartlett. Face recognition, chapter Machine Analysis
of Facial Expressions, pages 377–416. I-Tech Education and Publishing,
Vienna, Austria, July 2007.
[Parker 02] G. Parker and K. Roy. Examining the utility of a temperament model
for modelling non-melancholic depression. Acta Psychiatrica Scandinavica,
106(1):54–61, 2002.
[Picard 97] R.W. Picard. Affective Computing. MIT Press, Cambridge (MA), USA,
1997.
[Pree 95] W. Pree. Design Patterns for Object-Oriented Software Development. Ad-
dison Wesley Longman, 1st edition, 1995.
[Reed 07] L. Reed, M. Sayette, and J. Cohn. Impact of depression on response to com-
edy: A dynamic facial coding analysis. Journal of Abnormal Psychology,
117(4):804–809, May 2007.
[Rege 05] M. Rege, M. Dong, F. Fotouhi, M. Siadat, and L. Zamorano. Using MPEG-
7 to build a Human Brain Image Database for Image-guided Neurosurgery.
Medical Imaging 2005: Visualization, Image-Guided Procedures, and Dis-
play, pages 512–519, 2005.
[Ro 01] Yong Man Ro, Munchurl Kim, Ho Kyung Kang, and B. S. Manjunath.
MPEG-7 Homogeneous Texture Descriptor. Electronics and Telecommu-
nications Research Institute Journal, 23:41–51, 2001.
208 BIBLIOGRAPHY
[Saatci 06] Y. Saatci and C. Town. Cascaded classification of gender and facial
expression using active appearance models. In Automatic Face and Ges-
ture Recognition, 2006. FGR 2006. 7th International Conference on, pages
393–398, April 2006.
[Saragih 06] J. Saragih and R. Göcke. Iterative Error Bound Minimisation for AAM
Alignment. International Conference on Pattern Recognition, 2:1192–1195,
2006.
[Saragih 08] J. Saragih. The Generative Learning and Discriminative Fitting of Linear
Deformable Models. PhD thesis, Research School of Information Sciences
and Engineering, The Australian National University, Canberra, Australia,
2008.
[Saragih 09] J. Saragih and R. Göcke. Learning AAM fitting through simulation. Pat-
tern Recognition, 42(11):2628–2636, 2009.
[Schuller 09b] B. Schuller, S. Steidl, and A. Batliner. The INTERSPEECH 2009 Emo-
tion Challenge. In ISCA, editor, Proceedings of Interspeech 2009, pages
312–315, 2009.
[Sebe 03] N. Sebe and M. Lew. Robust Computer Vision Theory and Applications.
Springer, 2003.
210 BIBLIOGRAPHY
[Sebe 05] N. Sebe, I. Cohen, A. Garg, and Th. Huang. Machine Learning in Computer
Vision. Springer, 2005.
[Shen 06] L. Shen and L. Bai. A review on Gabor wavelets for face recognition. Pat-
tern Analysis & Applications, 9(2-3):273–292, 2006.
[Shen 07] L. Shen, L. Bai, and Z. Ji. Advances in visual information systems, Volume
4781, chapter A SVM Face Recognition Method Based on Optimized Gabor
Features, pages 165–174. Springer Berlin / Heidelberg, 2007.
[Song 05] D. Song, H. Lie, M. Cho, H. Kim, and P. Kim. Image and video retrieval,
Volume 3568/2005, chapter Domain Knowledge Ontology Building for Se-
mantic Video Event Description, pages 267–275. Springer Berlin / Heidel-
berg, 2005.
[Strupp 08] S. Strupp, N. Schmitz, and K. Berns. Visual-Based Emotion Detection for
Natural Man-Machine Interaction. In KI ’08: Proceedings of the 31st an-
BIBLIOGRAPHY 211
[Sung 08] J. Sung and D. Kim. Pose-Robust Facial Expression Recognition Using
View-Based 2D 3D AAM. Systems, Man and Cybernetics, Part A: Systems
and Humans, IEEE Transactions on, 38(4):852–866, July 2008.
[ten Bosch 00] L. ten Bosch. Emotions: What is Possible in the ASR Framework.
SpeechEmotion, 2000.
[Tian 01] Y. Tian, T. Kanade, and J. Cohn. Recognizing Action Units for Facial
Expression Analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 23(2):97–115, Feb 2001.
[Tong 09] Y. Tong, X. Liu, and F. Wheeler P. Tu. Automatic Facial Landmark Labeling
with Minimal Supervision. Computer Vision and Pattern Recognition, pages
2097–2104, June 2009.
[Tsai 01] D. Tsai, S. Wu, and M. Chen. Optimal Gabor filter design for texture
segmentation using stochastic optimization. Image Vision Computing,
19(5):299–316, 2001.
[Valstar 06a] M. Valstar and M. Pantic. Biologically vs. Logic Inspired Encoding of
Facial Actions and Emotions in Video. In ICME, pages 325–328, 2006.
[Valstar 06b] M. Valstar and M. Pantic. Fully Automatic Facial Action Unit Detection
and Temporal Analysis. In Conference on Computer Vision and Pattern
Recognition Workshop, 2006.
[Velten 68] E. Velten. A laboratory task for induction of mood states. Behaviour
Research and Therapy, 6:473–482, 1968.
[Viola 01] P. Viola and M. Jones. Robust real-time face detection. Proceedings Eighth
IEEE International Conference on Computer Vision ICCV 2001, 2:747,
2001.
[Wallhoff ] F. Wallhoff. Database with Facial Expressions and Emotions from Tech-
nical University of Munich (FEEDTUM). http://www.mmk.ei.tum.
BIBLIOGRAPHY 213
[Wang 02] X. Wang and H. Qi. Face Recognition Using Optimal Non-Orthogonal
Wavelet Basis Evaluated by Information Complexity. In 16th International
Conference on Pattern Recognition, Volume 1, page 10164, 2002.
[Wen 10] C. Wen and Y. Zhan. Facial expression recognition based on combined
HMM. International Journal of Computing and Applications in Technology,
38:172–176, July 2010.
[Wu 04] B. Wu, H. Ai, and R. Liu. Glasses Detection by Boosting Simple Wavelet
Features. International Conference on Pattern Recognition, 1:292–295,
2004.
[Zhou 06] M. Zhou and H. Wei. Face Verification Using GaborWavelets and Ad-
aBoost. In International Conference on Pattern Recognition, pages 404–407,
2006.
[Zhou 09] M. Zhou and H. Wei. Facial Feature Extraction and Selection by Gabor
Wavelets and Boosting. In 2nd International Congress on Image and Signal
Processing, pages 1–5, 2009.