0% found this document useful (0 votes)
25 views240 pages

Thesis

This document is Gordon McIntyre's PhD thesis from December 2010 submitted to the Australian National University. It examines using computer analysis of facial expressions to detect depression and anxiety. The thesis outlines developing a prototype system using Active Appearance Models to analyze facial activity in video recordings of patients diagnosed with Major Depressive Disorder and detect anxious expressions. It considers the limitations of statistical facial expression recognition approaches and strategies to overcome those limitations.

Uploaded by

go
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views240 pages

Thesis

This document is Gordon McIntyre's PhD thesis from December 2010 submitted to the Australian National University. It examines using computer analysis of facial expressions to detect depression and anxiety. The thesis outlines developing a prototype system using Active Appearance Models to analyze facial activity in video recordings of patients diagnosed with Major Depressive Disorder and detect anxious expressions. It considers the limitations of statistical facial expression recognition approaches and strategies to overcome those limitations.

Uploaded by

go
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 240

The Computer Analysis of Facial

Expressions: On the Example of


Depression and Anxiety

Gordon James McIntyre

A thesis submitted for the degree of Doctor of Philosophy


of the Australian National University

December 2010

School of Engineering
College of Engineering and Computer Science
The Australian National University
Canberra, Australia
Declaration

This thesis describes the results of research undertaken in the School of Engineering,
College of Engineering and Computer Science, The Australian National University,
Canberra. This research was supported by a scholarship from The Australian National
University.
The results and analyses presented in this thesis are my own original work, ac-
complished under the supervision of Doctor Roland Göcke, Doctor Bruce Millar and
Doctor Antonio Robles-Kelly, except where otherwise acknowledged. This thesis has
not been submitted for any other degree.

Gordon McIntyre
School of Engineering
College of Engineering and Computer Science
The Australian National University
Canberra, Australia
10 May 2010

iii
Acknowledgements

First of all I would like to thank the members of my supervisory panel Doctor Roland
Göcke, Doctor Bruce Millar and Doctor Antonio Robles-Kelly. They have added in-
valuable insight from their respective areas which has been a big help in this multi-
disciplinary project.

Roland, thank you for being an excellent supervisor and driving force, especially
at times when it all seemed a bit too hard. You abound with positive energy and are
blessed with shrewdness and patience well beyond your years. Bruce, I would like
to thank you for your constructive criticism and the benefit of the wisdom that you
accumulated over a distinguished career.

This PhD project would not have been as enjoyable without the support of staff and
fellow students at the College of Engineering and Computer Science and my colleagues
at the Centre for Mental Health Research. Thank you to all of you! I would like to also
thank the administrative and support staff for putting up with my inane questions and
providing prompt and professional assistance.

My gratitude goes to the Black Dog Institute in Sydney, it was the experience of
a lifetime to be a part of such a multi-disciplinary team in an incredibly innovative
organisation. It helped me to get an appreciation of the fantastic work they do in such
a complex field.

v
vi ACKNOWLEDGEMENTS

In the course of this project, I have made many friends from a diverse range of
backgrounds and my life is all the more richer for it. There are some special peo-
ple that I need to acknowledge. To Dot, your spirit and determination is always
an inspiration to me. To Abha, thank you for being so supportive and understand-
ing. Lastly, to my children who were, oftentimes, deprived of my attention but never-
theless seemed to show an interest in my work - thank you!
Abstract

Significant advances have been made in the field of computer vision in the last few
years. The mathematical underpinnings have evolved in conjunction with increases in
computer processing speed. Many researchers have attempted to apply these improve-
ments to the field of Facial Expression Recognition (FER).

In the typical FER approach, once an image has been acquired, possibly from cap-
turing frames in a video, the face is detected and local information is extracted from
the facial region in the image. One popular approach is to build a database of the raw
feature data, and then use statistical measures to group the data into representations
that correspond to facial expressions. Newly acquired images are then subjected to the
same feature extraction process, and the resulting feature data compared to that in the
database for matching facial expressions.

Academic studies tend to make use of freely available, annotated sets of images.
These community databases, used for training and testing, are usually built from acted
or posed expressions [Kanade 00, Wallhoff ] of primary or prototypical emotion ex-
pressions such as fear, anger and happiness. Making use of video or images captured
in a natural setting is less common, and fewer studies attempt to apply the techniques
to more subtle and pervasive moods and emotional states, such as boredom, arousal,
anxiety and depression.

vii
viii ABSTRACT

This dissertation aims to test whether state-of-the-art developments in the field of


computer vision can be successfully applied in a practical situation involving non-
primary FER. The functional requirements of a system that can perform full lifecycle,
video analysis of vocal and facial expressions are outlined. These have been used to
build a fully-functional prototype system that incorporates Active Appearance Model
(AAM)s. The system has been integral to supporting the experimental aspects of this
dissertation.
Of particular interest in this dissertation is the recent evolution in computer vision
of the AAM. These are used to locate fiduciary, or landmark, points, around a face in an
image. If the landmark points can be reliably and consistently found within an image
then the collective “shape” for the points, together with the pixel information, can be
used to build representations of facial expressions.
Two experiments were undertaken and are reported in this thesis. The first in-
vestigated whether FER practices could be applied to sense for anxious expressions
in images. The second was designed to analyse the facial activity and expressions in
video recordings of patients diagnosed with Major Depressive Disorder (MDD).
Finally, the practical limitations of the statistical approach to FER are considered
along with strategies for overcoming those limitations.
List of Publications

• G. McIntyre, R. Göcke, M. Hyett, M. Green, and M. Breakspear. An Approach


for Automatically Measuring Facial Activity in Depressed Subjects. In 3rd In-
ternational Conference on Affective Computing and Intelligent Interaction and
Workshops, ACII 2009, September 2009. DOI 10.1109/ACII.2009.5349593

• G. McIntyre and R. Göcke. A Composite Framework for Affective Sensing. In


Proceedings of Interspeech 2008, pages 2767–2770. ISCA, 22-26 September
2008

• G. McIntyre and R. Göcke. Affect and Emotion in Human-Computer Interac-


tions, chapter The Composite Sensing of Affect, pages 104–115. Lecture Notes
in Computer Science LNCS 4868. Springer, August 2008

• G. McIntyre and R. Göcke. Towards Affective Sensing. In Proceedings of the


12th International Conference on Human-Computer Interaction HCII2007, Vol-
ume 3 of Lecture Notes in Computer Science LNCS 4552, pages 411–420, Bei-
jing, China, July 2007. Springer

• G. McIntyre and R. Göcke. Researching Emotions in Speech. In 11th Aus-


tralasian International Conference on Speech Science and Technology, pages
264–369, Auckland, New Zealand, December 2006. ASSTA

ix
Abbreviations

AAM Active Appearance Model


ANN Artificial Neural Network
ASM Active Shape Model
ASR Automatic Speech Recognition
AU Action Unit
AVI Audio Video Interleave
CA Classification Accuracy
CBT Cognitive Behaviour Therapy
DDL Description Definition Language
DFT Discrete Fourier Transform
DSM-IV Diagnostic and Statistical Manual of Mental Disorders
EMG Electromyography
EmotionML Emotion Markup Language
FACS Facial Action Coding System
FDP Facial Definition Parameters
FER Facial Expression Recognition
fMRI functional Magnetic Resonance Imaging
FR Face Recognition
FT Fourier Transform
GAD Generalised Anxiety Disorder

xi
xii ABBREVIATIONS

GSR Galvanic Skin Response


HMM Hidden Markov Model
HUMAINE Human-Machine Interaction Network on Emotion
IAPS International Affective Picture System
IEBM Iterative Error Bound Minimisation
k-NN k-Nearest Neighbour
LDA Linear Discriminant Analysis
MDD Major Depressive Disorder
MDS Multimedia Description Schemes
MPEG Moving Picture Experts Group
NXS Any Expression Recognition System
OCD Obsessive Compulsive Disorder
OO Object Oriented
PCA Principal Component Analysis
POIC Project-Out Inverse Compositional
PTSD Post-Traumatic Stress Disorder
ROC Receiver Operating Curve
RU Region Unit
SIC Simultaneous Inverse Compositional
SML Statistical Machine Learning
STFT Short Time Fourier Transform
SVD Singular Value Decomposition
SVM Support Vector Machine
VXL Vision something Libraries
XML Extensible Markup Language
Contents

Declaration iii

Acknowledgements v

Abstract vii

List of Publications ix

Abbreviations xi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature Review 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Describing Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Emotion Classification Schemes . . . . . . . . . . . . . . . . 9

2.3 The Physiology of Emotional Display . . . . . . . . . . . . . . . . . 12

xiii
xiv CONTENTS

2.4 Describing Facial Activity . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Anxious Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5.1 Anxiety Disorders . . . . . . . . . . . . . . . . . . . . . . . 15

2.5.2 Anxious Facial Expressions . . . . . . . . . . . . . . . . . . 21

2.6 Non-verbal Communication in Depression . . . . . . . . . . . . . . . 24

2.6.1 Ellgring’s Study . . . . . . . . . . . . . . . . . . . . . . . . 25

2.6.2 Processing of Emotional Content . . . . . . . . . . . . . . . 26

2.6.3 Facial Feedback in Depression . . . . . . . . . . . . . . . . . 28

3 Affective Sensing 29

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Affective Sensing Systems . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Eliciting Training Data . . . . . . . . . . . . . . . . . . . . . 32

3.2.2 Approaches to Affective Sensing . . . . . . . . . . . . . . . . 33

3.2.3 Description of Process . . . . . . . . . . . . . . . . . . . . . 36

3.3 Computer Vision Techniques . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 Gabor Filters . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2 Active Appearance Models (AAM) . . . . . . . . . . . . . . . 41

3.3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.4 Building an AAM . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.5 Model Fitting Schemes . . . . . . . . . . . . . . . . . . . . . 49

3.4 Classification Techniques . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Expression Analysis in Practice 53

4.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.1 Audio Processing . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . 56


CONTENTS xv

4.2.3 Video Processing . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2.5 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 System Operational Requirements . . . . . . . . . . . . . . . . . . . 58

4.3.1 Implementation Platforms . . . . . . . . . . . . . . . . . . . 58

4.3.2 Audio and Video Formats . . . . . . . . . . . . . . . . . . . 58

4.3.3 System Performance . . . . . . . . . . . . . . . . . . . . . . 59

4.3.4 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 The Any Expression Recognition System (NXS) . . . . . . . . . . . . 60

4.4.1 Software Selection for the Core System . . . . . . . . . . . . 60

4.4.2 Software Selection for Major Functions . . . . . . . . . . . . 61

4.4.3 Class Structure . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4.4 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4.5 Dialog Creation . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4.6 Processing Scenario . . . . . . . . . . . . . . . . . . . . . . 64

4.4.7 Measuring Facial Features . . . . . . . . . . . . . . . . . . . 65

4.4.8 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.9 System Processing . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.10 Active Appearance Models . . . . . . . . . . . . . . . . . . . 68

4.4.11 Classification Using Support Vector Machine (SVM) . . . . . 69

4.4.12 Gabor Filter Processing . . . . . . . . . . . . . . . . . . . . . 70

5 Sensing for Anxiety 71

5.1 Introduction and Motivation for Experiments . . . . . . . . . . . . . 71

5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Questions and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
xvi CONTENTS

5.2.2 Questions Pertaining to the Importance of Feature Data . . . . 73

5.2.3 Questions Pertaining to the Relative Importance of Facial Re-


gions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.4 Question Pertaining to System Performance . . . . . . . . . . 76

5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.2 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 Presentation and Analysis of Data . . . . . . . . . . . . . . . . . . . 88

5.4.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.4.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.4.4 Experiment 4 - Baseline . . . . . . . . . . . . . . . . . . . . 98

5.4.5 Experiment 4 - Classification against Cohn-Kanade SVM . . . 103

5.5 Conclusions and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 107

5.5.1 Hypothesis 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.5.2 Hypothesis 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.5.3 Question Set 1 . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5.4 Question Set 2 . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.5.5 Question Set 3 . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.6 Overall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6 Depression Analysis Using Computer Vision 113

6.1 Introduction and Motivation for Experiments . . . . . . . . . . . . . 113

6.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 115

6.3.2 System Setup and Processing . . . . . . . . . . . . . . . . . 118


CONTENTS xvii

6.4 Presentation and Analysis of Data . . . . . . . . . . . . . . . . . . . 124

6.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.4.2 Old Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.4.3 New Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.5 Evaluation and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 140

6.5.1 Hypothesis 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.5.2 Hypothesis 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.6 Overall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7 Semantics and Metadata 143

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2.1 Use of Ontologies to Describe Content . . . . . . . . . . . . 147

7.2.2 Semantic Markup . . . . . . . . . . . . . . . . . . . . . . . . 148

7.3 An Affective Communication Framework . . . . . . . . . . . . . . . 149

7.3.1 Factors in the Proposed Framework . . . . . . . . . . . . . . 150

7.3.2 Influences in the Display of Affect . . . . . . . . . . . . . . 154

7.4 A Set of Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.4.1 Ontology 1 - Affective Communication Concepts . . . . . . . 156

7.4.2 Ontology 2 - Affective Communication Research . . . . . . . 156

7.4.3 Ontology 3 - Affective Communication Resources . . . . . . 157

7.5 An Exemplary Application Ontology for Affective Sensing . . . . . . 157

8 Conclusions 161

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

8.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

8.2.1 Objective 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

8.2.2 Objective 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 163


xviii CONTENTS

8.2.3 Objective 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 164


8.2.4 Objective 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.3 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 165
8.3.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . 165
8.3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 166

A Analysis and Data - Anxiety 169

B Analysis and Data - Depression 175


B.1 Old Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
B.2 New Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

C Extract of Patient Diagnosis 185

Bibliography 186
List of Figures

1.1 Affective sensing system . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Original image on left and “fitted” image on right . . . . . . . . . . . 4

2.1 The effect of emotion on the human voice [Murray 93] . . . . . . . . 14

2.2 Facial muscles [Fac ] . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Schematic of the basic affective facial processing loop . . . . . . . . 26

3.1 Conceptual overview of processing of facial activity measurements . . 31

3.2 Real and imaginary part of a Gabor wavelet . . . . . . . . . . . . . . 40

3.3 Original image with Gabor magnitude . . . . . . . . . . . . . . . . . 41

3.4 Original image with transform . . . . . . . . . . . . . . . . . . . . . 44

3.5 Face mesh used to build an Active Appearance Model . . . . . . . . . 48

4.1 Class hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 Class diagram of the Segment Factory . . . . . . . . . . . . . . . . . 63

4.3 Dialog creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4 The system menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Project creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.6 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xix
xx LIST OF FIGURES

4.7 Measurements from horizontal delineations image from Feedtum database


[Wallhoff ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.8 Processing of facial activity measurements . . . . . . . . . . . . . . . 68

4.9 Face mesh used to group Action Units . . . . . . . . . . . . . . . . . 69

5.1 Facial landmark points “fitted” to an image . . . . . . . . . . . . . . 74

5.2 Facial region demarcation in image [Wallhoff ] . . . . . . . . . . . . 75

5.3 Experiment 4 - Images from different databases showing different light-


ing conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4 Original image on left and image after fitting on right . . . . . . . . . 81

5.5 Regions before and after rescaling (3 × actual size) . . . . . . . . . . 82

5.6 Real and Imaginary part of a Gabor wavelet, scale = 1.4, orientation=
π/8 (5 × actual size) . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.7 Magnitude responses of regions R1, R2 and R3 after convolution (3 ×


actual size) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.8 K-Fold Cross Validation [PRISM ] . . . . . . . . . . . . . . . . . . . 87

5.9 Experiment 4 - Images fitted using generalised and specific AAMs . . 100

5.10 Experiment 4 - Feedtum images of anger and fear. . . . . . . . . . . . 102

6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2 Control subject watching video clip - Cry Freedom . . . . . . . . . . 119

6.3 Participant’s view of the interview (video clip - Silence of the Lambs) 120

6.4 Laptops displaying stimuli and recording of subject . . . . . . . . . . 121

6.5 The NXS System - Replaying captured images . . . . . . . . . . . . . 123

6.6 Old Paradigm - Stacked column chart comparing facial activity (Co -
Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.7 Old Paradigm - Clustered column chart comparing facial activity (Co
- Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . 127
LIST OF FIGURES xxi

6.8 Old Paradigm - Line chart comparing accumulated facial activity (Co
- Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.9 Old Paradigm - Facial activity for each video . . . . . . . . . . . . . 129
6.10 Old Paradigm - Number of happy expressions . . . . . . . . . . . . . 130
6.11 Old Paradigm - Number of sad expressions . . . . . . . . . . . . . . 131
6.12 Old Paradigm - Number of neutral expressions . . . . . . . . . . . . . 132
6.13 New Paradigm - Stacked column chart comparing facial Activity (Co
- Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.14 New Paradigm - Clustered column chart comparing facial activity (Co
- Control, Pa - Patient) . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.15 New Paradigm - Facial Activity for each video . . . . . . . . . . . . . 136
6.16 New Paradigm - Number of happy expressions . . . . . . . . . . . . 137
6.17 New Paradigm - Number of sad expressions . . . . . . . . . . . . . . 138
6.18 New Paradigm - Number of neutral expressions . . . . . . . . . . . . 139

7.1 Human disease ontology . . . . . . . . . . . . . . . . . . . . . . . . 147


7.2 Cell ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.3 A generic model of affective communication . . . . . . . . . . . . . . 150
7.4 Use of the model in practice . . . . . . . . . . . . . . . . . . . . . . 154
7.5 A set of ontologies for affective computing . . . . . . . . . . . . . . 155
7.6 A fragment of the domain ontology of concepts . . . . . . . . . . . . 156
7.7 An application ontology for affective sensing . . . . . . . . . . . . . 159
List of Tables

2.1 Facial Action Coding System - Sample Action Units . . . . . . . . . 16

2.2 Action units for fear expressions [Kanade 00] . . . . . . . . . . . . . 22

3.1 Action units for surprise expressions [Ekman 76, Ekman 02] . . . . . 34

3.2 Action units for fear expressions [Ekman 76, Ekman 02] . . . . . . . 35

3.3 Sample point file with x, y co-ordinates of landmark points . . . . . . 45

4.1 Mapping Action Units to Region Units . . . . . . . . . . . . . . . . . 68

5.1 Experiment 1 - Number of occurrences of each expression [Kanade 00] 76

5.2 Initial numbers of each expression [Kanade 00] . . . . . . . . . . . . 77

5.3 Results from poll - numbers labelled as fear and anxiety retained. . . . 78

5.4 Experiment 2 - Final number of occurrences of each expression retained. 78

5.5 Experiment 3 - Numbers of each expression from Cohn-Kanade database 78

5.6 Experiment 4 - Numbers of each expression from Feedtum database . 79

5.7 Experiment 1 - Recognition results . . . . . . . . . . . . . . . . . . . 90

5.8 Experiment 1 - Recognition results using shape from eyebrow (R1),


eye (R2) and mouth (R3) regions . . . . . . . . . . . . . . . . . . . . 91

5.9 Experiment 2 - Numbers of each expression from Cohn-Kanade database 92

xxiii
xxiv LIST OF TABLES

5.10 Experiment 2 - Recognition results . . . . . . . . . . . . . . . . . . . 93

5.11 Experiment 2 - Recognition results using shape from eyebrow (R1),


eye (R2) and mouth (R3) regions . . . . . . . . . . . . . . . . . . . . 94

5.12 Experiment 2 - Post-hoc . . . . . . . . . . . . . . . . . . . . . . . . 94

5.13 Experiment 3 - Recognition results . . . . . . . . . . . . . . . . . . . 96

5.14 Experiment 3 - Recognition results using shape from eyebrow (R1),


eye (R2) and mouth (R3) regions . . . . . . . . . . . . . . . . . . . . 97

5.15 Experiment 4 - Baseline recognition results using Feedtum database . 101

5.16 Experiment 4 Baseline Feedtum database - Recognition results using


shape from eyebrow (R1), eye (R2) and mouth (R3) regions . . . . . . 102

5.17 Experiment 4 - Respective chance of each expressions . . . . . . . . 103

5.18 Experiment 4 - Recognition results using SVMs built in experiment 1 . 105

5.19 Experiment 4 - Recognition results using shape from eyebrow (R1),


eye (R2) and mouth (R3) regions against SVMs built in experiment 1 . 106

6.1 Participant details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.2 Participant details and diagnosis . . . . . . . . . . . . . . . . . . . . 116

6.3 Old paradigm - movie list . . . . . . . . . . . . . . . . . . . . . . . . 117

6.4 New paradigm - movie list . . . . . . . . . . . . . . . . . . . . . . . 118

6.5 Participant summary . . . . . . . . . . . . . . . . . . . . . . . . . . 118

A.1 Experiment 1 - Poll results . . . . . . . . . . . . . . . . . . . . . . . 169

B.1 Old Paradigm - Facial activity . . . . . . . . . . . . . . . . . . . . . 175

B.2 Old Paradigm - Accumulated facial activity . . . . . . . . . . . . . . 176

B.3 Old Paradigm - Facial activity for each video . . . . . . . . . . . . . 176

B.4 Old Paradigm - Facial expressions - sorted by happy within video . . 177

B.5 Old Paradigm - Facial Expressions - sorted by sad within video . . . . 178
LIST OF TABLES xxv

B.6 Old Paradigm - Facial Expressions - sorted by neutral within video . . 179
B.7 New Paradigm - Accumulated facial activity . . . . . . . . . . . . . . 180
B.8 New Paradigm - Facial activity for each video . . . . . . . . . . . . . 181
B.9 New Paradigm - Facial expressions - sorted by happy within video . . 182
B.10 New Paradigm - Facial expressions - sorted by sad within video . . . 183
B.11 New Paradigm - Facial expressions - sorted by neutral within video . . 184

C.1 Extract of Patient Diagnosis . . . . . . . . . . . . . . . . . . . . . . 185


A man’s face as a rule
says more, and more
interesting things, than his
mouth, for it is a compendium
of everything his mouth will
ever say, in that it is the
monogram of all this man’s
thoughts and aspirations.

Arthur Schopenhauer

1
Introduction

Significant advances have been made in the field of computer vision in the last few
years. The mathematical underpinnings have evolved, in conjunction with increases in
computer processing speed. Many researchers have attempted to apply these improve-
ments to the field of FER in a process, which typically resembles Figure 1.1.

Once an image has been acquired, possibly from capturing frames in a video, the
face is detected and local information is extracted from the facial region in the image.
One popular approach is to build a database of the raw feature data, and then use
statistical measures to group the data into representations that correspond to facial
expressions. Newly acquired images are then subjected to the same feature extraction

1
2 CHAPTER 1. INTRODUCTION

Figure 1.1: Affective sensing system

process, and the resulting feature data compared to that in the database for matching
facial expressions.

Academic studies tend to make use of freely available, annotated sets of images.
These community databases used for training and testing are usually built from acted,
posed or induced expressions [Kanade 00, jaf , MMI , Wallhoff , Wallhoff 06], and this
tends to give an oversimplified picture of emotional expression. Making use of video
or images captured in a natural setting is less common.

Most studies aim to recognise prototypical emotion expressions such as fear, anger
and happiness. Fewer attempt to apply the same techniques to more subtle and perva-
sive moods and emotional states, such as boredom, arousal, anxiety and depression.

The limitations placed on FER studies are quite understandable. [Ekman 82] has
shown that the facial expressions of anger, disgust, fear, joy, sadness, and surprise are
universal across human cultures (although in [Ekman 99] he did expand the list to in-
clude amusement, contempt, contentment, embarrassment, excitement, guilt, pride, re-
lief, satisfaction, sensory pleasure and shame). Outside of the “unbidden” [Ekman 82]
1.1. MOTIVATION 3

emotions, the display rules of facial expressions vary with factors such as culture, con-
text, personality type.

1.1 Motivation

The difficulties outlined above, however, should not preclude research into how the
use of recently developed techniques could be applied to non-primary, emotional ex-
pressions. Several studies have confirmed characteristics such as speech, face, ges-
ture, Galvanic Skin Response (GSR) and body temperature, as being useful in the
diagnosis and evaluation of therapy for anxiety, depression and psychomotor retarda-
tion [Flint 93, Moore 08]. Vocal indicators have been shown to be of use in the detec-
tion of mood change in depression [Ellgring 96]. Some studies have suggested linking
certain syndromes by comparing parameters from modalities, such as speech and mo-
tor, to discriminate different groups. In [Chen 03], the eye blink rate in adults was
used in an attempt to diagnose Parkinson’s disease. [Alvinoa 07] have attempted to
link the computerised measurement of facial expression to emotions in patients with
schizophrenia.

Thus, even if the results are to be used in conjunction with measurements of other
modal expressions, e.g. vocal, eye-blink or gaze, there is good reason to explore the
use of the recent developments in computer vision. One obvious incentive is to provide
low-cost and unobtrusive ways to sense for disorders.

The motivation behind this dissertation is to test whether recent improvements in


the field of computer vision could be used to verify the existence of states such as
anxiety and depression.
4 CHAPTER 1. INTRODUCTION

1.2 Objectives

Of particular interest in this dissertation is the recent evolution in computer vision of


the AAM [Edwards 98]. These are used to locate fiduciary, or landmark, points around
a face in an image. An example of the before and after “fitting” of the landmark points
is shown in Figure 1.2.

(a) Original image (b) Image with landmark points

Figure 1.2: Original image on left and “fitted” image on right

If the landmark points can be reliably and consistently found within an image then
the collective “shape” for the points, together with the pixel information, can be used
to build representations of facial expressions.

The major objectives in this dissertation are outlined as follows:

1. Explore, through the construction of a prototype system that incorporates AAMs,


what would be required in order to build a fully-functional FER system, i.e. one
where the system could be trained and then used to recognise new and previously
unseen video or images;

2. Investigate whether FER practices could be applied to non-primary emotional


expression such as anxiety;
1.3. THESIS OUTLINE 5

3. Examine whether FER practices could be applied to non-primary emotional ex-


pressions, such as those displayed by someone suffering from a MDD; and

4. Identify avenues for improvement in the emotional expression recognition pro-


cess.

1.3 Thesis Outline

This dissertation comprises eight chapters, including this introduction. A brief outline
of each of the other chapters is as follows:

Chapter 2 - Literature Review In the literature review, the controversial topic of


defining emotions is discussed, followed by an explanation of some of the schemes
used to describe emotions. This is followed by a brief description of the physi-
ology contributing to emotional display. Next, the Facial Action Coding System
(FACS), used ubiquitously to describe facial musculature activity, is introduced.
The chapter concludes with a discussion of how dysphoric conditions, i.e. anxi-
ety and depression, might affect emotional expression.

Chapter 3 - Emotional Expression Recognition by Machines Affective sensing is


reviewed in Chapter 3. Schemes used to elicit speech samples (audio and video)
are discussed along with their relative strengths and weaknesses. The chapter
then narrows the focus to facial expression analysis, and a detailed view of the
state-of-the-art components for analysing them in image and video is presented.
The remainder of the chapter is devoted to the theoretical bases of the compo-
nents, real-life applications, and their strengths and weaknesses.

Chapter 4 - Expression Analysis in Practice The practical exploration and experi-


mental contribution in this dissertation is presented, firstly, with a discussion of
6 CHAPTER 1. INTRODUCTION

some of the capabilities that a full lifecycle, real-world, multi-modal affective


sensing system might require, before turning, briefly, to describe the NXS sys-
tem.

Chapter 5 - Sensing for Anxiety The experimental contribution of this thesis begins
in Chapter 5. Using anxiety as an example, the exercise serves as a proof of
concept for the techniques presented in Chapters 3 and 4, and to test whether
more subtle expressions could possibly be tracked using these concepts. The
experiments were conducted on the Cohn-Kanade [Kanade 00] database which
is available for academic use. Each of the experiments involves different aspects
and degrees of difficulty.

Chapter 6 - Depression Analysis Using Computer Vision Chapter 6 builds on the


work from the previous chapter and describes an experiment that is currently
incorporated in a collaborative project at the Black Dog Institute, Sydney, Aus-
tralia.1 The contribution is an exploration into applying the state-of-the-art, low-
cost, unobtrusive techniques to measure facial activity and expression, and then
applying them to a real-life application

Chapter 7 - Semantics in Expression Recognition With the results from Chapters 5


and 6 at hand, the reality and limitations of the statistical approach to facial
expression analysis are considered, along with strategies to improve the field
and overcome some of the limitations.

Chapter 8 - Conclusions Finally, the conclusions and a summary of the contributions


of this dissertation are presented. Open issues and future directions for ongoing
research are discussed.

1
The Black Dog Institute is a not-for-profit, educational, research, clinical and community-oriented
facility in Sydney, Australia, offering specialist expertise in depression and bipolar disorder. Available
at http://www.blackdoginstitute.org.au/, last accessed 23 May 2010.
Even before I open my
mouth to speak, the cul-
ture into which I’ve been
born has entered and suf-
fused it. My place of birth
and the country where I’ve
been raised, along with my
mother tongue, all help reg-
ulate the setting of my jaw,
the laxity of my lips, my

2
most comfortable pitch.

Anne Karpf

Literature Review

2.1 Introduction

This chapter begins with a broad discussion of emotional expression. The accepted
practices for its elicitation and description are discussed, before, narrowing the focus,
and reviewing the research of expressions, and how they relate, more specifically, to
anxiety and depression.

In Section 2.2, the somewhat difficult topic of defining emotions is broached, fol-
lowed by an introduction to the annotation schemes commonly used. This is followed,
in Section 2.3, by a brief description of the physiology contributing to emotional dis-

7
8 CHAPTER 2. LITERATURE REVIEW

play. Next, in Section 2.4, FACS, the ubiquitous system for describing facial muscle
activity, is introduced. The chapter concludes with a discussion of how dysphoric con-
ditions, i.e. anxiety and depression, might affect emotional expression.

2.2 Describing Emotions

Defining emotion is a bit like trying to define “knowledge”. It has deep ontological
significance; it goes to the heart of human existence; yet there is no universal agreement
on its definition. Whilst it is possible to observe physical symptoms arising from our
internal state, attaining agreement on emotion definitions and categories is challenging.
Taxonomies vary across disciplines, and [Cowie 03] point out that psychology, biology
and ecology have different stances. There are qualitative and quantitative approaches.

Definitions in philosophy such as, “Emotion is evolution’s way of giving mean-


ing to our lives”, might advance a philosophical view but do not translate neatly into
computer science. Some authors suggest that nostalgia, jealousy and disappointment
are emotions [Bower 92], while others propose a taxonomy of around five basic cat-
egories, e.g. Love, Joy, Anger, Sadness, and Fear, with all other emotions as sub-
categories. Many of the cross-disciplinary articles and books agree on a small set of
emotions, e.g. fear, anger, sadness, happiness, sometimes disgust, sometimes surprise,
and some include neutral. [Bower 92] “weed” the 600-plus items in the affective lexi-
con, removing behavioural responses, physical and body states, and short-hand expres-
sions. Whissel [Whissell 89], on the other hand, has created a dictionary of thousands
of affective language words that has been used to rate the affective content of the Bible,
the Quran, Shakespeare, Dickens, Beowulf, and the works of several poets.

Complicating the picture even further is the question of moods. We know that
moods are accompanied by physiological changes and affect our decisions, but are
they emotions? The common view in the literature is that they are the long-term of
2.2. DESCRIBING EMOTIONS 9

the affective state [Picard 97]. Emotions are seen as reactions [Bower 92, Cowie 03],
having a cause or stimulus and a brief experience associated with them, whereas mood
is seen as lingering and less specific. Moods have valance but less intensity, and emo-
tions and mood can exist at the same time. Intuitively, one would think that they could
affect one another, and, presumably, affect the valence of the emotion.

2.2.1 Emotion Classification Schemes

Traditional emotion theory is a large amalgam of approaches to the study of emotion,


mostly based on cognitive psychology but with contributions from learning theory,
physiological psychology, clinical psychology and other disciplines, including philos-
ophy. Classification schemes have traditionally been sourced from these areas.

[Cowie 03] present a good review of emotional classification regimes. Two ap-
proaches to describing emotions are dominant. The first, to define categories, is the
more common technique. The second approach uses dimensions. A third, less domi-
nant, approach makes use of the appraisal theory [Sander 05].

The various classification schemes are discussed in more detail in the following
subsections.

Category Approach

The most popular grouping of emotions, referred to as the “big six”, comprises fear,
anger, happiness, sadness, surprise and disgust [Cornelius 96, Ekman 99]. These are
regarded as full-blown emotions [Scherer 99] and are evolutionary, developmental and
cross-cultural in nature [Ekman 82, Ekman 99]. However, there are many alternative
groupings both across disciplines and within disciplines. Some studies concentrate
on only one or two select categories, others employ schemes using more than twenty
emotional archetypes. Thus, one of the difficulties in comparing results from studies
10 CHAPTER 2. LITERATURE REVIEW

into emotions is that the choice of categories used between studies is not consistent
and will often depend on the application that the researcher has in mind. If the focus
of the research is to understand full-blown emotions, then the big-six or a subset might
be adequate. However, if the objective is to study the less dramatic emotional states in
everyday life, with all the shades and nuances that we know distinguish them, then the
choice of categories is much more difficult [Schröder 05].

The main criticisms of the category approach are that:

• there are no agreed number of categories or definitions;

• a large number of descriptors exist, with overlapping meanings;

• there is a lack of consistency of words across languages; and

• the use of categories in research is inconsistent.

Dimensional Approach

Another way to label the affective state is to use dimensions. Instead of choosing
discrete labels, one or more continuous scales, such as pleasant/unpleasant, atten-
tion/rejection or simple/complicated are used. Two common scales are valence (nega-
tive/positive) and arousal (calm/excited). Valence describes the degree of positivity or
negativity of an emotion or mood; arousal describes the level of activation or emotional
excitement. Sometimes a third dimension, control or attention, is used to address the
internal or external source of emotion. [Cowie 03] have developed a software applica-
tion called FEELTRACE to assist in the continuous tracking of emotional state on two
dimensions [Cowie 03].
2.2. DESCRIBING EMOTIONS 11

Appraisal Approach

Scherer has extensively studied the assessment process in humans and suggests that
people affectively appraise events with respect to novelty, intrinsic pleasantness, goal/need
significance, coping, and norm/self compatibility [Scherer 99].

It is not yet clear how to implement this approach in practice although there is at
least one quite complex paper describing a practical application of the model [Sander 05].

Discussion of Description Schemes

Both the categorical and dimensional approaches, whilst practical, suffer from being
highly subjective. This is not only due to complexity, but also because of the dif-
ferences in the efficiency of listeners’ physiology and the fact that the listener’s own
affective state influences their judgment of the speaker’s affective state. Hence, there
is a need for a model that includes listener attributes.

Describing More Subtle Emotions

Analysis of the Belfast Naturalistic Database, a corpus of emotional utterances, has


shown that full-blown emotions occur more rarely in natural speech than is commonly
assumed [Cowie 03].

A study by the Human-Machine Interaction Network on Emotion [HUMAINE 06]


group proposed that emotions be dichotomised into episodic and pervasive categories.
Episodic emotions are more like the traditional set of full-blown emotions or that
which Scherer would describe as emotions. Pervasive are the everyday emotions such
as grief/sorrow, sarcasm/irony and surprise/astonishment, but could include the more
subtle states of anxiety and depression. Regardless of the method employed to describe
emotions, adding more subtle emotions to any exercise, greatly complicates the task of
describing emotional content in video and audio speech samples.
12 CHAPTER 2. LITERATURE REVIEW

A study by [Devillers 05] has included labelling of blended and secondary emo-
tions in a corpus of medical emergency call centre dialogues, as well as including
task-specific context annotation to one of the corpora.

2.3 The Physiology of Emotional Display

Apart from the obvious, speech carries a great deal more information than just the ver-
bal message. It can tell us about the speaker, their background and their emotional
state. Age, gender, culture, social setting, personality and well-being all play their part
in suffusing our communication apparatus even before we begin to speak. Studies by
[Koike 98, Shigeno 98] have shown that subjects more easily identify an emotion in a
speaker from their own culture, and that people will predominantly use visual informa-
tion to identify the emotion. Everyday expressions such as “lump in the throat”, “stiff
upper lip”, “plumb in the mouth”, point to our awareness of the physiological changes
that emotions have on the voice.

Early work from James [James 90] contended that emotions could be equated with
awareness of a visceral response; in other words, the contention that emotions follow
physical stimuli. That may be true for fast primary emotions, however, the twentieth
century view is that emotions are antecedent, and can be more often detected from
physiological measurements. For example, your heart rate goes up when you discover
that you have won lotto, you think that you have lost your ATM card, or you realise
that you have forgotten an important birthday.

The neurobiological explanation of human emotion is that it is a pleasant or un-


pleasant mental state organised in the limbic system. Recent studies establish that emo-
tional stimuli is given priority, or a privileged status, within the brain [Davidson 04].
Primary emotions such as fear use the limbic system circuitry along with the amygdala
and anterior cingulate gyrus. Secondary emotions, take a slightly different path to take
2.4. DESCRIBING FACIAL ACTIVITY 13

in memory. The stimulus may still be processed directly via the amygdala, but is now
also analysed in the thought process before processing.

Changes in brain patterns result in modulations in our major anatomical systems.


Stress tenses the laryngeal muscles, in turn, tightening the vocal folds. The result is that
more pressure is required to produce sound. Consequently, the fundamental frequency
and amplitude, particularly with regard to the ratio of the open to the closed phase of
the cycle, varies the larynx wave. The harmonics of the larynx wave vary according
to the specific balance of mass, length and tension that is set up to produce a given
frequency [Fry 79].

Some affective states like anxiety can influence breathing, resulting in variations
in sub-glottal pressure. Drying of the mucus membrane causes shrinking of the voice.
Rapid breath alters the tempo of the voice and relaxation tends to deepen the breath
and lower the voice. Changes in facial expression can also alter the sound of the voice.

Figure 2.1 represents the typical cues to the six most common emotion categories
[Murray 93].

Darwin raised the issue of whether or not it was possible to inhibit emotional
expression [Ekman 03]. This is an important question in human emotion recognition
and in emotion recognition by computers. Intentional or not, the voice and face are
used in everyday life, to judge verisimilitude in speakers. Many studies [Anolli 97]
[Hirschberg 05] have investigated the detection of deception in the voice.

2.4 Describing Facial Activity

Although some studies have made use of MPEG-4 compliant Facial Definition Parameters
(FDP) [Cowie 05b], the most ubiquitous and versatile method of describing facial be-
haviour, pioneered by Ekman [Ekman 75, Ekman 82, Ekman 97, Ekman 99, Ekman 03],
is FACS. The goal of FACS is to provide an accurate description of facial activity based
14 CHAPTER 2. LITERATURE REVIEW

fear anger sorrow joy disgust surprise


speech rate much slightly slightly faster or very much
faster faster slower slower much faster
slower
pitch average very very slightly much very much
much much lower higher much higher
higher higher lower
pitch range much much slightly much slightly
wider wider nar- wider wider
rower
intensity normal higher lower higher lower higher
voice quality irregular breathy resonant Breathy, grumble
voicing chest blaring chest
tone tone
pitch changes normal abrupt downward smooth wide rising
on inflec- upward down- contour
stressed tions inflec- ward
syllable tions terminal
inflec-
tions
articulation precise tense slurring normal normal
Figure 2.1: The effect of emotion on the human voice [Murray 93]

on musculature, and to lead to a system of consistent and objective description of facial


expression.

Despite being based on musculature, FACS measurement units are Action Unit
(AU)s, which are the identifiable actions of individual or groups of muscles. AUs are
One muscle can be represented by a single AU. Conversely, the appearance changes
produced by one muscle can sometimes appear as two or more relatively independent
actions attributable to different parts of the muscle. A FACS coder decomposes an ob-
served expression into the specific AUs that produced the movement, recording the list
of AUs that produced it. For example, during a smile, the zygomaticus major muscle
is activated, corresponding to AU12. During a spontaneous or Duchenne smile, the
orbicularis oculi muscle is recruited, corresponding to AU6. The FACS coder records
the AUs, and if needed, the duration, intensity, and asymmetry. Figure 2.2 and Table
2.5. ANXIOUS EXPRESSION 15

Figure 2.2: Facial muscles [Fac ]

2.1, together, are included to give a description of the facial muscles and an exam-
ple of some common AUs. FACS scores, in themselves, do not provide interpretations.
EMFACS, as mentioned previously, deals only with emotionally relevant facial action
units.

2.5 Anxious Expression

2.5.1 Anxiety Disorders

There are many types of inflictions that are labelled as “anxiety”, e.g. test anxiety (a
type of performance anxiety), death anxiety and stage fright. Typically, these have
short-term impact and, depending on the level of arousal, may or may not affect a
person’s performance. It is when anxiety begins to affect someone’s day to day life,
that it is classed as a disorder. An anxiety disorder is an umbrella term used to cover
16 CHAPTER 2. LITERATURE REVIEW

AU Description
1 Inner Brow Raiser – Frontalis (pars medialis)
2 Outer Brow Raiser – Frontalis (pars lateralis)
4 Brow Lowerer – Corrugator and Depressor supercilii
5 Upper Lid Raiser – Levator palpebrae superioris
6 Cheek Raiser – Orbicularis oculi (pars orbitalis)
7 Lid Tightener – Orbicularis oculi (pars palpebralis)
9 Nose Wrinkler – Levator labii superioris alaeque nasi
10 Upper Lip Raiser – Levator labii superioris
11 Nasolabial Deepener – Zygomaticus minor
12 Lip Corner Puller – Zygomaticus major
13 Cheek Puffer – Levator anguli oris
14 Dimpler – Buccinator
15 Lip Corner Depressor – Depressor anguli oris
16 Lower Lip Depressor – Depressor labii inferioris
17 Chin Raiser – Mentalis

Table 2.1: Facial Action Coding System - Sample Action Units

(at least) six types of disorder [beyondblue , NIMH ]:

• Generalised Anxiety Disorder (GAD);

• social anxiety disorder;

• phobia;

• Obsessive Compulsive Disorder (OCD);

• Post-Traumatic Stress Disorder (PTSD); and

• panic disorder.

In the discussion that follows, the social anxiety disorder has been included along
with the phobia disorder.

GAD

GAD is usually diagnosed with reference to an instrument such as the Diagnostic and
Statistical Manual of Mental Disorders (DSM-IV) [dsm 00] [Kvaal 05]. Broadly speak-
2.5. ANXIOUS EXPRESSION 17

ing, someone who has felt anxious for at least six months and it is adversely affecting
their life, will meet the criteria. The anxiety might be associated with issues such as
finances, illness or family problems. The adverse impact might include factors such as
insomnia, missed work days or fatigue.
[beyondblue ] report that GAD affects approximately 5 per cent of people in Aus-
tralia at some time in their lives.1 Diagnosing GAD can be difficult for a clinician as
the symptoms are shared with other types of anxiety and it often coexists with other
psychiatric disorders, e.g. depression or dysthymia (chronic, “low-grade” depression).
The symptoms of GAD are so broad that it would be difficult to imagine it ever
being capable of detection by machine.

Phobia

The most common phobias are:

• acrophobia - fear of heights;

• agoraphobia - fear of open spaces such as parks and big shopping centres;

• claustrophobia - fear of small spaces such as lifts, aeroplanes and crowded rooms;

• mysophobia - fear of dirt and germs in places such as toilets and kitchens.;

• social phobia or social anxiety - fear of social situations such as parties and
meetings; and

• zoophobia - fear of animals.

[beyondblue ] report that, “...approximately 9 per cent of people in Australia experi-


ence a phobia at some time in their lives. Phobias are twice as common in women as
in men and can start at any age.”
1
beyondblue is a national, independent, not-for-profit organisation working to address issues associ-
ated with depression, anxiety and related substance misuse disorders in Australia.
18 CHAPTER 2. LITERATURE REVIEW

Anxious episodes of the types listed above are the easiest to induce (ethically and
practically) and are the most common candidates for facial expression recognition ex-
periments.

OCD

OCD symptoms are characterised by obsessions or compulsions which are defined by


the following criteria:

• Recurrent and persistent thoughts, impulses, or images are experienced at some


time during the disturbance as intrusive and inappropriate and cause marked
anxiety and distress. Those with this disorder recognise the craziness of these
unwanted thoughts (such as fears of hurting their children) and would not act on
them, but the thoughts are very disturbing and difficult to tell others about;

• The thoughts, impulses, or images are not simply excessive worries about real-
life problems;

• The person attempts to suppress or ignore such thoughts, impulses, or images or


to neutralise them with some other thought or action; and

• The person recognises that the obsessional thoughts, impulses, or images are a
product of his/her own mind (not imposed from without, as in thought inser-
tion) [dsm 00].

Typical compulsions include:

• cleaning or hand-washing;

• checking things repeatedly, e.g. that appliances are turned off or that doors and
windows are locked;

• constantly counting or checking the order or symmetry of objects;


2.5. ANXIOUS EXPRESSION 19

• superstitions about colours or numbers; and

• hoarding items such as newspapers, books, or clothes.

OCD affects 2 to 3 per cent of people in Australia at some time in their lives
[beyondblue ].

The varied and sometimes serious nature of the compulsions and situations make
OCD an unlikely candidate for facial expression recognition.

PTSD

PTSD occurs when a person has been exposed to a traumatic event in which both of the

following were present:

• the person experienced, witnessed, or was confronted with an event or events that
involved actual or threatened death or serious injury, or a threat to the physical
integrity of self or others; and

• the person’s response involved intense fear, helplessness, or horror.


Note: In children, this may be expressed instead by disorganised or agitated
behaviour [dsm 00].

The symptoms of PTSD can include:

• flashbacks and nightmares;

• insomnia;

• loss of interest or enjoyment in life;

• difficulty concentrating; and

• amnesia.
20 CHAPTER 2. LITERATURE REVIEW

Approximately 8 per cent of people in Australia are affected by PTSD at some time
in their lives [beyondblue ]. PTSD has been the subject of some interesting virtual
reality applications for rehabilitation. In 2010, the U.S. Army began a four-year study
to track the results of using virtual reality therapy to treat Iraq and Afghanistan war
veterans suffering PTSD.2
The serious nature of this type of anxiety would mean that there would be some
important ethical and patient-care considerations to be met before any study is under-
taken.

Panic Disorder

The DSM-IV first sets out the definition of, Panic Attack as:

A discrete period of intense fear or discomfort, in which four (or more) of


the following symptoms developed abruptly and reached a peak within 10
minutes:

1. palpitations, pounding heart, or accelerated heart rate;

2. sweating;

3. trembling or shaking;

4. sensations of shortness of breath or smothering;;

5. feeling of choking;

6. chest pain or discomfort;

7. nausea or abdominal distress;

8. feeling dizzy, unsteady, lightheaded, or faint;

9. derealization (feelings of unreality) or depersonalization (being de-


tached from oneself);
2
http://www.army.mil/, last accessed 10 April 2010
2.5. ANXIOUS EXPRESSION 21

10. fear of losing control or going crazy;

11. fear of dying;

12. paresthesias (numbness or tingling sensations); or

13. chills or hot flushes [dsm 00].

After first discounting the effects of substance abuse, a general medical condition or
some other medical disorder that might better account for the condition , e.g. OCD, the
criteria for Panic Disorder is specified as recurring, unexpected Panic Attacks where
at least one of the attacks has been followed by 1 month (or more) of one (or more) of
the following:

• persistent concern about having additional attacks;

• worry about the implications of the attack or its consequences (e.g., losing con-
trol, having a heart attack, “going crazy”); or

• a significant change in behavior related to the attacks.

Around 3 per cent of the Australian population has experienced a panic disorder
[beyondblue ]. Although this type of anxiety would make an interesting research topic,
one would think that there would be some quite restrictive ethical considerations.

2.5.2 Anxious Facial Expressions

Numerous studies in the past twenty years have confirmed characteristics such as
speech, face, gesture, galvanic skin response (GSR) and body temperature, as being
useful in the diagnosis and evaluation of therapy for anxiety, depressions and psy-
chomotor retardation [Flint 93]. Some earlier studies have suggested linking certain
syndromes by comparing parameters from modalities such as speech and motor to
22 CHAPTER 2. LITERATURE REVIEW

discriminate different groups. [Chen 03] attempted to link eye blink rate in adults to
Parkinsons disease.
Anxiety is sometimes confused with fear, which is a reaction normally commensu-
rate with some form of imminent threat. In the case of anxiety, the perceived threat is
usually in the future and the reaction tends to be irrational or out of proportion to the
threat. The facial expressions of fear and anxiety, however, are similar [Harrigan 96].
Confounding the problem is that the facial expression of surprise is similar to that
of fear. One key difference between the fearful and the surprise expression is the
mouth movement - a fearful expression involves a stretching of the lips in the horizontal
direction rather than opening of the mouth. The AUs for fearful expressions are shown
in Table 3.2.
Emotion Prototype Major Variants
Fear 1 +2+4+5*+20*+25 1+2+4+5*+L or R20*+25, 26, or 27
1+2+4+5*+25 1+2+4+5*
1+2+5Z, with or without 25, 26, 27
5*+20* with or without 25, 26, 27
* means in this combination the AU may be at any level of intensity
L - Left, R - Right
Action Unit 1 (Inner Brow Raiser), Action Unit 2 (Outer Brow Raiser)
Action Unit 5 (Upper Lid Raiser)
Action Unit 20 (Lip Stretcher)
Action Unit 25 (Lips part Jaw Drop)

Table 2.2: Action units for fear expressions [Kanade 00]

Relatively little research has been conducted to definitively map which action units
are associated with anxiety. While available research generally supports the efficacy
of human ability to judge anxiety from facial expressions [Harrigan 96, Harrigan 97,
Harrigan 04, Ladouceur 06], understandably, due to logistical considerations, much of
the work has been conducted within the confines of social anxieties and in specific situ-
ations such as dental treatment [Buchanan 02], examinations, public speaking, children
receiving immunisations, medical examinations [Buchheim 07] and human-computer
interactions [Kaiser 98]. [Harrigan 96] reports:
2.5. ANXIOUS EXPRESSION 23

“First, people reveal feelings of anxiety facially in the form of actions


composing the expression of fear rather than other affects thought to com-
pose anxiety (distress, interest). These actions were not the full fear expression
of widened, tense eyes, raised and drawn brows, and horizontal mouth
stretch typically displayed in more intense situations described as fearful
[Ekman 71]. Rather, partial actions of the fear expression involving the
mouth or brows were exhibited and corresponded to the degree of fear-
anxiety experienced by the participants (i.e. moderate anxiety).”

Of importance, [Harrigan 96] found that the brow movement exhibited when fear
is experienced, i.e. brows raised and drawn together, was displayed by the participants
in the study, but less often than the mouth movement for fear. They summarise:

“The most predominant fear element was the horizontal mouth stretch
movement. This horizontal pulling movement of the mouth and brief ten-
sion of the lips was clearly visible on the videotapes and could not be
confused with movements required in verbalization, smiling, or other fa-
cial action units. The brow movement exhibited when fear is experienced,
brows raised and drawn together, was displayed by the participants in this
study, but less often than the mouth movement for fear.”

The finding of [Kaiser 98], while studying human-computer interactions, was that
AU20 (lip stretcher) is found more often only in fear. [Ellgring 05] noted that actor
portrayals provide little evidence for an abundance of distinctive AUs or AU combina-
tions, that are specific for basic emotions. [Ellgring 05] reports that there is quite a
distribution of AUs through each emotion - including anxiety. Muddying the waters
is the fact that there can be large variations in the way individuals react to stressful
stimuli. Genetics, personality and biochemistry factors all play a part in the propensity
to display an anxious expression.
24 CHAPTER 2. LITERATURE REVIEW

Trait anxiety is a longer-term, predisposition to anxiety whereas state anxiety is


the short-term anxiety induced by some recent event. These are defined in the clinical
practitioner’s Spielberger State-Trait Anxiety Inventory [Kvaal 05]. In a meta-analysis
of 46 state anxiety studies and 34 studies on trait anxiety, [Harrigan 04] conclude that
state anxiety was recognised by observers with greater accuracy than was trait anxiety,
but the modality was important. State anxiety was identified best from auditory signals,
whereas trait anxiety was identified best from visual (video) signals. Intuitively, this
makes some sense.

[Lazarus 91] concludes that a cognitive appraisal of threat is a prerequisite for the
experience of this type of emotion. If this is the case, then this does not augur well for
the ability to detect anxiety in a system trained from acted expressions. That is, one
would question how well actors could portray anxious expressions without a threat
stimulus.

2.6 Non-verbal Communication in Depression

As in the previous section, this chapter outlines the affects on facial expressions. Un-
like the previous section, it is expanded to include the processing of emotional content
by patients with MDD. This is in order to provide a background to the experimental
work in Chapter 6.

Early attempts to link facial activity with depression used broad measurements
such as cries, smiles, frowns [Grinker 61]. Some studies have used Electromyography
(EMG) to measure muscle response, notwithstanding the somewhat intrusive and con-
straining nature of the equipment [Fridlund 83]. The more recent trend is to use
the FACS, as described in Section 2.4, to add rigour and objectivity to the process
[Reed 07, Renneberg 05, Cohn 09, McDuff 10].

The difficulty is that capturing and recording measurements of facial activity man-
2.6. NON-VERBAL COMMUNICATION IN DEPRESSION 25

ually, requires time, effort, training, and regime for maintaining objectivity. Such man-
ual work is tedious and is prone to errors. However, even the abridged version of FACS,
EMFACS [EMFACS ], which deals only with emotionally relevant facial action units,
requires a scoring time of approximately 10 minutes of measurement for one minute of
facial behaviour. Further, only people who have passed the FACS final test are eligible
to download EMFACS for use.

2.6.1 Ellgring’s Study

In [Ellgring 08], the levels of facial activity, before and after treatment, of endogenous
and neurotic depressives were measured through several key indicators. [Ellgring 08]
hypothesised that facial activity and the repertoire of its elements will be reduced dur-
ing depression and will expand with improvement of subjective wellbeing. Facial be-
haviour was analysed by applying EMFACS to the videotapes of 40 clinical interviews
of 20 endogenous depressed patients. After analysing a frequency distribution of all
of the AU observations across all of the interviews, 13 AUs or groups of AUs were
then used to complete the study. AUs that nearly always occur together, e.g. AU1 and
AU2 (inner and outer brow raiser), AU6 and AU12 (cheek raiser and lip corner puller)
were considered as one AU group. Activity, repertoire and patterns were defined as
parameters of facial activity and can be summarised as the following three measure-
ments:

• General facial activity: The first measurement, total number of AUs in a spe-
cific interval, counts the number of single, significant AUs and groups of closely
related AUs that occur within a 5 minute interval.

• Specific facial activity: The second, frequency of specific, major AUs, defines
the number of major AU combinations, e.g. AU6+12 in the case of a spontaneous
smile, occurring within a 5 minute interval.
26 CHAPTER 2. LITERATURE REVIEW

Figure 2.3: Schematic of the basic affective facial processing loop

• Repertoire: The third, repertoire of AUs, is the number of distinct AUs occurring
more than twice within a 5 minute interval.

2.6.2 Processing of Emotional Content

Depressed subjects have been shown to respond differently to images of negative and
positive content, when compared with non-depressed subjects. The underlying cause
could be the impaired inhibition of negative affect, which has been found in depressed
patients across several studies [Goeleven 06, Lee 07]. In turn, altered patterns of facial
activity have been reported in those patients suffering MDD [Reed 07, Renneberg 05].
If this “affective facial processing loop”, as shown in Figure 2.3, is reliable, then 1) ob-
jective observations could be of clinical importance in diagnosis; and 2) measurements
of facial activity could possibly predict response to treatments such as pharmacother-
apy and Cognitive Behaviour Therapy (CBT).

As stated previously, it is commonly hypothesised that depression is characterised


2.6. NON-VERBAL COMMUNICATION IN DEPRESSION 27

by dysfunctional inhibition toward negative content. The reason for this has been pos-
tulated that depressed subjects show lowered activation in the regions responsible for
gaining attentional control over emotional interference. In [Goeleven 06], inhibition
to positive and negative stimuli was studied across subjects including hospitalised
depressed patients, formerly depressed, and never-depressed control subjects. They
report that depressed patients show a specific failure to inhibit negative information
whereas positive information was unaffected. Surprisingly, they report that formerly
depressed subjects display impairment to inhibiting negative content.

[Joormann 07] similarly found attentional bias was evident even after individuals
had recovered from a depressive episode. In that study, the attentional biases in the
processing of emotional faces in currently and formerly depressed participants and
healthy controls were examined. Faces expressing happy or sad emotions paired with
neutral faces were presented (in a dot-probe task). Currently and formerly depressed
participants selectively attended to the sad faces, the control participants selectively
avoided the sad faces and oriented toward the happy faces. They also report that a
positive bias that was not observed for either of the depressed groups.

In an evaluation of the evidence in studies which have used modified Stroop and
visual probe test, [Mogg 05] has found that the “inhibition theory” only holds true if
the material is relevant to their “negative self-concept” and the stimulus is presented
for longer durations. [Joormann 06] found that depressed participants required signif-
icantly greater intensity of emotion to correctly identify happy expressions, and less
intensity to identify sad than angry expressions.

Medical imaging studies also find impairment of neural processing of depressed


subjects. [Fu 08] conducted a comparison of 16 participants who had suffered from
acute unipolar major depression and 16 healthy volunteers. The patients received 16
weeks of CBT. Functional Magnetic Resonance Imaging (fMRI) scans were undertaken
at weeks 0 and 16 while the participants viewed facial stimuli displaying varying de-
28 CHAPTER 2. LITERATURE REVIEW

grees of sadness. Although there are some limitations to the study due to sample size,
there seems to be evidence that excessive amygdala activity correlated to the process-
ing of sad faces during episodes of acute depression.
This impairment may extend to the offspring of parents with MDD. [Monk 08]
found (small volume corrected) greater amygdala and nucleus accumbens activation
to fearful faces, and lower nucleus accumbens activation to happy faces, in high-risk
subjects when attention was unconstrained.

2.6.3 Facial Feedback in Depression

In a study of 116 participants (30 men, 86 women), some with a history of MDD
and individuals with no psychopathological history, the smile response to a comedy
clip was recorded. Participants were asked to rate a short film clip. FACS coding was
applied to 11 seconds of the clips - long enough to allow for the 4-6 second spontaneous
smile [Frank 93]. Those with a history of MDD and current depression symptoms were
more likely to control smiles than were the asymptomatic group [Reed 07].
There is always something
ridiculous about the emo-
tions of people whom one
has ceased to love.

Oscar Wilde

3
Emotional Expression Recognition by
Machines

3.1 Introduction

This section provides a summary of the more recent approaches and developments in
the field of emotional expression recognition. The scope is limited to FER, although
it would be incomplete without some reference to recognising vocal expression. The
objective of the chapter is to provide a theoretical grounding for later, more practical
chapters.

29
30 CHAPTER 3. AFFECTIVE SENSING

This chapter is organised as follows:

Section 3.2 introduces the broad concepts found in affective sensing systems. This
is quite a large section and encompasses the elicitation of training data, an overview
of the typical processing and a comparison of approaches to feature extraction. Sec-
tion 3.3 describes the computer vision techniques that are of particular interest in this
dissertation.

3.2 Affective Sensing Systems

Affect is emotional feeling, tone, and mood attached to a thought, including its external
manifestations. Affective communication is the, often complex, multimodal interplay
of affect between communication parties (which potentially includes non-humans).
Much of our daily dose of affective communication constitutes small talk, or phatic
speech. Recognising emotions from the modulations in another person’s voice and fa-
cial expressions is perhaps one of our most important human abilities. Yet, it is one
of the greatest challenges in order for machines to become more human-like. Affec-
tive sensing is an attempt to map manifestations or measurable physical responses to
affective states.

Historically, attempts at Affective Sensing or emotional expression recognition sys-


tems were based on the audio modality and had their origins in Automatic Speech
Recognition (ASR). The usual tact is to first acquire samples of vocal or facial expres-
sions, use them to train some form of system and then to test the system’s recognition
against some newly introduced samples. In practice, for development purposes, the
training and the testing samples are often taken from the same set and recognition per-
formance is measured using a K-fold or leave-one-out cross-validation method over
the sample collection [Dellaert 96, Kohavi 95, Yacoub 03].

Whether it be a vocal or a facial expression recognition system, the conceptual sys-


3.2. AFFECTIVE SENSING SYSTEMS 31

tem usually resembles Figure 3.1 (a vocal analog of face detection is silence detection).
Whilst systems may use different types of classifiers, e.g. k-Nearest Neighbour (k-NN)
[Cover 67], SVM [Vapnik 95], AdaBoost [Freund 99], the main differences surround
the feature extraction process, i.e. the number and type of features used; and, whether
they incorporate rule-based logic such as that presented by Pantic and Rothkrantz
[Pantic 00].

One major difference between vocal and facial expression is that speech signals
are inherently one-dimensional, whereas facial signals can be 2D or 3D - although the
use of 3D processing has only become popular in the last few years [Lucey 06]. In
the case of facial expression recognition, an additional differentiating factor is whether
holistic (spanning the whole or a large part of the face) features or, what Pantic terms,
“analytic” (sub-regions) of the face are used [Pantic 07].

Figure 3.1: Conceptual overview of processing of facial activity measurements


32 CHAPTER 3. AFFECTIVE SENSING

3.2.1 Eliciting Training Data

[ten Bosch 00] and [Schröder 05] have explored what is possible in extending the ASR
framework to emotion recognition. The common ASR approach is to train probabilistic
models from extracted speech features and then to use pattern matching to perform
recognition.

Most studies into affective communication begin with the collection of audio and/or
video communication samples. The topic of collection of emotional speech has been
well covered by other reviews [Cowie 03, Cowie 05a, Scherer 03], so it is only briefly
summarised here.

Naturally Occurring Speech

To date, call centre recordings, recordings of pilot conversations, and news readings
have provided sensible sources of data to research emotions in speech. Samples of
this nature have the highest ecological validity. However, aside from the copyright and
privacy issues, it is very difficult to construct a database of emotional speech from this
kind of naturally occurring emotional data. In audio samples, there are the complica-
tions of background noise and overlapping utterances. In video, there are difficulties
in detecting moving faces and facial expressions. A further complication is the sup-
pression of emotional behaviour by the speaker who is aware of being recorded.

Induced Emotional Speech

One technique introduced by Velten [Velten 68], is to have subjects read emotive texts
and passages which, in turn, induce emotional states in the speaker. Other techniques
include the use of Wizard of Oz setups where, for example, a dialog between a human
and a computer is controlled without the knowledge of the human. This method has
the benefit of providing a degree of control over the dialogue and can simulate a natural
3.2. AFFECTIVE SENSING SYSTEMS 33

setting [Hajdinjak 03].

The principal shortcoming of these methods is that the response to stimuli may
induce different emotional states in different people.

Acted Emotional Speech

A popular method is to engage actors to portray emotions. This technique provides for
a lot of experimental control over a range of emotions and like the previous method
provides for a degree of control over the ambient conditions.

One problem with this approach is that acted speech elicits how emotions should
be portrayed, not necessarily how they are portrayed. The other serious drawback is
that acted emotions are unlikely to derive from emotions in the way that [Scherer 04]
describe them, i.e. episodes of massive, synchronised recruitment of mental and so-
matic resources to adapt or cope with a stimulus event subjectively appraised as being
highly pertinent to the needs, goals and values of the individual.

3.2.2 Approaches to Affective Sensing

The affective sensing of facial expression is more commonly known as automatic FER.
FER is somewhat similar in approach to face recognition but the objectives of the two
are quite different. In the former, representation of expressions is sought in sets of
images, possibly from different people, whereas in the latter, discriminating features
are sought, which will distinguish one face from a set of faces. FER works better if there
is large variation among the different expressions generated by a given face, but small
variation in how a given expression is generated amongst different faces [Daugman ].
Nevertheless, both endeavours have common techniques.
34 CHAPTER 3. AFFECTIVE SENSING

Broadly speaking, there are two approaches to automatic facial expression recogni-
tion. The first is a holistic one, in which a set of raw features extracted from an image
are matched to an emotional facial expression such as happy, sad, anger, pain, or pos-
sibly to some gestural emblem such as a wink, stare or eye-roll [Ashraf 09, Liu 06,
Martin 08, Okada 09, Saatci 06, Sung 08]. The second approach, an analytic one, is
more fine-grained and the face is divided into regions. The most popular scheme for
annotating regions is through the use of the FACS [Ekman 75, Ekman 82, Ekman 97,
Ekman 99, Ekman 03], in which surface or musculature movements are tracked using
an analytical framework.

Computer vision techniques [Cootes 95, Lucey 06, Nixon 01, Sebe 03, Sebe 05]
are used to detect features and build evidence of FACS AUs [Bartlett 99, Bartlett 02,
Bartlett 03, Bartlett 06, Lien 98, Lucey 06, Tian 01, Valstar 06b]. The theoretical ben-
efit of this approach relates to its purported reusability. For example, if a system can
accurately detect 20 major FACS AUs then, in theory, any facial expression that has a
repertoire using a combination of the 20 AUs can be detected.

However, in practice, this is not as straightforward as it might seem. For instance,


the AU combinations for surprise and fear are very similar and the AUs are shown in
Tables 3.1 and 3.2:

Emotion Prototype Major Variants


Surprise 1+2+5B+26 1+2+5B
1+2+5B+27 1+2+26
1+2+27
5B+26
5B+27

Table 3.1: Action units for surprise expressions [Ekman 76, Ekman 02]

So which approach is correct? Regardless of the approach, testing and comparing


the claims of FER research reports is difficult. It is not like Face Recognition (FR) which
3.2. AFFECTIVE SENSING SYSTEMS 35

Fear Prototype Major Variants


Fear 1 +2+4+5*+20*+25 1+2+4+5*+L or R20*+25, 26, or 27
1+2+4+5*+25 1+2+4+5*
1+2+5Z, with or without 25, 26, 27
5*+20* with or without 25, 26, 27

* means in this combination the AU may be at any level of intensity.

Table 3.2: Action units for fear expressions [Ekman 76, Ekman 02]

tends to be a binary classification problem, and a Receiver Operating Curve (ROC)


makes visual comparison of reports quite simple (even though it might not be the best
instrument). FR vendors can even evaluate their software against the US government
“Face Recognition Vendor Test” [FRV ]. FER, on the other hand, has a lot more sub-
jectivity and variability with regards to the expressions being recognised. Even human
judges may not agree or be able to correctly identify a facial expression.

One of the problems with the holistic approach in comparing and validating results
is that each published report tends to use a different set or subset of classes or expres-
sions. Most experiments use a closed set of expressions and it is unknown how well
the reported systems would work, if at all, when an extraneous expression, or indeed
non-expression, is introduced. For example, if the training and testing database con-
sists of 2,000 images split evenly with happy, sad, angry and neutral expressions, the
article might report a 93% accuracy rate. However, if the database is then interspersed
with 500 other expressions (perhaps surprise, fear or yawn), it is likely that the system
in question will try to find the best match to one of the classes.

That is not to say that the analytic approach is an automatic choice either. Ac-
credited FACS coders do not always achieve consensus on AU displays. Not all AUs
are easily detectable in an image, and the ability of the systems are impacted by the
quality of the recordings. Holistic classification might have an advantage where fea-
tures are occluded; where, for instance, the subject has a beard, wears spectacles or is
36 CHAPTER 3. AFFECTIVE SENSING

in a non-frontal pose. However, if reliable detection of the muscle movements could


be guaranteed, one would think, that the analytic approach would be a better option to
advance research.

There is a lacuna in objective reporting instruments. To address the study of com-


parison and validation problems and to advance the field of research, a standard report-
ing system needs to be introduced. Exercises such as the Emotion Challenge held at
then INTERSPEECH conference in 2009 go some way to address this [Schuller 09b].

3.2.3 Description of Process

Whilst there are some strategies in recognition systems, e.g. some systems operate on
profile poses [Pantic 04a, Pantic 04b] and in [Pantic 04a] the system that incorporates
case-based reasoning, most follow the fairly generic model as depicted in Figure 3.1.

If the system processes video then the first stage is to capture some or all of the
images from the video at predefined intervals. Next, faces are usually detected and
then segmented from the images and this is certainly the case with the AAM approach
discussed later.

The most common method of face detection is that formulated by Viola and Jones
[Viola 01], which is implemented in the popular “openCV” software [OpenCV ] from
Intel. [Bartlett 05] report an improved face detection implementation using the Gen-
tleBoost algorithm.

The next stage, facial feature extraction, is where most variation between systems
arises. Optical Flow, Particle Filters and Gabor Filters, and more recently, Active Ap-
pearance Models are some of the choices, with the latter two the most widely re-
ported in recent years. Since Gabor filters and Active Appearance Models are used
in this research work, they are discussed in more detail in Section 3.3 onwards. Some-
times the techniques are combined as in [Gao 09] and in several instances, especially
3.3. COMPUTER VISION TECHNIQUES 37

in the use of Gabor wavelets, dimension reduction is attempted through some pre-
processing such as Boosting, Principal Component Analysis (PCA) or SVM classifica-
tion [Chen 07, Shen 07]. It is this stage where most research activity is taking place,
as it is critical to the success of facial expression recognition.

Finally, once the features have been extracted, the next stage is to adopt some
means of classification such as k-NN, Artificial Neural Network (ANN), SVM, or Ad-
aBoost. If the temporal patterns are to be classified, or the facial patterns combined
with other signals, e.g. vocal speech, then the solution may involve ensembles of clas-
sifiers or Hidden Markov Models (HMMs).

3.3 Computer Vision Techniques

Whilst there have been attempts to provide a top-down categorisation of automatic


facial feature extraction methods, the lines between the categories are quite blurred.
Pantic and Rothkrantz [Pantic 00] attempt to summarise the approaches, on one hand,
into “analysis from static facial images” and “analysis from facial image sequences”.
On the other hand, they also split the methods into “Holistic approach”, “Analytic ap-
proach” and “Hybrid approach”. They introduce further terms “Template-based” meth-
ods, which align with the “Holistic approach”, and “Feature-based” methods, which
align with the “analytic approach”.

Fasel and Luettin [Fasel 03] provide an appealing dichotomy of facial feature ex-
traction methods into Deformation Extraction, either image or model-based, and Mo-
tion Extraction. From a different viewpoint, they follow on from Pantic and Rothkrantz
[Pantic 00] and dichotomise into “holistic methods”, where the face is processed in its
entirety, or “local methods” (similar to Pantic and Rothkrantz’s “Analytic approach”),
which analyse only areas of interest in the face, especially where transient facial fea-
tures involved in expressions exist (as opposed to intransient features such as furrows,
38 CHAPTER 3. AFFECTIVE SENSING

wrinkles and laughter lines). Of course, the selection of local features tends to be very
application dependent.

3.3.1 Gabor Filters

The Discrete Fourier Transform (DFT), commonly used in speech processing, can be
applied to rows and columns of pixels in an image, indexed by co-ordinates x and y.
The 2D DFT of an N x N pixel image can be given by

N −1 N −1
1 XX 2π
FPu,v = Px,y e −j( N )(ux+vy) (3.3.1)
N x=0 y=0

where u and v are the horizontal and vertical dimensions of spatial frequency respec-
tively.

However, Fourier Transform (FT) and DFT are not well suited to non-stationary
signals, i.e. signals with time-varying spectra such as spikes. Both perform decima-
tion in frequency across the entire image in the forward transform and decimation in
time in the inverse transform. This can be overcome to some extend through the Short
Time Fourier Transform (STFT), which uses a window function to divide the signal
into segments.1 However, the inherent problem with STFT is in choosing the width of
the window. Too narrow a window will give good time resolution but poor frequency
resolution, and too wide a window will give good frequency resolution but poor time
resolution. If the spectral components in the input signal are already well separated
from each other, then a narrow window will possibly provide a satisfactory frequency
resolution. In the case of the frequency components being tightly packed, then a narrow
window will be needed, resulting in good time resolution but poor frequency resolu-
tion.
1
window is the term used when referring to the continuous-time STFT, whereas, frame is used in
discrete-time STFT. For explanatory purposes, only continuous-time STFT is referred to. However, the
same principles apply.
3.3. COMPUTER VISION TECHNIQUES 39

Wavelet transformations offer a solution to this resolution problem by delivering


decimation in frequency and space simultaneously. Although they are performed in a
similar fashion to STFT, i.e. the function (wavelet) is applied to different segments of
the time domain signal, the signal is analysed at different resolutions, i.e. the width of
the window is computed for every frequency.

Invented by Dennis Gabor in 1946, Gabor wavelets have found many applications
including speech analysis, handwriting, fingerprint, face and facial expression recog-
nition. One reason for their popularity in computer vision is that, in their 2-D form and
primed with the appropriate values, their filter response resembles the neural responses
of the mammalian primary visual cortex [Daugman 85, Bhuiyan 07, Lee 96, Gao 09].
Just as bandpass filter bank’s ability to approximate cochlear processing and ANN abil-
ity to analogue human neural processing appealed to researchers, this biological re-
semblance has also attracted much attention.

A spatial 2D complex Gabor function is given by the formula

g(x, y) = s(x, y)wr (x, y) (3.3.2)

where s(x, y) is a complex sinusoidal referred to as the carrier, and wr (x, y) is a


Gaussian-shaped function, known as the envelope, modulated by the sinusoidal.

The carrier is defined as

s(x, y) = exp(j(2π(u0 x + v0 y) + P )) (3.3.3)

where (u0 , v0 ) and P define the spatial frequency and the phase of the sinusoidal,
respectively [Movellan 08].

The complex sinusoidal can be split into its real and imaginary parts as

Re (s(x, y)) = cos(2π(u0 x + v0 y) + P ) (3.3.4)


40 CHAPTER 3. AFFECTIVE SENSING

and
Im (s(x, y)) = sin(2π(u0 x + v0 y) + P ) (3.3.5)

The real and imaginary parts of a Gabor with an orientation of π/4 and a scale of
4 is shown in Figure 5.6.

(a) Real part of wavelet (b) Imaginary part of wavelet

Figure 3.2: Real and imaginary part of a Gabor wavelet

This spatial frequency can be expressed in polar coordinates as magnitude F0 and


direction ω0
s(x, y) = exp(j(2πF0 (x cos ω0 + y sin ω0 ) + P )) (3.3.6)

The envelope is defined as

w(x, y) = Kexp(−π(a2 (x − x0 )2r + b2 (y − y0 )2r )) (3.3.7)

where K scales the magnitude of the Gaussian envelope, (x0 , y0 ) is the peak of the
function, a and b are scaling parameters, and the r subscript is a rotation operation
such that
(x − x0 )r = (x − x0 ) cos θ + (y − y0 ) sin θ (3.3.8)

and
(y − y0 )r = −(x − x0 ) sin θ + (y − y0 ) cos θ (3.3.9)

The response of a Gabor filter to an image is obtained by a 2D convolution oper-


ation. Let I(x, y) denote the image and G(x, y, θ, φ) denote the response of a Gabor
3.3. COMPUTER VISION TECHNIQUES 41

filter with frequency and orientation to an image at point (x, y) on the image plane.
G(.) is obtained as

Z Z
G(x, y, θ, φ) = I(p, q)g(x − p, y − q), θ, φ) dp dq (3.3.10)

Figure 3.3 shows the original image on the left and the magnitude response image
on the right.

(a) Original image (b) Magnitude after transform

Figure 3.3: Original image with Gabor magnitude

3.3.2 Active Appearance Models (AAM)

3.3.3 Introduction

In recent years, a powerful deformable model technique, known as the AAM [Edwards 98],
has become very popular for real-time face and facial expression recognition. The liter-
ature is not very clear in what exactly AAMs are and how they should be differentiated
from other approaches that represent the appearance of an object by shape and texture
subspaces, hence some explanation follows.
According to [Saragih 08], AAMs are examples of Linear Deformable Model (LDM)
which also includes Active Shape Models (ASM) [Cootes 92], AAMs [Cootes 98] and
42 CHAPTER 3. AFFECTIVE SENSING

3D Morphable Models (3DMM) [Blanz 99]. According to [Matthews 04], AAMs, to-
gether with the closely related concepts of Morphable Models and Active Blobs, are
“generative models of a certain visual phenomenon” and, “are just one instance in a
large class of closely related linear shape and appearance models and their associated
fitting algorithms”. In [Gross 05], they are defined as, “generative parametric models
commonly used to model faces”.

In the AAM approach, the non-rigid shape and visual texture (intensity and colour)
of an object (a face in an image perhaps) are statistically modelled using a low dimen-
sional representation obtained by applying PCA to a set of labelled training data. After
the models have been created, they can be parameterised to fit a new object (of similar
properties), which might vary in shape or texture or both.

Usually, the AAMs are pre-trained on static images using a method such as the
Simultaneous Inverse Compositional (SIC) algorithm [Baker 01, Baker 03b] (discussed
in Subsection 3.3.5) and then, when ready to be applied, the model will be fitted to one
or more images that were not present in the training set.

For the better understanding of the following sections, some terms are introduced
and explained now.

Shape All the geometrical information that remains when location, scale and rota-
tional effects are filtered out from an object - invariant to Euclidian similarity
transformations [Stegmann 02].

Landmark point Point of correspondence on each object that matches between and
within populations.

Shape space Set of all possible shapes of the object in question.

Texture The pattern of intensity or colour across the region of the object or image
patch [Cootes 01].
3.3. COMPUTER VISION TECHNIQUES 43

Image Registration This is the process of finding the optimal transformation between
a set of images in order to get them into one coordinate system.

Image Segmentation Segmentation is used to separate an object in an image from the


background. In practice, it is the process of separating objects, as represented by
sets of pixels, in an image.

Image Warping Image warping is a type of geometric manipulation of an image or


region of an image. It is the non-uniform mapping of a set of points in one shape
to a set of points in another shape.

Fitting An efficient scheme for adjusting the model parameters so that a synthetic
example is generated, which matches the image as closely as possible.

Model A model consists of the mean positions of the points and a number of vectors
describing the modes of variation [Cootes 01].

3.3.4 Building an AAM

AAMs usually refer to two things:

1. A statistical model of shape and appearance, trained from a set of images, each of
which has a set of corresponding landmark points (usually manually annotated);
and once built,

2. A method or algorithm for fitting the model to new and previously unseen im-
ages, i.e. images that were not in the set of images that were used to train the
model;

although sometimes they are used simply to mean the statistical model. It should be
noted that there are actually two types of AAM:
44 CHAPTER 3. AFFECTIVE SENSING

1. independent shape and appearance models, where the shape and appearance are
modelled separately; and

2. combined shape and appearance models, which use a single set of parameters to
describe shape and appearance [Matthews 04]

The reason for the different build strategies is to do with variations in the algorithms
that are used to “fit” the model to an image (discussed in Section 3.3.5).

The steps to building an AAM are covered in the following sections.

Annotate the Landmark Points

Although there have been some attempts at automatically annotating images [Asthana 09,
Tong 09], there is no completely automated method for facial feature extraction and at
least one image has to be manually marked up as shown in Figure 5.4.

(a) Original image (b) Image with landmark points

Figure 3.4: Original image with transform

For each image, a corresponding set of points (each point denoting the x, y co-
ordindates of a single point) exists as shown in Table 3.3.
3.3. COMPUTER VISION TECHNIQUES 45

n points:69
{
249.809 274.693
249.785 297.994
259.769 361.231
...
328.853 393.34
305.365 280.317
431.243 281.514
}

Table 3.3: Sample point file with x, y co-ordinates of landmark points

Align the Set of Points

A generalised Procrustes analysis is shown in Algorithm 1. The mean shape, if using


the Procrustes mean, is used is given by:

N
X
x = 1/N xi (3.3.11)
i=1

Algorithm 1: Aligning the training set

foreach image in the training set do


Translate so that its centre of gravity is at the origin
Choose one example as an initial estimate of the mean shape and scale it so
that |x| = 1
Record the first estimate as x0 to define the default reference frame
repeat
Align all the shapes with the current estimate of the mean shape
Re-estimate mean from aligned shapes
Apply constraints on the current estimate of the mean by aligning it with x0
and scaling so that |x| = 1
until convergence, i.e. until the estimate of the mean ≤ , where  is some
suitable threshold
46 CHAPTER 3. AFFECTIVE SENSING

However, one of the undesirable side effects of the Procrustes analysis, due to the
scaling and normalisation process, is that aligned shapes, or their shape vectors, will
now lie on the curved surface of a hypersphere, which can introduce non-linearities. A
popular approach to counter this problem is to transform or modify the shape vectors
into a tangent space to form a hyper plane [Cootes 01]. That way, linearity is assumed
and, the next step, “modelling the shape variation” is simplified and calculation per-
formance improved.

Model the Shape Variation

At this point, a set of points exist that are aligned to a common co-ordinate frame.
What remains to be done is to somehow model the point distributions, so that new and
plausible shapes can be generated. It is best to start with a reduction in dimensionality
and this is most commonly performed using PCA, which will derive a set of t eigen-
vectors, Φ, corresponding to the largest eigenvalues that best explain the data (spread
of the landmark points). The normal procedure is to first derive the mean shape as in
3.3.11.

The next step is then to compute the covariance matrix

N
1 X
Σx = (xi − x)(xi − x)T (3.3.12)
N − 1 i=1

Since computer vision applications are typically resource intensive, alternative


ways to derive the eigenvectors, Φ, and corresponding eigenvalues, λ, of the covari-
ance matrix have been devised and the choice depends on whether there are more
samples, s, than dimensions, n, in the feature vectors. If s > n then Singular Value
Decomposition (SVD) can be used.
3.3. COMPUTER VISION TECHNIQUES 47

Any set of points, x, can then be approximated by

x ≈ x + Φb (3.3.13)

where Φ = (φ1 , φ2 , ..., φt ) and b is a t dimensional vector given by

b = ΦT (x − x) (3.3.14)

The total variation in the training samples can be explained by the sum of all eigen-

values. Limiting the number of eigenvectors, e.g. ±3 λi of the parameter bi will de-
termine how close generated shapes will be to the training set.
All that remains is to find a way to model the spread of the variation or distribution
around the points. A Gaussian is a reasonable starting point but for facial feature
processing, where there are likely to be non-linear shape variations due, for example,
to pitch, yaw and head roll, a better solution is required. A Gaussian mixture is a
reasonable approach.

Build a Combined Appearance Model

Having already obtained a mean shape, each image texture is now warped to the mean
shape, obtaining what is described as a “shape-free” patch or “canonical frame”. This
is done by ensuring that each image’s control points match the mean shape. A trian-
gulation algorithm that creates a mesh, such as that shown in Figure 3.5 is used. Next,
the image texture over the warped image is sampled to obtain a vector of texture, ģim .
The texture vectors are normalised by applying a linear transformation (not shown
here but discussed in [Cootes 01]). Applying PCA to the normalised training samples
produces a linear model of texture

ģ = ģ + Pg bg (3.3.15)
48 CHAPTER 3. AFFECTIVE SENSING

Figure 3.5: Face mesh used to build an Active Appearance Model

where ģ is the mean normalised grey-level vector, Pg is a set of orthogonal models of


variation and bg is a set of grey-level parameters.

Shape and texture can now be expressed in terms of its shape-level parameters bs
(Equation 3.3.14) and its grey-level parameters bg . However, a further PCA is applied
to account for correlations between shape and grey-level texture and finally appearance
model shape and texture can be controlled by a vector of parameters c

x = x + Qs c (3.3.16)

and
ģ = ģ + Qg c (3.3.17)

where c is a vector of appearance parameters controlling both the shape and the grey-
level texture, and Qs and Qg are matrices describing modes of variation.

Reconstruction of an image is almost the reverse process of building the appearance


model. The texture is generated within the mean-shaped patch and then warped to
suit some image points that have been generated by applying a transformation (scale,
rotation and translation) from the image frame.
3.3. COMPUTER VISION TECHNIQUES 49

3.3.5 Model Fitting Schemes

Two fitting schemes are described, since they are used in this research work. Until now,
the building of a statistical model of shape and texture has been discussed. Although
there has been some notion of generating shapes and texture, the actual process of
fitting or adjusting model parameters to build a synthetic model, in an attempt to match
it to a new and previously unseen image, i.e. one not in the model training set, has
not been covered. Although it is a generalisation, state-of-the-art recognition systems
are heavily dependent on improvements in the area of model fitting, as it is crucial to
achieving the performance required for a real-time FER.

There are many variations and performance improvements (some application spe-
cific) that can be made to AAMs. Searching schemes can employ shape only, appear-
ance only, or combined shape and appearance.

Simultaneous Project-out Inverse Compositional Method

[Baker 01, Baker 02, Baker 03a, Baker 03b, Baker 04a, Baker 04b] treat the AAM search
process as an image alignment problem. The original image alignment algorithm was
formulated by Lucas and Kanade in 1981 [Lucas 81]. The goal is to minimise the dif-
ference between an image and a template image T (x) by minimising

X
[I(W (x; p)) − T (x)]2 (3.3.18)
x

where I is the image, W is the warp, x are the pixels in the template and p is a
vector of parameters.

The fine details go beyond the scope of this thesis. However, a concise explanation
50 CHAPTER 3. AFFECTIVE SENSING

of [Baker 01] is that the formulation to solve Expression 3.3.18 requires minimising:

X
[I(W (x; p + ∆p)) − T (x)]2 (3.3.19)
x

with respect to ∆p. Performing a first order Taylor expansion on Expression 3.3.19
gives:

X ∂W
[I(W (x; p)) + ∇I ∆p − T (x)]2 (3.3.20)
x
∂p

This can be minimised using Algorithm 2

Algorithm 2: Algorithm to minimise Equation 3.3.19

repeat
Warp I with W (x; p) to compute I(W (x; p))
Compute the error image T (x) − I(W (x; p))
Warp the gradient of image I to compute ∆I)
∂W
Evaluate the Jacobian ∂p

Compute the Hessian matrix (used to solve Expression 3.3.20)


Compute ∆p
Update the parameters p ← p + ∆p
until convergence

As can be seen, Algorithm 2 necessitates the re-evaluation of a Hessian and a


Jacobian at every iteration until convergence, which is a very expensive operation.
[Baker 01, Baker 02] propose an analytically-derived gradient descent algorithm called
the inverse compositional algorithm, in which the roles of the image and the template
are reversed and which allows both the Hessian and the Jacobian to be pre-computed
and then held constant when iterating the remainder of the algorithm.

[Matthews 04] provide a further efficiency improvement by “projecting out” ap-


3.3. COMPUTER VISION TECHNIQUES 51

pearance variation. Using this technique implies a major difference in the way that the
AAMs are initially built - the shape and the appearance parameters need to be modelled

separately - independent AAMs previously mentioned in Subsection 3.3.4.

The SIC algorithm [Baker 03a] is an another adaptation of inverse compositional


image alignment for AAM fitting that addresses the problem of significant shape and
texture variability by finding the optimal shape and texture parameters simultaneously.
Rather than re-evaluating the linear update model at every iteration using the current
estimate of appearance parameters, it can be approximated by evaluating it at the mean
appearance parameters, allowing the update model to be pre-computed, which has sig-
nificantly more computational efficiency.

Iterative Error Bound Minimisation (IEBM)

This is a form of what [Saragih 08] describes as “Iterative Discriminative Fitting”. At-
tempts at improving the “fitting” efficiency of AAMs, such as those discussed at Sub-
section 3.3.5, involve various techniques to streamline the original Lucas-Kanade al-
gorithm [Lucas 81], and essentially aim to minimise a least squares error function over
the texture. [Matthews 04] provide experimental evidence to show that the Project-
Out Inverse Compositional (POIC) fitting method provides very fast fitting. However,
when there is a lot of shape and appearance variation, it comes at the expense of poor
generalisability, i.e. in the case of facial expression recognition the model becomes
very person-specific. Changing the algorithm to improve generalisability then impacts
performance.

To overcome these problems, [Saragih 06, Saragih 08] pre-learn the update model
by minimising the error bounds over the data, rather than minimising least squares
distances. Conceptually, this is akin to boosting [Freund 99] where a bunch of weak
classifiers are iteratively passed over a training set of data and, for each iteration, a
distribution of weights is updated so that the weights of each incorrectly classified
52 CHAPTER 3. AFFECTIVE SENSING

example are increased, thereby resulting in a strong classifier. However, to continue


with the analogy, in the case of IEBM, when a new weak classifier is introduced after
each iteration, as the data used in calculating the weak learners changes, it is subject to
what the authors describe as “resampling” [Saragih 06]. It is this resampling process
that promotes generalisability.

3.4 Classification Techniques

Machine learning classification finds its way into many aspects of research and there
are many options available to take the features extracted from say, a bank of Ga-
bor filters and classify them. The most frequently used include ANN, k-NN, SVM
[Burges 98, Chen 05, Vapnik 95] and AdaBoost [Freund 99] (although a variant for
a multi-class problem is MultiBoost [Casagrande 06]). Only SVM and AdaBoost are
discussed in this dissertation.
There are no facts, only in-
terpretations.

Friedrich Nietzsche

4
Expression Analysis in Practice

4.1 Introduction and Motivation

A system capable of interpreting affect from a speaking face must recognise and fuse
signals from multiple cues. Building such a system necessitates the integration of such
tasks as image registration, video segmentation, facial expression analysis and recog-
nition, speech analysis and recognition.

Without the availability of vast sums of time and money to build the components
“from the ground up”, this almost certainly entails re-using publicly available soft-
ware. However, such components tend to be idiosyncratic, purpose-built, and driven

53
54 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE

by scripts and peculiar configuration files. Integrating them to achieve the necessary
degree of flexibility to perform full multimodal affective recognition is a serious chal-
lenge.

If one contemplates the operation of a full-lifecycle system that can be trained from
audio and video samples and then used to perform multimodal affect recognition, the
requirements are extensive and diverse. For example, to detect emotion in the voice
the system must be capable of training, say, HMM from prosody in the speech signals.

Another requirement might be that a SVM be trained to recognise still image facial
expressions, e.g. fear, anger, happiness, sadness, disgust, surprise or neutral. More
complex, is the requirement to capture a sequence of video frames and, from the se-
quence, recognise temporal expressions. In order to perform the latter, it might be
necessary to use a deformable model, e.g. an AAM [Edwards 98] to fit to each image
and provide parameters that can then, in turn, be trained using some classifier - possibly
another HMM.

Other features might also be considered for input to the system, e.g. eye gaze and
blink rate. Ultimately, some strategy is required to assess the overall meaning of the
signals, whether it involves fusion using a combined HMM or some other technique.

From the concise consideration of the requirements it can be seen that a broad range
of expertise and software is needed. It is not practical to develop the software from first
principles. Software capable of the recognition of voice and facial expressions imple-
ment techniques from different areas of specialisation. ASR techniques have evolved
over decades while computer vision has become practical in the last ten years, with the
evolution of statistical techniques and computer processing power.

The reasons behind choosing one software product over another is not within the
scope of this work. A brief overview of some of the critical components and the “com-
posite framework” used to harness them is presented.

This chapter is organised as follows:


4.2. FUNCTIONAL REQUIREMENTS 55

Section 4.2 discusses of the functional requirements of a system capable of sensing


multiple variable inputs from voice, facial expression and movement, making some as-
sessment of the signals of each, and then fusing them to provide some degree of affect
recognition. Section 4.4 describes how the key requirements have been implemented
in the NXS, which has been built to support the experimental aspects of this dissertation.

Although, due to time constraints, the main focus in this dissertation and the exper-
imental work is on facial expression recognition, the NXS system has been designed to
support multi-modal analysis and recognition.

4.2 Functional Requirements

There are several levels of sophistication that a system capable of sensing affect could
provide:

1. recognition of affect from audio only;

2. recognition of affect from image only;

3. recognition of affect from video without audio; and

4. recognition of affect from video with audio.

Indeed, the system should be able to operate within each modality or across a con-
flation of modalities. The ultimate goal is to be able to recognise emotion from both
audio and video inputs from a speaking face. Consider a speaking face in a real world
situation. Voice expression is not necessarily continuous, there may be long pauses or
sustained periods of speech. Vocal speech and facial expression may not necessarily
be contemporaneous, the verisimilitude of the voiced expression might be confirmed
or contradicted by the facial expressions. The face might be expressionless, hidden,
56 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE

not available or occluded to some degree for certain periods of time. This implies that
the system needs to be able to:

1. detect the voice and facial expressions independently;

2. operate on only one modality in some cases; and

3. weigh one signal against the other when more than one modality is available.

Lastly, the system must be flexible so that alternative software products and tech-
niques can be substituted without a large amount of effort or re-engineering work in-
volved. For example, to compare classification performance, it might be desirable to
substitute an ANN package for an SVM package or simply compare different types of
SVM implementations.
The following sections present a minimalist requirement statement of some of the
key individual functional areas.

4.2.1 Audio Processing

The approach to recognising affect in speech is often similar to ASR processing, how-
ever, emotion is normally mapped on a supra-segmental level, rather than word bound-
aries or silence. Nevertheless, features such as energy levels and variance in the signals
can be used to detect prosody. Indeed, use is often made of freely available ASR pack-
ages. Whatever the approach, this capability is mandatory.

4.2.2 Image Processing

Before a facial expression in an image can be analysed it is necessary to have software


capable of first detecting the face. Once found, facial features need to be extracted,
e.g. by fitting an AAM to the face. This will yield parameters that can be used in the
classification process.
4.2. FUNCTIONAL REQUIREMENTS 57

A major issue in recognising facial expressions from video is boundary detection,


i.e. the boundary between the onset and offset of each expression. One approach to this
is to try to recognise either the onset or the apex of a facial expression. For example, in
a simple situation a subject might begin with a neutral expression and then progress to
an expression of surprise. [Lucey 06] attempted to recognise expressions from peak to
peak. Whatever the temporal position in the expression, this implies training a classifier
of still images and then capturing one or more video frames from a video sequence
before attempting to match the expression.

4.2.3 Video Processing

Videos come in a wide variety of formats and containers. Unfortunately, not all freely
available computer vision software can operate on all formats. The very popular OpenCV
[OpenCV ] software, commonly used to perform face detection in videos, is also capa-
ble of capturing frames from a video, in principle, obviating the need for an additional
specialised image capture software. However, OpenCV will only process the Audio
Video Interleave (AVI) container format, introduced by Microsoft in 1992, thus con-
straining the solution to some extent.

Regardless of the video format, capturing image frames from the video at certain
intervals is essential and, in practice, may need to be performed both manually and
automatically. The frames need to be captured for training and recognition phases.

4.2.4 Classification

It is essential that the system be capable of incorporating different types of classifiers,


e.g. SVM, ANN, Boosting, for individual inputs and possibly an ensemble of classifiers
for fusing the individual classifications and weighting them.
58 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE

4.2.5 Miscellaneous

In order to compare different techniques it must be possible to re-run training and


testing phases, i.e. persisting all the available inputs, interim and final results. It is also
desirable to be able to compare studies or projects and store the results separately for
later comparison. The system must be capable of running in offline or real-time mode.
For example, models to be used in the classification phase might be pre-built, but the
system should be able to perform real-time analysis. Easy analysis of results must be
possible in order to determine how best to tune and improve the system.

4.3 System Operational Requirements

The NXS was built with the ultimate goal of evolving into a cross-paltform, extensible,
“real-world” audio visual FER system, capable of being used in a wide range of appli-
cations. With that in mind, the following sections discuss some putative prescriptions
of such a system.

4.3.1 Implementation Platforms

Ideally the system should have broad platform support. In practical terms, this trans-
lates to variants of Unix and Linux, Mac OS and Windows.

4.3.2 Audio and Video Formats

One of the first hurdles that one encounters, especially with video processing, is the
number of different video container formats and their lack of availability on one or
more operating environments. Where possible the system should be capable of sup-
porting multiple audio and video container formats. At a minimum these should in-
clude WAV, AVI, MP4, MPEG2, and MOV. If support is not available, then there should
4.3. SYSTEM OPERATIONAL REQUIREMENTS 59

be some simple way of converting between formats where this is possible.

Image Processing

The system needs the capability to train a classifier on a corpus of still emotional
expressions. The corpus could be images of jpeg, png or some other format. Al-
ternatively, there may be no image corpus, rather, a video collection that will require
significant images to be captured into a suitable format, thus creating a de facto corpus.
The images will then be subjected to some recognition process.

Video Processing

A video can be hours in duration or it could simply be, as in the Cohn-Kanade database
[Kanade 00], collections of short, sample expressions. Both in training and testing, the
system needs to be able to capture frames from a video segment. The frames will then
be subjected to a treatment similar to the image processing mentioned previously, and
the resulting parameters input to, say, a HMM.

Classification

Several subsystems require some form of classification component. It is essential that


the system be able to perform some data reduction of any input vectors that are large
in dimensionality, e.g. PCA or Linear Discriminant Analysis (LDA).

4.3.3 System Performance

Ideally, the system will make use of multi-threading and multi-processing capabilities
of the operating system. Performance is critical, as is efficient memory usage. It is
preferable that the system be able to execute in online mode.
60 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE

4.3.4 User Interface

The core system must be simple to use so that the effort required to integrate compo-
nents and to re-run exercises is minimised.

4.4 The Any Expression Recognition System (NXS)

At the heart of NXS is the concept of a “Project”. A project is a meta-object, e.g.


a FER experiment containing metadata about each of the objects that are used in an
experiment. However, it is not limited to FER and can contain information about an
audio object, as well. Each project can contain references to collections of videos,
images, image sequences, audio segments, multiple AAMs and SVMs, and even other
projects themselves.

The major benefit of projects is that results can be saved, experiments can be re-
run with different parameters and outputs can be written to comma-delimited files for
use in third-party products, e.g. Excel. AAMs and SVMs built for one project can be
referenced and used in another project.

NXS is capable extracting and cataloguing frames from video, as well as performing

real-time expression analysis.

4.4.1 Software Selection for the Core System

While the Windows operating system dominates the commercial and home environ-
ments, systems such as Linux and Mac OS also remain popular. In order to meet
the cross-platform operating environment requirement, C++ or Java was considered
suitable for development of the core system. However, given the critical performance
requirements and the fact that most high-performing video processing libraries are C
or C++ based, C++ was selected.
4.4. THE Any Expression Recognition System (NXS) 61

There is sometimes the misconception that the C or C++ programming languages


can be used to build a system that is portable between different platforms. Without a
lot of effort, this is not usually the case. Building a system using Nokia’s Qt integrated
development environment [Qt 09] guarantees that the computer programs, written in
C++, will be cross-platform compatible. This was selected for development of the core
system and user interface.

Qt has another very attractive feature. Its MetaObject Pattern supports a program-
ming concept known as “reflection”. The benefits of reflection are realised when it
comes to saving and restoring exercises. The state and values of objects and their
properties are “reflected” fairly simply into, in this case, Extensible Markup Language
(XML). Making use of this metaobject, it can do “round-trip” xml - serialising the ob-
jects, persisting them, and then deserialising them. In practical terms, this is the means
to saving and restoring projects.

4.4.2 Software Selection for Major Functions

The Vision something Libraries (VXL) are used for image processing [VXL ]. The
libraries are written in C++ and are very efficient. OpenCV is used for simple video
display and to capture images from videos [OpenCV ]. SVMLIB [Chang 01] will be
used for facial expression recognition.

4.4.3 Class Structure

The system has been built using design patterns as described by [Pree 95]. Qt lends
itself to building with design patterns [Ezust 06]. Figure 4.1 depicts the conceptual
class structure. Use is made of the serializer and composite patterns to effect the round-
trip processing, mentioned in Subsection 4.4.1.

A “project” is the top level concept, created by a software factory, and is simply
62 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE

a collection of segments (discussed in the next section). Facades are used to abstract
the details of external classes such as those used to perform AAM processing. A form
factory is used to create simple dialog boxes for user input, reducing the amount of
effort that would have otherwise have been required if the dialogs had been hand-
crafted.

Image ImageSegment

Video VideoSegment

Multimedia Segment
Audio AudioSegment

AudioVisual AVSegment

Figure 4.1: Class hierarchy

4.4.4 Segments

Central to the design of the system is the concept of segments. This borrows to some
extent from, but is simpler than, MPEG-7 [Salembier 01] and its concept of segment
types. This is no coincidence, MPEG-7, previously known as “Multimedia Content
Description Interface”, is a standard for describing the multimedia content data that
supports some degree of interpretation of the information meaning, which can be
passed onto, or accessed by, a device or a computer code. However, the implemen-
tation deviates slightly in that segment and multimedia data members and operations
are combined.

The various types of segments are created by a segment factory. As can be seen
in Figure 4.1, these include Image Segments, Image Sequence Segments, Image Col-
lections, Audio Collections and Video Collections. Using factories to provide a layer
4.4. THE Any Expression Recognition System (NXS) 63

of abstraction not only conceals the implementation complexity from the calling func-
tions but simplifies the creation of new types of segments. Figure 4.2 depicts the
segment factory class diagram.

Figure 4.2: Class diagram of the Segment Factory

4.4.5 Dialog Creation

User dialogs are, in keeping with the design pattern approach, created by factories.
Figure 4.3 demonstrates the simplicity in creating new dialogs through the use of a
form factory [Ezust 06].
64 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE

Figure 4.3: Dialog creation

4.4.6 Processing Scenario

Rather than simply list each function, this is better described by a practical walk
through a typical processing scenario. Most major functions are accessible by right-
clicking to present a context menu as shown in Figure 4.4. The use of the product
begins with the creation of a “Project” as seen in figure 4.5.

Figure 4.4: The system menu

Figure 4.6 shows the project tree structure after a project has been created, an
image segment, image collection segment, video segment, and model segment have
been added to the project. The tree structure is effectively the xml that reflects the
objects’ states and data member values. From here the xml can be saved and reopened
later.
4.4. THE Any Expression Recognition System (NXS) 65

Figure 4.5: Project creation

Figure 4.6: User interface

4.4.7 Measuring Facial Features

The facial image is subdivided into regions in order to track inter-region movement and
a common scale is applied to the facial images. Similar to [Strupp 08], it is done by first
creating a horizontal or transverse delineation line, mid-way between the topmost and
bottommost landmark points on the facial image being examined, as shown in Figure
4.7 (The concept of “RU” which appears on the Figure is explained in Section 5.2.2).
This process is applied to any facial image used to train or test the system.

Within the experiments described in Chapters 5 and 6, reference to “shape” refers


66 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE

Figure 4.7: Measurements from horizontal delineations image from Feedtum database
[Wallhoff ]

to normalised and scaled vectors. The process of “scaling” the data is explained in
Section 5.3.2. The algorithm for the normalisation process is described in Algorithm 3.

4.4.8 Classification

Two sets of classifiers are used in this work, one for prototypical expressions, and the
others specific to Region Unit (RU)s. Those specific to RUs are trained to classify the
major intra-RU patterns that would normally accompany prototypical expressions. For
instance, the mouth and lip movements of fiduciary points in an image of a prototypical
smile, are re-used and represent one class in the RU3 classifier. The relevant normalised
measurements are used as inputs to all of the classifiers. For example, in the case of
RU1, only the landmark points within RU1 are input to the classifier. The mappings are
shown in Table 4.1.

Note that eye movement AUs, such as those that effect blinks, are not incorporated
at present, e.g. AU43, AU45-46 and AU61-68.
4.4. THE Any Expression Recognition System (NXS) 67

Algorithm 3: Measuring and normalising the x, y distances


input Face x,y coordinates
xPoints (total number of x, y points/2)
yPoints (total number of x, y points/2)
xNormalised (total number of x, y points/2)
yNormalised (total number of x, y points/2)
i=0
for total number of x, y points/2 do
xPoints[i] = FaceXYCoordinates[2*i] ; // group x coordinates
yPoints[i] = FaceXYCoordinates[2*i+1] ; // group y coordinates
i++
// get the offset from the left
// and top of the image frame
xMin = xPoints.min()
yMin = Points.min()
// get the mean
xmean = xPoints.mean()
ymean = yPoints.mean()
j=0
for total number of x, y points/2 do
xNormalised[j] = (xPoints[j] - xmean)
yNormalised[j] = (yPoints[j] - ymean)
j++
// normalise the vectors
xNormalised.normalise()
yNormalised.normalise()

4.4.9 System Processing

The processing sequence is depicted in Figure 4.8. First, images are captured at prede-
fined intervals from video, 500ms in this case. Frontal face poses are then segmented
from the images using the Viola and Jones [Viola 01] (or a derived) technique to deter-
mine the global pose parameters. See [Viola 01] for details. Next, AAMs, which have
been prepared in advance, are fitted to new and previously unseen images to derive
the local shape and texture features. The measurements obtained (see Section 4.4.7)
are then used to classify the regions using SVM classifiers. Ultimately, the number and
mixture of classified regions are used to report facial activity.
68 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE

AU Description RU
1, 2, 3, 4 Brow movement 1
5, 6, 7 Eyelid activity and 2
orbicularis oculi
9, 10, 12, 15, 17, Nose, mouth and lip regions 3
20, 25, 26, 27
Table 4.1: Mapping Action Units to Region Units

Figure 4.8: Processing of facial activity measurements

4.4.10 Active Appearance Models

To recap on chapter 3, in recent years, a powerful deformable model technique, known


as the Active Appearance Model [Edwards 98], has become very popular for real-time
face and facial expression recognition. In the AAM approach, the non-rigid shape and
visual texture (intensity and colour) of an object are statistically modelled using a low
dimensional representation obtained by applying PCA to a set of labelled training data.
After these models have been created, they can be parameterised to fit a new image of
the object, which might vary in shape or texture or both.

Figure 4.9 depicts the facial triangulation mesh, based on the feature points, used
to divide the facial image into regions. In this case, AAMs are pre-trained on static
images using the SIC fitting method [Baker 03a]. SIC is an another adaptation of in-
4.4. THE Any Expression Recognition System (NXS) 69

Figure 4.9: Face mesh used to group Action Units

verse compositional image alignment for AAM fitting, that addresses the problem of
the significant shape and texture variability, by finding the optimal shape and texture
parameters simultaneously. Rather than re-computing the linear update model at every
iteration using the current estimate of appearance parameters, it can be approximated
by evaluating it at the mean appearance parameters, allowing the update model to be
pre-computed, which has significantly more computational efficiency. See [Baker 03a]
for details. The system described in section 7 involves the recording of frontal face im-
ages, while the participant views stimuli on a computer display. Under this setup, there
is a reasonable tolerance to head movement and short-duration occlusion.

4.4.11 Classification Using SVM

One implementation of SVM, often incorporated within large-scale and comprehensive


data mining packages, is LIBSVM [Chang 01]. It is well supported and can be imple-
mented independently of any host product. SVM generalization performance depends
on the appropriate setting of meta-parameters parameters C and the kernel gamma pa-
rameter γ and, to achieve this, LIBSVM includes a grid search program to establish
optimum parameter settings.
70 CHAPTER 4. EXPRESSION ANALYSIS IN PRACTICE

4.4.12 Gabor Filter Processing

Although there are several Gabor filter software packages available, few are written in
the C++ programming language and are easily incorporated into other software sys-
tems. The Gabor filter processing in NXS makes use of a well-designed software im-
plementation [cvG ], used in [Zhou 06, Zhou 09].
I am an old man and have
known a great many trou-
bles, but most of them
never happened.

Mark Twain

5
Sensing for Anxiety

5.1 Introduction and Motivation for Experiments

5.1.1 Introduction

Anxiety is a normal reaction to everyday events and everyone experiences it at some


point in time. Elevated arousal levels accompany many common activities such as ex-
aminations, public speaking and visits to the dentist. It is when anxiety starts to affect a
person’s everyday life, that it becomes classed as a disorder. Anxiety disorders are the
most common mental disorders in Australia. Nearly 10% of the Australian population

71
72 CHAPTER 5. SENSING FOR ANXIETY

will experience some type of anxiety disorder in any one year - around one in twelve
women and one in eight men. One in four people will experience an anxiety disorder
at some stage of their lives [beyondblue ]. In the United States, anxiety disorders affect
about forty million adults a year - around 18% of the population [Kessler 05]. Anxi-
ety is often comorbid with depression and between them they can have serious health
implications.

The term, “anxious expression”, is used in everyday language and most people
would give tacit acknowledgement to its intended meaning. Thus, given the prevalence
of anxiety and actual disorders, one would think that such an expression would be
straightforward to define and, similarly, to automatically recognise. However, the void
that occupies the literature on automatic anxious expression recognition suggests that
this is not the case. This Chapter describes an attempt, through a set of novel experi-
ments, to test the feasibility of such an exercise.

Chapter 3 introduced several state-of-the-art techniques used in the computer anal-


ysis of facial expressions. Chapter 4 described a computer system, NXS, that has been
developed in order to support the experimental aspects of this dissertation. Although
the system provides flexibility in processing video and images, incorporating imple-
mentations of AAMs, Gabor Wavelet Transforms and SVM. Its successful application is
conditional on proper calibration of these functions. Indeed, even with depth of knowl-
edge and a thorough review of the literature of similar research, some of the optimal
settings are obtained through experience, and empirically, i.e. through trial and error
which can appear like a black art at times.

In line with the broader objectives of this thesis, summarised in Section 1.1, the
motivation for the experiments presented in this chapter is to:

• prove the concepts described in previous chapters can be applied in a practical


experiment;
5.2. QUESTIONS AND HYPOTHESES 73

• understand how best to use and calibrate the NXS system; and

• taking anxiety as an example, try to establish if automatic expression recogni-


tion can be used to differentiate a subtle, non-primary emotion from other facial
expressions1 .

This chapter is organised as follows. Section 5.2 explains the hypotheses and ques-
tions to be addressed. This is followed in Section 5.3 with a description of the method-
ology used in the experiments. Section 5.4 presents the data and analysis from the
experiments. Finally, Section 5.5 concludes and evaluates the exercise.

5.2 Questions and Hypotheses

5.2.1 Hypotheses

On the basis of the motivation for the experiments and the literature review in Chapter
3, the following hypotheses and questions were generated:

1. Using computerised facial expression recognition techniques, anxious expres-


sions can be differentiated2 from fearful expressions.

2. Using computerised facial expression recognition techniques, anxious expres-


sions can be differentiated from a larger set of prototypical expressions.

5.2.2 Questions Pertaining to the Importance of Feature Data

Figure 5.1 depicts the fiduciary or landmark points that are “fitted” to each image dur-
ing analysis. The collective landmark points, referred to in this dissertation as “shape”,
are captured as a set of x, y Cartesian coordinates. As explained in Chapter 3, texture,
1
A “non-primary” emotion is one that excludes the fast acting emotions such as the “big-six”, dis-
cussed in Section 2.2.1
2
“Differentiated” is defined as that better than chance.
74 CHAPTER 5. SENSING FOR ANXIETY

Figure 5.1: Facial landmark points “fitted” to an image

or the spatial variation in the gray values of pixel intensities in the image, are also
obtained.

Several options are available to make use of this feature data to classify expres-
sions, e.g. using shape information only, using texture information only, or by combin-
ing the shape and texture information in some way. In keeping with the motivation for
this Section, the following set of questions were posed:

How does facial expression recognition performance, i.e. Classification Accuracy (CA),
vary when using:

1. the location of facial landmark points only;

2. the location of facial landmark points concatenated with Gabor magnitude;

3. the location of facial landmark points concatenated with AAM texture parameters;

4. AAM texture parameters only; and

5. Gabor magnitude only?


5.2. QUESTIONS AND HYPOTHESES 75

5.2.3 Questions Pertaining to the Relative Importance of Facial

Regions

In Section 4.4.7 of Chapter 4 the subdivision of facial areas for analysis is explained.
Figure 5.2 is reproduced for convenience, showing the three facial regions to be used:

• R1 - The eyebrow region;

• R2 - The eye region; and

• R3 - the mouth region

Figure 5.2: Facial region demarcation in image [Wallhoff ]

It would be useful to know if one facial region is more important than another for
all expressions or only for specific expressions. More formally, the following questions
were of interest:
76 CHAPTER 5. SENSING FOR ANXIETY

1. Is one facial region generally more reliable for recognition of expressions?

2. Is one facial region more reliable for recognition of a specific expression?

5.2.4 Question Pertaining to System Performance

Of interest is the cumulative execution time of face-detection, AAM fitting and clas-
sification of an image since these steps would apply to an online recognition system
(although the face-detection step is not normally undertaken in every frame).
The following question was therefore posed:

1. Would performance be sufficient to achieve on-line recognition of facial expres-


sions, in a video running at 30 frames per second (online recognition has the
additional step of capturing the frame from video)?

5.3 Methodology

5.3.1 Experimental Setup

Experiment 1 The first experiment, as a baseline, was to determine if the NXS system
could be trained to differentiate between prototypical facial expressions labelled
as ‘Fear’, ‘Anger’, ‘Happy’, ‘Sad’, ‘Surprise’, and ‘ Neutral’ from the Cohn-
Kanade database. The number of occurrences of each expression is shown at
Table 5.1.

Anger Fear Sad Surprise Neutral Happy Total


33 16 17 21 36 31 154
Table 5.1: Experiment 1 - Number of occurrences of each expression [Kanade 00]

Experiment 2 The second experiment was to test Hypothesis 1 in Section 5.2 and
determine if the NXS system could differentiate between facial expressions nom-
5.3. METHODOLOGY 77

inated by human judges as anxiety against those nominated by human judges as


fear.

There are no freely available databases of anxious facial expressions that have
FACS annotations. To improvise, a set of expressions from the Cohn-Kanade
[Kanade 00] database having FACS annotations corresponding to action units of
fear and anxiety were selected. They were first analysed by the author who made
a preliminary judgement, labelling the expression as either ‘Anxiety’ or ‘Fear’.
The number of occurrences of each class label from the preliminary assessment
is shown at Table 5.2.

An anonymous poll was then conducted and participants asked to view each
expression and judge whether they thought the expression should be labelled as
‘Fear’, ‘Anxiety or ‘Uncertain’. Those invited to participate in the poll were not
given any indication of the preliminary label.

Where 50% or more of judges labelled an expression as either ‘Fear’ or ‘Anxi-


ety’, i.e. one or the other, the image was retained for use in the exercise with the
voted label attached. An additional caveat is that the margin between the ‘Fear’
and ‘Anxiety’ vote had to exceed 5%. Note that there is no suggestion that any
of the expressions are, in fact, anxiety. For the purpose of the experiment it is
not essential that they be anxiety - merely that the computer system be trained
to classify them as such, and agree with the judges’ opinions on the expressions
used in training.

Fear Anxious Total


29 26 55
Table 5.2: Initial numbers of each expression [Kanade 00]

The raw data results of the poll are shown at Appendix A. A summary of the
revised numbers of each class is at Table 5.3.
78 CHAPTER 5. SENSING FOR ANXIETY

Fear Anxious Total


18 16 34
Table 5.3: Results from poll - numbers labelled as fear and anxiety retained.

For reasons discussed later, 3 images were removed from the exercise, leaving
31 images for use in the experiment. The number of occurrences of fear and
anxiety are shown in Table 5.4

Fear Anxious Total


16 15 31
Table 5.4: Experiment 2 - Final number of occurrences of each expression retained.

Of these occurrences, 9 are male and 22 are female. An even split would have
been ideal, but a little difficult to attain with such a small sample set when sixty-
five percent of subjects in the Cohn-Kanade database are female. Individual at-
tributes of the recorded actors are not known. We are told, however, that in
the database subjects range in age from 18 to 30 years, 15 percent are African-
American, and three percent are Asian or Latino.

Experiment 3 The third experiment was to test Hypothesis 2 in Section 5.2 and es-
tablish if the NXS system could differentiate facial expressions of anxiety from
a larger set of emotional expressions which included, ‘Fear’, ‘Anger’, ‘Happy’,
‘Sad’, ‘Surprise’, and ‘ Neutral’. The images of fearful and anxious expressions
used in Experiment 2 were combined with those of Experiment 1, replacing Ex-
periment 1’s fearful expressions with Experiment 2’s fearful expressions, and
adding Experiment 2’s anxious expressions. All images were from the Cohn-
Kanade database and the numbers of each expression is shown at Table 5.5.

Anger Fear Sad Surprise Neutral Happy Anxious Total


33 16 17 21 36 31 15 169
Table 5.5: Experiment 3 - Numbers of each expression from Cohn-Kanade database
5.3. METHODOLOGY 79

Experiment 4 The objective of the fourth experiment was to test whether the classifier
built from the Cohn-Kanade database of images, in Experiment 1, could be used
to predict the facial expressions in the Feedtum database of images [Wallhoff ],
which were recorded with different subjects and under different lighting condi-
tions. Prototypical facial expressions of ‘Fear’, ‘Anger’, ‘Happy’, ‘Sad’, ‘Sur-
prise’, and ‘ Neutral’ from both the Cohn-Kanade database and the Feedtum
database were selected.

As a preliminary step, a baseline classification performance test was conducted


using prototypical facial expressions of ‘Fear’, ‘Anger’, ‘Happy’, ‘Sad’, ‘Sur-
prise’, and ‘ Neutral’ from only the Feedtum database, so that classification
performance could be compared with that found in Experiment 1, where im-
ages from the Cohn-Kanade database were used. This is referred to here as the
“baseline” experiment.

Next, an attempt was made to automatically classify expressions in images from


the Feedtum database against the SVM models built in Experiment 1. The total
numbers of each expression is shown at Table 5.6.

Anger Fear Sad Surprise Neutral Happy Total


9 11 10 11 15 14 70
Table 5.6: Experiment 4 - Numbers of each expression from Feedtum database

A sample of an image from each database is presented at Figure 5.3 which show
clearly the marked difference in the lighting conditions between the image from
the Cohn-Kanade database on the left and the Feedtum database on the right.

5.3.2 System Setup

The broad concept of facial expression recognition has been explained in Chapter 3
and the implementation of various associated functions in the NXS system described in
80 CHAPTER 5. SENSING FOR ANXIETY

(a) Sample image from the (b) Sample image from the
Cohn-Kanade database Feedtum database

Figure 5.3: Experiment 4 - Images from different databases showing different lighting
conditions

Chapter 4. Although the NXS system can be used to both train and test classifiers, in
this set of experiments, it was used, predominantly, to build AAMs and fit them to the
set of images in the experiments. When fitting AAMs to images, NXS records:

• the Cartesian coordinates of each landmark point that makes up the face “shape”.
These are normalised to take into account differences in face sizes;

• the AAM texture parameters; and

• the Gabor magnitudes.

The meaning of the term, “shape”, as used in this dissertation, is used broadly to
mean the facial landmark points and the subdivision of facial regions into eyebrow
(R1), eye (R2), and mouth (R3). This is explained in Chapter 4.

One of NXS’ other functions, using the aforementioned stored face and texture
features, is to scale and normalise them and output them in a format suitable for in-
put to an external classification product such as LIBSVM [Chang 01] or RapidMiner
[RapidMiner ] (which uses LIBSVM).3

Eight output datasets were produced for each experiment. These contained fea-
ture sets of shape, shape and Gabor texture concatenated, shape and AAM texture
3
LIBSVM uses an format called SVM Lite, whereas RapidMiner imports, amongst other things, tab-
delimited format.
5.3. METHODOLOGY 81

parameters concatenated, Gabor magnitude, AAM texture, eyebrow shape (R1), eye
shape (R2) and mouth shape (R3). The tuning details of each major functional area
and parameter selections used in this set of experiments are explained below.

AAM Choice and Parameter Selection

The Iterative IEBM method [Saragih 06] of building the AAM and fitting the model to
the image was chosen because of its fast fitting capabilities4 which would be critical
in achieving real-time fitting as required in later experiments. Depending on the pa-
rameter selection, model training time was longer than that experienced with the SIC
method. The implementation of the algorithm by [Saragih 06] was used,5 and after the
recommended settings were applied, the parameters were fine-tuned by trial-and-error
in consultation with the software provider.

A before and after example of “fitting” an AAM to a face in an image from the
Cohn-Kanade database is given at Figure 5.4.

(a) Original image (b) Image after fitting

Figure 5.4: Original image on left and image after fitting on right

4
This had been shown in earlier informal trials [Saragih 09].
5
The implementation is written in the C++ programming language.
82 CHAPTER 5. SENSING FOR ANXIETY

Gabor Processing and Parameter Selection

Due to the high-dimensionality of output features from Gabor filters, processing was
only applied to the R1, R2 and R3 regions - not to the entire cropped face. Prior to
convolving the image with the Gabor filter, the region of interest, e.g. R1, is scaled or
warped using bicubic interpolation to a fixed, canonical size - 100 × 10 pixels for R1
and R2 and 100 x 20 pixels for R3 and written to a grayscale image. To facilitate the
explanation, the before and after images are shown at Figure 5.5.

(a) Eyebrow region R1 (b) Eyebrow region R1


before rescaling after rescaling

(c) Eye region R2 be- (d) Eye region R2 after


fore rescaling rescaling

(e) (f) Mouth region R3 after


Mouth rescaling
region R3
before
rescaling

Figure 5.5: Regions before and after rescaling (3 × actual size)

The Gabor filter processing made use of a software implementation [cvG ], which
had been used by [Zhou 06].6 The program implements the Gabor wavelet using the
formula in [Zhou 06] and shown at Equation 5.3.1.

kkµ,ν k2 − kkµ,ν k22kzk2 ikµ,ν z − σ2


2
ψµ,ν (z) = e 2σ [e − e ] (5.3.1)
σ2

where z = (x, y) is the point with the horizontal coordinate x and the vertical coordi-
nate y. The parameters µ and ν define the orientation and scale of the Gabor kernel,
6
The implementation is also written in the C++ programming language.
5.3. METHODOLOGY 83

k · k denotes the norm operator, and σ is related to the standard derivation of the Gaus-
sian window in the kernel and determines the ratio of the Gaussian window width to
the wavelength. The wave vector kµ,ν is defined as follows

kµ,ν = kν eiφµ (5.3.2)

kmax πµ
where kν = fν
and φµ = 8
if 8 different orientations have been chosen. kmax is the
maximum frequency, and f ν is the spatial frequency between kernels in the frequency
domain.

The second term in the square brackets in Equation 5.3.1 compensates for the DC
value and its effect becomes negligible when the parameter σ, which determines the
ratio of the Gaussian window width to wavelength, is sufficiently large.

The approach taken was analogous to the “eigenface” recognition process. In this
method, each training image is “flattened” to a 1 × D vector and the vector pushed
onto a stack of vectors. Once all training images have been processed, dimensionality
reduction takes place. The eigenvectors of the most significant eigenvalues are used to
project the vectors into eigenspace and the coefficients used to train a classifier. The
recognition phase processes the images in a similar manner with coefficients matched
against those derived in the training phase.

In this experiment, each N × M image of the extracted facial region (R1, R2 and
R3) was convolved with each Gabor filter and the resulting Gabor magnitudes are
laid out end-to-end in a vector of dimension N × M . Thus, after convolving an image
region with 40 filters, the result is an N × M ×40 vector. After processing each image,
the N × M × 40 vector is pushed into a stack of vectors of convolved images. Once all
of the images are stacked, OpenCV’s cvCalcPCA function is used to perform PCA to
get the eigenvalues and eigenvectors. A truncated set of eigenvectors is then used and
OpenCV’s cvProjectPCA function is called to project the vectors into “eigenspace”,
84 CHAPTER 5. SENSING FOR ANXIETY

finally using the resulting coefficients for recognition. Through trial and error, it was
found that 20 eigenvectors explained approximately 90% of the variation in the set of
Gabor magnitude responses.

By far the most challenging aspect of the experiment was to calibrate the Gabor
filters. Selection of the scale parameters σ and ν, affects the width of the kernel and in-
volves a tradeoff. Larger values are more robust to noise but less sensitive. Smaller val-
ues are more sensitive but less effective in removing noise. Finding optimum settings
involved reference to a number of articles [Bhuiyan 07, Chen 07, Fasel 02, Gao 09,
Kamarainen 06, Kanade 00, Lades 93, Lee 96, Liu 04, Liu 06, Movellan 08, Shen 06,
Shen 07, Wiskott 97, Wu 04] and a lot of trial and error.

Many articles do not explain the reasons behind parameter selection, instead refer-
ring, if at all, to other articles that, in turn, do not explain the settings or simply refer to
yet another article. The popular setting for Kmax is π/2 and for the spatial frequency

f is 2. It would seem that the setting for Kmax = π/2 originates from [Lades 93]
who noted it yielded the best results after trialing values of 3π/4, π/2, π/3. [Lades 93]

also seems to be responsible for the setting of the spatial frequency f value to 2 after

trialing values f = 2, f = 2.

In a study into automatic coding of facial expressions displayed during posed and
genuine pain, [Littlewort 07, Littlewort 09] disclose that they convolved their 96 × 96
images through a bank of Gabor filters eight orientations and 9 spatial frequencies (232
pixels per cycle at 1/2 octave steps). They then pass the output magnitudes to action
unit classifiers. Yet, in a study from the same laboratory, [Whitehill 09] describe an
attempt to provide an optimum set of parameters in the task of performing smile detec-
tion against a real-world set of photographs. The database, GENKI, consists of pictures
from thousands of different subjects, photographed by the subjects themselves. In the
experiment, all images were converted to grayscale, normalised by rotating and crop-
ping around the eye region to a canonical width of 24 pixels. The authors report that
5.3. METHODOLOGY 85

they used energy filters to model the primate visual cortex, using 8 equally spaced
orientations of 22.5 degrees, but they do not explain how they arrived at the spatial fre-
quencies with wavelengths of 1.17, 1.65, 2.33, 3.20 and 4.67 Standard Iris Diameters 7 .
The authors refer to [Donato 99] for filter design, however, [Donato 99] report spatial
frequency values of ν ∈ {0, 1, 2, 3, 4} and go on to describe a further test using high
frequency values of ν ∈ {0, 1, 2} (scale is the inverse of frequency) and low frequency
values of ν ∈ {2, 3, 4}. They state that the performance of the high frequency subset
ν ∈ {0, 1, 2} was almost the same as ν ∈ {0, 1, 2, 3, 4}. It should be noted that the task
at hand was the classification of FACS Action Units. Intuitively, one would expect high
scale (low frequency) to generalise or provide a better representation of expressions
than low scale (high frequency), since the former is more likely to ignore artifacts.

The MPEG-7 [MPEG-7 ] Homogeneous Texture Descriptor standard has made use
of Gabor filter banks of 6 orientations and 5 scales. The number of orientations and
scales were based on previous results from [Manjunath 96, Ro 01]. Its use is pitched
towards automatic searching and browsing of images and the mean and standard devi-
ation of each filtered image, plus the mean and standard deviation of the input image
are typically used as features (30 × 2 + 2 = 62 features).

After much experimentation, values of σ = π and ν ∈ {0.0, 0.06, 1.4} were used as
in [Bhuiyan 07]. In this experiment, the kernel or mask width is automatically decided
by the spatial extent of the Gaussian envelope and was obtained from the formulation
implemented in [Zhou 06]

kmax
6 × σ/ )+1 (5.3.3)

Truncating the Gabor filters to a width of 6σ + 1 points (pixels) as in [Dunn 95]



and using σ = π, f = 2, Kmax = π/2 and ν = 0.06, gives
7
A Standard Iris Diameter is defined as 1/7 of the distance between the centre of the left and right
eyes.
86 CHAPTER 5. SENSING FOR ANXIETY

π/2
6 × (π/ √ 0.06 ) + 1 (5.3.4)
2
gives a filter size of 13 × 13 pixels. A sample filter with scale = 1.4, orientation=
π/8 can be seen at Figure 5.6. Images of the regions R1, R2 and R3 filter response
magnitude are shown at figures 5.7.

(a) Real part of (b) Imaginary


wavelet part of wavelet

Figure 5.6: Real and Imaginary part of a Gabor wavelet, scale = 1.4, orientation= π/8
(5 × actual size)

(a) Eyebrow region R1 (b) Eye region R2 magni-


magnitude response tude response

(c) Mouth region R3


magnitude response

Figure 5.7: Magnitude responses of regions R1, R2 and R3 after convolution (3 ×


actual size)

Classification

In all of the experiments described in this chapter, a Radial Basis Function (RBF)
SVM kernel type with a regularized support vector classification (standard algorithm C-

SVC) type of SVM was used. As discussed previously, datasets of features were output
from the NXS system as training and testing sets. Optimal parameters for the SVM
models were found using the grid-search approach (the LIBSVM package includes
5.3. METHODOLOGY 87

the “grid.py” program to perform a grid-search). The training set data was reused for
testing in a 5-fold cross validation setup, as depicted in Figure 5.8. In this Figure,
“experiment ” is used to describe the training run.

Figure 5.8: K-Fold Cross Validation [PRISM ]

Once found, the optimal parameters were transcribed to the RapidMiner product
[RapidMiner ] in order to produce a confusion matrix of CA of the type shown in Figure
5.7. There were slight differences between LIBSVM and RapidMiner results which
was very likely to due to differences in the cross-fold validation sampling algorithms
and random seeds between the products. Using the GNU C library, the default seed for
LIBSVM is 1, whereas RapidMiner uses a default value of 1992. Keeping the random
seed the same for each test in RapidMiner improves reproducibility. RapidMiner has
several algorithms of its own available and the stratified sampling type was used to
create random subsets while keeping class distributions constant.
88 CHAPTER 5. SENSING FOR ANXIETY

5.4 Presentation and Analysis of Data

The sample size in all of the experiments was small and the misclassification of just
one or two expressions has a relatively high impact on the CA. Thus, some caution
is needed in interpreting results with regards to CA, where differences of only a few
percentage exist, and throughout this summary they are ignored. In the table column
headings, “true” denotes the actual or real classification and “pred.”, abbreviated from
predicted, denotes the derived classification.

5.4.1 Experiment 1

The first experiment was to determine if the NXS system could be trained to differ-
entiate between prototypical facial expressions labelled as ‘Fear’, ‘Anger’, ‘Happy’,
‘Sad’, ‘Surprise’, and ‘ Neutral’ from the Cohn-Kanade database. The recognition re-
sults using shape, shape and Gabor magnitudes concatenated, shape and AAM texture
parameters concatenated, Gabor magnitudes and AAM texture parameters are given in
Figures 5.7, and using R1, R2 and R3 shape in Figure 5.8.

Classification using the shape concatenated with the Gabor magnitudes yielded the
best overall CA. Given their holistic nature, one would have thought that the shape
concatenated with the AAM texture parameters would have performed better than the
rest. Perhaps there was an advantage of having all of the image patches set to a fixed
canonical size, as was the case with the Gabor filter pre-processing. There were no
exceptionally performing feature sets that provided a CA much higher than the others.

Of the CA results obtained using R1, R2 and R3 (Figure 5.8), the eyebrow (R1)
shape features achieved a 91% accuracy in the prediction of Anger. In nearly all of the
individual results, where surprise was misclassified, it was most often misclassified as
fear. Interestingly, the converse was not true.

Anger was most often the expression with the highest CA, regardless of the feature
5.4. PRESENTATION AND ANALYSIS OF DATA 89

sets that were used to build the classifer. One might have expected that Fear and
Surprise would most often be confused [Russell 94], however, this is not evident in
Tables 5.7 and 5.8.
90 CHAPTER 5. SENSING FOR ANXIETY

(a) Confusion matrix of recognition using shape only


Accuracy: 77%

true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 27 3 1 5 0 0 75%
pred. Fear 2 10 0 0 3 0 66%
pred. Happy 0 3 25 1 1 0 83%
pred. Neutral 3 0 4 29 1 6 67%
pred. Surprise 0 0 0 0 16 0 100%
pred. Sad 1 0 1 1 0 11 79%
class recall 82% 63% 81% 81% 76% 65%

(b) Confusion matrix of recognition using shape and Gabor magnitudes


Accuracy: 86%

true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 28 0 1 2 0 1 88%
pred. Fear 2 12 0 0 2 0 75%
pred. Happy 0 3 27 1 0 0 87%
pred. Neutral 3 1 2 33 1 1 80%
pred. Surprise 0 0 0 0 17 0 100%
pred. Sad 0 0 1 0 1 15 88%
class recall 85% 75% 87% 92% 81% 88%

(c) Confusion matrix of recognition using shape and AAM texture parameters
Accuracy: 83%

true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 27 1 1 2 0 0 87%
pred. Fear 2 13 1 0 2 0 72%
pred. Happy 0 2 25 1 2 0 83%
pred. Neutral 4 0 4 32 0 3 74%
pred. Surprise 0 0 0 0 17 0 100%
pred. Sad 0 0 0 1 0 14 93%
class recall 82% 81% 81% 89% 81% 82%

(d) Confusion matrix of recognition using Gabor magnitudes in eyebrow, eye and mouth regions
Accuracy: 82%

true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 28 1 1 2 0 0 88%
pred. Fear 2 11 0 0 1 0 79%
pred. Happy 0 0 25 3 1 0 86%
pred. Neutral 3 2 5 30 2 1 70%
pred. Surprise 0 2 0 0 17 0 89%
pred. Sad 0 0 0 1 0 16 94%
class recall 85% 69% 81% 83% 81% 94%

(e) Confusion matrix of recognition using AAM texture parameters from entire face
Accuracy: 79%

true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 27 3 1 3 0 3 73%
pred. Fear 2 10 0 0 2 0 71%
pred. Happy 0 1 26 1 2 0 87%
pred. Neutral 4 2 4 31 1 2 70%
pred. Surprise 0 0 0 0 15 0 100%
pred. Sad 0 0 0 1 1 12 86%
class recall 82% 63% 84% 86% 71% 71%

Table 5.7: Experiment 1 - Recognition results


5.4. PRESENTATION AND ANALYSIS OF DATA 91

(a) Confusion matrix of recognition using eyebrow region (R1) shape


Accuracy: 76%

true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 30 1 1 2 0 1 86%
pred. Fear 0 8 0 0 2 1 73%
pred. Happy 0 2 23 3 0 1 79%
pred. Neutral 3 2 7 28 2 2 64%
pred. Surprise 0 2 0 2 16 0 80%
pred. Sad 0 1 0 1 1 12 80%
class recall 91% 50% 74% 78% 76% 71%

(b) Confusion matrix of recognition using eye region (R2) shape


Accuracy: 79%

true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 29 0 2 3 1 0 83%
pred. Fear 0 11 0 0 2 2 73%
pred. Happy 3 2 26 2 0 0 79%
pred. Neutral 1 1 3 28 2 3 74%
pred. Surprise 0 1 0 1 16 0 89%
pred. Sad 0 1 0 2 0 12 80%
class recall 88% 69% 84% 78% 76% 71%

(c) Confusion matrix of recognition using mouth region (R3) shape


Accuracy: 76%

true Anger true Fear true Happy true Neutral true Surprise true Sad class precision
pred. Anger 24 2 1 5 1 2 69%
pred. Fear 1 10 0 0 3 0 71%
pred. Happy 3 1 25 1 1 0 81%
pred. Neutral 4 1 5 29 0 2 71%
pred. Surprise 1 2 0 0 16 0 84%
pred. Sad 0 0 0 1 0 13 93%
class recall 73% 63% 81% 81% 76% 76%

Table 5.8: Experiment 1 - Recognition results using shape from eyebrow (R1), eye
(R2) and mouth (R3) regions
92 CHAPTER 5. SENSING FOR ANXIETY

5.4.2 Experiment 2

The second experiment was to determine if the system could differentiate between
facial expressions nominated as anxiety against those nominated as fear. The fitting
process did not work well with 3 images (this was likely due to the relatively small
number of images being used for AAM training) and, for expediency, the images were
removed from the training set. The revised number of images used in the exercise is
shown in Figure 5.9. In total, there were 31 images selected for the experiment - 9
male and 22 female.

Fear Anxious Total


16 15 31
Table 5.9: Experiment 2 - Numbers of each expression from Cohn-Kanade database

The recognition results using shape, shape and Gabor magnitudes concatenated,
shape and AAM texture parameters concatenated, Gabor magnitudes and AAM texture
parameters are given in Figures 5.10, and using R1, R2 and R3 shape in Figure 5.11.

Overall, the CA was low. Shape alone, and individual R1, R2 and R3 shapes,
yielded the best CAs. Classifiers that were built using texture features did not per-
form nearly as well as those built with shape features. One surprising result was the
recognition performance using the shape extracted from the eyebrow region presented
at 5.10(a). One could theorise that this was because poll participants made more use
of the eyebrow region than the eye and mouth in their assessment of the expression.
And, based on the prior discussion of anxious features at Subsection 2.5.2, it would
seem that the less exaggerated brow movements with an anxious expression would be
a discriminating factor between fear and anxiety expressions. This, of course, is quite
speculative, with such a small sample size and the fact that there were only 14 poll
participants.

Another explanation considered was that it was simply due to the efficacy of the
5.4. PRESENTATION AND ANALYSIS OF DATA 93

(a) Confusion matrix of recognition using (b) Confusion matrix of recognition using
shape only shape and Gabor magnitudes
Accuracy: 70% Accuracy: 68%

true Fear true Anxious class precision true Fear true Anxious class precision
pred. Fear 10 3 77% pred. Fear 10 4 71%
pred. Anxious 6 12 67% pred. Anxious 6 11 65%
class recall 63% 80% class recall 63% 73%

(d) Confusion matrix of recognition using


(c) Confusion matrix of recognition using Gabor magnitudes in eyebrow, eye and mouth
shape and AAM texture parameters regions
Accuracy: 58% Accuracy: 58%

true Fear true Anxious class precision true Fear true Anxious class precision
pred. Fear 8 5 62% pred. Fear 14 11 56%
pred. Anxious 8 10 56% pred. Anxious 2 4 67%
class recall 50% 67% class recall 88% 27%

(e) Confusion matrix of recognition using


AAM texture parameters from entire face
Accuracy: 55%

true Fear true Anxious class precision


pred. Fear 12 10 55%
pred. Anxious 4 5 56%
class recall 75% 33%

Table 5.10: Experiment 2 - Recognition results

SVM processing. To examine the phenomenon further, two post-hoc experiments were
devised reusing the eyebrow region (R1) data. The first was to randomly change the
class labels in the samples that had been labelled as anxious and fear and to re-run the
experiment. The results are shown in Figure 5.11(a). This resulted in a much lower CA
- 68%.
The second post-hoc experiment was to test how well regression analysis would
separate the classes. Epsilon Support Vector Regression (SVR) was used and the op-
timal parameters determined using an alternative grid search program.8 The result is
shown in Figure 5.11(b). This time the CA was lower - 65%.

8
http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/#grid_parameter_
search_for_regression, last accessed 1 March 2010
94 CHAPTER 5. SENSING FOR ANXIETY

(a) Confusion matrix of recognition using (b) Confusion matrix of recognition using eye
eyebrow region (R1) shape region (R2) shape
Accuracy: 90% Accuracy: 78%

true Fear true Anxious class precision true Fear true Anxious class precision
pred. Fear 15 2 88% pred. Fear 13 4 76%
pred. Anxious 1 13 93% pred. Anxious 3 11 79%
class recall 94% 87% class recall 81% 73%

(c) Confusion matrix of recognition using


mouth region (R3) shape
Accuracy: 74%

true Fear true Anxious class precision


pred. Fear 11 3 79%
pred. Anxious 5 12 71%
class recall 69% 80%

Table 5.11: Experiment 2 - Recognition results using shape from eyebrow (R1), eye
(R2) and mouth (R3) regions

(a) Confusion matrix of recognition using eyebrow


region (R1) shape and randomised class labels
Accuracy: 68%

true Fear true Anxious class precision


pred. Anxious 10 5 67%
pred. Fear 5 11 69%
class recall 67% 69%

(b) Confusion matrix of regression analysis using


eyebrow region (R1) shape
Accuracy: 65%

true Fear true Anxious class precision


pred. Anxious 9 5 64%
pred. Fear 6 11 65%
class recall 60% 69%

Table 5.12: Experiment 2 - Post-hoc


5.4. PRESENTATION AND ANALYSIS OF DATA 95

5.4.3 Experiment 3

The third experiment was to establish how well the facial expressions of anxiety could
be classified when infused into a larger set of emotional expressions including ‘Fear’,
‘Anger’, ‘Happy’, ‘Sad’, ‘Surprise’, and ‘ Neutral’. Thus, the focus of the experiment
was not the overall classification but to establish if an anxious expression could be
distinguished from 6 prototypical expressions, which included the fear expressions
used in Experiment 2.
The recognition results using shape, shape and Gabor magnitudes concatenated,
shape and AAM texture parameters concatenated, Gabor magnitudes and AAM texture
parameters are given in Figures 5.13, and using R1, R2 and R3 shape in Figure 5.14. It
was anticipated that fear and anxiety would have the lowest CA and this was confirmed
in the CA of every classifier. One might have expected that fear would have been mis-
classified most often as anxious, rather than any other expression, and vice-versa but
this was not the case. Whilst there was a slight tendency towards anxious expressions
being misclassified as fear, they were also misclassified as every other expression, other
than those labelled as ‘Sad’.
Use of shape concatenated with Gabor magnitudes produced the best overall re-
sults, despite the CA performance of AAM texture parameters being slightly less than
that of Gabor magnitude. However, as stated previously, with such small sample CA
differences of just a few percentage are within reason.
96 CHAPTER 5. SENSING FOR ANXIETY

(a) Confusion matrix of recognition using shape only


Accuracy: 80%

true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 29 1 1 1 0 0 0 91%
pred. Fear 1 10 0 0 3 0 3 59%
pred. Happy 2 1 28 0 0 0 2 85%
pred. Neutral 1 0 1 33 1 7 1 75%
pred. Surprise 0 2 0 0 17 0 0 89%
pred. Sad 0 0 1 2 0 10 0 77%
pred. Anxious 0 2 0 0 0 0 9 82%
class recall 88% 63% 90% 92% 81% 59% 60%

(b) Confusion matrix of recognition using shape and Gabor magnitudes


Accuracy: 84%

true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 31 0 1 3 0 1 0 86%
pred. Fear 0 12 0 0 3 0 4 63%
pred. Happy 0 1 28 0 0 0 2 90%
pred. Neutral 2 1 1 32 0 3 1 80%
pred. Surprise 0 1 0 0 18 0 0 95%
pred. Sad 0 0 1 1 0 13 0 87%
pred. Anxious 0 1 0 0 0 0 8 89%
class recall 94% 75% 90% 89% 86% 76% 53%

(c) Confusion matrix of recognition using shape and AAM texture parameters
Accuracy: 82%

true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 29 2 1 1 0 0 0 88%
pred. Fear 0 10 0 0 2 0 3 67%
pred. Happy 0 0 28 1 1 0 3 85%
pred. Neutral 4 0 2 33 0 3 2 75%
pred. Surprise 0 2 0 0 18 0 0 90%
pred. Sad 0 0 0 1 0 14 0 93%
pred. Anxious 0 2 0 0 0 0 7 78%
class recall 88% 62% 90% 92% 86% 82% 47%

(d) Confusion matrix of recognition using Gabor magnitudes in eyebrow, eye and mouth regions
Accuracy: 75%

true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 28 1 1 4 0 1 0 80%
pred. Fear 0 6 0 0 2 0 4 50%
pred. Happy 0 1 25 3 0 0 2 81%
pred. Neutral 5 2 3 28 2 1 1 67%
pred. Surprise 0 1 0 0 17 0 1 89%
pred. Sad 0 0 0 1 0 15 0 94%
pred. Anxious 0 5 2 0 0 0 7 50%
class recall 85% 38% 81% 78% 81% 88% 47%

(e) Confusion matrix of recognition using AAM texture parameters from entire face
Accuracy: 76%

true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 29 1 1 1 1 3 1 78%
pred. Fear 1 9 0 1 3 0 5 47%
pred. Happy 0 2 27 2 0 0 0 87%
pred. Neutral 3 1 3 31 2 3 1 70%
pred. Surprise 0 1 0 0 15 0 1 88%
pred. Sad 0 0 0 1 0 11 0 92%
pred. Anxious 0 2 0 0 0 0 7 78%
class recall 88% 56% 87% 86% 71% 65% 47%

Table 5.13: Experiment 3 - Recognition results


5.4. PRESENTATION AND ANALYSIS OF DATA 97

(a) Confusion matrix of recognition using eyebrow region (R1) shape


Accuracy: 76%

true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 30 2 1 2 0 1 0 83%
pred. Fear 0 8 0 0 4 3 2 47%
pred. Happy 2 1 25 1 0 0 3 78%
pred. Neutral 1 0 3 32 0 2 2 80%
pred. Surprise 0 1 0 0 15 0 0 94%
pred. Sad 0 2 0 1 2 11 0 69%
pred. Anxious 0 2 2 0 0 0 8 67%
class recall 91% 50% 81% 89% 71% 65% 53%

(b) Confusion matrix of recognition using eye region (R2) shape


Accuracy: 74%

true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 27 0 2 3 0 0 0 84%
pred. Fear 0 9 0 0 2 1 2 64%
pred. Happy 1 1 26 1 0 0 1 87%
pred. Neutral 4 1 3 30 5 5 1 61%
pred. Surprise 0 1 0 1 14 0 3 74%
pred. Sad 1 1 0 1 0 11 0 79%
pred. Anxious 0 3 0 0 0 0 8 73%
class recall 82% 56% 84% 83% 67% 65% 53%

(c) Confusion matrix of recognition using mouth region (R3) shape


Accuracy: 76%

true Anger true Fear true Happy true Neutral true Surprise true Sad true Anxious class precision
pred. Anger 27 1 2 3 1 2 1 73%
pred. Fear 0 10 0 0 2 0 3 67%
pred. Happy 0 0 24 1 0 0 2 89%
pred. Neutral 6 0 4 31 0 2 1 70%
pred. Surprise 0 1 0 0 16 0 1 89%
pred. Sad 0 1 0 1 1 13 0 81%
pred. Anxious 0 3 1 0 1 0 7 58%
class recall 82% 63% 77% 86% 76% 76% 47%

Table 5.14: Experiment 3 - Recognition results using shape from eyebrow (R1), eye
(R2) and mouth (R3) regions
98 CHAPTER 5. SENSING FOR ANXIETY

5.4.4 Experiment 4 - Baseline

The ultimate objective of the fourth experiment was to test whether the classifier built
from the Cohn-Kanade database of images, in Experiment 1, could be used to predict
the facial expressions in the Feedtum database of images, which were recorded using
different subjects and under different lighting conditions. This is a somewhat novel
undertaking, probably due to the difficulty of the exercise. Classifiers trained using
shape vectors, shape and Gabor texture, Gabor texture, eye (R1), eye (R2) and mouth
(R3) regions were used.

As a preliminary step, a CA test was conducted using an SVM built from prototyp-
ical facial expressions of ‘Fear’, ‘Anger’, ‘Happy’, ‘Sad’, ‘Surprise’, and ‘Neutral’,
sourced entirely from images within the Feedtum database. This was so that the CA
could be compared to that attained in Experiment 1, which used the Cohn-Kanade
database. This is referred to here as the “baseline” experiment. The “baseline” was
acquired by SVM classification, similar to Experiment 1, using 5-fold cross validation.

At first, the AAM that had been built for Experiment 1, using images from the Cohn-
Kanade database, was used to fit the landmark points to images from the Feedtum
database. As can be seen in 5.9(a), the landmark points were not placed perfectly.
This was likely due to the relatively small number of samples used from the Cohn-
Kanade database (≈ 200) to build the initial AAM (in the order of 500 − 1, 000 would
be preferable). Since the object of the experiments was not to test the AAM per se, to
overcome this, a specific Feedtum AAM was built. The accuracy of fitting improved
with the new model and an example is shown at 5.9(b).

The baseline recognition results using shape, shape and Gabor magnitudes concate-
nated, shape and AAM texture parameters concatenated, Gabor magnitudes and AAM
texture parameters are given in Figures 5.15, and using R1, R2 and R3 shape in Figure
5.16.
5.4. PRESENTATION AND ANALYSIS OF DATA 99

Figures 5.15 and 5.16 show that the CA was lower than in Experiment 1. However,
the number of samples is smaller and a relatively larger variation in the CA will result
from each misclassification. Notwithstanding that, a plausible reason for the lower CA
is that the expressions portrayed in the Feedtum database are much less pronounced.
Figures 5.10(a) and 5.10(b) are reported within the Feedtum database transcriptions,9
as the apex of expressions of anger and fear respectively. One would think that human
judges might have difficulty in correctly classifying these expressions. In addition, the
intensity of the expressions are clearly in contrast to that shown in Figure 5.4.

9
Feedtum metadata transcriptions at http://www.mmk.ei.tum.de/˜waf/fgnet/
metadata-feedtum.csv, image files are anger/0003 3/p 086.jpg and fear/0007 2/p 110.jpg, last
access 1 March 2010
100 CHAPTER 5. SENSING FOR ANXIETY

(a) Feedtum image fitted using general AAM trained on Cohn-Kanade database

(b) Feedtum image fitted using specific AAM trained on Feedtum database

Figure 5.9: Experiment 4 - Images fitted using generalised and specific AAMs
5.4. PRESENTATION AND ANALYSIS OF DATA 101

(a) Confusion matrix of recognition using shape only


Accuracy: 56%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 0 0 0 1 0 88%
pred. Fear 0 6 0 1 1 1 67%
pred. Happy 1 2 10 2 1 1 59%
pred. Sad 0 1 0 2 2 4 22%
pred. Surprise 0 0 2 1 5 0 63%
pred. Neutral 1 2 2 4 1 9 47%
class recall 78% 55% 71% 20% 45% 60%

(b) Confusion matrix of recognition using shape and Gabor magnitudes


Accuracy: 74%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 8 0 0 0 0 1 89%
pred. Fear 0 8 0 0 2 1 73%
pred. Happy 0 1 11 0 1 1 79%
pred. Sad 0 1 0 6 0 1 75%
pred. Surprise 0 0 2 1 8 0 73%
pred. Neutral 1 1 1 3 0 11 65%
class recall 89% 73% 79% 60% 73% 73%

(c) Confusion matrix of recognition using shape and AAM texture parameters
Accuracy: 54%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 2 0 0 2 0 64%
pred. Fear 1 4 2 1 1 1 40%
pred. Happy 0 1 9 0 2 0 75%
pred. Sad 0 0 1 4 1 4 40%
pred. Surprise 0 1 2 1 4 0 50%
pred. Neutral 1 3 0 4 1 10 53%
class recall 78% 36% 64% 40% 36% 67%

(d) Confusion matrix of recognition using Gabor magnitudes in eyebrow, eye and mouth regions
Accuracy: 64%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 0 0 0 0 3 70%
pred. Fear 1 7 0 1 3 1 54%
pred. Happy 0 0 11 0 1 1 85%
pred. Sad 0 2 0 5 0 2 56%
pred. Surprise 0 1 1 2 7 0 64%
pred. Neutral 1 1 2 2 0 8 57%
class recall 78% 64% 79% 50% 64% 53%

(e) Confusion matrix of recognition using AAM texture parameters from entire face
Accuracy: 64%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 1 3 2 0 1 50%
pred. Fear 0 6 0 3 0 1 60%
pred. Happy 1 1 10 0 0 0 83%
pred. Sad 0 0 0 2 2 1 40%
pred. Surprise 0 0 0 0 8 0 100%
pred. Neutral 1 3 1 3 1 12 57%
class recall 78% 55% 71% 20% 73% 80%

Table 5.15: Experiment 4 - Baseline recognition results using Feedtum database


102 CHAPTER 5. SENSING FOR ANXIETY

(a) Confusion matrix of recognition using eyebrow region (R1) shape


Accuracy: 46%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 6 1 0 0 1 0 75%
pred. Fear 0 6 2 1 3 1 46%
pred. Happy 0 2 8 1 2 2 53%
pred. Sad 1 2 0 2 0 6 18%
pred. Surprise 1 0 2 1 4 0 50%
pred. Neutral 1 0 2 5 1 6 40%
class recall 67% 55% 57% 20% 36% 40%

(b) Confusion matrix of recognition using eye region (R2) shape


Accuracy: 57%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 8 1 0 0 1 0 80%
pred. Fear 0 4 1 0 5 2 33%
pred. Happy 0 1 10 1 1 0 77%
pred. Sad 0 2 1 5 0 4 42%
pred. Surprise 0 2 2 2 4 0 40%
pred. Neutral 1 1 0 2 0 9 69%
class recall 89% 36% 71% 50% 36% 60%

(c) Confusion matrix of recognition using mouth region (R3) shape


Accuracy: 59%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 7 0 0 0 2 0 78%
pred. Fear 1 5 1 0 0 3 50%
pred. Happy 0 1 11 0 2 0 79%
pred. Sad 0 2 0 3 1 3 33%
pred. Surprise 1 0 2 1 6 0 60%
pred. Neutral 0 3 0 6 0 9 50%
class recall 78% 45% 79% 30% 55% 60%

Table 5.16: Experiment 4 Baseline Feedtum database - Recognition results using shape
from eyebrow (R1), eye (R2) and mouth (R3) regions

(a) Feedtum image of anger expression (b) Feedtum image of fear expression

Figure 5.10: Experiment 4 - Feedtum images of anger and fear.


5.4. PRESENTATION AND ANALYSIS OF DATA 103

5.4.5 Experiment 4 - Classification against Cohn-Kanade SVM

Next, an attempt was made to automatically classify expressions in images from the
Feedtum database against the SVM models built in Experiment 1 using images from the
Cohn-Kanade database. As in the first part of the experiment, the images were fitted
with the Feedtum-specific AAM.

The PCA coefficients, after performing PCA on the Gabor filter magnitudes, were
obtained by projecting into the eigenspace that was created in Experiment 1. Simi-
larly, the scaling parameters obtained in Experiment 1 were applied when scaling the
features obtained in Experiment 4.

The recognition results for the second part of Experiment 4 using shape, shape
and Gabor magnitudes concatenated, shape and AAM texture parameters concatenated,
Gabor magnitudes and AAM texture parameters is given in Figure 5.18, and using R1,
R2 and R3 shape in Figure 5.19. The respective chance levels is shown at Table 5.17.

Anger Fear Happy Sad Surprise Neutral Total


Number of subjects 9 11 14 10 11 15 70
Chance 0.13 0.16 0.2 0.14 0.16 0.21 1
% 13 16 20 14 16 21 100

Table 5.17: Experiment 4 - Respective chance of each expressions

The CA results from this experiment were all very low. One reason that was con-
sidered is the practice of scaling the data prior to building the classifier. Scaling is
used not only in SVM, but also in Neural Network classification. The case for scaling
is presented in [Hsu 03]:

“The main advantage is to avoid attributes in greater numeric ranges dom-


inate those in smaller numeric ranges. Another advantage is to avoid nu-
merical difficulties during the calculation. Because kernel values usually
depend on the inner products of feature vectors, e.g. the linear kernel and
the polynomial kernel, large attribute values might cause numerical prob-
104 CHAPTER 5. SENSING FOR ANXIETY

lems. We recommend linearly scaling each attribute to the range [−1; +1]
or [0; 1].”

A key requirement in the use of scaling is that the same method used to scale the
training data is applied to the test data. Operationally, the scaling parameter that is
found for each feature during training needs to be saved and then applied to the test
data. One of the criticisms of this approach is that it “overtrains” the data. It has the
potential to increase the CA in exercises where the training set is used for testing after
the entire training set has been scaled, i.e. no new previously unseen data is introduced
in testing. This is sometimes referred to as, “peeping the data”. When applied to new
and previously unseen data, there is no guarantee that the scaling parameters will have
a valid relationship, as might have been the case in the second part of Experiment 4.
Experiments 1, 2, 3, and 4-baseline were all conducted using scaled and normalised
data. To get some idea of the effect that scaling might have had on the second part of
Experiment 4, an informal, post-hoc classification exercise was conducted, whereby
the data was normalised but not scaled. The results, which are not included here, did
not vary significantly and scaling was ruled out as a major contributor to the poor CA
performance.
5.4. PRESENTATION AND ANALYSIS OF DATA 105

(a) Confusion matrix of recognition using shape only


Accuracy: 33%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 6 6 1 8 4 9 18%
pred. Fear 2 2 1 0 2 2 22%
pred. Happy 0 1 11 0 2 3 65%
pred. Sad 0 0 0 0 0 0 0%
pred. Surprise 1 2 1 0 3 0 43%
pred. Neutral 0 0 0 2 0 1 33%
class recall 67% 18% 79% 0% 27% 67%

(b) Confusion matrix of recognition using shape and Gabor magnitudes


Accuracy: 41%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 9 2 1 4 0 8 38%
pred. Fear 0 5 3 2 4 4 28%
pred. Happy 0 1 10 1 3 2 59%
pred. Sad 0 0 0 0 0 0 0%
pred. Surprise 0 3 0 0 4 0 57%
pred. Neutral 0 0 0 3 0 1 25%
class recall 100% 45% 71% 0% 36% 67%

(c) Confusion matrix of recognition using Gabor magnitudes in eyebrow, eye and mouth regions
Accuracy: 17%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 6 1 5 0 0 0 50%
pred. Fear 3 5 1 6 5 9 17%
pred. Happy 0 4 1 4 6 6 5%
pred. Sad 0 0 0 0 0 0 0%
pred. Surprise 0 1 0 0 0 0 0%
pred. Neutral 0 0 7 0 0 0 0%
class recall 67% 45% 7% 0% 0% 0%

Table 5.18: Experiment 4 - Recognition results using SVMs built in experiment 1


106 CHAPTER 5. SENSING FOR ANXIETY

(a) Confusion matrix of recognition using eyebrow region (R1) shape


Accuracy: 26%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 0 0 5 5 0 9 0%
pred. Fear 0 4 0 2 0 1 57%
pred. Happy 0 1 2 1 0 1 40%
pred. Sad 1 0 1 0 1 0 0%
pred. Surprise 8 5 5 1 10 2 32%
pred. Neutral 0 1 1 1 0 2 40%
class recall 0% 36% 14% 0% 92% 13%

(b) Confusion matrix of recognition using eye region (R2) shape


Accuracy: 14%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 3 4 11 9 2 10 8%
pred. Fear 6 4 1 1 5 5 18%
pred. Happy 0 1 0 0 0 0 0%
pred. Sad 0 0 0 0 1 0 0%
pred. Surprise 0 0 0 0 3 0 100%
pred. Neutral 0 2 2 0 0 0 0%
class recall 33% 36% 0% 0% 27% 0%

(c) Confusion matrix of recognition using mouth region (R3) shape


Accuracy: 24%

true Anger true Fear true Happy true Sad true Surprise true Neutral class precision
pred. Anger 6 5 10 2 2 6 19%
pred. Fear 0 0 2 0 3 0 0%
pred. Happy 0 0 1 0 0 0 100%
pred. Sad 1 2 0 6 0 6 40%
pred. Surprise 0 0 0 0 1 0 100%
pred. Neutral 2 4 1 2 5 3 18%
class recall 67% 0% 7% 60% 9% 20%

Table 5.19: Experiment 4 - Recognition results using shape from eyebrow (R1), eye
(R2) and mouth (R3) regions against SVMs built in experiment 1
5.5. CONCLUSIONS AND EVALUATION 107

5.5 Conclusions and Evaluation

5.5.1 Hypothesis 1

Using computerised facial expression recognition techniques, anxious expressions can


be differentiated10 from fearful expressions.

Fear and anxious expressions were sourced from a set of expressions having action
units corresponding to both fear and anxiety. In the absence of a verified database
of anxious expressions, the Cohn-Kanade database was a useful starting point, but it
is not clear how the slightly pronounced nature of the fear expressions might have
affected the results. A larger, more ecologically valid set of training data, both for the
fearful and the anxious expressions, would provide a more reliable set of results to
support this hypothesis. Obviously, procuring such a database is a major undertaking.
Nevertheless, the results from Experiment 2 suggest that anxiety can be automatically
distinguished from fear.

The results from Experiment 2 suggest that shape is much better discriminator
than texture in the differentiation of fear and anxious expressions. Although much
more effort is need to draw a conclusion, human judges may rely, to a large extent,
on the shape of eyebrow region when trying to differentiate between fear and anxious
expression.

5.5.2 Hypothesis 2

Using computerised facial expression recognition techniques, anxious expressions can


be differentiated from a larger set of prototypical expressions.

Anxious expressions were not classified with a high degree of accuracy. In Ex-
periment 3, although they were quite often misclassified as fear, they were also mis-
10
“Differentiated” is defined as that better than chance.
108 CHAPTER 5. SENSING FOR ANXIETY

classified as every other expression except those labelled as ‘Sad’. Again, a much
larger sample set is needed to draw a conclusion. One possibility to improve accuracy
would be to use an ensemble of classifiers, recognising, first, a more broadly labelled
set of expressions which combined both fear and anxiety under one label, and, second,
performing a binary classification between fear and anxiety with the use of just shape.

5.5.3 Question Set 1

How does facial expression recognition performance, i.e. CA, vary when using:

• the location of facial landmark points only;

• the location of facial landmark points concatenated with Gabor magnitude;

• the location of facial landmark points concatenated with AAM texture parameters;

• AAM texture parameters only; and

• Gabor magnitude only?

In Experiment 1, shape when concatenated with Gabor magnitudes produced the


highest CA but, given the number of samples, there was no significant percentage ad-
vantage over the use of shape concatenated with AAM texture parameters, or use of
only the Gabor magnitudes. Again in Experiment 3, shape concatenated with Gabor
magnitudes produced the highest CA, but in this case, it outperformed the use of only
the Gabor magnitudes.

Shape concatenated with Gabor magnitudes also produced the best CA in the “base-
line” Experiment 4. This was a slightly odd result where, even though the overall CA
of Gabor magnitudes and AAM texture parameters was the same, concatenating Gabor
magnitudes to shape improved the CA (achieved by shape) by 20%, yet concatenating
5.5. CONCLUSIONS AND EVALUATION 109

AAM texture parameters to shape resulted in a slight decrease in CA. One could spec-
ulate about the manner in which the feature data from the Gabor magnitudes better
compliments the shape data. However, the topic of how best to fuse heterogeneous
feature sets, i.e. shape and texture data, in order to achieve the best CA needs far more
investigation.

5.5.4 Question Set 2

In the NXS system, described in Chapter 4, the face is subdivided into three regions:

• R1 - The eyebrow region;

• R2 - The eye region; and

• R3 - the mouth region

There are two parts to this question:

1. Is one facial region generally more reliable for recognition of expressions?


It could not be concluded that there was generally one region of the face that was
consistently associated with a CA rate, higher than the other regions.

2. Is one facial region more reliable for recognition of a specific expression?


One surprising result was the CA based on just the eyebrow (R1) shape features,
shown at Figure 5.10(a). Although purely speculative, it suggests that, when
faced with a binary choice to differentiate fear and anxious expressions, a viewer
will place more emphasis on the eyebrow features.

Knowing which facial regions are used by humans to differentiate facial expres-
sions would be useful for the field of facial expression recognition.
110 CHAPTER 5. SENSING FOR ANXIETY

5.5.5 Question Set 3

Would performance be sufficient to achieve on-line recognition of facial expressions, in


a video running at 30 frames per second? It was clear when running the experiments
that the execution time to classify the sets of images increased noticeably as the number
of Gabor filters increased. However, the results of this step are only anecdotal and have
not been formally recorded.

5.6 Overall Evaluation

In summary, despite the lack of samples and natural data, the results suggest that the
recognition of anxious expressions is possible but becomes more difficult when fear-
ful expressions are also present. The difficulty increases when more primary expres-
sions are added to the classification problem. The exercise demonstrates that facial
expression classification is, in general, a difficult task and, in some situations, in the
absence of contextual information and/or temporal data revealing facial dynamics, may
not even be possible. Moreover, even with contextual and temporal evidence present,
the fact that a prototypical expression can take many forms, e.g. a ‘happy’ expression
can be portrayed with or without opening the mouth, compounds the degree of diffi-
culty. Any attempt at recognition may not be reliable without the presence of semantic
information.

The second part of Experiment 4 demonstrated that, even using two popular and
creditable databases, a classifier built from images from one database, did not achieve
a high CA in predicting expressions from the other, despite Gabor filtering being rea-
sonably invariant to lighting conditions. This echoes the preliminary results reported
in [Whitehill 09] (albeit, much worse).

In this instance, it seems that the notion of using a database recorded under one
set of conditions to recognise expression in a database recorded under a different set
5.6. OVERALL EVALUATION 111

of lighting conditions was overly simplistic. Part of the solution, as demonstrated by


[Whitehill 09] would be to 1) vastly increase the size of the training set; and 2) include
image samples recorded under a wide range of lighting conditions.
Gabor filter processing proved cumbersome. Several studies have attempted to
improve the efficiency and reduce the computational overhead of it, either by:

• automatically selecting the best features to convolve with the Gabor filters.
There are many proposed schemes for doing this [Littlewort 06, Shen 05, Zhou 06,
Zhou 09] (although in [Zhou 06, Zhou 09] the processing is applied to reduce the
Gabor feature set after convolution); or

• by optimising the Gabor wavelet basis for convolution, e.g. use of genetic algo-
rithms (GA)
The GA approach has shown some promise in addressing both problems but,
ironically, it too imposes a computational burden [Li 07, Tsai 01, Wang 02].

In this set of experiments, all three facial regions were convolved using the same
basis for convolution and no attempt was made to optimise for individual facial re-
gions. [Ro 01] suggests a way to improve Gabor filter performance is to reduce the
computation load by selecting the Gabor filter basis in a pattern-dependent way. Given
the similarity of the images patches used in the this set of experiments, it is difficult to
envisage that using specific Gabor filter parameters would yield a significantly higher
CA or faster convolution times.
During the setup stage of the experiments, a great deal of effort was spent trying
to optimise the system in areas such as Gabor filter settings, facial region image sizes,
data scaling and PCA process. As discussed above, much of this calibration work came
down to trial and error, and the quantitative impacts are not reported here. The impact
on percentage CA from any of these measures was not as significant as changes to
the SVM parameters, and not comparable to the variation in Experiment 2 between
112 CHAPTER 5. SENSING FOR ANXIETY

shape-based and texture-based classification. Re-architecting the recognition process


to make use of a hierarchy of specialised classifiers, depending on the expression,
would improve the CA.
You largely constructed
your depression. It wasn’t
given to you. Therefore,
you can deconstruct it.

Albert Ellis

6
Depression Analysis Using Computer
Vision

6.1 Introduction and Motivation for Experiments

Facial expressions convey information about a person’s internal state. The ability to
correctly interpret another’s facial expressions is important to the quality of social in-
teractions. In phatic speech, in particular, the affective facial processing loop, i.e. inter-
preting another’s facial expressions then responding with one’s own facial expression,
plays a critical role in the ability to form and maintain relationships.

113
114 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

The affective facial processing loop was discussed in Subsection 2.6.2 of Chapter 3.
A predisposition to misinterpret expressions often underlies dysphoria, e.g. anxiety and
depression. More specifically, the impaired inhibition of negative affective material
could be an important cognitive factor in depressive disorders. Recent studies using
fMRI, have provided some evidence that the severity of depression in MDD groups
correlates with increased neural responses to pictures with negative content [Fu 08,
Lee 07]. In turn, this bias to favour negative material by depressed patients has been
shown to be signaled in their resulting facial expressions.
This chapter explores the feasibility of using state-of-the-art, low-cost, unobtrusive
techniques to measure facial activity and expressions, and then applies them to a real-
life application. The motivation for the experiments presented in this chapter is to:

• prove the concepts described in previous chapters can be applied in a practical


experiment involving the analysis of video;

• taking depression as an example, try to establish if automatic expression analysis


can be used to track facial activity and expression in video; and

• test if automatic facial activity and expression analysis could be used in a real-
life application.

Section 6.2 states the hypotheses that are to be tested. Section 6.3 describes the
methodology used in the experiments. The results are presented in Section 6.4 and
Section 6.5 concludes the chapter.

6.2 Hypotheses

To sharpen the focus of the experiments, the following hypotheses and questions were
generated on the basis of the motivation for the experiments and the literature review
in Chapter 3:
6.3. METHODOLOGY 115

1. When viewing the stimuli, patients with a clinical diagnosis of unipolar melan-
cholic depression will show less facial activity than control subjects and patients
with other types of depression; and

2. When viewing the stimuli, patients with a clinical diagnosis of unipolar melan-
cholic depression will show less repertoire of facial expressions than control
subjects and patients with other types of depression.

6.3 Methodology

6.3.1 Experimental setup

The experiment described in this chapter is currently incorporated in a collaborative


project at the Black Dog Institute, Sydney, Australia. 27 participants were recorded
using a high-quality video camera [AVT ] while viewing affective content and answer-
ing emotive questions. Table 6.1 summarises the ratios of female to male and controls
to patients in the interview,1 while Table 6.2 summarises the subjects in the exercise.
Some additional patient information is given in Appendix C.
Control Patient Total
Male 7 7 14
Female 9 4 13
Total 16 11 27

Table 6.1: Participant details

The experimental mood/affect induction paradigm is presented to participants of


the trial by way of an interactive computer package [NBS ], in a setup conceptually
similar to that in Figure 6.1.
In the paradigm, the participant’s facial and vocal expressions are recorded as they
face the computer display. In its entirety, the interview or session takes around 30
minutes. The experimental paradigm includes:
1
The terms “interview”, “session” and “recording” are used synonymously unless otherwise stated.
116 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

Patient Id Age Gender Diagnosis Clinical Diagnosis MINI Control Patient


Co m 01 19 m 1
Co m 02 20 m 1
Co f 03 20 f 1
Co m 04 20 m 1
Co f 05 19 f 1
Co f 06 22 f 1
Co f 07 20 f 1
Co f 08 21 f 1
Co f 09 19 f 1
Co f 10 30 f 1
Pa m UP-Mel 01 48 m Unclear - Grey UP-Mel 1
Pa m UP-Mel 02 56 m UP-Mel UP-Mel 1
Pa m UP-Mel 03 22 m UP-Mel UP-Mel 1
Pa m UP-Mel 04 48 m UP-Mel UP-Mel 1
Co m 11 39 m 1
Co f 12 26 f 1
Co m 13 33 m 1
Co m 14 31 m 1
Co f 15 31 f 1
Co m 16 20 m 1
Pa m UP-Mel 05 26 m UP-MEL UP-MEL 1
Pa f UP-NonMel 06 34 f BP2, meancholic UP NON-MEL 1
Pa m Unkown 07 45 m Unkown Unkown 1
Pa f UP BP2 08 32 f BP2 BP2 1
Pa f UP-NonMel 09 27 f no clinical diagnosis UP-NON MEL 1
Pa f UP-Mel 10 50 f UP-MEL UP-MEL 1
Pa m PD 11 53 m PD UP-MEL 1

Total Participants 16 11
Participant Ids in the table take the form
XX G CD ID
XX - “Co” for Control or “Pa” for Patient, G - Gender, CD - Diagnosis (Patients only), ID - Sequential Id number (control and
patients numbered separately)

Table 6.2: Participant details and diagnosis

Watching movie clips Short movie segments of around two minutes each, some posi-
tive and some negative are presented. With the exception of one clip, each movie
has previously been rated for its affective content [Gross 95].

Watching and rating International Affective Picture System (IAPS) pictures Pictures
from the IAPS [Lang 05] compilation are presented and participants rate each im-
age as either positive or negative. Reporting logs enable correlation of the image
presentations, the participant’s ratings, and their facial activity.

Reading sentences containing affective content Two sets of sentences used in the
study by [Brierley 07, Medforda 05] are read aloud. The first set contains emo-
tionally arousing “target” words. The second set repeats the first, with the “tar-
6.3. METHODOLOGY 117

Figure 6.1: Experimental setup

get” words replaced by well-matched neutral words.

Answering specific questions Finally, participants are asked to describe events that
had aroused significant emotions. For instance, ideographic questions such as,
“Describe an incident that made you feel really sad.”

It is important to note that this chapter and thesis reports only on the first
section of the experimental paradigm, i.e. watching movie clips. The initial experi-
mental setup consisted of the movies listed in Table 6.3, with intended induced emotion
shown in parenthesis. The list is referred to as the “Old Paradigm”. After some sub-
jects had participated in the experiment, the movie sequence was re-evaluated, and it
was decided to add a “fear” sample, and incorporate a longer “surprise” clip. This is
shown in Table 6.4 and is referred to as the “New Paradigm”.

Movie (emotion)
Bill Crosby (Happy)
The Champ (Sad)
Weather (Happy)
Sea of Love (Surprise)
Cry Freedom (Anger)

Table 6.3: Old paradigm - movie list

A summary of the numbers of participants in each paradigm is shown at Table 6.5.


118 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

Movie (emotion)
Bill Crosby (Happy)
The Champ (Sad)
Weather (Happy)
Silence of the Lambs (Fear)
Cry Freedom (Anger)
The Shining (Fear)
Capricorn One (Surprise)

Table 6.4: New paradigm - movie list

Control Patient Total


Old Paradigm 10 4 14
New Paradigm 6 7 13
Total 16 11 27

Table 6.5: Participant summary

Figure 6.2 shows a control subject being recorded as he watches the movie Cry
Freedom. The interview, from a participant’s view, is shown in Figure 6.3. When the
interview is in progress, the Research Assistant can monitor the session on the laptops
shown in Figure 6.4. The laptop on the left in Figure 6.4 is displaying a frame from the
movie clip Bill Cosby, while the one on the right shows the recording of the participant.

6.3.2 System Setup and Processing

The experimental mood/affect induction paradigm is presented to participants by way


of an interactive computer package [NBS ]. Videos of the subjects viewing the paradigm
are recorded with an Allied Vision Technologies (AVT) Pike 100C camera [AVT ], cap-
turing 800 × 600 pixels images at 24.94 frames per second. The camera is connected
through a Firewire IEEE 1394b connection to an Apple MacBook Pro. The videos are
recorded in very high quality in order to support future research. At this point in time,
however, prior to analysis, the videos are exported to Microsoft’s AVI format. This is
because AVI is the only container format supported by OpenCV [OpenCV ], which is
used in the analysis software (the NXS system).

Once the video has been recorded, analysis begins by capturing sample frames
from each video, which are used to 1) build a person-specific AAM for each person
6.3. METHODOLOGY 119

Figure 6.2: Control subject watching video clip - Cry Freedom

(person-specific AAMs give better fitting quality [Gross 05] and there is no need at
this juncture to have generic AAMs); and 2) construct an SVM classifier for each per-
son’s emotional expressions (which are rated subjectively at this point in time). For
the reasons explained in Section 5.3.2 of Chapter 5, the IEBM method [Saragih 06]
of building the AAM and fitting the model to the image has been chosen, as was the
LIBSVM [Chang 01] implementation of SVM.

Once the AAM and SVM have been built, frames are then captured from the video
at 200 ms intervals (this seemed a reasonable choice of interval based, anecdotally, on
the speed of movement of facial features, however, there is no restriction on the rate).
As each frame is captured, frontal facial images are detected using the Viola and Jones
[Viola 01] technique to determine the global location of the face in the image. Next,
the AAM is used to track and measure the local shape and texture features. As described
120 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

Figure 6.3: Participant’s view of the interview (video clip - Silence of the Lambs)

in Chapter 5, “shape” refers to the collective landmark points, which are captured as
a set of normalised, x, y Cartesian coordinates. The features are then used to classify
the expressions using an SVM classifier [Chang 01]. All outputs are stored within the
system to allow for post-processing, which is described in the next two Subsections.

Measuring Facial Activity

With the raw feature data captured, Algorithm 4 is used to measure the collective
movement of the landmark points between each frame. Although not shown in the
algorithm, extreme movements are ignored if the movement falls outside of predefined
thresholds. This is to cater for situations where the face detection in a frame has failed
and the AAM “fitting” has not converged, which typically leaves the landmark points
spread around the image.
6.3. METHODOLOGY 121

Figure 6.4: Laptops displaying stimuli and recording of subject

Algorithm 4: Measuring and facial activity


input: set of facial landmark points for every image for each video
int i = 0
int j = 0
for each video do
for each set of facial landmark points do
tempx ← distance between x coordinates of this and previous frame
tempy ← distance between y coordinates of this and previous frame
// one.norm is the sum of absolute values
allSubjectsMovements[i][j] ← one.norm(tempx) + one.norm(tempy)
j++;
i++
output: set of scalar values representing distance between each set of landmark
points for each video
122 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

The NXS system outputs the facial activity measurements for each subject into a file
of comma-separated values, which can then be imported to a third-party product for
further analysis, e.g. Excel.

Tracking Prototype Expressions

The classified expression is stored within the system for each captured image. The
images, captured at a rate of one every 200 ms and marked up with the automatically
fitted landmark points (these are fitted with a person-specific AAM, built using only
a few frames for training), can be assembled as an image sequence and played as a
short video. This assists in verifying that the AAM has fitted properly, and consistently
between frames. The coloured slider below the images, shown in Figure 6.5, pro-
vides a visual representation of the facial expressions over the course of the interview.
The colours represent the classification, pink - happy, blue - sad and white - neu-
tral. In Figure 6.5, the slider has been positioned to a period of happy expressions,
which coincides with the Bill Cosby film clip. This allows a visual confirmation that
the expression recognition has worked successfully. Each reconstructed participant
“movie” can be played individually or along with several other participant “movies”,
thus allowing a comparison of participants’ facial responses at a specific time in the
interview.
The NXS system outputs the list of classifications for each subject into a file of
comma-separated values, which can then be imported to a third-party product for fur-
ther analysis, e.g. Excel.
6.3. METHODOLOGY 123

Figure 6.5: The NXS System - Replaying captured images


124 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

6.4 Presentation and Analysis of Data

6.4.1 Introduction

Two sets of results are presented, one for each paradigm, in the form of charts at the
end of the chapter. The data used to construct the charts is available in Appendix
B. The y axis of the facial activity diagrams, e.g. Figure 6.6, is simply an internally
derived measurement of movement, described later in Algorithm 4, and no unit of
measurement has been attached to it.

6.4.2 Old Paradigm

Figure 6.6 displays a stacked column chart of facial activity for every Old Paradigm
participant over the entire video clip session. Overall, control subjects have tended to
have a higher facial activity score than patients. Figure 6.7 displays a clustered column
chart of the same data as Figure 6.6 facial activity. This is simply another view of the
data in Figure 6.6. Figure 6.8 is a comparison of the accumulated facial activity over
time, across the entire series of movie clips. Each sub-figure in Figure 6.9 shows the
facial activity specific to the relevant movie clip.

Overall, control subject, Co m 04 had a low facial activity score, but his score
during the “sad” stimuli was in keeping with the other controls. This was an interesting
result, since, anecdotally, the “sad” movie clip (The Champ), seemed to evoke strong
feelings in all of the other control subjects. On viewing Co m 04’s recording, he seems
of Asian appearance and it is not known if there was a cultural factor influencing the
results.

Figure 6.10 shows the number of happy expressions displayed by each subject, for
each film clip over time over the entire series of clips. Figure 6.11 shows the number of
sad expressions displayed by each subject, for each film clip over time over the entire
6.4. PRESENTATION AND ANALYSIS OF DATA 125

series of clips. Figure 6.12 shows the number of neutral expressions displayed by each
subject, for each film clip over time over the entire series of clips.
CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

Figure 6.6: Old Paradigm - Stacked column chart comparing facial activity (Co - Con-
350
300
250
200
Cry Freedom (Anger)
Sea of Love (Surprise)
150 Weather (Happy)
The Champ (Sad)
Bill Cosby (Happy)
100
50

trol, Pa - Patient)
0
9 2 7 3 1 5 8 6 1 0 3 4 2 4
_0 _0 _0 _0 _0 _0 _0 _0 _0 _1 _0 _0 _0 _0
_f _m _f _f _m _f _f _f el _f el _m el el
Co Co Co Co Co Co Co Co P‐
M Co P‐
M
Co P‐
M
P‐
M
_U _U _U _U
_m _m _m _m
126 Pa Pa Pa Pa
127

Figure 6.7: Old Paradigm - Clustered column chart comparing facial activity (Co -
120
100
80
6.4. PRESENTATION AND ANALYSIS OF DATA

60 Bill Cosby (Happy)


The Champ (Sad)
Weather (Happy)
Sea of Love (Surprise)
Cry Freedom (Anger)
40
20

Control, Pa - Patient)
0
9 2 07 3 1 5 8 6 1 0 3 4 2 4
_0 _0 _ _0 _0 _0 _0 _0 _0 _1 _0 _0 _0 _0
_f _m _f _f _m _f _f _f el _f el _m el el
Co Co Co Co Co Co Co Co P‐
M Co P‐
M
Co P‐
M
P‐
M
_U _U _U _U
_m _m _m _m
Pa Pa Pa Pa
CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

Figure 6.8: Old Paradigm - Line chart comparing accumulated facial activity (Co -
350
300
Co_f_09
250
Co_m_02
Co_f_07
Co_f_03
200 Co_m_01
Co_f_05
Co_f_08
Co_f_06
150
Pa_m_UP‐Mel_01
Co_f_10
Pa_m_UP‐Mel_03
100 Co_m_04
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_04

Control, Pa - Patient)
50
0
128 Bill Cosby (Happy) The Champ (Sad) Weather (Happy) Sea of Love (Surprise) Cry Freedom (Anger)
0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
50
60
70
80
90
100
Co
_m Co
_ 01 _f_
Co 09
_f_ Co
05 _f_
Co 07
_f_ Co
09 _f_
Co Co 03
_f_ _m
03 _0
Pa Co 2
_m_ _m Co
_f_
_0
UP
‐M 2 Co 08
el_ _m
01 _0
Co Co 1
_f_ _f_
08 05
Co Pa
_m Co_
_f_ _U f_0
Pa
07
Pa P 6
_m Co_f _m ‐Me
_U _0 _U l_
P‐M 6 P‐M 01
el_ el_
03 03
Pa

0
20
40
60
80
100
120
Co _m Co_
Co _f_ _U f_
_f_ 10
Co 07 Pa C P‐M 10
_m o_m el_

Weather (Happy)
_m

(c) Weather
Bill Cosby (Happy)
_U _0 C 02
_0
2 Pa P 4 Pa
_m o_m

(a) Bill Cosby


Co _m ‐Mel _
_f_ _U _ _U
Co 0 3 P‐M 02 P‐M 04
_m el_ el_
_0 04 04
Co 1
_f_
09
Co
_f_
08
Co
_f_
06
Co
_f_
Pa C 05
_m o_m
Bill Cosby (Happy)

Weather (Happy)
_U _
P‐M 04
el_
Pa 01

0
1
2
3
4
5
6
7
8
9
10
_m Co_
0
10
20
30
40
50
60
70
80
90
100

_U f_1 Co Co
Pa P 0 _f_ _f
_m ‐Me Co 08 _
_U l_0 _m Co 09
Pa P 3 _0 _m
_m ‐Me 2 _0

Cry Freedom (Anger)


_U l_ Co Co 2
P‐M 02 _f_ _f
05

(e) Cry Freedom


el_ Co _0
04 _f_ Co 3
03 _f
Co _
_m Co 07
6.4. PRESENTATION AND ANALYSIS OF DATA

_0 _m
Pa 1 _0
_m Co_ Pa
_U f_ _m Co_ 1
_U f_
P‐M 09
P‐ 0
el_
03 M 5
el_
Co
_f_ 0
Co 1
07 _f
Co _0
_f_ Co 8

Cry Freedom (Anger)


06 _m
Pa C _0
_m o_m
_U _ Co 4
P‐M 04 _f
Pa _0
el_
01 _m Co_ 6
Pa
_m Co_ _ f
Pa UP‐ _10
_U
P
f_1
0 _m M
Pa
The Champ (Sad)

_m ‐Me _U el_0
_U l_0 Pa P
Sea of Love (Surprise)

P‐M 4 _m ‐M 3
(b) The Champ

(d) Sea of Love


_U el_0
el_ P‐
02 M 2
el_
04

Figure 6.9: Old Paradigm - Facial activity for each video


The Champ (Sad)

Sea of Love (Surprise)


129
130 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

100

200

300

400

500

600
0
Co_f_06
Co_f_09
Co_f_10
Co_m_02
Co_m_01
Pa_m_UP‐Mel_03
Bil Cosby

Co_f_07
Co_f_03
Co_m_04
Co_f_08
Co_f_05
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_m_01
Co_f_07
Pa_m_UP‐Mel_02
Co_f_09
Co_f_05
Co_f_10
The Champ

Co_f_06
Pa_m_UP‐Mel_03
Co_f_03
Co_m_02
Co_m_04
Co_f_08
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_f_09
Co_m_01
Co_m_02
Pa_m_UP‐Mel_02
Co_f_06
Co_f_05
Weather

Co_f_10

Happy
Co_f_07
Co_f_03
Pa_m_UP‐Mel_03
Co_m_04
Co_f_08
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_m_01
Co_m_02
Co_f_06
Co_f_05
Co_f_07
Co_f_09
Sea of Love

Co_f_08
Pa_m_UP‐Mel_03
Co_f_03
Co_m_04
Co_f_10
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_04
Co_m_01
Co_f_07
Co_f_06
Co_f_03
Co_f_08
Co_f_10
Cry Freedom

Pa_m_UP‐Mel_03
Co_m_02
Co_f_05
Co_m_04
Co_f_09
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Happy

Figure 6.10: Old Paradigm - Number of happy expressions


6.4. PRESENTATION AND ANALYSIS OF DATA 131

100
200
300
400
500
600
700
800
900
0
Pa_m_UP‐Mel_03

Co_m_04

Co_f_03

Co_f_08

Co_m_01

Co_m_02
Bil Cosby

Co_f_05

Co_f_06

Co_f_10

Co_f_09

Pa_m_UP‐Mel_01

Co_f_07

Pa_m_UP‐Mel_02

Pa_m_UP‐Mel_04

Co_f_08

Co_m_02

Co_f_06

Co_m_04

Co_f_03

Co_f_10
The Champ

Co_f_09

Pa_m_UP‐Mel_03

Co_m_01

Co_f_05

Co_f_07

Pa_m_UP‐Mel_02

Pa_m_UP‐Mel_01

Pa_m_UP‐Mel_04

Co_f_08

Co_m_04

Co_f_10

Co_f_03

Pa_m_UP‐Mel_03

Co_f_06
Weather

Co_m_02

Sad
Co_f_09

Co_m_01

Co_f_05

Pa_m_UP‐Mel_02

Co_f_07

Pa_m_UP‐Mel_01

Pa_m_UP‐Mel_04

Co_f_03

Co_f_10

Co_m_04

Pa_m_UP‐Mel_03

Co_f_06

Co_f_08
Sea of Love

Co_m_02

Co_m_01

Co_f_05

Co_f_09

Co_f_07

Pa_m_UP‐Mel_01

Pa_m_UP‐Mel_02

Pa_m_UP‐Mel_04

Co_m_02

Co_f_06

Co_m_04

Co_f_08

Co_f_10

Pa_m_UP‐Mel_03
Cry Freedom

Co_f_09

Co_f_03

Co_f_05

Co_m_01

Co_f_07

Pa_m_UP‐Mel_02

Pa_m_UP‐Mel_01

Pa_m_UP‐Mel_04
Sad

Figure 6.11: Old Paradigm - Number of sad expressions


132 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

100
200
300
400
500
600
700
800
900
0
Pa_m_UP‐Mel_04
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_01
Co_f_05
Co_f_07
Co_f_08
Bil Cosby

Co_f_03
Co_m_04
Co_m_02
Co_m_01
Co_f_10
Co_f_09
Pa_m_UP‐Mel_03
Co_f_06
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_f_05
Pa_m_UP‐Mel_02
Co_f_07
Pa_m_UP‐Mel_03
The Champ

Co_f_10
Co_f_09
Co_m_01
Co_f_03
Co_m_04
Co_m_02
Co_f_06
Co_f_08
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Co_f_07
Pa_m_UP‐Mel_03
Pa_m_UP‐Mel_02
Co_f_05
Weather

Co_m_01

Neutral
Co_m_04
Co_f_03
Co_m_02
Co_f_09
Co_f_08
Co_f_10
Co_f_06
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_02
Pa_m_UP‐Mel_04
Co_f_09
Co_f_07
Co_f_08
Sea of Love

Co_f_05
Pa_m_UP‐Mel_03
Co_m_02
Co_m_04
Co_m_01
Co_f_03
Co_f_10
Co_f_06
Pa_m_UP‐Mel_01
Pa_m_UP‐Mel_04
Pa_m_UP‐Mel_02
Co_f_05
Co_f_03
Co_f_09
Cry Freedom

Co_f_07
Co_m_01
Co_m_04
Pa_m_UP‐Mel_03
Co_f_10
Co_f_08
Co_m_02
Co_f_06
Neutral

Figure 6.12: Old Paradigm - Number of neutral expressions


6.4. PRESENTATION AND ANALYSIS OF DATA 133

6.4.3 New Paradigm

Figure 6.13 displays a stacked column chart of facial activity. Although patient
Pa f UP BP2 08 has a very high score, examination of the video revealed that she
displayed non-purposeful or habitual mouth movement throughout the recording. Two
patients, Pa f UP-NonMel 09 and Pa f UP-NonMel 06, with a clinical diagnosis of
unipolar non-melancholic depression, also had a high facial activity score. Patients
Pa f UP-Mel 10 and Pa m UP-Mel 05, both diagnosed with unipolar melancholic de-
pression score lowest in the facial activity scale. Patient Pa m PD 11, who had a Mini-
diagnosis of unipolar melancholic depression and a clinical diagnosis of panic disorder,
had a low facial activity score. Figure 6.14 displays a clustered column chart of the
same data as Figure 6.6 facial activity. Each of the sub-figures in Figure 6.15 shows
the facial activity specific to a movie clip.
Figure 6.16 shows the number of happy expressions displayed by each subject, for
each film clip over time over the entire series of clips. Figure 6.17 shows the number of
sad expressions displayed by each subject, for each film clip over time over the entire
series of clips. Figure 6.18 shows the number of neutral expressions displayed by each
subject, for each film clip over time over the entire series of clips.
CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

Figure 6.13: New Paradigm - Stacked column chart comparing facial Activity (Co -
600
500
400
Capricorn One (Surprise)
300 The Shining (Fear)
Cry Freedom (Anger)
Silence of the Lambs (Fear)
Weather (Happy)
200 The Champ (Sad)
Bill Cosby (Happy)
100

Control, Pa - Patient)
0
8 9 1 06 2 5 7 3 6 4 11 10 05
_0 el_
0 _1 el_ _1 _1 _0 _1 _1 _1 D_ el_ el_
P2 _m _f _f n
_m _m _m
_B M
Co M Co Co ow Co Co Co _P ‐M P‐
M
UP on on nk _m UP
f_ ‐N ‐N _U Pa f_ _U
_ UP UP m Pa
_ _ m
Pa _ f_ _ f_ Pa
_ Pa
134 Pa Pa
135

Figure 6.14: New Paradigm - Clustered column chart comparing facial activity (Co -
160
140
120
100
6.4. PRESENTATION AND ANALYSIS OF DATA

Bill Cosby (Happy)


80 The Champ (Sad)
Weather (Happy)
Silence of the Lambs (Fear)
60 Cry Freedom (Anger)
The Shining (Fear)
Capricorn One (Surprise)
40
20

Control, Pa - Patient)
0
_08 09
_1
1 06 _1
2
_1
5
_0
7
_1
3
_1
6
_1
4 11 10 05
P2 el_ _m el_ _f _f n
_m _m _m D_ el_ el_
_B M
Co
M Co Co ow Co Co Co _P ‐M P‐
M
P on on nk _ m UP
_U ‐N P‐
N _U Pa f_ _U
_f UP _U m _ m
Pa _f
_
_f
_ Pa Pa
_
Pa
Pa Pa
136

Pa Pa
_ _f

0
20
40
60
80
100
120
140
160
f_
UP _UP

0
10
20
30
40
50
60
70
80
_B Pa ‐N
Pa_f_UP_BP2_08 P _f on

0
10
20
30
40
50
60
70
80
Pa Co 2_0 _U M
_f P_ el_
Pa_f_UP‐ _U C _m 8 BP 09
NonMel_06 P‐ o_ _11 2
Pa_f_UP‐ No m Co _0
nM _1 _ 8
NonMel_09 4 C f_1
Co_m_11 Co el_0
Pa
_m C o_f 2
Pa _m 9
Pa_m_Unkown_07 _m _
_U o_ _15
C
_U o 13 Pa nk m_
Co_m_16 nk _f_ _f ow 11
ow 12
_U n
P‐ Co _0
Co_f_15 Pa C n_ No _m 7
_f Pa o_ 07 nM _1
Co_f_12 _U _m m e 4
P‐ _ _16 Pa Co_ l_06
Co_m_13 No PD
nM _1 _m m
_P _16
Pa_m_PD_11 Pa e 1 Pa
C D
_f C l_0 Pa _f_U o_ _11

0
5
10
15
20
25
30
35
40
45
50
Pa _U o_ 6
Co_m_14 _m P‐ f_ _m P m_
M
_U e 15 _U ‐Me 13
Pa_f_UP_BP2_08 Pa_f_UP‐Mel_10 P‐ l_1 P‐ l_

(c) Weather
M 0 M 10
Pa_m_UP‐Mel_05 el el
Pa_f_UP‐

(a) Bill Cosby

Weather (Happy)
_0 _0
Bill Cosby (Happy)

5 5
NonMel_09

(e) Cry Freedom


Cry Freedom (Anger)
Co_f_12
Pa_f_UP‐
NonMel_06
Co_m_16
Co_f_15
Co_m_11
Co_m_14
Co_m_13

Cry Freedom (Anger)


Weather (Happy)
Bill Cosby (Happy)

Pa_m_PD_11
Pa_m_Unkown_07 P Pa
Pa_f_UP‐Mel_10
Pa a_ _

0
10
20
30
40
50
60
70
80
90
100
_f f_U Pa f_U
_U P _f P‐N
Pa_m_UP‐Mel_05

0
10
20
30
40
50
60
70
80
P‐ _B _U o
No P2 Pa_f_UP_BP2_08
0
5
10
15
20
25
30
35
40
45
50

Pa nM _0 P‐ nM
_f No e
_U C el 8 Pa_f_UP‐ nM l_0
P‐ o_ _09 Pa 9
NonMel_09

(g) Capricorn One


No m _f C el_
Co_m_13 _U o 06

Capricorn One (Surprise)


Pa nM _1
P_ _f_
_m el 1 Pa_f_UP‐ Pa 1
_U Co _06 _m C BP2 2
NonMel_06
nk _f_
ow 12 Co_m_16 _U o_ _08
nk m_
n Co_m_11 ow 11
Co _07 n
Co _0
Co _f_1 Co_f_15 _ 7
_m 5 C
P o f_
Co_f_12 Pa a_ _m 15

Capricorn One (Surprise)


Co _1
Pa _ 3 _f m_ _1
_m C m_
o 1 Pa_m_Unkown_07 _U P 6
P‐ D_
Pa _U _m 4 M 11
_f P‐M _1 Co_m_14 Co el_
_U e 6
l Pa_m_PD_11
Pa _ 1
Pa P‐M _05 _m C m_ 0
_m el o
_U _m 14
_P _10 Pa_f_UP‐Mel_10 P‐ _
D_ M 13
Pa_m_UP‐Mel_05 el
The Champ (Sad)

11

The Shining (Fear)


_0
(b) The Champ

(f) The Shining


5
Silence of the Lambs (Fear)

(d) Silence of the Lambs


(Fear)

Figure 6.15: New Paradigm - Facial Activity for each video


The Shining (Fear)
The Champ (Sad)

Silence of the Lambs


CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION
6.4. PRESENTATION AND ANALYSIS OF DATA 137

100

200

300

400

500

600
0
Pa_f_UP‐NonMel_09
Co_m_11
Pa_f_UP‐NonMel_06
Co_f_15
Co_f_12
Pa_f_UP_BP2_08
Bil Cosby

Pa_m_PD_11
Co_m_16
Co_m_14
Co_m_13
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_f_UP‐NonMel_09
Pa_f_UP_BP2_08
Co_f_12
Co_m_16
Pa_f_UP‐NonMel_06
Co_f_15
The Champ

Co_m_14
Co_m_13
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Pa_f_UP‐NonMel_06
Pa_m_PD_11
Pa_f_UP‐NonMel_09
Co_f_12
Co_m_11
Co_m_16
Weather

Co_m_14
Co_m_13
Pa_f_UP_BP2_08
Co_f_15
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Pa_f_UP_BP2_08
Pa_f_UP‐NonMel_09
Pa_f_UP‐NonMel_06
Silence of the Lambs

Co_f_12
Co_m_16

Happy
Co_m_14
Co_m_13
Co_f_15
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_f_UP‐NonMel_09
Co_f_12
Pa_m_PD_11
Co_m_14
Co_f_15
Cry Freedom

Pa_f_UP_BP2_08
Co_m_16
Pa_f_UP‐NonMel_06
Co_m_13
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_f_UP‐NonMel_09
Co_f_12
Co_m_16
Pa_f_UP_BP2_08
Co_f_15
The Shining

Co_m_14
Co_m_13
Pa_f_UP‐NonMel_06
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Pa_f_UP‐NonMel_09
Co_m_16
Pa_f_UP_BP2_08
Co_m_14
Co_f_12
Capricorn One

Pa_f_UP‐NonMel_06
Co_m_11
Co_m_13
Co_f_15
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Happy

Figure 6.16: New Paradigm - Number of happy expressions


138 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

1000

1200
200

400

600

800
0
Co_m_13
Co_f_12
Pa_f_UP‐NonMel_06
Co_m_14
Pa_f_UP_BP2_08
Co_m_11
Bil Cosby

Co_f_15
Pa_f_UP‐NonMel_09
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_14
Pa_f_UP‐NonMel_06
Co_m_11
Co_f_12
Co_m_13
Co_f_15
The Champ

Pa_f_UP_BP2_08
Pa_f_UP‐NonMel_09
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_f_15
Co_f_12
Co_m_13
Co_m_14
Pa_f_UP‐NonMel_09
Pa_f_UP‐NonMel_06
Weather

Co_m_11
Pa_f_UP_BP2_08
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_14
Co_m_13
Co_f_15
Pa_f_UP‐NonMel_06
Silence of the Lambs

Pa_f_UP‐NonMel_09
Pa_f_UP_BP2_08

Sad
Co_f_12
Co_m_11
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_13
Co_m_14
Co_f_15
Pa_f_UP_BP2_08
Pa_f_UP‐NonMel_06
Cry Freedom

Co_m_11
Pa_f_UP‐NonMel_09
Co_f_12
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_13
Co_m_14
Co_f_12
Co_f_15
Pa_f_UP_BP2_08
The Shining

Co_m_11
Pa_f_UP‐NonMel_06
Pa_f_UP‐NonMel_09
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_13
Co_m_14
Co_f_15
Pa_f_UP_BP2_08
Pa_f_UP‐NonMel_06
Capricorn One

Co_f_12
Pa_f_UP‐NonMel_09
Co_m_11
Co_m_16
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Sad

Figure 6.17: New Paradigm - Number of sad expressions


6.4. PRESENTATION AND ANALYSIS OF DATA 139

1000

1200
200

400

600

800
0
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Co_m_16
Pa_m_PD_11
Co_m_14
Bil Cosby

Co_f_15
Pa_f_UP_BP2_08
Co_m_11
Co_f_12
Pa_f_UP‐NonMel_06
Co_m_13
Pa_f_UP‐NonMel_09
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Co_m_16
Pa_f_UP‐NonMel_09
The Champ

Co_f_15
Co_m_13
Pa_f_UP_BP2_08
Co_f_12
Co_m_11
Co_m_14
Pa_f_UP‐NonMel_06
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_f_UP_BP2_08
Co_m_16
Co_m_11
Weather

Co_m_13
Co_m_14
Pa_m_PD_11
Pa_f_UP‐NonMel_09
Co_f_15
Co_f_12
Pa_f_UP‐NonMel_06
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Co_m_11
Silence of the Lambs

Co_m_16
Co_f_12

Neutral
Pa_m_PD_11
Pa_f_UP‐NonMel_06
Pa_f_UP‐NonMel_09
Co_f_15
Pa_f_UP_BP2_08
Co_m_13
Co_m_14
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Co_m_16
Pa_m_PD_11
Cry Freedom

Co_f_12
Co_m_11
Pa_f_UP‐NonMel_09
Pa_f_UP‐NonMel_06
Pa_f_UP_BP2_08
Co_f_15
Co_m_13
Co_m_14
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Pa_f_UP‐NonMel_06
The Shining

Co_m_16
Co_m_11
Pa_f_UP‐NonMel_09
Pa_f_UP_BP2_08
Co_f_15
Co_f_12
Co_m_14
Co_m_13
Co_m_11
Pa_m_UP‐Mel_05
Pa_m_Unkown_07
Pa_f_UP‐Mel_10
Pa_m_PD_11
Capricorn One

Co_m_16
Co_f_12
Pa_f_UP‐NonMel_06
Pa_f_UP‐NonMel_09
Pa_f_UP_BP2_08
Co_f_15
Co_m_13
Co_m_14
Neutral

Figure 6.18: New Paradigm - Number of neutral expressions


140 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION

6.5 Evaluation and Conclusions

There was some support for both hypotheses.

6.5.1 Hypothesis 1

When viewing the stimuli, patients with a clinical diagnosis of unipolar melancholic
depression will show less facial activity than control subjects and patients with other
types of depression.
In both the Old Paradigm and New Paradigm results, there was a tendency for patients
with unipolar melancholic depression to have reduced facial activity.

6.5.2 Hypothesis 2

When viewing the stimuli, patients with a clinical diagnosis of unipolar melancholic
depression will show less repertoire of facial expressions than control subjects and
patients with other types of depression.
In both the Old Paradigm and New Paradigm results, there was a tendency for patients
with unipolar melancholic depression to show less positive facial expressions. How-
ever, analysis of the results reveals that the same set of patients also display less nega-
tive (sad) expressions. This is in keeping with the lower facial scores and the tendency
towards a high number of neutral expressions.

6.6 Overall Evaluation

Although the results would tend to confirm the hypotheses, realistically, the sample
size is far too small. One possibility raised, after the analysis had been undertaken,
is that some of the patients show psychomotor agitation, whereas others show retar-
dation - this would suggest 2 clusters of patients. Many more recordings are needed
6.6. OVERALL EVALUATION 141

before any conclusions could be drawn. Other factors such as age, gender and cultural
background need to be considered, i.e. to test if there are other attributes that closely
correlate with facial activity and expressiveness. At this stage, the result information
is simply embedded within the charts. The concluding Chapter 8.3.2 briefly describes
an informal application of the Mann-Whitney non-parametric test.
Interestingly, if one examines the charts reporting facial expressions against the
intended elicited emotion, i.e. Figures 6.10,6.11,6.16,6.17, there is an obvious corre-
lation between the control subjects’ expression and the intended emotion. This is not
so clear in the case of the clinical subjects, and Figures 6.12,6.18 reveal that those sub-
jects dominate the numbers of “neutral” expressions, regardless of which movie clip is
being viewed.
The motivation for the exercise was to test the feasibility of the approach and, on
that basis, the results confirm that this method of facial activity and expression analy-
sis could be used successfully in this and similar studies. Anecdotally, even with the
small sample size, there were some interesting patterns. For instance, other types of
MDD participants seemed to have slightly higher levels of facial activity and expres-
sions than control subjects and patients with unipolar melancholic depression. Another
interesting event was that, in one case ethnic background seemed to influence the fa-
cial activity and expression responses to the emotion in the video clip. The control
subject’s responses during the sad clip was similar to other control subjects, whereas
his responses during other clips was much lower. Obviously, much more samples
would be required to support the notion, but it would make an interesting follow up
study.
142 CHAPTER 6. DEPRESSION ANALYSIS USING COMPUTER VISION
The map is not the territory.

Alfred Korzybski

Semantics and Metadata


7
7.1 Introduction

The chapter reflects on some of the lessons learned in the earlier experimental tasks,
and in keeping with objective 4 of the dissertation

“Identify avenues for improvement in the emotional expression recognition


process.”

considers the limitations in emotional expression recognition, and ways in which they
could be overcome. The problem domain is expanded well beyond the experiments

143
144 CHAPTER 7. SEMANTICS AND METADATA

described in this thesis, to the field of affective computing. Affective computing is com-
puting that relates to, arises from, or deliberately influences emotion or other affective
phenomena [Picard 97].

This chapter is organised as follows:

Section 7.2 discusses some of the strengths and weaknesses of the FER approach, used
in the earlier experiments. Approaches for improvement are suggested and, once the
background has been explained, an example framework for affective computing is de-
scribed in Section 7.3. This is further explained by way of examples in Sections 7.4
and 7.5.

7.2 Discussion

With the empirical footing in place, the requirements of an expression recognition sys-
tem can be revisited, and several observations made regarding the earlier experiments:

1. Culture-specific rules influence the display of affect [Cowie 05a], and in the ex-
periment in Chapter 6, the control subject Co m 04’s low facial activity score
gave cause for speculation that there might be ethnic or cultural factors influenc-
ing the result. This raises the question of how far a purely statistical approach to
emotional expression recognition can extend. One would think that accounting
for ethnic background or culture, as well as other factors such as context and
personality type are beyond the limitations of such an approach.

2. The results of the MDD experiment in Chapter 6 compared participants’ expres-


sions during specific movie clips (see Figure 6.10). The actual stimuli infor-
mation was not recorded in NXS, and the expressions had to be matched to the
temporal sequences manually and a posteriori. This was quite time consum-
7.2. DISCUSSION 145

ing, and although a technical solution could be found to synchronise the start
of the video recording with the stimuli presentation, a solution that incorporated
detailed temporal information about the stimuli would be more useful. For in-
stance, knowing which frames the “punch-line” occurs in a movie clip would
enable the latency between subjects’ reactions and the stimuli to be easily mea-
sured and compared.

3. The experiment in Chapter 6 was confined to the first part of the paradigm,
where participants view video clips, i.e. FER only. Subsequent steps in the ex-
perimental paradigm include viewing of IAPS [Lang 05] images and an open
interview, i.e. a question & answer stage at the end. The processing of the audio-
visual sections of the interview is much more difficult. During this stage, some
method of detecting when each interlocutor speaks is necessary and some means
of representing the dialogue is required. A participant’s facial display of a par-
ticular emotion will obviously be different during speech to that when viewing
a video clip. The AAM will need to be able to track the face during speech and,
possibly, additional classifiers will be needed to be able to match expressions in
speech.

4. With a larger sample size in the MDD experiment, it might be useful to incor-
porate other variables, e.g. the participant’s age and gender, or, even to consider
variables such as temperament [Parker 02] and personality type [Parker 06]. Of
course, this subject information can be recorded on a spreadsheet or word pro-
cessing document, as was the case in this experiment. However, as the amount of
data increases, so too does the degree of difficulty in maintaining spreadsheets.
Having the information stored within the system performing the analysis, NXS in
this case, would potentially create much more comprehensive outputs for analy-
sis.
146 CHAPTER 7. SEMANTICS AND METADATA

5. In the course of both the experiments, although the aggregated or summarised


data was exported to files of comma-separated values, the raw data was actu-
ally stored in a system-specific format. Details pertaining to the subjects in the
images, the frame sequence numbers, the location of the facial landmark points
and the expression classified for each frame, are all in a format, known only to
NXS. Thus, without additional effort, the data is system specific and unlikely to
be useable, in its raw form, by other systems or studies.

In recent research, there have been attempts to add rules and record descriptions
to the emotional expression recognition process (audio and video), and each project
has devised its own approach. Some have used rulebases and case-based reasoning
type information [Pantic 04a, Valstar 06a], whereas others have attempted a more com-
plex level of integration [Vinciarelli 08, Vinciarelli 09]. [Athanaselisa 05, Devillers 05,
Liscombe 05] describe efforts to represent non-basic emotional patterns and context
features in real-life emotions in audio-video data. [Athanaselisa 05] demonstrate that
recognition of speech can be improved by incorporating a dictionary of affect with
the standard ASR dictionary. [Cowie 05b] reports on a “fuzzy” rule based system for
interpreting facial expressions.

Each of the studies mentioned so far has used its own technique to incorporate
rules, and it implies 1) a need to devise some common method of defining rules; and,
2) a requirement to record rules and recognition results in a standard, reusable way.
In the remainder of this section, two complementary and overlapping concepts are in-
troduced. Ontologies offer a means to unify and extend affective computing research
and these are explained in Subsection 7.2.1. Subsection 7.2.2 examines two descrip-
tion, or markup, schemes. One scheme, known as EmotionML [emo 10, Baggia 08], is
specifically for emotional content, and the other, MPEG-7, is broader and intended for
audio-visual content (including a modest amount of affective content).
7.2. DISCUSSION 147

7.2.1 Use of Ontologies to Describe Content

An ontology is a statement of concepts which facilitates the specification of an agreed


vocabulary within a domain of interest. Ontologies have been used for some time in
the annotation of web pages. More recently, they have been used to semantically index
commercial video productions [Benitez 03, Bertini 05, Grana 05, Hunter 02, Jaimes 03,
Lagoze 01, Luo 06, Navigli 03, Obrenovic 05, Rahman 05, Tsinaraki 03, Tsinaraki 07,
Song 05, Polydoros 06].

Figure 7.1: Human disease ontology

Figure 7.2: Cell ontology

In its crudest form, an ontology can be thought of as a hierarchical database of


148 CHAPTER 7. SEMANTICS AND METADATA

concepts. Figures 7.1 and 7.2 illustrate 2 ontologies, one of human disease, and an-
other of cells. Once the concepts have been populated with values, it becomes akin
to a knowledge-base. One popular and well supported software product for building
ontologies, Protégé [Protégé ], is an ontology editor and knowledge-base framework,
and is supported by Stanford University. One very powerful feature of an ontology, is
its ability to link to other ontologies, e.g. medical and gene ontologies1 .

7.2.2 Semantic Markup

Emotion Markup Language (EmotionML)

A perplexing issue faced in automatic affect recognition, is the reuse and verification
of results. To compare and verify studies requires some consistent means of describ-
ing affect, and, until recently, there has been no universally accepted mark up system
for describing emotional content. Various schemes have arisen, e.g. Synchronized
Multimedia Integration Language (SMIL), Speech Synthesis ML (SSML), Extensi-
ble Multi-Modal Annotation ML (EMMA), Virtual Human ML (VHML), and HU-
MAINE’s Emotion Annotation and Representation Language (EARL) [Schröder 07].

EmotionML, a W3C working draft as at 29 July 2010, provides a general solution


to the mark up problem and is intended to be of use in 1) the manual annotation of
data; 2) the automatic recognition of emotion-related states from user behavior; and
the generation of emotion-related system behavior [emo 10, Baggia 08].

EmotionML is an XML-based language for representing and annotating emotions


in technological contexts. Using EmotionML, emotional expression can be described
using either categories, dimensions or even appraisal theory. Examples of XML el-
ements for annotation include “Emotion descriptor”, which could be a category or
a dimension; “Intensity”, expressed in terms of numeric values or discrete labels;
1
Examples can be located at http://bioportal.bioontology.org/ontologies, last
accessed 22 April 2010.
7.3. AN AFFECTIVE COMMUNICATION FRAMEWORK 149

and,“Start”and“End”.

MPEG-7

Another initiative is that of the Moving Picture Experts Group (MPEG) which has
developed the MPEG-7 standard for audio, audio-video and multimedia description
[MPEG-7 ]. MPEG-7 uses metadata structures or Multimedia Description Schemes
(MDS) for describing and annotating audio-video content. These are provided as a
standardised way of describing the important concepts in content description and con-
tent management in order to facilitate searching, indexing, filtering, and access. They
are defined using the MPEG-7 Description Definition Language (DDL), which is XML
Schema-based. The output is a description expressed in XML, which can be used
for editing, searching, filtering. The standard also provides a description scheme
for compressed binary form for storage or transmission [Chiariglione 01, Rege 05,
Salembier 01]. Examples in the use of MPEG-7 exist in the video surveillance industry
where streams of video are matched against descriptions of training data [Annesley 05].
The standard also caters for the description of affective content.

7.3 An Affective Communication Framework

To illustrate how the concepts previously described might be applied, an exemplary


approach is presented which consists of 1) a generic model of affective communica-
tion; and 2) a set of ontologies. The model and ontologies, intended to be used in
conjunction with one another, describe:

1. affective communication concepts;

2. affective computing research; and

3. affective computing resources.


150 CHAPTER 7. SEMANTICS AND METADATA

Figure 7.3 presents an example of a base-level model of emotions in spoken language.


It includes speaker and listener, in keeping with the Brunswikian lens model, as pro-
posed by [Scherer 03]. Modelling attributes of both speaker and listener caters for the
fact that the listener’s cultural and social presentation vis-à-vis the speaker may also
influence judgement of emotional content. It also includes a number of factors that
influence the expression of affect in spoken language. Each of these factors is briefly
discussed, with more attention given to context, as this is seen as a much neglected
factor in the study of automatic affective state recognition.

Figure 7.3: A generic model of affective communication

7.3.1 Factors in the Proposed Framework

Context

Context is linked to modality and emotion is strongly multi-modal in the way that cer-
tain emotions manifest themselves favouring one modality over the other [Cowie 05a].
Physiological measurements change depending on whether a subject is sedentary or
mobile. A stressful context, such as an emergency hot-line, air-traffic control, or a war
7.3. AN AFFECTIVE COMMUNICATION FRAMEWORK 151

zone, is likely to yield more examples of affect than everyday conversation.


It is likely to produce quite different responses to a recording studio, used to record
posed facial expressions for scientific experiments. [Stibbard 01] recommended

“...the expansion of the data collected to include relevant non-phonetic


factors including contextual and inter-personal information.”

His findings underline the fact that most studies so far took place in an artificial
environment, ignoring social, cultural, contextual and personality aspects which, in
natural situations, are major factors modulating speech and affect presentation. The
model depicted in Figure 7.3 takes into account the importance of context in the anal-
ysis of affect in speech.
There have been some attempts to include contextual data in emotion recogni-
tion research [Schuller 09a, Wollmer 08]. [Devillers 05] includes context annotation
as metadata in a corpus of medical emergency call centre dialogues. Context infor-
mation was treated as either task-specific or global in nature. Unlike [Devillers 05],
the model described in this dissertation does not differentiate between task-specific
and global context as the difference is seen merely as temporal, i.e. pre-determined or
established at “run-time”.
The HUMAINE project [HUMAINE 06] has proposed that at least the following
issues be specified:

• Agent characteristics (age, gender, race);

• Recording context (intrusiveness, formality, etc.);

• Intended audience (kin, colleagues, public);

• Overall communicative goal (to claim, to sway, to share a feeling, etc.);

• Social setting (none, passive other, interactant, group);


152 CHAPTER 7. SEMANTICS AND METADATA

• Spatial focus (physical focus, imagined focus, none);

• Physical constraint (unrestricted, posture constrained, hands constrained); and

• Social constraint (pressure to expressiveness, neutral, pressure to formality).

“It is proposed to refine this scheme through work with the HUMAINE
databases as they develop.”

[Millar 04] developed a methodology for the design of audio-visual data corpora of the
speaking face in which the need to make corpora re-usable is discussed. The method-
ology, aimed at corpus design, takes into account the need for speaker and speaking
environment factors.

The model presented in this dissertation treats agent characteristics and social con-
straints separate to context information. This is because their effects on discourse are
seen as separate topics for research.

Agent characteristics

As [Scherer 03] points out, most studies are either speaker oriented or listener oriented,
with most being the former. This is significant when you consider that the emotional
state of someone labelling affective content in a corpus could impact the label that is
ascribed to a speaker’s message, or facial expression.

The literature has not given much attention to the role that agent characteristics,
such as personality type, play in affective presentation. This is surprising when one
considers the obvious difference in expression between extroverted and introverted
types. Intuitively, one would expect a marked difference in signals between these types
of speakers. One would also think that knowing a person’s personality type would be
of great benefit in applications monitoring an individual’s emotions [Parker 06].
7.3. AN AFFECTIVE COMMUNICATION FRAMEWORK 153

At a more physical level, agent characteristics such as facial hair, whether they
wear spectacles, and their head and eye movements all affect the ability to visually
detect and interpret emotions.

Cultural

Culture-specific rules influence the display of affect [Cowie 05a], and gender and age
are established as important factors in shaping conversation style and content in many
societies. Studies by [Koike 98] and [Shigeno 98] have shown that it is difficult to
identify the emotion of a speaker from a different culture and that people will pre-
dominantly use visual information to identify emotion. Putting it in the perspective of
the proposed model, cognisance of the speaker and listener’s cultural backgrounds, the
context, and whether visual cues are available, obviously influence the effectiveness of
affect recognition.

Physiological

It might be stating the obvious but there are marked differences in speech signals and
facial expressions between people of different age, gender and health. The habitual
settings of facial features and vocal organs determine the speaker’s range of possible
visual appearances and sounds produced. The configuration of facial features, such as
chin, lips, nose, and eyes, provide the visual cues, whereas the vocal tract length and
internal muscle tone guide the interpretation of acoustic output [Millar 04].

Social

Social factors temper spoken language to the demands of civil discourse [Cowie 05a].
For example, affective bursts are likely to be constrained in the case of a minor relating
to an adult, yet totally unconstrained in a scenario of sibling rivalry. Similarly, a social
154 CHAPTER 7. SEMANTICS AND METADATA

setting in a library is less likely to yield loud and extroverted displays of affect than a
family setting.

Internal state

Internal state has been included in the model for completeness. At the core of affective
states is the person and their experiences. Recent events such as winning the lottery or
losing a job are likely to influence emotions and their display.

7.3.2 Influences in the Display of Affect

Modulating Factors Production and detection factors


Agent
Cultural Social Context Physiological Internal State
Characteristics
Speaker's viz- Education Group Extrovert/ Voice quality Recent
à-viz situations Introvert events,
listener's age Familiartiy/ Child vs e.g. lottery
and gender rapport with Ambient Autoritarian/ elderly wins,
listener conditions Control freak losses
Language Gender
Gender Dialogue turn Child vs elderly
Customs Illness/
Familiarity Appearance, e.g. Infirmity
Race with system facial hair, head
and eye Impairment
Sedentary/Ac movement
tive Vocal tract
length
Overt/Covert
Skin colour
Locations

Figure 7.4: Use of the model in practice

To help explain the differences between the factors that influence the expression of
affect, Figure 7.4 lists some examples. The factors are divided into two groups. On the
left, is a list of factors that modulate or influence the speaker’s display of affect, i.e.
cultural, social and contextual. On the right, are the factors that influence production
7.4. A SET OF ONTOLOGIES 155

or detection in the speaker or listener, respectively, i.e. personality type, physiological


make-up and internal state.

7.4 A Set of Ontologies

The three ontologies described in this section are a means by which the model de-
scribed in the previous section could be implemented. Figure 7.5 depicts the rela-
tionships between the ontologies and gives examples of each. Formality and rigour
increase towards the apex of the diagram.

It needs to be emphasised that this proposal is not confined just to experiments such
as those described within this dissertation. It is much broader and the intended users of
the set of ontologies extends beyond research exercises. There could be many types of
users such as librarians, decision support systems, application developers and teachers.

Figure 7.5: A set of ontologies for affective computing


156 CHAPTER 7. SEMANTICS AND METADATA

7.4.1 Ontology 1 - Affective Communication Concepts

The top level ontology correlates to the model discussed in Section 7.3 and is a for-
mal description of the domain of affective communication. It contains internal state,
personality, physiological, social, cultural, and contextual factors. It can be linked to
external ontologies in fields such as medicine, anatomy, and biology. A fragment of
the top-level, domain ontology of concepts in shown in Figure 7.6.

Figure 7.6: A fragment of the domain ontology of concepts

7.4.2 Ontology 2 - Affective Communication Research

This ontology is more loosely defined and includes the concepts and semantics used to
define research in the field. It has been left generic and can be further subdivided into
an affective computing domain at a later stage, if needed. It is used to specify the rules
by which accredited research reports are catalogued. It includes metadata to describe,
for example,

• classification techniques used;

• the method of eliciting speech, e.g. acted or natural; and

• manner in which corpora or results have been annotated, e.g. categorical or di-
mensional.
7.5. AN EXEMPLARY APPLICATION ONTOLOGY FOR AFFECTIVE SENSING157

Creating an ontology this way introduces a common way of reporting the knowledge
and facilitates intelligent searching and reuse of knowledge within the domain. For
instance, an ontology just based on the models described here could be used to find all
research reports where:

SPEAKER(internalState=‘happy’,
physiology=‘any’,
agentCharacteristics=‘extrovert’,
social=’friendly’,context=‘public’,
elicitation=‘dimension’)

7.4.3 Ontology 3 - Affective Communication Resources

This ontology is more correctly a repository containing both formal and informal rules,
as well as data. It is a combination of semantic, structural and syntactic metadata. It
contains information about resources such as corpora, toolkits, audio and video sam-
ples, and raw research result data.

The next section explains the bottom level, application ontology, in more detail.

7.5 An Exemplary Application Ontology for Affective

Sensing

Figure 7.7 illustrates an example application ontology for affective sensing, in a context
of investigating dialogues. In the context of the experiments in Chapter 6, a dialogue
is the interaction between the participant and the stimuli. During a dialogue, various
events can occur, triggered by one of the dialogue participants and recorded by the
sensor system. These are recorded as time stamped instances of events, so that they
can be easily identified.
158 CHAPTER 7. SEMANTICS AND METADATA

In the ontology, the roles for each interlocutor, sender and receiver, are distin-
guished. This caters for the type of open interview session in the experiments in Chap-
ter 6. At various points in time, each interlocutor can take on different roles. On the
sensory side, facial, gestural, textual, speech, physiological and verbal2 cues are dis-
tinguished. The ontology could be extended for other cues and is meant to serve as
an example here, rather than a complete list of affective cues. Finally, the emotion
classification method used in the investigation of a particular dialogue is also recorded.

2
The difference between speech and verbal cues here being spoken language versus other verbal
utterings.
7.5. AN EXEMPLARY APPLICATION ONTOLOGY FOR AFFECTIVE SENSING159

Figure 7.7: An application ontology for affective sensing


160 CHAPTER 7. SEMANTICS AND METADATA
8
Conclusions

8.1 Introduction

In this dissertation, state-of-the-art computer vision techniques that apply to Facial


Expression Recognition (FER) have been reviewed. After outlining the requirements
for building an emotional expression recognition system, the design of such a system,
the Any Expression Recognition System (NXS), was presented. Two experiments were
devised, with the objectives of 1) proving the concepts of the NXS system and their
implementation; and 2) in a much broader sense, establishing if Facial Expression
Recognition (FER) techniques could be applied to more subtle emotional expressions,

161
162 CHAPTER 8. CONCLUSIONS

such as anxiety and depression. Following the experiments, the practical limitations of
the statistical approach to FER were discussed, along with ways in which these could
be overcome through the use of a model and ontologies for affective computing.

This chapter concludes the dissertation and is organised as follows:

In Section 8.2, each of the objectives as stated in Chapter 1 are examined, and consid-
eration is given to the results from the experimental work in Chapters 5 and 6. Finally,
in Section 8.3, the conclusions and contributions of this dissertation are stated before
discussing open issues and future directions.

8.2 Objectives

8.2.1 Objective 1

Explore, though the construction of a prototype system that incorporates AAMs, what
would be required in order to build a fully-functional FER system, i.e. one where the
system could be trained and then used to recognise new and previously unseen video
or images.

Chapter 4 discussed the functional requirements of a system capable of sensing


multiple variable inputs from voice, facial expression and movement, making some
assessment of the signals of each, and then fusing them to provide some degree of
affect recognition. The chapter went on to describe how the requirements had been
implemented in a prototype system, NXS, which was built to support the experimental
aspects of this dissertation. Although the system has been built to cater for multi-modal
analysis and recognition, its application was confined to FER within the experiments.

The system proved to be flexible and robust throughout the experiments, and dealt
with the requirements to process sets of images, as in Chapter 5, as well as the more dif-
ficult task of video processing. The underlying software components that are required
8.2. OBJECTIVES 163

to perform FER, were shown to be very stable, i.e. [OpenCV ] for face detect and image
capture, [VXL ] for image processing, DemoLib [Saragih 08] for AAM development,
LIBSVM [Chang 01] for SVM classification. Components were easily interchanged,
and the MultiBoost classification software [MultiBoost 06] was easily replaced with
the LIBSVM implementation of SVM classification.

The system was built using the Qt software from [Qt 09], which will ensure that it
can be deployed in Windows, Mac OS or a Unix/Linux variant environments. Anecdo-
tally, NXS’ performance was quite adequate for real-time FER, despite little effort being
expended in tuning the performance of the software. Overall, it demonstrates that un-
derlying software components are mature enough to be incorporated in a production-
like system.

8.2.2 Objective 2

Investigate whether FER practices could be applied to non-primary emotional expression


such as anxiety. A great deal of experience was gained from the experiment, and, de-
spite the lack of samples and natural data, the results suggest that the recognition of
anxious expressions is possible, but becomes more difficult when fearful expressions
are also present. The difficulty increases when more primary expressions are added to
the classification problem. The exercise demonstrates that facial expression classifica-
tion is, in general, a difficult task and in some situations, in the absence of contextual
information and/or temporal data revealing facial dynamics, may not even be possible.
Moreover, even with contextual and temporal evidence present, the fact that a proto-
typical expression can take many forms, e.g. a ‘happy’ expression can be portrayed
with or without opening the mouth, compounds the degree of difficulty. Any attempt
at recognition may not be reliable without the presence of semantic information.

The second part of Experiment 4 demonstrated that, even using two popular and
164 CHAPTER 8. CONCLUSIONS

creditable databases, a classifier built from images from one database, did not achieve
a high CA in predicting expressions from the other, despite Gabor filtering being rea-
sonably invariant to lighting conditions. This echoes the preliminary results reported
in [Whitehill 09] (albeit, much worse), and, to address this problem, there seems to be
two approaches. The first, is to expand the training set of images, including samples
with variant conditions, e.g. recorded under different lighting conditions. However,
the results in Chapter 5 suggest that this will not be a complete solution. The second,
is to incorporate some form of logic or rule processing in expression recognition, and
this was discussed in Chapter 7.

8.2.3 Objective 3

Examine whether FER practices could be applied to non-primary emotional expres-


sions, such as those displayed by someone suffering from a MDD. The experiment
tested two hypotheses relating to facial activity and expressions and unipolar melan-
cholic depression. Although there were clearly not enough samples to undertake a
significant analysis-of-variation or draw conclusions, the results were encouraging.

Anecdotally, even with the small sample size, there were some interesting patterns.
For instance, other types of MDD participants seemed to have slightly higher levels of
facial activity and expressions than control subjects and patients with unipolar melan-
cholic depression. Another interesting event was that, in one case ethnic background
seemed to influence the facial activity and expression responses to the emotion in the
video clip. The control subject’s responses during the sad clip was similar to other
control subjects, whereas his responses during other clips was much lower. Obviously,
much more samples would be required to support the notion, but it would make an
interesting follow up study.
8.3. CONCLUSIONS AND FUTURE WORK 165

8.2.4 Objective 4

Identify avenues for improvement in the emotional expression recognition process.

Chapter 7 reflected on some of the lessons learned in the earlier experimental tasks,
and the scope was well beyond the field of FER, and set to affective computing. Two
recurring themes seem inevitable and emerging in the literature, i.e. the need to incor-
porate rules into the recognition process, and a means to represent the recognition in a
standard format. To address these requirements, the use of a model and ontologies of
affective computing was proposed.

8.3 Conclusions and Future Work

8.3.1 Summary of Contributions

This is a very broad thesis, which ranges in scope, from computer vision techniques
to psychology, taking in knowledge management concepts like ontologies along the
way. Such cross-disciplinary dissertations are necessary to advance such a broad and,
at times, nebulous field such as affective computing. This dissertation contributes in
several ways.

An artifact of the work is the NXS, which will be evolved and made available to
other researchers. It has been used successfully in a collaborative study that inves-
tigates the links between facial expressions and MDD, undertaken at the Black Dog
Institute, Sydney. The system is easily extendable, and its use is not confined to FER,
instead being suitable for experimental work in full multi-modal emotional expression
recognition. Its use is not confined to the recognition of primary emotional expression,
and it could be used to sense for any other state, e.g. attraction, boredom, level of
interest, or even all them. In fact, it is not confined to expression recognition and could
be used for face recognition.
166 CHAPTER 8. CONCLUSIONS

Even though the sample size was modest, the results in anxiety recognition are
very encouraging. It suggests that anxious facial expressions can be recognised with
FER techniques which could have many applications, extending beyond scientific and
medical research, such as interactive games and passenger screening technology.

Similarly, the results of the experiments on depression are encouraging, and once a
large enough sample size has been attained, the hypotheses in Chapter 6 can be prop-
erly tested. Even at this stage, interesting patterns in the data have emerged, enough to
suggest that other hypotheses could be tested within MDD populations, and with other
objectives in mind, e.g. cross-cultural responses to affective content.

8.3.2 Future Work

Recognition Based on Temporal Features

The FER within this dissertation was confined to recognition within images and no
attempt was made to classify expressions based on temporal features. The raw fa-
cial landmark coordinates, already stored within NXS, could be used to train Hidden
Markov Models HMMs. The major difficulty with this type of approach is to recognise
the onset or apex of the expressions, however, some ensemble of classifiers, perhaps
using SVM to detect the peak of the expression and HMM to recongise the temporal
expression could be attempted, similar to [Fan 05, Wen 10].

Assessing the Relative Contributions to FER from AAM and Gabor Features

In the experiments undertaken in Chapter 5, FER performance was assessed using


shape, AAM texture parameters and Gabor magnitudes. It would be useful to know
if the addition of Gabor features to shape provided incremental validity of the facial
expression classifications. Knowing at what point adding AAM and Gabor features to
FER can improve recognition, would be of use in boosting system performance.
8.3. CONCLUSIONS AND FUTURE WORK 167

Extension of Depression Analysis

Chapter 6 explored the application of FER techniques to the study of facial activity
and emotional expressions of patients with MDD. At the time of writing this disserta-
tion, the number of participants was not sufficiently large, and the groups not properly
matched for age and gender, to publish statistical results. However, informal results us-
ing the Mann-Whitney, non-parametric test1 are encouraging. More subjects have now
participated in the project, and it is expected that the results will soon be published and
this will be a continuing avenue of investigation.

Extension to Other Fields of Interest

Anxiety and depression have been used in this dissertation to test the application of
FER techniques to non-prototypical facial activity and expressions. There are many
other disorders that could be investigated using a paradigm similar to that described in
Chapter 6, e.g. borderline personality disorder, schizophrenia and Parkinson’s disease.
This approach could also be used to investigate the mirroring effect between the stimuli
and the viewers. Beyond the laboratory, there are many commercial applications that
could make use of FER, including interactive learning, gaming and robotics.

Standardisation of FER Reporting

One significant improvement that could be made in the field of FER, relates to the
experimental methods and reporting of results. At present, the format for reporting is
not consistent across studies and there is no standard database used which would enable
comparison of results. The set of expressions included for recognition varies between
studies, and this makes it difficult to compare and validate results. Reporting tends to
be constrained or dictated by journal or conference paper stipulations in relation to the
1
This was based on a two-tailed, or non-directional, hypothesis with a significance level of 0.05
168 CHAPTER 8. CONCLUSIONS

format and length of submissions.


The Emotion Challenge held at then INTERSPEECH conference in 2009 was an
innovative attempt to promote standardisation [Schuller 09b]. Settling on one or more
databases, such as Multi-PIE [Gross 10], as the standard for FER experiments would
be another step forward. Multi-PIE comprises recordings of 337 subjects, portraying 6
different expressions, recorded under different illumination conditions and numerous
cameras. Whilst it might not be suitable for every study or guarantee consistency, the
recommended use of such a database by conference organisers and journals, where
feasible, would at least encourage and promote uniformity.
Another improvement would be the adoption of a simple FER reporting template
or checklist by major conferences or journals. The checklist would be a précis, used
to ensure that salient and consistent information was presented to the conference or-
ganisers or journal editors. Since many of the key researchers are editors in chief of
prominent journals and/or chair influential conferences, this seems like an achievable
goal.
Analysis and Data - Anxiety
A
Table A.1: Experiment 1 - Poll results
Question Emotion Vote Percentage Winner Winner

Q1. Fear 1 7.14%


Anxiety 8 57.14% 1
Uncertain 5 35.71%
Q2. Fear 5 35.71%
Anxiety 7 50.00% 1
Uncertain 2 14.29%
Q3. Fear 4 28.57%
Anxiety 9 64.29% 1
Continued on next page

169
170 APPENDIX A. ANALYSIS AND DATA - ANXIETY

Table A.1 – continued from previous page


Question Emotion Vote Percentage Winner Winner

Uncertain 1 7.14%
Q4. Fear 6 42.86%
Anxiety 7 50.00% 1
Uncertain 1 7.14%
Q5. Fear 4 28.57%
Anxiety 9 64.29% 1
Uncertain 1 7.14%
Q6. Fear 8 57.14% 1
Anxiety 4 28.57%
Uncertain 2 14.29%
Q7. Fear 11 78.57% 1
Anxiety 1 7.14%
Uncertain 2 14.29%
Q8. Fear 5 35.71%
Anxiety 8 57.14% 1
Uncertain 1 7.14%
Q9. Fear 12 85.71% 1
Anxiety 0 0.00%
Uncertain 2 14.29%
Q10. Fear 1 7.14%
Anxiety 9 64.29% 1
Uncertain 4 28.57%
Q11. Fear 10 71.43% 1
Anxiety 4 28.57%
Uncertain 0 0.00%
Q12. Fear 5 35.71%
Anxiety 4 28.57%
Uncertain 5 35.71%
Q13. Fear 8 57.14% 1
Anxiety 6 42.86%
Uncertain 0 0.00%
Q14. Fear 1 7.14%
Anxiety 8 57.14% 1
Uncertain 5 35.71%
Q15. Fear 5 35.71%
Anxiety 4 28.57%
Uncertain 5 35.71%
Q16. Fear 11 78.57% 1
Anxiety 2 14.29%
Uncertain 1 7.14%
Q17. Fear 11 78.57% 1
Continued on next page
171

Table A.1 – continued from previous page


Question Emotion Vote Percentage Winner Winner

Anxiety 0 0.00%
Uncertain 3 21.43%
Q18. Fear 5 35.71%
Anxiety 4 28.57%
Uncertain 4 28.57%
Q19. Fear 6 42.86%
Anxiety 6 42.86%
Uncertain 2 14.29%
Q20. Fear 3 21.43%
Anxiety 4 28.57%
Uncertain 7 50.00%
Q21. Fear 4 28.57%
Anxiety 6 42.86%
Uncertain 4 28.57%
Q22. Fear 6 42.86%
Anxiety 5 35.71%
Uncertain 3 21.43%
Q23. Fear 5 35.71%
Anxiety 7 50.00% 1
Uncertain 2 14.29%
Q24. Fear 10 71.43%
Anxiety 1 7.14%
Uncertain 3 21.43%
Q25. Fear 7 50.00% 1
Anxiety 4 28.57%
Uncertain 3 21.43%
Q26. Fear 3 21.43%
Anxiety 8 57.14% 1
Uncertain 3 21.43%
Q27. Fear 3 21.43%
Anxiety 8 57.14% 1
Uncertain 3 21.43%
Q28. Fear 2 14.29%
Anxiety 6 42.86%
Uncertain 6 42.86%
Q29. Fear 0 0.00%
Anxiety 5 35.71%
Uncertain 9 64.29%
Q30. Fear 2 14.29%
Anxiety 6 42.86%
Uncertain 6 42.86%
Continued on next page
172 APPENDIX A. ANALYSIS AND DATA - ANXIETY

Table A.1 – continued from previous page


Question Emotion Vote Percentage Winner Winner

Q31. Fear 5 35.71%


Anxiety 5 35.71%
Uncertain 4 28.57%
Q32. Fear 4 28.57%
Anxiety 7 50.00% 1
Uncertain 3 21.43%
Q33. Fear 8 57.14% 1
Anxiety 4 28.57%
Uncertain 2 14.29%
Q34. Fear 6 42.86%
Anxiety 5 35.71%
Uncertain 3 21.43%
Q35. Fear 5 35.71%
Anxiety 4 28.57%
Uncertain 5 35.71%
Q36. Fear 7 50.00% 1
Anxiety 5 35.71%
Uncertain 2 14.29%
Q37. Fear 14 100.00% 1
Anxiety 0 0.00%
Uncertain 0 0.00%
Q38. Fear 1 7.14%
Anxiety 6 42.86%
Uncertain 7 50.00%
Q39. Fear 4 28.57%
Anxiety 5 35.71%
Uncertain 5 35.71%
Q40. Fear 0 0.00%
Anxiety 9 64.29% 1
Uncertain 5 35.71%
Q41. Fear 2 14.29%
Anxiety 7 50.00% 1
Uncertain 5 35.71%
Q42. Fear 7 50.00% 1
Anxiety 6 42.86%
Uncertain 1 7.14%
Q43. Fear 2 14.29%
Anxiety 9 64.29% 1
Uncertain 3 21.43%
Q44. Fear 12 85.71% 1
Anxiety 2 14.29%
Continued on next page
173

Table A.1 – concluded from previous page


Question Emotion Vote Percentage Winner Winner

Uncertain 0 0.00%
Q45. Fear 1 7.14%
Anxiety 5 35.71%
Uncertain 8 57.14%
Q46. Fear 6 42.86%
Anxiety 5 35.71%
Uncertain 3 21.43%
Q47. Fear 1 7.14%
Anxiety 4 28.57%
Uncertain 9 64.29%
Q48. Fear 5 35.71%
Anxiety 4 28.57%
Uncertain 5 35.71%
Q49. Fear 9 64.29% 1
Anxiety 1 7.14%
Uncertain 3 21.43%
Q50. Fear 8 57.14% 1
Anxiety 3 21.43%
Uncertain 3 21.43%
Q51. Fear 10 71.43% 1
Anxiety 2 14.29%
Uncertain 2 14.29%
Q52. Fear 5 35.71%
Anxiety 6 42.86%
Uncertain 3 21.43%
Q53. Fear 2 14.29%
Anxiety 9 64.29% 1
Uncertain 3 21.43%
Q54. Fear 11 78.57% 1
Anxiety 0 0.00%
Uncertain 3 21.43%
Q55. Fear 7 50.00% 1
Anxiety 6 42.86%
Uncertain 1 7.14%
18 16
174 APPENDIX A. ANALYSIS AND DATA - ANXIETY
Analysis and Data - Depression
B
B.1 Old Paradigm

Bill Cosby (Happy) The Champ (Sad) Weather (Happy) Sea of Love (Surprise) Cry Freedom (Anger) Total
Co f 09 94.3759 88.349 32.5771 3.64379 79.0066 297.95239
Co m 02 66.0356 79.069 29.5169 8.87003 99.3888 282.88033
Co f 07 76.6297 71.4058 24.4274 2.28335 103.664 278.41025
Co f 03 68.7369 74.2912 30.0492 5.53615 86.7574 265.37085
Co m 01 49.9946 58.0265 43.1402 4.2638 80.2981 235.7232
Co f 05 49.8066 56.9214 33.3114 7.09537 65.8639 212.99867
Co f 08 60.8075 33.9613 25.0592 8.87829 67.4343 196.14059
Co f 06 48.9984 27.7255 23.5991 2.13677 66.0349 168.49467
Pa m UP-Mel 01 47.1827 48.7715 28.6823 0.939169 34.0904 159.666069
Co f 10 40.9297 26.9097 15.5889 0.920074 31.2753 115.623674
Pa m UP-Mel 03 45.8346 21.2189 17.1226 2.76461 23.1864 110.12711
Co m 04 15.6498 28.7875 7.7389 1.48216 48.9529 102.61126
Pa m UP-Mel 02 18.1763 12.5658 5.02547 0.599598 17.4755 53.842668
Pa m UP-Mel 04 9.42419 9.92479 4.89164 0.704307 14.052 38.996927

Table B.1: Old Paradigm - Facial activity

175
176 APPENDIX B. ANALYSIS AND DATA - DEPRESSION

Bill Cosby (Happy) The Champ (Sad) Weather (Happy) Sea of Love (Surprise) Cry Freedom (Anger)
Co f 09 94.3759 182.7249 215.302 218.94579 297.95239
Co m 02 66.0356 145.1046 174.6215 183.49153 282.88033
Co f 07 76.6297 148.0355 172.4629 174.74625 278.41025
Co f 03 68.7369 143.0281 173.0773 178.61345 265.37085
Co m 01 49.9946 108.0211 151.1613 155.4251 235.7232
Co f 05 49.8066 106.728 140.0394 147.13477 212.99867
Co f 08 60.8075 94.7688 119.828 128.70629 196.14059
Co f 06 48.9984 76.7239 100.323 102.45977 168.49467
Pa m UP-Mel 01 47.1827 95.9542 124.6365 125.575669 159.666069
Co f 10 40.9297 67.8394 83.4283 84.348374 115.623674
Pa m UP-Mel 03 45.8346 67.0535 84.1761 86.94071 110.12711
Co m 04 15.6498 44.4373 52.1762 53.65836 102.61126
Pa m UP-Mel 02 18.1763 30.7421 35.76757 36.367168 53.842668
Pa m UP-Mel 04 9.42419 19.34898 24.24062 24.944927 38.996927

Table B.2: Old Paradigm - Accumulated facial activity

(a) Old Paradigm - Facial ac- (b) Old Paradigm - Facial ac- (c) Old Paradigm - Facial ac-
tivity (Bill Cosby) tivity (The Champ) tivity (Weather)
Bill Cosby (Happy) The Champ (Sad) Weather (Happy)
Co f 09 94.3759 Co f 09 88.349 Co m 01 43.1402
Co f 07 76.6297 Co m 02 79.069 Co f 05 33.3114
Co f 03 68.7369 Co f 03 74.2912 Co f 09 32.5771
Co m 02 66.0356 Co f 07 71.4058 Co f 03 30.0492
Co f 08 60.8075 Co m 01 58.0265 Co m 02 29.5169
Co m 01 49.9946 Co f 05 56.9214 Pa m UP-Mel 01 28.6823
Co f 05 49.8066 Pa m UP-Mel 01 48.7715 Co f 08 25.0592
Co f 06 48.9984 Co f 08 33.9613 Co f 07 24.4274
Pa m UP-Mel 01 47.1827 Co m 04 28.7875 Co f 06 23.5991
Pa m UP-Mel 03 45.8346 Co f 06 27.7255 Pa m UP-Mel 03 17.1226
Co f 10 40.9297 Co f 10 26.9097 Co f 10 15.5889
Pa m UP-Mel 02 18.1763 Pa m UP-Mel 03 21.2189 Co m 04 7.7389
Co m 04 15.6498 Pa m UP-Mel 02 12.5658 Pa m UP-Mel 02 5.02547
Pa m UP-Mel 04 9.42419 Pa m UP-Mel 04 9.92479 Pa m UP-Mel 04 4.89164

(d) Old Paradigm - Facial activ- (e) Old Paradigm - Facial activ-
ity (Sea of Love) ity (Cry Freedom)
Sea of Love (Surprise) Cry Freedom (Anger)
Co f 08 8.87829 Co f 07 103.664
Co m 02 8.87003 Co m 02 99.3888
Co f 05 7.09537 Co f 03 86.7574
Co f 03 5.53615 Co m 01 80.2981
Co m 01 4.2638 Co f 09 79.0066
Co f 09 3.64379 Co f 08 67.4343
Pa m UP-Mel 03 2.76461 Co f 06 66.0349
Co f 07 2.28335 Co f 05 65.8639
Co f 06 2.13677 Co m 04 48.9529
Co m 04 1.48216 Pa m UP-Mel 01 34.0904
Pa m UP-Mel 01 0.939169 Co f 10 31.2753
Co f 10 0.920074 Pa m UP-Mel 03 23.1864
Pa m UP-Mel 04 0.704307 Pa m UP-Mel 02 17.4755
Pa m UP-Mel 02 0.599598 Pa m UP-Mel 04 14.052

Table B.3: Old Paradigm - Facial activity for each video


B.1. OLD PARADIGM 177

Sad Happy Neutral


Bill Cosby Co f 06 91 479 24
Co f 09 65 425 104
Co f 10 88 378 128
Co m 02 116 335 143
Co m 01 124 333 137
Pa m UP-Mel 03 307 186 101
Co f 07 0 185 409
Co f 03 185 145 264
Co m 04 307 119 168
Co f 08 125 110 359
Co f 05 93 18 483
Pa m UP-Mel 02 0 14 580
Pa m UP-Mel 01 12 12 570
Pa m UP-Mel 04 0 0 594
The Champ Co m 01 428 300 40
Co f 07 0 178 590
Pa m UP-Mel 02 0 112 656
Co f 09 685 36 47
Co f 05 30 14 724
Co f 10 707 10 51
Co f 06 758 8 2
Pa m UP-Mel 03 669 8 91
Co f 03 728 5 35
Co m 02 764 0 4
Co m 04 756 0 12
Co f 08 768 0 0
Pa m UP-Mel 01 0 0 768
Pa m UP-Mel 04 0 0 768
Weather Co f 09 46 221 2
Co m 01 33 200 36
Co m 02 60 198 11
Pa m UP-Mel 02 0 195 74
Co f 06 88 181 0
Co f 05 20 180 69
Co f 10 244 25 0
Co f 07 0 24 245
Co f 03 244 13 12
Pa m UP-Mel 03 155 9 105
Co m 04 257 0 12
Co f 08 269 0 0
Pa m UP-Mel 01 0 0 269
Pa m UP-Mel 04 0 0 269
Sea of Love Co m 01 7 51 3
Co m 02 17 28 16
Co f 06 36 25 0
Co f 05 6 23 32
Co f 07 0 18 43
Co f 09 2 2 57
Co f 08 22 1 38
Pa m UP-Mel 03 41 1 19
Co f 03 61 0 0
Co m 04 56 0 5
Co f 10 61 0 0
Pa m UP-Mel 01 0 0 61
Pa m UP-Mel 02 0 0 61
Pa m UP-Mel 04 0 0 61
Cry Freedom Co m 01 56 446 311
Co f 07 0 408 405
Co f 06 711 42 60
Co f 03 283 36 494
Co f 08 612 35 166
Co f 10 610 23 180
Pa m UP-Mel 03 609 21 183
Co m 02 736 8 69
Co f 05 71 4 738
Co m 04 613 3 197
Co f 09 351 2 460
Pa m UP-Mel 02 0 1 812
Pa m UP-Mel 01 0 0 813
Pa m UP-Mel 04 0 0 813

Table B.4: Old Paradigm - Facial expressions - sorted by happy within video
178 APPENDIX B. ANALYSIS AND DATA - DEPRESSION

Sad Happy Neutral


Bill Cosby Pa m UP-Mel 03 307 186 101
Co m 04 307 119 168
Co f 03 185 145 264
Co f 08 125 110 359
Co m 01 124 333 137
Co m 02 116 335 143
Co f 05 93 18 483
Co f 06 91 479 24
Co f 10 88 378 128
Co f 09 65 425 104
Pa m UP-Mel 01 12 12 570
Co f 07 0 185 409
Pa m UP-Mel 02 0 14 580
Pa m UP-Mel 04 0 0 594
The Champ Co f 08 768 0 0
Co m 02 764 0 4
Co f 06 758 8 2
Co m 04 756 0 12
Co f 03 728 5 35
Co f 10 707 10 51
Co f 09 685 36 47
Pa m UP-Mel 03 669 8 91
Co m 01 428 300 40
Co f 05 30 14 724
Co f 07 0 178 590
Pa m UP-Mel 02 0 112 656
Pa m UP-Mel 01 0 0 768
Pa m UP-Mel 04 0 0 768
Weather Co f 08 269 0 0
Co m 04 257 0 12
Co f 10 244 25 0
Co f 03 244 13 12
Pa m UP-Mel 03 155 9 105
Co f 06 88 181 0
Co m 02 60 198 11
Co f 09 46 221 2
Co m 01 33 200 36
Co f 05 20 180 69
Pa m UP-Mel 02 0 195 74
Co f 07 0 24 245
Pa m UP-Mel 01 0 0 269
Pa m UP-Mel 04 0 0 269
Sea of Love Co f 03 61 0 0
Co f 10 61 0 0
Co m 04 56 0 5
Pa m UP-Mel 03 41 1 19
Co f 06 36 25 0
Co f 08 22 1 38
Co m 02 17 28 16
Co m 01 7 51 3
Co f 05 6 23 32
Co f 09 2 2 57
Co f 07 0 18 43
Pa m UP-Mel 01 0 0 61
Pa m UP-Mel 02 0 0 61
Pa m UP-Mel 04 0 0 61
Cry Freedom Co m 02 736 8 69
Co f 06 711 42 60
Co m 04 613 3 197
Co f 08 612 35 166
Co f 10 610 23 180
Pa m UP-Mel 03 609 21 183
Co f 09 351 2 460
Co f 03 283 36 494
Co f 05 71 4 738
Co m 01 56 446 311
Co f 07 0 408 405
Pa m UP-Mel 02 0 1 812
Pa m UP-Mel 01 0 0 813
Pa m UP-Mel 04 0 0 813

Table B.5: Old Paradigm - Facial Expressions - sorted by sad within video
B.1. OLD PARADIGM 179

Sad Happy Neutral


Bill Cosby Pa m UP-Mel 04 0 0 594
Pa m UP-Mel 02 0 14 580
Pa m UP-Mel 01 12 12 570
Co f 05 93 18 483
Co f 07 0 185 409
Co f 08 125 110 359
Co f 03 185 145 264
Co m 04 307 119 168
Co m 02 116 335 143
Co m 01 124 333 137
Co f 10 88 378 128
Co f 09 65 425 104
Pa m UP-Mel 03 307 186 101
Co f 06 91 479 24
The Champ Pa m UP-Mel 01 0 0 768
Pa m UP-Mel 04 0 0 768
Co f 05 30 14 724
Pa m UP-Mel 02 0 112 656
Co f 07 0 178 590
Pa m UP-Mel 03 669 8 91
Co f 10 707 10 51
Co f 09 685 36 47
Co m 01 428 300 40
Co f 03 728 5 35
Co m 04 756 0 12
Co m 02 764 0 4
Co f 06 758 8 2
Co f 08 768 0 0
Weather Pa m UP-Mel 01 0 0 269
Pa m UP-Mel 04 0 0 269
Co f 07 0 24 245
Pa m UP-Mel 03 155 9 105
Pa m UP-Mel 02 0 195 74
Co f 05 20 180 69
Co m 01 33 200 36
Co m 04 257 0 12
Co f 03 244 13 12
Co m 02 60 198 11
Co f 09 46 221 2
Co f 08 269 0 0
Co f 10 244 25 0
Co f 06 88 181 0
Sea of Love Pa m UP-Mel 01 0 0 61
Pa m UP-Mel 02 0 0 61
Pa m UP-Mel 04 0 0 61
Co f 09 2 2 57
Co f 07 0 18 43
Co f 08 22 1 38
Co f 05 6 23 32
Pa m UP-Mel 03 41 1 19
Co m 02 17 28 16
Co m 04 56 0 5
Co m 01 7 51 3
Co f 03 61 0 0
Co f 10 61 0 0
Co f 06 36 25 0
Cry Freedom Pa m UP-Mel 01 0 0 813
Pa m UP-Mel 04 0 0 813
Pa m UP-Mel 02 0 1 812
Co f 05 71 4 738
Co f 03 283 36 494
Co f 09 351 2 460
Co f 07 0 408 405
Co m 01 56 446 311
Co m 04 613 3 197
Pa m UP-Mel 03 609 21 183
Co f 10 610 23 180
Co f 08 612 35 166
Co m 02 736 8 69
Co f 06 711 42 60

Table B.6: Old Paradigm - Facial Expressions - sorted by neutral within video
180 APPENDIX B. ANALYSIS AND DATA - DEPRESSION

B.2 New Paradigm


Bill Cosby (Happy) The Champ (Sad) Weather (Happy) Silence of the Lambs (Fear) Cry Freedom (Anger) The Shining (Fear) Capricorn One (Surprise) Total
Pa f UP BP2 08 61.9718 34.5945 67.7825 91.4877 149.151 69.3599 43.5648 517.9122
Pa f UP-NonMel 09 74.982 43.9521 27.4592 84.8893 48.3166 29.007 23.8494 332.4556
Co m 11 47.8463 34.5049 40.8223 51.2466 43.3826 25.3941 14.8772 258.074
Pa f UP-NonMel 06 35.8466 41.9455 17.4541 62.5193 49.7708 24.5637 17.7679 249.8679
Co f 12 58.5707 39.9424 25.3561 43.6596 26.4961 24.3349 18.4008 236.7606
Co f 15 50.9681 29.948 17.1747 46.0423 28.2632 14.003 15.9124 202.3117
Pa m Unkown 07 42.8085 32.9788 19.9293 40.8624 37.639 14.1273 10.4134 198.7587
Co m 13 30.6725 16.9278 26.0144 73.5504 18.4885 11.9067 12.2761 189.8364
Co m 16 34.7066 26.5557 17.9017 53.904 29.9717 9.40495 17.0196 189.46425
Co m 14 40.6977 20.1737 38.1457 31.162 16.2503 11.0059 13.7803 171.2156
Pa m PD 11 31.7382 22.6704 17.5711 30.8778 18.0523 4.66098 11.1434 136.71418
Pa f UP-Mel 10 21.657 21.373 10.28 15.0949 8.99202 5.11657 3.99677 86.51026
Pa m UP-Mel 05 5.82715 6.73422 3.32408 12.3734 8.72638 5.31982 3.20501 45.51006

Table B.7: New Paradigm - Accumulated facial activity


B.2. NEW PARADIGM 181

(a) New Paradigm - Facial activ- (b) New Paradigm - Facial ac- (c) New Paradigm - Facial ac-
ity (Bill Cosby) tivity (The Champ) tivity (Weather)
Bill Cosby (Happy) The Champ (Sad) Weather (Happy)
Pa f UP-NonMel 09 74.982 Pa f UP-NonMel 09 43.9521 Pa f UP BP2 08 67.7825
Pa f UP BP2 08 61.9718 Pa f UP-NonMel 06 41.9455 Co m 11 40.8223
Co f 12 58.5707 Co f 12 39.9424 Co m 14 38.1457
Co f 15 50.9681 Pa f UP BP2 08 34.5945 Pa f UP-NonMel 09 27.4592
Co m 11 47.8463 Co m 11 34.5049 Co m 13 26.0144
Pa m Unkown 07 42.8085 Pa m Unkown 07 32.9788 Co f 12 25.3561
Co m 14 40.6977 Co f 15 29.948 Pa m Unkown 07 19.9293
Pa f UP-NonMel 06 35.8466 Co m 16 26.5557 Co m 16 17.9017
Co m 16 34.7066 Pa m PD 11 22.6704 Pa m PD 11 17.5711
Pa m PD 11 31.7382 Pa f UP-Mel 10 21.373 Pa f UP-NonMel 06 17.4541
Co m 13 30.6725 Co m 14 20.1737 Co f 15 17.1747
Pa f UP-Mel 10 21.657 Co m 13 16.9278 Pa f UP-Mel 10 10.28
Pa m UP-Mel 05 5.82715 Pa m UP-Mel 05 6.73422 Pa m UP-Mel 05 3.32408

(d) New Paradigm - Facial activity (Si- (e) New Paradigm - Facial activity
lence of the Lambs) (Cry Freedom)
Silence of the Lambs (Fear) Cry Freedom (Anger)
Pa f UP BP2 08 91.4877 Pa f UP BP2 08 149.151
Pa f UP-NonMel 09 84.8893 Pa f UP-NonMel 06 49.7708
Co m 13 73.5504 Pa f UP-NonMel 09 48.3166
Pa f UP-NonMel 06 62.5193 Co m 11 43.3826
Co m 16 53.904 Pa m Unkown 07 37.639
Co m 11 51.2466 Co m 16 29.9717
Co f 15 46.0423 Co f 15 28.2632
Co f 12 43.6596 Co f 12 26.4961
Pa m Unkown 07 40.8624 Co m 13 18.4885
Co m 14 31.162 Pa m PD 11 18.0523
Pa m PD 11 30.8778 Co m 14 16.2503
Pa f UP-Mel 10 15.0949 Pa f UP-Mel 10 8.99202
Pa m UP-Mel 05 12.3734 Pa m UP-Mel 05 8.72638

(f) New Paradigm - Facial activ- (g) New Paradigm - Facial activity
ity (The Shining) (Capricorn One)
The Shining (Fear) Capricorn One (Surprise)
Pa f UP BP2 08 69.3599 Pa f UP BP2 08 43.5648
Pa f UP-NonMel 09 29.007 Pa f UP-NonMel 09 23.8494
Co m 11 25.3941 Co f 12 18.4008
Pa f UP-NonMel 06 24.5637 Pa f UP-NonMel 06 17.7679
Co f 12 24.3349 Co m 16 17.0196
Pa m Unkown 07 14.1273 Co f 15 15.9124
Co f 15 14.003 Co m 11 14.8772
Co m 13 11.9067 Co m 14 13.7803
Co m 14 11.0059 Co m 13 12.2761
Co m 16 9.40495 Pa m PD 11 11.1434
Pa m UP-Mel 05 5.31982 Pa m Unkown 07 10.4134
Pa f UP-Mel 10 5.11657 Pa f UP-Mel 10 3.99677
Pa m PD 11 4.66098 Pa m UP-Mel 05 3.20501

Table B.8: New Paradigm - Facial activity for each video


182 APPENDIX B. ANALYSIS AND DATA - DEPRESSION

Sad Happy Neutral


Bill Cosby Pa f UP-NonMel 09 6 549 39
Co m 11 7 451 136
Pa f UP-NonMel 06 177 321 96
Co f 15 6 310 278
Co f 12 179 300 115
Pa f UP BP2 08 71 281 242
Pa m PD 11 0 182 412
Co m 16 0 63 531
Co m 14 141 62 391
Co m 13 479 50 65
Pa m UP-Mel 05 0 0 594
Pa m Unkown 07 0 0 594
Pa f UP-Mel 10 0 0 594
The Champ Pa f UP-NonMel 09 14 356 398
Pa f UP BP2 08 479 130 159
Co f 12 606 70 92
Co m 16 0 46 722
Pa f UP-NonMel 06 721 45 2
Co f 15 481 16 271
Co m 14 748 15 5
Co m 13 592 12 164
Co m 11 681 10 77
Pa m UP-Mel 05 0 0 768
Pa m Unkown 07 0 0 768
Pa f UP-Mel 10 0 0 768
Pa m PD 11 0 0 768
Weather Pa f UP-NonMel 06 45 224 0
Pa m PD 11 0 206 63
Pa f UP-NonMel 09 70 147 52
Co f 12 133 111 25
Co m 11 37 104 128
Co m 16 0 104 165
Co m 14 81 76 112
Co m 13 114 35 120
Pa f UP BP2 08 23 24 222
Co f 15 213 6 50
Pa m UP-Mel 05 0 0 269
Pa m Unkown 07 0 0 269
Pa f UP-Mel 10 0 0 269
Silence of the Lambs Pa m PD 11 0 392 614
Pa f UP BP2 08 261 293 452
Pa f UP-NonMel 09 334 110 562
Pa f UP-NonMel 06 351 58 597
Co f 12 234 40 732
Co m 16 0 39 967
Co m 14 977 18 11
Co m 13 636 10 360
Co f 15 515 6 485
Co m 11 7 3 996
Pa m UP-Mel 05 0 0 1006
Pa m Unkown 07 0 0 1006
Pa f UP-Mel 10 0 0 1006
Cry Freedom Pa f UP-NonMel 09 36 169 455
Co f 12 7 98 555
Pa m PD 11 0 40 620
Co m 14 629 29 2
Co f 15 629 26 5
Pa f UP BP2 08 430 24 206
Co m 16 0 13 647
Pa f UP-NonMel 06 228 12 420
Co m 13 646 10 4
Co m 11 111 1 548
Pa m UP-Mel 05 0 0 660
Pa m Unkown 07 0 0 660
Pa f UP-Mel 10 0 0 660
The Shining Pa f UP-NonMel 09 5 110 193
Co f 12 244 39 25
Co m 16 0 21 287
Pa f UP BP2 08 97 21 190
Co f 15 177 6 125
Co m 14 292 5 11
Co m 13 305 1 2
Pa f UP-NonMel 06 18 1 289
Co m 11 83 0 225
Pa m UP-Mel 05 0 0 308
Pa m Unkown 07 0 0 308
Pa f UP-Mel 10 0 0 308
Pa m PD 11 0 0 308
Capricorn One Pa f UP-NonMel 09 11 124 98
Co m 16 0 19 214
Pa f UP BP2 08 152 16 65
Co m 14 203 8 22
Co f 12 68 5 160
Pa f UP-NonMel 06 104 2 127
Co m 11 0 0 233
Co m 13 207 0 26
Co f 15 186 0 47
Pa m UP-Mel 05 0 0 233
Pa m Unkown 07 0 0 233
Pa f UP-Mel 10 0 0 233
Pa m PD 11 0 0 233

Table B.9: New Paradigm - Facial expressions - sorted by happy within video
B.2. NEW PARADIGM 183

Sad Happy Neutral


Bill Cosby Co m 13 479 50 65
Co f 12 179 300 115
Pa f UP-NonMel 06 177 321 96
Co m 14 141 62 391
Pa f UP BP2 08 71 281 242
Co m 11 7 451 136
Co f 15 6 310 278
Pa f UP-NonMel 09 6 549 39
Co m 16 0 63 531
Pa m UP-Mel 05 0 0 594
Pa m Unkown 07 0 0 594
Pa f UP-Mel 10 0 0 594
Pa m PD 11 0 182 412
The Champ Co m 14 748 15 5
Pa f UP-NonMel 06 721 45 2
Co m 11 681 10 77
Co f 12 606 70 92
Co m 13 592 12 164
Co f 15 481 16 271
Pa f UP BP2 08 479 130 159
Pa f UP-NonMel 09 14 356 398
Co m 16 0 46 722
Pa m UP-Mel 05 0 0 768
Pa m Unkown 07 0 0 768
Pa f UP-Mel 10 0 0 768
Pa m PD 11 0 0 768
Weather Co f 15 213 6 50
Co f 12 133 111 25
Co m 13 114 35 120
Co m 14 81 76 112
Pa f UP-NonMel 09 70 147 52
Pa f UP-NonMel 06 45 224 0
Co m 11 37 104 128
Pa f UP BP2 08 23 24 222
Co m 16 0 104 165
Pa m UP-Mel 05 0 0 269
Pa m Unkown 07 0 0 269
Pa f UP-Mel 10 0 0 269
Pa m PD 11 0 206 63
Silence of the Lambs Co m 14 977 18 11
Co m 13 636 10 360
Co f 15 515 6 485
Pa f UP-NonMel 06 351 58 597
Pa f UP-NonMel 09 334 110 562
Pa f UP BP2 08 261 293 452
Co f 12 234 40 732
Co m 11 7 3 996
Co m 16 0 39 967
Pa m UP-Mel 05 0 0 1006
Pa m Unkown 07 0 0 1006
Pa f UP-Mel 10 0 0 1006
Pa m PD 11 0 392 614
Cry Freedom Co m 13 646 10 4
Co m 14 629 29 2
Co f 15 629 26 5
Pa f UP BP2 08 430 24 206
Pa f UP-NonMel 06 228 12 420
Co m 11 111 1 548
Pa f UP-NonMel 09 36 169 455
Co f 12 7 98 555
Co m 16 0 13 647
Pa m UP-Mel 05 0 0 660
Pa m Unkown 07 0 0 660
Pa f UP-Mel 10 0 0 660
Pa m PD 11 0 40 620
The Shining Co m 13 305 1 2
Co m 14 292 5 11
Co f 12 244 39 25
Co f 15 177 6 125
Pa f UP BP2 08 97 21 190
Co m 11 83 0 225
Pa f UP-NonMel 06 18 1 289
Pa f UP-NonMel 09 5 110 193
Co m 16 0 21 287
Pa m UP-Mel 05 0 0 308
Pa m Unkown 07 0 0 308
Pa f UP-Mel 10 0 0 308
Pa m PD 11 0 0 308
Capricorn One Co m 13 207 0 26
Co m 14 203 8 22
Co f 15 186 0 47
Pa f UP BP2 08 152 16 65
Pa f UP-NonMel 06 104 2 127
Co f 12 68 5 160
Pa f UP-NonMel 09 11 124 98
Co m 11 0 0 233
Co m 16 0 19 214
Pa m UP-Mel 05 0 0 233
Pa m Unkown 07 0 0 233
Pa f UP-Mel 10 0 0 233
Pa m PD 11 0 0 233

Table B.10: New Paradigm - Facial expressions - sorted by sad within video
184 APPENDIX B. ANALYSIS AND DATA - DEPRESSION

Sad Happy Neutral


Bill Cosby Pa m UP-Mel 05 0 0 594
Pa m Unkown 07 0 0 594
Pa f UP-Mel 10 0 0 594
Co m 16 0 63 531
Pa m PD 11 0 182 412
Co m 14 141 62 391
Co f 15 6 310 278
Pa f UP BP2 08 71 281 242
Co m 11 7 451 136
Co f 12 179 300 115
Pa f UP-NonMel 06 177 321 96
Co m 13 479 50 65
Pa f UP-NonMel 09 6 549 39
The Champ Pa m UP-Mel 05 0 0 768
Pa m Unkown 07 0 0 768
Pa f UP-Mel 10 0 0 768
Pa m PD 11 0 0 768
Co m 16 0 46 722
Pa f UP-NonMel 09 14 356 398
Co f 15 481 16 271
Co m 13 592 12 164
Pa f UP BP2 08 479 130 159
Co f 12 606 70 92
Co m 11 681 10 77
Co m 14 748 15 5
Pa f UP-NonMel 06 721 45 2
Weather Pa m UP-Mel 05 0 0 269
Pa m Unkown 07 0 0 269
Pa f UP-Mel 10 0 0 269
Pa f UP BP2 08 23 24 222
Co m 16 0 104 165
Co m 11 37 104 128
Co m 13 114 35 120
Co m 14 81 76 112
Pa m PD 11 0 206 63
Pa f UP-NonMel 09 70 147 52
Co f 15 213 6 50
Co f 12 133 111 25
Pa f UP-NonMel 06 45 224 0
Silence of the Lambs Pa m UP-Mel 05 0 0 1006
Pa m Unkown 07 0 0 1006
Pa f UP-Mel 10 0 0 1006
Co m 11 7 3 996
Co m 16 0 39 967
Co f 12 234 40 732
Pa m PD 11 0 392 614
Pa f UP-NonMel 06 351 58 597
Pa f UP-NonMel 09 334 110 562
Co f 15 515 6 485
Pa f UP BP2 08 261 293 452
Co m 13 636 10 360
Co m 14 977 18 11
Cry Freedom Pa m UP-Mel 05 0 0 660
Pa m Unkown 07 0 0 660
Pa f UP-Mel 10 0 0 660
Co m 16 0 13 647
Pa m PD 11 0 40 620
Co f 12 7 98 555
Co m 11 111 1 548
Pa f UP-NonMel 09 36 169 455
Pa f UP-NonMel 06 228 12 420
Pa f UP BP2 08 430 24 206
Co f 15 629 26 5
Co m 13 646 10 4
Co m 14 629 29 2
The Shining Pa m UP-Mel 05 0 0 308
Pa m Unkown 07 0 0 308
Pa f UP-Mel 10 0 0 308
Pa m PD 11 0 0 308
Pa f UP-NonMel 06 18 1 289
Co m 16 0 21 287
Co m 11 83 0 225
Pa f UP-NonMel 09 5 110 193
Pa f UP BP2 08 97 21 190
Co f 15 177 6 125
Co f 12 244 39 25
Co m 14 292 5 11
Co m 13 305 1 2
Capricorn One Co m 11 0 0 233
Pa m UP-Mel 05 0 0 233
Pa m Unkown 07 0 0 233
Pa f UP-Mel 10 0 0 233
Pa m PD 11 0 0 233
Co m 16 0 19 214
Co f 12 68 5 160
Pa f UP-NonMel 06 104 2 127
Pa f UP-NonMel 09 11 124 98
Pa f UP BP2 08 152 16 65
Co f 15 186 0 47
Co m 13 207 0 26
Co m 14 203 8 22

Table B.11: New Paradigm - Facial expressions - sorted by neutral within video
Extract of Patient Diagnosis
C
Unipolar motor distubance Bipolar Schizoaffective other History of MI
ID age Gender (male =1) None Mel non-Mel Yes No type SAD (bipolar) SAD (dep) Specifiy
Pa m UP Mel 03 22 1 X
Pa m UP Mel 01 48 1 X
Pa m UP Mel 02 56 1 X
Pa m UP Mel 04 48 1 X
Pa f UP NonMel 09 27 2 X X
Pa f UP Mel 10 50 2 X X Anxiety? X
Pa m PD 11 53 1 X? panic disorder, substance abuse
Pa m UP Mel 05 26 1 X X
Pa f UP NonMel 06 34 2 X X? 2 X
Pa f UP BP2 08 32 2 2 X
Pa m Unkown 07 45 1

Table C.1: Extract of Patient Diagnosis

185
186 APPENDIX C. EXTRACT OF PATIENT DIAGNOSIS
Bibliography

[Alvinoa 07] C. Alvinoa, C. Kohlerb, F. Barrett, and R. Gurb. Computerized measure-


ment of facial expression of emotions in schizophrenia. Journal of Neuro-
science Methods, 163(6):350–361, July 2007.

[Annesley 05] J. Annesley and J. Orwell. On the Use of MPEG-7 for Visual Surveil-
lance. Technical report, Digital Imaging Research Center, Kingston Univer-
sity, Kingston-upon-Thames, Surrey, UK., 2005.

[Anolli 97] L. Anolli and R. Ciceri. The Voice of Deception: Vocal Strategies of Naive
and Able Liars. Journal of Nonverbal Behavior, 21:259–284, 1997.

[Ashraf 09] A. Ashraf, S. Lucey, J. Cohn, T. Chen, Z. Ambadar, K. Prkachin, and


P. Solomon. The painful face - Pain expression recognition using active
appearance models. Image Vision Computing, 27(12):1788–1796, 2009.

[Asthana 09] A. Asthana, R. Göcke, N. Quadrianto, and T. Gedeon. Learning


Based Automatic Face Annotation for Arbitrary Poses and Expressions from
Frontal Images Only. In Proceedings of the IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition CVPR 2009, Miami
(FL), USA, June 2009. IEEE Computer Society.

187
188 BIBLIOGRAPHY

[Athanaselisa 05] T. Athanaselisa, S. Bakamidisa, I. Dologloua, R. Cowie,


E. Douglas-Cowie, and C. Cox. ASR for emotional speech: Clarifying the
issues and enhancing performance. Neural Networks, 18:437–444, 2005.

[AVT ] AVT. Allied Vision Technologies. http://www.alliedvisiontec.


com/emea/products/cameras.html. last accessed, 13 April 2010.

[Baggia 08] P. Baggia, F. Burkhardt, J. Martin, C. Pelachaud, C. Peter, B. Schuller,


I. Wilson, and E. Zovato. Elements of an EmotionML 1.0. W3C Incubator
Group Report, November 2008.

[Baker 01] S. Baker and I. Matthews. Equivalence and Efficiency of Image Alignment
Algorithms. Computer Vision and Pattern Recognition, IEEE Computer So-
ciety Conference on, 1:1090–1097, 2001.

[Baker 02] S. Baker and I. Matthews. Lucas-Kanade 20 Years On: A Unifying Frame-
work: Part 1. Technical Report CMU-RI-TR-02-16, Robotics Institute,
Pittsburgh, PA, July 2002.

[Baker 03a] S. Baker, R. Gross, and I. Matthews. Lucas-Kanade 20 years on: A unify-
ing framework: Part 3. Technical Report CMU-RI-TR-03-35, Robotics In-
stitute, Carnegie Mellon University, Pittsburgh (PA), USA, November 2003.

[Baker 03b] S. Baker, R. Gross, I. Matthews, and T. Ishikawa. Lucas-Kanade 20 Years


On: A Unifying Framework: Part 2. Technical Report CMU-RI-TR-03-01,
Robotics Institute, Pittsburgh, PA, February 2003.

[Baker 04a] S. Baker, R. Gross, and I. Matthews. Lucas-Kanade 20 Years On: A Uni-
fying Framework: Part 4. Technical Report CMU-RI-TR-04-14, Robotics
Institute, Pittsburgh, PA, February 2004.
BIBLIOGRAPHY 189

[Baker 04b] S. Baker, R. Patil, K. Cheung, and I. Matthews. Lucas-Kanade 20 Years


On: Part 5. Technical Report CMU-RI-TR-04-64, Robotics Institute, Pitts-
burgh, PA, November 2004.

[Bartlett 99] M. Bartlett, J. Hager, P. Ekman, and T. Sejnowski. Measuring Facial


Expressions by Computer Image Analysis. Psychophysiology, 36:253–263,
1999.

[Bartlett 02] M. Bartlett, G. Littlewort, B. Braathen, T. Sejnowski, and J. Movellan. A


Prototype for Automatic Recognition of Spontaneous Facial Actions. In Ad-
vances in Neural Information Processing Systems, pages 1271–1278, 2002.

[Bartlett 03] M. Bartlett, G. Littlewort, I. Fasel, and J. Movellan. Real Time Face De-
tection and Facial Expression Recognition: Development and Applications
to Human Computer Interaction. In In CVPR Workshop on CVPR for HCI,
2003.

[Bartlett 05] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movel-


lan. Recognizing Facial Expression: Machine Learning and Application to
Spontaneous Behavior. In IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 568–573, 2005.

[Bartlett 06] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and Javier R.


Movellan. Fully Automatic Facial Action Recognition in Spontaneous Be-
havior. In 7th International Conference on Automatic Face and Gesture
Recognition, pages 223–230, 2006.

[Benitez 03] A. Benitez and S. Chang. Automatic Multimedia Knowledge Discovery,


Summarization and Evaluation. Technical report, Department of Electrical
Engineering, Columbia University, 2003.
190 BIBLIOGRAPHY

[Bertini 05] M. Bertini, A. Del Bimbo, and C. Torniai. Video Annotation and Re-
trieval with Pictorially Enriched Ontologies. Technical report, Universit`a
di Firenze - Italy, 2005.

[beyondblue ] beyondblue. http://www.beyondblue.org.au/index.


aspx?link_id=90. Last accessed 11 December 2009.

[Bhuiyan 07] A. Bhuiyan and C. Liu. On Face Recognition using Gabor Filters. In
Proceedings of World Academy of Science, Engineering and Technology,
22, pages 51–56, 2007.

[Blanz 99] V. Blanz and T. Vetter. A Morphable Model for the Synthesis of 3D Faces.
In Special Interest Group on Graphics and Interactive Techniques, pages
187–194, 1999.

[Bower 92] G.H. Bower. The handbook of emotion and memory: Research and theory,
chapter How Might Emotions Affect Learning?, pages 3–31. Lawrence Erl-
baum Associates, Inc, 365 Broadway, Hillsdale, New Jersey 07642, 1992.

[Brierley 07] B. Brierley, N. Medford, P. Shaw, and A. Davidson. Emotional mem-


ory for words: Separating content and context. Cognition & Emotion, 21
(3):495–521, 2007.

[Buchanan 02] H. Buchanan and N. Niven. Validation of a Facial Image Scale to


assess child dental anxiety. International Journal of Paediatric Dentistry,
12(1):47–52, January 2002.

[Buchheim 07] A. Buchheim and C. Benecke. Affective facial behavior of patients


with anxiety disorders during the adult attachment interview: a pilot study.
Psychotherapie Psychosomatik Medizinische Psychologie, 8(57):343–347,
March 2007.
BIBLIOGRAPHY 191

[Burges 98] C. Burges. A Tutorial on Support Vector Machines for Pattern Recogni-
tion. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.

[Casagrande 06] N. Casagrande. Multiboost: An open source multi-class ad-


aboost learner. 2005-2006. http://www.iro.umontreal.ca/
casagran/multiboost.html, last accessed 20 August 2008.

[Chang 01] C. Chang and C. Lin. LIBSVM: a library for support vector ma-
chines. 2001. Software available at http://www.csie.ntu.edu.
tw/˜cjlin/libsvm.

[Chen 03] W. Chen, T. Chiang, M. Hsu, and J. Liu. The validity of eye blink rate in
Chinese adults for the diagnosis of Parkinsons disease. Clinical Neurology
and Neurosurgery, 105:90–92, 2003.

[Chen 05] P. Chen, C. Lin, and B. Schölkopf. A tutorial on V-support vector machines:
Research Articles. Applied Stochastic Models in Business and Industry,
21(2):111–136, 2005.

[Chen 07] F. Chen and K. Kotani. Facial Expression Recognition by SVM-based Two-
stage Classifier on Gabor Features. In Machine Vision Applications, pages
453–456, 2007.

[Chiariglione 01] L. Chiariglione. Introduction to MPEG-7: Multimedia Content De-


scription Interface. Technical report, Telecom Italia Lab, Italy, 2001.

[Cohn 09] J. Cohn, T. Kruez, I. Matthews, Y. Yang, M. Hoai Nguyen, M. Padilla,


F. Zhou, and F. De la Torre. Detecting Depression from Facial Actions and
Vocal Prosody. In Affective Computing and Intelligent Interaction (ACII),
September 2009.
192 BIBLIOGRAPHY

[Cootes 92] T. Cootes and C. Taylor. Active Shape Models - Smart Snakes. British
Machine Vision Conference, pages 266–275, 1992.

[Cootes 95] T. Cootes, C. Taylor, D. Cooper, and J. Graham. Active Shape Models—
their training and applications. Computer Vision and Image Understanding,
61(1):38–59, 1995.

[Cootes 98] T. Cootes, G. Edwards, and C. Taylor. Active Appearance Models. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 1407:484–498,
1998.

[Cootes 01] T. Cootes and C. Taylor. Statistical Models of Appearance for Computer
Vision. Technical report, University of Manchester, 2001.

[Cornelius 96] R. Cornelius. The science of emotion. New Jersey: Prentice Hall, 1996.

[Cover 67] T. Cover and P. Hart. Nearest neighbor pattern classification. Information
Theory, IEEE Transactions on, 13(1):21–27, 1967.

[Cowie 03] R. Cowie and R. Cornelius. Describing the emotional states that are ex-
pressed in speech. Speech Communication, 40:5–32, 2003.

[Cowie 05a] R. Cowie, E. Douglas-Cowie, and C. Cox. Beyond emotion archetypes:


Databases for emotion modelling using neural networks. Neural Networks,
18:371–388, 2005.

[Cowie 05b] R. Cowie, E. Douglas-Cowie, J. Taylor, S. Ioannou, M. Wallace, and


S. Kollias. An Intelligent System For Facial Emotion Recognition. IEEE
International Conference on Multimedia and Expo, 2005.

[cvG ] cvGabor c++ source codeDownload. http://www.personal.rdg.


ac.uk/˜sir02mz/. Last accessed 24 January 2010.
BIBLIOGRAPHY 193

[Daugman ] Daugman. Computer Science Tripos: 16 Lectures by J G Daugman.


http://www.cl.cam.ac.uk/teaching/0910/CompVision/
LectureNotes2010.pdf. Last accessed 28 January 2010.

[Daugman 85] J. Daugman. Uncertainty relation for resolution in space, spatial fre-
quency, and orientation optimized by two-dimensional visual cortical filters.
Journal of the Optical Society of America A: Optics, Image Science, and Vi-
sion, 2(7):1160–1169, 1985.

[Davidson 04] R. Davidson, J. S. Maxwell, and A. J. Shackman. The privileged status


of emotion in the brain. Proceedings of the National Academy of Sciences
USA, 101:11915–11916, August 2004.

[Dellaert 96] F. Dellaert, T. Polzin, and A. Waibel. Recognizing Emotion in Speech.


International Conference on Spoken Language Processing, October 1996.

[Devillers 05] L. Devillers, L. Vidrascu, and L. Lamel. Challenges in real-life emo-


tion annotation and machine learning based detection. Neural Networks,
18:407–422, 2005.

[Donato 99] G. Donato, M. Bartlett, J. Hager, P. Ekman, and T. Sejnowski. Classi-


fying Facial Actions. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 21:974–989, 1999.

[dsm 00] DSM-IV-TR : Diagnostic and Statistical Manual of Mental Disorders.


American Psychiatric Press Inc (DC), 4th edition, July 2000.

[Dunn 95] D. Dunn and W. Higgins. Optimal Gabor filters for texture segmentation.
Image Processing, IEEE Transactions on, 4(7):947–964, 1995.

[Edwards 98] G. Edwards, C. Taylor, and T. Cootes. Interpreting Face Images Using
Active Appearance Models. In Proceedings of the IEEE International Con-
194 BIBLIOGRAPHY

ference on Automatic Face and Gesture Recognition FG’98, pages 300–305,


Nara, Japan, April 1998. IEEE.

[Ekman 71] P. Ekman and W. Friesen. Constants across cultures in the face and emo-
tion. Journal of Personality and Social Psychology, 17(2):124–129, 02 1971.

[Ekman 75] P. Ekman and W. Friesen. Unmasking the Face. Prentice Hall, Englewood
Cliffs NJ, 1975.

[Ekman 76] P. Ekman and W. Friesen. Pictures of Facial Affect. Consulting Psychol-
ogists Press, Palo Alto, CA, 1976.

[Ekman 82] P. Ekman and H. Oster. Emotion in the Human Face. New York: Cam-
bridge University Press, 2nd edition, 1982.

[Ekman 97] P. Ekman and E.L. Rosenberg. What the Face Reveals. Series in Affective
Science. Oxford University Press, Oxford, UK, 1997.

[Ekman 99] P. Ekman. Handbook of cognition and emotions, chapter Basic Emotions,
pages 301–320. Wiley, New York., 1999.

[Ekman 02] P. Ekman, W. Friesen, and J. Hager. Facial Action Coding System (FACS):
Manual. Salt Lake City, UT, 2002. Research Nexus eBook.

[Ekman 03] P. Ekman. Darwin, Deception, and Facial Expression. Annals New York
Academy of Sciences, pages 205–221, 2003.

[Ellgring 96] H. Ellgring and K. Scherer. Vocal indicators of mood change in depres-
sion. Journal of Nonverbal Behavior, 20:83–110, 1996.

[Ellgring 05] K. Scherer & H. Ellgring. Multimodal markers of appraisal and emo-
tion. In Paper presented at the ISRE Conference, Bari, 2005.
BIBLIOGRAPHY 195

[Ellgring 08] H. Ellgring. Nonverbal communication in depression. Cambridge Uni-


versity Press, Cambridge, UK, 2008.

[EMFACS ] EMFACS. http://www.face-and-emotion.com/dataface/


facs/emfacs.jsp. Last accessed 2 April 2009.

[emo 10] Emotion Markup Language (EmotionML) 1.0. http://www.w3.org/


TR/2010/WD-emotionml-20100729/, 2010. Last accessed 9 De-
cember 2010.

[Ezust 06] Alan Ezust and Paul Ezust. An Introduction to Design Patterns in C++
with Qt 4 (Bruce Perens Open Source). Prentice Hall PTR, Upper Saddle
River, NJ, USA, 2006.

[Fac ] Facial Muscles. http://www.csupomona.edu/˜jlbath/


LabPics/facial.htm. Last accessed 30 Jun 2009.

[Fan 05] Y. Fan, N. Cheng, Z. Wang, J. Liu, and C. Zhu. Real-Time Facial Expression
Recognition System Based on HMM and Feature Point Localization. In Jian-
hua Tao, Tieniu Tan, and Rosalind Picard, editors, Affective Computing and
Intelligent Interaction, Volume 3784 of Lecture Notes in Computer Science,
pages 210–217. Springer Berlin / Heidelberg, 2005.

[Fasel 02] I. Fasel, M. Bartlett, and J. Movellan. A Comparison of Gabor Filter Meth-
ods for Automatic Detection of Facial Landmarks. Fifth IEEE International
Conference on Automatic Face and Gesture Recognition, page 242, 2002.

[Fasel 03] B. Fasel and J. Luettin. Automatic facial expression analysis: a survey.
Pattern Recognition, 36(1):259–275, 2003.

[Flint 93] A. Flint, S. Black, I. Campbell-Taylor, G. Gailey, and C. Levinton. Ab-


normal speech articulation, psychomotor retardation, and subcortical dys-
196 BIBLIOGRAPHY

function in major depression. Journal of Psychiatric Research, 27:309–319,


1993.

[Frank 93] M. Frank, P. Ekman, and W. Friesen. Behavioral markers and recogniz-
ability of the smile of enjoyment. Journal of Personality and Social Psychol-
ogy, 64(1):83–93, 1993.

[Freund 99] Y. Freund and R. Schapire. A short introduction to boosting. Journal of


Japanese Society for Artificial Intelligence, 14(5):771–780, 1999.

[Fridlund 83] A. Fridlund and J. Izard. Social psychophysiology: A sourcebook, chap-


ter Electromyographic studies of facial expressions of emotions and patterns
of emotion, pages 163–218. Academic Press, New York, 1983.

[FRV ] Face Recognition Vendor Test. Last accessed 24 November 2009.

[Fry 79] D. B. Fry. The Physics of Speech. Cambridge Textbooks in Linguistics.


Cambridge University Press, Cambridge, United Kingdom, 1979.

[Fu 08] C.H.Y. Fu, S.C.R. Williams, A.J. Cleare, J. Scott, M.T. Mitterschiffthaler,
N.D. Walsh, C. Donaldson, J. Suckling, C. Andrew, H. Steiner, and R.M.
Murray. Neural Responses to Sad Facial Expressions in Major Depression
Following Cognitive Behavioral Therapy. Biological Psychiatry, 64(6):505–
512, 2008.

[Gao 09] X. Gao, Y. Su, X. Li, and D. Tao. Gabor texture in active appearance
models. Neurocomputing, 72(13-15):3174–3181, 2009.

[Goeleven 06] E. Goeleven, R. De Raedt, S. Baert, and E.Koster. Deficient inhibi-


tion of emotional information in depression. Journal of Affective Disorders,
93(1-3):149–157, 2006.
BIBLIOGRAPHY 197

[Grana 05] C. Grana, D. Bulgarelli, and R. Cucchiara. Video Clip Clustering for As-
sisted Creation of MPEG-7 Pictorially Enriched Ontologies. Technical re-
port, University of Modena and Reggio Emilia, Italy, 2005.

[Grinker 61] R. Grinker, N. Miller, M. Sabshin, R. Nunn, and J. Nunnally. The phe-
nomena of depressions. Harper and Row, New York, 1961.

[Gross 95] J. Gross and R. Levenson. Emotion elicitation using films. Cognition &
Emotion, 9:87–108, 1995.

[Gross 05] R. Gross, I. Matthews, and S. Baker. Generic vs. person specific active
appearance models. Image Vision Computing, 23(12):1080–1093, 2005.

[Gross 10] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. Image
Vision Comput., 28(5):807–813, 2010.

[Hajdinjak 03] M. Hajdinjak and F. Mihelič. Wizard of Oz Experiments. In EURO-


CON, Ljubljana, Slovenia, 2003.

[Harrigan 96] J.A. Harrigan and D.M. O’Connell. How do you look when feeling
anxious? Facial displays of anxiety. Personality and Individual Differences,
21:205–212, August 1996.

[Harrigan 97] J. Harrigan and K. Taing. Fooled by a Smile: Detecting Anxiety in


Others. Journal of Nonverbal Behaviour, 21:203, 1997.

[Harrigan 04] J. Harrigan, K. Wilson, and R. Rosenthal. Detecting state and trait
anxiety from auditory and visual cues: a meta-analysis. Personality and
Social Psychology Bulletin, 30(1):56–66, 2004.

[Hirschberg 05] J. Hirschberg, S. Benus, and J. Brenierand F. Enos. Distinguishing


Deceptive from Non-Deceptive Speech. In Interspeech, 2005.
198 BIBLIOGRAPHY

[Hsu 03] C. Hsu, C. Chang, and C. Lin. A Practical Guide to Support Vector Classi-
fication. Bioinformatics, 2003.

[HUMAINE 06] HUMAINE. http://emotion-research.net/, 2006. Last


accessed 24 January 2010.

[Hunter 02] J. Hunter. Enhancing the Semantic Interoperability of Multimedia through


a Core Ontology. Technical report, Harmony Project, funded by the Co-
operative Research Centre for Enterprise Distributed Systems Technology
(DSTC), 2002.

[jaf ] The Japanese Female Facial Expression (JAFFE) Database. http://


www.kasrl.org/jaffe.html. Last accessed on 26 November 2010.

[Jaimes 03] A. Jaimes and J. Smith. Semi-Automatic Data-driven Construction of


Multimedia Ontologies. Iternational Conference on Multimedia and Expo,
2003.

[James 90] W. James. Principles of Psychology. Harvard University, 1890.

[Joormann 06] J. Joormann and I. Gotlib. Is This Happiness I See? Biases in the Iden-
tification of Emotional Facial Expressions in Depression and Social Phobia.
Journal of Affective Disorders, 93(1-3):149–157, 2006.

[Joormann 07] J. Joormann and I. Gotlib. Selective Attention to Emotional Faces


Following Recovery From Depression. Journal of Abnormal Psychology,
116(1):80–85, 2007.

[Kaiser 98] S. Kaiser, T. Wehrle, and S. Schmidt. Emotional Episodes, Facial Expres-
sions, and Reported Feelings in Human-Computer Interactions. In ISRE
Publications, pages 82–86, 1998.
BIBLIOGRAPHY 199

[Kamarainen 06] J. Kamarainen, V. Kyrki, and H. Klviinen. Invariance properties of


Gabor filter-based features-overview and applications. IEEE Transactions
on Image Processing, 15(5):1088–1099, 2006.

[Kanade 00] T. Kanade, Y. Tian, and J. Cohn. Comprehensive Database for Facial
Expression Analysis. Fourth IEEE International Conference on Automatic
Face and Gesture Recognition, pages 46–53, 2000.

[Kessler 05] R. Kessler, W. Chiu, O. Demler, K. Merikangas, and E. Walters.


Prevalence, severity, and comorbidity of 12-month DSM-IV disorders in the
National Comorbidity Survey Replication. Arch Gen Psychiatry, 62(6):617–
27, 2005.

[Kohavi 95] R. Kohavi. A study of cross-validation and bootstrap for accuracy es-
timation and model selection. International Joint Conference on Artificial
Intelligence, pages 1137–1143, 1995.

[Koike 98] K. Koike, H. Suzuki, and H. Saito. Prosodic Parameters in Emotional


Speech. In International Conference on Spoken Language Processing, pages
679–682, 1998.

[Kvaal 05] K. Kvaal, I. Ulstein, I. Nordhus, and K. Engedal. The Spielberger State-
Trait Anxiety Inventory (STAI): the state scale in detecting mental disorders
in geriatric patients. International Journal of Geriatric Psychiatry, 20:629–
634, 2005.

[Lades 93] M. Lades, J.. Vorbrggen, J. Buhmann, J. Lange, C. Malsburg, R. Wrtz, and
W. Konen. Distortion Invariant Object Recognition in the Dynamic Link
Architecture. IEEE Trans. Computers, 42:300–311, 1993.
200 BIBLIOGRAPHY

[Ladouceur 06] C. Ladouceur, R. Dahl, D. Williamson, B. Birmaher, D. Axelson,


N. Ryan, and B. Casey. Processing emotional facial expressions influ-
ences performance on a Go/NoGo task in pediatric anxiety and depres-
sion. Journal of Child Psychology and Psychiatry and Allied Disciplines,
47(11):1107–1115, Nov 2006.

[Lagoze 01] C. Lagoze and J. Hunter. The ABC Ontology and Model. Technical re-
port, Cornell University Ithaca, NY and DSTC Pty, Ltd. Brisbane, Australia,
2001.

[Lang 05] P. Lang, M. Bradley, and B. Cuthbert. International affective picture sys-
tem (IAPS): Affective ratings of pictures and instruction manual. Technical
Report A-6, University of Florida, Gainesville, FL, 2005.

[Lazarus 91] R. Lazarus. Emotion and adaptation. Oxford University Press, New
York :, 1991.

[Lee 96] T. Lee. Image Representation Using 2D Gabor Wavelets. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 18(10):959–971, 1996.

[Lee 07] B. Lee, S. Cho, and H. Khang. The neural substrates of affective processing
toward positive and negative affective pictures in patients with major de-
pressive disorder. Progress in Neuro-Psychopharmacology and Biological
Psychiatry, 31(7):1487–1492, 2007.

[Li 07] F. Li and K. Xu. Optimal Gabor Kernel’s Scale and orientation selection
for face classification. Optics & Laser Technology, 39(4):852–857, 2007.

[Lien 98] J. Lien, J. Cohn, T. Kanade, and C. Li. Automated Facial Expression Recog-
nition Based on FACS Action Units. In Third IEEE International Conference
on Automatic Face and Gesture Recognition, pages 390–395, April 1998.
BIBLIOGRAPHY 201

[Liscombe 05] J. Liscombe, G. Riccardi, and D. Hakkani-Tür. Using Context to Im-


prove Emotion Detection in Spoken Dialog Systems. In EUROSPEECH’05,
9th European Conference on Speech Communication and Technology,
pages 1845–1848, September 2005.

[Littlewort 06] G. Littlewort, M. Bartlett, I. Fasel, J. Susskind, and J. Movellan. An


automatic system for measuring facial expression in video. Computer Vi-
sion and Image Understanding, Special Issue on Face Processing in Video,
24(6):615–625, 2006.

[Littlewort 07] G. Littlewort, M. Bartlett, and K. Lee. Faces of pain: Automated Mea-
surement of Spontaneous Facial Expressions of Genuine and Posed Pain.
9th International Conference on Multimodal Interfaces, pages 15–21, 2007.

[Littlewort 09] G. Littlewort, M. Bartlett, and K. Lee. Automatic coding of facial


expressions displayed during posed and genuine pain. Image and Vision
Computing, 27:1797–1803, November 2009.

[Liu 04] C. Liu. Gabor-based Kernel PCA with Fractional Power Polynomial Models
for Face Recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26:572–581, 2004.

[Liu 06] W. Liu and Z. Wang. Facial Expression Recognition Based on Fusion of
Multiple Gabor Features. 18th International Conference on Pattern Recog-
nition, 3:536–539, 2006.

[Lucas 81] B. Lucas and T. Kanade. An Iterative Image Registration Technique with
an Application to Stereo Vision. International Joint Conferences on Artificial
Intelligence, pages 674–679, April 1981.
202 BIBLIOGRAPHY

[Lucey 06] S. Lucey, I. Matthews, C. Hu, Z. Ambadar, F. de la Torre, and J. Cohn.


AAM Derived Face Representations for Robust Facial Action Recognition.
Automatic Face and Gesture Recognition, IEEE International Conference
on, 0:155–162, 2006.

[Luo 06] H. Luo and J. Fan. Building concept ontology for medical video annotation.
In MULTIMEDIA ’06: Proceedings of the 14th annual ACM international
conference on Multimedia, pages 57–60, New York, NY, USA, 2006. ACM.

[Manjunath 96] B. Manjunath and W. Ma. Texture Features for Browsing and Re-
trieval of Image Data. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(8):837–842, August 1996.

[Martin 08] C. Martin, U. Werner, and H. Gross. A real-time facial expression recog-
nition system based on Active Appearance Models using gray images and
edge images. 8th IEEE International Conference on Automatic Face & Ges-
ture Recognition, pages 1–6, Sept. 2008.

[Matthews 04] I. Matthews and S. Baker. Active Appearance Models Revisited. In


International Journal of Computer Vision, Volume 60, pages 135–164, 2004.

[McDuff 10] D. McDuff, R. Kaliouby, K. Kassam, and R. Picard. Affect valence infer-
ence from facial action unit spectrograms. In Computer Vision and Pattern
Recognition Workshops (CVPRW), 2010 IEEE Computer Society Confer-
ence on, pages 17–24, June 2010.

[McIntyre 06] G. McIntyre and R. Göcke. Researching Emotions in Speech. In 11th


Australasian International Conference on Speech Science and Technology,
pages 264–369, Auckland, New Zealand, December 2006. ASSTA.
BIBLIOGRAPHY 203

[McIntyre 07] G. McIntyre and R. Göcke. Towards Affective Sensing. In Proceed-


ings of the 12th International Conference on Human-Computer Interaction
HCII2007, Volume 3 of Lecture Notes in Computer Science LNCS 4552,
pages 411–420, Beijing, China, July 2007. Springer.

[McIntyre 08a] G. McIntyre and R. Göcke. A Composite Framework for Affective


Sensing. In Proceedings of Interspeech 2008, pages 2767–2770. ISCA, 22-
26 September 2008.

[McIntyre 08b] G. McIntyre and R. Göcke. Affect and Emotion in Human-Computer


Interactions, chapter The Composite Sensing of Affect, pages 104–115.
Lecture Notes in Computer Science LNCS 4868. Springer, August 2008.

[McIntyre 09] G. McIntyre, R. Göcke, M. Hyett, M. Green, and M. Breakspear. An


Approach for Automatically Measuring Facial Activity in Depressed Sub-
jects. In 3rd International Conference on Affective Computing and In-
telligent Interaction and Workshops, ACII 2009, September 2009. DOI
10.1109/ACII.2009.5349593.

[Medforda 05] N. Medforda, M. Phillipsa, B. Brierleya, M. Brammerb, E. Bullmorec,


and A. Davida. Emotional memory: Separating content and context. Psy-
chiatry Research: Neuroimaging, 138:247–258, 2005.

[Millar 04] J. B. Millar, M. Wagner, and R. Göcke. Aspects of Speaking-Face Data


Corpus Design Methodology. In International Conference on Spoken Lan-
guage Processing 2004, Volume II, pages 1157–1160, Jeju, Korea, October
2004.

[MMI ] MMI Facial Expression Database. http://www.mmifacedb.com/.


Last accessed 28 November 2010.
204 BIBLIOGRAPHY

[Mogg 05] K. Mogg and B. Bradley. Attentional Bias in Generalized Anxiety Disorder
Versus Depressive Disorder. Cognitive Therapy and Research, 29:29–45,
2005.

[Monk 08] C. Monk, R. Klein, and E. Telzer et al. Amygdala and Nucleus Accumbens
Activation to Emotional Facial Expressions in Children and Adolescents at
Risk for Major Depression. American Journal of Psychiatry, 165(3):90–98,
Jan 2008.

[Moore 08] E. Moore, M. Clements, J. Peifer, and L. Weisser. Critical analysis of


the impact of glottal features in the classification of clinical depression
in speech. IEEE Transactions on Biomedical Engineering, 55(1):96–107,
2008.

[Movellan 08] J. Movellan. Tutorial on Gabor Filters. Tutorial paper http:


//mplab.ucsd.edu/tutorials/pdfs/gabor.pdf, 2008.

[MPEG-7 ] MPEG-7. Multimedia Content Description Interface. http://www.


darmstadt.gmd.de/mobile/MPEG7. Last accessed 23 April 2010.

[MultiBoost 06] MultiBoost. An open source multi-class AdaBoost learner. http:


//www.iro.umontreal.ca/˜casagran/multiboost.html,
2005-2006.

[Murray 93] I. Murray and L. Arnott. Toward the simulation of emotion in synthetic
speech. Journal Acoustical Society of America, 93(2):1097–1108, 1993.

[Navigli 03] R. Navigli, P. Velardi, and A. Gangemi. Ontology Learning and Its Ap-
plication to Automated Terminology Translation. IEEE Intelligent Systems,
pages 22–31, 2003.
BIBLIOGRAPHY 205

[NBS ] NBS. Neurobehavioral Systems. http://www.neurobs.com/. last


accessed 14 April 2010.

[NIMH ] NIMH. National Institute of Mental Health. http://www.nimh.nih.


gov/health/publications/anxiety-disorders/index.
shtml. Last accessed 10 april 2010.

[Nixon 01] M. Nixon and A. Aguado. Feature Extraction and Image Processing.
MPG Books Lrd, Brodmin, Cornwall, 2001.

[Obrenovic 05] Z. Obrenovic, N. Garay, J. Lpez, I. Fajardo, and I. Cearreta. An On-


tology for Description of Emotional Cues. Technical report, Laboratory for
Multimodal Communications, University of Belgrade, 2005.

[Okada 09] T. Okada, T. Takiguchi, and Y. Ariki. Pose robust and person independent
facial expressions recognition using AAM selection. IEEE 13th International
Symposium on Consumer Electronics, pages 637–638, May 2009.

[OpenCV ] OpenCV. Image Processing Library Download. http://


sourceforge.net/projects/opencvlibrary/files/. Last
accessed 10 November 2009.

[Pantic 00] M. Pantic and L. Rothkrantz. Automatic Analysis of Facial Expressions:


The State of the Art. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(12):1424–1445, 2000.

[Pantic 04a] M. Pantic and L. Rothkrantz. Case-based reasoning for user-profiled


recognition of emotions from face images. Proceedings of the 2004 IEEE
International Conference on Multimedia and Expo, ICME 2004, 27-30 June
2004, Taipei, Taiwan, 2004.
206 BIBLIOGRAPHY

[Pantic 04b] M. Pantic and L. Rothkrantz. Facial action recognition for facial
expression analysis from static face images. IEEE Transactions on Systems,
Man, and Cybernetics, Part B, 34(3):1449–1461, 2004.

[Pantic 07] M. Pantic and M. Bartlett. Face recognition, chapter Machine Analysis
of Facial Expressions, pages 377–416. I-Tech Education and Publishing,
Vienna, Austria, July 2007.

[Parker 02] G. Parker and K. Roy. Examining the utility of a temperament model
for modelling non-melancholic depression. Acta Psychiatrica Scandinavica,
106(1):54–61, 2002.

[Parker 06] G. Parker, V. Manicavasagar, J. Crawford, L. TULLY, and G. Gladstone.


Assessing personality traits associated with depression: the utility of a tiered
model. Psychological Medicine, 36(8):1131–1139, August 2006.

[Picard 97] R.W. Picard. Affective Computing. MIT Press, Cambridge (MA), USA,
1997.

[Polydoros 06] P. Polydoros, C.Tsinaraki, and S. Christodoulakis. GraphOnto: OWL-


Based Ontology Management and Multimedia Annotation in the DS-MIRF
Framework. Technical report, Lab. Of Distributed Multimedia Information
Systems, Technical University of Crete (MUSIC/TUC) University Campus,
Kounoupidiana, Chania, Greece,, 2006.

[Pree 95] W. Pree. Design Patterns for Object-Oriented Software Development. Ad-
dison Wesley Longman, 1st edition, 1995.

[PRISM ] PRISM. http://research.cs.tamu.edu/prism/. Last accessed


on 26 February 2010.
BIBLIOGRAPHY 207

[Protégé ] Protégé. http://protege.stanford.edu/. last accessed, 22 April


2010.

[Qt 09] Qt. A cross-platform application and UI framework. http://qt.


nokia.com/, November 2009. Last accessed 10 November 2009.

[Rahman 05] A. Rahman, I. Kiringa, and A. Saddik. An Ontology for Unification of


MPEG-7 Semantic Descriptions. Technical report, School of Information
Technology and Engineering Univeristy of Ottawa, Ontario, Canada, 2005.

[RapidMiner ] RapidMiner. http://rapid-i.com/content/view/181/


190/. last accessed 04 August 2010.

[Reed 07] L. Reed, M. Sayette, and J. Cohn. Impact of depression on response to com-
edy: A dynamic facial coding analysis. Journal of Abnormal Psychology,
117(4):804–809, May 2007.

[Rege 05] M. Rege, M. Dong, F. Fotouhi, M. Siadat, and L. Zamorano. Using MPEG-
7 to build a Human Brain Image Database for Image-guided Neurosurgery.
Medical Imaging 2005: Visualization, Image-Guided Procedures, and Dis-
play, pages 512–519, 2005.

[Renneberg 05] B. Renneberg, K. Heyn, R. Gebhard, and S. Bachmann. Facial


expression of emotions in borderline personality disorder and depression.
Journal of Behavior Therapy and Experimental Psychiatry, 36(3):183–196,
2005.

[Ro 01] Yong Man Ro, Munchurl Kim, Ho Kyung Kang, and B. S. Manjunath.
MPEG-7 Homogeneous Texture Descriptor. Electronics and Telecommu-
nications Research Institute Journal, 23:41–51, 2001.
208 BIBLIOGRAPHY

[Russell 94] J. Russell. Is there universal recognition of emotion from facial


expression? A review of the cross-cultural studies. Psychological Bulletin,
115:102–141, 1994.

[Saatci 06] Y. Saatci and C. Town. Cascaded classification of gender and facial
expression using active appearance models. In Automatic Face and Ges-
ture Recognition, 2006. FGR 2006. 7th International Conference on, pages
393–398, April 2006.

[Salembier 01] P. Salembier and J. Smith. MPEG-7 Multimedia Description Schemes.


IEEE Transactions on Circuits and Systems for Video Technology, VOL. 11,
NO. 6:748–759, 2001.

[Sander 05] D. Sander, D. Grandjean, and K. Scherer. A systems approach to ap-


praisal mechanisms in emotion. Neural Networks, 18:317–352, 2005.

[Saragih 06] J. Saragih and R. Göcke. Iterative Error Bound Minimisation for AAM
Alignment. International Conference on Pattern Recognition, 2:1192–1195,
2006.

[Saragih 08] J. Saragih. The Generative Learning and Discriminative Fitting of Linear
Deformable Models. PhD thesis, Research School of Information Sciences
and Engineering, The Australian National University, Canberra, Australia,
2008.

[Saragih 09] J. Saragih and R. Göcke. Learning AAM fitting through simulation. Pat-
tern Recognition, 42(11):2628–2636, 2009.

[Scherer 99] K. R. Scherer. Handbook of cognition and emotion, chapter Appraisal


theory. New York: John Wiley, 1999.
BIBLIOGRAPHY 209

[Scherer 03] K. R. Scherer. Vocal communication of emotion: A review of research


paradigms. Speech Communication, 40:227–256, 2003.

[Scherer 04] K. R. Scherer. HUMAINE Deliverable D3c: Preliminary plans


for exemplars: theory. Retrieved 26 October, 2006 from. http://
emotion-research.net/publicnews/d3c/, 2004.

[Schröder 05] M. Schröder and R. Cowie. HUMAINE project: Developing a Con-


sistent View on Emotion-oriented Computing. Retrieved 26 October, 2006
from. http://emotion-research.net/aboutHUMAINE, 2005.

[Schröder 07] M. Schröder, L. Devillers, K. Karpouzis, J. Martin, C. Pelachaud, C. Pe-


ter, H. Pirker, B. Schuller, J. Tao, and I. Wilson. What Should a Generic
Emotion Markup Language Be Able to Represent? In Ana Paiva, Rui Prada,
and Rosalind W. Picard, editors, Affective Computing & Intelligent Interac-
tion, Volume 4738 of Lecture Notes in Computer Science, pages 440–451.
Springer, 2007.

[Schuller 09a] B. Schuller, R. Müllerd, F. Eyben, J. Gast, B. Hörnler, M. Wöllmer,


G. Rigoll, A. Höthker, and H. Konosu. Being bored? Recognising natural
interest by extensive audiovisual integration for real-life application. Image
and Vision Computing, 27:1760–1774, November 2009.

[Schuller 09b] B. Schuller, S. Steidl, and A. Batliner. The INTERSPEECH 2009 Emo-
tion Challenge. In ISCA, editor, Proceedings of Interspeech 2009, pages
312–315, 2009.

[Sebe 03] N. Sebe and M. Lew. Robust Computer Vision Theory and Applications.
Springer, 2003.
210 BIBLIOGRAPHY

[Sebe 05] N. Sebe, I. Cohen, A. Garg, and Th. Huang. Machine Learning in Computer
Vision. Springer, 2005.

[Shen 05] L. Shen. Recognizing Faces — An Approach Based on Gabor Wavelets.


PhD thesis, School of Computer Science, University of Nottingham, 2005.

[Shen 06] L. Shen and L. Bai. A review on Gabor wavelets for face recognition. Pat-
tern Analysis & Applications, 9(2-3):273–292, 2006.

[Shen 07] L. Shen, L. Bai, and Z. Ji. Advances in visual information systems, Volume
4781, chapter A SVM Face Recognition Method Based on Optimized Gabor
Features, pages 165–174. Springer Berlin / Heidelberg, 2007.

[Shigeno 98] S. Shigeno. Cultural Similarities and Differences in the Recognition of


Audio-Visual Speech Stimuli. In 5th International Conference on Spoken
Language Processing, Volume 1057, pages 281–284. International Confer-
ence on Spoken Language Processing, 1998.

[Song 05] D. Song, H. Lie, M. Cho, H. Kim, and P. Kim. Image and video retrieval,
Volume 3568/2005, chapter Domain Knowledge Ontology Building for Se-
mantic Video Event Description, pages 267–275. Springer Berlin / Heidel-
berg, 2005.

[Stegmann 02] M. Stegmann and D. Gomez. A Brief Introduction to Statistical Shape


Analysis. Technical report, University of Denmark, DTU, March 2002.

[Stibbard 01] R. Stibbard. Vocal expression of emotions in non-laboratory speech: An


investigation of the Reading/Leeds Emotion in Speech Project annotation
data. PhD thesis, University of Reading, UK, 2001.

[Strupp 08] S. Strupp, N. Schmitz, and K. Berns. Visual-Based Emotion Detection for
Natural Man-Machine Interaction. In KI ’08: Proceedings of the 31st an-
BIBLIOGRAPHY 211

nual German conference on Advances in Artificial Intelligence, pages 356–


363, Berlin, Heidelberg, 2008. Springer-Verlag.

[Sung 08] J. Sung and D. Kim. Pose-Robust Facial Expression Recognition Using
View-Based 2D 3D AAM. Systems, Man and Cybernetics, Part A: Systems
and Humans, IEEE Transactions on, 38(4):852–866, July 2008.

[ten Bosch 00] L. ten Bosch. Emotions: What is Possible in the ASR Framework.
SpeechEmotion, 2000.

[Tian 01] Y. Tian, T. Kanade, and J. Cohn. Recognizing Action Units for Facial
Expression Analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 23(2):97–115, Feb 2001.

[Tong 09] Y. Tong, X. Liu, and F. Wheeler P. Tu. Automatic Facial Landmark Labeling
with Minimal Supervision. Computer Vision and Pattern Recognition, pages
2097–2104, June 2009.

[Tsai 01] D. Tsai, S. Wu, and M. Chen. Optimal Gabor filter design for texture
segmentation using stochastic optimization. Image Vision Computing,
19(5):299–316, 2001.

[Tsinaraki 03] C. Tsinaraki, P. Polydoros, F. Kazasis, and S. Christodoulakis.


Ontology-based Semantic Indexing for MPEG-7 and TV-Anytime Audiovi-
sual Content. Technical report, Lab. of Distributed Multimedia Information
Systems and Applications (MUSIC/TUC), Technical University of Crete
Campus, 2003.

[Tsinaraki 07] C. Tsinaraki, P. Polydoros, and S. Christodoulakis. Interoperability


Support between MPEG-7/21 and OWL in DS-MIRF. IEEE Transactions on
Knowledge and Data Engineering, 19(2):219–232, 2007.
212 BIBLIOGRAPHY

[Valstar 06a] M. Valstar and M. Pantic. Biologically vs. Logic Inspired Encoding of
Facial Actions and Emotions in Video. In ICME, pages 325–328, 2006.

[Valstar 06b] M. Valstar and M. Pantic. Fully Automatic Facial Action Unit Detection
and Temporal Analysis. In Conference on Computer Vision and Pattern
Recognition Workshop, 2006.

[Vapnik 95] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag,


New York, NY, USA, 1995.

[Velten 68] E. Velten. A laboratory task for induction of mood states. Behaviour
Research and Therapy, 6:473–482, 1968.

[Vinciarelli 08] A. Vinciarelli, M. Pantic, H. Bourlard, and A. Pentland. Social Signal


Processing: State-of-the-Art and Future Perspectives of an Emerging Do-
main. In 16th ACM International Conference on Multimedia, pages 1061–
1070, New York, NY, USA, 2008. ACM.

[Vinciarelli 09] A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing:


Survey of an emerging domain. Image Vision Computing, 27(12):1743–
1759, 2009.

[Viola 01] P. Viola and M. Jones. Robust real-time face detection. Proceedings Eighth
IEEE International Conference on Computer Vision ICCV 2001, 2:747,
2001.

[VXL ] VXL. http://vxl.sourceforge.net/. Last accessed 10 Novem-


ber 2009.

[Wallhoff ] F. Wallhoff. Database with Facial Expressions and Emotions from Tech-
nical University of Munich (FEEDTUM). http://www.mmk.ei.tum.
BIBLIOGRAPHY 213

de/˜{}waf/fgnet/feedtum.html. Last accessed on 26 February


2010.

[Wallhoff 06] F. Wallhoff, B. Schuller, M. Hawellek, and G. Rigoll. Efficient Recogni-


tion of Authentic Dynamic Facial Expressions on the Feedtum Database. In
International Conference on Multimedia and Expo, pages 493–496. IEEE,
2006.

[Wang 02] X. Wang and H. Qi. Face Recognition Using Optimal Non-Orthogonal
Wavelet Basis Evaluated by Information Complexity. In 16th International
Conference on Pattern Recognition, Volume 1, page 10164, 2002.

[Wen 10] C. Wen and Y. Zhan. Facial expression recognition based on combined
HMM. International Journal of Computing and Applications in Technology,
38:172–176, July 2010.

[Whissell 89] C. Whissell. The Dictionary of Affect in Language. In Emotion: Theory,


Research and Experience, 1989.

[Whitehill 09] J. Whitehill, G. Littlewort, I. Fasel, M. Bartlett, and J. Movellan. To-


ward Practical Smile Detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 31:2106–2111, 2009.

[Wiskott 97] L. Wiskott, J. Fellous, N. Krger, and C. Malsburg. Face Recognition by


Elastic Bunch Graph Matching. IEEE Trans. Pattern Analysis and Machine
Intelligence, 19:775–779, 1997.

[Wollmer 08] M. Wollmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-


Cowie, and R. Cowie. Abandoning Emotion Classes - Towards Continuous
Ernolion Recognition with Modelling of Long-Range Dependencies. In Pro-
214 BIBLIOGRAPHY

ceedings of Interspeech 2008, Brisbane, Australia, pages 597–600. ISCA,


22-26 September 2008.

[Wu 04] B. Wu, H. Ai, and R. Liu. Glasses Detection by Boosting Simple Wavelet
Features. International Conference on Pattern Recognition, 1:292–295,
2004.

[Yacoub 03] S. Yacoub, S. Simske, X.Lin, and J. Burns. Recognition of Emotions in


Interactive Voice Response Systems. Technical report, HP Laboratories Palo
Alto, 2003.

[Zhou 06] M. Zhou and H. Wei. Face Verification Using GaborWavelets and Ad-
aBoost. In International Conference on Pattern Recognition, pages 404–407,
2006.

[Zhou 09] M. Zhou and H. Wei. Facial Feature Extraction and Selection by Gabor
Wavelets and Boosting. In 2nd International Congress on Image and Signal
Processing, pages 1–5, 2009.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy