0% found this document useful (0 votes)
56 views21 pages

Facial Micro-Expressions An Overview

This article provides an overview of facial micro-expressions, beginning with their discovery and study in psychology and extending to current computer vision techniques. It discusses micro-expressions as involuntary, fleeting facial movements that reveal concealed emotions. The article reviews progress in recognizing, detecting, and analyzing micro-expressions using machine learning methods, highlighting challenges in this area. It aims to serve as a tutorial for those interested in micro-expressions from psychology to computer vision applications.

Uploaded by

cstoajfu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views21 pages

Facial Micro-Expressions An Overview

This article provides an overview of facial micro-expressions, beginning with their discovery and study in psychology and extending to current computer vision techniques. It discusses micro-expressions as involuntary, fleeting facial movements that reveal concealed emotions. The article reviews progress in recognizing, detecting, and analyzing micro-expressions using machine learning methods, highlighting challenges in this area. It aims to serve as a tutorial for those interested in micro-expressions from psychology to computer vision applications.

Uploaded by

cstoajfu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

Facial Micro-Expressions:
An Overview

By G UOYING Z HAO , Fellow IEEE, X IAOBAI L I , Senior Member IEEE, YANTE L I , Member IEEE,
AND M ATTI P IETIKÄINEN , Life Fellow IEEE

ABSTRACT | Micro-expression (ME) is an involuntary, fleeting, I. I N T R O D U C T I O N


and subtle facial expression. It may occur in high-stake situa- Emotions are neurophysiological responses to external
tions when people attempt to conceal or suppress their true and/or internal stimuli [1], [2], [3]. They are associated
feelings. Therefore, MEs can provide essential clues to peo- with feelings, thoughts, behavioral responses, and plea-
ple’s true feelings and have plenty of potential applications, sure or displeasure [3], which influence human cognition,
such as national security, clinical diagnosis, and interrogations. decision-making, perception, learning, and so on [4], [5].
In recent years, ME analysis has gained much attention in vari- Thus, emotions play a crucial role in everyday human
ous fields due to its practical importance, especially automatic life. However, emotional expression and perception are not
ME analysis in computer vision as MEs are difficult to process easy jobs for some people, such as those who suffer from
by naked eyes. In this survey, we provide a comprehensive psychological disorders, e.g., alexithymia [6].
review of ME development in the field of computer vision, In recent years, research on emotions has grown signif-
from the ME studies in psychology and early attempts in icantly in interdisciplinary fields from psychology to com-
computer vision to various computational ME analysis methods puter science. In the beginning, the affects were mainly
and future directions. Four main tasks in ME analysis are studied by psychologists. The concept of affective comput-
specifically discussed, including ME spotting, ME recognition, ing was introduced by Picard [7] in 1997, proposing to
ME action unit detection, and ME generation in terms of the automatically quantify and recognize human affects based
approaches, advance developments, and challenges. Through on psychophysiology, biomedical engineering, computer
this survey, readers can understand MEs in both aspects of science, and artificial intelligence. Affective computing
psychology and computer vision, and apprehend the future aims to endow computers the human-like capabilities to
research direction in ME analysis. observe, understand, and interpret human affects, refer-
ring to feeling, emotion, and mood [7], [8], [9].
KEYWORDS | Affective computing; computer vision; machine
Psychological research demonstrates the body language
learning; micro-expression (ME); survey.
that we use, specifically our facial expressions, and relates
to 55% of messages when people perceive others’ feel-
ings [10], [11]. To this end, facial expressions are a major
channel that humans use to convey emotions. Analyzing
facial expressions is meaningful and important, which can
Manuscript received 31 August 2022; revised 21 April 2023 and 28 April 2023; be seen from a wide study of facial expressions. However,
accepted 5 May 2023. This work was supported in part by the Academy of
Finland for Academy Professor project EmotionAI under Grant 336116 and Grant
people may try to conceal their true feelings under certain
345122, in part by the University of Oulu & The Academy of Finland Profi 7 under conditions when people want to avoid losses or gain bene-
Grant 352788, and in part by the Ministry of Education and Culture of Finland for
AI Forum project. (Corresponding author: Guoying Zhao.)
fits [12]. In this case, facial micro-expressions (MEs) may
This work involved human subjects or animals in its research. The authors occur.
confirm that all human/animal subject research procedures and protocols are
Recent research illustrates that, besides ordinary facial
exempt from review board approval.
The authors are with the Center for Machine Vision and Signal Analysis, expressions, affect also manifests itself in a special format
University of Oulu, 90570 Oulu, Finland (e-mail: guoying.zhao@oulu.fi). of MEs. MEs are spontaneous subtle and fleeting facial
Digital Object Identifier 10.1109/JPROC.2023.3275192 movements reacting to emotional stimulus [13], [14]. MEs
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

P ROCEEDINGS OF THE IEEE 1


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

are almost impossible to control through one’s willpower.


Learning to detect and recognize MEs is critical for emo-
tional intelligence and has various potential applications,
such as clinical diagnosis, education, business interactions,
and interrogation. Due to the practical importance of ME
analysis in our daily life, researchers have increasing inter-
est in ME analysis recently.
In this article, we review the ME studies from psychology
to computer science, focusing on its research progress in
computer vision and machine learning. Although there
have been a few ME surveys [15], [16], [17], [18], [19],
[20], [21] published, they only focus on machine learning
approaches of ME spotting and recognition, which are for
Fig. 1. Examples of MaEs and MEs from the same person. Compared
researchers and professionals in the related field. Different to an MaE, an ME involves fewer muscles [action units (AUs)] with
from previous surveys, this overview introduces the gen- much lower intensity and shorter duration. AU6, AU7, AU12, and AU25
eral development of ME analysis from psychological study represent cheek raiser, lid tightener, lip corner puller, and lips part,

to automatic ME analysis in the computer vision field. The respectively.

goal is to provide a tutorial that can serve as a reference


point for all people who are interested in MEs. We start
from the discovery of the ME phenomenon and explo-
muscles with different levels of intensities (see subtle
rations in psychological studies and then track the studies
expressions [24] for MaEs with low intensities) and last
in cognitive neuroscience about the neural mechanism
between 0.5 and 4 s [25], [26], while MEs involve fewer
beneath the behavioral phenomenon. After that, we intro-
facial muscles (usually only one or two) with very low
duce the technological studies of ME in the computer
intensity and short duration (the criteria of ME duration
vision field, from the early attempts to advanced machine
varies according to different researchers, but it is com-
learning methods for recognition, spotting, related AU
monly agreed that an ME should be at least less than
detection tasks, and ME synthesis or generation. Finally,
500 ms [27], [28]). This is due to the fact that MEs occur
open challenges and future directions are identified.
under special conditions [23], e.g., when people attempt
The rest of this article is organized as follows. Section II
to cover their true feelings under high-stake situations,
presents ME studies in psychology. Section III introduces
some may involuntarily leak in the form of ME. With such
the early attempts of the computer vision study. Section IV
constrained conditions and strong intentions to inhibit and
discusses spontaneous ME datasets. Section V reviews
disguise, it is natural that MEs present in a condensed and
computational methods for ME analysis. The open chal-
fragmented way. One pair of example figures is shown in
lenges and future directions are discussed in Section VI.
Fig. 1 to illustrate the differences between MaEs and MEs.
Ordinary persons can recognize MaEs effortlessly, but it
II. M E S T U D I E S I N P S Y C H O L O G Y
is very challenging, if not completely impossible, to rec-
The research of MEs can be traced back to 1966 when
ognize MEs with naked eyes. According to psychological
Haggard and Isaacs [22] first reported finding one kind
studies [29], [30], without special training, people can
of short-lived facial behavior in psychotherapy that is
only perform slightly better than the guessing chance on
too fast to be observed with the naked eyes. In their
ME recognition (MiX). On the other side, MEs are very
report, these short-lived facial behaviors were referred to
important behavioral clues for lie detection, and some
as micromomentary facial expressions. This phenomenon
special occupations (law enforcement and psychotherapy)
was also found by Ekman and Friesen [23] one year after
could be benefited if there is any way to train their staff
and named it micro-facial expression. Ekman and Friesen
to better detect and recognize MEs. In 2002, a micro-
studied clinical depression patients who claimed they have
expression training tool (METT) [30] was developed for
recovered but later committed suicide. When examining
such a purpose. It was reported [14] that, after a training
the films of one patient in slow motion, although the
of 1.5 h, the trainee’s MER ability can be improved by
patient appeared to be happy most of the time, a fleeting
30%–40%. One limitation of METT is that it is composed
agony look lasting only 1/12 s was found, which reveals
of “man-made” ME clips, e.g., by inserting one happy
strong negative feelings of the patient. Upon the doctor’s
face image into a sequence of neutral faces, which are
questioning, the patient confessed that she was trying to
different from real, spontaneous MEs. Besides, Matsumoto
hide her plan to commit suicide. This finding illustrates the
and his colleagues also developed training tools1 for both
existence and essence of MEs that are important behavioral
MiX and subtle expression recognition (SubX) as it was
clues revealing human’s hidden true emotions.
reported [31] that the ability to read subtle expressions is
MEs have different appearance characteristics compared
also related to MER and lie detection. Such tools have been
to ordinary facial expressions [referred to as macro-
expressions (MaEs)]. MaEs can involve multiple facial 1 https://www.humintell.com/

2 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

commercialized for both academic research and business MiX. The ME spotting task aims to detect and locate ME
purposes. occurrences from context clips, and the recognition task
Another region of psychological research that is closely aims to classify MEs into emotional categories. ME data
related to ME is the facial action coding system (FACS). are needed to research the two tasks, but no ME dataset
FACS is a comprehensive tool to measure facial movements was available by that time. It is not easy to induce and
objectively [32]. FACS encodes and taxonomizes each collect spontaneous MEs as they only occur on very special
visually discernible facial muscle movement according to occasions. To overcome this obstacle, some early works
human face anatomical structure, which is called AU. attempted to build posed ME datasets [36], [37] and used
FACS provides detailed descriptions and coding criteria for them for evaluating proposed methods for ME spotting and
28 main code AUs, together with dozens of side codes recognition tasks.
about eye and head movements. Each action/movement Polikovsky et al. [36] collected the first posed ME
can be coded both with the category (e.g., AU4 and AU12) dataset by asking participants to act facial expressions
and the intensity level (A-E: A for the weakest and E for the as fast as possible. Eleven participants were enrolled,
strongest). FACS and AUs are very important for studies of and the data were recorded with a pixel resolution of
FER and MER although not all AUs are related to emotions. 640 × 480, at a frame rate of 200 frames/s (fps) in
In FACS Investigator’s Guide, an AU-emotion table (page a laboratory environment. Each frame was labeled with
136) was provided which maps AU combinations to cor- AUs, and a 3-D-gradient orientation histogram descriptor
responding emotion categories according to observation was proposed for AU analysis in specific facial regions,
evidence achieved from psychological studies. Such AU which was related to the recognition of MEs. However, the
combinations are considered as the “prototypes” of facial authors did not mention the length of such posed MEs, nor
expressions, e.g., AU6 + 12 for happiness and AU1 + 2 + did they concern the temporal progress of an ME in their
5 + 25 for surprise. However, it is worth mentioning that experiments.
the mapping table is for MaE emotions but not directly Meanwhile, Shreve et al. [37], [38] also conducted work
for MEs. Although, in practice, most current ME datasets on posed ME data. They built one dataset called USF-HD,
followed FACS and adopted the same AU-mapping rules as in which ME examples were shown to the participants and
MaEs based on a premise that “ME and MaE share the same the participants were asked to mimic the ME motions. Via
emotion categories and appearances,” such premise was this way, they collected 100 posed ME samples with a pixel
not systematically verified yet, and the correspondences resolution of 720 × 1280 at 29.7 fps. According to the
between AU and ME emotion categories are still open for authors, the USF-HD dataset contains both MEs and MaEs
discussion [33]. (a detailed description of the data is limited), and they
The neural mechanisms of MEs are also explored and proposed a spotting method using spatiotemporal strain,
explained. Two neural pathways [34] originating from which can spot 74% of all MEs and 85% of all MaEs in the
different brain areas are involved to mediate facial expres- USF-HD dataset.
sions. One pathway originated from the subcortical areas These studies represent the early attempts of building
(i.e., the amygdala), which drives involuntary emotional computing algorithms for automatic ME analysis, which
expression, including facial expressions and other bodily contributed in the way that they helped draw more atten-
responses, while the other pathway is originated from the tion of computer vision researchers to the topic of ME.
cortical motor strip, which drives voluntary facial actions. On the other side, the limitation of these studies is obvi-
According to Matsumoto and Hwang [35], when people ous, i.e., the posed MEs are different from real, naturally
are feeling strong emotions but try to control/suppress occurred MEs on their appearances in both spatial and
their expressions in high-stake situations, the two path- temporal domains; thus, it is questionable whether the
ways meet in the middle and engage in a neural “tug methods trained on posed ME data could be helpful in
of war” over the control of the face, which may lead to detecting and recognizing real MEs in practice. Early posed
fleeting leakage of MEs. ME datasets are not used in current ME studies anymore,
Over the years, ME study has gained the interest and multiple spontaneous ME datasets were built and
of researchers from various fields. It even inspired the shared with the research community later, which are intro-
award-winning American crime drama television series duced in Section IV.
“Lie to Me.” The show invited the ME researcher to analyze
each episode’s script and teach actors and staff the science
of deception detection. Many episodes of “Lie to Me” IV. D A T A S E T S
referred to the true experiences of MEs. In recent years, several spontaneous ME datasets
(SMIC [39] and its extended version SMIC-E,
III. E A R L Y A T T E M P T S F R O M CASME [41], CASME II [42], CAS(ME)2 [43],
COMPUTER VISION STUDY SAMM [44], MEVIEW [46], CAS(ME)3 [47], micro-
The ME analysis topic was first introduced to the com- and-macro expression warehouse (MMEW) [20], and
puter vision field around the year 2009. Early research on 4DME [33]) have been built, and details of these datasets
automatic ME analysis mainly concerns ME spotting and are summarized in Table 1.

P ROCEEDINGS OF THE IEEE 3


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

Fig. 2. Sample figures of the most popular ME datasets. (a) CASME. (b) CASME II. (c) CAS(ME)2 . (d) SAMM. (e) MEVIEW. (f) MMEW. (g) SMIC.
(h) CAS(ME)3 . (i) 4DME.

One common way to induce spontaneous MEs is to use in real life, MaEs and MEs often occur during interpersonal
movie clips with strong emotional clips. Pfister et al. [48] interactions. Husák et al. [46] collected an in-the-wild
first proposed a protocol for inducing and collecting spon- ME dataset MEVIEW. The samples were from TV series
taneous MEs. They asked each participant to watch movie of poker games. Poker games are one kind of high-stake
clips with strong emotional content in a monitored lab scenario when the players need to hide or disguise their
and asked them to try their best to keep a poker face true emotions from their opponents to achieve a win so
not to reveal their true feelings. If failed (found by the that MEs are likely to occur. The MEVIEW dataset brought
experimenter via the monitoring camera), there will be a a new sight into the ME study, but, on the other side, there
punishment, e.g., filling out a long questionnaire. Many are also constraints of the dataset. First, the videos were
of the later ME datasets followed the same protocol recorded for TV shows but not for research, so there were
as it was demonstrated to be effective. However, this frequent scene changes and also other factors, such as side
emotion-eliciting approach also has limitations, and the views and occlusions. Second, the dataset is very small
setup of “movie watching” is simplified compared with with only 31 video clips from 16 persons, which further
practical scenes in daily life. Some studies considered limits its usage. Recently, inspired by the paradigm of mock
different scenarios for ME data collection. For example, crime in psychology, Li et al. [47] collected one new ME

Table 1 Spontaneous ME Datasets

4 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

dataset CAS(ME)3 , which was a new approach to induce


MEs.
The spontaneous ME datasets also update and evolve in
the data form. Earlier spontaneous ME datasets only con-
tain frontal 2-D videos as they were relatively easy to col- Fig. 3. Pipeline of ME analysis.

lect and analyze, which leads to the fact that most existing
ME methods can only analyze frontal faces and are inca-
pable of dealing with challenges in real-world applications,
the MEs and MaEs coexisting situation, which is suitable
such as illumination variation, occlusion, and pose vari-
for studying spotting ME and MaE simultaneously in real-
ations. One research [49] on facial expression recognition
istic situations.
illustrated that dynamic 3-D videos with richer information
could facilitate facial expression analysis and alleviate
the self-occlusion, head motions, and lighting changes
V. C O M P U T A T I O N A L M E T H O D S F O R
problems. Currently, the fast technological development
M E A N A LY S I S
A general pipeline for ME analysis is shown in Fig. 3.
of 3-D scanning makes recording and reconstructing high-
Given raw collected videos, preprocessing is usually the
quality 3-D facial videos possible. Li et al. collected a 4-D
first step to be performed, and then, the ME clips can be
ME dataset (4DME) [33]. Moreover, the 4DME consists of
detected and located through ME spotting. After that, MiX
multimodal videos, including Kinect-color videos, Kinect-
and ME-AU detection can be carried out as separate or joint
depth videos, gray-scale 2-D frontal facial videos, and
tasks. ME generation can synthesize either long videos,
reconstructed dynamic 3-D facial meshes since leverag-
including mixed MEs and MaEs or short clips with just
ing multiple modalities is able to provide complementary
MEs, which is expected to benefit different ME analyses.
information to improve the robustness of the analysis. Also,
In this section, we first introduce preprocessing steps in
CAS(ME)3 also consists of depth information, physiological
ME analysis and discuss different inputs. Four main tasks
signals, and voice signals in addition to 2-D color videos.
in nowadays ME analysis with computational methods are
Fig. 2 shows the examples in ME datasets.
discussed, including ME spotting, MER, ME-AU detection,
The MMEW [20], CAS(ME)2 [43], CAS(ME)3 [47], and
and ME generation.
4DME [33] contain both MEs and MaEs, which can be
used to further identify the ME and MaEs. Moreover, the
4DME [33] dataset annotates the cases when MEs are A. Preprocessing
mixed with MaEs based on AUs. This dataset provides the
Given one ME dataset, there are multiple interfering
possibility to analyze the co-occurrence and relations of
factors of the raw facial videos, which need to be dealt
MEs and MaEs.
with first before the actual ME analysis could be carried
The above-discussed ME datasets are mostly constructed
out, e.g., background removal, head pose change, and
for MER. For researching ME spotting, some ME datasets
face size/shape variation. As subtle as MEs are, the ME
have been extended by including non-micro-frames before
analysis performance might be significantly impeded if
and after the annotated ME samples to generate longer
these problems are not solved properly. Like most facial
videos, such as the extended versions of CASME, CASME
video analysis tasks, ME analysis methods include two
II, and SMIC. However, the video lengths in these datasets
“common” preprocessing steps, i.e., face detection and face
are still quite short, which means that we are only con-
alignment. The former removes the background and keeps
cerning the spotting task under a simplified scenario. Later,
only the facial region, and the latter reduces the varia-
CAS(ME)2 [43] was released which raised the challenge
tion of facial shapes and poses by aligning corresponding
level of the spotting task by introducing long video clips
facial landmarks. Moreover, there are two other “special”
(148 s on average) that include both MaEs and MEs.
preprocessing steps that are commonly used in many ME
Another dataset, the SMIC-E-Long [40], was also estab-
methods as they are specifically helpful for ME analysis.
lished for the same purpose, i.e., extend the SMIC-E clips
The first one is motion magnification. Since MEs have very
into much longer clips by adding context frames from the
low intensity, motion magnification can help to enlarge the
original recordings. For such long clips, the challenge of
motion, thus facilitating ME analysis. The second one is
ME spotting lies in multiple aspects, not only that the
temporal interpolation. Since MEs are fleeting phenomena
duration of videos is longer but also other impacts such
with very short duration and various clip lengths, tem-
as eye blinks, head movements, and rotations, MaEs, all
poral interpolation can help to generate more frames or
become more significant and complex if compared to that
normalize the video length. The four preprocessing steps,
within a narrow observation window. Moreover, SAMM-
i.e., face detection, face alignment, motion magnification,
LV [45] and CAS(ME)3 provide long videos containing
and temporal interpolation, are each elaborated on in the
annotated intervals of MaEs and MEs according to the AUs.
following.
However, the above datasets mainly focus on separated
MEs and MaEs even though, in practical situations, the 1) Face Detection: Face detection is the first step in
MEs may occur with MaEs. The 4DME dataset emphasizes an ME analysis system. Face detection finds the face

P ROCEEDINGS OF THE IEEE 5


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

location and removes the background. The Viola–Jones magnification level, bigger artifacts and more noises are
face detector [50] is one of the most widely used face introduced at the same time. According to empirical stud-
detection algorithms. It can achieve robust face detec- ies [61], [62], [63], a suitable amplification factor should
tion on near-frontal faces with cascaded weak classifiers. be about 5–30 depending on different data. Besides EVM,
Moreover, it is computationally efficient and could run in Oh et al. [64] presented a learning-based motion magnifi-
real time, so it is widely employed in many face anal- cation method that was employed to magnify ME in [65].
ysis scenarios. However, the Viola–Jones detector does One advantage of the learning-based motion magnification
not work well with significant pose variations and occlu- method is that it causes fewer noises compared to EVM.
sions [51]. Wu et al. [52] proposed to utilize probabilistic
4) Temporal Interpolation: Besides low intensity, another
and pose-specific detector approaches for handling these
challenge of ME analysis is that ME clips are very short and
problems. Later, with the development of deep learning
with varying lengths, which is not good for clip-based anal-
methods, deep learning-based face detection methods have
ysis, especially when recording MEs with a relatively low-
been presented to deal with scale and pose variations [53],
speed camera. Temporal interpolation solves this issue by
[54], [55]. A lightweight deep convolutional network was
interpolating sequences into a designated length. Specifi-
proposed for robust face detection by using incremental
cally, temporal interpolation can be used to upsample ME
facial part learning [54]. HyperFace proposed in [55]
clips [61] with too few frames to get longer and unified
exploited synergy among multiple tasks of landmark local-
ME clips for stable spatial–temporal feature extraction.
ization, face detection, gender recognition, and pose esti-
Also, extending ME clips and subsampling to multiple
mation for better face detection performance. Currently,
short clips with temporal interpolation can be used for
deep learning-based face detection methods have been
data augmentation [66]. The temporal interpolation model
integrated into popular open- source libraries, such as
(TIM) [67] is one of the most popular methods used in ME
OpenCV and Dlib, to facilitate facial video analysis tasks,
analysis, which characterizes the sequence structure by a
including ME spotting and recognition.
path graph. Moreover, Niklaus and Liu [68] designed a
2) Face Registration: MEs are very subtle movements, network for temporal interpolation, which extracts infor-
which could be easily affected by head pose variations. mation from pixelwise contextual information of the input
Face registration aims to align each detected face to a refer- frames in order to calculate a high-quality intermediate
ence face according to key facial landmarks so that to alle- frame for interpolation. Their method can perform well in
viate pose variation and head movement problems. There- complex temporal interpolation scenarios in reality when
fore, face registration is an essential preprocess needed large motions and occlusions are involved.
for MER. There are multiple methods available for face
registration, such as discriminative response map fitting
(DRMF) [56], active appearance models (AAMs) [57], and B. Inputs
active shape models (ASMs) [58], which are all widely The characteristics of low intensity and short duration
adopted in related studies. AAM [57] is able to match faces make MER very challenging. There can be different inputs
with different expressions rapidly, and DRMF is good at for MER in the form of images, videos, or optical flow.
handling occlusions in complex backgrounds efficiently in
1) Images: A large number of facial expression recog-
real time [56]. Similar to face detection, deep learning is
nition studies are based on static images because of the
also exploited for face registration. For example, a deep
convenience of image processing and the availability of
cascaded framework was proposed by Zhang et al. [59] to
massive amounts of facial images. However, different from
exploit the correlation between alignment and detection
MaEs, MEs involve subtle facial movements. To this end,
to further enhance the alignment performance in uncon-
Li et al. [62], [69] studied using magnified apex frames
strained environments, i.e., when various poses, illumina-
for ME analysis. Their experimental results demonstrated
tions, and occlusions are involved.
that MER with one single apex frame could achieve com-
3) Motion Magnification: Motion magnification aims to parable performance to that of using the whole ME clip.
enhance the intensity level of subtle motions in videos, Following [62], [69], Sun et al. [70] further studied apex
e.g., the invisible trembling of a working machine. It was frame-based MER, which could leverage massive images in
found to be helpful for ME analysis tasks and employed as MaE databases and turned out to be able to obtain better
a special preprocessing step. The Eulerian video magnifi- performance than employing the whole videos.
cation (EVM) method [60] is one of the most popular used One motivation for researchers to develop single apex
magnification methods [61]. The original method can be frame-based ME analysis methods is that processing single
used to magnify either color or motion content of an input apex frames reduces computational complexity. However,
frame sequence. There is one adjustable parameter for the on the other side, temporal information is lost when using
magnified level, i.e., a larger amplification value leads to a one single frame. Some works [71], [72] proposed to
larger scale of motion amplification. For ME magnification, use multiple key ME frames as inputs. ME sequences
it is not the case that the larger magnification the better. recorded with high-frame rates (e.g., 200 fps) might con-
One issue to be concerned with is that, for a very large tain redundant information for ME analysis, and Liong

6 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

Table 2 Comparisons of Inputs for ME Analysis

and Wong [72] demonstrated that using only the onset based on appearances [96], [97]. However, current optical
and apex frames of one ME as the input could provide flow-based MER approaches mainly employ the traditional
sufficient spatial and temporal information for the analysis. optical flows with complicated feature operations leading
Furthermore, Kumar and Bhanu [73] and Liu et al. [74] to slow computation.
designed strategies for automatic key frame selection, and Moreover, considering the strengths of different inputs,
the selected key frames are aggregated as the input for some works combined multiple inputs to learn multiview
MiX. information to further improve the performance [84],
Embedding dynamic information of a video to a stan- [97], [99], [100]. The summarization of the input
dard image can form dynamic image input, which has been strengths and shortcomings is shown in Table 2.
proposed for action recognition [75]. Considering that
the dynamic image can summarize appearance and subtle C. ME Spotting
dynamics into an image, multiple MER methods [76],
As earlier discussed, ME spotting is one of the main
[77], [78], [79] employed dynamic image as input and
tasks in automatic ME analysis, which identifies temporal
achieved promising performance. The dynamic image can
locations (sometimes also the spatial places in faces) of
simultaneously consider spatial and temporal information,
MEs in video clips. Three keyframes are included in a
and keep computational efficiency by processing only one
complete process of one ME, i.e., the onset, the apex, and
image.
the offset. The onset is the first frame in which the ME
2) Video Input: ME video clips are commonly utilized motion is first discriminable. The frame with the highest
input for ME analysis [80], [81], [82], [83], [84], [85], motion intensity in the ME clip is the apex frame. The offset
[86]. It considers spatial and continuous temporal infor- is the frame marking the end of the motion. ME spotting
mation simultaneously and can be processed directly with- could be identifying the keyframes in long clips. Spotting
out extra operations. However, the ME videos have short is the very first step before many other ME analysis tasks,
and varying duration. Many approaches employed TIM such as MER and ME action unit detection but even more
to interpolate the ME clips to longer and/or the same challenging. There are human test experiments in the
length [61]. There are methods using spatial–temporal research of [61], which indicate that automatic MER tech-
descriptors, such as local binary patterns from three nology can outperform humans, while the performance
orthogonal planes (LBP-TOP) [81], long short-term mem- substantially decreases when including the ME spotting
ory (LSTM), and some methods utilizing LSTM and recur- step in the completed ME system.
rent neural networks (RNNs) [87], which aim to process The ME community has carried out preliminary research
the time-series data with various duration. However, there in ME spotting. The second Facial Micro-Expression Grand
is redundancy in ME sequences, and the computational Challenge (MEGC 2019) proposed challenges for ME spot-
cost is relatively high [88]. ting in long videos [101]. The long videos contain a lot
of non-ME movements, such as eye blinking, weak head
3) Optical Flow: Another widely used input is optical
rotation, swallowing, and MaEs, which are similar to prac-
flow. Optical flow estimates the local movement between
tical situations. Furthermore, the third MEGC 2020 [102]
images by computing the direction and magnitude of
developed the challenge to spot both MaE and ME from
pixel movement in image sequences [89]. Optical flow
long videos. In this section, we discuss ME spotting in
has been verified as effective for movement representa-
terms of heuristic and machine learning-based methods
tion. In recent years, various optical flow computation
followed by a discussion.
methods have been proposed [90], [91], [92], [93],
[94], such as Lucas-Kanade [90], Farnebäck’s [93], TV- 1) Heuristic ME Spotting: Traditional algorithms are usu-
L1 [94], and FlowNet [95]. Inspired by the effectiveness ally training-free, heuristic, and spot MEs by comparing
of optical flow, many ME analysis methods utilized opti- feature differences in a sliding window with fixed-length
cal flow to represent the micro-facial motion [96], [97], time [61], [62], [111]. The location of MEs can be
[98]. Furthermore, the optical flow can reduce the iden- determined by a thresholding method. LBP [61], [111],
tity characteristic to some degree [96]. MER approaches HOG [61], and optical flow [103], [112], [113], as shown
based on optical flows often outperform MER approaches in Fig. 4, are the commonly used features for ME spotting,

P ROCEEDINGS OF THE IEEE 7


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

Table 3 Comparison of Heuristic and Machine Learning-Based ME Spotting Methods

which are specifically discussed in this section. Li et al. [61] methods were developed to tell apart different facial move-
proposed a training-free ME spotting approach based on ments, which regards ME spotting as a binary classification
LBP difference contrast in 6 × 6 blocks and a sliding of non-ME and ME frames. In general, these methods
window. Moreover, baseline results of ME spotting in long first extract features of each ME frame, and a classifier is
videos were provided in this study [61]. On the other utilized to recognize the ME frames.
hand, Patel et al. [103] located the apex, onset, and offset Husák et al. [46] computed an image intensity change
frames in ME clips by computing the motion amplitude descriptor on RoIs and applied an SVM classifier to spot
shifted over time in optical flow. Main directional max- MEs. Xia et al. [106] and Borza et al. [115] both used
imal difference analysis (MDMD) was proposed [104], an Adaboost model to classify ME frames for ME spot-
[113] to spot MEs based on the magnitude of maximal ting. Specifically, the former utilized geometric features of
difference in the main direction of optical flow. Similar landmarks of face shapes, and the latter employed motion
to [61], MDMD also utilized a sliding window and block- descriptors based on absolute frame differences.
based division. Ma et al. [114] employed the oriented Inspired by the successful application of deep learning
optical flow histogram to further improve the apex frame in action detection [116], Zhang et al. [107] first pro-
spotting performance. To distinguish MaEs and MEs in posed to search the ME apex frame in long ME videos
long videos, Zhang et al. [105] proposed to disentangle thorough adopting a convolutional neural network (CNN)
head movement by computing the mean optical flow of with feature matrix processing. Later, LSTM networks were
the nose region and utilizing a multiscale filter to increase introduced in [108] and [40] to spot MEs in long videos
the ability to spot MEs and MaEs. due to their strength in processing sequences with var-
All of the above methods spot MEs in the ious lengths. Moreover, Wang et al. [110] presented an
spatial–temporal domain. However, MEs have rapid end-to-end deep framework consisting of three modules:
and low-intensity spatial movements, which are not clip proposal, 2 + 1D CNN, and classification regression,
obvious in the spatial–temporal domain but can lead to extract features, propose ME clips, and classify MEs,
to large changes in the frequency domain. To this end, respectively. Moreover, a local bilinear structure-based net-
frequency-based ME spotting methods [62], [69] were work was proposed to extract local and global features in
presented to spot the apex frame in the ME sequence by a fine-grained way to identify MEs and MaEs [109].
exploiting information in the frequency domain, which
can reflect the rate of facial changes. 3) Discussion: In general, the heuristic ME spotting
The strength of feature difference-based approaches is methods are mostly based on thresholds to classify MEs
that the approaches consider the temporal characteristic and non-MEs, which are weak to distinguish the MEs
according to the size of the sliding window, and the spot- from other facial movements, such as eye blinks. The
ting results can be simply obtained by setting thresholds. machine learning-based methods can recognize different
However, as they are heuristic, mainly based on practical facial movements by training classifiers. However, the per-
experience regarding, e.g., a threshold in the feature dif- formance of ME spotting is restricted by the small-scale ME
ference value, the spotting results can be easily influenced datasets and unbalanced ME and non-ME samples.
by the other facial movements with similar intensity or With the increase in ME spotting research, various
duration, such as eye blinks. Thus, it is hard to distinguish evaluation protocols have been proposed, using different
MEs from other similar facial movements with feature- training and testing sets, and various metrics, such as the
difference-based ME spotting methods. area under the curve (AUC), the receiver operating charac-
teristic (ROC) curve, mean absolute error (MAE), accuracy,
2) Machine Learning-Based ME Spotting: As the heuris- recall, and F1-score [18]. It causes inconsistencies and
tic methods based on thresholds are weak to distinguish makes fair comparisons very hard, as shown in Table 3.
MEs from other facial movements, machine learning-based Thus, in the future, it is essential to design standard

8 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

The most popular appearance feature utilized for MER


is LBP-TOP [81]. LBP-TOP is a texture descriptor thresh-
olding the neighbors of each pixel with a binary code
in spatial and temporal dimensions. Most of the MER
research employs the LBP-TOP as their baseline due to
its computational simplicity, as shown in Fig. 4(a). Later,
several LBP-TOP variants were proposed to meet the dif-
ferent needs of MER [120]. Wang et al. [120] presented a
spatiotemporal descriptor to enhance the efficiency of MiX
by suppressing the redundancy information of LBP-TOP
with six intersection points (LBP-SIP). Then, a more com-
pact variant with less computational time, LBP-MOP, was
proposed. LBP-MOP concatenates the LBP features from
the temporal pooling results of image sequences in three
orthogonal planes [121]. Huang et al. [122] designed
a spatiotemporal completed local quantized pattern
(STCLQP), which considers not only the pixel intensity but
also the sign, magnitude, and orientation components.
Aside from LBP, another widely used feature for MER
is the gradient-based feature. High-order gradients could
Fig. 4. Handcrafted features for ME analysis. (a) LBP-TOP [48]. represent the structure information of an image in
(b) HOG [117]. (c) Optical flow [118]. (d) Delaunay triangula- detail [123]. The histogram of gradients (HOG) is one of
tion [119]. the most widely used features for its ability to describe
the edges in an image with geometric invariance [123],
as shown in Fig. 4(b). Li et al. [61] developed the his-
evaluation protocols for ME spotting as the first step has togram of image gradient orientation (HIGO) ignoring
been taken in Tran et al.’s [40] benchmark work. In addi- the magnitude weighting of the first-order derivatives to
tion, multiple studies attempted to explore spotting MEs depress the influence of illumination affect. Moreover, both
and MaEs simultaneously in long videos. Due to the pres- HOG-TOP and HIGO-TOP were utilized together in [124]
ence of noise, irrelevant movements, and mixed MEs and to further improve the performance of MER.
MaEs, it is very challenging to learn discriminative features Considering the strength of optical flow discussed in
on limited datasets and accurately locate the various MEs Section V-B, several works designed feature descriptors
and MaEs that should be further studied. based on optical flow for MER. Liu et al. [118] proposed a
main directional mean optical flow (MDMO) that considers
D. MER both local information and its location through regions
of interest (RoIs), as shown in Fig. 4(c). The MDMO
The spontaneous MER research with computational
only exploits the dominant direction of optical flow in
technology can be traced to the work of [48]. Pfis-
each RoI. However, facial motions spread progressively
ter et al. [48] proposed to use spatiotemporal local texture
because of the elasticity of the skin. Allaert et al. [125]
descriptors combined with a TIM. Later, various methods
presented to extract the coherent movement of the face
were developed for efficient MER. In the beginning, most
from dense optical flow to better describe facial movement.
of the MER methods are based on traditional handcrafted
Inspired by the strength of optical strain in capturing
features. In recent years, the fast development of deep
small facial deformation, Liong et al. [126] proposed to
learning technology enables deep learning-based methods
apply optical strain to the MER task. Optical strain com-
to archive the state-of-the-art performance in MER. In this
putes the shear and normal strain tensor components of
section, we discuss the MER methods in terms of tradi-
optical flow. To reduce the dimensionality and enhance
tional learning and deep learning methods.
computational efficiency, Liong et al. [126] resized and
1) Traditional Learning Methods: Due to most ME max-normalized the strain maps to a relatively low resolu-
datasets with limited samples, handcrafted features are tion to keep consistency across the database. To effectively
widely researched in MER. The handcrafted features learn ME information from the active regions, optical strain
represent the image details without explicit semantic weighted (OSW) features were presented to weight local
knowledge/meaning, such as intensities [120] and gra- LBP-TOP features according to the temporal mean-pooled
dients [61]. Basic classifiers such as KNN and SVM are optical strain map [127]. Moreover, Liong and Wong [72]
employed to classify the features. In this section, we mainly designed a biweighted oriented optical flow (BI-WOOF)
discuss the LBP-TOP and its variants, gradient variants, and descriptor adding local and global weighting to HOOF by
optical flow-based approaches, which are widely used in optical strain and magnitude values, respectively, in order
MER, as shown in Table 4. to reduce the noisy optical flows. In contrast to the

P ROCEEDINGS OF THE IEEE 9


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

above works [72], [126], [127], Happy and Routray [128] data, and transfer learning. Then, effective structures and
proposed a fuzzy histogram of optical flow orientation blocks designed for learning discriminative ME features
(FHOFO) collecting the motion directions and ignoring are specifically discussed. Finally, we introduce the losses
motion magnitudes due to the low intensity of MEs. applied in MER.
A fuzzy membership function was utilized to map the In the beginning, most MER works adopted well-
directions of motion into angular bins to create smooth designed classical convolutional networks, such as ResNet
histograms for motion representation. family [140], [142], [143], Inception network [144],
Besides the above features, there are descriptors rep- [145], [146], AlexNet, and VGG-FACE [62], [147]. The
resenting MEs in other views, such as color [129] and effectiveness of these networks has been verified on com-
facial geometry [119], [130]. Wang et al. [129] pro- mon tasks, such as image classification and face recog-
posed a tensor-independent color space (TICS) method. nition. Furthermore, these networks are pretrained on
In TICS, the RGB color is transformed into independent datasets with a large number of images, such as ImageNet
color components and combined with dynamic texture to and VGG-FACE datasets [147]. Fine-tuning deep networks
increase MER accuracy further. Lu et al. [119] designed pretrained on large datasets can effectively avoid the over-
a Delaunay-based temporal coding model (DTCM) to nor- fitting problem caused by small-scale ME datasets [62],
malize the ME sequences temporally and spatially based on [74], [77].
Delaunay triangulation to suppress the personal appear- To further leverage the information on the limited
ance influence, which is not relevant to MEs, as shown ME samples, multiple works adopted multistream net-
in Fig. 4(d). Furthermore, a facial dynamics map (FDM) work structures to extract multiview features from var-
[130] was proposed to handle subtle face displacements ious inputs. The optical flow features from the apex
by characterizing the movements of an ME in different frame network (OFF-Apex-Net) [148] built a dual-stream
granularity. CNN for MER based on optical flow-derived components,
Once features are extracted, classifiers are used to cat- the strain along the horizontal and vertical directions.
egorize the MEs. Classification involves two stages: train- Khor et al. [149] proposed a dual-stream shallow net-
ing and testing. In the training stage, classifiers learn to work (DSSN) based on heterogeneous features. Moreover,
recognize MEs based on the labels and extracted features. other works developed multiple substreams to extract fea-
In the testing stage, the trained classifier’s performance is tures from frame sequences, static images, optical flow,
evaluated by evaluation metrics, such as accuracy and F1 or RoIs [85], [150], [151], [152]. Song et al. [97], [136]
score. Various supervised classification methods have been designed three-stream CNN (TSCNN) models extracting
used for MER, e.g., support vector machine (SVM) [131], features from the apex frames, local facial regions, and
Adaboost [132], random forest [133], k-Nearest Neigh- optical flow between onset, apex, and offset frames to
bor (kNN) [128], [134], and linear discriminant analy- leverage the information of the static spatial, local, and
sis (LDA) [135]. SVM is the most widely used classifier dynamic temporal information, as shown in Fig. 5(a).
because of its robustness, accuracy, and effectiveness espe- In addition, other works developed multiple substreams
cially when the training samples are limited. to extract features from frame sequences and optical flow,
or RoIs. She et al. [152] employed three RoIs and global
2) Deep Learning Methods: In recent years, deep learn- regions, and designed a four-stream model to explore local
ing has achieved excellent performance in many research and global information. To further explore the temporal
fields, such as facial expression recognition [138], object information of MEs, several works [88], [153], [154]
detection [139], and image classification [140]. Several cascaded CNN and RNN or LSTM to extract features from
researchers have attempted to explore MER with deep individual frames of ME sequence and capture the facial
learning. However, deep learning is a data-driven method evolution of MEs.
that requires a large amount of data to learn a robust Recent research demonstrates that taking advantage
representation. MEs have small-scale datasets and low of the information on relevant tasks could also bene-
intensity, which makes MER based on deep learning hard. fit facial expression recognition [139]. Inspired by this
Patel et al. [141] first attempted to utilize facial expres- finding, to make full use of the information on faces,
sion and object-based CNN models and selected relevant multiple research developed multitask learning for bet-
deep features for representing MEs. However, its MER ter MER by leveraging different side tasks [76], [155].
accuracy on CASME II is 47.3%, much lower compared Nie et al. [76] designed a GEnder-based MER (GEME)
to handcrafted descriptors. Following Patel’s work [141], incorporating gender detection task with MER, as shown in
various deep learning-based methods have been proposed Fig. 5(b). Furthermore, Zhou et al. [146], [156] proposed
to improve MER. To date, deep learning-based MER to recognize AUs and MEs and further aggregated AU
has achieved state-of-the-art performance by leveraging representation into ME representation to improve MER
massive facial images and designing effective network performance. Other methods leverage the knowledge of
structures and special blocks. In the following, we will other tasks through transfer learning. Directly fine-tuning
first discuss the MER methods of taking advantage of on a pretrained model is the simplest. Besides fine-tuning,
existing data through fine-tuning, learning from multiview knowledge distillation is also widely applied to MER [70].

10 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

Fig. 5. Demonstration of improving MER performance through different methods. (a) TSCNN employing a multistream structure to leverage
multiview inputs [97], [136]. (b) Knowledge distillation [70]. (c) GEME with multitask learning framework [76]. (d) Channel attention
module [137]. (e) Spatial attention module [137]. (f) Graph on AU representation [83]. Different colors in the figure represent different
paths in the network.

Knowledge distillation utilizes pretrained high-capacity learning the weights of each feature channel, channel
networks to guide the training of small and fast net- attention was employed in MER with spatiotemporal atten-
works [157]. Sun et al. [70] guided the shallow network tion to improve the representational ability of MEs [66],
learning for MER by mimicking the intermediate features [137], [174], [175], as shown in Fig. 5(d) and (e).
of a network trained for AU detection and facial expression In addition, MEs perform as specific combinations
recognition. Fig. 5(c) illustrates the knowledge distilla- of multiple facial AUs. The latent semantic information
tion process. However, it is not reasonable to directly among facial changes has an important contribution to
mimic the MaE representation, as the appearances of MEs MER. The graph convolutional network (GCN) has been
and MaEs have differences. To this end, instead of fea- verified to effectively model the semantic relationships.
tures, Zhou et al. [158] proposed to transfer attention to Inspired by the successful application of GCN in face
improve MER. Another effective transfer learning approach analysis tasks, the research [65], [83], [146], [176]
is domain adaptation that obtains domain invariant repre- applied the GCN to model the relationship between the
sentations through embedding domain adaptation into the local facial movements. Specifically, Lei et al. [65], [177]
deep learning pipeline. The adversarial learning strategy is designed graphs based on the ROIs along facial landmarks,
adopted in MER to narrow down the gap between the MEs while [73], [83], [146], and [176] built graphs on AU-level
and MaEs, and leverage massive MaE images to boost the representations to infer the AU relationship and boost MER
MER performance [159], [160], [161]. performance, as shown in Fig. 5(f).
Besides utilizing more data, many studies designed Deep networks apply loss functions to perform end-to-
effective shallow networks for MER to avoid overfitting end classification. The loss function penalizes the devi-
[82], [162], [163]. Zhao and Xu [164] designed a six-layer ation between the predicted and true labels during the
CNN and utilized a 1 × 1 convolutional layer to process learning process. Most ME analysis works directly utilize
the ME input to increase the nonlinear representation. softmax cross-entropy loss that is widely used in classi-
Liong et al. [165] designed a Shallow Triple Stream fication tasks [178]. However, ME datasets suffer from
Three-dimensional CNN (STSTNet) with two layers to low interclass differences due to the low intensity of MEs.
learn features from optical flow features computed from Contrastive loss [179], triplet loss, and center loss [180]
the onset and apex frames in each ME video clip. Other were introduced to MER to increase intraclass compactness
works trim multiple convolutional layers of the deep net- and interclass separability of MEs [69], [160]. In addition,
work to achieve a shallow network [149], [166]. the samples in ME datasets have imbalanced distribution
Since MEs involve fewer facial muscles (usually only since some MEs, such as fear, are difficult to trigger. The
one or two) with low intensity, MEs are related to focal loss was employed to alleviate the issue by focusing
changes [167] in RoIs. In order to emphasize learning on misclassified and hard samples [76], [96], [146].
on RoIs and reduce the influence of information unre-
lated to MEs, multiple works introduced attention mod- 3) Discussion: MEs are involuntary, rapid, and subtle
ules [168], [169], [170], [171], [172], [173]. Inspired facial movements. The main challenge for robust MER is
by the squeeze-and-excitation blocks [150] adaptively how to effectively extract discriminative representations.

P ROCEEDINGS OF THE IEEE 11


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

Table 4 Traditional Learning-Based MER on SMIC, CASME, and CASME II

Handcrafted features are low-level representations that Deep learning is a data-driven method. The small
are able to effectively describe the texture, color, and so amount of MEs is far from enough to train a robust net-
on while being weak in extracting high-level semantic work. Current MER research designed shallow networks or
information. In contrast, deep learning-based features are leveraged massive MaE images to solve the data limitation
abstract high-level representations. problem. The shallow network and transfer learning have
As shown in Tables 4 and 5, in the beginning, MER achieved big development for MER on small-scale ME
works used handcrafted features. In recent years, with the datasets, such as STSTNet [165], MiMaNet [161], and
development of deep learning, most of the current MER DIKD [70], as shown in Table 5. However, the perfor-
methods are based on CNNs, and the deep-based MER mance is still far from satisfying for real-world applica-
achieves state-of-the-art performance. The performances tions. For the former approach, more effective blocks and
of MER are influenced by various factors, such as prepro- structures should be developed to learn discriminative ME
cessing, features, and network structure. It is difficult to features with fewer parameters in the future. For the latter
directly compare the methods of every step. However, from approach, considering the appearance difference between
the experimental results, the general trends of MER can be MEs and MaEs, transfer learning methods should be fur-
found. ther studied to solve the domain shift problem. Lever-
In general, the preprocessing step could benefit the MER aging information from other related tasks, such as age
for both traditional learning and deep learning-based MER estimation and identity classification, could be considered.
approaches. Fig. 6 shows a comparison of the performance In addition, unsupervised learning and semisupervised
with TIM and magnification. From Fig. 6, we could draw learning [182] [183] are promising future directions for
a conclusion that magnification and TIM can benefit MER. MER, as they could leverage the massive unlabeled images.
However, the suitable magnification and temporal interpo-
lation factor should be further studied. E. ME-AU Detection
ME analysis is a relatively new topic. Currently, most
research focused on ME spotting and recognition [186],
[187], [188], [189]. The study of facial expression recog-
nition indicates that AU detection is able to facilitate
complex facial expression analysis, and developing facial
expression recognition with AU analysis simultaneously
could boost the facial expression recognition performance
[190], [191].
Inspired by the AU contribution to facial behavior
analysis, researchers started to study AU detection in MEs.
However, compared to MaEs, AU detection becomes more
challenging due to the low intensity of MEs and small-scale
ME datasets. AU detection is a fine-grained facial analysis
that is complicated. Common facial AU datasets contain
a large number of facial samples and identity diver-
sity [192], e.g., Aff-Wild2 [193] (564 videos/2 800 000
Fig. 6. Comparisons of MER performance with/without TIM and/or
magnification. Specifically, results are based on LBP-TOP and LBP-
frames of hundreds of subjects). In contrast, an ME dataset
TOP + TIM on CASME II [181], STRCN-A and STRCN-A + Mag on CASME may only contain thousands of images, e.g., CASME con-
II [96], and HIGO, HIGO + TIM, and HIGO + TIM + Mag on SMIC [61]. taining around 2500 images of 19 subjects. Moreover,

12 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

Table 5 Deep Learning-Based MER on SMIC, CASME II, and SAMM

the AUs in MEs have imbalanced distribution, e.g., there between the onset and apex frames to obtain the dis-
are 129 AU4, while only 13 AU4 are in CASME II. Exist- criminative representation for low-intensity ME-AU detec-
ing ME-AU detection research proposed to utilize the tion [195]. To effectively learn local facial movement and
MaEs [194] or specific ME characteristics, such as subtle leverage relationship information between different facial
local facial movements [66], [195], [196], which are dis- regions to enhance the robustness of ME-AU detection,
cussed in more detail in the following, to overcome these a spatial and channel attention module was designed
issues. to capture subtle ME-AUs by exploring high-order statis-
In order to overcome the lack of ME data, Li et al. [194] tics [66]. On the other hand, Zhang et al. [196] proposed
proposed a dual-view attentive similarity-preserving a segmentation method based on AUs to extract features
(DVASP) knowledge distillation to utilize the facial images on key facial regions and utilized multilabel classification
in the wild to achieve robust ME-AU detection. Considering to classify the AUs.
that one of the key factors for successful knowledge dis- The specific information on the abovementioned AU
tillation is a generalized teacher network, DVASP utilized detection methods is shown in Table. 6. We can see that
a semisupervised dual-view cotraining approach [197], the work and performance of ME-AU detection are limited.
[198] to construct a generalized teacher network by AU detection is a fine-grained detection identifying differ-
exploiting the massive labeled and unlabeled facial images ent facial movements. The low intensity of MEs increases
in the wild. To address the appearance gap between the the difficulty of AU detection. Moreover, the ME-AU detec-
MEs and MaEs, an attentive similarity-preserving distil- tion suffers from small-scale and extremely unbalanced
lation method was proposed to break the domain shift datasets as some AUs coexist and the occurrence of some
problem by transferring the correlation of important acti- AUs is very low. In the future, more effective AU detection
vations instead of directly mimicking the features. In order approaches should be explored to better study MEs.
to overcome the lack of ME data, Li et al. [194] proposed a
DVASP to utilize facial images in the wild to achieve robust
ME-AU detection. Considering that a generalized teacher F. ME Generation
network is one of the key factors for successful knowledge As the discussion in Section IV implies, it is challeng-
distillation, DVASP utilized a semisupervised dual-view ing to collect MEs compared to ordinary facial expres-
cotraining approach [197], [198] to build a generalized sions. Also, annotating MEs needs certified FACS coders
teacher network by exploiting the massive unannotated to check videos frame by frame several times, which is
facial images in the wild. To address the appearance time-consuming and labor-intensive. These issues lead to
gap between the MEs and MaEs, an attentive similarity- limited samples and imbalanced distribution in ME analy-
preserving distillation method was proposed to break the sis. Synthesizing MEs is an option to solve these problems.
domain shift problem by transferring the correlation of Recently, with the development of generative adversarial
important activations instead of directly mimicking the network (GAN) [138], [200], [201], image and video
features. generations have been widely applied for data augmen-
Other ME-AU research focuses on modeling subtle tation and image translation [202], [203] and achieved
AUs [66], [195] based on ME characteristics. An intra- distinctly improved performance in various fields, such
contrastive and intercontrastive learning method was pro- as face generation [204], [205] and style transfer [206],
posed to enlarge and utilize the contrastive information [207]. Recently, the ME researchers started to explore

P ROCEEDINGS OF THE IEEE 13


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

Table 6 AU Detection on CASME II and SAMM

utilizing GAN to generate facial images. However, the Other works estimated facial motion using key points
MEs are subtle and rapid, and straightforwardly utilizing to generate ME images [212], [213]. Specifically,
GANs cannot generate satisfying MEs. The workshop about Fan et al. [212] developed a deep motion retargeting
MEGC 2021 started to include ME generation task [208], (DMR) network to capture subtle ME variants by learning
leading to increased interest in ME generation. Current key points. Zhang et al. [213] combined the motion model
ME generation methods mainly leverage AUs or facial key based on key points and local affine transformations with
points. facial prior knowledge and achieved first place in MEGC
Since facial expressions are constituted by AUs [209], 2021. Besides, Yu et al. [214] proposed an Identity-aware
[210], AU-ICGAN [83], FAMGAN [211], and MiE-X [199] and Capsule-Enhanced GAN (ICE-GAN) to synthesize MEs
introduced GANs based on AUs to generate MEs. with the discriminator detecting the image authenticity
Xie et al. [83] proposed the AU Intensity Controllable and expression categories. Moreover, instead of generating
GAN (AU-ICGAN) to synthesize subtle MEs. Considering ME images, Liong et al. [215] synthesized optical-flow
that the ME has rapid change and temporal information images of MEs to improve the MER performance based on
plays an important role, AU-ICGAN simultaneously eval- computed optical flow.
uated the image quality and video quality to generate ME generation is a new direction in ME analysis. The
nearly indistinguishable ME sequences, which effectively subtle and rapid facial movements make ME challenging.
improves the MER performance. Xu et al. designed fine- Currently, the quality of the generated MEs is not realistic
grained AU modulation (FAMGAN) to eliminate the noise enough. However, with further investigation, it is expected
and deal with the asymmetrical AUs. Super-resolution that they could be not only helpful in other ME analyses,
was incorporated into FAMGAN to enhance the quality of such as MER, ME spotting, and ME AU detection, but also
the generated ME images. In addition, Liu et al. [199] in ME synthesis for augmented reality, HCI, and so on.
synthesized a large-scale and trainable ME dataset (MiE-
X, which includes 5000 identities and 45 000 samples VI. O P E N C H A L L E N G E S AND FUTURE
in total) based on the relationship between AUs and DIRECTIONS
expression categories. The experiments demonstrated that
generated MEs could help improve the MER performance. Knowing how others feel is an important element in
As shown in Fig. 7, the performance of ApexME [62] and social interactions, but it is not always an easy task. Some-
Branches [159] pretrained on MiE-X is improved by 3.1% times, people may intentionally express their emotions in
and 3.2% on MMEW and 5.4% and 3.0% on SAMM in the form of, e.g., facial MaEs in order to deliver mes-
terms of accuracy compared to pretrained on ImageNet. sages or attitudes, and sometimes, people may hide their
true feelings for different reasons. Computational methods
developed for ME analysis, including spotting, recognition,
and generation, could help in multiple use cases e.g.,
convert emotion understanding and expert training.
Covert Emotion Understanding: ME analysis tools can
aid doctors and therapists in better understanding people’s
covert emotions for emotion well-beings. Just as how the
ME phenomenon was first found, an important application
case is a mental assessment and tutoring, especially for
young and disordered people. About one in ten young peo-
ple is affected by mental health problems [216], which can
cause wide-ranging effects. They can be long-lasting, and
Fig. 7. Comparisons with the methods pretrained with ImageNet
or MiE-X established by ME generation method [199]. The ApexME
there are well-identified increased physical health prob-
and Branches are pretrained on ImageNet, while ApexME + MiE-X and lems associated with mental health [217]. Therapists often
Branches + MiE-X are pretrained on the synchronized dataset MiE-X. check recorded videos to examine and review patients’

14 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

conditions, which is very time-consuming, and ME analysis scratch, leveraging vast online videos with computer–
methods can be implemented as a tool aiding the process, human-cooperated annotations and synthesizing ME sam-
e.g., to locate and tag suspect moments for review. Such ples for imbalanced categories could be possible solutions
fine-level video analysis can also help doctors find covert for the data issues [219].
emotions in assessments, which are valuable for diagnosis
and treatment. Also, the technique can also be applied in 2) Compound MEs: Many MER studies are built on
other scenarios, such as law enforcement for investiga- the assumption of a simplified scenario that each ME
tions, online education, and intelligent human–computer appearing at one single time window corresponds to one
interaction, in which MEs should be concerned for fine- emotion, e.g., happiness, anger, surprise, sadness, fear, and
level, accurate emotion interpretation. disgust. However, that is not always the case in practice.
Expert Training: Besides working as a tool aiding experts Psychological studies [220], [221] show that people can
in multiple scenarios, ME analysis tools (especially ME gen- produce mixed expressions in their daily life when two or
eration) can also help in training experts to improve their more “basic” emotions are felt. For instance, surprise can
abilities to read covert emotions. One thing worth noticing occur with either happiness or fear at the same time and
is that sample diversity in terms of age, gender, and race expressed on the face as “happily surprised” or “fearfully
should be considered and balanced in designing such tools surprised.” Such mixed expressions are referred to as com-
to insure fair usage of the technology. The ability to read pound expressions; 17 compound expressions had been
others’ emotions is essential for some occupations, such identified in [221], suggesting that MaEs and emotions are
as medical, security, and others. A law enforcement officer more complex than previously believed. It is reasonable to
or a doctor interacts with a large number of people daily, assume that there could also be compound MEs, as MEs
and they must give judgments and decisions depending on are initiated the same way as MaEs driven by people’s felt
observations within a few minutes. Training tools that can emotions. So far, only a few works [222] have concerned
improve their ability to read people’s emotions (even the compound emotions for ME analysis. Compound MEs
covert ones) will benefit their performance at work. One could be rare and more challenging to study, but, as they
study illustrated that medical students’ communication reflect the specific emotional states that practically exist,
skills were significantly improved after being trained with they should be considered and not ignored in future ME
the ME training tool METT [218]. However, as mentioned studies. On the other side, one should always be cautious
in Section II, METT is composed of “man-made” ME clips when inferring complex status as compound emotions
that are different from real MEs. Using real spontaneous based on the observation of such a short time window of
MEs would have better training effects but is not possible an ME, as there are open discussions [223] about whether
to achieve a large number of real samples on all categories recognizing emotions without context is reliable.
and designated identities, as required for developing a 3) Multimodal MER: Psychological research illustrates
training tool. The technique of ME generation could pro- that there are various ways to express emotions. Visual
vide another option. Robust generation models can learn scenes, voices, bodies, other faces, cultural orientation,
characteristics from real ME samples and generate a large and even words shape how emotions are perceived in a
number of MEs on designated model faces, which could face [223]. Leveraging complementary information from
provide better and richer training materials. multiple modalities can also enhance ME analysis for a
Even though MEs have been a hot topic with great better understanding of human’s covert emotions. With the
potential in multiple application scenarios, there are note- rapid development of social media, a large amount of data
worthy challenges from both technological and ethical including texts, videos, and audio is shared online, which
perspectives, which need to be addressed in future studies. could be employed for multimodality research. Moreover,
videos recorded from various sensors provide different
A. Challenges From the Technological Perspective forms of visual information, such RGB, depth, thermal,
and 3-D meshes, which might contribute to the task of
1) Small-Scale and Imbalanced Data: Data are the central ME analysis in different ways. Two lasted ME datasets
part of ME research. Although multiple datasets have been (i.e., 4DME and CAS(ME)3 ) have already considered this
collected and released, the scale of most current datasets is in their data building. It would be interesting and valuable
still limited, i.e., a few hundred samples. Data annotation to explore integrating multichannels and multimodalities
is one key issue that hinders the development of large-scale for ME analysis in the future.
ME datasets as it requires certified expertise and is very
time-consuming. Moreover, some emotion, such as fear, 4) MEs and MaEs: Most previous facial expression stud-
is difficult to be evoked, which causes data imbalance ies explored MaEs and MEs separately, i.e., when exploring
issues. Data-driven methods tend to classify test samples the MER task, concerning ME clips and ignoring MaE cases.
to the majority class leading to poor classification per- However, in practice, it is natural and often that MaEs
formance. Thus, lacking large-scale, well-annotated, and and MEs coexist and even overlap [23] with each other.
balanced ME data is still a big barrier to ME research. In the MEGC challenges (MEGC 2019 and MEGC 2020),
Since it is challenging to induce and label MEs from the organizers posed one track to spot both MEs and MaEs

P ROCEEDINGS OF THE IEEE 15


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

in long video clips. So far, the challenge data only consider the general population with diverse ages, gender, culture,
a simpler situation where MaEs and MEs coexist but occur ethnicity, and so on. For ME studies, this issue needs to be
separately. Future studies should dig deeper to explore addressed from four aspects.
more challenging situations where MaEs and MEs overlap First, ME datasets should be more diverse. Most existing
with each other, which would be a substantial step toward ME datasets contain samples from young college students
the accurate understanding of human emotion in realistic of 18–35 years old from Asia and/or Europe due to the
situations. sites of research and availability of participant recruitment,
while data from older people or Africans or Latinos are
5) MEs in Realistic Situations: So far, most existing ME
lacking.
studies are still restricted to analyzing MEs collected in
Second, the fairness and reliability of ME data labeling
lab environments. They usually concern frontal view facial
should be considered. MEs (and AUs) are difficult to label.
images without big head movements, illumination varia-
The current standard for labeling is that two or more
tions, or occlusion. However, in realistic situations, it is
professional annotators will work together and cross-check
impossible to avoid these noises. ME analysis methods built
their labels, and the FACS system plays an important role
on the basis of constrained settings might not generalize
as one annotator should pass the FACS test to get his/her
well to wild environments. Effective algorithms for analyz-
certificate to become a qualified annotator. The FACS test
ing MEs in unconstrained settings with pose changes and
helps improve the reliability among different annotators,
illumination variations must be developed in the future.
but more factors need to be considered. One is the cultural
Moreover, considering the facts that: 1) there is too
background of the annotators (e.g., Asian or Caucasian)
little training data and 2) numerous “handcrafted” tra-
that might impact their judgment. It is hard to tackle as
ditional features and “handcrafted” neural architectures
the overall number of certified FACS annotators is very
have been proposed, more research could be dedicated
few, and it is already hard for a research group to gather
to specific applications (e.g., in the medical field, HCI,
two or more for cross-checking. The other issue might be
security, and business), in which application-dependent
addressed or improved in the future, i.e., the FACS training
constraints dealing, e.g., with illumination, viewing angles,
materials only contain face images from a few Caucasians
and types of facial expression, can be used to simplify the
but not from other ethnicities. It is not known whether
problem. Collecting enough training samples and use of
this impacts the labeling of Asian or African faces, but
multimodal data are also well-motivated and natural.
this could be addressed with helps from psychologists and
B. Ethical Issues the FACS developers by adding more diverse faces to the
training materials.
1) Privacy and Data Protection: Data are one of the most
Third, the fairness of the MiX models should be con-
valuable assets in ME studies. ME data contain facial videos
cerned in terms of outputs on different populations. As cur-
that are sensitive data that must be considered for privacy
rent data are biased toward young Caucasian and Asian
protection to avoid potential leakage of the participants’
people, it is not known whether the trained models can
personal information. Data protection laws, such as the EU
generalize well to other population groups, or whether the
General Data Protection Regulation (GDPR) [224] and the
outputs might be significantly biased on a diverse sample
California Consumer Privacy Act (CCPA) [225], have been
set.
established to protect the privacy of personal data, refer-
Fourth, ME generation methods could be a helpful tool
ring to international data protection agreements, transfer
for improving the diversity of samples. As the generation
of participant names, record data, and so on.
models can synthesize ME movements with any given face,
In data collection for research purposes, participants are
we can select and generate samples on a balanced face
usually gathered in a voluntary way, and they will sign a
set covering multiple age, gender, and ethnical groups.
consent form before any data are collected. The consent
Such a balanced and diverse set of synthesized ME samples
form explains issues related to data collection procedure,
could serve better in applications, e.g., training experts for
data processing, and data sharing, and lists all rights and
recognizing covert emotions.
options that participants have. For example, a participant
has the right to withdraw his/her own data at any time.
3) Regulated Usage of ME Technology: MEs provide
Moreover, since people’s faces include sensitive biomet-
important clues to people’s true feelings and, thus, are
ric information, a consent form should also concern and
useful in many potential applications. Meanwhile, there
specify proper usage in various application scenes. Besides
are risks if such technologies [227], [228], [229] are mis-
defining rules to regulate data usage, another aspect worth
used for malicious purposes. In both research communities
attention is privacy-preserving data sharing protocols and
and practical applications, the right-to-privacy and right-
techniques, e.g., to remove sensitive information (e.g., the
to-know should be respected, and consent agreements
identity), while preserving facial movement properties for
should be made in any scenario where human participants
ME analysis [226].
are involved. People have the right to know that such
2) Fairness and Diversity Among General Population: New technology is applied when they are entering a certain
technology should consider its fairness and validity among area, and they should also have the right to opt-out unless

16 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

in law-enforced scenarios. Legislation should be further micro-expression detection and recognition, this overview
developed to define specific rules to regulate the use of encompasses the exploration of micro-expression analy-
ME data and technologies. sis from its roots in psychology and early endeavors in
computer vision to the diverse range of contemporary
VII. C O N C L U S I O N computational methods. The survey not only addresses the
In conclusion, micro-expressions, being involuntary, subtle, current state of the field but also highlights open chal-
and rapid facial expressions, possess the ability to unveil lenges and outlines future directions, aiming to provide a
individuals’ genuine emotions. The field of computer vision tutorial-like reference point to anyone with an interest in
holds significant promise for automatic micro-expression micro-expressions.
analysis, presenting numerous potential applications and
impacting our daily lives. This article offers a comprehen-
sive review of the development of micro-expressions within Acknowledgment
the realm of computer vision. Instead of solely focusing This article was produced by the IEEE Publication Technol-
on the introduction of machine-learning techniques for ogy Group. They are in Piscataway, NJ, USA. ■

REFERENCES
[1] J. Panksepp, Affective Neuroscience: The Jul. 2018. [34] W. E. Rinn, “The neuropsychology of facial
Foundations of Human and Animal Emotions. [19] Y. Li, J. Wei, Y. Liu, J. Kauttonen, and G. Zhao, expression: A review of the neurological and
London, U.K.: Oxford Univ. Press, 2004. “Deep learning for micro-expression recognition: psychological mechanisms for producing facial
[2] W. James, “What is emotion?” 2023 Amer. A survey,” IEEE Trans. Affect. Comput., vol. 13, expressions,” Psychol. Bull., vol. 95, no. 1,
Psychol. Assoc., Washington, DC, USA, no. 4, pp. 2028–2046, Oct. 2022. pp. 52–77, 1984.
Tech. Rep., 1948, pp. 290–303. [20] X. Ben et al., “Video-based facial micro-expression [35] D. Matsumoto and H. S. Hwang, “Evidence for
[3] P. E. Ekman and R. J. Davidson, The Nature of analysis: A survey of datasets, features and training the ability to read microexpressions of
Emotion: Fundamental Questions. London, U.K.: algorithms,” IEEE Trans. Pattern Anal. Mach. emotion,” Motivat. Emotion, vol. 35, no. 2,
Oxford Univ. Press, 1994. Intell., vol. 44, no. 9, pp. 5826–5846, Sep. 2022. pp. 181–191, Jun. 2011.
[4] H. Okon-Singer, T. Hendler, L. Pessoa, and [21] L. Zhou, X. Shao, and Q. Mao, “A survey of [36] S. Polikovsky, Y. Kameda, and Y. Ohta, “Facial
A. J. Shackman, “The neurobiology of micro-expression recognition,” Image Vis. micro-expressions recognition using high speed
emotion–cognition interactions: Fundamental Comput., vol. 105, Jan. 2021, Art. no. 104043. camera and 3D-gradient descriptor,” in Proc. 3rd
questions and strategies for future research,” [22] E. A. Haggard and K. S. Isaacs, “Micromomentary Int. Conf. Imag. Crime Detection Prevention, 2009,
Frontiers Hum. Neurosci., vol. 9, p. 58, Feb. 2015. facial expressions as indicators of ego mechanisms p. 16.
[5] L. Pessoa, “On the relationship between emotion in psychotherapy,” in Methods of Research in [37] M. Shreve, S. Godavarthy, D. Goldgof, and
and cognition,” Nature Rev. Neurosci., vol. 9, no. 2, Psychotherapy. Cham, Switzerland: Springer, S. Sarkar, “Macro- and micro-expression spotting
pp. 148–158, 2008. 1966, pp. 154–165. in long videos using spatio-temporal strain,” in
[6] P. E. Sifneos, “The prevalence of ‘alexithymic’ [23] P. Ekman and W. Friesen, “Nonverbal leakage and Proc. IEEE Conf. Workshops Autom. Face Gesture
characteristics in psychosomatic patients,” clues to deception,” Psychiatry, vol. 32, no. 1, Recognit., Mar. 2011, pp. 51–56.
Psychotherapy Psychosomatics, vol. 22, nos. 2–6, pp. 88–106, 1969. [38] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof,
pp. 255–262, 1973. [24] D. Matsumoto and H. Hwang. (2011). Reading and S. Sarkar, “Towards macro- and
[7] R. W. Picard, Affective Computing. Cambridge, MA, Facial Expressions of Emotion. [Online]. Available: micro-expression spotting in video using strain
USA: MIT Press, 2000. https://www.apa.org/science/about/psa/2011/05/ patterns,” in Proc. Workshop Appl. Comput. Vis.
[8] J. Tao and T. Tan, “Affective computing: A review,” facial-expressions (WACV), Dec. 2009, pp. 1–6.
in Proc. Int. Conf. Affect. Comput. Intell. Interact. [25] P. Ekman and W. V. Friesen, “Felt, false, and [39] X. Li, T. Pfister, X. Huang, G. Zhao, and
Cham, Switzerland: Springer, 2005, pp. 981–995. miserable smiles,” J. Nonverbal Behav., vol. 6, M. Pietikäinen, “A spontaneous micro-expression
[9] M. A. Hogg and D. Abrams, “Social cognition and no. 4, pp. 238–252, 1982. database: Inducement, collection and baseline,” in
attitudes,” Univ. Kent, Canterbury, U.K., [26] E. Svetieva and M. G. Frank, “Empathy, emotion Proc. 10th IEEE Int. Conf. Workshops Autom. Face
Tech. Rep., 2007, pp. 684–721. dysregulation, and enhanced microexpression Gesture Recognit. (FG), Apr. 2013, pp. 1–6.
[10] A. Mehrabian and M. Wiener, “Decoding of recognition ability,” Motivat. Emotion, vol. 40, [40] T.-K. Tran, Q.-N. Vo, X. Hong, X. Li, and G. Zhao,
inconsistent communications,” J. Pers. Social no. 2, pp. 309–320, Apr. 2016. “Micro-expression spotting: A new benchmark,”
Psychol., vol. 6, no. 1, p. 109, 1967. [27] U. Hess and R. E. Kleck, “Differentiating emotion Neurocomputing, vol. 443, pp. 356–368, Jul. 2021.
[11] T. T. Amsel, An Urban Legend Called: ‘The 7/38/55 elicited and deliberate emotional facial [41] W.-J. Yan, Q. Wu, Y.-J. Liu, S.-J. Wang, and X. Fu,
Ratio Rule’, vol. 13, 2nd ed. Warsaw, Poland: expressions,” Eur. J. Social Psychol., vol. 20, no. 5, “CASME database: A dataset of spontaneous
Sciendo, De Gruyter Poland Sp. z o.o., Jun. 2019, pp. 369–385, Sep. 1990. micro-expressions collected from neutralized
pp. 95–99. [28] C. M. Hurley, A. E. Anker, M. G. Frank, faces,” in Proc. 10th IEEE Int. Conf. Workshops
[12] P. Ekman, “Darwin, deception, and facial D. Matsumoto, and H. C. Hwang, “Background Autom. Face Gesture Recognit. (FG), Apr. 2013,
expression,” Ann. New York Acad. Sci., vol. 1000, factors predicting accuracy and improvement in pp. 1–7.
no. 1, pp. 205–221, Jan. 2006. micro expression recognition,” Motivat. Emotion, [42] W.-J. Yan et al., “CASME II: An improved
[13] P. Ekman and W. V. Friesen, “Constants across vol. 38, no. 5, pp. 700–714, Oct. 2014. spontaneous micro-expression database and the
cultures in the face and emotion,” J. Pers. Social [29] M. Frank, M. Herbasz, K. Sinuk, A. Keller, and baseline evaluation,” PLoS ONE, vol. 9, no. 1,
Psychol., vol. 17, no. 2, pp. 124–129, 1971. C. Nolan, “I see how you feel: Training laypeople Jan. 2014, Art. no. e86041.
[14] P. Ekman, “Lie catching and microexpressions,” in and professionals to recognize fleeting emotions,” [43] F. Qu, S.-J. Wang, W.-J. Yan, H. Li, S. Wu, and
The Philosophy of Deception. Oxford, U.K.: Oxford in Proc. Annu. Meeting Int. Commun. Assoc. X. Fu, “CAS(ME)2 : A database for spontaneous
Univ. Press, 2009, pp. 118–133. New York, NY, USA: Sheraton, 2009, pp. 1–35. macro-expression and micro-expression spotting
[15] H.-X. Xie, L. Lo, H.-H. Shuai, and W.-H. Cheng, [30] P. Ekman, “Microexpression training tool and recognition,” IEEE Trans. Affect. Comput.,
“An overview of facial micro-expression analysis: (METT),” Stanford Univ., Stanford, CA, USA, vol. 9, no. 4, pp. 424–436, Oct. 2018.
Data, methodology and challenge,” 2020, Tech. Rep., 2002. [44] A. K. Davison, C. Lansley, N. Costen, K. Tan, and
arXiv:2012.11307. [31] G. Warren, E. Schertler, and P. Bull, “Detecting M. H. Yap, “SAMM: A spontaneous micro-facial
[16] K. M. Goh, C. H. Ng, L. L. Lim, and U. U. Sheikh, deception from emotional and unemotional cues,” movement dataset,” IEEE Trans. Affect. Comput.,
“Micro-expression recognition: An updated review J. Nonverbal Behav., vol. 33, no. 1, pp. 59–69, vol. 9, no. 1, pp. 116–129, Jan. 2018.
of current trends, challenges and solutions,” Vis. Mar. 2009. [45] C. H. Yap, C. Kendrick, and M. H. Yap, “SAMM
Comput., vol. 36, no. 3, pp. 445–468, Mar. 2020. [32] W. V. Friesen and P. Ekman, “Facial action coding long videos: A spontaneous facial micro- and
[17] M. Takalkar, M. Xu, Q. Wu, and Z. Chaczko, system: A technique for the measurement of facial macro-expressions dataset,” in Proc. 15th IEEE Int.
“A survey: Facial micro-expression recognition,” movement,” Consulting, Palo Alto, CA, USA, Conf. Autom. Face Gesture Recognit. (FG),
Multimedia Tools Appl., vol. 77, no. 15, Tech. Rep. 22, 1978, vol. 3. Nov. 2020, pp. 771–776.
pp. 19301–19325, Aug. 2018. [33] X. Li et al., “4DME: A spontaneous 4D [46] P. Husák, J. Cech, and J. Matas, “Spotting facial
[18] Y.-H. Oh, J. See, A. C. L. Ngo, R. C. -W. Phan, and micro-expression dataset with multimodalities,” micro-expressions ‘in the wild,”’ in Proc. 22nd
V. M. Baskaran, “A survey of automatic facial IEEE Trans. Affect. Comput., early access, Comput. Vis. Winter Workshop (RETZ), 2017,
micro-expression analysis: Databases, methods, Jun. 14, 2022, doi: 10.1109/TAFFC.2022. pp. 1–9.
and challenges,” Frontiers Psychol., vol. 9, p. 1128, 3182342. [47] J. Li et al., “CAS(ME)3 : A third generation facial

P ROCEEDINGS OF THE IEEE 17


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

spontaneous micro-expression database with Int. Conf. Multimedia, Oct. 2020, pp. 2237–2245. “Micro-expression recognition based on 3D flow
depth information and high ecological validity,” [66] Y. Li, X. Huang, and G. Zhao, “Micro-expression convolutional neural network,” Pattern Anal.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, action unit detection with spatial and channel Appl., vol. 22, pp. 1331–1339, Nov. 2018.
no. 3, pp. 2782–2800, Mar. 2023. attention,” Neurocomputing, vol. 436, [86] M. Peng et al., “Recognizing micro-expression in
[48] T. Pfister, X. Li, G. Zhao, and M. Pietikainen, pp. 221–231, May 2021. video clip with adaptive key-frame mining,” 2020,
“Differentiating spontaneous from posed facial [67] Z. Zhou, G. Zhao, and M. Pietikäinen, “Towards a arXiv:2009.09179.
expressions within a generic facial expression practical lipreading system,” in Proc. CVPR, [87] L. R. Medsker and L. Jain, “Recurrent neural
recognition framework,” in Proc. IEEE Int. Conf. Jun. 2011, pp. 137–144. networks,” Design Appl., vol. 5, pp. 64–67,
Comput. Vis. Workshops (ICCV Workshops), [68] S. Niklaus and F. Liu, “Context-aware synthesis for Dec. 2001.
Nov. 2011, pp. 1449–1456. video frame interpolation,” in Proc. IEEE/CVF [88] M. Bai and R. Goecke, “Investigating LSTM for
[49] G. Sandbach, S. Zafeiriou, M. Pantic, and L. Yin, Conf. Comput. Vis. Pattern Recognit., Jun. 2018, micro-expression recognition,” in Proc. Companion
“Static and dynamic 3D facial expression pp. 1701–1710. Publication Int. Conf. Multimodal Interact.,
recognition: A comprehensive survey,” Image Vis. [69] Y. Li, X. Huang, and G. Zhao, “Joint local and Oct. 2020, pp. 7–11.
Comput., vol. 30, no. 10, pp. 683–697, Oct. 2012. global information learning with single apex [89] J. L. Barron, D. J. Fleet, and S. S. Beauchemin,
[50] P. Viola and M. J. Jones, “Robust real-time face frame detection for micro-expression recognition,” “Performance of optical flow techniques,” Int. J.
detection,” Int. J. Comput. Vis., vol. 57, IEEE Trans. Image Process., vol. 30, pp. 249–263, Comput. Vis., vol. 12, no. 1, pp. 43–77, Feb. 1994.
pp. 137–154, May 2004. 2021. [90] B. D. Lucas, “Generalized image matching by the
[51] A. A. Salah, N. Sebe, and T. Gevers, [70] B. Sun, S. Cao, D. Li, J. He, and L. Yu, “Dynamic method of differences,” Carnegie Mellon Univ.,
“Communication and automatic interpretation of micro-expression recognition using knowledge Pittsburgh, PA, USA, Tech. Rep., 1986, p. 163.
affect from facial expressions,” in Affective distillation,” IEEE Trans. Affect. Comput., vol. 13, [91] B. K. P. Horn and B. G. Schunck, “Determining
Computing and Interaction: Psychological, no. 2, pp. 1037–1043, Apr. 2022. optical flow,” Artif. Intell., vol. 17, nos. 1–3,
Cognitive and Neuroscientific Perspectives. Hershey, [71] S.-T. Liong, J. See, K. Wong, and R. C.-W. Phan, pp. 185–203, Aug. 1981.
PA, USA: IGI Global, 2011, pp. 157–183. “Less is more: Micro-expression recognition from [92] T. Senst, V. Eiselein, and T. Sikora, “Robust local
[52] B. Wu, H. Ai, C. Huang, and S. Lao, “Fast rotation video using apex frame,” Signal Process., Image optical flow for feature tracking,” IEEE Trans.
invariant multi-view face detection based on real Commun., vol. 62, pp. 82–92, Mar. 2018. Circuits Syst. Video Technol., vol. 22, no. 9,
AdaBoost,” in Proc. 6th IEEE Int. Conf. Autom. Face [72] S. Liong and K. Wong, “Micro-expression pp. 1377–1387, Sep. 2012.
Gesture Recognit., May 2004, pp. 79–84. recognition using apex frame with phase [93] G. Farnebäck, “Two-frame motion estimation
[53] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, information,” in Proc. Asia–Pacific Signal Inf. based on polynomial expansion,” in Proc. Scand.
“Subject independent facial expression Process. Assoc. Annu. Summit Conf. (APSIPA ASC), Conf. Image Anal. Cham, Switzerland: Springer,
recognition with robust face detection using a Dec. 2017, pp. 534–537. 2003, pp. 363–370.
convolutional neural network,” Neural Netw., [73] A. J. R. Kumar and B. Bhanu, “Micro-expression [94] A. Wedel, T. Pock, C. Zach, H. Bischof, and
vol. 16, nos. 5–6, pp. 555–559, Jun. 2003. classification based on landmark relations with D. Cremers, “An improved algorithm for TV-L1
[54] D. Triantafyllidou and A. Tefas, “Face detection graph attention convolutional network,” in Proc. optical flow,” in Statistical and Geometrical
based on deep convolutional neural networks IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Approaches to Visual Motion Analysis. Cham,
exploiting incremental facial part learning,” in Workshops (CVPRW), Jun. 2021, pp. 1511–1520. Switzerland: Springer, 2009, pp. 23–45.
Proc. 23rd Int. Conf. Pattern Recognit. (ICPR), [74] J. Liu, W. Zheng, and Y. Zong, “SMA-STN: [95] A. Dosovitskiy et al., “FlowNet: Learning optical
Dec. 2016, pp. 3560–3565. Segmented movement-attending spatiotemporal flow with convolutional networks,” in Proc. IEEE
[55] R. Ranjan, V. M. Patel, and R. Chellappa, network for micro-expression recognition,” 2020, Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
“HyperFace: A deep multi-task learning arXiv:2010.09342. pp. 2758–2766.
framework for face detection, landmark [75] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and [96] Z. Xia, X. Hong, X. Gao, X. Feng, and G. Zhao,
localization, pose estimation, and gender S. Gould, “Dynamic image networks for action “Spatiotemporal recurrent convolutional networks
recognition,” IEEE Trans. Pattern Anal. Mach. recognition,” in Proc. IEEE Conf. Comput. Vis. for recognizing spontaneous micro-expressions,”
Intell., vol. 41, no. 1, pp. 121–135, Jan. 2019. Pattern Recognit. (CVPR), Jun. 2016, IEEE Trans. Multimedia, vol. 22, no. 3,
[56] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, pp. 3034–3042. pp. 626–640, Mar. 2020.
“Robust discriminative response map fitting with [76] X. Nie, M. A. Takalkar, M. Duan, H. Zhang, and [97] B. Song et al., “Recognizing spontaneous
constrained local models,” in Proc. IEEE Conf. M. Xu, “GEME: Dual-stream multi-task micro-expression using a three-stream
Comput. Vis. Pattern Recognit., Jun. 2013, gender-based micro-expression recognition,” convolutional neural network,” IEEE Access, vol. 7,
pp. 3444–3451. Neurocomputing, vol. 427, pp. 13–28, Feb. 2021. pp. 184537–184551, 2019.
[57] T. F. Cootes, G. J. Edwards, and C. J. Taylor, [77] T. T. Q. Le, T. Tran, and M. Rege, “Dynamic image [98] B. Allaert, I. R. Ward, I. M. Bilasco, C. Djeraba,
“Active appearance models,” IEEE Trans. Pattern for micro-expression recognition on region-based and M. Bennamoun, “Optical flow techniques for
Anal. Mach. Intell., vol. 23, no. 6, pp. 681–685, framework,” in Proc. IEEE 21st Int. Conf. Inf. Reuse facial expression analysis—A practical evaluation
Jun. 2001. Integr. Data Sci. (IRI), Aug. 2020, pp. 75–81. study,” 2019, arXiv:1904.11592.
[58] T. F. Cootes, C. J. Taylor, D. H. Cooper, and [78] M. Verma, S. K. Vipparthi, G. Singh, and [99] N. Liu, X. Liu, Z. Zhang, X. Xu, and T. Chen,
J. Graham, “Active shape models-their training S. Murala, “LEARNet: Dynamic imaging network “Offset or onset frame: A multi-stream
and application,” Comput. Vis. Image Understand., for micro expression recognition,” IEEE Trans. convolutional neural network with CapsuleNet
vol. 61, no. 1, pp. 38–59, Jan. 1995. Image Process., vol. 29, pp. 1618–1627, 2020. module for micro-expression recognition,” in Proc.
[59] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face [79] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, 5th Int. Conf. Intell. Informat. Biomed. Sci.
detection and alignment using multitask cascaded “Action recognition with dynamic image (ICIIBMS), Nov. 2020, pp. 236–240.
convolutional networks,” IEEE Signal Process. networks,” IEEE Trans. Pattern Anal. Mach. Intell., [100] B. Sun, S. Cao, J. He, and L. Yu, “Two-stream
Lett., vol. 23, no. 10, pp. 1499–1503, Oct. 2016. vol. 40, no. 12, pp. 2799–2813, Dec. 2018. attention-aware network for spontaneous
[60] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, [80] S.-J. Wang et al., “Micro-expression recognition micro-expression movement spotting,” in Proc.
F. Durand, and W. Freeman, “Eulerian video using color spaces,” IEEE Trans. Image Process., IEEE 10th Int. Conf. Softw. Eng. Service Sci.
magnification for revealing subtle changes in the vol. 24, no. 12, pp. 6034–6047, Dec. 2015. (ICSESS), Oct. 2019, pp. 702–705.
world,” ACM Trans. Graph., vol. 31, no. 4, pp. 1–8, [81] G. Zhao and M. Pietikainen, “Dynamic texture [101] J. See, M. H. Yap, J. Li, X. Hong, and S. Wang,
Aug. 2012. recognition using local binary patterns with an “MEGC 2019—The second facial
[61] X. Li et al., “Towards reading hidden emotions: A application to facial expressions,” IEEE Trans. micro-expressions grand challenge,” in Proc. 14th
comparative study of spontaneous Pattern Anal. Mach. Intell., vol. 29, no. 6, IEEE Int. Conf. Autom. Face Gesture Recognit. (FG),
micro-expression spotting and recognition pp. 915–928, Jun. 2007. May 2019, pp. 1–5.
methods,” IEEE Trans. Affect. Comput., vol. 9, [82] V. Mayya, R. M. Pai, and M. M. M. Pai, “Combining [102] J. Li, S. Wang, M. H. Yap, J. See, X. Hong, and
no. 4, pp. 563–577, Oct. 2018. temporal interpolation and DCNN for faster X. Li, “MEGC2020—The third facial
[62] Y. Li, X. Huang, and G. Zhao, “Can recognition of micro-expressions in video micro-expression grand challenge,” in Proc. 15th
micro-expression be recognized based on single sequences,” in Proc. Int. Conf. Adv. Comput., IEEE Int. Conf. Autom. Face Gesture Recognit. (FG),
apex frame?” in Proc. 25th IEEE Int. Conf. Image Commun. Informat. (ICACCI), Sep. 2016, Nov. 2020, pp. 777–780.
Process. (ICIP), Oct. 2018, pp. 3094–3098. pp. 699–703. [103] D. Patel, G. Zhao, and M. Pietikäinen,
[63] Z. Xia, W. Peng, H. Khor, X. Feng, and G. Zhao, [83] H.-X. Xie, L. Lo, H.-H. Shuai, and W.-H. Cheng, “Spatiotemporal integration of optical flow
“Revealing the invisible with model and data “AU-assisted graph attention convolutional vectors for micro-expression detection,” in Proc.
shrinking for composite-database network for micro-expression recognition,” in Int. Conf. Adv. Concepts Intell. Vis. Syst. Cham,
micro-expression recognition,” IEEE Trans. Image Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, Switzerland: Springer, 2015, pp. 369–380.
Process., vol. 29, pp. 8590–8605, 2020. pp. 2871–2880. [104] S.-J. Wang, S. Wu, and X. Fu, “A main directional
[64] T.-H. Oh et al., “Learning-based video motion [84] D. Kim, W. J. Baddar, and Y. M. Ro, maximal difference analysis for spotting
magnification,” in Proc. Eur. Conf. Comput. Vis. “Micro-expression recognition with micro-expressions,” in Proc. Asian Conf. Comput.
(ECCV), 2018, pp. 633–648. expression-state constrained spatio-temporal Vis. Cham, Switzerland: Springer, 2016,
[65] L. Lei, J. Li, T. Chen, and S. Li, “A novel graph-TCN feature representations,” in Proc. 24th ACM Int. pp. 449–461.
with a graph structured representation for Conf. Multimedia, 2016, pp. 382–386. [105] L. Zhang et al., “Spatio-temporal fusion for macro-
micro-expression recognition,” in Proc. 28th ACM [85] J. Li, Y. Wang, J. See, and W. Liu, and micro-expression spotting in long video

18 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

sequences,” in Proc. 15th IEEE Int. Conf. Autom. Jan. 2016. [143] M. Peng, C. Wang, T. Bi, Y. Shi, X. Zhou, and
Face Gesture Recognit. (FG), Nov. 2020, [123] N. Dalal and B. Triggs, “Histograms of oriented T. Chen, “A novel apex-time network for
pp. 734–741. gradients for human detection,” in Proc. IEEE cross-dataset micro-expression recognition,” in
[106] Z. Xia, X. Feng, J. Peng, X. Peng, and G. Zhao, Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Proc. 8th Int. Conf. Affect. Comput. Intell. Interact.
“Spontaneous micro-expression spotting via (CVPR), Jun. 2005, pp. 886–893. (ACII), Sep. 2019, pp. 1–6.
geometric deformation modeling,” Comput. Vis. [124] Y. Zhang, H. Jiang, X. Li, B. Lu, K. M. Rabie, and [144] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi,
Image Understand., vol. 147, pp. 87–94, A. U. Rehman, “A new framework combining “Inception-v4, inception-ResNet and the impact of
Jun. 2016. local-region division and feature selection for residual connections on learning,” in Proc. AAAI
[107] Z. Zhang, T. Chen, H. Meng, G. Liu, and X. Fu, micro-expressions recognition,” IEEE Access, Conf. Artif. Intell., vol. 31, 2017, pp. 1–7.
“SMEConvNet: A convolutional neural network vol. 8, pp. 94499–94509, 2020. [145] L. Zhou, Q. Mao, and L. Xue, “Dual-inception
for spotting spontaneous facial micro-expression [125] B. Allaert, I. M. Bilasco, and C. Djeraba, network for cross-database micro-expression
from long videos,” IEEE Access, vol. 6, “Consistent optical flow maps for full and micro recognition,” in Proc. 14th IEEE Int. Conf. Autom.
pp. 71143–71151, 2018. facial expression recognition,” in Proc. VISAPP, Face Gesture Recognit. (FG), May 2019, pp. 1–5.
[108] T.-K. Tran, Q.-N. Vo, X. Hong, and G. Zhao, “Dense vol. 5. Setúbal, Portugal: SciTePress, 2017, [146] L. Zhou, Q. Mao, and M. Dong, “Objective
prediction for micro-expression spotting based on pp. 235–242. class-based micro-expression recognition through
deep sequence model,” Electron. Imag., vol. 31, [126] S.-T. Liong, R. C.-W. Phan, J. See, Y.-H. Oh, and simultaneous action unit detection and feature
no. 8, pp. 401-1–401-6, Jan. 2019. K. Wong, “Optical strain based recognition of aggregation,” 2020, arXiv:2012.13148.
[109] H. Pan, L. Xie, and Z. Wang, “Local bilinear subtle emotions,” in Proc. Int. Symp. Intell. Signal [147] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep
convolutional neural network for spotting macro- Process. Commun. Syst. (ISPACS), Dec. 2014, face recognition,” in Proc. Brit. Mach. Vis. Conf.,
and micro-expression intervals in long video pp. 180–184. 2015, pp. 1–12.
sequences,” in Proc. 15th IEEE Int. Conf. Autom. [127] S.-T. Liong et al., “Spontaneous subtle expression [148] Y. S. Gan, S.-T. Liong, W.-C. Yau, Y.-C. Huang, and
Face Gesture Recognit. (FG), Nov. 2020, detection and recognition based on facial strain,” L.-K. Tan, “OFF-ApexNet on micro-expression
pp. 343–347. Signal Process., Image Commun., vol. 47, recognition system,” Signal Process., Image
[110] S. Wang, Y. He, J. Li, and X. Fu, “MESNet: A pp. 170–182, Sep. 2016. Commun., vol. 74, pp. 129–139, May 2019.
convolutional neural network for spotting [128] S. L. Happy and A. Routray, “Fuzzy histogram of [149] H. Khor, J. See, S. Liong, R. C. W. Phan, and
multi-scale micro-expression intervals in long optical flow orientations for micro-expression W. Lin, “Dual-stream shallow networks for facial
videos,” IEEE Trans. Image Process., vol. 30, recognition,” IEEE Trans. Affect. Comput., vol. 10, micro-expression recognition,” in Proc. IEEE Int.
pp. 3956–3969, 2021. no. 3, pp. 394–406, Jul. 2019. Conf. Image Process. (ICIP), Sep. 2019, pp. 36–40.
[111] A. Moilanen, G. Zhao, and M. Pietikäinen, [129] S.-J. Wang, W.-J. Yan, X. Li, G. Zhao, and X. Fu, [150] L. Yao, X. Xiao, R. Cao, F. Chen, and T. Chen,
“Spotting rapid facial movements from videos “Micro-expression recognition using dynamic “Three stream 3D CNN with SE block for
using appearance-based feature difference textures on tensor independent color space,” in micro-expression recognition,” in Proc. Int. Conf.
analysis,” in Proc. 22nd Int. Conf. Pattern Recognit., Proc. IEEE Int. Conf. Pattern Recognit., Aug. 2014, Comput. Eng. Appl. (ICCEA), Mar. 2020,
Aug. 2014, pp. 1722–1727. pp. 4678–4683. pp. 439–443.
[112] S.-T. Liong, J. See, K. Wong, A. C. L. Ngo, [130] F. Xu, J. Zhang, and J. Z. Wang, “Microexpression [151] H. Yan and L. Li, “Micro-expression recognition
Y.-H. Oh, and R. Phan, “Automatic apex frame identification and categorization using a facial using enriched two stream 3D convolutional
spotting in micro-expression database,” in Proc. dynamics map,” IEEE Trans. Affect. Comput., network,” in Proc. 4th Int. Conf. Comput. Sci. Appl.
3rd IAPR Asian Conf. Pattern Recognit. (ACPR), vol. 8, no. 2, pp. 254–267, Apr. 2017. Eng., Oct. 2020, pp. 1–5.
Nov. 2015, pp. 665–669. [131] C. Cortes and V. Vapnik, “Support-vector [152] W. She, Z. Lv, J. Taoi, B. Liu, and M. Niu,
[113] S.-J. Wang, S. Wu, X. Qian, J. Li, and X. Fu, networks,” Mach. Learn., vol. 20, no. 3, “Micro-expression recognition based on multiple
“A main directional maximal difference analysis pp. 273–297, Jul. 1995. aggregation networks,” in Proc. Asia–Pacific Signal
for spotting facial movements from long-term [132] A. C. L. Ngo, R. C.-W. Phan, and J. See, Inf. Process. Assoc. Annu. Summit Conf. (APSIPA
videos,” Neurocomputing, vol. 230, pp. 382–389, “Spontaneous subtle expression recognition: ASC), Dec. 2020, pp. 1043–1047.
Mar. 2017. Imbalanced databases and solutions,” in Proc. [153] S. C. Nistor, “Multi-staged training of deep neural
[114] H. Ma, G. An, S. Wu, and F. Yang, “A region Asian Conf. Comput. Vis. Cham, Switzerland: networks for micro-expression recognition,” in
histogram of oriented optical flow (RHOOF) Springer, 2014, pp. 33–48. Proc. IEEE 14th Int. Symp. Appl. Comput. Intell.
feature for apex frame spotting in [133] T. Pfister, X. Li, G. Zhao, and M. Pietikäinen, Informat. (SACI), May 2020, pp. 29–34.
micro-expression,” in Proc. Int. Symp. Intell. Signal “Recognising spontaneous facial [154] R. Zhi, M. Liu, H. Xu, and M. Wan, “Facial
Process. Commun. Syst. (ISPACS), Nov. 2017, micro-expressions,” in Proc. Int. Conf. Comput. micro-expression recognition using enhanced
pp. 281–286. Vis., Nov. 2011, pp. 1449–1456. temporal feature-wise model,” in Cyberspace Data
[115] D. Borza, R. Danescu, R. Itu, and A. Darabant, [134] Y. Guo, Y. Tian, X. Gao, and X. Zhang, and Intelligence, and Cyber-Living, Syndrome, and
“High-speed video system for micro-expression “Micro-expression recognition based on local Health. Cham, Switzerland: Springer, 2019,
detection and recognition,” Sensors, vol. 17, binary patterns from three orthogonal planes and pp. 301–311.
no. 12, p. 2913, Dec. 2017. nearest neighbor method,” in Proc. Int. Joint Conf. [155] Q. Li, S. Zhan, L. Xu, and C. Wu, “Facial
[116] E. Vahdani and Y. Tian, “Deep learning-based Neural Netw. (IJCNN), Jul. 2014, pp. 3473–3479. micro-expression recognition based on the fusion
action detection in untrimmed videos: A survey,” [135] A. J. Izenman, “Linear discriminant analysis,” in of deep learning and enhanced optical flow,”
2021, arXiv:2110.00111. Modern Multivariate Statistical Techniques. Cham, Multimedia Tools Appl., vol. 78, no. 20,
[117] P. Carcagnì, M. D. Coco, M. Leo, and C. Distante, Switzerland: Springer, 2013, pp. 237–280. pp. 29307–29322, Oct. 2019.
“Facial expression recognition and histograms of [136] K. Li et al., “Three-stream convolutional neural [156] L. Zhou, Q. Mao, X. Huang, F. Zhang, and
oriented gradients: A comprehensive study,” network for micro-expression recognition,” Z. Zhang, “Feature refinement: An
SpringerPlus, vol. 4, no. 1, pp. 1–25, Dec. 2015. Austral. J. Intell. Inf. Process. Syst., vol. 15, no. 3, expression-specific feature learning and fusion
[118] Y. Liu, J. Zhang, W. Yan, S. Wang, G. Zhao, and pp. 41–48, 2019. method for micro-expression recognition,” 2021,
X. Fu, “A main directional mean optical flow [137] B. Chen, Z. Zhang, N. Liu, Y. Tan, X. Liu, and arXiv:2101.04838.
feature for spontaneous micro-expression T. Chen, “Spatiotemporal convolutional neural [157] A. Romero, N. Ballas, E. K. Samira, A. Chassang,
recognition,” IEEE Trans. Affect. Comput., vol. 7, network with convolutional block attention C. Gatta, and B. Yoshua, “FitNets: Hints for thin
no. 4, pp. 299–310, Oct. 2016. module for micro-expression recognition,” deep nets,” Proc. Int. Conf. Learn. Represent.
[119] Z. Lu, Z. Luo, H. Zheng, J. Chen, and W. Li, Information, vol. 11, no. 8, p. 380, Jul. 2020. (ICLR), 2015, pp. 1–13.
“A Delaunay-based temporal coding model for [138] F. Zhang, T. Zhang, Q. Mao, and C. Xu, “Joint pose [158] L. Zhou, Q. Mao, and L. Xue, “Cross-database
micro-expression recognition,” in Proc. Asian Conf. and expression modeling for facial expression micro-expression recognition: A style aggregated
Comput. Vis. Cham, Switzerland: Springer, 2014, recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. and attention transfer approach,” in Proc. IEEE Int.
pp. 698–711. Pattern Recognit., Jun. 2018, pp. 3359–3368. Conf. Multimedia Expo Workshops (ICMEW),
[120] Y. Wang, J. See, R. C. Phan, and Y. Oh, “LBP with [139] L. Liu et al., “Deep learning for generic object Jul. 2019, pp. 102–107.
six intersection points: Reducing redundant detection: A survey,” Int. J. Comput. Vis., vol. 128, [159] Y. Liu, H. Du, L. Zheng, and T. Gedeon, “A neural
information in LBP-TOP for micro-expression no. 2, pp. 261–318, Feb. 2020. micro-expression recognizer,” in Proc. 14th IEEE
recognition,” in Proc. Asian Conf. Comput. Vis. [140] Z. Wu, C. Shen, and A. van den Hengel, “Wider or Int. Conf. Autom. Face Gesture Recognit. (FG),
Cham, Switzerland: Springer, 2014, pp. 525–537. deeper: Revisiting the ResNet model for visual May 2019, pp. 1–4.
[121] Y. Wang, J. See, R. C.-W. Phan, and Y.-H. Oh, recognition,” Pattern Recognit., vol. 90, [160] B. Xia, W. Wang, S. Wang, and E. Chen, “Learning
“Efficient spatio-temporal local binary patterns for pp. 119–133, Jun. 2019. from macro-expression: A micro-expression
spontaneous facial micro-expression recognition,” [141] D. Patel, X. Hong, and G. Zhao, “Selective deep recognition framework,” in Proc. 28th ACM Int.
PLoS ONE, vol. 10, no. 5, May 2015, features for micro-expression recognition,” in Conf. Multimedia, Oct. 2020, pp. 2936–2944.
Art. no. e0124674. Proc. 23rd Int. Conf. Pattern Recognit. (ICPR), [161] B. Xia and S. Wang, “Micro-expression recognition
[122] X. Huang, G. Zhao, X. Hong, W. Zheng, and Dec. 2016, pp. 2258–2263. enhanced by macro-expression from
M. Pietikäinen, “Spontaneous facial [142] K. He, X. Zhang, S. Ren, and J. Sun, “Deep spatial–temporal domain,” in Proc. 13th Int. Joint
micro-expression analysis using spatiotemporal residual learning for image recognition,” in Proc. Conf. Artif. Intell., Aug. 2021, pp. 1186–1193.
completed local quantized patterns,” IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), [162] Y. S. Gan and S.-T. Liong, “Bi-directional vectors
Neurocomputing, vol. 175, pp. 564–578, Jun. 2016, pp. 770–778. from apex in CNN for micro-expression

P ROCEEDINGS OF THE IEEE 19


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

recognition,” in Proc. IEEE 3rd Int. Conf. Image, pp. 499–515. [200] B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards
Vis. Comput. (ICIVC), Jun. 2018, pp. 168–172. [181] G. Zhao, X. Huang, M. Taini, S. Z. Li, and diverse and natural image descriptions via a
[163] P. Gupta, “MERASTC: Micro-expression M. Pietikäinen, “Facial expression recognition conditional GAN,” in Proc. IEEE Int. Conf. Comput.
recognition using effective feature encodings and from near-infrared videos,” Image Vis. Comput., Vis. (ICCV), Oct. 2017, pp. 2970–2979.
2D convolutional neural network,” IEEE Trans. vol. 29, no. 9, pp. 607–619, Aug. 2011. [201] R. Abdal, Y. Qin, and P. Wonka, “Image2StyleGAN:
Affect. Comput., early access, Feb. 25, 2021, doi: [182] J. E. van Engelen and H. H. Hoos, “A survey on How to embed images into the StyleGAN latent
10.1109/TAFFC.2021.3061967. semi-supervised learning,” Mach. Learn., vol. 109, space?” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
[164] Y. Zhao and J. Xu, “Compound micro-expression no. 2, pp. 373–440, Feb. 2020. (ICCV), Oct. 2019, pp. 4432–4441.
recognition system,” in Proc. Int. Conf. Intell. [183] X. Wang, X. Wang, and Y. Ni, “Unsupervised [202] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell,
Transp., Big Data Smart City (ICITBS), Jan. 2020, domain adaptation for facial expression and A. A. Efros, “Context encoders: Feature
pp. 728–733. recognition using generative adversarial learning by inpainting,” in Proc. IEEE Conf.
[165] S. Liong, Y. S. Gan, J. See, H. Khor, and Y. Huang, networks,” Comput. Intell. Neurosci., vol. 2018, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
“Shallow triple stream three-dimensional CNN pp. 1–10, Jul. 2018. pp. 2536–2544.
(STSTNet) for micro-expression recognition,” in [184] H. Khor, J. See, R. C. W. Phan, and W. Lin, [203] T. Karras, S. Laine, M. Aittala, J. Hellsten,
Proc. 14th IEEE Int. Conf. Autom. Face Gesture “Enriched long-term recurrent convolutional J. Lehtinen, and T. Aila, “Analyzing and improving
Recognit. (FG), May 2019, pp. 1–5. network for facial micro-expression recognition,” the image quality of StyleGAN,” in Proc. IEEE/CVF
[166] R. Belaiche, Y. Liu, C. Migniot, D. Ginhac, and in Proc. 13th IEEE Int. Conf. Autom. Face Gesture Conf. Comput. Vis. Pattern Recognit. (CVPR),
F. Yang, “Cost-effective CNNs for real-time Recognit. (FG), May 2018, pp. 667–674. Jun. 2020, pp. 8110–8119.
micro-expression recognition,” Appl. Sci., vol. 10, [185] M. Verma, S. K. Vipparthi, and G. Singh, [204] J. Gauthier, “Conditional generative adversarial
no. 14, p. 4959, Jul. 2020. “AffectiveNet: Affective-motion feature learning nets for convolutional face generation,” in Proc.
[167] D. Acharya, Z. Huang, D. P. Paudel, and for microexpression recognition,” IEEE Multimedia Class Project Stanford CS231N, Convolutional
L. Van Gool, “Covariance pooling for facial Mag., vol. 28, no. 1, pp. 17–27, Jan. 2021. Neural Netw. Vis. Recognit., Winter Semester, no. 5,
expression recognition,” in Proc. IEEE/CVF Conf. [186] X. Huang, G. Zhao, W. Zheng, and M. Pietikäinen, 2014, p. 2.
Comput. Vis. Pattern Recognit. Workshops “Action unit detection with region adaptation, [205] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A review
(CVPRW), Jun. 2018, pp. 367–374. multi-labeling learning and optimal temporal on generative adversarial networks: Algorithms,
[168] L. Wang, J. Jia, and N. Mao, “Micro-expression fusing,” Neurocomputing, vol. 175, pp. 564–578, theory, and applications,” IEEE Trans. Knowl. Data
recognition based on 2D–3D CNN,” in Proc. 39th Nov. 2017. Eng., vol. 35, no. 4, pp. 3313–3332,
Chin. Control Conf. (CCC), Jul. 2020, [187] X. Duan, Q. Dai, X. Wang, Y. Wang, and Z. Hua, Apr. 2023.
pp. 3152–3157. “Recognizing spontaneous micro-expression from [206] Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song,
[169] M. A. Takalkar, S. Thuseethan, S. Rajasegarar, eye region,” Neurocomputing, vol. 217, pp. 27–36, “Neural style transfer: A review,” IEEE Trans. Vis.
Z. Chaczko, M. Xu, and J. Yearwood, “LGAttNet: Dec. 2016. Comput. Graphics, vol. 26, no. 11, pp. 3365–3385,
Automatic micro-expression detection using [188] S.-J. Wang, W.-J. Yan, T. Sun, G. Zhao, and X. Fu, Nov. 2020.
dual-stream local and global attentions,” “Sparse tensor canonical correlation analysis for [207] X. Huang and S. Belongie, “Arbitrary style transfer
Knowl.-Based Syst., vol. 212, Jan. 2021, micro-expression recognition,” Neurocomputing, in real-time with adaptive instance
Art. no. 106566. vol. 214, pp. 218–232, Nov. 2016. normalization,” in Proc. IEEE Int. Conf. Comput.
[170] M. Bai, “Detection of micro-expression recognition [189] W.-J. Yan, S.-J. Wang, Y.-J. Liu, Q. Wu, and X. Fu, Vis. (ICCV), Oct. 2017, pp. 1501–1510.
based on spatio-temporal modelling and spatial “For micro-expression recognition: Database and [208] J. Li et al., “FME’21: 1st workshop on facial
attention,” in Proc. Int. Conf. Multimodal Interact., suggestions,” Neurocomputing, vol. 136, micro-expression: Advanced techniques for facial
Oct. 2020, pp. 703–707. pp. 82–87, Jul. 2014. expressions generation and spotting,” in Proc.
[171] M. F. Hashmi et al., “LARNet: Real-time detection [190] W. Li, F. Abtahi, and Z. Zhu, “Action unit detection 29th ACM Int. Conf. Multimedia, 2021,
of facial micro expression using lossless attention with region adaptation, multi-labeling learning pp. 5700–5701.
residual network,” Sensors, vol. 21, no. 4, p. 1098, and optimal temporal fusing,” in Proc. IEEE Conf. [209] E. L. Rosenberg and P. Ekman, What the Face
Feb. 2021. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, Reveals: Basic and Applied Studies of Spontaneous
[172] Y. Su, J. Zhang, J. Liu, and G. Zhai, “Key facial pp. 1841–1850. Expression Using the Facial Action Coding System
components guided micro-expression recognition [191] S. Li and W. Deng, “Deep facial expression (FACS). London, U.K.: Oxford Univ. Press, 2020.
based on first & second-order motion,” in Proc. recognition: A survey,” IEEE Trans. Affect. Comput., [210] A. Pumarola, A. Agudo, A. M. Martinez,
IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2021, vol. 13, no. 3, pp. 1195–1215, Jul. 2022. A. Sanfeliu, and F. Moreno-Noguer, “GANimation:
pp. 1–6. [192] C. F. Benitez-Quiroz, R. Srinivasan, and Anatomically-aware facial animation from a single
[173] M. Wei, W. Zheng, Y. Zong, X. Jiang, C. Lu, and A. M. Martinez, “EmotioNet: An accurate, image,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
J. Liu, “A novel micro-expression recognition real-time algorithm for the automatic annotation 2018, pp. 818–833.
approach using attention-based of a million facial expressions in the wild,” in Proc. [211] Y. Xu, S. Zhao, H. Tang, X. Mao, T. Xu, and
magnification-adaptive networks,” in Proc. IEEE IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), E. Chen, “FAMGAN: Fine-grained AUs modulation
Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2016, pp. 5562–5570. based generative adversarial network for
May 2022, pp. 2420–2424. [193] D. Kollias and S. Zafeiriou, “Expression, affect, micro-expression generation,” in Proc. 29th ACM
[174] V. R. Gajjala, S. P. T. Reddy, S. Mukherjee, and action unit recognition: Aff-wild2, multi-task Int. Conf. Multimedia, Oct. 2021, pp. 4813–4817.
S. R. Dubey, “MERANet: Facial micro-expression learning and arcface,” in Proc. Brit. Mach. Vis. [212] X. Fan, A. R. Shahid, and H. Yan, “Facial
recognition using 3D residual attention network,” Conf. (BMVC), 2019, pp. 1–6. micro-expression generation based on deep
2020, arXiv:2012.04581. [194] Y. Li, W. Peng, and G. Zhao, “Micro-expression motion retargeting and transfer learning,” in Proc.
[175] Y. Wang et al., “Micro expression recognition via action unit detection with dual-view attentive 29th ACM Int. Conf. Multimedia, Oct. 2021,
dual-stream spatiotemporal attention network,” similarity-preserving knowledge distillation,” in pp. 4735–4739.
J. Healthcare Eng., vol. 2021, pp. 1–10, Aug. 2021. Proc. 16th IEEE Int. Conf. Autom. Face Gesture [213] Y. Zhang, Y. Zhao, Y. Wen, Z. Tang, X. Xu, and
[176] L. Lo, H.-X. Xie, H.-H. Shuai, and W.-H. Cheng, Recognit. (FG), Dec. 2021, pp. 1–8. M. Liu, “Facial prior based first order motion
“MER-GCN: Micro-expression recognition based [195] Y. Li and G. Zhao, “Intra- and inter-contrastive model for micro-expression generation,” in Proc.
on relation modeling with graph convolutional learning for micro-expression action unit 29th ACM Int. Conf. Multimedia, Oct. 2021,
networks,” in Proc. IEEE Conf. Multimedia Inf. detection,” in Proc. Int. Conf. Multimodal Interact., pp. 4755–4759.
Process. Retr. (MIPR), Aug. 2020, pp. 79–84. Oct. 2021, pp. 702–706. [214] J. Yu, C. Zhang, Y. Song, and W. Cai, “ICE-GAN:
[177] L. Lei, T. Chen, S. Li, and J. Li, “Micro-expression [196] L. Zhang, O. Arandjelovic, and X. Hong, “Facial Identity-aware and capsule-enhanced GAN with
recognition based on facial graph representation action unit detection with local key facial graph-based reasoning for micro-expression
learning and facial action unit fusion,” in Proc. sub-region based multi-label classification for recognition and synthesis,” 2020,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. micro-expression analysis,” in Proc. 1st Workshop arXiv:2005.04370.
Workshops (CVPRW), Jun. 2021, pp. 1571–1580. Facial Micro-Expression, Adv. Techn. Facial [215] S.-T. Liong et al., “Evaluation of the
[178] D. M. Kline and V. L. Berardi, “Revisiting Expressions Gener. Spotting, Oct. 2021, spatio-temporal features and GAN for
squared-error and cross-entropy functions for pp. 11–18. micro-expression recognition system,” J. Signal
training neural network classifiers,” Neural [197] S. Qiao, W. Shen, Z. Zhang, B. Wang, and Process. Syst., vol. 92, pp. 705–725, Mar. 2020.
Comput. Appl., vol. 14, no. 4, pp. 310–318, A. Yuille, “Deep co-training for semi-supervised [216] M. Murphy and P. Fonagy, “Mental health
Dec. 2005. image recognition,” in Proc. Eur. Conf. Comput. problems in children and young people,” Our
[179] R. Hadsell, S. Chopra, and Y. LeCun, Vis. (ECCV), 2018, pp. 135–152. Children Deserve Better, Prevention Pays: Annual
“Dimensionality reduction by learning an [198] X. Niu, H. Han, S. Shan, and X. Chen, “Multi-label Report of the Chief Medical Officer. Department of
invariant mapping,” in Proc. IEEE Comput. Soc. co-regularization for semi-supervised facial action Health, ch. 10, 2013.
Conf. Comput. Vis. Pattern Recognit., vol. 2, unit recognition,” in Proc. Adv. Neural Inf. Process. [217] Parental Mental Illness: The Problems for Children.
Jun. 2006, pp. 1735–1742. Syst., 2019, pp. 909–919. Information for Parents, Carers and Anyone Who
[180] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, [199] Y. Liu, Z. Wang, T. Gedeon, and L. Zheng, “Action Works With Young People, Roy. College
“A discriminative feature learning approach for units that constitute trainable micro-expressions Psychiatrists, London, U.K., 2012.
deep face recognition,” in Proc. Eur. Conf. Comput. (and a large-scale synthetic dataset),” 2021, [218] J. Endres and A. Laidlaw, “Micro-expression
Vis. Cham, Switzerland: Springer, 2016, arXiv:2112.01730. recognition training in medical students: A pilot

20 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Zhao et al.: Facial Micro-Expressions: An Overview

study,” BMC Med. Educ., vol. 9, no. 1, pp. 1–6, recognition,” Sensors, vol. 19, no. 24, p. 5553, footage,” IEEE Trans. Circuits Syst. Video Technol.,
Dec. 2009. Dec. 2019. vol. 32, no. 2, pp. 496–509, Feb. 2022.
[219] G. Zhao and X. Li, “Automatic micro-expression [223] L. F. Barrett, B. Mesquita, and M. Gendron, [227] M. Kosinski, “Facial recognition technology can
analysis: Open challenges,” Frontiers Psychol., “Context in emotion perception,” Current expose political orientation from naturalistic facial
vol. 10, p. 1833, Aug. 2019. Directions Psychol. Sci., vol. 20, no. 5, images,” Sci. Rep., vol. 11, no. 1, pp. 1–7,
[220] S. Du, Y. Tao, and A. M. Martinez, “Compound pp. 286–290, 2011. Jan. 2021.
facial expressions of emotion,” Proc. Nat. Acad. [224] P. Voigt and A. Von dem Bussche, The EU General [228] I. D. Raji, T. Gebru, M. Mitchell, J. Buolamwini,
Sci. USA, vol. 111, no. 15, pp. E1454–E1462, Data Protection Regulation (GDPR): A Practical J. Lee, and E. Denton, “Saving face: Investigating
Apr. 2014. Guide, 1st ed. Cham, Switzerland: Springer, the ethical concerns of facial recognition
[221] S. Du and A. M. Martinez, “Compound facial vol. 10. 2017. auditing,” in Proc. AAAI/ACM Conf. AI, Ethics, Soc.,
expressions of emotion: From basic research to [225] L. de la Torre, “A guide to the California consumer Feb. 2020, pp. 145–151.
clinical applications,” Dialogues Clin. Neurosci., privacy act of 2018,” SSRN, Elsevier, Rochester, [229] S. Oviatt, “Technology as infrastructure for
vol. 17, no. 4, pp. 443–455, Dec. 2015. NY, USA, Tech. Rep., 2018, pp. 1–17. dehumanization: Three hundred million people
[222] Y. Zhao and J. Xu, “A convolutional neural [226] H. Proença, “The UU-Net: Reversible face with the same face,” in Proc. Int. Conf. Multimodal
network for compound micro-expression de-identification for visual surveillance video Interact., Oct. 2021, pp. 278–287.

ABOUT THE AUTHORS Yante Li (Member, IEEE) received the Ph.D.


degree in computer science from the Univer-
Guoying Zhao (Fellow, IEEE) received the
sity of Oulu, Oulu, Finland, in 2022.
Ph.D. degree in computer science from
She currently holds a postdoctoral position
the Chinese Academy of Sciences, Beijing,
at the Center for Machine Vision and Sig-
China, in 2005.
nal Analysis, University of Oulu. Her current
She is currently an Academy Professor and
research interests include affective comput-
a Full Professor (tenured in 2017) with the
ing, micro-expression analysis, and facial
University of Oulu, Oulu, Finland. She is also
action unit detection.
a Visiting Professor with Aalto University,
Espoo, Finland. She has authored or coau-
thored more than 300 papers in journals and conferences with
more than 23 330 citations in Google Scholar and an H-index of 72
(May 2023). Her current research interests include image and video
descriptors, facial expression and micro-expression recognition,
emotional gesture analysis, affective computing, and biometrics.
Dr. Zhao is a member of Academia Europaea, a member of
the Finnish Academy of Sciences and Letters and a Fellow of
the International Association for Pattern Recognition (IAPR) and
Asia-Pacific Artificial Intelligence Association (AAIA). She was the
Panel Chair of International Conference on Automatic Face and
Gesture Recognition (FG) 2023, the Co-Program Chair of the ACM Matti Pietikäinen (Life Fellow, IEEE)
International Conference on Multimodal Interaction (ICMI 2021), received the Doctor of Science in Technology
and the Publicity Chair of Scandinavian Conference on Image degree from the University of Oulu, Oulu,
Analysis (SCIA) 2023 and FG 2018. She has served as the area Finland, in 1982.
chair of several conferences and was/is an Associate Editor for IEEE From 1980 to 1981 and 1984 to 1985,
TRANSACTIONS ON MULTIMEDIA, Pattern Recognition, IEEE TRANSACTIONS he visited the Computer Vision Laboratory,
ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, and Image and Vision University of Maryland, Baltimore, MD, USA.
Computing journals. Her research has been reported by Finnish TV He is currently an Emeritus Professor with
programs, newspapers, and MIT Technology Review. the Center for Machine Vision and Signal
Analysis, University of Oulu. He has made fundamental contribu-
tions, e.g., to the local binary pattern (LBP) methodology, texture-
based image and video analysis, and facial image analysis. He has
authored over 350 refereed papers in international journals, books,
Xiaobai Li (Senior Member, IEEE) received and conferences. His papers have over 83 500 citations in Google
the B.Sc. degree in psychology from Peking Scholar (H-index: 100) (April 2023).
University, Beijing, China, in 2004, the Dr. Pietikäinen served as a member of the Governing Board
M.Sc. degree in biophysics from the Chi- of the International Association for Pattern Recognition (IAPR)
nese Academy of Sciences, Beijing, in 2007, from 1989 to 2007. He became one of the founding fellows
and the Ph.D. degree in computer science of the IAPR in 1994. He is an IEEE Fellow for his contribu-
from the University of Oulu, Oulu, Finland, in tions to texture and facial image analysis for machine vision.
2017. In 2014, his research on LBP-based face description was awarded
She is currently an Assistant Professor with the Koenderink Prize for Fundamental Contributions in Computer
the Center for Machine Vision and Signal Analysis, University of Vision. He was a recipient of the prestigious IAPR King-Sun Fu
Oulu. Her research interests include facial expression recognition, Prize 2018 for fundamental contributions to texture analysis and
micro-expression analysis, remote physiological signal measure- facial image analysis. He received the Computer Science Leader
ment from facial videos, and related applications in affective com- Award from Research.com. He ranked 351 in the world and
puting and healthcare. 1 in Finland. He was an Associate Editor of IEEE TRANSACTIONS ON
Dr. Li was the Co-Chair of several international workshops in PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI), Pattern Recogni-
Conference on Computer Vision and Pattern Recognition (CVPR), tion, IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, IEEE
International Conference on Computer Vision (ICCV), International TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, and Image
Conference on Automatic Face and Gesture Recognition (FG), and and Vision Computing journals. He serves as a Guest Editor for
ACM Multimedia. She is an Associate Editor of IEEE TRANSACTIONS special issues of IEEE TPAMI and PROCEEDINGS OF THE IEEE. He was the
ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT), Frontiers in President of the Pattern Recognition Society of Finland from 1989 to
Psychology, and Image and Vision Computing. 1992 and was named its Honorary Member in 2014.

P ROCEEDINGS OF THE IEEE 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy