Au Code PDF
Au Code PDF
Ying-li Tian 1
1 2
Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213 Department of Psychology, University of Pittsburgh, Pittsburgh, PA 15260 Email: fyltian, tkg@cs.cmu.edu jeffcohn@pitt.edu Abstract
We develop an automatic system to analyze subtle changes in upper face expressions based on both permanent facial features (brows, eyes, mouth) and transient facial features (deepening of facial furrows) in a nearly frontal image sequence. Our system recognizes ne-grained changes in facial expression based on Facial Action Coding System (FACS) action units (AUs). Multi-state facial component models are proposed for tracking and modeling different facial features, including eyes, brows, cheeks, and furrows. Then we convert the results of tracking to detailed parametric descriptions of the facial features. These feature parameters are fed to a neural network which recognizes 7 upper face action units. A recognition rate of 95% is obtained for the test data that include both single action units and AU combinations.
which case combination does not change the appearance of the constituents, or non-additive, in which case the appearance of the constituents changes. Although the number of atomic action units is small, numbering only 44, more than 7,000 combinations of action units have been observed [11]. FACS provides the necessary detail with which to describe facial expression. Automatic recognition of AUs is a difcult problem. AUs have no quantitative denitions and as noted can appear in complex combinations. Mase [10] and Essa [5] described patterns of optical ow that corresponded to several AUs but did not attempt to recognize them. Several researchers have tried to recognize AUs [1, 3, 8, 14]. The system of Lien et al. [8] used dense-ow, feature point tracking and edge extraction to recognize 3 upper face AUs (AU1+2, AU1+4, and AU4) and 6 lower face AUs. A separate hidden Markov model (HMM) was used for each AU or AU combination. However, it is intractable to model more than 7000 AU combinations separately. Bartlett et al. [1] recognized 6 individual upper face AUs (AU1, AU2, AU4, AU5, AU6, and AU7) but no combinations. Donato et al. [3] compared several techniques for recognizing 6 single upper face AUs and 6 lower face AUs. These techniques include optical ow, principal component analysis, independent component analysis, local feature analysis, and Gabor wavelet representation. The best performances were obtained using a Gabor wavelet representation and independent component analysis. All of these systems [1, 3, 8] used a manual step to align the input images with a standard face image using the center of the eyes and mouth. We developed a feature-based AU recognition system. This system explicitly analyzes appearance changes in localized facial features. Since each AU is associated with a specic set of facial muscles, we believe that accurate geometrical modeling of facial features will lead to better recognition results. Furthermore, the knowledge of exact facial feature positions could benet the area-based [15], holistic analysis [1], or optical ow based [8] classiers. Figure 1 depicts the overview of the analysis system. First, the head orientation and face position are detected. Then, appear-
1. Introduction
Recently facial expression analysis has attracted attention in the computer vision literature [3, 5, 6, 8, 10, 12, 15]. Most automatic expression analysis systems attempt to recognize a small set of prototypic expressions (i.e. joy, surprise, anger, sadness, fear, and disgust) [10, 15]. In everyday life, however, such prototypic expressions occur relatively infrequently. Instead, emotion is communicated by changes in one or two discrete facial features, such as tightening the lips in anger or obliquely lowering the lip corners in sadness [2]. Change in isolated features, especially in the area of the brows or eyelids, is typical of paralinguistic displays; for instance, raising the brows signals greeting. To capture the subtlety of human emotion and paralinguistic communication, automated recognition of ne-grained changes in facial expression is needed. Ekman and Friesen [4] developed the Facial Action Coding System (FACS) for describing facial expressions by action units (AUs). AUs are anatomically related to contraction of specic facial muscles. They can occur either singly or in combinations. AU combinations may be additive, in
ance changes in the facial features are measured base on the multi-state facial component models. Motivated by FACS action units, these changes are represented as a collection of mid-level feature parameters. Finally, AUs are classied by feeding these parameters into two neural networks (one for the upper face, one for the lower face) because facial actions in the upper and the lower face are relatively independent [4]. The networks can recognize AUs whether the AUs occur singly or in combinations. We [14] recognized 11 AUs in the lower face and achieved a 96.7% average recognition rate.
(a)
(b)
(c)
r h1 (x0, y0)
(xc, yc) h2 w
(x1, y1)
(x2, y2)
corner1
corner2
(d)
(e)
Figure 2. Dual-state eye model. (a) An open eye. (b) A closed eye. (c) The state transition diagram. (d) The open eye parameter model. (e) The closed eye parameter model.
Figure 2 shows the dual-state eye model. Using information from the iris of the eye, we distinguish two eye states, open and closed. When the eye is open, part of the iris normally will be visible. When closed, the iris is absent. For the different states, specic eye templates and different algorithms are used to obtain eye features. For an open eye, we assume the outer contour of the eye is symmetrical about the perpendicular bisector to the line connecting two eye corners. The template, illustrated in Figure 2 (d), is composed of a circle with three param-
eters (x0 ; y0 ; r) and two parabolic arcs with six parameters (xc ; yc ; h1 ; h2 ; w; ). This is the same eye template as Yuilles except for two points located at the center of the whites [16]. For a closed eye, the template is reduced to 4 parameters for each of the eye corners (Figure 2 (e)).
Figure 4. The detailed eye feature tracking techniques can be found in [13].
(x0, y0) r1 r2 r0
Figure 3. Half circle iris mask. (x0 ; y0) is the iris center; r0 is the iris radius; r1 is the minimum radius of the mask; r2 is the maximum radius of the mask
Most eye trackers developed so far are only for open eyes and simply track the eye locations. However, for facial expression analysis, a more detailed description of the eye is needed. The dual-state eye model is used to detect an open eye or a closed/blinking eye. The default eye state is open. Locating the open eye template in the rst frame, the eyes inner corner is tracked accurately by feature point tracking. We found that the outer corners are hard to track and less stable than the inner corners, so we assume the outer corners are on the line that connects the inner corners. Then, the outer corners can be obtained by the eye width, which is calculated from the rst frame. An iris provides important information about the eye state. Intensity and edge information are used to detect the iris. A half-circle iris mask is used to obtain correct iris edges (Figure 3). If the iris is detected, the eye is open and the iris center is the iris mask center (x0 ; y0 ). In an image sequence, the eyelid contours are tracked for open eyes by feature point tracking. For a closed eye, tracking of the eyelid contours is omitted. A line connecting the inner and outer corners of the eye is used as the eye boundary. Some eye tracking results for different states are shown in
(a)
(b)
Figure 4. Tracking results in (a) narrowlyopened eye and in (b) widely-opened eye with blinking.
activated muscle. These transient features provide crucial information for the recognition of action units. Contraction of the corrugator muscle, for instance, produces vertical furrows between the brows, which is coded in FACS as AU 4, while contraction of the medial portion of the frontalis muscle (AU 1) causes horizontal wrinkling in the center of the forehead.
than the threshold T , the crows-feet wrinkles are present. Otherwise, they are absent.
Action units can occur either singly or in combinations. The action unit combinations may be additive, in which case the combination does not change the appearance of the constituents (e.g., AU1+5), or nonadditive, in which case the appearance of the constituents does change (e.g., AU1+4). Table 2 shows the denitions of 7 individual upper face AUs and 5 non-additive combinations involving these action units. As an example of a non-additive effect, AU4 appears differently depending on whether it occurs alone or in combination with AU1, as in AU1+4. When AU4 occurs alone, the brows are drawn together and lowered. In AU1+4, the brows are drawn together but are raised by the action of AU 1. As another example, it is difcult to notice any difference between the static images of AU2 and AU1+2 because the action of AU2 pulls the inner brow up, which results in a very similar appearance to AU1+2. In contrast, the action of AU1 alone has little effect on the outer brow.
Brows lowered and drawn together and upper eyelids are raised. AU1+2+5+6+7
AU0(neutral)
Crows-feet wrinkles appearing to the side of the outer eye corners are useful features for recognizing upper face AUs. For example, the lower eyelid is raised for both AU6 and AU7, but the crows-feet wrinkles appear for AU6 only. Compared with the neutral frame, the wrinkle state is present if the wrinkles appear, deepen, or lengthen. Otherwise, it is absent. After locating the outer corners of the eyes, edge detectors search for crows-feet wrinkles. We compare edge pixel numbers E of the current frame with the edge pixel numbers E0 of the rst frame in the wrinkle areas. If E=E0 is larger
To recognize subtle changes of face expression, we represent the upper face features as 15 parameters. Of these, 12 parameters describe the motion and shape of eyes, brows, and cheeks. 2 parameters describe the state of crows feet wrinkles, and 1 parameter describes the distance between brows. To dene these parameters, we rst dene a coordinate system. Because we found that the inner corners of the eyes are the most stable features in the face and are insensitive to deformation by facial expressions, we dene the x-axis as the line connecting two inner corners of eyes and the y-axis as perpendicular to the x-axis. Figure 5 shows the coordinate system and the parameter denitions. The denitions of upper face parameters are listed in Table 3. In order to remove the effects of the different size of face images in different image sequences, all the parameters (except the furrow parameters) are normalized by dividing by the distances between each feature and the line connecting two inner corners of eyes in the neutral frame.
5. Image Databases
Two databases were used to evaluate our system, the CMU-Pittsburgh AU-Coded face expression image database (CMU-Pittsburgh database) [7] and Ekman and Hagers facial action exemplars (Ekman&Hager database). The later was used by Donato and Bartlett [1, 3]. We use the Ekman&Hager database to train the network. During testing both databases were used. Moreover, in part of our evaluation, we trained and tested on completely disjoint databases
rbouter
,D0 . = DD 0
,h20 . =, h2h 20 If rbtm > 0, Eye bottom lid move up. Other features Left crowsfeet wrinkles (Wleft ) If Wleft = 1, Left crows-feet wrinkle present.
rbtm
2),(h10 +h20 ) = (h1+h . (h10 +h20 ) If reheight >0, Eye height increases. Cheek motion (rcheek )
reheight
rcheek
that were collected by different research teams under different recording conditions and coded (ground-truth) by separate teams of FACS coders. This is a more rigorous test of generalizability than the more customary method of dividing a single database into test and training sets. Ekman&Hager database: This image database was obtained from 24 Caucasian subjects, consisting of 12 males and 12 females. Each image sequence consists of 6-8 frames, beginning with a neutral or with very low magnitude facial actions and ending with a high magnitude facial actions. For each sequence, action units were coded by a certied FACS coder. CMU-Pittsburgh database: This database currently consists of facial behavior recorded in 210 adults between the ages of 18 and 50 years. They were 69% female, 31% male, 81% Euro-American, 13% Afro-American, and 6% other groups. Subjects sat directly in front of the camera and performed a series of facial expressions that included single AUs and AU combinations. To date, 1917 image sequences from 182 subjects have been FACS coded for either target AUs or the entire sequence. Approximately fteen percent of the 1917 sequences were re-coded by a second certied FACS coder to validate the accuracy of the coding. Each expression sequence began from a neutral face.
6. AU Recognition
We used three-layer neural networks with one hidden layer to recognize AUs. The inputs of the neural networks are the 15 parameters shown in Table 3. The outputs are the upper face AUs. Each output unit gives an estimate of the probability of the input image consisting of the associated AUs. In this section, we concluded 3 experiments. In the rst, we compare with other results using the same database. In the second, we study the more difcult case in which AUs occur either individually or in combinations. Furthermore, we investigate the generalizability of our system on independent databases recorded under different conditions and in different laboratories. The optimal number of hidden units to achieve the best average recognition rate was also studied.
Figure 5. Upper face features. hl(hl1 + hl2) and hr(hr1 + hr2) are the height of left eye and right eye; D is the distance between brows; cl and cr are the motion of left cheek and right cheek. bli and bri are the motion of the inner part of left brow and right brow. blo and bro are the motion of the outer part of left brow and right brow. f l and f r are the left and right crows-feet wrinkle areas.
For comparison with the AU recognition results of Bartlett [1], we trained and tested our system on the same database (the Ekman&Hager database). In this experiment, 99 image sequences containing only individual AUs in upper face were used. Two test sets were selected as in Table 4. In Familiar faces testset, some subjects appear in both training and testing sets although they had no common sequences. In Novel faces testset, to study the robustness of the system to novel faces, we ensured that the subjects do not appear in both training and test sets. Training and testing were performed on the initial and nal two frames in each im-
age sequence. For some of the image sequences with large lighting changes, lighting normalizations were performed.
Table 5. Comparison with Donato and Batlett's system for AU recognition using Ekman&Hager Database.
Methods Feature-based Systems Recognition rate
Table 4. Data distribution of each data set for upper face AU recognition (Ekman&Hager Database).
Trainset Familiar face Novel face
T estset T estset
Bartletts system Our system (Feature-based) Bartletts system (Hybrid) Donatos system (ICA or Gabor wavelet)
Datasets
AU0
AU1
AU2
AU4
AU5
AU6
AU7
47 52 49
14 14 10
12 12 10
16 20 22
22 24 28
12 14 4
8 20 22
92.3% familiar faces 92.9% novel faces 85.3% familiar faces 57% novel faces 95% 90.0% 95%
Best performance
In the system of Bartlett and Donato [1, 3], 80 image sequences containing only individual AUs in upper face were used. They manually aligned the faces using three coordinates, rotated the eyes to horizontal, scaled the image, and cropped it to a x size. Their system was trained and tested using leave-one-out cross-validation and the mean classication accuracy was calculated across all of the test cases. The comparison is shown in Table 5. For 7 single AUs in the upper face, our system achieves an average recognition rate of 92.3% for familiar faces (new images of the faces used for training) on Familiar faces testset and 92.9% when we test the system for novel faces on Novel faces testset with zero false alarms. From experiments, we found that 6 hidden units gave the best performance. The performance of Bartletts [1] feature-based classier on familiar faces, the rate was 85.3%; on novel faces was 57%. Donato et al. did not report the details whether the test images were familiar or novel faces [3]. Our system achieved a 95% average recognition rate as the best performance for recognizing 7 single AUs and more than 10 AU combinations in the upper face. Bartlett et al. [1] increased the recognition accuracy to 90.9% correct by combining holistic spatial analysis and optical ow with local features in a hybrid system for 6 single upper face AUs. The best performance of Donatos et al. [3] system was obtained using a Gabor wavelet representation and independent component analysis (ICA) and achieved a 95% average recognition rate for 6 single upper face AUs and 6 lower face AUs. From the comparison, we see that our recognition performance from facial feature measurements is comparable to holistic analysis and Gabor wavelet representation for AU recognition.
Table 7. Upper face AU recognition with AU combinations when separately modeling nonadditive AU combinations.
AU0 AU0 AU1 AU2 AU4 AU5 AU6 AU7 AU1 AU2 AU4 AU5 AU6 AU7
(a)
(b)
Figure 6. Neural networks of the upper face AU and combination recognition. (a) Model the 7 single AUs only. (b) Separately model the nonadditive AU combinations
AU1=0.85 and AU2=0.89 for a human labeled AU2, it was treated as AU1+AU2. This means AU2 is recognized but with AU1 as a false alarm.
0 0 0 0 8 0 0
0 0 0 0 0 13 0
0 0 0 0 0 0 10
Ekman&Hager database and 72 image sequences from the CMU-Pittsburgh database were used to test the generalizability. The recognition results are shown in Table 8 and a 93.2% recognition rate is achieved. From the results we see that our system is robust and achieves high recognition rate for the new database.
Table 6. Upper face AU recognition with AU combinations when modeling 7 single AUs only. The rows correspond to NN outputs, and columns correspond to human labels.
AU0 AU0 AU1 AU2 AU4 AU5 AU6 AU7 AU1 AU2 AU4 AU5 AU6 AU7
Table 8. Upper face recognition results for test on the Pittsburgh-CMU database when the network is trained on the Ekman&Hager database.
AU0 AU0 AU1 AU2 AU4 AU5 AU6 AU7 AU1 AU2 AU4 AU5 AU6 AU7
0 0 0 0 7 0 0
0 0 0 0 0 12 0
0 0 0 0 0 0 8
AU Combination Recognition When Modeling Nonadditive Combinations: In order to study the effects of the non-additive combinations, we separately model the nonadditive AU combinations in the network. The 11 outputs consist of 7 individual upper face AUs and 4 non-additive AU combinations (Figure 6 (b) ). The recognition results are shown in Table 7. An average recognition rate of 93.7% is achieved, with a slightly lower false alarm rate of 4.5%. In this case, separately model the non-additive combinations does not improve recognition rate.
0 0 0 0 31 0 0
0 0 0 0 0 2 0
0 0 0 0 0 0 12
To test the generalizability of our system, the independent databases recorded under different conditions and in different laboratories were used. The network was trained on the
7. Conclusion
In this paper, we developed a multi-state feature-based facial expression recognition system to recognize both individual AUs and AU combinations. All the facial features were represented in a group of feature parameters. The network was able to learn the correlations between facial feature parameter patterns and specic action units. It has high sensitivity and specicity for subtle differences in facial expressions. Our system was tested in image sequences for large number of subjects, which included people of African and Asian in addition to European, thus providing a sufcient test of how well the initial training analyses generalized to new image sequences and new databases. From the experimental results, we have the following observations: 1. The recognition performance from facial feature measurements is comparable to holistic analysis and Gabor wavelet representation for AU recognition. 2. 5 to 7 hidden units are sufcient to code 7 individual the upper face AUs. 10 to 16 hidden units are needed when AUs may occur either singly or in complex combinations. 3. For the upper face AU recognition, separately modeling nonadditive AU combinations affords no increase in the recognition accuracy. 4. After using sufcient data to train the NN, our system is robustness to recognize AUs and AU combinations for new faces and new databases. Unlike a previous method [8] which built a separate model for each AU and AU combination, we developed a single model that recognized AUs whether they occur singly or in combinations. This is an important capability since the number of possible AU combinations is too large (over 7000) for each combination to be modeled separately. An average recognition rate of 95% was achieved for 7 upper face AUs and more than 10 AU combinations. Our system was robust across independent databases recorded under different conditions and in different laboratories.
Acknowledgements
The Ekman&Hager database was provided by Paul Ekman at the Human Interaction Laboratory, University of California, San Francisco. The authors would like to thank Zara Ambadar, Bethany Peters, and Michelle Lemenager for processing the images. This work is supported by NIMH grant R01 MH51435.
[2] J. M. Carroll and J. Russell. Facial expression in hollywoods portrayal of emotion. Journal of Personality and Social Psychology., 72:164176, 1997. [3] G. Donato, M. S. Bartlett, J. C. Hager, P. Ekman, and T. J. Sejnowski. Classifying facial actions. International Journal of Pattern Analysis and Machine Intelligence, 21(10):974 989, October 1999. [4] P. Ekman and W. V. Friesen. The Facial Action Coding System: A Technique For The Measurement of Facial Movement. Consulting Psychologists Press Inc., San Francisco, CA, 1978. [5] I. A. Essa and A. P. Pentland. Coding, analysis, interpretation, and recognition of facial expressions. IEEE Transc. On Pattern Analysis and Machine Intelligence, 19(7):757763, JULY 1997. [6] K. Fukui and O. Yamaguchi. Facial feature point extraction method based on combination of shape extraction and pattern matching. Systems and Computers in Japan, 29(6):4958, 1998. [7] T. Kanade, J. Cohn, and Y. Tian. Comprehensive database for facial expression analysis. In Proceedings of International Conference on Face and Gesture Recognition, March, 2000. [8] J.-J. J. Lien, T. Kanade, J. F. Chon, and C. C. Li. Detection, tracking, and classication of action units in facial expression. Journal of Robotics and Autonomous System, in press. [9] B. Lucas and T. Kanade. An interative image registration technique with an application in stereo vision. In The 7th International Joint Conference on Articial Intelligence, pages 674679, 1981. [10] K. Mase. Recognition of facial expression from optical ow. IEICE Transactions, E. 74(10):34743483, October 1991. [11] K. Scherer and P. Ekman. Handbook of methods in nonverbal behavior research. Cambridge University Press, Cambridge, UK, 1982. [12] D. Terzopoulos and K. Waters. Analysis of facial images using physical and anatomical models. In IEEE International Conference on Computer Vision, pages 727732, 1990. [13] Y. Tian, T. Kanade, and J. Cohn. Dual-state parametric eye tracking. In Proceedings of International Conference on Face and Gesture Recognition, March, 2000. [14] Y. Tian, T. Kanade, and J. Cohn. Recognizing lower face actions for facial expression analysis. In Proceedings of International Conference on Face and Gesture Recognition, March, 2000. [15] Y. Yacoob and L. S. Davis. Recognizing human facial expression from long image sequencesusing optical ow. IEEE Transactions On Pattern Analysis and machine Intelligence, 18(6):636642, June 1996. [16] A. Yuille, P. Haallinan, and D. S. Cohen. Feature extraction from faces using deformable templates. International Journal of Computer Vision,, 8(2):99111, 1992.
References
[1] M. Bartlett, J. Hager, P.Ekman, and T. Sejnowski. Measuring facial expressions by computer image analysis. Psychophysiology, 36:253264, 1999.