0% found this document useful (0 votes)
17 views6 pages

A Novel Speech-Driven Lip-Sync Model With CNN and LSTM

Lip synchronizing

Uploaded by

smonimala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

A Novel Speech-Driven Lip-Sync Model With CNN and LSTM

Lip synchronizing

Uploaded by

smonimala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) | 978-1-6654-0004-6/21/$31.

00 ©2021 IEEE | DOI: 10.1109/CISP-BMEI53629.2021.9624360 2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)

A Novel Speech-Driven Lip-Sync Model with CNN


and LSTM
Xiaohong Li, Xiang Wang, Kai Wang and Shiguo Lian
AI Innovation and Application Center,
China Unicom, Beijing, China

Abstract—Generating synchronized and natural lip movement tracking software and a retargeting software from Faceware
with speech is one of the most important tasks in creating realistic Technologies [7].
virtual characters. In this paper, we present a combined deep To effectively make use of the collected data, we designed
neural network of one-dimensional convolutions and LSTM to
generate vertex displacement of a 3D template face model from a deep neural network which takes speech feature as input
variable-length speech input. The motion of the lower part of and outputs the 3D vertex displacement. The speech feature
the face, which is represented by the vertex movement of 3D lip is extracted with a pre-trained speech recognition model, and
shapes, is consistent with the input speech. In order to enhance the vertex displacement is used to drive 3D mesh models
the robustness of the network to different sound signals, we adapt to generate accurate facial movements synchronized with
a trained speech recognition model to extract speech feature, and
a velocity loss term is adopted to reduce the jitter of generated the input speech sound. The proposed network combines a
facial animation. We recorded a series of videos of a Chinese one-dimensional convolution and LSTM (Long Short-Term
adult speaking Mandarin and created a new speech-animation Memory) and is able to generate realistic, smooth and natural
dataset to compensate the lack of such public data. Qualitative facial animations. Note that although we only collected the
and quantitative evaluations indicate that our model is able to voice and motion of a male character as training data, the
generate smooth and natural lip movements synchronized with
speech. facial animation generated by our model can be applied to 3D
Index Terms—speech-driven lip-sync; facial animation synthe- virtual characters of different styles and genders. In addition,
sis; convolutional neural network; LSTM our model is robust to the speech of different people with
different voices.
I. I NTRODUCTION The remaining part of the paper is organized as follows:
Section II reviews the related work of speech-driven facial
Creating lifelike and emotional digital human has wide animation methods. The details of the data collection and the
applications in many fields, such as film and games, facial proposed network is introduced in Section III. Section IV
repair and therapy, child education and so on. Nowadays, shows the experimental results and the conclusion is given
it is often seen a digital human serving as a newscaster or in Section V.
a narrator. Facial movement, especially the lip movement
when speaking, is one of the most important component for II. R ELATED W ORK
digital human to express themselves. The lip shape must Speech-driven facial animation has received intensive at-
match the pronunciation, otherwise it will make audience feel tentions in the past decades. Existing methods can be clas-
uncomfortable or even fake, leading to the uncanny valley sified into viseme-driven and data-driven approaches. The
effect. In recent years, a lot of research on speech-driven lip- viseme-driven approaches adopt a two-phase strategies: speech
sync has been carried out. These works can be divided into recognition algorithms are first used to segment speech in-
two categories: 2D and 3D, depending on whether the output to phonemes, which are then mapped to visual units a.k.a
is 2D video or a 3D animation. In this work, we focus on the visemes. Contrarily, the data-driven approaches learnt from
3D facial animation generation, which is widely used in video a large amount of data and learnt a model which directly
games and films. transfers from speech or text to facial animations.
Although various works on 3D facial animation have been Mattheyses et al. declared that an accurate mapping from
proposed in recent years [13], [9], [4], few works have phonemes to visemes should be many-to-many due to the
addressed the issue of animating a 3D model speaking Chinese coarticulation effect [10]. They introduced a many-to-many
due to the lack of open dataset and adapted algorithm. To phoneme-to-viseme mapping scheme using tree-based and k-
compensate the lack of training data, we used a web camera to means clustering approaches, and achieved better visual result.
record several hours of the video of a male Chinese character Edwards et al. depicted the many-valued phoneme-to-viseme
speaking Mandarin, with rate of 60 frames per second (fps). mapping using two visually distinct anatomical actions of
The recorded data includes images and the synchronized voice. jaw and lip, and proposed the JALI viseme model [6]. They
Afterwards, we created a 3D mesh model of a human head, first computed a sequence of phonemes from the speech
which was used to transfer the facial motion captured from transcript and audio, and then extracted the jaw and lip motions
the human performer to a digital face with the help of a facial for individual phonemes as viseme action units and blended

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
978-1-6654-0004-6/21/$31.00 ©2021 IEEE
the corresponding visemes into coarticulated action units to
produce animation curves. At last, a viseme compatible rig
was driven by the computed viseme values. Zhou et al. ft = σ(Wf · [ht−1 , xt ] + bf ) (1)
applied the JALI model and further introduced a three-stage it = σ(Wi · [ht−1 , xt ] + bi ) (2)
LSTM network to predict compact animator-centric viseme ot = σ(Wo · [ht−1 , xt ] + bo ) (3)
curves[16].
C̃t = tanh(WC · [ht−1 , xt ] + bC ) (4)
As it is difficult to accurately identify phonemes and
the coarticulation effect is difficult to simulate with simple Ct = ft ∗ Ct−1 + it ∗ C̃t (5)
functions, more and more researchers preferred the data- ht = ot ∗ tanh(Ct ), (6)
driven methods. Shimba et al. trained a regression model from
speech to talking head with LSTM network and applied lower where f , i, o, C, h represent the forget gate, the input gate,
level audio features instead of phonemes as input [13]. The the output gate, the cell state and the cell output respectively.
output of their network was Active Appearance Model (AAM) σ denotes the sigmoid activation function. W and b are the
parameters [3]. Suwajanakorn et al. also used a recurrent weights and biases respectively.
neural network to learn the mapping from raw audio features Next, we will introduce our dataset generation method, the
to mouth shapes based on many hours of high quality videos process of speech feature extraction, our network framework
of the target person [14]. They synthesized mouth textures and the loss function in details.
and composed them with proper images to generate a 2D
video that matched the input speech. Unlike the above 2D A. Dataset Generation
methods, Cudeiro et al. proposed an audio-driven 3D facial
animation method [4]. They trained a neural network on To the best of our knowledge, there has not been any public
their 4D scans dataset captured from 12 speakers, and used Chinese dataset that can be used for the training of speech-
a subject label to control the generated speaking style. In driven 3D facial animation. So we produced our own dataset.
addition, they integrated the speech feature extraction approach We applied the THCHS-30 corpus [15], which is an open
DeepSpeech [8] into their model to improve the robustness Chinese speech database published by Center for Speech and
to different audio sources. However, their trained model is Language Technology (CSLT) at Tsinghua University, as our
only applicable to targets of the FLAME model [9], which speech script. It contains 1000 sentences and covers most
is a statistical head model. Meanwhile, their model performs of the phonemes in Chinese. A male character was invited
better on English than other languages because their training to speak these sentences with normal speed in front of a
data only includes English speech. camera. This camera contains a microphone and is able to
record synchronized video and sound. It was used to capture
the image data at a rate of 60 frames per second and record
III. M ETHOD the surrounding sound signals at the same time. We collected
totally four hours of video data.
When dealing with temporal and sequential tasks, such as Next, we utilized two tools from the Faceware software
speech recognition, machine translation and text processing kits [7] to convert the collected videos into 3D animations.
with relevance to the context, the Recurrent Neural Networks One is the face tracking tool Analyzer, which tracks facial
(RNNs) are often used considering its advantage over the landmarks from video in a markerless way (see Figure 1). The
traditional feed-forward neural networks which cannot exhibit other is Retargeter, which is a high quality facial animation
temporal dynamic behavior. The RNNs are a class of neural solving software that retargets facial motion from a tracked
networks that allow previous output be used as input to the video onto a 3D character. It needs a prepared 3D model with
recurrent layer. However, it is difficult to solve problems proper rigs and generates facial animations by controlling the
that require learning long-term temporal dependencies using rig movement. We used these two tools to transfer the facial
standard RNNs due to the vanishing gradient phenomenon. animations from the captured video to a 3D character, and
Therefore, a memory cell that is able to maintain information obtained the 3D face data by recording the vertex position of
for long period is introduced, which is known as the Long the head of the animated 3D character in every frame. Figure 2
Short-Term Memory (LSTM) unit. In this work, we use shows the retargeting result from video frames to a virtual
LSTM as the backbone network to map the speech feature 3D character. The prepared virtual 3D character is regarded
extracted with DeepSpeech to vertex offsets. Specifically, we as a template, which is a mesh model in ”zero pose” (see
use unidirectional LSTM to model this mapping and make Figure 3). All the animated 3D faces have the same topology
further improvements on this basis. as the template.
A common LSTM unit contains a cell, an input gate, an We split the generated 3D facial animation data and ac-
output gate and a forget gate. The cell remembers information companied audio data into a training set (900 sentences), a
from previous intervals and the three gates control the mem- validation set (50 sentences) and a test set (50 sentences).
orizing process. Let xt denote the input of a LSTM unit at The ground truth of the vertex displacement is calculated by
time t. The activation of this unit is updated as follows: subtracting the template from the generated 3D faces.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
Figure 1. Facial landmarks tracked with Faceware Analyzer.

Figure 2. Facial motions transferred from video to 3D character with Faceware Retargeter.

C. Network Architecture
The architecture of our network is inspired by WaveNet [11]
proposed by Aaron van den Oord et al., which is a deep
generation model of raw audio waveforms. It is able to
generate speech that mimics many different speakers, and
make it sound more naturally than lots of existing text-to-
speech systems. It realized a time sequence generation with
only a series of convolutional layers stacked, which is faster
to train than the recurrent network. However, its receptive
Figure 3. 3D virtual character used to generate 3D facial field is limited. To overcome this limitation, we combine
animation data. the convolutional layers with LSTM blocks to construct our
network structure, as shown in Figure 4.
The network is implemented using Tensorflow frame-
work [1]. The first part of the network is composed of two one-
B. Speech Feature Extraction
dimensional convolutional layers, four unidirectional LSTM
blocks and two fully connected layers, which are used together
In order to improve the robustness of our model to input to transform speech features extracted by DeepSpeech to a
speech signals, we use a pre-trained speech recognition model low-dimensional embedding. The latter part of the network is
to extract speech feature. Like VOCA [4], we adapt the a decoder consisting of a fully connected layer with linear
DeepSpeech model [8], which is trained on hundreds of hours activation. The decoder maps the embedding into a high-
of voice data and generalized to different speakers, voice dimensional space of 3D vertex displacement, whose dimen-
speed, environment noise, etc. We used the pre-trained model sion is 5713×3 in our case, as the 3D facial model we used in
provided at the DeepSpeech GitHub releases page, which used training set has 5713 vertices, and the position of each vertex
Mel Frequency Cepstral Coefficients (MFCCs) [5] to extract is represented by the Cartesian coordinate in 3D space. Table I
the input audio features, and replaced the fourth recurrent layer shows the specific parameters of our network.
with a LSTM unit. The output of the model is a sequence During inference, we take variable-length audio clips as
of character probabilities, which is used as the input of our input and output vertex displacement at 60 fps. Then the
network. computed displacement is added to the position of each vertex
Given an audio clip with a length of T seconds, we of a 3D face template to drive the face deformation. The
first resample the audio to fixed 16kHZ, and calculate its topology of the face template here needs to be the same as
MFCCs. After normalization, we feed them to the pre-trained that of the one we used for training data generation.
DeepSpeech model to extract audio features and resample
them to 60 fps, which is consistent with the frame rate of D. Loss Function
videos in our dataset. Finally, the output is a two-dimensional Suppose the vertex displacement sequence output by the
array with size of 60T × D, where D is the number of letters network during training is {ỹt }, t = 1..T , and the ground
in alphabet plus a blank label. truth of the vertex displacement is {yt }, t = 1..T , where T is

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Network architecture

TABLE I. The proposed network architecture set the weights of the reconstruction loss and velocity loss
Type Kernel Stride Output Activation Num
to 1.0 and 0.5 respectively. More details can be found in the
follow-up comparative experiments in Section IV-B.
DeepSpeech - - 29 - -
Convolution 5×1 1×1 32 ReLU 2 IV. E XPERIMENTS
LSTM - - 128 - 2
LSTM - - 64 - 2 We compared our network with other methods and conduct
ablation experiments for velocity loss term. Both qualitative
Fully connected - - 128 tanh -
Fully connected - - 50 linear - and quantitative evaluations and analysis will be introduced in
Fully connected - - 5713×3 linear -
this section.
A. Qualitative Evaluation
Although we only used the voice of a male for training,
the number of video frames. We train the model by minimizing
our method generalizes well for female and even synthesized
the following loss function:
voices. The test results for different speech sources can be
found in the supplementary video1 . We compared our network
Loss = ω1 ∗ Lp + ω2 ∗ Lv , (7)
with LSTM and VOCA, whose results are also shown in the
where Lp is the reconstruction loss, Lv is the velocity loss, supplementary video. We used the pre-trained VOCA model
and ω1 and ω2 are the weight coefficients respectively. Lp which is claimed to work for any language, and use the same
is defined in Eq. 8, in which F represents the Frobenius test audio spoken in Mandarin as input.
Norm. It calculates Euclidean distance of vertices between the Figure 5 shows some samples. The first column on the
predicted output and the real facial animation, and thus used left shows the video frames of the character, and the second
to constrain the gap between the predicted vertex coordinates column is the ground truth of the 3D face which is reconstruct-
and the ground truth. ed using Faceware. Facial animations generated by LSTM,
VOCA and our model are shown in the third, fourth and the
2
Lp = k yt − ỹt kF . (8) last columns respectively. The test speech of the first three
The velocity loss Lv is defined as Eq. 9. It uses backward rows comes from the test data set, which is the same male
finite differences of the mesh vertices of adjacent frames to voice as the training data. The test speech of the last two rows
estimate the deformation speed of the face vertices, and calcu- is female voice and there is no ground truth of 3D face. It can
lates the difference between the predicted value and the ground be obviously seen that our model shows the best performance
truth. It has a smoothing effect. When only reconstruction loss and generates more accurate mouth shapes than the other two
is used, the mouth movement is obvious, while lip jitter cannot methods.
be avoided. Through qualitative and quantitative analysis, it is B. Quantitative Evaluation
found that the velocity loss can reduce the lip shaking and
Two metrics are used to measure the accuracy of generated
improve the model accuracy significantly.
facial animation and lip movements for different models:
2
Lv = k (yt − yt−1 ) − (ỹt − ỹt−1 ) kF . (9) the positional error and the velocity error of the 2D facial
The weight coefficients for Lp and Lv have to be set 1 https://www.dropbox.com/s/71oayo97aywd3l9/A%20Novel%20Speech-
carefully, otherwise the mouth shape will be inaccurate. We Driven%20Lip-Sync%20Model%20with%20CNN%20and%20LSTM.mp4?dl=0

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
Figure 6. Rendered facial animations and the detected 2D
facial landmarks.

a bit complicated. For the LSTM network, the introduction


of the velocity loss function has shown positive effects both
on positional error and motion error. For the network with
the combination of convolution layers and LSTM blocks,
the velocity loss function reduces the motion error of the
vertices, but it does not decrease the positional error of the
landmarks. Actually, the positional error of the landmarks can
only represent the difference between the facial shape output
of the network and the ground truth, rather than the reality and
naturalness of the facial animation. For example, the amplitude
of the mouth movement is different when different people
speak. The mapping between the speech and the face shape
is not unique. Therefore, it is not appropriate to judge the
contribution of the velocity loss function to the fidelity of
the facial animations from the positional error of landmarks.
On the other hand, as the velocity error of the landmarks is
able to reflect the error of facial motion, we believe that the
Figure 5. Samples generated by different models. From left to introduction of the velocity loss function is beneficial to the
right, the columns show the face appearance of the character, network we proposed.
the ground truth of 3D face generated by Faceware, and results We also conducted experiment to prove if the velocity loss
of LSTM, VOCA and our model respectively. we added is able to reduce lip jitter and make the animation
transition between words much smoother. Similar to [12], we
TABLE II. Error metrics of four models on the test data render 3D facial animations into images and detect 2D facial
landmarks with a facial behavior analysis toolkit OpenFace [2]
LSTM LSTM+v loss Conv+LSTM Conv+LSTM+v loss (see Figure 6). We extract the positions of a feature point
position error of the 2D facial landmarks (unit:pixel) located in the middle of the upper lip, and draw the movement
2.449 2.401 2.275 2.296 curve of this feature point in the y direction. Figure 7 and
position error of the 2D mouth landmarks (unit:pixel) Figure 8 show the movement curves generated by models
with and without the velocity loss term respectively. The
4.920 4.812 4.514 4.592
base model adopted in Figure 7 uses our network architecture
velocity error of the 2D facial landmarks
which combines convolutional layers with LSTM blocks, and
4.022 3.593 3.719 3.397 that adopted in Figure 8 uses the LSTM network. It can be
velocity error of the 2D mouth landmarks seen from the two figures that the introduced velocity loss
5.704 4.537 4.881 4.204 term significantly reduces the lip jitter and smooths the lip
movement for both networks. It is also noted that the velocity
loss reduced the motion range of the mouth. Therefore, it is
landmarks. Two network structures with and without velocity necessary to comprehensively consider the trade-offs to select
loss are compared, and the results are shown in Table II. an appropriate weight for the velocity loss term.
It can be seen from the qualitative results that the combina-
V. C ONCLUSION
tion of convolution layers with LSTM achieves higher action
accuracy than the application of LSTM alone, both on the In this work, we created a new speech-animation dataset
result of the landmark position error and that of the motion which contains the video and animation data of an adult
error. The comparison results of the velocity loss function is speaking Mandarin, and proposed a novel network for the

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
Figure 7. Movement curves of the selected lip feature point generated by our models trained with and without the velocity
loss term.

Figure 8. Movement curves of the selected lip feature point generated by LSTM models trained with and without the velocity
loss term.

generation of facial animation from speech. We conducted [6] Edwards, P., Landreth, C., Fiume, E. and Singh, K., JALI: an animator-
both quantitative and qualitative evaluations to show that our centric viseme model for expressive lip synchronization. ACM Transac-
tions on graphics (TOG), 35(4), pp.1-11, 2016.
network is able to generate more accurate and smooth facial [7] Faceware, http://support.facewaretech.com/home.
animation, and is more robust to various audio sources. It can [8] Hannun, A., Case, C., Casper, J., et al, Deep speech: Scaling up end-to-
be further improved by incorporating more movement data of end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
[9] Li, T., Bolkart, T., Black, M.J., Li, H. and Romero, J., Learning a model
the upper face part, such as eyebrows and eyes, to make the of facial shape and expression from 4D scans. ACM Trans. Graph., 36(6),
facial animation of digital human more natural and realistic. pp.194-1, 2017.
[10] Mattheyses, W., Latacz, L. and Verhelst, W., Comprehensive many-to-
many phoneme-to-viseme mapping and its application for concatenative
R EFERENCES visual speech synthesis. Speech Communication, 55(7-8), pp.857-876,
2013.
[1] Abadi, M., Agarwal, A., Barham, P., et al, Tensorflow: Large-scale [11] Oord, A.V.D., Dieleman, S., Zen, H., et al., Wavenet: A generative model
machine learning on heterogeneous distributed systems. arXiv preprint for raw audio. arXiv preprint arXiv:1609.03499, 2016.
arXiv:1603.04467, 2016. [12] Richard, A., Lea, C., Ma, S., Gall, J., de la Torre, F. and Sheikh, Y.,
[2] Baltrusaitis, T., Zadeh, A., Lim, Y. C., and Morency, L. P., Openface Audio-and gaze-driven facial animation of codec avatars. In Proceedings
2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International of the IEEE/CVF Winter Conference on Applications of Computer Vision,
Conference on Automatic Face & Gesture Recognition (FG 2018), pp. pp. 41-50, 2021.
59-66. IEEE, 2018. [13] Shimba, T., Sakurai, R., Yamazoe, H. and Lee, J.H., Talking heads
[3] Cootes, T.F., Edwards, G.J. and Taylor, C.J., Active appearance models. synthesis from audio with deep neural networks. In 2015 IEEE/SICE
IEEE Transactions on pattern analysis and machine intelligence, 23(6), International Symposium on System Integration (SII) (pp. 100-105).
pp.681-685, 2001. IEEE. 2015.
[4] Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A. and Black, M.J., [14] Suwajanakorn, S., Seitz, S.M. and Kemelmacher-Shlizerman, I., Syn-
Capture, learning, and synthesis of 3D speaking styles. In Proceedings of thesizing obama: learning lip sync from audio. ACM Transactions on
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Graphics (ToG), 36(4), pp.1-13, 2017.
(pp. 10101-10111), 2019. [15] Wang, D. and Zhang, X., Thchs-30: A free chinese speech corpus. arXiv
[5] Davis, S. and Mermelstein, P., Comparison of parametric representations preprint arXiv:1512.01882, 2015.
for monosyllabic word recognition in continuously spoken sentences. [16] Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S. and Singh,
IEEE transactions on acoustics, speech, and signal processing, 28(4), K., Visemenet: Audio-driven animator-centric speech animation. ACM
pp.357-366, 1980. Transactions on Graphics (TOG), 37(4), pp.1-10, 2018.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy