0% found this document useful (0 votes)

17 views6 pages

A Novel Speech-Driven Lip-Sync Model With CNN and LSTM

Lip synchronizing

Uploaded by

smonimala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views6 pages

A Novel Speech-Driven Lip-Sync Model With CNN and LSTM

Lip synchronizing

Uploaded by

smonimala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) | 978-1-6654-0004-6/21/$31.

A Novel Speech-Driven Lip-Sync Model with CNN

and LSTM
Xiaohong Li, Xiang Wang, Kai Wang and Shiguo Lian
AI Innovation and Application Center,
China Unicom, Beijing, China

Abstract—Generating synchronized and natural lip movement tracking software and a retargeting software from Faceware
with speech is one of the most important tasks in creating realistic Technologies [7].
virtual characters. In this paper, we present a combined deep To effectively make use of the collected data, we designed
neural network of one-dimensional convolutions and LSTM to
generate vertex displacement of a 3D template face model from a deep neural network which takes speech feature as input
variable-length speech input. The motion of the lower part of and outputs the 3D vertex displacement. The speech feature
the face, which is represented by the vertex movement of 3D lip is extracted with a pre-trained speech recognition model, and
shapes, is consistent with the input speech. In order to enhance the vertex displacement is used to drive 3D mesh models
the robustness of the network to different sound signals, we adapt to generate accurate facial movements synchronized with
a trained speech recognition model to extract speech feature, and
a velocity loss term is adopted to reduce the jitter of generated the input speech sound. The proposed network combines a
facial animation. We recorded a series of videos of a Chinese one-dimensional convolution and LSTM (Long Short-Term
adult speaking Mandarin and created a new speech-animation Memory) and is able to generate realistic, smooth and natural
dataset to compensate the lack of such public data. Qualitative facial animations. Note that although we only collected the
and quantitative evaluations indicate that our model is able to voice and motion of a male character as training data, the
generate smooth and natural lip movements synchronized with
speech. facial animation generated by our model can be applied to 3D
Index Terms—speech-driven lip-sync; facial animation synthe- virtual characters of different styles and genders. In addition,
sis; convolutional neural network; LSTM our model is robust to the speech of different people with
different voices.
I. I NTRODUCTION The remaining part of the paper is organized as follows:
Section II reviews the related work of speech-driven facial
Creating lifelike and emotional digital human has wide animation methods. The details of the data collection and the
applications in many fields, such as film and games, facial proposed network is introduced in Section III. Section IV
repair and therapy, child education and so on. Nowadays, shows the experimental results and the conclusion is given
it is often seen a digital human serving as a newscaster or in Section V.
a narrator. Facial movement, especially the lip movement
when speaking, is one of the most important component for II. R ELATED W ORK
digital human to express themselves. The lip shape must Speech-driven facial animation has received intensive at-
match the pronunciation, otherwise it will make audience feel tentions in the past decades. Existing methods can be clas-
uncomfortable or even fake, leading to the uncanny valley sified into viseme-driven and data-driven approaches. The
effect. In recent years, a lot of research on speech-driven lip- viseme-driven approaches adopt a two-phase strategies: speech
sync has been carried out. These works can be divided into recognition algorithms are first used to segment speech in-
two categories: 2D and 3D, depending on whether the output to phonemes, which are then mapped to visual units a.k.a
is 2D video or a 3D animation. In this work, we focus on the visemes. Contrarily, the data-driven approaches learnt from
3D facial animation generation, which is widely used in video a large amount of data and learnt a model which directly
games and films. transfers from speech or text to facial animations.
Although various works on 3D facial animation have been Mattheyses et al. declared that an accurate mapping from
proposed in recent years [13], [9], [4], few works have phonemes to visemes should be many-to-many due to the
addressed the issue of animating a 3D model speaking Chinese coarticulation effect [10]. They introduced a many-to-many
due to the lack of open dataset and adapted algorithm. To phoneme-to-viseme mapping scheme using tree-based and k-
compensate the lack of training data, we used a web camera to means clustering approaches, and achieved better visual result.
record several hours of the video of a male Chinese character Edwards et al. depicted the many-valued phoneme-to-viseme
speaking Mandarin, with rate of 60 frames per second (fps). mapping using two visually distinct anatomical actions of
The recorded data includes images and the synchronized voice. jaw and lip, and proposed the JALI viseme model [6]. They
Afterwards, we created a 3D mesh model of a human head, first computed a sequence of phonemes from the speech
which was used to transfer the facial motion captured from transcript and audio, and then extracted the jaw and lip motions
the human performer to a digital face with the help of a facial for individual phonemes as viseme action units and blended

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
978-1-6654-0004-6/21/$31.00 ©2021 IEEE
the corresponding visemes into coarticulated action units to
produce animation curves. At last, a viseme compatible rig
was driven by the computed viseme values. Zhou et al. ft = σ(Wf · [ht−1 , xt ] + bf ) (1)
applied the JALI model and further introduced a three-stage it = σ(Wi · [ht−1 , xt ] + bi ) (2)
LSTM network to predict compact animator-centric viseme ot = σ(Wo · [ht−1 , xt ] + bo ) (3)
curves[16].
C̃t = tanh(WC · [ht−1 , xt ] + bC ) (4)
As it is difficult to accurately identify phonemes and
the coarticulation effect is difficult to simulate with simple Ct = ft ∗ Ct−1 + it ∗ C̃t (5)
functions, more and more researchers preferred the data- ht = ot ∗ tanh(Ct ), (6)
driven methods. Shimba et al. trained a regression model from
speech to talking head with LSTM network and applied lower where f , i, o, C, h represent the forget gate, the input gate,
level audio features instead of phonemes as input [13]. The the output gate, the cell state and the cell output respectively.
output of their network was Active Appearance Model (AAM) σ denotes the sigmoid activation function. W and b are the
parameters [3]. Suwajanakorn et al. also used a recurrent weights and biases respectively.
neural network to learn the mapping from raw audio features Next, we will introduce our dataset generation method, the
to mouth shapes based on many hours of high quality videos process of speech feature extraction, our network framework
of the target person [14]. They synthesized mouth textures and the loss function in details.
and composed them with proper images to generate a 2D
video that matched the input speech. Unlike the above 2D A. Dataset Generation
methods, Cudeiro et al. proposed an audio-driven 3D facial
animation method [4]. They trained a neural network on To the best of our knowledge, there has not been any public
their 4D scans dataset captured from 12 speakers, and used Chinese dataset that can be used for the training of speech-
a subject label to control the generated speaking style. In driven 3D facial animation. So we produced our own dataset.
addition, they integrated the speech feature extraction approach We applied the THCHS-30 corpus [15], which is an open
DeepSpeech [8] into their model to improve the robustness Chinese speech database published by Center for Speech and
to different audio sources. However, their trained model is Language Technology (CSLT) at Tsinghua University, as our
only applicable to targets of the FLAME model [9], which speech script. It contains 1000 sentences and covers most
is a statistical head model. Meanwhile, their model performs of the phonemes in Chinese. A male character was invited
better on English than other languages because their training to speak these sentences with normal speed in front of a
data only includes English speech. camera. This camera contains a microphone and is able to
record synchronized video and sound. It was used to capture
the image data at a rate of 60 frames per second and record
III. M ETHOD the surrounding sound signals at the same time. We collected
totally four hours of video data.
When dealing with temporal and sequential tasks, such as Next, we utilized two tools from the Faceware software
speech recognition, machine translation and text processing kits [7] to convert the collected videos into 3D animations.
with relevance to the context, the Recurrent Neural Networks One is the face tracking tool Analyzer, which tracks facial
(RNNs) are often used considering its advantage over the landmarks from video in a markerless way (see Figure 1). The
traditional feed-forward neural networks which cannot exhibit other is Retargeter, which is a high quality facial animation
temporal dynamic behavior. The RNNs are a class of neural solving software that retargets facial motion from a tracked
networks that allow previous output be used as input to the video onto a 3D character. It needs a prepared 3D model with
recurrent layer. However, it is difficult to solve problems proper rigs and generates facial animations by controlling the
that require learning long-term temporal dependencies using rig movement. We used these two tools to transfer the facial
standard RNNs due to the vanishing gradient phenomenon. animations from the captured video to a 3D character, and
Therefore, a memory cell that is able to maintain information obtained the 3D face data by recording the vertex position of
for long period is introduced, which is known as the Long the head of the animated 3D character in every frame. Figure 2
Short-Term Memory (LSTM) unit. In this work, we use shows the retargeting result from video frames to a virtual
LSTM as the backbone network to map the speech feature 3D character. The prepared virtual 3D character is regarded
extracted with DeepSpeech to vertex offsets. Specifically, we as a template, which is a mesh model in ”zero pose” (see
use unidirectional LSTM to model this mapping and make Figure 3). All the animated 3D faces have the same topology
further improvements on this basis. as the template.
A common LSTM unit contains a cell, an input gate, an We split the generated 3D facial animation data and ac-
output gate and a forget gate. The cell remembers information companied audio data into a training set (900 sentences), a
from previous intervals and the three gates control the mem- validation set (50 sentences) and a test set (50 sentences).
orizing process. Let xt denote the input of a LSTM unit at The ground truth of the vertex displacement is calculated by
time t. The activation of this unit is updated as follows: subtracting the template from the generated 3D faces.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
Figure 1. Facial landmarks tracked with Faceware Analyzer.

Figure 2. Facial motions transferred from video to 3D character with Faceware Retargeter.

C. Network Architecture
The architecture of our network is inspired by WaveNet [11]
proposed by Aaron van den Oord et al., which is a deep
generation model of raw audio waveforms. It is able to
generate speech that mimics many different speakers, and
make it sound more naturally than lots of existing text-to-
speech systems. It realized a time sequence generation with
only a series of convolutional layers stacked, which is faster
to train than the recurrent network. However, its receptive
Figure 3. 3D virtual character used to generate 3D facial field is limited. To overcome this limitation, we combine
animation data. the convolutional layers with LSTM blocks to construct our
network structure, as shown in Figure 4.
The network is implemented using Tensorflow frame-
work [1]. The first part of the network is composed of two one-
B. Speech Feature Extraction
dimensional convolutional layers, four unidirectional LSTM
blocks and two fully connected layers, which are used together
In order to improve the robustness of our model to input to transform speech features extracted by DeepSpeech to a
speech signals, we use a pre-trained speech recognition model low-dimensional embedding. The latter part of the network is
to extract speech feature. Like VOCA [4], we adapt the a decoder consisting of a fully connected layer with linear
DeepSpeech model [8], which is trained on hundreds of hours activation. The decoder maps the embedding into a high-
of voice data and generalized to different speakers, voice dimensional space of 3D vertex displacement, whose dimen-
speed, environment noise, etc. We used the pre-trained model sion is 5713×3 in our case, as the 3D facial model we used in
provided at the DeepSpeech GitHub releases page, which used training set has 5713 vertices, and the position of each vertex
Mel Frequency Cepstral Coefficients (MFCCs) [5] to extract is represented by the Cartesian coordinate in 3D space. Table I
the input audio features, and replaced the fourth recurrent layer shows the specific parameters of our network.
with a LSTM unit. The output of the model is a sequence During inference, we take variable-length audio clips as
of character probabilities, which is used as the input of our input and output vertex displacement at 60 fps. Then the
network. computed displacement is added to the position of each vertex
Given an audio clip with a length of T seconds, we of a 3D face template to drive the face deformation. The
first resample the audio to fixed 16kHZ, and calculate its topology of the face template here needs to be the same as
MFCCs. After normalization, we feed them to the pre-trained that of the one we used for training data generation.
DeepSpeech model to extract audio features and resample
them to 60 fps, which is consistent with the frame rate of D. Loss Function
videos in our dataset. Finally, the output is a two-dimensional Suppose the vertex displacement sequence output by the
array with size of 60T × D, where D is the number of letters network during training is {ỹt }, t = 1..T , and the ground
in alphabet plus a blank label. truth of the vertex displacement is {yt }, t = 1..T , where T is

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Network architecture

TABLE I. The proposed network architecture set the weights of the reconstruction loss and velocity loss
Type Kernel Stride Output Activation Num
to 1.0 and 0.5 respectively. More details can be found in the
follow-up comparative experiments in Section IV-B.
DeepSpeech - - 29 - -
Convolution 5×1 1×1 32 ReLU 2 IV. E XPERIMENTS
LSTM - - 128 - 2
LSTM - - 64 - 2 We compared our network with other methods and conduct
ablation experiments for velocity loss term. Both qualitative
Fully connected - - 128 tanh -
Fully connected - - 50 linear - and quantitative evaluations and analysis will be introduced in
Fully connected - - 5713×3 linear -
this section.
A. Qualitative Evaluation
Although we only used the voice of a male for training,
the number of video frames. We train the model by minimizing
our method generalizes well for female and even synthesized
the following loss function:
voices. The test results for different speech sources can be
found in the supplementary video1 . We compared our network
Loss = ω1 ∗ Lp + ω2 ∗ Lv , (7)
with LSTM and VOCA, whose results are also shown in the
where Lp is the reconstruction loss, Lv is the velocity loss, supplementary video. We used the pre-trained VOCA model
and ω1 and ω2 are the weight coefficients respectively. Lp which is claimed to work for any language, and use the same
is defined in Eq. 8, in which F represents the Frobenius test audio spoken in Mandarin as input.
Norm. It calculates Euclidean distance of vertices between the Figure 5 shows some samples. The first column on the
predicted output and the real facial animation, and thus used left shows the video frames of the character, and the second
to constrain the gap between the predicted vertex coordinates column is the ground truth of the 3D face which is reconstruct-
and the ground truth. ed using Faceware. Facial animations generated by LSTM,
VOCA and our model are shown in the third, fourth and the
2
Lp = k yt − ỹt kF . (8) last columns respectively. The test speech of the first three
The velocity loss Lv is defined as Eq. 9. It uses backward rows comes from the test data set, which is the same male
finite differences of the mesh vertices of adjacent frames to voice as the training data. The test speech of the last two rows
estimate the deformation speed of the face vertices, and calcu- is female voice and there is no ground truth of 3D face. It can
lates the difference between the predicted value and the ground be obviously seen that our model shows the best performance
truth. It has a smoothing effect. When only reconstruction loss and generates more accurate mouth shapes than the other two
is used, the mouth movement is obvious, while lip jitter cannot methods.
be avoided. Through qualitative and quantitative analysis, it is B. Quantitative Evaluation
found that the velocity loss can reduce the lip shaking and
Two metrics are used to measure the accuracy of generated
improve the model accuracy significantly.
facial animation and lip movements for different models:
2
Lv = k (yt − yt−1 ) − (ỹt − ỹt−1 ) kF . (9) the positional error and the velocity error of the 2D facial
The weight coefficients for Lp and Lv have to be set 1 https://www.dropbox.com/s/71oayo97aywd3l9/A%20Novel%20Speech-
carefully, otherwise the mouth shape will be inaccurate. We Driven%20Lip-Sync%20Model%20with%20CNN%20and%20LSTM.mp4?dl=0

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
Figure 6. Rendered facial animations and the detected 2D
facial landmarks.

a bit complicated. For the LSTM network, the introduction

of the velocity loss function has shown positive effects both
on positional error and motion error. For the network with
the combination of convolution layers and LSTM blocks,
the velocity loss function reduces the motion error of the
vertices, but it does not decrease the positional error of the
landmarks. Actually, the positional error of the landmarks can
only represent the difference between the facial shape output
of the network and the ground truth, rather than the reality and
naturalness of the facial animation. For example, the amplitude
of the mouth movement is different when different people
speak. The mapping between the speech and the face shape
is not unique. Therefore, it is not appropriate to judge the
contribution of the velocity loss function to the fidelity of
the facial animations from the positional error of landmarks.
On the other hand, as the velocity error of the landmarks is
able to reflect the error of facial motion, we believe that the
Figure 5. Samples generated by different models. From left to introduction of the velocity loss function is beneficial to the
right, the columns show the face appearance of the character, network we proposed.
the ground truth of 3D face generated by Faceware, and results We also conducted experiment to prove if the velocity loss
of LSTM, VOCA and our model respectively. we added is able to reduce lip jitter and make the animation
transition between words much smoother. Similar to [12], we
TABLE II. Error metrics of four models on the test data render 3D facial animations into images and detect 2D facial
landmarks with a facial behavior analysis toolkit OpenFace [2]
LSTM LSTM+v loss Conv+LSTM Conv+LSTM+v loss (see Figure 6). We extract the positions of a feature point
position error of the 2D facial landmarks (unit:pixel) located in the middle of the upper lip, and draw the movement
2.449 2.401 2.275 2.296 curve of this feature point in the y direction. Figure 7 and
position error of the 2D mouth landmarks (unit:pixel) Figure 8 show the movement curves generated by models
with and without the velocity loss term respectively. The
4.920 4.812 4.514 4.592
base model adopted in Figure 7 uses our network architecture
velocity error of the 2D facial landmarks
which combines convolutional layers with LSTM blocks, and
4.022 3.593 3.719 3.397 that adopted in Figure 8 uses the LSTM network. It can be
velocity error of the 2D mouth landmarks seen from the two figures that the introduced velocity loss
5.704 4.537 4.881 4.204 term significantly reduces the lip jitter and smooths the lip
movement for both networks. It is also noted that the velocity
loss reduced the motion range of the mouth. Therefore, it is
landmarks. Two network structures with and without velocity necessary to comprehensively consider the trade-offs to select
loss are compared, and the results are shown in Table II. an appropriate weight for the velocity loss term.
It can be seen from the qualitative results that the combina-
V. C ONCLUSION
tion of convolution layers with LSTM achieves higher action
accuracy than the application of LSTM alone, both on the In this work, we created a new speech-animation dataset
result of the landmark position error and that of the motion which contains the video and animation data of an adult
error. The comparison results of the velocity loss function is speaking Mandarin, and proposed a novel network for the

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.
Figure 7. Movement curves of the selected lip feature point generated by our models trained with and without the velocity
loss term.

Figure 8. Movement curves of the selected lip feature point generated by LSTM models trained with and without the velocity
loss term.

generation of facial animation from speech. We conducted [6] Edwards, P., Landreth, C., Fiume, E. and Singh, K., JALI: an animator-
both quantitative and qualitative evaluations to show that our centric viseme model for expressive lip synchronization. ACM Transac-
tions on graphics (TOG), 35(4), pp.1-11, 2016.
network is able to generate more accurate and smooth facial [7] Faceware, http://support.facewaretech.com/home.
animation, and is more robust to various audio sources. It can [8] Hannun, A., Case, C., Casper, J., et al, Deep speech: Scaling up end-to-
be further improved by incorporating more movement data of end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
[9] Li, T., Bolkart, T., Black, M.J., Li, H. and Romero, J., Learning a model
the upper face part, such as eyebrows and eyes, to make the of facial shape and expression from 4D scans. ACM Trans. Graph., 36(6),
facial animation of digital human more natural and realistic. pp.194-1, 2017.
[10] Mattheyses, W., Latacz, L. and Verhelst, W., Comprehensive many-to-
many phoneme-to-viseme mapping and its application for concatenative
R EFERENCES visual speech synthesis. Speech Communication, 55(7-8), pp.857-876,
2013.
[1] Abadi, M., Agarwal, A., Barham, P., et al, Tensorflow: Large-scale [11] Oord, A.V.D., Dieleman, S., Zen, H., et al., Wavenet: A generative model
machine learning on heterogeneous distributed systems. arXiv preprint for raw audio. arXiv preprint arXiv:1609.03499, 2016.
arXiv:1603.04467, 2016. [12] Richard, A., Lea, C., Ma, S., Gall, J., de la Torre, F. and Sheikh, Y.,
[2] Baltrusaitis, T., Zadeh, A., Lim, Y. C., and Morency, L. P., Openface Audio-and gaze-driven facial animation of codec avatars. In Proceedings
2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International of the IEEE/CVF Winter Conference on Applications of Computer Vision,
Conference on Automatic Face & Gesture Recognition (FG 2018), pp. pp. 41-50, 2021.
59-66. IEEE, 2018. [13] Shimba, T., Sakurai, R., Yamazoe, H. and Lee, J.H., Talking heads
[3] Cootes, T.F., Edwards, G.J. and Taylor, C.J., Active appearance models. synthesis from audio with deep neural networks. In 2015 IEEE/SICE
IEEE Transactions on pattern analysis and machine intelligence, 23(6), International Symposium on System Integration (SII) (pp. 100-105).
pp.681-685, 2001. IEEE. 2015.
[4] Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A. and Black, M.J., [14] Suwajanakorn, S., Seitz, S.M. and Kemelmacher-Shlizerman, I., Syn-
Capture, learning, and synthesis of 3D speaking styles. In Proceedings of thesizing obama: learning lip sync from audio. ACM Transactions on
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Graphics (ToG), 36(4), pp.1-13, 2017.
(pp. 10101-10111), 2019. [15] Wang, D. and Zhang, X., Thchs-30: A free chinese speech corpus. arXiv
[5] Davis, S. and Mermelstein, P., Comparison of parametric representations preprint arXiv:1512.01882, 2015.
for monosyllabic word recognition in continuously spoken sentences. [16] Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S. and Singh,
IEEE transactions on acoustics, speech, and signal processing, 28(4), K., Visemenet: Audio-driven animator-centric speech animation. ACM
pp.357-366, 1980. Transactions on Graphics (TOG), 37(4), pp.1-10, 2018.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on August 24,2024 at 06:46:35 UTC from IEEE Xplore. Restrictions apply.

Lip Syncing Method For Realistic Expressive 3D Face Model
No ratings yet
Lip Syncing Method For Realistic Expressive 3D Face Model
59 pages
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
No ratings yet
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
8 pages
CBC 0
No ratings yet
CBC 0
52 pages
Probtalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using Vq-Vae
No ratings yet
Probtalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using Vq-Vae
14 pages
Uni Talker
No ratings yet
Uni Talker
27 pages
Cemo: Emotion-Controllable Video Generation For Talking Face
No ratings yet
Cemo: Emotion-Controllable Video Generation For Talking Face
13 pages
Talking Head Synthesis Using Neural Radiance Fields
No ratings yet
Talking Head Synthesis Using Neural Radiance Fields
27 pages
Enjoying English Grammar
No ratings yet
Enjoying English Grammar
232 pages
DreamTalk - When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
No ratings yet
DreamTalk - When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
17 pages
Cav 2076
No ratings yet
Cav 2076
11 pages
SadTalker Learning Realistic 3D Motion Coefficients For Stylized Audio-Driven
No ratings yet
SadTalker Learning Realistic 3D Motion Coefficients For Stylized Audio-Driven
14 pages
Few Shot Adversarial Learning of Realistic Neural Talking Head Models
No ratings yet
Few Shot Adversarial Learning of Realistic Neural Talking Head Models
21 pages
Ahmad Shehzad Arthristics Disease Copy 2
No ratings yet
Ahmad Shehzad Arthristics Disease Copy 2
9 pages
Gas Station Guidelines
100% (1)
Gas Station Guidelines
16 pages
Geneface++ ICLR 23
No ratings yet
Geneface++ ICLR 23
15 pages
UTS Extensive Reading
100% (1)
UTS Extensive Reading
4 pages
Fundamentals of Neural Networks PDF
100% (5)
Fundamentals of Neural Networks PDF
476 pages
Total Loss Claim Settlement
No ratings yet
Total Loss Claim Settlement
3 pages
Guo Generating Diverse and Natural 3D Human Motions From Text CVPR 2022 Paper
No ratings yet
Guo Generating Diverse and Natural 3D Human Motions From Text CVPR 2022 Paper
10 pages
Kips C2024B0066
No ratings yet
Kips C2024B0066
4 pages
Ghost Panzers PDF
89% (9)
Ghost Panzers PDF
66 pages
Kips C2025a0121f
No ratings yet
Kips C2025a0121f
4 pages
Lee RADIO Reference-Agnostic Dubbing Video Synthesis WACV 2024 Paper
No ratings yet
Lee RADIO Reference-Agnostic Dubbing Video Synthesis WACV 2024 Paper
11 pages
A Lip Sync Expert Is All You Need - A Review
No ratings yet
A Lip Sync Expert Is All You Need - A Review
17 pages
VLOGGER: Multimodal Diffusion For Embodied Avatar Synthesis
No ratings yet
VLOGGER: Multimodal Diffusion For Embodied Avatar Synthesis
22 pages
Musetalk Paper
No ratings yet
Musetalk Paper
15 pages
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
No ratings yet
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
13 pages
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos With Audio2Video Diffusion Model Under Weak Conditions
No ratings yet
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos With Audio2Video Diffusion Model Under Weak Conditions
15 pages
Speechanimation fcs2020
No ratings yet
Speechanimation fcs2020
13 pages
Facial-2108 07938v1
No ratings yet
Facial-2108 07938v1
10 pages
Investments in Associate PAS 28 Part 3
No ratings yet
Investments in Associate PAS 28 Part 3
8 pages
Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement
No ratings yet
Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement
12 pages
Wa0011.
No ratings yet
Wa0011.
11 pages
Stylesync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator
No ratings yet
Stylesync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator
11 pages
Learning Audio-Driven Viseme Dynamics For 3D Face Animation
No ratings yet
Learning Audio-Driven Viseme Dynamics For 3D Face Animation
10 pages
Face Image Analysis and Synthesis For Hu
No ratings yet
Face Image Analysis and Synthesis For Hu
8 pages
Urinary Elimination
100% (14)
Urinary Elimination
7 pages
Dawn: D F A N - D F - V G: Ynamic Rame Vatar With ON Autoregressive Iffusion Ramework For Talk Ing Head Ideo Eneration
No ratings yet
Dawn: D F A N - D F - V G: Ynamic Rame Vatar With ON Autoregressive Iffusion Ramework For Talk Ing Head Ideo Eneration
17 pages
Latentsync: Audio Conditioned Latent Diffusion Models For Lip Sync
No ratings yet
Latentsync: Audio Conditioned Latent Diffusion Models For Lip Sync
14 pages
Bài So N - Syntax Lesson 5
No ratings yet
Bài So N - Syntax Lesson 5
21 pages
6005 Completo
No ratings yet
6005 Completo
196 pages
Chung 18
No ratings yet
Chung 18
28 pages
Water Resource - Watermark
No ratings yet
Water Resource - Watermark
4 pages
โค้งสุดท้ายเข้าเตรียมอุดม 2
No ratings yet
โค้งสุดท้ายเข้าเตรียมอุดม 2
39 pages
2407 08136v2 EchoMimic-alibaba
No ratings yet
2407 08136v2 EchoMimic-alibaba
11 pages
Mobile-Banking Ebankit PDF
No ratings yet
Mobile-Banking Ebankit PDF
30 pages
National Curriculum in England - Mathematics Programmes of Study - GOV - UK
No ratings yet
National Curriculum in England - Mathematics Programmes of Study - GOV - UK
45 pages
PDF 5 Mechanics of DB
No ratings yet
PDF 5 Mechanics of DB
16 pages
Lipsync3D: Data-Efficient Learning of Personalized 3D Talking Faces From Video Using Pose and Lighting Normalization
No ratings yet
Lipsync3D: Data-Efficient Learning of Personalized 3D Talking Faces From Video Using Pose and Lighting Normalization
16 pages
Liu MODA Mapping-Once Audio-Driven Portrait Animation With Dual Attentions ICCV 2023 Paper
No ratings yet
Liu MODA Mapping-Once Audio-Driven Portrait Animation With Dual Attentions ICCV 2023 Paper
10 pages
Authnet: A Deep Learning Based Authentication Mechanism Using Temporal Facial Feature Movements
No ratings yet
Authnet: A Deep Learning Based Authentication Mechanism Using Temporal Facial Feature Movements
7 pages
A Journey Through Time - Us Forces in Malir Ww2
No ratings yet
A Journey Through Time - Us Forces in Malir Ww2
7 pages
E 0211
No ratings yet
E 0211
23 pages
Virtual Self: A Text-Driven Facial Animator
No ratings yet
Virtual Self: A Text-Driven Facial Animator
9 pages
AI Loopy PDF
No ratings yet
AI Loopy PDF
22 pages
A Survey On Deep Learning Based Lip-Reading Techniques
No ratings yet
A Survey On Deep Learning Based Lip-Reading Techniques
8 pages
Usace Eng Form 4025-r
No ratings yet
Usace Eng Form 4025-r
2 pages
M-Story Steel Building - FA - 01 PDF
No ratings yet
M-Story Steel Building - FA - 01 PDF
16 pages
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
No ratings yet
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
11 pages
Vision Based Lip Reading System Using Deep Learning: July 2022
No ratings yet
Vision Based Lip Reading System Using Deep Learning: July 2022
7 pages
From Speech To 3D Face Animation: 1 Motivation
No ratings yet
From Speech To 3D Face Animation: 1 Motivation
6 pages
Gonzales Gene08 Ethics Course Syllabus
No ratings yet
Gonzales Gene08 Ethics Course Syllabus
7 pages
Lip2 Speech Report
No ratings yet
Lip2 Speech Report
7 pages
An Efficient 3D Visual Speech Synthesis Framework For Romanian Language Logopedics Use
No ratings yet
An Efficient 3D Visual Speech Synthesis Framework For Romanian Language Logopedics Use
17 pages
A Survey of Facial Modeling and Animation Techniques: Uneumann@graphics - Usc.edu
No ratings yet
A Survey of Facial Modeling and Animation Techniques: Uneumann@graphics - Usc.edu
26 pages
Aiked 35
No ratings yet
Aiked 35
6 pages
584 Camera Ready
No ratings yet
584 Camera Ready
6 pages
Geologia Econômica Kupferschiefer
No ratings yet
Geologia Econômica Kupferschiefer
2 pages
ANN Paper
No ratings yet
ANN Paper
7 pages
Opn Research by Prof Narang
No ratings yet
Opn Research by Prof Narang
43 pages
= (cos (θ) sin (θ) 1 cos 2 x 3 sin 2 x 3 1 cos 2 x 3 sin 2 x 3 1
No ratings yet
= (cos (θ) sin (θ) 1 cos 2 x 3 sin 2 x 3 1 cos 2 x 3 sin 2 x 3 1
9 pages
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
No ratings yet
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
9 pages
A Lip Reading Method Based On 3D Convolutional Vision Transformer
No ratings yet
A Lip Reading Method Based On 3D Convolutional Vision Transformer
8 pages
Specialties and Accessories: Buffer Tank Hydraulic Separator
No ratings yet
Specialties and Accessories: Buffer Tank Hydraulic Separator
4 pages
Park College of Engineering and Teknology Lip Reading Using Neural Network
No ratings yet
Park College of Engineering and Teknology Lip Reading Using Neural Network
10 pages
On The Short Term Feasibility of Whole Brain Emulation S
No ratings yet
On The Short Term Feasibility of Whole Brain Emulation S
7 pages
Trainable Videorealistic Speech Animation
No ratings yet
Trainable Videorealistic Speech Animation
8 pages
Lip Reading Using CNN and LTSM
No ratings yet
Lip Reading Using CNN and LTSM
9 pages
Realistic Speech-Driven Facial Animation With Gans
No ratings yet
Realistic Speech-Driven Facial Animation With Gans
16 pages
Interactive Facial Animation With Deep Neural Networks: Wolfgang Paier, Anna Hilsmann, Peter Eisert
No ratings yet
Interactive Facial Animation With Deep Neural Networks: Wolfgang Paier, Anna Hilsmann, Peter Eisert
11 pages
MMSP 16
No ratings yet
MMSP 16
57 pages
E010OBS
No ratings yet
E010OBS
4 pages
1-Of-10 Decoder/Driver Open-Collector SN54/74LS145: Low Power Schottky
No ratings yet
1-Of-10 Decoder/Driver Open-Collector SN54/74LS145: Low Power Schottky
3 pages
LIP Reading Using Facial Feature Extraction and Deep Learning
No ratings yet
LIP Reading Using Facial Feature Extraction and Deep Learning
5 pages
Professional Development Plan
No ratings yet
Professional Development Plan
3 pages
Buklod-Ng-Mangbubukid-Case-Digest
No ratings yet
Buklod-Ng-Mangbubukid-Case-Digest
7 pages
A UAV-Assisted UE Access Authentication Scheme For 5G 6G Network
No ratings yet
A UAV-Assisted UE Access Authentication Scheme For 5G 6G Network
19 pages
Impact of Land Use Land Cover Changes On Urban Flooding A Case Study of The Greater Bay Area China
No ratings yet
Impact of Land Use Land Cover Changes On Urban Flooding A Case Study of The Greater Bay Area China
15 pages
Resume - Suchita Sanjeev Kamble - Format1
No ratings yet
Resume - Suchita Sanjeev Kamble - Format1
3 pages
Flowchart Implementasi KPI Kinerja
No ratings yet
Flowchart Implementasi KPI Kinerja
4 pages
Certifications: Toastmasters Diploma in IFRS Us-Gaap (FP&A) Oracle
No ratings yet
Certifications: Toastmasters Diploma in IFRS Us-Gaap (FP&A) Oracle
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Novel Speech-Driven Lip-Sync Model With CNN and LSTM

Uploaded by

A Novel Speech-Driven Lip-Sync Model With CNN and LSTM

Uploaded by

2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) | 978-1-6654-0004-6/21/$31.

A Novel Speech-Driven Lip-Sync Model with CNN

a bit complicated. For the LSTM network, the introduction

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.