Surmo: Surface-Based 4D Motion Modeling For Dynamic Human Rendering
Surmo: Surface-Based 4D Motion Modeling For Dynamic Human Rendering
Abstract
Dynamic human rendering from video sequences has
arXiv:2404.01225v1 [cs.CV] 1 Apr 2024
1
representation. At the core of the paradigm is a feature 1) A new paradigm for learning dynamic humans from
encoder-decoder framework with three key components: 1) videos that jointly models temporal motions and human
surface-based motion encoding; 2) physical motion decod- appearances in a unified framework, and one of the early
ing; and 3) 4D appearance decoding. works that systematically analyzes how human appearances
Firstly, in contrast to existing pose-guided methods [30, are affected by temporal dynamics.
49, 50, 71] that focus on static poses as a conditional vari- 2) An efficient surface-based triplane that encodes both
able, we extract an expressive 4D motion input from the 3D spatial and temporal motion relations for expressive 4D mo-
body mesh sequences obtained from training video as our tion modeling.
input, which includes both a static pose represented by a 3) We achieve state-of-the-art results and show that our
spatial 3D mesh and its temporal dynamics. Furthermore, new paradigm is capable of learning high-fidelity appear-
we notice that the non-rigid deformations of garments typi- ances from fast motion sequences (e.g., AIST++ dance
cally occur around the body surface instead of in a 3D vol- videos) or synthesizing motion-dependent shadows in chal-
ume, and hence we propose to model human motions on the lenging scenarios.
body surface by projecting the extracted spatial-temporal
4D motion input to the dense 2D surface UV manifold of 2. Related Work
a clothless body template (e.g., SMPL [34]). To model tem-
Our method is closely related to many sub-fields of visual
poral clothing offsets, a motion encoder is employed to lift
computing, and below we discuss a set of the work.
the clothless motion features into a motion triplane that en-
3D Shape Representations. To capture detailed shapes of
codes both spatial and temporal motion relations in a com-
3D objects, most recent papers utilize implicit functions
pact 3D triplane with time-varying dynamics conditioned.
[8, 9, 22, 39–41, 46, 48, 55–57, 64, 66, 75, 76] or point
The triplane is defined in the surface u − v − h coordinate
clouds [18, 37, 38] due to their topological flexibility. These
system, with u − v to represent the motion of the clothless
methods aim to learn geometry from 3D datasets, whereas
body template, and an extensional coordinate h to represent
we synthesize human images of novel poses only from 2D
the secondary motion of clothes, i.e., the temporal clothing
RGB training images.
offsets are parameterized by a signed distance to the body
Rendering Humans by 2D GANs. Some existing ap-
surface. In this way, 4D motions can be effectively repre-
proaches render human avatars by neural image translation,
sented by a surface-based triplane.
i.e., they utilize GAN [11] networks to learn a mapping
Secondly, we propose to physically model the spatial from poses (given in the form of renderings of a skele-
and temporal motion dynamics in the rendering network. ton [4, 27, 52, 59, 68, 78], dense mesh [12, 31, 32, 42,
Specifically, with the surface-based triplane conditioned at 58, 68] or joint position heatmaps [1, 35, 36]) to human
time t, a motion decoder is employed to decode the motion images [4, 68]. To improve temporal stability, some meth-
triplanes to predict the motion at the next timestep t+1, i.e., ods [19, 51, 53, 63] propose to utilize the SMPL [34] priors
spatial derivative of the motion physically corresponding to for pose-guided generations. However, these methods do
surface normal map and temporal derivatives correspond- not reconstruct geometry explicitly and cannot handle self-
ing to surface velocity. We illustrate the physical motion occlusions effectively.
learning significantly improves the rendering quality. Rendering Humans by 3D-aware Renderer. For stable
Thirdly, the motion triplanes are decoded into high- view synthesis, recent papers [7, 30, 44, 49, 50, 61] pro-
quality images at two stages: a volumetric surface- pose to unify geometry reconstruction with view synthe-
conditioned renderer that is focused on the rendering around sis by volume rendering, which, however, is computation-
human body surface and filters the query points far from ally heavy. To solve this issue, some recent 3D-GAN pa-
the body surface for efficient volumetric rendering, and a pers [5, 13, 17, 20, 21, 43, 45, 77] propose a hybrid ren-
geometry-aware super-resolution module for efficient high- dering strategy for efficient geometry-aware rendering, that
quality image synthesis. is, render low-resolution volumetric features for geometry
We conduct a systematical analysis of how human ap- learning, and employ a super-resolution module for high-
pearances are affected by temporal dynamics, and it was resolution image synthesis. We adopt this strategy for effi-
observed that some baseline methods mainly generate pose- cient rendering, whereas ours is distinguished by rendering
dependent appearances instead of time-varying appearances articulated humans with 4D motion modeling.
for free-view video generations. In addition, quantitative UV-based Pose Representation. Some methods [20, 21,
and qualitative experiments are performed on three datasets 37, 38, 54, 73] propose to project 3D posed meshes into a
with a total of 9 subject sequences, including ZJU-MoCap 2D positional map for pose encoding, which can be used for
[50], AIST++ [28], and MPII-RDDC [14], which validate different downstream tasks, such as 3D reconstruction and
the effectiveness of SurMo in different scenarios. novel view synthesis. [37, 38] rely on 3D supervision to
In summary, our contributions are: learn 3D reconstructions, and they do not take as input the
2
dynamics. To improve the rendering quality, [20, 54] pro- where the input is represented by dynamic motions includ-
poses to utilize additional driving views as input for faithful ing a static pose Pt and its physical dynamics Dt at time t,
rendering in telepresence applications. However, they do and a motion encoder EM is employed to encode the inputs
not explore how to learn temporal dynamics from pose se- into motion-dependent features (Sec. 3.2). Besides, in the
quences. By taking as input a 3D normal and velocity map, training stage, a physical motion decoder DM is introduced
[73] encodes motion dynamics in their rendering network, to enforce the learning of spatial and temporal relations by
whereas they do explicitly learn motions, i.e., predicting the decoding the intermediate feature ft to predict the spatial
motion status at the next timestep, and besides, [73] is a 2D derivatives at t + 1 physically corresponding to surface nor-
rendering method, and they do not explicitly learn motions. mal Nt+1 = ∂P∂x t+1
, and temporal derivatives at t + 1 cor-
responding to surface velocity Vt+1 = ∂P∂tt+1 (Sec. 3.3).
3. Methodology ft is rendered into human images by a decoder DR : G2 ◦ G1
(Sec. 3.4). The framework is depicted in Fig. 2.
3.1. Problem Setup
Notation. In the following parts, we use ftspace to denote
Given a sparse multi-view video of a clothed human in mo- the features in the pipeline, i.e., ft3D is a 4D representation
tion and corresponding 3D pose estimations {P0 , ..., Pt }, that is defined in 3D space with temporal t conditioned, and
our goal is to synthesize time-varying appearances of the M3Dt is 4D motion input defined in 3D space with t condi-
individual under novel views. Existing methods [30, 49, 50, tioned.
71] formulate the problem of human rendering as learning
a representation via a feature encoder-decoder framework: 3.2. Surface-based 4D Motion Encoding
Extracting 4D Motions. We first extract 4D motions from
ft = EP (Pt , zt ) a sequence of time-varying parametric posed body (e.g.,
(1)
It = DR (ft ) SMPL) meshes obtained from training video sequences. We
describe the motion M3D t at time t as a static skeleton pose
where a pose encoder EP takes as input a representation
Pt and its dynamics Dt . The dynamics at t are physically
of posed body Pt (e.g., 2D or 3D keypoints or 3D body
determined by the current pose, 3D velocity, and motion tra-
mesh vertices) and a timestamp embedding zt at time t, and
jectory of the past several timesteps, which also contribute
outputs intermediate pose-dependent features ft that can be
to the time-varying appearances of the secondary motion.
rendered by a decoder DR to reconstruct the appearance
The pose Pt is represented by the 3D vertices of the posed
images It of the corresponding pose at time t. The time
mesh, and dynamics Dt are parameterized by 1) body sur-
embedding zt often serves as a residual that is updated pas-
face velocity Vt corresponding to the temporal derivatives
sively in backpropagation by image reconstruction loss.
of the current pose at t, and 2) motion trajectory Tt that
However, there are two issues with the above well-
aggregates the temporal derivatives over the past several
adopted paradigm. First, the appearance of clothed humans
timesteps with a sliding window size of w with weight λ:
undergoes complex geometric transformations induced not
only by the static pose Pt but also its dynamics, whereas
M3D
t = [Pt , Dt ] , Dt = [Vt , Tt ]
the modeling of dynamics is ignored in E.q. 1, and the w
residual zt cannot expressively model physical dynamics ∂Pt 1 X
Vt = , Tt = Pt + P Vt−i ∗ λi
neither. Second, existing methods focus on the per-image ∂t λi i=1
reconstruction while ignoring the temporal relations of a (3)
motion sequence in training, i.e., sampling input pose Pt Note the motion trajectory aggregated from several con-
and timestamp zt individually, and supervising image re- secutive timesteps makes the motion representation robust
constructions frame-by-frame, partly because defining tem- to pose estimation errors, i.e., the pose estimations for two
poral supervision in the 2D image space is challenging, es- consecutive timesteps may be the same due to pose esti-
pecially for articulated humans that often suffer from mis- mation errors in practice. An ablation study of the motion
alignment problems due to pose estimation errors. trajectory representation can be found in the supp. mat.
To solve these issues, we propose a new paradigm for Recording 4D Motions on the Surface Manifold. Mod-
learning view synthesis of dynamic humans from video se- eling the motions with temporal dynamics often requires
quences, which is formulated as: dense observations to construct a 4D motion volume. In-
stead, we notice that non-rigid deformations of human ge-
ft = EM (Pt , Dt ) ometry in motion typically occur around the body surface
∂Pt+1 ∂Pt+1 instead of a 3D volume, and hence we propose to model
, = DM (ft ) (2)
∂x ∂t the motions on the human body surface. Specifically, we
It = DR (ft ) project the 4D motion input M3D t including spatial pose
3
Figure 2. Framework overview. Given a set of time-varying 3D body meshes {Pt , ..., Pt − n} obtained from training video sequences, we
aim to synthesize high-fidelity appearances of a clothed human in motion via a feature encoder-decoder framework: Motion Encoding, and
joint Motion and Appearance Decoding. 1) We take as input an expressive 4D motion representation extracted from the mesh sequences
including 3D pose, 3D velocity at time t, and motion trajectory over the past w timesteps that encode both spatial and temporal relations of
the motion sequence, which are projected to the spatially aligned UV surface space. A motion encoder EM is employed to lift the 2D UV-
aligned features to a 3D surface-based triplane ftuvh in an UV-plus-height space with a signed distance height to model temporal clothing
offsets. 2) A motion decoder DM is designed to encourage physical motion learning in training by decoding the triplane features ftuvh to
predict the motion at the next timestep t + 1, i.e. spatial derivatives surface normal Nuv uv
t+1 and temporal derivatives surface velocity Vt+1 in
UV space. 3) Finally, given a target camera view, the triplane ftuvh is rendered into high-quality images by a volumetric surface-conditioned
renderer including volumetric low-resolution rendering by G1 and an efficient geometry-aware super-resolution by G2 .
and temporal dynamics from 3D space into a compact spa- to predict the motions at the next timestep t+1, such as spa-
tially aligned UV space using the geometric transformation tial derivatives corresponding to surface normal Nuv
t+1 , tem-
uv
W that is pre-defined by the parametric (e.g., SMPL) body poral derivatives corresponding to surface velocity Vt+1 .
template, which yields ftuv = WM3D t . The UV-aligned Note that we model the normal and velocity in the UV space
motion feature ftuv faithfully preserves the articulated struc- by:
tures of body topology in a compact 2D space.
{Nuv uv uvh
t+1 , Vt+1 } = DM (ft ) = DM ◦ EM (WM3D
t ) (5)
Generating Surface-based Triplanes for Motion Model-
ing. To model clothed humans, we further employ a motion
encoder EM that lifts the clothless 2D features ftuv to a 3D 3.4. 4D Appearance Decoding
triplane representation ftuvh that is defined in a u − v − h Volumetric Surface-conditioned Rendering. Given a tar-
system, with u − v to represent the motion of the clothless get camera viewpoint, the triplane ftuvh is first rendered into
body template, and an extensional coordinate h to represent low-resolution volumetric features IF ∈ RH×W ×Ĉ by a
the secondary motion of clothes, i.e., the temporal cloth off- conditional NeRF FΦ , where H and W are the spatial res-
sets, and hence 4D clothed motions can be parameterized by olution and Ĉ is the channel number. To be more specific,
a surface-based triplane by: given a 3D query point pi , we transform it into a surface-
based local coordinate p̂i = (ui , vi , hi ) w.r.t. the tracked
ftuvh = EM (ftuv ) = EM (WM3D
t ), (4) body mesh Pt . Here we search the nearest face fi of Pt for
where W is the geometric transformation from 3D space to query point pi and (ui , vi ) and hi are the barycentric co-
UV space. ftuvh consists of three planes xuv , xuh , xhv ∈ ordinates of the nearest point on fi , and the signed distance
RU ×V ×C to form the spatial relationship, where U, V de- respectively. We therefore obtain the local feature ziuvh of
note spatial resolution and C is the channel number. In con- the query point pi as
trast to the volumetric triplanes used in EG3D [5] where the ziuvh = cat [Π(xuv ; ui , vi ), Π(xuh ; ui , hi ), Π(xhv ; hi , vi )] ,
three planes represent three vertical planes in the 3D vol- (6)
umetric space, ftuvh is defined on the human body surface where Π(·) denotes sampling operation, cat[·] denotes the
inheriting human topology priors for motion modeling, as concatenation operator.
illustrated by an unwarped triplane in 3D space in Fig 2. Given camera direction di , the appearance features ci
and density features σi of point pi are predicted by
3.3. Physical Motion Decoding
{ci , σi } = FΦ (di , ziuvh ) (7)
We propose a physical motion decoder DM to learn spatial
and temporal features of motion, which is achieved by de- We integrate all the radiance features of sampled points
coding the intermediate motion feature ftuvh learned by EM into a 2D feature map IF at time t through volume renderer
4
Figure 3. Qualitative comparisons on novel view synthesis on the subject S313 of ZJU-MoCap dataset. Two motion sequences S1 (swing
arms left to right) and S2 (raise and lower arms) are shown. We specifically focus on the synthesis of time-varying appearances (especially
T-shirt wrinkles), by evaluating the rendering results under similar poses yet with different movement directions, which are marked in the
same color, such as the pairs of ⃝ 1 ⃝,
2 ⃝ 3 ⃝,
4 and ⃝ 5 ⃝.
6 Our method synthesizes high-fidelity time-varying appearances, whereas SOTA
HumanNeRF generates almost the same cloth wrinkles.
5
Figure 4. Qualitative comparisons on novel view synthesis on the subject S387, S315 of ZJU-MoCap dataset. Row 1 and 2 show similar
poses occurring at different timesteps (not consecutive frames). The results indicate that our method synthesizes time-varying appearances
while other methods mainly generate pose-dependent appearances.
Table 1. Quantitative comparisons on ZJU-MoCap dataset (aver- isons on each subject of ZJU-MoCap, and quantitative re-
aged on all test views and poses on 6 sequences) for novel-view sults on MPII-RDDC and AIST++ are listed in the supp.
synthesis. To reduce the influence of the background, all scores mat. Refer to the comparisons with ARAH [67], DVA [54]
are calculated from images cropped to 2D bounding boxes. Note
and HVTR++ [20] in the supp. mat.
that the perception metrics LPIPS [74] and FID [16] capture hu-
man judgment better than per-pixel metrics such as SSIM [70] or Time-varying Appearances with Dynamics Condition-
PSNR, as stated in [30, 58]. ing on ZJU-MoCap. We compare with baseline methods
for novel view synthesis on two motion sequences of sub-
ZJU-S1-6 LPIPS ↓ FID ↓ SSIM ↑ PSNR ↑ ject S313 from the ZJU-MoCap dataset, as shown in Fig.
Neural Body [50] .164 125.10 .703 21.438 3. We evaluate the capability of each method to synthesize
Instant-NVR [10] .202 144.99 .738 21.748
time-varying appearances. Specifically, two sequences S1(
swing arms left to right) and S2 (raise and lower arms) are
HumanNeRF [71] .106 88.382 .792 23.624 evaluated, where similar poses occur at different timesteps
Ours .075 67.725 .833 24.815 with different dynamics (e.g., movement directions, trajec-
tory, or velocity) are marked in the same color, such as the
perception metrics including LPIPS [74] and FID [16]. pairs of ⃝1 ⃝,
2 ⃝ 3 ⃝,
4 and ⃝ 5 ⃝.
6 Fig. 3 suggests that Instant-
Baselines. We compare our method against SOTA meth- NVR [10] fails to synthesize time-varying high-frequency
ods including Neural Body [50], HumanNeRF [71], Instant- wrinkles, and Neural Body [50] synthesizes time-varying
NVR [10], ARAH [67], DVA [54] and HVTR++ [20]. yet blurry wrinkles. HumanNeRF [71] synthesizes detailed
Neural Body models human poses in 3D space with point yet static T-shirt wrinkles, i.e., the wrinkles are almost the
cloud as pose representation, HumanNeRF and Instant- same for similar poses, such as ⃝ 1 ⃝,
2 ⃝ 3 ⃝,
4 and ⃝ 5 ⃝.
6 In
NVR model poses in a canonical space with inverse skin- contrast, our method renders both high-fidelity and time-
ning, ARAH is a forward-skinning based method, DVA and varying wrinkles. The comparisons on other subjects S387
HVTR++ encode poses in UV space, and take driving view and S315 are shown in Fig. 4, which illustrate the effective-
signals as input for telepresence applications. ness of our proposed paradigm in dynamics learning.
Time-varying Appearances with Motion-dependent
4.1. Comparisons to SOTA Methods
Shadows on MPII-RDDC. We compare with baseline
Quantitative comparisons We conduct the quantitative methods for novel view synthesis on MPII-RDDC, as
comparisons on the three datasets with a total of 9 subjects. shown in Fig. 5. The sequence is captured in a studio
The quantitative results on ZJU-MoCap are summarized in with top-down lighting that casts shadows on the human
Tab. 1, which suggests that our new paradigm outperforms body due to self-occlusions. We notice that the synthesis
the SOTA by a big margin, and we achieve the best quan- of shadow can be formulated as learning motion-dependent
titative results on all four metrics. The detailed compar- shadows in the motion representation without the need to
6
Figure 5. Novel view synthesis of time-varying appearances with both pose and lighting conditioning on MPII-RDDC dataset. The
sequence is captured in a studio with top-down lighting that casts shadows on the human performer due to self-occlusion. In Row 1, we
specifically focus on synthesizing time-varying shadows (e.g., ⃝1 vs. ⃝,
2 and ⃝ 3 vs. ⃝)
4 for different poses with different self-occlusions.
In Row 2, we evaluate the synthesis of: 1) time-varying appearances for similar poses occurring in a jump-up-and-down motion sequence,
e.g., ⃝
5 vs. ⃝,
6 2) shadows ⃝ 7 vs. ⃝,8 and 3) clothing offsets ⃝
5 vs. ⃝.
6
7
Figure 7. Ablation study of 3D volumetric triplane (Vol-Trip) vs. surface-based triplane (Surf-Trip) for human modeling on subject S313
and S387 from ZJU-MoCap. We focus on the convergence in training (left), and self-occlusion rendering (right). The convergence of S313
is shifted for visualization purposes. Vol-Trip is not effective in handling self-occlusion, e.g., ⃝
1⃝2⃝
3⃝4 though it performs well in another
viewpoint without self-occlusions ⃝. 6 In addition, Vol-Trip cannot synthesize high-quality details for face.
Table 2. Ablation study of motion conditioning and learning. in the ablation study. It suggests that the appearance of the
Pcond and Dcond denote the conditioning of pose and dynamics, T-shirt varies when the velocity or dynamics are changed,
Vpred and Npred denote the prediction of surface velocity and sur- whereas the renderings of tight parts remain the same, such
face normal in motion learning.
as the head, tight shorts, and shoes. This is consistent with
S313 LPIPS ↓ FID ↓ SSIM ↑ PSNR ↑ our daily observations. The ablation study illustrates that
Pcond .120 104.83 .793 22.781 our method decouples the pose and dynamics, and is capa-
+ Dcond .085 73.674 .834 24.908 ble of generating dynamics-dependent appearances. Note
that the appearance changes are small between V ∗ 1 and
+ Vpred .069 62.092 .856 25.845
V ∗ 0.01 when we reduce the velocity, partly because the
+ Npred .060 50.170 .869 26.654 method is not sensitive to smaller velocity due to the pose
estimation errors in the training dataset where consecutive
sults qualitatively, which indicates that Surf-Trip is more frames may be estimated with the same pose fitting.
effective in handling self-occlusions, whereas Vol-Trip fails
⃝1⃝ 2⃝ 3 ⃝,
4 though it performs well in another viewpoint 5. Discussion
without self-occlusions ⃝. 6 In addition, Surf-Trip generates We propose SurMo, a new paradigm for learning dynamic
high-fidelity clothing wrinkles and facial details, whereas humans from videos by jointly modeling temporal motions
the face rendering of Vol-Trip is blurry. and human appearances in a unified framework, based on an
Motion Conditioning and Learning. We also analyze efficient surface-based triplane. We conduct a systematical
the effectiveness of our proposed motion conditioning and analysis of how human appearances are affected by tem-
learning qualitatively and quantitatively. Tab. 2 summarizes poral dynamics, and extensive experiments validate the ex-
the quantitative results, which suggest that the quantitative pressiveness of the surface-based triplane in rendering fast
results are significantly improved by taking as input the dy- motions and motion-dependent shadows. Quantitative ex-
namics conditioning Dcond . With physical motion learning, periments illustrate that SurMo achieves the SOTA results
i.e., predicting the temporal derivatives surface velocity and on different motion sequences.
spatial derivates normal at the next timestep, the quantita-
tive results are further improved by a big margin. Acknowledgement
The qualitative comparisons are shown in Fig. 8, where
This study is supported by the Ministry of Education,
in Row 1, we observe higher-fidelity appearances with dy-
Singapore, under its MOE AcRF Tier 2 (MOET2EP20221-
namics conditioning, and physical motion learning for both
0012), NTU NAP, and under the RIE2020 Industry Align-
velocity and normal prediction. In Row 2, we evaluate
ment Fund – Industry Collaboration Projects (IAF-ICP)
whether our method learns to decouple the temporal dynam-
Funding Initiative, as well as cash and in-kind contribution
ics (e.g., velocity) and static poses from motion condition-
from the industry partner(s).
ing, which corresponds to the rendering of clothing wrin-
kles of secondary motion (T-shirt ⃝) 1 and tight parts (e.g.,
shorts ⃝).
2 We only change the velocity for each variant
8
Appendix. Face Identity Loss. We use a pre-trained network to ensure
that the renderers preserve the face identity on the cropped
A. Implementation face of the generated and ground truth image,
Surface-based Triplane. The size of the triplane is 256 × is easier for the network than predicting Nuv t+1 directly,
256 × 48. and Nuvt+1 can be drived and normalized from: Nuv
t+1 =
∂Puv ∂{Puv +Vuv } ∂Vuv
Discriminator. We adopt the discriminator architecture of
t+1
∂x = t
∂x
t+1
= Nuvt +
t+1
∂x . With the Vt+1
uv
PatchGAN [23] for adversarial training. Note that differ- predicted for temporal motion supervision, the prediction
ent from EG3D [6] that applies the image discriminator at of Nuv uv
t enforces a similar supervision with Nt+1 for the
both resolutions, we only supervise the final rendered im- spatial motion learning.
ages with adversarial training and supervise the volumetric Volume Rendering Loss. We supervise the training of vol-
features with reconstruction loss. ume rendering at low resolution, which is applied on the
D D
first three channels of IF , Lvol = ∥IF [: 3] − Igt ∥2 . Igt is
A.2. Optimization the downsampled reference image.
The networks were trained using the Adam optimizer
SurMo is trained end-to-end to optimize EM , DM , and
[26]. The loss weights {λpix , λvgg , λadv , λf ace , λvelocity ,
renderers G1 , G1 with 2D image loss. Given a ground truth
λnormal , λvol } are set empirically to {.5, 10, 1, 5, 1, 1, 15}.
image Igt , we predict a target RGB image I+ RGB with the It takes about 12 hours to train a model from about 3000
following loss:
images with 200 epochs on two NVIDIA V100 GPUs.
Pixel Loss. We enforce an ℓ1 loss between the generated
image and ground truth as Lpix = ∥Igt − I+ RGB ∥1 . A.3. Training Data Processing.
Perceptual Loss. Pixel loss is sensitive to image misalign-
ment due to pose estimation errors, and we further use a We evaluate the novel view synthesis on three datasets:
perceptual loss [24] to measure the differences between the ZJU-MoCap [50] (including sequences of S313, S315,
activations on different layers of the pre-trained VGG net- S377, S386, S387, S394) at resolution 1024 × 1024, MPII-
work [60] of the generated image I+ RDDC [65] at resolution 1285 × 940, and AIST++ [28] at
RGB and ground truth
image Igt , 1920 × 1080. Note that sequences of ZJU-MoCap used in
Neural Body are generally short, e.g., only 60 frames for
X 1 S313. Instead, to evaluate the time-varying effects, we ex-
g j (Igt ) − g j I+
Lvgg = j RGB 2
, (10) tend the original training frames of S313, S315, S387, S394
N
to 400, 700, 600, 600 frames respectively depending on the
where g j is the activation and N j the number of elements pose variance of each sequence, whereas S377 and S386 re-
of the j-th layer in the pretrained VGG network. main the same 300 frames as the setup of Neural Body [50].
Adversarial Loss. We leverage a multi-scale discriminator 4 cameras are used for training, and the others are used in
D [69] as an adversarial loss Ladv to enforce the realism of testing for ZJU-MoCap. 6 cameras are used in training, 3
rendering, especially for the cases where estimated human for testing in AIST++, 18 cameras for training and 9 cam-
poses are not well aligned with the ground truth images. eras for testing in MPI-RDDC.
9
Figure 9. Illustration of Volumetric Triplane vs. Surface-based Triplane.
Figure 10. Qualitative comparisons against the 3D pose- and image-driven approach DVA [54] and HVTR++ [20] for novel view synthesis
of training poses on ZJU-MoCap. For each example, from left to right: DVA, HVTR++, Ours, Ground Truth. Rendering results of DVA
and HVTR++ are provided by the authors.
B.1. Comparisons with SOTA Methods Tab. 3 summarizes the quantitative results for novel view
synthesis on the two sequences (S386 and S387) mentioned
Comparisons with 3D pose- and image-driven ap- in DVA, which suggest that our method significantly out-
proaches. In contrast to pose-driven methods (e.g., Neu- performs DVA and HVTRPP in terms of both per-pixel and
ral Body [50], Instant-NVR [10], HumanNeRF [71]), DVA perception metrics. Qualitative comparisons are provided
[54] and HVTR++ [20] propose to utilize both the pose and in Fig. 10, which shows that our method produces sharper
driving view features in rendering. They model both the reconstructions with faithful wrinkles than both DVA and
pose and texture features in UV space, whereas ours is dis- HVTR++. In contrast to the image resolution of 512 × 512
tinguished by modeling motions in a surface-based triplane, used in Neural Body [50], HumanNeRF [71] and Instant-
and we jointly learn physical motions and rendering in a NVR [10], DVA and HVTR++ were trained and evaluated
10
Table 3. Quantitative comparisons against the 3D pose- and
image-driven approach DVA [54] and HVTR++ [20] on ZJU-
MoCap datasets (averaged on all test views and poses) for novel
view synthesis. To reduce the influence of the background, all
scores are calculated from images cropped to 2D bounding boxes
as used in [20]. Note that the training and test are conducted at the
image resolution of 1024 × 1024 by following the setup in DVA
[54]. For reference, we report the quantitative results of HVTR++
and DVA from the HVTR++ paper.
11
Table 5. Quantitative comparisons on MPII-RDDC datasets [14]. Table 7. Ablation study of motion prediction and training views.
To reduce the influence of the background, all scores are calculated
from images cropped to 2D bounding boxes. S313 LPIPS ↓ FID ↓ SSIM↑ PSNR↑
w/o P red .085 73.674 .834 24.908
Methods LPIPS↓ FID↓ SSIM↑ PSNR↑ P redt .073 60.942 .848 25.537
P redt+1 .060 50.170 .869 26.654
HumanNeRF [71] .175 116.53 .615 17.443
P redt+1 (1 view) .126 112.19 .788 22.830
Ours .153 107.79 .627 18.048
S387 LPIPS ↓ FID ↓ SSIM↑ PSNR↑
w/o P red .115 93.688 .761 22.152
Table 6. Quantitative comparisons on S13 and S21 sequences from P redt .096 83.825 .790 23.083
AIST++ datasets [28]. To reduce the influence of the background, P redt+1 .084 71.216 .810 23.735
all scores are calculated from images cropped to 2D bounding P redt+1 (1 view) .151 128.18 .729 21.093
boxes.
S13 LPIPS↓ FID↓ SSIM↑ PSNR↑ Table 8. Ablation study of dynamics conditioning.
Neural Body [50] .266 276.70 .732 17.649
Ours .183 161.68 .751 17.488 Methods LPIPS↓ FID↓ SSIM↑ PSNR↑
Norm. + Velo. [73] .093 81.900 .825 24.113
S21 LPIPS↓ FID↓ SSIM↑ PSNR↑
Ours w/ Dcond .085 73.674 .834 24.908
Neural Body [50] .296 333.03 .731 17.137
Ours w/ Vpred , Npred .060 50.170 .869 26.654
Ours .205 177.36 .757 17.334
12
module to synthesize high-quality images. The quantitative Guibas, Jonathan Tremblay, S. Khamis, Tero Karras, and
results are summarized in Tab. 9. It is observed that the Gordon Wetzstein. Efficient geometry-aware 3d generative
performances are improved when the upsampling factor is adversarial networks. ArXiv, abs/2112.07945, 2021. 2, 4, 5,
increased from 4 to 2, which indicates more geometric fea- 7, 12
[6] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano,
tures are utilized by increasing the resolution of volumetric
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J
rendering. Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient
B.4. Efficiency geometry-aware 3d generative adversarial networks. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
At test time, our method runs at 3.2 FPS on one NVIDIA and Pattern Recognition, pages 16123–16133, 2022. 9
V100 GPU to render 512×512 resolution images, about [7] Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao
39× faster than Neural Body [50], 17× faster than Human- Bao, and Huchuan Lu. Animatable neural radiance fields
NeRF [71], and 9× faster than Instant-NVR [10]. from monocular rgb video. ArXiv, abs/2106.13629, 2021. 2
[8] Zhiqin Chen and Hao Zhang. Learning implicit fields for
generative shape modeling. CVPR, 2019. 2
[9] Boyang Deng, John P Lewis, Timothy Jeruzalski, Gerard
Pons-Moll, Geoffrey Hinton, Mohammad Norouzi, and An-
drea Tagliasacchi. Nasa neural articulated shape approxima-
tion. In ECCV, 2020. 2
[10] Chen Geng, Sida Peng, Zhen Xu, Hujun Bao, and Xiaowei
Zhou. Learning neural volumetric representations of dy-
namic humans in minutes. In CVPR, 2023. 6, 10, 13, 14
[11] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,
and Yoshua Bengio. Generative adversarial nets. In NIPS,
2014. 2
[12] A. K. Grigor’ev, Artem Sevastopolsky, Alexander Vakhitov,
and Victor S. Lempitsky. Coordinate-based texture inpaint-
ing for pose-guided human image generation. CVPR, pages
12127–12136, 2019. 2
[13] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
Stylenerf: A style-based 3d-aware generator for high-
resolution image synthesis. ArXiv, abs/2110.08985, 2021.
Figure 13. Failure cases. 2
[14] Marc Habermann, Lingjie Liu, Weipeng Xu, Michael Zoll-
hoefer, Gerard Pons-Moll, and Christian Theobalt. Real-time
B.5. Failure Cases deep dynamic characters. ACM Transactions on Graphics
(TOG), 40:1 – 16, 2021. 2, 12
Our method fails to generate high-quality wrinkles for com- [15] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep
plicated textures of AIST++ [28], as shown in Fig. 13. This residual learning for image recognition. CVPR, pages 770–
is because we cannot learn to infer dynamic wrinkles from 778, 2016. 9
the complicated appearances. [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
References two time-scale update rule converge to a local nash equilib-
rium. In NIPS, 2017. 6
[1] Kfir Aberman, M. Shi, Jing Liao, Dani Lischinski, B. Chen, [17] Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong
and D. Cohen-Or. Deep video-based performance cloning. Zhang. Headnerf: A real-time nerf-based parametric head
Computer Graphics Forum, 38, 2019. 2 model. ArXiv, abs/2112.05637, 2021. 2
[2] George Borshukov, Dan Piponi, Oystein Larsen, J. P. Lewis, [18] Tao Hu, Geng Lin, Zhizhong Han, and Matthias Zwicker.
and Christina Tempelaar-Lietz. Universal capture: image- Learning to generate dense point clouds with textures on
based facial animation for ”the matrix reloaded”. In SIG- multiple categories. In Proceedings of the IEEE/CVF Win-
GRAPH ’03, 2003. 1 ter Conference on Applications of Computer Vision (WACV),
[3] Joel Carranza, Christian Theobalt, Marcus A. Magnor, and pages 2170–2179, January 2021. 2
Hans-Peter Seidel. Free-viewpoint video of human actors. [19] Tao Hu, Kripasindhu Sarkar, Lingjie Liu, Matthias Zwicker,
ACM SIGGRAPH 2003 Papers, 2003. 1 and Christian Theobalt. Egorenderer: Rendering human
[4] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. avatars from egocentric camera images. In ICCV, 2021. 2
Efros. Everybody dance now. ICCV, pages 5932–5941, [20] Tao Hu, Hongyi Xu, Linjie Luo, Tao Yu, Zerong Zheng, He
2019. 2 Zhang, Yebin Liu, and Matthias Zwicker. Hvtr++: Image and
[5] Eric Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, pose driven human avatars using hybrid volumetric-textural
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J.
13
Table 10. Quantitative comparisons with Neural Body [50], Instant-NVR [10], HumanNeRF [71] on ZJU-MoCap. Instant-NVR* and
Instant-NVR are trained with 100 and 30 epochs respectively, which generate better results than the official models that were trained with
6 epochs. Qualitative results can be found in the demo video.
rendering. IEEE Transactions on Visualization and Com- Wang, and Christian Theobalt. Neural human video ren-
puter Graphics, pages 1–15, 2023. 2, 3, 6, 10, 11 dering by learning dynamic textures and rendering-to-video
[21] T. Hu, Tao Yu, Zerong Zheng, He Zhang, Yebin Liu, and translation. IEEE Transactions on Visualization and Com-
Matthias Zwicker. Hvtr: Hybrid volumetric-textural render- puter Graphics, 05 2020. 2
ing for human avatars. 3DV, 2022. 2 [32] Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Hyeongwoo
[22] Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and Kim, Florian Bernard, Marc Habermann, Wenping Wang,
Tony Tung. Arch: Animatable reconstruction of clothed hu- and Christian Theobalt. Neural rendering and reenactment of
mans. 2020 (CVPR), pages 3090–3099, 2020. 2 human actor videos. ACM Transactions on Graphics (TOG),
[23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. 2019. 2
Efros. Image-to-image translation with conditional adver- [33] Weiyang Liu, Y. Wen, Zhiding Yu, Ming Li, B. Raj, and Le
sarial networks. CVPR, pages 5967–5976, 2017. 9 Song. Sphereface: Deep hypersphere embedding for face
[24] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual recognition. CVPR, pages 6738–6746, 2017. 9
losses for real-time style transfer and super-resolution. vol- [34] M. Loper, Naureen Mahmood, J. Romero, Gerard Pons-
ume 9906, pages 694–711, 10 2016. 9 Moll, and Michael J. Black. Smpl: a skinned multi-person
[25] James T. Kajiya and Brian Von Herzen. Ray tracing vol- linear model. ACM Trans. Graph., 34:248:1–16, 2015. 2
ume densities. Proceedings of the 11th annual conference [35] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuyte-
on Computer graphics and interactive techniques, 1984. 5 laars, and Luc Van Gool. Pose guided person image genera-
[26] Diederick P Kingma and Jimmy Ba. Adam: A method for tion. In NeurIPS, pages 405–415, 2017. 2
stochastic optimization. In ICLR, 2015. 5, 9 [36] Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc van
[27] Bernhard Kratzwald, Zhiwu Huang, Danda Pani Paudel, and Gool, Bernt Schiele, and Mario Fritz. Disentangled person
Luc Van Gool. Towards an understanding of our world by image generation. CVPR, 2018. 2
GANing videos in the wild. arXiv:1711.11453, 2017. 2 [37] Qianli Ma, Shunsuke Saito, Jinlong Yang, Siyu Tang, and
[28] Ruilong Li, Shan Yang, David A. Ross, and Angjoo Michael J. Black. Scale: Modeling clothed humans with a
Kanazawa. Learn to dance with aist++: Music conditioned surface codec of articulated local elements. In CVPR, 2021.
3d dance generation, 2021. 2, 9, 12, 13 2
[29] Zhe Li, Zerong Zheng, Yuxiao Liu, Boyao Zhou, and Yebin [38] Qianli Ma, Jinlong Yang, Siyu Tang, and Michael J Black.
Liu. Posevocab: Learning joint-structured pose embeddings The power of points for modeling humans in clothing. In
for human avatar modeling. In ACM SIGGRAPH Conference ICCV, 2021. 2
Proceedings, 2023. 11 [39] Lars M. Mescheder, Michael Oechsle, Michael Niemeyer,
[30] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sebastian Nowozin, and Andreas Geiger. Occupancy net-
Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor: works: Learning 3d reconstruction in function space. CVPR,
Neural free-view synthesis of human actors with pose con- 2019. 2
trol. TOG, 40, 2021. 1, 2, 3, 6 [40] Mateusz Michalkiewicz, Jhony Kaesemodel Pontes, Do-
[31] Lingjie Liu, Weipeng Xu, Marc Habermann, Michael minic Jack, Mahsa Baktash, and Anders P. Eriksson. Deep
Zollhöfer, Florian Bernard, Hyeongwoo Kim, Wenping level sets: Implicit surface representations for 3d shape in-
14
ference. ArXiv, 2019. Black. Scanimate: Weakly supervised learning of skinned
[41] Marko Mihajlovic, Yan Zhang, Michael J Black, and Siyu clothed avatar networks. 2021 (CVPR), pages 2885–2896,
Tang. Leap: Learning articulated occupancy of people. In 2021. 2
CVPR, 2021. 2 [58] Kripasindhu Sarkar, Dushyant Mehta, Weipeng Xu,
[42] Natalia Neverova, Riza Alp Güler, and Iasonas Kokkinos. Vladislav Golyanik, and Christian Theobalt. Neural re-
Dense pose transfer. ECCV, 2018. 2 rendering of humans from a single image. In ECCV, 2020.
[43] Michael Niemeyer and Andreas Geiger. Giraffe: Represent- 2, 6
ing scenes as compositional generative neural feature fields. [59] Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuiliere,
CVPR, pages 11448–11459, 2021. 2 and Nicu Sebe. Deformable GANs for pose-based human
[44] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya image generation. In CVPR, 2018. 2
Harada. Neural articulated radiance field. In IEEE/CVF [60] K. Simonyan and Andrew Zisserman. Very deep convolu-
ICCV, 2021. 2 tional networks for large-scale image recognition. CoRR,
[45] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shecht- abs/1409.1556, 2015. 9
man, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. [61] Shih-Yang Su, Frank Yu, Michael Zollhoefer, and Helge
Stylesdf: High-resolution 3d-consistent image and geometry Rhodin. A-nerf: Articulated neural radiance fields for learn-
generation. ArXiv, abs/2112.11427, 2021. 2 ing human shape, appearance, and pose. In NeurIPS, 2021.
[46] Pablo Palafox, Aljaž Božič, Justus Thies, Matthias Nießner, 2
and Angela Dai. Npms: Neural parametric models for 3d [62] Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann,
deformable shapes. In IEEE/CVF ICCV, 2021. 2 Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-
[47] Jeong Joon Park, Peter Florence, Julian Straub, Richard Brualla, Tomas Simon, Jason M. Saragih, Matthias Nießner,
Newcombe, and Steven Lovegrove. Deepsdf: Learning con- Rohit Pandey, S. Fanello, Gordon Wetzstein, Jun-Yan Zhu,
tinuous signed distance functions for shape representation. Christian Theobalt, Maneesh Agrawala, Eli Shechtman,
CVPR, 2019. 9 Dan B. Goldman, and Michael Zollhofer. State of the art
[48] Jeong Joon Park, Peter R. Florence, Julian Straub, on neural rendering. Computer Graphics Forum, 2020. 1
Richard A. Newcombe, and S. Lovegrove. Deepsdf: Learn- [63] Justus Thies, Michael Zollhöfer, and Matthias Nießner. De-
ing continuous signed distance functions for shape represen- ferred neural rendering: image synthesis using neural tex-
tation. 2019 (CVPR), pages 165–174, 2019. 2 tures. ACM Transactions on Graphics (TOG), 38, 2019. 2
[49] Sida Peng, Junting Dong, Qianqian Wang, Shangzhan [64] Garvita Tiwari, Nikolaos Sarafianos, Tony Tung, and Ger-
Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- ard Pons-Moll. Neural-gif: Neural generalized implicit func-
matable neural radiance fields for modeling dynamic human tions for animating people in clothing. In ICCV, 2021. 2
bodies. In ICCV, 2021. 2, 3 [65] G. Varol, J. Romero, X. Martin, Naureen Mahmood,
[50] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Michael J. Black, I. Laptev, and C. Schmid. Learning from
Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: synthetic humans. CVPR, pages 4627–4635, 2017. 9
Implicit neural representations with structured latent codes [66] Shaofei Wang, Marko Mihajlovic, Qianli Ma, Andreas
for novel view synthesis of dynamic humans. CVPR, 2021. Geiger, and Siyu Tang. Metaavatar: Learning animatable
1, 2, 3, 6, 9, 10, 12, 13, 14 clothed human models from few depth images. NeurIPS,
[51] Sergey Prokudin, Michael J. Black, and Javier Romero. Sm- 2021. 2
plpix: Neural avatars from 3d human models. WACV, 2021. [67] Shaofei Wang, Katja Schwarz, Andreas Geiger, and Siyu
2 Tang. Arah: Animatable volume rendering of articulated
[52] Albert Pumarola, Antonio Agudo, Alberto Sanfeliu, and human sdfs. In European Conference on Computer Vision,
Francesc Moreno-Noguer. Unsupervised person image syn- 2022. 6, 11
thesis in arbitrary poses. In CVPR, June 2018. 2 [68] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,
[53] Amit Raj, Julian Tanke, James Hays, Minh Vo, Carsten Stoll, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-
and Christoph Lassner. Anr: Articulated neural rendering for video synthesis. In NeurIPS, 2018. 2
virtual avatars. CVPR, pages 3721–3730, 2021. 2 [69] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
[54] Edoardo Remelli, Timur M. Bagautdinov, Shunsuke Saito, Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
Chenglei Wu, Tomas Simon, Shih-En Wei, Kaiwen Guo, thesis and semantic manipulation with conditional gans. In
Zhe Cao, Fabián Prada, Jason M. Saragih, and Yaser Sheikh. CVPR, 2018. 9
Drivable volumetric avatars using texel-aligned features. [70] Zhou Wang, A. Bovik, H. R. Sheikh, and E. P. Simoncelli.
ACM SIGGRAPH, 2022. 2, 3, 6, 10, 11 Image quality assessment: from error visibility to structural
[55] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- similarity. IEEE Transactions on Image Processing, 13:600–
ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned 612, 2004. 5, 6
implicit function for high-resolution clothed human digitiza- [71] Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan,
tion. IEEE/CVF ICCV, pages 2304–2314, 2019. 2 Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. Hu-
[56] Shunsuke Saito, Tomas Simon, Jason M. Saragih, and Han- mannerf: Free-viewpoint rendering of moving people from
byul Joo. Pifuhd: Multi-level pixel-aligned implicit function monocular video. ArXiv, abs/2201.04127, 2022. 1, 2, 3, 6,
for high-resolution 3d human digitization. 2020 (CVPR), 10, 12, 13, 14
pages 81–90, 2020. [72] Feng Xu, Yebin Liu, Carsten Stoll, James Tompkin, Gau-
[57] Shunsuke Saito, Jinlong Yang, Qianli Ma, and Michael J. rav Bharaj, Qionghai Dai, Hans-Peter Seidel, Jan Kautz,
15
and Christian Theobalt. Video-based characters: creating
new human performances from a multi-view video database.
ACM SIGGRAPH, 2011. 1
[73] Jae Shin Yoon, Duygu Ceylan, Tuanfeng Y. Wang, Jingwan
Lu, Jimei Yang, Zhixin Shu, and Hyunjung Park. Learn-
ing motion-dependent appearance for high-fidelity rendering
of dynamic humans from a single camera. 2022 IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 3397–3407, 2022. 2, 3, 12
[74] Richard Zhang, Phillip Isola, Alexei A. Efros, E. Shechtman,
and O. Wang. The unreasonable effectiveness of deep fea-
tures as a perceptual metric. CVPR, pages 586–595, 2018.
6
[75] Yang Zheng, Ruizhi Shao, Yuxiang Zhang, Tao Yu, Zerong
Zheng, Qionghai Dai, and Yebin Liu. Deepmulticap: Per-
formance capture of multiple characters using sparse mul-
tiview cameras. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, pages 6239–6249,
2021. 2
[76] Zerong Zheng, Tao Yu, Yebin Liu, and Qionghai Dai.
Pamir: Parametric model-conditioned implicit representa-
tion for image-based human reconstruction. TPAMI, PP,
2021. 2
[77] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. Cips-
3d: A 3d-aware generator of gans based on conditionally-
independent pixel synthesis. ArXiv, 2021. 2
[78] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei
Wang, and Xiang Bai. Progressive pose attention transfer for
person image generation. In CVPR, pages 2347–2356, 2019.
2
16