0% found this document useful (0 votes)

34 views16 pages

Surmo: Surface-Based 4D Motion Modeling For Dynamic Human Rendering

The document proposes a new 4D motion modeling paradigm called SurMo that jointly models temporal dynamics and human appearances in a unified framework. SurMo encodes 4D human motions with an efficient surface-based representation and decodes the motions to render time-varying appearances. Evaluation shows SurMo achieves state-of-the-art performance in generating realistic dynamic human renderings.

Uploaded by

OBXO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views16 pages

Surmo: Surface-Based 4D Motion Modeling For Dynamic Human Rendering

Uploaded by

OBXO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering

Tao Hu, Fangzhou Hong, Ziwei Liu

S-Lab, Nanyang Technological University, Singapore

Abstract
Dynamic human rendering from video sequences has
arXiv:2404.01225v1 [cs.CV] 1 Apr 2024

achieved remarkable progress by formulating the rendering

as a mapping from static poses to human images. However,
existing methods focus on the human appearance recon-
struction of every single frame while the temporal motion
relations are not fully explored. In this paper, we propose
a new 4D motion modeling paradigm, SurMo, that jointly
models the temporal dynamics and human appearances in a
unified framework with three key designs: 1) Surface-based
motion encoding that models 4D human motions with an
efficient compact surface-based triplane. It encodes both
Figure 1. Given several sparse multi-view video sequences with
spatial and temporal motion relations on the dense surface estimated 3D body meshes, SurMo synthesizes subject-specific
manifold of a statistical body template, which inherits body appearance. We specifically focus on the synthesis of plausible
topology priors for generalizable novel view synthesis with time-varying appearances by learning an effective 4D motion rep-
sparse training observations. 2) Physical motion decoding resentation.
that is designed to encourage physical motion learning by
decoding the motion triplane features at timestep t to pre-
dict both spatial derivatives and temporal derivatives at the namic movements of the T-shirt induced by dancing in Fig.
next timestep t + 1 in the training stage. 3) 4D appearance 1. The secondary motion of clothes arises from intricate
decoding that renders the motion triplanes into images by physical interactions with the body, typically changing over
an efficient volumetric surface-conditioned renderer that fo- time, which leads to challenges for plausible rendering of
cuses on the rendering of body surfaces with motion learn- dynamic humans.
ing conditioning. Extensive experiments validate the state- Most existing methods [30, 50, 71] are conditioned on
of-the-art performance of our new paradigm and illus- static poses and use a pose-guided generator to synthe-
trate the expressiveness of surface-based motion triplanes size time-varying appearances. However, the appearance of
for rendering high-fidelity view-consistent humans with fast clothed humans undergoes complex geometric transforma-
motions and even motion-dependent shadows. Our project tions induced not only by the static pose but also its dynam-
page is at: https://taohuumd.github.io/projects/SurMo. ics, whereas the modeling of dynamics is often ignored. In
addition, these methods are often focused on the 2D appear-
ance reconstruction of every single frame while the tempo-
1. Introduction ral motion relations are not fully explored. Due to these
Creating volumetric videos of human actors is required issues, they fail to render plausible secondary motions of
in many applications such as AR/VR, telepresence, video clothes, i.e., generating the same appearances for fast and
game and film character creation. Recent neural rendering slow motions. A key challenge of learning temporal dy-
methods [30, 50, 62, 71] have made great progress in gener- namics lies in the requirement of a tremendous amount of
ating free-viewpoint videos of humans from several sparse training observations to construct a 4D motion volume.
multi-view videos, which are simple yet effective compared To solve these issues, we propose a new paradigm
with traditional graphics approaches [2, 3, 72]. At the core to learn time-varying appearances of the secondary mo-
of the technique is to model pose- and time-varying ap- tion from just several sparse viewpoint video sequences,
pearances of dynamic humans. The motions of dressed hu- which is achieved by jointly modeling the temporal mo-
mans are often expressed as the movement of body and a tion dynamics and human appearances in a unified render-
sequence of natural secondary motion of clothes, e.g., dy- ing framework based on an efficient surface-based motion

1
representation. At the core of the paradigm is a feature 1) A new paradigm for learning dynamic humans from
encoder-decoder framework with three key components: 1) videos that jointly models temporal motions and human
surface-based motion encoding; 2) physical motion decod- appearances in a unified framework, and one of the early
ing; and 3) 4D appearance decoding. works that systematically analyzes how human appearances
Firstly, in contrast to existing pose-guided methods [30, are affected by temporal dynamics.
49, 50, 71] that focus on static poses as a conditional vari- 2) An efficient surface-based triplane that encodes both
able, we extract an expressive 4D motion input from the 3D spatial and temporal motion relations for expressive 4D mo-
body mesh sequences obtained from training video as our tion modeling.
input, which includes both a static pose represented by a 3) We achieve state-of-the-art results and show that our
spatial 3D mesh and its temporal dynamics. Furthermore, new paradigm is capable of learning high-fidelity appear-
we notice that the non-rigid deformations of garments typi- ances from fast motion sequences (e.g., AIST++ dance
cally occur around the body surface instead of in a 3D vol- videos) or synthesizing motion-dependent shadows in chal-
ume, and hence we propose to model human motions on the lenging scenarios.
body surface by projecting the extracted spatial-temporal
4D motion input to the dense 2D surface UV manifold of 2. Related Work
a clothless body template (e.g., SMPL [34]). To model tem-
Our method is closely related to many sub-fields of visual
poral clothing offsets, a motion encoder is employed to lift
computing, and below we discuss a set of the work.
the clothless motion features into a motion triplane that en-
3D Shape Representations. To capture detailed shapes of
codes both spatial and temporal motion relations in a com-
3D objects, most recent papers utilize implicit functions
pact 3D triplane with time-varying dynamics conditioned.
[8, 9, 22, 39–41, 46, 48, 55–57, 64, 66, 75, 76] or point
The triplane is defined in the surface u − v − h coordinate
clouds [18, 37, 38] due to their topological flexibility. These
system, with u − v to represent the motion of the clothless
methods aim to learn geometry from 3D datasets, whereas
body template, and an extensional coordinate h to represent
we synthesize human images of novel poses only from 2D
the secondary motion of clothes, i.e., the temporal clothing
RGB training images.
offsets are parameterized by a signed distance to the body
Rendering Humans by 2D GANs. Some existing ap-
surface. In this way, 4D motions can be effectively repre-
proaches render human avatars by neural image translation,
sented by a surface-based triplane.
i.e., they utilize GAN [11] networks to learn a mapping
Secondly, we propose to physically model the spatial from poses (given in the form of renderings of a skele-
and temporal motion dynamics in the rendering network. ton [4, 27, 52, 59, 68, 78], dense mesh [12, 31, 32, 42,
Specifically, with the surface-based triplane conditioned at 58, 68] or joint position heatmaps [1, 35, 36]) to human
time t, a motion decoder is employed to decode the motion images [4, 68]. To improve temporal stability, some meth-
triplanes to predict the motion at the next timestep t+1, i.e., ods [19, 51, 53, 63] propose to utilize the SMPL [34] priors
spatial derivative of the motion physically corresponding to for pose-guided generations. However, these methods do
surface normal map and temporal derivatives correspond- not reconstruct geometry explicitly and cannot handle self-
ing to surface velocity. We illustrate the physical motion occlusions effectively.
learning significantly improves the rendering quality. Rendering Humans by 3D-aware Renderer. For stable
Thirdly, the motion triplanes are decoded into high- view synthesis, recent papers [7, 30, 44, 49, 50, 61] pro-
quality images at two stages: a volumetric surface- pose to unify geometry reconstruction with view synthe-
conditioned renderer that is focused on the rendering around sis by volume rendering, which, however, is computation-
human body surface and filters the query points far from ally heavy. To solve this issue, some recent 3D-GAN pa-
the body surface for efficient volumetric rendering, and a pers [5, 13, 17, 20, 21, 43, 45, 77] propose a hybrid ren-
geometry-aware super-resolution module for efficient high- dering strategy for efficient geometry-aware rendering, that
quality image synthesis. is, render low-resolution volumetric features for geometry
We conduct a systematical analysis of how human ap- learning, and employ a super-resolution module for high-
pearances are affected by temporal dynamics, and it was resolution image synthesis. We adopt this strategy for effi-
observed that some baseline methods mainly generate pose- cient rendering, whereas ours is distinguished by rendering
dependent appearances instead of time-varying appearances articulated humans with 4D motion modeling.
for free-view video generations. In addition, quantitative UV-based Pose Representation. Some methods [20, 21,
and qualitative experiments are performed on three datasets 37, 38, 54, 73] propose to project 3D posed meshes into a
with a total of 9 subject sequences, including ZJU-MoCap 2D positional map for pose encoding, which can be used for
[50], AIST++ [28], and MPII-RDDC [14], which validate different downstream tasks, such as 3D reconstruction and
the effectiveness of SurMo in different scenarios. novel view synthesis. [37, 38] rely on 3D supervision to
In summary, our contributions are: learn 3D reconstructions, and they do not take as input the

2
dynamics. To improve the rendering quality, [20, 54] pro- where the input is represented by dynamic motions includ-
poses to utilize additional driving views as input for faithful ing a static pose Pt and its physical dynamics Dt at time t,
rendering in telepresence applications. However, they do and a motion encoder EM is employed to encode the inputs
not explore how to learn temporal dynamics from pose se- into motion-dependent features (Sec. 3.2). Besides, in the
quences. By taking as input a 3D normal and velocity map, training stage, a physical motion decoder DM is introduced
[73] encodes motion dynamics in their rendering network, to enforce the learning of spatial and temporal relations by
whereas they do explicitly learn motions, i.e., predicting the decoding the intermediate feature ft to predict the spatial
motion status at the next timestep, and besides, [73] is a 2D derivatives at t + 1 physically corresponding to surface nor-
rendering method, and they do not explicitly learn motions. mal Nt+1 = ∂P∂x t+1
, and temporal derivatives at t + 1 cor-
responding to surface velocity Vt+1 = ∂P∂tt+1 (Sec. 3.3).
3. Methodology ft is rendered into human images by a decoder DR : G2 ◦ G1
(Sec. 3.4). The framework is depicted in Fig. 2.
3.1. Problem Setup
Notation. In the following parts, we use ftspace to denote
Given a sparse multi-view video of a clothed human in mo- the features in the pipeline, i.e., ft3D is a 4D representation
tion and corresponding 3D pose estimations {P0 , ..., Pt }, that is defined in 3D space with temporal t conditioned, and
our goal is to synthesize time-varying appearances of the M3Dt is 4D motion input defined in 3D space with t condi-
individual under novel views. Existing methods [30, 49, 50, tioned.
71] formulate the problem of human rendering as learning
a representation via a feature encoder-decoder framework: 3.2. Surface-based 4D Motion Encoding
Extracting 4D Motions. We first extract 4D motions from
ft = EP (Pt , zt ) a sequence of time-varying parametric posed body (e.g.,
(1)
It = DR (ft ) SMPL) meshes obtained from training video sequences. We
describe the motion M3D t at time t as a static skeleton pose
where a pose encoder EP takes as input a representation
Pt and its dynamics Dt . The dynamics at t are physically
of posed body Pt (e.g., 2D or 3D keypoints or 3D body
determined by the current pose, 3D velocity, and motion tra-
mesh vertices) and a timestamp embedding zt at time t, and
jectory of the past several timesteps, which also contribute
outputs intermediate pose-dependent features ft that can be
to the time-varying appearances of the secondary motion.
rendered by a decoder DR to reconstruct the appearance
The pose Pt is represented by the 3D vertices of the posed
images It of the corresponding pose at time t. The time
mesh, and dynamics Dt are parameterized by 1) body sur-
embedding zt often serves as a residual that is updated pas-
face velocity Vt corresponding to the temporal derivatives
sively in backpropagation by image reconstruction loss.
of the current pose at t, and 2) motion trajectory Tt that
However, there are two issues with the above well-
aggregates the temporal derivatives over the past several
adopted paradigm. First, the appearance of clothed humans
timesteps with a sliding window size of w with weight λ:
undergoes complex geometric transformations induced not
only by the static pose Pt but also its dynamics, whereas
M3D
t = [Pt , Dt ] , Dt = [Vt , Tt ]
the modeling of dynamics is ignored in E.q. 1, and the w
residual zt cannot expressively model physical dynamics ∂Pt 1 X
Vt = , Tt = Pt + P Vt−i ∗ λi
neither. Second, existing methods focus on the per-image ∂t λi i=1
reconstruction while ignoring the temporal relations of a (3)
motion sequence in training, i.e., sampling input pose Pt Note the motion trajectory aggregated from several con-
and timestamp zt individually, and supervising image re- secutive timesteps makes the motion representation robust
constructions frame-by-frame, partly because defining tem- to pose estimation errors, i.e., the pose estimations for two
poral supervision in the 2D image space is challenging, es- consecutive timesteps may be the same due to pose esti-
pecially for articulated humans that often suffer from mis- mation errors in practice. An ablation study of the motion
alignment problems due to pose estimation errors. trajectory representation can be found in the supp. mat.
To solve these issues, we propose a new paradigm for Recording 4D Motions on the Surface Manifold. Mod-
learning view synthesis of dynamic humans from video se- eling the motions with temporal dynamics often requires
quences, which is formulated as: dense observations to construct a 4D motion volume. In-
stead, we notice that non-rigid deformations of human ge-
ft = EM (Pt , Dt ) ometry in motion typically occur around the body surface
∂Pt+1 ∂Pt+1 instead of a 3D volume, and hence we propose to model
, = DM (ft ) (2)
∂x ∂t the motions on the human body surface. Specifically, we
It = DR (ft ) project the 4D motion input M3D t including spatial pose

3
Figure 2. Framework overview. Given a set of time-varying 3D body meshes {Pt , ..., Pt − n} obtained from training video sequences, we
aim to synthesize high-fidelity appearances of a clothed human in motion via a feature encoder-decoder framework: Motion Encoding, and
joint Motion and Appearance Decoding. 1) We take as input an expressive 4D motion representation extracted from the mesh sequences
including 3D pose, 3D velocity at time t, and motion trajectory over the past w timesteps that encode both spatial and temporal relations of
the motion sequence, which are projected to the spatially aligned UV surface space. A motion encoder EM is employed to lift the 2D UV-
aligned features to a 3D surface-based triplane ftuvh in an UV-plus-height space with a signed distance height to model temporal clothing
offsets. 2) A motion decoder DM is designed to encourage physical motion learning in training by decoding the triplane features ftuvh to
predict the motion at the next timestep t + 1, i.e. spatial derivatives surface normal Nuv uv
t+1 and temporal derivatives surface velocity Vt+1 in
UV space. 3) Finally, given a target camera view, the triplane ftuvh is rendered into high-quality images by a volumetric surface-conditioned
renderer including volumetric low-resolution rendering by G1 and an efficient geometry-aware super-resolution by G2 .

and temporal dynamics from 3D space into a compact spa- to predict the motions at the next timestep t+1, such as spa-
tially aligned UV space using the geometric transformation tial derivatives corresponding to surface normal Nuv
t+1 , tem-
uv
W that is pre-defined by the parametric (e.g., SMPL) body poral derivatives corresponding to surface velocity Vt+1 .
template, which yields ftuv = WM3D t . The UV-aligned Note that we model the normal and velocity in the UV space
motion feature ftuv faithfully preserves the articulated struc- by:
tures of body topology in a compact 2D space.
{Nuv uv uvh
t+1 , Vt+1 } = DM (ft ) = DM ◦ EM (WM3D
t ) (5)
Generating Surface-based Triplanes for Motion Model-
ing. To model clothed humans, we further employ a motion
encoder EM that lifts the clothless 2D features ftuv to a 3D 3.4. 4D Appearance Decoding
triplane representation ftuvh that is defined in a u − v − h Volumetric Surface-conditioned Rendering. Given a tar-
system, with u − v to represent the motion of the clothless get camera viewpoint, the triplane ftuvh is first rendered into
body template, and an extensional coordinate h to represent low-resolution volumetric features IF ∈ RH×W ×Ĉ by a
the secondary motion of clothes, i.e., the temporal cloth off- conditional NeRF FΦ , where H and W are the spatial res-
sets, and hence 4D clothed motions can be parameterized by olution and Ĉ is the channel number. To be more specific,
a surface-based triplane by: given a 3D query point pi , we transform it into a surface-
based local coordinate p̂i = (ui , vi , hi ) w.r.t. the tracked
ftuvh = EM (ftuv ) = EM (WM3D
t ), (4) body mesh Pt . Here we search the nearest face fi of Pt for
where W is the geometric transformation from 3D space to query point pi and (ui , vi ) and hi are the barycentric co-
UV space. ftuvh consists of three planes xuv , xuh , xhv ∈ ordinates of the nearest point on fi , and the signed distance
RU ×V ×C to form the spatial relationship, where U, V de- respectively. We therefore obtain the local feature ziuvh of
note spatial resolution and C is the channel number. In con- the query point pi as
trast to the volumetric triplanes used in EG3D [5] where the ziuvh = cat [Π(xuv ; ui , vi ), Π(xuh ; ui , hi ), Π(xhv ; hi , vi )] ,
three planes represent three vertical planes in the 3D vol- (6)
umetric space, ftuvh is defined on the human body surface where Π(·) denotes sampling operation, cat[·] denotes the
inheriting human topology priors for motion modeling, as concatenation operator.
illustrated by an unwarped triplane in 3D space in Fig 2. Given camera direction di , the appearance features ci
and density features σi of point pi are predicted by
3.3. Physical Motion Decoding
{ci , σi } = FΦ (di , ziuvh ) (7)
We propose a physical motion decoder DM to learn spatial
and temporal features of motion, which is achieved by de- We integrate all the radiance features of sampled points
coding the intermediate motion feature ftuvh learned by EM into a 2D feature map IF at time t through volume renderer

4
Figure 3. Qualitative comparisons on novel view synthesis on the subject S313 of ZJU-MoCap dataset. Two motion sequences S1 (swing
arms left to right) and S2 (raise and lower arms) are shown. We specifically focus on the synthesis of time-varying appearances (especially
T-shirt wrinkles), by evaluating the rendering results under similar poses yet with different movement directions, which are marked in the
same color, such as the pairs of ⃝ 1 ⃝,
2 ⃝ 3 ⃝,
4 and ⃝ 5 ⃝.
6 Our method synthesizes high-fidelity time-varying appearances, whereas SOTA
HumanNeRF generates almost the same cloth wrinkles.

G1 [25] Perceptual Loss, Face Identity Loss, Velocity and Normal

IF = G1 (ftuvh , cam; FΦ ) (8) Loss, and Volume Rendering Loss for supervision in train-
ing, with Adam [26] as the optimizer. Refer to the supp.
where ftuvh is extracted from Eq. 4.
mat. for more details.
Efficient Geometry-Aware Super-Resolution. Sampling
dense points to render the full-resolution volumetric fea-
tures is computationally heavy. Instead, we employ a super- 4. Experiments
resolution network G2 [5] to render high-resolution images Dataset and Metrics. We evaluate the novel view synthe-
I+
RGB ∈ R
Ĥ×Ŵ ×3
: sis on ZJU-MoCap, MPII-RDDC, and AIST++, with a to-
tal of 9 subjects. We use the same camera setup as Neural
I+ uvh
RGB = G2 ◦ G1 (ft , cam; FΦ ) (9) Body, that is, 4 cameras used for training, and the other 18
for testing. For MPII-RDDC, 18 cameras in training, and
3.5. Optimization
9 for testing. Since ASIT++ only has 9 cameras, we use
SurMo is trained end-to-end to optimize EM , DM , and 6 for training, and the remaining 3 for testing. Refer to
renderers G1 , G1 with 2D image loss. We employ Adver- more details in the supp. mat. We compare each method
sarial Loss and Reconstruction Loss including Pixel Loss, on per-pixel metrics including SSIM [70] and PSNR, and

5
Figure 4. Qualitative comparisons on novel view synthesis on the subject S387, S315 of ZJU-MoCap dataset. Row 1 and 2 show similar
poses occurring at different timesteps (not consecutive frames). The results indicate that our method synthesizes time-varying appearances
while other methods mainly generate pose-dependent appearances.

Table 1. Quantitative comparisons on ZJU-MoCap dataset (aver- isons on each subject of ZJU-MoCap, and quantitative re-
aged on all test views and poses on 6 sequences) for novel-view sults on MPII-RDDC and AIST++ are listed in the supp.
synthesis. To reduce the influence of the background, all scores mat. Refer to the comparisons with ARAH [67], DVA [54]
are calculated from images cropped to 2D bounding boxes. Note
and HVTR++ [20] in the supp. mat.
that the perception metrics LPIPS [74] and FID [16] capture hu-
man judgment better than per-pixel metrics such as SSIM [70] or Time-varying Appearances with Dynamics Condition-
PSNR, as stated in [30, 58]. ing on ZJU-MoCap. We compare with baseline methods
for novel view synthesis on two motion sequences of sub-
ZJU-S1-6 LPIPS ↓ FID ↓ SSIM ↑ PSNR ↑ ject S313 from the ZJU-MoCap dataset, as shown in Fig.
Neural Body [50] .164 125.10 .703 21.438 3. We evaluate the capability of each method to synthesize
Instant-NVR [10] .202 144.99 .738 21.748
time-varying appearances. Specifically, two sequences S1(
swing arms left to right) and S2 (raise and lower arms) are
HumanNeRF [71] .106 88.382 .792 23.624 evaluated, where similar poses occur at different timesteps
Ours .075 67.725 .833 24.815 with different dynamics (e.g., movement directions, trajec-
tory, or velocity) are marked in the same color, such as the
perception metrics including LPIPS [74] and FID [16]. pairs of ⃝1 ⃝,
2 ⃝ 3 ⃝,
4 and ⃝ 5 ⃝.
6 Fig. 3 suggests that Instant-
Baselines. We compare our method against SOTA meth- NVR [10] fails to synthesize time-varying high-frequency
ods including Neural Body [50], HumanNeRF [71], Instant- wrinkles, and Neural Body [50] synthesizes time-varying
NVR [10], ARAH [67], DVA [54] and HVTR++ [20]. yet blurry wrinkles. HumanNeRF [71] synthesizes detailed
Neural Body models human poses in 3D space with point yet static T-shirt wrinkles, i.e., the wrinkles are almost the
cloud as pose representation, HumanNeRF and Instant- same for similar poses, such as ⃝ 1 ⃝,
2 ⃝ 3 ⃝,
4 and ⃝ 5 ⃝.
6 In
NVR model poses in a canonical space with inverse skin- contrast, our method renders both high-fidelity and time-
ning, ARAH is a forward-skinning based method, DVA and varying wrinkles. The comparisons on other subjects S387
HVTR++ encode poses in UV space, and take driving view and S315 are shown in Fig. 4, which illustrate the effective-
signals as input for telepresence applications. ness of our proposed paradigm in dynamics learning.
Time-varying Appearances with Motion-dependent
4.1. Comparisons to SOTA Methods
Shadows on MPII-RDDC. We compare with baseline
Quantitative comparisons We conduct the quantitative methods for novel view synthesis on MPII-RDDC, as
comparisons on the three datasets with a total of 9 subjects. shown in Fig. 5. The sequence is captured in a studio
The quantitative results on ZJU-MoCap are summarized in with top-down lighting that casts shadows on the human
Tab. 1, which suggests that our new paradigm outperforms body due to self-occlusions. We notice that the synthesis
the SOTA by a big margin, and we achieve the best quan- of shadow can be formulated as learning motion-dependent
titative results on all four metrics. The detailed compar- shadows in the motion representation without the need to

6
Figure 5. Novel view synthesis of time-varying appearances with both pose and lighting conditioning on MPII-RDDC dataset. The
sequence is captured in a studio with top-down lighting that casts shadows on the human performer due to self-occlusion. In Row 1, we
specifically focus on synthesizing time-varying shadows (e.g., ⃝1 vs. ⃝,
2 and ⃝ 3 vs. ⃝)
4 for different poses with different self-occlusions.
In Row 2, we evaluate the synthesis of: 1) time-varying appearances for similar poses occurring in a jump-up-and-down motion sequence,
e.g., ⃝
5 vs. ⃝,
6 2) shadows ⃝ 7 vs. ⃝,8 and 3) clothing offsets ⃝
5 vs. ⃝.
6

Figure 6. Novel view synthesis of fast motions on AIST++.

explicitly model the lighting. We validate our method from

two aspects of the scenario in Fig. 5. 1) In Row 1, Fig.
5 indicates that with the expressive motion representation
Figure 8. Ablation study of motion conditioning and learning. We
and physical motion learning, SurMo succeeds in predicting focus on the effect of motion conditioning and learning in Row
the motion-dependent shadows, such as ⃝ 1⃝ 2⃝3 ⃝,
4 whereas 1, and whether our method learns to decouple the static pose and
SOTA HumanNeRF renders almost the same T-shirt ap- dynamics (e.g., velocity) from the motion conditioning.
pearance. 2) In Row 2, we compare the synthesis of time-
varying appearances for similar poses occurring in a jump-
on SMPL tracking with a scaling factor that affects the
up-and-down motion sequence i.e., ⃝ 5 vs. ⃝,6 which shows
inverse LBS used in Instant-NVR or HumanNeRF both
that SurMo is capable of predicting the clothes offsets un-
relying on inverse LBS, and hence we cannot compare
der similar pose with different motion trajectories. In ad-
them based on the released official code. Quantitative
dition, we synthesize the shadows ⃝ 7⃝ 8 in the challenging
results can be found in the supp. mat.
jump-up-and-down motion sequence. However, Human-
NeRF fails to predict the dynamic clothing offsets ⃝.
9
4.2. Ablation Study
Time-varying Appearances for Fast Motions on
AIST++. We also evaluate our methods by rendering Surface-based Triplane vs. Volumetric Triplane. We
humans with fast dance motions (S21 and S13) on AIST++, compare the well-adopted volumetric triplane (Vol-Trip) [5]
as shown in Fig. 6, where we compare ours against Neural and our proposed surface-based triplane (Surf-Trip) for hu-
Body in synthesizing time-varying wrinkles of the T-shirt man modeling on two subjects S313 and S387 from ZJU-
for different motions. Fig. 6 suggests that Neural Body MoCap, as shown in Fig. 7. Fig. 7 (left) suggests that
can only synthesize blurry results partly because Neural the Surf-Trip converges faster in training with a smaller re-
Body models pose in a sparse 3D space. Instead, we model construction loss for both sequences, and the performances
motions in a denser body surface space, which enables are further improved with physical motion learning, e.g.,
high-fidelity image synthesis. Note that AIST++ is based ‘Surf-Trip + ML’. Fig. 7 (right) compares the rendering re-

7
Figure 7. Ablation study of 3D volumetric triplane (Vol-Trip) vs. surface-based triplane (Surf-Trip) for human modeling on subject S313
and S387 from ZJU-MoCap. We focus on the convergence in training (left), and self-occlusion rendering (right). The convergence of S313
is shifted for visualization purposes. Vol-Trip is not effective in handling self-occlusion, e.g., ⃝
1⃝2⃝
3⃝4 though it performs well in another
viewpoint without self-occlusions ⃝. 6 In addition, Vol-Trip cannot synthesize high-quality details for face.

Table 2. Ablation study of motion conditioning and learning. in the ablation study. It suggests that the appearance of the
Pcond and Dcond denote the conditioning of pose and dynamics, T-shirt varies when the velocity or dynamics are changed,
Vpred and Npred denote the prediction of surface velocity and sur- whereas the renderings of tight parts remain the same, such
face normal in motion learning.
as the head, tight shorts, and shoes. This is consistent with
S313 LPIPS ↓ FID ↓ SSIM ↑ PSNR ↑ our daily observations. The ablation study illustrates that
Pcond .120 104.83 .793 22.781 our method decouples the pose and dynamics, and is capa-
+ Dcond .085 73.674 .834 24.908 ble of generating dynamics-dependent appearances. Note
that the appearance changes are small between V ∗ 1 and
+ Vpred .069 62.092 .856 25.845
V ∗ 0.01 when we reduce the velocity, partly because the
+ Npred .060 50.170 .869 26.654 method is not sensitive to smaller velocity due to the pose
estimation errors in the training dataset where consecutive
sults qualitatively, which indicates that Surf-Trip is more frames may be estimated with the same pose fitting.
effective in handling self-occlusions, whereas Vol-Trip fails
⃝1⃝ 2⃝ 3 ⃝,
4 though it performs well in another viewpoint 5. Discussion
without self-occlusions ⃝. 6 In addition, Surf-Trip generates We propose SurMo, a new paradigm for learning dynamic
high-fidelity clothing wrinkles and facial details, whereas humans from videos by jointly modeling temporal motions
the face rendering of Vol-Trip is blurry. and human appearances in a unified framework, based on an
Motion Conditioning and Learning. We also analyze efficient surface-based triplane. We conduct a systematical
the effectiveness of our proposed motion conditioning and analysis of how human appearances are affected by tem-
learning qualitatively and quantitatively. Tab. 2 summarizes poral dynamics, and extensive experiments validate the ex-
the quantitative results, which suggest that the quantitative pressiveness of the surface-based triplane in rendering fast
results are significantly improved by taking as input the dy- motions and motion-dependent shadows. Quantitative ex-
namics conditioning Dcond . With physical motion learning, periments illustrate that SurMo achieves the SOTA results
i.e., predicting the temporal derivatives surface velocity and on different motion sequences.
spatial derivates normal at the next timestep, the quantita-
tive results are further improved by a big margin. Acknowledgement
The qualitative comparisons are shown in Fig. 8, where
This study is supported by the Ministry of Education,
in Row 1, we observe higher-fidelity appearances with dy-
Singapore, under its MOE AcRF Tier 2 (MOET2EP20221-
namics conditioning, and physical motion learning for both
0012), NTU NAP, and under the RIE2020 Industry Align-
velocity and normal prediction. In Row 2, we evaluate
ment Fund – Industry Collaboration Projects (IAF-ICP)
whether our method learns to decouple the temporal dynam-
Funding Initiative, as well as cash and in-kind contribution
ics (e.g., velocity) and static poses from motion condition-
from the industry partner(s).
ing, which corresponds to the rendering of clothing wrin-
kles of secondary motion (T-shirt ⃝) 1 and tight parts (e.g.,
shorts ⃝).
2 We only change the velocity for each variant

8
Appendix. Face Identity Loss. We use a pre-trained network to ensure
that the renderers preserve the face identity on the cropped
A. Implementation face of the generated and ground truth image,

A.1. Network Architectures Lf ace = ∥Nf ace (Igt ) − Nf ace I+

RGB ∥2 , (11)
Motion Encoder and Decoder. The motion encoder is where Nf ace is the pretrained SphereFaceNet [33].
based on the Pix2PixHD [69] architecture with 3 Encoder
Velocity Loss. We employ a velocity loss (temporal motion
blocks of [Conv2d, Batch- Norm, ReLU], ResNet [15]
derivates) for the motion dececoding supervision,
blocks, and 3 Decoder blocks of [ReLU, ConvTranspose2d,
BatchNorm]. The motion decoder has 2 Decoder blocks. uv
Lvelocity = ∥Vgt(t+1) uv
− Vt+1 ∥2 , (12)
Volume Renderer. We use a 5-layer MLP with a skip con-
nection from the input to the 3th layer as in DeepSDF [47]. uv
where Vgt(t+1) is the ground truth velocity at timestep t + 1,
From the 4th layer, the network branches out two heads, one uv
and Vt+1 is the predicted velocity by DM at timestep t+1.
to predict density with one fully-connected layer and the Normal Loss. We also employ a surface normal loss (spa-
other one to predict color features with two fully-connected tial motion derivates) for the motion dececoding supervi-
layers. sion,
uv
Super-Resolution. To super-resolve low-resolution volu- Lnormal = ∥Ngt(t) − Nuvt ∥2 , (13)
metric features to low-resolution images, we first bilinearly uv
upsample the features by 2× and then feed the upsampled where Ngt(t) is the ground truth normal at timestep t, and
uv
features into two convolutional layers with a kernel size of Nt is the predicted normal by DM at timestep t. Note that
3 to upsample the images by a factor of 2. in practical implementation, DM first predicts Nuvt , which

Surface-based Triplane. The size of the triplane is 256 × is easier for the network than predicting Nuv t+1 directly,

256 × 48. and Nuvt+1 can be drived and normalized from: Nuv
t+1 =
∂Puv ∂{Puv +Vuv } ∂Vuv
Discriminator. We adopt the discriminator architecture of
t+1
∂x = t
∂x
t+1
= Nuvt +
t+1
∂x . With the Vt+1
uv

PatchGAN [23] for adversarial training. Note that differ- predicted for temporal motion supervision, the prediction
ent from EG3D [6] that applies the image discriminator at of Nuv uv
t enforces a similar supervision with Nt+1 for the
both resolutions, we only supervise the final rendered im- spatial motion learning.
ages with adversarial training and supervise the volumetric Volume Rendering Loss. We supervise the training of vol-
features with reconstruction loss. ume rendering at low resolution, which is applied on the
D D
first three channels of IF , Lvol = ∥IF [: 3] − Igt ∥2 . Igt is
A.2. Optimization the downsampled reference image.
The networks were trained using the Adam optimizer
SurMo is trained end-to-end to optimize EM , DM , and
[26]. The loss weights {λpix , λvgg , λadv , λf ace , λvelocity ,
renderers G1 , G1 with 2D image loss. Given a ground truth
λnormal , λvol } are set empirically to {.5, 10, 1, 5, 1, 1, 15}.
image Igt , we predict a target RGB image I+ RGB with the It takes about 12 hours to train a model from about 3000
following loss:
images with 200 epochs on two NVIDIA V100 GPUs.
Pixel Loss. We enforce an ℓ1 loss between the generated
image and ground truth as Lpix = ∥Igt − I+ RGB ∥1 . A.3. Training Data Processing.
Perceptual Loss. Pixel loss is sensitive to image misalign-
ment due to pose estimation errors, and we further use a We evaluate the novel view synthesis on three datasets:
perceptual loss [24] to measure the differences between the ZJU-MoCap [50] (including sequences of S313, S315,
activations on different layers of the pre-trained VGG net- S377, S386, S387, S394) at resolution 1024 × 1024, MPII-
work [60] of the generated image I+ RDDC [65] at resolution 1285 × 940, and AIST++ [28] at
RGB and ground truth
image Igt , 1920 × 1080. Note that sequences of ZJU-MoCap used in
Neural Body are generally short, e.g., only 60 frames for
X 1 S313. Instead, to evaluate the time-varying effects, we ex-
g j (Igt ) − g j I+

Lvgg = j RGB 2
, (10) tend the original training frames of S313, S315, S387, S394
N
to 400, 700, 600, 600 frames respectively depending on the
where g j is the activation and N j the number of elements pose variance of each sequence, whereas S377 and S386 re-
of the j-th layer in the pretrained VGG network. main the same 300 frames as the setup of Neural Body [50].
Adversarial Loss. We leverage a multi-scale discriminator 4 cameras are used for training, and the others are used in
D [69] as an adversarial loss Ladv to enforce the realism of testing for ZJU-MoCap. 6 cameras are used in training, 3
rendering, especially for the cases where estimated human for testing in AIST++, 18 cameras for training and 9 cam-
poses are not well aligned with the ground truth images. eras for testing in MPI-RDDC.

9
Figure 9. Illustration of Volumetric Triplane vs. Surface-based Triplane.

Figure 10. Qualitative comparisons against the 3D pose- and image-driven approach DVA [54] and HVTR++ [20] for novel view synthesis
of training poses on ZJU-MoCap. For each example, from left to right: DVA, HVTR++, Ours, Ground Truth. Rendering results of DVA
and HVTR++ are provided by the authors.

B. Additional Experimental Results unified network for faithful rendering.

B.1. Comparisons with SOTA Methods Tab. 3 summarizes the quantitative results for novel view
synthesis on the two sequences (S386 and S387) mentioned
Comparisons with 3D pose- and image-driven ap- in DVA, which suggest that our method significantly out-
proaches. In contrast to pose-driven methods (e.g., Neu- performs DVA and HVTRPP in terms of both per-pixel and
ral Body [50], Instant-NVR [10], HumanNeRF [71]), DVA perception metrics. Qualitative comparisons are provided
[54] and HVTR++ [20] propose to utilize both the pose and in Fig. 10, which shows that our method produces sharper
driving view features in rendering. They model both the reconstructions with faithful wrinkles than both DVA and
pose and texture features in UV space, whereas ours is dis- HVTR++. In contrast to the image resolution of 512 × 512
tinguished by modeling motions in a surface-based triplane, used in Neural Body [50], HumanNeRF [71] and Instant-
and we jointly learn physical motions and rendering in a NVR [10], DVA and HVTR++ were trained and evaluated

10
Table 3. Quantitative comparisons against the 3D pose- and
image-driven approach DVA [54] and HVTR++ [20] on ZJU-
MoCap datasets (averaged on all test views and poses) for novel
view synthesis. To reduce the influence of the background, all
scores are calculated from images cropped to 2D bounding boxes
as used in [20]. Note that the training and test are conducted at the
image resolution of 1024 × 1024 by following the setup in DVA
[54]. For reference, we report the quantitative results of HVTR++
and DVA from the HVTR++ paper.

S386 LPIPS↓ FID↓ SSIM↑ PSNR↑

DVA [54] .146 117.80 .791 26.209
HVTR++ [20] .131 84.291 .797 26.517
Ours .108 72.556 .807 27.164

S387 LPIPS↓ FID↓ SSIM↑ PSNR↑

DVA [54] .166 142.67 .791 22.474
Figure 12. Qualitative comparisons against ARAH [67] for novel
HVTR++ [20] .136 101.03 .786 22.515 view synthesis of novel poses on ZJU-MoCap.
Ours .112 76.097 .808 23.581

Table 4. Quantitative comparisons against ARAH [67] for novel

view synthesis of training poses and novel poses on ZJU-MoCap
datasets (averaged on all test views and poses) for novel view syn-
thesis. To reduce the influence of the background, all scores are
calculated from images cropped to 2D bounding boxes.

S377-Train LPIPS↓ FID↓ SSIM↑ PSNR↑

ARAH [67] .096 83.900 .870 25.176
Ours .069 63.008 .866 25.306
S386-Train LPIPS↓ FID↓ SSIM↑ PSNR↑
Figure 11. Qualitative comparisons against PoseVocap [29] for ARAH [67] .112 99.614 .808 27.008
novel view synthesis of training poses on ZJU-MoCap. Ours .080 85.811 .801 27.069

S377-Novel LPIPS↓ FID↓ SSIM↑ PSNR↑

at the resolution of 1024 × 1024 in [20, 54]. We follow the ARAH [67] .116 106.46 .821 23.355
same protocol used in [20, 54] for fair comparisons.
Ours .088 78.961 .819 23.594
Comparisons with PoseVocap [29]. PoseVocap [29] pro-
poses joint-structured pose embeddings for better temporal S386-Novel LPIPS↓ FID↓ SSIM↑ PSNR↑
consistency in rendering. Qualitative comparisons on novel ARAH [67] .150 114.24 .742 25.031
view synthesis are shown in Fig. 11, which suggest that Ours .123 104.45 .728 24.821
our method is capable of generating higher-quality wrin-
kles than PoseVocap [29]. Note that PoseVocap only pro-
vides qualitative results on ZJU-MoCap, and the test results
of PoseVocap are reported in the paper [29]. summarized in Tab. 4. The qualitative comparisons in
Pose Generalization. Our method is focused on generat- Fig. 12 suggest that our method is capable of synthesizing
ing free-viewpoint video of dynamic humans, whereas we higher-quality faces and cloth wrinkles than ARAH. Note
evaluate the pose generalization capability on ZJU-MoCap that our method is not targeted at animation, and since the
and it is observed that our method is not overfitted to the pose variance of ZJU-MoCap is not big enough, the exper-
training poses, as suggested in Fig. 12 and Tab. 4. iments do not illustrate that our method achieves the SOTA
Compared to ARAH [67] (a forward-skinning-based ap- results in animation tasks. However, the experimental re-
proach), the state-of-the-art method in pose generalization sults suggest that our method is not completely overfitted to
tasks, we generate better quantitative results in terms of the training poses. We use the publicly released test results
novel view synthesis on training poses or novel poses as of ARAH for comparisons.

11
Table 5. Quantitative comparisons on MPII-RDDC datasets [14]. Table 7. Ablation study of motion prediction and training views.
To reduce the influence of the background, all scores are calculated
from images cropped to 2D bounding boxes. S313 LPIPS ↓ FID ↓ SSIM↑ PSNR↑
w/o P red .085 73.674 .834 24.908
Methods LPIPS↓ FID↓ SSIM↑ PSNR↑ P redt .073 60.942 .848 25.537
P redt+1 .060 50.170 .869 26.654
HumanNeRF [71] .175 116.53 .615 17.443
P redt+1 (1 view) .126 112.19 .788 22.830
Ours .153 107.79 .627 18.048
S387 LPIPS ↓ FID ↓ SSIM↑ PSNR↑
w/o P red .115 93.688 .761 22.152
Table 6. Quantitative comparisons on S13 and S21 sequences from P redt .096 83.825 .790 23.083
AIST++ datasets [28]. To reduce the influence of the background, P redt+1 .084 71.216 .810 23.735
all scores are calculated from images cropped to 2D bounding P redt+1 (1 view) .151 128.18 .729 21.093
boxes.

S13 LPIPS↓ FID↓ SSIM↑ PSNR↑ Table 8. Ablation study of dynamics conditioning.
Neural Body [50] .266 276.70 .732 17.649
Ours .183 161.68 .751 17.488 Methods LPIPS↓ FID↓ SSIM↑ PSNR↑
Norm. + Velo. [73] .093 81.900 .825 24.113
S21 LPIPS↓ FID↓ SSIM↑ PSNR↑
Ours w/ Dcond .085 73.674 .834 24.908
Neural Body [50] .296 333.03 .731 17.137
Ours w/ Vpred , Npred .060 50.170 .869 26.654
Ours .205 177.36 .757 17.334

Table 9. Ablation study of super-resolution module under different

B.2. Quantitative Comparisons on MPII-RDDC image resolutions and upsampling factors.
and AIST++ Datasets.
Methods LPIPS↓ FID↓ SSIM↑ PSNR↑
The quantitative comparisons on MPII-RDDC [14] are 2
512 , ×2 .060 49.714 .870 26.678
summarized in Tab. 5, which suggests that our method
outperforms HumanNeRF in the lighting-conditioned sce- 5122 , ×4 .070 56.456 .854 26.166
nario. The quantitative comparisons on AIST++ [28] are 10242 , ×2 .076 54.563 .862 26.063
summarized in Tab. 6, which confirms the effectiveness of
our method in rendering fast motions.
use the motion prediction to force a meaningful embedding
B.3. Ablation study of the feature space, which improves the rendering quality.
Surface-based Triplane vs. Volumetric Triplane. We Predicting the next motion frame P redt+1 offers higher-
compare the volumetric triplane (Vol-Trip) [5] and our pro- quality rendering than predicting the current motion frame
uv
posed surface-based triplane (Surf-Trip) for human model- P redt , i.e. Vt+1 vs. Vtuv , as listed in Tab. 7. We con-
ing as shown in Fig. 9. It is observed that the volumetric duct experiments on the S313 and S387 sequences of the
triplane is a sparse representation for human body model- ZJU-MoCap dataset in Tab. 7.
ing, i.e., only 21-35% features are utilized to render the hu- Training Views. Tab. 7 suggests that the performances of
man under the specific pose, and hence the Vol-Trip fails novel view synthesis degrade with fewer training views, i.e.,
to handle the self-occlusions effectively as shown in Fig. from 4 training views P redt+1 to 1 view P redt+1 (1 view).
9 (d), though Vol-Trip generates plausible results from an- Even with 1 view, our performance is still comparable with
other viewpoint without sever self-occlusions. In contrast, Instant-NVR (Tab. 10).
about 85% surface-based triplane features are utilized in Dynamics Conditioning. We compare the methods of con-
rendering. In addition, with surface-guided ray marching, ditioning dynamics in the rendering network between [73]
our method is more efficient by filtering out invalid points and ours. [73] takes as input the velocities of the past 10
that are far from the body surface. consecutive poses and normal maps of the current pose,
Motion Prediction. Predicting the next frame based on the whereas we take as input the positional map of the current
status of the current frame is a one-to-many mapping prob- pose and aggregated trajectory of the past 5 frames as input.
lem. However, we take as input additional dynamics, and Tab. 8 suggests that our method enables better quantitative
trajectory features to infer the motion of the next frame, results, and we improve the performances by further learn-
which alleviates the one-to-many mapping issue. The paper ing motions, e.g., surface velocity and normal prediction.
is not focused on motion prediction/generation. Instead, we Super-resolution. Our method utilizes a super-resolution

12
module to synthesize high-quality images. The quantitative Guibas, Jonathan Tremblay, S. Khamis, Tero Karras, and
results are summarized in Tab. 9. It is observed that the Gordon Wetzstein. Efficient geometry-aware 3d generative
performances are improved when the upsampling factor is adversarial networks. ArXiv, abs/2112.07945, 2021. 2, 4, 5,
increased from 4 to 2, which indicates more geometric fea- 7, 12
[6] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano,
tures are utilized by increasing the resolution of volumetric
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J
rendering. Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient
B.4. Efficiency geometry-aware 3d generative adversarial networks. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
At test time, our method runs at 3.2 FPS on one NVIDIA and Pattern Recognition, pages 16123–16133, 2022. 9
V100 GPU to render 512×512 resolution images, about [7] Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao
39× faster than Neural Body [50], 17× faster than Human- Bao, and Huchuan Lu. Animatable neural radiance fields
NeRF [71], and 9× faster than Instant-NVR [10]. from monocular rgb video. ArXiv, abs/2106.13629, 2021. 2
[8] Zhiqin Chen and Hao Zhang. Learning implicit fields for
generative shape modeling. CVPR, 2019. 2
[9] Boyang Deng, John P Lewis, Timothy Jeruzalski, Gerard
Pons-Moll, Geoffrey Hinton, Mohammad Norouzi, and An-
drea Tagliasacchi. Nasa neural articulated shape approxima-
tion. In ECCV, 2020. 2
[10] Chen Geng, Sida Peng, Zhen Xu, Hujun Bao, and Xiaowei
Zhou. Learning neural volumetric representations of dy-
namic humans in minutes. In CVPR, 2023. 6, 10, 13, 14
[11] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,
and Yoshua Bengio. Generative adversarial nets. In NIPS,
2014. 2
[12] A. K. Grigor’ev, Artem Sevastopolsky, Alexander Vakhitov,
and Victor S. Lempitsky. Coordinate-based texture inpaint-
ing for pose-guided human image generation. CVPR, pages
12127–12136, 2019. 2
[13] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
Stylenerf: A style-based 3d-aware generator for high-
resolution image synthesis. ArXiv, abs/2110.08985, 2021.
Figure 13. Failure cases. 2
[14] Marc Habermann, Lingjie Liu, Weipeng Xu, Michael Zoll-
hoefer, Gerard Pons-Moll, and Christian Theobalt. Real-time
B.5. Failure Cases deep dynamic characters. ACM Transactions on Graphics
(TOG), 40:1 – 16, 2021. 2, 12
Our method fails to generate high-quality wrinkles for com- [15] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep
plicated textures of AIST++ [28], as shown in Fig. 13. This residual learning for image recognition. CVPR, pages 770–
is because we cannot learn to infer dynamic wrinkles from 778, 2016. 9
the complicated appearances. [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
References two time-scale update rule converge to a local nash equilib-
rium. In NIPS, 2017. 6
[1] Kfir Aberman, M. Shi, Jing Liao, Dani Lischinski, B. Chen, [17] Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong
and D. Cohen-Or. Deep video-based performance cloning. Zhang. Headnerf: A real-time nerf-based parametric head
Computer Graphics Forum, 38, 2019. 2 model. ArXiv, abs/2112.05637, 2021. 2
[2] George Borshukov, Dan Piponi, Oystein Larsen, J. P. Lewis, [18] Tao Hu, Geng Lin, Zhizhong Han, and Matthias Zwicker.
and Christina Tempelaar-Lietz. Universal capture: image- Learning to generate dense point clouds with textures on
based facial animation for ”the matrix reloaded”. In SIG- multiple categories. In Proceedings of the IEEE/CVF Win-
GRAPH ’03, 2003. 1 ter Conference on Applications of Computer Vision (WACV),
[3] Joel Carranza, Christian Theobalt, Marcus A. Magnor, and pages 2170–2179, January 2021. 2
Hans-Peter Seidel. Free-viewpoint video of human actors. [19] Tao Hu, Kripasindhu Sarkar, Lingjie Liu, Matthias Zwicker,
ACM SIGGRAPH 2003 Papers, 2003. 1 and Christian Theobalt. Egorenderer: Rendering human
[4] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. avatars from egocentric camera images. In ICCV, 2021. 2
Efros. Everybody dance now. ICCV, pages 5932–5941, [20] Tao Hu, Hongyi Xu, Linjie Luo, Tao Yu, Zerong Zheng, He
2019. 2 Zhang, Yebin Liu, and Matthias Zwicker. Hvtr++: Image and
[5] Eric Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, pose driven human avatars using hybrid volumetric-textural
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J.

13
Table 10. Quantitative comparisons with Neural Body [50], Instant-NVR [10], HumanNeRF [71] on ZJU-MoCap. Instant-NVR* and
Instant-NVR are trained with 100 and 30 epochs respectively, which generate better results than the official models that were trained with
6 epochs. Qualitative results can be found in the demo video.

S313 S315 S377

Models LPIPS ↓ FID ↓ SSIM ↑ PSNR ↑ LPIPS FID SSIM PSNR LPIPS FID SSIM PSNR
Neural Body .152 149.43 .844 26.755 .108 112.57 .855 23.340 .119 132.16 .862 25.997
Instant-NVR .199 153.46 .783 23.123 .230 175.68 .716 19.066 .173 123.24 .810 22.976
Instant-NVR* .185 132.73 .783 23.029 .186 148.43 .704 18.592 .159 119.97 .806 22.884
HumanNeRF .098 69.868 .822 24.870 .084 82.412 .830 21.314 .092 79.760 .804 24.651
Ours .060 50.170 .869 26.654 .058 59.664 .868 23.125 .069 63.008 .866 25.306

S386 S387 S394

Neural Body .148 133.74 .815 27.648 .215 173.33 .769 23.454 .217 169.12 .803 26.467
Instant-NVR .171 137.29 .742 24.639 .237 161.94 .724 20.990 .251 159.11 .725 23.111
Instant-NVR* .161 135.96 .736 24.591 .230 155.97 .724 21.070 .247 155.41 .727 23.244
HumanNeRF .105 100.43 .763 26.590 .129 96.722 .762 22.452 .119 97.947 .766 24.643
Ours .080 85.811 .801 27.069 .084 71.216 .810 23.735 .095 78.949 .787 25.237

rendering. IEEE Transactions on Visualization and Com- Wang, and Christian Theobalt. Neural human video ren-
puter Graphics, pages 1–15, 2023. 2, 3, 6, 10, 11 dering by learning dynamic textures and rendering-to-video
[21] T. Hu, Tao Yu, Zerong Zheng, He Zhang, Yebin Liu, and translation. IEEE Transactions on Visualization and Com-
Matthias Zwicker. Hvtr: Hybrid volumetric-textural render- puter Graphics, 05 2020. 2
ing for human avatars. 3DV, 2022. 2 [32] Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Hyeongwoo
[22] Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and Kim, Florian Bernard, Marc Habermann, Wenping Wang,
Tony Tung. Arch: Animatable reconstruction of clothed hu- and Christian Theobalt. Neural rendering and reenactment of
mans. 2020 (CVPR), pages 3090–3099, 2020. 2 human actor videos. ACM Transactions on Graphics (TOG),
[23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. 2019. 2
Efros. Image-to-image translation with conditional adver- [33] Weiyang Liu, Y. Wen, Zhiding Yu, Ming Li, B. Raj, and Le
sarial networks. CVPR, pages 5967–5976, 2017. 9 Song. Sphereface: Deep hypersphere embedding for face
[24] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual recognition. CVPR, pages 6738–6746, 2017. 9
losses for real-time style transfer and super-resolution. vol- [34] M. Loper, Naureen Mahmood, J. Romero, Gerard Pons-
ume 9906, pages 694–711, 10 2016. 9 Moll, and Michael J. Black. Smpl: a skinned multi-person
[25] James T. Kajiya and Brian Von Herzen. Ray tracing vol- linear model. ACM Trans. Graph., 34:248:1–16, 2015. 2
ume densities. Proceedings of the 11th annual conference [35] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuyte-
on Computer graphics and interactive techniques, 1984. 5 laars, and Luc Van Gool. Pose guided person image genera-
[26] Diederick P Kingma and Jimmy Ba. Adam: A method for tion. In NeurIPS, pages 405–415, 2017. 2
stochastic optimization. In ICLR, 2015. 5, 9 [36] Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc van
[27] Bernhard Kratzwald, Zhiwu Huang, Danda Pani Paudel, and Gool, Bernt Schiele, and Mario Fritz. Disentangled person
Luc Van Gool. Towards an understanding of our world by image generation. CVPR, 2018. 2
GANing videos in the wild. arXiv:1711.11453, 2017. 2 [37] Qianli Ma, Shunsuke Saito, Jinlong Yang, Siyu Tang, and
[28] Ruilong Li, Shan Yang, David A. Ross, and Angjoo Michael J. Black. Scale: Modeling clothed humans with a
Kanazawa. Learn to dance with aist++: Music conditioned surface codec of articulated local elements. In CVPR, 2021.
3d dance generation, 2021. 2, 9, 12, 13 2
[29] Zhe Li, Zerong Zheng, Yuxiao Liu, Boyao Zhou, and Yebin [38] Qianli Ma, Jinlong Yang, Siyu Tang, and Michael J Black.
Liu. Posevocab: Learning joint-structured pose embeddings The power of points for modeling humans in clothing. In
for human avatar modeling. In ACM SIGGRAPH Conference ICCV, 2021. 2
Proceedings, 2023. 11 [39] Lars M. Mescheder, Michael Oechsle, Michael Niemeyer,
[30] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sebastian Nowozin, and Andreas Geiger. Occupancy net-
Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor: works: Learning 3d reconstruction in function space. CVPR,
Neural free-view synthesis of human actors with pose con- 2019. 2
trol. TOG, 40, 2021. 1, 2, 3, 6 [40] Mateusz Michalkiewicz, Jhony Kaesemodel Pontes, Do-
[31] Lingjie Liu, Weipeng Xu, Marc Habermann, Michael minic Jack, Mahsa Baktash, and Anders P. Eriksson. Deep
Zollhöfer, Florian Bernard, Hyeongwoo Kim, Wenping level sets: Implicit surface representations for 3d shape in-

14
ference. ArXiv, 2019. Black. Scanimate: Weakly supervised learning of skinned
[41] Marko Mihajlovic, Yan Zhang, Michael J Black, and Siyu clothed avatar networks. 2021 (CVPR), pages 2885–2896,
Tang. Leap: Learning articulated occupancy of people. In 2021. 2
CVPR, 2021. 2 [58] Kripasindhu Sarkar, Dushyant Mehta, Weipeng Xu,
[42] Natalia Neverova, Riza Alp Güler, and Iasonas Kokkinos. Vladislav Golyanik, and Christian Theobalt. Neural re-
Dense pose transfer. ECCV, 2018. 2 rendering of humans from a single image. In ECCV, 2020.
[43] Michael Niemeyer and Andreas Geiger. Giraffe: Represent- 2, 6
ing scenes as compositional generative neural feature fields. [59] Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuiliere,
CVPR, pages 11448–11459, 2021. 2 and Nicu Sebe. Deformable GANs for pose-based human
[44] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya image generation. In CVPR, 2018. 2
Harada. Neural articulated radiance field. In IEEE/CVF [60] K. Simonyan and Andrew Zisserman. Very deep convolu-
ICCV, 2021. 2 tional networks for large-scale image recognition. CoRR,
[45] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shecht- abs/1409.1556, 2015. 9
man, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. [61] Shih-Yang Su, Frank Yu, Michael Zollhoefer, and Helge
Stylesdf: High-resolution 3d-consistent image and geometry Rhodin. A-nerf: Articulated neural radiance fields for learn-
generation. ArXiv, abs/2112.11427, 2021. 2 ing human shape, appearance, and pose. In NeurIPS, 2021.
[46] Pablo Palafox, Aljaž Božič, Justus Thies, Matthias Nießner, 2
and Angela Dai. Npms: Neural parametric models for 3d [62] Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann,
deformable shapes. In IEEE/CVF ICCV, 2021. 2 Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-
[47] Jeong Joon Park, Peter Florence, Julian Straub, Richard Brualla, Tomas Simon, Jason M. Saragih, Matthias Nießner,
Newcombe, and Steven Lovegrove. Deepsdf: Learning con- Rohit Pandey, S. Fanello, Gordon Wetzstein, Jun-Yan Zhu,
tinuous signed distance functions for shape representation. Christian Theobalt, Maneesh Agrawala, Eli Shechtman,
CVPR, 2019. 9 Dan B. Goldman, and Michael Zollhofer. State of the art
[48] Jeong Joon Park, Peter R. Florence, Julian Straub, on neural rendering. Computer Graphics Forum, 2020. 1
Richard A. Newcombe, and S. Lovegrove. Deepsdf: Learn- [63] Justus Thies, Michael Zollhöfer, and Matthias Nießner. De-
ing continuous signed distance functions for shape represen- ferred neural rendering: image synthesis using neural tex-
tation. 2019 (CVPR), pages 165–174, 2019. 2 tures. ACM Transactions on Graphics (TOG), 38, 2019. 2
[49] Sida Peng, Junting Dong, Qianqian Wang, Shangzhan [64] Garvita Tiwari, Nikolaos Sarafianos, Tony Tung, and Ger-
Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- ard Pons-Moll. Neural-gif: Neural generalized implicit func-
matable neural radiance fields for modeling dynamic human tions for animating people in clothing. In ICCV, 2021. 2
bodies. In ICCV, 2021. 2, 3 [65] G. Varol, J. Romero, X. Martin, Naureen Mahmood,
[50] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Michael J. Black, I. Laptev, and C. Schmid. Learning from
Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: synthetic humans. CVPR, pages 4627–4635, 2017. 9
Implicit neural representations with structured latent codes [66] Shaofei Wang, Marko Mihajlovic, Qianli Ma, Andreas
for novel view synthesis of dynamic humans. CVPR, 2021. Geiger, and Siyu Tang. Metaavatar: Learning animatable
1, 2, 3, 6, 9, 10, 12, 13, 14 clothed human models from few depth images. NeurIPS,
[51] Sergey Prokudin, Michael J. Black, and Javier Romero. Sm- 2021. 2
plpix: Neural avatars from 3d human models. WACV, 2021. [67] Shaofei Wang, Katja Schwarz, Andreas Geiger, and Siyu
2 Tang. Arah: Animatable volume rendering of articulated
[52] Albert Pumarola, Antonio Agudo, Alberto Sanfeliu, and human sdfs. In European Conference on Computer Vision,
Francesc Moreno-Noguer. Unsupervised person image syn- 2022. 6, 11
thesis in arbitrary poses. In CVPR, June 2018. 2 [68] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,
[53] Amit Raj, Julian Tanke, James Hays, Minh Vo, Carsten Stoll, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-
and Christoph Lassner. Anr: Articulated neural rendering for video synthesis. In NeurIPS, 2018. 2
virtual avatars. CVPR, pages 3721–3730, 2021. 2 [69] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
[54] Edoardo Remelli, Timur M. Bagautdinov, Shunsuke Saito, Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
Chenglei Wu, Tomas Simon, Shih-En Wei, Kaiwen Guo, thesis and semantic manipulation with conditional gans. In
Zhe Cao, Fabián Prada, Jason M. Saragih, and Yaser Sheikh. CVPR, 2018. 9
Drivable volumetric avatars using texel-aligned features. [70] Zhou Wang, A. Bovik, H. R. Sheikh, and E. P. Simoncelli.
ACM SIGGRAPH, 2022. 2, 3, 6, 10, 11 Image quality assessment: from error visibility to structural
[55] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- similarity. IEEE Transactions on Image Processing, 13:600–
ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned 612, 2004. 5, 6
implicit function for high-resolution clothed human digitiza- [71] Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan,
tion. IEEE/CVF ICCV, pages 2304–2314, 2019. 2 Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. Hu-
[56] Shunsuke Saito, Tomas Simon, Jason M. Saragih, and Han- mannerf: Free-viewpoint rendering of moving people from
byul Joo. Pifuhd: Multi-level pixel-aligned implicit function monocular video. ArXiv, abs/2201.04127, 2022. 1, 2, 3, 6,
for high-resolution 3d human digitization. 2020 (CVPR), 10, 12, 13, 14
pages 81–90, 2020. [72] Feng Xu, Yebin Liu, Carsten Stoll, James Tompkin, Gau-
[57] Shunsuke Saito, Jinlong Yang, Qianli Ma, and Michael J. rav Bharaj, Qionghai Dai, Hans-Peter Seidel, Jan Kautz,

15
and Christian Theobalt. Video-based characters: creating
new human performances from a multi-view video database.
ACM SIGGRAPH, 2011. 1
[73] Jae Shin Yoon, Duygu Ceylan, Tuanfeng Y. Wang, Jingwan
Lu, Jimei Yang, Zhixin Shu, and Hyunjung Park. Learn-
ing motion-dependent appearance for high-fidelity rendering
of dynamic humans from a single camera. 2022 IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 3397–3407, 2022. 2, 3, 12
[74] Richard Zhang, Phillip Isola, Alexei A. Efros, E. Shechtman,
and O. Wang. The unreasonable effectiveness of deep fea-
tures as a perceptual metric. CVPR, pages 586–595, 2018.
6
[75] Yang Zheng, Ruizhi Shao, Yuxiang Zhang, Tao Yu, Zerong
Zheng, Qionghai Dai, and Yebin Liu. Deepmulticap: Per-
formance capture of multiple characters using sparse mul-
tiview cameras. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, pages 6239–6249,
2021. 2
[76] Zerong Zheng, Tao Yu, Yebin Liu, and Qionghai Dai.
Pamir: Parametric model-conditioned implicit representa-
tion for image-based human reconstruction. TPAMI, PP,
2021. 2
[77] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. Cips-
3d: A 3d-aware generator of gans based on conditionally-
independent pixel synthesis. ArXiv, 2021. 2
[78] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei
Wang, and Xiang Bai. Progressive pose attention transfer for
person image generation. In CVPR, pages 2347–2356, 2019.
2

Mitsubishi - FD30N
100% (1)
Mitsubishi - FD30N
7 pages
Solar Energy
No ratings yet
Solar Energy
41 pages
Data Science Bootcamp - UG - V1 - 0324
No ratings yet
Data Science Bootcamp - UG - V1 - 0324
30 pages
PyQt Tutorial
No ratings yet
PyQt Tutorial
11 pages
CMIT-796-PIP-15.69-00-0008 - 0 3D Model Review Procedure
No ratings yet
CMIT-796-PIP-15.69-00-0008 - 0 3D Model Review Procedure
10 pages
PF-LHM: 3D Animatable Avatar Reconstruction From Pose-Free Articulated Human Images
No ratings yet
PF-LHM: 3D Animatable Avatar Reconstruction From Pose-Free Articulated Human Images
16 pages
Sfm-Net: Learning of Structure and Motion From Video: Sudheendra Vijayanarasimhan Susanna Ricco Cordelia Schmid
100% (1)
Sfm-Net: Learning of Structure and Motion From Video: Sudheendra Vijayanarasimhan Susanna Ricco Cordelia Schmid
9 pages
Cheung Kong Man 2003 3TESIS
No ratings yet
Cheung Kong Man 2003 3TESIS
221 pages
3D Reconstruction of Human Body Via Machine Learning Qi He
100% (1)
3D Reconstruction of Human Body Via Machine Learning Qi He
59 pages
Arah: Animatable Volume Rendering of Articulated Human SDFS: Abstract
No ratings yet
Arah: Animatable Volume Rendering of Articulated Human SDFS: Abstract
35 pages
L4GM - Large 4D Gaussian Reconstruction Model
No ratings yet
L4GM - Large 4D Gaussian Reconstruction Model
23 pages
(Electrical Power Systems) (By: C.L. Wadhwa) (Published: July, 2009)
No ratings yet
(Electrical Power Systems) (By: C.L. Wadhwa) (Published: July, 2009)
5 pages
Os Practical
No ratings yet
Os Practical
23 pages
Freetimegs: Free Gaussians at Anytime and Anywhere For Dynamic Scene Reconstruction
No ratings yet
Freetimegs: Free Gaussians at Anytime and Anywhere For Dynamic Scene Reconstruction
17 pages
Champ Controllable and Consistent Human Image Animation With 3D Parametric Guidance
No ratings yet
Champ Controllable and Consistent Human Image Animation With 3D Parametric Guidance
21 pages
Banmo CVPR
No ratings yet
Banmo CVPR
17 pages
Paper 4
No ratings yet
Paper 4
22 pages
HeadGaS - Real-Time Animatable Head Avatars Via 3D Gaussian Splatting
No ratings yet
HeadGaS - Real-Time Animatable Head Avatars Via 3D Gaussian Splatting
24 pages
Reperformer: Immersive Human-Centric Volumetric Videos From Playback To Photoreal Reperformance
No ratings yet
Reperformer: Immersive Human-Centric Volumetric Videos From Playback To Photoreal Reperformance
15 pages
Neus 2
No ratings yet
Neus 2
15 pages
Yao Uni4D Unifying Visual Foundation Models For 4D Modeling From A CVPR 2025 Paper
No ratings yet
Yao Uni4D Unifying Visual Foundation Models For 4D Modeling From A CVPR 2025 Paper
11 pages
Human Motion Tracking With Less Constraint of Init
No ratings yet
Human Motion Tracking With Less Constraint of Init
16 pages
Instantavatar
No ratings yet
Instantavatar
12 pages
GauHuman - Articulated Gaussian Splatting From Monocular Human Videos
No ratings yet
GauHuman - Articulated Gaussian Splatting From Monocular Human Videos
16 pages
AniGS Single Pic Animatable Avatar
No ratings yet
AniGS Single Pic Animatable Avatar
25 pages
Splattingavatar: Realistic Real-Time Human Avatars With Mesh-Embedded Gaussian Splatting
No ratings yet
Splattingavatar: Realistic Real-Time Human Avatars With Mesh-Embedded Gaussian Splatting
15 pages
Weakly-Supervised 3D Reconstruction of Clothed Humans Via Normal Maps
No ratings yet
Weakly-Supervised 3D Reconstruction of Clothed Humans Via Normal Maps
15 pages
Xu GHUM GHUML Generative 3D Human Shape and Articulated Pose CVPR 2020 Paper
No ratings yet
Xu GHUM GHUML Generative 3D Human Shape and Articulated Pose CVPR 2020 Paper
10 pages
Heatformer: A Neural Optimizer For Multiview Human Mesh Recovery
No ratings yet
Heatformer: A Neural Optimizer For Multiview Human Mesh Recovery
14 pages
Xu 等 - 2023 - 4K4D Real-Time 4D View Synthesis at 4K Resolution
No ratings yet
Xu 等 - 2023 - 4K4D Real-Time 4D View Synthesis at 4K Resolution
17 pages
Gaussianavatar: Towards Realistic Human Avatar Modeling From A Single Video Via Animatable 3D Gaussians
No ratings yet
Gaussianavatar: Towards Realistic Human Avatar Modeling From A Single Video Via Animatable 3D Gaussians
13 pages
Pang ASH Animatable Gaussian Splats For Efficient and Photoreal Human Rendering CVPR 2024 Paper
No ratings yet
Pang ASH Animatable Gaussian Splats For Efficient and Photoreal Human Rendering CVPR 2024 Paper
11 pages
HUGS: Human Gaussian Splats
No ratings yet
HUGS: Human Gaussian Splats
16 pages
Goel Humans in 4D Reconstructing and Tracking Humans With Transformers ICCV 2023 Paper
No ratings yet
Goel Humans in 4D Reconstructing and Tracking Humans With Transformers ICCV 2023 Paper
12 pages
【2022】IEEE CSVT a Progressive Quadric Graph Convolutional Network for 3D Human Mesh Recovery2
No ratings yet
【2022】IEEE CSVT a Progressive Quadric Graph Convolutional Network for 3D Human Mesh Recovery2
14 pages
RT 900 User Guide
No ratings yet
RT 900 User Guide
83 pages
3DGS-Avatar: Animatable Avatars Via Deformable 3D Gaussian Splatting
No ratings yet
3DGS-Avatar: Animatable Avatars Via Deformable 3D Gaussian Splatting
19 pages
Lin OcclusionFusion Occlusion-Aware Motion Estimation For Real-Time Dynamic 3D Reconstruction CVPR 2022 Paper
No ratings yet
Lin OcclusionFusion Occlusion-Aware Motion Estimation For Real-Time Dynamic 3D Reconstruction CVPR 2022 Paper
10 pages
Dong Dressing in The Wild by Watching Dance Videos CVPR 2022 Paper
No ratings yet
Dong Dressing in The Wild by Watching Dance Videos CVPR 2022 Paper
10 pages
Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing
No ratings yet
Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing
14 pages
mẫu báo cáo
No ratings yet
mẫu báo cáo
12 pages
Android App Dissertation
100% (2)
Android App Dissertation
5 pages
Neuralhumanfvv: Real-Time Neural Volumetric Human Performance Rendering Using RGB Cameras
No ratings yet
Neuralhumanfvv: Real-Time Neural Volumetric Human Performance Rendering Using RGB Cameras
11 pages
Neural Head Avatars From Monocular RGB Videos
No ratings yet
Neural Head Avatars From Monocular RGB Videos
18 pages
Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars
No ratings yet
Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars
11 pages
(2017) Deep Learning Based Human Action Recognition - A Survey
No ratings yet
(2017) Deep Learning Based Human Action Recognition - A Survey
6 pages
Decoupling Human and Camera Motion From Videos in The Wild
No ratings yet
Decoupling Human and Camera Motion From Videos in The Wild
12 pages
Function 4 D
No ratings yet
Function 4 D
11 pages
2410 3D-Adapter
No ratings yet
2410 3D-Adapter
22 pages
Vid2Avatar: 3D Avatar Reconstruction From Videos in The Wild Via Self-Supervised Scene Decomposition
No ratings yet
Vid2Avatar: 3D Avatar Reconstruction From Videos in The Wild Via Self-Supervised Scene Decomposition
11 pages
Autodecoding Latent 3D Diffusion Models
No ratings yet
Autodecoding Latent 3D Diffusion Models
22 pages
Group - 45 Research Paper - Final
No ratings yet
Group - 45 Research Paper - Final
8 pages
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
No ratings yet
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
13 pages
A Survey On Retrieval-Augmented Text Generation For Large Language Models
No ratings yet
A Survey On Retrieval-Augmented Text Generation For Large Language Models
18 pages
Martinez A Simple Yet ICCV 2017 Paper
No ratings yet
Martinez A Simple Yet ICCV 2017 Paper
12 pages
3DD
No ratings yet
3DD
6 pages
Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation
No ratings yet
Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation
13 pages
Learning The Depths of Moving People by Watching Frozen People
No ratings yet
Learning The Depths of Moving People by Watching Frozen People
10 pages
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
No ratings yet
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
20 pages
MotionBERT - A Unified Perspective On Learning Human Motion Representations
No ratings yet
MotionBERT - A Unified Perspective On Learning Human Motion Representations
18 pages
35 Swap Space Management 08-11-2024
No ratings yet
35 Swap Space Management 08-11-2024
6 pages
Dachs - Bedien Einstellanleitung MSR2 - TULKOTS
No ratings yet
Dachs - Bedien Einstellanleitung MSR2 - TULKOTS
108 pages
Fast and Robust Virtual Try On Based On Parser Free Generative Adversarial Network
No ratings yet
Fast and Robust Virtual Try On Based On Parser Free Generative Adversarial Network
10 pages
Mpvton, Iccv 2019 PDF
No ratings yet
Mpvton, Iccv 2019 PDF
11 pages
Copyright Project
No ratings yet
Copyright Project
11 pages
TVC91
No ratings yet
TVC91
16 pages
MTM Jan Feb 2017 Web
No ratings yet
MTM Jan Feb 2017 Web
72 pages
2d 3d Reconstruction
No ratings yet
2d 3d Reconstruction
11 pages
2404 07149
No ratings yet
2404 07149
48 pages
Beginner's Ubuntu Handbook
No ratings yet
Beginner's Ubuntu Handbook
102 pages
Motion mm2020
No ratings yet
Motion mm2020
9 pages
On The Potential of Quantum Walks For Modeling Financial Return Distributions
No ratings yet
On The Potential of Quantum Walks For Modeling Financial Return Distributions
23 pages
Venturini Et Al. 2020
No ratings yet
Venturini Et Al. 2020
23 pages
2404 03065
No ratings yet
2404 03065
59 pages
Differentially Private Federated Learning: Servers Trustworthiness, Estimation, and Statistical Inference
No ratings yet
Differentially Private Federated Learning: Servers Trustworthiness, Estimation, and Statistical Inference
56 pages
A Formal Specification of The JQ Language: Michael Färber
No ratings yet
A Formal Specification of The JQ Language: Michael Färber
27 pages
Hyundai HX380L FT
No ratings yet
Hyundai HX380L FT
10 pages
MAIN-VC: Lightweight Speech Representation Disentanglement For One-Shot Voice Conversion
No ratings yet
MAIN-VC: Lightweight Speech Representation Disentanglement For One-Shot Voice Conversion
7 pages
The Best Possible Constants Approach For Wilker-Cusa-Huygens Inequalities Via Stratification
No ratings yet
The Best Possible Constants Approach For Wilker-Cusa-Huygens Inequalities Via Stratification
42 pages
Imitation of The Human Upper Limb by Convolutional Neural Networks
No ratings yet
Imitation of The Human Upper Limb by Convolutional Neural Networks
11 pages
Sec Sheet 3 Carnot Cycle
No ratings yet
Sec Sheet 3 Carnot Cycle
3 pages
Searching For Cosmological Collider in The Planck CMB Data
No ratings yet
Searching For Cosmological Collider in The Planck CMB Data
41 pages
2404 11347
No ratings yet
2404 11347
38 pages
Generating Animatable 3D Virtual Humans From Photographs
No ratings yet
Generating Animatable 3D Virtual Humans From Photographs
10 pages
Comparison of Different DEM Generation Methods Based On Open Source Datasets
No ratings yet
Comparison of Different DEM Generation Methods Based On Open Source Datasets
23 pages
Zero-Delay and Causal Single-User and Multi-User Lossy Source Coding With Decoder Side Information
No ratings yet
Zero-Delay and Causal Single-User and Multi-User Lossy Source Coding With Decoder Side Information
34 pages
TCS Allegations and Mixtures Quiz-3 PREP INSTA
No ratings yet
TCS Allegations and Mixtures Quiz-3 PREP INSTA
21 pages
Idea Makers Stephen Wolfram Epub - Google Search
0% (1)
Idea Makers Stephen Wolfram Epub - Google Search
3 pages
Dense in L
No ratings yet
Dense in L
26 pages
A Dual Perspective of Reinforcement Learning For Imposing Policy Constraints
No ratings yet
A Dual Perspective of Reinforcement Learning For Imposing Policy Constraints
25 pages
Sponge Evaporation 1.2 2021-08-17
No ratings yet
Sponge Evaporation 1.2 2021-08-17
10 pages
An Approach To Hamiltonian Floer Theory For Maps From Surfaces
No ratings yet
An Approach To Hamiltonian Floer Theory For Maps From Surfaces
23 pages
EAPP 12 2nd Quarter
No ratings yet
EAPP 12 2nd Quarter
23 pages
The Sydney Radio Star Catalogue: Properties of Radio Stars at Megahertz To Gigahertz Frequencies
No ratings yet
The Sydney Radio Star Catalogue: Properties of Radio Stars at Megahertz To Gigahertz Frequencies
21 pages
IEC 61850 Process Bus
No ratings yet
IEC 61850 Process Bus
3 pages
2406 16582v1
No ratings yet
2406 16582v1
16 pages
Effectiveness of Self-Assessment Software To Evaluate Preclinical Operative Procedures
No ratings yet
Effectiveness of Self-Assessment Software To Evaluate Preclinical Operative Procedures
16 pages
Studying The Supernova Absolute Magnitude Constancy With Baryonic Acoustic Oscillations
No ratings yet
Studying The Supernova Absolute Magnitude Constancy With Baryonic Acoustic Oscillations
15 pages
Human Body Segmentation and Matching Using Biomechanics 3D Models
No ratings yet
Human Body Segmentation and Matching Using Biomechanics 3D Models
6 pages
How To Apply, Submission of Application and Printing of Admit Card
No ratings yet
How To Apply, Submission of Application and Printing of Admit Card
3 pages
A Comparative Analysis of Adversarial Robustness For Quantum and Classical Machine Learning Models
No ratings yet
A Comparative Analysis of Adversarial Robustness For Quantum and Classical Machine Learning Models
11 pages
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
No ratings yet
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
10 pages
Eli89 BEM15
No ratings yet
Eli89 BEM15
9 pages
Exploring Augmentation and Cognitive Strategies For AI Based Synthetic Personae
No ratings yet
Exploring Augmentation and Cognitive Strategies For AI Based Synthetic Personae
9 pages
Drepmrec: A Dual Representation Learning Framework For Multimodal Recommendation
No ratings yet
Drepmrec: A Dual Representation Learning Framework For Multimodal Recommendation
9 pages
(HK241) Convolution Operation
No ratings yet
(HK241) Convolution Operation
6 pages
E-Graph Rewriting With Technology-Aware Cost Functions For Logic Synthesis
No ratings yet
E-Graph Rewriting With Technology-Aware Cost Functions For Logic Synthesis
7 pages
Understanding The Security Architecture of The One Identity Safeguard Appliance
No ratings yet
Understanding The Security Architecture of The One Identity Safeguard Appliance
6 pages
When Does A Bent Concatenation Not Belong To The Completed Maiorana-Mcfarland Class?
No ratings yet
When Does A Bent Concatenation Not Belong To The Completed Maiorana-Mcfarland Class?
6 pages
Learning-Based Efficient Approximation of Data-Enabled Predictive Control
No ratings yet
Learning-Based Efficient Approximation of Data-Enabled Predictive Control
6 pages
Dynamic Complex-Frequency Control of Grid-Forming Converters
No ratings yet
Dynamic Complex-Frequency Control of Grid-Forming Converters
6 pages
An Improved Spectral Lower Bound of Treewidth: A B C, 1 C C
No ratings yet
An Improved Spectral Lower Bound of Treewidth: A B C, 1 C C
6 pages
There and Back Again: A Netlist's Tale With Much Egraphin'
No ratings yet
There and Back Again: A Netlist's Tale With Much Egraphin'
5 pages
1 Write The Java Program For Grading System
No ratings yet
1 Write The Java Program For Grading System
5 pages
A Simple Framework For Natural Animation of Digitized Models
No ratings yet
A Simple Framework For Natural Animation of Digitized Models
8 pages
VMware KB - Required VMware Vcenter Converter Ports
No ratings yet
VMware KB - Required VMware Vcenter Converter Ports
4 pages
XXX
No ratings yet
XXX
2 pages
Hardware of The PIC16F877
No ratings yet
Hardware of The PIC16F877
2 pages
Working Principle of Flash Welding
No ratings yet
Working Principle of Flash Welding
3 pages
Articulated Body Pose Estimation: Unlocking Human Motion in Computer Vision
From Everand
Articulated Body Pose Estimation: Unlocking Human Motion in Computer Vision
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Surmo: Surface-Based 4D Motion Modeling For Dynamic Human Rendering

Uploaded by

Surmo: Surface-Based 4D Motion Modeling For Dynamic Human Rendering

Uploaded by

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering

Tao Hu, Fangzhou Hong, Ziwei Liu

achieved remarkable progress by formulating the rendering

G1 [25] Perceptual Loss, Face Identity Loss, Velocity and Normal

Figure 6. Novel view synthesis of fast motions on AIST++.

explicitly model the lighting. We validate our method from

A.1. Network Architectures Lf ace = ∥Nf ace (Igt ) − Nf ace I+

B. Additional Experimental Results unified network for faithful rendering.

S386 LPIPS↓ FID↓ SSIM↑ PSNR↑

S387 LPIPS↓ FID↓ SSIM↑ PSNR↑

Table 4. Quantitative comparisons against ARAH [67] for novel

S377-Train LPIPS↓ FID↓ SSIM↑ PSNR↑

S377-Novel LPIPS↓ FID↓ SSIM↑ PSNR↑

Table 9. Ablation study of super-resolution module under different

S313 S315 S377

S386 S387 S394

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.