0% found this document useful (0 votes)
17 views15 pages

Kim2019 Article LatentTransformationsNeuralNet

Uploaded by

Sandeep Gurung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views15 pages

Kim2019 Article LatentTransformationsNeuralNet

Uploaded by

Sandeep Gurung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

The Visual Computer

https://doi.org/10.1007/s00371-019-01755-x

ORIGINAL ARTICLE

Latent transformations neural network for object view synthesis


Sangpil Kim1 · Nick Winovich1 · Hyung-Gun Chi1 · Guang Lin1 · Karthik Ramani1

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Abstract
We propose a fully convolutional conditional generative neural network, the latent transformation neural network, capable
of rigid and non-rigid object view synthesis using a lightweight architecture suited for real-time applications and embedded
systems. In contrast to existing object view synthesis methods which incorporate conditioning information via concatenation,
we introduce a dedicated network component, the conditional transformation unit. This unit is designed to learn the latent
space transformations corresponding to specified target views. In addition, a consistency loss term is defined to guide the
network toward learning the desired latent space mappings, a task-divided decoder is constructed to refine the quality of
generated views of objects, and an adaptive discriminator is introduced to improve the adversarial training process. The
generalizability of the proposed methodology is demonstrated on a collection of three diverse tasks: multi-view synthesis on
real hand depth images, view synthesis of real and synthetic faces, and the rotation of rigid objects. The proposed model
is shown to be comparable with the state-of-the-art methods in structural similarity index measure and L 1 metrics while
simultaneously achieving a 24% reduction in the compute time for inference of novel images.

Keywords Object view synthesis · Latent transformation · Fully convolutional · Conditional generative model

1 Introduction Generative models have been shown to provide effective


frameworks for representing complex, structured datasets
The task of synthesizing novel views of objects from a single and generating realistic samples from underlying data dis-
reference frame/view is an important problem which has a tributions [7,13]. This concept has also been extended to
variety of practical applications in computer vision, graphics, form conditional models capable of sampling from condi-
and robotics. In computer vision, view synthesis can be used tional distributions in order to allow certain properties of the
to generate 3D point cloud representations of objects from generated data to be controlled or selected [26,43]. Gener-
a single input image [40]; in the context of hand pose esti- ative models without encoders [5,46] are generally used to
mation algorithms, generating additional synthetic views can sample from broad classes of the data distributions; however,
also help reduce occlusion and improve the accuracy of the these models are not designed to incorporate input data and
estimated poses [12,14]. In computer graphics, view synthe- therefore cannot preserve characteristic features of specified
sis has been used to apply changes in lighting and viewpoint input data. Models have also been proposed which incorpo-
to single-view images of faces [24]. In robotics, synthetic rate encoding components to overcome this by learning to
views can be used to help predict unobserved part locations map input data onto an associated latent space representa-
and improve the performance of object grasping with manip- tion within a generative framework [21,25]. The resulting
ulators [42]. However, synthesizing novel views from a single inference models allow for the defining features of inputs to
input image is a formidable task with serious complications be preserved while specified target properties are adjusted
arising from the complexity of the target object and the pres- through conditioning [45].
ence of heavily self-occluded parts. Conventional conditional models have largely relied on
rather simple methods, such as concatenation, for implement-
B Karthik Ramani ing this conditioning process; however, cGANs [27] have
ramani@purdue.edu
shown that utilizing the conditioning information in a less
Sangpil Kim trivial, more methodical manner has the potential to signif-
kim2030@purdue.edu
icantly improve the performance of conditional generative
1 Purdue University, West Lafayette, IN 47906, USA

123
S. Kim et al.

models. In this work, we provide a general framework for


effectively performing inference with conditional generative
models by strategically controlling the interaction between
conditioning information and latent representations within a
generative inference model. In this framework, a conditional
transformation unit (CTU), Φ, is introduced to provide a
means for navigating the underlying manifold structure of the
latent space. The CTU is realized in the form of a collection of
convolutional layers which are designed to approximate the
latent space operators defined by mapping encoded inputs to
Fig. 1 The conditional transformation unit Φ constructs a collection of
the encoded representations of specified targets (see Fig. 1). mappings {Φk } in the latent space which produce object view changes
This is enforced by introducing a consistency loss term to to the decoded outputs. Conditioning information is used to select the
guide the CTU mappings during training. In addition, a appropriate convolutional weights ωk for the specified transformation;
conditional discriminator unit (CDU), Ψ , also realized as the encoding l x of the original input image x is transformed to  l yk =
Φk (l x ) = conv(l x , ωk ) and provides an approximation to the encoding
a collection of convolutional layers, is included in the net- l yk of the attribute-modified target image yk
work’s discriminator. This CDU is designed to improve the
network’s ability to identify and eliminate transformation-
specific artifacts in the network’s predictions. through a series of comparative studies on a diverse range of
The network has also been equipped with RGB balance experiments where it is seen to be comparable with the exist-
parameters consisting of three values {θ R , θG , θ B } designed ing state-of-the-art models for (i) simultaneous multi-view
to give the network the ability to quickly adjust the global synthesis of real hand depth images in real-time, (ii) the syn-
color balance of the images it produces to better align with thesis of rotated views of rigid objects given a single image,
that of the true data distribution. In this way, the network is and (iii) object view synthesis and attribute modification of
easily able to remove unnatural hues and focus on estimat- real and synthetic faces.
ing local pixel values by adjusting the three RGB parameters Moreover, the CTU conditioning framework allows for
rather than correcting each pixel individually. In addition, additional conditioning information, or target views, to be
we introduce a novel estimation strategy for efficiently learn- added to the training procedure ad infinitum without any
ing shape and color properties simultaneously; a task-divided increase to the network’s inference speed.
decoder (TD) is designed to produce a coarse pixel value map
along with a refinement map in order to split the network’s
overall task into distinct, dedicated network components. 2 Related work

Summary of contributions: Conditional generative models have been widely used in


computer vision areas such as geometric prediction [30,34,
1. We introduce the conditional transformation unit, with a 40,48] and non-rigid object modification such as human face
family of modular filter weights, to learn high-level map- deformation [1,11,33,47]. Dosovitskiy et al. [8] has pro-
pings within a low-dimensional latent space. In addition, posed a supervised, conditional generative model trained to
we present a consistency loss term which is used to guide generate images of chairs, tables, and cars with specified
the transformations learned during training. attributes which are controlled by transformation and view
2. We propose a novel framework for 3D object view syn- parameters passed to the network. MV3D [40] is pioneering
thesis which separates the generative process into distinct deep learning work for object view synthesis which uses an
network components dedicated to learning (i) coarse encoder–decoder network to directly generate pixels of a tar-
pixel value estimates, (ii) pixel refinement map, and (iii) get view with depth information in the loss function along
the global RGB color balance of the dataset. with view point information passed as a conditional term.
3. We introduce the conditional discriminator unit designed The appearance flow network (AFN) [48] proposed a method
to improve adversarial training by identifying and elim- for view synthesis of objects by predicting appearance flow
inating transformation-specific artifacts present in the fields, which are used to move pixels from an input to a tar-
generated images. get view. However, this method requires detailed camera pose
information and is not capable of predicting pixels which are
Each contribution proposed above has been shown to missing in the source views. The M2N from [38] proposed
provide a significant improvement to the network’s overall view prediction using a recurrent network and a self-learned
performance through a series of ablation studies. The result- confidence map iteratively synthesizes views with recurrent
ing latent transformation neural network (LTNN) is placed pixel generator with appearance flow. TVSN [30] uses a vis-

123
Latent transformations neural network for object view synthesis

ibility map, which indicates visible parts in a target image to synthesis tasks, while requiring substantially less FLOPs and
identify occlusion in different views. However, this method memory consumption for inference than other methods.
requires mesh models for each object in order to extract vis-
ibility maps for training the network. The DFN by Jia et al.
[20] proposed using a dynamic filter which is conditioned 3 Latent transformation neural network
on a sequence of previous frames; this is fundamentally
different from our method since the filter is applied to the In this section, we introduce the methods used to define the
original inputs rather than the latent embeddings. More- proposed LTNN model. We first give a brief overview of
over, it relies on temporal information and is not applicable the LTNN network structure. We then detail how conditional
for predictions given a single image. The IterGAN model transformation unit mappings are defined and trained to oper-
introduced by Galama and Mensink [10] is also designed ate on the latent space, followed by a description of the
to synthesize novel views from a single image, with a spe- conditional discriminator unit implementation and the net-
cific emphasis on the synthesis of rotated views of objects work loss function used to guide the training process. Lastly,
in small, iterative steps. The conditional variational autoen- we describe the task division framework used for the decod-
coder (CVAE) incorporates conditioning information into the ing process.
standard variational autoencoder (VAE) framework [23] and The basic workflow of the proposed model is as follows:
is capable of synthesizing specified attribute changes in an
identity preserving manner [37,45]. Other works have intro- 1. Encode the input image x to a latent representation l x =
duced a clamping strategy to enforce a specific organizational Encode(x).
structure in the latent space [24,33]; these networks require 2. Use conditioning information k to select conditional, con-
extremely detailed labels for supervision, such as the graph- volutional filter weights ωk .
ics code parameters used to create each example, and are 3. Map the latent representation l x to  l yk = Φk (l x ) =
therefore very difficult to implement for more general tasks conv(l x , ωk ), an approximation of the encoded latent rep-
(e.g., training with real images). These models are all reliant resentation l yk of the specified target image yk .
on additional knowledge for training, such as depth informa- 4. Decode l yk to obtain a coarse pixel value map and a refine-
tion, camera poses, or mesh models, and are not applicable ment map.
in embedded systems and real-time applications due to the 5. Scale the channels of the pixel value map by the RGB
high computational demand and the number of neural net- balance parameters and take the Hadamard product with
works’ parameters since these methods did not consider the the refinement map to obtain the final prediction  yk .
efficiency of the model. 6. Pass real images yk as well as generated images  yk to
CVAE-GAN [2] further adds adversarial training to the the discriminator, and use the conditioning information
CVAE framework in order to improve the quality of gen- to select the discriminator’s conditional filter weights ωk .
erated predictions. The work from Zhang et al. [47] has 7. Compute loss and update weights using ADAM opti-
introduced the conditional adversarial autoencoder (CAAE) mization and backpropagation.
designed to model age progression/regression in human
faces. This is achieved by concatenating conditioning infor-
mation (i.e., age) with the input’s latent representation before 3.1 Conditional transformation unit
proceeding to the decoding process. The framework also
includes an adaptive discriminator with conditional infor- Generative models have frequently been designed to explic-
mation passed using a resize/concatenate procedure. To the itly disentangle the latent space in order to enable high-level
best of our knowledge, all existing conditional generative attribute modification through linear, latent space interpo-
models are designed for inference use fixed hidden lay- lation. This linear latent structure is imposed by design
ers and concatenate conditioning information directly with decisions, however, and may not be the most natural way
latent representations. In contrast to these existing methods, for a network to internalize features of the data distribution.
the proposed model incorporates conditioning information Several approaches have been proposed which include non-
by defining dedicated, transformation-specific convolutional linear layers for processing conditioning information at the
layers at the latent level. This conditioning framework allows latent space level. In these conventional conditional genera-
the network to synthesize multiple transformed views from tive frameworks, conditioning information is introduced by
a single input, while retaining a fully convolutional struc- combining features extracted from the input with features
ture which avoids the dense connections used in existing extracted from the conditioning information (often using
inference-based conditional models. Most significantly, the dense connection layers); these features are typically com-
proposed LTNN framework is shown to be comparable with bined using standard vector concatenation, although some
the state-of-the-art models in a diverse range of object view have opted to use channel concatenation.

123
S. Kim et al.

Fig. 2 Selected methods for incorporating conditioning information; the proposed LTNN method is illustrated on the left, and six conventional
alternatives are shown to the right

In particular, conventional approaches for incorporating ner which produces high-level view or attribute changes upon
conditional information generally fall into three classes: (1) decoding. In this way, different angles of view, light direc-
apply a fully connected layer before and after concatenating a tions, and deformations, for example, can be generated from
vector storing conditional information [24,40,47,48], (2) flat- a single input image. In one embodiment, the training pro-
ten the network features and concatenate with a vector storing cess for the conditional transformation units can be designed
conditional information [30], (3) tile a conditional vector to to form a semigroup {Φt }t≥0 of operators:
create a two-dimensional array with the same shape as the
network features and concatenate channel-wise [2,38]. Since 
the first class is more prevalent than the others in practice, Φ0 = id
i.e., (1)
we have subdivided this class into four cases: FC-Concat- Φt+s = Φt ◦ Φs ∀ t, s ≥ 0
FC [47], FC-Concat-2FC [24], 2FC-Concat-FC [48], and
2FC-Concat-2FC [40]. Six of these conventional conditional defined on the latent space and trained to follow the geomet-
network designs are illustrated in Fig. 2 along with the pro- ric flow corresponding to a specified attribute. In the context
posed LTNN network design for incorporating conditioning of rotating three-dimensional objects, for example, the trans-
information. formation units are trained on the input images paired with
Rather than directly concatenating conditioning informa- several target outputs corresponding to different angles of
tion with network features, we propose using a conditional rotation; the network then uses conditioning information,
transformation unit (CTU), consisting of a collection of dis- which specifies the angle by which the object should be
tinct convolutional mappings in the network’s latent space. rotated, to select the appropriate transformation unit. In this
More specifically, the CTU maintains independent convo- context, the semigroup criteria correspond to the fact that
lution kernel weights for each target view in consideration. rotating an object 10◦ twice should align with the result of
Conditioning information is used to select which collection rotating the object by 20◦ once.
of kernel weights, i.e., which CTU mapping, should be used Since the encoder and decoder are not influenced by the
in the CTU convolutional layer to perform a specified trans- specified angle of rotation, the network’s encoding/decoding
formation. In addition to the convolutional kernel weights, structure learns to model objects at different angles simulta-
each CTU mapping incorporates a Swish activation [32] with neously; the single, low-dimensional latent representation of
independent parameters for each specified target view. The the input contains all information required to produce rotated
kernel weights and Swish parameters of each CTU mapping views of the original object. Other embodiments can depart
are selectively updated by controlling the gradient flow based with this semigroup formulation, however, training condi-
on the conditioning information provided. tional transformation units to instead produce a more diverse
The CTU mappings are trained to transform the encoded, collection of non-sequential viewpoints, for example, as is
latent space representation of the network’s input in a man- the case for multi-view hand synthesis.

123
Latent transformations neural network for object view synthesis

LTNN Training Procedure CTU, is trained to specifically identify unrealistic artifacts


  which are being produced by the corresponding conditional
Provide: Labeled dataset x, {yk }k∈T with target transformations
indexed by a fixed set T , encoder weights θ E , decoder weights θ D , transformation unit mappings. This is accomplished by main-
RGB balance parameters {θ R , θG , θ B }, conditional transformation unit taining independent convolutional kernel weights for each
weights {ωk }k∈T , discriminator D with standard weights θD and condi- specified target view and using the conditioning information
tionally selected weights {ωk }k∈T , and loss function hyperparameters passed to the discriminator to select the kernel weights for the
γ , ρ, λ, κ corresponding to the smoothness, reconstruction, adversar-
ial, and consistency loss terms, respectively. The specific loss function CDU layer. The incorporation of this context-aware discrim-
components are defined in detail in Eqs. 3–2 in Sect. 3.2. inator structure has significantly boosted the performance of
the network (see Table 1). The discriminator, D, is trained
1: procedure Train( ) D defined in Eq. 3. The
2: // Sample input and targets from training set using the adversarial loss term Ladv
3: x , {yk }k∈T = get_train_batch() proposed model uses the adversarial loss in Eq. 4 to effec-
4: // Compute encoding of original input image tively capture multimodal distributions [36], which helps to
5: l x = Encode[ x ]
sharpen the generated views.
6: for k in T do
7: // Compute true encoding of specified target image
D  
8: l yk = Encode[ yk ] Ladv = − log D(yk , ωk ) − log 1 − D(
yk , ωk ) (3)
9: // Compute approximate encoding of target with CTU
10: l yk = conv(l x , ωk ) Ladv = − log D(
yk , ωk ). (4)
11: // Compute RGB value and refinement maps
ykvalue ,  = Decode[ 
r e f ine
12:  yk l yk ] Reducing the total variation is widely used in view synthe-
13: // Assemble final network prediction for target sis methods [30,47]. In particular, the L smooth term is used
 value   r e f ine
14: 
yk = θC ·  yk,C yk,C C∈{R,G,B} to reduce noise in the generated images by reducing the vari-
15:
16: // Update encoder, decoder, RGB, and CTU weights
ation of pixels, which is inspired by total variation image
17: Ladv = − log(D ( yk , ωk )) denoising. Experimental evidence shows that the inclusion
18: Lguide = γ · Lsmooth ( yk ) + ρ · Lr econ ( yk , yk ) of the L smooth loss term leads to an improvement in the over-
19: Lconsist =  l yk − l yk 1 all quality of the synthesized images (see Table 1). We have
20: L = λ · Ladv + Lguide + κ · Lconsist
21: for θ in {θ E , θ D , θ R , θG , θ B , ωk } do
experimented with various shift sizes and found that the shift
22: θ = θ − ∇θ L size τ = 1 yields the best performance.
23: Additional loss terms corresponding to accurate structural
24: // Update discriminator and CDU weights reconstruction and smoothness [19] in the generated views
D
25: Ladv = − log(D (yk , ωk )) − log(1 − D ( yk , ωk )) are defined in Eqs. 5 and 6:
26: for θ in {θD , ωk } do
D
27: θ = θ − ∇θ Ladv
Lr econ = 
yk − yk 22 (5)
Lsmooth = 
yk − τi, j 
yk 1
, (6)
To enforce this behavior on the latent space CTU map- i∈{0,±1} j∈{0,±1}
pings in practice, a consistency term is introduced into the
loss function, as specified in Eq. 2. This loss term is mini- where yk is the modified target image corresponding to an
mized precisely when the CTU mappings behave as depicted input x, ωk are the weights of the CDU mapping correspond-
in Fig. 1; in particular, the output of the CTU mapping asso- ing to the kth transformation, Φk is the
 CTU
 mapping  for the
ciated with a particular transformation is designed to match kth transformation,  yk = Decode Φk Encode[x] is the
the encoding of the associated ground truth target view. More network prediction, and τi, j is the two-dimensional, discrete
precisely, given an input image x, the consistency loss asso- shift operator. The final loss function for the encoder and
ciated with the kth transformation is defined in terms of the decoder components is given by:
ground truth, transformed target view yk by:
L = λ · Ladv + ρ · Lr econ + γ · Lsmooth + κ · Lconsist (7)
Lconsist = Φk (Encode[x]) − Encode[yk ] 1 . (2)
with hyperparameters typically selected so that λ, ρ γ , κ.
3.2 Discriminator and loss function The consistency loss is designed to guide the CTU mappings
toward approximations of the latent space mappings which
The discriminator used in the adversarial training process connect the latent representations of input images and target
is also passed conditioning information which specifies the images as depicted in Fig. 1. In particular, the consistency
transformation which the model has attempted to make. The term enforces the condition that the transformed encoding,
conditional discriminator unit (CDU), which is implemented 
l yk = Φk (Encode[x]), approximates the encoding of the kth
as a convolutional layer with modular weights similar to the target image, l yk = Encode[yk ], during the training process.

123
S. Kim et al.

Table 1 Ablation/comparison
Model Elevation Azimuth Light Direction Age
results of six different
conventional alternatives for SSIM L1 SSIM L1 SSIM L1 SSIM L1
fusing condition information
into the latent space and ablation LTNN (CTU + CDU + TD) .923 .107 .923 .108 .941 .093 .925 .102
study of conditional LTNN w/o L smooth .918 .118 .921 .114 .935 .112 .911 .110
transformation unit (CTU), CTU + CDU .901 .135 .908 .125 .921 .121 .868 .118
conditional discriminator unit
(CDU), and task-divided CTU .889 .142 .878 .135 .901 .131 .831 .148
decoder (TD) Channel Concat + Conv .803 .179 .821 .173 .816 .182 .780 .188
2-FC + Concat + 2-FC .674 .258 .499 .355 .779 .322 .686 .243
2-FC + Concat + FC .691 .233 .506 .358 .787 .316 .687 .240
FC + Concat + 2-FC .673 .261 .500 .360 .774 .346. .683 .249
FC + Concat + FC .681 .271 .497 .355 .785 .315. .692 .246
Reshape + Concat + FC .671 .276 .489 .357 .780 .318 .685 .251
For valid comparison, we used identical encoder, decoder, and training procedure with synthetic face dataset
Bold values indicate the best performance

3.3 Task-divided decoder

The decoding process has been divided into three tasks: esti-
mating the refinement map, pixel values, and RGB color
balance of the dataset. We have found this decoupled frame-
work for estimation helps the network converge to better
minima to produce sharp, realistic outputs without addi-
tional loss terms. The decoding process begins with a series
of convolutional layers followed by bilinear interpolation
to upsample the low-resolution latent information. The last
Fig. 3 Proposed task-divided design for the LTNN decoder. The coarse
pixel value estimation map is split into RGB channels, rescaled by the component of the decoder’s upsampling process consists of
RGB balance parameters, and multiplied element-wise by the refine- two distinct convolutional layers used for task divide; one
ment map values to produce the final network prediction layer is allocated for predicting the refinement map, while

Fig. 4 The proposed network structure for the encoder/decoder (left) the number of distinct 3 × 3 filters associated with the CTU and CDU
and discriminator (right) for 64 × 64 input images. Features have been corresponds to the number of distinct conditional transformations the
color-coded according to the type of layer which has produced them. network is designed to produce. For 256 × 256 input images, we have
The CTU and CDU components both store and train separate collections added two Block v1/MaxPool layers in the front of encoder and two
of 3 × 3 filter weights for each conditional transformation; in particular, Conv/Interpolation layers at the end of the decoder

123
Latent transformations neural network for object view synthesis

4 Architecture details

The overview of the pipeline is shown in Fig. 4. Input images


are passed through a Block v1 collaborative filter layer (see
Fig. 5) along with a max pooling layer to produce the fea-
tures at the far left end of the figure. At the bottleneck between
the encoder and decoder, a conditional transformation unit
(CTU) is applied to map the 2 × 2 latent features directly to
the transformed 2×2 latent features on the right. This CTU is
Fig. 5 Layer definitions for Block v1 and Block v2 collaborative fil- implemented as a convolutional layer with 3×3 filter weights
ters. Once the total number of output channels, Nout , is specified, the selected based on the conditioning information provided to
remaining Nout − Nin output channels are allocated to the non-identity
filters (where Nin denotes the number of input channels). For the Block the network. The features near the end of the decoder compo-
v1 layer at the start of the proposed LTNN model, for example, the nent are processed by two independent convolution transpose
input is a image with Nin = 3 channels and the specified number of layers for non-rigid object and bilinear interpolation for the
output channels is Nout = 32. One of the 32 channels is accounted for rigid object: one corresponding to the value estimation map
by the identity component, and the remaining 29 channels are the three
non-identity filters. When the remaining channel count is not divisible and the other corresponding to the refinement map. The chan-
by 3, we allocate the remainder of the output channels to the single nels of the value estimation map are rescaled by the RGB
3 × 3 convolutional layer. Swish activation functions are used for each balance parameters, and the Hadamard product is taken with
filter; however, the filters with multiple convolutional layers do not use the refinement map to produce the final network output. For
activation functions for the intermediate 3 × 3 convolutional layers
rigid object experiment, we added tangent hyperbolic activa-
tion function after the Hadamard product to bound the output
the other is trained to predict pixel values. The refinement values range in [−1, 1]. The CDU is also designed to have the
map layer incorporates a sigmoidal activation function which same 3 × 3 kernel size as the CTU and is applied between the
outputs scaling factors intended to refine the coarse pixel third and fourth layers of the discriminator. For the stereo face
value estimations; the pixel value estimation layer does not dataset [9] experiment, we have added an additional Block v1
use an activation so that the output values are not restricted layer in the encoder and additional convolutional layer fol-
to the range of a specific activation function. RGB balance lowed by bilinear interpolation in decoder to utilize the full
parameters, consisting of three trainable variables, are used 128×128×3 resolution images and two Block v1 layers and
as weights for balancing the color channels of the pixel value two convolutional layers followed by bilinear interpolation
map. The Hadamard product, , of the refinement map and for the 256 × 256 × 3 resolution image of rigid object views.
the RGB-rescaled value map serves as the network’s final The encoder incorporates two main block layers, as
output: defined in Fig. 5, which are designed to provide efficient fea-
ture extraction; these blocks follow a similar design to that
proposed by Szegedy et al. [39], but include dense connec-
 yR , 
y = [ yG , 
y B ] where tions between blocks, as introduced by Huang et al. [16]. We
value r e f ine normalize the output of each network layer using the batch
yC = θC · 
 yC 
yC for C ∈ {R, G, B} (8)
normalization method as described in [18]. For the decoder,
we have opted for a minimalist design, inspired by the work
In this way, the network has the capacity to mask values of [31]. Standard convolutional layers with 3 × 3 filters and
which lie outside of the target object (i.e., by setting refine- same padding are used through the penultimate decoding
ment map values to zero) which allows the value map to focus layer and transpose convolutional layers with 1 × 1 filters
on the object itself during the training process. Experimental for non-rigid objects and 5 × 5 for other experiments. We
results show that the refinement maps learn to produce masks have used same padding to produce the value estimation and
which closely resemble the target objects’ shapes and have refinement maps. All parameters have been initialized using
sharp drop-offs along the boundaries. No additional infor- the variance scaling initialization method described in [15].
mation has been provided to the network for training the Our method has been implemented and developed using
refinement map; the masking behavior illustrated in Figs. 3 the TensorFlow framework. The models have been trained
and 6 is learned implicitly by the network during training using stochastic gradient descent (SGD) and the ADAM opti-
and is made possible by the design of the network’s archi- mizer [22] with initial parameters: learning_rate = 0.005,
tecture. As shown in Fig. 3, the refinement map produces a β1 = 0.9, and β2 = 0.999 (as defined in the TensorFlow
shape mask and mask out errors in each pixels by masking API r1.6 documentation for tf. train. AdamOptimizer), along
values which lie outside of the target object (i.e., by setting with loss function hyperparameters: λ = 0.8, ρ = 0.2, γ =
refinement map values to zero). 0.000025, and κ = 0.00005 (as introduced in Eq. 7). The

123
S. Kim et al.

Fig. 6 Qualitative comparison of 360◦ view prediction of rigid objects. M2N. The pixel value map and refinement maps corresponding to the
A single image, shown in the first column of the “Ground” row, is used task division framework are also provided as well as an inverted view
as the input for the network. Results are shown for the proposed network of the refinement map for better visibility
with and without task division (“w/o TD”) as well as a comparison with

discriminator is updated once every two encoder/decoder 80% of the datasets. Since ground truth target depth images
updates, and one-sided label smoothing [36] has been used to were not available for the real hand dataset, an indirect
improve the stability of the discriminator training procedure. metric has been used to quantitatively evaluate the model
as described in Sect. 5.2. Ground truth data were avail-
able for all other experiments, and models were evaluated
5 Experiments and results directly using the L 1 mean pixel-wise error and the struc-
tural similarity index measure (SSIM) [44] used in [30,38].
We conduct experiments on a diverse collection of datasets To evaluate the proposed framework with existing works, two
including both rigid and non-rigid objects. To show the gen- comparison groups have been formed: conditional inference
eralizability of our method, we have conducted a series methods, CVAE-GAN [2] and CAAE [47], with compara-
of experiments: (i) hand pose estimation using a synthetic ble hourglass structures for comparison on experiments with
training set and real NYU hand depth image data [41] for non-rigid objects, and view synthesis methods, MV3D [40],
testing, (ii) synthesis of rotated views of rigid objects using M2N [38], AFN [48], and TVSN [30], for comparison on
the 3D object dataset [4], (iii) synthesis of rotated views experiments with rigid objects. Additional ablation experi-
using a real face dataset [9], and (iv) the modification of ments have been performed to compare the proposed CTU
a diverse range of attributes on a synthetic face dataset [17]. conditioning method with other conventional concatenation
For each experiment, we have trained the models using methods (see Fig. 2); results are shown in Fig. 9 and Table 1.

123
Latent transformations neural network for object view synthesis

Table 2 FLOPs and parameter counts corresponding to inference for a


single image with resolution 256 × 256 × 3
Model Parameters (Million) GFLOPs/image

Ours 17.0 2.183


M2N 127.1 341.404
TVSN 57.3 2.860
AFN 70.3 2.671
MV3D 69.7 3.056
These calculations are based on code provided by the authors and the
definitions prescribed in the associated papers. Smaller numbers are
better for parameters and GFLOPs/Image
Bold values indicate the best performance

Table 3 Quantitative comparison for 360◦ view synthesis of rigid


objects
Model Car Chair
SSIM L1 SSIM L1

Ours .902 .121 .897 .178


Ours (w/o TD) .861 .187 .871 .261 Fig. 7 Quantitative evaluation for multi-view hand synthesis using the
real NYU dataset
M2N .923 .098 .895 .181
TVSN .913 .119 .894 .230
AFN .877 .148 .891 .240
MV3D .875 .139 .895 .248
Smaller numbers are better for L 1 and higher numbers are better for
SSIM. We performed ablation experiment with and without task-divided
decoder (TD) and compared with other methods Fig. 8 Comparison of CVAE-GAN (top) with the proposed LTNN
Bold values indicate the best performance model (bottom) using the noisy NYU hand dataset [41]. The input depth-
map hand pose image is shown to the far left, followed by the network
predictions for nine synthesized view points. The views synthesized
using LTNN are seen to be sharper and also yield higher accuracy for
5.1 Experiment on rigid objects pose estimation (see Fig.11)

Rigid object experiment We have experimented with novel


3D view synthesis tasks given a single view of an object with rable with existing models specifically designed for the task
an arbitrary pose. The goal of this experiment is to synthe- of multi-view prediction while requiring the least FLOPs for
size an image of the object after a specified transformation inference compared with all other methods. The low com-
or change in viewpoint has been applied to the original view. putational cost of the LTNN model highlights the efficiency
To evaluate our method in the context of rigid objects, we of the CTU/CDU framework for incorporating conditional
have performed a collection of tests on the chair and car information into the network for view synthesis. Moreover,
datasets. Given a single input view of an object, we leverage as seen in the qualitative results provided in Fig. 6, using a
the LTNN model to produce 360◦ views of the object. We task-divided decoder helps to eliminate artifacts in the gener-
have tested our model’s ability to perform 360◦ view estima- ated views; in particular, the spokes on the back of the chair
tion on 3D objects and compared the results with the other and the spoiler on the back of the car are seen to be synthe-
state-of-the-art methods. The models are trained on the same sized much more clearly when using a task-divided decoder.
dataset used in M2N [38]. The car and chair categories from
the ShapeNet [3] 3D model repository have been rotated hor- 5.2 Experiment on non-rigid objects
izontally 18 times by 20◦ along with elevation changes of 0◦ ,
10◦ , and 20◦ . The M2N and TVSN results are slightly better Hand pose experiment To assess the performance of the pro-
for the car category; however, these works have incorporated posed network on non-rigid objects, we consider the problem
skip connections between the encoder layers and decoder lay- of hand pose estimation. As the number of available view
ers, proposed in U-net [35], which substantially increases the points of a given hand is increased, the task of estimating
computational demand for these networks (see Table 2). As the associated hand pose becomes significantly easier [14].
can be seen in Tables 2 and 3, the proposed model is compa- Motivated by this fact, we synthesize multiple views of a

123
S. Kim et al.

Fig. 10 Qualitative evaluation for view synthesis of real faces using the
image dataset [9]

Fig. 9 LTNN ablation experiment results and comparison with alterna- alone. In particular, for a threshold distance D = 40 mm,
tive conditioning frameworks using synthetic hand dataset. Our models: the proposed model yields the highest accuracy with 61.98%
conditional transformation unit (CTU), conditional discriminator unit of the frames having all predicted joint locations within a
(CDU), task-divide decoder (TD), and LTNN consisting of all previous
distance of 40 mm from the ground truth values. The second
components. Alternative concatenation methods: channel-wise con-
catenation (CH Concat), fully connected concatenation (FC Concat), highest accuracy is achieved with the CVAE-GAN model
and reshape fully connected feature vector concatenation (RE Concat) with 45.70% of frames predicted within the 40 mm thresh-
old.
A comparison of the quantitative hand pose estimation
hand given a single view and evaluate the accuracy of the results is provided in Fig. 7 where the proposed LTNN
estimated hand pose using the synthesized views. The under- framework is seen to provide a substantial improvement
lying assumption of the assessment is that the accuracy of over existing methods; qualitative results are also available
the hand pose estimation will be improved precisely when in Fig. 8. Ablation study results for assessing the impact
the synthesized views provide faithful representations of the of individual components of the LTNN model are also pro-
true hand pose. Since ground truth predictions for the real vided in Fig. 9; in particular, we note that the inclusion of the
NYU hand dataset were not available, the LTNN model has CTU, CDU, and task-divided decoder each provides signifi-
been trained using a synthetic dataset generated using 3D cant improvements to the performance of the network. With
mesh hand models. The NYU dataset does, however, provide regard to real-time applications, the proposed model runs at
ground truth coordinates for the input hand pose; using this, 114 fps without batching and at 1975 fps when applied to a
we were able to indirectly evaluate the performance of the mini-batch of size 128 (using a single TITAN Xp GPU and
model by assessing the accuracy of a hand pose estimation an Intel i7-6850K CPU).
method using the network’s multi-view predictions as input. Real face experiment We have also conducted an experiment
More specifically, the LTNN model was trained to generate using a real face dataset to show the applicability of LTNN
nine different views which were then fed into the pose esti- for real images. The stereo face database [9], consisting of
mation network from Choi et al. [6] (also trained using the images of 100 individuals from 10 different viewpoints, was
synthetic dataset). For an evaluation metric, the maximum used for experiments with real faces. These faces were first
error in the predicted joint locations has been computed for segmented using the method of [28], and then we manu-
each frame (i.e., each hand pose in the dataset). The cumula- ally cleaned up the failure cases. The cleaned faces have
tive number of frames with maximum error below a threshold been cropped and centered to form the final dataset. The
distance D has then been computed, as is commonly used in LTNN model was trained to synthesize images of input
hand pose estimation tasks [6,29]. A comparison of the pose faces corresponding to three consecutive horizontal rotations.
estimation results using synthetic views generated by the pro- Qualitative results for the real face experiment are provided
posed model, the CVAE-GAN model, and the CAAE model in Fig. 10; in particular, we note that the quality of the views
are presented in Fig. 7, along with the results obtained by per- generated by the proposed LTNN model is consistent for each
forming pose estimation using the single-view input frame of the four views, while the quality of the views generated

123
Latent transformations neural network for object view synthesis

substantial improvements to alternative methods with respect


to the SSIM and L 1 metrics and converges much faster as
well.

5.3 Diverse attribute exploration

To evaluate the proposed framework’s performance on a


more diverse range of attribute modification tasks, a syn-
thetic face dataset and other conditional generative models,
CVAE-GAN and CAAE, with comparable hourglass struc-
Fig. 11 Quantitative evaluation with SSIM of model performances for
tures to the LTNN model have been selected for comparison.
experiment on the real face dataset [9]. Higher values are better
The generated images from the LTNN model are available in
Fig. 13. These models have been trained to synthesize dis-
crete changes in elevation, azimuth, light direction, and age
from a single image. As shown in Tables 4 and 5, the LTNN
model outperforms the CVAE-GAN and CAAE models by a
significant margin in both SSIM and L 1 metrics; additional
quantitative results are provided in Table 1, along with a col-
lection of ablation results for the LTNN model.
Multiple attributes can also be modified simultaneously
using LTNN by composing CTU mappings. For example,
light
one can train four CTU mappings {Φk }3k=0 corresponding
to incremental changes in lighting and four CTU map-
pings {Φkazim }3k=0 corresponding to incremental changes
Fig. 12 Quantitative evaluation with L 1 of model performances for
in azimuth. In this setting, the network predictions for
experiment on the real face dataset [9]. Lower values are better lighting and azimuth changes correspond to the values
light
of Decode[Φk (l x )] and Decode[Φkazim (l x )], respectively
(where l x denotes the encoding of the original input image).
using other methods decreases substantially as the change To predict the effect of simultaneously changing both lighting
in angle is increased. This illustrates the advantage of using and azimuth, we can compose the associated CTU mappings
CTU mappings to navigate the latent space and avoid the in the latent space; that is, we may take our network predic-
light
accumulation of errors inherent to iterative methods. More- tion for the lighting change associated with Φi combined
over, as shown in Figs. 11 and 12, the LTNN model provides with the azimuth change associated with Φ azimj to be:

Fig. 13 Simultaneous learning of multiple attribute modifications. using four CTU mappings per attribute (e.g., four azimuth mappings and
Azimuth and age (left), light and age (center), and light and azimuth four age mappings); results shown have been generated by composing
(right) combined modifications are shown. The network has been trained CTU mappings in the latent space and decoding

123
S. Kim et al.

Table 4 Quantitative results for light direction and age modification on 3.5◦ , we can perform linear interpolation in the latent space
the synthetic face dataset between the representations Φ0 (l x ) and Φ1 (l x ); that is, we
Model Light direction Age may take our network prediction for the intermediate change
SSIM L1 SSIM L1 of 3.5◦ to be:

Ours .941 .093 .925 .102


y = Decode[
 l y ] where 
l y = 0.5 · Φ0 (l x ) + 0.5 · Φ1 (l x ).
CVAE-GAN .824 .209 .848 .166
CAAE .856 .270 .751 .207 (10)
Bold values indicate the best performance
More generally, we can interpolate between the latent
CTU map representations to predict a change θ via:
Table 5 Quantitative results for azimuth and elevation modification on
the synthetic face dataset
Model Elevation Azimuth y = Decode[
 l y ] where 
l y = λ·Φk (l x )+(1−λ)·Φk+1 (l x ),
SSIM L1 SSIM L1 (11)

Ours .923 .107 .923 .108


where k ∈ {0, . . . , 7} and λ ∈ [0, 1] are chosen so that
CVAE-GAN .864 .158 .863 .180
θ = λ · θk + (1 − λ) · θk+1 . In this way, the proposed
CAAE .777 .175 .521 .338 framework naturally allows for continuous attribute changes
Bold values indicate the best performance to be approximated while only requiring training for a
finite collection of discrete changes. Qualitative results for
near-continuous attribute modification on the synthetic face
y = Decode[
 l y ] where
dataset are provided in Fig. 14; in particular, we note that
 light light  azim
l y = Φi ◦ Φ azim
j (l x ) = Φi Φ j (l x ) . (9) views generated by the network effectively model gradual
changes in the attributes without any noticeable degradation
in quality. This highlights the fact that the model has learned
5.4 Near-continuous attribute modification a smooth latent space structure which can be navigated effec-
tively by the CTU mappings while maintaining the identities
Near-continuous attribute modification is also possible within of the original input faces.
the proposed framework; this can be performed by a simple,
piecewise-linear interpolation procedure in the latent space.
For example, we can train nine CTU mappings {Φk }8k=0 cor- 6 Conclusion
responding to incremental 7◦ changes in elevation {θk }8k=0 .
The network predictions for an elevation change of θ0 = 0◦ In this work, we have introduced an effective, general
and θ1 = 7◦ are then given by the values Decode[Φ0 (l x )] and framework for incorporating conditioning information into
Decode[Φ1 (l x )], respectively (where l x denotes the encod- inference-based generative models. We have proposed a
ing of the input image). To predict an elevation change of modular approach to incorporating conditioning information

Fig. 14 Near-continuous attribute modification is attainable using tion. These attribute-modified images have been produced using nine
piecewise-linear interpolation in the latent space. Provided a grayscale CTU mappings, corresponding to varying degrees of modification, and
image (corresponding to the faces on the far left), modified images cor- linearly interpolating between the discrete transformation encodings in
responding to changes in light direction (first), age (second), azimuth the latent space
(third), and elevation (fourth) are produced with 17 degrees of varia-

123
Latent transformations neural network for object view synthesis

using CTUs and a consistency loss term, defined an efficient 12. Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3D hand pose
task-divided decoder setup for deconstructing the data gen- estimation in single depth images: from single-view CNN to multi-
view CNNs. In: Proceedings of the IEEE Conference on Computer
eration process into manageable subtasks, and shown that Vision and Pattern Recognition, pp. 3593–3601 (2016)
a context-aware discriminator can be used to improve the 13. Goodfellow, I.J.: NIPS 2016 Tutorial: Generative Adversarial Net-
performance of the adversarial training process. The perfor- works (2017). CoRR arXiv:1701.00160
mance of this framework has been assessed on a diverse range 14. Guan, H., Chang, J.S., Chen, L., Feris, R.S., Turk, M.: Multi-view
appearance-based 3D hand pose estimation. In: 2006 Confer-
of tasks and shown to perform comparably with the state- ence on Computer Vision and Pattern Recognition Workshop
of-the-art methods while reducing computational operations (CVPRW’06), pp. 154–154. IEEE (2006)
and memory consumption. 15. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers:
surpassing human-level performance on imagenet classification.
Acknowledgements Karthik Ramani acknowledges the US National In: Proceedings of the IEEE International Conference on Computer
Science Foundation Awards NRI-1637961 and IIP-1632154. Guang Vision, pp. 1026–1034 (2015)
Lin acknowledges the US National Science Foundation Awards DMS- 16. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.:
1555072, DMS-1736364 and DMS-1821233. Any opinions, findings, Densely connected convolutional networks (2016). arXiv preprint
and conclusions or recommendations expressed in this material are arXiv:1608.06993
those of the authors and do not necessarily reflect the views of the 17. IEEE: A 3D Face Model for Pose and Illumination Invariant Face
funding agency. We gratefully appreciate the support of NVIDIA Cor- Recognition (2009)
poration with the donation of GPUs used for this research. 18. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep net-
work training by reducing internal covariate shift. In: International
Conference on Machine Learning, pp. 448–456 (2015)
19. Jason, J.Y., Harley, A.W., Derpanis, K.G.: Back to basics: unsuper-
vised learning of optical flow via brightness constancy and motion
References smoothness. In: Computer Vision—ECCV 2016 Workshops, pp.
3–10. Springer (2016)
1. Antipov, G., Baccouche, M., Dugelay, J.L.: Face aging with con- 20. Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic
ditional generative adversarial networks (2017). arXiv preprint filter networks. In: Advances in Neural Information Processing
arXiv:1702.01983 Systems, pp. 667–675 (2016)
2. Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: CVAE-GAN: fine- 21. Kim, S., Kim, D., Choi, S.: Citycraft: 3D virtual city creation
grained image generation through asymmetric training (2017). from a single image. Vis. Comput. (2019). https://doi.org/10.1007/
arXiv preprint arXiv:1703.10155 s00371-019-01701-x
3. Chang, A., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, 22. Kingma, D., Ba, J.: Adam: a method for stochastic optimization
Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: (2014). arXiv preprint arXiv:1412.6980
An information-rich 3D model repository. 1(7), 8 (2015). arXiv 23. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes
preprint arXiv:1512.03012 (2013). arXiv preprint arXiv:1312.6114
4. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., 24. Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep
Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, convolutional inverse graphics network. In: Advances in Neural
F.: ShapeNet: an information-rich 3D model repository. Technical Information Processing Systems, pp. 2539–2547 (2015)
Report, Stanford University—Princeton University—Toyota Tech- 25. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adver-
nological Institute at Chicago (2015). arXiv:1512.03012 [cs.GR] sarial autoencoders (2015). arXiv preprint arXiv:1511.05644
5. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., 26. Mirza, M., Osindero, S.: Conditional generative adversarial nets
Abbeel, P.: InfoGAN: interpretable representation learning by (2014). arXiv preprint arXiv:1411.1784
information maximizing generative adversarial nets. In: Advances 27. Miyato, T., Koyama, M.: cGANs with projection discriminator
in Neural Information Processing Systems, pp. 2172–2180 (2016) (2018). arXiv preprint arXiv:1802.05637
6. Choi, C., Kim, S., Ramani, K.: Learning hand articulations by 28. Nirkin, Y., Masi, I., Tuan, A.T., Hassner, T., Medioni, G.: On face
hallucinating heat distribution. In: Proceedings of the IEEE Confer- segmentation, face swapping, and face perception. In: 2018 13th
ence on Computer Vision and Pattern Recognition, pp. 3104–3113 IEEE International Conference on Automatic Face and Gesture
(2017) Recognition (FG 2018), pp. 98–105. IEEE (2018)
7. Dinerstein, J., Egbert, P.K., Cline, D.: Enhancing computer graph- 29. Oberweger, M., Lepetit, V.: Deepprior++: improving fast and
ics through machine learning: a survey. Vis. Comput. 23(1), 25–43 accurate 3D hand pose estimation. In: Proceedings of the IEEE
(2007) International Conference on Computer Vision, pp. 585–594 (2017)
8. Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to gen- 30. Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.:
erate chairs with convolutional neural networks. In: Proceedings of Transformation-grounded image generation network for novel 3D
the IEEE Conference on Computer Vision and Pattern Recognition, view synthesis. In: 2017 IEEE Conference on Computer Vision and
pp. 1538–1546 (2015) Pattern Recognition (CVPR), pp. 702–711. IEEE (2017)
9. Fransens, R., Strecha, C., Van Gool, L.: Parametric stereo for 31. Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep
multi-pose face recognition and 3D-face modeling. In: Interna- neural network architecture for real-time semantic segmentation
tional Workshop on Analysis and Modeling of Faces and Gestures, (2016). arXiv preprint arXiv:1606.02147
pp. 109–124. Springer (2005) 32. Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activa-
10. Galama, Y., Mensink, T.: Iterative GANs for rotating visual objects tion function (2017). arXiv preprint arXiv:1710.05941
(2018) 33. Reed, S., Sohn, K., Zhang, Y., Lee, H.: Learning to disentangle
11. Gauthier, J.: Conditional generative adversarial nets for convo- factors of variation with manifold interaction. In: International
lutional face generation. In: Class Project for Stanford CS231N: Conference on Machine Learning, pp. 1431–1439 (2014)
Convolutional Neural Networks for Visual Recognition. Winter 34. Rezende, D.J., Eslami, S.A., Mohamed, S., Battaglia, P., Jaderberg,
Semester 2014(5), 2 (2014) M., Heess, N.: Unsupervised learning of 3D structure from images.

123
S. Kim et al.

In: Advances in Neural Information Processing Systems, pp. 4996– Sangpil Kim is a Ph.D. student in
5004 (2016) the Electrical and Computer Engi-
35. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional net- neering at Purdue University. He
works for biomedical image segmentation. In: International Con- received his B.S. degree from the
ference on Medical Image Computing and Computer-Assisted Korea University, South Korea, in
Intervention, pp. 234–241. Springer (2015) 2015. His current research inter-
36. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Rad- ests are in computer vision and
ford, A., Chen, X.: Improved techniques for training GANs. In: deep learning.
Advances in Neural Information Processing Systems, pp. 2234–
2242 (2016)
37. Sohn, K., Lee, H., Yan, X.: Learning structured output represen-
tation using deep conditional generative models. In: Advances in
Neural Information Processing Systems, pp. 3483–3491 (2015)
38. Sun, S.H., Huh, M., Liao, Y.H., Zhang, N., Lim, J.J.: Multi-view to
novel view: Synthesizing novel views with self-learned confidence.
In: Proceedings of the European Conference on Computer Vision
(ECCV), pp. 155–171 (2018) Nick Winovich is a Ph.D. candi-
39. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., date in the Department of Math-
Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with con- ematics at Purdue University. He
volutions. In: Proceedings of the IEEE Conference on Computer earned his B.A. in Mathematics
Vision and Pattern Recognition, pp. 1–9 (2015) and Spanish at the University of
40. Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models Notre Dame in 2012 and subse-
from single images with a convolutional network. In: European quently received an M.S. in Math-
Conference on Computer Vision, pp. 322–337. Springer (2016) ematics at the University of Ore-
41. Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous gon in 2015. His current research
pose recovery of human hands using convolutional networks. ACM focuses on the intersection of
Trans. Gr. 33(5), 169 (2014) probability theory and numerical
42. Varley, J., DeChant, C., Richardson, A., Ruales, J., Allen, P.: Shape computing, with an emphasis on
completion enabled robotic grasping. In: 2017 IEEE/RSJ Interna- applications of Gaussian processes
tional Conference on Intelligent Robots and Systems (IROS), pp. and neural network models for
2442–2447. IEEE (2017) partial differential equations.
43. Wang, Q., Artières, T., Chen, M., Denoyer, L.: Adversarial learning
for modeling human motion. Vis. Comput. (2018). https://doi.org/
10.1007/s00371-018-1594-7 Hyung-gun Chi is a Master’s stu-
44. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image qual- dent in Mechanical Engineering
ity assessment: from error visibility to structural similarity. IEEE at Purdue University. He received
Trans. Image Process. 13(4), 600–612 (2004) his B.S. degree from the school
45. Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: conditional of Mechanical Engineering, Yon-
image generation from visual attributes. In: European Conference sei University, South Korea, in
on Computer Vision, pp. 776–791. Springer (2016) 2017. His current research inter-
46. Zhang, S., Han, Z., Lai, Y.K., Zwicker, M., Zhang, H.: Stylistic ests lie at the intersection of com-
scene enhancement GAN: mixed stylistic enhancement generation puter vision and robotics.
for 3D indoor scenes. Vis. Comput. 35(6–8), 1157–1169 (2019)
47. Zhang, Z., Song, Y., Qi, H.: Age progression/regression by con-
ditional adversarial autoencoder. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp.
5810–5818 (2017)
48. Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View syn-
thesis by appearance flow. In: European Conference on Computer
Vision, pp. 286–301. Springer (2016)

Publisher’s Note Springer Nature remains neutral with regard to juris-


dictional claims in published maps and institutional affiliations.

123
Latent transformations neural network for object view synthesis

Guang Lin is the Director of Karthik Ramani is the Donald W.


Purdue Data Science Consulting Feddersen Professor of School of
Service, Associate Professor of Mechanical Engineering at Pur-
Mathematics and School of due University, with courtesy
Mechanical Engineering at Pur- appointments in Electrical and
due University, with courtesy Computer Engineering and Col-
appointment in Statistics. He earn- lege of Education. He earned his
ed his B.S. in mechanics from B.Tech from the Indian Institute
Zhejiang University in 1997, and of Technology, Madras, in 1985,
an M.S. and a Ph.D. from Brown an M.S. from Ohio State Univer-
University in 2004 and 2007 in sity, in 1987, and a Ph.D. from
applied mathematics, respectively. Stanford University in 1991, all
He received many awards from in Mechanical Engineering. His
the National Science Foundation research interests are in collabora-
and other organizations. He has tive intelligence, human–machine
served as the Associate Editor of SIAM Multiscale Modeling and Sim- interactions, spatial interfaces, deep shape learning, and manufactur-
ulations and in the editorial board of many International Journals. In ing productivity. He has published recently in ACM [CHI & UIST],
2019, he received University Faculty Scholars from Purdue University. IEEE [CVPR, ECCV, ICCV], ICLR, ICRA, Scientific Reports, and
ASME JMD.

123

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy