Kim2019 Article LatentTransformationsNeuralNet
Kim2019 Article LatentTransformationsNeuralNet
https://doi.org/10.1007/s00371-019-01755-x
ORIGINAL ARTICLE
Abstract
We propose a fully convolutional conditional generative neural network, the latent transformation neural network, capable
of rigid and non-rigid object view synthesis using a lightweight architecture suited for real-time applications and embedded
systems. In contrast to existing object view synthesis methods which incorporate conditioning information via concatenation,
we introduce a dedicated network component, the conditional transformation unit. This unit is designed to learn the latent
space transformations corresponding to specified target views. In addition, a consistency loss term is defined to guide the
network toward learning the desired latent space mappings, a task-divided decoder is constructed to refine the quality of
generated views of objects, and an adaptive discriminator is introduced to improve the adversarial training process. The
generalizability of the proposed methodology is demonstrated on a collection of three diverse tasks: multi-view synthesis on
real hand depth images, view synthesis of real and synthetic faces, and the rotation of rigid objects. The proposed model
is shown to be comparable with the state-of-the-art methods in structural similarity index measure and L 1 metrics while
simultaneously achieving a 24% reduction in the compute time for inference of novel images.
Keywords Object view synthesis · Latent transformation · Fully convolutional · Conditional generative model
123
S. Kim et al.
123
Latent transformations neural network for object view synthesis
ibility map, which indicates visible parts in a target image to synthesis tasks, while requiring substantially less FLOPs and
identify occlusion in different views. However, this method memory consumption for inference than other methods.
requires mesh models for each object in order to extract vis-
ibility maps for training the network. The DFN by Jia et al.
[20] proposed using a dynamic filter which is conditioned 3 Latent transformation neural network
on a sequence of previous frames; this is fundamentally
different from our method since the filter is applied to the In this section, we introduce the methods used to define the
original inputs rather than the latent embeddings. More- proposed LTNN model. We first give a brief overview of
over, it relies on temporal information and is not applicable the LTNN network structure. We then detail how conditional
for predictions given a single image. The IterGAN model transformation unit mappings are defined and trained to oper-
introduced by Galama and Mensink [10] is also designed ate on the latent space, followed by a description of the
to synthesize novel views from a single image, with a spe- conditional discriminator unit implementation and the net-
cific emphasis on the synthesis of rotated views of objects work loss function used to guide the training process. Lastly,
in small, iterative steps. The conditional variational autoen- we describe the task division framework used for the decod-
coder (CVAE) incorporates conditioning information into the ing process.
standard variational autoencoder (VAE) framework [23] and The basic workflow of the proposed model is as follows:
is capable of synthesizing specified attribute changes in an
identity preserving manner [37,45]. Other works have intro- 1. Encode the input image x to a latent representation l x =
duced a clamping strategy to enforce a specific organizational Encode(x).
structure in the latent space [24,33]; these networks require 2. Use conditioning information k to select conditional, con-
extremely detailed labels for supervision, such as the graph- volutional filter weights ωk .
ics code parameters used to create each example, and are 3. Map the latent representation l x to l yk = Φk (l x ) =
therefore very difficult to implement for more general tasks conv(l x , ωk ), an approximation of the encoded latent rep-
(e.g., training with real images). These models are all reliant resentation l yk of the specified target image yk .
on additional knowledge for training, such as depth informa- 4. Decode l yk to obtain a coarse pixel value map and a refine-
tion, camera poses, or mesh models, and are not applicable ment map.
in embedded systems and real-time applications due to the 5. Scale the channels of the pixel value map by the RGB
high computational demand and the number of neural net- balance parameters and take the Hadamard product with
works’ parameters since these methods did not consider the the refinement map to obtain the final prediction yk .
efficiency of the model. 6. Pass real images yk as well as generated images yk to
CVAE-GAN [2] further adds adversarial training to the the discriminator, and use the conditioning information
CVAE framework in order to improve the quality of gen- to select the discriminator’s conditional filter weights ωk .
erated predictions. The work from Zhang et al. [47] has 7. Compute loss and update weights using ADAM opti-
introduced the conditional adversarial autoencoder (CAAE) mization and backpropagation.
designed to model age progression/regression in human
faces. This is achieved by concatenating conditioning infor-
mation (i.e., age) with the input’s latent representation before 3.1 Conditional transformation unit
proceeding to the decoding process. The framework also
includes an adaptive discriminator with conditional infor- Generative models have frequently been designed to explic-
mation passed using a resize/concatenate procedure. To the itly disentangle the latent space in order to enable high-level
best of our knowledge, all existing conditional generative attribute modification through linear, latent space interpo-
models are designed for inference use fixed hidden lay- lation. This linear latent structure is imposed by design
ers and concatenate conditioning information directly with decisions, however, and may not be the most natural way
latent representations. In contrast to these existing methods, for a network to internalize features of the data distribution.
the proposed model incorporates conditioning information Several approaches have been proposed which include non-
by defining dedicated, transformation-specific convolutional linear layers for processing conditioning information at the
layers at the latent level. This conditioning framework allows latent space level. In these conventional conditional genera-
the network to synthesize multiple transformed views from tive frameworks, conditioning information is introduced by
a single input, while retaining a fully convolutional struc- combining features extracted from the input with features
ture which avoids the dense connections used in existing extracted from the conditioning information (often using
inference-based conditional models. Most significantly, the dense connection layers); these features are typically com-
proposed LTNN framework is shown to be comparable with bined using standard vector concatenation, although some
the state-of-the-art models in a diverse range of object view have opted to use channel concatenation.
123
S. Kim et al.
Fig. 2 Selected methods for incorporating conditioning information; the proposed LTNN method is illustrated on the left, and six conventional
alternatives are shown to the right
In particular, conventional approaches for incorporating ner which produces high-level view or attribute changes upon
conditional information generally fall into three classes: (1) decoding. In this way, different angles of view, light direc-
apply a fully connected layer before and after concatenating a tions, and deformations, for example, can be generated from
vector storing conditional information [24,40,47,48], (2) flat- a single input image. In one embodiment, the training pro-
ten the network features and concatenate with a vector storing cess for the conditional transformation units can be designed
conditional information [30], (3) tile a conditional vector to to form a semigroup {Φt }t≥0 of operators:
create a two-dimensional array with the same shape as the
network features and concatenate channel-wise [2,38]. Since
the first class is more prevalent than the others in practice, Φ0 = id
i.e., (1)
we have subdivided this class into four cases: FC-Concat- Φt+s = Φt ◦ Φs ∀ t, s ≥ 0
FC [47], FC-Concat-2FC [24], 2FC-Concat-FC [48], and
2FC-Concat-2FC [40]. Six of these conventional conditional defined on the latent space and trained to follow the geomet-
network designs are illustrated in Fig. 2 along with the pro- ric flow corresponding to a specified attribute. In the context
posed LTNN network design for incorporating conditioning of rotating three-dimensional objects, for example, the trans-
information. formation units are trained on the input images paired with
Rather than directly concatenating conditioning informa- several target outputs corresponding to different angles of
tion with network features, we propose using a conditional rotation; the network then uses conditioning information,
transformation unit (CTU), consisting of a collection of dis- which specifies the angle by which the object should be
tinct convolutional mappings in the network’s latent space. rotated, to select the appropriate transformation unit. In this
More specifically, the CTU maintains independent convo- context, the semigroup criteria correspond to the fact that
lution kernel weights for each target view in consideration. rotating an object 10◦ twice should align with the result of
Conditioning information is used to select which collection rotating the object by 20◦ once.
of kernel weights, i.e., which CTU mapping, should be used Since the encoder and decoder are not influenced by the
in the CTU convolutional layer to perform a specified trans- specified angle of rotation, the network’s encoding/decoding
formation. In addition to the convolutional kernel weights, structure learns to model objects at different angles simulta-
each CTU mapping incorporates a Swish activation [32] with neously; the single, low-dimensional latent representation of
independent parameters for each specified target view. The the input contains all information required to produce rotated
kernel weights and Swish parameters of each CTU mapping views of the original object. Other embodiments can depart
are selectively updated by controlling the gradient flow based with this semigroup formulation, however, training condi-
on the conditioning information provided. tional transformation units to instead produce a more diverse
The CTU mappings are trained to transform the encoded, collection of non-sequential viewpoints, for example, as is
latent space representation of the network’s input in a man- the case for multi-view hand synthesis.
123
Latent transformations neural network for object view synthesis
123
S. Kim et al.
Table 1 Ablation/comparison
Model Elevation Azimuth Light Direction Age
results of six different
conventional alternatives for SSIM L1 SSIM L1 SSIM L1 SSIM L1
fusing condition information
into the latent space and ablation LTNN (CTU + CDU + TD) .923 .107 .923 .108 .941 .093 .925 .102
study of conditional LTNN w/o L smooth .918 .118 .921 .114 .935 .112 .911 .110
transformation unit (CTU), CTU + CDU .901 .135 .908 .125 .921 .121 .868 .118
conditional discriminator unit
(CDU), and task-divided CTU .889 .142 .878 .135 .901 .131 .831 .148
decoder (TD) Channel Concat + Conv .803 .179 .821 .173 .816 .182 .780 .188
2-FC + Concat + 2-FC .674 .258 .499 .355 .779 .322 .686 .243
2-FC + Concat + FC .691 .233 .506 .358 .787 .316 .687 .240
FC + Concat + 2-FC .673 .261 .500 .360 .774 .346. .683 .249
FC + Concat + FC .681 .271 .497 .355 .785 .315. .692 .246
Reshape + Concat + FC .671 .276 .489 .357 .780 .318 .685 .251
For valid comparison, we used identical encoder, decoder, and training procedure with synthetic face dataset
Bold values indicate the best performance
The decoding process has been divided into three tasks: esti-
mating the refinement map, pixel values, and RGB color
balance of the dataset. We have found this decoupled frame-
work for estimation helps the network converge to better
minima to produce sharp, realistic outputs without addi-
tional loss terms. The decoding process begins with a series
of convolutional layers followed by bilinear interpolation
to upsample the low-resolution latent information. The last
Fig. 3 Proposed task-divided design for the LTNN decoder. The coarse
pixel value estimation map is split into RGB channels, rescaled by the component of the decoder’s upsampling process consists of
RGB balance parameters, and multiplied element-wise by the refine- two distinct convolutional layers used for task divide; one
ment map values to produce the final network prediction layer is allocated for predicting the refinement map, while
Fig. 4 The proposed network structure for the encoder/decoder (left) the number of distinct 3 × 3 filters associated with the CTU and CDU
and discriminator (right) for 64 × 64 input images. Features have been corresponds to the number of distinct conditional transformations the
color-coded according to the type of layer which has produced them. network is designed to produce. For 256 × 256 input images, we have
The CTU and CDU components both store and train separate collections added two Block v1/MaxPool layers in the front of encoder and two
of 3 × 3 filter weights for each conditional transformation; in particular, Conv/Interpolation layers at the end of the decoder
123
Latent transformations neural network for object view synthesis
4 Architecture details
123
S. Kim et al.
Fig. 6 Qualitative comparison of 360◦ view prediction of rigid objects. M2N. The pixel value map and refinement maps corresponding to the
A single image, shown in the first column of the “Ground” row, is used task division framework are also provided as well as an inverted view
as the input for the network. Results are shown for the proposed network of the refinement map for better visibility
with and without task division (“w/o TD”) as well as a comparison with
discriminator is updated once every two encoder/decoder 80% of the datasets. Since ground truth target depth images
updates, and one-sided label smoothing [36] has been used to were not available for the real hand dataset, an indirect
improve the stability of the discriminator training procedure. metric has been used to quantitatively evaluate the model
as described in Sect. 5.2. Ground truth data were avail-
able for all other experiments, and models were evaluated
5 Experiments and results directly using the L 1 mean pixel-wise error and the struc-
tural similarity index measure (SSIM) [44] used in [30,38].
We conduct experiments on a diverse collection of datasets To evaluate the proposed framework with existing works, two
including both rigid and non-rigid objects. To show the gen- comparison groups have been formed: conditional inference
eralizability of our method, we have conducted a series methods, CVAE-GAN [2] and CAAE [47], with compara-
of experiments: (i) hand pose estimation using a synthetic ble hourglass structures for comparison on experiments with
training set and real NYU hand depth image data [41] for non-rigid objects, and view synthesis methods, MV3D [40],
testing, (ii) synthesis of rotated views of rigid objects using M2N [38], AFN [48], and TVSN [30], for comparison on
the 3D object dataset [4], (iii) synthesis of rotated views experiments with rigid objects. Additional ablation experi-
using a real face dataset [9], and (iv) the modification of ments have been performed to compare the proposed CTU
a diverse range of attributes on a synthetic face dataset [17]. conditioning method with other conventional concatenation
For each experiment, we have trained the models using methods (see Fig. 2); results are shown in Fig. 9 and Table 1.
123
Latent transformations neural network for object view synthesis
123
S. Kim et al.
Fig. 10 Qualitative evaluation for view synthesis of real faces using the
image dataset [9]
Fig. 9 LTNN ablation experiment results and comparison with alterna- alone. In particular, for a threshold distance D = 40 mm,
tive conditioning frameworks using synthetic hand dataset. Our models: the proposed model yields the highest accuracy with 61.98%
conditional transformation unit (CTU), conditional discriminator unit of the frames having all predicted joint locations within a
(CDU), task-divide decoder (TD), and LTNN consisting of all previous
distance of 40 mm from the ground truth values. The second
components. Alternative concatenation methods: channel-wise con-
catenation (CH Concat), fully connected concatenation (FC Concat), highest accuracy is achieved with the CVAE-GAN model
and reshape fully connected feature vector concatenation (RE Concat) with 45.70% of frames predicted within the 40 mm thresh-
old.
A comparison of the quantitative hand pose estimation
hand given a single view and evaluate the accuracy of the results is provided in Fig. 7 where the proposed LTNN
estimated hand pose using the synthesized views. The under- framework is seen to provide a substantial improvement
lying assumption of the assessment is that the accuracy of over existing methods; qualitative results are also available
the hand pose estimation will be improved precisely when in Fig. 8. Ablation study results for assessing the impact
the synthesized views provide faithful representations of the of individual components of the LTNN model are also pro-
true hand pose. Since ground truth predictions for the real vided in Fig. 9; in particular, we note that the inclusion of the
NYU hand dataset were not available, the LTNN model has CTU, CDU, and task-divided decoder each provides signifi-
been trained using a synthetic dataset generated using 3D cant improvements to the performance of the network. With
mesh hand models. The NYU dataset does, however, provide regard to real-time applications, the proposed model runs at
ground truth coordinates for the input hand pose; using this, 114 fps without batching and at 1975 fps when applied to a
we were able to indirectly evaluate the performance of the mini-batch of size 128 (using a single TITAN Xp GPU and
model by assessing the accuracy of a hand pose estimation an Intel i7-6850K CPU).
method using the network’s multi-view predictions as input. Real face experiment We have also conducted an experiment
More specifically, the LTNN model was trained to generate using a real face dataset to show the applicability of LTNN
nine different views which were then fed into the pose esti- for real images. The stereo face database [9], consisting of
mation network from Choi et al. [6] (also trained using the images of 100 individuals from 10 different viewpoints, was
synthetic dataset). For an evaluation metric, the maximum used for experiments with real faces. These faces were first
error in the predicted joint locations has been computed for segmented using the method of [28], and then we manu-
each frame (i.e., each hand pose in the dataset). The cumula- ally cleaned up the failure cases. The cleaned faces have
tive number of frames with maximum error below a threshold been cropped and centered to form the final dataset. The
distance D has then been computed, as is commonly used in LTNN model was trained to synthesize images of input
hand pose estimation tasks [6,29]. A comparison of the pose faces corresponding to three consecutive horizontal rotations.
estimation results using synthetic views generated by the pro- Qualitative results for the real face experiment are provided
posed model, the CVAE-GAN model, and the CAAE model in Fig. 10; in particular, we note that the quality of the views
are presented in Fig. 7, along with the results obtained by per- generated by the proposed LTNN model is consistent for each
forming pose estimation using the single-view input frame of the four views, while the quality of the views generated
123
Latent transformations neural network for object view synthesis
Fig. 13 Simultaneous learning of multiple attribute modifications. using four CTU mappings per attribute (e.g., four azimuth mappings and
Azimuth and age (left), light and age (center), and light and azimuth four age mappings); results shown have been generated by composing
(right) combined modifications are shown. The network has been trained CTU mappings in the latent space and decoding
123
S. Kim et al.
Table 4 Quantitative results for light direction and age modification on 3.5◦ , we can perform linear interpolation in the latent space
the synthetic face dataset between the representations Φ0 (l x ) and Φ1 (l x ); that is, we
Model Light direction Age may take our network prediction for the intermediate change
SSIM L1 SSIM L1 of 3.5◦ to be:
Fig. 14 Near-continuous attribute modification is attainable using tion. These attribute-modified images have been produced using nine
piecewise-linear interpolation in the latent space. Provided a grayscale CTU mappings, corresponding to varying degrees of modification, and
image (corresponding to the faces on the far left), modified images cor- linearly interpolating between the discrete transformation encodings in
responding to changes in light direction (first), age (second), azimuth the latent space
(third), and elevation (fourth) are produced with 17 degrees of varia-
123
Latent transformations neural network for object view synthesis
using CTUs and a consistency loss term, defined an efficient 12. Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3D hand pose
task-divided decoder setup for deconstructing the data gen- estimation in single depth images: from single-view CNN to multi-
view CNNs. In: Proceedings of the IEEE Conference on Computer
eration process into manageable subtasks, and shown that Vision and Pattern Recognition, pp. 3593–3601 (2016)
a context-aware discriminator can be used to improve the 13. Goodfellow, I.J.: NIPS 2016 Tutorial: Generative Adversarial Net-
performance of the adversarial training process. The perfor- works (2017). CoRR arXiv:1701.00160
mance of this framework has been assessed on a diverse range 14. Guan, H., Chang, J.S., Chen, L., Feris, R.S., Turk, M.: Multi-view
appearance-based 3D hand pose estimation. In: 2006 Confer-
of tasks and shown to perform comparably with the state- ence on Computer Vision and Pattern Recognition Workshop
of-the-art methods while reducing computational operations (CVPRW’06), pp. 154–154. IEEE (2006)
and memory consumption. 15. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers:
surpassing human-level performance on imagenet classification.
Acknowledgements Karthik Ramani acknowledges the US National In: Proceedings of the IEEE International Conference on Computer
Science Foundation Awards NRI-1637961 and IIP-1632154. Guang Vision, pp. 1026–1034 (2015)
Lin acknowledges the US National Science Foundation Awards DMS- 16. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.:
1555072, DMS-1736364 and DMS-1821233. Any opinions, findings, Densely connected convolutional networks (2016). arXiv preprint
and conclusions or recommendations expressed in this material are arXiv:1608.06993
those of the authors and do not necessarily reflect the views of the 17. IEEE: A 3D Face Model for Pose and Illumination Invariant Face
funding agency. We gratefully appreciate the support of NVIDIA Cor- Recognition (2009)
poration with the donation of GPUs used for this research. 18. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep net-
work training by reducing internal covariate shift. In: International
Conference on Machine Learning, pp. 448–456 (2015)
19. Jason, J.Y., Harley, A.W., Derpanis, K.G.: Back to basics: unsuper-
vised learning of optical flow via brightness constancy and motion
References smoothness. In: Computer Vision—ECCV 2016 Workshops, pp.
3–10. Springer (2016)
1. Antipov, G., Baccouche, M., Dugelay, J.L.: Face aging with con- 20. Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic
ditional generative adversarial networks (2017). arXiv preprint filter networks. In: Advances in Neural Information Processing
arXiv:1702.01983 Systems, pp. 667–675 (2016)
2. Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: CVAE-GAN: fine- 21. Kim, S., Kim, D., Choi, S.: Citycraft: 3D virtual city creation
grained image generation through asymmetric training (2017). from a single image. Vis. Comput. (2019). https://doi.org/10.1007/
arXiv preprint arXiv:1703.10155 s00371-019-01701-x
3. Chang, A., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, 22. Kingma, D., Ba, J.: Adam: a method for stochastic optimization
Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: (2014). arXiv preprint arXiv:1412.6980
An information-rich 3D model repository. 1(7), 8 (2015). arXiv 23. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes
preprint arXiv:1512.03012 (2013). arXiv preprint arXiv:1312.6114
4. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., 24. Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep
Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, convolutional inverse graphics network. In: Advances in Neural
F.: ShapeNet: an information-rich 3D model repository. Technical Information Processing Systems, pp. 2539–2547 (2015)
Report, Stanford University—Princeton University—Toyota Tech- 25. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adver-
nological Institute at Chicago (2015). arXiv:1512.03012 [cs.GR] sarial autoencoders (2015). arXiv preprint arXiv:1511.05644
5. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., 26. Mirza, M., Osindero, S.: Conditional generative adversarial nets
Abbeel, P.: InfoGAN: interpretable representation learning by (2014). arXiv preprint arXiv:1411.1784
information maximizing generative adversarial nets. In: Advances 27. Miyato, T., Koyama, M.: cGANs with projection discriminator
in Neural Information Processing Systems, pp. 2172–2180 (2016) (2018). arXiv preprint arXiv:1802.05637
6. Choi, C., Kim, S., Ramani, K.: Learning hand articulations by 28. Nirkin, Y., Masi, I., Tuan, A.T., Hassner, T., Medioni, G.: On face
hallucinating heat distribution. In: Proceedings of the IEEE Confer- segmentation, face swapping, and face perception. In: 2018 13th
ence on Computer Vision and Pattern Recognition, pp. 3104–3113 IEEE International Conference on Automatic Face and Gesture
(2017) Recognition (FG 2018), pp. 98–105. IEEE (2018)
7. Dinerstein, J., Egbert, P.K., Cline, D.: Enhancing computer graph- 29. Oberweger, M., Lepetit, V.: Deepprior++: improving fast and
ics through machine learning: a survey. Vis. Comput. 23(1), 25–43 accurate 3D hand pose estimation. In: Proceedings of the IEEE
(2007) International Conference on Computer Vision, pp. 585–594 (2017)
8. Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to gen- 30. Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.:
erate chairs with convolutional neural networks. In: Proceedings of Transformation-grounded image generation network for novel 3D
the IEEE Conference on Computer Vision and Pattern Recognition, view synthesis. In: 2017 IEEE Conference on Computer Vision and
pp. 1538–1546 (2015) Pattern Recognition (CVPR), pp. 702–711. IEEE (2017)
9. Fransens, R., Strecha, C., Van Gool, L.: Parametric stereo for 31. Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep
multi-pose face recognition and 3D-face modeling. In: Interna- neural network architecture for real-time semantic segmentation
tional Workshop on Analysis and Modeling of Faces and Gestures, (2016). arXiv preprint arXiv:1606.02147
pp. 109–124. Springer (2005) 32. Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activa-
10. Galama, Y., Mensink, T.: Iterative GANs for rotating visual objects tion function (2017). arXiv preprint arXiv:1710.05941
(2018) 33. Reed, S., Sohn, K., Zhang, Y., Lee, H.: Learning to disentangle
11. Gauthier, J.: Conditional generative adversarial nets for convo- factors of variation with manifold interaction. In: International
lutional face generation. In: Class Project for Stanford CS231N: Conference on Machine Learning, pp. 1431–1439 (2014)
Convolutional Neural Networks for Visual Recognition. Winter 34. Rezende, D.J., Eslami, S.A., Mohamed, S., Battaglia, P., Jaderberg,
Semester 2014(5), 2 (2014) M., Heess, N.: Unsupervised learning of 3D structure from images.
123
S. Kim et al.
In: Advances in Neural Information Processing Systems, pp. 4996– Sangpil Kim is a Ph.D. student in
5004 (2016) the Electrical and Computer Engi-
35. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional net- neering at Purdue University. He
works for biomedical image segmentation. In: International Con- received his B.S. degree from the
ference on Medical Image Computing and Computer-Assisted Korea University, South Korea, in
Intervention, pp. 234–241. Springer (2015) 2015. His current research inter-
36. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Rad- ests are in computer vision and
ford, A., Chen, X.: Improved techniques for training GANs. In: deep learning.
Advances in Neural Information Processing Systems, pp. 2234–
2242 (2016)
37. Sohn, K., Lee, H., Yan, X.: Learning structured output represen-
tation using deep conditional generative models. In: Advances in
Neural Information Processing Systems, pp. 3483–3491 (2015)
38. Sun, S.H., Huh, M., Liao, Y.H., Zhang, N., Lim, J.J.: Multi-view to
novel view: Synthesizing novel views with self-learned confidence.
In: Proceedings of the European Conference on Computer Vision
(ECCV), pp. 155–171 (2018) Nick Winovich is a Ph.D. candi-
39. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., date in the Department of Math-
Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with con- ematics at Purdue University. He
volutions. In: Proceedings of the IEEE Conference on Computer earned his B.A. in Mathematics
Vision and Pattern Recognition, pp. 1–9 (2015) and Spanish at the University of
40. Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models Notre Dame in 2012 and subse-
from single images with a convolutional network. In: European quently received an M.S. in Math-
Conference on Computer Vision, pp. 322–337. Springer (2016) ematics at the University of Ore-
41. Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous gon in 2015. His current research
pose recovery of human hands using convolutional networks. ACM focuses on the intersection of
Trans. Gr. 33(5), 169 (2014) probability theory and numerical
42. Varley, J., DeChant, C., Richardson, A., Ruales, J., Allen, P.: Shape computing, with an emphasis on
completion enabled robotic grasping. In: 2017 IEEE/RSJ Interna- applications of Gaussian processes
tional Conference on Intelligent Robots and Systems (IROS), pp. and neural network models for
2442–2447. IEEE (2017) partial differential equations.
43. Wang, Q., Artières, T., Chen, M., Denoyer, L.: Adversarial learning
for modeling human motion. Vis. Comput. (2018). https://doi.org/
10.1007/s00371-018-1594-7 Hyung-gun Chi is a Master’s stu-
44. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image qual- dent in Mechanical Engineering
ity assessment: from error visibility to structural similarity. IEEE at Purdue University. He received
Trans. Image Process. 13(4), 600–612 (2004) his B.S. degree from the school
45. Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: conditional of Mechanical Engineering, Yon-
image generation from visual attributes. In: European Conference sei University, South Korea, in
on Computer Vision, pp. 776–791. Springer (2016) 2017. His current research inter-
46. Zhang, S., Han, Z., Lai, Y.K., Zwicker, M., Zhang, H.: Stylistic ests lie at the intersection of com-
scene enhancement GAN: mixed stylistic enhancement generation puter vision and robotics.
for 3D indoor scenes. Vis. Comput. 35(6–8), 1157–1169 (2019)
47. Zhang, Z., Song, Y., Qi, H.: Age progression/regression by con-
ditional adversarial autoencoder. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp.
5810–5818 (2017)
48. Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View syn-
thesis by appearance flow. In: European Conference on Computer
Vision, pp. 286–301. Springer (2016)
123
Latent transformations neural network for object view synthesis
123