Advances in 3D Generation - A Survey
Advances in 3D Generation - A Survey
1 Tencent AI Lab 2 ARC Lab, Tencent PCG 3 City University of Hong Kong 4 South China University of Techonology
⋆ Equal contribution. † Corresponding author.
arXiv:2401.17807v1 [cs.CV] 31 Jan 2024
Figure 1: In this survey, we investigate a large variety of 3D generation methods. Over the past decade, 3D generation has achieved remarkable
progress and has recently garnered considerable attention due to the success of generative AI in images and videos. 3D generation results
from 3D-GAN [WZX∗ 16], DeepSDF [PFS∗ 19], DMTet [SGY∗ 21], EG3D [CLC∗ 22], DreamFusion [PJBM23], PointE [NJD∗ 22], Zero-1-
to-3 [LWVH∗ 23] and Instant3D [LTZ∗ 23].
Abstract
Generating 3D models lies at the core of computer graphics and has been the focus of decades of research. With the emergence
of advanced neural representations and generative models, the field of 3D content generation is developing rapidly, enabling
the creation of increasingly high-quality and diverse 3D models. The rapid growth of this field makes it difficult to stay abreast
of all recent developments. In this survey, we aim to introduce the fundamental methodologies of 3D generation methods and es-
tablish a structured roadmap, encompassing 3D representation, generation methods, datasets, and corresponding applications.
Specifically, we introduce the 3D representations that serve as the backbone for 3D generation. Furthermore, we provide a com-
prehensive overview of the rapidly growing literature on generation methods, categorized by the type of algorithmic paradigms,
including feedforward generation, optimization-based generation, procedural generation, and generative novel view synthesis.
Lastly, we discuss available datasets, applications, and open challenges. We hope this survey will help readers explore this
exciting topic and foster further advancements in the field of 3D content generation.
In the realm of 2D content generation, recent advancements cent works focusing on 3D representations, 3D generation meth-
in generative models have steadily enhanced the capacity for im- ods, datasets, and applications of 3D content generation, and to
age generation and editing, leading to increasingly diverse and outline open challenges.
high-quality results. Pioneering research on generative adversar-
Fig. 2 presents an overview of this survey. We first discuss the
ial networks (GANs) [GPAM∗ 14, AQW19], variational autoen-
scope and related work of this survey in Sec. 2. In the following
coders (VAEs) [KPHL17, PGH∗ 16, KW13], and autoregressive
sections, we examine the core methodologies that form the foun-
models [RWC∗ 19, BMR∗ 20] has demonstrated impressive out-
dation of 3D content generation. Sec. 3 introduces the primary
comes. Furthermore, the advent of generative artificial intelligence
scene representations and their corresponding rendering functions
(AI) and diffusion models [HJA20, ND21, SCS∗ 22] signifies a
used in 3D content generation. Sec. 4 explores a wide variety of
paradigm shift in image manipulation techniques, such as Stable
3D generation methods, which can be divided into four categories
Diffusion [RBL∗ 22a], Imagen [SCS∗ 22], Midjourney [Mid], or
based on their algorithmic methodologies: feedforward generation,
DALL-E 3 [Ope]. These generative AI models enable the creation
optimization-based generation, procedural generation, and genera-
and editing of photorealistic or stylized images, or even videos
tive novel view synthesis. An evolutionary tree of these methodolo-
[CZC∗ 24, HSG∗ 22, SPH∗ 23, GNL∗ 23], using minimal input like
gies is also depicted to illustrate their primary branch. As data accu-
text prompts. As a result, they often generate imaginative content
mulation plays a vital role in ensuring the success of deep learning
that transcends the boundaries of the real world, pushing the limits
models, we present related datasets employed for training 3D gen-
of creativity and artistic expression. Owing to their “emergent” ca-
eration methods. In the end, we include a brief discussion on related
pabilities, these models have redefined the limits of what is achiev-
applications, such as 3D human and face generation, outline open
able in content generation, expanding the horizons of creativity and
challenges, and conclude this survey. We hope this survey offers a
artistic expression.
systematic summary of 3D generation that could inspire subsequent
The demand to extend 2D content generation into 3D space is work for interested readers.
becoming increasingly crucial for applications in generating 3D as- In this work, we present a comprehensive survey on 3D genera-
sets or creating immersive experiences, particularly with the rapid tion, with two main contributions:
development of the metaverse. The transition from 2D to 3D con-
tent generation, however, is not merely a technological evolution. It • Given the recent surge in contributions based on generative mod-
is primarily a response to the demands of modern applications that els in the field of 3D vision, we provide a comprehensive and
necessitate a more intricate replication of the physical world, which timely literature review of 3D content generation, aiming to offer
2D representations often fail to provide. This shift highlights the readers a rapid understanding of the 3D generation framework
limitations of 2D content in applications that require a comprehen- and its underlying principles.
sive understanding of spatial relationships and depth perception. • We propose a multi-perspective categorization of 3D generation
methods, aiming to assist researchers working on 3D content
As the significance of 3D content becomes increasingly evident, generation in specific domains to quickly identify relevant works
there has been a surge in research efforts dedicated to this domain. and facilitate a better understanding of the related techniques.
However, the transition from 2D to 3D content generation is not
a straightforward extension of existing 2D methodologies. Instead, 2. Scope of This Survey
it involves tackling unique challenges and re-evaluating data rep-
resentation, formulation, and underlying generative models to ef- In this survey, we concentrate on the techniques for the generation
fectively address the complexities of 3D space. For instance, it of 3D models and their related datasets and applications. Specifi-
is not obvious how to integrate the 3D scene representations into cally, we first give a short introduction to the scene representation.
2D generative models to handle higher dimensions, as required for Our focus then shifts to the integration of these representations and
3D generation. Unlike images or videos which can be easily col- the generative models. Then, we provide a comprehensive overview
lected from the web, 3D assets are relatively scarce. Furthermore, of the prominent methodologies of generation methods. We also ex-
evaluating the quality of generated 3D models presents additional plore the related datasets and cutting-edge applications such as 3D
challenges, as it is necessary to develop better formulations for ob- human generation, 3D face generation, and 3D editing, all of which
jective functions, particularly when considering multi-view con- are enhanced by these techniques.
sistency in 3D space. These complexities demand innovative ap- This survey is dedicated to systematically summarizing and cat-
proaches and novel solutions to bridge the gap between 2D and 3D egorizing 3D generation methods, along with the related datasets
content generation. and applications. The surveyed papers are mostly published in ma-
jor computer vision and computer graphics conferences/journals as
While not as prominently featured as its 2D counterpart, 3D
well as some preprints released on arXiv in 2023. While it’s chal-
content generation has been steadily progressing with a series of
lenging to exhaust all methods related to 3D generation, we hope to
notable achievements. The representative examples shown in Fig.
include as many major branches of 3D generation as possible. We
1 demonstrate significant improvements in both quality and diver-
do not delve into detailed explanations for each branch, instead, we
sity, transitioning from early methods like 3D-GAN [WZX∗ 16] to
typically introduce some representative works within it to explain
recent approaches like Instant3D [LTZ∗ 23]. Therefore, This survey
its paradigm. The details of each branch can be found in the related
paper seeks to systematically explore the rapid advancements and
work section of these cited papers.
multifaceted developments in 3D content generation. We present a
structured overview and comprehensive roadmap of the many re- Related Survey. Neural reconstruction and rendering with scene
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 3
Feedforward Optimization-Based
Generation Generation
3D Humans
Meshes
Procedural Generative Novel View
Generation Synthesis
3D Faces
Point Clouds
Datasets
3D Objects
Voxel Grids
…
3D Data Multi-view Images Single Images 3D Scenes
Figure 2: Overview of this survey, including 3D representations, 3D generation methods, datasets and applications. Specifically, we introduce
the 3D representations that serve as the backbone for 3D generation. Furthermore, we provide a comprehensive overview of the rapidly
growing literature on generation methods, categorized by the type of algorithmic paradigms, including feedforward generation, optimization-
based generation, procedural generation, and generative novel view synthesis. Finally, we provide a brief discussion on popular datasets and
available applications.
representations are closely related to 3D generation. However, we resentation, while others render the scene representation into im-
consider these topics to be outside the purview of this report. For a ages and supervise the resulting renderings. In the following, we
comprehensive discussion on neural rendering, we direct readers to broadly classify the scene representations into three groups: ex-
[TFT∗ 20, TTM∗ 22], and for a broader examination of other neural plicit scene representations (Section 3.1), implicit representations
representations, we recommend [KBM∗ 20, XTS∗ 22]. Our primary (Section 3.2), and hybrid representations (Section 3.3). Note that,
focus is on exploring techniques that generate 3D models. There- the rendering methods (e.g. ray casting, volume rendering, raster-
fore, this review does not encompass research on generation meth- ization, etc), which should be differentiable to optimize the scene
ods for 2D images within the realm of visual computing. For fur- representations from various inputs, are also introduced.
ther information on a specific generation method, readers can refer
to [Doe16] (VAEs), [GSW∗ 21] (GANs), [PYG∗ 23, CHIS23] (Dif- 3.1. Explicit Representations
fusion) and [KNH∗ 22] (Transformers) for a more detailed under-
standing. There are also some surveys related to 3D generation that Explicit scene representations serve as a fundamental module in
have their own focuses such as 3D-aware image synthesis [XX23], computer graphics and vision, as they offer a comprehensive means
3D generative models [SPX∗ 22], Text-to-3D [LZW∗ 23] and deep of describing 3D scenes. By depicting scenes as an assembly
learning for 3D point clouds [GWH∗ 20]. In this survey, we give a of basic primitives, including point-like primitives, triangle-based
comprehensive analysis of different 3D generation methods. meshes, and advanced parametric surfaces, these representations
can create detailed and accurate visualizations of various environ-
ments and objects.
3. Neural Scene Representations
3.1.1. Point Clouds
In the domain of 3D AI-generated content, adopting a suitable rep-
resentation of 3D models is essential. The generation process typi- A point cloud is a collection of elements in Euclidean space, rep-
cally involves a scene representation and a differentiable rendering resenting discrete points with addition attributes (e.g. colors and
algorithm for creating 3D models and rendering 2D images. Con- normals) in three-dimensional space. In addition to simple points,
versely, the created 3D models or 2D images could be supervised in which can be considered infinitesimally small surface patches,
the reconstruction domain or image domain, as illustrated in Fig. 3. oriented point clouds with a radius (surfels) can also be used
Some methods directly supervise the 3D models of the scene rep- [PZVBG00]. Surfels are used in computer graphics for rendering
4 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
Z
x
x
Y
X
Polygon Point Voxel Implicit Neural Network Tri-planes Gaussian Latent
Surface Rendering
Volume Rendering
Point Cloud Mesh
Neural Network
Rendering
Mixed Rendering
SDF Volume
Figure 3: Neural scene representations used for 3D generation, including explicit, implicit, and hybrid representations. The 3D generation
involves the use of scene representations and a differentiable rendering algorithm to create 3D models or render 2D images. On the flip
side, these 3D models or 2D images can function as the reconstruction domain or image domain, overseeing the 3D generation of scene
representations.
point clouds (called splitting), which are differentiable [YSW∗ 19, derer [ASK∗ 20, DZL∗ 20, KPLD21, RALB22]. These methods fa-
KKLD23] and allow researchers to define differentiable render- cilitate the generation and manipulation of 3D point cloud models
ing pipelines to adjust point cloud positions and features, such while maintaining differentiability, which is essential for training
as radius or color. Techniques like Neural Point-based Rendering and optimizing neural networks in 3D generation tasks.
[ASK∗ 20, DZL∗ 20], SynSin [WGSJ20], Pulsar [LZ21, KPLD21]
and ADOP [RFS22] leverage learnable features to store informa- 3.1.2. Meshes
tion about the surface appearance and shape, enabling more accu-
rate and detailed rendering results. Several other methods, such as By connecting multiple vertices with edges, more complex ge-
FVS [RK20], SVS [RK21], and FWD-Transformer [CRJ22], also ometric structures (e.g. wireframes and meshes) can be formed
employ learnable features to improve the rendering quality. These [BKP∗ 10]. These structures can then be further refined by using
methods typically embed features into point clouds and warp them polygons, typically triangles or quadrilaterals, to create realistic
to target views to decode color values, allowing for more accurate representations of objects [SS87]. Meshes provide a versatile and
and detailed reconstructions of the scene. efficient means of representing intricate shapes and structures, as
they can be easily manipulated and rendered by computer algo-
By incorporating point cloud-based differentiable renderers into rithms. The majority of graphic editing toolchains utilize triangle
the 3D generation process, researchers can leverage the benefits meshes. This type of representation is indispensable for any digi-
of point clouds while maintaining compatibility with gradient- tal content creation (DCC) pipeline, given its wide acceptance and
based optimization techniques. This process can be generally cate- compatibility. To align seamlessly with these pipelines, neural net-
gorized into two different ways: point splitting which blends the works can be strategically trained to predict discrete vertex loca-
discrete samples with some local deterministic blurring kernels tions [BNT21, TZN19]. This ability allows for the direct importa-
[ZPVBG02, LKL18, ID18, RROG18], and conventional point ren- tion of these locations into any DCC pipeline, facilitating a smooth
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 5
and efficient workflow. In contrast to predicting discrete textures, and meshes, NeRFs depict the scene as a continuous volume. This
continuous texture methods optimized via neural networks are pro- approach involves obtaining volumetric parameters, such as view-
posed, such as texture fields [OMN∗ 19] and NeRF-Tex [BGP∗ 22]. dependent radiance and volume density, by querying an implicit
In this way, it could provide a more refined and detailed texture, en- neural network. This innovative representation offers a more fluid
hancing the overall quality and realism of the generated 2D models. and adaptable way to capture the intricacies of 3D scenes, paving
the way for enhanced rendering and modeling techniques.
Integrating mesh representation into 3D generation requires
the use of mesh-based differentiable rendering methods, which Specifically, NeRF [MST∗ 20] represents the scene with a con-
enable meshes to be rasterized in a manner that is compatible tinuous volumetric radiance field, which utilizes MLPs to map the
with gradient-based optimization. Several such techniques have position x and view direction r to a density σ and color c. To render
been proposed, including OpenDR [LB14], neural mesh renderer a pixel’s color, NeRF casts a single ray r(t) = o + td and evaluates
[KUH18], Paparazzi [LTJ18], and Soft Rasterizer [LLCL19]. Ad- a series of points {ti } along the ray. The evaluated {(σi , ci )} at the
ditionally, general-purpose physically based renderers like Mitsuba sampled points are accumulated into the color C(r) of the pixel via
2 [NDVZJ19] and Taichi [HLA∗ 19] support mesh-based differen- volume rendering [Max95]:
tiable rendering through automatic differentiation. i−1
!
C(r) = ∑ Ti αi ci , where Ti = exp − ∑ σk δk , (1)
3.1.3. Multi-layer Representations i k=0
The use of multiple semi-transparent colored layers for represent- and αi = 1−exp(−σi δi ) indicates the opacity of the sampled point.
ing scenes has been a popular and successful scheme in real- Accumulated transmittance Ti quantifies the probability of the ray
time novel view synthesis [ZTF∗ 18]. Using Layered Depth Im- traveling from t0 to ti without encountering other particles, and δi =
age (LDI) representation [SGHS98] is a notable example, ex- ti − ti−1 denotes the distance between adjacent samples.
tending traditional depth maps by incorporating multiple layers
NeRFs [MST∗ 20, NG21, BMT∗ 21, BMV∗ 22, VHM∗ 22,
of depth maps, each with associated color values. Several meth-
LWC∗ 23] have seen widespread success in problems such as edi-
ods [PZ17, CGT∗ 19, SSKH20] have drawn inspiration from the
tion [MBRS∗ 21,ZLLD21,CZL∗ 22,YSL∗ 22], joint optimization of
LDI representation and employed deep learning advancements to
cameras [LMTL21, WWX∗ 21, CCW∗ 23, TRMT23], inverse ren-
create networks capable of predicting LDIs. In addition to LDIs,
dering [ZLW∗ 21, SDZ∗ 21, BBJ∗ 21, ZSD∗ 21, ZZW∗ 23, LZF∗ 23],
Stereomagnification [ZTF∗ 18] initially introduced the multiple im-
generalization [YYTK21, WWG∗ 21, CXZ∗ 21, LFS∗ 21, JLF22,
age (MPI) representation. It describes scenes using multiple front-
HZF∗ 23b], acceleration [RPLG21, GKJ∗ 21, ZZZ∗ 23b], and
parallel semi-transparent layers, including colors and opacity, at
free-viewpoint video [DZY∗ 21, LSZ∗ 22, PCPMMN21]. Apart
fixed depth ranges through plane sweep volumes. With the help
from the above applications, NeRF-based representation can
of volume rendering and homography projection, the novel view
also be used for digit avatar generation, such as face and body
could be synthesized in real-time. Building on Stereomagnification
reenactment [PDW∗ 21, GCL∗ 21, LHR∗ 21, WCS∗ 22, HPX∗ 22].
[ZTF∗ 18], various methods [FBD∗ 19, MSOC∗ 19, STB∗ 19] have
NeRFs have been extend to various fields such as robotics
adopted the MPI representation to enhance rendering quality. The
[KFH∗ 22, ZKW∗ 23, ACC∗ 22], tomography [RWL∗ 22, ZLZ∗ 22],
multi-layer representation has been further expanded to accommo-
image processing [HZF∗ 22, MLL∗ 22b, HZF∗ 23a], and astron-
date wider fields of view in [BFO∗ 20, ALG∗ 20, LXM∗ 20] by sub-
omy [LSC∗ 22].
stituting planes with spheres. As research in this domain continues
to evolve, we can expect further advancements in these methods,
leading to more efficient and effective 3D generation techniques 3.2.2. Neural Implicit Surfaces
for real-time rendering. Within the scope of shape reconstruction, a neural network pro-
cesses a 3D coordinate as input and generates a scalar value, which
3.2. Implicit Representations usually signifies the signed distance to the surface. This method is
particularly effective in filling in missing information and generat-
Implicit representations have become the scene representation of ing smooth, continuous surfaces. The implicit surface representa-
choice for problems in view synthesis or shape reconstruction, as tion defines the scene’s surface as a learnable function f that spec-
well as many other applications across computer graphics and vi- ifies the signed distance f (x) from each point to the surface. The
sion. Unlike explicit scene representations that usually focus on fundamental surface can then be extracted from the zero-level set,
object surfaces, implicit representations could define the entire vol- S = {x ∈ R3 | f (x) = 0}, providing a flexible and efficient way to re-
ume of a 3D object, and use volume rendering for image synthesis. construct complex 3D shapes. Implicit surface representations of-
These representations utilize mathematical functions, such as radi- fer numerous advantages, as they eliminate the need to define mesh
ance fields [MST∗ 20] or signed distance fields [PFS∗ 19, CZ19], to templates. As a result, they can represent objects with unknown or
describe the properties of a 3D space. changing topology in dynamic scenarios. Specifically, implicit sur-
face representations recover signed distance fields for shape mod-
3.2.1. Neural Radiance Fields
eling using MLPs with coordinate inputs. These initial propos-
Neural Radiance Fields (NeRFs) [MST∗ 20] have gained promi- als sparked widespread enthusiasm and led to various improve-
nence as a favored scene representation method for a wide range of ments focusing on different aspects, such as enhancing training
applications. Fundamentally, NeRFs introduce a novel representa- schemes [DZW∗ 20, YAK∗ 20, ZML∗ 22], leveraging global-local
tion of 3D scenes or geometries. Rather than utilizing point clouds context [XWC∗ 19, EGO∗ 20, ZPL∗ 22], adopting specific parame-
6 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
terizations [GCV∗ 19, CTZ20, YRSh21, BSKG22], and employing in 3D shape representation and processing techniques have signif-
spatial partitions [GCS∗ 20, TTG∗ 20, TLY∗ 21, WLG∗ 23]. icantly enhanced the efficiency and effectiveness of 3D generation
applications.
NeuS [WLL∗ 21] and VolSDF [YGKL21] extend the basic NeRF
formulation by integrating an SDF into volume rendering, which
3.3.2. Tri-plane
defines a function to map the signed distance to density σ. It attains
a locally maximal value at surface intersection points. Specifically, Tri-plane representation is an alternative approach to using voxel
accumulated transmittance T (t) along the ray r(t) = o + td is for- grids for embedding features in 3D shape representation and neural
mulated as a sigmoid function: T (t) = Φ( f (t)) = (1 + es f (t) )−1 , rendering. The main idea behind this method is to decompose a 3D
where s and f (t) refer to a learnable parameter and the signed dis- volume into three orthogonal planes (e.g., XY, XZ, and YZ planes)
tance function of points at r(t), respectively. Discrete opacity val- and represent the features of the 3D shape on these planes. Specifi-
ues αi can then be derived as: cally, TensoRF [CXG∗ 22] achieves similar model compression and
acceleration by replacing each voxel grid with a tensor decompo-
Φs ( f (ti )) − Φs ( f (ti+1 ))
αi = max ,0 . (2) sition into planes and vectors. Tri-planes are efficient and capable
Φs ( f (ti )) of scaling with the surface area rather than volume and naturally
NeuS employs volume rendering to recover the underlying SDF integrate with expressive, fine-tuned 2D generative architectures.
based on Eqs. (1) and (2). The SDF is optimized by minimizing the In the generative setting, EG3D [CLC∗ 22] proposes a spatial de-
photometric loss between the rendering results and ground-truth composition into three planes whose values are added together to
images. represent a 3D volume. NFD [SCP∗ 23] introduces diffusion on 3D
scenes, utilizing 2D diffusion model backbones and having built-in
Building upon NeuS and VolSDF, NeuralWarp [DBD∗ 22], Geo- tri-plane representation.
NeuS [FXOT22], MonoSDF [YPN∗ 22] leverage prior geometry in-
formation from MVS methods. IRON [ZLLS22], MII [ZSH∗ 22], 3.3.3. Hybrid Surface Representation
and WildLight [CLL23] apply high-fidelity shape reconstruction
via SDF for inverse rendering. HF-NeuS [WSW22] and PET-Neus DMTet, a recent development cited in [SGY∗ 21], is a hybrid three-
[WSW23] integrate additional displacement networks to fit the dimensional surface representation that combines both explicit and
high-frequency details. LoD-NeuS [ZZF∗ 23] adaptively encodes implicit forms to create a versatile and efficient model. It segments
Level of Detail (LoD) features for shape reconstruction. the 3D space into dense tetrahedra, thereby forming an explicit par-
tition. By integrating explicit and implicit representations, DMTet
can be optimized more efficiently and transformed seamlessly into
3.3. Hybrid Representations explicit structures like mesh representations. During the generation
Implicit representations have indeed demonstrated impressive re- process, DMTet can be differentiably converted into a mesh, which
sults in various applications as mentioned above. However, most enables swift high-resolution multi-view rendering. This innovative
of the current implicit methods rely on regression to NeRF or SDF approach offers significant improvements in terms of efficiency and
values, which may limit their ability to benefit from explicit super- versatility in 3D modeling and rendering.
vision on the target views or surfaces. Explicit representation could
impose useful constraints during training and improve the user ex- 4. Generation Methods
perience. To capitalize on the complementary benefits of both rep-
resentations, researchers have begun exploring hybrid representa- In the past few years, the rapid development of generative mod-
tions. These involve scene representations (either explicit or im- els in 2D image synthesis, such as generative adversarial networks
plicit) that embed features utilizing rendering algorithms for view (GANs) [GPAM∗ 14, AQW19], variational autoencoders (VAEs)
synthesis. [KPHL17, PGH∗ 16, KW13], autoregressive models [RWC∗ 19,
BMR∗ 20], diffusion models [HJA20, ND21, SCS∗ 22], etc., has led
to their extension and combination with these scene representations
3.3.1. Voxel Grids
for 3D generation. Tab. 1 shows well-known examples of 3D gen-
Early work [WSK∗ 15, CXG∗ 16, MS15] depicted 3D shapes us- eration using generative models and scene representations. These
ing voxels, which store coarse occupancy (inside/outside) values methods may use different scene representations in the generation
on a regular grid. This approach enabled powerful convolutional space, where the representation is generated by the generative mod-
neural networks to operate natively and produce impressive results els, and the reconstruction space, where the output is represented.
in 3D reconstruction and synthesis [DRB∗ 18, WZX∗ 16, BLW16]. For example, AutoSDF [MCST22a] uses a transformer-based au-
These methods usually use explicit voxel grids as the 3D represen- toregressive model to learn a feature voxel grid and decode this rep-
tation. Recently, to address the slow training and rendering speeds resentation to SDF for reconstruction. EG3D [CLC∗ 22] employs
of implicit representations, the 3D voxel-based embedding methods GANs to generate samples in latent space and introduces a tri-
[LGZL∗ 20,FKYT∗ 22,SSN∗ 22,SSC22] have been proposed. These plane representation for rendering the output. SSDNeRF [CGC∗ 23]
methods encode the spatial information of the scene and decode the uses the diffusion model to generate tri-plane features and decode
features more efficiently. Moreover, Instant-NGP [MESK22] intro- them to NeRF for rendering. By leveraging the advantages of neu-
duces the multi-level voxel grids encoded implicitly via the hash ral scene representations and generative models, these approaches
function for each level. It facilitates rapid optimization and ren- have demonstrated remarkable potential in generating realistic and
dering while maintaining a compact model. These advancements intricate 3D models while maintaining view consistency.
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 7
Table 1: Some examples of 3D generation methods. We first divide the methods according to the generative models and their corresponding
representations in generation space. The representations in the reconstruction space determine how the 3D objects are formatted and rendered.
We also list the main supervision and conditions of these methods. For the 2D supervision, a rendering technique is utilized to generate the
images.
Method Generative Model Generation Space Reconstruction Space Rendering Supervision Condition
In this section, we explore a large variety of 3D generation sisting of a generator G(·) and a discriminator D(·). The gener-
methods which are organized into four categories based on their ator network G produces synthetic data by accepting latent code
algorithmic paradigms: Feedforward Generation (Sec. 4.1), gen- as input, while the discriminator network D differentiates between
erating results in a forward pass; Optimization-Based Generation generated data from G and real data. Throughout the training opti-
(Sec. 4.2), necessitating a test-time optimization for each genera- mization process, the generator G and discriminator D are jointly
tion; Procedural Generation (Sec. 4.3), creating 3D models from optimized, guiding the generator to create synthetic data as realistic
sets of rules; and Generative Novel View Synthesis (Sec. 4.4), syn- as real data.
thesizing multi-view images rather than an explicit 3D represen-
tation for 3D generation. An evolutionary tree of 3D generation Building on the impressive results achieved by GANs in 2D
methods is depicted in Fig. 4, which illustrates the primary branch image synthesis, researchers have begun to explore the appli-
of generation techniques, along with associated work and subse- cation of these models to 3D generation tasks. The core idea
quent developments. A comprehensive analysis will be discussed is to marry GANs with various 3D representations, such as
in the subsequent subsection. point clouds (l-GAN/r-GAN [ADMG18], tree-GAN [SPK19]),
voxel grids (3D-GAN [WZX∗ 16], Z-GAN [KKR18]), meshes
(MeshGAN [CBZ∗ 19]), or SDF (SurfGen [LLZL21], SDF-
4.1. Feedforward Generation StyleGAN [ZLWT22]). In this context, the 3D generation process
A primary technical approach for generation methods is feedfor- can be viewed as a series of adversarial steps, where the generator
ward generation, which can directly produce 3D representations learns to create realistic 3D data from input latent codes, and the
using generative models. In this section, we explore these methods discriminator differentiates between generated data and real data.
based on their generative models as shown in Fig. 5, which include By iteratively optimizing the generator and discriminator networks,
generative adversarial networks (GANs), diffusion Models, autore- GANs learn to generate 3D data that closely resembles the realism
gressive models, variational autoencoders (VAEs) and normalizing of actual data.
flows.
For 3D object generation, prior GAN methodologies, such as l-
GAN [ADMG18], 3D-GAN [WZX∗ 16], and Multi-chart Gener-
4.1.1. Generative Adversarial Networks
ation [BHMK∗ 18], directly utilize explicit 3D object representa-
Generative Adversarial Networks (GANs) [GPAM∗ 14] have tion of real data to instruct generator networks. Their discrimina-
demonstrated remarkable outcomes in image synthesis tasks, con- tors employ 3D representation as supervision, directing the gener-
8 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
AE/VAE-based
DC-IGN Tatarchenko et al. 2016 Zhou et al. 2016 Chen et al. 2019
PixelSynth
Generative Novel View SynSin
Infinite Nature InfiniteNature-Zero
Synthesis GAN-based
TVSN Sun et al. 2018
VQGAN-based
Geometry-Free View Synthesis ViewFormer
Zero123++
Diffusion-based
3DiM Zero-1-2-3 SyncDreamer
CLIP-Mesh
PureCLIPNeRF
Optimization-Based Dream3D
Generation CLIP-based NeRDi Make-It-3D DreamCraft3D
Dream Field
Image-to-3D Consistent123
Magic123
HiFi-123
RealFusion
2D Diffusion-based
DreamFusion
HiFA
Magic3D
Text-to-3D
ProlificDreamer
SweetDreamer
Text-to-Texture TEXTure
Latent-NeRF
TexFusion
SurfGen
3D GANs Z-GAN
3D-GAN
MeshGAN
l-GAN/r-GAN tree-GAN
GANs
EG3D
3D Generation
GRAF pi-GAN GIRAFFE StyleNeRF
StyleSDF
3D-Aware GANs
HoloGAN BlockGAN
PrGAN PlatonicGAN
Voxel-Based VAEs
VAEs
Feedforward SDM-NET TM-NET
Generation
NeRF-VAE
Mesh
Meshdiffusion TetraDiffusion
3D-LDM
DiffRF
Radiance Field
ShapE
Tri-plane
Triplane Diffusion
Figure 4: The evolutionary tree of 3D generation illustrates the primary branch of generation methods and their developments in recent
years. Specifically, we provide a comprehensive overview of the rapidly growing literature on generation methods, categorized by the type of
algorithmic paradigms, including feedforward generation, optimization-based generation, procedural generation, and generative novel view
synthesis.
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 9
Real / Fake
Decoder
Encoder
G D
3D Data Tokens 3D Data
3D Data
Decoder
Encoder
Inverse
Flow
Noise 3D Data 3D Data 3D Data 3D Data 3D Data
Figure 5: Exemplary feedforward 3D generation models. We showcase several representative pipelines of feedforward 3D generation models,
including (a) generative adversarial networks, (b) diffusion models, (c) autoregressive models, (d) variational autoencoders and (e) normal-
izing flows.
ator to produce synthetic data that closely resembles the realism of Building upon representations such as NeRFs, GRAF [SLNG20]
actual data. During training, specialized generators generate cor- proposes generative radiance fields utilizing adversarial frame-
responding supervisory 3D representations, such as point clouds, works and achieves controllable image synthesis at high reso-
voxel grids, and meshes. Some studies, like SurfGen [LLZL21], lutions. pi-GAN [CMK∗ 21a] introduces SIREN-based implicit
have progressed further to generate intermediate implicit represen- GANs with FiLM conditioning to further improve image quality
tations and then convert them to corresponding 3D representations and view consistency. GIRAFFE [NG21] represents scenes as com-
instead of directly generating explicit ones, achieving superior per- positional generative neural feature fields to model multi-object
formance. In particular, the generator of l-GAN [ADMG18], 3D- scenes. Furthermore, EG3D [CLC∗ 22] first proposes a hybrid ex-
GAN [WZX∗ 16], and Multi-chart Generation [BHMK∗ 18] gener- plicit–implicit tri-plane representation that is both efficient and ex-
ate the position of point cloud, voxel grid, and mesh directly, re- pressive and has been widely adopted in many following works.
spectively, taking latent code as input. SurfGen [LLZL21] gener-
ates implicit representation and then extracts explicit 3D represen- 4.1.2. Diffusion Models
tation.
Diffusion models [HJA20,RBL∗ 22a] are a class of generative mod-
In addition to GANs that directly generate various 3D represen- els that learn to generate data samples by simulating a diffusion
tations, researchers have suggested incorporating 2D supervision process. The key idea behind diffusion models is to transform the
through differentiable rendering to guide 3D generation, which is original data distribution into a simpler distribution, such as Gaus-
commonly referred to as 3D-Aware GAN. Given the abundance of sian, through a series of noise-driven steps called the forward pro-
2D images, GANs can better understand the implicit relationship cess. The model then learns to reverse this process, known as the
between 2D and 3D data than relying solely on 3D supervision. In backward process, to generate new samples that resemble the orig-
this approach, the generator of GANs generates rendered 2D im- inal data distribution. The forward process can be thought of as
ages from implicit or explicit 3D representation. Then the discrimi- gradually adding noise to the original data until it reaches the tar-
nators distinguish between rendered 2D images and real 2D images get distribution. The backward process, on the other hand, involves
to guide the training of the generator. iteratively denoising the samples from the distribution to generate
the final output. By learning this denoising process, diffusion mod-
Specifically, HoloGAN [NPLT∗ 19] first learns a 3D represen-
els can effectively capture the underlying structure and patterns of
tation of 3D features, which is then projected to 2D features
the data, allowing them to generate high-quality and diverse sam-
by the camera pose. These 2D feature maps are then rendered
ples.
to generate the final images. BlockGAN [NPRM∗ 20] extends it
to generate 3D features of both background and foreground ob- Building on the impressive results achieved by diffusion models
jects and combine them into 3D features for the whole scene. in generating 2D images, researchers have begun to explore the ap-
In addition, PrGAN [GMW17] and PlatonicGAN [HMR19a] em- plications of these models to 3D generation tasks. The core idea is
ploy an explicit voxel grid structure to represent 3D shapes and to marry denoising diffusion models with various 3D representa-
use a render layer to create images. Other methods like DIB- tions. In this context, the 3D generation process can be viewed as a
R [CZ19], ConvMesh [PSH∗ 20], Textured3DGAN [PKHL21] and series of denoising steps, reversing the diffusion process from input
GET3D [GSW∗ 22] propose GAN frameworks for generating tri- 3D data to Gaussian noise. The diffusion models learn to generate
angle meshes and textures using only 2D supervision. 3D data from this noisy distribution through denoising.
10 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
Specifically, Cai et al. [CYAE∗ 20] build upon a denoising score- role in pushing the boundaries of 3D shape generation across a wide
matching framework to learn distributions for point cloud genera- range of applications.
tion. PVD [ZDW21] combines the benefits of both point-based and
voxel-based representations for 3D generation. The model learns 4.1.3. Autoregressive Models
a diffusion process that transforms point clouds into voxel grids
and vice versa, effectively capturing the underlying structure and A 3D object can be represented as a joint probability of the occur-
patterns of the 3D data. Similarly, DPM [LH21] focuses on learn- rences of multiple 3D elements:
ing a denoising process for point cloud data by iterative denoising p(x0 , x1 , ..., xn ), (3)
the noisy point cloud samples. Following the advancements made
by PVD [ZDW21] and DPM [LH21], LION [ZVW∗ 22] builds where xi is the i-th element which can be the coordinate of a point or
upon the idea of denoising point clouds and introduces the con- a voxel. A joint probability with a large number of random variables
cept of denoising in the latent space of point clouds, which is anal- is usually hard to learn and estimate. However, one can factorize it
ogous to the shift in 2D image generation from denoising pixels into a product of conditional probabilities:
to denoising latent space representations. To generate point clouds n
from text prompts, Point·E [NJD∗ 22] initially employs the GLIDE p(x0 , x1 , ..., xn ) = p(x0 ) ∏ p(xi |x<i ), (4)
i=1
model [NDR∗ 21] to generate text-conditional synthetic views, fol-
lowed by the production of a point cloud using a diffusion model which enables learning conditional probabilities and estimating the
conditioned on the generated image. By training the model on a joint probability via sampling. Autoregressive models for data gen-
large-scale 3D dataset, it achieves remarkable generalization capa- eration are a type of models that specify the current output depend-
bilities. ing on their previous outputs. Assuming that the elements x0 , x1 ,
..., xn form an ordered sequence, a model can be trained by provid-
In addition to point clouds, MeshDiffusion [LFB∗ 23], Tetrahe- ing it with previous inputs x0 , ... xi−1 and supervising it to fit the
dral Diffusion Models [KPWS22], and SLIDE [LWA∗ 23] explore probability of the outcome xi :
the application of diffusion models to mesh generation. MeshDif-
fusion [LFB∗ 23] adopts the DMTet representation [SGY∗ 21] for p(xi |x<i ) = f (x0 , ..., xi−1 ), (5)
meshes and optimizes the model by treating the optimization of the conditional probabilities are learned by the model function f .
signed distance functions as a denoising process. Tetrahedral Dif- This training process is often called teacher forcing. The model can
fusion Models [KPWS22] extends diffusion models to tetrahedral be then used to autoregressively generate the elements step-by-step:
meshes, learning displacement vectors and signed distance values
on the tetrahedral grid through denoising. SLIDE [LWA∗ 23] ex- xi = argmax p(x|x<i ). (6)
plores diffusion models on sparse latent points for mesh generation. State-of-the-art generative models such as GPTs [RWC 19, ∗
Apart from applying diffusion operations on explicit 3D BMR∗ 20] are autoregressive generators with Transformer net-
representations, some works focus on performing the diffu- works as the model function. They achieve great success in gen-
sion process on implicit representations. SSDNeRF [CGC∗ 23], erating natural languages and images. In 3D generation, several
DiffRF [MSP∗ 23] and Shap·E [JN23] operate on 3D radiance studies have been conducted based on autoregressive models. In
fields, while SDF-Diffusion [SKJ23], LAS-Diffusion [ZPW∗ 23], this section, we discuss some notable examples of employing au-
Neural Wavelet-domain Diffusion [HLHF22], One-2-3- toregressive models for 3D generation.
45++ [LXJ∗ 23], SDFusion [CLT∗ 23] and 3D-LDM [NKR∗ 22] PointGrow [SWL∗ 20b] generates point clouds using an autore-
focus on signed distance fields representations. Specifically, gressive network with self-attention context awareness operations
Diffusion-SDF [LDZL23] utilizes a voxel-shaped SDF repre- in a point-by-point manner. Given its previously generated points,
sentation to generate high-quality and continuous 3D shapes. PointGrow reforms the points by axes and passes them into three
3D-LDM [NKR∗ 22] creates neural implicit representations of branches. Each branch takes the inputs to predict a coordinate value
SDFs by initially using a diffusion model to generate the latent of one axis. The model can also condition an embedding vector to
space of an auto-decoder. Subsequently, the latent space is decoded generate point clouds, which can be a class category or an image.
into SDFs to acquire 3D shapes. Moreover, Rodin [WZZ∗ 23] Inspired by the network from PointGrow, PolyGen [NGEB20b]
and Shue et al. [SCP∗ 23] adopt tri-plane as the representation generates 3D meshes with two transformer-based networks, one for
and optimize the tri-plane features using diffusion methods. vertices and one for faces. The vertex transformer autoregressively
Shue et al. [SCP∗ 23] generates 3D shapes using occupancy generates the next vertex coordinate based on previous vertices.
networks, while Rodin [WZZ∗ 23] obtains 3D shapes through The face transformer takes all the output vertices as context to gen-
volumetric rendering. erate faces. PolyGen can condition on a context of object classes or
images, which are cross-attended by the transformer networks.
These approaches showcase the versatility of diffusion models
in managing various 3D representations, including both explicit Recently, AutoSDF [MCST22b] generates 3D shapes repre-
and implicit forms. By tailoring the denoising process to different sented by volumetric truncated-signed distance function (T-SDF).
representation types, diffusion models can effectively capture the AutoSDF learns a quantized codebook regarding local regions of
underlying structure and patterns of 3D data, leading to improved T-SDFs using VQ-VAE. The shapes are then presented by the
generation quality and diversity. As research in this area continues codebook tokens and learned by a transformer-based network in
to advance, it is expected that diffusion models will play a crucial a non-sequential autoregressive manner. In detail, given previous
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 11
Table 2: Quantitative comparison of image-to-3D methods on sur- Table 3: Quantitative comparison of image-to-3D methods on novel
face reconstruction. We summarize the Chamfer distance and vol- view synthesis. We report the CLIP-Similarity, PSNR, and LPIPS
ume IoU as the metrics to evaluate the quality of surface recon- as the metrics to evaluate the quality of view synthesis.
struction.
Method CLIP-Similarity ↑ PSNR ↑ LPIPS ↓
Method Chamfer Distance ↓ Volume IoU ↑ RealFusion [MKLRV23] 0.735 20.216 0.197
RealFusion [MKLRV23] 0.0819 0.2741 Magic123 [QMH∗ 23] 0.747 25.637 0.062
Magic123 [QMH∗ 23] 0.0516 0.4528 Make-it-3D [TWZ∗ 23] 0.839 20.010 0.119
Make-it-3D [TWZ∗ 23] 0.0732 0.2937 One-2-3-45 [LXJ∗ 23] 0.788 23.159 0.096
One-2-3-45 [LXJ∗ 23] 0.0629 0.4086 Zero-1-to-3 [LWVH∗ 23] 0.759 25.386 0.068
Point-E [NJD∗ 22] 0.0426 0.2875 SyncDreamer [LLZ∗ 23] 0.837 25.896 0.059
Shap-E [JN23] 0.0436 0.3584
Zero-1-to-3 [LWVH∗ 23] 0.0339 0.5035
SyncDreamer [LLZ∗ 23] 0.0261 0.5421
sentation vis VQGAN to obtain an abstract latent space for training In this section, we will go over the common data used for 3D gen-
transformers. While ViewFormer [KDSB22] also uses a two-stage eration. Depending on the methods employed, it usually includes
training consisting of a Vector Quantized Variational Autoencoder 3D data (Section 5.1), multi-view image data (Section 5.2), and
(VQ-VAE) codebook and a transformer model. And [SMP∗ 22] em- single-view image data (Section 5.3), which are also summarized
ploys an encoder-decoder model based on transformers to learn an in Tab. 4.
implicit representation.
On the other hand, generative adversarial networks could pro- 5.1. Learning from 3D Data
duce high-quality results in image synthesis and consequently are
applied to novel view synthesis [WGSJ20,KLY∗ 21,RFJ21,LTJ∗ 21, 3D data could be collected by RGB-D sensors and other technology
LWSK22]. Some methods [WGSJ20, KLY∗ 21, RFJ21] maintain a for scanning and reconstruction. Apart from 3D generation, 3D data
3D point cloud as the representation, which could be projected is also widely used for other tasks like helping improve classical
onto novel views followed by a GAN to hallucinate the miss- 2D vision task performance by data synthesis, environment simula-
ing regions and synthesize the output image. While [LTJ∗ 21] tion for training embodied AI agents, 3D object understanding, etc.
and [LWSK22] focus on long-range view generation from a sin- One popular and frequently used 3D model database in the early
gle view with adversarial training. At an earlier stage of deep stage is The Princeton Shape Benchmark [SMKF04]. It contains
learning methods when the auto-encoders and variational autoen- about 1800 polygonal models collected from the World Wide Web.
coders begin to be explored, it is also used to synthesize the novel While [KXD12] constructs a special rig that contains a 3D digitizer,
views [KWKT15, ZTS∗ 16, TDB16, CSH19]. a turntable, and a pair of cameras mounted on a sled that can move
along a bent rail to capture the kit object models database. To eval-
In summary, generative novel view synthesis can be regarded uate the algorithms to detect and estimate the objects in the image
as a subset of image synthesis techniques and continues to evolve given 3D models, [LPT13] introduces a dataset of 3D IKEA models
alongside advancements in image synthesis methods. Besides the obtained from Google Warehouse. Some 3D model databases are
generative models typically included, determining how to integrate presented for tasks like robotic manipulation [CWS∗ 15, MCL20],
information from the input view as a condition for synthesizing the 3D shape retrieval [LLL∗ 14], 3D shape modeling from a single im-
novel view is the primary issue these methods are concerned with. age [SWZ∗ 18]. BigBIRD [SSN∗ 14] presents a large-scale dataset
of 3D object instances that also includes multi-view images and
depths, camera pose information, and segmented objects for each
5. Datasets for 3D Generation image.
With the rapid development of technology, the ways of data acqui- However, those datasets are very small and only contain hun-
sition and storage become more feasible and affordable, resulting in dreds or thousands of objects. Collecting, organizing, and label-
an exponential increase in the amount of available data. As data ac- ing larger datasets in computer vision and graphics communities
cumulates, the paradigm for problem-solving gradually shifts from is needed for data-driven methods of 3D content. To address this,
data-driven to model-driven approaches, which in turn contributes ShapeNet [CFG∗ 15] is introduced to build a large-scale repository
to the growth of "Big Data" and "AIGC". Nowadays, data plays a of 3D CAD models of objects. The core of ShapeNet covers 55
crucial role in ensuring the success of algorithms. A well-curated common object categories with about 51,300 models that are manu-
dataset can significantly enhance a model’s robustness and perfor- ally verified category and alignment annotations. Thingi10K [ZJ16]
mance. On the contrary, noisy and flawed data may cause model collects 10,000 3D printing models from an online repository Thin-
bias that requires considerable effort in algorithm design to rectify. giverse. While PhotoShape [PRFS18] produces 11,000 photorealis-
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 15
tic, relightable 3D shapes based on online data. Other datasets such Table 5: Recent 3D human generation techniques and their corre-
as 3D-Future [FJG∗ 21], ABO [CGD∗ 22], GSO [DFK∗ 22] and sponding input-output formats.
OmniObject3D [WZF∗ 23] try to improve the texture quality but
only contain thousands of models. Recently, Objaverse [DSS∗ 23] Methods Input Condition Output Texture
presents a large-scale corpus of 3D objects that contains over 800K
3D assets for research in the field of AI and makes a step toward ICON [XYTB22] Single-Image %
a large-scale 3D dataset. Objaverse-XL [DLW∗ 23] further extends ECON [XYC∗ 23] Single-Image %
Objaverse to a larger 3D dataset of 10.2M unique objects from a gDNA [CJS∗ 22] Latent %
diverse set of sources. These large-scale 3D datasets have the po- Chupa [KKL∗ 23] Text/Latent %
tential to facilitate large-scale training and boost the performance ELICIT [HYL∗ 23] Single-Image !
of 3D generation. TeCH [HYX∗ 23] Single-Image !
Get3DHuman [XKJ∗ 23] Latent !
EVA3D [HCL∗ 22] Latent !
5.2. Learning from Multi-view Images
AvatarCraft [JWZ∗ 23] Text !
3D objects have been traditionally created through manual 3D DreamHuman [KAZ∗ 23] Text !
modeling, object scanning, conversion of CAD models, or combi- TADA [LYX∗ 24] Text !
nations of these techniques [DFK∗ 22]. These techniques may only
produce synthetic data or real-world data of specific objects with
limited reconstruction accuracy. Therefore, some datasets directly human body, SHHQ [FLJ∗ 22] and DeepFashion [LLQ∗ 16] have
provide multi-view images in the wild which are also widely used been adopted for 3D human generation. In terms of objects, many
in many 3D generation methods. ScanNet [DCS∗ 17] introduces an methods [LSMG20, GMW17, HMR19a, ZZZ∗ 18, WZX∗ 16] ren-
RGB-D video dataset containing 2.5M views in 1513 scenes and der synthetic single-view datasets using several major object cat-
Objectron [AZA∗ 21] contains object-centric short videos and in- egories of ShapeNet. While GRAF [SLNG20] renders 150k Chairs
cludes 4 million images in 14,819 annotated videos, of which only from Photoshapes [PRFS18]. Moreover, CelebA [LLWT15] and
a limited number cover the full 360 degrees. CO3D [RSH∗ 21] ex- Cats [ZST08] datasets are also commonly used to train the mod-
tends the dataset from [HRL∗ 21] and increases the size to nearly els like HoloGAN [NPLT∗ 19] and pi-GAN [CMK∗ 21a]. Since the
19,000 videos capturing objects from 50 MS-COCO categories, single-view images are easy to obtain, these methods could collect
which has been widely used in the training and evaluations of their own dataset for the tasks.
novel view synthesis and 3D generation or reconstruction methods.
Recently, MVImgNet [YXZ∗ 23] presents a large-scale dataset of
multi-view images that collects 6.5 million frames from 219,188 6. Applications
videos by shooting videos of real-world objects in human daily In this section, we introduce various 3D generation tasks (Sec. 6.1-
life. Other lines of work provide the multi-view dataset in small- 6.3) and closely related 3D editing tasks (Sec. 6.4). The generation
scale RGB-D videos [LBRF11, SHG∗ 22, CX∗ 23] compared with tasks are divided into three categories, including 3D human gener-
these works, large-scale synthetic videos [TME∗ 22], or egocentric ation (Sec. 6.1), 3D face generation (Sec. 6.2), and generic object
videos [ZXA∗ 23]. A large-scale dataset is still a remarkable trend and scene generation (Sec. 6.3).
for deep learning methods, especially for generation tasks.
6.1. 3D Human Generation
5.3. Learning from Single-view Images
With the emergence of the metaverse and the advancements in vir-
3D generation methods usually rely on multi-view images or 3D tual 3D social interaction, the field of 3D human digitization and
ground truth to supervise the reconstruction and generation of 3D generation has gained significant attention in recent years. Differ-
representation. Synthesizing high-quality multi-view images or 3D ent from general 3D generation methods that focus on category-free
shapes using only collections of single-view images is a challeng- rigid objects with sample geometric structures [PJBM23,LXZ∗ 23],
ing problem. Benefiting from the unsupervised training of gener- most 3D human generation methods aim to tackle the complex-
ative adversarial networks, 3D-aware GANs are introduced that ities of articulated pose changes and intricate geometric details
could learn 3D representations in an unsupervised way from natural of clothing. Tab. 5 presents a compilation of notable 3D human
images. Therefore, several single-view image datasets are proposed body generation methods in recent years, organized according to
and commonly used for these 3D generation methods. Although the input conditions and the output format of the generated 3D hu-
many large-scale image datasets have been presented for 2D gen- man bodies. Some results of these methods are shown in Fig. 8.
eration, it is hard to directly use them for 3D generation due to the Specifically, in terms of the input condition, current 3D human
high uncertainty of this problem. Normally, these image datasets body generation methods can be categorized based on the driv-
only contain a specific category or domain. FFHQ [KLA19], a real- ing factors including latent features randomly sampled from a pre-
world human face dataset consisting of 70,000 high-quality im- defined latent space [MYR∗ 20, CJS∗ 22, HCL∗ 22], a single refer-
ages at 10242 resolution, and AFHQ [CUYH20], an animal face ence image [APMTM19, CPA∗ 21, XYC∗ 23, HYX∗ 23, ZLZ∗ 23],
dataset consisting of 15,000 high-quality images at 5122 resolu- or text prompts [KKL∗ 23, JWZ∗ 23, KAZ∗ 23, LYX∗ 24]. Accord-
tion, are introduced for 2D image synthesis and used a lot for ing to the form of the final output, these methods can be classi-
3D generation based on 3D-aware GANs. In the domain of the fied into two categories: textureless shape generation [APMTM19,
16 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
XYTB22, XYC∗ 23, CJS∗ 22, MYR∗ 20, CPA∗ 21, KKL∗ 23] and
textured body generation [AZS22, LYX∗ 24, HYL∗ 23, KAZ∗ 23,
As text-to-image models [RKH∗ 21b, RBL∗ 22b, SCS∗ 22] con-
XKJ∗ 23, HYX∗ 23, ZLZ∗ 23]. While the latter focuses on generat-
tinue to advance rapidly, the field of text-to-3D has also reached
ing fully textured 3D clothed humans, the former aims to obtain
its pinnacle of development. For the text-driven human generation,
textureless body geometry with realistic details.
existing methods inject priors from pre-trained text-to-image mod-
In terms of textureless shape generation, early works [CPB∗ 20, els into the 3D human generation framework to achieve text-driven
OBB20,LXC∗ 21] attempt to predict SMPL parameters from the in- textured human generation, such as AvatarCLIP [HZP∗ 22], Avatar-
put image and infer a skinned SMPL mesh as the generated 3D rep- Craft [JWZ∗ 23], DreamHuman [KAZ∗ 23], and TADA [LYX∗ 24].
resentation of the target human. Nevertheless, such skinned body Indeed, text-driven human generation methods effectively address
representation fails to represent the geometry of clothes. To over- the challenge of limited 3D training data and significantly enhance
come this issue, [APMTM19, XYTB22, XYC∗ 23] leverage a pre- the generation capabilities of 3D human assets. Nevertheless, in
trained neural network to infer the normal information and com- contrast to the generation of unseen 3D humans, it is also signif-
bine the skinned SMPL mesh to deduce a clothed full-body geom- icant to generate a 3D human body from a specified single im-
etry with details. In contrast to such methods, which require ref- age in real-life applications. In terms of single-image-conditioned
erence images as input, CAPE [MYR∗ 20] proposes a generative 3D human generation methods, producing generated results with
3D mesh model conditioned on latents of SMPL pose and clothing textures and geometries aligned with the input reference image is
type to form the clothing deformation from the SMPL body. gDNA widely studied. To this end, PIFu [SHN∗ 19], PaMIR [ZYLD21],
[CJS∗ 22] introduces a generation framework conditioned on latent and PHORHUM [AZS22] propose learning-based 3D generators
codes of shape and surface details to learn the underlying statistics trained on scanned human datasets to infer human body geome-
of 3D clothing details from scanned human datasets via an adver- try and texture from input images. However, their performance is
sarial loss. Different from the previous methods that generate an constrained by the limitations of the training data. Consequently,
integrated 3D clothed human body geometry, SMPLicit [CPA∗ 21] they struggle to accurately infer detailed textures and fine geom-
adopts an implicit model conditioned on shape and pose parame- etry from in-the-wild input images, particularly in areas that are
ters to individually generate diverse 3D clothes. By combining the not directly visible in the input. To achieve data-free 3D human
SMPL body and associated generated 3D clothes, SMPLicit en- generation, ELICIT [HYL∗ 23], Human-SGD [AST∗ 23], TeCH
ables to produce 3D clothed human shapes. To further improve the [HYX∗ 23], and HumanRef [ZLZ∗ 23] leverage priors of pre-trained
quality of the generated human shape, Chupa [KKL∗ 23] introduces CLIP [RKH∗ 21b] or image diffusion models [RBL∗ 22b, SCS∗ 22]
diffusion models to generate realistic human geometry and decom- to predict the geometry and texture based on the input reference
pose the 3D generation task into 2D normal map generation and image without the need for 3D datasets, and achieve impressive
normal map-based 3D reconstruction. qualities in generated 3D clothed human.
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 17
NerFace, CVPR' 21
NHA, CVPR'22
PiCA, CVPR'21
NeRFBlendshape, TOG'22
i3DMM, CVPR'21
3D Face Neural Implicit
ImFace, CVPR'22
Generation 3DMMs
NPHM, CVPR'23
HoloGAN, PlatonicGAN
PanoHead, CVPR'23
RODIN, CVPR'23
Next3D, CVPR'23
StyleAvatar3D, arXiv'23
Table 6: Applications of general scene generation methods. shape generation from random noise. In addition to these latent-
based methods, another important research direction is text-driven
Texture 3D asset generation [CCS∗ 19, LWQF22]. For example, 3D-LDM
Methods Type Condition [NKR∗ 22], SDFusion [CLT∗ 23], and Diffusion-SDF [LDZL23]
Generation
achieve text-to-3D shape generation by designing the diffusion
PVD [ZDW21] Object-Centered Latent % process in 3D feature space. Due to such methods requiring 3D
NFD [SCP∗ 23] Object-Centered Latent % datasets to train the diffusion-based 3D generators, they are limited
Point-E [NJD∗ 22] Object-Centered Text % to the training data in terms of the categories and diversity of gen-
Diffusion-SDF [LDZL23] Object-Centered Text % erated results. By contrast, CLIP-Forge [SCL∗ 22], CLIP-Sculptor
Deep3DSketch+ [CFZ∗ 23] Object-Centered Sketch % [SFL∗ 23], and Michelangelo [ZLC∗ 23] directly employ the prior
Zero-1-to-3 [LWVH∗ 23] Object-Centered Single-Image ! of the pre-trained CLIP model to constrain the 3D generation pro-
Make-It-3D [TWZ∗ 23] Object-Centered Single-Image ! cess, effectively improving the generalization of the method and the
GET3D [GSW∗ 22] Object-Centered Latent ! diversity of generation results. Unlike the above latent-conditioned
EG3D [CLC∗ 22] Object-Centered Latent ! or text-driven 3D generation methods, to generate 3D assets with
CLIP-Mesh [MKXBP22] Object-Centered Text ! expected shapes, there are some works [HMR19a, CFZ∗ 23] that
DreamFusion [PJBM23] Object-Centered Text ! explore image or sketch-conditioned generation.
ProlificDreamer [WLW∗ 23] Object-Centered Text !
PixelSynth [RFJ21] Outward-Facing Single-Image !
In comparison to textureless 3D shape generation, textured
DiffDreamer [CCP∗ 23] Outward-Facing Single-Image !
3D asset generation not only produces realistic geometric struc-
Xiang et al. [XYHT23] Outward-Facing Latent !
tures but also captures intricate texture details. For example,
CC3D [BPP∗ 23] Outward-Facing Layout !
HoloGAN [NPLT∗ 19], GET3D [GSW∗ 22], and EG3D [CLC∗ 22]
Text2Room [HCO∗ 23] Outward-Facing Text !
employ GAN-based 3D generators conditioned on latent vec-
Text2NeRF [ZLW∗ 23] Outward-Facing Text !
tors to produce category-specific textured 3D assets. By contrast,
text-driven 3D generation methods rely on the prior knowledge
of pre-trained large-scale text-image models to enable category-
free 3D asset generation. For instance, CLIP-Mesh [MKXBP22],
stream applications (e.g. editing, talking head generation) are en- Dream Fields [JMB∗ 22], and PureCLIPNeRF [LC22] employ
abled or become less data-hungry, including 3D consistent edit- the prior of CLIP model to constrain the optimization process
ing [SWW∗ 23, SWZ∗ 22, SWS∗ 22, LFLSY∗ 23, JCL∗ 22], 3D talk- and achieve text-driven 3D generation. Furthermore, DreamFu-
ing head generation [BFW∗ 23, XSJ∗ 23, WDY∗ 22], etc. sion [PJBM23] and SJC [WDL∗ 23] propose a score distillation
sampling (SDS) method to achieve 3D constraint which priors ex-
tracted from pre-trained 2D diffusion models. Then, some meth-
6.3. General Scene Generation
ods further improve the SDS-based 3D generation process in terms
Different from 3D human and face generation, which can use ex- of generation quality, multi-face problem, and optimization effi-
isting prior knowledge such as SMPL and 3DMM, general scene ciency, such as Magic3D [LGT∗ 23], Latent-NeRF [MRP∗ 23b],
generation methods are more based on the similarity of semantics Fantasia3D [CCJJ23], DreamBooth3D [RKP∗ 23], HiFA [ZZ23],
or categories to design a 3D model generation framework. Based on ATT3D [LXZ∗ 23], ProlificDreamer [WLW∗ 23], IT3D [CZY∗ 23],
the differences in generation results, as shown in Fig. 11 and Tab. 6, DreamGaussian [TRZ∗ 23], and CAD [WPH∗ 23]. On the other
we further subdivide general scene generation into object-centered hand, distinct from text-driven 3D generation, single-image-
asset generation and outward-facing scene generation. conditioned 3D generation is also a significant research direction
[LWVH∗ 23, MKLRV23, CGC∗ 23, WLY∗ 23, KDJ∗ 23].
6.3.1. Object-Centered Asset Generation
6.3.2. Outward-Facing Scene Generation
The field of object-centered asset generation has seen significant
advancements in recent years, with a focus on both textureless Early scene generation methods often require specific scene data
shape generation and textured asset generation. For the textureless for training to obtain category-specific scene generators, such as
shape generation, early works use GAN-based networks to learn GAUDI [BGA∗ 22] and the work of Xiang et al. [XYHT23],
a mapping from latent space to 3D object space based on spe- or implement a single scene reconstruction based on the input
cific categories of 3D data, such as 3D-GAN [WZX∗ 16], Holo- image, such as PixelSynth [RFJ21] and Worldsheet [HRBP21].
GAN [NPLT∗ 19], and PlatonicGAN [HMR19b]. However, limited However, these methods are either limited by the quality of
by the generation capabilities of GANs, these methods can only the generation or by the extensibility of the scene. With the
generate rough 3D assets of specific categories. To improve the rise of diffusion models in image inpainting, various methods
quality of generated results, SingleShapeGen [WZ22] leverages a are beginning to use the scene completion capabilities of dif-
pyramid of generators to generate 3D assets in a coarse to fine fusion models to implement scene generation tasks [CCP∗ 23,
manner. Given the remarkable achievements of diffusion models HCO∗ 23,ZLW∗ 23]. Recently, SceneScape [FAKD23], Text2Room
in image generation, researchers are directing their attention to- [HCO∗ 23], Text2NeRF [ZLW∗ 23], and LucidDreamer [CLN∗ 23]
wards the application of diffusion extensions in the realm of 3D propose progressive inpainting and updating strategies for gener-
generation. Thus, subsequent methods [LH21, ZDW21, HLHF22, ating realistic 3D scenes using pre-trained diffusion models. Sce-
SCP∗ 23, EMS∗ 23] explore the use of diffusion processes for 3D neScape and Text2Room utilize explicit polygon meshes as their
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 19
Figure 11: Some examples of general scene generation methods. 3D generation results source from Deep3DSketch+ [CFZ∗ 23],
NFD [SCP∗ 23], Diffusion-SDF [LDZL23], Make-It-3D [TWZ∗ 23], GET3D [GSW∗ 22], ProlificDreamer [WLW∗ 23], Diff-
Dreamer [CCP∗ 23], CC3D [BPP∗ 23], Xiang et al. [XYHT23], Text2NeRF [ZLW∗ 23], and LucidDreamer [CLN∗ 23].
3D representation during the generation procedure. However, this methods [WJC∗ 23, HTE∗ 23] can support textual format style def-
choice of representation imposes limitations on the generation of inition by utilizing the learned prior knowledge from large-scale
outdoor scenes, resulting in stretched geometry and blurry artifacts language-vision models such as CLIP [RKH∗ 21a] and Stable Dif-
in the fusion regions of mesh faces. In contrast, Text2NeRF and fusion [RBL∗ 22a]. Other than commonly seen artistic style trans-
LucidDreamer adopt implicit representations, which offer the abil- fer, there also exist some special types of “style” manipulation
ity to model fine-grained geometry and textures without specific tasks such as seasonal and illumination manipulations [LLF∗ 23,
scene requirements. Consequently, Text2NeRF and LucidDreamer CZL∗ 22, HTE∗ 23, CYL∗ 22] and climate changes.
can generate both indoor and outdoor scenes with high fidelity.
Single-Object Manipulation. There are many papers specifically
aim at manipulating a single 3D object. For example, one rep-
6.4. 3D Editing resentative task is texturing or painting a given 3D object (usu-
Based on the region where editing happens, we classify the existing ally in mesh format) [MBOL∗ 22, LZJ∗ 22, MRP∗ 23b, CCJJ23,
works into global editing and local editing. CSL∗ 23]. Except for diffuse albedo color and vertex displace-
ment [MBOL∗ 22, MZS∗ 23, LYX∗ 24], other common property
maps may be involved, including normal map [CCJJ23, LZJ∗ 22],
6.4.1. Global Editing
roughness map [CCJJ23, LZJ∗ 22], specular map [LZJ∗ 22], and
Global editing works aim at changing the appearance or geom- metallic map [CCJJ23], etc. A more general setting would be
etry of the competing 3D scene globally. Different from local directly manipulating a NeRF-like object [WCH∗ 22, LZJ∗ 22,
editing, they usually do not intentionally isolate a specific region TLYCS22, YBZ∗ 22]. Notably, the human face/head is one special
from a complete and complicated scene or object. Most commonly, type of object that has drawn a lot of interest [ATDN23, ZQL∗ 23].
they only care if the resultant scene is in a desired new “style” In the meanwhile, many works focus on fine-grained local face
and resembles (maintains some features of) the original scene. manipulation, including expression and appearance manipula-
Most representative tasks falling into this category include styliza- tion [SWZ∗ 22, SWS∗ 22, LFLSY∗ 23, JCL∗ 22, WDY∗ 22, XSJ∗ 23,
tion [HTS∗ 21,HHY∗ 22,FJW∗ 22,ZKB∗ 22,WJC∗ 23,HTE∗ 23], and MLL∗ 22a, ZLW∗ 22] and face swapping [LMY∗ 23] since human
single-object manipulation (e.g. re-texturing [MBOL∗ 22, LZJ∗ 22, face related understanding tasks (e.g. recognition, parsing, attribute
MRP∗ 23b, CCJJ23]) as shown in Fig. 12. classification) have been extensively studied previously.
7. Open Challenges
Iron Man Brick Lamp Colorful Crochet Candle Astronaut Horse The quality and diversity of 3D generation results have experienced
Texturing significant progress due to advancements in generative models, 3D
representations, and algorithmic paradigms. Considerable attention
has been drawn to 3D generation recently as a result of the suc-
cess achieved by large-scale models in natural language process-
ing and image generation. However, numerous challenges remain
before the generated 3D models can meet the high industrial stan-
dards required for video games, movies, or immersive digital con-
tent in VR/AR. In this section, we will explore some of the open
Original Affine Non-affine Duplication challenges and potential future directions in this field.
Evaluation. Quantifying the quality of generated 3D models objec-
tively is an important and not widely explored problem. Using met-
rics such as PSNR, SSIM, and F-Score to evaluate rendering and
reconstruction results requires ground truth data on the one hand,
Local Editing but on the other hand, it can not comprehensively reflect the quality
and diversity of the generated content. In addition, user studies are
Figure 12: Representative 3D editing tasks. Images adapted from usually time-consuming, and the study results tend to be influenced
ARF [ZKB∗ 22], Text2Mesh [MBOL∗ 22], NeRFShop [JKK∗ 23], by the bias and number of surveyed users. Metrics that capture both
SKED [MPS∗ 23], and DreamEditor [ZWL∗ 23]. the quality and diversity of the results like FID can be applied to
3D data, but may not be always aligned with 3D domain and hu-
man preferences. Better metrics to judge the results objectively in
terms of generation quality, diversity, and matching degree with the
local editing types include appearance manipulation [YBZ∗ 22, conditions still need further exploration.
ZWL∗ 23], geometry deformation [JKK∗ 23, PYL∗ 22, YSL∗ 22,
TLYCS22], object-/semantic-level duplication/deletion and mov- Dataset. Unlike language or 2D image data which can be easily
ing/removing [YZX∗ 21, WLC∗ 22, KMS22, WWL∗ 23]. For exam- captured and collected, 3D assets often require 3D artists or de-
ple, NeuMesh [YBZ∗ 22] supports several kinds of texture manip- signers to spend a significant amount of time using professional
ulation including swapping, filling, and painting since they dis- software to create. Moreover, due to the different usage scenarios
till a NeRF scene into a mesh-based neural representation. NeRF- and creators’ personal styles, these 3D assets may differ greatly in
Shop [JKK∗ 23] and CageNeRF [PYL∗ 22] transform/deform the scale, quality, and style, increasing the complexity of 3D data. Spe-
volume bounded by a mesh cage, resulting in moved or de- cific rules are needed to normalize this diverse 3D data, making it
formed/articulated object. SINE [BZY∗ 23] updates both the NeRF more suitable for generation methods. A large-scale, high-quality
geometry and the appearance with geometry prior and semantic 3D dataset is still highly desirable in 3D generation. Meanwhile,
(image feature) texture prior as regularizations. exploring how to utilize extensive 2D data for 3D generation could
also be a potential solution to address the scarcity of 3D data.
Another line of works (e.g. ObjectNeRF [YZX∗ 21], Ob-
Representation. Representation is an essential part of the 3D gen-
jectSDF [WLC∗ 22], DFF [KMS22]) focus on automatically de-
eration, as we discuss various representations and the associated
composing the scene into individual objects or semantic parts dur-
methods in Sec. 3. Implicit representation is able to model com-
ing reconstruction, which is made possible by utilizing extra 2D
plex geometric topology efficiently but faces challenges with slow
image understanding networks (e.g. instance segmentation), and
optimization; explicit representation facilitates rapid optimization
support subsequent object-level manipulations such as re-coloring,
convergence but struggles to encapsulate complex topology and
removal, displacement, duplication.
demands substantial storage resources; Hybrid representation at-
Recently, it is possible to create new textures and/or content tempts to consider the trade-off between these two, but there are
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 21
still shortcomings. In general, we are motivated to develop a repre- or even daily. We hope this survey offers a systematic sum-
sentation that balances optimization efficiency, geometric topology mary that could inspire subsequent work for interested readers.
flexibility, and resource usage.
Controllability. The purpose of the 3D generation technique is
to generate a large amount of user-friendly, high-quality, and di-
References
verse 3D content in a cheap and controllable way. However, em-
bedding the generated 3D content into practical applications re- [ACC∗ 22] A DAMKIEWICZ M., C HEN T., C ACCAVALE A., G ARDNER
mains a challenge: most methods [PJBM23, CLC∗ 22, YHH∗ 19b] R., C ULBERTSON P., B OHG J., S CHWAGER M.: Vision-only robot navi-
gation in a neural radiance world. IEEE Robotics and Automation Letters
rely on volume rendering or neural rendering, and fail to generate 7, 2 (2022), 4606–4613. 5
content suitable for rasterization graphics pipeline. As for meth-
[ADMG18] ACHLIOPTAS P., D IAMANTI O., M ITLIAGKAS I., G UIBAS
ods [CCJJ23, WLW∗ 23, TRZ∗ 23] that generate the content repre- L.: Learning representations and generative models for 3d point clouds.
sented by polygons, they do not take into account layout (e.g. the In International conference on machine learning (2018), PMLR, pp. 40–
rectangular plane of a table can be represented by two triangles) and 49. 7, 9
high-quality UV unwrapping and the generated textures also face [ALG∗ 20] ATTAL B., L ING S., G OKASLAN A., R ICHARDT C., T OMP -
some issues such as baked shadows. These problems make the gen- KIN J.: Matryodshka: Real-time 6dof video view synthesis using multi-
erated content unfavorable for artist-friendly interaction and edit- sphere images. In European Conference on Computer Vision (2020),
ing. Furthermore, the style of generated content is still limited by Springer, pp. 441–459. 5
training datasets. Furthermore, the establishment of comprehensive [APMTM19] A LLDIECK T., P ONS -M OLL G., T HEOBALT C., M AGNOR
toolchains is a crucial aspect of the practical implementation of 3D M.: Tex2shape: Detailed full human body geometry from a single image.
In Proceedings of the IEEE/CVF International Conference on Computer
generation. In modern workflows, artists use tools (e.g. LookDev) Vision (2019), pp. 2293–2303. 15, 16
to harmonize 3D content by examining and contrasting the relight-
ing results of their materials across various lighting conditions. [AQW19] A BDAL R., Q IN Y., W ONKA P.: Image2stylegan: How
to embed images into the stylegan latent space? In Proceedings of
Concurrently, modern Digital Content Creation (DCC) software the IEEE/CVF international conference on computer vision (2019),
offers extensive and fine-grained content editing capabilities. It is pp. 4432–4441. 2, 6
promising to unify 3D content produced through diverse methods [ASK∗ 20] A LIEV K.-A., S EVASTOPOLSKY A., KOLOS M., U LYANOV
and establish tool chains that encompass abundant editing capabil- D., L EMPITSKY V.: Neural point-based graphics. In Computer Vision–
ities. ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XXII 16 (2020), Springer, pp. 696–712. 4
Large-scale Model. Recently, the popularity of large-scale models
[AST∗ 23] A L BAHAR B., S AITO S., T SENG H.-Y., K IM C., KOPF J.,
has gradually affected the field of 3D generation. Researchers are H UANG J.-B.: Single-image 3d human digitization with shape-guided
no longer satisfied with using distillation scores that use large-scale diffusion. In SIGGRAPH Asia 2023 Conference Papers (2023), pp. 1–11.
image models as the priors to optimize 3D content, but directly 16
train large-scale 3D models. MeshGPT [SAA∗ 23] follows large [ATDN23] A NEJA S., T HIES J., DAI A., N IESSNER M.: ClipFace: Text-
language models and adopts a sequence-based approach to autore- guided editing of textured 3d morphable models. In ACM SIGGRAPH
gressively generate sequences of triangles in the generated mesh. 2023 Conference Proceedings (2023). 19
MeshGPT takes into account layout information and generates [AYS∗ 23] A BDAL R., Y IFAN W., S HI Z., X U Y., P O R., K UANG Z.,
compact and sharp meshes that match the style created by artists. C HEN Q., Y EUNG D.-Y., W ETZSTEIN G.: Gaussian shell maps for
Since MeshGPT is a decoder-only transformer, compared with the efficient 3d human generation. arXiv preprint arXiv:2311.17857 (2023).
16
optimization-based generation, it gets rid of inefficient multi-step
sequential optimization, achieving rapid generation. Despite this, [AZA∗ 21] A HMADYAN A., Z HANG L., A BLAVATSKI A., W EI J.,
MeshGPT’s performance is still limited by training datasets and G RUNDMANN M.: Objectron: A large scale dataset of object-centric
videos in the wild with pose annotations. In Proceedings of the
can only generate regular furniture objects. But there is no doubt IEEE/CVF conference on computer vision and pattern recognition
that large-scale 3D generation models have great potential worth (2021), pp. 7822–7831. 15
exploring.
[AZS22] A LLDIECK T., Z ANFIR M., S MINCHISESCU C.: Photorealis-
tic monocular 3d reconstruction of humans wearing clothing. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
8. Conclusion Recognition (2022), pp. 1506–1515. 16
In this work, we present a comprehensive survey on 3D gen- [BBJ∗ 21] B OSS M., B RAUN R., JAMPANI V., BARRON J. T., L IU C.,
eration, encompassing four main aspects: 3D representations, L ENSCH H.: Nerd: Neural reflectance decomposition from image col-
generation methods, datasets, and various applications. We be- lections. In Proceedings of the IEEE/CVF International Conference on
Computer Vision (2021), pp. 12684–12694. 5
gin by introducing the 3D representation, which serves as the
backbone and determines the characteristics of the generated [BFO∗ 20] B ROXTON M., F LYNN J., OVERBECK R., E RICKSON D.,
results. Next, we summarize and categorize a wide range of H EDMAN P., D UVALL M., D OURGARIAN J., B USCH J., W HALEN M.,
D EBEVEC P.: Immersive light field video with a layered mesh repre-
generation methods, creating an evolutionary tree to visualize their sentation. ACM Transactions on Graphics (TOG) 39, 4 (2020), 86–1.
branches and developments. Finally, we provide an overview of 5
related datasets, applications, and open challenges in this field. [BFW∗ 23] BAI Y., FAN Y., WANG X., Z HANG Y., S UN J., Y UAN C.,
The realm of 3D generation is currently witnessing explosive S HAN Y.: High-fidelity facial avatar reconstruction from monocular
growth and development, with new work emerging every week video with generative priors. In CVPR (2023). 18
22 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
[BGA∗ 22] BAUTISTA M. A., G UO P., A BNAR S., TALBOTT W., T O - [CCP∗ 23] C AI S., C HAN E. R., P ENG S., S HAHBAZI M., O BUKHOV
SHEV A., C HEN Z., D INH L., Z HAI S., G OH H., U LBRICHT D., A., VAN G OOL L., W ETZSTEIN G.: Diffdreamer: Towards consistent
ET AL .: Gaudi: A neural architect for immersive 3d scene generation. unsupervised single-view scene extrapolation with conditional diffusion
Advances in Neural Information Processing Systems 35 (2022), 25102– models. In Proceedings of the IEEE/CVF International Conference on
25116. 18 Computer Vision (2023), pp. 2139–2150. 18, 19
[BGP∗ 22] BAATZ H., G RANSKOG J., PAPAS M., ROUSSELLE F., [CCS∗ 19] C HEN K., C HOY C. B., S AVVA M., C HANG A. X.,
N OVÁK J.: Nerf-tex: Neural reflectance field textures. In Computer F UNKHOUSER T., S AVARESE S.: Text2shape: Generating shapes from
Graphics Forum (2022), vol. 41, Wiley Online Library, pp. 287–301. 5 natural language by learning joint embeddings. In Computer Vision–
[BHE23] B ROOKS T., H OLYNSKI A., E FROS A. A.: InstructPix2Pix: ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Aus-
Learning to follow image editing instructions. In CVPR (2023). 20 tralia, December 2–6, 2018, Revised Selected Papers, Part III 14 (2019),
Springer, pp. 100–116. 18
[BHMK∗ 18] B EN -H AMU H., M ARON H., K EZURER I., AVINERI G.,
L IPMAN Y.: Multi-chart generative surface modeling. ACM Transac- [CCW∗ 23] C HEN Y., C HEN X., WANG X., Z HANG Q., G UO Y., S HAN
tions on Graphics (TOG) 37, 6 (2018), 1–15. 7, 9 Y., WANG F.: Local-to-global registration for bundle-adjusting neural
radiance fields. In Proceedings of the IEEE/CVF Conference on Com-
[BKP∗ 10] B OTSCH M., KOBBELT L., PAULY M., A LLIEZ P., L ÉVY B.: puter Vision and Pattern Recognition (2023), pp. 8264–8273. 5
Polygon mesh processing. CRC press, 2010. 4
[CFG∗ 15] C HANG A. X., F UNKHOUSER T., G UIBAS L., H ANRAHAN
[BKY∗ 22] B ERGMAN A., K ELLNHOFER P., Y IFAN W., C HAN E., L IN - P., H UANG Q., L I Z., S AVARESE S., S AVVA M., S ONG S., S U H.,
DELL D., W ETZSTEIN G.: Generative neural articulated radiance fields. ET AL .: Shapenet: An information-rich 3d model repository. arXiv
Advances in Neural Information Processing Systems 35 (2022), 19900– preprint arXiv:1512.03012 (2015). 14
19916. 16
[CFZ∗ 23] C HEN T., F U C., Z ANG Y., Z HU L., Z HANG J., M AO P., S UN
[BLW16] B ROCK A., L IM T., W ESTON N.: Generative and discrimina- L.: Deep3dsketch+: Rapid 3d modeling from single free-hand sketches.
tive voxel modeling with convolutional neural networks. arXiv preprint In International Conference on Multimedia Modeling (2023), Springer,
arXiv:1608.04236 (2016). 6, 11 pp. 16–28. 18, 19
[BMR∗ 20] B ROWN T., M ANN B., RYDER N., S UBBIAH M., K APLAN [CGC∗ 23] C HEN H., G U J., C HEN A., T IAN W., T U Z., L IU L., S U
J. D., D HARIWAL P., N EELAKANTAN A., S HYAM P., S ASTRY G., H.: Single-stage diffusion nerf: A unified approach to 3d generation and
A SKELL A., ET AL .: Language models are few-shot learners. NeurIPS reconstruction. arXiv preprint arXiv:2304.06714 (2023). 6, 7, 10, 18
(2020). 2, 6, 10
[CGD∗ 22] C OLLINS J., G OEL S., D ENG K., L UTHRA A., X U L., G UN -
[BMT∗ 21] BARRON J. T., M ILDENHALL B., TANCIK M., H EDMAN
DOGDU E., Z HANG X., V ICENTE T. F. Y., D IDERIKSEN T., A RORA
P., M ARTIN -B RUALLA R., S RINIVASAN P. P.: Mip-nerf: A multiscale
H., ET AL .: Abo: Dataset and benchmarks for real-world 3d object un-
representation for anti-aliasing neural radiance fields. In Proceedings
derstanding. In Proceedings of the IEEE/CVF Conference on Computer
of the IEEE/CVF International Conference on Computer Vision (2021),
Vision and Pattern Recognition (2022), pp. 21126–21136. 15
pp. 5855–5864. 1, 5
[CGT∗ 19] C HOI I., G ALLO O., T ROCCOLI A., K IM M. H., K AUTZ J.:
[BMV∗ 22] BARRON J. T., M ILDENHALL B., V ERBIN D., S RINIVASAN
Extreme view synthesis. In Proceedings of the IEEE/CVF International
P. P., H EDMAN P.: Mip-nerf 360: Unbounded anti-aliased neural radi-
Conference on Computer Vision (2019), pp. 7781–7790. 5
ance fields. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (2022), pp. 5470–5479. 5 [CHB∗ 23] C HEN X., H UANG J., B IN Y., Y U L., L IAO Y.: Veri3d: Gen-
[BNT21] B UROV A., N IESSNER M., T HIES J.: Dynamic surface erative vertex-based radiance fields for 3d controllable human image syn-
function networks for clothed human bodies. In Proceedings of thesis. In Proceedings of the IEEE/CVF International Conference on
the IEEE/CVF International Conference on Computer Vision (2021), Computer Vision (2023), pp. 8986–8997. 16
pp. 10754–10764. 4 [CHIS23] C ROITORU F.-A., H ONDRU V., I ONESCU R. T., S HAH M.:
[BPP∗ 23] BAHMANI S., PARK J. J., PASCHALIDOU D., YAN X., W ET- Diffusion models in vision: A survey. IEEE Transactions on Pattern
ZSTEIN G., G UIBAS L., TAGLIASACCHI A.: Cc3d: Layout-conditioned
Analysis and Machine Intelligence (2023). 3
generation of compositional 3d scenes. arXiv preprint arXiv:2303.12074 [CJS∗ 22] C HEN X., J IANG T., S ONG J., YANG J., B LACK M. J.,
(2023). 18, 19 G EIGER A., H ILLIGES O.: gdna: Towards generative detailed neural
[BSKG22] B EN -S HABAT Y., KONEPUTUGODAGE C. H., G OULD S.: avatars. In Proceedings of the IEEE/CVF Conference on Computer Vi-
Digs: Divergence guided shape implicit neural representation for un- sion and Pattern Recognition (2022), pp. 20427–20437. 7, 15, 16
oriented point clouds. In Proceedings of the IEEE/CVF Conference on [CLC∗ 22] C HAN E. R., L IN C. Z., C HAN M. A., NAGANO K., PAN B.,
Computer Vision and Pattern Recognition (2022), pp. 19323–19332. 6 D E M ELLO S., G ALLO O., G UIBAS L. J., T REMBLAY J., K HAMIS S.,
[BTH∗ 23] BAI Z., TAN F., H UANG Z., S ARKAR K., TANG D., Q IU D., ET AL .: Efficient geometry-aware 3d generative adversarial networks. In
M EKA A., D U R., D OU M., O RTS -E SCOLANO S., ET AL .: Learning CVPR (2022). 1, 6, 7, 9, 17, 18, 21
personalized high quality volumetric head avatars from monocular rgb [CLL23] C HENG Z., L I J., L I H.: Wildlight: In-the-wild inverse render-
videos. In CVPR (2023). 17 ing with a flashlight. In Proceedings of the IEEE/CVF Conference on
[BZY∗ 23] BAO C., Z HANG Y., YANG B., FAN T., YANG Z., BAO H., Computer Vision and Pattern Recognition (2023), pp. 4305–4314. 6
Z HANG G., C UI Z.: SINE: Semantic-driven image-based nerf editing [CLN∗ 23] C HUNG J., L EE S., NAM H., L EE J., L EE K. M.: Lucid-
with prior-guided editing field. In CVPR (2023). 20 dreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv
[CBZ∗ 19] C HENG S., B RONSTEIN M., Z HOU Y., KOTSIA I., PANTIC preprint arXiv:2311.13384 (2023). 18, 19
M., Z AFEIRIOU S.: Meshgan: Non-linear 3d morphable models of faces. [CLT∗ 23] C HENG Y.-C., L EE H.-Y., T ULYAKOV S., S CHWING A. G.,
arXiv preprint arXiv:1903.10384 (2019). 7 G UI L.-Y.: Sdfusion: Multimodal 3d shape completion, reconstruction,
[CCH∗ 23] C AO Y., C AO Y.-P., H AN K., S HAN Y., W ONG K.-Y. K.: and generation. In Proceedings of the IEEE/CVF Conference on Com-
Dreamavatar: Text-and-shape guided 3d human avatar generation via dif- puter Vision and Pattern Recognition (2023), pp. 4456–4465. 10, 18
fusion models. arXiv preprint arXiv:2304.00916 (2023). 16
[CMA∗ 22] C HOI H., M OON G., A RMANDO M., L EROY V., L EE
[CCJJ23] C HEN R., C HEN Y., J IAO N., J IA K.: Fantasia3D: Disentan- K. M., ROGEZ G.: Mononhr: Monocular neural human renderer.
gling geometry and appearance for high-quality text-to-3d content cre- In 2022 International Conference on 3D Vision (3DV) (2022), IEEE,
ation. In ICCV (2023). 18, 19, 21 pp. 242–251. 16
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 23
[CMK∗ 21a] C HAN E. R., M ONTEIRO M., K ELLNHOFER P., W U J., [CYL∗ 22] C HEN Y., Y UAN Q., L I Z., L IU Y., WANG W., X IE C., W EN
W ETZSTEIN G.: pi-gan: Periodic implicit generative adversarial net- X., Y U Q.: UPST-NeRF: Universal photorealistic style transfer of neural
works for 3d-aware image synthesis. In Proceedings of the IEEE/CVF radiance fields for 3d scene. In arXiv preprint arXiv:2208.07059 (2022).
conference on computer vision and pattern recognition (2021), pp. 5799– 19
5809. 9, 15
[CYW∗ 23] C HENG X., YANG T., WANG J., L I Y., Z HANG L., Z HANG
[CMK∗ 21b] C HAN E. R., M ONTEIRO M., K ELLNHOFER P., W U J., J., Y UAN L.: Progressive3D: Progressively local editing for text-to-
W ETZSTEIN G.: pi-GAN: Periodic implicit generative adversarial net- 3d content creation with complex semantic prompts. arXiv preprint
works for 3d-aware image synthesis. In CVPR (2021). 17 arXiv:2310.11784 (2023). 19
[CNC∗ 23] C HAN E. R., NAGANO K., C HAN M. A., B ERGMAN A. W., [CZ19] C HEN Z., Z HANG H.: Learning implicit fields for generative
PARK J. J., L EVY A., A ITTALA M., D E M ELLO S., K ARRAS T., W ET- shape modeling. In Proceedings of the IEEE/CVF Conference on Com-
ZSTEIN G.: Generative novel view synthesis with 3d-aware diffusion puter Vision and Pattern Recognition (2019), pp. 5939–5948. 5, 9
models. arXiv preprint arXiv:2304.02602 (2023). 13
[CZC∗ 24] C HEN H., Z HANG Y., C UN X., X IA M., WANG X., W ENG
[CPA∗ 21] C ORONA E., P UMAROLA A., A LENYA G., P ONS -M OLL G., C., S HAN Y.: Videocrafter2: Overcoming data limitations for high-
M ORENO -N OGUER F.: Smplicit: Topology-aware generative model for quality video diffusion models, 2024. arXiv:2401.09047. 2
clothed people. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition (2021), pp. 11875–11885. 15, 16 [CZL∗ 22] C HEN X., Z HANG Q., L I X., C HEN Y., F ENG Y., WANG X.,
WANG J.: Hallucinated neural radiance fields in the wild. In CVPR
[CPB∗ 20] C HOUTAS V., PAVLAKOS G., B OLKART T., T ZIONAS D., (2022), pp. 12943–12952. 5, 19
B LACK M. J.: Monocular expressive body regression through body-
driven attention. In Computer Vision–ECCV 2020: 16th European Con- [CZY∗ 23] C HEN Y., Z HANG C., YANG X., C AI Z., Y U G., YANG L.,
ference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16 L IN G.: It3d: Improved text-to-3d generation with explicit view synthe-
(2020), Springer, pp. 20–40. 16 sis. arXiv preprint arXiv:2308.11473 (2023). 18
[CRJ22] C AO A., ROCKWELL C., J OHNSON J.: Fwd: Real-time novel [DBD∗ 22] DARMON F., BASCLE B., D EVAUX J.-C., M ONASSE P.,
view synthesis with forward warping and depth. In Proceedings of AUBRY M.: Improving neural implicit surfaces geometry with patch
the IEEE/CVF Conference on Computer Vision and Pattern Recognition warping. In CVPR (2022). 6
(2022), pp. 15713–15724. 4
[DCS∗ 17] DAI A., C HANG A. X., S AVVA M., H ALBER M.,
[CSH19] C HEN X., S ONG J., H ILLIGES O.: Monocular neural im- F UNKHOUSER T., N IESSNER M.: Scannet: Richly-annotated 3d recon-
age based rendering with continuous view control. In Proceedings structions of indoor scenes. In Proceedings of the IEEE conference on
of the IEEE/CVF international conference on computer vision (2019), computer vision and pattern recognition (2017), pp. 5828–5839. 14, 15
pp. 4090–4100. 14
[DFK∗ 22] D OWNS L., F RANCIS A., KOENIG N., K INMAN B., H ICK -
[CSL∗ 23] C HEN D. Z., S IDDIQUI Y., L EE H.-Y., T ULYAKOV S., MAN R., R EYMANN K., M C H UGH T. B., VANHOUCKE V.: Google
N IESSNER M.: Text2Tex: Text-driven texture synthesis via diffusion scanned objects: A high-quality dataset of 3d scanned household items.
models. In ICCV (2023). 19 In 2022 International Conference on Robotics and Automation (ICRA)
[CTZ20] C HEN Z., TAGLIASACCHI A., Z HANG H.: Bsp-net: Generat- (2022), IEEE, pp. 2553–2560. 14, 15
ing compact meshes via binary space partitioning. In Proceedings of [DJQ∗ 23] D ENG C., J IANG C., Q I C. R., YAN X., Z HOU Y., G UIBAS
the IEEE/CVF Conference on Computer Vision and Pattern Recognition L., A NGUELOV D., ET AL .: Nerdi: Single-view nerf synthesis with
(2020), pp. 45–54. 6 language-guided diffusion as general image priors. In Proceedings of
[CUYH20] C HOI Y., U H Y., YOO J., H A J.-W.: Stargan v2: Diverse the IEEE/CVF Conference on Computer Vision and Pattern Recognition
image synthesis for multiple domains. In Proceedings of the IEEE/CVF (2023), pp. 20637–20647. 12
conference on computer vision and pattern recognition (2020), pp. 8188– [DLW∗ 23] D EITKE M., L IU R., WALLINGFORD M., N GO H., M ICHEL
8197. 14, 15 O., K USUPATI A., FAN A., L AFORTE C., VOLETI V., G ADRE S. Y.,
[CWS∗ 15] C ALLI B., WALSMAN A., S INGH A., S RINIVASA S., ET AL .: Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint
A BBEEL P., D OLLAR A. M.: Benchmarking in manipulation research: arXiv:2307.05663 (2023). 14, 15
The ycb object and model set and benchmarking protocols. arXiv
[Doe16] D OERSCH C.: Tutorial on variational autoencoders. arXiv
preprint arXiv:1502.03143 (2015). 14
preprint arXiv:1606.05908 (2016). 3
[CX∗ 23] C HAO Y.-W., X IANG Y., ET AL .: Fewsol: A dataset for few-
shot object learning in robotic environments. In 2023 IEEE Interna- [DRB∗ 18] DAI A., R ITCHIE D., B OKELOH M., R EED S., S TURM J.,
tional Conference on Robotics and Automation (ICRA) (2023), IEEE, N IESSNER M.: Scancomplete: Large-scale scene completion and seman-
pp. 9140–9146. 15 tic segmentation for 3d scans. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (2018), pp. 4578–4587. 6
[CXG∗ 16] C HOY C. B., X U D., G WAK J., C HEN K., S AVARESE S.:
3d-r2n2: A unified approach for single and multi-view 3d object recon- [DSS∗ 23] D EITKE M., S CHWENK D., S ALVADOR J., W EIHS L.,
struction. In Computer Vision–ECCV 2016: 14th European Conference, M ICHEL O., VANDER B ILT E., S CHMIDT L., E HSANI K., K EMBHAVI
Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part A., FARHADI A.: Objaverse: A universe of annotated 3d objects. In Pro-
VIII 14 (2016), Springer, pp. 628–644. 6 ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (2023), pp. 13142–13153. 14, 15
[CXG∗ 22] C HEN A., X U Z., G EIGER A., Y U J., S U H.: TensoRF:
Tensorial radiance fields. In European Conference on Computer Vision [DZL∗ 20] DAI P., Z HANG Y., L I Z., L IU S., Z ENG B.: Neural point
(2022). 6 cloud rendering via multi-plane projection. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
[CXZ∗ 21] C HEN A., X U Z., Z HAO F., Z HANG X., X IANG F., Y U J., (2020), pp. 7830–7839. 4
S U H.: MVSNeRF: Fast generalizable radiance field reconstruction from
multi-view stereo. In Proceedings of the IEEE/CVF International Con- [DZW∗ 20] D UAN Y., Z HU H., WANG H., Y I L., N EVATIA R., G UIBAS
ference on Computer Vision (2021), pp. 14124–14133. 5, 13 L. J.: Curriculum deepsdf. In European Conference on Computer Vision
(2020), Springer, pp. 51–67. 5
[CYAE∗ 20] C AI R., YANG G., AVERBUCH -E LOR H., H AO Z., B E -
LONGIE S., S NAVELY N., H ARIHARAN B.: Learning gradient fields [DZY∗ 21] D U Y., Z HANG Y., Y U H.-X., T ENENBAUM J. B., W U J.:
for shape generation. In Computer Vision–ECCV 2020: 16th European Neural radiance flow for 4d view synthesis and video processing. In
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16 Proceedings of the IEEE/CVF International Conference on Computer
(2020), Springer, pp. 364–381. 10 Vision (2021), IEEE Computer Society, pp. 14304–14314. 5
24 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
[EGO∗ 20] E RLER P., G UERRERO P., O HRHALLINGER S., M ITRA [GKJ∗ 21] G ARBIN S. J., KOWALSKI M., J OHNSON M., S HOTTON J.,
N. J., W IMMER M.: Points2surf learning implicit surfaces from point VALENTIN J.: Fastnerf: High-fidelity neural rendering at 200fps. In
clouds. In European Conference on Computer Vision (2020), Springer, Proceedings of the IEEE/CVF International Conference on Computer
pp. 108–124. 5 Vision (2021), pp. 14346–14355. 5
[EMS∗ 23] E RKOÇ Z., M A F., S HAN Q., N IESSNER M., DAI A.: Hyper- [GLWT22] G U J., L IU L., WANG P., T HEOBALT C.: StyleNeRF: A
diffusion: Generating implicit neural fields with weight-space diffusion. style-based 3d-aware generator for high-resolution image synthesis. In
arXiv preprint arXiv:2303.17015 (2023). 18 Int. Conf. Learn. Represent. (2022). 17
[EST∗ 20] E GGER B., S MITH W. A., T EWARI A., W UHRER S., Z OLL - [GLZ∗ 23] G AO X., L I X., Z HANG C., Z HANG Q., C AO Y., S HAN
HOEFER M., B EELER T., B ERNARD F., B OLKART T., KORTYLEWSKI Y., Q UAN L.: Contex-human: Free-view rendering of human from
A., ROMDHANI S., ET AL .: 3d morphable face models—past, present, a single image with texture-consistent synthesis. arXiv preprint
and future. ACM Trans. Graph. (2020). 17 arXiv:2311.17123 (2023). 16
[FAKD23] F RIDMAN R., A BECASIS A., K ASTEN Y., D EKEL T.: Sce- [GMW17] G ADELHA M., M AJI S., WANG R.: 3d shape induction from
nescape: Text-driven consistent scene generation. arXiv preprint 2d views of multiple objects. In 2017 International Conference on 3D
arXiv:2302.01133 (2023). 18 Vision (3DV) (2017), IEEE, pp. 402–411. 9, 15
[GNL∗ 23] G E S., NAH S., L IU G., P OON T., TAO A., C ATANZARO B.,
[FBD∗ 19] F LYNN J., B ROXTON M., D EBEVEC P., D U VALL M., F YFFE
JACOBS D., H UANG J.-B., L IU M.-Y., BALAJI Y.: Preserve your own
G., OVERBECK R., S NAVELY N., T UCKER R.: Deepview: View synthe-
correlation: A noise prior for video diffusion models. In ICCV (2023). 2
sis with learned gradient descent. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (2019), pp. 2367– [GPAM∗ 14] G OODFELLOW I., P OUGET-A BADIE J., M IRZA M., X U
2376. 5 B., WARDE -FARLEY D., O ZAIR S., C OURVILLE A., B ENGIO Y.: Gen-
erative adversarial nets. Advances in neural information processing sys-
[FJG∗ 21] F U H., J IA R., G AO L., G ONG M., Z HAO B., M AYBANK S., tems 27 (2014). 2, 6, 7
TAO D.: 3d-future: 3d furniture shape with texture. International Journal
of Computer Vision 129 (2021), 3313–3337. 14, 15 [GPL∗ 22] G RASSAL P.-W., P RINZLER M., L EISTNER T., ROTHER C.,
N IESSNER M., T HIES J.: Neural head avatars from monocular RGB
[FJW∗ 22] FAN Z., J IANG Y., WANG P., G ONG X., X U D., WANG Z.: videos. In CVPR (2022). 17
Unified implicit neural stylization. In ECCV (2022), Springer. 19
[GSW∗ 21] G UI J., S UN Z., W EN Y., TAO D., Y E J.: A review on gen-
[FKYT∗ 22] F RIDOVICH -K EIL S., Y U A., TANCIK M., C HEN Q., erative adversarial networks: Algorithms, theory, and applications. IEEE
R ECHT B., K ANAZAWA A.: Plenoxels: Radiance fields without neu- transactions on knowledge and data engineering 35, 4 (2021), 3313–
ral networks. In Proceedings of the IEEE/CVF Conference on Computer 3332. 3
Vision and Pattern Recognition (2022), pp. 5501–5510. 6
[GSW∗ 22] G AO J., S HEN T., WANG Z., C HEN W., Y IN K., L I D.,
[FLJ∗ 22] F U J., L I S., J IANG Y., L IN K.-Y., Q IAN C., L OY C. C., W U L ITANY O., G OJCIC Z., F IDLER S.: GET3D: A generative model of
W., L IU Z.: Stylegan-human: A data-centric odyssey of human gener- high quality 3d textured shapes learned from images. In NeurIPS (2022).
ation. In European Conference on Computer Vision (2022), Springer, 9, 18, 19
pp. 1–19. 14, 15
[GTZN21] G AFNI G., T HIES J., Z OLLHOFER M., N IESSNER M.: Dy-
[FXOT22] F U Q., X U Q., O NG Y. S., TAO W.: Geo-NeuS: Geometry- namic neural radiance fields for monocular 4D facial avatar reconstruc-
consistent neural implicit surfaces learning for multi-view reconstruc- tion. In CVPR (2021). 17
tion. Advances in Neural Information Processing Systems 35 (2022), [GWH∗ 20] G UO Y., WANG H., H U Q., L IU H., L IU L., B ENNAMOUN
3403–3416. 6 M.: Deep learning for 3d point clouds: A survey. IEEE transactions on
[GAA∗ 22] G AL R., A LALUF Y., ATZMON Y., PATASHNIK O., pattern analysis and machine intelligence 43, 12 (2020), 4338–4364. 3
B ERMANO A. H., C HECHIK G., C OHEN -O R D.: An image is worth [GWY∗ 21] G AO L., W U T., Y UAN Y.-J., L IN M.-X., L AI Y.-K.,
one word: Personalizing text-to-image generation using textual inver- Z HANG H.: Tm-net: Deep generative networks for textured meshes.
sion. arXiv preprint arXiv:2208.01618 (2022). 12 ACM Transactions on Graphics (TOG) 40, 6 (2021), 1–15. 11
[GCL∗ 21] G UO Y., C HEN K., L IANG S., L IU Y.-J., BAO H., Z HANG J.: [GXN∗ 23] G UPTA A., X IONG W., N IE Y., J ONES I., O ĞUZ B.: 3dgen:
Ad-nerf: Audio driven neural radiance fields for talking head synthesis. Triplane latent diffusion for textured mesh generation. arXiv preprint
In Proceedings of the IEEE/CVF International Conference on Computer arXiv:2303.05371 (2023). 7
Vision (2021), pp. 5784–5794. 5
[GYW∗ 19a] G AO L., YANG J., W U T., Y UAN Y.-J., F U H., L AI Y.-
[GCS∗ 20] G ENOVA K., C OLE F., S UD A., S ARNA A., F UNKHOUSER K., Z HANG H.: Sdm-net: Deep generative network for structured de-
T.: Local deep implicit functions for 3d shape. In Proceedings of formable mesh. ACM Transactions on Graphics (TOG) 38, 6 (2019),
the IEEE/CVF Conference on Computer Vision and Pattern Recognition 1–15. 7
(2020), pp. 4857–4866. 6
[GYW∗ 19b] G AO L., YANG J., W U T., Y UAN Y.-J., F U H., L AI Y.-
[GCV∗ 19] G ENOVA K., C OLE F., V LASIC D., S ARNA A., F REEMAN K., Z HANG H.: Sdm-net: Deep generative network for structured de-
W. T., F UNKHOUSER T.: Learning shape templates with structured im- formable mesh. ACM Transactions on Graphics (TOG) 38, 6 (2019),
plicit functions. In Proceedings of the IEEE/CVF International Confer- 1–15. 11
ence on Computer Vision (2019), pp. 7154–7164. 6 [GZX∗ 22] G AO X., Z HONG C., X IANG J., H ONG Y., G UO Y., Z HANG
[GEB16] G ATYS L. A., E CKER A. S., B ETHGE M.: Image style transfer J.: Reconstructing personalized semantic facial nerf models from
using convolutional neural networks. In CVPR (2016). 19 monocular video. ACM Trans. Graph. (2022). 17
[GII∗ 21] G RIGOREV A., I SKAKOV K., I ANINA A., BASHIROV R., Z A - [HB17] H UANG X., B ELONGIE S.: Arbitrary style transfer in real-time
KHARKIN I., VAKHITOV A., L EMPITSKY V.: Stylepeople: A generative with adaptive instance normalization. In ICCV (2017). 19
model of fullbody human avatars. In Proceedings of the IEEE/CVF Con- [HCL∗ 22] H ONG F., C HEN Z., L AN Y., PAN L., L IU Z.: Eva3d:
ference on Computer Vision and Pattern Recognition (2021), pp. 5151– Compositional 3d human generation from 2d image collections. arXiv
5160. 16 preprint arXiv:2210.04888 (2022). 15, 16
[GKG∗ 23] G IEBENHAIN S., K IRSCHSTEIN T., G EORGOPOULOS M., [HCO∗ 23] H ÖLLEIN L., C AO A., OWENS A., J OHNSON J., N IESSNER
R ÜNZ M., AGAPITO L., N IESSNER M.: Learning neural parametric M.: Text2room: Extracting textured 3d meshes from 2d text-to-image
head models. In CVPR (2023). 17 models. arXiv preprint arXiv:2303.11989 (2023). 18
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 25
[HHP∗ 23] H U S., H ONG F., PAN L., M EI H., YANG L., L IU Z.: [HZF∗ 23b] H UANG X., Z HANG Q., F ENG Y., L I X., WANG X., WANG
Sherf: Generalizable human nerf from a single image. arXiv preprint Q.: Local implicit ray function for generalizable radiance field represen-
arXiv:2303.12791 (2023). 16 tation. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (2023), pp. 97–107. 5
[HHY∗ 22] H UANG Y.-H., H E Y., Y UAN Y.-J., L AI Y.-K., G AO L.:
Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d [HZP∗ 22] H ONG F., Z HANG M., PAN L., C AI Z., YANG L., L IU Z.:
mutual learning. In CVPR (2022). 19 Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.
arXiv preprint arXiv:2205.08535 (2022). 16
[HJA20] H O J., JAIN A., A BBEEL P.: Denoising diffusion probabilistic
models. Advances in neural information processing systems 33 (2020), [ID18] I NSAFUTDINOV E., D OSOVITSKIY A.: Unsupervised learning
6840–6851. 1, 2, 6, 9 of shape and pose with differentiable point clouds. Advances in neural
information processing systems 31 (2018). 4
[HLA∗ 19] H U Y., L I T.-M., A NDERSON L., R AGAN -K ELLEY J., D U -
RAND F.: Taichi: a language for high-performance computation on spa- [JCL∗ 22] J IANG K., C HEN S.-Y., L IU F.-L., F U H., G AO L.: NeRF-
tially sparse data structures. ACM Transactions on Graphics (TOG) 38, FaceEditing: Disentangled face editing in neural radiance fields. In SIG-
6 (2019), 1–16. 5 GRAPH Asia Conference Papers (2022). 18, 19
[HLHF22] H UI K.-H., L I R., H U J., F U C.-W.: Neural wavelet-domain [JJW∗ 23] J IANG S., J IANG H., WANG Z., L UO H., C HEN W., X U L.:
diffusion for 3d shape generation. In SIGGRAPH Asia 2022 Conference Humangen: Generating human radiance fields with explicit priors. In
Papers (2022), pp. 1–9. 10, 18 Proceedings of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (2023), pp. 12543–12554. 16
[HMR19a] H ENZLER P., M ITRA N. J., R ITSCHEL T.: Escaping plato’s
cave: 3D shape from adversarial rendering. In ICCV (2019). 9, 15, 18 [JKK∗ 23] JAMBON C., K ERBL B., KOPANAS G., D IOLATZIS S., D RET-
TAKIS G., L EIMKÜHLER T.: Nerfshop: Interactive editing of neural ra-
[HMR19b] H ENZLER P., M ITRA N. J., R ITSCHEL T.: Escaping plato’s diance fields. In Proceedings of the ACM on Computer Graphics and
cave: 3d shape from adversarial rendering. In ICCV (2019). 17, 18 Interactive Techniques (2023). 19, 20
[HPX∗ 22] H ONG Y., P ENG B., X IAO H., L IU L., Z HANG J.: Head- [JLF22] J OHARI M. M., L EPOITTEVIN Y., F LEURET F.: GeoN-
NeRF: A real-time nerf-based parametric head model. In CVPR (2022). eRF: Generalizing nerf with geometry priors. In Proceedings of the
5 IEEE/CVF Conference on Computer Vision and Pattern Recognition
[HRBP21] H U R., R AVI N., B ERG A. C., PATHAK D.: Worldsheet: (2022), pp. 18365–18375. 5
Wrapping the world in a 3d sheet for view synthesis from a single image. [JMB∗ 22] JAIN A., M ILDENHALL B., BARRON J. T., A BBEEL P.,
In Proceedings of the IEEE/CVF International Conference on Computer P OOLE B.: Zero-shot text-guided object generation with dream fields.
Vision (2021), pp. 12528–12537. 18 In Proceedings of the IEEE/CVF Conference on Computer Vision and
[HRL∗ 21] H ENZLER P., R EIZENSTEIN J., L ABATUT P., S HAPOVALOV Pattern Recognition (2022), pp. 867–876. 18
R., R ITSCHEL T., V EDALDI A., N OVOTNY D.: Unsupervised learn- [JN23] J UN H., N ICHOL A.: Shap-e: Generating conditional 3d implicit
ing of 3d object categories from videos in the wild. In Proceedings of functions. arXiv preprint arXiv:2305.02463 (2023). 10, 12
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(2021), pp. 4700–4709. 15 [JWZ∗ 23] J IANG R., WANG C., Z HANG J., C HAI M., H E M., C HEN
D., L IAO J.: Avatarcraft: Transforming text into neural human
[HSG∗ 22] H O J., S ALIMANS T., G RITSENKO A., C HAN W., N OROUZI avatars with parameterized shape and pose control. arXiv preprint
M., F LEET D. J.: Video diffusion models. In NeurIPS (2022). 2 arXiv:2303.17606 (2023). 15, 16
[HSZ∗ 23] H UANG X., S HAO R., Z HANG Q., Z HANG H., F ENG Y., [KAZ∗ 23] KOLOTOUROS N., A LLDIECK T., Z ANFIR A., BAZAVAN
L IU Y., WANG Q.: Humannorm: Learning normal diffusion model E. G., F IERARU M., S MINCHISESCU C.: Dreamhuman: Animatable
for high-quality and realistic 3d human generation. arXiv preprint 3d avatars from text. arXiv preprint arXiv:2306.09329 (2023). 15, 16
arXiv:2310.01406 (2023). 16
[KBM∗ 20] K ATO H., B EKER D., M ORARIU M., A NDO T., M ATSUOKA
[HTE∗ 23] H AQUE A., TANCIK M., E FROS A. A., H OLYNSKI A., T., K EHL W., G AIDON A.: Differentiable rendering: A survey. arXiv
K ANAZAWA A.: Instruct-NeRF2NeRF: Editing 3d scenes with instruc- preprint arXiv:2006.12057 (2020). 3
tions. In ICCV (2023). 19, 20
[KBV20] K LOKOV R., B OYER E., V ERBEEK J.: Discrete point flow
[HTS∗ 21] H UANG H.-P., T SENG H.-Y., S AINI S., S INGH M., YANG networks for efficient point cloud generation. In European Conference
M.-H.: Learning to stylize novel views. In ICCV (2021). 19 on Computer Vision (2020), Springer, pp. 694–710. 11
[HWZ∗ 23] H UANG Y., WANG J., Z ENG A., C AO H., Q I X., S HI Y., [KDJ∗ 23] K WAK J.- G ., D ONG E., J IN Y., KO H., M AHAJAN S., Y I
Z HA Z.-J., Z HANG L.: Dreamwaltz: Make a scene with complex 3d K. M.: Vivid-1-to-3: Novel view synthesis with video diffusion models.
animatable avatars. arXiv preprint arXiv:2305.12529 (2023). 16 arXiv preprint arXiv:2312.01305 (2023).
[HYL∗ 23] H UANG Y., Y I H., L IU W., WANG H., W U B., WANG W., [KDSB22] K ULHÁNEK J., D ERNER E., S ATTLER T., BABUŠKA R.:
L IN B., Z HANG D., C AI D.: One-shot implicit animatable avatars with Viewformer: Nerf-free neural rendering from few images using trans-
model-based priors. In Proceedings of the IEEE/CVF International Con- formers. In European Conference on Computer Vision (2022), Springer,
ference on Computer Vision (2023), pp. 8974–8985. 15, 16 pp. 198–216. 13, 14
[HYX∗ 23] H UANG Y., Y I H., X IU Y., L IAO T., TANG J., C AI D., [KFH∗ 22] K ERR J., F U L., H UANG H., AVIGAL Y., TANCIK M., I CH -
T HIES J.: Tech: Text-guided reconstruction of lifelike clothed humans. NOWSKI J., K ANAZAWA A., G OLDBERG K.: Evo-nerf: Evolving nerf
arXiv preprint arXiv:2308.08545 (2023). 15, 16 for sequential robot grasping of transparent objects. In 6th Annual Con-
ference on Robot Learning (2022). 5
[HZF∗ 22] H UANG X., Z HANG Q., F ENG Y., L I H., WANG X., WANG
Q.: Hdr-nerf: High dynamic range neural radiance fields. In Proceedings [KKL∗ 23] K IM B., K WON P., L EE K., L EE M., H AN S., K IM D., J OO
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- H.: Chupa: Carving 3d clothed humans from skinned shape priors us-
tion (2022), pp. 18398–18408. 5 ing 2d diffusion probabilistic models. arXiv preprint arXiv:2305.11870
(2023). 15, 16
[HZF∗ 23a] H UANG X., Z HANG Q., F ENG Y., L I H., WANG Q.: In-
verting the imaging process by learning an implicit camera model. In [KKLD23] K ERBL B., KOPANAS G., L EIMKÜHLER T., D RETTAKIS G.:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- 3d gaussian splatting for real-time radiance field rendering. ACM Trans-
tern Recognition (2023), pp. 21456–21465. 5 actions on Graphics (ToG) 42, 4 (2023), 1–14. 1, 4, 12
26 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
[KKR18] K NYAZ V. A., K NIAZ V. V., R EMONDINO F.: Image-to-voxel [LBRF11] L AI K., B O L., R EN X., F OX D.: A large-scale hierarchical
model translation with conditional adversarial networks. In Proceedings multi-view rgb-d object dataset. In 2011 IEEE international conference
of the European Conference on Computer Vision (ECCV) Workshops on robotics and automation (2011), IEEE, pp. 1817–1824. 15
(2018), pp. 0–0. 7 [LC22] L EE H.-H., C HANG A. X.: Understanding pure clip guidance
[KLA19] K ARRAS T., L AINE S., A ILA T.: A style-based generator for voxel grid nerf models. arXiv preprint arXiv:2209.15172 (2022). 18
architecture for generative adversarial networks. In Proceedings of [LCCT23] L I W., C HEN R., C HEN X., TAN P.: Sweetdreamer: Aligning
the IEEE/CVF conference on computer vision and pattern recognition geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint
(2019), pp. 4401–4410. 14, 15, 17 arXiv:2310.02596 (2023). 12
[KLA∗ 20] K ARRAS T., L AINE S., A ITTALA M., H ELLSTEN J., L EHTI - [LDS∗ 23] L I Y., D OU Y., S HI Y., L EI Y., C HEN X., Z HANG Y., Z HOU
NEN J., A ILA T.: Analyzing and improving the image quality of Style-
P., N I B.: FocalDreamer: Text-driven 3d editing via focal-fusion assem-
GAN. In CVPR (2020). 16, 17 bly. arXiv preprint arXiv:2308.10608 (2023). 19, 20
[KLK∗ 20] K IM H., L EE H., K ANG W. H., L EE J. Y., K IM N. S.: Soft- [LDZL23] L I M., D UAN Y., Z HOU J., L U J.: Diffusion-sdf: Text-to-
flow: Probabilistic framework for normalizing flow on manifolds. Ad- shape via voxelized diffusion. In Proceedings of the IEEE/CVF Confer-
vances in Neural Information Processing Systems 33 (2020), 16388– ence on Computer Vision and Pattern Recognition (2023), pp. 12642–
16397. 11 12651. 10, 18, 19
[KLY∗ 21] KOH J. Y., L EE H., YANG Y., BALDRIDGE J., A NDERSON [LFB∗ 23] L IU Z., F ENG Y., B LACK M. J., N OWROUZEZAHRAI D.,
P.: Pathdreamer: A world model for indoor navigation. In Proceedings PAULL L., L IU W.: Meshdiffusion: Score-based generative 3d mesh
of the IEEE/CVF International Conference on Computer Vision (2021), modeling. arXiv preprint arXiv:2303.08133 (2023). 10
pp. 14738–14748. 14
[LFLSY∗ 23] L IN G., F ENG -L IN L., S HU -Y U C., K AIWEN J., C HUN -
[KMS22] KOBAYASHI S., M ATSUMOTO E., S ITZMANN V.: Decompos- PENG L., L AI Y., H ONGBO F.: SketchFaceNeRF: Sketch-based facial
ing nerf for editing via feature field distillation. In NeurIPS (2022). 19, generation and editing in neural radiance fields. ACM Trans. Graph.
20 (2023). 18, 19
[KNH∗ 22] K HAN S., NASEER M., H AYAT M., Z AMIR S. W., K HAN [LFS∗ 21] L I J., F ENG Z., S HE Q., D ING H., WANG C., L EE G. H.:
F. S., S HAH M.: Transformers in vision: A survey. ACM computing Mine: Towards continuous depth mpi with nerf for novel view synthesis.
surveys (CSUR) 54, 10s (2022), 1–41. 3 In Proceedings of the IEEE/CVF International Conference on Computer
[KPHL17] K USNER M. J., PAIGE B., H ERNÁNDEZ -L OBATO J. M.: Vision (2021), pp. 12578–12588. 5
Grammar variational autoencoder. In International conference on ma- [LGL∗ 23] L ONG X., G UO Y.-C., L IN C., L IU Y., D OU Z., L IU L., M A
chine learning (2017), PMLR, pp. 1945–1954. 2, 6 Y., Z HANG S.-H., H ABERMANN M., T HEOBALT C., ET AL .: Won-
[KPLD21] KOPANAS G., P HILIP J., L EIMKÜHLER T., D RETTAKIS G.: der3d: Single image to 3d using cross-domain diffusion. arXiv preprint
Point-based neural rendering with per-view optimization. In Computer arXiv:2310.15008 (2023). 13
Graphics Forum (2021), vol. 40, Wiley Online Library, pp. 29–43. 4 [LGT∗ 23] L IN C.-H., G AO J., TANG L., TAKIKAWA T., Z ENG X.,
[KPWS22] K ALISCHEK N., P ETERS T., W EGNER J. D., S CHINDLER H UANG X., K REIS K., F IDLER S., L IU M.-Y., L IN T.-Y.: Magic3d:
K.: Tetrahedral diffusion models for 3d shape generation. arXiv preprint High-resolution text-to-3d content creation. In Proceedings of the
arXiv:2211.13220 (2022). 10 IEEE/CVF Conference on Computer Vision and Pattern Recognition
[KQG∗ 23] K IRSCHSTEIN T., Q IAN S., G IEBENHAIN S., WALTER T., (2023), pp. 300–309. 12, 18
N IESSNER M.: NeRSemble: Multi-view radiance field reconstruction of [LGZL∗ 20] L IU L., G U J., Z AW L IN K., C HUA T.-S., T HEOBALT C.:
human heads. ACM Trans. Graph. (2023). 17 Neural sparse voxel fields. Advances in Neural Information Processing
[KSZ∗ 21] KOSIOREK A. R., S TRATHMANN H., Z ORAN D., M ORENO Systems 33 (2020), 15651–15663. 6
P., S CHNEIDER R., M OKRÁ S., R EZENDE D. J.: Nerf-vae: A geometry [LH21] L UO S., H U W.: Diffusion probabilistic models for 3d point
aware 3d scene generative model. In ICML (2021). 11 cloud generation. In Proceedings of the IEEE/CVF Conference on Com-
[KUH18] K ATO H., U SHIKU Y., H ARADA T.: Neural 3d mesh renderer. puter Vision and Pattern Recognition (2021), pp. 2837–2845. 10, 18
In Proceedings of the IEEE conference on computer vision and pattern [LHG∗ 23] L IN Y., H AN H., G ONG C., X U Z., Z HANG Y., L I X.: Con-
recognition (2018), pp. 3907–3916. 5 sistent123: One image to highly consistent 3d asset using case-aware
[KVNM23] K ARNEWAR A., V EDALDI A., N OVOTNY D., M ITRA N. J.: diffusion priors. arXiv preprint arXiv:2309.17261 (2023). 13
Holodiffusion: Training a 3d diffusion model using 2d images. In Pro- [LHR∗ 21] L IU L., H ABERMANN M., RUDNEV V., S ARKAR K., G U J.,
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern T HEOBALT C.: Neural actor: Neural free-view synthesis of human actors
Recognition (2023), pp. 18423–18433. 7 with pose control. ACM transactions on graphics (TOG) 40, 6 (2021),
[KW13] K INGMA D. P., W ELLING M.: Auto-encoding variational 1–16. 5
bayes. arXiv preprint arXiv:1312.6114 (2013). 2, 6, 11 [Lin68] L INDENMAYER A.: Mathematical models for cellular interac-
[KWKT15] K ULKARNI T. D., W HITNEY W. F., KOHLI P., T ENEN - tions in development i. filaments with one-sided inputs. Journal of theo-
BAUM J.: Deep convolutional inverse graphics network. Advances in
retical biology 18, 3 (1968), 280–299. 13
neural information processing systems 28 (2015). 14 [LKL18] L IN C.-H., KONG C., L UCEY S.: Learning efficient point
[KXD12] K ASPER A., X UE Z., D ILLMANN R.: The kit object mod- cloud generation for dense 3d object reconstruction. In proceedings of
els database: An object model database for object recognition, localiza- the AAAI Conference on Artificial Intelligence (2018), vol. 32. 4
tion and manipulation in service robotics. The International Journal of [LLCL19] L IU S., L I T., C HEN W., L I H.: Soft rasterizer: A dif-
Robotics Research 31, 8 (2012), 927–934. 14 ferentiable renderer for image-based 3d reasoning. In Proceedings of
[KYLH21] K IM J., YOO J., L EE J., H ONG S.: Setvae: Learning hier- the IEEE/CVF International Conference on Computer Vision (2019),
archical composition for generative modeling of set-structured data. In pp. 7708–7717. 5
CVPR (2021). 11 [LLF∗ 23] L I Y., L IN Z.-H., F ORSYTH D., H UANG J.-B., WANG S.:
[LB14] L OPER M. M., B LACK M. J.: Opendr: An approximate differen- ClimateNeRF: Physically-based neural rendering for extreme climate
tiable renderer. In Computer Vision–ECCV 2014: 13th European Con- synthesis. In ICCV (2023). 19
ference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part [LLL∗ 14] L I B., L U Y., L I C., G ODIL A., S CHRECK T., AONO M.,
VII 13 (2014), Springer, pp. 154–169. 5 C HEN Q., C HOWDHURY N. K., FANG B., F URUYA T., ET AL .:
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 27
Shrec’14 track: Large scale comprehensive 3d shape retrieval. In Eu- [LWQF22] L IU Z., WANG Y., Q I X., F U C.-W.: Towards implicit text-
rographics Workshop on 3D Object Retrieval (2014), vol. 2, . 14 guided 3d shape generation. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (2022), pp. 17896–
[LLQ∗ 16] L IU Z., L UO P., Q IU S., WANG X., TANG X.: Deepfashion: 17906. 18
Powering robust clothes recognition and retrieval with rich annotations.
In Proceedings of the IEEE conference on computer vision and pattern [LWSK22] L I Z., WANG Q., S NAVELY N., K ANAZAWA A.:
recognition (2016), pp. 1096–1104. 14, 15 Infinitenature-zero: Learning perpetual view generation of natural
scenes from single images. In European Conference on Computer Vision
[LLWT15] L IU Z., L UO P., WANG X., TANG X.: Deep learning face at- (2022), Springer, pp. 515–534. 14
tributes in the wild. In Proceedings of the IEEE international conference
on computer vision (2015), pp. 3730–3738. 15 [LWVH∗ 23] L IU R., W U R., VAN H OORICK B., T OKMAKOV P., Z A -
KHAROV S., VONDRICK C.: Zero-1-to-3: Zero-shot one image to 3d
[LLZ∗ 23] L IU Y., L IN C., Z ENG Z., L ONG X., L IU L., KOMURA T., object. arXiv preprint arXiv:2303.11328 (2023). 1, 7, 12, 13, 18
WANG W.: Syncdreamer: Generating multiview-consistent images from
a single-view image. arXiv preprint arXiv:2309.03453 (2023). 12, 13 [LXC∗ 21] L I J., X U C., C HEN Z., B IAN S., YANG L., L U C.: Hybrik: A
hybrid analytical-neural inverse kinematics solution for 3d human pose
[LLZL21] L UO A., L I T., Z HANG W.-H., L EE T. S.: Surfgen: Adver- and shape estimation. In Proceedings of the IEEE/CVF conference on
sarial 3d shape synthesis with explicit surface discriminators. In Pro- computer vision and pattern recognition (2021), pp. 3383–3393. 16
ceedings of the IEEE/CVF International Conference on Computer Vision
(2021), pp. 16238–16248. 7, 9 [LXJ∗ 23] L IU M., X U C., J IN H., C HEN L., X U Z., S U H., ET AL .:
One-2-3-45: Any single image to 3d mesh in 45 seconds without per-
[LMTL21] L IN C.-H., M A W.-C., T ORRALBA A., L UCEY S.: Barf: shape optimization. arXiv preprint arXiv:2306.16928 (2023). 10, 12
Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (2021), pp. 5741–5751. 5 [LXM∗ 20] L IN K.-E., X U Z., M ILDENHALL B., S RINIVASAN P. P.,
H OLD -G EOFFROY Y., D I V ERDI S., S UN Q., S UNKAVALLI K., R A -
[LMY∗ 23] L I Y., M A C., YAN Y., Z HU W., YANG X.: 3d-aware face MAMOORTHI R.: Deep multi depth panoramas for view synthesis. In
swapping. In CVPR (2023). 19 European Conference on Computer Vision (2020), Springer, pp. 328–
[LPT13] L IM J. J., P IRSIAVASH H., T ORRALBA A.: Parsing ikea ob- 344. 5
jects: Fine pose estimation. In Proceedings of the IEEE international [LXZ∗ 23] L ORRAINE J., X IE K., Z ENG X., L IN C.-H., TAKIKAWA
conference on computer vision (2013), pp. 2992–2999. 14 T., S HARP N., L IN T.-Y., L IU M.-Y., F IDLER S., L UCAS J.: Att3d:
Amortized text-to-3d object synthesis. arXiv preprint arXiv:2306.07349
[LSC∗ 22] L EVIS A., S RINIVASAN P. P., C HAEL A. A., N G R.,
(2023). 15, 18
B OUMAN K. L.: Gravitationally lensed black hole emission tomogra-
phy. In Proceedings of the IEEE/CVF Conference on Computer Vision [LYX∗ 24] L IAO T., Y I H., X IU Y., TANG J., H UANG Y., T HIES J.,
and Pattern Recognition (2022), pp. 19841–19850. 5 B LACK M. J.: TADA! text to animatable digital avatars. In 3DV (2024).
15, 16, 19
[LSMG20] L IAO Y., S CHWARZ K., M ESCHEDER L., G EIGER A.: To-
wards unsupervised learning of generative models for 3d controllable [LZ21] L ASSNER C., Z OLLHOFER M.: Pulsar: Efficient sphere-based
image synthesis. In Proceedings of the IEEE/CVF conference on com- neural rendering. In Proceedings of the IEEE/CVF Conference on Com-
puter vision and pattern recognition (2020), pp. 5871–5880. 15 puter Vision and Pattern Recognition (2021), pp. 1440–1449. 4
[LSS∗ 21] L OMBARDI S., S IMON T., S CHWARTZ G., Z OLLHOEFER M., [LZF∗ 23] L IANG Z., Z HANG Q., F ENG Y., S HAN Y., J IA K.: Gs-
S HEIKH Y., S ARAGIH J.: Mixture of volumetric primitives for efficient ir: 3d gaussian splatting for inverse rendering. arXiv preprint
neural rendering. ACM Trans. Graph. (2021). 17 arXiv:2311.16473 (2023). 5
[LSSS18] L OMBARDI S., S ARAGIH J., S IMON T., S HEIKH Y.: Deep [LZJ∗ 22] L EI J., Z HANG Y., J IA K., ET AL .: TANGO: Text-driven
appearance models for face rendering. ACM Trans. Graph. (2018). 17 photorealistic and robust 3d stylization via lighting decomposition. In
NeurIPS (2022). 11, 19
[LSZ∗ 22] L I T., S LAVCHEVA M., Z OLLHOEFER M., G REEN S., L ASS -
NER C., K IM C., S CHMIDT T., L OVEGROVE S., G OESELE M., N EW- [LZT∗ 23] L IU X., Z HAN X., TANG J., S HAN Y., Z ENG G., L IN D.,
COMBE R., ET AL .: Neural 3d video synthesis from multi-view video. L IU X., L IU Z.: Humangaussian: Text-driven 3d human generation with
In Proceedings of the IEEE/CVF Conference on Computer Vision and gaussian splatting. arXiv preprint arXiv:2311.17061 (2023). 16
Pattern Recognition (2022), pp. 5521–5531. 5 [LZW∗ 23] L I C., Z HANG C., WAGHWASE A., L EE L.-H., R AMEAU F.,
[LTJ18] L IU H.-T. D., TAO M., JACOBSON A.: Paparazzi: surface edit- YANG Y., BAE S.-H., H ONG C. S.: Generative ai meets 3d: A survey
ing by way of multi-view image processing. ACM Trans. Graph. 37, 6 on text-to-3d in aigc era. arXiv preprint arXiv:2305.06131 (2023). 3
(2018), 221–1. 5 [Man67] M ANDELBROT B.: How long is the coast of britain? statistical
[LTJ∗ 21] L IU A., T UCKER R., JAMPANI V., M AKADIA A., S NAVELY self-similarity and fractional dimension. science 156, 3775 (1967), 636–
N., K ANAZAWA A.: Infinite nature: Perpetual view generation of natural 638. 13
scenes from a single image. In Proceedings of the IEEE/CVF Interna- [Max95] M AX N.: Optical models for direct volume rendering. IEEE
tional Conference on Computer Vision (2021), pp. 14458–14467. 14 Transactions on Visualization and Computer Graphics 1, 2 (1995), 99–
[LTZ∗ 23] L I J., TAN H., Z HANG K., X U Z., L UAN F., X U Y., H ONG 108. 5
Y., S UNKAVALLI K., S HAKHNAROVICH G., B I S.: Instant3d: Fast text- [MBOL∗ 22] M ICHEL O., BAR -O N R., L IU R., B ENAIM S., H ANOCKA
to-3d with sparse-view generation and large reconstruction model. arXiv R.: Text2mesh: Text-driven neural stylization for meshes. In CVPR
preprint arXiv:2311.06214 (2023). 1, 2 (2022). 19, 20
[LWA∗ 23] LYU Z., WANG J., A N Y., Z HANG Y., L IN D., DAI B.: Con- [MBRS∗ 21] M ARTIN -B RUALLA R., R ADWAN N., S AJJADI M. S.,
trollable mesh generation through sparse latent point diffusion models. BARRON J. T., D OSOVITSKIY A., D UCKWORTH D.: Nerf in the wild:
In Proceedings of the IEEE/CVF Conference on Computer Vision and Neural radiance fields for unconstrained photo collections. In CVPR
Pattern Recognition (2023), pp. 271–280. 10 (2021), pp. 7210–7219. 5
[LWC∗ 23] L I Z., WANG Q., C OLE F., T UCKER R., S NAVELY N.: Dyni- [MCL20] M ORRISON D., C ORKE P., L EITNER J.: Egad! an evolved
bar: Neural dynamic image-based rendering. In Proceedings of the grasping analysis dataset for diversity and reproducibility in robotic ma-
IEEE/CVF Conference on Computer Vision and Pattern Recognition nipulation. IEEE Robotics and Automation Letters 5, 3 (2020), 4368–
(2023), pp. 4273–4284. 5 4375. 14
28 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
[MCST22a] M ITTAL P., C HENG Y.-C., S INGH M., T ULSIANI S.: Au- [ND21] N ICHOL A. Q., D HARIWAL P.: Improved denoising diffusion
tosdf: Shape priors for 3d completion, reconstruction and generation. In probabilistic models. In International Conference on Machine Learning
Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- (2021), PMLR, pp. 8162–8171. 2, 6
tern Recognition (2022), pp. 306–315. 6, 7 [NDR∗ 21] N ICHOL A., D HARIWAL P., R AMESH A., S HYAM P.,
[MCST22b] M ITTAL P., C HENG Y.-C., S INGH M., T ULSIANI S.: Au- M ISHKIN P., M C G REW B., S UTSKEVER I., C HEN M.: Glide: Towards
tosdf: Shape priors for 3d completion, reconstruction and generation. In photorealistic image generation and editing with text-guided diffusion
CVPR (2022). 10 models. arXiv preprint arXiv:2112.10741 (2021). 10
[MESK22] M ÜLLER T., E VANS A., S CHIED C., K ELLER A.: Instant [NDVZJ19] N IMIER -DAVID M., V ICINI D., Z ELTNER T., JAKOB W.:
neural graphics primitives with a multiresolution hash encoding. ACM Mitsuba 2: A retargetable forward and inverse renderer. ACM Transac-
Transactions on Graphics (ToG) 41, 4 (2022), 1–15. 1, 6 tions on Graphics (TOG) 38, 6 (2019), 1–17. 5
[Mid] M IDJOURNEY: Midjourney. https://www.midjourney. [Neu66] N EUMANN J. V.: Theory of self-reproducing automata. Edited
com/. 2 by Arthur W. Burks (1966). 13
[MKLRV23] M ELAS -K YRIAZI L., L AINA I., RUPPRECHT C., [NG21] N IEMEYER M., G EIGER A.: GIRAFFE: Representing scenes as
V EDALDI A.: Realfusion: 360deg reconstruction of any object from a compositional generative neural feature fields. In CVPR (2021). 5, 7, 9,
single image. In Proceedings of the IEEE/CVF Conference on Computer 17
Vision and Pattern Recognition (2023), pp. 8446–8455. 12, 18
[NGEB20a] NASH C., G ANIN Y., E SLAMI S. A., BATTAGLIA P.: Poly-
[MKXBP22] M OHAMMAD K HALID N., X IE T., B ELILOVSKY E., gen: An autoregressive generative model of 3d meshes. In International
P OPA T.: Clip-mesh: Generating textured meshes from text using pre- conference on machine learning (2020), PMLR, pp. 7220–7229. 7
trained image-text models. In SIGGRAPH Asia 2022 conference papers
(2022), pp. 1–8. 18 [NGEB20b] NASH C., G ANIN Y., E SLAMI S. A., BATTAGLIA P.: Poly-
gen: An autoregressive generative model of 3d meshes. In ICML (2020).
[MLL∗ 22a] M A L., L I X., L IAO J., WANG X., Z HANG Q., WANG J., 10
S ANDER P. V.: Neural parameterization for dynamic human head edit-
ing. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–15. 19 [NJD∗ 22] N ICHOL A., J UN H., D HARIWAL P., M ISHKIN P., C HEN M.:
Point-e: A system for generating 3d point clouds from complex prompts.
[MLL∗ 22b] M A L., L I X., L IAO J., Z HANG Q., WANG X., WANG J., arXiv preprint arXiv:2212.08751 (2022). 1, 7, 10, 12, 18
S ANDER P. V.: Deblur-nerf: Neural radiance fields from blurry images.
In Proceedings of the IEEE/CVF Conference on Computer Vision and [NKR∗ 22] NAM G., K HLIFI M., RODRIGUEZ A., T ONO A., Z HOU L.,
Pattern Recognition (2022), pp. 12861–12870. 5 G UERRERO P.: 3d-ldm: Neural implicit 3d shape generation with latent
diffusion models. arXiv preprint arXiv:2212.00842 (2022). 10, 18
[MM82] M ANDELBROT B. B., M ANDELBROT B. B.: The fractal geom-
etry of nature, vol. 1. WH freeman New York, 1982. 13 [NPLT∗ 19] N GUYEN -P HUOC T., L I C., T HEIS L., R ICHARDT C.,
YANG Y.-L.: HoloGAN: Unsupervised learning of 3d representations
[MPS∗ 23] M IKAEILI A., P EREL O., S AFAEE M., C OHEN -O R D., from natural images. In ICCV Workshop (2019). 9, 15, 17, 18
M AHDAVI -A MIRI A.: SKED: Sketch-guided text-based 3d editing. In
ICCV (2023). 19, 20 [NPRM∗ 20] N GUYEN -P HUOC T. H., R ICHARDT C., M AI L., YANG
Y., M ITRA N.: Blockgan: Learning 3d object-aware scene representa-
[MRP∗ 23a] M ETZER G., R ICHARDSON E., PATASHNIK O., G IRYES tions from unlabelled images. Advances in neural information process-
R., C OHEN -O R D.: Latent-nerf for shape-guided generation of 3d ing systems 33 (2020), 6767–6778. 7, 9
shapes and textures. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (2023), pp. 12663–12673. 12 [NSLH22] N OGUCHI A., S UN X., L IN S., H ARADA T.: Unsupervised
learning of efficient geometry-aware neural articulated representations.
[MRP∗ 23b] M ETZER G., R ICHARDSON E., PATASHNIK O., G IRYES In European Conference on Computer Vision (2022), Springer, pp. 597–
R., C OHEN -O R D.: Latent-NeRF for shape-guided generation of 3d 614. 16
shapes and textures. In CVPR (2023). 18, 19
[OBB20] O SMAN A. A., B OLKART T., B LACK M. J.: Star: Sparse
[MS15] M ATURANA D., S CHERER S.: Voxnet: A 3d convolutional neu- trained articulated human body regressor. In Computer Vision–ECCV
ral network for real-time object recognition. In 2015 IEEE/RSJ interna- 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
tional conference on intelligent robots and systems (IROS) (2015), IEEE, Proceedings, Part VI 16 (2020), Springer, pp. 598–613. 16
pp. 922–928. 6
[OELS∗ 22] O R -E L R., L UO X., S HAN M., S HECHTMAN E., PARK
[MSOC∗ 19] M ILDENHALL B., S RINIVASAN P. P., O RTIZ -C AYON R., J. J., K EMELMACHER -S HLIZERMAN I.: StyleSDF: High-resolution 3d-
K ALANTARI N. K., R AMAMOORTHI R., N G R., K AR A.: Local light consistent image and geometry generation. In CVPR (2022). 17
field fusion: Practical view synthesis with prescriptive sampling guide-
lines. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–14. 5 [OMN∗ 19] O ECHSLE M., M ESCHEDER L., N IEMEYER M., S TRAUSS
T., G EIGER A.: Texture fields: Learning texture representations in func-
[MSP∗ 23] M ÜLLER N., S IDDIQUI Y., P ORZI L., B ULO S. R.,
tion space. In Proceedings of the IEEE/CVF International Conference
KONTSCHIEDER P., N IESSNER M.: Diffrf: Rendering-guided 3d ra-
on Computer Vision (2019), pp. 4531–4540. 5
diance field diffusion. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (2023), pp. 4328–4338. 10 [Ope] O PENAI: Dall-e 3. https://openai.com/dall-e-3. 2
[MSS∗ 21] M A S., S IMON T., S ARAGIH J., WANG D., L I Y., [PCPMMN21] P UMAROLA A., C ORONA E., P ONS -M OLL G.,
D E L A T ORRE F., S HEIKH Y.: Pixel codec avatars. In CVPR (2021). 17 M ORENO -N OGUER F.: D-nerf: Neural radiance fields for dynamic
scenes. In Proceedings of the IEEE/CVF Conference on Computer
[MST∗ 20] M ILDENHALL B., S RINIVASAN P. P., TANCIK M., BARRON
Vision and Pattern Recognition (2021), pp. 10318–10327. 5
J. T., R AMAMOORTHI R., N G R.: Nerf: Representing scenes as neural
radiance fields for view synthesis. In European conference on computer [PDW∗ 21] P ENG S., D ONG J., WANG Q., Z HANG S., S HUAI Q., Z HOU
vision (2020), Springer, pp. 405–421. 1, 5, 11 X., BAO H.: Animatable neural radiance fields for modeling dynamic
[MYR∗ 20] M A Q., YANG J., R ANJAN A., P UJADES S., P ONS -M OLL human bodies. In Proceedings of the IEEE/CVF International Confer-
G., TANG S., B LACK M. J.: Learning to dress 3d people in genera- ence on Computer Vision (2021), pp. 14314–14323. 5
tive clothing. In Proceedings of the IEEE/CVF Conference on Computer [Per85] P ERLIN K.: An image synthesizer. ACM Siggraph Computer
Vision and Pattern Recognition (2020), pp. 6469–6478. 15, 16 Graphics 19, 3 (1985), 287–296. 13
[MZS∗ 23] M A Y., Z HANG X., S UN X., J I J., WANG H., J IANG G., [Per02] P ERLIN K.: Improving noise. In Proceedings of the 29th annual
Z HUANG W., J I R.: X-Mesh: Towards fast and accurate text-driven 3d conference on Computer graphics and interactive techniques (2002),
stylization via dynamic textual guidance. In ICCV (2023). 19 pp. 681–682. 13
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 29
[PFS∗ 19] PARK J. J., F LORENCE P., S TRAUB J., N EWCOMBE R., [RK20] R IEGLER G., KOLTUN V.: Free view synthesis. In Computer
L OVEGROVE S.: Deepsdf: Learning continuous signed distance func- Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
tions for shape representation. In Proceedings of the IEEE/CVF confer- 23–28, 2020, Proceedings, Part XIX 16 (2020), Springer, pp. 623–640. 4
ence on computer vision and pattern recognition (2019), pp. 165–174. 1, [RK21] R IEGLER G., KOLTUN V.: Stable view synthesis. In Proceedings
5 of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[PGH∗ 16] P U Y., G AN Z., H ENAO R., Y UAN X., L I C., S TEVENS A., tion (2021), pp. 12216–12225. 4
C ARIN L.: Variational autoencoder for deep learning of images, labels
[RKH∗ 21a] R ADFORD A., K IM J. W., H ALLACY C., R AMESH A.,
and captions. Advances in neural information processing systems 29
G OH G., AGARWAL S., S ASTRY G., A SKELL A., M ISHKIN P., C LARK
(2016). 2, 6
J., ET AL .: Learning transferable visual models from natural language
[PJBM23] P OOLE B., JAIN A., BARRON J. T., M ILDENHALL B.: supervision. In ICML (2021). 11, 19
DreamFusion: Text-to-3d using 2d diffusion. In Int. Conf. Learn. Repre-
[RKH∗ 21b] R ADFORD A., K IM J. W., H ALLACY C., R AMESH A.,
sent. (2023). 1, 7, 11, 12, 15, 18, 20, 21
G OH G., AGARWAL S., S ASTRY G., A SKELL A., M ISHKIN P., C LARK
[PKHL21] PAVLLO D., KOHLER J., H OFMANN T., L UCCHI A.: Learn- J., ET AL .: Learning transferable visual models from natural language
ing generative models of textured 3d meshes from real-world images. supervision. In International Conference on Machine Learning (2021),
In Proceedings of the IEEE/CVF International Conference on Computer PMLR, pp. 8748–8763. 16
Vision (2021), pp. 13879–13889. 9
[RKP∗ 23] R AJ A., K AZA S., P OOLE B., N IEMEYER M., RUIZ N.,
[PRFS18] PARK K., R EMATAS K., FARHADI A., S EITZ S. M.: Photo- M ILDENHALL B., Z ADA S., A BERMAN K., RUBINSTEIN M., BAR -
shape: Photorealistic materials for large-scale shape collections. arXiv RON J., ET AL .: Dreambooth3d: Subject-driven text-to-3d generation.
preprint arXiv:1809.09761 (2018). 14, 15 arXiv preprint arXiv:2303.13508 (2023). 18
[PSB∗ 21] PARK K., S INHA U., BARRON J. T., B OUAZIZ S., G OLD - [RPLG21] R EISER C., P ENG S., L IAO Y., G EIGER A.: Kilonerf: Speed-
MAN D. B., S EITZ S. M., M ARTIN -B RUALLA R.: Nerfies: Deformable ing up neural radiance fields with thousands of tiny mlps. In Proceedings
neural radiance fields. In ICCV (2021). 17 of the IEEE/CVF International Conference on Computer Vision (2021),
[PSH∗ 20] PAVLLO D., S PINKS G., H OFMANN T., M OENS M.-F., L UC - pp. 14335–14345. 5
CHI A.: Convolutional generation of textured 3d meshes. Advances in [RROG18] ROVERI R., R AHMANN L., O ZTIRELI C., G ROSS M.: A
Neural Information Processing Systems 33 (2020), 870–882. 9 network architecture for point cloud classification via automatic depth
[PYG∗ 23] P O R., Y IFAN W., G OLYANIK V., A BERMAN K., BARRON images generation. In Proceedings of the IEEE Conference on Computer
J. T., B ERMANO A. H., C HAN E. R., D EKEL T., H OLYNSKI A., Vision and Pattern Recognition (2018), pp. 4176–4184. 4
K ANAZAWA A., ET AL .: State of the art on diffusion models for visual [RSH∗ 21] R EIZENSTEIN J., S HAPOVALOV R., H ENZLER P., S BOR -
computing. arXiv preprint arXiv:2310.07204 (2023). 3 DONE L., L ABATUT P., N OVOTNY D.: Common objects in 3d: Large-
[PYL∗ 22] P ENG Y., YAN Y., L IU S., C HENG Y., G UAN S., PAN B., scale learning and evaluation of real-life 3d category reconstruction. In
Z HAI G., YANG X.: CageNeRF: Cage-based neural radiance field for Proceedings of the IEEE/CVF International Conference on Computer
generalized 3d deformation and animation. In NeurIPS (2022). 20 Vision (2021), pp. 10901–10911. 14, 15
[PZ17] P ENNER E., Z HANG L.: Soft 3d reconstruction for view synthe- [RWC∗ 19] R ADFORD A., W U J., C HILD R., L UAN D., A MODEI D.,
sis. ACM Transactions on Graphics (TOG) 36, 6 (2017), 1–11. 5 S UTSKEVER I., ET AL .: Language models are unsupervised multitask
[PZVBG00] P FISTER H., Z WICKER M., VAN BAAR J., G ROSS M.: learners. OpenAI blog 1, 8 (2019), 9. 2, 6, 10
Surfels: Surface elements as rendering primitives. In Proceedings of [RWL∗ 22] R ÜCKERT D., WANG Y., L I R., I DOUGHI R., H EIDRICH
the 27th annual conference on Computer graphics and interactive tech- W.: Neat: Neural adaptive tomography. ACM Transactions on Graphics
niques (2000), pp. 335–342. 3 (TOG) 41, 4 (2022), 1–13. 5
[QMH∗ 23] Q IAN G., M AI J., H AMDI A., R EN J., S IAROHIN A., L I B., [SAA∗ 23] S IDDIQUI Y., A LLIEGRO A., A RTEMOV A., T OMMASI T.,
L EE H.-Y., S KOROKHODOV I., W ONKA P., T ULYAKOV S., ET AL .: S IRIGATTI D., ROSOV V., DAI A., N IESSNER M.: Meshgpt: Gener-
Magic123: One image to high-quality 3d object generation using both ating triangle meshes with decoder-only transformers. arXiv preprint
2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843 (2023). 12 arXiv:2311.15475 (2023). 21
[RALB22] R AKHIMOV R., A RDELEAN A.-T., L EMPITSKY V., B UR - [SCL∗ 22] S ANGHI A., C HU H., L AMBOURNE J. G., WANG Y., C HENG
NAEV E.: Npbg++: Accelerating neural point-based graphics. In Pro- C.-Y., F UMERO M., M ALEKSHAN K. R.: Clip-forge: Towards zero-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern shot text-to-shape generation. In Proceedings of the IEEE/CVF Confer-
Recognition (2022), pp. 15969–15979. 4 ence on Computer Vision and Pattern Recognition (2022), pp. 18603–
[RBL∗ 22a] ROMBACH R., B LATTMANN A., L ORENZ D., E SSER P., 18613. 18
O MMER B.: High-resolution image synthesis with latent diffusion mod- [SCP∗ 23] S HUE J. R., C HAN E. R., P O R., A NKNER Z., W U J., W ET-
els. In CVPR (2022). 1, 2, 9, 19, 20 ZSTEIN G.: 3d neural field generation using triplane diffusion. In Pro-
[RBL∗ 22b] ROMBACH R., B LATTMANN A., L ORENZ D., E SSER P., ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
O MMER B.: High-resolution image synthesis with latent diffusion mod- Recognition (2023), pp. 20875–20886. 6, 10, 18, 19
els. In Proceedings of the IEEE/CVF conference on computer vision and [SCS∗ 22] S AHARIA C., C HAN W., S AXENA S., L I L., W HANG
pattern recognition (2022), pp. 10684–10695. 12, 16 J., D ENTON E. L., G HASEMIPOUR K., G ONTIJO L OPES R.,
[REO21] ROMBACH R., E SSER P., O MMER B.: Geometry-free K ARAGOL AYAN B., S ALIMANS T., ET AL .: Photorealistic text-to-
view synthesis: Transformers and no 3d priors. In Proceedings of image diffusion models with deep language understanding. Advances
the IEEE/CVF International Conference on Computer Vision (2021), in Neural Information Processing Systems 35 (2022), 36479–36494. 2,
pp. 14356–14366. 13 6, 16
[RFJ21] ROCKWELL C., F OUHEY D. F., J OHNSON J.: Pixelsynth: Gen- [SCZ∗ 23] S HI R., C HEN H., Z HANG Z., L IU M., X U C., W EI X.,
erating a 3d-consistent experience from a single image. In Proceedings C HEN L., Z ENG C., S U H.: Zero123++: a single image to consis-
of the IEEE/CVF International Conference on Computer Vision (2021), tent multi-view diffusion base model. arXiv preprint arXiv:2310.15110
pp. 14104–14113. 14, 18 (2023). 13
[RFS22] R ÜCKERT D., F RANKE L., S TAMMINGER M.: Adop: Approx- [SDZ∗ 21] S RINIVASAN P. P., D ENG B., Z HANG X., TANCIK M.,
imate differentiable one-pixel point rendering. ACM Transactions on M ILDENHALL B., BARRON J. T.: Nerv: Neural reflectance and vis-
Graphics (ToG) 41, 4 (2022), 1–14. 4 ibility fields for relighting and view synthesis. In Proceedings of the
30 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
IEEE/CVF Conference on Computer Vision and Pattern Recognition [SSN∗ 22] S CHWARZ K., S AUER A., N IEMEYER M., L IAO Y., G EIGER
(2021), pp. 7495–7504. 5 A.: Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids.
Advances in Neural Information Processing Systems 35 (2022), 33999–
[SFL∗ 23] S ANGHI A., F U R., L IU V., W ILLIS K. D., S HAYANI H.,
34011. 6
K HASAHMADI A. H., S RIDHAR S., R ITCHIE D.: Clip-sculptor: Zero-
shot generation of high-fidelity and diverse shapes from natural lan- [STB∗ 19]S RINIVASAN P. P., T UCKER R., BARRON J. T., R A -
guage. In Proceedings of the IEEE/CVF Conference on Computer Vision MAMOORTHI R., N G R., S NAVELY N.: Pushing the boundaries of
and Pattern Recognition (2023), pp. 18339–18348. 18 view extrapolation with multiplane images. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
[SGHS98] S HADE J., G ORTLER S., H E L.- W., S ZELISKI R.: Layered
(2019), pp. 175–184. 5
depth images. In Proceedings of the 25th annual conference on Com-
puter graphics and interactive techniques (1998), pp. 231–242. 5 [SWL∗ 20a] S UN Y., WANG Y., L IU Z., S IEGEL J., S ARMA S.:
Pointgrow: Autoregressively learned point cloud generation with self-
[SGY∗ 21] S HEN T., G AO J., Y IN K., L IU M.-Y., F IDLER S.: Deep attention. In Proceedings of the IEEE/CVF Winter Conference on Appli-
marching tetrahedra: a hybrid representation for high-resolution 3d shape cations of Computer Vision (2020), pp. 61–70. 7
synthesis. Advances in Neural Information Processing Systems 34
(2021), 6087–6101. 1, 6, 10, 12 [SWL∗ 20b] S UN Y., WANG Y., L IU Z., S IEGEL J., S ARMA S.:
Pointgrow: Autoregressively learned point cloud generation with self-
[SHG∗ 22] S HRESTHA R., H U S., G OU M., L IU Z., TAN P.: A real attention. In WACV (2020). 10
world dataset for multi-view 3d reconstruction. In European Conference
on Computer Vision (2022), Springer, pp. 56–73. 15 [SWS∗ 22] S UN J., WANG X., S HI Y., WANG L., WANG J., L IU Y.:
IDE-3D: Interactive disentangled editing for high-resolution 3d-aware
[SHN∗ 19] S AITO S., H UANG Z., NATSUME R., M ORISHIMA S., portrait synthesis. ACM Trans. Graph. (2022). 18, 19
K ANAZAWA A., L I H.: Pifu: Pixel-aligned implicit function for high-
resolution clothed human digitization. In Proceedings of the IEEE/CVF [SWW∗ 23] S UN J., WANG X., WANG L., L I X., Z HANG Y., Z HANG
international conference on computer vision (2019), pp. 2304–2314. 16 H., L IU Y.: Next3D: Generative neural texture rasterization for 3d-aware
head avatars. In CVPR (2023). 18
[SKJ23] S HIM J., K ANG C., J OO K.: Diffusion-based signed distance
fields for 3d shape generation. In Proceedings of the IEEE/CVF Con- [SWY∗ 23] S HI Y., WANG P., Y E J., L ONG M., L I K., YANG X.:
ference on Computer Vision and Pattern Recognition (2023), pp. 20887– Mvdream: Multi-view diffusion for 3d generation. arXiv preprint
20897. 10 arXiv:2308.16512 (2023). 7, 12
[SLNG20] S CHWARZ K., L IAO Y., N IEMEYER M., G EIGER A.: GRAF: [SWZ∗ 18] S UN X., W U J., Z HANG X., Z HANG Z., Z HANG C., X UE T.,
Generative radiance fields for 3d-aware image synthesis. In NeurIPS T ENENBAUM J. B., F REEMAN W. T.: Pix3d: Dataset and methods for
(2020). 9, 15, 17 single-image 3d shape modeling. In Proceedings of the IEEE conference
on computer vision and pattern recognition (2018), pp. 2974–2983. 14
[SMKF04] S HILANE P., M IN P., K AZHDAN M., F UNKHOUSER T.: The
[SWZ∗ 22] S UN J., WANG X., Z HANG Y., L I X., Z HANG Q., L IU Y.,
princeton shape benchmark. In Proceedings Shape Modeling Applica-
WANG J.: Fenerf: Face editing in neural radiance fields. In CVPR (2022).
tions, 2004. (2004), IEEE, pp. 167–178. 14
18, 19
[SMP∗ 22] S AJJADI M. S., M EYER H., P OT E., B ERGMANN U., G R - [SZS∗ 23] S UN J., Z HANG B., S HAO R., WANG L., L IU W., X IE Z.,
EFF K., R ADWAN N., VORA S., L U ČI Ć M., D UCKWORTH D., D OSO -
L IU Y.: Dreamcraft3d: Hierarchical 3d generation with bootstrapped
VITSKIY A., ET AL .: Scene representation transformer: Geometry-free
diffusion prior. arXiv preprint arXiv:2310.16818 (2023). 12
novel view synthesis through set-latent scene representations. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [TDB16] TATARCHENKO M., D OSOVITSKIY A., B ROX T.: Multi-view
Recognition (2022), pp. 6229–6238. 13, 14 3d models from single images with a convolutional network. In Com-
puter Vision–ECCV 2016: 14th European Conference, Amsterdam, The
[SPH∗ 23] S INGER U., P OLYAK A., H AYES T., Y IN X., A N J., Z HANG Netherlands, October 11–14, 2016, Proceedings, Part VII 14 (2016),
S., H U Q., YANG H., A SHUAL O., G AFNI O., ET AL .: Make-a-video: Springer, pp. 322–337. 14
Text-to-video generation without text-video data. In Int. Conf. Learn.
Represent. (2023). 2 [TFT∗ 20] T EWARI A., F RIED O., T HIES J., S ITZMANN V., L OMBARDI
S., S UNKAVALLI K., M ARTIN -B RUALLA R., S IMON T., S ARAGIH J.,
[SPK19] S HU D. W., PARK S. W., K WON J.: 3d point cloud generative N IESSNER M., ET AL .: State of the art on neural rendering. In Computer
adversarial network based on tree structured graph convolutions. In Pro- Graphics Forum (2020), vol. 39, Wiley Online Library, pp. 701–727. 3
ceedings of the IEEE/CVF international conference on computer vision
(2019), pp. 3859–3868. 7 [TLK∗ 23] T SENG H.-Y., L I Q., K IM C., A LSISAN S., H UANG J.-B.,
KOPF J.: Consistent view synthesis with pose-guided diffusion models.
[SPX∗ 22] S HI Z., P ENG S., X U Y., G EIGER A., L IAO Y., S HEN Y.: In Proceedings of the IEEE/CVF Conference on Computer Vision and
Deep generative models on 3d representations: A survey. arXiv preprint Pattern Recognition (2023), pp. 16773–16783. 13
arXiv:2210.15663 (2022). 3
[TLY∗ 21] TAKIKAWA T., L ITALIEN J., Y IN K., K REIS K., L OOP C.,
[SS87] S HIRMAN L. A., S EQUIN C. H.: Local surface interpolation with N OWROUZEZAHRAI D., JACOBSON A., M C G UIRE M., F IDLER S.:
bézier patches. Computer Aided Geometric Design 4, 4 (1987), 279–295. Neural geometric level of detail: Real-time rendering with implicit 3d
4 shapes. In Proceedings of the IEEE/CVF Conference on Computer Vi-
[SSC22] S UN C., S UN M., C HEN H.-T.: Direct voxel grid optimiza- sion and Pattern Recognition (2021), pp. 11358–11367. 6
tion: Super-fast convergence for radiance fields reconstruction. In Pro- [TLYCS22] T SENG W.-C., L IAO H.-J., Y EN -C HEN L., S UN M.: CLA-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern NeRF: Category-level articulated neural radiance field. In International
Recognition (2022), pp. 5459–5469. 6 Conference on Robotics and Automation (ICRA) (2022). 19, 20
[SSKH20] S HIH M.-L., S U S.-Y., KOPF J., H UANG J.-B.: 3d photog- [TME∗ 22] T REMBLAY J., M ESHRY M., E VANS A., K AUTZ J., K ELLER
raphy using context-aware layered depth inpainting. In Proceedings of A., K HAMIS S., M ÜLLER T., L OOP C., M ORRICAL N., NAGANO K.,
the IEEE/CVF Conference on Computer Vision and Pattern Recognition ET AL .: Rtmv: A ray-traced multi-view synthetic dataset for novel view
(2020), pp. 8028–8038. 5 synthesis. arXiv preprint arXiv:2205.07058 (2022). 15
[SSN∗ 14] S INGH A., S HA J., NARAYAN K. S., ACHIM T., A BBEEL P.: [TRMT23] T RUONG P., R AKOTOSAONA M.-J., M ANHARDT F.,
Bigbird: A large-scale 3d database of object instances. In 2014 IEEE in- T OMBARI F.: Sparf: Neural radiance fields from sparse and noisy poses.
ternational conference on robotics and automation (ICRA) (2014), IEEE, In Proceedings of the IEEE/CVF Conference on Computer Vision and
pp. 509–516. 14 Pattern Recognition (2023), pp. 4190–4200. 5
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 31
[TRZ∗ 23] TANG J., R EN J., Z HOU H., L IU Z., Z ENG G.: Dream- [WLG∗ 23] WANG M., L IU Y.-S., G AO Y., S HI K., FANG Y., H AN Z.:
gaussian: Generative gaussian splatting for efficient 3d content creation. Lp-dif: Learning local pattern-specific deep implicit function for 3d ob-
arXiv preprint arXiv:2309.16653 (2023). 12, 18, 21 jects and scenes. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (2023), pp. 21856–21865. 6
[TTG∗ 20] T RETSCHK E., T EWARI A., G OLYANIK V., Z OLLHÖFER M.,
S TOLL C., T HEOBALT C.: Patchnets: Patch-based generalizable deep [WLL∗ 21] WANG P., L IU L., L IU Y., T HEOBALT C., KOMURA T.,
implicit 3d shape representations. In Computer Vision–ECCV 2020: 16th WANG W.: Neus: Learning neural implicit surfaces by volume rendering
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, for multi-view reconstruction. Advances in Neural Information Process-
Part XVI 16 (2020), Springer, pp. 293–309. 6 ing Systems 34 (2021), 27171–27183. 6
[TTM∗ 22] T EWARI A., T HIES J., M ILDENHALL B., S RINIVASAN P., [WLW∗ 23] WANG Z., L U C., WANG Y., BAO F., L I C., S U H., Z HU
T RETSCHK E., Y IFAN W., L ASSNER C., S ITZMANN V., M ARTIN - J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with
B RUALLA R., L OMBARDI S., ET AL .: Advances in neural rendering. variational score distillation. arXiv preprint arXiv:2305.16213 (2023).
In Computer Graphics Forum (2022), vol. 41, Wiley Online Library, 12, 18, 19, 21
pp. 703–735. 3 [WLY∗ 23] W U T., L I Z., YANG S., Z HANG P., PAN X., WANG J., L IN
[TWZ∗ 23] TANG J., WANG T., Z HANG B., Z HANG T., Y I R., M A L., D., L IU Z.: Hyperdreamer: Hyper-realistic 3d content generation and
C HEN D.: Make-it-3d: High-fidelity 3d creation from a single image editing from a single image. In SIGGRAPH Asia 2023 Conference Pa-
with diffusion prior. arXiv preprint arXiv:2303.14184 (2023). 7, 12, 18, pers (2023), pp. 1–10. 18
19 [Wol83] W OLFRAM S.: Statistical mechanics of cellular automata. Re-
[TYC∗ 23] T EWARI A., Y IN T., C AZENAVETTE G., R EZCHIKOV S., views of modern physics 55, 3 (1983), 601. 13
T ENENBAUM J. B., D URAND F., F REEMAN W. T., S ITZMANN V.: Dif- [WPH∗ 23] WAN Z., PASCHALIDOU D., H UANG I., L IU H., S HEN B.,
fusion with forward models: Solving stochastic inverse problems without X IANG X., L IAO J., G UIBAS L.: Cad: Photorealistic 3d generation via
direct supervision. arXiv preprint arXiv:2306.11719 (2023). 13 adversarial distillation. arXiv preprint arXiv:2312.06663 (2023). 18
[TZN19] T HIES J., Z OLLHÖFER M., N IESSNER M.: Deferred neural [WSK∗ 15] W U Z., S ONG S., K HOSLA A., Y U F., Z HANG L., TANG
rendering: Image synthesis using neural textures. Acm Transactions on X., X IAO J.: 3d shapenets: A deep representation for volumetric shapes.
Graphics (TOG) 38, 4 (2019), 1–12. 4 In Proceedings of the IEEE conference on computer vision and pattern
[VHM∗ 22] V ERBIN D., H EDMAN P., M ILDENHALL B., Z ICKLER T., recognition (2015), pp. 1912–1920. 6
BARRON J. T., S RINIVASAN P. P.: Ref-nerf: Structured view-dependent [WSW22] WANG Y., S KOROKHODOV I., W ONKA P.: HF-NeuS: Im-
appearance for neural radiance fields. In 2022 IEEE/CVF Conference proved surface reconstruction using high-frequency details. Advances in
on Computer Vision and Pattern Recognition (CVPR) (2022), IEEE, Neural Information Processing Systems (2022). 6
pp. 5481–5490. 5
[WSW23] WANG Y., S KOROKHODOV I., W ONKA P.: PET-NeuS: Posi-
[VN∗ 51] VON N EUMANN J., ET AL .: The general and logical theory of tional encoding triplanes for neural surfaces. In CVPR (2023). 6
automata. 1951 (1951), 1–41. 13
[WWG∗ 21] WANG Q., WANG Z., G ENOVA K., S RINIVASAN P. P.,
[WCH∗ 22] WANG C., C HAI M., H E M., C HEN D., L IAO J.: CLIP- Z HOU H., BARRON J. T., M ARTIN -B RUALLA R., S NAVELY N.,
NeRF: Text-and-image driven manipulation of neural radiance fields. In F UNKHOUSER T.: IBRNet: Learning multi-view image-based render-
CVPR (2022). 19 ing. In Proceedings of the IEEE/CVF Conference on Computer Vision
[WCMB∗ 22] WATSON D., C HAN W., M ARTIN -B RUALLA R., H O J., and Pattern Recognition (2021), pp. 4690–4699. 5, 13
TAGLIASACCHI A., N OROUZI M.: Novel view synthesis with diffusion [WWL∗ 23] W U Q., WANG K., L I K., Z HENG J., C AI J.: Ob-
models. arXiv preprint arXiv:2210.04628 (2022). 13 jectSDF++: Improved object-compositional neural implicit surfaces. In
[WCS∗ 22] W ENG C.-Y., C URLESS B., S RINIVASAN P. P., BARRON ICCV (2023). 19, 20
J. T., K EMELMACHER -S HLIZERMAN I.: Humannerf: Free-viewpoint [WWX∗ 21] WANG Z., W U S., X IE W., C HEN M., P RISACARIU V. A.:
rendering of moving people from monocular video. In Proceedings of Nerf–: Neural radiance fields without known camera parameters. arXiv
the IEEE/CVF conference on computer vision and pattern Recognition preprint arXiv:2102.07064 (2021). 5
(2022), pp. 16210–16220. 5
[WZ22] W U R., Z HENG C.: Learning to generate 3d shapes from a single
[WDL∗ 23] WANG H., D U X., L I J., Y EH R. A., S HAKHNAROVICH G.: example. arXiv preprint arXiv:2208.02946 (2022). 18
Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d
generation. In Proceedings of the IEEE/CVF Conference on Computer [WZF∗ 23] W U T., Z HANG J., F U X., WANG Y., R EN J., PAN L., W U
Vision and Pattern Recognition (2023), pp. 12619–12629. 12, 18 W., YANG L., WANG J., Q IAN C., ET AL .: Omniobject3d: Large-
vocabulary 3d object dataset for realistic perception, reconstruction and
[WDY∗ 22] W U Y., D ENG Y., YANG J., W EI F., C HEN Q., T ONG X.: generation. In Proceedings of the IEEE/CVF Conference on Computer
AniFaceGAN: Animatable 3d-aware face image generation for video Vision and Pattern Recognition (2023), pp. 803–814. 14, 15
avatars. In NeurIPS (2022). 18, 19
[WZX∗ 16] W U J., Z HANG C., X UE T., F REEMAN B., T ENENBAUM J.:
[WGSJ20] W ILES O., G KIOXARI G., S ZELISKI R., J OHNSON J.: Learning a probabilistic latent space of object shapes via 3d generative-
Synsin: End-to-end view synthesis from a single image. In Proceedings adversarial modeling. Advances in neural information processing sys-
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tems (2016). 1, 2, 6, 7, 9, 15, 18
tion (2020), pp. 7467–7477. 4, 14
[WZZ∗ 23] WANG T., Z HANG B., Z HANG T., G U S., BAO J., BALTRU -
[WJC∗ 23] WANG C., J IANG R., C HAI M., H E M., C HEN D., L IAO J.: SAITIS T., S HEN J., C HEN D., W EN F., C HEN Q., ET AL .: RODIN:
NeRF-Art: Text-driven neural radiance fields stylization. IEEE Trans. A generative model for sculpting 3d digital avatars using diffusion. In
Vis. Comput. Graph. (2023). 19 CVPR (2023). 10
[WKC∗ 23] WANG C., K ANG D., C AO Y., BAO L., S HAN Y., Z HANG [XJW∗ 23] X U D., J IANG Y., WANG P., FAN Z., WANG Y., WANG
S.-H.: Neural point-based volumetric avatar: Surface-guided neural Z.: Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with
points for efficient and photorealistic volumetric head avatar. In ACM 360deg views. In Proceedings of the IEEE/CVF Conference on Com-
SIGGRAPH Asia 2023 Conference Proceedings (2023). 17 puter Vision and Pattern Recognition (2023), pp. 4479–4489. 12
[WLC∗ 22] W U Q., L IU X., C HEN Y., L I K., Z HENG C., C AI J., [XKJ∗ 23] X IONG Z., K ANG D., J IN D., C HEN W., BAO L., C UI S.,
Z HENG J.: Object-compositional neural implicit surfaces. In ECCV H AN X.: Get3dhuman: Lifting stylegan-human into a 3d generative
(2022). 19, 20 model using pixel-aligned reconstruction priors. In Proceedings of
32 X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey
the IEEE/CVF International Conference on Computer Vision (2023), In Proceedings of the IEEE/CVF International Conference on Computer
pp. 9287–9297. 15, 16 Vision (2023), pp. 9122–9132. 16
[XSJ∗ 23] X U H., S ONG G., J IANG Z., Z HANG J., S HI Y., L IU J., M A [YPN∗ 22] Y U Z., P ENG S., N IEMEYER M., S ATTLER T., G EIGER A.:
W., F ENG J., L UO L.: OmniAvatar: Geometry-guided controllable 3d Monosdf: Exploring monocular geometric cues for neural implicit sur-
head synthesis. In CVPR (2023). 18, 19 face reconstruction. Advances in neural information processing systems
[XTL∗ 23] X U Y., TAN H., L UAN F., B I S., WANG P., L I J., S HI Z., 35 (2022), 25018–25032. 6
S UNKAVALLI K., W ETZSTEIN G., X U Z., ET AL .: Dmv3d: Denoising [YRSh21] Y IFAN W., R AHMANN L., S ORKINE - HORNUNG O.:
multi-view diffusion using 3d large reconstruction model. arXiv preprint Geometry-consistent neural shape representation with implicit displace-
arXiv:2311.09217 (2023). 7 ment fields. In International Conference on Learning Representations
[XTS∗ 22] X IE Y., TAKIKAWA T., S AITO S., L ITANY O., YAN S., (2021). 6
K HAN N., T OMBARI F., T OMPKIN J., S ITZMANN V., S RIDHAR S.: [YSL∗ 22] Y UAN Y.-J., S UN Y.-T., L AI Y.-K., M A Y., J IA R., G AO
Neural fields in visual computing and beyond. In Computer Graphics L.: NeRF-Editing: geometry editing of neural radiance fields. In CVPR
Forum (2022), vol. 41, Wiley Online Library, pp. 641–676. 3 (2022). 5, 20
[XWC∗ 19] X U Q., WANG W., C EYLAN D., M ECH R., N EUMANN U.: [YSW∗ 19] Y IFAN W., S ERENA F., W U S., Ö ZTIRELI C., S ORKINE -
Disn: Deep implicit surface network for high-quality single-view 3d re- H ORNUNG O.: Differentiable surface splatting for point-based geometry
construction. Advances in neural information processing systems 32 processing. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–14.
(2019). 5 4
[XX23] X IA W., X UE J.-H.: A survey on deep generative 3d-aware im- [YTB∗ 21] Y ENAMANDRA T., T EWARI A., B ERNARD F., S EIDEL H.-
age synthesis. ACM Computing Surveys 56, 4 (2023), 1–34. 3 P., E LGHARIB M., C REMERS D., T HEOBALT C.: i3DMM: Deep im-
[XYB∗ 23] X U Y., Y IFAN W., B ERGMAN A. W., C HAI M., Z HOU B., plicit 3d morphable model of human heads. In CVPR (2021). 17
W ETZSTEIN G.: Efficient 3d articulated human generation with layered [YXZ∗ 23] Y U X., X U M., Z HANG Y., L IU H., Y E C., W U Y., YAN Z.,
surface volumes. arXiv preprint arXiv:2307.05462 (2023). 16 Z HU C., X IONG Z., L IANG T., ET AL .: Mvimgnet: A large-scale dataset
[XYC∗ 23] X IU Y., YANG J., C AO X., T ZIONAS D., B LACK M. J.: of multi-view images. In Proceedings of the IEEE/CVF Conference on
Econ: Explicit clothed humans optimized via normal integration. In Pro- Computer Vision and Pattern Recognition (2023), pp. 9150–9161. 14,
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern 15
Recognition (2023), pp. 512–523. 15, 16 [YYC∗ 23] Y U W., Y UAN L., C AO Y.-P., G AO X., L I X., Q UAN L.,
[XYHT23] X IANG J., YANG J., H UANG B., T ONG X.: 3d-aware image S HAN Y., T IAN Y.: Hifi-123: Towards high-fidelity one image to 3d
generation using 2d diffusion models. arXiv preprint arXiv:2303.17905 content generation. arXiv preprint arXiv:2310.06744 (2023). 12
(2023). 18, 19 [YYTK21] Y U A., Y E V., TANCIK M., K ANAZAWA A.: pixelnerf:
[XYTB22] X IU Y., YANG J., T ZIONAS D., B LACK M. J.: Icon: Im- Neural radiance fields from one or few images. In Proceedings of the
plicit clothed humans obtained from normals. In 2022 IEEE/CVF Con- IEEE/CVF Conference on Computer Vision and Pattern Recognition
ference on Computer Vision and Pattern Recognition (CVPR) (2022), (2021), pp. 4578–4587. 5, 13
IEEE, pp. 13286–13296. 15, 16 [YZX∗ 21] YANG B., Z HANG Y., X U Y., L I Y., Z HOU H., BAO H.,
[YAK∗ 20] Y IFAN W., A IGERMAN N., K IM V. G., C HAUDHURI S., Z HANG G., C UI Z.: Learning object-compositional neural radiance field
S ORKINE -H ORNUNG O.: Neural cages for detail-preserving 3d defor- for editable scene rendering. In ICCV (2021). 19, 20
mations. In Proceedings of the IEEE/CVF Conference on Computer Vi- [ZAB∗ 22] Z HENG Y., A BREVAYA V. F., B ÜHLER M. C., C HEN X.,
sion and Pattern Recognition (2020), pp. 75–83. 5 B LACK M. J., H ILLIGES O.: IMAvatar: Implicit morphable head avatars
[YBZ∗ 22] YANG B., BAO C., Z ENG J., BAO H., Z HANG Y., C UI Z., from videos. In CVPR (2022). 17
Z HANG G.: NeuMesh: Learning disentangled neural mesh-based im- [ZBT23] Z IELONKA W., B OLKART T., T HIES J.: Instant volumetric
plicit field for geometry and texture editing. In ECCV (2022), Springer. head avatars. In CVPR (2023). 17
19, 20 [ZCY∗ 23] Z HANG H., C HEN B., YANG H., Q U L., WANG X., C HEN
[YGKL21] YARIV L., G U J., K ASTEN Y., L IPMAN Y.: Volume render- L., L ONG C., Z HU F., D U K., Z HENG M.: Avatarverse: High-
ing of neural implicit surfaces. Advances in Neural Information Process- quality & stable 3d avatar creation from text and pose. arXiv preprint
ing Systems 34 (2021), 4805–4815. 6 arXiv:2308.03610 (2023). 16
[YGMG23] YOO P., G UO J., M ATSUO Y., G U S. S.: Dreamsparse: Es- [ZDW21] Z HOU L., D U Y., W U J.: 3d shape generation and completion
caping from plato’s cave with 2d frozen diffusion model given sparse through point-voxel diffusion. In Proceedings of the IEEE/CVF Inter-
views. CoRR (2023). 13 national Conference on Computer Vision (2021), pp. 5826–5835. 10,
[YHH∗ 19a] YANG G., H UANG X., H AO Z., L IU M.-Y., B ELONGIE S., 18
H ARIHARAN B.: Pointflow: 3d point cloud generation with continuous [ZJ16] Z HOU Q., JACOBSON A.: Thingi10k: A dataset of 10,000 3d-
normalizing flows. In Proceedings of the IEEE/CVF international con- printing models. arXiv preprint arXiv:1605.04797 (2016). 14
ference on computer vision (2019), pp. 4541–4550. 7, 11 [ZJY∗ 22] Z HANG J., J IANG Z., YANG D., X U H., S HI Y., S ONG G.,
[YHH∗ 19b] YANG G., H UANG X., H AO Z., L IU M.-Y., B ELONGIE S., X U Z., WANG X., F ENG J.: Avatargen: a 3d generative model for an-
H ARIHARAN B.: Pointflow: 3d point cloud generation with continuous imatable human avatars. In European Conference on Computer Vision
normalizing flows. In Proceedings of the IEEE/CVF international con- (2022), Springer, pp. 668–685. 16
ference on computer vision (2019), pp. 4541–4550. 21 [ZKB∗ 22] Z HANG K., KOLKIN N., B I S., L UAN F., X U Z., S HECHT-
[YLM∗ 22] YAN X., L IN L., M ITRA N. J., L ISCHINSKI D., C OHEN -O R MAN E., S NAVELY N.: ARF: Artistic radiance fields. In ECCV (2022),
D., H UANG H.: Shapeformer: Transformer-based shape completion via Springer. 19, 20
sparse representation. In CVPR (2022). 11 [ZKW∗ 23] Z HOU A., K IM M. J., WANG L., F LORENCE P., F INN C.:
[YLWD22] YANG Z., L I S., W U W., DAI B.: 3dhumangan: To- Nerf in the palm of your hand: Corrective augmentation for robotics via
wards photo-realistic 3d-aware human image generation. arXiv preprint novel-view synthesis. In Proceedings of the IEEE/CVF Conference on
arXiv:2212.07378 (2022). 16 Computer Vision and Pattern Recognition (2023), pp. 17907–17917. 5
[YLX∗ 23] YANG X., L UO Y., X IU Y., WANG W., X U H., FAN Z.: D- [ZLC∗ 23] Z HAO Z., L IU W., C HEN X., Z ENG X., WANG R., C HENG
if: Uncertainty-aware human digitization via implicit distribution field. P., F U B., C HEN T., Y U G., G AO S.: Michelangelo: Conditional 3d
X. Li & Q. Zhang & D. Kang & W. Cheng & Y. Gao et al. / Advances in 3D Generation: A Survey 33
shape generation based on shape-image-text aligned latent representa- [ZTNW23] Z HANG B., TANG J., N IESSNER M., W ONKA P.:
tion. arXiv preprint arXiv:2306.17115 (2023). 18 3dshape2vecset: A 3d shape representation for neural fields and genera-
tive diffusion models. arXiv preprint arXiv:2301.11445 (2023). 7
[ZLLD21] Z HI S., L AIDLOW T., L EUTENEGGER S., DAVISON A. J.:
In-place scene labelling and understanding with implicit scene represen- [ZTS∗ 16] Z HOU T., T ULSIANI S., S UN W., M ALIK J., E FROS A. A.:
tation. In ICCV (2021), pp. 15838–15847. 5 View synthesis by appearance flow. In Computer Vision–ECCV 2016:
14th European Conference, Amsterdam, The Netherlands, October 11–
[ZLLS22] Z HANG K., L UAN F., L I Z., S NAVELY N.: Iron: Inverse ren- 14, 2016, Proceedings, Part IV 14 (2016), Springer, pp. 286–301. 14
dering by optimizing neural sdfs and materials from photometric images.
In Proceedings of the IEEE/CVF Conference on Computer Vision and [ZVW∗ 22] Z ENG X., VAHDAT A., W ILLIAMS F., G OJCIC Z., L ITANY
Pattern Recognition (2022), pp. 5565–5574. 6 O., F IDLER S., K REIS K.: Lion: Latent point diffusion models for 3d
shape generation. arXiv preprint arXiv:2210.06978 (2022). 10
[ZLW∗ 21] Z HANG K., L UAN F., WANG Q., BALA K., S NAVELY N.:
Physg: Inverse rendering with spherical gaussians for physics-based ma- [ZWL∗ 23] Z HUANG J., WANG C., L IU L., L IN L., L I G.: Dreamedi-
terial editing and relighting. In Proceedings of the IEEE/CVF Conference tor: Text-driven 3d scene editing with neural fields. In SIGGRAPH Asia
on Computer Vision and Pattern Recognition (2021), pp. 5453–5462. 5 Conference Papers (2023). 20
[ZLW∗ 22] Z HANG J., L I X., WAN Z., WANG C., L IAO J.: Fdnerf: Few- [ZXA∗ 23] Z HU C., X IAO F., A LVARADO A., BABAEI Y., H U J., E L -
shot dynamic neural radiance fields for face reconstruction and expres- M OHRI H., C ULATANA S., S UMBALY R., YAN Z.: Egoobjects: A
sion editing. In SIGGRAPH Asia 2022 Conference Papers (2022), pp. 1– large-scale egocentric dataset for fine-grained object understanding. In
9. 19 Proceedings of the IEEE/CVF International Conference on Computer
Vision (2023), pp. 20110–20120. 15
[ZLW∗ 23] Z HANG J., L I X., WAN Z., WANG C., L IAO J.: Text2nerf:
Text-driven 3d scene generation with neural radiance fields. arXiv [ZYHC22] Z HENG M., YANG H., H UANG D., C HEN L.: ImFace: A
preprint arXiv:2305.11588 (2023). 18, 19 nonlinear 3d morphable face model with implicit neural representations.
In CVPR (2022). 17
[ZLWT22] Z HENG X., L IU Y., WANG P., T ONG X.: Sdf-stylegan: Im-
plicit sdf-based stylegan for 3d shape generation. In Computer Graphics [ZYLD21] Z HENG Z., Y U T., L IU Y., DAI Q.: Pamir: Parametric model-
Forum (2022), vol. 41, Wiley Online Library, pp. 52–63. 7 conditioned implicit representation for image-based human reconstruc-
tion. IEEE transactions on pattern analysis and machine intelligence 44,
[ZLZ∗ 22] Z HU H., L IU Z., Z HOU Y., M A Z., C AO X.: DNF: Diffractive 6 (2021), 3170–3184. 16
neural field for lensless microscopic imaging. Optics Express 30, 11
[ZYW∗ 23] Z HENG Y., Y IFAN W., W ETZSTEIN G., B LACK M. J.,
(2022), 18168–18178. 5
H ILLIGES O.: PointAvatar: Deformable point-based head avatars from
[ZLZ∗ 23] Z HANG J., L I X., Z HANG Q., C AO Y., S HAN Y., L IAO J.: videos. In CVPR (2023). 17
Humanref: Single image to 3d human generation via reference-guided
[ZZ23] Z HU J., Z HUANG P.: Hifa: High-fidelity text-to-3d with advanced
diffusion. arXiv preprint arXiv:2311.16961 (2023). 15, 16
diffusion guidance. arXiv preprint arXiv:2305.18766 (2023). 12, 18
[ZML∗ 22] Z HOU J., M A B., L IU Y.-S., FANG Y., H AN Z.: Learning [ZZF∗ 23] Z HUANG Y., Z HANG Q., F ENG Y., Z HU H., YAO Y., L I X.,
consistency-aware unsigned distance functions progressively from raw C AO Y.-P., S HAN Y., C AO X.: Anti-aliased neural implicit surfaces
point clouds. Advances in Neural Information Processing Systems 35 with encoding level of detail. arXiv preprint arXiv:2309.10336 (2023).
(2022), 16481–16494. 5 6
[ZPL∗ 22] Z HU Z., P ENG S., L ARSSON V., X U W., BAO H., C UI Z., [ZZK∗ 20] Z AMORSKI M., Z I EBA ˛ M., K LUKOWSKI P., N OWAK R.,
O SWALD M. R., P OLLEFEYS M.: Nice-slam: Neural implicit scalable K URACH K., S TOKOWIEC W., T RZCI ŃSKI T.: Adversarial autoen-
encoding for slam. In Proceedings of the IEEE/CVF Conference on Com- coders for compact representations of 3d point clouds. Computer Vision
puter Vision and Pattern Recognition (2022), pp. 12786–12796. 5 and Image Understanding 193 (2020), 102921. 7
[ZPVBG02] Z WICKER M., P FISTER H., VAN BAAR J., G ROSS M.: Ewa [ZZW∗ 23] Z HUANG Y., Z HANG Q., WANG X., Z HU H., F ENG Y., L I
splatting. IEEE Transactions on Visualization and Computer Graphics X., S HAN Y., C AO X.: Neai: A pre-convoluted representation for plug-
8, 3 (2002), 223–238. 4 and-play neural ambient illumination. arXiv preprint arXiv:2304.08757
[ZPW∗ 23] Z HENG X.-Y., PAN H., WANG P.-S., T ONG X., L IU Y., (2023). 5
S HUM H.-Y.: Locally attentional sdf diffusion for controllable 3d shape [ZZZ∗ 18] Z HU J.-Y., Z HANG Z., Z HANG C., W U J., T ORRALBA A.,
generation. arXiv preprint arXiv:2305.04461 (2023). 10 T ENENBAUM J., F REEMAN B.: Visual object networks: Image genera-
[ZQL∗ 23] Z HANG L., Q IU Q., L IN H., Z HANG Q., S HI C., YANG W., tion with disentangled 3d representations. Advances in neural informa-
S HI Y., YANG S., X U L., Y U J.: DreamFace: Progressive generation of tion processing systems 31 (2018). 15
animatable 3d faces under text guidance. ACM Trans. Graph. (2023). 19 [ZZZ∗ 23a] Z HANG J., Z HANG X., Z HANG H., L IEW J. H., Z HANG C.,
[ZSD∗ 21] Z HANG X., S RINIVASAN P. P., D ENG B., D EBEVEC P., YANG Y., F ENG J.: Avatarstudio: High-fidelity and animatable 3d avatar
F REEMAN W. T., BARRON J. T.: Nerfactor: Neural factorization of creation from text. arXiv preprint arXiv:2311.17917 (2023). 16
shape and reflectance under an unknown illumination. ACM Transac- [ZZZ∗ 23b] Z HU J., Z HU H., Z HANG Q., Z HU F., M A Z., C AO X.:
tions on Graphics (ToG) 40, 6 (2021), 1–18. 5 Pyramid nerf: Frequency guided fast radiance field optimization. Inter-
[ZSH∗ 22] Z HANG Y., S UN J., H E X., F U H., J IA R., Z HOU X.: Model- national Journal of Computer Vision (2023), 1–16. 5
ing indirect illumination for inverse rendering. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(2022), pp. 18643–18652. 6
[ZST08] Z HANG W., S UN J., TANG X.: Cat head detection-how to
effectively exploit shape and texture features. In Computer Vision–
ECCV 2008: 10th European Conference on Computer Vision, Marseille,
France, October 12-18, 2008, Proceedings, Part IV 10 (2008), Springer,
pp. 802–816. 15
[ZTF∗ 18] Z HOU T., T UCKER R., F LYNN J., F YFFE G., S NAVELY N.:
Stereo magnification: Learning view synthesis using multiplane images.
arXiv preprint arXiv:1805.09817 (2018). 5