2010 Book Three-DimensionalModelAnalysis PDF
2010 Book Three-DimensionalModelAnalysis PDF
Advanced Topics in Science and Technology in China aims to present the latest
and most cutting-edge theories, techniques, and methodologies in various
research areas in China. It covers all disciplines in the fields of natural science
and technology, including but not limited to, computer science, materials
science, life sciences, engineering, environmental sciences, mathematics, and
physics.
Faxin Yu
Zheming Lu
Hao Luo
Pinghui Wang
Three-Dimensional Model
Analysis and Processing
ISBN 978-7-308-07412-4
Zhejiang University Press, Hangzhou
ϝ㓈ൟߚᵤϢ໘⧚=Three-Dimensional Model
Analysis and Processing˖㣅᭛ / 䚕থᮄㄝ㨫ˊüᵁ
Ꮂ˖⌭∳ᄺߎ⠜⼒ˈ2010.4
(Ё⾥ᡔ䖯ሩϯк)
ISBN 978-7-308-07412-4
ϝ㓈ൟߚᵤϢ໘⧚
္֟ৄოੜฆଽݐᅗ
üüüüüüüüüüüüüüüüüüüüüüüüüü
䋷ӏ㓪䕥 ӡ⾔㢇
ᇕ䴶䆒䅵 ֲѮᔸ
ߎ⠜থ㸠 ⌭∳ᄺߎ⠜⼒
㔥ഔ˖http://www.zjupress.com
Springer-Verlag GmbH
㔥ഔ˖http://www.springer.com
ᥦ ⠜ ᵁᎲЁ᭛䆒䅵᳝䰤݀ৌ
ॄ ࠋ ᵁᎲᆠॄࡵ᳝䰤݀ৌ
ᓔ ᴀ 710mmh1000mm 1/16
ॄ ᓴ 27.25
ᄫ ᭄ 785 ग
⠜ ॄ 2010 ᑈ 4 ᳜ 1 ⠜ 2010 ᑈ 4 ᳜ 1 ॄࠋ
к ো ISBN 978-7-308-07412-4 (⌭∳ᄺߎ⠜⼒)
ISBN 978-3-642-12650-5 (Springer-Verlag GmbH)
ᅮ Ӌ 176.00 ܗ
üüüüüüüüüüüüüüüüüüüüüüüüüü
⠜ᴗ᠔᳝ 㗏ॄᖙお ॄ㺙Ꮒ䫭 䋳䋷䇗ᤶ
⌭∳ᄺߎ⠜⼒থ㸠䚼䚂䌁⬉䆱 (0571)88925591
Preface
With the increasing popularization of the Internet, together with the rapid
development of 3D scanning technologies and modeling tools, 3D model
databases have become more and more common in fields such as biology,
chemistry, archaeology and geography. People can distribute their own 3D works
over the Internet, search and download 3D model data, and also carry out
electronic trade over the Internet. However, some serious issues are related to this
as follows: (1) How to efficiently transmit and store huge 3D model data with
limited bandwidth and storage capacity; (2) How to prevent 3D works from being
pirated and tampered with; (3) How to search for the desired 3D models in huge
multimedia databases. This book is devoted to partially solving the above issues.
Compression is useful because it helps reduce the consumption of expensive
resources, such as hard disk space and transmission bandwidth. On the downside,
compressed data must be decompressed to be used, and this extra processing may
be detrimental to some applications. 3D polygonal mesh (with geometry, color,
normal vector and texture coordinate information), as a common surface
representation, is now heavily used in various multimedia applications such as
computer games, animations and simulation applications. To maintain a
convincing level of realism, many applications require highly detailed mesh
models. However, such complex models demand broad network bandwidth and
much storage capacity to transmit and store. To address these problems, 3D mesh
compression is essential for reducing the size of 3D model representation.
Feature extraction is a special form of dimensionality reduction. When the
input data to an algorithm is too large to be processed and is suspected to be
notoriously redundant (much data, but not much information), the input data will
be transformed into a reduced representation set of features (also named a feature
vector). If the features extracted are carefully chosen, it is expected that the
features set will extract the relevant information from the input data, in order to
perform the desired task using this reduced representation instead of the full size
input. Feature extraction is an essential step in content-based 3D model retrieval
systems. In general, the shape of the 3D object
b is described by a feature vector that
serves as a search key in the database. If an unsuitable feature extraction method
has been used, the whole retrieval system will be unusable. We must realize that
3D objects can be saved in many representations, such as polyhedral meshes,
vi Preface
The authors
Hangzhou, China
January, 2010
Contents
1 Introduction ...............................................................................................1
1.1 Background ............................................................................................ 1
1.1.1 Technical Development Course of Multimedia.......................... 1
1.1.2 Information Explosion ............................................................... 3
1.1.3 Network Information Security ................................................... 6
1.1.4 Technical Requirements of 3D Models...................................... 9
1.2 Concepts and Descriptions of 3D Models ............................................ 11
1.2.1 3D Models................................................................................ 11
1.2.2 3D Modeling Schemes ............................................................. 13
1.2.3 Polygon Meshes ....................................................................... 20
1.2.4 3D Model File Formats and Processing Software.................... 22
1.3 Overview of 3D Model Analysis and Processing ................................. 31
1.3.1 Overview of 3D Model Processing Techniques ....................... 31
1.3.2 Overview of 3D Model Analysis Techniques........................... 35
1.4 Overview of Multimedia Compression Techniques.............................. 38
1.4.1 Concepts of Data Compression................................................ 38
1.4.2 Overview of Audio Compression Techniques.......................... 39
1.4.3 Overview of Image Compression Techniques.......................... 42
1.4.4 Overview of Video Compression Techniques .......................... 46
1.5 Overview of Digital Watermarking Techniques ................................... 48
1.5.1 Requirementt Background ........................................................ 48
1.5.2 Concepts of Digital Watermarks .............................................. 50
1.5.3 Basic Framework of Digital Watermarking Systems ............... 51
1.5.4 Communication-Based Digital Watermarking Models ............ 52
1.5.5 Classification of Digital Watermarking Techniques................. 54
1.5.6 Applications of Digital Watermarking Techniques .................. 56
1.5.7 Characteristics of Watermarking Systems................................ 58
1.6 Overview of Multimedia Retrieval Techniques T .................................... 62
1.6.1 Concepts of Information Retrieval........................................... 62
1.6.2 Summary of Content-Based Multimedia Retrieval .................. 65
x Contents
2 3D Mesh Compression...............................................................................91
2.1 Introduction .......................................................................................... 91
2.1.1 Background .............................................................................. 91
2.1.2 Basic Concepts and Definitions ............................................... 93
2.1.3 Algorithm Classification ........................................................ 100
2.2 Single-Rate Connectivity Compression.............................................. 102
2.2.1 Representation of Indexed Face Set....................................... 103
2.2.2 Triangle-Strip-Based d Connectivity Coding............................ 104
2.2.3 Spanning-Tree-Based Connectivity Coding........................... 105
2.2.4 Layered-Decomposition-Based Connectivity Coding............ 107
2.2.5 Valence-Driven Connectivity Coding Approach.................... 108
2.2.6 Triangle Conquest Based Connectivity Coding ..................... 111
2.2.7 Summary ................................................................................ 115
2.3 Progressive Connectivity Compression.............................................. 116
2.3.1 Progressive Meshes................................................................ 117
2.3.2 Patch Coloring ....................................................................... 121
2.3.3 Valence-Driven Conquest ...................................................... 122
2.3.4 Embedded Coding.................................................................. 124
2.3.5 Layered Decomposition ......................................................... 125
2.3.6 Summary ................................................................................ 126
2.4 Spatial-Domain Geometry Compression ............................................ 127
2.4.1 Scalar Quantization ................................................................ 128
2.4.2 Prediction ............................................................................... 129
2.4.3 k-d Tree .................................................................................. 132
2.4.4 Octree Decomposition............................................................ 133
2.5 Transform Based Geometric Compression......................................... 134
2.5.1 Single-Rate Spectral Compression of Mesh Geometry.......... 135
2.5.2 Progressive Compression Based on Wavelet Transform........ 136
2.5.3 Geometry Image Coding........................................................ 139
2.5.4 Summary ................................................................................ 140
Contents xi
Index ...........................................................................................417
1
Introduction
The digitization of multimedia data, such as images, graphics, speech, text, audio,
video and 3D models, has made the storage of multimedia more and more
convenient, and has simultaneously improved the efficiency and accuracy of
information representation. With the increasing popularization of the Internet,
multimedia communication has reached an unprecedented level of depth and
broadness, and multimedia distribution is becoming more and more manifold.
People can distribute their own works over the Internet, search and download
multimedia data, and also carry out electronic trade over the Internet. However,
some serious issues accompany this as follows: (1) How can we efficiently
transmit and store huge multimedia information with limited bandwidth and
storage capacity? (2) How can we prevent multimedia works from being pirated
and tampered with? (3) How can we search for the desired multimedia content in
huge multimedia databases?
1.1 Background
We first introduce the background to three urgent issues for multimedia, i.e.,
(1) storage and transmission, (2) protection and authentication, (3) retrieval and
recognition.
Real human civilization starts from the Internet. In fact, we are living with all
kinds of networks, such as electrical networks, telephone networks, broadcast/
television networks, commercial networks and traffic networks. However, all these
networks are very different from the Internet, which has affected so many
governments, enterprises and individuals in such a short time. Nowadays, the
network has become a substitutable noun for the Internet. In the past few years,
with the rapid development of computer and network techniques, the scale of the
Internet has been suddenly expanded. The Internet technique breaks the traditional
borderline, which makes the world smaller and smaller, while making the market
larger and larger. The wide world is like a global village, where the global
4 1 Introduction
economy and information networking promote and depend on each other. The
Internet makes the speed and scale of information acquisition and transmission
reach an unprecedented level. In the era of information networking, the Internet
should be considered for any product or technique. Network information systems
are playing more and more important roles in politics, military affairs, finance,
commerce, transportation, telecommunication, culture and education. Modern
communication and transmission techniques have greatly improved the speed and
extent of information transmission. The technical means include broadcasts,
television, satellite communication and computer communication using
microwave and optical fiber communication networks, which overcome traditional
obstacles in space and time and further unite the whole world. However, the
accompanying issues and side effects are as follows: A surge of information
overwhelms people, and it is very hard to retrieve accurately and rapidly the
information most needed from the tremendous amount of information. This
phenomenon is called the information explosion [2], also called “information
overload” or “knowledge bombing”.
The information explosion describes the rapid development in the amount of
information or human knowledge in recent years, whose speed d is like a bomb
engulfing all the world. With regard to the phrase “information explosion”, it can
date back to the 1980s. At that time, besides broadcasting, television, telephone,
newspapers and various publications, new means of communication, i.e.,
computers and communication satellites emerged, making the amount of
information increase suddenly like an explosion. Statistics show that over the past
decade the amount of information all over the world doubled every 20 months.
During the 1990s, the amount of information continued to increase dramatically.
At the end of the 1990s, due to the emergence of the Internet, information
distribution and transmission got out of control, and a great deal of false or useless
information was generated, resulting in the pollution of information environments
and the birth of “waste messages”. Because everyone can freely air his opinion
over the Internet, and the distribution cost can be ignored, in a sense everyone can
become an information manufacturer on the global level, and thus information
really starts to explode. As times go by, the information explosion manifests itself
mainly in five aspects˖(1) the rapid increase in the amount of news; (2) the
dramatic increase in the amount of amusement
m information; (3) a barrage of
advertisements; (4) the rapid increase in scientific and technical information; (5)
the overloading of our personal receptiveness. However, faced with the inflated
amount of information and the enormous pressure of “chaotic information space”
and “information surplus”, people out of the blue become hesitant in their urgent
pursuit and expectation of information. Even if we take 24 hours every day to read
information, we cannot take it all in, and besides, there is a great deal of useless or
false information. Useful information cann increase economic benefits and promote
the development of human society, but if the information increases in a disorderly
fashion and even runs out of control, it will bring about various social problems
such as information crime and information pollution. People on the one hand are
enjoying the convenience brought about by abundant information over the Internet;
on the other hand they are suffering from annoyance due to the “information
1.1 Background 5
People neglect the security problems of most modern computer networks at the
beginning of construction and, even if they do not, they only base the security
mechanism on the physical security. Therefore, with the enlargement of the
networking scale, this physical security mechanism is but an empty shell in the
network environment. In addition, the protocol in use nowadays, e.g., the TCP/IP
protocol, does not take the security problem into account at the beginning. Thus,
openness and resource sharing are the main rootstock of the computer networking
security problem, and the security mainly depends on encryption, network user
authentication and access control strategies. Facing such severe threats that harm
network information systems and considering the importance of network security
and secrecy, we must take effective measures in order to guarantee the security
and secrecy of the network information. The network measures for security can be
classified in the following three categories: logical-based, physical-based and
policy-based. In the face of various threats that harm computer networking
security more and more severely, only using physical-based or policy-based means
cannot effectively keep away computer-based crime. People should therefore
adopt logical-based measures, that is to research and develop effective techniques
for network and information security. Even if we have very self-contained policies
and rules for security and secrecy, very advanced techniques for security and
secrecy and flawless physical security mechanisms, all efforts will go to waste if
the above knowledge cannot be popularized.
People’s understanding of information security is continually updated. In the
era of host computers, people understand information security as the protection of
confidentiality, integrality and availability off information, which is data-oriented.
In the era of microcomputers and local networks in the 1980s, because of the
simple structure of users and networks, information security was administrator-
oriented and stipulation-oriented. In the era of the Internet in the 1990s, every user
could access, use and control the connected computers everywhere, and thus
information security over the Internet emphasizes connection-oriented and
user-oriented security. Thus it can be seen that data-oriented security considers the
confidentiality, integrality and availability of information, while user-oriented
security considers authentication, authorization, access control, non-repudiation
and serviceability, together with content-based individual privacy and copyright
protection. Combining the above two aspects of security, we can obtain the
1.1 Background 7
generalized information security [3] concept, that is all theories and techniques
related to information security, integrality, availability, authenticity and
controllability, suming up physical security, network security, data security,
information content security, information infrastructure security and public
information security. On the other hand, information security in the narrow sense
indicates information content security, which is the protection of the secrecy,
authenticity and integrality of the information, avoiding attackers’ wiretapping,
imitating, beguilement and embezzlement and protecting the legal users’ benefits
and privacy. The secure service issues in the information security architecture rely
on ciphers, digital signatures, authentication techniques, firewalls, secure audit,
disaster recovery, anti-virus, preventing hacker intrusion, and so on. Among them,
cryptographic techniques and managementt means are the core of information
security, while the security standards and system evaluation methods are the bases
of information security. Technically, information security is a marginal integrated
subject involving computer science, network techniques, communication
techniques, applied mathematics, number theory, information theory, and so on.
Network information security consists of four aspects, i.e., the security
problems in information communication and storage, and the audit of network
information content and authentication. To maintain the security of data
transmission, it is necessary to apply data encryption and integrity identification
techniques. To guarantee the security of information storage, it is necessary to
guarantee the database security and terminal security. An information content
audit checks the content of the input and output information from networks, so as
to prevent or trace possible whistle-blowing. User identification is the process of
verifying the principal part in the network. Usually there are three kinds of
methods for verifying the principal part identity. One is that only the secret known
by the principal part is available, e.g., passwords or keys. The second is that the
objects carried by the principal part are available, e.g., intelligent cards or token
cards. The third is that only the principal part’s unique characteristics or abilities
are available, e.g., fingerprints, voices, retina, signatures, etc. The technical
characteristics of network information security mainly embody the following five
aspects: (1) Integrity. It means the network information cannot be altered without
authority. It is against active attacks, guaranteeing data consistence and preventing
data from being modified and destroyed by illegal users. (2) Confidentiality. It is
the characteristic that the network information cannot be leaked to unauthorized
users. It is against passive attacks so as to guarantee that the secret information
cannot be leaked to illegal users. (3) Availability. It is the characteristic that the
network information can be visited and used by legal users if needed. It is used to
prevent information and resource usage by legal users from being rejected
irrationally. (4) Non-repudiation. It means all participants in the network cannot
deny or disavow the completed operations and promises. The sender cannot deny
the already sent information, while the receiver also cannot deny the already
received information. (5) Controllability. It is the ability to control the content of
network information and its prevalence. Namely, it can monitor the security of
network information.
The coming of the network information era also proposes a new challenge to
8 1 Introduction
industrial manufacturing, and one can also find applications in electronic business
and web-based search engines. Therefore, how to rapidly search for the required
3D models has been a second popular topic following the retrieval techniques for
texts, audios, images and videos. The 3D model retrieval technology involves
several areas such as artificial intelligence, computer vision and pattern
recognition. The underlying problem in content-based 3D model retrieval systems
is to select appropriate features to distinguish dissimilar shapes and index 3D
models. Based on these requirements, this book discusses 3D model feature
extraction techniques in Chapter 3, and introduces 3D model retrieval techniques
in Chapter 4.
On the other hand, with the ceaseless emergence of advanced modeling tools
and the increasing maturation of 3D shape data scanning techniques, people have
put forward greater requests for accuracy and details of 3D geometric data, which
has at the same time brought about a rapid growth in the scale and complexity of
3D geometric data. Huge geometric data have enormously challenged the capacity
and speed of current 3D graphics search engines. Furthermore, the development of
the Internet makes the application of 3D geometric data broader and broader.
However, the limitation of bandwidth has severely restricted the distribution of
this kind of media. It is not sufficient to solve this problem merely based on the
increase in the contribution of hardware devices, but we also need to research 3D
model compression techniques. Thus, this book discusses 3D model compression
techniques in Chapter 2.
More severely, with the development of computer technologies, CAD, virtual
reality and network technologies have made considerable progress, and more and
more 3D models have been created, distributed, downloaded and used. Because
3D models possess commercial value, visual value and economic benefits, the
producers and copyright owners of these 3D products will inevitably have to face
up to the practical issues of copyright (or intellectual property rights) protection
and content authentication during the distribution of 3D models over the Internet.
Thus, this book discusses the watermarking and reversible data hiding techniques
of 3D models in Chapters 5 and 6.
Besides the above three technical requirements, there are some other
technical requirements for 3D models including simplification, reconstruction,
segmentation, interactive display, matching and recognition, and so on. For
example, computer- aided geometric modeling techniques have been widely used
during product development and manufacturingt processes, but there are still many
products not originally described by CAD models because the designers or
manufacturers are faced with material objects. In order to utilize the advanced
manufacturing technology, we should transform material objects into CAD models,
and this has been a relatively independent research area in CAD or CAM
(computer-aided manufacturing) systems, i.e., reverse engineering [4]. To take a
second example, mesh segmentation [5] has become a hot research topic because
it has become an important technical requirement to modify current models
according to the new design goal by reusing previous models. Mesh segmentation
stands for the technique of segmenting a closed mesh polyhedron or orientable 2D
manifold, according to certain geometric or topological characteristics, into a certain
1.2 Concepts and Descriptions of 3D Models 11
In the following, the concepts, descriptions and research directions for newly-
developed digital media, 3D models, are presented. Based on three aspects of
technical requirements, the basic concepts and the commonly-used techniques for
multimedia compression, multimedia watermarking, multimedia retrieval and
multimedia perceptual hashing are then summarized.
1.2.1 3D Models
3D models can be roughly classified into two categories: (1) Solid models.
These models define the volume of the object they represent (like a rock). These
are more realistic, but more difficult to build. Solid models are mostly used for
non-visual simulations such as medical and engineering simulations, and for CAD
and specialized visual applications such as ray tracing and constructive solid
geometry. (2) Shell/Boundary models. These models represent the surface, e.g.,
the boundary of the object, not its volume (like an infinitesimally thin eggshell).
These are easier to work with than solid models. Almost all visual models used in
games and films are shell models.
Because the appearance of an object depends largely on the exterior of the
object, boundary representations are common in computer graphics. 2D surfaces
are a good analogy for the objects used in graphics, though quite often these
objects are non-manifold. Since surfaces are not finite, a discrete digital
approximation is required: polygonal meshes are by far the most common
representations, although point-based representations have been gaining some
popularity in recent years. Level sets are a useful representation for deforming
surfaces which undergo many topological changes, such as fluids.
The process of transforming representations of objects, such as the middle
point coordinate of a sphere and a point on its circumference into a polygon
representation of a sphere, is called tessellation. This step is used in polygon-based
rendering, where objects are broken down from abstract representations
(“primitives”) such as spheres, cones, etc., to so-called meshes, which are nets of
interconnected triangles. Meshes of triangles (instead of e.g. squares) are popular
as they have proven to be easy to render using scan line rendering. Polygon
representations are not used in all rendering techniques, and in these cases the
tessellation step is not included in the transition from abstract representation to the
rendered scene.
There are two types of information in a 3D model, geometrical information
and topological information. Geometrical information generally represents shapes,
locations and sizes in the Euclidean space, while topological information stands
for the connectivity between different parts of the 3D model. The 3D model itself
is invisible, but we can perform the rendering operation at different levels of detail
1.2 Concepts and Descriptions of 3D Models 13
surface textures. The data that record such information are called 3D data, and 3D
data acquisition is the process by which the 3D information is acquired from
samples and organized as the representation consistent with the samples’
structures. The methods of acquiring 3D information from samples can be
classified in the following five categories:
(1) Methods based on direct design or measurement. They are often used in
early architecture 3D modeling. They utilize engineering drawing to obtain the
three views of each model.
(2) Image-based methods. They construct 3D models based on pictures. They
first obtain geometrical and texture information simultaneously by taking photos,
and then construct 3D models based on obtained images.
(3) Mechanical-probe-based methods. They acquire the surface data by
physical touch between the probe and the object. They require that the object hold
a certain hardness.
(4) Methods based on volume data restoration. They adopt a series of slicing
images of the object to restore the 3D shape of the object. They are often used in
medical departments with X-ray slicing images, CT images and MRT images.
(5) Region-scanning-based methods. They obtain the position of each vertex in
the space by estimating the distance between the measuring instrument and each
point on the object surface. Two examples of the methods are optical triangulation
and interferometry.
The main problem in 3D modeling is to render 3D models based on 3D data.
To achieve a better visual effect, we should guarantee it has smooth surfaces,
without burrs and holes, and make 3D models embody a third dimension and
sense of reality. At the same time, we should organize the data in a better manner
to reduce the storage space and speed up the displaying. Current modeling
techniques can be mainly classified in three categories: geometric-modeling-based,
3D scanner-based and image-based, which can be described in detail as follows.
representation of freeform surfaces like those used for ship hulls, aerospace
exterior surfaces and car bodies, which could be exactly reproduced whenever
technically needed. Prior representations of this kind of surface only existed as a
single physical model created by a designer. The pioneers of this development
were Pierre Bézier who worked as an engineer at Renault, and Paul de Casteljau
who worked at Citroën, both in France. Bézier worked almost in parallel to de
Casteljau, neither knowing about the work of the other. But because Bézier
published the results of his work, the average computer graphics user today
recognizes splines — which are represented with control points lying off the curve
itself — as Bézier splines, while de Casteljau’s name is only known and used for
the algorithms he developed to evaluate parametric surfaces. In the 1960s, it
became clear that NURBSs are a generalization of Bézier splines, which can be
regarded as uniform, NURBSs. At first, non-uniform rational B-splines were only
used in the proprietary CAD packages of car a companies. Later they became part of
standard computer graphics packages. In 1985, the first interactive NURBS
modeler for PCs, called Macsurf (later Maxsurf), was developed by Formation
Design Systems, a small startup company based in Australia. Maxsurf is a marine
hull design system intended for the creation of ships, workboats and yachts, whose
designers have a need for highly accurate sculptured surfaces. Real-time,
interactive rendering of NURBS curves and surfaces was first made available on
Silicon Graphics workstations in 1989. Today, most professional computer
graphics applications available for desktop use offer NURBS technology, which is
most often realized by integrating a NURBS engine from a specialized company.
3) Subdivision surface modeling. Subdivision surface modeling, in the field of
3D computer graphics, is a method of representing a smooth surface via the
specification of a coarser piecewise linear polygon mesh. The smooth surface can
be calculated from the coarse mesh as the limit of a recursive process of
subdividing each polygonal face into smaller faces that better approximate the
smooth surface. The subdivision surfaces are defined recursively. The process
starts with a given polygonal mesh. A refinement scheme is then applied to this
mesh. This process takes that mesh and subdivides it, creating new vertices and
new faces. The positions of the new vertices in the mesh are computed based on
the positions of nearby old vertices. In some refinement schemes, the positions of
old vertices might also be altered (possibly based on the positions of new vertices).
This process produces a denser mesh than the original one, containing more
polygonal faces. This resulting mesh can be passed through the same refinement
scheme again. The limit subdivision surface is the surface produced from this
process being iteratively applied infinitely many times. In practical use, however,
this algorithm is only applied a limited number of times.
(3) Solid modeling. Solid modeling is the unambiguous representation of the
solid parts of an object, which means models of solid objects suitable for computer
processing. As we know, surface models are used extensively in automotive and
consumer product design as well as entertainment animation, while wireframe
models are ambiguous about solid volume. Primary uses of solid modeling are for
CAD, engineering analysis, computer graphics and animation, rapid prototyping,
medical testing, product visualization and visualization of scientific research.
1.2 Concepts and Descriptions of 3D Models 17
(1) Contact. Contact 3D scanners probe the subject through physical touch. A
coordinate measuring machine (CMM) is an example of a contact 3D scanner. It is
used mostly in manufacturing and can be very precise. The disadvantage of
CMMs is that they require contact with the object being scanned. Thus, the
scanning operation might modify or damage the object. This fact is very
significant when scanning delicate or valuable objects such as historical artifacts.
The other disadvantage of CMMs is that they are relatively slow compared to the
other scanning methods. Physically moving the arm that the probe is mounted on
can be very slow and the fastest CMMs can only operate on a few hundred hertz.
In contrast, an optical system like a laser scanner can operate from 10 to 500 kHz.
Other examples are the hand-driven touch probes used to digitize clay models in
the computer animation industry.
(2) Non-contact active. Active scanners emit some kind of radiation or light
and detect its reflection in order to probe an object or environment. Possible types
of emissions used include light, ultrasound or X-ray. For example, both
time-of-flight and triangulation 3D laser scanners are active scanners that use laser
lights to probe the subject or environment. The advantage of time-of-flight range
finders is that they are capable of operating over very long distances, in the order
of kilometers. These scanners are thus suitable for scanning large structures like
buildings or geographic features. The disadvantage of time-of-flight range finders
is their accuracy. Due to the high speed of light, timing the round-trip time is
difficult and the accuracy of the distance measurement is relatively low, in the
order of millimeters. Triangulation range finders are exactly the opposite. They
have a limited range of some meters, butt their accuracy is relatively high. The
accuracy of triangulation range finders is in the order of tens of micrometers.
(3) Non-contact passive. Passive scanners do not emit any radiation
themselves, but instead rely on detecting reflected ambient radiation. Most
scanners of this type detect visible light because it is a readily available ambient
radiation. Other types of radiation, such as infrared, could also be used. Passive
methods can be very cheap, because in most cases they do not need particular
hardware. For example, stereoscopic systems usually employ two video cameras,
slightly apart, looking at the same scene. By analyzing the slight differences
between the images seen by each camera, it is possible to determine the distance at
each point in the images. This method is based on human stereoscopic vision. In
contrast, photometric systems usually use a single camera, but take multiple
images under varying lighting conditions. These techniques attempt to invert the
image formation model in order to recover the surface orientation at each pixel. In
addition, silhouette-based 3D scanners use outlines generated from a sequence of
photographs around a 3D object against a well-contrasted background. These
silhouettes are extruded and intersected to form the visual hull approximation of
the object. However, some types of concavities in an object (like the interior of a
bowl) cannot be detected by these techniques.
1.2 Concepts and Descriptions of 3D Models 19
polygon meshes over a network. Volumetric meshes are distinct from polygon
meshes in that they explicitly represent both the surface and volume of a structure,
while polygon meshes only explicitly represent the surface (the volume is
implicit). As polygonal meshes are extensively used in computer graphics,
algorithms also exist for ray tracing, collision detection and rigid-body dynamics
of polygon meshes.
Objects created with polygon meshes must store different types of elements,
including vertices, edges, faces, polygons and surfaces. In many applications, only
vertices, edges and either faces or polygons are stored as shown in Fig. 1.3. A
renderer may support only 3-sided faces, so polygons must be composed of many
of these. However, many renderers either support quadrangles and higher-sided
polygons, or are able to triangulate polygons to triangles on the fly, making it
unnecessary to store a mesh in a triangulated form. Also, in certain applications
like head modeling, it is desirable to be able to create both 3- and 4-sided
polygons.
where {ik, jk} denotes the kk-th edge that connects the ik-th and jk-th vertices.
Currently, there are many types of software for 3D model generation, design and
processing. The famous ones include AutoCAD, 3ds Max, Maya, Art of Illusion,
ngPlant, Multigen, SketchUp, and so on. The most common ones are AutoCAD,
3DSMAX and MAYA, which will be introduced in detail below. 3D data can be
stored in various formats, including 3DS, OBJ, ASE, MD2, MD3, MS3D, WRL,
MDL, BSP, GEO, DXF, DWG, STL, NFF, RAW, POV, TTF, COB, VRML, OFF,
and so on. Currently, the most common ones are 3DS, OBJ and DXF, and OFF
and OBJ are the two most common formats used in academic research, which will
be introduced in detail below. Before introducing these types of software and file
formats, we must introduce OpenGL, the industrial standard for high-performance
graphics.
1.2 Concepts and Descriptions of 3D Models 23
1.2.4.1 OpenGL
1.2.4.2 AutoCAD
1.2.4.4 Maya
released for the IRIX operating system, and subsequently ported to the Microsoft
Windows, Linux, and Mac OS X operating systems. IRIX support was
discontinued after the release of Version 6.5. When Autodesk acquired Alias in
October 2005, they continued the development of Maya. The latest version, 2009
(10.0), was released in October 2008. An important feature of Maya is its
openness to third-party software, which cana strip the software completely of its
standard appearance and, using only the kernel, transform it into a highly
customized version of the software. This feature in itself made Maya appealing to
large studios, which tend to write custom codes for their productions using the
provided software development kit. A Tcl-like cross-platform scripting language
called Maya Embedded Language (MEL) is provided not only as a scripting
language, but as a means to customize Maya’s core functionality. Additionally,
user interactions are implemented and recorded as MEL scripting codes which
users can store on a toolbar, allowing animators to add functionality without
experience in C or C++, though that option is provided with the software
development kit. Support for Python scripting was added in Version 8.5. The core
of Maya itself is written in C++. Project files, including all geometry and
animation data, are stored as sequences of MEL operations which can be
optionally saved as a human-readable file (.ma, for “Maya ASCII”), editable in
any text editor outside of the Maya environment, thus allowing for a high level of
flexibility when working with external tools. A marking menu is built into a larger
menu system called Hotbox that provides instant access to a majority of features
in Maya at the press of a key.
The 3DS format is one of the file formats used by Discreet Software’s 3D Studio
Max. It is close to the most common format, and is supported by many
applications. DirectX does not provide native
a support to load 3DS files, but you
can find the code to convert a 3DS to the DirectX’s internal format.
The 3DS file format is made up of chunks. They describe what information is
to follow, what it is made up of, its ID and the location of the next block. If you do
not understand a chunk you can quite simply skip it. The next chunk pointer is
relative to the start of the current chunk and in bytes. The binary information in
the 3Ds file is written in a special way. Namely, the least significant byte comes
first in an integer. For example: 4A 5C (2 bytes in hex) would be 5C high byte and
4A low byte. In a long integer, it is 4A 5C 3B 8F where 5C 4A is the low word and
8F 3B is the high word. A chunk is defined as:
Chunks have a hierarchy imposed on them that is identified by its ID. A 3DS
1.2 Concepts and Descriptions of 3D Models 27
file has the primary chunk ID 4D4Dh. This is always the first chunk of the file.
Within the primary chunk are the main chunks.
# This is a comment
# Here is the first vertex, with (x,y,z) coordinates.
v 0.123 0.234 0.345
v ...
...
# Texture coordinates
vt ...
...
# Normals in (x,y,z) form; normals might not be unit.
vn ...
...
# Each face is given by a set of indices to the vertex/texture/normal
# coordinate array that precedes this.
# Hence f 1/1/1 2/2/2 3/3/3 is a triangle having texture coordinates and
# normals for those 3 vertices,
# and having the vertex 1 from the “v” list, texture coordinate 2 from
# the “vt” list, and the normal 3 from the “vn” list
f v0/vt0/vn0 v1/vt1/vn1 ...
f ...
...
# When there are named polygon groups or materials groups the following
# tags appear in the face section,
g [group name]
usemtl [material name]
# the latter matches the named material definitions in the external .mtl file.
# Each tag applies to all faces following, until another tag of the same type
appears.
...
...
An OBJ file also supports smoothing parameters to allow for curved objects,
28 1 Introduction
and also the possibility to name groups of polygons. It also supports materials by
referring to an external MTL material file. OBJ files, due to their list structure, are
able to reference vertices, normals, etc., either by their absolute (1-indexed) list
position, or relatively by using negative indices and counting backwards. However,
not all software supports the latter approach, and conversely some software
inherently writes only the latter form (due to the convenience of appending
elements without the need to recalculate vertex offsets, etc.), leading to occasional
incompatibilities.
Now let us see a practical case. We create a polygon cube using the Maya
software as shown in Fig. 1.5. Select this cube, using the menu item “FileÆExport
Selection...” to export as an OBJ file named “cube.obj”. If OBJ is not found,
please load “objExport.mll” in the Plug-in Manager. Using the notepad to open
“cube.obj”, we have the following codes:
Fig. 1.5. The polygon with holes created by the Maya software
Object file format (OFF) files are used to represent the geometry of a model by
specifying the polygons of the model’s surface. The polygons can have any
number of vertices. The .off files in the Princeton Shape Benchmark conform to
the following standard. OFF files are all ASCII files beginning with the keyword
OFF. The next line states the number of vertices, the number of faces and the
number of edges. The number of edges can be safely ignored. The vertices are
listed with x, y, z coordinates, written one per line. After the list of vertices, the
faces are listed, with one face per line. For each face, the number of vertices is
specified, followed by indices into the list of vertices. Note that earlier versions of
the model files had faces with 1 indices into the vertex list. That was due to an
error in the conversion program and can be corrected now.
Note that vertices are numbered starting at 0 (not starting at 1), and that
numEdges will always be zero. A simple example for a cube is as follows:
30 1 Introduction
OFF
860
-0.500000 -0.500000 0.500000
0.500000 -0.500000 0.500000
-0.500000 0.500000 0.500000
0.500000 0.500000 0.500000
-0.500000 0.500000 -0.500000
0.500000 0.500000 -0.500000
-0.500000 -0.500000 -0.500000
0.500000 -0.500000 -0.500000
40132
42354
44576
46710
41753
46024
The DXF format is a tagged data representation of all the information contained in
an AutoCAD drawing file. Tagged data means that each data element in the file is
preceded by an integer number that is called a group code. A group code’s value
indicates what type of data element follows. This value also indicates the meaning
of a data element for a given object type. Virtually all user-specified information
in a drawing file can be represented in the DXF format. The DXF reference
presents the DXF group codes found in DXF files and encountered by AutoLISP
and ObjectARXTM applications. This chapter describes the general DXF
conventions. The remaining chapters list the group codes organized by the object
type. The group codes are presented in the order they are found in a DXF file, and
each chapter is named according to the associated section of a DXF file. In the
DXF format, the definition of objects differs from entities: objects have no
graphical representation but entities do. For example, dictionaries are objects
without entities. Entities are also referred to as graphical objects, while objects are
referred to as non-graphical objects. Entities appear in both the BLOCK and
ENTITIES sections of the DXF file. The use of group codes in the two sections is
identical. Some group codes that define an entity always appear; others are
optional and appear only if their values differ from the defaults. The end of an
entity is indicated by the next 0 group, which begins the next entity or indicates
the end of the section. Group codes define the type of the associated value as an
integer, a floating-point number, or a string, according to the table of group code
ranges.
1.3 Overview of 3D Model Analysis and Processing 31
3D models are the fourth type of digital media following audio data, images and
video data. Compared to the first three kinds of digital media, the 3D model has its
own characteristics: (1) no data sequence; (2) no specific sampling rate; (3)
non-unique description; (4) containing both the geometric information and
topological information; (5) Both geometry and topology information can be
modified easily. Therefore, the analysis and processing techniques for 3D models
are very different from those for other media. Similar to other media, the analysis
and processing techniques for 3D models include pre-processing, de-noising,
coding and compression, copyright protection, content authentication, retrieval
and identification, segmentation, feature extraction, reconstruction, matching and
stitching, visualization, etc., but due to the specificity of 3D models, in the
realization of these technologies or with the meaning, it is very different from
traditional media. In addition, there are some special analysis and processing
techniques for 3D models, including model simplification, model voxelization,
texture mapping, speedup of the drawing, transformation of 2D graphics into 3D
models, rendering techniques, reverse engineering, 2D projection of 3D models,
contour line extraction algorithms, and so on. In the following subsections, we
briefly introduce the concepts of 3D-model-related techniques in two aspects, i.e.,
3D model processing techniques and 3D model analysis techniques. Detailed
techniques will be discussed from Chapter 2 to Chapter 6.
The so-called 3D model processing operations are those operations whose inputs
and outputs are both 3D models or 3D objects. 3D model processing techniques
comprise many aspects, including 3D model construction, format conversion, 3D
model transmission and compression, 3D model management and retrieval.
with relatively refined models and the far objects with relatively coarse models.
The aim is to reduce the number of triangles representing the model as much as
we can, while guaranteeing a good approximation in shape to the original model.
We can describe this process as: (1) inputting the original triangle mesh data,
including geometric data, surface data, color information, texture information,
normal vectors, etc.; (2) generating automatically multiple levels of details
through the model simplification method; (3) describing different parts of the
model with different levels of detail during the rendering process, guaranteeing
that the difference between the result image and the rendering result with the most
refined model is within a predefined range.
Mesh de-noising [8] is used in the surface reconstruction procedure to reduce
noise and output a higher quality triangle mesh which describes more precisely the
geometry of the scanned object. 3D surface mesh de-noising has been an active
research field for several years. Although much progress has been made, mesh
de-noising technology is still not mature. The presence of intrinsic fine details and
sharp features in a noisy mesh makes it hard to simultaneously de-noise the mesh
and preserve the features. Mesh de-noising is usually posed as a problem of
adjusting vertex positions while keeping the connectivity of the mesh unchanged.
In the literature, mesh de-noising is often
f confused with surface smoothing or
fairing, because all of them use vertex adjustment to make the mesh surface
smooth. However, they have different purposes and different algorithms are
needed to meet their specific requirements, and we should keep in mind the
distinctions. The main goal of mesh fairing is related to aesthetics, while the goal
of mesh de-noising has more to do with fidelity, and mesh smoothing generally
attempts to remove small scale details. Another commonly used term, mesh
filtering, is also often used in place of mesh fairing, smoothing or de-noising.
Filtering, however, is a rather general term which simply refers to some black box
which processes a signal to produce a new signal, and could, in principle, perform
some quite different function such as feature enhancement.
Voxelization [9] refers to converting geometric objects from their continuous
geometric representation into a set of voxels that best approximates the continuous
object. As this process mimics the scan-conversion process that pixelizes
(rasterizes) 2D geometric objects, it is also referred to as 3D scan conversion. In
2D rasterization, the pixels are directly drawn onto the screen to be visualized and
filtering is applied to reduce the aliasing artifacts. However, the voxelization
process does not render the voxels but merely generates a database of the discrete
digitization of the continuous object.
Texture mapping [10] in computer graphics generally refers to the process of
mapping a 2D image onto geometric primitives. The primitives are annotated with
an extra set of 2D coordinates that orient the image on the primitive. The
coordinate system axes of the image space are typically denoted as u and v for the
horizontal and vertical axes, respectively. When the geometry is processed, the
texture is applied to the geometry and appears draped over the geometry primitive
like painting on cloth. The texture to be draped on the geometric primitive can be
stored as an array of colors that will eventually be mapped onto the polygonal
surface. The surface to be textured is specified with vertex coordinates and texture
1.3 Overview of 3D Model Analysis and Processing 33
coordinates (u,v), the latter being used to map the color array on the polygon’s
surface. The u and v are interpolated across the span and then used as indices into
the texture map to obtain the texture color. This color is combined with the
primitive color (obtained by interpolating vertex colors across spans) or the colors
specified by the application to obtain a final color value at the pixel location.
Texture maps do not have to be color arrays but can be arrays of intensities used
for color modulation. In this case, the application can specify two colors to
modulate with the intensity, or it can take one of the colors from the primitive. The
software takes the colors and uses the intensity in the texture map to determine
how much of each color to be blended to produce the color of the pixel. This is
useful for defining mottled textures found in landscape or cloth.
Subdivision surface refinement schemes [11] can be broadly classified into
two categories: interpolating and approximating. Interpolating schemes are
required to match the original position off vertices in the original mesh, while
approximating schemes will adjust these positions as needed. In general,
approximating schemes have greater smoothness, but editing applications that
allow users to set exact surface constraints require an optimization step. This is
analogous to spline surfaces and curves, where Bézier splines are required to
interpolate certain control points, while B-splines are not. There is another
classification of subdivision surface schemes as well, i.e., the type of polygon that
they operate on. Some function for quadrilaterals (quads), while others operate on
triangles. Approximating means that the limit surfaces approximate the initial
meshes and that after subdivision, the newly generated control points are not in the
limit surfaces. After interpolation-based subdivision, the control points of the
original mesh and the newly generated control points are interpolated on the limit
surface. Subdivision surfaces can be naturally edited at different levels of
subdivision. Starting with basic shapes you can use binary operators to create the
correct topology. You can edit the coarse mesh to create the basic shape and edit
the offsets for the next subdivision step, and then repeat this at finer and finer
levels. You can always see how your edit affects the limit surface via GPU
(graphic processing unit) evaluation of the surface.
compression; the other is the compression method for the 3D vertex data and some
other attribute data such as colors, texture
t and normal vectors, which is called
geometric compression, among which vertex compression is the focus. In 1996,
Hoppe presented a new representation scheme for 3D models, called progressive
mesh [12]. It describes a dynamic data structure that is used to represent a given
(usually quite complex) triangle mesh. Att runtime, a progressive mesh provides a
triangle mesh representation whose complexity is appropriate for the current view
conditions. The purpose of progressive meshes is to speed up the rendering
process by avoiding the rendering of details that are unimportant or completely
invisible. This efficient, lossless, continuous-resolution representation addresses
several practical problems in graphics: smooth geomorphing of level-of-detail
approximations, progressive transmission, mesh compression and selective
refinement. While conventional methods use a small set of discrete LODs,
Schmalstieg et al. introduced a new class of polygonal simplification: Smooth
LODs [13]. A very large number of small details encoded in a data stream allow a
progressive refinement of the object from a very coarse approximation to the
original high quality representation. Advantages of the new approach include
progressive transmission and encoding suitable for networked applications,
interactive selection of any desired quality, and compression of the data by
incremental and redundancy-free encoding.
3D model encryption is the process of transforming 3D model data (referred to
as plaintext) using an algorithm (called cipher) to make it unreadable to anyone
except those possessing special knowledge, usually referred to as a key. The result
of the process is the encrypted 3D model (in cryptography, referred to as
ciphertext). In many contexts, the word encryption also implicitly refers to the
reverse process, decryption (e.g. “software for encryption” can typically also
perform decryption), to make the encrypted information readable again (i.e., to
make it unencrypted).
3D model information hiding refers to the process of invisibly embedding the
copyright information, the authentication information or other secret information
into 3D models to fulfill the purpose of copyright protection, content
authentication or covert communication. People usually embed information in 3D
models with digital watermarking techniques, which will be discussed in Chapters
5 and 6 of this book.
depends on the center of mass, which is defined as the center of its surface points.
To normalize a 3D model for scaling, the average distance of the points on its
surface to the center of mass should be scaled to a constant. Note that normalizing
a 3D model by scaling its bounding box is sensitive to outliers. To normalize for
translation, the center of mass is translated to the origin. To normalize a 3D model
for rotation, usually the principal component analysis (PCA) method is applied. It
aligns the principal axes to the x-, y-, and z-axes of a canonical coordinate system
by an affine transformation based on a set of surface points, e.g. the set of vertices
of a 3D model. After translation of the center of mass to the origin, a rotation is
applied so that the largest variance off the transformed points is along the x-axis.
Then a rotation around the x-axis is carried out such that the maximal spread in the
yz-plane occurs along the y-axis.
Content-based 3D model retrieval [14] has been an area of research in
disciplines such as computer vision, mechanical engineering, artifact searching,
molecular biology and chemistry. Recently, a lot of specific problems about
content-based 3D shape retrieval have been investigated by researchers. At a
conceptual level, a typical 3D shape retrieval framework consists of a database
with an index structure created offline and an online query engine. Each 3D model
has to be identified with a shape descriptor, providing a compact overall
description of the shape. To efficiently search for a large collection online, an
index of data structures and searching algorithms should be available. The online
query engine computes the query descriptor, and models similar to the query
model are retrieved by matching descriptors to the query descriptor from the index
structure of the database. The similarity between two descriptors is quantified by a
dissimilarity measure. Three approaches can be distinguished to provide a query
object: (1) browsing to select a new query object from the obtained results; (2)
handling a direct query by providing a query descriptor; (3) querying by example
by providing an existing 3D model or by creating a 3D shape query from scratch
using a 3D tool or sketching 2D projections of the 3D model. Finally, the retrieved
models can be visualized. 3D model retrieval techniques will be discussed in
Chapter 4.
Volume visualization is used to create images from scalar and vector datasets
defined on multiple dimensional grids; i.e., it is the process of projecting a
multidimensional (usually 3D) dataset onto a 2D image plane to gain an
understanding of the structure contained within the data. Most techniques are
applicable to 3D lattice structures. Techniques for higher dimensional systems are
rare. It is a new but rapidly growing field in both computer graphics and data
visualization. These techniques are usedd in medicine, geosciences, astrophysics,
chemistry, microscopy, mechanical engineering, and so on.
So-called 3D model analysis operations are those operations whose inputs are 3D
models or 3D objects while outputs are features, classification results, recognition
36 1 Introduction
Audio compression [17] is a form of data compression designed to reduce the size
of audio files. Audio compression algorithms are implemented in computer
software as audio codecs. Generic data compression algorithms perform poorly
with audio data, seldom reducing file sizes much below 87% of the original, and
are not designed for use in real-time. Consequently, specific audio “lossless” and
“lossy” algorithms have been designed. Lossy algorithms provide far greater
compression ratios and are used in mainstream consumer audio devices. As with
image compression, both lossy and lossless compression algorithms are used in
audio compression, lossy being the most common for everyday use. In both lossy
and lossless compression, information redundancy is reduced, using methods such
as coding, pattern recognition and linearr prediction to reduce the amount of
information used to describe the data. The trade-off of slightly reduced audio
quality is clearly outweighed for most practical audio applications, where users
cannot perceive any difference and space requirements are substantially reduced.
For example, on one CD, one can fit an hour of high fidelity music, less than two
hours of music compressed losslessly, or seven hours of music compressed in
MP3 format at medium bit rates.
Lossless audio compression allows one to preserve an exact copy of one’s audio
files, in contrast to the irreversible changes from lossy compression techniques
such as Vorbis and MP3. Compression ratios are similar to those for generic
lossless data compression (around 50%60% of original size), and substantially
less than those for lossy compression (which typically yield 5%20% of the
original size).
40 1 Introduction
The primary uses of lossless encoding are: (1) Archives. For archival purposes,
one naturally wishes to maximize quality. (2) Editing. Editing lossily compressed
data leads to digital generation loss, since the decoding and re-encoding introduce
artifacts at each generation. Thus audio engineers use lossless compression. (3)
Audio quality. Being lossless, these formats completely avoid compression
artifacts. Audiophiles thus favor lossless compression. A specific application is to
store lossless copies of audio, and then produce lossily compressed versions for a
digital audio player. As formats and encoders are improved, one can produce
updated lossily compressed files from the lossless master. As file storage space
and communication bandwidth have become less expensive and more available,
lossless audio compression has become more popular.
“Shorten” was an early lossless format, and newer ones include Free Lossless
Audio Codec (FLAC), Apple’s Apple Lossless, MPEG-4 ALS, Monkey’s Audio
and TTA. Some audio formats feature a combination of a lossy format and a
lossless correction, which allows stripping the correction to easily obtain a lossy
file. Such formats include MPEG-4 SLS (Scalable to Lossless), WavPack and
OptimFROG DualStream. Some formats are associated with a technology, such as
Direct Stream Transfer used in Super Audio CD, Meridian Lossless Packing used
in DVD-Audio, Dolby TrueHD, Blu-ray and HD DVD.
It is difficult to maintain all the data in an audio stream and achieve substantial
compression. First, the vast majority of sound recordings are highly complex,
recorded from the real world. As one of the key methods of compression is to find
patterns and repetition, more chaotic data such as audios cannot be compressed
well. In a similar manner, photographs can be compressed less efficiently with
lossless methods than simpler computer-generated images. But interestingly, even
computer-generated sounds can contain very complicated waveforms that present
a challenge to many compression algorithms. This is due to the nature of audio
waveforms, which are generally difficult to simplify without a conversion to
frequency information, as performed by the human ear. The second reason is that
values of audio samples change very quickly, so generic data compression
algorithms do not work well for audios, and strings of consecutive bytes do not
generally appear very often. However, convolution with the filter [1 1] tends to
slightly whiten the spectrum, thereby allowing traditional lossless compression at
the encoder to do its job, while integration at the decoder restores the original
signal. Codecs such as FLAC, “Shorten” and TTA use linear prediction to estimate
the spectrum of the signal. Att the encoder, the inverse of the estimator is used to
whiten the signal by removing spectral peaks, while the estimator is used to
reconstruct the original signal at the decoder.
Lossless audio codecs have no quality issues, so the usability can be estimated
by: (1) speed of compression and decompression; (2) degree of compression; (3)
software and hardware support; (4) robustness and error correction.
addition to the direct applications, digitally compressed audio streams are used in
most video DVDs, digital television, streaming media on the Internet, satellite and
cable radio and increasingly in terrestrial radio broadcasts. Lossy compression
typically achieves far greater compression than lossless compression by discarding
less-critical data.
The innovation of lossy audio compression was to use psychoacoustics to
recognize that not all data in an audio stream can be perceived by the human
auditory system. Most lossy compression reduces perceptual redundancy by first
identifying sounds which are considered perceptually irrelevant, i.e., sounds that
are very hard to hear. Typical examples include high frequencies, or sounds that
occur at the same time as louder sounds. Those sounds are coded with decreased
accuracy or not coded at all.
While removing or reducing these “unhearable” sounds may account for a
small percentage of bits saved in lossy compression, the real reduction comes
from a complementary phenomenon: noise shaping. Reducing the number of bits
used to code a signal increases the amount of noise in that signal. In
psychoacoustics-based lossy compression, the real key is to “hide” the noise
generated by the bit savings in areas of the audio stream that cannot be perceived.
This is done by, for instance, using very small numbers of bits to code the high
frequencies of most signals (not because the signal has little high frequency
information, but rather because the human ear can only perceive very loud signals
in this region), so that softer sounds “hidden” there simply are not heard.
If reducing perceptual redundancy does not achieve sufficient compression for
a particular application, it may require further lossy compression. Depending on
the audio source, this still may not produce perceptible differences. Speech, for
example, can be compressed far more than music. Most lossy compression
schemes allow compression parameters to be adjusted to achieve a target rate of
data, usually expressed as a bit rate. Again, the data reduction will be guided by
some model of how important the sound is as perceived by the human ear, with
the goal of efficiency and optimized quality for the target data rate. Hence,
depending on the bandwidth and storage requirements, the use of lossy
compression may result in a perceived reduction of the audio quality that ranges
from none to severe, but generally an obviously audible reduction in quality is
unacceptable to listeners.
Because data is removed during lossy compression and cannot be recovered by
decompression, some people may not prefer lossy compression for archival
storage. Hence, as noted, even those who use lossy compression may wish to keep
a losslessly compressed archive for other applications. In addition, the
compression technology continues to advance, and achieving state-of-the-art lossy
compression would require one to begin again with the lossless, original audio
data and compress with the new lossy codec. The nature of lossy compression
results in increasing degradation of quality if data are decompressed and then
recompressed with lossy compression.
42 1 Introduction
There are two kinds of coding methods: transform dromain methods and time
domain methods.
(1) Transform domain methods. To determine what information in an audio
signal is perceptually irrelevant, most lossy compression algorithms use
transforms such as the modified discrete cosine transform (MDCT) to convert
time domain sampled waveforms into a transform domain. Once transformed,
typically into the frequency domain, component frequencies can be allocated bits
according to how audible they are. The audibility of spectral components is
determined by first calculating a masking threshold, below which it is estimated
that sounds will be beyond the limits of human perception.
The masking threshold is calculated with the absolute threshold of hearing and
the principles of simultaneous masking (the phenomenon wherein a signal is
masked by another signal separated by frequency) and, in some cases, temporal
masking (where a signal is masked by another signal separated by time).
Equal-loudness contours may also be used to weigh the perceptual importance of
different components. Models of the human ear-brain combination incorporating
such effects are often called psychoacoustic models.
(2) Time domain methods. Other types of lossy compressors, such as linear
predictive coding (LPC) used for speech signals, are source-based coders. These
coders use a model of the sound’s generator to whiten the audio signal prior to
quantization. LPC may also be thought of as a basic perceptual coding technique,
where reconstruction of an audio signal using a linear predictor shapes the coder’s
quantization noise into the spectrum of the target signal, partially masking it.
(5) Deflation. Deflation is used in PNG, MNG and TIFF. It is a lossless data
compression algorithm that uses a combination of the LZ77 algorithm and
Huffman coding. It was originally defined by Phil Katz for Version 2 of his PKZIP
archiving tool, and was later specified in RFC 1951. Deflation is widely thought to
be free of any subsisting patents and, for a time before the patent on LZW (which
is used in the GIF file format) expired, this led to its use in gzip compressed files
and PNG image files, in addition to the ZIP file format for which Katz originally
designed it.
method using fractals to achieve high compression ratios. The method is best
suited for photographs of natural scenes such as trees, mountains, ferns and clouds.
The fractal compression technique relies on the fact that in certain images, parts of
the image resemble other parts of the same image. Fractal algorithms convert
these parts or, more precisely, geometric shapes into mathematical data called
“fractal codes” which are used to recreate the encoded image. Fractal compression
differs from pixel-based compression schemes such as JPEG, GIF and MPEG
since no pixels are saved. Once an image has been converted into fractal code, its
relationship to a specific resolution has been lost, and it becomes resolution
independent. The image can be recreated d to fill any screen size without the
introduction of image artifacts or loss of sharpness that occurs in pixel-based
compression schemes. With fractal compression, encoding is very computationally
expensive because of the search used to find the self-similarities. However,
decoding is quite fast. At common compression ratios, up to about 50:1, fractal
compression provides similar results to DCT-based algorithms such as JPEG. At
high compression ratios, fractal compression may offer superior quality. For
satellite imagery, ratios of over 170:1 have been achieved with acceptable results.
Fractal video compression ratios of 25:1244:1 have been achieved in reasonable
compression time (2.4 to 66 s/frame).
The quality of a compression method is often measured by the peak
signal-to-noise ratio. It measures the amount of noise introduced through a lossy
compression of the image. However, the subjective judgmentt of the viewer is also
regarded as an important measure, perhaps the most important one. The best
image quality at a given bit-rate is the main goal of image compression. However,
there are other important requirements in image compression as follows:
(1) Scalability. It generally refers to a quality reduction achieved by
manipulation of the bitstream or file. Other names for scalability are progressive
coding or embedded bitstreams. Despite its contrary nature, scalability can also be
found in lossless codecs, usually in the form of coarse-to-fine pixel scans.
Scalability is especially useful for previewing images while downloading them or
for providing variable quality access to image databases. There are several types
of scalability: 1) Quality progressive or layer progressive: the bitstream
successively refines the reconstructed image; 2) Resolution progressive: to first
encode a lower image resolution and then encode the difference to higher
resolutions; 3) Component progressive: to first encode the grey component and
then color components.
(2) Region-of-interest coding. Certain parts
a of the image are encoded with a
higher quality than others. This can be combined with scalability, i.e., to encode
these parts first, others later.
(3) Meta information. Compressed data can contain information about the
image which can be used to categorize, search or browse images. Such
information can include color and texture statistics, small preview images and
author/copyright information.
(4) Processing power. Compression algorithms require different amounts of
processing power to encode and decode. Some compression algorithms with high
compression ratios require high processing power.
46 1 Introduction
Video compression [18] refers to reducing the quantity of data used to represent
digital video frames, and is a combination of spatial image compression and
temporal motion compensation. Compressed video can effectively reduce the
bandwidth required to transmit video via terrestrial broadcast, cable TV or satellite
TV services. Most video compression is lossy, for it operates on the premise that
much of the data present before compression is not necessary for achieving good
perceptual quality. For example, DVDs use a video coding standard called
MPEG-2 that can compress around two hours of video data by 15 to 30 times,
while still producing a picture quality that is generally considered high-quality for
a standard-definition video. Video compression is a tradeoff between disk space,
video quality, and the cost of hardware required to decompress the video in a
reasonable time. However, if the video is overcompressed in a lossy manner,
visible artifacts may appear. Video compression typically operates on
square-shaped groups of neighboring pixels, often called macroblocks. These pixel
groups or blocks of pixels are compared from one frame to the next and the video
compression codec sends only the differences within those blocks. This works
extremely well if the video has no motion. A still frame of text, for example, can
be repeated with very little transmitted data. In areas of the video with more
motion, more pixels change from one frame to the next. When more pixels change,
the video compression scheme must send more data to keep up with the larger
number of pixels that are changing. If the video content includes an explosion,
flames, a flock of thousands of birds, or any other image with a great deal of
high-frequency detail, the quality will decrease, or the variable bit rate must be
increased to render this added information with the same level of detail.
The programming providers have control over the amount of video
compression applied to their video programming before it is sent to their
distribution system. DVDs, Blu-ray discs, and HD DVDs have video compression
applied during their mastering process, though Blu-ray and HD DVD have enough
disc capacity so that most compression applied in these formats is light, when
compared to such examples as most of the video streamed over the Internet, or
taken on a cellphone. Software used for storing videos on hard drives or various
optical disc formats will often have a lower image quality, although not in all
cases. High-bitrate video codecs, with little or no compression, exist for video
post-production work, but create very large files and are thus almost never used
for the distribution of finished videos. Once excessive lossy video compression
compromises image quality, it is impossible to restore the image to its original
quality.
A video is basically a 3D array of color pixels. Two dimensions serve as
spatial directions of the moving pictures, and one dimension represents the time
domain. A data frame is a set of all pixels that correspond to a single time moment.
Basically, a frame is the same as a still picture. Video data contains spatial and
temporal redundancy. Similarities can thus be encoded by merely registering
differences within a frame (spatial), and/or between frames (temporal). Spatial
1.4 Overview of Multimedia Compression Techniques 47
encoding is performed by taking advantage of the fact that the human eye is
unable to distinguish small differences in color as easily as it can perceive changes
in brightness, so that very similar areas of color can be “averaged out” in a similar
way to JPEG images. With temporal compression, only the changes from one
frame to the next are encoded, as often a large number of the pixels will be the
same on a series of frames.
Some forms of data compression are lossless. This means that when the data is
decompressed, the result is a bit-for-bit perfect match with the original. While
lossless compression of video is possible, it is rarely used, as lossy compression
results in far higher compression ratios at an acceptable level of quality.
One of the most powerful techniques for compressing videos is interframe
compression. Interframe compression uses one or more earlier or later frames in a
sequence to compress the current frame. Intraframe compression is applied only to
the current frame, where we can just adopt effective image compression methods.
The most commonly-used method works by comparing each frame in the video
with the previous one. If the frame contains areas where nothing has moved, the
system simply issues a short command that copies that part of the previous frame,
bit-for-bit, into the next one. If sections of the frame move in a simple manner, the
compressor emits a command that tells the decompresser to shift, rotate, lighten,
or darken the copy. This is a longer command, but still much shorter than
intraframe compression. Interframe compression works well for programs that will
simply be played back by the viewer, but can cause problems if the video
sequence needs to be edited. Since interframe compression copies data from one
frame to another, if the original frame is simply cut out, the following frames
cannot be reconstructed properly. Some video formats, such as DV, compress each
frame independently through intraframe compression. Making “cuts” in the
intraframe-compressed video is almost as easy as editing the uncompressed video,
i.e., one finds the beginning and end of each frame, and simply copies bit-for-bit
each frame that one wants to keep, and discards the frames one does not want.
Another difference between intraframe and interframe compression is that with
intraframe systems, each frame uses a similar amount of data. In most interframe
systems, certain frames are not allowed to copy data from other frames, and thus
they require much more data than other frames nearby. It is possible to build a
computer-based video editor that spots problems caused when frames are edited
out (i.e., deleted) while other frames need them. This has allowed newer formats
like HDV to be used for editing. However, this process demands much more
computing power than editing intraframe-compressed videos with the same
picture quality.
Today, nearly all video compression methods in common use, e.g., those in
standards approved by the ITU-T or ISO, apply a discrete cosine transform for
spatial redundancy reduction. Other methods, such as fractal compression,
matching pursuit and the use of a discrete wavelet transform (DWT), have been
the subjects of some research, but are typically not used in practical products. The
interest in fractal compression seems to be waning, due to recent theoretical
analysis showing a comparative lack of effectiveness of such methods.
48 1 Introduction
Digital watermarking [19] is a fast developing focus technique, which has been
already of high interest to the international academic and business communities.
The watermarking technique is a rising interdisciplinary technique, which refers to
ideas and theories from different scientific
f and academic fields, such as signal
processing, image processing, information theory, coding theory, cryptography,
detection theory, probability theory, random theory, digital communication, game
theory, computer science, network technique, algorithm design, etc., but also
including public strategy and law. Therefore, whether from the point of theories or
applications, carrying out research on digital watermarking techniques is not only
a matter of great academic significance, but also a matter of great economic
significance.
the copyright owner, editors and retailers, and they try to distribute the digital
product x via the network. The consumers, which also can be called customers
(clients), hope to receive the digital product x via the network. The pirates are
unauthorized suppliers, such as the pirate A, who redistributes the product x
without the legal copyright owner’s permission, and the pirate B, who
intentionally destroys the original product and redistributes the unauthentic edition
x̂ , so it is hard for consumers to avoid receiving the pirate edition x or x̂
indirectly. There are three common illegal forms of behavior as follows: (1) Illegal
visit, i.e., to copy or pirate digital products without the permission of copyright
owners. (2) Intentional tampering, i.e., the pirates maliciously change digital
products or insert characteristics and then redistribute them, resulting in the loss of
the original copyright information. (3) Copyright destruction, i.e., the pirates,
resells digital products without the permission of the copyright owner after
receiving them.
Fig. 1.6. The basic model of digital product distribution over the Internet
there is no way to make more people obtain their required information via public
systems. At the same time, once the information is decoded illegally, there is no
direct evidence to prove the information has been illegally copied and resent.
Furthermore, for some people, encryption is a challenging task, because people
can hardly prevent an encrypted file from being cut during the decoding process.
Therefore, it is necessary to seek a more valid method to ensure secure
transmission and protect the digital products’ copyright.
where N is the length of the watermark sequence, and O represents the value range.
Actually, watermarks can be not only 1D sequences, but also 2D sequences, even
multi-dimensional sequences, which are usually decided by the carrier object’s
dimension. For instance, audio, images and video correspond to 1D, 2D and 3D
sequences respectively. For convenience, this book usually uses Eq. (1.3) to
represent watermark signals, and for multi-dimensional sequences it is equivalent
to expanding them into 1D sequences in a certain order. The range of watermark
signals can be in binary forms, such as O {0, 1} , O { 1, 1} and O { , } ,
or some other forms, such as white Gaussian
a noises (with the mean 0 and the
variance 1, N
N(0, 1)).
Roughly speaking, a digital watermarking system contains two main parts, the
embedder and the detector. The embedder has at least two inputs, the original
information which will be properly transformed into the watermark signal, and the
carrier product which will be embedded with watermarks. The output of the
embedder is the watermarked product, which will be transmitted or recorded. The
input of the detector may be the watermarked work or another random work that
has never been embedded with watermarks. Most detectors try their best to
estimate whether there are watermarks in the work or not. If the answer is yes, the
output will be the watermark signal previously embedded in the carrier product.
Fig. 1.7 presents the particular sketch map of the basic framework of digital
watermarking systems. It can be defined as a set with nine elements ((M M, X,
X W,
W K,
K
G, Em, At, D, Ex) and they are defined below separately:
(1) M stands for the set of all possible original information m.
(2) X is the set of digital products (or works) x, i.e., the content.
G : M u X uK oW , w G(( , , ). (1.4)
It should be pointed out that the original digital product does not necessarily
participate in generating watermarks, so we use dashed lines in Fig. 1.7.
(6) Em is the embedding algorithm, which embeds the watermark w into the
digital product x, i.e.,
Em : X u W o X , x w E ( ,
Em ), (1.5)
here x presents the original product and x w presents the watermarked product. To
enhance the security, sometimes secret keys are included in the embedding
algorithms.
(7) At is the attacking algorithm performed on the watermarked product x w ,
i.e.,
At : X u K o X , xˆ At(( w
, c), (1.6)
1, if exists in ˆ ( );
D: {0,1} , D(( ˆ , ) ®
1
(1.7)
¯0, if does not exist in ˆ ( 0 ),
here, H1 and H0 stand for binary hypotheses, which indicate the watermark exists
or not.
(9) Ex is the extraction algorithm, i.e.,
Ex : X u K o W , wˆ Ex(( ˆ , ).
E (1.8)
considered as noise. In the second model, the carrier work is still considered as
noise but the noise is input into the channel encoder as additional information. In
the third model, the carrier work is nott considered as noise but the second
information. This information and the original information are transmitted in a
multiplex manner. Here we only show the first kind of model.
Figs. 1.8 and 1.9 present two basic digital watermarking system models.
Fig. 1.8 adopts the non-blind detector and Fig. 1.9 adopts the blind detector. In
these two kinds of models, the watermark embedder is considered as a channel.
The input information is transmitted via the channel, and the carrier work is a part
of it. To depict this conveniently, here the watermark generation algorithm is
called the watermark encoder, and it is combined into the watermark embedder.
No matter whether adopting the non-blind detector or the blind detector, the first
step in the embedding process is mapping the information m to an embedding
pattern wa with the same format and dimension as the original product x, which is
actually a watermark generation process. For instance, if we embed watermarks
into images in the spatial domain, the watermark encoder, i.e., the watermark
generator, will generate a 2D image pattern with the same size as the original
image. However, when we embed watermarks into audio clips in the time domain,
the watermark encoder will generate a 1D pattern with the same length as the
original audio clip. This kind of mapping usually needs the aid of the
watermarking secret key K. The embedding pattern is calculated with several steps:
(1) Predefining one or several reference patterns (represented by wr, e.g., a
pseudorandom or chaotic sequence), which depend on some secret key K K. (2) These
reference patterns are combined together to form a pattern to encode the
information m, which is usually called the information pattern w. In this book, it is
called the watermark w to be embedded, which is the output of the watermark
generation algorithm. (3) Then this information pattern is scaled proportionally or
modified to generate the embedding pattern wa (In this book this process falls
under the first step of the embedding process). The watermark encoders in Figs.
1.8 and 1.9 both do not take carrier works into account, and we call them
non-adaptive generators. The watermarked work xw is gained by embedding the
pattern wa into the work x, and it will undergo some kind of processes, whose
effect is equal to adding the noise n to the work. Here the processes may be
unintentional attacks such as compression, decompression, analog/digital conversion
and signal enhancement, or malicious attack behaviors such as wiping off watermarks.
K x K
Original
a carrier work Watermarking key
Watermarking key
Digital watermarks are signals embedded in digital media such as images, audio
clips or video clips. These signals enable people to construct products’ ownership,
identify purchasers and provide some extra information about products. According
1.5 Overview of Digital Watermarking Techniques 55
to the visibility in the carrier work, watermarks can be divided into two categories,
visible and invisible watermarks. This book mainly discusses invisible watermarks.
Therefore, if there is no special announcement, watermarks in the following
discussions refer to invisible watermarks. According to whether the watermark
generation process depends on the original carrier work or not, it can be divided
into non-adaptive watermarks (independent of the original cover media) and
adaptive watermarks. Watermarks dependent on the original cover media can be
generated not only randomly or by algorithms, but can also be given in advance,
while adaptive watermarks are generated considering the characteristic of the
original cover media. According to the watermarked product’s ability against
attacks, watermarks can be divided into fragile watermarks, semi-fragile
watermarks and robust watermarks. Fragile watermarks are very sensitive to any
transforms or processing. Semi-fragile watermarks are robust against some special
image processing operations while not robust to other operations. Robust
watermarks are robust to various popular image processing operations. According
to whether the original image is required in the watermark detection process or not,
watermarks can be divided into non-blind-detection watermarks (private
watermarks) and blind-detection watermarks (public watermarks). Private
watermark detection requires the original image, while public watermarks do not.
According to different application purposes, watermarks can be divided into
copyright protection watermarks, content authentication watermarks, transaction
tracking watermarks, copy control watermarks, annotation watermarks, covert
communications watermarks, etc.
Accordingly, watermarking algorithms also can be classified into two
categories, visible watermarking algorithms and invisible watermarking
algorithms. This book mainly discusses invisible watermarking algorithms, which
can be mainly classified into three categories, time/spatial-domain-based,
transform-domain-based and compression-domain-based schemes. Time/spatial
domain watermarking uses various methods to directly modify cover media’s
time/spatial samples (e.g., pixels’ LSB). The robustness of this kind of algorithm
is not strong, and the capacity is not very large; otherwise watermarks will become
visible. Transform domain watermarking embeds watermarks after various
transforms of the original cover media, e.g., DCT transform, DFT transform,
wavelet transform, etc. Compression domain watermarking refers to embedding a
watermark in the JPEG domain, MPEG domain, VQ compression domain or
fractal compression domain. This kind of algorithm is robust against the
associated compression attack. Some researchers use public key cryptosystems in
watermarking systems where the detection key and the embedding key are
different. These kinds of watermarking systems are called public key
watermarking systems, or are otherwise called private key watermarking systems.
According to whether the original cover media can be losslessly recovered or not,
watermarking systems can be classified into two categories, reversible
watermarking systems and irreversible watermarking systems. According to
different types of original cover media, watermarking processing can be classified
into audio watermarking, image watermarking, video watermarking, 3D model or
3D image watermarking, document watermarking, database watermarking,
56 1 Introduction
The application fields of watermarking techniques are very wide. There are mainly
the following seven categories: broadcast monitoring, owner identification,
ownership verification, transaction tracking, content authentication, copy control
and device control. Each application is concretely introduced below. Problem
characteristics are analyzed and the reasons for applying watermarking techniques
to solve these problems are given.
(1) Broadcast monitoring. The advertiser hopes that his advertisements can be
aired completely in the airtime that is bought from the broadcaster, while the
broadcaster hopes that he can obtain advertisement dollars from the advertiser. To
realize broadcast monitoring, we can hire some people to directly survey and
monitor the aired content. But not only does this method cost a lot but also it is
easy to make mistakes. We can also use the dynamic monitoring system to put
recognition information outside the area of the broadcast signal, e.g., vertical
blanking interval (VBI); however there are some compatibility problems to be
solved. The watermarking technique can encode recognition information, and it
is a good method to replace the dynamic monitoring technique. It uses the
characteristic of embedding itself in content and requires no special fragments
of the broadcast signal. Thus it is completely compatible with the installed
analog or digital broadcast device.
(2) Owner identification. There are some limitations in using the text copyright
announcement for product owner recognition. First, during the copying process,
this announcement is very easily removed, sometimes accidentally. For example,
when a professor copies several pages of a book, the copyright announcement on
the topic pages is probably not copied by negligence. Another problem is that it
may occupy some parts of the image space, destroying the original image, and it is
easy to be cropped. As a watermark is not only invisible, but also cannot be
separated from the watermarked product, the watermark is therefore more
beneficial than a text announcement in owner identification. If the product user
has a watermark detector, he can recognize the watermarked product’s owner.
Even if the watermarked product is altered by the method that can remove the text
1.5 Overview of Digital Watermarking Techniques 57
protected content from being illegally copied. The primary defense of illegal
copying is encryption. After encrypting the product with a special key, the product
simply cannot be used by those without this key. Then this key can be provided to
legal users in a secure manner such that the key is difficult to copy or redistribute.
However, people usually hope that the media data can be viewed, but cannot be
copied by others. At this time, people can embed watermarks in content and play it
with the content. If each recording device is installed with a watermark detector,
the device can forbid copying when it detects the watermark “copy forbidden”.
(7) Device control. In fact, copy control belongs to a larger application
category called device control. Device control
t refers to the phenomenon where a
device can react when the watermark is detected. For example, the “media bridge”
system of Digimarc can embed the watermark in printed images such as
magazines, advertisements, parcels and bills. If this image is captured by a digital
camera again, the “media bridge” software and recognition unit in the computer
will open a link to related websites.
N
1
¦v vic ;
2
MSE i (1.9)
N i 1
2
max( i )
1dii N
PSNR 10 log10 ; (1.10)
MSE
¦v
2
i
i 1
SNR 10 log10 N
, (1.11)
¦ vc v
2
i i
i 1
where N is the number of vertices, vi and vic denote the i-th vertex of the
original model M and the i-th vertex of the watermarked model M c ,
respectively.
(3) Data capacity. Data capacity refers to the number of bits embedded in unit
time or a product. For an image, data capacity refers to the number of bits
embedded in this image. For audios, it refers to the number of bits embedded in
one second of transmission. For videos, it refers to either the number of bits
embedded in each frame, or that embedded in one second. A watermark encoded
N-bit watermark. Such a system can be used to embed 2N
with N bits is called an N
different messages. Many situations require the detector to execute two-layer
functions. The first one is to determine whether the watermark exists or not. If it
exists, then continue to determine which one of the 2N messages it is. This kind of
detector has 2N+1 possible output values, i.e., 2N messages together with the case
of “no watermark”.
60 1 Introduction
(4) Blind detection and informed detection. The detector that requires the
original copy as an input is called an informed detector. This kind of detector also
refers to the detector requiring only a small part of the original product
information instead of the whole product. The detector that does not require the
original product is called a blind detector. To use the blind or informed detector in
watermarking systems determines whether it is suitable for some concrete
applications. Non-blind detectors can only be used in those situations where the
original product can be obtained.
(5) False positive probability. False positive refers to the case where
watermarks can be detected in the product without watermarks. There are two
definitions for this probability, and their difference lies in that the random variable
is a watermark or a product. In the first definition, the false positive probability
refers to the probability that the detector finds the watermark, given a product and
several randomly selected watermarks. In the second definition, the false positive
probability refers to the probability that the detector finds the watermark, given a
watermark and several randomly selected products. In most applications, people
are more interested in the second definition. But in a few applications, the first
definition is also important. For example, in transaction tracking, false pirate
accusation often appears when detecting a random watermark in the given
product.
(6) Robustness. Robustness refers to the ability for the watermark to be
detected if the watermarked product suffers some common signal processing
operations, such as spatial filtering, lossy compression, printing and copying,
geometry deformation (rotation, translation, scaling and others). In some cases,
robustness is useless and even may be avoided. For example, another important
research branch of watermarking, fragile watermarking, has an opposite characteristic
of robustness. For example, the watermark for content authentication should be
fragile, namely any signal processing operation will destroy the watermark. In
another kind of extreme application, the watermark must be robust against any
distortion that will not destroy the watermarked product.
The three commonly-used evaluation criteriaa for robustness are given as follows:
(i) Normalized correlation (NC). This criterion is used to quantitatively
evaluate the similarity between the extracted watermark and the original
watermark, especially for binary watermarks. When the watermarked media is
distorted, the robust watermarking algorithm tries to make the NC value maximal,
while the fragile watermarking algorithm tries to make the NC value minimal. The
definition of NC is as follows:
Nw
¦ w(( ) ˆ ( )
i 1
NC(( , ˆ ) ; (1.12)
Nw Nw
¦w ( ) ¦
i 1
2
i 1
2
()
Nw
1
U
Nw
¦ w(( )
i 1
ˆ( ) ; (1.13)
2
wmmax
PSNR 10 log10 , (1.14)
1
M N
¦ ( , )
[ ( , ) ˆ ( , )]2
where N w is the length of the watermark sequence, w(( ) and ŵ( ˆ ( ) are the i-th
value of the original watermark sequence and the i-th value of the extracted
watermark respectively. w(( , ) and w( ˆ ( , ) are the original watermark image
2
and the extracted watermark image respectively. wmmax denotes the maximal
watermark pixel value, and M u N is the size of the watermark image.
(7) Security. Security indicates the ability of watermarks to resist malicious
attacks. The malicious attack refers to any behavior that destroys the function of
watermarks. Attacks can be summarized into three categories: unauthorized
removing, unauthorized embedding and unauthorized detection. Unauthorized
removing and unauthorized embedding may change the watermarked products,
and thus they are regarded as active attacks, while unauthorized detection does not
change the watermarked products, and thus it is regarded as a passive attack.
Unauthorized removing refers to making the watermark in products unable to be
detected. Unauthorized embedding also means forgery, namely embedding illegal
watermark information in products. Unauthorized detection can be divided into
three levels. The most serious level is that the opponent detects and deciphers the
embedded message. The second level is that the opponent detects watermarks and
recognizes each mark, but he cannot decipher the meaning off these marks. The
attack which is not serious is that the opponent can determine the existence of
watermarks, but cannot decipher the message or recognize the embedded
positions.
(8) Ciphers and watermarking keys. In modern cryptography systems, security
depends only on keys instead of algorithms. People hope watermarking systems
also have the same standard. In ideal cases, if the key is unknown, it is impossible
to detect whether the product contains a watermark or not, even if the
watermarking algorithm is known. Even if a part of the keys is known by the
opponent, it is impossible to successfully f remove the watermark on the
precondition that the quality of the watermarked product is well maintained. Since
the security of keys used in embedding and extraction is different from that
provided in cryptography, two keys are usually used in watermarking systems.
62 1 Introduction
One is used in encoding and the other is used in embedding. To distinguish these
two keys, they are called the generation key and the embedding key, respectively.
(9) Content alteration and multiple watermarking. When a watermark is
embedded in a product, the watermark transmitter may concern the watermark
alteration problem. In some applications, the watermark should not be modified
easily, but in some other situations, watermark alteration is necessary. In copy
control, broadcast content will be marked with “copy once”, and after being
recorded, it will be labeled with “copy forbidden”. Embedding multiple
watermarks in a product is suitable for transaction tracking. Before being obtained
by the final user, content is often transmitted by several middlemen. Copy mark
first includes the watermark of the copyright owner. After that, the product may be
distributed to some music websites. And each product copy may be embedded
with a unique watermark to label each distributor’s information. Finally, each
website may embed the unique watermark to label the associated purchaser.
(10) Cost. It is very complex to economically consider the deploying of
watermark embedders and detectors. It depends on the business mode involved.
From the technical viewpoint, two main problems are the speed of watermark
embedding and detection and the required number of embedders and detectors.
Other problems may be whether the embedder and detector are implemented by
hardware, software, or by a plug-in unit.
Information retrieval (IR) [21] is the science of searching for documents, for
information within documents and for metadata about documents, as well as that
of searching relational databases and the World Wide Web. There is overlap in the
usage of the terms data retrieval, documentt retrieval, information retrieval and text
retrieval, but each also has its own body of literature, theory, praxis and
technologies. IR is interdisciplinary, based on computer science, mathematics,
library science, information science, information architecture, cognitive
psychology, linguistics, statistics and physics. Automated information retrieval
systems are used to reduce what has been called “information overload”. Many
universities and public libraries use IR R systems to provide access to books,
journals and other documents. Web search engines are the most visible IR
applications.
The idea of using computers to search for relevant pieces of information was
popularized in an article by Vannevar Bush in 1945 [21]. The first
implementations of information retrieval systems were introduced in the 1950s
1.6 Overview of Multimedia Retrieval Techniques 63
and 1960s. By 1990 several different techniques had been shown to perform well
on small text corpora (several thousand documents). In 1992 the US Department
of Defense, along with the National Institute of Standards and Technology (NIST),
co-sponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text
program. The aim of this was to look into the information retrieval community by
supplying the infrastructure that was needed for evaluation of text retrieval
methodologies on a very large text collection. This catalyzed the research into
methods that scale to huge corpora. The introduction of web search engines has
boosted the need for very large scale retrieval systems even further. The use of
digital methods for storing and retrieving information has led to the phenomenon
of digital obsolescence, where a digital resource ceases to be readable because the
physical media (The reader is required to read the media), the hardware, or the
software that runs on it, is no longer available. The information is initially easier
to retrieve than if it were on paper, but is then effectively lost.
An information retrieval process begins when a user enters a query into the
system. Queries are formal statements off information needs, for example search
strings in web search engines. In information retrieval a query does not uniquely
identify a single object in the collection. Instead, several objects may match the
query, perhaps with different degrees of relevancy. An object is an entity which
keeps or stores information in a database. User queries are matched to objects
stored in the database. Depending on the application of the data, objects may be,
for example, text documents, images orr videos. Often the documents themselves
are not kept or stored directly in the IR system, but are instead represented in the
system by document surrogates. Most IR systems compute a numeric score on
how well each object in the database matches the query, and rank the objects
according to this value. The top ranking objects are then shown to the user. The
process may then be iterated if the user wishes to refine the query.
According to the objects of IR, the techniques used in IR can be classified into
three categories: literature retrieval, data retrieval and document retrieval. The
main difference between these types of information retrieval systems lies in the
following: Data retrieval and document retrieval are required to retrieve the
information itself in the literature, while literature retrieval is only required to
retrieve the literature including the input information. According to the search
means, information retrieval systems can be classified into three categories:
manual retrieval systems, mechanical retrieval systems and computer-based
retrieval systems. At present, the rapidly developing computer-based retrieval is
“network information retrieval”, which stands for the behavior of web users to
search required information over the Internet with specific network-based
searching tools or simple browsing manners. Information retrieval methods can be
also classified into direct retrieval and indirect retrieval methods. Currently, the
research hotspots in the domain of IR lie in the following three areas.
(1) Knowledge retrieval or intelligent retrieval. Knowledge retrieval (KR) [22]
is a field of study which seeks to return information in a structured form,
consistent with human cognitive processes as opposed to simple lists of data items.
It draws on a range of fields including epistemology (theory of knowledge),
cognitive psychology, cognitive neuroscience, logic and inference, machine
64 1 Introduction
The growth in the Internet and multimedia technologies brings a huge sea of
66 1 Introduction
make use of lower-level features like texture, colors and shapes, although some
systems take advantage of very common higher-level features like faces. Not
every CBIR system is generic. Some systems are designed for a specific domain,
e.g. shape-matching can be used for finding parts inside a CAD-CAM database.
(3) Other query methods. Other query methods include browsing for example
images, navigating customized/hierarchical categories, querying by image regions
(rather than the entire image), querying by multiple example images, querying by
visual sketches, querying by directt specification of image features, and
multimodal queries (e.g. combining touch, voice, etc.).
CBIR systems can also make use of relevance feedback, where the user
progressively refines the search results by marking images in the results as
“relevant”, “not relevant”, or “neutral” to the search query, then repeating the
search with the new information. The following are some commonly-used features
for CBIR.
(1) Color. Retrieving images based on color similarity is achieved by
computing a color histogram for each image that identifies the proportion of pixels
within an image holding specific values. Current research is attempting to segment
color proportion by region and by spatial relationships among several color
regions. Examining images based on the colors they contain is one of the most
widely-used techniques because it does not depend on image sizes or orientations.
Color searches will usually involve comparing color histograms, though this is not
the only technique in practice.
(2) Texture. Texture measures look for visual patterns in images and how they
are spatially defined. Textures are represented by texels which are then placed into
a number of sets, depending on how many textures are detected in the image.
These sets not only define the texture, but also where the texture is located in the
image. Texture is a difficult concept to represent. The identification of specific
textures in an image is achieved primarily by modeling texture as a 2D gray level
variation. The relative brightness of pairs of pixels is computed such that the
degree of contrast, regularity, coarseness and directionality may be estimated.
However, the problem is in identifying patterns of co-pixel variation and
associating them with particular classes of textures such as “silky” or “rough”.
(3) Shape. Shape does not refer to the shape of an image but to the shape of a
particular region that is being sought out. Shapes will often be determined by first
applying segmentation or edge detection to an image. Other methods use shape
filters to identify given shapes of an image. In some cases accurate shape detection
will require human intervention because methods like segmentation are very
difficult to completely automate.
CBIR belongs to the image analysis research area. Image analysis is a typical
domain for which a high degree of abstraction from low-level methods is required,
and where the semantic gap immediately affectsf the user. If image content is to be
identified to understand the meaning of an image, the only available independent
information is the low-level pixel data. Textual annotations always depend on the
knowledge, capability of expression and specific language of the annotator and
therefore are unreliable. To recognize the displayed scenes from the raw data of an
image the algorithms for selection and manipulation of pixels must be combined
1.6 Overview of Multimedia Retrieval Techniques 69
and parameterized in an adequate manner and finally linked with the natural
description. Even the simple linguistic representation of shape or color, such as
round or yellow, requires entirely different mathematical formalization methods,
which are neither intuitive nor unique and sound. The above description involves
the concept of semantic gap. The semantic gap characterizes the difference
between two descriptions of an object by different linguistic representations, for
instance, languages or symbols. In computer science, the concept is relevant
whenever ordinary human activities, observations and tasks are transferred into a
computational representation. More precisely, the gap means the difference
between ambiguous formulation of contextual knowledge in a powerful language
(e.g. natural language) and its sound, reproducible and computational representation
in a formal language (e.g. programming language). The semantics of an object
depends on the context it is regarded within. For practical applications, this means
any formal representation of real world tasks requires the translation of the
contextual expert knowledge of an application (high-level) into the elementary and
reproducible operations of a computing machine (low-level). Since natural
language allows the expression of tasks which are impossible to compute in a
formal language, there is no way to automate this translation in a general way.
Moreover, the examination of languages within the Chomsky hierarchy indicates
that there is no formal and consequently automated way of translating from one
language into another above a certain level of expressional power.
The following are some famous CBIR systems.
(1) QBIC. The earliest CBIR system is the QBIC (query by image content)
system, which was developed by IBM Almaden. The QBIC lets you make queries
of large image databases based on visual image content, i.e., properties such as
color percentages, color layout, and textures occurring in the images. Such queries
utilize the visual properties of images, so you can match colors, textures and their
positions without describing them in words. Content-based queries are often
combined with text and keyword predicates to get powerful retrieval methods for
image and multimedia databases.
(2) PhotoBook. PhotoBook is a Facebook photo browser for Mac developed
by the MIT Media Lab. It makes it easy and fun to manage, share and view your
friends’ Facebook photos in one intuitive interface. The key features are: 1)
Viewing photos of friends or albums on a single page; 2) Quickly viewing photos
with tags and other information all in the same window; 3) Watching slideshows
with amazing transitions; 4) Importing photos or entire albums into iPhoto with
one click; 5) Filtering through photos or albums instantly with as-you-type search.
(3) VisualSEEK. VisualSEEK is a fully automated content-based image query
system developed by Columbia University. VisualSEEk is distinct from other
content-based image query systems in that the user may query for images using
both the visual properties of regions and their spatial layout. Furthermore, the
image analysis for region extraction is fully automated. VisualSEEk uses a novel
system for region extraction and representation based upon color sets. Through a
process of color set back-projection, the system automatically extracts salient
color regions from images.
(4) Other CBIR systems. Some other famous CBIR systems are the MARS
70 1 Introduction
A typical content-based video retrieval (CBVR) [25] is shown in Fig. 1.11. First,
we should analyze the video structure and segment the video into shots, and then
we select keyframes in each shot, which is the basis and key problem of a highly
efficient CBVR system. Second, we extract the motion features from each shot
and the visual features from the keyframes in this shot, and store these two kinds
of features as a retrieval mechanism in the video database. Finally, we return the
retrieval results to users based on their queries according to the similarities
between features. If the user is not satisfied
d with the search results, the system can
optimize the retrieval results according to the users’ feedback.
(1) Abrupt transitions. This is a sudden transition from one shot to another; i.e.,
one frame belongs to the first shot, and the next frame belongs to the second shot.
They are also known as hard cuts or simple cuts. (2) Gradual transitions. In this
kind of transition the two shots are combined using chromatic, spatial or
spatial-chromatic effects which gradually replace one shot by another. These are
also often known as soft transitions and can be of various types, e.g., wipes,
dissolves, fades, and so on.
The entire process of constructing the video structure can be divided into the
following three steps: extracting the video shots from the camera, selecting the
key frames from the shots and constructing the scenes or groups from the video
stream.
(1) Extracting the video shots from the camera (i.e., shot detection). A shot is
the basic unit of video data. The first task in video processing or content-based
video retrieval is to automatically segment the video into shots and use them as
fundamental indexing units. This process is called shot boundary detection. In shot
detection, the abrupt transition detection is the keystone, and the related
algorithms and ideas can be used in other steps; therefore it is a focus of attention.
The main schemes for abrupt transition detection are as follows: 1)
color-feature-based methods, such as template matching (sum of absolute
differences) and histogram-difference-based schemes; 2) edge-based methods; 3)
optical-flow detection-based methods; 4) compressed-domain-based methods; 5)
the double-threshold-based method; 6) the sliding window detection method; 7)
the dual-window method.
(2) Selecting the keyframes from the shots. A keyframe is a frame that
represents the content of a shot or scene. This content must be as representative as
possible. In the large amountt of video data, we first reduce each video to a set of
representative key frames (Though we enrich our representations with shot-level
motion-based descriptors as well). In practice, often the first frame or center frame
of a shot is chosen, which causes information loss in the case of long shots
containing considerable zooming and panning. This is why unsupervised
approaches have been suggested that provide multiple key frames per shot. Since
72 1 Introduction
for online videos the structure varies strongly, we use a two-step approach that
delivers multiple key frames per shot in n an efficient way by following shot
boundary detection based on a “divide and conquer” strategy, for which reliable
standard techniques exist, which is used to divide keyframe extraction into
shot-level sub-problems that are solved separately. Keyframe selection methods
can be divided into the following categories: 1) Methods based on the shots. A
video clip is first segmented into several shots, and then the first (or last) frame in
each shot is viewed as the keyframe. 2) Content-based analysis. This method is
based on the change in color, texture and other visual information of each frame to
extract the keyframe. When the information changes significantly, the current
frame is viewed as a keyframe. 3) Motion-analysis-based methods. 4) Clustering-
based methods.
(3) Constructing the scenes or groups from the video stream. First we calculate
the similarity between the shots (in fact, the key frames), and then select the
appropriate clustering algorithm for analysis. According to the chronological order
and the similarity between key frames, we can divide the video stream into scenes,
or we can perform the grouping operation only according to the similarity between
key frames.
After the keyframe extraction process and the feature extraction operation on
keyframes, we need to index video clips based on their characteristics. Through
the index, you can use the keyframe-based features or the motion features of the
shots, or a combination of both for the video search and browsing. Content-based
retrieval is a kind of approximate match, a cycle of stepwise refinement processes,
including initial query description, similarity matching, the return of results, the
adjustment of features, human-computer interaction, retrieval feedback,
f and so on,
until the results satisfy the customers. The richness and complexity of video
content, as well as the subjective evaluation of video content, make it difficult to
evaluate the retrieval performance with a uniform standard. This is also a research
direction of CBVR. Currently, there are two commonly used criteria, recall and
precision, which are defined as:
correct
recall , (1.15)
correct missed
correct
precision , (1.16)
correct falsepositive
Much previous audio analysis and processing of research was related to speech
signal processing, e.g., speech recognition. It is easy for machines to automatically
identify isolated words, as used in dictation and telephone applications, while it is
relatively hard for machines to perform f continuous speech recognition. But
recently some breakthrough has been made in this area, and at the same time
research into speaker identification has also been carried out. All these advances
will provide audio information retrieval systems that are of great help.
Audio is the important media in multimedia. The frequency range of audio that we
can hear is from 60 Hz to 20 kHz, and the speech frequency range is from 300 Hz
to 4 kHz, while music and other natural sounds are within the full range of audio
frequency. The audio that we can hear is first recorded or regenerated by analog
recording equipment, and then digitized into digital audio. During digitalization,
the sampling rate must be larger than twice the signal bandwidth in order to
correctly restore the signal. Each sample can be represented with 8 or 16 bits.
Audio can be classified into three categories: (1) Waveform sound. We
1.6 Overview of Multimedia Retrieval Techniques 75
perform the digitization operation on the analog sound to obtain the digital audio
signals. It can represent the voice, music, natural and synthetic sounds. (2) Speech.
It possesses morphemes such as words and grammars, and it is a kind of highly
abstract media for concept communication. Speech can be converted to text
through recognition, and text is the script form of speech. (3) Music. It possesses
elements such as rhythm, melody or harmony, and it is a kind of sound composed
of the human voice and/or sounds from musical instruments.
Overall, the audio content can be divided into three levels: the lowest level of
physical samples, the middle level of acoustic characteristics and the most senior
level of semantics. From lower levels to higher levels, the content becomes more
and more abstract. In the level of physical samples, the audio content is
represented in the form of streaming media, and users can retrieve or call the
audio data according to the time scale, e.g., the common audio playback API. The
middle level is the level of acoustic characteristics. Acoustic characteristics are
extracted from audio data automatically. Some auditory features representing
users’ perception of audio can be used directly for retrieval, and some features can
be used for speech recognition or detection, supporting the representation for
higher level content. In addition, the space-time structure of audio can also be
used. The semantic level is the highest level, i.e., the concept level of representing
audio content and objects. Specifically, at this level, the audio content is the result
of recognition, detection and identification, or the description of music rhythms, as
well as the description of audio objects and concepts. The latter two levels are the
most concerned with content-based audio retrieval. In these two levels, the user
can submit a concept query or perform the query by auditory perception.
The perceptual auditory features include volume, tone and intensity. With respect
to speech recognition, IBM’s Via Voice has become more and more mature, and
the VMR system of the University of Cambridge and Carnegie Mellon
University’s Informedia are both very good audio processing systems. With
respect to content-based audio information retrieval, Muscle Fish of the United
States has introduced a prototype of a more comprehensive system for audio
retrieval and classification with a high accuracy.
With respect to the query interface, users can adopt the following query types:
(1) Query by example. Users choose audio examples to express their queries,
searching all sounds similar to the characteristics of query audio, for example, to
search for all sounds similar to the roarr of aircraft. (2) Simile. A number of
acoustic/perceptual features are selected to describe the query, such as loudness,
tone and volume. This scheme is similar to the visual query in CBIR or CBVR. (3)
Onomatopoeia. We can describe our queries by uttering the sound similar to the
sounds we would like to search for. For example, we can search for the bees’ hum
or electrical noise by uttering buzzes. (4) Subjective features. That means the
sound is described by individuals. This method requires training the system to
understand the meaning of these terms. For example, the user may search “happy”
sounds in the database. (5) Browsing. This is an important means of information
discovery, especially for such time-base audio media. Besides the browsing based
on pre-classification, it is more important to browse based on the audio structure.
According to the classification of audio media, we know that speech, music
and other sound possess significantly different characteristics, so current CBAR
approaches can be divided into three categories: retrieval of “speech” audio,
retrieval of “non-speech non-music” audio and retrieval of “music” audio. In other
words, the first one is mainly based on automatic speech recognition technologies,
and the latter two are based on more general audio analysis to suit a wider range of
audio media, such as music and sound effects, also including digital speech signals
of course. Thus, CBAR can be divided into the following three areas, sound
retrieval, speech retrieval and music retrieval.
select the keyword from adjectives on retrieval. The adjective values, which are
determined for the retrieval keyword, are set to a retrieval point for each sound.
This means more retrieval points are given for a sound that is more generally
associated with the input adjective.
Speech search [27] is concerned with the retrieval of spoken content from
collections of speech or multimedia data. The key challenges raised by speech
search are indexing via an appropriate process of speech recognition and
efficiently accessing specific content elements within spoken data. The specific
limitations of speech recognition in terms of vocabulary and word accuracy mean
that effective speech search often does not reduce to an application of information
retrieval to speech recognition transcripts. Although text information retrieval
techniques are clearly helpful, speech retrieval involves confronting issues less apt
to arise in the text domain, such as high levels of noise in the indexed data and
lack of a clearly defined unit of retrieval. A speech retrieval system accepts vague
queries and it performs best-match searches to find speech recordings that are
likely to be relevant to the queries. Efficient best-match searches require that the
speech recordings be indexed in a previous step. People focus on effective
automatic indexing methods that are based on automatic speech recognition.
Automatic indexing of speech recordings is a difficult task for several reasons.
One main reason is the limited size of vocabularies of speech recognition systems,
which are at least one order of magnitude smaller than the indexing vocabularies
of text retrieval systems. Another main problem is the deterioration of the retrieval
effectiveness due to speech recognition errors that invariably occur when speech
recordings are converted into sequences of language units (e.g. words or
phonemes).
Humming a tune is by far the most straightforward and natural way for normal
users to make a melody query. Thus music query-by-humming has attracted much
research interest recently. It is a challenging problem since the humming query
inevitably contains tremendous variation and inaccuracy. And when the hummed
tune corresponds to some arbitrary part in the middle of a melody and is rendered
at an unknown speed, the problem becomes even tougher. This is because
exhaustive search of location and humming speeds is computationally prohibitive
for a feasible music retrieval system. The efficiency of retrieval becomes a key
issue when the database is very large. Based on the types of features used for
melody representation and matching methods, the past works on query-by-
humming can be broadly classified into three categories [28]: the string-matching
approach, the beat alignment approach and time-series-matching approach. In the
string matching approach, a hummed query is translated into a series of musical
notes. The note differences between adjacentt notes are then represented by letters
or symbols according to the directions and/or the quantity of the differences. The
hummed query is thus represented by a string. In the database, the notes of the
MIDI music are also translated into strings in the same manner. The retrieval is
done by approximate string matching. String edit distance is used for similarity
measure. There are many limitations to this approach. It requires precise
identification of each note’s onset, offset and note values. Any inaccuracies of note
articulation in the humming can lead to a large number of wrong notes detected
and can result in a poor retrieval accuracy. In the beat alignment approach for
query-by-humming, the user expresses the hummed query according to a
metronome, by which the hummed tune can be aligned with the notes of the MIDI
music clips in the database. Since the timing/speed of humming is controlled, the
errors in humming can only come from the pitch/note values and alignment is not
affected. By computing the statistical information of the notes in a fixed number
of beats, a histogram-based feature vector is constructed and used to match the
feature vectors for the MIDI music clip database. However, humming with a
metronome is a rather restrictive condition for normal use. Many people usually
are not very discriminating when it comes to their awareness of the beat of a
melody. Different meters (e.g. duple, triple, quadruple meters) of the music can
also contribute to the difficulties. In the pitch time-series-matching approaches, a
melody is represented by a time series off pitch values. Time-warping distance is
used for a similarity metric between the time series. However, current methods
have an efficiency problem, especially for matching anywhere in the middle of
melodies.
80 1 Introduction
This section briefly introduces multimedia perceptual hashing techniques that can
be used in the fields of copyright protection, content authentication and
content-based retrieval. In this section, the basic concept of hashing functions is
first introduced. Secondly, definitions and properties of perceptual hashing
functions are given. Thirdly, the basic framework and state-of-the-art of perceptual
hashing techniques are briefly discussed. Finally, some typical applications of
perceptual hashing functions are illustrated.
From the above description, we can see that hashing functions can be used to
extract the digital digest of the original data irreversibly, and they are one-way and
fragile to guarantee the uniqueness and unmodifiability of the original data.
Various hashing functions have been successfullyf used in information retrieval and
management, data authentication, and so on. However, with the increasing
popularization of multimedia service, traditional hashing functions have no longer
satisfied the demand for multimedia information management and protection. The
reasons lie in two aspects: (1) The perceptual redundancy of multimedia requires a
specific abstraction technique. Traditional hash functions only possess the function
of data compression, and they cannot eliminate the redundancy in multimedia
perceptual content. Therefore, we need to perform the perceptual abstraction on
multimedia information according to human a perceptual characteristics, obtaining
the concise summary while at the same time retaining the content. (2) The
many-to-one mapping properties between digital presentation and multimedia
content require that the content digest possess perceptual robustness. We should
research the multimedia authentication methods that are fragile to tampering
operations but robust to the content-preserved operations. Therefore, according to
the distinct properties of multimedia that are different from that of general
computer data, we should study the one-way multimedia digest methods and
techniques that possess perceptual robustness and the capability of data
compression. Thus, perceptual hashing [29] has gradually become a hotspot in the
field of multimedia signal processing and multimedia security.
The distinct characteristics of multimedia information that are different from
general computer data are determined by the human psychological process of
cognizing multimedia. According to the theory of cognitive psychology, this
process includes the following stages: sensory input, perceptual content, extraction
and cognitive recognition. The theory of perception threshold points out that only
when the stimuli brought about by objective things exceed the perceptual
threshold can we perceive the objective things and, before that, objective things
are just a kind of “data”. The kind of elements whose differences are less than the
perception threshold is mapped to an element in another collection. The perceptual
content of multimedia information is the basic feeling of humans for objective
things, and it is also the basis for carrying out high-level mental activities and
responding to stimuli. In addition, information processing in the cognitive stage
mainly depends on subjective analysis, which has exceeded the current research
range of information technology.
The perceptual hash function is an information processing theory based on
cognitive psychology, and it is a one-way mapping from a multimedia data set to a
multimedia perceptual digest set. The perceptual hash function maps the
multimedia data possessing the same perceptual content into one unique segment
of digital digest, satisfying the security requirements. We denote the perceptual
hashing function by PH H as shown in Eq.(1.17):
82 1 Introduction
PH : M H. (1.17)
That means two pieces of multimedia work with different perceptual content
should not be mapped to the same perceptual hash value.
(2) Robustness
Assume a c Ocp ( ) , a ( c) , then
That means two pieces of multimedia workk should be mapped into the same hash
value if they possess the same content or one is the content-preserved version of
another.
(3) One way
Given ha and PH(·),
H it is very hard to reversely compute the value a based on
PH(
H a) = ha, or the valid information of a cannot be obtained.
(4) Randomicity
The entropy of perceptual hash values should be equal to the length of the data,
meaning the ideal perceptual hash value should be completely random.
(5) Transitivity
That means under the perception threshold constraints, perceptual hash functions
possess transitivity, otherwise not.
(6) Compactness
Besides the above basic properties, the capacity of perceptual data should be
1.7 Overview of Multimedia Perceptual Hashing Techniques 83
as small as possible.
In addition, easy implementation is also an important evaluation index. Only
simple and fast perceptual hash functions can meet the application requirements of
massive multimedia data analysis.
The overall framework of the perceptual hashing function is shown in Fig. 1.12.
Multimedia input cannot only be audios, images, videos, but also biometric
templates and 3D models that are stored as the digital sequences in the computer.
Perceptual feature extraction is based on the human perceptual model, obtaining
the perceptual invariant features resisting content-preserved operations. The
preprocessing operations such as framing and filtering can improve the accuracy
of feature selection. A variety of signal processing methods in line with the human
perception model can remove the perceptual redundancy and select the most
perceptually significant characteristic parameters. Furthermore, in order to facilitate
hardware implementation and reduce storage requirements, characteristics of these
parameters need to be quantized and encoded, i.e., to undergo some post-
processing operations. Accurate perceptual feature extraction is the prerequisite
for the perceptual hash value to possess a good perceptual robustness. The aim of
hash construction is to perform a further dimensionality reduction on the
perceptual characteristics, outputting the final result ü perceptual hash values.
During the design process of hash construction, we should ensure several security
requirements such as anti-collision, one-way and randomness. According to
different levels of security needs, we may choose not to use perceptual hash keys
and to achieve key-dependency at various stages.
Perceptual hash value
Multimedia input
Hash
Preprocessing Perceptual feature Postprocessing construction
extraction
At present, there are two similar concepts with respect to perceptual hashes. In
order to avoid confusion, we make a brief statement on their differences and
contacts as follows: (1) Robust hashing. Robust hashing is very close to perceptual
hashing in concept, and they both require robust multimedia mapping. However,
for robust hashing, the mapping establishment is based on the choice of invariant
variables, while for perceptual hashing the invariance is based on multimedia
84 1 Introduction
perceptual features in line with the human perceptual model, realizing more
accurately multimedia content analysis and protection. (2) Digital fingerprinting.
At present, the definition and use of digital fingerprinting is somewhat confusing.
There are mainly two types: one is the digital watermarking technique for
copyright protection, the other is the media abstraction technique for media
content identification. The perceptual hash is similar to a digital fingerprint since
it is also a digital digest of multimedia, but it requires more security than the
digital fingerprint technology.
The research into perceptual hash functions is still in its infancy. The research
content mainly focuses on the one-way mapping from the dataset to the perception
data. With in-depth study, it is bound to investigate the perception set in order to
achieve deep content protection. At present, a lot of research results in the
perceptual hashing area have been published for all kinds of multimedia. Among
them, a large number of research results in audio fingerprinting have laid a solid
foundation for research into audio perceptual hashing. The perceptual hashing
technique for images has been a research hotspot in recent years, and a large
number of research results have been published. The research into video
perceptual hashing functions is gradually advancing. The state-of-the-art of
perceptual hashing research work for these three kinds of multimedia can be given
as follows.
(1) Extensive research on audio hashing functions started at the beginning of
this century. The PHILIPS Research Institute, Delft University and the NYU-Poly,
USA, have achieved significant research results. In China, the research into
perceptual audio hashing is still in its infancy. And papers on speech perceptual
hashing technology are seldom published. Based on audio signal processing
techniques and psychoacoustic models, the audio perceptual feature extraction
methods are relatively mature. Mel-frequency cepstrum coefficients and spectral
smoothness can be used to evaluate well the quality of pitches and noises of each
sub-band. A more common feature is the energy in each critical sub-band. Haitsma
and Kalker [30] used 33 sub-band energy values in non-overlapping logarithmic
scales to obtain the ultimate digital fingerprint, which is composed of the signs of
differential results between adjacent sub-bands (both in the time and frequency
axes). The compressed-domain perceptual hashing functions for MPEG audio
often adopt MDCT coefficients to calculate the perceptual hash value. This
method is prominently robust to MP3 encoding conversion. Performing the
post-processing operations such as quantization can further improve the
robustness and reduce the amount of data, and discretization is used to enhance the
randomness of hash values so as to reduce the probability of their collision.
(2) Image perceptual hashing functions have become research hot spots in the
field of perceptual hashing recently. Due to plenty of research results in the field
of digital image processing, there are various perceptually-invariant feature
extraction methods for images, such as histogram-based, edge-information-based
and DCT-coefficient-interrelationship-based methods. Unlike audio perceptual
hashing functions, image perceptual hashing functions mainly focus on the image
authentication problem. Therefore, the security problem in hashing is also an
important research part of image perceptual hashing functions. Currently, there are
1.7 Overview of Multimedia Perceptual Hashing Techniques 85
mainly two methods for improving the security of image hashing. One is to
encrypt the extracted features to assure the security of hashing. However, the
encryption mechanism will greatly reduce the robustness of hashing. The other is
to perform randomly mapping on the features, for example, to perform random
block selection or low-pass projection on features.
(3) How to extract video perceptual features is still the most crucial and most
challenging research content in the field of video perceptual hashing. Currently,
unlike the spectrum-domain or other transform-domain features extracted from
images and audios, many algorithms extract spatial features from video signals.
The main aim is to reduce the computational complexity. During the preprocessing
process, the video signal is segmented into shots, each shot being composed of
frames with similar content. The image perceptual hashing function is adopted to
extract the perceptual hash value from keyframes in each shot, and then the final
hash value is obtained for the whole video sequence. This kind of method inherits
good properties from image perceptual hashing functions. We can select the
keyframes with a key, and thus the perceptual hash value is key-dependent.
However, the above methods segment the video sequence into isolated images
such that the interrelation between frames is neglected, and thus it is hard to
completely and accurately describe the video perceptual content. Therefore, the
exploitation of spatial-temporal features is the research direction in the field of
video perceptual feature extraction. In general, the low-level statistics of the
luminance component are viewed as the perceptual features of video, and of
course the chromatic components can also be used to extract the perceptual
features. However, based on the characteristics of the human visual system,
human eyes are more sensitive to the luminance component than to chromatic
components, and the luminance component reflects the main feature of videos.
for the widespread use of perceptual hashing functions. Fig. 1.13 shows the
identification diagram of a typical audio recognition system.
Fig. 1.13. The diagram of audio recognition based on perceptual hashing functions
engine
Returned images Search results
Image to be stored
Storage Hash computation Feature vector
Fig. 1.14. The diagram of image retrieval based on perceptual hashing functions
1.8 Main Content of This Book 87
Computed hash
Original hash
Hash calculation
Authentication result
This book mainly focuses on three technical issues: (1) storage and transmission;
(2) watermarking and reversible data hiding; (3) retrieval issues for 3D models.
Succeeding chapters are organized as follows: From the point of view of lowering
the burden of storage and transmission and improving the transmission efficiency,
Chapter 2 discusses 3D model compression technology. From the perspective of the
application to retrieval, Chapter 3 introduces a variety of 3D model feature
extraction techniques, and Chapter 4 is devoted to content-based 3D model retrieval
technology. From the perspective of the application of copyright protection and
content authentication, Chapter 5 and Chapter 6 discuss 3D digital watermarking
techniques, including robust, fragile aand reversible watermarking techniques.
88 1 Introduction
References
3D Mesh Compression
3D meshes have been widely used in graphics and simulation applications for
representing 3D objects. They generally require a huge amount of data for storage
and/or transmission in the raw data format. Since most applications demand
compact storage, fast transmission and efficient
f processing of 3D meshes, many
algorithms have been proposed in the literature to compress 3D meshes efficiently
since the early 1990s [1]. Because most of the 3D models in use are polygonal
meshes, most of the published papers focus on coding that type of data, which is
composed of two main components: connectivity data and geometry data. This
chapter discusses 3D mesh compression technologies that have been developed
over the last decade, with the main focus on triangle mesh compression
technologies.
2.1 Introduction
2.1.1 Background
Graphics data are more and more widely adopted in various applications, including
video games, engineering design, architectural walkthrough, virtual reality,
e-commerce and scientific visualization. The emerging demand for visualizing and
simulating 3D geometric data in networkedr environments has aroused research
interests in representations of such data. Among various representation tools,
triangle meshes provide an effective way to represent 3D models. Typically,
connectivity, geometry and property data are together used to represent a 3D
polygonal mesh. Connectivity data describe the adjacency relationship between
92 2 3D Mesh Compression
vertices, geometry data specify vertex locations and property data specify several
attributes such as normal vectors, material reflectance and texture coordinates.
Geometry and property data are often attached to vertices in many cases, where
they are often called vertex data, and most 3D triangle mesh compression
algorithms handle geometry and property data in a similar way. Therefore, we
focus on the compression of connectivity and geometry data in this chapter.
As the number and the complexity of existing 3D meshes increase explosively,
higher resource demands are placed on the storage space, computing power and
network bandwidth. Among these resources, the network bandwidth is the most
severe bottleneck in network-based graphics that demands real-time interactivity.
Thus, it is essential to compress graphics data efficiently. This research area has
received a lot of attention since the early 1990s, and there has been a significant
amount of progress in this direction over the last decade [2].
Due to the significance of 3D mesh compression, it has been incorporated into
several international standards. VRML [3] has established a standard for
transmitting 3D models over the Internet. Originally, a 3D mesh was represented
in ASCII format without any compression in VRML. To implement efficient
transmission, Taubin et al. developed a compressed binary format for VRML [4]
based on the topological surgery algorithm [5], which can easily achieve a
compression ratio of 50 over the VRML ASCII format. MPEG-4 [6], which is an
ISO/IEC multimedia standard developed by the Moving Picture Experts Group for
digital TV, interactive graphics and interactive multimedia applications, also
includes the 3D mesh coding (3DMC) algorithm to encode graphics data. The
3DMC algorithm is also based on the topological surgery algorithm, which is
basically a single-rate coder for manifoldd triangle meshes. Furthermore, MPEG-4
3DMC incorporates progressive 3D mesh compression, non-manifold 3D mesh
encoding, error resiliency and quality scalability as optional modes. In this book,
we intend to review various 3D mesh compression technologies with the main
focus on triangle mesh compression.
With respect to 3D mesh compression, there have been several survey papers.
Taubin and Rossignac [5] briefly summarized prior schemes on vertex data
compression and connectivity data compression for triangle meshes. Taubin [8]
gave a survey on various geometry and progressive compression schemes, but the
focus was on two schemes in the MPEG-4 standard. Shikhare [9] classified and
described mesh compression schemes, but progressive schemes were not
discussed in enough depth. Gotsman et al. [10] gave an overview on mesh
simplification, connectivity compression and geometry compression techniques,
but the review on connectivity coding algorithms focused mostly on single-rate
region-growing schemes. Recently, Alliez and Gotsman [1] surveyed techniques
for both single-rate and progressive compression of 3D meshes, but the review
focused only on static (single-rate) compression. Compared with previous survey
papers, this chapter attempts to achieve the following three goals: (1) To be
comprehensive. This chapter covers both single-rate and progressive mesh
compression schemes. (2) To be in-depth. This chapter attempts to make a more
detailed classification and explanation off different algorithms. For example,
techniques based on vector quantization (VQ) are discussed in a whole section. (3)
2.1 Introduction 93
manifold with boundary. However, there are also quite common surface models
that are not manifold, e.g., the other two examples in Fig. 2.1. In Fig. 2.1(c), the
two cubes touch at a common edge, which contains points with a neighborhood
not equivalent to a disk or a half disk. And in Fig. 2.1(d), the tetrahedra touch at
points with a non-manifold neighborhood.
2.1.2.2 Connectivity
In order to analyze and represent complex surfaces, we subdivide the surfaces into
polygonal patches enclosed by edges and vertices. Fig. 2.2(a) shows the
subdivision of the torus surface into four patches p1, p2, p3, p4. Each patch can be
embedded into the Euclidean plane resulting in four planar polygons as shown in
Fig. 2.2(b). The embedding allows the mapping of the Euclidean topology to the
interior of each patch on the surface. The collection of polygons can represent the
same topology as the surface if the edges and vertices of adjacent patches are
identified. In Fig. 2.2(b), identified edges and vertices are labeled with the same
specifier. The topology of the points on two identified edges is defined as follows.
The points on the edges are parameterizedd over the interval [0, 1], where zero
corresponds to the vertex with a smaller index and one to the vertex with a larger
index. The points on the identified edges with the same parameter value are
identified and the neighborhood of the unified point is composed of the unions of
half-disks with the same diameter in both adjacent patches. In this way, the
identified edges are treated as one edge. The topology around vertices is defined
similarly. Here the neighborhood is composed of disks put together from several
pies with the same radius of all incident patches.
We are now in the position to split the surface into two constitutes: the
connectivity and the geometry. The connectivity C defines the polygons, edges
and vertices and their incidence relation. The geometry G on the other hand
defines the mappings from the polygons, edges and vertices to patches, possibly
2.1 Introduction 95
bent edges and vertices in the 3D Euclidean space. The pair M = (C, C G G) defines a
polygonal mesh and allows the representation off solids via their surface. First we
discuss the connectivity, which defines the incidence among polygons, edges and
vertices and which is independent of the geometric realization.
Definition 2.3 (Polygonal Connectivity) The polygonal connectivity is a
quadruple (V, E, F, I) of the set of vertices V, the set of edges E, the set of faces F
and the incidence relation I, such that: 1) each edge is incident to its two end
vertices; 2) each face is incident to an ordered closed loop of edges (e1, e2, …, en)
with eiE, such that e1 is incident to v1 and v2, …, ei is incident to vi and vi+1, i =
2, …, n1, and en is incident to vn and v1; 3) in the notation of the previous item, the
face is also incident to the vertices v1, …, vn; 4) the incidence relation is reflexive.
The collection of all vertices, all edges and all faces are called the mesh
elements. We next define the relation “adjacent”,
d which is defined on pairs of
mesh elements of the same type.
Definition 2.4 (Adjacent) Two faces are adjacent, if there exists an edge
incident to both of them. Two edges are adjacent, if there exists t a vertex incident
to both. Two vertices are adjacent, if there exists an edge incident to both.
Up to now we defined only terms for very local properties among the mesh
elements. Now we move on to global properties.
Definition 2.5 (Edge-connected) A polygonal connectivity is edge-connected,
if each two faces are connected by a path of faces such that two successive faces
in the path are adjacent.
Definition 2.6 (Valence, Degree and Ring) The valence of a vertex is the
number of edges incident to it, and the degree of a face is the number of edges
incident to it. The ring of a vertex is the ordered list of all its incident faces.
Fig. 2.3 gives an example to show the valence of a vertex and the degree of a
face.
Fig. 2.3. Close-up of a polygon mesh: the valence of a vertex is the number of edges incident
to this vertex, while the degree of a face is the number of edges enclosing it
As the connectivity is used to define the topology of the mesh and the
represented surface, one can define the following criterion for the surface to be
manifold.
Definition 2.7 (Potentially Manifold) A polygonal connectivity is potentially
96 2 3D Mesh Compression
manifold, if 1) each edge is incident to exactly two faces; 2) the non-empty set of
faces around each vertex forms a closed cycle.
Definition 2.8 (Potentially Manifold with Border) A polygonal connectivity
is potentially manifold with border, if 1) each edge is incident to one or two faces;
2) the non-empty set of faces around each vertex forms an open or closed cycle.
A surface defined by a mesh is manifold, if the connectivity is potentially
manifold and no patch has a self-intersection and the intersection of two different
patches is either empty or equal to the identified edges and vertices. All the
non-manifold meshes in Fig. 2.1 are not potentially manifold.
Definition 2.9 (Genus of a Manifold) The genus of a connected orientable
manifold without boundary is defined as the number of handles.
As we know, there is no handle in a sphere, one handle in a torus, and two
handles in an eight-shaped surface as shown in Fig. 2.4. Thus, their genera are 0, 1
and 2, respectively. For a connected orientable manifold without boundary,
Euler’s formula is given by
Nv Ne Nf 2 2G , (2.1)
where G is the genus of the manifold, and the total number of vertices, edges and
faces of a mesh are denoted as Nv, Ne, and Nf respectively.
Fig. 2.4. Examples to show the genus of a manifold. (a) Sphere; (b) Torus; (c) Eight-shaped mesh
Ne 3 f /2. (2.2)
Nv Nf / 2 . (2.3)
That is to say, a typical triangle mesh has twice as many triangles as vertices.
2.1 Introduction 97
Ne 3Nv . (2.4)
¦ valence 2 Ne 6 Nv . (2.5)
space with non zero volume. For this we define the topological polyhedron as
follows.
Definition 2.12 (Topological Polyhedron) A topological polyhedron is a
potentially manifold and edge-connected polygonal connectivity.
Fig. 2.5. Examples of orientable and non-orientable meshes. (a) Orientable manifold mesh; (b)
Non-orientable non-manifold mesh; (c) Orientable non-manifold mesh
2.1.2.3 Geometry
It is now time to add some geometry to the connectivity. We want to describe this
procedure only for the typical case of polygonal and polyhedral geometry in the
Euclidean space. Similarly, meshes with curved edges and surfaces could be
defined.
Definition 2.13 (Euclidean Polygonal/Polyhedral Geometry) The Euclidean
geometry G of a polygonal/polyhedral mesh M = (C, G G) is a mapping from the
mesh elements in C to R3 with the following properties: 1) a vertex is mapped to a
point in R3; 2) an edge is mapped to the line segment connecting the points of its
incident vertices; 3) a face is mapped to the inside of the polygon formed by the
line segments of the incident edges; 4) a topological polyhedron is mapped to the
sub-volume of R3 enclosed by its incident faces.
Here arises a problem that also often arises in practice. In R3, the edges of a
face often do not lie in the same plane. Therefore, the geometric representation of
a face is not defined properly and also a sound 2D parameterization of the polygon
is not easily defined. In practice, this is often ignored and the polygon is split into
2.1 Introduction 99
triangles for which a unique plane is given in the Euclidean space. Often further
attributes like physical properties of the described surface/volume, the surface
color, the surface normal or a parameterization of the surface are necessary. In
practice, we often simplify the problem to the simplest types of mesh elements,
the simplices. The kk-dimensional simplex (or for short kk-simplex) is formed by the
convex hull of kk+1 points in the Euclidean space. A 0-simplex is just a point, a
1-simplex is a line segment, a 2-simplex is a triangle and the 3-simplex forms a
tetrahedron. For simplices, the linear and quadratic interpolations of vertex and
edge attributes are simply defined via the barycentric coordinates.
In some applications, the handling of mixed dimensional meshes is necessary.
As the handling of mixed dimensional polygonal/polyhedral meshes becomes very
complicated, one often gives up polygons and polyhedra and restricts oneself to
simplicial complexes, which allow for singleton vertices and edges and
non-manifold mesh elements. A simplicial complex is defined as follows.
Definition 2.14 (Simplicial Complex) A k dimensional simplicial complex is
a (k+1)-tuple (S0, …, Sk), where Si contains all i-simplices of the complex. The
simplices fulfill the condition that the intersection of two i-simplices is either
empty or equal to a simplex of lower dimension.
As a simplex and therefore a simplicial complex is only a geometric
description, we have to define the connectivity of a simplicial complex, which is
easily done by specifying the incidence relation among the simplices of different
dimensions. An i-simplex is incident to a j-simplex with i < j if the i-simplex
forms a sub-simplex of the j-simplex.
Definition 2.16 (Simple Mesh) A simple mesh is a triangle mesh that forms a
connected, orientable, manifold surface that is homeomorphic to a sphere or to a
half-sphere. Such meshes have no handle and either have no boundary or have a
boundary that is a connected, manifold, closed curve, i.e., a simple loop.
For simple meshes, the Euler equation yields
Nt Ne Nv 1, (2.6)
where Nt=|T|
T is the number of triangles, Nv =|V
VI| + |V
VE|, and Ne is the total number
of the external and internal edges. Since there are |V VE| external edges and
(3 | | | E |) / 2 internal edges, we have N e (3 | | | E |) / 2 . Thus, based on
Eq.(2.6), we can easily have
| | 2| I | | E | 2. (2.7)
When reporting the compression performance, some papers employ the measure
of bits per triangle (bpt) while others use bits per vertex (bpv). For consistency, we
adopt the bpv measure exclusively, and convert the bpt metric to the bpv metric by
assuming that a mesh has twice as many triangles as vertices.
Single resolution mesh compression methods are important for encoding large
data bases of small objects, base meshes of progressive representations or for fast
transmission of meshes over the Internet. We can classify the single resolution
2.2 Single-Rate Connectivity Compression 103
techniques into two classes: (1) techniques aiming at coding the original mesh
without making any assumption about its complexity, regularity or uniformity;
(2) techniques which remesh the model before compression. The original mesh is
considered as just one instance of the shape geometry.
Single-rate or static connectivity compression methods perform the single-rate
compression only on the connectivity data, without considering the geometry data.
Single-rate connectivity compression can be roughly divided into two types:
edge-based and vertex-based coders. Here, we classify existing typical single-rate
connectivity compression algorithms into six classes: the indexed face set, the
triangle strip, the spanning tree, the layered decomposition, the valence-driven approach
and the triangle conquest method. They can be described in detail as follows.
In the VRML ASCII format [3], a triangle mesh is represented with an indexed
face set that is composed of a coordinate array and a face array. The coordinate
array gives the coordinates of all vertices, and the face array shows each face by
indexing its three vertices in the coordinate array. Fig. 2.6 gives a mesh example
and its face array.
Fig. 2.6. The indexed face set representation of a mesh. (a) A mesh example; (b) Its face array
If the number of vertices in a mesh is Nv, then we need log2Nv bits to represent
the index of each vertex. Thus, 3log2Nv bits are required to represent the
connectivity information of a triangular face.
f Since there are about twice as many
triangles as vertices in a typical triangle mesh, the connectivity information costs
about 6log2Nv bpv in the indexed face set method. This method provides a
straightforward way for the representation of triangle meshes. There is actually no
compression applied in this method, but we still list it here to provide a basis of
comparison for the following compression schemes.
Obviously, in this representation, each vertex may be indexed several times by
all its adjacent triangles. Repeated vertex references will definitely degrade the
efficiency of connectivity representation. In other words, a good connectivity
compression method should reduce the numberr of repeated vertex references. This
observation motivates researchers to develop the following triangle strip scheme.
104 2 3D Mesh Compression
The triangle strip scheme attempts to segment a 3D mesh into long strips of
triangles, and then encode them. The main aim of this method is to reduce the
amount of data transmitted between the CPU and the graphic card, for triangle
strips are well supported by most graphic cards. Although this method requires
less storage space and transmission bandwidth than the indexed face set, it is still
not very efficient for the compression purpose.
Fig. 2.7(a) shows a triangle strip, where each vertex is combined with the
previous two vertices in a vertex sequence to form a new triangle. Fig. 2.7(b)
shows a triangle fan, where each vertex after the first two forms a new triangle
with the previous vertex and the first vertex. Fig. 2.7(c) shows a generalized
triangle strip that is a mixture of triangle strips and triangle fans. Note that, in a
generalized triangle strip, a new triangle is introduced by each vertex after the first
two in a vertex sequence. However, in an indexed face set, a new triangle is
introduced by three vertices. Therefore, the generalized triangle strip provides a
more compact representation than the indexed face set, especially when the strip
length is long. In a rather long generalized triangle strip, the ratio of the number of
triangles to the number of vertices is very close to 1, meaning that a triangle can
be represented by almost exactly 1 vertex index.
Fig. 2.7. Example of triangle trips. (a) Triangle strip; (b) Triangle fan; (c) Generalized triangle strip
However, since there are about twice as many triangles as vertices in a typical
mesh, some vertex indices should be repeated in the generalized triangle strip
representation of the mesh, which indicates a waste of storage. To alleviate this
problem, several schemes have been developed, where a vertex buffer is utilized
to store the indices of recently traversed vertices. Deering [12] first introduced the
concept of the generalized triangle mesh. A generalized triangle mesh is formed by
combining generalized triangle strips with a vertex buffer. He used a
first-in-first-out (FIFO) buffer to store the indices of up to 16 recently-visited
vertices. If a vertex is saved in the vertex buffer, it can be represented with the
buffer index that requires a lower number of bits than the global vertex index.
Assuming that each vertex is reused by the buffer index only once, Taubin and
Rossignac [5] showed that the generalized triangle mesh representation requires
approximately 11 bpv to encode the connectivity data for large meshes. Deering,
however, did not propose a method to decompose a mesh into triangle strips.
Based on Deering’s work, Chow [13] proposed a mesh compression scheme
2.2 Single-Rate Connectivity Compression 105
Turan [16] observed that the connectivity of a planar graph can be encoded with a
constant number of bpv using two spanning trees: a vertex spanning tree and a
triangle spanning tree. Based on this observation, Taubin and Rossignac [5]
presented a topological surgery approach to encode mesh connectivity. The basic
idea is to cut a given mesh along a selected set of cut edges to make a planar
polygon. The mesh connectivity is then represented by the structures of cut edges
and the polygon. In a simple mesh, any vertex spanning tree can be selected as the
set of cut edges.
Fig. 2.9 illustrates the encoding process. Fig. 2.9(a) is an octahedron mesh.
First, the encoder constructs a vertex spanning tree as shown in Fig. 2.9(b), where
each node corresponds to a vertex in the input mesh. Then, it cuts the mesh along
the edges of the vertex spanning tree. Fig. 2.9(c) shows the resulting planar
polygon and the triangle spanning tree. Each node in the triangle spanning tree
corresponds to a triangle in the polygon, and two nodes are connected if and only
if the corresponding triangles share an edge.
106 2 3D Mesh Compression
v1
v3
v2
v1 v5 v'1 v6
1 v'3
v5 v4 v1 v5
2 3
v2 v3
v3 v'4
v2 5
4 v'1
v6 v6 v4
v'3
Then, the two spanning trees are run-length encoded. A run is defined as a tree
segment between two nodes with degrees not equal to 2. For each run of the vertex
spanning tree, the encoder records its length with two additional flags. The first
flag is the branching bit indicating whether a run subsequent to the current run
starts at the same branching node, and the second flag is the leaf bit indicating
whether the current run ends at a leaf node. For example, let us encode the vertex
spanning tree in Fig. 2.9(b), where the edges are labeled with their run indices.
The first run is represented by (1, 0, 0), since its length is 1, the next run does not
start at the same node and it does not end at a leaf node. In this way, the vertex
spanning tree in Fig. 2.9(b) is represented by (1,0,0), (1,1,1), (1,0,0), (1,1,1),
(1,0,1). Similarly, for each run of the triangle spanning tree, the encoder writes its
length and the leaf bit. Note that the triangle spanning tree is always binary so that
it does not need the branching bit. Furthermore, the encoder records the marching
pattern with one bit per triangle to indicate how to triangulate the planar polygon
internally. The decoder can reconstruct the original mesh connectivity from this
set of information.
In both vertex and triangle spanning trees, a run is a basic coding unit. Thus,
the coding cost is proportional to the number of runs, which in turn depends on
how the vertex spanning tree is constructed. Taubin and Rossignac’s algorithm
builds the vertex spanning tree based on layered decomposition, which is similar
to the way we peel an orange along a spiral path, to maximize the length of each
run and minimize the number of runs generated.
Taubin and Rossignac also presented several modifications so that their
algorithm can encode general manifold meshes: meshes with arbitrary genus,
meshes with boundary and non-orientable meshes. However, their algorithm
cannot directly deal with non-manifold meshes. As a preprocessing step, the
2.2 Single-Rate Connectivity Compression 107
mesh data, for the effects of transmission errors can be localized by encoding
different vertex and triangle layers independently. Based on the layered
decomposition method, Bajaj et al. [18] also proposed an algorithm to encode
large CAD models. This algorithm extends the layered decomposition method to
compress quadrilateral and general polygonal models as well as CAD models with
smooth non-uniform rational B-splines (NURBS) patches.
Fig. 2.10. Three cases in the triangle layer, where contours are depicted with solid lines and
other edges with dashed lines. (a) The layered vertex structure and the branching point depicted
by a black dot; (b) A triangle strip; (c) Bubble triangles; (d) A cross-contour triangle fan
The main idea of the valence-driven approach is as follows. First, it selects a seed
triangle whose three edges form the initial borderline. Then, the borderline
partitions the whole mesh into two parts, i.e., the inner part that has been
processed and the outer part that is to be processed. Next, the borderline gradually
expands outwards until the whole mesh is processed. The output is a stream of
vertex valences, from which the original connectivity can be reconstructed.
In [19], Touma and Gotsman presented a pioneering algorithm known as the
valence-driven approach. It starts from an n arbitrary triangle, and pushes its three
vertices into a list called the active list. Then, it pops up a vertex from the active
list, traverses all untraversed edges connected to that vertex, and pushes the new
vertices into the end of the list. For each processed vertex, it outputs the valence.
Sometimes it needs to split the current active list or merge it with another active
list. These cases are encoded with special codes. Before encoding, for each
boundary loop, a dummy vertex is added and connected to all the vertices in that
2.2 Single-Rate Connectivity Compression 109
boundary loop, making the topology closed. Fig. 2.11 shows an example of the
encoding process, where the active list is depicted by thick lines, and the focus
vertex by the black dot, and the dummy vertex by the gray dot. Table 2.1 lists the
output of each step associated with Fig. 2.11.
Fig. 2.11. (a)(s) showing a mesh connectivity encoding example by Touma and Gotsman [19],
where the active list is shown with thick lines, the focus vertex with the black dot and the dummy
vertex with the gray dot (With courtesy of Touma and Gotsman)
valence-based connectivity code must be due to the split operations (or some other
essential piece of information). In other words, the number of split operations in
the code is linear in the size of the mesh, albeit with a very small constant. This
means that the empirical observation that the number of split operations is
negligible is incorrect, and is probably due to the experiments being performed on
a small subset of relatively “well-behaved” mesh connectivities. At present, there
is no way of bounding this number, meaning that even if the coding algorithms
minimize the number of split operations, there is no way for us to eliminate the
possibility that the size of the code may actually exceed the Tutte entropy (due to
these split operations). The question of the optimality of valence-based coding of
3D meshes will remain open until more concrete information on the expected
number of split operations incurred during the mesh conquest is available. We do
believe, nonetheless, that even if the valence-based coding is not optimal, it is
probably not far from this.
Similar to the valence-driven approach, the triangle conquest approach starts from
the initial borderline, which partitions the whole mesh into conquered and
unconquered parts, and then inserts triangle by triangle into the conquered parts.
The main difference is that the triangle conquest scheme outputs the building
operations of new triangles, while the valence-driven approach outputs the
valences of new vertices. Gumhold and Straßer [24] first presented a triangle
conquest approach, called the cut-border machine. At each step, this scheme
inserts a new triangle into the conquered part, closed by the cut-border, with one
of the five building operations: “new vertex”, “forward”, “backward”, “split” and
“close”. The sequence of building operations is encoded with Huffman codes. This
method is applicable to manifold meshes that are either orientable or
non-orientable. Experimentally, its compression cost lies within 3.228.94 bpv,
mostly around 4 bpv. The most important advantage of this scheme is that the
decompression speed is very fast and the decompression method is easy to
implement with hardware. Furthermore, compression and decompression
operations can be performed in parallel. These properties make this method very
attractive in real-time coding applications. In [25], Gumhold further improved the
compression performance by using an adaptive arithmetic coder to optimize the
border encoding. The experimental compression ratio is within the range of
0.32.7 bpv, and on average 1.9 bpv.
Rossignac [26] proposed another triangle conquest approach called the
edgebreaker algorithm. It is nearly equivalent to the cut-border machine, except
that it does not encode the offset data associated with the split operation. The
triangle traversal is controlled by edge loops as shown in Fig. 2.12(a). Each edge
loop bounds a conquered region and contains a gate edge. At each step, this
approach focuses on one edge loop and its gate edge is called the active gate,
112 2 3D Mesh Compression
while the other edge loops are stored in a stack and will be processed later.
Initially, for each connected component, one edge loop is defined. If the
component has no physical boundary, two half edges corresponding to one edge
are set as the edge loop. For example, in Fig. 2.12(b), the mesh has no boundary
and the initial edge loop is formed by g and g·o, where g·o is the opposite half
edge of g. In Fig. 2.12(c), the initial edge loop is the mesh boundary.
Fig. 2.12. Illustration of the Edgebreaker algorithm, where thick lines depict edge loops, and g
denotes the gate. (a) Edge loops; (b) Gates and initial edge loops for a mesh without boundary; (c)
Gates and initial edge loops for a mesh with boundary
At each step, this scheme conquers a triangle incident on the active gate,
updates the current loop, and moves the active gate to the next edge in the updated
loop. For each conqueredd triangle, this algorithm outputs an op-code. Assume that
the triangle to be removed is enclosed by the active gate g and the vertex v, there
are five kinds of possible op-codes as shown in Fig. 2.13(a): (1) C (loop
extension), if v is not on the edge loop; (2) L (left), if v immediately precedes g in
the edge loop; (3) R (right), if v immediately follows g; (4) E (end), if v precedes
and follows g; (5) S (split), otherwise. Essentially, the compression process is a
depth-first traversal of the dual graph of the mesh. When the split case is
encountered, the current loop is split into two, and one of them is pushed into the
stack while the other is further traced. Fig. 2.13(b) shows an example of the
encoding process, where the arrows and the numbers give the order of the triangle
conquest. The triangles are filled with different patterns to represent different
op-codes, which are produced when they are conquered. In this case, the encoder
outputs the series of op-codes as CCRSRLLRSEERLRE.
2.2 Single-Rate Connectivity Compression 113
v v v v v
g g g g g
C L R E S
(a)
11
7
8 9
6 10
5
14 133 12 4
2 3
15 1
S
Start
(b)
Fig. 2.13. Five op-codes used in the Edgebreaker algorithm. (a) Five op-codes C, L, R, E, and S,
where the gate g is marked with an arrow; (b) An example of the encoding process in the
Edgebreaker algorithm, where the arrows and the numbers show the traversal order and different
filling patterns are used to represent different op-codes
The Edgebreaker method can encode the topology data of orientable manifold
meshes with multiple boundary loops orr with arbitrary genus, and guarantee a
worst-case coding cost of 4 bpv for simple meshes. However, it is unsuitable for
streaming applications, since it requires a two-pass process for decompression,
and the decompression time is O(( v2 ) . Another disadvantage is that, even for
regular meshes, it requires about the same bitrate as that for non-regular meshes.
King and Rossignac [27] modified the Edgebreaker method to guarantee a
worst-case coding cost of 3.67 bpv for simple meshes, and Gumhold [28] further
improved this upper bound to 3.522 bpv. The decoding efficiency of the
Edgebreaker method was also improved to exhibit linear time and space
complexities in [27, 29, 30]. Furthermore, Szymczak et al. [31] optimized the
Edgebreaker method for meshes with high regularity by exploiting dependencies
of output symbols. It guarantees a worst-case performance of 1.622 bpv for
sufficiently large meshes with high regularity.
As mentioned earlier, we can reduce the amount of data transmission between
the CPU and the graphic card by decomposing a mesh into long triangle strips, but
finding a good decomposition is often computationally intensive. Thus, it is often
desirable to generate long strips from a given mesh only once and distribute the
stripification information together with the mesh. Based on this observation,
Isenburg [32] presented an approach to encode the mesh connectivity together
with its stripification information. It is basically a modification of the Edgebreaker
method, but its traversal order is guided by strips obtained by the STRIPE
114 2 3D Mesh Compression
algorithm [15]. When a new triangle is included, its relation to the underlying
triangle strip is encoded with a label. The label sequences are then entropy
encoded. The experimental compression performance ranges from 3.0 to 5.0 bpv.
Recently, Jong et al. proposed an edge-based single-resolution compression
scheme [33] to encode and decode 3D models straightforwardly via single pass
traversal in a sequential order. Most algorithms use the split operation to separate
the 3D model into two components; however, the displacement is recorded or an
extra operator is required for identifying the branch. This study suggested using
the J operator to skip to the next edge of the active boundary, and thus it does not
require split overhead. With all sorts of conditions of active gates and third
vertices, this study adopted five operators, QCRLJ, J and then used them to encode
and decode triangular meshes. This algorithm adopts Rossignac’s CRL operators
[26] as shown in Fig. 2.13(a), and two new operators are proposed, Q and JJ, as
illustrated in Fig. 2.14(a). For explanatory purposes, Q and J operators are
described as follows:
(1) Q. The third vertex is a new vertex and its consecutive triangle is R. These
two triangles, which comprise a quadrilateral, are then shifted from the
un-compressed area into the compressed area. The active gate is then removed and
the other two sides of the quadrilateral that are not on the active boundary are
moved to the active boundary, then the right side is allowed to serve as the new
active gate. The geometric characteristics demonstrate that the Q operator
represents two triangles which are coded CR. Different from the further
context-based encoding for CR codes conducted by Rossignac, this approach only
requires us to read Q at the decompression process, and treats it as two triangles.
However, using the context-based coder requires transforming the code to CR, and
then acknowledges these two triangles.
(2) JJ. The third vertex lies on the active boundary and is not the previous or
next vertex of the active gate. This operator does not compress any triangle and
the next side of active boundary is allowed to serve as the new active gate. The
active gate skips to the next edge of the active boundary. Since the third vertex
that corresponds with the active gate comprises one triangle, and this triangle
divides the un-compressed area into two, numerous indications for the third vertex
Fig. 2.14. Two new operators and the corresponding compression process adopted in [33]. (a)
Operators Q and JJ; (b) A compression example ([2005]IEEE)
2.2 Single-Rate Connectivity Compression 115
are stumped up under this condition. Thus, this triangle is not compressed and is
eventually compressed by “R“ ” or “L
“ ”.
Fig. 2.14(b) illustrates the compression course of Jong et al.’s algorithm,
where the dotted lines represent J operators. A total of 27 operators are calculated
as CQQJRLRCJQ QRRLLLRQQQ RRLLRLR using Jong et al.’s algorithm.
Furthermore, the adaptive arithmetic coder is applied in Jong et al.’s algorithm to
achieve an improved compression ratio.
2.2.7 Summary
Table 2.2 summarizes the bitrates off various connectivity coding schemes
introduced above. The bitrates marked by “*” are the theoretical upper bounds
obtained by the worst-case analysis, while the others are experimental bitrates.
Among these methods, Touma and Gotsman’s algorithm [19] is viewed as the
state-of-the-art technique for single-rate 3D mesh compression. With some minor
improvements on Touma and Gotsman’s algorithm, Alliez and Desbrun’s
algorithm [20] yields an improved compression ratio. The indexed face set,
triangle strip and layered decomposition methods can encode meshes with
arbitrary topology. In contrast, other approaches can handle only manifold meshes
with additional constraints. For instance, the valence-driven approach [19, 20]
Table 2.2 Comparisons of bitrates for various single-rate connectivity coding algorithms
Category Algorithm Bitrate (bpv) Comment
Indexed face set VRML ASCII Format [3] 6log2Nv No compression
Triangle strip Deering [12] 11
Spanning tree Taubin and Rossignac [5] 2.487.0
Layered Bajaj et al. [17] 1.406.08
decomposition
Valence-driven Touma and Gotsman [19] 0.22.4, 1.5 on Especially good for
approach average regular meshes
Alliez and Desbrun [20] ü
0.024ü2.96, 3.24*
Triangle Gumhold and Straßer 3.228.94, 4 on Optimized for real-time
conquest [24] average applications
Gumhold [25] 0.32.7, 1.9 on
average
Rossignac [26] 4*
King and Rossignac [27] 3.67*
Gumhold [28] 3.522*
Szymczak et al. [31] 1.622* for Optimized for regular
sufficiently large meshes
meshes with high
regularity
Jong et al. [33] 1.19 on average An adaptive arithmetic
coder is used
*
Theoretical upper bounds obtained by the worst-case analysis
116 2 3D Mesh Compression
requires that the manifold be also orientable. Szymczak et al.’s algorithm [31]
requires that the manifold have neither boundary nor handles. Note that using
these algorithms, a non-manifold mesh can be handled only if it is pre-converted
to a manifold mesh by replicating non-manifold vertices, edges and faces as in
[34].
Fig. 2.15. Intermediate meshes [1]. (a) Based on a single-rate technique; (b) Using a
progressive technique (With courtesy of Alliez and Gotsman)
2.3 Progressive Connectivity Compression 117
Hoppe [35] first introduced the progressive mesh (PM) representation, a new
scheme for storing and transmitting arbitrary triangle meshes. This efficient,
lossless, continuous-resolution representation addresses several practical problems
in graphics: smooth geomorphing of level-of-detail approximations, progressive
transmission, mesh compression and selective refinement. This scheme simplifies
a given orientable manifold mesh with successive edge collapse operations. As
shown in Fig. 2.16, if an edge is collapsed, its two end points are merged into one,
and two triangles (or one triangle if the collapsed edge is on the boundary)
incident to this edge are removed, and all vertices previously connected to the two
118 2 3D Mesh Compression
end points are re-connected to the merged vertex. The inverse operation of edge
collapse (e_col as shown in Fig. 2.16) is vertex split (v_split as shown in Fig. 2.16)
that inserts a new vertex into the mesh together with corresponding edges and
triangles.
An original mesh M = Mk can be simplified into a coarser mesh M0 by
performing k successive edge collapse operations. Each edge collapse operation
ecoli transforms the mesh Mi to Mi1, with i = k, k k 1, …, 1. Since edge collapse
operations are invertible, we can represent an arbitrary triangle mesh M with its
base mesh M0 together with a sequence of vertex split operations. Each vertex
split operation vspliti refines the mesh Mi1 back to Mi, with i = 1, 2, …, k. Thus,
we can view ((MM0, vsplit1, …, vsplittk) as the progressive mesh representation of M.
M
e_col
vt
vl vl
vr vr
v_split vs
vs
Fig. 2.16. Illustration of the edge collapse and vertex split processes
Popovic and Hoppe [38] observed that the original PM has two restrictions: (1) It
is applicable only to orientable manifold meshes; (2) It does not possess the
freedom to change the topological type of a given mesh during the simplification
and refinement, which limits its coding efficiency.
f To alleviate these problems,
they presented a method called progressive simplicial complex (PSC). In this
scheme, a more general vertex split operation is exploited to encode the changes in
both geometry and topology. A PSC representation consists of a single-vertex base
model followed by a sequence of generalized vertex split operations. PSC can be
used to compress meshes of any topology type.
To construct a PSC representation, a sequence of vertex merging operations
are performed to simplify a given mesh model. Each vertex merging operation
merges an arbitrary pair of vertices, which are not necessarily connected by an
edge, into a single vertex. The inverse operation of vertex merging is the
generalized vertex split operation that splits a vertex into two. Suppose that the
vertex vi in the mesh Mi is to be split to generate a new vertex whose index is i+1
in the mesh Mi+1. Each simplex adjacent to vi in Mi is the merging result of one of
four cases as shown in Fig. 2.17. For a rigorous definition of simplex, readers can
refer to [38]. Intuitively, a 0-dimensional simplex is a point, a 1D simplex is an
edge and a 2D simplex is a triangle face, and so on. For each simplex adjacent to
vi, PSC assigns a code to indicate one of the four cases as given in Fig. 2.17.
Since the generalized vertex split operation is more flexible than the original
vertex split operation in PM, PSC may require more bits in connectivity coding
than PM. Specifically, PSC requires about (log2Nvi+8) bits to specify the
connectivity change around the split vertex, while PM requires only about
(log2Nvi+5) bits. However, the main advantage of PSC is its capability to handle
arbitrary triangular models without any topology constraint. Similar to PM, the
geometry data in PSC are also encoded based on delta prediction.
Taubin et al. [39] suggested the progressive forest split (PFS) representation for
manifold meshes. Similar to the PM representation [35], a triangle mesh is
represented with a low resolution base model and a series of refinement operations
in PFS. Instead of the vertex split operation, the PFS scheme exploits the forest
split operation as illustrated in Fig. 2.18. The forest split operation cuts a mesh
along the edges in the forest and fills in the resulting crevice with triangles. For
the sake of simplicity, the forest contains only one tree in Fig. 2.18. In practice, a
120 2 3D Mesh Compression
forest may be composed of many complex trees, and a single forest split operation
may double the number of triangles in a mesh. Therefore, PFS can obtain a much
higher compression ratio than PM att the cost of reduced granularity.
{vi+1}
0-dim {vi} Undefined Undefined
{vi}
1-dim
2-dim
Fig. 2.17. Possible cases after a generalized vertex split for different-dimensional simplices
Fig. 2.18. Illustration of a forest split process. (a) The original mesh with a forest marked with
thick lines; (b) The cut of the original mesh along the forest edges; (c) Triangulation of the
crevice; (d) The cut mesh in (b) ffilled with the triangulation in (c)
For each forest split operation, the forest structure, the triangulation
information of the crevices and the vertex displacements are encoded. To encode
the forest structure, one bit is required for each edge indicating whether it belongs
to the forest or not. To encode the triangulation of the crevices, the triangle
spanning tree and the marching patterns can be adopted as in Taubin and
Rossignac’s algorithm [5], or a simple constant-length encoding scheme can be
employed, which requires exactly 2 bits per new triangle. To encode the vertex
displacements, a smoothing algorithm [40] is first applied after connectivity
refinement, and then the difference between the original vertex position and the
smoothed vertex position is Huffman-coded.
With respect to the coding efficiency, to progressively encode a given mesh
with four or five LODs, PFS requires about 710 bpv for the connectivity data and
2.3 Progressive Connectivity Compression 121
2040 bpv for the geometry data at the 6-bit quantization resolution. Here, we
should point out that the bpv performance is measured with respect to the number
of vertices in the original mesh. PFS has been adopted in MPEG-4 3DMC [6] as
an optional mode for progressive mesh coding.
re-triangulates the resulting hole. The topology data record the way of
re-triangulation after each vertex is decimated, or equivalently, the neighborhood
of each new vertex before it is inserted.
Cohen-Or et al. [48] suggested the patch coloring algorithm for progressive
mesh compression based on vertex decimation. First, the original mesh is
simplified by iteratively decimating a set of vertices. At each iteration, decimated
vertices are selected such that they are not adjacent to one another. Each vertex
decimation results in a hole, which is then re-triangulated. The set of new triangles
filling in this hole is called a patch. By reversing the simplification process, a
hierarchical progressive reconstruction process can be obtained. In order to
identify the patches in the decoding process, two patch coloring techniques were
proposed: 4-coloring and 2-coloring. The 4-coloring scheme colors adjacent
patches with distinct colors, requiring 2 bits per triangle. It is applicable to patches
of any degree. The 2-coloring scheme further saves topology bits by coloring the
whole mesh with only two colors. It enforces the re-triangulation of each patch in
a zigzag manner and encodes the two outer triangles with the bit “1”, and the other
triangles with the bit “0”. Therefore, it requires only 1 bit per triangle but applies
only to the patches with a degree greater than 4. During the encoding process, at
each level of detail, either the 2-coloring or 4-coloring scheme is selected based on
the distribution of patch degrees. Then, the coloring bitstream is encoded with the
famous Ziv-Lempel coder. For geometry coding, the position of a new vertex is
simply predicted by averaging over its direct neighboring vertices. Experimentally,
this approach requires about 6 bpv for connectivity data and about 1622 bpv for
geometry data at the 12-bit quantization resolution.
Alliez and Desbrun [49] proposed a progressive mesh coder for manifold 3D
meshes. Observing the fact that the entropy of mesh connectivity is dependent on
the distribution of vertex valences, they iteratively applied the valence-driven
decimating conquest and the cleaning conquest in pair to get multiresolution
meshes. The vertex valences are output and entropy encoded during this process.
The decimating conquest is a mesh simplification process based on vertex
decimation. It only decimates vertices with valences not larger than 6 to maintain
a statistical concentration of valences around 6. In the decimating conquest, a 3D
mesh is traversed from patch to patch. A degree-n patch is a set of triangles
incident to a common vertex of valence n, and a gate is an oriented boundary edge
of a patch, storing the reference to its front vertex. The encoder enters a patch
through one of its boundary edges, called the input gate. If the front vertex of the
input gate has a valence not larger than 6, the encoder decimates the front vertex,
re-triangulates the remaining polygon, and outputs the front vertex valence. Then,
it pushes the other boundary edges, called output gates, into a FIFO list, and
replaces the current input gate with the nextt available gate in the FIFO list. This
2.3 Progressive Connectivity Compression 123
Fig. 2.19. An example to explain valence-driven conquests. (a) The decimating conquest; (b)
The cleaning conquest; (c) The resulting mesh after the decimating conquest and the cleaning
conquest. The shaded areas represent the conquered patches and the thick lines represent the
gates. The gates to be processed are depicted in black, while the gates already processed are in
normal color. Each arrow represents the direction of entrance into a patch
124 2 3D Mesh Compression
see that the resulting mesh is also a 6-regular mesh as the original mesh in Fig.
2.19(a). If an input mesh is irregular, it may not be completely covered by patches
in the decimating conquest. In such a case, null patches are generated. For
geometry coding, Alliez and Desbrun [49] adopted the barycentric prediction and
the approximate Frenet coordinate frame. The normal and the barycenter of a
patch approximate the tangent plane of the surface. Then, the position of the
inserted vertex is encoded as an offset from the tangent plane.
Experimentally, for connectivity coding, this scheme requires about 25 bpv,
on average 3.7 bpv, which is about 40% lower than the results reported in [41, 48].
For geometry coding, the performance typically ranges from 10 to 16 bpv with
quantization resolutions between 10 and 12 bits. In particular, the geometry coding
rate is much less than 10 bpv for meshes with high-connectivity regularity and
geometry uniformity. Furthermore, this scheme has a comparable performance
with that of the state-of-the-art single-rate coder. This scheme yields a compressed
file size only about 1.1 times larger than Touma and Gotsman’s algorithm [19],
even though it supports full progressiveness.
Fig. 2.20. The multiplexing of topology and geometry data, where the zigzag lines illustrate the
bit order
In [51], Bajaj et al. generalized their single-rate mesh coder [17] based on layered
decomposition to a progressive mesh coder that is applicable to arbitrary meshes.
An input mesh is decomposed into layers of vertices and triangles. Then the mesh
is simplified through three stages: intra-layer simplification, inter-layer
simplification and generalized triangle contraction. The former two are topology-
preserving, whereas the last one may change the mesh topology.
The intra-layer simplification operation selects vertices to be removed from
each contour. After those vertices are removed, re-triangulation is performed in the
region between the simplified contour and its adjacent contours. A bit string is
encoded to indicate which vertices are removed, and extra bits are encoded to
reconstruct the original connectivity between the decimated vertex and its
neighbors in the refinement process.
In the inter-layer simplification stage, a contour can be totally removed. Then,
the two triangle strips sharing the removed contour are replaced by a single coarse
strip [52]. Fig. 2.21 illustrates the process of contour removal and re-triangulation.
A dashed line in Fig. 2.21(b), called a constraining chord, is associated with each
edge in the contour to be removed, which is illustrated with a thick line. The
simplification process is encoded as (0, 6, 2, 3, 1, 3), where the first bit indicates
whether the contour is open or closed, the second value denotes the number of
vertices in the removed contour, and the remaining values indicate the number of
triangles between every two consecutive constraining chords in the coarse strip.
126 2 3D Mesh Compression
Fig. 2.21. Illustration of the inter-layer simplification process. (a) The fine level; (b)
Constraining chords; (c) The coarse strip. Dashed lines depict constraining chords and thick lines
depict the contour to be removed
2.3.6 Summary
Table 2.3 Comparisons of bitrates for typical progressive connectivity coding algorithms
Category Algorithm Bitrate C:G (Q) Comment
Progressive meshes Hoppe [35] O(N
(Nv log2 Nv):N/A
Popovic and Hoppe [38] O(N
(Nv log2 Nv):N/A
Taubin et al. [39] (710):(2040) (6)
Pajarola and Rossignac 7(1215) (8, 10, 12)
[41]
Patch coloring Cohen-Or et al. [48] 6(1622) (12)
Valence-driven conquest Alliez and Desbrun [49] 3.7(1016) (10, 12)
Embedded coding Li and Kuo [50] O(N
(Nv log2 Nv):N/A Embedded
multiplexing
Layered decomposition Bajaj et al. [51] (1017):30 (10, 12)
2.4.2 Prediction
After the quantization of vertex coordinates, the resulting values are then typically
compressed by entropy coding after prediction relying on some data smoothness
assumptions. A prediction is a mathematical operation where future values of a
discrete-time signal are estimated as a certain function of previous samples. In 3D
mesh compression, the prediction step makes full use of the correlation between
adjacent vertex coordinates and it is most crucial in reducing the amount of
geometry data. A good prediction scheme produces prediction errors with a highly
skewed distribution, which are then encoded with entropy coders, such as the
Huffman coder or the arithmetic coder.
Different types of prediction schemes for 3D mesh geometry coding have been
proposed in the literature, such as delta prediction [12, 13], linear prediction [5],
parallelogram prediction [19] and second-order prediction [17]. All these
prediction methods can be treated as a special case of the linear prediction scheme
with carefully selected coefficients.
The early work employed simple delta coding or linear prediction along a vertex
ordering guided by connectivity coding. Delta coding or delta prediction is based
on the fact that adjacent vertices tend to have slightly different coordinates, and
the differences (or deltas) between them are usually very small. Deering’s work
[12] and Chow’s work [13] encode the deltas of coordinates instead of the original
coordinates with variable length codes according to the distribution of deltas.
Deering’s scheme adopts the quantization resolutions between 10 and 16 bits per
coordinate component and its coding cost is roughly between 36 and 17 bpv. In
Chow’s geometry coder, bitrates of 1318 bpv can be achieved at quantization
resolutions of 912 bits per coordinate component.
K
vn ¦O
i 1
i n i H( n) , (2.8)
where O1, O2, …, OK are carefully selected to minimize the mean square error
° °½
2
^ `
K
E E® n ¦ ¾ (2.9)
°¯ i 1 ¿°
and transmitted to the decoder as the side information. The bitrate of this method
is not directly reported in [5]. However, as estimated by Touma and Gotsman [19],
it costs about 13 bpv at the 8-bit quantization resolution. Note that the delta
prediction is a special case of linear prediction with K = 1 and O1=1.
The approach proposed by Lee et al. [55] consists of quantizing in the angle
space after prediction. By applying different levels of precision while quantizing
the dihedral or the internal angles between or inside each facet, this method
achieves better visual appearance by allocating more precision to the dihedral
angles, since they are more related to the geometry and normals.
Touma and Gotsman [19] used a more sophisticated prediction scheme. To encode
a new vertex vn, it considers a triangle with two vertices vˆn1 and vˆn 2 on the
active list, where triangle ( vˆn 1 vˆn 2 vˆn 3 ) is already encoded as shown in Fig. 2.22.
The parallelogram prediction assumes that the four vertices vˆn 1 vˆn 2 vˆn 3 and vn
form a parallelogram. Therefore, the new vertex position can be predicted as
This method performs well only if the four vertices are exactly or nearly co-planar.
To further improve the prediction accuracy, the crease angle between the two
triangles ( vˆn 1 vˆn 2 vˆn 3 ) and ( vˆn 1 vˆn 2 , vˆn ) can also be estimated using the crease
angle T between the two triangles ( vˆn 2 vˆn 3 vˆn 4 ) and ( vˆn 2 vˆn 4 vˆn 5 ). In Fig. 2.22,
vnc is the predicted position of vn using the crease angle estimation. This work
achieves an average bitrate of 9 bpv at 8-bit quantization resolution. The
parallelogram prediction is also a linear prediction in essence, since the predicted
vertex position is a linear combination of the three previously visited vertex positions.
Inspired by the above TG parallelogram prediction scheme, Isenburg and
Alliez [56] generalized it to polygon mesh geometry compression. They let the
polygon information dictate where to apply the parallelogram rule that they use to
predict vertex positions. Since polygons tend to be fairly planar and fairly convex,
it is beneficial to make predictions within a polygon rather than across polygons.
2.4 Spatial-Domain Geometry Compression 131
v̂n 3
v̂n
T
T v̂n2
v̂n1 v̂n 5
vn
vn vnc
This, for example, avoids poor predictions due to a crease angle between polygons.
Up to 90% of the vertices can be predicted in this way. Their strategy improves
geometry compression performance by 10%40%, depending on how polygonal
the mesh is and the quality (planarity/convexity) of the polygons.
the prediction from one polygon to the next is performed along this order, it
cannot be expected to do the best job.
The first approach to improve the prediction is called prediction trees [57],
where the geometry drives the traversal instead of the connectivity as before. This
is based on the solution of an optimization problem. In some cases, it results in a
reduction of up to 50% in the geometry code entropy, particularly in meshes with
significant creases and corners, e.g. CAD models. The main drawback of this
method is the complexity of the encoder. Due to the need to run an optimization
procedure at the encoder, it is up to one order of magnitude slower than, for
example, the TG encoder. The decoder, however, is very fast, so for many
applications where the encoding is done offline, the encoder speed is not an
impediment. Cohen-Or et al. [58] suggested a multi-way prediction technique,
where each vertex position was predicted from all its neighboring vertices, as
opposed to the one-way parallelogram prediction. In addition, an extreme
approach to prediction is the feature discovery approach by Shikhare et al. [59],
which removes the redundancy by detecting similar geometric patterns. However,
this technique works well only for a certain class of models and involves
expensive matching computations.
Now we turn to introduce progressive geometry coding schemes in this and the
next subsections. In most mesh compression techniques, geometry coding is
guided by the underlying connectivity coding. Gandoin and Devillers [60]
proposed a fundamentally different strategy, where connectivity coding is guided
by geometry coding. Their algorithm works in two passes: the first pass encodes
geometry data progressively without considering connectivity data. The second
pass encodes connectivity changes between two successive LODs. Their algorithm
can encode arbitrary simplicial complexes without any topological constraint.
For geometry coding, their algorithm employs a k- k d tree decomposition based
on cell subdivisions [61]. At each iteration, it subdivides a cell into two child cells,
and then it encodes the number of vertices in one of the two child cells. If the
parent cell contains Nvp vertices, the number of vertices in one of the child cells
can be encoded using log2(N (Nvp+1) bits with the arithmetic coder [62]. This
subdivision is recursively applied, until each nonempty cell is small enough to
contain only one vertex and enables a sufficiently precise reconstruction of the
vertex position. Fig. 2.23 illustrates the geometry coding process based on a 2D
example. First, the total number of vertices, 7, is encoded using a fixed number of
bits (32 in this example). Then, the entire cell is divided vertically into two cells,
and the number of vertices in the left cell, 4, is encoded using log2(7+1) bits. Note
that the number of vertices in the right cell is not encoded, since it is deducible
from the number of vertices in the entire cell and the number of vertices in the left
cell. The left and right cells are then horizontally divided, respectively, and the
2.4 Spatial-Domain Geometry Compression 133
numbers of vertices in the upper cells are encoded, and so on. To improve the
coding gain, the number of vertices inn a cell can be predicted from the point
distribution in its neighborhood.
For connectivity coding, their algorithm encodes the topology change after
each cell subdivision using one of two operations: vertex split [35] or generalized
vertex split [38]. Specifically, after each cell subdivision, the connectivity coder
records a symbol, indicating which operation is used, and parameters specific to
that operation. Compared to [35, 38], their algorithm has the advantage that split
vertices are implicitly determined by the subdivision order given in geometry
coding, resulting in a reduction in the topology coding cost. Moreover, to improve
the coding gain further, they proposed several rules, which predict the parameters
for vertex split operations efficiently using already encoded geometry data.
On average, this scheme requires 3.5 bpv for connectivity coding and 15.7 bpv
for geometry coding at the 10-bit or 12-bit quantization resolution, which
outperforms progressive mesh coders presented in [44, 49]. This scheme is even
comparable to the single-rate mesh coder given in [19], achieving a full
progressiveness at a cost of only 5% overhead bitrate. It is also worthwhile to
point out that this scheme is especially useful for terrain models and densely
sampled objects, where topology data can be losslessly reconstructed from
geometry data. Besides its good coding gain, it can be easily extended to compress
tetrahedral meshes.
Peng and Kuo [63] proposed a progressive lossless mesh coder based on the octree
decomposition, which can encode triangle meshes with arbitrary topology. Given a
3D mesh, an octree structure is first constructed through recursive partitioning of
the bounding box. The mesh coder traverses the octree in a top-down fashion and
encodes the local changes of geometry and connectivity associated with each
octree cell subdivision.
In [63], the geometry coder does not encode the vertex number in each cell,
but encodes the information whether each cell is empty or not, which is usually
134 2 3D Mesh Compression
more concise in the top levels of the octree. For connectivity coding, a uniform
approach is adopted, which is efficient and easily extendable to arbitrary
polygonal meshes.
For each octree cell subdivision, the geometry coder encodes the number, T
T 8), of non-empty-child cells and the configuration of non-empty-child cells
(1T
among KT C8T possible combinations. Whenn the data are encoded
straightforwardly, T takes 3 bits and the non-empty-child-cell configuration takes
log2KT bits. To further improve the coding efficiency, T is arithmetic coded using
the context of the parent cell’s octree level and valence, resulting in a 30%50%
bitrate reduction. Furthermore, all KT possible configurations are sorted according
to their estimated probability values, and the index of the configuration in the
sorted array is arithmetic coded. The probability estimation is based on the
observation that non-empty-child cells tend to gather around the centroid of the
parent-cell’s neighbors. This technique leads to a more than 20% improvement.
For the connectivity coding, each octree cell subdivision is simulated by a
sequence of k-k d tree cell subdivisions. Each vertex split corresponds to a k-
k d tree
cell subdivision, which generates two non-empty-child cells. Let the vertex to be
split be denoted by v, the neighboring vertices before the vertex split by P =
{ 1, p2, …, pK} and the two new vertices from the vertex split by v1 and v2. Then,
{p
the following information will be encoded: (1) Vertices among P that are
connected to both v1 and v2 (called the pivot vertices); (2) Whether each non-pivot
vertex in P is connected to v1 or v2; and (3) Whether v1 and v2 are connected in the
refined mesh. During the coding process, a triangle regularity metric is used to
predict each neighboring vertex’s probability y of being a pivot vertex, and a spatial
distance metric is used to predict the connectivity of non-pivot neighbor vertices
to the new vertices. At the decoder side, the facets are constructed from the
edge-based connectivity without an extra coding cost. To further improve the R-D
performance, the prioritized cell subdivision is applied. Higher priorities are given
to cells of a bigger size, a bigger valence and a larger distance from neighbors.
The octree-based mesh coder outperforms the k-d d tree algorithm [60] in both
geometry and connectivity coding efficiency. For geometry coding, it provides
about a 10%20% improvement for typical meshes, but up to 50%60%
improvement for meshes with highly regular geometry data and/or tightly
clustered vertices. With respect to connectivity coding, the improvement ranges
from 10% to 60%.
Transform coding is a type of data compression for “natural” data like audio
signals or photographic images [64]. The transformation is typically lossy,
resulting in a lower quality copy of the original input. In transform coding,
knowledge of the application is used to choose information to discard, thereby
lowering its bandwidth. The remaining information can then be compressed using
2.5 Transform Based Geometric Compression 135
a variety of methods. When the output is decoded, the result may not be identical
to the original input, but is expected to be close enough for the purpose of
applications. The discrete cosine transform (DCT) or the discrete Fourier transform
(DFT) is often used to represent a sequence of source samples to another sequence
of transform coefficients, whose energy is concentrated in relatively few
low-frequency coefficients. Thus, great degradation can be obtained if we encode
low-frequency coefficients while discarding higher frequency ones. The common
JPEG image format is an example of transform coding, one that examines small
blocks of the image and “averages out” the color using a discrete cosine transform
to form an image with far fewer colors in total. MPEG modifies this across frames
in a motion image, further reducing the size compared to a series of JPEGs. MPEG
audio compression analyzes the transformed data according to a psychoacoustic
model that describes the human ear’s sensitivity to parts of the signal, similar to
the TV model. In this section, we briefly introduce several typical 3D mesh
geometry compression methods based on DFT and wavelet transforms. Some are
single-rate compression techniques, and others are progressive schemes.
Karni and Gotsman [65] used the spectral theory on meshes [40] to compress
geometry data. It is a single-rate geometry compression scheme. Suppose that a
mesh consists of Nv vertices. Then the mesh Laplacian matrix L of size Nv u Nv is
derived from the mesh connectivity as follows:
1, i j ;
°
Lij ® 1 / di , i and j are adjacent; (2.11)
° 0, otherwise,
¯
several segments and each segment can be independently encoded. However, the
eigenvectors should be computed in the decoder as well. Thus, even though the
partitioning is incorporated, the decoding complexity is too high for real-time
applications. To alleviate this problem, Karni and Gotsman [66] proposed to use
fixed basis functions, which are computed from a 6-regular connectivity. Those
basis functions are actually the Fourier basis functions. Therefore, the encoding
and decoding processes can be performed with the fast Fourier transform (FFT)
efficiently. Before encoding, the connectivity of an input mesh is mapped into a
6-regular connectivity. No geometry information is used during the mapping. Thus,
the decoder can perform the same mapping with separately received connectivity
data and determine the correct ordering of vertices. The exploitation of fixed basis
functions is obviously not optimal, but provides an acceptable performance at
much lower complexity.
In addition, Sorkine et al. [67] addressed the issue of reducing the visual
effect of quantization errors. Considering the fact that the human visual system
is more sensitive to normal distortion than to geometric distortion, they propose
to apply quantization not in the coordinate space as usual, but rather in a
transformed coordinate space obtained by applying a so-called “k-anchor
invertible Laplacian transformation” over the original vertex coordinates. This
concentrates the quantization error at the low-frequency end of the spectrum,
thus preserving the normal variations over the surface, even after aggressive
quantization. To avoid significant low-frequency errors, a set of anchor vertex
positions are also selected to “nail down” the geometry at a selected number of
vertex locations.
It is well known from image coding that wavelet representations are very effective
in decorrelating the original data, greatly facilitating subsequent entropy coding.
In essence, coarser level data provides excellent predictors for finer level data,
leaving only generally small prediction residuals for the coding step. For tensor
product surfaces, many of these ideas can be applied in a straightforward fashion.
However, the arbitrary topology surface case is much more challenging. To begin
with, wavelet decompositions of general surfaces were not known until the
pioneering work by Lounsbery [68]. These constructions were subsequently
applied to progressive approximation of surfaces as well as data on surfaces.
Khodakovsky et al. [69] proposed a progressive geometry compression (PGC)
algorithm based on the wavelet transform. It first remeshes an arbitrary manifold
mesh M into a semi-regular mesh, where most vertices are of degree 6, using the
MAPS algorithm [70]. MAPS generates a semi-regular approximation of M by
finding a coarse base mesh and successively subdividing each triangle into four
triangles. Fig. 2.24 shows a remeshing example. In this figure, vertices within the
region bounded by white curves in Fig. 2.24(a) are projected onto a base triangle.
2.5 Transform Based Geometric Compression 137
These projected vertices are depicted by black dots in Fig. 2.24(b). Each vertex
projected onto the base triangle contains the information of the original vertex
position. By interpolating these original vertex positions, each subdivision point
can be mapped approximately to a point (nott necessarily a vertex) in the original
mesh. Note that the connectivity information of the semi-regular mesh can be
efficiently encoded, since it can be reconstructed using only the connectivity of the
base mesh and the number of subdivisions. However, this algorithm attempts to
preserve only the geometry information. Thus, the original connectivity of M
cannot be reconstructed at the decoder.
Fig. 2.24. A remeshing example [2]. (a) An irregular mesh; (b) The corresponding base mesh;
(c) The corresponding semi-regular mesh. Triangles are illustrated with a normal flipping pattern
to clarify the semi-regular connectivity (With permission of Elsevier)
Based on the Loop algorithm [71], this algorithm then represents the
semi-regular mesh geometry with the base mesh geometry and a sequence of
wavelet coefficients. These coefficients represent the differences between
successive LODs with a concentrated distribution around zero, which is suitable
for entropy coding. The wavelet coefficients are encoded using a zerotree
approach, introducing progressiveness into the geometry data. More specifically,
they modified the SPIHT algorithm [72], which is one of the successful 2D image
coders, to compress the Loop wavelet coefficients.
f Their algorithm provides about
12 dB or four times better image quality than CPM [41], and even a better
performance than Touma and Gotsman’s single-rate coder [19]. This is mainly due
to the fact that they employed semi-regular meshes, enabling the wavelet coding
approach.
Khodakovsky and Guskov [73] later proposed another wavelet coder based on
the normal mesh representation [74]. In the subdivision, their algorithm restricts
the offset vector which should be in the normal direction of the surface. Therefore,
whereas 3D coefficients are used in [69], 1D coefficients are used in the normal
mesh algorithm. Furthermore, their algorithm employs the uplifted version of
butterfly wavelets [42, 43] as the transform. As a result, it achieves about 25 dB
quality improvement over that in [69].
In addition, Payan and Antonini [75] proposed an efficient low complexity
compression scheme for densely sampled irregular 3D meshes. This scheme is
based on 3D multiresolution analysis (3D discrete wavelet transform) and includes
138 2 3D Mesh Compression
one and eliminating redundant points. Their method for constructing the wavelet
transform requires three steps: vertex split, prediction and update. With respect to
zerotree coding, they adopted a new approach. In their approach, vertices do not
have a tree structure, but the edges and faces do. Each edge and each face is the
parent of four edges of the same orientation in the finer mesh. Hence, each edge
and face of the coarsest domain mesh forms the root of each zerotree, and it
groups all the wavelet coefficients of a fixed wavelet subband from its incident
based domain faces. No coefficient is accounted for multiple times or left out by
this grouping.
Surface geometry is often modeled with irregular triangle meshes. The process of
remeshing refers to approximating such geometry using a mesh with
(semi)-regular connectivity, which has advantages for many graphics applications.
However, current techniques for remeshing arbitrary surfaces f create only
semi-regular meshes. The original mesh is typically decomposed into a set of
disk-like charts, onto which the geometry is parameterized and sampled. Unlike
this approach, Gu et al. [79] proposed to remesh an arbitrary surface onto a
completely regular structure called a geometry image. It captures geometry as a
simple 2D array of quantized points. Surface signals like normals and colors are
stored in similar 2D arrays using the same implicit surface parameterization,
where texture coordinates are absent. Each pixel value in the geometry image
represents a 3D position vector ((x, y, z). Fig. 2.26 shows the geometry image of
the Stanford Bunny. Due to its regular structure, the geometry image
representation can facilitate the compression and rendering of 3D data.
Fig. 2.26. The geometry image of the Stanford Bunny. (a) The Stanford Bunny; (b) Its
geometry image
To generate the geometry image, an input manifold mesh is cut and opened to
be homeomorphic to a disk. The cut mesh is then parameterized onto a 2D square,
which is in turn regularly sampled. In the cut process, an initial cut is first selected
140 2 3D Mesh Compression
and then iteratively refined. At each iteration, it selects a vertex of the triangle
with the biggest geometric stretch and inserts the path, connecting the selected
vertex to the previous cut, into the refined cut. After the final cut is determined,
the boundary of the square domain is parameterized with special constraints to
prevent cracks along the cut, and the interior is parameterized using
geometry-stretch parameterization in [80], which attempts to distribute vertex
samples evenly over the 3D surface.
Geometry images can be compressed using standard 2D image compression
techniques, such as wavelet-based coders. To seamlessly zip the cut in the
reconstructed 3D surface, especially when the geometry image is compressed in a
lossy manner, it encodes the sideband signal, which records the topological
structure of the cut boundary and its alignment with the boundary of the square
domain.
The geometry image compression provides about 3 dB worse R-D
performance than the wavelet mesh coder [69]. Also, since it maps complex 3D
shapes onto a simple square, it may yield large distortions for high-genus meshes
and unwanted smoothing of 3D features. References [81] and [82] proposed an
approach to parameterize a manifold 3D mesh with genus 0 onto a spherical
domain. Compared with the square domain approach [79], this approach leads to a
simple cut topology and an easy-to-extend image boundary. It was shown by
experiments that the spherical geometry image coder achieves better R-D
performance than the square domain approach [79] and the wavelet mesh coder
[69], but slightly worse performance than the normal mesh coder [73].
2.5.4 Summary
deal with manifold triangular meshes only. In the wavelet coding methods [69, 73]
and the geometry image coding methods [79, 81, 82], the original connectivity is
lost due to the remeshing procedure.
Recently, vector quantization (VQ) has been proposed for geometry compression,
which does not follow the conventional “quantization+prediction+entropy coding”
approach. The conventional approach pre-quantizes each vertex coordinate using a
scalar quantizer and then predictively encodes the quantized coordinates. In
contrast, typical VQ approaches first predict vertex positions and then jointly
compress the three components of each prediction residual. Thus, it can utilize the
correlation between different coordinate components of the residual. Compared
with scalar quantization, the main advantages of VQ include a superior
rate-distortion performance, more freedom in choosing shapes of quantization
cells, and better exploitation of redundancy between vector components. In this
section, we first introduce some basic concepts of VQ and then introduce several
typical VQ-based geometry compression methods.
142 2 3D Mesh Compression
d ( x, c i ) min d ( x, c j ) . (2.12)
0 d jd N-1
k
d ( x, c j ) ¦(x
l 1
l c jl ) 2 . (2.13)
And then the index i of the best matching codeword assigned to the input vector x
is transmitted over the channel to the decoder. The decoder has the same codebook
as the encoder. In the decoding phase, for each index i, the decoder merely
performs a simple table look-up operation to obtain ci and then uses ci to
reconstruct the input vector x. Compression is achieved by transmitting or storing
the index of a codeword rather than the codeword itself. The compression ratio is
determined by the codebook size and the dimension of the input vectors, and the
overall distortion is dependent on the codebook size and the selection of
codewords.
In Lee and Ko’s work [84], the Cartesian coordinates of a vertex were transformed
into a model space vector using the three previous vertex positions. In fact, the
model space transformation is a kind of prediction and the model space vector can
be regarded as a prediction residual. Then the model space vector was quantized
2.6 Geometry Compression Based on Vector Quantization 143
using the generalized Lloyd algorithm [83]. Since they used the original positions
of previous vertices in the model space transform, the quantization errors will be
accumulated in the decoder. To overcome this encoder-decoder mismatch problem,
they periodically inserted correction vectors into the bitstream. Experimentally,
this scheme requires about 6.7 bpv on average to achieve the same visual quality
as conventional methods at 8-bit quantization resolution. Note that Touma and
Gotsman’s work requires about 9 bpv at 8-bit resolution [19]. This method is
especially efficient for 3D meshes with high-geometry regularity.
transformed space.
Our aim is to find an appropriate set of orthonormal base vectors V = {v1,
v2, …, vk} so that the coefficient along each base vector is a criterion for rejecting
impossible codevectors. Since the possible nearest codevectors for an input vector
locate in the hypersphere with centre at x and radius dmin that is the distortion
between x and the current best matched codevector, and the hypersphere can be
confined by k pairs of parallelogram hyperplanes that are tangential to the
hypersphere in the Euclidean space Rk, we can use these parallelogram
hyperplanes to form a hypercube which encloses the hypersphere, thus reducing
the search space to a great extent. It follows that if we select the k different unit
normal vectors of these hyperplanes as V V, we can reject impossible codevectors
according to each component of X. X
In Li and Lu’s work [89], 3D meshes are vector quantized based on the
parallelogram prediction, so each input vector is a 3D residual vector. They set V
to be the unit normal vectors of 3 pairs of parallelogram hyperplanes enclosing the
sphere on which all the possible nearest codevectors lie, i.e., v1 ^1 3, 1 3, 1 3` ,
v2 ^1 6, 1 6, 2 6` and v3 ^1 `
22, 1 2, 0 . So the kick-out conditions for
judging possible nearest codevectors are:
2.6 Geometry Compression Based on Vector Quantization 145
X i ,min
min Y ji X i ,max
max , 1 3, (2.14)
where Yj =(Yj1, Yj2, Yj3) is the coefficient vector of yj in the transformed space and
X ii,min
,min
min X d mmin , (2.15)
X ii,max
,max
max X d mmin . (2.16)
2.6.4.1 Preprocessing
The first step is to transform each codevector of the codebook into the space with
base vectors V = {v1, v2, v3} in order that each input vector can be quantized in the
transformed space with the transformed codebook. This process involves 3N
multiplications and 6N N additions.
Then, the transformed codevectors are sorted in the ascending order of their
first elements, i.e., the coefficients along the base vector v1.
Step 1: To carry out the codevector search in the transformed space, we first
perform the transformation on the input vector x to obtain X X. This process
involves 3 multiplications and 6 additions.
Step 2: A probable nearby codevector Yj is guessed, based on the minimum
first element difference criterion. This is easy to implement with the bisection
technique. dmin, Xi,min and Xi,max are calculated.
Step 3: For each codevector Yj, we check if Eq.(2.14) is satisfied. If not, then
Yj is rejected, thus discarding those codevectors which are far away from X, X
resulting in a reduced cube search space containing the sphere centered at X with
radius dmin; else we proceed to the next step.
Step 4: If Yj is not rejected in the third step, then d(
d(X,Yj) is calculated. If d(
d(X,Yj) <
dmin, then the current closest codevector to X is taken as Yj with dmin set to be
d(X,Yj), and Xi,min and Xi,max are updated accordingly. The procedure is repeated
d(
until we arrive at the best matched codevector Yp for X. X
Step 5: Inversely transform Yp to yp in the original space. This process needs 3
multiplications and 6 additions.
In the codevector search process, we expect the “so far” dmin to be as small as
possible in order to reject x with lighter computation. The projection of x on v1 is
proportional to the mean of x, so it has a clear physical meaning and is regarded as
the best value to represent x. In this sense, the initial dmin in Step 2 is minimized,
and further rejection of x based on Eq.(2.14) is more likely to occur.
It is obvious that this fast method can be extended to VQ in a Euclidean space
of any dimension by finding an orthonormal transform of the original space. The
146 2 3D Mesh Compression
number of the kick-out conditions for nearest codevectors can either be equal or
be less than the dimension of the space.
The computational efficiency of the proposed algorithm in compressing 3D
mesh geometry data, in comparison to PDS [90], ENNS [91] and EENNS [92]
algorithms, was evaluated in [89]. In the fast VQ scheme [89], 20 meshes were
randomly selected from the famous Princeton 3D mesh library and 42,507 3D
residual vectors were generated from these meshes based on the parallelogram
prediction. The residual vectors are then used to generate the codebook, and the
sizes of the codebooks are 256, 1,024 and 8,192. Table 2.5 shows the time needed
for quantizing the geometry of two 3D mesh models, Stanford Dragon (100,250
vertices and 202,520 triangles) and Stanford Bunny (35,947 vertices and 69,451
triangles). The time is the average of three experiments. The encoding qualities for
different codebooks are also shown. The coding quality remains the same for all
the algorithms since they are full-search equivalent. No extra memory is
demanded for Full Search (FS), PDS and Li and Lu’s approach while ENNS and
EENNS need N and 2N pre-stored float data respectively, where N is the size of
the codebook. The platform is Visual C++ 6.0 and PC 2.0 GHz.
The search efficiency in the form of a ratio is evaluated by how many times
the Euclidean distance computation is averagely performed compared to the size
of codebook, as shown in Table 2.6. The ratio is a relative baseline rather than
encoding time to exclude the effect of programming skills, but it ignores the
online computation complexity for non-winner rejection. A smaller ratio is better.
Table 2.5 Performance comparison among the algorithms on the time usedd to quantize the
Dragon and Bunny meshes
Time (s)
Codebook PSNR
Mesh Li and Lu’s
size (dB) FS PDS ENNS EENNS
approach
Dragon 256 41.00 1.45 0.86 0.25 0.28 0.15
1,024 48.25 5.34 2.89 0.44 0.41 0.20
8,192 56.40 43.12 26.13 1.58 0.95 0.55
Bunny 256 41.72 0.49 0.30 0.08 0.09 0.04
1,024 49.96 1.94 1.02 0.16 0.14 0.07
8,192 58.47 15.41 10.70 0.50 0.27 0.17
Table 2.6 Ratio of the reduced search space after each check step compared to FS (100%) for
Dragon and Bunny meshes
Ratio compared to FS
Mesh Codebook size Li and Lu’s
PDS ENNS EENNS
approach
Dragon 256 11.90 7.60 3.00 1.52
1,024 3.67 3.65 1.00 0.43
8,192 5.43 1.83 0.26 0.08
Bunny 256 11.26 7.20 2.79 1.50
1,024 3.59 3.19 0.84 0.40
8,192 5.31 1.47 0.19 0.07
2.6 Geometry Compression Based on Vector Quantization 147
Evident in Table 2.5 and Table 2.6, Li and Lu’s approach [89] is a computation
efficient algorithm in terms of both encoding time and the effect of search space
reduction, compared to state-of-art fast search algorithms that can be extended to
mesh VQ.
In DRCVQ, a parameter is used to control the encoding quality to get the desired
compression rate in a range with only one codebook, instead of using different
levels of codebooks to get a different compression rate. During the encoding
process, the indexes of the preceding encoded residual vectors which have high
correlation with the current input vector are pre-storedd in an FIFO so both the
codevector searching range and bit rate are averagely reduced. The proposed
scheme also incorporates a very effective Laplacian smooth operator. A unique
feature of this scheme is that there is an adjustable parameter in the proposed
scheme, so the user can get a desired rate-distortion performance conveniently,
without encoding the vertex data with a codebook of another quality level. In
addition, it permits compatibility with most of the existing algorithms for
geometry data compression. Combined with other schemes, the rate-distortion
performance may be further improved.
The DRCVQ approach uses a fixed-length first-in-first-out (FIFO) buffer to
store the previously encoded codevector indexes. The sequence of vertices
encountered during a mesh traversal defines which vector is to be coded and the
correlation between codevectors of the processed input vectors is also employed.
When the encoding procedure begins, the approach sets FIFO to be null, and then
appends the index of the current encoded vertex to the buffer if it is not found in
the buffer.
148 2 3D Mesh Compression
Using a fixed-length FIFO, the codevector search range of an input vector can
be reduced so the bit rate is reduced, as illustrated as follows. First we define the
stationary codebook C0 which has N0 codevectors and its restricted part C1. The
restricted codebook C1 contains the N1 most likely codevector indexes when the
stationary codebook C0 is applied to the source. Here, the restricted codebook C1
is dynamic for each encoded vertex and is regenerated by buffering a series of
codevector indices since the statistics off the ongoing sequence of vectors may
undergo a sudden and substantial change. As each of the input vectors is encoded
using codebook C0, there are in total N0 possible codevector indexes for each input
vector. If the input vectors are highly correlated, then we are lucky to specify an
input vector by one of the codevector index in C1, and log2N1 bits are sufficient to
represent the input vector instead of log2N0 bits. Since N1 is normally much
smaller than N0, bpv can be greatly reduced.
The first issue in designing a VQ scheme for compressing any kind of source is
how to map the source data into a vector sequence as the input of the vector
quantizer. For 2D signals such as images, the vector sequence is commonly
formed from blocks of neighboring pixels. The blocks can be directly used as the
input vector for the quantizer. In the case of triangle meshes, neighboring vertices
are also likely to be correlated. However, blocking multiple vertices is not as
straightforward as the case for images. The coordinate vector of a vertex cannot be
directly regarded as an input vector to the quantizer because if multiple vertices
are mapped into the same vertex, the distortion of the mesh will be unacceptable
and the connectivity of the mesh will also disappear.
Since the principle of the vector quantizer design method remains the same in
both ordinary VQ and DRCVQ, we only discuss ordinary VQ here. In order to
exploit the correlation between vertices, it is necessary to use a vector quantizer with
memory. Thus, Lu and Li [93] employed predictive vector quantization. The index
identifying this residual vector in PVQ was then stored or transmitted to the decoder.
There are two components in a PVQ system: prediction and residual vector
quantization. We first discuss the design of the predictor. The goal of the predictor
is to minimize the variance of the residuals, as well as maintaining low
computation complexity, allowing them to be coded more efficiently by the vector
quantizer.
Lu and Li [93] used the principle of the “parallelogram” prediction illustrated
in Fig. 2.22. The three vertices of the initial triangle in the traversal order are
uniformly scalar quantized at 10 bits per coordinate and then Huffman encoded.
Any other vertex can be predicted by its neighboring triangles, enabling
exploitation of the tendency for neighboring triangles to be roughly coplanar and
similar in size. This is particularly true for high-resolution, scanned models, which
have little variation in the triangle size. As shown in Fig. 2.22 and Eq.(2.10), the
prediction error between vn and vn may be accumulated to the subsequent
2.6 Geometry Compression Based on Vector Quantization 149
In order to achieve the desired compression ratio, Lu and Li assumed that some
applications can tolerate a little degradation of PSNR to reduce the bpv. They set a
threshold T as the parameter to control the PSNR degradation. Note that T is the
parameter for additional distortion control because the compression is always
lossy due to the restriction to N0 codevectors in the global codebook. When the
Euclidean distance of the inputt vector and its closest codevector specified by the
index stored in C1 is not more than the desired T T, we assign the index in C1 to the
input vector as its encoded index and its corresponding codevector is easily found.
This method has the advantage of adjusting T by the user to get a satisfactory R-D
performance, rather than changing the codebook to another size as in conventional
VQ compression methods. In Lu and Li’s scheme, 1 bit side information is needed
for identifying whether a codevector index is for C0 or C1.
The correlation of consecutive subsets of residual vectors in the connectivity
traversal order that the algorithm is taking advantage of is shown in a graphical
way in Fig. 2.27. Stars represent an example m of typical 16 consecutive residual
vectors generated from the Caltech Feline mesh model compression, and their
bounding sphere radius is 0.02, while the dots indicate part of the codevectors of
the universal codebook consisting of 8,192 codevectors whose bounding sphere
radius is 2.00. It is evident that consecutive residual vectors concentrate in a small
region relative to the whole codevectors. Thus it may happen that multiple
residual vectors of the 16 consecutive vectors are mapped to the same codevector
and, if we increase T for further distortion tolerance, any residual vectors in the
sphere with radius T and centered at that codevector will be mapped to it, resulting
in more likelihood of the local search in the FIFO and thus bit rate reduction.
150 2 3D Mesh Compression
Fig. 2.27. Zoom-in of an example of consecutive residual vectors (in stars) and codevectors (in
dots)
The most computationally intensive part of the DRCVQ algorithm is the distortion
calculation between an input vector and a each codevector in the stationary
codebook C0 for finding the closest codevector for the input vector. The distance
computation in R3 Euclidean space needs 3N N0 multiplications, 5N
N0 additions and
N0 comparisons to encode each input vectorr in the full search VQ. Lu and Li [93]
adopted the mean-distance-ordered partial codebook search (MPS) [94] as an
efficient fast codevector search algorithm which uses the mean of the input vector
to reduce the computational burden of the full search algorithm without sacrificing
performance. In [94], the codevectors are sorted according to their component
means, and the search for the codevector having the minimum Euclidean distance
to a given input vector starts with the one having the minimum mean distance to it.
The search is then made to terminate as soon as possible since the mean distance
out of a range should correspond to a larger Euclidean distance.
The mesh distortion metric is also an important issue. Let d( d(x, Y
Y) be the
Euclidean distance from a point x on X to its closest point on YY, then the distance
from X to Y is defined as follows:
d ( X ,Y ) 1 A( X ) ³ x X
d ( ,Y ) 2 d x , (2.17)
where A(X
(X) is the area of X
X. Since this distance is not symmetric, the distortion
between X and Y is given as:
2.6 Geometry Compression Based on Vector Quantization 151
d max ^d ( X , Y ), d (Y , X )` . (2.18)
vic ¦L
j
ij vj / 2 , (2.19)
where vic is the filtered version of vi. This filter can be operated iteratively. Based
on the assumption that similar mesh models should have similar surface area, the
criterion for terminating the Laplacian filter is set to be:
area(( ( )
) ( ) ( )G , (2.20)
Fig. 2.28. DRCVQ compared with conventional VQ. (a) Caltech Feline; (b) Stanford Bunny; (c)
Fandisk; (d) Stanford simplified Bunny
Fig. 2.29. Comparisons with Wavemesh. (a) Fandisk; (b) Venus head; (c) Venus body
154 2 3D Mesh Compression
Fig. 2.30 shows reconstructed meshes by using the proposed method with
entropy coding and Laplacian filtering. Lu and Li’s scheme has the advantage of
low computational complexity. Since they have incorporated MPS in DRCVQ, the
codevector search time is rather low. With T increasing from 0 to 1E3 relative to
the mesh bounding box diagonal, the geometry compression time ranges from
0.15 to 0.05 s for Bunny and 0.20 to 0.07 s for Feline. The platform is Visual C++
6.0 and PC 2.0 GHz.
Fig. 2.30. Reconstructed meshes of typical models using DRCVQ with entropy coding and
Laplacian smooth. (a) Original Fandisk; (b) 7.22 bpv, 59.24 dB; (c) 5.94 bpv, 53.79 dB; (d)
Original Venus head; (e) 11.00 bpv, 62.85 dB; (f) 6.76 bpv, 55.86 dB; (g) Original Venus body;
(h) 7.39 bpv, 63.43 dB; (i) 5.86 bpv, 56.54 dB
2.7 Summary 155
2.7 Summary
References
2003.
[2] J. L. Peng, C. S. Kim and C. C. Jay Kuo. Technologies for 3D mesh compression:
A survey. Journal of Visual Communication and Image Representation, 2005,
16(6):688-733.
[3] ISO/IEC 14772-1. The Virtual Reality Modeling Language VRML. 1997.
[4] G. Taubin, W. Horn, F. Lazaru, et al. Geometry coding and VRML. Proceedings
of the IEEE, 1998, 96(6):1228-1243.
[5] G. Taubin and J. Rossignac. Geometric compression through topological surgery.
ACM Trans. Graph., 1998, 17(2):84-115.
[6] ISO/IEC 14496-2. Coding of Audio-Visual Objects: Visual. 2001.
[7] O. Devillers and P. Gandoin. Geometric compression for interactive transmission.
In: Proceedings of the IEEE Conference on Visualization, 2000, pp. 319-326.
[8] G. Taubin. 3D geometry compression and progressive transmission.
EUROGRAPHICS—State of the Art Report, 1999.
[9] D. Shikhare. State of the art in geometry compression. Technical Report,
National Centre for Software Technology, India, 2000.
[10] C. Gotsman, S. Gumhold and L. Kobbelt. Simplification and compression of 3D
meshes. Tutorials on Multiresolution in Geometric Modelling, 2002.
[11] J. Gross and J. Yellen. Graph Theory and Its Applications. CRC Press, 1998.
[12] M. Deering. Geometry compression. ACM SIGGRAPH, 1995, pp. 13-20.
[13] M. Chow. Optimized geometry compression for real-time rendering. IEEE
Visualization, 1997, pp. 347-354.
[14] E. M. Arkin, M. Held, J. S. B. Mitchell, et al. Hamiltonian triangulations for fast
rendering. Visual Computation, 1996, 12(9):429-444.
[15] F. Evans, S. S. Skiena and A. Varshney. Optimizing triangle strips for fast
rendering. IEEE Visualization, 1996, pp. 319-326.
[16] G. Turan. On the succinct representations of graphs. Discr. Appl. Math, 1984,
8:289-294.
[17] C. L. Bajaj, V. Pascucci and G. Zhuang. Single resolution compression of
arbitrary triangular meshes with properties. Comput. Geom. Theor. Appl., 1999,
14:167-186.
[18] C. Bajaj, V. Pascucci and G. Zhuang. Compression and coding of large CAD
models. Technical Report, University of Texas, 1998.
[19] C. Touma and C. Gotsman. Triangle mesh compression. In: Proceedings of
Graphics Interface, 1998, pp. 26-34.
[20] P. Alliez and M. Desbrun. Valence-driven connectivity encoding for 3D meshes.
EUROGRAPHICS, 2001, pp. 480-489.
[21] M. Schindler. A fast renormalization for arithmetic coding. In: Proceedings of
IEEE Data Compression Conference, 1998, p. 572.
[22] W. Tutte. A census of planar triangulations. Can. J. Math., 1962, 14:21-38.
[23] C. Gotsman. On the optimality of valence-based connectivity coding. Computer
Graphics Forum, 2003, 22(1):99-102.
[24] S. Gumhold and W. Straßer. Real time compression of triangle mesh connectivity.
ACM SIGGRAPH, 1998, pp. 133-140.
[25] S. Gumhold. Improved cut-border machine for triangle mesh compression. Paper
presented at The Erlangen Workshop’99 on Vision, Modeling and Visualization,
1999.
[26] J. Rossignac. Edgebreaker: connectivity compression for triangle meshes. IEEE
References 157
33(10):1132-1133.
[91] L. Guan and M. Kamel. Equal-average hyperplane partitioning method for
vector quantization of image data. Pattern Recognition Letters, 1992,
13(10):693-699.
[92] H. Lee and L. H. Chen. Fast closest codevector search algorithms for vector
quantization. Signal Processing, 1995, 43:323-331.
[93] Z. M. Lu and Z. Li. Dynamically restricted codebook based vector quantization
scheme for mesh geometry compression. Signal Image and Video Processing,
2008, 2(3):251-260.
[94] S. W. Ra and J. K. Kim. Fast mean-distance-ordered partial codebook search
algorithm for image vector quantization. IEEE. Transactions on Circuits and
Systems-II, 1993, 40(9):576-579.
[95] Z. Li, Z. M. Lu and L. Sun. Dynamic extended codebook based vector
quantization scheme for mesh geometry compression. Paper presented at The
IEEE Third International Conference on Intelligent Information Hiding and
Multimedia Signal Processing (IIHMSP2007), 2007, Vol. 1, pp. 178-181.
[96] S. Valette and R. Prost. Wavelet-based progressive compression scheme for
triangle meshes: Wavemesh. IEEE Transactions on Visualizations and Computer
Graphics, 2004, 10(2):123-129.
3
Features are important parts of geometric models. They come in different varieties
[1]: sharp edges, smoothed edges, ridges or valleys, prongs, bridges and others, as
shown in Fig. 3.1. The crucial role of features for a correct appearance and an
accurate representation of a geometric model have led to increasing activity in
research on feature extraction. Feature extraction from 3D models is an essential
and beforehand task for subsequent analysis, retrieval, recognition, classification
and tracking processes. This chapter focuses on the techniques of feature
extraction from 3D models.
3.1 Introduction
3.1.1 Background
Fig. 3.1. Example of automatic feature classification: ridges (orange), valleys (blue), and
prongs (pink) [1] ([2007]IEEE)
only consists of geometry data, connectivity data and appearance data, and there
are few descriptions of high-level semantic features for automatic matching. How
to describe 3D models appropriately (i.e., feature extraction) is the issue to be
urgently solved, and it has been hard to obtain a satisfying solution up to now.
Building correct feature correspondence for 3D models is more difficult and
time-consuming [5]. 3D models possess more complex and excessive poses than
2D media, with different translations, rotations, scales and reflections. This gives
3D models many more arbitrary and unpredictable positions, orientations and
measurements and makes 3D models difficult f to parameterize and search. The
new adopted features in content-based 3D model retrieval include 2D shape
projections, 3D shapes, 3D appearances and even high-level semantics, which are
required not only to be extracted, represented and indexed easily and efficiently,
but also for effectively distinguishing similar models from dissimilar models,
invariant to typical affine
f transformations.
Scan registration [6] can be defined as finding the translation and rotation of a
projected scan contour that produces maximum overlap with a reference scan or a
previous model. Scan matching is a highly non-linear problem, with no analytical
solution, which requires an initial estimation to be solved iteratively. In addition,
some applications of registration with 3D laser range-finders, like mobile robotics,
impose time constraints on this problem, in spite of the large amount of raw data
to be processed.
Registration of 3D scenes from laser range data is more complex than
matching 2D views: (1) The amount of raw data is substantially bigger; (2) The
number of degrees of freedom increases twofold. Moreover, registration of 3D
scenes is different from modeling single objects in several aspects: (1) The scene
can have more occlusions and more invalid ranges; (2) The scene may contain
points from unconnected regions; (3) All scan directions in the scene may contain
relevant information.
There are two general approaches for 3D scan registration: feature matching
and point matching. The goal of feature matching is to find correspondences
between singular points, edges or surfaces
f from range images. The segmentation
process used to extract and select image primitives determines computation time
and maximum accuracy. On the other hand, point matching techniques try to
directly establish correspondences between n spatial points from two views. Exact
point correspondence from different scans is impossible due to a number of facts:
spurious ranges, random noise, mixed pixels, occluded areas and discrete angular
resolution. This is why point matching is usually regarded as an optimization
problem, where the maximum expected precision is intrinsically limited by the
working environment and by the rangefinder performance.
164 3 3D Model Feature Extraction
3.1.2.1 Features
The shape of a 3D object is described by the feature vector that serves as a search
key in the database. If an unsuitable feature extraction method had been used, the
whole retrieval system would not be usable. Therefore, the following text is
dedicated to properties that an ideal feature extraction method should have [7]:
(1) Independence of 3D object representations. At first we have to realize that
3D objects can be saved in many representations such as polyhedral meshes,
volumetric data, parametric or implicit equations. The method for feature
extraction should accept this fact and it should be independent of data
representations.
(2) Invariance under transformations. The computed descriptor values have to
be invariant under an application dependent set of transformations. Usually, these
are the similarity transformations, but some applications like retrieval of
articulated objects may additionally demand invariance under certain deformations.
Perhaps it is the most important requirement, because the 3D objects are usually
saved in various poses and scales.
(3) Insensitiveness to noise. The 3D object can be obtained either from a 3D
graphics program or from a 3D input device. The second way is more susceptible
to some errors. Thus, the feature extraction method should also be insensitive to
noise.
(4) Descriptive power. The similarity measure based on the descriptor should
deliver a similarity ordering that is close to the application driven notion of
resemblance. The features between different models should be distinguishable.
(5) Conciseness and ease of indexing. The database can contain thousands of
objects and the agility of the system would also be one of the main requirements.
The descriptor should be compact in order to minimize the storage requirements
and accelerate the search by reducing the dimensionality of the problem. Very
3.1 Introduction 167
importantly, it should provide some means of indexing and thereby structuring the
database in order to further accelerate the search process.
The feature extraction method that would have all the above mentioned
requirements probably does not exist. For all that, some methods that try to find a
compromise among ideal properties exist.
various orders of vertices and feature coefficients of various transforms, and so on.
Statistical-data-based feature extraction approaches sample points on the
surface of 3D models and extracts characteristics from the sample points. These
characteristics are typically organized in the form of histograms or distributions
representing frequency of occurrence. The most extensively used statistical
property is the “moments”, such as Hu’s image moments [20]. There are also
many other kinds of statistical property features expressed in the form of different
discrete histograms of geometrical statistics [21]. The shape representation is
simplified as a probability distribution problem by using histograms and avoids
the model normalization process.
Compared with other methods, most statistical feature extraction methods are
not only fast and easy to implement, but also have some desired properties, such
as robustness and invariance. In many cases, they are also robustt against noise, or
the small cracks and holes that exist in a 3D model. Unfortunately, as an inherent
drawback of a histogram representation, they provide only limited discrimination
between objects: they neither preserve nor construct spatial information. Thus,
they are often not discriminating enough to make small differences between
dissimilar 3D shapes, and usually fail to distinguish different shapes having the
same histogram. In this section, we mainly introduce several typical
moment-based and histogram-based feature descriptors for 3D models, including
one method proposed by the authors of this book.
m pqr ³0
w
x p y q z r dxdydz , (3.1)
The crux of Elad et al.’s algorithm lies in the computation of a subset of the ((p, q, r)-th
r
moments of each object, which are used as the feature set. Thus, it is necessary to
170 3 3D Model Feature Extraction
perform a pre-processing stage where the ffeatures are calculated for each database
object. A practical way to evaluate the integral defining moments is to compute
this analytically for each facet of the object, and then sum over all the facets. They
use an alternative approach, yielding an approximation of the moments. The
algorithm draws a sequence of points (x ( , y, z) distributed uniformly over the
object’s surface. The number of points drawn from each of the object’s facets is
proportional to its relative surface area. If we denote the list of points for a given
object by {xi, yi, zi}, i = 1, 2, …, N
N, then the (p, q, rr)-th moment is approximated by
N
1
mˆ pqr
N
¦x
i 1
i
p
yi q zi r . (3.2)
The similarity measure should be invariant to spatial position, scale and rotation of
the different objects. One is therefore required to normalize the feature vectors of
all objects. The first moments m100, m010 and m001 represent the object’s center of
mass. Thus, the normalization starts by y estimating the first moments for each
object represented as a set of surface sample points, and subtracting them from
each of these points
i 1, 2, ...,, , [ i , i , i ]T
1, 2, [ i 100 , i 010 , i 001 ]T . (3.3)
UT
U SVD( ) , (3.5)
where the unitary matrix U represents the rotation and the diagonal matrix '
represents the scale in each axis, ordered in decreasing size.
3.2 Statistical Feature Extraction 171
1
[ i , i , i ]T [ i , i , i ]T . (3.6)
(1
(1,1)
1)
Finally, the algorithm shouldd also determine each object’s orientation, relative
to each axis. To do this, we count the number of points on each side of the center
of the body. In order to normalize such that all the objects have the same
orientation, we flip each object so that it is “heavier” on the positive side. In
counting the number of points and flipping according to it, we are actually forcing
the median center to be on a predetermined side relative to the center of mass.
After applying all the normalization stages to each object, the moments are
computed once more, up to the pre-specified order. Obviously, the normalization
process fixes m̂100 , m̂010 , m̂001 and m̂200 to 0, 0, 0 and 1, respectively, for
each and every object. These are therefore no longer useful as object features.
m pqr ³
| 2 2 2
| 1
f ( x, y, z ) x p y q z r dxdydz (3.7)
3
:nlm
44
¦
pqr dn
F nlm
pqr
p
pq
m pqr , (3.8)
where F nlm
pqr
is the intermediate monomial that can be found in [25] for more
details. Note that the summation has to be conducted only for the nonzero
coefficients F nlm
pqr
. Also note that for m 0, :nlm may be computed using the
m
symmetry relation :nll m ( 1)) m : nl .
(4) 3D Zernike descriptor generation. Compute the rotationally invariant 3D
Zernike descriptors as norms of vectors nl as follows:
The similarity of two objects is then defined as the vicinity of their feature vectors
in the feature space. Ankerst et al. [26] introduced 3D shape histograms as
intuitive feature vectors. In general, histograms are based on a partitioning of the
space in which the objects reside, i.e., a complete and disjoint decomposition into
cells which correspond to the bins of the histograms. The space may be geometric
(2D, 3D), thematic (e.g., physical or chemical properties), or temporal (modeling
the behavior of objects). They suggested three techniques for decomposing the
space: a shell model, a sector model and a spiderweb model as the combination of
the former two, as shown in Fig. 3.2. In the preprocessing step, a 3D solid is
moved to the origin. Thus the models are aligned to the center of mass of the solid.
Fig. 3.2. Shells and sectors as basic space decompositions for shape histograms. (a) 4 shell bins;
(b) 12 sector bins; (c) 48 combined bins. In each
h of the 2D examples, a single bin is marked
Fig. 3.3. Several 3D shape histograms of the example protein 1SER-B. From top to bottom, the
number of shells decreases and the number of sectors increases [13] (With kind permission of
Springer Science+Business Media)
3.2 Statistical Feature Extraction 175
of around 120. Note that the histograms are not built from volume elements but
from uniformly distributed surface points taken from the molecular surfaces.
Besl [27] constructed 3D histograms on the crease angles for all edges in a 3D
triangular mesh to match 3D shapes. Fig. 3.4 shows the crease angle histograms
(CAHs) and hidden line drawings for eight simple shapes: a block, a cylinder, a
sphere, a block with channel, a “soap-shape” superquadric, two blocks glued
together, a “double horn” superquadric, and a “jack-shaped” superquadric.
Working from the bottom up, we see the block CAH consists of two simple peaks:
one peak at 90 degrees for the 12 edges and one peak at zero for the adjacent
triangles within a face. The cylinder’s creases will have angles that are zero or
small and positive as well as a peak at 90 degrees. The three ideal peaks, one for
flatness, one for convex curvature, and one for 90 angles, are the signature for the
cylinder. An ideal cone’s histogram will look very, very similar except the peak at
90 degrees should be half the size.
(g) (h)
Fig. 3.4. Crease angle histograms for simple shapes. (a) Double-horn superquadric; (b)
Jack-shaped superquadric; Soap superquadric (c); (d) Two blocks glued; (e) Sphere; (f) Block
with channel; (g) Block; (h) Cylinder [27] (With kind permission of Springer
Science+Business Media)
176 3 3D Model Feature Extraction
For rigid 3D shapes, Novotni et al. [28] introduced the so-called “distance
histograms” as a basic representation. Their fundamental idea is that if two objects
were similar, only a small part of the volume of one of the objects would be
outside the boundary of the other one, and the average distance from the boundary
would also be small. They first computed the offset hulls of each object based on a
3D distance field, and then constructed the distance histograms for each object to
indicate how much of the volume of one object b is inside the offset hull of the
other.
Suzuki et al. [33] suggested that several steps are requiredd to create rotation
invariant feature descriptors: (1) Information associated with shape features has to
be extracted from data files; (2) The extracted information is converted to feature
vectors as indices of the database; (3) Feature vectors are grouped into
equivalence classes, so that these vectors can be converted into rotation invariant
feature vectors. In their paper, only 3D model shapes are of concern, thus only
information related to vertices is used.
When a 3D graphical object is displayed, a set of points is used to represent
the shape. This set of points is connected by lines to form a wireframe. This
wireframe shows a set of polygons. Once polygons have been created, the
rendering algorithm can shade the individual polygons to produce a solid object.
Suzuki et al. [33] used the density of the point clouds as feature vectors. Each 3D
model is placed into the unit cube, and then the unit cube is divided into coarse
grids. The number of points is counted in each grid cell to compute the density of
the point clouds. In their paper, only the density of the point clouds is used.
However, other features can also be used, such as normal vectors of polygon
faces.
Since the distributions of the point clouds depend on how the 3D model is
generated, they normalized point positions by using polygon triangulation
programs. The density of the point clouds gives us rough shape descriptors of the
3D models which include curvature, height, width and positions. These feature
descriptors are not rotation invariant, because orientations of 3D models are
defined by those who designed the 3D models. Orientations may be normalized by
rules. Suitable rules to set 3D model orientations depend on the purpose of the
applications.
To explain the concept of equivalent classes, Fig. 3.5 illustrates the rotations that
are parallel to one of the coordinate axes in the order of 90 degrees. Each cell can
be moved to a new position by rotation. When rotations are repeated, eventually
each cell can return to its original position. In this moving cell process, some
unique paths are generated. For example, the coordinates of the 8 cells which lie
along the edge of the grid are as follows: (1, 1, 1), (1, l, +l), (l, +l, 1),
(1, +1, +1), (+1, 1, 1), (+1, 1, +1), (+l, +1, 1), (+1, +1, +l). When we apply
the rotation to the cell which has one of the above coordinates, the calculated new
coordinate is also one of the above. This means that these 8 cells have no path to
any other cells. For instance, the cell which lies at the origin can keep its own
position even if rotations are applied, so it has an independent path.
178 3 3D Model Feature Extraction
Rx Ry Rz
§1 0 0 0·
¨ ¸
¨0 cos T i T
sin 0¸
Rx , (3.10)
¨0 sin T cos T 0¸
¨ ¸
©0 0 0 1¹
§ cos T 0 sin T 0·
¨ ¸
Ry ¨ 0 1 0 0¸
, (3.11)
¨ sin T 0 cos T 0¸
¨ ¸
© 0 0 0 1¹
§ cos T sin T 0 0·
¨ ¸
i T
¨ sin cos T 0 0¸
Rz . (3.12)
¨ 0 0 1 0¸
¨ ¸
© 0 0 0 1¹
Cells have equivalent relations if they belong to the same paths. The cell sets
that have equivalent relations are called equivalence classes. Fig. 3.6 shows the
equivalence classes of the 3u3u3 grid. Each cell is classified into one of four
equivalence classes. The 3u3u3 grid contains 27 cells. Since we define each class
of cells as having an identical relation, the summation of cells in the same class
can be calculated. Each cell contains the density of the point clouds. The Pn(x, y, z)
contains values for the density of the point clouds for the cell located at
coordinates (x, y, z), where n is the index of each cell as shown in Fig. 3.6. In the
3.2 Statistical Feature Extraction 179
case of the 3u3u3 grid, we can define the following four functions to calculate the
rotation invariant feature vectors in the order of 90 degrees. Twenty seven vectors
are reduced to 4 vectors by these equations. Since these 4 vectors are recalculated
to be rotation invariant vectors, some of the fine details of the feature descriptors
are lost.
f1 P0 ( 11, 1,
1 1) 2 ( 1,
1 1,
1 1) 6 ( 1,
1 1,
1 1) 8 ( 1, 1, 1)
(3.13)
P18 (1
(1, 1,
1 1) 20 (1,
(1 1,
1 1) 24 (1,
(1 1,
1 1) 226 (1, 1, 1),
f2 P1 ( 1 1 0) P3 ( 1 0 1) P5 ( 1 0 1) P7 ( 1, 1, 0)
P9 (0
(0, 1,
1 1) 11 (0,
(0 1,
1 1) 15 (0,
(0 1,
1 1) 17
1 (0, 1, 1) (3.14)
P19 (1
(1, 11, 0) 21 (1
(1, 00, 1) 23 (1
(1, 00, 1) 25
2 (1, 1, 0),
f3 P4 ( 11, 00, 0) 10 (0
(0, 11, 0) 12 (0
(0, 00, 1) 144
1 (0, 0, 1)
(3.15)
P16 (0
(0, 11, 0) 22
2 (1, 0, 0),
f4 P113 (0, 0, 0). (3.16)
n2
n
°¦ F j ¦ Fj , 3;
°j 0 j 0
Qnnum ® n
(3.17)
°
°̄° ¦
j 0
Fj , 3,
with
j
Fj ¦( j
k 0
k ). (3.18)
Here, an Nu
N Nu
N N grid has N = 2n + 1 relations. Thus, if the grid size is larger than
7u7u7, the first part of Eq.(3.17) is used, otherwise the second part is used. We
can easily see that the numberr of cells increases rapidly for the higher resolutions
of the Nu
N Nu
N N grid compared to the number of equivalent classes. Comparisons of
the huge number of vectors cause inefficient retrieval, and it requires more
memory to store the vectors. Statistical approaches such as principal component
analysis (PCA), multidimensional scaling and multiple regression analysis can be
used to reduce the size of the vectors for similarity retrieval. However, these
approaches need a sufficient number of data samples and processes to determine
which vectors can be eliminated.
180 3 3D Model Feature Extraction
In fact, the basic idea of this method is similar to 3D shape histograms. They both
calculate the point distribution, but their implementation methods are different.
The detailed procedure of Suzuki et al.’s method [33] can be expressed as follows:
Step 1: Transform the 3D model into the normalized coordinate system by the
PCA method.
Step 2: Partition the cube into N×
N N×
N N cells.
Step 3: Classify each cell into the equivalent class it belongs to.
Step 4: Compute the number of vertices in each class, and divide it by the total
number of vertices in the 3D model, composing a feature vector for the 3D model.
Experimentally, it has been shown that the computational complexity of the
point density approach is low, and in the retrieval application, based on this
feature, we can obtain good retrieval performance
f in terms of precision and recall.
Osada et al. [34] described and analyzed a method for computing 3D shape
signatures and dissimilarity measures for arbitrary objects described by possibly
degenerate 3D polygonal models. The key idea is to represent the signature of an
object as a shape distribution sampled from a shape function measuring global
geometric properties of the object. The primary motivation for this approach is
that the shape matching problem is reduced to the comparison of two probability
distributions, which is a relatively simple problem when compared to the more
difficult problems encountered by traditional shape matching methods, such as
pose registration, parameterization, feature correspondence and model fitting. The
challenges of this approach are to select discriminating shape functions, to develop
3.2 Statistical Feature Extraction 181
efficient methods for sampling them, and to robustly compute the dissimilarity of
probability distributions.
The first and most interesting issue is to select a function whose distribution
provides a good signature for the shape of a 3D polygonal model. Ideally, the
distribution should be invariant under similarity transformations, and it should be
insensitive to noise, cracks, tessellation and insertion/removal of small polygons.
In general, any function could be sampled to form a shape distribution,
including ones that incorporate domain-specific knowledge, visibility information
(e.g., the distance between random but mutually visible points), and/or surface
attributes (e.g., color, texture coordinates, normals and curvature). However, for
the sake of clarity, Osada et al. focused on a small set of shape functions based on
geometric measurements (e.g., angles, distances, areas, and volumes). Specifically,
in their initial investigation, they have experimented with the following shape
functions (see Fig. 3.7):
(1) A3: Measures the angle between three random points on the surface of a
3D model.
(2) D1: Measures the distance between a fixed point and one random point on
the surface. We use the centroid of the boundary of the model as the fixed point.
(3) D2: Measures the distance between two random points on the surface.
(4) D3: Measures the square root of the area of the triangle between three
random points on the surface.
(5) D4: Measures the cube root of the volume of the tetrahedron between four
random points on the surface.
These five shape functions were chosen mostly for their simplicity and
invariance. In particular, they are quick to compute, easy to understand, and
produce distributions that are invariant to rigid motions (translations and rotations).
They are invariant to tessellation of the 3D polygonal model, since points are
selected randomly from the surface. They are insensitive to small perturbations
due to noise, cracks, and insertion/removal of polygons, since sampling is area
weighted. In addition, the A3 shape function is invariant to scale, while the
others have to be normalized to enable comparisons. Finally, the D2, D3, and D4
shape functions provide a nice comparison of 1D, 2D, and 3D geometric
measurements.
Fig. 3.7. Five simple shape functions based on angles (A3), lengths (D1, D2), areas (D3) and
volumes (D4)
182 3 3D Model Feature Extraction
Fig. 3.8. Example D2 shape distributions. In each plot, the horizontal axis represents distance,
and the vertical axis represents the probability off that distance between two points on the surface.
(a) Line segment; (b) Circle (perimeter only); (c) Triangle; (d) Cube; (e) Sphere; (f) Cylinder
(without caps); (g) Ellipsoids of different radii; (h) Two adjacent unit spheres; (i) Two unit
spheres separated by 1, 2, 3, and 4 units
A shape function having been chosen, the next issue is to compute and store a
representation of its distribution. Analytic calculation of the distribution is feasible
only for certain combinations of shape functions and models (e.g., the D2 function
3.2 Statistical Feature Extraction 183
for a sphere or line). Thus, in general, Osada et al. employed stochastic methods.
Specifically, Osada et al. evaluated N samples from the shape distribution and
construct a histogram by counting how many samples fall into each of B fixed
sized bins. From the histogram, Osada et al. reconstructed a piecewise linear
function with V ( B) equally spaced vertices, which forms the representation for
the shape distribution. Osada et al. computed
m the shape distribution once for each
model and stored it as a sequence of V integers.
One issue we must be concerned with is the sampling density. On one hand,
the more samples we take, the more accurately and precisely we can reconstruct
the shape distribution. On the other hand, the time to sample a shape distribution is
linearly proportional to the number of samples, so there is an accuracy/time
tradeoff in the choice of NN. Similarly, a larger number of vertices yield higher
resolution distributions, while increasing the storage and comparison costs of the
shape signature. In Osada et al.’s experiments, they have chosen to err on the side
of robustness, taking a large number of samples for each histogram bin.
Empirically, they have found that using N = 1,0242 samples, B = 1,024 bins, and V =
64 vertices yields shape distributions with low enough variance and high enough
resolution to be useful for our initial experiments. Adaptive sampling methods
could be used in future work to make robust construction of shape distributions
more efficient.
A second issue is sample generation. Although it would be simplest to sample
vertices of the 3D model directly, the resulting shape distributions would be biased
and sensitive to changes in tessellation. Instead, Osada et al.’s shape functions are
sampled from random points on the surface of a 3D model. The method for
generating unbiased random
a points with respect to the surface area of a polygonal
model proceeds as follows. First, Osada et al. iterated through all polygons,
splitting them into triangles as necessary. Then, for each triangle, Osada et al.
computed its area and store it in an array along with the cumulative area of
triangles visited so far. Next, Osada et al. selected a triangle with probability
proportional to its area by generating a random number between 0 and the total
cumulative area and performed a binary search on the array of cumulative areas.
For each selected triangle with vertices (A, B, C), Osada et al. constructed a point
on its surface by generating two random numbers, r1 and r2, between 0 and 1, and
evaluate the following equation:
P (
(1 1 )A 1 ( 2 ) B r1 r2 C .
(1 (3.19)
Intuitively, r1 sets the percentage from vertex A to the opposing edge, while r2
represents the percentage along that edge (see Fig. 3.9). Taking the square-root of
r1 gives a uniform random point with respect to surface area.
184 3 3D Model Feature Extraction
r1 r2
r2
B C
Osada et al. have shown that D2 is the best feature among their five features. It
represents the distribution of distances between two random points. This feature is
invariant to tessellation of 3D polygonal models, since points are randomly
selected from the object’s surface. However, it is sensitive to small deformation
due to noise, cracks, or insertion/removal of polygons, since sampling is area
weighted. To finely represent the complex components of a 3D object, a 3D model
often requires many polygons. The random sampling of a 3D model would be
dominated by those complex components. Thus, a novel feature, called grid D2, is
proposed by Shih et al. [35] to improve the performance of the traditional D2.
First, the 3D model is decomposed by a voxel grid. A voxel is regarded as valid if
there is a polygonal surface located within it, and invalid otherwise. Then the
distribution of distances between two valid voxels instead of two points on the
surface is calculated. Therefore, the area weighted defect in the sampling process
will be greatly reduced since each valid voxel is weighted equally irrespective of
how many points are located within this voxel. The main steps for computing the
grid D2 are described as follows:
(1) First, a 3D model is segmented into a 2Ru2Ru2R voxel grid. To be
invariant to translation and scaling, the object’s mass centre is moved to the
location (R, R, R) and the average distance from valid voxels to the mass centre is
scaled to be R/2. R is set as 32, which provides adequate resolution for
discriminating objects while filtering out those high-frequency polygonal surfaces
3.2 Statistical Feature Extraction 185
B1 B2 B3 B ½
GD 2 ® , , , ...,, 256 ¾ , (3.20)
¯ U U U U ¿
where U is set as 643. From Fig. 3.10 we can see that the D2 distributions are
clearly different while GD2 distributions are similar for these two similar
airplanes. Experimental results show that Shih et al.’s method is superior to others,
and the new shape descriptor is both discriminating and robust.
In addition, Song et al. [36] also adopted a histogram representation, based on
shape functions to match 3D shapes by generating histograms using the discrete
Gaussian curvature and discrete mean curvature of every vertex of a 3D triangle
mesh.
Fig. 3.10. D2 and GD2 distributions for two similar airplane objects [35] ([2005]IEEE)
In [37], Horn defined the extended Gaussian image (EGI), discussed its properties,
and gave examples. Methods for determining the extended Gaussian images of
polyhedra, solids of revolution and smoothly curved objects in general were
shown. The orientation histogram, a discrete approximation of the extended
Gaussian image, was described along with a variety of ways of tessellating the
sphere. The detailed concepts and properties of EGI can be described as follows.
186 3 3D Model Feature Extraction
Minkowski showed in 1897 that a convex polyhedron is fully specified by the area
and orientation of its faces. Surface normal vector information for any object can
be mapped onto a unit sphere, called the Gaussian sphere. We can represent area
and orientation of the faces conveniently by point masses on this sphere. A weight
is assigned to each point on the Gaussian sphere equal to the area of the surface
having the given normal. Weights are represented by vectors parallel to the surface
normals, with length equal to the weight. Imagine moving the unit surface normal
of each face so that its tail is at the center of a unit sphere. The head of the unit
normal then lies on the surface of the unit sphere. Each point on the Gaussian
sphere corresponds to a particular surface orientation. The extended Gaussian
image of the polyhedron is obtained by placing a mass at each point equal to the
surface area of the corresponding face.
It seems at first as if some information is lost in this mapping, since the
position of the surface normals is discarded. Viewed from another angle, no note is
made of the shape of the faces or their adjacency relationships. It can nevertheless
be shown that the extended Gaussian image uniquely defines a convex polyhedron.
Iterative algorithms can be used for recovering a convex polyhedron from its
extended Gaussian image.
One can associate a point on the Gaussian sphere with a given point on a surface
by finding the point on the sphere which has the same surface normal. Thus it is
possible to map information associated with points on the surface onto points on
the Gaussian sphere. In the case of a convex object with positive Gaussian
curvature everywhere, no two points have the same surface normal. The mapping
from the object to the Gaussian sphere in this case is invertible: Corresponding to
each point on the Gaussian sphere, there is a unique point on the surface. If the
convex surface has patches with zero Gaussian curvature, curves or even areas on
it may correspond to a single point on the Gaussian sphere.
One useful property of the Gaussian image is that it rotates with the object.
Consider two parallel surface normals, one on the object and the other on the
Gaussian sphere. The two normals will remain parallel if the object and the
Gaussian sphere are rotated in the same fashion. A rotation of the object thus
corresponds to an equal rotation of the Gaussian sphere.
Consider a small patch GO on the object. Each point in this patch corresponds to a
particular point on the Gaussian sphere. The patch GO on the object maps into a
patch, GSS say, on the Gaussian sphere. On one hand, if the surface is strongly
3.2 Statistical Feature Extraction 187
curved, the normals of points in the patch will point into a wide fan of directions.
The corresponding points on the Gaussian sphere will be spread out. On the other
hand, if the surface is planar, the surface normals
r are parallel and map into a single
point.
These considerations suggest a suitable definition of curvature. The Gaussian
curvature is defined to be equal to the limit of the ratio of the two areas as they
tend to zero. That is,
GS dS
K lim . (3.21)
O o 0 GO dO
From this differential relationship we can obtain two useful integrals. Consider
first integrating K over a finite patch O on the object:
³³ KdO ³³ dS
O S
AS , (3.22)
where AS is the area of the corresponding patch on the Gaussian sphere. The
expression on the left is called the integral curvature. This relationship allows one
to deal with surfaces which have discontinuities in surface normal.
Now consider instead integrating 1/K /K over a patch S on the Gaussian sphere
³³ (1 /
S
)d ³³ dO
O
dO AO , (3.23)
where AO is the area of the corresponding patch on the object. This relationship
suggests the use of the inverse of the Gaussian curvature in the definition of the
extended Gaussian image of a smoothly curved object, as we shall see. It also
/K over the whole Gaussian sphere equals
shows, by the way, that the integral of 1/K
the total area of the object.
We can define a mapping which associates the inverse of the Gaussian curvature at
a point on the surface of the object with the corresponding point on the Gaussian
sphere. Let u and v be parameters used to identify points on the original surface.
Similarly, let [ and K be parameters used to identify points on the Gaussian sphere.
These could be longitude and latitude, for example. Then we define the extended
Gaussian image as
1
G ([ , K ) , (3.24)
K( , )
188 3 3D Model Feature Extraction
where ([, K) is the point on the Gaussian sphere which has the same normal as the
point (u, v) on the original surface. It can be shown that this mapping is unique for
convex objects. That is, there is only one convex object corresponding to a
particular extended Gaussian image. The proof is unfortunately non-constructive
and no direct method for recovering the object is known.
The extended Gaussian image is not affected by translation of the object. Rotation
of the object induces an equal rotation of the extended Gaussian image, since the
unit surface normals rotate with the object.
Mass distributions, which lie entirely within one hemisphere, are zero in the
complementary hemisphere and do not correspond to closed objects. We can
demonstrate that the center of mass of an extended Gaussian image has to lie at
the origin. This is clearly impossible if the whole hemisphere is empty. Also, a
mass distribution which is nonzero only on a great circle of the sphere corresponds
to the limit of a sequence of cylindrical objects of increasing length and
decreasing diameter. Here, such pathological cases are excluded and our attention
is confined to closed, bounded objects.
Some properties of the extended Gaussian image are important.
m First, the total
mass of the extended Gaussian image is obviously just equal to the total surface
area of the polyhedron. If the polyhedron is closed, it will have the same projected
area when viewed from any pair of opposite directions. This allows us to compute
the location of the center of mass of the extended Gaussian image.
An equivalent representation, called a spike model, is a collection of vectors
each of which is parallel to one of the surface normals and of length equal to the
area of the corresponding face. The result regarding the center of mass is
equivalent to the statement that these vectors must form a closed chain when
placed end to end.
Recently, the authors of this book [38] presented a new shape descriptor based on
rotation. The proposed method is designed for 3D mesh models. Our approach is
to represent 3D shape as a 1D histogram. The motivation originates from a
question such as this: As a 3D model rotates in the spatial domain, why is the
human vision system, from the fixed viewing angle, sensitive to the fact that the
shape after rotation differs from the initial shape, as shown in Fig. 3.11? If points
are sampled uniformly on the model surface, we notice that the orientation of the
normal vector of points is changed after rotation. As Fig. 3.12 shows, regardless of
the position of point p, we translate its normal vector n so that its origin coincides
with the origin of the coordinate system, and the end of the unit normal lies on a
3.3 Rotation-Based Shape Descriptor 189
Fig. 3.11. Shape of a 3D model viewing from the same angle after various rotations. (a) The
shape of the original model; (b)(g) Shapes after various random rotations
For a triangulated mesh model, N random points are sampled uniformly on the
surface. Suppose si and k denote the area of the triangle i and the number of
triangles, respectively. Then we can compute ni, namely the number of sample
points on the triangle i as follows:
Nsi . (3.25)
ni k
¦s
i 1
i
The normal vector of the point p is estimated by the normal of ƸABC, where p
lies, as follows:
np n'AAABC . (3.26)
Hereto a mesh model is translated into a point set with orientations. Notice that the
proposed method does not need to accurately determine positions of random
points, but only needs to attain the orientation of normals. Different from this,
positions of sample points must be obtained in Osada’s D2 [34] and Ohbuchi’s
improvement [39]. Consequently computational complexity of our descriptor is
lower than that in [34] and [39].
pc Rp (3.28)
3.3 Rotation-Based Shape Descriptor 191
Actually, we rotate a model in order to find the shape difference after rotation.
This can be translated into analyzing normal distributions on the unit sphere. Let
us assume we rotate a model T times with T groups of rotation angles; , , are
randomly selected in the range of [0, 2S]. When rotating a model, the normal
distribution of points is changed accordingly.
As shown in Fig. 3.13, the triangle ABC C and point p are rotated to AB
C and
p, respectively. Then np and np have the relationship as follows:
ncp Rn p . (3.29)
V , (3.30)
8
N ¦v i 1
i
. (3.31)
Based on these 8 sections, the spherical surface also can be further segmented into
24 sections. As shown in Fig. 3.14(b), one eighth of the surface is divided into
three subsections by finding the maximum absolute value of three components of
the normal.
192 3 3D Model Feature Extraction
Fig. 3.15. Calculation of normal distribution. (a) Signs and corresponding section; (b) Example
normals
3.3 Rotation-Based Shape Descriptor 193
The authors of this book proposed a novel feature for 3D mesh models, i.e., a
vector quantization index histogram [40]. The main idea is as follows: Firstly,
points are sampled uniformly on mesh surface. Secondly, to a point five features
representing global and local properties are extracted. Thus feature vectors of
points are obtained. Thirdly, we select several models from each class, and employ
their feature vectors as a training set. After training using the LBG algorithm, a
public codebook is constructed. Next, codeword index histograms of the query
model and those in the database are computed. The last step is to compute the
distance between histograms of the query and those of the models in the database.
Experimental results show the effectiveness of our method. The following is the
detailed description of our method.
(1
p (1 1 )A 1 (
(1 2 ) B r1r2 C , (3.33)
where the random numbers r1 and r2 are uniformly distributed between 0 and 1.
Clearly, the number of sample points on a triangle is proportional to its area. This
step aims to guarantee that the number off sample points of all models is exactly
the same. Suppose n denotes it.
1 n
CV ¦ ( pi
ni1
m ) ( pi m )T , (3.34)
where pi is a sample point, and m is the center of mass. The center of mass is
computed as follows:
1 k
m ¦ si gi ,
S i1
(3.35)
where si and gi is the area and gravity of triangle Ti. Three eigenvectors of the
covariance matrix CV V are the principal axes of inertia of the model. The first, the
second and the third significant principle axes correspond to the associated
magnitude of the eigenvalues in decreasing order.
Next, five sub-features are extracted for each point. Suppose a cord ci is
defined to be a vector that goes from the center of mass m to the sample point pi.
D1: the Euclidean distance between pi and m, i.e. the length of ci.
D: the angle between ci and the first most significant principle axis.
E: the angle between ci and the second most significant principle axis.
196 3 3D Model Feature Extraction
J: the angle between ci and the third most significant principle axis.
T: the angle between ci and the normal vector of pi.
VI: visual importance of the point pi.
Here the normal vector of a point is estimated as the normal of the triangle it
lies on. Clearly, D1, D, E, J and T describe the relationship between the local points
and the global properties, while VII denotes the local characteristics.
Suppose I is the inclination of two vectors OM M, ON N. The cosine of this
inclination is computed as
OM ON
cos I . (3.36)
OM ON
Thus the cosD, cosE, cosJ and cosT can be computed like this.
We associate a vertex v with a value that represents its visual importance [13],
defined by:
¦ n i i
VI v 1 i , (3.37)
¦ i
i
1
VI pi ( A B C ). (3.38)
3
It is obvious that VII is in the range of [0, 1], which can indicate the local
curvature around pi. When VII is equal to 0, the vertex v is on a flat plane. The
increase of VII is coupled with the increase of curvature.
After calculating the above five sub-features, we can construct a feature vector
for each point as follows:
F [ f1 , f 2 , ..., f N ]T . (3.40)
For all of the models in the database, we construct their codeword index
histograms offline, while that of the query model is obtained online, all based on
the public codebook. As the sample points in all histograms are equal to N
N, there is
no normalization operation required before comparison. Suppose all index
histograms contain B bins.
This step is to measure the similarity between the histogram of the query and those
of the models in the database. We employ the Euclidean distance as the similarity
metric. Suppose Q = {q1, q2, …, qB} denotes the index histogram of the query, H =
{h1, h2, …, hB} is the histogram of a model from the database, we have
B
D ¦ (q
i 1
i i )2 . (3.41)
After computing the distances, retrieval results can be returned, which are ranked
in the descending order of the distances between the query and models in the
database.
In the experiment, the test database contains 95 models, which are classified into
10 categories. The names of the categories are: bottles (5 models), cars (8), dogs
(6), human bodies (24), planes (8), tanks (5), televisions (7), fire balloons (19),
helicopters (5) and chess (8). From each class, we randomly select one model and
thus our training set has ten models. For each model, we sample 30,000 points on
its surface, thus there are 300,000 sub-feature vectors as training vectors. The
codebook contains 500 codewords. Each index histogram also consists of 500 bins.
198 3 3D Model Feature Extraction
Some samples of 3D model retrieval results are shown in Fig. 3.17, from which
we can see our method is effective.
Fig. 3.17. 3D query models and the four top matches listed from left to right
The global geometry of a 3D model is analyzed by directly sampling the vertex set,
the polygon mesh set, or the voxel set in the spatial domain. Aspect ratio, binary
3D voxel bitmap, and 3D angles of vertices or edges may be considered as the
most simple and straightforward features [42], although their discriminative
powers are limited. These types of analyses generally use PCA-like methods to
align the model into a canonical coordinate frame at first, and then define the
shape representation on this normalized orientation.
The common characteristic of these methods is that they are almost all derived
directly from the elementary unit of a 3D model, that is the vertex, polygon, or
voxel, and a 3D model is viewed and handled as a vertex set, a polygon mesh set
or a voxel set. Their advantages lie in their easy and direct derivation from 3D
data structures, together with their relatively good representation power. However,
the computation processes are usually too time-consuming and sensitive for small
features. Also, the storage requirements are too high due to the difficulties in
building a concise and efficient indexing mechanism for them in large model
databases.
3.5 Global Geometry Feature Extraction 199
Suppose we have a given set of L directional vectors {u1, u2, …, uL}, as shown in
Fig. 3.18. Then the triangle mesh is intersected with the ray emanating from the
origin of the PCA coordinate system and traveling in the direction ui (i{1, ..., L}).
The distance to the farthest intersection is taken as the i-th component of the
feature vector which is scaled to the Euclidean unit length to ensure scale
invariance. In Vrani et al.’s experiment, L is set to be 20. The vertices of a
dodecahedron, with the center in the coordinate origin, are taken as directions.
This feature is invariant with respect to rotation and translation because of the fact
that initial coordinate axes are transformed. The scaling invariance is
accomplished by normalizing the feature vector.
200 3 3D Model Feature Extraction
Fig. 3.18. Illustration of ray-based shape descriptor [53] (With permission of Comenius
University Press)
After extraction of features, the next step is their formal description. As we know,
the MPEG-7 standard provides a rich set of standardized mechanisms and means
aimed at describing multimedia content. The MPEG-7 terminology has been
adopted and the mutual relation between a descriptor and a feature is explained in
the following definition: A descriptor is a representation of a feature. A descriptor
is used to define the syntax and the semantics of the feature representation [44].
Therefore, the descriptor of the above feature vector is determined with 20
non-negative real numbers, where the i-th component is the object extension in the
direction of the i-th vertex of the mentioned dodecahedron, which is defined (the
vertex coordinates and the numbering) internally. This defines the semantics of the
descriptor. The syntax is defined by description schemes (DS) for real vectors.
MPEG-7 is not a restrictive system for audio-visual content description. It is a
flexible and extensible scope for describing multimedia data with a developed set
of methods and tools. As mentioned in MPEG-7, the 3D Model DS should support
“the hierarchical representation of different descriptors in order that queries may
be processed more efficiently at successive levels (where N level descriptors
complement (N (N1) level descriptors)”. Hence, different features at different levels
of detail should be considered. Vrani et al. were encouraged by the reflector of
the MPEG-7 DS group to implement their own DS for 3D models.
This DS should comply with MPEG-7 specification [44].
Using a similar idea, Yu et al. [45] extracted the 3D global geometry as a distance
map and surface penetration map features. These two spatial feature maps describe
the geometry and topology of the surface patches
a on the object, while preserving
the spatial information of the patches in the maps. The feature maps capture the
amount of effort required to morph a 3D object into a canonical sphere, without
3.5 Global Geometry Feature Extraction 201
Fig. 3.19. Computing feature maps. Rays (dashed lines) are shot from the center (white dot) of
a bounding sphere (dashed circle) through the object points (black dots) to the sphere’s surface.
The distance di traveled by the ray from a point pi to the sphere’s surface and the number of
object surfaces (solid lines; 2, in this case) penetrated by the ray since it leaves the sphere’s
center are recorded in the feature maps [45] ([2003]IEEE)
Tangelder et al. proposed a method using weighted point sets as the shape
descriptor for a 3D polygon mesh [46]. They assumed that a 3D shape is
represented by a polyhedral mesh. They do not require the polyhedral mesh to be
closed. Therefore, their method can also handle polyhedral models that may
contain gaps. They also enveloped the object in a 3D voxel grid and represented
the shape as a weighted point set by selecting one representative point for each
non-empty grid cell. They then selected the vertex with the highest Gaussian
curvature or the area-weighted mean of all the vertices in a grid cell, to represent
the model’s geometry features.
Many methods mentioned in previous sections do not take the overall relative
spatial location into account, but throw away some of this information, in order to
deal with data of lower complexity, e.g. 2D views or 1D histograms. What is new
in Tangelder et al.’s method is that they use the overall relative spatial position by
representing the 3D shape as a weighted point set, without taking the connectivity
relations into account. The weighted point sets, which can be viewed as 3D
probability distributions, are compared using a new transportation distance that is
202 3 3D Model Feature Extraction
Feature extraction methods based on signal analysis analyze 3D models from the
point of view of the frequency domain. However, because the 3D model is not a
regularly sampled signal, the preprocessing process before feature extraction is
generally complicated. In this section, we would like to introduce three typical
shape descriptors based on transform domains.
We introduce discrete Fourier transform, Vrani and Soupe’s Scheme and other
schemes.
the input happens to be periodic (forever). Therefore, it is often said that the DFT
is a transform for Fourier analysis of finite-domain discrete-time functions. The
sinusoidal basis functions of the decomposition have the same properties. Since
the input function is a finite sequence of real or complex numbers, the DFT is
ideal for processing information stored in computers. In particular, the DFT is
widely employed in signal processing and related fields to analyze the frequencies
contained in a sampled signal, to solve partial differential equations and to
perform other operations such as convolutions. The DFT can be computed
efficiently in practice using a fast Fourier transform (FFT) algorithm.
The sequence of N complex numbers x0, ..., xN1 is transformed into the
sequence of N complex numbers X0, ..., XN1 by the DFT according to the formula:
N 1
j
2
¦ xn e
kn
N
Xk , k 0 ..., N 1 ,
0, (3.42)
n 0
j
2
where e N is a primitive N N-th root of unity. The inverse discrete Fourier
transform (IDFT) is given by
N 1 j
2
1
¦X
kn
xn k eN , 0,
0 ..., 1. (3.43)
N k 0
and Soupe formerly [49] used a similar voxelization as a feature in the spatial
domain with a reasonably small N N. The feature vector had N3 components and the
L1 or L2 norms were engaged for calculating distances. While in [53], their
modification is as follows: A greater value of N is selected and the feature is
represented in the frequency domain by applying the 3D-DFT to the voxelized
model (i.e., calculated values in the N3 cells).
Let Q = {qikll | qiklR, N
N/2 i, k, l <N
<N/2} be the set of all voxels. The set Q is
transformed into the set G = {guvw| guvw C, NN/2 u, v, w <N
<N/2} by
N /2 1 N /2 1 N /2 1 j
2
¦ ¦ ¦
( )
N
guvw qikl e . (3.44)
i N /2 k N /2 l N /2
Finally, we find the absolute values of the coefficients g uvw with indices K K
u, v, w K (the lowest frequencies). Except for the coefficient g000, all selected
complex numbers are pairwise conjugated. Therefore, the feature vector
K 3+1)/2 real-valued components. In Vrani and Soupe’s
consists of ((2K+1)
experiments, they select K = 1, 2, 3, i.e., the descriptors possess 14, 63, and
172 components, respectively.
The value of parameter N (the resolution of voxelization) should be
sufficiently large in order to capture spatial properties of a model by the 3D DFT.
In practice, Vrani and Soupe selected N = 128 and on average about 20,000
voxels (out of 1,283 elements of the set Q) have values greater than zero. This
makes the octree representation very efficient. During the 3D-DFT, they computed
only those elements of the set G that are used in the feature
t vector (14, 63, or 172
out of 1283). The proposed descriptor shows better retrieval performance than the
voxel-based feature presented in [49]. Having in mind that the ray-based
descriptor [49] was improved by incorporating spherical harmonics [54], they
inferred that if the L1 or L2 norm is engaged, representation of a feature in the
frequency domain is more efficient than representation of the same feature in the
spatial domain.
Vrani [54] first introduced harmonic analysis into the field of 3D model feature
extraction, which is a rotation-relevant feature descriptor. Kazhdan et al. [59]
improved this scheme, making it rotation irrelevant. The key idea of this approach
is to describe a spherical function in terms of the amount of energy it contains at
different frequencies. Since these values do not change when the function is
rotated, the resulting descriptor is rotation invariant. This approach can be viewed
as a generalization of the Fourier Descriptor method to the case of spherical
functions. The detailed procedure can be described as follows.
f l
f( , ) ¦¦
l 0 m l
lm l
m
( , I )). (3.45)
The harmonics are visualized in Fig. 3.20. The key property of this decomposition
is that if we restrict it to some frequency l, and define the subspace of functions:
Vl Span(Yl l , Yl l 1
, ..., Yl l 1 , Yl l ) , (3.46)
3.6 Signal-Analysis-Based Feature Extraction 207
we then have the following two properties: (1) Vl is a representation for the
f Vl and any rotation R, we have R( f ) Vl. This
rotation group: For any function f
can also be expressed in the following manner: if
l is the projection onto the
subspace Vl then
l commutes with rotations:
S l ( ( )) ( l ( f )) . (3.47)
Using the properties of spherical harmonics and the observation that rotating a
spherical function does not change its L2-norm, we represent the energies of a
spherical function f (T , I ) as:
SH ( f ) { 0 ( , ) 1 ( , ) , ...} , (3.48)
208 3 3D Model Feature Extraction
where f is the frequency components off f as shown in steps (3) and (4) of Fig. 3.21:
l
fl ( , ) l ( ) ¦
m l
lm l
m
( , I) . (3.49)
This representation has the property that it is independent of the orientation of the
spherical function. To see this, we let R be any rotation and we have:
SH ( R ( f )) { 0 ( ( )) , 1 ( ( )) ,...}
{ ( 0 ( )) , ( 1 ( )) ,...} (3.50)
{ 0 ( ) , 1 ( ) ,...}} ( ),
so that applying a rotation to a spherical function f does not change its energy
representation.
…
Harmonic functions
Spherical signatures
Kazhdan et al. [59] made their representation still more discriminating by refining
the case of the second order component. It can be proved that the L2-difference
between the quadratic components of two spherical functions is minimized when
the two functions are aligned with their principal axes. Thus, instead of describing
the constant and quadratic components by the two scalars f 0 and f 2 ,
Kazhdan et al. [59] represented them by the three scalars a1, a2, and a3, where after
alignment to principal axes:
f0 f2 a1 x 2 a2 y 2 a3 z 2 . (3.51)
However, care must be taken because as functions on the unit sphere, x2, y2,
and z2 are not orthonormal. By fixing an orthonormal basis {v1, v2, v3} for the span
of {x2, y2, z2}, the harmonic representation SH(
H f ) defined above can be replaced
with the more discriminating representation:
3.6 Signal-Analysis-Based Feature Extraction 209
1
SHQ ( f ) { (a1 , 2 , 3 ), 1 , 3 , ...}, (3.52)
where R is the matrix whose columns are the orthonormal vectors vi.
Fig. 3.22. The model (a) obtained by applying a rotation to the interior part of the model (b).
While the models differ by more than a single rotation, their rotation invariant representations are
the same [59] (With courtesy of Kazhdan et al.)
A wavelet can also be used to describe the features of 3D models. Laga et al. [60]
for the first time applied the spherical wavelet transform (SWT) to content-based
3D model retrieval. They proposed three new descriptors, i.e., spherical wavelet
coefficients as feature vector (SWC Cd), L1 energy of the spherical wavelet
210 3 3D Model Feature Extraction
Let us first consider the problem of descriptor extraction from the spherical shape
function. Wavelets are basis functions which represent a given signal at multiple
levels of detail, called resolutions. They are suitable for sparse approximations of
functions. In the Euclidean space, wavelets are defined by translating and dilating
one function called mother wavelet. In the S2 space, however, the metric is no
longer Euclidean. Schröder and Sweldens [61] introduced the second generation
wavelets. The idea behind this was to build wavelets with all desirable properties
adapted to much more general settings than real lines and 2D images. The general
wavelet transform of a function is constructed as follows.
Analysis (forward transform):
O j,k ¦
l K ( j)
h j , k ,l O j 1, l ,
(3.53)
J j, ¦ l M ( j)
g j , m ,l O j 1,l ;
O j 1,l ¦ k K ( j)
h , k ,l ,k ¦ m M ( j)
g j , m,l J j , m , (3.54)
where j, and j, are respectively the approximation and the wavelet coefficients of
the function at resolution j. The decomposition filters h , g , and the synthesis
filters h, g denote spherical wavelet basis functions. The forward transform is
performed recursively starting from the shape function = n, at the finest
resolution n, to get j and j at level j, j = n1, …, 0. The coarsest approximation
ni, is obtained after i iterations (0 < i n). The sets M( M j ) and K(
K j ) are index
sets on the sphere such that K( K j )ĤM( M j ) = K( K(j +1), and K(
K n) = K is the index set
at the finest resolution.
To analyze a 3D model, Laga et al. first applied spherical wavelet transform
(SWT) to the spherical shape function and collected the coefficients to construct
3.6 Signal-Analysis-Based Feature Extraction 211
Laga et al. proposed three methods to compare 3D shapes using their spherical
Cd) where the
wavelet transform: (1) wavelet coefficients as a shape descriptor (SWC
shape signature is built by considering directly the spherical wavelet coefficients,
212 3 3D Model Feature Extraction
and (2) spherical wavelet energies: SWEL1 based on the L1 energy, and (3) SWEL2
based on L2 energy of the wavelet sub-bands. Fig. 3.24 shows an example model
and its three different SW
W descriptors. The following parts detail each method.
(1) Wavelet coefficients as a shape descriptor. Once the spherical wavelet
transform is performed, one may use the wavelet coefficients as the shape
descriptor. Using the entire coefficients is computationally expensive. Instead, we
can choose to keep the coefficients up to level d d. Thus the obtained shape
descriptor is called SWC Cd, where d = 0, …, n1. In Laga et al.’s implementation,
they used d = 3, therefore they obtained two dimensional feature vectors F of size
N = 2d+2
d
×2d+1
d
= 32×16.
Comparing directly wavelet coefficients requires efficient alignment of the 3D
model prior to wavelet transform. A popular method for finding the reference
coordinate frame is to pose normalization based on principal component analysis
(PCA) as described in Section 3.2. During the preprocessing, they used the
maximum area technique to resolve the positive and negative directions of the
principal axis. Fig. 3.24 shows the SWC C3 descriptor extracted on the 3D “tree”
model. Note that the vector F can provide an embedded multi-resolution
representation for 3D shape features. This approach performs as a filtering of the
3D shape by removing outliers. A major difference with spherical harmonics is
that SWT preserves the localization and orientation of local features. However, a
feature space of dimension 512 is still computationally expensive.
Fig. 3.24. Example of the “tree” model with its spherical wavelet-based descriptors [60]. (a)
3D shape; (b) Associated geometry image; (c) Spherical wavelet coefficients as descriptor
(SWCC3); (d) L2 energy descriptor (SWEL2); (e) L1 energy descriptor (SWEL1) ([2006]IEEE)
3.6 Signal-Analysis-Based Feature Extraction 213
1
§1 kl
·2
Fl (2) xl2, j ¸ , (3.55)
© kl j 1 ¹
kl
1
Fl (1)
kl
¦x
j 1
l, j
, (3.56)
where xl,j,j (j = 1, 2, …, kl) are the wavelet coefficients of the l-th wavelet sub-band.
Using the observation that rotating a spherical function does not change its energy,
Laga et al. proposed to adopt it to build general rotation invariant shape
descriptors. For this purpose, they performed n1 decompositions, then they
computed the energy of the approximation A(1) and the energy of each detailed
sub-band HV V(l)l , VH
H(l)l and HH
H(l)l yielding a 1D shape descriptor F = {Fl}, l = 0, ...,
3×(n1) of size N = 3×(n1)+1. In Laga et al.’s case, they adopted n = 7, therefore
N = 19. Laga et al. referred to L1 energy and L2 energy-based descriptors by
SWEL1 and SWEL2 respectively.
The main benefits of this descriptorr are its compactness and its rotation
invariance. Therefore, the storage and computation time required for comparison
are reduced. Since Laga et al. adopted the rotation invariant sampling method in
[60], the shape descriptors invariant to general rotations can be obtained. However,
similar to the power spectrum, information such as feature localization is lost in
the energy spectrum.
Note that the above spherical wavelet analysis framework supports retrieval at
different acuity levels. In some situations, only the main structures of the shapes
are required for comparison while, in others, fine details are essential. In the
former case, shape matching can be performed by considering only the wavelet
coefficients on large scales while, in the later, coefficients on small scales are used.
Hence, the flexibility of the developed method benefits different retrieval
requirements. Finally, Table 3.1 summarizes [60] the length of the proposed
descriptors. E-measure means the expected number of failures detected.
discounted cumulative gain (DCG) measures the usefulness, or gain, of a
document based on its position in the result list, and the gain is accumulated
cumulatively from the top of the result list to the bottom, with the gain of each
Table 3.1 Performance of SW descriptors on the PSB base test classification [60]
([2006]IEEE)
Length NN 1st-tier 2nd-tier E-measure DCG
Cd
SWC 512 46.9 31.4 39.7 20.5 65.4
SWEL
W 1 19 37.3 27.6 35.9 18.6 62.6
SWEL
W 2 19 30.3 24.9 31.5 16.1 59.4
Values of the length are in bytes, others are in (%). The length refers to the dimension of the
feature space
214 3 3D Model Feature Extraction
result discounted at lower ranks. From this table we can see that the SWEL1
and SWEL2 are more efficient in terms of storage requirement and comparison
time, and they are also rotation invariant.
consists of many surfaces, there is a large set of spin images generated for each 3D
model. To achieve more concise and compact feature representation, the original
set of spin images is compressed by the PCA method.
Gu et al. and Praun et al. [64, 65] discussed the “geometry image” concept, a
simple 2D array of quantized points with useful attributes, such as vertex positions,
surface normals and textures. In fact, in Chapter 2 we have introduced the concept
of geometry images. Laga et al. [66] applied this method to 3D shape matching by
simplifying the 3D matching problem to measure similarities between
parameterized 2D geometry images. All those methods make use of specific 3D
geometry information from a 3D model in their 2D mapping process.
3.7.1.3 2D Slicing
Fig. 3.25. Slice-based shape representation, where the shape on the right is reconstructed with
more slices than the middle one [67] ([2004]IEEE)
216 3 3D Model Feature Extraction
3.26(g) and (h). The latter’s ability to handle occlusion comes from the way the
boundary mapping is constructed when mapping the boundary of D(v, R) onto the
boundary of P. Because of the boundary mapping, the images remain
approximately the same in the presence of occlusion. From the above generation
process, it can be seen that the only requirement imposed on creating harmonic
Fig. 3.26. Examples of surface patches and harmonic shape images [68]. (a), (e) Surface
patches on a given surface; (b), (f) The surface patches in wireframe; (c), (g) Their harmonic
images; (d), (h) Their harmonic shape images (With courtesy of Zhang and Hebert)
shape images is that the underlying surface patch is connected and without holes.
This requirement is called the topology constraint.
Harmonic shape images have some properties that are important for surface
matching. They are unique and their existence is guaranteed for any valid surface
patches. More importantly, those images preserve both the shape and the
continuity of the underlying surfaces. Furthermore, harmonic
a shape images are not
designed specifically for representing surface shapes. Instead, they provide a
general framework to represent surface attributes such as surface normal, color,
texture and material. Harmonic shape images are discriminative and stable, and
they are robust with respect to surface sampling resolution and occlusion.
Extensive experiments have been conducted to analyze and demonstrate the
properties of harmonic shape images in [68].
218 3 3D Model Feature Extraction
Compared with the methods in Subsection 3.7.1, the 2D mapping methods that
establish mappings from a 3D view to a set of specific 2D planar views from
different angles are much more natural and simple. The basic idea is that if two 3D
shapes are similar, they should be similar from many different views. Thus, 2D
shapes, such as 2D silhouettes, can be extracted and adopted for 3D shape
matching. There is a prolific amount of literature on these particular techniques.
Chen et al. [73] proposed a light field descriptor representing the 4D light field of
a 3D model with a collection of 2D images, which are captured by a set of
uniformly distributed cameras by borrowing the concept of “light field” from
image-based rendering. The cameras are controlled to rotate many times when
measuring the similarity between descriptors of two 3D models, as shown in Fig.
3.28, so as to be switched onto their different vertices. The final 3D model
retrieval results are combined from the matching results of all those acquired 2D
images by integrating 2D Zernike moment and Fourier descriptors.
Fig. 3.28. (a)(d) showing rotation and comparison in a light field [73] (With permission of
Chen)
220 3 3D Model Feature Extraction
Ohbuchi et al. [74] presented a similar method. They generated a depth or z-value
image of a 3D model from multiple viewpoints that are equally spaced on the unit
sphere. The 3D model matching is then performed by adopting a 2D Fourier
descriptor [70] for similarity matching off 2D images. The main difference is that
Chen’s 2D image only contains silhouettes while Ohbuchi’s has depth information.
Fig. 3.29 depicts Ohbuchi’s feature extraction process. The depth image is first
mapped from the Cartesian coordinate into the polar coordinate to perform Fourier
transformation before Fourier descriptors are computed.
r g (r,T )
r T
G
T
Since many more features can be extracted for a 2D shape, the function
mapping methods make the retrieval process more flexible. They can also largely
reduce the complexity of feature computation and make the feature descriptor
more compact. However, this inevitably causes much loss of important 3D
information, since the function mapping process is restricted by different
constraints. Moreover, for 2D planar view mapping, how to decide the necessary
number of 2D projection views is another problem in practice [71].
3.8.1 Introduction
Hilaga et al. [77] proposed a novel technique, called topology matching, in which
similarity between polyhedral models is quickly, accurately and automatically
calculated by comparing multi-resolution Reeb graphs (MRGs). The basic idea of
MRGs can be introduced as follows.
(v(x
( , y, z)) = z. (3.57)
Most existing studies have used the height function as the function for
generating the Reeb graph. Fig. 3.30 shows the distribution of the height function
on the surface of a torus and the corresponding Reeb graph. In the left figure, the
red and blue coloring represents minimum and maximum values, respectively, and
the black lines represent the isovalued contours. The Reeb graph in the right figure
corresponds to connectivity information for these isovalued contours.
Fig. 3.30. Torus (a) and its Reeb graph (b) using a height function [77] (2001, Association for
Computing Machinery, Inc. Reprinted by permission)
3.8 Topology-Based Feature Extraction 223
The basic idea of the MRG is to develop a series of Reeb graphs for an object at
various levels of detail. To construct a Reeb graph for a certain level, the object is
partitioned into regions based on the function . A node of the Reeb graph
represents a connected component in a particular region, and adjacent nodes are
linked by an edge if the corresponding connected components of the object contact
each other. The Reeb graph for a finer level is constructed by re-partitioning each
region. In topology matching, the re-partitioning is done in a binary manner for
simplicity. Fig. 3.31 shows an example where a height function is employed as the
function for convenience. In Fig. 3.31(a), there is only one region r0 and one
connected component s0. Therefore, the Reeb graph consists of one node n0 that
corresponds to s0. In Fig. 3.31(b), the region r0 is re-partitioned into r1 and r2,
producing connected components s1 and s2 in r1, and s3 in r2. The corresponding
nodes are n1, n2 and n3 respectively. According to the connectivities of s1, s2 and s3,
edges are generated between n1 and n3, and also between n2 and n3. Finer levels of
the Reeb graph are constructed in the same manner, as shown in Fig. 3.31(c). The
MRG has the following properties:
Property 1 There are parent-child relationships between nodes of adjacent
levels. In Fig. 3.31, the node n0 is the parent of n1, n2 and n3, and the node n1 is the
parent of n4 and n6, etc.
Property 2 By repeating the re-partitioning, the MRG converges to the
original Reeb graph as defined by Reeb. That is, finer levels approximate the
original object more exactly.
Property 3 A Reeb graph of a certain level implicitly contains all of the
information of the coarser levels. Once a Reeb graph is generated at a certain
resolution level, a coarser Reeb graph can be constructed by unifying adjacent
nodes. Consider the construction of the Reeb graph shown in Fig. 3.31(b) from
that shown in Fig. 3.31(c) as an example. The nodes {n4, n6} are unified to n1, {n5,
n7, n8} to n2, and {n9, n100, n111} to n3. Note that the unified nodes satisfy the
parent-child relationship.
Using the above three properties, MRGs are easily constructed and a similarity
between objects can then be calculated using a coarse-to-fine strategy of different
resolution levels as described in [77].
MRG uses a continuous function based on the distribution of the geodesic distance,
which is defined as follows:
P( ) ³ p S
g ( , p)) d , (3.58)
Fig. 3.31. Multi-resolution Reeb graph [77]. (a) With one node; (b) With three nodes; (c) With
finer levels (2001, Association for Computing Machinery, Inc. Reprinted by permission)
P ( ) min pS P ( p)
Pn ( ) . (3.59)
max pS P ( p)
The MRG feature is invariant to translation and rotation and robust against
changes in topology structure caused by a mesh simplification or subdivision. In
consequence, it is discriminative of different levels of detail. However, MRG lacks
the ability to correctly distinguish the corresponding parts of 3D models.
In [88], Sundar et al. encoded the geometric and topological information in the
form of a skeletal graph and used graph matching techniques to match the
skeletons and to compare them. The skeletal graphs can be manually annotated to
refine or restructure the search. This is a directed graph structure adopted to
represent the skeleton of a 3D volumetric model [88], where an edge is directional
according to a principle similar to a shock graph [89]. The skeleton is a nice shape
descriptor because it can be utilized in the following ways:
(1) Part/Component matching. In contrast to a global shape measure,
skeleton-matching can accommodate part-matching, i.e. whether the object to be
matched can be found as part of a largerr object, or vice versa. This feature can
potentially give the users flexibility towards the matching algorithm, allowing
them to specify what part of the object they would like to match or whether the
matching algorithm should weight one part of the object more than another.
(2) Visualization. The skeleton can be used to register one object to another
and visualize the result. This is very important in scientific applications
a where one
is interested in both finding a similar object and understanding the extent of the
3.8 Topology-Based Feature Extraction 225
similarity.
(3) Intuitiveness. The skeleton is an intuitive representation of shape and can
be understood by the user, allowing the user more control in the matching process.
(4) Articulation. The method can be used for articulated object matching,
because the skeleton topology does not change during articulated motion.
(5) Indexing. We can index the skeletal graph for restricting the search space
for the graph matching process.
The steps in the skeletal graph matching process include: obtaining a volume,
computing a set of skeletal nodes, connecting the nodes into a graph, and then
indexing into a database and/or verification with one of more objects. The results
of the match are then visualized. Here we focus on the construction of the skeleton
and preliminary results of using the graph matching in conjunction with
skeletonization.
The term skeleton has many meanings. It generally refers to a “central-spine”
or “stick-figure” like the representation off an object. The line is centered within
the 3D/2D object. For 2D objects, the skeleton is related to the medial-axis of the
2D picture. For 3D objects a medial surface is computed. To use graph matching
what is needed is a medial core/skeleton also known as a curve-skeleton which
can be represented as a graph. The method utilized in [88] is a parameter-based
thinning algorithm. This algorithm thins the volumes to a desired threshold based
on a parameter given by the user. A family of different point sets can be obtained,
each one thinner than its parent. This point set, termed skeletal voxels, is
unconnected and must be connected to form an appropriate stick-figure
representation. In what follows, we describe the various steps necessary to
compute the skeleton/graph representation.
First, a volumetric cube is thinned into a skeletal-graph, a line-like sketch
composed of the points on the medial axis of the medial surface planes. Then a
clustering algorithm is implemented on the thinned voxels to increase the
robustness against small perturbations on the surface and to reduce the number of
graph nodes. An undirected acyclic graph is first generated out of the skeletal
points by applying the minimum spanning tree (MST) algorithm. After that, the
directed graph is finally constructed by directing the edge from a voxel with the
higher distance to the one with the lower distance. Here the distance means the
minimum distance from a voxel to the boundary of the volumetric object. Fig.
3.32 shows two examples of skeletal graphs.
226 3 3D Model Feature Extraction
Fig. 3.32. Sample skeletal graphs: In the upper row, different volumes are shown. At the
bottom are the resulting skeletal graphs [88] ([2003]IEEE)
The 3D models usually possess multimodal feature descriptors. Besides the shape
features, the appearance attributes of 3D models such as material color, color
distribution and texture, are also an important part of content-based 3D model
retrieval. In particular, color and texture
t databases are necessary to render 3D
models.
3.9.1 Introduction
Suziki et al. evaluated another appearance feature representation using the surface
textures of 3D models where the higher order local autocorrelation (HLAC) masks
are extracted as texture features [91].
2D HLAC has been used as a feature descriptor for various 2D image pattern
recognition applications. It is well known that the autocorrelation function is
shift-invariant. The NN-th-order autocorrelation functions with N displacements
a1, …, aN are defined by
xN ( ) ³
m m m
r r a1 r aN ! dr , (3.60)
m
where the function P r ! denotes the m-th order PARCOR coefficient of pixel <r>
= <x, y>.
Since the number of these autocorrelation functions obtained by the
combination of the displacements over the PARCOR images Pm is enormous, we
must reduce them for practical applications. First, we restrict the order N up to the
second, i.e., N = 0, 1, 2. We also restrict the range of displacements within a local
3u3 window, the center of which is the reference point. By eliminating the
displacements which are equivalent to the shift, the number of patterns of the
displacements is reduced to 25.
Although the HLAC mask patterns were previously applied to 2D images, they
have not been applied to 3D models or volume data. Suziki et al. extended 2D
HLAC mask patterns to 3D HLAC mask patterns, and this method enables masks
to extract features from 3D models. 3D HLAC mask patterns are generated by
using a simulation program, and 251 patterns have been found that are about 10
times more than 2D HLAC mask patterns. By using these 3D HLAC mask
patterns, the search system can perform efficient retrieval.
3.10 Summary
In this chapter, we have discussed six types of feature extraction methods for 3D
models. It should be borne in mind that these methods are not absolutely
independent and isolated. In fact, many of them are quite interdependent. The
purpose of our taxonomy is to provide a rational and comprehensible classification
and summarization of the existing research literature. Currently, most of the work
on shape feature extraction places emphasis
m on geometrical and surface
topological properties of 3D shape features, based on surfaces, voxels, vertex sets,
and structural shape models. Generally, geometrical features usually represent the
specific shape and spatial position of surfaces, edges and vertices, while
topological features maintain the linking relationship between surfaces, edges and
3.10 Summary 229
vertices.
The common characteristic of global-geometrical-analysis-based methods is
that they are almost all derived directly from the elementary unit of a 3D model,
that is the vertex, polygon, or voxel, and a 3D model is viewed and handled as a
vertex set, a polygon mesh set or a voxel set. Their advantages lie in their easy and
direct derivation from 3D data structures, together with their relatively good
representation power. However, the computation processes are usually too
time-consuming and sensitive for small features.
t Also, the storage requirements
are too high due to the difficulties in building a concise and efficient indexing
mechanism for them in large model databases.
The spherical mapping based methods produce invariant shape features, which
avoids the time-consuming canonical coordinate normalization process in feature
extraction. However, they also have some shortcomings. Firstly, it is generally
assumed that a 3D model will have valid topology (for meshes), or explicit
volume (for volumetric models), which cannot be guaranteed in practice. Secondly,
the spherical function mapping process is complicated and time-consuming. Since
many more features can be extracted d for a 2D shape, the function mapping
methods make the retrieval process more flexible. They can also largely reduce the
complexity of feature computation and make the feature descriptor more compact.
However, this inevitably causes much loss of important 3D information, since the
function mapping process is restricted by different constraints. Moreover, for 2D
planar view mapping, how to decide the necessary number of 2D projection views
is another problem in practice.
Many statistical shape feature descriptors are simple to compute and useful for
keeping invariant properties. In many cases they are also robust against noise, or
the small cracks and holes that exist in a 3D model. Unfortunately, as an inherent
drawback of a histogram representation, they provide only limited discrimination
between objects: They neither preserve nor construct spatial information. Thus,
they are often not discriminating enough to make small differences between
dissimilar 3D shapes and usually fail to distinguish different shapes having the
same histogram.
The topological and skeletal shape features
t are attractive for 3D retrieval
because they are able to capture the significant shape structures of a 3D object.
Meanwhile, they are relatively high-level and close to human intuitive perception,
which makes them useful for defining more natural 3D query representation. They
can also perform part-matching tasks by y containing both local and global
structural properties. For some kinds of topological representations, they are also
robust against the LOD structure of 3D models due to their multiresolution
properties. However, 3D models are not always defined well enough to be easily
and naturally decomposed into a canonical set of features or basic shapes. In
addition, the decomposition process is usually computationally expensive.
Moreover, model decomposition processes are quite noise-sensitive to small
perturbations of the model. Thus, extra effort is, in turn, required to handle them.
Finally, compared with the comparatively straightforward indexing and similarity
matching algorithms based on numeric feature vectors, the indexing and matching
230 3 3D Model Feature Extraction
References
[1] Y. K. Lai, Q. Y. Zhou, S. M. Hu, et al. Robust feature classification and editing.
IEEE Transactions on Visualization and Computer Graphics, 2007, 13(1):34-45.
[2] H. T. Ho and D. Gibbins. Multi-scale feature extraction for 3D models using local
surface curvature. In: Digital Image Computing: Techniques and Applications
(DICTA’2008), 2008, pp. 16-23.
[3] C. B. Akgül, B. Sankur, Y. Yemez, et al. Density-based 3D shape descriptors.
EURASIP Journal on Advances in Signal Processing, 2007, pp. 1-16.
[4] C. Cui, D. Wang and X. Yuan. Feature extraction of 3D model based on fuzzy
clustering. In: Proceedings of the SPIE, 2005, Vol. 5637, pp. 559-566.
[5] Y. Yang, H. Lin and Y. Zhang. Content-based 3-D model retrieval: A survey.
IEEE Transactions on Systems, Man and Cybernetics-Part C: Appliactions and
Reviews, 2007, 37(6):1081-1098.
[6] J. L. Martínez, A. Reina and A. Mandow. Spherical laser point sampling with
application to 3D scene genetic registration. In: 2007 IEEE International
Conference on Robotics and Automation, 2007, pp. 1104-1109.
[7] T. Hlavaty and V. Skala. A survey of methods for 3D model feature extraction.
Bulletin of IV Seminar Geometry and Graphics in Teaching Contemporary
Engineer, 2003, 13(3):5-8.
[8] K. Siddiqi, A. Shokoufandeh, S. J. Dickinson, et al. Shock graphs and shape
matching. In: Proceedings of the IEEE International Conference on Computer
Vision (ICCV’98), 1998, pp. 222-229.
[9] T. Tung and F. Schmitt. The augmented multiresolution Reeb graph approach for
content-based retrieval of 3D shapes. International Journal of Shape Modeling,
2005, 11(1):91-120.
[10] H. Sundar, D. Silver, N. Gagvani, et al. Skeleton based shape matching and
retrieval. In: Proceedings of the International Conference on Shape Modeling and
Applications (SMI’03), 2003, pp. 130-139.
[11] S. Kang and K. Ikeuchi. The complex EGI: a new representation for 3-D pose
determination. IEEE Transactions on Pattern Analysis and Machine Intelligence,
1993, 15(7):707-721.
[12] E. Paquet and M. Rioux. Nefertiti: a query by content software for
three-dimensional models databases management. In: Proceedings of the 1st
International Conference on Recent Advances in 3-D Digital Imaging and
References 231
[31] E. Paquet and M. Rioux. A content-based search engine for VRML databases. In:
Proc. IEEE Int. Conf. Comput. Vis. and Pattern Recognit., Santa Barbara, CA,
USA, 1998, pp. 541-546.
[32] MPEG Video Group. MPEG-7 Visual Part of eXperimentation Model (version
9.0 ed.). Pisa, Italy, 2001.
[33] M. T. Suzuki, T. Kato and N. Otsu. A similarity retrieval of 3D polygonal models
using rotation invariant shape descriptors. Paper presented at The IEEE
International Conference on Systems, Man, and Cybernetics, 2000, pp.
2946-2952.
[34] R. Osada, T. Funkhouser, B. Chazelle, et al. Shape distributions. ACM
Transactions on Graphics, 2002, 21(4):807-832.
[35] J. L. Shih, C. H. Lee and J. T. Wang. 3D object retrieval system based on grid D2.
Electronics Letters, 2005, 41(4):179-181.
[36] J. J. Song and F. Golshani. Shape-based 3D model retrieval. In: Proc. 15th IEEE
Int. Conf. Tools Artif. Intell., 2003, pp. 636-640.
[37] B. K. P. Horn. Extended Gaussian Image. In: Proc. of IEEE, 1984,
72(12):1671-1686.
[38] H. Luo, J. S. Pan, Z. M. Lu, et al. A new 3D shape descriptor based on rotation.
Paper presented at The Sixth International Conference on Intelligent Systems
Design and Applications (ISDA2006), 2006.
[39] R. Ohbuchi, T. Minamitani and T. Takei. Shape-similarity search of 3D models by
using enhanced shape functions. International Journal of Computer Applications
in Technology, 2005, 23(2/3/4):70-85.
[40] Z. M. Lu, H. Luo and J. S. Pan. 3D model retrieval based on vector quantization
index histograms. Paper presented at The 4th International Symposium on
Instrumentation Science and Technology (ISIST’2006), 2006.
[41] Y. Linde, A. Buzo and R. M. Gray. An algorithm for vector quantizer design.
IEEE Trans. Communications, 1980, 28(1):84-95.
[42] L. Kolonias, D. Tzovaras, S. Malassiotis, et al. Fast content based search of
VRML models based on shape descriptors. In: Proc. IEEE Int. Conf. Image
Process., 2001, Vol. 2, pp. 133-136.
[43] D. V. Vrani and D. Saupe. 3D model retrieval. Paper presented at The Spring
Conf. Comput. Graph. (SCCG 2000), 2000.
[44] MPEG Requirements Group. Overview of the MPEG-7 Standard. Doc.
ISO/MPEG N3158, Maui, Hawaii, 1999.
[45] M. Yu, I. Atmosukarto, W. K. Leow, et al. 3D model retrieval with morphing-
based geometric and topological feature maps. In: Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2003, pp. 656-661.
[46] J. Tangelder and R. Veltkamp. Polyhedral model retrieval using weighted point
sets. Int. J. Image Graph., 2003, 3:1-21.
[47] Y. Rubner, C. Tomasi and L. J. Guibas. A metric for distributions with
applications to image databases. Paper presented at The IEEE Int. Conf. on
Computer Vision, 1998, pp. 59-66.
[48] J. Rossignac and P. Borrel. Multi-resolution 3D approximation for rendering
complex scenes. Geometric Modeling in Computer Graphics, 1993, pp. 455-465.
[49] M. Heczko, D. Keim, D. Saupe, et al. A method for similarity search of 3D
objects (in German). In: Proc. BTW, 2001, pp. 384-401.
[50] V. Cicirello and W. Regli. Machining feature-based comparisons of mechanical
References 233
parts. In: Proc. Int. Conf. Shape Model. Appl., 2001, pp. 176-185.
[51] D. McWherter, M. Peabody, W. Regli, et al. Transformation invariant shape
similarity comparison of solid models. Paper presented at The ASME DETC,
Pittsburgh, PA, 2001.
[52] C. Zhang and T. Chen. Efficient feature extraction for 2D/3D objects in mesh
representation. Paper presented at The ICIP, 2001.
[53] D. Vrani and D. Saupe. 3D shape descriptor based on 3D Fourier transform. In:
The EURASIP Conference on Digital Signal Processing for Multimedia
Communications and Services, 2001, pp. 271-274.
[54] D. Vrani, D. Saupe and J. Richter. Tools for 3D-object retrieval:
Karhunen-Loeve transform and spherical harmonics. In: Proc. IEEE 2001
Workshop Multimedia Signal Process, Cannes, France, 2001, pp. 293-298.
[55] K. Arbter, W. E. Snyder, H. Burkhardt, et al. Application of affine invariant
fourier descriptors to recognition of 3-D objects. IEEE Trans. on Pattern Analysis
and Machine Intelligence, 1990, 12(7):640-647.
[56] C. W. Richard and H. Hemami. Identification of 3D objects using Fourier
descriptors of the boundary curve. IEEE Transactions on Systems, Man, and
Cybernetics, 1974, 4(4):371-378.
[57] H. Zhang and E. Fiume. Shape matching of 3D contours using normalized
Fourier descriptors. Paper presented at International Conference on Shape
Modeling and Applications, 2002, pp. 261-271.
[58] J. Sijbers, T. Ceulemans and D. van Dyck. Efficient algorithm for the
computation of 3D Fourier descriptors. Paper presented at The 1st International
Symposium on 3D Data Processing Visualization and Transmission, 2002, pp.
640-643.
[59] M. Kazhdan, T. Funkhouser and S. Rusinkiewicz. Rotation invariant spherical
harmonic representation of 3D shape descriptors. Paper presented at The
Eurographics/ACM Siggraph Symposium on Geometry Processing, 2003, pp.
156-164.
[60] H. Laga, H. Takahashi and M. Nakajima. Spherical wavelet descriptors for
content-based 3D model retrieval. Paper presented at The IEEE International
Conference on Shape Modeling and Applications, 2006, pp. 15-25.
[61] P. Schroder and W. Sweldens. Spherical wavelets: efficiently representing
functions on the sphere. In: SIGGRAPH’95: Proceedings of the 22nd Annual
Conference on Computer Graphics and Interactive Techniques, 1995, pp.
161-172.
[62] G. van de Wouwer, P. Scheunders and D. van Dyck. Statistical texture
characterization from discrete wavelet representations. IEEE Transactions on
Image Processing, 1999, 8(4):592-598.
[63] A. Johnson and M. Hebert. Using spin images for efficient object recognition in
cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell., 1999,
21(5):433-449.
[64] X. Gu, S. Gortler and H. Hoppe. Geometry images. In: Proc. ACM Siggraph,
2002, pp. 355-361.
[65] E. Praun and H. Hoppe. Spherical parametrization and remeshing. In: Proc.
SIGGRAPH, 2003, pp. 340-349.
[66] H. Laga, H. Takahashi and M. Nakajima. Geometry image matching for
similarity estimation of 3D shapes. In: Proc. Comput. Graph. Int., Crete, Greece,
234 3 3D Model Feature Extraction
[88] H. Sundar, D. Silver, N. Gagvani, et al. Skeleton based shape matching and
retrieval. In: Proc. Shape Model. Int., 2003, pp. 130-139.
[89] K. Siddiqi, A. Shokoufandeh, S. Dickinson, et al. Shock graphs and shape
Matching. Comput. Vis., 1998, pp. 222-229.
[90] M. Suzuki. A web-based retrieval system for 3D polygonal models. In: Proc.
Joint 9th IFSA World Congr. 20th NAFIPS (IFSA/NAFIPS 2001), 2001, pp.
2271-2276.
[91] M. Suzuki, Y. Yaginuma and Y. Shimizu. A texture similarity evaluation method
for 3D models. In: Proc. Int. Conf. Internet Multimedia Syst. Appl. (IMSA 2005),
2005, pp. 185-190.
4
4.1 Introduction
4.1.1 Background
If we view audio as the first wave of multimedia, images as the second wave of
multimedia and video as the third wave of multimedia, then we can regard 3D
digital models and 3D scenes as the fourth
r wave of multimedia. Unlike 2D images,
3D models are capable of overcoming the illusion problem caused by the human
eye, and therefore object segmentation becomes less error-prone and easier to
achieve. Modern computer technology and powerful computing capacity, together
with new acquisition and new modeling tools, make it much easier and cheaper to
create and process 3D models with basic hardware, resulting in an increasing
number of 3D models from various sources, such as those over the Internet and
those from professional 3D model databases in the areas of biology, medicine,
chemistry, archaeology and geography, and so on. In the past two decades, tools
for retrieving and visualizing complex 3D models have become an integrated part
238 4 Content-Based 3D Model Retrieval
significant research aspects in 3D model retrieval systems, and has become one
aspect of MPEG-7 standards [7]. The key problem in similarity comparison
between two 3D models is to generate shape descriptors that can form an index
conveniently and achieve geometry shape matching effectively. In general, 3D
descriptors should hold the following four characteristics: transformation
invariance, high-speed computing, convenient index structures and easy storage.
Since there are many kinds of specialized d 3D models in different domains, the
relevant research work, including versatile shape representations and similarity
measures, may also affect the retrieval task in different ways. As a result, when
considering the performance evaluation issue, the first response will be to define a
relatively common and general-purpose 3D model collection as a benchmark
database, in order to define a common method to provide relevance judgments.
Currently, there are several representative 3D model databases for the purpose
of performance evaluation, among which the Princeton Shape Benchmark (PSB)
[8] is maybe the most popular and well-organized one. PSB is a publicly available
3D model benchmark database containing 1,814 classified 3D models, which have
been collected from the Internet and organized into hierarchical semantic
classifications by experts. PSB provides us with separate training and test sets, and
each 3D model has a set of annotations. Fig. 4.1 shows some samples from the
PSB.
Besides the PSB, some other 3D model databases, which contain a wide
variety of 3D objects that have been independently gathered by different research
groups, can also be employed as standard benchmarks. These include the Utrecht
databases [9], MPEG-7 databases [10] and Taiwan databases [11]. In addition,
there are also several benchmark databases constructed for specific domains, e.g.,
CAD models [12] and 3D protein structures [13]. To know more detailed statistics
for most currently available 3D model databases, readers can refer to [8].
Unfortunately, since most 3D model databases primarily focus on 3D shapes,
there are currently no standard benchmark databases constructed for appearance
attributes, such as color, texture and light distribution. Although the PSB can
partially perform this function, it is still neither ideal nor optimal.
The two most common evaluation measures adopted in 3D model retrieval are
precision and recall, which were introduced from the information retrieval (IR)
community and have been widely employed to evaluate image retrieval systems
[14]. Given a query model belonging to the category C, precision measures the
ability of the system to retrieve models from C, thus precision can be defined as
follows:
N rrc
precision , (4.1)
Nr
where Nrc is the number of retrieved models belonging to C and Nr is the number
of retrieved models. On the other hand, recall measures how many relevant
models are retrieved to answer a query, thus recall is defined as
N rrc
recall , (4.2)
Nc
where Nc is the number of models in C. Fig. 4.2 shows the relationship between
precision and recall.
In general, recall and precision are in a trade-off relationship. If one goes up,
the other usually comes down. As the standard database is designed for
similarity-based search, on the one hand if the similarity matching criteria are
rather strict, then the precision value and the recall value go in opposite directions.
On the other hand, if the matching criteria are too loose, most retrieved 3D models
are useless.
Precision and recall can be separately used to evaluate the retrieval
performance, e.g., the graph of precision vs. the number of retrieved models, or
the graph of recall vs. the number of retrieved models. They can be also combined
as a “precision-recall” (P-R) graph [15], which shows how precision falls and how
recall rises as more and more 3D objects are retrieved. Fig. 4.3 gives a vivid
example of achieving the P-R graph. Here, we assume that there are five 3D
models in the same class as the query model, i.e. Nc = 5. With an increase in
retrieved models, the precision value decreases but the recall value increases. The
closer the precision value is to 1, the better the performance is obtained. Moreover,
the performance can also be evaluated from some other aspects based on the P-R
graph, such as effectiveness and robustness [16]. However, since “relevant” and
“irrelevant” are both judged subjectively by users, this evaluation is naturally
subjective.
Besides the P-R graph, to integrate the precision and recall criteria, another
commonly-used criterion is called F1 score [17]. In statistics, the F1 score (also
F-score or F
F F-measure) is a measure of a test’s accuracy, which considers both the
precision and the recall of the test. The F1 score can be interpreted as a weighted
average of the precision and recall values, where an F1 score reaches its best value
at 1 and its worst score at 0. The traditional FF-measure or balanced F F-score (F1
score) is the harmonic mean of precision and recall values, which can be defined
as follows:
2 u pprecision recall
F1 u 100%. (4.3)
precision recall
242 4 Content-Based 3D Model Retrieval
Fig. 4.4. A typical “best matches” evaluation measure (With courtesy of Shilane et al.)
4.1 Introduction 243
The measure of “distance image” is an image of the distance matrix where the
lightness of each pixel (i, j) is proportional to the magnitude of the distance
between models Mi and Mj [19]. Models are grouped by class along each axis, and
lines are added to separate classes, which makes it easy to evaluate patterns in the
match results qualitatively. The optimal result is a set of darkest, class-sized
blocks of pixels along the diagonal indicating that every model matches the
models within its class better than those in other classes. Otherwise, the reasons
for poor match results can often be seen in the image, e.g., off-diagonal blocks of
dark pixels indicate that two classes match each other well. A typical example is
shown in Fig. 4.5.
Fig. 4.5. A typical “distance image” evaluation measure (With courtesy of Shilane et al.)
The measure of “tier image” is an image visualizing nearest neighbor, first tier
and second tier matches [19]. Specifically, for each row representing a query with
the model Mj in a class with |C| members, the pixel (i, j) is white if the model Mi is
just the model Mj or its nearest neighbor, yellow if the model Mi is among the
|C| 1 top matches (i.e., the first tier) and golden if the model Mi is among the
2·(|C| 1) top matches (i.e., the second tier). Similar to the distance image, models
are grouped by class along each axis, and lines are added to separate classes.
However, this image is often more useful than the distance image because the best
matches are clearly shown for every model, regardless of the magnitude of their
distance values. The optimal result is a set of white/yellow, class-sized blocks of
pixels along the diagonal indicating that every model matches the models within
its class better than those in other classes. Otherwise, more colored pixels in the
244 4 Content-Based 3D Model Retrieval
class-sized blocks along the diagonal represent a better result. A typical example is
shown in Fig. 4.6.
Fig. 4.6. A typical “tier image” evaluation measure (With courtesy of Shilane et al.)
Then, we analyze and discuss several topics for content-based 3D model retrieval,
including preprocessing, feature extraction, similarity matching and query
interfaces.
values are then sorted in descending order so that the models having the largest
similarity values are returned as the matching results, on the basis of which
browsing and retrieval in 3D model databases are finally implemented. Here,
“content-based” means that the retriever utilizes the visual features of 3D models
themselves, rather than relying on human-inputted metadata such as captions or
keywords. The visual features of 3D models should be automatically or
semi-automatically extracted and expected to characterize their contents.
The ultimate aim of content-based 3D model retrieval systems is to
approximate human visual perception so that semantically similar 3D models can
be correctly retrieved based on their looks. However, most of the existing types of
3D feature extraction methods which can be termed “low-level similarity-induced
semantics,” capture some, but not all, aspects of the content from a 3D model, and
do not coincide with the high-level semantics it contains. As shown in Fig. 4.7, a
sphere-like shape feature alone can be used to describe either a 3D ball or a 3D
model of the globe. This is the well-known “semantic gap” issue [20] that
indicates the relatively limited descriptive power of low-level visual features for
approaching human subjective high-level perception. Therefore, high-level feature
extraction methods that can derive semantics from low-level features should also
be integrated as an important part in the content-based 3D model retrieval system.
If 3D shape content is to be extracted in order to understand the meaning of a 3D
model, the only available independent information is the low-level geometry data,
connectivity data and surface appearance data. Annotations always depend on the
knowledge, capability of expression and specific language of the annotator. They
are therefore unreliable. To recognize the displayed scenes from the raw data of a
model, the algorithms for selection and manipulation of vertices must be
combined and parameterized in an adequate manner and finally linked with the
natural description. Even the simple linguistic representation of shape or texture
mapping, such as round or yellow, requires entirely different mathematical
formalization methods, which are neither intuitive nor unique or sound.
Fig. 4.7. A sphere-like shape can be used to describe (a) a 3D ball or (b) a 3D model of the globe
246 4 Content-Based 3D Model Retrieval
From the point of view of the conceptual level, a typical 3D model retrieval
system framework as shown in Fig. 4.8 consists of a database with an index
structure created offline and an online query engine [6]. This system generally
consists of four main components: 1) the model preprocessing module for pose
registration, noise removing and so on; 2) the feature extraction module for
generating both low-level 3D shapes or appearance features and high-level
semantic features; 3) the similarity matching phase, i.e., the relevance ranking
procedure according to calculated similarity degrees; 4) the query interface, i.e., a
practical online user interface designed to represent and process user queries. In
general, a 3D model retrieval procedure is performed in four steps: indexing,
querying, matching and visualizing. Except the first step that is done off-line, the
remaining three steps are performed online to deal with each user query that
supports input modes based on text, 3D sketches or 3D model examples, 2D
projections and 2D sketches. For each of these input modes, the relevant shape
descriptors are extracted from the 3D database models during the offline stage in
order that they can be compared with the queries efficiently in the online phase.
These shape descriptors provide a compact overall description of each 3D model.
Fig. 4.8. Typical architecture framework of content-based 3-D model retrieval [6]
([2008]IEEE)
248 4 Content-Based 3D Model Retrieval
The first important issue is the type of model file format that a model retrieval
system can accept. Most of the 3D models provided over the Internet are meshes
defined in a file format supporting visual appearance [25]. Currently, the
commonly-used formats for 3D model retrieval include VRML, 3D studio, PLY,
AutoCAD, Wavefront, Lightwave objects, etc. These 3D model files over the
Internet are both in plain links as well as in compressed archive files. As VRML is
designed to be used over the Internet, it is often kept in a non-compressed format.
Thus, the most commonly used format for retrieval is the VRML format. Most 3D
models are represented by “polygon soups” consisting of unorganized and
degenerate sets of polygons. They are rarely manifold, most are even not
self-consistent and seldom have any solid modeling information. By contrast, for
volume models, many retrieval techniques depending on a properly defined
volume can be applied.
4.2 Content-Based 3D Model Retrieval Framework 249
4.2.4.2 Normalization
Without prior knowledge, most 3D model search systems need the normalization
step before feature extraction. Typically, this step is just a conversion of 3D
models into their canonical representations to guarantee that the corresponding
shape descriptors are invariant to rotation, translation and scaling operations. The
Principal Component Analysis (PCA) algorithm for pose registration is fairly
simple and efficient [26]. There are also some similarity measures which are
invariant under the rotation operation [27-29]. We will discuss in detail the
normalization step in the next section.
In general, the index structure is adopted to avoid the sequential scan that may be
time-consuming during similarity matching. Researchers have presented many
index structures and algorithms for efficient querying in the high-dimensional
space. For example, metric access methods are index structures that utilize the
metric properties of the distance function (especially triangle inequality) to filter
out zones of the search space [30], while spatial access methods are index
250 4 Content-Based 3D Model Retrieval
structures especially designed for vectorr spaces that, together with the metric
properties of the distance function, use geometric information to discard unlikely
points from the space [31].
4.3.1 Overview
revealing the internal structure of the data in a way which best explains the
variance in the data. If a multivariate dataset is visualized as a set of coordinates in
a high-dimensional data space (1 axis per variable), PCA provides the user with a
lower-dimensional picture, i.e., a “shadow” of this object when viewed from its (in
some sense) most informative viewpoint. PCA is closely related to factor analysis,
and indeed, some statistical packages deliberately conflate the two techniques.
Actually, true factor analysis makes different assumptions about the underlying
structure and solves eigenvectors from a slightly different matrix.
In 3D model normalization, the aim of PCA is to change the coordinate system
axes to new ones which coincide with the directions of the three largest spreads of
the point (i.e. vertex) distribution. The detailed steps can be described as follows:
Step 1: Translation. First, the model’s center of mass should be shifted to the
coordinate origin as follows:
I1 I c { | , }, (4.5)
I2 R I1 { | , 1 }, (4.6)
where I1 is the 3D model’s coordinate frame before rotation and I2 is the new
coordinate frame after rotation, which are identical to the directions having the top
three largest variances of the point distribution.
The general PCA transformation in 3D model retrieval is defined on the given
set of representative points of a 3D model, such as vertices, centroids of each
surface, or even randomly selected locations on each surface using statistical
techniques, e.g., the Monte Carlo approach [36]. In considering the different sizes
of triangles or meshes of a 3D model, some appropriate weighting factors,
proportional to their surface areas can be accommodated, so as to make the
transformation more robust and improve the reliability and veracity of feature
representation [32, 37, 38]. However, the point-based PCA transformation may
cause an inaccurate normalization result that will seriously affect the retrieval
precision if the chosen vertices do not have an even distribution on the surface.
Therefore, a more thorough improvement, termed CPCA (continuous PCA), which
performs PCA transformation based on the whole 3D polygon mesh, is proposed
in [39]. CPCA generalizes the PCA transformation by using the sums of integrals
over surfaces instead of the sums over selective vertices. Assume that the whole
size of all the surfaces in a 3D model is represented as
¦ ³ dv ,
Nf
S i 1
Si (4.7)
I
4.3 Preprocessing of 3D Models 253
where vI is the point on the surface, Nf is the number of surfaces on the 3D
model and I is the point set of the 3D model as follows:
*
Nv
I i 1 i
v, (4.8)
where Nv is the number of points, vi is the i-th point. Similarly, the triangle set T
can be denoted as
*
Nt
T i 1 i , i ( i , i , i ). (4.9)
where Nt is the number of triangles, i means the i-th triangle. The covariance
matrix R is then defined as
1
S ³I
R v v T dv . (4.10)
Fig. 4.9. Principal component analysis [32] (With courtesy of Vraníc and Saupe)
254 4 Content-Based 3D Model Retrieval
W( ) 1
( ). (4.11)
pq u qr
Nd . (4.12)
pq u qr
Secondly, the area of each triangle i is calculated and the areas of all triangles
with the same or opposite normals are added. Here Pu et al. thought the normals
that are located in the same direction belong to a similar distribution.
The next step is to determine the three principal axes. From all normal
distributions, the normal with the maximum area is selected as the first principal
axis bu. To get the next principal axis bv, we can search from the remaining normal
distributions and find out the normal that satisfies two conditions: (1) with the
maximum area; (2) orthogonal to the first normal. Naturally, the third axis can be
obtained by doing a cross product between bu and bv:
bw bu bv . (4.13)
Fig. 4.10. Bounding box examples [41]. The bounding boxes shown in (a), (b) and (c) are
obtained by the IPA method, while the boxes shown in (d), (e) and (f) by the MND method (With
courtesy of Gottschalk)
256 4 Content-Based 3D Model Retrieval
To find the center and the half-length of the bounding box, Pu et al. projected
the points of the polygon mesh onto the direction vector and find the minimum
and maximum along each direction. Finally, the positive direction for each
principal axis has to be decided. For this purpose, Pu et al. proposed a rule: the
farthest side from the centroid is the positive direction. In Fig. 4.10, the boxes
shown in (d), (e) and (f) are obtained by the maximum normal distribution method,
and they look much better than Figs. 4.10(a), (b) and (c).
For models with obvious normal distributions, such as CAD models, the MND
method outperforms the IPA method. However, for models without obvious
normal distributions, as shown in Fig. 4.11, the former method will fail because
the normal distribution has a random property for this case. From Fig. 4.11, we
can observe that the IPA method is good in describing the mass distribution of 3D
models, and it can find out the symmetric axes according to the mass distributions.
Therefore, to overcome this limitation and make full use of the merits of the two
methods, Pu et al. proposed a rule to combine the two methods: select the
bounding box with smaller volume as the final box. Its validity has been proved
by a large amount of models in their 3D library consisting of more than 2 700
models.
Fig. 4.11. An example for the bounding box of a mesh model, in which the MND method fails
[41]. (a) The bounding box obtained by the MND method; (b) The bounding box obtained by the
IPA method (With courtesy of Gottschalk)
required to triangulate the polygons of the mesh. Here we introduce the polygon
triangulation problem and algorithms.
In computational geometry, polygon triangulation [44] is the decomposition of
a polygon into a set of triangles. A triangulation of a polygon P is its partition into
non-overlapping triangles whose union is P. In a strict sense, these triangles may
have vertices only at the vertices of P. In a less strict sense, points can be added
anywhere on or inside the polygon to serve as vertices of triangles.
Triangulations are special cases of planarr straight-line graphs. It is trivial to
triangulate a convex polygon in linear time, by adding edges from one vertex to all
other vertices. A monotone polygon can easily be triangulated in linear time as
described by Fournier and Montuno [45].
For a long time there has been an openn problem in computational geometry,
whether a simple polygon may be triangulated faster than O(N (NvlogN
gNv) time [44],
where Nv is the number of vertices of the polygon. In 1990, researchers discovered
an O(N(Nvlog logN
gNv) algorithm for triangulation. Eventually, Chazelle showed in
1991 that any simple polygon can be triangulated in linear time. This algorithm is
very complex though, so Chazelle and others are still looking for easier algorithms
[46]. Although a practical linear time algorithm has yet to be found, simple
randomized methods such as Seidel’s [47] or Clarkson et al.’s have O(N (Nvlog*N Nv)
behavior which, in practice, are indistinguishable from O(N (Nv). The time
complexity of the triangulation of a polygon with holes has O(N (NvlogN
gNv) lower
bound [44]. Over time, a number of algorithms have been proposed to triangulate
a polygon. The following are two typical ones.
One way to triangulate a simple polygon is to use the assertion that any simple
polygon without holes has at least two so-called “ears”. As shown in Fig. 4.12, an
ear is a triangle with two sides on the edge of the polygon and the other one
completely inside it. The algorithm then consists of finding such an ear, removing it
from the polygon (which results in a new polygon that still meets the conditions) and
repeating this until there is only one triangle left. This algorithm is easy to
implement, but suboptimal, and it only works on polygons without holes. An
implementation that keeps separate lists of convex and reflex vertices will run in
(Nv2) time. This method is also known as ear clipping and sometimes ear trimming.
O(N
The partition of model units is also required if we extract features from various
parts of the 3D models. It is a segmentation problem. Mesh segmentation has
become an important and challenging problem in computer graphics, with
applications in areas as diverse as modeling, metamorphosis, compression,
simplification, 3D shape retrieval, collision detection, texture mapping and
skeleton extraction.
Mesh, and more generally shape, segmentation can be interpreted either in a
purely geometric sense or in a more semantics-oriented manner. In the first case,
the mesh is segmented into a number of patches that are uniform with respect to
some property (e.g., curvature or distance to a fitting plane), while in the latter
case the segmentation is aimed at identifying parts that correspond to relevant
features of the shape. Methods that can be grouped under the first category have
been presented as a pre-processing step for the recognition of meaningful features.
Semantics-oriented approaches to shape segmentation have gained great interest
recently in the research community, because they can support parameterization or
re-meshing schemes, metamorphosis, 3D shape retrieval, skeleton extraction as
4.3 Preprocessing of 3D Models 259
Fig. 4.14. Segmentations of miscellaneous models by various methods [48]. (a) Fuzzy
clustering and cuts based; (b) Feature point and core extraction based; (c) Tailor; (d) Plumber; (e)
Fitting primitives based ([2006]IEEE)
(1) Mesh decomposition using fuzzy clustering and cuts [49]. The key idea of
this algorithm is to first find the meaningful components using a clustering
algorithm, while keeping the boundaries between the components fuzzy. Then, the
algorithm focuses on the small fuzzy areas and finds the exact boundaries which
go along the features of the object.
(2) Mesh segmentation using feature point and core extraction [50]. This
approach is based on three key ideas. First, multi-dimensional scaling (MDS) is
used to transform the mesh vertices into a pose insensitive representation. Second,
260 4 Content-Based 3D Model Retrieval
prominent feature points are extracted using the MDS representation. Third, the
core component of the mesh is found. The core, along with the feature points,
provides sufficient information for meaningful segmentation.
(3) Tailor: multi-scale mesh analysis using blowing bubbles [51]. This method
provides a segmentation of a shape into clusters of vertices that have a uniform
behavior from the point of view of the shape morphology, analyzed on different
scales. The main idea is to analyze the shape by using a set of spheres of
increasing radius, placed at the vertices of the mesh. The type and length of the
sphere-mesh intersection curve are good descriptors of the shape and can be used
to provide a multi-scale analysis of the surface.
(4) Plumber: mesh segmentation into tubular parts [52]. Based on the Tailor
shape analysis, the Plumber method decomposes the shape into tubular features
and body components and extracts, simultaneously, the skeletal axis of the
features. Tubular features capture the elongated parts of the shape, protrusions or
wells, and are well suited for articulated objects.
(5) Hierarchical mesh segmentation based on fitting primitives (HFP) [53].
Based on a hierarchal face clustering algorithm, the mesh is segmented into
patches that best fit a pre-defined set of primitives. In the current prototype, these
primitives are planes, spheres, and cylinders. Initially, each triangle represents a
single cluster. At each iteration, all the pairs of adjacent clusters are considered,
and the one that can be better approximated with one of the primitives forms a
new single cluster. The approximation error is evaluated using the same metric for
all the primitives, so that it makes sense to choose the most suitable primitive to
approximate the set of triangles in a cluster.
Some retrieval systems may require the mesh simplification step before feature
extraction. Vertex clustering [54] is a practical technique to automatically compute
approximations of polygonal representations of 3D objects. It is based on a
previously developed model simplification technique which applies vertex-
clustering. Major advantages of the vertex-clustering technique are its low
computational cost and high data reduction rate, and thus suitable for interactive
applications.
As we know, in a synthetic scene, when an object is far away from the
viewpoint, its image size is small. Due to the discreteness of the image space,
many points on the object are mapped onto the same pixels, and this happens often
when the object’s model is complex and the image size is relatively small. For
points mapped to the same pixel, only one point appears on the image at the pixel,
and the others are eliminated by hidden-surface removal. This is wastage in
rendering as many such points are processed but never make their way to the final
image. A potential solution to cut down this wasteful processing is to find out
which are the points that are going to fall onto the same pixel and use a new point
to represent them. Only this new point is sent for rendering.
4.4 Feature Extraction 261
In fact, feature extraction techniques have been discussed in detail in the last
chapter. In this section, we would like to briefly introduce them with another
categorical method. Here, methods addressing retrieval by global similarity of 3D
models are classified according to the principles under which shape
representations are derived. This section discusses feature extraction methods in
the following four categories, i.e., primitive-based, statistics-based, geometry-
based and view-based.
f app ( , y ) f ( , y ) ...
1 1 d f d ( , y) ( 1 , ..., d ) ( f1 , ..., f d )( , y ). (4.14)
The notion by which Kriegel and Seidl related 3D surface segments and
multi-parametric approximation models is the approximation error. For any
arbitrary 3D surface segment s and any instance app of approximation parameters,
the approximation error indicates the deviation of the surface function fapp from
the points of the segment s:
Definition 4.2 (Approximation Error) Let the 3D surface segment s be
represented by a set of n surface points. Given an approximation model f and a
vector app of approximation parameters, the approximation error of app and s is
defined as
4.4 Feature Extraction 263
1
d s (app) ¦ ( f app ( p x , p y ) p z ) 2 ,
n ps
(4.15)
where p = ((px, py, pz) is a 3D point in s. Given this definition, from all possible
choices, Kriegel and Seidl selected the parameter vector app which yields the
minimum approximation error for a given 3D segment s.
Definition 4.3 (Approximation of a Segment) Given an approximation model
f and a 3D surface segment s, the (unique) approximation of s is given by the
parameter set apps for which the approximation error is minimum:
The (unique) approximation apps is closest to the original surface points and
may be used as a more or less coarse representation of the shape of s, whereas the
other surface functions do not fit the shape of the segment s very well.
Kriegel and Seidl focused on two immediate implications of this definition:
First, the relative approximation never evaluates to a negative value, and it reaches
zero for the (unique) approximation of a segment.
Lemma 4.1 (1) For any 3D surface segment s and any approximation
parameter set app, the relative approximation error is non-negative:
p c) t 0 . (2) The relative approximation error reaches zero. In particular,
'd s2 ( app
p c)
'd s2 ( app 0 for all segments s.
Two different segments s and q may share the same approximation apps = appq.
Consequently, they cannot be distinguished by a simple comparison of their
264 4 Content-Based 3D Model Retrieval
For Kriegel and Seidl’s approximation models, they restrict themselves to the class
of linear combinations of non-parameterized base functions as introduced in
Definition 4.1. According to Definitions 4.2 and 4.3, finding an approximation is a
least squares minimization problem for which an efficient numerical computation
method is required. For linearly parameterized functions in particular, it is
recommended to perform least-squares approximation by Singular Value
Decomposition (SVD) [56].
Besides the d approximation parameters apps = (a1, …, ad), the SVD also
returns a dd-dimensional vector ws of confidence or condition factors, and an
orthogonal dud-matrix
d Vs. Using Vs, we can compute the relative approximation
error for any approximation parameter vector app with respect to the segment s.
Let As=Vs·diag(w ws)2·VsT and let us denote the rows of Vs by Vsi. Now the error
formula can be written as:
'd
d s2 ( ) ¦
i 1,, ..., d
2
si ((( i si ) si )2 ( s ) s ( s )T . (4.18)
In general, the points of a segment s are located anywhere in the 3D space and are
oriented arbitrarily. Since we are only interested in the shape of the segment s, but
not in its location and orientation in the 3D space, we transform s by a rigid 3D
transformation into a normalized representation. There are two ways to integrate
normalization into Kriegel and Seidl’s method: (1) Separate. We first normalize
the segment s, and then compute the approximation apps by least-squares
minimization. (2) Combined. We minimize the approximation error simultaneously
over all the normalization and approximation parameters. In Kriegel and Seidl’s
experiments, they used the combined normalization approach. For similarity
search purposes, only the resulting approximation parameters are used. However,
the normalization parameters may be required later for superimposing segments.
4.4 Feature Extraction 265
4.4.2.1 Overview
Recently, Antini et al. [61] proposed curvature correlograms to capture the spatial
distribution of curvature values on the object surface. Previously, correlograms
have been successfully used for image retrieval based on color content [62]. In
particular, with respect to a description based on histograms of local features,
correlograms also enable us to encode the information about the relative
localization of local features. In [63], histograms of surface curvature have been
used to support the description and retrieval of 3D objects. However, since
266 4 Content-Based 3D Model Retrieval
histograms do not include any spatial information, the system is liable to false
positives. Therefore, Antini et al. presented a model for representation and
retrieval of 3D objects based on curvature correlograms. Correlograms are used to
encode the information about curvature values and their localization on the object
surface. For this peculiarity, description of 3D objects based on correlograms of
curvature proves to be very effective for the purpose of content based retrieval of
3D objects.
High resolution 3D models obtained through scanning real world objects are
often affected by high frequency noise, due to either the scanning device or the
subsequent registration process. Hence, smoothing is required to deal with such
models for the purpose of extracting their salient features. This is especially true if
salient features are related to differential properties of the mesh surface, e.g.
surface curvature. Selection of a smoothing filter is a critical step, as application
of some filters entails changes in the model’s shape. In the proposed solution,
Antini et al. adopted the filter first proposed by Taubin [64]. This filter, also
known as | filter, operates iteratively and interleaves a Laplacian smoothing
weighed by with a second smoothing weighed with a negative factor ( ( > 0,
<
< 0). This second step is introduced such that the model’s original shape
can be preserved.
Let M be a mesh. We denote with E, V and F the sets of all edges, vertices and
faces of the mesh. We denote the cardinality of sets V,
V E and F with Nv, Ne and Nf ,
respectively. Given a vertex v ęM M, the principal curvatures of M at the vertex v
are indicated as k1(v) and k2(v), respectively. The mean curvature k v is related to
the principal curvatures k1(v) and k2(v) by the equation:
k1 ( v ) k 2 (v )
kv . (4.19)
2
Details about the computation of the principal curvatures for a mesh can be found
in [65].
Values of the mean curvature
t are quantized into 2N
N+1 intervals of discrete
values. For this purpose, a quantization module processes the mean curvature
value through a stairstep function so thatt many neighboring values are mapped to
one output value as follows:
N if k ! N '
°
° i if [ , ( 1) ) (4.20)
Q(( ) ®
° i if ( ( 1) , ]
° N if k N '
¯
with ię{0, ..., N 1} and is a suitable quantization parameter. The function Q(·)
N distinct classes {ci }iN N .
quantizes values of k into 2N+1
To simplify notations, v ę Mi is synonymous with v ę M and Q ( k ) ci in
4.4 Feature Extraction 267
hci ( ) v [ i i ], (4.21)
vi M
where Nv is the number of mesh vertices. hci(M)//Nv is the probability that the
quantized curvature of a generic vertex of the mesh belongs to the interval ci.
The correlogram of curvatures is defined with respect to a predefined distance
value . In particular, the curvature correlogram J c(G,c) of a mesh M is defined as:
i j
J c( , ) ( )
i j
[ 1 i
, 2 j
| 1 2 ], (4.22)
v1 v2 M
where J c(G,c) ( M ) means the probability that two vertices that are far away from
i j
each other have curvatures belonging to intervals ci and cj, respectively. Ideally,
||v1 v2|| should be the geodesic distance between two vertices v1 and v2. However,
it can be approximated with the kk-ring distance if the mesh M is regular and
triangulated [66].
Definition 4.6 (1-ring) Given a generic vertex vięM, the neighborhood or
1-ring of vi is the set:
V vi { j : ij E} . (4.23)
E is the set of all mesh edges (if eijj ę E, there is an edge that links vertices vi
and vj). The set V v can be easily computed using the morphological operator
i
V vi dilate(vi ) . (4.24)
Through the dilate operator, the concept of 1-ring can be used to define,
recursively, the generic kk-th order neighborhood:
Definition of the kk-th order neighborhood enables the definition of a true metric
between vertices in a mesh. This metric can be used for the purpose of computing
curvature correlograms as an approximation of the usual geodesic distance (That
is computationally much more demanding). According to this, the kk-ring distance
between two mesh vertices is defined as dring(v1, v2) = k if v2ęringgk(v1). Function
dring(v1, v2) = k is a true metric, in fact:
268 4 Content-Based 3D Model Retrieval
J c( ,c) ( )
i j
[ 1 ci , 2 cj | d ring ( 1 , 2 ) k] . (4.26)
v1 v2 M
ª1 n º,
« n ¦ [ Si ( qi qCM )( ri rCCM )]»
I [ I qr ] (4.27)
¬ i1 ¼
The identification of the axes is performed by comparing the eigen values. The
eigen vector with the highest eigen value is labeled one, the second highest is
labeled two and the remaining axis is labeled three. The tensor of inertia has a
mirror symmetry problem which can be handled by computing the statistical
distribution of the mass in the positive and negative direction in order to identify
the positive direction. For each axis, the points are divided between ‘‘North’’ and
“South’’: a point belongs to the North group if the angle between the
corresponding cord and a given axis is smaller than 90°, and to the South group if
it is greater than 90°. A cord is defined as a vector that goes from the center of
mass of the model to the center of mass of the triangle. The standard deviation for
the length of the cords is calculated for each group of each axis and it is defined as
2
n
§ n ·
n¦ d ¨ ¦ d i ¸
i
2
i 1 ©i 1 ¹ , (4.29)
s
n( n 1)
where d is the length of a cord and n the number of points. If the standard
deviation of the North group is higher than the standard deviation of the South
group, then the direction of the corresponding eigen vector is not changed while,
in the other case, the direction is flipped by 180°. This technique is also applied to
the first and second axes. Then the outer product between them is calculated. If the
third axis does not have the same direction, then the resulting vector is flipped by
180° in order to have a direct orthogonal system.
The scale is simply handled by a bounding box which is the smallest box that
can contain the model. The axes of the box are parallel to the principal axes of the
tensor of inertia. A rough description of the mass distribution inside the box is
obtained by using the eigen values of the tensor of inertia (i.e., moment
description).
In [73], the shape is analyzed at three levels. The local level is defined by the
normals. Assuming a triangular decomposition of the object and a normal for each
triangle, the angles between the normals and the first two principal axes are
computed using
§ n aq ·
Dq cos 1 ¨ ¸, (4.30)
¨ n aq ¸
© ¹
where
[( r2 r1 ) u ( r3 r1 )] . (4.31)
n
( r2 r1 ) u ( r3 r1 )
270 4 Content-Based 3D Model Retrieval
ª xº ª i'x º
« y» « j' » . (4.32)
« » « y»
«¬ z »¼ «¬ k 'z »¼
where x, y and z are the dimensions of the voxel and i, j and k are the discrete
coordinates. If the density of points in the original model is not high enough, it
may be necessary to interpolate the original model so as to generate more points
4.4 Feature Extraction 271
2 j (2 j q ) n, j Z . (4.33)
Reference [73] used DAU4 (Daubechies) wavelets which have two vanishing
moments. The NuN (N being a multiple of two) matrix corresponding to the 1D
transform is
ª c0 c1 c2 c3 º
«c c2 c1 c0 »
« 3 »
« c0 c1 c2 c3 »
« »
« c3 c2 c1 c0 »,
« »
(4.34)
W
« »
« c0 c1 c2 c3 »
« c3 c2 c1 c0 »
« »
«c2 c3 c0 c1 »
« »
¬ c1 c0 c3 c2 ¼
where
((1 3)) (3 3)
c0 c1 ,
4 2 4 2 (4.35)
((3 3)) (1 3)
c2 c3 .
4 2 4 2
H [ 0 1 2 3 ], (4.36)
G [ 3 2 1 0 ]. (4.37)
smoothing filter, while G is a filter with two vanishing moments. The 1D wavelet
transform is computed by applying the wavelet transform matrix hierarchically,
first on the full vector of length N
N, then to the NN/2 values smoothed by H, then the
N/4 values smoothed again by H, until two components remain. In order to
N
compute the wavelet transform in three dimensions, the array is transformed
sequentially on the first dimension (for all values of its other dimensions), then on
its second dimension and finally on its third dimension. The final result of the
wavelet transform is an array of the same dimension as the initial voxel array.
The set of wavelet coefficients represents a tremendous amount of information.
In order to reduce it, Reference [73] computed the logarithm of base 2 of the
coefficients in order to enhance the coefficients corresponding to small details.
These usually have a very low value compared to those that have a large value and
Reference [73] integrated the signal for each scale. A histogram representing the
distribution of the signal at different scales is then constructed: the vertical axis
represents the total amount of signal at a given scale and the horizontal axis
represents the ‘‘scale’’ or level of resolution. It is important to notice that each
‘‘scale’’ in the histogram represents in fact a triplet of scales corresponding to sx, sy
and sz.
Currently, distance metrics are perhaps the most popular and widely used
similarity matching methods, most of which have already been used in
content-based 2D media retrieval.
1/ p
§ N ·
¨
p
Lp . (4.38)
i i ¸
©i1 ¹
All distances are metrics when p1. The Lp distance itself can also be directly
used as a similarity measurement. For example, Osada ett al. [19] employed it to
implement a similarity match on the probability density function of shape
distribution features. In particular, to assign different impacts to different features
or to allow relevance feedback, Euclidean distance is often modified into the
weighted Euclidean distance with the weight matrix [19, 70, 79].
274 4 Content-Based 3D Model Retrieval
The Hausdorff distance, another frequently used metric, is defined for comparing
two point sets of different sizes as follows:
h(A
( , B) = minaęAmaxbęB d(
d(A, B), (4.39)
where d(
d(A,B) is a distance metric, e.g., the Euclidean distance. However, it is very
sensitive to noise since even a single outlier can change the Hausdorff distance
[80].
Many other distance metrics have also been studied for the 3D model retrieval
task. Ohbuchi et al. [36, 81] introduced an elastic-matching distance in order to
compensate the “larger-than-wanted” effectf caused by “rigid” distance metrics,
e.g., the Euclidean distance, and the results were promising. Elastic matching has
been used extensively in speech recognition. Ohbuchi et al. performed elastic
matching along the distance axis, using the dynamic programming technique for
its implementation to compute the distance DE(X,Y). It locally stretches and
shrinks the distance axis of the histogram in order to find minimal distance
matches. If the matching is too elastic, a pair of shapes having very different
histograms could have a low distance value. Ohbuchi et al. implemented and
experimentally compared the performance of the linear and the quadratic penalty
functions, the latter of which is depicted in Eq.(4.42). Ohbuchi et al. used the
better performing quadratic penalty function for their experiments:
DE ( , ) g( n , n ), (4.40)
ª g( n, )
n 1 ( n, n) º
g( n, n) i «« (
min n 1, n 1 ) 2 ( n , n ) »» , (4.41)
«¬ g ( n 1, n ) ( n , n ) ¼»
Ia
'g
g(( i , j ) ¦(
k 1
ii,, k i,k )2 , (4.42)
where X = (xi,kk) and Y =((yi,kk) are the feature vectors (2D histograms having Id×IIa
elements) for the model A and B, respectively.
Tangelder et al. [9] used an improved Earthmover’s Distance (EMD) [82] as the
distance measure. Intuitively, given two distributions, one can be seen as a mass of
earth properly spread in space, the other as a collection of holes in that same space.
4.5 Similarity Matching 275
Then the EMD measures the least amount off work needed to fill the holes with
earth. Here a unit of work corresponds to transporting a unit of earth by a unit of
ground distance. Computing the EMD is based on a solution to the well-known
transportation problem a.k.a. the Monge-Kantorovich problem. That is, signature
matching can be naturally cast as a transportation problem by defining one
signature as the supplier and the other as the consumer, and by setting the cost for
a supplier-consumer pair to equal the ground distance between an element in the
first signature and an element in the second signature. Intuitively, the solution is
then the minimum amount of “work” required to transform one signature into the
other. Thus, the EMD naturally extends the notion of a distance between single
elements to that of a distance between sets or distributions of elements. The
advantages of the EMD over previous definitions of distribution distances should
now be apparent. First, the EMD applies to signatures, which subsume histograms.
The greater compactness of signatures is in itself an advantage, and having a
distance measure that can handle these variable-size structures is important.
Second, the cost of moving “earth” reflects the notion of nearness properly,
without the quantization problems in most current measures. Even for histograms,
in fact, items from neighboring bins now contribute similar costs, as appropriate.
Third, the EMD allows for partial matches in a very natural way. This is important,
for instance, in order to deal with occlusions and clutter in image retrieval
applications and when matching only parts of an image. Fourth, if the ground
distance is a metric and the total weights off two signatures are equal, the EMD is a
true metric, which allows endowing image spaces with a metric structure. Of
course, it is important that the EMD can be computed efficiently, especially if it is
used for image retrieval systems where a quick response is required. In addition,
retrieval speed can be increased if lower bounds to the EMD can be computed at
low cost. These bounds can significantly reduce the number of EMDs that actually
need to be computed by pre-filtering the database and ignoring images that are too
far from the query. Fortunately, efficient algorithms for the transportation problem
are available. For example, we can use the transportation-simplex method [12], a
streamlined simplex algorithm that exploits the special structure of the
transportation problem. A good initial basic feasible solution can drastically
decrease the number of iterations needed. We can compute the initial basic
feasible solution by Russell’s method [23].
1 I J I J R I J S
E ARG ¦¦¦¦ ij kl ¦
2i1 j1k 1l1
P P
r 1
( )
Cijkl ¦¦ Pij ¦ Ciij( ) ,
i 1 j 1 s 1
(4.43)
subject to:
¦
J
P d 1;
i, j 1 ij
°
¦ (4.44)
I
® jjj,, i 1 ij
1;
°i, j ,
¯ Pij {0,1},
where ^Ciijkl
( )
` is the compatibility matrix for a link of type r, whose components
are defined as Cijkl
( )
cl ( ) ( ( )
ij , ( )
kl ) (0 if either Gij( )
or H kl( ) is NULL); ^C `
i
( )
ij
is the similarity matrix for an attribute of type s, whose components are defined as:
Cij( ) cn ( ) ( i( ) , (j ) ) ; ^Giij( ) ` and ^H kl( ) ` are the adjacency matrices for the
r-link; cl ( ) ( , ) is a compatibility measure between a r-link in G and a r-link in H;
^Gi( ) ` and ^H (j ) ` are vectors corresponding to the s-attribute of the nodes of G
and H; cn( ) ( , ) is a measure of similarity between a node in G and a node in H,
with respect to the same attribute s. P is an IuJ association matrix that at the end
of the minimization process provides the correspondences between one set of
primitives and the other: Pij=1 if Node i in G corresponds to Node j in H, 0
otherwise. Note that the approach does not always converge to an exact
permutation matrix, thus a clean-up heuristic should be defined. Bardinet et al. set
in each column of the association matrix P the maximum element to 1 and others
to 0. In this specific case, P provides the correspondences between the skeleton
parts of the two objects to be compared. Above constraints adopted in the
objective function guarantee that two graph nodes, or two object skeleton parts,
will be matched only if they are similar and if they share the same type of relations
with their neighboring primitives in their respective graphs. Fig. 4.15 gives an
example of skeleton-based ARG matching.
4.5 Similarity Matching 277
Fig. 4.15. Example of graph matching [82]. (a) Original object with superimposed skeleton and
labeled object partition; (b) Deformed objectt obtained by occlusion with a polygonal shape and
scaling, rotation and translation, with superimposed skeleton and labeled object partition; (c)
Original object labeled by propagating labels of the deformed object through the skeleton-based
ARG matching (With courtesy of Bardine et al.)
result-known training samples, which allows for great flexibility in the retrieval
process.
4.5.3.1 SVM
Support vector machines (SVMs) [89] are a set of related supervised learning
methods used for classification and regression. Viewing the input data as two sets
of vectors in the n-dimensional space, an SVM will generate a separating
hyperplane that maximizes the margin between the two data sets. To compute this
margin, two parallel hyperplanes are constructed, one on each side of the
separating hyperplane, which are “pushed up against” the two data sets. Intuitively,
a good separation can be achieved by the hyperplane with the largest distance to
the neighboring data points of both classes since, in general, the larger the margin,
the lower the generalization error of the obtained classifier. The basic idea of the
SVM approach can be described as follows.
Given some training data, a set of points with the following form
p
D {( i , i )| i , i 1 1}}in 1 ,
{ 1, (4.45)
where ci is either 1 or 1, indicating one of two classes to which the point xi
belongs. Each xi is a p-dimensional real vector. Our goal is to find the
maximum-margin hyperplane which divides the points with ci = 1 from those with
ci = 1. In fact, any hyperplane can be written as the set of points x satisfying
w x b 0, (4.46)
where
denotes the dot product between two vectors. The vector w is a normal
vector that is perpendicular to the hyperplane. The parameter b / w is the offset
of the hyperplane from the origin along the normal vector w. Our aim is to choose
the w and b to maximize the margin, namely the distance between the two parallel
hyperplanes that are as far apart as possible while still separating the data into two
classes. These hyperplanes can be described by the equations
w x b 1, (4.47)
and
w x b 1 , (4.48)
Note that if the training data are linearly separable, we can select the two
hyperplanes of the margin in such a way that there are no points between them and
then try to maximize their distance. According to geometry, we can find that the
distance between these two hyperplanes equals 2/||w||, thus our goal is transformed
to minimize ||w||. As we should also prevent data points from falling into the
margin, we may add the following constraint: for each i, either w xi b t 1 for xi in
4.5 Similarity Matching 279
ci b 1 for
o 1 i n. (4.49)
1 2
Minimize (in w, b): w ,
2
Subject to ( for 1 d i d n ): ci i b t 1 . (4.51)
Note that the factor of 0.5 is used for mathematical convenience. This problem
can now be solved by standard quadratic programming techniques and programs.
A typical 2D case is shown in Fig. 4.16.
list of previous retrieval results, the system learns the models the user desires by
using the SVM approach. Ibato et al. carried out many experiments by combining
the transform-invariant D2 shape features [19] with the SVM, feeding the feature
vector to an SVM to compute the dissimilarity. The experimental results show that,
despite its simplicity, the system works well in retrieving shapes that a user feels
“similar” to the given examples.
4.5.3.2 SOM
respond similarly to certain input patterns. This is partly motivated by the way that
the visual, auditory or other sensory information is handled in separate parts of the
cerebral cortex in the human brain. The weights of the neurons are initialized
either as small random values or sampled evenly from the subspace spanned by
the two largest principal component eigenvectors. Obviously, with the latter
alternative, learning is much faster since the initial weights already give good
approximation of SOM weights. The network must be fed a large number of
example vectors that represent, as closely as possible, the kinds of vectors
expected during the mapping process. The examples are usually administered
multiple times. The training utilizes competitive learning methods. When a
training example is fed to the network, its Euclidean distance to all weight vectors
is calculated. The neuron with its weight vector most similar to the input is called
the best matching unit (BMU). The weights of the BMU and neurons close to the
input in the SOM lattice are then adjusted towards the input vector. The magnitude
of the modification decreases with both time and the distance from the BMU. In
the simplest form, the magnitude is one for all neurons close enough to BMU and
zero for others. A Gaussian function is also a common choice. Regardless of the
functional form, the neighborhood function shrinks with time. At the beginning,
when the neighborhood is broad, the self-organizing operation takes place on a
global scale. When the neighborhood has shrunk to just a couple of neurons, the
weights are converging to local estimates. This process is repeated for each input
vector for a large number of cycles. The network winds up the associated output
nodes with groups or patterns in the input data set. If these patterns can be named,
the names can be attached to the associated nodes in the trained net. During the
mapping process, there will be one single winning neuron, i.e., the neuron whose
weight vector lies nearest to the input vector. This can be simply determined by
computing the Euclidean distance between the input and weight vectors. It should
be noted that any kind of object that can be represented digitally, and with which
an appropriate distance measure is associated and in which the necessary
operations for training are possible, can be used to construct a self-organizing map.
Pedro et al. [91] described a system for querying 3D model databases based on
the spin image representation as a shape signature for objects depicted as
triangular meshes. The spin image representation facilitates the task of aligning
the query object with respect to matched models. The main contribution of this
work is the introduction of a three-level indexing schema with artificial neural
networks. The indexing schema improves greatly the efficiency in matching the
query spin images against those stored in the database. Their results are suitable
for content-based retrieval in 3D general object databases. Their method achieves
both compression and indexing of the original set of spin images. Basically, a
self-organized map is built from the stack of spin images of a given object. This is
a way of “summarizing” the whole stack into a set of representative spin images.
Then, the kernel K K-means clustering algorithm is utilized in order to group
representative views in the SOM map into a reduced set of clusters. At the query
time, the input spin images will be first compared with the clusters’ centers
resulting from the kernel K K-means method and subsequently with the SOM map if
a finer answer is requested.
282 4 Content-Based 3D Model Retrieval
of the closest training sample (i.e. when kk=1) is called the nearest neighbor
algorithm. The accuracy of the KNN algorithm will be severely degraded if there
are noisy or irrelevant features, or if the feature scales are not consistent with their
importance. Many research efforts have been put into selecting or scaling features
to improve classification. A particularly popular approach is to utilize evolutionary
algorithms to optimize feature scaling. Another popular approach is to scale
features by the mutual information of the training data with the training classes.
Ip et al. [92] proposed a weighted similarity function for CAD model
classification based on an underlying shape distribution feature representation and
a KNN learning algorithm. Given a set of CAD solid models and corresponding
classes, the KNN learning method was used to extract the related patterns to
automatically construct a model classifier and identify new or hidden
classifications using the shape distribution feature, learning from the stored,
correctly categorized training examples. In addition, probabilistic approaches,
such as Bayes theorem, are also a practical way for similarity matching, in which
specific probabilities of features are calculated and the 3D model having the
highest probability will be identified as the closest matching result [93].
graded relevance scale of documents in a search engine result set, DCG measures
the usefulness, or gain, of a document based on its position in the result list. The
gain is accumulated cumulatively from the top of the result list to the bottom, with
the gain of each result discounted at lower ranks. Other measures include the
precision at k (i.e., precision of top k results) and the mean average precision.
Implicit feedback is inferred from the user behavior, such as noting which
documents they do or do not select for viewing, the duration of time spent in
viewing a document, or page browsing or scrolling actions. The key differences
between implicit and explicit relevance feedback include the following: The user
is not assessing relevance for the benefit of the IR system, but only satisfying their
own needs and the user is not necessarily informed that their behavior (selected
documents) will be used as relevance feedback. An example off this is the Surf
Canyon browser extension, which advances search results from later pages of the
result set based on both the user interaction (clicking an icon) and the time spent
in viewing the page linked to a search result. Blind or “pseudo” relevance
feedback is obtained by assuming that the top k documents in the result set
containing n results (usually where k << n) are relevant. Blind feedback automates
the manual part of relevance feedback and has the advantage that assessors are not
required.
Actually, machine-learning methods can also be used to implement users’
relevance feedback mechanism in 3D model retrieval to iteratively refine the
retrieval results step by step, by making designed reactions to the user’s
interactive evaluations. This can also achieve a personalized retrieval, based on
different user’s preferences. A good example is Elad et al.’s work on relevance
feedback [70, 94]. They made use of the SVM learning algorithm to derive the
optimal weight combination for a weighted Euclidean distance metric, and made
stepwise improvements to the similarity match, according to every iteration of the
user’s interactive evaluation. The detailed approach can be illustrated as follows.
Assuming that two feature vectors X and Y constitute partial descriptions of
database objects DX and DY respectively, we can measure the distance between the
objects using the squared Euclidean distance
2
d ( DX , DY ) . (4.52)
Using the Euclidean distance alone, the automatic search of the database will
indeed produce objects that are geometrically close to the given one. However,
these may not be what the human user has in mind when initiating the search.
Therefore, Elad et al. employed further “parameterization” of this distance by
adding weights and a bias value
d ( DX , DY ) ( )T ( ) b, (4.53)
relevant and some of them irrelevant, no matter that they are all geometrically
close. The adaptation of the distance function can be done by re-computing
distances, based on the user preferences. The additional requirement is that the
new distance between the given object and the relevant results should be small and,
obviously, the new distance between the given object and the irrelevant results
should be large. In essence, this is a classification or a learning problem. One way
of formulating the requirements is to define weights on the components of the
distance function and writing a set of constraints. Denote the feature vector of the
object for which the system is to search by O, the feature vectors of the “relevant”
results by ^ `nk 1 , and the feature vectors of the “irrelevant” results by ^ `ln 1 .
G B
k 1, 2, ..., nG , d ( DO , DGk ) [ k ]T [ k ] b 1,
(4.54)
T
l 1, 2, ..., nB , d ( DO , DBl ) [ l ] [ l ] b 2.
This generates a margin between the “relevant” and “irrelevant” results. The
above inequalities are linear with respect to the entries of W. Denoting the main
diagonal of W by , we may rewrite the constraints as follows:
k 1, 2, ..., nG , d ( DO , DGk ) [ k ]2 b 1,
(4.55)
2
l 1, 2, ..., nB , d ( DO , DBl ) [ l ] b 2.
As the 3D model retrieval results achieved by low-level features have proven not
to be as discriminative as people had expected, this raises another important issue,
that is, subjective semantic measurement in similarity comparison. Furthermore,
whether a retrieved 3D model is “relevant” or “irrelevant” to the query is also
judged by the users according to their subjective perception, related to the
semantic content. Consequently, it is highly significant to develop semantic
similarity-matching methods that take human perception into account in
content-based 3D model retrieval systems.
Many approaches that have been proposed in 2D media retrieval to reduce the
“semantic gap” try to perform similarity measurement based on high-level
semantics. One method is to learn the connections between a 3D model and a set
of semantic descriptors, or the semantic meanings from those automatically
extracted 3D model features. This approach is usually based on machine learning
and statistical classification, which groups 3D models into semantically
meaningful categories using low-level features so that semantically-adaptive
searching methods can be applied to different categories. Examples are as follows.
Suzuki et al. [78] constructed a multidimensional scaling mechanism so that
semantic keyword descriptors used in the query and the shape features calculated
from the 3D shapes were strongly correlated, based on a training data set. The
multidimensional scaling mechanism can analyze matrices of similar or dissimilar
data by representing the rows and the columns as a point in Euclidean space and
then measure their similarities using Euclidean distances. They then created a
special user preference space according to this principle, in which a function
mapping from the 3D model space was constructed to integrate semantic
keywords and 3D shapes as a representation of human subjective perception.
Zhang et al. [95] introduced the concept of “hidden annotation” to construct a
semantic tree of the whole 3D model database. They used an active learning
method to calculate a list of probabilities for each 3D model, which indicated the
4.5 Similarity Matching 287
the upper-left corner of the interface is the query model inputted by the users,
while the returned 16 similar models are listed below.
Fig. 4.17. The QBE-based 3D model retrieval demo system developed by the authors of this book
Draft or sketch is the most extensively applied query interface in practice. Since
users paint basic features of a 3D model based on conception, the system extracts
shape features from the drafts to match and retrieve in the database. The 2D draft
is currently very attractive in image retrieval and afterwards can be extended to
view based 3D model retrieval. In such a manner, with a number of drafts drawn
by users as query request, the matching operation is conducted according to 2D
projections of the 3D object from different view angles. Apart from the 2D
sketches interface, there also exist 3D draft query interfaces. Teddy is a very
typical 3D draft editing environment. For 2D-stroke-based users’ input, it can
construct 3D shape in accordance with certain rules. The technology has been
adopted by the 3D search engine in Princeton University as a user input interface.
In the subsequent three subsections, we will introduce query by 2D projections,
query by 2D sketches and query by 3D sketches, respectively.
3D to 2D projection denotes any method of mapping 3D points to a 2D plane.
Since most of the current methods for displaying graphical data are based on
planar 2D media, the use of this type off projection is widespread, especially in
computer graphics, engineering and drafting. There are two typical projection
290 4 Content-Based 3D Model Retrieval
bx sx ax cx ,
(4.57)
by sz az cz ,
where the vector s is an arbitrary scale factor and c is an arbitrary offset. These
constants are optional, and can be used to properly align the viewport. The
projection can be shown through the following matrix notation, where we
introduce a temporary vector d for clarity.
ª ax º
ªdx º ª1 0 0 º « »
«d » «0 0 1 » « a y » ,
¬ y¼ ¬ ¼« » (4.58)
¬ az ¼
ª bx º 0 ª d x º ª cx º
» « ».
x
«b »
¬ y¼ ¬0 ¼ ¬ d y ¼ ¬ cz ¼
rotation matrix using to the result. This transformation is often called a camera
transform (note that these calculations assume a left-handed system of axes):
This transformed point can then be projected onto the 2D plane using the
formula (here x-y
- is used as the projection plane, though other literatures may also
use x-z):
bx (d x ex )(ez / d z ),
(4.60)
by (d y e y )(ez / d z ).
The distance of the viewer from the display surface, ez, directly relates to the
field of view, where D 2 tan 1 (1/ z ) is the viewed angle. Note that this assumes
that you map the points (1, 1) and (1,1) to the corners of your viewing surface.
Subsequent clipping and scaling operations may be necessary to map the 2D plane
onto any particular display media.
In content-based 3D model retrieval, 2D projection views themselves can be
adopted as features of a 3D model [104], while the query by 2D projections means
representing a query with a set of 2D projection images of a 3D example model
from different viewpoints [33]. Since both 2D projection and 2D sketch are 2D
images, readers can refer to Fig. 4.18 as a similar demo system of query by 2D
projections-based 3D model retrieval.
Query by text means that the query interface is based on text keywords [33] and/or
semantic descriptions [95]. Attempting to find a 3D model using just text
keywords suffers from the same problems as any text search: a text description
may be too limited, incorrect, ambiguous, or in a different language. Furthermore,
3D models contain shape and appearance information which is hard to query just
based on text. In many cases, a shape query is able to describe a property of a 3D
model that is hard to specify only adopting text. As shown in Fig. 4.20, query by
the too common keyword “plane” will produced worse retrieval results. Thus, we
often combine the text-based query with the sketch-based query, as discussed in
the subsection below.
294 4 Content-Based 3D Model Retrieval
Fig. 4.20. The retrieval results for the query by the text keyword “plane” [105] (With courtesy of
Min et al.)
Fig. 4.21. The retrieval results for the query by the text keyword “table” and 2D sketch [105]
(With courtesy of Min et al.)
Fig. 4.22. Relevance feedback interface developed by the authors of this book
4.7 Summary
References
pp. 375-380.
[8] P. Shilane, M. Kazhdan, P. Min, et al. The Princeton shape benchmark. In:
Proceedings of Shape Modeling International, 2004.
[9] J. Tangelder and R. Veltkamp. Polyhedral model retrieval using weighted point
sets. Int. J. Image Graph., 2003, 3:1-21.
[10] T. Zaharia and F. Prteux. 3D versus 2D/3D shape descriptors: A comparative
study. In: Proc. SPIE Conf. Image Process.: Algorithms Syst. III—SPIE Symp.
Electron. Imaging, Sci. Technol., 2004, Vol. 5298, pp. 47-58.
[11] Meshnose, the 3D Objects Search Engine. [Online]. Available:
http://www.deepfx.com/meshnose. 2003.
[12] National Design Repository. [Online]. Available: http://www.deepfx.com/meshnose.
2003.
[13] H. Berman, J. Westbrook, Z. Feng, et al. The protein data bank. Nucleic Acids
Res., 2000, 28:235-242.
[14] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly
relevant documents. In: Proc. 23rd ACM SIGIR Conf. Res. Dev. Inf. Retrieval,
2000, pp. 41-48.
[15] J. Rocchio. Relevance feedback in information retrieval. The SMART Retrieval
System: Experiments in Automatic Document Processing, Prentice-Hall, Englewood
Cliffs, NJ, 1971, pp. 313-323.
[16] B. Bustos, D. Keim, D. Saupe, et al. An experimental comparison of feature-based
3D retrieval methods. Paper presented at The Int. Symp. 3D Data Process., Vis.,
Transmiss., 2004, pp. 215-222.
[17] S. M. Beitzel. On Understanding and Classifying Web Queries. Ph.D Thesis,
2006.
[18] P. Min. A 3D model search engine. Ph.D Dissertation. Dept. Comput. Sci.
Princeton Univ., Princeton, NJ, 2004.
[19] R. Osada, T. Funkhouser, B. Chazelle, et al. Matching 3D models with shape
distributions. Shape Modeling International, 2001, pp. 154-166.
[20] A. W. M. Smeulders, M. Worring, S. Santini, et al. Content-based image retrieval
in the early years. IEEE Trans. Pattern Anal. Mach. Intell., 2000,
22(12):1349-1380.
[21] R. Ohbuchi, M. Nakazawa and T. Takei. Retrieving 3D shapes based on their
appearance. In: Proc. 5th ACM SIGMM, Int. Workshop Multimedia Inf.
Retrieval, Berkeley, CA, 2003, pp. 39-45.
[22] R. Ohbuchi and T. Takei. Shape-similarity comparison of 3D models using alpha
shapes. In: Proc. 11th Pacific Conf. Comput Graph. Appl. (PG 2003), 2003, pp.
293-302.
[23] P. Min, M. Kazhdan and T. Funkhouser. A comparison of text and shape
matching for retrieval of online 3D models. In: Proceedings of the 8th European
Conference on Digital Libraries (ECDL 2004), 2004, pp. 209-220.
[24] M. Kazhdan, T. Funkhouser and S. Rusinkiewicz. Shape matching and
anisotropy. ACM Trans. Graph., 2004, 23(3):623-629.
[25] J. W. H. Tangelder and R. C. Veltkamp. A survey of content based 3D shape
retrieval methods. In: Proceedings of the Shape Modeling International 2004
(SMI’04), 2004, pp. 145-156.
[26] D. Y. Chen and M. Ouhyoung. A 3D model alignment and retrieval system. In:
Proceedings of International Computer Symposium, Workshop on Multimedia
Technologies, 2002, pp. 1436-1443.
References 299
[27] T. Funkhouser, P. Min, M. Kazhdan, et al. A search engine for 3D models. ACM
Transactions on Graphics (TOG), 2003, 22:83-105.
[28] M. Kazhdan, T. Funkhouser and S. Rusinkiewicz. Rotation invariant spherical
harmonic representation of 3D shape descriptors. In: Proceedings of the
Eurographics/ACM SIGGRAPH Symposium on Geometry Processing, 2003, pp.
156-164.
[29] R. Osada, T. Funkhouser, B. Chazelle, et al. Shape distributions. ACM
Transactions on Graphics (TOG), 2002, 21:807-832.
[30] E. Chávez, G. Navarro, R. Baeza-Yates et al. Searching in metric spaces. ACM
Computing Surveys (CSUR), 2001, 33:273-321.
[31] C. Böhm, S. Berchtold and D. A. Keim. Searching in highdimensional spaces:
Index structures for improving the performance of multimedia databases. ACM
Computing Surveys (CSUR), 2001, 33:322-373.
[32] D. V. Vraníc and D. Saupe. 3D model retrieval. Paper presented at The Spring
Conf. Comput. Graph. (SCCG 2000), 2000.
[33] P. Min, A. Halderman, M. Kazhdan, et al. Early experiences with a 3D model
search engine. In: Proc. Web3D Symp., 2003, pp. 7-18.
[34] M. Ankerst, G. Kastenmuller, H. Kriegel, et al. Nearest neighbor classification in
3D protein databases. In: Proc. ISMB, 1999, pp. 34-43.
[35] K. Pearson. On lines and planes of closest fit to systems of points in space.
Philosophical Magazine, 1901, 2(6):559-572.
[36] R. Ohbuchi, T. Otagiri, M. Ibato, et al. Shape-similarity search of
three-dimensional models using parameterized statistics. In: Proc. 10th Pacific
Conf. Comput. Graph. Appl., 2002, pp. 265-275.
[37] E. Paquet, A. Murching, T. Naveen, et al. Description of shape information for
2-D and 3-D objects. Signal Process.: Image Commun., 2000, 16:103-122.
[38] M. Heczko, D. Keim, D. Saupe, et al. A method for similarity search of 3D
objects (in German). In: Proc. BTW, 2001, pp. 384-401.
[39] D. Vraníc, D. Saupe and J. Richter. Tools for 3D-object retrieval: Karhunen-Loeve
transform and spherical harmonics. In: Proc. IEEE, Workshop Multimedia Signal
Process, 2001, pp. 293-298.
[40] M. Kazhdan. Shape representations and algorithms for 3D model retrieval. Ph.D
Dissertation, Dept. Comput. Sci., Princeton University, Princeton, NJ, 2004.
[41] S. Gottschalk. Collision queries using oriented bounding boxes. Ph. D Dissertation,
Department of Computer Science, University of North Carolina at Chapel Hill,
1999.
[42] A. Tomas and H. Eric. Real-time Rendering (2nd ed.). A K Peters, Ltd., 2002,
pp.564-567.
[43] J. Pu, Y. Liu, G. Xin, et al. Yusuke. 3D model retrieval based on 2D slice
similarity measurements. In: Proc. 2nd International Symposium on 3D Data
Processing, Visualization and Transmission (3DPVT 2004), 2004, pp. 95-101.
[44] M. de Berg, M. van Kreveld, M. Overmars, et al. Computational Geometry (2nd
revised ed.). Springer-Verlag, 2000, pp.45-61.
[45] A. Fournier and D. Y. Montuno. Triangulating simple polygons and equivalent
problems. ACM Transactions on Graphics, 1984, 3(2):153-174.
[46] A. Chazelle. Triangulating a simple polygon in linear time. Discrete &
Computational Geometry, 1991, 6:485-524.
[47] R. Seidel. A simple and fast incremental randomized algorithm for computing
trapezoidal decompositions and for triangulating polygons. Computational
300 4 Content-Based 3D Model Retrieval
[105] P. Min, J. Chen and T. Funkhouser. A 2D sketch interface for a 3D model search
engine. In: Proc. SIGGRAPH Tech. Sketches, 2002, p. 138.
[106] T. Igarashi, S. Matsuoka and H. Tanaka. Teddy: A sketching interface for 3D
freeform design. In: Proc. SIG-GRAPH 1999, ACM, 1999, pp. 409-416.
[107] C. Zhang and T. Chen. Efficient feature extraction for 2D/3D objects in mesh
representation. Paper presented at The ICIP, 2001.
[108] J. Corney, H. Rea, D. Clark, et al. Coarse filters for shape matching. IEEE
Comput. Graph. Appl., 2002, 22(3):65-74.
[109] D. McWherter, M. Peabody, A. Shokoufandeh, et al. Solid model databases:
Techniques and empirical results. ASME/ACM Trans., J. Comput. Inf. Sci. Eng.,
2001, 1(4):300-310.
[110] M. Suzuki, Y. Yaginuma and Y. Sugimoto. A 3D model retrieval system for
cellular phones. In: Proc. IEEE Int. Conf. Syst Man Cybern, 2003, pp.
3846-3851.
[111] M. Novotni and R. Klein. A geometric approach to 3D object comparison. In:
Proc. Int. Conf. Shape Model. Appl., 2001, pp. 167-175.
5
3D Model Watermarking
5.1 Introduction
3D meshes have been used more and more widely in industrial, medical and
entertainment applications during the last decade. Many researchers, from both the
academic and industrial sectors, have become aware of intellectual property
protection and authentication problems arising with their increasing use. Apart
from in familiar multimedia combinations, such as images, text, audio and video,
the issues of copyright protection and piracy detection are now emerging in the
fields of CAD, CAM, computer aided education (CAE) and computer graphics
(CG), etc. Scientific visualization, computer animation and virtual reality (VR) are
three hot topics in the field of computer graphics. On the one hand, with the
development of collaborative design and virtual products in the network
environment, it is expected that consumers will prefer models consisting of points,
lines and faces, rather than material objects or accessories. As a result, only the
authorized user can replicate, modify orr recreate the model. The models that we
handle are all three dimensional and digital, which can be called 3D graphics, 3D
objects or 3D models. The issue of how to protect and even manipulate and
control 3D models and other CAD products is now involved. On the other hand,
with the rapid development in communication and distribution technology, digital
content creation sometimes requires the cooperation of many creators. In
particular, the scale of 3D objects is large and special skills are needed for the
creation of 3D objects. Therefore, to create good and complex 3D content, the
cooperation of many creators may be necessary and important. In the scenario of
the joint-creation of 3D objects in a manufacturing
f environment, the creatorship of
the participating creators becomes a big issue. There are some concerns for
participating creators during the creation process. Firstly, each participating
creator wants to prove his/her creatorship. Secondly, all of the participating
creators want to verify the joint-creatorship of the whole product. Thirdly, it is
306 5 3D Model Watermarking
necessary to prevent some creators from neglecting other creators and asserting
the whole creatorship of the final product and selling the product to a buyer. How
we protect each creator’s creatorship and how we account for his/her level of
contribution are a major challenge.
Digital watermarking has been considered a potentially efficient solution for
copyright protection of various multimedia content. This technique carefully hides
some secret information in the functional part of the cover content. Compared
with cryptography, the digital watermarking technique is able to protect digital
works (assets) after the transmission phase and legal access. Thus, digital
watermarking techniques can provide us with a very effective approach to embed
digital watermarks in 3D model data, such that the copyright of 3D models and
other CAD products can be effectively protected. Nowadays, this research area is
becoming a new hot topic in the field of digital watermarking. 3D model digital
watermarking technology is a branch of digital watermarking technology, and its
main aim is to embed invisible watermarks in 3D models to authenticate 3D
models or embed information to claim the model’s ownership. Watermarking 3D
objects has been performed from various perspectives. In [1], an optical-based
system employing phase shift interferometry was devised for mixing holograms of
3D objects, representing the cover media and the hidden data, respectively.
Watermarking of texture attributes has been attempted by Garcia and Dugelay [2].
Hartung et al. watermarked the stream of MPEG-4 animation parameters,
representing information about shape, texture and motion, by using a spread
spectrum approach [3].
Attributes of 3D graphical objects can be easily removed or replaced. This is
why most of the 3D watermarking algorithms are applied on the 3D graphical
object geometry. Authentication is concerned with the protection of the cover
media and should indicate when it has been modified. Authentication of 3D
graphical objects by means of fragile watermarking has been considered in [4, 5].
Ohbuchi et al. discussed three methods for embedding data into 3D polygonal
models in [6]. Many approaches applied to 3D object geometry aim to ensure
invariance at geometrical transformations. This can be realized by using ratios of
various 2D or 3D geometrical measures [6-9]. Results provided by a watermarking
algorithm for copyright protection, by employing modifications in histograms of
surface normals, were reported by Benedens in [10]. Local statistics have been
used for watermarking 3D objects in [11, 12]. Multiresolution filters for mesh
watermarking have been considered in connection with interpolating surface basis
functions [13] and with pyramid-based algorithms [14]. Benedens and Busch
introduced three different algorithms, each of them having robustness to certain
attacks while being suitable for specific applications [15]. Algorithms that embed
data in surfaces described by NURBS use changes in control points [9] or
re-parameterization [16]. Wavelet decomposition of polygons was used for 3D
watermarking in [17, 18]. Watermarking algorithms that embed information in the
mesh spectral domain using graph Laplacian have been proposed in [19-21].A few
characteristics can be outlined for the existing 3D watermarking approaches.
Some of the 3D watermarking algorithms are based on displacing vertex locations
[9, 12, 13] or on changing the local mesh connectivity [14, 16]. Minimization of
5.2 3D Model Watermarking System and Its Requirements 307
local norms has been considered in the context of 3D watermarking in [15, 22].
Localized embedding has been employed in [6, 9, 15]. Arrangements of
embedding primitives have been classified according to their locality in: global,
local and indexed [6]. Localized and repetitive embedding is used in order to
increase the robustness to 3D object cropping [21].
Preferably, a watermarking system would require in the detection stage only
the knowledge of the watermark given by a key and that of the stego object.
However, most of the approaches developed for watermarking 3D graphical
objects are nonblind and require the knowledge of the cover media in the detection
stage [2, 3, 5, 10, 13, 14, 16-20, 22]. Some algorithms require complex
registration procedures [13-15, 18, 22] orr should be provided with additional
information about the embedding process in the detection stage [8, 9, 17]. A
nonlinear 3D watermarking methodology that employs perturbations in the 3D
geometrical structure is described in [23]. The watermark embedding is performed
by a processing algorithm in two steps. In the first step, a string of vertices and
their neighborhoods are selected from the cover object. The selected vertices are
ordered according to a minimal distortion criterion. This criterion relies on the
calculation of the sum of Euclidean distances from a vertex to its connected
neighbors. A study of the effects of perturbations in the surface structure is
provided in this paper. The second step estimates first and second order moments
and defines two regions: one for embedding the bit “1” and another one for
embedding the bit “0”. First- and second-order moments have desirable properties
of invariance to various transformations [24, 25] and have been used for shape
description [7]. These properties can ensure the detection of the watermark after
the 3D graphical object is transformed by affine transformations. Two different
approaches that produce controlled local geometrical perturbations are considered
for data embedding, i.e., using parallel planes and bounding ellipsoids. The
detection stage is completely blind in the sense that it does not require the cover object.
This chapter is organized as follows. The description of general requirements
for 3D watermarking is provided in Section 5.2. Section 5.3 focuses on the
classification of 3D model watermarking algorithms. Section 5.4 discusses typical
spatial domain 3D mesh model watermarking schemes. Section 5.5 introduces the
robust adaptive 3D mesh watermarking algorithm proposed by the authors of this
book, and it belongs to the spatial domain techniques. Section 5.6 introduces
typical transform-domain 3D mesh model watermarking schemes. Section 5.7
overviews watermarking algorithms for other types of 3D models. Finally,
conclusions and summaries are given in Section 5.8.
Fig. 5.1. A visible watermark embedded in the Lena image [27] ([2003]IEEE)
Xˆ EK ( , ), (5.1)
Wˆ DK ( c) , (5.2)
A typical 3D model watermarking system [26] is shown in Fig. 5.2. During the
watermark embedding process of this system, the watermark is embedded in some
way in the spatial or transformed domains of the original 3D model (i.e., cover
model), so that the watermarked 3D model (i.e., stego model) is acquired. For
example, a watermark bit can be embedded into the original 3D NURBS model
surface to get the watermarked NURBS model. The stego 3D model is transmitted
or sent through various channels, during which the stego model may be subject to
a variety of attacks, including unintentional attacks and intentional attacks. Here,
unintentional modifications are applied to a data object during the course of its
normal use, while intentional modifications are applied to the data object with the
intention of modifying or destroying the watermark.
In the detection end, we can extract the watermark from a suspect model
through blind or non-blind detection methods. By comparison of the extracted
watermark with the original watermark to calculate the similarity, the existence of
the original watermark can be judged and the authenticity of the 3D model
copyright source or content can be identified. On some special occasions, the
original 3D model may also need to be restored in the watermark extraction, such
as reversible watermarking applications.
5.2.3 Difficulties
There are still few watermarking methods for 3D meshes, in contrast with the
relative maturity of the theory and practices of image, audio and video
watermarking. This situation is mainly caused by the difficulties encountered
while handling the arbitrary topology and irregular sampling of 3D meshes, as
well as the complexity of the possible attacks on watermarked meshes.
A 3D mesh model can be very little, so the payload capacity can be low.
Besides, there are multiple representations for exactly the same models and 3D
model because of the lack of an inherentt order. We can consider an image as a
matrix, and each pixel as an element of this matrix. This means that all of these
pixels have an intrinsic order in the image, for example, the order established by
row or column scanning. This order is usually used to synchronize watermark bits
(i.e. to know where the watermark bits are and in which order). On the contrary,
there is no simple robust intrinsic ordering for mesh elements, which often
constitute the watermark bit carriers (primitives). Some intuitive orders, such as
the order of the vertices and facets in the mesh file, and the order of vertices
obtained by ranking their projections on an axis of the objective Cartesian
coordinate system, are easy to alter. In addition, because of their irregular
sampling, it is very difficult to transform a 3D model into the frequency domain
for further operation, and thus we still lack an effective spectral analysis tool for
3D meshes. This situation makes it difficult to apply existing successful spectral
watermarking schemes on 3D meshes.
In addition to the above point, robust watermarks also have to face various
intractable attacks. Many attacks on geometric or topology may undermine the
watermark, such as mesh simplification and remeshing. The reordering of vertices
and facets does not have any impact on the shape of the mesh, while it can
seriously desynchronize the watermarks that rely on this straightforward ordering.
The similarity transformations, including translation, rotation, uniform scaling and
their combination, are supposed to be common operations through which a robust
watermark should survive. Even worse, the original watermark primitives can
disappear after a mesh simplification or remeshing. Such tools are available in
many software packages, and they can completely destroy the connectivity
information of the watermarked mesh while well conserving its shape. Usually,
the possible attacks can be classified into two groups: the geometric attacks that
only modify the positions of the vertices, and the connectivity attacks that also
5.2 3D Model Watermarking System and Its Requirements 311
5.2.4 Requirements
The aim of digital watermarks not only lies in ensuring that the data will not be
found and destroyed, but also to ensure that after the carrier, together with the
embedded information, subject to intentional or unintentional operations (such as
conversion, compression and simplification), information can be extracted
correctly (from carriers) or some kind of measure is designed to estimate the
existence possibility of the information. Therefore, the digital watermark normally
should have the following characteristics: (1) Vindicability. The watermark should
be able to provide complete and reliable evidence for the attribution of multimedia
products that are copyright protected; (2) Imperceptivity. It is not visible and
statistically irreparable; (3) Robustness. It should be able to bear a large number of
different physical and geometric distortions, including intentional or unintentional
attacks. The watermarking diagram for 3D mesh is basically similar to that for
other media as shown in Fig. 5.2. However, in a 3D mesh, as points, lines and
surface data are without a natural sequence, and 3D meshes usually subject to
affine transformations such as translation, rotation and scaling, mesh compression
and mesh simplification, 3D mesh watermarking methods are therefore
distinguished greatly from other media watermarking methods. A brief description
for all of the requirements for 3D model watermarking is given as follows.
The second important requirement for 3D watermarks is the ability to detect the
watermark even after the object has undergone various transformations or attacks.
In any watermarking or fingerprinting approach, there is a trade-off between being
able to make the watermark survive a set of transformations and the actual
visibility of the watermark. Such transformations can be inherent for 3D object
manipulation in computer graphics or computer vision or they may be done
intentionally with the malicious purpose of removing the watermark.
Transformations of 3D meshes can be classified into geometrical and topological
transformations. Geometrical transformations include affine transformations such
as rotation, translation, scale normalization, vertex randomization, or their
combinations, and can be local or applied to the entire object. Topological
transformations consist of changing the order of vertices in the object description
file, mesh simplification for the purpose of accelerating the rendering speed, mesh
smoothing, insection operation, remeshing, partial deformation or cropping parts
of the object. Other processing algorithms include object compression and
encoding, such as MPEG-4. Smoothing and noise corruption algorithms can be
mentioned in the category of intentional attacks. A large variety of attacks can be
modeled generically by noise corruption. Noise corruption in 3D models amounts
to a succession of small perturbations in the location of vertices.
Table 5.1 compares potential attacks of image watermarking algorithms and
3D objects watermarking algorithms [27]. Evident from the table, virtually every
image watermarking algorithm attack method corresponds to their counterpart in
3D watermarking algorithms. However, an important distinction must not be
ignored: Attack methods on 3D meshes are much more complicated. In fact, an
image is 2D and is uniformly sampled, while a 3D mesh corresponds to 3D space
points with a certain topology and non-uniform sampling. Therefore, many image
processing methods cannot be directly extended to 3D geometric data. In Table 5.1,
the remeshing operation is a unique attack on 3D models. Remeshing is actually a
resampling operation on the geometric shape of a 3D model and usually causes
topology alterations.
5.2 3D Model Watermarking System and Its Requirements 313
Space utilization and robustness normally contradict each other. As a result, the
most efficient use of space as possible is one important parameter for the
evaluation of mesh watermarking algorithms, and this involves how to properly
coordinate the relationship between the robustness of the watermark and space
utilization.
5.2 3D Model Watermarking System and Its Requirements 315
The description forms of 3D models may also be redundant. At this time, people
may amend the description of the shape itself without altering the shape to embed
information. For example, we can embed
m knots into NURBS surfaces without
altering the geometry. Once embedded, it is very difficult for node removal if we
force the model geometry to be maintained.
There will be also some redundancy in encoding the shape description, thus we
can also embed watermarks without changing the geometry or shape descriptions.
For example, suppose a CAD model coordinate of each control point has an
accuracy of up to 6 bits, while the data format is up to 10 bits, thus 4 bits out of
5.3 Classifications of 3D Model Watermarking Algorithms 317
often at odds, i.e., making a watermark more robust tends to make it less
transparent.
In 1999, Praun from Princeton University and Hoppe from the Microsoft Research
Institute applied the spread spectrum technology to triangle meshes, providing a
robust mesh watermarking algorithm for arbitrary triangle meshes [39].
Spread-spectrum technology is a technical means used in information transmission,
which makes the signal bandwidth much wider than the minimum requirements to
send information. Spread spectrum is implemented with a code independent of the
data to be sent. The spread spectrum code should be received by the receiver
synchronously for the subsequent de-spread and data recovery processes.
Spread-spectrum technology makes signal detection and removal more difficult,
therefore the watermarking methods based on spread-spectrum technologies are
quite robust. Considering that the representation of mesh surfaces lacks natural
parametric methods based on frequency decomposition, Praun et al. constructed a
group of scalar functions using multi-resolution analysis on the mesh vertex
structure (Due to space limitations, the construction details are not illustrated here).
During the watermark embedding process, the basic idea is to disturb vertex
322 5 3D Model Watermarking
coordinates slightly along the direction of surface normals and weighted by the
corresponding basis function. Suppose that the watermark is a Gaussian noise
sequence with zero mean and unit variance, w = {w0, w1, …, wm1}. To guarantee
irreversibility, the original 3D model and its related information are both
encrypted with Hash functions, e.g., MD5 or SHA-1 algorithms, and the encrypted
sequence is used as the seed for the pseudo-randomizer. Basis functions,
multiplied by a coefficient, are added to the 3D vertex coordinates. Every basis
function i has a scalar impact factor I ij and a global displacement di for every
vertex j, 0 I m1, 0 j k
k 1. For each direction of X,
X Y and Z, the embedding
formula is as follows (take X for example):
where x wj and xj are the coordinate along X for the watermarked vertex p wj and
k 1, H is the parameter for embedding,
the original vertex pi respectively, 0 j k
dix is the X component of the global displacement di, and hi is the amplitude of the
i-th basis function. To countermine the topology attacks such as mesh
simplification, an optimization method is used in this algorithm to remesh the
attacked mesh model based on the connectivity of the original mesh model.
Simulation results show that this watermarking method is rather robust to such
operations as translation, rotation, uniform scaling, insection, smoothing,
simplification and remeshing and can also resist attacks of added noise, least
significant bits alteration and so on.
1
ni
Si
¦(
j Si
j i ) ( ix , iy , iz ), (5.7)
Si| represents the set cardinality, and the vectorr ni is in essential a “discrete
where |S
normal vector” that represents the change of the coordinates around pi. Thus the
mask function can be defined as follows:
/( i ) { ix , iy , iz }. (5.8)
In [41, 42], the embedding location is first confirmed and then a dithering
embedding method is performed in the ellipsoid that is derived from the vertices
connected to the selected location (vertex). The selection of embedding locations
is based on a geometry criterion. First, every “discrete normal vector” ni for every
vertex pi is computed according to Eq.(5.7). Then an ellipsoid is defined for each
vertex, which encloses all the connected vertices to pi. Obviously, the centroid of
the ellipsoid is calculated as follows:
1
i
Si
¦p,
j Si
j
(5.9)
while the shape of the ellipsoid is determined by the variance (2-order statistics) as
follows:
¦(p
j Si
j i )( j i )T
Ui K , (5.10)
Si
where K is a normalized factor. In general, Ui is not singular unless all the vertices
connected to pi are coplanar. Obviously, we should avoid the vertex pi that
produces a singular matrix Ui. In the case that Ui is non-singular, any vector q on
the ellipsoid surface should satisfy the following condition:
(q i )T i
1
( i ) 1. (5.11)
Di ¦
j Si
p j pi . (5.12)
Now we can safely select vertices that satisfy Di<T T as the embedding
primitives. T is a predefined threshold. It should be noticed that once a vertex is
selected as an embedding location, all the connected vertices to this vertex should
be excluded for embedding watermark bits in order not to interfere with each
other.
T are divided into several groups,
All the vertices that satisfy the condition Di<T
each group consisting of m vertices, and then a binary watermark sequence with
the length of mˈis embedded repeatedly. For any vertex in each group, two
embedding methods are adopted in [42]. In the first method, two parallel planes
that are of the same distance from the centroid Pi are defined, and the normal
vector Qi of the parallel planes and their distance ei from the centroid are
calculated respectively as below:
1
Qi
Si
¦n
j Si
j
, (5.13)
1
ei
Si
¦ [(
j Si
j )T ]2 . (5.14)
If the watermark bit is “1”, we should make the following formula come into
existence
( piw i )T i i , (5.15)
where piw is the watermarked vertex. If the watermark bit is “0”, then the
following formula should come into existence:
( piw i )T i i . (5.16)
( piw i )T i
1
( i
w
i ) 1. (5.17)
Otherwise, we can make piw outside the ellipsoid so that a watermark bit “0”
can be embedded, making piw satisfy the following formula,
5.4 Spatial-Domain-Based 3D Model Watermarking 325
w w
( i i )T i
1
( i i ) 1. (5.18)
Besides the above several algorithms, Yeung and Yeo from Intel presented a
fragile 3D mesh watermarking algorithm for verification for the first time in 1999.
The proposed algorithm can be used to verify whether or not the change on a 3D
polygon mesh is authentic [29, 30]. As we know, in order to achieve this purpose,
the embedded watermark should be very sensitive for even minor changes, and
any mesh change will be immediately detected and located, and then presented in
an intuitive way. The basic process is as follows: Firstly, the centroid Pi of all the
vertices connected to the vertex pi is computed according to Eq.(5.9). Then the
floating vector Pi is converted to an integer vector ti = (tix, tiy, tiz) using a certain
function. Finally, another function is utilized to convert ti = (tix, tiy, tiz) into two
integers Lix and Liy, thus the mapping from the centroid to a 2D mesh is acquired,
where (L( ix, Liy) is the corresponding position in the 2D mesh. In fact, a 3D vertex
coordinate can be converted into an integer using a certain function, where the
integer can be regarded as a pixel value while ((Lix, Liy) is the pixel’s corresponding
coordinate. As a result, the watermark can be embedded through slightly altering
the coordinates in the image. The study off fragile watermarking is an important
branch of watermarking and can be widely used in 3D model authentication and
multi-level user management in collaborative design.
A 3D mesh watermarking technique [43, 44] that utilizes the distances from the
centroid to vertices to achieve watermarking is proposed by Yu et al. from
Northwestern Polytechnical University. The watermark embedding process is as
follows.
Step 1: Input the watermark to be embedded and/or the secret key into the
pseudo-randomizer and the corresponding binary watermark sequence w = {w0,
w1, …, wm1} is generated, where m is the length of the watermark sequence, w =
G(K
(K) represents the watermark generation algorithm and K is a large enough set of
keys.
Step 2: Use the function “Permute” to reorder the original vertex set P =
{ i}, I = 0, 1, 2, …, k
{p k 1, with the key as the parameter: P' = Permute(P
( ,K
K), where
326 5 3D Model Watermarking
k is the number of vertices of a 3D model, K is the secret key for reordering and
P c { ic} is the reordered vertex sequence of the 3D model.
Step 3: Select L×m vertices orderly from the reordered vertices P c { ic} and
divide them into m groups, i.e. P c { 0c, 1c, ..., c 1}, where Pi c { i0c0 , c1 , ..., c( 1) } ,
i m1, and L is the number of vertices in each group.
Step 4: Each group can be regarded as an embedding primitive Pic and can be
embedded with a watermark bit wi. In [43], the watermark is embedded in the
following manner:
where Lijj denotes the vector from the center to the j-th vertex in the i-th group,
Lwij represents the corresponding watermarked vertex, D is the embedding weight,
wi is the i-th bit of the watermark sequence and Uijj is the unit vector of Lij. To
improve the transparency, the watermark can be embedded in the following
manner:
where D is the global embedding weight parameter that controls the overall energy
of the embedded watermark, Eijj is the local embedding weight parameter that
makes the embedding process adaptive to the local characteristic of the 3D model.
In [44], the watermark is embedded in the following manner:
where Eij(D) shows that the local embedding weight is relevant to the global
embedding weight D.
Step 5: Reorder the watermarked 3D model back to its original order.
The corresponding detection method for above-mentioned watermark
embedding methods involves the original 3D model M and the detailed procedure
is as follows:
Step 1: Some attackers may use simple translation, rotation and scaling
operations to change the watermarked 3D model. Before the watermark extraction,
the attacked 3D model must be registered to its original position, direction and
scale. Usually, there is always a balance between computation complexity and
accuracy, which affects the speed and accuracy of watermark extraction. As a
result, we should make an appropriate trade-off between complexity and accuracy.
The registration process should be performed between the model M̂ to be
detected and the original model M M, because if the registration is performed
between M̂ and the stego mesh Mw, some additional information may be
introduced to M̂ .
5.4 Spatial-Domain-Based 3D Model Watermarking 327
Step 2: Since some attacks may alter the mesh topology, such as simplification,
insection and remeshing, the watermark cannot be correctly extracted from the
attacked model through a non-blind watermark detection method. In this case,
resampling is required to recover the model with the original connectivity. The
resampling process is as follows: a line from the center of the original model M to
the vertex pi is drawn and intersected with M̂ . If there is more than one point of
intersection that is closest to pˆ i , then pˆ i is regarded as the match point of pi;
Otherwise pˆ i p is taken.i
Step 3: This process is the same as Steps 2 and 3 in the embedding algorithm:
reorder M and M̂ and group them to get P c { 0c, 0c, ..., c 1} and
Pˆ c { ˆ0c , ˆ1c , ..., ˆ c 1} .
Step 4: Regard the center off the original model as the center of the model to be
detected. Compute the magnitude difference between the vector from the model
center to original vertices and the vector from the model center to the vertices to
be detected in each group:
where Lijj is the vector magnitude from the center to the j-th vertex in the i-th
group and Lˆij is the corresponding vector magnitude for M̂ .
Step 5: Sum the vector magnitude differences in each group:
1 L 1
Di ¦ Diij ,
Lj 0
(5.23)
m 1
¦( ˆ j
ˆ ave )( j ave )
Cor ( ˆ , )
j 0
, (5.25)
m 1 m 1
¦( ˆ
j 0
j
ˆ ave ) 2
¦(
j 0
j ave ) 2
sequence, ŵave is the mean of ŵ , wave is the mean of w and m is the length of
the watermark sequence.
The algorithms in [44] have the following characteristics: (1) They use the
overall geometric features as primitives; (2) They distribute the watermark
information throughout the model; (3) The watermark embedding strength is
adaptive to local geometric features. Experiments show that this watermarking
algorithm can resist ordinary attacks for a 3D model, such as simplification,
adding noise, insection and their combinations. In addition, a progressive
transmission method of 3D models is introduced in [45]. This literature has also
proposed a watermarking algorithm based on the distance from the vertices to the
mean of the base. This algorithm adopts the simple additive embedding
mechanism. Due to space limitations, it will not be illustrated here.
1 k 1
d ¦ ni ,
ki 0
(5.26)
and according to
§c · (5.27)
ni round ¨ i ¸,
©d ¹
we can convert each vector ni to an integer, where c is the primary parameter that
5.4 Spatial-Domain-Based 3D Model Watermarking 329
is a fixed real value. The value of ni remains unchanged during the geometry
transform of 3D models. The watermark data are defined as a function f(
f v) on the
sphere surface, e.g. f(
f v) = constant. Similarly, according to
§ § ··
wi round ¨ 2b f¨ i
¸¸ ¸¸ , (5.28)
¨ ¨
© © i ¹¹
the value of f(
f v) can be converted to an integer wi. From the binary representation
of ni, b bits can be selected to be replaced by the watermark data wi (for each ni,
the embedding location is fixed), so the modified vector niw is acquired. With the
above formulae, the watermarked vertex piw can be calculated according to niw .
The watermark extraction process is relatively simple, only requiring the
calculation of nˆi and an appropriate position for extraction.
Fig. 5.3. Two-tuple {b/a, h/c} Fig. 5.4. The 4 triangles in MEP
Fig. 5.5. Watermarked insections of the 3D model of Beethoven bust [19]. (a) With 4,889
triangles; (b) With 2,443 triangles; (c) With 1,192 triangles; (d) With 399 triangles. (1997,
Association for Computing Machinery, Inc. Reprinted by permission)
5.4 Spatial-Domain-Based 3D Model Watermarking 331
Fig. 5.6. A mesh model with a visible watermark [19] (1997, Association for Computing
Machinery, Inc. Reprinted by permission)
Fig. 5.7. Simplified stego mesh [19] (1997, Association for Computing Machinery, Inc.
Reprinted by permission)
First, a triangle strip peeling sequence is established based on a secret key, and
the process is shown in Fig. 5.8. The initial triangle is determined by the specific
geometry characteristic. The next triangle in the sequence should either be the first
triangle (Its new entry edge is AC) C or the second triangle (Its new triangle is BC),
C
which is determined by the bits of the secret key. Here, the length of the secret key
is allowed to be the same as the number of triangles. The path of the accessed
triangles is called “Stencil” in [49].
Fig. 5.8. Construction of the triangle strip peeling sequence (TSPS) [8]. (a) Two types of
triangle edges; (b) TSPS is gray and the embedded location is black ([2003]IEEE)
the initial vertex and the initial spanning direction are given and seek for a vertex
spanning tree Vt according to the triangle mesh. At a given vertex, scan the
connecting edges counterclockwise until an edge that is not available in Vt and is
not connected to any vertex scanned in Vt. If an edge satisfying the above
conditions is found, append it to Vt. And then a certain edge to be the initial edge
is sought in order that the volume of the enclosed tetrahedron d is maximal. A
triangle bounding edge (TBE) list is required to be constructed before Vt is
converted into a triangle list, where the initial list consists of the edges of a series
of vertices in Vt. The list can be constructed as follows: scan Vt from the root
node and then span all the vertices, and scan all connected edges clockwise at each
vertex. If the scanned edge is not in TBE, then append it. If the three edges of a
triangle are found in TBE for the first time, and the triangle is not available in the
triangle sequence “Tris”, then append the triangle in the “Tris”, as shown in Fig. 8
in [19]. Convert Tris into a tetrahedron sequence “Tets”, and regard the first
tetrahedron of Tets as the denominator. Converting the “Tets” to a volume ratio
sequence “Vrs”, the data symbol can be embedded
m into each volume ratio through
replacing the vertices of the numerator tetrahedrons. The embedded locations are
depicted in Fig. 11 in [19], where the dark gray parts represent the embedded
locations.
The watermark extraction process involves testing the candidate edges to find
proper initial edges using pre-embedded symbols. However, because of factors
such as noise, usually it is not accurate iff the initial edge is determined only in
accordance with the largest tetrahedron volume. The algorithm is highly robust to
affine transforms (such as projection transformation), but is fragile to topology
changes (such as remeshing and randomization of the vertex order) and geometry
transformation. The stego mesh and the attacked stego mesh with an affine
transform and an insection are rendered in Fig. 5.11. Simulation results show that
the TVR algorithm can resist these two attacks.
In addition to TVR, another mesh watermarking algorithm based on Affine
Invariant Embedding (AIE) was proposed by Benedens and Busch [50, 51].
Inspired by TVR, AIE uses tetrahedrons as embedding primitives as follows: A
triangle with vertices V = {v1, v2, v3} is selected and then an edge with an end in V
is selected. The other end of the selected edge is denoted as v4, where the distance
from v4 to {v1, v2, v3} is large enough. Thus two initial triangles {v1, v2, v3} and {v2,
v3, v4} are acquired, as shown in Fig. 5.12. Two sets G1 and G2 are constructed: G1
consists of all vertices that only have one neighboring vertex in V = {v1, v2, v3, v4},
i.e. a, b, c, d, e in Fig. 5.12; G2 is comprised of all vertices that are neighboring to
the initial triangle through an edge and locate in a certain triangle, i.e. A, B, C,
C D
in Fig. 5.12. A set G is constructed based on G1 and G2: If |G1|<4 (meaning that
the cardinality is less than 4) and |G2|<4, then set G=G2ĤG1. Otherwise, let
G min{ i | | i | 4} . If |G| < 4, then abandon this primitive. The case of G = G2
i 1
1,2
2
is shown in Fig. 5.12. Finally, divide G into 4 subsets g1, g2, g3, g4 (The number of
elements should be similar) and record the watermark information and the control
information in the vertices that formed g1, g2, g3, g4, as shown in Table 5.3, where
the former 2 bits are the flag for groups, I5I
I0 are index bits, D9D
0 are imbedded
5.4 Spatial-Domain-Based 3D Model Watermarking 335
Fig. 5.11. Results of watermarking and attacks by TVR [19]. (a) Cover model; (b) Stego model;
(c) Affine transform; (d) Insection. (1997, Association for Computing Machinery, Inc.
Reprinted by permission)
1
0
1 0 1 0
e 0 0 0
1 1
1
Fig. 5.13. The selection of the triangle strip according to the watermark bits
the centroid of bins, i.e. average normal vectors. n centroids of bins should be
moved in order to embed a watermark with n bits. The replacement process is
through substituting the mesh vertices, resulting in the changes of normal vectors
of triangles and then the centroids of the corresponding bins. Simulation results
show that the algorithm is robust to vertex randomization, remeshing and
simplification. The embedded watermark can still survive when the stego mesh is
simplified to 36% of the cover mesh. In addition, another mesh watermarking
algorithm based on alteration of surface normal vectors is available in [53]. Due to
space limitations, the details are not elaborated here.
Apart from the above algorithms, several algorithms [32] based on the redundant
data in a polygon mesh have been proposed by Ichikawa et al. from Japan’s
Toyohashi University in 2002. The algorithms, which maintain the original
geometry and topology, are as follows: (1) Full permutation scheme (FPS) and
partial permutation scheme (PPS) that permute the order of mesh vertices and
polygons; (2) Polygon vertex rotation scheme (PVR), packet PVR, full PVR
(FPVR) and partial PVR (PPVR) that embed watermarks through rotating vertices.
Due to the low embedding capacity of these methods, they are only supplementary
methods to those methods based on alteration of geometry and topology, and will
not be detailed here.
The basic flows of watermark embedding and extraction processes are outlined,
including watermark embedding process and watermark extraction process.
The detailed watermark embedding process can be shown in Fig. 5.14. Firstly, we
adopt the non-adaptive watermark generation algorithm participating with
copyright information. Copyright information and the secret key are inputted to a
pseudo-random sequence generator G and the output is the permuted binary
watermark as follows:
W G ( , ), (5.29)
where m denotes the original copyright information, K is the secret key and
Voc P( o , ), (5.31)
5.5 A Robust Adaptive 3D Mesh Watermarking Scheme 339
( ) ( )
where Vo { i } and V p { i } are the sets of vertices of the original model
and the permuted model respectively, 0 d i d L 1 and L is the number of
vertices.
Thirdly, we choose N×
N Q vertices from the vertex set Vp of the disturbed
model and then divide these vertices into Q subsets Vl ( ) , 0 l Q 1 as follows:
Vl ( p ) {vlj( p ) } , 0 d l d Q 1 , 0 d j d N 1 , (5.32)
where N is the number of vertices in each section, which equals the length of the
watermark sequence.
Fourthly, we embed one watermark bit into each section by the following
formula,
elj( ) ( )
lj E D lj wl nlj( ) , (5.33)
where elj( ) denotes the original vector from the centroid of the model to the j-th
vertex of the l-th section, elj( ) denotes the watermarked vector from the centroid
of the model to the j-th vertex of the l-th section, E is the watermarking
r of the embedded watermark sequence, wl is
coefficient to control the global energy
340 5 3D Model Watermarking
the l-th bit of the watermark sequence, Dljj is the parameter controlling the local
watermarking weight and is adaptive to the local geometry of the model, which
will be detailed in Subsection 5.5.2. nlj( ) is the direction in which a watermark bit
is embedded corresponding to the j-th vertex in the l-th section, which will be
detailed in Subsection 5.5.2. The same watermark sequence is embedded into each
section repeatedly in order to ensure robustness to local deformation. When a
vertex is embedded with a watermark bit, its neighboring vertices cannot be used
as embedding locations.
Finally, we permute reversely the order of the watermarked vertices by using
the original key K.
The detailed watermark extraction procedure is shown in Fig. 5.15. Note that an
attack might change the 3D model via similarity transforms. We extract the
potential watermark as follows.
Before extracting watermarks, we should firstly recover the object to its
original location and scale via model registration. The annealing algorithm in [57]
is adopted in our work. Secondly, we use the re-sampling scheme proposed in [58]
in case attacks which change the mesh connectivity are applied to the
watermarked mesh. Thirdly, for both the original model and the model to be
detected, we disturb and divide their vertices to get their own Q sections as
described in Eqs.(5.31) and (5.32) for the embedding procedure, respectively. We
then compute the residual vectors between the vectors that connect the origin with
the vertices in each section of the original model and those of the model to be
detected as follows:
where elj( ) and elj( d ) are the vector from the origin to the j-th vertex in the l-th
section of the original model and the vector from the origin to the j-th vertex in the
l-th section of the model to be detected, respectively. Fourthly, we sum up the dot
products of the residual vectors and their corresponding watermarking directions
as follows:
Q 1
sj ¦r
l 0
lj
l nlj( ) , (5.35)
d
¦(
i 0
(d )
i
(d )
)( i )
c(( , ) , (5.36)
N 1 N 1
¦(
i 0
(d )
i
(d ) 2
) ¦(
i 0
i ) 2
d jji v j vi , v j N ( i ) . (5.37)
2
Regard the vertex vi as a node in a circuit, the distances between it and its
neighboring vertices as impedances between nodes vi and its neighboring vertices,
and the parallel connection impedance between vi and its neighboring vertices as
watermark embedding weights of the vertex vi. The computation formula is
defined as follows:
wti 1/ ¦
v j N ( i )
(1 / d ji ) . (5.38)
5.5 A Robust Adaptive 3D Mesh Watermarking Scheme 343
§ · 1 § · (5.39)
WTi wti q sin ¨ sin
i ¸,
©2 q¹ ¦
v j N ( i )
(1 / d jij ) ©2 q¹
where q denotes the number of A’s neighboring vertices. The first item of the right
side of the above equation makes sure that the embedding weight is mainly
determined by the minimal length between A and A’s neighbors, the second one
shows how the number of A’s neighbors affects the embedding weight, and the last
one is the effect of angles between the neighboring edges connecting with A to the
embedding weight.
In our algorithm, a vertex and its neighbors can be regarded as a primitive
where the watermark is embedded without computing the watermark embedding
weight according to each triangle surface connecting to the vertex. Thus, the
algorithm can make full use of the local geometry of the model with the
imperceptivity of watermarks and is computationally timesaving, especially in the
case where the number of surfaces of the model is considerable.
344 5 3D Model Watermarking
Fig.5.16. Two example cases of computing the locally adaptive watermarking strength with the
local geometry. (a) Point A has 3 neighbors; (b) Point A has 6 neighbors
The local strength for embedding watermarks has been ascertained in the previous
part. Now the direction in which the watermark should be embedded is to be
confirmed. If two parameters are both acquired,
q the watermarking scheme is then
fixed. By optimizing the watermark embedding direction, more energy of the
watermark can be embedded with imperceptivity. Namely, the visual change in the
model is relatively less if a watermark with fixed energy is embedded along the
optimized direction. Enhancing the watermark strength and minimizing the visual
change in the model supplement each other.
In most of the previous literature concerning 3D model watermarking
techniques, the watermark is embedded along the vector that links the model
centroid to a vertex, whose length is the embedding primitive. Though the
primitive is a global geometry feature, it may not allow maximum possible
watermark energy to be embedded. A rather novel method to ascertain the
watermark embedding direction is proposed here, and the locally adaptive
watermark embedding direction is not only a global geometry feature that is the
primitive to be embedded with the watermark, but also makes sure that more (of
the) energy of the watermark can be embedded under the precondition of
imperceptivity.
The watermark energy that can be embedded lies on the watermark embedding
direction under the precondition that the local geometry feature and the visual
characteristic of the model are fixed. As shown in [59] and the example in Fig.
5.17, if the dot products of the unit vector of watermark embedding directions and
normalized normals of triangle surfaces connecting to the vertex increase, the
watermark energy that can be embedded decreases. The watermark energy that can
be embedded is determined by the minimum value of the dot products to satisfy
imperceptivity. Thus, the watermark energy that can be embedded is determined
by
5.5 A Robust Adaptive 3D Mesh Watermarking Scheme 345
O max{ i }, (5.40)
i
q
v ¦p
i 1
i /q. (5.41)
Let the angle between each of the q triangle surfaces and the underside of the
polyhedron be
§ q
·
T arccos ¨ / i ¸, 0, (5.42)
© i 1 ¹
where S equals the area of the polyhedron underside, si denotes the area of a
triangle surface connecting to A, i = 1, 2, …, q and, as a result, the normals of the
surfaces are as follows:
§ 2 2 ·
pi ¨ cos «( 1)) » sin
i , sin
i «( 1)) i , cos ¸ .
» sin (5.43)
© ¬ q ¼ ¬ q ¼ ¹
O1 cos T . (5.44)
Let the unit normal of the vector from the model centroid to A be
u { , , } where
346 5 3D Model Watermarking
x2 y 2 z 2 1, z 0. (5.45)
If n is chosen as u , then
O2 max{ i }. (5.46)
i
Due to the rotation symmetry, we can let O2 1 x sin z cos T , and then
we have
p1 u ! pk u , k 2, 3, , q. (5.47)
2 2S (5.48)
(1 cos ) y sin
si , 0.
q q
From the restriction conditions Eqs.(5.45) and (5.48), it can be deducted that
ª§ 2 ·
2S
2
º
«¨ 1 cos ¸ »
Ǭ q
¸ 1» 2
1 2
. (5.49)
«¨ 2 ¸ »
Ǭ sin
i ¸ »
¬© q ¹ ¼
1 z2
sin
sin cos cos , (5.50)
ª§ § 2 2 ·
2
º
«¨ ¨ 1 cos ¸ ssin
iin ¸ 1»
«¬© © q ¹ q ¹ »¼
Fig. 5.18. Face models and the watermark embedded. (a) Original face model; (b)
Watermarked model by Algorithm 1; (c) Watermarked model by Algorithm 2; (d) Copyright
information
To test the robustness against noise attacks, we add a noise vector to each vertex.
We perform the test four times and the amplitude of the noise is 0.5%, 1.2% and
3.0%, respectively, of the length of the longest vector extended from the model
centroid to a vertex. From Fig. 5.19, it can be visually seen that when the
amplitude of the noise is 3.0% of the longest vector, the model is changed greatly.
However, it can be seen from Table 5.4 that the watermark correlation is still 0.77
in Algorithm 1, which is better than that in Algorithm 2.
348 5 3D Model Watermarking
Fig. 5.19. Noise attacks on the watermarked model with different noise amplitudes. (a) 0.5%;
(b) 1.2%; (c) 3.0%
The experimental results in Table 5.9 show high robustness of Algorithm 1, even if
20% of vertices are removed.
5.5 A Robust Adaptive 3D Mesh Watermarking Scheme 349
It can be known from Table 5.10 that Algorithm 1 has high robustness against
insection operations. Even if only 50% of vertices are left, the correlation value is
still around 0.60.
350 5 3D Model Watermarking
Two different watermarks can be embedded via our algorithm by using two
different secret keys. The dual watermarked face model is shown in Fig. 5.20.
Table 5.11 depicts the correlation value corresponding to each watermark. It can
be known from the table that each watermark is well extracted via Algorithm 1.
To test the robustness of our technique against combination attacks, the face
model is subjected to combined attacks of simplification, insection, additional
noise, translation, rotation and uniform scaling. Re-sampling operations are
applied before the watermark is extracted. Experimental results are shown in Table
5.12. High robustness of Algorithm 1 against these combination attacks is
demonstrated, while the watermark cannot be extracted via Algorithm 2, as shown
in Table 5.12.
5.5 A Robust Adaptive 3D Mesh Watermarking Scheme 351
From all the above experiments we can conclude that the proposed
watermarking technique is highly robust against a lot of common attacks imposed
on 3D mesh models in comparison with Algorithm 2. Experimental results of
Algorithm 1 and Algorithm 2 against simplification and insection attacks are
nearly the same because under such attacks, vertices are removed with some
watermark information, while the remaining watermark information can be
entirely extracted.
5.5.4 Conclusions
domain rather than in the spatial domain to achieve higher robustness. Since a
watermark is embedded in the crucial position of the carrier in spectral domain
based watermarking algorithms, the embedded watermark can resist attacks such
as simplification. Most of the algorithms with high robustness are in the spectral
domain. The principle of spectral domain based watermarking is to analyze the
mesh spectrum which can be acquired by the mesh topology and graph theory [60].
Currently, there are few literatures related to transforming domain based 3D model
watermarking algorithms, which can be mentioned in this section as follows.
Fig. 5.21. The watermark embedding process [17] (With permission of ASME)
5.6 3D Watermarking in Transformed Domains 353
ªk 1 2 k 1 k 1
º
« xic yicxic zicxic »
«i 0 i 0 i 0
»
«k 1 k 1 k 1
»
H « xic yic yic2 zic yic » . (5.51)
«i 0 i 0 i 0 »
«k 1 k 1 k 1 »
« xic zic yic zic zic »
2
¬i 0 i 0 i 0 ¼
(4) Model rotation. Rotate the model so that the eigenvector T is along the Z
axis, so that the rotation invariance is achieved.
(5) Transform the mesh into spherical coordinates, in other words represent
each vertex picc in the coordinates ( i , i , i ) . The watermark is embedded in the
ri component, so the scaling invariance is also achieved.
354 5 3D Model Watermarking
ri , i 0;
w ° (5.52)
ri ® g1 ( i , i ),
) i 1;
° g (r , ), 1,
¯ 2 i i i
where ni denotes the function value determined by the neighborhoods of ri, g1(ri, ni)
and g2(ri, ni) are functions for embedding:
g1 ( i , i ) ,
i 1 i
(5.53)
g2 ( i , i ) i 2 i ,
where D1>0 and D2<0 are the embedding parameters. Accordingly, the detection
formula is easily designed:
° 1, ˆi i ;
(5.54)
wˆ i ®
°̄°1, ˆi i.
Yin et al. from the CAD & CG State Key Laboratory of Zhejiang University
addressed the two difficulties in mesh watermarkingümesh decomposition and
topology recovery from the attacked mesh, by constructing a Burt-Adelson
pyramid using a relaxation operator and embedding the watermark in the final
coarsest approximation mesh [14]. This algorithm is integrated with the
multi-resolution mesh processing toolbox of Guskov, and can embed watermarks
in the low spectral coefficients without extra data structure or complex
computation. In addition, the embedded watermark can survive the operations in
the mesh processing toolbox. The mesh resampling algorithm described is simple
but efficient, which enables watermark detection on simplified meshes and other
meshes with topology changes. In this Subsection, the relaxation operator and the
Burt-Adelson pyramid are firstly introduced and then the embedding algorithm,
followed by the detection algorithm, is given.
5.6 3D Watermarking in Transformed Domains 355
R pi ¦
j V2 ( i )
W i, j p j , (5.55)
According to the specific connectivity in Fig. 5.23, ce,j,j has the following 4
choices:
where Le is the length of the shared edge e, A represents the signed area of the
triangle, A[ ,l ,l ] and A[ j ,l ,l ] are areas of the rotated triangles of sll2l1 and jl1l2 on
2 1 1 2
s
e l2
j
l1
A pure progressive mesh method only removes vertices, with the coordinates
of the other vertices unchanged while, in the pyramid algorithms, the coordinates
of the left vertices may be different from their counterparts in the finer mesh, so
that differences at different levels come into being. Here the new coordinates of
the left vertices are denoted as q nj , the differences between different levels are
represented as d nj , which is also called the detail information. The detailed
construction of the BA pyramid is illustrated in Fig. 5.24. The mesh sequence (P (Pn,
n N
C ) can be constructed from the start of P = P, 1 n N N. There are 4 steps to
construct Pn1 from Pn (i.e. removing vertex n) as follows:
(1) Pre-smoothing. Update the coordinate of the 1-ring vertex neighborhood
jj V1n (n) of vertex n: p nj 1 ¦ nj ,k pkn ; the other vertices j V1 \ V1 (n) of
n 1 n
k V2n ( j )
qnn ¦ n
n, j p nj 1 . (5.58)
j V2n ( n )
j V1n ( ) : q nj ¦ n
j ,k pkn 1 n
q .
n
j ,n n
(5.59)
k V2n ( ) \{ }
5.6 3D Watermarking in Transformed Domains 357
(4) Computation of details. Compute the details of the local structure Fn1 for
the vertex n and its neighborhoods as follows:
j V1n ( ) { } : d nj j
n 1
( p nj q nj ) , (5.60)
where Qn = { q nj } and D n { nj } .
Pn-1
Presmoothh Subdivisionn
Qn
n PnQn Dn
P Fn-1
In the construction of the lower level of the pyramid from the upper level, Qn
is first acquired by subdivision using vertices of Pn1, and adding it to Dn so that
Pn is recovered. At the same time, the pyramid data information is recorded in a
proper data structure, such as the half edge folding sequence, the relaxation
operator sequence Wn and the details sequence Dn, which are all necessary for mesh
multi-resolution processing as well as mesh watermark embedding and detection.
From the above pyramid structure construction process we can see that the
coarser mesh in an upper level can be regarded as the low-frequency coefficients
of the finer mesh in a lower level. From the point of view of signal processing, a
vertex of a coarser mesh is the smoothed downsampled vertex of a finer mesh and
corresponds to low-frequency. In the construction process, the most significant
features are maintained while the details are abandoned. As a result, the process of
embedding the watermark in a coarse mesh is analogous to watermarking in the
low-frequency coefficients in still images.
A bipolar sequence w = {w1, w2, …, wm} is used as the watermark and the
embedding process is as follows:
(1) Construct a BA pyramid from the original mesh M and an appropriate level
of coarse mesh Mc is the embedding object.
(2) Select [m/3] vertices pi randomly or according to some rules from Mc, i = 1,
2, …, [m/3]; Compute the minimum length of the 1-ring edge neighborhood of pi:
lmi = min{length(e)|eęE1(i)}, then the watermark embedding equations are as
follows:
358 5 3D Model Watermarking
wˆ 3i 1 sgn( ˆ ix ix ),
° (5.62)
® wˆ 3i 2 sgn( ˆ iy iy ),
°
¯ wˆ 3i 3 sgn( ˆ iz iz ),
pp j P, { *} (, ) . (5.63)
Define di as the degree of pi, i.e. di = |{i*}|. Thus the k×kk Laplacian matrix L
defined by Taubin [69] is as follows:
1, j;
° 1 *
Lij ® di , j {i } andd di 0; (5.64)
° 0, otherwise.
¯
ª e0 0 0 º
«0 »
« »
« ei » B 1 LB. (5.65)
« »
« 0 »
«¬ 0 0 ek 1 »¼
Then we can perform the orthogonal transform on the three kk-dimensional vectors
X, Y and Z
X Z, thus the so-called spectrum or pseudo-frequency vectors O, Q and R
can be derived:
O BX , Q BY , R BZ , (5.66)
Si | i |2 | i |2 | i |2 , 0 k 1. (5.68)
where
otherwise, if the watermark bit is “1”, then alter Cinterr to make it fall in the interval
W1. Let Cmean = 0.5(C
Cmin+ Cmax) and then the embedding can be formulized as
w Cmean | i t
inter mean
m |
°°Cinter m
, w 0;
® (5.71)
°C w Cmean | i t
inter mean
m |
, w 1,
°̄° inter m
where the parameter m is used to control the trade-off between the robustness and
imperceptivity, and is set to be 10 in [21]. The watermark extraction is simple and
blind, only requiring judging whether or not Ĉinter falls in the interval W0.
The above-mentioned algorithms are all designed for 3D polygon mesh models. In
fact, not all 3D models are represented by polygons. As a result, watermarking
algorithms for other types of 3D models are also available. Due to space
limitations, they are briefly introduced here.
elements is limited to be less than that of the control points. Then three 2D virtual
images are extracted, the pixels of which are the distances from the sample points
to the x, y, and z plane, respectively. The watermark is embedded into these 2D
images, which leads to the modification of the control points of NURBS. As a
result, the original model is changed by the watermark data as much as by the
quantity of embedded data. But the data size of the NURBS model is preserved
because there is no change in the numberr of knots and conttrol points. For the
extraction of embedded information, modified virtual sample points are first
acquired by the matrix operation of basis functions in accordance with the {u, v}
sequence. Even if the third party has the original NURBS model, the embedded
information cannot be acquired without {u, v} sequence as a key, which is a good
property for the steganography. The second algorithm is suitable for robust
watermarking. This algorithm also samples the 3D virtual model. But the
difference from the steganography algorithm is that the number of sampled points
is not limited by the number of control points of the original NURBS model.
Instead, the sequence {u, v} is chosen so that the sampling interval in the physical
space is kept constant. This makes the model robust against attacks on knot
vectors, such as knot insertion, removal and so forth. The procedure for making
2D virtual images is the same as for the steganography algorithm. Then, the
watermarking algorithms for 2D images are applied to these virtual images and a
new NURBS model is made by the approximation of watermarked sample points.
The watermarks in the coordinate of each sample point are distorted within the
error bound by approximation. But such distortion can be controlled by the
strength of embedded watermarks and the magnitude of error bound. Since the
points are not sampled in the physical space (x ( -, y-, z-coordinate) but in the
parametric space (u-, v-coordinate), the proposed algorithm for watermarking is
also found to be robust against attacks on the control points that determine the
model’s transition, rotation, scaling and projection.
Some 3D models are acquired using some special equipment (such as 3D laser
scanners). Similar to 2D pixel-based images, the data unit of a 3D image is a voxel,
which also has a color or gray-scale property. Watermarks can be embedded
through altering the colors or gray properties in the spatial domain or transformed
domains (e.g. 3D DCT, DFT, 3D DWT). Detailed descriptions of 3D image
watermarking algorithms can be found in [35-38].
of motion due to the phenomenon of persistence of vision, and can be created and
demonstrated in a number of ways. The most common method of presenting
animation is as a motion picture or video program, although several other forms of
presenting animation also exist. Computer animation (or CGI animation) is the art
of creating moving images with the use of computers. It is a subfield of computer
graphics and animation. Increasingly it is created by means of 3D computer
graphics, though 2D computer graphics are still widely used for stylistic, low
bandwidth and faster real-time rendering needs. Sometimes the target of the
animation is the computer itself, but sometimes the target is another medium, such
as film. It is also referred to as CGI (computer-generated imagery or
computer-generated imaging), especially when used in films. For 3D animations,
all frames must be rendered after modeling is complete. For 2D vector animations,
the rendering process is the key frame illustration process, while in-between
frames are rendered as needed. For pre-recorded presentations, the rendered
frames are transferred to a different format or medium such as film or digital video.
The frames may also be rendered in real time as they are presented to the end-user
audience. Low bandwidth animations transmitted via the internet (e.g. 2D Flash,
X3D) often use software on the end-users computer to render in real time as an
alternative to streaming or pre-loaded high bandwidth animations.
3D animation watermarking technology is a brand new application of 3D
animation data protection. Animation is referred to as a role continuously moving
for a certain period of time. The role can
a be compactly represented by a skeleton
formed by some key points with one or more degrees of freedom. The change of
each degree of freedom in the time domain can be viewed as an independent
signal, while the whole animation is a function of time. DCT can be used for a 3D
animation oblivious watermarking algorithm by performing a slight quantization
disturbance to mid-coefficients of DCT and combining the ideas of spread
spectrum and quantization. Choosing a reasonable quantization step can ensure
that the original movement is visually acceptable. At the same time, spreading
every watermark bit over many frequency coefficients by spread spectrum can
effectively increase the robustness. This algorithm exhibits high robustness to
white Gaussian noise, resampling, movement smoothing and reordering.
In addition, Hartung et al. developed a watermarking algorithm [3] in the
MPEG-4 facial animation parameters (FAP) sequence using spread spectrum
technology. A remarkable aspect of this method is that not only can watermarks be
extracted from parameters, but the facial animation parameter sequence (from
which the watermark can be extracted) can also be generated from the real facial
video sequence using the facial feature tracking system.
5.8 Summary
watermarking methods in the spatial domain were introduced. Next, a robust mesh
watermarking scheme proposed by the authors of this book was introduced in
detail. Then, according to different transformations when embedding information,
we briefed some typical 3D model watermarking algorithms in the transform
domain. Finally, watermarking algorithms for other types of 3D models were
briefly introduced.
Through this chapter, we can see that 3D model watermarking is a new field of
watermarking research, which has become the focus for domestic and foreign
researchers who have done much exploratory work and provided a lot of new
ideas for those working in CAD research and development. Thus a new research
area has opened up. However, analysis shows that there is much unfinished work.
There are many outstanding issues and thus a larger study space for 3D model
watermarking. A number of issues need to be addressed by thorough
studies-centered around 3D mesh watermarking:
Robust watermarking also needs improving. Robust watermarking research
includes robustness against insection, non-uniform scaling and mesh
simplification, as well as the introduction of geometric noise interference, and so
on. In 3D mesh digital watermarking research, we can learn from the still image
digital watermarking ideas and methods. In particular, we should introduce
transform-domain methods into 3D mesh watermarking research, such as the
pioneering work done by Kanai in this direction [61, 62]. With consideration of a
balanced robustness-capacity relationship, improving the robustness of public
watermarks is still a problem.
The applied research area of fragile watermarking is not yet mature.
Visualization tools for detecting and locating the alteration should be further
improved. In addition, research into authentication for VRML (virtual reality
modeling language) models, along with multi-level verification of 3D meshes, has
involved few people as yet.
It is necessary to develop watermarking methods for VRML files. VRML is
widely used for creating a dynamic 3D virtual space over the Internet. VRML
documents are text documents and send commands to Internet browsers about
how to create 3D models for the virtual space. Research into watermarking
methods for VRML files has a direct practical value.
Watermarking technology has extended to the CAD system and other forms of
representation, mainly to the free surface
f and the solid model. There are many
ways for describing object shapes, such as representation by voxels, CSG trees
and borders. Border representation includes implicit function surfaces, parametric
surfaces, subdivision surfaces and points, as well as polygonal meshes. Ohbuchi et
al. and Mitsuhashi et al. have done exploratory work in the field of watermarking
for interval curve surfaces and triangle domain curve surfaces. The solid model is
far more extensively applied in the CAD field than mesh models, so it is more
significant for copyright protection and product verification if we extend the
watermarking technology to the CAD field.
Now, a potential application example of 3D watermark technology is givenü
the Virtual Museum. Although a museum exists for the collection, protection and
use of important cultural relics, for various reasons most museums have the
366 5 3D Model Watermarking
References
[1] S. Kishk and B. Javidi. 3D object watermarking by 3-D hidden object. Opt. Exp.,
2003, 11(8):874-888.
[2] E. Garcia and J. L. Dugelay. Texture-based watermarking of 3-D video objects.
IEEE Trans. Circuits Syst. Video Technol., 2003, 13(8):853-866.
References 367
[61] S. Kanai, H. Date and T. Kishinami. Digital watermarking for 3D polygons using
multiresolution wavelet decomposition. In: Proc. Sixth IFIP WG 5.2 GEO-6,
1998, pp. 296-307.
[62] H. Date, S. Kanai and T. Kishinami. Digital watermarking for 3D polygonal
model based on wavelet transform. In: Proceedings of DETC’99, 1999.
[63] J. M. Lounsbery. Multiresolution analysis for surfaces of arbitrary topological
type. Ph.D Thesis, Department of Computer Science and Engineering,
University of Washington, 1994.
[64] J. Stollnitz, T. D. Derose and D. H. Salesin. Wavelet for Computer Graphics.
Morgan Kaufmann Publishers, 1996.
[65] A. Kalivas, A. Tefas and I. Pitas. Watermarking of 3D models using principal
component analysis. In: IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP’03), 2003, 5:676-679.
[66] Guskov, W. Sweldensy and P. Schroder. Multiresolution signal processing for
meshes. In: SIGGRAPH’99 Conference Proceedings, 1999, pp. 325-334.
[67] H. Hoppe. Progressive Meshes. In: SIGGRAPH’96 Proceedings, 1996, pp.
99-108.
[68] M. Garland and P. S. Heckbert. Surface simplification using quadric error
metrics. In: SIGGRAPH’97 Proceedings, 1997, pp. 119-128.
[69] G. Taubin, T. Zhang and G. Golub. Optimal surface smoothing as filter design.
IBM Technical Report RC-20404, 1996.
[70] H. S. Song, N. I. Cho and J. W. Kim. Robust watermarking of 3D mesh models.
In: IEEE Workshop on Multimedia Signal Processing, 2002, pp. 332-335.
[71] K. Muratani and K. Sugihara. Watermarking 3D polygonal meshes using the
singular spectrum analysis. Paper presented at The IMA Conference on the
Mathematics of Surfaces, 2003, pp. 85-98.
[72] J. Lee, N. I. Cho and S. U. Lee. Watermarking algorithms for 3D NURBS
graphic data. EURASIP Journal on Applied Signal Processing, 2004, 14:
2142-2152.
6
6.1 Introduction
6.1.1 Background
Data hiding is a technique that embeds secret information called a mark into host
media for various purposes such as copyright protection, broadcast monitoring and
authentication. Although cryptography is another way to protect the digital content,
it only protects the content in transit. Once the content is decrypted, it has no
further protection. Moreover, cryptographic techniques cannot provide sufficient
integrity for content authentication. Data hiding techniques can be used in a wide
variety of applications, each of which has its own specific requirements: different
payload, perceptual transparency, robustness and security [10-13].
Digital watermarking is a form of data hiding. From the application point of
view, digital watermarking methods can be classified into two categories: robust
watermarking and fragile watermarking [10]. On the one hand, robust
watermarking aims at making a watermark robust to all possible distortions to
preserve the contents. On the other hand, fragile watermarking makes a watermark
invalid even after the slightest modification of the contents, so it is useful to
control content integrity and authentication. Most multimedia data embedding
techniques modify, and hence distort, the host signal in order to insert the
additional information. Often, this embedding distortion is small, yet irreversible;
i.e., it cannot be removed to recover the original host signal. In many applications,
the loss of host signal fidelity is not prohibitive as long as original and modified
signals are perceptually equivalent. However, in some cases, although some
embedding distortion is admissible, permanent loss of signal fidelity is undesirable.
For example, in quality-sensitive applications such as medical imaging, military
imaging, law enforcement and remote sensing where a slight modification can
lead to a significant difference in the final decision-making process, the original
media without any modification is required during data analysis. Even if the
modification is quite small and imperceptible to the human eye, it is not
acceptable because it may affect the right decision and lead to legal problems.
This highlights the need for reversible (lossless) data embedding techniques.
These techniques, like their lossy counterparts, insert information bits by
modifying the host signal, thus inducing an embedding distortion. Nevertheless,
they also enable the removal of such distortions and the lossless restoration of the
original host signal after extraction of embedded information. Most of the
reversible data hiding schemes, or so-called lossless data hiding (invertible data
hiding) schemes, belong to fragile watermarking. For content authentication and
tamper proofing, this enables exact recovery of the original media from the
watermarked image after watermark removal [14]. The hash value of the original
content, as well as electronic patient records (EPRs) and metadata regarding the
6.1 Introduction 373
The general principle of reversible data hiding is that for a digital object (say a
JPEG image file) II, a subset J of I is chosen. J has the structural property that it
can be easily randomized without changing the essential property of II, and it offers
the lossless compression version of I enough space (at least 128 bits) to embed the
authentication message (say hash of II). During embedding, J is replaced by the
authentication message concatenated with the compressed JJ. If J is highly
compressible, only a subset of J can be used. During the decoding process,
authentication information together with compressed J is extracted. This extracted
J (compressed) is decompressed to replace the modified features in the
watermarked object; hence the exact copy of the original object b is found. The
decoding process is just the reverse of the embedding process.
Three basic requirements for reversible data hiding can be summarized as
follows:
(1) Reversibility. Reversibility is defined as “one can remove his embedded
data to restore the original media.” It is the most important and essential property
for reversible data hiding.
(2) Capacity. The data to be embedded should be as large as possible. A small
capacity will restrict the range of applications. The capacity is one of the
important factors for measuring the performance of the algorithm.
(3) Fidelity. Data hiding techniques with high capacity might lead to low
374 6 Reversible Data Hiding in 3D Models
fidelity. The perceptual quality of the host media should not be degraded severely
after data embedding, although the original content is supposed to be recovered
completely.
In particular, the performance of a 3D model reversible data hiding algorithm
is measured by the following aspects: (1) embedding capacity; (2) visual quality of
the marked model; (3) computational complexity. Reversible data hiding aims at
developing a method that increases the embedding capacity as much as possible
while keeping the distortion and the computational complexity at a low level.
Before introducing reversible data hiding schemes for 3D models, this section first
introduces classifications, applications and typical schemes of reversible data
hiding for images.
The type-I algorithms are based on lossless data compression techniques. They
losslessly compress selected features from
m the host media to obtain enough space,
which is then filled up with the secret data to be hidden. For example, Fridrich et
al. [16] used a JBIG lossless compression scheme for compressing a proper
bit-plane that offers minimum redundancy and embedded the image hash by
appending it to the compressed bit-stream. However, a noisy image may force us
to embed the hash in the higher bit-plane, and hence it causes visual artifacts.
Celik et al. [17] used a CALIC lossless compression algorithm and achieved high
capacity by using a generalized least significant bit embedding (G-LSB) technique,
but the capacity depends on image structures.
coefficients of image blocks. The capacity and visual quality were adjusted by
selecting different numbers of AC coefficients in different frequencies. In [19], an
integer wavelet transform is employed. Secret bits are embedded into a middle
bit-plane of the integer wavelet coefficients in the high frequency sub-band. In
[15], Lee et al. applied the integer-to-integer wavelet transform to image blocks
and embedded message bits into the high-frequency wavelet coefficients of each
block.
The type-III algorithms can be grouped into two categories: difference expansion
(DE) and histogram modification. The original difference expansion technique
was proposed by Tian in [20]. It applies the integer Haar wavelet transform to
obtain high-pass components considered as the differences of pixel pairs. Secret
bits are embedded by expanding these differences. The main advantage is its high
embedding capacity, but its disadvantages are the undesirable distortion at low
capacities and lack of capacity control due to embedding of a location map which
contains the location information of all selected expandable difference values.
Alattar developed the DE technique for color images using triplets [21] and quads
[22] of adjacent pixels and generalized DE for any integer transform [23].
Kamstra and Heijmans [24] improved the DE technique by employing low-pass
components to predict which location will be expandable, so their scheme is
capable of embedding small capacities at low distortions. To overcome the
drawbacks of the DE technique, Thodi and Rodriguez [25] presented a
histogram-shifting technique to embed a location map for capacity control and
suggested a prediction error expansion approach utilizing the spatial correlation in
the neighborhood of a pixel.
Histogram modification techniques use the image histogram to hide message
bits and achieve reversibility. Since mostt histogram-based methods do not apply
any transform, all processing is performed in the spatial domain, and thus the
computational cost is moderately lower thann type-I and type-II algorithms. Ni et al.
[26] utilized a zero point and a peak point of a given image histogram where the
amount of embedding capacity is the number of pixels in the peak point. Versaki
et al. [27] also proposed a reversible scheme using peak and zero points. One
drawback of these algorithms is that it requires the information of the histogram’s
peak or zero points to recover an original image. In [28] and [29], they extended
Ni’s scheme and applied the location map to reverse without the knowledge of the
peak and zero points. Tsai et al. [30] achieved a higher embedding capacity than
the previous histogram-based methods by using a residue image indicating a
difference between a basic pixel and each pixel in a non-overlapping block.
However, in their scheme, since the peak and zero point information per each
block is required to be attached to message bits, it makes the actual embedding
capacity lower. Lee et al. [31] explored d the peak point in the difference image
histogram and embedded data into locations where the values of the difference
image are 1 and +1. In [32], Lin et al. divided the image into non-overlapping
376 6 Reversible Data Hiding in 3D Models
blocks and generated a difference image block by block. Then, message bits are
embedded by modifying the difference image of each block after making an empty
bin through histogram shifting. Although this technique is a high capacity
reversible method using a multi-level hiding strategy, it is required to transmit the
peak information of all blocks.
In the type-I algorithms, the embedding capacity varies according to the
characteristic of the image and the performance highly depends on the adopted
lossless compression algorithm. The type-II algorithms show satisfactory results,
but require additional computational costs to convert the media into transform
domains. The DE technique in type-III algorithms is required to control the
capacity due to the embedding of the location map. Although histogram-based
methods simply work through histogram modification, overhead information
should be as little as possible. In the following two subsections, two typical
reversible data hiding schemes for images are detailed.
6.2.2 Difference-Expansion-Based
d Reversible Data Hiding
In [20], Tian proposed a reversible data hiding method for images based on
difference expansion. In this method, the secret data is embedded in the difference
of image pixel values. For a pair of pixels ((x, y) in a gray level image, their
average l and difference h are defined as
«x y»
°l « »,
® ¬ 2 ¼ (6.1)
° h x y.
¯
« hc 1 »
°x l « 2 » ,
° ¬ ¼
® (6.2)
° y c l « hc » .
°̄° «2»
¬ ¼
During data extraction, the secret bit is extracted as b = h' mod 2 and the
original difference is computed as
« xc y c »
h « 2 ». (6.3)
¬ ¼
« xc y c » « h 1 »
°x « »« »,
° ¬ 2 ¼ ¬ 2 ¼
® (6.4)
°y « xc y c » « h »
°̄° « 2 » « 2 ».
¬ ¼ ¬ ¼
The major problem is that overflow and underflow might occur. The secret bit
can be embedded only in the pixels which satisfy
« hc » « hc 1 »
0 l « », « 2 » 255. (6.5)
¬2¼ ¬ ¼
A pixel pair satisfying Eq.(6.5) is called the expandable pixel pair. In order to
achieve lossless data embedding, a location map is employed to record the
expandable pixel pair. The location map is then compressed by lossless
compression methods and concatenated with the original secret message to be
superimposed on the host signal later.
In [23], Alattar extended Tian’s scheme using difference expansion of a vector
instead of a pixel pair to hide message data for color images. In their scheme, a
vector is formed by k non-overlapping pixels. Then they use a reversible integer
transform function to transform the vector. If the transformed vector can be used
to hide message data, then they use Tian’s difference expansion algorithm to
conceal the data. For restoring the host image, the algorithm needs a location map,
as well as Tian’s location map, to indicate whether the vector can be used to hide
message bits or not.
For example, a vector with four pixels is used to embed three message bits. Let
p = ((p1, p2, p3, p4) be the vector and b1, b2, b3 be the message bits. First, they use
the reversible integer transformation function to compute the weighted average q1,
and the differences q2, q3 and q4 of p2, p3, p4 from p1. The weighted average and
the differences are calculated by
« a1 p1 a2 p2 a3 p3 a4 p4 »
° q1 « »,
° ¬ a1 a2 a3 a4 ¼
°
® q2 p2 p1 , (6.6)
°
° q3 p3 p1 ,
°̄° q4 p4 p1 ,
where a1, a2, a3, a4 are constant coefficients. Then, the weighted average and the
differences are shifted according to the message bits to generate the one-bit
left-shifted values q'1, q'2, q'3 and q'4. The shifted values are computed by
378 6 Reversible Data Hiding in 3D Models
q1c q1 ,
°q c
° 2 2 q2 b,
® (6.7)
°q3c 2 q3 b,
°̄°q4c 2 q4 b.
Finally, the pixels containing the message bits p'1, p'2, p'3 and p'4 are calculated
by
« a q c a3 q3c a4 q4c »
° p1c q1 « 2 2 »,
° ¬ a1 a2 a3 a4 ¼
° « a q c a3 q3c a4 q4c »
° pc q2c q1 « 2 2 »,
°° 2 ¬ a1 a2 a3 a4 ¼
® (6.8)
° c « a q c a3 q3c a4 q4c »
° p3 q3c q1 « 2 2 »,
° ¬ a1 a2 a3 a4 ¼
° « a q c a3 q3c a4 q4c »
° p4c q4c q1 « 2 2 ».
°̄° ¬ a1 a2 a3 a4 ¼
The embedding data is inferred from the shifted values that are computed as
« q ccc »
°b1 q2cc 2 « 2 » ,
° ¬2¼
°° « q ccc »
®b2 q3cc 2 « 3 » , (6.10)
° ¬2¼
° « q ccc »
°b3 q4cc 2 « 4 » .
°̄° ¬2¼
q1 q ccc 1
°
°q « q2ccc »
° 2 « 2 »,
¬ ¼
°
® « q3ccc » (6.11)
°q3 « 2 »,
° ¬ ¼
° « q4ccc »
° q4 « 2 ».
¯ ¬ ¼
« a q a3 q3 a4 q4 »
° p1 q1 « 2 2 »,
° ¬ a1 a2 a3 a4 ¼
°
® p2 q2 q1 , (6.12)
°
° p3 q3 q1 ,
°̄° p4 q4 q1 .
In this way, the secret data is extracted and the host image is accurately
recovered.
In these histograms, the horizontal axis denotes the pixel values in the range of
[0, 255], while N on the vertical axis is the number of peak values corresponding
to the pixel value P. In [33], P is called the peak point and the first one with
magnitude 0 on the right side of P is called the zero point Z.
The peak and zero points must be found before shifting the histogram. Then all
bins between [P, Z Z 1] are shifted one gray level rightward. That is, to all pixel
values between [P, Z Z 1] add 1 and thus the original P is emptied. As a result, the
magnitude in the original bin P+1 is changed as N.N
Next, we can embed secret data by modulating 0 and 1 on P and P+1,
respectively. In particular, the pixel values belonging to the bin P+1 are scanned
one by one. If the bit “0” is to be embedded, the pixels with the value P+1 are
modified as P, while they are kept unchanged when the bit “1” is to be embedded.
In this way, the data embedding process is completed.
The data extraction and image recovery y is the inverse process of data
embedding. First, the peak point P and the zero point Z must be located accurately.
Then we scan the whole image. If we come across a pixel with the value P, a
secret bit “0” is extracted. If P+1 is encountered, a secret bit “1” is extracted. After
the data is extracted, we only need to subtract 1 from all pixel values between
[P+1, ZZ] and thus the original image can be perfectly recovered.
Although reversible data hiding was first introduced for digital images, it also has
wide application scenarios for hiding data in 3D models. For example, suppose
there is a column on a 3D mechanical model obtained by computed aided design.
The diameter of this column is changed with a given data hiding scheme. In some
applications, it is not enough that the hidden content is accurately extracted. This
is because the remaining watermarked model is still distorted. Even if the column
diameter is increased or decreased by 1 mm, it may cause a severe effect because
this mechanical model cannot be assembled well with other mechanical
accessories.Therefore, it also has significance in designing reversible data hiding
methods for 3D models.
As shown in Fig. 6.2, the general system for 3D model reversible data hiding can
be deduced from that designed for images. In this typical system, M and W denote
the host model and the original secret data, respectively. W is embedded in M with
the key K and the marked model MW is produced. Suppose the MW is losslessly
transmitted to the receiver and then the secret data is extracted as WR with the
same key K. Meanwhile, the original model is recovered as MR. The definition of
3D model reversible data hiding requires that both the secret data and the host
model should be recovered accurately, i.e., WR = W and MR = M M. In a word, 3D
382 6 Reversible Data Hiding in 3D Models
model reversible data hiding schemes also satisfy the imperceptibility and
inseparability properties that those general irreversible data hiding schemes do.
According to the general model shown in Fig. 6.2, we can find that the
requirements of 3D model reversible data hiding are more restricted than those of
irreversible ones. Besides, as a special host media, 3D model reversible data
hiding has several technical challenges as follows.
(1) Nowadays there are many types of 3D models such as 3D meshes and
point cloud models. Most 3D models are represented as meshes, while point cloud
models are stored and used in some specific applications such as 3D face
recognition. Moreover, there exist many formats of meshes, such as .off and .obj.
In practical applications, various types and formats of models are often
interconverted. In contrast, most available reversible data hiding schemes are
designed for one specific type or format. Thus, these schemes are usually not
suitable for other types or formats. Therefore, developing a universal reversible
data-hiding scheme is a challenging work.
(2) Various models may have different levels of detail. For example, a desk
may only contain tens of vertices and faces,
f while a plane may have thousands of
vertices and faces. This diversity of levels of detail should be considered in
developing the reversible data hiding scheme for 3D models.
(3) The elements of data hiding in images are pixels, while in a 3D model the
elements of data hiding are usually vertices and faces. In an image, each pixel has
its fixed coordinates and data hiding is just to modify their pixel values. In
contrast, the coordinates of the watermarked vertices of 3D models are usually
changed before data extraction. For example,
m the watermarked model is rotated
and translated. Thus, pose estimation is usually required. This causes a difficulty
to extract data and recover the host model. Sometimes some affiliated knowledge
must be used to assist the data extraction and model recovery. This affiliated
6.4 Spatial Domain 3D Model Reversible Data Hiding 383
knowledge must be securely sent to the decoder along with the watermarked
model. Thus researchers must try to reduce the amount of affiliated knowledge.
Nowadays, some reversible data hiding schemes for 3D models are proposed in
the literature [39-45]. According to different embedding domains, they can be
classified into spatial-domain-based, compressed-domain-based and transform-
domain-based methods. In spatial-domain-based methods [39, 42, 43], the task of
data embedding is to modify the vertex coordinates, edge connections, face slopes
and so on. These schemes usually have a low computational complexity. The
compressed-domain-based methods [44, 45] are for embedding data with certain
compression techniques involved, e.g., vector quantization. In addition, some of
these methods are designed for compressed content of 3D models. Their
advantage is to hide data without decompressing the host model. In transform
domain-based methods [40, 41], the original model is transformed into a certain
transform domain and then data are embedded in transform coefficients. In these
schemes, the reversibility is guaranteed by that of the transforms.
Most available 3D model reversible data hiding schemes belong to spatial domain
methods. In [39], Chou et al. proposed a reversible data hiding scheme for 3D
models. In this method, all of the 3D vertices are divided into a set of groups.
Then they are transformed into the invariant space for resisting the attacks such as
rotation, translation and scaling. The secret data are embedded in some carefully
selected positions with unnoticeable distortions introduced. In this way, some
parameters are generated for data extraction, and these parameters are also hidden
in 3D models. In data extraction, these parameters
a are retrieved for data extraction
and model recovery. In [42], a reversible data hiding scheme for 3D meshes is
proposed based on prediction-error expansion. The principle is to predict a
vertex’s position by calculating the centroid of its traversed neighbors, and then
the prediction error, i.e. the difference between the predicted and real positions, is
expanded for data embedding. In this scheme, only the vertex coordinates are
modified to embed data, and thus the mesh topology is unchanged. The visual
distortion is reduced by adaptively choosing a threshold so that the prediction
errors with too large a magnitude will not be expanded. The selected threshold
value and the location information are saved in the mesh for model recovery. As
the original mesh can be exactly recovered, this algorithm can be used for
symmetric or public key authentication of 3D mesh models.
This section introduces another spatial-domain-based reversible data hiding
384 6 Reversible Data Hiding in 3D Models
With the widespread use of polygonal meshes, how to authenticate them has
become a real need, especially in the web environment. As an effective measure,
data hiding for multimedia content (e.g. digital images, 3D models, video and
audio streams) has been widely studied to prove the ownership of digital works,
verify their integrity, convey additional information, and so forth. Depending on
the applications, digital watermarking can be mainly classified into robust
watermarking (e.g. [46-48]) and fragile watermarking. In this subsection, we
concentrate on the latter only, in which the embedded watermark will change or
even disappear if the watermarked objectt is tampered with. Therefore, fragile
watermarking has been used to verify the integrity of digital works. In the
literature, only a few fragile ones [5, 49-51] have been proposed to verify the
integrity. Actually, the first fragile watermarking method for 3D object verification
is addressed by Yeo and Yeung in [49], as a 3D version of the method for 2D
image watermarking. In [52], invertible authentication of 3D meshes is first
introduced by combining a public verifiable digital signature protocol with the
embedding method in [53], which appends extra faces and vertices to the original
mesh. After extracting the embedded signature, the appended faces and vertices
can be removed on demand to reproduce the original mesh with a secret key. One
of the algorithms proposed in [5] called Vertex Flood Algorithm can be used for
model authentication with certain tolerances, e.g. truncation of mantissas of vertex
coordinates. A fragile watermarking scheme for triangle meshes is presented by
Cayre et al. in [50] to embed a watermark with robustness against translation,
rotation and scaling transforms. Nevertheless, all those proposed algorithms are
not reversible, i.e. the original mesh cannot be recovered from the watermarked
mesh. Actually, it is advantageous to recover the original mesh from its
watermarked version because the mesh distortion introduced by the encoding
process can be compensated. In this subsection, a reversible data-hiding method is
introduced to authenticate 3D meshes [43]. By keeping the modulation
information in the watermarked mesh, the reversibility of the embedding process
in [54] is achieved. Since the embedded watermark is sensitive to geometrical and
topological processing, unauthorized modifications on the watermarked mesh can
6.4 Spatial Domain 3D Model Reversible Data Hiding 385
In [54], the distance from the mesh faces to the mesh centroid is modulated to
embed the fragile watermark to detect the modifications on the watermarked mesh.
As a result, the original mesh is changed after the watermarking process.
Nevertheless, we notice that the mesh topology is unchanged during the encoding
process; the original mesh can be recovered by moving every vertex back to its
original position. It can be achieved by keeping the modulation information in the
watermarked mesh. Accordingly, the encoding and decoding processes will be
shown as follows, respectively.
In the encoding process, a special case of quantization index modulation called
dither modulation [55] is extended to the mesh. By modulating the distances from
the mesh faces to the mesh centroid, a sequence of data bits is embedded into the
original mesh.
Suppose V = {v1, …, vU} is the set of vertex positions in R3, the position vc of
the mesh centroid is defined as
U
1
vc
U
¦v .
i 1
i (6.13)
where (vicx, vicy, vicz) and (vcx, vcy, vcz) are the coordinates of the face centroid vic
and the mesh centroid vc in R3, respectively. It can be concluded that dfi is sensitive
to both geometrical and topological modifications made to the mesh model.
The distance di from a vertex with the position vi to the mesh centroid is
defined as
where (vix, viy, viz) is the vertex coordinate in R3. The quantization step S of the
modulation is chosen as
386 6 Reversible Data Hiding in 3D Models
S=D/N
S= /N, (6.16)
where N is a specified value and D is the distance from the furthest vertex to the
mesh centroid. With the modulation step S, the integer quotient Qi and the
remainder Ri are obtained by
« d ffi »
Qi « », (6.17)
¬ S ¼
Ri d fi % S . (6.18)
To embed one watermark bit wi, Wu and Yiu [43] modulated the distance
dfi from f to the mesh centroid so that the modulated integer quotient Q'i meets
Q'i%2 = wi. To keep the modulation information in the watermarked mesh, the
modulated distance d'fi is defined as
Qi S S / 2 i, if i %2 i;
°
d cfi ® i
Q S S / 2 mi , if Qi %2 wi and Ri S / 2; (6.19)
°Q S 3S / 2 m , if Qi %2 wi and Ri S / 2,
¯ i i
Qi , if %2i i;
°
Qic ®Qi 1,, if Qi %2
% wi and
a d Ri S / 2; (6.20)
°Q 1,, if Qi %2
% wi and
a d Ri S / 2.
¯ i
Consequently, the resulting d'fi is used to adjust the position of the face
centroid. Only one vertex in f is selected to move the face centroid to the desired
position. Suppose vis is the position of the selected vertex, the adjusted vertex
position would be
ª d cffi º Ni
visc « c ( ic c) » i ¦ ij , (6.21)
«¬ d ffi »¼ j 1,
1 j s
where vijj is the vertex position in f with Ni vertices and vic as the former face
6.4 Spatial Domain 3D Model Reversible Data Hiding 387
centroid. To prevent the embedded watermark bits from being changed by the
subsequent encoding operations, all vertices in the face should not be moved any
more after the adjustment.
The detailed procedure to reversibly embed the watermark is as follows: At
first, the original mesh centroid position is calculated by Eq.(6.13). Then the
furthest vertex to the mesh centroid is found out using Eq.(6.15) and the distance
D from it to the mesh centroid is obtained. After that, the modulation step S is
chosen by specifying the value of N in Eq.(6.16). Using the key Key, the sequence
of face indices I are scrambled to generate the scrambled version I', which
determine the sequence of mesh faces. For a face f indexed by I', if there is at least
one unvisited vertex, the distance fromm f to the mesh centroid is calculated by
Eq.(6.14) and modulated by Eq.(6.19) according to the watermark bit value.
Subsequently, the position of the unvisited vertex is modified using Eq.(6.21),
whereby the face centroid is moved to the desired position. If there is no unvisited
vertex in f , the checking mechanism will be skipped to the next face indexed by I'
until all watermark bits are embedded.
In the decoding process, the original mesh centroid position vc, the modulation
step S, as well as the secret key Key and the original watermark are required. The
embedded watermark needs to be extracted from the watermarked mesh and
compared with the original watermark to detect illegal tampering on the
watermarked mesh. The original mesh can be recovered if the watermarked mesh
is intact.
The detailed decoding process is conducted as follows: At first, the sequence
of face indices I is scrambled using the key Key to generate the scrambled version
I', which is followed to retrieve the embedded watermark. If there is at least one
unvisited vertex in a face f'i, the modulated distance d'fi from f'i to the mesh
centroid is calculated by Eq.(6.14). With the given S', the modulated integer
quotient Q'i is obtained by
« d cffi »
Qic « c ». (6.22)
¬S ¼
After the watermark extraction, the extracted watermark W'' is compared with
the original watermark W to detect the modifications that might have been made to
the watermarked mesh. Supposing the length of the watermark is K K, the
normalized cross-correlation value NC C between the original and the extracted
watermarks is given by
K
1
NC
K
¦ I(
i 1
i
c, i ), (6.24)
with
1, if ic i;
I ( ic, ) ® (6.25)
¯ 1, otherwise.
i
According to the definition of mi, for i = 2, …, KK 1, the original distance dfi =
d'fi mi+1 × 4, while dfK = d'fK m1 × 4 and df1 = Q'1 × S' + S'/2
' m2 × 4. With the
obtained dfi, all the vertices whose positions have been adjusted can be moved
back by
Ni
d fif
vis ( c ( icc c)
d ffi
) i ¦
j 1,
1 j s
c,
ij (6.27)
where v'ijj is the vertex position in the face f'i consisting of Ni vertices with v'ic as
the adjusted centroid position, vis is the recovered vertex position and vc is the
original mesh centroid position. After the original mesh is recovered from the
watermarked mesh, an additional way to detect the modifications on the
watermarked mesh is to compare the centroid position of the recovered mesh with
that of the original mesh, which should be identical to each other.
The above algorithm is conducted in the spatial domain and applicable to all
meshes without any restriction. The modulation step S should be carefully set,
providing a trade-off between imperceptibility and false alarm probability. Wu
and Yiu [43] have investigated the algorithm on several meshes listed in Table 6.1.
A 2D binary image is chosen as the watermark, which can also be a hashed value.
6.4 Spatial Domain 3D Model Reversible Data Hiding 389
The capacities of the meshes are also listed in Table 6.1, which depends on the
vertex number and mesh traversal. Wu and Yiu [43] wished to hide sufficient
watermark bits in the mesh so that the modification made to each vertex position
can be efficiently detected. Fig. 6.3(a) and Fig. 6.3(b) illustrate the original mesh
model “dog” and its watermarked version, while Fig. 6.3(c) shows the recovered
one. It can be seen that the watermarking process has not caused noticeable
distortion.
Fig. 6.3. Experimental results on the “dog” mesh with N = 10000 [43]. (a) Original mesh; (b)
Watermarked mesh; (c) Recovered mesh ([2005]IEEE)
to all the vertex positions with nx, ny and nz uniformly distributed within the
interval [S, SS], respectively. The watermarks were extracted from the modified
meshes with and without the key Key. The centroid positions of the meshes
recovered from those modified meshes were compared with the original meshes.
The obtained NC C values are all below 1, and the recovered mesh centroid positions
are different from the original one in most of the cases so that modifications on the
watermarked mesh can be efficiently detected.
Modulation step S
Fig. 6.4. The normalized Hausdorff distance subject to the modulation step S [43]
([2005]IEEE)
A general reversible data embedding diagram [44] is illustrated in Fig. 6.5. First,
we compress the original mesh M0 into the cover mesh M that is the object for
payload embedding based on the VQ technique. Although the VQ compression
technique introduces a small amount of distortion to the mesh, as long as the
distortion is small enough, we can ignore it. Besides, VQ technique enables the
distortion to be as tiny as possible by simply choosing a higher quality level of
codebook. In this sense, M0 as well as M can both be reversibly authenticated as
long as they are close enough. Then we embed a payload into M by modifying its
prediction mechanisms during the VQ encoding process, and obtain the stego
mesh M'. Before it is sent to the decoder, M' might or might not have been
tampered with by some intentional or unintentional attacks. If the decoder finds
that no tampering happened in M', i.e. M' is authentic, then the decoder can
remove the embedded payload from M' to restore the cover mesh, which results in
a new mesh M". According to the definition of reversible data embedding, the
restored mesh M" should be exactly the same as the cover mesh M, vertex by
vertex and bit by bit.
Vector Payload
quantization embedding
Original Cover mesh Stego mesh
mesh M0 M M'
Tampered
Cover mesh
restoration
Restored mesh Decoding and
M'' (=
M =M) Authentic authentication
v̂n 3
v̂n1 v̂n2
vn v~n ( 2 )
v~n (1)
v̂nc
v̂n
v~n ( 3)
which corresponds to the vn (1) in Fig. 6.6. However, there are two less common
prediction mechanisms as follows:
vn 2 ˆn 2
ˆn 3 , (6.29)
and
vn 2 ˆn 1
ˆn 3 , (6.30)
which correspond to vn (2) and vn (3) in Fig. 6.6, respectively. During the
encoding process, we employ the mechanism Eq.(6.28). The residual en v n vn
6.5 Compressed Domain 3D Model Reversible Data Hiding 393
In this work, 42507 training vectors were randomly selected from the famous
Princeton 3D mesh library [67] for training the approximate universal codebook
off-line.
D min{ n(2)
((2))
ˆ , (3)
( )
ˆ }. (6.32)
2 2
vn (1)
( ) vˆn D u D. (6.33)
2
Under the above condition, if the payload bit is “0”, we maintain the codeword
index unchanged. Otherwise, if the payload bit is “1”, we make a further judgment
as follows.
Firstly, the nearer vertex to vˆn out of vn (2) and vn (3) is adopted as the new
prediction of vn. For example, in Fig. 6.6, the new prediction of vn is vn (2) , thus
we quantize the residual vector enc as follows:
The quantized residual vector eˆnc and its corresponding codeword vector inc
are acquired by matching the codebook. Thus, the new quantized vector is
i.e., the reconstructed vector after the change of prediction mechanisms can be
exactly restored to the original reconstructed vector before embedding, vˆn can be
embedded with the payload bit “1”. In this situation, we replace the codeword
index of eˆn with inc , while vˆn remains unchanged.
The payload bit “1” cannot be embedded even when Eq.(6.33) and Eq.(6.37)
are satisfied in the unlikely case as follows: the nearest vertex to vˆnc out of vn (1) ,
vn (2) and vn (3) is not vn (1) . This case can be avoided by reducing D or increasing
the size of the codebook to achieve a better quantization precision.
One flag bit of the side information is required to indicate whether a vertex is
embedded with a payload bit or not. In this work, the bit “1” indicates that the
vertex is embedded with a payload bit while “0” indicates not.
When the flag bit is “1”, we find the residual vector by table lookup operations in
the codebook. Then we compute a temporary vector xn by subtracting the residual
vector from vn (1) . It can be easily deduced from the payload embedding process
that if the nearest vector to xn out of vn (1) , vn (2) and vn (3) is vn (1) , the
embedded payload bit is “0”; otherwise, the embedded payload bit is “1”.
Whenever Eq. (6.37) is not satisfied during the decoding process, we terminate the
procedure because the stego mesh must have been tampered with and is certainly
unauthorized.
When the flag bit is “1”, the nearest vector to xn out of vn (1) , vn (2) and vn (3)
is obviously the prediction of vcˆn . vc
ˆn is computed by adding its prediction and
êcn . Then we can easily acquire v̂ n based on Eqs.(6.36) and (6.37). After all
vertices have been restored to their original values, the restored mesh is acquired.
reconstructed vector. Because the embedded payload bits are judged by the nearest
vector to xn out of the three predictions, a distortion within a certain range can be
tolerated. We use the following model to simulate the channel noise effect on
indices:
ei* ˆi E u ˆi 2 u N i , (6.38)
watermark bit may be also correctly extracted. Based on this, the proposed method
is robust to noise attack. Besides, attacks on mesh topology such as mesh
simplification, re-sampling or insection are not available because the geometric
coordinates and topology of the mesh are unknown before the VQ bitstream is
decoded.
B
PSNR 10 log B
,
¦ vic vi
2
2
i 1
As shown in Table 6.2 and Table 6.3, with D increasing, the embedding
capacities in Shark and Chessman increase while the correlation values between
the extracted payloads and the original ones remain as 1.0, with E set to be 0.005.
Data in Table 6.4 and Table 6.5 indicate the robustness performances for Shark
and Chessman with D set to be 0.8. Here, the capacity is represented by the ratio
of payload capacity to the number of mesh vertices. From the above results, we
can see that the proposed scheme is effective.
Table 6.2 Capacity and robustness values for Shark with different D (E= 0.005)
D Capacity Correlation
0.2 0.004 1.00
0.3 0.016 1.00
0.4 0.027 1.00
0.5 0.049 1.00
0.6 0.078 1.00
0.110 1.00
0.134 1.00
0.166 1.00
0.209 1.00
Table 6.3 Capacity and robustness values for Chessman with different D (E= 0.005)
D Capacity Correlation
0.2 0.002 1.00
0.3 0.022 1.00
0.4 0.060 1.00
0.5 0.091 1.00
0.6 0.145 1.00
0.171 1.00
0.198 1.00
0.219 1.00
0.243 1.00
Table 6.4 PSNR and robustness values for Shark with different E (D= 0.8)
E PSNR Correlation
0.001 1.00
0.002 1.00
1.00
48.63 0.99
25.02 0.84
Table 6.5 PSNR and robustness values for Chessman with different E (D= 0.8)
E PSNR Correlation
0.001 1.00
0.002 1.00
1.00
42.19 1.00
23.62 0.94
6.5 Compressed Domain 3D Model Reversible Data Hiding 397
Although the data hiding scheme [44] is very robust to zero-mean Gaussian noise
in a noise channel, the main drawback of this algorithm, however, is that the
capacity for data hiding is not high. To evaluate the capacity enhancement
performance, 20 meshes were randomly selected from the famous Princeton 3D
mesh library [67] and 42507 training vectors were generated from these meshes
for training the approximate universal codebook off-line. The residual vectors are
then used to generate the codebook based on the minimax partial distortion
competitive learning (MMPDCL) method [68] for optimal codebook design. In
this way, we expect the codebook to be suitable for nearly all triangle meshes for
VQ compression and can be pre-stored in each terminal in the network [45]. Thus
the compressed bitstream can be transmitted alone with convenience. The
improvement in [45] over [44] can be illustrated as follows.
D vn(2)
n ((2)) v n
ˆ . (6.39)
2
vn (1)
( ) vˆn D u D. (6.40)
2
Under the above condition, iff the payload bit is “0”, we maintain the codeword
index unchanged. Otherwise, if the payload bit is “1”, we should make a further
judgment as follows.
vn (2) is adopted as the new prediction of vn. Thus we quantize the residual
vector e'n as follows:
The quantized residual vector eˆnc and its corresponding codeword index i'n
are acquired by matching the codebook. Thus, the new quantized vector is
vˆnc ˆc
( ) en .
v n (2) (6.42)
398 6 Reversible Data Hiding in 3D Models
In other words, the reconstructed vector after the change of prediction mechanisms
can be exactly restored to the original reconstructed vector before embedding, vˆn
can be hidden with the payload bit “1”. In this situation, we replace the codeword
index of eˆn with i'n, while vˆn remains unchanged.
The payload bit “1” cannot be hidden even when Eq.(6.40) and Eq.(6.44) are
satisfied in the unlikely case as follows: the nearest vertex to the vector vˆn eˆnc
out of vn (1) and vn (2) is not vn (2) . This case can be avoided by reducing the
value of D or increasing the size of the codebook to achieve a better quantization
precision. When the payload bit “1” cannot be hidden, proceed to the next vertex
until the bit satisfies the hiding conditions.
One flag bit of the side information is required to indicate if a vertex is hidden
with a payload bit. In this work, the bit “1” indicates thatt the vertex is hidden with
a payload bit while “0” indicates not. The vertex order in the payload embedding
process is the same as for the VQ quantization process.
When the flag bit is “1”, we find the codevector specified by the received index by
table lookup operations in the codebook. Then we compute a temporary vector xn
by subtracting the codevector, eˆn or eˆnc from vˆn . It can be easily deduced from
the payload hiding process that if the nearest vector to xn out of vn (1) and vn (2)
is vn (1) , the hidden payload bit is “0”; otherwise, the hidden payload bit is “1”.
Whenever Eq.(6.44) is not satisfied during the decoding process, we terminate
the procedure because the mesh bitstream must have been tampered with and is
certainly unauthorized. Thus, if a mesh bitstream is tampered with, the decoding
process cannot be completed in most cases.
When the hidden payload bit is judged to be “1”, vˆnc is computed by adding
vn (2) and eˆnc . Then we can easily acquire vˆn according to Eqs.(6.43) and (6.44).
When the hidden payload bit is judged to be “0”, no operation is needed.
After all vertices have been restored to their original values, the restored mesh
M"" in its uncompressed form is acquired. For content authentication, we compare
the authentication hash hidden in the bitstream with the hash of M". If they match
6.5 Compressed Domain 3D Model Reversible Data Hiding 399
exactly, then the mesh content is authentic and the restored mesh is exactly the
same as the cover mesh M M. Most likely a tampered mesh will not go through to
this step because some decoding error could happen, as mentioned, in the payload
extraction process. We reconstruct a restored mesh first, and then authenticate the
content of the stego mesh.
The capacity bottleneck is to satisfy Eq.(6.44), which is the same as that in
[44]. In [44], two other uncommon prediction rules are used other than the
parallelogram prediction. When the payload bit “1” is embedded, one of the two
uncommon prediction rules is used, resulting in a large residual vector, so the
vector quantization error is large. As a result, Eq.(6.44) is not likely to be satisfied
in [44]. In the work [45], both eˆn and eˆnc are small, so a small vector
quantization error ought to be expected and thus Eq.(6.44) is more likely to be
satisfied. As a result, a high capacity of payload hiding can be achieved.
Attacks on mesh topology such as mesh simplification, re-sampling or
insection are not available because the geometric coordinates and topology of the
mesh are unknown before the VQ bitstream is decoded.
Residual vectors are kept small after the payload hiding process, so the
statistical characteristic of the bitstream
m does not change much. Thus, one cannot
judge whether a codeword index corresponds to a payload bit by simply observing
it. Instead, the payload can only be extracted by the payload extraction algorithm.
The flag bits in the bitstream can be shuffled with a secure key. In this sense, the
payload is imperceptible.
Any small change to the authenticated mesh will be detected with a high
probability because the chances of obtaining a match between the calculated mesh
hash and the extracted hash are equal to finding a collision for the hash.
In addition, in order to reduce the encoding time of VQ, we adopt the
mean-distance-ordered partial codebook search (MPS) [69] as an efficient fast
codevector search algorithm, which uses the mean of the input vector to
dramatically reduce the computational burden of the full search algorithm without
sacrificing performance.
To evaluate the effectiveness of the proposed method in [45], we first adopt
the 8 meshes as the experimental objects. First, we quantize the original mesh M0
to acquire the cover mesh M with a universal codebook consisting of 8,192
codewords. The PSNR values between M0 and M are 50.99 dB and 56.40 dB, for
Stanford Bunny and Dragon meshes, respectively. The PSNR values can be
further improved by many other sophisticated VQ encoding techniques that are
not what we aim at in this work.
M0, M and the restored meshes M''' for Bunny and Dragon are shown in Fig.
6.7. Comparing these meshes visually, we can know that there are no significant
differences among the Bunny meshes and the Dragon meshes. Other original
meshes used here are depicted in Fig. 6.8.
400 6 Reversible Data Hiding in 3D Models
Fig. 6.7. Comparisons of rendered meshes (implemented with OpenGL). (a) Original Bunny
mesh; (b) Cover Bunny mesh; (c) Restored Bunny mesh; (d) Original Dragon mesh; (e) Cover
Dragon mesh; (f) Restored Dragon mesh
Fig. 6.8. Other original meshes (implemented with OpenGL). (a) Goldfish; (b) Tiger; (c) Head;
(d) Dove; (e) Fist; (f) Shark
Table 6.6 lists PSNR values of the vector quantized meshes and numbers of
their vertices and faces. As shown in Table 6.7, with D increasing, the embedding
capacities for various meshes increase while the correlation values between the
extracted payloads and the original ones remain as 1.0. Each capacity in all tables
is represented by the ratio of hidden payload bits to the numberr of mesh vertices.
Evident in Table 6.7, the capacity for each mesh is as high as about 0.5, except for
the Dragon model. This is because the Dragon model has very high definition and
6.5 Compressed Domain 3D Model Reversible Data Hiding 401
the prediction error vectors are of small norm compared to the codevectors in the
universal codebook. Payload in this case can be increased by using a larger
codebook that contains enough small codevectors. The payload of the proposed
data hiding method is about 2 to 3 times the capacity reported in [44].
Table 6.6 PSNR values of the vector quantized meshes and numbers of their vertices and faces
Mesh PSNR (dB) Numbers of vertices Numbers of faces
Bunny 50.99 8,171 16,301
Dragon 56.40 100,250 202,520
Goldfish 41.15 1,004 1,930
Tiger 44.19 956 1,908
Head 42.31 1,543 2,688
Dove 39.33 649 1,156
Fist 38.82 1,198 2,392
Shark 47.90 1,583 3,164
In this section, we introduce a reversible data hiding scheme for a 3D point cloud
model proposed in [40] by the authors of this book. This method exploits the high
correlation among neighboring vertices to embed data. It starts with creating a set
of 8-neighbor vertices clusters with randomly
a selected seed vertices. Then an
8-point integer DCT is performed
f on these clusters and an efficient highest
frequency coefficient modification technique in the integer DCT domain is
employed to modulate the watermark bit. After that, the modified coefficients are
inversely transformed into coordinates in the spatial domain. In data extraction,
we need to recreate the modified clusters first, and other operations are the inverse
process of the data hiding. The original model can be perfectly recovered using the
clusters information if it is intact. This technique is suitable for some specific
applications where content accuracy of the original model must be guaranteed.
Moreover, the method can be easily extended to 3D point cloud model
authentication. The following is the detailed description of our scheme.
402 6 Reversible Data Hiding in 3D Models
6.6.1 Introduction
In recent years, 3D point cloud models have gained the status of one of the
mainstream 3D shape representations. Point cloud is a set of vertices in a 3D
coordinate system. These vertices are usually defined by X, X Y and Z coordinates.
Compared to a polygonal mesh representation, a point set representation has the
advantage of being lightweight to store and transmit, due to its lack of
connectivity information. Point clouds are most often created by 3D scanners.
These devices measure a large number of points on the surface of an object and
output a point cloud as a data file. The point cloud represents the visible surface of
the object that has been scanned or digitized. Point clouds are used for many
purposes, such as creating 3D CAD models for manufactured parts,
metrology/quality inspection, and a multitude of visualization, animation,
rendering and mass customization applications. Point clouds themselves are
generally not directly usable in most 3D applications, and therefore are usually
converted to triangle mesh models, NURBS surface models, or CAD models
through a process commonly referred to as reverse engineering, so that they can be
used for various purposes. Techniques for converting a point cloud to a polygon
mesh include Delaunay triangulation and more recent techniques such as
Marching triangles, Marching cubes, and the Ball-Pivoting algorithm. One
application in which point clouds are directly usable is industrial metrology or
inspection. The point cloud of a manufactured part can be aligned to a CAD model
(or even another point cloud) and compared to check for differences. These
differences can be displayed as color maps that give a visual indicator of the
deviation between the manufactured part and the CAD model. Geometric
dimensions and tolerances can also be extracted directly from the point cloud.
Point clouds can also be used to represent volumetric data used for example in
medical imaging. Using point clouds multi-sampling and data compression are
achieved.
Nowadays, most existing data hiding methods are for 3D mesh models.
However, fewer approaches for 3D point cloud models have been developed. In
[70], Wang et al. proposed two spatial-domain-based methods to hide data in point
cloud models. In both schemes, principal component analysis (PCA) is applied to
translate the points’ coordinates to a new coordinate system. In the first scheme, a
list of intervals for each axis is established according to the secret key. Then a
secret bit is embedded into each interval by changing the points’ position. In the
second scheme, a list of macro embedding primitives (MEPs) is located, and then
multiple secret bits are embedded in eachh MEP. Blind extraction is achieved in
both of the schemes, and robustness against translation, rotation and scaling is
demonstrated. In addition, these schemes are fast and can achieve high data
capacity with insignificant visual distortion in the marked models.
A great deal of the existing data hiding process usually introduces
d irreversible
degradation to the original medium. Although slight, it may not be acceptable in
some applications where content accuracy of the original model must be
guaranteed, e.g. a medical model. Hence there is a need for reversible data hiding.
6.6 Transform Domain Reversible 3D Model Data Hiding 403
In our context, reversibility refers to the ability to recover the original model in
data extraction. Actually, it is advantageous to recover the original model from its
watermarked version for the distortion introduced by the data hiding can be
compensated. However, up until now, there has been little attention paid to
reversible data-hiding techniques for 3D point cloud models.
The original idea of our method is attributed to the high correlation among
neighboring vertices. It is well known that the discrete cosine transform (DCT)
exhibits high efficiency in energy compaction of highly correlated data. For high
correlated data, higher frequency is associated with smaller amplitude of the
coefficient in the statistics. Usually, the first harmonic coefficient is larger than the
last one and this fact is the basic principle of our reversible data hiding scheme.
However, due to the finite representation of numbers in the computer,
floating-point DCT is sometimes not reversible and therefore not able to guarantee
the reversibility of the data hiding process.
In this research, we employ an 8-point integer-to-integer DCT, exhibiting
similar energy compacting property and ensuring the perfect recovery of the
original data in data extraction. First, some vertices clusters are chosen as the
entry of integer DCT, then the 8-point integer DCT is performed on these clusters
and an efficient highest frequency coefficient modification technique is used to
modulate the data bit. After modulation, the inverse integer DCT is used to
transform the modified coefficients into spatial coordinates. In data extraction, we
need to recreate the modified clusters first, and subsequent procedures are the
inverse process of data hiding.
Most existing data hiding methods are for 3D polygonal mesh models. 3D
polygonal meshes consist of coordinates of vertices and their connectivity
information. As we know, these methods can be roughly divided into two
categories: spatial domain based and transform domain based. Approaches based
on spatial domain directly modify either the vertex coordinates or the connectivity,
or both, to embed data. Ohbuchi et al. [71-74] presented a sequence of
watermarking algorithms for polygonal meshes. However, their approaches are
not robust enough to be used for copyrightt protection. In [75] Benedens developed
a robust watermarking for copyright protection. Nevertheless, this method requires
a significant amount of data for decoding and is therefore not suitable for public
data hiding. Yeo et al. [76] introduced a fragile watermarking for 3D objects
verification. Wagner [77] presented two variations of a robust watermarking
method for general polygonal meshes of arbitrary
r topology. In contrast, relatively
fewer techniques based on transform domain have been developed. Praun et al.
[78] introduced a watermarking scheme based on wavelet encoding. This method
requires a registration procedure for decoding and is also not public. Ohbuchi et al.
[79] also developed a frequency-domain approach employing mesh spectral
404 6 Reversible Data Hiding in 3D Models
(1) Use the key K to retrieve the clusters information, furthermore the modified
clusters.
(2) Perform forward 8-point integer DCT on the modified clusters.
(3) Demodulate the AC C7 coefficients of clusters and extract the embedded data
sequence.
(4) Perform inverse 8-point integer DCT on the restored clusters and the
recovered model is obtained.
The block diagram of data embedding and the extraction process is as shown
in Fig. 6.9. Details are illustrated in the next sections.
6.6 Transform Domain Reversible 3D Model Data Hiding 405
Suppose the cover model M has n vertices V = {v1, v2, …, vn} with 3D space
coordinates vi = ((xi, yi, zi) (1 i n).
«n»
m d « ». (6.45)
¬8¼
6.6.3.2 Clustering
This step aims to select aappropriate point sets as the target of data hiding. As an
example shown in Fig. 6.10, a point cluster consists of a given seed sj (1 j m)
and its 7 nearest neighbor vertices N1, N2, …, N7, with distance to sj ranking in an
406 6 Reversible Data Hiding in 3D Models
ascending order. The clustering starts from the first seed s1, and 3D Euclidian
distances are calculated between s1 and the other n1 vertices. Then the nearest 7
vertices corresponding to the 7 smallest distances are chosen and a cluster with 8
vertices including the seed is formed. Now move to s2, its nearest 7 points can be
chosen according to n9 distances, except the visited points in the first cluster.
Such operations are repeated for all seeds and j clusters are created. Generally,
suppose dl denotes the number of distances of sl needing to be computed.
Apparently it can be estimated by Eq.(6.46).
dl n 8l 7 1 l j . (6.46)
The clusters’ information must be saved for data extraction. In our approach, it
refers to the indices of the vertices of all clusters. A secret key K is used to
permute the index information.
To highly correlated data, the DCT’s energy compacting property results in large
values of the first harmonics. Once the point cloud model is clustered, we apply
the 8-point integer-to-integer DCT introduced in [82] to all clusters. To each
cluster, coordinates of 8 vertices are input in the following order: The seed is the
first entry, and other vertices coordinates are successively input as the distance to
the seed grows. Let us take the example in Fig. 6.10, the input sequence is sj, N1,
N2, …, N7. In this way, 8 DCT coefficients, DC and AC1, AC C2, …, AC C7, can be
acquired from a cluster.
Since a cluster has x, y and z coordinate sets, it has three sets of DCT coefficients.
Here we only take the example of coefficients associated with the x-coordinates to
6.6 Transform Domain Reversible 3D Model Data Hiding 407
demonstrate data embedding and extraction. The operation on the other two sets of
coefficients is similar.
It is reasonable to suppose that, in most cases, the magnitude of the highest
frequency coefficient AC C7 is quite small, which is smaller than the largest
magnitude among |AC1|, |AC C2|, …, |ACC6|, as long as the 8 neighboring vertices are
relatively closely distributed and thus highly correlated. That is to say, in most
cases, the results of the 8-point integer DCT should satisfy Eq.(6.47):
where |AC Cmax| is the maximum magnitude among |AC Ci| (i = 1, 2, …, 6). All clusters
in the DCT domain can be divided into two categories according to Eq.(6.47). If it
is satisfied, the cluster is a normal cluster NC, otherwise an exceptional cluster EC.
A NC can be used to embed data, while an EC cannot. In data embedding, if the
cluster is an EC, then the coefficients are modified as Eq.(6.48). This operation
can be regarded as magnitude superposition.
In this way, data is inserted into the clusters, namely the point cloud model. It
is clear that Eq.(6.50) is satisfied for all modified clusters:
In a word, to embed data, we add AC Cmax to ACC7 to modulate the data bit “1”,
and keep all coefficients unchanged to modulate the data bit “0”. Obviously, the
modified AC7c no longer satisfies Eq.(6.47), and thus a new exceptional cluster
occurs. We regard it as an artificial exceptional cluster AE.
408 6 Reversible Data Hiding in 3D Models
After coefficient modulation, the last step is to perform the inverse 8-point integer
DCT on all clusters and the watermarked model is obtained.
Data extraction includes fourr steps, i.e. cluster recovery, formard integer DCT,
coefficient demodulation and inverse integer DCT.
The clusters must be recovered first for further data extraction. The same key K
and the cluster information are used to retrieve the index of vertices of all clusters,
and the coordinates of these clusters, i.e. the modified clusters, are used as entries
of the integer DCT.
This step is to perform forward 8-point integer DCT on the modified clusters.
Meanwhile, each cluster is transformed into three sets of DCT coefficients.
This step is the inverse process of the coefficient modulation in data embedding.
We still take the example of coefficients corresponding x-coordinates of a cluster
to describe the demodulation operation. After embedding data, clusters can be
classified into three kinds of states: NC, EC and AE. These three categories are
distinguished according to Eq.(6.51):
NC: c d
7 m
max ;
°
®EC: c
7 2 max
m ; (6.51)
°
¯AE: Cmax 7
c 2 max
m .
No data is inserted into EC. A bit “0” is inserted into an NC, while a bit “1” is
d is as shown in Eq.(6.52), where W denotes the
inserted into an AE. Data extracted
extracted watermark bit:
6.6 Transform Domain Reversible 3D Model Data Hiding 409
0, if NC;
W ® (6.52)
¯1, if AE.
This step is to perform the inverse 8-point integer DCT on the demodulated
coefficients, thus spatial coordinates off vertices are recovered. Namely, the
original model is perfectly restored if it is intact.
To test the performance and effectiveness of our scheme, a point cloud model, the
Stanford Bunny with 34,835 vertices is selected as the test model, as shown in Fig.
6.11(c). The original data to be hidden is a 32×32 binary image “KUAS” as shown
in Fig. 6.11(a). Experiment results show 502 clusters (i.e. 1,506 sets of coordinates)
can be inserted into 1,024 bit data. In other words, in these clusters 1,024 sets of
coordinates belong to NC, and the left 482 sets belong to EC. From Figs. 6.11(c)
and 6.11(d), slight degradation is introduced to the visual quality of the original
model. The recovered model is exactly the same as the original model if the
watermarked model suffers no alteration. This can be verified as the Hausdorff
distance between the original model and the recovered model is equal to 0.
Although the original model is not required, our method is semi-blind, for the
clusters information is required for data extraction.
Fig. 6.11. Experimental results. (a) Original watermark; (b) Extracted watermark; (c) Original
Bunny; (e) Watermarked Bunny; (e) Recovered Bunny
410 6 Reversible Data Hiding in 3D Models
2 i , if
ACic ®
i m
max
, i P2 , (6.54)
¯ ACi ACma
max , if ACi ACmmax
6.7 Summary
secure hash function. This technique embeds the hash or some invariant features
of the whole mesh as a payload. This method can be localized to blocks rather
than applied to the whole mesh. In addition, it is argued that all typical meshes can
be authenticated and this technique can be further generalized to other data types,
e.g. 2D vector maps, arbitrary
r polygonal 3D meshes and 3D animations.
Third, a reversible data hiding scheme for a 3D point cloud model was
presented. Its principle is to employ the high correlation among neighboring
vertices to embed data, and an 8-point integer-to-integer DCT is applied to
guarantee the reversibility. Two strategies of transform domain coefficient
modulation/demodulation are introduced. Low distortion is introduced to the
original model and it can be perfectly recovered if intact, using some prior
knowledge.
Future work in 3D model reversible data hiding will involve further improving
the capacity and robustness of the schemes.
References
44(10):2325-2384.
[66] P. H. Chou and T. H. Meng. Vertex data compression through vector quantization.
IEEE Transactions on Visualization and Computer Graphics, 2002,
8(4):373-382.
[67] Princeton University. 3D Model Search Engine. http://shape.cs.princeton.edu.
[68] C. Zhu and L. M. Po. Minimax partial distortion competitive learning for
optimal codebook design. IEEE Trans. on Image Processing, 1998,
7(10):1400-1409.
[69] S. W. Ra and J. K. Kim. Fast mean-distance-ordered partial codebook search
algorithm for image vector quantization. IEEE. Trans. on Circuits and Systems-II,
1993, 40(9):576-579.
[70] C. M. Wang and P. C. Wang. Steganography on point-sampled geometry.
Computers & Graphics, 2006, 30:244-254.
[71] R. Ohbuchi, H. Masuda and M. Aono. Embedding watermark in 3D models. In:
Proceedings of the IDMS’97, Lecture Notes in Computer Science, Springer,
1997, pp. 1-11.
[72] R. Ohbuchi, H. Masuda and M. Aono. Watermarking three-dimensional
polygonal models. In: Proceedings of the ACM Multimedia’97, 1997, pp.
261-272.
[73] R. Ohbuchi, H. Masuda and M. Aono. Watermarking three-dimensional
polygonal models through geometric and topological modifications. IEEE
Journal on Selected Areas in Communications, 1998, 16(4):551-560.
[74] R. Ohbuchi, H. Masuda and M. Aono. Watermark embedding algorithms for
geometrical and non-geometrical targets in three-dimensional polygonal models.
Computer Communications, 1998.
[75] O. Benedens. Geometry-based watermarking r of 3D models. IEEE Computer
Graphics and Applications, 1999, 19(1):46-55.
[76] B. L. Yeo and M. M. Yeung. Watermarking 3D Objects for Verification. IEEE
Computer Graphics and Applications, 1999, 19(1):36-45.
[77] M. G. Wagner. Robust watermarking of polygonal meshes. In: Proceedings of
Geometric Modeling and Processing, 2000, pp. 10-12.
[78] E. Praun, H. Hoppe and A. Finkelstein. Robust mesh watermarking. Microsoft
Technical Report TR-99-05, 1999.
[79] R. Ohbuchi, A. Mukaiyama and S. Takahashi. A frequency-domain approach to
watermarking 3D shapes. In: Proc. EUROGRAPHICS 2002, 2002.
[80] R. Ohbuchi, A. Mukaiyama and S. Takahashi. Watermarking a 3D shape model
defined as a point set. In: Proc. of Cyber Worlds 2004, IEEE Computer Society
Press, 2004, pp. 392-399.
[81] M. Voigt, B. Yang and C. Busch. Reversible watermarking of 2D-vector
watermark. In: Proceedings of the Multimedia and Security Workshop 2004
(MM&SEC’04), 2004, pp. 160-165.
[82] G. Plonka and M. Tasche. Invertible integer DCT algorithms. Appl. Comput.
Harmon. Anal., 2003, 15:70-88.
Index
I Audio Compression, 40
Image-Base Modeling (IBM), 19 Data Compression, 38
and Rendering (IBMR), 19 Image Compression, 44
Image Compression, 42-45 Geometry Compression, 101
Imperceptivity (Transparency), 311
Improved Earthmover’s Distances, M
275 Manifold, 107
Information with Boundary, 93, 94
Explosion, 3-6 MAYA Software, 28
Retrieval, 62-65 Media, 50
Theory, 38 Mesh, 10
Security in the Narrow Sense, 7 De-noising, 32
Internet Content Providers (ICPs), 5 Density Pattern (MDP), 329, 331
Innate Redundancy, 316 Segmentation, 259-261
Interframe Compression, 47 Minkowski Distances, 274
Interior Model
Edges, 99 Segmentation, 36
Vertices, 99 Simplification, 31, 32
Intraframe Compression, 47 Monomedia, 2
Inverse Integer DCT, 403, 408 Modeling, 13ˈ20
Mother Wavelet, 211
K Multimedia, 2
k-d Tree, 128, 133 Computer Technology, 2
Keyframe, 70 Perceptual Hashing, 110
Kirchhoff Matrix, 359 Multimodal Queries, 295
kk-Nearest Neighbor (KNN), 283 Multiresolution
Knowledge Reeb Graph, 167
Retrieval, 63 Shape Descriptor, 176
Mining, 64 Music Retrieval, 76, 78
L N
Laplacian Matrix, 359 Network Information Security, 6-9
Layered Decomposition, 103, 108, Non-Blind Detector, 53
115, 116 Non-reconstruction-Based
Levels of Details (LOD), 116 Compression, 101
Light Field Descriptor, 220 Non-uniform Rational B-spline
Linear Prediction, 129 (NURBS), 15, 362
Coding (LPC), 42 NURBS Modeling, 15
Loops, 100
Lossless
O
Audio Compression, 39
OBJ File Format, 27-29
Compression, 40
Object Recognition, 194
Image Compression, 43, 44
OFF File Format, 29
Geometry Compression, 101
1-ring, 268
Lossy
420 Index
OpenGL, 23 Q
State Machine, 23 QBIC, 69
Orientable, 110 Quantization Index Modulation,
Oriented Bounding Box, 255 329, 311
Octree Decomposition, 134 Query by
Owner Identification, 56 Example, 67
Ownership Verification, 56 3D Sketches 289, 292
Text, 293
P 2D Projections, 289
Parallelogram Prediction, 145, 147 2D Sketches, 289, 292
Patch Coloring, 122
Pattern R
Classification, 37 Recall, 73, 180, 204
Recognition, 37 Reconstruction-Based Compression,
Payload Capacity, 393, 396 101
Perceptual Hashing, 80, 87 Reeb Graph, 167, 221
Functions, 80-83 Relevance Feedback, 268, 273
PhotoBook, 69 Remeshing, 310
Point Density, 177 Rendering, 312, 331
Polygon, 20 Representation Redundancy, 316
-Based Rendering, 12 Reverse Engineering, 10, 17, 31
Mesh, 20 Reversibility, 316
Soup, 247 Reversible
Triangulation, 178 Data Hiding, 371
Polygonal Watermarking, 371, 411
Connectivity, 95 Robustness, 19, 412
Modeling, 15 Rotation-Invariant Features, 167
Potentially Manifold, 96 Rotation-Variant Feature, 167
with Border, 96 Run-Length Encoding (RLE), 43
Pose Normalization, 252-257
Precision, 130 S
Precision-Recall (P-R) Graph, 130 Scalar Quantization, 127
Prediction, 73, 128, 131 Scan Registration, 163
Trees, 132, 144 Second-Order Prediction, 126
Predictive VQ (PVQ), 180 Security, 312
Principal Component Analysis, 200ˈ Mechanisms, 6
213 Self-Organizing Map (SOM), 280
Progressive Semantic Retrieval, 67
Compression, 156 Shading, 277
Geometry Compression, 137 Shape, 182
Mesh, 92, 117 Distribution Functions, 180
Forest Split (PFS), 120 Shell Models, 12
Simplicial Complex (PSC), 119 Simple Mesh, 100
Push Service, 5 Simplification, 100
Index 421
T W
Tessellation, 11 Wavelet Transform, 209
Tetrahedral Volume Ratio (TVR), Weighted Point Sets, 201
318, 333 Wireframe Modeling, 15
Texture, Work (or Product), 50
Mapping, 337