Computational Photography Methods and Applications
Computational Photography Methods and Applications
Photography
Methods and Applications
Series Editor
Rastislav Lukac
Foveon, Inc./Sigma Corporation
San Jose, California, U.S.A.
Edited by
Rastislav Lukac
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Photography suits the temper of this age — of active bod-
ies and minds. It is a perfect medium for one whose mind
is teeming with ideas, imagery, for a prolific worker who
would be slowed down by painting or sculpting, for one
who sees quickly and acts decisively, accurately.
Preface xi
Contributors xix
1 Single Capture Image Fusion 1
James E. Adams, Jr., John F. Hamilton, Jr., Mrityunjay Kumar,
Efraı́n O. Morales, Russell Palum, and Bruce H. Pillman
2 Single Capture Image Fusion with Motion Consideration 63
James E. Adams, Jr., Aaron Deever, John F. Hamilton, Jr.,
Mrityunjay Kumar, Russell Palum, and Bruce H. Pillman
3 Lossless Compression of Bayer Color Filter Array Images 83
King-Hong Chung and Yuk-Hee Chan
4 Color Restoration and Enhancement in the Compressed Domain 103
Jayanta Mukherjee and Sanjit K. Mitra
5 Principal Component Analysis-Based Denoising of Color Filter
Array Images 131
Rastislav Lukac and Lei Zhang
6 Regularization-Based Color Image Demosaicking 153
Daniele Menon and Giancarlo Calvagno
7 Super-Resolution Imaging 175
Bahadir K. Gunturk
8 Image Deblurring Using Multi-Exposed Images 209
Seung-Won Jung and Sung-Jea Ko
9 Color High-Dynamic Range Imaging: Algorithms for Acquisition
and Display 227
Ossi Pirinen, Alessandro Foi, and Atanas Gotchev
10 High-Dynamic Range Imaging for Dynamic Scenes 259
Celine Loscos and Katrien Jacobs
11 Shadow Detection in Digital Images and Videos 283
Csaba Benedek and Tamás Szirányi
ix
x Computational Photography: Methods and Applications
Computational photography is a new and rapidly developing research field. It has evolved
from computer vision, image processing, computer graphics, and applied optics, and refers
broadly to computational imaging techniques that enhance or extend the capabilities of
digital photography. The output of these techniques is an image which cannot be produced
by today’s common imaging solutions and devices. Despite the recent establishment of
computational photography as a recognized research area, numerous commercial products
capitalizing on its principles have already appeared in diverse market applications due to the
gradual migration of computational algorithms from computers to image-enabled consumer
electronic devices and imaging software.
Image processing methods for computational photography are of paramount importance
in the research and development community specializing in computational imaging due
to the urgent needs and challenges of emerging digital camera applications. There exist
consumer digital cameras which use face detection to better focus and expose the image,
while others perform preliminary panorama stitching directly in the camera and use local
tone mapping to manage difficult lighting situations. There are also successful attempts
to use the information from a set of images, for instance, to reduce or eliminate image
blur, suppress noise, increase image resolution, and remove objects from or add them to a
captured image.
Thus it is not difficult to see that many imaging devices and applications already rely on
research advances in the field of computational photography. The commercial proliferation
of digital still and video cameras, image-enabled mobile phones and personal digital as-
sistants, surveillance and automotive apparatuses, machine vision systems, and computer
graphic systems has increased the demand for technical developments in the area. It is
expected that the growing interest in image processing methods for computational photog-
raphy and their use in emerging applications such as digital photography and art, visual
communication, online sharing in social networks, digital entertainment, surveillance, and
multimedia will continue.
The purpose of this book is to fill the existing gap in the literature and comprehensively
cover the system design, implementation, and application aspects of image processing-
driven computational photography. Due to the rapid developments in specialized areas of
computational photography, the book is a contributed volume in which well-known experts
deal with specific research and application problems. It presents the state-of-the-art as well
as the most recent trends in image processing methods and applications for computational
photography. It serves the needs of different readers at different levels. It can be used as
textbook in support of graduate courses in computer vision, digital imaging, visual data
processing and computer graphics or as stand-alone reference for graduate students, re-
searchers, and practitioners. For example, a researcher can use it as an up-to-date reference
xi
xii Computational Photography: Methods and Applications
since it will offer a broad survey of the relevant literature. Development engineers, techni-
cal managers, and executives may also find it useful in the design and implementation of
various digital image and video processing tasks.
This book provides a strong, fundamental understanding of theory and methods, and
a foundation upon which solutions for many of today’s most interesting and challenging
computational imaging problems can be built. It details recent advances in digital imag-
ing, camera image processing, and computational photography methods and explores their
applications. The book begins by focusing on single capture image fusion technology for
consumer digital cameras. This is followed by the discussion of various steps in a camera
image processing pipeline, such as data compression, color correction and enhancement,
denoising, demosaicking, super-resolution reconstruction, deblurring, and high-dynamic
range imaging. Then, the reader’s attention is turned to bilateral filtering and its applica-
tions, painterly rendering of digital images, shadow detection for surveillance applications,
and camera-driven document rectification. The next part of the book presents machine
learning methods for automatic image colorization and digital face beautification. The
remaining chapters explore light field acquisition and processing, space-time light field
rendering, and dynamic view synthesis with an array of cameras.
Chapters 1 and 2 discuss concepts and technologies that allow effective design and high
performance of single-sensor digital cameras. Using a four-channel color filter array, an
image capture system can produce images with high color fidelity and improved signal-to-
noise performance relative to traditional three-channel systems. This is accomplished by
adding a panchromatic or spectrally nonselective channel to the digital camera sensor to
decouple sensing luminance information from chrominance information. To create a full-
color image on output, as typically required for storage and display purposes, single capture
image fusion techniques and methodology are used as the means for reducing the original
four-channel image data down to three channels in a way that makes the best use of the
additional fourth channel. Single capture image fusion with motion consideration enhances
these concepts to provide a capture system that can additionally address the issue of motion
occurring during a capture. By allowing different integration times for the panchromatic
and color pixels, an imaging system produces images with reduced motion blur.
Chapters 3 and 4 address important issues of data compression and color manipulation in
the compressed domain of captured camera images. Lossless compression of Bayer color
filter array images has become de facto a standard solution of image storage in single-lens
reflex digital cameras, since stored raw images can be completely processed on a personal
computer to achieve higher quality compared to resource-limited in-camera processing.
This approach poses a unique challenge of spectral decorrelation of spatially interleaved
samples of three or more sampling colors. Among a number of reversible lossless trans-
forms, algorithms that rely on predictive and entropy coding seem to be very effective in
removing statistical redundancies in both spectral and spatial domains using the spatial
correlation in the raw image and the statistical distribution of the prediction residue.
Color restoration and enhancement in the compressed domain address the problem of
adjusting a camera image represented in the block discrete cosine transform space. The
goal is to compensate for shifts from perceived color in the scene due to the ambient il-
lumination and a poor dynamic range of brightness values due to the presence of strong
background illumination. The objective of restoring colors from varying illumination is to
Preface xiii
Chapters 9 and 10 deal with the enhancement of the dynamic range of an image us-
ing multiple captures of the scene. Color high-dynamic range imaging enables access to
a wider range of color values than traditional digital photography. Methods for capture,
composition, and display of high-dynamic range images have become quite common in
modern imaging systems. In particular, luminance-chrominance space-driven composition
techniques seem to be suitable in various real-life situations where the source images are
corrupted by noise and/or misalignment and the faithful treatment of color is essential. As
the objects in the scene often move during the capture process, high-dynamic range imag-
ing for dynamic scenes is needed to enhance the performance of an imaging system and
extend the range of its applications by integrating motion and dynamic scenes in underly-
ing technology targeting both photographs and movies.
Chapter 11 focuses on shadow detection in digital images and videos, with application to
video surveillance. Addressing the problem of color modeling of cast shadows in real-life
situations requires a robust adaptive model for shadow segmentation without strong restric-
tions on a priori probabilities, image quality, objects’ shapes, and processing speed. Such
a modeling framework can be generalized for and used to compare different color spaces,
as the appropriate color space selection is a key in reliable shadow detection and classifica-
tion, for example, using color-based pixel clustering and Bayesian foreground/background
shadow segmentation.
Chapter 12 presents another way of using information from more than one image. Doc-
ument image rectification using single-view or two-view camera input in digital camera-
driven systems for document image acquisition, analysis, and processing represents an
alternative to flatbed scanners. A stereo-based method can be employed to complete the
rectification task using explicit three-dimensional reconstruction. Since the method works
irrespective of document contents and removes specular reflections, it can be used as a pre-
processing tool for optical character recognition and digitization of figures and pictures. In
situations when a user-provided bounding box is available, a single-view method allows
rectifying a figure inside this bounding box in an efficient, robust, and easy-to-use manner.
Chapter 13 discusses both the theory and applications of the bilateral filter. This filter is
widely used in various image processing and computer vision applications due to its ability
to preserve edges while performing spatial smoothing. The filter is shown to relate to pop-
ular approaches based on robust estimation, weighted least squares estimation, and partial
differential equations. It has a number of extensions and variations that make the bilateral
filter an indispensable tool in modern image and video processing systems, although a fast
implementation is usually critical for practical applications.
Chapter 14 focuses on painterly rendering methods. These methods convert an input
image into an artistic image in a given style. Artistic images can be generated by simulating
the process of putting paint on paper or canvas. A synthetic painting is represented as a list
of brush strokes that are rendered on a white or canvas textured background. Brush strokes
can be mathematically modeled or their attributes can be extracted from the source image.
Another approach is to abstract from the classical tools that have been used by artists and
focus on the visual properties, such as sharp edges or absence of natural texture, which
distinguish painting from photographic images.
Chapters 15 and 16 deal with two training-based image analysis and processing steps.
Machine learning methods for automatic image colorization focus on adding colors to a
Preface xv
grayscale image without any user intervention. This can be done by formally stating the
color prediction task as an optimization problem with respect to an energy function. Differ-
ent machine learning methods, in particular nonparametric methods such as Parzen window
estimators and support vector machines, provide a natural and efficient way of incorporat-
ing information from various sources. In order to cope with the multimodal nature of the
problem, the solution can be found directly at the global level with the help of graph cuts,
which makes the approach more robust to noise and local prediction errors and allows re-
solving large-scale ambiguities and handling cases with more texture noise. The approach
provides a way of learning local color predictors along with spatial coherence criteria and
permits a large number of possible colors.
In another application of training-based methods, machine learning for digital face beau-
tification constitutes a powerful tool for automatically enhancing the attractiveness of a face
in a given portrait. It aims at introducing only subtle modifications to the original image
by manipulating the geometry of the face, such that the resulting beautified face main-
tains a strong, unmistakable similarity to the original. Using a variety of facial locations to
calculate a feature vector of a given face, a feature space is searched for a vector that corre-
sponds to a more attractive face. This can be done by employing an automatic facial beauty
rating machine which has the form of two support vector regressors trained separately on
a database of female and male faces with accompanying facial attractiveness ratings col-
lected from a group of human raters. The feature vector output by the regressor serves as
a target to define a two-dimensional warp field which maps the original facial features to
their beautified locations. The method augments image enhancement and retouching tools
available in existing digital image editing packages.
Finally, Chapters 17 and 18 discuss various light field-related issues. High-quality light
field acquisition and processing methods rely on various hardware and software approaches
to overcome the lack of the spatial resolution and avoid photometric distortion and aliasing
in output images. Programmable aperture is an example of a device for high-resolution light
field acquisition. It exploits the fast multiple-exposure feature of digital sensors without
trading off sensor resolution to capture the light field sequentially, which, in turn, enables
the multiplexing of light rays. The quality of the captured light field can be further im-
proved by employing a calibration algorithm to remove the photometric distortion unique to
the light field without using any reference object by estimating this distortion directly from
the captured light field and a depth estimation algorithm utilizing the multi-view property
of light field and visibility reasoning to generate view-dependent depth maps for view in-
terpolation. The device and algorithms constitute a complete system for high-quality light
field acquisition.
Light field-style rendering techniques have an important position among image-based
modeling methods for dynamic view synthesis with an array of cameras. These techniques
can be extended for dynamic scenes, constituting an approach termed as space-time light
field rendering. Instead of capturing the dynamic scene in strict synchronization and treat-
ing each image set as an independent static light field, the notion of a space-time light field
assumes a collection of video sequences that may or may not be synchronized and can have
different capture rates. In order to be able to synthesize novel views from any viewpoint at
any instant in time, feature correspondences are robustly identified across frames and used
as land markers to digitally synchronize the input frames and improve view synthesis qual-
xvi Computational Photography: Methods and Applications
ity. This concept is further elaborated in reconfigurable light field rendering where both
the scene content and the camera configurations can be dynamic. Automatically adjusting
the cameras’ placement allows achieving optimal view synthesis results for different scene
contents.
The bibliographic links included in all chapters of the book provide a good basis for
further exploration of the presented topics. The volume includes numerous examples
and illustrations of computational photography results, as well as tables summarizing the
results of quantitative analysis studies. Complementary material is available online at
http://www.colorimageprocessing.org.
I would like to thank the contributors for their effort, valuable time, and motivation to
enhance the profession by providing material for a wide audience while still offering their
individual research insights and opinions. I am very grateful for their enthusiastic support,
timely response, and willingness to incorporate suggestions from me to improve the quality
of contributions. I also thank Rudy Guttosch, my colleague at Foveon, Inc., for his help
with proofreading some of the chapters. Finally, a word of appreciation for CRC Press /
Taylor & Francis for giving me the opportunity to edit a book on computational photogra-
phy. In particular, I would like to thank Nora Konopka for supporting this project, Jennifer
Ahringer for coordinating the manuscript preparation, Shashi Kumar for his LaTeX assis-
tance, Karen Simon for handling the final production, Phoebe Roth for proofreading the
book, and James Miller for designing the book cover.
Rastislav Lukac
Foveon, Inc. / Sigma Corp., San Jose, CA, USA
E-mail: lukacr@colorimageprocessing.com
Web: www.colorimageprocessing.com
The Editor
xvii
xviii Computational Photography: Methods and Applications
ber 2008). He is a Digital Imaging and Computer Vision book series founder and editor
for CRC Press / Taylor & Francis. He serves as a technical reviewer for various scientific
journals, and participates as a member of numerous international conference committees.
He is the recipient of the 2003 North Atlantic Treaty Organization / National Sciences
and Engineering Research Council of Canada (NATO/NSERC) Science Award, the Most
Cited Paper Award for the Journal of Visual Communication and Image Representation for
the years 2005–2007, and the author of the #1 article in the ScienceDirect Top 25 Hottest
Articles in Signal Processing for April–June 2008.
Contributors
James E. Adams, Jr. Eastman Kodak Company, Rochester, New York, USA
Yasemin Altun Max Planck Institute for Biological Cybernetics, Tübingen, Germany
Csaba Benedek Computer and Automation Research Institute, Budapest, Hungary
Ilja Bezrukov Max Planck Institute for Biological Cybernetics, Tübingen, Germany
Giancarlo Calvagno University of Padova, Padova, Italy
Yuk-Hee Chan The Hong Kong Polytechnic University, Hong Kong SAR
Guillaume Charpiat INRIA, Sophia-Antipolis, France
Homer H. Chen National Taiwan University, Taipei, Taiwan R.O.C.
Nam Ik Cho Seoul National University, Seoul, Korea
King-Hong Chung The Hong Kong Polytechnic University, Hong Kong SAR
Aaron Deever Eastman Kodak Company, Rochester, New York, USA
Gideon Dror The Academic College of Tel-Aviv-Yaffo, Tel Aviv, Israel
Alessandro Foi Tampere University of Technology, Tampere, Finland
Atanas Gotchev Tampere University of Technology, Tampere, Finland
Bahadir K. Gunturk Louisiana State University, Baton Rouge, Louisiana, USA
John F. Hamilton, Jr. Rochester Institute of Technology, Rochester, New York, USA
Matthias Hofmann Institute for Biological Cybernetics, Tübingen, Germany
Katrien Jacobs University College London, London, UK
Seung-Won Jung Korea University, Seoul, Korea
Sung-Jea Ko Korea University, Seoul, Korea
Hyung Il Koo Seoul National University, Seoul, Korea
Mrityunjay Kumar Eastman Kodak Company, Rochester, New York, USA
Chia-Kai Liang National Taiwan University, Taipei, Taiwan R.O.C.
Celine Loscos Universitat de Girona, Girona, Spain
xix
xx Computational Photography: Methods and Applications
Rastislav Lukac Foveon, Inc. / Sigma Corp., San Jose, California, USA
Daniele Menon University of Padova, Padova, Italy
Sanjit K. Mitra University of Southern California, Los Angeles, California, USA
Efraı́n O. Morales Eastman Kodak Company, Rochester, New York, USA
Jayanta Mukherjee Indian Institute of Technology, Kharagpur, India
Russell Palum Eastman Kodak Company, Rochester, New York, USA
Giuseppe Papari University of Groningen, Groningen, The Netherlands
Nicolai Petkov University of Groningen, Groningen, The Netherlands
Bruce H. Pillman Eastman Kodak Company, Rochester, New York, USA
Ossi Pirinen OptoFidelity Ltd., Tampere, Finland
Bernhard Schölkopf Institute for Biological Cybernetics, Tübingen, Germany
Tamás Szirányi Péter Pázmány Catholic University, Budapest, Hungary
Huaming Wang University of California, Berkeley, Berkeley, California, USA
Ruigang Yang University of Kentucky, Lexington, Kentucky, USA
Cha Zhang Microsoft Research, Redmond, Washington, USA
Lei Zhang The Hong Kong Polytechnic University, Hong Kong
1
Single Capture Image Fusion
James E. Adams, Jr., John F. Hamilton, Jr., Mrityunjay Kumar, Efraı́n O. Morales,
Russell Palum, and Bruce H. Pillman
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Image Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Color Camera Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Three-Channel Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Four-Channel Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.3 Color Fidelity versus Spatial Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Demosaicking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Special Functions and Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.2 The Panchromatic Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.3 Demosaicking Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.3.1 Rectilinear Grid-Based Nonadaptive Interpolation . . . . . . . . . . . 20
1.3.3.2 Diamond Grid-Based Nonadaptive Interpolation . . . . . . . . . . . . . 21
1.3.4 The Bayer Color Filter Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.4.1 Bilinear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.4.2 Adaptive Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.5 Four-Channel CFA Demosaicking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.3.5.1 Adaptive Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.3.5.2 Adaptive Cubic Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.3.5.3 Alternating Panchromatic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.3.5.4 Double Panchromatic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.3.5.5 Triple Panchromatic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.3.5.6 Two-Color Alternating Panchromatic . . . . . . . . . . . . . . . . . . . . . . . 37
1.3.6 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.4 Noise and Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.4.1 Image Noise Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.4.2 Image Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.4.2.1 High-Frequency Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.4.2.2 Mid-Frequency and Low-Frequency Noise Reduction . . . . . . . 49
1.4.3 Image Sharpening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.5 Example Single-Sensor Image Fusion Capture System . . . . . . . . . . . . . . . . . . . . . . . 52
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1
2 Computational Photography: Methods and Applications
1.1 Introduction
A persistent challenge in the design and manufacture of digital cameras is how to im-
prove the signal-to-noise performance of these devices while simultaneously maintaining
high color fidelity captures. The present industry standard three-color channel system is
constrained in that the fewest possible color channels are employed for the purposes of
both luminance and chrominance image information detection. Without additional degrees
of freedom, for instance, additional color channels, digital camera designs are generally
limited to solutions based on improving sensor hardware (larger pixels, lower readout noise,
etc.) or better image processing (improved denoising, system-wide image processing chain
optimization, etc.) Due to being constrained to three channels, the requirements for im-
proved signal-to-noise and high color fidelity are frequently in opposition to each other,
thereby providing a limiting constraint on how much either can be improved. For example,
to improve the light sensitivity of the sensor system, one might wish to make the color
channels broader spectrally. While this results in lower image noise in the raw capture, the
color correction required to restore the color fidelity amplifies the noise so much that there
can be a net loss in overall signal-to-noise performance.
This chapter explores the approach of adding a fourth, panchromatic or spectrally non-
selective, channel to the digital camera sensor in order to decouple sensing luminance (spa-
tial) information from chrominance (color) information [1]. Now the task of improving
signal-to-noise can be largely confined to just the panchromatic channel while leaving the
requirement for high color fidelity captures to the three color channels. As any such system
must eventually create a full-color image on output, some means is needed for reducing
the original four-channel image data down to three channels in a way that makes the best
use of the additional fourth channel. To this end, image fusion techniques and method-
ology are selected as the means for accomplishing this task. Therefore, the remainder of
this introduction first reviews the concept of image fusion and then sets the stage for how
this body of work can be applied to the problem of capturing and processing a single four-
channel digital camera capture to produce a full-color image with improved signal-to-noise
and high color fidelity.
The concept of data fusion is not new. It is naturally performed by living beings to
achieve a more accurate assessment of the surrounding environment and identification of
threats, thereby improving their chances of survival. For example, human beings and ani-
mals use a combination of sight, touch, smell, and taste to perceive the quality of an edible
object. Tremendous advances in sensor and hardware technology and signal processing
techniques have provided the ability to design software and hardware modules to mimic
the natural data fusion capabilities of humans and animals [3].
Data fusion applied to image-based applications is commonly referred to as image fusion.
The goal of image fusion is to extract information from input images such that the fused
image provides better information for human or machine perception as compared to any
of the input images [8], [9], [10], [11]. Image fusion has been used extensively in various
areas of image processing such as digital camera imaging, remote sensing, and biomedical
imaging [12], [13], [14].
From the perspective of fusion, information present in the observed images that are to
be fused can be broadly categorized in the following three classes: i) common informa-
tion – these are features that are present in all the observed images, ii) complementary
information – features that are present only in one of the observed images, and iii) noise –
features that are random in nature and do not contain any relevant information. Note that
this categorization of the information could be global or local in nature. A fusion algorithm
should be able to select the feature type automatically and then fuse the information ap-
propriately. For example, if the features are similar, then the algorithm should perform an
operation similar to averaging, but in the case of complementary information, should select
the feature that contains relevant information.
Due to the large number of applications as well as the diversity of fusion techniques, con-
siderable efforts have been made in developing standards for data fusion. Several models
for data fusion have been proposed in the recent past [15], [16]. One of the models com-
monly used in signal processing applications is the three-level fusion model that is based
on the levels at which information is represented [17]. This model classifies data fusion
into three levels depending on the way information present in the data/image is represented
and combined. If the raw images are directly used for fusion then it is called low level
fusion. In the case when the features of the raw images such as edge, texture, etc., are used
for fusion then it is called feature or intermediate level fusion. The third level of fusion
is known as decision or high level fusion in which decisions made by several experts are
combined. A detailed description of these three levels of fusion is given below.
• Low level fusion: At this level, several raw images are combined to produce a new
“raw” image that is expected to be more informative than the inputs [18], [19], [20],
[21], [22]. The main advantage of low level fusion is that the original measured
quantities are directly involved in the fusion process. Furthermore, algorithms are
computationally efficient and easy to implement. Low level fusion requires a precise
registration of the available images.
• Feature or intermediate level fusion: Feature level fusion combines various features
such as edges, corners, lines, and texture parameters [23], [24], [25]. In this model
several feature extraction methods are used to extract the features of input images and
a set of relevant features is selected from the available features. Methods of feature
4 Computational Photography: Methods and Applications
fusion include, for example, Principal Component Analysis (PCA) and Multi-layer
Perceptron (MLP) [26]. The fused features are typically used for computer vision
applications such as segmentation and object detection [26]. In feature level fusion,
all the pixels of the input images do not contribute in fusion. Only the salient features
of the images are extracted and fused.
• Decision or high level fusion: This stage combines decisions from several ex-
perts [27], [28], [29], [30]. Methods of decision fusion include, for instance, sta-
tistical methods [27], voting methods [28], fuzzy logic-based methods [29], and
Dempster-Schafer’s method [30].
Typically, in image fusion applications, input images for fusion are captured using mul-
tiple sensors. For example, in a typical remote sensing system, multispectral sensors are
used to obtain information about the Earth’s surface and image fusion techniques are used
to fuse the outputs of multispectral sensors to generate a thematic map [4], [6]. As another
example, recent developments in medical imaging have resulted in many imaging sensors
to capture various aspects of the patient’s anatomy and metabolism [31], [32]. For exam-
ple, magnetic resonance imaging (MRI) is very useful for defining anatomical structure
whereas metabolic activity can be captured very reliably using positron emission tomogra-
phy (PET). The concept of image fusion is used to combine the output of MRI and PET
sensors to obtain a single image that describes anatomical as well as metabolic activity of
the patient effectively [33].
Image fusion techniques can also be applied to fuse multiple images obtained from a
single sensor [34], [35], [36], [37], [38], [39], [40]. An excellent example of generating
complementary information from a single sensor is the Bayer color filter array (CFA) pat-
tern [41] extensively used in digital cameras. To reduce cost and complexity, most digital
cameras are designed using a single CCD or CMOS sensor that has the panchromatic re-
sponsivity of silicon. As shown in Figure 1.1 panchromatic responsivity, which passes all
visible wavelengths, is higher than color (red, green, and blue) responsivities [42].
panchromatic
1.0
relative channel responsivity
green
0.8
red
0.6 blu e
0.4
0.2
0
350 400 450 500 550 600 650 700
w avelength [nm ]
FIGURE 1.1
Relative responsivity of panchromatic, red, green, and blue channels.
Single Capture Image Fusion 5
G R G R
B G B G
G R G R
B G B G
FIGURE 1.2
Bayer CFA pattern.
(a) (b)
Note that the pixels with panchromatic responsivity are spectrally nonselective in nature.
Therefore, digital cameras use a color filter array (CFA) to capture color images, an exam-
ple of which is the Bayer CFA pattern as shown in Figure 1.2. The CFA pattern provides
only a single color sample at each pixel location and the missing color samples at each
pixel location are estimated using a CFA interpolation or demosaicking algorithm [43],
[44], [45]. An example of a Bayer CFA image is shown in Figure 1.3a. The inset shows
the individual red, green, and blue pixels in the captured image. The corresponding full
color image generated by applying CFA interpolation to the Bayer CFA image is shown in
Figure 1.3b.
FIGURE 1.4
Example image processing chain for a four-channel system.
are exposed for the same amount of time during the capture period. This is, of course,
the standard situation. In Chapter 2, this assumption will be dropped and systems with
different, although concurrent, exposure times for panchromatic and color pixels will be
described. As such systems are ideal for the detection and compensation of image motion,
it is convenient to delay all motion-related considerations until the next chapter.
Figure 1.4 is an example image processing chain for the four-channel system that will
be the reference for the discussion in this chapter. Section 1.2 focuses on CFA image for-
mation, and color and tone correction. It begins by reviewing the fundamentals of digital
camera color imaging using standard three-channel systems. From this basis it extends
the discussion into the design of four-channel CFA patterns that produce high color fi-
delity while providing the additional degree of freedom of a panchromatic channel. It is
this panchromatic channel that, in turn, will be used to enable the use of image fusion
techniques to produce higher quality images not possible with typical three-channel sys-
tems. Section 1.3 discusses the problem of demosaicking four-channel CFA images both
from the perspective of algorithm design and spatial frequency response. Both adaptive
and nonadaptive approaches are presented and comparisons are made to standard Bayer
CFA processing methods and results. Section 1.4 focuses on noise cleaning and sharpen-
ing. This includes an analytical investigation into the effects of the relative photometric
gain differences between the panchromatic and color channels and how, through the use of
image fusion, these gain differences can result in a fundamentally higher signal-to-noise
image capture system compared to three-channel systems. Explicit investigations of how
image fusion techniques are applied during both the demosaicking and sharpening opera-
tions to achieve these advantages are discussed. Section 1.5 brings the preceding material
in the chapter together to illustrate with an example how the entire system shown in Fig-
ure 1.4 uses image fusion techniques to produce the final image. The performance of this
system compared to a standard Bayer system is assessed both numerically and qualitatively
through example images. Finally, the chapter is summarized in Section 1.6.
different illuminants make the reproduction of a color image a challenge well beyond the
technological one of producing different color stimuli. While color reproduction is very
complex, the problem of camera design is slightly simpler. The camera design objective
is to capture scene information to support the best possible reproduction and to do this
under a wide range of imaging conditions. Because the reproduction is judged by a human
observer, information about the human visual system is used in determining whether image
information is common, complementary, or noise.
To set the context for this image capture problem, some history of color imaging will be
reviewed. One of the first photographic color reproduction systems was demonstrated in
Reference [48], later described in Reference [46]. This system captured black-and-white
photographs of a scene through red, green, and blue filters, then projects them through the
same filters. This technique was used as a demonstration of the trichromatic theory of color
vision, although the film he used was not sensitive to red light and limited the quality of the
demonstration [49]. Reference [50] reproduced color images by a very different technique.
This technique captured the spectrum of light from a scene in an analog fashion and re-
produced the actual spectrum when viewed by reflected light under the correct conditions.
This allowed good reproduction of color, yet the process was extremely slow — the emul-
sion used very fine grains (10 to 40 nm in diameter) and the process required minutes of
exposure even in strong daylight.
Fortunately, human color sensitivity is essentially a trichromatic system and capture of
detailed spectral information is not necessary for good color reproduction. That is, human
visual color response to different spectral stimuli is essentially based on three integrals over
wavelength, most commonly represented as follows:
Z λmax
X =k S(λ ) R(λ ) x(λ ) d λ ,
λmin
Z λmax
Y =k S(λ ) R(λ ) y(λ ) d λ ,
λmin
Z λmax
Z=k S(λ ) R(λ ) z(λ ) d λ ,
λmin
where S(λ ) is an illuminant spectral power distribution varying with wavelength λ , and
R(λ ) is a spectral reflectance curve. The visual response functions x(λ ), y(λ ), and z(λ )
are standardized color matching functions, defined over the wavelength range 380 nm to
780 nm and zero outside this range. The constant k is a normalization factor, normally
computed to produce a value of 100 for Y with a spectrally flat 100% reflector under a
chosen illumination, and X, Y , and Z are standard tristimulus values.
Different stimuli are perceived as matching colors if the different stimuli produce the
same three tristimulus values. Color matching studies in the 1920s and 1930s provided the
basis for the CIE standard color matching functions, providing a quantitative reference for
how different colors can be matched by a three primary system. While spectral cameras
continue to be developed and serve research needs, they are more complex than a trichro-
matic camera and usually require more time, light, or both, to capture a scene. For most
color reproduction purposes, the trichromatic camera is a better match to the human visual
system.
8 Computational Photography: Methods and Applications
FIGURE 1.5
Color and tone correction block details.
Many different approaches have been developed to acquire trichromatic color images.
Some cameras scan a scene with a trilinear sensor incorporating color filters to capture
three channels of color information for every pixel in the scene [51], [52]. These cameras
can achieve very high spatial resolution and excellent color quality. Because the exposure
of each scan line of the scene is sequential, the exposure time for each line must be a small
fraction of the capture time for the whole scene. These cameras often take seconds or even
minutes to acquire a full scene, so they work poorly for recording scenes that have any
motion or variation in lighting. Some cameras use a single panchromatic area sensor and
filters in the lens system to sequentially capture three color channels for every pixel in the
scene. These are faster than the linear scanning cameras, since all pixels in a color channel
are exposed simultaneously, although the channels are exposed sequentially. These are
commonly used and particularly effective in astronomy and other scientific applications
where motion is not a factor. Some cameras use dichroic beam splitters and three area
sensors to simultaneously capture three channels of color for each pixel in the scene [53],
[54]. This is particularly common and successful in high-quality video cameras, where a
high pixel rate makes the processing for demosaicking difficult to implement well. These
are the fastest cameras, since all color channels are exposed simultaneously. Each of these
approaches can perform well, but they increase the cost and complexity of the camera and
restrict the range of camera operation.
Cost and complexity issues drive consumer digital color cameras in the direction of a
single sensor that captures all color information simultaneously. Two approaches based
on single array sensors are presently used. One approach fabricates a sensor with three
layers of photodiodes and uses the wavelength-dependent depth of photon absorption to
provide spectral sensitivity [55], [56]. This system allows sampling three channels of color
information at every pixel, although the spectral sensitivity poses several challenges for
image processing. The approach more commonly used is the fusion of multiple color chan-
nels from a sensor with a color filter array into a full-color image. This fusion approach,
specifically with a four-channel color filter array, is the focus of this chapter.
A camera embodying this approach includes a single lens, a single area array sensor with
a color filter array, and a processing path to convert pixel values read from the sensor to
a suitable image for color reproduction. The image processing chain used is shown as a
block diagram in Figure 1.4. The block labeled as color and tone processing is examined
in more detail here, as shown in Figure 1.5. This figure breaks the overall color and tone
processing down into three distinct steps. The first, white balance, applies gain factors to
each color channel of camera pixel values to provide an image with equal mean code values
in each color channel for neutral scene content. The second, color correction, converts the
white balanced image to a known set of color primaries, such as the primaries used for
sRGB. The final step applies a tone correction to convert the image to a rendered image
Single Capture Image Fusion 9
1.0 2.0
spectru m z(l)
520 XYZ
sRGB 1.5
0.8
550 RIMM RGB
relative sensivity
y(l) x(l)
1.0
0.6
CIE y
500
0.5
0.4 600
780
0
0.2
-0.5
380
0
0 0.2 0.4 0.6 0.8 1.0 400 450 500 550 600 650 700
CIE x w avelength [nm ]
(a) (b)
FIGURE 1.6
(a) CIE xy chromaticity diagram, and (b) CIE 1931 color matching functions.
suitable for viewing. This is often referred to as gamma correction, although optimal tone
correction is rarely as simple as correcting for a standard display nonlinearity. More details
behind these operations are discussed in Reference [47].
relative sensivity
1.0 1.0
0.5 0.5
0 0
FIGURE 1.7
Color matching functions and approximations: (a) sRGB, and (b) RIMM.
as PC = MPO , where PC and PO are 3 × 1 vectors of converted pixel values and original
color pixel values, respectively. This matrix operation is also referred to as color correction.
Each set of primaries has a corresponding set of color matching functions. An exam-
ple shown in Figure 1.7a presents the color matching functions for the sRGB primaries.
Because cameras cannot provide negative spectral sensitivity, cameras use all-positive ap-
proximations to color matching functions (ACMF) instead. The figure also shows one
simple approximation to the sRGB color matching functions, formed by clipping at zero
and eliminating the red channel sensitivity to blue light. A second example, for the RIMM
set of primaries, is shown in Figure 1.7b, along with a set of simple all-positive approxi-
mations to these color matching functions. Note the RIMM color matching functions have
smaller negative lobes than the sRGB color matching functions. The size of the negative
excursions in the color matching functions correspond to how far the spectral locus lies
outside the color gamut triangle, as can be seen by comparing the curves in Figures 1.7
and 1.6b with the gamut triangles in Figure 1.6a. Cameras with spectral sensitivities that
are not color matching functions produce color errors because the camera integration of
the spectrum is different from the human integration of the spectrum. In a successful color
camera, the spectral sensitivities must be chosen so these color errors are acceptable for the
intended application.
Digital camera images are usually corrected to one of several standardized RGB color
spaces, such as sRGB [58], [59], RIMM RGB [57], [60], and Adobe RGB (1998) [61],
each with somewhat different characteristics. Some of these color spaces and others are
compared in Reference [62].
The deviation of a set of spectral sensitivities from color matching functions was consid-
ered in Reference [63], which proposed a q factor for measuring how well a single spectral
sensitivity curve compared with its nearest projection onto color matching functions. This
concept was extended in Reference [64], to the ν factor, which considers the projection of
a set of spectral sensitivities (referred to as scanning filters) onto the human visual sensi-
tivities. Because q and ν are computed on spectral sensitivities, the factors are not well
correlated to color errors calculated in a visually uniform space, such as CIE Lab.
Single Capture Image Fusion 11
0.5 0.5
sensor sensor
0.2 0.2
0.1 0.1
m agenta
0 0
400 450 500 550 600 650 700 400 450 500 550 600 650 700
w avelength [nm ] w avelength [nm ]
(a) (b)
FIGURE 1.8
Example quantum efficiencies: (a) typical RGB, and (b) CMY from typical RGB.
Several three-channel systems are used to illustrate the impact of spectral sensitivity on
image noise. These examples use sample spectral sensitivity curves for a typical RGB
camera from Reference [56] converted to quantum efficiencies and cascaded with a typical
infrared cut filter. The resulting overall quantum efficiency curves are shown, together
with the quantum efficiency of the underlying sensor, in Figure 1.8a. One way to improve
the signal-to-noise ratio of this camera would be to increase the quantum efficiency of
the sensor itself. This is difficult and begs the question of selecting the optimal quantum
efficiencies for the three color channels. Given the sensor quantum efficiency as a limit for
peak quantum efficiency for any color, widening the spectral response for one or more color
channels is the available option to significantly improve camera sensitivity. The effects of
widening the spectral sensitivity are illustrated in this chapter by considering a camera
with red, panchromatic, and blue channels and a camera with cyan, magenta, and yellow
channels, shown in Figure 1.8. The CMY quantum efficiencies were created by summing
pairs of the RGB quantum efficiency curves and thus are not precisely what would normally
be found on a CMY sensor. In particular, the yellow channel has a dip in sensitivity near
a wavelength of 560 nm, which is not typical of yellow filters. The primary effect of this
dip is to reduce color errors rather than change the color correction matrix or sensitivity
significantly.
Reference [65] considers the trade-off of noise and color error by examining the sensitiv-
ity and noise in sensors with both RGB and CMYG filters. It is concluded that the CMYG
system has more noise in a color-corrected image than the RGB system. Reference [66]
proposes optimal spectral sensitivity curves for both RGB and CMY systems consider-
ing Poisson noise, minimizing a weighted sum of color errors and noise. Fundamentally,
the overlap between color matching functions drives use of substantial color correction
to provide good color reproduction. All three systems in the current illustration produce
reasonable color errors, so the illustration will compare the noise in the three systems.
This chapter focuses on random noise from two sources. The first is Poisson-distributed
noise associated with the random process of photons being absorbed and converted to
photo-electrons within a pixel, also called shot noise. The second is electronic ampli-
12 Computational Photography: Methods and Applications
fier read noise, which is modeled with a Gaussian distribution. These two processes are
independent, so the resulting pixel values are the sum of the two processes. A pixel value
Q may be modeled as Q = kQ (q + g), where kQ is the amplifier gain, q is a Poisson random
variable with mean mq and variance σq2 , and g is a Gaussian random variable with mean
mg and variance σg2 . Note that σq2 = mq since q is a Poisson variable, and it is entirely de-
fined by the spectral power distribution impinging upon the sensor and the channel spectral
responsivities. Also note that for a given sensor mg and σg2 are independent from mq . For
this discussion, the original pixel values are assumed to be independent, so the covariance
matrix of the original pixel values, KO , is a diagonal matrix. Because the two random
processes are independent, the variance of the pixel values is the sum of the two variances:
¡ 2 ¢
KO = diag kQ (mq,1 + σg2 ), kQ
2
(mq,2 + σg2 ), kQ
2
(mq,3 + σg2 ) , (1.1)
where mq,i is the mean original signal level (captured photo-electrons) for channel i ∈
{1, 2, 3} and σg2 is the read noise. In the processing path of Figure 1.5, the white bal-
ance gain factors scale camera pixel values to equalize the channel responses for neu-
tral scene content. The gain factors are represented here with a diagonal matrix, GB =
diag(G1 , G2 , G3 ). Accordingly, the covariance matrix for white balanced pixels, KB , is
KB = GB KO GTB , (1.2)
where the superscript T denotes a transposed matrix or vector. Color correction is also a
3 × 3 matrix; the covariance matrix for color corrected pixels is
KC = MKB MT . (1.3)
Photometric sensitivity and noise amplification will be compared by examining the di-
agonal elements of KC and KB . The elements on the diagonal of the covariance matrix are
the variance of each color channel. Since the visual impression of noise is affected by all
three color channels, the sum of the variance terms can be used to compare noise levels.
This sum is referred to as Tr (a), the trace of matrix a. More precise noise measurements
convert the color image to provide a luminance channel and consider the variance in the
luminance channel [67]. The luminance coefficients recommended in the ISO standard are
L = [0.2125, 0.7154, 0.0721], so the appropriate estimate for the luminance variance is
where σL2 is the variance observed in a luminance channel. The weighting values shown
are specified in the ISO standard and come from ITU-R BT.709, which specifies primaries
that sRGB also uses.
The following equation shows the calculation for the number of photo-electrons captured
by each set of spectral quantum efficiencies:
Z
l2 55.6 I0 (λ )R(λ )
PO,i = R Qi (λ )d λ , (1.5)
IEI 683 I0 (λ )V (λ )d λ hc/λ
where PO,i is the mean number of photo-electrons captured in a square pixel with size l, the
term I0 denotes the illuminant relative spectral power distribution, R is the scene spectral
Single Capture Image Fusion 13
TABLE 1.1
Summary of channel sensitivity and color correction matrices. The balance gains and the
sensitivity gain are respectively denoted by {G1 , G2 , G3 } and GE .
reflectance, Qi is the quantum efficiency, and IEI is the exposure index. The additional
values are Planck’s constant h, the speed of light c, the spectral luminous efficiency function
V , and normalization constants arising from the definition of exposure index. Using a
relative spectral power distribution of D65 for the illuminant, a pixel size of l = 2.2 µ m,
and a spectrally flat 100% diffuse reflector, the mean number of photo-electrons captured
in each pixel at an exposure index of ISO 1000 are shown under “Channel Response” in
Table 1.1.
The balance gains listed are factors to equalize the color channel responses. The sensi-
tivity gain shown is calculated to equalize the white balanced pixel values for all sets of
quantum efficiencies. The color correction matrix shown for each set of quantum efficien-
cies was computed by calculating Equation 1.5 for 64 different color patch spectra, then
finding a color correction matrix that minimized errors between color corrected camera
data and scene colorimetry, as described in Reference [68].
The illustration compares the noise level in images captured at the same exposure index
and corrected to pixel value PC . For a neutral, the mean of the balanced pixel values is
the same as the color corrected pixel values. Since the raw signals are related to the bal-
anced signal by the gains shown in Table 1.1, the original signal levels can be expressed as
follows:
1
GE G1 0 0 PC
PO = 0 GE1G2 0 PC . (1.6)
1 PC
0 0 G G E 3
Defining a modified balance matrix including the sensitivity equation gain along with
the white balance gains, GB = GE diag(G1 , G2 , G3 ) and substituting Equation 1.6 into Equa-
tion 1.1 produces the following covariance matrix for the white balanced and gain corrected
pixel values:
³ ´
G2E G21 GPECG1 + σg2 0 0
³ ´
KB = 0 G2E G22 GPECG2 + σg2 0 . (1.7)
³ ´
P
0 0 G2E G23 GECG3 + σg2
14 Computational Photography: Methods and Applications
TABLE 1.2
Summary of Relative Noise in White Balanced and Color Corrected Signals.
In the case where σg2 << PC / (GE Gi ), where i ∈ [1, 2, 3], this simplifies to
G1 0 0
KB = 0 G2 0 GE PC . (1.8)
0 0 G3
To focus on the relative sensitivity, the matrix SB is defined by leaving out the factor of PC :
G1 0 0
SB = 0 G2 0 GE . (1.9)
0 0 G3
The values on the diagonal of SB show the relative noise levels in white balanced images
before color correction, accounting for the differences in photometric sensitivity. To finish
the comparison, the matrix SC is defined as MSB MT . The values on the diagonal of SC in-
dicate the relative noise levels in color corrected images. The values σL,B and σL,C indicate
the estimated relative standard deviation for a luminance channel based on Equation 1.4.
As shown in Table 1.2, the Tr(SB ) and σL,B are smaller for CMY and for RPB than for
RGB, reflecting the sensitivity advantage of the broader spectral sensitivities. However,
Tr(SC ) and σL,C are greater for RPB and CMY than for RGB, reflecting the noise ampli-
fication from the color correction matrix. In summary, while optimal selection of spectral
sensitivity is important for limiting noise, a well-selected relatively narrow set of RGB
spectral sensitivies is close to optimum, as found in References [65] and [66]. Given these
results, it is tempting to consider narrower spectral bands for each color channel, reduc-
ing the need for color correction. This would help to a limited extent, but eventually the
signal loss from narrower bands would take over. Further, narrower spectral sensitivities
would produce substantially larger color errors, leading to lower overall image quality. The
fundamental problem is that providing acceptable color reproduction constrains the three
channel system, precluding substantial improvement in sensitivity.
Reference [65] considers the possibility of reducing the color saturation of the image,
lowering the noise level at the expense of larger color errors. However, the concept of
lowering the color saturation can be applied with RGB quantum efficiencies as well. Ref-
erence [66] shows that by allowing larger color errors at higher exposure index values, the
Single Capture Image Fusion 15
optimum set of quantum efficiencies changes with exposure index. In particular, at a high
exposure index, the optimum red quantum efficiency peaks at a longer wavelength and has
less overlap with the green channel. This is another way to accept larger color errors to
reduce the noise in the color corrected image.
1.3 Demosaicking
Demosaicking, or color filter array interpolation, is the process of producing a full-color
image from the sparsely sampled digital camera capture. It generally involves some sort of
interpolation of neighboring pixel values within a given support region. This process may
be based on strictly linear, shift-invariant systems theory, or may be conducted in a more
heuristic nonlinear, adaptive manner. Both approaches will be described below. Because
of the large breadth of knowledge now available on demosaicking in general, the following
discussion will be restricted to a particular body of research conducted in the area of four-
channel color filter array image processing [72], [73], [74], [75].
The delta function, δ (x), is the function that has the following properties:
δ (x − x0 ) = 0, x =
6 x0 ,
Z x2
f (α ) δ (α − x0 ) d α = f (x0 ) , x1 < x0 < x2 ,
x1
µ ¶
x − x0
δ = |b| δ (x − x0 ) .
b
For convenience, pairs of delta functions can be defined as follows:
µ ¶
x − x0
δδ = |b| [δ (x − x0 + b) + δ (x − x0 − b)] .
b
Fourier analysis of the tri function is expressed in terms of the sinc function:
µ ¶ £ ¡ 0 ¢¤
x − x0 sin π x−x
sinc = ¡ 0b ¢ .
b π x−xb
The forward Fourier transform pairs of the aforementioned special functions are defined
as follows: µ ¶
x − x0 F
δ −→ |b| e−i2π x0 ξ ,
b
µ ¶
x − x0 F
δδ −→ 2be−i2π x0 ξ cos (2π bξ ) ,
b
µ ¶
x − x0 F
comb −→ |b| e−i2π x0 ξ comb (bξ ) ,
b
µ ¶
x − x0 F
tri −→ |b| e−i2π x0 ξ sinc2 (bξ ) .
b
Two-dimensional versions of these special functions as well as their Fourier transforms can
be constructed by multiplying together one-dimensional versions, resulting in the following
(note that the results are separable):
µ ¶ µ ¶ µ ¶
x − x0 y − y0 x − x0 y − y0
δ , =δ δ ,
b d b d
µ ¶ µ ¶ µ ¶
x − x0 y − y0 x − x0 y − y0
δδ , = δδ δδ ,
b d b d
µ ¶ µ ¶ µ ¶
x − x0 y − y0 x − x0 y − y0
comb , = comb comb ,
b d b d
µ ¶ µ ¶ µ ¶
x − x0 y − y0 x − x0 y − y0
tri , = tri tri ,
b d b d
µ ¶ µ ¶ µ ¶
x − x0 y − y0 x − x0 y − y0
sinc , = sinc sinc ,
b d b d
cos (2π bx, 2π dy) = cos (2π bx) cos (2π dy) ,
µ ¶
x − x0 y − y0 F
δ , −→ |bd| e−i2π (x0 ξ +y0 η ) ,
b d
µ ¶
x − x0 y − y0 F
δδ , −→ 4 |bd| e−i2π (x0 ξ +y0 η ) cos (2π bξ , 2π d η ) ,
b d
µ ¶
x − x0 y − y0 F
comb , −→ |bd| e−i2π (x0 ξ +y0 η ) comb (bξ , d η ) ,
b d
µ ¶
x − x0 y − y0 F
tri , −→ |bd| e−i2π (x0 ξ +y0 η ) sinc2 (bξ , d η ) .
b d
18 Computational Photography: Methods and Applications
y
FIGURE 1.9
Panchromatic pixel neighborhood.
Finally, it is necessary to look at functions that have been rotated and skewed and their
corresponding Fourier transforms. The general rule to be used can be written as follows:
µ ¶ µ ¶
x − x0 y − y0 x − x0 y − y0 F |bd| −i2π (x0 ξ +y0 η ) bξ − d η bξ + d η
f − , + −→ e F , ,
b d b d 2 2 2
where
F
f (x, y) −→ F (ξ , η ) .
SP = comb (ξ ) ,
Z ∞ ∞ ∞
FP0 = FP ∗ SP = FP (α ) ∑ δ (ξ − α − n) d α = ∑ FP (ξ − n) . (1.11)
−∞ n=−∞ n=−∞
1 In this analysis the pixels are considered to be point entities modeled by delta functions.
These delta functions
could be convolved with a finite area mask such as rect function of Reference [77] to more accurately simulate
their physical dimensions. However, as doing so would not significantly impact the results of this analysis, this
is omitted for the sake of simplicity. The interested reader is referred to Reference [77] for a more detailed
discussion of this topic.
Single Capture Image Fusion 19
panchromatic panchromatic
CFA im age
d em osaicking im age
d em osaicked
im age
FIGURE 1.10
Demosaicking Algorithm Flowchart.
It can be seen from Equation 1.11 that if the initial panchromatic image is appropriately
bandlimited to be zero beyond |ξ | ≥ 1/2, then the fundamental component (n = 0) is not
aliased by any of the sidebands (n = 6 0). In practice, this bandlimiting is usually imposed
by an optical antialiasing filter [78]. Restricting attention to the portion of the resulting
panchromatic spectrum 0 ≤ ξ < 1/2 and considering this to be the rendered portion of the
reconstructed image, this idealized case can be seen to produce perfect image reconstruc-
tion, that is, FP0 = FP , 0 ≤ ξ < 1/2.
D x0 , y0 +N Dx0 +M , y 0 +N
D x0 , y 0 Dx0 +M , y 0
FIGURE 1.11
Color Difference Interpolation Rectilinear Neighborhood.
terpolation, primitive edge detection computations are performed and the direction of least
edge activity chosen. One set of terminology exists for describing this process. Classifi-
cation is the selection of a preferred interpolation direction through edge detection. In this
context, the edge detectors become classifiers. Prediction is the estimation of the miss-
ing pixel value. The expressions used for computing these missing values are then called
predictors.
The simplest demosaicking algorithms will use nonadaptive methods for both panchro-
matic and color difference interpolation. Since nonadaptive methods are not able to respond
to or take advantage of any feature (edge) information in the image, the algorithmic sim-
plicity comes as the cost of reconstruction image fidelity. Note that this is a liability for
the panchromatic channel, as the color differences are predominantly low spatial frequency
records, similar to the chrominance channels in a YCC color space. Color differences,
being largely devoid of edge information, are well suited to nonadaptive demosaicking
methods. For improved reconstruction image fidelity, adaptive methods can be used for the
demosaicking of the panchromatic channel. This, of course, comes at the price of increas-
ing the interpolation algorithm complexity.
fD0 = ( fD sD ) ∗ b. (1.12)
The sampling function for the color differences in Figure 1.11 is given by
µ ¶
1 x − x0 y − y0
sD = comb , .
MN M N
Standard Fourier analysis produces the spatial frequency response for Equation 1.12, as
follows:
FD0 = (FD ∗ SD ) B.
This translates into the general frequency response for bilinear interpolation on a rectilinear
grid:
∞ ∞ ³ m n´
FD0 = ∑ ∑ Amn (ξ , η ) FD ξ − , η − , (1.13)
m=−∞ n=−∞ M N
where
e−i2π (x0 M +y0 N )
m n
Amn = B (ξ , η ) (1.14)
MN
denotes the transfer function and B is defined as follows (see Appendix for more details):
∞ ∞
B = MN ∑ ∑ sinc2 [M (ξ − p) , N (η − q)] ,
p=−∞ q=−∞
" ¶ µ #" µ ¶ #
M−1 N−1
j k
= 1 + 2 ∑ tri cos (2π jξ ) 1 + 2 ∑ tri cos (2π kη ) . (1.15)
j=1 M k=1 N
Equations of the form of Equation 1.13 occur several times in the subsequent analysis.
These equations can be viewed¡ as consisting
¢ of two components: repeated spectral com-
m
ponents, for instance, FD ξ − M , η − Nn , which describe the aliasing behavior, and the
transfer functions, Amn (ξ , η ), which describe the spectral component fidelity. As a rule
of thumb, the larger the values of M and N, the more likely the CFA is prone to aliasing
artifacts. Similarly, the greater the departure of the transfer functions from a unity response
over all spatial frequencies of interest, the more distorted the demosaicked image appears,
usually as a lack of sharpness or definition.
Dx0 +M , y 0 +N
D x0 , y 0 Dx0 +2M , y0
D x0 +M , y0 -N
FIGURE 1.12
Color Difference Interpolation Diamond Neighborhood.
The interpolation operation is formally the same as in the rectilinear case (Equation 1.12)
with only a change in the sampling function, sD , and the convolution kernel, b. The sam-
pling function for the color differences in Figure 1.12 is given by
µ ¶
1 x − x0 y − y0 x − x0 y − y0
sD = comb − , + .
2MN 2M 2N 2M 2N
Standard Fourier analysis produces the equivalent spatial frequency response for Equa-
tion 1.12 using the new values for sD and b as follows (see Appendix for more details):
B G
y
G R
x
R -2 G -1 R0 G1 R2
x
(a) (b)
FIGURE 1.13
Bayer pattern: (a) CFA, and (b) row neighborhood.
1.0
triple panchr. linear
triple panchr. cubic
0.8 tw o color altering
panchromatic linear
frequency response
0.6
FIGURE 1.14
Fundamental transfer function frequency responses.
For the purposes of analysis, in the Bayer pattern the green channel will be taken to be the
luminance channel and the color differences will be red minus green and blue minus green.3
3 While formally one justifies interpolating color differences by performing the computations in a logarithmic
space [79], for all but the most extreme pixel differences computing color differences in video gamma or even
linear space is usually visually acceptable.
24 Computational Photography: Methods and Applications
(a) (b)
(c) (d)
FIGURE 1.15
Bayer green bilinear interpolation results: (a) original image, (b) bilinear green interpolation, (c) interpolation
error map, and (d) bilinear interpolation full color result.
where
2 + cos (2πξ ) + cos (2πη )
Amn = .
4
The ξ -axis response of Amn is plotted in Figure 1.14 as “Bayer green bilinear.” Another
way to analyze the performance of the Bayer bilinear algorithm is to test the algorithm
on a chirp circle test chart. Figure 1.15a is a chirp circle target in which the spatial fre-
quency of the circles increases linearly from the center out. Figure 1.15b is the equivalent
Single Capture Image Fusion 25
where t is the threshold set to a value of 22 for Figure 1.15c as well as all subsequent in-
terpolation error maps. Note that the original image code value range of Figure 1.15a is 0
to 255. The central circular region in Figure 1.15c represents an area of low interpolation
error whereas the rest of the error map is dominated by aliasing and transfer function dis-
tortions. A qualitative assessment of the resulting aliasing can be made from the full-color
results of the bilinear interpolation in Figure 1.15d. In this figure, the green-magenta alias-
ing patterns in the corners of the image represent the aliasing due to bilinear interpolation
of the green channel.
Once the green channel has been fully populated by the interpolation process, red and
blue color differences, DR = R − G and DB = B − G, can be formed at each red and blue
pixel location and the method of Section 1.3.3.1 can be used with M = 2, N = 2, x0 = 1,
and y0 = 0 for the red channel and x0 = 0 and y0 = 1 for the blue channel. The bilinear
interpolation function
1
b= [2δ (x) + δ δ (x)] [2δ (y) + δ δ (y)]
4
is the same for both color difference channels. The equivalent convolution kernel would be
121
1
b = 2 4 2.
4
121
with the red and blue transfer functions, respectively, defined as follows:
(−1)m
Amn,R = B (ξ , η ) ,
4
(−1)n
Amn,B = B (ξ , η ) ,
4
and the frequency response of the bilinear interpolating function defined as
The aliasing consequences of the final image can be seen in Figure 1.15d with the addition
of blue-orange aliasing patterns in the centers of the image sides.
26 Computational Photography: Methods and Applications
y
R1
G2
R3 G4 R5 G6 R7
x
G8
R9
FIGURE 1.16
Bayer adaptive interpolation neighborhood.
as follows:
u v
fG0 = G05 = V+ U, (1.20)
u+v u+v
where U and V are the horizontal and vertical predictors to be derived below. It can be
seen in Equation 1.20 that the direction of the smaller classifier gives the greater weight
to the corresponding predictor; for example, a smaller value of u will produce a dominant
weighting of U. It can also be seen that the classifiers freely combine color Laplacians and
green gradients. This is an image fusion technique that will be discussed shortly.
In this adaptive algorithm the derivation of a suitable predictor becomes a one-
dimensional interpolation problem. Figure 1.13b shows an example of a five-point hori-
zontal neighborhood. The corresponding predictor is defined as follows (these results will
be stated more broadly in Section 1.3.5.1):
fG0 = fG sG + ( fG ∗ b) sR + ( fR ∗ h) sR ≈ fG sG + ( fG ∗ b) sR + ( fG ∗ h) sR , (1.21)
Single Capture Image Fusion 27
FIGURE 1.17
Bayer green interpolation error maps: (a) bilinear interpolation error map, (b) adaptive linear interpolation with
α = 0, and (c) adaptive linear interpolation with α = 1/2.
where b and h denote, respectively, a low-pass filter and a high-pass filter, defined as:
1
b = δ δ (x) ,
2
· ¸
α 1 ³x´
h= 2δ (x) − δ δ ,
4 2 2
where α is a design parameter. The terms sG and sR are defined as follows:
µ ¶
1 x−1
sG = comb ,
2 2
1 ³x´
sR = comb .
2 2
Image fusion occurs in the substitution of the high-pass image component ( fR ∗ h) sR for
the unavailable high-pass image component ( fG ∗ h) sR . This is justified on the assumption
that G = R + constant over the pixel neighborhood [83]. The corresponding frequency
response is given by
∞ ³ n´
FG0 = ∑ An (ξ ) FG ξ − ,
n=−∞ 2
(−1)n [1 + cos (2πξ )] + α sin2 (2πξ )
An = .
2
The design parameter α can be set to satisfy a number of different constraints. Here, α
will be set to make the slope of the fundamental transfer function as follows:
1 + cos (2πξ ) + α sin2 (2πξ )
A0 = ,
2
¯
dA0 ¯¯ 1
⇒α = ,
d ξ ¯ξ =0 2
with zero at the origin. Therefore, h can be restated with this value of α and the predictors
written in terms of the pixel values in Figure 1.16 as follows:
28 Computational Photography: Methods and Applications
(a) (b)
FIGURE 1.18
Bayer adaptive interpolation results: (a) adaptive interpolation full color result, and (b) bilinear-adaptive inter-
polation difference green channel.
· ¸
1 1 ³x´
h= 2δ (x) − δ δ ,
8 2 2
G4 + G6 −R3 + 2R5 − R7
U= + ,
2 8
G2 + G8 −R1 + 2R5 − R9
V= + .
2 8
The response of A0 is equivalent to one of the four-channel situations analyzed below and
is therefore the same as the plot in Figure 1.14 labeled “alternating panchromatic linear.”
The interpolation error map is shown in Figure 1.17c. If α is set to zero and just the linear
interpolation of green values is used in the adaptive interpolation, the fundamental transfer
function becomes
1 + cos (2πξ )
A0 = .
2
This response of A0 is labeled as “two-point average” in Figure 1.14. The interpolation
error map is shown in Figure 1.17b. Comparing the bilinear interpolation error map shown
in Figure 1.17a with Figures 1.17b and 1.17c reveals that interpolation error is greatest
with bilinear interpolation and least with adaptive interpolation and α = 1/2. The adaptive
interpolation error with α = 0 is clearly between these two extremes.
Bilinear interpolation of color differences is still used for demosaicking the red and blue
channels. The resulting full color image from using adaptive interpolation for green and
bilinear interpolation for red and blue is shown in Figure 1.18a. A difference map of the
green channel between the all-bilinear interpolation case of Figures 1.15d and 1.18a is
shown in Figure 1.18b, indicating that the largest region of improvement realized in the
adaptive interpolation case is in the middle spatial frequency range of the green channel.
Single Capture Image Fusion 29
G P R P G P P P
P P P P G P P P R P P
B P G P P R P P P G P
P P P P P P B P P P B
G P G P P G P R
P R P B G P R P
G P G P P B P G
P R P B B P G P
(d) (e)
FIGURE 1.19
Four-channel CFA patterns.
y y
C P C P C P C P C P C P P C P P C P
x x
(a) (b)
y y
C P P P C P P P C C P D P C P D P C
x x
(c) (d)
FIGURE 1.20
Four-channel row neighborhoods: (a) alternating panchromatic, (b) double panchromatic, (c) triple panchro-
matic, and (d) two-color alternating panchromatic.
30 Computational Photography: Methods and Applications
fP0 = fP sP + ( fP ∗ b + fC ∗ h) sC , (1.22)
where ¡ ¢
1 x
sC = N+1 comb N+1 , sP = comb (x) − sC ,
£ ¡ x ¢¤ (1.23)
b = 21 δ δ (x) , h = 1 1
2 2δ (x) − N+1 δ δ N+1 .
2(N+1)
where ¡ ¢ ¡ ¢
n n
cn + B ξ − N+1 + H ξ − N+1
An = ,
N +1
½ n
N for N+1 ∈ Z,
cn =
−1 otherwise,
B = cos (2πξ ) H = 2
sin2 [(N + 1) πξ ] . (1.25)
(N+1)2
The classifier is given below. The scale factor in front of the gradient term is to balance
the contributions of the gradient and Laplacian terms:
(N + 1)2
u= |δ (x + 1) − δ (x − 1)| + |−δ (x + N + 1) + 2δ (x) − δ (x − N − 1)| .
2
In the case of Figure 1.20d only h in Equation 1.23 (and H in Equation 1.25) and the
classifier need be modified as follows:
· µ ¶¸
1 1 x 2
h= 2
2 δ (x) − δ δ , H= 2
sin2 [(2N + 2) πξ ] ,
2 (2N + 2) 2N + 2 2N + 2 (2N + 2)
(N + 1)2
u= |δ (x + 1) − δ (x − 1)| + |−δ (x + N + 1) + 2δ (x) − δ (x − N − 1)| .
2
These expressions works for N > 1, but the N = 1 case needs a slightly different expression
for b:
· ¸
1 1 ³x´
b= 9δ δ (x) − δ δ , B = (9 cos (2πξ ) − cos (6πξ ))/8 (1.27)
16 3 3
The design parameter α is still zero so Equation 1.26 is still applicable. Note that dAn /d ξ
is still zero at the origin.
FIGURE 1.21
Alternating panchromatic interpolation: (a) linear interpolation error map, (b) cubic interpolation error map,
and (c) fully processed image.
cubic interpolation is used. This is why Figure 1.21a appears to be a blend of Figures 1.17c
(linear in both directions) and 1.21b (cubic in both directions). As a result, the linear inter-
polation method appears to have marginally lower error overall than the cubic interpolation
method, as least along the vertical axis.
Color difference interpolation is done in the standard nonadaptive manner. Again refer-
ring to the CFA pattern of Figure 1.19d, the green color difference interpolation can be cast
as a convolution with the following kernel:
121
1
bG = 2 4 2.
4
121
1
[1 + cos (2πξ )] [1 + cos (2πη )] .
Amn =
4
In the case of the red channel, M = 4, N = 2, x0 = 1, and y0 = 1. The corresponding
convolution kernel and frequency response are as follows:
1234321
1
bRB = 2 4 6 8 6 4 2 ,
8
1234321
∞ ∞ ³ m n´
FD0 =∑ ∑ Amn ( ξ , η ) FD ξ − , η − ,
m=−∞ n=−∞ 4 2
· ¸
e−iπ ( 2 +n)
m
3 1
Amn = 1 + cos (2πξ ) + cos (4πξ ) + cos (6πξ ) [1 + cos (2πη )] .
8 2 2
Single Capture Image Fusion 33
Finally, in the case of the blue channel, M = 4, N = 2 , and x0 = −1, y0 = −1. The
convolution kernel bRB is used for both the red and blue channels, providing
∞ ∞ ³ m n´
FD0 = ∑ ∑ Amn (ξ , η ) FD ξ − , η − ,
m=−∞ n=−∞ 4 2
· ¸
eiπ ( 2 +n)
m
3 1
Amn = 1 + cos (2πξ ) + cos (4πξ ) + cos (6πξ ) [1 + cos (2πη )] .
8 2 2
The aliasing characteristics of Figure 1.19d can be observed in Figure 1.21c. The aliasing
patterns along the edge of the image are different from the Bayer case, and some new faint
bands have appeared along the horizontal axis halfway out from the center.
9
u= |δ (x + 1) − δ (x − 1)| + |−δ (x + 3) + 2δ (x) − δ (x − 3)|
2
⇒ 9 |δ (x + 1) − δ (x − 1)| + 2 |−δ (x + 3) + 2δ (x) − δ (x − 3)| .
The general solution with cubic interpolation and N = 2 has the same functional form as
Equation 1.31 with a different transfer function:
¡ ¢ ¡ ¢
cos 23 nπ [6 + 4 cos (2πξ ) − cos (4πξ )] + sin 32 nπ [4 sin (2πξ ) + sin (4πξ )]
An = .
9
(1.33)
The fundamental components of Equations 1.32 and 1.33 are plotted in Figure 1.14 as
“double panchromatic linear” and “double panchromatic cubic,” respectively. Interpolation
error maps of these algorithms assuming the pattern of Figure 1.19b are shown in Fig-
ure 1.22. Using the aliasing patterns as a visual guide, no more than subtle differences can
be seen between the two error maps. It would appear that both interpolation methods are
comparable.
A benefit of the CFA pattern Figure 1.19b is that all three color difference channels can
be interpolated in the same manner. The corresponding convolution kernel is expressed as
12321
2 4 6 4 2
1
bRGB = 3 6 9 6 3 .
9
2 4 6 4 2
12321
34 Computational Photography: Methods and Applications
FIGURE 1.22
Double Panchromatic Interpolation: (a) Linear Interpolation Error Map, (b) Cubic Interpolation Error Map,
and (c) Fully Processed.
FIGURE 1.23
Triple panchromatic interpolation: (a) linear interpolation error map, (b) cubic interpolation error map, and (c)
fully processed image.
∞ ³ n´
FP0 = ∑ n A ( ξ ) FP ξ − , (1.34)
n=−∞ 4
¡π ¢ ¡π ¢
8 (−1)n + 8cos 2
2 n [2 + cos (2πξ )] + 8 sin 2 n sin (2πξ ) + sin (4πξ )
An = , (1.35)
32
The general solution with cubic interpolation and N = 3 has the same functional form as
Equation 1.34 with a different transfer function:
¡n ¢
8 (−1)n + cos 2 π [16 + 9 cos (2πξ ) − cos (6πξ )]
An =
¡ ¢ 32
sin 2n π [9 sin (2πξ ) + sin (6πξ )]
+ . (1.36)
32
The fundamental components of Equations 1.35 and 1.36 are plotted in Figure 1.14 as
“triple panchromatic linear” and “triple panchromatic cubic,” respectively. Interpolation
error maps of these algorithms assuming the pattern of Figure 1.19c are shown in Fig-
ure 1.23. A crossover has clearly occurred with the cubic interpolation method starting to
clearly produce less error overall than the linear interpolation method.
Color difference interpolation is once again done in the standard nonadaptive manner.
Referring to the CFA pattern of Figure 1.19c, the green color difference interpolation can
36 Computational Photography: Methods and Applications
m−n ½ · ¸¾
eiπ ( 2 ) 3 1 1
Amn = 1 + 2 cos (2πξ ) + cos (4πξ ) + cos (6πξ )
16 4 2 4
½ · ¸¾
3 1 1
× 1 + 2 cos (2πη ) + cos (4πη ) + cos (6πη ) .
4 2 4
Finally, in the case of the blue channel, M = 4, N = 4, x0 = 1, and y0 = −1. The convo-
lution kernel bRB is used for both the red and blue channels. The corresponding frequency
response is as follows:
∞ ∞ ³ m n´
FD0 = ∑ ∑ Amn (ξ , η ) FD ξ − , η − ,
m=−∞ n=−∞ 4 4
Single Capture Image Fusion 37
m−n ½ · ¸¾
e−iπ ( 2 ) 3 1 1
Amn = 1 + 2 cos (2πξ ) + cos (4πξ ) + cos (6πξ )
16 4 2 4
½ · ¸¾
3 1 1
× 1 + 2 cos (2πη ) + cos (4πη ) + cos (6πη ) .
4 2 4
The aliasing characteristics of Figure 1.19c can be observed in Figure 1.23c. Colored
aliasing patterns are evident along the edges of the image half-way of the distance to the
corners from both the horizontal and vertical axes. There are also four strong aliasing
patterns half-way out from the center in both the horizontal and vertical directions. The
two of these patterns along the color pixel diagonal of Figure 1.19c are colored, whereas the
other two patterns are neutral (that is, luminance patterns). There are also strong luminance
aliasing patterns in the corners of the image itself.
(a) (b)
(c) (d)
FIGURE 1.24
Two-color alternating panchromatic interpolation: (a) linear interpolation error map, (b) cubic interpolation
error map, (c) fully processed image using Figure 1.19a, and (d) fully processed image using Figure 1.19e.
· ¸
1 1 9 1 1
Amn = + cos (2πξ ) + cos (4πξ ) + cos (6πξ )
8 4 16 4 16
· ¸
1 9 1 1
+ cos (2πη ) + cos (4πη ) + cos (6πη )
4 16 4 16
· ¸
1 1 3 3
+ cos (2πξ , 2πη ) + cos (2πξ , 4πη ) + cos (4πξ , 2πη ) .
2 2 16 16
Adjustments are made to x0 and y0 in the case of the red and blue channels. For the red
channel the results of Section 1.3.3.1, M = 4, N = 4, x0 = 2, and y0 = 0 are used. The
corresponding convolution kernel bRB is as described in the previous section whereas the
Single Capture Image Fusion 39
∞ ∞ ³ m n´
FD0 = ∑ ∑ mn A ( ξ , η ) FD ξ − , η − ,
m=−∞ n=−∞ 4 4
½ · ¸¾
(−1)m 3 1 1
Amn = 1 + 2 cos (2πξ ) + cos (4πξ ) + cos (6πξ )
16 4 2 4
½ · ¸¾
3 1 1
× 1 + 2 cos (2πη ) + cos (4πη ) + cos (6πη ) .
4 2 4
For the blue channel, M = 4, N = 4, x0 = 0, and y0 = 2. The convolution kernel bRB is used
for both the red and blue channels. The corresponding frequency response is as follows:
∞ ∞ ³ m n´
FD0 = ∑ ∑ mn A ( ξ , η ) FD ξ − , η − ,
m=−∞ n=−∞ 4 4
½ · ¸¾
(−1)n 3 1 1
Amn = 1 + 2 cos (2πξ ) + cos (4πξ ) + cos (6πξ )
16 4 2 4
½ · ¸¾
3 1 1
× 1 + 2 cos (2πη ) + cos (4πη ) + cos (6πη ) .
4 2 4
Color difference interpolation for the CFA pattern of Figure 1.19e provides a minor twist
over the other patterns. This pattern can be viewed as consisting of diagonal pairs of like-
colored pixels. Assuming the salient high spatial frequency information of the image is
contained in the panchromatic channel, the option exists to treat each diagonal pair of color
pixels as a single larger pixel for the purposes of noise cleaning.4 Therefore, a strategy
that largely averages adjacent diagonal pixel pairs is used for color difference interpola-
tion. Beginning with the green channel, it is treated as the sum of two diamond-shaped
neighborhoods:
µ ¶ µ ¶
1 x−y x+y 1 x−y x+y−2
sD = comb , + comb , .
8 4 4 8 4 4
4 In all the other CFA patterns of Figure 1.19, the pixels of a given color are separated by at least one panchro-
matic pixel. Averaging these more widely spaced pixels would introduce greater amounts of color aliasing into
the demosaicked image.
40 Computational Photography: Methods and Applications
· ¸
1 + (−1)n 1 + (−1)n 9 1 1
Amn = + cos (2πξ ) + cos (4πξ ) + cos (6πξ )
16 8 16 4 16
n· ¸
1 + (−1) 9 1 1
+ cos (2πη ) + cos (4πη ) + cos (6πη )
8 16 4 16
n· ¸
1 + (−1) 1 3 3
+ cos (2πξ , 2πη ) + cos (2πξ , 4πη ) + cos (4πξ , 2πη ) .
4 2 16 16
The same approach is used for red and blue color difference interpolation. The sampling
function is the sum of two rectilinear grids and the interpolating function is scaled by one-
half. The red channel is considered first:
µ ¶ µ ¶
1 x y−2 1 x−1 y−3
sD = comb , + comb , .
16 4 4 16 4 4
As a consequence of having two rectilinear neighborhoods, the interpolating function must
be scaled by 1/2, resulting in the following:
1 ³x y´
bRB = tri , comb (x, y) .
2 4 4
The convolution kernel is the same for the red and blue channels and is defined as follows:
12 3 4 3 21
2 4 6 8 6 4 2
3 6 9 12 9 6 3
1
bRB = 4 8 12 16 12 8 4 .
32
3 6 9 12 9 6 3
2 4 6 8 6 4 2
12 3 4 3 21
The resulting frequency response is given by
∞ ∞ ³ m n´
FD0 = ∑ ∑ mn A ( ξ , η ) FD ξ − , η − ,
m=−∞ n=−∞ 4 4
Single Capture Image Fusion 41
·
−i π2 (m+3n)
¸
(−1)n + e 3 1
Amn = 1 + cos (2πξ ) + cos (4πξ ) + cos (6πξ )
32 2 2
· ¸
3 1
× 1 + cos (2πη ) + cos (4πη ) + cos (6πη ) .
2 2
The blue channel frequency response requires only a change to the phase term in the transfer
functions.
π · ¸
(−1)m + e−i 2 (3m+n) 3 1
Amn = 1 + cos (2πξ ) + cos (4πξ ) + cos (6πξ )
32 2 2
· ¸
3 1
× 1 + cos (2πη ) + cos (4πη ) + cos (6πη ) .
2 2
The aliasing characteristics of Figure 1.19a can be observed in Figure 1.24c and the
aliasing patterns for Figure 1.19e are shown in Figure 1.24d. The predominant aliasing
patterns occur half-way out from the center with Figure 1.19a having four such patterns,
whereas Figure 1.19e has only two.
1.3.6 Comments
From the foregoing analysis a number of conclusions can be drawn. Even the simplest
adaptive demosaicking of the luminance (i.e., green or panchromatic) channel produces
greater image reconstruction fidelity than nonadaptive demosaicking, as illustrated in Fig-
ure 1.17. The best forms of adaptive demosaicking are either linear interpolation of lumi-
nance combined with appropriately weighted color Laplacians or cubic interpolation of lu-
minance values alone, for example, Figure 1.21. In a four-channel system, color aliasing in
the demosaicked image is determined by the number and arrangement of color pixels within
the CFA pattern. The fewer the number of color pixels present and the more widely they
are separated, the greater the resulting aliasing. Compare Figure 1.21c, which has a high
number of closely spaced color pixels to Figure 1.24c, which has a low number of widely
spaced color pixels. Of the four-channel CFA patterns discussed (see Figure 1.19), the
pattern of Figure 1.19d demosaicked with a combination of linear and cubic interpolation
strategy produces the highest overall reconstruction fidelity with the least low-frequency
color aliasing. It should be noted, however, that there are other possible considerations
when selecting a CFA pattern, most notably signal-to-noise performance (see Section 1.4).
With the opportunity to average diagonally adjacent color pixels, the CFA pattern in Fig-
ure 1.19e can be a better choice for certain applications, for instance, low light imaging. As
with all such trade-offs, the relative importance of aliasing versus signal-to-noise needs to
be assessed on a case-by-case basis.
conductor in proportion to the amount of incoming photons and those electrons are gathered
within the imaging chip. Image capture is therefore essentially a photon-counting process.
As such, image capture is governed by the Poisson distribution, which is defined with a
photon arrival rate variance equal to the mean photon arrival rate. The arrival rate variance
is a source of image noise because if a uniformly illuminated, uniform color patch is cap-
tured with a perfect optical system and sensor, the resulting image will not be uniform but
rather have a dispersion about a mean value. The dispersion is called image noise because
it reduces the quality of an image when a human is observing it [84].
Image noise can also be structured, as is the case with dead pixels or optical pixel cross-
talk [85]. This book chapter does not discuss structured noise, but rather focuses on the
Poisson-distributed noise (also called shot noise) with the addition of electronic amplifier
read noise, which is modeled with a Gaussian distribution [85]. A pixel value Q may
be modeled as Q = kQ (q + g), where kQ is the amplifier gain, q is a Poisson variable with
mean mq and variance σq2 , and g is a Gaussian variable with mean mg and variance σg2 . Note
that σq2 = mq since q is a Poisson variable, and it is entirely defined by the spectral power
distribution impinging upon the sensor and the channel spectral responsivities. The mean
signal level of pixel Q is derived from the pixel model and is written as mQ = kQ (mq + mg ).
An objective measure of image noise is the signal-to-noise ratio (SNR). To increase the
perceived quality of an image it is desirable to increase the SNR [86]. The SNR is defined
as the signal mean level divided by the signal standard deviation and in this case the SNR
of a pixel is
kQ (mq + mg ) (mq + mg )
SNRQ = h i 1 = ¡ ¢1 . (1.39)
¡ ¢ 2 2 + σ2 2
2 2
kQ σq + σg 2 σ q g
(a) (b)
FIGURE 1.25
Simulated images of two identical sensors that have different filters: (a) green filters only, (b) panchromatic
filters only.
than that of a color channel, the panchromatic channel may still exhibit visible or even ob-
jectionable noise under low exposure conditions. Additionally, the chroma information is
obtained solely from the noisier color channels. It is therefore important to include noise
reduction techniques for both the panchromatic and color channels in the image processing
chain. Image fusion techniques may be included in the color noise reduction to again ex-
ploit the increased SNR of the panchromatic image. Such noise reduction techniques will
be discussed later in this section.
With these suppositions, the noise variance may be propagated through the demosaicking
process that is described in Section 1.3. For simplicity, suppose that the color Laplacian
image fusion component is not used to demosaick the panchromatic channel. Also for
simplicity, only the noise variance of one color channel C is derived here. It is of course
understood that the noise variance for any number of color channels may be propagated the
same way.
44 Computational Photography: Methods and Applications
where Tr (a) is the trace of matrix a and the superscript T denotes a transposed kernel. The
variance of the color difference, σD2 , is obtained by adding the noise contribution from a
color channel C, gained by kC , to the noise contribution from the demosaicked panchro-
matic. Therefore, ¡ ¢
σD2 = kC2 σC2 + σP2 Tr bbT .
Now suppose that a kernel d is used to demosaick the color differences. By defining a
kernel e = −b ∗ d, where ∗ is the convolution operator, the variance of the demosaicked
color differences may be written as follows:
¡ ¢ ¡ ¢
σD0 2 = kC2 σC2 Tr ddT + σP2 Tr eeT .
To finally derive the variance for the demosaicked color, the contribution from the
panchromatic pixel at the same location as the color pixel that is being demosaicked must
be added to the contribution from the demosaicked color difference. This panchromatic
contribution depends on whether that particular panchromatic pixel is an original sample or
the result of demosaicking. Let a kernel f be equal to b (zero-padded to make it the same
size as e) if the color difference demosaicking is centered on a color pixel and equal to the
discrete delta function if the color difference demosaicking is centered on a panchromatic
pixel. The variance of the demosaicked color is therefore written as
¡ ¢ h i
σC0 2 = kC2 σC2 Tr ddT + σP2 Tr (e + f) (e + f)T . (1.42)
In order for the SNR of the demosaicked color pixels to be equal to or greater than the
gained original color pixels, the following constraint is introduced:
The implication upon kC that results from this expression depends on the negligibility of
the read noise. The case where the read noise is negligible is discussed next, followed by
the case where the read noise is considered.
For the case that the read noise is negligible, the gain given by Equation 1.41 is defined
with Poisson variances. Equation 1.41 with Poisson variances and Equation 1.43 together
imply that h i
Tr (e + f) (e + f)T
kC ≥ . (1.44)
1 − Tr (ddT )
Note that this result is independent of signal levels. As a simple example, suppose that
neighbor averaging is used in only the horizontal dimension to demosaick the neighborhood
shown in Figure 1.20a. Then, b = d = [0.5 0.0 0.5] and f = [0 0 1 0 0]. In this case
kC must be greater or equal than 3/4 for the SNR of the demosaicked color pixels to be
equal to or greater than the gained original color pixels. Of course, the higher kC is made,
the better the SNR of the demosaicked color pixels as compared to the SNR of the gained
original color pixels.
Single Capture Image Fusion 45
4.0 4.0
3.0 3.0
kC
kC
2.0 2.0
1.0 1.0
0 0
0 0.5 1.0 1.5 2.0 0 0.5 1.0 1.5 2.0
r r
(a) (b)
4.0 4.0
3.0 3.0
kC
kC
2.0 2.0
1.0 1.0
0 0
0 0.5 1.0 1.5 2.0 0 0.5 1.0 1.5 2.0
r r
(c) (d)
FIGURE 1.26
A graph representation of Equation 1.46 where kC is plotted against ρ for (a) σg = 4, (b) σg = 8, (c) σg = 16,
and (d) σg = 64. Each subfigure depicts the graphs of mC − mg values between 0 and 250 in steps of 10. The
demosaicking kernels are b = d = [0.5 0.0 0.5] and f = [0 0 1 0 0].
When the read noise is significant, the color and panchromatic variances include both
the Poisson noise and the Gaussian noise contributions. Therefore, in this case σC2 = σCP
2
+
2 2 2 2 2 2
σg and σP = σPP + σg , where σCP and σPP are the Poisson variances for the color and
2 = m −m
panchromatic pixels, respectively. By definition of the Poisson distribution, σCP C g
2 = m −m ,
and σPP P g where mC and mP are the mean signal responses of the color and
panchromatic pixels, respectively. The color gain is defined in this case as follows:
mP − mg σ2
kC = = PP
2
. (1.45)
mC − mg σCP
Using Equations 1.45 and 1.43, in order to have a demosaicked color pixel SNR equal to
or greater than the gained original color pixels the following inequality must hold:
v h i
u
u Tr (e + f) (e + f)T
σP t
kC ≥ . (1.46)
σC 1 − Tr (ddT )
46 Computational Photography: Methods and Applications
This result is not independent of signal levels and the right-hand side of the inequality,
which from now on is denoted as ρ , cannot be computed without first defining kC . To
illustrate the behavior of the inequality, the same demosaicking kernels as in the previous
simple example are used along with mC − mg values between 0 and 250 in steps of 10.
Figure 1.26a shows plots of kC against ρ for the case where σg is 4, where the graph for an
mC − mg of 0 is vertical and the graph for an mC − mg of 250 is the one with the least slope
for a given value of ρ . The dashed line indicates the values where kC is equal to ρ , and
therefore any value of kC that intersects a graph above the dashed line will yield the SNR
of the demosaicked color pixels that is equal to or greater than the gained original color
pixel for the mC − mg associated with the intersected graph. Figures 1.26b to 1.26d show
the same plots for the σg values of 8, 16, and 64, respectively. It is evident from the graphs
that at very low mC − mg , the read noise dominates over the shot noise in Equation 1.46 and
in the limit that mC − mg goes to zero, which is the same as σCP and σPP going to zero, ρ
becomes independent of both kC and σg because
q
σP 0 + σg2
lim =q = 1,
σCP →0 σC
0+σ2 g
and therefore v h
u i
u Tr (e + f) (e + f)T
t
lim ρ = . (1.47)
σCP →0 1 − Tr (ddT )
If it is required that the SNR of the demosaicked color pixels be equal to or greater than the
gained original color pixels for any color pixel value, p
then the gain must be chosen to be
this limiting value in Equation 1.47, which is equal to 3/4 for the demosaicking kernels
in the previous example calculation.
technique with panchromatic image fusion is shown to yield a lower NPS everywhere but
below about 0.035 cycles/sample. Any four-channel CFA will yield a similar situation
because it necessarily has the color pixels further away from each other to make room for
the panchromatic pixels. The exact frequency where the four-channel NPS becomes larger
than the Bayer NPS is dependent upon the specific four-channel CFA and the demosaicking
technique. Any noise reduction strategy for a four-channel CFA image must address this
low-frequency noise as well as the high-frequency noise.
Given the demosaicking strategies discussed in Section 1.3, it is clear that to take the most
advantage of the panchromatic channel it is best to apply noise reduction to the panchro-
matic pixels before any demosaicking is done. Any single-channel noise reduction strategy
may be used to process the panchromatic pixels, therefore only color-pixel noise reduc-
tion techniques are discussed in this section. It is assumed for the rest of this section that
the panchromatic pixels are median-filtered and boxcar-filtered. Since the green channel is
used as a substitute for the panchromatic channel in the Bayer CFA comparison, this sec-
tion also assumes that the Bayer green pixels are median filtered and boxcar filtered. The
assumed median filter is an adaptive filter as described in Reference [87] with a window
size that includes 9 of the nearest panchromatic (or green for the Bayer CFA) pixels. The
assumed boxcar filter kernel is adjusted for each type of CFA image such that it includes
15 of the nearest pixels.
0
40
Bayer CFA
-10
20
-20
4-channel CFA
0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
(a) (b)
40 0
Bayer CFA
-10
20
40 20
Bayer CFA
20 10
4-channel CFA
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
(e) (f)
0
15
4-channel CFA
10 -5
5
Bayer CFA
-10
0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
(g) (h)
FIGURE 1.28
(left) Noise power spectra (NPS) of the demosaicked red channel from a Bayer CFA and a four-channel CFA
shown in Figure 1.27 and (right) the associated difference of the four-channel NPS minus Bayer NPS. The
NPS and difference plots are shown for the cases of: (a,b) no noise reduction, (c,d) median filtering only, (e,f)
median and boxcar filtering only, and (g,h) median, boxcar, and low-frequency filtering. The horizontal axis is
in units of cycles per sample.
noise, which has spectral power from low to very high frequencies. Figures 1.27c and 1.27f
show the respective effects of median filtering the colors before demosaicking the Bayer
and the four-channel CFA images with an adaptive filter as described in Reference [87] with
window sizes that include nine like pixels in the filtering. The NPS and NPS-difference
plots in Figures 1.28c and 1.28d show that the overall effect is of slightly lowering the NPS
nearly uniformly across all frequencies. However, the appearance of the images is changed
the most at the highest frequencies. It is also clear that both images could be improved
with more noise reduction. A boxcar filter is a low pass filter that can be effectively used to
Single Capture Image Fusion 49
reduce high frequencies in images. Figures 1.27c and 1.27g show the demosaicked results
of boxcar filtering the Bayer and four-channel CFA images after median filtering. The NPS
plots in Figures 1.28e and 1.28f show that for the image demosaicked from the Bayer CFA
the noise has been reduced for all but the lowest frequencies, but for the image demosaicked
from the four-channel CFA the noise still has some mid-frequency power. Again, one of
the effects of having panchromatic pixels is also shown; the low-frequency portion of the
NPS is much higher than that of the Bayer image because the color pixels are farther away
from each other in the four-channel CFA than in the Bayer CFA. Both images again could
be improved with more noise reduction.
M N
∑ ∑ gC (m, n) [ f κ (x, y)sC (x, y) − fC (x − m, y − n)sC (x − m, y − n)] =
m=−M n=−N
M N £ ¤
∑ ∑ gC (m, n) fP0 (x, y)sC (x, y) − fP0 (x − m, y − n)sC (x − m, y − n) ,
m=−M n=−N
where gC is a normalized low-pass kernel with a low cutoff frequency kernel that reduces
the desired frequencies, fκ is the noise-reduced color channel to be determined, sC is a
color sampling function, and x and y denote the location of the pixel being noise-reduced.
The equality may be rewritten as
M N
fκ (x, y)sC (x, y) − ∑ ∑ gC (m, n) fC (x − m, y − n)sC (x − m, y − n) =
m=−M n=−N
M N
fP0 (x, y)sC (x, y) − ∑ ∑ gC (m, n) fP0 (x − m, y − n)sC (x − m, y − n).
m=−M n=−N
Rearranging terms and switching away from the coordinate notation, as follows:
¡ ¢ £¡ ¢¤
fκ − fP0 sC = gC ∗ fC − fP0 .
The quantities within parentheses are color differences so this last equality may be written
as
f∆ sC = gC ∗ fD sC (1.48)
where fD is the noisy color-difference image, f∆ is a noise-reduced version of fD .
50 Computational Photography: Methods and Applications
f(x,y) g(x,y)
FIGURE 1.29
Linear sharpening for image enhancement.
The heuristic explanation for why this method tends to preserve image content is that
f∆ is a low-frequency signal and the final color image is derived by adding the panchro-
matic channel, which contains the high-frequency content. Figure 1.27h shows the result
of applying Equation 1.48 to the four-channel CFA image before demosaicking and after
median and boxcar filtering. For comparison, Figure 1.27c shows the equivalent noise re-
duction for a Bayer image, where the green channel is substituted for the panchromatic
channel in Equation 1.48. For the red channel of the four-channel example, gC is defined as
a 33 × 33 kernel that when centered on a red pixel it is 1/153 at the red pixel locations and
zero elsewhere. This means that gC averages together 153 red color differences (for this
case, red minus panchromatic). For the red channel of the Bayer example, gC is defined
as a 33 × 17 kernel that when centered on a red pixel it is 1/153 at the red pixel locations
and zero elsewhere. This means that this version of gC also averages together 153 red color
differences (for this case, red minus green). Note from the examples that even though the
same number of color differences are averaged together, because the panchromatic channel
has a higher SNR than the Bayer green channel, the final image simulated with the four-
channel CFA has much lower noise. The NPS plots in Figures 1.28g and 1.28h support the
visual appearance of the images by showing that the NPS of the Bayer image is now larger
than the NPS of the image from the four-channel CFA. To show that this method of noise
cleaning indeed preserves image content, Section 1.5 contains a discussion and results of
the full image processing chain. The noise reduction results obtained in Section 1.5 are
typical of the noise reduction techniques discussed in this section.
(a) (b)
FIGURE 1.30
Four-channel demosaicked images: (a) color image, and (b) panchromatic image.
(a) (b)
FIGURE 1.31
High-frequency information extracted from: (a) the green channel, and (b) the panchromatic image.
of linear sharpening in which the blurred image is also used as the reference image, that is,
f (x, y) = fR (x, y). This method works well in many applications, however, it is extremely
sensitive to noise. This leads to undesirable distortions, especially in flat regions. Typically,
nonlinear coring functions are used to mitigate this type of distortion [93].
An example of linear sharpening is presented next. Figure 1.30 shows a four-channel
demosaicked color and a corresponding panchromatic image. The flat, gray patches shown
in the upper left quadrant of these images were inserted for measuring the noise presented
in these images. It is evident that the color image shown in Figure 1.30a is noisy and soft.
Therefore, a linear sharpening algorithm was applied to enhance its visual appearance.
52 Computational Photography: Methods and Applications
(a) (b)
FIGURE 1.32
Linear sharpening results: (a) sharpened using green channel high-frequency information, and (b) sharpened
using panchromatic high-frequency information.
The high-frequency details extracted from the green channel of the blurred color image
of Figure 1.30a and the corresponding panchromatic image are shown in Figures 1.31a
and 1.31b, respectively. The sharpened image shown in Figure 1.32a was obtained by
adding the green channel high-frequency information to the blurred color image. Sim-
ilarly, the sharpened image of Figure 1.32b was estimated by adding the panchromatic
high-frequency information to the blurred color image. The standard deviations of the gray
patches before and after sharpening are summarized in Table 1.3. Clearly, the panchromatic
image was less noisy than the blurred color image. Furthermore, it turned out to be a better
choice for high-frequency extraction as compared to the green channel for sharpening.
TABLE 1.3
Standard deviations of gray patches before and after sharpening.
FIGURE 1.33
Image processing chain example, Part 1: (a) original CFA image, (b) noise-cleaned image, and (c) demosaicked
image.
The first step is to select the CFA pattern to be used with the sensor. In this case, the
CFA pattern of Figure 1.19e will be used. Using this pattern, a simulation of a capture is
generated and shown in Figure 1.33a. For the purposes of making noise measurements, a
flat, gray patch is placed in the upper left quadrant of the image. The first thing that can
be seen is that the image has enough noise, to the point that the underlying CFA pattern
is somewhat hard to discern. Table 1.4 shows the standard deviations of the gray patch
for each of the four channels present in the image. It can be seen that the panchromatic
channel standard deviation is roughly half of the color channel standard deviations. The
lower noise of the panchromatic channel is a consequence of its broader spectral response
and associated greater light sensitivity.
Once the original CFA image is in hand, it is noise-cleaned (denoised). Using the meth-
ods described in Section 1.4, the CFA image is cleaned to produce the image shown in
Figure 1.33b. The CFA pattern is now evident as a regular pattern, especially in the region
of the life preserver in the lower left-hand corner of the image. The standard deviations
(Table 1.4) have been reduced significantly with the panchromatic channel still having a
lower amount of noise.
TABLE 1.4
Standard deviations of gray patch in image processing chain example.
(a) (b)
FIGURE 1.34
Image processing chain example, Part 2: (a) color and tone corrected image, and (b) sharpened image.
After noise cleaning, the image is demosaicked as discussed in Section 1.3. Figure 1.33c
shows the result; color has been restored, although it is desaturated and flat. As there are
now only three color channels in the image, Table 1.4 no longer has a panchromatic entry.
Of more interest is how the demosaicked color channels now have the same noise standard
deviation as the noise-cleaned panchromatic channel. This is a consequence of the image
fusion techniques used, specifically as accomplished through the use of color differences.
The color along with the tone scale of the image is next corrected using the techniques
described in Section 1.2 with the results shown in Figure 1.34a. Since color correction
and tone scaling are generally signal amplifying steps, the standard deviations of the gray
patch increase. In this particular case, these corrections are relatively mild, so the noise
amplification is correspondingly low.
As the image is still lacking a bit in sharpness, the final step is to sharpen the image as
shown in Figure 1.34b. As indicated in Table 1.4, the standard deviations have been sig-
nificantly amplified, although they are still about half of what the original CFA image had.
A nonadaptive sharpening algorithm was used here. An adaptive algorithm capable of rec-
ognizing flat regions and reducing the sharpening accordingly would reduce the resulting
noise amplification for the gray patch region. Still, the final image is certainly acceptable.
TABLE 1.5
Standard deviations of gray patch in Bayer image pro-
cessing chain example.
FIGURE 1.35
Bayer image processing chain example, Part 1: (a) original CFA Image, (b) noise-cleaned image, and (c)
demosaicked image.
(a) (b)
FIGURE 1.36
Bayer image processing chain example, Part 2: (a) color and tone corrected image, and (b) sharpened image.
As a comparison, the simulation is repeated using the Bayer CFA pattern and processing
as shown in Figure 1.35 and Figure 1.36. The standard deviations of the gray patch are
given in Table 1.5. It can be seen that the original CFA image starts off with the same
amount of noise as in the four-channel case. Noise cleaning produces results comparable
to before. The demosaicking step produces the first notable differences as the color differ-
ences cannot benefit from a significantly less noisy luminance channel. The resulting noise
is amplified by the color and tone scale correction step as before. Finally, the sharpening
operation is performed and the resulting noise level has almost returned to that of the orig-
56 Computational Photography: Methods and Applications
inal CFA image. Comparison with the four-channel system shows a double increase in the
gray patch standard deviations over the four-channel example.
1.6 Conclusion
Image fusion provides a way of creating enhanced and even impossible-to-capture im-
ages through the appropriate combination of image components. These components are
traditionally full-image captures acquired from either a system consisting of several spe-
cialized sensors (e.g., each with different spectral characteristics) or as part of a multicap-
ture sequence (e.g., burst or video). This chapter describes a new approach that uses a
single capture from a single sensor to produce the necessary image components for subse-
quent image fusion operations. This capability is achieved by the inclusion of panchromatic
pixels in the color filter array pattern. Inherently, panchromatic pixels will be more light
sensitive, which results in improved signal-to-noise characteristics. Additionally, being
spectrally nonselective, edge and texture detail extracted from the panchromatic channel
will be more complete and robust across the visible spectrum. Image fusion techniques can
then be used to impart these benefits onto the color channels while still preserving color
fidelity. These image fusion techniques are generally implemented as parts of the noise
cleaning, demosaicking, and sharpening operations in the image processing chain. In addi-
tion to the benefits afforded requiring only one capture for enabling image fusion, the noise
cleaning and demosaicking operations described in this chapter work on sparsely sampled
CFA data. This reduction in the amount of data to be processed provides additional effi-
ciency in the application of image fusion techniques.
Acknowledgment
The authors dedicate this chapter to the memories of our dear colleague Michele O’Brien
and Stacy L. Moor, Efraı́n’s well-beloved wife.
Appendix
This appendix provides a derivation of the relationship first appearing in Equation 1.15
and restated below:
∞ ∞
MN ∑ ∑ sinc2 [M (ξ − p) , N (η − q)]
p=−∞ q=−∞
" ¶ µ #" µ ¶ #
M−1 N−1
j k
= 1 + 2 ∑ tri cos (2π jξ ) 1 + 2 ∑ tri cos (2π kη ) .
j=1 M k=1 N
Single Capture Image Fusion 57
... ...
x
-M M -M M x
(a) (b)
FIGURE 1.37
Discrete tri function: (a) tri and comb functions, and (b) delta functions.
Since this relationship is separable, only one dimension needs to be derived, as follows:
∞ ¶ M−1 µ
2 j
M ∑ sinc [M (ξ − p)] = 1 + 2 ∑ tri cos (2π jξ ) .
p=−∞ j=1 M
Figure 1.37a shows the functions tri (x/M) and comb (x) superimposed on coordinate
axes. The result of multiplying these two functions together is shown in Figure 1.37b.
Only a finite number of delta functions remain and these are scaled by the tri function.
Therefore, this discrete form of the tri function can be written as follows:
³x´ M−1 µ ¶ µ ¶
j 1 x
tri comb (x) = δ (x) + ∑ tri δδ . (1.50)
M j=1 M j j
Taking the Fourier transform of each side produces the required relationship.
¶M−1 µ
2 j
Msinc (M ξ ) ∗ comb (ξ ) = 1 + ∑ tri 2 cos (2π jξ ) ,
j=1 M
∞ ¶ M−1 µ
2 j
M ∑ sinc [M (ξ − p)] = 1 + 2 ∑ tri cos (2π jξ ) .
p=−∞ j=1 M
The case of Equation 1.16 is handled in a similar manner, as follows:
³ x y x y ´
tri − , + comb (x)
2M 2N 2M 2N
µ ¶ µ ¶ 2N−1 µ ¶
2M−1
2 j 1 x 2 k 1 ³y´
= δ (x) + ∑ tri δδ + ∑ tri δδ
j=1 2M j j k=1 2N k k
2M−1 2N−1 µ ¶ ³ x y ´
j k j k 1
+ ∑ ∑ tri − , + δδ , .
j=1 k=1 2M 2N 2M 2N jk 2M 2N
References
[1] J.T. Compton and J.F. Hamilton Jr., “Image sensor with improved light sensitivity,” U.S. Patent
Application 11/191 729, February 2007.
[2] M. Kokar and K. Kim, “Review of multisensor data fusion architectures,” in Proceedings of
the IEEE International Symposium on Intelligent Control, Chicago, IL, USA, August 1993,
pp. 261–266.
[3] D.L. Hall and J. Llinas, “An introduction to multisensor data fusion,” Proceedings of the IEEE,
vol. 85, no. 1, pp. 6–23, January 1997.
[4] R. Mahler, “A unified foundation for data fusion,” in Proceedings of the Data Fusion System
Conference, Laurel, MD, USA, June 1987.
[5] M.A. Abidi and R.C. Gonzalez, Data Fusion in Robotics and Machine Intelligence, San
Diego, CA: Academic Press, October 1992.
[6] L. Waltz, J. Llinas, and E. Waltz, Multisensor Data Fusion, Boston, MA, USA: Artech House
Publishers, August 1990.
[7] M. Kumar, “Optimal image fusion using the rayleigh quotient,” in Proceedings of the IEEE
Sensors Applications Symposium, New Orleans, LA, USA, February 2009, pp. 269–274.
[8] E. Waltz, “The principles and practice of image and spatial data fusion,” in Proceedings of the
Eighth National Data Fusion Conference, Dallas, TX, USA, March 1995, pp. 257–278.
[9] R.S. Blum, Z. Xue, and Z. Zhang, Multi-Sensor Image Fusion and Its Applications, ch. An
overview of image fusion, R.S. Blum and Z. Liu (eds.), Boca Raton, FL: CRC Press, July
2005, pp. 1–36.
[10] Z. Zhang and R.S. Blum, “A categorization and study of multiscale-decomposition-based im-
age fusion schemes,” Proceedings of the IEEE, vol. 87, no. 8, pp. 1315–1326, August 1999.
[11] R.S. Blum, “Robust image fusion using a statistical signal processing approach,” Information
Fusion, vol. 6, no. 2, pp. 119–128, June 2005.
[12] R. Raskar, A. Agrawal, and J. Tumblin, “Coded exposure photography: Motion deblurring
using fluttered shutter,” ACM Transactions on Graphics, vol. 25, no. 3, pp. 795–804, July
2006.
[13] Y. Chibani, “Multisource image fusion by using the redundant wavelet decomposition,” in
Proceedings of the IEEE Geoscience and Remote Sensing Symposium, Toulouse, France, July
2003, pp. 1383–1385.
[14] M. Hurn, K. Mardia, T. Hainsworth, J. Kirkbride, and E. Berry, “Bayesian fused classification
of medical images,” IEEE Transactions on Medical Imaging, vol. 15, no. 6, pp. 850–858,
December 1996.
[15] D.L Hall and S.A.H. McMullen, Mathematical Techniques in Multisensor Data Fusion.
Boston, MA, USA: Artech House, 2nd edition, February 2004.
[16] J. Llinas and D. Hall, “A challenge for the data fusion community I: Research imperatives for
improved processing,” in Proceedings of the Seventh National Symposium On Sensor Fusion,
Albuquerque, New Mexico, USA, March 1994.
[17] C. Pohl and J.L. van Genderen, “Multisensor image fusion in remote sensing: concepts, meth-
ods and applications,” International Journal of Remote Sensing, vol. 19, no. 5, pp. 823–854,
March 1998.
[18] M. Daniel and A. Willsky, “A multiresolution methodology for signal-level fusion and data
assimilation with applications to remote sensing,” Proceedings of the IEEE, vol. 85, no. 1,
pp. 164–180, January 1997.
Single Capture Image Fusion 59
[19] E. Lallier and M. Farooq, “A real time pixel-level based image fusion via adaptive weight av-
eraging,” in Proceedings of the Third International Conference on Information Fusion, Paris,
France, July 2000, pp. WEC3/3–WEC3/13.
[20] G. Pajares and J.M. de la Cruz, “A wavelet-based image fusion tutorial,” Pattern Recognition,
vol. 37, no. 9, pp. 1855–1872, September 2004.
[21] J. Richards, “Thematic mapping from multitemporal image data using the principal compo-
nents transformation,” Remote Sensing of Environment, vol. 16, pp. 36–46, August 1984.
[22] I. Bloch, “Information combination operators for data fusion: A comparative review with clas-
sification,” IEEE Transactions on Systems, Man, and Cybernetics. C, vol. 26, no. 1, pp. 52–67,
January 1996.
[23] Y. Gao and M. Maggs, “Feature-level fusion in personal identification,” in Proceedings of the
IEEE International Conference on Computer Vision and Pattern Recognition, San Diego, CA,
USA, June 2005, pp. 468–473.
[24] V. Sharma and J.W. Davis, “Feature-level fusion for object segmentation using mutual infor-
mation,” in Proceedings of Computer Vision and Pattern Recognition Workshop, New York,
USA, June 2006, pp. 139–146.
[25] A. Kumar and D. Zhang, “Personal recognition using hand shape and texture,” IEEE Transac-
tions on Image Processing, vol. 15, no. 8, pp. 2454–2461, August 2006.
[26] K. Fukanaga, Introduction to Statistical Pattern Recognition. New York, USA: Academic
Press, October 1990.
[27] Y. Liao, L.W. Nolte, and L.M. Collins, “Decision fusion of ground-penetrating radar and metal
detector algorithms – A robust approach,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 45, no. 2, pp. 398–409, February 2007.
[28] L.O. Jimenez, A. Morales-Morell, and A. Creus, “Classification of hyperdimensional data
based on feature and decision fusion approaches using projection pursuit, majority voting,
and neural networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 37, no. 3,
pp. 1360–1366, May 1999.
[29] M. Fauvel, J. Chanussot, and J.A. Benediktsson, “Decision fusion for the classification of ur-
ban remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 44,
no. 10, pp. 2828–2838, October 2006.
[30] S. Foucher, M. Germain, J.M. Boucher, and G.B. Benie, “Multisource classification using
ICM and Dempster-Shafer theory,” IEEE Transactions on Instrumentation and Measurement,
vol. 51, no. 2, pp. 277–281, April 2002.
[31] F. Nebeker, “Golden accomplishments in biomedical engineering,” IEEE Engineering in
Medicine and Biology Magazine, vol. 21, no. 3, pp. 17–47, May/June 2002.
[32] D. Barnes, G. Egan, G. OḰeefe, and D. Abbott, “Characterization of dynamic 3-D pet imaging
for functional brain mapping,” IEEE Transactions on Medical Imaging, vol. 16, no. 3, pp. 261–
269, June 1997.
[33] S. Wong, R. Knowlton, R. Hawkins, and K. Laxer, “Multimodal image fusion for noninvasive
epilepsy surgery planning,” International Journal of Remote Sensing, vol. 16, no. 1, pp. 30–38,
January 1996.
[34] R. Raskar, A. Agrawal, and J. Tumblin, “Coded exposure photography: Motion deblurring
using fluttered shutter,” ACM Transactions on Graphics, vol. 25, no. 3, pp. 795–804, July
2006.
[35] A. Agrawal and R. Raskar, “Resolving objects at higher resolution from a single motion-
blurred image,” in Proceedings of the IEEE International Conference on Computer Vision and
Pattern Recognition, Minneapolis, MN, USA, June 2007, pp. 1–8.
60 Computational Photography: Methods and Applications
[36] M. Kumar and P. Ramuhalli, “Dynamic programming based multichannel image restoration,”
in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Pro-
cessing, Philadelphia, PA, USA, March 2005, pp. 609–612.
[37] L. Yuan, J. Sun, L. Quan, and H.Y. Shum, “Image deblurring with blurred/noisy image pairs,”
ACM Transactions on Graphics, vol. 26, no. 3, July 2007.
[38] G. Petschnigg, R. Szeliski, M. Agrawala, M. Cohen, H. Hoppe, and K. Toyama, “Digital
photography with flash and no-flash image pairs,” ACM Transactions on Graphics, vol. 23,
no. 3, pp. 664–672, August 2004.
[39] P.J. Burt and R.J. Kolczynski, “Enhanced image capture through fusion,” in Proceedings of the
Fourth International Conference on Computer Vision, Berlin, Germany, May 1993, pp. 173–
182.
[40] H. Li, B.S. Manjunath, and S.K. Mitra, “Multisensor image fusion using the wavelet trans-
form,” Graphical Models and Image Processing, vol. 57, no. 3, pp. 235–245, May 1995.
[41] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976.
[42] Eastman Kodak Company, Kodak KAI-11002 Image Sensor Device Performance Specifica-
tion, 2006.
[43] J.E. Adams Jr. and J.F. Hamilton Jr., “Adaptive color plan interpolation in single color elec-
tronic camera,” U.S. Patent 5 506 619, April 1996.
[44] C.W. Kim and M.G. Kan, “Noise insensitive high resolution demosaicing algorithm consid-
ering cross-channel correlation,” in Proceedings of the International Conference on Image
Processing, Genoa, Italy, September 2005, pp. 1100–1103.
[45] R. Kimmel, “Demosaicing: Image reconstruction from color CCD samples,” IEEE Transac-
tions on Image Processing, vol. 8, no. 9, pp. 1221–1228, September 1999.
[46] R.W.G. Hunt, The Reproduction of Colour, 5th Edition, Kingston-upon-Thames, UK: Foun-
tain Press, 1995.
[47] E.J. Giorgianni and T.E. Madden, Digital Color Management Encoding Solutions, 2nd Edi-
tion, Chichester, UK: John Wiley and Sons, Ltd., 2008.
[48] J.C. Maxwell, On the Theory of Three Primary Colours, Cambridge, England: Cambridge
University Press, 1890.
[49] R.M. Evans, “Some notes on Maxwell’s colour photograph,” Journal of Photographic Science,
vol. 9, no. 4, p. 243, July-August 1961.
[50] J.S. Friedman, History of Color Photography, Boston, Massachusets: The American Photo-
graphic Publishing Company, 1944.
[51] M.L. Collette, “Digital image recording device,” U.S. Patent 5 570 146, May 1994.
[52] T.E. Lynch and F. Huettig, “High resolution RGB color line scan camera,” in Proceedings of
SPIE, Digital Solid State Cameras: Designs and Applications, San Jose, CA, USA, January
1998, pp. 21–28.
[53] G. Sharma and H.J. Trussell, “Digital color imaging,” IEEE Transactions on Image Process-
ing, vol. 6, no. 7, pp. 901–932, July 1997.
[54] R.F. Lyon, “Prism-based color separation for professional digital photography,” in Proceed-
ings of the Image Processing, Image Quality, Image Capture, Systems Conference, Portland,
OR, USA, March 2000, pp. 50–54.
[55] R.F. Lyon and P.M. Hubel, “Eying the camera: Into the next century,” in Proceedings of the
IS&T SID Tenth Color Imaging Conference, Scottsdale, AZ, USA, November 2002, pp. 349–
355.
Single Capture Image Fusion 61
[56] R.M. Turner and R.J. Guttosch, “Development challenges of a new image capture technology:
Foveon X3 image sensors,” in Proceedings of the International Congress of Imaging Science,
Rochester, NY, USA, May 2006, pp. 175–181.
[57] K.E. Spaulding, E.J. Giorgianni, and G. Woolfe, “Optimized extended gamut color encodings
for scene-referred and output-referred image states,” Journal of Imaging Science and Technol-
ogy, vol. 45, no. 5, September/October 2001, pp. 418–426.
[58] M. Anderson, R. Motta, S. Chandrasekar, and M. Stokes, “Proposal for a standard default
color space for the internet: sRGB,” in Proceedinga of the Fourth IS&T/SID Color Imaging
Conference, Scottsdale, AZ, USA, November 1995, pp. 238–245.
[59] K.E. Spaulding and J. Holm, “Color encodings: sRGB and beyond,” in Proceedings of the
IS&T’s Conference on Digital Image Capture and Associated System, Reproduction and Image
Quality Technologies, Portland, OR, USA, April 2002, pp. 167–171.
[60] K.E. Spaulding, E.J. Giorgianni, and G. Woolfe, “Reference input/output medium metric RGB
color encoding (RIMM/ROMM RGB),” in Proceedings of the Image Processing, Image Qual-
ity, Image Capture, Systems Conference, Portland, OR, USA, March 2000, pp. 155–163.
[61] “Adobe rgb (1998) color image encoding,” Tech. Rep. http://www.adobe.com/adobergb,
Adobe Systems, Inc., 1998.
[62] S. Süsstrunk, “Standard RGB color spaces,” in Proceedings of the Seventh IS&T/SID Color
Imaging Conference, Scottsdale, AZ, USA, November 1999, pp. 127–134.
[63] H.E.J. Neugebauer, “Quality factor for filters whose spectral transmittances are different from
color mixture curves, and its application to color photography,” Journal of the Optical Society
of America, vol. 46, no. 10, pp. 821–824, October 1956.
[64] P. Vora and H.J. Trussel, “Measure of goodness of a set of color-scanning filters,” Journal of
the Optical Society of America, vol. 10, no. 7, pp. 1499–1508, July 1993.
[65] R.L. Baer, W.D. Holland, J. Holm, and P. Vora, “A comparison of primary and complementary
color filters for ccd-based digital photography,” Proceedings of the SPIE, pp. 16–25, January
1999.
[66] H. Kuniba and R.S. Berns, “Spectral sensitivity optimization of color image sensors consider-
ing photon shot noise,” Journal of Electronic Imaging, vol. 18, no. 2, pp. 023002/1–14, April
2009.
[67] “Photography - electronic still picture imaging - noise measurements,” Tech. Rep. ISO
15739:2003, ISO TC42/WG 18, 2003.
[68] R. Vogel, “Digital imaging device optimized for color performance,” U.S. Patent 5 668 596,
September 1997.
[69] M. Parmar and S.J. Reeves, “Optimization of color filter sensitivity functions for color filter
array based image acquisition,” in Proceedings of the Fourteenth Color Imaging Conference,
Scottsdale, AZ, USA, November 2006, pp. 96–101.
[70] G.J.C. van der Horst, C.M.M. de Weert, and M.A. Bouman, “Transfer of spatial chromaticity-
contrast at threshold in the human eye,” Journal of the Optical Society of America, vol. 57,
no. 10, pp. 1260–1266, October 1967.
[71] K.T. Mullen, “The contrast sensitivity of human colour vision to red-green and blue-yellow
chromatic gratings,” Journal of Physiology, vol. 359, pp. 381–400, February 1985.
[72] J.E. Adams Jr., M. Kumar, B.H. Pillman, and J.A. Hamilton, “Four-channel color filter array
pattern,” U.S. Patent Application 12/472 563, May 2009.
[73] J.E. Adams Jr., M. Kumar, B.H. Pillman, and J.A. Hamilton, “Four-channel color filter array
interpolation,” U.S. Patent Application 12/473 305, May 2009.
62 Computational Photography: Methods and Applications
[74] J.E. Adams Jr., M. Kumar, B.H. Pillman, and J.A. Hamilton, “Color filter array pattern having
four channels,” U.S. Patent Application 12/478 810, June 2009.
[75] J.E. Adams Jr., M. Kumar, B.H. Pillman, and J.A. Hamilton, “Interpolation for four-channel
color filter array,” U.S. Patent Application 12/480 820, June 2009.
[76] J.D. Gaskill, Linear Systems, Fourier Transforms, and Optics, New York: John Wiley & Sons,
1978.
[77] J.D. Gaskill, Linear Systems, Fourier Transforms, and Optics, ch. Characteristics and Appli-
cations of Linear Filters, New York: John Wiley & Sons, 1978, pp. 279–281.
[78] R. Palum, Single-Sensor Imaging: Methods and Applications for Digital Cameras, ch. Optical
antialiasing filters, R. Lukac (ed.), Boca Raton, FL: CRC Press / Taylor & Francis, September
2008, pp. 105–135.
[79] D. Cok, “Signal processing method and apparatus for producing interpolated chrominance
values in a sampled color image signal,” U.S. Patent 4 642 678, February 1987.
[80] J.F. Hamilton Jr. and J.E. Adams Jr., “Adaptive color plan interpolation in single color elec-
tronic camera,” U.S. Patent 5 629 734, May 1997.
[81] P.S. Tsai, T. Acharya, and A. Ray, “Adaptive fuzzy color interpolation,” Journal of Electronic
Imaging, vol. 11, no. 3, pp. 293–305, July 2002.
[82] K. Hirakawa and T. Parks, “Adaptive homogeneity-directed demosaicking algorithm,” in Pro-
ceedings of the International Conference on Image Processing, Barcelona, Spain, September
2003, pp. 669–672.
[83] J.E. Adams Jr., “Design of practical color filter array interpolation algorithms for digital cam-
eras,” in Proceedings of the SPIE Conference on Real-Time Imaging, San Jose, CA, USA,
February 1997, pp. 117–125.
[84] B.W. Keelan, Handbook of Image Quality: Characterization and Prediction. New York, NY,
USA: Marcel Dekker, March 2002.
[85] G.C. Holst and T.S. Lomheim, CMOS/CCD sensors and camera systems. Bellingham, WA:
The International Society for Optical Engineering, October 2007.
[86] H.R.S.Z. Wang, A.C. Bovik and E.P. Simoncelli, “Image quality assessment: From error visi-
bility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–
612, April 2004.
[87] J.E. Adams Jr., J.F. Hamilton Jr., and E.B. Gindele, “Noise-reducing a color filter array image,”
U.S. Patent Application 10/869 678, December 2005.
[88] E.O. Morales and J.F. Hamilton Jr., “Noise reduced color image using panchromatic image,”
U.S. Patent Application 11/752 484, November 2007.
[89] A. Polesel, G. Ramponi, and V.J. Mathews, “Image enhancement via adaptive unsharp mask-
ing,” IEEE Transactions on Image Processing, vol. 9, no. 3, pp. 505–510, March 2000.
[90] F.P. de Vries, “Automatic, adaptive, brightness independent contrast enhancement,” Signal
Processing, vol. 21, no. 2, pp. 169–182, October 1990.
[91] G. Ramponi, N. Strobel, S.K. Mitra, and T. Yu, “Nonlinear unsharp masking methods for
image contrast enhancement,” Journal of Electronic Imaging, vol. 5, no. 3, pp. 353–366, July
1996.
[92] G. Ramponi, “A cubic unsharp masking technique for contrast enhancement,” Signal Process-
ing, vol. 67, no. 2, pp. 211–222, June 1998.
[93] J.E. Adams Jr. and J.F. Hamilton Jr., Single-Sensor Imaging: Methods and Applications for
Digital Cameras, ch. Digital camera image processing chain design, R. Lukac (ed.), Boca
Raton, FL: CRC Press / Taylor & Francis, September 2008, pp. 67–103.
2
Single Capture Image Fusion with Motion
Consideration
James E. Adams, Jr., Aaron Deever, John F. Hamilton, Jr., Mrityunjay Kumar,
Russell Palum, and Bruce H. Pillman
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2 Single-Capture Motion Estimation and Compensation . . . . . . . . . . . . . . . . . . . . . . . . 65
2.2.1 Motion Estimation Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.2.2 Motion Estimation for the Proposed Capture System . . . . . . . . . . . . . . . . . . 68
2.2.3 Motion Compensation for the Proposed Captured System . . . . . . . . . . . . . 71
2.3 Four-Channel Single Capture Motion Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.3.1 Fusion for Motion Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4 Example Single-Sensor Image Fusion Capture System . . . . . . . . . . . . . . . . . . . . . . . 76
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Appendix: Estimation of Panchromatic Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.1 Introduction
In Chapter 1, an image capture system was introduced that uses a four-channel color filter
array to obtain images with high color fidelity and improved signal-to-noise performance
relative to traditional three-channel systems. A panchromatic (spectrally nonselective)
channel was added to the digital camera sensor to decouple sensing luminance (spatial)
information from chrominance (color) information. In this chapter, that basic foundation
is enhanced to provide a capture system that can additionally address the issue of motion
occurring during a capture to produce images with reduced motion blur.
Motion blur is a common problem in digital imaging that occurs when there is rela-
tive motion between the camera and the scene being captured. The degree of motion blur
present in an image is a function of both the characteristics of the motion as well as the
integration time of the sensor. Motion blur may be caused by camera motion or it may
be caused by object motion within the scene. It is particularly problematic in low-light
imaging, which typically requires long integration times to acquire images with acceptable
signal-to-noise levels. Motion blur is also often a problem for captures taken with signifi-
cant optical magnification. Not only does the magnification amplify the motion that occurs,
it also decreases the amount of light reaching the sensor, causing a need for longer integra-
tion times. A familiar trade-off often exists in these situations. The integration time can
63
64 Computational Photography: Methods and Applications
be kept short to avoid motion blur, but at a cost of poor signal-to-noise performance. Con-
versely, the integration time can be lengthened to allow sufficient light to reach the sensor,
but at a cost of increased motion blur in the image. Due to this trade-off between motion
blur and noise, images are typically captured with sufficiently long exposure time to ensure
satisfactory signal-to-noise levels, and signal processing techniques are used to reduce the
motion blur [1], [2], [3], [4]. Reducing motion blur, especially motion blur corresponding
to objects moving within the scene, is a challenging task and often requires multiple cap-
tures of the same scene [5], [6], [7], [8]. These approaches are computationally complex
and memory-intensive. In contrast, the proposed four-channel image sensor architecture al-
lows the design of computationally efficient image fusion algorithms for motion deblurring
of a color image from a single capture.
By varying the length of time that different image sensor pixels integrate light, it is possi-
ble to capture a low-light image with reduced motion blur while still achieving acceptable
signal-to-noise performance. In particular, the panchromatic channel of the image sen-
sor is integrated for a shorter period of time than the color channels of the image sensor.
This approach can be motivated from the perspective of spectral sensitivity. Panchromatic
pixels are more sensitive to light than color (red, green and blue) pixels, and panchro-
matic pixel integration time can be kept shorter to reduce motion blur while still acquiring
enough photons for acceptable signal-to-noise performance. This approach can also be
motivated from a human visual system perspective. The human visual system has greater
spatial sensitivity to high-frequency luminance information than high-frequency chromi-
nance information [9]. It is thus desirable to keep the panchromatic pixel integration time
as short as possible to retain high-frequency luminance information in the form of sharp
edges and textures. High frequencies are less important in the chrominance channels, thus
greater motion blur can be tolerated in these channels in exchange for longer integration
and improved signal-to-noise performance. The sharp panchromatic channel information is
combined with the low-noise chrominance channel information to produce an output image
with reduced motion blur compared to a standard capture in which all image sensor pixels
have equal integration.
An image captured with varying integration times for panchromatic and color channels
has properties that require novel image processing. Motion occurring during capture can
cause edges in the scene to appear out of alignment between the panchromatic and color
channels. Motion estimation and compensation steps can be used to align the panchromatic
and color data. After alignment, the data may still exhibit uneven motion blur between the
panchromatic and color channels. A subsequent image fusion step can account for this
while combining the data to produce an output image.
This chapter looks at the issues involved with using a four-channel image sensor and
allowing different integration times for the panchromatic and color pixels to produce an
image with reduced motion blur. Section 2.2 provides a brief introduction to the topics of
motion estimation and compensation and focuses on how these techniques can be applied to
align the panchromatic and color data in the proposed capture system. Section 2.3 discusses
the use of image fusion techniques to combine the complementary information provided
by the shorter integration panchromatic channel and longer integration color channels to
produce an output image with reduced motion blur. Section 2.4 presents an example single-
sensor, single-capture, image-fusion capture system. The processing path contains the steps
Single Capture Image Fusion with Motion Consideration 65
of motion estimation and compensation, and deblurring through image fusion, as well as
techniques introduced in Chapter 1. Finally, conclusions are offered in Section 2.5.
where (x, y) is the pixel location, d(x, y) is the motion at (x, y), and b1 , b2 , ..., b6 are the six
parameters that define the affine model. This model can be used to represent translation,
rotation, dilation, shear, and stretching. Global motion typically occurs as a result of camera
unsteadiness during an exposure or between successive frames of an image sequence.
A second category of motion is local or object motion. Local motion occurs when an
object within a scene moves relative to the camera. Accurate estimation of object motion
requires the ability to segment an image into multiple regions or objects with different
motion characteristics. Often a compromise is made sacrificing segmentation accuracy for
speed by dividing an image into a regular array of rectangular tiles and computing a local
motion value for each tile, as discussed below. Additional discussion on the topic of motion
can be found in References [10] and [11].
Motion plays an important role in the proposed capture system, in which panchromatic
pixels have a shorter integration than color pixels. Given this difference in integration, the
color and panchromatic data may initially be misaligned. Should this be a source of ar-
tifacts such as halos beyond the apparent boundary of a moving object, it is desirable to
66 Computational Photography: Methods and Applications
(a) (b)
FIGURE 2.1
Captured image with short panchromatic integration and long color integration processed (a) without alignment
and (b) with alignment.
align the panchromatic and color data, particularly at edge locations where misalignment
is most visible. Figure 2.1 shows an example capture in which the color pixels were inte-
grated for three times as long as the panchromatic pixels, and there was movement of the
subject’s head throughout the capture. The image on the left shows the result of subsequent
processing without alignment, while the image on the right shows the result of equivalent
processing occurring after an alignment step. The misalignment of color and panchromatic
data is visible along the edge of the subject’s face. This halo artifact is reduced by a pre-
processing motion compensation step to align the panchromatic and color data. Note that
even after alignment, the color data and panchromatic data may retain differing degrees of
motion blur as a result of their differing integration times. The process of fusing the data
into a single output image while accounting for this varying motion blur is discussed later
in this chapter.
At one extreme, a single motion model is applied to the entire image. In this case, global
motion is being modeled. Global motion estimation describes the motion of all image
points with just a few parameters and is computationally efficient, but it fails to capture
multiple, local motions in a scene [12], [13]. At the other extreme, a separate motion model
(for instance, translational) is applied to every individual pixel. This generates a dense
motion representation having at least two parameters (in the case of a translational model)
for every pixel. Dense motion representations have the potential to accurately represent
local motion in a scene but are computationally complex [14], [15], [16].
In between the extremes lie block-based motion models, which typically partition an
image into uniform, non-overlapping rectangular blocks and estimate motion for each in-
dividual block [17]. Block-based motion models have moderate ability to represent local
motion, constrained by the block boundaries imparted by the partition. Block-based trans-
lational motion models have been used extensively in digital video compression standards,
for example, MPEG-1 and MPEG-2 [18], [19].
A second element of a motion estimation algorithm is the criteria that are used to deter-
mine the quality of a given motion estimate. One common strategy is to evaluate a motion
vector based on a prediction error between the reference pixel(s) and the corresponding
pixel(s) that are mapped to by the motion estimate. The prediction error can be written as:
e(x, y) = α (Ik (x, y) − Iˆk (x, y)), (2.2)
where I is the reference image, (x, y) is the pixel being predicted, Iˆ is the prediction, k
represents the kth image, and α is a function that assesses a penalty for non-matching data.
The prediction Iˆk incorporates the motion vector as Iˆk (x, y) = Ik−1 ((x, y) − d(x, y)). In this
case, the previous image, Ik−1 , is used to form the prediction, and d(x, y) is the motion
vector for the given pixel (x, y).
A quadratic penalty function α (e) = e2 is commonly used. One of the drawbacks of the
quadratic penalty function is that individual outliers with large errors can significantly affect
the overall error for a block of pixels. Alternatively, an absolute value penalty function
α (e) = |e| can be used. The absolute value penalty function is more robust to outliers and
has the further advantage that it can be computed without multiplications. Another robust
criterion commonly used is a cross-correlation function
C(d) = ∑ (Ik (x, y)Iˆk (x, y)). (2.3)
(x,y)
In this case, the function is maximized rather than minimized to determine the optimal
motion vector.
The error function given by Equation 2.2 can be used to compare pixel intensity values.
Other choices exist, including comparing pixel intensity gradients. This corresponds to
comparing edge maps of images rather than comparing pixel values themselves. The use
of gradients can be advantageous when the images being compared have similar image
structure but potentially different mean value. This can occur, for example, when there is
an illumination change, and one image is darker than another. It can also occur when the
images correspond to different spectral responses.
A third element of a motion estimation algorithm is a search strategy used to locate the
solution. Search strategies can vary based on the motion model and estimation criteria in
68 Computational Photography: Methods and Applications
P G P R
G P R P
P B P G
B P G P
FIGURE 2.2
A color filter array pattern containing red, green, blue, and panchromatic pixels.
place. Many algorithms use iterative approaches to converge on a solution [14], [15], [16].
For motion models with only a small number of parameters to estimate and a small state
space for each of the parameters, an exhaustive matching search is often used to minimize a
prediction error such as given by Equation 2.2. A popular example of this approach is used
with block-based motion estimation in which each block is modeled having a translational
motion (only two parameters to estimate for each block of pixels), and the range of possible
motion vectors is limited to integer pixel or possibly half-pixel values within a fixed-size
window. Each possible motion offset is considered, and the offset resulting in the best
match based on the estimation criteria is chosen as the motion estimate. Many algorithms
have been proposed to reduce the complexity of exhaustive matching searches [20], [21].
These techniques focus on intelligently reducing the number of offsets searched as well as
truncating the computation of error terms once it is known that the current offset is not the
best match.
PSyn = α R + β G + γ B. (2.4)
Single Capture Image Fusion with Motion Consideration 69
The linear weights α , β , and γ are chosen to generate an overall spectral response for
PSyn as similar as possible to a natural panchromatic pixel spectral response. Details are
provided in the Appendix.
Depending on when motion estimation occurs in the overall image processing path, syn-
thetic panchromatic pixel values may be computed at some or all pixel locations. The
panchromatic and color pixel data are both initially available as sparse checkerboard ar-
rays, as shown in Figure 2.3. In one processing path, the data is interpolated to generate
fully populated panchromatic and color channels prior to motion estimation. In this case,
synthetic panchromatic pixel values can be computed at each pixel using the available red,
green, and blue values at that pixel, and motion can be estimated using the full images.
Alternatively, motion estimation can be carried out at lower resolution, with the objec-
tive of retaining a CFA image after motion compensation. Both the panchromatic and color
image data are reduced to lower resolution. The color data are interpolated to form a full-
color low-resolution image from which synthetic panchromatic pixel values are computed.
Motion is estimated using the low-resolution images, with the results of the motion esti-
mation applied to the original checkerboard pixel data [25]. In order to retain the original
checkerboard CFA pattern, the motion estimation can be constrained to appropriate integer
translational offsets.
Figure 2.3 also illustrates a scenario in which motion is estimated by directly comparing
the panchromatic and color data, bypassing the need to compute a synthetic panchromatic
channel. In this case, the green channel is used as an approximate match to the panchro-
matic channel. Each channel is fully populated by an interpolation step. Edge maps are
formed and used during motion estimation to minimize the effects of spectral differences
between the panchromatic and green channels [26].
The panchromatic and synthetic panchromatic channels are likely to differ in the amount
of noise present, as well as the amount of motion blur present. These differences make
estimation difficult for algorithms that derive individual motion vectors for every pixel.
Block-based motion estimation provides some robustness to the varying noise and blur
within the images while also providing some ability to detect local object motion in the
scene.
With appropriate hardware capability, the panchromatic channel integration interval can
align in various ways with the integration interval for the color channels, as shown in Fig-
ure 2.4 [27]. Different alignments produce different relative motion offsets under most con-
ditions, especially with constant or nearly constant motion (both velocity and direction), as
used in the following examples.
If the integration interval of the panchromatic channel is concentrically located within
the integration interval of the color channels, as in Figure 2.4a, the motion offset between
the channels will be very small or zero. This minimizes any need to align the color chan-
nels with the panchromatic channel during fusion of the channels, but also minimizes the
information that motion offset estimation can provide regarding motion blur in the captured
image. If the two integration intervals are aligned at the end of integration as in Figure 2.4b,
the motion offset between the panchromatic channel and the color channels will be greater
than with concentric integration intervals. This may require alignment of the color chan-
nels with the panchromatic channel to minimize artifacts during fusion of the images. In
this case, motion offset estimation can also provide more information about the blur in the
70
P P G R
P P G R
P P B G
P P B G
P P P P B B B B P P G R P P P P G G G G
G G G G
P P P P B P P B G P P P P G G G G
R R R R
G
P P P P B P P P P G G G G
R R R R
G
P P P P B P P P P G G G G
R R R R
G P P B B
R R R R G G
P P B BR R
G G
P P P P PSyn PSyn PSyn PSyn R R P` P` P` P` G` G` G` G`
FIGURE 2.3
Three options for motion estimation in the proposed capture system. The term PSyn represents the synthetic panchromatic channel whereas P0 and G0 are gradients of P
and G, respectively.
Computational Photography: Methods and Applications
Single Capture Image Fusion with Motion Consideration 71
color channels integration (tc)
panchromatic channel integration (tp)
tim e
(a)
tim e
(b)
tim e
(c)
FIGURE 2.4
Integration timing options: (a) concentric integration, (b) simultaneous readout integration, and (c) non-
overlapping integration.
captured image, to aid fusion or deblurring operations. If the integration intervals do not
overlap, as shown in Figure 2.4c, the motion offset between the panchromatic and color
channels will be still larger.
One advantage of overlapping integration intervals is to limit any motion offset between
the channels and increase the correlation between the motion during the panchromatic in-
terval and the motion during the color integration interval. The ratio of the color integration
time to the panchromatic integration time, tC /tP , also affects the amount of motion offset.
As this ratio decreases toward one, the capture converges to a standard single capture, and
the relative motion offset converges to zero.
The alignment of the integration intervals has hardware implications in addition to the
image processing implications just mentioned. In particular, use of end-aligned integration
intervals tends to reduce the complexity of readout circuitry and buffer needs, since all
pixels are read out at the same time. Concentric alignment of the integration intervals tends
to maximize the complexity of readout, since the panchromatic integration interval both
begins and ends at a time different from the color integration interval.
G R G R G R
G R G R G R
B G B G B G
B G B G B G P P
G R G R G R P P
G R G R G R P P P
B G B G B G P P
B G B G B G P P
G R G R G R
G R G R G R
B G B G B G
B G B G B G
P P P P P P G R
P p P P P P G R
P P P P P P B G
P P P P P P B G
P P P P P P G R
P P P P P P G R
P P P P P P B G
P P P P P P B G
P P P P P P
P P P P P P
P P P P P P
P P P P P P
FIGURE 2.5
Motion compensation with the reference image constituted by (a) the color data and (b) the panchromatic data.
For the proposed capture system, it is possible to consider either the panchromatic data
or the color data as the reference image. The choice of reference image affects which data
is left untouched, and which data is shifted, and possibly interpolated. The advantage of
selecting the panchromatic data as the reference image lies in keeping the sharp, panchro-
matic backbone of the image untouched, preserving as much strong edge and texture in-
formation as possible. The advantage of selecting the chrominance data as the reference
image becomes apparent in the case that the motion compensation is performed on sparse
CFA data with the intention of providing a CFA image as the output of the motion com-
pensation step, as illustrated in Figure 2.5. In this case, shifting a block of panchromatic
Single Capture Image Fusion with Motion Consideration 73
(a) (b)
(c)
FIGURE 2.6
Single capture: (a) CFA image, (b) demosaicked panchromatic image, and (c) demosaicked color image.
data to exactly fit the CFA pattern requires integer motion vectors for which the horizon-
tal and vertical components have the same even/odd parity. Arbitrary shifts of a block of
panchromatic data require an interpolation step for which the four nearest neighbors are
no more than two pixels away. Shifting the chrominance data is more difficult, however,
due to the sparseness of the colors. Shifting a block of chrominance data to exactly fit
the CFA pattern requires motion vectors that are multiples of four, both horizontally and
vertically. Arbitrary shifts of a block of chrominance data require an interpolation step for
which neighboring color information may be four pixels away.
Signal processing techniques are also often used to reduce the motion blur [1], [2], [3],
[4]. As explained in Section 2.2, in a four-channel system, it is possible to capture an
image with acceptable signal-to-noise by using a relatively shorter integration time for
the panchromatic pixels as compared to the color (red, green, and blue) pixels. This is a
highly desirable feature for color image motion deblurring. Due to short integration time
and high photometric sensitivity, panchromatic pixels do not suffer much from motion blur
and at the same time produce a luminance image of the scene with high signal-to-noise,
whereas a long integration time for color pixels leads to a motion blurred color image
with reliable color information. An example of a four-channel CFA image is shown in
Figure 2.6a. The integration time ratio of panchromatic to color pixels was set to 1:5.
The corresponding demosaicked panchromatic and color images are shown in Figures 2.6b
and 2.6c, respectively. The basketball is more clearly defined in the panchromatic image
but appears blurred in the color image. This example illustrates that by using different
integration times for panchromatic and color pixels in the four-channel imaging sensor, it
is possible to generate complementary information at the sensor level, which subsequently
can be exploited to generate a color image with reduced motion blur. A pixel-level fusion
algorithm designed to fuse demosaicked panchromatic and color images [32] is explained
below.
CR = R − PSyn , (2.5)
Syn
CB = B − P . (2.6)
To restore the high-frequency luminance information of the deblurred color image, the
synthetic panchromatic image PSyn can be replaced with the observed panchromatic image
P. However, this operation only ensures reconstruction of the luminance information. In or-
der to restore color information, chroma images corresponding to P must be reconstructed.
Note that P is a luminance image and does not contain color information. Therefore, its
chroma images must be estimated from the observed RGB color image. In order to do this,
a system model is determined to relate PSyn and the corresponding chroma images (CR and
CB ) which in turn is used to predict chroma images for P. For the sake of simplicity and
computational efficiency, the model is linear:
CR = mR PSyn , (2.7)
Syn
CB = mB P , (2.8)
(a) (b)
(c)
FIGURE 2.7
Motion deblurring from a single capture: (a) panchromatic image, (b) color image, and (c) deblurred image.
Let CRP and CBP be the red and the blue chroma images, respectively, corresponding to P.
Then, from Equations 2.7 and 2.8 it is apparent that
CR
CRP = mR P = P, (2.9)
PSyn
CB
CBP = mB P = Syn P. (2.10)
P
The new motion deblurred color image (RN , GN , and BN ) can be estimated as follows:
RN = CRP + P, (2.11)
N
B = CBP + P, (2.12)
P − α RN − γ BN
GN = . (2.13)
β
1. Compute synthetic panchromatic and chroma images using Equations 2.4, 2.5,
and 2.6.
2. Compute chroma images corresponding to observed panchromatic image P us-
ing Equations 2.9 and 2.10.
3. Compute deblurred color image using Equations 2.11, 2.12, and 2.13.
76 Computational Photography: Methods and Applications
A summary of the fusion algorithm is presented in Algorithm 2.1, whereas its feasibility
is demonstrated in Figure 2.7. The integration time ratio for panchromatic image shown in
Figure 2.7a and color image shown in Figure 2.7b was set to 1:5 and these two images were
read out of the sensor simultaneously as illustrated in Figure 2.4b. The restored image is
shown in Figure 2.7c.
m otion
m otion d eblu rring d em osaicking
com pensation
FIGURE 2.8
Example image processing chain incorporating motion estimation and compensation.
Single Capture Image Fusion with Motion Consideration 77
(a) (b)
FIGURE 2.9
Edge detection using the original CFA panchromatic channel: (a) CFA panchromatic channel, and (b) edge
map with mid and high intensities corresponding to the panchromatic and color pixels, respectively.
(a) (b)
FIGURE 2.10
Edge detection using the motion compensated CFA panchromatic channel: (a) CFA panchromatic channel, and
(b) edge map with block boundaries and mid and high intensities corresponding to the panchromatic and color
pixels, respectively.
ure 2.11a) and after deblurring (Figure 2.11b). The differences between these two images is
dramatic, especially with respect to the two balls being tossed. Finally, color and tone cor-
rection and sharpening are applied and the final result shown in Figure 2.12a. For reference,
the same image without motion processing is shown in Figure 2.12b.
2.5 Conclusion
The estimation of motion and the compensation of associated artifacts classically requires
the capture of a sequence of images. Image fusion techniques are then used to reduce this
set of multiple images into a single image with reduced motion blur.
78 Computational Photography: Methods and Applications
(a) (b)
(a) (b)
This chapter described a new approach to motion deblurring that uses a single capture
from a single sensor, which produces the required image components for subsequent image
fusion. The result is a vastly simplified hardware system that is motion-aware and provides
the necessary information for performing motion deblurring. This is accomplished by the
use of a four-channel color filter array consisting of three color channels and a panchro-
matic channel. Due to superior light sensitivity, the panchromatic pixels are exposed for a
shorter duration than the color pixels during image capture. As a result, the panchromatic
channel produces a luminance record of the image with significantly reduced motion blur
while maintaining acceptable signal-to-noise performance. To this low motion blur lumi-
nance record, the chrominance information from the color channels is fused to produce a
final motion-deblurred color image. There are a number of ways to achieve this motion-
reduced result including the alignment of edges in motion between the panchromatic and
color image components and the exchange of panchromatic and color-derived luminance
image components. Many of these techniques can be applied to CFA data directly, thereby
reducing the computational overhead of the system. Finally, apart from the motion ele-
ments, the image processing chain is as described in Chapter 1, thus allowing the system to
realize most, if not all, of the advantages of the four-channel system as described there.
Single Capture Image Fusion with Motion Consideration 79
Z λmax
R(x, y) = I(λ ) QR (λ ) S(x, y, λ ) d λ ,
λmin
Z λmax
G(x, y) = I(λ ) QG (λ ) S(x, y, λ ) d λ ,
λmin
Z λmax
B(x, y) = I(λ ) QB (λ ) S(x, y, λ ) d λ ,
λmin
where I(λ ) is the spectrum of the illumination as a function of wavelength, S(x, y, λ ) is the
surface spectral reflectance function. The spectral quantum efficiency of P, R, G, and B
sensors are represented by QP (λ ), QR (λ ), QG (λ ), and QB (λ ), respectively. The panchro-
matic coefficients α , β , and γ are computed by minimizing the cost function g(α , β , γ ), as
given below:
° °2
° °
° °
min g(α , β , γ ) = min ° ∑ P(x, y) − α ∑ R(x, y) − β ∑ G(x, y) − γ ∑ B(x, y)° ,
α ,β ,γ α ,β ,γ ° x,y∈Ω x,y∈Ω x,y∈Ω x,y∈Ω
°
2
where k • k2 denotes L2 -norm. Note that the integration times for P, R, G, and B pixels are
assumed to be the same in this analysis.
References
[1] E. Shechtman, Y. Caspi, and M. Irani, “Space-time super-resolution,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 27, no. 4, pp. 531–545, April 2005.
[2] Q. Shan, J. Jia, and A. Agarwala, “High-quality motion deblurring from a single image,” ACM
Transactions on Graphics, vol. 27, no. 3, pp. 1–10, August 2008.
[3] J. Jiya, “Single image motion deblurring using transparency,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, June 2007,
pp. 1–8.
[4] A. Levin, “Blind motion deblurring using image statistics,” in Proceedings of the Twentieth
Annual Conference on Advances in Neural Information Processing Systems, Vancouver, BC,
Canada, December 2006, pp. 841–848.
80 Computational Photography: Methods and Applications
[5] M. Kumar and P. Ramuhalli, “Dynamic programming based multichannel image restoration,”
in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Pro-
cessing, Philadelphia, PA, USA, March 2005, pp. 609–612.
[6] L. Yuan, J. Sun, L. Quan, and H.Y. Shum, “Image deblurring with blurred/noisy image pairs,”
ACM Transactions on Graphics, vol. 26, no. 3, July 2007.
[7] A. Rav-Acha and S. Peleg, “Two motion-blurred images are better than one,” Pattern Recog-
nition Letters, vol. 26, no. 3, pp. 311–317, February 2005.
[8] X. Liu and A.E. Gamal, “Simultaneous image formation and motion blur restoration via mul-
tiple capture,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, Salt Lake City, UT, USA, May 2001, pp. 1841–1844.
[9] K.T. Mullen, “The contrast sensitivity of human colour vision to red-green and blue-yellow
chromatic gratings,” Journal of Physiology, vol. 359, pp. 381–400, February 1985.
[10] A.M. Tekalp, Digital Video Processing, Upper Saddle River, NJ: Prentice Hall, August 1995.
[11] A. Bovik, Handbook of Image and Video Processing, 2nd Edition, New York: Academic
Press, June 2005.
[12] F. Dufaux and J. Konrad, “Efficient, robust, and fast global motion estimation for video cod-
ing,” IEEE Transactions on Image Processing, vol. 9, no. 3, pp. 497–501, March 2000.
[13] J.M. Odobez and P. Bouthemy, “Robust multiresolution estimation of parametric motion mod-
els,” Journal of Visual Communication and Image Representation, vol. 6, no. 4, pp. 348–365,
December 1995.
[14] M. Black, “The robust estimation of multiple motions: Parametric and piecewise-smooth flow
fields,” Computer Vision and Image Understanding, vol. 63, no. 1, pp. 75–104, January 1996.
[15] J. Konrad and E. Dubois, “Bayesian estimation of motion vector fields,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 14, no. 9, pp. 910–927, September 1992.
[16] B.K.P. Horn and B.G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17,
pp. 185–203, August 1981.
[17] F. Dufaux and F. Moscheni, “Motion estimation techniques for digital TV: A review and a new
contribution,” Proceedings of the IEEE, vol. 83, no. 6, pp. 858–876, June 1995.
[18] “Information technology-coding of moving pictures and associated audio for digital storage
media up to about 1.5 mbit/s.” ISO/IEC JTC1 IS 11172-2 (MPEG-1), 1993.
[19] “Information technology–generic coding of moving pictures and associated audio.” ISO/IEC
JTC1 IS 13818-2 (MPEG-2), 1994.
[20] R. Li, B. Zeng, and M.L. Liou, “A new three-step search algorithm for block motion es-
timation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, no. 4,
pp. 438–442, August 1994.
[21] S. Zhu and K.K. Ma., “A new diamond search algorithm for fast block-matching motion esti-
mation,” IEEE Transactions on Image Processing, vol. 9, no. 2, pp. 287–290, February 2000.
[22] K.A. Prabhu and A.N. Netravali, “Motion compensated component color coding,” IEEE
Transactions on Communications, vol. 30, no. 12, pp. 2519–2527, December 1982.
[23] N.R. Shah and A. Zakhor, “Resolution enhancement of color video sequences,” IEEE Trans-
actions on Image Processing, vol. 8, no. 6, pp. 879–885, June 1999.
[24] B.C. Tom and A. Katsaggelos, “Resolution enhancement of monochrome and color video
using motion compensation,” IEEE Transactions on Image Processing, vol. 10, no. 2, pp. 278–
287, February 2001.
[25] A.T. Deever, J.E. Adams Jr., and J.F. Hamilton Jr., “Improving defective color and panchro-
matic CFA image,” U.S. Patent Application 12/258 389, 2009.
Single Capture Image Fusion with Motion Consideration 81
[26] J.E. Adams Jr., A.T. Deever, and R.J. Palum, “Modifying color and panchromatic channel
CFA Image,” U.S. Patent Application 12/266 824, 2009.
[27] J.A. Hamilton, J.T. Compton, and B.H. Pillman, “Concentric exposures sequence for image
sensor,” U.S. Patent Application 12/111 219, April 2008.
[28] M. Ben-Ezra and S.K. Nayar, “Motion-based motion deblurring,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 689–698, June 2004.
[29] S. Bottini, “On the visual motion blur restoration,” in Proceedings of the Second International
Conference on Visual Psychophysics and Medical Imaging, Brussels, Belgium, July 1981,
p. 143.
[30] W.G. Chen, N. Nandhakumar, and W.N. Martin, “Image motion estimation from motion smear
– a new computational model,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 18, no. 4, pp. 412–425, April 1996.
[31] S.H. Lee, N.S. Moon, and C.W. Lee, “Recovery of blurred video signals using iterative image
restoration combined with motion estimation,” in Proceedings of the International Conference
on Image Processing, Santa Barbara, CA, USA, October 1997, pp. 755–758.
[32] M. Kumar and J.E. Adams Jr., “Producing full-color image using CFA image,” U.S. Patent
Application Number 12/412 429, 2009.
3
Lossless Compression of Bayer Color Filter Array
Images
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.2 Concerns in CFA Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3 Common Compression Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4 Compression Using Context Matching-Based Prediction . . . . . . . . . . . . . . . . . . . . . 87
3.4.1 Context Matching-Based Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.1.1 Luminance Subimage-Based Prediction . . . . . . . . . . . . . . . . . . . . . 88
3.4.1.2 Chrominance Subimage-Based Prediction . . . . . . . . . . . . . . . . . . . 89
3.4.2 Adaptive Color Difference Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.4.3 Subimage Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.5 Compression Based on Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.5.1 Statistic-Based Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.5.2 Subband Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.6.1 Bit Rate Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.6.2 Complexity Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.1 Introduction
Most digital cameras reduce their cost, size and complexity by using a single-sensor
image acquisition system to acquire a scene in digital format [1]. In such a system, an
image sensor is overlaid with a color filter array (CFA), such as the Bayer pattern [2] shown
in Figure 3.1a, to record one of the three primary color components at each pixel location.
Consequently, a gray-scale mosaic-like image, commonly referred to as a CFA image, is
produced as the sensor output.
An imaging pipeline is required to turn a CFA image into a full-color image. Images
are commonly compressed to reduce the storage requirement and store as many images as
possible in a given storage medium. Figure 3.2a shows the simplified pipeline which first
converts the CFA image to a full-color image using color demosaicking [3], [4], [5] and
then compresses the demosaicked full-color image for storage.
83
84 Computational Photography: Methods and Applications
G R G R G R G G G R R R
B G B G B G G G G B B B
G R G R G R G G G R R R
B G B G B G G G G B B B
G R G R G R G G G R R R
B G B G B G G G G B B B
(a) (b) (c)
FIGURE 3.1
A Bayer color filter array with green-red-green-red phase in the first row: (a) complete pattern, (b) luminance
plane, and (c) chrominance plane.
This approach reduces the burden of compression, since one can use any existing image
coding scheme to compress the demosaicked full-color data in either a lossless or lossy
manner. However, it may be considered suboptimal from the compression point of view
because the demosaicking process always introduces some redundancy which should even-
tually be removed in the subsequent compression step [6], [7], [8].
To address this issue, an alternative approach, as shown in Figure 3.2b, aims at compress-
ing the CFA image prior to demosaicking [9], [10], [11], [12]. As a result, more sophis-
ticated demosaicking algorithms can be applied offline on a personal computer to produce
a more visually pleasing full-color output. Besides, as the data size of a CFA image to be
handled by the compression step is only one-third that of the corresponding demosaicked
image, this alternative approach can effectively increase the pipeline throughput without
degrading the output image quality.
Since this alternative pipeline has been proven to outperform the conventional one when
the quality requirement of the output color image is high [7], [8], it has been adopted in
many prosumer and professional grade digital cameras to serve as an optional imaging
path to deliver a precise, high quality output. This motivates the demand for CFA image
compression schemes accordingly.
storage
(a)
storage
(b)
FIGURE 3.2
Simplified imaging pipelines for single-sensor digital cameras: (a) conventional approach, (b) alternative ap-
proach.
Lossless Compression of Bayer Color Filter Array Images 85
Both lossy or lossless compression schemes can be applied to CFA image data. Lossy
compression schemes, such as those presented in References [13], [14], [15], [16], [17]
and [18], compress the CFA image by discarding its visually redundant information. Thus,
only an approximation of the original image can eventually be reconstructed. Since loss
of information is allowed, these schemes usually yield a higher compression ratio than the
lossless schemes. However, lossless compression schemes preserve all the information of
the original and hence allow perfect reconstruction of the original image. Therefore, they
are crucial in coding the CFA images which can be seen as digital negatives and used as an
ideal original archive format for producing high quality color images especially in high-end
photography applications such as commercial poster production.
Certainly, many standard lossless image compression schemes such as JPEG-LS [19] and
JPEG2000 (lossless mode) [20] can be used to compress a CFA image directly. However,
they only achieve a fair compression performance as the spatial correlation among adjacent
pixels is generally weakened in a CFA image due to its mosaic-like structure. To perform
the compression more efficiently, later methods [21], [22] aim at increasing the image spa-
tial correlation by de-interleaving the CFA image according to the red, green, and blue
color channels and then compress the three subsampled color planes individually with the
lossless image compression standards. Nevertheless, redundancy among the color channels
still remains. Recently, some advanced lossless CFA image compression algorithms [23],
[24], [25] have been reported to efficiently remove the pixel redundancy in both spatial
and spectral domains. These algorithms do not rely on any single individual coding tech-
nique but rather combine various techniques to remove the data redundancy using different
means. This chapter surveys relevant lossless coding techniques and presents two new loss-
less compression algorithms to show how to mix different techniques to achieve an effective
compression. Performance comparisons, in terms of compression ratio and computational
complexity, are included.
This chapter is structured as follows. Section 3.2 discusses some major concerns in
the design of a lossless compression algorithm for CFA images. Section 3.3 focuses on
some common coding techniques used in lossless CFA image coding. Sections 3.4 and 3.5
present two lossless compression algorithms which serve as examples to show how the
various coding techniques discussed in Section 3.3 can work together to remove the re-
dundancy in different forms. The simulation results in Section 3.6 show that remarkable
compression performance can be achieved with these two algorithms. Finally, conclusions
are drawn in Section 3.7.
The second consideration is the complexity of the algorithm. For in-camera compression,
real-time processing is always expected. As compression is required for each image, the
processing time of the compression algorithm determines the frame rate of the camera.
Parallel processing support can help reduce an algorithm’s processing time, but it still may
not be a solution as it does not reduce the overall complexity.
The complexity of an algorithm can be measured in terms of the number of operations re-
quired to compress an image. This measure may not be able to reflect the real performance
of the algorithm as number of operations is not the sole factor that determines the required
processing time. The number of branch decisions, the number of data transfers involved,
the hardware used to realize the algorithm, and a lot of other factors also play their roles.
The impact of these factors to the processing time is hardware oriented and can fluctuate
from case to case. The power consumption induced by an algorithm is also a hardware
issue and is highly reliant on the hardware design. Without a matched hardware platform
as the test bed of an algorithm, it is impossible to judge its real performance. Therefore,
in this chapter, the processing time required to execute an algorithm in a specified general
purpose hardware platform will be measured to indicate the complexity of the algorithm.
Since minimizing the complexity and maximizing the compression rate are generally
mutually exclusive, a compromise is always required in practice. Note that the energy
consumption increases with the complexity of any algorithm, thus affecting both the battery
size and the operation time of a camera directly. There are some other factors, such as the
memory requirements, that one has to consider when designing a compression algorithm,
but their analysis is beyond the scope of this chapter.
Predictive coding removes redundancy by making use of the correlation among the input
data. For each data entry, a prediction is performed to estimate its value based on the
correlation and the prediction error is encoded. The spatial correlation among pixels is
commonly used in predictive coding. Since color channels in a CFA image are interlaced,
the spatial correlation in CFA images is generally lower than in color images. Therefore,
many spatial predictive coding algorithms designed for color images cannot provide a good
compression performance when they are used to encode a CFA image directly.
However, there are solutions which preprocess the CFA image to provide an output with
improved correlation characteristics which is more suitable for predictive coding than the
original input. This can be achieved by deinterleaving the CFA image into several sep-
arate images each of which contains the pixels from the same color channel [21], [22].
This can also be achieved by converting the data from RGB space to YCr Cb space [10],
[12]. Although a number of preprocessing procedures can be designed, not all of them are
reversible and only reversible ones can be used in lossless compression of CFA images.
In transform coding, the discrete cosine transform and the wavelet transform are usually
used to decorrelate the image data. Since typical images generally contain redundant edges
and details, insignificant high-frequency contents can thus be discarded to save coding bits.
When distortion is allowed, transform coding helps to achieve good rate-distortion perfor-
mance and hence it is widely used in lossy image compression. In particular, the integer
Mallat wavelet packet transform is highly suitable to decorrelate mosaic CFA data [23],
[24]. This encourages the use of transform coding in lossless compression of CFA images.
Other lossless coding techniques, such as run-length coding [26], Burrows-Wheeler
transform [27], and adaptive dictionary coding (e.g., LZW [28]) are either designed for
a specific type of input other than CFA images (for example, run-length coding is suitable
for coding binary images) or designed for universal input. Since they do not take the prop-
erties of a CFA image into account, it is expected that the redundancy in a CFA image
cannot be effectively removed if one just treats the CFA image as a typical gray-level im-
age or even a raster-scanned sequence of symbols when using these coding techniques. A
preprocessing step would be necessary to turn a CFA image into a better form to improve
the compression performance when these techniques are exploited.
At the moment, most, if not all, lossless compression algorithms designed for coding
CFA images mainly rely on predictive, entropy, and transform coding. In the following,
two dedicated lossless compression algorithms for CFA image coding are presented. These
algorithms serve as examples of combining the three aforementioned coding techniques to
achieve remarkable compression performance.
resid u e
CFA color plane CM-based entropic cod e
im age separation pred iction encod ing stream
(a)
resid u e
cod e entropy inverse CM color plane CFA
stream d ecod ing based pred iction com bination im age
(b)
FIGURE 3.3
Structure of the context-matching-based lossless CFA image compression method: (a) encoder and (b) decoder.
in the same subimage are raster-scanned and each one of them undergoes a prediction pro-
cess based on context matching and an entropy coding process as shown in Figure 3.3a.
Due to the higher number of green samples in the CFA image compared to red or blue
samples, the luminance subimage is encoded before encoding the chrominance subimage.
When handling the chrominance subimage, the luminance subimage is used as a reference
to remove the interchannel correlation.
Decoding is just the reverse process of encoding as shown in Figure 3.3b. The lumi-
nance subimage is decoded first to be used as a reference when decoding the chrominance
subimage. The original CFA image is reconstructed by combining the two subimages.
FIGURE 3.4
Prediction in the luminance subimage: (a) four closest neighbors used to predict the intensity value of sample
g(i, j), and (b) pixels used to construct the context of sample g(i, j).
Theoretically, other metrics such as the well-known Euclidean distance can be accom-
modated in the above equation to enhance matching performance. However, the achieved
improvement is usually not significant enough to compensate for the increased implemen-
tation complexity.
Let g(mk , n¡k ) ∈ Φg(i, j) , for ¢k = 1, 2,
¡ 3, 4, represent¢four ranked neighbors of sample g(i, j)
such that D1 Sg(i, j) , Sg(mu ,nu ) ≤ D1 Sg(i, j) , Sg(mv ,nv ) for 1 ≤ u < v ≤ 4. The value of g(i, j)
can then be predicted with a prediction filter as follows:
à !
4
ĝ(i, j) = round ∑ wk d(mk , nk ) , (3.2)
k=1
FIGURE 3.5
Prediction in the chrominance subimage: (a) four closest neighbors used to predict the color difference value
of sample c(i, j) = r(i, j) or c(i, j) = b(i, j), and (b) pixels used to construct the context of sample c(i, j).
available green samples, as shown in Figures 3.5a and 3.5b, respectively. This arrangement
is based on the fact that the green channel has a double sampling rate compared to the red
and blue channels in the CFA image and green samples are encoded first. As a consequence,
it provides a more reliable noncausal context for matching.
Color difference values d(i, j −2), d(i−2, j −2), d(i−2, j), and d(i−2, j +2) are ranked
according to the absolute difference between their context and the context of d(i, j). The
predicted value of d(i, j) is determinable as follows:
à !
4
ˆ j) = round ∑ wk d(mk , nk ) ,
d(i, (3.4)
k=1
where
¡ wk is the weight ¢ associated
¡ with ¢the the kth ranked neighbor d(mk , nk ) such that
D2 Sc(i, j) , Sc(mu ,nu ) ≤ D2 Sc(i, j) , Sc(mv ,nv ) for 1 ≤ u < v ≤ 4, where
¡ ¢
D2 Sc(i, j) , Sc(m,n) = |g(i, j − 1) − g(m, n − 1)| + |g(i, j + 1) − g(m, n + 1)|
(3.5)
+ |g(i − 1, j) − g(m − 1, n)| + |g(i + 1, j) − g(m + 1, n)|
measures the difference between two contexts.
Weights wk , for k = 1, 2, 3, 4, are trained similarly to the weights used in luminance signal
prediction. Under this training condition, color difference prediction is obtained as follows:
µ ¶
ˆ j) = round 4d(m1 , n1 ) + 2d(m2 , n2 ) + d(m3 , n3 ) + d(m4 , n4 ) ,
d(i, (3.6)
8
which also involves shift-add operations only.
where g(i, j) is the real value of the luminance sample, d(i, j) is the color difference value
estimated with the method described in Section 3.4.2, ĝ(i, j) is the predicted value of g(i, j),
ˆ j) is the predicted value of d(i, j). The error residue, e(i, j), is then mapped to a
and d(i,
nonnegative integer via
½
−2e(i, j) if e(i, j) ≤ 0,
E(i, j) = (3.10)
2e(i, j) − 1 otherwise,
to reshape its value distribution from a Laplacian type to a geometric one for Rice coding.
When Rice coding is used, each mapped residue E(i, j) is split into a quotient Q =
floor(E(i, j)/2λ ) and a remainder R = E(i, j)mod(2λ ), where the parameter λ is a nonneg-
ative integer. The quotient and the remainder are then saved for storage or transmission.
The length of the codeword used for representing E(i, j) depends on λ and is determinable
as follows: µ ¶
E(i, j)
L(E(i, j)|λ ) = floor +1+λ. (3.11)
2λ
Parameter λ is critical for the compression performance as it determines the code length
of E(i, j). For a geometric source S with distribution parameter ρ ∈ (0, 1) (i.e. Prob(S =
s) = (1 − ρ )ρ s for s = 0, 1, 2, . . .), the optimal coding parameter λ is given as
½ µ µ ¶¶¾
logφ
λ = max 0, ceil log2 , (3.12)
logρ −1
92 Computational Photography: Methods and Applications
√
where φ = ( 5 + 1)/2 is the golden ratio [29]. Since the expectation value of the source is
µ = ρ /(1 − ρ ), it follows that
µ
ρ= . (3.13)
1+µ
As long as µ is known, parameter ρ and hence the optimal coding parameter λ for the
whole source can be determined easily.
To enhance coding efficiency, µ can be estimated adaptively in the course of encoding
the mapped residues E(i, j) as follows:
µ ¶
α µ̃ p + Mi, j
µ̃ = round , (3.14)
1+α
where µ̃ is the current estimate of µ for selecting λ to determine the codeword format
of current E(i, j). The weighting factor α specifies the significance of µ̃ p and Mi, j when
updating µ̃ . The term µ̃ p , initially set as zero for all residue subplanes, is the previous
estimate of µ̃ which is updated for each E(i, j). The term
1
Mi, j = ∑ E(a, b)
4 (a,b)∈
(3.15)
ξ i, j
denotes the local mean of E(i, j) in a region defined as the set ξi, j of four processed pixel
locations which are closest to (i, j) and, at the same time, possess samples from the same
color channel as that of (i, j). When coding the residues from the luminance subimage,
ξi, j = {(i, j − 2), (i − 1, j − 1), (i − 2, j), (i − 1, j + 1)}. When coding the residues from the
chrominance subimage, ξi, j = {(i, j − 2), (i − 2, j − 2), (i − 2, j), (i − 2, j + 2)}.
The value of the weighting factor α is determined empirically. Figure 3.6 shows the
effect of α on the final compression ratio of the above compression algorithm. Curves “G”
and “R and B” respectively show the cases when coding the residues from the luminance
and the chrominance subimages. The curve marked as “all” depicts the overall performance
when all residue subplanes are compressed with a common α value. This figure indicates
that α = 1 can provide good compression performance.
4.85
4.80 G
4.75
bits per pixel(bbp)
4.70
all
4.65
4.60
4.55
R and B
4.50
4.45
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
w eighting factor a
FIGURE 3.6
Average output bit rates (in bpp) versus different α values.
Lossless Compression of Bayer Color Filter Array Images 93
resid u e
CFA 5/3 w avelet statistics based entropic cod e
im age transform pred iction encod ing stream
(a)
resid u e
cod e entropy inverse statistics inverse 5/3 CFA
stream d ecod ing based pred iction w avelet transform im age
(b)
FIGURE 3.7
Workflow of the statistic-based lossless CFA image compression method: (a) encoder, and (b) decoder.
NW N NE
2 3 4
c2 c3 c4
c1 c0 W 1
(a) (b)
FIGURE 3.8
Subband coefficient c0 and its causal adjacent neighbors used in statistic-based prediction: (a) causal template,
and (b) four possible optimal prediction directions for c0 .
94 Computational Photography: Methods and Applications
4
ĉ0 = ∑ wk ck , (3.16)
k=1
where wk is the weight associated with the coefficient ck . These weights are constrained as
w1 + w2 + w3 + w4 = 1 and indicate the likelihood of the neighboring coefficients to have a
value closest to that of c0 under a condition derived based on the local value distribution of
the coefficients. The likelihood is estimated with the frequency of the corresponding event
occurred so far while processing the coefficients in the same subband.
Using the template shown in Figure 3.8a, the optimal neighbor of c0 is one which mini-
mizes the difference to c0 , as follows:
arg min|ck − c0 |, for k = 1, 2, 3, 4. (3.17)
ck
When there are more than one optimal neighbors, one of them is randomly selected. The
optimal neighbor of c0 can be located in any of four directions shown in Figure 3.8b. The
direction from c0 to its optimal neighbor is referred to as the optimal prediction direction
of c0 and its corresponding index value is denoted as dc0 hereafter.
The weights needed for predicting c0 can be determined as follows:
wk = Prob(dc0 = k|dc1 , dc2 , dc3 , dc4 ), for k = 1, 2, 3, 4, (3.18)
where dc j is the index value of the optimal prediction direction of the coefficient c j . The
term Prob(dc0 = k|dc1 , dc2 , dc3 , dc4 ) denotes the probability that the optimal prediction di-
rection index of c0 is k under the condition that dc1 , dc2 , dc3 , and dc4 are known.
Since dc0 and hence Prob(dc0 = k|dc1 , dc2 , dc3 , dc4 ) are not available during decoding, to
predict the current coefficient c0 in this method the probability is estimated as follows:
C(k|dc1 , dc2 , dc3 , dc4 )
Prob(dc0 = k|dc1 , dc2 , dc3 , dc4 ) = 4
, for k = 1, 2, 3, 4, (3.19)
∑ j=1 C( j|dc1 , dc2 , dc3 , dc4 )
where C(·|·) is the current value of a conditional counter used to keep track of the occur-
rence frequency of k being the optimal prediction direction index of a processed coefficient
whose western, northwestern, northern and northeastern neighbors’ optimal prediction di-
rection indices are dc1 , dc2 , dc3 , and dc4 , respectively.
Since dc j ∈ {1, 2, 3, 4} for j = 1, 2, 3, 4, there are totally 256 possible combinations of dc1 ,
dc2 , dc3 , and dc4 . Accordingly, 256 × 4 counters are required and a table of 256 × 4 entries
needs to be constructed to maintain these counters. This table is initialized with all entries
set to one before the compression starts and is updated in the course of the compression.
As soon as the coefficient c0 is encoded, counter C(dc0 |dc1 , dc2 , dc3 , dc4 ) is increased by
one to update the table. With the table which keeps track of the occurrence frequencies
of particular optimal prediction direction index values when processing the subband, the
predictor can learn from experiences to improve its prediction performance adaptively.
where c(i, j) is the real coefficient value and ĉ(i, j) is the predicted value of c(i, j). Similar
to the approach presented in Section 3.4.3, the error residue e(i, j) is mapped to a nonneg-
ative integer E(i, j) with Equation 3.10 and encoded with adaptive Rice code.
When estimating the expectation value of E(i, j) for obtaining the optimal coding pa-
rameter λ for Rice code with Equations 3.12, 3.13, and 3.14, the local mean of E(i, j) is
adaptively estimated with the causal adjacent mapped errors as follows:
1
Mi, j = ∑ E(m, n),
4 (m,n)∈
(3.21)
ξ
i, j
where ξi, j = {E(i − 1, j), E(i − 1, j − 1), E(i − 1, j + 1), E(i, j − 1)}. Note that the members
of ξi, j are derived based on the coefficients in the same subband and they all have the same
nature. This estimation thus differs from the approach presented in Section 3.4.3 where the
members of ξi, j are obtained using samples from different color channels and hence should
not be used to derive the local mean of E(i, j).
With Mi, j available, the expectation value of E(i, j) is estimated adaptively using Equa-
tion 3.14. Through a training process similar to that discussed in Section 3.4.3, the weight-
ing factor α is set to one. The term µ̃ is updated for each E(i, j).
(a)
(b)
FIGURE 3.9
Test images: (a) Kodak image set – referred to as Images 1 to 24 in raster scan order, and (b) real raw Bayer
image set captured with a Nikon D200 camera – referred to as Images 25 to 40 in raster scan order.
Average 7.671 6.095 5.900 5.020 5.004 5.231 4.893 4.629 4.749
Table 3.2 lists bit rates for various coding approaches evaluated on real raw Bayer sensor
images. Similar to the previous experiment, CP provides the lowest output bit rate among
all the evaluated coding approaches. Note that the original bit rate is twelve bits per pixel
for this test set. The compression ratio is hence around two to one. As compared with the
original storage format (NEF), CP can save up to 1.558M bytes per image on average.
For this test database, quincunx separation helps both JPEG-LS and JPEG2000 to reduce
the output bit rates; achieving significant improvements against JPEG-LS and even outper-
forming LCCMI. This can be explained based on the observation that the spatial correlation
increases with image resolution. The larger the image size, the higher the spatial correlation
of the output of quincunx separation is and the more suitable it is for JPEG-LS compres-
sion. Based on this finding, it may be worth exploring if there is a better preprocessing step
to allow better coding performance using JPEG-LS.
25 11.613 10.828 7.551 6.584 6.535 6.726 6.523 6.172 6.304 7.615
26 10.596 10.344 8.637 5.948 5.470 5.521 5.522 5.382 5.403 6.331
27 11.136 9.839 8.438 6.551 6.108 6.277 6.321 6.085 6.132 7.364
28 11.511 10.447 8.058 6.381 6.024 6.186 6.247 6.014 6.035 7.163
29 10.454 8.935 6.981 5.489 5.137 5.211 5.387 5.295 5.244 6.867
30 12.202 11.147 8.796 6.993 6.742 6.958 6.804 6.483 6.582 7.633
31 11.658 10.242 7.013 6.528 6.364 6.379 6.623 6.408 6.368 6.805
32 11.962 10.944 7.994 6.373 6.198 6.332 6.220 5.970 6.015 7.375
33 11.546 10.725 7.290 6.295 6.145 6.360 6.124 5.840 5.952 7.551
34 12.040 10.727 7.920 6.812 6.596 6.679 6.769 6.537 6.523 7.466
35 12.669 10.882 8.667 6.933 6.621 6.670 6.779 6.505 6.595 8.051
36 11.668 10.809 8.708 6.784 6.526 6.717 6.660 6.395 6.435 7.413
37 11.446 10.530 8.585 6.948 6.633 6.804 6.852 6.607 6.609 7.448
38 10.106 9.925 6.866 5.660 5.460 5.589 5.476 5.288 5.347 6.991
39 11.864 10.955 7.705 6.364 6.192 6.390 6.201 5.925 6.001 7.339
40 11.713 10.552 7.752 6.566 6.534 6.735 6.461 6.127 6.252 8.158
Average 11.512 10.489 7.935 6.450 6.206 6.346 6.311 6.065 6.112 7.348
individually compressed ten times and the average processing time per image was recorded.
As can be seen, JPEG-LS is the the most efficient approach among considered approaches.
On average, it is almost twice as fast as other approaches among which SL shows the
highest efficiency when handling large real raw CFA images. Finally, it should be noted
that comparing the processing time for JPEG-LS and S+JLS indicates low complexity of
quincunx separation.
3.7 Conclusion
Compressing the CFA image can improve the efficiency of in-camera processing as one
can skip the demosaicking process to eliminate the overhead. Without the demosaicking
step, no extra redundant information is added to the image to increase the loading of the
subsequent compression process. Since digital camera images are commonly stored in
the so-called “raw” format to allow their high quality processing on a personal computer,
lossless compression of CFA images becomes necessary to avoid information loss. Though
a number of lossy compression algorithms have been proposed for coding CFA images [6],
[7], [10], [11], [12], [13], [14], [15], [16], [17], [18], only a few lossless compression
algorithms have been reported in literature [23], [24], [25], [33].
This chapter revisited some common lossless image coding techniques and presented two
new CFA image compression algorithms. These algorithms rely on predictive and entropy
coding to remove the redundancy using the spatial correlation in the CFA image and the
Lossless Compression of Bayer Color Filter Array Images 99
TABLE 3.3
Average execution time (in seconds per frame) for compressing images from two test sets.
TABLE 3.4
Approaches used in algorithms CP and SL to remove the image redundancy.
Approach
Redundancy CP SL
Interchannel redundancy Operates in color difference domain Uses integer Mallat wavelet packet
transform
Spatial redundancy Linear prediction with 4 neighbors, Linear prediction with 4 neighbors,
the neighbor whose context is the neighbor whose value is more
closer to that of the pixel of interest likely to be the closest to that of the
is weighted more pixel of interest is weighted more
statistical distribution of the prediction residue. Table 3.4 highlights the approaches used
in the proposed CP and SL algorithms to remove various kinds of data redundancy.
Extensive experimentation showed that CP provides the best average output bit rates for
various test image databases. An interesting finding is that quincunx separation greatly en-
hances the performance of JPEG-LS. When the input CFA image is large enough, quincunx
separation produces gray-level images with strong spatial correlation characteristics and
hence JPEG-LS can compress them easily. Considering its low computational complexity,
S+JLS could be a potential rival for dedicated CFA image compression algorithms. On the
other hand, since S+JLS does not remove the interchannel redundancy during compression,
well-designed dedicated CFA coding algorithms which take the interchannel redundancy
into account should be able to achieve better compression ratios than S+JLS.
Compressing raw mosaic-like single-sensor images constitutes a rapidly developing re-
search field. Despite recent progress, a number of challenges remain in the design of low-
complexity high-performance lossless coding algorithms. It is therefore expected that there
will be new CFA image coding algorithms proposed in the near future.
Acknowledgment
We would like to thank the Research Grants Council of the Hong Kong Special Admin-
istrative Region, China (PolyU 5123/08E) and Center for Multimedia Signal Processing,
100 Computational Photography: Methods and Applications
The Hong Kong Polytechnic University, Hong Kong (G-U413) for supporting our research
project on the topics presented in this chapter.
References
[1] R. Lukac (ed.), Single-Sensor Imaging: Methods and Applications for Digital Cameras. Boca
Raton, FL: CRC Press / Taylor & Francis, September 2008.
[2] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976.
[3] K. Hirakawa and T.W. Parks, “Adaptive homogeneity-directed demosaicing algorithm,” IEEE
Transactions on Image Processing, vol. 14, no. 3, pp. 360–369, March 2005.
[4] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, R.W. Schafer, and R.M. Mersereau, “Demosaick-
ing: Color filter array interpolation,” IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 44–
54, January 2005.
[5] K.H. Chung and Y.H. Chan, “Color demosaicing using variance of color differences,” IEEE
Transactions on Image Processing, vol. 15, no. 10, pp. 2944–2955, October 2006.
[6] C.C. Koh, J. Mukherjee, and S.K. Mitra, “New efficient methods of image compression in
digital cameras with color filter array,” IEEE Transactions on Consumer Electronics, vol. 49,
no. 4, pp. 1448–1456, November 2003.
[7] N.X. Lian, L. Chang, V. Zagorodnov, and Y.P. Tan, “Reversing demosaicking and compression
in color filter array image processing: Performance analysis and modeling,” IEEE Transac-
tions on Consumer Electronics, vol. 15, no. 11, pp. 3261–3278, November 2006.
[8] D. Menon, S. Andriani, G. Calvagno, and T. Eirseghe, “On the dependency between compres-
sion and demosaicing in digital cinema,” in Proceedings of the IEE European Conference on
Visual Media Production, London, UK, December 2005, pp. 104–111.
[9] N.X. Lian, V. Zagorodnov, and Y.P. Tan, Single-Sensor Imaging: Methods and Applications
for Digital Cameras, ch. Modelling of image processing pipelines in single-sensor digital
cameras, R. Lukac (ed.), Boca Raton, FL: CRC Press / Taylor & Francis, September 2008,
pp. 381–404.
[10] S.Y. Lee and A. Ortega, “A novel approach of image compression in digital cameras with
a Bayer color filter array,” in Proceedings of the IEEE International Conference on Image
Processing, Thessaloniki, Greece, October 2001, pp. 482–485.
[11] T. Toi and M. Ohta, “A subband coding technique for image compression in single ccd cameras
with bayer color filter arrays,” IEEE Transactions on Consumer Electronics, vol. 45, no. 1,
pp. 176–180, February 1999.
[12] G.L.X. Xie, Z. Wang, D.L.C. Zhang, and X. Li, “A novel method of lossy image compression
for digital image sensors with bayer color filter arrays,” in Proceedings of the IEEE Interna-
tional Symposium on Circuits and Systems, Kobe, Japan, May 2005, pp. 4995–4998.
[13] Y.T. Tsai, “Color image compression for single-chip cameras,” IEEE Transactions on Electron
Devices, vol. 38, no. 5, pp. 1226–1232, May 1991.
[14] A. Bruna and F. Vella, A. Buemi, and S. Curti, “Predictive differential modulation for CFA
compression,” in Proceedings of the 6th Nordic Signal Processing Symposium, Espoo, Fin-
land, June 2004, pp. 101–104.
[15] S. Battiato, A. Bruna, A. Buemi, and F. Naccari, “Coding techniques for CFA data images,” in
Proceedings of IEEE International Conference on Image Analysis and Processing, Mantova,
Italy, September 2003, pp. 418–423.
Lossless Compression of Bayer Color Filter Array Images 101
[16] A. Bazhyna, A. Gotchev, and K. Egiazarian, “Near-lossless compression algorithm for Bayer
pattern color filter arrays,” Proceedings of SPIE, vol. 5678, pp. 198–209, January 2005.
[17] B. Parrein, M. Tarin, and P. Horain, “Demosaicking and JPEG2000 compression of mi-
croscopy images,” in Proceedings of the IEEE International Conference on Image Processing,
Singapore, October 2004, pp. 521–524.
[18] R. Lukac and K.N. Plataniotis, “Single-sensor camera image compression,” IEEE Transac-
tions on Consumer Electronics, vol. 52, no. 2, pp. 299–307, May 2006.
[19] “Information technology - lossless and near-lossless compression of continuous-tone still im-
ages (JPEG-LS),” ISO/IEC Standard 14495-1, 1999.
[20] “Information technology - JPEG 2000 image coding system - Part 1: Core coding system,”
INCITS/ISO/IEC Standard 15444-1, 2000.
[21] A. Bazhyna, A.P. Gotchev, and K.O. Egiazarian, “Lossless compression of Bayer pattern color
filter arrays,” in Proceedings of SPIE, vol. 5672, pp. 378–387, January 2005.
[22] X. Xie, G. Li, and Z. Wang, “A low-complexity and high-quality image compression method
for digital cameras,” ETRI Journal, vol. 28, no. 2, pp. 260–263, April 2006.
[23] N. Zhang and X.L. Wu, “Lossless compression of color mosaic images,” IEEE Transactions
on Image Processing, vol. 15, no. 6, pp. 1379–1388, June 2006.
[24] N. Zhang, X. Wu, and L. Zhang, Single-Sensor Imaging: Methods and Applications for Digital
Cameras, ch. Lossless compression of color mosaic images and video, R. Lukac (ed.), Boca
Raton, FL: CRC Press / Taylor & Francis, September 2008, pp. 405–428.
[25] K.H. Chung and Y.H. Chan, “A lossless compression scheme for Bayer color filter array im-
ages,” IEEE Transactions on Image Processing, vol. 17, no. 2, pp. 134–144, February 2008.
[26] S. Golomb, “Run-length encodings,” IEEE Transactions on Information Theory, vol. 12, no. 3,
pp. 399–401, July 1966.
[27] M. Burrows and D.J. Wheeler, “A block sorting lossless data compression algorithm,” Tech-
nical Report 124, Digital Equipment Corporation, 1994.
[28] T.A. Welch, “A technique for high-performance data compression,” Computer, vol. 17, no. 6,
pp. 8–19, June 1984.
[29] A. Said, “On the determination of optimal parameterized prefix codes for adaptive entropy
coding,” Technical Report HPL-2006-74, HP Laboratories Palo Alto, 2006.
[30] R. Franzen, “Kodak lossless true color image suite.” Available online:
http://r0k.us/graphics/kodak.
[31] D. Coffin, “Decoding raw digital photos in Linux.” Available online:
http://www.cybercom.net/ dcoffin/dcraw.
[32] A. Bazhyna and K. Egiazarian, “Lossless and near lossless compression of real color filter
array data,” IEEE Transactions on Consumer Electronics, vol. 54, no. 4, pp. 1492–1500,
November 2008.
[33] S. Andriani, G. Calvagno, and D. Menon, “Lossless compression of Bayer mask images using
an optimal vector prediction technique,” in Proceedings of the 14th European Signal Process-
ing Conference, Florence, Italy, September 2006.
4
Color Restoration and Enhancement in the
Compressed Domain
4.1 Introduction
The quality of an image captured by a camera is influenced primarily by three main
factors; namely, the three-dimensional (3D) scene consisting of the objects present in it,
the illuminant(s) or radiations received by the scene from various sources, and the camera
characteristics (of its optical lenses and sensors). In a typical scenario, the 3D scene and the
camera may be considered as invariants, whereas the illumination varies depending on the
nature of the illuminant. For example, the same scene may be captured at different times
103
104 Computational Photography: Methods and Applications
of the day. The pixel values of these images then would be quite different from each other
and the colors may also be rendered differently in the scene.
Interestingly, a human observer is able to perceive the true colors of the objects even in
complex scenes with varying illumination. Restoration of colors from varying illumination
is also known as solving for color constancy of a scene. The objective of the computation of
color constancy is to derive an illumination-independent representation of an image, so that
it could be suitably rendered with a desired illuminant. The problem has two components,
one in estimating the spectral component of the illuminant(s) and the other one in perform-
ing the color correction for rendering the image with a target illumination. The latter task
is usually carried out by following the Von Kries equation of diagonal correction [1].
Another factor involved in the visualization of a color image is the color reproduction
capability of a display device depending upon its displayable range of color gamut and the
brightness range it can handle. The captured image may also have a poor dynamic range of
brightness values due to the presence of strong background illumination. In such situations,
for good color rendition in a display device, one may have to enhance an image. This
enhancement is mostly done independent of solving for color constancy; as usually, the
illuminants are assumed to be invariant conventional sources. However, one may require
to apply both color correction and enhancement for the display of color images, when the
scene suffers from both widely varying spectral components and brightness of illuminants.
Several methods have been reported in the literature for solving these problems, mostly in
the spatial representation of images. However, as more and more imaging devices are pro-
ducing end results in the compressed domain, namely in the block discrete cosine transform
(DCT) space of Joint Photographic Experts Group (JPEG) compression, it is of interest to
study these methods in that domain. The primary objective for processing these images di-
rectly in the compressed domain is to reduce the computational and storage requirements.
Due to the processing of the compressed images in their own domain, the computational
overhead on inverse and forward transforms of the spatial domain data into a compressed
domain gets eliminated by this process. In particular, processing in the DCT domain has
drawn significant attention of the researchers due to its use in the JPEG and Moving Picture
Experts Group (MPEG) compression standards. There are also other advantages of using
compressed domain representation. One may exploit the spectral separation of the DCT
coefficients in designing these algorithms.
This chapter discusses the two above aspects of color restoration. Unlike previous work
which dealt with color correction and color enhancement of images represented in the block
DCT space of JPEG compression independently, this chapter presents the color restoration
task as a combination of these two computational stages. Here, restoration of colors is not
considered from a noisy environment, the attention is rather focused on the limitation of
sensors and display devices due to varying illumination of a scene.
The following section presents the fundamentals related to the block DCT space. These
are required to understand and to design algorithms in this space. Next, the color constancy
problem is introduced and different methods for solving this problem with the DCT co-
efficients [2] are discussed. This is followed by the discussion on color enhancement in
the compressed domain. A simple approach based on scaling of DCT coefficients is also
elaborated here. Finally, some examples of color restoration using both color correction
and enhancement are shown and discussed.
Color Restoration and Enhancement in the Compressed Domain 105
The C(0, 0) coefficient is the DC coefficient and the rest are called AC coefficients for a
block. The normalized transform coefficients ĉ(k, l) are defined as
C(k, l)
ĉ(k, l) = . (4.3)
N
Let µ and σ denote the mean and standard deviation of an N × N image. Then µ and σ
are related to the normalized DCT coefficients as given below:
In fact, from Equation 4.5, it is obvious that the sum of the square of the normalized AC
coefficients provides the variance of the image. Hence, any change in the DC component
does not have any bearing on its standard deviation (σ ). These two statistical measures
computable directly in the compressed domain, are quite useful for designing algorithms
of color constancy and enhancement. Moreover, there exist two interesting relationships
between the block DCT coefficients, namely the relationship between the coefficients of
adjacent blocks [3] and between the higher order coefficients and the lower ones (or sub-
band relationship) [4]. Using these relationships, one may efficiently compose or decom-
pose DCT blocks, or perform interpolation or decimation operations. For details, refer to
the discussion in Reference [5].
comprehensive surveys are available in References [6], [7], and [8]. All these techniques
solve the color constancy problem in the spatial representation of the images in the RGB
color space. In this chapter, the solution of this problem is considered in the block DCT
space. Moreover, since the color space in JPEG compression is YCbCr, the Von Kries
model will be adapted in the YCbCr space and demonstrated how this model could be
further simplified for obtaining reasonably good results with less computation.
In a simplified model [6], assuming all the reflecting bodies as ideal 2D flat Lambertian
surfaces, the brightness I(x) at an image coordinate x is related to the illuminant property
of the surface and camera sensor as follows:
Z
I(x) = E(λ )RX (λ )S(λ )d λ , (4.6)
ω
where E(λ ) is the spectral power distribution (SPD) of the incident illuminant, X is the
surface point projected on x, RX (λ ) represents the surface reflectance spectrum at that point,
and S(λ ) is the relative spectral response of the sensor. The responses are accumulated over
the range of wavelength ω on which the sensors are active. In this chapter, it is assumed
that there is a single illuminant for a scene.
In an optical color camera with three sensors, each sensor operates on different zones
of the optical wavelengths namely with small wavelengths (Blue zone), mid wavelength
range (Green zone) and large wavelengths (Red zone). Computation for color constancy
involves estimating E(λ ) from these three responses. Typically, the SPD of the illuminant
for the three different zones of the optical range of wavelengths needs to be estimated. It is
explained in Reference [6] that the problem is underconstrained (the number of unknowns
are more than the number of observations). That is why many researchers have taken
additional assumption to reduce the number of unknowns.
Namely, in the gray world assumption [9], [10], average reflectance of all surfaces is
taken as gray or achromatic. Hence, the average of color components provides the colors
of the incident illuminant. This approach has been also extended in the gradient space of
images [11], where it is assumed that the average edge difference in a scene is achromatic.
This hypothesis is termed the gray edge hypothesis. Some researchers [12] assume the
existence of a white object in the scene; this assumption is referred to as the white world
assumption. In this case, the maximum values of individual color components provide the
colors of the incident illuminant. However, the method is very much sensitive over the
dynamic ranges of the sensors; although, given a scene whose dynamic range of brightness
distribution is in accordance with the linear response of the sensor, this assumption works
well in many cases. More recent trend on solving color constancy problem is to use sta-
tistical estimation techniques with prior knowledge on the distribution of pixels in a color
space given known camera sensors and source of illumination. In these techniques, an illu-
minant (or a set of illuminants) is chosen from a select set of canonical illuminants based
on certain criteria. There are color gamut mapping approaches both in the 3D [13] and the
2D [14] color spaces, where one tries to maximize the evidence of color maps with known
color maps of canonical illuminants. Reference [6] reports a color by correlation technique
which attempts to maximize a likelihood of an illuminant given the distribution of pixels in
the 2D chromatic space. The same work shows that many other algorithms, like the gamut
mapping method in the 2D chromatic space, could be implemented under the same frame-
Color Restoration and Enhancement in the Compressed Domain 107
work. Note that all these techniques require significant amount of storage space for storing
the statistics of each canonical illuminant. As one of the motivations in this chapter is to
reduce the storage requirement for efficient implementation in the block DCT space, a sim-
ple nearest neighbor (NN) classification approach for determining the canonical illuminant
has also been explored. Interestingly, it is found that the nearest neighbor classification in
the 2D chromatic space performs equally well as other existing techniques such as color by
correlation [6] or gamut mapping [14].
Once the SPDs of the illuminant in three spectral zones are estimated, the next step is to
convert the pixel values to those under a target illuminant (which may be fixed to a standard
illumination). This computation is performed by the diagonal color correction following
the Von Kries model [1]. Let Rs , Gs and Bs be the spectral components for the source
illuminant (for Red, Green and Blue zones). Let the corresponding spectral components
for the target illuminant be Rd , Gd and Bd . Then, given a pixel in RGB color space with
R, G, and B as its corresponding color components, the updated color components, Ru , Gu ,
and Bu , are expressed as follows:
Rd Gd Bd
kr = , kg = , kb = ,
Rs Gs Bs
R+G+B
f = , (4.7)
kr R + kg G + kb B
Ru = f kr R, Gu = f kg G, Bu = f kb B.
The next section discusses the usage of these techniques in the block DCT domain.
4.4.1 Color Constancy in the YCbCr Color Space and Proposed Variations
The YCbCr color space is related to the RGB space as given below:
Y = 0.502G + 0.098R + 0.256B,
Cb = −0.290G + 0.438R − 0.148B + 128, (4.8)
Cr = −0.366G − 0.071R + 0.438B + 128,
assuming eight bits for each color component.
108 Computational Photography: Methods and Applications
For implementing the gray world algorithm in the YCbCr space, one can directly obtain
the mean values by computing the means of the DC coefficients in individual Y , Cb and Cr
components. However, finding the maximum of a color component is not a linear operation.
Hence, for the white world algorithm, one needs to convert all the DC coefficients in the
RGB space and then compute their maximum values. To this end, a simple heuristic can
be used; it is assumed here that the color of the maximum luminance value is the color of
the illuminant. This implies that only the maximum in the Y component is computed and
the corresponding Cb and Cr values at that point provide the color of the illuminant. This
significantly reduces the computation, as it does not require conversion of DC values from
the YCbCr space to RGB space. Further, the maximum finding operation is restricted to
one component only. This assumption is referred to as white world in YCbCr.
With regard to the statistical techniques, the color by correlation technique [6] and the
gamut mapping approach in 2D chromatic space [14] were adapted here for use in the
YCbCr space. Naturally, CbCr was chosen as the chrominance space instead of rg space
as used in Reference [6], where r = R/(R + G + B) and g = G/(R + G + B). This space
was discretized into 32 × 32 cells to accumulate the distribution of pixels. A new statistical
approach based on the nearest neighbor classification was also explored; this approach
will be described in the following subsection. Note that there are other techniques, such
as neural network-based classification [15], probabilistic approaches [16], [17], [18], and
gamut mapping in the 3D color space [13], which are not considered in this study.
where µ (= [µCb µCr ]) is the mean of the distribution and Σ is the covariance matrix defined
as · 2 ¸
σCb σCbCr
Σ(= 2 ).
σCbCr σCr
Following the Bayesian classification rule and assuming that all the illuminants are
equally probable, a minimum distance classifier can be designed. Let the mean chromatic
components of an image be Cm . Then, for an illuminant L with the mean µL and the covari-
ance matrix ΣL , the distance function for the nearest neighbor classifier is nothing but the
Mahalanobis distance function [19], as defined below:
RGB color space for applying the diagonal correction of Equation 4.7. Additionally it also
would be necessary to transform back the updated color values to the YCbCr space. The
color space transformations can be avoided by performing the diagonal correction directly
in the YCbCr space as outlined in the following theorem.
Theorem 4.1
Let kr , kg , and kb be the parameters for diagonal correction as defined in Equation 4.7.
Given a pixel with color values in the YCbCr color space, the updated color values Yu , Cbu ,
and Cru are expressed by the following equations:
Cb0 = Cb − 128,
Cr0 = Cr − 128,
3.51Y + 1.63Cb0 + 0.78Cr0
f = ,
1.17(kr + kg + kb )Y + (2.02kb − 0.39kg )Cb0 + (1.6kr − 0.82kg )Cr0
Yu = f ((0.58kg + 0.12kb + 0.30kr )Y + 0.2(kb − kg )Cb0 + 0.41(kr − kg )Cr0 ),
Cbu = f ((0.52kb − 0.34kg − 0.18kr )Y + (0.11kg + 0.89kb )Cb0 + 0.24(kg − kr )Cr0 ) + 128,
Cru = f ((0.52kr − 0.43kg − 0.09kb )Y + 0.14(kg − kb )Cb0 + (0.3kg + 0.7kr )Cr0 ) + 128.
The proof of the above theorem is straightforward. However, one should note that the
number of multiplications and additions in the above equations does not get reduced com-
pared to the diagonal correction method applied in the RGB color space. 2
of the illuminant in the RGB color space. These were used here for collecting the statistics
(means and covariance matrices of the SPD of the illuminants) related to the nearest neigh-
bor classification-based technique. Further, from the chromatic components of the images
110 Computational Photography: Methods and Applications
FIGURE 4.1
Images of the same scene (ball) captured under three different illuminants: (a) ph-ulm, (b) syl-50mr16q, and
(c) syl-50mr16q+3202.
captured under different illuminants statistics related to the color by correlation and gamut
mapping techniques are formed. It should be mentioned that all these techniques provide
the estimate of an illuminant as its mean SPD in the RGB or YCbCr color space.
Experiments were performed using the images with objects having minimum specular-
ity. Though the scenes are captured at different instances, it was observed that the images
are more or less registered. The images are captured by three different fluorescent lights,
four different incandescent lights and also each of them in conjunction with a blue filter
(Roscolux 3202). In these experiments, Sylvania 50MR16Q is taken as the target illumi-
nant, as it is quite similar to a regular incandescent lamp. The list of different illuminants is
given in Table 4.1. Figure 4.1 shows a typical set of images for the same scene under some
of these illuminants.
Different metrics were used to compare the performances of all techniques in the block
DCT domain as well as their performances with respect to different spatial domain algo-
rithms. Four metrics described below, reported earlier in References [7] and [8], were also
used for studying the performances of different algorithms in estimating the spectral com-
ponents of illuminants. Let the target illuminant T be expressed by the spectral component
triplet in the RGB colors-pace as (RT , GT , BT ) and let the corresponding estimated illumi-
nant be represented by E = (RE , GE , BE ). The respective illuminants in the (r, g) chromatic
TABLE 4.1
List of illuminants.
space can be expressed as (rT , gT ) = (RT /ST , GT /ST ) and (rE , gE ) = (RE /SE , GE /SE ),
where ST = RT + GT + BT and SE = RE + GE + BE . Then, different performance metrics
can be defined as follows:
∆θ = cos−1 ( |TT ||E|
◦E
),
∆rg = |(rT − gT , rE − gE )|,
(4.12)
∆RGB = |T − E|,
∆L = |ST − SE |,
where ∆θ , ∆rg , ∆RGB , and ∆L denote the angular, rg, RGB, and luminance error, respec-
tively. In the above definitions, ◦ denotes the dot product between two vectors and |.|
denotes the magnitude of the vector.
Next, the performances on image rendering were studied for different algorithms after
applying the color correction with the estimated illuminants. It was observed that the im-
ages in the dataset are roughly registered. In the experiment, the image captured at the
target illuminant (syl-50mr16q) is considered to be the reference image. The images ob-
tained by applying different color constancy algorithms are compared with respect to this
image. Two different measures were used for this purpose; the usual PSNR measure and
the so-called WBQM similarity measure proposed in Reference [22]. The latter measure
was used because the reference images are not strongly registered. For two distributions x
and y, the WBQM between these two distributions is defined as follows:
4σxy 2
x̄ȳ
W BQM(x, y) = 2 , (4.13)
(σx + σy2 )(x̄2 + ȳ2 )
where σx and σy are the standard deviations of x and y, respectively, x̄ with ȳ denoting
their respective means, and σxy2 is the covariance between x and y. It may be noted that
this measure takes into account the correlation between the two distributions and also their
proximity in terms of brightness and contrast. The WBQM values should lie in the interval
[−1, 1]. Processed images with WBQM values closer to one are more similar in quality
according to human visual perception. Applying WBQM independently to each component
in the YCbCr space provides Y-WBQM, Cb-WBQM, and Cr-WBQM, respectively.
Reference [23] suggests another no reference metric, called here as JPEG qual-
ity metric (JPQM), for judging the image quality reconstructed from the block DCT
space to take into account of visible blocking and blurring artifacts. To measure the
quality of the images obtained by DCT domain algorithms, the source code available
at http://anchovy.ece.utexas.edu/∼ zwang/research/nr jpeg quality/index.html was used to
compute JPQM values. It should be noted that for an image with a good visual quality, the
JPQM value should be close to ten.
algorithms are referred to by their short names as given in the table. These are followed
by any one of the two color correction techniques, which is either the diagonal correction
(DGN) method or the chromatic shift (CS) method, as discussed previously.
TABLE 4.3
Average ∆θ for different techniques and various illuminants (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL10 IL11
GRW-DCT 13.28 11.57 14.90 11.22 22.09 13.19 28.07 15.44 31.70 10.39 18.87
MXW-DCT 11.28 9.98 12.16 7.32 21.52 12.83 25.47 17.47 28.30 7.62 19.55
MXW-DCT-Y 15.26 12.94 17.80 8.30 22.52 13.55 27.55 16.34 29.33 6.04 19.18
COR-DCT 4.30 8.84 5.91 5.74 19.38 9.02 21.96 10.08 20.31 0.43 12.69
GMAP-DCT 4.65 9.99 6.57 6.88 16.40 8.77 16.71 11.42 16.17 0.43 10.09
NN-DCT 7.69 6.94 9.32 10.78 14.65 12.86 12.91 12.56 14.33 0.43 13.15
GRW 10.79 11.77 15.13 10.33 21.31 12.45 27.69 15.31 29.38 9.64 18.56
MXW 29.83 26.65 30.66 26.71 29.90 26.07 32.46 27.30 33.73 27.73 28.30
COR 4.64 10.03 7.28 7.04 21.32 10.06 23.39 9.97 22.05 0.43 15.89
GMAP 4.36 10.93 6.93 5.98 20.74 11.45 21.00 12.42 20.84 0.43 15.09
NN 8.90 13.17 7.50 7.77 11.71 9.60 12.22 13.35 12.87 0.43 10.01
† IL1 : ph-ulm, IL2 : syl-cwf, IL3 : syl-wwf, IL4 : syl-50mr16q, IL5 : syl-50mr16q+3202, IL6 : solux-3500, IL7 :
solux-3500+3202, IL8 : solux-4100, IL9 : solux-4100+3202, IL10 : solux-4700, IL11 : solux-4700+3202.
Color Restoration and Enhancement in the Compressed Domain 113
TABLE 4.4
Average ∆rg for different techniques and various illuminants (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL10 IL11
GRW-DCT 0.093 0.081 0.119 0.087 0.161 0.100 0.205 0.114 0.239 0.075 0.136
MXW-DCT 0.081 0.082 0.089 0.058 0.197 0.112 0.244 0.160 0.278 0.056 0.188
MXW-DCT-Y 0.100 0.080 0.146 0.058 0.157 0.093 0.197 0.113 0.213 0.039 0.131
COR-DCT 0.036 0.060 0.056 0.044 0.129 0.062 0.148 0.071 0.137 0.003 0.086
GMAP-DCT 0.038 0.068 0.061 0.051 0.108 0.060 0.109 0.079 0.107 0.003 0.066
NN-DCT 0.058 0.054 0.085 0.082 0.102 0.093 0.090 0.089 0.099 0.003 0.092
GRW 0.078 0.081 0.135 0.078 0.146 0.087 0.191 0.107 0.206 0.077 0.127
MXW 0.199 0.186 0.211 0.176 0.271 0.195 0.317 0.217 0.339 0.177 0.240
COR 0.038 0.064 0.070 0.049 0.143 0.065 0.158 0.066 0.151 0.003 0.103
GMAP 0.037 0.073 0.064 0.045 0.139 0.078 0.142 0.086 0.142 0.003 0.100
NN 0.073 0.096 0.074 0.064 0.082 0.072 0.085 0.095 0.090 0.003 0.073
TABLE 4.5
Average ∆RGB for different techniques and various illuminants (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL1 0 IL1 1
GRW-DCT 247.5 243.6 243.8 217.1 246.2 228.1 259.7 250.2 259.4 229.9 251.3
MXW-DCT 114.6 121.7 108.7 93.4 155.1 122.9 173.9 151.9 176.6 96.6 155.1
MXW-DCT-Y 139.7 145.0 143.0 105.2 179.8 140.4 205.2 171.3 209.8 109.0 181.0
COR-DCT 60.0 71.2 70.1 63.9 117.6 74.1 128.3 75.0 122.8 52.9 92.5
GMAP-DCT 60.6 71.0 75.2 65.1 102.8 75.8 103.2 83.4 102.1 52.9 83.4
NN-DCT 81.4 65.0 74.3 78.6 99.6 89.6 87.4 87.4 89.8 52.9 91.7
GRW 241.5 237.1 239.7 209.0 241.0 220.4 254.7 244.1 254.1 224.5 246.2
MXW 171.4 164.4 162.9 147.9 172.1 148.4 190.0 172.8 194.0 150.5 171.2
COR 60.7 76.5 71.6 67.2 122.2 78.9 135.1 73.4 131.7 52.9 98.9
GMAP 60.3 75.7 77.6 64.5 130.4 77.9 126.2 84.4 126.0 52.9 103.8
NN 79.9 88.1 72.6 71.5 84.8 78.3 86.3 90.9 85.2 52.9 81.8
TABLE 4.6
Average ∆L for different techniques and various illuminants (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL1 0 IL1 1
GRW-DCT 426.8 420.4 420.8 374.0 420.4 391.8 443.5 429.5 443.7 397.1 432.8
MXW-DCT 179.6 191.8 169.3 147.0 224.6 187.9 259.4 231.0 259.8 156.9 244.4
MXW-DCT-Y 218.0 232.4 224.9 167.4 279.9 224.3 328.3 274.4 335.1 182.6 301.6
COR-DCT 95.0 87.0 105.9 95.6 135.0 95.4 146.5 95.1 146.3 91.3 119.1
GMAP-DCT 93.5 81.0 114.1 89.9 116.4 94.6 114.8 105.4 120.2 91.3 108.3
NN-DCT 125.4 89.2 94.0 95.1 121.5 107.6 105.9 107.8 96.5 91.3 113.1
GRW 416.7 409.2 413.9 359.9 410.8 378.3 434.1 418.7 433.7 388.0 423.8
MXW 189.5 194.8 159.3 129.3 200.3 159.7 237.4 204.3 255.7 127.4 223.0
COR 93.3 90.8 100.5 91.2 133.7 96.6 153.3 90.1 155.5 91.3 116.3
GMAP 94.8 87.9 117.2 93.3 158.5 84.4 150.0 102.0 150.3 91.3 126.7
NN 114.2 102.9 101.2 100.5 101.9 102.7 111.9 110.5 103.8 91.3 109.7
† IL1 : ph-ulm, IL2 : syl-cwf, IL3 : syl-wwf, IL4 : syl-50mr16q, IL5 : syl-50mr16q+3202, IL6 : solux-3500, IL7 :
solux-3500+3202, IL8 : solux-4100, IL9 : solux-4100+3202, IL10 : solux-4700, IL11 : solux-4700+3202.
114 Computational Photography: Methods and Applications
TABLE 4.7
Overall average performances of different techniques on
c 2009 IEEE
estimating illuminants. °
Error Measure
Comparing different error measures in estimating the illuminants reveals that statistical
techniques perform better than the others in both the compressed domain and spatial do-
main. It is also noted that recovering illuminants in conjunction with the blue filter is more
difficult than those without it. Moreover, with the blue filter, in most cases proposed nearest
neighbor classification-based techniques (NN and NN-DCT) perform better than the oth-
ers. The proposed technique is found to be equally good with respect to the other statistical
techniques such as COR and GMAP. Finally, as shown in Table 4.7 which indicates the
overall performances of all considered techniques in estimating the illuminants, the GMAP
technique is found to have the best performance in the block DCT domain values whereas
the NN algorithm tops the list in all respects in the spatial domain.
FIGURE 4.5
Color corrected images for the illuminant syl-50mr16q+3202: (a) MXW-DCT-Y, (b) COR, and (c) COR-DCT.
TABLE 4.8
Performances of different techniques for color correction of the ball from the illuminant
solux-4100 to syl-50mr16q.
TABLE 4.9
Performances of different techniques for color correction of the books from the illuminant
syl-50mr16q+3202 to syl-50mr16q.
TABLE 4.10
Performances of different techniques for color correction of the Macbeth from the illuminant
ph-ulm to syl-50mr16q.
the target illuminant itself in conjunction with the blue filter. The reference images cap-
tured under syl-50mr16q are shown in Figure 4.2 for three different objects — ball, books,
and Macbeth. The corresponding source images are shown in Figure 4.3. The results are
first presented for the diagonal color correction method. This is followed by the results
obtained using the chromatic shift color correction method.
TABLE 4.11
Average PSNR for different techniques (under diagonal correction) and various illuminant (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL10
GRW-DCT 23.99 23.46 24.20 26.75 22.76 27.26 21.06 24.35 20.28 22.71
MXW-DCT 24.57 23.60 24.82 27.39 22.36 26.72 20.65 24.35 20.71 22.63
MXW-DCT-Y 22.66 23.20 23.25 26.77 22.65 27.27 21.06 24.45 20.96 22.94
COR-DCT 24.93 23.81 25.24 27.46 22.50 26.45 20.93 24.04 20.73 22.62
GMAP-DCT 24.82 23.46 25.21 27.96 22.47 26.83 20.53 24.02 20.51 22.42
NN-DCT 24.49 23.37 25.06 26.44 22.17 27.38 20.14 24.31 20.17 22.94
GRW 24.24 23.65 24.50 27.42 22.91 28.22 21.21 24.55 20.99 22.82
MXW 24.82 23.69 25.36 28.04 22.62 27.28 21.03 24.29 20.93 23.08
COR 24.95 24.05 25.44 28.53 22.80 27.08 21.09 24.21 20.87 23.16
GMAP 25.02 23.58 25.16 29.22 22.85 27.55 20.80 24.19 20.74 23.09
NN 24.01 23.58 25.48 27.09 21.90 27.09 20.20 24.62 20.35 22.72
† IL1 : ph-ulm, IL2 : syl-cwf, IL3 : syl-wwf, IL4 : solux-3500, IL5 : solux-3500+3202, IL6 : solux-4100, IL7 :
solux-4100+3202, IL8 : solux-4700, IL9 : solux-4700+3202, IL10 : syl-50mr16q+3202
118 Computational Photography: Methods and Applications
TABLE 4.12
Average Y-WBQM for different techniques (under diagonal correction) and various illuminant (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL10
GRW-DCT 0.789 0.750 0.814 0.861 0.683 0.850 0.604 0.764 0.540 0.693
MXW-DCT 0.812 0.788 0.852 0.937 0.733 0.885 0.655 0.801 0.580 0.751
MXW-DCT-Y 0.725 0.718 0.720 0.916 0.647 0.841 0.548 0.734 0.464 0.657
COR-DCT 0.838 0.768 0.879 0.936 0.608 0.834 0.469 0.690 0.343 0.585
GMAP-DCT 0.836 0.767 0.876 0.929 0.574 0.844 0.395 0.699 0.296 0.549
NN-DCT 0.842 0.759 0.863 0.875 0.580 0.849 0.360 0.734 0.262 0.627
GRW 0.792 0.739 0.808 0.867 0.655 0.836 0.566 0.744 0.481 0.663
MXW 0.817 0.763 0.848 0.944 0.615 0.853 0.498 0.728 0.412 0.633
COR 0.836 0.795 0.874 0.950 0.653 0.859 0.502 0.685 0.383 0.648
GMAP 0.837 0.783 0.873 0.952 0.655 0.861 0.463 0.710 0.360 0.635
NN 0.820 0.767 0.867 0.907 0.544 0.841 0.358 0.740 0.276 0.569
TABLE 4.13
Average Cb-WBQM for different techniques (under diagonal correction) and various illuminant (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL10
GRW-DCT 0.796 0.785 0.795 0.917 0.717 0.876 0.586 0.795 0.462 0.698
MXW-DCT 0.781 0.739 0.714 0.936 0.541 0.829 0.334 0.670 0.241 0.489
MXW-DCT-Y 0.713 0.750 0.718 0.925 0.720 0.862 0.610 0.784 0.534 0.708
COR-DCT 0.818 0.800 0.769 0.937 0.778 0.899 0.650 0.817 0.542 0.714
GMAP-DCT 0.815 0.792 0.790 0.933 0.775 0.902 0.634 0.823 0.528 0.727
NN-DCT 0.819 0.780 0.823 0.921 0.734 0.900 0.569 0.815 0.480 0.710
GRW 0.824 0.827 0.848 0.922 0.788 0.914 0.691 0.859 0.602 0.781
MXW 0.813 0.809 0.802 0.953 0.744 0.904 0.647 0.819 0.567 0.733
COR 0.819 0.817 0.816 0.939 0.785 0.923 0.657 0.837 0.550 0.770
GMAP 0.821 0.806 0.813 0.951 0.777 0.919 0.640 0.828 0.537 0.754
NN 0.820 0.812 0.841 0.940 0.726 0.917 0.574 0.842 0.478 0.720
TABLE 4.14
Average Cr-WBQM for different techniques (under diagonal correction) and various illuminant (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL10
GRW-DCT 0.789 0.839 0.811 0.906 0.866 0.934 0.790 0.893 0.707 0.851
MXW-DCT 0.819 0.846 0.856 0.901 0.889 0.931 0.861 0.933 0.853 0.904
MXW-DCT-Y 0.733 0.826 0.778 0.875 0.853 0.911 0.798 0.905 0.781 0.858
COR-DCT 0.796 0.848 0.828 0.910 0.870 0.929 0.854 0.917 0.865 0.894
GMAP-DCT 0.795 0.835 0.817 0.919 0.886 0.932 0.888 0.913 0.889 0.911
NN-DCT 0.726 0.851 0.790 0.933 0.903 0.930 0.910 0.909 0.903 0.905
GRW 0.804 0.853 0.827 0.936 0.874 0.946 0.797 0.902 0.761 0.855
MXW 0.816 0.853 0.850 0.921 0.899 0.947 0.840 0.931 0.818 0.901
COR 0.802 0.832 0.826 0.933 0.856 0.936 0.830 0.939 0.834 0.877
GMAP 0.803 0.831 0.809 0.930 0.839 0.939 0.831 0.916 0.842 0.867
NN 0.738 0.839 0.824 0.935 0.921 0.949 0.918 0.915 0.915 0.925
† IL1 : ph-ulm, IL2 : syl-cwf, IL3 : syl-wwf, IL4 : solux-3500, IL5 : solux-3500+3202, IL6 : solux-4100, IL7 :
solux-4100+3202, IL8 : solux-4700, IL9 : solux-4700+3202, IL10 : syl-50mr16q+3202
Color Restoration and Enhancement in the Compressed Domain 119
TABLE 4.15
Average JPQM for different techniques in the compressed domain (under diagonal correction) and
various illuminant (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL10
GRW-DCT 12.06 12.25 12.05 12.12 11.24 12.02 10.79 12.00 10.04 11.84
MXW-DCT 12.31 12.26 12.33 12.35 11.06 12.00 10.72 11.75 10.70 11.61
MXW-DCT-Y 11.65 12.16 11.69 12.25 11.41 12.06 11.15 11.93 11.16 11.93
COR-DCT 12.51 12.39 12.53 12.38 11.73 12.35 11.73 12.38 12.03 12.37
GMAP-DCT 12.50 12.28 12.51 12.30 12.00 12.37 12.17 12.28 12.38 12.62
NN-DCT 12.30 12.50 12.36 12.11 12.07 12.03 12.50 12.19 12.64 12.35
† IL1 : ph-ulm, IL2 : syl-cwf, IL3 : syl-wwf, IL4 : solux-3500, IL5 : solux-3500+3202, IL6 : solux-4100, IL7 :
solux-4100+3202, IL8 : solux-4700, IL9 : solux-4700+3202, IL10 : syl-50mr16q+3202
TABLE 4.16
Overall average performances of different techniques on rendering color cor-
rected images across different illumination.
TABLE 4.17
Average PSNR for different techniques (under CS) and various illuminant (IL).†
Method IL1 IL2 IL3 IL4 IL5 IL6 IL7 IL8 IL9 IL10
GRW-DCT 23.05 22.47 23.21 24.10 21.19 25.11 19.78 22.86 19.89 21.74
MXW-DCT 19.25 20.68 19.24 23.07 17.73 21.59 16.61 19.37 16.06 18.04
MXW-DCT-Y 18.17 20.91 19.16 23.63 19.40 23.09 18.26 21.27 18.63 20.43
COR-DCT 23.25 20.48 22.18 22.94 18.69 22.26 17.56 21.09 17.88 19.47
GMAP-DCT 23.04 20.22 21.93 22.99 19.68 22.44 18.45 21.04 18.53 20.09
NN-DCT 22.13 20.64 20.80 21.30 19.19 20.82 18.60 20.22 18.09 19.09
GRW 23.55 22.85 23.70 25.11 21.32 25.84 19.82 23.18 19.88 21.91
MXW 20.80 22.15 21.23 24.72 19.65 23.62 18.19 21.27 18.37 20.19
COR 23.14 20.90 21.87 23.51 18.76 22.53 17.47 21.33 17.91 19.24
GMAP 23.27 20.91 21.34 24.27 19.06 22.06 18.17 20.72 18.14 19.30
NN 20.87 19.26 21.41 22.45 19.83 22.35 19.14 20.09 18.59 19.70
† IL1 : ph-ulm, IL2 : syl-cwf, IL3 : syl-wwf, IL4 : solux-3500, IL5 : solux-3500+3202, IL6 : solux-4100, IL7 :
solux-4100+3202, IL8 : solux-4700, IL9 : solux-4700+3202, IL10 : syl-50mr16q+3202
120 Computational Photography: Methods and Applications
in many categories of illuminants, their performance measures have topped the list. This
is also observed for the MXW-DCT-Y technique which has achieved the highest average
PSNR among other techniques for the illuminants solux-4700 and solux-4700+3202. For
judging the recovery of colors, one may look into the values corresponding to Cb-WBQM
and Cr-WBQM. It is generally observed that with the blue filter, the reconstruction of Cb
component is relatively poorer. This may be the reason lower performance measures are
obtained in such cases. In the compressed domain, the statistical techniques performed
better than the others in reconstructing the color components.
Table 4.16 summarizes the overall performances of all the techniques in transferring the
images to the target illuminant. One may observe that in the block DCT domain the overall
performances of the COR-DCT are usually better than the others. In the spatial domain,
the GMAP and the COR techniques have better performance indices in most cases. It is
interesting to note that though the errors of estimation of illuminants in the block DCT
domain are usually less than those in spatial domain (see Table 4.7), the end results after
color correction provide higher PSNR values in the spatial domain. It is felt that the color
correction to all the pixels in spatial domain, as compared to those made with only DC
coefficients in the block DCT domain, makes the rendering more successful.
jecture that for natural images the inherent combination of brightness, contrast, and colors
should be preserved in the process of enhancement. This leads to the development of a con-
trast and color preserving enhancement algorithm in the compressed domain [24], which
is elaborated below.
Theorem 4.2
Let κdc be the scale factor for the normalized DC coefficient and κac be the scale factor for
the normalized AC coefficients of an image Y of size N × N, such that the DCT coefficients
in the processed image Ye are given by:
½
κdcY (i, j) for i = j = 0,
Ye (i, j) = (4.15)
κacY (i, j) otherwise.
The contrast of the processed image then becomes κac /κdc times the contrast of the original
image. 2
Color Restoration and Enhancement in the Compressed Domain 123
One may note here that in the block DCT space, to preserve the local contrast of the
image, scale factor should be kept as κdc = κac = κ in a block. However, though the above
operations with the Y component of an image preserve the contrast, they do not preserve
the colors or color vectors of the pixels. Hence, additional operations with the chromatic
components, that is, Cr and Cb components of the image in the compressed domain, need
to be carried out. Theorem 4.3 states how colors could be preserved under the uniform
scaling operation. The proof of this theorem is given in Reference [24].
Theorem 4.3
Let U = {U(k, l)|0 ≤ k, l ≤ (N − 1)} and V = {V (k, l)|0 ≤ k, l ≤ (N − 1)} be the DCT
coefficients of the Cb and Cr components, respectively. If the luminance (Y ) component of
an image is uniformly scaled by a factor κ , the colors of the processed image with Ye , U
e
and Ve are preserved by the following operations:
½
e j) = N(κ ( U(i, j)
N − 128) + 128) for i = j = 0,
U(i, (4.16)
κ U(i, j) otherwise,
½
N(κ ( V (i, j)
N − 128) + 128) for i = j = 0,
Ve (i, j) = (4.17)
κ V (i, j) otherwise.
2
2. Preservation of local contrast. Once the scale factor is obtained from the mapping
of the DC coefficient using any of the functions mentioned above, the same fac-
tor is used for scaling the AC coefficients of the block (according to Theorem 4.2).
However, while performing this operation, there is a risk of crossing the maximum
124 Computational Photography: Methods and Applications
displayable brightness value at a pixel in that block. To restrict this overflow, the
scale factor is clipped by a value as stated in Theorem 4.4. The proof of this theorem
is given in Reference [24].
3. Preservation of colors. Finally, the colors are preserved by performing the scaling of
the DCT coefficients of the Cb and Cr components as described in Theorem 4.3.
Theorem 4.4
If the values in a block are assumed to lie within µ ± λ σ , the scaled values will not exceed
the maximum displayable brightness value Bmax if 1 ≤ κ ≤ Bmax /( µ + λ σ ). 2
Due to the independent processing of blocks, blocking artifacts near edges or near sharp
changes of brightness and colors of pixels can occur. To suppress this effect, in such cases
the same computation is carried out in blocks of smaller size. These smaller blocks are
obtained by decomposing the given block using the block decomposition operation [3].
Similarly, the resulting scaled coefficients of smaller blocks are recomposed into the larger
one using the block composition operation [3]. It may be noted that both these operations
can be performed efficiently in the block DCT domain. Hence, the algorithm for color
enhancement remains totally confined within the block DCT domain. For detecting a block
requiring these additional operations, the standard deviation σ of that block was used here
as a measure. If σ is greater than a threshold (say, σth ), the 8 × 8 blocks are decomposed
into four subblocks to perform the scaling operations. The typical value of σth in these
experiments was empirically chosen as 5.
4.8.3 Results
This section presents the results of the enhancement algorithms and compares these re-
sults with that of other compressed domain techniques reported in the literature. Table 4.19
lists considered techniques including the algorithms discussed in this section, referred un-
der the category of color enhancement by scaling (CES). The details of these techniques
and the description of its different parameters can be found in the literature cited in the
table. For the sake of completeness, parameter values are also presented in the table.
TABLE 4.19
List of techniques considered for comparative study.
(a)
(c)
(b)
FIGURE 4.8
Original images used in the implementation of the color enhancement algorithms: (a) Bridge, (b) Mountain,
and (c) Under-water.
To evaluate these techniques, two measures for judging the quality of reconstruction in
the compressed domain were used. One, based on JPEG-quality-metric (JPQM) [23], is
related to the visibility of blocking artifacts, whereas the other one, a no-reference metric
of Reference [25], is related to the enhancement of colors in images. The definition for this
latter metric in the RGB color space is given below.
Let the red, green and blue components of an image I be denoted by R, G, and B, respec-
tively. Let α = R − G and β = ( R+G 2 ) − B. Then the colorfulness of the image is defined as
follows: q q
CM(I) = σα2 + σβ2 + 0.3 µα2 + µβ2 , (4.21)
where σα and σβ are standard deviations of α and β , respectively. Similarly, µα and µβ are
their means. In this comparison, however, the ratio of CMs between the enhanced image
and its original for observing the color enhancement factor was used and it is referred here
as color enhancement factor (CEF).
Typical examples of the enhanced images are presented here for the set of images shown
in Figure 4.8. The average JPQM and CEF values obtained on these set of images are
presented in the Table 4.20. From this table, one can observe that the color enhancement
performance indicated by the measure CEF is quite improved using the color enhancement
by scaling (CES) algorithms. In particular, the TW-CES-BLK algorithm, which uses a very
simple mapping function (Equation 4.18), is found to provide excellent color rendition
in the enhanced images. However, the JPQM measure indicates that its performance in
suppressing blocking artifacts is marginally poorer than the other schemes.
TABLE 4.20
Average performance measures obtained by different color enhancement techniques.
Techniques
FIGURE 4.10
Color enhancement of the image Bridge with increasing number of iterations: (a) 2, (b) 3, and (c) 4.
Figure 4.9 shows the enhancement results obtained by three typical techniques. Namely,
these techniques are MCE, MCEDRC, and TW-CES-BLK. One may observe that in all
these cases the TW-CES-BLK provides better visualization of these images than other two
techniques.
One may further enhance images by iteratively subjecting the resulting image to the next
stage of enhancement using the same algorithm. Figure 4.10 shows a typical example
Color Restoration and Enhancement in the Compressed Domain 127
TABLE 4.21
Performance measures after iterative ap-
plication of the TW-CES-BLK algorithm
on the image Bridge.
number of iterations
Measures 2 3 4 5
of iterative enhancement of the image Bridge. It was observed that for a few iterations
the CEF measure shows improvements with respect to the original image. However, the
process suffers from the risk of making blocking artifacts more and more visible. This is
also evidenced by an increased degradation of the JPQM values with an increase in the
number of iterations, as shown in Table 4.21.
putation of color constancy, as demonstrated in Figure 4.11c. In this case, the illumination
is transferred to the Syl-50mr16q (see Section 4.3) using the COR-DCT method [6] fol-
lowed by the TW-CES-BLK enhancement algorithm applied to the color corrected image.
4.10 Conclusion
This chapter discussed the restoration of colors under varying illuminants and illumina-
tion in the block DCT domain. The basic computational task involved in this process is to
obtain an illuminant independent representation by solving the problem of color constancy.
Once the spectral components of the (global) illuminant of a scene are computed, one is
required to transfer the image under a canonical illuminant. This computational task is
marked as color correction. However, due to wide variation of illumination in a scene and
the limited dynamic range of displaying colors, one would be required to further modify
the colors of pixels by the process of enhancement. It may be noted that in an ordinary
situation color enhancement may not necessarily follow the color correction stage.
This chapter reviewed several color constancy algorithms and discussed the extension
of these algorithms in the block DCT domain. It was observed that many algorithms are
quite suitable for computation considering only the DC coefficients of the blocks. This
chapter further discussed the enhancement algorithm in the block DCT domain. As the
computations are restricted only in the compressed domain, overhead of compression and
decompression is avoided. This makes the algorithms fast and less memory intensive.
Acknowledgment
Tables 4.2, 4.7, and 4.18 are reprinted from Reference [2], with the permission of IEEE.
References
[1] V. Kries, Handbuch der Physiologic des Menschen, vol. 3. Braunschweig, Germany: Viewieg
und Sohn, 1905.
[2] J. Mukhopadhyay and S.K. Mitra, “Color constancy in the compressed domain,” in Proceed-
ings of the IEEE International Conference on Image Processing, Cairo, Egypt, November
2009.
[3] J. Jiang and G. Feng, “The spatial relationships of DCT coefficients between a block and its
sub-blocks,” IEEE Transactions on Signal Processing, vol. 50, no. 5, pp. 1160–1169, May
2002.
[4] S.H. Jung, S.K. Mitra, and D. Mukherjee, “Subband DCT: Definition, analysis and appli-
Color Restoration and Enhancement in the Compressed Domain 129
cations,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 3,
pp. 273–286, June 1996.
[5] J. Mukherjee and S.K. Mitra, Color Image Processing: Methods and Applications, ch. “Re-
sizing of color images in the compressed domain,” R. Lukac and K.N. Plataniotis (eds.), Boca
Raton, FL: CRC Press / Taylor & Francis, October 2006, pp. 129–156.
[6] G. Finlayson, S. Hordley, and P. Hubel, “Color by correlation: A simple, unifying frame-
work for color constancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 23, no. 11, pp. 1209–1221, November 2001.
[7] K. Barnard, V. Cardei, and B. Funt, “A comparison of computational color constancy algo-
rithms – Part I: Methodology and experiments with synthesized data,” IEEE Transactions on
Image Processing, vol. 11, no. 9, pp. 972–984, September 2002.
[8] K. Barnard, V. Cardei, and B. Funt, “A comparison of computational color constancy algo-
rithms – Part II: Experiments with image data,” IEEE Transactions on Image Processing,
vol. 11, no. 9, pp. 985–996, September 2002.
[9] G. Buchsbaum, “A spatial processor model for object colour perception,” Journal of Franklin
Inst., vol. 310, pp. 1–26, 1980.
[10] R. Gershon, A. Jepson, and J. Tsotsos, “From [r,g,b] to surface reflectance: Computing color
constant descriptors in images,” Perception, pp. 755–758, 1988.
[11] J. Van de Weijer, T. Gevers, and A. Gijsenij, “Edge-based color constancy,” IEEE Transactions
on Image Processing, vol. 16, no. 9, pp. 2207–2214, September 2007.
[12] E. Land, “The retinex theory of color vision,” Scientific American, vol. 3, pp. 108–129, 1977.
[13] D. Forsyth, “A novel algorithm for color constancy,” Int. Journal of Computer Vision, vol. 5,
no. 1, pp. 5–36, August 1990.
[14] G. Finlayson, “Color in perspective,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 18, no. 10, pp. 1034–1038, October 1996.
[15] B. Funt, V. Cardei, and K. Barnard, “Learning color constancy,” Proceedings of the 4th
IS&T/SID Color Imaging Conference, Scottsdale, AZ, USA, November 1996, pp. 58–60.
[16] G. Sapiro, “Color and illuminant voting,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 10, no. 11, pp. 1210–1215, November 1999.
[17] G. Sapiro, “Bilinear voting,” in Proceedings of the Sixth International Conference on Com-
puter Vision, Bombay, India, January 1998, pp. 178–183.
[18] D. Brainard and W. Freeman, “Bayesian color constancy,” Journal of Optical Society of Amer-
ica, A,, vol. 14, no. 7, pp. 1393–1411, July 1997.
[19] P. Mahalanobis, “On the generalised distance in statistics,” Proceedings of the National Insti-
tute of Science of India, vol. 12, pp. 49–55, 1936.
[20] M. Ebner, G. Tischler, and J. Albert, “Integrating color constancy into JPEG2000,” IEEE
Transactions on Image Processing, vol. 16, no. 11, pp. 2697–2706, November 2007.
[21] K. Barnard, L. Martin, B. Funt, and A. Coath, “A data set for color research,” Color Research
and Applications, vol. 27, no. 3, pp. 148–152, June 2000.
[22] Z. Wang and A. Bovik, “A universal image quality index,” IEEE Signal Processing Letters,
vol. 9, no. 3, pp. 81–84, March 2002.
[23] Z. Wang, H. Sheikh, and A. Bovik, “No-reference perceptual quality assessment of JPEG com-
pressed images,” in Proceeding of the IEEE International Conference on Image Processing,
vol. I, September 2002, pp. 477–480.
[24] J. Mukherjee and S.K. Mitra, “Enhancement of color images by scaling the DCT coefficients,”
IEEE Transactions on Image Processing, vol. 17, no. 10, pp. 1783–1794, October 2008.
130 Computational Photography: Methods and Applications
[25] S. Susstrunk and S. Winkler, “Color image quality on the Internet,” Proceedings of SPIE,
vol. 5304, pp. 118–131, January 2004.
[26] S.K. Mitra and T.H. Yu, “Transform amplitude sharpening: A new method of image enhance-
ment,” Computer Vision Graphics and Image Processing, vol. 40, no. 2, pp. 205–218, Novem-
ber 1987.
[27] S. Lee, “An efficient content-based image enhancement in the compressed domain using
retinex theory,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17,
no. 2, pp. 199–213, February 2007.
[28] T. De, “A simple programmable S-function for digital image processing,” in Proceedings of the
Fourth IEEE Region 10 International Conference, Bombay, India, November 1989, pp. 573–
576.
[29] S. Aghagolzadeh and O. Ersoy, “Transform image enhancement,” Optical Engineering,
vol. 31, no. 3, pp. 614–626, March 1992.
[30] J. Tang, E. Peli, and S. Acton, “Image enhancement using a contrast measure in the com-
pressed domain,” IEEE Signal Processing Letters, vol. 10, no. 10, pp. 289–292, October 2003.
5
Principal Component Analysis-Based Denoising of
Color Filter Array Images
5.1 Introduction
In the last decade, advances in hardware and software technology have allowed for mas-
sive replacement of conventional film cameras with their digital successors. This reflects
the fact that capturing and developing photos using chemical and mechanical processes
cannot provide users with the conveniences of digital cameras which record, store and ma-
nipulate photographs electronically using image sensors and built-in computers. The ability
to display an image immediately after it is recorded, to store thousands of images on a small
memory device and delete them from this device in order to allow its further re-use, to edit
captured visual data, and to record images and video with sound makes digital cameras
very attractive consumer electronic products.
To create an image of a scene, digital cameras use a series of lenses that focus light onto
a sensor which samples the light and records electronic information which is subsequently
converted into digital data. The sensor is an array of light-sensitive cells which record the
131
132 Computational Photography: Methods and Applications
FIGURE 5.1
The concept of acquiring the visual information using the image sensor with a color filter array.
total intensity of the light that strikes their surfaces. Various image analysis and processing
steps are then typically needed to transform digital sensor image data into a full-color fully
processed image, commonly referred to as a digital photograph.
This chapter focuses on denoising of image data captured using a digital camera equipped
with a color filter array and a monochrome image sensor (Figure 5.1). The chapter presents
a principal component analysis-driven approach which takes advantage of local similari-
ties that exist among blocks of image data in order to improve the estimation accuracy of
the principal component analysis transformation matrix. This adaptive calculation of a co-
variance matrix and the utilization of both spatial and spectral correlation characteristics
of a CFA image allow effective signal energy clustering and efficient noise removal with
simultaneous preservation of local image structures such as edges and fine details.
The chapter begins with Section 5.2, which briefly discusses digital color camera imag-
ing fundamentals and relevant denoising frameworks. Section 5.3 presents principal com-
ponent analysis basics and notations used throughout this chapter, and outlines the concept
of a spatially adaptive denoising method using principal component analysis of color filter
array mosaic data. Section 5.4 is devoted to the design of the denoising method. Included
examples and experimental results indicate that principal component analysis-driven de-
noising of color filter array mosaic images constitutes an attractive tool for a digital camera
image processing pipeline, since it yields good performance and produces images of rea-
sonable visual quality. The chapter concludes with Section 5.5, which summarizes the main
camera image denoising ideas.
(a) (b)
FIGURE 5.2
Color filter array imaging: (a) acquired image and (b) cropped region showing the mosaic layout.
corresponding digital representation of the sensor values. Since common image sensors,
such as charge-coupled devices (CCD) [1], [2] and complementary metal oxide semicon-
ductor (CMOS) sensors [3], [4], are monochromatic devices, digital camera manufacturers
place a color filter on top of each sensor cell to capture color information. Figure 5.1 shows
a typical solution, termed as a color filter array (CFA), which is a mosaic of color filters
with different spectral responses. Both the choice of a color system and the arrangement of
color filters in the array have significant impacts on the design, implementation and perfor-
mance characteristics of a digital camera. Detailed discussion on this topic can be found in
References [5], [6], [7], and [8].
The acquired CFA sensor readings constitute a single plane of data. Figure 5.2 shows an
example. The image has a mosaic-like structure dictated by the CFA. It can either be stored
as a so-called raw camera file [9] together with accompanying metadata containing infor-
mation about the camera settings to allow its processing on a personal computer (PC) or
directly undergo in-camera image processing realized by an application-specific integrated
circuit (ASIC) and a microprocessor. In the former case, which is typical for a digital sin-
gle lens reflex (SLR) camera, the Tagged Image File Format for Electronic Photography
(TIFF-EP) [10] is used to compress image data in a lossless manner. This application sce-
nario allows for developing high-quality digital photographs on a PC using sophisticated
solutions, under different settings, and reprocessing the image until certain quality criteria
are met. In the latter case, the captured image is completely processed in a camera under
real-time constraints to produce the final image which is typically stored using lossy Joint
Photographic Experts Group (JPEG) compression [11] in the Exchangeable Image File
(EXIF) format [12] together with the metadata. Image compression methods suitable for
these tasks can be found in References [13], [14], [15], and [16].
In either case, extensive processing is needed to faithfully restore full-color information
required by common output media such as displays, image storage systems, and printers.
Various image processing and analysis techniques usually operate based on the assumption
of noise-free CFA data. Unfortunately, this assumption does not hold well in practice.
Noise is an inherent property of image sensors and cannot be eliminated in the digital
camera design.
134 Computational Photography: Methods and Applications
(a) (b)
FIGURE 5.3
Example digital camera images corrupted by sensor noise: (a) CFA image, and (b) demosaicked image.
(a) (b)
(c) (d)
[24], [25] to restore full-color information from CFA mosaic data, white balancing [26],
[27] to compensate for the scene illuminant, color correction [19] to achieve visually pleas-
ing scene reproduction, and tone / scale rendering to transform the color data from an
unrendered to a rendered space and make the tonality of a captured image match the non-
linear characteristics of the human visual system [18]. Figure 5.4 illustrates the effect of
these steps on the image as it progresses through the pipeline. Visual quality of the captured
image also is highly dependent on denoising [28] to suppress noise and various outliers, im-
age sharpening [29] to enhance structural content such as edges and color transitions, and
exposure correction [30] to compensate for inaccurate exposure settings.
In addition to the above processing operations, a resizing step [31] can be employed to
produce images of dimensions different from those of the sensor. Red-eye removal [32] is
used to detect and correct defects caused by the reflection of the blood vessels in the retina
due to flash exposure. Face detection [33], [34] can help to improve auto-focus, optimize
exposure and flash settings, and allow more accurate color manipulation to produce pho-
tographs with enhanced color and tonal quality. Image stabilization [35] compensates for
undesired camera movements, whereas image deblurring [36] removes the blurring effect
caused by the camera optical system, lack of focus, or camera motion during exposure.
Advanced imaging systems can enhance resolution [37], [38] and dynamic-range [39].
136 Computational Photography: Methods and Applications
(a)
(b)
(c)
FIGURE 5.5
Pipelining the demosaicking and denoising steps: (a) demosaicked image denoising, (b) joint demosaicking
and denoising, and (c) color filter array image denoising.
The overall performance of the imaging pipeline can vary significantly depending on
the choice and order of the processing steps. The way an image processing pipeline is
constructed usually differs among camera manufacturers due to different design charac-
teristics, implementation constraints, and preferences regarding the visual appearance of
digital photographs. Note that there is no ideal way of cascading individual processing
steps; therefore, the problem of designing the pipeline is often simplified by analyzing just
a very few steps at the time. Detailed discussion on pipelining the image processing and
analysis steps can be found in References [5], [18], and [19].
• The framework shown in Figure 5.5a performs denoising after demosaicking. Algo-
rithms directly adopted from grayscale imaging, such as various median [40], [41],
averaging [42], [43], multiresolution [44], and wavelet [45], [46], [47] filters, process
each color channel of the demosaicked image separately, whereas modern filters for
digital color imaging process color pixels as vectors to preserve the essential spectral
correlation and avoid new color artifacts in the output image [41], [48], [49]. Un-
fortunately, the CFA sensor readings corresponding to different color channels have
different noise statistics and the demosaicking process blends the noise contributions
across channels, thus producing compound noise that is difficult to characterize and
remove by traditional filtering approaches.
• Figure 5.5b shows the framework which produces the output image by performing
demosaicking and image denoising simultaneously. Approaches designed within
this framework include additive white noise assumption-driven demosaicking us-
ing minimum mean square error estimation [50], bilateral filter-based demosaick-
ing [51], and joint demosaicking and denoising using total least square estima-
tion [52], wavelets [53], [54], color difference signals [55], and local polynomial
approximation-based nonlinear spatially adaptive filtering [56]. At the expense of
increased complexity of the design, performing the two estimation processes jointly
avoids the problem associated with the other two processing frameworks which tend
to amplify artifacts created in the first processing step.
• Finally, the framework depicted in Figure 5.5c addresses denoising before demo-
saicking. This approach aims at restoring the desired signal for subsequent color
interpolation, thus enhancing the performance of the demosaicking process which
can fail in edge regions in the presence of noise. Existing denoising methods for
grayscale images cannot be directly used on the CFA image due to its underlying
mosaic structure. However, these methods are applicable to subimages [13], [15]
extracted from the CFA image; the denoised CFA image is obtained by combining
the denoised subimages. This approach often results in various color shifts and arti-
facts due to the omission of the essential spectral characteristics during processing.
Therefore, recent methods for denoising the CFA mosaic data exploit both spatial
and spectral correlations to produce color artifact-free estimates without the need to
extract subimages [57].
This chapter focuses on denoising the CFA image, that is, the framework depicted in
Figure 5.5c, as this is the most natural way of handling the denoising problem in the digital
cameras under consideration. The framework can effectively suppress noise while pre-
serving color edges and details. Since it performs denoising before color restoration and
manipulation steps, it gives the camera image processing pipeline that uses this strategy an
advantage of less noise-caused color artifacts. Moreover, since CFA images consist of three
times less data compared to demosaicked images, this framework has an obvious potential
to achieve high processing rates.
138 Computational Photography: Methods and Applications
be the sample matrix of x, with xij denoting the discrete samples of variable xi and
Xi = [xi1 xi2 · · · xin ] denoting the sample vector of xi , for i = 1, 2, , m and j = 1, 2, , n. The
centralized version X̄ of the sample matrix X can be written as
1 2
X̄1 x̄1 x̄1 · · · x̄1n
X̄2 x̄1 x̄2 · · · x̄n
2 2 2
X̄ = . = . . . . ,
.. .. .. .. ..
X̄m 1 x̄2 · · · x̄n
x̄m m m
where x̄ij = xij − µi is obtained using the mean value of xi estimated as follows:
n
1
µi = E [xi ] ≈
n ∑ Xi ( j).
j=1
A set of all such mean values gives µ = E[x] = [µ1 µ2 · · · µm ]T which is the mean value
vector of x. This mean vector is used to express the centralized vector as x̄ = x−µ, with the
elements of x̄ defined as x̄i = xi − µi and the corresponding sample vectors as X̄i = X̄i − µi =
j j
[x̄i1 x̄i2 · · · x̄in ], where x̄i = xi − µi . Accordingly, the covariance matrix of x̄ is calculated as
£ ¤ T
Ω = E x̄x̄T ≈ n1 X̄X̄ .
The goal of principal component analysis is to find an orthonormal transformation matrix
P to decorrelate x̄. This transformation can be written as ȳ = Px̄, with the covariance matrix
of ȳ being diagonal. Since Ω is symmetrical, its singular value decomposition (SVD) can
be expressed as follows:
Ω = ΦΛΦT ,
where Φ = [φ1 φ2 · · · φm ] denotes the m × m orthonormal eigenvector matrix and Λ =
diag{λ1 λ2 · · · λm } is the diagonal eigenvalue matrix with λ1 ≥ λ2 ≥ · · · ≥ λm . By setting
£ ¤ T
P = ΦT , the vector x̄ can be decorrelated, resulting in Ȳ = PX̄ and Λ = E ȳȳT ≈ n1 ȲȲ .
Principal Component Analysis-Based Denoising of Color Filter Array Images 139
(a) (b)
(c) (d)
FIGURE 5.6
Digital camera noise modeling: (a) original, noise-free CFA image and (b) its demosaicked version; (c) noised
CFA image and (d) its demosaicked version.
Principal component analysis not only decorrelates the data, but it is also an optimal
way to represent the original signal using a subset of principal components. This property,
known as optimal dimensionality reduction [59], refers to the situations when the k most
important eigenvectors are used to form the transformation matrix PT = [φ1 φ2 · · · φk ], for
k < m. In this case, the transformed dataset Ȳ = PX̄ will be of dimensions k × n, as opposed
to the original dataset X̄ of dimensions m × n, while preserving most of the energy of X̄.
tions. As shown in Figure 5.6, noise observed in digital camera images can be approximated
using specialized models. Many popular noise models are based on certain assumptions
which simplify the problem at hand and allow for a faster design.
Image sensor noise is signal-dependent [52], [60], [61], as the noise variance depends
on the signal magnitude. Reference [60] argues that various techniques, such as Poisson,
film-grain, multiplicative, and speckle models can be used to approximate such noise char-
acteristics. Reference [52] proposes simulating the noise effect using Gaussian white noise
and sensor dependent parameters. A widely used approximation of image sensor noise is
the signal-independent additive noise model, as it is simple to use in the design and anal-
ysis of denoising algorithms and allows modeling signal-dependent noise characteristics
by estimating the noise variance adaptively in each local area [50]. By considering the
different types of color filters in the image acquisition process, it is reasonable to use a
channel-dependent version of the signal-independent additive noise model [55]. This ap-
proach allows varying noise statistics in different channels to simulate the sensor’s response
in different wavelengths while keeping the sensor noise contributions independent of signal
within each channel to simplify the design and analysis of the denoising algorithm [57].
The channel-dependent additive noise model can be defined as follows [56], [57]:
r̃ = r + υr , g̃ = g + υg , b̃ = b + υb , (5.1)
where υr , υg and υb are mutually uncorrelated noise signals in the red, green and blue loca-
tions of the CFA image. Following the additive nature of this model, the noise contributions
are added to the desired sample values r, g and b to obtain the noisy (acquired) signals r̃, g̃,
and b̃, respectively. Note that the standard deviations σr , σg , and σb corresponding to υr ,
υg and υb , may have different values. Figure 5.6 shows that this noise model can produce
similar effects as can be seen in real-life camera images in Figure 5.3.
G R G R G R
B G B G B G
G R g1 r 2 G R
B G b3 g4 B G
G R G R G R
B G B G B G
FIGURE 5.7
Illustration of the 6 × 6 variable block and 2 × 2 training block (g1 , r2 , b3 , and g4 ) in the spatially adaptive
PCA-based CFA image denoising method.
[g1 r2 b3 g4 ]T . Practical implementations, however, can use larger size variable blocks. The
whole dataset of x can be written as X = [GT1 RT2 BT3 GT4 ]T , where G1 , R2 , B3 , and G4 denote
the row vectors containing all the samples associated with g1 , r2 , b3 , and g4 , respectively.
The mean values µg1 , µr2 , µb3 , and µg4 of variables g1 , r2 , b3 , and g4 can be estimated
as the average of all the samples in G1 , R2 , B3 , and G4 , respectively. These mean values
constitute the mean vector µ = [µg1 µr2 µb3 µg4 ]T of the variable vector x. Using the
mean vector µ, the centralized version of x and X can be expressed as x̄ = x − µ and
X̄ = [GT1 − µg1 RT2 − µr1 BT3 − µb3 GT4 − µg4 ]T , respectively.
Using the additive noise model, the noisy observation of x can be expressed as x̃ = x + v,
where v = [υg1 υr2 υb3 υg4 ]T is the noise variable vector. Assuming additive noise with
zero mean, the mean vectors of x̃ and x are identical, that is, E [x̃] = E [x] = µ. Since x is
unavailable in practice, µ is calculated from the samples of x̃, resulting in x̃¯ = x̃ − µ = x̄ + v
as the centralized vector of x̃.
The whole dataset of additive channel-dependent noise v can be written as V =
[VgT1 VrT2 VbT3 VgT4 ]T , where Vr2 comes from the red channel noise υr , Vg1 and Vg4 come from
the green channel noise υg , and Vb3 comes from the blue channel noise υb . The available
measurements of the noise-free dataset X can thus be expressed as X̃ = X + V. Subtracting
the mean vector µ from X̃ provides the centralized dataset X̃ ¯ = X̄ + V of the vector x̃¯ .
The problem can now be seen as estimating X̄ from the noisy measurement X̃ ¯ ; the use
of PCA to complete this task is discussed in the next section. Assuming that X̄ ˆ , which is
the estimated dataset of X̄, is available, then the samples in the training block are denoised.
Since pixels located far away from the location under consideration have usually very little
or even no influence on the denoising estimate, the central part of the training block can
be used as the denoising block [57]. The CFA image is denoised by moving the denoising
block across the pixel array to affect all the pixels in the image.
the optimal dimensionality reduction property of principal component analysis can be used
in noise removal. Namely, keeping the most important subset of the transformed dataset to
conduct the inverse PCA transform can significantly reduce noise while still being able to
restore the desired signal.
1 ¯ ¯T
Ωỹ¯ = Ωȳ + Ωvy ≈ ỸỸ , (5.6)
n
Principal Component Analysis-Based Denoising of Color Filter Array Images 143
T
where Ωȳ = Λx̄ ≈ 1n ȲȲ and Ωvy = Px̄ Ωv PTx̄ ≈ 1n VY VTY are the covariance matrices of Ȳ
and VY , respectively.
Given the fact that most of the energy of Ȳ concentrates in the first several rows of Ỹ ¯
whereas the energy of VY is distributed in Ỹ ¯ much more evenly, setting the last several
¯
rows of Ỹ to zero preserves the signal Ȳ while removing the noise VY . Unaltered rows
¯ constitute the so-called dimension reduced dataset Ỹ
of Ỹ ¯ 0 . It holds that Ỹ
¯ 0 = Ȳ0 + V 0
Y
where Ȳ0 and VY0 represent the dimension reduced datasets of Ȳ and VY , respectively. The
corresponding covariance matrices relate as Ωỹ¯ 0 = Ωȳ0 + Ωvy0 .
¯ 0 can be achieved via linear minimum mean square error estimation
Further denoising of Ỹ
(LMMSE) applied to individual rows, as follows [57]:
ˆ 0 = c · Ỹ
Ȳ ¯ 0 , for c = Ω 0 (i, i)/(Ω 0 (i, i)+Ω (i, i)), (5.7)
i i i i ȳ ȳ vy0
where i denotes the row index. Repeating the estimation procedure for each nonzero row
¯ , which
ˆ 0 . The denoised version of the original dataset X̃
of Ȳ0 yields the denoised dataset Ȳ
represents the estimate of an unknown noiseless dataset X̄, can be obtained as X̄ ˆ = P−1 Ȳ ˆ0
x̄
by performing the transform from the PCA domain to the time domain. The denoised CFA
block is produced by reformatting X̄ ˆ.
Choosing a suitable value of the scaling parameter s allows for Iυl with blurred structural
content and almost completely removed noise. This implies that a complementary image
144 Computational Photography: Methods and Applications
Iυh contains almost complete high-frequency content, including the essential edge informa-
tion to be preserved and undesired structures which are attributed to noise. Since noise is
dominant in the flat areas of Iυh , it can be effectively suppressed by LMMSE filtering in the
PCA domain using the method described in the previous section. The denoising procedure
outputs the image Iˆυh which can be used to produce the denoised CFA image Iˆ = Iυl + Iˆυh .
where m denotes the vector length and σa = (σr2 + 2σg2 + σb2 )1/2 /2. Vectors ~xk and ~x0 are
the noiseless counterparts of ~x̃k and ~x̃0 , respectively. Obviously, the smaller the distance dk
is, the more similar ~xk is to ~x0 .
The training sample selection criterion can be defined as follows [57]:
dk ≤ T 2 + σa2 , (5.11)
where T is a predetermined parameter. If the above condition is met, then ~x̃k is selected as
one training sample of x̃. Note that a high number of sample vectors may be required in
practice to guarantee a reasonable estimation of the covariance matrix of x̃. Assuming that
X̃b denotes the dataset composed of the sample vectors that give the smallest distance to ~x̃0 ,
the algorithms described in Section 5.4.1 should be applied to X̃b , instead of the original
dataset X̃.
(a) (b)
(c)
FIGURE 5.8
PCA-driven denoising of an artificially noised image shown in Figure 5.6c: (a) denoised CFA image, (b) its
demosaicked version, and (c) original, noise-free image.
settings can produce good results in most situations, while the size of training block should
be at least 16 times larger, that is, 24 × 24 or 30 × 30 for a 6 × 6 variable block.
Figure 5.8a shows the result when the proposed method is applied to the CFA image
with simulated noise. Comparing this image with its noisy version shown in Figure 5.6c
reveals that noise is effectively suppressed in both smooth and edge regions while there is
no obvious loss of the structural contents. Visual inspection of the corresponding images
demosaicked using the method of Reference [64] confirms what was expected. Namely,
as shown in Figure 5.6d, demosaicking the noisy CFA image with no denoising produces
poor results; in some situations the noise level actually increases due to blending the noise
contributions across channels. This is not the case of the demosaicked denoised image in
Figure 5.8b which is qualitatively similar to the original test image shown in Figure 5.8c.
146 Computational Photography: Methods and Applications
(a) (b)
FIGURE 5.9
PCA-driven denoising of a real-life camera image shown in Figure 5.3a: (a) denoised CFA image and (b) its
demosaicked version.
The performance of the proposed method will now be evaluated using digital camera im-
ages with real, non-approximated noise. Denoising raw sensor images using the proposed
method requires calculating the noise energy of each channel from the acquired CFA data.
This can be accomplished by dividing the CFA image into subimages and then processing
each subimage using the one-stage orthogonal wavelet transform [65]. Assuming that w
denotes the diagonal subband at the decomposed first stage, the noise level in each subim-
age can be estimated as σ = median(w)/0.6475 [66] or σ = ((MN)−1 ∑M N 2 0.5
i ∑ j w (i, j)) ,
where (i, j) denotes the spatial location and M and N denote the subband dimensions. In
the situations when there is more than one subimage per color channel, the noise level is
estimated as the average of σ values calculated for all spectrally equivalent subimages.
Figure 5.9 and 5.10 demonstrate good performance of the proposed method in environ-
ments with the presence of real image sensor noise. Comparing the denoised CFA images
with the acquired ones clearly shows that the proposed method efficiently uses spatial and
spectral image characteristics to suppress noise and simultaneously preserve edges and im-
age details. The same conclusion can be made when visually inspecting the corresponding
demosaicked images.
Full-color results presented in this chapter are available at http://www4.comp.polyu.edu.
hk/∼ cslzhang/paper/cpPCA.pdf. Additional results and detailed performance analysis can
be found in Reference [57].
5.5 Conclusion
This chapter presented image denoising solutions for digital cameras equipped with a
color filter array placed on top of a monochrome image sensor. Namely, taking into con-
Principal Component Analysis-Based Denoising of Color Filter Array Images 147
(a) (b)
(c) (d)
FIGURE 5.10
PCA-driven denoising of a real-life digital camera image: (a) acquired CFA sensor image and (b) its demo-
saicked version; (c) denoised CFA image and (d) its demosaicked version.
sideration the fundamentals of single-sensor color imaging and digital camera image pro-
cessing, the chapter identified three pipelining frameworks that can be used to produce a
denoised image. These frameworks differ in the position of the denoising step with respect
to the demosaicking step in the camera image processing pipeline, thus having their own
design, performance, and implementation challenges.
The framework that performs denoising before demosaicking was the main focus of this
chapter. Denoising the color filter array mosaic data is the most natural way of handling
the image noise problem in the digital cameras under consideration. The framework can
effectively suppress noise and preserve color edges and details, while having the poten-
tial to achieve high processing rates. This is particularly true for the proposed principal
component analysis-driven approach that adaptively calculates covariance matrices to al-
148 Computational Photography: Methods and Applications
low effective signal energy clustering and efficient noise removal. The approach utilizes
both spatial and spectral correlation characteristics of the captured image and takes advan-
tage of local similarities that exist among blocks of color filter array mosaic data in order
to improve the estimation accuracy of the principal component analysis transformation ma-
trix. This constitutes a basis for achieving the desired visual quality using the proposed
approach.
Obviously, image denoising solutions have an extremely valuable position in digital
imaging. The trade-off between performance and efficiency makes many denoising meth-
ods indispensable tools for digital cameras and their applications. Since the proposed
method is reasonably robust in order to deal with the infinite number of variations in the
visual scene and varying image sensor noise, it can play a key role in modern imaging sys-
tems and consumer electronic devices with image-capturing capabilities which attempt to
mimic human perception of the visual environment.
References
[1] P.L.P. Dillon, D.M. Lewis, and F.G. Kaspar, “Color imaging system using a single CCD area
array,” IEEE Journal of Solid-State Circuits, vol. 13, no. 1, pp. 28–33, February 1978.
[2] B.T. Turko and G.J. Yates, “Low smear CCD camera for high frame rates,” IEEE Transactions
on Nuclear Science, vol. 36, no. 1, pp. 165–169, February 1989.
[3] A.J. Blanksby and M.J. Loinaz, “Performance analysis of a color CMOS photogate image
sensor,” IEEE Transactions on Electron Devices, vol. 47, no. 1, pp. 55–64, January 2000.
[4] D. Doswald, J. Haflinger, P. Blessing, N. Felber, P. Niederer, and W. Fichtner, “A 30-frames/s
megapixel real-time CMOS image processor,” IEEE Journal of Solid-State Circuits, vol. 35,
no. 11, pp. 1732–1743, November 2000.
[5] R. Lukac, Single-Sensor Imaging: Methods and Applications for Digital Cameras, ch. Single-
sensor digital color imaging fundamentals, R. Lukac (ed.), Boca Raton, FL: CRC Press /
Taylor & Francis, September 2008, pp. 1–29.
[6] R. Lukac and K.N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE
Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1260–1267, November 2005.
[7] FillFactory, “Technology - image sensor: The color filter array faq.” Available online: http://
www.fillfactory.com/htm/technology/htm/rgbfaq.htm.
[8] K. Hirakawa and P.J. Wolfe, Single-Sensor Imaging: Methods and Applications for Digital
Cameras, ch. Spatio-spectral sampling and color filter array design, R. Lukac (ed.), Boca
Raton, FL: CRC Press / Taylor & Francis, September 2008, pp. 137–151.
[9] K.A. Parulski and R. Reisch, Single-Sensor Imaging: Methods and Applications for Digital
Cameras, ch. Digital camera image storage formats, R. Lukac (ed.), Boca Raton, FL: CRC
Press / Taylor & Francis, September 2008, pp. 351–379.
[10] Technical Committee ISO/TC 42, Photography, “Electronic still picture imaging - removable
memory, part 2: Image data format - TIFF/EP,” ISO 12234-2, January 2001.
[11] “Information technology - digital compression and coding of continuous-tone still images:
Requirements and guidelines.” ISO/IEC International Standard 10918-1, ITU-T Recommen-
dation T.81, 1994.
Principal Component Analysis-Based Denoising of Color Filter Array Images 149
[12] Japan Electronics and Information Technology Industries Association, “Exchangeable image
file format for digital still cameras: Exif Version 2.2,” Technical Report, JEITA CP-3451,
April 2002.
[13] N. Zhang, X. Wu, and L. Zhang, Single-Sensor Imaging: Methods and Applications for Digital
Cameras, ch. Lossless compression of color mosaic images and videos, R. Lukac (ed.), Boca
Raton, FL: CRC Press / Taylor & Francis, September 2008, pp. 405–428.
[14] N.X. Lian, V. Zagorodnov, and Y.P. Tan, Single-Sensor Imaging: Methods and Applications
for Digital Cameras, ch. Modelling of image processing pipelines in single-sensor digital
cameras, R. Lukac (ed.), Boca Raton, FL: CRC Press / Taylor & Francis, September 2008,
pp. 381–404.
[15] C.C. Koh, J. Mukherjee, and S.K. Mitra, “New efficient methods of image compression in
digital cameras with color filter array,” IEEE Transactions on Consumer Electronics, vol. 49,
no. 4, pp. 1448–1456, November 2003.
[16] R. Lukac and K.N. Plataniotis, “Single-sensor camera image compression,” IEEE Transac-
tions on Consumer Electronics, vol. 52, no. 2, pp. 299–307, May 2006.
[17] S.T. McHugh, “Digital camera image noise.” Available online: http://www. cambridgein-
colour.com/tutorials/noise.htm.
[18] K. Parulski and K.E. Spaulding, Digital Color Imaging Handbook, ch. Color image processing
for digital cameras, G. Sharma (ed.), Boca Raton, FL: CRC Press, 2002, pp. 728–757.
[19] J.E. Adams and J.F. Hamilton, Single-Sensor Imaging: Methods and Applications for Digital
Cameras, ch. Digital camera image processing chain design, R. Lukac (ed.), Boca Raton, FL:
CRC Press / Taylor & Francis, September 2008, pp. 67–103.
[20] R. Ramanath, W.E. Snyder, Y. Yoo, and M.S. Drew, “Color image processing pipeline,” IEEE
Signal Processing Magazine, Special Issue on Color Image Processing, vol. 22, no. 1, pp. 34–
43, January 2005.
[21] R. Lukac and K.N. Plataniotis, Color Image Processing: Methods and Applications,
ch. Single-sensor camera image processing, R. Lukac and K.N. Plataniotis (eds.), Boca Raton,
FL: CRC Press / Taylor & Francis, October 2006, pp. 363–392.
[22] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, R.W. Schaffer, and R.M. Murserau, “Demosaick-
ing: Color filter array interpolation,” IEEE Signal Processing Magazine, Special Issue on
Color Image Processing, vol. 22, no. 1, pp. 44–54, January 2005.
[23] E. Dubois, Single-Sensor Imaging: Methods and Applications for Digital Cameras, ch. Color
filter array sampling of color images: Frequency-domain analysis and associated demosaick-
ing algorithms, R. Lukac (ed.), Boca Raton, FL: CRC Press / Taylor & Francis, September
2008, pp. 183–212.
[24] D. Alleysson, B.C. de Lavarène, S. Süsstrunk, and J. Hérault, Single-Sensor Imaging: Methods
and Applications for Digital Cameras, ch. Linear minimum mean square error demosaicking,
R. Lukac (ed.), Boca Raton, FL: CRC Press / Taylor & Francis, September 2008, pp. 213–237.
[25] L. Zhang and W. Lian, Single-Sensor Imaging: Methods and Applications for Digital Cam-
eras, ch. Video-demosaicking, R. Lukac (ed.), Boca Raton, FL: CRC Press / Taylor & Francis,
September 2008, pp. 485–502.
[26] E.Y. Lam and G.S.K. Fung, Single-Sensor Imaging: Methods and Applications for Digital
Cameras, ch. Automatic white balancing in digital photography, R. Lukac (ed.), Boca Raton,
FL: CRC Press / Taylor & Francis, September 2008, pp. 267–294.
[27] R. Lukac, “New framework for automatic white balancing of digital camera images,” Signal
Processing, vol. 88, no. 3, pp. 582–593, March 2008.
150 Computational Photography: Methods and Applications
[28] K. Hirakawa, Single-Sensor Imaging: Methods and Applications for Digital Cameras,
ch. Color filter array image analysis for joint demosaicking and denoising, R. Lukac (ed.),
Boca Raton, FL: CRC Press / Taylor & Francis, September 2008, pp. 239–266.
[29] R. Lukac and K.N. Plataniotis, “A new image sharpening approach for single-sensor digital
cameras,” International Journal of Imaging Systems and Technology, Special Issue on Applied
Color Image Processing, vol. 17, no. 3, pp. 123–131, June 2007.
[30] S. Battiato, G. Messina, and A. Castorina, Single-Sensor Imaging: Methods and Applications
for Digital Cameras, ch. Exposure correction for imaging devices: An overview, R. Lukac
(ed.), Boca Raton, FL: CRC Press / Taylor & Francis, September 2008, pp. 323–349.
[31] R. Lukac, Single-Sensor Imaging: Methods and Applications for Digital Cameras, ch. Image
resizing solutions for single-sensor digital cameras, R. Lukac (ed.), Boca Raton, FL: CRC
Press / Taylor & Francis, September 2008, pp. 459–484.
[32] F. Gasparini and R. Schettini, Single-Sensor Imaging: Methods and Applications for Digital
Cameras, ch. Automatic red-eye removal for digital photography, R. Lukac (ed.), Boca Raton,
FL: CRC Press / Taylor & Francis, September 2008, pp. 429–457.
[33] P. Viola and M.J. Jones, “Robust real-time object detection,” Tech. Rep. CRL 2001/01, Com-
paq Cambridge Research Laboratory, Cambridge, Massachusetts, February 2001.
[34] R.L. Hsu, M. Abdel-Mottaleb, and A.K. Jain, “Face detection in color images,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696–706, May 2002.
[35] W.C. Kao and S.Y. Lin, Single-Sensor Imaging: Methods and Applications for Digital Cam-
eras, ch. An overview of image/video stabilization techniques, R. Lukac (ed.), Boca Raton,
FL: CRC Press / Taylor & Francis, September 2008, pp. 535–561.
[36] M. Ben-Ezra and S.K. Nayar, “Motion-based motion deblurring,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 689–698, June 2004.
[37] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, Single-Sensor Imaging: Methods and Ap-
plications for Digital Cameras, ch. Simultaneous demosaicking and resolution enhancement
from under-sampled image sequences, R. Lukac (ed.), Boca Raton, FL: CRC Press / Taylor &
Francis, September 2008, pp. 503–533.
[38] S.G. Narasimhan and S.K. Nayar, “Enhancing resolution along multiple imaging dimensions
using assorted pixels,” IEEE Transactions on Pattern Recognition and Machine Intelligence,
vol. 27, no. 4, pp. 518–530, April 2005.
[39] E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, High Dynamic Range Imaging. San
Francisco, CA, USA: Morgan Kaufmann Publishers, November 2005.
[40] I. Pitas and A.N. Venetsanopoulos, “Order statistics in digital image processing,” Proceedings
of the IEEE, vol. 80, no. 12, pp. 1892–1919, December 1992.
[41] R. Lukac and K.N. Plataniotis, Advances in Imaging and Electron Physics, ch. A taxonomy
of color image filtering and enhancement solutions, pp. 187–264, San Diego, CA: Elsevier /
Academic Press, February/March 2006.
[42] J.S. Lee, “Digital image smoothing and the sigma filter,” Graphical Models and Image Pro-
cessing, vol. 24, no. 2, pp. 255–269, November 1983.
[43] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Proceedings
of the IEEE International Conference on Computer Vision, Bombay, India, January 1998,
pp. 839–846.
[44] M. Zhang and B.K. Gunturk, “Multiresolution bilateral filtering for image denoising,” IEEE
Transactions on Image Processing, vol. 17, no. 12, pp. 2324–2333, December 2008.
Principal Component Analysis-Based Denoising of Color Filter Array Images 151
[45] S.G. Chang, B. Yu, and M. Vetterli, “Spatially adaptive wavelet thresholding with context
modeling for image denoising,” IEEE Transactions on Image Processing, vol. 9, no. 9,
pp. 1522–1531, September 2000.
[46] A. Pizurica and W. Philips, “Estimating the probability of the presence of a signal of inter-
est in multiresolution single- and multiband image denoising,” IEEE Transactions on Image
Processing, vol. 15, no. 3, pp. 654–665, March 2006.
[47] J. Portilla, V. Strela, M.J. Wainwright, and E.P. Simoncelli, “Image denoising using scale mix-
tures of Gaussians in the wavelet domain,” IEEE Transactions on Image Processing, vol. 12,
no. 11, pp. 1338–1351, November 2003.
[48] R. Lukac, B. Smolka, K. Martin, K. N. Plataniotis, and A. N. Venetsanopulos, “Vector fil-
tering for color imaging,” IEEE Signal Processing Magazine, Special Issue on Color Image
Processing, vol. 22, no. 1, pp. 74–86, January 2005.
[49] K.N. Plataniotis and A.N. Venetsanopoulos, Color Image Processing and Applications. New
York: Springer Verlag, 2000.
[50] H.J. Trussell and R.E. Hartwig, “Mathematics for demosaicking,” IEEE Transactions on Im-
age Processing, vol. 11, no. 4, pp. 485–492, April 2002.
[51] R. Ramanath and W.E. Snyder, “Adaptive demosaicking,” Journal of Electronic Imaging,
vol. 12, no. 4, pp. 633–642, October 2003.
[52] K. Hirakawa and T.W. Parks, “Joint demosaicking and denoising,” IEEE Transactions on Im-
age Processing, vol. 15, no. 8, pp. 2146–2157, August 2006.
[53] K. Hirakawa, X.L. Meng, and P.J. Wolfe, “A framework for wavelet-based analysis and pro-
cessing of color filter array images with applications to denoising and demosaicing,” in Pro-
ceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing,
Honolulu, Hawai, USA, April 2007, vol. 1, pp. 597–600.
[54] K. Hirakawa and X.L. Meng, “An empirical Bayes EM-wavelet unification for simultaneous
denoising, interpolation, and/or demosaicing,” in Proceedings of the IEEE International Con-
ference on Image Processing, Atlanta, GA, USA, October 2006, pp. 1453–1456.
[55] L. Zhang, X. Wu and D. Zhang, “Color reproduction from noisy CFA data of single sensor
digital cameras,” IEEE Transactions on Image Processing, vol. 16, no. 9, pp. 2184–2197,
September 2007.
[56] D. Paliy, V. Katkovnik, R. Bilcu, S. Alenius and K. Egiazarian, “Spatially adaptive color filter
array interpolation for noiseless and noisy data,” International Journal of Imaging Systems and
Technology, Special Issue on Applied Color Image Processing, vol. 17, no. 3, pp. 105–122,
June 2007.
[57] L. Zhang, R. Lukac, X. Wu, and D. Zhang, “PCA-based spatially adaptive denoising of CFA
images for single-sensor digital cameras,” IEEE Transactions on Image Processing, vol. 18,
no. 4, pp. 797–812, April 2009.
[58] S. Haykin, Neural Networks: A Comprehensive Foundation. 2nd Edition, India: Prentice Hall,
July 1998.
[59] K. Fukunaga, Introduction to Statistical Pattern Recognition. 2nd Edition, San Diego, CA:
Academic Press, October 1990.
[60] A. Foi, S. Alenius, V. Katkovnik, and K. Egiazarian, “Noise measurement for raw-data of digi-
tal imaging sensors by automatic segmentation of nonuniform targets,” IEEE Sensors Journal,
vol. 7, no. 10, pp. 1456–1461, October 2007.
[61] A. Foi, V. Katkovnik, D. Paliy, K. Egiazarian, M. Trimeche, S. Alenius, R. Bilcu, and M. Ve-
hvilainen, “Apparatus, method, mobile station and computer program product for noise esti-
mation, modeling and filtering of a digital image,” U.S. Patent Application No. 11/426,128,
2006.
152 Computational Photography: Methods and Applications
[62] D.D. Muresan and T.W. Parks, “Adaptive principal components and image denoising,” in
Proceedings of the IEEE International Conference on Image Processing, Barcelona, Spain,
September 2003, vol. 1, pp. 101–104.
[63] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976.
[64] L. Zhang and X. Wu, “Color demosaicking via directional linear minimum mean square-
error estimation,” IEEE Transactions on Image Processing, vol. 14, no. 12, pp. 2167–2178,
December 2005.
[65] S. Mallat, A Wavelet Tour of Signal Processing 2nd Edition, New York: Academic Press,
September 1999.
[66] D.L. Donoho and I.M. Johnstone, “Ideal spatial adaptation via wavelet shrinkage,” Biometrika,
vol. 81, no. 3, pp. 425-455, September 1994.
6
Regularization-Based Color Image Demosaicking
6.1 Introduction
Demosaicking is the process of reconstructing the full color representation of an image
acquired by a digital camera equipped with a color filter array (CFA). Most demosaicking
approaches have been designed for the Bayer pattern [1] shown in Figure 6.1a. However, a
number of different patterns [2], [3], [4], such as those shown in Figures 6.1b to 6.1d, have
been recently proposed to enhance color image acquisition and restoration processes.
Various demosaicking methods have been surveyed in References [5], [6], [7], and [8].
Popular demosaicking approaches rely on directional filtering [9], [10], [11], wavelet [12],
[13], frequency-domain analysis [14], and reconstruction [15] methods. In particular, an
effective strategy consists in considering demosaicking as an inverse problem which can be
solved by making use of some prior knowledge about the natural color images. This ap-
proach, generally known as regularization, has been also exploited for demosaicking [16],
[17], [18], [19], [20], [21], as it allows to design algorithms that are suitable for any CFA.
153
154 Computational Photography: Methods and Applications
This chapter presents regularization methods for demosaicking. Namely, Section 6.2 fo-
cuses on the problem formulation and introduces the notation to be used throughout the
chapter. Section 6.3 surveys existing regularization methods for sole demosaicking or
jointly with super-resolution. Section 6.4 presents a new regularization technique which
allows noniterative demosaicking. Section 6.5 presents performance comparisons of differ-
ent methods. Finally, conclusions are drawn in Section 6.6.
Ic(u)
R2
p(×) ¯G CFA + Is(n)
h(n)
FIGURE 6.2
c 2009 IEEE
Image acquisition in a digital camera. °
Regularization-Based Color Image Demosaicking 155
Common practical models for the impulse response of the prefilter are the Gaussian or the
rect functions. However, today’s digital cameras use a CFA placed in front of the sensor to
capture one color component in each pixel location [1], [2]. Moreover, some CFAs capture
colors which can be expressed as a linear combination of the traditional red, green and blue
components [3], [4]. The CFA-sampled image Is (n) can thus be expressed as
where the acquisition functions cX (n), for X = R, G, and B, are periodic and for any pixel
n ∈ Γ constrained as cR (n) + cG (n) + cB (n) = 1. The term η (n) characterizes sensor noise
introduced during image acquisition. It is assumed here that η (n) is uncorrelated with
respect to the acquired image Is (n). More complex models for the noise can be found in
Reference [22].
To this end, the image acquisition process can be represented as in Figure 6.2, and the
relation between Is (n), for n ∈ Γ, and the continuous image Ic (u) is given by
Based on these considerations, frequency-domain analysis of the acquired image Is (n) can
be carried out. Since the acquisition functions cX (n) are periodic, they can be represented
as a finite sum of harmonically related complex exponentials using the discrete Fourier
series as follows:
T −1
cX (n) = ∑ αX (k)e jk 2π V n , (6.4)
k∈P(Λ)
where V is the periodicity matrix and Λ is the lattice1 generated by V. The term P(Λ)
is a fundamental parallelepiped of the lattice Λ and the coefficients α (k) are complex, but
α (k) = ᾱ (−k), in order to ensure real values for c(n).
Equation 6.2 can thus be rewritten as
T
2π V−1 n
Is (n) = ∑ ∑ αX (k)e jk Xp (n) + η (n), (6.5)
X k
where X = R, G, B and k ∈ P(Λ). The Fourier transform of the acquired image can be
expressed as follows:
where Xp (ω) denotes the Fourier transform of the color component Xp (n). As a conse-
quence, the spectrum of a CFA image has at most |P(V)| frequency peaks and each peak
is a linear combination of the spectra of the three color components. In particular, the base-
band component is given by a weighted positive sum of the three color channels, where
each weight αX (0) is equivalent to the ratio between the number of acquired samples for
1 When the acquisition functions are periodic on different lattices, Λ indicates the densest lattice over which all
the c(n), for X = R, G, B, are periodic. This lattice denotes also the periodicity of the CFA.
156 Computational Photography: Methods and Applications
40
m agnitu d e [d B]
20
0
1
1
0
0
w/p -1 -1 w/p
y x
FIGURE 6.3
c 2009 IEEE
Spectrum of the test image Lighthouse sampled using the Bayer pattern. °
the component X and the number of the pixels in the image. In many demosaicking strate-
gies it is suitable to define this weighted sum as the luminance component of the image,
representing the achromatic information of the original scene.
For instance, in the case of the Bayer pattern, under the assumption of ideal impulse
response and zero noise, (see also Reference [23]):
1 1
Is (ω1 , ω2 ) = [B(ω1 ± π , ω2 ) − R(ω1 ± π , ω2 )] + [R(ω1 , ω2 ± π ) − B(ω1 , ω2 ± π )]
4 4
1
+ [−R(ω1 ± π , ω2 ± π ) + 2G(ω1 ± π , ω2 ± π ) − B(ω1 ± π , ω2 ± π )]
4
1
+ [R(ω1 , ω2 ) + 2G(ω1 , ω2 ) + B(ω1 , ω2 )] , (6.7)
4
where ω1 and ω2 indicate the horizontal and vertical frequencies. Therefore, it is possible
to identify nine regions containing the signal energy, where the central region corresponds
to the luminance and the others to the modulated chrominance components. Due to the
correlation between the high-frequencies of the color components, the chrominance repli-
cas have limited support, while the luminance overlies a large part of the frequency plane.
This is clearly shown in Figure 6.3 which depicts the spectrum of the Kodak test image
Lighthouse sampled with the Bayer pattern.
where the first term Ψ(I , Is ), called the data-fidelity term, denotes a measure of the dis-
tance between the estimated image and the observed data, while the terms Jk (I ) denote
regularizing constraints based on a priori knowledge of the original image. The regulariza-
tion parameters λk control the tradeoff between the various terms.
In the matrix notation used in this chapter, r, g, and b denote the stacking vectors of
the three full-resolution color components R(n), G(n), and B(n), respectively. The term
i denotes the vector obtained by stacking the three color component vectors, that is, iT =
[rT , gT , bT ], whereas is is the stacking version of the image Is (n) acquired by the sensor
and η is the stacking vector of the noise. Then, the relation between the acquired samples
is and the full-resolution image i is described as follows:
is = Hi + η, (6.9)
H = [CR PR , CG PG , CB PB ]. (6.10)
The square matrices PX account for the impulse response of the filters pX (n), respectively.
The entries of the diagonal matrices CX are obtained by stacking the acquisition functions
cX (n).
Using the above notation, the solution to the regularization problem in Equation 6.8 can
be found as follows: ( )
î = arg min Ψ(i, is ) + ∑ λk Jk (i) . (6.11)
i
k
where k · k2 indicates the `2 norm. However, in some techniques, the characteristics of the
sensor noise are considered and the weighted `2 norm
where ∇hi x and ∇vi x are discrete approximations to the horizontal and vertical first order
difference, respectively, at the pixel i. The term Nx denotes the length of the vector x, and
β is a constant that is included to make the term differentiable in zero. Then, the energy
158 Computational Photography: Methods and Applications
functional in Equation 6.11 is minimized with an iterative algorithm for any color compo-
nent. References [18] and [19] improve this approach using the total-variation technique to
impose smoothness also to the color difference signals G − R and G − B and the color sum
signals G + R and G + B. This is the same as requiring strong color correlation in high-
frequency regions, which is a common assumption in many demosaicking approaches [6].
The estimation of the green component, which has the highest density in the Bayer pattern,
is obtained as in Reference [16]. The missing red and blue values are estimated with an
iterative procedure taking into account total-variation terms both of the color components
and of the color differences and sums.
Reference [17] introduces a novel regularizing term to impose smoothness on the chromi-
nances. This demosaicking algorithm uses suitable edge-directed weights to avoid exces-
sive smoothing in edge regions. The prior term is defined as follows:
N1 N2 1 1 £
∑ ∑ ∑ ∑ ẽl,m
n1 ,n2 (Xcb (n1 , n2 ) − Xcb (n1 + l, n2 + m))
2
n1 =1 n2 =1 l=−1 m=−1
¤
+ (Xcr (n1 , n2 ) − Xcr (n1 + l, n2 + m))2 , (6.15)
where Xcb and Xcr are the chrominances obtained using the following linear transform:
The terms N1 and N2 denote the width and the height of the image, respectively. The
weights ẽl,m
n1 ,n2 , computed using the values of the CFA samples in the locations (n1 , n2 )
and (n1 + 2l, n2 + 2m), are used to discourage smoothing across the edges. Using vector
notation, Equation 6.15 can be rewritten as follows:
1 1 µ ¶
l,m 2 l,m 2
∑ ∑ kxcb − Z xcb kWl,m + kxcr − Z xcr kWl,m ,
is is
(6.17)
l=−1 m=−1
where xcb and xcr denote the vector representation of the chrominances, and Zl,m is the
operator such that Zl,m x corresponds to shifting the image x by l pixels horizontally and m
pixels vertically. The matrix Wl,m
is is diagonal; its diagonal values correspond to the vector
representation of the weights ẽl,m
n1 ,n2 . This constraint set is convex and a steepest descent
optimization is used to find the solution.
Reference [20] exploits the sparse nature of the color images to design the prior terms.
Instead of imposing smoothness on the intensity and chrominance values, it is assumed
that for any natural image i there exists a sparse linear combination of a limited number of
fixed-size patches that approximates it well. Given a dictionary D, which is a 3Nx ×k matrix
which contains the k prototype patches, this implies that for any image there exists α ∈ Rk
such that i ' Dα and kαk0 ¿ 3Nx where k · k0 denotes the `0 -quasi norm which counts
the number of nonzero elements. Based on these assumptions, novel regularizing terms
can be proposed and an estimate î of the original image can be found using an iterative
method that incorporates the K-SVD (single value decomposition) algorithm, as presented
in Reference [27].
Regularization-Based Color Image Demosaicking 159
where the diagonal matrix Λd contains suitable weights determined by detecting the edge
orientation at each pixels.
The second constraint imposes isotropic smoothness on the chrominance components
with a high-pass filter S as follows:
where the matrices Zl,m are defined as in Equation 6.17 and the scalar weight 0 < α < 1 is
applied to give a decreasing effect when l and m increase. The quadratic penalty term in
Equation 6.20 is used to describe the bandlimited characteristic of the chrominances, and
another term penalizes the mismatch between locations or orientation of edges across the
color bands using the element by element multiplication operator ¯, as follows:
160 Computational Photography: Methods and Applications
1 1 h
J3 (i) = ∑ ∑ kg ¯ Zl,m b − b ¯ Zl,m gk22
l=−1 m=−1
i
+ kb ¯ Zl,m r − r ¯ Zl,m gk22 + kr ¯ Zl,m g − g ¯ Zl,m rk22 . (6.22)
The data fidelity term measuring the similarity between the resulting high-resolution im-
age and the original low-resolution images is based on the `1 norm. A steepest descent
optimization is used to minimize the cost function expressed as the sum of the three terms.
2 InReference [21], a finite impulse response (FIR) filter with coefficients [0.2, −0.5, 0.65, −0.5, l0.2] for Sh
and Sv is chosen. For the second constraint the filter coefficients [−0.5, 1, −0.5] are used.
Regularization-Based Color Image Demosaicking 161
Then, exploiting the properties of the Kronecker product, this constraint can be expressed
as J2 (i) = kM2 ik22 , where
2 −1 −1 1.547 −0.577 −0.577
M2 = sqrt −1 2 −1 ⊗ ST2 S2 = −0.577 1.547 −0.577 ⊗ S2 . (6.25)
−1 −1 2 −0.577 −0.577 1.547
Using the two regularizing constraints J1 (i) and J2 (i) defined as above and the data-
fidelity term defined in Equation 6.13, the solution of Equation 6.11 can be obtained by
solving the problem
¡ T −1 ¢
H Rη H + λ1 MT1 M1 + λ2 MT2 M2 î − HT R−1 η is = 0, (6.26)
The coefficients of the filters that estimate the three color components from the CFA
sampled image can be extracted from the matrix G . In fact, G can be written as G =
[GR , GG , GB ], where the submatrices GR , GG , and GB are the representation (according to
the matrix notation introduced in Section 6.2) of the filters that estimate the red, green and
blue components from the CFA image. Due to the data sampling structure of the CFA, the
resulting filters are periodically space-varying and the number of different states depends
on the periodicity of the CFA.
`ˆ = G` is , (6.28)
where G` = AG represents the filter that estimates the luminance component from the CFA
data.
As for the estimation of the red, green, and blue components, the resulting filter is peri-
odically space-varying. Reference [14] describes the luminance estimation process for the
Bayer pattern with the sensor PSFs assumed to be ideal impulses, that is, pX (n) = δ (n).
In this case, inspecting the rows of the matrix G` reveals two different states for the recon-
struction filter. For pixels corresponding to the red and blue locations of the Bayer pattern
the resulting filter has the frequency response as shown in Figure 6.4a. For pixels corre-
sponding to the green locations in the Bayer pattern, the frequency response of the filter is
as shown in Figure 6.4b.
Recalling the frequency analysis of a Bayer-sampled image reported in Section 6.2, the
estimation of the luminance corresponds to eliminating the chrominance replicas using
162 Computational Photography: Methods and Applications
m agnitu d e
0.5 0.5
0 0
1 1
1 1
0 0
0 0
Fy -1 -1 Fx Fy -1 -1 Fx
(a) (b)
FIGURE 6.4
(a) Frequency response of the 9×9 filter used for the luminance estimation in the red/blue pixels. (b) Frequency
c 2009 IEEE
response of the 5 × 5 filter used for the luminance estimation in the green pixels. °
appropriate low-pass filters. For green pixels in the quincunx layout, the color difference
terms modulated at (0, ±π ) and (±π , 0) vanish, as reported in Reference [14]. Thus, only
the chrominance components modulated at (±π , ±π ) have to be eliminated. This is not
the case of the red and blue locations, where all the chrominance components have to be
filtered and the spectrum of the image is as shown in Figure 6.3. Therefore, the frequency
response of the low-pass filters follows Figure 6.4. Since the requirements imposed on the
filter design are less demanding in the green locations than in the red and blue locations, a
smaller number of filter coefficients is usually sufficient. A similar analysis can be carried
out for other CFA arrangements.
where Wi is a diagonal matrix estimated from the image in order to adapt the penalty term
to the local features of the image [31]. If a regularization term of this type is considered
together with the two quadratic penalties J1 (i) and J2 (i) proposed in the previous section,
the solution to Equation 6.11 is found by solving
¡ T −1 ¢
H Rη H + λ1 MT1 M1 + λ2 MT2 M2 + λ3 MT3 Wi M3 i − HT R−1
η is = 0. (6.30)
Regularization-Based Color Image Demosaicking 163
Since Wi depends on i, Equation 6.30 is nonlinear, and often is solved with a Landweber
fixed point iteration [25], [31]. However, a large number of iterations can be required before
convergence is reached, precluding fast implementations.
An alternative approach is proposed below. An initial estimate, ĩ, of the original image
is used such that the value of MT3 Wĩ M3 ĩ approximates MT3 Wi M3 i. Therefore, the resulting
image î is obtained as follows:
¡ ¢−1 ¡ T −1 ¢
î = HT R−1 T T
η H + λ1 M1 M1 + λ2 M2 M2 H Rη is − λ3 MT3 Wĩ M3 ĩ . (6.31)
Since the operator M3 is set to be equivalent to the first quadratic operator M1 , and both the
matrices are designed using two directional filters, it can be written that
· ¸
S
M1 = M3 = I3 ⊗ S1 = I3 ⊗ 1h , (6.32)
S1v
in order to detect the discontinuities of the image along horizontal and vertical directions,
respectively. The diagonal entries of Wĩ depend on the ¡ horizontal and vertical high fre-
¢
quencies of the estimated image ĩ. In fact, Wĩ = diag Wr,h , Wr,v , Wg,h , Wg,v , Wb,h , Wb,v ,
where diag(·) denotes the diagonal entries and Wx,h and Wx,v , for x = r, g, b, are diagonal
matrices with their values defined as follows:
µ ¶
{ex,v } j
{Wx,h } j = ξ , (6.33)
{ex,h } j + {ex,v } j
µ ¶
{ex,h } j
{Wx,v } j = ξ , (6.34)
{ex,h } j + {ex,v } j
where {ex,h } j and {ex,v } j are the energies of the j-th value of S1h x and S1v x, respectively,
and ξ (·) is a function defined as
0y − ε if y < ε
ξ (y) = if ε ≤ y ≤ 1 − ε (6.35)
1 − 2ε
1 if y > 1 − ε
with 0 ≤ ε ≤ 1/2 (in Reference [21] ε = 0.25 is used). In this way, when {S1h x} j À
{S1v x} j the presence of a vertical edge can be assumed; therefore, {Wx,h } j = 0 and the
constraint of smoothness of the color components is not considered along the horizontal
direction, while it is preserved for the vertical direction. The same analysis holds when hor-
izontal edges are found. Finally, when {Sh x} j and {Sv x} j have similar energies, smoothing
is imposed along both horizontal and vertical directions.
A similar approach was adopted in Reference [31], where a visibility function was applied
to compute the diagonal values of Wi . The visibility function depends on the local variance
of the image and goes to zero near the edges. However, this technique does not discriminate
between horizontal and vertical edges, so the high-frequency penalty is disabled for both
directions. Moreover, this approach is applied in iterative restoration methods.
It can be pointed out that there are two smoothing penalties in Equation 6.31, as the
adaptive term J3 (i) is included together with the quadratic constraint J1 (i). In fact, J1 (i)
164 Computational Photography: Methods and Applications
T
cannot be removed, as the matrix HT R−1 T
η H + λ2 M2 M2 is not invertible since ker(H H) ∩
T
ker(M2 M2 ) = 6 {0}. Therefore, the regularization process with respect to the spatial smooth-
ness of the color components uses two constraints, where the quadratic one allows inverting
the matrix HT R−1 T T
η H + λ1 M1 M1 + λ2 M2 M2 and the second one includes adaptivity in the
solution of the problem. The same approach is applied also in the half-quadratic minimiza-
tion methods in the additive form [32]. However, in these approaches, the diagonal subma-
trix Wx,h for the horizontal details and Wx,v for the vertical details does not accommodate
the vertical frequencies S1v x and the horizontal frequencies S1h x, respectively. Thus, the
local adaptivity is not based on the comparison between S1h x and S1v x as in Equations 6.33
and 6.34, and convergence to the optimal solution is therefore reached more slowly after
many iterations.
As for the initial estimate ĩ used in Equation 6.31, an efficient solution is to apply the
quadratic approach described in Section 6.4.1, leading to ĩ = G is . In this way, the approxi-
mation MT3 Wĩ M3 ĩ ' MT3 Wi M3 i is verified, and the proposed scheme provides a reliable es-
timate of the color image i, as proved by the experimental results reported in Reference [21]
and in Section 6.5.
Since M3 = I3 ⊗ S1 and Wĩ = diag (Wr̃ , Wg̃ , Wb̃ ), with Wx̃ = diag (Wx̃,h , Wx̃,v ), it can be
written that T
S1 Wr̃ S1 r̃
MT3 Wĩ M3 ĩ = ST1 Wg̃ S1 g̃ , (6.37)
ST1 Wb̃ S1 b̃
and, considering that the high frequencies of the three color components are highly corre-
lated with those of the luminance, the following approximation holds:
˜
ST1 Wr̃ S1 r̃ ' ST1 Wg̃ S1 g̃ ' ST1 Wb̃ S1 b̃ ' ST1 W`˜S1 `. (6.38)
Is ^
L q
Gl Sh Sh
w eighting
ad aptive
+ Fl l3
- ^
L
Sv Sv
+
FIGURE 6.5
c 2009 IEEE
Adaptive luminance estimation scheme. °
where G` is defined in the previous section. If the initial estimate of the luminance `˜ is
computed with the quadratic approach described in Section 6.4.1, that is, `˜ = G` is , Equa-
tion 6.40 can be written as
¡ ¢
`ˆ = I − λ3 F` ST1 W`˜S1 G` is . (6.41)
This equation indicates the procedure to compute the luminance from the CFA-sampled
image using the proposed adaptive method. The resulting scheme is depicted in Figure 6.5,
where G` is the space-varying filter designed in Section 6.4.1 and the filter F` is obtained
by matrix F` . Filters Sh and Sv are the horizontal and vertical high-pass filters represented
by matrices S1h and S1v , respectively.3
2552
CPSNR = 10 log10 ¡ ¢2 , (6.42)
1
3N1 N2 ∑ ∑ ∑ X̂(n1 , n2 ) − X(n1 , n2 )
X n1 n2
where N1 and N2 denote image dimensions, X = R, G, B denotes the color channel, and
n1 = 1, 2, ..., N1 and n2 = 1, 2, ..., N2 denote the pixel coordinates.
Performance of regularization-based algorithms was tested with four different acquisition
models to produce: i) noise-free Bayer CFA data, ii) noise-free CFA data generated using
the CFA with panchromatic sensors from Reference [4], iii) the Bayer CFA data corrupted
3 In Figure 6.5, it is assumed that filters S and S are even-symmetric since in this case ST = S and ST = S .
h v 1h 1h 1v 1v
Instead, if Sh and Sv are odd-symmetric, that is, ST1h = −S1h and ST1v = −S1v , after the adaptive weighting Sh
and Sv have to be replaced with −Sh and −Sv , respectively.
166 Computational Photography: Methods and Applications
FIGURE 6.6
c 2009 IEEE
The test images of the Kodak dataset used in the experiments. °
by Gaussian noise, and iv) CFA data generated using the Bayer pattern and sensors with a
non-ideal impulse response. In addition to these simulated scenarios, raw images acquired
by a Pentax *ist DS2 digital camera are also used for performance comparisons.
^
R
Is quad ratic ^
L ad aptive ^
L bilinear
q
lu m inance lu m inance interpolation ^
^ G
estim ation estim ation of R - L,
^ ^
G - L, B - L B^
FIGURE 6.7
c 2009 IEEE
Complete scheme of the adaptive color reconstruction described in Section 6.4.2. °
Regularization-Based Color Image Demosaicking 167
FIGURE 6.8
Portion of the image #6 of the Kodak set: (a) original image, and (b-f) demosaicked images obtained using
the methods presented in (b) Reference [9], (c) Reference [10], (d) Reference [19], (e) Reference [20], and (f)
c 2009 IEEE
Reference [21]. °
TABLE 6.1
CPSNR (dB) evaluation of demosaicking methods using images sam-
pled by the Bayer CFA shown in Figure 6.1a.
Method / Reference
Method / Reference
Method / Reference
regularization-based techniques described in Sections 6.3 and 6.4 produce high average
CPSNR values and seem to be more robust on noisy images than edge-adaptive approaches
which tend to fail due to inaccurate edge detection on noisy data. The quality of the de-
mosaicked images can be improved by applying a denoising procedure. An alternative
strategy [34] could be to use techniques that perform demosaicking jointly with denoising.
Method / Reference
Image [9] [9]+[35] [9]+[36] [10] [10]+[35] [10]+[36] [21] quad [21] adap
the total variation-based image deconvolution [35] and the deconvolution using a sparse
prior [36] method.4 The performance of these methods is also compared with the results
obtained with the non-iterative regularization method described in Section 6.4.
Using no deblurring, the methods presented in References [9] and [10] produce im-
ages with a poor quality. Employing the restoration method after the demosaicking step
considerably improves performance, providing average CPSNR improvements up to 3.6
dB. The adaptive approach is able to produce sharp demosaicked images, thus making
use of enhancement procedures unnecessary. The average CPSNR value obtained by the
regularization-based method is higher compared to values achieved using the demosaick-
ing methods [9], [10] followed by computationally demanding deblurring methods of Ref-
erences [35] and [36]. Figure 6.9 allows visual comparisons of different methods, with
the original image shown in Figure 6.8a. As can be seen, the image shown in Figure 6.9a
reconstructed without sharpening is blurred. Figure 6.9b shows the image with demosaick-
ing artifacts amplified by the deblurring algorithm. The image shown in Figure 6.9c is
excessively smoothed in the homogeneous regions. The best compromise between sharp-
ness and absence of demosaicking artifacts demonstrated by the regularization approach of
Reference [21], with output image shown in Figure 6.9d.
4 The source codes of the deblurring algorithms presented in References [35] and [36] are available
(a) (b)
(c) (d)
FIGURE 6.9
Portion of the image #6 of the Kodak set: (a) image reconstructed using the method of Reference [9] and
without deblurring, (b) image reconstructed using the method of Reference [9] and sharpened using the method
of Reference [35], (c) image reconstructed using the method of Reference [9] and sharpened with the method
c 2009 IEEE
of Reference [36], (d) image reconstructed by the adaptive method of Reference [21]. °
6.6 Conclusion
This chapter presented demosaicking methods based on the concept of regularization.
Demosaicking is considered as an inverse problem and suitable regularization terms are
designed using the characteristics of natural images. As demonstrated in this chapter, tak-
ing advantage of assumptions based on the smoothness of the color components and the
high-frequency correlation between the color channels allows the design of efficient algo-
172 Computational Photography: Methods and Applications
(a) (b)
(c) (d)
FIGURE 6.10
Portion of an image captured using a Pentax *ist DS2 camera: (a) CFA image; (b-d) images reconstructed
using the method of (b) Reference [9], (c) Reference [14], and (d) Reference [21].
rithms. The regularization-based methods are easily applicable to any CFA and can demo-
saick efficiently images acquired with sensors having a non-ideal impulse response, since
the characteristics of the PSF are taken into account in the reconstruction method. More-
over, the regularization-based strategy permits coupling demosaicking with other frequent
problems in image reconstruction and restoration.
Acknowledgment
Figures 6.1 to 6.9 and Tables 6.2 to 6.4 are reprinted from Reference [21], with the
permission of IEEE.
Regularization-Based Color Image Demosaicking 173
References
[1] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976.
[2] R. Lukac and K.N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE
Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1260–1267, November 2005.
[3] J.F. Hamilton and J.T. Compton, “Processing color and panchromatic pixels,” U.S. Patent
Application 0024879, February 2007.
[4] K. Hirakawa and P.J. Wolfe, “Spatio-spectral color filter array design for optimal image re-
covery,” IEEE Transactions Image Processing, vol. 17, no. 10, pp. 1876–1890, October 2008.
[5] R. Lukac (ed.), Single-Sensor Imaging: Methods and Applications for Digital Cameras. Boca
Raton, FL: CRC Press / Taylor & Francis, September 2008.
[6] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, R.W. Schafer, and R.M. Mersereau, “Demosaick-
ing: Color filter array interpolation,” IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 44–
54, January 2005.
[7] X. Li, B.K. Gunturk, and L. Zhang, “Image demosaicing: A systematic survey,” Proceedings
of SPIE, vol. 6822, pp. 68221J:1–15, January 2008.
[8] S. Battiato, M. Guarnera, G. Messina, and V. Tomaselli, “Recent patents on color demosaic-
ing,” Recent Patents on Computer Science, vol. 1, no. 3, pp. 94–207, November 2008.
[9] L. Zhang and X. Wu, “Color demosaicking via directional linear minimum mean square-error
estimation,” IEEE Transactions Image Processing, vol. 14, no. 12, pp. 2167–2177, December
2005.
[10] K.H. Chung and Y.H. Chan, “Color demosaicing using variance of color differences,” IEEE
Transactions Image Processing, vol. 15, no. 10, pp. 2944–2955, October 2006.
[11] D. Menon, S. Andriani, and G. Calvagno, “Demosaicing with directional filtering and a poste-
riori decision,” IEEE Transactions on Image Processing, vol. 16, no. 1, pp. 132–141, January
2007.
[12] B.K. Gunturk, Y. Altunbasak, and R.M. Mersereau, “Color plane interpolation using alter-
nating projections,” IEEE Transactions on Image Processing, vol. 11, no. 9, pp. 997–1013,
September 2002.
[13] X. Li, “Demosaicing by successive approximation,” IEEE Transactions on Image Processing,
vol. 14, no. 3, pp. 370–379, March 2005.
[14] N.X. Lian, L. Chang, Y.P. Tan, and V. Zagorodnov, “Adaptive filtering for color filter array
demosaicking,” IEEE Transactions on Image Processing, vol. 16, no. 10, pp. 2515–2525,
October 2007.
[15] D. Taubman, “Generalized Wiener reconstruction of images from colour sensor data using a
scale invariant prior,” in Proceedings of IEEE International Conference on Image Processing,
Vancouver, BC, Canada, September 2000, pp. 801–804.
[16] T. Saito and T. Komatsu, “Sharpening-demosaicking method with a total-variation-based
super-resolution technique,” Proceedings of SPIE, vol. 5678, pp. 1801–1812, January 2005.
[17] O.A. Omer and T. Tanaka, “Image demosaicking based on chrominance regularization with
region-adaptive weights,” in Proceedings of the International Conference on Information,
Communications, and Signal Processing, Singapore, December 2007, pp. 1–5.
[18] T. Saito and T. Komatsu, “Demosaicing method using the extended color total-variation regu-
larization,” Proceedings of SPIE, vol. 6817, pp. 68170C:1–12, January 2008.
174 Computational Photography: Methods and Applications
[19] T. Saito and T. Komatsu, “Demosaicing approach based on extended color total-variation reg-
ularization,” in Proceedings of the IEEE International Conference on Image Processing, San
Diego, CA, USA, September 2008, pp. 885–888.
[20] J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color image restoration,” IEEE
Transactions on Image Processing, vol. 17, no. 1, pp. 53–69, January 2008.
[21] D. Menon and G. Calvagno, “Regularization approaches to demosaicking,” IEEE Transactions
onb Image Processing, vol. 18, no. 10, pp. 2209–2220, October 2009.
[22] K. Hirakawa, Single-Sensor Imaging: Methods and Applications for Digital Cameras,
ch. Color filter array image analysis for joint denoising and demosaicking, R. Lukac (ed.),
Boca Raton, FL: CRC Press/ Taylor & Francis, September 2008, pp. 239–266.
[23] D. Alleysson, S. Süsstrunk, and J. Hérault, “Linear demosaicing inspired by the human visual
system,” IEEE Transactions on Image Processing, vol. 14, no. 4, pp. 439–449, April 2005.
[24] G. Demoment, “Image reconstruction and restoration: Overview of common estimation struc-
tures and problems,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37,
no. 12, pp. 2024–2036, December 1989.
[25] W.C. Karl, Handbook of Image and Video Processing, ch. Regularization in image restoration
and reconstruction, A. Bovik (ed.), San Diego, CA: Academic Press, June 2000, pp. 141–160.
[26] L.I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algo-
rithms,” Physica D, vol. 60, pp. 259–268, November 1992.
[27] M. Aharon, M. Elad, and A. Bruckstein, “The K-SVD: An algorithm for designing overcom-
plete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54,
no. 11, pp. 4311–4322, November 2006.
[28] S.C. Park, M.K. Park, and M.G. Kang, “Super-resolution image reconstruction: A technical
overview,” IEEE Signal Processing Magazine, vol. 20, no. 5, pp. 21–36, May 2003.
[29] T. Gotoh and M. Okutomi, “Direct super-resolution and registration using raw CFA images,”
in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recog-
nition, Washington, DC, USA, July 2004, pp. 600–607.
[30] S. Farsiu, M. Elad, and P. Milanfar, “Multiframe demosaicing and super-resolution of color
images,” IEEE Transactions on Image Processing, vol. 15, no. 1, pp. 141–159, January 2006.
[31] A.K. Katsaggelos, J. Biemond, R.W. Schafer, and R.M. Mersereau, “A regularized iterative
image restoration algorithm,” IEEE Transactions on Signal Processing, vol. 39, no. 4, pp. 914–
929, April 1991.
[32] M. Nikolova and M.K. Ng, “Analysis of half-quadratic minimization methods for signal and
image recovery,” SIAM Journal on Scientific Computing, vol. 27, no. 3, pp. 937–966, June
2005.
[33] K. Hirakawa and P.J. Wolfe, “Second-generation color filter array and emosaicking designs,”
Proceedings of SPIE, vol. 6822, pp. 68221P:1–12, January 2008.
[34] L. Zhang, X. Wu, and D. Zhang, “Color reproduction from noisy CFA data of single sen-
sor digital cameras,” IEEE Transactions Image Processing, vol. 16, no. 9, pp. 2184–2197,
September 2007.
[35] J.M. Bioucas-Dias, M.A.T. Figueiredo, and J.P. Oliveira, “Total variation-based image de-
convolution: A majorization-minimization approach,” in Proceedings of the IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006,
pp. 861–864.
[36] A. Levin, R. Fergus, F. Durand, and W.T. Freeman, “Image and depth from a conventional
camera with a coded aperture,” ACM Transactions on Graphics, vol. 26, no. 3, July 2007.
7
Super-Resolution Imaging
Bahadir K. Gunturk
7.1 Introduction
Super-resolution (SR) image restoration is the process of producing a high-resolution im-
age (or a sequence of high-resolution images) from a set of low-resolution images [1], [2],
[3]. The process requires an image acquisition model that relates a high-resolution image
to multiple low-resolution images and involves solving the resulting inverse problem. The
acquisition model includes aliasing, blurring, and noise as the main sources of information
175
176 Computational Photography: Methods and Applications
FIGURE 7.1
A subset of 21 low-resolution input images.
(a) (b)
FIGURE 7.2
Interpolation vs. super-resolution: (a) one of the input images resized with bilinear interpolation, and (b)
high-resolution image obtained with the SR restoration algorithm presented in Reference [3].
loss. A super-resolution algorithm increases the spatial detail in an image, and equivalently
recovers the high-frequency information that is lost during the imaging process.
There is a wide variety of application areas for SR image restoration. In biomedical
imaging, multiple images can be combined to improve the resolution, which may help in
diagnosis. In surveillance systems, the resolution of a video sequence can be increased
to obtain critical information, such as license plate or facial data. High-definition televi-
sion (HDTV) sets may utilize SR image restoration to produce and display higher quality
video from a standard definition input signal. High-quality prints from low-resolution im-
ages can be made possible with SR image restoration. Imaging devices such as video
cameras and microscopes might be designed to create intentional sensor shifts to produce
high-resolution images [1]. Similarly, super-resolution projectors can be realized by super-
imposing multiple shifted images on a screen [2]. Other application areas include satellite
and aerial imaging, astronomy, video coding, and radar imaging. Figure 7.1 shows a num-
ber of low-resolution images. A set of these images is used to create a high-resolution
image; Figure 7.2 shows that the restored image is de-aliased and more legible than the
bilinearly interpolated input.
Super-Resolution Imaging 177
f(u,v)
w arping blu rring d ow nsam pling + gk(x,y)
high
resolu tion f(M k(x,y)) b(u,v)* f (Mk (x,y)) nk(x,y)
inpu t im age noise
FIGURE 7.3
Imaging model.
SR image restoration has been an active research field for more than two decades. As
revealed in various survey studies [4], [5], [6], [7], [8], [9], there are both well-established
methods and open problems to be addressed.
This chapter aims to serve as an introductory material for SR imaging, and also provide
a comprehensive review of SR methods. Section 7.2 describes a commonly used imaging
model and two implementation approaches. Section 7.3 presents the main SR methods.
Registration and parameter estimation issues are discussed in Section 7.4. Variations on SR
imaging model are presented in Section 7.5. Finally, conclusions and research directions
are given in Section 7.6.
f M kf
d ow nsam ple
convolve w ith
D BM k f BM k f a d iscrete PSF
FIGURE 7.4
The high-resolution image is warped onto the kth frame, convolved with a discrete PSF, and downsampled to
form the observation.
f is the high-resolution image, Mk is geometric mapping between f and the kth observa-
tion, and nk denotes observation noise. The term Mk (x, y) relates the coordinates of the
kth observation and the high-resolution image; in other words, f (Mk (x, y)) is the warped
high-resolution image and is the higher resolution version of gk . This model makes several
assumptions, including a shift-invariant PSF, additive noise, and constant illumination con-
ditions. Later in the chapter, three enhancements of this model, through motion modeling,
photometric modeling, and modeling of color filter array sampling, will be discussed.
The discretized imaging model can be written in matrix form as follows:
gk = Hk f + nk , (7.2)
where f is the vectorized version of the high-resolution image, gk is the kth vectorized
observation, nk is the kth vectorized noise, and Hk is the matrix that includes the linear
operations, that is, geometric warping, convolution with the PSF, and downsampling.
Sometimes, all N observations are stacked to form a simplified representation of the
problem:
g1 H1 n1
g2 H2 n2
.. = .. f + .. =⇒ g = Hf + n (7.3)
. . .
gN HN nN
| {z } | {z } | {z }
g H n
There are two main approaches to implement the forward imaging process. In the first
method, the high-resolution image is warped to align with the low-resolution observation,
convolved with a discrete PSF, and downsampled to simulate the observation. This method
Super-Resolution Imaging 179
high-resolu tion image sim u lated
low -resolu tion image
M k(x, y)
w eighed su m of
continuous PSF high-resolu tion image sam ples
FIGURE 7.5
For a low-resolution position (x, y), a continuous PSF is placed at location (u, v) on the high-resolution image
based on the geometric mapping Mk (x, y). The weights of the PSF corresponding to the high-resolution pixels
under the kernel support are calculated. The weighted sum of the high-resolution pixels produces the intensity
at low-resolution image position (x, y). This process is repeated for all low-resolution image locations.
FIGURE 7.6
Low-resolution observations are registered and interpolated on a high-resolution grid. This process is followed
by a deconvolution process.
Combining this equation with the shifting property of Fourier transform Fk (wx , wy ) =
e j2π (Tx wx +Ty wy ) F (wx , wy ) results in a set of linear equations relating the DFT of the observa-
tions with the samples of the CFT of the high-resolution image. The CFT samples are then
solved to form the high-resolution image.
This frequency domain algorithm was later extended to include blur and noise in the
model [12]. A total least squares version was presented in Reference [13] for regulariz-
ing against registration errors; and a DCT domain version was presented in Reference [14]
to reduce the computational cost. The frequency domain approach has the advantage of
low-computational complexity and an explicit dealiasing mechanism. Among the disad-
vantages are the limitation to global translational motion, limitation to shift-invariant blur,
and limited capability of incorporating spatial domain priors for regularization.
the optimality of separating the interpolation and deconvolution steps is an open question.
A general observation is that the interpolation-deconvolution method does not perform as
well as the methods that do not separate interpolation from deconvolution.
In practice, the direct solution is not computationally feasible due to the sizes of the ma-
trices involved. If Hk was a block-circulant matrix, which is not the case in general, an
efficient frequency domain implementation would be possible. Therefore, iterative meth-
ods, such as the steepest descent and conjugate gradient methods, are adopted. These
methods start with an initial estimate and update it iteratively until a convergence criterion
is reached. The convergence criterion could be, for instance, the maximum number of iter-
ations or the rate of change between two successive iterations. An iteration of the steepest
descent method is the following:
¯
(i+1) (i) ∂ C(f) ¯¯
f = f −α ,
∂ f ¯f(i)
³ ´
= f(i) + 2α ∑ HTk gk − Hk f(i) , (7.8)
k
where f(i) is the ith estimate. The step size α should be small enough to guarantee conver-
gence; on the other hand, the convergence would be slow if it is too small. The value of α
could be fixed or adaptive, changing at each iteration. A commonly used method to choose
the step size is the exact line search method. In this method, ¡ defining
¢ d¡ = ∂ C(f)/
¢ ∂ f|f(i) at
the ith iteration, the step size that minimizes the next cost C f(i+1) = C f(i) − α d is given
by
dT d
α= µ ¶ . (7.9)
T
dT ∑ Hk Hk d
k
The algorithm suggested by Equation 7.8 can be taken literally by converting images
to vectors and obtaining the matrix forms of warping, blurring, and downsampling opera-
tions. There is an alternative implementation, which involves simple image manipulations.
The key is to understand the image domain operations implied by the transpose matrices
HTk = (DBWk )T = WTk BT DT . Through the analysis of the matrices, it can be seen that
182 Computational Photography: Methods and Applications
(a) (b)
seeks the solution that maximizes the probability P (g|f), while the maximum a posteri-
ori (MAP) estimator maximizes P (f|g). Using the Bayes rule, the MAP estimator can be
written in terms of the conditional probability P (g|f) and the prior probability of P (f) as
follows:
P (g|f) P (f)
fmap = arg max P (f|g) = arg max . (7.10)
f f P (g)
In most SR restoration problems, image noise is modeled to be a zero-mean indepen-
dent identically distributed (iid) Gaussian random variable. Thus, the probability of an
observation given the high-resolution image can be expressed as follows:
µ ¶
1 1 2
P (gk |f) = ∏ √ exp − 2 (gk (x, y) − ĝk (x, y)) ,
x,y 2πσ 2σ
µ ¶
1 1 2
= ¡√ ¢M exp − 2 kgk − Hk fk , (7.11)
2πσ 2σ
where σ is the noise standard deviation, ĝk (x, y) is the predicted low-resolution pixel value
using f and the forward imaging process Hk , and M is the total number of pixels in the
low-resolution image.
The probability of all N observations given the high-resolution image can then be ex-
pressed as
à !
N
1 1 2
P (g|f) = ∏ P (gk |f) = ¡√ ¢NM exp − 2 ∑ kgk − Hk fk ,
k=1 2πσ 2 σ k
µ ¶
1 1 2
= ¡√ ¢NM exp − 2 kg − Hfk . (7.12)
2πσ 2σ
(7.13)
Substituting this into Equation 7.10, and by taking the logarithm and neglecting the irrele-
vant terms, provides the following:
where µ is the average image and Q is the inverse of the covariance matrix. When both the
data fidelity and the prior terms are quadratic, a straightforward analytical implementation
is possible. Another popular quadratic model is P (f) ∝ exp(− kLfk2 ), which leads to the
following optimization problem:
n o
fmap = arg min kg − Hfk2 + λ kLfk2 , (7.16)
f
184 Computational Photography: Methods and Applications
with
d1 (x, y) = f (x + 1, y) − 2 f (x, y) + f (x − 1, y),
d2 (x, y) = f (x, y + 1) − 2 f (x, y) + f (x, y − 1),
(7.20)
d3 (x, y) = f (x + 1, y + 1)/2 − f (x, y) + f (x − 1, y − 1)/2,
d4 (x, y) = f (x + 1, y − 1)/2 − f (x, y) + f (x + 1, y − 1)/2,
where the clique potentials dn (x, y) measure the spatial activity in horizontal, vertical and
diagonal directions using second order derivatives. This prior is an example of the Gauss
Markov Random Field (GMRF). For this model, the energy function can be obtained by
linear filtering; for example, the clique potentials for d1 (x, y) can be obtained by convolving
the image with the filter [1, −2, 1]. Therefore, the energy can be written as
4
E (f) = ∑ Vc (f) = ∑ kΦn fk2 = kΦfk2 , (7.21)
c∈C n=1
£ ¤T
where Φn is the convolution matrix for dn (x, y), and Φ = ΦT1 ΦT2 ΦT3 ΦT4 is constructed
by stacking Φn . Substituting Equation 7.21 into Equation 7.18 reveals that the implemen-
tation approaches discussed previously for quadratic cost functions can be applied here as
well.
Super-Resolution Imaging 185
A criticism of the GMRF prior is the over-smoothing effect. In Reference [28], the Huber
function is used to define the Huber Markov Random Field (HMRF), with clique potentials
4
Vc (f) = ∑ ρφ (dn (x, y)), (7.22)
n=1
where 1 is a vector of ones. This optimization problem can be solved using a gradient
descent algorithm, which requires the gradient of the cost function:
∂ n 2
o
kg − Hfk + λ 1 ρθ (Φf) = −2HT (g − Hf) + λ ΦT ρθ0 (Φf) ,
T
(7.25)
∂f
where the derivative of the Huber function is
½
0 2z if |z| ≤ φ ,
ρφ (z) = (7.26)
2φ sign (z) otherwise.
The GMRF and HMRF prior formulations above lead to efficient analytical implementa-
tions. In general, the use of Gibbs prior requires numerical methods, such as the simulated
annealing and iterated conditional modes, for optimization.
Another commonly used prior model is based on total variation, where the L1 norm (i.e.,
the sum of the absolute value of the elements) is used instead of the L2 norm:
The total variation regularization has been shown to preserve edges better than the
Tikhonov regularization [25]. A number of SR algorithms, including those presented in
References [31] and [32], have adopted this prior. Recently, Reference [33] proposed a
bilateral total variation prior:
à !
R R
P(f) ∝ exp − ∑ ∑ γ |m|+|l| kf − Sxl Sym fk1 , (7.28)
l=−R m=−R
where γ is a regularization constant in the range (0,1), Sxl and Sym shift an image by integer
amounts l and m in x and y directions, respectively, and R is the maximum shift amount.
This prior term penalizes intensity differences at different scales, and γ gives a spatial decay
effect. The gradient descent solution requires the derivative of the prior, which is
∂ log P (f) ³ ´ ³ ´
∝ ∑ ∑ γ |l|+|m| I − Sy−m Sx−l sign f − Sxl Sym f (7.29)
∂f l m
It should be noted that in case of the iid Gaussian model, the MAP estimation could be
interpreted under the regularized least estimation context, and vice versa. However, there
are non-Gaussian cases, where the MAP approach and the least squares approach lead to
different solutions. For example, in tomographic imaging, the observation noise is better
modeled with a Poisson distribution; and the MAP approach works better than the least
squares approach. Another advantage of the MAP approach is that it provides an elegant
and effective formulation when some uncertainties (for example, in registration or PSF
estimation) are modeled as random variables.
f (i+1) (u, v) = f (i) (u, v) for |rk (x, y)| ≤ Tk (x, y), (7.32)
(rk (x,y)+Tk (x,y))hk (x,y;u,v)
(i) for rk (x, y) < −Tk (x, y),
f (u, v) +
2
∑ hk (x,y;u,v)
u,v∈Sxy
where Sxy is the set of pixels under the support of the PSF, centered by the mapping Mk (x, y).
The projection operation is repeated for all low-resolution pixels.
In addition to the data fidelity constraint set, it is possible to define other constraint sets.
For example, the amplitude constraint set, CA = { f : 0 ≤ f (u, v) ≤ 255}, ensures that the
resulting image has©pixel ¯ values within a¯ certain
ª range. A smoothness constraint set could
be defined as CS = f : ¯ f (u, v) − f¯(u, v)¯ ≤ δS , where f¯(u, v) is the average image and δS
is a nonnegative threshold.
Super-Resolution Imaging 187
high resolu tion sim u lated actual
im age estimate S hk(x, y; u, v) f(u, v) observation observation
u,v
M k(x, y) gk(x, y)
FIGURE 7.8
POCS implementation.
low -resolu tion image patch high-resolu tion pixel fP1 (0,0)
FIGURE 7.9
High-resolution pixel and the corresponding local patch in Reference [38].
With this approach, the solution is constrained to the subspace spanned by the average
image µ and the basis vectors [v1 , ..., vK ]. Noise that is orthogonal to the subspace is elimi-
nated automatically. This may turn out to be very helpful in some applications, for example,
face recognition from low-resolution surveillance video [37].
where ef is the subspace representation as in Equation 7.33. Using Equations 7.33 and 7.34,
the estimator in Equation 7.37 becomes
¡ ¢
fls = arg min k g − Hf k2 +λ k I − ΛΛT (f − µ ) k2 . (7.38)
f
low-resolution patch [g p1 (−1, −1), g p1 (0, −1), · · · , g p1 (1, 1)] and a corresponding high-
resolution pixel f p1 (0, 0). The low-resolution patch is binarized with ADRC as follows:
½
1 if g p1 (x, y) ≥ ḡ p1 ,
ADRC (g p1 (x, y)) = (7.39)
0 otherwise,
where ḡ p1 is the average of the patch. Applying ADRC to the entire patch, a binary code-
word is obtained. This codeword determines the class of that patch. That is, a 3 × 3 patch
is coded with a 3 × 3 matrix of ones and zeroes. There are, therefore, a total of 29 = 512
classes. During training, each low-resolution patch is classified; and for each class, lin-
ear regression is applied to learn the relation between the low-resolution patch and the
corresponding high-resolution pixel. Assuming M low-resolution patches and correspond-
ing high-resolution pixels are available for a particular class c, the regression parameters,
[ac,1 ac,2 · · · ac,9 ], are then found by solving:
f p1 (0, 0) g p1 (−1, −1) g p1 (0, −1) · · · g p1 (1, 1) ac,1
f p (0, 0) g p (−1, −1) g p (0, −1) · · · g p (1, 1) ac,2
2 2 1 1
.. = .. .. (7.40)
. . .
f pM (0, 0) g pM (−1, −1) g pM (0, −1) · · · g pM (1, 1) ac,9
During testing, for each pixel, the local patch around that pixel is taken and classified.
According to its class, the regression parameters are taken from a look-up table and applied
to obtain the high-resolution pixels corresponding to that pixel. This single-image method
can be extended to a multi-image method by including local patches from neighboring
frames. This would, however, increase the dimensionality of the problem significantly and
requires a much higher volume of training data. For example, if three frames were taken,
the size of the ADRC would be 3 × 3 × 3 = 27; and the total number of classes would be
227 .
There are also other patch-based methods. In Reference [39], a feature vector from a
patch is extracted through a nonlinear transformation, which is designed to form more edge
classes. The feature vectors are then clustered to form classes. And, as in the case of
Reference [38], a weighted sum of the pixels in the patch is taken to get a high-resolution
pixel; the weights are learned during training. Reference [40] uses a Markov network to
find the relation between low- and high-resolution patches. Given a test patch, the best
matching low-resolution patch is determined and the corresponding high-resolution patch
learned from training is obtained. In Reference [41], a multi-scale decomposition is applied
to an image to obtain feature vectors, and the solution is forced to have feature vectors close
to ones learned during training.
Nonparametric techniques are not limited to video sequences with global geometric
transformations. Their disadvantage is the high computational cost. Among the nonpara-
metric image registration techniques are:
• Block matching methods. A block around the pixel in question is taken, and the
best matching block in the other frame is found based a criterion, such as the mean
squared error or sum of absolute difference. References [10] and [28] are among the
methods that utilize the hierarchical block matching technique to estimate motion
vectors. Reference [65] evaluates the performance of block matching algorithms for
estimating subpixel motion vectors in noisy and aliased images. It is shown that a
(1/p)-pixel-accurate motion estimator exhibits errors bounded within ±1/(2p), for
p ≥ 1; however, for real data the accuracy does not increase much beyond p > 4.
• Optical flow methods. These methods assume brightness constancy along the motion
path and derive the motion vector at each pixel based on a local or global smoothness
model. References [31], [66], [67] and [68] are among the methods that use optical
flow based motion estimation. A comparison of optical flow methods is presented in
Reference [69].
It is possible that there are misregistered pixels, and these may degrade the result dur-
ing restoration. Such inaccurate motion vectors can be detected and excluded from the
restoration process. In References [70], two threshold values, one for regions of low local
variance and the other for regions of high local variance, are applied on the motion com-
pensated pixel residuals to determine the unreliable motion vectors, and the corresponding
observed data are excluded from the POCS iterations.
Most SR algorithms, including the ones mentioned above, perform registration and
restoration in two separate successive steps. There are a few SR algorithms that do joint
registration and restoration. A popular approach is the Bayesian approach, where the high-
resolution image f and the registration parameters p are calculated to maximize the condi-
tional probability P(f, p|g):
{fmap , pmap } = arg max P (f, p|g) = arg max P (f) P (p|f) P (g|f, p) . (7.42)
f,p f,p
An example of such an algorithm is presented in Reference [71], which models both the
SR image and the registration parameters as Gaussian random variables, and employs an
iterative scheme to get the estimates.
sm all l
Lf(l)
large l
g-H f(l)
FIGURE 7.10
L-curve.
Visual inspection. When the viewer has considerable prior knowledge on the scene, it is
reasonable to choose the regularization parameter through visual inspection of results with
different parameter values. Obviously, this approach is not appropriate for all applications.
L-curve method [72]. Since the regularization parameter controls the trade-off between
the data fidelity and prior information fidelity, it makes sense to determine the parameter
by examining the behavior of these fidelity terms. Assuming that f(λ ) is the solution with
a particular regularization parameter λ , the data fidelity is measured as the norm of the
residual kg−Hf(λ )k, and the prior information fidelity is measured as the norm of the prior
term, for example, kLf(λ )k in case of the Tikhonov regularization. The plot of these terms
as λ is varied forms an L-shaped curve (Figure 7.10). For some values of λ , the residual
changes rapidly while the prior term does not change much; this is the over-regularized
region. For some other values of λ , the residual changes very little while the prior term
changes significantly; this is the under-regularized region. Intuitively, the optimal λ value
is the one that corresponds to the corner of the L-curve. The corner point may be defined
in a number of ways, including the point of maximum curvature, and the point with slope
−1. A sample application of the L-curve method in SR is presented in Reference [73].
Generalized cross-validation (GCV) method [74]. GCV is an estimator for the predictive
risk kHf − Hf(λ )k2 . The underlying idea is that the solution that is obtained using all but
one observation should predict that left-out observation well if the regularization parameter
is a good choice. The total error for a particular choice of the parameter is calculated by
summing up the prediction errors over all observations. The optimal parameter value is the
one that minimizes the total error. A search technique or an optimization method could be
used to determine the optimal value.
Discrepancy principle [75]. If the variance of the noise is known, then the bound on the
residual norm kg − Hf(λ )k can be determined. Since under-regularization causes excessive
noise amplification, one can choose the regularization parameter so that the residual norm
is large but not larger than the bound. That is, if the bound is δ , then it is needed to find λ
such that kg − Hf(λ )k = δ .
Statistical approach. As already discussed, the statistical methods look for the image f
by maximizing the conditional probability p(g|f) (maximum likelihood solution) or p(f|g)
Super-Resolution Imaging 193
During image acquisition, the signal ft is blurred with a linear shift-invariant PSF b,
which is due to optical and sensor blurs:
Z
b (u, v) ∗ ft (u, v) = b ((u, v) − (ξ1 , ξ2 )) ft (ξ1 , ξ2 ) d ξ1 d ξ2 . (7.44)
By making the change of variables (ur , vr ) = (Mt (ξ1 , ξ2 )), Equation 7.44 becomes
Z ¡ ¢
b (u, v) ∗ ft (u, v) = b (u, v) − Mt−1 (ur , vr ) f (ur , vr ) |J (Mt )|−1 dur dvr ,
Z
= b (u, v; ur , vr ;t) f (ur , vr ) dur dvr , (7.45)
where |J (Mt )| is the determinant of the Jacobian of Mt , Mt−1 is inverse motion mapping,
and b (u, v; ur , vr ;t) is defined as follows:
¡ ¢
b (u, v; ur , vr ;t) = b (u, v) − Mt−1 (ur , vr ) |J (Mt )|−1 . (7.46)
Note that b (u, v; ur , vr ;t) is not invariant in space or time. The video signal in Equation 7.45
is then integrated during the exposure time tk to obtain
Ztk Z
1
fˆk (u, v) = b (u, v; ur , vr ;t) f (ur , vr ) dur dvr dt,
tk
0
Z
= bk (u, v; ur , vr ) f (ur , vr ) dur dvr , (7.47)
where bk (u, v; ur , vr ) is the linear shift- and time-variant blur defined as follows:
Ztk
1
bk (u, v; ur , vr ) = b (u, v; ur , vr ;t) dt. (7.48)
tk
0
Super-Resolution Imaging 195
Finally, fˆk (u, v) is downsampled to obtain the kth observation gk (x, y). Discretizing this
continuous model, one can write gk = Hk f + nk and apply the techniques (that do not re-
quire linear shift-invariant PSF) as before. The only difference would be the construction
of the matrix Hk which is no longer block circulant. In Reference [10], the POCS tech-
nique, which is very suitable for shift-variant blurs, is utilized; the algorithm is successfully
demonstrated on real video sequences with pronounced motion blur.
(a) (b)
(c) (d)
conversion increases in the lower and higher parts of the intensity range. This is mainly
due to low signal-to-noise ratio in the lower parts and saturation in the higher parts. Less
weight should be given to pixels (with intensities in the lower and higher parts of the range)
in constructing the cost function; and the diagonal matrix Wi reflects that.
Figure 7.11 compares the affine and nonlinear photometric models. It is clear that the
nonlinear model works better than the affine model. One may notice that the residual in
Figure 7.11d is not as small as the residual in Figure 7.11i. The reason is the saturation
in Figure 7.11b; some pixels in Figure 7.11a cannot be estimated in any way from Fig-
ure 7.11b. As seen in Figure 7.11e, the weighting Wi would suppress these pixels and
prevent them from degrading the solution.
198 Computational Photography: Methods and Applications
Figure 7.12 shows a number of input images from a dataset consisting of images captured
at different exposure times and camera positions. These images are processed with the
algorithm of Reference [84] to produce higher resolution versions. Two of these high-
resolution images are shown in Figure 7.13. The resulting high-resolution images can then
be processed with a HDR imaging algorithm to produce an image of high resolution and
high dynamic range. Reference [3] combines these steps; the tone-mapped image for this
dataset is given in Figure 7.14.
tI
tS
FIGURE 7.15
The spatio-intensity neighborhood of a pixel illustrated for a one-dimensional case. The gray region is the
neighborhood of the pixel in the middle.
where Mk is the warping operation to account for the relative motion between observations,
B is the convolution operation to account for the point spread function of the camera, and D
is the downsampling operation. Note that in Section 7.2, these operations were represented
as matrices DBMk .
(R) (G) (B)
The full-color image (gk , gk , gk ) is then converted to a mosaicked observation zk
according to a CFA sampling pattern as follows:
(S)
zk (x, y) = ∑ PS (x, y)gk (x, y), (7.56)
S=R,G,B
where PS (x, y) takes only one of the color samples at a pixel according to the CFA pattern.
For example, at red pixel location, [PR (x, y), PG (x, y), PB (x, y)] is [1, 0, 0].
There are a number of SR papers utilizing the ideas developed in demosaicking research.
In Reference [91], the alternating projection method of Reference [92] is extended to mul-
tiple frames. In addition to the data fidelity constraint set, Reference [92] defines two more
constraint sets; namely, the detail constraint set and the color consistency constraint set.
The detail constraint set is based on the observation that the high-frequency contents of
color channels are similar to each other for natural images. Since the green channel is more
densely sampled and therefore less likely to be aliased, the high-frequency contents of the
red and blue channels are constrained to be close to the high-frequency content of the green
channel. Let Wi be an operator that produces the ith frequency subband of an image. There
are four frequency subbands (i = LL, LH, HL, HH) corresponding to low-pass filtering and
high-pass filtering permutations along horizontal and vertical dimensions [93]. The detail
constraint set, Cd , that forces the details (high-frequency components) of the red and blue
channels to be similar to the details of the green channel at every pixel location (x, y), is
defined as follows:
n ¯³ ´ ³ ´ ¯ o
(S) ¯ (S) (G) ¯
Cd = gk (x, y) : ¯ Wi gk (x, y) − Wi gk (x, y)¯ ≤ Td (x, y) , (7.57)
where i = LH, HL, HH, and S = R, B. The term Td (x, y) is a nonnegative threshold that
quantifies the closeness of the detail subbands to each other.
Color consistency constraint set: It is reasonable to expect pixels with similar green
intensities to have similar red and blue intensities within a small spatial neighborhood.
200 Computational Photography: Methods and Applications
tI tI
tS tS
FIGURE 7.16
Extension of the spatio-intensity neighborhood of a pixel for multiple images. The corresponding point of
a pixel is found using motion vectors; using the parameters τS and τI , the neighborhood (gray regions) is
determined.
This leads to the concept of spatio-intensity neighborhood of a pixel. Suppose that the
(G)
green channel gk of an image is already interpolated and the goal here is to estimate the
red value at a particular pixel (x, y). Then, the spatio-intensity neighborhood of the pixel
(x, y) is defined as follows:
n ¯ ¯ o
¯ (G) (G) ¯
N (x, y) = (u, v) : k(u, v) − (x, y)k ≤ τS and ¯gk (u, v) − gk (x, y)¯ ≤ τI , (7.58)
where τS and τI determine the extents of the spatial and intensity neighborhoods. Fig-
ure 7.15 illustrates the spatio-intensity neighborhood for a one-dimensional signal. Note
that this single-frame spatio-intensity neighborhood can be extended to multiple images
using motion vectors. The idea is illustrated in Figure 7.16.
The spatio-intensity neighbors of a pixel should have similar color values. One way to
measure color similarity is to inspect color differences between the red and green channels
and between the blue and green channels. These differences are expected to be similar
within the spatio-intensity neighborhood N (x, y). Therefore, the color consistency con-
straint set can be defined as follows:
½ ¯³ ´ ³ ´¯¯ ¾
(S) ¯ (S) (G) (S) (G)
¯ ¯
Cc = gk (x, y) : ¯ gk (x, y) − gk (x, y) − gk (x, y) − gk (x, y) ¯ ≤ Tc (x, y) ,
(7.59)
where S = R, B. The term (·) denotes averaging within the neighborhood N (x, y), and
Tc (x, y) is a nonnegative threshold. It should be noted here that the spatio-intensity neigh-
borhood concept is indeed a variant of the bilateral filter [95] with uniform box kernels
instead of Gaussian kernels.
The method starts with an initial estimate and projects it onto the constraint sets defined
over multiple images iteratively to obtain the missing pixels. If the blur function is set
to a delta function and the downsampling operation is not included, only the missing color
samples are obtained; this is called multi-frame demosaicking. Figure 7.17 provides sample
results. As seen, single-frame interpolation methods do not produce satisfactory results.
Super-Resolution Imaging 201
(a) (b)
(c) (d)
The multi-frame demosaicking can get rid of most of the color artifacts. And finally, SR
restoration deblurs the image and increases the resolution further [92].
Another method combining demosaicking and super-resolution is presented in Refer-
ences [6] and [33]. This method is based on least squares estimation with demosaicking
related regularization terms. Specifically, there are three regularization terms. The first
regularization term is the bilateral total variation regularization as in Equation 7.28 applied
on the luminance channel. The second regularization term is the Tikhonov regularization
applied on the chrominance channels. And the third regularization term is orientation regu-
larization, which basically forces different color channels to have similar edge orientations.
202 Computational Photography: Methods and Applications
7.6 Conclusions
This chapter presented an overview of SR imaging. Basic SR methods were described,
sample results were provided, and critical issues such as motion estimation and parameter
estimation were discussed. Key references were provided for further reading. Two recent
advances in modeling, namely, photometric modeling and color filter array modeling were
discussed. While there are well-studied issues in SR imaging, there are also open problems
that need further investigation, including real-time implementation and algorithm paral-
lelization, space-variant blur identification, fast and accurate motion estimation for noisy
and aliased image sequences, identifying and handling occlusion and misregistration, im-
age prior modeling, and classification and regression in training-based methods.
As cameras are equipped with more computational power, it is becoming possible to
incorporate specialized hardware with associated software and exceed the performance of
traditional cameras. The jitter camera [1] is a good example; the sensor is shifted in hori-
zontal and vertical directions during video capture, and the resulting sequence is processed
with a SR algorithm to produce a higher-resolution image sequence. Imagine that the pix-
els in the jitter camera have a mosaic pattern of ISO gains; in that case, not only a spatial
diversity but also a photometric diversity could be created. Such a camera would have the
capability of producing a high-dynamic range and high-resolution image sequence. This
joint hardware and software approach will, of course, bring new challenges in algorithm
and hardware design.
Acknowledgment
This work was supported in part by the National Science Foundation under Grant No
0528785 and National Institutes of Health under Grant No 1R21AG032231-01.
References
[1] M. Ben-Ezra, A. Zomet, and S. Nayar, “Video super-resolution using controlled subpixel de-
tector shifts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6,
pp. 977–987, June 2005.
[2] N. Damera-Venkata and N.L. Chang, “Realizing super-resolution with superimposed pro-
jection,” in Proceedings of IEEE International Conference on Computer Vision and Pattern
Recognition, Minneapolis, MN, USA, June 2007, pp. 1–8.
[3] B.K. Gunturk and M. Gevrekci, “High-resolution image reconstruction from multiple differ-
ently exposed images,” IEEE Signal Processing Letters, vol. 13, no.4, pp. 197–200, April
2006.
[4] S. Borman and R.L. Stevenson, “Super-resolution from image sequences — A review,” in
Super-Resolution Imaging 203
Proceedings of Midwest Symposium on Circuits and Systems, Notre Dame, IN, USA, August
1998, pp. 374–378.
[5] M.G. Kang and S. Chaudhuri, “Super-resolution image reconstruction,” IEEE Signal Process-
ing Magazine, vol. 20, no. 3, pp. 19–20, May 2003.
[6] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, “Advances and challenges in super-
resolution,” in International Journal of Imaging Systems and Technology, vol. 14, no. 2,
pp. 47–57, August 2004.
[7] S. Chaudhuri, Super-Resolution Imaging. Boston, MA: Kluwer Academic Publishers, January
2001.
[8] D. Capel, Image Mosaicing and Super Resolution. London, UK: Springer, January 2004.
[9] A. Katsaggelos, R. Molina, and J. Mateos, Super Resolution of Images and Video. San Rafael,
CA: Morgan and Claypool Publishers, November 2006.
[10] A.J. Patti, M.I. Sezan, and A.M. Tekalp, “Superresolution video reconstruction with arbitrary
sampling lattices and nonzero aperture time,” IEEE Transactions on Image Processing, vol. 6,
no. 8, pp. 1064–1076, August 1997.
[11] R.Y. Tsai and T.S. Huang, “Multiframe Image Restoration and Registration.” In Advances in
Computer Vision and Image Processing. Greenwich: JAI Press, CT, 1984.
[12] S.P. Kim and W.Y. Su, “Recursive high-resolution reconstruction of blurred multiframe im-
ages,” IEEE Transactions on Image Processing, vol. 2, no. 4, pp. 534–539, October 1993.
[13] N.K. Bose, H.C. Kim, and H.M. Valenzuela, “Recursive total least squares algorithm for im-
age reconstruction from noisy, undersampled frames,” Multidimensional Systems and Signal
Processing, vol. 4, no. 3, pp. 253–268, July 1993.
[14] S. Rhee and M. Kang, “Discrete cosine transform based regularized high-resolution image
reconstruction algorithm,” Optical Engineering, vol. 38, no. 8, pp. 1348–1356, April 1999.
[15] S.C. Park, M.K. Park, and M.G. Kang, “Super-resolution image reconstruction: A technical
overview,” IEEE Signal Processing Magazine, vol. 20, no. 3, pp. 21–36, May 2003.
[16] S.P. Kim and N.K. Bose, “Reconstruction of 2D bandlimited discrete signals from nonuniform
samples,” in IEE Proceedings on Radar and Signal Processing, vol. 137, no. 3, pp. 197–204,
June 1990.
[17] S. Lertrattanapanich and N.K. Bose, “High resolution image formation from low resolution
frames using delaunay triangulation,” IEEE Transactions on Image Processing, vol. 11, no. 2,
pp. 1427–1441, December 2002.
[18] T. Strohmer, “Computationally attractive reconstruction of bandlimited images from irregular
samples,” IEEE Transactions on Image Processing, vol. 6, no. 4, pp. 540–548, April 1997.
[19] T.Q. Pham, L.J. van Vliet, and K. Schutte, “Robust fusion of irregularly sampled data using
adaptive normalized convolution,” EURASIP Journal on Applied Signal Processing, vol. 2006,
Article ID 83268, pp. 236–236, 2006.
[20] B.K. Gunturk, Y. Altunbasak, and R.M. Mersereau, “Super-resolution reconstruction of com-
pressed video using transform-domain statistics,” IEEE Transactions on Image Processing,
vol. 13, no. 1, pp. 33–43, January 2004.
[21] M. Irani and S. Peleg, “Improving resolution by image registration,” CVGIP: Graphical Mod-
els and Image Processing, vol. 53, no 3, pp. 231–239, May 1991.
[22] A. Zomet, A. Rav-Acha, and S. Peleg, “Robust super-resolution,” in Proceedings of IEEE
International Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, De-
cember 2001, pp. 645–650.
[23] H.W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems. Dordrecht,
Netherlands: Kluwer Academic Publishers, 1996.
204 Computational Photography: Methods and Applications
[24] C. Groetsch, Theory of Tikhonov Regularization for Fredholm Equations of the First Kind.
Boston, MA: Pittman, April 1984.
[25] L.I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algo-
rithms,” Physica D, vol. 60, no. 1–4, pp. 259–268, November 1992.
[26] D. Geman and C. Yang, “Nonlinear image recovery with half-quadratic regularization,” IEEE
Transactions on Image Processing, vol. 4, no. 7, pp. 932–945, July 1995.
[27] P. Cheeseman, B. Kanefsky, R. Hanson, and J. Stutz, “Super-resolved surface reconstruc-
tion from multiple images,” Tech. Rep. FIA-94-12, NASA Ames Research Center, December
1994.
[28] R.R. Schultz and R.L. Stevenson, “Extraction of high-resolution frames from video se-
quences,” IEEE Transactions on Image Processing, vol. 5, no. 6, pp. 996–1011, June 1996.
[29] R.C. Hardie, K.J. Barnard, and E.E. Armstrong, “Joint map registration and high-resolution
image estimation using a sequence of undersampled images,” IEEE Transactions on Image
Processing, vol. 6, no. 12, pp. 1621–1633, December 1997.
[30] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions, and the Bayesian restora-
tion of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6,
no. 6, pp. 721–741, November 1984.
[31] M.K. Ng, H. Shen, E.Y. Lam, and L. Zhang, “A total variation regularization based super-
resolution reconstruction algorithm for digital video,” EURASIP Journal on Advances in Sig-
nal Processing, vol. 2007, Article ID 74585, pp. 1–16, 2007.
[32] S.D. Babacan, R. Molina, and A.K. Katsaggelos, “Total variation super resolution using a vari-
ational approach,” in Proceedings of the IEEE International Conference on Image Processing,
San Diego, CA, USA, October 2008, pp. 641–644.
[33] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, “Fast and robust multiframe super reso-
lution,” IEEE Transactions on Image Processing, vol. 13, no. 10, pp. 1327–1344, October
2004.
[34] H. Stark and P. Oskoui, “High-resolution image recovery from image-plane arrays, using con-
vex projections,” Journal of the Optical Society of America, vol. 6, no. 11, pp. 1715–1726,
November 1989.
[35] A.M. Tekalp, M.K. Ozkan, and M.I. Sezan, “High-resolution image reconstruction from
lower-resolution image sequences and space-varying image restoration,” in Proceedings of the
IEEE International Conference on Acoustics, Speech, and Signal Processing, San Francisco,
CA, USA, March 1992, vol. 3, pp. 169–172.
[36] D. Capel and A. Zisserman, “Super-resolution from multiple views using learnt image mod-
els,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern
Recognition, Kauai, HI, USA, December 2001, vol. 2, pp. 627–634.
[37] B.K. Gunturk, A.U. Batur, Y. Altunbasak, M.H. Hayes, and R.M. Mersereau, “Eigenface-
domain super-resolution for face recognition,” IEEE Transactions on Image Processing,
vol. 12, no. 5, pp. 597–606, May 2003.
[38] T. Kondo, Y. Node, T. Fujiwara, and Y.Okumura, “Picture conversion apparatus, picture con-
version method, learning apparatus and learning method,” U.S. Patent 6 323 905, November
2001.
[39] C.B. Atkins, C.A. Bouman, and J.P. Allebach, “Optimal image scaling using pixel classifi-
cation,” in Proceedings of the IEEE International Conference on Image Processing, Thessa-
loniki, Greece, October 2001, vol. 3, pp. 864–867.
[40] W.T. Freeman, T.R. Jones, and E.C. Pasztor, “Example-based super-resolution,” IEEE Com-
puter Graphics and Applications, vol. 22, no. 2, pp. 56–65, March/April 2002.
Super-Resolution Imaging 205
[41] S. Baker and T. Kanade, “Hallucinating faces,” in Proceedings of the IEEE Fourth Inter-
national Conference on Automatic Face and Gesture Recognition, Grenoble, France, March
2000, pp.83–88.
[42] Y. Altunbasak, A.J. Patti, and R.M. Mersereau, “Super-resolution still and video reconstruc-
tion from mpeg-coded video,” IEEE Transactions on Circuits and Systems for Video Technol-
ogy, vol. 12, no. 4, pp. 217–226, April 2002.
[43] C.A. Segall, R. Molina, and A.K. Katsaggelos, “High-resolution images from low-resolution
compressed video,” IEEE Signal Processing Magazine, vol. 20, no. 3, pp. 37–48, May 2003.
[44] C.A. Segall, R. Molina, and A.K. Katsaggelos, “Bayesian resolution enhancement of com-
pressed video,” IEEE Transactions on Image Processing, vol. 13, no. 7, pp. 898–911, July
2004.
[45] A.J. Patti, A.M. Tekalp, and M.I. Sezan, “A new motion-compensated reduced-order model
kalman filter for space varying restoration of progressive and interlaced video,” IEEE Trans-
actions on Image Processing, vol. 7, no. 4, pp. 543–554, April 1998.
[46] M. Elad and A. Feuer, “Superresolution restoration of an image sequence: Adaptive filter
approach,” IEEE Transactions on Image Processing, vol. 8, no. 3, pp. 387–395, March 1999.
[47] D. Rajan, S. Chaudhuri, and M.V. Joshi, “Multi-objective super resolution: Concepts and
examples,” IEEE Signal Processing Magazine, vol. 20, no. 3, pp. 49–61, May 2003.
[48] M. Elad and A. Feuer, “Restoration of a single superresolution image from several blurred,
noisy and undersampled measured images,” IEEE Transactions on Image Processing, vol. 6,
no. 12, pp. 1646–1658, December 1997.
[49] H. Shekarforoush, M. Berthod, J. Zerubia, and M. Werman, “Sub-pixel Bayesian estimation
of albedo and height,” International Journal of Computer Vision, vol. 19, no. 3, pp. 289–300,
August 1996.
[50] E. Shechtman, Y. Caspi, and M. Irani, “Space-time super-resolution,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 27, no. 4, pp. 531–545, April 2005.
[51] Z. Lin and H.Y. Shum, “Fundamental limits of reconstruction-based superresolution algo-
rithms under local translation,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 26, no. 1, pp. 83–97, January 2004.
[52] D. Robinson and P. Milanfar, “Statistical performance analysis of super-resolution,” IEEE
Transactions on Image Processing, vol. 15, no. 6, pp. 1413–1428, June 2006.
[53] L.G. Brown, “A survey of image registration techniques,” ACM Computing Survey, vol. 24,
no. 4, pp. 325–376, December 1992.
[54] B. Zitova and J. Flusser, “Image registration methods: A survey,” Image and Vision Comput-
ing, vol. 21, no. 11, pp. 977–1000, October 2003.
[55] H.S. Stone, M.T. Orchard, E.C. Chang, and S.A. Martucci, “A fast direct Fourier-based al-
gorithm for subpixel registration of images,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 39, no. 10, pp. 2235–2243, October 2001.
[56] P. Vandewalle, S. Susstrunk, and M. Vetterli, “Double resolution from a set of aliased images,”
in Proceedings of SPIE Electronic Imaging, San Jose, CA, USA, January 2004, pp. 374–382.
[57] H. Foroosh, J.B. Zerubia, and M. Berthod, “Extension of phase correlation to subpixel regis-
tration,” IEEE Transactions on Image Processing, vol. 11, no. 3, pp. 188–200, March 2002.
[58] B.S. Reddy and B.N. Chatterji, “An FFT-based technique for translation, rotation and scale-
invariant image registration,” IEEE Transactions on Image Processing, vol. 5, no. 8, pp. 1266–
1271, August 1996.
206 Computational Photography: Methods and Applications
[59] L. Lucchese and G.M. Cortelazzo, “A noise-robust frequency domain technique for estimating
planar roto-translations,” IEEE Transactions on Image Processing, vol. 48, no. 6, pp. 1769–
1786, June 2000.
[60] S.P. Kim and W.Y. Su, “Subpixel accuracy image registration by spectrum cancellation,” in
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Process-
ing, Minneapolis, MN, USA, April 1993, vol. 5, pp. 153–156.
[61] P. Vandewalle, S. Susstrunk, and M. Vetterli, “A frequency domain approach to registration
of aliased images with application to super-resolution,” EURASIP Journal on Applied Signal
Processing, vol. 2006, Article ID 71459, pp. 1–14, 2006.
[62] D. Keren, S. Peleg, and R. Brada, “Image sequence enhancement using sub-pixel displace-
ments,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern
Recognition, Ann Arbor, MI, USA, June 1988, pp. 742–746.
[63] D. Capel and A. Zisserman, “Computer vision applied to super resolution,” IEEE Signal Pro-
cessing Magazine, vol. 20, no. 3, pp. 75–86, May 2003.
[64] M.A. Fischler and R.C. Bolles, “Random sample consensus: A paradigm for model fitting
with applications to image analysis and automated cartography,” Communications of the ACM,
vol. 24, no. 6, pp. 381–395, June 1981.
[65] S. Borman, M. Robertson, and R.L. Stevenson, “Block-matching sub-pixel motion estima-
tion from noisy, under-sampled frames - An empirical performance evaluation,” SPIE Visual
Communications and Image Processing, vol. 3653, no. 2, pp. 1442–1451, January 1999.
[66] S. Baker and T. Kanade, “Super-resolution optical flow,” Tech. Rep. CMU-RI-TR-99-36, The
Robotics Institute, Carnegie Mellon University, October 1999.
[67] R. Fransens, C. Strecha, and L.V. Gool, “A probabilistic approach to optical flow based super-
resolution,” in Proceedings of the IEEE International Conference on Computer Vision and
Pattern Recognition, Washington, DC, USA, June 2004, vol. 12, pp. 191–191.
[68] W. Zhao and H.S. Sawhney, “Is super-resolution with optical flow feasible?,” in Proceedings
of the 7th European Conference on Computer Vision, London, UK, May 2002, pp. 599–613.
[69] B. Galvin, B. McCane, K. Novins, D. Mason, and S. Mills, “Recovering motion fields: An
evaluation of eight optical flow algorithms,” in Proceedings of the British Machine Vision
Conference, Southampton, UK, September 1998, pp. 195–204.
[70] P.E. Eren, M.I. Sezan, and A.M. Tekalp, “Robust, object-based high-resolution image re-
construction from low-resolution,” IEEE Transactions on Image Processing, vol. 6, no. 6,
pp. 1446–1451, October 1997.
[71] P. Cheeseman, B. Kanefsky, R. Kraft, J. Stutz, and R. Hanson, Maximum Entropy and
Bayesian Methods, ch. Super-resolved surface reconstruction from multiple images, G.R. Hei-
dbreder (ed.), Dordrecht, Netherlands: Kluwer Academic Publishers, 1996, pp. 293–308.
[72] P.C. Hansen, “Analysis of discrete ill-posed problems by means of the L-curve,” SIAM Review,
vol. 34, no. 4, pp. 561–580, December 1992.
[73] N.K. Bose, S. Lertrattanapanich, and J. Koo, “Advances in superresolution using L-curve,” in
Proceedings of the IEEE International Symposium on Circuit and Systems, Sydney, Australia,
May 2001, vol. 2, pp. 433–436.
[74] G. Golub, M. Heath, and G. Wahba, “Generalized cross-validation as a method for choosing
a good ridge parameter,” Technometrics, vol. 21, no. 2, pp. 215–223, May 1979.
[75] V.A. Morozov, “On the solution of functional equations by the method of regularization,”
Soviet Math. Dokl., vol. 7, pp. 414–417, 1966.
Super-Resolution Imaging 207
[76] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum likelihood from incomplete data
via the EM algorithm,” Journal of the Royal Statistical Society - Series B (Methodological),
vol. 39, no. 1, pp. 1–38, 1977.
[77] D. Kundur and D. Hatzinakos, “Blind image deconvolution,” IEEE Signal Processing Maga-
zine, vol. 13, no. 3, pp. 43–64, May 1996.
[78] S.J. Reeves and R.M. Mersereau, “Blur identification by the method of generalized cross-
validation,” IEEE Transactions on Image Processing, vol. 1, no. 3, pp. 301–311, July 1992.
[79] R.L. Lagendijk, A.M. Tekalp, and J. Biemond, “Maximum likelihood image and blur identi-
fication: A unifying approach,” Journal of Optical Engineering, vol. 29, no. 5, pp. 422–435,
May 1990.
[80] N. Nguyen, G. Golub, and P. Milanfar, “Blind restoration / superresolution with generalized
cross-validation using Gauss-type quadrature rules,” in Conference Record of the Third-Third
Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, October
1999, vol. 2, pp. 1257–1261.
[81] N. Nguyen, P. Milanfar, and G. Golub, “Efficient generalized cross-validation with applica-
tions to parametric image restoration and resolution enhancement,” IEEE Transactions on
Image Processing, vol. 10, no. 9, pp. 1299–1308, September 2001.
[82] J. Abad, M. Vega, R. Molina, and A.K. Katsaggelos, “Parameter estimation in super-resolution
image reconstruction problems,” in Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, Hong Kong, April 2003, vol. 3, pp. 709–712.
[83] M.E. Tipping and C.M. Bishop, “Bayesian image super-resolution,” in Proceedings of the
International Conference on Advances in Neural Information Processing Systems, Vancouver,
BC, Canada, December 2002, vol. 15, pp. 1279–1286.
[84] M. Gevrekci and B.K. Gunturk, “Superresolution under photometric diversity of images,”
EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 36076, pp. 1–12,
2007.
[85] P. Debevec and J. Malik, “Recovering high dynamic range radiance maps from photographs,”
in Proceedings of International Conference on Computer Graphics and Interactive Techniques
, Los Angeles, CA, USA, pp. 369–378, August 1997.
[86] S. Mann, “Comparametric equations with practical applications in quantigraphic image pro-
cessing,” IEEE Transactions on Image Processing, vol. 9, no. 8, pp. 1389–1406, August 2000.
[87] T. Mitsunaga and S. Nayar, “Radiometric self calibration,” in Proceedings of the IEEE Inter-
national Conference Computer Vision and Pattern Recognition, Fort Collins, CO, USA, June
1999, vol. 1, pp. 374–380.
[88] A. Litvinov and Y. Schechner, “Radiometric framework for image mosaicking,” Journal of the
Optical Society of America A, vol. 22, no. 5, pp. 839–848, May 2005.
[89] M.D Grossberg and S.K. Nayar, “Determining the camera response from images: What is
knowable?,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 11,
pp. 1455–1467, November 2003.
[90] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, R.W. Schafer, and R.M. Mersereau, “Demosaick-
ing: Color filter array interpolation in single-chip digital cameras,” IEEE Signal Processing
Magazine, vol. 22, no. 1, pp. 44–55, January 2005.
[91] M. Gevrekci, B.K. Gunturk, and Y. Altunbasak, “Restoration of Bayer-sampled image se-
quences,” Oxford University Press, Computer Journal, vol. 52, no. 1, pp. 1–14, January 2009.
[92] B.K. Gunturk, Y. Altunbasak, and R.M. Mersereau, “Color plane interpolation using alter-
nating projections,” IEEE Transactions on Image Processing, vol. 11, no. 9, pp. 997–1013,
September 2002.
208 Computational Photography: Methods and Applications
8.1 Introduction
Image blur arises when a single object point spreads over several image pixels. This phe-
nomenon is mainly caused by camera motion during exposure or a lens that is out-of-focus.
The conventional approach to image deblurring is to construct an image degradation model
and then solve the ill-posed inverse problem of this model. A new approach takes advan-
tage of recent advances in image sensing technology which enable splitting or controlling
the exposure time [1], [2], [3]. This approach exploits the mutually different pieces of
information from multi-exposed images of the same scene to produce the deblurred image.
This chapter presents a general framework of image deblurring using multi-exposed im-
ages. The objective of this approach is to reconstruct an image that faithfully represents a
real scene by using multi-exposed images of the same scene. The multi-exposed images
are assumed to be captured by a single camera or multiple cameras placed at the different
locations. It is often further assumed that the multiple captured images do not have any
grid mismatch, as they can be aligned at the same grid by applying image registration algo-
rithms. With these assumptions, the various applications and recent endeavors in this field
are provided in detail.
This chapter is organized as follows. In Section 8.2, the characteristics of the multi-
exposed images are analyzed. The typical camera architecture for capturing multi-exposed
images is described in Section 8.2.1. Section 8.2.2 focuses on the characteristics of multi-
exposed images captured by various methods. Section 8.3 presents various techniques for
209
210 Computational Photography: Methods and Applications
focus /
zoom m otor
m icro
UI d evice
controller
iris
m otor
optical lens
FIGURE 8.1
Block diagram of the camera system.
image deblurring using multi-exposed images. Section 8.3.1 describes the basic concept
of multi-exposed image deblurring, Section 8.3.2 presents an approach that relies on the
degradation model while using multi-exposed images, and Section 8.3.3 introduces a new
approach which does not require a deconvolution operation. Finally, conclusions are drawn
in Section 8.4.
FIGURE 8.2
c 2008 IEEE
Example of images captured with (left) a long exposure time, and (right) a short exposure time. °
in a captured image with incomplete information about the scene. In order to extract more
useful visual information in such situations, images are captured multiple times. Then,
mutually different pieces of the information are merged into the final image.
To design a deblurring algorithm which can take advantage of multi-exposed images, the
characteristics of these images need to be known. Multi-exposed images can be catego-
rized as i) images taken with the same exposure time and ii) images taken with different
exposure times. Before comparing these two, the basic relationship between the exposure
time and image quality factors such as blur, noise, brightness, and color distortion should
be discussed. Figure 8.2 shows a simple example of images taken with different exposure
times. Namely, the image on the left is captured with a long exposure time and exhibits
strong motion blur characteristics due to camera shake that often occurs in such scenar-
ios. This is not the case of the image on the right which is taken with a short exposure
time for which camera shake is negligible, resulting in sharp edges as well as undesired
high-frequency content such as noise and brightness distortion. In short, a long exposure
time is a source of image blur. Setting the exposure time short enough reduces blur while
introducing noise, color artifacts, and brightness loss. Due to this relationship, the expo-
sure time setting plays an important role in multi-exposed deblurring. Figure 8.3 illustrates
three possible exposure time settings.
As shown in Figure 8.3a, using uniform long exposure intervals results in multiple
blurred images. Since the speed and direction of the camera or object motion usually
change during the exposure time, multiple blurred images tend to contain different infor-
mation about the original scene. In other words, the point spread function (PSF) applied to
the ideal image varies amongst the multi-exposed images.
Figure 8.3b shows the effect of short and uniformly split exposure intervals. As can be
seen, image blur is inherently reduced and multiple noisy images are obtained instead. In
this case, the deblurring problem becomes a denoising problem. Since noise is generally
uncorrelated among these images, the denoising problem can be solved by accumulating
212 Computational Photography: Methods and Applications
split tim e
(a)
(b)
(c)
FIGURE 8.3
Three possible methods of exposure time splitting: (a) uniform long intervals, (b) uniform short intervals, and
(c) nonuniform intervals.
multiple noisy images. Unfortunately, this can result in various color and brightness degra-
dations, thus necessitating image restoration methods to produce an image with the desired
visual quality.
Finally, Figure 8.3c shows the effect of nonuniform exposure intervals. Unlike the above
two methods, multi-exposed images of different characteristics are taken by partitioning
the exposure time into short and long intervals, thus producing both noisy and blurred im-
ages. Color and brightness information can be acquired from blurred images, whereas edge
and detail information can be derived from noisy images. This image capturing method
has recently gained much interest since difficult denoising or deblurring problems can be
reduced into a simpler image merging problem. Moreover, the method allows focusing on
the deblurring problem by using noisy images as additional information or the denoising
problem by extracting brightness and color information from blurred images.
Image Deblurring Using Multi-Exposed Images 213
g = h ∗ f + n, (8.1)
where g, h, f , and n denote the blurred image, the PSF, the original image, and the ob-
servation noise term, respectively. The objective here is to estimate the original image f
using the observed image g. The most classical approach is to estimate h and n from g, and
solve f by applying the image deconvolution algorithms [4] such as Richardson-Lucy (RL)
deconvolution [4]. However, the estimation of h and n is also a very demanding problem
and the estimation error is unavoidable in most practical cases. To this end, joint opti-
mization of blur identification and image restoration has been considered [5]. Though this
approach performs relatively well, it is computationally complex and does not guarantee
robust performance when the images are severely blurred. Since the basic difficulties of the
conventional methods originate from the lack of sufficient knowledge about a real scene,
the performance is inherently limited. The reader can refer to References [4], [6], [7], [8],
[9], and [10] for details on image deblurring using a single image.
Recently, many researchers have devoted their attention to a new approach that utilizes
multi-exposed images. Unlike the conventional approach, multiple images of the same
scene are captured and used together to reconstruct an output image. Since the amount of
available information about the original scene is increased by taking images multiple times,
the difficulty of the original deblurring problem is significantly alleviated.
The degradation model for multi-exposed images can be expressed as follows:
where gi , hi , and ni denote the ith blurred image, and its corresponding PSF and observa-
tion noise, respectively. The amount of the noise and blur of N multi-exposed images is
dependent on the image capturing methods mentioned in Section 8.2.2. The multi-exposed
deblurring techniques can be divided into two groups. The first group includes techniques
which preserve the classical structure of single image deblurring and use multi-exposed im-
ages to enhance deblurring performance. The techniques in the second group convert the
deblurring problem to a new formulation and use a specific solution for this new problem
at hand. These two approaches are discussed in detail below.
where x and y denote the image coordinates. By pessimistically setting the support of h1
and h2 to a large number, K1 and K2 , respectively, the derivatives of Equation 8.5 provide
K1 + K2 linear equations. By solving these equations, the PSFs can be obtained. Additional
constraints can be included in Equation 8.5 to stabilize the solution [11].
After estimating the PSFs, the deblurred image fb is obtained by minimizing the following
error function:
2 ° °2 ·µ° ° ¶ p µ° ° ¶ p ¸
° b θi ° ° b° ° °
E = ∑ °gi − f ∗ hi ° + λ ° fx ° + ° fby ° ), (8.6)
i=1 p p
where λ controls the fidelity and stability of the solution. The regularization term is com-
puted using the p-norm of the horizontal and vertical derivatives of fb, fbx , and fby . Then, the
solution, fb, is iteratively estimated as follows:
³ ´ ¯
2
∂ L ¯
fb(k+1)
= fb + ∑ hi ∗ gi − fb ∗ hi − λ
(k) T θi (k) θi ¯ , (8.7)
∂ f ¯b
i=1 f
where fb(k) is the deblurred image at the kth iteration and hTi denotes the flipped version of
hi . Equations 8.5, 8.6, and 8.7 are almost the same as those used in conventional single
image deconvolution except for the summation term required for the two images. How-
ever, by using two blurred images, the ill-posed problem can move onto the direction of the
well-posed problem and therefore this approach can improve the quality of the restored im-
ages. This straightforward generalization indicates that deblurring based on multi-exposed
images can outperform its single-exposed image counterpart.
Reference [12] focuses on the fundamental relationship between exposure time and im-
age characteristics discussed in Section 8.2. It introduces an imaging system shown in
Figure 8.4 which can be seen as an alternative to the system shown in Figure 8.3c. The
image obtained by the primary detector truly represents the color and brightness of the
original scene but the details are blurred. On the other hand, the image taken by the sec-
ondary detector possesses the image details which allow robust motion estimation. The
Image Deblurring Using Multi-Exposed Images 215
secondary d etector
FIGURE 8.4
A conceptual design of a hybrid camera.
algorithm first estimates the global motion between successive frames by minimizing the
following optical flow-based error function [14]:
µ ¶
∂g ∂g ∂g 2
arg min ∑ u + v + , (8.8)
(u,v) ∂x ∂y ∂t
where g is the image captured by the secondary detector, ∂ g/∂ x and ∂ g/∂ y are the spatial
partial derivatives, ∂ g/∂ t is the temporal derivative of the image, and (u, v) is the instanta-
neous motion at time t. Each estimated motion trajectory is then interpolated and the PSF
is estimated with an assumption that the motion direction is the same as the blur direction.
Then, the RL deconvolution, an iterative method that updates the estimation result at each
iteration as follows:
g
fb(k+1) = fb(k) · hT ∗ , (8.9)
h ∗ fb(k)
is applied to the image captured by the primary camera. In the above equation, ∗ is the
convolution operator, h is the PSF estimated using the multiple images from the secondary
detector, and hT is the flipped version of h. The initial estimate, fb(0) , is set to g. Exper-
imental results reveal good deblurring performance even in situations where the camera
moves in an arbitrary direction [12]. Moreover, significant performance improvements can
be achieved using special hardware, suggesting no need to cling to the conventional camera
structures. Introducing a new or modified hardware structure allows efficiently solving the
problem. Additional examples can be found in Reference [15].
Reference [13] presents a multi-exposed image deblurring algorithm that requires only
two images, blurred and noisy image pairs as shown in Figure 8.5. The blurred image, gb ,
is assumed to be obtained as follows:
gb = f ∗ h, (8.10)
which indicates that noise in the blurred image is negligible. On the contrary, when the
original image is captured within a short exposure time, the original image f can be repre-
sented as:
216 Computational Photography: Methods and Applications
f = gn + ∆ f , (8.11)
where gn is the noisy image and ∆ f is called the residual image. It should be noted here that
gn is not the captured noisy image but the scaled version of that image. The scaling is re-
quired to compensate the exposure difference between the blurred and noisy images. Based
on the above modeling, h is first estimated from Equation 8.10 using the Landweber iter-
ative method [16] with Tikhonov regularization [17]. Unlike the conventional approaches,
gn can be used here as an initial estimate of f since gn is a very close approximation of f
compared to gb . Therefore, the PSF estimation accuracy is significantly improved.
The estimated PSF is not directly used for the image deconvolution. Instead of recov-
ering f , the residual image ∆ f is first recovered from the blurred image gb . Combining
Equations 8.5 and 8.6 results in the following:
∆gb = gb − gn ∗ h = ∆ f ∗ h. (8.12)
Then, the estimated h is used to reconstruct ∆ f , as this method tends to produce fewer
deconvolution artifacts than estimating f directly. Also, the conventional RL deconvolution
is modified to suppress the ringing artifacts in smooth image regions. As can be seen in
Figure 8.5, this algorithm produces a clear image without ringing artifacts.
In summary, the key advantage of multi-exposed image deconvolution is the increased
accuracy of the PSF. When estimating the PSF, the noisy, but not blurred, images play an
important role. Using a more accurate PSF significantly improves the output image qual-
ity compared to the classical single-image deconvolution. The three techniques described
above represent the main directions in the area of multi-exposed image deconvolution. A
method of capturing multi-exposed images is still an open issue, and a new exposure time
setting associated with a proper deconvolution technique can further improve the perfor-
mance of conventional approaches.
It should be also noted that the conventional multi-exposed image deconvolution algo-
rithms still have some limitations. First, when the objects in a scene move, the linear shift
invariant (LSI) degradation model does not work any longer. In this case, the piecewise
LSI model [18], [19] or the linear shift variant degradation model (LSV) [20] should be
adopted instead of the LSI model. Second, even though the accuracy of the PSF is signif-
icantly improved by using information from multi-exposed images, the estimation error is
Image Deblurring Using Multi-Exposed Images 217
inevitable in most cases. Therefore, the deblurred output image is still distinguishable from
the ground truth image. Finally, the computational complexity of the PSF estimation and
image deconvolution methods is demanding, thus preventing these methods to be widely
used in practical applications.
FIGURE 8.6
Color variations due to image blur: (a) original image, (b) blurred image, and (c) difference between the
original image and the blurred image.
The above constitutes a novel approach that converts the deblurring problem into a color
mapping problem. However, this method does not carefully address noise in an underex-
posed image. Since an image captured using a short exposure time tends to contain noise,
the color mapping result can still be noisy. Therefore, a robust mapping function needs to
be devised to improve the performance.
Reference [23] presents an image merging algorithm that requires three multi-exposed
images. The first image and the third image are captured using a long exposure time to
preserve brightness and color information, whereas the second image is captured with a
short exposure time to preserve edges and details. The objective of this technique is to
compensate the color and brightness loss of the second image using the other two images.
The flow of this algorithm is shown in Figure 8.7.
The algorithm starts with global motion estimation to find the corresponding regions be-
tween successively captured images. Since the camera or objects can move during the time
needed to capture three images, this step is necessary to remove a global mismatch between
images. A single parametric model is adopted to estimate global motion as follows:
xl h11 h12 h13 xh
yl = h21 h22 h23 yh , (8.14)
sl h31 h32 h33 sh
where (xl , yl ) and (xh , yh ) are pixel positions in the underexposed image and the blurred
images, respectively. A scaling parameter sl (sh ) is used to represent the position by the
homogeneous coordinate. Nine unknown parameters, h11 to h33 , are then estimated by
minimizing the intensity discrepancies between two images. However, since the brightness
FIGURE 8.7
The block diagram of the image merging-driven deblurring algorithm.
Image Deblurring Using Multi-Exposed Images 219
vhl
vhl
FIGURE 8.8
Bilateral optical flow estimation.
differences between the underexposed image and the two blurred images are usually sig-
nificant, the direct motion estimation between differently exposed images is not desired.
Therefore, the global motion is estimated using two blurred images as these are captured
using the same exposure time setting. Then, the motion between the first blurred image
and the underexposed image is simply assumed to be half of the estimated motion. Since
optical flow estimation follows this global motion estimation, the loss of accuracy by this
simplified assumption is not severe.
Since the color distribution of the underexposed image tends to occupy low levels, his-
togram processing is used to correct the color of the underexposed image based on the color
distribution of the corresponding regions in the long-exposed (blurred) images. However,
the noise in the underexposed image still remains. Therefore, for each pixel in the under-
exposed image, bilateral motion estimation is used to search for the correlated pixels in
the blurred images. This estimation is based on the assumption of linear motion between
two intervals. Unlike conventional optical flow estimation, the motion trajectory passing
through a point in the intermediate frame is found by comparing a pixel at the shifted posi-
tion in the first blurred image and one at the opposite position in the other blurred image as
shown in Figure 8.8. Finally, the output image is obtained by averaging the pixel values of
two corresponding pixels in the blurred images and the one in the underexposed image as
follows:
1−λ
fb(u) = λ gbl (u) + (gh,1 (u − v) + gh,2 (u + v)) , (8.15)
2
where fb denotes the resultant image, gbl denotes the the scaled underexposed image, and gh,1
and gh,2 denote the two blurred images. The terms u and v are the vector representation
of the pixel position and its corresponding motion vector. The weighting coefficient λ is
determined by the characteristics of the image sensor.
This algorithm, which can be seen as the first attempt to use motion estimation and com-
pensation concepts for deblurring, produces high-quality images. Since the output image is
generated by averaging two blurred images and one noisy image, most noise contributions
are reduced, if not completely eliminated. Although the computational complexity for esti-
220 Computational Photography: Methods and Applications
(d) (e)
mating global and local motion is demanding, this technique does not require any hardware
modifications, and produces a natural output image without visually annoying artifacts.
Figure 8.9 allows some comparison of the methods in References [22] and [23]. Three
test images, shown in Figures 8.9a to 8.9c, are respectively captured with a long, a short,
and a long exposure time. As mentioned above, the method in Reference [22] uses two
differently blurred images shown in Figures 8.9a and 8.9b, whereas the method in Refer-
ence [23] combines all three input images. As can be seen, the images output by both of
these algorithms do not suffer from image blur or restoration artifacts, which is a signifi-
cant advantage of this deconvolution-free approach. It can be further noted that the color
in Figure 8.9d is different from that of Figure 8.9a or Figure 8.9c. This is because noise in
Figure 8.9b prevents finding the optimal color mapping function.
Reference [24] approaches multi-exposed image deblurring using the assumption that
the multi-exposed images are captured within a normal exposure time and these images
are differently blurred. In other words, unlike previously cited techniques which rely on
noisy data for image details and edge information, this method uses only blurred images.
Figure 8.10 illustrates the basic concept. The original image is blurred in the diagonal, hor-
izontal, and vertical directions, respectively. Then, the frequency spectra of these blurred
images are found by applying the fast Fourier transform (FFT). Since motion blur behaves
as a directional low-pass filter, each spectrum loses high frequency components depending
on blur direction. In other words, each spectrum tends to have mutually different partial
information of the original image spectrum. Therefore, it is possible to gather this informa-
tion for compensating the loss of frequency components.
The merging operation is performed in the frequency domain via fuzzy projection onto
convex sets (POCS). Based on the fact that image blur exhibits the low-pass characteristics
in the frequency domain regardless of the blur type, the reliability of frequency coefficients
Image Deblurring Using Multi-Exposed Images 221
FIGURE 8.10
Frequency spectra of blurred images: (top) three differently blurred images, and (bottom) magnitude of their
frequency spectra.
initial
estim ate
FIGURE 8.11
c 2009 IEEE
Procedure of the projection onto the fuzzy convex sets. °
reduces as the frequency increases. Therefore, only low-frequency regions of each spec-
trum are considered as convex sets. However, projection onto these sets cannot recover
enough high-frequency parts. To this end, the convex sets can be expanded through fuzzi-
fication [26], [27]. The fuzzification process is illustrated in Figure 8.11. Each ellipse
represents the convex set and the arrow indicates a projection operation. Both projection
and fuzzification are repeated until the predefined criterion is reached. By this approach,
the mutually different information about the scene can be efficiently combined. Since this
method merges all the available information from multi-exposed images, the quality im-
provement is limited if multi-exposed images are strongly correlated to each other.
222 Computational Photography: Methods and Applications
(a) (b)
(c) (d)
FIGURE 8.12
Deblurring using real-life blurred images: (a-c) three test images, and (d) the deblurred image.
Figure 8.12 shows the performance of the method in Reference [24] on real-life images.
For the sake of visual comparison, small patches of 400 × 400 pixels were cropped from
1024 × 1024 test images. The image registration algorithm from Reference [28] is applied
to all test images to match the image grids. Figure 8.12d shows the combined result when
Figure 8.12a is used as a reference image for image registration. As can be seen, the output
image successively merges available information from multi-blurred images.
This section explored various multi-exposed image deblurring algorithms. Since each
algorithm uses its own experimental conditions, the direct comparison of these algorithms
is not applicable. Instead, the conceptual differences and the characteristics of the output
images need to be understood. In multi-exposed image deconvolution, the underexposed
(noisy) image is used to estimate the PSF with higher accuracy. By using additional im-
ages, the parametric assumption of the PSF [9], [29] and computationally complex blind
Image Deblurring Using Multi-Exposed Images 223
deconvolution [30], [31] are not necessary. Also, all the existing techniques developed for
single-image deconvolution can be generalized for multi-exposed image deconvolution. In
multi-exposed image deblurring without deconvolution, no typical degradation model is
required. Therefore, this approach can be more generally used in practical applications.
Compared to the first approach, it has a lot of room for improvement. For example, by
changing the image capturing method and/or applying the techniques used in other fields,
such as histogram mapping and motion compensated prediction, the performance can be
further improved.
8.4 Conclusion
For several decades, researchers have tried to solve a troublesome deblurring problem.
Due to the fundamental difficulty — the lack of available information — the performance
of conventional single image deblurring methods is rather unsatisfactory. Therefore, multi-
exposed image deblurring is now considered a promising breakthrough. Thanks to recent
advances in image sensing technology, the amount of available information about the orig-
inal scene is significantly increased. The remaining problem is how to use this information
effectively. By referring to the image capturing methods and conventional multi-exposed
deblurrig algorithms described in this chapter, the reader can freely develop a new solution,
making no need to adhere to the classical deblurring approach any longer.
Other camera settings such as ISO and aperture can be also tuned to generate images
with different characteristics. Therefore, more general image capturing methods and de-
blurring solutions can be designed by considering these additional factors. In addition to
deblurring, multi-exposed images have another application. Based on the fact that a set of
multi-exposed images can capture a wider dynamic range than a single image, a dynamic
range improvement by using multi-exposed images is also a promising research topic. By
jointly considering these two image processing operations, the ultimate goal of truly repro-
ducing the real scene can be reached in the near future.
Acknowledgment
Figure 8.2 is reprinted from Reference [23] and Figure 8.11 is reprinted from Refer-
ence [24], with the permission of IEEE.
References
[1] E.R. Fossum, “Active pixel sensors: Are CCDs dinosaurs,” Proceedings of SPIE, vol. 1900,
pp. 2–14, February 1993.
224 Computational Photography: Methods and Applications
[2] N. Stevanovic, M. Hillegrand, B.J. Hostica, and A. Teuner, “A CMOS image sensor for high
speed imaging,” in Proceedings of the IEEE International Solid-State Circuits Conference,
San Francisco, CA, USA, February 2000, pp. 104–105.
[3] O. Yadid-Pecht and E. Fossum, “Wide intrascene dynamic range CMOS aps using dual sam-
pling,” IEEE Transactions on Electron Devices, vol. 44, no. 10, pp. 1721–1723, October 1997.
[4] P.A. Jansson, Deconvolution of Image and Spectra. New York: Academic Press, 2nd edition,
October 1996.
[5] Y.L. You and M. Kaveh, “A regularization approach to joint blur identification and image
restoration,” IEEE Transactions on Image Processing, vol. 5, no. 3, pp. 416–428, February
1996.
[6] M.R. Banham and A.K. Katsaggelos, “Digital image restoration,” IEEE Signal Processing
Magazine, vol. 14, no. 2, pp. 24–41, March 1997.
[7] R.L. Lagendijk, J. Biemond, and D.E. Boekee, “Identification and restoration of noisy blurred
images using the expectation-maximization algorithm,” IEEE Transactions on Acoustics,
Speech, and Signal Processing, vol. 38, no. 7, pp. 1180–1191, July 1990.
[8] A.K. Katsaggelos, “Iterative image restoration algorithms,” Optical Engineering, vol. 28,
no. 7, pp. 735–748, July 1989.
[9] G. Pavlović and A. M. Tekalp, “Maximum likelihood parametric blur identification based on
a continuous spatial domain model,” IEEE Transactions on Image Processing, vol. 1, no. 4,
pp. 496–504, October 1992.
[10] D.L. Tull and A.K. Katsaggelos, “Iterative restoration of fast-moving objects in dynamic im-
age sequences,” Optical Engineering, vol. 35, no. 12, pp. 3460–3469, December 1996.
[11] A. Rav-Acha and S. Peleg, “Two motion-blurred images are better than one,” Pattern Recog-
nition Letters, vol. 26, no. 3, pp. 311–317, February 2005.
[12] M. Ben-Ezra and S.K. Nayar, “Motion-based motion deblurring,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 689–698, June 2004.
[13] L. Yuan, J. Sun, L. Quan, and H.-Y. Shum, “Image deblurring with blurred / noisy image
pairs,” ACM Transactions on Graphics, vol. 26, no. 3, July 2007.
[14] B.D. Lucas and T. Kanade, “An iterative image registration technique with an application to
stereo vision,” Defense Advanced Research Projects Agency, DARPA81, pp. 121–130, 1981.
[15] R. Raskar, A. Agrawal, and J. Tumblin, “Coded exposure photography: Motion deblurring
using fluttered shutter,” ACM Transactions on Graphics, vol. 25, no. 3, pp. 795–804, July
2006.
[16] H.W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems. Dordrecht, The
Netherlands: Kluwer Academic, March 2000.
[17] A. Tarantola, Inverse Problem Theory. Philadelphia, PA: Society for Industrial and Applied
Mathematics, December 2004.
[18] J. Bardsley, S. Jefferies, J. Nagy, and R. Plemmons, “A computational method for the restora-
tion of images with an unknown, spatially-varying blur,” Optics Express, vol. 14, no. 5,
pp. 1767–1782, March 2006.
[19] J. Bardsley, S. Jefferies, J. Nagy, and R. Plemmons, “Variational pairing of image segmenta-
tion and blind restoration,” in Proceedings of the European Conference on Computer Vision,
Prague, Czech Republic, May 2004, pp. 166–177.
[20] M.K. Ozkan, A.M. Tekalp, and M.I. Sezan, “POCS-based restoration of space-varying blurred
images,” IEEE Transactions on Image Processing, vol. 3, no. 4, pp. 450–454, July 1994.
Image Deblurring Using Multi-Exposed Images 225
[21] Z. Wei, Y. Cao, and A.R. Newton, “Digital image restoration by exposure-splitting and regis-
tration,” in Proceedings of the International Conference on Pattern Recognition, Cambridge,
UK, August 2004, vol. 4 pp. 657–660.
[22] J. Jia, J. Sun, C.K. Tang, and H.-Y. Shum, “Bayesian correction of image intensity with spa-
tial consideration,” in Proceedings of the European Conference on Computer Vision, Prague,
Czech Republic, May 2004, pp. 342–354.
[23] B.D. Choi, S.W. Jung, and S.J. Ko, “Motion-blur-free camera system splitting exposure time,”
IEEE Transactions on Consumer Electronincs, vol. 54, no. 3, pp. 981–986, August 2008.
[24] S.W. Jung, T.H. Kim, and S.J. Ko, “A novel multiple image deblurring technique using fuzzy
projection onto convex sets,” IEEE Signal Processing Letters, vol. 16, no. 3, pp. 192–195,
March 2009.
[25] X. Liu and A.E. Gamal, “Simultaneous image formation and motion blur restoration via mul-
tiple capture,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, Salt Lake City, UT, USA, May 2001, vol. 3, pp. 1841–1844.
[26] S. Oh and R.J. Marks, “Alternating projection onto fuzzy convex sets,” in Proceedings of IEEE
International Conference on Fuzzy Systems, San Francisco, CA, USA, March 1993, pp. 1148–
1155.
[27] M.R. Civanlar and H.J. Trussell, “Digital signal restoration using fuzzy sets,” IEEE Trans-
actions on Acoustics, Speech, and Signal Processing, vol. 34, no. 4, pp. 919–936, August
1986.
[28] B.S. Reddy and B.N. Chatterji, “An FFT-based technique for translation, rotation, and
scale-invariant image registration,” IEEE Transacxtions on Image Processing, vol. 5, no. 8,
pp. 1266–1271, August 1996.
[29] D.B. Gennery, “Determination of optical transfer function by inspection of frequency-domain
plot,” Optical Society of America, vol. 63, no. 12, pp. 1571–1577, December 1973.
[30] A. Levin, “Blind motion deblurring using image statistics,” in Proceedings of the Twentieth
Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, De-
cember 2006, pp. 841–848.
[31] G. Harikumar and Y. Bresler, “Perfect blind restoration of images blurred by multiple fil-
ters: Theory and efficient algorithms,” IEEE Transactions on Image Processing, vol. 8, no. 2,
pp. 202–219, February 1999.
9
Color High-Dynamic Range Imaging: Algorithms for
Acquisition and Display
227
228 Computational Photography: Methods and Applications
9.1 Introduction
Dynamic range is the term used in many fields to describe the ratio between the highest
and the lowest value of a variable quantity. In imaging and especially in display technology,
dynamic range is also known as the contrast ratio or, simply contrast, which denotes the
brightness ratio between black and white pixels visible on the screen at the same time. For
natural scenes the dynamic range is the ratio between the density of luminous intensity
of the brightest sunbeam and the darkest shadow. In photography, the unit of luminance,
cd/m2 (candelas per square meter), is also known as a “nit.”
The problem of high-dynamic range (HDR) imaging is two-fold: i) how to capture the
true luminance and the full chromatic information of an HDR scene, possibly with a captur-
ing device with a dynamic range smaller than the scene’s, and ii) how to faithfully represent
this information on a device that is capable of reproducing the actual luminances of neither
the darkest nor the brightest spots, nor the true colors. The first half of the issue is com-
monly referred to as high-dynamic range composition or recovery and the latter as com-
pression or, more commonly, tone mapping. In this chapter the terms used are composition
for the acquisition process and tone mapping for the displaying part.
Figure 9.1 illustrates the HDR imaging problem. Namely, Figure 9.1a shows that the
conventional low-dynamic range (LDR) snapshot has a very limited dynamic range, which
is seen as clipped bright parts as well as underexposed dark parts. The HDR representation
shown in Figure 9.1b has all the information but only a fraction of it is visible on an LDR
media, such as the printout. The tone mapped HDR image shown in Figure 9.1c boasts the
greatest amount of visible information.
HDR images can be acquired by capturing real scenes or by rendering three-dimensional
(3D) computer graphics, using techniques like radiosity and ray tracing. HDR images of
natural scenes can be acquired either by capturing multiple images of the same scene with
different exposures or utilizing only recently introduced special cameras able to directly
capture HDR data [1]. The multiple exposure method relies on combining the different
exposures into a single image, spanning the whole dynamic range of the scene [2], [3]. The
image composition step requires a preliminary calibration of the camera response [3], [4].
Though HDR capture and display hardware are both briefly discussed, the focus of this
chapter is on the approach of multiple capture composition.
The second part of the problem is to appropriately tone map the so-obtained HDR image
back to an LDR display. Tone mapping techniques range in complexity from a simple
gamma-curve to sophisticated histogram equalization methods and complicated lightness
perception models [5], [6], [7]. Recently, display devices equipped with extended dynamic
range have also started to appear [8].
HDR imaging methods have been originally developed for RGB color models. However,
techniques working in luminance-chrominance spaces seem more meaningful and prefer-
able for a number of reasons:
• Any HDR technique operating in RGB space requires post-composition white bal-
ancing since the three color channels undergo parallel transformations. While the
white balancing would yield perceptually convincing colors, they might not be the
true ones. For the sake of hue preservation and better compression, it is beneficial
to opt for a luminance-chrominance space, even if the input data is in RGB (for
instance, an uncompressed TIFF image).
• The luminance channel, being a weighted average of the R, G, and B channels, enjoys
a better signal-to-noise ratio (SNR), which is crucial if the HDR imaging process
takes place in noisy conditions. In this chapter, the problem of HDR imaging in
a generic luminance-chrominance space is addressed, efficient algorithms for HDR
image composition and tone mapping are presented, and the functionality of such an
approach under various conditions is studied.
In this chapter, Section 9.2 provides general information concerning color spaces and
the notation used in the chapter. Techniques essential to HDR acquisition, namely camera
response calibration, are discussed in Section 9.3. A brief overview of HDR sensors is also
included in Section 9.3. Section 9.4 focuses on the creation of HDR images; techniques for
both monochromatic and color HDR are presented. Section 9.5 deals with displaying HDR
images; display devices capable of presenting HDR content are also discussed. Section 9.6
contains examples and discussion. Conclusions are drawn in Section 9.7.
generate the physical sensation of different colors, which is combined by the human visual
system (HVS) to form the color image. From this it is intuitive to describe color with
three numerical components. All the possible values of this three-component vector form a
space called a color space or color model. The three components can be defined in various
meaningful ways, which leads to definition of different color spaces [1], [9].
is also used in television broadcasting, it fills the definition of an opponent color space
because it consists of a luminance channel and two color difference channels. As a side note
it can be mentioned that the usual scheme for compressing a YUV signal is to downsample
each chromatic channel with a factor of two or four so that chromatic data occupies at most
half of the total video bandwidth. This can be done without apparent loss of visual quality
as the human visual system is far less sensitive to spatial details in the chrominances than
in the luminance. The same compression approach is also utilized among others in the
well-known JPEG (Joint Photographic Experts Group) image compression standard.
The transformation matrices for YUV color space are
0.30 −0.17 0.50 1 0 1.4020
AYUV = 0.59 −0.33 −0.42 and BYUV = AYUV −1
= 1 −0.3441 −0.7141 . (9.2)
0.11 0.50 −0.08 1 1.7720 0
Because of the similar nature of the color models, the orthogonality properties given in
the previous section for opponent color space hold also for the YUV space. One should note
that these properties do not depend on the orthogonality of the matrix A (in fact the three
columns of AYUV are not orthogonal), but rather on the orthogonality between a constant
vector and the second and third columns of the matrix.
can be interpreted as the angular component and the length of a planar vector. Thus the
triplet Y , H, and S corresponds to a luminance-chrominance representation with respect to
cylindrical coordinates. It is noted that multiple definitions of both hue and saturation can
be found in the literature [12] and while the presented ones are among the more frequently
noted ones, they are chosen here first and foremost for the sake of simplicity.
where f is the function describing the response of the output to the given exposure. There-
fore, the irradiance can be obtained from the pixel output by the formal expression as
follows [3]:
ln E(x) = g(z (x)) − ln ∆t, (9.4)
where g(·) = ln f −1 (·) is the inverse camera response function. This function can be esti-
mated from a set of images of a fixed scene captured with different exposure times. Images
for the calibration sequence are assumed to be perfectly aligned, shot under constant il-
lumination, and contain negligible noise. For the camera response function calibration, a
much larger set of images (i.e., a denser set of exposures) than typically available for the
HDR composition, is used. The camera response function is estimated (calibrated) only
once for each camera. The estimated function can then be used for the linearization of the
input values in all subsequent HDR compositions of the same device.
Because of underexposure (which produces dramatically low SNR) and overexposure
(which results in clipping the values which would otherwise exceed the dynamic range)
not all pixels from the given set of images should be used for the calibration of the camera
response function, nor for the subsequent composition, with equal weights. Near the min-
imum of the value interval of an imaging device, the information is distorted by numerous
noise components, most influential of which is the Poisson photon shot noise [13]. As the
number of photons hitting a well on the sensor within a given time interval is a random
process with Poisson distribution, at low light levels the variations in the number of pho-
tons become substantial in comparison to the mean levels, effectively rendering the current
created in the well completely unreliable. In addition to this, components like thermal and
read-out noise play some role in corrupting the low-level sensor output. In the vicinity of
234 Computational Photography: Methods and Applications
1.0
0.8
0.6
w cam (r)
0.4
0.2
0
0 0.5 1.0
r
FIGURE 9.2
Weight function used for the luminance values in the camera response function definition.
the maximum of the value interval, clipping starts to play its role. Naturally a saturated
pixel can only convey very limited amount of information about the scene. The described
phenomena motivate the penalization of underexposed and overexposed pixels in both the
camera response calibration and the HDR image composition phase. Additionally, only
pixels having monotonically strictly increasing values between underexposure and over-
exposure, throughout the sequence, can be considered to be valid in the camera response
calibration. This is a natural idea, since the pixels used for response calibration are assumed
to come from a sequence shot with increasing exposure time. As the scene is assumed to
be static and the pixel irradiance constant, the only aberration from the strict monotonical
increase in pixel value can come from noise. Noisy pixels should not be used for camera
response calibration, as erroneous response function would induce a systematic error in the
subsequent HDR compositions. Using these criteria for pixel validity, the function g can
be fitted by minimizing the quadratic objective function [3]
P N £ Y ¤2 Z 1
Y
∑∑ wcam ( ζ i (x j )) g(ζ i (x j )) − ln Ei − ln ∆ti + λ wcam (ζ )g00 (ζ )2 d ζ , (9.5)
j=1 i=1 0
where P is the number of pixels used from each image and wcam is a weight function lim-
iting the effect of the underexposed and overexposed pixels. The regularization term on
the right-hand side uses a penalty on the second derivative of g to ensure the smoothness
of the solution. In this chapter, the preference is to use a relatively high pixel count (e.g.,
1000 pixels) in order to minimize the need for regularization in the data fitting. The system
of equations is solved by employing singular-value decomposition, in a fashion similar to
that of Reference [3]. If instead of processing directly one of the sensor’s RGB outputs,
a combination ζiY , (e.g., luminance) is considered, it should be noted that some pixels in
this combination can include overexposed components without reaching the upper limit
of the range of the combination. Such pixels have to be penalized in the processing. To
ensure a more powerful attenuation of the possibly overexposed pixels, instead of a sim-
ple triangular hat function (e.g., as in Reference [3]) an asymmetric function of the form
wcam (ρ ) = ρ α (1 − ρ )β where 0 ≤ ρ ≤ 1 and 1 ≤ α < β , is used. An example of such a
Color High-Dynamic Range Imaging: Algorithms for Acquisition and Display 235
1.0
0.8
0.6
log exposu re
0.4
0.2
-2
-4
0 0.5 1.0
lu m inance pixel value
FIGURE 9.3
The estimated inverse response function g for the luminance channel of Nikon COOLPIX E4500.
weight function is given in Figure 9.2. An example of a camera response function for the
luminance solved with this method is illustrated in Figure 9.3. When looking at the camera
response one can observe an abrupt peak towards the right end of the plot. This is a result
of the instability of the system for overexposed pixels. It becomes evident that such pixels
cannot be directly included in the subsequent composition step. Instead, they have to be
penalized with weights. As discussed in Reference [14] the issue of overexposure being
present also in pixels whose value is lower than the maximum is not absent even for the
conventional approaches dealing in the RGB space.
proves the dynamic range of the captured photograph by including two sensor wells for
each pixel on the sensor. The two wells have different sensitivities, where the less sensitive
well starts reacting only when the normal well is saturated. The dynamic range improve-
ment takes place at the bright end of the scene and the total dynamic range is according
to Fuji comparable to traditional film. Another example of HDR imaging in a consumer
camera is the Pentax K-7. Though not featuring a true HDR sensor, their 2009 released
DSLR is probably the first commercial camera to provide in-camera HDR capture and tone
mapping. An automated bracketing and composition program is made available, and the
user is presented with a tone mapped HDR image.
calibrate d isplay H DR /
response tonem apped
FIGURE 9.4
The block diagram of the general multiple exposure HDR imaging pipeline. Different paths indicate alter-
natives in the process, such as displaying a tone mapped HDR image on a normal display or alternatively
displaying an HDR image on an HDR capable display.
With the camera response function solved as described in Section 9.3.1 and the exposure
times for each captured frame known, the logarithmic HDR radiance map for a monochro-
matic channel C can then be composed as a weighted sum of the camera output pixels
values as follows:
∑N wC (zi (x j ))(gC (zi (x j )) − ln ∆ti )
ln EiC = i=1
∑Ni=1 wC (zi (x j ))
In case of color images, the RGB channels are treated in parallel. This assumes that the
interactions between channels are negligible, which is, as admitted in Reference [3], prob-
lematic to defend. As a result to the parallel treatment of the three channels, color distor-
tions are in many cases introduced to the composed image. These have to be corrected by
post-composition white-balancing which in turn may lead to colors that are not faithful to
the original ones. Nevertheless, the approach works well enough to produce satisfactory
results provided that the source sequence is very well aligned and noise is negligible.
0.8
0.6
w Y (r)
0.4
0.2
0
0 0.5 1.0
r
FIGURE 9.5
Weight function wY used for the luminance channel composition.
Because of the nature of the camera response function g, the HDR luminance is obtained
in logarithmic scale. After employing the natural exponential, the resulting values are
positive, normally spanning [10−4 104 ], thus constituting truly a high dynamic range.
Color High-Dynamic Range Imaging: Algorithms for Acquisition and Display 239
1.0
0.8
a
0.6
w UV (S) = S
0.4
0.2
0
0 0.5 0.707
S
FIGURE 9.6
Weight function wUV used for the composition of the chrominance channels.
101
100
-1
10
-2
10
R G B
FIGURE 9.7
The illustration of the desaturating effect of the luminance-chrominance HDR composition.
be the scalar proportionality factor between the HDR luminance ζ̃ Y and the weighted av-
erage of the LDR luminances ζiY (x) with weights wUV (Si (x)). In other words, µ (x) is a
pixelwise scaling parameter defining the ratio between the weighted average of the original
pixel values and the pixel irradiance values obtained through the HDR composition pro-
cess. The HDR image in RGB space can now be obtained by the normalized inverse color
transformation
1 0 0
z̃ (x) = ζ̃ (x) 0 µ (x) 0 B.
0 0 µ (x)
Color High-Dynamic Range Imaging: Algorithms for Acquisition and Display 241
(a) (b)
(a) (b)
The problem of tone mapping is essentially one of compression with preserved visibility.
A linear scene-to-output mapping of an HDR image produces results similar to the image
shown in Figure 9.9a. The tone mapped version of the same scene, using the method
described in Section 9.5.2.1, is shown in Figure 9.9b. As can be seen, an HDR image is
useless on an LDR display device without tone mapping.
The history of tone mapping dates back to much longer than the introduction of HDR
imaging. As the dynamic range of the film has always exceeded that of the photographic
paper of the era, manual control has been necessary in the development process. The
darkroom was introduced in the late 19th century and since that time, a technique called
dodging and burning has been used to manipulate the exposures of the photographic print.
The same idea has later been transported into controlling the visibility of digital HDR
images. In this context the operation is known as tone mapping. The very basic aim of tone
mapping or tone reproduction (the two terms are interchangeable) is to provoke the same
response through the tone mapped image that would be provoked by the real scene. In other
words, matching visibility. In the darkroom the procedure was of course done manually,
but for digital purposes the ideal solution would fit every problem and not need any human
interaction. So far the optimal tone mapping method has not been invented and one has
to choose the method dependent on the problem. Many of the methods require parameter
adjustment, while some are automatic. The results of the tone mapping depend highly
on the chosen method as well as the parameters used. New tone mapping approaches are
introduced almost monthly and the aim of the scientific community is, as usual, to develop
more and more universal approaches well suited for the majority of problems. Some of the
major tone mapping algorithms were reviewed in Reference [6].
The global methods are the most simple class of tone mapping operators (TMOs). They
all share a couple of inherent properties; the same mapping is applied for all the pixels and
the tone mapping curve is always monotonic in nature. The mapping has to be monotonic
in order to avoid disturbing artifacts. This imposes a great limitation on the compression
/ visibility preservation combination. As the usual target is to map an HDR scene into the
range of standard eight-bit representation, only 256 distinct brightness values are available.
Global methods excel generally in computational complexity, or the lack-thereof. As the
same mapping is applied on all the pixels, the operations can be done very efficiently. On
the other hand, for scenes with very high dynamic range, the compression ability of the
global class may not be sufficient.
The local methods are able, to an extent, to escape the limitations met with global TMOs.
In general, local methods do not rely on image-wide statistics. Instead, every pixel is com-
pressed depending on its luminance value and the values of a local neighborhood. Often
local methods try to mimic properties of the HVS; the eye is known to focus locally on an
area of a scene forming an independent adaptation based on the local neighborhood con-
tents. As a result, the cost of more flexible compression is in the computational complexity.
The number of necessary computations goes up with the number of local adaptation neigh-
borhoods. Also the higher, local compression of scene brightness may at times lead to
halo-like artifacts around objects.
The transform domain operators are distinguished from global as well as local meth-
ods by the fact that they operate on the data in some domain other than the conventional
spatial one. Frequency domain operators compress data, as the name suggests, utilizing
a frequency-dependent scheme. The first digital tone mapping operator was already pub-
lished in 1968 and it was a frequency domain one [20]. Many of the properties of modern
frequency domain operators are inherited from this approach. Gradient domain operators
rely on the notion that a discrimination between illuminance and reflectance is for many
scenes relatively well approximated by the gradient approach. This is supported by the
notion that an image area with a high dynamic range usually manifests a large gradient
between neighboring pixels. The follow-up is a tone mapping operator functioning on the
differentiation domain, using gradient manipulation for dynamic range reduction.
The majority of tone mapping methods work on the spatial domain and are therefore cat-
egorized under either local or global umbrella, depending on the nature of the compression.
Finally it is noted that as there are numerous methods for tone mapping of HDR scenes and
their basic functionality is, apart from the core idea of range compression, very different
from one method to another, it is not meaningful to give a detailed description of an exam-
ple implementation. All the methods are well described in the literature and the methods
implemented for the luminance-chrominance approach of this thesis are described in detail
in Section 9.5.2.
FIGURE 9.10
The block diagram of the anchoring-based tone mapping process. The blocks represent the four central stages
of the procedure.
is a luminance image with range [0, 1], that is, T (ζ̃ Y ) (·) ∈ [0, 1]. As for the chromatic
channels, a simple, yet effective approach is presented.
For the compression of the luminance channel, two global luminance range reduction
operators are presented. The selection is limited to global operations simply because thus
far, local operators have not been able to produce results faithful to the original scene. It
should be noted that the majority of tone mapping methods developed for RGB can be
applied more or less directly for the compression of the luminance channel as if it were a
grayscale HDR image. As such, the continuous development of tone mapping methods for
RGB HDR images also benefits the compression of luminance-chrominance HDR data.
described method allows no manual tuning serves as both a drawback and a benefit; for
most images the majority of the scene is brought visible in a believable manner but for
images containing extremely high dynamic range, the lack of global contrast may at times
lead to results appearing slightly too flattened.
°z gray
G (x) + z G
chrom (x)
°z gray
R (x) + z R
chrom (x)
°z gray
R (x)
=
°z gray
G (x)
=
°z gray
B (x)
dB(x)
0
dG(x) = d(x) 1 dR(x)
°z gray
B (x) + z B
chrom (x)
FIGURE 9.11
Illustration of the definition of the chromatic tone mapping parameter δ .
³ ´ T
£ ¤ T ζ̃ Y (x)
z̊gray(x) = z̊Rgray(x) z̊G B
gray(x) z̊gray(x) = 0 B, (9.9)
0
T
0
£R ¤
zchrom(x) = zchrom(x) zG B U B.
chrom(x) zchrom(x) = ζ̃ (x) (9.10)
ζ̃ V(x)
It can be noted that z̊gray(x) is truly a gray image because in RGB to luminance-
chrominance transforms b1,1 = b1,2 = b1,3 . Then a map δ ≥ 0 is needed, such that
and δ G and δ B are defined analogously. Thus, δ (x) is the largest scalar smaller or equal
to one, which allows the condition in Equation 9.11 to hold. Figure 9.11 illustrates the
definition of δ (x). From the figure it is easy to realize that the hue, that is, the angle of the
vector, of z̊ (x) is not influenced by δ , whereas the saturation is scaled proportionally to it.
Roughly speaking, the low-dynamic range image z̊ (x) has colors which have the same hue
Color High-Dynamic Range Imaging: Algorithms for Acquisition and Display 247
as those in the HDR image ζ̃ and which are desaturated as little as is needed to fit within
the sRGB gamut.
It is now obvious that the tone mapped LDR image can be defined in luminance-
chrominance space as follows:
h ³ ´ i
ζ̊(x) = T ζ̃ Y (x) δ (x) ζ̃ U(x) δ (x) ζ̃ V(x) . (9.13)
The tone mapped luminance-chrominance image ζ̊ can be compressed and stored directly
with an arbitrary method (for example, DCT-based compression, as in JPEG), and for dis-
play transformed into RGB using the matrix B. It is demonstrated that this approach yields
lively, realistic colors.
16000
12000
8000
4000
0 - -2 -1
10 3 10 10 100 101 102
9.6 Examples
This section presents examples of HDR scenes imaged and visualized with methods de-
scribed in this chapter. In the experiments, both real and synthetic images are used. Special
attention is focused on the effects of noise and misalignment, both realistic components of
distortion when HDR scenes are imaged with off-the-shelf components. Extensive perfor-
mance comparisons against state-of-the-art RGB methods can be found in Reference [14].
(a) (b)
(c)
RGB space can be used as ground truth for subsequent synthetic composition examples.
The HDR image is assumed to represent the true scene irradiance E, with an HDR pixel
value E (x) = 1 equal to a scene irradiance of 1 W /m2 . Where comparison with RGB
techniques is provided, the tone mapping method described in Reference [22] is used to
process the images composed both in RGB and in luminance-chrominance to allow for
relevant visual comparison of the results.
Sequence 2
σLED
Reference 0
1st degraded, no noise 0.2289
2nd degraded, added noise 0.2833
Only added noise 0
3rd degraded, no noise 2.4859
4th degraded, added noise 1.5339
(LEDs, used as markers in the HDR scene) over the sequence of frames. In particular,
for each sequence of LDR frames, the standard deviation of the position of the center of
each of the three LEDs was computed. The numbers reported in Table 9.1 are the average
variances for position of the three LEDs.
A synthetic LDR image sequence is obtained by simulating acquisition of the above
HDR image into LDR frames. The LDR frames are produced using Equations 9.3 and 9.4
applied separately to the R, G, and B channels. The camera response function f used in the
equations is defined as follows:
n n ¥¡ ¢ ¨¡ ¢−1 oo
f (e (x)) = max 0, min 1, 28 − 1 κ e (x) 28 − 1 . (9.14)
where κ = 1 is a fixed factor representing the acquisition range of the device (full well) and
the b·e brackets denote the rounding to the nearest integer, thus expressing 8-bit quantiza-
tion. This is a simplified model which, modulo the quantization operation, corresponds to
a linear response of the LDR acquisition device.
Noise is also introduced to the obtained LDR images using a signal-dependent noise
model of the sensor [24], [25]. More precisely, noise corrupts the term κ e (x) in the acqui-
sition formula, Equation 9.14, which thus becomes
n n ¥¡ ¢ ¨¡ ¢−1 oo
f (e (x)) = max 0, min 1, 28 − 1 ε (x) 28 − 1 , (9.15)
p
where ε (x) = κ e (x) + aσ (κ e (x)) + bη (x) and η is standard Gaussian noise, η (x) ∼
N (0, 1). In the experiment, parameters a=0.004 and b=0.022 are used; this setting cor-
responds to the noise model of a Fujifilm FinePix S9600 digital camera at ISO1600 [25].
Note that the selected noise level is relatively high and intentionally selected to clearly
visualize the impact of noise on the HDR compositions.
FIGURE 9.15
The four synthetic LDR frames used to create the HDR images shown tone mapped in the first row of Fig-
ure 9.16. The exposure times used in creating the frames are 0.064, 0.256, 1.024, and 4.096 seconds.
same inverse response function is used for the luminance channel, while chrominance chan-
nels are processed as described in Section 9.4.2.3. Since RGB composition is essentially an
inverse of the acquisition process followed to obtain the synthetic LDR frames, the RGB
composition result is, apart from some minute quantization-induced differences, perfect.
For luminance-chrominance this cannot be expected, since the luminance channel is always
obtained as a combination of the RGB channels, which due to clipping reduces the accuracy
of the response. As will be shown, however, the reduced accuracy does not compromise
the actual quality of the composed HDR image. Instead, the luminance-chrominance ap-
proach leads to much more accurate composition when the frames are degraded by noise
or misaligned.
Because of noisy image data, the normalized saturation definition given in Section 9.2.3.1
modified with p p τ is employed. The regularized saturation is then ob-
a regularization term
tained as S = (ζ ) + (ζ ) / (ζ Y )2 + τ 2 with τ = 0.1. Without regularization, such
U 2 V 2
saturation would become unstable at low luminance values, eventually resulting in miscal-
culation of the weights for the composition of the chrominance channels.
(a) (b)
(c) (d)
(e) (f)
FIGURE 9.16
Experiments with synthetic data: (a,b) luminance-chrominance and RGB images composed using synthetic
LDR data subject to no degradation, (c,d) images composed using sequence subject to misalignment with
measured average LED position variance of 0.2289, and (e,f) images composed using sequence subject to
misalignment with measured average LED position variance of 0.2833 as well as noise.
for images degraded by misalignment (average LED position variance 0.2833 pixels) and
noise. Though masked somewhat by the noise, color artifacts can again be witnessed, for
example in the upper right corner of the color chart and in the vertical shelf edge found in
the left side of the scene. It is also evident that both these compositions suffer from noise,
which is particularly obvious in dark regions.
Figure 9.18 shows similar behavior as discussed above. Figures 9.18a and 9.18b show
composition results for a synthetic sequence degraded by noise; again, the comparison
Color High-Dynamic Range Imaging: Algorithms for Acquisition and Display 253
(a) (b)
(c) (d)
(e) (f)
FIGURE 9.17
Magnified details extracted from the images of Figure 9.16: (a,b) luminance-chrominance and RGB images
composed using synthetic LDR data subject to no degradation, (c,d) images composed using sequence subject
to misalignment with measured average LED position variance of 0.2289, and (e,f) images composed using
sequence subject to misalignment with measured average LED position variance of 0.2833 as well as noise.
favors the luminance-chrominance composed image. Figures 9.18c and 9.18d show com-
posed images obtained using the source sequence degraded by camera shake with the av-
erage variance of 2.6858 pixels. The luminance-chrominance composed image, apart from
slight blurriness, seems visually acceptable whereas the RGB image suffer from color ar-
tifacts present at almost all of the edges. Among the worst are the greenish artifacts at the
upper-right edge of the white-screen and red distortion on the wires of the LED at the upper
part of the scene, as further illustrated in Figure 9.19 which displays magnified fragments of
254 Computational Photography: Methods and Applications
(a) (b)
(c) (d)
(e) (f)
FIGURE 9.18
Experiments with synthetic data: (a,b) luminance-chrominance and RGB images composed using synthetic
LDR data subject to noise, (c,d) images composed using sequence subject to misalignment with measured
average LED position variance of 2.4859, and (e,f) images composed using sequence subject to misalignment
with measured average LED position variance of 1.5339 as well as noise.
Figure 9.18. Figures 9.18e and 9.18f show the composition results for a sequence degraded
by noise and camera shake with average variance of 1.5339 pixels. Again, both noise and
color artifacts are very much present in the RGB composed image, whereas the luminance-
chrominance composed image handles the imperfect conditions visibly significantly better.
It is interesting to comment about the blueish colored noise visible on the darker parts
of the HDR images produced by the RGB composition of noisy LDR frames. First, as
can be seen in Figure 9.15, these are areas which remain rather dark even in the frame
Color High-Dynamic Range Imaging: Algorithms for Acquisition and Display 255
(a) (b)
(c) (d)
(e) (f)
FIGURE 9.19
Magnified details extracted from the images of Figure 9.18: (a,b) luminance-chrominance and RGB images
composed using synthetic LDR data subject to noise, (c,d) images composed using sequence subject to mis-
alignment with measured average LED position variance of 2.4859, and (e,f) images composed using sequence
subject to misalignment with measured average LED position variance of 1.5339 as well as noise.
with the longest exposure (∆t = 4.096 s). Second, as can be observed in Figure 9.14c, the
scene is dominated by a cream yellow cast, mainly due to the tone of the lamp used for
lighting. Thus, in each LDR frame, in these areas the blue component is the one with both
the lowest intensity and the poorest signal-to-noise ratio. Because of the way weights are
defined for the composition of these dark areas, only the longest exposed frame contributes
with significant weights. This situation is quite different from that of the other parts of the
image, which are instead produced as an average of two or more frames, one of which is
256 Computational Photography: Methods and Applications
properly exposed. In particular, for the darkest components, it is the right tail of the noise
distribution which is awarded larger weights (see Figure 9.5). Moreover, because of the
clipping at zero in Equation 9.15, the noise distribution itself is asymmetric. This results in
a positive bias in the composition of the darker parts, which causes the blueish appearance
of the noise over those parts.
9.7 Conclusions
This chapter presented methods for capture, composition, and display of color HDR im-
ages. In particular, composition techniques which effectively allow to produce HDR images
using LDR acquisition devices were considered. Composition in luminance-chrominance
space is shown to be especially suitable for the realistic imaging case where the source
LDR frames are corrupted by noise and/or misalignment. In addition to being more robust
to degradations, the luminance-chrominance approach to HDR imaging focuses special at-
tention on the faithful treatment of color. As opposed to traditional methods working with
the RGB channels, this method does not suffer from systematic color artifacts in cases of
misaligned data nor color balancing errors usually induced among other things by imper-
fectly calibrated camera response.
With the ongoing introduction of HDR acquisition and display hardware, HDR imaging
techniques are set to gain an even more important role in all parts of computational pho-
tography and image processing. With new applications, both industrial and commercial,
introduced nearly daily, it is not farfetched to say that HDR will in more than one meaning
be “the new color television.”
Acknowledgment
This work was in part supported by OptoFidelity Ltd (www.optofidelity.com) and in
part by the Academy of Finland (project no. 213462, Finnish Programme for Centres of
Excellence in Research 2006-2011, project no. 118312, Finland Distinguished Professor
Programme 2007-2010, and project no. 129118, Postdoctoral Researcher’s Project 2009-
2011).
Figures 9.2, 9.3, 9.5, 9.6, and 9.11 are reprinted from Reference [14], with the permission
of John Wiley and Sons.
References
[1] E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, High Dynamic Range Imaging. San
Francisco, CA, USA: Morgan Kaufmann Publishers, November 2005.
Color High-Dynamic Range Imaging: Algorithms for Acquisition and Display 257
[2] S. Mann and R. Picard, “Being ‘undigital’ with digital cameras: Extending dynamic range by
combining differently exposed pictures,” in Proceedings of the IS&T 46th Annual Conference,
Boston, MA, USA, May 1995, pp. 422–428.
[3] P. Debevec and J. Malik, “Recovering high dynamic range radiance maps from photographs,”
in Proceedings of the 24th Conference on Computer Graphics and Interactive Techniques,
New York, USA, August 1997, pp. 369–378.
[4] T. Mitsunaga and S. Nayar, “Radiometric self calibration,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, Fort Collins, CO, USA, June 1999,
pp. 1374–1380.
[5] G. Larson, H. Rushmeier, and C. Piatko, “A visibility matching tone reproduction operator for
high dynamic range scenes,” IEEE Transactions Visual and Computer Graphics, vol. 3, no. 4,
pp. 291–306, October 1997.
[6] K. Devlin, A. Chalmers, A. Wilkie, and W. Purgathofer, “Tone reproduction and physi-
cally based spectral rendering,” in Proceedings of EUROGRAPHICS, Saarbrucken, Germany,
September 2002, pp. 101–123.
[7] G. Krawczyk, K. Myszkowski, and H. Seidel, “Lightness perception in tone reproduction for
high dynamic range images,” Computer Graphics Forum, vol. 24, no. 3, pp. 635–645, October
2005.
[8] H. Seetzen, W. Heidrich, W. Stuerzlinger, G. Ward, L. Whitehead, M. Trentacoste, A. Ghosh,
and A. Vorozcovs, “High dynamic range display systems,” ACM Transactions on Graphics,
vol. 23, no. 3, pp. 760–768, August 2004.
[9] K.N. Plataniotis and A.N. Venetsanopoulos, Color Image Processing and Applications, New
York, USA: Springer-Verlag, July 2000.
[10] R.C. Gonzales and R.E. Woods, Digital Image Processing, 2nd Edition, Upper Saddle River,
NJ: Prentice-Hall, January 2002.
[11] S. Susstrunk, R. Buckley, and S. Swen, “Standard RGB color spaces,” in Proceedings of
IS&T/SID 7th Color Imaging Conference, Scottsdale, AZ, USA, November 1999, pp. 127–
134.
[12] A. Ford and A. Roberts, “Colour space conversions.” Available online: http:// herak-
les.fav.zcu.cz/research/night road/westminster.pdf.
[13] A. Blanksby, M. Loinaz, D. Inglis, and B. Ackland, “Noise performance of a color CMOS
photogate image sensor,” Electron Devices Meeting, Technical Digest, Washington, DC, USA,
December 1997.
[14] O. Pirinen, A. Foi, and A. Gotchev, “Color high dynamic range imaging: The luminance-
chrominance approach,” International Journal of Imaging Systems and Technology, vol. 17,
no. 3, pp. 152–162, October 2007.
[15] S. Kavusi and A.E. Gamal, “Quantitative study of high-dynamic-range image sensor architec-
tures,” in Proceedings of Sensors and Camera Systems for Scientific, Industrial, and Digital
Photography Applications, San Jose, CA, USA, January 2004, pp. 264–275.
[16] J. Burghartz, H.G. Graf, C. Harendt, W. Klingler, H. Richter, and M. Strobel, “HDR CMOS
imagers and their applications,” in Proceedings of the 8th International Conference on Solid-
State and Integrated Circuit Technology, Shanghai, China, October 2006, pp. 528–531.
[17] B.C. Madden, “Extended intensity range imaging,” Technical Report, GRASP Laboratory,
University of Pennsylvania, 1993.
[18] S. O’Malley, “A simple, effective system for automated capture of high dynamic range im-
ages,” in Proceedings of the IEEE International Conference on Computer Vision Systems,
New York, USA, January 2006, p. 15.
258 Computational Photography: Methods and Applications
[19] M. Goesele, W. Heidrich, and H.P. Seidel, “Color calibrated high dynamic range imaging
with icc profiles,” in Proceedings of the 9th Color Imaging Conference, Scottsdale, AZ, USA,
November 2001, pp. 286–290.
[20] A.V. Oppenheim, R. Schafer, and T. Stockham, “Nonlinear filtering of multiplied and con-
volved signals,” Proceedings of the IEEE, vol. 56, no. 8, pp. 1264–1291, August 1968.
[21] A. Gilchrist, C. Kossyfidis, F. Bonato, T. Agostini, J. Cataliotti, X. Li, B. Spehar, V. An-
nan, and E. Economou, “An anchoring theory of lightness perception,” Psychological Review,
vol. 106, no. 4, pp. 795–834, October 1999.
[22] F. Drago, K. Myszkowski, T. Annen, and N. Chiba, “Adaptive logarithmic mapping for dis-
playing high contrast scenes,” Computer Graphics Forum, vol. 22, no. 3, pp. 419–426, July
2003.
[23] “Picturenaut 3.0.” Available online: http://www.hdrlabs.com/picturenaut/index.html.
[24] A. Foi, M. Trimeche, V. Katkovnik, and K. Egiazarian, “Practical Poissonian-Gaussian noise
modeling and fitting for single-image raw-data,” IEEE Transactions on Image Processing,
vol. 17, no. 10, pp. 1737–1754, October 2008.
[25] A. Foi, “Clipped noisy images: Heteroskedastic modeling and practical denoising,” Signal
Processing, vol. 89, no. 12, pp. 2609–2629, December 2009.
10
High-Dynamic Range Imaging for Dynamic Scenes
10.1 Introduction
Computational photography makes it possible to enhance traditional photographs digi-
tally [1]. One of its branches is high-dynamic range imaging, which enables access to a
wider range of color values than traditional digital photography. Typically, a high-dynamic
range (HDR) image stores RGB color values as floating point numbers, enlarging the con-
ventional discretized RGB format (8-bit per color channel) used for low-dynamic range
(LDR) images [2]. It is possible to visualize the HDR images through a specifically-built
HDR display [3], [4] or to perceptually adapt them for LDR displays using tone map-
ping [5], [6], [7].
259
260 Computational Photography: Methods and Applications
(d) (e)
HDR imaging revolutionized digital imaging. Its usage became popular to the profes-
sional and amateur photographer after its inclusion in software such as HDRShop [8], Pho-
togenics [9] and Cinepaint [10]. The autobracketing functionality nowadays available on
many mainstream digital cameras makes the HDR capture less cumbersome and hence
more attractive. Its dissemination is becoming common in press articles on photography.
Besides becoming a must-have in digital photography, it is also used for scientific activities
such as computer graphics, image processing and virtual reality. With its larger range of
luminance values it provides a much more detailed support for imagery. Showing none or
fewer saturated areas ensures more accurate calculations and more realistic simulations.
While HDR imaging becomes more mature in its use, its development is still at an early
stage. This chapter will show that most HDR capture methods only work well when scenes
are static, meaning that no movement or changes in the scene content is allowed during
the capture process. The real world, however, is made of dynamic scenes (Figure 10.1),
with objects in movement during the capture process. In indoor scenes this can be people
moving, doors or windows opening, and objects being manipulated. In outdoor scenes this
dynamic character is even more prominently present as pedestrians walk through the scene,
leaves and trees move due to wind, and cloud movement or reflection on water change the
lighting in the scene. The camera itself may also be in motion. Thus, in order to consider
HDR imaging for a wider range of applications, motion and dynamic scenes need to be
integrated in HDR technology targeting both HDR photographs and movies.
In this chapter, Section 10.2 presents the definition of HDR images. Section 10.3 de-
scribes how to create an HDR image from multiple exposures. Section 10.4 explores new
approaches that extend the types of possible scenes. Section 10.5 presents a vision on future
HDR imaging development for generic scenes. Conclusions are drawn in Section 10.6.
High-Dynamic Range Imaging for Dynamic Scenes 261
t 4s 2s 1s 1/2s 1/4s
EV -2 -1 0 1 2
dynamic range of one LDR image only [13], [14]. These methods extrapolate information
in underexposed or overexposed areas by either using user intervention or neighboring parts
of the image. Although this approach can produce reasonable results, it is generally more
accurate to recover information using several exposures of the same scene.
where N is the f-number (aperture value) and t is the exposure time (shutter speed). It is
usually preferred to fix the aperture and vary the exposure time to prevent traditional photo-
graphic effects associated with aperture changes, such as depth of field variation. Table 10.1
lists regular or linear intervals of EV for N = 1.0 and the required shutter speed. Having
regular intervals for EV is not compulsory but makes it easier for future computations.
The auto-bracketing function of existing cameras is very useful to capture a set of LDR
images at regular EV intervals. The camera chooses the best EV0 exposure value by cal-
culating what combination of aperture and shutter speed introduces the least saturation
in the image. Once this EV0 is established, darker and brighter pictures are taken by re-
spectively decreasing and increasing the shutter speed while keeping the aperture width
fixed. As an example, if 2n + 1 is the number of pictures taken, EV varies along the range
EV−n , ...EV−1 , EV0 , EV1 , ..., EVn . The number of pictures taken varies with the camera,
three, five or nine pictures (n = 1, 2, 4) being the standard. An example of a series of five
pictures is shown in Figure 10.2.
N aligned
N LDR im ages im age alignm ent
LDR im ages
FIGURE 10.3
The different steps needed to create an HDR image from several LDR images in traditional methods.
(a) (b)
from other known functions [24]. The procedure is very sensitive to the chosen pixels and
to the input images. If the camera parameters do not change, the same curve is often reused
for other sets of LDR images.
Converting the intensity values of the LDR images to radiance values: The inverse re-
sponse function f is applied to each pixel of the LDR images to convert the RGB color
value into an HDR radiance value.
Generating the HDR image: The radiance values stored in the LDR images are combined
using a weighted average function into a single value for each pixel that will form the final
HDR image. The weights are used to eliminate saturated values, or misaligned pixels.
Examples of a resulting image are shown in Figure 10.4; namely, Figure 10.4a shows the
output image obtained using the input images in Figure 10.2 after alignment (translation and
rotation) whereas Figure 10.4b shows the image resulting from the aligned image sequence
shown in Figure 10.1. Traditional methods calculate the final radiance E(i, j) of an HDR
image for pixel (i,j) such as:
g−1 (Mr (i, j))
∑Rr=1 w(Mr (i, j))( ∆tr )
E(i, j) = , (10.3)
∑Rr=1 w(Mr (i, j))
where Mr (i, j) is the intensity of pixel (i, j) for the image r of R images, w its associated
weight and ∆tr the exposure time. In traditional combining methods, w often only takes
into account overexposure or underexposure. A hat function can be used [25], illustrated in
Figure 10.5, to select well-exposed values.
10.3.3 Limitations
Limitations occur at different stages of the construction of an HDR image from several
LDR images. The existing alignment procedures often used in HDR generation methods
rarely consider nonlinear transformations (such as zooming or warping) which are much
more computationally expensive and difficult to recover. Moreover, the alignment can eas-
ily be perturbed when object movement is present in the scene.
High-Dynamic Range Imaging for Dynamic Scenes 265
1.0
0.8
w
0.6
0.4
0.2
0
0 50 100 150 200 250
z
FIGURE 10.5
The hat function used to clamp values underexposed or overexposed in Reference [25].
The currently known algorithms that retrieve the camera response function work reason-
ably well for many cases, but there is no guarantee that these will always work. Firstly, the
chosen shape (e.g., log, gamma or polynomial) may not fit the original curve; secondly, it is
very dependent on the chosen pixels; thirdly a large dataset of curves is required to ensure a
reasonable success in the procedure used in Reference [24]. In practice, since the response
curve estimation is unstable, it is often preferable to calculate it once for a certain setting
of the camera, and use this retrieved curve for other sets of LDR images.
The reconstruction of an HDR image from aligned LDR images works only if the scene is
static. Ghosting effects or other types of wrongly recovered radiance values will be present
for dynamic scenes. Such effects are illustrated in Figure 10.6. The time to take an HDR
image is at the minimum the sum of the exposure times. For certain scenes, such as exterior
scenes or populated scenes, it is almost impossible to guarantee complete stillness during
this time. An ignored problem in the literature is the changes in illumination or covering
objects in movement during the shoot. This happens for example on a cloudy day which
may lead to rapid changes in illumination. It may also happen that an object moves fast
(a) (b)
FIGURE 10.6
Example of ghosting effects in the HDR image due to movement in the input LDR image sequence: (a) a
zoomed portion of the HDR image shown in Figure 10.4 with hosting effects due to people walking near the
bridge, (b) a zoomed portion of the HDR reconstructed image from the sequence shown in Figure 10.1, with
ghosting effects present in the zone with pedestrians and on the platform floating on the water.
266 Computational Photography: Methods and Applications
and changes position reaching both shaded and illuminated areas. This leads to radiance
incoherence in the LDR input images.
(a)
(b) (c)
(d) (e)
The assumption is made that in general, moving objects cover a wide range of adjacent
pixel clusters, called movement clusters, rather than affecting isolated pixels solely. The
movement clusters are derived by applying a threshold TV I on V I, resulting in a binary
image V IT . For well-defined and closed movement clusters, the morphological operations
erosion and dilation are applied to the binary image V IT . In Reference [16] a suitable
threshold value for TV I is stated to be 0.18. The HDR reconstruction is done using the
weighted sum as in Equation 10.3 except for the identified movement clusters. In those
regions, pixels are replaced by the ones of the best exposed LDR radiance image.
268 Computational Photography: Methods and Applications
An example is shown in Figure 10.7. Namely, Figure 10.7a shows a set of four exposures
in which people walk through the viewing window. Figure 10.7b presents the variance
image used to detect this movement. The HDR image generated using this variance image
is shown in Figure 10.7d.
This method defines that highly variant pixels in V I indicate movement. Other influences
exist, besides remaining camera misalignments, that might result in a highly variant V I
value:
• Camera curve: The camera curve might fail to convert the intensity values to irradi-
ance values correctly. This influences the variance between corresponding pixels in
the LDR images and might compromise the applicability of the threshold to retrieve
movement clusters.
• Inaccuracies in exposure speed and aperture width used: In combination with the
camera curve this produces incorrect irradiance values after transformation. Chang-
ing the aperture width causes change in depth of field, which influences the quality
of the irradiance values.
To derive the entropy of an image I, written as H(I), the intensity of a pixel in an image
is thought of as a statistical process. In other words, X is the intensity value of a pixel, and
p(x) = P(X = x) is the probability that a pixel has intensity x. The probability function
p(x) = P(X = x) is the normalized histogram of the image. Note that the pixel intensities
range over a discrete interval, usually defined as the integers in [0, 255], but the number of
bins M of the histogram used to calculate the entropy can be less than 256.
The entropy of an image provides some useful information and the following remarks
can be made:
• The entropy of an image has a positive value between [0, log(M)]. The lower the
entropy, the less different intensity values are present in the image; the higher the
entropy, the more different intensity values there are in the image. However, the
actual intensity values do not have an influence on the entropy.
High-Dynamic Range Imaging for Dynamic Scenes 269
• The actual order or organization of the pixel intensities in an image does not influ-
ence the entropy. As an example, consider two images with equal amounts of black
and white organized in squares as in a checkerboard. If the first image has only 4
large squares and the second image consists of 100 smaller squares they still contain
the same amount of entropy.
• Applying a scaling factor on the intensity values of an image does not change its
entropy, if the intensity values do not saturate. In fact, the entropy of an image does
not change if an injective function is applied to the intensity values. An injective
function associates distinct arguments to distinct values, examples are the logarithm,
exponential, scaling, etc.
• The entropy of an image gives a measure of the uncertainty of the pixels in the image.
If all intensity values are equal, the entropy is zero and there is no uncertainty about
the intensity value a randomly chosen pixel can have. If all intensity values are
different, the entropy is high and there is a lot of uncertainty about the intensity value
of any particular pixel.
The movement detection method discussed in this section shares some common ele-
ments with the one presented in References [26] and [27]. Both methods detect movement
in a sequence of images, but restrict this sequence to be captured under the same condi-
tions (illumination and exposure settings). The method presented here can be applied to
a sequence of images captured under different exposure settings. It starts by creating an
uncertainty image UI, which has a similar interpretation as the variance image V I used in
Section 10.4.1.1; pixels with a high UI value indicate movement. The following explains
how the calculation of UI proceeds.
For each pixel with coordinates (k, l) in each LDR image Ii the local entropy is calculated
from the histograms constructed from the pixels that fall within a two-dimensional window
V with size (2v + 1) × (2v + 1) around (k, l). Each image Ii therefore defines an entropy
image Hi , where the pixel value Hi (k, l) is calculated as follows:
M−1
Hi (k, l) = − ∑ P(X = x) log(P(X = x)), (10.6)
x=0
where the probability function P(X = x) is derived from the normalized histogram con-
structed from the intensity values of the pixels within the two-dimensional window V, or
over all pixels p in
{p ∈ Ii (k − v : k + v, l − v : l + v)}. (10.7)
From these entropy images a final uncertainty image UI is defined as the local weighted
entropy difference:
N−1 j<i
wi j
UI(k, l) = ∑ ∑ N−1 j<i hi j (k, l), (10.8)
i=0 j=0
∑ ∑ vi j
i=0 j=0
hi j (k, l) = |Hi (k, l) − H j (k, l)|, (10.9)
wi j = min(Wi (k, l),W j (k, l)). (10.10)
270 Computational Photography: Methods and Applications
It is important that the weights Wi (k, l) and W j (k, l) remove any form of underexposure
or saturation to ensure the transformation between the different exposures is an injective
function. Therefore they are slightly different from those used during the HDR generation.
In Reference [16] a relatively small hat function with lower and upper thresholds equal to
0.05 and 0.95 for normalized pixel intensities is used. The weight wi j is created as the
minimum of Wi (k, l) and W j (k, l), which further reflects the idea that underexposed and
saturate pixels do not yield any entropic information.
The reasoning behind this uncertainty measure follows from the edge enhancement that
the entropy images Hi provide. The local entropy is high in areas with edges and details.
These high entropic areas do not change between the images in the exposure sequence, ex-
cept when corrupted by a moving object or saturation. The difference between the entropy
images therefore provides a measure for the difference in features, such as intensity edges,
between the exposures. Entropy does this without the need to search for edges and corners
which can be difficult in low contrast areas. In fact, the entropy images are invariant to
the local contrast in the areas around these features. If two image regions share the exact
same structure, but with a different intensity, the local entropy images will fail to detect this
change. This can be considered a drawback of the entropic movement detector as it also
implies that when one homogenous colored object moves against another homogeneously
colored object, the uncertainty measure would only detect the boundaries of the moving ob-
jects of having changed. Fortunately, real-world objects usually show some spatial variety,
which is sufficient for the uncertainty detector to detect movement. Therefore the indif-
ference to local contrast is only an advantage, particularly in comparison to the variance
detector discussed previously in this section.
The difference in local entropy between two images induced by the moving object, de-
pends on the difference in entropy of the moving object and the background environment.
Though the uncertainty measure is invariant to the contrast of these two, it is not invariant
to the entropic similarity of the two. For instance, if the local window is relatively large, the
moving object is small relative to this window, and the background consists of many static
objects that are small and similar, then the entropic difference defined in Equation 10.8
might not be large. Decreasing the size of the local window will result in an increased
entropic difference, but a too small local window might be subject to noise and outliers. It
was found in Reference [16] that a window size of 5 × 5 pixels returned good results.
Similarly to the variance-based method [2], [16], the movement clusters are now defined
by applying a threshold TUI on UI, resulting in a binary image UIT . For well-defined,
closed, movement clusters, the morphological operations erosion and dilation are applied
to UIT . A threshold TUI equal to 0.7 for M = 200 provides satisfactory results, although it
does not seem to be as robust as the threshold for the variance detector. As for the variance-
based method, for the HDR reconstruction, pixels in a detected movement area are replaced
by the pixels of only one LDR image chosen to have least saturation in this area.
An example is shown in Figure 10.7. Namely, Figure 10.7a shows a set of four exposures
that indicate object movement. Figure 10.7c presents the uncertainty image. The resulting
HDR image after movement removal using this uncertainty image is shown in Figure 10.7e.
The creation of UI is independent from the camera curve calibration. As mentioned ear-
lier, this has as an extra advantage that the detection of movement clusters could potentially
be used in the camera calibration phase.
High-Dynamic Range Imaging for Dynamic Scenes 271
FIGURE 10.8
HDR reconstruction using the background estimation method of Reference [28]: (a) input LDR images, (b)
computed labeling, and (c) reconstructed HDR image. Courtesy of Granados et al. [28].
final HDR image is directly built from the input LDR images. Weights are calculated
through an iterative process to represent the chance of each pixel to belong to the static part
of the scene together with its chance of being correctly exposed. A specific weight w, that
takes into account not only exposure quality but also the chance of being static, is obtained.
Calculations are made in the Lαβ color space.
An iterative process is set to calculate the weights w pqs of a pixel (p, q) in image s with
an intensity Z. All weights are first initialized to the average over the color space of a
hat function which is low for values close to the extremes 0 and 255. This hat function is
represented in Figure 10.5 and defined as follows:
Z
w(Z) = 1 − (2. − 1)12 . (10.11)
255
Reference [25] uses the set N that contains neighboring pixels y pqs of pixel xi jr with
6 (i, j), where x and y denote vectors in R5 representing the Lαβ color space and
(p, q) =
the two-dimensional position. In practice, the neighboring pixels are located inside a 3 × 3
window. The neighboring pixels help to evaluate the likelihood of the pixel to belong to the
background. New weights are calculated at iteration t + 1 as follows:
w pqs,t+1 = w(Zs (p, q))P(x pqs | F). (10.12)
The probability function P(·) is defined as
∑ pqsε N(xi jr ) w pqs KH (xi jr − y pqs )
P(xi jr | F) = , (10.13)
∑ pqsε N(xi jr ) w pqs
where
1 d 1
KH (x) = |H| 2 (2Π) 2 exp(− xT H−1 x) (10.14)
2
for a d-variate Gaussian, with H being a symmetric, positive definite, d × d bandwidth
matrix. For example, H can be an identity matrix.
By performing 10 to 15 iterations, the method can effectively remove ghosting effects;
it works particularly well if the background is predominant in the image [25]. When there
is some overlap in the region in movement in most images, the object in movement is still
present in the reconstructed image. This is shown in Figure 10.9.
This method is extended in Reference [31]. The color values of the input images are
calibrated through a histogram matching; however, a more robust radiometric alignment
could be used instead. The matrix H used in Equation 10.13 is set to
σ˜L (Ni, j ) 0 0
H= 0 σ˜α (Ni, j ) 0 , (10.15)
0 0 σ˜β (Ni, j )
with the weighted standard deviation σ̃ calculated for L, α , and β . This matrix is used
instead of the identity matrix chosen in Reference [25]. To prevent objects from being
segmented in the final reconstruction, a weight propagation algorithm is proposed to cover
regions that are likely to belong to the same object in each image. This algorithm requires
fewer iterations than the algorithm described in Reference [25], using often only one itera-
tion for improved results. An example is shown in Figure 10.10.
High-Dynamic Range Imaging for Dynamic Scenes 273
FIGURE 10.9
HDR reconstruction using the background estimation method of Reference [25]: (a,b) two of the input LDR
images, (c) HDR reconstruction with traditional methods, and (d) HDR reconstruction after 10 iterations.
c 2006 IEEE
Courtesy of Khan et al. [25]. °
FIGURE 10.10
Comparison of various HDR image reconstruction methods: (a) HDR image reconstructed without any move-
ment identification, (b) results obtained using the method in Reference [25] after six iterations, and (c) results
obtained using the method in Reference [31] using only one iteration. Courtesy of Pedone et al. [31].
(f) (g)
FIGURE 10.11
(a-e) LDR sequence of images showing walking pedestrians, (f) HDR reconstruction using traditional methods,
c 2009 IEEE
and (g) HDR reconstruction after ghost removal [32]. Image courtesy of Nokia, copyright 2009. °
To identify inconsistent pixels, the algorithm actually looks at the offset of the logarith-
mic relation
ln(X(p2 )) = ln(X(p1 )) + ln(ev12 ) (10.17)
for two patches, 1 and 2, of two different exposures. A measure of outliers gives the
percentage of ghosting in each patch. This measure is used when combining the exposures
together to stitch the final HDR image. Patches found inconsistent are not considered in
the stitching. A significant advantage of the approach is that if the movement occurs only
in parts of the images, the radiance of a moving object can still be reconstructed from input
images where this object was still. Only the largest set I of patches with consistent values
is considered. Reconstruction of the radiance around a moving object is done so that no
seams will be visible. A reconstruction based on a Poisson equation is used to create valid
values around the moving area from the identified set I of consistent images for each area.
An example of the method is shown in Figure 10.11.
A similar approach is employed in Reference [17] which identifies errors in pixels be-
tween images using the computation of a predicted color from Equation 10.2 as follows:
∆tk −1
z̃i,k = f ( f (zi, j )), (10.18)
∆t j
where ∆t j and ∆tk denote the exposures, i denote the pixel, and f (·) is the camera response
function, with f −1 (zi,. j ) = Xi, j = Ei .∆t j . Errors are computed by comparing z̃i,k with zi,k .
If an error is identified, the pixel is not used in the final HDR combination. The final image
is built using a selected region showing least saturation in the region of detected invalid
pixels. Unlike Reference [32], this method aims to have all moving objects appearing in
images, although this requires the user to interact manually with the picture.
High-Dynamic Range Imaging for Dynamic Scenes 275
F B
SU- W
S-
SB- W
+ B
~ ~
L H LB B SB W F
+ B
SB+ W
S+
SU+ W
F B
FIGURE 10.12
HDR stitching procedure of Reference [19] for an image sequence S− , L, and S+ .
FIGURE 10.13
HDR video reconstruction: (top) input photographs of an video sequence, and (bottom) HDR reconstruction.
c 2003 ACM
Courtesy of Kang et al. [19]. °
needs to be used, but is incoherent with what is contained in L. New images are created,
SU− , SU+ and SB− , SB+ , that match the content of L for the HDR reconstruction. All images are
converted to new images with radiance values (ŜU− , ŜU+ ,ŜB− , ŜB+ , L̂) using a retrieved inverse
camera curve, as proposed in Reference [22]. The reconstructed HDR radiance image L̂HDR
contains pixels with the following values:
• If the considered pixel in L̂ is not saturated, a weighted average is computed using
values in ŜB− , ŜB+ and L, with the weights being low for overexposed or underexposed
values and representing a plausibility map (Hermite cubic).
• If the considered pixel in L̂ is saturated, the value in ŜB− is used if it is not overexposed
or underexposed, and ŜB+ otherwise.
The main difficulty relies on computing SU− , SU+ , SB− , and SB+ . Motion estimation between
S− , L, and S+ is calculated combing optical flow [33], [34] with hierarchical homography
for more accuracy. Residual flow vectors are computed between two warped images from
the source images and are accumulated in a pyramid. A global affine flow is also added.
This estimate is used to generate (SU− , SU+ ) for a unidirectional warping (unidirectional flow
field) and (SB− , SB+ ) for a bidirectional warping. To display the final video, a temporal tone-
mapper is proposed. Figure 10.13 show some results. This technique can also be applied
to produce a still HDR photograph. One extremely interesting aspect of this method is that
camera motion and scene content movements are treated together with the same method.
algorithm runs on each pair of images to match. If images contain substantial intensity
differences (illumination or exposure), images are normalized. For each pixel of an image,
the algorithm analyzes the surrounding of its corresponding pixel in its pair image using a
3 × 3 window. A score is defined based on the maximum and minimum intensities found. If
the pixel value is outside this range, it receives a penalty. A pixel intensity dissimilarity is
calculated averaging the scores over a region, which then serves to compute a probability on
pixel matching. The computation of Mi uses weights that measure a correspondence field
between matched feature points, vector fields using locally weighted linear regression, and
a predicted locally weighted regression.
This technique is shown to have an application to HDR video for aligning several image
sequences taken at different exposures before the HDR reconstruction is done, for example,
with the method of Reference [19].
10.4.5 Limitations
The methods presented in this section show that it is possible to address the problem of
dynamic scenes in HDR imaging. They all seem to work well for some typical scenes,
where some small parts are in movement, but cannot be considered generic for all types of
scenes. In particular, most of the methods require a well-defined background region with
foreground moving objects. Only the entropy-based method [16] can differentiate well
moving objects to a similar background. Most methods assume that the moving objects are
located in a cluster that is small compared to the static background in the image. This is a
strong limitation, since scenes with a predominant moving area cannot be treated.
Certain methods only work for background reconstruction [17], [25], [28], [32]. As a
consequence, objects in movement disappear from the reconstructed HDR image or can
be cut [25], [32] during the reconstruction process. When the moving areas of the image
are conserved in the final HDR reconstruction, they are often represented by the associated
region of a chosen LDR radiance image [16], [17] or their HDR representation is limited
by the amount of valid data [25], [31]. Depending on the algorithm, this may imply du-
plication of some objects and overexposed or underexposed areas still present in the final
HDR reconstruction. Warping, as presented in References [19] and [20], can solve the re-
construction problem particularly for nonrigid movement but it is difficult to have optical
flow-based methods to be robust. In particular, the method in Reference [19] relies strongly
on proximity of the moving object in the image sequence.
An interesting point is that some methods solve directly for image alignment and move-
ment detection [19], [20]. Finally, none of the above methods propose solutions to the
retrieval of the inverse camera response function when scenes are dynamic. All assume
that the camera curve was previously precomputed.
propose to store the captured image directly in a RAW format that most of the time stores
pixel values without any processing. The RAW format also uses more bits, typically be-
tween 12 to 14 bits against the 8 bits per color channel for more traditional formats, to store
the intensity information increasing thus the range of intensities. The RAW format stores
all the necessary information to convert its uncompressed stored image into the final image.
While the RAW format helps to acquire more precise information for HDR photography,
its storage size makes it impractical for HDR video. Therefore, more research needs to be
undertaken to calculate the inverse camera curves for camera acquired dynamic scenes, so
that a natural extension could be made to HDR video capture.
For a more accurate and robust camera inverse curve retrieval, it is crucial to develop
methods that make no assumption on the camera curve shape; as mentioned in Refer-
ence [24], the shape of the curve varies a lot from one camera to another. A function
built with no assumption on the shape will be more robust and should lead to more accu-
rate reconstructed radiance values. Further improvements can be achieved by identifying
well viewpoint / object movement; it is vital to develop methods that take movement into
account rather than ignoring pixels in motion, as it may occur in some pictures that most
pixels in the image are in movement.
The reconstruction of the radiance in moving areas is important when the required recon-
structed HDR image needs to be faithful to the input scene so that all objects of the scene
and only those are present in the final image and are represented only with valid radiance
values. Methods based on HDR reconstruction from only one LDR image may be used to
fill in missing areas of the reconstructed HDR image.
Another issue is to guarantee the accuracy of the captured radiance. Reference [36]
presents a method to characterize color accuracy in HDR images. Up to now, methods
have rather focused on plausible radiance results. Some applications may require physi-
cally accurate or perceptually accurate radiance values, and it is important to develop HDR
methods that take this into account.
Finally, hardware implementation can reduce the need of postprocessing [19], [12]. Mul-
tiple camera systems could be used, although image registration and synchronization will
be a crucial determinant in the HDR reconstruction. It is also expected that manufactur-
ers will develop new sensors that will capture higher ranges of values as it is currently the
trend. New storage and compression formats will probably be designed to fit with HDR
video requirements.
10.6 Conclusion
HDR imaging is a growing field already popular in its use. However, up to now, it
was mostly restricted to static scenes. This chapter identified the issues linked to dynamic
scenes and described the methods recently presented in the literature that contribute to a
significant step in solving HDR imaging for dynamic scenes. Nevertheless, some work
is still needed to address some of their limitations and improve their robustness to more
generic scenes. There is also a need for postprocessing approaches for image calibration
High-Dynamic Range Imaging for Dynamic Scenes 279
and HDR reconstruction in moving scenes as well as hardware development of new HDR
capture systems and sensors to capture a higher range in a shorter time.
The research field in HDR imaging is extremely active, both in computer vision and
computer graphics communities, and important advances should continue to appear in the
near future.
Acknowledgment
This work is funded by a grant given by the Spanish ministry of education and science
(MEC) for the project TIN2008-02046-E/TIN “High-Dynamic Range Imaging for Uncon-
trollable Environments.” It is also supported by the Ramon y Cajal program of the Spanish
government. We would like to thank authors of References [19], [25], [28], [31], and [32]
for allowing us to use their images for illustration. We would also like to thank Florent
Duguet (Altimesh) for his valuable advice.
Figure 10.7 is reprinted from Reference [16], Figure 10.9 is reprinted from Refer-
ence [25], Figure 10.11 is reprinted from Reference [32], with the permission of IEEE.
Figure 10.13 is reprinted from Reference [19], with the permission of ACM.
References
[1] R. Raskar and J. Tumblin, Computational Photography: Mastering New Techniques for
Lenses, Lighting, and Sensors. A K Peters Ltd., December 2009.
[2] E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, High Dynamic Range Imaging: Acquisi-
tion, Display, and Image-Based Lighting. San Francisco, CA: Morgan Kaufmann Publishers,
August 2005.
[3] BrightSide Technologies, “High dynamic range displays.” www.brightsidetech.com.
[4] P. Ledda, A. Chalmers, and H. Seetzen, “HDR displays: A validation against reality,” in
Proceedings of IEEE International Conference on Systems, Man and Cybernetics, The Hague,
Netherlands, October 2004, vol. 3, pp. 2777–2782.
[5] P. Ledda, A. Chalmers, T. Troscianko, and H. Seetzen, “Evaluation of tone mapping operators
using a high dynamic range display,” ACM Transactions on Graphics, vol. 24, no. 3, pp. 640–
648, August 2005.
[6] M. Čadı́k, M. Wimmer, L. Neumann, and A. Artusi, “Evaluation of HDR tone mapping meth-
ods using essential perceptual attributes,” Computers & Graphics, vol. 32, no. 3, pp. 330–349,
June 2008.
[7] A. Yoshida, V. Blanz, K. Myszkowski, and H.-P. Seidel, “Perceptual evaluation of tone map-
ping operators with real-world scenes,” Proceedings of SPIE, vol. 5666, pp. 192–203, January
2005.
[8] HDRshop 2.0. www.hdrshop.com.
[9] Idruna software, “Photogenics HDR.” www.idruna.com.
280 Computational Photography: Methods and Applications
[27] G. Jing, C.E. Siong, and D. Rajan, “Foreground motion detection by difference-based spatial
temporal entropy image,” in Proceedings of the IEEE Region Ten Conference, Chiang Mai,
Thailand, November 2004, pp. 379–382.
[28] M. Granados, H.P. Seidel, and H. Lensch, “Background estimation from non-time sequence
images,” in Proceedings of the Graphics Interface Conference, Windsor, ON, Canada, May
2008, pp. 33–40.
[29] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin,
and M. Cohen, “Interactive digital photomontage,” ACM Transactions on Graphics, vol. 23,
no. 3, pp. 294–302, August 2004.
[30] S. Cohen, “Background estimation as a labeling problem,” in Proceedings of the IEEE Inter-
national Conference on Computer Vision, Beijing, China, October 2005, pp. 1034–1041.
[31] M. Pedone and J. Heikkilä, “Constrain propagation for ghost removal in high dynamic range
images,” in Proceedings of the International Conference on Computer Vision Theory and Ap-
plications, Funchal, Portugal, January 2008, pp. 36–41.
[32] O. Gallo, N. Gelfand, W. Chen, M. Tico, and K. Pulli, “Artifact-free high dynamic range imag-
ing,” in Proceedings of the IEEE International Conference on Computational Photography,
San Francisco, CA, USA, April 2009.
[33] J.R. Bergen, P. Anandan, K.J. Hanna, and R. Hingorani, “Hierarchical model-based motion
estimation,” in Proceedings of the Second European Conference on Computer Vision, Santa
Margherita Ligure, Italy, May 1992, pp. 237–252.
[34] B.D. Lucas and T. Kanade, “An iterative image registration technique with an application
to stereo vision,” in Proceedings of the Seventh International Joint Conference on Artificial
Intelligence, Vancouver, BC, Canada, August 1981, pp. 674–679.
[35] S. Birchfield and C. Tomasi, “A pixel dissimilarity measure that is insensitive to image sam-
pling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 4,
pp. 401–406, April 1998.
[36] M.H. Kim and J. Kautz, “Characterization for high dynamic range imaging,” Computer
Graphics Forum, vol. 27, no. 2, pp. 691–697, April 2008.
11
Shadow Detection in Digital Images and Videos
11.1 Introduction
Shadow detection is an important preprocessing task and a hot topic in computer vision.
There exist numerous applications which vary in their motivations to address shadows in
acquired digital images and video. For example, in video surveillance [1], [2], aerial ex-
ploitation [3], and traffic monitoring [4] shadows are usually mentioned as harmful effects,
because they make it difficult to separate and track moving objects via background sub-
traction (Figure 11.1). In remote sensing, shadows may reduce the performance of change
detection techniques [5]. Similarly, in scene reconstruction it is a fundamental problem to
distinguish surface edges from illumination differences [6]. It should also be noted that
shadow-free images are commonly required for purely visual purposes [6].
283
284 Computational Photography: Methods and Applications
On the other hand, shadows may be helpful phenomena in many situations. The so-called
shape from shading [8] methods derive the three-dimensional (3D) parameters of objects
based on estimated shadowing effects. Shadows also provide general descriptors for the
illumination conditions in scenes, which can be used for image and video indexing or event
analysis [9]. For example, the darkness of a shadow indicates whether an outdoor shot was
taken in sunlit or overcast weather; meanwhile the size and orientation of the shadow blobs
are related to the time and date of frame capture. If multiple shadows are observable with
different darkness, several light sources in the scene can be expected. Object extraction in
still images can also be facilitated by shadow detection. In aerial image analysis, it is often
necessary to detect static scene objects, such as buildings [10], [11] and trees [12], which
constitute a challenging pattern recognition problem. Note that even a noisy shadow map
is a valuable information source, because the object candidate regions can be estimated
as image areas lying next to the shadow blobs in the sun direction as demonstrated in
Figure 11.2.
As suggested above, shadow detection is a wide concept; different classes of approaches
should be separated depending on the environmental conditions and the exact goals of the
Shadow Detection in Digital Images and Videos 285
systems. This chapter focuses on the video surveillance problem; demonstrating some
challenges and solutions related to shadow detection in digital video. In surveillance video
streams, foreground areas usually contain the regions of interest, moreover, an accurate
object-silhouette mask can directly provide useful information for several applications, for
instance, people detection [13], [14], [15], vehicle detection [4], tracking [16], [17], bio-
metrical identification through gait recognition [18], [19], and activity analysis [7]. How-
ever, moving cast shadows on the background make it difficult to estimate shape [20] or
behavior [14] of moving objects, because they can be erroneously classified as part of the
foreground mask. Considering that under some illumination conditions more than half of
the nonbackground image areas may belong to cast shadows, their filtering has a crucial
role in scene analysis.
This chapter will build upon a few assumptions for the scene and the input data. First, the
camera is fixed and has no significant ego-motion. Expected are static background objects
(for example, there is no running river or flickering object in the background); therefore,
all motions are caused either by moving objects or by shadows. Moreover, a topically
valid image is expected in each moment; this can be obtained by the conventional Gaussian
mixture method [7]. There is one emissive light source in the scene (the sun or an artificial
source), but the presence of additional effects (for example, reflection), is considered; such
effects may change the spectrum of illumination locally. It is assumed that the estimated
background values of the pixels correspond to the illuminated surface points.
On the other hand, several properties of real situations are considered. The background
may change over time, due to varying lighting conditions and changes of static objects.
Crowded and empty scenarios may alternate, and background or shadow colored object
parts are expected. Due to the daily changes of the sun position and weather, shadow
properties may strongly alter as well.
Considering the above discussion, this chapter focuses on video and region-based shadow
modeling techniques. These techniques can be categorized with respect to the description
of the shadow-background color transform, which can be nonparametric and paramet-
ric [26]. Nonparametric techniques are often referred to as shadow invariant approaches,
since instead of detecting the shadows they remove them by converting the pixel values
into an illuminant invariant feature space. Usually a conventional color space transfor-
mation is applied to fulfill this task; the normalized RGB (or rg) [27], [28] and C1C2C3
spaces [29] purely contain chrominance color components which are less dependent on lu-
minance. Similar constancy of the hue channel in HSV space is exploited in Reference [30].
However, as Reference [29] points out, illumination invariant approaches have several lim-
itations regarding reflective surfaces and the lighting conditions of the scenes. Outdoors,
shadows will have a blue color cast (due to the sky), while lit regions have a yellow cast
(sunlight), hence the chrominance color values corresponding to the same surface point
in shadow and sunlight [25] may differ significantly. It was also found that the shadow
invariant methods often fail in outdoor scenes and are more usable in indoor scenes. More-
over, by ignoring the luminance components of the color, these models become sensitive to
noise.
Consequently, parametric models will be of interest in this chapter. First, the mean
background values of the individual pixels are estimated using a statistical background
model [7], then feature vectors from the actual and the estimated background values of the
pixels are extracted in order to model the feature domain of shadows in a probabilistic way.
Parametric shadow models may be categorized as local or global.
In a local shadow model [31], independent shadow processes are proposed for each
pixel. The local shadow parameters are trained using a second mixture model similar to
background-based training [7]. This way, the differences in the light absorption / reflection
properties of the scene points can be taken into account. However, each pixel should be
shadowed several times till its estimated parameters converge under unchanged illumina-
tion conditions; a hypothesis often not satisfied in surveillance videos.
In this chapter, Section 11.3 introduces a novel statistical shadow model. This model
follows an approach which characterizes shadows in an image using global parameters;
this approach describes the relation of the corresponding background and shadow color
values. Since this transformation is considered here as a random transformation affected
by a perturbation, illumination artifacts are taken into consideration. On the other hand, the
shadow parameters are derived from global image statistics; therefore, the model perfor-
mance is reasonable also on image regions where motion is rare.
Color space choice is a key issue in a number of methods; this problem will be intensively
studied in Section 11.4. The initial model presented in Section 11.3 will be extended to the
CIE L*u*v* space which allows measuring the perceptual distance between colors using
Euclidean distance [32] and in which the color components are approximately uncorrelated
with respect to camera noise and changes in illumination [33]. Since the model parameters
will be derived statistically, there is no need for accurate color calibration and the CIE D65
standard can be used. It is also not critical to consider an exact physical meaning of the
color components, which is usually environment-dependent [29], [34]; only an approximate
interpretation of the L, u, v components will be used and the validity of the model will be
shown via experiments.
Shadow Detection in Digital Images and Videos 287
TABLE 11.1
Comparison of various methods.
Section 11.4 presents a detailed qualitative and quantitative study of the color space se-
lection problem of shadow detection. In this sense, this section can be considered both
as the premise and generalization of Section 11.3. The choice of the CIE L*u*v* space
will be justified here, but on the other hand, experiments will refer to the previously in-
troduced model elements, extending their validity to various color spaces. The reason for
dedicating an independent part to this issue is that statistical feature modeling and color
space analysis are two different and, in themselves, composite aspects of shadow detec-
tion. Although interaction between the two approaches will be emphasized several times,
a separate discussion can help the clarity of presentation. Due to the various experiments,
the consequences of Section 11.4 may be more generally usable than in the context of the
proposed statistical model framework.
For validation, both real surveillance video shots and test sequences from a well-known
benchmark set [26] will be used. Table 11.1 summarizes the different goals and tools
regarding some of the above mentioned state-of-the-art methods and the proposed model.
For a detailed comparison see also Section 11.3.6.
Note that foreground modeling is not addressed in this chapter, since it has been exten-
sively covered in the literature. The simplest approach is using uniform foreground distri-
butions pfg = u [35] which is equivalent to outlier detection. More sophisticated models are
based on temporal foreground descriptions [36] or pixel state transition probabilities [37].
The model described below uses the spatial foreground calculus [1] which is insensitive
to the frame rate parameter of a video stream, thus ensuring robustness in surveillance
environments.
The background’s and shadow’s conditional density functions are defined in Sec-
tions 11.3.1 to 11.3.4, and the segmentation procedure will be presented in detail in Sec-
tion 11.3.6. Note that minimization will be done using the minus-logarithm of the global
probability term. Therefore, εφ (s) = − log pφ (s) will be used to denote the local energy
terms in order to simplify the notation.
with C = 2 log 2π . According to Equation 11.1, each feature contributes with its own addi-
tional term to the energy calculus. Therefore, the model is modular; the one dimensional
2 (s)],
model parameters, [µk,i (s), σk,i can be estimated separately.
The use of a Gaussian distribution to model the observed color of a single background
pixel is well established in the literature, the corresponding parameter estimation proce-
dures can be found in References [7] and [40]. The components of the background pa-
rameters [µ bg (s), σ bg (s)] can be trained in a similar manner to the conventional online
K-means algorithm [7]. The vector [µbg,L (s), µbg,u (s), µbg,v (s)]T estimates the mean back-
ground color of pixel s measured over the recent frames, whereas σ bg (s) is an adaptive
noise parameter. An efficient outlier filtering technique [7] excludes most of the nonback-
ground pixel values from the parameter estimation process, which works without user in-
teraction.
As stated in Section 11.2, shadows are characterized by describing the background-
shadow color value transformation in the images. The shadow calculus is based on the
illumination-reflection model [41] introduced in Section 11.3.2. This model assumes con-
stant lighting, along with flat and Lambertian reflecting surfaces; however, video surveil-
lance scenes do not usually fulfill these requirements. Therefore, a probabilistic approach
Shadow Detection in Digital Images and Videos 289
FIGURE 11.3
Illustration of two illumination artifacts using the frame extracted from the Entrance am test sequence: (left)
input frame with “1” indicating the dark shadow part between the legs (more object parts change the reflected
light) and “2” indicating penumbra artifact near the edge of the shadow, (middle) segmented image using the
constant ratio model which causes errors, (right) segmented image using the proposed model which is more
robust.
presented in Section 11.3.3 will be used to describe the deviation of the scene from the
ideal surface assumptions in order to obtain more robust shadow detection (Figure 11.3).
where e(λ , s) is the illumination function at a given wavelength λ , the term ρ (s) depends
on the surface albedo and geometry, and ν (λ ) denotes the sensor sensitivity. Accordingly,
the difference between the shadowed and illuminated background values of a given surface
point is caused only by the different local value of e(λ , s). In outdoor scenes, the illu-
mination function observed in sunlight is the composition of the direct component (sun),
the Rayleigh scattering (sky), resulting in a blue tinge to ambient light [42], and residual
light components reflected from other objects. On the other hand, the effect of the direct
component is missing in the shadow.
Although the validity of Equation 11.2 is already limited by several scene assump-
tions [41], it is in general still too difficult to exploit appropriate information about the
corresponding background-shadow values since the components of the illumination func-
tion are unknown. Therefore, further strong simplifications are used in the applications.
According to Reference [6], the camera sensors must be exact Dirac delta functions
ν (λ ) = q0 · δ (λ − λ0 ) and the illumination must be Planckian [43]. In this case, Equa-
tion 11.2 implies the well-known constant ratio rule. Namely, the ratio of the shadowed
gsh (s) and illuminated value gbg (s) of a given surface point is considered to be constant
over the image, that is, gsh (s)/gbg (s) = A.
The constant ratio rule has been used in several applications [35], [37], [39]. Here the
shadow and background Gaussian terms corresponding to the same pixel are related via
a globally constant linear density transform. In this way, the results may be reasonable
when all the direct, diffused and reflected light can be considered constant over the scene.
290 Computational Photography: Methods and Applications
0.5 0.8 1.1 1.4 -1.8 -0.6 0.6 1.8 -1.8 -0.6 0.6 1.8
0.5 0.8 1.1 1.4 -1.8 -0.6 0.6 1.8 -1.8 -0.6 0.6 1.8
(a) (b) (c)
FIGURE 11.4
Histograms of (a) ψL , (b) ψu , and (c) ψv values for (top) shadowed and (bottom) foreground points collected
over a 100-frame period of the video sequence Entrance pm (frame rate 1 fps).
However, the reflected light may vary over the image in case of several static or moving
objects, and the reflecting properties of the surfaces may differ significantly from the Lam-
bertian model (see Figure 11.3). The efficiency of the constant ratio model is also restricted
by several practical reasons, like quantization errors of the sensor values, saturation of the
sensors, imprecise estimation of gbg (s) and A, or video compression artifacts. Based on the
experiments presented in Section 11.3.6, these inaccuracies cause poor detection rates in
some outdoor scenes.
Using Equations 11.3 and 11.4, the color values in the shadow at a given pixel position are
also generated by a Gaussian distribution, that is,
scene; therefore, adaptive algorithms should be used in this case. Note that as discussed
in Section 11.3.1, only the one dimensional marginal distribution parameters should be es-
timated for the background and shadow processes. The background parameter estimation
and update procedure are automated, based on the work in Reference [7] which presents
reasonable results and it is computationally more effective than the standard expectation-
maximization algorithm.
As shown in Figure 11.5, changes in global illumination significantly alter the shadow
properties. Moreover, changes can occur rapidly; in indoor scenes due to switching on/off
different light sources and in outdoor scenes due to the appearance of clouds. Regarding
the shadow parameter settings, parameter initialization and re-estimation are discriminated.
From a practical point of view, initialization may be supervised by marking shadowed re-
gions in a few video frames by hand, once after switching on the system. Maximum likeli-
hood estimates of the shadow parameters can be calculated based on the training data. On
the other hand, there is usually no opportunity for continuous user interaction in an auto-
mated surveillance environment; thus, the system must adapt to illumination changes by
initiating a claim to an automatic re-estimation procedure. Therefore, supervised initializa-
tion is used here. The parameter adaptation process will be described later.
According to Section 11.3.3, the shadow process has six parameters, stored in three-
component vectors µ ψ and σ ψ . Figure 11.6a shows the one-dimensional histograms for the
ψL , ψu and ψv values of shadowed points for each video shot. It can be observed that while
the variation of parameters σ ψ , µψ ,u and µψ ,v are low, µψ ,L varies in time significantly.
Therefore, the parameters should be updated in two different ways.
0.5 0.7 0.9 1.1 -0.9 -0.3 0.3 0.9 -1.8 -0.9 0 0.9
0.5 0.7 0.9 1.1 -0.9 -0.3 0.3 0.9 -1.8 -0.9 0 0.9
0.5 0.7 0.9 1.1 -0.9 -0.3 0.3 0.9 -1.8 -0.9 0 0.9
0.5 0.7 0.9 1.1 -0.9 -0.3 0.3 0.9 -1.8 -0.9 0 0.9
(a)
0.5 0.7 0.9 1.1 -0.9 -0.3 0.3 0.9 -1.8 -0.9 0 0.9
0.5 0.7 0.9 1.1 -0.9 -0.3 0.3 0.9 -1.8 -0.9 0 0.9
0.5 0.7 0.9 1.1 -0.9 -0.3 0.3 0.9 -1.8 -0.9 0 0.9
0.5 0.7 0.9 1.1 -0.9 -0.3 0.3 0.9 -1.8 -0.9 0 0.9
(b)
FIGURE 11.6
Extracted ψ statistics from four sequences recorded by the entrance camera of the university campus: (a)
shadow statistics, (b) nonbackground statistics, (left) ψL , (middle) ψu , and (right) ψv . Rows correspond to
video shots from different parts of the day.
condition dependent. In outdoor scenes, it can vary between 0.6 in direct sunlight and 0.95
in overcast weather. The simple re-estimation from the previous section does not work in
this case, since the illumination properties between time t and t + T may change a lot,
which would result in absolutely false detected shadow values in set Wt presenting false Mt
and Dt parameters for the re-estimation procedure.
294 Computational Photography: Methods and Applications
For this reason, the actual µψ ,L value is obtained using the statistics of all nonbackground
ψL values (with background filtering performed using the Stauffer-Grimson algorithm).
Figure 11.6b shows that the peaks of the nonbackground ψL -histograms are approximately
in the same location as they were in Figure 11.6a. The videos of the first and second rows
were recorded around noon, where the shadows were relatively small, but the peak is still
in the right location.
The previous experiments encourage identifying µψ ,L with the location of the peak on
the nonbackground ψL -histograms for the scene. The µψ ,L value is updated as depicted
in Algorithm 11.1. Namely, using a data structure [ψL ,t] which contains a ψL value with
its timestamp, the latest occurring [ψL ,t] pairs of the nonbackground points in a set Q are
stored, and the histogram hL of the ψL values in Q are continuously updated. The key
point is the management of set Q; therefore, two parameters, MAX and MIN, are defined
to control the size of Q.
[t]
1. For each frame t determine Ψt = { [ψL (s),t] | s ∈ S, ω [t] (s) =
6 bg}.
2. Append Ψt to Q.
3. Remove elements from Q as follows:
• if |Q| < MIN, keep all the elements,
• if |Q| ≥ MIN, find the oldest timestamp te in Q and remove all the ele-
ments from Q with time stamp te .
4. If |Q| > MAX after step 3: in order of their timestamp remove further “old”
elements from |Q| till |Q| ≤ MAX is reached.
[t+1]
5. Update the histogram hL regarding Q and apply µψ ,L = argmax{hL }
Consequently, Q always contains the latest available ψL values. The algorithm keeps
the size of Q between prescribed bounds MAX and MIN ensuring the topicality and rele-
vancy of the data contained. The actual size of Q is around MAX in the case of cluttered
scenarios. In the case of few or no motions in the scene, the size of Q decreases until it
reaches MIN. This increases the influence of the forthcoming elements, and causes quicker
adaptation, since it is faster to modify the shape of a smaller histogram.
The parameter σψ ,L is updated similarly to σψ ,u but only in the time periods when µψ ,L
does not change significantly. Note that the above update process may fail in shadow-free
scenarios. However, that case occurs mostly under artificial illumination conditions, where
the shadow detector can be switched off.
Method / Reference
Classes 3 2 3 3
MRF Opt — graph cut ICM ICM
Frame-rate 10 fps 11 fps 1-2 fps 3 fps
b , defined as follows:
the global labeling, ω
¡ ¢
b = arg min ∑ − log P o(s) | ω (s) + ∑ Θ (ω (r), ω (s)) ,
ω (11.13)
ω ∈Ω
| {z } r,s∈S
s∈S ε (s)
ω (s)
where the minimum is searched over all the possible segmentations (Ω) of a given input
frame. The first part of Equation 11.13 contains the sum of the local class-energy terms
regarding the pixels of the image (see Equation 11.1). The second part is responsible for
the smooth segmentation; with Θ (ω (r), ω (s)) = 0 if s and r are not neighboring pixels,
and ½
−δ if ω (r) = ω (s),
Θ (ω (r), ω (s)) = (11.14)
+δ if ω (r) =
6 ω (s).
As for optimization, the deterministic modified Metropolis (MMD) [46] relaxation
method was found similarly efficient but significantly faster for this task than the original
stochastic algorithm [47]. Namely, it runs about 1 fps when processing 320 × 240 images
whereas the running speed of the ICM method [48] with the proposed model is 3 fps in
exchange for some degradation in the segmentation results. For comparison, frame-rates of
three latest reference methods are shown in Table 11.2. It can be observed that the proposed
model has approximately the same complexity as [37]. Although the processing speed of
methods in References [31] and [36] is notably higher, one should consider that the method
in Reference [31] does not use any spatial smoothing (like MRF), thus requiring a separate
noise filter in the postprocessing phase. On the other hand, the method in Reference [36]
performs only a two-class segmentation (background and foreground). That simplification
enables using the quick graph-cut based MRF optimization techniques, which is not the
case for three classes [49].
FIGURE 11.7
Shadow model validation: (a) video image, (b) C1C2C3 space-based illumination invariants [29], (c) constant
ratio model of Reference [35] without object-based postprocessing, and (d) proposed statistical shadow model.
Test image sequences: (top) Laboratory, (middle) Highway, and (bottom) Entrance am.
• Highway video (ATON benchmark set). This sequence contains dark shadows
but homogenous background without illumination artifacts. In contrast to Refer-
ence [35], the proposed method reaches the appropriate results without postprocess-
ing, which is strongly environment-dependent.
• Four surveillance video sequences captured by the entrance (outdoor) camera of the
university campus in different lighting conditions (Figure 11.5: Entrance am, En-
trance noon, Entrance pm, and Entrance overcast). These sequences contain difficult
illumination and reflection effects and suffer from sensor saturation (dark objects and
shadows). Here, the presented model improves the segmentation results significantly
versus previous methods.
Figure 11.7 shows the results of different shadow detectors. For the sake of comparison,
an illumination invariant method based on Reference [29] and a constant ratio model [35]
were implemented in the same framework. It was observed that the performance differ-
ences between the previous and the proposed methods increase with surveillance scene
complexity. In the Laboratory sequence, the constant ratio and the proposed method are
similarly accurate. For the Highway video, the illumination invariant and constant ratio
methods approximately find objects without shadows; these results are, however, much
Shadow Detection in Digital Images and Videos 297
TABLE 11.3
Overview of the evaluation parameters.
noisier compared to that of the proposed model. The illumination invariant method fails
completely on the Entrance am surveillance video, as shadows are not removed while the
foreground component is noisy due to the lack of using luminance features in the model.
The constant ratio model also produces poor results; due to the long shadows and various
field objects the constant ratio model becomes inaccurate. The proposed model handles
these artifacts in a robust way.
The quantitative evaluations are done through manually generated ground-truth se-
quences. Since the application’s goal is foreground detection, the crossover between
shadow and background does not account for errors. Denoting the number of correctly
identified foreground pixels by TP, misclassified background pixels by FP, and misclassi-
fied foreground pixels of evaluation images by FN, the evaluation metrics consisting of the
Recall rate, Rc, and the Precision of the detection, Pr, can be defined as follows:
TP TP
Recall = , Precision = . (11.15)
TP + FN TP + FP
In addition to these measures, the so-called F-measure (FM) [50]
2 · Rc · Pr
FM = . (11.16)
Rc + Pr
which combines Rc and Pr in a single efficiency measure (it is the harmonic mean of Rc
and Pr) will be also used. It should be noted that while Rc and Pr have to be used jointly
to characterize a given algorithm, FM constitutes a stand-alone evaluation metrics.
TABLE 11.4
Quantitative evaluation results.
Dataset CR SS CR SS CR SS
(a) (b)
0.5 0.8 1.1 1.4 -1.8 -0.6 0.6 1.8 -1.8 -0.6 0.6 1.8
(c) (d) (e)
0.5 0.8 1.1 1.4 -1.8 -0.6 0.6 1.8 -1.8 -0.6 0.6 1.8
(f) (g) (h)
FIGURE 11.8
Distribution of the shadowed ψ values in simultaneous sequences from a street scenario recorded by differ-
ent CCD cameras: (a,c-e) three-sensor camera recorder, (b,f-h) digital camera with the Bayer CFA; (c,f) L
component, (d,g) u component, and (e,h) u component.
For numerical validation, 861 frames were used from the Laboratory, Highway, Entrance
am, Entrance noon, and Entrance pm sequences. Table 11.3 lists some details about these
test sets. Table 11.4 compares the detection results of the proposed method and the constant
ratio model, showing that the proposed shadow calculus improves the precision rate as
it significantly decreases the number of false negative shadow pixels and simultaneously
preserves the high foreground recall rate. Consequently, the proposed model outperforms
the constant ratio method on all test sequences in terms of the FM-measure.
TABLE 11.5
Color space selection in the state-of-the-art
methods.
[28] rg invariant
[29] C1C2C3 invariant
[27] rg invariant
[35] RGB 1
[39] grayscale 2
[37] grayscale 2
[51] HSV 1.33
[31] CIE L*u*v* 2
[52] CIE L*a*b*/HSV —
[53] RGB —‡
[2] all from above 2
PPCC – the average number of shadow parameters for one color channel in parametric methods; ‡ proportional
to the number of support vectors after training.
300 Computational Photography: Methods and Applications
tion of color spaces for shadow edge classification can be found in Reference [25] and that
this chapter addresses detection of the shadowed and foreground regions, which is a fairly
different problem.
For the above reasons, this section aims to experimentally compare different color mod-
els for the purpose of cast shadow detection in digital video. Since the validity of such
experiments is limited to the examined model structures, it is important to make the com-
parison in a relevant framework. Taking a general approach, the task is considered as a
classification problem in the space of the extracted features, describing the different cluster
domains with relatively few free parameters. Note that most models in Table 11.5 use two
parameters for each color channel; drawbacks of methods which use fewer parameters were
discussed in Section 11.3.
Popular models can be categorized as deterministic (per pixel) [51] or statistical (prob-
abilistic) [35]. Up to now, this chapter has only dealt with statistical models, since these
models have proved to be advantageous considering the whole segmentation process. A de-
terministic method will be introduced here; the pixels are classified independently before
the rate of the correct pixel-classification is investigated. In this way, a relevant quanti-
tative comparison of the different color spaces can be achieved, because the decision for
each pixel depends only on the corresponding local color-feature value. Note that post-
processing and prior effects whose efficiency may be environment-dependent will not be
considered. A probabilistic interpretation of this model will be given and used in the MRF
framework which was introduced in Section 11.3. The results after MRF optimization will
be compared both qualitatively and quantitatively.
C2 S G u
C3 V B v
C1 H R L
C2 S G u
C3 V B v
FIGURE 11.9
One dimensional projection of histograms of ψ values in the Entrance pm test sequence: (top) shadow, (bot-
tom) foreground; (a) C1C2C3 , (b) HSV, (c) RGB, and (d) L*u*v*.
oi (s)
ψi (s) = , for i = 0, 1, 2, (11.17)
µbg,i (s)
if i is the index of a chrominance component. The descriptor in grayscale and in the rg space
are defined similarly to Equations 11.17 and 11.18 considering that ψ will be a scalar and
a two-dimensional vector, respectively.
The efficiency of the proposed feature selection regarding three color spaces is demon-
strated in Figure 11.9 on the plots of one-dimensional marginal histograms of the ψ0 ,
ψ1 , and ψ2 values for manually marked shadowed and foreground points of a 75-frame
long outdoor surveillance video sequence, Entrance pm. Apart from some outliers, the
shadowed ψi values lie for each color space and each color component in a short inter-
val, while the difference between the upper and lower bounds of the foreground values is
usually greater.
TABLE 11.6
Luminance and chrominance channels in different color spaces.
Color space
Luminance g — — H R,G,B L* L*
Chrominance — r,g C1 ,C2 ,C3 S,V — a*,b* u*,v*
302 Computational Photography: Methods and Applications
where [a0 , a1 , a2 ] is the coordinate of the ellipsoid center and (b0 , b1 , b2 ) are the semi-axis
lengths. In other words, [a0 , a1 , a2 ] is equivalent to the mean ψ (s) value of shadowed pixels
in a given scene, while b0 , b1 , and b2 depend on the spatiotemporal variance of the ψ (s)
measurements under shadows. It will be shown later that the similarity to the µ ψ and σ ψ
parameters from Section 11.3 is not by chance; thus, parameter adaptation can also be done
in a similar manner.
Note that with the SVM method [53], the number of free parameters is related to the
number of support vectors, which can be much greater than the six scalars of the proposed
model. Moreover, for each situation, a novel SVM should be trained. Also note that one
Shadow Detection in Digital Images and Videos 303
could use an arbitrarily oriented ellipsoid, but compared to Equation 11.19, it is more diffi-
cult to define since it needs the accurate estimation of nine parameters. The domain defined
by Equation 11.19 becomes an interval for grayscale images and a two dimensional ellipse
for the rg space.
Figure 11.10 shows the two dimensional scatter plots about the foreground and shadow
ψ values. It can be observed that the components of vector ψ are strongly correlated in
RGB space (and also in C1C2C3 ) and that the previously defined ellipse cannot present a
narrow boundary. In HSV space, the shadowed values are not within a convex hull, even
if the hue component is considered periodic (hue = 2kπ means the same color for each
k = 0, 1, . . .). Based on the above facts, the CIE L*u*v* space seems to be a good choice.
In the following, this statement will be supported by numerical results.
precision
0.8 0.8
grayscale
rg
C 1C 2C 3
0.7 H SV 0.7
RGB
Lab
Lu v
0.6 0.6
0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 0.78 0.80 0.82 0.84 0.86 0.88 0.90 0.92
recall recall
(a) (b)
FIGURE 11.11
Evaluation of the deterministic model using recall-precision curves corresponding to different parameter-
settings for (a) Laboratory and (b) Entrance pm sequences.
Pr/Rc curves) in both cases. However, the relative performance of the other color systems
strongly varies for these two test videos. In the indoor scene, the grayscale and RGB
segmentation procedures are less efficient than the other ones, whereas for the outdoor
scenes the performance of the chrominance spaces is prominently poor.
Table 11.7 shows the achieved FM rates. Also here, it can be seen that the CIE L*a*b*
and L*u*v* spaces are the most efficient. As for the other color systems, in sequences
containing dark shadows (Entrance pm, Highway), the chrominance spaces produce poor
results, while the luminance and mixed spaces are similarly effective. Performance of
the chrominance spaces is reasonable for brighter shadows (Entrance am, Laboratory),
whereas the luminance spaces are relatively poor. In the latter case, the color constancy
of the chrominance channels seems to be more relevant than the luminance-darkening do-
main. It was also observed that the hue coordinate in HSV is very sensitive to illumination
artifacts (see also Figure 11.9), thus the HSV space is more efficient in the case of light
shadows. Table 11.8 summarizes the relationship between the darkness of shadows and the
performance of various color spaces. In this table, darkness is characterized by the mean
of the grayscale ψ0 values of shadowed points.
TABLE 11.7
Evaluation of the deterministic model (Equation 11.16) using the FM measure.
Color space
FIGURE 11.12
MRF segmentation results with different color models: (a) input video frame, (b) grayscale, (c) C1C2C3 , (d) HSV, (e) RGB, and (f) CIE L*u*v*. Test sequences from top
Computational Photography: Methods and Applications
to bottom: Laboratory, Highway, Entrance am, Entrance pm, and Entrance noon.
Shadow Detection in Digital Images and Videos 307
where
3 1
µ ψ = [a0 , a1 , a2 ]T , Σψ = diag{b20 , b21 , b22 }, t = (2π )− 2 (b0 b1 b2 )−1 e− 2 . (11.23)
In the following, the previously defined probability density functions will be used in the
MRF model in a straightforward way, as psh (s) = f (ψ (s)). The flexibility of this MRF
model comes from the fact that ψ (s) shadow descriptors are defined for different color
spaces differently (see Section 11.4.2).
TABLE 11.9
Evaluation of the MRF model using F ∗ coefficients.
Color space
described in this section confirm that appropriate color space selection is also crucial in the
applications, and the CIE L*u*v* space is preferred for this task.
11.5 Conclusion
This chapter examined the color modeling problem of cast shadows, focusing on video
surveillance applications. A novel adaptive model for shadow segmentation without strong
restrictions on a priori probabilities, image quality, objects’ shapes and processing speed
was introduced. The proposed modeling framework was generalized for different color
spaces and used to compare these color spaces in detail. It was observed that the appro-
priate color space selection is an important issue in classification, and the CIE L*u*v*
space is the most efficient both color-based clustering of the individual pixels and in the
case of Bayesian foreground-background-shadow segmentation. The proposed method
was validated on various video shots, including well-known benchmark videos and real-
life surveillance sequences, indoor and outdoor shots, which contain both dark and light
shadows. Experimental results showed the advantages of the proposed statistical approach
over earlier methods.
References
[1] C. Benedek and T. Szirányi, “Bayesian foreground and shadow detection in uncertain frame
rate surveillance videos,” IEEE Transactions on Image Processing, vol. 17, no. 4, pp. 608–621,
April 2008.
[2] C. Benedek and T. Szirányi, “Study on color space selection for detecting cast shadows in
video surveillance,” International Journal of Imaging Systems and Technology, vol. 17, no. 3,
pp. 190–201, June 2007.
[3] C. Benedek, T. Szirányi, Z. Kato, and J. Zerubia, “Detection of object motion regions in aerial
image pairs with a multi-layer Markovian model,” IEEE Transactions on Image Processing,
vol. 18, no. 10, pp. 2303–2315, October 2009.
[4] J. Kato, T. Watanabe, S. Joga, L. Ying, and H. Hase, “An HMM/MRF-based stochastic frame-
work for robust vehicle tracking,” IEEE Transactions on Intelligent Transportation Systems,
vol. 5, no. 3, pp. 142–154, September 2004.
[5] C. Benedek and T. Szirányi, “Change detection in optical aerial images by a multi-layer condi-
tional mixed Markov model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 47,
no. 10, pp. 3416–3430, October 2009.
[6] D. Finlayson, S.D. Hordley, C. Lu, and M.S. Drew, “On the removal of shadows from images,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 1, pp. 59–68,
January 2006.
[7] C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real-time tracking,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 747–757, Au-
gust 2000.
Shadow Detection in Digital Images and Videos 309
[8] R.Z. Ping-Sing, R. Zhang, P. Sing Tsai, J.E. Cryer, and M. Shah, “Shape from shading: A
survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 8,
pp. 690–706, August 1999.
[9] Z. Szlávik, L. Kovács, L. Havasi, C. Benedek, I. Petrás, A. Utasi, A. Licsár, L. Czúni, and
T. Szirányi, “Behavior and event detection for annotation and surveillance,” in Proceedings of
the International Workshop on Content-Based Multimedia Indexing, London, UK, June 2008,
pp. 117–124.
[10] A. Katartzis and H. Sahli, “A stochastic framework for the identification of building rooftops
using a single remote sensing image,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 46, no. 1, pp. 259–271, January 2008.
[11] B. Sirmacek and C. Unsalan, “Building detection from aerial imagery using invariant color
features and shadow information,” in Proceedings of the International Symposium on Com-
puter and Information Sciences, Istanbul, Turkey, October 2008, pp. 1–5.
[12] G. Perrin, X. Descombes, and J. Zerubia, “2D and 3D vegetation resource parameters as-
sessment using marked point processes,” in Proceedings of the International Conference on
Pattern Recognition, Hong-Kong, August 2006, pp. 1–4.
[13] R. Cutler and L.S. Davis, “Robust real-time periodic motion detection, analysis, and appli-
cations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,
pp. 781–796, August 2000.
[14] L. Havasi, Z. Szlávik, and T. Szirányi, “Higher order symmetry for non-linear classification of
human walk detection,” Pattern Recognition Letters, vol. 27, no. 7, pp. 822–829, May 2006.
[15] L. Havasi, Z. Szlávik, and T. Szirányi, “Detection of gait characteristics for scene registra-
tion in video surveillance system,” IEEE Transactions on Image Processing, vol. 16, no. 2,
pp. 503–510, February 2007.
[16] L. Czúni and T. Szirányi, “Motion segmentation and tracking with edge relaxation and op-
timization using fully parallel methods in the cellular nonlinear network architecture,” Real-
Time Imaging, vol. 7, no. 1, pp. 77–95, February 2001.
[17] Z. Zivkovic, Motion Detection and Object Tracking in Image Sequences, PhD thesis, PhD
thesis, University of Twente, 2003.
[18] J.B. Hayfron-Acquah, M.S. Nixon, and J.N. Carter, “Human identification by spatio-temporal
symmetry,” in Proceedings of the International Conference on Pattern Recognition, Washing-
ton, DC, USA, August 2002, pp. 632–635.
[19] L. Wang, T. Tan, H. Ning, and W. Hu, “Silhouette analysis-based gait recognition for hu-
man identification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25,
no. 12, pp. 1505–1518, December 2003.
[20] S.C. Zhu and A.L. Yuille, “A flexible object recognition and modeling system,” International
Journal of Computer Vision, vol. 20, no. 3, pp. 187–212, October 1996.
[21] L. Havasi and T. Szirányi, “Estimation of vanishing point in camera-mirror scenes using
video,” Optics Letters, vol. 31, no. 10, pp. 1411–1413, May 2006.
[22] A. Yoneyama, C.H. Yeh, and C.C.J. Kuo, “Moving cast shadow elimination for robust vehicle
extraction based on 2D joint vehicle/shadow models,” in Proceedings of the IEEE Conference
on Advanced Video and Signal Based Surveillance, Miami, FL, USA, July 2003, p. 229.
[23] C. Fredembach and G.D. Finlayson, “Hamiltonian path based shadow removal,” in Proceed-
ings of the British Machine Vision Conference, Oxford, UK, September 2005, pp. 970–980.
[24] T. Gevers and H. Stokman, “Classifying color edges in video into shadow-geometry, highlight,
or material transitions,” IEEE Transactions on Multimedia, vol. 5, no. 2, pp. 237–243, June
2003.
310 Computational Photography: Methods and Applications
[25] E.A. Khan and E. Reinhard, “Evaluation of color spaces for edge classification in outdoor
scenes,” in Proceedings of the International Conference on Image Processing, Genoa, Italy,
September 2005, pp. 952–955.
[26] A. Prati, I. Mikic, M.M. Trivedi, and R. Cucchiara, “Detecting moving shadows: Algorithms
and evaluation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25,
no. 7, pp. 918–923, July 2003.
[27] N. Paragios and V. Ramesh, “A MRF-based real-time approach for subway monitoring,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii,
USA, December 2001, pp. 1034–1040.
[28] A. Cavallaro, E. Salvador, and T. Ebrahimi, “Detecting shadows in image sequences,” in Pro-
ceedings of the of European Conference on Visual Media Production, London, UK, March
2004, pp. 167–174.
[29] E. Salvador, A. Cavallaro, and T. Ebrahimi, “Cast shadow segmentation using invariant color
features,” Computer Vision and Image Understanding, vol. 95, no. 2, pp. 238–259, August
2004.
[30] F. Porikli and J. Thornton, “Shadow flow: A recursive method to learn moving cast shadows,”
in Proceedings of the IEEE International Conference on Computer Vision, Beijing, China,
October 2005, pp. 891–898
[31] N. Martel-Brisson and A. Zaccarin, “Moving cast shadow detection from a Gaussian mixture
shadow model,” in Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, San Diego, CA, USA, June 2005, pp. 643–648.
[32] Y. Haeghen, J. Naeyaert, I. Lemahieu, and W. Philips, “An imaging system with calibrated
color image acquisition for use in dermatology,” IEEE Transactions on Medical Imaging,
vol. 19, no. 7, pp. 722–730, July 2000.
[33] M.G.A. Thomson, R.J. Paltridge, T. Yates, and S. Westland, “Color spaces for discrimination
and categorization in natural scenes,” in Proceedings of Congress of the International Colour
Association, Rochester, NY, USA, June 2002, pp. 877–880.
[34] T. Gevers and A.W. Smeulders, “Color based object recognition,” Pattern Recognition, vol. 32,
no. 3, pp. 453–464, March 1999.
[35] I. Mikic, P. Cosman, G. Kogut, and M.M. Trivedi, “Moving shadow and object detection
in traffic scenes,” in Proceedings of the International Conference on Pattern Recognition,
Barcelona, Spain, September 2000, pp. 321–324.
[36] Y. Sheikh and M. Shah, “Bayesian modeling of dynamic scenes for object detection,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1778–1792,
November 2005.
[37] Y. Wang, K.F. Loe, and J.K. Wu, “A dynamic conditional random field model for foreground
and shadow segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 28, no. 2, pp. 279–289, February 2006.
[38] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions and the Bayesian restora-
tion of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6,
no. 6, pp. 721–741, June 1984.
[39] J. Rittscher, J. Kato, S. Joga, and A. Blake, “An HMM-based segmentation method for traffic
monitoring,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 9,
pp. 1291–1296, September 2002.
[40] D.S. Lee, “Effective Gaussian mixture learning for video background subtraction,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 827–832, May
2005.
Shadow Detection in Digital Images and Videos 311
[41] D.A. Forsyth, “A novel algorithm for color constancy,” International Journal of Computer
Vision, vol. 5, no. 1, pp. 5–36, January 1990.
[42] D.K. Lynch and W. Livingstone, Color and Light in Nature, UK: Cambridge University Press,
1955.
[43] G. Wyszecki and W. Stiles, Color Science: Concepts and Methods, Quantitative Data and
Formulas, 2nd Edition, USA: John Wiley & Sons, 1982.
[44] Y. Wang and T. Tan, “Adaptive foreground and shadow detection in image sequences,” in
Proceedings of the International Conference on Pattern Recognition, Quebec, Canada, August
2002, pp. 983–986.
[45] R. Potts, “Some generalized order-disorder transformation,” Proceedings of the Cambridge
Philosophical Society, vol. 24, no. 1, p. 106, January 1952.
[46] Z. Kato, J. Zerubia, and M. Berthod, “Satellite image classification using a modified Metropo-
lis dynamics,” in Proceedings of the International Conference on Acoustics, Speech and Signal
Processing, San Francisco, CA, USA, March 1992, pp. 573–576.
[47] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, “Equation of state cal-
culations by fast computing machines,” Journal of Chemical Physics, vol. 21, no. 6, pp. 1087–
1092, June 1953.
[48] J. Besag, “On the statistical analysis of dirty images,” Journal of Royal Statistics Society,
vol. 48, no. 3 pp. 259–302, March 1986.
[49] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222–
1239, November 2001.
[50] C.J.V. Rijsbergen, Information Retrieval, 2nd Edition, London, UK: Butterworths, 1979.
[51] R. Cucchiara, C. Grana, G. Neri, M. Piccardi, and A. Prati, “The Sakbot system for moving
object detection and tracking,” in Video-Based Surveillance Systems-Computer Vision and
Distributed Processing, Boston, MA, USA, November 2001, pp. 145–157.
[52] M. Rautiainen, T. Ojala, and H. Kauniskangas, “Detecting perceptual color changes from
sequential images for scene surveillance,” IEICE Transactions on Information and Systems,
vol. 84, no. 12, pp. 1676–1683, December 2001.
[53] K. Siala, M. Chakchouk, F. Chaieb, and O. Besbes, “Moving shadow detection with support
vector domain description in the color ratios space,” in Proceedings of the International Con-
ference on Pattern Recognition, Cambridge, UK, August 2004, pp. 384–387.
[54] V. Meas-Yedid, E. Glory, E. Morelon, C. Pinset, G. Stamon, and J.C. Olivo-Marin, “Automatic
color space selection for biological image segmentation,” in Proceedings of the International
Conference on Pattern Recognition, Washington, DC, USA, August 2004, pp. 514–517.
[55] P. Guo and M.R. Lyu, “A study on color space selection for determining image segmentation
region number,” in Proceedings of the International Conference on Artificial Intelligence, Las
Vegas, NV, USA, June 2000, pp. 1127–1132.
[56] A. Prati, I. Mikic, C. Grana, and M. Trivedi, “Shadow detection algorithms for traffic flow
analysis: A comparative study,” in Proceedings of the IEEE Intelligent Transportation Systems
Conference, Oakland, CA, USA, August 2001, pp. 340–345.
[57] M. Tkalcic and J. Tasic, “Colour spaces - perceptual, historical and applicational background,”
in Proceedings of Eurocon, Ljubljana, Slovenia, September 2003, pp. 304–308.
12
Document Image Rectification Using Single-View or
Two-View Camera Input
313
314 Computational Photography: Methods and Applications
12.1 Introduction
Digital cameras have several advantages, for instance, portability and fast response, over
flatbed scanners. Therefore, there have been a number of attempts to replace flatbed scan-
ners with digital cameras. Unfortunately, camera captured images often suffer from per-
spective distortions due to oblique shot angle, geometric distortions caused by curved book
surfaces, specular reflections, and unevenness of brightness due to uncontrolled illumina-
tion and vignetting. Hence their visual quality is usually inferior to flatbed scanned images,
and the optical character recognition (OCR) rate is also low. Camera captured document
images thus need to be enhanced to alleviate these problems and to widen the area of
valuable text processing tools (for example, OCR and text-to-speech (TTS) for the visu-
ally impaired, automatic translation of books, and easy digitization of printed material) for
the camera captured inputs. This chapter focuses on removing perspective and geomet-
ric distortions in captured document images, operations that are referred to as document
dewarping or document rectification.
(a) (b)
(c)
FIGURE 12.1
Document image rectification using a stereo-pair by an algorithm presented in Section 12.2: (a,b) input stereo
c 2009 IEEE
pair, and (c) rectified result with specular reflection removal. °
Document Image Rectification Using Single-View or Two-View Camera Input 315
12.1.1 Overview
To rectify document images, many methods directly estimate a three-dimensional (3D)
structure. A straightforward method is to use depth-measuring hardware, such as structured
light or a laser scanner. Since the inferred surface may not be isometric to the plane,
several methods for modifying the surface have been proposed [1], [2], [3]. Although these
approaches can be used in a wide range of paper material including old documents damaged
by aging or water, it is burdensome to use depth-measuring equipment.
To overcome this problem, 3D structure can be estimated from multiple images without
using depth measuring devices. For example, a specialized stereo vision system is proposed
in Reference [4]. However, this method still needs hardware, although it is much simpler
than the depth acquisition hardware. A stereo vision method presented in Reference [5]
has a disadvantage in that it requires reference points. A more recent document dewarp-
ing algorithm presented in Reference [6] alleviates the problems by using video sequences.
However, the algorithm requires hundreds of input images which are the results of scanning
the entire book surface carefully. Section 12.2 presents a method [7] which rectifies docu-
ments using a stereo pair. The method needs no special hardware; as shown in Figure 12.1,
it uses just two images captured from different viewpoints.
Although using multiple images has several desirable properties, such as content inde-
pendence and the ability to remove specular reflection, such methods suffer from com-
putational complexity in 3D reconstruction. Therefore, a number of single-view methods
that do not require 3D reconstruction have also been proposed [8], [9], [10], [11]. Most
of these methods avoid the 3D reconstruction problem by exploiting the clues from the
two-dimensional text line structure with some additional assumptions. These methods are
usually computationally efficient and easy-to-use; however, they are limited to the rectifi-
cation of text regions due to their dependencies on text lines.
Unlike the text region, the rectification of figures using a single image requires deter-
mining the boundaries of distorted figures. Given the distorted boundaries, the rectification
can be done by using applicable surface assumptions [12] or simple boundary interpola-
tion [13], [14], [15]. However, it is burdensome to extract the boundaries. Section 12.3
presents a method that segments figures from a single view using a bonding box inter-
face [16], which substantially facilitates the segmentation process. A new boundary inter-
polation method that can improve the visual quality of the output image (Figure 12.2) is
also presented. The overall process is very efficient, so that a rectified result is obtained
within a few seconds, whereas the stereo methods require almost a minute.
(a)
(b)
FIGURE 12.2
Document image rectification using a single image by an algorithm presented in Section 12.3: (a) input image
with a user-provided bounding box, (b) segmented and rectified result.
12.2.1 Assumptions
The framework operates under three assumptions. The first is that book surfaces satisfy
the cylindrical surface model (CSM) assumption, which is known to be sufficient for many
kinds of document surfaces including the unfolded books [17]. The second assumption is
that the intrinsic matrix of a camera is known. More precisely, a standard pin-hole camera
(image coordinates are Euclidean coordinates having equal scales in both directions and
the principal point is at the center of the image) is assumed, and it is also assumed that the
estimated focal length is available from Exchangeable Image File Format (EXIF) tags of
image files; most current digital cameras satisfy these assumptions. The third assumption
Document Image Rectification Using Single-View or Two-View Camera Input 317
Y
O X
u
v
[x g(x) v]T
FIGURE 12.3
c 2009 IEEE
A book coordinate system and a modeled book surface. °
is that the contents of a book are quite distinctive compared to the background (that is,
most of correspondences are found on the book surface), which can be easily achieved by
placing the document on the relatively homogeneous background or capturing a book as
large as possible.
using the CSM assumption. That is, a point (u, v) on a flat surface goes to [x g(x) v] in the
world coordinate space when the surface is warped, where g(x) is the height of the book
surface from the uv plane. The relation between u, x and g(x) is given by
s
Z x µ ¶2
dg
u= 1+ dt, (12.2)
0 dt
because u is the arc length of the curve g(x). Although the center O of the world coordinate
is located at the top of the book as shown in Figure 12.3, its position is not important as
long as it is located on the book’s binding.
corresponding points which satisfy the similarity relation with P1 , P2 , and {Xi } for some
similarity transformation H ∈ ℜ4×4 , as follows:
P01 = P1 H−1 , (12.3)
P02 −1
= P2 H , (12.4)
X̃i0 = HX̃i , (12.5)
where X̃i ∈ ℜ4 and X̃i0 ∈ ℜ4 are homogeneous representation of Xi and Xi0 , respectively. The
similarity transform H can be represented as
· ¸
sR t
H= , (12.6)
01×3 1
where s is a scale factor, R = [r1 r2 r3 ] is a 3 × 3 rotation matrix, and t is a 3 × 1 translation
vector. By setting s = 1, Equation 12.5 reduces to
Xi0 = RXi + t. (12.7)
Note that Xi represents a point on the book surface in the book (world) coordinate system
shown in Figure 12.3 and Xi0 means a point on the book surface in the camera coordinate
system. By applying proper rotation and translation, {Xi0 } are transformed to {Xi }, and a
surface function g(x) can be estimated from them. The cost function for finding R,t and g
can be expressed as follows:
N N
(R,t, g) = arg min ∑ d 2 (S(g), Xi ) = arg min ∑ d 2 (S(g), RT (Xi0 − t)), (12.8)
R,t,g i=1 R,t,g i=1
where S(g) is the surface induced from a function g, and d(·, ·) is the distance between a
surface and a point. Intuitively, this cost function can be seen as finding the best surface
that fits 3D points {Xi }.
Although a surface induced from any g(x) is isometric to the plane (flat surface), the class
of g(x) can be restricted using a priori knowledge that the book surface is smooth except
at the binding. Specifically, two polynomials are used for the modeling of g(x) as follows:
g+ (x) if x > 0,
g(x) = 0 if x = 0, (12.9)
g− (x) if x < 0,
where g+ (x) is for the right side and g− (x) is for the left side of book binding. Although
a small number of parameters for g(x) should be estimated, the direct minimization of
Equation 12.8 is intractable due to highly nonlinear nature of the function and the presence
of outliers.
FIGURE 12.4
c 2009 IEEE
Points on the book surface and their projection in the direction of r3 . °
where
v2 v3
r(θ ) = cos θ + sin θ . (12.12)
|v2 | |v3 |
The term Pr (Xi0 ) denotes the projected point of Xi0 on the plane whose normal vector is r,
and µ (·) is a measure of area that the distributed points occupy. Since a finite number of
noisy points are available, the area measure µ (·) is approximated to a discrete function.
By changing θ with predefined steps, θ̂ that minimizes µ ({Pr (Xi0 )}) is found and r3 is
computed from Equation 12.10. Finally, r1 can be obtained from r1 = r2 × r3 . In imple-
mentation, a coarse-to-fine approach is adopted for the efficiency and the estimation of R
is refined by a local two-dimensional search around the current estimate.
320 Computational Photography: Methods and Applications
Although the 3D projection analysis for r3 is robust to outliers, the estimate of r2 using
eigenvalue decomposition may fail in the presence of a single significant outlier. For the
rejection of such outliers, samples whose distances from the center of mass are less than 3σ
are only used, where σ is the standard deviation of distances from the center to the points.
In this outlier rejection step, only significant outliers are rejected, as a refined rejection
process is addressed in the following stage.
where (xi , yi ) = (r1T Xi0 , r2T Xi0 ) and (a, b) = (r1T t, r2T t). The term d(·, ·) in Equation 12.13 is
a distance function between a point and a surface, whereas d(·, ·) in Equation 12.14 is a
distance function between a point and a curve. Since (xi − a, yi − b) should be on the curve
g, the parameters of g, that is, coefficients of a polynomial, can be estimated using the
least squares method. In this case, the problem in Equation 12.8 reduces to the following
minimization problem:
M
∑ (g(xi − a) − (yi − b))2 (12.15)
i=1
where M is the number of points. The curve in the left side (x < a) and the right side (x > a)
are represented by different polynomials as g− (x − a) + b and g+ (x − a) + b, where g+ (x) =
∑Kk=1 pk xk , g− (x) = ∑Kk=1 qk xk , and K is the order of polynomials which is empirically set
to four [7].
In order to find pk , qk , and (a, b), a set of candidates for a is determined from the his-
togram of {xi }M i=1 . A point is chosen as a candidate if its corresponding bin is the local
minimum in the histogram, which is based on the fact that there is a relatively small number
of features around the book’s binding. Then for each a, a total of m samples are randomly
selected on both sides and an overdetermined system is solved to get 2K + 1 unknowns
(p1 , p2 , · · · , pk , q1 , q2 , · · · , qK , and b). Differing from the conventional random sample con-
sensus (RANSAC), the criterion used here is to find the minimum of
M
∑ φT (g(xi − a) − (yi − b)), (12.16)
i=1
where ½
x2 if x < T ,
φT (x) = (12.17)
T 2 otherwise,
is a truncated square function and T is set to ten percent of the range of {xi }, which is
equivalent to MSAC (M-estimator sample consensus) in Reference [20]. After iteration,
Document Image Rectification Using Single-View or Two-View Camera Input 321
the hypothesis that minimizes Equation 12.16 is selected, and a rough estimate of a and
inliers are obtained, where the inlier criterion is |g(xi − a) − (yi − b)| < T . Finally, two
polynomials g+ (x) and g− (x) on each side are estimated using a standard curve fitting
method and their intersection point is determined as (a, b).
where N p is the set of first order neighborhoods at the site p and the data penalty Cd repre-
sents the sharpness of the pixel p in JL(p) , which will be explained in the next subsection.
The interaction penalty term Ci is expressed as follows:
Ci (p, q, L(p), L(q)) = kJL(p) (p) − JL(q) (p)k + kJL(p) (q) − JL(q) (q)k, (12.19)
where J j (p) means the pixel value of J j at p. Note that this cost function is the same
as that of Reference [21]. The optimal labeling can be found using the graph-cut tech-
nique [22].
where θ (p, L(p)) is the angle between the line of sight and the surface normal at the point
p in JL(p) . Because a small value of θ (p, L(p)) means that the tangential surface at point
p is closer to the perpendicular surface of the line of sight, it is approximately proportional
to sharpness and naturally handles the problems caused by a little misalignment.
Since the surface and camera matrices were estimated during the 3D reconstruction pro-
cess, the algorithm can handle the specular reflection without any device, whereas the exist-
ing algorithm dealt with it using some hardware [23]. Assuming that the distance from the
flash to the camera center is small compared to the distance between the camera and a docu-
ment, the possible glare regions can be determined. It comes from the specular distribution
function model that this follows cosn φ , where φ is the angle between the line of sight and
principal reflected ray [23]. Hence, specular reflection can be removed by discarding the
region with small φ . This can be done by modifying Equation 12.20 as follows:
½
(2) − cos θ (p, L(p)), if θ (p, L(p)) ≥ θ0 ,
Cd (p, L(p)) = (12.21)
B otherwise,
where n(·) is the number of elements in a set. It is experimentally found for images with
dimensions 1600 × 1200 pixels that S > 1000 in the presence of noticeable specular reflec-
tion, otherwise S < 100.
(a) (b)
Figures 12.5 and 12.6 show experimental results for the glossy papers with flash illumi-
nation. The stitching boundaries resulting from two data penalty terms (Equations 12.20
and 12.21) are shown in Figures 12.5c and 12.5f. As expected, the boundaries in Fig-
ure 12.5c are more complex than those present in Figure 12.5f. Hence, the text misalign-
ment is sometimes observed, which seldom occurs in the case of using Equation 12.20 due
to its simple boundary. However, specular reflections and illumination inconsistency are
successfully removed as shown in Figure 12.6b.
The character recognition rate (CRR) is commonly measured by counting the number
of correctly recognized characters or words. However, it is not an accurate measure in the
324 Computational Photography: Methods and Applications
(a) (b)
(c) (d)
FIGURE 12.7
c 2009 IEEE
Images used in Table 12.1. °
sense that this number does not faithfully reflect the amount of hand-labor needed to fix
incorrect characters or words after OCR. To faithfully reflect the amount of hand-labor,
that is, deleting and/or inserting characters, a new CRR measure is defined here as follows:
n(match)
CRR = 100 × . (12.23)
n(match) + n(insertion) + n(deletion)
For example, if “Hello, World!” is recognized as “Helmo, Wold!!”, there are 9 matches.
Also note that two deletions (“m” and “!”) and two insertions (“l” and “r”) are re-
quired for the correction of recognition result. Hence, the CRR of that text is CRR =
100 × 9/(9 + 2 + 2) ' 69%. In the computation of CRR, dynamic programming is used to
find correspondences between recognized text and ground truth.
Because OCR performance depends on the line of sight and the types of content and
material, quantitative evaluation is actually not a simple task. Thus, experiments in four
different situations (Figure 12.7) are conducted, with results summarized in Table 12.1.
Namely, Figure 12.7a represents the situation when the images are taken rather perpen-
dicularly to the book surface; in this case the CRRs of I1 and I2 are relatively high
and the CRR of rectified images is close to that of an image from flatbed scanner. Fig-
ure 12.7b represents the situation when images are captured obliquely. Although the OCR
improvement is very high, the final recognition rate for Figure 12.7b is less than that of
Document Image Rectification Using Single-View or Two-View Camera Input 325
TABLE 12.1
CRR for several image pairs (I1 and I2 are image pairs before dewarp-
ing, J1 and J2 are image pairs after dewarping, “composite” denotes a
stitched image obtained from J1 and J2 , and “scan” denotes an image
from a flatbed scanner).
the image pair shown in Figure 12.7a, due to out-of-focus blur. Figures 12.7c and 12.7d
correspond to the cases where each image does not have complete information due to blur
and specular reflection. The problem is alleviated only after combining the information
of two images. Figure 12.8 shows more results, additional examples can be found at
http://ispl.snu.ac.kr/∼ hikoo/documents.html.
326 Computational Photography: Methods and Applications
R(G) gR
gL
gB
FIGURE 12.10
Bounding box definition: R denotes the bounding box given by the user, Γ represents four bounding curves
(γL , γT , γR and γB ), and R(Γ) indicates the region corresponding to the printed figure.
this set is searched by using an alternating optimization scheme. After boundary extraction,
the extracted figure is rectified. The rectification method is the combination of the metric
rectification method used for planar documents [25], [26] and the boundary interpolation
methods used for curved documents [13], [27].
where
Φ(Γ, Θ) = ΦL (Γ, Θ) + ΦS (Γ) (12.25)
is defined as the sum of two terms; the first one encodes data fidelity and the second one
encodes the energy of the boundaries. Finally, Θ represents the parameters of a probabilistic
model that is explained later in this chapter.
To design VF (x, y; Θ) and VB (x, y; Θ), a Gaussian mixture model (GMM) of color distribu-
tions [16] is used. Therefore, Θ represents the GMM parameters of two distributions and
the pixelwise energies are defined as the negative log of probabilities which are given by
GMM.
The estimate of Θ is obtained from user-provided seed pixels. Although the method
of user interaction is the same as that of Reference [16], there is a subtle difference in
the estimation process. In the conventional object segmentation problem, a given image
usually consists of an object and background, and it is reasonable to consider the pixels in
a box (R) as seeds for the object, and the pixels outside the box (Ω\R) as seeds for the
background. However, typical documents contain several figures, and it is very likely that
other figures (especially those having similar color distributions) exist outside the box as
can be seen in Figure 12.9a. Hence the initialization method that considers all pixels in
Ω\R as seeds for the background is often problematic. Rather, the pixels around the outer
box (∂ R) are used as seeds for the background, and the pixels inside the box (R\∂ R) are
used as seeds for the figure. The modified initialization method not only provides better
performance but also improves efficiency due to a small number of seeds. This method will
be used in the Grabcut algorithm [16] for fair comparison in the experimental section.
u pper su pport
su pport
low er su pport
FIGURE 12.11
Support areas used to obtain the statistics for low curvature edge extraction.
where
ψH (γ ) = ∑ φT (y − γ (x)) . (12.28)
EH (x,y)=1
Here φT (·) is a truncated square function defined in Equation 12.17. Note that ψV (·) is
similarly defined. Intuitively, both functions are minimized when the curve passes as many
as all the possible edge points. In all experiments, λ = 100 and T = 3.
Plausible curves in Equation 12.27 are the ones that satisfy several hard constraints. Hard
constraints are imposed on the slope of curves, the intersection point of γL and γR (that is,
the position of the vanishing point), and so on. However, these restrictions are not critical
because any Γ violating these conditions is likely to have high energy. These conditions
are instead used to reject unlikely candidates at the early stage.
12.3.2.5 Optimization
Since minimizing the term in Equation 12.25 is not a simple task, a new optimization
method is presented. This method consists of four steps: i) clustering edges (EH and EV )
into line segments A , ii) constructing a candidate boundary set C from A , iii) finding
a coarse solution in the candidate set C as Γ̂ = arg minΓ∈C Φ(Γ, Θ), and iv) refining the
coarse estimate Γ̂. In this section, the width and height of R are respectively denoted as W
and H.
(a)
(b)
(c)
FIGURE 12.12
Three choices of (u, v) ∈ AHT × AHT .
bottom sides of R), two side curves are estimated. Then, these estimated two side curves
are used to estimate top and bottom curves. Thus, the problem in Equation 12.32 reduces
to two subproblems, namely
fixing (γL , γR ). Since the minimization methods for Equations 12.33 and 12.34 are similar,
only the method for Equation 12.33 is presented. When γT and γB are fixed, ΦL (Γ, Θ) can
be represented as follows:
where à !
H−1 γL (y) W −1
ηL (γL ) = ∑ ∑ VB0 (x, y) + ∑ VF0 (x, y) (12.36)
y=0 x=0 x=γL (y)
and
H−1 W −1 ¡ 0 ¢
ηR (γR ) = ∑ ∑ VB (x, y) −VF0 (x, y) . (12.37)
y=0 x=γR (y)
Here, VF0 and VB0 are obtained from VF and VB by setting the outside of the top and bottom
curves as 0. By constructing the following tables:
z
T1 (y, z) = ∑ VB0 (x, y), (12.38)
x=0
W −1
T2 (y, z) = ∑ VF0 (x, y), (12.39)
x=z
W −1 ¡ ¢
T3 (y, z) = ∑ VB0 (x, y) −VF0 (x, y) , (12.40)
x=z
the term ΦL (Γ, Θ) can be evaluated in O(H) operations using the tables. Moreover, ΦS (Γ)
is also represented via ψV (γL ) + ψV (γR ) + constant. Putting it all together into Equa-
tion 12.33 provides
(γ̂L , γˆR ) = arg min (ηL (γL ) + ψV (γL ) + ηR (γR ) + ψV (γR )) . (12.41)
γL ∈AVL ,γR ∈AVR
The computational cost for this scheme can be summarized as i) W × H operations re-
quired for the construction of tables, ii) |A | × H operations required for the precompu-
tations of ηL (γL ) and ψV (γL ) for all γL ∈ AVL , iii) |A | × H operations required for the
precomputations of ηR (γR ) and ψV (γR ) for all γR ∈ AVR , and iv) |A |2 operations used for
the test of hard constraints and the minimization of Equation 12.41. This results in total
O(W × H + |A | × H + |A |2 ) computations. The computational cost of Equation 12.34
can be reduced in a similar manner. Experiments show that Equations 12.33 and 12.34
converge to their optimal solutions very quickly; they are repeated only twice.
12.3.3 Rectification
This section introduces a rectification algorithm which improves conventional boundary
interpolation methods. Using the boundaries in Figures 12.9c and 12.13a the conventional
methods yield distorted results, as shown in Figures 12.9d and 12.13c. In order to alleviate
this distortion, a new rectification method is presented. The method consists of two steps.
Document Image Rectification Using Single-View or Two-View Camera Input 333
(a)
FIGURE 12.13
Figure rectification: (a) segmentation result, (b) transformed result, note that the four corners of the figure
compose a rectangle, (c) rectification of the segmentation result using boundary interpolation, aspect ratio
1.02, (d) rectification of the the transformed result using boundary interpolation, aspect ratio 0.614, and (e)
ground truth, scanned image with aspect ratio 0.606.
The first is the rectification process for an imaginary rectangle consisting of four corners
of a figure, which is the same as metric rectification methods for planar documents used in
References [25] and [26] except that the rectangle is an imaginary one. Boundary interpo-
lation is then applied to the transformed image [13]. Figure 12.13 illustrates this process.
In the proposed method, boundary interpolation is applied to Figure 12.13b to produce the
result shown in Figure 12.13d. For completeness, conventional methods are applied to Fig-
ure 12.13a to produce the result shown in Figure 12.13c. As can be seen, the proposed
method can largely remove distortions. In metric rectification of an imaginary rectangle,
the focal length is obtained from EXIF (if available) [7].
(a) (b)
(c) (d)
FIGURE 12.14
Comparison of the Grabcut algorithm and the proposed segmentation method: (a) user interaction-based input,
(b) light pixels stand for figures and dark pixels stand for blank in feature space, (c) Grabcut output, and (d)
presented method output.
scene is obliquely captured as illustrated in the bottom row in Figure 12.16. In such cases,
although both camera-driven methods suffer from blur caused by perspective contraction
and shallow depths of field, the stereo method is less affected by geometric distortions than
the method presented in this section. On the other hand, the latter method has advantages
on the boundaries because the algorithm forces the boundaries of restored images to be
straight, while skews and boundary fluctuations are usually observed in the method pre-
sented in Section 12.2 (an image was manually deskewed and cropped in order to obtain
Figures 12.16a and 12.16d).
In terms of computational complexity, the stereo system usually requires about one
minute to produce a 1600 × 1200 output image. The method presented in this section takes
4.8 seconds in feature extraction, 1.3 seconds for segmentation, and 0.5 seconds in rectifi-
cation for handling a 3216 × 2136 image. By taking about five seconds in user interaction,
this method produces an output within two seconds even for seven Megapixel images.
Document Image Rectification Using Single-View or Two-View Camera Input 335
(a) (b)
(c) (d)
FIGURE 12.15
Evaluation of the method in Section 12.2: (a,b) stereo pair, and (c,d) magnified and cropped images with user
interactions.
12.4 Conclusion
This chapter presented two camera-driven methods for geometric rectification of docu-
ments. One is a stereo-based method using explicit 3D reconstruction. The method works
irrespective of contents on documents and provides several advantages, such as specular
reflection removal. Therefore, this method can be used for OCR and digitization of figures
and pictures in indoor environment. The other one is a single-view method which recti-
fies a figure from a user-provided bounding box. This method is shown to be efficient,
robust, and easy-to-use. It should be noted that the camera captured images often suffer
from photometric and geometric distortion. Therefore, removal of uneven illumination and
motion/out-of-focus blur are also essential in enhancing camera captured document images,
although these operations are not discussed in this chapter. Nevertheless, as demonstrated
in this chapter, digital camera-driven systems for document image acquisition, analysis,
and processing have the potential to replace flatbed scanners.
336 Computational Photography: Methods and Applications
Acknowledgment
Figures 12.1 and 12.3 to 12.8 are reprinted from Reference [7], with the permission of
IEEE.
References
[1] M.S. Brown and C.J. Pisula, “Conformal deskewing of non-planar documents,” in Proceedings
of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, June
2005, pp. 998–1004.
[2] M.S. Brown and W.B. Seales, “Document restortion using 3D shape: A general deskewing
algorithm for arbitrarily warped documents,” in Proceedings of International Conference on
Computer Vision, Vancouver, BC, Canada, July 2001, pp. 367–374.
[3] M. Pilu, “Undoing paper curl distortion using applicable surfaces,” in Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, December 2001,
pp. 67–72.
[4] A. Yamashita, A. Kawarago, T. Kaneko, and K.T. Miura, “Shape reconstruction and image
restoration for non-flat surfaces of documents with a stereo vision system,” in Proceedings
of International Conference on Pattern Recognition, Cambridge, UK, August 2004, pp. 482–
485.
[5] A. Ulges, C.H. Lampert, and T. Breuel, “Document capture using stereo vision,” in Proceed-
Document Image Rectification Using Single-View or Two-View Camera Input 337
ings of ACM symposium on Document Engineering, Milwaukee, Wi, USA, October 2004,
pp. 198–200.
[6] A. Iketani, T. Sato, S. Ikeda, M. Kanbara, N. Nakajima, and N. Yokoya, “Video mosaicing
based on structure from motion for distortion-free document digitization,” in Proceedings of
Asian Conference on Computer Vision, Tokyo, Japan, November 2007, pp. 73–84.
[7] H.I. Koo, J. Kim, and N.I. Cho, “Composition of a dewarped and enhanced document image
from two view images,” IEEE Transactions on Image Processing, vol. 18, no. 7, pp. 1551–
1562, July 2009.
[8] J. Liang, D. DeMenthon, and D. Doermann, “Flattening curved documents in images,” in
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego,
CA, USA, June 2005, pp. 338–345.
[9] F. Shafait and T.M. Breuel, “Document image dewarping contest,” in Proceedings of 2nd Inter-
national Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil,
September 2007, pp. 181–188.
[10] N. Stamatopoulos, B. Gatos, I. Pratikakis, and S. Perantonis, “A two-step dewarping of camera
document images,” in Proceedings of International Workshop on Document Analysis Systems,
Nara, Japan, September 2008, pp. 209–216.
[11] S.S. Bukhari, F. Shafait, and T.M. Breuel, “Coupled snakelet model for curled textline seg-
mentation of camera-captured document images,” in Proceedings of International Conference
on Document Analysis and Recognition, Barcelona, Spain, July 2009, pp. 61–65.
[12] N.A. Gumerov, A. Zandifar, R. Duraiswami, and L.S. Davis, “3D structure recovery and un-
warping of surfaces applicable to planes,” International Journal of Computer Vision, vol. 66,
no. 3, pp. 261–281, March 2006.
[13] Y.C. Tsoi and M.S. Brown, “Geometric and shading correction for images of printed materials:
A unified approach using boundary,” in Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition, Washington, DC, June 2004, pp. 240–246.
[14] Y.C. Tsoi and M.S. Brown, “Multi-view document rectification using boundary,” in Proceed-
ings of IEEE Conference on Computer Vision and Pattern Recognition, Chicago, IL, USA,
June 2007, pp. 1–8.
[15] M.S. Brown, M. Sun, R. Yang, L. Yung, and W.B. Seales, “Restoring 2D content from dis-
torted documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29,
no. 11, pp. 1904–1916, November 2007.
[16] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive foreground extraction using
iterated graph cuts,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 309–314, August
2004.
[17] H. Cao, X. Ding, and C. Liu, “A cylindrical surface model to rectify the bound document,”
in Proceedings of International Conference on Computer Vision, Nice, France, October 2003,
pp. 228–233.
[18] D.G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal
of Computer Vision, vol. 60, no. 2, pp. 91–110, November 2004.
[19] R.I. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge UK:
Cambridge University Press, 2nd edition, April 2004.
[20] P.H.S. Torr and A. Zisserman, “MLESAC: A new robust estimator with application to estimat-
ing image geometry,” Computer Vision and Image Understanding, vol. 78, no. 1, pp. 138–156,
April 2000.
[21] A. Agarwala, M. Dontcheva, S. Drucker, A. Colburn, B. Curless, D. Salesin, and M. Cohen,
“Interactive digital photomontage,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 294–
302, August 2004.
338 Computational Photography: Methods and Applications
Bahadir K. Gunturk
13.1 Introduction
The bilateral filter is a nonlinear weighted averaging filter, where the weights depend
on both the spatial distance and the intensity distance with respect to the center pixel.
The main feature of the bilateral filter is its ability to preserve edges while doing spatial
smoothing. The term bilateral filter was introduced in Reference [1]; the same filter was
earlier called the SUSAN (Smallest Univalue Segment Assimilating Nucleus) filter [2].
The variants of the bilateral filter have been published even earlier as the sigma filter [3]
and the neighborhood filter [4].
339
340 Computational Photography: Methods and Applications
18 18
16 16 inpu t signal
14 14
12 12
10 10
8 8
6 6
0 0.5 1.0 200 220 240 260 280 300
intensity
w eights 1.0
spatial w eights
0.5
0
200 220 240 260 280 300
0.06
overall w eights
0.04
0.02
0
200 220 240 260 280 300
FIGURE 13.1
Illustrative application of the bilateral filter. The range kernel and the spatial kernel are placed at I(x = 260).
The product of Kr (·) and Kd (·) determines the weights of the pixels in a local neighborhood. As seen in the
overall weights subplot, pixels on the left side of the edge have zero weights in getting the output at x = 260
even though they are spatially close.
At a pixel location x = (x1 , x2 ), the output of the bilateral filter is calculated as follows:
ˆ = 1
I(x) ∑ Kd (ky − xk) Kr (|I(y) − I(x)|) I(y),
C(x) y∈N
(13.1)
(x)
where Kd (·) is the spatial domain kernel, Kr (·) is the intensity range kernel, N (x) is a
spatial neighborhood of x, and C(x) is the normalization constant expressed as
The kernels Kd (·) and Kr (·) determine how the spatial and intensity differences are
treated. The most commonly used kernel is the Gaussian kernel1 defined as follows:
à !
− ky − xk2
Kd (ky − xk) = exp , (13.3)
2σd2
1 In the text, the Gaussian kernel is the default kernel unless otherwise stated.
Bilateral Filter: Theory and Applications 341
18
14
10
6
200 220 240 260 280 300
18
14
10
6
200 220 240 260 280 300
18
14
10
6
200 220 240 260 280 300
FIGURE 13.2
Top: Input signal. Middle: Output of the Gaussian filter with σd = 10. Bottom: Output of the bilateral filter
with σd = 10 and σr = 1.5.
à !
− |I(y) − I(x)|2
Kr (|I(y) − I(x)|) = exp . (13.4)
2σr2
The contribution (weight) of a pixel I(y) is determined by the product of Kd (·) and Kr (·).
As illustrated in Figure 13.1, the range kernel pulls down the weights of the pixels that are
not close in intensity to the center pixel even if they are in close spatial proximity. This
leads to the preservation of edges. Figure 13.2 demonstrates this property of the bilateral
filter and compares it with the Gaussian low-pass filter which blurs the edge.
While the Gaussian kernel is the choice for both Kd (·) and Kr (·) in References [1] and [2],
the sigma filter [3] and the neighborhood filter [4] use different kernels. The sigma filter [3]
calculates the local standard deviation σ around I(x) and uses a thresholded uniform kernel
½
1 if |I(y) − I(x)| ≤ 2σ
Kr (|I(y) − I(x)|) = (13.5)
0 otherwise
This range kernel essentially eliminates the use of outliers in calculating the spatial average.
The spatial kernel of the sigma filter, on the other hand, is a uniform box kernel with a
rectangular or a circular support. For a circular support with radius ρd , the spatial kernel is
defined as follows: ½
1 if ky − xk ≤ ρd
Kd (ky − xk) = (13.6)
0 otherwise
In case of the neighborhood filter [4], the range kernel is Gaussian as in Equation 13.3
and the spatial kernel is a uniform box as in Equation 13.6. Among these kernel options,
342 Computational Photography: Methods and Applications
FIGURE 13.3
Output of the bilateral filter for different values of σd and σr : (top) σd = 1, (middle) σd = 3, (bottom) σd = 9;
(a) σr = 10, (b) σr = 30, (c) σr = 90, and (d) σr → ∞.
the Gaussian kernel is the most popular choice for both the range and spatial kernels, as it
gives an intuitive and simple control of the behavior of the filter with two parameters.
The Gaussian kernel parameters σd and σr control the decay of the weights in space
and intensity. Figure 13.3 demonstrates the behavior of the bilateral filter for different
combinations of σd and σr . It can be seen that the edges are preserved better for small
values of σr . In fact, an image is hardly changed as σr → 0. As σr → ∞, Kr (·) approaches
to 1 and the bilateral filter becomes a Gaussian low-pass filter. On the other hand, σd
controls the spatial extent of pixel contribution. As σd → 0, the filter acts on a single pixel.
As σd → ∞, the spatial extent of the filter will increase, and eventually, the bilateral filter
will act only on intensities regardless of position, in other words, on histograms.
Denoting H(·) as the histogram over the entire spatial domain, the filter becomes:
à !
2
ˆ = 1 − |I(y) − I(x)|
C(x) ∑
I(x) exp I(y)
y 2σr2
à !
1 255 − |i − I(x)|2
= ∑ iH(i) exp
C(x) i=0 2σr2
(13.7)
FIGURE 13.4
Iterative application of the bilateral filter with σd = 12 σr = 30. Left: Input image. Middle: Result of the first
iteration. Right: Result of the third iteration.
2500
2000 inpu t im age
1000
500
0
0 50 100 150 200 250
FIGURE 13.5
Histograms of the input image and the output image after the third iteration of the bilateral filter.
à !
− |I(y) − I(x)|2
C(x) = ∑ exp
y 2σr2
à !
255
− |i − I(x)|2
= ∑ H(i) exp . (13.8)
i=0 2σr2
Considering the histogram H(i) as the probability density function (pdf) of intensities,
H(i) exp(−|i − I(x)|2 /(2σr2 )) is the smoothed pdf, and Equation 13.7 can be interpreted
as finding the expected value of the pixel intensities. When σr also goes to infinity, the
bilateral filter returns the expected (average) value of all intensities.
From the histogram perspective, the bilateral filter can be interpreted as a local mode
filter [5], returning the expected value of local histograms. This effect is demonstrated
through iterative application of the bilateral filter in Figures 13.4 and 13.5. As seen, the
filtered image approaches the modes of the distribution through the iterations. The output
histogram has certain peaks, and in-between values are reduced.
Using the bilateral filter, an image can be decomposed into its large-scale (base) and
small-scale (detail) components. The large-scale component is a smoothed version of the
input image with main edges preserved, and the small-scale component is interpreted as
having the texture details or noise, depending on the application and parameter selection.
The small-scale component is obtained by subtracting the filtered image from the original
image. Figure 13.6 shows the effect of the σr value on extracting detail components.
344 Computational Photography: Methods and Applications
(a) (b)
(c) (d)
FIGURE 13.6
A detail component is obtained by subtracting the filtered image from the original image. In this figure, σd = 3
and (a) σr = 10, (b) σr = 30, (c) σr = 90, and (d) σr → ∞ in raster scan order.
13.2 Applications
The bilateral filter has found a number of applications in image processing and computer
vision. This section reviews some popular examples of using bilateral filter in practice.
sr
20
15
10 25 30
20
2 4 6 8 10 2 4 6 8 10
sd sd
(a) (b)
20 30
200
400
45
30
150 300
sr
sr
60
40
100 200
75
50
2 4 6 8 10 2 4 6 8 10
sd sd
(c) (d)
FIGURE 13.7
Average MSE values between original images and denoised images for different values of σd , σr , and the noise
standard deviation σn : (a) σ = 5, (b) σ = 10, (c) σ = 15, and (d) σ = 20.
the parameters of the bilateral filter as a function of noise or local texture. Reference [6]
presents an empirical study on optimal parameter selection. To understand the relation
among σd , σr , and the noise standard deviation σn , zero-mean white Gaussian noise is
added to some test images and the bilateral filter is applied with different values of the
parameters σd and σr . The experiment is repeated for different noise variances and the
mean squared error (MSE) values are recorded. The average MSE values are given in
Figure 13.7. These MSE plots indicate that the optimal σd value is relatively insensitive
to noise variance compared to the optimal σr value. It appears that σd could be chosen
around two regardless of the noise power; on the other hand, the optimal σr value changes
significantly as the noise standard deviation σn changes. This is an expected result because
if σr is smaller than σn , noisy data could remain isolated and untouched, as in the case of the
salt-and-pepper noise problem of the bilateral filter [1]. That is, σr should be sufficiently
large with respect to σn .
To see the relation between σn and the optimal σr , σd is set to some constant values,
and the optimal σr values (minimizing MSE) are determined as a function of σn . The
experiments are again repeated for a set of images; the average values and the standard
deviations are displayed in Figure 13.8. It can be observed that the optimal σr is linearly
proportional to σn . There is obviously no single value for σr /σn that is optimal for all
images and σd values; and in fact, future research should investigate spatially adaptive
parameter selection to take local texture characteristics into account. On the other hand,
these experiments give us some guidelines in selection of the parameters.
Reference [6] further suggests a multiresolution framework for the bilateral filter. In
this way, different noise components (fine-grain and coarse-grain noise) can be determined
346 Computational Photography: Methods and Applications
60 60
50 50
40 40
optimal sr
optimal sr
30 30
20 20
10 10
0 0
0 5 10 15 20 0 5 10 15 20
sn sn
(a) (b)
60
50
40
optimal sr
30
20
10
0
0 5 10 15 20
sn
(c)
FIGURE 13.8
The optimal σr values plotted as a function of the noise standard deviation σn based on the experiments with
a number of test images [6]: (a) σd = 1.5, (b) σd = 3.0, and (c) σd = 5.0. The data points are the mean of
optimal σr values that produce the smallest MSE for each σn value. The vertical lines denote the standard
deviation of the optimal σr for the test images. The least squares fits to the means of the optimal σr /σn data
are plotted as diagonal lines. The slopes of these lines are, from left to right, 2.56, 2.16, and 1.97.
inpu t
im age La B Ls
La B Ls
H W Hs B
H W Hs
ou tpu t
B bilateral filter W w avelet threshold ing
im age
FIGURE 13.9
Illustration of the multiresolution denoising framework [6]. The analysis and synthesis filters (La , Ha , Ls , and
Hs ) form a perfect reconstruction filter bank.
Bilateral Filter: Theory and Applications 347
(a) (b)
(c) (d)
FIGURE 13.10
Image denoising using the bilateral filter: (a) input image, (b) bilateral filtering [1] with σd = 1.8 and σr =
3 × σn , (c) bilateral filtering [1] with σd = 5.0 and σr = 20 × σn , and (d) multiresolution bilateral filtering [6]
with σd = 1.8 and σr = 3 × σn at each resolution level.
and eliminated at different resolution levels for better results. The proposed framework
is illustrated in Figure 13.9. A signal is decomposed into its frequency subbands with
wavelet decomposition; as the signal is reconstructed back, bilateral filtering is applied to
the approximation subbands and wavelet thresholding is applied to the detail subbands. At
each level, the noise standard deviation σn is estimated, and the bilateral filter parameter
σr is set accordingly. Unlike the standard single-level bilateral filter [1], this multiresolu-
tion bilateral filter has the potential of eliminating coarse-grain noise components. This is
demonstrated in Figure 13.10.
intensity color
large-scale d etail
FIGURE 13.11
The tone-mapping method of Reference [10].
bined to produce a HDR image [7], [8], [9]. This process requires estimation or knowledge
of the exposure rates and camera response function. Geometric registration, lens flare and
ghost removal, vignetting correction, compression and display of HDR images are some of
the other challenges in HDR imaging.
After an HDR image is generated, it has to be tone-mapped to display on a screen, which
typically has less dynamic range than the HDR image. The bilateral filter has been suc-
cessfully used for this purpose [10]. As illustrated in Figure 13.11, the intensity and color
channels of a HDR image are first extracted. The intensity channel is then decomposed
into its large-scale and detail components using the bilateral filter. The dynamic range of
the large-scale component is reduced (using, for instance, linear or logarithmic scaling) to
fit into the dynamic range of the display; it is then combined with the detail component
to form the tone-mapped intensity, which is finally combined with the color channel to
form the final image. The detail component preserves the high frequency content of the
image. Since bilateral filtering is used to obtain the large-scale component, the edges are
not blurred and the so-called halo artifacts are avoided. Figure 13.12 demonstrates this
framework in a practical situation.
(a)
(b) (c)
l can be estimated.
Reference [11] uses two bilateral filters, one for extracting the illumination component
and the other for denoising the reflectance component. Since the reflectance R is in the
range [0, 1], it holds that l ≤ s. Therefore, in calculating the illumination component l, only
the pixels with value larger than the value of the center pixel are included in the bilateral
filter. Once l is calculated, s − l gives the reflectance component r. Reference [11] uses a
second bilateral filter to remove noise from the reflectance component. As the noise is more
pronounced in the darker regions, the bilateral filter is adapted spatially through the range
parameter as σr (x) = (c1 s(x)c2 + c3 )−1 , where c1 , c2 , and c3 are some constants. With this
adaptation, larger σr (therefore, stronger filtering) is applied for smaller s.
Another contrast enhancement algorithm where bilateral filtering is utilized is presented
in Reference [12]. Similar to Reference [10], an image is decomposed into its large-scale
and detail components using the bilateral filter. The large-scale component is modified with
a histogram specification; the detail component is modified according to a textureness mea-
sure, which quantifies the degree of local texture. The textureness measure TI is obtained
by cross (or joint) bilateral filtering the high-pass filtered image HI as follows:
1
TI (x) = ∑ Kd (ky − xk) Kr (|I(y) − I(x)|) |HI (y)|,
C(x) y∈N
(13.9)
(x)
where |·| returns the absolute values, and the cross bilateral filter smooths |HI | without
blurring edges. The term cross (or joint) bilateral filter [13], [14] is used because input to
the kernel Kr (·) is I, but not |HI |. In other words, the edge information comes from I while
|HI | is filtered.
350 Computational Photography: Methods and Applications
no-flash flash
fu sed image
FIGURE 13.13
Flowchart for the fusion of flash and no-flash images.
20 s r = 5, s d = 15 20 s r = 5, s d = 25
18 s r = 10, sd = 15 18 s r = 20, sd = 15
s r = 20, sd = 15 s r = 20, sd = 25
16 16
14 14
12 12
10 10
100 150 200 250 300 350 400 100 150 200 250 300 350 400
(a) (b)
FIGURE 13.15
Bilateral filtering is applied on a step edge signal to illustrate the effects of the filter parameters σr and σd on
blocking artifacts.
and no-flash images is the flash shadows. In Reference [13], shadow regions are detected
and excluded from bilateral filtering in extracting the detail layer. The same work also
proposes the use of the cross bilateral filter in obtaining the large-scale component of the
no-flash image when it is too dark and thus suffers from a low signal-to-noise ratio:
1 ¡¯ ¯¢
Iˆno− f lash (x) = ∑ Kd (ky − xk) Kr ¯I f lash (y) − I f lash (x)¯ Ino− f lash (y) (13.10)
C(x) y∈N (x)
generate textu re
calcu late sd
m ap
generate block
calcu late sr
d iscontinu ity m ap
FIGURE 13.16
The block diagram of the method in Reference [16]. Discontinuity and texture detection modules produce
space varying maps that are used to compute the range and domain parameters of the bilateral filter. The
bilateral filter is then applied to the image based on these parameters.
FIGURE 13.17
Discontinuity and texture map generation: (a) input compressed image, (b) texture map, and (c) block discon-
tinuity map produced by the method of Reference [16].
gions. The parameters of the bilateral filter should be carefully chosen for this purpose.
As illustrated in Figure 13.15, when the σr value is less than the discontinuity amount, the
filter is basically useless for eliminating the discontinuity. When σr is larger than the dis-
continuity amount, the discontinuity can be eliminated. The extent of the smoothing can
be controlled by the σd value. The larger the σd value, the wider the extent of smoothing
is. On the other hand, if σr value is less than the discontinuity amount, elimination of the
discontinuity is impossible no matter the value of σd .
Figure 13.16 shows the flowchart of this method. The block discontinuity amounts are
detected at the block boundaries, and then spatially interpolated to obtain a discontinuity
map. The σr value at each pixel is adjusted accordingly; specifically, σr at a pixel is linearly
proportional to the discontinuity map value. On the other hand, the σd value is adjusted
according to the local texture to avoid over-smoothing. A texture map is obtained by calcu-
lating the local standard deviation at every pixel; the σd value is set inversely proportional
to the texture map value at each pixel. Figure 13.17 shows discontinuity and texture maps
for a compressed image. Figure 13.18 compares the results of the standard bilateral filter
and the spatially adaptive bilateral filter.
Bilateral Filter: Theory and Applications 353
(a) (b)
(c) (d)
FIGURE 13.18
Compression artifact reduction using the bilateral filter: (a) original image, (b) compressed image, (c) result of
the standard bilateral filter with σr = 20 and σd = 3, and (d) result of the adaptive bilateral filter [16].
p q (x) nx
án x, (y - x)ñ
cq
y
FIGURE 13.19
Illustration of the mesh denoising methods of References [17] and [18].
1
C(x) ∑
d= Kd (ky − xk) Kr (hnx , (y − x)i) hnx , (y − x)i, (13.13)
y
where the inner product hnx , (y − x)i gives the projection of the difference (y − x) onto the
surface normal nx , and thus the bilateral filter smooths the projections within a local space
and updates the vertex as a weighted sum of the projections.
Reference [18] takes a different approach to mesh smoothing. Suppose that q is a surface
within a neighborhood of the vertex x, cq is the centroid of the surface, and aq is the area
of the surface. The prediction pq (x) of the vertex x based on the surface q is the projection
of x to the plane tangent to the surface q. Then the vertex x is updated as follows:
1 ¡° °¢ ¡° °¢
x̂ = ∑ aq Kd °cq − x° Kr °pq (x) − x° pq (x). (13.14)
C(x) q
The inclusion of the area aq is to give more weight to predictions coming from larger
surfaces. Figure 13.19 illustrates the methods of References [17] and [18].
where MR is a mask of ones and zeros, indicating the locations of red samples, and
Kr (|IG (y) − IG (x)|) captures the edge information from the green channel. The blue chan-
nel is updated similarly.
Bilateral Filter: Theory and Applications 355
FIGURE 13.20
Demosaicking using the bilateral filter: (a) bilinear interpolation of a Bayer sampled data, (b) the standard
POCS interpolation method of Reference [22], (c) the POCS method with the addition of the bilateral constraint
set [21].
In Reference [21], the green channel is again used as a reference to interpolate the red and
blue channels. Instead of applying the bilateral filter directly, the interpolation problem is
formulated as an optimization problem, and solved using the projections onto convex sets
(POCS) technique [22]. The POCS technique starts with an initial estimate and updates
it iteratively by projecting onto constraint sets. Denoting BF(·) as the bilateral filter, the
constraint set SR on the red (or blue) channel to limit the deviation from the green channel
is defined as follows:
where T is a positive threshold. For an object in the scene, the difference IR − IG should be
constant or change smoothly. This constraint set guarantees that IR − IG changes smoothly
in space; and the bilateral filter prevents crossing edges. Reference [21] defines additional
constraint sets, including data fidelity and frequency similarity constraint sets, and uses the
POCS technique to perform the interpolation. A sample result is shown in Figure 13.20.
This vector representation can be used to interpret the bilateral filter as linear filtering of
the entries of a vector-valued image separately, followed by division of the first entry by
the second.
More explicitly, the bilateral filter is implemented by defining the 3D grids, Γ1 and Γ2 ,
of a 2D image I as follows:
½
I (x1 , x2 ) if r = I (x1 , x2 ) ,
Γ1 (x1 , x2 , r) = (13.18)
0 otherwise.
Bilateral Filter: Theory and Applications 357
½
1 if r = I (x1 , x2 ) ,
Γ2 (x1 , x2 , r) = (13.19)
0 otherwise.
These grids are then convolved with a 3D Gaussian, Kd,r , whose standard deviation is σd
in the 2D spatial domain and σr in the 1D intensity domain, providing the following:
ˆ = 1
I(x) ∑ Kr (|I(y) − I(x)|) I(y)
C(x) y∈N (x)
1 255
= ∑ iHx (i)Kr (|i − I(x)|),
C(x) i=0
(13.23)
where
This equation reveals that the bilateral filter could be approximated from spatially filtered
I, I 2 , and I 3 . Defining zn as the convolution of Kd (·) with I n :
where ρ (·) is a robust function that penalizes the difference between I(x) and I(y). The reg-
ularization term also includes Kd (ky − xk) to give more weight to the pixels that are close
to x. If ρ (·) is differentiable, the solution that minimizes the cost function can be found
iteratively using a gradient descent technique. For instance, an iteration of the steepest
descent algorithm is
ˆ = I(x) − µ ∂ Φ (I(x))
I(x)
∂ I(x)
= I(x) + µ ∑ Kd (ky − xk) ρ 0 (I(y) − I(x)), (13.30)
y
where µ is set to 1/C(x). This means that different versions of the bilateral filter can be
defined based on robust estimation with the range kernel Kr (α ) = ρ 0 (α ) /α . For example,
using ρ (α ) = 1 − exp(α 2 /(2σ 2 )) provides the standard Gaussian kernel. Other possible
choices include the Tukey function
½
(x/σ )2 − (x/σ )4 + 31 (x/σ )6 if |x| ≤ σ ,
ρ (α ) = 1 (13.32)
3 otherwise,
and the Huber function
½
x2 /2σ + σ /2 if |x| ≤ σ ,
ρ (α ) = (13.33)
|x| otherwise.
Also note that, as seen in Equation 13.30, the contribution of an input to the update is
proportional to ρ 0 (·), the so-called influence function [10]. The influence function, in other
360 Computational Photography: Methods and Applications
1.0 1.0 0.6
0 0 -0.6
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
0.35
0.5 0.3
0.30
0.2
0.25 0.4
0.1
0.20 0.3
0
0.15
0.2 -0.1
0.10
-0.2
0.1
0.05
-0.3
0 0
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
4.0 1.0 1.0
3.5
0.8 0.5
3.0
2.5 0.6
0
2.0
0.4
1.5 -0.5
0.2
1.0
0.5 0 -1.0
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
(a) (b) (c)
FIGURE 13.21
Example robust functions with their corresponding influence functions and kernels: (top) Gaussian function,
(middle) Tukey function, (bottom) Huber function; (a) ρ , (b) Kr , and (c) ρ 0 .
words, can be used to analyze the response of the filter to outliers. The Gaussian, Tukey,
and Huber robust functions, with corresponding influence functions and kernels are given
in Figure 13.21.
The bilateral filter is also related to the weighted least squares estimation [35]. Defining
v as the vectorized version of the image I, the regularization term in weighted least squares
estimation as given by
where Dm operator shifts a signal by m samples, and W is a weighting matrix. With proper
choice of weighting matrix (in particular, choosing the weighting matrix such that it penal-
izes the pixel intensity differences and shift amounts with Gaussian functions), an iteration
of the weighted least squares estimation becomes equivalent to the bilateral filter [35].
bilateral filter:
Z σd
x+
−|I(y)−I(x)|2
ˆ = 1
I(x) e 2σr2 I(y)dy, (13.35)
C(x)
x−σd
ˆ −
is considered to show that the evolution (or temporal derivative) of the signal It (x) ≡ I(x)
I(x) is proportional to its second derivatives Iηη (x) in the gradient direction and Iξ ξ (x) in
the orthogonal direction of the gradient:
It (x) ∼
= a1 Iξ ξ (x) + a2 Iηη (x), (13.36)
where a1 and a2 are functions of σd , σr , and the gradient of I(x). The signs of a1 and a2
determine the behavior of the filter.
Namely, if σr À σd , both a1 and a2 become equal positive constants, the sum a1 Iξ ξ (x) +
a2 Iηη becomes the Laplacian of the signal and the bilateral filter becomes a lowpass filter.
If σr and σd have the same order of magnitude, the filter behaves like a Perona-Malik
model [38] and shows smoothing and enhancing characteristics. Since a1 is positive and
decreasing, there is always diffusion in the tangent direction Iξ ξ . On the other hand, a2 is
positive when the gradient is less than a threshold (which is proportional to σd /σr ), and the
filter is smoothing in the normal direction Iηη . The term a2 is negative when the gradient
is larger than the threshold; in this case, the filter shows an enhancing behavior. Finally, if
σr ¿ σd , then a1 and a2 both tend to zero, and the signal is hardly altered.
Reference [37] also shows that, when σr and σd have the same order of magnitude,
the bilateral filter and the Perona-Malik filter can be decomposed in the same way as in
Equation 13.36, and produce visually very similar results even though their weights are
not identical. Similar to the Perona-Malik filter, the bilateral filter can create contouring
artifacts, also known as shock or staircase effects. These shock effects occur at signal lo-
cations where the convex and concave parts of the signal meet, in other words, at inflection
points where the second derivative is zero (see Figure 13.22). Reference [37] proposes to
do linear regression to avoid these artifacts. With linear regression, the weights a1 and a2
in Equation 13.36 become positive in smooth regions, and no contours or flat zones are
created.
10
-5 inpu t
3 iterations
10 iterations
-10
50 60 70 80 90 100 110 120 130 140
FIGURE 13.22
The shock effect is illustrated through iterative application of the bilateral filter with σr = 4 and σd = 10.
362 Computational Photography: Methods and Applications
trilateral
filter
bilateral
filter
FIGURE 13.23
The trilateral filter [39] adapts to local slope.
ˆ = 1
I(x) ∑ Kd (ky − xk) Kr (kI(y) − I(x)k) I(y).
C(x) y∈N
(13.39)
(x)
Apparently, the bilateral filter is a specific case of the non-local means filter. The non-
local means filter has been shown to perform better than the bilateral filter since it can
be used with a larger spatial support σd as it robustly finds similar regions through
Kr (kI(y) − I(x)k). The disadvantage is, however, its high computational cost.
Bilateral Filter: Theory and Applications 363
In addition to the kernels that have mentioned so far (e.g., uniform box kernel, Gaus-
sian kernel, and kernels derived from robust functions in Section 13.4.1), it is possible to
use other kernel types and modifications. For example, salt-and-pepper noise cannot be
eliminated effectively with the standard bilateral filter because a noisy pixel is likely to be
significantly different from its surrounding, resulting in Kr (·) to be zero. By using the me-
dian value Imed (x) of the local neighborhood around x, the impulse noise can be eliminated:
à !
− |I(y) − Imed (x)|2
Kr (|I(y) − Imed (x)|) = exp . (13.40)
2σr2
13.6 Conclusions
This chapter surveyed the bilateral filter-driven methods and their applications in image
processing and computer vision. The theoretical foundations of the filter were provided; in
particular, the connections with robust estimation, weighted least squares estimation, and
partial differential equations were pointed out. A number of extensions and variations of
the filter were discussed. Since the filter is nonlinear, its fast implementation is critical
for practical applications; therefore, the main implementation approaches for fast bilateral
filter were discussed as well.
The bilateral filter has started receiving attention very recently, and there are open prob-
lems and room for improvement. Future research topics include optimal kernel and param-
eter selection specific to applications, fast and accurate implementations for multidimen-
sional signals, efficient hardware implementations, spatial adaptation, and modifications to
avoid staircase artifacts.
Acknowledgment
This work was supported in part by the National Science Foundation under Grant No.
0528785 and National Institutes of Health under Grant No. 1R21AG032231-01.
References
[1] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Proceedings
of the IEEE International Conference on Computer Vision, Bombay, India, January 1998,
pp. 839–846.
364 Computational Photography: Methods and Applications
[2] S.M. Smith and J.M. Brady, “Susan - A new approach to low level image processing,” Inter-
national Journal of Computer Vision, vol. 23, no. 1, pp. 45–78, May 1997.
[3] J.S. Lee, “Digital image smoothing and the sigma filter,” Graphical Models and Image Pro-
cessing, vol. 24, no. 2, pp. 255–269, November 1983.
[4] L. Yaroslavsky, Digital Picture Processing - An Introduction. Berlin, Germany: Springer-
Verlag, December 1985.
[5] J. van de Weijer and R. van den Boomgaard, “Local mode filtering,” in Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, December 2001,
vol. 2, pp. 428–433.
[6] M. Zhang and B.K. Gunturk, “Multiresolution bilateral filtering for image denoising,” IEEE
Transactions on Image Processing, vol. 17, no. 12, pp. 2324–2333, December 2008.
[7] E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, High Dynamic Range Imaging: Acqui-
sition, Display, and Image-Based Lighting. San Francisco, CA: Morgan Kaufmann, August
2005.
[8] P. Debevec and J.Malik, “Recovering high dynamic range radiance maps from photographs,”
in Proceedings of International Conference on Computer Graphics and Interactive Tech-
niques, San Diego, CA, USA, August 1997, pp. 369–378.
[9] M.A. Robertson, S. Borman, and R.L. Stevenson, “Dynamic range improvement through mul-
tiple exposures,” in Proceedings of the IEEE International Conference on Image Processing,
Kobe, Japan, October 1999, vol. 3, pp. 159–163.
[10] F. Durand and J. Dorsey, “Fast bilateral filtering for the display of high-dynamic-range im-
ages,” ACM Transactions on Graphics, vol. 21, no. 3, pp. 257–266, July 2002.
[11] M. Elad, “Retinex by two bilateral filters,” in Proceedings of Scale-Space and PDE Methods
in Computer Vision, vol. 3459, Hofgeismar, Germany, April 2005, pp. 217–229.
[12] S. Bae, S. Paris, and F. Durand, “Two-scale tone management for photographic look,” ACM
Transactions on Graphics, vol. 25, no. 3, pp. 637–645, July 2006.
[13] E. Eisemann and F. Durand, “Flash photography enhancement via intrinsic relighting,” ACM
Transactions on Graphics, vol. 25, no. 3, pp. 673–678, August 2004.
[14] G. Petschnigg, M. Agrawala, H. Hoppe, R. Szeliski, M. Cohen, and K. Toyama, “Digital
photography with flash and no-flash image pairs,” ACM Transactions on Graphics, vol. 25,
no. 3, pp. 664–672, August 2004.
[15] E.P. Bennett, J.L. Mason, and L. McMillan, “Multispectral video fusion,” in Proceedings of
International Conference on Computer Graphics and Interactive Techniques, Boston, MA,
USA, July 2006, p. 123.
[16] M. Zhang and B.K. Gunturk, “Compression artifact reduction with adaptive bilateral filtering,”
Proceedings of SPIE, vol. 7257, p. 72571A, January 2009.
[17] S. Fleishman, I. Drori, and D. Cohen-Or, “Bilateral mesh denoising,” ACM Transactions on
Graphics, vol. 22, no. 3, pp. 950 – 953, July 2003.
[18] T.R. Jones, F. Durand, and M. Desbrun, “Non-iterative, feature-preserving mesh smoothing,”
ACM Transactions on Graphics, vol. 22, no. 3, pp. 943 – 949, July 2003.
[19] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, R.W. Schafer, and R.M. Mersereau, “Demosaick-
ing: Color filter array interpolation in single-chip digital cameras,” IEEE Signal Processing
Magazine, vol. 22, no. 1, pp. 44–55, January 2005.
[20] R. Ramanath and W.E. Snyder, “Adaptive demosaicking,” Journal of Electronic Imaging,
vol. 12, no. 4, pp. 633–642, October 2003.
[21] M. Gevrekci, B.K. Gunturk, and Y. Altunbasak, “Restoration of Bayer-sampled image se-
quences,” Computer Journal, vol. 52, no. 1, pp. 1–14, January 2009.
Bilateral Filter: Theory and Applications 365
[22] B.K. Gunturk, Y. Altunbasak, and R.M. Mersereau, “Color plane interpolation using alter-
nating projections,” IEEE Transactions on Image Processing, vol. 11, no. 9, pp. 997–1013,
September 2002.
[23] J. Xiao, H. Cheng, H. Sawhney, C. Rao, and M. Isnardi, “Bilateral filtering-based optical flow
estimation with occlusion detection,” in Proceedings of European Conference on Computer
Vision, Graz, Austria, May 2006, vol. 1, pp. 211–224.
[24] E.A. Khan, E. Reinhard, R. Fleming, and H. Buelthoff, “Image-based material editing,” ACM
Transactions on Graphics, vol. 25, no. 3, pp. 654–663, July 2006.
[25] H. Winnemoller, S.C. Olsen, and B. Gooch, “Real-time video abstraction,” ACM Transactions
on Graphics, vol. 25, no. 3, pp. 1221–1226, July 2006.
[26] B.M. Oh, M. Chen, J. Dorsey, and F. Durand, “Image-based modeling and photo editing,” in
Proceedings of ACM Annual conference on Computer Graphics and Interactive Techniques,
Los Angeles, CA, USA, August 2001, pp. 433–442.
[27] S. Paris, H. Briceno, and F. Sillion, “Capture of hair geometry from multiple images,” ACM
Transactions on Graphics, vol. 23, no. 3, pp. 712—719, July 2004.
[28] W.C.K. Wong, A.C.S. Chung, and S.C.H. Yu, “Trilateral filtering for biomedical images,” in
Proceedings of IEEE International Symposium on Biomedical Imaging, Arlington, VA, USA,
April 2004, pp. 820–823.
[29] E.P. Bennett and L. McMillan, “Video enhancement using per-pixel virtual exposures,” ACM
Transactions on Graphics, vol. 24, no. 3, pp. 845–852, July 2005.
[30] T.Q. Pham and L.J. Vliet, “Separable bilateral filtering for fast video preprocessing,” in Pro-
ceedings of the IEEE International Conference on Multimedia and Expo, Amsterdam, Nether-
lands, July 2005, pp. 1–4.
[31] S. Paris and F. Durand, “A fast approximation of the bilateral filter using a signal processing
approach,” in Proceedings of European Conference on Computer Vision, Graz, Austria, May
2006, pp. 568–580.
[32] F. Porikli, “Constant time O(1) bilateral filtering,” in Proceedings of International Conference
on Computer Vision and Pattern Recognition, Anchorage, AK, USA, June 2008, pp. 1–8.
[33] F. Porikli, “Integral histogram: A fast way to extract histograms in Cartesian spaces,” in
Proceedings of International Conference on Computer Vision and Pattern Recognition, San
Diego, CA, USA, June 2005, pp. 829–836.
[34] B. Weiss, “Fast median and bilateral filtering,” ACM Transactions on Graphics, vol. 25, no. 3,
pp. 519–526, July 2006.
[35] M. Elad, “On the origin of the bilateral filter and ways to improve it,” IEEE Transactions on
Image Processing, vol. 11, no. 10, pp. 1141–1151, October 2002.
[36] D. Barash, “Fundamental relationship between bilateral filtering, adaptive smoothing, and the
nonlinear diffusion equation,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 24, no. 6, pp. 844–847, June 2002.
[37] A. Buades, B. Coll, and J. Morel, “Neighborhood filters and PDE’s,” Numerische Mathematik,
vol. 105, no. 1, pp. 1–34, October 2006.
[38] P. Perona and J. Malik, “Scale-space and edge detection using anistropic diffusion,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 7, pp. 629–639, July
1990.
[39] P. Choudhury and J.E. Tumblin, “The trilateral filter for high contrast images and meshes,” in
Proceedings of Eurographics Symposium on Rendering, Leuven, Belgium, June 2003, vol. 44,
pp. 186–196.
366 Computational Photography: Methods and Applications
[40] A. Buades, B. Coll, and J. Morel, “On image denoising methods,” Technical Report 2004-15,
CMLA, 2004.
14
Painterly Rendering
14.1 Introduction
For centuries, artists have been developing different tools and various styles to produce
artistic images, which are usually in a way more interesting than mere representations of
scenes from the real world. While classical tools, such as brushes, ink pens or pencils,
require skills, effort and talent, science and technology can make more advanced tools that
can be used by all people, not only artists, to produce artistic images with little effort.
Recently, scientists have been showing increasing interest in visual arts. On one side,
psychologists and neurophysiologists attempt to understand the relation between the way
artists produce their works and the function of the visual system of the brain. Examples
of such studies include the use of principles of gestalt psychology to understand and de-
scribe art [1], [2], [3], deriving spatial organization principles and composition rules of
artwork from neural principles [1], [4], or understanding the biological basis of aesthetic
experiences [5], [6], [7]. A recent overview of these findings is presented in Reference [8].
On the other side, computer scientists and engineers are developing painterly rendering al-
gorithms which imitate painting styles. There is a large variety of such algorithms, both
unsupervised [9], [10], [11], [12] and interactive [13], [14], [15], [16]. Much effort has
been made to model different painterly styles [17], [18], [19], [20] and techniques [21],
[22], [23], [24], [25], [26], especially watercolor [27], [21], [28], [29], [30], [31], [32],
367
368 Computational Photography: Methods and Applications
[33], [34], and to design efficient interactive user interfaces [14], [35], [36], some of which
deploy special-purpose hardware [35]. An overview of these techniques can be found in
Reference [37]. The importance of such algorithms is two-fold; computers have the poten-
tial to help the non-specialist to produce their own art and the artist to develop new forms
of art such as impressionistic movies [9], [38], [39] or stereoscopic paintings [40], [41].
This chapter focuses on unsupervised painterly rendering, that is, fully automatic algo-
rithms which convert an input image into a painterly image in a given style. This problem
has been faced in different ways; for example, artistic images can be generated by simulat-
ing the process of putting paint on paper or canvas. A synthetic painting is represented as a
list of brush strokes which are rendered on a white or canvas textured background. Several
mathematical models of a brush stroke are proposed, and special algorithms are developed
to automatically extract brush stroke attributes from the input image. Another approach
suggests abstracting from the classical tools that have been used by artists and focusing on
the visual properties, such as sharp edges or absence of natural texture, which distinguish
painting from photographic images. These two classes of algorithms will be explored and
discussed in the next sections. The existence of major areas of painterly rendering, in which
the input is not a photographic image and whose treatment goes beyond the purpose of this
chapter, will also be acknowledged. Important examples are painterly rendering on video
sequences and the generation of artistic images from three-dimensional models of a real
scene.
The chapter is organized as follows. Section 14.2 describes brush stroke oriented
painterly rendering algorithms, including physical models of the interaction of a fluid pig-
ment with paper or canvas. More specifically, Section 14.2.1 focuses on imitating the
appearance of a single brush stroke in a given technique, such as watercolor, impasto, or
oil-painting, when all its attributes are given. Section 14.2.2 presents algorithms for ex-
tracting the brush stroke attributes from an input image. Section 14.3 describes methods
which aim at simulating the visual properties of a painting regardless of the process that
artists perform. Conclusions are drawn in Section 14.4.
Attribute Meaning
I(r) L y(r)
brush stroke brush stroke
inpu t list of bru sh ou tpu t
attribu te extraction rend ering
im age strokes im age
FIGURE 14.1
Schematic representation of brush stroke-based painterly rendering.
(a) (b)
FIGURE 14.2
Brush stroke-based painterly rendering using the approach described in Reference [9]: (a) input image and (b)
output image.
370 Computational Photography: Methods and Applications
bristle trajectory. Specifically, the trajectory T of the brush is computed as B(r1 , ..., rN ).
(k) (k)
The trajectory of the kth bristle is computed as B(r1 + δ1 u1 , ..., rN + δN uN ), where the
(k)
coefficient δi is associated to the i-th control and the k-th bristle is proportional to the
pressure vale pi . This takes into account the fact that when a brush is pushed against the
paper with higher pressure, the bristles will spread out more. Once the trajectory Bk of each
bristle is determined, it is rendered by changing the color of those pixels which are crossed
by Bk . The color of the trace left out by each bristle can be determined in different ways.
The simplest one would be to assign a constant color to all bristles of a given brush stroke.
A more complex diffusion scheme can change the color across the brush stroke; the color
ck (t) of the k-th bristle at time t can be computed from the diffusion equation
ck−1 (t) + ck+1 (t)
ck (t + ∆t) = (1 − λ )ck (t) + λ , (14.1)
2
where the coefficient λ ∈ [0, 1] is related to the speed of diffusion. Another aspect taken
into the account is the amount of ink of each bristle. Specifically, the length of the trace left
by each bristle is made proportional to the amount of ink of that bristle. This influences the
appearance of the resulting brush stroke, as it will be more “compact” at its starting point
and more “hairy” at is end point. In Reference [42], the speed at which bristles become
empty is made proportional to the local pressure of the brush.
More sophisticated models perform fluid simulation by means of cellular automata. Cel-
lular automation is described by the following components:
• A lattice L of cells; here a two-dimensional square lattice is considered, but other
geometries, such as hexagonal, could be used as well. Each cell in the lattice is
identified by an index i , (i1 , i2 ), which indicates the position of that cell in the
lattice.
• A set S of states in which each cell can be. The state of the cell in position i is
defined by a set of state variables ai , bi , ci , etc.
• A state transition function which determines the evolution of the cellular automata
over (discretized) time. Specifically, the state si (n + 1) of the cell i at time n + 1 is a
function F of the states of all cells k in the neighborhood Ni of i, that is, si (n + 1) =
F[sk (n) : k ∈ Ni ]. The transition function is the most important part of a cellular
automata, since it determines the state of the entire cellular automata at any time for
every initial condition.
The notion of coupled cellular automata is also needed; two cellular automata C1 and C2
are coupled if the evolution of each one of them is determined by the states of both C1
(1) (2)
and C2 . In other words, for si and si denoting the states of the i-th cell of C1 and C2 ,
respectively, the state transition functions of C1 and C2 can be written as follows:
(u) (1) (2)
si (n + 1) = F (u) [sk1 (n), sk2 (n) : k1 , k2 ∈ Ni ], (14.2)
Painterly Rendering 371
with u = 1, 2. The following focuses on cellular automata with a deterministic state tran-
sition function, that is, A for which a given configuration of states at time n will always
produce the same configuration at time n + 1. Stochastic cellular automata could be con-
sidered as well, which are well known in the literature as discrete Markov random fields. In
general, cellular automata are capable of modeling any phenomenon in which each part in-
teracts only with its neighbors1 . Moreover, their structure makes them suitable for parallel
implementations. A deeper treatment of the subject can be found in Reference [43].
Cellular automata are used in painterly rendering to simulate the diffusion process that
occurs when a fluid pigment is transferred from a brush to the paper [44]. A brush stroke
simulation system based on cellular automata consists of a cellular automata that models
the paper, a cellular automata that models the brush, and a coupling equation. In the case
of a cellular automata that models the paper, the state of every cell describes the amount
(p) (p)
of water Wi and ink Ii that are present in each point of the paper. The state transition
function Fp is designed to take into account several phenomena, such as the diffusion of
water in paper, the diffusion of ink in water, and water evaporation. The diffusion of water
in paper can be isotropic, that is, water flows in all directions at equal speed, or anisotropic.
Anisotropy can be due to several factors, such as local directionality in the structure of
the paper, or by the effect of gravity, so that water flows downward at higher speed than
upward. In the case of a cellular automata that models the brush, the state of every cell
(b) (b)
describes the amount of water Wi and ink Ii that are present locally at each bristle of
the brush. The state transition function Fb regulates the amount of fluid that is transferred
to the paper, the flow of fluid from the tip of the brush to the contact point with the paper,
the diffusion of fluid between bristle, and the rate at which each bristle gets empty. The last
element, a coupling equation, relates the states of the paper cellular automata to the state of
the brush A. In the simplest case, this is just a balancing equation between the amount of
fluid that leaves the brush and the amount of fluid that comes into the paper at the contact
point between paper and brush.
In this framework, a brush stroke is simulated as described below. Let i(n) be the dis-
cretized position of the brush at time n, which is supposed to be given as input to the brush
stroke simulation system, and let 0 be the index of the cell of the brush cellular automata
that touches the paper (as shown in Figure 14.3 for a one-dimensional example). First, all
cells of the paper cellular automata are initialized to the value zero (no water or ink) and
the cells of the brush cellular automata are initialized to a value that is proportional to the
amount of pigment the brush has been filled with. Then, the states of both cellular automata
are iteratively updated according to their state transition functions. Specifically, at each it-
eration, given amounts ∆W and ∆I of water and ink are transferred from the brush to the
paper at position i(n):
Diffusion is simulated by updating the state of all cells of the paper cellular automata and
brush cellular automata according to the respective transition functions Fp and Fb .
1 Due to the maximum speed at which information can be conveyed, namely, the speed of light, one can argue
that cellular automata can model every phenomenon in the real word.
372 Computational Photography: Methods and Applications
brush
4
3
2 brush
1 m ovem ent
0
paper
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
FIGURE 14.3
Brush stroke simulation by means of cellular automata.
Once convergence is reached, the state of the paper cellular automata must be converted
into an image. The simplest approach is to modify the color of each pixel in proportion to
the amount of ink that is present in each cell of the paper cellular automata. Specifically, let
cb be the color of the ink and c p (i) the color of the paper at position i before the application
of the brush stroke. Then, the color of the paper at position i after the application of the
brush stroke is given by Ii cb + (1 − Ii )c p (i). Such a linear combination of cb and c p (i)
allows the simulation of the transparency of brush strokes which are placed on top of each
other. A more sophisticated approach, consists of determining the color of each pixel of
the image by simulating the interaction between white light and a layer of dry pigment of a
given thickness. Specifically, it can be proved that the reflectance of the pigment layer can
be expressed as
r
K K K
Rx = 1 + + ( )2 + 2 , (14.4)
S S S
where K and S are the adsorption and scattering coefficients of the pigment [45] and [46].
Reference [21] proposes a more sophisticated fluid simulation model for synthetic water-
color generation. In this approach, the interaction between water, ink, and paper is modeled
by means of three layers, each one of which is associated with a cellular automata. Namely,
a shallow water layer which models the motion of water, a pigment deposition layer which
models the absorption and desorption of pigment from paper, and a capillarity layer which
models the penetration of water into the pores of the paper. This model takes into account a
large number of factors, such as the fluid velocity in the x and y directions, water pressure,
the effect of gravity due to the inclination of the paper with respect to the vertical direction,
and physical properties of the solution of ink and water, such as viscosity and viscous drag.
As a result, the brush stroke obtained with this system is virtually indistinguishable from
true watercolor brush stroke.
FIGURE 14.4
Example painting styles: (a) Seurat’s painting La Parade de Cirque from 1889, (b) Monet’s painting Impres-
sion, Sunrise from 1872, and (c) Van Gogh’s painting Road with Cypress and Star from 1890.
Gogh, used elongated brush strokes to add to a painting a geometric structure which makes
it more vibrant (Figure 14.4c). Therefore, length l and width w of brush strokes are usu-
ally determined by the desired painting style rather than automatically extracted from the
input image. In a number of algorithms, the values of l and w are the same for all brush
strokes and are specified by the user depending on the desired effect. In some cases, an
interface is provided in which the user selects the painting style and the values of l and w
are determined accordingly. Other approaches take into account the fact that artists often
render different areas of the paintings at different levels of detail. In particular, artists first
make a simplified sketch of their painting by using coarse brush strokes, and then paint on
top of it with smaller brushes to render finer details. The so-called coarse-to-fine rendering
algorithms [17], [38] produce a final synthetic painting as the superposition of different
layers to be rendered on top of each other. On the lowest layers, coarse brush strokes are
rendered, while at the highest layers finer brush strokes are present only on those regions
which need to be rendered at finer detail. The success of such an approach depends on the
strategies deployed to identify which regions of the painting should be present in each layer.
In Reference [17], such areas are determined iteratively; specifically, the coarsest layer is
initialized by rendering brush strokes on the whole image. Then, in the k-th layer, brush
strokes are rendered only in those regions for which the difference in color between the
input image and the painting rendered until the (k − 1)-th layer is below a given threshold.
A different approach is presented in Reference [38]; regions to be rendered at a higher level
of detail are detected by looking at the frequency content of each edge. In Reference [19],
the level of detail in different areas of the painting is determined by a measure of saliency,
or visual interest, of the different areas of the painting. Specifically, a saliency map is com-
puted by looking at how frequently the local pattern around each pixel occurs in the image
and assigning high saliency to the most rare patterns; afterwards, high saliency areas are
rendered with smaller brush strokes.
The simplest method to compute the position of each brush stroke is to place them on a
square lattice, whose spacing is less than or equal to the width of the brush stroke. This
guarantees that every pixel of the output image is covered by at least one brush stroke.
374 Computational Photography: Methods and Applications
The main disadvantage of this approach is that it works well only with brush stroke of fixed
size. In more sophisticated approaches [10], the position of each brush stroke is determined
by its area. Specifically, an area map A(r) is first extracted from the input image; then, a
random point set S is generated, whose local density is a decreasing function of A(r). Each
point S is the position of a brush stroke. One possible way to compute the function A(r),
proposed in Reference [10], is to analyze the moments up to the second order of regions
whose color variation is below a given threshold.
Orientation is another attribute to consider. A vector field v(r) is introduced such that
that a brush stroke placed at point r0 is locally oriented along v(r0 ) [9], [17], [38], [47]. The
simplest approach to automatic extraction of a suitable vector field from the input image
consists of orienting v(r) orthogonally to the gradient of the input image. This simulates
the fact that artists draw brush strokes along the object contours. However, the gradient
orientation is a reliable indicator only in the presence of very high contrast edges and tends
to be random on textures as well as on regions with slowly varying color. Therefore, the
other approach is to impose v(r) ⊥ ∇I(r) only on points for which the gradient magnitude
is sufficiently high, and to compute v(r) on the other pixels by means of diffusion or in-
terpolation processes [38]. On the one hand, this simple expedient considerably improves
the appearance of the final output (Figure 14.5); on the other hand, these gradient-based
approaches for extracting v(r) only look at a small neighborhood of each pixel while ne-
glecting the global geometric structure of the input image. A global method for vector
field extraction, inspired on fluid dynamics is proposed in Reference [39]. Specifically, the
input image is first partitioned into N regions Rk , k = 1, ..., N by means of image segmen-
tation. Then, inside each region Rk , the motion of a fluid mass is simulated by solving
the Bernoulli equation of fluid dynamics with the constraint that the fluid velocity v(r) is
orthogonal to Rk on the boundary of Rk . This approach has the advantages of deriving
the vector field from general principles, taking into account the global structure of the in-
Painterly Rendering 375
FIGURE 14.6
Illustration of the superiority of tensor hyperstreamlines with respect to vector streamlines: (a) topological con-
figuration of hyperstreamlines, called trisector, which cannot be reproduced by the streamlines of a continuous
vector field, (b) streamlines of a vector field, and (c) hyperstreamlines of a tensor field, also oriented along the
edges of an eye which form a trisector configuration.
put image, and allowing an easy control of interesting geometric features of the resulting
vector field, such as vorticity (the presence of vortices in v(r) results in whirls in the final
output which resemble some Van Gogh paintings). Unfortunately, the fluid dynamic model
introduces several input parameters which do not have a clear interpretation in the context
of painterly rendering and for which is not obvious how to choose their value (for example,
the mixture parameter [39]).
More recently, it has been suggested to locally orient brush strokes along the so-called
hyperstreamlines of a symmetric tensor field T (r) rather than along a vector field. Hy-
perstreamlines are defined as lines locally oriented along the eigenvector associated to the
largest eigenvalue of T (r). Hyperstreamlines are not defined on points for which the two
eigenvalues of T (r) are equal. The main difference between hyperstreamlines of a tensor
field and streamlines of a vector field is that the former are unsigned. Consequently, tensor
fields offer a larger variety of topologies compared to vector fields. An example is given
in Figure 14.6. The geometric structure shown in Figure 14.6c, which is more natural for
painting an eye, can be reproduced by tensor fields but not by vector fields. On the other
hand, both geometric structures in Figures 14.6b and 14.6c can be synthesized by using
tensor fields.
The simplest approach to assigning a shape to each brush stroke is to fix a constant shape
for all brush strokes. The most common choices are rectangles [9], [10], [11], [12]. In
the system developed in Reference [38], the user can choose among a larger set of shapes,
including irregular rectangloid brush strokes, irregular blob-like shaped brush strokes, and
flower shaped brush strokes. While these approaches are simple, they do not take into ac-
count the fact that in true paintings artists draw brush strokes of different shapes depending
on the image content. A first step toward brush strokes of adaptive shape is made in Refer-
ence [17], where curved brush strokes with fixed thickness are generated. Specifically, the
shape of each brush stroke is obtained by thickening a cubic B-spline curve with a disk of a
given radius. The control points ri of the spline are obtained in two steps. First, the starting
point r0 is calculated by means of one of the methods for determining the position. Then,
each point is obtained from the previous by moving orthogonally to the gradient of a fixed
spacing h as follows:
ri+1 = ri + hδ i , (14.5)
where δ i is a unit vector oriented orthogonally to the gradient of the input image in ri . Fully
376 Computational Photography: Methods and Applications
adaptive procedures to determine the brush stroke shape are presented in References [15]
and [18]. The idea is to presegment the input image into a large number of components,
and to generate brush strokes by merging segments of similar colors. In Reference [15],
morphological techniques are used to prevent brush strokes with holes.
In all approaches discussed so far, the algorithms deployed to extract the brush stroke
attributes from the input image are not derived from general principles. As a result, all such
algorithms can imitate a very limited number of painting styles and need the introduction
of many additional parameters whose values are not theoretically justified. To overcome
these limitations, the task of computing the brush stroke attributes can be formulated as an
optimization problem [19], [48]. Specifically, let b be a vector whose components are the
attributes of all brush strokes in the painting, and let Pb be the image obtained by rendering
the brush stroke of B with one of the techniques reviewed in Section 14.2.1. Let also E(b)
be an energy function which measures the dissimilarity between Pb a painting in a given
style. Then, the brush stroke attributes are computed in order to minimize E(b).
The most challenging aspects of this approach are the definition of a suitable energy func-
tion and the development of search algorithms able to minimize an energy in the extremely
high dimensional space in which b is set. Reference [48] proposes an energy function
which is a weighted sum of four terms:
E(b) , Eapp + warea Earea + wnstr Enstr + wcov Ecov , (14.6)
where theR
coefficients warea , wnstr , and wcov are different for each painting style. The term
Eapp , w(app(r)[I(r) − pb (r)]2 d 2 r measures the dissimilarity between the input image
I(r) and the resulting painting pb (r). The term Earea is proportional the total area of all
brush strokes and reflects the total amount of paint that is used by the artist. Since brush
strokes overlap on top of each other, this term could be much higher than the total area
of the painting. Therefore, minimizing Earea corresponds to minimizing the brush stroke
overlap and reflects the principle of economy that is followed in art. The term Enstr is
proportional to the number of brush strokes in the painting. A high number of brush strokes
will result in a very detailed image, while a low number will give rise to very coarse and
sketchy representation. Therefore, the weight assigned to Enstr influences the painting style
in terms of the level of detail at which the image is rendered. The term Ecov is proportional
to the number of pixels that are not covered by any brush stroke. Assigning a high weight
to this term will prevent areas of the canvas to remain uncovered by brush strokes. As can
be seen, the minimization of E(b) results in a trade off between competing goals, such as
the amount of details in the rendered image and the number of brush strokes. If, following
Reference [48], one assumes that painting styles differ for the importance that is given to
each subgoal, it results that different styles can be imitated by simply changing the weights
of each one of the four terms in E(b). An important advantage of energy minimization-
driven painterly rendering algorithms is that new painting styles can be included by simply
modifying the cost function.
Once a suitable energy function E(b) is found, there is still the problem of developing
procedures to minimize it. At the current state of the art, this problem is very far from being
solved, especially due to the extremely high dimensionality of the search space in which
b is set and the huge number of local minima in which a search algorithm can be trapped.
Reference [48] addresses this problem by means of a relaxation algorithm in combination
Painterly Rendering 377
with heuristic search. Specifically, the algorithm is initialized by an empty painting. Then,
a trial-and-error iterative procedure is followed. At each iteration, several modifications
of the painting are attempted (such as adding or removing a brush stroke, or modifying
the attributes of a given brush stroke), and the new painting energy is computed. Changes
which result in an energy decrease are adopted and the entire process is reiterated until
convergence. Unfortunately, the algorithm is tremendously slow and, in general, converges
neither to a global optimum nor even to a local minimum. A more principled approach
to minimize E(b) can be found in Reference [19], where genetic algorithms are deployed
with some strategies to avoid undesired local minima.
FIGURE 14.7
Morphological operators applied to a one-dimensional signal: (a) input f (x), (b) output fC (x) of area opening,
and (c) output fCO (x) of area close-opening.
(a) (b)
closing flattens local minima to a certain area, while area opening flattens local maxima.
The combined application of a closing and an opening reduces the amount of texture in a
signal while preserving edges. Figure 14.8 shows the result of an area open-closing applied
independently to each RGB component of a photographic image, this operator effectively
adds an artistic effect to the input image. The rationale beyond it is that area open-closing
induces flat zones with irregular shapes which simulate irregular brush strokes. A more
formal introduction to these operators, as well as efficient algorithms for their computation,
can be found in Reference [49].
The following describes the Kuwahara filter and its variants. Let us consider a gray level
image I(x, y) and a square of length 2a centered around a point (x, y) which is partitioned
Painterly Rendering 379
A
Q2 Q1
y 2a
Q2 Q4
Q3 Q4
(a) (b)
FIGURE 14.9
Kuwahara filtering: (a) regions Qi on which local averages and standard deviations are computed, and (b)
the square with the smallest standard deviation, delineated by a thick line, determines the output of the filter.
°c 2007 IEEE
into four identical squares Q1 , ..., Q4 (Figure 14.9a). Let si (x, y) and mi (x, y) be the local
average and the local standard deviation, respectively, computed on each square Qi (x, y),
for i = 1, ..., 4. For a given point (x, y), the output Φ(x, y) of the Kuwahara filter is given
by the value of mi (x, y) that corresponds to the i-th square providing the minimum value of
mi (x, y) [50]. Figure 14.9b shows the behavior of the Kuwahara operator in the proximity of
an edge. When the central point (x, y) is on the dark side of the edge (point A), the chosen
value of mi corresponds to the square that completely lies on the dark side (Q4 here), as this
is the most homogeneous area corresponding to minimum si . On the other hand, as soon
as the point (x, y) moves to the bright side (point B), the output is determined by the square
that lies completely in the bright area (Q2 here), since now it corresponds to the minimum
standard deviation si . This flipping mechanism guarantees the preservation of edges and
corners, while the local averaging smooths out texture and noise.
One limitation of Kuwahara filtering is the block structure of the output, particularly evi-
dent on textured areas (Figure 14.9), that is due to the square shape of the regions Q1 to Q4
and to the Gibbs phenomenon [51]. This problem can be avoided by using different shapes
for the regions Qi and by replacing the local averages with weighted local averages. For ex-
ample, the squares Qi can be replaced by pentagons and hexagons [52]. Reference [53] uses
circular regions whereas References [54] and [55] take into account a larger set of overlap-
ping windows. Namely, Reference [55] avoids the Gibbs phenomenon by using Gaussian
weighted local averages (Gaussian Kuwahara filtering) instead of the local averages. Other
solutions are based on smoothed circular [56] and elliptical [57] sectors. Specifically, in
Reference [56] local averages and standard deviations are computed as follows:
mi (x, y) = wi (x, y) ? I(x, y), s2i (x, y) = wi (x, y) ? I 2 (x, y) − m2i (x, y), (14.7)
where the weighting functions wi (x, y) are given by the product of a two-dimensional
isotropic Gaussian function with an angular sector.
A more serious problem is that the Kuwahara filter is not a mathematically well defined
operator; every time the minimum value of si is reached by two or more squares, the output
380 Computational Photography: Methods and Applications
corner
sharp corner
6
7
1
3
3 4 2
4 2
1
5 1
hom ogeneous
6 8 (textu re) area ed ge
7
FIGURE 14.10
Sector selection in various situations. The sectors selected to determine the output are delineated by a thick
c 2007 IEEE
line. °
where q > 0 is an input parameter which controls the sharpness of the edges. For q = 0
this reduces to linear filtering which has high stability to noise but poor edge preservation;
conversely, for q → ∞, only the term in the sum with minimum si survives, thus reducing
to the minimum standard deviation criterion of Kuwahara filtering which has high edge
preserving performance but poor noise stability. This operator is thus an intermediate case
between linear filtering and Kuwahara-like filtering, taking on the advantages of both.
Another advantage of this combination criterion is that it automatically selects the most
interesting subregions. This is illustrated in Figure 14.10 for the circular sectors deployed
in Reference [56]. On areas that contain no edges (case a), the si values are very similar
to each other, therefore the output q is close to the average of the mi values. The operator
behaves very similarly to a Gaussian filter; texture and noise are averaged out and the Gibbs
phenomenon is avoided. On the other hand, in presence of an edge (case b), the sectors
placed across it give higher si values with respect to the other sectors. If q is sufficiently
large (for instance, q = 4) the sectors intersected by the edge (S5 - S8 ) give a negligible
contribution to the value of q. Similarly, in presence of corners (case c) and sharp corners
(case d), only those sectors which are placed inside the corner (S6 , S7 ) for case c and S1 for
case d give an appreciable contribution to value of q whereas the others are negligible.
In all methods described so far, the values of mi and si are computed over regions with
fixed shapes. In Reference [57], adaptive anisotropic subregions are deployed. Specifically,
an ellipse of a given size is placed on each point of the input image, where orientation
and ellipticity of each ellipse are determined by means of the structure tensor of the input
image [58]. Then, the concerned subregions are sectors of such ellipses. Deploying sectors
Painterly Rendering 381
(a) (b)
(c) (d)
FIGURE 14.11
Painterly rendering by edge preserving smoothing: (a) input image, (b) output of the Kuwahara filter [50],
(c) output of the isotropic operator proposed in Reference [56], and (d) output of the anisotropic approach
deployed in Reference [57].
of adaptable ellipses instead of circles results in better behavior in the presence of elongated
structures and in a smaller error in the presence of edges, since it does not deal with a small
number of primary orientations. On the other hand, this makes the algorithm much slower,
since spatially variant linear filters cannot be evaluated in the Fourier domain.
A comparison is shown in Figure 14.11. The blocky structure and the Gibbs phe-
nomenon, well visible in the Kuwahara output (Figure 14.11b), are avoided by the ap-
proaches proposed in Reference [56] (Figure 14.11c) and Reference [57] (Figure 14.11d).
Moreover, the use of adaptable windows better preserves elongated structures, such as the
whiskers of the baboon.
382 Computational Photography: Methods and Applications
FIGURE 14.12
Glass patterns obtained by several geometric transformations: (a) isotropic scaling, (b) expansion and compres-
sion in the horizontal and vertical directions, respectively, (c) combination of rotation and isotropic scaling, and
c 2009 IEEE
(d) translation. °
with the initial condition r(0) = r0 . For a fixed value of t, the term Φv denotes a map from
R2 to R2 , which satisfies the condition Φv (r, 0) = r. Let S = {r1 , ..., rN } be a random point
set and Φv (S,t) , {Φv (r,t)|r ∈ S}. Using this notation, the glass pattern Gv,t (S) associated
S
with S, v, and t is defined as Gv,t (S) , S Φv (S,t). In general, the geometrical structure
exhibited by a glass pattern is related to the streamlines of v(r).
Due to their randomness and geometric structure, glass patterns capture the essence of
the motives induced by brush strokes in several impressionistic paintings, and provide cor-
responding mathematical models (Figure 14.13). The following will show that transferring
the microstructure of a glass pattern to an input image results in outputs perceptually similar
to paintings.
(a) (b)
FIGURE 14.13
Comparison of the real and synthesized painting styles: (a) Vincent Van Gogh’s painting Road with Cypress
and Star, and (b) manually generated glass pattern. The two images exhibit similar geometric structures.
the value 0 for other points. Clearly, the binary field associated with the superposition of
two point sets S1 and S2 is equal to
the binary field associated with a glass pattern is thus equal to bGv,t (S) (r) =
max{bS (r), bS [Φv (r,t)]}. With this notation, generalizing glass patterns to the continu-
ous case is straightforward. Namely, a continuous set of patterns bS [Φ(r, τ )] with τ ∈ [0, 1]
is considered instead of only two patterns, and any real valued random image z(r) can be
used instead of bS (r). Specifically, a continuous glass pattern Gv (r) is defined as follows:
where Av (r) is the arc of streamline r(t) = Φv (r,t) with t ∈ [0, 1], that is, Av (r) ,
{Φv (r,t)|t ∈ [0, 1]}.
Given a vector field v(r) and a random image z(r), the glass pattern can easily be com-
puted by integrating the differential equation dr/dt = v(r) with the Euler algorithm [62]
and by taking the maximum of z(r) over an arc of the solving trajectory Φv (r,t). Examples
384 Computational Photography: Methods and Applications
FIGURE 14.14
Random image z(r) and examples of continuous glass patterns obtained from it. Their geometrical structure is
c 2009 IEEE
analogous to the discrete case. °
of continuous glass patterns are shown in Figure 14.14, with the histograms of all images
being equalized for visualization purposes. As can be seen, these patterns exhibit similar
geometric structures to the corresponding discrete patterns.
A simple way to obtain painterly images from continuous glass patterns is depicted in
Figures 14.15 and 14.16. The first step is edge preserving smoothing with the output
IEPS (r), which can be obtained using the techniques described above. The second step
is the generation of synthetic painterly texture USPT (r) which simulates oriented brush
strokes (Figure 14.16c). This is simply a continuous glass pattern associated with a vector
field which forms a constant angle φ with the color gradient of the input image. An ex-
ample of such a texture is shown in Figure 14.16c for θ0 = π /2 and a = 18, for an image
of size 320 × 480 pixels. It can be seen that the geometric structure of USPT (r) is similar
to the elongated brush strokes that artists use in paintings. For θ0 = π /2, such strokes are
oriented orthogonally to ∇σ IEPS (r). This mimics the fact that artists usually tend to draw
FIGURE 14.15
Schematic representation of artistic image generation by means of continuous glass patterns.
Painterly Rendering 385
brush strokes along object contours. Moreover, it is easy to prove that for θ0 = π /2 the
streamlines of V (r,t) are closed curves [47]. Thus, the brush strokes tend to form whirls
which are typical of some impressionist paintings.
Finally, the artistic effect is achieved by adding the synthetic texture to the smoothed im-
age, thus obtaining the final output y(r) , IEPS (r) + λ USPT (r) (Figure 14.16d). The param-
eter λ controls the strength of the synthetic texture. Comparing Figures 14.16a and 14.16d
(a) (b)
(c) (d)
reveals that natural texture of the input image is replaced by USPT (r). Such a simple texture
manipulation produces images which look like paintings.
An example of how well this approach performs is shown in Figure 14.17, in compari-
son with two of the most popular brush stroke-based artistic operators described previously:
namely, the impressionist rendering algorithm [9] and the so-called artistic vision [15] tech-
nique. It can be seen that the glass pattern operator effectively mimics curved brush strokes
oriented along object contours while the whirls present in contourless areas resemble some
impressionist paintings. As to artistic vision, though simulation of curved brush strokes is
attempted, several artifacts are clearly visible. Impressionistic rendering does not produce
artifacts, but it tends to render blurry contours and small object details are lost. Moreover,
impressionistic rendering is less effective in rendering impressionist whirls.
with ρ 0 (r) , arg maxρ ∈Av (r) {z(ρ )}. In other words, instead of directly considering the
maximum of z(r) over Av (r), the point ρ 0 (r) which maximizes z(ρ ) is first identified, and
the value of I(r) at that point is taken. It is easy to see that if the input image I(r) coincides
with z(r), then the output coincides with a continuous glass pattern Cv {z(r), z(r)} = Gv (r)
defined above. An efficient implementation of cross continuous glass patterns can be found
in References [47] and [63].
Examples of cross continuous glass patterns are given p in Figure 14.19, which p are re-
spectively related to the vector fields v(r) = [x, y]/ x + y , v(r) = [−y, x]/ x2 + y2 ,
2 2
and v(r) ⊥ ∇I(r). Figures 14.19b and 14.19c show similar microstructure whereas Fig-
FIGURE 14.18
Artifact that arises with the approach proposed in Reference [47] when the streamlines of v(r) do not match the
object contours of the input image (marked by arrows). This happens especially in presence of sharp corners.
°c 2009 IEEE
Painterly Rendering 387
(a) (b)
(c) (d)
ure 14.19d is perceptually similar to a painting. Other examples are shown in Figure 14.18.
In Figure 14.20b the vector field v(r) is orthogonal to the color gradient of the input image,
while in Figure 14.20c it forms an angle of 45◦ with ∇I. As can be seen, though the stream-
lines of v(r) strongly mismatch the object contours, no artifacts similar to those shown in
Figure 14.18 are present (see also Figures 14.19b and 14.19c). It can also be observed
that this approach, as well as all techniques based on vector fields, is very versatile since
substantially different artistic images can be achieved by varying a few input parameters.
Specifically, when the angle between v(r) and the gradient direction is equal to θ = π /2,
the strokes follow the object contours and form whirls in flat areas, while for θ = π /4, the
strokes are orthogonal to the contours and build star-like formations in flat regions.
14.4 Conclusions
The classical approach to generate synthetic paintings consists in rendering an ordered
list of brush strokes on a white or canvas textured background. Alternatively, the possi-
388 Computational Photography: Methods and Applications
(a) (b)
(c)
FIGURE 14.20
Cross continuous glass patterns generated using different values of θ : (a) input image, (b) output for θ = π /4,
c 2009 IEEE
and (c) output for θ = π /2. °
(a) (b)
FIGURE 14.21
Painterly rendering using simpler brush strokes models: (a) input image and (b) output of the algorithm pre-
sented in Reference [15] which does not perform physical brush stroke simulation. Small brush strokes are
rendered properly, such as on the trees. On the other hand, visible undesired artifacts are present on large brush
strokes, such as on the sky.
rendered for each image, simpler brush stroke models based on predefined intensity and
texture profiles are usually deployed [9], [38]. Such simpler models still give acceptable
results for small brush strokes but might produce visible artifacts for larger brush strokes
(Figure 14.21). Extracting brush stroke attributes from an input image is a much harder
task, because it involves some knowledge about how artists see the world. The basic idea
behind this approach is to look for regions where the color profile of the input image is con-
stant or slowly varying, and then to extract position, shape, and orientation of each brush
stroke. There exist a large variety of complex algorithms, with many predetermined param-
eters and without a general guiding principle. To overcome this problem, the extraction of
brush stroke attributes can be formulated as an optimization problem. However, this leads
to cost functions defined on an extremely high dimensional space, thus making such an
optimization extremely difficult carry out in practice. Moreover, it is not obvious that every
artistic effect can be achieved by the mere modification of a cost function.
Due to these intrinsic difficulties, some authors propose to focus on the visual properties
which distinguish a painting from a photographic image while abstracting from the process
deployed by an artist to generate them. Examples of such properties are sharp edges, ab-
sence of natural texture, or presence of motives induced by brush strokes in several impres-
sionistic paintings (see, for instance, Figure 14.13a). An effective and efficient approach
to painterly rendering is smoothing the input image while preserving or sharpening edges.
It should be noted, however, that not all existing smoothing operators which preserve and
enhance edges produce satisfactory artistic images. Examples are bilateral filtering, median
filtering, and structural opening and closing [67]. Area open-closing can produce artistic
effects, but only on images that are rich in texture and with sharp edges. In presence of
blurry edges and relatively flat areas, area opening does not modify the input image sub-
stantially. In contrast, the approach proposed by Kuwahara and the subsequent extensions
are much more effective for every input image. Another interesting way to produce artistic
images is based on glass patterns. The theory of glass patterns naturally combines three
390 Computational Photography: Methods and Applications
Acknowledgment
Figures 14.9 and 14.10 are reprinted from Reference [67] and Figures 14.12, 14.14,
14.16, 14.17, 14.18, and 14.20 are reprinted from Reference [47], with the permission
of IEEE.
References
[1] R. Arnheim, Art and Visual Perception: A Psychology of the Creative Eye. Berkeley, CA:
University of California Press, September 1974.
[2] R. Arnheim, Toward a Psychology of Art. Berkeley, CA: University of California Press, May
1972.
[3] R. Arnheim, Visual Thinking. Berkeley, CA: University of California Press, 1969.
[4] E. Loran, Cézanne’s Composition: Analysis of His Form with Diagrams And Photographs of
His Motifs. Berkeley, CA: University of California Press, 1963.
[5] S. Zeki, Inner Vision. New York: Oxford University Press, February 2000.
[6] S. Zeki, “Trying to make sense of art,” Nature, vol. 418, pp. 918–919, August 2002.
[7] V. Ramachandran and W. Hirstein, “The science of art,” Journal of Consciousness Studies,
vol. 6, no. 6-7, pp. 15–51(37), 1999.
[8] B. Pinna, Art and Perception Towards a Visual Science of Art. Koninklijke Brill NV, Leiden,
The Netherlands, November 2008.
Painterly Rendering 391
[9] P. Litwinowicz, “Processing images and video for an impressionist effect,” in Proceedings of
the 24th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles,
CA, USA, August 1997, pp. 407–414.
[10] M. Shiraishi and Y. Yamaguchi, “An algorithm for automatic painterly rendering based on
local source image approximation,” in Proceedings of the 1st International Symposium on
Non-Photorealistic Animation and Rendering, Annecy, France, June 2000, pp. 53–58.
[11] N. Li and Z. Huang, “Feature-guided painterly image rendering,” in Proceedings of the IEEE
International Conference on Image Processing, Rochester, New York, USA, September 2002,
pp. 653–656.
[12] L. Kovács and T. Szirányi, “Painterly rendering controlled by multiscale image features,” in
Proceedings of Spring Conference on Computer Graphics, Budmerice, Slovakia, April 2004,
pp. 177–184.
[13] P. Haeberli, “Paint by numbers: Abstract image representations,” ACM SIGGRAPH Computer
Graphics, vol. 24, no. 4, pp. 207–214, August 1990.
[14] D. De Carlo and A. Santella, “Stylization and abstraction of photographs,” ACM Transactions
on Graphics, vol. 21, no. 3, pp. 769–776, July 2002.
[15] B. Gooch, G. Coombe, and P. Shirley, “Artistic vision: Painterly rendering using computer
vision techniques,” in Proceedings of the 2nd International Symposium on Non-Photorealistic
Animation and Rendering, Annecy, France, June 2002, pp. 83–90.
[16] M. Schwar, T. Isenberg, K. Mason, and S. Carpendale, “Modeling with rendering primitives:
An interactive non-photorealistic canvas,” in Proceedings of the 5th International Symposium
on Non-Photorealistic Animation and Rendering, San Diego, CA, USA, August 2007, pp. 15–
22.
[17] A. Hertzmann, “Painterly rendering with curved brush strokes of multiple sizes,” in Proceed-
ings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, Or-
lando, FL, USA, July 1998, pp. 453–460.
[18] A. Kasao and K. Miyata, “Algorithmic painter: A NPR method to generate various styles of
painting,” The Visual Computer, vol. 22, no. 1, pp. 14–27, January 2006.
[19] J. Collomosse and P. Hall, “Salience-adaptive painterly rendering using genetic search,” Inter-
national Journal on Artificial Intelligence Tools, vol. 15, no. 4, pp. 551–575, August 2006.
[20] A. Orzan, A. Bousseau, P. Barla, and J. Thollot, “Structure-preserving manipulation of pho-
tographs,” in Proceedings of the 5th International Symposium on Non-Photorealistic Anima-
tion and Rendering, San Diego, CA, USA, August 2007, pp. 103–110.
[21] C. Curtis, S. Anderson, J. Seims, K. Fleischer, and D. Salesin, “Computer-generated water-
color,” in Proceedings of the 24th Annual Conference on Computer Graphics and Interactive
Techniques, Los Angeles, CA, USA, August 1997, pp. 421–430.
[22] E. Lum and K. Ma, “Non-photorealistic rendering using watercolor inspired textures and il-
lumination,” in Proceedings of Pacific Conference on Computer Graphics and Applications,
Tokyo, Japan, October 2001, pp. 322–331.
[23] T. Van Laerhoven, J. Liesenborgs, and F. Van Reeth, “Real-time watercolor painting on a
distributed paper model,” in Proceedings of Computer Graphics International, Crete, Greece,
June 2004, pp. 640–643.
[24] E. Lei and C. Chang, “Real-time rendering of watercolor effects for virtual environments,” in
Proceedings of IEEE Pacific-Rim Conference on Multimedia, Tokyo, Japan, December 2004,
pp. 474–481.
[25] H. Johan, R. Hashimota, and T. Nishita, “Creating watercolor style images taking into account
painting techniques,” Journal of the Society for Art and Science, vol. 3, no. 4, pp. 207–215,
2005.
392 Computational Photography: Methods and Applications
[26] A. Bousseau, M. Kaplan, J. Thollot, and F. Sillion, “Interactive watercolor rendering with
temporal coherence and abstraction,” in Proceedings of the 4th International Symposium on
Non-Photorealistic Animation and Rendering, Annecy, France, June 2006, pp. 141–149.
[27] D. Small, “Simulating watercolor by modeling diffusion, pigment, and paper fibers,” in Pro-
ceedings of SPIE, vol. 1460, p. 140, February 1991.
[28] E. Lum and K. Ma, “Non-photorealistic rendering using watercolor inspired textures and il-
lumination,” in Proceedings of Pacific Conference on Computer Graphics and Applications,
Tokyo, Japan, October 2001, pp. 322–331.
[29] E. Lei and C. Chang, “Real-time rendering of watercolor effects for virtual environments,” in
Proceedings of IEEE Pacific-Rim Conference on Multimedia, Tokyo, Japan, December 2004,
pp. 474–481.
[30] T. Van Laerhoven, J. Liesenborgs, and F. Van Reeth, “Real-time watercolor painting on a
distributed paper model,” in Proceedings of Computer Graphics International, Crete, Greece,
June 2004, pp. 640–643.
[31] H. Johan, R. Hashimota, and T. Nishita, “Creating watercolor style images taking into account
painting techniques,” Journal of the Society for Art and Science, vol. 3, no. 4, pp. 207–215,
2005.
[32] A. Bousseau, M. Kaplan, J. Thollot, and F. Sillion, “Interactive watercolor rendering with
temporal coherence and abstraction,” in Proceedings of the 4th International Symposium on
Non-Photorealistic Animation and Rendering, Annecy, France, June 2006, pp. 141–149.
[33] T. Luft and O. Deussen, “Real-time watercolor illustrations of plants using a blurred depth
test,” in Proceedings of the 4th International Symposium on Non-Photorealistic Animation
and Rendering, Annecy, France, June 2006, pp. 11–20.
[34] A. Bousseau, F. Neyret, J. Thollot, and D. Salesin, “Video watercolorization using bidirec-
tional texture advection,” Transactions on Graphics, vol. 26, no. 3, July 2007.
[35] A. Santella and D. DeCarlo, “Abstracted painterly renderings using eye-tracking data,” in
Proceedings of the Second International Symposium on Non-photorealistic Animation and
Rendering Annecy, France, June 2002, pp. 75–82.
[36] S. Nunes, D. Almeida, V. Brito, J. Carvalho, J. Rodrigues, and J. du Buf, “Perception-based
painterly rendering: functionality and interface design,” Ibero-American Symposium on Com-
puter Graphics, Santiago de Compostela, Spain, July 2006, pp. 53–60.
[37] A. Hertzmann, “A survey of stroke-based rendering,” IEEE Computer Graphics and Applica-
tions, vol. 23, no. 4, pp. 70–81, July/August 2003.
[38] J. Hays and I. Essa, “Image and video based painterly animation,” Proceedings of the 3rd
International Symposium on Non-Photorealistic Animation and Rendering, Annecy, France,
June 2004, pp. 113–120.
[39] S. Olsen, B. Maxwell, and B. Gooch, “Interactive vector fields for painterly rendering,” Pro-
ceedings of Canadian Annual Conference on Graphics Interface, Victoria, British Columbia,
May 2005, pp. 241–247.
[40] E. Stavrakis and M. Gelautz, “Stereo painting: Pleasing the third eye,” Journal of 3D Imaging,
vol. 168, pp. 20–23, Spring 2005.
[41] S. Stavrakis and M. Gelautz, “Computer generated stereoscopic artwork,” in Proceedings
of Workshop on Computational Aesthetics in Graphics, Visualization and Imaging, Girona,
Spain, May 2005, pp. 143–149.
[42] S. Strassmann, “Hairy brushes,” ACM SIGGRAPH Computer Graphics, vol. 20, no. 4,
pp. 225–232, August 1986.
Painterly Rendering 393
[43] J. Hopcroft, R. Motwani, and J. Ullman, Introduction to Automata Theory, Languages, and
Computation. Addison-Wesley, July 2006.
[44] Q. Zhang, Y. Sato, T. Jy, and N. Chiba, “Simple cellular automaton-based simulation of ink
behaviour and its application to suibokuga-like 3D rendering of trees,” The Journal of Visual-
ization and Computer Animation, vol. 10, no. 1, pp. 27–37, April 1999.
[45] C. Haase and G. Meyer, “Modeling pigmented materials for realistic image synthesis,” ACM
Transactions on Graphics, vol. 11, no. 4, pp. 305–335, October 1992.
[46] G. Kortüm, Reflectance Spectroscopy: Principles, Methods, Applications. New York:
Springer-Verlag, January 1969.
[47] G. Papari and N. Petkov, “Continuous glass patterns for painterly rendering,” IEEE Transac-
tions on Image Processing, vol. 18, no. 3, pp. 652–664, March 2009.
[48] A. Hertzmann, “Paint by relaxation,” in Proceedings of Computer Graphics International,
Hong Kong, July 2001, pp. 47–54.
[49] M. Wilkinson, H. Gao, W. Hesselink, J. Jonker, and A. Meijster, “Concurrent computation of
attribute filters on shared memory parallel machines,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 30, no. 10, pp. 1800–1813, October 2008.
[50] M. Kuwahara, K. Hachimura, S. Eiho, and M. Kinoshita, “Processing of ri-angiocardiographic
images,” Digital Processing of Biomedical Images, pp. 187–202, 1976.
[51] A. Oppenheim, R. Schafer, and J. Buck, Discrete-Time Signal Processing. Englewood Cliffs,
NJ: Prentice Hall, 1989.
[52] M. Nagao and T. Matsuyama, “Edge preserving smoothing,” Computer Graphics and Image
Processing, vol. 9, no. 4, pp. 394–407, 1979.
[53] P. Bakker, L. Van Vliet, and P. Verbeek, “Edge preserving orientation adaptive filtering,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Ft. Collins,
CO, USA, June 1999, pp. 535–540.
[54] M.A. Schulze and J.A. Pearce, “A morphology-based filter structure for edge-enhancing
smoothing,” in Proceedings of the IEEE International Conference on Image Processing,
Austin, Texas, USA, November 1994, pp. 530–534.
[55] R. van den Boomgaard, “Decomposition of the Kuwahara-Nagao operator in terms of a linear
smoothing and a morphological sharpening,” in Proceedings of International Symposium on
Mathematical Morphology, Sydney, NSW, Australia, April 2002, p. 283.
[56] G. Papari, N. Petkov, and P. Campisi, “Edge and corner preserving smoothing for artistic
imaging,” Proceedings of SPIE, vol. 6497, pp. 649701–64970J, 2007.
[57] J. Kyprianidis, H. Kang, and J. Döllner, “Image and video abstraction by anisotropic Kuwa-
hara filtering,” Computer Graphics Forum, vol. 28, no. 7, 2009.
[58] T. Brox, J. Weickert, B. Burgeth, and P. Mrázek, “Nonlinear structure tensors,” Image and
Vision Computing, vol. 24, no. 1, pp. 41–55, January 2006.
[59] L. Glass, “Moiré effect from random dots,” Nature, vol. 223, pp. 578–580, 1969.
[60] L. Glass and R. Perez, “Perception of random dot interference patterns,” Nature, vol. 246,
pp. 360–362, 1973.
[61] L. Glass and E. Switkes, “Pattern recognition in humans: Correlations which cannot be per-
ceived,” Perception, vol. 5, no. 1, pp. 67–72, 1976.
[62] G. Hall and J. Watt, Modern Numerical Methods for Ordinary Differential Equations. Oxford,
UK: Clarendon Press, October 1976.
[63] G. Papari and N. Petkov, “Spatially variant dilation for unsupervised painterly rendering,” in
Abstract Book of the 9th International Symposium on Mathematical Morphology, Groningen,
Netherlands, August 2009, pp. 56–58.
394 Computational Photography: Methods and Applications
[64] G. Papari and N. Petkov, “Reduced inverse distance weighting interpolation for painterly ren-
dering,” in Proceedings of International Conference on Computer Analysis of Images and
Patterns, Mnster, Germany, September 2009, pp. 509–516.
[65] C. Aguilar and H. Lipson, “A robotic system for interpreting images into painted artwork,” in
Proceedings of Generative Art Conference, Milano, Italy, December 2008, pp. 372–387.
[66] W. Baxter, J. Wendt, and M. Lin, “Impasto: A realistic, interactive model for paint,” in Pro-
ceedings of the 3rd International Symposium on Non-Photorealistic Animation and Rendering,
Annecy, France, June 2004, pp. 45–148.
[67] G. Papari, N. Petkov, and P. Campisi, “Artistic edge and corner enhancing smoothing,” IEEE
Transactions on Image Processing, vol. 29, no. 10, pp. 2449–2462, October 2007.
15
Machine Learning Methods for Automatic Image
Colorization
15.1 Introduction
Automatic image colorization is the task of adding colors to a grayscale image without
any user intervention. This problem is ill-posed in the sense that there is not a unique
colorization of a grayscale image without any prior knowledge. Indeed, many objects can
have different colors. This is not only true for artificial objects, such as plastic objects
395
396 Computational Photography: Methods and Applications
which can have random colors, but also for natural objects such as tree leaves which can
have various nuances of green and brown in different seasons, without significant change
of shape.
The most common color prior in the literature is the user. Most image colorization meth-
ods allow the user to determine the color of some areas and extend this information to the
whole image, either by presegmenting the image into (preferably) homogeneous color re-
gions or by spreading color flows from the user-defined color points. The latter approach
involves defining a color flow function on neighboring pixels and typically estimates this
as a simple function of local grayscale intensity variations [1], [2], [3], or as a predefined
threshold such that color edges are detected [4]. However, this simple and efficient frame-
work cannot deal with texture examples of Figure 15.1, whereas simple oriented texture
features such as Gabor filters can easily overcome these limitations. Hence, an image
colorization method should incorporate texture descriptors for satisfactory results. More
generally, the manually set criteria for the edge estimation are problematic, since they can
be limited to certain scenarios. The goal of this chapter is to learn the variables of image
colorization modeling in order to overcome the limitations of manual assignments.
User-based approaches have the advantage that the user has an interactive role, for exam-
ple, by adding more color points until a satisfactory result is obtained or by placing color
points strategically in order to give indirect information on the location of color bound-
aries. The methods proposed in this chapter can easily be adapted to incorporate such
user-provided color information. Predicting the colors, that is, providing an initial fully au-
tomatic colorization of the image prior to any possible user intervention, is a much harder
but arguably more useful task. Recent literature investigating this task [5], [6], [7], [8]
yields mixed conclusions. An important limitation of these methods is their use of local
predictors. Color prediction involves many ambiguities that can only be resolved at the
global level. In general, local predictions based on texture are most often very noisy and
not reliable. Hence, the information needs to be integrated over large regions in order
to provide a significant signal. Extensions of local predictors to include global informa-
tion has been limited to using automatic tools (such as automatic texture segmentation [7])
which can introduce errors due to the cascaded nature of the process or incorporating small
Machine Learning Methods for Automatic Image Colorization 397
Section 15.8 focuses on an experimental analysis of the proposed machine learning meth-
ods on datasets of various sizes. All proposed approaches perform well with a large num-
ber of colors and outperform existing methods. It will be shown that the Parzen window
approach provides natural colorization, especially when trained on small datasets, and per-
forms reasonably well on big datasets. On large training data, SVMs and structured SVMs
leverage the information more efficiently and yield more natural colorization, with more
color details, at the expense of longer training times. Although experiments presented in
this chapter focus on colorization of still images, the proposed framework can be readily
extended to movies. It is believed that the framework has the potential to enrich existing
movie colorization methods that are suboptimal in the sense that they heavily rely on user
input. Further discussion on the future work and conclusions are offered in Section 15.9.
of the grayscale images. Let I denote a grayscale image to be colored, p the location of one
particular pixel, and C a colorization of image I. Hence, I and C are images of the same
size, and the color of the pixel p, denoted by C(p), is in the standard RGB color space.
Since the grayscale information is already given by I(p), the term C(p) is restricted such
that computing the grayscale intensity of C(p) yields I(p). Thus, the dimension of the color
space to be explored is intrinsically two rather than three.
This section presents the model chosen for the color space, the limitations of a regres-
sion approach for color prediction, and the proposed color space discretization method. It
also discusses how to express probability distributions of continuous valued colors given a
discretization and describes the feature space used for the description of grayscale patches.
if the task of recognizing a balloon was easy and it is known that that the observed balloon
colors should be used to predict the color of a new balloon, a regression approach would
recommend using the average color of the observed balloons. This problem is not specific
to objects of the same class, but also extends to objects with similar local descriptors. For
example, the local descriptions of grayscale patches of skin and sky are very similar. Hence,
a method trained on images including both objects would recommend purple for skin and
sky, without considering the fact that this average value is never probable. Therefore, an
image colorization method requires multimodality, that is, the ability to predict different
colors if needed, or more precisely, the ability to predict scores or probability values of
every possible color at each pixel.
pixel. This leads to a vector of 192 features per pixel. Using principal component analysis
(PCA), only the first 27 eigenvectors are kept, in order to reduce the number of features
and to condense the relevant information. Furthermore, as supplementary components, the
pixel gray level as well as two biologically inspired features are included. Namely, these
feature are a weighted standard deviation of the intensity in a 5 × 5 neighborhood (whose
meaning is close to the norm of the gradient), and a smooth version of its Laplacian. This
30-dimensional vector, computed at each pixel q, is referred to as local description. It is
denoted by v(q) or v, when the text uniquely identifies q.
pixel p given the local description v of its grayscale neighborhood can be expressed as the
fraction, amongst colored examples e j = (w j , c( j)) whose local description w j is similar
to v, of those whose observed color c( j) is in the same color bin Bi . This can be estimated
with a Gaussian Parzen window model
³ ´.
p(ci |v) = ∑ k(w j , v) ∑ k(w j , v), (15.1)
{ j:c( j)∈Bi } j
2 2
where k(w j , v) = e−kw j −vk /2σ is the Gaussian kernel. The best value for the standard
deviation σ can be estimated by cross-validation on the densities. Parzen windows also
allow one to express how reliable the probability estimation is; its confidence depends
directly on the density of examples around v, since an estimation far from the clouds of
observed points loses significance. Thus, the confidence on a probability estimate is given
by the density in the feature space as follows:
p(v) ∝ ∑ k(w j , v).
j
Note that both distributions, p(ci |v) and p(v), require computing the similarities k(v, w j )
of all pixel pairs, which can be expensive during both training and prediction. For computa-
tional efficiency, these can be approximated by restricting the sums to K-nearest neighbors
of v in the training set with a sufficiently large K chosen as a function of the σ and the
Parzen densities can be estimated based on these K points. In practice, K = 500 is chosen.
Using fast nearest neighbor search techniques, such as kD-tree in the TSTOOL package
available at http://www.physik3.gwdg.de/tstool/ without particular optimization, the time
needed to compute the predictions for all pixels of a 50 × 50 image is only 10 seconds (for
a training set of hundreds of thousands of patches) and this scales linearly with the number
of test pixels.
dow method. Section 15.6 outlines how to use these estimators in a graph-cut algorithm in
order to get spatially coherent color predictions. Before describing the details of this tech-
nique, further improvements over the Parzen window approach are proposed, by employing
support vector machines (SVMs) [13] to learn the local color prediction function.
Equation 15.1 describes the Parzen window estimator for the conditional probability of
the colors given a local grayscale description v. A more general expression for the color
prediction function is given by
where the kernel k satisfies k(v, v0 ) = hf(v), f(v0 )i for all v and v0 in a certain space of
features f(v), embedded with an inner product h·, ·i between feature vectors (more details
can be found in Reference [11]). In Equations 15.1 and 15.2, the expansions for each
color ci are linear in the feature space. The decision boundary between different colors,
which tells which color is the most probable, is consequently an hyperplane. The αi can
be considered as a dual representation of the normal vector λi of the hyperplane separating
the color ci from other colors. The estimator in this primal space can then be represented
as
s(ci |v; λi ) = hλi , f(v)i. (15.3)
In the Parzen window estimator, all α values are nonzero constants. In order to overcome
computational problems, Section 15.4 proposes a restriction of α parameters of pixels p j
that are not in the neighborhood of v to be zero. A more sophisticated classification ap-
proach is obtained using SVMs which differ from Parzen window estimators in terms of
patterns whose α values are active (i.e., nonzero) and in terms of finding the optimal val-
ues for these parameters. In particular, SVMs remove the influence of correctly classified
training points that are far from the decision boundary, since they generally do not improve
the performance of the estimator and removing such instances (setting their corresponding
α values to 0) reduces the computational cost during prediction. Hence, the goal in SVMs
is to identify the instances that are close to the boundaries, commonly referred as support
vectors, for each class ci and find the optimal αi . More precisely, the goal is to discriminate
the observed color c( j) for each colored pixel e j = (w j , c( j)) from the other colors as much
as possible while keeping a sparse representation in the dual space. This can be achieved
by imposing the margin constraints
where the decision function is given in Equation 15.3. If these constraints are satisfiable,
one can find multiple solutions by simply scaling the parameters. In order to overcome this
problem, it is common to search for parameters that satisfy the constraints with minimal
complexity. This can be accomplished by minimizing the norm of the solution λ. In
cases where the constraints cannot be satisfied, one can allow violations of the constraints
by adding slack variables ξ j for each colored pixel e j and penalize the violations in the
optimization, where K denotes the trade-off between the loss term and the regularization
term [14]:
404 Computational Photography: Methods and Applications
1
2 ∑i ||λi ||2 + K ∑ j ξ j , subject to (15.5)
s(c( j)|w j ; λc( j) ) − s(c|w j ; λc ) ≥ 1 − ξ j , ∀ j, ∀c =
6 c( j)
ξ j ≥ 0, ∀ j.
If the constraint is satisfied for a pixel e j and a color ci , SVM yields 0 for αi ( j). The pixel-
color pairs with nonzero αi ( j) are the pixels that are difficult (and hence critical) for the
color prediction task. These pairs are the support vectors and these are the only training
data points that appear in Equation 15.2.
The constraint optimization problem of Equation 15.5 can be rewritten as a quadratic
program (QP) in terms of the dual parameters αi for all colors c(i). Minimizing this
function yields sparse αi , which can be used in the local color predictor function (Equa-
tion 15.2). While training SVMs is more expensive than training Parzen window estima-
tors, SVMs yield often better prediction performance. More details on SVMs can be found
in Reference [11]. Note that in the experiments, an SVM library publicly available at
http://www.csie.ntu.edu.tw/∼cjlin/libsvm/ was used. A Gaussian kernel was used in both
Parzen windows and SVMs.
guarantees to find a good local optimum. In the multilabel case with α -expansion [17], it
can be applied to all energies of the form ∑i Vi (xi ) + ∑i∼ j Di, j (xi , x j ) where xi are the un-
known variables that take values in a finite set L of labels, Vi are any functions, and Di, j
are any pairwise interaction terms with the restriction that each Di, j (·, ·) should be a metric
on L . For the swap-move case, the constraints are weaker [18]:
where Vp (c(p)) is the cost of choosing color c(p) locally for pixel p (whose neighboring
¡ ¢−1
texture is described by v(p)) and where gp,q = 2 g(v(p))−1 + g(v(q))−1 is the har-
monic mean of the estimated color variation at pixels p and and q. An eight-neighborhood
is considered for the interaction term, and p ∼ q denotes that p and q are neighbors.
The interaction term between pixels penalizes color variation where it is not expected,
according to the variations predicted in the previous paragraph. The hyper-parameter ρ en-
ables a trade-off between local color scores and spatial coherence score. It can be estimated
using cross validation.
Two methods that yield scores to local color prediction were described earlier in this
chapter. These can be used to define Vp (c(p)). When using the Parzen window estimator,
the local color cost Vp (c(p)) can be defined as follows:
¡ ¡ ¢¢ ¡ ¢
Vp (c(p)) = − log p v(p) p c(p)|v(p) . (15.8)
Then, Vp penalizes colors which are not probable at the local level according to the proba-
bility distributions obtained in Section 15.4.1, with respect to the confidence in the predic-
tions.
When using SVMs, there exist two options to define Vp (c(p)). Even though SVMs
are not probabilistic, methods exist to convert SVM decision scores to probabilities [19].
Hence, the p(c(p)|v(p)) term in Equation 15.8 can be replaced with the probabilistic SVM
scores and the graph cut algorithm can be used to find spatially coherent colorization. How-
ever, since V is not restricted to be a probabilistic function, Vp (c(p)) can be directly used as
−s(c(p)|v(p)). This way does not require to get the additional p(v(p)) estimate in order to
model the confidence of the local predictor; s(c(p)|v(p)) already captures the confidence
via the margin concept and renders the additional (possibly noisy) estimation unnecessary.
The graph cut package [18] available at http://vision.middlebury.edu/MRF/code/ was
used in the experiments. The solution for a 50 × 50 image and 73 possible colors is ob-
tained by graph cuts in a fraction of second and is generally satisfactory. The computation
time scales approximately quadratically with the size of the image, which is still fast, and
the algorithm performs well even on significantly downscaled versions of the image so that
a good initial colorization can still be given quickly for very large images as well. The
computational costs compete with those of the fastest colorization techniques [20] while
achieving more spatial coherency.
406 Computational Photography: Methods and Applications
where C refers to a color assignment for image I and C(p) denotes its restriction to pixel
p, hence the color assigned to the pixel. As in the case of standard SVMs, there is a kernel
expansion of the joint predictor given by
s(C|I) = ∑p ∑ j αC(p) ( j)k(w j , v(p)) + ∑p∼q λ̄C(p)C(q) f¯(C(p),C(q)). (15.11)
The following discusses this estimator with respect to the previously considered func-
tions. Compared to the SVM-based local prediction function given in Equation 15.2, this
estimator is defined over a full grayscale image I and its possible colorings C as opposed
to the SVM case which is defined over an individual grayscale pixel p and its colorings c.
Furthermore, the spatial coherence criteria (the second term in Equation 15.11) are incor-
porated directly rather than by two-step approaches used in the Parzen window and SVM-
based methods. It can be also observed that the proposed joint predictor, Equation 15.11,
is simply a variation of the energy function used in the graph-cut algorithm given in Equa-
tion 15.7, where different parameters for spatial coherence can now be estimated by a joint
learning process as opposed to learning color variation and finding λ in the energy via
cross-validation. With the additional symmetry constraint λ̄cc0 = λ̄c0 c for each color pair
c, c0 , the energy function can be optimized using the graph cuts swap move algorithm.
Hence, SVMstruct provides a more unified approach for learning parameters and removes
the necessity of the hyper-parameter ρ .
There are efficient training algorithms for this optimization [21]. In this chap-
ter, the experiments were done using the SVMstruct implementation available at
http://svmlight.joachims.org/ with a Gaussian kernel.
15.8 Experiments
This section presents experimental results achieved using the proposed automatic col-
orization methods on different datasets.
(d) (e)
FIGURE 15.5
Comparison with the method of Reference [7]: (a) color zebra example, (b) test image, (c) proposed method
output, (d) 2D colors predicted using the proposed method, and (e) output using Reference [7] with the as-
c Eurographics Association 2005
sumption that this is a binary classification problem. °
FIGURE 15.6
Zoomed portion of images in Figure 15.5: (a) method of Reference [7], (b) proposed method, and (c) colors
predicted using the proposed method.
similar scene that can be used as a training image. Therefore, a small set of three different
images is considered; each image from this set shares a partial similarity with the Charlie
Chaplin image. The underlying difficulty is that each training image also contains parts
which should not be reused in this target image. Figure 15.7 shows the results obtained
using Parzen windows. The result is promising considering the training set. In spite of the
difficulty of the task, the prediction of color edges and of homogeneous regions remains
significant. The brick wall, the door, the head, and the hands are globally well colored. The
large trousers are not in the training set; the mistakes in the colors of Charlie Chaplin’s dog
are probably due to the blue reflections on the dog in the training image and to the light
brown of its head. Dealing with larger training datasets increases the computation time
only logarithmically during the kD-tree search.
Machine Learning Methods for Automatic Image Colorization 411
(a)
(b) (c)
Figure 15.8 shows the results for SVM-based prediction. In the case of SVM coloriza-
tion, the contours of Charlie Chaplin and the dog are well recognizable in the color-only
image. Face and hands do not contain nonskin colors, but the abrupt transitions from very
pale to warm tones are visually not satisfying. The dog contains a large fraction of skin col-
ors; this could be attributed to a high textural similarity between human skin and dog body
regions. The large red patch on the door frame is probably caused by a high local similarity
with the background flag in the first training image. Application of the spatial coherency
criterion from Equation 15.7 yields a homogeneous coloring, with skin areas being well
412 Computational Photography: Methods and Applications
FIGURE 15.9
Eight of 20 training images from the Caltech Pasadena Houses 2000 collection available at
http://www.vision.caltech.edu/archive.html. The whole set can be seen at http://www-sop.inria.fr/members/
Guillaume.Charpiat/color/houses.html.
represented. The coloring of the dog does not contain the mistakes from Figure 15.7, and
appears consistent. The door is colored with regions of multiple, but similar colors, which
roughly follow the edges on the background. The interaction term effectively prevents tran-
sitions between colors that are too different, when the local probabilities are similar. When
colorizing with the interaction weights learned using SVMstruct, the overall result appears
less balanced; the dog has patches of skin color and the face consists of two not very similar
color regions. Given the difficulty of the task and the small amount of training data, the
result is presentable, as all interaction weights were learned automatically.
(a) (b)
to the results from the Parzen window method, but with a higher number of different color
regions and thus a more realistic appearance. SVMstruct-based colorization preserves a
much higher number of different color regions while removing most of the inconsistent
color patches and keeping finer details. It is able to realistically color the small bushes
in front of the second house. The spatial coherency weights for transitions between dif-
ferent colors were learned from training data without the need for cross validation for the
adjustment of the hyper-parameter λ . These improvements come at the expense of longer
training time, which scales quadratically with the training data.
Finally, an ambitious experiment was performed to evaluate how the proposed approach
deals with quantities of different textures on similar objects; that is, how it scales with the
number of textures observed. A portrait database of 53 paintings with very different styles
(Figure 15.12) was built. Five other portraits were colored using Parzen windows (Fig-
ure 15.13) and the same parameters. Although given that red color is indeed the dominant
color in an important proportion of the training set, the colorizations sometimes appears
rather reddish. The surprising part is the good quality of the prediction of the colored
edges, which yields a segmentation of the test images into homogeneous color regions.
The boundaries of skin areas in particular are very well estimated, even in images which
are very heavily textured. The good estimation of color edges helped the colorization pro-
cess to find suitable colors inside the supposedly-homogeneous areas, despite locally noisy
color predictions. Note that neither SVM nor SVMstruct were evaluated in this experiment
due to their expensive training.
414 Computational Photography: Methods and Applications
15.9 Conclusion
This chapter presented three machine learning methods for automatic image colorization.
These methods do not require any intervention by the user other than the choice of rela-
tively similar training data. The color prediction task was formally stated as an optimization
problem with respect to an energy function. Since the proposed approaches retain the mul-
timodality until the prediction step, they extract information from training data effectively
using different machine learning methods. The fact that the problem is solved directly at
the global level with the help of graph cuts makes the proposed framework more robust to
noise and local prediction errors. It also allows resolving large scale ambiguities as opposed
to previous approaches. The multimodality framework is not specific to image colorization
and could be used in any prediction task on images. For example, Reference [22] outlines
a similar approach for medical imaging to predict computed tomography scans for patients
whose magnetic resonance scans are known.
Machine Learning Methods for Automatic Image Colorization 415
FIGURE 15.12
Some of 53 portraits used as a training set. Styles of paintings vary significantly, with different kinds
of textures and different ways of representing edges. The full training set is available at http://www-
sop.inria.fr/members/Guillaume.Charpiat/color/.
The proposed framework exploits features derived from various sources of information.
It provides a principal way of learning local color predictors along with spatial coherence
criteria as opposed to the previous methods which chose the spatial coherence criteria man-
ually. Experimental results on small and large scale experiments demonstrate the validity
of the proposed approach which produces significant improvements over the methods in
References [5] and [6], in terms of the spatial coherency formulation and the large number
of possible colors. It requires less or similar user-intervention than the method in Refer-
ence [7], and can handle cases which are more ambiguous or have more texture noise.
Currently, the proposed automatic colorization framework does not employ decisive in-
formation which is commonly used in user-interactive approaches. However, the proposed
framework can easily incorporate user-provided information such as the color c at pixel p in
order to modify a colorization that has been obtained automatically. This can be achieved
by clamping the local prediction to the color provided by the user with high confidence.
For example, in the Parzen window method, p(c|v(p)) = 1 and the confidence p(v(p)) is
set to a very large value. Similar clamping assignments are possible for SVM-based ap-
proaches. Consequently, the proposed optimization framework is usable for further inter-
active colorization. A recolorization with user-provided color landmarks does not require
the re-estimation of color probabilities, and therefore requires only a fraction of second.
This interactive setting will be addressed in future work.
Acknowledgment
The authors would like to thank Jason Farquhar, Peter Gehler, Matthew Blaschko, and
Christoph Lampert for very fruitful discussions.
416 Computational Photography: Methods and Applications
FIGURE 15.13
Portrait colorization: (top) result, (middle) colors chosen without grayscale intensity, and (bottom) predicted
color edges. Predicted color variations are particularly meaningful and correspond precisely to the boundaries
of the principal regions. Thus, the color edge estimator can be seen as a segmentation tool. The background
colors cannot be expected to be correct since the database focuses on faces. The same parameters were used
for all portraits.
Figures 15.1 to 15.5, and 15.7 are reprinted with permission from Reference [8]. Fig-
ure 15.5 is reprinted from Reference [7], with the permission of Eurographics Associ-
ation. Figure 15.9 contains photos from the Caltech Pasadena Houses 2000 collection
(http://www.vision.caltech.edu/archive.html), reproduced with permission.
References
[1] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,” ACM Transactions
on Graphics, vol. 23, no. 3, pp. 689–694, August 2004.
Machine Learning Methods for Automatic Image Colorization 417
[2] L. Yatziv and G. Sapiro, “Fast image and video colorization using chrominance blending,”
IEEE Transactions on Image Processing, vol. 15, no. 5, pp. 1120–1129, May 2006.
[3] T. Horiuchi, “Colorization algorithm using probabilistic relaxation,” Image Vision Computing,
vol. 22, no. 3, pp. 197–202, March 2004.
[4] T. Takahama, T. Horiuchi, and H. Kotera, “Improvement on colorization accuracy by parti-
tioning algorithm in CIELAB color space,” Lecture Notes in Computer Science, vol. 3332,
pp. 794–801, November 2004.
[5] T. Welsh, M. Ashikhmin, and K. Mueller, “Transferring color to greyscale images,” ACM
Transactions on Graphics, vol. 21, no. 3, pp. 277–280, July 2002.
[6] A. Hertzmann, C.E. Jacobs, N. Oliver, B. Curless, and D.H. Salesin, “Image analogies,” in
Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Tech-
niques, Los Angeles, CA, USA, August 2001, pp. 327–340.
[7] R. Irony, D. Cohen-Or, and D. Lischinski, “Colorization by example,” in Proceedings of Eu-
rographics Symposium on Rendering, Konstanz, Germany, June 2005, pp. 201–210.
[8] G. Charpiat, M. Hofmann and B. Schölkopf, “Automatic image colorization via multimodal
predictions,” Lecture Notes on Computer Science, vol. 5304, pp. 126–139, October 2008.
[9] F. Pitie, A. Kokaram, and R. Dahyot, Single-Sensor Imaging: Methods and Applications for
Digital Cameras, ch. Enhancement of digital photographs using color transfer techniques,
R. Lukac (ed.), Boca Raton, FL: CRC Press / Taylor & Francis, September 2008, pp. 295–
321.
[10] R.W.G. Hunt, The Reproduction of Colour, Chichester, England: John Wiley, November
2004.
[11] B. Schölkopf and A.J. Smola, Learning with Kernels: Support Vector Machines, Regulariza-
tion, Optimization, and Beyond. Cambridge, MA: MIT Press, December 2001.
[12] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust features,” Com-
puter Vision and Image Understanding vol. 110, no. 3, pp. 346–359, June 2008.
[13] V. Vapnik, Statistical Learning Theory, New York, NY, USA: Wiley-Interscience, September
1998.
[14] J. Weston and C. Watkins, “Support vector machines for multi-class pattern recognition,” in
Proceedings of the European Symposium on Artificial Neural Networks, Bruges, Belgium,
April 1999.
[15] Y. Boykov and V. Kolmogorov, “An experimental comparison of min-cut/max-flow algorithms
for energy minimization in vision,” in Proceedings of the Third International Workshop on
Energy Minimization Methods in Computer Vision and Pattern Recognition, London, UK,
September 2001, pp. 359–374.
[16] V. Kolmogorov and R. Zabih, “What energy functions can be minimized via graph cuts?,” in
Proceedings of the European Conference on Computer Vision, Copenhagen, Denmark, May
2002, pp. 65–81.
[17] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph
cuts,” in Proceedings of the International Conference on Computer Vision, Kerkyra, Greece,
September 1999, pp. 377–384.
[18] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M.F. Tappen,
and C. Rother, “A comparative study of energy minimization methods for markov random
fields,” Lecture Notes in Computer Science, vol. 3952, pp. 16–29, May 2006.
[19] J. Platt, Advances in Large Margin Classifiers, ch. Probabilistic outputs for support vector
machines and comparisons to regularized likelihood methods, P.B. Alexander and J. Smola
(eds.), Cambridge, MA: MIT Press, October 2000, pp. 61–74.
418 Computational Photography: Methods and Applications
[20] G. Blasi and D.R. Recupero, “Fast Colorization of Gray Images,” in Proceedings of Euro-
graphics Italian Chapter, Milano, Italy, September 2003, pp. 1120–1129.
[21] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, “Support vector machine learning
for interdependent and structured output spaces,” in Proceedings of International Conference
on Machine Learning, Banf, AB, Canada, July 2004, pp. 823–830.
[22] M. Hofmann, F. Steinke, V. Scheel, G. Charpiat, J. Farquhar, P. Aschoff, M. Brady,
B. Schölkopf, and B.J. Pichler, “MR-based attenuation correction for PET/MR: A novel ap-
proach combining pattern recognition and atlas registration,” Journal of Nuclear Medicine,
vol. 49, no. 11, pp. 1875–1883, November 2008.
16
Machine Learning for Digital Face Beautification
Gideon Dror
16.1 Introduction
Beauty, particularly of the human face, has fascinated human beings from the very dawn
of mankind, inspiring countless artists, poets, and philosophers. Numerous psychological
studies find high cross-cultural agreement in facial attractiveness ratings among raters from
different ethnicities, socioeconomic classes, ages, and gender [1], [2], [3], [4], indicating
that facial beauty is a universal notion, transcending the boundaries between different cul-
tures. These studies suggest that the perception of facial attractiveness is data-driven; the
properties of a particular set of facial features are the same irrespective of the perceiver.
419
420 Computational Photography: Methods and Applications
FIGURE 16.1
Digital face beautification: (left) input facial images, and (right) the modified images generated using the
proposed method. The changes are subtle, yet their impact is significant. Notice that different modifications
are applied to men and women, according to preferences learned from human raters. ° c 2008 ACM
The universality of the notion of facial attractiveness along with the ability to reliably
and automatically rate the facial beauty from a facial image [5], [6] has motivated this
work. Specifically, this chapter presents a novel tool capable of automatically enhancing
the attractiveness of a face in a given frontal portrait. It aims at introducing only sub-
tle modifications to the original image, such that the resulting beautified face maintains a
strong, unmistakable similarity to the original, as demonstrated in Figure 16.1 by the pairs
of female and male faces. This is a highly nontrivial task, since the relationship between the
ensemble of facial features and the degree of facial attractiveness is anything but simple.
Professional photographers have been retouching and deblemishing their subjects ever
since the invention of photography. It may be safely assumed that any model present on
a magazine cover today has been digitally manipulated by a skilled, talented retouching
Machine Learning for Digital Face Beautification 421
FIGURE 16.2
The proposed digital face beautification process.
artist. Since the human face is arguably the most frequently photographed object on earth,
a tool such as the one described in this chapter would be a useful and welcome addition to
the ever-growing arsenal of image enhancement and retouching tools available in today’s
digital image editing packages. The potential of such a tool for motion picture special
effects, advertising, and dating services, is also quite obvious.
Given a face, a variety of predetermined facial locations are identified to compute a set of
distances between them. These distances define a point in a high-dimensional face space.
This space is searched for a nearby point that corresponds to a more attractive face. The
key component in this search is an automatic facial beauty rating machine, that is, two
support vector regressors trained separately on a database of female and male faces with
accompanying facial attractiveness ratings collected from a group of human raters. Once
such a point is found, the corresponding facial distances are embedded in the plane and
serve as a target to define a two-dimensional (2D) warp field which maps the original facial
features to their beautified locations. The process is schematically depicted in Figure 16.2.
Experimental results indicate that the proposed method is capable of effectively im-
proving the facial attractiveness of most images of female faces used in experiments. In
particular, its effectiveness was experimentally validated by a group of test subjects who
consistently found the beautified faces to be more attractive than the original ones.
The proposed beauty regressor was trained using frontal portraits of young Caucasian
males and females with neutral expression and roughly uniform lighting. Thus, it currently
can only be expected to perform well on facial images with similar characteristics. How-
ever, it may be directly extended to handle additional ethnic groups, simply by using it with
beauty regressors trained on suitable collections of portraits.
16.1.1 Overview
Section 16.2 describes psychological findings related to perception of human facial
beauty, that accumulated in the past three decades. These notions are important for un-
derstanding the reasoning behind the proposed methods. Section 16.3 discusses how to
construct a model of facial beauty using supervised learning methods. The models are
based on sets of face images of females and males, rated by human raters. This section also
describes the features used to represent facial geometry, that are based on psychological
422 Computational Photography: Methods and Applications
research as well as detailed empirical tests. Section 16.4 presents two alternative methods
of face beautification. The first uses direct optimization of a beauty function and the other
is based on heuristics motivated by well-known psychological effects of beauty perception.
The next two sections describe techniques and methods related to the process of face
beautification. Namely, Section 16.5 presents the methods used to locate facial features and
identify 84 canonical points on a face, including points at specific locations on the mouth,
nose, eyes, brows, and the contour of the face, which provide a succinct description of the
geometry of a face. These points are the anchors for warping a faces to a beautified version
thereof. Section 16.6 describes distance embedding required to carry out the warping, and
the warping process, modified for the specific task of warping human faces.
Section 16.7 presents examples of beautified faces of both females and males. An empir-
ical validation based on a large set of faces is described, showing that face images produced
by the process are indeed significantly more pleasing than the original images. This section
concludes by pointing out some applications of face beautification. Finally, conclusions are
offered in Section 16.8 which also discusses some ideas for extending the proposed method
to handle nonfrontal portraits and nonneutral expressions.
16.2 Background
Philosophers, artists and scientists have been trying to capture the nature of beauty since
the early days of philosophy. Although in modern days a common laymans notion is that
judgments of beauty are a matter of subjective opinion alone, recent findings suggest that
people share a common taste for facial attractiveness and that their preferences may be
an innate part of our primary constitution. Indeed, several rating studies have shown high
cross-cultural agreement in attractiveness rating of faces of different ethnicities [1], [2], [3].
Other experimental studies demonstrated consistent relations between attractiveness and
various facial features, which were categorized as neonate (features such as small nose
and high forehead), mature (e.g., prominent cheekbones) and expressive (e.g., arched eye-
brows). They concluded that beauty is not an inexplicable quality which lies only in the
eye of the beholder [7], [8].
Further experiments have found that infants ranging from two to six months of age prefer
to look longer at faces rated as attractive by adults than at faces rated as unattractive [9],
[10]. They also found that twelve month old infants prefer to play with a stranger with an
attractive face compared with a stranger with an unattractive face.
Such findings give rise to the quest for common factors which determine human facial
attractiveness. Accordingly, various hypotheses, from cognitive, evolutional, and social
perspectives, have been put forward to describe and interpret the common preferences for
facial beauty. Inspired by the photographic method of composing faces presented in Ref-
erence [11], Reference [12] proposed to create averaged faces (Figure 16.3) by morphing
multiple images together. Human judges found these averaged faces to be attractive and
rated them with attractiveness ratings higher than the mean rating of the component faces
composing them, proposing that averageness is the answer for facial attractiveness [12],
Machine Learning for Digital Face Beautification 423
[13]. Investigating symmetry and averageness of faces, it was found that symmetry is more
important than averageness in facial attractiveness [14]. Other studies have agreed that av-
erage faces are attractive but claim that faces with certain extreme features, such as extreme
sexually dimorphic traits, may be more attractive than average faces [4].
Many contributors refer to the evolutionary origins of attractiveness preferences [15].
According to this view, facial traits signal mate quality and imply chances for reproductive
success and parasite resistance. Some evolutionary theorists suggest that preferred features
might not signal mate quality but that the “good taste” by itself is an evolutionary adapta-
tion (individuals with a preference for attractiveness will have attractive offspring that will
be favored as mates) [15]. Another mechanism explains attractiveness preferences through
a cognitive theory — a preference for attractive faces might be induced as a by-product
of general perception or recognition mechanisms [16]. Attractive faces might be pleasant
to look at since they are closer to the cognitive representation of the face category in the
mind. It was further demonstrated that not just average faces are attractive but also birds,
fish, and automobiles become more attractive after being averaged with computer manip-
ulation [17]. Such findings led researchers to propose that as perceivers can process an
object more fluently, aesthetic response becomes more positive [18]. A third view suggests
that facial attractiveness originates in a social mechanism, where preferences may be de-
pendent on the learning history of the individual and even on his social goals [16]. Other
studies have used computational methods to analyze facial attractiveness. In several cases
faces were averaged using morphing tools [3]. Laser scans of faces were put into com-
plete correspondence with the average face in order to examine the relationship between
facial attractiveness, age, and averageness [19]. Machine learning methods have been used
recently to investigate whether a machine can predict attractiveness ratings by learning a
mapping from facial images to their attractiveness scores [5], [6]. The predictor presented
by the latter achieved a correlation of 0.72 with average human ratings, demonstrating that
facial beauty can be learned by a machine with human-level accuracy.
424 Computational Photography: Methods and Applications
• Increasing the number of ratings to 60, with approximately the same ratio of males
to female raters, introduced no significant modification of the average ratings.
• The raters were divided into two disjoint groups of equal size. The mean rating for
each facial image in each group was calculated to determine the Pearson correlation
between the mean ratings of the two groups. This process was repeated 1000 times.
This procedure was taken separately for the male and female datasets. The mean
correlation between two groups for both datasets was higher than was 0.9, with a
standard deviation of σ < 0.03. It should be noted that the split-half correlations
reported were high in all 1000 trials (as evident from the low standard deviation) and
not only over the average. Experimental results show that there is a greater agreement
on human ratings of female faces while male face preferences are more variable, in
accordance with Reference [39]. This correlation corresponds well to the known
level of consistency among groups of raters reported in the literature [1].
426 Computational Photography: Methods and Applications
Hence, the mean ratings collected are stable indicators of attractiveness that can be used
for the learning task. The facial set contained faces in all ranges of attractiveness. Final
attractiveness ratings range from 1.42 to 5.75, with the mean rating equal to 3.33 and σ =
0.94.
The mean (normalized) positions of the extracted feature points, see Figure 16.4b, are
used to construct a Delaunay triangulation. The triangulation consists of 234 edges, and the
lengths of these edges in each face form its 234-dimensional distance vector. Figure 16.4c
is an example for face triangulation and the associated distance vector. The distances are
normalized by the square root of the face area to make them invariant of scale. The pro-
posed method works with distances between feature points, rather than with their spatial
coordinates, as such distances are more directly correlated with the perceived attractiveness
of a face. Furthermore, working with a facial mesh, rather than some other planar graph,
imposes some rigidity on the beautification process, preventing it from generating distances
which may possess a high score but do not correspond to a valid face.
y = yorig − ylin , where yorig are the original beauty scores and ylin is the regression hyper-
plane, based on the nongeometrical features above.
5.0
4.8
4.7
4.6
0 5 10 15 20
KN N K
FIGURE 16.5
The beauty score plotted as a function of K in the proposed KNN-based technique applied to one of the faces in
the test database. The optimal value of K is 5 with an associated SVR beauty score of 5.03. The initial beauty
score for this face is 4.38, and the simple average score, K→∞, is 4.51. The proposed SVR-based beautifier
c 2008 ACM
succeeds in finding a distance vector with a higher score of 5.20. °
More specifically, let ~xi and yi denote the set of distance vectors corresponding to the
training set samples and their associated beauty scores, respectively. Now, given a distance
vector ~x, the beauty-weighted distances wi can be defined as follows:
yi
wi = .
k~x −~xi k
Notice that yi gives more weight to the more beautiful samples, in the neighborhood of~x.
The best results are obtained by first sorting {~xi } in descending order, such that wi ≥ wi+1 ,
and then searching for the value of K maximizing the SVR beauty score fb of the weighted
sum
∑K wi~xi
~x0 = i=1 . (16.1)
∑Ki=1 wi
The chart in Figure 16.5 shows how the beauty score changes for different values of
K. Note that the behavior of the beauty score is nontrivial. However, generally speaking,
small values of K tend to produce higher beauty scores than that of the average face. Some
examples of KNN-beautified faces with different choices of K are shown in Figure 16.6.
Rather than simply replacing the original distances ~x with the beautified ones ~x0 , more
subtle beautification effects can be produced. This can be achieved through trading-off the
degree of the beautification for resemblance to the original face, by linearly interpolating
between ~x and ~x0 before performing the distance embedding described in Section 16.6.
FIGURE 16.6
Face beautification using KNN and SVR: (a) original faces, (b) KNN-beautified images with K = 3, (c) KNN-
c 2008 ACM
beautified images with optimal K, and (d) SVR-beautified images. °
is limited by no such constraint. Figure 16.6 demonstrates the differences between KNN-
based and SVR-based beautification.
Formally, the beautified distance vector ~x0 is defined as follows:
Here, the standard no-derivatives direction set method [44] is used to numerically perform
this minimization. To accelerate the optimization, principal component analysis (PCA) is
performed on the feature space to reduce its dimensionality from 234 to 35. Thus, the
minimization process can be applied in the low dimensional space, with ~u denoting the
projection of ~x on this lower dimensional space.
For the majority of the facial images in the test database using only the beauty function as
a guide produces results with higher beauty score than the KNN-based approach. However,
for some samples, the SVR-based optimization yields distance vectors which do not corre-
spond to valid human face distances. To constrain the search space, the energy functional,
Equation 16.2, can be regularized by adding a log-likelihood term (LP):
where α controls the importance of the log-likelihood term, with α = 0.3 being sufficient
to enforce probable distance vectors. This technique is similar to the one used in Refer-
ence [24].
Machine Learning for Digital Face Beautification 431
d0
(~u − ~µ )2
LP(~u) = − ∑ + const,
j=1 2Σ j j
where the constant term is independent of ~u and d 0 denotes the dimensionality of ~u.
of the facial images in the training set. To beautify a new facial image, this new image is
first analyzed and its feature landmarks are extracted in the same way as was done for the
training images. In most cases, the input image analysis is fully automatic. In rare cases
some user intervention is required, typically, when large parts of the face are occluded by
hair.
where ei j denotes the facial mesh connectivity matrix. To reduce nonrigid distortion of
facial features, αi j is set to 1 for intra-feature edges (edges that connect two feature points
from different facial features), and to 10 for inter-feature edges. The target distance term
di j is the entry in ~v0 corresponding to the edge ei j .
The target landmark positions qi are obtained by minimizing E. This kind of optimization
has been recently studied in the context of graph drawing [48]. It is referred to as a stress
minimization problem, originally developed for multidimensional scaling [49]. Here, the
Levenberg-Marquardt (LM) algorithm is used to efficiently perform this minimization [50],
[51], [52]. The LM algorithm is an iterative nonlinear minimization algorithm which re-
quires reasonable initial positions. However, in this case, the original geometry provides a
good initial guess, since the beautification always modifies the geometry only a little.
The embedding process has no knowledge of the semantics of facial features. However,
human perception of faces is extremely sensitive to the shape of the eyes. Specifically,
even a slight distortion of the pupil or the iris into a noncircular shape significantly detracts
from the realistic appearance of the face. Therefore, a postprocess that enforces a similarity
transform on the landmarks of the eyes, independently for each eye, should be performed.
A linear least squares problem in the four free variables of the similarity transform
a b tx
S = −b a ty ,
0 0 1
can be solved by minimizing ∑ kSpi − qi k2 for all feature points of the eyes, where pi are
original landmark locations and qi are their corresponding embedded positions (from Equa-
tion 16.3). Then Spi replaces qi to preserve the shape of the original eyes. The embedding
works almost perfectly with an average beauty score drop of only 0.005, before applying
the above similarity transform correction to the eyes. However, there is an additional small
loss of 0.232 on average in beauty score after the similarity correction.
Machine Learning for Digital Face Beautification 433
The distance embedding process maps the set of feature points {pi } from the source
image to the corresponding set {qi } in the beautified image. Next, a warp field that maps
the source image into a beautified one according to this set of correspondences is computed.
For this purpose, the multilevel free-form deformation (MFFD) technique [53] is adapted.
The warp field is illustrated in Figure 16.7, where the source feature points are shown using
dark gray and the corresponding beautified positions are indicated using black.
The MFFD consists of a hierarchical set of free-form deformations of the image plane
where, at each level, the warp function is a free-form deformation defined by B-spline
tensor products. The advantage of the MFFD technique is that it guarantees a one-to-one
mapping (no foldovers). However, this comes at the expense of a series of hierarchical
warps [53]. To accelerate the warping of high resolution images, the explicit hierarchical
composition of transformations is first unfold into a flat one by evaluating the MFFD on
the vertices of a fine lattice.
16.7 Results
To demonstrate performance of the proposed digital beautification technique, a simple
interactive application, which was used to generate all of the examples in this chapter, has
been implemented. After loading a portrait, the application automatically detects facial
features, as described in Section 16.5. The user is able to examine the detected features,
and adjust them if necessary. Next, the user specifies the desired degree of beautification, as
well as the beautification function used, fb (males or females) and the application computes
and displays the result.
434 Computational Photography: Methods and Applications
Figure 16.8 shows a number of input faces and their corresponding beautified versions.
The degree of beautification in all these examples is 100 percent, and the proposed beauti-
fication process increases the SVR beauty score by roughly 30 percent. Note that in each of
these examples, the differences between the original face and the beautified one are quite
subtle, and thus the resemblance between the two faces is unmistakable. Yet the subtle
changes clearly succeed in enhancing the attractiveness of each of these faces.
The faces shown in Figure 16.8 are part of the set of 92 faces, which were photographed
by a professional photographer, and used to train the SVR, as described in Section 16.4.
However, the resulting beautification engine is fully effective on faces outside that set. This
is demonstrated by the examples in Figure 16.9 for females and Figure 16.10 for males.
All images are part of the AR face database [54]. Note that the photographs of this open
repository appear to have been taken under insufficient illumination.
In some cases, it is desirable to let the beautification process modify only some parts of
the face, while keeping the remaining parts intact. This mode is referred to as beautification
by parts. For example, the operator of the proposed application may request that only the
eyes should be subject to beautification, as shown in Figure 16.11. These results demon-
strate that sometimes a small local adjustment may result in an appreciable improvement in
the facial attractiveness. Figure 16.12 is another example of beautification by parts, where
all of the features except the rather unique lips of this individual were subject to adjustment.
Performing beautification by parts requires only those distances where at least one end-
point is located on a feature designated for beautification. This reduces the dimensionality
of the feature space and enables the algorithm to search only among the beautified features.
This technique implicitly assumes that features that are part of a beautiful face are beautiful
on their own.
As mentioned earlier, it is possible to specify the desired degree of beautification, with
0 percent corresponding to the original face and 100 percent corresponding to the face
Machine Learning for Digital Face Beautification 435
FIGURE 16.9
Beautification of female faces that were not part of the 92 training faces set for which facial attractiveness
c 2008 ACM
ratings were collected: (top) input portraits, and (bottom) their beautified versions. °
FIGURE 16.10
Beautification of male faces that were not part of the male training faces: (top) input portraits, and (bottom)
their beautified versions.
defined by the beautification process. Degrees between 0 and 100 are useful in cases where
the fully beautified face is too different from the original, as demonstrated in Figure 16.13.
FIGURE 16.11
Beautification by parts: (a) original image, (b) full beautification, and (c) only the eyes are designated for
c 2008 ACM
beautification. °
FIGURE 16.12
Beautification by parts: (a) original image, (b) full beautification, and (c) the mouth region is excluded from
c 2008 ACM
beautification. °
or right) were determined randomly, and the 93 pairs were shown in random order. All of
the 93 original faces were obtained from the AR face database [54]. In total, 37 raters, both
males and females aged between 25 and 40, participated in the experiment.
As could be expected, the agreement between raters is not uniform for all portraits. Still,
for all 48 female portraits, the beautified faces were chosen as more attractive by most
raters, and in half of the portraits the beautified versions were preferred by more than 80
percent of the raters. Finally, on average, in 79 percent of the cases the raters chose the
beautified version as the more attractive face. This result is very significant statistically (P-
value = 7.1 × 10−13 ), proving that on average the proposed tool succeeds in significantly
improving the attractiveness of female portraits.
Machine Learning for Digital Face Beautification 437
FIGURE 16.13
Varying the degree of beautification: (a) original image, (b) 50 percent, and (c) 100 percent, where the differ-
ences with respect to the original image may be too conspicuous. ° c 2008 ACM
As for the male portraits, 69 percent of the beautified versions were chosen as more
attractive. Notice that this result, although not as striking as that for females, is still statis-
tically significant (P-value = 0.006). The possible reasons for the different strengths of the
female and male validation are twofold: i) the male training set was considerably smaller
than that for females; and ii) the notion of male attractiveness is not as well defined as
that of females, so the consensus in both training set ratings and ratings of beautified im-
ages versus original images is not as uniform. Both issues are also reflected in the lower
performance of the males’ beauty regressor.
16.7.2 Applications
Professional photographers have been retouching and deblemishing their subjects ever
since the invention of photography. It may be safely assumed that any model that appears
on a magazine cover today has been digitally manipulated by a skilled, talented retouching
artist. It should be noted that such retouching is not limited to manipulating color and
texture, but also to wrinkle removal and changes in the geometry of the facial features.
Since the human face is arguably the most frequently photographed object on earth, face
beautification would be a useful and welcome addition to the ever-growing arsenal of image
enhancement and retouching tools available in today’s digital image editing packages. The
potential of such a tool for motion picture special effects, advertising, and dating services,
is also quite obvious.
Another interesting application of the proposed technique is the construction of facial
collages when designing a new face for an avatar or a synthetic actor. Suppose a collection
of facial features (eyes, nose, mouth, etc.) originating from different faces to synthesize
a new face with these features. The features may be assembled together seamlessly using
Poisson blending [55], but the resulting face is not very likely to look attractive, or even
natural, as demonstrated in Figure 16.14. Applying the proposed tool to the collage results
in a new face that is more likely to be perceived as natural and attractive.
438 Computational Photography: Methods and Applications
FIGURE 16.14
Beautification using a collection of facial features: (a) a collage with facial features taken from a catalog, (b)
result of Poisson-blending the features together, and (c) result after applying the proposed technique to the
middle image. ° c 2008 ACM
16.8 Conclusions
A face beautification method based on an optimization of a beauty function modeled by
a support vector regressor has been developed. The challenge was twofold: first, the mod-
eling of a high dimensional nonlinear beauty function, and second, climbing that function
while remaining within the subspace of valid faces. It should be emphasized that the syn-
thesis of a novel valid face is a particularly challenging task, since humans are extremely
sensitive to every single nuance in a face. Thus, the smallest artifact may be all that is
needed for a human observer to realize that the face he is looking at is a fake. Currently,
the proposed method is limited to beautifying faces in frontal views and with a neutral ex-
pression only. Extending this technique to handle general views and other expressions is a
challenging direction for further research.
In the proposed method, beautification is obtained by manipulating only the geometry
of the face. However, as was mentioned earlier, there are also important nongeometric
attributes that have a significant impact on the perceived attractiveness of a face. These
factors include color and texture of hair and skin, and it would be interesting to investigate
how changes in these attributes might be incorporated in the proposed digital beautification
framework.
Finally, it should be noted that the goal of this research was not to gain a deeper under-
standing of how humans perceive facial attractiveness. Thus, no specific explicit beautifi-
cation guidelines, such as making the lips fuller or the eyes larger, were proposed. Instead,
this work aimed at developing a more general methodology that is based on raw beauty
ratings data. It is hoped, however, that perceptual psychologists will find the proposed
technique useful in their quest to better understanding of the perception of beauty.
Machine Learning for Digital Face Beautification 439
Appendix
Suppose l examples {~xi , yi }, with ~xi ∈ Rd and yi ∈ R for all i = 1, 2 . . . l. Let us also
assume that ε -support vector regression (SVR) [41] finds a smooth function f (~x) that has
at most ε deviation from the actual values of the target data yi , and at the same time is as
flat as possible. In other words, errors are ignored as long as they are less than ε . In the
simplest case, f is a linear function, taking the form f (~x) = ~w ·~xi + b where ~w ∈ Rd and
b ∈ R. In the linear case flatness simply means small k~wk.
Formally one can write this as constraint optimization problem [41] requiring
l
1
minimize k~wk2 +C ∑ (ξi + ξi∗ ),
2 i=1
subject to yi − ~w ·~xi − b ≤ ε + ξi , (16.4)
~w ·~xi + b − yi ≤ ε + ξi∗ ,
ξi , ξi∗ ≥ 0.
This formulation, referred to as soft margin SVR, introduced the slack variables ξi and ξi∗
in order to allow for some outliers. Figure 16.15 illustrates the ε -insensitive band as well as
the meaning of the slack variables for a one-dimensional regression problem. The positive
constant C determines the trade off between the flatness of f and the tolerance to outliers.
Nonlinear functions f can be elegantly incorporated into the SVR formalism by using
a mapping ~Φ from the space of input examples Rn into some feature feature space and
then applying the standard SVR formulation. The transformation ~Φ(~x) need not be carried
out explicitly, due to the fact that the SVR algorithm depends only on inner products be-
tween various examples. Therefore, it suffices to replace all inner products in the original
formulation by a kernel function k(x, x0 ) = ~Φ(~x) · ~Φ(~x0 ). This forms a quadratic optimiza-
tion problem and therefore allows efficient numeric solutions. Not surprisingly, there are
constraints on kernel functions, which are known as Mercer’s conditions [41], [56].
e
x
FIGURE 16.15
ε -insensitive regression. The ξ and ξi∗ variables are nonzero only for examples that reside outside the region
bounded between the dashed lines.
440 Computational Photography: Methods and Applications
The solution of Equation 16.4 usually proceeds via the dual formulation. Each constraint
is associated by a Lagrange multiplier, in terms of which one obtains a quadratic optimiza-
tion problem, which is easy to solve [57].
The radial basis function kernel k(x, x0 ) = exp(−k~x − ~x0 k2 /(2σ 2 )) is especially useful
since it smoothly interpolates between linear models obtained with σ → ∞ and highly non-
linear models obtained for small values of σ .
Acknowledgment
Figures 16.1, 16.4 to 16.9, and 16.11 to 16.14 are reprinted from Reference [36], with
the permission of ACM.
References
[1] M.R. Cunningham, A.R. Roberts, C.H. Wu, A.P. Barbee, and P.B. Druen, “Their ideas of
beauty are, on the whole, the same as ours: Consistency and variability in the cross-cultural
perception of female attractiveness,” Journal of Personality and Social Psychology, vol. 68,
no. 2, pp. 261–279, February 1995.
[2] D. Jones, Physical Attractiveness and the Theory of Sexual Selection: Results from Five Pop-
ulations, Ann Arbor: University of Michigan Press, June 1996.
[3] D.I. Perrett, K.A. May, and S. Yoshikawa, “Facial shape and judgements of female attractive-
ness,” Nature, vol. 368, pp. 239–242, March 1994.
[4] D.I. Perrett, K.J. Lee, I.S. Penton-Voak, D.A. Rowland, S. Yoshikawa, M.D. Burt, P. Henzi,
D.L. Castles, and S. Akamatsu, “Effects of sexual dimorphism on facial attractiveness,” Na-
ture, vol. 394, pp. 884–887, August 1998.
[5] Y. Eisenthal, G. Dror, and E. Ruppin, “Facial attractiveness: Beauty and the machine,” Neural
Computer, vol. 18, no. 1, pp. 119–142, January 2006.
[6] A. Kagian, G. Dror, T. Leyvand, I. Meilijson, D. Cohen-Or, and E. Ruppin, “A machine learn-
ing predictor of facial attractiveness revealing human-like psychophysical biases,” Vision Re-
search, vol. 48, no. 2, pp. 235–243, January 2008.
[7] M.R. Cunningham, “Measuring the physical in physical attractivenes: Quasi experiments
on the sociobiology of female facial beauty,” Journal of Personality and Social Psychology,
vol. 50, no. 5, pp. 925–935, May 1986.
[8] M.R. Cunningham, A.P. Barbee, and C.L. Philhower, Facial Attractiveness: Evolutionary,
Cognitive, and Social Perspectives, ch. Dimensions of facial physical attractiveness: The in-
tersection of biology and culture, G. Rhodes and L.A. Zebrowitz (eds.), Westport, CT: Ablex
Publishing, October 2001, pp. 193–238.
[9] J.H. Langlois, L.A. Roggman, R.J. Casey, J.M. Ritter, L.A. Rieser-Danner, and V.Y. Jenkins,
“Infant preferences for attractive faces: Rudiments of a stereotype?,” Developmental Psychol-
ogy, vol. 23, no. 5, pp. 363–369, May 1987.
Machine Learning for Digital Face Beautification 441
[10] A. Slater, C.V. der Schulenberg, E. Brown, G. Butterworth, M. Badenoch, and S. Parsons,
“Newborn infants prefer attractive faces,” Infant Behavior and Development, vol. 21, no. 2,
pp. 345–354, 1998.
[11] F. Galton, “Composite portraits,” Journal of the Anthropological Institute of Great Britain and
Ireland, vol. 8, pp. 132–142, 1878.
[12] J.H. Langlois and L.A. Roggman, “Attractive faces are only average,” Psychological Science,
vol. 1, no. 2, pp. 115–121, March 1990.
[13] A. Rubenstein, J. Langlois, and L. Roggman, Facial Attractiveness: Evolutionary, Cognitive,
and Social Perspectives, ch. What makes a face attractive and why: The role of averageness in
defining facial beauty, G. Rhodes and L.A. Zebrowitz (eds.), Westport, CT: Ablex Publishing,
October 2001, pp. 1–33.
[14] K. Grammer and R. Thornhill, “Human facial attractiveness and sexual selection: The role of
symmetry and averageness,” Journal of Comparative Psychology, vol. 108, no. 3, pp. 233–
242, September 1994.
[15] R. Thornhill and S.W. Gangsted, “Facial attractiveness,” Trends in Cognitive Sciences, vol. 3,
no. 12, pp. 452–460, December 1999.
[16] L.A. Zebrowitz and G. Rhodes, Facial Attractiveness: Evolutionary, Cognitive, and Social
Perspectives, ch. Nature let a hundred flowers bloom: The multiple ways and wherefores of
attractiveness, G. Rhodes and L.A. Zebrowitz (eds.), Westport, CT: Ablex Publishing, October
2001, pp. 261–293.
[17] J.B. Halberstadt and G. Rhodes, “It’s not just average faces that are attractive: Computer-
manipulated averageness makes birds, fish, and automobiles attractive,” Psychonomic Bulletin
and Review, vol. 10, no. 1, pp. 149–156, March 2003.
[18] R. Reber, N. Schwarz, and P. Winkielman, “Processing fluency and aesthetic pleasure: Is
beauty in the perceiver’s processing experience?,” Personality and Social Psychology Review,
vol. 8, no. 4, pp. 364–382, 2004.
[19] A.J. O’Toole, T. Price, T. Vetter, J.C. Bartlett, and V. Blanz, “3D shape and 2D surface textures
of human faces: The role of “averages” in attractiveness and age,” Image Vision Computing,
vol. 18, no. 1, pp. 9–19, December 1999.
[20] F.I. Parke and K. Waters, Computer Facial Animation, Wellesley, MA: A K Peters, September
1996.
[21] Y. Lee, D. Terzopoulos, and K. Waters, “Realistic modeling for facial animation,” in Proceed-
ings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, New
York, USA, August 1995, pp. 55–62.
[22] B. Guenter, C. Grimm, D. Wood, H. Malvar, and F. Pighin, “Making faces,” in Proceedings
of the 25th Annual Conference on Computer Graphics and Interactive Techniques, New York,
USA, July 1998.
[23] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D. H. Salesin, “Synthesizing realistic facial
expressions from photographs,” in Proceedings of the 25th Annual Conference on Computer
Graphics and Interactive Techniques, New York, USA, July 1998, pp. 75–84.
[24] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proceedings of
the 26th Annual Conference on Computer Graphics and Interactive Techniques, New York,
USA, August 1999, pp. 187–194.
[25] P.A. Viola and M.J. Jones, “Robust real-time face detection,” International Journal of Com-
puter Vision, vol. 57, no. 2, pp. 137–154, May 2004.
[26] M.H. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in images: A survey,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 24, no. 1, pp. 34–58, January 2002.
442 Computational Photography: Methods and Applications
[27] W. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfeld, “Face recognition: A literature survey,”
ACM Computing Surveys, vol. 35, no. 4, pp. 399–458, December 2003.
[28] T. Beier and S. Neely, “Feature-based image metamorphosis,” in Proceedings of the 19th
Annual Conference on Computer Graphics and Interactive Techniques Conference, New York,
USA, July 1992, pp. 35–42.
[29] S.Y. Lee, G. Wolberg, and S.Y. Shin, “Scattered data interpolation with multilevel B-splines,”
IEEE Transactions on Visualization and Computer Graphics, vol. 3, no. 3, pp. 228–244, July-
September 1997.
[30] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D faces,” in Proceedings of
the 26th Annual Conference on Computer Graphics and Interactive Techniques, New York,
USA, August 1999, pp. 187–194.
[31] D.I. Perrett, D.M. Burt, I.S. Penton-Voak, K.J. Lee, D.A. Rowland, and R. Edwards, “Symme-
try and human facial attractiveness,” Evolution and Human Behavior, vol. 20, no. 5, pp. 295–
307, September 1999.
[32] A. Lanitis, C.J. Taylor, and T.F. Cootes, “Toward automatic simulation of aging effects on
face images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 4,
pp. 442–455, April 2002.
[33] V. Blanz, “Manipulation of facial attractiveness.” Available nnline http://www.mpi-
inf.mpg.de/ blanz/data/attractiveness/, 2003.
[34] D. Vlasic, M. Brand, H. Pfister, and J. Popovic, “Face transfer with multilinear models,” ACM
Transactions on Graphics, vol. 24, no. 3, pp. 426–433, July 2005.
[35] V.S. Johnston and M. Franklin, “Is beauty in the eye of the beholder?,” Ethology and Sociobi-
ology, vol. 14, pp. 183–199, May 1993.
[36] T. Leyvand, D. Cohen-Or, G. Dror, and D. Lischinski, “Data-driven enhancement of facial
attractiveness,” ACM Transactions on Graphics, vol. 27, no. 3, pp. 1–9, August 2008.
[37] C. Braun, M. Gruendl, C. Marberger, and C. Scherber, “Beautycheck - ur-
sachen und folgen von attraktivitaet.” Available online: http://www.uni-regensburg.
de/Fakultaeten/phil Fak II/Psychologie/Psy II/beautycheck/english/2001.
[38] B. Fink, N. Neave, J. Manning, and K. Grammer, “Facial symmetry and judgements of at-
tractiveness, health and personality,” Personality and Individual Differences, vol. 41, no. 3,
pp. 491–499, August 2006.
[39] A.C. Little, I.S. Penton-Voak, D.M. Burt, and D.I. Perrett, Facial Attractiveness: Evolutionary,
Cognitive, and Social Perspectives, ch. Evolution and individual differences in the perception
of attractiveness: How cyclic hormonal changes and self-perceived attractiveness influence
female preferences for male faces, G. Rhodes and L.A. Zebrowitz (eds.), Westport, CT: Ablex
Publishing, October 2001, pp. 68–72.
[40] M.A. Turk and A.P. Pentland, “Face recognition using eigenfaces,” in Proceedings of the IEEE
International Conference on Computer Vision and Pattern Recognition, Maui, HI, USA, June
1991, pp. 586–591.
[41] V. Vapnik, The Nature of Statistical Learning Theory, New York: Springer, 1995.
[42] T. Joachims, Advances in Kernel Methods: Support Vector Learning, ch. Making large-scale
SVM learning practical, Cambridge, MA: MIT Press, 1999, pp. 169–184.
[43] T.R. Alley and M.R. Cunningham, “Averaged faces are attractive, but very attractive faces are
not average,” Psychological Science, vol. 2, no. 2, pp. 123–125, March 1991.
[44] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes: The Art
of Scientific Computing, Cambridge, UK: Cambridge University Press, 2nd Edition, 1992.
Machine Learning for Digital Face Beautification 443
[45] Y. Zhou, L. Gu, and H.J. Zhang, “Bayesian tangent shape model: Estimating shape and pose
parameters via Bayesian inference,” in Proceedings of the IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition 2003, Los Alamitos, CA, USA, June 2003,
pp. 109–118.
[46] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, “Active shape models - Their training
and their applications,” Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38–59,
January 1995.
[47] G. Bradski, “The OpenCV library,” Dr. Dobb’s Journal, vol. 25, no. 11, pp. 120–125, Novem-
ber 2000.
[48] J.D. Cohen, “Drawing graphs to convey proximity: An incremental arrangement method,”
ACM Transactions on Computer-Human Interaction, vol. 4, no. 3, pp. 197–229, September
1997.
[49] J.W. Sammon, “A nonlinear mapping for data structure analysis,” IEEE Transactions on Com-
puters, vol. C-18, no. 5, pp. 401–409, May 1969.
[50] K. Levenberg, “A method for the solution of certain problems in least squares,” The Quarterly
of Applied Mathematics, vol. 2, pp. 164–168, 1944.
[51] D. Marquardt, “An algorithm for least-squares estimation of nonlinear parameters,” SIAM
Journal of Applied Mathematics, vol. 11, no. 2, pp. 431–441, June 1963.
[52] M. Lourakis, “Levmar: Levenberg-Marquardt non-linear least squares algorithms in C/C++.”
Available online http://www.ics.forth.gr/ lourakis/levmar, 2004.
[53] S.Y. Lee, G. Wolberg, K.Y. Chwa, and S.Y. Shin, “Image metamorphosis with scattered fea-
ture constraints,” IEEE Transactions on Visualization and Computer Graphics, vol. 2, no. 4,
pp. 337–354, December 1996.
[54] A.M. Martinez and R. Benavente, “The AR face database,” Tech. Rep. 24, CVC, 1998.
[55] P. Pérez, M. Gangnet, and A. Blake, “Poisson image editing,” ACM Transactions on Graphics,
vol. 22, no. 3, pp. 313–318, July 2003.
[56] R. Courant and D. Hilbert, Methods of Mathematical Physics. Interscience, 1953.
[57] A.J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and Com-
puting, vol. 14, no. 3, pp. 199–222, August 2004.
17
High-Quality Light Field Acquisition and Processing
17.1 Introduction
Computational photography is changing the way of capturing images. While traditional
photography simply captures two-dimensional (2D) projection of three-dimensional (3D)
world, computational photography captures additional information by using generalized
optics. The captured image may not be visually attractive, but together with the additional
information, it enables novel postprocessing that can deliver quality images and, more im-
portantly, generate data such as scene geometry that were unobtainable in the past. These
new techniques overwrite the concept of traditional photography and transform a normal
camera into a powerful device.
445
446 Computational Photography: Methods and Applications
• It has adjustable angular resolution and prefilter kernel. When the angular resolution
is set to one, the light field camera becomes a conventional camera.
• The device is compact and economic. The programmable aperture can be placed in,
and nicely integrated with, a conventional camera.
Second, two algorithms are presented to enhance the captured light field. The first is a
calibration algorithm to remove the photometric distortion unique to a light field without
using any reference object. The distortion is directly estimated from the captured light field.
The other is a depth estimation algorithm utilizing the multi-view property of the light field
and visibility reasoning to generate view-dependent depth maps for view interpolation.
This chapter also presents a simple light transport analysis of the light field cameras. The
device and algorithms constitute a complete system for high quality light field acquisition.
In comparison with other light field cameras, the spatial resolution of the proposed cam-
era is increased by orders of magnitude, and the angular resolution can be easily adjusted
during operation or postprocessing. The photometric calibration enables more consistent
rendering and more accurate depth estimation. The multi-view depth estimation effectively
increases the angular resolution for smoother transitions between views and makes depth-
aware image editing possible.
In the remaining part of this chapter, Section 17.2 presents the relevant work in light field
rendering and computational photography. Section 17.3 shortly discusses the light trans-
port process in image acquisition and Section 17.4 describes the programmable aperture,
including the concepts and the implementations. Section 17.5 presents the novel postpro-
cessing algorithms for the light field data. Section 17.6 presents experimental results and
Section 17.7 discusses the limitations and future work. Finally, the conclusions are drawn
in Section 17.8.
High-Quality Light Field Acquisition and Processing 447
capture stereoscopic images [20]. Reference [21] uses a coded aperture to estimate the
depth of a near-Lambertian scene from a single image. Similar results can be obtained by
placing a color-filter array at the aperture [22]. In contrast to those methods, the proposed
method directly captures the 4D light field and estimates the depth from it when possible.
The multiple capturing technique captures the scene many times sequentially, or simul-
taneously by using beam splitters and camera arrays. At each exposure the imaging pa-
rameters, such as lighting [23], exposure time, focus, viewpoints [24], or spectral sensitiv-
ity [19], are made different. Then a quality image or additional information, for instance,
alpha matte, is obtained by computation. This technique can be easily implemented in digi-
tal cameras since the integration duration of the sensor can be electronically controlled. For
example, Reference [25] splits a given exposure time into a number of time steps and sam-
ples one image in each time step. The resulting images are then registered for correcting
hand-shaking.
Z F
1
l
u
x
x
l0
u
object
lens sensor
su rface
FIGURE 17.1
Light field and light transport. A light ray emitting from a point on an object surface at Z can be represented by
l0 ([x u]T ) or l([x u]T ) after refraction by lens. These two representations only differ by a linear transformation
(Equation 17.1).
Another common representation places the two planes at the lens and the film (sensor) of a
camera and defines independent coordinate systems for these two planes [31] (x and u on
the right in Figure 17.1).
Suppose there is a light ray emitting from an object surface point and denote its radiance
by the light field l0 ([x u]T ), where x and u are the intersections of the light ray with the
two coordinate planes. The light ray first traverses the space to the lens of the camera at
distance Z from the emitting point, as illustrated in Figure 17.1. According to the light
transport theory, this causes a shearing to the light field [30]. Next, the light ray changes its
direction after it leaves the lens. According to the matrix optics, this makes another shearing
to the light field [3]. As the light ray traverses to the image plane at distance F from the
lens plane, one more shearing is produced. Finally, the light field is reparameterized into
the coordinate system used in the camera. Since the shearings and the reparameterization
are all linear transformations, they can be concatenated into a single linear transformation.
Hence, the transformed light field l([x u]T ) can be represented by
Ã" #" #!
Z
− Z∆ x
l([x u]T ) = l0 (M[x u]T ) = l0 F
1 1 1 , (17.1)
F f − F u
where f is the focal length of the lens and ∆ = 1/Z + 1/F − 1/ f . This transformation, plus
modulation due to the blocking of the aperture [4], describes various photographic effects
such as focusing [3].
In traditional photography, a sensor integrates the radiances along rays from all directions
into an irradiance sample and thus loses all angular information of the light rays. The goal
of this work is to capture the transformed light field l([x u]T ) that contains both the spatial
and the angular information. In other words, the goal is to avoid the integration step in the
traditional photography.
In capturing the light field, although the sampling grids on the lens plane and the sensor
plane are usually fixed, the camera parameters, F and f , can be adjusted to modify the
transformation in Equation 17.1, thus changing the actual sampling grid for the original
light field l0 . For example, it is well-known that in natural Lambertian scenes, while the
450 Computational Photography: Methods and Applications
FIGURE 17.2
Configurations of the programmable aperture: (a) capturing one single sample at a time, (b) aggregating several
samples at each exposure for quality improvement, and (c) adjusting the prefilter kernel without affecting the
sampling rate.
spatial information is rich (i.e, complex structures, texture, shadow, etc.), the angular in-
formation is usually of low-frequency. Therefore, setting a high sampling rate along the
angular axis u is wasteful. By properly adjusting F, the spatial information can be moved
to the angular domain to better utilize the sample budget, as shown in Reference [15].
By capturing images with different aperture shapes (Figure 17.2a), a complete light field
can be constructed. However, unlike previous devices that manipulate the light rays after
High-Quality Light Field Acquisition and Processing 451
they enter the camera [1], [3], [4], the proposed method blocks the undesirable light rays
and captures one subset of the data at a time. The spatial resolution of the light field is thus
the same as the sensor resolution. For the method to take effect, a programmable aperture
is needed. Its transmittance has to be spatially variant and controllable.
An intuitive approach to such a programmable aperture is to replace the lens module with
a volumetric light attenuator [19]. However, according to the following frequency analysis
of light transport, it can be found that the lens module should be preserved for efficient
sampling. Let L0 ([ fx fu ]T ) and L([ fx fu ]T ) denote the Fourier transform of l0 ([x u]T ) and
l([x u]T ), respectively. By Equation 17.1 and the Fourier linear transformation theory, L0
and L are related as follows:
L([ fx fu ]T ) = |det(M)|−1 L0 (M −T [ fx fu ]T )
÷ ¸ " #!
1 1 − Ff 1 fx
= L0 . (17.3)
|det(M)| FZ∆ Z fu
Consider the case where the scene is a Lambertian plane perpendicular to the optical
path at Z = 3010, f = 50, and the camera is focused at Z = 3000 (so F = 50.8475). If the
lens module is removed, f → ∞ and the sampling rate along the fu axis has to be increased
by a factor of 18059 to capture the same signal content. As a result, millions of images
need to be captured for a single dataset, which is practically infeasible. Therefore, the lens
module must be preserved. The light rays are bent inwards at the lens due to refraction
and consequently the spectrum of the transformed light field is compressed. With the lens
module and by carefully selecting the in-focus plane, the spectrum can be properly reshaped
to reduce aliasing. A similar analysis is developed for multi-view displays [32].
The weights wuk ∈ [0, 1] of the light field images can be represented by a vector wu =
[wu0 , wu1 , ... , wu(N−1) ] and is referred to as a multiplexing pattern since wu is physically
realized as a spatial-variant mask on the aperture. After N captures with N different mul-
tiplexing patterns, the light field images can be recovered by demultiplexing the captured
images, if the chosen multiplexing patterns form an invertible linear system.
452 Computational Photography: Methods and Applications
(d)
Intuitively, one should open as many regions as possible, that is, maximize kwu k, to
allow the sensor to gather as much light as possible. In practice, however, noise is always
involved in the acquisition process and complicates the design of the multiplexing patterns.
In the case where the noise is independent and identically-distributed (i.i.d.), Hadamard
code-based patterns are best in terms of the quality of the demultiplexed data [5], [26],
[33]. However, noise in a digital sensor is often correlated with the input signal [34], [35].
For example, the variance of the shot noise grows linearly with the number of incoming
photons. In this case, using the Hadamard code-based patterns actually degrades the data
quality [27]. Another drawback of the Hadamard code-based patterns is that they only exist
for certain sizes.
Instead, multiplexing patterns can be obtained through optimization. Given the noise
characteristics of the device and the true signal value, the mean square error of the demul-
tiplexed signal is proportional to a function E(W):
prototype 1
FIGURE 17.4
Prototypes of the programmable aperture cameras: (top) with aperture patterns on an opaque slip of paper and
(bottom) on an electronically controlled liquid crystal array.
17.4.3 Prototypes
Two prototypes of the programmable aperture camera shown in Figure 17.4 were imple-
mented using a regular Nikon D70 digital single-lens reflex (DSLR) camera and a 50mm
f/1.4D lens module. For simplicity, the lens module was dismounted from the Nikon cam-
era to insert the programmable aperture between this module and the camera. Hence the
distance (F in Figure 17.1) between the lens and the sensor is lengthened and the focus
range is shortened as compared to the original camera.
The optimization of the multiplexing patterns requires information of the noise charac-
teristics of the camera and the scene intensity. The former is obtained by calibration and
the latter is assumed to be one half of the saturation level. Both prototypes can capture the
light field with or without multiplexing. The maximal spatial resolution of the light field is
3039 × 2014 and the angular resolution is adjustable.
In the first prototype, the programmable aperture is made up of a pattern scroll, which
is an opaqued slit of paper used for film protection. The aperture patterns are manually
cut and scrolled across the optical path. The pattern scroll is long enough to include tens
of multiplexing patterns and the traditional aperture shapes. This quick and dirty method
is simple and performs well except one minor issue; the blocking cell (wuk = 0) cannot
stay on the pattern scroll if it loses support. This is solved by leaving a gap between cells.
Since the pattern scroll is movable, its position may drift out of place. However, this can
be simply solved in the industrial level manufacture.
In the second prototype, the programmable aperture is made up of a liquid crystal array
(LCA) controlled by a Holtek HT49R30A-1 micro control unit that supports C language.
Two different resolutions, 5 × 5 and 7 × 7, of the LCA are made. The LCA is easier to pro-
gram and mount than the pattern scroll, and the multiplexing pattern is no longer limited
to binary. However, the light rays can leak from the gaps (used for routing) in between the
454 Computational Photography: Methods and Applications
liquid crystal cells and from the cells that cannot be completely turned off. To compen-
sate for the leakage, an extra image with all liquid crystal cells turned off is captured and
subtracted from other images.
17.4.4 Summary
The proposed light field acquisition scheme does not require a high resolution sensor.
Therefore, it can even be implemented on web, cell-phone, and surveillance cameras, etc.
The image captured by previous light field cameras must be decoded before visualization.
In contrast, the image captured using the proposed device can be directly displayed. Even
when the multiplexing is applied, the in-focus regions remain sharp (Figure 17.3c).
It should be noted that multiplexing cannot be directly applied to the existing methods
like a single moving camera, a plenoptic camera, or a camera array. These methods use a
permanent optical design and cannot dynamically select the light rays for integration.
Another advantage of the proposed device is that the sampling grid and the prefilter
kernel are decoupled. Therefore, the aperture size can be chosen regardless of the sampling
rate (Figure 17.2c). A small prefilter is chosen to preserve the details and remove aliasing
by view interpolation. Also the sampling lattice on the lens plane in the proposed device is
not restricted to rectangular grids. These parameters, including the number of samples, the
sampling grid, and the size of the prefilter kernel, can all be adjusted dynamically.
where {aui } are the polynomial coefficients, cu is the vignetting center, and k · k2 is the
High-Quality Light Field Acquisition and Processing 455
Euclidean distance (the coordinates are normalized to (0, 1)). The function fu , called vi-
gnetting field, is a smooth field across the image. It is large when the distance between x
and cu is small and gradually decreases as the distance increases.
Since the number of unknown variables in Equation 17.6 is larger than the number of
observations, the estimation problem is inherently ill-posed. A straightforward method
is to capture a uniformly lit object so the distortion-free image Iu (x) becomes a priori.
However, the vignetting field changes when the camera parameters, including the focus,
aperture size, and lens module, are adjusted. It is impractical to perform the calibration
whenever a single parameter is changed.
Existing photometric calibration methods that require no specific reference object gener-
ally use two assumptions to make the problem tractable [38], [40]. First, the scene points
have multiple registered observations with different levels of distortions. This assumption
is usually valid in panoramic imaging where the panorama is stitched from many images
of the same view point. Second, the vignetting center cu . This assumption is valid in most
traditional cameras, where the optics and the sensors are symmetric along the optical path.
Some recent methods remove the first assumption by exploiting the edge and gradient priors
in natural images [41], but the second assumption is still needed.
However, both assumptions are inappropriate for the light field images for two reasons.
First, the registration of the light field images taken from different view points requires
an accurate per-pixel disparity map that is difficult to obtain from the distorted inputs.
Second, in each light field image, the parameters, {aui } and cu , of the vignetting function,
are image-dependent and coupled. Therefore, simultaneously estimating the parameters
and the clean image is an under-determined nonlinear problem. Another challenge specific
to the proposed camera is that the vignetting function changes with the lens and the aperture
settings (such as the size of the prefilter kernel) and hence is impossible to tabulate.
An algorithm is proposed here to automatically calibrate the photometric distortion of
the light field images. The key idea is that the light field images closer to the center of the
optical path have less distortion. Therefore, it can be assumed that I0d ≈ I0 and then other
Iu ’s can be approximated by properly transforming I0 to estimate the vignetting field. This
way, the problem is greatly simplified. The approach can also be generalized to handle the
distortions of other computational cameras, particularly previous light field cameras.
456 Computational Photography: Methods and Applications
d istortion
cu u pd ate {a ui} estim ation
correction
FIGURE 17.6
The photometric calibration flow.
Figure 17.6 depicts the flowchart of the proposed algorithm, with example images shown
in Figure 17.7. First, the scale-invariant feature transform (SIFT) method, which is well
immune to local photometric distortions [42], is used to detect the feature points in an
input Iud and find their valid matches in I0 (Figures 17.7a and 17.7b). Next, the Delaunay
triangulation is applied to the matched points in Iud to construct a mesh (Figure 17.7c).
For each triangle A of the mesh, the displacement vectors of its three vertices are used to
determine an affine transform. By affinely warping all triangles, an image Iuw from I0 is
obtained (Figure 17.7d).
The warped image Iuw is close enough to the clean image Iu unless there are triangles in-
cluding objects of different depths or incorrect feature matchings. Such erroneous cases can
be effectively detected and removed by measuring the variance of the associated displace-
ment vectors. By dividing the distorted image Iud with the warped image Iuw and excluding
the outliers, an estimated vignetting field is obtained (Figure 17.7e). Comparing this image
with the vignetting field estimated from the image without warping (Figure 17.7f) reveals
that the warping operation effectively finds a rough approximation of the smooth vignetting
field and the outliers around the depth discontinuities are successfully excluded.
After the approximation of the vignetting field is obtained, the parametric vignetting
function (Equation 17.6) is estimated by minimizing an objective function E({aui }, cu ):
³ I d (x) ´2
u
E({aui }, cu ) = ∑ − fu (x) . (17.7)
x Iuw (x)
Since {aui } and cu are coupled, this objective function is nonlinear and can be minimized
iteratively. Given an initial estimate, the vignetting center cu is fixed first, as this makes
the Equation 17.7 linear in {aui }, which can be easily solved by a least square estimation.
Then, {aui } can be fixed and cu can be updated. This is done by a gradient descent method.
The goal is to find a displacement du such that E({aui }, cu + du ) is minimal.
Specifically, let ri denote the distance between the i-th pixel at (xi , yi ) and cu = (cu,x , cu,y ),
the N-D vector r = [r1 , r1 , ..., rN ]T denote the distances between all points and cu , and
Iu denote the estimated vignetting field, that is, the ratio Iud /Iuw . Since cu is the only
variable, the vignetting function fu (x) can be redefined as a vector function f(cu ) =
[ fu (x1 ), fu (x2 ), ..., fu (xN )]T . Equation 17.7 is then equivalent to the l2 norm of the error
vector ε , that is, kε k = kIu − f(cu )k. The optimal displacement du at iteration t can then be
obtained by solving the normal equation:
High-Quality Light Field Acquisition and Processing 457
original im age
...
d epth m ap
...
estim ation
d epth m aps
cross-bilateral
...
filtering
occlusion map
...
estim ation
occlusion maps
iteration
FIGURE 17.8
Overview of the proposed multi-view depth estimation algorithm.
Finally, Iud is divided by fu to recover the clean image Iu , as shown in Figure 17.7h.
One scanline profile is also shown in Figure 17.7i for comparison. It can be seen that the
recovered image has much less distortion.
FIGURE 17.9
(a,d) Two images of a simple scene with different viewpoints, (b,e) the corresponding depth maps, and (c,f) the
occlusion maps. The black region is unoccluded and white region is occluded.
The multi-view depth estimation problem is similar to the traditional stereo correspon-
dence problem [46]. However, the visibility reasoning is extremely important for multi-
view depth estimation since the occluded views should be excluded from the depth esti-
mation. Previous methods that determine the visibility by hard constraint [47] or greedy
progressive masking [48] can easily be trapped in local minima because they cannot recover
from incorrect occlusion guess. Inspired by the symmetric stereo matching algorithm [49],
this problem can be alleviated by iteratively optimizing a view-dependent depth map Du
for each image Iu and an occlusion map Ouv for each pair of neighboring images Iu and
Iv . If a scene point projected onto a point x in Iu is occluded in Iv , it does not have a valid
correspondence. When this happens, Ouv (x) is set to one to exclude it from the matching
process (x1 in Figure 17.9). On the other hand, if the estimated correspondence x0 of xu in
Iv is marked as invisible, that is, Ovu (x0 ) = 1, the estimate is unreliable (x2 in Figure 17.9).
The depth and occlusion estimation are now reformulated as a discrete labeling problem.
For each pixel xu , a discrete depth value Du (x) ∈ {0, 1, ..., dmax } and a binary occlusion
value Ouv (x) ∈ {0, 1} need to be determined. More specifically, given a set of light field
images I = {Iu }, the goal is to find a set of depth maps D = {Du } and a set of occlusion
maps O = {Ouv } to minimize the energy functional defined as follows:
© ª
E(D, O|I ) = ∑ Edd (Du |O, I ) + Eds (Du |O, I )
u
© ª
+∑ ∑ Eod (Ouv |Du , I ) + Eos (Ouv ) , (17.10)
u v∈N (u)
where Edd and Eds are, respectively, the data term and the smoothness (or regularization)
term of the depth map, and Eod and Eos denote, respectively, the data term and the smooth-
ness term of the occlusion map. The term N (u) denotes the set of eight viewpoints that are
closest to u. The energy minimization is performed iteratively. In each iteration, first the
460 Computational Photography: Methods and Applications
occlusion maps are fixed and Edd + Eds is minimized by updating the depth maps. Then,
the depth maps are fixed and Eod + Eos are minimized by updating the occlusion maps.
Figure 17.8 shows the overview of the proposed algorithm for minimizing Equation 17.10.
The following describes the definitions of energy terms and the method used to minimize
these terms. Let α , β , γ , ζ , and η denote the weighting coefficients and K and T the
thresholds. These parameters are empirically determined and fixed in the experiments. The
data term Edd is a unary function:
n ¡ ¡ ¢ ¢o
Edd (Du |O, I ) = ∑ ∑ Ōuv (x)C Iu (x) − Iv (ρ ) + α Ovu (ρ ) , (17.11)
x v∈N (u)
where ρ = x + Duv (x) and Ōuv (x) = 1 − Ouv (x). The term Ōuv (x) = 1 − Ouv (x), Duv (x)
denotes the disparity corresponding to the depth value Du (x) and C(k) = min(|k|, K) is a
truncated linear function. For each pixel xu , the first term measures the similarity between
the pixel and its correspondence in Iv , and the second term adds a penalty to an invalid
correspondence.
The pairwise smoothness term Eds is based on a generalized Potts model:
The first term above biases a pixel to be non-occluded if it is similar to its correspondence.
The second term penalizes the occlusion (O = 1) to prevent the whole image from being
marked as occluded, and the third term favors the occlusion when the prior Wuv is true.
Finally, the smoothness term Eos is based on the Potts model:
FIGURE 17.10
Digital refocusing without depth information. The angular resolution is 4 × 4.
modified cross bilateral filtering is applied to the depth maps at the end of each iteration to
improve their quality and make the iteration converge faster [54]. When the resolution of
the light field is too large to fit in the memory, the tile-based belief propagation [55], which
requires much smaller memory and bandwidth than the previous methods, is used.
17.5.3 Summary
The light field images captured using the proposed programmable aperture camera have
several advantages for depth estimation. First, the viewpoints of the light field images
are well aligned with the 2D grid on the aperture, and thus the depth estimation can be
performed without camera calibration. Second, the disparity corresponding to a depth value
can be adjusted by changing the camera parameters without any additional rectification as
required in camera array systems. Finally, unlike depth-from-defocus methods [21], [56],
there is no ambiguity in the scene points behind and in front of the in-focus object.
Finally, when the scene is out-of-focus, only the disparity cue between the light field
images are used for depth estimation. It is possible to combine the defocus cue and further
remove the defocus blur, as in Reference [57]. Also, it is possible to iteratively estimate the
vignetting fields and depth maps to obtain better results.
17.6 Results
All data in the experiments are captured indoors. Images shown in Figures 17.10
and 17.11 are captured using the first prototype and the rest are captured using the second
one. The shutter speed of each exposure is set to 10ms for images shown in Figures 17.3,
17.11, and 17.14, and 20ms for the rest. These settings are chosen for the purpose of fair
comparison. For example, it takes 160ms with an aperture setting of f/8 to capture a clean
and large depth of field image for the scene in Figure 17.11; therefore 10ms is chosen for
the proposed device.
All the computations are performed on a Pentium IV 3.2GHz computer with 2GB mem-
ory. Demultiplexing one light field dataset takes three to five seconds. To save the compu-
tational cost, the light field images are optionally downsampled to 640 × 426 after demulti-
plexing. The photometric calibration takes 30 seconds per image, and the multi-view depth
462 Computational Photography: Methods and Applications
(a)
estimation takes around 30 minutes. The following demonstrates still images with various
effects generated from the captured light field and the associated depth maps. The video
results are available on the project website http://mpac.ee.ntu.edu.tw/ chiakai/pap.
Figure 17.10 shows a scene containing a transparent object in front of a nearly uniform
background. The geometry of this scene is difficult to estimate. However, since the pro-
posed acquisition method does not impose any restriction on the scene, the light field can
be captured with 4 × 4 angular resolution and faithful refocused images can be generated
through dynamic reparameterization [44].
The dataset shown in Figure 17.11 is used to evaluate the performance of the proposed
postprocessing algorithms. Here a well-known graph cut stereo matching algorithm with-
out occlusion reasoning is implemented for comparison [51]. The photo-consistency as-
sumption is violated in the presence of the photometric distortion, and thus poor result
is obtained (Figure 17.12a). With the photometric calibration, the graph cut algorithm
generates a good depth map but errors can be observed at the depth discontinuities (Fig-
ure 17.12b). On the contrary, the proposed depth estimation algorithm can successfully
identify these discontinuities and generate a more accurate result (Figure 17.11b).
Both the light field data and the postprocessing algorithms are indispensable for gener-
ating plausible photographic effects. To illustrate this, a single light field image and its
associated depth map are used as the input of the Photoshop Lens Blur tool to generate a
defocused image. The result shown in Figure 17.12c contains many errors, particularly at
the depth discontinuities (Figure 17.12d). In contrast, the results of the proposed algorithm
(Figures 17.11c and 17.11d) are more natural. The boundaries of the defocused objects are
semitransparent and thus the objects behind can be partially seen.
Figure 17.13 shows the results of view interpolation. The raw angular resolution is 3 × 3.
If a simple bilinear interpolation is used, ghosting effect due to aliasing is observed (Fig-
ure 17.13b). While previous methods use filtering to remove the aliasing [7], [44], a mod-
High-Quality Light Field Acquisition and Processing 463
(d) (e)
FIGURE 17.12
(a) Depth map estimated without photometric calibration and occlusion reasoning, (b) depth map estimated
without occlusion reasoning, and (c) defocusing by the Photoshop Lens Blur tool. (d) Close-up of (b) and (c).
(e) Corresponding close-up of Figures 17.11b and 17.11c.
(d) (e)
ified projective texture mapping [58] is used here instead. Given a viewpoint, three clos-
est images are warped according to their associated depth maps. The warped images are
then blended; the weight of each image is inversely proportional to the distance between
its viewpoint and the given viewpoint. This method greatly suppresses the ghosting ef-
fect without blurring (Figure 17.13c). Note that unlike the single-image view morphing
method [21], hole-filling is not performed here due to the multi-view nature of the light
field. In most cases, the region occluded in one view is observed in others.
464 Computational Photography: Methods and Applications
(d) (e)
FIGURE 17.15
Application of the postprocessing algorithms to the dataset in Reference [4]: (a) original image, (b) estimated
vignetting field, and (c) processed image. Image courtesy of Ashok Veeraraghavan.
Figure 17.14 shows another digital refocusing result. The raw angular resolution is 4 × 4.
Though the in-focus objects are sharp, the out-of-focus objects are subject to the ghost ef-
fect due to aliasing (Figure 17.14b). With the estimated depth maps, the angular resolution
is first increased to 25 × 25 by view interpolation described above and then digital refocus-
ing is performed. As can be seen in Figure 17.14c, the out-of-focus objects are blurry while
the in-focus objects are unaffected.
Finally, to illustrate the robustness of the proposed algorithms, these algorithms are ap-
plied to the noisy and photometrically distorted data captured by the heterodyned light field
camera [4]. Four clear images are selected from the data to perform photometric calibration
and multi-view depth estimation and synthesize the whole light field by view interpolation.
As seen in Figure 17.15, the interpolated image is much cleaner than the original one.
High-Quality Light Field Acquisition and Processing 465
17.7 Discussion
This section discusses the performance and limitations of the proposed camera and the
directions of future research.
TABLE 17.1
Performance comparison between the conventional camera, the plenoptic camera, and the programmable
aperture camera.†
† CCSA – conventional camera with small aperture, CCLA – conventional camera with large aperture, PCAM
– plenoptic camera, PCAMS – plenoptic camera with N 2 M 2 sensors, PAC – programable aperture camera,
PACM – programable aperture camera with multiplexing, AS – aperture size, SED – single exposure duration,
SNRLFS – SNR of the light field samples, SNRRI – SNR of the refocused image, A×SR – angular × spatial
resolution.
466 Computational Photography: Methods and Applications
Though the light field data is noisier then the normal picture, it enables better postpro-
cessing abilities. Because the image from the light field can be simply refocused, there is
no longer need for setting up the focus and aperture size. In traditional photography much
longer time is usually spent on these settings than on the exposure.
The plenoptic camera is slightly better than the programmable aperture camera at the
same angular and spatial resolutions. Nevertheless, it requires N 2 M 2 sensors. To capture a
light field of the same resolution as the dataset shown in Figure 17.11, the plenoptic camera
requires an array of nearly 100 million sensors, which is expensive, if not difficult, to make.
17.8 Conclusion
This chapter described a system for capturing light field using a programmable aperture
with an optimal multiplexing scheme. Along with the programmable aperture, two post-
High-Quality Light Field Acquisition and Processing 467
processing algorithms for photometric calibration and multi-view depth estimation were
developed. This system is probably the first single-camera system that generates light field
at the same spatial resolution as that of the sensor, has adjustable angular resolution, and
is free of photometric distortion. In addition, the programmable aperture is fully backward
compatible with conventional apertures.
While this work focused on the light field acquisition, the programmable aperture camera
can be further exploited for other applications. For example, it can be used to realize a com-
putational camera with a fixed mask. It is believed that by replacing the traditional aperture
with the proposed programmable aperture, a camera will become much more versatile than
before.
References
[1] R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz, and P. Hanrahan, “Light field photogra-
phy with a hand-held plenoptic camera,” CSTR 2005-02, Stanford University, April 2005.
[2] T. Georgiev, K.C. Zheng, B. Curless, D. Salesin, S. Nayar, and C. Intwala, “Spatio-angular
resolution tradeoff in integral photography,” in Proceedings of the 17th Eurographics Work-
shop on Rendering, Nicosia, Cyprus, June 2006, pp. 263–272.
[3] T. Georgiev, C. Intwala, and D. Babacan, “Light-field capture by multiplexing in the frequency
domain,” Technical Report, Adobe Systems Incorporated, 2007.
[4] A. Veeraraghavan, R. Raskar, A. Agrawal, A. Mohan, and J. Tumblin, “Dappled photography:
Mask enhanced cameras for heterodyned light fields and coded aperture refocusing,” ACM
Transactions on Graphics, vol. 26, no. 3, pp. 69:1–69:10, July 2007.
[5] C.K. Liang, G. Liu, and H.H. Chen, “Light field acquisition using programmable aperture
camera,” in Proceedings of the IEEE International Conference on Image Processing, San
Antonio, TX, USA, September 2007, pp. 233–236.
[6] C.K. Liang, T.H. Lin, B.Y. Wong, C. Liu, and H.H. Chen, “Programmable aperture photog-
raphy: Multiplexed light field acquisition,” ACM Transactions on Graphics , vol. 27, no. 3,
pp. 55:1–10, August 2008.
[7] M. Levoy and P. Hanrahan, “Light field rendering,” in Proceedings of the 23rd Annual Confer-
ence on Computer Graphics and Interactive Techniques, New York, NY, USA, August 1996,
pp. 31–42.
[8] S.J. Gortler, R. Grzeszczuk, R. Szeliski, and M.F. Cohen, “The lumigraph,” in Proceedings of
the 23rd Annual Conference on Computer Graphics and Interactive Techniques, New York,
NY, USA, August 1996, pp. 43–54.
[9] J.C. Yang, M. Everett, C. Buehler, and L. McMillan, “A real-time distributed light field cam-
era,” in Proceedings of the 13th Eurographics Workshop on Rendering, Pisa, Italy, June 2002,
pp. 77–85.
[10] B. Wilburn, N. Joshi, V. Vaish, E.V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz,
and M. Levoy, “High performance imaging using large camera arrays,” ACM Transactions on
Graphics, vol. 24, no. 3, pp. 765–776, July 2005.
[11] M.G. Lippmann, “Epreuves reversible donnant la sensation du relief,” Journal de Physics,
vol. 7, pp. 821–825, 1908.
468 Computational Photography: Methods and Applications
[12] H.E. Ive, “Parallax panoramagrams made with a large diameter lens,” Journal of the Optical
Society of America, vol. 20, no. 6, pp. 332–342, June 1930.
[13] T. Okoshi, Three-dimensional imaging techniques, New York: Academic Press New York,
1976.
[14] E.H. Adelson and J.Y.A. Wang, “Single lens stereo with a plenoptic camera,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 14, no. 2, pp. 99–106, February
1992.
[15] A. Lumsdaine and T. Georgiev, “The focused plenoptic camera,” in Proceedings of the First
IEEE International Conference on Computational Photography, San Francisco, CA, USA,
April 2009.
[16] R. Raskar, A. Agrawal, and J. Tumblin, “Coded exposure photography: Motion deblurring
using fluttered shutter,” ACM Transactions on Graphics, vol. 25, no. 3, pp. 795–804, July
2006.
[17] S.K. Nayar and V. Branzoi, “Adaptive dynamic range imaging: Optical control of pixel ex-
posures over space and time,” in Proceedings of the 9th IEEE International Conference on
Computer Vision, Nice, France, October 2003, pp. 1168–1175.
[18] Y.Y. Schechner and S.K. Nayar, “Uncontrolled modulation imaging,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, June
2004, pp. 197–204.
[19] A. Zomet and S.K. Nayar, “Lensless imaging with a controllable aperture,” Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, New York, USA, June 2006,
pp. 339–346.
[20] H. Farid and E.P. Simoncelli, “Range estimation by optical differentiation,” Journal of the
Optical Society of America A, vol. 15, no. 7, pp. 1777–1786, July 1998.
[21] A. Levin, R. Fergus, F. Durand, and W.T. Freeman, “Image and depth from a conventional
camera with a coded aperture,” ACM Transactions on Graphics, vol. 26, no. 3, p. 70, July
2007.
[22] Y. Bando, B.Y. Chen, and T. Nishita, “Extracting depth and matte using a color-filtered aper-
ture,” ACM Transactions on Graphics, vol. 27, no. 5, pp. 134:1–9, December 2008.
[23] R. Raskar, K.H. Tan, R. Feris, J. Yu, and M. Turk, “Non-photorealistic camera: Depth edge
detection and stylized rendering using multi-flash imaging,” ACM Transactions on Graphics,
vol. 23, no. 3, pp. 679–688, August 2004.
[24] N. Joshi, W. Matusik, and S. Avidan, “Natural video matting using camera arrays,” ACM
Transactions on Graphics, vol. 25, no. 3, pp. 779–786, July 2006.
[25] C. Senkichi, M. Toshio, H. Toshinori, M. Yuichi, and K. Hidetoshi, “Device and method
for correcting camera-shake and device for detecting camera shake,” JP Patent 2003-138436,
2003.
[26] Y.Y. Schechner, S.K. Nayar, and P.N. Belhumeur, “A theory of multiplexed illumination,” in
Proceedings of the 9th IEEE International Conference on Computer Vision, Nice, France,
October 2003, pp. 808–815.
[27] A. Wenger, A. Gardner, C. Tchou, J. Unger, T. Hawkins, and P. Debevec, “Performance re-
lighting and reflectance transformation with time-multiplexed illumination,” ACM Transac-
tions on Graphics, vol. 24, no. 3, pp. 756–764, July 2005.
[28] N. Ratner, Y.Y. Schechner, and F. Goldberg, “Optimal multiplexed sensing: Bounds, condi-
tions and a graph theory link,” Optics Express, vol. 15, no. 25, pp. 17072–17092, December
2007.
High-Quality Light Field Acquisition and Processing 469
[29] A. Levin, S.W. Hasinoff, P. Green, F. Durand, and W.T. Freeman, “4D frequency analysis of
computational cameras for depth of field extension,” ACM Transactions on Graphics, vol. 28,
no. 3, pp. 97:1–14, July 2009.
[30] F. Durand, N. Holzschuch, C. Soler, E. Chan, and F.X. Sillion, “A frequency analysis of light
transport,” ACM Transactions on Graphics, vol. 24, no. 3, pp. 1115–1126, July 2005.
[31] R. Ng, “Fourier slice photography,” ACM Transactions on Graphics, vol. 24, no. 3, pp. 735–
744, July 2005.
[32] M. Zwicker, W. Matusik, F. Durand, and H. Pfister, “Antialiasing for automultiscopic 3D dis-
plays,” in Proceedings of the 17th Eurographics Symposium on Rendering, Nicosia, Cyprus,
June 2006, pp. 73–82.
[33] M. Harwit and N.J. Sloane, Hadamard Transform Optics. New York: Academic Press, July
1979.
[34] HP components group, “Noise sources in CMOS image sensors,” Technical Report, Hewlett-
Packard Company, 1998.
[35] Y. Tsin, V. Ramesh, and T. Kanade, “Statistical calibration of CCD imaging process,” in
Proceedings of the 8th IEEE International Conference on Computer Vision, Vancouver, BC,
Canada, July 2001, pp. 480–487.
[36] N. Ratner and Y.Y. Schechner, “Illumination multiplexing within fundamental limits,” in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis,
MN, USA, June 2007, pp. 1–8.
[37] B.K. Horn, Robot Vision. MIT Press, March 1986.
[38] D.B. Goldman and J.H. Chen, “Vignette and exposure calibration and compensation,” in Pro-
ceedings of the 10th IEEE International Conference on Computer Vision, Beijing, China,
October 2005, pp. 899–906.
[39] M. Aggarwal, H. Hua, and N. Ahuja, “On cosine-fourth and vignetting effects in real lenses,”
in Proceedings of the 8th IEEE International Conference on Computer Vision, Vancouver, BC,
Canada, July 2001, pp. 472–479.
[40] A. Litvinov and Y. Y. Schechner, “Addressing radiometric nonidealities: A unified frame-
work,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
San Diego, CA, USA, June 2005, pp. 52–59.
[41] Y. Zheng, J. Yu, S.B. Kang, S. Lin, and C. Kambhamettu, “Single-image vignetting correction
using radial gradient symmetry,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Anchorage, AK, USA, June 2008, pp. 1–8.
[42] D.G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal
of Computer Vision, vol. 60, no. 2, pp. 91–110, November 2004.
[43] J.X. Chai, X. Tong, S.C. Chan, and H.Y. Shum, “Plenoptic sampling,” in Proceedings of the
27th Annual Conference on Computer Graphics and Interactive Techniques, New York, NY,
USA, July 2000, pp. 307–318.
[44] A. Isaksen, L. McMillan, and S.J. Gortler, “Dynamically reparameterized light fields,” in Pro-
ceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques,
New York, NY, USA, July 2000, pp. 297–306.
[45] J. Stewart, J. Yu, S.J. Gortler, and L. McMillan, “A new reconstruction filter for undersam-
pled light fields,” in Proceedings of the 14th Eurographics Workshop on Rendering, Leuven,
Belgium, June 2003, pp. 150–156.
[46] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo corre-
spondence algorithms,” International Journal of Computer Vision, vol. 47, no. 1-3, pp. 7–42,
April 2002.
470 Computational Photography: Methods and Applications
[47] V. Kolmogorov and R. Zabih, “Multi-camera scene reconstruction via graph cuts,” Proceed-
ings of the European Conference on Computer Vision, Copenhagen, Denmark, May 2002,
pp. 82–96.
[48] S.B. Kang and R. Szeliski, “Extracting view-dependent depth maps from a collection of im-
ages,” International Journal of Computer Vision, vol. 58, no. 2, pp. 139–163, July 2004.
[49] J. Sun, Y. Li, S.B. Kang, and H.Y. Shum, “Symmetric stereo matching for occlusion handling,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San
Diego, CA, USA, June 2005, pp. 399–406.
[50] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and
C. Rother, “A comparative study of energy minimization methods for markov random fields,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 6, pp. 1068–
1080, June 2008.
[51] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222–
1239, November 2001.
[52] M. Tappen and W.T. Freeman, “Comparison of graph cuts with belief propagation for stereo,
using identical MRF parameters,” in Proceedings of the 9th IEEE International Conference
on Computer Vision, Nice, France, October 2003, pp. 900–907.
[53] V. Kolmogorov, “Convergent tree-reweighted message passing for energy minimization,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1568–
1583, October 2006.
[54] Q. Yang, R. Yang, J. Davis, and D. Nister, “Spatial-depth super resolution for range images,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Min-
neapolis, MN, USA, June 2007, pp. 1–8.
[55] C.K. Liang, C.C. Cheng, Y.C. Lai, L.G. Chen, and H.H. Chen, “Hardware-efficient belief
propagation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, Miami, FL, US), June 2009, pp. 80–87.
[56] P. Green, W. Sun, W. Matusik, and F. Durand, “Multi-aperture photography,” ACM Transac-
tions on Graphics, vol. 26, no. 3, pp. 68:1–68:10, July 2007.
[57] T. Bishop, S. Zanetti, and P. Favaro, “Light field superresolution,” in Proceedings of the 1st
IEEE International Conference on Computational Photography, San Francisco, CA, USA,
April 2009.
[58] P.E. Debevec, C.J. Taylor, and J. Malik, “Modeling and rendering architecture from pho-
tographs: A hybrid geometry- and image-based approach,” in Proceedings of the 23rd Annual
Conference on Computer Graphics and Interactive Techniques, New York, NY, USA, August
1996, pp. 11–20.
[59] D. Donoho, “Compressed Sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4,
pp. 1289–1306, April 2006.
[60] Y.H. Kao, C.K. Liang, L.W. Chang, and H.H. Chen, “Depth detection of light field,” in Pro-
ceedings of the International Conference on Acoustics, Speech, and Signal Processing, Hon-
olulu, HI, USA), April 2007, pp. 893–897.
18
Dynamic View Synthesis with an Array of Cameras
471
472 Computational Photography: Methods and Applications
18.1 Introduction
Being able to look freely through a scene has long been an active research topic in the
computer graphics community. Historically, computer graphics research has been focused
on rendering. That is, given a three-dimensional (3D) model, how to generate new images
faster, better, and more realistically. View synthesis addresses a typically more challenging
problem. It is aimed to generate new images using only a set of two-dimensional (2D)
images, instead of 3D models.
There are many different ways to categorize and introduce the vast variety of existing
view synthesis methods. For example, these methods can be categorized based on the type
of scenes they can handle, the number of input images required, or the level of automation.
In this chapter, existing methods are categorized using their internal representations of the
scene. Based on this criterion, there is a continuum of view synthesis methods shown in
Figure 18.1. They vary on the dependency of images samples vs. geometric primitives.
Approaches on the left side of the continuum are categorized as geometry-based. Given
a set of input images, a 3D model is extracted, either manually or algorithmically, and can
then be rendered from novel viewing angles using computer graphics rendering techniques.
In this category, the primary challenge is in the creation of the 3D model. Automatic
extraction of 3D models from images has been one of the central research topics in the
field of computer vision for decades. Although many algorithms and techniques exist, such
as the extensively studied stereo vision techniques, they are relatively fragile and prone to
error in practice. For instance, most 3D reconstruction algorithms assume a Lambertian
(diffuse) scene, which is only a rough approximation of real-world surfaces.
By contrast, approaches on the right side of the continuum are categorized as image-
based modeling and rendering (IBMR) — a popular alternative for view synthesis in re-
cent years. The basic idea is to synthesize new images directly from input images, partly
or completely bypassing the intermediate 3D model. In other words, IBMR methods typ-
ically represent the scene as a collection of images, optionally augmented with additional
information for view synthesis. Light field rendering (LFR) [1], [2] represents one extreme
of such techniques; it uses many images (hundreds or even thousands) to construct a light
field function that completely characterizes the flow of light through unobstructed space
in a scene. Synthesizing different views becomes a simple lookup of the light field func-
tion. This method works for any scene and any surface; the synthesized images are usually
so realistic that they are barely distinguishable from real photos. But the success of this
method ultimately depends on having a very high sampling rate, and the process of captur-
ing, storing, and retrieving many samples from a real environment can be difficult or even
impossible.
In the middle of the continuum are some hybrid methods that represent the scene as
a combination of images samples and geometrical information. Typically these methods
require a few input images as well as some additional information about the scene, usually
in the form of approximate geometric knowledge or correspondence information. By using
this information to set constraints, the input images can be correctly warped to generate
novel views. To avoid the difficult shape recovery problem, successful techniques usually
geom etry-based m od eling and rend ering im age-based m od eling and rend ering
FIGURE 18.1
The continuum for view synthesis methods. Relative positions of some well-known methods are indicated. The two subfigures show two main subgroups for view
synthesis. Note that the boundary between these two groups is blurry.
473
474 Computational Photography: Methods and Applications
require a human operator to be involved in the process and use a priori domain knowledge
to constrain the problem. Because of the required user interaction, these techniques are
typically categorized under the IBMR paradigm. For example, in the successful Façade
system [3] designed to model and render architecture from photographs, an operator first
manually places simple 3D primitives in rough positions and specifies the corresponding
features in the input images. The system automatically optimizes the location and shape of
the 3D primitives, taking advantage of the regularity and symmetry in architectures. Once
a 3D model is generated, new views can be synthesized using traditional computer graphics
rendering techniques.
This chapter briefly reviews various image-based modeling methods, in particular light
field-style rendering techniques. Then, an extension of LFR for dynamic scenes, for which
the notion of space-time LFR is introduced, is discussed in detail. Finally, reconfigurable
LFR is introduced, in which not only the scene content, but also the camera configurations,
can be dynamic.
This seven-dimensional function can be used to develop a taxonomy for evaluating mod-
els of low-level vision [15]. By introducing the plenoptic function to the computer graphics
Dynamic View Synthesis with an Array of Cameras 475
FIGURE 18.2
The plenoptic function describes all of the image information visible from any given viewing position.
community, all image-based modeling and rendering approaches can be cast as attempts to
reconstruct the plenoptic function from a sample set of that function [4].
From a computer graphics standpoint, one can consider a plenoptic function as a scene
representation that describes the flow of light in all directions from any point at any time.
In order to generate a view from a given point in a particular direction, one would merely
need to plug in appropriate values for (Vx ,Vy ,Vz ) and select from a range of (θ , φ , λ ) for
some constant t.
This plenoptic function framework provides many venues for exploration, such as the
representation, optimal sampling, and reconstruction of the plenoptic function. The fol-
lowing sections discuss several popular parameterizations of the plenoptic function under
varying degrees of simplification.
1 References [4], [5] could be included in the second category, in which geometry information is used. However,
they are presented here, since the plenoptic function theory presented in these references paved the way for
subsequent light field rendering techniques.
476 Computational Photography: Methods and Applications
t
L = (u,v,s,t)
s
(s,t)
(u,v)
v
(s,t) plane
(u,v) plane
u
FIGURE 18.3
Light field rendering: the light slab representation and its construction.
occluders2 ) [1]. This is based on the observation that the radiance along a line does not
change unless blocked. In Reference [1], the 4D plenoptic function is called the light field
function which may be interpreted as a functions on the space of oriented light rays. Such
reduction in dimensions has been used to simplify the representation of radiance emitted
by luminaries [18], [19].
Reference [1] parameterizes light rays based on their intersections with two planes (see
Figure 18.3). The coordinate system is (u, v) on the first plane, and (s,t) on the second
plane. An oriented light ray is defined by connecting a point on the uv plane to a point on
the st plane; this representation is called a light slab. Intuitively, a light slab represents the
beam of light entering one quadrilateral and exiting another quadrilateral. To construct a
light slab, one can simply take a 2D array of images. Each image can be considered a slice
of the light slab with a fixed (u, v) coordinate and a range of (s,t) coordinates.
Generating a new image from a light field is quite different than previous view interpo-
lation approaches. First, the new image is generally formed from many different pieces of
the original input images, and does not need to look like any of them. Second, no model
information, such as depth values or image correspondences, is needed to extract the image
values. The second property is particularly attractive since automatically extracting depth
information from images is a very challenging task. However, many image samples are
required to completely reconstruct the 4D light field functions. For example, to completely
capture a small Buddha as shown in Figure 18.3, hundreds or even thousands of images are
required. Obtaining so many images samples from a real scene may be difficult or even im-
possible. Reference [2] augments the two-plane light slab representation with a rough 3D
geometric model that allows better quality reconstructions using fewer images. However,
recovering even a rough geometric model raises the difficult 3D reconstruction problem.
2 Such reduction can be used to represent scenes and objects as long as there is no occluder between the desired
viewpoint and the scene. In other words, the effective viewing volume must be outside the convex hull of the
scene. A 4D representation cannot thus be used, for example, in architecture workthroughs to explore from
one room to another.
Dynamic View Synthesis with an Array of Cameras 477
Reference [8] presents a more flexible parameterization of the light field function. In
essence, one of the two planes of the light slab is allowed to move. Because of this ad-
ditional degree of freedom, it is possible to simulate a number of dynamic photograph
effects, such as depth of field and apparent focus [8]. Furthermore, this reparameterization
technique makes it possible to create integral photography-based [20], auto-stereoscopic
displays for direct viewing of light fields.
Besides the popular two-plane parameterization of the plenoptic function, there is the
spherical representation introduced in Reference [6]. The object-space algorithm presented
in the same work can easily be embedded into the traditional polygonal rendering system
and accelerated by 3D graphics boards.
Light field rendering is the first image-based rendering method that does not require any
geometric information about the scene. However, this advantage is acquired at the cost
of many image samples. The determination of the minimum number of samples required
for light field rendering involves complex relationships among various factors, such as
the depth and texture variation of the scene, the input image resolutions, and the desired
rendering resolutions. Details can be found in Reference [21].
direction (θ , φ ) varies can be reconstructed from a single environment map. While the orig-
inal use of environment maps is to efficiently approximate reflections of the environment
on a surface [22], [23], environment maps can be used to quickly display any outward-
looking view of the environment from a fixed location but at a variable orientation — this
is the basis of the Apple QuickTimeVR system [24]. In this system, environment maps are
created at key locations in the scene. A user is able to navigate discretely from one location
to another and, while at each location, continuously change the viewing direction.
While it is relatively easy to generate computer-generated environmental maps [23], it is
more difficult to capture panoramic images from real scenes. A number of techniques have
been developed. Some use special hardware [25], [26], [27], such as panoramic cameras
or cameras with parabolic mirrors; others use regular cameras to capture many images that
cover the whole viewing space, then stitch them into a complete panoramic image [28],
[29], [30], [31], [32].
tim e novel
tim e novel
novel
im age
tem poral spatial im age
im age
interpolation interpolation
FIGURE 18.4
c 2007 IEEE
The structure of the proposed space-time light field rendering algorithm. °
fram e I i,t
spatial flow
t
featu re Pi,t
tem poral flow
t’+Dt j
tim e
FIGURE 18.5
c 2007 IEEE
The registration process of the two-camera case. °
where F is the fundamental matrix encoding the epipolar geometry between the two im-
ages [43]. In fact, F~pi can also be considered as an epipolar line in the second image,
Equation 18.2 thus means that ~p j must lie on the epipolar line F~pi , and vice versa.
The epipolar constraint is incorporated to verify the temporal flow from Ii,t to Ii,t+∆ti ,
given the help of the spatial flow from Ii,t to I j,t 0 . Let ~pi,t and ~pi,t+∆ti be the projection
of a moving 3D point on camera Ci at time t and ∆t + ti , respectively; the projection of
this 3D point forms a trajectory connecting between ~pi,t and ~pi,t+∆ti . Given the spatial
correspondence ~p j,t 0 of ~pi,t from camera C j at time t 0 , the epipolar constraint is described
as follows:
~pTj,t 0 Fij~pi,t 0 = 0, (18.3)
where Fij is the fundamental matrix between camera Ci and C j . Since ~pi,t 0 is not actually
observed by camera Ci at time t 0 , it can be estimated assuming that the 3D point moves
locally linear in the image space, as follows:
t+∆ti −t 0 0
~pi,t 0 = ∆ti ~ pi,t + t∆t
−t
i
~pi,t+∆ti . (18.4)
FIGURE 18.6
Feature points and epipolar line constraints: (a) ~pi,t on image Ii,t , (b) ~pi,t+∆ti on image Ii,t+∆ti , and (c) ~p j,t 0
c 2007 IEEE
and ~pi,t 0 ’s epipolar lines on image I j,t 0 . °
by the epipolar constraint using the spatial correspondence ~p j,t 0 . If both spatial and tem-
poral correspondences are correct, the epipolar constraint should be closely satisfied. This
is the criterion used to validate the spatial-temporal flow computation. Figure 18.6 shows
that a correct correspondence satisfies the epipolar constraint, while a wrong temporal cor-
respondences causes an error in ~pi,t 0 , which leads to a wrong epipolar line that ~p j,t 0 fails to
meet with.
The fundamental matrix is directly computed from cameras’ world position and projec-
tion matrices. Due to various error sources such as camera noise, inaccuracy in camera
calibration or feature localization, a band of certainty is defined along the epipolar line. For
each feature point ~pi,t , if the distance from ~p j,t 0 to ~pi,t 0 ’s epipolar line is greater than a cer-
tain tolerance threshold, either the temporal or the spatial flow is considered to be wrong.
This feature will then be discarded. In the experiment, three pixels are used as the distance
threshold.
It should be noted that the proposed correction scheme for unsynchronized input is only
reasonable when the motion is roughly linear in the projective space. Many real world
movements, such as rotation, do not satisfy this requirement. Fortunately, when cameras
have a sufficiently high rate with respect to the 3D motion, such a locally temporal lin-
earization is generally acceptable [44]. Achieved experimental results also support this
assumption. In Figure 18.6, correct feature correspondences satisfy the epipolar constraint
well even though the magazine was rotating fast. This figure also demonstrates the amount
of pixel offset casual motion could introduce: in two successive frames captured at 30fps,
many feature points moved more than 20 pixels — a substantial amount that cannot be
ignored in view synthesis.
where d is a constant to balance the influence from the camera spatial closeness and the
capture time difference. If the multi-camera system is constructed regularly as a camera
array, the closeness Close(i, j) can be simply evaluated according to the array indices. A
single best camera is chosen along the row and the column respectively to provide both
horizontal and vertical epipolar constraints. If all cameras have an identical frame rate, the
same camera will always be selected using Equation 18.5.
FIGURE 18.7
Real feature edges: (left) the gradient magnitude map and (right) feature edges detected by testing feature
c 2007 IEEE
points pairwise. °
∇I(sk )
f (e) = min(|∇I(sk )| + β • N(e)). (18.7)
sk ∈e |∇I(sk )|
The first term in Equation 18.7 gives the gradient magnitude and the second term indi-
cates how well the gradient orientation matches with the edge normal direction N(e). The
parameter β is used to balance the influence between two terms. The fitness values are
calculated for both et and et+∆t . If both values are greater than a threshold, it means that
both et and et+∆t are in strong gradient regions and the gradient direction matches the edge
normal. Therefore, et and et+∆t may represent a real surface or texture boundary in the
scene. Figure 18.7 shows a gradient magnitude map and detected feature edges.
The second factor is the edge length. It can be assumed that the edge length is nearly
a constant from frame to frame, provided that the object distortion is relatively slow com-
pared with the frame rate. This assumption is used to discard edge correspondences if
|et+∆t | changes too much from |et |. Most changes in edge length are caused by wrong tem-
poral correspondences. Nevertheless, the edge length change may be caused by reasonable
point correspondences as well. For instance, in Figure 18.8, the dashed arrows show the
intersection boundaries between the magazine and the wall. They are correctly tracked,
however, since they do not represent real 3D points, they do not follow the magazine mo-
tion at all. Edges ended with those points can be detected from length changes, and be
removed to avoid distortions during temporal interpolation. Similarly, segments connect-
484 Computational Photography: Methods and Applications
FIGURE 18.8
The optical flow: (left) before applying the epipolar constraint and (right) after applying the epipolar constraint.
Correspondence errors on the top of the magazine are detected and removed. ° c 2007 IEEE
FIGURE 18.9
c 2007 IEEE
The edge map constructed after constraining Delaunay triangulation. °
ing one static point (that is, a point that does not move in two temporal frames) with one
dynamic point should not be considered.
FIGURE 18.10
c 2007 IEEE
Interpolation results: (left) without virtual feature edges and (right) with virtual feature edges. °
18.3.3.3 Morphing
The image morphing method [45] is extended to interpolate two frames temporally. Real
edges are allowed to have more influence weights than virtual edges since real edges are
believed to have physical meanings in the real world. In Reference [45], the edge weight is
calculated using the edge length L and the point-edge distance D as follows:
µ ¶b
Lρ
weight0 = , (18.8)
(a + D)
where a, b, and ρ are constants to affect the line influence.
The weight for both real and virtual feature edge is calculated here using the formula
above. The weight for real edge is then further boosted as follows:
f (e) − fmin (e) τ
weight = weight0 · (1 + ) , (18.9)
fmax (e) − fmin (e)
where f (e) is the edge samples’ average fitness value from Equation 18.7. The terms
fmin (e) and fmax (e) denote the minimum and maximum fitness value among all real feature
edges, respectively. The parameter τ is used to scale the boosting effect exponentially.
Two frames are temporally interpolated using both the forward temporal flow from Ii,t
to Ii,t+∆ti and the backward temporal flow from Ii,t+∆ti to Ii,t . The final pixel intensity is
calculated by linear interpolation as follows:
t + ∆ti − t 0 t0 − t
Pt 0 = · Pf orward + · Pbackward , (18.10)
∆ti ∆ti
where Pf orward is the pixel color calculated only from frame Ii,t and Pbackward is only from
frame Ii,t+∆ti . This reflects more confidence in features associated with Ii,t in the forward
flow and features associated with Ii,t+∆ti in the backward flow. These features are selected
directly on images.
Figure 18.10 shows the results without and with virtual edges. Figure 18.11 shows the
results using different interpolation schemes. Since some features are missing on the top of
the magazine, the interpolation quality improves when real feature edges get extra weights
according to Equation 18.9. Even without additional weights, image morphing generates
more visually pleasing images in the presence of bad feature correspondences.
486 Computational Photography: Methods and Applications
(a) (b)
(c) (d)
FIGURE 18.11
Interpolated results using different schemes: (a) image morphing without epipolar test, (b) direct image tri-
angulation with epipolar test, (c) image morphing with epipolar test, but without extra edge weights, and (d)
image morphing with both epipolar test and extra edge weights. °c 2007 IEEE
FIGURE 18.12
c 2007 IEEE
Reconstructed 3D mesh proxy: (left) oblique view and (right) front view. °
spatial
interpolation
spatial flow
proxy
FIGURE 18.13
c 2007 IEEE
The space-time light field rendering pipeline. °
is in focus. The second option is a reconstructed proxy using the spatial correspondences.
Every pair of spatial correspondences defines a 3D point. All the 3D points, plus the four
corners of a background plane, are triangulated to create a 3D mesh. Figure 18.12 shows
an example mesh.
Whether to use a plane or a mesh proxy in rendering depends mainly on the camera
configuration and scene depth variation. The planar proxy works well for scenes with
small depth variations. On the other hand, using a reconstructed mesh proxy improves both
the depth of field and the 3D effect for oblique-angle viewing.
18.3.5 Results
This section presents some results from the space-time light field rendering framework.
Figure 18.13 shows a graphical representation of the entire processing pipeline. To facilitate
data capturing, two multi-camera systems were developed. The first system uses eight Point
Grey dragonfly cameras [49]. This system can provide a global time stamp for each frame
in hardware. The second system includes eight SONY color digital fire-wire cameras, but
the global time stamp is not available. In both systems, the cameras are approximately 60
mm apart, limited by their form factor. Based on the analysis from Reference [21], the
488 Computational Photography: Methods and Applications
FIGURE 18.14
Synthesized results using a uniform blending weight, that is, pixels from all frames are averaged together: (left)
traditional LFR with unsynchronized frames and (right) space-time LFR. ° c 2007 IEEE
effective depth of field is about 400 mm. The cameras are calibrated, and all images are
rectified to remove lens distortions.
Since the cameras are arranged in a horizontal linear array, only one camera for the
epipolar constraint is selected according to the discussion in Section 18.3.2. The closeness
is just the camera position distance. The results are demonstrated using data tagged with a
global time stamp, that is, the temporal offsets are known. The final images are synthesized
using either a planar proxy or a reconstructed proxy from spatial correspondences.
FIGURE 18.15
Synthesized results of the book image: (top) traditional LFR with unsynchronized frames and (bottom) space-
time LFR. °c 2007 IEEE
FIGURE 18.16
Synthesized results of the book sequence: (left) plane proxy adopted and (right) 3D reconstructed mesh proxy
c 2007 IEEE
adopted. °
FIGURE 18.17
Synthesized results of a human face from two different angles: (top) plane proxy adopted and (bottom) 3D
mesh proxy adopted. ° c 2007 IEEE
FIGURE 18.18
c 2007 IEEE
A self-reconfigurable camera array system with 48 cameras. °
platform
gear rack
sid estep
servo
netw ork gear w heel
cam era
FIGURE 18.19
c 2007 IEEE
The mobile camera unit. °
cameras are 150 mm apart. They can capture up to 640 × 480 pixel2 images at maximally
30 fps. The cameras have built-in HTTP servers, which respond to HTTP requests and
send out motion JPEG sequences. The JPEG image quality is controllable. The cameras
are connected to a central computer through 100 Mbps Ethernet cables.
The cameras are mounted on a mobile platform, as shown in Figure 18.19. Each camera
is attached to a standard pan servo capable of rotating for about 90 degrees. They are
mounted on a platform which is equipped with another sidestep servo. The sidestep servo
is a hacked one, and can rotate continuously. A gear wheel is attached to the sidestep
servo which allows the platform to move horizontally with respect to the linear guide. The
gear rack is added to avoid slippery during the motion. The two servos on each camera
492 Computational Photography: Methods and Applications
virtu al
point
2D m esh on
the im aging plane resticted 3D m esh
(2D m esh w ith d epth)
FIGURE 18.20
The multi-resolution 2D mesh with depth information on its vertices.
unit allow the camera to have two degrees of freedom: pan and sidestep. However, the 12
cameras at the leftmost and rightmost columns have fixed positions and can only pan.
The servos are controlled by the Mini SSC II servo controller [50]. Each controller is
in charge of no more than eight servos (either standard servos or hacked ones). Multiple
controllers can be chained, thus up to 255 servos can be controlled simultaneously through
a single serial connection to a computer. The current system uses altogether 11 Mini SSC
II controllers to control 84 servos (48 pan servos, 36 sidestep servos).
The system is controlled by a single computer with an Intel Xeon 2.4 GHz dual pro-
cessor, 1 GB of memory and a 32 MB NVIDIA Quadro2 EX graphics card. As will be
detailed later, the proposed rendering algorithm is so efficient that the region of interest
(ROI) identification, JPEG (Joint Photographic Experts Group) image decompression, and
camera lens distortion correction, which were usually performed with dedicated computers
in previous systems, can all be conducted during the rendering process for a camera array
in the considered system. On the other hand, it is not difficult to modify the system and
attribute ROI identification and image decoding to dedicated computers, as is done in the
distributed light field camera described in Reference [51].
The system software runs as two processes, one for capturing and the other for render-
ing. The capturing process is responsible for sending requests to and receiving data from
the cameras. The received images (in JPEG compressed format) are directly copied to
some shared memory that both processes can access. The capturing process is often lightly
loaded, consuming about 20% of one of the processors in the computer. When the cam-
eras start to move, their external calibration parameters need to be calculated in real-time.
Since the internal parameters of the cameras do not change during their motion, they are
calibrated offline. To calibrate the external parameters on the fly, a large planar calibration
pattern is placed in the scene and the algorithm presented in Reference [52] is used for the
calibrating the external parameters. The calibration process runs very fast on an employed
processor (150 to 180 fps at full speed).
2D m esh d epth
reconstru ction, m esh
subd ivision if necessary
exit
FIGURE 18.21
The flow chart of the rendering algorithm.
is positioned on the imaging plane of the virtual view, thus the geometry is view-dependent
(similar to that in References [53], [54], and [55]). The MRM solution significantly re-
duces the amount of computation spent on depth reconstruction, making it possible to be
implemented efficiently in software.
The flow chart of the rendering algorithm is shown in Figure 18.21. A novel view is
rendered when there is an idle callback or the user moves the viewpoint. An initial sparse
and regular 2D mesh on the imaging plane of the virtual view are constructed first. For
each vertex of the initial 2D mesh, the procedure looks for a subset of images that will
be used to interpolate its intensity during the rendering. Once such information has been
collected, it is easy to identify the ROIs of the captured images and decode them when
necessary. The depths of the vertices in the 2D mesh are then reconstructed through a plane
sweeping algorithm. During plane sweeping, a set of depth values are hypothesized for a
given vertex, and the color consistency verification (CCV) score for the projections on the
nearby images is computed based on the mean-removed correlation coefficient as follows:
where Iik and I jk are the kth pixel intensity in the projected small patches of nearby image
494 Computational Photography: Methods and Applications
#i and #j, respectively. The terms I¯i and I¯j denote the mean of pixel intensities in the
two patches. Equation 18.11 is widely used in traditional stereo matching algorithms [56].
The overall CCV score of the nearby input images is one minus the average correlation
coefficient of all the image pairs. The depth plane resulting in the lowest CCV score will
be selected as the scene depth.
If a certain triangle in the mesh bears large depth variation, subdivision is performed to
obtain more detailed depth information. After the depth reconstruction, the novel view can
be synthesized through multi-texture blending, similar to that in the unstructured lumigraph
rendering (ULR) [48]. Lens distortion is corrected in the last stage, although the procedure
also compensates distortion during the depth reconstruction stage.
The proposed camera array system is used to capture a variety of scenes, both static and
dynamic. The speed of rendering process is about four to ten fps. The rendering results of
some static scenes are shown in Figure 18.22. Note that the cameras are evenly spaced on
the linear guide. The rendering positions are roughly on the camera plane but not too close
to any of the capturing cameras. Figures 18.22a to 18.22c show results rendered with the
constant depth assumption. The ghosting artifacts are very severe due to the large spacing
between the cameras. Figure 18.22d shows the result from the proposed algorithm. The
improvement is significant. Figure 18.22e shows the reconstructed 2D mesh with depth
information on its vertices. The grayscale intensity represents the depth; the brighter the
intensity, the closer the vertex. Like many other geometry reconstruction algorithms, the
Dynamic View Synthesis with an Array of Cameras 495
geometry obtained using the proposed method contains some errors. For example, in the
background region of scene toys, the depth should be flat and far, but the achieved results
have many small “bumps.” This is because part of the background region has no texture,
which is prone to error for depth recovery. However, the rendered results are not affected
by these errors because view-dependent geometry is used and the local color consistency
always holds at the viewpoint.
Figure 18.23 gives the comparison of the rendering results using a dense depth map and
the proposed adaptive mesh. Using adaptive mesh produces rendering images at almost the
same quality as using dense depth map, but with a much smaller computational cost.
Y2
B31 B32 B3k B37
Y3
(xi ,yj )
Y4
Y5
Y6
FIGURE 18.25
c 2007 IEEE
Self-reconfiguration of the cameras. °
Dynamic View Synthesis with an Array of Cameras 497
2. Back-project the vertices of the mesh model to the camera plane. In Figure 18.25,
one mesh vertex is back-projected as (xi , yi ) on the camera plane. Note that such back-
projection can be performed even if there are multiple virtual views to be rendered, thus the
proposed algorithm is applicable to situations where there exist multiple virtual viewpoints.
3. Collect CCV score for each pair of neighboring cameras on the linear guides. The
capturing cameras on each linear guide naturally divide the guide into seven segments. Let
these segments be B jk , where j is the row index of the linear guide, k is the index of bins
on that guide, 1 ≤ j ≤ 6, 1 ≤ k ≤ 7. If a back-projected vertex (xi , yi ) satisfies
Y j−1 < yi < Y j+1 and xi ∈ B jk , (18.12)
the CCV score of the vertex is added to the bin B jk . After all the vertices have been back-
projected, the procedure obtains a set of accumulated CCV scores for each linear guide,
denoted as S jk , where j is the row index of the linear guide and k is the index of bins on
that guide.
4. Determine which camera to move on each linear guide. Given a linear guide j, the
procedure looks for the largest S jk , for 1 ≤ k ≤ 7. Let it be denoted as S jK . If the two cam-
eras forming the corresponding bin B jK are not too close to each other, one of them will be
moved towards the other (thus reducing their distance). Note that each camera is associ-
ated with two bins. To determine which one of the two cameras should move, the procedure
checks their other associated bin and moves the camera with a smaller accumulated CCV
score in its other associated bin.
5. Move the cameras. Once the moving cameras are decided, the procedure issues them
commands such as “move left” or “move right.” Once the cameras are moved, the process
waits until it is confirmed that the movement is finished and the cameras are re-calibrated.
Then it jumps back to Step 1 for the next epoch of movement.
Some results of the proposed self-reconfiguration algorithm are shown in Figure 18.24.
In the first and third line of this figure, the capturing cameras are evenly spaced on the
linear guide; note that scene flower is rendered behind the camera plane whereas Santa
is rendered in front of the camera plane. Due to depth discontinuities, some artifacts can
be observed in the corresponding rendered images shown in Figure 18.24d along the ob-
ject boundaries. Figure 18.24b shows the reconstructed depth of the scene at the virtual
viewpoint. Figure 18.24c depicts the CCV score obtained during the depth reconstruction.
It is obvious that the CCV score is high along the object boundaries, which usually means
wrong or uncertain reconstructed depth, or bad rendering quality. The dots in Figure 18.24c
are the projections of the capturing camera positions to the virtual imaging plane.
The second and fourth line in Figure 18.24 show the rendering result after reconfigura-
tion; note that the result for scene flower is achieved using six epochs of camera movement,
whereas the results for scene Santa is after twenty epochs. It can be seen from the CCV
score map (Figure 18.24c) that the consistency generally gets better after the camera move-
ment (indicated by the dots). The cameras move towards the regions where the CCV score
is high, which effectively increases the sampling rate for the rendering of those regions.
Figure 18.24d shows that the rendering results after self-reconfiguration are much better
than those obtained using evenly spaced cameras.
The major limitation of the self-reconfigurable camera array is that the motion of the
cameras is generally slow. During the self-reconfiguration of the cameras, it is necessary to
498 Computational Photography: Methods and Applications
® ® ®
pi,t +Dti pi,t +Dti pi,t +Dti
®
pj,t’
® ® ®
pi,t pi,t pi,t
FIGURE 18.26
An illustration of various cases of the temporal offset t˜ between two frames: (a) a feature point in camera C j at
time t 0 , (b-d) its epipolar line in Ci can intersect the feature trajectory, from t to t + ∆ti , in Ci in three ways.
assume that the scene is either static or moving very slowly, and the viewer is not changing
his/her viewpoint all the time. A more practical system might be to have a much denser
set of cameras, and limit the total number of cameras actually used for rendering the scene.
The camera selection problem can be solved in a similar fashion as the recursive weighted
vector quantization scheme in Reference [57].
where Fij is the fundamental matrix between Ci and C j and ~pi,t 0 is the estimated location
at time t 0 , assuming temporary linear motion. If all feature correspondences are correct,
the equation system above can be organized as a single linear equation of one unknown
t˜. Geometrically, this equation finds the intersection between ~p j,t 0 ’s epipolar line and the
straight line defined by ~pi,t and ~pi,t+∆ti . As shown in Figure 18.26, if the intersection
happens between ~pi,t and ~pi,t+∆ti , then 0 < t˜ < ∆t; if the intersection happens before ~pi,t ,
then t˜ < 0; and if beyond pi,t+∆ti , then t˜ > ∆t.
Ideally given frames Ii,t , Ii,t+∆ti and I j,t 0 , the offset t˜ can be calculated using just a single
feature by Equation 18.13. Unfortunately, the flow computation is not always correct in
practice. To provide a more robust and accurate estimation of the temporal offset, the
following procedure is used:
Dynamic View Synthesis with an Array of Cameras 499
30 RAN SAC
WLS
10
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
ou tlier ratio
FIGURE 18.27
Temporal offset estimation error using synthetic tracked features. The ratio of outliers varies from 10% to 90%.
Using three variations of the proposed techniques, the relative errors to the ground truth are plotted. The error
is no greater than 15% in any test case.
1. Select salient features on image Ii,t , calculate the temporal flow from Ii,t to Ii,t+∆ti
and the spatial flow from Ii,t to I j,t 0 using the technique described in Section 18.3.1.1.
2. Calculate the time offset t˜[1] , t˜[2] , ... t˜[N] for each feature P[1] , P[2] , ... P[N] using
Equation 18.13.
3. Detect and remove time offset outliers using the random sample consensus
(RANSAC) algorithm [58], assuming that outliers are primarily caused by random
optical flow errors.
4. Calculate the final time offset using a weighted least squares (WLS) method, given
the remaining inliers from RANSAC. The cost function is defined as follows:
M
C= ∑ wk (t˜ − t˜k )2 , (18.14)
k=1
where M is the total number of remaining inliers and the weight factor wk is defined
as:
wk = eγ |t˜−t˜k | , γ ≥ 0. (18.15)
The last step (WLS) is repeated several times. During each regression step, the weight
factor is recalculated for each inlier in order to recalculate the weighted average offset t˜.
Since RANSAC has already removed most outliers, the weighted least squares fitting can
converge fast.
0.4
tim e offset
0.3
RAN SAC
WLS
RAN SAC + WLS
exact offset
0.2
1 2 3 4 5 6 7 8 9 10 11
fram e ind ex
FIGURE 18.28
Estimated time offset using real image sequences (containing 11 frames). Exact time offset (0.3 frame time, or
10 ms) is also shown.
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11
fram e ind ex
FIGURE 18.29
Weighted average epipolar distance (in pixels) from a pair of real sequences 11 frames long. The distance of
each feature ~p is from ~p j,t 0 ’s epipolar line on camera C j to ~pi,t 0 that is calculated using the estimated offset.
randomly. Random correspondences (outliers) are added and the ground-truth feature cor-
respondences (inliers) are perturbed by adding a small amount of Gaussian noise. Fig-
ure 18.27 plots the accuracy with various outlier ratios. It shows that the technique is
extremely robust even when 90% of the offsets are outliers.
The second dataset is a pair of video sequences with the ground-truth offset. Each se-
quence, captured at 30 fps, contains 11 frames. The temporal offset between the two is 10
ms, that is, 0.3 frame time. Figure 18.28 shows the estimated temporal offset, the error is
typically with in 0.05 frame time. Figure 18.29 plots the weighted average epipolar dis-
tance using the estimated time offset. Ideally, the epipolar line should go right through the
linearly interpolated feature point (as in Equation 18.13). Figure 18.29 shows that subpixel
accuracy can be achieved.
From the above experiments with known ground truth, it can be seen that the proposed
approach can produce very accurate and robust estimation of the temporal offset. In ad-
Dynamic View Synthesis with an Array of Cameras 501
+ +
+ +
+ +
(a) (b)
FIGURE 18.30
Verifying estimated time offset using epipolar lines: (a) temporal optical flow on camera Ci and (b) the epipolar
line of ~pi,t 0 , which is linearly interpolated based on the temporal offset, and ~p j,t 0 on camera C j .
dition, combining RANSAC and weighted least squares fitting yields better results than
either of these techniques alone.
18.6 Conclusion
This chapter provided a brief overview of an important branch for view synthesis, namely
methods based on the concept of light field rendering (LFR). The technical discussions were
focused on extending traditional LFR to the temporal domain to accommodate dynamic
scenes. Instead of capturing the dynamic scene in strict synchronization and treating each
image set as an independent static light field, the notion of a space-time light field simply
assumes a collection of video sequences. These sequences may or may not be synchronized
and they can have different capture rates.
502 Computational Photography: Methods and Applications
In order to be able to synthesize novel views from any viewpoint at any time instant, fea-
ture correspondences are robustly identified across frames. They are used as land markers
to digitally synchronize the input frames and improve view synthesis quality. Furthermore,
this chapter presented a reconfigurable camera array in which the cameras’ placement can
be automatically adjusted to achieve optimal view synthesis results for different scene con-
tents. With the ever-decreasing cost of web cameras and the increased computational and
Dynamic View Synthesis with an Array of Cameras 503
communication capability of modern hardware, it is believed that light field rendering tech-
niques can be adopted in many interesting applications such as 3D video teleconferencing,
remote surveillance, and tele-medicine.
Acknowledgment
Figures 18.4 to 18.17 are reprinted from Reference [38] and Figures 18.18, 18.19, 18.24,
and 18.25 are reprinted from Reference [57], with the permission of IEEE.
References
[1] M. Levoy and P. Hanrahan, “Light field rendering,” in Proceedings of the 23rd International
Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, Au-
gust 1996, pp. 31–42.
[2] S.J. Gortler, R. Grzeszczuk, R. Szeliski, and M.F. Cohen, “The lumigraph,” in Proceedings
of the 23rd International Conference on Computer Graphics and Interactive Techniques, New
Orleans, LA, USA, August 1996, pp. 43–54.
[3] P.E. Debevec, C.J. Taylor, and J. Malik, “Modeling and rendering architecture from pho-
tographs: A hybrid geometry-and image-based approach,” in Proceedings of the 23rd Inter-
national Conference on Computer Graphics and Interactive Techniques, New Orleans, LA,
United States, August 1996, pp. 11–20.
[4] L. McMillan and G. Bishop, “Plenoptic modeling: An image-based rendering system.,” in
Proceedings of the 22nd International Conference on Computer Graphics and Interactive
Techniques, Los Angeles, CA, USA, August 1995, pp. 39–46.
[5] L. McMillan, An Image-Based Approach to Three-Dimensional Computer Graphics, PhD
thesis, University of North Carolina at Chapel Hill, 1997.
[6] I. Ihm, S. Park, and R.K. Lee, “Rendering of spherical light fields,” in Proceedings of Pacific
Conference on Computer Graphics and Applications, Seoul, Korea, October 1997, p. 59.
[7] E. Camahort, A. Lerios, and D. Fussell, “Uniformly sampled light fields,” in Proceedings of
Eurographics Rendering Workshop, Vienna, Austria, July 1998, pp. 117–130.
[8] A. Isaksen, L. McMillan, and S.J. Gortler, “Dynamically reparameterized light fields,” in Pro-
ceedings of the 27th International Conference on Computer Graphics and Interactive Tech-
niques, New Orleans, LA, USA, August 2000, pp. 297–306.
[9] P.P. Sloan, M.F. Cohen, and S.J. Gortler, “Time critical lumigraph rendering,” in Proceedings
of Symposium on Interactive 3D Graphics, Providence, RI, USA, April 1997, pp.17–23.
[10] H.Y. Shum and L.W. He, “Rendering with Concentric Mosaics,” in Proceedings of the 24th
International Conference on Computer Graphics and Interactive Techniques, Los Angeles,
CA, USA, July 1997, pp. 299–306.
[11] W. Li, Q. Ke, X. Huang, and N. Zheng, “Light field rendering of dynamic scenes,” Machine
Graphics and Vision, vol. 7, no. 3, pp. 551–563, 1998.
504 Computational Photography: Methods and Applications
[12] H. Schirmacher, C. Vogelgsang, H.P. Seidel, and G. Greiner, “Efficient free form light
field rendering,” in Proceedings of Vision, Modeling, and Visualization, Stuttgart, Germany,
November 2001, pp. 249–256.
[13] P.P. Sloan and C. Hansen, “Parallel lumigraph reconstruction,” in Proceedings of Symposium
on Parallel Visualization and Graphics, San Francisco, CA, USA, October 1999, pp. 7–15.
[14] H. Schirmacher, W. Heidrich, and H.P. Seidel, “High-quality interactive lumigraph render-
ing through warping,” in Proceedings of Graphics Interface, Montreal, Canada, May 2000,
pp. 87–94.
[15] E.H. Adelson and J. Bergen, Computational Models of Visual Processing, ch. “The plenoptic
function and the elements of early vision,” M.S. Landy and J.A. Movshon (eds.), Cambridge,
MA: MIT Press, August 1991, pp. 3–20.
[16] L.A. Westover, “Footprint Evaluation for Volume Rendering,” in Proceedings of the 17th In-
ternational Conference on Computer Graphics and Interactive Techniques, Dallas, TX, USA,
August 1990, pp. 367–376.
[17] D. Anderson, “Hidden Line Elimination in Projected Grid Surfaces,” ACM Transactions on
Graphics, vol. 4, no. 1, pp. 274–288, October 1982.
[18] R. Levin, “Photometric characteristics of light controlling apparatus,” Illuminating Engineer-
ing, vol. 66, no. 4, pp. 205–215, 1971.
[19] I. Ashdown, “Near-field photometry: A new approach,” Journal of the Illuminating Engineer-
ing Society, vol. 22, no. 1, pp. 163–180, Winter 1993.
[20] T. Okoshi, Three-Dimensional Imaging Techniques. New York: Academic Press, Inc., Febru-
ary 1977.
[21] J.X. Chai, X. Tong, S.C. Chan, and H.Y. Shum, “Plenoptic Sampling,” in Proceedings of
the 27th International Conference on Computer Graphics and Interactive Techniques, New
Orleans, Louisiana, USA, July 2000, pp. 307–318.
[22] J. Blinn and M. Newell, “Texture and reflection in computer generated images,” Communica-
tions of the ACM, vol. 19, no. 10, pp. 542–547, October 1976.
[23] N. Greenem, “Environment mapping and other applications of world projections,” IEEE Com-
puter Graphics and Applications, vol. 6, no. 11, pp. 21–29, November 1986.
[24] S.E. Chen, “Quicktime VR: An image-based approach to virtual environment navigation,” in
Proceedings of SIGGRAPH 1995, Los Angeles, LA, USA, August 1995, pp. 29–38.
[25] J. Meehan, Panoramic Photography. New York: Amphoto, March 1996.
[26] B. Technology. http://www.behere.com.
[27] S.K. Nayar, “Catadioptric omnidirectional camera,” in Proceedings of Conference on Com-
puter Vision and Pattern Recognition, San Juan, Puerto Rico, June 1997, p. 482.
[28] R. Szeliski and H.Y. Shum, “Creating full view panoramic image mosaics and environment
maps,” in Proceedings of the 24th International Conference on Computer Graphics and Inter-
active Techniques, Los Angeles, CA, USA, August 1997, pp. 251–258.
[29] M. Irani, P. Anandan, and S. Hsu, “Mosaic based representations of video sequences and their
applications,” in Proceedings of International Conference on Computer Vision, Cambridge,
MA, USA, June 1995, p. 605.
[30] S. Mann and R.W. Picard, “Virtual bellows: Constructing high-quality images from video.,” in
Proceedings of International Conference on Image Processing, Austin, TX, USA, November
1994, pp. 363–367.
[31] R. Szeliski, “Image mosaicing for tele-reality applications,” in Proceedings of IEEE Workshop
on Applications of Computer Vision, Sarasota, FL, USA, December 1994, pp. 44–53.
Dynamic View Synthesis with an Array of Cameras 505
[32] R. Szeliski, “Video mosaics for virtual environments,” IEEE Computer Graphics and Appli-
cations, vol. 16, no. 2, pp. 22–30, March 1996.
[33] T. Naemura, J. Tago, and H. Harashima, “Realtime video-based modeling and rendering of 3D
scenes,” IEEE Computer Graphics and Applications, vol. 22, no. 2, pp. 66–73, March/April
2002.
[34] J.C. Yang, M. Everett, C. Buehler, and L. McMillan, “A real-time distributed light field cam-
era,” in Proceedings of the 13th Eurographics Workshop on Rendering, Pisa, Italy, June 2002,
pp. 77–86.
[35] B. Wilburn, M. Smulski, H. Lee, and M. Horowitz, “The light field video camera,” in Pro-
ceedings of SPIE Electronic Imaging Conference, San Jose, CA, USA, January 2002.
[36] W. Matusik and H. Pfister, “3D TV: A scalable system for real-time acquisition, transmission,
and autostereoscopic display of dynamic scenes,” ACM Transactions on Graphics, vol. 23,
no. 3, pp. 814–824, August 2004.
[37] B. Wilburn, N. Joshi, V. Vaish, E.V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz,
and M. Levoy, “High performance imaging using large camera arrays,” ACM Transactions on
Graphics, vol. 24, no. 3, pp. 765–776, July 2005.
[38] H. Wang, M. Sun, and R. Yang, “Space-time light field rendering,” IEEE Transactions on
Visualization and Computer Graphics, vol. 13, no. 4, pp. 697–710, July/August 2007.
[39] C.J. Harris and M. Stephens, “A combined corner and edge detector,” in Proceedings of 4th
Alvey Vision Conference, Manchester, UK, August 1988, pp. 147–151.
[40] J.Y. Bouguet, “Pyramidal implementation of the Lucas Kanade feature tracker description of
the algorithm,” Technical Report, 1999.
[41] B.D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application
to Stereo Vision,” in Proceedings of International Joint Conference on Artificial Intelligence,
Vancouver, BC, Canada, August 1981, pp. 674–679.
[42] C. Tomasi and T. Kanade, “Detection and tracking of point features,” Technical Report CMU-
CS-91-132, Carnegie Mellon University, 1991.
[43] O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint. Cambridge, MA:
MIT Press, November 1993.
[44] L. Zhang, B. Curless, and S. Seitz, “Spacetime stereo: Shape recovery for dynamic scenes,” in
Proceedings of Conference on Computer Vision and Pattern Recognition, Madison, WI, USA,
June 2003, pp. 367–374.
[45] T. Beier and S. Neely, “Feature based image metamorphosis,” SIGGRAPH Computer Graph-
ics, vol. 26, no. 2, pp. 35–42, July 1992.
[46] J. Canny, “A computational approach to edge detection,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, vol. 8, no. 6, pp. 679–698, November 1986.
[47] J.R. Shewchuk, “Triangle: Engineering a 2D quality mesh generator and Delaunay triangu-
lato,” Lecture Notes in Computer Science, vol. 1148, pp. 203–222, August 1996.
[48] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen, “Unstructured lumigraph ren-
dering,” in Proceedings of the 28th International Conference on Computer Graphics and In-
teractive Techniques, Los Angeles, CA, USA, August 2001, pp. 405–432.
[49] Point Grey Research Inc., Available online: http://www.ptgrey.com.
[50] MiniSSC-II, “Scott Edwards Electronics Inc., Available online: http://www.seetron.
com/ssc.htm,”
[51] J.C. Yang, M. Everett, C. Buehler, and L. McMillan, “A real-time distributed light field cam-
era,” in Proceedings of Eurographics Workshop on Rendering, Pisa, Italy, June 2002.
506 Computational Photography: Methods and Applications
[52] Z. Zhang, “A flexible new technique for camera calibration,” Technical Report, MSR-TR-98-
71, 1998.
[53] R. Yang, G. Welch, and G. Bishop, “Real-time consensus-based scene reconstruction using
commodity graphics hardware,” in Proceedings of Pacific Conference on Computer Graphics
and Applications, Beijing, China, October 2002.
[54] G. G. Slabaugh, R. W. Schafer, and M. C. Hans, “Image-based photo hulls,” Technical Report
HPL-2002-28, HP Labs, 2002.
[55] W. Matusik, C. Buehler, and L. McMillan, “Polyhedral visual hulls for real-time rendering,”
in Proceedings of Eurographics Workshop on Rendering, London, UK, June 2001.
[56] O. Faugeras, B. Hotz, H. Mathieu, T. Viéville, Z. Zhang, P. Fua, E. Théron, L. Moll, G. Berry,
J. Vuillemin, P. Bertin, and C. Proy, “Real time correlation-based stereo: Algorithm, imple-
mentations and applications,” Technical Report 2013, INRIA, 1993.
[57] C. Zhang and T. Chen, “Active rearranged capturing of image-based rendering scenes - Theory
and practice,” IEEE Transactions on Multimedia, vol. 9, no. 3, pp. 520–531, April 2007.
[58] M.A. Fischler and R.C. Bolles, “Random sample consensus: A paradigm for model fitting
with applications to image analysis and automated cartography,” Communication of the ACM,
vol. 24, no. 6, pp. 381–395, June 1981.
(a) (b)
FIGURE 1.3
CFA-based digital imaging: (a) Bayer CFA image, and (b) full-color reconstructed image.
FIGURE 1.27
Simulated images demosaicked from a Bayer CFA (top row) and the four-channel CFA shown in Figure 1.19d
(bottom row). The images shown are for the cases of: (a,e) no noise reduction, (b,f) median filtering only, (c,g)
median and boxcar filtering only, and (d,h) median, boxcar, and low-frequency filtering.
(a) (b)
FIGURE 2.11
Motion deblurring of demosaicked image: (a) before motion deblurring, and (b) after motion deblurring.
(a) (b)
FIGURE 2.12
Fully processed images: (a) with motion compensation, and (b) without motion compensation.
FIGURE 4.2
Target reference images captured under syl-50mr16q: (a) ball, (b) books, and (c) Macbeth.
(a) (b) (c)
FIGURE 4.3
Source images captured under different illuminations: (a) ball under solux-4100, (b) books under syl-
50mr16q+3202, and (c) Macbeth under ph-ulm.
FIGURE 4.4
Color corrected images for the illuminant solux-4100: (a) MXW-DCT-Y, (b) COR, and (c) COR-DCT.
FIGURE 4.6
Color corrected images for the illuminant ph-ulm: (a) MXW-DCT-Y, (b) COR, and (c) COR-DCT.
FIGURE 4.7
Chromaticity shift corrected images by GRW-DCT: (a) ball, (b) books, and (c) Macbeth.
(a) (b) (c)
FIGURE 4.9
Color enhancement of the images: (a,d,g) MCE, (b,e,h) MCEDRC, and (c,f,i) TW-CES-BLK.
FIGURE 4.11
Color restoration through color correction followed by enhancement: (a) original image, (b) enhanced image
without color correction, and (c) enhanced image with color correction.
(a) (b)
(c) (d)
FIGURE 5.4
Different stages of the camera image processing pipeline: (a) demosaicked image, (b) white-balanced image,
(c) color-corrected image, and (d) tone / scale-rendered image. The results correspond to Figure 5.2a.
FIGURE 6.1
c 2009
Examples of existing CFAs: (a) Bayer [1], (b) Lukac [2], (c) Hamilton [3], and (d) Hirakawa [4]. °
IEEE
(a) (b)
FIGURE 7.7
Least square method-based image restoration: (a) bilinearly interpolated observation from a sequence of 50
frames, and (b) restored image using the least squares SR algorithm.
FIGURE 7.11
Comparison of the affine and nonlinear photometric conversion using differently exposed images. The images
are geometrically registered: (b) is photometrically mapped on (a), and (g) is photometrically mapped on (f).
Images in (c) and (h) are the corresponding residuals when the affine photometric model is used. Images in (d)
and (i) are the residuals when the nonlinear photometric model is used. Images in (e) and (j) are the residuals
multiplied by the weighting function.
FIGURE 7.12
A subset of 22 input images showing different exposure times and camera positions.
(a) (b)
(c) (d)
FIGURE 7.13
Image restoration using the method in Reference [84]: (a,c) input images, and (b,d) their restored versions.
FIGURE 7.14
High-dynamic-range high-resolution image obtained using the method in Reference [3].
(a) (b)
(c) (d)
FIGURE 7.17
Image demosaicking: (a) bilinear interpolation, (b) edge-directed interpolation [94], (c) multi-frame demo-
saicking [91], and (d) super-resolution restoration [91].
(a) (b) (c)
FIGURE 8.5
Image deblurring using a blurred and noisy image pair: (a) blurred input image, (b) noisy input image, and (c)
output image obtained using two input images.
(d) (e)
FIGURE 8.9
Comparison of two deblurring approaches: (a-c) three test images, (d) output produced using two first input
images, and (e) output produced using all three input images.
FIGURE 9.1
HDR imaging: (a) single exposure from a clearly HDR scene, (b) HDR image linearly scaled to fit the normal
8-bit display interval — the need for tone mapping is evident, and (c) tone mapped HDR image showing a
superior amount of information compared with the two previous figures.
(a) (b)
FIGURE 9.8
The HDR image is composed in luminance-chrominance space and transformed: (a) directly to RGB for tone
mapping, (b) to RGB utilizing the presented saturation control, with subsequent tone mapping is done in RGB.
(a) (b)
FIGURE 9.9
An HDR image of a stained glass window from Tampere Cathedral, Finland: (a) the image is linearly scaled
for display, and (b) the tone mapping was done with the method described in Section 9.5.2.1.
FIGURE 9.12
Four exposures from the real LDR sequence used to compose an HDR image. The exposure times of the frames
are 0.01, 0.0667, 0.5, and 5 seconds.
16000
12000
8000
4000
0 - -2 -1
10 3 10 10 100 101 102
FIGURE 9.13
The histograms of the HDR image composed in luminance-chrominance space and transformed into RGB.
Histogram is displayed in logarithmic scale.
(a) (b)
(c)
FIGURE 9.14
The HDR image composed in luminance-chrominance space from real data and tone mapped using: (a) the
histogram adjustment technique presented in Section 9.5.2.2, (b) the anchoring technique presented in Sec-
tion 9.5.2.1, and (c) the adaptive logarithmic technique presented in Reference [22] applied in RGB.
(a) (b) (c)
(d) (e)
FIGURE 10.1
Example of a dynamic scene illustrated on a sequence of LDR images taken at different exposures. On the
quay, pedestrians stroll. The boat, floating on the water, oscillates with the water movement. The water and the
clouds are subjected to the wind. Therefore water wrinkles change from one image to another.
FIGURE 10.2
A series of five LDR images taken at different exposures.
(a) (b)
FIGURE 10.4
The HDR image obtained using traditional methods, (a) from the images shown in Figure 10.2 after alignment,
(b) from the image sequence shown in Figure 10.1. (Images built using HDRShop [8].)
(a)
(b) (c)
(d) (e)
FIGURE 10.7
Movement removal using variance and uncertainty [16]: (a) sequence of LDR images captured with different
exposure times; several people walk through the viewing window, (b) variance image V I, (c) uncertainty
image UI, (d) HDR image after object movement removal using the variance image, and (e) HDR image
c 2008 IEEE
after movement removal using the uncertainty image. °
FIGURE 11.1
Results of background subtraction using the algorithm of Reference [7]. Object silhouettes are strongly cor-
rupted, and multiple moving objects cannot be separated due to cast shadows.
FIGURE 11.2
Built-in area extraction using cast shadows: (left) input image, (middle) output of a color-based shadow filter
with red areas indicating detected shadows, and (right) built-in areas identified as neighboring image regions
of the shadows blobs considering the sun direction.
FIGURE 11.5
Different parts of the day on Entrance sequence with the corresponding segmentation results: (a) morning am,
(b) noon, (c) afternoon pm, and (d) wet weather.
(a) (b) (c) (d)
FIGURE 11.10
Two dimensional projection of foreground (red) and shadow (blue) ψ values in the Entrance pm test sequence:
(a) C1 −C2 , (b) H − S, (c) R − G, (d) L − u, (e) C2 −C3 , (f) S −V , (g) G − B, and (h) u − v. The ellipse denotes
the projection of the optimized shadow boundary.
FIGURE 12.5
Document rectification from a stereo pair: (a,d) input stereo pair, (b,e) rectified images, and (c,f) stitching
boundary using Equations 12.20 and 12.21. °c 2009 IEEE
(a) (b)
FIGURE 12.6
c 2009 IEEE
Composite image generation: (a) without blending, and (b) with blending (final result). °
FIGURE 12.8
c 2009 IEEE
Additional experimental results: (a,b) input stereo pairs, and (c) final results. °
(a) (b) (c)
FIGURE 12.9
Single-view rectification procedure: (a) user interaction, the inner box is displayed just for the intuitive under-
standing of the system and is ignored after user interaction, (b) feature extraction, one of three feature maps
is shown, (c) result of the presented segmentation method, (d) result of References [13] and [27] using the
segmented image, aspect ratio 2.32, (e) result of the presented rectification method using the segmented image,
aspect ratio 1.43, and (f) figure scanned by a flatbed scanner for comparison, aspect ratio 1.39.
FIGURE 12.16
Performance comparison of the two approaches: (a,d) results of the method presented in Section 12.2 with
aspect ratio 1.52 and 1.50, (b,e) results of the method presented in Section 12.3 with aspect ratio 1.52 and
1.53, and (c,f) scanned images with aspect ratio 1.54 and 1.51.
(a)
(b) (c)
FIGURE 13.12
High definition range imaging using the bilateral filter: (a) a set of four input images, (b) linearly scaled HDR
image, and (c) image obtained using the tone-mapping method of Reference [10].
FIGURE 13.14
Image fusion using the bilateral filter: (a) no-flash image, (b) flash image, and (c) fusion of the flash and
no-flash images.
(a) (b) (c)
FIGURE 14.5
The effect of v(r) ⊥ ∇I(r) on image quality: (a) input image, (b) synthetic painting obtained by imposing the
condition v(r) ⊥ ∇I(r) on the whole image, and (c) synthetic painting obtained by imposing the condition
v(r) ⊥ ∇I(r) only on high contrast edge points.
(a) (b)
FIGURE 14.8
Morphological color image processing: (a) input image and (b) output image obtained via close-opening ap-
plied separately to each RGB component.
(a) (b) (c) (d)
FIGURE 14.16
Artistic image generation for the example of Figure 14.15: (a) input image I(r), (b) edge preserving smoothing
c 2009 IEEE
output IEPS (r), (c) associated synthetic painterly texture, and (d) final output y(r). °
(a) (b)
(c) (d)
FIGURE 14.17
Comparison of various artistic effects: (a) input image, (b) Glass pattern algorithm, (c) artistic vision [15], and
c 2009 IEEE
(d) impressionistic rendering [9]. °
(a) (b)
(c) (d)
FIGURE 14.19
Examples of cross continuous Glass patterns: (a) input image, (b) output corresponding to v(r) =
p p
[x, y]/ x2 + y2 , (c) output corresponding to v(r) = [−y, x]/ x2 + y2 , and (d) output corresponding to v(r) ⊥
∇I(r). The last image is perceptually similar to a painting.
FIGURE 15.1
Failure of standard colorization algorithms in the presence of texture: (left) manual initialization, (right) result
of Reference [1] with the code available at http://www.cs.huji.ac.il/∼yweiss/Colorization/. Despite the general
efficiency of this simple method, based on the mean and the standard deviation of local intensity neighborhoods,
the texture remains difficult to deal with. Hence texture descriptors and learning edges from color examples
are required.
(a) (b) (c) (d) (e)
FIGURE 15.2
Examples of color spectra and associated discretizations: (a) color image, (b) corresponding 2D colors, (c) the
location of the observed 2D colors in the ab-plane (a red dot for each pixel) and the computed discretization in
color bins, (d) color bins filled with their average color, and (e) continuous extrapolation with influence zones
of each color bin in the ab-plane (each bin is replaced by a Gaussian, whose center is represented by a black
dot; red circles indicate the standard deviation of colors within the color bin, blue ones are three times larger).
FIGURE 15.3
Coloring a painting given another painting by the same painter: (a) training image, (b) test image, (c) image
colored using Parzen windows — the border is not colored because of the window size needed for SURF
descriptors, (d) color variation predicted — white stands for homogeneity and black for color edge, (e) most
probable color at the local level, and (f) 2D color chosen by graph cuts.
(a) (b) (c)
FIGURE 15.4
Landscape example with Parzen windows: (a) training image, (b) test image, (c) output image, (d) predicted
color variation, (e) most probable color locally, (f) 2D color chosen by graph cuts, (g) colors obtained after
refinement step.
(a)
(b) (c)
FIGURE 15.7
Image colorization using Parzen windows: (a) three training images, (b) test image, (c) colored image, (d)
prediction of color variations, (e) most probable colors at the local level, and (f) final colors.
(a) (b) (c)
FIGURE 15.8
SVM-driven colorization of Charlie Chaplin frame using the training set of Figure 15.7: (a,d) SVM, (b,e) SVM
with spatial regularization — Equation 15.7, and (c,f) SVMstruct.
(a) (b)
FIGURE 15.10
Colorization of the 21st and 22nd images from the Pasadena houses Caltech dataset: (a) 21st image colored,
(b) 22nd image colored, (c) predicted edges, (d) most probable color at the pixel level, and (e) colors chosen.
(a) (b) (c)
FIGURE 15.11
SVM-driven colorization of the 21st and 22nd images from the Pasadena houses Caltech dataset; results and
colors chosen are displayed: (a-c) 21st image, (d-f) 22nd image; (a,d) SVM, (b,e) SVM with spatial regular-
ization, and (c,f) SVMstruct.
FIGURE 16.3
Female and male face-composites, each averaging sixteen faces. It has been empirically shown that average
faces tend to be considerably more attractive than the constituent faces.
(a) (b) (c)
FIGURE 16.4
An example of the eight facial features (two eyebrows, two eyes, the inner and outer boundaries of the lips, the
nose, and the boundary of the face), composed of a total of 84 feature points, used in the proposed algorithm:
(a) output feature points from the active shape model search, (b) scatter of the aligned 84 landmark points of
92 sample training data and their average, and (c) 234 distances between these points. °c 2008 ACM
FIGURE 16.7
The warp field is defined by the correspondence between the source feature points (in blue) and the beautified
geometry (in red). °c 2008 ACM
FIGURE 16.8
c 2008 ACM
Beautification examples: (top) input portraits and (bottom) their beautified versions. °
(d)
FIGURE 17.3
Performance improvement by multiplexing: (a) light field image captured without multiplexing, (b) demulti-
plexed light field image, (c) image captured with multiplexing, that is, Mu (x) in Equation 17.4, and (d) enlarged
portions of (a) and (b). The insets in (a-c) show the corresponding multiplexing patterns.
FIGURE 17.5
The effect of the photometric distortion. The images shown here are two of the nine light field images of a
static scene. The insets show the corresponding aperture shape.
original im age
FIGURE 17.7
Photometric calibration results: (a) input image Iud to be corrected — note the darker left side, (b) reference
image I0 , (c) triangulation of the matched features marked in the previous two images, (d) image Iuw warped
using the reference image based on the triangular mesh, (e) approximated vignetting field with suspicious
areas removed, (f) vignetting field approximated without warping, (g) estimated parametric vignetting field,
(h) calibrated image Iu , and (i) intensity profile of the 420th scanline before and after the calibration.
(a)
FIGURE 17.11
(a) Two demultiplexed light field images generated by the proposed system; the full 4D resolution is 4 × 4 ×
3039 × 2014. (b) The estimated depth map of the left image of (a). (c,d) Postexposure refocused images
generated from the light field and the depth maps.
(d) (e)
FIGURE 17.13
(a) An estimated depth map. (b) Image interpolated without depth information. (c) Image interpolated with
depth information. (d-e) Close-up of (b-c). The angular resolution of the light field is 3 × 3.
(a) (b) (c)
(d) (e)
FIGURE 17.14
(a) An estimated depth map. (b) Digital refocused image with the original angular resolution 4 × 4. (c) Digital
refocused image with the angular resolution 25 × 25 boosted by view interpolation. (d-e) Close-up of (b-c).
FIGURE 18.22
Scenes captured and rendered with the proposed camera array: (a) rendering with a constant depth at the
background, (b) rendering with a constant depth at the middle object, (c) rendering with a constant depth
at the closest object, (d) rendering with the proposed method, and (e) multi-resolution 2D mesh with depth
reconstructed on-the-fly, brighter intensity means smaller depth. Captured scenes from top to bottom: toys,
train, girl and checkerboard, and girl and flowers.
(a) (b) (c) (d)
FIGURE 18.23
Real-world scenes rendered with the proposed algorithm: (top) scene train, (bottom) scene toys; (a,c) used
per-pixel depth map to render, and (b,d) the proposed adaptive mesh to render.
FIGURE 18.24
Scenes rendered by reconfiguring the proposed camera array: (a) the camera arrangement, (b) reconstructed
depth map, brighter intensity means smaller depth, (c) CCV score of the mesh vertices and the projection of
the camera positions to the virtual imaging plane denoted by dots — darker intensity means better consistency,
and (d) rendered image. Captured scenes from top to bottom: flower with cameras evenly spaced, flower
with cameras self-reconfigured (6 epochs), Santa with cameras are evenly spaced, and Santa with cameras
c 2007 IEEE
self-reconfigured (20 epochs). °
FIGURE 18.31
Synthesized results of the toy ball sequence: (top) traditional LFR with unsynchronized frames and (bottom)
space-time LFR using estimated temporal offsets among input sequences. Note that the input data set does not
contain global time stamps.
FIGURE 18.32
Visualization of the trajectory.