0% found this document useful (0 votes)

59 views160 pages

ChamPock An

This document provides an introduction to continuous optimization methods for imaging problems. It describes how many imaging problems can be formulated as the minimization of an objective function representing image quality. These problems are large-scale yet structured, and often involve non-smooth terms promoting properties like sparsity and robustness. The document focuses on first-order proximal splitting methods that are well-suited for such problems, and can deal with non-smoothness and large scale. It discusses gradient descent methods and saddle-point methods, along with acceleration techniques, convergence results, and applications to problems in image reconstruction, inverse problems, and segmentation.

Uploaded by

alejandro.david1642

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views160 pages

ChamPock An

Uploaded by

alejandro.david1642

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 160

An introduction to continuous optimization for imaging

Antonin Chambolle, Thomas Pock

To cite this version:

Antonin Chambolle, Thomas Pock. An introduction to continuous optimization for imag-
ing. Acta Numerica, Cambridge University Press (CUP), 2016, Acta Numerica, 25, pp.161-319.
�10.1017/S096249291600009X�. �hal-01346507�

HAL Id: hal-01346507

https://hal.archives-ouvertes.fr/hal-01346507
Submitted on 19 Jul 2016

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est

archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
An introduction to continuous
optimization for imaging
Antonin Chambolle
CMAP, Ecole Polytechnique, CNRS, France
E-mail: antonin.chambolle@cmap.polytechnique.fr

Thomas Pock
ICG, Graz University of Technology, AIT, Austria
E-mail: pock@icg.tugraz.at

A large number of imaging problems reduce to the optimization of a cost func-

tion, with typical structural properties. The aim of this paper is to describe
the state of the art in continuous optimization methods for such problems, and
present the most successful approaches and their interconnections. We place
particular emphasis on optimal first-order schemes that can deal with typi-
cal non-smooth and large-scale objective functions used in imaging problems.
We illustrate and compare the different algorithms using classical non-smooth
problems in imaging, such as denoising and deblurring. Moreover, we present
applications of the algorithms to more advanced problems, such as magnetic
resonance imaging, multilabel image segmentation, optical flow estimation,
stereo matching, and classification.

CONTENTS
1 Introduction 2
2 Typical optimization problems in imaging 5
3 Notation and basic notions of convexity 12
4 Gradient methods 20
5 Saddle-point methods 49
6 Non-convex optimization 75
7 Applications 81
A Abstract convergence theory 128
B Proof of Theorems 4.1, 4.9 and 4.10. 131
C Convergence rates for primal–dual algorithms 136
References 140
2 A. Chambolle and T. Pock

1. Introduction
The purpose of this paper is to describe, and illustrate with numerical ex-
amples, the fundamentals of a branch of continuous optimization dedicated
to problems in imaging science, in particular image reconstruction, inverse
problems in imaging, and some simple classification tasks. Many of these
problems can be modelled by means of an ‘energy’, ‘cost’ or ‘objective’ which
represents how ‘good’ (or bad!) a solution is, and must be minimized.
These problems often share a few characteristic features. One is their size,
which can be very large (typically involving at most around a billion vari-
ables, for problems such as three-dimensional image reconstruction, dense
stereo matching, or video processing) but usually not ‘huge’ like some recent
problems in learning or statistics. Another is the fact that for many prob-
lems, the data are structured in a two- or three-dimensional grid and interact
locally. A final, frequent and fundamental feature is that many useful prob-
lems involve non-smooth (usually convex) terms, for reasons that are now
well understood and concern the concepts of sparsity (DeVore 1998, Candès,
Romberg and Tao 2006b, Donoho 2006, Aharon, Elad and Bruckstein 2006)
and robustness (Ben-Tal and Nemirovski 1998).
These features have strongly influenced the type of numerical algorithms
used and further developed to solve these problems. Due to their size and
lack of smoothness, higher-order methods such as Newton’s method, or
methods relying on precise line-search techniques, are usually ruled out,
although some authors have suggested and successfully implemented quasi-
Newton methods for non-smooth problems of the kind considered here (Ito
and Kunisch 1990, Chan, Golub and Mulet 1999).
Hence these problems will usually be tackled with first-order descent
methods, which are essentially extensions and variants of a plain gradi-
ent descent, appropriately adapted to deal with the lack of smoothness of
the objective function. To tackle non-smoothness, one can either rely on
controlled smoothing of the problem (Nesterov 2005, Becker, Bobin and
Candès 2011) and revert to smooth optimization techniques, or ‘split’ the
problem into smaller subproblems which can be exactly (or almost) solved,
and combine these resolutions in a way that ensures that the initial problem
is eventually solved. This last idea is now commonly referred to as ‘proxi-
mal splitting’ and, although it relies on ideas from as far back as the 1950s
or 1970s (Douglas and Rachford 1956, Glowinski and Marroco 1975), it has
been a very active topic in the past ten years in image and signal processing,
as well as in statistical learning (Combettes and Pesquet 2011, Parikh and
Boyd 2014).
Hence, we will focus mainly on proximal splitting (descent) methods,
and primarily for convex problems (or extensions, such as finding zeros of
maximal-monotone operators). We will introduce several important prob-
Optimization for imaging 3

lems in imaging and describe in detail simple ﬁrst-order techniques to solve

these problems practically, explaining how ‘best’ to implement these meth-
ods, and in particular, when available, how to use acceleration tricks and
techniques to improve the convergence rates (which are generally very poor
for such methods). This point of view is very similar to the approach in a
recent tutorial of Burger, Sawatzky and Steidl (2014), though we will de-
scribe a larger class of problems and establish connections between the most
commonly used first-order methods in this field.
Finally we should mention that for many imaging problems, the grid
structure of the data is well suited for massively parallel implementations
on GPUs, hence it is beneficial to develop algorithms that preserve this
property.
The organization of this paper is as follows. We will first describe typical
(simple) problems in imaging and explain how they can be reduced to the
minimization of relatively simple functions, usually convex. Then, after a
short introduction to the basic concepts of convexity in Section 3, we will
describe in Sections 4 and 5 the classes of algorithms that are currently
used to tackle these problems, illustrating each algorithm with applications
to the problems introduced earlier. Each time, we will discuss the basic
methods, convergence results and expected rates, and, when available, ac-
celeration tricks which can sometimes turn a slow and inefficient method
into a useful practical tool. We will focus mainly on two families of meth-
ods (whose usefulness depends on the structure of the problem): first-order
descent methods and saddle-point methods. Both can be seen as either vari-
ants or extensions of the ‘proximal-point algorithm’ (Martinet 1970), and
are essentially based on iterations of a 1-Lipschitz operator; therefore, in
Appendix A we will very briefly recall the general theory for such iterative
techniques. It does not apply to accelerated variants which are not usually
contractive (or not known to be contractive), but rates of convergence can
be estimated; see Appendices B and C.
In a final theoretical section (Section 6) we will briefly introduce some
extensions of these techniques to non-convex problems.
Then, in Section 7, we will review a series of practical problems (e.g.,
first- and higher-order regularization of inverse problems, feature selection
and dictionary learning, segmentation, basic inpainting, optical flow), each
time explaining which methods can be used (and giving the implementations
in detail), and how methods can be tuned to each problem. Of course, we do
not claim that we will always give the ‘optimal’ method to solve a problem,
and we will try to refer to the relevant literature where a more thorough
study can be found.
Our review of first-order algorithms for imaging problems is partly in-
spired by our own work and that of many colleagues, but also by impor-
tant textbooks in optimization (Polyak 1987, Bertsekas 2015, Ben-Tal and
4 A. Chambolle and T. Pock

Nemirovski 2001, Nesterov 2004, Boyd and Vandenberghe 2004, Nocedal

and Wright 2006, Bauschke and Combettes 2011). However, we have tried
to keep the level of detail as simple as possible, so that most should be acces-
sible to readers with very little knowledge of optimization theory. Naturally
we refer the interested reader to these references for a deeper understanding
of modern optimization.
Finally we should mention that we will overlook quite a few important
problems and methods in imaging. First, we will not discuss combinatorial
optimization techniques for regularization/segmentation, as we fear that
this would require us to almost double the size of the paper. Such meth-
ods, based on graph cuts or network flows, are very efficient and have been
extensively developed by the computer vision community to tackle most
of the problems we address here with continuous optimization. As an ex-
ample, the paper of Boykov, Veksler and Zabih (2001), which shows how
to minimize the ‘Potts’ model (7.25) using graph-cuts, attains almost 6000
citations in Google Scholar, while the maximal flow algorithm of Boykov
and Kolmogorov (2004) is cited more than 3500 times. We believe the two
approaches complement one another nicely: they essentially tackle the same
sort of problems, with similar structures, but from the perspective of imple-
mentation they are quite different. In particular, Hochbaum (2001) presents
an approach to solve exactly a particular case of Problem 2.6 in polynomial
time; see also Darbon and Sigelle (2006a, 2006b) (the variant in Chambolle
and Darbon 2012 might be more accessible for the reader unfamiliar with
combinatorial optimization). In general, graph-based methods are harder
to parallelize, and can approximate fewer general energies than methods
based on continuous optimization. However, they are almost always more
efficient than non-parallel iterative continuous implementations for the same
problem.
We will also ignore a few important issues and methods in image pro-
cessing: we will not discuss many of the ‘non-local’ methods, which achieve
state of the art for denoising (Dabov, Foi, Katkovnik and Egiazarian 2007,
Buades, Coll and Morel 2005, Buades, Coll and Morel 2011). Although
these approaches were not introduced as ‘variational’ methods, it is now
known that they are closely related to methods based on structured spar-
sity (Danielyan, Katkovnik and Egiazarian 2012) or (patch-based) Gaussian
mixture models (Mallat and Yu 2010, Yu, Sapiro and Mallat 2012, Lebrun,
Buades and Morel 2013) and can be given a ‘variational’ form (Gilboa,
Darbon, Osher and Chan 2006, Kindermann, Osher and Jones 2005, Peyré,
Bougleux and Cohen 2008, Arias, Facciolo, Caselles and Sapiro 2011). The
numerical algorithms to tackle these problems still need a lot of specific
tuning to achieve good performance. We will address related issues in Sec-
tion 7.12 (on ‘Lasso’-type problems) and present alternatives to non-local
denoising.
Optimization for imaging 5

Moreover, we will not mention the recent developments in computer vi-

sion and learning based on convolutional neural networks, or CNNs (LeCun,
Boser, Denker, Henderson, Howard, Hubbard and Jackel 1989, Krizhevsky,
Sutskever and Hinton 2012), which usually achieve the best results in classi-
fication and image understanding. These models (also highly non-local) are
quite different from those introduced here, although there is a strong con-
nection with dictionary learning techniques (which could be seen as a basic
‘first step’ of CNN learning). Due to the complexity of the models, the opti-
mization techniques for CNNs are very specific and usually rely on stochastic
gradient descent schemes for smoothed problems, or stochastic subgradient
descent (Krizhevsky et al. 2012, LeCun, Bottou, Orr and Muller 1998b).
The second author of this paper has recently proposed a framework which
in some sense bridges the gap between descent methods or PDE approaches
and CNN-based learning (Chen, Ranftl and Pock 2014b).
More generally, we will largely ignore recent developments in stochastic
first-order methods in optimization, which have been driven by big data
applications and the need to optimize huge problems with often billions of
variables (in learning and statistics, hence also with obvious applications
to image analysis and classification). We will try to provide appropriate
references when efficient stochastic variants of the methods described have
recently been developed.
We now describe, in the next section, the key exemplary optimization
problems which we are going to tackle throughout this paper.

2. Typical optimization problems in imaging

First let us give the reader a taste of typical optimization problems that
arise from classical models in image processing, computer vision and ma-
chine learning. Another of our goals is to give a short overview of typical
applications of variational models in imaging; more specific models will then
be described in Section 7. Among the most important features in images
are edges and texture. Hence, an important property of models in image
processing is the ability to preserve sharp discontinuities in their solutions
in order to keep precise identification of image edges. Another goal of most
models is robustness (Ben-Tal and Nemirovski 1998, Ben-Tal, El Ghaoui
and Nemirovski 2009), that is, the solution of a model should be stable in
the presence of noise or outliers. In practice this implies that successful
models should be non-smooth and hence non-differentiable. Indeed, a suc-
cessful approach to these issues is known to be realized by the minimization
of robust error functions based on norm functions. Classical optimization
algorithms from non-linear optimization, such as gradient methods, Newton
or quasi-Newton methods, cannot be used ‘out of the box’ since these algo-
rithms require a certain smoothness of the objective function or cannot be
6 A. Chambolle and T. Pock

applied to large-scale problems – hence the need for specialized algorithms

that can exploit the structure of the problems and lead eﬃciently to good
solutions.

2.1. Sparse representations

An important discovery in recent years (Candès et al. 2006b, Donoho 2006,
Aharon et al. 2006) is the observation that many real-world signals can
be modelled via sparse representation in a suitable basis or ‘dictionary’.
This property can be used to reconstruct a signal from far fewer measure-
ments than required by the Shannon–Nyquist sampling theorem, for exam-
ple, which states that the sampling frequency should be at least twice as high
as the highest frequency in the signal. Furthermore, a sparse representation
of a signal is desirable since it implies a certain robustness in the presence
of noise. Given an input signal b ∈ Rm , a sparse representation in the dic-
tionary A = (ai,j )i,j ∈ Rm×n of n column vectors (ai,j )mi=1 can be found by
solving the following optimization problem (Mallat and Zhang 1993, Chen,
Donoho and Saunders 1998):
min f (x)
x (2.1)
such that Ax = b,
where x ∈ Rn is the unknown coeﬃcient vector. This model is usually
known by the name basis pursuit (Chen and Donoho 1994). Since each
column of A can be interpreted as a basis atom, the equality constraint
Ax = b describes the fact that the signal b should be represented as a sparse
linear combination of those atoms.PThe function f (x) is a sparsity-inducing
function, such as f (x) = kxk1 := i |xi | in the most simple case.
If some further prior knowledge concerning a relevant group structure is
available, one can encode such information in the sparsity-inducing function.
This idea is known as group sparsity, and is widely used in data analysis.
It consists in using ℓ1,p -norms, with p = 2 or p = ∞. The p-norm is taken
within the groups and the 1-norm is taken between the groups. This forces
the solution to have only a few active groups, but within the active groups
the coeﬃcients can be dense.
For problems such as matrix factorization (Paatero and Tapper 1994, Lee
and Seung 1999) or robust principal component analysis (Candès, Li, Ma
and Wright 2011), where x is tensor-valued, the sparsity-inducing norm
could also be a function promoting the sparsity of the singular values of x
and hence forcing x to be of low rank. A popular choice to achieve this
goal is the 1-Schatten norm (or nuclear norm) k·kS1 , which is given by the
1-norm of the singular values of x, and is polar to the spectral/operator
norm k·kS∞ .
A more general formulation that also allows for noise in the observed
Optimization for imaging 7

signal b is given by the following optimization problem, popularized by the

name ‘Lasso’, least absolute shrinkage and selection operator (Tibshirani
1996):
λ
min kxk1 + kAx − bk22 , (2.2)
x 2
where λ > 0 is a parameter that can be adapted to the noise level of b.
The parameter λ can also be interpreted as a Lagrange multiplier for the
constraint 21 kAx − bk22 ≤ σ 2 , where σ is an estimate of the noise level. This
shows the close connection between (2.1) and (2.2).
The Lasso approach can also be interpreted as a model that tries to
synthesize the given signal b using only a small number of basis atoms. A
closely related problem is obtained by moving the linear operator A from
the data-fitting term to the regularization term, that is,
λ
min kBxk1 + kx − bk22 , (2.3)
x 2
where B is again a linear operator. If A is invertible and B = A−1 , a simple
change of variables shows that the two problems are equivalent. However,
the more interesting cases are for non-invertible B, and the two problems
can have very different properties. Here, the linear operator B can be
interpreted as an operator analysing the signal, and hence the model is
known as the co-sparse analysis model (Nam, Davies, Elad and Gribonval
2013). The basic idea behind this approach is that the scalar product of
the signal with a given family of filters should vanish most of the time. The
most influential model in imaging utilizing such sparse analysis regularizers
is the total variation regularizer.
Here, we recall the ‘ROF’ (Rudin, Osher and Fatemi 1992, Chambolle and
Lions 1997) model for total variation based image denoising. We consider
a scalar-valued digital image u ∈ Rm×n of size m × n pixels.1 A simple
and standard approach for defining the (discrete) total variation is to use a
finite difference scheme acting on the image pixels. We introduce a discrete
gradient operator D : Rm×n → Rm×n×2 , which is defined by
(
ui+1,j − ui,j if 1 ≤ i < m,
(Du)i,j,1 =
0 else,
( (2.4)
ui,j+1 − ui,j if 1 ≤ j < n,
(Du)i,j,2 =
0 else.

1
Of course, what follows is also valid for images/signals defined on a one- or three-dim-
ensional domain.
8 A. Chambolle and T. Pock

We will also frequently need the operator norm kDk, which is estimated as
√
kDk ≤ 8 (2.5)
(see Chambolle 2004b). The discrete ROF model is then deﬁned by
1
min λkDukp,1 + ku − u⋄ k22 , (2.6)
u 2
where u⋄ ∈ Rm×n is the given noisy image, and the discrete total variation
is deﬁned by
m,n m,n
1/p
(Du)pi,j,1 + (Du)pi,j,2
X X
kDukp,1 = |(Du)i,j |p = ,
i=1,j=1 i=1,j=1

that is, the ℓ1 -norm of the p-norm of the pixelwise image gradients.2 The
parameter p can be used, for example, to realize anisotropic (p = 1) or
isotropic (p = 2) total variation. Some properties of the continuous model,
such as the co-area formula, carry over to the discrete model only if p = 1,
but the isotropic total variation is often preferred in practice since it does
not exhibit a grid bias.
From a sparsity point of view, the idea of the total variation denoising
model is that the ℓ1 -norm induces sparsity in the gradients of the image,
hence it favours piecewise constant images with sparse edges. On the other
hand, this property – also known as the staircasing effect – might be con-
sidered a drawback for some applications. Some workarounds for this issue
will be suggested in Example 4.7 and Section 7.2. The isotropic case (p = 2)
can also be interpreted as a very simple form of group sparsity, grouping
together the image derivatives in each spatial dimension.
In many practical problems it is necessary to incorporate an additional
linear operator in the data-fitting term. Such a model is usually of the form
1
min λkDukp,1 + kAu − u⋄ k22 , (2.7)
u 2
where A : Rm×n → Rk×l is a linear operator, u⋄ ∈ Rk×l is the given data,
and k, l will depend on the particular application. Examples include image
deblurring, where A models the blur kernel, and magnetic resonance imag-
ing (MRI), where the linear operator is usually a combination of a Fourier
transform and the coil sensitivities; see Section 7.4 for details.
The quadratic data-fitting term of the ROF model is specialized for zero-
mean Gaussian noise. In order to apply the model to other types of noise,
different data-fitting terms have been proposed. When the noise is impulsive
or contains gross outliers, a simple yet efficient modification is to replace
2
Taking only right differences is of course arbitrary, and may lead to anisotropy issues.
However, this is rarely important for applications (Chambolle, Levine and Lucier 2011).
Optimization for imaging 9

(a) original image (b) noisy image (c) denoised image

Figure 2.1. Total variation based image denoising. (a) Original input image, and
(b) noisy image containing additive Gaussian noise with standard deviation σ = 0.1.
(c) Denoised image obtained by minimizing the ROF model using λ = 0.1.

the quadratic data-ﬁtting term with an ℓ1 -data term. The resulting model,
called the TV-ℓ1 model, is given by
min λkDukp,1 + ku − u⋄ k1 . (2.8)
u

This model has many nice properties such as noise robustness and contrast
invariance (Nikolova 2004, Chan and Esedoglu 2004). However, this does
not come for free. While the ROF model still contains some regularity in
the data term that can be exploited during optimization, the TV-ℓ1 model
is completely non-smooth and hence signiﬁcantly more diﬃcult to minimize.

2.2. Three introductory examples for image restoration

We now will present three prototypical examples of image restoration, to
which we will frequently refer in the algorithmic parts of the paper.
Example 2.1 (ROF model). In the ﬁrst example we consider standard
image denoising using the ROF model (2.6) in the presence of Gaussian
noise. Figure 2.1 shows the result of total variation based image denoising
using this model. It is now well understood that eﬃcient ways to solve
this problem rely on convex duality (Chambolle and Lions 1995, Chan et
al. 1999, Chambolle 2004b); for details on the particular algorithm used
here, see Examples 4.8 and 5.6.
Figure 2.1(a) shows the original input image of size 360 × 270 pixels and
intensity values in the range [0, 1]. Figure 2.1(b) shows its noisy variant,
obtained by adding Gaussian noise of standard deviation σ = 0.1. Fig-
ure 2.1(c) shows the result obtained by minimizing the ROF model using the
10 A. Chambolle and T. Pock

(a) original image (b) blurry and noisy image

(c) deblurred, λ = 0 (d) deblurred, λ = 5 × 10−4

Figure 2.2. An image deblurring problem. (a) Original image, and (b) blurry
and noisy image (Gaussian noise with standard deviation σ = 0.01) together
with the known blur kernel. (c, d) Image deblurring without (λ = 0) and with
(λ = 5 × 10−4 ) total variation regularization. Observe the noise ampliﬁcation when
there is no regularization.

FISTA algorithm (Algorithm 5). We used isotropic total variation (p = 2)

and we set the regularization parameter λ = 0.1. Observe that the ROF
model successfully removes the noise from the image while preserving the
main edges in the image. One can also observe that the ROF model is not
very successful at reconstructing textured regions, as it favours piecewise
constant images. State-of-the-art denoising methods will usually revert to
non-local techniques that treat patches as a whole, allowing better represen-
tation of textures (Buades et al. (2005, 2011), Dabov et al. (2007)). These
approaches are not variational at ﬁrst glance, but variants can be obtained
by alternating minimization of non-local energies (Peyré et al. 2008, Arias
et al. 2011).
Example 2.2 (TV-deblurring). In this second example we assume that
the observed blurry image u⋄ has been obtained by convolving the unknown
Optimization for imaging 11

image u with a two-dimensional blur kernel a of size k × l pixels. We can

‘deblur’ the given image by minimizing the model (2.7) with Au = a ∗ u. If
we choose λ = 0 in (2.7), then unless the original image u⋄ has no noise at
all, it is well known that the noise will be amplified by the deconvolution
process and ruin the quality of the deconvolution.
Figure 2.2 shows an example of image deblurring with known blur ker-
nel. Figure 2.2(a) shows the original image of size 317 × 438 pixels and
intensity values in the range [0, 1]. Figure 2.2(b) shows the blurry image
together with the blur kernel of size 31 × 31 pixels. The blurry image has
been further degraded by adding zero-mean Gaussian noise with standard
deviation 0.01. Moreover, to get rid of unwanted boundary effects, we mod-
ified the input image by setting its intensity values to its average values
at the image boundaries. This allows us to approximately assume periodic
boundary conditions and hence to use a fast Fourier transform (FFT) to
compute the convolution. Another way to deal with the boundary, which
works better but is computationally more expensive, is suggested in Almeida
and Figueiredo (2013).
Figure 2.2(c) shows the deblurred image using no regularization (λ = 0)
and Figure 2.2(d) the deblurred image using the total variation regularized
deblurring model. The regularization parameter was set to λ = 5 × 10−4 .
Observe that the regularization is essential to reduce the noise in the de-
blurred image. This particular example has been computed using the PDHG
algorithm (Algorithm 6); see also Example 5.7 for details. Note that when
the blur kernel is also unknown, the problem becomes non-convex and hence
significantly more complex to solve (Levin, Weiss, Durand and Freeman
2011).

Example 2.3 (TV-ℓ1 model). In this third example we consider image

restoration in the presence of salt-and-pepper noise. For this we utilize the
TV-ℓ1 model (2.8). Figure 2.3 shows an example where the TV-ℓ1 model
can successfully denoise an image of size 375 × 500 pixels that has been
degraded by adding 20% salt-and-pepper noise. The intensity values of the
input image are again in the range [0, 1]. For comparison we also show the
results of the ROF model (2.6) for this example. For the TV-ℓ1 model the
regularization parameter was set to λ = 0.6; for ROF, the regularization
parameter was set to λ = 0.25. It can be seen that the results of the ROF
model are signiﬁcantly inferior, since the quadratic data term of the ROF
model does not ﬁt the distribution of the salt-and-pepper noise at all well.
The example was computed again using the PDHG algorithm (Algorithm 6);
see also Example 5.8 for details.
12 A. Chambolle and T. Pock

(a) original image (b) noisy image

(c) TV-ℓ1 (d) ROF

Figure 2.3. Denoising an image containing salt-and-pepper noise. (a) Original

image, and (b) noisy image that has been degraded by adding 20% salt-and-pepper
noise. (c) Denoised image obtained from the TV-ℓ1 model, and (d) result obtained
from the ROF model.

3. Notation and basic notions of convexity

We recall some basic notions of convexity, and introduce our notation.
Throughout the paper, at least in the theoretical parts, X (and Y) is a
Hilbert or Euclidean space endowed with a norm k·k = h·, ·i1/2 . The results
in this section and the next should usually be understood in finite dimen-
sions, but most of them do not depend on the dimension, and often hold
in a Hilbert space. If M is a bounded positive definite symmetric operator,
we define kxkM = hM x, xi1/2 , which in finite-dimensional spaces is a norm
equivalent to kxk.
In two-dimensional image processing we usually consider norms acting on
images u defined on a regular Cartesian grid of m × n pixels. When the
pixels are scalar-valued, that is, ui,j ∈ R, the image can also be written in
the form u = (u1,1 , . . . , um,n ) ∈ Rm×n .
Optimization for imaging 13

A p-vector norm acting on the image is hence given by

X m Xn 1/p
p
kukp = |ui,j | .
i=1 j=1

When the pixels of an image u of size m × n pixels are vector-valued,

we will adopt the notation u = (u1,1 , . . . , um,n ) ∈ Rm×n×r , with bold-font
variables ui,j ∈ Rr referring to the vector-valued pixel. In such images we
will consider mixed p, q-vector norms which are given by
X m X n 1/q
q
kukp,q = |ui,j |p ,
i=1 j=1

with |ui,j |p = ( rk=1 |ui,j,k |p )1/p denoting the p-vector norm acting on the
P
single pixels. Similarly, if the pixels are matrix-valued (or tensor-valued),
that is, Ui,j ∈ Rr×s , we have U = (U1,1 , . . . , Um,n ) ∈ Rm×n×r×s , and we
will consider matrix norms, acting on the single pixels Ui,j .

3.1. Convex functions

An extended real valued function f : X → [−∞, +∞] is said to be convex
if and only if its epigraph
epi f := {(x, λ) ∈ X × R : λ ≥ f (x)}
is a convex set, that is, if when λ ≥ f (x), µ ≥ f (y), and t ∈ [0, 1], we have
tλ + (1 − t)µ ≥ f (tx + (1 − t)y).3 It is proper if it is not identically +∞ and
nowhere −∞: in this case, it is convex if and only if, for all x, y ∈ X and
t ∈ [0, 1],
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y).
It is strictly convex if the above inequality is strict whenever x 6= y and
0 < t < 1. It is lower semi-continuous (l.s.c.) if, for all x ∈ X , if xn → x,
then
f (x) ≤ lim inf f (xn ).
n→∞

A trivial but important example is the characteristic function or indicator

function of a set C:
(
0 if x ∈ C,
δC (x) =
+∞ else,

3
This definition avoids the risky expression (+∞) + (−∞); see for instance Rockafellar
(1997, Section 4).
14 A. Chambolle and T. Pock

which is convex, l.s.c., and proper when C is convex, closed and non-empty.
The minimization of such functions will allow us to easily model convex
constraints in our problems.

3.2. Subgradient
Given a convex, extended real valued, l.s.c. function f : X → [−∞, +∞],
we recall that its subgradient at a point x is defined as the set
∂f (x) := {p ∈ X : f (y) ≥ f (x) + hp, y − xi for all y ∈ X }.
An obvious remark which stems from the definition is that this notion al-
lows us to generalize Fermat’s stationary conditions (∇f (x) = 0 if x is a
minimizer of f ) to non-smooth convex functions: we indeed have
x ∈ X is a global minimizer of f if and only if 0 ∈ ∂f (x). (3.1)
The function is strongly convex or ‘µ-convex’ if in addition, for x, y ∈ X
and p ∈ ∂f (x), we have
µ
f (y) ≥ f (x) + hp, y − xi + ky − xk2
2
or, equivalently, if x 7→ f (x) − µkxk2 /2 is also convex. It is then, obviously,
strictly convex as it satisfies
t(1 − t)
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) − µ ky − xk2 (3.2)
2
for any x, y and any t ∈ [0, 1]. A trivial but important remark is that if f
is strongly convex and x is a minimizer, then we have (since 0 ∈ ∂f (x))
µ
f (y) ≥ f (x) + ky − xk2
2
for all y ∈ X .
The domain of f is the set dom f = {x ∈ X : f (x) < +∞}, while
the domain of ∂f is the set dom ∂f = {x ∈ X : ∂f (x) 6= ∅}. Clearly
dom ∂f ⊂ dom f ; in fact if f is convex, l.s.c. and proper, then dom ∂f
is dense in dom f (Ekeland and Témam 1999). In finite dimensions, one
can show that for a proper convex function, dom ∂f contains at least the
relative interior of dom f (that is, the interior in the vector subspace which
is generated by dom f ).

3.3. Legendre–Fenchel conjugate

To any function f : X → [−∞, +∞] one can associate the Legendre–Fenchel
conjugate (or convex conjugate)
f ∗ (y) = sup hy, xi − f (x) (3.3)
x∈X
Optimization for imaging 15

which, as a supremum of linear and continuous functions, is obviously convex

and lower semi-continuous. The biconjugate f ∗∗ is then the largest convex
l.s.c. function below f (from the definition it is easy to see that f ∗∗ ≤ f );
in particular, if f is already convex and l.s.c., we have f ∗∗ = f . This is
a consequence of the convex separation theorem (a corollary of the Hahn–
Banach theorem), which is a difficult result in general (see Brézis 1983 for
an introduction to convex duality in infinite dimensions which includes a
detailed proof of this result), but it is a trivial consequence of the projection
onto closed convex sets in Euclidean or Hilbert spaces.
By definition, we see that x realizes the sup in (3.3) if and only if y ∈
∂f (x), and we have f (x) + f ∗ (y) = hy, xi. In this case we easily deduce
that f ∗∗ (x) = f (x) = hy, xi − f ∗ (x), so that in particular, y realizes the
sup which defines f ∗∗ (x). Also, it follows that x ∈ ∂f ∗ (y). We deduce the
celebrated Legendre–Fenchel identity:
y ∈ ∂f (x) ⇔ x ∈ ∂f ∗ (y) ⇔ f (x) + f ∗ (y) = hy, xi. (3.4)
In particular, ∂g and ∂g ∗
are inverses. From the definition, it is clear that
the subgradient of a convex function is a monotone operator, that is, it
satisfies
hp − q, x − yi ≥ 0 for all (x, y) ∈ X 2 , p ∈ ∂f (x), q ∈ ∂f (y),
while it is strongly monotone if f is strongly convex:
hp − q, x − yi ≥ µkx − yk2 for all (x, y) ∈ X 2 , p ∈ ∂f (x), q ∈ ∂f (y).
An important remark is that a convex l.s.c. function f is µ-strongly convex
if and only if its conjugate f ∗ is C 1 with (1/µ)-Lipschitz gradient. In fact,
f is µ-strongly convex if and only if, for any x ∈ X and p ∈ ∂f (x), the
‘parabola’
µ
y 7→ f (x) + hp, y − xi + ky − xk2
2
touches the graph of f at x from below. But then, fairly simple computa-
tions show that the graph of f ∗ is touched from above at p by the conjugate
parabola
1 1
q 7→ hq, xi + kq − pk2 −f (x) = f ∗ (p) + hq − p, xi + kq − pk2 ,
2µ 2µ
which is equivalent to saying that x = ∇f ∗ (p) and (if this holds at all
p) that p 7→ x= ∇f ∗ (p) is (1/µ)-Lipschitz. Observe that in this case, the
strong monotonicity of ∂f also reads
hp − q, ∇f ∗ (p) − ∇f ∗ (q)i ≥ µk∇f ∗ (p) − ∇f ∗ (q)k2 ,
which expresses that ∇f ∗ is a µ-co-coercive monotone operator: in general
the gradient of a convex function with L-Lipschitz gradient is (1/L)-co-
coercive.
16 A. Chambolle and T. Pock

We must mention here that subgradients of convex l.s.c. functions are only
a particular class of maximal monotone operators, which are multivalued
operators T : X → P(X ) such that
hp − q, x − yi ≥ 0 for all (x, y) ∈ X 2 , p ∈ T x, q ∈ T y (3.5)
and whose graph {(x, p) : x ∈ T p} ⊂ X × X is maximal (with respect to
inclusion) in the class of graphs of operators which satisfy (3.5). Strongly
monotone and co-coercive monotone operators are defined accordingly. It is
also almost obvious from the definition that any maximal monotone operator
T has an inverse T −1 defined by x ∈ T −1 p ⇔ p ∈ T x, which is also maximal
monotone. The operators ∂f and ∂f ∗ are inverse in this sense. Examples
of maximal monotone operators which are not subgradients of a convex
function are given by skew-symmetric operators. See, for instance, Brézis
(1973) for a general study of maximal monotone operators in Hilbert spaces.

3.4. Proximal map and resolvent

Another important role in optimization is played by the so-called proximal
map or proximity operator of a convex function defined as follows. If f is
convex, proper and l.s.c., then clearly, for any x, there is a unique minimizer
ŷ to the strongly convex problem
1
min f (y) + ky − xk2 , (3.6)
y∈X 2τ
which also satisfies
1 1 1
f (y) + ky − xk2 ≥ f (ŷ) + kŷ − xk2 + ky − ŷk2 (3.7)
2τ 2τ 2τ
for any y (thanks to strong convexity). We let ŷ := proxτ f (x). It is easy to
show that this defines a 1-Lipschitz, monotone operator, which is itself the
gradient of a convex function. Basic subdifferential calculus (Rockafellar
1997) shows that
ŷ − x
∂f (ŷ) + ∋ 0,
τ
in other words ŷ = (I + τ ∂f )−1 x is given by the resolvent of the maximal
monotone operator τ ∂f at x. In general it is shown that T is maximal mono-
tone if and only if its resolvent (I + T )−1 is well defined and single-valued;
this is an important theorem due to Minty (1962). The resolvent is also a
weak contraction, as well as a ‘firmly non-expansive operator’ (Bauschke,
Moffat and Wang 2012), or equivalently a ‘1/2-averaged operator’; see Ap-
pendix A.
Playing with this expression and (3.4), we can easily deduce Moreau’s
Optimization for imaging 17

identity (Moreau 1965)

−1
−1 1 x x
x = (I + τ ∂f ) (x) + τ I + ∂f ∗ = proxτ f (x) + τ prox 1 ∗ ,
τ τ τ f τ

(3.8)
which in fact holds for any maximal monotone operators T, T −1 . It shows
in particular that if we know how to compute proxτ f , then we also know
how to compute proxf ∗ /τ . Finally, we will sometimes let proxM
τ f (x) denote
the proximity operator computed in the metric M , that is, the solution of
1
min f (y) + ky − xk2M .
y∈X 2τ

3.5. Fenchel–Rockafellar duality

We now introduce an essential notion in convex programming, that is, con-
vex duality. This notion allows us to transform convex problems into other
problems which sometimes have a nicer structure and are easier to tackle. A
fairly extensive and very enlightening recent survey on duality for imaging
and inverse problems can be found in Borwein and Luke (2015).
Consider the minimization problem
min f (Kx) + g(x), (3.9)
x∈X

where
f : Y → (−∞, +∞], g : X → (−∞, +∞]
are convex l.s.c. functions and K : X → Y is a bounded linear operator.
Then, since f = f ∗∗ , one can write
min f (Kx) + g(x) = min suphy, Kxi − f ∗ (y) + g(x).
x∈X x∈X y∈Y

Under very mild conditions on f, g (such as f (0) < ∞ and g continuous at

0; see e.g. Ekeland and Témam (1999, (4.21))); in ﬁnite dimensions it is
suﬃcient to have a point x with both Kx in the relative interior of dom f
and x in the relative interior of dom g (Rockafellar 1997, Corollary 31.2.1)),
one can swap the min and sup in the relation above and write
min f (Kx) + g(x) = min suphy, Kxi − f ∗ (y) + g(x)
x x y
= max inf hy, Kxi − f ∗ (y) + g(x) (3.10)
y x
= max −f ∗ (y) − g ∗ (−K ∗ y). (3.11)
y

The last problem in this formula is the (Fenchel–Rockafellar) dual problem.

18 A. Chambolle and T. Pock

Under the assumptions above, it has at least a solution y ∗ . If x∗ is any

solution of the initial primal problem, then (x∗ , y ∗ ) is a saddle point of the
primal–dual formulation: for any (x, y) ∈ X × Y we have
L(x∗ , y) ≤ L(x∗ , y ∗ ) ≤ L(x, y ∗ )
where
L(x, y) := hy, Kxi − f ∗ (y) + g(x) (3.12)
denotes the Lagrangian. In particular, it satisﬁes
0 ∈ ∂g(x∗ ) + K ∗ y ∗ , (3.13)
0 ∈ ∂f ∗ (y ∗ ) − Kx∗ . (3.14)
Observe that the primal–dual gap
G(x, y) := f (Kx)+g(x)+f ∗ (y)+g ∗ (−K ∗ y) = sup L(x, y ′ )−L(x′ , y),
(x′ ,y ′ )∈X ×Y

which is always non-negative (even if the min and sup cannot be swapped),
vanishes if and only if (x, y) is a saddle point.
Finally we remark that
K∗

x ∂g(x) 0 x
T := + (3.15)
y ∂f ∗ (y) −K 0 y
is a maximal monotone operator, being the sum of two maximal monotone
operators, only
one of which is a subgradient, and the conditions above can
x∗
be written T ∗ ∋ 0.
y
Example 3.1 (dual of the ROF model). As an example, consider the
minimization problem (2.6) above. This problem has the general form (3.9),
with x = u, K = D, f = λk·kp,1 and g = k · −u⋄ k2 /2. Hence the dual
problem (3.11) reads

1 ∗ 2
max −f ∗ (p) − kD pk − hD∗ p, u⋄ i
p 2

1 ∗ 1
∗
= − min f (p) + kD p − u k + ku⋄ k2 ,
⋄ 2
p 2 2
where p ∈ Rm×n×2 is the dual variable. Equation (3.13) shows that the
solution u of the primal problem is recovered from the solution p of the
dual by letting u = u⋄ − D∗ p. One interesting observation is that the dual
ROF model, with f ∗ being a norm, has almost exactly the same structure
as the Lasso problem (2.2).
In this example, f is a norm, so f ∗ is the indicator function of the polar
ball: in this case the dual variable has the structure p = (p1,1 , . . . , pm,n ),
Optimization for imaging 19

where pi,j = (pi,j,1 , pi,j,2 ) is the per pixel vector-valued dual variable, and
therefore
(
0 if |pi,j |q ≤ λ for all i, j,
f ∗ (p) = δ{k·kq,∞ ≤λ} (p) = (3.16)
+∞ else,

where q is the parameter of the polar norm ball which is deﬁned via 1/p +
1/q = 1. The most relevant cases are p = 1 or p = 2. In the ﬁrst case we
have q = +∞, so the corresponding constraint reads
|pi,j |∞ = max{|pi,j,1 |, |pi,j,2 |} ≤ λ for all i, j.
In the second case we have q = 2, and the corresponding constraint reads
q
|pi,j |2 = p2i,j,1 + p2i,j,2 ≤ λ for all i, j.

Of course, more complex norms can be used, such as the nuclear norm for
colour images. In this case the per pixel dual variable pi,j will be matrix-
valued (or tensor-valued) and should be constrained to have its spectral
(operator) norm less than λ, for all i, j. See Section 7.3 for an example and
further details.
In practice, we will (improperly) use ‘dual problem’ to denote the mini-
mization problem
min{kD∗ p − u⋄ k2 : |pi,j |q ≤ λ for all i, j}, (3.17)
which is essentially a projection problem. For this problem, it is interesting
to observe that the primal–dual gap
1 1
G(u, p) = f (Du) + ku − u⋄ k2 + f ∗ (p) + kD∗ pk2 − hD∗ p, u⋄ i
2 2
1
= λkDukp,1 + δ{k·kq,∞ ≤λ} (p) − hp, Dui + ku⋄ − D∗ p − uk2
2
(3.18)

gives a bound on the ℓ2 -error 21 ku − u∗ k2 , where (u∗ , p∗ ) is a saddle point.

More precisely, if we use both the strong convexity of the energy (with re-
spect to u) and the strong convexity of the dual energy (with respect to D∗ p)
and recall that u∗ = u⋄ − D∗ p∗ (so kD∗ p − D∗ p∗ k2 = k(u⋄ − D∗ p) − u∗ k2 ),
we ﬁnd that
1 1
G(u, p) ≥ ku − u∗ k2 + k(u⋄ − D∗ p) − u∗ k2 . (3.19)
2 2
We can even provide a slightly ﬁner criterion, since if we introduce the
middle value
u + (u⋄ − D∗ p)
ũ := ,
2
20 A. Chambolle and T. Pock

then it follows from (3.19) that

1
G(u, p) ≥ kũ − u∗ k2 + ku⋄ − D∗ p − uk2 .
4
Using (3.18) we obtain the error criterion
1
λkDukp,1 + δ{k·kq,∞ ≤λ} (p) − hp, Dui + ku⋄ − D∗ p − uk2 ≥ kũ − u∗ k2 .
4
(3.20)
This can be used to test the convergence of algorithms for this problem, at
least when a dual variable p is correctly identiﬁed, since in that case, if u
is not provided by the algorithm, one can let u = ũ = u⋄ − D∗ p. It also
shows how to obtain, in a primal–dual method which provides both u and
p with u 6= ũ, a slightly better estimate
√ of the primal ℓ2 -error (and of the
root-mean-square error kũ − u∗ k/ mn) than that given by the gap.
We will now describe, starting from the simplest, the ﬁrst-order optimiza-
tion methods that can be implemented to solve the problems described so
far and a few others that will be introduced in Section 7.

4. Gradient methods
The first family of methods we are going to describe is that of first-order
gradient descent methods. It might seem a bit strange to introduce such
simple and classical tools, which might be considered outdated. However, as
mentioned in the Introduction, the most efficient way to tackle many sim-
ple problems in imaging is via elaborate versions of plain gradient descent
schemes. In fact, as observed in the 1950s, such methods can be consider-
ably improved by adding inertial terms or performing simple over-relaxation
steps (or less simple steps, such as Chebyshev iterations for matrix inversion:
Varga 1962), line-searches, or more elaborate combinations of these, such
as conjugate gradient descent; see for instance Polyak (1987, Section 3.2) or
Bertsekas (2015, Section 2.1). Also, if second-order information is available,
Newton’s method or quasi-Newton variants such as the (l-)BFGS algorithm
(Byrd, Lu, Nocedal and Zhu 1995) can be used, and are known to converge
very fast. However, for medium/large non-smooth problems such as those
described above, such techniques are not always convenient. It is now ac-
knowledged that, if not too complex to implement, then simpler iterations,
which require fewer operations and can sometimes even be parallelized, will
generally perform better for a wide class of large-dimensional problems, such
as those considered in this paper.
In particular, first-order iterations can be accelerated by many simple
tricks such as over-relaxation or variable metrics – for instance Newton’s
method – but this framework can be transferred to fairly general schemes
(Vũ 2013b, Combettes and Vũ 2014), and since the seminal contribution
Optimization for imaging 21

Algorithm 1 Gradient descent (GD) with ﬁxed step.

Choose x0 ∈ X
for all k ≥ 0 do

xk+1 = xk − τ ∇f (xk ). (4.2)

end for

of Nesterov (1983) it has been understood that some of the over-relaxation

techniques developed for matrix inversion can be adapted to the non-linear
setting, providing eﬃcient ﬁrst-order schemes for non-linear or non-smooth
minimization problems. Let us start with the simplest approach and show
how it can be improved.

4.1. Gradient descent

We therefore start by describing gradient descent methods, and we will
see that these are sufficient to provide efficient methods for solving simple
problems such as the Lasso (2.2) or dual ROF (3.17) problems.
Assume we need to find a minimizer of a convex function f , that is, to
solve
min f (x), (4.1)
x∈X

and let us first assume that f is differentiable. Then, the most straightfor-
ward approach to solving the problem is to implement a gradient descent
scheme with fixed step size τ > 0: see Algorithm 1. The major issue is
that this will typically not work if f is not sufficiently smooth. The natural
assumption is that ∇f is Lipschitz with some constant L, and 0 < τ L < 2.
If τ is too large, this method will oscillate: if for instance f (x) = x2 /2, then
xk+1 = (1 − τ )xk , and it is obvious that this recursion converges if and only
if τ < 2. On the other hand, a Taylor expansion shows that

τL
f (x − τ ∇f (x)) ≤ f (x) − τ 1 − k∇f (x)k 2 ,
2
so that if τ < 2/L, then we see both that f (xk ) is a P strictly decreasing se-
quence (unless ∇f (xk ) = 0 at some point) and that k k∇f (xk )k 2 < +∞
if f is bounded from below. If f is, in addition, coercive (with bounded level
sets), it easily follows in the finite dimensional setting that f (xk ) converges
to a critical value and that every converging subsequence of (xk )k≥0 goes
to a critical point. If f is convex, then x 7→ x − τ ∇f (x) is also a (weak)
contraction, which shows that kxk − x∗ k is also non-increasing, for any min-
imizer4 x∗ of f . In this case we can deduce the convergence of the whole
4
We shall always assume the existence of at least one minimizer, here and elsewhere.
22 A. Chambolle and T. Pock

sequence (xk )k to a solution, if 0 < τ < 2/L. In fact, this is a particular case
of the fairly general theory of averaged operators, for which such iterations
converge: see Theorem A.1 in the Appendix for details and references.

4.2. Implicit gradient descent and the proximal-point algorithm

The above analysis breaks down if ∇f is not Lipschitz, as clearly it will
be much harder to understand how f and its gradient behave at the new
point given by (4.2). One typical workaround is to use varying steps that
converge to zero (see Section 4.3), but this leads to very slow algorithms.
Another one is to try to implement an implicit gradient descent where the
iteration (4.2) is replaced with
xk+1 = xk − τ ∇f (xk+1 ). (4.3)
Now the question is how to implement this iteration. We clearly see that if
such an xk+1 exists, then it satisﬁes
xk+1 − xk
∇f (xk+1 ) + = 0,
τ
so it is a critical point of the function
2
kx − xk k
x 7→ f (x) + . (4.4)
2τ
If, in addition, f is convex and l.s.c. (rather than C 1 ), then this critical point
is the unique minimizer of (4.4), that is, the proximal map of τ f evaluated
at xk (see Section 3.4). We will say that the convex function f is simple
if the proximal maps τ f , τ > 0, can be easily evaluated. An important
observation is that its deﬁnition does not require any smoothness of f . Now
consider the function
kx − x̄k 2
x̄ 7→ fτ (x̄) := min f (x) + , (4.5)
x∈X 2τ
which is the Moreau–Yosida regularization of f with parameter τ > 0. It is
a standard fact that ∇fτ (x̄) is a (1/τ )-Lipschitz function whose gradient is
given by
x̄ − proxτ f (x̄)
∇fτ (x̄) = . (4.6)
τ
Indeed, for any x, y, letting ξ = proxτ f (x) and η = proxτ f (y),
kη − yk2
fτ (y) = f (η) +
2τ
kη − xk2 x−η kx − yk2

= f (η) + + ,y − x +
2τ τ 2τ
Optimization for imaging 23

kξ − xk2 x−ξ ξ−η

≥ f (ξ) + + ,y − x + ,y − x
2τ τ τ
kη − ξk2 kx − yk2
+ +
2τ 2τ
x−ξ τ y−η x−ξ 2

= fτ (x) + ,y − x + − ,
τ 2 τ τ
which actually shows that (x − ξ)/τ is a subgradient of fτ at x. In the
third line we have used the fact that ξ is the minimizer of a (1/τ )-strongly
convex problem. The last term in this equation expresses the fact that the
map x̄ 7→ (x̄ − proxτ f (x̄))/τ is τ -co-coercive, which implies that it is also
(1/τ )-Lipschitz, as claimed.
Now we can rewrite (4.6) in the form
proxτ f (x̄) = x̄ − τ ∇fτ (x̄),
which shows that an iteration of implicit gradient descent for f , which in
its general form reads
xk+1 = proxτ f (xk ) = (I + τ ∂f )−1 (xk ) = xk − τ ∇fτ (xk ), (4.7)
is exactly the same as an iteration of explicit gradient descent for fτ , with
step τ (which is admissible since ∇fτ is (1/τ )-Lipschitz). As it is obvious
from its deﬁnition (4.5) that both f and fτ have the same set of minimizers,
the convergence theory for the implicit algorithm is a simple corollary of the
convergence theory for explicit gradient descent with ﬁxed step. Moreover,
as we know we can take slightly larger steps (we can let xk+1 = xk −
σ∇fτ (xk ) for any choice of σ < 2τ ), then we also deduce the convergence
for the iterations:
xk+1 = (1 + θ)proxτ f (xk ) − θxk , (4.8)
for any θ ∈ (−1, 1). These elementary remarks are in fact at the heart of the
rich theory of contraction semigroups in Hilbert and Banach spaces, when
applied to more general monotone or accretive operators (indeed, (4.6) is
exactly the Yosida regularization of the subgradient operator ∂f ). See for
instance Brézis (1973).
In optimization theory, the extension of these remarks to a general mono-
tone operator is known as the ‘proximal-point algorithm’ (PPA), introduced
in Martinet (1970). The general form of the PPA algorithm is the iteration
xk+1 = (I + τk T )−1 xk (4.9)
(with a possible relaxation as in (4.8)), which is shown to converge to a zero
of T under certain conditions.5 This method has been studied extensively

5
This is an extension of Theorem A.1; see also the references cited there.
24 A. Chambolle and T. Pock

since the 1970s (Martinet 1970, Rockafellar 1976), and in fact many of the
methods we consider later on are special instances. Convergence proofs and
rates of convergence
P 2 can be found,Pfor instance, in Brézis and Lions (1978)
(these require k τk = +∞, but k τk = +∞ is sufficient if T = ∂f ); see
also the work of Güler (1991) when T = ∂f . In fact some of the results
mentioned in Section 4.7 below will apply to this method as a particular
case, when T = ∂f , extending some of the results of Güler (1991).
Fairly general convergence rates for gradient methods are given in the
rich book of Bertsekas (2015, Propositions 5.1.4, 5.1.5), depending on the
behaviour of f near the set of minimizers. In the simplest case of the descent
(4.2) applied to a function f with L-Lipschitz gradient, the convergence rate
is found in many other textbooks (e.g. Nesterov 2004) and reads as follows.
Theorem 4.1. Let x0 ∈ X and xk be recursively defined by (4.2), with
τ ≤ 1/L. Then not only does (xk )k converge to a minimizer, but the value
f (xk ) decays with the rate
1
f (xk ) − f (x∗ ) ≤kx∗ − x0 k2 ,
2τ k
where x∗ is any minimizer of f . If in addition f is strongly convex with
parameter µf > 0, we have
1 k 1
f (xk ) − f (x∗ ) + kx − x∗ k2 ≤ ω k kx0 − x∗ k2 ,
2τ 2τ
where ω = (1 − τ µf ) < 1.
A short (standard) proof is given in Appendix B.
Remark 4.2. This form of the result is slightly suboptimal, allowing a
very elementary proof in Appendix B. However, it can be checked that the
first rate holds for larger steps τ < 2/L, while the second can be improved
by taking larger steps (τ = 2/(L + µf )), yielding linear convergence with a
factor ω = (1 − µf /L)/(1 + µf /L); see for instance Nesterov (2004, Theo-
rem 2.1.15). However, we will see very soon that this too can be improved.
Of course, the observations above show that similar rates will also hold
for the implicit form (4.7): indeed, recalling that fτ (x∗ ) = f (x∗ ) for any
τ > 0, we have that a bound on fτ (xk ) − fτ (x∗ ) is, by definition, also a
bound on
kxk+1 − xk k2
f (xk+1 ) − f (x∗ ) + .
2τ
We remark that in this implicit case it would seem that we only have to
choose the largest possible τ to solve the minimization accurately. We will
see further (Example 3.1) that in practice, we are not always free to choose
the step or the metric which makes the algorithm actually implementable. In
Optimization for imaging 25

Algorithm 2 Subgradient method (SGM).

Choose x0 ∈ X , hk > 0 with k hk = +∞ and k h2k < +∞.
P P
for all k ≥ 0 do
Compute gk ∈ ∂f (xk )
gk
xk+1 = xk − hk (4.10)
kgk k
end for

other situations the choice of the step might eventually result in a trade-oﬀ
between the precision of the computation, the overall rate and the complex-
ity of one single iteration (which should also depend on τ ).

4.3. Subgradient descent

Another way to implement a gradient scheme for a non-smooth convex ob-
jective is to perform a subgradient descent, that is, try to reduce the energy
by following the direction of an arbitrarily chosen subgradient: see Algo-
rithm 2. In general, this method performs poorly, as shown by Polyak
(1987) and Nesterov (2004) for example, since the typical √ rate for such a
method (which is also optimal: Nesterov 2004) is O(1/ k) for the best ob-
jective found after the kth iteration. However, if f is not ‘simple’ but ∂f
is easy to compute, this might be an option, but it is preferable to try to
use a splitting strategy as described in Section 4.7 and the following. A
condition for convergence is that f should be M -Lipschitz (at least near the
optimum), which is not too restrictive in ﬁnite dimensions since f is always
locally Lipschitz in the interior of its domain.
The study of convergence of this algorithm is based on the following simple
chain of (in)equalities: given x∗ a minimizer, we have
gk 2
kxk+1 − x∗ k2 = xk − x∗ − hk
kgk k
hk
= kxk − x∗ k2 − 2 hxk − x∗ , gk i + h2k
kgk k
hk
≤ kxk − x∗ k2 − 2 (f (xk ) − f (x∗ )) + h2k
kgk k
and hence (using kgk k ≤ M and letting xok = arg minxi ,i≤k f (xi ))
Pk
o ∗ h2 + kx0 − x∗ k2
f (xk ) − f (x ) ≤ M i=0 iPk ,
2 i=0 hi
√
which goes to 0 as k → ∞ by assumption. If we now choose hi = C/ k+1
26 A. Chambolle and T. Pock

(Nesterov 2004) for the ﬁrst k iterations, then at iteration k we ﬁnd

C + kx0 − x∗ k 2
f (xok ) − f (x∗ ) ≤ M √ .
2C k+1
Clearly, this is much slower than the rate of descent with ﬁxed steps which
can be reached when ∇f is Lipschitz.
A variant proposed by Polyak (1987) consists in choosing hk = ck (f (xk )−
f (x∗ ))/kgk k2 , with ck ∈ (α, 2 − α) for some α > 0. However, this requires
knowledge, or a good estimate, of the optimal value f (x∗ ). The rate for this
approach is again O(1/k).
Inexact variants, based on the notion of ε-subgradients (Rockafellar 1997),
have also been introduced (Bertsekas and Mitter 1973, Polyak 1987). These
have been studied recently by Benfenati and Ruggiero (2013) to tackle non-
linear inverse problems in imaging (see also Bonettini, Benfenati and Rug-
giero 2014). They have also been used by Bonettini and Ruggiero (2012) to
reinterpret a primal–dual scheme of Zhu and Chan (2008) (see Section 5.1)
and prove its convergence.

4.4. Lower bounds for smooth convex optimization

An important question is what is the best possible rate of convergence of
a ﬁrst-order method applied to a convex optimization problem. Of course,
the answer depends on the properties of the function to minimize. An
answer in the form of lower bounds has been given by Nemirovski and Yudin
(1983), and is also found in Nesterov (2004). The idea is to consider a fairly
general class of ﬁrst-order methods where the iterates xk are restricted to
the subspace spanned by the gradients of earlier iterates, that is, for k ≥ 0,
xk ∈ x0 + span ∇f (x0 ), ∇f (x1 ), . . . , ∇f (xk−1 ) ,

(4.11)
where x0 is an arbitrary starting point. Then, for x ∈ Rn , we consider
L > 0, µ ≥ 0, 1 ≤ p ≤ n, the minimization of functions of the form
p
L−µ

2
X µ
f (x) = (x1 − 1) + (xi − xi−1 ) + kxk2 .
2
(4.12)
8 2
i=2

Starting from an initial point x0 = 0, any ﬁrst-order method of the class

considered can transmit the information of the data term only at the speed
of one coordinate index per iteration. This makes such problems very hard
to solve by any first-order methods in the class (4.11). Indeed, if we start
from x0 = 0 in the above problem (whose solution is given by x∗k = 1,
k = 1, . . . , p, and 0 for k > p), then at the first iteration, only the first
component x11 will be updated (since ∂f /∂xi (x0 ) = 0 for i ≥ 2), and by
induction we can check that xkl = 0 for l ≥ k + 1.
For convenience we reproduce a variant of the results in Nesterov (2004)
Optimization for imaging 27

(where a slightly different function is used: see Theorems 2.1.7 and 2.1.13).
If µ = 0, using (possible translates of) the function (4.12), which is very ill
conditioned (and degenerate if defined in dimension n > p), the following
general lower bound for smooth convex optimization can be shown.
Theorem 4.3. For any x0 ∈ Rn , L > 0, and k < n there exists a convex,
continuously differentiable function f with L-Lipschitz-continuous gradient,
such that for any first-order algorithm satisfying (4.11), we have
Lkx0 − x∗ k2
f (xk ) − f (x∗ ) ≥ , (4.13)
8(k + 1)2
where x∗ denotes a minimizer of f .
This particular bound is reached by considering the function in (4.12)
with p = k + 1, and an appropriate change of variable which moves the
starting point to the origin. Observe that the above lower bound is valid
only if the number of iterates k is less than the problem size. We cannot
improve this with a quadratic function, as the conjugate gradient method
(which is a first-order method) is then known to find the global minimizer
after at most n steps. But the practical problems we encounter in imaging
are often so large that we will never be able to perform as many iterations
as the dimension of the problem.
If choosing µ > 0 so that the function (4.12) becomes µ-strongly convex,
a lower bound for first-order methods is given in Theorem 2.1.13 of Nesterov
(2004), which reads as follows.
Theorem 4.4. For any x0 ∈ R∞ ≃ ℓ2 (N) and µ, L > 0 there exists a
µ-strongly convex, continuously differentiable function f with L-Lipschitz-
continuous gradient, such that, for any algorithm in the class of first-order
algorithms defined by (4.11), we have
√
q − 1 2k 0

k ∗ µ
f (x ) − f (x ) ≥ √ kx − x∗ k2 (4.14)
2 q+1
for all k, where q = L/µ ≥ 1 is the condition number, and x∗ is the minimizer
of f .
In finite dimensions, one can adapt the proof of Nesterov (2004) to show
the same result for sufficiently small k, with respect to n. It is important to
bear in mind that these lower bounds are inevitable for any first-order algo-
rithm (assuming the functions are ‘no better’ than with L-Lipschitz gradient
and µ-strongly convex). Of course, one could ask if these lower bounds are
not too pessimistic, and whether such hard problems will appear in prac-
tice. We will indeed see that these lower bounds are highly relevant to our
algorithms, and are observed when minimizing relatively simple problems
28 A. Chambolle and T. Pock

Algorithm 3 Accelerated gradient descent (AGD) with ﬁxed step.

Choose x0 = x−1 = y 0 ∈ X , τ ≤ 1/L, and let t0 = 0.
for all k ≥ 0 do
q
1+ 1 + 4t2k
tk+1 = ,
2
tk − 1 k
y k = xk + (x − xk−1 ), (4.15)
tk+1
xk+1 = y k − τ ∇f (y k ).
end for

such as the ROF model. Let us mention that many other types of interest-
ing lower bounds can be found in the literature for most of the algorithmic
techniques described in this paper, and a few others; see in particular the
recent and fairly exhaustive study by Davis and Yin (2014a).

4.5. Accelerated gradient descent

Let us return to standard gradient descent. It turns out that the rates
in Theorem 4.1 are suboptimal, in the sense that smaller upper bounds
can be obtained, which (almost) match the lower bounds presented in the
previous section. Accelerated variants of the gradient method for non-linear
problems were ﬁrst proposed by Nesterov (1983); see also Güler (1992) for
the implicit form and variants, including inexact forms, and Salzo and Villa
(2012) for a more general result. The method consists in simply performing
a varying relaxation step at each iteration: see Algorithm 3.
Theorem 4.5. Let {xk } be a sequence generated by the accelerated gra-
dient descent (4.15). Then if x∗ is a minimizer, we have
2
f (xk ) − f (x∗ ) ≤ kx0 − x∗ k2 .
τ (k + 1)2
This rate is clearly better than the rate in Theorem 4.1, and, in fact,
optimal when comparing to the lower bound in Theorem 4.3. We leave the
case where f is strongly convex to Section 4.7 below, where we will present
a more general result. Both are particular cases of Theorem B.1 in the
Appendix.
Example 4.6 (minimizing the worst-case function). In this example
we show the actual performance of gradient methods for the worst-case func-
tion presented in (4.12) using p = n = 100. Figure 4.1 compares the speed of
convergence of gradient descent (GD), accelerated gradient descent (AGD),
Optimization for imaging 29

and conjugate gradient (CG), together with the lower bound for smooth
optimization provided in (4.13). The results show that AGD is signiﬁcantly
faster than GD. For comparison we also applied CG, which is known to be
an optimal method for quadratic optimization and provides convergence, in
ﬁnitely many steps, to the true solution, in this case after at most k = 100
iterations. Observe that CG exactly touches the lower bound at k = 99
(black cross), which shows that the lower bound is sharp for this problem.
Before and after k = 99, however, the lower bound is fairly pessimistic.

4.6. Descent on the Moreau–Yosida regularization

Let us consider a certain class of problems in which the objective function
is the sum of a simple convex function and a quadratic function. This is
the case for the dual of the ROF problem or the Lasso problem. We want
to solve
1
min f (x) := kKx − x⋄ k2 + g(x) (4.16)
x∈X 2
with g simple (e.g., a characteristic function of a polar ball or an ℓ1 -norm).
An important observation is that in this case the Moreau–Yosida regulariza-
tion (4.5) of f is actually computable, provided that we choose the metric
carefully. Let
1
M = I − K ∗ K,
τ
which is positive6 if τ kKk2 < 1. Then the Moreau–Yosida regularization of
f in the metric M is given by
1 1
fM (x̄) := min kx − x̄k2M + kKx − x⋄ k2 + g(x)
x 2 2
and the point x̂ = proxM
f (x̄) which solves this problem is given by

x̂ = (I + τ ∂g)−1 (x̄ − τ K ∗ (K x̄ − x⋄ )). (4.17)

In other words, we can perform an implicit gradient descent (4.7) of f in the
metric M . For the Lasso problem, this iteration is known as the ‘iterative
soft-thresholding’ algorithm (Donoho 1995, Chambolle, DeVore, Lee and
Lucier 1998, Daubechies, Defrise and De Mol 2004, Bect, Blanc-Féraud,
Aubert and Chambolle 2004), as the proximity operator of g = k·k1 consists
of a ‘soft-thresholding’ of the values. From (4.6), we see that ∇fM (x̄) = x̂−x̄
(where the gradient is computed also in the metric M ), and is (still in this
metric) 1-Lipschitz. Therefore, we can solve the problem with a simple

6
This point of view is a bit restrictive: it will be seen in Section 4.7 that one can also
choose τ = 1/kKk2 – or even τ < 2/kKk2 for simple descent with fixed steps.
30 A. Chambolle and T. Pock

1.1

0.9

0.8

Value xi
x10000 (GD)
0.7
x10000 (AGD)
x∗
0.6

0.5

0.4

0 10 20 30 40 50 60 70 80 90 100
Index i

(a)

2
10

0
10

−2
10
f(xk ) − f(x∗ )

−4
10

−6
10
GD
−8 Rate of GD
10
AGD
−10
Rate of AGD
10
Lower bound
CG
−12
10
0 1 2 3 4
10 10 10 10 10
Iteration k

(b)

Figure 4.1. Comparison of accelerated and non-accelerated gradient schemes.

(a) Comparisons of the solutions x of GD and AGD after 10 000(!) iterations.
(b) Rate of convergence for GD and AGD together with their theoretical worst-
case rates, and the lower bound for smooth optimization. For comparison we also
provide the rate of convergence for CG. Note that CG exactly touches the lower
bound at k = 99.

gradient descent (4.2) on the function fM (which is equivalent to (4.7) for

f ), or the accelerated descent described in Theorem 4.5.
It turns out that the operator in (4.17) can also be written (in the initial
Optimization for imaging 31

metric)

1 ⋄ 2
x̂ = proxτ g x̄ − τ ∇ kK · −x k (x̄) ,
2
combining a step of implicit (‘backward’) gradient descent for g and a step
of explicit (‘forward’) gradient descent for the smooth part 12 kK · −x⋄ k2 of
(4.16). This is a particular case of a more general gradient descent algorithm
which mixes the two points of view explained so far, and which we describe
in Section 4.7 below.
These first elementary convergence results can already be applied to quite
important problems in imaging and statistics. We first consider plain gra-
dient descent for the primal ROF problem and then show how we can use
implicit descent to minimize the dual of the ROF problem (3.17), which has
the same structure as the Lasso problem (2.2).
Example 4.7 (minimizing the primal ROF model). In this example
we consider gradient descent methods to minimize the primal ROF model,
in (2.6), for p = 2. As mentioned above, this will work only if the gradient
of our energy is Lipschitz-continuous, which is not the case for (2.6). Hence
we consider a smoothed version of the total variation, which is obtained
by replacing the norm kDuk2,1 , which is singular at 0, with a smoothed
approximation; this means in practice that we solve a different problem,
but we could theoretically estimate how far the solution to this problem is
from the solution to the initial problem. A classical choice is
Xq
ε2 + (Du)2i,j,1 + (Du)2i,j,2 ,
i,j

where ε> 0 is a (usually small) parameter. While this approximation is C ∞ ,

it tends to promote large gradients near the discontinuities of the image. A
good alternative is the ‘Huber regularizer’. Letting
 2
t

 if t ≤ ε,
hε (t) = 2ε (4.18)
|t| − ε else,

2
which is merely C 1 but smooths the absolute value function only locally
around zero, we consider the following Huber-ROF problem:
1
min f (u) = Hε (Du) + ku − u⋄ k2 , (4.19)
u 2
with
m,n
X q
Hε (Du) = λ hε (Du)2i,j,1 + (Du)2i,j,2 . (4.20)
i=1,j=1
32 A. Chambolle and T. Pock

Observe that (4.19) is strongly convex with parameter µ = 1. Although we

want to minimize the primal problem here, we remark that the dual of the
Huber-ROF model is
1
max − kD∗ pk2 + hD∗ p, u⋄ i − Hε∗ (p), (4.21)
p 2
with
ε
Hε∗ (p) = kpk2 + δ{k·k2,∞ ≤λ} (p),
2λ
where δ{k·k2,∞ ≤λ} (p) denotes the characteristic function of the polar ball
{p : kpk2,∞ ≤ λ} as in (3.16): it is simply the dual (3.17) of ROF, to which
a small ‘smoothing term’ ε/(2λ)kpk2 has been added.
The gradient of (4.19) is computed as
∇f (u) = D∗ p̃ + u − u⋄ ,
where p̃ = ∇Hε (Du), and it can be written as p̃ = (p̃1,1 , . . . , p̃m,n ), with
p̃i,j given by
λ(Du)i,j
p̃i,j = .
max{ε, |(Du)i,j |2 }
A simple computation shows that ∇f (u) is√Lipschitz-continuous with pa-
rameter L = 1 + (kDk2 λ)/ε, where kDk ≤ 8 is the operator norm of D;
see (2.5).
It can be observed that the auxiliary variables p̃ are feasible (that is,
δ{k·k2,∞ ≤λ} (p̃) = 0) dual variables, as by deﬁnition they satisfy (3.14). Hence
we can use these expressions to compute the primal–dual gap (3.18) (where
the regularizer and its conjugate now need to be replaced with Hε (Du) and
Hε∗ (p)):
ε 1
G(u, p̃) = Hε (Du) + kp̃k2 − hp̃, Dui + ku⋄ − D∗ p̃ − uk2 .
2λ 2
Using (3.20), we also obtain that
ε 1
Hε (Du) + kp̃k2 − hp̃, Dui + ku⋄ − D∗ p̃ − uk2 ≥ kũ − u∗ k2 ,
2λ 4
where u is the solution of (4.19) and ũ = (u + u⋄ − D∗ p̃)/2. We im-
∗

plement the gradient descent algorithm (4.2) using a constant step size
τ = 2/(L + µ) and apply the algorithm to Example 2.1. Figure 4.2 shows
the convergence of the primal–dual gap using diﬀerent values of ε. Since
the objective function is smooth and strongly convex, the gradient descent
converges linearly. However, for smaller values of ε, where the smoothed
ROF model approaches the original ROF model, the convergence of the al-
gorithm becomes very slow. The next example shows that it is actually a
better idea to minimize the dual of the ROF model.
Optimization for imaging 33

4
10

2
10

0
10

−2

Primal-dual gap
10

−4
10

−6
10

−8
10

−10
10
ε = 0.01
−12
10 ε = 0.05
ε = 0.001
−14
10
0 1 2 3
10 10 10 10
Iterations

Figure 4.2. Minimizing the primal ROF model using smoothed (Huber) total vari-
ation applied to the image in Figure 2.1. The ﬁgure shows the convergence of the
primal–dual gap using plain gradient descent for diﬀerent settings of the smoothing
parameter ε.

Example 4.8 (minimizing the dual ROF model). Let us turn to the
problem of minimizing the dual ROF model using the explicit representation
of the Moreau–Yosida envelope. We consider (4.16) with K = D and g =
δ{k·k2,∞ ≤λ} . The Moreau–Yosida regularization is given by
1 1
fM (p̄) := min kp − p̄k2M + kD∗ p − u⋄ k2 + δ{k·k2,∞ ≤λ} (p), (4.22)
p 2 2
with τ ′ such that M = (1/τ ′ ) I − DD∗ > 0, and the minimum of the right-
hand side is attained for
p̂ = Π{k·k2,∞ ≤λ} (p̄ − τ ′ D(D∗ p̄ − u⋄ )),
where Π{k·k2,∞ ≤λ} denotes the (pixelwise) orthogonal projection onto 2-balls
with radius λ, that is, for each pixel i, j, the projection is computed by
p̃i,j
p̂ = Π{k·k2,∞ ≤λ} (p̃) ⇔ p̂i,j = . (4.23)
max 1, λ−1 |p̃i,j |2
As shown before, the gradient in the M -metric is given by
∇fM (p̄) = p̄ − p̂. (4.24)
The advantages of minimizing the dual ROF model rather than the pri-
mal ROF model as in Example 4.7 are immediate. Thanks to the implicit
smoothing of the Moreau–Yosida regularization, we do not need to artiﬁ-
cially smooth the objective function and hence any gradient method will
converge to the exact minimizer. Second, the step size of a gradient method
34 A. Chambolle and T. Pock

will just depend on kDk, whereas the step size of a gradient method applied
to the primal ROF model is proportional to the smoothing parameter ε. We
implement both a standard gradient descent (GD) with step size τ = 1.9
and the accelerated gradient descent (AGD) with step size τ = 1. The
parameter τ ′ in the M -metric is set to τ ′ = 0.99/kDk2 .
Since we are dealing with a smooth, unconstrained optimization in (4.22),
we can also try to apply a black-box algorithm, which only needs information
about the gradients and the function values. A very popular algorithm is the
limited memory BFGS quasi-Newton method (Byrd et al. 1995, Zhu, Byrd,
Lu and Nocedal 1997, Morales and Nocedal 2011). We applied a 1-memory
variant of the l-BFGS algorithm7 to the Moreau–Yosida regularization of
the dual ROF model and supplied the algorithm with function values (4.22)
(using the correct values of p̂) and gradients (4.24). The idea of using vari-
able metric approaches to the Moreau–Yosida regularization of the operator
has been investigated in many papers (Bonnans, Gilbert, Lemaréchal and
Sagastizábal 1995, Burke and Qian 1999, Burke and Qian 2000) and can lead
to very fast convergence under simple smoothness assumptions. However,
it is not always suitable or easily implementable for many of the problems
we address in this paper.
The plot in Figure 4.3 represents the decay of the primal–dual gap (which
bounds the energy and the ℓ2 -error) obtained from gradient descent (GD),
accelerated gradient descent (AGD) and the limited memory BFGS quasi-
Newton method (l-BFGS). It appears that the energy actually decreases
faster for the accelerated method and the quasi-Newton method, with no
clear advantage of one over the other (the ﬁrst being of course simpler to im-
plement). Also observe that both AGD and l-BFGS are only slightly faster
than the lower bound O(1/k 2 ) for smooth convex optimization. This seems
to shows that the dual ROF model is already quite a hard optimization
problem. We should mention here that the idea of applying quasi-Newton
methods to a regularized function as in this example has been recently
extended to improve the convergence of some of the methods introduced
later in this paper, namely the forward-backward and Douglas-Rachford
splittings, with very interesting results: see Patrinos, Stella and Bemporad
(2014), Stella, Themelis and Patrinos (2016).

4.7. Forward–backward splitting

We can write problem (4.16) in the general form
min F (x) := f (x) + g(x), (4.25)
x∈X

7
We used S. Becker’s MATLAB wrapper of the implementation at http://users.iems.
northwestern.edu/˜nocedal/lbfgsb.html.
Optimization for imaging 35

4
10

3
10

2
10

Primal-dual gap 1
10

0
10
GD
−1
AGD
10
l-BFGS
−2
O(1/k)
10
O(1/k 2 )
−3
10
0 1 2 3
10 10 10 10
Iterations

Figure 4.3. Comparison of diﬀerent gradient-based methods applied to Moreau–

Yosida regularization of the dual ROF model using the image in Figure 2.1. Ac-
celerated gradient descent (AGD) and the quasi-Newton method (l-BFGS) are
signiﬁcantly faster than plain gradient descent (GD).

Algorithm 4 Forward–backward descent with ﬁxed step.

Choose x0 ∈ X
for all k ≥ 0 do

xk+1 = Tτ xk = proxτ g (xk − τ ∇f (xk )). (4.27)

end for

where g is, as before, a ‘simple’ convex l.s.c. function and f is a convex

function with Lipschitz gradient. The basic idea of the forward–backward
(FB) splitting scheme is to combine an explicit step of descent in the smooth
part f with a implicit step of descent in g. We thus introduce the operator
x̄ 7→ x̂ = Tτ x̄ := proxτ g (x̄ − τ ∇f (x̄)) = (I + τ ∂g)−1 (x̄ − τ ∇f (x̄)). (4.26)
Another name found in the literature (Nesterov 2013) is ‘composite gra-
dient’ descent, as one may see (x̂ − x̄)/τ as a generalized gradient for F
at x̄ (in particular, note the analogy with (4.6)). The essential reason jus-
tifying this is that a ﬁxed point x̂ = x̄ will clearly satisfy the stationary
condition ∇f (x̄) + ∂g(x̄) ∋ 0 of (4.25). Observe that in the particular case
where g = δC is the characteristic function of a closed, convex set C, then
proxτ g (x) reduces to ΠC (x) (the orthogonal projection onto C) and the
mapping Tτ deﬁnes a projected gradient descent method (Goldstein 1964).
See Algorithm 4.
36 A. Chambolle and T. Pock

The theoretical convergence rate of plain FB splitting descent is not very

good, as one can simply show the same as for gradient descent.
Theorem 4.9. Let x0 ∈ X and xk be recursively defined by (4.27), with
τ ≤ 1/L. Then not only does xk converge to a minimizer but we have the
rates
1
F (xk ) − F (x∗ ) ≤ kx∗ − x0 k2 , (4.28)
2τ k
where x∗ is any minimizer of f . If in addition f or g is strongly convex with
parameters µf , µg (with µ = µf + µg > 0), we have
1 + τ µg k 1 + τ µg 0
F (xk ) − F (x∗ ) + kx − x∗ k2 ≤ ω k kx − x∗ k2 , (4.29)
2τ 2τ
where ω = (1 − τ µf )/(1 + τ µg ).
However, its behaviour is improved if the objective is smoother than ac-
tually known. Moreover, it is fairly robust to perturbations and can be
over-relaxed; see in particular Combettes and Wajs (2005).
An ‘optimal’ accelerated version, generalizing Theorem 4.5, is also avail-
able for this method. This is introduced in Nesterov (2004) (for projected
gradient descent). In the case µ = µf + µg = 0, a more general algorithm,
popularized under the name ‘FISTA’ is proposed in Beck and Teboulle
(2009). The algorithm we present here unifies these approaches. The
general iteration takes the form shown in Algorithm 5. In (4.35), we can
assume L > µf , and hence τ µf < 1; otherwise f is quadratic and the prob-
lem is trivial. We have the following result, which unifies Nesterov (2004)
and Beck and Teboulle (2009). See also Nesterov (2005, 2013) and Tseng
(2008) for more general variants that enjoy the same convergence rates.
Theorem 4.10. Assume t0 = 0 and let xk be generated by the algorithm,
in either case µ = 0 or µ > 0. Then we have the decay rate

k ∗ √ √ k 4 1 + τ µg 0
F (x ) − F (x ) ≤ min (1 + q)(1 − q) , kx − x∗ k2 .
(k + 1)2 2τ
It must be mentioned that for µ = 0, a classical choice for tk is also
tk = (k + 1)/2, which gives essentially the same rate. Variants of this choice
which ensure, in addition, convergence of the iterates (xk )k to a solution,
are discussed in Chambolle and Dossal (2015). An important issue is the
stability of these rates when the proximal operators can only be evaluated
approximately: the situation here is worse than for the non-accelerated
algorithm. Several papers address this issue and derive the corresponding
rates, for example Schmidt, Roux and Bach (2011), Villa, Salzo, Baldassarre
and Verri (2013) and Aujol and Dossal (2015); see also Güler (1992) and
Salzo and Villa (2012) for the backward step only, and d’Aspremont 2008 in
Optimization for imaging 37

Algorithm 5 FISTA (and variant)

Given 0 < τ ≤ 1/L, let q = τ µ/(1 + τ µg ) < 1. Choose x0 = x−1 ∈ X ,
√
and t0 ∈ R, 0 ≤ t0 ≤ 1/ q.
for all k ≥ 0 do

y k = xk + βk (xk − xk−1 ) (4.30)

k+1 k k k
x = Tτ y = proxτ g (y − τ ∇f (y )) (4.31)
where, for µ = 0,
q
1+ 1 + 4t2k k+1
tk+1 = ≥ , (4.32)
2 2
tk − 1
βk = , (4.33)
tk+1
and if µ = µf + µg > 0,
q
1 − qt2k + (1 − qt2k )2 + 4t2k
tk+1 = , (4.34)
2
tk − 1 1 + τ µg − tk+1 τ µ
βk = . (4.35)
tk+1 1 − τ µf
end for

the smooth case. Naturally, a rate of convergence for the errors is required
to obtain an improved global rate.
Proofs of both Theorems 4.9 and 4.10 are given in Appendix B, where
more cases are discussed, including more possibilities for the choice of the
parameters. They rely on the following essential but straightforward descent
rule.8 Let x̂ = Tτ x̄. Then, for all x ∈ X ,
kx − x̄k2
F (x) + (1 − τ µf )
2τ
1 − τ L kx̂ − x̄k2 kx − x̂k2
≥ + F (x̂) + (1 + τ µg ) . (4.36)
τ 2 2τ
In particular, if τ L ≤ 1,
kx − x̄k2 kx − x̂k2
F (x) + (1 − τ µf ) ≥ F (x̂) + (1 + τ µg ) . (4.37)
2τ 2τ
The proof is elementary, especially if we follow the lines of the presentation

8
This rule – or some variant of it – is of course found in almost all papers on first-order
descent methods.
38 A. Chambolle and T. Pock

in Tseng (2008), in a more general setting. By deﬁnition, x̂ is the minimizer

of the (µg + (1/τ ))-strongly convex function
kx − x̄k2
x 7→ g(x) + f (x̄) + h∇f (x̄), x − x̄i + .
2τ
It follows that for all x (see (3.7))
kx − x̄k2
F (x) + (1 − τ µf )
2τ
kx − x̄k2
≥ g(x) + f (x̄) + h∇f (x̄), x − x̄i +
2τ
kx̂ − x̄k2 kx − x̂k2
≥ g(x̂) + f (x̄) + h∇f (x̄), x̂ − x̄i + + (1 + τ µg ) .
2τ 2τ
But since ∇f is L-Lipschitz, f (x̄) + h∇f (x̄), x̂ − x̄i ≥ f (x̂) − (L/2)kx̂ − x̄k2 ,
so equation (4.36) follows.

Remark 4.11. One can more precisely deduce from this computation that
kx − x̄k2
F (x) + (1 − τ µf )
2τ
kx − x̂k2 kx̂ − x̄k2

≥ F (x̂) + (1 + τ µg ) + − Df (x̂, x̄) , (4.38)
2τ 2τ
where Df (x, y) := f (x) − f (y) − h∇f (y), x − yi ≤ (L/2)kx − yk2 is the
‘Bregman f -distance’ from y to x (Brègman 1967). In particular, (4.37)
holds once
kx̂ − x̄k2
Df (x̂, x̄) ≤ ,
2τ
which is always true if τ ≤ 1/L but might also occur in other situations, and
in particular, be tested ‘on the ﬂy’ during the iterations. This allows us to
implement eﬃcient backtracking strategies of the type of Armijo (1966) (see
Nesterov 1983, Nesterov 2013, Beck and Teboulle 2009) for the algorithms
described in this section when the Lipschitz constant of f is not a priori
known.

Remark 4.12. Observe that if X ⊂ X is a closed convex set containing

the domain of F , and on which the projection ΠX can be computed, then
the same inequality (4.37) holds if x̂ = Tτ ΠX x̄ (requiring only that ∇f is
Lipschitz on X), provided x ∈ X; see Bonettini, Porta and Ruggiero (2015).
This means that the same rates are valid if we replace (4.30) with
y k = ΠX (xk + βk (xk − xk−1 )),
which is feasible if X is the domain of F .
Optimization for imaging 39

Discussion
The idea of forward–backward splitting is very natural, and appears in many
papers in optimization for imaging: it would not be possible to mention all
the related literature. Historically, it is a generalization of projected gradi-
ent descent, which dates back at least to Goldstein (1964) (see Passty 1979,
Lions and Mercier 1979, Fukushima and Mine 1981). For minimization
problems, it can be viewed as successive minimizations of a parabolic upper
bound of the smooth part added to the non-smooth part. It has been gener-
alized, and popularized in the imaging community by Combettes and Wajs
(2005), yet a few particular forms were already well known, such as iterative
soft-thresholding for the Lasso problem (Daubechies et al. 2004). It is not
always obvious how to choose parameters correctly when they are unknown.
Several backtracking techniques will work, such as those of Nesterov (2013),
for both the Lipschitz constants and strong convexity parameters; see also
Nesterov (1983), Beck and Teboulle (2009) and Bonettini et al. (2015) for
estimates of the Lipschitz constant.
For simpler problems such as Lasso (2.2), convergence of the iterates
(more precisely of Axk ) yields that after some time (generally unknown),
the support {i : x∗i = 0} of the solution x∗ should be detected by the al-
gorithm (under ‘generic’ conditions). In that case, the objective which is
solved becomes smoother than during the first iterations, and some authors
have succeeded in exploiting this ‘partial smoothness’ to show better (lin-
ear) convergence of the FB descent (Bredies and Lorenz 2008, Grasmair,
Haltmeier and Scherzer 2011, Liang, Fadili and Peyré 2014, Tao, Boley and
Zhang 2015). Liang, Fadili and Peyré (2015a) have extended this approach
to the abstract setting of Appendix A, so that this remark also holds for
some of the saddle-point-type algorithms introduced in Section 5 below.
Another interesting and alternative approach to convergence rates is to
use the ‘Kurdyka–Lojasiewicz’ (KL) inequality, which in practice will bound
a function of the distance of a point to the critical set by the norm of
the (sub)gradient. As shown by Bolte, Daniilidis and Lewis (2006), such a
property will hold for ‘most’ of the functions optimized in practice, including
non-smooth functions, and this can lead to improved convergence rates for
many algorithms (Attouch, Bolte and Svaiter 2013). It is also possible to
derive accelerated schemes for problems with different types of smoothness
(such as Hölder-continuous gradients); see Nesterov (2015).
Finally, a heuristic technique which often works to improve the conver-
gence rate, when the objective is smoother than actually known, consists
simply in ‘restarting’ the method after a certain number of iterations: in
Algorithm 5 (for µ = 0), we start with a new sequence (tk )k letting tk̄ = 1
for some sufficiently large k̄. Ideally, we should restart when we are sure
that the distance of xk̄ to the optimum x∗ (unique if the objective is strongly
convex) has shrunk by a given, sufficiently small factor (but the correspond-
40 A. Chambolle and T. Pock

ing value k̄ depends on the strong convexity parameter). There is a simple

way to implement such a scheme while still keeping the global O(1/k 2 ) rate
(it consists in adapting the idea of the ‘Monotone FISTA’ scheme: see Re-
mark B.3). A rigorous justiﬁcation of a restarting scheme is discussed in
O’Donoghue and Candès (2015).

4.8. Extensions and variants

4.8.1. Mirror descent
A natural extension of the proximal descent methods consists in replacing
the function (2τ )−1 ky − xk2 in (3.6) with other distances between y and x.
There can be several good reasons for this.
• One may wish to use a (smooth) distance d(y, x) which blows up when
y reaches the boundary of certain constraints. This is the principle of
barriers and penalty functions.
• The proximity operator of a function is not easily computable with
the squared Euclidean distance but it is simple in some non-linear
metrics.
• The ambient space X is neither a Hilbert space nor a Euclidean space,
and we need to optimize a function whose gradient is Lipschitz with
respect to some non-Hilbert space norm.
The ‘mirror’ descent algorithm was introduced by Nemirovski and Yudin
(1983) as a tool for optimization in Banach spaces. It requires the introduc-
tion of an auxiliary convex function ψ whose gradient will act as a map be-
tween the space X and its dual X ′ . In the Hilbert space case, ψ(x) = kxk2 /2
is the most natural choice, but there might be reasons to use other choices
(whereas in Banach spaces it is not natural at all). The basic idea is to
replace the gradient descent iteration (4.2) with
∇ψ(xk+1 ) = ∇ψ(xk ) − τ ∇f (xk ).
If we introduce the Bregman ψ-distance
Dψ (x, y) = ψ(x) − ψ(y) − h∇ψ(y), x − yi,
we readily see that it is equivalent to deﬁning xk+1 as a point which mini-
mizes
1
min Dψ (x, xk ) + f (xk ) + h∇f (xk ), x − xk i, (4.39)
x τ

that is, we ﬁnd xk+1 by minimizing the linear approximation of f at xk

with some penalization of the distance between the two points. The natural
‘mirror prox’ alternative will consist in solving iteratively, if possible,
1
min Dψ (x, xk ) + f (x) (4.40)
x τ
Optimization for imaging 41

and deﬁning xk+1 to be the solution. In general it is required that ψ be

smooth and strongly convex with respect to the norm of the space X , which
is not necessarily Euclidean or Hilbertian. Convergence of these algorithms
under various conditions on ψ are studied in a few important papers; see
in particular the papers of Eckstein (1993), Teboulle (1992), Chen and
Teboulle (1993), Kiwiel (1997), Beck and Teboulle (2003) and Auslender
and Teboulle (2004). The extensive monograph by Ben-Tal and Nemirovski
(2001)9 presents many possible variants, with rates.
An important remark is that a non-linear variant of (4.37) is as easy to
show in the non-linear case, since if x̂ is the minimizer of
1
min Dψ (x, x̄) + f (x)
x τ

for some f and admissible ψ, it satisﬁes

∇ψ(x̂) − ∇ψ(x̄) + ∂f (x̂) ∋ 0,
from which we deduce from simple computations that for any x ∈ X ,
1 1 1
Dψ (x, x̄) + f (x) ≥ Dψ (x̂, x̄) + f (x̂) + Dψ (x, x̂). (4.41)
τ τ τ
It is relatively easy to deduce basic rates of convergence for the mirror
and mirror prox schemes from this inequality, in the same way as for the
Hilbertian FB splitting.
Quite naturally, this can be generalized to the full forward–backward
splitting, where now the problem is of the form (4.25), and the point x̂ is
obtained from x̄ by solving
1
min Dψ (x, x̂) + h∇f (x̄), xi + g(x).
x∈X τ

A non-linear analogue of (4.37) will easily follow from (4.41) and the Lips-
chitz property of ∇f , which reads
k∇f (x) − ∇f (y)k∗ ≤ Lkx − yk for all x, y ∈ X ,
where k·k∗ is the norm in X ′ induced by the norm k·k of X , with respect to
which ψ is strongly convex. The simple FB descent method (using x̄ = xk ,
xk+1 = x̂) will then converge with essentially the same rate (but constants
which depend on the new distance Dψ ); see Tseng (2008) for details. More
interesting is the fact that, again, Tseng (2008) has also introduced accel-
erated variants which reach a convergence rate in O(1/k 2 ), as before (see
also Allen-Zhu and Orecchia 2014). A diﬀerent way to introduce barriers
and non-linearities for solving (4.25) by smoothing is proposed in Nesterov
(2005), where another O(1/k 2 ) algorithm is introduced.

9
See also the version at http://www2.isye.gatech.edu/˜nemirovs.
42 A. Chambolle and T. Pock

The idea of considering non-linear proximity operator is not purely formal

and can be useful. The most classical example is the case of optimization
over the unit simplex
n
X
x ∈ Rn+ : xi = 1 .
i=1

Then it is known (Teboulle 1992, Beck and Teboulle 2003) that the entropy
n
X
ψ(x) := xi ln xi (and ∇ψ(x) = (1 + ln xi )ni=1 )
i=1

is 1-strongly convex with respect to the ℓ1 -norm10

X
kxk1 = |xi |.
i

In this case, the (constrained) mirror step takes the form

1
Pmin hp, xi + Dψ (x, x̄)
i xi =1 τ
and the solution x̂ satisﬁes
ln x̂i = ln x̄i − τ pi + λ,
P
where λ is a Lagrange multiplier for the constraint i xi = 1. We obtain
that, for i = 1, . . . , n,
e−τ pi
x̂i = Pn −τ pj x̄i .
j=1 x̄j e

There might be two advantages: one is that we do not have to project

back onto the simplex (although this projection is very cheap), the other is
that the parameters of the problem in the ℓ1 -norm (such as the Lipschitz
constant of the smooth part of the objective) might allow us to take a larger
time step or yield better constants in the estimates for the rates (Beck and
Teboulle 2003).
Non-linear smoothing or mirror descent is also useful for solving optimal
transportation problems; for applications in imaging see Benamou, Carlier,
Cuturi, Nenna and Peyré (2015) and Ferradans, Papadakis, Peyré and Aujol
(2014).

4.8.2. Inertial and over-relaxed algorithms

The accelerated methods described in Theorems 4.5 and 4.10 are based on
a particular example of ‘overshooting’, where the new point is obtained by
10
This will imply that Dψ (x, x′ ) ≥ kx − x′ k21 /2, so even though kxk1 = kx′ k1 = 1 it does
carry some information!
Optimization for imaging 43

applying an operator to the old point with a ‘momentum’ (here, a multiple

of the difference between the two last iterates).
Gradient descent type methods can be accelerated in many similar ways.
A very efficient method is the heavy ball (HB) method (Polyak 1987), which
consists in iterating
xk+1 = xk + α∇f (xk ) + β(xk − xk−1 ). (4.42)
For strongly convex problems, that is, assuming ℓ Id ≤ ∇2 f ≤ L Id, this can
be optimal: convergence is ensured for 0 ≤ β < 1 and 0 < α < 2(1 + β)/L,
and the choices β = q 2 , where
p
1 − ℓ/L 4
q= p , α= √ √
1 + ℓ/L ( L + ℓ)2
yield the optimal rate kxk+1 − x∗ k = O(q k ) (Polyak 1987, Theorem 1).
The heavy ball method has been generalized to monotone operators by Al-
varez and Attouch (2001) (see also Alvarez 2003, Moudafi and Oliny 2003),
so there exist general convergence results that allow for non-smooth terms.
We should of course also mention the conjugate gradient descent method,
which is of the same sort, except that the parameters α and β are updated
dynamically at each iteration. Ideally we want to choose α, β which solve
min f (xk + α∇f (xk ) + β(xk − xk−1 ))
α,β

(see Polyak 1987). For a quadratic function this problem is easily solved,
and it is known that the descent method obtained minimizes the quadratic
function exactly in rank A iterations, where A = ∇2 f . It is the fastest
method in this case (Polyak 1987); see the plot ‘CG’ in Figure 4.1. In prac-
tice, this method should be implemented on a sufficiently smooth problem
when the cost of performing a line-search (which requires evaluations of the
function) is not too large; as for non-quadratic problems, the optimal step
cannot be computed in closed form.
A generalization of the HB algorithm to a strongly convex function given
by the sum of a smooth, twice continuously differentiable function with
Lipschitz-continuous gradient and a non-smooth function, with easily com-
puted proximal map, was investigated for quadratic functions in Bioucas-
Dias and Figueiredo (2007) and for more general smooth functions in Ochs,
Brox and Pock (2015). It is of the form:
xk+1 = proxαg (xk + α∇f (xk ) + β(xk − xk−1 )). (4.43)
The proximal HB algorithm offers the same optimal convergence rate as
the HB algorithm, but can be applied only if the smooth function is twice
continuously differentiable. It is therefore very efficient; see Figure 4.4 below
for a comparison of this method with other accelerated methods.
44 A. Chambolle and T. Pock

Another standard and simple way to speed up such algorithms consists

in simply over-relaxing the iterates, that is, replacing xk+1 with the value
xk+1 + θ(xk+1 − xk ) such as in (4.8); this is not exactly the same as (4.30)–
(4.31). Convergence is generally guaranteed as long as θ < 1; this has
been studied in a very general setting by Combettes and Wajs (2005). The
theoretical convergence rates are in general only slightly improved by such
over-relaxations, but sometimes the empirical rates are much better. On the
other hand, there do not seem to be many studies of over-relaxed accelerated
algorithms, although a recent paper on the ‘FISTA’ method shows that it
is actually possible and improves the convergence (Yamagishi and Yamada
2011).

4.8.3. (Two) block(s) coordinate descent

It is obvious from the proof in Appendix B that any algorithm which en-
sures a descent rule such as (4.37) will enjoy the same convergence proper-
ties (Theorem 4.9) and can be accelerated by the same techniques as FB
splitting. As a particular case, one can eﬃciently solve problems of the form
1
min f1 (x) + f2 (x) + kx − x0 k2 .
x∈X 2
Indeed, in its dual formulation, this problem can be written as
1
min f ∗ (y1 ) + f2∗ (y2 ) + ky1 + y2 k2 − hy1 + y2 , x0 i,
y1 ,y2 1 2
and if we minimize successively with respect to y1 , y2 , it turns out that we
obtain a descent rule similar to (4.37).
Lemma 4.13. Given ȳ1 , ȳ2 , let
1
ŷ2 = arg min f2∗ (y2 ) + kȳ1 + y2 k2 − hȳ1 + y2 , x0 i,
y2 2
1
ŷ1 = arg min f1∗ (y1 ) + ky1 + ŷ2 k2 − hy1 + ŷ2 , x0 i.
y1 2
Then, for all (y1 , y2 ) ∈ X 2 , we have
1 1
f1∗ (y1 ) + f2∗ (y2 ) + ky1 + y2 k − hy1 + y2 , x0 i + ky1 − ȳ1 k2
2 2
1 1
≥ f1 (ŷ1 ) + f2 (ŷ2 ) + kŷ1 + ŷ2 k − hŷ1 + ŷ2 , x0 i + ky1 − ŷ1 k2 .
∗ ∗ 2
2 2
This is even improved if either f1∗ or f2∗ is strongly convex (equivalently,
if at least one of the functions f1 or f2 has Lipschitz gradient). It clearly
follows that the scheme of the proof of Theorem 4.10 will also work for this
method: see Appendix B. The proof of the lemma is elementary. Moreover,
Optimization for imaging 45

we can observe that since

1
f˜2∗ : y1 7→ min f2∗ (y2 ) + ky1 + y2 k2 − hy1 + y2 , x0 i
y2 2
is a convex function of y1 with 1-Lipschitz gradient, the alternating min-
imization method is simply a forward–backward splitting applied to the
problem miny1 f1∗ (y1 ) + f˜2∗ (y1 ); see for instance Combettes and Pesquet
(2011, Example 10.11). Less elementary is the fact that this descent rule
still holds if the exact minimizations are replaced with proximal (implicit
descent) steps or if the quadratic part is linearized, which can be useful if
it involves linear operators; see Chambolle and Pock (2015b) for details.
A particular case of this splitting is used in Section 7.8 to compute Fig-
ure 7.9; see the explanations there (Chambolle and Pock 2015b, Kolmogorov,
Pock and Rolinek 2016). It can also be used to implement fast parallel
solvers for the ROF problem (2.6): the idea is to split the dual variable into
two groups, one ‘living’ on the ‘odd’ squares (or cubes in three dimensions),
that is, the edges connecting the vertices (i, j) + {0, 1}2 , i, j odd, and the
other in the ‘even’ squares. Then one can use a dedicated solver to solve
exactly or approximately the subproblem on each odd/even square, which
are low-dimensional decoupled problems. This is particularly well adapted
to implementation on GPUs; details can be found in Chambolle and Pock
(2015b).
In general, block coordinate (or Gauss–Seidel) descent schemes can be
implemented in many ways, and many generalizations involving non-smooth
terms are proved to converge in the literature (Grippo and Sciandrone 2000,
Auslender 1976, Attouch et al. 2013). As long as some energy decay is
guaranteed, O(1/k) rates are easy to prove. In the context of this chapter,
see in particular Beck and Tetruashvili (2013); see also Tseng (2001), Tseng
and Yun (2009), Beck (2015), Chouzenoux, Pesquet and Repetti (2016) and
Nesterov (2012).
For more than two blocks, efficient methods can be developed in two
different directions: sums of ‘simple’ objective can be dualized and their
proximity operators then computed in parallel (Attouch, Briceño-Arias and
Combettes 2009/10, Raguet, Fadili and Peyré 2013, Becker and Combettes
2014, Pustelnik, Chaux and Pesquet 2011). Acceleration is then possible
in this framework (Goldfarb and Ma 2012). On the other hand, random-
ized algorithms seem to be a very efficient alternative for tackling problems
with a huge number of variables or blocks (Nesterov 2012). In particular,
whereas in the deterministic setting it is hard to implement acceleration
techniques for problems involving more than two blocks, stochastic block
descent methods will typically average out antisymmetric terms in the de-
scent rules and lead to much nicer inequalities which can be exploited to
derive very efficient methods (Lin, Lu and Xiao 2014). A few recent algo-
46 A. Chambolle and T. Pock

rithms recover optimal rates (in particular when specialized to the one-block
case) and allow for descent steps which are optimal for each block (Fercoq
and Richtárik 2013a, 2013b).

4.8.4. FBF splitting

In the context of maximal monotone operators, an important generalization
of the FB splitting algorithm is due to Tseng (2000). The standard FB split-
ting algorithm requires the forward operator to be co-coercive, for example
the gradient of a smooth function. This clearly limits the applicability of
the algorithm to more general problems. The following modiﬁcation, called
the forward–backward–forward (FBF) algorithm simply assumes that the
forward monotone operator is single-valued and Lipschitz-continuous. It
can therefore be applied, for example, if the forward operator is a skew-
symmetric matrix. Let A, B be two maximal monotone operators with A
single-valued on domA ⊃ domB. The FBF algorithm is deﬁned by the
following scheme:
xk+1/2 = (I + τk B)−1 (I − τk A)(xk ), (4.44)
k+1 k+1/2 k+1/2 k

x = ΠX x − τk (A(x ) − A(x )) , (4.45)
where X is a suitable non-empty set (e.g. Rn ) and τk is the largest number
satisfying, for any δ ∈ (0, 1),
τk kA(xk+1/2 ) − A(xk )k ≤ δkxk+1/2 − xk k,
which in practice can be determined by an Armijo-type backtracking pro-
cedure (Armijo 1966). An important application of this algorithm is to
convex-concave saddle-point problems, which we will investigate in more
detail in the next section.

4.9. Examples
We conclude this section by providing two examples. In the ﬁrst example we
consider minimizing the dual of the Huber-ROF problem, which is strongly
convex and can therefore be minimized using accelerated proximal gradient
descent for strongly convex problems. The second example uses the explicit
representation of Moreau–Yosida regularization to transform the dual of an
anisotropic variant of the ROF model into a form consisting of a smooth
plus a non-smooth function, which can be tackled by accelerated forward–
backward splitting.
Example 4.14 (minimizing the dual of Huber-ROF). Let us revisit
the dual of the Huber-ROF model introduced in (4.21):
1 ε
min kD∗ p − u⋄ k2 + kpk2 + δ{k·k2,∞ ≤λ} (p),
p 2 2λ
Optimization for imaging 47

where u⋄ is again the noisy image of size m × n from Example 2.1, and D
is the (two-dimensional) ﬁnite diﬀerence operator. This problem is the sum
of a smooth function with Lipschitz-continuous gradient,
1
f (p) = kD∗ p − u⋄ k2 ,
2
plus a non-smooth function with easily computed proximal map,
ε
g(p) = kpk2 + δ{k·k2,∞ ≤λ} (p).
2λ
The gradient of the smooth function is given by

∇f (p) = D(D∗ p − u⋄ ),

and its Lipschitz parameter is estimated again as L ≤ 8. The non-smooth

function is strongly convex with parameter µ = ε/λ and its pixelwise prox-
imal map is given by

(1 + τ µ)−1 p̃i,j
p̂ = proxτ g (p̃) ⇔ p̂i,j =
max{1, (1 + τ µ)−1 |p̃i,j |2 }
Let us now apply the Huber-ROF model to the image in Example 2.1 us-
ing the parameters λ = 0.1 and ε = 0.001. We implemented the FISTA
algorithm (Algorithm 5) using the extrapolation parameters corresponding
to µ = 0 and the correct µ = ε/λ. For comparison, we also implemented
the proximal heavy ball algorithm (4.43) and used the optimal parameter
settings
√ √
4 ( µ − L + µ)2
α= √ √ , β= √ √ .
( µ + L + µ)2 − 4µ ( µ + L + µ)2 − 4µ
Figure 4.4 shows that it is generally not a good idea to apply the classi-
cal FISTA algorithm using µ = 0 to a strongly convex problem. On the
other hand, applying the FISTA algorithm with the correct settings for the
strong convexity, that is, µ = ε/λ, largely improves the convergence rate of
the algorithm. Interestingly, it turns out that the proximal HB algorithm
converges almost twice as fast as the FISTA algorithm (ω k as opposed to
√ √
ω 2k with q = L/µg and ω = ( q − 1)/( q + 1)). In fact the proximal HB
algorithm seems to exactly obey the lower bound of ﬁrst-order algorithms
for the strongly convex problems presented in Theorem 4.14.

Example 4.15 (total variation on chains). We have already seen that

when the smooth function is quadratic, the forward–backward algorithm is
equivalent to a plain gradient method applied to Moreau–Yosida regular-
ization. The aim of this example is to give a practical problem where such
48 A. Chambolle and T. Pock

5
10

0
10

−5

Primal-dual gap
10

−10
10

FISTA (µ = 0)
−15
10 FISTA (µ = ε/λ)
proximal HB
−20
10 O(ω k )
O(ω 2k ) (lower bound)
−25
10
0 50 100 150 200 250 300 350 400
Iterations

Figure 4.4. Convergence of accelerated proximal gradient descent methods for

minimizing the dual Huber-ROF model using the image in Figure 2.1. Using the
correct modulus of strong convexity (µ = ε/λ), the FISTA algorithm performs
much better than the FISTA algorithm, which does not take into account the
correct value of µ. Interestingly, a tuned proximal heavy ball (HB) algorithm that
uses the correct value of µ clearly outperforms FISTA and seems to coincide with
the lower bound of ﬁrst-order methods.

an equivalence does not hold. Consider again the dual of the ROF model:
1
min kD∗ p − u⋄ k2 + δ{k·k∞ ≤λ} (p), (4.46)
p 2

which differs slightly from our previous ROF problems by the choice of
the norm constraining the dual variables. First, application of the adjoint
of the finite difference operator to the dual variables p = (p1 , p2 ) can be
decomposed via
X2
∗
D p= D∗d pd ,
d=1

where D∗dis the adjoint ﬁnite diﬀerence operator in the direction d. Sec-
ond, by a change of variables td = D∗d pd and using the property that the
constraint on p is also decomposable, we can rewrite the problem in the
equivalent form
2 2
1 X X
min k t d − u ⋄ k2 + δC d (td ), (4.47)
(td )2d=1 2
d=1 d=1

where
C d = {td : td = D∗d pd , kpd k∞ ≤ λ}, for d = 1, 2.
Optimization for imaging 49

Hence, as shown in Section 4.8.3, this problem could be easily solved via ac-
celerated alternating minimization in td if we were able to eﬃciently compute
the proximal maps with respect to δC d (td ). Moreover, we have shown that
the (accelerated) alternating minimization corresponds to an (accelerated)
forward–backward algorithm on the partial Moreau–Yosida regularization
that is obtained by partially minimizing (4.47) with respect to one variable,
hence corresponding to a non-trivial instance of the forward–backward al-
gorithm.
Observe that the characteristic functions of the sets C d are exactly the
convex conjugates of the total variation in each dimension d, that is,
δC d (td ) = suphu, td i − λkDd uk1 .
u

In other words, if we were able to solve the proximal maps for one-dimen-
sional total variation problems along chains, we could – thanks to Moreau’s
identity – also efficiently solve the proximal maps for the functions δC d (td ).
As a matter of fact, there exist several direct algorithms that can solve
one-dimensional ROF problems very efficiently, and hence the proximal
maps for one-dimensional total variation. Some of the algorithms even work
in linear time; see Davies and Kovac (2001), Condat (2013a), Johnson (2013)
and Kolmogorov et al. (2016), and references therein.
Figure 4.5 presents a comparison between the convergence rates of ac-
celerated block descent (FISTA-chains) applied to (4.47) and a standard
implementation of FISTA applied to (4.46). To solve the one-dimensional
total variation subproblems on chains we used the linear-time dynamic pro-
gramming approach from Kolmogorov et al. (2016). Figure 4.5(a) shows
that in terms of iterations, the accelerated block descent is about 10–20
times as fast. Clearly, one iteration of the accelerated block descent is com-
putationally more expensive compared to one iteration of the standard im-
plementation; in our C++ implementation, one iteration of standard FISTA
was approximately three times as fast compared to the accelerated block de-
scent. Yet overall the block splitting technique turns out to be more efficient
for a given precision, as shown in Figure 4.5(b). Later, in Section 7.8, we
will come back to a similar example and show how accelerated block descent
can be used to solve large-scale stereo problems.

5. Saddle-point methods
In this section we will brieﬂy describe the main optimization techniques
for ﬁnding saddle points, which are commonly used for imaging problems.
The goal of these approaches is, as before, to split a complex problem into
simpler subproblems which are easy to solve – although depending on the
structure and properties of the functions, one form might be more suitable
50 A. Chambolle and T. Pock

4
10

2
10

0
10

Primal-dual gap
−2
10

−4
10

−6
10

FISTA-chains
−8
10 FISTA
O(1/k 2 )
−10
10
0 1 2 3
10 10 10 10
Iterations

(a) iterations

4
10

2
10

0
10
Primal-dual gap

−2
10

−4
10

−6
10

FISTA-chains
−8
10 FISTA
O(1/k 2 )
−10
10
−2 −1 0 1
10 10 10 10
Time (seconds)

(b) CPU time

Figure 4.5. Minimizing the dual ROF model applied to the image in Figure 2.1.
This experiment shows that an accelerated proximal block descent algorithm
(FISTA-chains) that exactly solves the ROF problem on horizontal and vertical
chains signiﬁcantly outperforms a standard accelerated proximal gradient descent
(FISTA) implementation. (a) Comparison based on iterations, (b) comparison
based on the CPU time.

than another. We will mostly concentrate on one type of algorithm known as

the ‘primal–dual’ algorithm, ‘ADMM’, or ‘Douglas–Rachford splitting’ (see
references below) in a Euclidean setting, although more complex splitting
techniques can be useful (e.g. Tseng 2000), as well as descent with respect
to non-linear metrics or in Banach spaces (Nemirovski 2004, Chambolle and
Optimization for imaging 51

Pock 2015a). We will mention the simplest useful results. These have been
generalized and improved in many ways; see in particular Davis (2015) and
Davis and Yin (2014a, 2014b) for an extensive study of convergence rates,
Chen, Lan and Ouyang (2014a), Ouyang, Chen, Lan and Pasiliao (2015) and
Valkonen and Pock (2015) for optimal methods exploiting partial regularity
of some objectives, and Fercoq and Bianchi (2015) for eﬃcient stochastic
approaches.
The natural order in which to present these algorithms should be to start
with the Douglas–Rachford splitting (Douglas and Rachford 1956; the mod-
ern form we will describe is found in Lions and Mercier 1979) and the
ADMM, which have been used for a long time in non-smooth optimization.
However, since the convergence results for primal–dual methods are in some
sense much simpler and carry on to the other algorithms, we ﬁrst start by
describing these methods.

5.1. Primal–dual algorithms

The problems we consider here are in the ‘standard’ form (3.9)
min f (Kx) + g(x),
x∈X

where f, g are convex, l.s.c. and ‘simple’, and K : X → Y is a bounded linear

operator. When f is smooth, FB splitting can be used eﬃciently for such a
problem. In other situations we usually have to revert to Lagrangian tech-
niques or primal–dual methods. This is the case for Examples 2.2 and 2.3
below, for example.
The idea is to write the problem as a saddle point as in (3.10):
max inf hy, Kxi − f ∗ (y) + g(x).
y x

Then (this dates back to Arrow, Hurwicz and Uzawa 1958), we alternate a
(proximal) descent in the variable x and an ascent in the dual variable y:
xk+1 = proxτ g (xk − τ K ∗ y k ), (5.1)
k+1 k k+1
y = proxσf ∗ (y + σKx ). (5.2)
It is not clear that such iterations will converge. (We can easily convince
ourselves that a totally explicit iteration, with xk+1 above replaced with
xk , will in general not converge.) However, this scheme was proposed in
Zhu and Chan (2008) for problem (2.6) and observed to be very eﬃcient
for this problem, especially when combined with an acceleration strategy
consisting in decreasing τ and increasing σ at each step (e.g., following
the rules in Algorithm 8 below). Proofs of convergence for the Zhu–Chan
method have been proposed by Esser, Zhang and Chan (2010), Bonettini
and Ruggiero (2012) and He, You and Yuan (2014). For a general problem
52 A. Chambolle and T. Pock

Algorithm 6 PDHG.
Input: initial pair of primal and dual points (x0 , y 0 ), steps τ, σ > 0.
for all k ≥ 0 do
ﬁnd (xk+1 , y k+1 ) by solving
xk+1 = proxτ g (xk − τ K ∗ y k ) (5.3)
y k+1 = proxσf ∗ (y k + σK(2xk+1 − xk )). (5.4)
end for

there exist several strategies to modify these iterations into converging sub-
sequences. Popov (1981) proposed incorporating a type of ‘extragradient’
strategy into these iterations, as introduced by Korpelevich (1976, 1983):
the idea is simply to replace y k with proxσf ∗ (y k + σK ∗ xk ) in (5.1). This
makes the algorithm convergent; moreover, an O(1/k) (ergodic) convergence
rate is shown in Nemirovski (2004) (for a class of schemes including this one,
using also non-linear ‘mirror’ descent steps: see Section 4.8.1). A variant
with similar properties, but not requiring us to compute an additional step
at each iteration, was proposed at roughly the same time by Esser et al.
(2010) (who gave it the name ‘PDHG’11 ), and Pock, Cremers, Bischof and
Chambolle (2009). The iterations can be written as in Algorithm 6.
The over-relaxation step 2xk+1 − xk = xk+1 + (xk+1 − xk ) can be inter-
preted as an approximate extragradient, and indeed it is possible to show
convergence of this method with a rate which is the same as in Nemirovski
(2004) (see also Chambolle and Pock 2011, 2015a). On the other hand, this
formula might recall similar relaxations present in other standard splitting
algorithms such as the Douglas–Rachford splitting or the ADMM (see Sec-
tions 5.3 and 5.4 below), and indeed, we then see that this algorithm is
merely a variant of these other methods, in a possibly degenerate metric.
He et al. (2014) observed that, letting z = (x, y), the iterations above can
be written as
M (z k+1 − z k ) + T z k+1 ∋ 0, (5.5)
where T is the monotone operator in (3.15) and M is the metric
1
I −K ∗

M= τ , (5.6)
1
−K σI

which is positive deﬁnite if τ σkKk2 < 1. Hence, in this form the primal–dual
11
Primal–dual hybrid gradient. More precisely, the algorithm we describe here would
correspond to ‘PDHGMu’ and ‘PDHGMp’ in Esser et al. (2010), while ‘PDHG’ corre-
spond to a plain Arrow–Hurwicz alternating scheme such as in Zhu and Chan (2008).
However, for simplicity we will keep the name ‘PDHG’ for the general converging
primal–dual method.
Optimization for imaging 53

Algorithm 7 General form of primal–dual iteration.

Input: previous points (x̄, ȳ, x̃, ỹ), steps τ, σ > 0.
Output: new points (x̂, ŷ) = PDτ,σ (x̄, ȳ, x̃, ỹ) given by
(
x̂ = proxτ g (x̄ − τ (∇h(x̄) + K ∗ ỹ)),
(5.8)
ŷ = proxσf ∗ (ȳ + σK x̃).

algorithm is simply a proximal-point algorithm applied to the monotone

operator T , and standard convergence results or rates (Brézis and Lions
1978) can be deduced.
This can be extended to a slightly more general form. Assume that we
want to minimize the problem
min f (Kx) + g(x) + h(x), (5.7)
x∈X

where f, g are convex, l.s.c. and ‘simple’, K is a bounded linear operator

and h is convex with Lh -Lipschitz gradient term which we will treat explic-
itly. The primal–dual method for such a problem was suggested by Condat
(2013b) and its convergence studied by Vũ (2013a) and Boţ, Csetnek, Hein-
rich and Hendrich (2015) (the latter papers dealing with more general mono-
tone operators). Rates of convergence, including control of the primal–dual
gap, are established in Chambolle and Pock (2015a) (a variant is studied in
Drori, Sabach and Teboulle 2015), and a close (different) algorithm which
mixes general monotone operators and subgradients and establishes similar
rates is found in Davis and Yin (2015). The idea is simply to replace the
descent step in x with an FB splitting step, letting
xk+1 = proxτ g (xk − (τ K ∗ y k + τ ∇h(xk )).
Let us now write the algorithm in a general form, as in Algorithm 7. The
first case (5.3)–(5.4) corresponds to iterations with fixed steps τ, σ, x̄ = xk ,
ȳ = ỹ = y k , x̃ = 2xk+1 − xk . In this case we have the following result,
which also obviously applies to the PDHG method (5.3)–(5.4). Here we let
L = kKk.

Theorem 5.1. Let τ, σ > 0 and (x0 , y 0 ) ∈ X × Y be given, and for k ≥ 0

let
(xk+1 , y k+1 ) = PDτ,σ (xk , y k , 2xk+1 − xk , y k ).
Assume

1 1
− Lh ≥ L2 . (5.9)
τ σ
54 A. Chambolle and T. Pock

Then, for any (x, y) ∈ X × Y, we have

1
τ kx − x0 k2 + σ1 ky − y 0 k2
L(X k , y) − L(x, Y k ) ≤ , (5.10)
k
where12
k k
k1X i 1X i
k
X = x, Y = y.
k k
i=1 i=1

Moreover, if the inequality is strict in (5.9), then (xk , y k ) converge (weakly

in inﬁnite dimension) to a saddle point.
Proof. This is a particular case of Theorem 1, Remark 1 and Remark 3 in
Chambolle and Pock (2015a). Under additional assumptions, one can derive
a similar rate of convergence for the ‘true’ primal–dual gap G(X k , Y k ).
Note that this result says little, in general, about the ‘best’ choice of
the parameters τ, σ, and the empirical speed of convergence often depends
a lot on this choice. Heuristic approaches have been proposed (which in
general try to ensure that the primal and dual variables evolve at roughly
the ‘same speed’); an eﬃcient one, together with a backtracking strategy
and convergence guarantees, is proposed in Goldstein, Esser and Baraniuk
(2013).

Acceleration
An interesting feature of these types of primal–dual iteration is the fact
they can be ‘accelerated’, in cases when the objective function has more
regularity. The ﬁrst case is when g + h (or f ∗ ) is strongly convex: see
Algorithm 8. Observe that if f ∗ is µf -strongly convex, then x 7→ f (Kx) has
(L2 /µf )-Lipschitz gradient, and it is natural to expect that one will be able
to decrease the objective at rate O(1/k 2 ) as before. Similarly, we expect
the same if g or h is strongly convex. This is the result we now state. We
should assume here that g is µg -convex, h is µh -convex, and µ = µg +µh > 0.
However, in this case it is no diﬀerent to assuming that g is µ-convex, as
one can always replace h with h(x) − µh kxk2 /2 (which is convex with (Lh −
µh )-Lipschitz gradient ∇h(x) − µh x), and g with g(x) + µh kxk2 /2 (whose
proximity operator is as easy to compute as g’s). For notational simplicity,
we will thus restrict ourselves to this latter case – which is equivalent to the
general case upon replacing τ with τ ′ = τ /(1 + τ µh ).
Theorem 5.2. Let (xk , y k )k≥0 be the iterations of Algorithm 8. For each

12
This is called an ‘ergodic’ convergence rate.
Optimization for imaging 55

Algorithm 8 Accelerated primal–dual algorithm 1.

Choose τ0 = 1/(2Lh ) and σ0 = Lh /L2 (or any τ0 , σ0 with τ0 σ0 L2 ≤ 1 if
Lh = 0), θ0 = 0 and x−1 = x0 ∈ X , y 0 ∈ Y,
for all k ≥ 0 do
(xk+1 , y k+1 k k k k k−1 k+1
p) = PDτk ,σk (x , y , x + θk (x − x ), y ),
θk+1 = 1/ 1 + µg τk , τk+1 = θk+1 τk , σk+1 = σk /θk+1 .
end for
Pk
k ≥ 1, deﬁne tk = σk−1 /σ0 , Tk = i=1 ti and the averaged points
k
1 X
(X k , Y k ) = ti (xi , y i ).
Tk
i=1

Then for any k ≥ 1 and any (x, y) ∈ X × Y,

t2 1 1
Tk L(X k , y) − L(x, Y k ) + k+1 kxk − xk2 ≤ kx0 − xk2 + ky 0 − yk2 .
2τ0 2τ0 2σ0
(5.11)
One can then show that with this choice tk ≈ γk/(4Lf ), so that also
1/Tk = O(1/k 2 ) (see Chambolle and Pock 2011). Under additional assump-
tions (for instance if f has full domain, so that f ∗ is superlinear), it follows a
global O(1/k 2 ) estimate for the primal–dual gap G(X k , Y k ), although with
a constant which could be very large.
Proof. This is a particular case of Chambolle and Pock (2015a, Theo-
rem 4).
Remark 5.3. We should mention here that the over-relaxation step above
can also be performed in the y variable, therefore letting
(xk+1 , y k+1 ) = PDτk ,σk (xk , y k , xk+1 , y k + θk (y k − y k−1 )).
The proof remains identical, but since it is not widely known we sketch it in
Appendix C.2. It may seem to be quite a trivial remark, but in Section 5.3
we will see that it can be useful.
Now let us assume that f ∗ is also strongly convex with parameter µf ∗ =
1/Lf . Then an appropriate choice of the parameter σ, τ, θ can yield a better
(linear) convergence rate: see Algorithm 9. We give a particular choice of
parameters for which such convergence occurs, but note that it is possible to
show linear convergence with quite general over-relaxation parameters (but
the global rate is strongly dependent on the parameter choice: for details
see Chambolle and Pock 2011 or Tan 2016). Letting
s
µg L2

µf ∗ (µg + Lh )
α= 1+4 − 1 ∈ (0, 1),
2L2 µf ∗ (µg + Lh )2
56 A. Chambolle and T. Pock

Algorithm 9 Accelerated primal–dual algorithm 2.

Choose x−1 = x0 ∈ X , y 0 ∈ Y, and τ, σ, θ > 0 satisfying θ−1 = 1 + µg τ =
1 + µf ∗ σ and θL2 στ ≤ 1 − Lh τ .
for all k ≥ 0 do
(xk+1 , y k+1 ) = PDτ,σ (xk , y k , xk + θ(xk − xk−1 ), y k+1 ),
end for

a possible choice for τ, σ, θ is given by

α α
τ= , σ= , θ = 1 − α ∈ (0, 1).
µg (1 − α) µf (1 − α)
∗

We can then show the following ergodic estimate.

Theorem 5.4. Let (xk , y k )k≥0 be the iterations of Algorithm 9. For each
k ≥ 1, deﬁne tk = σk−1 /σ0 , Tk = ki=1 θ−i+1 and the averaged points
P

k
k k 1 X −i+1 i i
(X , Y ) = θ (x , y ).
Tk
i=1

Then, for any k ≥ 1 and any (x, y) ∈ X × Y,

k k 1 1 0 2 1 0 2
L(X , y) − L(x, Y ) ≤ kx − xk + ky − yk . (5.12)
Tk 2τ 2σ
Observe that 1/Tk = O(θk ), so this is indeed a linear convergence rate.
Proof. See Chambolle and Pock (2015a, Theorem 5).

Preconditioning
As a quick remark, we mention here that it is not always obvious how to
estimate the norm of the matrix L = kKk precisely and efficiently, without
which we cannot choose parameters correctly. An interesting use of gen-
eral preconditioners is suggested by Bredies and Sun (2015a, 2015b) for the
variants of the algorithm described in the next few sections. The main dif-
ficulty is that if the metric is changed, f and g might no longer be ‘simple’.
A simpler approach is suggested in Pock and Chambolle (2011), for prob-
lems where a diagonal preconditioning does not alter the property that the
proximal operators of f and g are easy to compute. Let us briefly describe
a variant which is very simple and allows for a large choice of diagonal pre-
conditioners. If we assume h = 0, then the PDHG algorithm of Theorem 5.1
can be written equivalently as a proximal-point iteration such as (5.5) (He
et al. 2014). Changing the metric means replacing M in (5.6) with
−1
−K ∗

′ T
M = ,
−K Σ−1
Optimization for imaging 57

where T and Σ are positive deﬁnite symmetric matrices. This means that
the prox operators in iteration (5.8) must be computed in the new met-
rics T −1 and Σ−1 : in other words, the points (x̂, ŷ) are replaced with the
solutions of

1 1
min kx − x̄k2T −1 + hỹ, Kxi + g(x), min ky − ȳk2Σ−1 − hy, K x̃i + f ∗ (y).
x 2 y 2

Using a diagonal preconditioning means that we also require T and Σ to

be diagonal. The reason to impose this is that in many cases the prox-
imity operators of f and g will remain simple in such metrics, and might
become intractable in more general metrics. We should mention here that
Becker and Fadili (2012) have shown that, in problems with separable sim-
ple functions, one can in fact use preconditioners that are diagonal+rank
one, without altering the simplicity of the functions. These authors use
this remark to build quasi-Newton descent schemes, but it might also be
interesting in a primal–dual framework.
A necessary condition for the algorithm to be well posed (which is then
suﬃcient for convergence, as shown in Theorem 5.1), is that M ′ is positive
(semi-)deﬁnite. An elementary computation shows that it is equivalent to
the requirement

kΣ1/2 KT 1/2 k ≤ 1. (5.13)

The following strategy, which extends the choice in Pock and Chambolle
(2011), allows us to design matrices Σ and T such that this holds. We
assume here that X = Rn and Y = Rm , m, n ≥ 1, so K is an (m × n)-
matrix.

Lemma 5.5. Let (τ̃i )1≤i≤n and (σ̃j )1≤j≤m be arbitrary positive numbers,
and α ∈ [0, 2]. Then let

T = diag((τi )ni=1 ), Σ = diag((σj )m

j=1 ),

where
τ̃i σ̃j
τi = P m 2−α
, σj = P n α
.
j=1 σ̃j |Kj,i | i=1 τ̃i |Kj,i |

Then (5.13) holds.

Proof. The proof is as in Pock and Chambolle (2011): for any x ∈ X, we

58 A. Chambolle and T. Pock

just observe that

m X n 2 m n p |Kj,i |1−α/2 √ 2
√
X
X √ X
α/2
σj Kj,i τi xi ≤ σj |Kj,i | τ̃i √ τi |xi |
j=1 i=1 j=1 i=1
τ̃i
m n n
|Kj,i |2−α 2
X X X
≤ σj |Kj,i |α τ˜i τi x i
τ̃i
j=1 i=1 i=1
m n n
|Kj,i |2−α
X X X
2
= σ̃j Pm x
2−α i
= x2i , (5.14)
l=1 σ̃ l |K l,i |
j=1 i=1 i=1

which shows (5.13).

It means that with this choice the algorithm will converge. This can be
extended to accelerated variants if, for instance, g is strongly convex, pro-
vided that T is chosen in such a way that the strong convexity parameter
in the new metric does not degenerate too much (in particular, such pre-
conditioning should be useful if g is strongly convex but with a very small
parameter in some variables). The case α = 0 was suggested by Eckstein
(1989) for primal–dual iterations, referred to as the ‘alternating step method
for monotropic programming’. This is simply the PDHG algorithm applied
to particular instances of (3.9), with a simpler structure.
We now show that the PDHG algorithm and its variants can be used to
eﬃciently minimize our three introductory examples provided in Section 2.2.
Example 5.6 (ROF model using accelerated PDHG). In this exam-
ple, we show how to implement the PDHG algorithm to minimize the ROF
model (2.6) used in Example 2.1, which we recall here for convenience:
1
min λkDuk2,1 + ku − u⋄ k2 ,
u 2
⋄
where u is the noisy input image. First we need to transform the objective
function to a form that can be tackled by the PDHG algorithm. Using
duality, we obtain the saddle-point problem
1
min maxhDu, pi + ku − u⋄ k2 − δ{k·k2,∞ ≤λ} (p).
u p 2
It remains to detail the implementation of the proximity operators for the
functions g(u) = 12 ku − u⋄ k2 and f ∗ (p) = δ{k·k2,∞ ≤λ} (p). The proximal
map for the function g(u) is given by solving a simple quadratic problem
for each pixel. For a given ũ, the proximal map is given by
ũi,j + τ u⋄i,j
û = proxτ g (ũ) ⇔ ûi,j = .
1+τ
The proximal map for the function f ∗ (p) is given by the pixelwise orthog-
Optimization for imaging 59

5
10

Primal-dual gap
0
10

PDHG
aPDHG
FISTA
O(1/k 2 )
−5
10
0 1 2 3
10 10 10 10
Iterations

Figure 5.1. Minimizing the ROF model applied to the image in Figure 2.1. This
experiment shows that the accelerated primal–dual method with optimal dynamic
step sizes (aPDHG) is signiﬁcantly faster than a primal–dual algorithm that uses
ﬁxed step sizes (PDGH). For comparison we also show the performance of acceler-
ated proximal gradient descent (FISTA).

onal projection onto ℓ2 -balls with radius λ. This projection can be easily
computed using formula (4.23).
We implemented both the standard PDHG algorithm (Algorithm 6) and
its accelerated variant (Algorithm 8) and applied it to the image from Exam-
ple 2.1. For comparison we also ran the FISTA algorithm (Algorithm 5) on
the dual ROF problem. For the plain PDHG we used √ a ﬁxed setting of the
2
step sizes τ = 0.1, σ = 1/(τ L ), where L = kDk ≤ 8. For the accelerated
PDHG (aPDHG), we observe that the function g(u) is (µg = 1)-strongly
convex, and we used the proposed settings for dynamically updating the step
size parameters. The initial step size parameters were set to τ0 = σ0 = 1/L.
Figure 5.1 shows the decay of the primal–dual gap for PDHG, aPDHG
and FISTA. It can be observed that the dynamic choice of the step sizes
greatly improves the performance of the algorithm. It can also be observed
that the ﬁxed choice of step sizes for the PDHG algorithm seems to be fairly
optimal for a certain accuracy, but for higher accuracy the performance of
the algorithm breaks down. We can also see that in terms of the primal–dual
gap – which in turn bounds the ℓ2 -error to the true solution – the aPDHG
algorithm seems to be superior to the FISTA algorithm.

Example 5.7 (TV-deblurring). Here we show how to use the PDHG

algorithm to minimize the image deblurring problem presented in Exam-
60 A. Chambolle and T. Pock

ple 2.2. The TV-deblurring problem is given by

1
min λkDuk2,1 + ka ∗ u − u⋄ k2 ,
u 2
where u⋄ is the m × n image from Example 2.2, a is a given blur kernel
and ∗ denotes the two-dimensional convolution operation (using symmetric
boundary conditions). For notational simplicity we will ignore the fact that
the image u⋄ can be a colour image, as in Figure 2.2, and present the details
about the algorithm only in the grey-scale case. The generalization to colour
images is straightforward and is left to the reader. To simplify the notation
further, we replace the convolution operator a with an equivalent linear
operator A such that Au = a ∗ u.
We will now describe two possible ways to minimize the TV-deblurring
objective function using the PDHG algorithm. In the ﬁrst method, called
‘PD-explicit’, we apply the convex conjugate to the total variation term (as
in the ROF model), yielding the saddle-point problem
1
min maxhDu, pi + kAu − u⋄ k2 − δ{k·k2,∞ ≤λ} (p).
u p 2
The PDHG algorithm could in principle be applied to this formulation if
there were an eﬃcient algorithm to compute the proximal map for the func-
tion h(u) = 12 kAu − u⋄ k2 . A natural idea would be to compute this proximal
map using the FFT (as in Example 5.9 below). However, here we want to
use only convolution operations in the image domain. Hence, we restrict
ourselves to the variant (5.8) of the PDHG algorithm, which can also deal
with explicit gradients. The gradient with respect to the data term is easily
computed as
∇g(u) = A∗ (Au − u⋄ ),

where the adjoint operator A∗ can be realized by a convolution with the

adjoint kernel ā, which is the convolution kernel a rotated by 180 degrees.
The Lipschitz constant of the gradient of h(u) is computed as Lh ≤ kAk2 ≤
1. The proximal map with respect to the function f ∗ (p) = δ{k·k2,∞ ≤λ} (p) is
again computed using the projection formula (4.23). With this information,
the primal–dual algorithm is easily implemented. According to (5.9), we set
the step sizes as τ = 1/(L/c + Lh ), and σ = 1/(cL), for some c > 0 which
yields a feasible pair of primal and dual step sizes.
In the second method, called ‘PD-split’, we apply the convex conjugate
not only to the total variation term but also to the data term. This yields
the saddle-point problem
1
min maxhDu, pi − δ{k·k2,∞ ≤λ} (p) + hAu, qi − kq + u⋄ k2 ,
u p,q 2
Optimization for imaging 61

3
10

2
10

1
10

0
10
Primal gap

−1
10

−2
10

−3
10
PD-explicit
PD-split
−4
10 O(1/k)

−5
10
0 1 2 3
10 10 10 10
Iterations

Figure 5.2. Minimizing the TV-deblurring problem applied to the image in Fig-
ure 2.2. We compare the performance of a primal–dual algorithm with explicit
gradient steps (PD-explicit) and a primal–dual algorithm that uses a full splitting
of the objective function (PD-split). PD-explicit seems to perform slightly better
at the beginning, but PD-split performs better for higher accuracy.

where, letting y = (p, q), K ∗ = (D∗ , A∗ ), f ∗ (y) = (fp∗ (p), fq∗ (q)), with
1
fp∗ (p) = δ{k·k2,∞ ≤λ} (p), fq∗ (q) = kq + u⋄ k2 ,
2
we obtain the saddle-point problem
min maxhKu, yi − f ∗ (y),
u y

which exactly ﬁts the class of problems that can be optimized by the PDHG
algorithm. To implement the algorithm, we just need to know how to com-
pute the proximal maps with respect to f ∗ . Since f ∗ is separable in p, q,
we can compute the proximal maps independently for both variables. The
formula to compute the proximal map for fp∗ is again given by the projec-
tion formula (4.23). The proximal map for fq∗ requires us to solve pixelwise
quadratic optimization problems. For a given q̃, its solution is given by
q̃i,j − σu⋄i,j
q̂ = proxσfq∗ (q̃) ⇔ q̂i,j =
.
1+σ
We found it beneﬁcial to apply a simple form of 2-block diagonal precondi-
tioning by observing that the linear operator K is compiled from the two
distinct but regular blocks D and A. According to Lemma 5.5, we √ can
perform the following feasible
√ choice of the step sizes: τ = c/(L + Lh ),
σp = 1/(cL), and σq = 1/(c Lh ), for some c > 0 where σp is used to update
the p variable and σq is used to update the q variable.
62 A. Chambolle and T. Pock

Note that we cannot rely on the accelerated form of the PDHG algorithm
because the objective function lacks strong convexity in u. However, the
objective function is strongly convex in the variable Au, which can be used
to achieve partial acceleration in the q variable (Valkonen and Pock 2015).
Figure 5.2 shows a comparison between the two diﬀerent variants of the
PDHG algorithm for minimizing the TV-deblurring problem from Exam-
ple 2.2. In both variants we used c = 10. The true primal objective function
has been computed by running the ‘PD-split’ algorithm for a large number
of iterations. One can see that the ‘PD-split’ variant is signiﬁcantly faster
for higher accuracy. The reason is that the choice of the primal step size in
‘PD-explicit’ is more restrictive (τ < Lh ). On the other hand, ‘PD-explicit’
seems to perform well at the beginning and also has a smaller memory
footprint.

Example 5.8 (minimizing the TV-ℓ1 model). In the next example we

will show how to minimize the TV-ℓ1 model used in Example 2.3 for salt-
and-pepper denoising. Let us recall the TV-ℓ1 model for image denoising.
It is given by the following non-smooth objective function:

min λkDuk2,1 + ku − u⋄ k1 ,
u

where u⋄ is a given, noisy image of size m × n pixels. Using duality as in

(3.10), we obtain the saddle-point problem

min maxhDu, pi + ku − u⋄ k1 − δ{k·k2,∞ ≤λ} (p).

u p

In order to implement the PDHG algorithm (or related proximal algo-

rithms), we need to compute the proximal maps with respect to both the
indicator function of the dual ball f ∗ (p) = δ{k·k2,∞ ≤λ} (p) and the ℓ1 -norm
in the data-ﬁtting term g(u) = ku − u⋄ k1 . The proximal map with respect
to f ∗ (p) is given (as in earlier examples) by the orthogonal projection onto
independent 2-balls with radius λ (see (4.23)).
The proximal map with respect to g is given by the classical soft shrinkage
operator. For a given ũ and step size τ > 0, the proximal map û = proxτ g (ũ)
is given by

ûi,j = u⋄i,j + max{0, |ũi,j − u⋄i,j | − τ } · sgn(ũi,j − u⋄i,j ).

Having detailed the computation of the proximal maps for the TV-ℓ1 model,
the implementation of the PDHG algorithm (Algorithm 6) is straightfor-
ward. The step size parameters were set to τ = σ = kDk−1 . For compar-
ison, we also implemented the FBF algorithm (Tseng 2000) applied to the
primal–dual system (5.5), which for the TV-ℓ1 model and ﬁxed step size is
Optimization for imaging 63

5
10

4
10

Primal gap
3
10

2
10
PDHG
FBF
SGM
10
1
O(1/k)
√
O(1/ k)
0
10
0 1 2 3
10 10 10 10
Iterations

Figure 5.3. Minimizing the TV-ℓ1 model applied to the image in Figure 2.3.
The plot shows a comparison of the convergence of the primal gap between the
primal–dual (PDHG) algorithm and the forward–backward–forward (FBF) algo-
rithm. PDHG and FBF perform almost equally well, but FBF requires twice as
many evaluations of the linear operator. We also show the performance of a plain
subgradient method (SGM) in order to demonstrate the clear advantage of PDHG
and FBF exploiting the structure of the problem.

given by
uk+1/2 = proxτ g (uk − τ D∗ pk ),
pk+1/2 = proxτ f ∗ (pk +τ Duk ),
uk+1 = uk+1/2 − τ D∗ (pk+1/2 − pk ),
pk+1 = pk+1/2 + τ D(uk+1/2 − uk ).
Observe that the FBF method requires twice as many matrix–vector mul-
tiplications as the PDHG algorithm. For simplicity, we used a ﬁxed step
size τ = kDk−1 . We also tested the FBF method with an Armijo-type
line-search procedure, but it did not improve the results in this example.
Moreover, as a baseline, we also implemented a plain subgradient method
(SGM), as presented in (4.10). In order to compute a subgradient of the
total variation we used a Huber-type smoothing, but we set the smoothing
parameter to a very small value, ε = 10−30 . For the subgradient of the data
term, we just took the sign of the√argument of the ℓ1 -norm. We used a
diminishing step size of the form c/ k for some c > 0 since it gave the best
results in our experiments.
Figure 5.3 shows the convergence of the primal gap, where we computed
the ‘true’ value of the primal objective function by running the PDHG
algorithm for a large number of iterations. It can be observed that both
64 A. Chambolle and T. Pock

proximal algorithms (PDHG and FBF) converge signiﬁcantly faster than

plain SGM. This shows that in non-smooth optimization it is essential to
exploit the structure of the problem. Observe that PDHG and FBF are
faster than their theoretical worst-case rate O(1/k) and PDHG seems to be
slightly faster than FBF. Moreover, PDHG requires us to compute only half
of the matrix–vector products of FBF.

5.2. Extensions
Convergence of more general algorithms of the same form are found in many
papers in the literature: the subgradient can be replaced with general mono-
tone operators (Vũ 2013a, Boţ et al. 2015, Davis and Yin 2015). In particu-
lar, some acceleration techniques carry on to this setting, as observed in Boţ
et al. (2015). Davis and Yin (2015) discuss a slightly diﬀerent method with
similar convergence properties and rates which mix the cases of subgradients
and monotone operators.
As for the case of forward–backward descent methods, this primal–dual
method (being a variant of a proximal-point method) can be over-relaxed in
some cases, or implemented with inertial terms, yielding better convergence
rates (Chambolle and Pock 2015a).
Another important extension involves the Banach/non-linear setting. The
proximity operators in (5.8) can be computed with non-linear metrics such
as in the mirror prox algorithm (4.40). It dates back at least to Nemirovski
(2004) in an extragradient form. For the form (5.8), it can be found in
Hohage and Homann (2014) and is also implemented in Yanez and Bach
(2014) to solve a matrix factorization problem. For a detailed convergence
analysis see Chambolle and Pock (2015a).
Finally we should mention important developments towards optimal rates:
Valkonen and Pock (2015) show how to exploit partial strong convexity
(with respect to some of the variables) to gain acceleration, and obtain a
rate which is optimal in both smooth and non-smooth situations; see also
Chen et al. (2014a).
A few extensions to non-convex problems have recently been proposed
(Valkonen 2014, Möllenhoﬀ, Strekalovskiy, Moeller and Cremers 2015); see
Section 6 for details.

5.3. Augmented Lagrangian type techniques

Perhaps one of the ‘oldest’ and most discussed approaches to solving non-
smooth convex problems of the form (3.9), and beyond, is the ‘alternating
directions method of multipliers’, or ‘ADMM’, proposed by Glowinski and
Marroco (1975), studied by Gabay and Mercier (1976), and revisited many
times; see for instance Boyd, Parikh, Chu, Peleato and Eckstein (2011) for a
review. It is also closely related to the ‘split-Bregman’ algorithm (Goldstein
Optimization for imaging 65

Algorithm 10 ADMM.
Choose γ > 0, y 0 , z 0 .
for all k ≥ 0 do
Find xk+1 by minimizing x 7→ f (x) − hz k , Axi + γ2 kb − Ax − By k k2 ,
Find y k+1 by minimizing y 7→ g(y) − hz k , Byi + γ2 kb − Axk+1 − Byk2 ,
Update z k+1 = z k + γ(b − Axk+1 − By k+1 ).
end for

and Osher 2009, Zhang, Burger, Bresson and Osher 2010), which is inspired
by Bregman iterations (Brègman 1967) and whose implementation boils
down to an instance of the ADMM (though with interesting interpretations).
A fairly general description of the relationships between the ADMM and
similar splitting methods can be found in Esser (2009).
In its standard form, the ADMM aims at tackling constrained problems
of the form
min f (x) + g(y), (5.15)
Ax+By=b

which become (3.9) if b = 0, A = Id and B = −K. The idea is to introduce

a Lagrange multiplier z for the constraint, and write the global problem as
a saddle-point optimization for an ‘augmented Lagrangian’ (Hestenes 1969,
Powell 1969, Fortin and Glowinski 1982):
γ
min sup f (x) + g(y) + hz, b − Ax − Byi + kb − Ax − Byk2 ,
x,y z 2
where γ > 0 is a parameter. While it is obvious that the addition of the last
quadratic term will not modify the optimality conditions or the value of the
saddle-point, it greatly helps to stabilize the iterations by usually making the
minimization problems in x, y (for fixed z) coercive, and hopefully solvable.
In practice, the most straightforward way to tackle this saddle-point prob-
lem should be to perform Uzawa’s iterations, or an Arrow–Hurwicz-type
method as in Section 5.1. Uzawa’s algorithm would require us to minimize
a problem in (x, y), for fixed z, which might be as difficult as the initial prob-
lem. When this is possible, the iteration z k+1 = z k + γ(b − Axk+1 − By k+1 )
is then precisely an ascent method for the dual energy, whose gradient is
(1/γ)-Lipschitz. But in general we need a different strategy. The idea
which was proposed and analysed in Glowinski and Marroco (1975) and
Gabay and Mercier (1976) simply consists in alternating the descent steps
in x and y before updating z, as summarized in Algorithm 10. A related
method called ‘AMA’ (Tseng 1991) drops the quadratic term in the first
minimization. This can be interesting when f has strong convexity proper-
ties, but we will not discuss this variant. Many studies of the ADMM ap-
proach are to be found in the literature, from the thesis of Eckstein (1989)
66 A. Chambolle and T. Pock

to more recent convergence analyses (with rates) such as that of Nishihara,

Lessard, Recht, Packard and Jordan (2015) (Shefi and Teboulle 2014, He
and Yuan 2015a, He and Yuan 2015b, Goldstein, O’Donoghue, Setzer and
Baraniuk 2014). Some of these studies (Shefi and Teboulle 2014, Davis and
Yin 2014b) relate the algorithm, and linearized variants, to the primal-dual
approach discussed in Section 5.1; we will see indeed that these methods
belong to the same class.
We now show briefly how this relationship is established, explaining the
reason why convergence for the ADMM is ensured and the rates of conver-
gence that one can expect from this algorithm without further tuning. Of
course, this provides only a rough understanding of the method. A first
remark is as follows. If we let
f˜(ξ) := min f (x), g̃(η) := min g(y),
{x : Ax=ξ} {y : By=η}

and set these to +∞ when the set of constraints are empty, then these
functions are convex, l.s.c., proper and the convex conjugates of f ∗ (A∗ ·) and
g ∗ (B ∗ ·), respectively; see Rockafellar (1997, Corollary 31.2.1).13 Then one
can rewrite the iterations of Algorithm 10, letting ξ k = Axk and η k = Ay k ,
in the form
zk

k+1 k
ξ = proxf˜/γ b + −η ,
γ
zk

k+1 k+1 (5.16)
η = proxg̃/γ b + −ξ ,
γ
z k+1 = z k + γ(b − ξ k+1 − η k+1 ).
In fact it is generally impossible to express the functions f˜ and g̃ explicitly,
but the fact that the algorithm is computable implicitly assumes that the
operators proxτ f˜, proxτ g̃ are computable. Observe that from the last two
steps, thanks to Moreau’s identity (3.8), we have
z k+1 zk zk

k+1 k+1 1
= proxγg̃∗ z k +γ(b−ξ k+1 ) .

= b+ −ξ −proxg̃/γ b+ −ξ
γ γ γ γ
Hence, letting τ = γ, σ = 1/γ, z̄ k = z k + γ(b − ξ k − η k ), we see that the
iterations (5.16) can be rewritten as
ξ k+1 = proxσf˜(ξ k + σz̄ k ),
z k+1 = proxτ g̃∗ z k − τ (ξ k+1 − b) ,

(5.17)
z̄ k+1 = 2z k+1 − z k ,

13
In infinite dimensions, we must require for instance that f ∗ is continuous at some point
A∗ ζ; see in particular Bouchitté (2006).
Optimization for imaging 67

which is exactly the primal–dual iterations (5.3)–(5.4) for the saddle-point

problem
min max f˜(ξ) − g̃ ∗ (z) + hz, ξ − bi.
ξ z

A consequence is that Theorem 5.1 applies, and allows us to derive a con-

vergence rate of the form
f˜(Ξk ) − g̃ ∗ (z) + hz, Ξk − bi − f˜(ξ) − g̃ ∗ (Z k ) + hZ k , ξ − bi

1 0 2 1 0
≤ γkξ − ξ k + kz − z k ,
k γ
where (Ξk , Z k ) are appropriate averages of the quantities (ξ i+1 , z i ), i =
1, . . . , k. In practice, of course, we will have Ξk = AX k , where X k is the
average of the points (xi )ki=1 . In addition, since each ξ i = Axi is obtained
by a minimization of f (x)+(terms which depend only on Ax), we have
f (xi ) = f˜(ξ i ). Finally, the bound obtained in Theorem 5.1 is, in fact, at
the step before being a bound on f˜(Ξk ), a bound on the quantity
k k
1X˜ i 1X
f (ξ ) = f (xi ) ≥ f (X k ).
k k
i=1 i=1
Hence the estimate above can be slightly improved and written in the fol-
lowing form, which involves the original functions f, g (we also use g̃ ∗ (z) =
g ∗ (B ∗ z)):
f (X k ) − g ∗ (B ∗ z) + hz, AX k − bi − f (x) − g ∗ (B ∗ Z k ) + hZ k , Ax − bi

1 0 2 1 0
≤ γkAx − Ax k + kz − z k .
k γ
Whether this can be turned into an eﬀective bound for the energy of the
initial problem depends on the particular functions f, g: one can obtain
something useful only if an a priori bound on the z or ξ which reaches
the supremum is known. For instance, if g̃ ∗ (z) = g ∗ (B ∗ z) has a globally
bounded domain (equivalently, if g̃ has linear growth), one can take the sup
over z in the estimate and get

k k ∗ ∗
1
f (X ) + g̃(AX − b) − f (x ) + g̃(Ax − b) ≤ O ,
k
where x∗ is a minimizer for the problem. Similarly, if we can show a bound
for the x which realizes the sup of the left-hand side, then we obtain a bound
on the dual objective. Choosing a solution z = z ∗ , it follows that

∗ ∗ k ∗ ∗ k k ∗ ∗ ∗ ∗ ∗ ∗ ∗
1
f (A Z ) + g (B Z ) + hZ , bi − f (A z ) + g (B z ) + hz , bi ≤ O .
k
These are ergodic rates, (slower) convergence rates in the non-ergodic sense
68 A. Chambolle and T. Pock

are discussed in Davis and Yin (2014a); see also He and Yuan (2015a). This
form of the ADMM has been generalized to problems involving more than
two blocks (with some structural conditions) (He and Yuan 2015c, Fu, He,
Wang and Yuan 2014) and/or to non-convex problems (see the references
in Section 6.3).

Accelerated ADMM
The relationship that exists between the two previous method also allows us
to derive accelerated variants of the ADMM method if either the function
g̃ ∗ (z) = g ∗ (B ∗ z) or the function f˜ is strongly convex. The ﬁrst case will
occur when g has Lg -Lipschitz gradient and B ∗ is injective; then it will
follow that g̃ ∗ is 1/(Lg k(BB ∗ )−1 k)-strongly convex. This should not cover
too many interesting cases, except perhaps the cases where B = Id and g is
smooth so that the problem reduces to

min f (x) + g(b − Ax),

which could then be tackled by a more standard accelerated descent method

as in Section 4.7.14 The second case corresponds to the case where f is itself
µf -strongly convex:15 here, f˜∗ = f ∗ (A∗ ·) will have an (kAk2 /µf )-Lipschitz
gradient so that f˜ is itself (µf /kAk2 )-strongly convex. This is the case for
problem (2.6), for example.
When ∇g is Lg -Lipschitz and B = Id (so that g̃ ∗ = g ∗ is (1/Lg )-convex),
the accelerated variant of (5.17), according to Algorithm 8, would be

ξ k+1 = proxσf˜(ξ k + σk z̄ k ),
z k+1 = proxτ g̃∗ z k − τk (ξ k+1 − b) ,

τk
r
θk = 1/ 1 + , τk+1 = θk τk , σk+1 = 1/τk+1 ,
Lg
z̄ k+1 = z k+1 + θk (z k+1 − z k ).
This, in turn, can be rewritten in the following ‘ADMM’-like form, letting

14
However, if we have a fast solver for the prox of g̃, it might still be interesting to
consider the ADMM option.
15
If both cases occur, then of course one must expect linear convergence, as in the
previous section (Theorem 5.4). A derivation from the convergence of the primal–dual
algorithm is found in Tan (2016), while general linear rates for the ADMM in smooth
cases (including with over-relaxation and/or linearization) are proved by Deng and Yin
(2015).
Optimization for imaging 69

ξ k = Axk , η k = y k , and τk = γk :
γk
xk+1 = arg min f (x) − hz k , Axi + kb − Ax − y k k2 ,
x 2
k+1 γk
y k
= arg min g(y) − hz , yi + kb − Axk+1 − yk2 ,
y 2
k+1
(5.18)
z = z + γk (b − Axk+1 − y k+1 ),
k

γk
r
γk+1 = γk / 1 + .
Lg

On the other hand, if f is µf -strongly convex, so that f˜ is (µ := µf /kAk2 )-

strongly convex (indeed, f˜∗ (z) = f ∗ (A∗ z) will have an (kAk2 /µf )-Lipschitz
gradient), we must revert to the formulation of the algorithm described in
Appendix C.2. We start from the primal–dual formulation
ξ k+1 = proxτk f˜(ξ k + τk z̄ k ),
z k+1 = proxσk g̃∗ z k − σk (ξ k+1 − b) ,

z̄ k+1 = z k+1 + θk+1 (z k+1 − z k ),

where τk , σk , θk need to satisfy (C.4)–(C.5) (with µg replaced by µ) and
(C.6), which reads as16 θk2 σk τk ≤ 1; see Appendix C.2. Then, this is in turn
converted into an ADMM formulation, as before: letting
σk
η k+1 = arg min g̃(η) + kη + ξ k+1 − bk2 − hz k , ηi,
η 2
we have, thanks again to Moreau’s identity,
z k+1 = z k − σk (ξ k+1 + η k+1 − b),
so that z̄ k+1 = z k+1 − θk+1 σk (ξ k+1 + η k+1 − b). In particular, the argument
in the prox deﬁning ξ k+1 turns out to be ξ k +τ z k −τk θk σk−1 (ξ k +η k −b) and
we can obtain an ADMM formulation again provided that θk τk σk−1 = 1 for
all k, which in particular implies, using (C.5), that θk2 τk σk = 1, which shows
(C.6). A possible choice is to consider equality in (C.4): together with (C.5)
we deduce after a few calculations that, for each k, we should choose
q
µ + µ2 + 4σk−1 2
σk = , (5.19)
2
p
after having chosen an initial σ0 > µ. (We also have θk = 1 − µ/σk ,
τ0 = 1/(σ0 − µ), τk = 1/(θk2 σk ) = 1/(σk − µ), but these are not really needed

16
As Lh = 0 and L = 1.
70 A. Chambolle and T. Pock

in the ﬁnal expressions.) We obtain the following ADMM algorithm:

σk − µ
xk+1 = arg min f (x) − hz k , Axi + kb − Ax − By k k2 , (5.20)
x 2
σk
y k+1 = arg min g(x) − hz k , Byi + kb − Axk+1 − Byk2 , (5.21)
y 2
z k+1 = z k + σk (b − Axk+1 − By k+1 ). (5.22)
Clearly from (5.19), σk ≥ σ0 + kµ/2, so that (see (C.7)) we have
µ
Tk ≥ k + k(k − 1).
4σ0
Interpreting (C.10) correctly, we obtain a control with a rate O(1/k 2 ) on
a gap which, depending on the particular problem, can be turned into a
control on the energy. One issue with this kind of algorithm is the choice of
the initial parameter σ0 , which can have a lot of influence on the effective
convergence (Nishihara et al. 2015). Knowledge of the order of magnitude
of kx0 − x∗ k, kz0 − z ∗ k and precise knowledge of µ might help to improve
this choice.
In Figure 5.4 we compare the convergence of the primal–dual gap of
ADMM and accelerated ADMM (aADMM) for the ROF model (2.6) ap-
plied to the image in Example 2.1. It can be observed that the advantage
of the accelerated ADMM takes effect especially for higher accuracy. For
comparison, we also plot the convergence of the accelerated primal–dual
algorithm (aPDHG). One can observe that the ADMM algorithms are fast,
especially at the beginning. Note, however, that one iteration of ADMM is
computationally much more expensive than one iteration of aPDHG.
A few accelerated ADMM implementations have been proposed in the
recent literature. A first important contribution is by Goldstein et al. (2014),
who discuss various cases, and in the case considered here propose a heuristic
approach which might be more efficient in practice. Powerful extensions of
the techniques described in this section (and the next), leading to optimal
mixed rates (depending on the structure of each function), are detailed in
the recent contribution by Ouyang et al. (2015), where it seems the kind of
acceleration described in this section first appeared.

Linearized ADMM
An important remark of Chambolle and Pock (2011), successfully exploited
by Sheﬁ and Teboulle (2014) to derive new convergence rates, is that the
‘PDHG’ primal–dual algorithm (5.3)–(5.4), is exactly the same as a lin-
earized variant of the ADMM for B = Id, with the ﬁrst minimization step
replaced by a proximal descent step (following a general approach intro-
duced in Chen and Teboulle 1994),
γ γ
xk+1 = arg min f (x) − hz k , Axi + kb − Ax − y k k2 + kx − xk k2M , (5.23)
x 2 2
Optimization for imaging 71

3
10

2
10

1
10

Primal-dual gap 0
10

−1
10

ADMM
−2
10 aADMM
aPDHG
−3 O(1/k)
10
O(1/k 2 )
−4
10
0 1 2 3
10 10 10 10
Iterations

Figure 5.4. Comparison of ADMM and accelerated ADMM (aADMM) for solving
the ROF model applied to the image in Figure 2.1. For comparison we also plot
the convergence of the accelerated primal–dual algorithm (aPDHG). The ADMM
methods are fast, especially at the beginning.

for M the preconditioning matrix

1
M= Id −A∗ A
λ
which is positive semidefinite when λkAk2 ≤ 1. Indeed, we obtain

k+1 k λ ∗ k ∗ k k
x = prox λ x + A z − λA (Ax + y − b) . (5.24)
γf γ
However, since z k = z k−1 + γ(b − Axk − y k ), letting σ = λ/γ, this equation
becomes
xk+1 = proxσf (xk + σA∗ (2z k − z k−1 )),
while the second equation from (5.17) reads, using g̃ = g,
z k+1 = proxτ g∗ (z k − τ (Axk+1 − b)).
We recover precisely the PDHG algorithm of Theorem 5.1. Observe in
particular that the condition στ kAk2 = λkAk2 < 1 is precisely the condition
that makes M positive definite.
The natural extension of this remark consists in observing that if one can
solve problem (5.23) for a preconditioning matrix M with M + A∗ A ≥ 0,
then everything works the same, the only difference being that the prox-
imity operator in (5.24) is now computed in the metric M + A∗ A. This is
particularly useful if g itself is quadratic (that is, g(x) = kBx − x0 k2 /2), as
it implies that one can replace the exact minimization of the problem in x
72 A. Chambolle and T. Pock

with a few iterations of a linear solver, and in many cases the output will
be equivalent to exactly (5.23) in some (not necessarily known) metric M
with M + A∗ A + (1/γ)K ∗ K ≥ 0. (For example, this occurs in the ‘split
Bregman’ algorithm (Goldstein and Osher 2009), for which it has been ob-
served, and proved by Zhang, Burger and Osher (2011), that one can do
only one inner iteration of a linear solver; see also Yin and Osher (2013),
who study inexact implementations.) For a precise statement we refer to
Bredies and Sun (2015b, Section 2.3). It is shown there and in Bredies and
Sun (2015a, 2015c) that careful choice of a linear preconditioner can lead to
very fast convergence. A generalization of the ADMM in the same ﬂavour
is considered in Deng and Yin (2015), and several convergence rates are
derived in smooth cases.

5.4. Douglas–Rachford splitting

Last but not least, we must mention another splitting method that is of the
same class as the ADMM and PDHG algorithms. Observe that if we apply
the PDHG algorithm to a problem with form (3.9), where K = Id, then the
iterations reduce to
xk+1 = proxτ g (xk − τ y k ),
y k+1 = proxσf ∗ (y k + σ(2xk+1 − xk )),
with the condition τ σ ≤ 1. We now assume σ = 1/τ . Thanks to Moreau’s
identity (3.8),
2xk+1 − xk + τ y k = τ y k+1 + proxτ f (2xk+1 − xk + τ y k ).
Letting v k = xk − τ y k for each k, we ﬁnd that the iterations can be equiv-
alently written as
xk+1 = proxτ g v k , (5.25)
v k+1 = v k − xk+1 + proxτ f (2xk+1 − v k ). (5.26)
This is precisely the ‘Douglas–Rachford’ (DR) splitting algorithm (Douglas
and Rachford 1956), in the form given in Lions and Mercier (1979, Algo-
rithm II). Convergence is established in that paper, in the case where ∂f and
∂g are replaced with general maximal-monotone operators A and B. It can
be seen as a particular way of writing the PDHG algorithm of Section 5.1, as
for the ADMM. In fact, it is well known from Gabay (1983) and Glowinski
and Le Tallec (1989) that the ADMM is the same as the Douglas–Rachford
splitting implemented on the dual formulation of the problem; see also Set-
zer (2011) and Esser (2009) for other connections to similar algorithms.
This method has been studied by many authors. An important observation
is the fact that it can be written as a proximal-point algorithm: Eckstein
and Bertsekas (1992) have shown and used this equivalence in particular to
Optimization for imaging 73

generalize the algorithm to inexact minimizations. This follows quite easily

from the analysis in this paper, since it is derived from the PDHG, which is
an instance of the proximal-point algorithm when written in the form (5.5).
The fact that the convergence rate is improved when some of the functions
or operators are smooth was mentioned by Lions and Mercier (1979); see
Davis and Yin (2014b) for a detailed study which includes all the splitting
described in this work. However, better rates can be achieved by appropriate
accelerated schemes: in fact the same remarks which apply to the ADMM
obviously apply here. Another obvious consequence is the fact that the
acceleration techniques derived for the primal–dual algorithm can also be
transferred to this splitting at the expense of some easy calculations – and
this also holds for general monotone operators (Boţ et al. 2015). It might
be simpler in practice, though, to write the iterations in the ADMM or
primal–dual form before implementing acceleration tricks.
In a diﬀerent direction, Briceño-Arias and Combettes (2011) have sug-
gested applying this splitting to the saddle-point formulation (3.15) of prob-
lem (3.9), solving the inclusion 0 ∈ Az + Bz (z = (x, y)T ) with A : (x, y) 7→
(∂g(x), ∂f ∗ (y))T and B : (x, y) 7→ (K ∗ y, −Kx)T . This seems to produce
excellent results at a reasonable cost; see in particular the applications to de-
convolution in O’Connor and Vandenberghe (2014). A fairly general analysis
of this approach is found in in Bredies and Sun (2015b), with applications
to problems (2.6) and (7.4) in Bredies and Sun (2015c).17

Example 5.9 (TV-deblurring via ADMM or DR splitting). Let us

turn back to the image deblurring problem from Example 2.2. For notational
simplicity, we will again describe the problem for a grey-scale image u. One
possible way to implement the TV-regularized deblurring problem (2.7) is
to write the minimization problem as follows (letting Au := a ∗ u):
1
min λkDuk2,1 + kAu − u⋄ k2 = min λkpk2,1 + G(p),
u 2 p

where p = (p1 , p2 ) and

1
G(p) := min kAu − u⋄ k2
u:Du=p 2

(and +∞ if p is not a discrete gradient). Observe that the prox of G can

be computed as p̂ = proxτ G (p̃) if and only if p̂ = Du, where u solves
1 1
min kDu − p̃k2 + kAu − u⋄ k2 .
u 2τ 2

17
We will discuss acceleration strategies in the spirit of Theorem 5.2 in a forthcoming
paper.
74 A. Chambolle and T. Pock

(a) TV (PSNR ≈ 26.6) (b) Huber (PSNR ≈ 28.0)

Figure 5.5. Solving the image deblurring problem from Example 2.2. (a) Problem
(2.7) after 150 iterations of Douglas–Rachford (DR) splitting. (b) Huber variant
after 150 iterations with accelerated DR splitting. The ﬁgure shows that after the
same number of iterations, the accelerated algorithm yields a higher PSNR value.

It follows that p̂ must be given by

p̂ = D(D∗ D + τ A∗ A)−1 (D∗ p̃ + τ A∗ u⋄ ).
When A corresponds to a convolution operator, then this computation can
be efficiently implemented using an FFT. Since this approach implicitly
assumes that the image is periodic, we have pre-processed the original image
by bringing its intensity values to the average values at the image boundaries
(see Example 2.2). Another natural and clever approach to deal with this
boundary issue, suggested by Almeida and Figueiredo (2013), is to replace
the data term with 21 kM (a ∗ u − u⋄ )k2 , where M is a diagonal mask which
is zero in a strip, of width the size of the support of a, along the boundaries,
and 1 in the middle rectangle (or in other words, we now use Au := M a ∗ u
where u is defined on a larger grid than u⋄ , and can now be assumed to be
periodic). This modification improves the results, but observe that it then
requires about twice as many FFTs per iteration.
The proximity operator is defined by
1
p̂ = arg min λkpk2,1 + kp − p̃k2 ,
p 2τ
and since p = (p1,1 , . . . , pm,n ), its solution is given by a pixelwise shrinkage
operation,

1
p̂i,j = 1 − p̃i,j .
max{1, τ1λ |p̃i,j |2 }
As the prox of both λkpk2,1 and G(p) are solvable, one can implement a DR
splitting for this problem. Of course, the reader expert in these approaches
will see right away that it leads to exactly the same computations as those
Optimization for imaging 75

we would perform for an ADMM implementation (Esser 2009, Zhang et

al. 2011, Zhang et al. 2010, Getreuer 2012). In practice, we alternate (5.25)
and (5.26) with g replaced with k·k2,1 and f with G deﬁned above.
As already observed, the total variation regularizer often produces un-
wanted staircasing (ﬂat regions) in the reconstructed image, as it promotes
sparse gradients. Hence, for such a problem, the ‘Huber’ regularizer (4.18)
will usually produce equivalent if not better results. The idea is to replace
the function kpk2,1 with its Huber variant Hε (p), for ε > 0, where hε is
given in (4.20). An important advantage is that the new function now has a
1/ε-Lipschitz gradient, so acceleration is possible (Algorithm 8). We obtain
essentially the same results as before with far fewer iterations (or better
results with the same number of iterations): see Figure 5.5 (the PSNR val-
ues are computed only in a central area, where the image is not smoothed
before deblurring).

6. Non-convex optimization
In this very incomplete section, we mention some extensions of the meth-
ods described so far to non-convex problems. Of course, many interesting
optimization problems in imaging are not convex. If f is a smooth non-
convex function, many of the optimization methods designed for smooth
convex functions will work and ﬁnd a critical point of the function. For
instance, a simple gradient method (4.2) always guarantees that, denoting
g k = ∇f (xk ),
f (xk+1 ) = f (xk − τ g k )
Z τ
= f (xk ) − τ h∇f (xk ), g k i + (τ − t)hD2 f (xk − tg k )g k , g k i dt
0

τ L
≤ f (xk ) − τ 1 − kg k k2
2
as long as D2 f ≤ L Id, whether positive or not. Hence, if 0 < τ < 2/L, then
f (xk ) will still be decreasing. If f is coercive and bounded from below, we
deduce that subsequences of (xk )k converge to some critical point. Likewise,
inertial methods can be used and are generally convergent (Zavriev and
Kostyuk 1991) if ∇f is L-Lipschitz and with suitable assumptions which
ensure the boundedness of the trajectories.

6.1. Non-convex proximal descent methods

Now, what about non-smooth problems? A common way to extend acceler-
ated methods of this kind to more general problems is to consider problems
of the form (4.25) with f still being smooth but not necessarily convex.
76 A. Chambolle and T. Pock

Algorithm 11 ‘iPiano’ algorithm (Ochs et al. 2014).

Choose x0 , x−1 ∈ X , and for all k, parameters β ∈ [0, 1), α ∈ (0, 2(1 −
β)/L).
for all k ≥ 0 do

xk+1 = proxαg (xk − α∇f (xk ) + β(xk − xk−1 )) (6.1)

end for

Then one will generally look for a critical point (hoping of course that it
might be optimal!) by trying to ﬁnd x∗ such that
∇f (x∗ ) + ∂g(x∗ ) ∋ 0.
There is a vast literature on optimization techniques for such problems,
which have been tackled in this form at least since Mine and Fukushima
(1981) and Fukushima and Mine (1981). These authors study and prove the
convergence of a proximal FB descent (combined with an approximate line-
search in the direction of the new point) for non-convex f . Recent contribu-
tions in this direction, in particular for imaging problems, include those of
Grasmair (2010), Chouzenoux, Pesquet and Repetti (2014), Bredies, Lorenz
and Reiterer (2015a) and Nesterov (2013). We will describe the inertial ver-
sion of Ochs, Chen, Brox and Pock (2014), which is of the same type but
seems empirically faster, which is natural to expect as it reduces to the
standard heavy ball method (Section 4.8.2) in the smooth case. Let us de-
scribe the simplest version, with constant steps: see Algorithm 11. Here
again, L is the Lipschitz constant of ∇f . Further, subsequences of (xk )k
will still converge to critical points of the energy; see Ochs et al. (2014, The-
orem 4.8). This paper also contains many interesting variants (with varying
steps, monotonous algorithms, etc.), as well as convergence rates for the
residual of the method.

6.2. Block coordinate descent

We must mention another particular form of problem (4.25) that is interest-
ing for applications in the non-convex case, for instance for matrix factor-
ization problems, or problems where the product of two variables appears
in a (smooth) term of the objective. This takes the form
min f (x, y) + g1 (x) + g2 (y), (6.2)
x,y

where f is again smooth but not necessarily convex, while g1 , g2 are non-
smooth and simple functions, possibly non-convex.
The convergence of alternating minimizations or proximal (implicit) de-
scent steps in this setting (which is not necessarily covered by the gen-
Optimization for imaging 77

Algorithm 12 ‘PALM’ algorithm (Bolte et al. 2014).

Choose (x0 , y 0 ) ∈ X × Y, γ1 > 1, γ2 > 1.
for all k ≥ 0 do
Choose τ1,k = (γ1 L1 (y k ))−1 and let
xk+1 = proxτ1,k g1 (xk − τ1,k ∇x f (xk , y k )). (6.3)

Choose τ2,k = (γ2 L2 (xk+1 ))−1 and let

y k+1 = proxτ2,k g2 (y k − τ2,k ∇y f (xk+1 , y k )). (6.4)
end for

eral approach of Tseng 2001) has been studied by Attouch et al. (2013),
Attouch, Bolte, Redont and Soubeyran (2010) and Beck and Tetruashvili
(2013). However, Bolte, Sabach and Teboulle (2014) have observed that,
in general, these alternating steps will not be computable. These authors
propose instead to alternate linearized proximal descent steps, as shown in
Algorithm 12. Here, L1 (y) is the Lipschitz constant of ∇x f (·, y), while L2 (x)
is the Lipschitz constant of ∇y f (x, ·). These are assumed to be bounded
from below18 and above (in the original paper the assumptions are slightly
weaker). Also, for convergence one must require that a minimizer exists; in
particular, the function must be coercive.
Then it is proved by Bolte et al. (2014, Lemma 5) that the distance of
the iterates to the set of critical points of (6.2) goes to zero. Additional
convergence results are shown if, in addition, the objective function has a
very generic ‘KL’ property. We have presented a simpliﬁed version of the
PALM algorithm: in fact, there can be more than two blocks, and the simple
functions gi need not even be convex: as long as they are bounded from
below, l.s.c., and their proximity operator (which is possibly multivalued,
but still well deﬁned by (3.6)) can be computed, then the algorithm will
converge. We use an inertial variant of PALM (Pock and Sabach 2016) in
Section 7.12 to learn a dictionary of patches.

6.3. Saddle-point-type methods

In the same way, primal–dual ﬁrst-order methods, including the ADMM, can
easily be extended to non-convex optimization and one could mention an
inﬁnite number of papers where this has been suggested; see for instance the
references in Hong, Luo and Razaviyayn (2014). Some structural conditions
which guarantee convergence to a critical point (of the Lagrangian) are
given in a few recent papers (Chartrand and Wohlberg 2013, Magnússon,
18
A bound from below does not make any sense for the Lipschitz constants, but here it
corresponds to a bound from above for the steps τi,k .
78 A. Chambolle and T. Pock

Chathuranga Weeraddana, Rabbat and Fischione 2014, Hong et al. 2014,

Wang, Yin and Zeng 2015), sometimes involving more than two blocks.
For instance, with such an algorithm one can easily tackle the non-convex
variant of (2.6),
1
min λϕ(Du) + ku − u⋄ k2 ,
u 2

with ϕ(·) = k·kp,q for q ∈ [0, 1) and either p = 2 or p = q, or another similar

non-convex function with sublinear growth. This formulation dates back
at least to Geman and Geman (1984), Geman and Reynolds (1992) and
Blake and Zisserman (1987) (which motivated Mumford and Shah 1989).
However, it is often considered that the beneﬁt of using such a formulation
is not obvious with respect to the computationally simpler model (2.6). The
deblurring model (2.7) can be extended in the same way.
It is clear that the ADMM or PDHG algorithms are still possible to
implement for this problem, once we know how to compute the proximity
operator
1
p̄ 7→ arg min ϕ(p) + kp − p̄k2 ,
p 2τ

at least for suﬃciently large τ . Here, an interesting remark is that if the

Hessian of ϕ is bounded from below (ϕ is ‘semiconvex’), then this problem is
strongly convex for sufficiently small τ and has a unique solution. This is the
case considered in Magnússon et al. (2014) or Möllenhoff et al. (2015), for
example. However, the method should also converge if this is not satisfied in
a bounded set, which is the case for q-norms with q < 1 (Wang et al. 2015):
in this case we should select one solution of the proximity operator, which
is now possibly multivalued (yet still a monotone operator). In general this
operator is not explicitly computable (except for p = 0 – hard thresholding
– or p = 1/2), but its solution can be approximated by simple methods (e.g.
Newton).
Another interesting non-convex extension of saddle-point methods is Val-
konen’s generalization of the PDHG method (Valkonen 2014), in which the
linear operator K is replaced with a non-linear operator; this is particularly
useful for applications to diffusion tensor imaging in MRI.

Example 6.1 (solving non-convex TV-deblurring). Here we extend

Example 5.9 by replacing the total variation regularizer with a non-convex
variant. We consider the problem

1
min λϕ(Du) + kAu − u⋄ k2 . (6.5)
u 2
Optimization for imaging 79

2
10

1
10

Primal energy 0
10

−1
10

−2
10

ADMM
iPiano
−3
10
0 1 2 3
10 10 10 10
Iterations

Figure 6.1. Image deblurring using a non-convex variant of the total variation.
The plot shows the convergence of the primal energy for the non-convex TV model
using ADMM and iPiano. In order to improve the presentation in the plot, we
have subtracted a strict lower bound from the primal energy. ADMM is faster at
the beginning but iPiano ﬁnds a slightly lower energy.

The non-convex regularizer ϕ(p) with p = (p1,1 , . . . , pm,n ) is given by

|pi,j |22

1X
ϕ(p) = ln 1 + ,
2 µ2
i,j

where µ > 0 is a parameter. In what follows, we consider two diﬀerent

approaches to minimizing the non-convex image deblurring problem. In our
ﬁrst approach we consider a non-convex extension of the ADMM algorithm,
and hence we restrict ourselves to exactly the same setting as in Example 5.9.
Observe that the proximity operator of τ λϕ is the (unique, if τ is suﬃciently
small) solution of
τ λ pi,j
+ pi,j = p̄i,j
µ2 1 + |pi,j |22
µ2

(for all pixels i, j), which we can compute here using a ﬁxed point (Newton)
iteration, or by solving a third-order polynomial.
The second approach is based on directly minimizing the primal objective
using the iPiano algorithm (Algorithm 11). We perform a forward–backward
splitting by taking explicit steps with respect to the (diﬀerentiable) regular-
izer f (u) = λϕ(Du), and perform a backward step with respect to the data
term g(u) = 21 kAu − u⋄ k2 . The gradient with respect to the regularization
80 A. Chambolle and T. Pock

(a) NC, TV, ADMM (PSNR ≈ 27.80) (b) NC, TV, iPiano (PSNR ≈ 27.95)

(c) NC, learned, iPiano (PSNR ≈ 29.56) (d) 7 × 7 filters

Figure 6.2. Image deblurring using non-convex functions after 150 iterations. (a, b)
Results of the non-convex TV-deblurring energy obtained from ADMM and iPiano.
(c) Result obtained from the non-convex learned energy, and (d) convolution ﬁlters
Dk sorted by their corresponding λk value (in descending order) used in the non-
convex learned model. Observe that the learned non-convex model leads to a
signiﬁcantly better PSNR value.

term is given by
λ ∗
∇f (u) = D p̃,
µ2
where p̃ is of the form p̃ = (p̃1,1 , . . . , p̃m,n ), and
(Du)i,j
p̃i,j = .
|(Du)i,j |22
1+ µ2

The gradient is Lipschitz-continuous with Lipschitz constant

λ 8λ
L≤ 2
kDk2 ≤ 2 .
µ µ
The proximal map with respect to the data term g(u) can be easily imple-
Optimization for imaging 81

mented using the FFT. We used the following parameter settings for the
iPiano algorithm: β = 0.7 and α = 2(1 − β)/L.
Moreover, we implemented a variant of (6.5), where we have replaced the
non-convex TV regularizer with a learned regularizer of the form
K
X
λk ϕ(Dk u),
k=1

where the parameters λk > 0, and the linear operators Dk (convolution

filters, in fact) are learned from natural images using bilevel optimization.
We used the 48 filters of size 7 × 7 obtained by Chen et al. (2014b); see also
Figure 6.2. We again minimize the resulting non-convex objective function
using the iPiano algorithm.
Figure 6.1 compares the performance of the ADMM algorithm with that
of the iPiano algorithm, using the image from Example 2.2. We used the
parameter settings λ = 1/5000 and µ = 0.1. One can observe that ADMM
converges faster but iPiano is able to find a slightly lower energy. This
is explained by the ability of the iPiano algorithm to overcome spurious
stationary points by making use of an inertial force. Figure 6.2 shows the
results obtained from the non-convex learned regularizer. The PSNR values
show that the learned regularizer leads to significantly better results, of
course at a higher computational cost.

7. Applications
In the rest of the paper we will show how the algorithms presented so far
can be used to solve a number of interesting problems in image process-
ing, computer vision and learning. We start by providing some theoretical
background on the total variation and some extensions.

7.1. Total variation

In the continuous setting, the idea of the ROF model is to minimize the
following energy in the space of functions with bounded variation:
1
Z Z
min λ |Du| + (u(x) − u⋄ (x))2 dx, (7.1)
u Ω 2 Ω
where Ω is the image domain, u⋄ is a given
R (noisy) image and λ > 0 is
a regularization parameter. The term Ω |Du| in the energy is the total
variation (TV) of the image u and the gradient operator D is understood
in its distributional sense. A standard way to deﬁne the total variation is
82 A. Chambolle and T. Pock

by duality, as follows, assuming Ω ⊂ Rd is a d-dimensional open set:

Z
|Du| (7.2)
Ω
Z
∞ d
:= sup − u(x) div ϕ(x) dx : ϕ ∈ Cc (Ω; R ), |ϕ(x)| ≤ 1, ∀x ∈ Ω
Ω
and we say that u has bounded variation if and only if this quantity is ﬁnite.
The space
Z
1
BV(Ω) = u ∈ L (Ω) : |Du| < +∞ ,
Ω
of functions
R with bounded variation, equipped with the norm kukBV =
kukL1 + Ω |Du|, is a Banach space; see Giusti (1984), Ambrosio, Fusco and
Pallara (2000) or Evans and Gariepy (1992) for details. The function | · |
could in fact be any norm, in which case the constraint on ϕ(x) in (7.2)
should use the polar norm
|ξ|◦ := sup hξ, xi
|x|≤1

and read |ϕ(x)|◦ ≤ 1 for all x. The most common choices (at least for grey-
scale images) are (possibly weighted) 2- and 1-norms. The main advantage
of the total variation is that it allows for sharp jumps across hypersurfaces,
for example edges or boundaries in the image, while being a convex func-
tional, in contrast to other Sobolev norms. For smooth images u we easily
check from (7.2) (integrating by parts) that it reduces to the L1 -norm of
the image gradient, but it is also well defined for non-smooth functions.
For characteristic functions of sets it measures the length or surface of the
boundary of the set inside Ω (this again is easy to derive, at least for smooth
sets, from (7.2) and Green’s formula). This also makes the total variation
interesting for geometric problems such as image segmentation.
Concerning the data-fitting term, numerous variations of (7.1) have been
proposed in the literature. A simple modification of the ROF model is to
replace the squared data term with an L1 -norm (Nikolova 2004, Chan and
Esedoḡlu 2005):
Z Z
min λ |Du| + |u(x) − u⋄ (x)| dx. (7.3)
u Ω Ω
The resulting model, called the ‘TV-ℓ1 model’, turns out to have interest-
ing new properties. It is purely geometric in the sense that the energy
decomposes on the level set of the image. Hence, it can be used to remove
structures of an image of a certain scale, and the regularization parame-
ter λ can be used for scale selection. The TV-ℓ1 model is also effective in
removing impulsive (outlier) noise from images.
In the presence of Poisson noise, a popular data-fitting term (justified by a
Optimization for imaging 83

(a) TV-ℓ1 , λ = 0 (b) TV-ℓ1 , λ = 4 (c) TV-ℓ1 , λ = 5 (d) TV-ℓ1 , λ = 8

(e) ROF, λ = 0 (f) ROF, λ = 1 (g) ROF, λ = 2 (h) ROF, λ = 4

Figure 7.1. Contrast invariance of the TV-ℓ1 model. (a–d) Result of the TV-ℓ1
model for varying values of the regularization parameter λ. (e–h) Result of the ROF
model for varying values of λ. Observe the morphological property of the TV-ℓ1
model. Structures are removed only with respect to their size, but independent of
their contrast.

Bayesian approach) is given by the generalized Kullback–Leibler divergence,

which is the Bregman distance of the Boltzmann–Shannon entropy (Steidl
and Teuber 2010, Dupé, Fadili and Starck 2012). This yields the following
TV-entropy model:
Z Z
min λ |Du| + u(x) − u⋄ (x) log u(x) dx.
u(x)>0 Ω Ω

This model has applications in synthetic aperture radar (SAR) imaging, for
example.
We have already detailed the discretization of TV models in (2.6) and we
have shown that an efficient algorithm to minimize total variation models
is the PDHG algorithm (Algorithm 6 and its variants). A saddle point
formulation of discrete total variation models that summarizes the different
aforementioned data-fitting terms is as follows:
min maxhDu, pi + g(u) − δ{k·k2,∞ ≤λ} (p),
u p

where D : Rm×n → Rm×n×2 is the ﬁnite diﬀerence approximation of the

gradient operator deﬁned in (2.4), and p = (p1 , p2 ) ∈ Rm×n×2 is the dual
variable. Let u⋄ ∈ Rm×n be a given noisy image, then for g(u) = 21 ku − u⋄ k2
we obtain the ROF model, for g(u) = ku − u⋄ k1 we obtain the TV-ℓ1 model
84 A. Chambolle and T. Pock

and for g(u) = i,j ui,j − u⋄i,j log ui,j + δ(0,∞) (u) we obtain the TV-entropy
P
model. The implementation of the models using the PDHG algorithm only
diﬀers in the implementation of the proximal operators û = proxτ g (ũ). For
all 1 ≤ i ≤ m, 1 ≤ j ≤ n the respective proximal operators are given by
ũi,j + τ u⋄i,j
ûi,j = (ROF),
1+τ
ûi,j = ui,j + max{0, |ũi,j − u⋄i,j | − τ } · sgn(ũi,j − u⋄i,j )
⋄
(TV-ℓ1 ),
q
ũi,j − τ + (ũi,j − τ )2 + 4τ u⋄i,j
( )
ûi,j = max 0, (TV-entropy).
2
Figure 7.1 demonstrates the contrast invariance of the TV-ℓ1 model and
compares it to the ROF model. Both models were minimized using Algo-
rithm 6 (PDHG) or Algorithm 8. Gradually increasing the regularization
parameter λ in the TV-ℓ1 model has the eﬀect that increasingly larger struc-
tures are removed from the image. Observe that the structures are removed
only with respect to their size and not with respect to their contrast. In
the ROF model, however, scale and contrast are mixed such that gradually
increasing the regularization parameter results in removing structures with
increased size and contrast.
Figure 7.2 compares the ROF model with the TV-entropy model for image
denoising in the presence of Poisson noise. The noisy image of size 480×640
pixels has been generated by degrading an aerial image of Graz, Austria with
a Poisson noise with parameter the image values scaled between 0 and 50.
Both models have been minimized using the PDHG algorithm. It can be
observed that the TV-entropy model adapts better to the noise properties of
the Poisson noise and hence leads to better preservation of dark structures
and exhibits better contrast.

7.2. Total generalized variation

There have been attempts to generalize the total variation to higher-order
smoothness, for example using the notion of inﬁmal convolution (Chambolle
and Lions 1995, 1997). One generalization is the so-called total generalized
variation (TGV) proposed in Bredies, Kunisch and Pock (2010). An image
denoising model based on second-order total generalized variation (TGV2 )
is given by
1
Z Z Z
min inf λ1 |Du − v| + λ0 |Dv| + (u(x) − u⋄ (x))2 dx, (7.4)
u v Ω Ω 2 Ω

where u ∈ BV(Ω), v ∈ BV(Ω; R2 ), and λ0,1 > 0 are tuning parameters. The
idea of TGV2 is to force the gradient Du of the image to deviate only on a
sparse set from a vector ﬁeld v which itself has sparse gradient. This will get
Optimization for imaging 85

(a) original image (b) noisy image

(c) ROF, λ = 1/10 (d) TV-entropy, λ = 1/7

Figure 7.2. Total variation based image denoising in the presence of Poisson noise.
(a) Aerial view of Graz, Austria, (b) noisy image degraded by Poisson noise. (c) Re-
sult using the ROF model, and (d) result using the TV-entropy model. One can see
that the TV-entropy model leads to improved results, especially in dark regions,
and exhibits better contrast.

rid of the staircasing effect on affine parts of the image, while still preserving
the possibility of having sharp edges. The discrete counterpart of (7.4) can
be obtained by applying the same standard discretization techniques as in
the case of the ROF model.
We introduce the discrete scalar images u, u⋄ ∈ Rm×n and vectorial image
v = (v1 , v2 ) ∈ Rm×n×2 . The discrete version of the TGV2 model is hence
given by
1
min λ1 kDu − vk2,1 + λ0 kDvk2,1 + ku − u⋄ k2 ,
u,v 2
where D : Rm×n×2 → Rm×n×4 is again a finite difference operator that
computes the Jacobian (matrix) of the vectorial image v, which we treat
as a vector here. It can be decomposed into Dv = (Dv1 , Dv2 ), where D
is again the standard finite difference operator introduced in (2.4). The
86 A. Chambolle and T. Pock

discrete versions of the total ﬁrst- and second-order variations are given by
m,n q
X
kDu − vk2,1 = ((Du)i,j,1 − vi,j,1 )2 + ((Du)i,j,2 − vi,j,2 )2 ,
i=1,j=1
m,n q
X
kDvk2,1 = (Dv1 )2i,j,1 + (Dv1 )2i,j,2 + (Dv2 )2i,j,1 + (Dv2 )2i,j,2 .
i=1,j=1

In order to minimize the discrete TGV2 model, we rewrite it as the saddle-

point problem
1
min max hDu − v, pi + hDv, qi + ku − u⋄ k2
u,v p,q 2
− δ{k·k2,∞ ≤λ1 } (p) − δ{k·k2,∞ ≤λ0 } (q),
where p = (p1 , p2 ) ∈ Rm×n×2 and q = (q1 , q2 , q3 , q4 ) ∈ Rm×n×4 are the dual
variables. Combining (u, v), and (p, q) into the primal and dual vectors, we
can see that the above saddle-point problem is exactly of the form (3.10),
with

D −I 1
K= , g(u) = ku − u⋄ k2 ,
0 D 2
∗
f (p, q) = δ{k·k2,∞ ≤λ1 } (p) + δ{k·k2,∞ ≤λ0 } (q).
The proximal map with respect to the data term g(u) is the same as the
proximal map of the ROF model. The proximal map with respect to
f ∗ (p, q) reduces to projections onto the respective polar balls, as shown
in (4.23). The √ Lipschitz constant L of the operator K can be estimated
as L = kKk ≤ 12. With this information, the PDHG algorithm can be
easily implemented. Figure 7.3 shows a qualitative comparison between TV
and TGV2 regularization using an image of size 399 × 600. One can see that
TGV2 regularization leads to better reconstruction of the smooth region in
the background while still preserving the sharp discontinuities of the bird.
TV regularization, on the other hand, suﬀers from the typical staircasing
eﬀect in the background region.

7.3. Vectorial total variation

Since the seminal paper ‘Color TV’ (Blomgren and Chan 1998), diﬀerent
extensions of the total variation to vector-valued images have been pro-
posed. The most straightforward approach would be to apply the scalar
total variation to each colour channel individually, but this clearly ignores
the coupling between the colour channels. Assume we are given a vector-
valued image u = (u1 , . . . , uk ) with k channels. A proper generalization of
the total variation to vector-valued images is now to deﬁne a suitable matrix
norm that acts on the distributional Jacobian Du. Assume that our image
Optimization for imaging 87

(a) original image g (b) noisy image f

(c) ROF, λ = 0.1 (d) TGV2 , λ0,1 = (1/4, 1/9)

Figure 7.3. Comparison of TV and TGV2 denoising. (a) Original input image,
and (b) noisy image, where we have added Gaussian noise with standard deviation
σ = 0.1. (c) Result obtained from the ROF model, and (d) result obtained by min-
imizing the TGV2 model. The main advantage of the TGV2 model over the ROF
model is that it is better at reconstructing smooth regions while still preserving
sharp discontinuities.

domain Ω is a subset of Rd , where d is the dimension of the image, and also

assume for the moment that the image function u is suﬃciently smooth that
the Jacobian J(x) = ∇u(x) exists. An interesting class of matrix norms is
given by p-Schatten norms, which are deﬁned as
min{d,k}
X 1/p
p
|J(x)|Sp = σn (J(x)) , for all p ∈ [1, ∞),
n=1
|J(x)|S∞ = max σn (J(x)),
n∈{1,...,min{d,k}}

where the σn (J(x)) denote the singular values of the Jacobian J(x) (i.e.,
the square roots of the eigenvalues of J(x)J(x)∗ or J(x)∗ J(x)).
If p = 2, the resulting norm is equivalent to the Frobenius norm, which
corresponds to one of the most classical choices (Bresson and Chan 2008),
though other choices might also be interesting (Sapiro and Ringach 1996,
88 A. Chambolle and T. Pock

(a) original image (b) noisy image

(c) Frobenius, λ = 0.1 (d) nuclear, λ = 0.1

Figure 7.4. Denoising a colour image using the vectorial ROF model. (a) Original
RGB colour image, and (b) its noisy variant, where Gaussian noise with standard
deviation σ = 0.1 has been added. (c) Solution of the vectorial ROF model using
the Frobenius norm, and (d) solution using the nuclear norm. In smooth regions
the two variants lead to similar results, while in textured regions the nuclear norm
leads to signiﬁcantly better preservation of small details (see the close-up views
in (c, d)).

Chambolle 1994). For p = 1 (Duran, Moeller, Sbert and Cremers 2016a,

2016b), the norm is equivalent to the nuclear norm and hence forces the
Jacobian to be of low rank. The comparisons in Duran, Moeller, Sbert
and Cremers (2016b) conﬁrm the superior performance of this choice. If
p = ∞, the resulting norm is the operator norm of the Jacobian, penalizing
its largest singular value, which turns out to produce more spurious colours
than the previous choice (Goldluecke, Strekalovskiy and Cremers 2012).
Using the dual formulation of the total variation (7.2), we can readily
deﬁne a vectorial version of the total variation based on p-Schatten norms
which is now valid also for images in the space of functions with bounded
Optimization for imaging 89

variation:
Z
|Du|Sp = (7.5)
Ω
Z
∞ d×k
sup − u(x) · div ϕ(x) dx : ϕ ∈ C (Ω; R ), |ϕ(x)|Sq ≤ 1, ∀x ∈ Ω ,
Ω
where q is the parameter of the polar norm associated with the parameter
p of the Schatten norm and is given by 1/p + 1/q = 1. Based on that we
can define a vectorial ROF model as
1
Z Z
min λ |Du|Sp + |u(x) − u⋄ (x)|22 dx. (7.6)
u Ω 2 Ω
The discretization of the vectorial ROF model is similar to the discretiza-
tion of the standard ROF model. We consider a discrete colour image
u = (ur , ug , ub ) ∈ Rm×n×3 , where ur , ug , ub ∈ Rm×n denote the red, green,
and blue colour channels, respectively. We also consider a finite difference
operator D : Rm×n×3 → Rm×n×2×3 given by Du = (Dur , Dug , Dub ), where
D is again the finite difference operator defined in (2.4). The discrete colour
ROF model based on the 1-Schatten norm is given by
1
min λkDukS1 ,1 + ku − u⋄ k2 .
u 2
The vectorial ROF model can be minimized either by applying Algorithm 5
to its dual formulation or by applying Algorithm 8 to its saddle-point for-
mulation. Let us consider the saddle-point formulation:
1
min maxhDu, Pi + ku − u⋄ k2 + δ{k·kS∞ ,∞ ≤λ} (P),
u P 2
where P ∈ Rm×n×2×3 is the tensor-valued dual variable, hence the dual
variable can also be written as P = (P1,1 , . . . , Pm,n ), where Pi,j ∈ R2×3 is
a 2 × 3 matrix. Hence, the polar norm ball {kPkS∞ ,∞ ≤ λ} is also given by
{P = (P1,1 , . . . , Pm,n ) : |Pi,j |S∞ ≤ λ, for all i, j},
hence the set of variables P, whose tensor-valued components Pi,j have an
operator norm less than or equal to λ. To compute the projection to the
polar norm ball we can use the singular value decomposition (SVD) of the
matrices. Let U, S, V with U ∈ R2×2 , let S ∈ R2×3 with S = diag(s1 , s2 ),
and let V ∈ R3×3 be an SVD of P̃i,j , that is, P̃i,j = U SV T . As shown by
Cai, Candès and Shen (2010), the orthogonal projection of P̃i,j to the polar
norm ball {kPkS∞ ,∞ ≤ λ} is
Π{k·kS∞ ,∞ ≤λ} (P̃i,j ) = U Sλ V T , Sλ = diag(min{s1 , λ}, min{s2 , λ}).
Figure 7.4 shows an example of denoising a colour image of size 384 × 512
with colour values in the range [0, 1]3 . It can be seen that the nuclear norm
90 A. Chambolle and T. Pock

(a) no regularization (b) TV regularization

Figure 7.5. TV regularized reconstruction of one slice of an MRI of a knee from

partial Fourier data. (a) Least-squares reconstruction without using total variation
regularization, and (b) reconstruction obtained from the total variation regularized
reconstruction model.

indeed leads to better results as it forces the colour gradients to be of low

rank, promoting better correlation of the colour channels. This can be best
observed in textured regions, where the nuclear norm is much better at
preserving small details.

7.4. Total variation regularized linear inverse problems

The total variation and its generalizations (e.g. TGV) have also become a
popular regularizer for general linear inverse problems such as image decon-
volution and image reconstruction in computer tomography (CT) (Sidky,
Kao and Pan 2006) or magnetic resonance imaging (MRI) (Knoll, Bredies,
Pock and Stollberger 2011). The main idea for speeding up MRI recon-
structions is via compressed sensing by sparsely sampling the data in the
Fourier space. Using direct reconstruction from such undersampled data
would clearly lead to strong artifacts. Therefore, a better idea is to consider
a total variation regularized problem of the form
C
1
Z X
min λ |Du| + kF(σc u) − gc k22 , (7.7)
u Ω 2
c=1
where F denotes the Fourier transform, gc , c = 1, . . . , C are multiple channel
data obtained from the coils, and σc are the corresponding complex-valued
sensitivity estimates. Discretization of the model is straightforward. We
consider a complex valued image u ∈ Cm×n of size m × n pixels and also the
Optimization for imaging 91

usual finite difference approximation D of the gradient operator, as defined

in (2.4). We also assume that we are given c = 1, . . . , C discrete versions
of the sensitivities σc ∈ Cm×n and data gc ∈ Cm×n . The discrete model is
given by
C
X 1
min λkDuk2,1 + kF(σc ◦ u) − gc k2 ,
u 2
c=1

where F : Cm×n
→ Cm×n
denotes the (discrete) fast Fourier transform,
and ◦ denotes the Hadamard product (the element-wise product of the two
matrices). In order to minimize the TV-MRI objective function, we ﬁrst
transform the problem into a saddle-point problem:
C
X 1
min maxhDu, pi + kF(σc ◦ u) − gc k2 − δ{k·k2,∞ ≤λ} (p),
u p 2
c=1

where p ∈ Cm×n×2 is the dual variable. Observe that we have just dualized
the total variation term but kept the data-ﬁtting term
C
X 1
h(u) = kF(σc ◦ u) − gc k2
2
c=1

explicitly. The gradient of this ﬁtting term is given by

∇h(u) = σc ◦ F∗ (F(σc ◦ u) − gc ),
where F∗ denotes the adjoint of the Fourier transform and σc denotes the
complex conjugate of σc . Assuming that F is orthonormal and assuming
that kσc k∞ ≤ 1 for all c = 1, . . . , C, the Lipschitz constant of ∇h(u) is
bounded by Lh ≤ 1. Following Example 5.7 we can use the ‘Condat-Vũ’
variant (5.8) of the PDHG algorithm that can include explicit gradient steps.
Figure 7.5 shows the TV-MRI model for the reconstruction of a slice of
an MRI image of a knee.19 Figure 7.5(a) shows the reconstruction of the
slice without using total variation regularization and Figure 7.5(b) shows
the result based on total variation regularization. Observe that the total
variation regularized reconstruction model successfully removes the artifacts
introduced by the missing Fourier data.

7.5. Optical flow

In computer vision, total variation regularization has become a very popular
choice for optical ﬂow computation. Optical ﬂow is the apparent motion of

19
Data courtesy of Florian Knoll, Center for Biomedical Imaging and Center for Ad-
vanced Imaging Innovation and Research (CAI2R), Department of Radiology, NYU
School of Medicine.
92 A. Chambolle and T. Pock

intensity patterns (caused by objects, structures, surfaces) in a scene. The

main underlying assumption of optical flow is the brightness constancy as-
sumption, which states that the intensities of visual structures stay constant
over time. Let I(x, t) be a spatio-temporal intensity function (e.g. a video),
defined on Ω × [0, T ], where Ω is a subset of Rd and [0, T ] with T > 0 is the
time domain. The brightness constancy assumption states that
I(x, t) = I(x + ∆x , t + ∆t ),
for sufficiently small spatial displacements ∆x and time differences ∆t . As-
suming the spatio-temporal intensity function is sufficiently smooth, a first-
order Taylor expansion can be used to derive a linearized brightness con-
stancy assumption, also known as the optical flow constraint (Lucas and
Kanade 1981, Horn and Schunck 1981):
∇I(x, t)T · (v(x), 1) = 0, for all x, t ∈ Ω × [0, T ],
where v = (v1 , v2 ) is the velocity field. Direct solution of this equation for
v is heavily ill-posed. Indeed, the velocity can only be estimated in the
direction of spatial image gradients (aperture problem), and homogeneous
regions do not provide any information. If the brightness constancy as-
sumption does not hold in practice, it can also be replaced with a gradient
constancy or more sophisticated photo consistency metrics. It turns out
that the total variation and its generalizations are effective regularizers for
optical flow estimation since they force the flow field to be piecewise con-
stant or piecewise smooth. A total variation based optical flow model is
given by
1
Z Z
min λ |Dv|p,1 + |∇I(x, t)T · (v(x), 1)|q dx, (7.8)
v Ω q Ω

where different norms can be considered for both the total variation and
the data-fitting term. The most common choice is p = 2, and q = 1 (Brox,
Bruhn, Papenberg and Weickert 2004, Zach, Pock and Bischof 2007, Cham-
bolle and Pock 2011). For numerical solution we discretize the TV-ℓ1 op-
tical flow model in the same spirit as we did with the previous TV mod-
els. We consider a discrete velocity field v = (v1 , v2 ) ∈ Rm×n×2 , where
v1 corresponds to the horizontal velocity and v2 corresponds to the ver-
tical velocity. It can also be written in the form of v = (v1,1 , . . . , vm,n ),
where vi,j = (vi,j,1 , vi,j,2 ) is the local velocity vector. To discretize the total
variation, we again consider a finite difference approximation of the vecto-
rial gradient D : Rm×n×2 → Rm×n×4 , defined by Dv = (Dv1 , Dv2 ), where
D is defined in (2.4). In order to discretize the data term, we consider a
certain point in time for which we have computed finite difference approx-
imations for the space-time gradient of I(x, t). It is necessary to have at
least two images in time in order to compute the finite differences in time.
Optimization for imaging 93

(a) input images (averaged)

(b) velocity field

Figure 7.6. Optical flow estimation using total variation. (a) A blending of the
two input images. (b) A colour coding of the computed velocity field. The colour
coding of the velocity field is shown in the upper left corner of the image.

We denote the ﬁnite diﬀerence approximation to the space-time gradient

by r = (r1 , r2 , r3 ) ∈ Rm×n×3 ; it also has the structure r = (r1,1 , . . . , rm,n ),
where ri,j = (ri,j,1 , ri,j,2 , ri,j,3 ) is the space-time gradient at pixel i, j. The
discrete model is then given by
m,n
X
min λkDvk2,1 + |ri,j · (vi,j , 1)|,
v
i=1,j=1

where
ri,j · (vi,j , 1) = ri,j,1 vi,j,1 + ri,j,2 vi,j,2 + ri,j,3 .

For the vectorial total variation we consider the standard 2-vector norm,
that is,
X q
kDvk2,1 = (Dv1 )2i,j,1 + (Dv1 )2i,j,2 + (Dv2 )2i,j,1 + (Dv2 )2i,j,2 .
i=1,j=1

The TV-ℓ1 model is non-smooth, and hence we introduce dual variables

94 A. Chambolle and T. Pock

p ∈ Rm×n×4 and consider its saddle-point formulation

m,n
X
min maxhDv, pi + |ri,j · (vi,j , 1)| − δ{k·k2,∞ ≤λ} (p),
v p
i=1,j=1

which can be solved using Algorithm 6 (PDHG). It remains to detail the

solution of the proximal map with respect to the function
m,n
X
g(v) = |ri,j · (vi,j , 1)|.
i=1,j=1

A simple computation (Zach et al. 2007) shows that the proximal map is
given by
v̂ = proxτ g (ṽ) ⇔

τ ri,j
 if ri,j · (ṽi,j , 1) < −τ |ri,j |2 ,
v̂i,j = ṽi,j + −τ ri,j if ri,j · (ṽi,j , 1) > τ |ri,j |2 ,
−(ri,j · (ṽi,j , 1)/|ri,j |2 )ri,j

else.


With this information, the PDHG algorithm can be easily implemented.

Since the optical flow constraint is valid only for small spatial displacements,
the minimization of the TV-ℓ1 optical flow model is usually embedded within
a coarse-to-fine warping framework based on image pyramids (Brox et al.
2004). Figure 7.6 shows an example of computing the optical flow in a
typical car driving scene.20 It can be observed that the motion field has
been recovered nicely, but we can also see a typical problem in optical flow
computation: the car on the right-hand side (which itself has a weak texture)
is driving through the shadow of a tree. This creates a texture pattern that
does not move along with the car, and hence the recovered velocity is the
velocity of the street rather than the velocity of the car. In driver assistance
systems and autonomous driving, the result of the optical flow computation
represents an important cue, for example in crash avoidance.

7.6. Analysis operators

There have also been research directions that replace the gradient operator
of the total variation with a more general analysis operator, usually in a
discrete setting. Popular operators include wavelet-like transforms such as
curvelets (Candès, Demanet, Donoho and Ying 2006a, Starck, Murtagh and
Fadili 2010) or shearlets (Guo, Kutyniok and Labate 2006). There have also
been many attempts to learn optimal analysis operators from data; see for
example Protter, Yavneh and Elad (2009), Peyré, Fadili and Starck (2010),
Chen et al. (2014b) and Hawe, Kleinsteuber and Diepold (2013).
20
The input images are taken from the 2015 KITTI benchmark.
Optimization for imaging 95

(a) original image (b) 90% missing data

(c) TV regularization (d) shearlet regularization

Figure 7.7. Image inpainting using shearlet regularization. (a) Original image,
and (b) input image with a randomly chosen fraction of 10% of the image pix-
els. (c) Reconstruction using TV regularization, and (d) reconstruction using the
shearlet model. Observe that the shearlet-based model leads to signiﬁcantly better
reconstruction of small-scale and elongated structures.

Let Φ : Rm×n → Ck1 ×···×kK be a linear transform that maps an image

of size m × n pixels to a complex space of dimension k1 × · · · × kK , where
the dimensions ki , i = 1, . . . , K usually depend on the number of filters,
orientations, and scales used in the transform. When using the operator Φ
within a regularization term, the part of the transform that computes the
approximation (coarse) coefficients is usually skipped since these coefficients
are generally not sparse. Hence we cannot assume that Φ is invertible. A
straightforward extension of the discrete ROF model (2.6) is given by
1
min λkΦuk1 + ku − u⋄ k22 . (7.9)
u 2
Sometimes it is also useful to give different weights to different scales or
orientations of the transform Φ, but for the sake of simplicity we will just
assume a single regularization parameter. Let us show an example using
the discrete shearlet transform, since it has been shown to provide opti-
mal sparse approximations for cartoon-like images (Guo and Labate 2007,
96 A. Chambolle and T. Pock

Easley, Labate and Lim 2008, Kutyniok and Lim 2011), for image inpaint-
ing. For this we consider the following formulation:
X
min kΦuk1 + δ{u⋄i,j } (ui,j ),
u
(i,j)∈I

where
D = {(i, j) : 1 ≤ i ≤ m, 1 ≤ j ≤ n}
is the set of pixel indices of a discrete image of size m × n, and I ⊂ D is the
subset of known pixels of the image u⋄ . After transforming to a saddle-point
problem, the solution of the inpainting problem can be computed using the
PDHG algorithm. It just remains to give the proximal map with respect to
the data term
X
g(u) = δ{u⋄i,j } (ui,j ).
(i,j)∈I

Obviously, the solution of the proximal map is given by

(
u⋄i,j if (i, j) ∈ I,
û = proxτ g (ũ) ⇔ ûi,j =
ũi,j else.
Figure 7.7 shows the application of the inpainting model to recover an im-
age from only 10% known image pixels. For comparison we also show the
result with TV regularization, which can be realized by replacing Φ with
the discrete gradient operator D. This result shows that penalizing the
ℓ1 -norm of the shearlet transform produces a signiﬁcantly better approxi-
mation compared to the TV model. In particular, elongated structures are
reconstructed with far more detail.

7.7. The Mumford–Shah model

The next functional we mention here is the celebrated Mumford–Shah func-
tional (Mumford and Shah 1989). The minimization problem reads
Z Z
min |∇u(x)|2 dx + νHd−1 (Su ) + λ (u − u⋄ )2 dx, (7.10)
u Ω\Su Ω

where the image domain Ω is a subset of Rd , u⋄ is a given image, and

u ∈ SBV(Ω) is the smooth approximation image. The space SBV(Ω) refers
to the so-called space of special functions of bounded variations (Ambrosio
et al. 2000, Attouch, Buttazzo and Michaille 2014). It is a subspace of the
space BV(Ω) introduced in Section 7.1, and contains functions whose dis-
tributional derivatives consist only of a jump (discontinuity) part and an
absolutely continuous gradient. The parameters ν > 0 and λ > 0 are tuning
parameters, used to balance between the diﬀerent terms. Su ⊂ Ω refers to
Optimization for imaging 97

the so-called jump set, that is, the set of points where the function u is
allowed to jump and Hd−1 is the (d − 1)-dimensional Hausdorff measure
(Ambrosio et al. 2000, Attouch et al. 2014, Evans and Gariepy 1992), which
is, for d = 2, the length of the jump set Su and hence the total length of
edges in u. The main difference between the ROF functional and the MS
functional is as follows. While the ROF functional penalizes discontinuities
proportional to their jump height, the MS functional penalizes disconti-
nuities independently of their jump height and hence allows for better dis-
crimination between smooth and discontinuous parts of the image. We must
stress that the MS functional is very hard to minimize. The reason is that
the jump set Su is not known beforehand and hence the problem becomes
a non-convex optimization problem. Different numerical approaches have
been proposed to find approximate solutions to the Mumford–Shah problem
(Ambrosio and Tortorelli 1992, Chambolle 1999, Chan and Vese 2002, Pock
et al. 2009).
Here we focus on the work by Alberti, Bouchitté and Dal Maso (2003),
who proposed a method called the calibration method to characterize global
minimizers of the MS functional. The approach is based on a convex
representation of the MS functional in a three-dimensional space Ω × R,
where the third dimension is given by the value t = u(x). The idea of
the calibration method is to consider the maximum flux of a vector field
ϕ = (ϕx , ϕt ) ∈ C0 (Ω × R; Rd+1 ) through the interface of the subgraph
(
1 if t < u(x),
1u (x, t) = (7.11)
0 else,

which allows us to distinguish between smooth and discontinuous areas in

the function u in an elegant way. It turns out that we can find convex
constraints K on the vector field such that the supremum of the flux of the
vector field through the interface is identical to the value of the Mumford–
Shah functional in (7.10). Formally, this energy is written as a minimal
surface energy of the form
Z
MS(u) = sup ϕD1u , (7.12)
ϕ∈K Ω×R

where the convex set K is given by

K = ϕ ∈ C0 (Ω × R; Rd+1 ) : (7.13)
Z t2
ϕx (x, t)2

t ⋄ 2 x
ϕ (x, t) ≥ − λ(t − u (x)) , ϕ (x, s) ds ≤ ν ,
4 t1

where the inequalities in the deﬁnition of K hold for all x ∈ Ω and for all
98 A. Chambolle and T. Pock

t, t1 , t2 ∈ R. Observe that the set K includes non-local constraints, which

is challenging from a computational point of view.
In particular, if the supremum is attained by a divergence-free vector field,
the divergence theorem provides a sufficient condition of optimality for u.
Such a vector field is then said to be a calibration.
Although (7.12) looks almost like a convex optimization problem, we must
take into account the constraint that 1u is a binary function. The standard
approach is to relax this constraint by replacing 1u with a function v ∈
BV(Ω × R; [0, 1]) which satisfies

lim v(x, t) = 1, lim v(x, t) = 0, (7.14)

t→−∞ t→+∞

such that (7.12) becomes the convex problem

Z
min sup ϕDv. (7.15)
v ϕ∈K Ω×R

Clearly, being a relaxation of the original problem, the question remains

whether a minimizer of (7.15) translates to a global minimizer of the Mum-
ford–Shah problem. In particular, this would be true if a minimizer v ∗
of the above optimization problem were binary, which would imply that
the supremum is attained by a divergence-free vector field and hence a
calibration is found. For some particular cases such as edges and triple
junction, it is known that such a calibration exists (Alberti et al. 2003,
Dal Maso, Mora and Morini 2000, Mora and Morini 2001). For other cases
such as ‘crack-tip’, the proof of the existence of a calibration remains an
unsolved problem (Bonnet and David 2001) and one should maybe not
expect to obtain a binary v ∗ in this situation.
Let us quickly develop a discrete version of the calibration method. We
consider a spatially discrete function v ∈ Rm×n×r on a three-dimensional
regular grid of m × n × r voxels. We also associate discrete values tk ,
k = 1, . . . , r of the range of the given function u⋄ ∈ Rm×n with the discrete
function v. Usually the range of u⋄ is in [0, 1] so that tk = (k − 1)/(r − 1)
is a natural choice.
We approximate the three-dimensional gradient operator, again using a
simple finite difference operator D : Rm×n×r → Rm×n×r×3 , which is imple-
mented as usual, using finite differences. The operator is the extension of
Optimization for imaging 99

(2.4) to three dimensions, and is deﬁned by

(
vi+1,j,k − vi,j,k if 1 ≤ i < m,
(Dv)i,j,k,1 =
0 else,
(
vi,j+1,k − vi,j,k if 1 ≤ j < n,
(Dv)i,j,k,2 = (7.16)
0 else,
(
vi,j,k+1 − vi,j,k if 1 ≤ k < r,
(Dv)i,j,k,3 =
0 else.

We also introduce a discrete ﬁeld variable p = (p1 , p2 , p3 ) ∈ Rm×n×r×3 ,

which can also be written in the form p = (p1,1,1 , . . . , pm,n,r ), where pi,j,k =
(pi,j,k,1 , pi,j,k,2 , pi,j,k,3 ) ∈ R3 is the per voxel ﬁeld variable. Furthermore, we
need a discrete version of the convex set K deﬁned in (7.13):
p2i,j,k,1 + p2i,j,k,2

K = p ∈ Rm×n×r×3 : pi,j,k,3 ≥ − λ(tk − u⋄i,j )2 ,
4
X k2
(pi,j,k,1 , pi,j,k,2 ) ≤ ν, for all i, j, k, k1 ≤ k2 . (7.17)
k=k1 2

Finally, we deﬁne a convex set C that constrains the variable v to be-

long to a set of relaxed binary functions that satisfy the required boundary
conditions:
C = v ∈ Rm×n×r : vi,j,k ∈ [0, 1], vi,j,1 = 0, vi,j,r = 1, for all i, j, k (7.18)

With this, the discrete version of (7.15) is given by the saddle-point problem
min maxhDv, pi + δC (v) − δK (p),
v p

which can be solved using Algorithm 6. The critical part of the implemen-
tation of the algorithm is the solution of the projection of p onto K:
kp − p̃k2
p̂ = ΠK (p̃) = arg min ,
p∈K 2
which is non-trivial since the set K contains a quadratic number (in fact
r(r + 1)/2) of coupled constraints. In order to solve the projection prob-
lem, we may adopt Dykstra’s algorithm for computing the projection on
the intersection of convex sets (Dykstra 1983). The algorithm performs a
coordinate descent on the dual of the projection problem, which is deﬁned
in the product space of the constraints. In principle, the algorithm proceeds
by sequentially projecting onto the single constraints. The projections onto
the 2-ball constraints can be computed using projection formula (4.23). The
projection to the parabola constraint can be computed by solving a cubic
100 A. Chambolle and T. Pock

(a) original image u⋄ (b) piecewise smooth image u

(c) relaxed function v

Figure 7.8. Piecewise smooth approximation using the Mumford–Shah functional.

(a) Original image u⋄ , and (b) piecewise smooth approximation u extracted from
the convex relaxation. (c) Three-dimensional rendering of the subgraph of the
relaxed function v which approximates the subgraph 1u of the image u. Note the
tendency of the Mumford–Shah functional to produce smooth regions terminated
by sharp discontinuities.

polynomial or by adopting Newton’s method. Once a solution v is com-

puted from the saddle-point problem, an approximate minimizer u can be
computed for example by extracting the 12 -level set of v.
Figure 7.8 shows the result of the Mumford–Shah functional obtained
from the convex representation. The solution u is composed of smooth
regions terminated by sharp discontinuities. We also show a three-dimen-
Optimization for imaging 101

sional rendering of the relaxed function v which, if binary, is exactly the

subgraph of the function u.

7.8. Convex regularization with non-convex data terms

Pock, Schoenemann, Graber, Bischof and Cremers (2008, 2010) have shown
that the calibration method can also be used to compute exact solutions of
a certain class of optimization problems that have important applications
in imaging. The class of problems is given by
Z
min f (x, u(x), Du), (7.19)
u Ω
where the function f (x, t, p) can be non-convex in t but has to be convex in
p. It turns out that a global minimizer of this problem can be computed by
solving (7.15) with the following convex set of constraints:
K = ϕ ∈ C0 (Ω × R; Rd+1 ) : ϕt (x, t) ≥ f ∗ (x, t, ϕx (x, t)), ∀x, t ∈ Ω × R ,

(7.20)
where f ∗ denotes the convex conjugate of f (x, t, p) with respect to p. Ob-
serve that, in contrast to the convex set associated with the convex represen-
tation of the Mumford–Shah functional, this convex set is local in Ω×R and
hence much more amenable to numerical optimization. The discretization
of the problem is very similar to the discretization of the Mumford–Shah
functional and hence we omit it.
A particularly interesting class of optimization problems is given by total
variation regularization plus a quite arbitrary non-convex data term, for
example given by the matching costs of a stereo problem. In this case, the
convex set (7.20) completely decomposes into a set of simple and point-
wise constraints. Equivalently, the problem can be solved by solving a ROF
problem in three dimensions, and extracting the 0-level set of the solution,
as explained later on in Section 7.9. This has the advantage that we can
implement an accelerated block-coordinate descent as explained in Exam-
ple 4.15. The resulting algorithm is quite efficient. It usually needs only a
very small number of iterations (20–30) to give a good approximate solution.
See Chambolle and Pock (2015b) for more information.
Figure 7.9 shows an example of computing the globally optimal solution
of a total variation regularized stereo model using accelerated block descent.
The non-convex data term is computed from the stereo matching costs using
the census transform (Zabih and Woodfill 1994). Figure 7.9(a) shows one
input image of an aerial stereo image pair showing the neighbourhood of
Freiheitsplatz in the city of Graz.21 Figure 7.9(b) shows the computed
disparity image obtained by solving the convexification of the non-convex
21
Data courtesy of the Vermessungsamt Graz.
102 A. Chambolle and T. Pock

(a) left image (b) disparity map

Figure 7.9. Computing a globally optimal solution of a large-scale stereo problem

using the calibration method. (a) Left input image showing the region around
the Freiheitsplatz in the city of Graz. (b) Computed disparity map, where the
intensity is proportional to the height above the ground. Black pixels indicate
occluded pixels that have been determined by a left–right consistency check.

stereo problem. After interchanging the left and right images we repeated
the experiment. This allowed us to perform a left–right consistency check
and in turn to identify occluded regions. Those pixels are shown in black.
Although the calibration method is able to compute the globally optimal
solution, it is important to point out that this does not come for free. The
associated optimization problem is huge because the range space of the
solution also has to be discretized. In our stereo example, the disparity
image is of size 1835 × 3637 pixels and the number of disparities was 100.
Optimization for imaging 103

This amounts to solving an optimization problem of 0.7 billion unknowns

– in practice we solve the dual ROF problem, which in fact triples the
number of unknowns! However, using the combination of accelerated block
descent and eﬃcient dynamic programming to solve the one-dimensional
ROF problems involved allows us to solve such a huge optimization problem
in ‘only’ 10 minutes on a 20-core machine.
Various groups have proposed extensions of these techniques to ever more
diﬃcult problems such as vector-valued data or manifold-valued data; see
Lellmann and Schnörr (2011), Cremers and Strekalovskiy (2013), Gold-
luecke, Strekalovskiy and Cremers (2013) and Strekalovskiy, Chambolle and
Cremers (2014).

7.9. Image segmentation

Image segmentation is a central problem in imaging. Let us first discuss
figure-ground segmentation, whose general idea is to partition an image
into two regions, one corresponding to the figure and the other to the back-
ground. A simple model consists of an energy functional that minimizes
the boundary length of the segmentation plus a certain region-based seg-
mentation criterion, for example the colour distance to given mean values
of the regions. In the continuous setting, this problem can be written in the
following form:
Z Z
min Per(S; Ω) + w1 (x) dx + w2 (x) dx, (7.21)
S⊆Ω S Ω\S

where Ω is the image domain, Per(S; Ω) denotes the perimeter of the set S
in Ω, and w1,2 : Ω → R+ are given non-negative potential functions. This
problem belongs to a general class of minimal surface problems that have
been studied for a long time (see for instance the monograph by Giusti
1984).
The discrete version of this energy is commonly known as the ‘Ising’
model, which represents the interactions between spins in an atomic lattice
and exhibits phase transitions. In computer science, the same kind of energy
has been used to model many segmentation and classification tasks, and
has received a lot of attention since it was understood that it could be
efficiently minimized if represented as a minimum s − t cut problem (Picard
and Ratliff 1975) in an oriented graph (V, E). Here, V denotes a set of
vertices and E denotes the set of edges connecting some of these vertices.
Given two particular vertices, the ‘source’ s and the ‘sink’ t, the s − t
minimum cut problem consists in finding two disjoint sets S ∋ s and T ∋ t
with S∪T = V and the cost of the ‘cut’ C(S, T ) = {(u, v) ∈ E : u ∈ S, v ∈ T }
is minimized. The cost of the cut can be determined by simply counting the
number of edges, or by summing a certain weight wuv associated with each
edge (u, v) ∈ E. By the Ford–Fulkerson min-cut/max-flow duality theorem
104 A. Chambolle and T. Pock

(see Ahuja, Magnanti and Orlin 1993 for a fairly complete textbook on
these topics), this minimal s − t cut can be computed by finding a maximal
flow through the oriented graph, which can be solved by a polynomial-
time algorithm. In fact, there is a ‘hidden’ convexity in the problem. We
will describe this briefly in the continuous setting; for discrete approaches
to image segmentation we refer to Boykov and Kolmogorov (2004), and the
vast subsequent literature. The min-cut/max-flow duality in the continuous
setting and the analogy with minimal surfaces type problems were first
investigated by Strang (1983) (see also Strang 2010).
We mentioned in the previous section that the total variation (7.2) is also
well defined for characteristic functions of sets, and measures the length of
the boundary (in the domain). This is, in fact, the ‘correct’ way to define
the perimeter of a measurable set, introduced by R. Caccioppoli in the
early 1950s. Ignoring constants, we can replace (7.21) with the following
equivalent variational problem:
Z Z
min |D1S | + 1S (x)w(x) dx, (7.22)
S⊆Ω Ω Ω
where for notational simplicity we have set w = w1 − w2 , and 1S is the
characteristic function associated with the set S, that is,
(
1 if x ∈ S,
1S (x) =
0 else.
The idea is now to replace the binary function 1S : Ω → {0, 1} with a
continuous function u : Ω → [0, 1] such that the problem becomes convex:
Z Z
min |Du| + u(x)w(x) dx, such that u(x) ∈ [0, 1] a.e. in Ω, (7.23)
u Ω Ω
It turns out that the relaxed formulation is exact in the sense that any
thresholded solution v = 1{u≥s} of the relaxed problem for any s ∈ (0, 1]
is also a global minimizer of the binary problem (Chan, Esedoḡlu and
Nikolova 2006, Chambolle 2005, Chambolle and Darbon 2009). This is a
consequence of the co-area formula (Federer 1969, Giusti 1984, Ambrosio et
al. 2000), which shows that minimizing the total variation of u decomposes
into independent problems on all level sets of the function u.
Interestingly, there is also a close relationship between the segmentation
model (7.23) and the ROF model (7.1). In fact a minimizer of (7.23) is
obtained by minimizing (7.1), with u⋄ = w being the input image, and then
thresholding the solution u at the 0 level (Chambolle 2004a, 2005). Con-
versely, this relationship has also been successfully used to derive efficient
combinatorial algorithms, based on parametric maximal flow approaches
(Gallo, Grigoriadis and Tarjan 1989), to solve the fully discrete ROF model
exactly in polynomial time (Hochbaum 2001, Darbon and Sigelle 2004, Dar-
Optimization for imaging 105

bon and Sigelle 2006a, Darbon and Sigelle 2006b, Chambolle and Darbon
2012), where the total variation is approximated by sum of pairwise inter-
actions |ui − uj |.
Exploiting the relation between the ROF model and the two-label segmen-
tation model, we can easily solve the segmentation problem by considering
a discrete version of the ROF model. In our setting here, we consider a
discrete image u ∈ Rm×n and a discrete weighting function w ∈ Rm×n . The
discrete model we need to solve is
1
min kDuk2,1 + ku − wk2 .
u 2
It can be solved by using either Algorithm 8 or Algorithm 5 (applied to the
dual problem). Let u∗ denote the minimizer of the ROF problem. The ﬁnal
discrete and binary segmentation 1S is given by thresholding u∗ at zero:
(
0 if u∗i,j < 0,
(1S )i,j =
1 else.

Figure 7.10 shows an example of foreground–background segmentation

using this approach. The weighting function w was computed using the
negative log-ratio of two Gaussian mixture models (GMMs) that were ﬁtted
to the desired foreground and background regions provided by the input of
a user. Let
exp − 12 (x − µ)T Σ−1 (x − µ)

N (x, µ, Σ) = p
(2π)d detΣ

be the (d-dimensional) normal distribution with expectation µ ∈ Rd and

covariance matrix Σ ∈ Rd×d . We let
L
X
G(x; µ, Σ, α) = αl N (x, µl , Σl ) (7.24)
l=1

denote a GMM with L components, where

L
X
α= (αl )L
l=1
L
∈ [0, 1] , αl = 1
l=1

are the mixture coeﬃcients, µ = (µl )L L

l=1 are the means, and Σ = (Σl )l=1
are the covariances of the Gaussian probability densities. Further, we let
Gf (·; µf , Σf , αf ) denote the Gaussian mixture model of the ﬁgure and
Gb (·; µb , Σb , αb ) the mixture model of the background.
Following the well-known GrabCut algorithm (Rother, Kolmogorov and
106 A. Chambolle and T. Pock

Blake 2004), the weighting function w is given by the negative log-ratio

Gf (u⋄i,j ; µf , Σf , αf )

wi,j = − log ,
Gb (u⋄i,j ; µb , Σb , αb )

where u⋄ ∈ Rm×n×3 is a given colour image. The weighting function is

larger than zero if the pixel u⋄i,j is more likely to be a background pixel, and
smaller than zero if a pixel is more likely to be a foreground pixel. In our ex-
periment we used Gaussian mixture models with 10 components. The Gaus-
sian mixture models can be computed using the classical EM (expectation–
maximization) algorithm. In practice, it consists in alternating the following
steps, until convergence.

• Compute at each pixel of the (current) foreground/background the

membership probabilities for each Gaussian,

αf N (u⋄i,j , µfl , Σfl )

f
πi,j,l = Pk l f f f
⋄
l′ =1 αl N (ui,j , µl′ , Σl′ )

for l = 1, . . . , L and each foreground pixel (i, j) ∈ fg, and similarly for
(π b )i,j,l , (i, j) ∈ bg (here fg, bg ⊂ {1, . . . , n} × {1, . . . , m} denote the set
of foreground and background pixels, respectively).

• Update the parameters:

P ⋄ f
1 X f i,j∈fg ui,j πi,j,l
αlf = f
πi,j,l , µl = P f
,
#fg π
i,j∈fg i,j,l
(i,j)∈fg
P ⋄ f ⋄ f T f
i,j∈fg (ui,j − µl )(ui,j − µl ) πi,j,l
Σfl = P f
,
i,j∈fg πi,j,l

and similarly for the background.

After solving the segmentation problem, the Gaussian mixture models can
be re-computed and the segmentation can be reﬁned.

7.10. Extension to multilabel segmentation

We now describe an extension of the two-region segmentation model to
multiple regions. Here the idea is to partition the image domain into a set
of K disjoint image segments Ωk , k = 1, . . . , K. In the continuous setting
Optimization for imaging 107

(a) input image with user input (b) weight function

(c) segmentation (d) background removal

Figure 7.10. Interactive image segmentation using the continuous two-label image
segmentation model. (a) Input image overlaid with the initial segmentation pro-
vided by the user. (b) The weighting function w, computed using the negative
log-ratio of two Gaussian mixture models ﬁtted to the initial segments. (c) Binary
solution of the segmentation problem, and (d) the result of performing background
removal.

such a model is given by

K
1X
Z
min Per(Ωk ) + wk (x) dx, (7.25)
(Ωk )K
k=1
2 Ωk
k=1
K
[
such that Ω = Ωk , Ωk ∩ Ωl = ∅, for all k 6= l. (7.26)
k=1

This model can be interpreted as the continuous version of the ‘Potts’ model
that has also been proposed in statistical mechanics to model the interac-
tions of spins on a crystalline lattice. It is also widely used as a smoothness
term in graphical models for computer vision, and can be minimized (ap-
proximately) by specialized combinatorial optimization algorithms such as
those proposed by Boykov et al. (2001).
The continuous Potts model (7.25) is also closely related to the seminal
Mumford–Shah model (Mumford and Shah 1989), where the smooth ap-
108 A. Chambolle and T. Pock

proximation of the image is restricted to piecewise constant regions. See

Chan and Vese (2001), for example, where the problem is approached using
a level set method.
In the discrete setting it is known that the Potts model is NP-hard,
so we cannot hope to solve this problem exactly in the continuous set-
ting either. The most standard approach to this problem is to consider
a convex relaxation similar to (7.23) by introducing a vectorial function
u = (u1 , . . . , uK ) : Ω → [0, 1]K with the constraint that K
P
k=1 uk (x) = 1
a.e. in Ω, and consider some vectorial version of the total variation, which
coincides with half of the sum of the perimeters of the sets Ωk if the function
u is binary. A convex relaxation of (7.25) is now given by
Z K Z
X
min |Du|P + uk (x)wk (x) dx, (7.27)
u Ω Ω
k=1
such that u(x) ∈ SK−1 for a.e. x ∈ Ω, (7.28)
where
K
X
SK−1 = x∈ RK
+ : xi = 1
i=1

denotes the (K − 1)-dimensional unit simplex, and the vectorial total vari-
ation is given by
Z Z
|Du|P = sup − u(x) · div ϕ(x) dx :
Ω Ω

∞ d×K
ϕ ∈ C (Ω; R ), ϕ(x) ∈ CP , for all x ∈ Ω ,

where CP is a convex set, for which various choices can be made. If the
convex set is given by

d×K 1
CP1 = ξ = (ξ1 , . . . , ξK ) ∈ R : |ξk |2 ≤ , for all k ,
2
the vectorial total variation is simply the sum of the total variations of the
single channels (Zach, Gallup, Frahm and Niethammer 2008). Chambolle,
Cremers and Pock (2012) have shown that a strictly larger convex function
is obtained by means of the so-called paired calibration (Lawlor and Morgan
1994, Brakke 1995). In this case, the convex set is given by
CP2 = ξ = (ξ1 , . . . , ξK ) ∈ Rd×K : |ξk − ξl |2 ≤ 1, for all k 6= l ,

which has a more complicated structure than CP1 but improves the convex
relaxation. See Figure 7.11 for a comparison. Note that unlike in the
two-phase case, the relaxation is not exact. Thresholding or rounding a
Optimization for imaging 109

(a) input image (b) using CP1 (c) using CP2

Figure 7.11. Demonstration of the quality using diﬀerent relaxations. (a) Input
image, where the task is to compute a partition of the grey zone in the middle of the
image using the three colours as boundary constraints. (b) Colour-coded solution
using the simple relaxation CP1 , and (c) result using the stronger relaxation CP2 .
Observe that the stronger relaxation exactly recovers the true solution, which is a
triple junction.

minimizer of the relaxed problem will only provide an approximate solution

to the problem (Lellmann, Lenzen and Schnörr 2013).
For the numerical implementation we consider a straightforward discre-
tization similar to the previous models. We consider a discrete labelling
function u = (u1 , . . . , uK ) ∈ Rm×n×K , and we consider the usual ﬁnite dif-
ference approximation of the gradient operator D : Rm×n×K → Rm×n×K×2 ,
where Du = (Du1 , . . . , DuK ) and D is deﬁned in (2.4). Furthermore, we are
given a discrete weight function w = (w1 , . . . , wK ) ∈ Rm×n×K . Therefore
the discrete pendant of the Potts model is given by
K
X
min kDuk2,P + huk , wk i, such that ui,j ∈ SK−1 ,
u
k=1

and the vectorial total variation that is intended to measure half the length
of the total boundaries is given by
kDuk2,P = suphDu, Pi, such that Pi,j ∈ CP for all i, j,
P

where P ∈ Rm×n×2×K is the tensor-valued dual variable. Combining the

two last equations already leads to a saddle-point problem that can be solved
using Algorithm 6 (PDHG). It remains to detail the pixelwise projections
onto the simplex constraints ui,j ∈ SK−1 and the constraints Pi,j ∈ CP
on the dual variables. The projection on the (K − 1)-dimensional simplex
SK−1 can be done for each pixel independently in K log K time or even in
expected linear time; see for example Duchi, Shalev-Shwartz, Singer and
Chandra (2008). The complexity of the projection onto CP depends on the
110 A. Chambolle and T. Pock

particular choice of the set. If we choose the weaker set CP1 the projection
reduces to K independent projections onto the 2-ball with radius 1/2. If
we choose the stronger relaxation CP2 , no closed-form solution is available
to compute the projection. A natural approach is to implement Dykstra’s
iterative projection method (Dykstra 1983), as CP2 is the intersection of sim-
ple convex sets on which a projection is straightforward. Another efficient
possibility would be to introduce Lagrange multipliers for the constraints
defining this set, but in a progressive way as they get violated. Indeed,
in practice, it turns out that few of these constraints are actually active,
in general no more than two or three, and only in a neighbourhood of the
boundary of the segmentation.
Figure 7.12 shows the application of interactive multilabel image segmen-
tation using four phases. We again use the user input to specify the de-
sired regions, and we fit Gaussian mixture models (7.24) Gk (·; µk , Σk , αk )),
k = 1, . . . , K with 10 components to those initial regions. The weight func-
tions wk are computed using the negative log probability of the respective
mixture models, that is,
wi,j,k = − log Gk (u⋄i,j ; µk , Σk , αk ), k = 1, . . . , K.
It can be observed that the computed phases uk are almost binary, which
indicates that the computed solution is close to a globally optimal solution.

7.11. Curvature
Using curvature information in imaging is mainly motivated by ﬁndings
in psychology that so-called subjective (missing) object boundaries that
are seen by humans are linear or curvilinear (Kanizsa 1979). Hence such
boundaries can be well recovered by minimizing the ‘elastica functional’
Z
(α + βκ2 ) dγ, (7.29)
γ

where α > 0, β > 0 are weighting parameters, γ is a smooth curve, and κ

is its curvature. Psychological experiments also suggest that such a process
must take place at a very early stage in the human visual system, as it
is so strong that it cannot be resolved even if the structures seen become
absurd to the human observer. It is therefore natural that curvature in-
formation provides a very strong prior for recovering missing parts of an
image (Masnou and Morel 2006) or to resolve occluded objects (Nitzberg,
Mumford and Shiota 1993).
In order to construct a regularization term for images, the most natural
idea (Masnou and Morel 2006, Ambrosio and Masnou 2003) is to apply the
elastica energy to the level lines of a suﬃciently smooth (twice-diﬀerentiable)
Optimization for imaging 111

(a) input image with user input (b) segmentation

(c) phase 1 (d) phase 2

(e) phase 3 (f) phase 4

Figure 7.12. Interactive image segmentation using the multilabel Potts model.
(a) Input image overlaid with the initial segmentation provided by the user. (b) Fi-
nal segmentation, where the colour values correspond to the average colours of the
segments. (c–f) The corresponding phases uk . Observe that the phases are close
to binary and hence the algorithm was able to ﬁnd an almost optimal solution.

image:
2
∇u
Z
|∇u| α + β div dx. (7.30)
Ω |∇u|
Here, div(∇u/|∇u|) = κ{u=u(x)} (x) represents the curvature of the level
line/surface of u passing through x, and thanks to the co-area formula this
112 A. Chambolle and T. Pock

expression can (or should) also be written as

Z +∞ Z
(α + βκ2{u>s} ) dHd−1 (x) ds.
−∞ Ω∩∂{u>s}

Masnou and Morel (2006) propose an algorithm based on dynamic program-

ming that can find globally optimal solutions to this highly non-convex op-
timization problem. The authors apply the elastica model to the problem
of recovering missing parts of an image. Although the model is simple, it
yields faithful reconstructions.
However, curvature is notoriously difficult to minimize and hence its uti-
lization as a regularization term in imaging is still limited. In Bredies, Pock
and Wirth (2013), a convex relaxation approach was proposed to find a con-
vex representation of curvature-dependent regularization functionals. The
idea is based on work by Citti and Sarti (2006), who proposed representing
images in the so-called roto-translation space in order to perform amodal
completion. The basic idea is to lift the gradient of a two-dimensional image
to a three-dimensional space, where the third dimension is given by the ori-
entation of the image gradient. The lifting of the image gradient is defined
via the non-negative measure µ = µ(Du) given by
Z Z
ϕ dµ = ϕ(x, −σ(x)⊥ ) d|Du|,
Ω Ω
for all test functions that are continuous and compactly supported on the
image domain Ω. Here σ(x) denotes the local orientation of the image
gradient (e.g. Du = σ|Du|), and ⊥ denotes an anticlockwise rotation by 90
degrees, and hence −σ(x)⊥ is such that the image function u is increasing
on its left-hand side. An appealing property of this representation is that
the image gradient can be recovered from the lifted representation via a
linear constraint:
Z Z
ϕ · dDu = ϕ(x) · ϑ⊥ dµ(x, ϑ).
Ω Ω×S1

In the lifted space a new regularization term can be deﬁned that penalizes
curvature information. Such a regularizer – called total vertex regularization
(TVX) – is given by
Z
sup Dx ψ(x, ϑ) · ϑ dµ(x, ϑ), (7.31)
ψ(x,·)∈Bρ Ω×S1

where Dx ψ(x, ϑ) · ϑ is the directional derivative in the direction of θ, and

Bρ deﬁnes the pre-dual ball for any metric ρ on S1 :
Bρ = {ϕ ∈ C(S1 ) : ϕ(η1 ) − ϕ(η2 ) ≤ ρ(η1 , η2 ) for all (η1 , η2 ) ∈ S1 × S1 }
(7.32)
Optimization for imaging 113

Several metrics ρ can be considered: One choice is the ℓ0 -metric or Potts

metric, which counts the number of ‘corners’. In this case, ρ(η1 , η2 ) = 1 for
η1 6= η2 , and the pre-dual ball is simply given by

1
B0 = 1 · R + kϕk∞ ≤ , (7.33)
2
which is the set of all constant functions plus the ∞-ball of radius 1/2. In
what follows, we call the corresponding model the TVX0 model.
Another natural choice is the ℓ1 -metric, which is equivalent to the total
absolute curvature on smooth parts and penalizes the unsigned external
angle between two joining line segments: hence ρ(η1 , η2 ) = dist(η1 , η2 ).
This model is referred to as the TVX1 model. The corresponding pre-dual
ball B1 is given by
B1 = {ϕ ∈ C(S1 ) : kDϕk∞ ≤ 1}, (7.34)
that is, the set of 1-Lipschitz continuous functions.
An extension to non-metrics on S1 (e.g. the squared curvature) requires
lifting to the space of pairs of orientations; see for example Bredies, Pock
and Wirth (2015b), Schoenemann and Cremers (2007) and Schoenemann,
Masnou and Cremers (2011).
Interestingly, the total variation in the lifted space is equivalent to the
norm kµkM , which is the total variation or mass of the measure µ, deﬁned
in the classical sense by
X
kµkM := sup |µ(Ei )| : Ei disjoint sets ,
i

and which generalizes the L1 -norm to measures (Evans and Gariepy 1992).
This enforces sparsity of the lifted measure µ. In practice, it turns out that a
combination of total variation regularization and total vertex regularization
performs best. An image restoration model combining both total variation
and total vertex regularization is given by
1
Z
min α sup Dx ψ(x, ϑ) · ϑ dµ(x, ϑ) + βkµkM + ku − u⋄ k2 ,
(u,µ) ψ(x,·)∈Bρ Ω×S1 2
such that (u, µ) ∈ LµDu = {(u, µ) | µ is the lifting of Du}, (7.35)
where α and β are tuning parameters. Clearly, the constraint that µ is
a lifting of Du represents a non-convex constraint. A convex relaxation
of this constraint is obtained by replacing LµDu with the following convex
constraint:
Z Z
µ ⊥
LDu = (u, µ) µ ≥ 0, ϕ · dDu = ϕ(x) · ϑ dµ(x, ϑ) , (7.36)
Ω Ω×S1
for all smooth test functions ϕ that are compactly supported on Ω. With
114 A. Chambolle and T. Pock

xi,j xi,j+1

xi+1,j

Figure 7.13. A 16-neighbourhood system on the grid. The black dots refer to the
grid points xi,j , the shaded squares represent the image pixels Ωi,j , and the line
segments li,j,k connecting the grid points are depicted by thick lines.

this, the problem becomes convex and can be solved. However, it remains
unclear how close minimizers of the relaxed problem are to minimizers of
the original problem.
It turns out that the total vertex regularization functional works best
for inpainting tasks, since it tries to connect level lines in the image with
curves with a small number of corners or small curvature. Tackling the
TVX models numerically is not an easy task because the lifted measure
is expected to concentrate on line-like structures in the roto-translational
space. Let us assume our image is defined on a rectangular domain Ω =
[0, n) × [0, m). On this domain we consider a collection of S square pixels
{Ωi,j }m,n
i=1,j=1 with Ωi,j = [j − 1, j) × [i − 1, i), such that Ω = m,n
i=1,j=1 Ωi,j .
m,n
Furthermore, we consider a collection of grid points {xi,j }i=1,j=1 with xi,j =
(j, i) such that the grid points xi,j are located on the lower right corners of
the corresponding image pixels Ωi,j . Using the collection of image pixels,
we consider a piecewise constant image
u = {u : Ω → R : u(x) = Ui,j , for all x ∈ Ωi,j },
where U ∈ Rm×n is the discrete version of the continuous image u.
Following Bredies et al. (2015b), we use a neighbourhood system based
on a set of o distinct displacement vectors δk = (δk1 , δk2 ) ∈ Z2 . On a regular
grid, it is natural to define a system consisting of 4, 8, 16, 32, etc. neighbours.
Figure 7.13 depicts an example based on a neighbourhood system of 16
neighbours. The displacement vectors naturally imply orientations ϑk ∈ S1 ,
defined by ϑk = δk /|δk |2 . We shall assume that the displacement vectors δk
are ordered such that the corresponding orientations ϑk are ordered on S1 .
Next, we consider a collection of line segments {li,j,k }m,n,o
i=1,j=1,k=1 , where
the line segments li,j,k = [xi,j , xi,j + δk ] connect the grid points xi,j to a
collection of neighbouring grid points xı̂,̂ , as defined by the neighbourhood
Optimization for imaging 115

system. The measure µ is discretized on these line segments: we assume it

is given by
X
µ= Vi,j,k H1 |li,j,k ⊗ δθk .
i,j,k

It means that µ is the ﬁnite-dimensional combination of small ‘line mea-

sures’ (here H1 is the Hausdorﬀ one-dimensional measure) supported by
the segment li,j,k for θ = θk . The coordinates V ∈ Rm×n×o can be recovered
from µ by computing the averages along the line segment li,j,k for all i, j, k:
1
Z
Vi,j,k = dµ(x, ϑk ).
|δk |2 li,j,k
In order to approximate the compatibility constraint in (7.36), we intro-
duce a family of test functions ϕi,j (x) : Ω → R2 which are given by

max{0, 1 − |x1 − j|}1[i−1,i] (x2 )
ϕi,j (x) = .
1[j−1,j] (x1 )max{0, 1 − |x2 − i|}
Observe that the test functions are given by the product of a triangular and
a rectangular interpolation kernel, which can be interpreted as performing
a linear interpolation in one direction and a nearest-neighbour interpolation
in the other direction. Using these test functions in (7.36), and thanks to
the fact that the image u is piecewise constant, we obtain the collection of
discrete constraints
Z
Ui,j+1 − Ui,j
= φi,j (x) · ϑ⊥ dµ(x, θ)
Ui+1,j − Ui,j [j−1,j+1]×[i−1,i+1]×S 1

XX o Z
= Vı̂,̂,k ϑ⊥k · ϕi,j (x) dx ⇐⇒ DU = CV.
ı̂,̂ k=1 lı̂,̂,k

The operator D : Rm×n → Rm×n×2 is the usual ﬁnite diﬀerences operator,

as introduced in (2.4), and the operator C : Rm×n×o → Rm×n×2 holds the
coefficients given by the value of the line integrals. Note that C will be very
sparse since for each grid point xi,j only a small number of line segments
will intersect with the non-zero part of the test function ϕi,j . Observe that
the integrals over the line segments can be computed in closed form since
the test functions ϕi,j are piecewise affine functions.
In the lifted space, we will need to compute the directional derivatives
Dx ψ(x, ϑ) · ϑ. We assume that ψ is continuous and linear on the line seg-
ments, and hence we only need to store ψ on the end-points of the line
segments li,j,k . For this we use a discrete test function P ∈ Rm×n×o , which
holds the values of ψ on the end-points of the corresponding line segments.
In turn, the directional derivatives can be computed by simple finite dif-
ferences along the line segments. We therefore introduce a linear operator
116 A. Chambolle and T. Pock

A : Rm×n×o → Rm×n×o , which is deﬁned by

Pi+ ,j + ,k − Pi,j,k
(AP )i,j,k = ,
|δk |2
where i+ = max{1, min{m, i + δk2 }}, and j + = max{1, min{n, j + δk1 }}.
In order to implement the TVX1 model, we further introduce a linear
operator B : Rm×n×o → Rm×n×o , which computes the angular derivative of
the test function ψ. Assuming again that ψ is linear in the angular direction
between the grid points xi,j , its angular derivative can again be computed
by a finite differences operator:
P − Pi,j,k

 i,j,k+1
 if k < o,
αk

(BP )i,j,k =
P − Pi,j,k
 i,j,1 if k = o,


αk
where the factors αk are given by angular differences between the two cor-
responding line segments.
As mentioned above, the total variation in the lifted space is given by the
norm kµkM . The discrete variant using the discrete measure V reads
X
|δk |2 Vi,j,k ,
i,j,k

where |δk |2 is the length of the corresponding line segment.

We are now ready to state the discrete approximations of the TVX mod-
els:
X X
min max α |δk |2 Vi,j,k + β |δk |2 Vi,j,k (AP )i,j,k + g(U ),
U,V P
i,j,k i,j,k
such that Vi,j,k ≥ 0, Pi,j,k ∈ Bρ , DU = CV.
The function g(U ) is a data-fitting term, which defines the type of appli-
cation of the model. The TVX0 model is obtained by using the constraint
|Pi,j,k | ≤ 1/2 and the TVX1 model is obtained by using the constraint
|(BP )i,j,k | ≤ 1.
Introducing additional Lagrange multipliers, the above problems can be
easily turned into a standard saddle-point form which can be tackled by
Algorithm 6 (PDHG). We leave this task to the reader.
In a first experiment, we apply the TVX models to shape denoising.
Consider a shape encoded as a binary image F . The idea is to com-
pute a (relaxed) binary image U that approximates the shape of F by
minimizing the TVX energy. For this, we consider a data-fitting term
g(U ) = λhU, 0.5 − F i + δ[0,1]m×n×o (U ), which measures the difference be-
tween the shapes in U and F and forces the values of U to stay in the
interval [0, 1]. In all experiments we set α = 0.01, β = 1 and we use o = 32
Optimization for imaging 117

(a) Input image F

(b) TVX0 , λ = 1/2 (c) TVX0 , λ = 1/4 (d) TVX0 , λ = 1/8 (e) TVX0 , λ = 1/16

(f) TVX1 , λ = 1/2 (g) TVX1 , λ = 1/4 (h) TVX1 , λ = 1/8 (i) TVX1 , λ = 1/16

Figure 7.14. Comparison of TVX0 (b–e) and TVX1 (f–i) regularization for shape
denoising. One can see that TVX0 leads to a gradually simpliﬁed polygonal approx-
imation of the shape in F whereas TVX1 leads to an approximation by piecewise
smooth shapes.

discrete orientations. In Figure 7.14 we show the results of TVX1 and TVX0
regularization using different weights λ in the data-fitting term. It can be
seen that TVX0 minimizes the number of corners of the shape in U and
hence leads to a gradually simplified polygonal approximation of the origi-
nal shape. TVX1 minimizes the total curvature of the shape in U and hence
leads to a piecewise smooth approximation of the shape.
In Figure 7.15 we provide a visualization of the measure µ in the roto-
translation space for the image shown in Figure 7.14(e), obtained using
TVX0 regularization. One can observe that in our discrete approximation
the measure µ nicely concentrates on thin lines in the roto-translation space.
In our second experiment we consider image inpainting. For this we
choose
X
g(U ) = δ{Ui,j
⋄ } (Ui,j ),

(i,j)∈I

where U ⋄ ∈ Rm×n is a given image and I deﬁnes the set of indices for
which pixel information is available. Figure 7.16 shows the image inpainting
118 A. Chambolle and T. Pock

Figure 7.15. Visualization of the measure µ in the roto-translation space for the
image of Figure 7.14(e), obtained using TVX0 regularization. Observe that the
measure µ indeed concentrates on thin lines in this space.

results, where we have used the same test image as in Figure 7.7. The
parameters α, β of the TVX model were set to α = 0.01, β = 1, and we
used o = 32 discrete orientations. The parameter α is used to control the
amount of total variation regularization while the parameter β is used to
control the amount of curvature regularization. We tested two diﬀerent
kinds of missing pixel information. In the experiment shown on the left we
randomly threw away 90% of the image pixels, whereas in the experiment
shown on the right we skipped 80% of entire rows of the image. From the
results one can see that the TVX1 models can faithfully reconstruct the
missing image information even if there are large gaps.
In our third experiment we apply the TVX1 regularizer for image denois-
ing in the presence of salt-and-pepper noise. Following the classical TV-ℓ1
model, we used a data term based on the ℓ1 -norm: g(U ) = λkU − U ⋄ k1 ,
where U ⋄ is the given noisy image. We applied the TVX1 model to the
same test image as used in Figure 2.3. In this experiment the parameters
for the regularizer were set to α = 0.01, β = 1. For the data term we used
λ = 0.25, and we used o = 32 discrete orientations. Figure 7.17 shows the
results obtained by minimizing the TVX1 model. The result shows that the
TVX1 model performs particularly well at preserving thin and elongated
structures (e.g. on the glass pyramid). The main reason why these models
works so well when applied to salt-and-pepper denoising is that the problem
is actually very close to inpainting, for which curvature minimizing models
were originally developed.
Optimization for imaging 119

(a) 90% missing data (b) 80% missing lines

(c) TVX1 inpainting (d) TVX1 inpainting

Figure 7.16. Image inpainting using TVX1 regularization. (a,c,e) Input image with
90% missing pixels and recovered solutions. (b,d,f) Input image with 80% missing
lines and recovered solutions.

(a) noisy image (b) TVX1 denoised

Figure 7.17. Denoising an image containing salt-and-pepper noise. (a) Noisy image
degraded by 20% salt-and-pepper noise. (b) Denoised image using TVX1 regular-
ization. Note the signiﬁcant improvement over the result of the TV-ℓ1 model,
shown in Figure 2.3.
120 A. Chambolle and T. Pock

7.12. Lasso and dictionary learning

Let us return to one of the first sparse models considered in this paper,
the Lasso model (2.2). We will describe how the Lasso model can be used
for image denoising. This technique is related to state-of-the art non-local
approaches for image denoising, although somewhat simplified (see the ref-
erences in the Introduction). Let p = (p1 , . . . , pk ) be a set of k image
patches, where each patch pi , i = 1 . . . , k is of size m × n pixels. From the
patches we form a matrix P = (P1 , . . . , Pk ) ∈ Rmn×k , where Pi ∈ Rmn is
a vectorized version of the patch pi . Moreover, we consider a dictionary
D = (D1 , . . . , Dl ) ∈ Rmn×l , where each Dj ∈ Rmn , j = 1, . . . , l are the
atoms of the dictionary. Our goal is now to find a dictionary that allows
for a sparse representation of the entire set of patches. The Lasso model
specialized to this setting is given by

1
min λkXk1 + kDX − P k22 ,
X,D 2

where X = (X1 , . . . , Xl ) ∈ Rl×k is a matrix holding the coeﬃcients of the

dictionary and λ > 0 is the regularization parameter. The basic idea behind
learning an optimal dictionary is based on minimizing the Lasso problem
with respect to both X and D. See for example Olshausen and Field (1997)
for one of the first attempts, Aharon et al. (2006) for an algorithm based on
the singular value decomposition, Lee, Battle, Raina and Ng (2007) for an
algorithm based on alternating minimization, and Mairal, Bach, Ponce and
Sapiro (2009a) for a stochastic (online) algorithm. It is now acknowledged
that very large-scale Lasso problems should rather be tackled by acceler-
ated stochastic descent algorithms, for instance the accelerated proximal
coordinate gradient method (APCG) of Lin et al. (2014).
Here we directly minimize the Lasso objective using the PALM algorithm
(Algorithm 12). In order to eliminate scaling ambiguities, however, it is
necessary to put constraints on the atoms of the dictionary. We found it
useful to require that all but the first atom D1 should have zero mean and
a 2-norm less than or equal to one. The reason why we do not put any
constraints on the first atom is that we expect the first atom to capture the
low-frequency part of the patches and the other atoms should capture the
high frequencies. Hence we consider the following convex set:

C = {D ∈ Rmn×l : 1T Dj = 0, kDj k2 ≤ 1, j = 2, . . . , l}.

Further, since we do not expect the patches to be sparse in the low-frequency

atom, we penalize the ℓ1 -norm only for the coeﬃcients corresponding to
the high-frequency patches. Using these constraints, the ﬁnal dictionary
Optimization for imaging 121

learning problem takes the form

l
X 1
min λ kXj k1 + kDX − P k22 , such that D ∈ C. (7.37)
X,D 2
j=2

This problem is non-convex but it has a relatively nice structure, which

makes Algorithm 12 suitable. Indeed the data-fitting term is smooth (quad-
ratic) and has partially Lipschitz-continuous gradients. Furthermore, the
proximal maps with respect to both the ℓ1 -norm and the convex set C are
easy to implement. The proximal map with respect to the ℓ1 -norm can be
computed by the usual soft-shrinkage formula. The projection of the dictio-
nary onto C is computed independently for each patch by first removing the
average of each Dj , and then projecting onto the 2-ball with radius 1. In
our experiments we used an inertial variant of the PALM algorithm, whose
convergence has recently been analysed in Pock and Sabach (2016).
Figure 7.18 illustrates an experiment in which we learned an optimal
dictionary on the clean image and then applied the learned dictionary to
denoise a noisy version of the image. During learning, we set λ = 0.1, we
used a patch size of m = n = 9, and the number of dictionary atoms was
set to l = 81. The size of the image is 327 × 436, from which we extracted
roughly k = 136 500 patches using a sliding-window approach. One can
see that the learned dictionary nicely captures the most frequent structures
in the image. The dictionary atom shown top left corresponds to the low-
frequency part of the patches. It can be seen from the figure that this atom
does not have a zero mean.
After learning was completed, we applied the learned dictionary to the
restoration of a noisy version of the original image.22 We first extracted the
patches p̃i ∈ Rm×n , i = 1, . . . , k from the noisy image and again arranged the
patches into a matrix P̃ ∈ Rmn×k . In order to denoise the set of patches,
we solved the Lasso problem (7.37), but this time only with respect to
the coefficient vector X. This problem is convex and can be solved using
Algorithm 5 (FISTA). The parameter λ was set to λ = 0.1. After the patches
were denoised, we reconstructed the final image by averaging over all the
patches. The reconstructed image is shown in Figure 7.18(d). Observe that
the texture and elongated structures are nicely preserved in the denoised
image.
Another interesting variant of the patch-based Lasso problem – particu-

22
A more reasonable approach would of course be to learn the dictionary on a set of
representative images (excluding the test image). Although we learn the dictionary on
the patches of the original image, observe that we are still far from obtaining a perfect
reconstruction. On one hand the number of dictionary atoms (81) is relatively small
compared to the number of patches (136 500), and on the other hand the regularization
parameter also prevents overfitting.
122 A. Chambolle and T. Pock

(a) original image g (b) noisy image f

(c) dictionary (d) denoised image (PSNR = 26.57)

Figure 7.18. Image denoising using a patch-based Lasso model. (a) Original image,
and (b) its noisy variant, where additive Gaussian noise with standard deviation
0.1 has been added. (c) Learned dictionary containing 81 atoms with patch size
9 × 9, and (d) ﬁnal denoised image.

larly in the context of image processing – is given by a convolutional sparse

model (Zeiler, Krishnan, Taylor and Fergus 2010) of the form
k k 2
X λ X
min kvi k1 + di ∗ v i − u⋄ , (7.38)
(vi )ki=1 2 2
i=1 i=1

where u⋄ ∈ Rm×n is an input image, vi ∈ Rm×n , i = 1, . . . , k are a set of

k sparse coeﬃcient images, and di ∈ Rl×l , i = 1, . . . , k are the correspond-
m×n can
Pkkernels. The approximated image u ∈ R
ing two-dimensional ﬁlter
be recovered via u = i=1 di ∗ vi . The main advantage of the convolu-
tional model over the patch-based model is that the convolution operation
inherently models the translational invariance of images. This model is also
strongly related to the layers adopted in recently proposed convolutional
Optimization for imaging 123

(a) learned filters (b) denoised image (PSNR = 26.68)

Figure 7.19. Image denoising using the convolutional Lasso model. (a) The 81
convolution ﬁlters of size 9 × 9 that have been learned on the original image.
(b) Denoised image obtained by minimizing the convolutional Lasso model.

neural networks (CNNs), which have been shown to perform extremely well
on large-scale image classification tasks (Krizhevsky et al. 2012).
For learning the filters di , we minimize the convolutional Lasso problem
(7.38) with respect to both the filters di and the coefficient images vi . Some
care has to be taken to avoid a trivial solution. Therefore we fix the first filter
kernel to be a Gaussian filter and fix the corresponding coefficient image to
be the input image u⋄ . Hence, the problem is equivalent to learning the
dictionary only for the high-frequency filtered image ũ = u⋄ − g ∗ u⋄ , where
g ∈ Rl×l is a Gaussian filter with standard deviation σ = l.
To minimize (7.38) in vi and di , we again use the inertial variant of the
PALM algorithm. We used k = 81 filters of size l = 9 and the first filter was
set to a Gaussian filter of the same size. The regularization parameter λ was
set to λ = 0.2. Figure 7.19(a) shows the filters we have learned on the clean
image shown in Figure 7.18(a). Comparing the learned convolution filters
to the dictionary of the patch-based Lasso problem, one can see that the
learned filters contain Gabor-like structures (Hubel and Wiesel 1959) but
also more complex structures, which is a known effect caused by the induced
shift invariance (Hashimoto and Kurata 2000). We then also applied the
convolutional Lasso model to a noisy variant of the original image, and the
result is shown in Figure 7.19(b). From the PSNR values, one can see that
the convolutional Lasso model leads to a slightly better result.

7.13. Support vector machine

Linear classiﬁcation in a feature space is an important problem in many
scientiﬁc disciplines. See for example the review paper by Burges (1998) and
references therein. The general idea is as follows. Let {x1 , . . . , xn } ⊂ Rd
124 A. Chambolle and T. Pock

(a) subset of the MNIST database

(b) dictionary containing 400 atoms

Figure 7.20. MNIST training images and dictionary.

be a set of d-dimensional feature vectors with corresponding positive and

negative class labels yi ∈ {−1, +1}. The idea of a linear classiﬁer is to
ﬁnd a separating hyperplane w ∈ Rd and a bias term b ∈ R, such that the
feature vectors corresponding to class +1 are (in some sense) most robustly
separated by the feature vectors corresponding to class labels −1. This
problem can be written as the minimization of the following loss function,
known as the support vector machine (SVM) (Vapnik 2000):
n
1 2 CX
min kwk + h(yi (hw, xi i + b)), (7.39)
w,b 2 n
i=1

where 21 kwk2 is a regularization term that avoids overﬁtting to the training

data, h(·) is the Hinge-loss function deﬁned by h(t) = max{0, 1 − t}, and
C > 0 is a regularization parameter. Note that the Hinge-loss function
is closely related to the ℓ1 -norm and hence it also induces sparsity in the
solution: the idea being that it will minimize the number of samples which
are within a certain margin around the hyperplane (t ∈ [0, 1]) or are wrongly
classiﬁed (t < 0). If the feature space is linearly separable, the SVM returns
the hyperplane that maximizes the distance to the closest feature example
(which is half the ‘margin’), which can be interpreted as minimizing the
risk of misclassifying any new feature example. To simplify the problem, we
can replace the bias term with an additional constant (d + 1)th component
added to each sample xi , which corresponds to assuming that all samples
‘live’ in a hyperplane far from the origin, and therefore has a very similar
Optimization for imaging 125

effect (the bias b is replaced with cyi wd+1 for some constant c of the order
of the norm of the samples). This smoothing makes the problem strongly
convex, hence slightly easier to solve, as one can use Algorithm 8. An
additional acceleration trick consists in starting the optimization with a
small number of samples and periodically adding to the problem a fraction
of the worst classified samples. As it is well known (and desirable) that only
a small proportion of the samples should be really useful for classification
(the ‘support vectors’ which bound the margin), it is expected, and actually
observed, that the size of the problems can remain quite small with this
strategy.
An extension of the SVM to non-linear classifiers can be achieved by
applying the kernel trick (Aı̆zerman, Braverman and Rozonoèr 1964) to
the hyperplane, which lifts the linear classifier to a new feature space of
arbitrary (even infinite) dimension (Vapnik 2000).
To illustrate this method, we have tried to learn a classifier on the 60 000
digits of the MNIST23 database (LeCun, Bottou, Bengio and Haffner 1998a):
see figure 7.20. Whereas it is known that a kernel SVM can achieve good
performance on this dataset (see the results reported on the web page of
the project) it is computationally quite expensive, and we have tried here to
incorporate non-linearities in a simpler way. To start with, it is well known
that training a linear SVM directly on the MNIST data (which consists of
small 28 × 28 images) does not lead to good results. To improve the per-
formance, we trained the 400-component dictionary shown in Figure 7.20,
using the model in Section 7.12, and then computed the coefficients ((ci )400 i=1 )
of each MNIST digit on this dictionary using the Lasso problem. This rep-
resents a fairly large computation and may take several hours on a standard
computer.
Then we trained the SVMs on feature vectors of the form, for each digit,
(c̃, (ci )400 2 400
i=1 , (ci )i=1 ) (in dimension 801), where c̃ is the constant which maps
all vectors in a hyperplane ‘far’ from the origin, as explained above, and
the additional (c2i )400 i=1 represent a non-linear lifting which slightly boosts
the separability of the vectors. This mimics a non-linear kernel SVM with
a simple isotropic polynomial kernel.
The technique we have employed here is a standard ‘one-versus-one’ clas-
sification approach, which proved slightly more efficient than, for instance,
training an SVM to separate each digit from the rest. It consists in training
45 vectors wi,j , 0 ≤ i < j ≤ 9, each separating the training subset of digits i
from the digits j (in this case, in particular, each learning problem remains
quite small).
Then, to classify a new digit, we have counted how often it is classified
as ‘i’ or ‘j’ by wi,j (which is simply testing whether hwi,j , xi i is positive or

23
http://yann.lecun.com/exdb/mnist
126 A. Chambolle and T. Pock

(a) subset of the well-classified digits

(b) the 221 (out of 10 000) wrongly classified digits

Figure 7.21. MNIST classiﬁcation results.

negative). If everything were perfect, a digit would be classiﬁed nine times

as its true label, hence it is natural to consider that the label that gets the
maximal number of ‘votes’ is the expected one.24 This very elementary ap-
proach leads to an error of 2.21% on the 10 000 test digits of the database,
which is much worse than the best results reached so far but quite reasonable
for such a simple approach. Let us observe that for each failed classifica-
tion, the second vote was correct except for 70 digits, that is, 0.7% of the
base: hence it is not surprising that more elaborate representations or clas-
sification techniques can reach classification error rates of this order: 0.49%
for SVM methods based on ‘scattering’ networks (Bruna and Mallat 2013)
and 0.23% for the most recent CNN-based methods (Ciresan, Meier and
Schmidhuber 2012). The failed classifications are shown in Figure 7.21, as
well as some of the well-recognized digits.
An interesting extension to the unsupervised dictionary learning problem
consists in augmenting the dictionary learning objective function with a
loss function, ensuring that the learned features not only lead to a sparse
representation but can also be well classified with the SVM; see Mairal,
Ponce, Sapiro, Zisserman and Bach (2009b). This supervised dictionary
learning actually improves the final classification results (leading to an error
of only 0.6%).
Optimization for imaging 127

(a) input image (b) recovered image

Figure 7.22. Inverting a convolutional neural network. (a) Original image used
to compute the initial feature vector φ⋄ . (b) Image recovered from the non-linear
deconvolution problem. Due to the high degree of invariances of the CNN with
respect to scale and spatial position, the recovered image contains structures from
the same object class, but the image looks very diﬀerent.

7.14. Non-linear deconvolution: inverting a CNN

In this last example, we show an interesting application of non-linear de-
convolution to invert a pre-trained convolutional neural network. The CNN
we consider here is a state-of-the-art very deep CNN that has been trained
on the ImageNet (Krizhevsky et al. 2012) classiﬁcation problem (Simonyan
and Zisserman 2014). Whereas it has been shown by Bruna, Szlam and Le-
Cun (2014) that some CNN structures are invertible, it is not clear whether
this particular one is. Let C : Rm×n×3 → Rk denote a non-linear map that
takes a colour image u of size m × n pixels as input and produces a feature
vector φ of length k. The task is now to start from a given feature vector
φ⋄ and try to ﬁnd an image u such that C(u) ≈ φ⋄ . Since this problem
is ill-posed, we aim at minimizing the following total variation regularized
problem (Mahendran and Vedaldi 2015):
1
min λkDuk2,1 + kC(u) − φ⋄ k2 .
u 2
We minimize the energy by performing a gradient descent using the iPiano
algorithm (Algorithm 11). For this we replace the total variation regularizer
with its smooth Huber variant (4.18). The gradient with respect to the data
term can be computed by using back-propagation.
24
This can be slightly improved: see for instance Duan and Keerthi (2005).
128 A. Chambolle and T. Pock

Figure 7.22 shows the results we get by ﬁrst computing a 4096-dimensional

feature vector φ⋄ = C(u⋄ ) from the original image u⋄ and then inverting
this feature vector by solving the non-linear deconvolution problem. We
initialized the algorithm with an image u that was obtained by ﬁltering the
original image with a Gaussian noise. One can see that the structures that
are ‘hallucinated’ by the CNN are of the same class of objects as the original
image but the algorithm produces a completely new, yet almost reasonable
image.

Acknowledgements
The authors benefit from support of the ANR and FWF via the ‘EANOI’
(Efficient Algorithms for Nonsmooth Optimization in Imaging) joint project,
FWF no. I1148 / ANR-12-IS01-0003. Thomas Pock also acknowledges the
support of the Austrian Science Fund (FWF) under the START project
BIVISION, no. Y729, and the European Research Council under the Hori-
zon 2020 program, ERC starting grant ‘HOMOVIS’, no. 640156. Antonin
Chambolle also benefits from support of the ‘Programme Gaspard Monge
pour l’Optimisation et la Recherche Opérationnelle’ (PGMO), through the
‘MAORI’ group, as well as the ‘GdR MIA’ of the CNRS. He also warmly
thanks Churchill College and DAMTP, Centre for Mathematical Sciences,
University of Cambridge, for their hospitality, with the support of the French
Embassy in the UK. Finally, the authors are very grateful to Yunjin Chen,
Jalal Fadili, Yura Malitsky, Peter Ochs and Glennis Starling for their com-
ments and their careful reading of the manuscript.

A. Abstract convergence theory

We recall an essential result on weak contractions (or non-expansive map-
pings) in Hilbert spaces. More details can be found in Bauschke and Com-
bettes (2011) or the recent review by Burger et al. (2014). As mentioned
in Section 4.1, for convex f the operator
x 7→ x − τ ∇f (x)
is an ‘averaged operator’ when τ ∈ (0, 2/L), where L is the Lipschitz con-
stant of ∇f : this means that it is of the form
Tθ x := θx + (1 − θ)T0 x
with θ ∈ (0, 1), where T0 is a weak contraction, satisfying kT0 x − T0 yk ≤
kx − yk for all x, y. Indeed, let θ = τ L/2 and T0 x := x − (2/L)∇f (x)
to recover this fact. Let F denote the set of ﬁxed points of T0 , that is,
F = {x ∈ X : T0 x = x}. Then we obtain the following result, usually called
the Krasnosel’skii–Mann theorem: see Bauschke and Combettes (2011, The-
orem 5.13) and Bertsekas (2015, Theorem 5.1.9).
Optimization for imaging 129

Theorem A.1. Let x ∈ X , 0 < θ < 1, and assume F 6= ∅. Then (Tθk x)k≥1
weakly converges to some point x∗ ∈ F .
Proof. Throughout this proof let xk = Tθk x for each k ≥ 0.
Step 1. The first observation is that since Tθ is also a weak contraction, the
sequence (kxk − x∗ k)k is non-increasing for any x∗ ∈ F (which is also the set
of fixed points of Tθ ). The sequence (xk )k is said to be Fejér-monotone with
respect to F , which yields a lot of interesting consequences; see Bauschke
and Combettes (2011, Chapter 5) for details. It follows that for any x∗ ∈ F ,
one can define m(x∗ ) := inf k kxk − x∗ k = limk kxk − x∗ k. If there exists x∗
such that m(x∗ ) = 0, then the theorem is proved, as xk converges strongly
to x∗ .
Step 2. If not, let us show that we still obtain Tθ xk − xk = xk+1 − xk → 0.
An operator which satisfies this property is said to be asymptotically regular
(Browder and Petryshyn 1966). We will use the following result, which is
standard, and in fact gives a hint that this proof can be extended to more
general spaces with uniformly convex norms.
Lemma A.2. For all ε > 0, θ ∈ (0, 1), there exists δ > 0 such that, for all
x, y ∈ X with kxk, kyk ≤ 1 and kx − yk ≥ ε,
kθx + (1 − θ)yk ≤ (1 − δ) max{kxk, kyk}.
This follows from the strong convexity of x 7→ kxk2 (i.e. the parallelogram
identity), and we leave the proof to the reader.
Now assume that along a subsequence, we have kxkl +1 − xkl k ≥ ε > 0.
Observe that
xkl +1 − x∗ = θ(xkl − x∗ ) + (1 − θ)(T0 xkl − x∗ )
and that
1
(xkl − x∗ ) − (T0 xkl − x∗ ) = xkl − T0 xkl = − (xkl +1 − xkl ),
1−θ
so that
k(xkl − x∗ ) − (T0 xkl − x∗ )k ≥ ε/(1 − θ) > 0.
Hence we can invoke the lemma (remember that (xk − x∗ )k is globally
bounded since its norm is non-increasing), and we obtain that, for some
δ > 0,
m(x∗ ) ≤ kxkl +1 − x∗ k ≤ (1 − δ) max{kxkl − x∗ k, kT0 xkl − x∗ k},
but since kT0 xkl − x∗ k ≤ kxkl − x∗ k, it follows that
m(x∗ ) ≤ (1 − δ)kxkl − x∗ k.
As kl → ∞, we get a contradiction if m(x∗ ) > 0.
130 A. Chambolle and T. Pock

Step 3. Assume now that x̄ is the weak limit of some subsequence (xkl )l .
Then we claim it is a ﬁxed point. An easy way to see it is to use Minty’s
trick (Brézis 1973) and the fact that I −Tθ is a monotone operator. Another
is to use Opial’s lemma.
Lemma A.3 (Opial 1967, Lemma 1). If the sequence (xn )n is weakly
convergent to x0 in a Hilbert space X , then, for any x 6= x0 ,
lim inf kxn − xk > lim inf kxn − x0 k.
n n

The proof in the Hilbert space setting is easy and we leave it to the reader.
Since Tθ is a weak contraction, we observe that for each k,
kxk − x̄k2 ≥ kTθ xk − Tθ x̄k2
= kxk+1 − xk k2 + 2hxk+1 − xk , xk − Tθ x̄i + kxk − Tθ x̄k2 ,
and we deduce (thanks to Step 2 above)
lim inf kxkl − x̄k ≥ lim inf kxkl − Tθ x̄k.
l l

Opial’s lemma implies that Tθ x̄ = x̄. One advantage of this approach is

that it can be easily extended to a class of Banach spaces (Opial 1967).
Step 4. To conclude, assume that a subsequence (xml )l of (xn )n converges
weakly to another fixed point ȳ. Then it must be that ȳ = x̄, otherwise
Opial’s Lemma A.3 would again imply that m(x̄) < m(ȳ) and m(ȳ) < m(x̄).
It follows that the whole sequence (xn ) must converge weakly to x̄.
The notion of averaged operators dates back at least to Schaefer (1957),
Krasnosel’skiı̆ (1955) (with θ = 1/2) and Mann (1953) (with possibly ar-
bitrarily long averages). Forms of this classical result were proved in the
first two papers (usually in a more general context such as Banach spaces)
and many others (Opial 1967), as well as variants, such as with varying
operators25 (Browder 1967). This and many useful extensions can be found
in Bauschke and Combettes (2011). The special case θ ≥ 1/2 is the case
of ‘firmly non-expansive operators’, which is well known to coincide with
the proximity operators and the resolvent (I + A)−1 of maximal-monotone
operators (see the review in Bauschke et al. 2012). Another important ob-
servation is that the composition of two averaged operators is, again, an
averaged operator; this is straightforward and we leave the proof to the
reader. It implies the convergence of forward–backward splitting and in
fact of many similar splitting techniques; see Combettes (2004) for further

25
The proof above can easily be extended to allow for some variation of the averaging
parameter θ. This would yield convergence, for instance, for gradient descent algo-
rithms with varying steps (within some bounds) and many other similar methods.
Optimization for imaging 131

results in this direction. Finally, we mention that one can improve such re-
sults to obtain convergence rates; in particular, Liang et al. (2015a) Liang,
Fadili, Peyré and Luke (2015b) have recently shown that for some problems
one can get an eventual linear convergence for algorithms based on this type
of iteration.

B. Proof of Theorems 4.1, 4.9 and 4.10.

Here we prove the rates of convergence for a class of accelerated descent
algorithms introduced in Nesterov (2004) and Beck and Teboulle (2009).
We give proofs which diﬀer slightly from the classical ones, and unify both
presentations. For the FISTA algorithm the proof presented here is found in
a few places such as the references Chambolle and Pock (2015b), Chambolle
and Dossal (2015), Burger et al. (2014), Bertsekas (2015) and Bonettini et
al. (2015). As Theorem 4.1 is a particular case of Theorem 4.9 (with g = 0),
we just prove the latter.
Proof of Theorem 4.9. We start from inequality (4.37), letting, x̄ = xk and
x̂ = xk+1 for k ≥ 0. It follows that, for any x,
kx − xk k2 kx − xk+1 k2
F (x) + (1 − τ µf ) ≥ F (xk+1 ) + (1 + τ µg ) .
2τ 2τ
Choosing x = xk shows that F (xk ) is non-increasing. Letting
1 − τ µf
ω= ≤1
1 + τ µg
and summing these inequalities from k = 0 to n − 1, n ≥ 1, after multipli-
cation by ω −k−1 , we ﬁnd
n n
X X 1 + τ µg
ω −k (F (xk ) − F (x)) + ω −k kx − xk k2
2τ
k=1 k=1
n−1
X 1 − τ µf
≤ ω −k−1 kx − xk k2 .
2τ
k=0

After cancellations, and using F (xk ) ≥ F (xn ) for k = 0, . . . , n, we get

n−1
X
−n 1 + τ µg 1 + τ µg
ω ω k (F (xn ) − F (x)) + ω −n kx − xn k2 ≤ kx − x0 k2 .
2τ 2τ
k=0

We deduce both (4.29) (for µ = µf + µg > 0 so that ω < 1) and (4.28) (for
µ = 0 and ω = 1).

Proof of Theorem 4.10. The idea behind the proof of Beck and Teboulle
(2009) is to improve this inequality (4.37) by trying to obtain strict decay of
132 A. Chambolle and T. Pock

the term in F in the inequality. The trick is to use (4.37) at a point which
is a convex combination of the previous iterate and an arbitrary point.
If, in (4.37), we replace x with ((t − 1)xk + x)/t (t ≥ 1), x̄ with y k and
x̂ with xk+1 = Tτ y k , where t ≥ 1 is arbitrary, we ﬁnd that for any x (after
multiplication by t2 ),
t−1
t(t − 1)(F (xk ) − F (x)) − µ kx − xk k2
2
k(t − 1)xk + x − ty k k2
+ (1 − τ µf )
2τ
k(t − 1)xk + x − txk+1 k2
≥ t2 (F (xk+1 ) − F (x)) + (1 + τ µg ) . (B.1)
2τ
Then we observe that
t−1 kx − xk + t(xk − y k )k2
−µ kx − xk k2 + (1 − τ µf )
2 2τ
kx − xk k2 1 − τ µf
= (1 − τ µf − µτ (t − 1)) + thx − xk , xk − y k i
2τ τ
kxk − y k k2
+ t2 (1 − τ µf )
2τ
(1 + τ µg − tµτ ) 1−τ µf
= kx − xk + t 1+τ µg −tµτ (xk − y k )k2
2τ
k
1 − τ µf kx − y k k2

2
+ t (1 − τ µf ) 1 −
1 + τ µg − tµτ 2τ
(1 + τ µg − tµτ ) 1−τ µf
= kx − xk + t 1+τ µg −tµτ (xk − y k )k2
2τ
τ µ(1 − τ µf ) kxk − y k k2
− t2 (t − 1) .
1 + τ µg − tµτ 2τ
It follows that, for any x ∈ X ,
1−τ µ
kx − xk − t 1+τ µg −tµτ
f
(y k − xk )k2
k
t(t − 1)(F (x ) − F (x)) + (1 + τ µg − tµτ )
2τ
kx − x k+1 − (t − 1)(xk+1 − xk )k2
≥ t2 (F (xk+1 ) − F (x)) + (1 + τ µg )
2τ
τ µ(1 − τ µ ) kx k − y k k2
f
+ t2 (t − 1) . (B.2)
1 + τ µg − tµτ 2τ
We let t = tk+1 above. Then we can get a useful recursion if we let
1 + τ µg − tk+1 µτ µτ
ωk = = 1 − tk+1 ∈ [0, 1], (B.3)
1 + τ µg 1 + τ µg
tk+1 (tk+1 − 1) ≤ ωk t2k , (B.4)
Optimization for imaging 133

tk − 1 1 + τ µg − tk+1 µτ tk − 1 1 + τ µ g
βk = = ωk , (B.5)
tk+1 1 − τ µf tk+1 1 − τ µf
y k = xk + βk (xk − xk−1 ). (B.6)
Denoting αk = 1/tk and
τµ τ µf + τ µg
q= = < 1,
1 + τ µg 1 + τ µg
we easily check that these rules are precisely the same as in Nesterov (2004,
formula (2.2.9), p. 80), with the minor diﬀerence that in our case the choice
t0 = 0, t1 = 1 is admissible26 and a shift in the numbering of the sequences
(xk ), (y k ). In this case we ﬁnd
1 + τ µg
t2k+1 (F (xk+1 ) − F (x)) + kx − xk+1 − (tk+1 − 1)(xk+1 − xk )k2
2τ
1 + τ µg
≤ ωk t2k (F (xk ) − F (x)) + kx − xk − (tk − 1)(xk − xk−1 )k2 ,
2τ
so that
1 + τ µg
t2k (F (xk ) − F (x)) + kx − xk − (tk − 1)(xk − xk−1 )k2
2τ
k−1
Y
2 0 1 + τ µg 0 2
≤ ωn t0 (F (x ) − F (x)) + kx − x k . (B.7)
2τ
n=0

The update rule for tk reads

tk+1 (tk+1 − 1) = (1 − qtk+1 )t2k , (B.8)
so that
q
1− qt2k + (1 − qt2k )2 + 4t2k
tk+1 = . (B.9)
2
We need to make sure that qtk+1 ≤ 1, so that (B.3) holds. This is proved
exactly as in the proof of Lemma 2.2.4 of Nesterov (2004). Assuming (as in
√
Nesterov 2004) that qtk ≤ 1, we observe that (B.8) yields
qt2k+1 = qtk+1 + (1 − qtk+1 )qt2k .
If qtk+1 ≥ 1, then qt2k+1 ≤ qtk+1 , and hence qtk+1 ≤ q < 1, a contradiction.
Hence qtk+1 < 1 and we obtain that qt2k+1 is a convex combination of 1 and
√ √
qt2k , so that qtk+1 ≤ 1. We have shown that when qt0 ≤ 1, which we

26
Note, however, that this is no different from performing a first step of the forward–
backward descent scheme to the energy before actually implementing Nesterov’s iter-
ations.
134 A. Chambolle and T. Pock
√
will now assume, qtk ≤ 1 for all k. Finally, we also observe that
t2k+1 = (1 − qt2k )tk+1 + t2k ,
showing that tk is an increasing sequence. It remains to estimate the factor
k−1
Y
θk = t−2
k ωn for k ≥ 1.
n=0
From (B.4) (with an equality) we ﬁnd that
1 t2k
1− = ωk 2 ,
tk+1 tk+1
so
k−1 k
t20 Y

Y 1 √
t20 θk
= 2 ωn = 1− ≤ (1 − q)k
tk n=0 tk
n=1
√ √
since 1/tk ≥ q. If t0 ≥ 1, then θk ≤ (1 − q)k /t20 . If t0 ∈ [0, 1[, we instead
write
k−1 k
ω0 Y ω0 Y 1
θk = 2 ωn = 2 1−
tk t1 tk
n=1 n=2
and observe that (B.9) yields (using 2 − q ≥ 1 ≥ q)
1 − qt20 + 1 + 2(2 − q)t20 + q 2 t40
p
t1 = ≥ 1.
2
Also, ω0 ≤ 1 − q (from (B.3)), so that
√ √
θk ≤ (1 + q)(1 − q)k .
The next step is to bound θk by O(1/k 2 ). It also follows from Nesterov
(2004, Lemma 2.2.4). In our notation, we have
1 1 θk − θk+1 θk (1 − (1 − 1/tk+1 ))
p −√ =p √ p ≥ p
θk+1 θk θk θk+1 ( θk + θk+1 ) 2θk θk+1
since θk is non-increasing. It follows that
1 1 1 1 1 1
p −√ ≥ p = qQ ≥ ,
θk+1 θk 2tk+1 θk+1 2 k 2
n=0 ωn

showing that
p k−1 √ k+1
1/ θk ≥ + t1 / ω 0 ≥ .
2 2
√
Hence, provided that qt0 ≤ 1, we also ﬁnd
4
θk ≤ . (B.10)
(k + 1)2
Optimization for imaging 135

We have shown the following result.

√
Theorem B.1. If qt0 ≤ 1, t0 ≥ 0, then the sequence (xk ) produced by
iterations xk = Tτ y k with (B.9), (B.3), (B.5), (B.6) satisﬁes

k ∗ 2 0 ∗ 1 + τ µg 0 ∗ 2
F (x ) − F (x ) ≤ rk (q) t0 (F (x ) − F (x )) + kx − x k ,
2τ
(B.11)
where x∗ is a minimizer of F , and
√
(1 − q)k

√ √ k 4
rk (q) = min , (1 + q)(1 − q) , .
t20 (k + 1)2

Theorem 4.10 is a particular case of this result, for t0 = 0.

Remark B.2 (constant steps). If µ > 0 (which is q > 0), then an ad-
√
missible choice which satisﬁes (B.3), (B.4) and (B.5) is to take t = 1/ q,
√
ω = 1 − q, and
p √
2 1 + τ µg 1 + τ µg − τ µ
β=ω =p √ .
1 − τ µf 1 + τ µg + τ µ
Then (B.11) becomes
kx0 − x∗ k2

k ∗ √ k 0 ∗
F (x ) − F (x ) ≤ (1 − q) F (x ) − F (x ) + µ .
2
Remark B.3 (monotone algorithms). The algorithms studied here are
not necessarily ‘monotone’ in the sense that the objective F is not always
non-increasing. A workaround implemented in various papers (Tseng 2008,
Beck and Teboulle 2009) consists in choosing xk+1 to be any point for which
F (xk+1 ) ≤ F (Tτ y k ),27 which will not change (B.1) much except that, in the
last term, xk+1 should be replaced with Tτ y k . Then, the same computations
carry on, and it is enough to replace the update rule (B.6) for y k with
tk 1 + τ µ g
y k = xk + βk (xk − xk−1 ) + ωk (Tτ y k−1 − xk )
tk+1 1 − τ µf

k k k−1 tk k−1 k
= x + βk (x − x ) + (Tτ y −x ) (B.6′ )
tk − 1
to obtain the same rates of convergence. The most sensible choice for xk+1
is to take Tτ y k if F (Tτ y k ) ≤ F (xk ), and xk otherwise (see the monotone
implementation in Beck and Teboulle 2009), in which case one of the two
terms (xk − xk−1 or Tτ y k−1 − xk ) vanishes in (B.6′ ).
27
This makes sense only if the evaluation of F is easy and does not take too much time.
136 A. Chambolle and T. Pock

Tao et al. (2015) recently suggested choosing xk+1 to be the point reach-
ing the minimum value between F (Tτ y k ) and F (Tτ xk ) (this requires ad-
ditional computation), hoping to attain the best rate of accelerated and
non-accelerated proximal descents, and thus obtain a linear convergence
rate for the standard ‘FISTA’ (µ = 0) implementation if F turns out to
be strongly convex. This is very reasonable and seems to be supported by
experiment, but we are not sure how to prove it.

C. Convergence rates for primal–dual algorithms

The goal of this appendix is to give an idea of how the convergence results
in Section 5.1 are established, and also to explicitly give a proof of a variant
of the accelerated rate in Theorem 5.2, which to our knowledge is not found
in the current literature, and is an easy adaption of the proofs in Chambolle
and Pock (2011, 2015a). In Section C.1 we sketch a proof of Theorem 5.1,
establishing the O(1/k) ergodic convergence rate, while in Section C.2 we
prove a variant of Theorem 5.2.

C.1. The PDHG and Condat–Vũ algorithm

We give an elementary proof of the rate of convergence of Theorem 5.1. It
is easily extended to the non-linear proximity operator (or ‘mirror descent’:
Beck and Teboulle 2003); in fact the proof is identical (Chambolle and
Pock 2015a).
The ﬁrst observation is that if (x̂, ŷ) = PDτ,σ (x̄, ȳ, x̃, ỹ), then for all
(x, y) ∈ X × Y we have
L(x̂, y) − L(x, ŷ)
1 1 + τ µg 1 − Lh τ
≤ kx − x̄k2 − kx − x̂k2 − kx̂ − x̄k2
2τ 2τ 2τ
1 1 1
+ ky − ȳk2 − ky − ŷk2 − kŷ − ȳk2
2σ 2σ 2σ
+ hK(x − x̂), ỹ − ŷi − hK(x̃ − x̂), y − ŷi. (C.1)
In fact this follows from (4.36), applied on one hand to the function g(x) +
h(x)+hKx, ỹi, and on the other hand to f ∗ (y)−hK x̃, yi. For all x, y, we
obtain
1
g(x) + h(x) + hKx, ỹi + kx − x̄k2
2τ
1 − τ Lh 1 + τ µg
≥ g(x̂) + h(x̂) + hK x̂, ỹi + kx̄ − x̂k2 + kx − x̂k2 ,
2τ 2τ
Optimization for imaging 137

1
f ∗ (y) − hK x̃, yi + ky − ȳk2
2σ
1 1
≥ f ∗ (ŷ) − hK x̃, ŷi + kȳ − ŷk2 + kŷ − yk2 ,
2σ 2σ
where µg ≥ 0 is a convexity parameter for g, which we will consider in Sec-
tion C.2. Summing these two inequalities and rearranging, we obtain (C.1).
The PDHG algorithm corresponds to the choice (x̃, ỹ) = (2xk+1 − xk , y k ),
(x̂, ŷ) = (xk+1 , y k+1 ), (x̄, ȳ) = (xk , y k ). We deduce (assuming µg = 0) that

L(xk+1 , y) − L(x, y k+1 )

1 1
+ kx − xk+1 k2 + ky − y k+1 k2 − hK(x − xk+1 ), y − y k+1 i
2τ 2σ
1 − τ Lh k+1 1 k+1
+ kx − x k k2 + ky − y k k2 − hK(xk+1 − xk ), y k+1 − y k i
2τ 2σ
1 1
≤ kx − xk k2 + ky − y k k2 − hK(x − xk ), y − y k i.
2τ 2σ
Thanks to (5.9), each of the last three lines is non-negative. We sum this
from i = 0, . . . , k − 1, and ﬁnd that
K
X 1 1
L(xi , y) − L(x, y i ) ≤ kx − x0 k2 + ky − y 0 k2 − hK(x − x0 ), y − y 0 i.
2τ 2σ
i=1

Equation (5.10) follows from the convexity of (ξ, η) 7→ L(ξ, y) − L(x, η), and
using
kx − x0 k2 ky − y 0 k2
2hK(x − x0 ), y − y 0 i ≤ + .
τ σ

C.2. An accelerated primal–dual algorithm

We brieﬂy show how one can derive a result similar to Theorem 5.2 for a
variant of the algorithm, which consists in performing the over-relaxation
step in the variable y rather than the variable x. We still consider problem
(5.7), in the case where g is µg -convex with µg > 0, and as before ∇h is
Lh -Lipschitz. Given x0 , y 0 = y −1 , we let at each iteration

(xk+1 , y k+1 ) = PDτk ,σk (xk , y k , xk+1 , y k + θk (y k − y k−1 )),

where θk , τk , σk will be made precise later. We let

xk+1 = proxτ g (xk − τ (∇h(xk ) + K ∗ (y k + θk (y k − y k−1 )))), (C.2)

k+1 k k+1
y = proxσf ∗ (y + σKx ). (C.3)
138 A. Chambolle and T. Pock

Using (C.1) with (x̃, ỹ) = (xk+1 , y k + θk (y k − y k−1 )), we obtain that for all
(x, y) ∈ X × Y,

1 1
kx − xk k2 + ky − y k k2
2τk 2σk
1 + τk µ g 1
≥ L(xk+1 , y) − L(x, y k+1 ) + kx − xk+1 k2 + ky − y k+1 k2
2τk 2σk
− hK(xk+1 − x), y k+1 − y k i + θk hK(xk+1 − x), y k − y k−1 i
1 − τk L h k 1
+ kx − xk+1 k2 + ky k − y k+1 k2 .
2τk 2σk

Letting

kx − xk k2 ky − y k k2
∆k (x, y) := + ,
2τk 2σk

and assuming we can choose (τk , σk , θk ) satisfying

1 + τk µ g 1
≥ , (C.4)
τk θk+1 τk+1
σk = θk+1 σk+1 , (C.5)

we obtain

∆k (x, y) − θk hK(xk − x), y k − y k−1 i

1
≥ L(xk+1 , y) − L(x, y k+1 ) + ∆k+1 (x, y)
θk+1
1 − τk Lh k+1
− hK(xk+1 − x), y k+1 − y k i + kx − x k k2
2τk
1
+ ky k+1 − y k k2 + θk hK(xk+1 − xk ), y k − y k−1 i.
2σk

Now using (with L = kKk)

θk hK(xk+1 − xk ), y k − y k−1 i
θk2 L2 σk τk k 1
≥− kx − xk+1 k2 − ky k − y k−1 k2 ,
2τk 2σk
Optimization for imaging 139

we ﬁnd, using 1/θk+1 = σk+1 /σk , that

1
∆k (x, y) − θk hK(xk − x), y k − y k−1 i + ky k − y k−1 k2
2σk
≥ L(xk+1 , y) − L(x, y k+1 )
σk+1
+ ∆k+1 (x, y) − θk+1 hK(xk+1 − x), y k+1 − y k i
σk
1
+ ky k+1 − y k k2
2σk+1
1 − τk Lh − θk2 L2 σk τk k+1
+ kx − xk k2 .
2τk
If we can ensure, for all k, that
τk Lh + θk2 L2 σk τk ≤ 1, (C.6)
then we will deduce by induction that
k
X σi−1
L(xi , y) − L(x, y i )

∆0 (x, y) ≥
σ0
i=1

σk k k k−1 1 k k−1 2
+ ∆k (x, y) − θk hK(x − x), y − y i + ky − y k .
σ0 2σk
Finally, letting
k k
X σi−1 k k 1 X σi−1 k k
Tk = , (X , Y ) = (x , y ), (C.7)
σ0 Tk σ0
i=1 i=1
for each k and using the convexity of L(·, y) − L(x, ·), we ﬁnd that
∆0 (x, y) ≥ Tk L(X k , y) − L(x, Y k )

σk 1 − θk2 L2 τk σk 1
+ kxk − xk2 + kyk − yk2 .
σ0 2τk 2σ0
There are several choices of τk , σk , θk that will ensure a good rate of conver-
gence for the ergodic gap or for the distance kxk − xk; see Chambolle and
Pock (2015a) for a discussion. A simple choice, as in Chambolle and Pock
(2011), is to take, for k ≥ 0,
1
θk+1 = p , (C.8)
1 + µ g τk
σk
τk+1 = θk+1 τk , σk+1 = . (C.9)
θk+1
One can show that in this case, since
1 1 µg
= + p ,
τk+1 τk 1 + 1 + µ g τk
140 A. Chambolle and T. Pock

we have τk ∼ 2/(µg k), so that 1/Tk = O(1/k 2 ). In this case,

τk Lh + θk2 L2 σk τk = τk Lh + θk2 L2 σ0τ0 ≤ τ0 (Lh + L2 σ0 ),
so choosing σ0 arbitrarily and τ0 ≤ 1/(Lh + L2 σ0 ) will yield convergence.
If, moreover, τ0 (Lh + L2 σ0 ) = t < 1, we ﬁnd, in addition to the ergodic rate
1
L(X k , y) − L(x, Y k ) ≤ ∆0 (x, y) (C.10)
Tk
for all (x, y) ∈ X × Y, the rate for the iterate xk :

∗ 2 4 1 0 ∗ 2 1 0 ∗ 2
kxk − x k . kx − x k + ky − y k , (C.11)
(1 − t)µ2g k 2 τ02 σ 0 τ0
where (x∗ , y ∗ ) is a saddle point.

REFERENCES28
M. Aharon, M. Elad and A. Bruckstein (2006), ‘K-SVD: An algorithm for designing
overcomplete dictionaries for sparse representation’, Signal Processing, IEEE
Transactions on 54(11), 4311–4322.
R. K. Ahuja, T. L. Magnanti and J. B. Orlin (1993), Network flows, Prentice Hall
Inc., Englewood Cliﬀs, NJ. Theory, algorithms, and applications.
M. A. Aı̆zerman, È. M. Braverman and L. I. Rozonoèr (1964), ‘A probabilistic
problem on automata learning by pattern recognition and the method of
potential functions’, Avtomat. i Telemeh. 25, 1307–1323.
G. Alberti, G. Bouchitté and G. Dal Maso (2003), ‘The calibration method for
the Mumford-Shah functional and free-discontinuity problems’, Calc. Var.
Partial Differential Equations 16(3), 299–333.
Z. Allen-Zhu and L. Orecchia (2014), ‘Linear Coupling: An Ultimate Uniﬁcation
of Gradient and Mirror Descent’, ArXiv e-prints.
M. Almeida and M. A. T. Figueiredo (2013), ‘Deconvolving images with un-
known boundaries using the alternating direction method of multipliers’,
IEEE Trans. on Image Processing 22(8), 3074 – 3086.
F. Alvarez (2003), ‘Weak convergence of a relaxed and inertial hybrid projection-
proximal point algorithm for maximal monotone operators in Hilbert space’,
SIAM J. on Optimization 14(3), 773–782.
F. Alvarez and H. Attouch (2001), ‘An inertial proximal method for maximal mono-
tone operators via discretization of a nonlinear oscillator with damping’, Set-
Valued Anal. 9(1-2), 3–11. Wellposedness in optimization and related topics
(Gargnano, 1999).
L. Ambrosio and S. Masnou (2003), ‘A direct variational approach to a problem
arising in image reconstruction’, Interfaces Free Bound. 5(1), 63–81.
L. Ambrosio and V. M. Tortorelli (1992), ‘On the approximation of free disconti-
nuity problems’, Boll. Un. Mat. Ital. B (7) 6(1), 105–123.
L. Ambrosio, N. Fusco and D. Pallara (2000), Functions of bounded variation and
free discontinuity problems, The Clarendon Press Oxford University Press,
New York.
Optimization for imaging 141

P. Arias, G. Facciolo, V. Caselles and G. Sapiro (2011), ‘A variational framework for

exemplar-based image inpainting’, International journal of computer vision
93(3), 319–347.
L. Armijo (1966), ‘Minimization of functions having Lipschitz continuous ﬁrst par-
tial derivatives’, Pacific J. Math. 16, 1–3.
K. J. Arrow, L. Hurwicz and H. Uzawa (1958), Studies in linear and non-linear
programming, With contributions by H. B. Chenery, S. M. Johnson, S. Kar-
lin, T. Marschak, R. M. Solow. Stanford Mathematical Studies in the Social
Sciences, vol. II, Stanford University Press, Stanford, Calif.
H. Attouch, J. Bolte and B. F. Svaiter (2013), ‘Convergence of descent methods for
semi-algebraic and tame problems: proximal algorithms, forward-backward
splitting, and regularized Gauss-Seidel methods’, Math. Program. 137(1-2,
Ser. A), 91–129.
H. Attouch, J. Bolte, P. Redont and A. Soubeyran (2010), ‘Proximal alternating
minimization and projection methods for nonconvex problems: an approach
based on the Kurdyka-lojasiewicz inequality’, Math. Oper. Res. 35(2), 438–
457.
H. Attouch, L. M. Briceño-Arias and P. L. Combettes (2009/10), ‘A parallel
splitting method for coupled monotone inclusions’, SIAM J. Control Optim.
48(5), 3246–3270.
H. Attouch, G. Buttazzo and G. Michaille (2014), Variational analysis in Sobolev
and BV spaces, MOS-SIAM Series on Optimization, second edn, Society for
Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathemati-
cal Optimization Society, Philadelphia, PA. Applications to PDEs and opti-
mization.
J.-F. Aujol and C. Dossal (2015), ‘Stability of Over-Relaxations for the Forward-
Backward Algorithm, Application to FISTA’, SIAM J. Optim. 25(4), 2408–
2433.
A. Auslender (1976), Optimisation, Masson, Paris-New York-Barcelona. Méthodes
numériques, Maı̂trise de Mathématiques et Applications Fondamentales.
A. Auslender and M. Teboulle (2004), ‘Interior gradient and epsilon-subgradient
descent methods for constrained convex minimization’, Math. Oper. Res.
29(1), 1–26.
H. H. Bauschke and P. L. Combettes (2011), Convex analysis and monotone op-
erator theory in Hilbert spaces, CMS Books in Mathematics/Ouvrages de
Mathématiques de la SMC, Springer, New York. With a foreword by Hédy
Attouch.
H. H. Bauschke, S. M. Moﬀat and X. Wang (2012), ‘Firmly nonexpansive mappings
and maximally monotone operators: correspondence and duality’, Set-Valued
Var. Anal. 20(1), 131–153.
A. Beck (2015), ‘On the convergence of alternating minimization for convex pro-
gramming with applications to iteratively reweighted least squares and de-
composition schemes’, SIAM J. Optim. 25(1), 185–209.
A. Beck and M. Teboulle (2003), ‘Mirror descent and nonlinear projected subgra-
dient methods for convex optimization’, Oper. Res. Lett. 31(3), 167–175.
A. Beck and M. Teboulle (2009), ‘A fast iterative shrinkage-thresholding algorithm
for linear inverse problems’, SIAM J. Imaging Sci. 2(1), 183–202.
142 A. Chambolle and T. Pock

A. Beck and L. Tetruashvili (2013), ‘On the convergence of block coordinate descent
type methods’, SIAM J. Optim. 23(4), 2037–2060.
S. Becker and J. Fadili (2012), A quasi-newton proximal splitting method, in Ad-
vances in Neural Information Processing Systems 25, pp. 2627–2635.
S. Becker, J. Bobin and E. J. Candès (2011), ‘NESTA: a fast and accurate first-
order method for sparse recovery’, SIAM J. Imaging Sci. 4(1), 1–39.
S. R. Becker and P. L. Combettes (2014), ‘An algorithm for splitting parallel sums
of linearly composed monotone operators, with applications to signal recov-
ery’, J. Nonlinear Convex Anal. 15(1), 137–159.
J. Bect, L. Blanc-Féraud, G. Aubert and A. Chambolle (2004), A l1 -unified frame-
work for image restoration, in Proceedings ECCV 2004 (Prague) (T. Pajdla
and J. Matas, eds), number 3024 in ‘Lecture Notes in Computer Science’,
Springer, pp. 1–13.
A. Ben-Tal and A. Nemirovski (1998), ‘Robust convex optimization’, Math. Oper.
Res. 23(4), 769–805.
A. Ben-Tal and A. Nemirovski (2001), Lectures on modern convex optimiza-
tion, MPS/SIAM Series on Optimization, Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society
(MPS), Philadelphia, PA. Analysis, algorithms, and engineering applications.
A. Ben-Tal, L. El Ghaoui and A. Nemirovski (2009), Robust optimization, Princeton
Series in Applied Mathematics, Princeton University Press, Princeton, NJ.
J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna and G. Peyré (2015), ‘Iterative
Bregman projections for regularized transportation problems’, SIAM J. Sci.
Comput. 37(2), A1111–A1138.
A. Benfenati and V. Ruggiero (2013), ‘Inexact Bregman iteration with an applica-
tion to Poisson data reconstruction’, Inverse Problems 29(6), 065016, 31.
D. P. Bertsekas (2015), Convex Optimization Algorithms, Athena Scientific.
D. P. Bertsekas and S. K. Mitter (1973), ‘A descent numerical method for opti-
mization problems with nondifferentiable cost functionals’, SIAM J. Control
11, 637–652.
J. Bioucas-Dias and M. Figueiredo (2007), ‘A new TwIST: two-step iterative
shrinkage/thresholding algorithms for image restoration’, IEEE Trans. on
Image Processing 16, 2992–3004.
A. Blake and A. Zisserman (1987), Visual Reconstruction, MIT Press.
P. Blomgren and T. F. Chan (1998), ‘Color TV: total variation methods for
restoration of vector-valued images’, IEEE Transactions on Image Processing
7(3), 304–309.
J. Bolte, A. Daniilidis and A. Lewis (2006), ‘The lojasiewicz inequality for nons-
mooth subanalytic functions with applications to subgradient dynamical sys-
tems’, SIAM J. Optim. 17(4), 1205–1223 (electronic).
J. Bolte, S. Sabach and M. Teboulle (2014), ‘Proximal alternating linearized mini-
mization for nonconvex and nonsmooth problems’, Math. Program. 146(1-2,
Ser. A), 459–494.
S. Bonettini and V. Ruggiero (2012), ‘On the convergence of primal-dual hybrid
gradient algorithms for total variation image restoration’, J. Math. Imaging
Vision 44(3), 236–253.
Optimization for imaging 143

S. Bonettini, A. Benfenati and V. Ruggiero (2014), ‘Scaling techniques for ǫ-

subgradient projection methods’, ArXiv e-prints.
S. Bonettini, F. Porta and V. Ruggiero (2015), ‘A variable metric forward–
backward method with extrapolation’, ArXiv e-prints.
J. F. Bonnans, J. C. Gilbert, C. Lemaréchal and C. A. Sagastizábal (1995), ‘A
family of variable metric proximal methods’, Math. Programming 68(1, Ser.
A), 15–47.
A. Bonnet and G. David (2001), ‘Cracktip is a global Mumford-Shah minimizer’,
Astérisque (274), vi+259.
J. Borwein and D. Luke (2015), Duality and convex programming, in Handbook
of Mathematical Methods in Imaging (O. Scherzer, ed.), Springer New York,
pp. 257–304.
R. I. Boţ, E. R. Csetnek, A. Heinrich and C. Hendrich (2015), ‘On the convergence
rate improvement of a primal-dual splitting algorithm for solving monotone
inclusion problems’, Math. Program. 150(2, Ser. A), 251–279.
G. Bouchitté (2006), Convex analysis and duality methods, in Encyclopedia of
Mathematical Physics (J.-P. Françoise, G. L. Naber and T. S. Tsun, eds),
Academic Press, Oxford, pp. 642 – 652.
S. Boyd and L. Vandenberghe (2004), Convex optimization, Cambridge University
Press, Cambridge.
S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein (2011), ‘Distributed op-
timization and statistical learning via the alternating direction method of
multipliers’, Found. Trends Mach. Learn. 3(1), 1–122.
Y. Boykov and V. Kolmogorov (2004), ‘An experimental comparison of min-
cut/max-ﬂow algorithms for energy minimization in vision’, IEEE Trans.
Pattern Analysis and Machine Intelligence 26(9), 1124–1137.
Y. Boykov, O. Veksler and R. Zabih (2001), ‘Fast approximate energy minimization
via graph cuts’, Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on 23(11), 1222–1239.
K. A. Brakke (1995), ‘Soap ﬁlms and covering spaces’, J. Geom. Anal. 5(4), 445–
514.
K. Bredies and D. A. Lorenz (2008), ‘Linear convergence of iterative soft-
thresholding’, J. Fourier Anal. Appl. 14(5-6), 813–837.
K. Bredies and H. Sun (2015a), Preconditioned alternating direction method of
multipliers for the minimization of quadratic plus non-smooth convex func-
tionals, Technical report, Univ. Graz.
K. Bredies and H. Sun (2015b), ‘Preconditioned Douglas-Rachford splitting meth-
ods for convex-concave saddle-point problems’, SIAM J. Numer. Anal.
53(1), 421–444.
K. Bredies and H. P. Sun (2015c), ‘Preconditioned Douglas-Rachford algorithms for
TV- and TGV-regularized variational imaging problems’, J. Math. Imaging
Vision 52(3), 317–344.
K. Bredies, K. Kunisch and T. Pock (2010), ‘Total generalized variation’, SIAM J.
Imaging Sci. 3(3), 492–526.
K. Bredies, D. A. Lorenz and S. Reiterer (2015a), ‘Minimization of non-smooth,
non-convex functionals by iterative thresholding’, J. Optim. Theory Appl.
165(1), 78–112.
144 A. Chambolle and T. Pock

K. Bredies, T. Pock and B. Wirth (2013), ‘Convex relaxation of a class of vertex

penalizing functionals’, J. Math. Imaging Vision 47(3), 278–302.
K. Bredies, T. Pock and B. Wirth (2015b), ‘A convex, lower semicontinuous ap-
proximation of Euler’s elastica energy’, SIAM J. Math. Anal. 47(1), 566–613.
L. M. Brègman (1967), ‘A relaxation method of finding a common point of convex
sets and its application to the solution of problems in convex programming’,
Z̆. Vyčisl. Mat. i Mat. Fiz. 7, 620–631.
X. Bresson and T. F. Chan (2008), ‘Fast dual minimization of the vectorial total
variation norm and applications to color image processing’, Inverse Probl.
Imaging 2(4), 455–484.
H. Brézis (1973), Opérateurs maximaux monotones et semi-groupes de contrac-
tions dans les espaces de Hilbert, North-Holland Publishing Co., Amsterdam.
North-Holland Mathematics Studies, No. 5. Notas de Matemática (50).
H. Brézis (1983), Analyse fonctionnelle, Collection Mathématiques Appliquées
pour la Maı̂trise. [Collection of Applied Mathematics for the Master’s De-
gree], Masson, Paris. Théorie et applications. [Theory and applications].
H. Brézis and P.-L. Lions (1978), ‘Produits infinis de résolvantes’, Israel J. Math.
29(4), 329–345.
L. M. Briceño-Arias and P. L. Combettes (2011), ‘A monotone + skew split-
ting model for composite monotone inclusions in duality’, SIAM J. Optim.
21(4), 1230–1250.
F. E. Browder (1967), ‘Convergence theorems for sequences of nonlinear operators
in Banach spaces’, Math. Z. 100, 201–225.
F. E. Browder and W. V. Petryshyn (1966), ‘The solution by iteration of nonlinear
functional equations in Banach spaces’, Bull. Amer. Math. Soc. 72, 571–575.
T. Brox, A. Bruhn, N. Papenberg and J. Weickert (2004), High accuracy optical
flow estimation based on a theory for warping, in Computer Vision - ECCV
2004, PT 4 (Pajdla, T and Matas, J, ed.), Vol. 2034 of Lecture Notes In Com-
puter Science, Business Informat Grp as; Camea spol sro; Casablanca INT
sro; ECVision; Microsoft Res; Miracle Network sro; Neovision sro; Toyota,
pp. 25–36. 8th European Conference on Computer Vision, Prague, CZECH
REPUBLIC, MAY 11-14, 2004.
J. Bruna and S. Mallat (2013), ‘Invariant scattering convolution networks’, Pattern
Analysis and Machine Intelligence, IEEE Transactions on 35(8), 1872–1886.
J. Bruna, A. Szlam and Y. LeCun (2014), Signal recovery from pooling represen-
tations, in 31st International Conference on Machine Learning, pp. 307–315.
A. Buades, B. Coll and J. M. Morel (2005), ‘A review of image denoising algorithms,
with a new one’, Multiscale Model. Simul. 4(2), 490–530.
A. Buades, B. Coll and J.-M. Morel (2011), ‘Non-local means denoising’, Image
Processing On Line.
M. Burger, A. Sawatzky and G. Steidl (2014), ‘First order algorithms in variational
image processing’, ArXiv e-prints. T o appear in Splitting Methods in Com-
munication and Imaging, Science and Engineering (Glowinsky, Osher, Yin,
Eds).
C. J. C. Burges (1998), ‘A tutorial on support vector machines for pattern recog-
nition’, Data Min. Knowl. Discov. 2(2), 121–167.
Optimization for imaging 145

J. V. Burke and M. Qian (1999), ‘A variable metric proximal point algorithm for
monotone operators’, SIAM J. Control Optim. 37(2), 353–375 (electronic).
J. V. Burke and M. Qian (2000), ‘On the superlinear convergence of the variable
metric proximal point algorithm using Broyden and BFGS matrix secant
updating’, Math. Program. 88(1, Ser. A), 157–181.
R. H. Byrd, P. Lu, J. Nocedal and C. Y. Zhu (1995), ‘A limited memory algorithm
for bound constrained optimization’, SIAM J. Sci. Comput. 16(5), 1190–1208.
J.-F. Cai, E. J. Candès and Z. Shen (2010), ‘A singular value thresholding algorithm
for matrix completion’, SIAM J. on Optimization 20(4), 1956–1982.
E. Candès, L. Demanet, D. Donoho and L. Ying (2006a), ‘Fast discrete curvelet
transforms’, Multiscale Model. Simul. 5(3), 861–899 (electronic).
E. J. Candès, X. Li, Y. Ma and J. Wright (2011), ‘Robust principal component
analysis?’, J. ACM 58(3), Art. 11, 37.
E. J. Candès, J. Romberg and T. Tao (2006b), ‘Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency information’,
IEEE Trans. Inform. Theory pp. 489–509.
A. Chambolle (1994), Partial differential equations and image processing, in Pro-
ceedings 1994 International Conference on Image Processing, Austin, Texas,
USA, November 13-16, 1994, pp. 16–20.
A. Chambolle (1999), ‘Finite-differences discretizations of the Mumford-Shah func-
tional’, M2AN Math. Model. Numer. Anal. 33(2), 261–288.
A. Chambolle (2004a), ‘An algorithm for mean curvature motion’, Interfaces Free
Bound. 6(2), 195–218.
A. Chambolle (2004b), ‘An algorithm for total variation minimization and applica-
tions’, J. Math. Imaging Vision 20(1-2), 89–97. Special issue on mathematics
and image analysis.
A. Chambolle (2005), Total variation minimization and a class of binary MRF
models, in Energy Minimization Methods in Computer Vision and Pattern
Recognition, pp. 136–152.
A. Chambolle and J. Darbon (2009), ‘On total variation minimization and surface
evolution using parametric maximum flows’, Int. J. Comput. Vis. 84(3), 288–
307.
A. Chambolle and J. Darbon (2012), Image Processing and Analysis with Graphs:
Theory and Practice, CRC Press, chapter “A Parametric Maximum Flow
Approach for Discrete Total Variation Regularization”.
A. Chambolle and C. Dossal (2015), ‘On the convergence of the iterates of the fast
iterative shrinkage/thresholding algorithm’, Journal of Optimization Theory
and Applications 166(3), 968–982.
A. Chambolle and P.-L. Lions (1995), Image restoration by constrained total vari-
ation minimization and variants, in Investigative and Trial Image Processing,
San Diego, CA (SPIE vol. 2567), pp. 50–59.
A. Chambolle and P.-L. Lions (1997), ‘Image recovery via total variation minimiza-
tion and related problems’, Numer. Math. 76(2), 167–188.
A. Chambolle and T. Pock (2011), ‘A first-order primal-dual algorithm for convex
problems with applications to imaging’, J. Math. Imaging Vision 40(1), 120–
145.
146 A. Chambolle and T. Pock

A. Chambolle and T. Pock (2015a), ‘On the ergodic convergence rates of a first-
order primal-dual algorithm’, Mathematical Programming pp. 1–35. (online
first).
A. Chambolle and T. Pock (2015b), ‘A remark on accelerated block coordinate
descent for computing the proximity operators of a sum of convex functions’,
SMAI-Journal of Computational Mathematics 1, 29–54.
A. Chambolle, D. Cremers and T. Pock (2012), ‘A convex approach to minimal
partitions’, SIAM J. Imaging Sci. 5(4), 1113–1158.
A. Chambolle, R. A. DeVore, N.-y. Lee and B. J. Lucier (1998), ‘Nonlinear
wavelet image processing: variational problems, compression, and noise re-
moval through wavelet shrinkage’, IEEE Trans. Image Process. 7(3), 319–335.
A. Chambolle, S. E. Levine and B. J. Lucier (2011), ‘An upwind finite-difference
method for total variation-based image smoothing’, SIAM J. Imaging Sci.
4(1), 277–299.
T. Chan and S. Esedoglu (2004), ‘Aspects of total variation regularized L1 function
approximation’, SIAM J. Appl. Math. 65(5), 1817–1837.
T. Chan and L. Vese (2001), ‘Active contours without edges’, IEEE Trans. Image
Processing 10(2), 266–277.
T. F. Chan and S. Esedoḡlu (2005), ‘Aspects of total variation regularized L1
function approximation’, SIAM J. Appl. Math. 65(5), 1817–1837 (electronic).
T. F. Chan and L. A. Vese (2002), Active contour and segmentation models using
geometric PDE’s for medical imaging, in Geometric methods in bio-medical
image processing, Math. Vis., Springer, Berlin, pp. 63–75.
T. F. Chan, S. Esedoḡlu and M. Nikolova (2006), ‘Algorithms for finding global min-
imizers of image segmentation and denoising models’, SIAM J. Appl. Math.
66(5), 1632–1648 (electronic).
T. F. Chan, G. H. Golub and P. Mulet (1999), ‘A nonlinear primal-dual method for
total variation-based image restoration’, SIAM J. Sci. Comput. 20(6), 1964–
1977 (electronic).
R. Chartrand and B. Wohlberg (2013), A nonconvex ADMM algorithm for group
sparsity with sparse groups, in Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on, IEEE, pp. 6009–6013.
G. Chen and M. Teboulle (1993), ‘Convergence analysis of a proximal-like mini-
mization algorithm using Bregman functions’, SIAM J. Optim. 3(3), 538–543.
G. Chen and M. Teboulle (1994), ‘A proximal-based decomposition method for
convex minimization problems’, Math. Programming 64(1, Ser. A), 81–101.
S. Chen and D. Donoho (1994), Basis pursuit, in 28th Asilomar Conf. on Signals,
Systems, and Computers, pp. 41–44.
S. S. Chen, D. L. Donoho and M. A. Saunders (1998), ‘Atomic Decomposition by
Basis Pursuit’, SIAM Journal on Scientific Computing 20(1), 33–61.
Y. Chen, G. Lan and Y. Ouyang (2014a), ‘Optimal primal-dual methods for a class
of saddle point problems’, SIAM J. Optim. 24(4), 1779–1814.
Y. Chen, R. Ranftl and T. Pock (2014b), ‘Insights into analysis operator learning:
From patch-based sparse models to higher order MRFs’, IEEE Transactions
on Image Processing 23(3), 1060–1072.
E. Chouzenoux, J.-C. Pesquet and A. Repetti (2014), ‘Variable metric forward-
backward algorithm for minimizing the sum of a differentiable function and
a convex function’, J. Optim. Theory Appl. 162(1), 107–132.
Optimization for imaging 147

E. Chouzenoux, J.-C. Pesquet and A. Repetti (2016), ‘A block coordinate variable

metric forward–backward algorithm’, Journal of Global Optimization pp. 1–
29.
D. C. Ciresan, U. Meier and J. Schmidhuber (2012), ‘Multi-column deep neural
networks for image classiﬁcation’, CoRR.
G. Citti and A. Sarti (2006), ‘A cortical based model of perceptual completion in
the roto-translation space’, J. Math. Imaging Vision 24(3), 307–326.
P. L. Combettes (2004), ‘Solving monotone inclusions via compositions of nonex-
pansive averaged operators’, Optimization 53(5-6), 475–504.
P. L. Combettes and J.-C. Pesquet (2011), Proximal splitting methods in signal
processing, in Fixed-point algorithms for inverse problems in science and en-
gineering, Vol. 49 of Springer Optim. Appl., Springer, New York, pp. 185–212.
P. L. Combettes and B. C. Vũ (2014), ‘Variable metric forward-backward split-
ting with applications to monotone inclusions in duality’, Optimization
63(9), 1289–1318.
P. L. Combettes and V. R. Wajs (2005), ‘Signal recovery by proximal forward-
backward splitting’, Multiscale Model. Simul. 4(4), 1168–1200.
L. Condat (2013a), ‘A direct algorithm for 1D total variation denoising’, IEEE
Signal Proc. Letters 20(11), 1054–1057.
L. Condat (2013b), ‘A primal-dual splitting method for convex optimization involv-
ing Lipschitzian, proximable and linear composite terms’, J. Optim. Theory
Appl. 158(2), 460–479.
D. Cremers and E. Strekalovskiy (2013), ‘Total cyclic variation and generalizations’,
J. Math. Imaging Vision 47(3), 258–277.
K. Dabov, A. Foi, V. Katkovnik and K. Egiazarian (2007), ‘Image denoising by
sparse 3-D transform-domain collaborative ﬁltering’, IEEE Trans. Image Pro-
cess. 16(8), 2080–2095.
G. Dal Maso, M. G. Mora and M. Morini (2000), ‘Local calibrations for minimizers
of the Mumford-Shah functional with rectilinear discontinuity sets’, J. Math.
Pures Appl. (9) 79(2), 141–162.
A. Danielyan, V. Katkovnik and K. Egiazarian (2012), ‘BM3D frames and varia-
tional image deblurring’, IEEE Trans. Image Process. 21(4), 1715–1728.
J. Darbon and M. Sigelle (2004), Exact optimization of discrete constrained total
variation minimization problems, in Combinatorial image analysis, Vol. 3322
of Lecture Notes in Comput. Sci., Springer, Berlin, pp. 548–557.
J. Darbon and M. Sigelle (2006a), ‘Image restoration with discrete constrained
total variation. I. Fast and exact optimization’, J. Math. Imaging Vision
26(3), 261–276.
J. Darbon and M. Sigelle (2006b), ‘Image restoration with discrete constrained
total variation. II. Levelable functions, convex priors and non-convex cases’,
J. Math. Imaging Vision 26(3), 277–291.
A. d’Aspremont (2008), ‘Smooth optimization with approximate gradient’, SIAM
J. Optim. 19(3), 1171–1183.
I. Daubechies, M. Defrise and C. De Mol (2004), ‘An iterative thresholding algo-
rithm for linear inverse problems with a sparsity constraint’, Comm. Pure
Appl. Math. 57(11), 1413–1457.
148 A. Chambolle and T. Pock

P. L. Davies and A. Kovac (2001), ‘Local extremes, runs, strings and multiresolu-
tion’, The Annals of Statistics 29(1), 1–65.
D. Davis (2015), ‘Convergence Rate Analysis of Primal-Dual Splitting Schemes’,
SIAM J. Optim. 25(3), 1912–1943.
D. Davis and W. Yin (2014a), ‘Convergence rate analysis of several splitting
schemes’, ArXiv e-prints.
D. Davis and W. Yin (2014b), ‘Faster convergence rates of relaxed Peaceman-
Rachford and ADMM under regularity assumptions’, ArXiv e-prints.
D. Davis and W. Yin (2015), A three-operator splitting scheme and its op-
timization applications, Technical report. CAM Report 15-13 / preprint
arXiv:1504.01032.
W. Deng and W. Yin (2015), ‘On the global and linear convergence of the gen-
eralized alternating direction method of multipliers’, Journal of Scientific
Computing pp. 1–28.
R. A. DeVore (1998), Nonlinear approximation, in Acta numerica, 1998, Vol. 7 of
Acta Numer., Cambridge Univ. Press, Cambridge, pp. 51–150.
D. L. Donoho (1995), ‘De-noising by soft-thresholding’, IEEE Trans. Inform. The-
ory 41(3), 613–627.
D. L. Donoho (2006), ‘Compressed sensing’, IEEE Trans. Inf. Theor. 52(4), 1289–
1306.
J. Douglas and H. H. Rachford (1956), ‘On the numerical solution of heat conduc-
tion problems in two and three space variables’, Transactions of The Ameri-
can Mathematical Society 82, 421–439.
Y. Drori, S. Sabach and M. Teboulle (2015), ‘A simple algorithm for a class of non-
smooth convex-concave saddle-point problems’, Oper. Res. Lett. 43(2), 209–
214.
K.-B. Duan and S. S. Keerthi (2005), Which is the best multiclass svm method? an
empirical study, in Multiple Classifier Systems: 6th International Workshop,
MCS 2005, Seaside, CA, USA, June 13-15, 2005. Proceedings (N. C. Oza,
R. Polikar, J. Kittler and F. Roli, eds), number 3541 in ‘LNCS’, Springer,
pp. 278–285.
J. Duchi, S. Shalev-Shwartz, Y. Singer and T. Chandra (2008), Eﬃcient projections
onto the ℓ1 -ball for learning in high dimensions, in Proceedings of the 25th
International Conference on Machine Learning, ICML ’08, ACM, New York,
pp. 272–279.
F.-X. Dupé, M. J. Fadili and J.-L. Starck (2012), ‘Deconvolution under Poisson
noise using exact data ﬁdelity and synthesis or analysis sparsity priors’, Stat.
Methodol. 9(1-2), 4–18.
J. Duran, M. Moeller, C. Sbert and D. Cremers (2016a), ‘Collaborative total vari-
ation: a general framework for vectorial TV models’, SIAM J. Imaging Sci.
9(1), 116–151.
J. Duran, M. Moeller, C. Sbert and D. Cremers (2016b), ‘On the Implementation
of Collaborative TV Regularization: Application to Cartoon+Texture De-
composition’, Image Processing On Line 6, 27–74. http://dx.doi.org/10.
5201/ipol.2016.141.
R. L. Dykstra (1983), ‘An algorithm for restricted least squares regression’, J.
Amer. Statist. Assoc. 78(384), 837–842.
Optimization for imaging 149

G. Easley, D. Labate and W.-Q. Lim (2008), ‘Sparse directional image represen-
tations using the discrete shearlet transform’, Appl. Comput. Harmon. Anal.
25(1), 25–46.
J. Eckstein (1989), Splitting methods for monotone operators with applications
to parallel optimization, PhD thesis, Massachusetts Institute of Technology.
PhD Thesis.
J. Eckstein (1993), ‘Nonlinear proximal point algorithms using Bregman functions,
with applications to convex programming’, Mathematics of Operations Re-
search 18(1), 202–226.
J. Eckstein and D. P. Bertsekas (1992), ‘On the Douglas-Rachford splitting method
and the proximal point algorithm for maximal monotone operators’, Math.
Programming 55(3, Ser. A), 293–318.
I. Ekeland and R. Témam (1999), Convex analysis and variational problems, Vol. 28
of Classics in Applied Mathematics, english edn, Society for Industrial and
Applied Mathematics (SIAM), Philadelphia, PA. Translated from the French.
E. Esser (2009), Applications of Lagrangian-based alternating direction methods
and connections to split Bregman, CAM Reports 09-31, UCLA, Center for
Applied Math.
E. Esser, X. Zhang and T. F. Chan (2010), ‘A general framework for a class of
ﬁrst order primal-dual algorithms for convex optimization in imaging science’,
SIAM J. Imaging Sci. 3(4), 1015–1046.
L. C. Evans and R. F. Gariepy (1992), Measure theory and fine properties of func-
tions, CRC Press, Boca Raton, FL.
H. Federer (1969), Geometric measure theory, Springer-Verlag New York Inc., New
York.
O. Fercoq and P. Bianchi (2015), ‘A Coordinate Descent Primal-Dual Algorithm
with Large Step Size and Possibly Non Separable Functions’, ArXiv e-prints.
O. Fercoq and P. Richtárik (2013a), Accelerated, parallel and proximal coordinate
descent, Technical report. arXiv:1312.5799.
O. Fercoq and P. Richtárik (2013b), Smooth minimization of nonsmooth functions
with parallel coordinate descent methods, Technical report. arXiv:1309.5885.
S. Ferradans, N. Papadakis, G. Peyré and J.-F. Aujol (2014), ‘Regularized discrete
optimal transport’, SIAM J. Imaging Sci. 7(3), 1853–1882.
M. Fortin and R. Glowinski (1982), Méthodes de lagrangien augmenté, Vol. 9
of Méthodes Mathématiques de l’Informatique [Mathematical Methods of In-
formation Science], Gauthier-Villars, Paris. Applications à la résolution
numérique de problèmes aux limites. [Applications to the numerical solution
of boundary value problems].
X. L. Fu, B. S. He, X. F. Wang and X. M. Yuan (2014), ‘Block-wise alternating
direction method of multipliers with gaussian back substitution for multiple-
block convex programming’.
M. Fukushima and H. Mine (1981), ‘A generalized proximal point algorithm for cer-
tain nonconvex minimization problems’, Internat. J. Systems Sci. 12(8), 989–
1000.
D. Gabay (1983), Applications of the method of multipliers to variational in-
equalities, in Augmented Lagrangian Methods: Applications to the Solution of
150 A. Chambolle and T. Pock

Boundary Value Problems (M. Fortin and R. Glowinski, eds), North-Holland,

Amsterdam, chapter IX, pp. 299–340.
D. Gabay and B. Mercier (1976), ‘A dual algorithm for the solution of nonlinear
variational problems via finite element approximation’, Computers & Mathe-
matics with Applications 2(1), 17 – 40.
G. Gallo, M. D. Grigoriadis and R. E. Tarjan (1989), ‘A fast parametric maximum
flow algorithm and applications’, SIAM J. Comput. 18(1), 30–55.
D. Geman and G. Reynolds (1992), ‘Constrained image restoration and the recov-
ery of discontinuities’, IEEE Trans. PAMI PAMI-3(14), 367–383.
S. Geman and D. Geman (1984), ‘Stochastic relaxation, Gibbs distributions, and
the Bayesian restoration of images’, IEEE Trans. PAMI PAMI-6(6), 721–
741.
P. Getreuer (2012), ‘Total variation deconvolution using split Bregman’, Image
Processing On Line 2, 158–174.
G. Gilboa, J. Darbon, S. Osher and T. Chan (2006), ‘Nonlocal convex functionals
for image regularization’, UCLA CAM-report pp. 06–57.
E. Giusti (1984), Minimal surfaces and functions of bounded variation, Birkhäuser
Verlag, Basel.
R. Glowinski and P. Le Tallec (1989), Augmented Lagrangian and operator-splitting
methods in nonlinear mechanics, Vol. 9 of SIAM Studies in Applied Mathe-
matics, Society for Industrial and Applied Mathematics (SIAM), Philadel-
phia, PA.
R. Glowinski and A. Marroco (1975), ‘Sur l’approximation, par éléments fi-
nis d’ordre un, et la résolution, par pénalisation-dualité, d’une classe de
problèmes de Dirichlet non linéaires’, Rev. Française Automat. Informat.
Recherche Opérationnelle Sér. Rouge Anal. Numér. 9(R-2), 41–76.
D. Goldfarb and S. Ma (2012), ‘Fast multiple-splitting algorithms for convex opti-
mization’, SIAM J. Optim. 22(2), 533–556.
B. Goldluecke, E. Strekalovskiy and D. Cremers (2012), ‘The natural vectorial total
variation which arises from geometric measure theory’, SIAM J. Imaging Sci.
5(2), 537–564.
B. Goldluecke, E. Strekalovskiy and D. Cremers (2013), ‘Tight convex relaxations
for vector-valued labeling’, SIAM J. Imaging Sci. 6(3), 1626–1664.
A. A. Goldstein (1964), ‘Convex programming in Hilbert space’, Bull. Amer. Math.
Soc. 70(5), 709–710.
T. Goldstein and S. Osher (2009), ‘The split Bregman method for L1 -regularized
problems’, SIAM J. Imaging Sci. 2(2), 323–343.
T. Goldstein, E. Esser and R. Baraniuk (2013), Adaptive primal-dual hybrid gra-
dient methods for saddle-point problems, arXiv:1305.0546.
T. Goldstein, B. O’Donoghue, S. Setzer and R. Baraniuk (2014), ‘Fast alternating
direction optimization methods’, SIAM J. Imaging Sci. 7(3), 1588–1623.
M. Grasmair (2010), ‘Non-convex sparse regularisation’, J. Math. Anal. Appl.
365(1), 19–28.
M. Grasmair, M. Haltmeier and O. Scherzer (2011), ‘Necessary and sufficient con-
ditions for linear convergence of ℓ1 -regularization’, Comm. Pure Appl. Math.
64(2), 161–182.
Optimization for imaging 151

L. Grippo and M. Sciandrone (2000), ‘On the convergence of the block nonlinear
Gauss-Seidel method under convex constraints’, Oper. Res. Lett. 26(3), 127–
136.
O. Güler (1991), ‘On the convergence of the proximal point algorithm for convex
minimization’, SIAM Journal on Control and Optimization 29, 403–419.
O. Güler (1992), ‘New proximal point algorithms for convex minimization’, SIAM
J. Optim. 2(4), 649–664.
K. Guo and D. Labate (2007), ‘Optimally sparse multidimensional representation
using shearlets.’, SIAM J. Math. Analysis 39(1), 298–318.
K. Guo, G. Kutyniok and D. Labate (2006), Sparse multidimensional repre-
sentations using anisotropic dilation and shear operators, in Wavelets and
splines: Athens 2005, Mod. Methods Math., Nashboro Press, Brentwood,
TN, pp. 189–201.
W. Hashimoto and K. Kurata (2000), ‘Properties of basis functions generated by
shift invariant sparse representations of natural images’, Biological Cybernet-
ics 83(2), 111–118.
S. Hawe, M. Kleinsteuber and K. Diepold (2013), ‘Analysis operator learning and
its application to image reconstruction’, Image Processing, IEEE Transac-
tions on 22(6), 2138–2150.
B. He and X. Yuan (2015a), ‘On non-ergodic convergence rate of Douglas-Rachford
alternating direction method of multipliers’, Numer. Math. 130(3), 567–577.
B. He and X. Yuan (2015b), ‘On the convergence rate of Douglas–Rachford operator
splitting method’, Math. Program. 153(2, Ser. A), 715–722.
B. S. He and X. M. Yuan (2015c), ‘Block-wise alternating direction method of mul-
tipliers for multiple-block convex programming and beyond’, SMAI-Journal
of computational mathematics 1, 145–174.
B. He, Y. You and X. Yuan (2014), ‘On the convergence of primal-dual hybrid
gradient algorithm’, SIAM J. Imaging Sci. 7(4), 2526–2537.
M. R. Hestenes (1969), ‘Multiplier and gradient methods’, J. Optimization Theory
Appl. 4, 303–320.
D. S. Hochbaum (2001), ‘An efficient algorithm for image segmentation, Markov
random fields and related problems’, J. ACM 48(4), 686–701 (electronic).
T. Hohage and C. Homann (2014), A generalization of the Chambolle-Pock al-
gorithm to Banach spaces with applications to inverse problems, Technical
report. arXiv:1412.0126.
M. Hong, Z.-Q. Luo and M. Razaviyayn (2014), ‘Convergence Analysis of Alter-
nating Direction Method of Multipliers for a Family of Nonconvex Problems’,
ArXiv e-prints.
B. K. P. Horn and B. G. Schunck (1981), ‘Determining optical flow’, Artif. Intell.
17(1-3), 185–203.
D. Hubel and T. Wiesel (1959), ‘Receptive fields of single neurones in the cat’s
striate cortex’, The Journal of Physiology 148(3), 574–591.
K. Ito and K. Kunisch (1990), ‘The augmented Lagrangian method for equality and
inequality constraints in Hilbert spaces’, Math. Programming 46, 341–360.
N. A. Johnson (2013), ‘A dynamic programming algorithm for the fused lasso and
l0 -segmentation’, J. Computational and Graphical Statistics.
G. Kanizsa (1979), Organization in Vision, Praeger, New York.
152 A. Chambolle and T. Pock

S. Kindermann, S. Osher and P. W. Jones (2005), ‘Deblurring and denoising of

images by nonlocal functionals’, Multiscale Model. Simul. 4(4), 1091–1115
(electronic).
K. C. Kiwiel (1997), ‘Proximal minimization methods with generalized Bregman
functions’, SIAM J. Control Optim. 35(4), 1142–1168.
F. Knoll, K. Bredies, T. Pock and R. Stollberger (2011), ‘Second order total general-
ized variation (TGV) for MRI’, Magnetic Resonance in Medicine 65(2), 480–
491.
V. Kolmogorov, T. Pock and M. Rolinek (2016), ‘Total Variation on a Tree’, SIAM
J. Imaging Sci. 9(2), 605–636.
G. M. Korpelevič (1976), ‘An extragradient method for finding saddle points and
for other problems’, Èkonom. i Mat. Metody 12(4), 747–756.
G. M. Korpelevich (1983), ‘Extrapolational gradient methods and their connection
with modified Lagrangians.’, Ehkon. Mat. Metody 19, 694–703.
M. A. Krasnosel’skiı̆ (1955), ‘Two remarks on the method of successive approxi-
mations’, Uspehi Mat. Nauk (N.S.) 10(1(63)), 123–127.
A. Krizhevsky, I. Sutskever and G. E. Hinton (2012), Imagenet classification with
deep convolutional neural networks, in Advances in Neural Information Pro-
cessing Systems 25: 26th Annual Conference on Neural Information Process-
ing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake
Tahoe, Nevada, United States., pp. 1106–1114.
G. Kutyniok and W.-Q. Lim (2011), ‘Compactly supported shearlets are optimally
sparse’, Journal of Approximation Theory 163(11), 1564 – 1589.
G. Lawlor and F. Morgan (1994), ‘Paired calibrations applied to soap films, im-
miscible fluids, and surfaces or networks minimizing other norms’, Pacific J.
Math. 166(1), 55–83.
M. Lebrun, A. Buades and J. M. Morel (2013), ‘A nonlocal Bayesian image denois-
ing algorithm’, SIAM J. Imaging Sci. 6(3), 1665–1688.
Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hub-
bard and L. D. Jackel (1989), Handwritten digit recognition with a back-
propagation network, in Advances in Neural Information Processing Sys-
tems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989],
pp. 396–404.
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner (1998a), ‘Gradient-based learning
applied to document recognition’, Proceedings of the IEEE 86(11), 2278–2324.
Y. LeCun, L. Bottou, G. Orr and K. Muller (1998b), Efficient backprop, in Neural
Networks: Tricks of the trade (G. Orr and M. K., eds), Springer.
D. D. Lee and H. S. Seung (1999), ‘Learning the parts of objects by nonnegative
matrix factorization’, Nature 401, 788–791.
H. Lee, A. Battle, R. Raina and A. Y. Ng (2007), Efficient sparse coding algorithms,
in Advances in Neural Information Processing Systems 19 (B. Schölkopf,
J. Platt and T. Hoffman, eds), MIT Press, pp. 801–808.
J. Lellmann and C. Schnörr (2011), ‘Continuous multiclass labeling approaches
and algorithms’, SIAM J. Imaging Sci. 4(4), 1049–1096.
J. Lellmann, F. Lenzen and C. Schnörr (2013), ‘Optimality bounds for a varia-
tional relaxation of the image partitioning problem’, J. Math. Imaging Vision
47(3), 239–257.
Optimization for imaging 153

A. Levin, Y. Weiss, F. Durand and W. T. Freeman (2011), Eﬃcient marginal

likelihood optimization in blind deconvolution, in The 24th IEEE Conference
on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs,
CO, USA, 20-25 June 2011, pp. 2657–2664.
J. Liang, J. Fadili and G. Peyré (2014), Local linear convergence of forward–
backward under partial smoothness, in Advances in Neural Information Pro-
cessing Systems, pp. 1970–1978.
J. Liang, J. Fadili and G. Peyré (2015a), ‘Convergence rates with inexact non-
expansive operators’, Mathematical Programming pp. 1–32.
J. Liang, J. Fadili, G. Peyré and R. Luke (2015b), Activity identiﬁcation and local
linear convergence of Douglas-Rachford/ADMM under partial smoothness, in
Scale space and variational methods in computer vision, Vol. 9087 of Lecture
Notes in Comput. Sci., Springer, Cham, pp. 642–653.
Q. Lin, Z. Lu and L. Xiao (2014), An accelerated proximal coordinate gradient
method and its application to regularized empirical risk minimization, Tech-
nical Report MSR-TR-2014-94, Microsoft Research.
P. L. Lions and B. Mercier (1979), ‘Splitting algorithms for the sum of two nonlinear
operators’, SIAM Journal on Numerical Analysis 16(6), 964–979.
B. D. Lucas and T. Kanade (1981), An iterative image registration technique with
an application to stereo vision, in Proceedings of the 7th International Joint
Conference on Artificial Intelligence (IJCAI ’81), Vancouver, BC, Canada,
August 1981, pp. 674–679.
S. Magnússon, P. Chathuranga Weeraddana, M. G. Rabbat and C. Fischione
(2014), ‘On the Convergence of Alternating Direction Lagrangian Methods
for Nonconvex Structured Optimization Problems’, ArXiv e-prints.
A. Mahendran and A. Vedaldi (2015), Understanding deep image representations
by inverting them, in Proceedings of the IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR).
J. Mairal, F. Bach, J. Ponce and G. Sapiro (2009a), Online dictionary learning for
sparse coding, in Proceedings of the 26th Annual International Conference on
Machine Learning, ICML ’09, ACM, New York, NY, USA, pp. 689–696.
J. Mairal, J. Ponce, G. Sapiro, A. Zisserman and F. R. Bach (2009b), Supervised
dictionary learning, in Advances in Neural Information Processing Systems 21
(D. Koller, D. Schuurmans, Y. Bengio and L. Bottou, eds), Curran Associates,
Inc., pp. 1033–1040.
S. Mallat and G. Yu (2010), ‘Super-resolution with sparse mixing estimators’, IEEE
Trans. Image Process. 19(11), 2889–2900.
S. Mallat and Z. Zhang (1993), ‘Matching pursuits with time-frequency dictionar-
ies’, Trans. Sig. Proc. 41(12), 3397–3415.
W. R. Mann (1953), ‘Mean value methods in iteration’, Proc. Amer. Math. Soc.
4, 506–510.
B. Martinet (1970), ‘Brève communication. régularisation d’inéquations varia-
tionnelles par approximations successives’, ESAIM: Mathematical Modelling
and Numerical Analysis - Modélisation Mathématique et Analyse Numérique
4(R3), 154–158.
S. Masnou and J.-M. Morel (2006), ‘On a variational theory of image amodal
completion’, Rend. Sem. Mat. Univ. Padova 116, 211–252.
154 A. Chambolle and T. Pock

H. Mine and M. Fukushima (1981), ‘A minimization method for the sum of a

convex function and a continuously differentiable function’, J. Optim. Theory
Appl. 33(1), 9–23.
G. J. Minty (1962), ‘Monotone (nonlinear) operators in Hilbert space’, Duke Math-
ematical Journal 29, 341–346.
T. Möllenhoff, E. Strekalovskiy, M. Moeller and D. Cremers (2015), ‘The primal-
dual hybrid gradient method for semiconvex splittings’, SIAM J. Imaging Sci.
8(2), 827–857.
M. G. Mora and M. Morini (2001), ‘Local calibrations for minimizers of the
Mumford-Shah functional with a regular discontinuity set’, Ann. Inst. H.
Poincaré Anal. Non Linéaire 18(4), 403–436.
J. L. Morales and J. Nocedal (2011), ‘Remark on “Algorithm 778: L-BFGS-
B: Fortran subroutines for large-scale bound constrained optimization”
[mr1671706]’, ACM Trans. Math. Software 38(1), Art. 7, 4.
J. J. Moreau (1965), ‘Proximité et dualité dans un espace Hilbertien’, Bull. Soc.
Math. France 93, 273–299.
A. Moudafi and M. Oliny (2003), ‘Convergence of a splitting inertial proximal
method for monotone operators’, J. Comput. Appl. Math. 155(2), 447–454.
D. Mumford and J. Shah (1989), ‘Optimal approximation by piecewise smooth
functions and associated variational problems’, Comm. Pure Appl. Math.
42, 577–685.
S. Nam, M. Davies, M. Elad and R. Gribonval (2013), ‘The cosparse analysis model
and algorithms’, Applied and Computational Harmonic Analysis 34(1), 30 –
56.
A. S. Nemirovski (2004), ‘Prox-method with rate of convergence O(1/t) for varia-
tional inequalities with Lipschitz continuous monotone operators and smooth
convex-concave saddle point problems’, SIAM J. Optim. 15(1), 229–251 (elec-
tronic).
A. S. Nemirovski and D. Yudin (1983), Problem complexity and method efficiency
in optimization, John Wiley & Sons Inc., New York. Translated from the
Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in
Discrete Mathematics.
Y. Nesterov (1983), ‘A method for solving the convex programming problem with
convergence rate O(1/k 2 )’, Dokl. Akad. Nauk SSSR 269(3), 543–547.
Y. Nesterov (2004), Introductory lectures on convex optimization, Vol. 87 of Applied
Optimization, Kluwer Academic Publishers, Boston, MA. A basic course.
Y. Nesterov (2005), ‘Smooth minimization of non-smooth functions’, Math. Pro-
gram. 103(1), 127–152.
Y. Nesterov (2012), ‘Efficiency of coordinate descent methods on huge-scale opti-
mization problems’, SIAM J. Optim. 22(2), 341–362.
Y. Nesterov (2013), ‘Gradient methods for minimizing composite functions’, Math.
Program. 140(1, Ser. B), 125–161.
Y. Nesterov (2015), ‘Universal gradient methods for convex optimization problems’,
Math. Program. 152(1-2, Ser. A), 381–404.
M. Nikolova (2004), ‘A variational approach to remove outliers and impulse noise’,
J. Math. Image Vis. 20(1-2), 99–120.
Optimization for imaging 155

R. Nishihara, L. Lessard, B. Recht, A. Packard and M. I. Jordan (2015), A general

analysis of the convergence of ADMM, in Proc. 32nd International Conference
on Machine Learning, Vol. 37 of JMLR Workshop and Conference Proceed-
ings.
M. Nitzberg, D. Mumford and T. Shiota (1993), Filtering, segmentation and depth,
Vol. 662 of Lecture Notes in Computer Science, Springer-Verlag, Berlin.
J. Nocedal and S. J. Wright (2006), Numerical optimization, Springer Series in
Operations Research and Financial Engineering, second edn, Springer, New
York.
P. Ochs, T. Brox and T. Pock (2015), ‘iPiasco: inertial proximal algorithm for
strongly convex optimization’, J. Math. Imaging Vision 53(2), 171–181.
P. Ochs, Y. Chen, T. Brox and T. Pock (2014), ‘iPiano: inertial proximal algorithm
for nonconvex optimization’, SIAM J. Imaging Sci. 7(2), 1388–1419.
D. O’Connor and L. Vandenberghe (2014), ‘Primal-dual decomposition by oper-
ator splitting and applications to image deblurring’, SIAM J. Imaging Sci.
7(3), 1724–1754.
B. O’Donoghue and E. Candès (2015), ‘Adaptive restart for accelerated gradient
schemes’, Found. Comput. Math. 15(3), 715–732.
B. A. Olshausen and D. J. Field (1997), ‘Sparse coding with an overcomplete basis
set: A strategy employed by v1?’, Vision Research 37(23), 3311 – 3325.
Z. Opial (1967), ‘Weak convergence of the sequence of successive approximations
for nonexpansive mappings’, Bull. Amer. Math. Soc. 73, 591–597.
Y. Ouyang, Y. Chen, G. Lan and E. Pasiliao, Jr. (2015), ‘An accelerated linearized
alternating direction method of multipliers’, SIAM J. Imaging Sci. 8(1), 644–
681.
P. Paatero and U. Tapper (1994), ‘Positive matrix factorization: A nonnegative
factor model with optimal utilization of error estimates of data values’, En-
vironmetrics 5, 111–126.
N. Parikh and S. Boyd (2014), ‘Proximal algorithms’, Found. Trends Optim.
1(3), 127–239.
G. B. Passty (1979), ‘Ergodic convergence to a zero of the sum of monotone op-
erators in Hilbert space’, Journal of Mathematical Analysis and Applications
72, 383–390.
P. Patrinos, L. Stella and A. Bemporad (2014), Douglasrachford splitting: Com-
plexity estimates and accelerated variants, in Proc. IEEE 53rd Annual Con-
ference on Decision and Control: CDC, pp. 4234–4239.
G. Peyré, S. Bougleux and L. Cohen (2008), Non-local regularization of inverse
problems, in Computer Vision–ECCV 2008, Springer, pp. 57–68.
G. Peyré, J. Fadili and J.-L. Starck (2010), ‘Learning the morphological diversity’,
SIAM J. Imaging Sci. 3(3), 646–669.
J. C. Picard and H. D. Ratliﬀ (1975), ‘Minimum cuts and related problems’, Net-
works 5(4), 357–370.
T. Pock and A. Chambolle (2011), Diagonal preconditioning for ﬁrst order primal-
dual algorithms, in International Conference of Computer Vision (ICCV
2011), pp. 1762–1769.
T. Pock and S. Sabach (2016), Inertial proximal alternating linearized minimization
156 A. Chambolle and T. Pock

(iPALM) for nonconvex and nonsmooth problems, Technical report. submit-

ted.
T. Pock, D. Cremers, H. Bischof and A. Chambolle (2009), An algorithm for mini-
mizing the Mumford-Shah functional, in ICCV Proceedings, LNCS, Springer.
T. Pock, D. Cremers, H. Bischof and A. Chambolle (2010), ‘Global solutions of vari-
ational models with convex regularization’, SIAM J. Imaging Sci. 3(4), 1122–
1145.
T. Pock, T. Schoenemann, G. Graber, H. Bischof and D. Cremers (2008), A convex
formulation of continuous multi-label problems, in European Conference on
Computer Vision (ECCV), Marseille, France.
B. T. Polyak (1987), Introduction to optimization, Translations Series in Mathe-
matics and Engineering, Optimization Software, Inc., Publications Division,
New York. Translated from the Russian, With a foreword by Dimitri P.
Bertsekas.
L. D. Popov (1981), A modification of the Arrow-Hurwitz method of search for
saddle points with an adaptive procedure for determining the iteration step,
in Classification and optimization in control problems, Akad. Nauk SSSR Ural.
Nauchn. Tsentr, Sverdlovsk, pp. 52–56.
M. J. D. Powell (1969), A method for nonlinear constraints in minimization prob-
lems, in Optimization (Sympos., Univ. Keele, Keele, 1968), Academic Press,
London, pp. 283–298.
M. Protter, I. Yavneh and M. Elad (2009), Closed-form MMSE estimation for signal
denoising under sparse representation modelling over a unitary dictionary,
Technical report, Technion, Haifa.
N. Pustelnik, C. Chaux and J.-C. Pesquet (2011), ‘Parallel proximal algorithm for
image restoration using hybrid regularization’, IEEE Trans. Image Process.
20(9), 2450–2462.
H. Raguet, J. Fadili and G. Peyré (2013), ‘A generalized forward-backward split-
ting’, SIAM J. Imaging Sci. 6(3), 1199–1226.
R. T. Rockafellar (1976), ‘Monotone operators and the proximal point algorithm’,
SIAM J. Control Optimization 14(5), 877–898.
R. T. Rockafellar (1997), Convex analysis, Princeton Landmarks in Mathematics,
Princeton University Press, Princeton, NJ. Reprint of the 1970 original,
Princeton Paperbacks.
C. Rother, V. Kolmogorov and A. Blake (2004), “‘GrabCut”: Interactive fore-
ground extraction using iterated graph cuts’, ACM Trans. Graph. 23(3), 309–
314.
L. Rudin, S. J. Osher and E. Fatemi (1992), ‘Nonlinear total variation based noise
removal algorithms’, Physica D 60, 259–268. [also in Experimental Mathe-
matics: Computational Issues in Nonlinear Science (Proc. Los Alamos Conf.
1991)].
S. Salzo and S. Villa (2012), ‘Inexact and accelerated proximal point algorithms’,
J. Convex Anal. 19(4), 1167–1192.
G. Sapiro and D. L. Ringach (1996), ‘Anisotropic diffusion of multivalued images
with applications to color filtering’, IEEE Transactions on Image Processing
5(11), 1582–1586.
Optimization for imaging 157

H. Schaefer (1957), ‘Über die Methode sukzessiver Approximationen’, Jber.

Deutsch. Math. Verein. 59(Abt. 1), 131–140.
M. Schmidt, N. L. Roux and F. R. Bach (2011), Convergence rates of inexact
proximal-gradient methods for convex optimization, in Advances in Neural
Information Processing Systems 24 (J. Shawe-Taylor, R. Zemel, P. Bartlett,
F. Pereira and K. Weinberger, eds), Curran Associates, Inc., pp. 1458–1466.
T. Schoenemann and D. Cremers (2007), Globally optimal image segmentation
with an elastic shape prior, in Computer Vision, 2007. ICCV 2007. IEEE
11th International Conference on, IEEE, pp. 1–6.
T. Schoenemann, S. Masnou and D. Cremers (2011), ‘The elastic ratio: introducing
curvature into ratio-based image segmentation’, IEEE Trans. Image Process.
20(9), 2565–2581.
S. Setzer (2011), ‘Operator splittings, Bregman methods and frame shrinkage in
image processing’, Int. J. Comput. Vis. 92(3), 265–280.
R. Shefi and M. Teboulle (2014), ‘Rate of convergence analysis of decomposition
methods based on the proximal method of multipliers for convex minimiza-
tion’, SIAM J. Optim. 24(1), 269–297.
E. Y. Sidky, C.-M. Kao and X. Pan (2006), ‘Accurate image reconstruction from
few-views and limited-angle data in divergent-beam CT’, 14(2), 119–139.
K. Simonyan and A. Zisserman (2014), ‘Very deep convolutional networks for large-
scale image recognition’, CoRR.
J.-L. Starck, F. Murtagh and J. M. Fadili (2010), Sparse image and signal process-
ing, Cambridge University Press, Cambridge. Wavelets, curvelets, morpho-
logical diversity.
G. Steidl and T. Teuber (2010), ‘Removing multiplicative noise by Douglas-
Rachford splitting methods’, J. Math. Imaging Vision 36(2), 168–184.
L. Stella, A. Themelis and P. Patrinos (2016), ‘Forward-backward quasi-Newton
methods for nonsmooth optimization problems’, ArXiv e-prints.
G. Strang (1983), ‘Maximal flow through a domain’, Math. Programming
26(2), 123–143.
G. Strang (2010), ‘Maximum flows and minimum cuts in the plane’, J. Global
Optim. 47(3), 527–535.
E. Strekalovskiy, A. Chambolle and D. Cremers (2014), ‘Convex relaxation of vec-
torial problems with coupled regularization’, SIAM J. Imaging Sci. 7(1), 294–
336.
P. Tan (2016), Acceleration of saddle-point methods in smooth cases, Technical
report, CMAP.
S. Tao, D. Boley and S. Zhang (2015), ‘Local Linear Convergence of ISTA and
FISTA on the LASSO Problem’, ArXiv e-prints.
M. Teboulle (1992), ‘Entropic proximal mappings with applications to nonlinear
programming’, Math. Oper. Res. 17(3), 670–690.
R. Tibshirani (1996), ‘Regression shrinkage and selection via the lasso’, Journal of
the Royal Statistical Society (Series B) 58, 267–288.
P. Tseng (1991), ‘Applications of a splitting algorithm to decomposition in convex
programming and variational inequalities’, SIAM Journal on Control and
Optimization 29(1), 119–138.
158 A. Chambolle and T. Pock

P. Tseng (2000), ‘A modiﬁed forward-backward splitting method for maximal

monotone mappings’, SIAM J. Control Optim. 38(2), 431–446.
P. Tseng (2001), ‘Convergence of a block coordinate descent method for nondiffer-
entiable minimization’, J. Optim. Theory Appl. 109(3), 475–494.
P. Tseng (2008), ‘On accelerated proximal gradient methods for convex-concave op-
timization’. http://www.csie.ntu.edu.tw/~b97058/tseng/papers/apgm.
pdf.
P. Tseng and S. Yun (2009), ‘Block-coordinate gradient descent method for lin-
early constrained nonsmooth separable optimization’, J. Optim. Theory Appl.
140(3), 513–535.
T. Valkonen (2014), ‘A primal-dual hybrid gradient method for nonlinear operators
with applications to MRI’, Inverse Problems 30(5), 055012, 45.
T. Valkonen and T. Pock (2015), ‘Acceleration of the PDHGM on strongly convex
subspaces’, ArXiv e-prints.
V. N. Vapnik (2000), The nature of statistical learning theory, Statistics for Engi-
neering and Information Science, second edn, Springer-Verlag, New York.
R. S. Varga (1962), Matrix iterative analysis, Prentice-Hall, Inc., Englewood Cliffs,
N.J.
S. Villa, S. Salzo, L. Baldassarre and A. Verri (2013), ‘Accelerated and inexact
forward-backward algorithms’, SIAM J. Optim. 23(3), 1607–1633.
B. C. Vũ (2013a), ‘A splitting algorithm for dual monotone inclusions involving
cocoercive operators’, Adv. Comput. Math. 38(3), 667–681.
B. C. Vũ (2013b), ‘A variable metric extension of the forward-backward-forward
algorithm for monotone operators’, Numer. Funct. Anal. Optim. 34(9), 1050–
1065.
Y. Wang, W. Yin and J. Zeng (2015), ‘Global Convergence of ADMM in Nonconvex
Nonsmooth Optimization’, ArXiv e-prints.
M. Yamagishi and I. Yamada (2011), ‘Over-relaxation of the fast iterative
shrinkage-thresholding algorithm with variable stepsize’, Inverse Problems
27(10), 105008, 15.
F. Yanez and F. Bach (2014), ‘Primal-Dual Algorithms for Non-negative Matrix
Factorization with the Kullback-Leibler Divergence’, ArXiv e-prints.
W. Yin and S. Osher (2013), ‘Error forgetting of Bregman iteration’, J. Sci. Com-
put. 54(2-3), 684–695.
G. Yu, G. Sapiro and S. Mallat (2012), ‘Solving inverse problems with piecewise lin-
ear estimators: from Gaussian mixture models to structured sparsity’, IEEE
Trans. Image Process. 21(5), 2481–2499.
R. Zabih and J. Woodfill (1994), Non-parametric local transforms for computing
visual correspondence, in Proceedings of the Third European Conference on
Computer Vision (Vol. II), ECCV ’94, pp. 151–158.
C. Zach, D. Gallup, J. M. Frahm and M. Niethammer (2008), Fast global labeling
for real-time stereo using multiple plane sweeps, in Vision, Modeling, and
Visualization 2008, IOS Press, pp. 243–252.
C. Zach, T. Pock and H. Bischof (2007), A duality based approach for realtime TV-
L1 optical flow, in 29th DAGM Symposium on Pattern Recognition, LNCS,
Heidelberg, Germany, pp. 214–223.
Optimization for imaging 159

S. K. Zavriev and F. V. Kostyuk (1991), The heavy ball method in nonconvex

optimization problems, in Software and models of systems analysis (Russian),
Moskov. Gos. Univ., Moscow, pp. 179–186, 195. Translated in Comput. Math.
Model. 4 (1993), no. 4, 336–341.
M. Zeiler, D. Krishnan, G. Taylor and R. Fergus (2010), Deconvolutional networks,
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 2528–2535.
X. Zhang, M. Burger and S. Osher (2011), ‘A uniﬁed primal-dual algorithm frame-
work based on Bregman iteration’, J. Sci. Comput. 46(1), 20–46.
X. Zhang, M. Burger, X. Bresson and S. Osher (2010), ‘Bregmanized nonlocal
regularization for deconvolution and sparse reconstruction’, SIAM J. Imaging
Sci. 3(3), 253–276.
C. Zhu, R. H. Byrd, P. Lu and J. Nocedal (1997), ‘Algorithm 778: L-BFGS-B:
Fortran subroutines for large-scale bound-constrained optimization’, ACM
Trans. Math. Software 23(4), 550–560.
M. Zhu and T. Chan (2008), An eﬃcient primal-dual hybrid gradient algorithm
for total variation image restoration, CAM Reports 08-34, UCLA, Center for
Applied Math.

Advanced Techniques in Optimization For Machine Learning and Imaging
No ratings yet
Advanced Techniques in Optimization For Machine Learning and Imaging
173 pages
Mojabi - Pedram Masters Thesis
No ratings yet
Mojabi - Pedram Masters Thesis
182 pages
Unrolled Optimization With Deep Priors: Steven Diamond Vincent Sitzmann Felix Heide Gordon Wetzstein December 20, 2018
No ratings yet
Unrolled Optimization With Deep Priors: Steven Diamond Vincent Sitzmann Felix Heide Gordon Wetzstein December 20, 2018
11 pages
Kto12 Pedagogical Approaches: Regional Memorandum No. 233 S. 2016
No ratings yet
Kto12 Pedagogical Approaches: Regional Memorandum No. 233 S. 2016
107 pages
Tony Chan, Jianhong Shen Image Processing and Analysis Variational, Pde, Wavelet, and Stochastic Methods PDF
100% (2)
Tony Chan, Jianhong Shen Image Processing and Analysis Variational, Pde, Wavelet, and Stochastic Methods PDF
423 pages
A Daoyin To Liberate The Spine Yang-Sheng 2012-06 PDF
100% (1)
A Daoyin To Liberate The Spine Yang-Sheng 2012-06 PDF
9 pages
Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging Mathematical Imaging and Vision
100% (1)
Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging Mathematical Imaging and Vision
1,981 pages
A Survey On The Low Dimensional Model Ba
No ratings yet
A Survey On The Low Dimensional Model Ba
95 pages
A Variational Approach For Sharpening High Dimensional Images
No ratings yet
A Variational Approach For Sharpening High Dimensional Images
99 pages
Pre Print
No ratings yet
Pre Print
88 pages
Albert Bandura
No ratings yet
Albert Bandura
18 pages
PM157 bm1
No ratings yet
PM157 bm1
19 pages
A Simplified View of First Order Methods For Optimization
No ratings yet
A Simplified View of First Order Methods For Optimization
30 pages
Flow Priv
No ratings yet
Flow Priv
26 pages
PDE Applications in Image Processing: - Animesh Choudhary 2015MT10870
No ratings yet
PDE Applications in Image Processing: - Animesh Choudhary 2015MT10870
39 pages
DTAM: Dense Tracking and Mapping in Real-Time Seminar
No ratings yet
DTAM: Dense Tracking and Mapping in Real-Time Seminar
34 pages
Digital Image Reconstruction
No ratings yet
Digital Image Reconstruction
56 pages
CIL13-503 Rev. 2
No ratings yet
CIL13-503 Rev. 2
11 pages
Baructu - Limited-Angle Computed Tomography With Deep Image and Physics Priors
No ratings yet
Baructu - Limited-Angle Computed Tomography With Deep Image and Physics Priors
12 pages
A Review of Adaptive Image Representations: Gabriel Peyré
No ratings yet
A Review of Adaptive Image Representations: Gabriel Peyré
16 pages
Image Denoising Report
No ratings yet
Image Denoising Report
31 pages
Durmus etal18-ProximalMCMC
No ratings yet
Durmus etal18-ProximalMCMC
34 pages
3 Mais Atual
No ratings yet
3 Mais Atual
14 pages
Lecture08 Restoration Deblur
No ratings yet
Lecture08 Restoration Deblur
27 pages
Full Text 01
No ratings yet
Full Text 01
180 pages
Action Plan For Tle Tve Tle For Sy 2023 2024
100% (3)
Action Plan For Tle Tve Tle For Sy 2023 2024
6 pages
Not Guarantee Compression.: Peer Group Image Enhancement
No ratings yet
Not Guarantee Compression.: Peer Group Image Enhancement
9 pages
Zhang Learning Deep CNN CVPR 2017 Paper
No ratings yet
Zhang Learning Deep CNN CVPR 2017 Paper
10 pages
International Journal of C 2009 Institute For Scientific Numerical Analysis and Modeling Computing and Information
No ratings yet
International Journal of C 2009 Institute For Scientific Numerical Analysis and Modeling Computing and Information
9 pages
Deep Hyperspectral Prior Single-Image Denoising, Inpainting, Super-Resolution-Oleksii Sidorov-Iccvw2019
No ratings yet
Deep Hyperspectral Prior Single-Image Denoising, Inpainting, Super-Resolution-Oleksii Sidorov-Iccvw2019
8 pages
Murch Thesis
No ratings yet
Murch Thesis
199 pages
A Survey On Diffusion Models For Inverse Problems: 1.1 Problem Setting
No ratings yet
A Survey On Diffusion Models For Inverse Problems: 1.1 Problem Setting
38 pages
Adaptive Image Reconstruction Using Information Measures
No ratings yet
Adaptive Image Reconstruction Using Information Measures
18 pages
On Combining Denoising With Learning-Based Image Decoding
No ratings yet
On Combining Denoising With Learning-Based Image Decoding
14 pages
Fast Image Recovery Using Variable Splitting and Constrained Optimization
No ratings yet
Fast Image Recovery Using Variable Splitting and Constrained Optimization
11 pages
(Computational Science and Engineering) Jennifer L. Müller, Samuli Siltanen-Linear and Nonlinear Inverse Problems With Practical Applications-Society For Industrial and Applied Mathematics (2012)
100% (2)
(Computational Science and Engineering) Jennifer L. Müller, Samuli Siltanen-Linear and Nonlinear Inverse Problems With Practical Applications-Society For Industrial and Applied Mathematics (2012)
349 pages
Beck 2009
No ratings yet
Beck 2009
20 pages
Metacognition, - Motivation, and Understanding 3: - Edited by
No ratings yet
Metacognition, - Motivation, and Understanding 3: - Edited by
352 pages
Brief Review of Image Denoising Techniques
No ratings yet
Brief Review of Image Denoising Techniques
12 pages
FISTA
No ratings yet
FISTA
20 pages
Regularisation in Image Reconstruction
No ratings yet
Regularisation in Image Reconstruction
4 pages
Cam08 44
No ratings yet
Cam08 44
140 pages
Dual Methods For The Minimization of The Total Variation
No ratings yet
Dual Methods For The Minimization of The Total Variation
30 pages
Introduction Inverse Problems
No ratings yet
Introduction Inverse Problems
205 pages
Artificial Intelligence and Industrial Applications: Tawfik Masrour Ibtissam El Hassani Anass Cherrafi Editors
No ratings yet
Artificial Intelligence and Industrial Applications: Tawfik Masrour Ibtissam El Hassani Anass Cherrafi Editors
341 pages
Learned Reconstruction Methods With Convergence Guarantees A Survey of Concepts and Applications
No ratings yet
Learned Reconstruction Methods With Convergence Guarantees A Survey of Concepts and Applications
19 pages
Ugc - Academic Staff College Pt. Ravishankar Shukla University, Raipur
No ratings yet
Ugc - Academic Staff College Pt. Ravishankar Shukla University, Raipur
2 pages
Tval3 Thesis
No ratings yet
Tval3 Thesis
92 pages
Gee7 2011
No ratings yet
Gee7 2011
318 pages
780 Submission
No ratings yet
780 Submission
12 pages
FCI Review
No ratings yet
FCI Review
2 pages
Lec 4
No ratings yet
Lec 4
9 pages
Sensors 22 00926
No ratings yet
Sensors 22 00926
11 pages
Higher Order Pde Based Image Processing: Theory, Computation & Application
No ratings yet
Higher Order Pde Based Image Processing: Theory, Computation & Application
141 pages
Convex Optimizatiom IP
No ratings yet
Convex Optimizatiom IP
97 pages
Introduction To Inverse Problems: Guillaume Bal January 29, 2012
No ratings yet
Introduction To Inverse Problems: Guillaume Bal January 29, 2012
205 pages
Introduction To Inverse Problems - Guillaume Bal PDF
No ratings yet
Introduction To Inverse Problems - Guillaume Bal PDF
205 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
A Rationale For Differentiating Instruction in The Regular Classroom
No ratings yet
A Rationale For Differentiating Instruction in The Regular Classroom
10 pages
Max Invqn v4
No ratings yet
Max Invqn v4
27 pages
Rudin Osher Fatemi
No ratings yet
Rudin Osher Fatemi
10 pages
Cm400dy 12H
No ratings yet
Cm400dy 12H
4 pages
Mandatory Disclosure
No ratings yet
Mandatory Disclosure
33 pages
Template RPH CEFR
No ratings yet
Template RPH CEFR
2 pages
A Prospective Study On Algorithms Adapted To The Spatial Frequency in Tomography
No ratings yet
A Prospective Study On Algorithms Adapted To The Spatial Frequency in Tomography
6 pages
Lesson Plan: Ratios and Proportions
No ratings yet
Lesson Plan: Ratios and Proportions
2 pages
FPC Syllabus
No ratings yet
FPC Syllabus
3 pages
cs10 Toc 2
No ratings yet
cs10 Toc 2
3 pages
Revison Worksheet Beehive Answer Key
No ratings yet
Revison Worksheet Beehive Answer Key
4 pages
Opaig Precise Praise
No ratings yet
Opaig Precise Praise
3 pages
Authentic Assessment
No ratings yet
Authentic Assessment
26 pages
Operations Director
No ratings yet
Operations Director
3 pages
Vsem EEE Coaching Student Name List
No ratings yet
Vsem EEE Coaching Student Name List
2 pages
James Individual Accomplishment Report November
No ratings yet
James Individual Accomplishment Report November
2 pages
Ijlis 07 02 001
No ratings yet
Ijlis 07 02 001
5 pages
Detailed Lesson Plan Mathematics Grade L
No ratings yet
Detailed Lesson Plan Mathematics Grade L
4 pages
Laporan Praktikum MPS (Processing) - Teknisi 1200 2024
No ratings yet
Laporan Praktikum MPS (Processing) - Teknisi 1200 2024
10 pages
IELTS MASTER - IELTS Writing Test 1
No ratings yet
IELTS MASTER - IELTS Writing Test 1
2 pages
Maaly TESOL Portfolio
No ratings yet
Maaly TESOL Portfolio
22 pages
Gel 107 Writing Assignment 1 Your Inner Fish1
No ratings yet
Gel 107 Writing Assignment 1 Your Inner Fish1
1 page
Document From Bathini Sai Sujith
No ratings yet
Document From Bathini Sai Sujith
12 pages
Fulbright
No ratings yet
Fulbright
1 page
Guidance Annual Plan
No ratings yet
Guidance Annual Plan
7 pages
Diss 16
No ratings yet
Diss 16
3 pages
Quadratic Equation Quant Questions and Answers PDF by Ambitious Baba
No ratings yet
Quadratic Equation Quant Questions and Answers PDF by Ambitious Baba
26 pages
Finite Elements and Approximation
From Everand
Finite Elements and Approximation
O. C. Zienkiewicz
4.5/5 (4)
Concepts of Combinatorial Optimization
From Everand
Concepts of Combinatorial Optimization
Vangelis Th. Paschos
No ratings yet
Optimization in Engineering Sciences: Exact Methods
From Everand
Optimization in Engineering Sciences: Exact Methods
Pierre Borne
No ratings yet
Numerical Methods for Two-Point Boundary-Value Problems
From Everand
Numerical Methods for Two-Point Boundary-Value Problems
Herbert B. Keller
No ratings yet
Feedback Control Theory
From Everand
Feedback Control Theory
Bruce Francis
5/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.