0% found this document useful (0 votes)
12 views96 pages

Proximal Gradient Methods for Machine

This document discusses proximal gradient methods in the context of convex optimization, highlighting their significance in data science and image processing. It provides a comprehensive analysis of various algorithms, including their convergence properties and applications, while also covering foundational concepts in convex analysis. The chapter is structured into sections that address convex analysis, proximal gradient algorithms, stochastic optimization methods, and the derivation of new algorithms using duality theory.

Uploaded by

Paula Cardoso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views96 pages

Proximal Gradient Methods for Machine

This document discusses proximal gradient methods in the context of convex optimization, highlighting their significance in data science and image processing. It provides a comprehensive analysis of various algorithms, including their convergence properties and applications, while also covering foundational concepts in convex analysis. The chapter is structured into sections that address convex analysis, proximal gradient algorithms, stochastic optimization methods, and the derivation of new algorithms using duality theory.

Uploaded by

Paula Cardoso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Proximal Gradient Methods for Machine

Learning and Imaging

Saverio Salzo and Silvia Villa

1 Introduction

Convex optimization plays a key role in data science and image processing. Indeed,
from one hand it provides theoretical frameworks, such as duality theory and the
theory of nonexpansive operators, which are indispensable to formally analyze many
problems arising in those fields. On the other hand, convex optimization supplies a
plethora of algorithmic solutions covering a broad range of applications. In particular,
the last decades witnessed an unprecedented development of optimization methods
which are now capable of addressing structured and large-scale problems effectively.
An important class of such methods, which are at the core of modern nonlinear convex
optimization, is that of proximal gradient splitting algorithms. They are first-order
methods which are tailored to optimization problems having a composite structure
given by the sum of smooth and nonsmooth terms. These methods are splitting
algorithms, in the sense that along the iterations they process each term separately by
exploiting gradient information when available and the so-called proximity operator
for nonsmooth terms.
Even though there is a rich literature on proximal gradient algorithms, in this
contribution, we paid particular attention to presenting a self-contained and unifying
analysis for the various algorithms, unveiling common theoretical basis. We give
state-of-the-art results treating both convergence of the iterates and of objective
functions values in infinite-dimensional setting. This work is based on the lecture

S. Salzo (B)
Istituto Italiano di Tecnologia, Via E. Melen 83, 16152 Genova, Italy
e-mail: saverio.salzo@iit.it
S. Villa
DIMA & MaLGa Center, Università degli Studi di Genova, Via Dodecaneso 35,
16146 Genova, Italy
e-mail: silvia.villa@unige.it

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 149
F. De Mari and E. De Vito (eds.), Harmonic and Applied Analysis, Applied and Numerical
Harmonic Analysis, https://doi.org/10.1007/978-3-030-86664-8_4
150 S. Salzo and S. Villa

notes written for the PhD course “Introduction to Convex Optimization” that was
given by the authors at the University of Genoa during the last 5 years.
This chapter is divided into six sections. Section 2 provides an account on convex
analysis, recalling the fundamental concepts of subdifferentials, Legendre–Fenchel
transform, and duality theory. In Sect. 3, we study the proximal gradient algorithm
under different assumptions, addressing also acceleration techniques. Section 4 is
about stochastic optimization methods. We study the projected stochastic subgradi-
ent method, the proximal stochastic gradient algorithm and the randomized block-
coordinate proximal gradient algorithm. Section 5 exploits duality to derive new
algorithms. Finally, in Sect. 6, we describe several important applications in which
proximal gradient algorithms has been successfully used.

2 Preliminaries on Convex Analysis

2.1 Basic Notations

We set R+ = {α ∈ R | α ≥ 0} and R++ = {α ∈ R | α > 0}. Throughout the chapter,


X is a real Hilbert space and its associated scalar product and norm is denoted by

·, · : X × X → R and · : X → R.

An affine set of X is a set M ⊂ X such that every straight line joining two distinct
points of M is contained in M. In formula this means that, for every x, y ∈ M,
and every λ ∈ R, we have (1 − λ)x + λy ∈ M. If M is affine then V := M − M
is a vector subspace of X , which is called the direction of M. Moreover, we have
M = V + x, for every x ∈ M. The intersection of a family of affine sets of X is still
affine, so if C ⊂ X one can define the affine hull of C, denoted by aff(C), which is the
intersection of all the affine sets of X containing C. It can be represented as the set of
the finite affine combinations of elements of C, meaning that x ∈ aff(C) if and only
if there exists finite
nnumber of points x 1 , . . . , x n ∈ C and numbers λ1 , . . . , λn ∈ R
n
(n ≥ 1) such that i=1 λi = 1 and x = i=1 λi xi . The affine dimension of a set C is
the dimension of the affine hull of C. A mapping T : X → Y between Hilbert spaces
is said to be affine if T ((1 − λ)x + λy) = (1 − λ)T x + λT y, for every x, y ∈ X
and λ ∈ R. An affine mapping T can be uniquely represented as T x = Ax + b with
A : X → Y be a linear operator and b ∈ Y . The image and the counter image of
affine sets through affine mappings are affine sets. An (affine) hyperplane of X is a
set of the form {x ∈ X | ϕ(x) = α}, where ϕ : X → R is a nonzero linear form on X
and α ∈ R.
For every x ∈ X and every δ > 0 we denote by Bδ (x) the (closed) ball of center
x and radius δ, that is Bδ (x) = {y ∈ X | y − x ≤ δ}. Given a subset C ⊂ X , we
denote by int(C), cl(C), and bdry(C) its interior, closure and boundary, respectively.
An hyperplane H = {x ∈ X | ϕ(x) = α} is closed if and only if ϕ is a continuous
Proximal Gradient Methods for Machine Learning and Imaging 151

linear form on X so that it can be represented as H = {x ∈ X | x, u = α} with


u ∈ X \ {0}. A sequence (xk )k∈N in X converges to x ∈ X , and we write xk → x, if
xk − x → 0, whereas it weakly converges to x, and we write xk  x, if for every
u ∈ X , xk − x, u → 0. A subset C ⊂ X is weakly sequentially closed if the weak
limit of every weakly convergent sequence in C belongs to C.
Classically, in optimization, functions and constraints are treated separately. By
introducing extended real-valued functions, they can be treated in a unified way. Here
with extended real-valued functions, we mean functions

f : X → ] − ∞, +∞] ,

so that the value −∞ will never be allowed. In the rest of the chapter, if not otherwise
specified, functions are supposed to be extended real-valued. The (effective) domain
of f is the set dom f = {x ∈ X | f (x) < +∞} and the epigraph of f is the set

epi( f ) = {(x, t) ∈ X × R | f (x) ≤ t}. (1)

Note that epi( f ) is a subset of X × R. We also define the sublevel sets of f as


 
[ f ≤ t] = x ∈ X | f (x) ≤ t , t ∈ R, (2)

and similarly, we define the sets [ f > t]. An extended real-valued function is called
proper if dom f = ∅, meaning that the function admits at least a finite value. The
set of minimizers of f is denoted by argmin f .
In optimization problems, extended real-valued functions allow to treat constraints
as functions. Indeed let C ⊂ X and define the indicator function of C as

0 if x ∈ C
ιC : X → ] − ∞, +∞] : x → (3)
+∞ if x ∈
/ C.

Then the constrained minimization problem

min h(x), h: X → R
x∈C

can be equivalently written as

min f (x), f : X → ] − ∞, +∞] , f (x) = h(x) + ιC (x).


x∈X

Note that indicator functions and epigraphs allow to establish a one to one corre-
spondence between extended real-valued functions and sets.
152 S. Salzo and S. Villa

2.2 Convex Sets and Functions

A subset C ⊂ X is said to be convex if

(∀ x, y ∈ C)(∀ λ ∈ [0, 1]) (1 − λ)x + λy ∈ C, (4)


 
meaning that for every x, y ∈ C, the segment [x, y] = x + λ(y − x) | λ ∈ [0, 1] ,
joining x and y, is contained in C. A cone of X is a subset C ⊂ X such that

(∀ x ∈ C)(∀ λ ∈ R++ ) λx ∈ C, (5)

meaning that, for every x ∈ C the ray R++ x = {λx | λ ∈ R++ } is contained in C.
The intersection of a family of convex sets of X is still convex, so if A ⊂ X , then
one defines the convex hull of A, denoted by co(A), as the intersection of the family
of all convex subsets of X containing A. In fact it is the smallest convex subset of X
containing A and it can be represented as the set of the finite convex combinations of
elements of A, meaning that x ∈ aff(A) if and only if there exists finite number of
n
points x1 ,
. . . , xn ∈ A and numbers λ1 , . . . , λn ∈ R+ (n ≥ 1) such that i=1 λi = 1
n
and x = i=1 λi xi .
Let C be a nonempty closed convex subset of X and let x ∈ X . Then the orthogonal
projection of x onto C is defined as the unique point p ∈ C such that, for every y ∈ C,
 p − x ≤ y − x and is denoted by PC (x). It is also characterized by the following
variational inequality
(∀ y ∈ C) y − p, x − p ≤ 0.

If C is an affine set with direction V , then the above characterization becomes the
classical x − p ∈ V ⊥ . We recall that for convex sets the property of being closed
is equivalent to that of being weakly sequentially closed. We finally recall that the
projection operator PC : X → X is firmly nonexpansive, that is, it satisfies

(∀ x ∈ X )(∀ y ∈ X ) PC (x) − PC (y)2 ≤ PC (x) − PC (y), x − y . (6)

An extended real-valued function f : X → ] − ∞, +∞ ] is convex if

(∀ x, y ∈ X )(∀ λ ∈ [0, 1]) f ((1 − λ)x + λy) ≤ (1 − λ) f (x) + λ f (y) (7)

and is strictly convex if in (7) the strict inequality holds when x, y ∈ dom f , x = y
and λ ∈ ] 0, 1 [ . Finally, g : X → [ − ∞, +∞ [ is concave (risp. strictly concave)
if −g is convex (risp. strictly convex). If f is convex, by induction, definition (7)
yields Jensen’s inequality, that is, for every finite sequence (xi )1≤i≤m in X and every
m
(λi )1≤i≤m ∈ Rm + such that i=1 λi = 1, we have
Proximal Gradient Methods for Machine Learning and Imaging 153


m  
m
f λi xi ≤ λi f (xi ). (8)
i=1 i=1

The property of convexity for a function f : X → ] − ∞, +∞] is equivalent to the


fact that its epigraph epi( f ) is a convex set in X × R. The function f is strongly
convex if there exists μ > 0 such that, for every x, y ∈ X and every λ ∈ [0, 1],
μ
f (1 − λ)x + λy ≤ (1 − λ) f (x) + λ f (y) − (1 − λ)λx − y2 . (9)
2
In such case, μ is called the modulus of strong convexity of f and the function f is also
said to be μ-strongly convex. It is easy to see that a function f : X → ] − ∞, +∞]
is μ-strongly convex if and only if f − (μ/2)·2 is convex. Moreover, strongly
convex functions admitting a minimizer, say x∗ , satisfies the following quadratic
growth condition
μ
(∀ x ∈ X ) f (x) − f (x∗ ) ≥ x − x∗ 2 . (10)
2
The function f : X → ] − ∞, +∞] is lower semicontinuous if for every
sequence (xk )k∈N in X and every x ∈ X , xk → x ⇒ f (x) ≤ lim inf k f (xk ). This
property is equivalent to the closeness of epi( f ) in X × R. We denote by 0 (X )
the class of functions f : X → ] − ∞, +∞ ] which are proper convex and lower
semicontinuous. Such functions are continuous on the interior of their domain. When
existence of minimizers is in order, the following definition is needed. The proper
function f : X → ] − ∞, +∞] is said coercive if

lim f (x) = +∞,


x→+∞

which is equivalent to say that, for every α ∈ R, [ f ≤ α] is bounded. A proper,


convex lower semicontinuous and coercive function admits a global minimizer and
if the function is strictly convex the minimizer is unique.

2.3 Differentiability and Convexity

We recall the definition of differentiable functions. Let f : X → ] − ∞, +∞] be a


proper extended real-valued function and let x0 ∈ int(dom f ). Then f is Gâteaux
differentiable at x0 if there exists a vector ∇ f (x0 ) ∈ X such that

f (x0 + tv) − f (x0 )


(∀ v ∈ X ) lim = v, ∇ f (x0 ) . (11)
t→0 t

In such case ∇ f (x0 ) is called the gradient of f at x0 and f admits directional


derivatives at x0 in every direction v and the directional derivatives depend linearly
154 S. Salzo and S. Villa

and continuously on v. When f is Gâteaux differentiable at every point of a subset


A ⊂ int(dom f ) we say that f is Gâteaux differentiable on A.
When dom f is open and f is differentiable on dom f , convexity is characterized
by the monotonicity of the gradient operator, i.e., that x − y, ∇ f (x) − ∇ f (y) ≥ 0,
for every x, y ∈ X . Similarly, the strong convexity of f is equivalent to the strong
monotonicity of the gradient operator, that is,

(∀ x ∈ dom f )(∀ y ∈ dom f ) x − y, ∇ f (x) − ∇ f (y) ≥ μx − y2 . (12)

A function f : X → R is Lipschitz smooth if it is Gâteaux differentiable on X and


its gradient is Lipschitz continuous. The following result provides several character-
izations of Lipschitz smoothness that will be useful in analyzing proximal gradient
methods. The implication (i) ⇒ (ii) is called the descent lemma, whereas the impli-
cation (i) ⇒ (iv) is called the Baillon–Haddad theorem.

Fact 1 Let f : X → R be a convex differentiable function and let L ∈ R+ . The


following statements are equivalent.
(i) (∀ x ∈ X )(∀ y ∈ X ) ∇ f (x) − ∇ f (y) ≤ Lx − y.
(ii) (∀ x ∈ X )(∀ y ∈ X ) f (y) − f (x) − y − x, ∇ f (x) ≤ L
2
x − y2 .
(iii) (∀ x ∈ X )(∀ y ∈ X ) 1
2L
∇ f (x) − ∇ f (y)2
≤ f (y) − f (x) − y − x, ∇ f (x)
(iv) (∀ x ∈ X )(∀ y ∈ X ) 1
L
∇ f (x) − ∇ f (y)2 ≤ x − y, ∇ f (x) − ∇ f (y)
(v) (∀ x ∈ X )(∀ y ∈ X ) ∇ f (x) − ∇ f (y), x − y ≤ Lx − y2 .
(vi) L2 ·2 − f is convex.
In case f is twice differentiable on X , the previous statements are equivalent to
(vii) (∀ x ∈ X )(∀ v ∈ X ) ∇ 2 f (x)v, v ≤ Lv2 .
(viii) (∀ x ∈ X ) ∇ 2 f (x) ≤ L.

Fact 2 Let f : X → R be a differentiable function. Then the following are equiva-


lent
(i) f is μ-strongly convex and ∇ f is Lipschitz continuous with constant L.
μL
(ii) ∀ x, y ∈ X , L+μ
1
∇ f (x) − ∇ f (y)2 + L+μ x − y2
≤ x − y, ∇ f (x) − ∇ f (y).

2.4 Calculus for Nonsmooth Convex Functions

In this section, we recall the concept of subdifferentials and calculus for non-
smooth convex functions. Let f : X → ] − ∞, +∞] be a proper convex func-
tion and x ∈ dom f . The directional derivative of f at x along the vector v is
f  (x, v) = lim+ ( f (x + tv) − f (x)) /t. The subdifferential of f at x is defined as
t→0
Proximal Gradient Methods for Machine Learning and Imaging 155
 
∂ f (x) := u ∈ X (∀ y ∈ X ) f (y) ≥ f (x) + y − x, u . (13)

Each element of ∂ f (x) is called a subgradient of f at x. If x ∈ / dom f , by definition,


∂ f (x) = ∅. Finally, the domain of ∂ f , denoted by dom∂ f , is defined as the set of
points at which the subdifferential is nonempty. It is easy to see that the subdifferential
∂ f is a monotone operator, that is, for every x, y ∈ X and u ∈ ∂ f (x), v ∈ ∂ f (y)
x − y, u − v ≥ 0. If f is Gâteaux differentiable at x ∈ int(dom f ), then ∂ f (x) =
{∇ f (x)}. Let C ⊂ X be a nonempty convex set and let x ∈ C. The set ∂ιC (x) is
called the normal cone to C at x and it is denoted by NC (x), that is,
 
NC (x) = u ∈ X | (∀ y ∈ C) y − x, u ≤ 0 . (14)

We have the following important facts

Fact 3 (Fermat’s rule) Let f : X → ] − ∞, +∞] be a proper convex function and


x ∈ dom f . Then the following are equivalent
(i) x is a minimizer of f ;
(ii) 0 ∈ ∂ f (x);
(iii) (∀ y ∈ X ) f  (x, y − x) ≥ 0;
(iv) (∀ y ∈ dom f ) f  (x, y − x) ≥ 0,

Fact 4 Let f ∈ 0 (X ) be μ-strongly convex and x, u ∈ X . Then


μ
u ∈ ∂ f (x) ⇐⇒ ∀ y ∈ X f (y) ≥ f (x) + y − x, u + x − y2 .
2
Fact 5 (Moreau–Rockafellar) Let f ∈ 0 (X ), g ∈ 0 (Y ), and A : X → Y be a
continuous linear operator and suppose that 0 ∈ int(domg − A(dom f )). Then,

(∀ x ∈ X ) ∂( f + g ◦ A)(x) = ∂ f (x) + A∗ ∂g(Ax). (15)

In particular, if g is Gâteaux differentiable at x ∈ int(domg), then ∂( f + g)(x) =


∂ f (x) + {∇g(x)}.
m
Fact 6 Let (X i )1≤i≤m be m Hilbert spaces and let mX = i=1 X i be their direct
product, endowed with the scalar product x, y = i=1 xi , yi . Let ( f i )1≤i≤m be a
family of proper convex functions, f i : X i → ] − ∞, +∞] and define

f : X → ] − ∞, +∞] , f (x) = f 1 (x1 ) + f 2 (x2 ) + · · · , + f m (xm ).


m
So the function f is separable. Then, for all x ∈ dom f = i=1 dom f i , we have

∂ f (x) = ∂ f 1 (x1 ) × ∂ f 2 (x2 ) × · · · × ∂ f m (xm ).


d
Example 7 Let us consider the case of the 1 -norm on Rd , that is, x1 = i=1 |xi |.
Since ·1 is clearly separable with components |·|, it follows from Fact 6 that
156 S. Salzo and S. Villa

∂·1 (x) = ∂|·|(x1 ) × · · · × ∂|·|(xd ).

Fact 8 Let ( f i )i∈I be a finite family of continuous affine functions on X , say f i =


·, u i  + αi , for some u i ∈ X and αi ∈ R. Let f = maxi∈I f i , let x ∈ X and set
I (x) = {i ∈ I | f i (x) = f (x)}. Then

∂ f (x) = co{u i | i ∈ I (x)}. (16)

2.5 The Legendre–Fenchel Transform

Let f : X → ] − ∞, +∞] be proper. The function

f ∗ : X → ] − ∞, +∞] , f ∗ (u) = sup x, u − f (x)


x∈X

is called the Fenchel conjugate of f , which is always convex and lower semicon-
tinuous. The Fenchel–Moreau theorem ensures that if f ∈ 0 (X ) then f ∗ ∈ 0 (X )
and f ∗∗ = f . Thus, the transformation ·∗ : 0 (X ) → 0 (X ) is an involution, which
is called the Legendre–Fenchel transform. Let C ⊂ X . The support function of C is
the function ιC∗ , which is denoted by σC , that is, σC (u) = supx∈C x, u.

Fact 9 (Properties of the conjugate operation) Let f : X → ] − ∞, +∞] be a


proper function. Then the following hold.
(i) Let g : X → ] − ∞, +∞] be a proper function. Then f ≤ g ⇒ f ∗ ≥ g ∗ .
(ii) Let γ > 0. Then, for every u ∈ X , (γ f )∗ (u) = γ f ∗ (u/γ ).
(iii) (The conjugate of a separable function is separable). Under the same assump-
tions of Fact 6, we have

∀ u = (u 1 , . . . , u m ) ∈ X f ∗ (u) = f 1∗ (u 1 ) + f 2∗ (u 2 ) + · · · + f m∗ (u m ).

(iv) [ f (· − x0 )]∗ = f ∗ + x0 , · and [ f + ·, u 0 ]∗ = f ∗ (· − u 0 ), for x0 , u 0 ∈ X .


(v) Let x0 ∈ X . Then ι∗{x0 } = x0 , ·.

Example 10 Let f : X → ] − ∞, +∞] be proper function. Then the following


hold.
(i) If f = (1/2)·2 , then f ∗ = (1/2)·2 .
(ii) Let ϕ : R → ] − ∞, +∞] be an even function. Then [ϕ ◦ ·]∗ = ϕ ∗ ◦ ·.
(iii) Suppose that f is positively homogeneous. Then, f ∗ = ι∂ f (0) . Recall that ∂ f (0)
is a closed convex cone.

Fact 11 Let f : X → ] − ∞, +∞] be proper and convex and let x, u ∈ X . Then,


the following holds.
(i) x, u ≤ f (x) + f ∗ (u) (Young–Fenchel inequality).
Proximal Gradient Methods for Machine Learning and Imaging 157

(ii) x, u = f (x) + f ∗ (u) ⇔ u ∈ ∂ f (x).


(iii) If f ∈ 0 (X ), then u ∈ ∂ f (x) ⇔ x ∈ ∂ f ∗ (u).
Fact 12 Let f ∈ 0 (X ) be strongly convex. Then f is supercoercive, i.e.,
f (x)/x → +∞ as x → +∞.
Fact 13 Let f ∈ 0 (X ) and μ > 0. Then, if f is μ-strongly convex, we have
(a) dom f ∗ = X , f ∗ is differentiable on X and ∇ f ∗ is (1/μ)-Lipschitz continuous.
Vice versa if (a) holds, then f is μ-strongly convex on the convex subsets of dom∂ f .

2.6 The Fenchel–Rockafellar Duality

Duality plays a key role in convex optimization. Here we recall the Fenchel–
Rockafellar duality. We let A : X → Y be a continuous linear operator between
Hilbert spaces, f ∈ 0 (X ) and g ∈ 0 (Y ). Consider the problem

min f (x) + g(Ax) =: (x). (P)


x∈X

Its dual problem (in the sense of Fenchel–Rockafellar) is

min f ∗ (−A∗ u) + g ∗ (u) =: (u). (D)


u∈Y

One can prove that

(∀ x ∈ X )(∀u ∈ Y ) (x) ≥ −(u), (17)

hence
inf (x) ≥ sup −(u) = − inf (u). (18)
x∈X u∈Y u∈Y

This means that the function is (uniformly) above the function − (which is
concave). The difference between the infimum of and the supremum of −, that
is inf + inf , is called the duality gap and we say that strong duality holds if the
duality gap is zero.1
Let S = argmin and S ∗ = argmin . Then the following are equivalent.
(i) x̂ ∈ S, û ∈ S ∗ , and inf X + inf Y  = 0 (duality gap is zero);
(ii) x̂ ∈ ∂ f ∗ (−A∗ û) and A x̂ ∈ ∂g ∗ (û)
(iii) −A∗ û ∈ ∂ f (x̂) and û ∈ ∂g(A x̂).

1 Note that if inf = −∞, it follows from (18) that inf = sup(−) = − inf  = −∞. In this
case,  ≡ +∞ and inf + inf  = −∞ + ∞ does not make sense. Anyway, since there is no
gap between and −, by convention, we set inf + inf  = 0. The same situation occurs if
inf  = −∞.
158 S. Salzo and S. Villa

The conditions (ii) and (iii) above are called KKT (Karush–Kuhn–Tucker) conditions.
Once one ensures that strong duality holds (that is, inf + inf  = 0) they provide
fully characterizations for a couple (x̂, û) to be a primal and dual solution.
Fact 14 Suppose that one of the following conditions is satisfied.
(a) S = ∅ and ∂( f + g ◦ A) = ∂ f + A∗ ∂g A
(b) 0 ∈ int(domg − A(dom f )).
Then is proper and
inf = − min , (19)
X Y

meaning that S ∗ = ∅ and inf X + inf Y  = 0.


Example 15 (Equality constraints) We consider the problem

min f (x), (20)


Ax=b

where f ∈ 0 (X ) and A : X → Y is a continuous linear operator with closed range


and b ∈ Y . We assume that a solution exists and that f is continuous at some x such
that Ax = b. This problem can be equivalently formulated as

min f (x) + ι{b} (Ax), (21)


x∈X

which is in the form (P). Then, in view of Fact 9(v), the dual problem of (20) is

min f ∗ (−A∗ u) + b, u .


u∈Y

Recalling Fact 14(a), to ensure the existence of dual solutions and a zero duality gap,
we need to find conditions ensuring the validity of the calculus rule (15). We first
prove that if x ∈ X is such that Ax = b, then

∂(ι{b} ◦ A)(x) = R(A∗ ) = A∗ ∂ι{b} (Ax). (22)

Indeed, we note that ι{b} ◦ A = ι A−1 (b) and A−1 (b) = x + N (A). Then,

u ∈ ∂(ι{b} ◦ A)(x) ⇐⇒ (∀y ∈ A−1 (b)) y − x, u ≤ 0


⇐⇒ (∀v ∈ N (A)) v, u ≤ 0
⇐⇒ u ∈ N (A)⊥ = R(A∗ ).

Therefore, ∂(ι{b} ◦ A)(x) = R(A∗ ). Moreover, A∗ ∂ι{b} (Ax) = A∗ ∂ι{b} (b) and the
subdifferential of ι{b} is

Y if y = b
∂ι{b} : Y → Y : y → (23)
∅ if y = b,
Proximal Gradient Methods for Machine Learning and Imaging 159

hence A∗ ∂ι{b} (Ax) = R(A∗ ) and (22) holds. Finally, recalling the calculus rule
for subdifferentials in Fact 5 and that we assumed that f is continuous at some
x ∈ dom(ι{b} ◦ A), then, we have ∂( f + ι{b} ◦ A)(x) = ∂ f (x) + ∂(ι{b} ◦ A)(x) =
∂ f (x) + A∗ ∂ι{b} (Ax) and hence (15) holds. We note in passing that Fermat’s rule
for (21) is

0 ∈ ∂( f + ι{b} ◦ A)(x̂) ⇔ 0 ∈ ∂ f (x̂) + A∗ ∂ι{b} (A x̂)


⇔ 0 ∈ ∂ f (x̂) + R(A∗ )
⇔ ∃ û ∈ Y A∗ û ∈ ∂ f (x̂).

In the differentiable case, this condition reduces to the classical Lagrange multiplier
rule, that is, x̂ is a solution of (20) if and only if there exists a multiplier û such that
A∗ û = ∇ f (x̂).

2.7 Bibliographical Notes

Though convexity is a very old concept, the first systematic study of convex sets in
finite dimension is due to Minkowski [73]; while concerning convex functions, it was
Jensen [58] to introduce the concept now known as midpoint convexity. The lecture
notes by Fenchel [48] constitute the first modern exposition on convex analysis in the
finite-dimensional case. Indeed, the notions of support function, Legendre–Fenchel
conjugate as well as the duality theory presented in Sects. 2.5 and 2.6, for the special
case that A is the identity operator, were fully studied there. At the beginning of
the 1960s, convex analysis became a mathematical field in his own, thanks to the
works by Moreau [74–76] and Rockafellar [99], who established the theory in infinite
dimension and developed the concepts of subgradients and subdifferential, among
others. Starting from those works, the field flourished, and it is nowadays still a very
active research area.
In the following, we list the main references. Concerning the finite-dimensional
setting, we refer to the fundamental monography [98] and the book [57]. For Hilbert
spaces, a comprehensive treatment is given in [11] (where most of the facts presented
can be found). A lot of research has been also devoted to the Banach spaces and
general topological vector spaces. For the former case, we refer to [10, 19, 88, 89],
and to [46, 99, 115] for the latter.

3 The Proximal Gradient Method

In this section, we focus on the main object of this chapter, which is the proximal
gradient algorithm (also called the forward–backward algorithm). In the following,
we describe the basic assumptions and the algorithm, whereas in the next sections, we
160 S. Salzo and S. Villa

study the convergence properties under several additional assumptions. Moreover,


we will also address techniques for accelerating the convergence.
Let f : X → R be a convex differentiable function, let g ∈ 0 (X ) and set F =
f + g. We aim at the following composite optimization problem:

minimize f (x) + g(x) =: F(x). (24)


x∈X

The algorithm is detailed below.

Algorithm 1 (The proximal gradient method) Let x0 ∈ X and γ > 0. Then,

 k = 0, 1, . . .
for
(25)
xk+1 = proxγ g xk − γ ∇ f (xk ) .

In the above algorithm, proxγ g : X → X is the so-called proximity operator of


γ g which will be defined in the next section. Also, γ > 0 is the stepsize which has
to be determined according to the smoothness property of f . More precisely, we will
assume that the gradient ∇ f is L-Lipschitz continuous, for some L > 0, and that
the stepsize is set as
2
γ < . (26)
L
Remark 16 We stress that some restriction on the stepsize γ should be required.
Indeed if we take g = 0 and f (x) = (L/2)x2 , we have

xk+1 = (1 − γ L)xk .

Thus, if we take γ = 2/L, we have xk+1 = −xk and the sequence does not converge,
unless x0 = 0.

Example 17 (Iterative Soft-Thresholding Algorithm (ISTA) [41]) We consider the


so called Lasso problem

1
minimize Ax − y2 + λx1 . (27)
x∈R d 2

Then, Algorithm 1 reduces to the following. Let γ ∈ ]0, 2/A∗ A[ and x0 ∈ X , then

 k = 0, 1, . . .
for
(28)
xk+1 = softγ λ (xk − γ A∗ (Axk − y)).

In the above equation, softγ λ : R → R is the so-called soft-thresholding operator,


that is, the proximity operator of λ|·|, which is supposed to be applied component-
wise (see (43)).
Proximal Gradient Methods for Machine Learning and Imaging 161

3.1 Nonexpansive and Averaged Operators

In this section, we present the convergence theory for the method of the fixed point
iteration. We recall the classical theory for contractive operators and then we address
the case of averaged operators which is motivated by the Krasnosel’skiı̆–Mann iter-
ation.
Let X be a real Hilbert space and let T : X → X . Then
(i) T is nonexpansive if for all x, y ∈ X, T x − T y ≤ x − y
(ii) T is a contraction if for all x, y ∈ X, T x − T y ≤ qx − y, for some q ∈
]0, 1[.
A fixed point of T is a point x ∈ X such that T x = x and the set of such points
is denoted by Fix T . In order to compute fixed points of T , we will consider the
following fixed point iteration. Let x0 ∈ X and define, for every k ∈ N,

xk+1 = T xk . (29)

An iterative method of type (29) is also called Picard iteration or the method of
successive approximations.
Remark 18
(i) Nonexpansive operators may have no fixed points. For instance, a translation
T = Id + a, with a = 0, does not have any fixed point.
(ii) For nonexpansive operators, even admitting fixed points, the fixed point iter-
ation may fail to converge. Indeed, this occurs if we take T = −Id and start
with x0 = 0. More generally, rotations are nonexpansive operators admitting
a fixed point, for which the fixed point iteration does not converge.
The first important result concerning existence of fixed points and the convergence
of the fixed point iteration is the following.
Theorem 19 (Banach-Caccioppoli) Let T : X → X be a q-contractive mapping for
some 0 < q < 1. Then there exists a unique fixed point of T , that is, Fix T = {x∗ }.
Moreover, for the fixed point iteration (29), we have

qk
(∀ k ∈ N) xk − x∗  ≤ q k x0 − x∗  and xk − x∗  ≤ x0 − x1 . (30)
1−q

Proof We first note that

1
(∀ x, y ∈ X ) x − y ≤ x − T x + y − T y . (31)
1−q

Indeed, x − y ≤ x − T x + T x − T y + T y − y ≤ x − T x + qx − y +
y − T y, hence (1 − q)x − y ≤ x − T x + T y − y and (31) follows. Inequal-
ity (31) shows that there may exist at most one fixed point of T . Moreover, for every
162 S. Salzo and S. Villa

k, h ∈ N,

1
xk − x h  ≤ xk − xk+1  + x h − x h+1 
1−q
1
≤ T k x0 − T k x1  + T h x0 − T h x1 
1−q
1
≤ q k x0 − x1  + q h x0 − x1 
1−q
qk + qh
≤ x0 − x1 , (32)
1−q

where we used that T k is q k -contractive. Since 0 < q < 1, q k and q h converge to


zero as k and h go to +∞. Therefore, (xk )k∈N is a Cauchy sequence and hence it
converges, say to x∗ . Then T xk → T x∗ and T xk = xk+1 → x∗ , so T x∗ = x∗ , that is,
x∗ is a fixed point of T . The second inequality in (30) follows from (32) by letting
h → +∞. The first equality in (30) follows from the following chain of inequalities

xk − x∗  = T xk−1 − T x∗  ≤ qxk−1 − x∗  ≤ · · · ≤ q k x0 − x∗ .

The statement follows. 

As we noted in Remark 18 for general non expansive operators, the fixed point
iteration (29) may not converge. To overcome this situation, it is enough to slightly
modify the iteration. This leads to the following definition.
Let T : X → X be a nonexpansive operator and let λ ∈ ]0, 1[. The Krasnosel’skiı̆–
Mann iteration is defined as follows:

x0 ∈ X, xk+1 = xk + λ(T xk − xk ). (33)

If we look at the example given in Remark 18(ii) , now we see that the iteration (33)
becomes xk+1 = (1 − 2λ)xk . Since |1 − 2λ| < 1, we have that xk = (1 − 2λ)k x0 →
0. Iteration (33) can be equivalently written as a fixed point iteration of the operator
Tλ = (1 − λ)Id + λT . This motivates the study of operators that are convex combi-
nation of the identity operator and nonexpansive operators and justify the definition
below.

Definition 20 Let α ∈ ]0, 1[. Then T : X → X is an α-averaged operator if T =


(1 − α)Id + α R for some nonexpansive operator R. An operator which is 1/2-
averaged is also called firmly nonexpansive.

Remark 21 Since averaged operators are convex combinations of nonexpansive


operators, they are indeed nonexpansive operators. This follows by the following
chain of inequalities:
Proximal Gradient Methods for Machine Learning and Imaging 163

T x − T y = (1 − α)(x − y) + α(Rx − Ry) ≤ (1 − α)x − y + αRx − Ry


≤ (1 − α)x − y + αx − y = x − y.

In the following, we give several characterizations of the property of being an


averaged operators.

Lemma 22 Let x, y ∈ X and λ ∈ R. Then

(1 − λ)x + λy2 = (1 − λ)x2 + λy2 − (1 − λ)λx − y2 . (34)

Proof Indeed

(1 − λ)x + λy2 = (1 − λ)2 x2 + λ2 y2 + 2(1 − λ)λ x, y


= (1 − λ)x2 − λ(1 − λ)x2
+ λy2 − (1 − λ)λy2 + 2(1 − λ)λ x, y
= (1 − λ)x2 + λy2 − (1 − λ)λ x2 + y2 − 2 x, y

and the statement follows. 

Proposition 23 Let T : X → X and α ∈ ]0, 1[. Then the following statements are
equivalent
(i) T is α-averaged
 
1 1
(ii) 1 − Id + T is nonexpansive
α α
(iii) For every (x, y) ∈ X 2 ,
 
1
T x − T y ≤ x − y −
2 2
− 1 (Id − T )x − (Id − T )y2 .
α

(iv) For every (x, y) ∈ X 2

T x − T y2 + (1 − 2α)x − y2 ≤ 2(1 − α)x − y, T x − T y.

Proof (i) ⇔ (ii): It follows from the following equivalence


 
1 1
T = (1 − α)Id + α R ⇔ R = 1 − Id + T.
α α

(ii) ⇔ (iii) : Set R = (1 − α −1 )Id + α −1 T and let x, y ∈ X . It follows from


Lemma 22 that
164 S. Salzo and S. Villa

Rx − Ry2 = (1 − α −1 )(x − y) + α −1 (T x − T y)2


= (1 − α −1 )x − y2 + α −1 T x − T y2
− α −1 (1 − α −1 )(Id − T )x − (Id − T )y2

and hence

Rx − Ry2 − x − y2


   
1 1
= T x − T y − x − y +
2 2
− 1 (Id − T )x − (Id − T )y .
2
α α

So inequality Rx − Ry2 − x − y2 ≤ 0 is equivalent to that in (iii).


(iii) ⇔ (iv): It follows from the inequality

(Id − T )x − (Id − T )y2 = x − y2 + T x − T y2 − 2x − y, T x − T y.

Remark 24 The inequality in Proposition 23(iii) shows that if T is α-averaged,


then it is also α  -averaged for every α  > α. So it makes sense to consider the best
(smallest) constant of averagedness.

Remark 25 Contractions are averaged operators. More precisely, if T is a contrac-


tion with constant q, then it is (q + 1)/2-averaged. By Proposition 23(i) it is enough
to show that (1 − 2/(q + 1))Id + 2/(q + 1)T is nonexpansive. Indeed
 
q − 1 2 q −1 2 
(∀x, y ∈ X ) 
q + 1 x + T x − y − T y ≤
q +1 q +1 q +1 
1−q 2q
≤ x − y + x − y ≤ x − y.
q +1 q +1

Remark 26 In view of Definition 20 and Proposition 23(iii), an operator T is firmly


nonexpansive if and only if

(∀ x, y ∈ X ) T x − T y2 ≤ x − y, T x − T y. (35)

The properties of being averaged is preserved by compositions, as the following


result shows.

Proposition 27 Let T1 : X → X and T2 : X → X be two averaged operators, with


constants α1 and α2 respectively. Then T1 ◦ T2 is averaged with constant

α1 + α2 − 2α1 α2
α= .
1 − α1 α2
Proximal Gradient Methods for Machine Learning and Imaging 165

Averaged operators are important since, provided that they have fixed points, the
Picard iteration always weakly converges to some fixed point. In the rest of the
section, we will prove this result.

Lemma 28 (demiclosedness principle) Let T : X → X be a nonexpansive operator.


Then I − T is demiclosed, that is, for all sequence (xk )k∈N in X and x, z ∈ X , we
have
xk  x and xk − T xk → z ⇒ x − T x = z. (36)

Proof Let k ∈ N. Then using the nonexpansivity of T , we have

x − T x − z2 = xk − T x − z2 − xk − x2 − 2xk − x, x − T x − z


= xk − T xk − z2 + T xk − T x2 + 2xk − T xk − z, T xk − T x
− xk − x2 − 2xk − x, x − T x − z
≤ xk − T xk − z2 + 2xk − T xk − z, T xk − T x − 2xk − x, x − T x − z.

Since xk − T xk − z → 0, xk − x  0, and T xk is bounded, the right-hand side of


the above inequality goes to zero and hence x − T x − z2 = 0. 

Lemma 29 (Opial) Let F ⊂ X be nonempty. Let (xk )k∈N be a sequence in X and


suppose that the weak cluster points of (xk )k∈N belongs to F and that for any y ∈ F,
(xk − y)k∈N is convergent. Then (xk )k∈N weakly converges to a point in F.

Proof The assumptions ensure that (xk )k∈N is bounded. Therefore, the set of weak
cluster points of (xk )k∈N is nonempty. Let y1 , y2 ∈ X and let (xk1 )k∈N and (xk2 )k∈N be
subsequences of (xk )k∈N such that xk1  y1 and xk2  y2 . Then, for every k ∈ N,

xk − y1 2 − y1 2 = xk 2 − 2xk , y1 


xk − y2 2 − y2 2 = xk 2 − 2xk , y2 

and hence (subtracting)

2xk , y2 − y1  = xk − y1 2 − xk − y2 2 − y1 2 + y2 2 . (37)

Since y1 and y2 are weak cluster points of (xk )k∈N , by assumptions, y1 , y2 ∈ F and
(xk − y1 )k∈N and (xk − y2 )k∈N are convergent. Therefore, by (37), we obtain
that there exists β ∈ R such that xk , y2 − y1  → β. Now, since xki  yi , i = 1, 2,
we have xki , y2 − y1  → yi , y2 − y1 , which implies

y1 , y2 − y1  = β = y2 , y2 − y1 

and hence y2 − y1 2 = 0. This proves that the set of weal cluster points of the
sequence (xk )k∈N is a singleton. So, the sequence (xk )k∈N is weakly convergent. 
166 S. Salzo and S. Villa

Theorem 30 Let α ∈ ]0, 1[ and let T : X → X be an α-averaged operator such


that the set of fixed points is nonempty, that is Fix T = ∅. Let (xk )k∈N be generated
by the fixed point iteration (29). Then the following hold.
(i) For every k ∈ N and every x∗ ∈ Fix T , xk+1 − x∗  ≤ xk − x∗ 
+∞
 α
(ii) T xk − xk 2 < dist(x0 , Fix T )2
k=0
1 − α
(iii) (xk )k∈N weakly converges to some point x∗ ∈ Fix T .

Proof (i): Since T is nonexpansive and x∗ is a fixed point of T , xk+1 − x∗  =


T xk − T x∗  ≤ xk − x∗ .
(ii): Let x∗ ∈ S. Then, by Proposition 23(iii) (with x = xk and y = x∗ ), we have
 
1
(∀ k ∈ N) xk+1 − x∗  ≤ xk − x∗  −
2 2
− 1 xk − T xk 2 . (38)
α

Therefore,
+∞ +∞
1−α  
xk − T xk 2 ≤ xk − x∗ 2 − xk+1 − x∗ 2 ≤ x0 − x∗ 2 .
α k=0 k=0

(iii): It follows from (ii) that T xk − xk  → 0. Let x∗ be a weak cluster point


of (xk )k∈N and let (xk )k∈N be a subsequence of (xk )k∈N such that xk  x∗ . Then
T xk − xk → 0. Hence, in virtue of Lemma 28, T x∗ − x∗ = 0, that is x∗ ∈ Fix T .
Moreover, by item (i), for every x∗ ∈ Fix T , xk − x∗  is decreasing and hence
convergent. The statement follows from Lemma 29 with F = Fix T . 

Applying the previous theorem to the operator Tλ = (1 − λ)Id + λT and noting


that Fix Tλ = Fix T , we get the following result.

Corollary 31 Let T : X → X be a nonexpansive operator admitting fixed points


and let (xk )k∈N be generated by the Krasnosel’skiı̆–Mann iteration (33). Then (xk )k∈N
converges to some fixed point of T .

3.2 The Proximity Operator

Motivated by the use of nonsmooth regularization techniques in inverse problems,


we introduce the proximity operator of a convex function.
Definition 32 Let g ∈ 0 (X ). Then, the proximity operator of g is
 1 
proxg : X → X, proxg (x) = argmin y∈X g(y) + y − x2 .
2
Proximal Gradient Methods for Machine Learning and Imaging 167

Note that the definition is well-posed since the function y → g(y) + (1/2)y − x2
is lower semicontinuous and strongly convex, hence, it has a unique minimizer.
Moreover, let us check that proxg = (Id + ∂g)−1 . Using the sum rule for the subd-
ifferential, which holds since the square norm is differentiable, we derive

z = proxg (x) ⇔ 0 ∈ ∂g(z) + z − x


⇔ x ∈ (Id + ∂g)(z)
⇔ z ∈ (Id + ∂g)−1 (x).

This shows that (Id + ∂g)−1 (x) is actually a singleton and its unique element is
proxγ g (x). Note that for every x ∈ X , proxg (x) ∈ domg, since the minimizer of
g + (1/2)·2 is clearly in the domain of g.

Example 33 Let C be a closed and convex set. The proximity operator of ιC is the
projection on C. The projection is nonexpansive (and, indeed, firmly nonexpansive),
but in general not a contraction, unless C is a singleton.

Proposition 34 Let g ∈ 0 (X ). Then

(∀x, y ∈ X ) proxg (x) − proxg (y)2 ≤ x − y, proxg (x) − proxg (y). (39)

In other words, recalling (35), the operator proxg is firmly nonexpansive.

Proof Let x, y ∈ X and set px = proxg (x) and p y = proxg (y). Then, by Fermat’s
rule, we have
x − px ∈ ∂g( px ) and y − p y ∈ ∂g( p y ).

Therefore,

g( p y ) ≥ g( px ) + x − px p y − px 
g( px ) ≥ g( p y ) + y − p y px − p y 

and summing g( p y ) + g( px ) ≥ g( px ) + g( p y ) + y − p y − x + px , px − p y .
Then the statement follows. 

Remark 35 The function


 
1
gλ (u) = inf g(x) + x − u ,
2
(40)
x∈X 2λ

is called the Moreau envelope of g with parameter λ. We have that gλ is differentiable


and the gradient of gλ is given as

u − proxλg (u)
∇gλ (u) = ∈ ∂g(proxλg (x)). (41)
λ
168 S. Salzo and S. Villa

In the following, we provide important properties of proximity operators.

Proposition 36 (Separable sum) Let (X i )1≤i≤m be Hilbert spaces and let X =


m
for every i = 1, . . . , m, gi ∈ 0 (X i ) and define
i=1 X i be their direct product. Let,
m
g : X → ] − ∞, +∞] by g(x) = i=1 gi (xi ), for every x = (x1 , . . . , xm ) ∈ X .
Then

(∀ x = (x1 , . . . , xm ) ∈ X ) proxg (x) = (proxg1 (x1 ), . . . , proxgm (xm )). (42)

Example 37
(i) (Proximity operator of the 1 norm) Let X = Rd . The 1 norm on X is separable,
thus the proximity operator can be computed componentwise, so it is enough
to compute the proximity operator of the absolute value in R. Let γ > 0. By
definition, for every t ∈ R, proxγ |·| (t) = (Id + γ ∂|·|)−1 (t). Thus, if we make
the plot of the graph of Id + γ ∂|·| and invert it, we discover that


⎨t − γ if t > γ
softγ (t) := proxγ |·| (t) = 0 if |t| ≤ γ (43)


t + γ if t < −γ .

Thus, it follows from Proposition 36 that, for every x ∈ Rd and every i =


1, . . . , d, (proxγ ·1 (x))i = proxγ |·| (xi ).
(ii) (Proximity operator of the 1 + 2 norm)

λ
g(x) = x1 + x22
2
proxγ g (x) = prox(γ /(γ λ+1))·1 (x/(γ λ + 1))


⎨(xi − γ )/(γ λ + 1) if xi > γ
(proxγ g (x))i = 0 if |xi | ≤ γ


(xi + γ )/(γ λ + 1) if xi < −γ

Proposition 38 (Properties of the proximity operator) Let h ∈ 0 (X ) and let γ > 0.


Then the following holds
(i) (linear perturbation) Let g = h + ·, u + a, with u ∈ X and a ∈ R. Then

proxγ g (x) = proxγ h (x − γ u).

(ii) Let g(x) = h(ax + b), with a ∈ R, a = 0 and b ∈ X . Then

proxγ g (x) = (proxa 2 γ h (ax + b) − b)/a.


Proximal Gradient Methods for Machine Learning and Imaging 169

(iii) (composition with an orthogonal matrix) Let g = h ◦ L, with L : X → X a bijec-


tive linear map such that L ∗ = L −1 . Then

(∀x ∈ X ) proxγ g (x) = L ∗ proxγ h (L x).

Proof In the following, we let x ∈ X and set p = proxγ g (x).


(i): Since p = argmin y∈X {γ h(y) + γ u, y + a + 21 y − x2 }, Fermat’s rule
yields

0 ∈ γ ∂h( p) + γ u + p − x ⇔ x − γ u ∈ (Id + γ ∂h)( p)


⇔ p = proxγ h (x − γ u).

(ii): We have:
 
1
p = proxγ g (x) ⇔ p = argmin y∈X γ h(ay + b) + y − x 2
2
 
1
⇔ p = argmin y∈X γ h(ay + b) + 2 ay + b − (ax + b)2
2a
 
1
⇔ p = argmin y∈X γ a 2 h(ay + b) + ay + b − (ax + b)2
2
⇔ ap + b = proxa 2 γ h (ax + b)
⇔ p = (proxa 2 γ h (ax + b) − b)/a.

(iii): We have
 
1
p = proxγ g (x) ⇔ p = argmin y∈X γ h(L y) + y − x2
2

⇔ 0 ∈ γ L ∂h(L p) + p − x
⇔ x − p ∈ L −1 ∂h(L p)
⇔ L x ∈ γ ∂h(L p) + L p
⇔ p = L ∗ proxγ h (L x)

The statement follows. 


Remark 39 Regarding Proposition 38(iii), in general, if L is not orthogonal, we can
apply a gradient descent on the dual of the minimization problem defining the prox
to compute it approximately. See Sect. 5.
We now introduce an important identity, that is, the Moreau’s decomposition
formula. Let V be a closed linear subspace of X . Then we know that x can be
uniquely decomposed in two orthogonal components, PV x and PV ⊥ x such that:

x = x V + x V ⊥ = PV x + PV ⊥ (x). (44)
170 S. Salzo and S. Villa

If we set f = ιV , we first note that (ιV )∗ (u) = supx∈X x, u − ιV (x) = ιV ⊥ (u). Thus,
we can rewrite (44) as
x = proxιV (x) + prox(ιV )∗ (x).

This last formula can be generalized to every convex function.


Theorem 40 (Moreau’s decomposition) Let g ∈ 0 (X ) and let x ∈ X . Then

x = proxg (x) + proxg∗ (x).

More generally, for all γ > 0, x = proxγ g (x) + γ proxg∗ /γ (x/γ ).


Proof It follows from the list of equivalences below.

p = proxg (x) ⇔ x − p ∈ ∂g( p)


⇔ p ∈ ∂g ∗ (x − p)
⇔ x − (x − p) ∈ ∂g ∗ (x − p)
⇔ x − p = proxg∗ (x). 

Example 41 (The proximity operator of the Euclidean norm) We want to compute


the prox of the norm of X (which is a Hilbert space). First note that

x = sup x, u = σ B1 (0) (x).


u≤1

Hence,
· = σ B1 (0) = (ι B1 (0) )∗ .

Therefore, it follows from Theorem 40 that

prox· (x) = x − proxι B (0) (x) = x − PB1 (0) (x).


1

More explicitly: ⎧
⎨x − x if x > 1
prox· (x) = x
⎩0 if x ≤ 1.

Note that this operation corresponds to a vector soft thresholding, which reduces to
(43) for dim X = 1 and γ = 1.
Example 42 (The proximity operator of the group lasso norm) Let J = {J1 , . . . , Jl }
be a partition of {1, . . . , d}. We define a norm on Rd by considering
⎛ ⎞1/2

l 
xJ = ⎝ |x j |2 ⎠ .
i=1 j∈Ji
Proximal Gradient Methods for Machine Learning and Imaging 171

For every x ∈ Rd , let us call x Ji = (x j ) j∈Ji ∈ R Ji and denote by · Ji the Euclidean
norm on R Ji . Then
 l
xJ = x Ji  Ji .
i=1

We next compute the proximity operator of  · J . First note that  · J is the


sum of functions depending on groups of variables x Ji . Therefore the prox can be
computed group-wise thanks to the decomposability property (42). Thus

(prox·J (x)) Ji = prox· J (x Ji ),


i

and recalling Example 41, we have



⎨x J − x Ji if x Ji  Ji > 1
(prox·J (x)) Ji =
i
x Ji 
⎩0 otherwise

The resulting prox operator is called block soft-thresholding operator.

3.3 Worst Case Convergence Analysis

Algorithm 1 can be seen as a fixed-point iteration of the following operator

T = proxγ g ◦ (Id − γ ∇ f ), (45)

which is the composition of the proximity operator of γ g and the operator Id − γ ∇ f .


We also note that the fixed points of T are the minimizers of f + g. Indeed

x = T x ⇔ x = proxγ g (x − γ ∇ f (x)) ⇔ x − γ ∇ f (x) − x ∈ ∂γ g(x) ⇔ 0 ∈ ∂( f + g)(x).

So we need to study the operator T . We already know that proxγ g is firmly non-
expansive and hence (1/2)-averaged. The following result concerns the operator
Id − γ ∇ f .

Proposition 43 Let f : X → R be differentiable and let L > 0. Let γ > 0 and set
Tγ = Id − γ ∇ f . Then, the L-Lipschitz continuity of ∇ f is equivalent to the property
 
2
(∀ x, y ∈ X ) Tγ x − Tγ y2 ≤ x − y2 − − 1 (Id − Tγ )x − (Id − Tγ )y2 .
γL
(46)
In particular, if γ < 2/L, Tγ is a α-averaged operator, with α = γ L/2 < 1.

Proof Multiplying by γ 2 L the inequality in Fact 1(iv) and replacing γ ∇ f with


Id − Tγ , we obtain
172 S. Salzo and S. Villa

(Id − Tγ )x − (Id − Tγ )y2 ≤ γ Lx − y, (Id − Tγ )x − (Id − Tγ )y.

Then, using the identity

2x − y, (Id − Tγ )x − (Id − Tγ )y


= (Id − Tγ )x − (Id − Tγ )y2 + x − y2 − Tγ x − Tγ y2 ,

the statement follows. 

Proposition 44 Let f : X → R a differentiable convex function with a Lipschitz


continuous gradient with constant L, let g ∈ 0 (X ) and set T as in (45). Suppose
that γ < 2/L. Then T is α-averaged with α = 2/(4 − γ L).

Proof It follows from Proposition 43 that I − γ ∇ f is α2 -averaged with α2 = γ L/2.


Moreover, Proposition 34 yields that proxγ g is firmly nonexpansive, that is, α1 -
averaged with α1 = 1/2. Therefore, by Proposition 27, T = proxγ g ◦ (I − γ ∇ f ) is
α-averaged with

1/2 + γ L/2 − γ L/2 2


α= = . 
1 − (1/2)(γ L/2) 4−γL

Lemma 45 For any x, z ∈ X , y ∈ domg and for any u ∈ ∂g(x). We have

L
F(z) ≥ F(x) + z − x, ∇ f (y) + u − x − y2 .
2
Proof Let x, z ∈ X and let y ∈ domg. Then, it follows from Fact 1 that

L
f (y) ≥ f (x) − x − y, ∇ f (y) − x − y2 .
2
Hence, since f is convex,

L
f (z) ≥ f (y) + z − y, ∇ f (y) ≥ f (x) + z − x, ∇ f (y) − x − y2 . (47)
2
Now, since u ∈ ∂g(x), g(z) ≥ g(x) + z − x, u, which summed with inequality
(47) give the statement. 
+∞
Lemma 46 Let (ak )k∈N be a decreasing sequence in R+ . If k=0 ak < +∞, then

1 
+∞  1 
(∀ k ∈ N) ak ≤ ak , and ak = o .
k + 1 k=0 k+1

k
Proof Let k ∈ N. Since ak ≤ ai , for i = 0, 1, . . . , k, we have i=0 ai ≥ (k + 1)ak ,
hence the first part of the statement. As regard the second part, we note that, for every
Proximal Gradient Methods for Machine Learning and Imaging 173

+∞ k
integer k ≥ 2, we have i=k/2 ai ≥ i=k/2 ai ≥ (k + 1 − k/2)ak ≥
k+1
ak .
+∞ 2
Therefore, (k + 1)ak ≤ 2 i=k/2 ai → 0 as k → +∞. 

The following theorem provides full convergence results concerning the proximal
gradient algorithm.

Theorem 47 Let f : X → R a differentiable convex function with a Lipschitz con-


tinuous gradient with constant L and g ∈ 0 (X ). Let S∗ be the set of minimizers of
F := f + g and suppose that S∗ = ∅. Let γ < 2/L and (xk )k∈N be generated by
Algorithm 1. Then the following statements hold
 2
(i) xk+1 − xk 2 ≤ dist(x0 , S∗ )2 .
2−γL
k∈N
(ii) For every k ∈ N and for every x ∈ X ,

xk+1 − x2 ≤ xk − x2 + 2γ F(x) − F(xk+1 ) + (γ L − 1)xk+1 − xk 2 .

(iii) For all k ∈ N,


1 L
− xk+1 − xk 2 ≤ F(xk ) − F(xk+1 ),
γ 2

so that the algorithm is descending.


(iv) Let F∗ = inf x∈X ( f + g)(x). Then F(xk+1 ) − F∗ = o(1/(k + 1)) and, for all
k ∈ N,

⎪ 1

⎨ 2γ if γ ≤ 1/L
dist(x0 , S∗ )2
F(xk+1 ) − F∗ ≤ × (48)
k+1 ⎪
⎪ L 1
⎩ if 1/L < γ < 2/L .
2 2−γL

(v) The sequence (xk )k∈N weakly converges to some x∗ ∈ S∗ .

Proof (i): It follows from (25), Theorem 44, and Theorem 30(ii).
(ii): Let x ∈ X and k ∈ N. It follows from (25) that u := (xk − xk+1 )/γ −
∇ f (xk ) ∈ ∂g(xk+1 ), hence

xk − xk+1
= ∇ f (xk ) + u, u ∈ ∂g(xk+1 ).
γ

Thus, by Lemma 45, we have that

L
F(x) ≥ F(xk+1 ) + x − xk+1 , ∇ f (xk ) + u −xk+1 − xk 2
2
1 L
= F(xk+1 ) + x − xk+1 , xk − xk+1  − xk+1 − xk 2 ;
γ 2
174 S. Salzo and S. Villa

and identity xk − x2 = xk − xk+1 2 + xk+1 − x2 + 2 xk+1 − xk , x − xk+1 ,
yields

1   L
F(x) − F(xk+1 ) ≥ xk − xk+1 2 + xk+1 − x2 − xk − x2 − xk+1 − xk 2
2γ 2
1 !
= (1 − γ L)xk − xk+1  + xk+1 − x − xk − x2 .
2 2

Therefore,

xk+1 − x2 ≤ xk − x2 + 2γ F(x) − F(xk+1 ) − (1 − γ L)xk − xk+1 2

and the statement follows.


(iv): Let x∗ ∈ S∗ . Then, it follows from (ii) that, for every k ∈ N,

0 ≤ 2γ F(xk+1 ) − F(x∗ ) ≤ xk − x∗ 2 − xk+1 − x∗ 2 + (γ L − 1)+ xk − xk+1 2 .

Thus, summing and using (i), we have

+∞
 2 γL −1 +
2γ F(xk+1 ) − F(x∗ ) ≤ x0 − x∗ 2 + x0 − x∗ 2
k=0
2−γL


⎨1 if γ ≤ 1/L
= x0 − x∗  ×2
γL

⎩ if 1/L < γ < 2/L .
2−γL

Then, since F(xk+1 ) − F(x∗ ) k∈N is decreasing and positive, the statement follows
from Lemma 46.
(v): It follows from (25), Theorems 44, 30(iii), and the fact that S∗ = Fix(T ). 
Remark 48 It follows from (48) that the best bound is achieved when γ = 1/L.
Remark 49 Suppose that in problem (24) f is the Moreau envelope of a function
h ∈ 0 (X ) with parameter 1, that is f = h 1 . Then ∇ f (x) = x − proxh (x), which is
1-Lipschitz continuous, and the proximal gradient Algorithm 1 with stepsize γ = 1,
becomes
 k = 0, 1, . . .
for
(49)
xk+1 = proxγ g proxh (xk ) ,

which is called the backward-backward algorithm. If one takes g = ιC1 and h = ιC2 ,
for two closed convex sets C1 , C2 ⊂ X , we have the alternating projection algorithm

 k = 0, 1, . . .
for
(50)
xk+1 = PC1 PC2 (xk ) .

Note that Theorem 47 ensures that the sequence (xk )k∈N weakly converges to a point
in argmin x∈C1 dC2 2 (x).
Proximal Gradient Methods for Machine Learning and Imaging 175

3.4 Convergence Analysis Under Strong Convexity


Assumptions

In this section, following the same notation of the previous section, we set

Tγ = Id − γ ∇ f and T = proxγ g ◦ Tγ . (51)

We will consider the situation where f and/or g are strongly convex. This will make
the corresponding operators Tγ and/or proxγ g contractions.

Proposition 50 Let f : X → R be a differentiable convex function. Suppose that


for some γ > 0, the operator Tγ = Id − γ ∇ f is a contraction. Then f is strongly
convex and its gradient is Lipschitz continuous.
Proof Let x, y ∈ X . Then

Tγ x − Tγ y2 ≤ q 2 x − y2


⇔ x − y − γ (∇ f (x) − ∇ f (y))2 ≤ q 2 x − y2
⇔ x − y2 + γ 2 ∇ f (x) − ∇ f (y)2 − 2γ ∇ f (x) − ∇ f (y), x − y ≤ q 2 x − y2
⇔ (1 − q 2 )x − y2 + γ 2 ∇ f (x) − ∇ f (y)2 ≤ 2γ ∇ f (x) − ∇ f (y), x − y


⎪ 1 − q2
⎨ x − y2 ≤ ∇ f (x) − ∇ f (y), x − y
⇒ 2γ

⎪ γ
⎩ ∇ f (x) − ∇ f (y)2 ≤ ∇ f (x) − ∇ f (y), x − y
2

So in virtue of Fact 1(iv) and (12), f is strongly convex and ∇ f is Lipschitz contin-
uous. 

Now we assume that f is strongly convex and with Lipschitz continuous gradient.
Then we will prove that there exists an interval of values of γ for which Tγ is a
contraction.

Theorem 51 f : X → R is Lipschitz smooth with constant L > 0 and strongly con-


vex with modulus μ > 0. Then, for every γ ∈ ]0, 2/(L + μ)], Tγ = Id − γ ∇ f is a
contraction with constant
 
2γ μL 1/2
q1 (γ ) := 1 − . (52)
L +μ

Proof It follows from Fact 2(ii) (multiplied by 2γ ) that

2 2γ μL
γ ∇ f (x) − γ ∇ f (y)2 + x − y2 ≤ 2 γ ∇ f (x) − γ ∇ f (y), x − y
γ (L + μ) L +μ

Moreover,
176 S. Salzo and S. Villa

(x − y) − γ (∇ f (x) − ∇ f (y))2 = x − y2 + γ ∇ f (x) − γ ∇ f (y)2


−2 γ ∇ f (x) − γ ∇ f (y), x − y .

Hence
 
2γ μL
(x − y) − γ (∇ f (x) − ∇ f (y)) ≤ 1 −
2
x − y2
L +μ
 
2
− − 1 γ ∇ f (x) − γ ∇ f (y)2 .
γ (L + μ)

Now since Tγ = Id − γ ∇ f , the inequality above becomes


 2γ μL   2 
Tγ x − Tγ y2 ≤ 1 − x − y2 − − 1 (Id − Tγ )x − (Id − Tγ )y2 .
μ+L γ (μ + L)

Note that if γ (L + μ)/2 ≤ 1, then the above inequality yields


 
2γ μL 1/2
Tγ x − Tγ y ≤ 1 − x − y, (53)
μ+L

where  L − μ 2
2γ μL 4μL
0< ≤ − 1 + 1 = 1 − < 1.
L +μ (L + μ)2 L +μ

Therefore, for every γ ∈ ]0, 2/(L + μ)], Tγ is a contraction with the constant given
in (53). 

If we additionally assume that the function f is twice differentiable the results


can be further improved.

Theorem 52 Let f : X → R be twice differentiable and suppose that f is μ-


strongly convex and that ∇ f is L-Lipschitz continuous. Then, for every γ > 0,
Tγ = Id − γ ∇ f is Lipschitz continuous with constant

⎪ 2

⎨1 − γ μ if γ ≤
L +μ
q̃1 (γ ) = max{|1 − γ μ|, |1 − γ L|} = (54)

⎪ 2
⎩γ L − 1 if γ ≥ .
L +μ

So, if γ ∈ ] 0, 2/L [ , then Tγ is a contraction.

Proof The mapping Tγ is differentiable and Tγ (x) = Id − γ ∇ 2 f (x). By the mean
value theorem, for every q > 1,

∀ x, y ∈ X Tγ x − Tγ y ≤ qx − y ⇔ ∀ x ∈ X Tγ (x) ≤ q.


Proximal Gradient Methods for Machine Learning and Imaging 177

Fig. 1 Explanation of the


fact that:
q̃1 (γ ) < 1 ⇐⇒ γ < 2/L

Moreover, Tγ (x) = supλ∈σ (∇ 2 f (x)) |1 − γ λ|. Since f is μ strongly convex and ∇ f
is L-Lipschitz continuous,

(∀ x ∈ X )(∀ u ∈ X ) μu2 ≤ ∇ 2 f (x)u, u ≤ Lu2 .

So σ (∇ 2 f (x)) ⊂ [μ, L] and hence Tγ (x) ≤ maxλ∈[μ,L] |1 − γ λ| = q̃1 (γ ). This


last equality follows by noting that λ → |1 − γ λ| is a piecewise convex function
and hence it achieves its maximum at the end points of the interval [μ, L]. It follows
from (54) that q̃(γ ) < 1 ⇔ γ ∈ ]0, 2/L[ (see Fig. 1). 

Remark 53 The constant q̃1 (γ ) given in Theorem 52 is always better than the con-
stant q1 (γ ) given in Theorem 51. However, on the minimum value they agree.

Theorem 54 Let g ∈ 0 (X ) and suppose that g is σ -strongly convex. Then, for


every γ > 1 the operator proxγ g is a contraction with constant 1/(1 + γ σ ).

Proof Let x, y ∈ X and set px = proxγ g (x) and p y = proxγ g (y). Then, by Fermat’s
rule, we have (x − px )/γ ∈ ∂g( px ) and (y − p y )/γ ∈ ∂g( p y ). Therefore, recalling
Fact 4, we have

g( p y ) − g( px ) ≥ γ −1  p y − px x − px  + (σ/2) p y − px 2
g( px ) − g( p y ) ≥ γ −1  px − p y y − p y  + (σ/2) px − p y 2 .
178 S. Salzo and S. Villa

and summing, we have 0 ≥ γ −1  px − p y y − x + px − p y  + σ  px − p y 2 and


hence
 px − p y x − y ≥ (1 + γ σ ) px − p y 2 . (55)

Then, Cauchy-Schwarz inequality yields  px − p y 2 ≤ (1 + γ σ )−1  px − p y 


x − y and the statement follows. 

Now we are ready to provide the theorem of convergence for the proximal gradient
algorithm.

Theorem 55 Let f : X → R be Lipschitz smooth with constant L > 0 and with


modulus of strong convexity μ > 0 and g ∈ 0 (X ) with modulus of strong convexity
σ ≥ 0. Suppose that γ < 2/L. Let x∗ be the minimizer of F := f + g and let (xk )k∈N
be generated by Algorithm 1. Then
 
1 2γ μL 1/2
(∀ k ∈ N) xk − x∗  ≤ q k x0 − x∗ , q := 1− (56)
1 + γσ L +μ

Moreover, if f is twice differentiable, then



⎪ 1 − γμ 2

⎨ 1 + γ σ if γ ≤ L +μ
(∀ k ∈ N) xk − x∗  ≤ q k x0 − x∗ , q̃ := (57)

⎪ γL −1 2
⎩ if γ ≥ .
1 + γσ L +μ

Proof The statement follows from Theorems 51, 52, to 54 and the Banach-
Caccioppoli theorem. 

Remark 56
(i) The best value of γ in (57) is achieved for γ = 2/(μ + L).
(ii) When g = 0 one can derive an explicit linear rate also in the function values.
Indeed, in this case, since ∇ f (x∗ ) = 0, it follows from Fact 1(ii) that f (x) −
f (x∗ ) ≤ (L/2)x − x∗ 2 .

3.5 Convergence Analysis Under Geometric Assumptions

It is possible to show that strongly convex functions satisfy the following condition

1
f (x) − inf f ≤ ∂ f (x)2− , (58)

where ∂ f (x)− = inf{u|u ∈ ∂ f (x)}.


Proximal Gradient Methods for Machine Learning and Imaging 179

This condition is called Łojasiewicz inequality and can hold even for non-strongly
convex functions and very recently has been the objective of intense research which
has unveiled its connection with the quadratic growth condition
μ
(∀ x ∈ X ) f (x) − inf f ≥ dist(x, argmin f )2 (59)
X 2

and ultimately its critical role in achieving linear convergence in optimization algo-
rithms. In this section, we study the convergence of the proximal gradient algorithm
under Łojasiewicz-type inequalities.
We start with a major (although simple) example showing a function which is
not strongly convex but satisfies the Łojasiewicz inequality and the quadratic growth
condition above.
Example 57 Let A : X → Y be a bounded linear operator with closed range between
two Hilbert spaces, b ∈ Y , and set

1
f: X →R f (x) = Ax − b2 . (60)
2
Note that here we do not assume A∗ A to be positive definite. Let b∗ be the projection
of b onto the range R(A) of A. Then Pytagoras’ theorem yields

1
(∀ x ∈ X ) f (x) = Ax − b∗ 2 + b∗ − b2 .
2

Thus, f ∗:= inf X f = (1/2)b∗ − b2 . Now, let x∗ ∈ S := argmin f = x ∈ X |
Ax = b∗ , let x ∈ X and set x p = PS x. We have b∗ = Ax∗ = Ax p , and hence

1 1 1
f (x) − f ∗ = Ax − b∗ 2 = A(x − x∗ )2 = A(x − x p )2 . (61)
2 2 2

Moreover, since S is an affine set with direction N (A), we have x − x p ∈ N (A)⊥ .


Now we introduce the pseudo inverse of A, which is a the bounded linear operator
A† : Y → X satisfying, for every u ∈ N (A)⊥ , the equality A† Au = u, hence, u ≤
A† Au. Therefore, using (61) we have

1 † −2 1
f (x) − f ∗ ≥ A  x − x p 2 = A† −2 dist(x, argmin f )2 , (62)
2 2

so that (59) holds with μ = A† −2 . Moreover, ∇ f (x) = A∗ (Ax − b∗ ) = A∗ A(x −
x∗ ), and hence
∇ f (x)2 = A∗ A(x − x∗ )2 .

Thus, inequality (58) in this case reduces to

(∀ x ∈ X ) μA(x − x∗ )2 ≤ A∗ A(x − x∗ )2 ,


180 S. Salzo and S. Villa

which is equivalent to

(∀ y ∈ R(A)) μy2 ≤ A∗ y2 . (63)

Again, since (as before) for every y ∈ R(A) = N (A∗ )⊥ , y ≤ (A∗ )† A∗ y and
(A∗ )† = (A† )∗ , we have that (63) and hence (58) holds with μ = (A† )∗ −2 =
A† −2 .

In the following we generalize condition (58).

Definition 58 Let p ∈ [1, +∞[, let F ∈ 0 (X ) with argmin F = ∅. We say that F


is p-Łojasiewicz on sublevel sets if for every t > inf F there exists a constant ct > 0
such that:
1
∀x ∈ [inf F < F ≤ t], (F(x) − inf F)1− p ≤ ct ∂ F(x)_ ,

where for a given set D, D_ = inf u∈D u. We will refer to this notion as global
if supt>inf F ct < +∞.

Example 59 (Convex piecewise polynomials) A function f : Rd → R is a convex


piecewise polynomial if it is convex, continuous, and Rd can be partitioned in a
finite number of polyhedra P1 , ..., Ps such that for all i ∈ {1, ..., s}, the restriction
of f to Pi is a convex polynomial of degree di ∈ N. The degree of f is defined
as deg( f ) := max{di | i ∈ {1, ..., s}}. Assume deg( f ) > 0. Convex piecewise poly-
nomial functions are p-Łojasiewicz on sublevel sets with p = 1 + (deg( f ) − 1)d .
This result implies that piecewise linear functions (deg( f ) = 1) are 1-Łojasiewicz
on sublevel sets and that convex piecewise quadratic functions (deg( f ) = 2) are
2-Łojasiewicz.

Example 60 (L1 regularized least squares) Let f (x) = αx1 + (1/2)Ax − y2 ,
for some linear operator A : Rd → Rn , y ∈ Rn and α > 0. Then f is convex piece-
wise polynomial of degree 2, thus it is 2-Łojasiewicz on sublevel sets.

Lemma 61 Let (rk )k∈N be a real sequence being strictly positive and satisfying,
α
for some κ > 0, α > 1 and all k ∈ N: rk − rk+1 ≥ κrk+1 . Define κ̃ := min{(α −
α−1
1)κ, (α − 1)κ , r0 , κ r0 }. Then, for all k ∈ N, rk ≤ (κ̃k)−1/(α−1) .
α
1−α 1/α 1−α

The proof can be found in [50, Theorem 3.4].

Theorem 62 Let f : X → R be convex and differentiable with L-Lipschitz continu-


ous gradient and let g ∈ 0 (X ). Set F = f + g and suppose that F has a minimizer
and that is p-Łojasiewicz on sublevel sets, for some p ≥ 1. Let γ < 2/L and (xk )k∈N
be generated by Algorithm 1 with x0 ∈ domF. Then the sequence (xk )k∈N has finite
length in X , meaning that k∈N xk+1 − xk  < +∞, and converges strongly to
some x∗ ∈ argmin F. Moreover, there exists a constant b p with explicit expression
(see equation (71)), such that the following convergence rates hold, depending on
the value of p, and of κ := γ (2 − γ L)[2c2F(x0 ) ]−1 :
Proximal Gradient Methods for Machine Learning and Imaging 181

(i) If p = 1, then xk = x∗ for every k ≥ (F(x0 ) − inf F)/κ.


(ii) If p ∈ ]1, 2[, for all k ∈ N,
  p
F(xk ) − inf F 2( p−1)
F(xk+1 ) − inf F ≤ and xk+1 − x∗  ≤ b p (F(xk ) − inf F)1/2 ,
κ

(iii) If p = 2, for all k ∈ N,

1 (F(x0 ) − inf F)1/2


F(xk+1 ) − inf F ≤ (F(xk ) − inf F) and xk+1 − x∗  ≤ b2 .
1+κ (1 + κ)k/2

(iv) If p ∈ ]2, +∞[, for all k ∈ N,


p
F(xk ) − inf F ≤ c p k − p−2 p−2) − p−2 1
and xk+1 − x∗  ≤ b p c1/(
p k .

Proof We first show that (xk )k∈N has finite length. Since inf F > −∞ then rk :=
F(xk ) − inf F ∈ [0, +∞[, and Theorem 47(iii) yields

1 L
axk+1 − xk 2 ≤ rk − rk+1 , with a = − . (64)
γ 2

By definition of Algorithm 1, we have xk − γ ∇ f (xk+1 ) − xk+1 ∈ γ ∂g(xk+1 ) and


hence
xk − γ ∇ f (xk ) − xk+1 + γ ∇ f (xk+1 ) ∈ ∂γ F(xk+1 ). (65)

This implies, together with the nonexpansiveness of Id − γ ∇ f (see Proposition 43),


that

γ inf u ≤ xk − γ ∇ f (xk ) − (xk+1 − γ ∇ f (xk+1 )) ≤ xk − xk+1 .(66)
u∈∂ F(xk+1 )

If there exists k ∈ N such that rk = 0 then the algorithm would stop after a finite
number of iterations (see (64)), therefore it is not restrictive to assume that rk > 0
for all k ∈ N. Since (F(xk ))k∈N is decreasing by Theorem 47(iii), and x0 ∈ domF,
xk ∈ [inf F < F ≤ F(x0 )] for every k ≥ 1. We set ϕ(t) := pt 1/ p and F0 = F(x0 ),
so that the Łojasiewicz inequality at xk ∈ [inf F < F ≤ F0 ] can be rewritten as

(∀k ∈ N) 1 ≤ c F0 ϕ  (rk )∂ F(xk )_ . (67)

Combining (64), (66), and (67), and using the concavity of ϕ, we obtain for all k ∈ N:
c F0  cF
xk+1 − xk 2 ≤ ϕ (rk )(rk − rk+1 )xk − xk−1  ≤ 0 (ϕ(rk ) − ϕ(rk+1 ))xk − xk−1 .
γa γa

By taking the square root on both sides, and using Young’s inequality, we obtain
182 S. Salzo and S. Villa

c F0
(∀k ∈ N) 2xk+1 − xk  ≤ (ϕ(rk ) − ϕ(rk+1 )) + xk − xk−1 . (68)
γa

Sum this inequality, and reorder the terms to finally obtain


K
c F0
(∀k ≥ 1) xk+1 − xk  ≤ ϕ(r1 ) + x1 − x0 .
k=1
γa

We deduce that (xk )k∈N has finite length and therefore converges strongly to some
x∗ . Moreover, from (66) and the strong closedness of ∂ f : X ⇒ X , we conclude that
0 ∈ ∂ f (x∗ ). We next show a preliminary inequality which will be useful to prove the
rates for (xk − x∗ )k∈N . Let K ∈ N and 1 ≤ k ≤ K , recall that ϕ(t) = pt 1/ p , and
sum the inequality in (68) between k and K to obtain


K
pc F0 1/ p
x K − xk  ≤ xn+1 − xn  ≤ r + xk − xk−1 .
n=k
aγ k

Passing to the limit for K → ∞, using (64), and the fact that rk is decreasing, we
derive
pc F0 1/ p 1 1/2
(∀k ≥ 1) x∗ − xk  ≤ r + √ rk−1 . (69)
aγ k−1 a

Next we prove the convergence rates. We first derive rates for the sequence of values
rk , from which we will derive the rates for the iterates thanks to (69). Equations (64)
and (66) and the Łojasiewicz inequality at xk+1 ∈ [inf F < F ≤ F0 ] yield

rk − rk+1 ≥ axk+1 − xk 2 ≥ aγ 2 ∂ F(xk+1 )2_ ≥ aγ 2 c−2


2−2/ p
F0 r k+1 ,

which we write more compactly as


α
(∀k ∈ N) rk − rk+1 ≥ κrk+1 , with α = 2( p − 1) p −1 and κ := aγ 2 c−2
F0 . (70)

The rates for the values are derived from the analysis of the sequences satisfying the
inequality in (70), which is recalled in Lemma 61. Depending on the value of p, we
obtain different rates.
(i): Since p = 1, we deduce from (70) that for all k ∈ N rk+1 ≤ rk − κ. Since the
sequence (rk )k∈N is decreasing and positive, this implies k ≤ r0 κ −1 .
(ii): Since p ∈ ]1, 2[we have α ∈]0, 1[. Thus, the positivity of rk+1 and (70) imply
α
and hence rk+1 ≤ κ −1/α rk , meaning that rk converges
1/α
that for all k ∈ N, rk ≥ κrk+1
1/ p 1/ p−1/2 1/2 1/ p−1/2 1/2
Q-superlinearly to zero. In addition, we have rk−1 = rk−1 rk−1 ≤ r0 r and
1/2 1/ p−1/2 √ k−1
(69) implies xk − x∗  ≤ b p rk−1 , with b p = pc F0 r0 /(aγ ) + (1/ a).
(iii): If p = 2, then α = 1 and (70) yields that for all k ∈ N, rk+1 ≤ (1 + κ)−1rk ,
so that rk ≤ (1 + κ)−k r0 . Moreover, from (69) we derive that,
1/2
(∀k ≥ 1) x∗ − xk  ≤ b2 rk−1 .
Proximal Gradient Methods for Machine Learning and Imaging 183


where b2 = 2c F0 /aγ + 1/ a.
(iv): If p ∈ ]2, +∞[, then α ∈ ]1, 2[ and (70) and Lemma 61 imply that rk+1 ≤
c p (k + 1)− p/( p−2) , where
" #− p−2
p " # p $
κ( p − 2) p − 2 − p−2 − 2 p−2
p p2
− 2( p−1)(
c p = min , κ , r0 , κ p−2) r
0 . (71)
p p

1
− 1 1/ p
Note that rk−1 ≤ r02 p rk−1 , and therefore, defining b p = pc F0 /γ + (r0 )1/2 a −1/2
1/2

−1/ p 1/ p
r0 , we derive from (69) that xk − x∗  ≤ b p rk−1 for every k ≥ 1. 

Remark 63 Note that the rates range from the finite termination, for p = 1, to the
worst-case rates presented in Theorem 47, when p tends to +∞. The bigger is p,
the more the rates for the objective function values become closer to o(k −1 ), and the
rates of its iterates become arbitrarily slow.

3.6 Accelerations

Proximal gradient methods are very simple and have a very low cost per iteration, but
often they converge slowly, both in practice and in theory (see Theorem 47). In this
section, we consider the class of accelerated proximal gradient algorithms, which
are only slightly more complicated than the basic proximal gradient methods, but
have an improved convergence rate. While in the proximal gradient method, only the
information obtained in the previous step is used to build the next iterate, accelerated
methods are multistep methods, namely they take into account previous iterates to
improve the convergence. The most popular accelerated multistep method is due to
Nesterov and is also known as Fast Iterative Soft Thresholding Algorithm (FISTA).
We consider the same setting of the previous sections.

Algorithm 2 (Accelerated proximal gradient method) Let 0 < γ ≤ 1/L and let
(tk )k∈N ∈ RN be such that t0 = 1, tk ≥ 1, and for every integer k ≥ 1, tk2 − tk ≤ tk−1
2
.
Let x0 = y0 ∈ X and define

⎢ k = 0, 1, . . .
for
⎢ xk+1 = proxγ g (yk − γ ∇ f (yk ))



⎢ βk+1 = tk − 1
(72)
⎣ tk+1
yk+1 = xk+1 + βk+1 (xk+1 − xk ).
184 S. Salzo and S. Villa

3.6.1 Dynamical Systems Interpretation

One of the crucial observations that lead to a whole stream of literature and allowed
to give a physical interpretation of this kind of algorithms is the link of accelerated
algorithms with the trajectories of a second-order continuous dynamical system.
Let us consider a heavy ball of mass m in the potential field ∇ f + ∂g under the
force of friction, or “viscosity" controlled by a function p(t) > 0. The motion x(t)
of the heavy ball is described by the following second-order differential inclusion:

m ẍ ∈ −∇ f (x(t)) − ∂g(x(t)) − p(t)ẋ(t) (73)

Intuitively, ignoring existence issues, the heavy ball reaches the minimizer of f + g
for t → +∞, due to the loss of energy caused by the friction. In addition, the friction
avoids the zig-zagging effect, which is one of the causes that slows down gradient
type methods. We consider a scenario where the viscosity coefficient is of the form
p(t) = α/t which turned out to be crucial in the achievement of accelerated rates:
α
0 ∈ ẍ + ẋ(t) + ∇ f (x(t)) + ∂g(x(t)). (74)
t
We next show that Algorithm 2 can be seen as a discretization of (74). To this aim,
we discretize implicitly with respect to the nonsmooth function g and explicitly with
respect to the smooth one f . Let h > 0 be a fixed time step, and set tk = (τ0 + k)h,
xk = x(tk ). The suggested implicit/explicit discretization strategy reads as

1 α
(xk+1 − 2xk + xk−1 ) + (xk − xk−1 ) + ∂g(xk+1 ) + ∇ f (yk )  0,
h2 (τ0 + k)h 2

where yk will be suitably chosen as a linear combination of xk and xk−1 . Rearranging


the terms in 3.6.1 we derive
 
α
xk+1 + h 2 ∂g(xk+1 )  xk + 1 − (xk − xk−1 ) − h 2 ∇ f (yk ).
τ0 + k

A choice of yk classically made in the literature is


 
α
yk = xk + 1 − (xk − xk−1 ).
τ0 + k

Recalling the definition of proximal operator, and setting γ = h 2 we can rewrite


3.6.1 as   
yk = xk + 1 − τ0α+k (xk − xk−1 )
xk+1 = proxγ g (yk − γ ∇ f (yk )),

which is an instance of Algorithm 2 for a specific choice of parameters tk ’s (see next


section).
Proximal Gradient Methods for Machine Learning and Imaging 185

3.6.2 Convergence Analysis

We start with few results concerning the sequence of the parameters tk ’s.
Proposition 64 Suppose that t0 = 1 and for every integer k ≥ 1

tk ≥ 0 and tk2 − tk − tk−1


2
= −b − ctk (75)

for some c ∈ [0, 1 [ and b ∈ [0, 1 − c]. Then condition (75) is equivalent to
'
1−c 1 − c 2
tk = + + tk−1
2
− b. (76)
2 2

Moreover, the following hold.

√ k ≥ 1, 1 ≤ tk−1 ≤ tk ≤ 1 − c + tk−1 .
(i) For every integer
(ii) Suppose that 2 b ≤ 1 − c. Then, for every integer k ≥ 1, (1 − c)/2 + tk−1 ≤ tk .
Hence k(1 − c)/2 ≤ tk − 1 ≤ k(1 − c).
Proof The discriminant of the quadratic equation in (75) (in the unknown tk ) is k =
(1 − c)2 + 4(tk−1
2
− b). Then it is clear that if tk−1 ≥ 1, then k > 0, the positive
(
solution of (75) is (76) and tk ≥ (1 − c)/2 + (1 − c)2 /4 + 1 − b ≥ 1, since b ≤
1 − c. Vice versa, if tk−1 ≥ 1, then (76) ⇒ (75). In the end, if tk−1 ≥ 1, then (76) and
(75) are equivalent and in such case tk ≥ 1. So, the first part of the statement follows
by an induction argument since t0 = 1.
(i): We derive from (75) that tk2 − tk−12
= −b + (1 − c)tk ≥ −b + 1 − c ≥ 0,
hence tk−1 ≤ tk . Moreover, it follows from (76) that
     2
1−c 2 1−c 2 1−c
tk − ≤ + tk−1
2
≤ + tk−1 . (77)
2 2 2

Thus, tk − (1 − c)/2 ≤ (1 − c)/2 + tk−1 and hence tk ≤ 1 − c + tk−1 . The state-


ment follows. √
) 1 − c ≥ 2 b. Then (1 − c) /4 − b ≥ 0 and hence, we have
2
(ii): Suppose that
tk = (1 − c)/2 + (1 − c)2 /4 − b + tk−1
2
≥ (1 − c)/2 + tk−1 and the first part of
the statement follows. Next, summing the inequalities (1 − c)/2 ≤ ti − ti−1 ≤ (1 −
c) from i = 1 to i = k, we have k(1 − c)/2 ≤ tk − 1 ≤ k(1 − c). 
Remark 65 The following are two special cases of (75).
'
1 1 k+a
tk = + + tk−1
2
and tk = (with) a ≥ 2, (78)
2 4 a

which are obtained from (75) with (b, c) = (0, √ 0) and (b, c) = (1/a , (a − 2)/a)
2

respectively. Note that in both cases 1 − c ≥ 2 b (and in the first case, in virtue of
Proposition 64(ii), we have tk ≥ (k + 2)/2).
186 S. Salzo and S. Villa

Remark 66 Suppose that the tk ’s satisfy (75) with 2 b ≤ 1 − c. Then, since tk2 −
(1 − c)tk ≤ tk−1
2
, we have, for k ≥ 2,

tk2 tk2 1−c 1−c


≤ =1+ ≤1+ ≤ 2 − c, (79)
2
tk−1 tk (tk − (1 − c)) tk − (1 − c) tk−2

where in the second last inequality we used that tk ≥ (1 − c)/2 + tk−1 ≥ 1 − c +


tk−2 . Note that, in view of Proposition 64(i), t1 ≤ 2 − c. Therefore, since 2 − c > 1,
we have
(∀ k ∈ N, k ≥ 1) tk ≤ (2 − c)tk−1 . (80)

Lemma 67 Let y ∈ X and set x = proxγ g (y − γ ∇ f (y)), with γ ≤ 1/L. Then

x − z2 z − y2
(∀ z ∈ X ) F(x) + ≤ F(z) + .
2γ 2γ

Proof It follows from the definition of the proximity operator that


 1 
x = argminz∈X γ g(z) + y − z − γ ∇ f (y)2
2
 1 
= argminz∈X g(z) + y − z2 + z − y, ∇ f (y) .

Therefore, since z → g(z) + 2γ1 y − z2 + z − y, ∇ f (y) is γ −1 -strongly convex


and x is its minimizer, it follows from (10) that

1 1
z − x2 ≤ g(z) + y − z2 + z − y, ∇ f (y)
2γ 2γ
 1 
− g(x) + y − x2 + x − y, ∇ f (y)

hence
1 1
g(x) + y − x2 + x − y, ∇ f (y) + z − x2
2γ 2γ
* +, -
(a)
1
≤ g(z) + y − z2 + z − y, ∇ f (y) .

Now, since f is L-Lipschitz continuous and γ ≤ 1/L, it follows from Theorem 1


that
L 1
f (x) − f (y) ≤ x − y, ∇ f (y) + x − y2 ≤ x − y, ∇ f (y) + x − y2 .
2 2γ
* +, -
(a)
Proximal Gradient Methods for Machine Learning and Imaging 187

Therefore,

1 1
f (x) + g(x) + z − x2 ≤ f (y) + g(z) + y − z2 + z − y, ∇ f (y)
2γ 2γ
1
≤ f (z) + g(z) + y − z2 ,

where in the last inequality we used that f (y) + z − y, ∇ f (y) ≤ f (z), due to the
convexity of f . 

We now present the first of the two results of the section, which concerns the
convergence in value for Algorithm 2. Next, we will address the convergence of the
iterates under slightly stronger assumptions on the sequence of parameters tk ’s.

Theorem 68 Let f : X → R be convex and differentiable with L-Lipschitz continu-


ous gradient and let g ∈ 0 (X ). Set F = f + g and suppose that F has a minimizer.
Define (xk )k∈N and (tk )k∈N according to Algorithm 2. Then

dist(x0 , argmin F)2


(∀ k ∈ N, k ≥ 1) F(xk ) − min F ≤ 2
.
2γ tk−1

Moreover,
√ if the parameters tk ’s are defined according to Proposition 64 with 1 − c ≥
2 b, then F(xk ) − min F = O(1/k 2 ).

Proof It follows from the definition of yk+1 in Algorithm 2 that, for every k ∈ N,
 
1 1
yk+1 = 1 − xk+1 + xk + tk (xk+1 − xk )
tk+1 tk+1 * +, -
vk+1

Therefore, for every k ∈ N,


 
1 1
yk = 1 − xk + vk (v0 := y0 ) (81)
tk tk

Moreover, it follows from the definition of vk+1 that vk+1 − xk = tk (xk+1 − xk ) and
hence  
1 1 1
xk+1 = xk + (vk+1 − xk ) = 1 − xk + vk+1 . (82)
tk tk tk

Also, by Lemma 67, with y = yk and x = xk+1 , we have

xk+1 − z2 z − yk 2
(∀ z ∈ X ) F(xk+1 ) + ≤ F(z) + . (83)
2γ 2γ
188 S. Salzo and S. Villa

Now, let x∗ ∈ argmin F and set


 
1 1
z = 1− xk + x∗ .
tk tk

Then, we derive from (81) and (151) that

1 1
xk+1 − z = (vk+1 − x∗ ) and yk − z = (vk − x∗ ).
tk tk

Therefore, it follows from (83) and the convexity of F (considering that z is a convex
combination of xk and x∗ ) that

vk+1 − x∗ 2 vk − x∗ 2
F(xk+1 ) + 2
≤ F(z) +
2γ tk 2γ tk2
 
1 1 vk − x∗ 2
≤ 1− F(xk ) + F(x∗ ) + .
tk tk 2γ tk2

Summing −F(x∗ ) to both terms of the above inequality and setting rk = F(xk ) −
F(x∗ ), we get
 
vk+1 − x∗ 2 1 vk − x∗ 2
rk+1 + ≤ 1 − r k +
2γ tk2 tk 2γ tk2

and hence, multiplying by tk2

vk+1 − x∗ 2 vk − x∗ 2
tk2 rk+1 + ≤ tk (tk − 1)rk + . (84)
2γ 2γ

Now we set, for every integer k ≥ 1, Ek = tk−1


2
rk + vk − x∗ 2 /(2γ ). Then, by using
tk − tk − tk−1 ≤ −ctk , we have
2 2

vk − x∗ 2
(∀ k ∈ N, k ≥ 1) Ek+1 ≤ tk (tk − 1)rk + ≤ −ctk rk + Ek . (85)

Therefore, Ek is decreasing and hence, using (84) with k = 0, we have, for all k ≥ 1

v1 − x∗ 2 v0 − x∗ 2 x0 − x∗ 2


2
tk−1 r k ≤ Ek ≤ E1 = r 1 + ≤ = .
2γ 2γ 2γ

Since x∗ is an arbitrary element of argmin F, the first part of the statement follows.
The second part of the statement follows from Proposition 64(ii) and the fact that, for
every integer k ≥ 1, 2tk−1 ≥ 2 + (k − 1)(1 − c) = k(1 − c) + 1 + c ≥ k(1 − c). 
Proximal Gradient Methods for Machine Learning and Imaging 189

Remark 69 The quantity Ek introduced in the proof of Theorem 68 can be seen as


a discretization of a Lyapunov function of the continuous dynamical system (74).

We now start the analysis of the convergence of the iterates.

Proposition 70 Under the assumptions of Theorem 68 suppose additionally that for


every integer k ∈ N,
tk ≥ 1 and tk2 − tk − tk−1
2
≤ −ctk (86)

for some c ∈ ]0, 1[. Then the following hold.



(i) ∞ k=0 tk (F(x k ) − inf F) < +∞.

(ii) k=1 tk x k+1 − x k  < +∞.
2

Proof Let rk and Ek be defined as in the proof of Theorem 68. It follows from (85)
that, for every integer k ≥ 1,

ctk rk ≤ Ek − Ek+1 . (87)



Hence c ∞ k=1 tk r k ≤ E1 ≤ x 0 − x ∗  /(2γ ). Concerning the second statement, it
2

follows from (83) with z = xk , that

xk+1 − xk 2 xk − yk 2
F(xk+1 ) + ≤ F(xk ) + . (88)
2γ 2γ

Subtracting − inf F and recalling the definition of yk in Algorithm 2, we get

xk+1 − xk 2 (tk−1 − 1)2 xk − xk−1 2


rk+1 + ≤ rk + , (89)
2γ tk2 2γ

which, multiplied by tk2 yields

1 2 
tk xk+1 − xk 2 − (tk−1 − 1)2 xk − xk−1 2 ≤ tk2 (rk − rk−1 ).

Since (tk−1 − 1)2 = tk−1


2
+ 1 − 2tk−1 , we have

1 2 
tk xk+1 − xk 2 − tk−1
2
xk − xk−1 2 + (2tk−1 − 1)xk − xk−1 2

≤ tk−1
2
rk − tk2 rk+1 + (tk2 − tk−1
2
)rk . (90)

Summing the above inequality from k = 1 to k = K , and recalling that t0 = 1, we


have
190 S. Salzo and S. Villa

1 2  
K
t K x K +1 − x K 2 + (2tk−1 − 1)xk − xk−1 2
2γ k=2


K
≤ r1 − t K2 r K +1 + (tk2 − tk−1
2
)rk
k=1


K
≤ r1 + (1 − c) tk r k ,
k=1

where we used the fact that, by 70, we have tk2 − tk−1


2
≤ (1 − c)tk . Therefore, since
tk−1 ≤ 2tk−1 − 1 (being tk−1 ≥ 1),


K 
K  
K 
tk−1 xk − xk−1 2 ≤ (2tk−1 − 1)xk − xk−1 2 ≤ 2γ r1 + (1 − c) tk r k
k=2 k=2 k=1
(91)
and the statement follows from (i). 

We need two additional results concerning the convergence of numerical


sequences.

Lemma 71 Let (ak )k∈N , (εk )k∈N be sequences in R+ such that k∈N εk < +∞ and

(∀ k ∈ N) ak+1 ≤ ak + εk . (92)

Then (ak )k∈N is convergent.


+∞
+∞ Define u k =a+∞
Proof k + i=k εi . Then it follows from (92) that u k+1 = ak+1 +
i=k+1 εi ≤ ak + i=k εi = u k , so that (u ) is decreasing and hence convergent.
+∞k k∈N
Then, by definition of u k , ak = u k − i=k εi and hence (ak )k∈N is convergent too.


Lemma 72 Suppose√ that the sequence of parameters tk ’s satisfy equation (75) in


Proposition 64 with 2 b ≤ 1 − c. Let (ak )k≥1 and (bk )k≥1 be two positive sequences
such that
tk−1 − 1
(∀ k ∈ N, k ≥ 1) ak+1 ≤ ak + bk , (93)
tk

If (tk bk )k∈≥1 is summable, then (ak )k∈≥1 is summable.

Proof Let k ∈ N with k ≥ 1. Multiplying equation (93) by tk2 and using the relation
tk2 − tk ≤ tk−1
2
and the fact that tk−1 ≤ tk , we have

tk2 ak+1 ≤ tk (tk−1 − 1)ak + tk2 bk ≤ tk (tk − 1)ak + tk2 bk ≤ tk−1


2
ak + tk2 bk . (94)
Proximal Gradient Methods for Machine Learning and Imaging 191

Hence

k−1 
k−1
2
tk−1 ak − a1 = (ti2 ai+1 − ti−1
2
ai ) ≤ ti2 bi . (95)
i=1 i=1

2
Then, dividing by tk−1 , we obtain

1 
k−1
a1
ak ≤ 2
+ 2
ti2 bi (96)
tk−1 tk−1 i=1

and hence


k k
a1 k j−1
1 2
aj ≤ 2
+ 2
ti bi
j=1
t
j=1 j−1
t
j=1 i=1 j−1

k
a1 
k−1  k
1 2
= 2
+ 2
ti bi . (97)
t
j=1 j−1
t
i=1 j=i+1 j−1


Now we analyze the term kj=i+1 1/t 2j−1 . Let j ∈ N with j ≥ 2. Since, by assump-
tion, t j (t j − (1 − c)) ≤ t 2j−1 and t j ≥ (1 − c)/2 + t j−1 ≥ (1 − c) + t j−2 , we have

1 1

t 2j−1 t j (t j − (1 − c))
 
1 1 1
= −
1 − c t j − (1 − c) t j
 
1 1 1 1 1
≤ − + − .
1 − c t j−2 t j−1 t j−1 tj

Hence, for i ≥ 1 and k ≥ 2,


k "  k   k  #
1 1 1 1 1 1
≤ − + −
t2
j=i+1 j−1
1−c j=i+1
t j−2 t j−1 j=i+1
t j−1 tj
 
1 1 1 1 1
= − + −
1 − c ti−1 tk−1 ti tk
3−c 1
≤ ,
1 − c ti

where in the last inequality we used that t j ≤ (2 − c)t j−1 (see Remark 66). In the
end, it follows from (97) that
192 S. Salzo and S. Villa


k k
a1 3−c 
k−1
aj ≤ + ti bi
j=1
t2
j=1 j−1
1 − c i=1

3−c 
k−1
a1 (3 − c) 1
≤ a1 + + ti bi .
1 − c t1 1 − c i=1

The statement follows. 


We are finally ready for the second main result of this section which addresses
the convergence of the iterates of Algorithm 2.
Theorem 73 Under the assumptions of Theorem 68, suppose additionally that the
parameters tk ’s satisfy the equation (75) with c > 0. Then xk  x∗ for some x∗ ∈
argmin F.
Proof We invoke Opial’s Lemma 29. We first prove that weak cluster points of
(xk )k∈N belong to argmin F. We note that Theorem 68 yields that F(xk ) → inf F.
Let (xkn )n∈N be a weakly convergent subsequence with xkn  x∗ . Since F is weakly
lower semicontinuous, we have F(x∗ ) ≤ lim inf n F(xkn ) = limk F(xk ) = inf F and
hence x∗ ∈ argmin F. We now prove that for every x∗ ∈ argmin F, the sequence
(xk − x∗ )k∈N is convergent. Let x∗ ∈ argmin F and set h k = xk − x∗ 2 /2 and
δk = (1/2)xk − xk+1 2 . Then, since xk − x∗ 2 = xk − xk+1 2 + xk+1 − x∗ 2 +
2 xk − xk+1 , xk+1 − x∗  and yk − xk = βk (xk − xk−1 ), we have

h k − h k+1 = δk + xk − xk+1 , xk+1 − x∗  (98)


= δk − βk xk − xk−1 , xk+1 − x∗  + yk − xk+1 , xk+1 − x∗  . (99)

Now we note that, by definition of xk+1 and the fact that x∗ ∈ argmin F, we have

yk − xk+1 − γ ∇ f (yk ) ∈ ∂γ g(xk+1 ) and − γ ∇ f (x∗ ) ∈ ∂γ g(x∗ ).

Hence, using the monotonicity of ∂g (see Sect. 2.4), we have

xk+1 − x∗ , yk − xk+1 − γ ∇ f (yk ) + γ ∇ f (x∗ ) ≥ 0

which yields, in virtue of Fact 1(iv), that

xk+1 − x∗ , yk − xk+1  ≥ γ xk+1 − x∗ , ∇ f (yk ) − ∇ f (x∗ )


= γ yk − x∗ , ∇ f (yk ) − ∇ f (x∗ ) + γ xk+1 − yk , ∇ f (yk ) − ∇ f (x∗ )
γ
≥ ∇ f (yk ) − ∇ f (x∗ )2 − γ xk+1 − yk ∇ f (yk ) − ∇ f (x∗ )
L
γL
≥− xk+1 − yk 2 ,
4

where in the last inequality we minorized the function α → (1/L)α 2 − xk+1 − yk α


with −xk+1 − yk 2 L/4. Hence it follows from (99) that
Proximal Gradient Methods for Machine Learning and Imaging 193

γL
h k − h k+1 ≥ δk − βk xk − xk−1 , xk+1 − x∗  − xk+1 − yk 2 . (100)
4
Now, (98), written for k − 1, yields h k−1 − h k = δk−1 + xk−1 − xk , xk − x∗  and
hence, we have
. / γL
h k+1 − h k − βk (h k − h k−1 ) ≤ −δk + βk xk − xk−1 , xk+1 − x∗ + xk+1 − yk 2
. / 4
+ βk δk−1 − βk xk − xk−1 , xk − x∗
γL
= −δk + xk+1 − yk 2
4 . /
+ βk δk−1 + βk xk − xk−1 , xk+1 − xk ).

Now, using the definition of yk , we have

1 1
xk+1 − yk 2 = xk+1 − xk − βk (xk − xk−1 )2
2 2
1 β2
= xk+1 − xk 2 + k xk − xk−1 2 − βk xk+1 − xk , xk − xk−1 
2 2
= δk + βk2 δk−1 − βk xk+1 − xk , xk − xk−1  .

Therefore,
 
1 γL
h k+1 − h k − βk (h k − h k−1 ) ≤ −1− xk+1 − yk 2 + (βk + βk2 )δk−1 .
2 2
(101)
Since γ L < 2 and βk + βk2 ≤ 2 we finally have

h k+1 − h k ≤ βk (h k − h k−1 ) + 2δk−1 , (102)

which yields
(h k+1 − h k )+ ≤ βk (h k − h k−1 )+ + 2δk−1 . (103)

Since tk δk−1 ≤ (2 − c)tk−1 δk−1 and tk−1 δk−1 is summable in virtue of Proposi-
tion 70(ii), Lemma 72 yields that ((h k+1 − h k )+ )k∈N is summable. Finally, since

h k+1 ≤ h k + (h k+1 − h k )+ (104)

and h k is positive, the statement follows from Lemma 71. 

Remark 74 In order to have convergence of the iterates in Algorithm 2, possible


choices of the parameters tk ’s are (76) with c > 0 and b = 0 (which looks as a
perturbed version of the classical choice given in the first of (78)) and, recalling
Remark 18, the second in (78) with a > 2.
194 S. Salzo and S. Villa

3.7 Bibliographical Notes

Section 3.1. Fixed-point iterations, also known as the method of successive approxi-
mations, was developed by Picard, starting from ideas by Cauchy and Liouville. For
the case of Banach spaces, Theorem 19 was first formulated and proved by Banach
in his famous dissertation from 1922. Later and independently it was rediscovered
by Caccioppoli in 1931. Since then, numerous generalizations or extensions have
been obtained which deal with more general classes of operators and iterations.
Krasnosel’skiı̆–Mann iteration, as presented in (33), were first studied in [63] with
λ = 1/2. For general λ ∈ ]0, 1[, these mappings have been studied by Schaefer [104],
Browder and Petryshyn [25, 26], and Opial [85]. Mann in [70] considered the more
general case of this iteration where λ may vary. Later this case was also studied in
[43, 54]. The concept of averaged operator was introduced in [9]. Later, the proper-
ties of compositions and convex combinations of averaged nonexpansive operators
(Proposition 27) have been applied to the design of new fixed-point algorithms in
[38].
Section 3.2. The proximity operator was introduced by Moreau in 1962 [74] and
further investigated in [75, 76] as a generalization of the notion of a convex projection
operator. Later was considered within the proximal point algorithm in [97]. Since
then, it appears in most of the splitting algorithms used in practice [34].
Sections 3.3–3.4. The proximal gradient algorithm finds its roots in the projected
gradient method [53, 64] and was originally devised in [72] in the more general
context of monotone operators. Weak convergence of the iterates were proved in [51,
72]. An error tolerant version, with variable stepsize is presented in [39], whereas
worst-case rate of convergence in values was studied in [12, 24]. The proximal
gradient algorithm is also a generalization of the iterative soft thresholding algorithm,
first proposed in [41].
Section 3.5. The idea of imposing geometric conditions on the function to be
optimized to derive improved convergence rates of first-order methods is old, and
was already used in [27, 91, 97]. A systematic study of the class of functions sat-
isfying favorable geometric conditions is more recent and is the result of a series
of papers, among which we mention [14, 16, 17]. The fact that convex piecewise
polynomial functions are p-Łojasiewicz on sublevel sets is due to [66, Corollary
3.6], in agreement with [27, Corollary 3.6], for the special case of piecewise linear
convex functions and with [65, Theorem 2.7] for convex piecewise quadratic func-
tions. The fact that the lasso problem is 2-Łojasiewicz has been observed in [17,
Sect. 3.2.1]. Kurdyka–Łojiasiewicz inequality is a powerful tool to analyze conver-
gence of first-order splitting algorithms as shown in a whole line of work [3–5, 17,
18, 50, 69] ranging from the analysis of the proximal point algorithm to a whole
class of descent gradient based techniques. These results had an impressive impact
on the machine learning community, see e.g., [60]. Theorem 62 is a special case of
[52, Theorem 4.1].
Section 3.6. The idea of adding an inertial term in 74 to mitigate zig-zagging was
due to Polyak, and gave raise to the heavy ball method [93] (see also [1]), which
Proximal Gradient Methods for Machine Learning and Imaging 195

is optimal in the sense of Nemirovski and Yudin [81] for the class of convex twice
continuously differentiable functions. A simple, but not very intuitive, modification
of Polyak’s method was due to Nesterov [83], and is the famous accelerated gradient
method for convex smooth objective functions [82, 83]. The acceleration technique
has been first extended to the proximal point algorithm by Güler [55] and finally
extended to the composite optimization problems in [12]. Various modifications
of these accelerated algorithms are nowadays the methods of choice to optimize
objective functions in a large scale scenario, even in a nonconvex setting: despite
convergence issues, the ADAM algorithm is probably the most used in the deep
learning context [62]. The first papers studying accelerated algorithms were focused
on convergence of the objective function values. Convergence of the iterates has been
established much more recently, starting from the paper by Chambolle and Dossal
[29] and further devoloped later. Only many years later its introduction, Nesterov
accelerated method has been shown to be a specific discretization of the heavy ball
system introduced by Polyak with a vanishing inertial coefficient [111], and this key
observation started a very active research activity on the subject (see [7], [6] and
references therein).

4 Stochastic Minimization Algorithms

In this section, we analyze stochastic versions of the algorithms previously presented.


We will consider problems of type

minimize f (x) + g(x), (105)


x∈X

where f : X → R is a convex function and g : X → ] − ∞, +∞] is a proper con-


vex and lower semicontinuous function, and depending on the hypotheses only a
stochastic subgradient/gradient of f will be available. One of the main examples for
such situation is when f is given in the form of an expectation, that is,

f (x) = E[ϕ(x, ζ )], (106)

which corresponds to the setting of stochastic optimization. In this case, a stochastic


subgradient/gradient of f is obtained through a subgradient/gradient of ϕ(x, ζ ).
Finally, in general we will assume that the proximity operator of g is given explicitly.
However, in the last section we will consider a situation in which the proximity
operator of g is actually given through a stochastic oracle.
We start by recalling few facts on conditional expectation.
Fact 75 The following hold.
(i) Let ζ be a random variable with value in the measurable space Z. Then the
operator E[· | ζ ] : L 1 → L 1 is linear and monotone increasing.
196 S. Salzo and S. Villa

(ii) Let ξ be a real-valued summable random variable and ζ be a random variable


with value in a measurable space Z. Then, E[E[ξ, | ζ ]] = E[ξ ].
(iii) Let ζ be a random variable with value in the measurable space Z and let
ϕ : Z → R be a measurable real function such that E[|ϕ(ζ )|] < +∞. Then
E[ϕ(ζ ) | ζ ] = ϕ(ζ ).
(iv) Let X be a separable Hilbert space and let ζ1 and ζ2 be two X -valued random
vectors such that E[|ζ1 , ζ2 |] < +∞ and E[ζ2 ] < +∞. Then
E[ζ1 , ζ2  | ζ1 ] = ζ1 , E[ζ2 | ζ1 ].
(v) Let ζ1 and ζ2 be two independent random variables with values in the mea-
surable spaces Z1 and Z2 respectively. Let ϕ : Z1 × Z2 → R be measurable
and such that E[|ϕ(ζ1 , ζ2 )|] < +∞. Then E[ϕ(ζ1 , ζ2 ) | ζ1 ] = ψ(ζ1 ), where, for
every z 1 ∈ Z1 , ψ(z 1 ) = E[ϕ(z 1 , ζ2 )].

4.1 The Stochastic Subgradient Method

Here we take g in (105) as an indicator function of a closed convex set. Thus, we


assume that C ⊂ X is a nonempty closed and convex set and f : X → R is a convex
function and we want to solve the following problem

minimize f (x), (107)


x∈C

where the projection onto C can be computed explicitly but, only a stochastic sub-
gradient of f is available. The algorithm is detailed below.

Algorithm 3 (The stochastic subgradient projection method) Let x0 ∈ X and


(γk )k∈N be a sequence in R++ . Then,

0 k = 0, 1, . . .
for
û k is a summable X -valued random vector s.t. E[û k | xk ] ∈ ∂ f (xk ), (108)
xk+1 = PC (xk − γk û k ).

Moreover, define, for every k ∈ N,



k −1 
k
f k = min E[ f (xi )], x̄k = γi γi xi .
0≤i≤k
i=0 i=0

Remark 76 In addition
k to the sequence xk , Algorithm 3 requires keeping track of the
sequences k := i=0 γi and x̄k , which can updated recursively, as k+1 = k + γk
−1
and x̄k+1 = k+1 (k x̄k + γk+1 xk+1 ).

The following theorem gives the main convergence results about the algorithm.
Proximal Gradient Methods for Machine Learning and Imaging 197

Theorem 77 Let C ⊂ X be a nonempty closed convex set and let f : X → R be


convex. Let (xk )k∈N , ( f k )k∈N , and (x̄k )k∈N be the sequences generated by Algorithm
3. We make the following additional assumption
A1 There exists B ≥ 0, such that, for every k ∈ N, E[û k 2 ] ≤ B 2 .
Then, for every k ∈ N, xk is square summable in norm and f (xk ) is summable and
the following statements hold.

(i) Suppose that γk → 0 and k∈N γk = +∞. Then lim inf k E[ f (x k )] =
limk f k = inf C f .
(ii) Let x ∈ C and let m, k ∈ N with m ≤ k. Then
k

k
γj E[xm − x2 ] 1 B 2 j=m γi
2
k E[ f (x j )] − f (x) ≤ k + k .
i=m γi i=m γi i=m γi
j=m
2 2
 k k (109)
(iii) Suppose that k∈N γk = +∞ and i=0 γi2 / i=0 γi → 0. Then f k → inf C f
and E[ f (x̄k )] → inf C f .
Moreover, if argminC f = ∅, the right hand side of (109), with m = 0 and
x ∈ argminC f , yields a rate of convergence for both f k − minC f and E[ f (x̄k )] −
minC f .
Proof Let k ∈ N and x ∈ C and set u k = E[û k | xk ]. First of all, note that assumption
A1 actually implies that û k  is square summable and hence summable. Then we
prove the following inequality

2γk xk − x, û k  ≤ xk − x2 − xk+1 − x2 + γk2 û k 2 . (110)

Indeed setting yk = xk − γk û k and using the relation 2a, b = a2 + b2 −


a − b2 , we have

2γk xk − x, û k  = 2xk − x, xk − yk 


= xk − x2 + xk − yk 2 − yk − x2 . (111)

Now, since PC is nonexpansive, we have xk+1 − x = PC (yk ) − PC (x) ≤


yk − x and hence (110) follows.
We prove by induction that xk − x is square summable for every k ∈ N. The
statement is true for k = 0. Suppose that xk − x is square summable for some
k ∈ N. Then it follows from (110) that

xk+1 − x2 ≤ xk − x2 + 2γk xk − xû k  + γk2 û k 2 .

The right-hand side is summable, and hence xk+1 − x is square summable. So, all
the terms in (110) are summable. Therefore, taking the conditional expectation given
xk of both terms of inequality (110) and using the fact that u k = E[û k | xk ] ∈ ∂ f (xk )
and the properties in Fact 75, we have almost surely
198 S. Salzo and S. Villa

2γk ( f (xk ) − f (x)) ≤ 2γk xk − x, E[û k | xk ]


≤ xk − x2 − E[xk+1 − x2 | xk ] + γk2 E[û k 2 | xk ]. (112)

Now, being f subdifferentiable, there exists (a, β) ∈ H × R, a = 0, such that


·, a + β ≤ f , hence xk , a + β ≤ f (xk ). Therefore, we have ( f (xk ))− ≤ xk 
a + |β|, which together with (112) yields the summability of f (xk ). Taking the
expectation in (112) and recalling that E[û k 2 ] ≤ B 2 , we get

2γk (E[ f (xk )] − f (x)) ≤ E[xk − x2 ] − E[xk+1 − x2 ] + γk2 B 2 . (113)

(i): Since ( f k )k∈N is decreasing, we have inf C f ≤ limk f k = inf k f k = inf k


E[ f (xk )] ≤ lim inf k E[ f (xk )]. Therefore it is sufficient to prove that
lim inf k E[ f (xk )] ≤ inf C f . Suppose that x ∈ C is such that f (x) < lim inf k
E[ f (xk )] = supn inf k≥n E[ f (xk )]. Then there exists n ∈ N such that f (x) < inf k≥n
E[ f (xk )]. Set ρ = inf k≥n E[ f (xk )] − f (x) > 0. Then, (113) yields

(∀ k ≥ n) γk ρ ≤ E[xk − x2 ] − E[xk+1 − x2 ] − γk ρ − γk B 2 ).

Now, since γk → 0, there exists m ∈ N such that for every integer k ≥ m, we have
ρ − γk B 2 ≥ 0 and hence, setting ν := max{n, m}, we have

ρ γk ≤ E[xν − x2 ] < +∞.
k≥ν


This contradicts the assumption k∈N γk = +∞. Therefore, we showed that there
is no x ∈ C such that f (x) < lim inf k E[ f (xk )], that is, lim inf k E[ f (xk )] ≤ inf C f .
(ii): It follows from (113) that

1 B2 2
(∀ i ∈ N) γi (E[ f (xi )] − f (x)) ≤ E[xi − x2 ] − E[xi+1 − x2 + γ .
2 2 i
(114)
So, summing from m to k, we have


k
1 B2  2
k
γi (E[ f (xi )] − f (x)) ≤ E[xm − x ] +
2
γ .
i=m
2 2 i=m i

k
Dividing the above inequality by i=m γi yields (109).
(iii): We first note that, since f  is convex and x̄k is a convex combination of
the xi ’s, with coefficients ηi = γi / kj=0 γ j , with 0 ≤ i ≤ k, we have E[ f (x̄k )] ≤
k k k
i=0 ηi E[ f (x i )]. Moreover, f k = i=0 ηi f k ≤ i=0 ηi E[ f (x i )]. Therefore,


k −1 
k
(∀ k ∈ N) h k := max{ f k , E[ f (x̄k )]} ≤ γi γi E[ f (xi )]. (115)
i=0 i=0
Proximal Gradient Methods for Machine Learning and Imaging 199

Let x ∈ C. Then it follows from (109) and (115) that lim supk h k ≤ f (x). Since x is
arbitrary in C, we have lim supk h k ≤ inf C f . Moreover, clearly we have inf C f ≤
lim inf k h k . Therefore, h k → inf C f . Since inf C f ≤ f k ≤ h k and inf C f ≤
E[ f (x̄k )] ≤ h k , the statement follows. 

Lemma 78 Let m, k ∈ N with 2 ≤ m < k. Then, the following inequalities hold.


k 1 1 1  1
k  k 
(i) log + + ≤ ≤ log
m 2 m k i=m
i m−1
  
√ √ 1 1 1
k
1
(ii) 2( k − m) + √ +√ ≤ √ .
2 m k i=m
i
+∞
 1 π
(iii) 2
= .
i=0
i 6

Lemma 79 Let a ∈ Rn++ and α, β ∈ R++ . Then


1
α β γ 2 αβ
minn + =
γ ∈R++ 2a γ 2 a γ a2
( 
and the minimum is achieved at γ = α/βa2 a.

Proof Define ϕ : R × Rn → ] − ∞, +∞ ] such that



⎨ α + βγ 
2
if t > 0 and γ ∈ Rn+
ϕ(t, γ ) = 2t

+∞ otherwise.

Clearly ϕ is closed, convex, and differentiable in R++ × Rn++ , and, for all (t, γ ) ∈
R++ × Rn++ ,
 
α + βγ 2 β
∇ϕ(t, γ ) = − , γ . (116)
2t 2 t

Then,

α β γ 2 α + βγ 2
infn + = inf infn = inf ϕ(t, γ ),
γ ∈R++ 2a γ 2 a γ t>0 γ ∈R++ 2t (t,γ )∈R×Rn
a γ =t a γ =t

and the right hand side can be written as

inf ϕ(t, γ ) + ι{0} (−1, a) (t, γ ) .


(t,γ )∈R×Rn
200 S. Salzo and S. Villa

So, Fermat’s rule yields

0 ∈ ∇ϕ(t, γ ) + A∗ ∂ι{0} (A(t, γ )),

where A : Rn+1 → R is the linear form A = (−1, a) · and A∗ is the map s →


s(−1, a). Therefore, we have

(−1, a) (t, γ ) = 0 and − ∇ϕ(t, γ ) ∈ R(−1, a),

which, in view of (116), implies that there exists s ∈ R such that




⎪ α + βγ 2

⎪ = −s
⎨ 2t 2
β

⎪ − γ = sa

⎪ t

a γ =t

Now, it follows from the last two equations above that −β = −βa γ /t = sa2
and hence ⎧
⎪ α + βγ  = β
⎪ 2


⎨ 2t 2 a2
t
⎪ γ = a

⎪ a2

⎩a γ = t.

√ above that γ  = t /a which, substituted


2 2 2
It follows from the second equation
into the first equation, gives t = α/βa. Therefore, finally, we have
'  1
α αβ
γ = a and ϕ(t, γ ) = . 
βa2 a2

Corollary 80 Under the same assumptions of Theorem 77, the following hold.
(i) Suppose that argminC f = ∅ and let D ≥ dist(x0 , argminC f ) and k ∈ N. Then,
k
B 2 j=0 γi
2
D2 1
max{ f k , E[ f (x̄k )]} − min f ≤ k + k . (117)
i=0 γi i=0 γi
C 2 2

√ right hand side of (117) is minimized when, for every i = 0, . . . , k,


Moreover, the
γi = D/(B k + 1) and in that case we have

  BD
max f k , E[ f (x̄k )] − min f ≤ √ .
C k+1
Proximal Gradient Methods for Machine Learning and Imaging 201

(ii) Let, for every k ∈ N, γk = γ̄ /(k + 1). Then, f k → inf C f and E[ f (x̄k )] →
inf C f . Moreover, if argminC f = ∅, we have, for every k ∈ N,
 
  dist(x0 , argminC f )2 π γ̄ B 2 1
max f k , E[ f (x̄k )] − min f ≤ + . (118)
C 2γ̄ 12 log(k + 1)

(iii) Let, for every k ∈ N, γk = γ̄ / k + 1. Then, f k → inf C f and E[ f (x̄k )] →
inf C f . Moreover, if argminC f = ∅, for every integer k ≥ 2, we have
  dist(x0 , argminC f )2 1 log(k + 1)
max f k , E[ f (x̄k )] − min f ≤ √ + γ̄ B 2 √ . (119)
C 2γ̄ k+1 k+1

(iv) Let, for every k ∈ N, γk = γ̄ / k + 1 and suppose that C is bounded with diam-
eter D̄ > 0 and that argminC f = ∅. Set, for every k ∈ N, f˜k = min!k/2"≤i≤k
k −1 k
f (xi ) and x̃k = i=!k/2" γi i=!k/2" γi x i . Then, for every integer k ≥ 2,

 
  3 D̄ 2 5γ̄ B 2 1
max f˜k , E[ f (x̃k )] − min f ≤ + √ . (120)
C 2γ̄ 2 k+1

Proof (i): Equation (117) follows from (115) and by minimizing the right hand
side of (109), with m = 0, w.r.t. x ∈ argminC f . Now, it√follows from Lemma 79
that the minimum of √ the right-hand side of (117) is B D/ k + 1 and k is achieved at
(γi )0≤i≤k ≡ D/(B k + 1). Note that is this case x̄k = (k + 1)−1 i=0 xi .
k k+1
(ii): We derive from Lemma 78(i), with m = 1, that i=0 γi = γ̄ i=1 (1/i) ≥
k k+1
γ̄ log(k + 1). Moreover, we have i=0 γi = γ̄ 2 2
i=1 1/i ≤ γ̄ π/6. So, the first
2 2

part follows from Theorem 77 (iii), while the inequality in (118) follows from (117)
with D = dist(x0 , argminC f ). √ √
k
√ √ 78(ii), with m = 1,
(iii) Lemma √ yields i=1 √ 1/ i ≥ 2( k − 1) + (1/2)(1 +
1/ k) ≥√2 k √
 − 3/2. Moreover, 2 k − 3/2 ≥ k k for k ≥  3 and clearly
√ for√ k ≤ 2,
k k+1
1/ i ≥ k. Therefore, for every k ∈ N, γ = γ̄ 1/ i ≥ γ̄ k + 1.
i=1 k i=0 i k i=1
Moreover, by Lemma 78(i), we have i=1 1/i = 1 + k
i=2 1/i ≤ 1 + log k ≤
k+1
2 log k, for k ≥ 3. Therefore, for every k ∈ N, k ≥ 2, we have i=0 γi2 = γ̄ 2 i=1
1/i ≤ 2γ̄ 2 log(k + 1). Again, the first part follows from Theorem 77(iii), while (119)
follows from (117) with D = dist(x0 , argminC f ).
(iv): Let k ∈ N, k ≥ 2. It follows from Lemma 78(i) that


k 
k+1
1 k +1 5
γi2 = γ̄ 2 ≤ γ̄ 2 log ≤ γ̄ 2 log 4 ≤ γ̄ 2 .
i=!k/2" i=!k/2"+1
i !k/2" 3

Moreover, Lemma 78(ii) yields


202 S. Salzo and S. Villa


k 
k+1
1
γi = γ̄ √
i=!k/2" i=!k/2"+1
i
 '
√ ( √ !k/2" + 1 
≥ 2γ̄ ( k + 1 − !k/2" + 1) ≥ 2γ̄ k + 1 1 − .
k+1

Now, since (!k/2" + 1)/(k + 1) ≤ 2/3, we have


 ' 

k
2 √ γ̄ √
γi ≥ 2γ̄ 1 − k+1≥ k+1
i=!k/2"
3 3

The statement follows from Theorem 77(ii), with m = !k/2" and x ∈ argminC f ,
k −1 k
taking into account that, as in (115), max{ f˜k , f (x̃k )} ≤ i=!k/2" γi i=!k/2" γi
f (xi ). 

Example 81 A case in which the above stochastic algorithm arises is in the incre-
mental subgradient method. We aim at solving

1 
m
min f (x) := f j (x),
x∈C m j=1

where every f j : X → R is convex and Lipschitz continuous with constant L j . The


projected incremental subgradient method is as follows. Let, for every j, ∇˜ f j : X →
X be a selection of ∂ f j . Let x0 ∈ X . Then,

⎢ k = 0, 1, . . .
for
⎢ chose an index jk ∈ {1, . . . , m} at random

⎢x
⎣ k+1 = PC (xk − γk *∇˜ f +,
(121)
jk (x k )).
-
û k

 
Since ∂ f = (1/m) mj=1 ∂ f j , we have that (1/m) mj=1 ∇˜ f j (x) ∈ ∂ f (x). Let k ∈ N.
Then, xk is a random variable, depending on j0 , . . . , jk−1 . Hence, û k := ∇˜ f jk (xk ) is
a random variable, where xk and jk are independent random variables, and Fact 75
yields
! 1 ˜
m
u k := E ∇˜ f jk (xk ) | xk = ∇ f j (xk ) ∈ ∂ f (xk )
m j=1

and
1  ˜ 1  2
m m
E[∇˜ f jk (xk )2 | xk ] = ∇ f j (xk )2 ≤ L ,
m j=1 m j=1 j
Proximal Gradient Methods for Machine Learning and Imaging 203


and hence E[∇˜ f jk (xk )2 ] ≤ (1/m) mj=1 L 2j . In the end assumptions of Theorem 77

are satisfied with B 2 = (1/m) mj=1 L 2j .
Example 82 (Stochastic optimization) We generalize the previous example. We
consider the following optimization problem

minimize f (x), f (x) = E[ϕ(x, ζ )], (122)


x∈C

where f : X → R, ζ is a random variable with values in a measurable space Z with


distribution μ and ϕ : X × Z → R is such that
2
(SO1 ) ∀ z ∈ Z, ϕ(·, z) is convex and L(z)-Lipschitz continuous and Z L(z)2 dμ <
+∞.
(SO2 ) ϕ(0, ·) ∈ L 1 (Z, μ).
The above assumptions ensure that, for every x ∈ X , ϕ(x, ·) ∈ L 1 (Z, μ). Indeed,
for every z ∈ Z, |ϕ(x, z)| ≤ |ϕ(x, z) − ϕ(0, z)| + |ϕ(0, z)| ≤ L(z)x + |ϕ(0, z)|.
Hence ϕ(x, z) ∈ L 1 (Z, μ), since L(z) and ϕ(0, z) are so. We let ∂ϕ : X × Z → 2 X
be such that ∂ϕ(x, z) = ∂ϕ(·, z)(x) and we make the following additional assump-
tions
(SO3 ) there exists a measurable ∇ϕ˜ : X × Z → X , such that, for every x ∈ X and
˜
for μ-a.e. z ∈ Z, ∇ϕ(x, z) ∈ ∂ϕ(x, z).
(SO4 ) (ζk )k∈N is a sequence of independent copies of ζ .
Then we consider the following algorithm. Let x0 ∈ X . Then,

3 k = 0, 1, . . .
for
˜
xk+1 = PC xk − γk ∇ϕ(x ,ζ ) . (123)
* +,k k-
û k

We have, for every x1 , x2 ∈ X ,


4 4
| f (x1 ) − f (x2 )| ≤ |ϕ(x1 , z) − ϕ(x2 , z)|dμ(z) ≤ x1 − x2  L(z)dμ(z).
Z Z

2 2
Therefore, f is Lipschitz continuous with constant Z L(z)dμ(z) ≤ Z L(z)2
1/2
dμ(z) . Moreover, assumption (SO3 ) implies that

˜
for all x, y ∈ X and for μ-a.e. z ∈ Z ϕ(y, z) ≥ ϕ(x, z) + y − x, ∇ϕ(x, z).
(124)
Note that all terms of the above inequality are μ-summable, in particular, since
˜
∇ϕ(x, ˜
z) ≤ L(z) and L(z) is μ-summable, ∇ϕ(x, ·) is μ-summable. Hence, inte-
grating (124) w.r.t. μ we get
4
(∀ x, y ∈ X ) f (y) ≥ f (x) + y − x, ˜
∇ϕ(x, z)dμ(z).
Z
204 S. Salzo and S. Villa

˜
Therefore, for every x ∈ X , E[∇ϕ(x, ζ )] ∈ ∂ f (x). Now, let k ∈ N, k ≥ 1. Then, it
follows from (123) that
xk = xk (ζ0 , . . . , ζk−1 ),

hence xk and ζk are independent2 random variables. Therefore, Fact 75(v) yields that
˜
u k := E[∇ϕ(x , ζ ) | ] = ˜
k k x k Z ∇ϕ(x k , z)dμ(z) ∈ ∂ f (x k ) and
4 4
!
˜
E ∇ϕ(x k , ζk ) | x k =
2 ˜
∇ϕ(x k , z) dμ(z) ≤
2
L(z)2 dμ(z) < +∞,
Z Z

2
˜
and hence E[∇ϕ(x k , ζk ) ] ≤ Z L(z) dμ(z). In the end Theorem 77 applies with
2 2
2
B = Z L(z) dμ(z), so that the stochastic algorithm (123) provides a solution to
2 2

problem (122).

4.2 Stochastic Proximal Gradient Method

We address again problem (105) where now f is Lipschitz smooth, and we consider
a stochastic version of Algorithm 1. In the following we set F = f + g.
Algorithm 4 (The stochastic proximal gradient method) Let x0 ∈ X and (γk )k∈N
be a sequence in R++ . Then,

0 k = 0, 1, . . .
for
û k is a square summable X -valued random vector s.t. E[û k | xk ] = ∇ f (xk ),
xk+1 = proxγk g (xk − γk û k ).
(125)
Moreover, define, for every k ∈ N,

k −1 
k
Fk = min E[F(xi+1 )], x̄k = γi γi xi+1 .
0≤i≤k
i=0 i=0

The following theorem gives the main convergence results about the algorithm.

Theorem 83 Let f : X → R be convex and differentiable with a L-Lipschitz con-


tinuous gradient, let g ∈ 0 (X ), and define F = f + g. Let (xk )k∈N , (Fk )k∈N , and
(x̄k )k∈N be the sequences generated by Algorithm 4. We make the following additional
assumption
A1 There exists σ ≥ 0, such that, for every k ∈ N, the random variable
û k − ∇ f (xk ) is square summable and E[û k − ∇ f (xk )2 |xk ] ≤ σ 2 .
A2 For every k ∈ N, γk ≤ 1/L.
Then, for every k ∈ N, xk is square summable in norm and F(xk ) is summable and
the conclusions (i), (ii), and (iii) of Theorem 77 and those of Corollary 80(i)(iii)(iv)
Proximal Gradient Methods for Machine Learning and Imaging 205

remain valid in expectation, with the constant B 2 replaced by σ 2 and f k , E[ f (xk )],
and inf C f replaced by Fk , E[F(xk )], and inf F respectively. In particular, the fol-
lowing hold.
 k k
(i) Suppose that k∈N γk = +∞ and that i=0 γi2 / i=0 γi → 0. Then Fk →
inf F, lim inf k E[F(xk )] = inf F and E[F(x̄k )] → inf F. √
(ii) Suppose that S∗ := argmin F = ∅ and let, for every k ∈ N, γk = γ̄ / k + 1,
with γ̄ ≤ 1/L. Then, for every integer k ≥ 2,

dist(x0 , S∗ )2 1 log(k + 1)
max{Fk+1 , E[F(x̄k+1 )]} − min F ≤ √ + γ̄ σ 2 √ .
2γ̄ k+1 k+1

Proof Since γk ≤ 1/L for every k ∈ N, it follows from Lemma 45, that, for every
(x, y) ∈ X 2 , z ∈ dom∂g, and every η ∈ ∂g(z) we have

1
F(x) ≥ F(z) + x − z, ∇ f (y) + η − z − y2 . (126)
2γk

Let x ∈ X . Applying the previous inequality with z = xk+1 , η = γk−1 (xk − xk+1 ) −
û k , and y = xk we obtain

xk − xk+1 1
F(x) ≥ F(xk+1 ) + x − xk+1 , ∇ f (xk ) − û k + − xk+1 − xk 2
γk 2γk
(127)
and thus, setting (∀k ∈ N) x̃k+1 = proxγk g (xk − γk ∇ f (xk )),
xk − xk+1 1
F(xk+1 ) − F(x) ≤ x − xk+1 , û k − ∇ f (xk ) − + xk+1 − xk 2
γk 2γk
1  
= x − xk+1 , û k − ∇ f (xk ) + − 2x − xk+1 , xk − xk+1  + xk+1 − xk 2
2γk
1
= x − xk+1 , û k − ∇ f (xk ) + xk − x2 − xk+1 − x2
2γk
. /
= x − x̃k+1 , û k − ∇ f (xk ) + x̃k+1 − xk+1 , û k − ∇ f (xk )
1
+ xk − x2 − xk+1 − x2 . (128)
2γk

We next want to take the conditional expectation of this inequality. To this aim we
first prove by induction that xk  and ∇ f (xk ) are square summable and F(xk ) is
summable. The statement is clearly true for k = 0. Suppose that it holds for k ≥ 0.
Then it follows from (128) and the nonexpansivity of proxγk g that
206 S. Salzo and S. Villa

xk+1 − x2 + 2γk (F(xk+1 ) − F(x))


≤ 2γk (x − x̃k+1  + xk+1 − x̃k+1 )û k − ∇ f (xk ) + xk − x2
≤ 2γk (x − proxγk g (x) + x − xk + γk ∇ f (xk ) + γk û k − ∇ f (xk ))
× û k − ∇ f (xk ) + xk − x2 (129)

and hence we derive that xk+1  is square summable and F(xk+1 ) is summable.
Moreover, since ∇ f is Lipschitz continuous, we have ∇ f (xk+1 ) ≤ Lxk+1 −
x + ∇ f (x), which implies that ∇ f (xk+1 ) is square summable too. given xk
in (128) and recalling that E[û k | xk ] = ∇ f (xk ), we get

E[xk+1 − x2 |xk ] + 2γk E[F(xk+1 ) − F(x)|xk ]


≤ xk − x2 + 2γk E[x̃k+1 − xk+1 , û k − ∇ f (xk )|xk ]. (130)

Since proxγ g is nonexpansive by Proposition 34, we derive

E[xk+1 − x2 |xk ] + 2γk E[F(xk+1 ) − F(x)|xk ] ≤ xk − x2 + 2γk2 σ 2 , (131)

and this yields

2γk (E[F(xk+1 )] − F(x)) ≤ E[xk − x2 ] − E[xk+1 − x2 ] + 2γk2 σ 2 . (132)

The above equation is the same as (113) except for the fact that F(xk ) and B 2 are
replaced by F(xk+1 ) and σ 2 respectively. The proof thus essentially continues as the
one of Theorem 77.

4.3 Randomized Block-Coordinate Descent

In this section, we address the following problem


m
minimize F(x) = f (x) + g(x), g(x) = gi (xi ), (133)
x∈X
i=1

where X is the direct sum of m separable real Hilbert spaces (X i )1≤i≤m , i.e.,

5
m 
m
X= X i and (∀ x = (x1 , · · · , xm ), y = (y1 , · · · , ym ) ∈ X ) x, y = xi , yi 
i=1 i=1

and the following assumptions hold


A1 f : X → R is convex and differentiable with Lipschitz continuous gradient.
A2 (∀i ∈ [m] := {1, . . . , m}), gi ∈ 0 (X i ).
Proximal Gradient Methods for Machine Learning and Imaging 207

We study the following algorithm.

Algorithm 5 (The randomized block-coordinate proximal gradient method) Let


x 0 = (x10 , . . . , xm0 ) ∈ X and (γi )1≤i≤m ∈ Rm
++ . Then,

⎢ k = 0, 1, . . .
for
⎢ for i = 0, 1, . . . , m
⎢3 
⎢ proxγk gi xikk − γik ∇ik f (x k ) if i = i k (134)
⎣ k+1
xi = k

xik if i = i k

where (i k )k∈N are independent random variables taking values in {1, . . . , m} with
pi := P(i k = i) > 0 for all i ∈ {1, . . . , m}.

In the following we denote by Ji : X i → X the canonical embedding of X i into


X , that is, Ji (xi ) = (0, . . . , xi , . . . , 0), where xi occurs in the i-th position. Thus, the
algorithm can be equivalently written as

x k+1 = x k + Jik proxγi g (xikk − γik ∇ik f (x k )) − xikk . (135)


k ik

Moreover, we set

5m
1 m
1
 −1 = Idi , x, y−1 = xi , yi  (136)
γ
i=1 i
γ
i=1 i

and
5m
1 m
1
W = Idi , x, yW = xi , yi  . (137)
γ p
i=1 i i
γ p
i=1 i i

Remark 84 Algorithm 5 can be interpreted as a stochastic optimization algorithm


which uses special stochastic gradients and proximity operators oracles. Indeed, let
ξ be a random variable with values in {1, . . . , m} distributed as i k and let

1
ĝ(x, ξ ) = gξ (xξ ). (138)

m
Then, clearly E[ĝ(x, ξ )] = i=1 gi (xi ) = g(x). Moreover,
1 1 
) (x) = argmin y∈X gξ (xξ ) +  y − x2W
W
proxĝ(·,ξ
pξ 2
1 1  1 
= argmin y∈X gξ (xξ ) + (yξ − xξ )2 + (yi − xi )2
pξ 2γξ , pξ i =ξ
2γi pi

and hence
208 S. Salzo and S. Villa

xi if i = ξ
(∀ i ∈ {1, . . . , m}) [proxĝ(·,ξ
W
) (x)]i = (139)
proxγξ gξ (xξ ) if i = ξ.

Also, if we set ∇ˆ ξW f (x) = γξ Jξ (∇ξ f (x)), we have

E[∇ˆ ξW f (x)] = (γi pi ∇i f (x))1≤i≤m = W −1 ∇ f (x) = ∇ W f (x). (140)

Therefore, it is clear that Algorithm 5 can be rewritten as a stochastic proximal


gradient algorithm in the metric W as follows

x k+1 = proxĝ(·,i
W
k)
(x k − ∇ˆ iWk f (x k )). (141)

Proposition 85 Let f : X → R be a convex differentiable function. Then the fol-


lowing statements are equivalent.
(i) ∇ f is Lipschitz continuous.
(ii) There exists (L i )1≤i≤m ∈ Rm
+ such that for all i ∈ {1, . . . , m} and x = (x 1 , . . . ,
xm ) ∈ X , the mapping ∇i f (x1 , . . . , xi−1 , ·, xi+1 , · · · , xm ) : X i → X i is Lips-
chitz continuous with constant L i .

Proof (i) ⇒ (ii): Let L be a Lipschitz constant of ∇ f . Then (ii) holds with
(L i )1≤i≤m ≡ L. 
(ii) ⇒ (i): Let, for every i ∈ [m], qi = L i / mj=1 L j . Then (qi )1≤i≤m ∈ Rm
+ and
m
i=1 iq = 1. Let x, v ∈ X . Then

 
m 
f (x + v) = f x + Ji (vi )
i=1

m 
= f qi (x + qi−1 Ji (vi ))
i=1

m
≤ qi f (x + qi−1 Ji (vi ))
i=1
m
. / Li
≤ qi f (x) + qi−1 vi , ∇i f (x) + qi−1 vi 2
i=1
2
m
Li
= f (x) + v, ∇ f (x) + vi 2
i=1
2q i
m
Li
= f (x) + v, ∇ f (x) + i=1 v2 .
2
Therefore, Fact 1(ii) yields that ∇ f is Lipschitz continuous. 
Proximal Gradient Methods for Machine Learning and Imaging 209

Remark 86 Let f : X → R be a convex differentiable function with Lipschitz con-


tinuous gradient. The constants (L i )1≤i≤m defined in Proposition 85 are called the
block-Lipschitz constants of the partial gradients ∇i f . Then the following block-
coordinate descent lemma holds
Li
(∀ vi ∈ X i ) f (x + Ji (vi )) ≤ f (x) + vi , ∇ f i (x) + vi 2 . (142)
2
Lemma 87 Let X be a real Hilbert space. Let ϕ : X → R be differentiable and
convex and ψ ∈ 0 (X ). Let x ∈ X and set x + = proxψ (x − ∇ϕ(x)). Then, for all
z ∈ X,

z − x, x − x +  ≤ (ϕ + ψ)(z) − (ϕ + ψ)(x) − z − x2


+ ψ(x) − ψ(x + ) + x − x + , ∇ϕ(x) − x − x + 2 .

Proof Let z ∈ X . By definition of x + we have x − x + − ∇ϕ(x) ∈ ∂ψ x + . There-


fore, ψ(z) ≥ ψ x + + z − x + , x − x + − ∇ϕ(x), and hence

z − x + , x − x +  ≤ ψ(z) − ψ x + + z − x + , ∇ϕ(x). (143)

Now, we note that x + − z2 = x + − x2 + x − z2 + 2x + − x, x − z. Then,

z − x, x − x +  + x − x + , x − x + 
≤ ψ(z) − ψ x + + z − x, ∇ϕ(x) + x − x + , ∇ϕ(x)

and hence

z − x, x − x +  ≤ ψ(z) − ψ(x) + z − x, ∇ϕ(x) + ψ(x) − ψ(x + )


+ x − x + , ∇ϕ(x) − x − x + 2 .

Since z − x, ∇ϕ(x) ≤ ϕ(z) − ϕ(x) − (μϕ /2)z − x2 , the statement follows. 

Now we set
x̄ k+1 = proxγi gi (xik − γi ∇i f (x k )) 1≤i≤m
(144)
 = x − x̄
k k k+1
.

Then, recalling (135), we have

x̄ik+1 = proxγi g xikk − γik ∇ik f (x k ) = xik+1 ikk = xikk − xik+1 . (145)
k k ik k k

Also note that

x k = x k (i 0 , . . . , i k−1 ) and x̄ k+1 = x̄ k+1 (i 0 , . . . , i k−1 ).

We derive from (145) that


210 S. Salzo and S. Villa

xikk − xik+1
k
− ∇ik f (x k ) ∈ ∂gik (xik+1 ) (146)
γik k

Proposition 88 Let f and g satisfy Assumptions 4.3 and 4.3. Let (L i )1≤i≤m be the
block-Lipschitz constants of the partial gradients ∇i f as defined in Proposition 85.
Let (γi )1≤i≤m ∈ Rm++ be such that γi < 2/L i . Set δ = max1≤i≤m γi L i and pmin =
min1≤i≤m pi . Let (x k )k∈N be generated by Algorithm 5. Then, for all x ∈ X ,

1 !
x − x k , x k − x̄ k+1 −1 ≤ E F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1
pmin
δ−2 k
+ F(x) − F(x k ) + x − x k+1 2−1 . (147)
2
−1 −1
Proof First note that x̄ k+1 = proxg (x − ∇  f (x k )), where the prox and the gra-
dient are computed in the weighted norm ·−1 . Then we derive from Lemma 87
written in the norm ·−1 that

x − x k , x k − x̄ k+1 −1 ≤ F(x) − F(x k )


+ g(x k ) − g( x̄ k+1 ) + x k − x̄ k+1 , ∇ f (x k )
− x k − x̄ k+1 2−1 . (148)

Next, we have

g(x k ) − g( x̄ k+1 ) + x k − x̄ k+1 , ∇ f (x k )


" #
1
=E gik (xikk ) − gik (x̄ik+1 ) + x k
ik − x̄ k+1
ik , ∇ik f (x k
) | i 0 , . . . , i k−1
pik k

Moreover, since xik+1


k
= x̄ik+1
k
and x k and x k+1 differ only for the i k -th component

1
gi (x k ) − gik (x̄ik+1 ) + xikk − x̄ik+1 , ∇ik f (x k )
pik k ik k k

1
= g(x k ) − g(x k+1 ) + x k − x k+1 , ∇ f (x k )
pik
1
= g(x k ) − g(x k+1 ) + x k − x k+1 , ∇ f (x k )
pmin
 1 1 
− − gik (xikk ) − gik (xik+1 ) + xikk − xik+1 , ∇ik f (x k )
pmin pik k k
* +, -
≥0
1
≤ g(x k ) − g(x k+1 ) + x k − x k+1 , ∇ f (x k )
pmin
Proximal Gradient Methods for Machine Learning and Imaging 211
 
1 1 1
− − ikk 2 ,
pmin pik γik

where in the last inequality we used that

1
− gik (xikk ) − gik (xik+1 ) + xikk − xik+1 , ∇ik f (x k ) ≤ − ik 2 (149)
k k
γi

which was obtained by the fact that vi = (xikk − xik+1


k
)/γik − ∇ik f (x k ) ∈ ∂gik (xik+1
k
).
So

g(x k ) − g( x̄ k+1 ) + x k − x̄ k+1 , ∇ f (x k )


1
≤ E[g(x k ) − g(x k+1 ) + x k − x k+1 , ∇ f (x k ) | i 0 , . . . , i k−1 ]
pmin
1  pi
m
− ik 2 + x k − x̄ k+1 2−1 . (150)
pmin i=1 γi

Now, we derive from the block-coordinate descent lemma (142) and the fact that x k
and x k+1 differ only in the i k -th component, that

E[x k − x k+1 , ∇ f (x k ) | i 0 , . . . , i k−1 ]


L ik
≤ E[ f (x k ) − f (x k+1 ) + ikk 2 | i 0 , . . . , i k−1 ]
2
1
m
≤ E[ f (x k ) − f (x k+1 ) | i 0 , . . . , i k−1 ] + pi L i ik 2 .
2 i=1

Therefore it follows from the above inequality and (150) that

g(x k ) − g( x̄ k+1 ) + x k − x̄ k+1 , ∇ f (x k )


1
≤ E[F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1 ]
pmin
1  pi
m
+ (γi L i − 2 − σ−1 ) ik 2 + x k − x̄ k+1 2−1
2pmin i=1 γi
1
≤ E[F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1 ]
pmin
δ−2 k
+ x − x̄ k+1 2−1 + x k − x̄ k+1 2−1 ,
2
where in the last inequality we used that γi L i − 2 ≤ δ − 2 ≤ 0 and that pi ≥ pmin .
The statement follows from (148). 
212 S. Salzo and S. Villa

Proposition 89 Under the assumptions of Proposition 88 suppose additionally that


x is an X -valued random variable which is measurable w.r.t. to the σ -algebra gen-
erated by i 0 , . . . , i k−1 . Then

E[x k+1 − x2W | i 0 , . . . , i k−1 ] − x k − x2W =  x̄ k+1 − x2−1 − x k − x2−1


(151)
and E[x k+1 − x k 2W | i 0 , . . . , i k−1 ] =  x̄ k+1 − x k 2−1 .

Proof If follows from Fact 75(v) that

E[x k+1 − x2W | i 0 , . . . , i k−1 ]


" m #
1 k+1 2
=E x − xi  i 0 , . . . , i k−1
γp i
i=1 i i
" #
1 1
= E x k − x2W − xik − xi 2 + x̄ik+1 − xi 2 i 0 , . . . , i k−1
γi k pi k γi k pi k
= x k − x2W − x k − x2 −1 +  x̄ k+1 − x2 −1

The second equation follows from (151), by choosing x = x k . 

Proposition 90 Under the assumptions of Proposition 88 set F = f + g. Then, the


following hold.
(i) (E[F(x k )])k∈N is decreasing.
(ii) Suppose that inf k∈N E[F(x k )] > ∞. Then,
  !
 x̄ k+1 − x k 2−1 = E x k − x k+1 2W i 0 , . . . , i k−1 < +∞ P a.s.
k∈N k∈N

(iii) For every k ∈ N and every x ∈ domF

E[x k+1 − x2W | i 0 , . . . , i k−1 ]

≤ x k − x2W − 2 F(x k ) − F(x


2  (δ − 1)+ 
+ + 1 E[F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1 ].
pmin 2−δ
(152)

Proof Let k ∈ N and x ∈ domF. Since

x k − x2−1 −  x̄ k+1 − x2−1 = −x k − x̄ k+1 2−1 + 2x k − x̄ k+1 , x k − x−1 ,

we derive from (147), multiplied by 2, that


Proximal Gradient Methods for Machine Learning and Imaging 213

 x̄ k+1 − x2−1 ≤ x k − x2−1 + (δ − 1) x̄ k+1 − x k 2−1


2
+ E[F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1 ]
pmin
− 2 F(x k ) − F(x) . (153)

Then for an X -valued random variable x  measurable with respect to i 0 , . . . , i k−1 ,


Proposition 89 yields

E[x k+1 − x  2W | i 0 , . . . , i k−1 ]

≤ x k − x  2W + (δ − 1)E[x k+1 − x k 2W | i 0 , . . . , i k−1 ]


2
+ E[F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1 ]
pmin
− 2 F(x k ) − F(x  ) . (154)

Taking x  = x k in (154), we have

pmin !
(2 − δ)E[x k+1 − x k 2W | i 0 , . . . , i k−1 ] ≤ E F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1 ,
2
(155)

which plugged into (154), with x  ≡ x ∈ domF, gives (iii). Moreover, taking the
expectation in (155), we obtain

pmin !
(2 − δ)E x k+1 − x k 2W ≤ E[F(x k )] − E[F(x k+1 )], (156)
2
!
which gives (i). Finally, set for all k ∈ N, ξk = E F(x k ) − F(x k+1 ) i 0 , . . . , i k−1 ≥
0. Then
" +∞
 # +∞
 +∞

E ξk = E[ξk ] = E[F(x k )] − E[F(x k+1 )] ≤ E[F(x 0 )] − inf E[F(x k )].
k∈N
k=0 k=0 k=0


This shows that if inf k∈N E[F(x k )] > −∞, then +∞ k=0 ξk is P-integrable and hence
it is P-a.s. finite. Then (ii) follows from (155) and Proposition 89. 

Proposition 91 Under the assumptions of Proposition 90, suppose in addition that


F is bounded from below. Then, there exist ( yk )k∈N and (v k )k∈N , sequences of X -
valued random variables, such that the following hold.
(i) v k ∈ ∂ F( yk ) P-a.s.
(ii) yk − x k → 0 and v k → 0 P-a.s.
214 S. Salzo and S. Villa

Proof It follows from (144) that, (xik (ω) − x̄ik+1 (ω))/γi − ∇i f (x k (ω)) ∈
∂gi (x̄ik+1 (ω)), for all i ∈ [m] and ω ∈ . Hence

 x k (ω) − x̄ k+1 (ω) 


i i
− ∇ f (x k (ω)) ∈ ∂g( x̄ k+1 (ω)).
γi 1≤i≤m

Set yk = x̄ k+1 and let v k :  → X be such that, for every ω ∈ ,


 x k (ω) − y k (ω) 
v k (ω) = i i
+ ∇ f ( yk (ω)) − ∇ f (x k (ω))
γi 1≤i≤m

∈ ∂g( yk (ω)) + ∇ f ( yk (ω)) = ∂ F( yk (ω)).

Clearly v k is measurable and hence it is a random variable. Moreover, for every


ω ∈ ,

1
v k (ω) ≤ x k (ω) − yk (ω) + ∇ f ( yk (ω)) − ∇ f (x k (ω)).
γmin

Now, since F is bounded from below, Proposition 90(ii) yields that ( yk − x k 2−1 )k∈N
is summable P-a.s. and hence yk − x k → 0 P-a.s. The statement follows from the
fact that ∇ f is Lipschitz continuous (see Proposition 85). 

Lemma 92 (Stochastic Opial) Let X be a Hilbert space, let S be a nonempty subset


of X be a subset and let (x k )k∈N be a random sequence on (, A, P) with values in
X . Assume that
(a) S is separable;
(b) for every z ∈ S, there exists z with P(z ) = 1 such that, for every ω ∈ z ,

∃ lim x k (ω) − z;


k

ˆ with P()
(c) there exists  ˆ = 1 such that, for every ω ∈ ,
ˆ every weak cluster
point of (x (ω)) belongs to S.
k

Then there exists a S-valued random variable x̄ such that x k  x̄ a.s.


˜ such that, for every ω ∈ 
Proof We first show that there exists  ˜ and for every
z ∈ S, there exists
lim x k (ω) − z.
n

6
Let W ⊆ S countable dense in S and let ˜ = w∈W w . Then P()
˜ = 1 and, for
˜
every ω ∈  and for every w ∈ W , there exists

lim x k (ω) − w.


n
Proximal Gradient Methods for Machine Learning and Imaging 215

Fix ω ∈ ˜ and z ∈ S. Since W is dense in S, there exists a sequence (w j ) in W such


that w j → z. Since w j ∈ W for every j ≥ 0, we know that there exists

lim x k (ω) − w j  = τ j (ω). (157)


k

Note that

− w j − z ≤ x k (ω) − z − x k (ω) − w j  ≤ w j − z. (158)

Then, (157) and (158) yield


!
−w j − z ≤ lim inf x k (ω) − z − x k (ω) − w j 
k

= lim inf x k (ω) − z − τ j (ω) ≤ lim sup x k (ω) − z − τ j (ω)


k k
!
= lim sup x (ω) − z − x (ω) − w j  ≤ w j − z.
k k
k

Taking the limit for k → +∞ and recalling that wk → z, we get that there exists
˜ the limit of limk x k (ω) − z
limk x k (ω) − z. So we proved that for every ω ∈ 
¯ ˜ ˆ
exists. Now suppose that  :=  ∩ . Then, for every ω ∈ , ¯ we have both that:
for every z ∈ Z, ∃ limk x (ω) − z; every weak cluster point of x k (ω) belongs to
k

Z. We conclude by Lemma 29 that, for every ω ∈ , ¯ there exists x̄ (ω) ∈ Z such


that x (ω)  x̄(ω).
k


Now we give the main convergence results, which extends to the stochastic setting
the convergence rate of the (deterministic) proximal gradient algorithm given in
Theorem 47.

Theorem 93 Under the assumptions of Proposition 88 set F = f + g, F∗ = inf F,


and S∗ = argmin F ⊂ X . Then, the following hold.
(i) E[F(x k )] → F∗ .
(ii) Suppose that S∗ = ∅. Then E[F(x k )] − F∗ = o(1/k) and, for all integer
k ≥ 1,
"
dist 2W (x 0 , S∗ )  max 1, (2 − δ)−1   #
1
E[F(x k )] − F∗ ≤ + − 1 (F(x 0 ) − F∗ ) .
2 pmin k

Moreover, there exists a random variable x ∗ taking values in S∗ such that


x k  x ∗ P-a.s.

Proof Proposition 90(iii) gives, for all x ∈ domF and k ∈ N,

E[x k+1 − x2W | i 0 , . . . , i k−1 ]


≤ x k − x2W + 2E[F(x) − F(x k+1 ) | i 0 , . . . , i k−1 ] + ξk , (159)
216 S. Salzo and S. Villa

where
 
max{1, (2 − δ)−1 }
ξk = b1 E[F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1 ], b1 = 2 −1 .
pmin

Note that the random variables


 x k ’s are discrete with finite range and (E[F(x k )])k∈N is
decreasing. Moreover, k∈N E[ξk ] ≤ b1 (F(x 0 ) − F∗ ). Therefore, taking the expec-
tation in (159) we have

2E[F(x k+1 )] − F(x) ≤ E[x k − x2W ] − E[x k+1 − x2W ] + E[ξk ] (160)

Since (E[F(x k )])k∈N is decreasing, E[F(x k )] → inf k∈N E[F(x k )] ≥ F∗ . Thus, the
statement (i) is true if inf k∈N E[F(x k )] = −∞. Suppose that inf k∈N E[F(x k )] > −∞
and let x ∈ domF. Then, the right hand side of (160), being summable, converges to
zero. Therefore, F∗ ≤ limk→+∞ E[F(x k+1 )] ≤ F(x). Since x is arbitrary in domF,
(i) follows. Let x ∈ S∗ . Then, F(x) = F∗ and (160) yields
 ! 
2 E[F(x k+1 )] − F∗ ≤ E x 0 − x2 + E[ξk ] ≤ x 0 − x2 + b1 (F(x 0 ) − F∗ ).
k∈N k∈N


Therefore, we have k∈N (E[F(x k+1 )] − F∗ ) ≤ (x 0 − x2 + b1 (F(x 0 ) − F∗ ))/2.
Since (E[F(x k+1 )] − F∗ )k∈N is decreasing, the first part of statement (ii) follows
from Fact 46. Concerning the convergence of the iterates, we will use the stochastic
Opial’s Lemma 92. Let x ∈ argmin F. Then it follows from (159) that

(∀ k ∈ N) E[x k+1 − x2W | i 0 , . . . , i k−1 ] ≤ x k − x2W + ξk .


  
Since E[ k∈N ξk ] = k∈N E[ξk ] < +∞, we have k∈N ξk < +∞ P-a.s. and hence
(x k − x2W )k∈N is an almost supermartingale in the sense of Robbins and Sieg-
mund [96]. Thus, there exists 1 ⊂  such that P(1 ) = 1 and for every ω ∈ 1 ,
(x k (ω) − x2W )k∈N is convergent. Now, it follows from Proposition 91 that there
exists 2 ⊂  with P(2 ) = 1, such that, for every ω ∈ 2 , v k (ω) ∈ ∂ F( yk (ω)),
yk (ω) − x k (ω) → 0, and v k (ω) → 0. Therefore, let ω ∈ 2 and let (x n k (ω))k∈N be
a subsequence of (x k (ω))k∈N such that x n k (ω)  x̄. Then,

v n k (ω) ∈ ∂ F( yn k (ω)) yn k (ω) − x n k (ω) → 0 v n k (ω) → 0. (161)

Then, it follows from (161) that yn k (ω)  x and, since ∂ F is weak-strong closed,
that 0 ∈ ∂ F(x). Therefore the two conditions in the stochastic Opial’s Lemma 92
are satisfied with S = argmin F and hence the statement follows.
Proximal Gradient Methods for Machine Learning and Imaging 217

4.4 Bibliographical Notes

Stochastic methods in optimization were initiated by Robbins and Monro [95], Kiefer
and Wolfowitz [61], and Ermoliev [47]. These methods are nowadays very popular
due to applications in deep machine learning [22]. The projected stochastic subgra-
dient method was studied in [44, 80]. In the last years rate of convergence in the
last iterates were also derived [105]. The proximal stochastic gradient which explic-
itly assumes the Lipschitz continuity of the gradient was studied in [2, 100]. The
worst case convergence rate in expectation of proximal stochastic gradient method
is much worse with respect to the one of proximal gradient method. Recently, vari-
ance reduction techniques have been studied to improve the convergence behavior
of stochastic methods [59], at the cost of keeping previously computed gradients in
memory. These techniques are particularly useful for empirical risk minimization
problems, see [42, 56] and references therein. Randomized strategies in block coor-
dinate descent methods were popularized by Nesterov in [84]. Since then a number
of works appeared extending and improving the analysis under several aspects. We
cite among others [35, 79, 94, 103, 114].

5 Dual Algorithms

In this section, we show how proximal gradient algorithms can be used on the dual
problem, to derive new algorithmic solutions for the primal.

5.1 A Framework for Dual Algorithms

We consider the same setting of Sect. 2.6. Here we additionally assume that f is
strongly convex with modulus of convexity μ > 0. In this situation, it follows from
Fact 13 that f ∗ is differentiable on X and ∇ f ∗ is 1/μ-Lipschitz continuous. More-
over, since f is strongly convex, the primal problem (P) admits a (unique) solution,
say x̂. We also assume that the calculus rule for subdifferentials (15) holds. Thus, in
view of Fact 14, we have that a dual solution û also exists, the duality gap is zero,
and the following KKT conditions hold

x̂ = ∇ f ∗ (−A∗ û) and A x̂ ∈ ∂g ∗ (û). (162)

So, in this case, a dual solution uniquely determines the primal solution. Actually, the
map u → ∇ f ∗ (−A∗ u) provides a way to go from the dual space Y into the primal
space X . See Fig. 2. The following proposition tells us even more.

Proposition 94 Under the notation of Sect. 2.6, let u ∈ Y and set x = ∇ f ∗ (−A∗ u).
Then
218 S. Salzo and S. Villa

μ
x − x̂2 ≤ (u) − (û).
2
Proof It follows from the KKT conditions (162), Fact 11, and the definition of u
that
. / . /
f (x̂) + f ∗ (−A∗ û) = x̂, −A∗ û and f (x) + f ∗ (−A∗ u) = x, −A∗ u .

Thus, since −A∗ u ∈ ∂ f (x) and f is μ-strongly convex,


. / . /
f ∗ (−A∗ u) − f ∗ (−A∗ û) = f (x̂) − f (x) + x̂, A∗ û − x, A∗ u
. / μ . / . /
≥ x̂ − x, −A∗ u + x̂ − x2 + x̂, A∗ û − A∗ u, x
2
. / μ
= A x̂, û − u + x̂ − x2 .
2

Now, since A x̂ ∈ ∂g ∗ (û), we have


. /
g ∗ (u) − g ∗ (û) ≥ A x̂, u − û .

Summing the two inequalities above, we have


μ
f ∗ (−A∗ u) + g ∗ (u) − f ∗ (−A∗ û) + g ∗ (û) ≥ x − x̂2
2
and the statement follows. 

We define the duality gap function

G : X × Y → ] − ∞, +∞ ] , G(x, u) = (x) + (u).

Fig. 2 Duality in strongly convex problems


Proximal Gradient Methods for Machine Learning and Imaging 219

Recall that if strong duality holds inf = − inf , and hence

( (x) − inf ) + ((u) − inf ) = G(x, u),

so the duality gap function bounds the primal and dual objectives. We have the
following theorem
Theorem 95 Under the notation of Sect. 2.6, suppose that R(A) ⊂ dom∂g. Then
the following holds:
(i) Suppose that g ∗ is α-strongly convex. Let u ∈ domg ∗ and set x = ∇ f ∗ (−A∗ u).
Then,  
A2
G(x, u) ≤ 1 + ((u) − inf ). (163)
αμ

(ii) Suppose that g is L-Lipschitz continuous. Let u ∈ domg ∗ be such that (u) −
inf  < A2 L 2 /μ and set x = ∇ f ∗ (−A∗ u). Then, we have

AL
G(x, u) ≤ 2 ((u) − inf )1/2 . (164)
μ1/2

Proof Let u ∈ domg ∗ and let x = ∇ f ∗ (−A∗ u). Since R(A) ⊂ dom∂g, we have
∂g(Ax) = ∅. Let v ∈ ∂g(Ax). Then we first prove that for every s ∈ [0, 1],
 
s s
(u) − inf  ≥ sG(x, u) + α(1 − s) − A u − v2 .
2
(165)
2 μ

Indeed, let s ∈ [0, 1]. Then

(u) − inf  ≥ (u) − (u + s(v − u))


= g ∗ (u) − g ∗ (u + s(v − u))
+ f ∗ (−A∗ u) − f ∗ − A∗ u − s A∗ (v − u) . (166)

Now, since f ∗ is (1/μ)-Lipschitz smooth, we have

f ∗ − A∗ u − s A∗ (v − u) − f ∗ (−A∗ u)
. / 1 2
≤ −s A∗ (v − u), ∇ f ∗ (−A∗ u) + s A2 v − u2 .

Moreover, since g ∗ is α-strongly convex (α ≥ 0),

s(1 − s)
g ∗ (u + s(v − u)) − g ∗ (u) ≤ s(g ∗ (v) − g ∗ (u)) − α u − v2 .
2
Therefore, it follows from (166) that
220 S. Salzo and S. Villa

(u) − inf  ≥ (u) − (u + s(v − u))


≥ s g ∗ (u) − g ∗ (v) − x, A∗ (u − v)
 
s s
+ α(1 − s) − A v − u2 .
2
(167)
2 μ

Now, we note that

G(x, u) = (x) + (u)


= f (x) + f ∗ (−A∗ u) − −A∗ u, x + g(Ax) + g ∗ (u) − Ax, u .

Moreover, since x = ∇ f ∗ (−A∗ u) and v ∈ ∂g(Ax), Young equality yields

f (x) + f ∗ (−A∗ u) − −A∗ u, x = 0 and g(Ax) + g ∗ (v) − Ax, v = 0.

Therefore,
G(x, u) = g ∗ (u) − g ∗ (v) − Ax, u − v. (168)

In conclusion, (165) follows from (167) and (168).


(i): If in (165) we chose s = α/(α + A2 /μ) we have α(1 − s) − sA2 /μ = 0
and hence α
G(x, u) ≤ (u) − inf .
α + A2 /μ

Then (163) follows.


(ii): It follows from (165) with α = 0 that, for every s ∈ [0, 1],

s2
sG(x, u) ≤ (u) − inf  + A2 u − v2 .

Since g is L-Lipschitz continuous, we have domg ∗ ⊂ B L (0). Moreover, u ∈ domg ∗


and v ∈ ∂g(Ax) ⇒ Ax ∈ ∂g ∗ (v) ⇒ v ∈ domg ∗ . Therefore, u − v2 ≤ 2(u2 +
v2 ) ≤ 2L 2 . Then,

1 s
G(x, u) ≤ inf ((u) − inf ) + A2 L 2 .
s∈[0,1] s μ

Since, if 0 < a < b, mins∈[0,1] (a/s + bs) = 2 ab, the statement follows. 

5.2 Dual Proximal Gradient Algorithms

It follows from Proposition 94 and Theorem 95 that if an algorithm, applied to the


dual problem (D), provides a minimizing sequence, that is, a sequence (u k )k∈N such
Proximal Gradient Methods for Machine Learning and Imaging 221

that (u k ) → inf , then, the sequence (xk )k∈N , defined as xk = ∇ f ∗ (−A∗ u k ) is
converging (possibly also in function values) to the solution of the primal problem.
In particular, we have

2
xk − x̂2 ≤ (u k ) − inf  → 0,
μ

and, depending on the assumptions in Theorem 95,

(xk ) − inf ≤ O (u k ) − inf  → 0


or
(
(xk ) − inf ≤ O (u k ) − inf  → 0.

Since the gradient of the term f ∗ (−A∗ ·) in (D) is Lipschitz continuous with constant
A2 /μ, the proximal gradient algorithm applied to (D) leads to the following

Algorithm 6 (Dual proximal gradient algorithm) Let u 0 ∈ Y and 0 < γ < A

2.

Then,
0 k = 0, 1,
for ...
xk = ∇ f ∗ (−A∗ u k ) (169)
u k+1 = proxγ g∗ (u k + γ Axk ).

Then, since Theorem 47(iv) ensures that (u k ) − (û) = o(1/(k + 1)), we have

xk − x̂ ≤ o(1/ k + 1)

and, again, in the settings of Theorem 95,



(xk ) − inf ≤ o(1/(k + 1)) or (xk ) − inf ≤ o(1/ k + 1).

Similarly, we can apply Algorithm 2 to the dual problem (D) and this yields the
following dual algorithm.
Algorithm 7 (Dual accelerated proximal gradient algorithm) Let√ 0 < γ ≤ μ/A2
N
and let (tk )k∈N ∈ R be defined as Proposition 64 with 1 − c ≥ 2 b. Let u 0 = v0 ∈
Y and define
⎢ k = 0, 1,
for ...
⎢ yk = ∇ f ∗ (−A∗ vk )

⎢ u k+1 = proxγ g∗ (vk + γ Ayk )

⎢ (170)

⎢ βk+1 = tk − 1
⎣ tk+1
vk+1 = u k+1 + βk+1 (u k+1 − u k ).

Then, defining xk = ∇ f ∗ (−A∗ u k ), Theorem 68 yield


222 S. Salzo and S. Villa

xk − x̂ ≤ O(1/k) (171)

and, under the assumptions of Theorem 95, that

(xk ) − inf ≤ O(1/k 2 ) or (xk ) − inf ≤ O(1/k).

Finally, suppose that g is separable, meaning that

5
m 
m
g : Y := Yi → ] − ∞, +∞ ] , g(y1 , . . . , ym ) = gi (yi ), (172)
i=1 i=1

and A : X → Y with Ax = (A1 x, . . . , Am x), where Ai : X → Yi are bounded lin-


ear
m operators. Then g ∗ is separable as well and A∗ : Y → X is such that A∗ u =

i=1 Ai u i . Hence, one can apply Algorithm 5 to the dual problem (D), yielding the
following stochastic dual algorithm.

Algorithm 8 (stochastic dual block coordinate gradient ascent method) Let u0 =


(u 01 , . . . , u 0m ) ∈ Y and let (γi )1≤i≤m ∈ Rm
++ be such that 0 < γi < 2μ/Ai  . Then,
2

⎢ kk = 0, 1,∗ . . . ∗ k
for
⎢ x = ∇ f (−A u )

⎢ for i = 0, 1, . . . , m
⎢3  (173)
⎢ proxγk gi∗ u ikk + γik Aik x k if i = i k
⎣ k+1
ui = k

u ik if i = i k ,

where (i k )k∈N are independent random variables taking values in {1, . . . , m} with
pi := P(i k = i) > 0 for all i ∈ {1, . . . , m}.

Remark 1 Note that in the setting of Algorithm 8, the primal problem can be written
as
 m
min gi (Ai x) + f (x). (174)
x∈X
i=1

Now, suppose that f ∗ is quadratic, so that ∇ f ∗ = H is a linear operator. Then, since


uk+1 and uk differ on the i k component only, denoting by Jik the canonical injection
of Yik into Y , we have

x k+1 = −H A∗ uk+1
= −H A∗ Jik (u ik+1
k
− u kk ) + x k .

Thus, Algorithm 8 can be written as follows. Set u0 = 0, x 0 = 0. Then


Proximal Gradient Methods for Machine Learning and Imaging 223

⎢ k = 0, 1 . . .
for
⎢ for i = 0, 1, . . . , m
⎢3 
⎢ proxγk gi∗ u ikk + γik Aik x k if i = i k
⎢ k+1 (175)
⎢ ui = k
⎢ u ik if i = i k ,

x k+1 = x k − H Ai∗k (u ik+1
k
− u ikk ).

This shows that Algorithm 8 can be used as an incremental stochastic method for
the minimization of (174), in which at each iteration one selects at random a single
component in the sum (say i k ) and uses only the knowledge related to that component
(Aik , Ai∗k , gi∗k , γik ) to make an update of the algorithm.

Example 96 (Linearly constrained problems) We consider the minimization prob-


lem
min f (x),
Ax=b

where f : X → ] − ∞, +∞] is closed and strongly convex with constant μ > 0.


Then the dual problem is
min f ∗ (−A∗ u) + u, b ,
u∈Y

which is an uncostrained and smooth optimization problem. Thus, since g ∗ = ·, b


and proxγ g∗ (u) = u − γ b, Algorithm 6 becomes

0 k = 0, 1,
for ...
xk = ∇ f ∗ (−A∗ u k ) (176)
u k+1 = u k + γ (Axk − b),

where γ < 2μ/A2 .

5.3 Bibliographical Notes

Proposition 94 is standard, while Theorem 95 was essentially given (in a less explicit
form) in [45]. Dual algorithms have been proposed several times in the literature. We
mention among others the works [28, 37] for deterministic algorithms, while [107]
for stochastic algorithms in the context of machine learning. The dual accelerated
proximal gradient Algorithm 7 was presented in [15] with the standard choice of
the parameters tk ’s given by the first of (78). The gradient descent on the dual of
the linearly constrained optimization problem described in Example 96 coincides,
up to a change of variables, with the linearized Bregman method studied in a series
of papers, see [86, 116] and references therein.
224 S. Salzo and S. Villa

Fig. 3 Solution of problem


(P1 ) for A : R2 → R. Here
A satisfies the NSP relative
to S = {2}

6 Applications

In this section, we present three main applications where convex optimization plays
a key role, providing fundamental tools and computational solutions.

6.1 Sparse Recovery

In many applications throughout science and engineering, one often needs to solve
ill-posed inverse problems, where the number of available measurements is smaller
than the dimension of the vector (signal) to be estimated. More formally, the setting
is the following: given an observation y ∈ Rn , and a linear measurement process
A : Rd → Rn the goal is to

find x∗ ∈ Rd such that Ax∗ = y, (177)

under the assumption that d >> n. In general, more than one solution of the above
problem exists, but reconstruction of x∗ is often possible since in many practical
situations of interest, the vectors of interest are sparse, namely they only have a few
nonzero entries or few degrees of freedom compared to their dimension. In compress
sensing it is shown that reconstruction of sparse vectors is not only feasible in theory,
but efficient algorithms also exist to perform the reconstruction in practice. One of
the most popular strategies is basis pursuit and consists in solving the following
convex optimization problem
min x1 . (P1 )
Ax=y

In realistic situations, the measurements y will be always affected by noise, i.e.:

Ax∗ − y ≤ δ

thus it makes more sense to consider the problem


Proximal Gradient Methods for Machine Learning and Imaging 225

min x1 . (P1,δ )


Ax−y≤δ

Then, the constrained problem (P1,δ ) is usually transformed into a penalized problem,
i.e (Fig. 3).
1
min Ax − y2 + λx1 , (178)
x∈Rd 2

which is advantageous from the algorithmic point of view. It is possible to show that
the problems (P1,δ ) and (178) are equivalent, for suitable choices of the regularization
parameter.
Proposition 97 Let A ∈ Rn×d and let y ∈ Rn . Then the following hold:
(i) If x is a minimizer of (178) with λ > 0, then there exists δ = δ(x) ≥ 0 such that
x is a minimizer of (P1,δ ).
(ii) If x is a minimizer of (P1,δ ) with δ ≥ 0, then there exists λ = λ(x) ≥ 0 such that
x is a minimizer of (178).
Proof Fermat’s rule for problem (178) yields

0 ∈ A∗ (Ax − y) + λ∂·1 (x),

that is,

∗ λ sign(xi ) if xi = 0
(∀ i ∈ {1, . . . , d}) (A (y − Ax))i ∈ λ∂|·|(xi ) =
[−λ, λ] if xi = 0.

This shows that 0 is a minimizer of (178) if and only if A∗ y∞ ≤ λ. Moreover, if
A∗ y∞ > λ and x is a minimizer of (178), then x = 0 and λ = A∗ (Ax − y)∞
(so λ is uniquely determined by any minimizer).
Now, problem (P1,δ ) can be equivalently written as

minx1 + ι Bδ (y) (Ax),


x∈X

where Bδ (y) is the ball of radius δ centered at y. Moreover, 0 is a minimizer of


(P1,δ ) if and only if y ≤ δ. We therefore suppose that y > δ, so that 0 is not
a minimizer of (P1,δ ). Then, the minimizers of (P1,δ ) are different from zero and
characterized by the following equation

0 ∈ ∂·1 (x) + A∗ ∂ι Bδ (y) (Ax).

which is equivalent to

∃u ∈ ∂ι Bδ (y) (Ax) such that − A∗ u ∈ ∂·1 (x). (179)

Recall that
226 S. Salzo and S. Villa

{0} if Ax − y < δ
∂ι Bδ (y) (Ax) = N Bδ (y) (Ax) =
R+ (Ax − y) if Ax − y = δ.

If Ax − y < δ, then u = 0 and hence 0 ∈ ∂·1 (x) which yields x = 0. Therefore
since 0 is not a minimizer of (P1,δ ), then necessarily Ax − y = δ and equation
(179) is equivalent to

Ax − y = δ and ∃α > 0 such that α A∗ (y − Ax) ∈ ∂·1 (x),

which yields

1 ⎨α −1 sign(xi ) if xi = 0
∃α > 0 s. t. ∀ i ∈ {1, . . . , d} (A∗ (y − Ax))i ∈ ∂|·|(xi ) =
α ⎩[−α −1 , α −1 ] if x = 0.
i

Taking into account the above equations one can see that, if x is a minimizer of
(178), then x is a minimizer of (P1,δ ) with δ = Ax − y and, vice versa, if x is a
minimizer of (P1,δ ), then x is a minimizer of (178) with λ = A∗ (Ax − y)∞ . 

Remark 98 Analogous equivalence results relate (P1,δ ) and (178) to another con-
strained problem:
min Ax − y2 , τ > 0.
x1 ≤τ

6.1.1 Proximal Gradient Algorithms for Lasso

In this section, we specialized several proximal gradients algorithms we studied in


the previous sections to the case of the lasso problem (178). As already anticipated in
Example 17, (the proximal gradient) Algorithm 1 become the so called Iterative Soft-
Thresholding Algorithm (ISTA), which is described below. Let γ ∈ ]0, 2/A∗ A[
and x0 = y0 ∈ X . Then,

 k = 0, 1, . . .
for
(180)
xk+1 = softγ λ (xk − γ A∗ (Axk − y)),

where softγ λ : R → R is the so called soft-thresholding operator, which is the prox-


imity operator of λ|·| (see (43)) and which is supposed to be applied component-wise.
We stress that according to Example 60 and Theorem 62(iii), algorithm (180) pro-
vides a sequence that converges linearly to a solution of problem (178).
Now, according to Algorithm 2, its accelerated version is as follows. Let x0 =
y0 ∈ X and γ ∈ ] 0, 1/A∗ A ]. Then,
Proximal Gradient Methods for Machine Learning and Imaging 227

⎢ k = 0, 1, . . .
for
⎢ t −1
⎢ u k = xk + k−1 (xk − xk−1 ) (181)
⎣ tk

xk+1 = softγ λ (u k − γ A (Au k − y)).

This algorithm is known as Fast Iterative Soft-Thresholding Algorithm (FISTA) √ and


when the parameters tk ’s are defined according to Proposition 64 with 1 − c ≥ 2 b,
Theorem 68 yields that it converges in values with rate O(1/k 2 ). Finally, we special-
ize the randomized proximal gradient Algorithm 5. We denote by a i and ak the i-th
column and k-th row of A respectively. Since ∇i [(1/2)Ax − b2 ] = a i , Ax − b,
condition (ii) in Proposition 85 is satisfied with L i = a i 2 . Then, Algorithm 5
(assuming that each block is made of one coordinate only) writes as
!
x k+1 = x k + softγi λ xik − γi a i (Ax k − b) − xik eik , (182)

where γi < 2/a i 2 . Then, Theorem 93 ensures that E[F(x k )] − inf F = o(1/k)
and that (xk )k∈N there exists a random vector x∗ taking values in the solution set of
problem (178) such that xk → x∗ almost surely.

6.2 Image Denoising

One of the most popular denoising models for imaging, is based on the total variation
regularizer, and is known under the name “ROF” (Rudin, Osher and Fatemi). We
consider a scalar-valued digital image x ∈ Rm×n of size m × n pixels. A standard
approach for defining the discrete total variation is to use a finite difference scheme
acting on the pixels. The discrete gradient operator D : Rm×n → Rm×n × Rm×n 
(R2 )m×n is defined by

(Dx)i, j = ((D1 x)i, j , (D2 x)i, j ) ∈ R2 ,

where

xi+1, j − xi, j if 1 ≤ i ≤ m − 1
(D1 x)i, j =
0 i =m

xi, j+1 − xi, j if 1 ≤ j ≤ n − 1
(D2 x)i, j =
0 j =n

The discrete ROF model is then defined by

min λDx2,1 + x − y22 , (183)


x∈Rm×n
228 S. Salzo and S. Villa

where y ∈ Rm×n is the given noisy image, and the discrete total variation is defined
by   1/2
Dx2,1 = (Dx)i, j 2 = (D1 x)i,2 j + (D2 x)i,2 j ,
i, j i, j

that is, the 1 -norm of the 2-norm of the pixelwise image gradients. We can interpret
the total variation regularization from a sparsity point of view, establishing analogies
with lasso approach in (178). Indeed, the 1 -norm induces sparsity in the gradients of
the image. More precisely, this regularizer can be interpreted as a group lasso one (see
Example 42), where each group include the two directional derivatives at each pixel.
Hence, this norm favors vectors with sparse gradients, namely piecewise constant
images. This favorable property, a.k.a. staircaising effect has also some drawbacks
in the applications, and other regularizations have been proposed. In the next section
we describe an algorithm to solve (183).

6.2.1 Algorithms for Total Variation Denoising

Solving the discrete ROF (Rudin–Osher–Fatemi) model

1
min λDx2,1 + x − y22 , (184)
x∈R m×n 2

is equivalent to compute the proximity operator of the total variation, which is not
available in closed form. Here we show how to solve the above problem by a dual algo-
rithm. Indeed the problem is of the form (P) considered by the Fenchel–Rockafellar
duality theory with f (x) = (1/2)x − y2 ,

g(v) = λv2,1 = λv i, j 2 , v = (v i, j )1≤i≤m , v i, j ∈ R2 ,
1≤ j≤n
i, j

and A = D. We first compute D since it will be useful later to set the steplength.
For every x ∈ Rm×n
 
Dx2 = (xi+1, j − xi, j )2 + (xi, j+1 − xi, j )2
1≤i<m 1≤i≤m
1≤ j≤n 1≤ j<n
 
≤2 ((xi+1, j )2 + (xi, j )2 ) + 2 ((xi, j+1 )2 + (xi, j )2 )
1≤i<m 1≤i≤m
1≤ j≤n 1≤ j<n

≤ 8x2 ,

therefore D2 ≤ 8. We next prove that the dual problem is


Proximal Gradient Methods for Machine Learning and Imaging 229

1
min y − D ∗ u2 − y2 + ι Bλ (0)m×n (u), Bλ (0) ⊂ R2 , (185)
u∈(R2 )m×n 2

where Bλ (0) is the ball of R2 of radius λ centered at zero. Indeed, it is easy to


check that D ∗ = −div : (R2 )m×n  (Rm×n )2 → Rm×n where, for every (u1 , u2 ) ∈
(Rm×n )2
⎧ ⎧

⎨ui, j − ui−1, j if 1 < i < m, ⎪
1 1
⎨ui, j − ui, j−1 if 1 < j < n,
2 2

(div(u1 , u2 ))i, j = u11, j if i = 1, + ui,1


1
if j = 1,

⎩ 1 ⎪
⎩ 1
−um−1, j if i = m, −ui,n−1 if j = n,

and
1
f ∗ (z) =
(z + y22 − y22 ).
2
 
Moreover, g(v) = i, j λv i, j 2 = i, j σ Bλ (0) (v i, j ), which shows that g is separa-
ble. Then it follows from Fact 9(iii) that g ∗ is separable as well, so

g ∗ (v) = ι Bλ (0) (v i, j ) = ι Bλ (0)m×n (v).
i, j

Finally, since ∇ f ∗ (z) = z + y, the way one goes from the dual variable u ∈ (R2 )m×n
to the primal variable x ∈ Rm×n is through the formula

x = ∇ f ∗ (−D ∗ u) = y − D ∗ u.

The dual proximal gradient algorithm (176) writes down as follows

for k = 0, 1, . . .
0 (k)
x = y − D ∗ u(k) (186)
u(k+1) = PBλ (0)n×m (u(k) + γ Dx (k) ),

where γ < 2/D2 = 1/4. Note also that the projection onto Bλ (0)m×n is separable
too and can be computed as


⎨ui, j if ui, j 2 ≤ λ
PBλ (0)m×n (u) = PBλ (0) (ui, j ) 1≤i≤m , PBλ (0) (ui, j ) = ui, j
1≤ j≤n ⎪
⎩ if ui, j 2 > λ.
ui, j 2

in Sect. 5 that the sequence (xk )k∈N converges


Then it follows from the theory given√
to the minimizer of (184) as an O(1/ k).
We next specialize Algorithm 2 to problem (185). Let u0 = v 0 ∈ X , z 0 = y −
D ∗ u(0) , and γ ∈ ]0, 1/8[. Define
230 S. Salzo and S. Villa

for k = 0, 1, . . .
⎢ (k)
⎢ x = y − D ∗ u(k)
⎢ (k+1)
⎢u = PBλ (0)n×m (v (k) + γ Dz (k) ), (187)

⎣ v (k+1) = u(k+1) + β (u(k+1) − u(k) )
k+1
z (k+1) = x (k+1) + βk+1 (x (k+1) − x (k) )

With the choice of parameters as in Theorem 68, from the results in Sect. 5, we derive
that the sequence (xk )k∈N converges to the minimizer of (184) as an O(1/k).
Finally, we specialize the randomized proximal gradient √Algorithm 5. Note that,
condition (ii) in Proposition 85 is satisfied with L i, j = 17. Then, Algorithm 5
(assuming that each block is made of one R2 block only and (i k , jk ) is uniformly
distributed on {1, . . . , n} × {1, . . . , m}) writes as

for k = 0, 1, . . .
0 (k)
x = x k−1 + D ∗ (uk−1 − uk ) (188)
!
u(k+1) = u(k) + J(ik , jk ) PBλ (0) (ui(k)
k , jk
+ γik , jk (Dx (k) )ik , jk ) − ui(k)
k , jk
,

where γi, j < 2/ 17 and J(ik , jk ) : R2 → (R2 )m×n is the canonical injection. Then,
denoting by x∗ the unique solution√of (184), Theorem 93 and the results in Sect. 5
ensure that E[x k − x∗ 2 ] ≤ o(1/ k).

6.3 Machine Learning

In statistical machine learning we are given two random variables ξ and η, with
values in X and Y ⊂ R respectively, with joint distribution μ. We let : X × Y ×
R → R be a convex loss function and the goal is to find a function h : X → Y
in a given hypothesis function space which minimizes the averaged risk R(h) =
E[ (ξ, η, h(η))] without knowing the distribution μ but based on some sequence
(ξk , ηk )k∈N of independent copies of (ξ, η).
In this problem, concerning the hypothesis function space one option is that of
considering reproducing kernel Hilbert spaces (RKHS). They indeed are defined
through kernel functions and are flexible enough to model even infinite-dimensional
function spaces. They are defined as follows. We let  : X → H be a general map
from the input space X to a separable Hilbert space H , endowed with a scalar product
·, · and norm ·. Then the corresponding RKHS is defined as

H = {h ∈ RX | ∃ w ∈ H s.t. h = w, (·)} h = inf{w | h = w, (·)}.


(189)
In this context, the map  is called the feature map and the corresponding kernel
function is defined as
. /
K : X × X → R, K (x, x  ) = (x), (x  ) . (190)
Proximal Gradient Methods for Machine Learning and Imaging 231

In this way, the above statistical learning problem becomes


4
min R(w) = E[ (ξ, η, w, (ξ ))] = (x, y, w, (x))dμ(x, y), (191)
w∈H X×Y

which is supposed to be solved via some sequence (ξk , ηk )k∈N of independent copies
of (ξ, η).
In order to approach problem (191) we consider two strategies. The first one con-
sists in considering the problem as an instance of a stochastic optimization problem
as described in Example 82. The second one is to consider a regularized empirical
version of (191) based on the available sample. In the following, we describe these
two approaches.

6.3.1 Statistical Learning as Stochastic Optimization

We make the following assumptions.


SL1 For every (x, y) ∈ X × Y, (x, y, ·) : R → R is positive, convex and Lipschitz
continuous with constant α > 0 and E[ (ξ, η, 0)] < +∞.
SL2 The feature map  is measurable and E[(ξ )2 ] < +∞.
We show that problem (191) is an instance of Example 82. Indeed, we let Z = X × Y
and, for every w ∈ H and z = (x, y) ∈ Z, ϕ(w, z) = (z, w, (x)). Then,

(∀ w1 , w2 ∈ H )(∀ z = (x, y) ∈ X × Y)
|ϕ(w1 , z) − ϕ(w2 , z)| ≤ α|w1 − w2 , (x)| ≤ α(x)w1 − w2 .

Hence, conditions (SO1 ) − (SO2 ) in Example 82 hold with L(z) = α(x). More-
over,
(∀ z ∈ Z)(∀ w ∈ H ) ∂ϕ(w, z) = ∂ (z, w, (x))(x), (192)

where ∂ϕ(w, z) = ∂ϕ(·, z)(w). Now, let, for every (z, t) ∈ Z × R, ˜ (z, t) be a sub-
gradient of (z, ·) at t and define

˜ : H × Z → H : (w, z) → ˜ (z, w, (x))(x) ∈ ∂ϕ(w, z).


∇ϕ

Therefore, assumptions (SO3 ) − (SO4 ) in Example 82 are satisfied and


4
˜
E[∇ϕ(w, ζ )] = ˜ (x, y, w, (x))(x)dμ(x, y) ∈ ∂ R(w).
Z

Then algorithm (175) becomes

wk+1 = wk − γk ˜ (ξk , ηk , wk , (ξk ))(ξk ). (193)


232 S. Salzo and S. Villa

If we define h k (x) = wk , (x) and the kernel K (x, x  ) = (x), (x  ), then it
follows from (193) that

h k+1 (x) = h k (x) − γk ˜ (ξk , ηk , h k (ξk ))K (x, ξk ). (194)

Moreover, set


k
−1

k 
k
−1

k
w̄k = γi γi wi , h k (x) = w̄k , (x) = γi γi gi (x).
i=0 i=0 i=0 i=0

Then, the risk of h̄ k is R(w̄k ) and according to Theorem 77 we have that√ R(w̄k ) →
inf H R, and if S∗ := argmin H R = ∅, D ≥ dist(x0 , S∗ ), and γk = γ̄ / k + 1, we
have
D2 1 log(k + 1)
(∀ k ∈ N) E[R(w̄k )] − min R ≤ √ + γ̄ B 2 √ ,
H 2γ̄ k + 1 k+1

where√ B 2 = α 2 E[(ξ )2 ]. Moreover, for all k ∈ N, if (γi )0≤i≤k ≡


D/(B k + 1), then
BD
E[R(w̄k )] − min R ≤ √ . (195)
H k+1

Note that algorithm (194) is fully practicable, since it depends only on the kernel
function K and on the data (ξk , ηk ). In the following, we provide a list of 1-Lipschitz
continuous losses:
• the hinge loss: Y = {−1, 1} and (x, y, t) = max{0, 1 − yt};
• the logistic loss for classification: Y = {−1, 1} and (x, y, t) = log(1 + e−yt );
• L 1 -loss: Y = R and (x, y, t) = |y − t|;
4e y−t
• logistic loss for regression: Y = R and (x, y, t) = − log .
(1 + e y−t )2
• ε-insensitive loss: Y = R and (x, y, t) = max{0, |y − t| − ε}.

6.3.2 Regularized Empirical Risk Minimization

Regularized empirical risk estimation solves the following optimization problem

λ
n
1
min (yi , w, (xi )) + w2 =: (w), (196)
w∈H n i=1 2

where (xi , yi )1≤i≤n are realizations of the random variables (ξi , ηi )1≤i≤n and we
assume for simplicity that the loss function is : Y × R → R+ (convex in the second
variable), and λ > 0 is a regularization parameter. Essentially the goal here is to find
a function h = w, (·) that best fits the data (xi , yi )1≤i≤n according the to given
Proximal Gradient Methods for Machine Learning and Imaging 233

loss . Depending on the choice of the loss function the techniques take different
names. If is the square loss, that is, Y = R and (s, t) = (s − t)2 , one talks about
ridge regression. If is the Vapnik ε-insensitive loss

(s, t) = max{0, |s − t| − ε},

then we have support vector regression. Finally, if is the hinge loss, that is
Y = {−1, 1} and (s, t) = (1 − st)+ , then we get support vector machines. Another
important loss for classification is the logistic loss, which is defined as (s, t) =
log(1 + e−st ).
We are going to compute the dual problem of (196) in the sense of Fenchel–
Rockafellar (see Sect. 2.6). Define the operator
⎡ ⎤
w, (x1 )
(X) : H → Rn , (X)w = ⎣ ··· ⎦ ∈ Rn
w, (xn )

and the functions

λ
n
1
g : Rn → R, g(z) = (yi , −z i ), and f : H → R, f (w) = w2 .
n i=1 2
(197)
Then problem (196) can be written as

min f (w) + g(−(X)w), (198)


w∈H

which is in the form (P) considered by the Fenchel–Rochafellar duality. We recall


that the dual problem is
minn f ∗ ((X)∗ α) + g ∗ (α) (199)
α∈R

and the corresponding KKT optimality conditions are (see Sect. 2.6)

w̄ ∈ ∂ f ∗ ((X)∗ ᾱ) and ᾱ ∈ ∂g(−(X)w̄). (200)

So, since f ∗ = (1/2)·2 and


n
(∀ α ∈ Rn ) (X)∗ α = αi (xi ),
i=1

the first term in the dual objective function (199) is


234 S. Salzo and S. Villa

1
f ∗ ((X)∗ α) = (X)∗ α2
2
1
2
n

=  αi (xi )
2 i=1
1  . /
n
= αi α j (xi ), (x j )
2 i, j=1
1
= α Kα,
2

where K ∈ Rn×n is the Gram matrix, defined as K = (K (xi , x j ))i,n j=1 and K is the
kernel function associated to the feature map  as defined in (190). Now we compute
g ∗ . According to (197), the function g is separable, that is, it can be written
the form of
n
as g(z) = i=1 gi (z i ), where gi = (λ/n) (yi , −·). Therefore


n

g (α) = gi∗ (αi ).
i=1

Moreover, recalling the properties of the Fenchel conjugation, we have


 
λ n
gi∗ (s) = ∗
yi , −s .
n λ

Therefore we are lead to the following theorem


Theorem 99 The dual problem of (196) is
 
λ
n
1 ∗ n
minn α Kα + yi , −αi =: (α). (201)
α∈R 2 n i=1 λ

where K = (K (xi , x j ))i,n j=1 and K is the kernel function associated to the feature
map (see (190)), ∗ (yi , ·) is the Fenchel conjugate of (yi , ·). Moreover, (i) the
primal problem (196) has a unique solution, the dual problem has solutions and
min = − min  (strong duality holds); and (ii) the solutions (w̄, ᾱ) of the primal
and dual problems are characterized by the following KKT conditions


⎪ 
n

⎨w̄ = (X) ∗
ᾱ = ᾱi (xi ),
i=1 (202)

⎪ ᾱ n

⎩∀ i ∈ {1, . . . , n} − i ∈ ∂ L(yi , (x i ), w̄),
λ

where ∂ (yi , ·) is the subdifferential of (yi , ·). Finally for the estimated function it
holds
Proximal Gradient Methods for Machine Learning and Imaging 235


n
w̄, (·) = ᾱi K (xi , ·).
i=1

Remark 100 The first equation in (202) says that the primal solution can be written
as a finite linear combination of feature map evaluations on the training points. This is
known as the representer theorem in the related literature. Moreover, the coefficients
of this representation can be obtained through the solution of the dual problem (201).
We now specialize Theorem 99 to distance-based and margin-based losses.
Corollary 101 Suppose that is a convex distance-based loss, that is, of the form
(s, t) = χ (s − t) with Y = R, for some convex function χ : R → R+ . Then the
dual problem (201) becomes

λ  ∗  αi n 
n
1
minn α Kα − y α + χ . (203)
α∈R 2 n i=1 λ

Suppose that is a convex margin-based loss, that is, of the form (s, t) = χ (st) with
Y = {−1, 1}, for some convex function χ : R → R+ . Then the dual problem (201)
becomes
λ  ∗  yi αi n 
n
1
minn α Kα + χ − . (204)
α∈R 2 n i=1 λ

The following example shows that all the losses commonly used in machine
learning admit explicit Fenchel conjugates.
Example 102 (i) The least squares loss is (s, t) = χ (s − t) with χ = (1/2)|·|2 .
In that case (203) reduces to

1 n
minn α Kα − y α + α2 .
α∈R 2 2λ

which is strongly convex with modulus n/λ and has the explicit solution ᾱ =
(K + (n/λ)Id)−1 y.
(ii) The Vapnik-ε-insensitive loss for regression is (s, t) = χ (s − t) with χ = |·|ε .
Then, χ ∗ = ε|·| + ι[−1,1] and the dual problem (203) turns out to be

1
min α Kα − y α + εα1 + ιλ/n[−1,1]n (α)
α∈Rn
2

(iii) The Huber loss is the distance-based loss defined by



r 2 /2 if |r | ≤ ρ
χ (r ) =
ρ|r | − ρ 2 /2 otherwise.

Then χ ∗ = ι[−ρ,ρ] + (1/2)|·|2 and (203) becomes


236 S. Salzo and S. Villa

1 n
minn α Kα − y α + α22 + ιρλ/n[−1,1]n (α)
α∈R 2 2λ

(iv) The logistic loss for classification is the margin-based loss with χ (r ) = log(1 +
e−r ). Thus


⎨(1 + s) log(1 + s) − s log(−s) if s ∈ ]−1, 0[

χ (s) = 0 if s = −1 or s = 0


+∞ otherwise.

It is easy to see that χ has Lipschitz continuous derivative with constant 1/4 and
hence χ ∗ is strongly convex with modulus 4. Thus, referring to (203) and (199),
we see that in this case domg ∗ = i=1 n
(yi [0, λ/n]) and g ∗ is differentiable

on int(domg ) with locally Lipschitz continuous gradient. Moreover, since
lims→1 |(χ ∗ ) (s)| = lims→0 |(χ ∗ ) (s)| = +∞, we have that ∇g ∗ (α) = +∞
on the boundary of domg ∗ . Finally, it follows from (202) that 0 < yi ᾱi < λ/n,
for i = 1, . . . , n.
(v) The hinge loss is the margin-based loss with χ (r ) = (1 − r )+ . We have χ ∗ (s) =
s + ι[−1,0] (s). So the dual problem (204) is

1
min α Kα − y α + ιλ/n[0,1]n (y % α).
α∈Rn2

where y % α = (yi αi )1≤i≤n is the Hadamard product of y and α.


The connection between the primal and dual problem is clarified by the following
result, which follows from Proposition 94, Theorem 95 and (197).
Corollary 103 Let ᾱ ∈ Rn be a solution of the dual problem (201) and let w̄ =
(X) ᾱ be the solution of the primal problem (196). Let α ∈ Rn and set w =
(X) α. Then the following hold.
(i) (α) − min  ≥ 21 w − w̄22 .
(ii) If (yi , ·) Lipschitz smooth with constant a1 , then
 
λa1 (X)2
(w) − inf ≤ 1+ ((α) − inf ).
n

(iii) if (yi , ·) is Lipschitz continuous with constant a2 , then

a2 λ
(w) − inf ≤ 2(X) ((α) − inf )1/2 .
n
Remark 104 The above proposition ensures that if an algorithm generates a
sequence (α k )k∈N that is minimizing for the dual problem (201), i.e., (α k ) →
min , then the sequence defined by w k = (X)∗ α k , k ∈ N, converges to the solu-
tion of the primal problem. More precisely, for the function wk , (·) we have
Proximal Gradient Methods for Machine Learning and Imaging 237

|wk , (x) − w̄, (x)| ≤ wk − w̄(x) → 0 (205)

nwk ,k(·) can be expressed


and the function n ink terms of the kernel only, indeed
wk , (x) = i=1 αi (xi ), (x) = i=1 αi K (xi , x).

Proximal gradient algorithms for SVM. For all the cases treated in Example 102,
the dual problem (201) has the following form


n
minn q(α) + h i (αi ) = (α), (206)
α∈R
i=1

where q : Rn → R is convex and smooth with Lipschitz continuous gradient (locally


Lipschitz for the logistic loss) and includes the quadratic term (1/2)α Kα, and
h i : R → R ∪ {+∞} is proper, lower semicontinuous, convex, and admitting a
closed-form proximity operator. So, the form (201) is amenable to proximal gra-
dient type algorithms studied in the previous sections. We note that due to Corollary
103 if (α k )k∈N converges linearly (resp. sublinearly) to inf , then (wk )k∈N as well
as (w k ) − inf converges linearly (resp. sublinearly) too. In particular recalling
Example 59, we have that the dual proximal gradient Algorithm 6 converges linearly
on the dual problem (206) for all the losses presented in Example 102 (except for
the logistic one) and yields a linearly convergent sequence for the primal problem
too. Similarly to the lasso problem, additional algorithmic solutions are obtained by
applying on the dual problem the accelerated proximal gradient Algorithm 2 and the
randomized block-coordinate proximal gradient Algorithm 5. In the case of the logis-
tic loss considered in Example 102 (iv), proximal gradient algorithm with linesearch
should be considered. See [102].

6.3.3 Structured Sparsity in Machine Learning

Sparse estimation methods are very popular in machine learning. The most natural
one is the minimization of the empirical risk regularized with the 1 norm, in the
very same way that we described in Sect. 6.1. In several applications of interest,
it is beneficial to impose more structure in the regularization process and several
extensions of the 1 regularization, such as group lasso or multitask learning, are
common. It turns out that proximal gradient algorithms play a key role in the solution
of the related variational problems, which we write using the notation introduced in
the previous subsection

λ
n
min (yi , w, (xi )) + (w), (207)
w∈Rd n i=1

where the loss function is supposed to be differentiable with a Lipschitz continuous


gradient (e.g., the square loss) and  : Rd → R is a structured sparsity inducing
238 S. Salzo and S. Villa

penalty. In this section, we briefly summarize some examples and the related proximal
gradient algorithms.
When the input variables are supposed to be grouped together according to prede-
fined groups forming a partition of the variables, the group lasso penalty discussed in
Example 42 promotes solutions w∗ depending only on few groups. The algorithms
and the considerations made for the lasso problem in Sect. 6.1.1 can be generalized
to the group LASSO, replacing the soft-thresholding operator with the proximal
operator of the group lasso computed in Example 42. If the support of the solution
is a union of potentially overlapping groups defined a priori then a different penalty
should be used.
Let J = {J1 , . . . , Jm } be a family of subsets of {1, . . . , d} whose union is
{1, . . . , d} itself. Let us call v J = (v j ) j∈J ∈ R J . Denote by · J the Euclidean
norm on R J and by J J : R J → Rd the canonical embedding. We define a penalty
on Rd by considering

m 
m 
(w) = inf v J  J J (v J ) = w . (208)
=1 =1

When the groups do not overlap, the above penalty coincides with the group lasso
norm. If some groups overlap, then this penalty induces the selection of w∗ sparsely
supported on a union of groups. The regularized empirical risk in this case can be
written in terms of the vectors v J :
: ;
λ  
n m m
min yi , v j , J j∗ (xi ) + v J ,
(v1 ,...,vl )∈R J1 ×...×R Jm n i=1 =1 =1

and the problem in these new variables coincide with a regularized group lasso
without overlap.
Learning multiple tasks simultaneously has been shown to improve performance
relative to learning each task independently, when the tasks are related in the sense
that they all share a small set of features. For example, given T tasks modeled as
x → wt , (x), for t = 1, . . . , T , multi-task learning amounts to the minimization
of
 d  1/2
λ  
T n T
min (yi , wt , (xi )) + w 2j,t ,
(w1 ,...,wT )∈Rd×T n
t=1 t i=1 j=1 t=1

where n t is the number of samples for each task. Note that the regularization is an
instance of a group lasso norm of the vector (w1 , . . . , wT ) ∈ Rd×T , and the multitask
problem can therefore be solved as described above.
Proximal Gradient Methods for Machine Learning and Imaging 239

6.4 Bibliographical Notes

Section 6.1 The connections between the lasso minimization problem and the prob-
lem of determining the sparsest solutions of linear systems is the topic of interest
for the compressive sensing community. We refer to [49] for a mathematical intro-
duction on this subject. The solution of the lasso problem motivated a huge amount
of research at the interface between convex optimization, signal processing, inverse
problems, and machine learning. The Iterative Soft thresholding algorithm has been
proposed in [41] and around the same time the application of the proximal gradient
algorithm to the lasso problem, but also to other signal processing problems was
discussed in [39]. Strong convergence of the sequence of iterates generated by the
proximal gradient algorithm for the objective function in (178) was proved in [41] and
generalized in [36]. The FISTA algorithm was proposed by Beck and Teboulle in the
seminal paper [12]. Block coordinate versions of the ISTA algorithm are considered
e.g., in [78, 103].
Section 6.2 The ROF model has been introduced by Rudin, Osher and Fatemi
in [101], and studied theoretically in [31]. The approach based on duality has been
considered in [28, 30, 33]. The application of FISTA and a monotone modification
to the dual problem has been considered in [13].
Section 6.3 Stochastic optimization approaches for machine learning are very
popular, and in particular stochastic gradient descent [21], see the related discussion
in Sect. 4.4. One of the most well known stochastic methods to solve SVM in the
primal variables is PEGASOS [106].
Proximal methods have been immediately the methods of choice to deal with
structured sparsity in machine learning. The literature on the topic is vast, see the
surveys [8, 77] and references therein.
Support vector machines are due to Vapnik and have been introduced in [20, 40].
There, the case of the hinge loss for classification with a general kernel function
(so to cover nonlinear classifiers) was treated. The dual problem was derived via
the Lagrange theory. The analysis for general losses as well as the connection with
reproducing kernel Hilbert spaces and the formulation via general feature maps is
given, e.g., in [110].

Acknowledgements The work of S. Villa has been supported by the ITN-ETN project TraDE-
OPT funded by the European Union’s Horizon 2020 research and innovation programme under the
Marie Skłodowska–Curie grant agreement No 861137 and by the project “Processi evolutivi con
memoria descrivibili tramite equazioni integro-differenziali” funded by Gruppo Nazionale per l’
Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di
Alta Matematica (INdAM).
240 S. Salzo and S. Villa

References

1. Alvarez, F., Attouch, H.: An inertial proximal method for maximal monotone operators via
discretization of a nonlinear oscillator with damping. Set-Valued Anal. 9, 3–11 (2001)
2. Atchadé, Y.F., Fort, G., Moulines, E.: On perturbed proximal gradient algorithms. J. Mach.
Learn. Res. 18, 1–33 (2017)
3. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions
involving analytic features. Math. Progr. 116, 5–16 (2009)
4. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and
projection methods for nonconvex problems. An approach based on the Kurdyka-Ł ojasiewicz
inequality, Math. Oper. Res. 35, 438–457 (2010)
5. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and
tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-
Seidel methods. Math. Progr. 137, 91–129 (2013)
6. H. Attouch, Z. Chbani, J. Peypouquet, P. Redont, Fast convergence of inertial dynamics and
algorithms with asymptotic vanishing viscosity. Math. Prog. Ser. B 168, 123–175 (2018)
7. Aujol, J.-F., Dossal, C., Rondepierre, A.: Optimal convergence rates for Nesterov Accelera-
tion. SIAM J. Optim. 29, 3131–3153 (2019)
8. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with Sparsity-Inducing Penal-
ties. Optim. Mach. Learn. 5, 19–53 (2011)
9. Baillon, J.B., Bruck, R.E., Reich, S.: On the asymptotic behavior of nonexpansive mappings
and semigroups in Banach spaces. Houston J. Math. 4, 1–9 (1978)
10. Barbu, V., Precupanu, T.: Convexity and Optimization in Banach Spaces. Springer, Dordrecht
(2012)
11. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert
Spaces, 2nd edn. Springer, New York (2017)
12. Beck, A., Teboulle, M.: A fast iterative Shrinkage-Thresholding algorithm for linear inverse
problems. SIAM J. Imaging Sci. 2, 183–202 (2009)
13. Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image
denoising and deblurring problems. IEEE Trans. Image Process. 18, 2419–2434 (2009)
14. Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic
functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–
1223 (2006)
15. Beck, A., Teboulle, M.: A fast dual proximal gradient algorithm for convex minimization and
applications. Oper. Res. Lett. 42, 1–6 (2014)
16. Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions.
SIAM J. Optim. 18, 556–572 (2007)
17. Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of
first-order descent methods for convex functions. Math. Program. 165, 471–507 (2017)
18. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for noncon-
vex and nonsmooth problems. Math. Prog. 146, 459–494 (2013)
19. Borwein, J.M., Vanderwerff, J.D.: Convex Functions: Constructions, Characterizations and
Counterexamples. Cambridge University Press, Cambridge (2010)
20. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers.
In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory—COLT
’92, p. 144 (1992)
21. Bottou, L., Bousquet, O.: The tradeoffs of large-scale learning. In: Optimization for Machine
Learning, pp. 351–368, The MIT Press, Cambridge (2012)
22. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning.
SIAM Rev. 60, 223–311 (2018)
23. Bourbaki, N.: General Topology, 2nd edn. Springer, New York (1989)
24. Bredies, K.: A forward-backward splitting algorithm for the minimization of non-smooth
convex functionals in Banach space. Inv. Prob. 25, Art. 015005 (2009)
Proximal Gradient Methods for Machine Learning and Imaging 241

25. Browder, F.E., Petryshyn, W.V.: The solution by iteration of nonlinear functional equations
in Banach spaces. Bull. Am. Math. Soc. 72, 571–575 (1966)
26. Browder, F.E., Petryshyn, W.V.: Construction of fixed points of nonlinear mappings in Hilbert
space. J. Math. Anal. Appl. 20, 197–228 (1967)
27. Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM J. Control
Optim. 31, 1340–1359 (1993)
28. Chambolle, A.: An algorithm for total variation minimization and applications. J. Math.
Imaging Vis. 20, 89–97 (2004)
29. Chambolle, A., Dossal, C.: On the convergence of the iterates of the “Fast Iterative Shrink-
age/Thresholding Algorithm". J. Optim. Theory Appl. 166, 968–982 (2015)
30. Chambolle, A., Lions, P.-L.: Image restoration by constrained total variation minimization
and variants. In: Investigative and Trial Image Processing, San Diego, CA (SPIE), vol. 2567,
pp. 50–59 (1995)
31. Chambolle, A., Lions, P.-L.: Image recovery via total variation minimization and related
problems. Numer. Math. 76, 167–188 (1997)
32. Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta
Numerica 25, 161–319 (2016)
33. Chan, T.F., Golub, G.H., Mulet, P.: A nonlinear primal-dual method for total variation-based
image restoration. SIAM J. Sci. Comput. 20, 1964–1977 (1999)
34. Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing, In: Fixed-
Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer,
New York, NY (2011)
35. Combettes, P.L., Pesquet, J.-C.: Stochastic quasi-Fejér block-coordinate fixed point iterations
with random sweeping. SIAM J. Optim. 25, 1121–1248 (2015)
36. Combettes, P.L., Pesquet, J.-C.: Proximal thresholding algorithms for minimization over
orthonormal bases. SIAM J. Optim. 18, 1351–1376 (2007)
37. Combettes, P.L., V u, B.C.: Dualization of signal recovery problems. Set-Valued Anal. 18,
373–404 (2010)
38. Combettes, P.L., Yamada, I.: Compositions and convex combinations of averaged nonexpan-
sive operators. J. Math. Anal. Appl. 425, 55–70 (2015)
39. Combettes, P.L., Wajs, V.: Signal recovery by proximal forward-backward splitting. Multi-
scale Model. Simul. 4, 1168–1200 (2005)
40. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)
41. Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse
problems with a sparsity constraint. Comm. Pure Appl. Math. 57, 1413–1457 (2004)
42. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with
support for non-strongly convex composite objectives. In: Advances in Neural Information
Processing Systems, vol. 27 (2014)
43. Dotson, W.G.: On the Mann iterative process. Trans. Am. Math. Soc. 149, 65–73 (1970)
44. Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J.
Mach. Learn. Res. 10, 2899–2934 (2009)
45. Dünner, C., Forte, S., Takac, M., Jaggi, M.: Primal-dual rates and certificates. In: Proceedings
of The 33rd International Conference on Machine Learning, PMLR, vol. 48, pp. 783–792
(2016)
46. Ekeland, I., Témam, R.: Roger. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, Convex analysis and variational problems (1999)
47. Ermoliev, Yu.M.: On the method of generalized stochastic gradients and quasi-Fejér
sequences. Cybernetics 5, 208–220 (1969)
48. Fenchel, W.: Convex Cones, Sets, and Functions. Princeton University (1953)
49. Foucart, S., Rauhut, H.: A mathematical introduction to compressive sensing. Birkäuser.
Springer, New York (2010)
50. Frankel, P., Garrigos, G., Peypouquet, J.: Splitting methods with variable metric for Kurdyka-
Łojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165, 874–900
(2015)
242 S. Salzo and S. Villa

51. Gabay, D.: Applications of the method of multipliers to variational inequalities. In: Fortin,
M., Glowinski, R. (eds.) Augmented Lagrangian Methods: Applications to the Numerical
Solution of Boundary-Value Problems, North-Holland, Amsterdam, vol. 15, pp. 299–331
(1983)
52. Garrigos, G., Rosasco, L., Villa, S.: Convergence of the Forward-Backward Algorithm:
Beyond the Worst Case with the Help of Geometry (2017). https://arxiv.org/abs/1703.09477
53. Goldstein, A.A.: Convex programming in Hilbert space. Bull. Am. Math. Soc. 70, 709–710
(1964)
54. Groetsch, C.W.: A note on segmenting Mann iterates. J. Math. Anal. Appl. 40, 369–372 (1972)
55. Guler, O.: New proximal point algorithms for convex minimization. SIAM J. Optim. 2, 649–
664 (1992)
56. Blatt, D., Hero, A., Gauchman, H.: A convergent incremental gradient method with a constant
step size. SIAM J. Optim. 18, 29–51 (2007)
57. Hiriart-Urruty, J.-B., Lemaréchal, C.: Fundamentals of Convex Analysis. Springer, Berlin
(2001)
58. Jensen, J.L.W.V.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta
Math. 30, 175–193 (1906)
59. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance
reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)
60. Karimi, H., Nutini, J., Schmidt, M.: Linear Convergence of gradient and proximal-gradient
methods under the Polyak-Łojasiewicz condition. In: Frasconi, P., Landwehr, N., Manco, G.,
Vreeken, J. (eds.), Machine Learning and Knowledge Discovery in Databases. ECML PKDD
2016. Lecture Notes in Computer Science, vol. 9851. Springer, Cham
61. Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann.
Math. Stat. 23, 462–466 (1952)
62. Kingma, D.P., Ba, L.J.: Adam: a method for stochastic optimization. In: Proceedings of
Conference on Learning Representations (ICLR), San Diego (2015)
63. Krasnoselski, M.A.: Two remarks on the method of successive approximations. Uspekhi Mat.
Nauk. 10, 123–127 (1955)
64. Levitin, E.S., Polyak, B.T.: Constrained minimization methods. U.S.S.R. Comput. Math.
Math. Phys. 6, 1–50 (1966)
65. Li, W.: Error bounds for piecewise convex quadratic programs and applications. SIAM J.
Control Optim 33, 1510–1529 (1995)
66. Li, G.: Global error bounds for piecewise convex polynomials. Math. Prog. Ser. A 137, 37–64
(2013)
67. Lions, P.L., Mercier, I.: Splitting algorithms for the sum of two nonlinear operators. SIAM J.
Numer. Anal. 16, 964–979 (1979)
68. Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a
general approach. Ann. Oper. Res. 46, 157–178 (1993)
69. Luque, F.: Asymptotic convergence analysis of the proximal point algorithm. SIAM J. Control
Optim. 22, 277–293 (1984)
70. Mann, W.R.: Mean value methods in iteration. Proc. Am. Math. Soc. 4, 506–510 (1953)
71. Martinet, B.: Régularisation d’in Opér. 4, Sér. R-3, pp. 154–158 (1970)
72. Mercier, B.: Inéquations Variationnelles de la Mécanique. No. 80.01 in Publications Mathé-
matiques d’Orsay. Université de Paris-XI, Orsay, France (1980)
73. Minkowski, H.: Theorie der konvexen Körper, insbesondere Begründung ihres Oberflächen-
begriffs. In: Hilbert, D. (ed.) Gesammelte abhandlungen von Hermann Minkowski [Collected
Papers of Hermann Minkowski], vol. 2, pp. 131–229. B.G. Teubner, Leipzig (1911)
74. Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace hilbertien, C. R.
Acad. Sci. Paris Ser. A Math. 255, 2897–2899 (1962)
75. Moreau, J.J.: Propriétés des applications “prox”, C. R. Acad. Sci. Paris Ser. A Math. 256,
1069–1071 (1963)
76. Moreau, J.J.: Proximité et dualité dans un espace Hilbertien. Bull. de la Société Mathématique
de France 93, 273–299 (1965)
Proximal Gradient Methods for Machine Learning and Imaging 243

77. Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity regu-
larization with proximal methods. In: Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, pp. 418–433. Springer, Berlin, Heidelberg (2010)
78. Necoara, I., Clipici, D.: Parallel random coordinate descent method for composite minimiza-
tion: convergence analysis and error bounds. SIAM J. Optim. 26, 197–226 (2016)
79. Necoara, I., Nesterov, Y., Glineur, F.: Random block coordinate descent methods for linearly
constrained optimization over networks. J. Optim. Theory Appl. 173, 227–254 (2017)
80. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach
to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)
81. Nemirovsij, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization.
Wiley-Interscience, New York (1983)
82. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Aca-
demic Publishers, London (2004)
83. Nesterov, Y.: A method for solving the convex programming problem with convergence rate
O(1/k 2 ). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
84. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems.
SIAM J. Optim. 22, 341–362 (2012)
85. Opial, Z.: Weak convergence of the sequence of successive approximations for nonexpansive
mappings. Bull. Am. Math. Soc. 73, 591–597 (1967)
86. Osher, S., Burger, M., Goldfarb, D., Xu, J., Yin, W.: An iterative regularization method for
total variation- based image restoration. Multiscale Model. Sim. 4, 460–489 (2005)
87. Passty, G.B.: Ergodic convergence of a zero of the sum of monotone operators in Hilbert
space. J. Math. Anal. Appl. 72, 383–390 (1979)
88. Peypouquet, J.: Convex Optimization in Normed Spaces. Springer, Cham (2015)
89. Phelps, R.R.: Convex Functions, Monotone Operators and Differentiability. Springer, Berlin
(1993)
90. Polyak, B.T.: Dokl. Akad. Nauk SSSR 174
91. Polyak, B.T.: Gradient methods for minimizing functionals. Zh. Vychisl. Mat. Mat. Fiz. 3,
643–653 (1963)
92. Polyak, B.T.: Subgradient methods: a survey of Soviet research. In: Lemaréchal, C.L., Mifflin,
R. (eds.) Proceedings of a IIASA Workshop, Nonsmooth Optimization, pp. 5–28. Pergamon
Press, New York (1977)
93. Polyak, B.T.: Introduction to Optimization. Optimization Software, Inc. (1987)
94. Richtàrik, P., Takàc̆, M.: Parallel coordinate descent methods for big data optimization. Math.
Program. Ser. A 156, 56–484 (2016)
95. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407
(1951)
96. Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales
and some applications. In: Optimizing Methods in Statistics, pp. 233–257. Academic Press
(1971)
97. Rockafellar, T.: Monotone operators and the proximal point algorithm. SIAM J. Optim. 14,
877–898 (1976)
98. Rockafellar, T.: Convex Analysis. Princeton University Press, Princeton (1970)
99. Rockafellar, T.: Conjugate duality and optimization. Society for Industrial and Applied Math-
ematics, Philadelphia (1974)
100. Rosasco, L., Villa, S., Vũ, B.C.: Convergence of stochastic proximal gradient method. Appl.
Math. Optim. 82, 891–917 (2020)
101. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms.
Physica D 60, 259–268 (1992)
102. Salzo, S.: The variable metric forward-backward splitting algorithm under mild differentia-
bility assumptions. SIAM J. Optim. 27(4), 2153–2181 (2017)
103. Salzo, S. Villa, S.: Parallel random block-coordinate forward-backward algorithm: a unified
convergence analysis. Math. Program. Ser. A. https://doi.org/10.1007s10107-020-01602-1
244 S. Salzo and S. Villa

104. Schaefer, H.: Über die Methode sukzessiver Approximationen. Jber. Deutsch. Math.-Verein.
59, 131–140 (1957)
105. Shamir, O., Zhang, T.: Stochastic gradient descent for non-smooth optimization: convergence
results and optimal averaging schemes. In: Proceedings of the 30th International Conference
on Machine Learning, pp. 71–79 (2013)
106. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient
solver for SVM. Math. Program. 127, 3–30 (2011)
107. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss
minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
108. Shor, N.: Minimization Methods for Non-differentiable Functions. Springer, New York (1985)
109. Sibony, M.: Méthodes itéraratives pour les équations et inéquations aux dérivées partielles
non linéaires de type monotone. Calcolo 7, 65–183 (1970)
110. Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
111. Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated
gradient method: theory and insights. J. Mach. Learn. Res. 17, 1–43 (2016)
112. Tseng, P.: Applications of a splitting algorithm to decomposition in convex programming and
variational inequalities. SIAM J. Control Optim. 29, 119–138 (1991)
113. Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable functions.
Nondifferentiable optimization. Math. Program. Stud. 3, 145–173 (1975)
114. Wright, S.: Coordinate descent algorithms. Math. Program. 151, 3–34 (2015)
115. Zălinescu, C.: Convex Analysis in General Vector Spaces. World Scientific Publishing Co.
Inc, River Edge, NJ (2002)
116. Zhang, X., Burger, M., Bresson, X., Osher, S.: Bregmanized nonlocal regularization for decon-
volution and sparse reconstruction. SIAM J. Imaging Sci. 3, 253–276 (2010)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy