Proximal Gradient Methods for Machine
Proximal Gradient Methods for Machine
1 Introduction
Convex optimization plays a key role in data science and image processing. Indeed,
from one hand it provides theoretical frameworks, such as duality theory and the
theory of nonexpansive operators, which are indispensable to formally analyze many
problems arising in those fields. On the other hand, convex optimization supplies a
plethora of algorithmic solutions covering a broad range of applications. In particular,
the last decades witnessed an unprecedented development of optimization methods
which are now capable of addressing structured and large-scale problems effectively.
An important class of such methods, which are at the core of modern nonlinear convex
optimization, is that of proximal gradient splitting algorithms. They are first-order
methods which are tailored to optimization problems having a composite structure
given by the sum of smooth and nonsmooth terms. These methods are splitting
algorithms, in the sense that along the iterations they process each term separately by
exploiting gradient information when available and the so-called proximity operator
for nonsmooth terms.
Even though there is a rich literature on proximal gradient algorithms, in this
contribution, we paid particular attention to presenting a self-contained and unifying
analysis for the various algorithms, unveiling common theoretical basis. We give
state-of-the-art results treating both convergence of the iterates and of objective
functions values in infinite-dimensional setting. This work is based on the lecture
S. Salzo (B)
Istituto Italiano di Tecnologia, Via E. Melen 83, 16152 Genova, Italy
e-mail: saverio.salzo@iit.it
S. Villa
DIMA & MaLGa Center, Università degli Studi di Genova, Via Dodecaneso 35,
16146 Genova, Italy
e-mail: silvia.villa@unige.it
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 149
F. De Mari and E. De Vito (eds.), Harmonic and Applied Analysis, Applied and Numerical
Harmonic Analysis, https://doi.org/10.1007/978-3-030-86664-8_4
150 S. Salzo and S. Villa
notes written for the PhD course “Introduction to Convex Optimization” that was
given by the authors at the University of Genoa during the last 5 years.
This chapter is divided into six sections. Section 2 provides an account on convex
analysis, recalling the fundamental concepts of subdifferentials, Legendre–Fenchel
transform, and duality theory. In Sect. 3, we study the proximal gradient algorithm
under different assumptions, addressing also acceleration techniques. Section 4 is
about stochastic optimization methods. We study the projected stochastic subgradi-
ent method, the proximal stochastic gradient algorithm and the randomized block-
coordinate proximal gradient algorithm. Section 5 exploits duality to derive new
algorithms. Finally, in Sect. 6, we describe several important applications in which
proximal gradient algorithms has been successfully used.
An affine set of X is a set M ⊂ X such that every straight line joining two distinct
points of M is contained in M. In formula this means that, for every x, y ∈ M,
and every λ ∈ R, we have (1 − λ)x + λy ∈ M. If M is affine then V := M − M
is a vector subspace of X , which is called the direction of M. Moreover, we have
M = V + x, for every x ∈ M. The intersection of a family of affine sets of X is still
affine, so if C ⊂ X one can define the affine hull of C, denoted by aff(C), which is the
intersection of all the affine sets of X containing C. It can be represented as the set of
the finite affine combinations of elements of C, meaning that x ∈ aff(C) if and only
if there exists finite
nnumber of points x 1 , . . . , x n ∈ C and numbers λ1 , . . . , λn ∈ R
n
(n ≥ 1) such that i=1 λi = 1 and x = i=1 λi xi . The affine dimension of a set C is
the dimension of the affine hull of C. A mapping T : X → Y between Hilbert spaces
is said to be affine if T ((1 − λ)x + λy) = (1 − λ)T x + λT y, for every x, y ∈ X
and λ ∈ R. An affine mapping T can be uniquely represented as T x = Ax + b with
A : X → Y be a linear operator and b ∈ Y . The image and the counter image of
affine sets through affine mappings are affine sets. An (affine) hyperplane of X is a
set of the form {x ∈ X | ϕ(x) = α}, where ϕ : X → R is a nonzero linear form on X
and α ∈ R.
For every x ∈ X and every δ > 0 we denote by Bδ (x) the (closed) ball of center
x and radius δ, that is Bδ (x) = {y ∈ X | y − x ≤ δ}. Given a subset C ⊂ X , we
denote by int(C), cl(C), and bdry(C) its interior, closure and boundary, respectively.
An hyperplane H = {x ∈ X | ϕ(x) = α} is closed if and only if ϕ is a continuous
Proximal Gradient Methods for Machine Learning and Imaging 151
f : X → ] − ∞, +∞] ,
so that the value −∞ will never be allowed. In the rest of the chapter, if not otherwise
specified, functions are supposed to be extended real-valued. The (effective) domain
of f is the set dom f = {x ∈ X | f (x) < +∞} and the epigraph of f is the set
and similarly, we define the sets [ f > t]. An extended real-valued function is called
proper if dom f = ∅, meaning that the function admits at least a finite value. The
set of minimizers of f is denoted by argmin f .
In optimization problems, extended real-valued functions allow to treat constraints
as functions. Indeed let C ⊂ X and define the indicator function of C as
0 if x ∈ C
ιC : X → ] − ∞, +∞] : x → (3)
+∞ if x ∈
/ C.
min h(x), h: X → R
x∈C
Note that indicator functions and epigraphs allow to establish a one to one corre-
spondence between extended real-valued functions and sets.
152 S. Salzo and S. Villa
meaning that, for every x ∈ C the ray R++ x = {λx | λ ∈ R++ } is contained in C.
The intersection of a family of convex sets of X is still convex, so if A ⊂ X , then
one defines the convex hull of A, denoted by co(A), as the intersection of the family
of all convex subsets of X containing A. In fact it is the smallest convex subset of X
containing A and it can be represented as the set of the finite convex combinations of
elements of A, meaning that x ∈ aff(A) if and only if there exists finite number of
n
points x1 ,
. . . , xn ∈ A and numbers λ1 , . . . , λn ∈ R+ (n ≥ 1) such that i=1 λi = 1
n
and x = i=1 λi xi .
Let C be a nonempty closed convex subset of X and let x ∈ X . Then the orthogonal
projection of x onto C is defined as the unique point p ∈ C such that, for every y ∈ C,
p − x ≤ y − x and is denoted by PC (x). It is also characterized by the following
variational inequality
(∀ y ∈ C) y − p, x − p ≤ 0.
If C is an affine set with direction V , then the above characterization becomes the
classical x − p ∈ V ⊥ . We recall that for convex sets the property of being closed
is equivalent to that of being weakly sequentially closed. We finally recall that the
projection operator PC : X → X is firmly nonexpansive, that is, it satisfies
and is strictly convex if in (7) the strict inequality holds when x, y ∈ dom f , x = y
and λ ∈ ] 0, 1 [ . Finally, g : X → [ − ∞, +∞ [ is concave (risp. strictly concave)
if −g is convex (risp. strictly convex). If f is convex, by induction, definition (7)
yields Jensen’s inequality, that is, for every finite sequence (xi )1≤i≤m in X and every
m
(λi )1≤i≤m ∈ Rm + such that i=1 λi = 1, we have
Proximal Gradient Methods for Machine Learning and Imaging 153
m
m
f λi xi ≤ λi f (xi ). (8)
i=1 i=1
In this section, we recall the concept of subdifferentials and calculus for non-
smooth convex functions. Let f : X → ] − ∞, +∞] be a proper convex func-
tion and x ∈ dom f . The directional derivative of f at x along the vector v is
f (x, v) = lim+ ( f (x + tv) − f (x)) /t. The subdifferential of f at x is defined as
t→0
Proximal Gradient Methods for Machine Learning and Imaging 155
∂ f (x) := u ∈ X (∀ y ∈ X ) f (y) ≥ f (x) + y − x, u . (13)
is called the Fenchel conjugate of f , which is always convex and lower semicon-
tinuous. The Fenchel–Moreau theorem ensures that if f ∈ 0 (X ) then f ∗ ∈ 0 (X )
and f ∗∗ = f . Thus, the transformation ·∗ : 0 (X ) → 0 (X ) is an involution, which
is called the Legendre–Fenchel transform. Let C ⊂ X . The support function of C is
the function ιC∗ , which is denoted by σC , that is, σC (u) = supx∈C x, u.
∀ u = (u 1 , . . . , u m ) ∈ X f ∗ (u) = f 1∗ (u 1 ) + f 2∗ (u 2 ) + · · · + f m∗ (u m ).
Duality plays a key role in convex optimization. Here we recall the Fenchel–
Rockafellar duality. We let A : X → Y be a continuous linear operator between
Hilbert spaces, f ∈ 0 (X ) and g ∈ 0 (Y ). Consider the problem
hence
inf (x) ≥ sup −(u) = − inf (u). (18)
x∈X u∈Y u∈Y
This means that the function is (uniformly) above the function − (which is
concave). The difference between the infimum of and the supremum of −, that
is inf + inf , is called the duality gap and we say that strong duality holds if the
duality gap is zero.1
Let S = argmin and S ∗ = argmin . Then the following are equivalent.
(i) x̂ ∈ S, û ∈ S ∗ , and inf X + inf Y = 0 (duality gap is zero);
(ii) x̂ ∈ ∂ f ∗ (−A∗ û) and A x̂ ∈ ∂g ∗ (û)
(iii) −A∗ û ∈ ∂ f (x̂) and û ∈ ∂g(A x̂).
1 Note that if inf = −∞, it follows from (18) that inf = sup(−) = − inf = −∞. In this
case, ≡ +∞ and inf + inf = −∞ + ∞ does not make sense. Anyway, since there is no
gap between and −, by convention, we set inf + inf = 0. The same situation occurs if
inf = −∞.
158 S. Salzo and S. Villa
The conditions (ii) and (iii) above are called KKT (Karush–Kuhn–Tucker) conditions.
Once one ensures that strong duality holds (that is, inf + inf = 0) they provide
fully characterizations for a couple (x̂, û) to be a primal and dual solution.
Fact 14 Suppose that one of the following conditions is satisfied.
(a) S = ∅ and ∂( f + g ◦ A) = ∂ f + A∗ ∂g A
(b) 0 ∈ int(domg − A(dom f )).
Then is proper and
inf = − min , (19)
X Y
which is in the form (P). Then, in view of Fact 9(v), the dual problem of (20) is
Recalling Fact 14(a), to ensure the existence of dual solutions and a zero duality gap,
we need to find conditions ensuring the validity of the calculus rule (15). We first
prove that if x ∈ X is such that Ax = b, then
Indeed, we note that ι{b} ◦ A = ι A−1 (b) and A−1 (b) = x + N (A). Then,
Therefore, ∂(ι{b} ◦ A)(x) = R(A∗ ). Moreover, A∗ ∂ι{b} (Ax) = A∗ ∂ι{b} (b) and the
subdifferential of ι{b} is
Y if y = b
∂ι{b} : Y → Y : y → (23)
∅ if y = b,
Proximal Gradient Methods for Machine Learning and Imaging 159
hence A∗ ∂ι{b} (Ax) = R(A∗ ) and (22) holds. Finally, recalling the calculus rule
for subdifferentials in Fact 5 and that we assumed that f is continuous at some
x ∈ dom(ι{b} ◦ A), then, we have ∂( f + ι{b} ◦ A)(x) = ∂ f (x) + ∂(ι{b} ◦ A)(x) =
∂ f (x) + A∗ ∂ι{b} (Ax) and hence (15) holds. We note in passing that Fermat’s rule
for (21) is
In the differentiable case, this condition reduces to the classical Lagrange multiplier
rule, that is, x̂ is a solution of (20) if and only if there exists a multiplier û such that
A∗ û = ∇ f (x̂).
Though convexity is a very old concept, the first systematic study of convex sets in
finite dimension is due to Minkowski [73]; while concerning convex functions, it was
Jensen [58] to introduce the concept now known as midpoint convexity. The lecture
notes by Fenchel [48] constitute the first modern exposition on convex analysis in the
finite-dimensional case. Indeed, the notions of support function, Legendre–Fenchel
conjugate as well as the duality theory presented in Sects. 2.5 and 2.6, for the special
case that A is the identity operator, were fully studied there. At the beginning of
the 1960s, convex analysis became a mathematical field in his own, thanks to the
works by Moreau [74–76] and Rockafellar [99], who established the theory in infinite
dimension and developed the concepts of subgradients and subdifferential, among
others. Starting from those works, the field flourished, and it is nowadays still a very
active research area.
In the following, we list the main references. Concerning the finite-dimensional
setting, we refer to the fundamental monography [98] and the book [57]. For Hilbert
spaces, a comprehensive treatment is given in [11] (where most of the facts presented
can be found). A lot of research has been also devoted to the Banach spaces and
general topological vector spaces. For the former case, we refer to [10, 19, 88, 89],
and to [46, 99, 115] for the latter.
In this section, we focus on the main object of this chapter, which is the proximal
gradient algorithm (also called the forward–backward algorithm). In the following,
we describe the basic assumptions and the algorithm, whereas in the next sections, we
160 S. Salzo and S. Villa
k = 0, 1, . . .
for
(25)
xk+1 = proxγ g xk − γ ∇ f (xk ) .
xk+1 = (1 − γ L)xk .
Thus, if we take γ = 2/L, we have xk+1 = −xk and the sequence does not converge,
unless x0 = 0.
1
minimize Ax − y2 + λx1 . (27)
x∈R d 2
Then, Algorithm 1 reduces to the following. Let γ ∈ ]0, 2/A∗ A[ and x0 ∈ X , then
k = 0, 1, . . .
for
(28)
xk+1 = softγ λ (xk − γ A∗ (Axk − y)).
In this section, we present the convergence theory for the method of the fixed point
iteration. We recall the classical theory for contractive operators and then we address
the case of averaged operators which is motivated by the Krasnosel’skiı̆–Mann iter-
ation.
Let X be a real Hilbert space and let T : X → X . Then
(i) T is nonexpansive if for all x, y ∈ X, T x − T y ≤ x − y
(ii) T is a contraction if for all x, y ∈ X, T x − T y ≤ qx − y, for some q ∈
]0, 1[.
A fixed point of T is a point x ∈ X such that T x = x and the set of such points
is denoted by Fix T . In order to compute fixed points of T , we will consider the
following fixed point iteration. Let x0 ∈ X and define, for every k ∈ N,
xk+1 = T xk . (29)
An iterative method of type (29) is also called Picard iteration or the method of
successive approximations.
Remark 18
(i) Nonexpansive operators may have no fixed points. For instance, a translation
T = Id + a, with a = 0, does not have any fixed point.
(ii) For nonexpansive operators, even admitting fixed points, the fixed point iter-
ation may fail to converge. Indeed, this occurs if we take T = −Id and start
with x0 = 0. More generally, rotations are nonexpansive operators admitting
a fixed point, for which the fixed point iteration does not converge.
The first important result concerning existence of fixed points and the convergence
of the fixed point iteration is the following.
Theorem 19 (Banach-Caccioppoli) Let T : X → X be a q-contractive mapping for
some 0 < q < 1. Then there exists a unique fixed point of T , that is, Fix T = {x∗ }.
Moreover, for the fixed point iteration (29), we have
qk
(∀ k ∈ N) xk − x∗ ≤ q k x0 − x∗ and xk − x∗ ≤ x0 − x1 . (30)
1−q
1
(∀ x, y ∈ X ) x − y ≤ x − T x + y − T y . (31)
1−q
Indeed, x − y ≤ x − T x + T x − T y + T y − y ≤ x − T x + qx − y +
y − T y, hence (1 − q)x − y ≤ x − T x + T y − y and (31) follows. Inequal-
ity (31) shows that there may exist at most one fixed point of T . Moreover, for every
162 S. Salzo and S. Villa
k, h ∈ N,
1
xk − x h ≤ xk − xk+1 + x h − x h+1
1−q
1
≤ T k x0 − T k x1 + T h x0 − T h x1
1−q
1
≤ q k x0 − x1 + q h x0 − x1
1−q
qk + qh
≤ x0 − x1 , (32)
1−q
As we noted in Remark 18 for general non expansive operators, the fixed point
iteration (29) may not converge. To overcome this situation, it is enough to slightly
modify the iteration. This leads to the following definition.
Let T : X → X be a nonexpansive operator and let λ ∈ ]0, 1[. The Krasnosel’skiı̆–
Mann iteration is defined as follows:
If we look at the example given in Remark 18(ii) , now we see that the iteration (33)
becomes xk+1 = (1 − 2λ)xk . Since |1 − 2λ| < 1, we have that xk = (1 − 2λ)k x0 →
0. Iteration (33) can be equivalently written as a fixed point iteration of the operator
Tλ = (1 − λ)Id + λT . This motivates the study of operators that are convex combi-
nation of the identity operator and nonexpansive operators and justify the definition
below.
Proof Indeed
Proposition 23 Let T : X → X and α ∈ ]0, 1[. Then the following statements are
equivalent
(i) T is α-averaged
1 1
(ii) 1 − Id + T is nonexpansive
α α
(iii) For every (x, y) ∈ X 2 ,
1
T x − T y ≤ x − y −
2 2
− 1 (Id − T )x − (Id − T )y2 .
α
and hence
α1 + α2 − 2α1 α2
α= .
1 − α1 α2
Proximal Gradient Methods for Machine Learning and Imaging 165
Averaged operators are important since, provided that they have fixed points, the
Picard iteration always weakly converges to some fixed point. In the rest of the
section, we will prove this result.
Proof The assumptions ensure that (xk )k∈N is bounded. Therefore, the set of weak
cluster points of (xk )k∈N is nonempty. Let y1 , y2 ∈ X and let (xk1 )k∈N and (xk2 )k∈N be
subsequences of (xk )k∈N such that xk1 y1 and xk2 y2 . Then, for every k ∈ N,
Since y1 and y2 are weak cluster points of (xk )k∈N , by assumptions, y1 , y2 ∈ F and
(xk − y1 )k∈N and (xk − y2 )k∈N are convergent. Therefore, by (37), we obtain
that there exists β ∈ R such that xk , y2 − y1 → β. Now, since xki yi , i = 1, 2,
we have xki , y2 − y1 → yi , y2 − y1 , which implies
y1 , y2 − y1 = β = y2 , y2 − y1
and hence y2 − y1 2 = 0. This proves that the set of weal cluster points of the
sequence (xk )k∈N is a singleton. So, the sequence (xk )k∈N is weakly convergent.
166 S. Salzo and S. Villa
Therefore,
+∞ +∞
1−α
xk − T xk 2 ≤ xk − x∗ 2 − xk+1 − x∗ 2 ≤ x0 − x∗ 2 .
α k=0 k=0
Note that the definition is well-posed since the function y → g(y) + (1/2)y − x2
is lower semicontinuous and strongly convex, hence, it has a unique minimizer.
Moreover, let us check that proxg = (Id + ∂g)−1 . Using the sum rule for the subd-
ifferential, which holds since the square norm is differentiable, we derive
This shows that (Id + ∂g)−1 (x) is actually a singleton and its unique element is
proxγ g (x). Note that for every x ∈ X , proxg (x) ∈ domg, since the minimizer of
g + (1/2)·2 is clearly in the domain of g.
Example 33 Let C be a closed and convex set. The proximity operator of ιC is the
projection on C. The projection is nonexpansive (and, indeed, firmly nonexpansive),
but in general not a contraction, unless C is a singleton.
(∀x, y ∈ X ) proxg (x) − proxg (y)2 ≤ x − y, proxg (x) − proxg (y). (39)
Proof Let x, y ∈ X and set px = proxg (x) and p y = proxg (y). Then, by Fermat’s
rule, we have
x − px ∈ ∂g( px ) and y − p y ∈ ∂g( p y ).
Therefore,
g( p y ) ≥ g( px ) + x − px p y − px
g( px ) ≥ g( p y ) + y − p y px − p y
and summing g( p y ) + g( px ) ≥ g( px ) + g( p y ) + y − p y − x + px , px − p y .
Then the statement follows.
u − proxλg (u)
∇gλ (u) = ∈ ∂g(proxλg (x)). (41)
λ
168 S. Salzo and S. Villa
Example 37
(i) (Proximity operator of the 1 norm) Let X = Rd . The 1 norm on X is separable,
thus the proximity operator can be computed componentwise, so it is enough
to compute the proximity operator of the absolute value in R. Let γ > 0. By
definition, for every t ∈ R, proxγ |·| (t) = (Id + γ ∂|·|)−1 (t). Thus, if we make
the plot of the graph of Id + γ ∂|·| and invert it, we discover that
⎧
⎪
⎨t − γ if t > γ
softγ (t) := proxγ |·| (t) = 0 if |t| ≤ γ (43)
⎪
⎩
t + γ if t < −γ .
λ
g(x) = x1 + x22
2
proxγ g (x) = prox(γ /(γ λ+1))·1 (x/(γ λ + 1))
⎧
⎪
⎨(xi − γ )/(γ λ + 1) if xi > γ
(proxγ g (x))i = 0 if |xi | ≤ γ
⎪
⎩
(xi + γ )/(γ λ + 1) if xi < −γ
(ii): We have:
1
p = proxγ g (x) ⇔ p = argmin y∈X γ h(ay + b) + y − x 2
2
1
⇔ p = argmin y∈X γ h(ay + b) + 2 ay + b − (ax + b)2
2a
1
⇔ p = argmin y∈X γ a 2 h(ay + b) + ay + b − (ax + b)2
2
⇔ ap + b = proxa 2 γ h (ax + b)
⇔ p = (proxa 2 γ h (ax + b) − b)/a.
(iii): We have
1
p = proxγ g (x) ⇔ p = argmin y∈X γ h(L y) + y − x2
2
∗
⇔ 0 ∈ γ L ∂h(L p) + p − x
⇔ x − p ∈ L −1 ∂h(L p)
⇔ L x ∈ γ ∂h(L p) + L p
⇔ p = L ∗ proxγ h (L x)
x = x V + x V ⊥ = PV x + PV ⊥ (x). (44)
170 S. Salzo and S. Villa
If we set f = ιV , we first note that (ιV )∗ (u) = supx∈X x, u − ιV (x) = ιV ⊥ (u). Thus,
we can rewrite (44) as
x = proxιV (x) + prox(ιV )∗ (x).
Hence,
· = σ B1 (0) = (ι B1 (0) )∗ .
More explicitly: ⎧
⎨x − x if x > 1
prox· (x) = x
⎩0 if x ≤ 1.
Note that this operation corresponds to a vector soft thresholding, which reduces to
(43) for dim X = 1 and γ = 1.
Example 42 (The proximity operator of the group lasso norm) Let J = {J1 , . . . , Jl }
be a partition of {1, . . . , d}. We define a norm on Rd by considering
⎛ ⎞1/2
l
xJ = ⎝ |x j |2 ⎠ .
i=1 j∈Ji
Proximal Gradient Methods for Machine Learning and Imaging 171
For every x ∈ Rd , let us call x Ji = (x j ) j∈Ji ∈ R Ji and denote by · Ji the Euclidean
norm on R Ji . Then
l
xJ = x Ji Ji .
i=1
So we need to study the operator T . We already know that proxγ g is firmly non-
expansive and hence (1/2)-averaged. The following result concerns the operator
Id − γ ∇ f .
Proposition 43 Let f : X → R be differentiable and let L > 0. Let γ > 0 and set
Tγ = Id − γ ∇ f . Then, the L-Lipschitz continuity of ∇ f is equivalent to the property
2
(∀ x, y ∈ X ) Tγ x − Tγ y2 ≤ x − y2 − − 1 (Id − Tγ )x − (Id − Tγ )y2 .
γL
(46)
In particular, if γ < 2/L, Tγ is a α-averaged operator, with α = γ L/2 < 1.
L
F(z) ≥ F(x) + z − x, ∇ f (y) + u − x − y2 .
2
Proof Let x, z ∈ X and let y ∈ domg. Then, it follows from Fact 1 that
L
f (y) ≥ f (x) − x − y, ∇ f (y) − x − y2 .
2
Hence, since f is convex,
L
f (z) ≥ f (y) + z − y, ∇ f (y) ≥ f (x) + z − x, ∇ f (y) − x − y2 . (47)
2
Now, since u ∈ ∂g(x), g(z) ≥ g(x) + z − x, u, which summed with inequality
(47) give the statement.
+∞
Lemma 46 Let (ak )k∈N be a decreasing sequence in R+ . If k=0 ak < +∞, then
1
+∞ 1
(∀ k ∈ N) ak ≤ ak , and ak = o .
k + 1 k=0 k+1
k
Proof Let k ∈ N. Since ak ≤ ai , for i = 0, 1, . . . , k, we have i=0 ai ≥ (k + 1)ak ,
hence the first part of the statement. As regard the second part, we note that, for every
Proximal Gradient Methods for Machine Learning and Imaging 173
+∞ k
integer k ≥ 2, we have i=k/2 ai ≥ i=k/2 ai ≥ (k + 1 − k/2)ak ≥
k+1
ak .
+∞ 2
Therefore, (k + 1)ak ≤ 2 i=k/2 ai → 0 as k → +∞.
The following theorem provides full convergence results concerning the proximal
gradient algorithm.
Proof (i): It follows from (25), Theorem 44, and Theorem 30(ii).
(ii): Let x ∈ X and k ∈ N. It follows from (25) that u := (xk − xk+1 )/γ −
∇ f (xk ) ∈ ∂g(xk+1 ), hence
xk − xk+1
= ∇ f (xk ) + u, u ∈ ∂g(xk+1 ).
γ
L
F(x) ≥ F(xk+1 ) + x − xk+1 , ∇ f (xk ) + u −xk+1 − xk 2
2
1 L
= F(xk+1 ) + x − xk+1 , xk − xk+1 − xk+1 − xk 2 ;
γ 2
174 S. Salzo and S. Villa
and identity xk − x2 = xk − xk+1 2 + xk+1 − x2 + 2 xk+1 − xk , x − xk+1 ,
yields
1 L
F(x) − F(xk+1 ) ≥ xk − xk+1 2 + xk+1 − x2 − xk − x2 − xk+1 − xk 2
2γ 2
1 !
= (1 − γ L)xk − xk+1 + xk+1 − x − xk − x2 .
2 2
2γ
Therefore,
+∞
2 γL −1 +
2γ F(xk+1 ) − F(x∗ ) ≤ x0 − x∗ 2 + x0 − x∗ 2
k=0
2−γL
⎧
⎪
⎨1 if γ ≤ 1/L
= x0 − x∗ ×2
γL
⎪
⎩ if 1/L < γ < 2/L .
2−γL
Then, since F(xk+1 ) − F(x∗ ) k∈N is decreasing and positive, the statement follows
from Lemma 46.
(v): It follows from (25), Theorems 44, 30(iii), and the fact that S∗ = Fix(T ).
Remark 48 It follows from (48) that the best bound is achieved when γ = 1/L.
Remark 49 Suppose that in problem (24) f is the Moreau envelope of a function
h ∈ 0 (X ) with parameter 1, that is f = h 1 . Then ∇ f (x) = x − proxh (x), which is
1-Lipschitz continuous, and the proximal gradient Algorithm 1 with stepsize γ = 1,
becomes
k = 0, 1, . . .
for
(49)
xk+1 = proxγ g proxh (xk ) ,
which is called the backward-backward algorithm. If one takes g = ιC1 and h = ιC2 ,
for two closed convex sets C1 , C2 ⊂ X , we have the alternating projection algorithm
k = 0, 1, . . .
for
(50)
xk+1 = PC1 PC2 (xk ) .
Note that Theorem 47 ensures that the sequence (xk )k∈N weakly converges to a point
in argmin x∈C1 dC2 2 (x).
Proximal Gradient Methods for Machine Learning and Imaging 175
In this section, following the same notation of the previous section, we set
We will consider the situation where f and/or g are strongly convex. This will make
the corresponding operators Tγ and/or proxγ g contractions.
So in virtue of Fact 1(iv) and (12), f is strongly convex and ∇ f is Lipschitz contin-
uous.
Now we assume that f is strongly convex and with Lipschitz continuous gradient.
Then we will prove that there exists an interval of values of γ for which Tγ is a
contraction.
2 2γ μL
γ ∇ f (x) − γ ∇ f (y)2 + x − y2 ≤ 2 γ ∇ f (x) − γ ∇ f (y), x − y
γ (L + μ) L +μ
Moreover,
176 S. Salzo and S. Villa
Hence
2γ μL
(x − y) − γ (∇ f (x) − ∇ f (y)) ≤ 1 −
2
x − y2
L +μ
2
− − 1 γ ∇ f (x) − γ ∇ f (y)2 .
γ (L + μ)
where L − μ 2
2γ μL 4μL
0< ≤ − 1 + 1 = 1 − < 1.
L +μ (L + μ)2 L +μ
Therefore, for every γ ∈ ]0, 2/(L + μ)], Tγ is a contraction with the constant given
in (53).
Proof The mapping Tγ is differentiable and Tγ (x) = Id − γ ∇ 2 f (x). By the mean
value theorem, for every q > 1,
Moreover, Tγ (x) = supλ∈σ (∇ 2 f (x)) |1 − γ λ|. Since f is μ strongly convex and ∇ f
is L-Lipschitz continuous,
Remark 53 The constant q̃1 (γ ) given in Theorem 52 is always better than the con-
stant q1 (γ ) given in Theorem 51. However, on the minimum value they agree.
Proof Let x, y ∈ X and set px = proxγ g (x) and p y = proxγ g (y). Then, by Fermat’s
rule, we have (x − px )/γ ∈ ∂g( px ) and (y − p y )/γ ∈ ∂g( p y ). Therefore, recalling
Fact 4, we have
g( p y ) − g( px ) ≥ γ −1 p y − px x − px + (σ/2) p y − px 2
g( px ) − g( p y ) ≥ γ −1 px − p y y − p y + (σ/2) px − p y 2 .
178 S. Salzo and S. Villa
Now we are ready to provide the theorem of convergence for the proximal gradient
algorithm.
Proof The statement follows from Theorems 51, 52, to 54 and the Banach-
Caccioppoli theorem.
Remark 56
(i) The best value of γ in (57) is achieved for γ = 2/(μ + L).
(ii) When g = 0 one can derive an explicit linear rate also in the function values.
Indeed, in this case, since ∇ f (x∗ ) = 0, it follows from Fact 1(ii) that f (x) −
f (x∗ ) ≤ (L/2)x − x∗ 2 .
It is possible to show that strongly convex functions satisfy the following condition
1
f (x) − inf f ≤ ∂ f (x)2− , (58)
2μ
This condition is called Łojasiewicz inequality and can hold even for non-strongly
convex functions and very recently has been the objective of intense research which
has unveiled its connection with the quadratic growth condition
μ
(∀ x ∈ X ) f (x) − inf f ≥ dist(x, argmin f )2 (59)
X 2
and ultimately its critical role in achieving linear convergence in optimization algo-
rithms. In this section, we study the convergence of the proximal gradient algorithm
under Łojasiewicz-type inequalities.
We start with a major (although simple) example showing a function which is
not strongly convex but satisfies the Łojasiewicz inequality and the quadratic growth
condition above.
Example 57 Let A : X → Y be a bounded linear operator with closed range between
two Hilbert spaces, b ∈ Y , and set
1
f: X →R f (x) = Ax − b2 . (60)
2
Note that here we do not assume A∗ A to be positive definite. Let b∗ be the projection
of b onto the range R(A) of A. Then Pytagoras’ theorem yields
1
(∀ x ∈ X ) f (x) = Ax − b∗ 2 + b∗ − b2 .
2
Thus, f ∗:= inf X f = (1/2)b∗ − b2 . Now, let x∗ ∈ S := argmin f = x ∈ X |
Ax = b∗ , let x ∈ X and set x p = PS x. We have b∗ = Ax∗ = Ax p , and hence
1 1 1
f (x) − f ∗ = Ax − b∗ 2 = A(x − x∗ )2 = A(x − x p )2 . (61)
2 2 2
1 † −2 1
f (x) − f ∗ ≥ A x − x p 2 = A† −2 dist(x, argmin f )2 , (62)
2 2
so that (59) holds with μ = A† −2 . Moreover, ∇ f (x) = A∗ (Ax − b∗ ) = A∗ A(x −
x∗ ), and hence
∇ f (x)2 = A∗ A(x − x∗ )2 .
which is equivalent to
Again, since (as before) for every y ∈ R(A) = N (A∗ )⊥ , y ≤ (A∗ )† A∗ y and
(A∗ )† = (A† )∗ , we have that (63) and hence (58) holds with μ = (A† )∗ −2 =
A† −2 .
where for a given set D, D_ = inf u∈D u. We will refer to this notion as global
if supt>inf F ct < +∞.
Example 60 (L1 regularized least squares) Let f (x) = αx1 + (1/2)Ax − y2 ,
for some linear operator A : Rd → Rn , y ∈ Rn and α > 0. Then f is convex piece-
wise polynomial of degree 2, thus it is 2-Łojasiewicz on sublevel sets.
Lemma 61 Let (rk )k∈N be a real sequence being strictly positive and satisfying,
α
for some κ > 0, α > 1 and all k ∈ N: rk − rk+1 ≥ κrk+1 . Define κ̃ := min{(α −
α−1
1)κ, (α − 1)κ , r0 , κ r0 }. Then, for all k ∈ N, rk ≤ (κ̃k)−1/(α−1) .
α
1−α 1/α 1−α
Proof We first show that (xk )k∈N has finite length. Since inf F > −∞ then rk :=
F(xk ) − inf F ∈ [0, +∞[, and Theorem 47(iii) yields
1 L
axk+1 − xk 2 ≤ rk − rk+1 , with a = − . (64)
γ 2
γ inf u ≤ xk − γ ∇ f (xk ) − (xk+1 − γ ∇ f (xk+1 )) ≤ xk − xk+1 .(66)
u∈∂ F(xk+1 )
If there exists k ∈ N such that rk = 0 then the algorithm would stop after a finite
number of iterations (see (64)), therefore it is not restrictive to assume that rk > 0
for all k ∈ N. Since (F(xk ))k∈N is decreasing by Theorem 47(iii), and x0 ∈ domF,
xk ∈ [inf F < F ≤ F(x0 )] for every k ≥ 1. We set ϕ(t) := pt 1/ p and F0 = F(x0 ),
so that the Łojasiewicz inequality at xk ∈ [inf F < F ≤ F0 ] can be rewritten as
Combining (64), (66), and (67), and using the concavity of ϕ, we obtain for all k ∈ N:
c F0 cF
xk+1 − xk 2 ≤ ϕ (rk )(rk − rk+1 )xk − xk−1 ≤ 0 (ϕ(rk ) − ϕ(rk+1 ))xk − xk−1 .
γa γa
By taking the square root on both sides, and using Young’s inequality, we obtain
182 S. Salzo and S. Villa
c F0
(∀k ∈ N) 2xk+1 − xk ≤ (ϕ(rk ) − ϕ(rk+1 )) + xk − xk−1 . (68)
γa
K
c F0
(∀k ≥ 1) xk+1 − xk ≤ ϕ(r1 ) + x1 − x0 .
k=1
γa
We deduce that (xk )k∈N has finite length and therefore converges strongly to some
x∗ . Moreover, from (66) and the strong closedness of ∂ f : X ⇒ X , we conclude that
0 ∈ ∂ f (x∗ ). We next show a preliminary inequality which will be useful to prove the
rates for (xk − x∗ )k∈N . Let K ∈ N and 1 ≤ k ≤ K , recall that ϕ(t) = pt 1/ p , and
sum the inequality in (68) between k and K to obtain
K
pc F0 1/ p
x K − xk ≤ xn+1 − xn ≤ r + xk − xk−1 .
n=k
aγ k
Passing to the limit for K → ∞, using (64), and the fact that rk is decreasing, we
derive
pc F0 1/ p 1 1/2
(∀k ≥ 1) x∗ − xk ≤ r + √ rk−1 . (69)
aγ k−1 a
Next we prove the convergence rates. We first derive rates for the sequence of values
rk , from which we will derive the rates for the iterates thanks to (69). Equations (64)
and (66) and the Łojasiewicz inequality at xk+1 ∈ [inf F < F ≤ F0 ] yield
The rates for the values are derived from the analysis of the sequences satisfying the
inequality in (70), which is recalled in Lemma 61. Depending on the value of p, we
obtain different rates.
(i): Since p = 1, we deduce from (70) that for all k ∈ N rk+1 ≤ rk − κ. Since the
sequence (rk )k∈N is decreasing and positive, this implies k ≤ r0 κ −1 .
(ii): Since p ∈ ]1, 2[we have α ∈]0, 1[. Thus, the positivity of rk+1 and (70) imply
α
and hence rk+1 ≤ κ −1/α rk , meaning that rk converges
1/α
that for all k ∈ N, rk ≥ κrk+1
1/ p 1/ p−1/2 1/2 1/ p−1/2 1/2
Q-superlinearly to zero. In addition, we have rk−1 = rk−1 rk−1 ≤ r0 r and
1/2 1/ p−1/2 √ k−1
(69) implies xk − x∗ ≤ b p rk−1 , with b p = pc F0 r0 /(aγ ) + (1/ a).
(iii): If p = 2, then α = 1 and (70) yields that for all k ∈ N, rk+1 ≤ (1 + κ)−1rk ,
so that rk ≤ (1 + κ)−k r0 . Moreover, from (69) we derive that,
1/2
(∀k ≥ 1) x∗ − xk ≤ b2 rk−1 .
Proximal Gradient Methods for Machine Learning and Imaging 183
√
where b2 = 2c F0 /aγ + 1/ a.
(iv): If p ∈ ]2, +∞[, then α ∈ ]1, 2[ and (70) and Lemma 61 imply that rk+1 ≤
c p (k + 1)− p/( p−2) , where
" #− p−2
p " # p $
κ( p − 2) p − 2 − p−2 − 2 p−2
p p2
− 2( p−1)(
c p = min , κ , r0 , κ p−2) r
0 . (71)
p p
1
− 1 1/ p
Note that rk−1 ≤ r02 p rk−1 , and therefore, defining b p = pc F0 /γ + (r0 )1/2 a −1/2
1/2
−1/ p 1/ p
r0 , we derive from (69) that xk − x∗ ≤ b p rk−1 for every k ≥ 1.
Remark 63 Note that the rates range from the finite termination, for p = 1, to the
worst-case rates presented in Theorem 47, when p tends to +∞. The bigger is p,
the more the rates for the objective function values become closer to o(k −1 ), and the
rates of its iterates become arbitrarily slow.
3.6 Accelerations
Proximal gradient methods are very simple and have a very low cost per iteration, but
often they converge slowly, both in practice and in theory (see Theorem 47). In this
section, we consider the class of accelerated proximal gradient algorithms, which
are only slightly more complicated than the basic proximal gradient methods, but
have an improved convergence rate. While in the proximal gradient method, only the
information obtained in the previous step is used to build the next iterate, accelerated
methods are multistep methods, namely they take into account previous iterates to
improve the convergence. The most popular accelerated multistep method is due to
Nesterov and is also known as Fast Iterative Soft Thresholding Algorithm (FISTA).
We consider the same setting of the previous sections.
Algorithm 2 (Accelerated proximal gradient method) Let 0 < γ ≤ 1/L and let
(tk )k∈N ∈ RN be such that t0 = 1, tk ≥ 1, and for every integer k ≥ 1, tk2 − tk ≤ tk−1
2
.
Let x0 = y0 ∈ X and define
⎢ k = 0, 1, . . .
for
⎢ xk+1 = proxγ g (yk − γ ∇ f (yk ))
⎢
⎢
⎢
⎢ βk+1 = tk − 1
(72)
⎣ tk+1
yk+1 = xk+1 + βk+1 (xk+1 − xk ).
184 S. Salzo and S. Villa
One of the crucial observations that lead to a whole stream of literature and allowed
to give a physical interpretation of this kind of algorithms is the link of accelerated
algorithms with the trajectories of a second-order continuous dynamical system.
Let us consider a heavy ball of mass m in the potential field ∇ f + ∂g under the
force of friction, or “viscosity" controlled by a function p(t) > 0. The motion x(t)
of the heavy ball is described by the following second-order differential inclusion:
Intuitively, ignoring existence issues, the heavy ball reaches the minimizer of f + g
for t → +∞, due to the loss of energy caused by the friction. In addition, the friction
avoids the zig-zagging effect, which is one of the causes that slows down gradient
type methods. We consider a scenario where the viscosity coefficient is of the form
p(t) = α/t which turned out to be crucial in the achievement of accelerated rates:
α
0 ∈ ẍ + ẋ(t) + ∇ f (x(t)) + ∂g(x(t)). (74)
t
We next show that Algorithm 2 can be seen as a discretization of (74). To this aim,
we discretize implicitly with respect to the nonsmooth function g and explicitly with
respect to the smooth one f . Let h > 0 be a fixed time step, and set tk = (τ0 + k)h,
xk = x(tk ). The suggested implicit/explicit discretization strategy reads as
1 α
(xk+1 − 2xk + xk−1 ) + (xk − xk−1 ) + ∂g(xk+1 ) + ∇ f (yk ) 0,
h2 (τ0 + k)h 2
We start with few results concerning the sequence of the parameters tk ’s.
Proposition 64 Suppose that t0 = 1 and for every integer k ≥ 1
for some c ∈ [0, 1 [ and b ∈ [0, 1 − c]. Then condition (75) is equivalent to
'
1−c 1 − c 2
tk = + + tk−1
2
− b. (76)
2 2
√ k ≥ 1, 1 ≤ tk−1 ≤ tk ≤ 1 − c + tk−1 .
(i) For every integer
(ii) Suppose that 2 b ≤ 1 − c. Then, for every integer k ≥ 1, (1 − c)/2 + tk−1 ≤ tk .
Hence k(1 − c)/2 ≤ tk − 1 ≤ k(1 − c).
Proof The discriminant of the quadratic equation in (75) (in the unknown tk ) is k =
(1 − c)2 + 4(tk−1
2
− b). Then it is clear that if tk−1 ≥ 1, then k > 0, the positive
(
solution of (75) is (76) and tk ≥ (1 − c)/2 + (1 − c)2 /4 + 1 − b ≥ 1, since b ≤
1 − c. Vice versa, if tk−1 ≥ 1, then (76) ⇒ (75). In the end, if tk−1 ≥ 1, then (76) and
(75) are equivalent and in such case tk ≥ 1. So, the first part of the statement follows
by an induction argument since t0 = 1.
(i): We derive from (75) that tk2 − tk−12
= −b + (1 − c)tk ≥ −b + 1 − c ≥ 0,
hence tk−1 ≤ tk . Moreover, it follows from (76) that
2
1−c 2 1−c 2 1−c
tk − ≤ + tk−1
2
≤ + tk−1 . (77)
2 2 2
which are obtained from (75) with (b, c) = (0, √ 0) and (b, c) = (1/a , (a − 2)/a)
2
respectively. Note that in both cases 1 − c ≥ 2 b (and in the first case, in virtue of
Proposition 64(ii), we have tk ≥ (k + 2)/2).
186 S. Salzo and S. Villa
√
Remark 66 Suppose that the tk ’s satisfy (75) with 2 b ≤ 1 − c. Then, since tk2 −
(1 − c)tk ≤ tk−1
2
, we have, for k ≥ 2,
x − z2 z − y2
(∀ z ∈ X ) F(x) + ≤ F(z) + .
2γ 2γ
1 1
z − x2 ≤ g(z) + y − z2 + z − y, ∇ f (y)
2γ 2γ
1
− g(x) + y − x2 + x − y, ∇ f (y)
2γ
hence
1 1
g(x) + y − x2 + x − y, ∇ f (y) + z − x2
2γ 2γ
* +, -
(a)
1
≤ g(z) + y − z2 + z − y, ∇ f (y) .
2γ
Therefore,
1 1
f (x) + g(x) + z − x2 ≤ f (y) + g(z) + y − z2 + z − y, ∇ f (y)
2γ 2γ
1
≤ f (z) + g(z) + y − z2 ,
2γ
where in the last inequality we used that f (y) + z − y, ∇ f (y) ≤ f (z), due to the
convexity of f .
We now present the first of the two results of the section, which concerns the
convergence in value for Algorithm 2. Next, we will address the convergence of the
iterates under slightly stronger assumptions on the sequence of parameters tk ’s.
Moreover,
√ if the parameters tk ’s are defined according to Proposition 64 with 1 − c ≥
2 b, then F(xk ) − min F = O(1/k 2 ).
Proof It follows from the definition of yk+1 in Algorithm 2 that, for every k ∈ N,
1 1
yk+1 = 1 − xk+1 + xk + tk (xk+1 − xk )
tk+1 tk+1 * +, -
vk+1
Moreover, it follows from the definition of vk+1 that vk+1 − xk = tk (xk+1 − xk ) and
hence
1 1 1
xk+1 = xk + (vk+1 − xk ) = 1 − xk + vk+1 . (82)
tk tk tk
xk+1 − z2 z − yk 2
(∀ z ∈ X ) F(xk+1 ) + ≤ F(z) + . (83)
2γ 2γ
188 S. Salzo and S. Villa
1 1
xk+1 − z = (vk+1 − x∗ ) and yk − z = (vk − x∗ ).
tk tk
Therefore, it follows from (83) and the convexity of F (considering that z is a convex
combination of xk and x∗ ) that
vk+1 − x∗ 2 vk − x∗ 2
F(xk+1 ) + 2
≤ F(z) +
2γ tk 2γ tk2
1 1 vk − x∗ 2
≤ 1− F(xk ) + F(x∗ ) + .
tk tk 2γ tk2
Summing −F(x∗ ) to both terms of the above inequality and setting rk = F(xk ) −
F(x∗ ), we get
vk+1 − x∗ 2 1 vk − x∗ 2
rk+1 + ≤ 1 − r k +
2γ tk2 tk 2γ tk2
vk+1 − x∗ 2 vk − x∗ 2
tk2 rk+1 + ≤ tk (tk − 1)rk + . (84)
2γ 2γ
vk − x∗ 2
(∀ k ∈ N, k ≥ 1) Ek+1 ≤ tk (tk − 1)rk + ≤ −ctk rk + Ek . (85)
2γ
Therefore, Ek is decreasing and hence, using (84) with k = 0, we have, for all k ≥ 1
Since x∗ is an arbitrary element of argmin F, the first part of the statement follows.
The second part of the statement follows from Proposition 64(ii) and the fact that, for
every integer k ≥ 1, 2tk−1 ≥ 2 + (k − 1)(1 − c) = k(1 − c) + 1 + c ≥ k(1 − c).
Proximal Gradient Methods for Machine Learning and Imaging 189
Proof Let rk and Ek be defined as in the proof of Theorem 68. It follows from (85)
that, for every integer k ≥ 1,
xk+1 − xk 2 xk − yk 2
F(xk+1 ) + ≤ F(xk ) + . (88)
2γ 2γ
1 2
tk xk+1 − xk 2 − (tk−1 − 1)2 xk − xk−1 2 ≤ tk2 (rk − rk−1 ).
2γ
1 2
tk xk+1 − xk 2 − tk−1
2
xk − xk−1 2 + (2tk−1 − 1)xk − xk−1 2
2γ
≤ tk−1
2
rk − tk2 rk+1 + (tk2 − tk−1
2
)rk . (90)
1 2
K
t K x K +1 − x K 2 + (2tk−1 − 1)xk − xk−1 2
2γ k=2
K
≤ r1 − t K2 r K +1 + (tk2 − tk−1
2
)rk
k=1
K
≤ r1 + (1 − c) tk r k ,
k=1
K
K
K
tk−1 xk − xk−1 2 ≤ (2tk−1 − 1)xk − xk−1 2 ≤ 2γ r1 + (1 − c) tk r k
k=2 k=2 k=1
(91)
and the statement follows from (i).
(∀ k ∈ N) ak+1 ≤ ak + εk . (92)
Proof Let k ∈ N with k ≥ 1. Multiplying equation (93) by tk2 and using the relation
tk2 − tk ≤ tk−1
2
and the fact that tk−1 ≤ tk , we have
Hence
k−1
k−1
2
tk−1 ak − a1 = (ti2 ai+1 − ti−1
2
ai ) ≤ ti2 bi . (95)
i=1 i=1
2
Then, dividing by tk−1 , we obtain
1
k−1
a1
ak ≤ 2
+ 2
ti2 bi (96)
tk−1 tk−1 i=1
and hence
k k
a1 k j−1
1 2
aj ≤ 2
+ 2
ti bi
j=1
t
j=1 j−1
t
j=1 i=1 j−1
k
a1
k−1 k
1 2
= 2
+ 2
ti bi . (97)
t
j=1 j−1
t
i=1 j=i+1 j−1
Now we analyze the term kj=i+1 1/t 2j−1 . Let j ∈ N with j ≥ 2. Since, by assump-
tion, t j (t j − (1 − c)) ≤ t 2j−1 and t j ≥ (1 − c)/2 + t j−1 ≥ (1 − c) + t j−2 , we have
1 1
≤
t 2j−1 t j (t j − (1 − c))
1 1 1
= −
1 − c t j − (1 − c) t j
1 1 1 1 1
≤ − + − .
1 − c t j−2 t j−1 t j−1 tj
k " k k #
1 1 1 1 1 1
≤ − + −
t2
j=i+1 j−1
1−c j=i+1
t j−2 t j−1 j=i+1
t j−1 tj
1 1 1 1 1
= − + −
1 − c ti−1 tk−1 ti tk
3−c 1
≤ ,
1 − c ti
where in the last inequality we used that t j ≤ (2 − c)t j−1 (see Remark 66). In the
end, it follows from (97) that
192 S. Salzo and S. Villa
k k
a1 3−c
k−1
aj ≤ + ti bi
j=1
t2
j=1 j−1
1 − c i=1
3−c
k−1
a1 (3 − c) 1
≤ a1 + + ti bi .
1 − c t1 1 − c i=1
Now we note that, by definition of xk+1 and the fact that x∗ ∈ argmin F, we have
γL
h k − h k+1 ≥ δk − βk xk − xk−1 , xk+1 − x∗ − xk+1 − yk 2 . (100)
4
Now, (98), written for k − 1, yields h k−1 − h k = δk−1 + xk−1 − xk , xk − x∗ and
hence, we have
. / γL
h k+1 − h k − βk (h k − h k−1 ) ≤ −δk + βk xk − xk−1 , xk+1 − x∗ + xk+1 − yk 2
. / 4
+ βk δk−1 − βk xk − xk−1 , xk − x∗
γL
= −δk + xk+1 − yk 2
4 . /
+ βk δk−1 + βk xk − xk−1 , xk+1 − xk ).
1 1
xk+1 − yk 2 = xk+1 − xk − βk (xk − xk−1 )2
2 2
1 β2
= xk+1 − xk 2 + k xk − xk−1 2 − βk xk+1 − xk , xk − xk−1
2 2
= δk + βk2 δk−1 − βk xk+1 − xk , xk − xk−1 .
Therefore,
1 γL
h k+1 − h k − βk (h k − h k−1 ) ≤ −1− xk+1 − yk 2 + (βk + βk2 )δk−1 .
2 2
(101)
Since γ L < 2 and βk + βk2 ≤ 2 we finally have
which yields
(h k+1 − h k )+ ≤ βk (h k − h k−1 )+ + 2δk−1 . (103)
Since tk δk−1 ≤ (2 − c)tk−1 δk−1 and tk−1 δk−1 is summable in virtue of Proposi-
tion 70(ii), Lemma 72 yields that ((h k+1 − h k )+ )k∈N is summable. Finally, since
Section 3.1. Fixed-point iterations, also known as the method of successive approxi-
mations, was developed by Picard, starting from ideas by Cauchy and Liouville. For
the case of Banach spaces, Theorem 19 was first formulated and proved by Banach
in his famous dissertation from 1922. Later and independently it was rediscovered
by Caccioppoli in 1931. Since then, numerous generalizations or extensions have
been obtained which deal with more general classes of operators and iterations.
Krasnosel’skiı̆–Mann iteration, as presented in (33), were first studied in [63] with
λ = 1/2. For general λ ∈ ]0, 1[, these mappings have been studied by Schaefer [104],
Browder and Petryshyn [25, 26], and Opial [85]. Mann in [70] considered the more
general case of this iteration where λ may vary. Later this case was also studied in
[43, 54]. The concept of averaged operator was introduced in [9]. Later, the proper-
ties of compositions and convex combinations of averaged nonexpansive operators
(Proposition 27) have been applied to the design of new fixed-point algorithms in
[38].
Section 3.2. The proximity operator was introduced by Moreau in 1962 [74] and
further investigated in [75, 76] as a generalization of the notion of a convex projection
operator. Later was considered within the proximal point algorithm in [97]. Since
then, it appears in most of the splitting algorithms used in practice [34].
Sections 3.3–3.4. The proximal gradient algorithm finds its roots in the projected
gradient method [53, 64] and was originally devised in [72] in the more general
context of monotone operators. Weak convergence of the iterates were proved in [51,
72]. An error tolerant version, with variable stepsize is presented in [39], whereas
worst-case rate of convergence in values was studied in [12, 24]. The proximal
gradient algorithm is also a generalization of the iterative soft thresholding algorithm,
first proposed in [41].
Section 3.5. The idea of imposing geometric conditions on the function to be
optimized to derive improved convergence rates of first-order methods is old, and
was already used in [27, 91, 97]. A systematic study of the class of functions sat-
isfying favorable geometric conditions is more recent and is the result of a series
of papers, among which we mention [14, 16, 17]. The fact that convex piecewise
polynomial functions are p-Łojasiewicz on sublevel sets is due to [66, Corollary
3.6], in agreement with [27, Corollary 3.6], for the special case of piecewise linear
convex functions and with [65, Theorem 2.7] for convex piecewise quadratic func-
tions. The fact that the lasso problem is 2-Łojasiewicz has been observed in [17,
Sect. 3.2.1]. Kurdyka–Łojiasiewicz inequality is a powerful tool to analyze conver-
gence of first-order splitting algorithms as shown in a whole line of work [3–5, 17,
18, 50, 69] ranging from the analysis of the proximal point algorithm to a whole
class of descent gradient based techniques. These results had an impressive impact
on the machine learning community, see e.g., [60]. Theorem 62 is a special case of
[52, Theorem 4.1].
Section 3.6. The idea of adding an inertial term in 74 to mitigate zig-zagging was
due to Polyak, and gave raise to the heavy ball method [93] (see also [1]), which
Proximal Gradient Methods for Machine Learning and Imaging 195
is optimal in the sense of Nemirovski and Yudin [81] for the class of convex twice
continuously differentiable functions. A simple, but not very intuitive, modification
of Polyak’s method was due to Nesterov [83], and is the famous accelerated gradient
method for convex smooth objective functions [82, 83]. The acceleration technique
has been first extended to the proximal point algorithm by Güler [55] and finally
extended to the composite optimization problems in [12]. Various modifications
of these accelerated algorithms are nowadays the methods of choice to optimize
objective functions in a large scale scenario, even in a nonconvex setting: despite
convergence issues, the ADAM algorithm is probably the most used in the deep
learning context [62]. The first papers studying accelerated algorithms were focused
on convergence of the objective function values. Convergence of the iterates has been
established much more recently, starting from the paper by Chambolle and Dossal
[29] and further devoloped later. Only many years later its introduction, Nesterov
accelerated method has been shown to be a specific discretization of the heavy ball
system introduced by Polyak with a vanishing inertial coefficient [111], and this key
observation started a very active research activity on the subject (see [7], [6] and
references therein).
where the projection onto C can be computed explicitly but, only a stochastic sub-
gradient of f is available. The algorithm is detailed below.
0 k = 0, 1, . . .
for
û k is a summable X -valued random vector s.t. E[û k | xk ] ∈ ∂ f (xk ), (108)
xk+1 = PC (xk − γk û k ).
Remark 76 In addition
k to the sequence xk , Algorithm 3 requires keeping track of the
sequences k := i=0 γi and x̄k , which can updated recursively, as k+1 = k + γk
−1
and x̄k+1 = k+1 (k x̄k + γk+1 xk+1 ).
The following theorem gives the main convergence results about the algorithm.
Proximal Gradient Methods for Machine Learning and Imaging 197
The right-hand side is summable, and hence xk+1 − x is square summable. So, all
the terms in (110) are summable. Therefore, taking the conditional expectation given
xk of both terms of inequality (110) and using the fact that u k = E[û k | xk ] ∈ ∂ f (xk )
and the properties in Fact 75, we have almost surely
198 S. Salzo and S. Villa
2γk (E[ f (xk )] − f (x)) ≤ E[xk − x2 ] − E[xk+1 − x2 ] + γk2 B 2 . (113)
Now, since γk → 0, there exists m ∈ N such that for every integer k ≥ m, we have
ρ − γk B 2 ≥ 0 and hence, setting ν := max{n, m}, we have
ρ γk ≤ E[xν − x2 ] < +∞.
k≥ν
This contradicts the assumption k∈N γk = +∞. Therefore, we showed that there
is no x ∈ C such that f (x) < lim inf k E[ f (xk )], that is, lim inf k E[ f (xk )] ≤ inf C f .
(ii): It follows from (113) that
1 B2 2
(∀ i ∈ N) γi (E[ f (xi )] − f (x)) ≤ E[xi − x2 ] − E[xi+1 − x2 + γ .
2 2 i
(114)
So, summing from m to k, we have
k
1 B2 2
k
γi (E[ f (xi )] − f (x)) ≤ E[xm − x ] +
2
γ .
i=m
2 2 i=m i
k
Dividing the above inequality by i=m γi yields (109).
(iii): We first note that, since f is convex and x̄k is a convex combination of
the xi ’s, with coefficients ηi = γi / kj=0 γ j , with 0 ≤ i ≤ k, we have E[ f (x̄k )] ≤
k k k
i=0 ηi E[ f (x i )]. Moreover, f k = i=0 ηi f k ≤ i=0 ηi E[ f (x i )]. Therefore,
k −1
k
(∀ k ∈ N) h k := max{ f k , E[ f (x̄k )]} ≤ γi γi E[ f (xi )]. (115)
i=0 i=0
Proximal Gradient Methods for Machine Learning and Imaging 199
Let x ∈ C. Then it follows from (109) and (115) that lim supk h k ≤ f (x). Since x is
arbitrary in C, we have lim supk h k ≤ inf C f . Moreover, clearly we have inf C f ≤
lim inf k h k . Therefore, h k → inf C f . Since inf C f ≤ f k ≤ h k and inf C f ≤
E[ f (x̄k )] ≤ h k , the statement follows.
Clearly ϕ is closed, convex, and differentiable in R++ × Rn++ , and, for all (t, γ ) ∈
R++ × Rn++ ,
α + βγ 2 β
∇ϕ(t, γ ) = − , γ . (116)
2t 2 t
Then,
α β γ 2 α + βγ 2
infn + = inf infn = inf ϕ(t, γ ),
γ ∈R++ 2a γ 2 a γ t>0 γ ∈R++ 2t (t,γ )∈R×Rn
a γ =t a γ =t
Now, it follows from the last two equations above that −β = −βa γ /t = sa2
and hence ⎧
⎪ α + βγ = β
⎪ 2
⎪
⎪
⎨ 2t 2 a2
t
⎪ γ = a
⎪
⎪ a2
⎪
⎩a γ = t.
Corollary 80 Under the same assumptions of Theorem 77, the following hold.
(i) Suppose that argminC f = ∅ and let D ≥ dist(x0 , argminC f ) and k ∈ N. Then,
k
B 2 j=0 γi
2
D2 1
max{ f k , E[ f (x̄k )]} − min f ≤ k + k . (117)
i=0 γi i=0 γi
C 2 2
BD
max f k , E[ f (x̄k )] − min f ≤ √ .
C k+1
Proximal Gradient Methods for Machine Learning and Imaging 201
(ii) Let, for every k ∈ N, γk = γ̄ /(k + 1). Then, f k → inf C f and E[ f (x̄k )] →
inf C f . Moreover, if argminC f = ∅, we have, for every k ∈ N,
dist(x0 , argminC f )2 π γ̄ B 2 1
max f k , E[ f (x̄k )] − min f ≤ + . (118)
C 2γ̄ 12 log(k + 1)
√
(iii) Let, for every k ∈ N, γk = γ̄ / k + 1. Then, f k → inf C f and E[ f (x̄k )] →
inf C f . Moreover, if argminC f = ∅, for every integer k ≥ 2, we have
dist(x0 , argminC f )2 1 log(k + 1)
max f k , E[ f (x̄k )] − min f ≤ √ + γ̄ B 2 √ . (119)
C 2γ̄ k+1 k+1
√
(iv) Let, for every k ∈ N, γk = γ̄ / k + 1 and suppose that C is bounded with diam-
eter D̄ > 0 and that argminC f = ∅. Set, for every k ∈ N, f˜k = min!k/2"≤i≤k
k −1 k
f (xi ) and x̃k = i=!k/2" γi i=!k/2" γi x i . Then, for every integer k ≥ 2,
3 D̄ 2 5γ̄ B 2 1
max f˜k , E[ f (x̃k )] − min f ≤ + √ . (120)
C 2γ̄ 2 k+1
Proof (i): Equation (117) follows from (115) and by minimizing the right hand
side of (109), with m = 0, w.r.t. x ∈ argminC f . Now, it√follows from Lemma 79
that the minimum of √ the right-hand side of (117) is B D/ k + 1 and k is achieved at
(γi )0≤i≤k ≡ D/(B k + 1). Note that is this case x̄k = (k + 1)−1 i=0 xi .
k k+1
(ii): We derive from Lemma 78(i), with m = 1, that i=0 γi = γ̄ i=1 (1/i) ≥
k k+1
γ̄ log(k + 1). Moreover, we have i=0 γi = γ̄ 2 2
i=1 1/i ≤ γ̄ π/6. So, the first
2 2
part follows from Theorem 77 (iii), while the inequality in (118) follows from (117)
with D = dist(x0 , argminC f ). √ √
k
√ √ 78(ii), with m = 1,
(iii) Lemma √ yields i=1 √ 1/ i ≥ 2( k − 1) + (1/2)(1 +
1/ k) ≥√2 k √
− 3/2. Moreover, 2 k − 3/2 ≥ k k for k ≥ 3 and clearly
√ for√ k ≤ 2,
k k+1
1/ i ≥ k. Therefore, for every k ∈ N, γ = γ̄ 1/ i ≥ γ̄ k + 1.
i=1 k i=0 i k i=1
Moreover, by Lemma 78(i), we have i=1 1/i = 1 + k
i=2 1/i ≤ 1 + log k ≤
k+1
2 log k, for k ≥ 3. Therefore, for every k ∈ N, k ≥ 2, we have i=0 γi2 = γ̄ 2 i=1
1/i ≤ 2γ̄ 2 log(k + 1). Again, the first part follows from Theorem 77(iii), while (119)
follows from (117) with D = dist(x0 , argminC f ).
(iv): Let k ∈ N, k ≥ 2. It follows from Lemma 78(i) that
k
k+1
1 k +1 5
γi2 = γ̄ 2 ≤ γ̄ 2 log ≤ γ̄ 2 log 4 ≤ γ̄ 2 .
i=!k/2" i=!k/2"+1
i !k/2" 3
k
k+1
1
γi = γ̄ √
i=!k/2" i=!k/2"+1
i
'
√ ( √ !k/2" + 1
≥ 2γ̄ ( k + 1 − !k/2" + 1) ≥ 2γ̄ k + 1 1 − .
k+1
The statement follows from Theorem 77(ii), with m = !k/2" and x ∈ argminC f ,
k −1 k
taking into account that, as in (115), max{ f˜k , f (x̃k )} ≤ i=!k/2" γi i=!k/2" γi
f (xi ).
Example 81 A case in which the above stochastic algorithm arises is in the incre-
mental subgradient method. We aim at solving
1
m
min f (x) := f j (x),
x∈C m j=1
⎢ k = 0, 1, . . .
for
⎢ chose an index jk ∈ {1, . . . , m} at random
⎢
⎢x
⎣ k+1 = PC (xk − γk *∇˜ f +,
(121)
jk (x k )).
-
û k
Since ∂ f = (1/m) mj=1 ∂ f j , we have that (1/m) mj=1 ∇˜ f j (x) ∈ ∂ f (x). Let k ∈ N.
Then, xk is a random variable, depending on j0 , . . . , jk−1 . Hence, û k := ∇˜ f jk (xk ) is
a random variable, where xk and jk are independent random variables, and Fact 75
yields
! 1 ˜
m
u k := E ∇˜ f jk (xk ) | xk = ∇ f j (xk ) ∈ ∂ f (xk )
m j=1
and
1 ˜ 1 2
m m
E[∇˜ f jk (xk )2 | xk ] = ∇ f j (xk )2 ≤ L ,
m j=1 m j=1 j
Proximal Gradient Methods for Machine Learning and Imaging 203
and hence E[∇˜ f jk (xk )2 ] ≤ (1/m) mj=1 L 2j . In the end assumptions of Theorem 77
are satisfied with B 2 = (1/m) mj=1 L 2j .
Example 82 (Stochastic optimization) We generalize the previous example. We
consider the following optimization problem
3 k = 0, 1, . . .
for
˜
xk+1 = PC xk − γk ∇ϕ(x ,ζ ) . (123)
* +,k k-
û k
2 2
Therefore, f is Lipschitz continuous with constant Z L(z)dμ(z) ≤ Z L(z)2
1/2
dμ(z) . Moreover, assumption (SO3 ) implies that
˜
for all x, y ∈ X and for μ-a.e. z ∈ Z ϕ(y, z) ≥ ϕ(x, z) + y − x, ∇ϕ(x, z).
(124)
Note that all terms of the above inequality are μ-summable, in particular, since
˜
∇ϕ(x, ˜
z) ≤ L(z) and L(z) is μ-summable, ∇ϕ(x, ·) is μ-summable. Hence, inte-
grating (124) w.r.t. μ we get
4
(∀ x, y ∈ X ) f (y) ≥ f (x) + y − x, ˜
∇ϕ(x, z)dμ(z).
Z
204 S. Salzo and S. Villa
˜
Therefore, for every x ∈ X , E[∇ϕ(x, ζ )] ∈ ∂ f (x). Now, let k ∈ N, k ≥ 1. Then, it
follows from (123) that
xk = xk (ζ0 , . . . , ζk−1 ),
hence xk and ζk are independent2 random variables. Therefore, Fact 75(v) yields that
˜
u k := E[∇ϕ(x , ζ ) | ] = ˜
k k x k Z ∇ϕ(x k , z)dμ(z) ∈ ∂ f (x k ) and
4 4
!
˜
E ∇ϕ(x k , ζk ) | x k =
2 ˜
∇ϕ(x k , z) dμ(z) ≤
2
L(z)2 dμ(z) < +∞,
Z Z
2
˜
and hence E[∇ϕ(x k , ζk ) ] ≤ Z L(z) dμ(z). In the end Theorem 77 applies with
2 2
2
B = Z L(z) dμ(z), so that the stochastic algorithm (123) provides a solution to
2 2
problem (122).
We address again problem (105) where now f is Lipschitz smooth, and we consider
a stochastic version of Algorithm 1. In the following we set F = f + g.
Algorithm 4 (The stochastic proximal gradient method) Let x0 ∈ X and (γk )k∈N
be a sequence in R++ . Then,
0 k = 0, 1, . . .
for
û k is a square summable X -valued random vector s.t. E[û k | xk ] = ∇ f (xk ),
xk+1 = proxγk g (xk − γk û k ).
(125)
Moreover, define, for every k ∈ N,
k −1
k
Fk = min E[F(xi+1 )], x̄k = γi γi xi+1 .
0≤i≤k
i=0 i=0
The following theorem gives the main convergence results about the algorithm.
remain valid in expectation, with the constant B 2 replaced by σ 2 and f k , E[ f (xk )],
and inf C f replaced by Fk , E[F(xk )], and inf F respectively. In particular, the fol-
lowing hold.
k k
(i) Suppose that k∈N γk = +∞ and that i=0 γi2 / i=0 γi → 0. Then Fk →
inf F, lim inf k E[F(xk )] = inf F and E[F(x̄k )] → inf F. √
(ii) Suppose that S∗ := argmin F = ∅ and let, for every k ∈ N, γk = γ̄ / k + 1,
with γ̄ ≤ 1/L. Then, for every integer k ≥ 2,
dist(x0 , S∗ )2 1 log(k + 1)
max{Fk+1 , E[F(x̄k+1 )]} − min F ≤ √ + γ̄ σ 2 √ .
2γ̄ k+1 k+1
Proof Since γk ≤ 1/L for every k ∈ N, it follows from Lemma 45, that, for every
(x, y) ∈ X 2 , z ∈ dom∂g, and every η ∈ ∂g(z) we have
1
F(x) ≥ F(z) + x − z, ∇ f (y) + η − z − y2 . (126)
2γk
Let x ∈ X . Applying the previous inequality with z = xk+1 , η = γk−1 (xk − xk+1 ) −
û k , and y = xk we obtain
xk − xk+1 1
F(x) ≥ F(xk+1 ) + x − xk+1 , ∇ f (xk ) − û k + − xk+1 − xk 2
γk 2γk
(127)
and thus, setting (∀k ∈ N) x̃k+1 = proxγk g (xk − γk ∇ f (xk )),
xk − xk+1 1
F(xk+1 ) − F(x) ≤ x − xk+1 , û k − ∇ f (xk ) − + xk+1 − xk 2
γk 2γk
1
= x − xk+1 , û k − ∇ f (xk ) + − 2x − xk+1 , xk − xk+1 + xk+1 − xk 2
2γk
1
= x − xk+1 , û k − ∇ f (xk ) + xk − x2 − xk+1 − x2
2γk
. /
= x − x̃k+1 , û k − ∇ f (xk ) + x̃k+1 − xk+1 , û k − ∇ f (xk )
1
+ xk − x2 − xk+1 − x2 . (128)
2γk
We next want to take the conditional expectation of this inequality. To this aim we
first prove by induction that xk and ∇ f (xk ) are square summable and F(xk ) is
summable. The statement is clearly true for k = 0. Suppose that it holds for k ≥ 0.
Then it follows from (128) and the nonexpansivity of proxγk g that
206 S. Salzo and S. Villa
and hence we derive that xk+1 is square summable and F(xk+1 ) is summable.
Moreover, since ∇ f is Lipschitz continuous, we have ∇ f (xk+1 ) ≤ Lxk+1 −
x + ∇ f (x), which implies that ∇ f (xk+1 ) is square summable too. given xk
in (128) and recalling that E[û k | xk ] = ∇ f (xk ), we get
E[xk+1 − x2 |xk ] + 2γk E[F(xk+1 ) − F(x)|xk ] ≤ xk − x2 + 2γk2 σ 2 , (131)
The above equation is the same as (113) except for the fact that F(xk ) and B 2 are
replaced by F(xk+1 ) and σ 2 respectively. The proof thus essentially continues as the
one of Theorem 77.
m
minimize F(x) = f (x) + g(x), g(x) = gi (xi ), (133)
x∈X
i=1
where X is the direct sum of m separable real Hilbert spaces (X i )1≤i≤m , i.e.,
5
m
m
X= X i and (∀ x = (x1 , · · · , xm ), y = (y1 , · · · , ym ) ∈ X ) x, y = xi , yi
i=1 i=1
⎢ k = 0, 1, . . .
for
⎢ for i = 0, 1, . . . , m
⎢3
⎢ proxγk gi xikk − γik ∇ik f (x k ) if i = i k (134)
⎣ k+1
xi = k
xik if i = i k
where (i k )k∈N are independent random variables taking values in {1, . . . , m} with
pi := P(i k = i) > 0 for all i ∈ {1, . . . , m}.
Moreover, we set
5m
1 m
1
−1 = Idi , x, y−1 = xi , yi (136)
γ
i=1 i
γ
i=1 i
and
5m
1 m
1
W = Idi , x, yW = xi , yi . (137)
γ p
i=1 i i
γ p
i=1 i i
1
ĝ(x, ξ ) = gξ (xξ ). (138)
pξ
m
Then, clearly E[ĝ(x, ξ )] = i=1 gi (xi ) = g(x). Moreover,
1 1
) (x) = argmin y∈X gξ (xξ ) + y − x2W
W
proxĝ(·,ξ
pξ 2
1 1 1
= argmin y∈X gξ (xξ ) + (yξ − xξ )2 + (yi − xi )2
pξ 2γξ , pξ i =ξ
2γi pi
and hence
208 S. Salzo and S. Villa
xi if i = ξ
(∀ i ∈ {1, . . . , m}) [proxĝ(·,ξ
W
) (x)]i = (139)
proxγξ gξ (xξ ) if i = ξ.
x k+1 = proxĝ(·,i
W
k)
(x k − ∇ˆ iWk f (x k )). (141)
Proof (i) ⇒ (ii): Let L be a Lipschitz constant of ∇ f . Then (ii) holds with
(L i )1≤i≤m ≡ L.
(ii) ⇒ (i): Let, for every i ∈ [m], qi = L i / mj=1 L j . Then (qi )1≤i≤m ∈ Rm
+ and
m
i=1 iq = 1. Let x, v ∈ X . Then
m
f (x + v) = f x + Ji (vi )
i=1
m
= f qi (x + qi−1 Ji (vi ))
i=1
m
≤ qi f (x + qi−1 Ji (vi ))
i=1
m
. / Li
≤ qi f (x) + qi−1 vi , ∇i f (x) + qi−1 vi 2
i=1
2
m
Li
= f (x) + v, ∇ f (x) + vi 2
i=1
2q i
m
Li
= f (x) + v, ∇ f (x) + i=1 v2 .
2
Therefore, Fact 1(ii) yields that ∇ f is Lipschitz continuous.
Proximal Gradient Methods for Machine Learning and Imaging 209
z − x, x − x + + x − x + , x − x +
≤ ψ(z) − ψ x + + z − x, ∇ϕ(x) + x − x + , ∇ϕ(x)
and hence
Since z − x, ∇ϕ(x) ≤ ϕ(z) − ϕ(x) − (μϕ /2)z − x2 , the statement follows.
Now we set
x̄ k+1 = proxγi gi (xik − γi ∇i f (x k )) 1≤i≤m
(144)
= x − x̄
k k k+1
.
x̄ik+1 = proxγi g xikk − γik ∇ik f (x k ) = xik+1 ikk = xikk − xik+1 . (145)
k k ik k k
xikk − xik+1
k
− ∇ik f (x k ) ∈ ∂gik (xik+1 ) (146)
γik k
Proposition 88 Let f and g satisfy Assumptions 4.3 and 4.3. Let (L i )1≤i≤m be the
block-Lipschitz constants of the partial gradients ∇i f as defined in Proposition 85.
Let (γi )1≤i≤m ∈ Rm++ be such that γi < 2/L i . Set δ = max1≤i≤m γi L i and pmin =
min1≤i≤m pi . Let (x k )k∈N be generated by Algorithm 5. Then, for all x ∈ X ,
1 !
x − x k , x k − x̄ k+1 −1 ≤ E F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1
pmin
δ−2 k
+ F(x) − F(x k ) + x − x k+1 2−1 . (147)
2
−1 −1
Proof First note that x̄ k+1 = proxg (x − ∇ f (x k )), where the prox and the gra-
dient are computed in the weighted norm ·−1 . Then we derive from Lemma 87
written in the norm ·−1 that
Next, we have
1
gi (x k ) − gik (x̄ik+1 ) + xikk − x̄ik+1 , ∇ik f (x k )
pik k ik k k
1
= g(x k ) − g(x k+1 ) + x k − x k+1 , ∇ f (x k )
pik
1
= g(x k ) − g(x k+1 ) + x k − x k+1 , ∇ f (x k )
pmin
1 1
− − gik (xikk ) − gik (xik+1 ) + xikk − xik+1 , ∇ik f (x k )
pmin pik k k
* +, -
≥0
1
≤ g(x k ) − g(x k+1 ) + x k − x k+1 , ∇ f (x k )
pmin
Proximal Gradient Methods for Machine Learning and Imaging 211
1 1 1
− − ikk 2 ,
pmin pik γik
1
− gik (xikk ) − gik (xik+1 ) + xikk − xik+1 , ∇ik f (x k ) ≤ − ik 2 (149)
k k
γi
Now, we derive from the block-coordinate descent lemma (142) and the fact that x k
and x k+1 differ only in the i k -th component, that
pmin !
(2 − δ)E[x k+1 − x k 2W | i 0 , . . . , i k−1 ] ≤ E F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1 ,
2
(155)
which plugged into (154), with x ≡ x ∈ domF, gives (iii). Moreover, taking the
expectation in (155), we obtain
pmin !
(2 − δ)E x k+1 − x k 2W ≤ E[F(x k )] − E[F(x k+1 )], (156)
2
!
which gives (i). Finally, set for all k ∈ N, ξk = E F(x k ) − F(x k+1 ) i 0 , . . . , i k−1 ≥
0. Then
" +∞
# +∞
+∞
E ξk = E[ξk ] = E[F(x k )] − E[F(x k+1 )] ≤ E[F(x 0 )] − inf E[F(x k )].
k∈N
k=0 k=0 k=0
This shows that if inf k∈N E[F(x k )] > −∞, then +∞ k=0 ξk is P-integrable and hence
it is P-a.s. finite. Then (ii) follows from (155) and Proposition 89.
Proof It follows from (144) that, (xik (ω) − x̄ik+1 (ω))/γi − ∇i f (x k (ω)) ∈
∂gi (x̄ik+1 (ω)), for all i ∈ [m] and ω ∈ . Hence
1
v k (ω) ≤ x k (ω) − yk (ω) + ∇ f ( yk (ω)) − ∇ f (x k (ω)).
γmin
Now, since F is bounded from below, Proposition 90(ii) yields that ( yk − x k 2−1 )k∈N
is summable P-a.s. and hence yk − x k → 0 P-a.s. The statement follows from the
fact that ∇ f is Lipschitz continuous (see Proposition 85).
ˆ with P()
(c) there exists ˆ = 1 such that, for every ω ∈ ,
ˆ every weak cluster
point of (x (ω)) belongs to S.
k
6
Let W ⊆ S countable dense in S and let ˜ = w∈W w . Then P()
˜ = 1 and, for
˜
every ω ∈ and for every w ∈ W , there exists
Note that
Taking the limit for k → +∞ and recalling that wk → z, we get that there exists
˜ the limit of limk x k (ω) − z
limk x k (ω) − z. So we proved that for every ω ∈
¯ ˜ ˆ
exists. Now suppose that := ∩ . Then, for every ω ∈ , ¯ we have both that:
for every z ∈ Z, ∃ limk x (ω) − z; every weak cluster point of x k (ω) belongs to
k
Now we give the main convergence results, which extends to the stochastic setting
the convergence rate of the (deterministic) proximal gradient algorithm given in
Theorem 47.
where
max{1, (2 − δ)−1 }
ξk = b1 E[F(x k ) − F(x k+1 ) | i 0 , . . . , i k−1 ], b1 = 2 −1 .
pmin
2E[F(x k+1 )] − F(x) ≤ E[x k − x2W ] − E[x k+1 − x2W ] + E[ξk ] (160)
Since (E[F(x k )])k∈N is decreasing, E[F(x k )] → inf k∈N E[F(x k )] ≥ F∗ . Thus, the
statement (i) is true if inf k∈N E[F(x k )] = −∞. Suppose that inf k∈N E[F(x k )] > −∞
and let x ∈ domF. Then, the right hand side of (160), being summable, converges to
zero. Therefore, F∗ ≤ limk→+∞ E[F(x k+1 )] ≤ F(x). Since x is arbitrary in domF,
(i) follows. Let x ∈ S∗ . Then, F(x) = F∗ and (160) yields
!
2 E[F(x k+1 )] − F∗ ≤ E x 0 − x2 + E[ξk ] ≤ x 0 − x2 + b1 (F(x 0 ) − F∗ ).
k∈N k∈N
Therefore, we have k∈N (E[F(x k+1 )] − F∗ ) ≤ (x 0 − x2 + b1 (F(x 0 ) − F∗ ))/2.
Since (E[F(x k+1 )] − F∗ )k∈N is decreasing, the first part of statement (ii) follows
from Fact 46. Concerning the convergence of the iterates, we will use the stochastic
Opial’s Lemma 92. Let x ∈ argmin F. Then it follows from (159) that
Then, it follows from (161) that yn k (ω) x and, since ∂ F is weak-strong closed,
that 0 ∈ ∂ F(x). Therefore the two conditions in the stochastic Opial’s Lemma 92
are satisfied with S = argmin F and hence the statement follows.
Proximal Gradient Methods for Machine Learning and Imaging 217
Stochastic methods in optimization were initiated by Robbins and Monro [95], Kiefer
and Wolfowitz [61], and Ermoliev [47]. These methods are nowadays very popular
due to applications in deep machine learning [22]. The projected stochastic subgra-
dient method was studied in [44, 80]. In the last years rate of convergence in the
last iterates were also derived [105]. The proximal stochastic gradient which explic-
itly assumes the Lipschitz continuity of the gradient was studied in [2, 100]. The
worst case convergence rate in expectation of proximal stochastic gradient method
is much worse with respect to the one of proximal gradient method. Recently, vari-
ance reduction techniques have been studied to improve the convergence behavior
of stochastic methods [59], at the cost of keeping previously computed gradients in
memory. These techniques are particularly useful for empirical risk minimization
problems, see [42, 56] and references therein. Randomized strategies in block coor-
dinate descent methods were popularized by Nesterov in [84]. Since then a number
of works appeared extending and improving the analysis under several aspects. We
cite among others [35, 79, 94, 103, 114].
5 Dual Algorithms
In this section, we show how proximal gradient algorithms can be used on the dual
problem, to derive new algorithmic solutions for the primal.
We consider the same setting of Sect. 2.6. Here we additionally assume that f is
strongly convex with modulus of convexity μ > 0. In this situation, it follows from
Fact 13 that f ∗ is differentiable on X and ∇ f ∗ is 1/μ-Lipschitz continuous. More-
over, since f is strongly convex, the primal problem (P) admits a (unique) solution,
say x̂. We also assume that the calculus rule for subdifferentials (15) holds. Thus, in
view of Fact 14, we have that a dual solution û also exists, the duality gap is zero,
and the following KKT conditions hold
So, in this case, a dual solution uniquely determines the primal solution. Actually, the
map u → ∇ f ∗ (−A∗ u) provides a way to go from the dual space Y into the primal
space X . See Fig. 2. The following proposition tells us even more.
Proposition 94 Under the notation of Sect. 2.6, let u ∈ Y and set x = ∇ f ∗ (−A∗ u).
Then
218 S. Salzo and S. Villa
μ
x − x̂2 ≤ (u) − (û).
2
Proof It follows from the KKT conditions (162), Fact 11, and the definition of u
that
. / . /
f (x̂) + f ∗ (−A∗ û) = x̂, −A∗ û and f (x) + f ∗ (−A∗ u) = x, −A∗ u .
so the duality gap function bounds the primal and dual objectives. We have the
following theorem
Theorem 95 Under the notation of Sect. 2.6, suppose that R(A) ⊂ dom∂g. Then
the following holds:
(i) Suppose that g ∗ is α-strongly convex. Let u ∈ domg ∗ and set x = ∇ f ∗ (−A∗ u).
Then,
A2
G(x, u) ≤ 1 + ((u) − inf ). (163)
αμ
(ii) Suppose that g is L-Lipschitz continuous. Let u ∈ domg ∗ be such that (u) −
inf < A2 L 2 /μ and set x = ∇ f ∗ (−A∗ u). Then, we have
AL
G(x, u) ≤ 2 ((u) − inf )1/2 . (164)
μ1/2
Proof Let u ∈ domg ∗ and let x = ∇ f ∗ (−A∗ u). Since R(A) ⊂ dom∂g, we have
∂g(Ax) = ∅. Let v ∈ ∂g(Ax). Then we first prove that for every s ∈ [0, 1],
s s
(u) − inf ≥ sG(x, u) + α(1 − s) − A u − v2 .
2
(165)
2 μ
f ∗ − A∗ u − s A∗ (v − u) − f ∗ (−A∗ u)
. / 1 2
≤ −s A∗ (v − u), ∇ f ∗ (−A∗ u) + s A2 v − u2 .
2μ
s(1 − s)
g ∗ (u + s(v − u)) − g ∗ (u) ≤ s(g ∗ (v) − g ∗ (u)) − α u − v2 .
2
Therefore, it follows from (166) that
220 S. Salzo and S. Villa
Therefore,
G(x, u) = g ∗ (u) − g ∗ (v) − Ax, u − v. (168)
s2
sG(x, u) ≤ (u) − inf + A2 u − v2 .
2μ
1 s
G(x, u) ≤ inf ((u) − inf ) + A2 L 2 .
s∈[0,1] s μ
√
Since, if 0 < a < b, mins∈[0,1] (a/s + bs) = 2 ab, the statement follows.
that (u k ) → inf , then, the sequence (xk )k∈N , defined as xk = ∇ f ∗ (−A∗ u k ) is
converging (possibly also in function values) to the solution of the primal problem.
In particular, we have
2
xk − x̂2 ≤ (u k ) − inf → 0,
μ
Since the gradient of the term f ∗ (−A∗ ·) in (D) is Lipschitz continuous with constant
A2 /μ, the proximal gradient algorithm applied to (D) leads to the following
Algorithm 6 (Dual proximal gradient algorithm) Let u 0 ∈ Y and 0 < γ < A
2μ
2.
Then,
0 k = 0, 1,
for ...
xk = ∇ f ∗ (−A∗ u k ) (169)
u k+1 = proxγ g∗ (u k + γ Axk ).
Then, since Theorem 47(iv) ensures that (u k ) − (û) = o(1/(k + 1)), we have
√
xk − x̂ ≤ o(1/ k + 1)
Similarly, we can apply Algorithm 2 to the dual problem (D) and this yields the
following dual algorithm.
Algorithm 7 (Dual accelerated proximal gradient algorithm) Let√ 0 < γ ≤ μ/A2
N
and let (tk )k∈N ∈ R be defined as Proposition 64 with 1 − c ≥ 2 b. Let u 0 = v0 ∈
Y and define
⎢ k = 0, 1,
for ...
⎢ yk = ∇ f ∗ (−A∗ vk )
⎢
⎢ u k+1 = proxγ g∗ (vk + γ Ayk )
⎢
⎢ (170)
⎢
⎢ βk+1 = tk − 1
⎣ tk+1
vk+1 = u k+1 + βk+1 (u k+1 − u k ).
5
m
m
g : Y := Yi → ] − ∞, +∞ ] , g(y1 , . . . , ym ) = gi (yi ), (172)
i=1 i=1
⎢ kk = 0, 1,∗ . . . ∗ k
for
⎢ x = ∇ f (−A u )
⎢
⎢ for i = 0, 1, . . . , m
⎢3 (173)
⎢ proxγk gi∗ u ikk + γik Aik x k if i = i k
⎣ k+1
ui = k
u ik if i = i k ,
where (i k )k∈N are independent random variables taking values in {1, . . . , m} with
pi := P(i k = i) > 0 for all i ∈ {1, . . . , m}.
Remark 1 Note that in the setting of Algorithm 8, the primal problem can be written
as
m
min gi (Ai x) + f (x). (174)
x∈X
i=1
x k+1 = −H A∗ uk+1
= −H A∗ Jik (u ik+1
k
− u kk ) + x k .
⎢ k = 0, 1 . . .
for
⎢ for i = 0, 1, . . . , m
⎢3
⎢ proxγk gi∗ u ikk + γik Aik x k if i = i k
⎢ k+1 (175)
⎢ ui = k
⎢ u ik if i = i k ,
⎣
x k+1 = x k − H Ai∗k (u ik+1
k
− u ikk ).
This shows that Algorithm 8 can be used as an incremental stochastic method for
the minimization of (174), in which at each iteration one selects at random a single
component in the sum (say i k ) and uses only the knowledge related to that component
(Aik , Ai∗k , gi∗k , γik ) to make an update of the algorithm.
0 k = 0, 1,
for ...
xk = ∇ f ∗ (−A∗ u k ) (176)
u k+1 = u k + γ (Axk − b),
Proposition 94 is standard, while Theorem 95 was essentially given (in a less explicit
form) in [45]. Dual algorithms have been proposed several times in the literature. We
mention among others the works [28, 37] for deterministic algorithms, while [107]
for stochastic algorithms in the context of machine learning. The dual accelerated
proximal gradient Algorithm 7 was presented in [15] with the standard choice of
the parameters tk ’s given by the first of (78). The gradient descent on the dual of
the linearly constrained optimization problem described in Example 96 coincides,
up to a change of variables, with the linearized Bregman method studied in a series
of papers, see [86, 116] and references therein.
224 S. Salzo and S. Villa
6 Applications
In this section, we present three main applications where convex optimization plays
a key role, providing fundamental tools and computational solutions.
In many applications throughout science and engineering, one often needs to solve
ill-posed inverse problems, where the number of available measurements is smaller
than the dimension of the vector (signal) to be estimated. More formally, the setting
is the following: given an observation y ∈ Rn , and a linear measurement process
A : Rd → Rn the goal is to
under the assumption that d >> n. In general, more than one solution of the above
problem exists, but reconstruction of x∗ is often possible since in many practical
situations of interest, the vectors of interest are sparse, namely they only have a few
nonzero entries or few degrees of freedom compared to their dimension. In compress
sensing it is shown that reconstruction of sparse vectors is not only feasible in theory,
but efficient algorithms also exist to perform the reconstruction in practice. One of
the most popular strategies is basis pursuit and consists in solving the following
convex optimization problem
min x1 . (P1 )
Ax=y
Ax∗ − y ≤ δ
Then, the constrained problem (P1,δ ) is usually transformed into a penalized problem,
i.e (Fig. 3).
1
min Ax − y2 + λx1 , (178)
x∈Rd 2
which is advantageous from the algorithmic point of view. It is possible to show that
the problems (P1,δ ) and (178) are equivalent, for suitable choices of the regularization
parameter.
Proposition 97 Let A ∈ Rn×d and let y ∈ Rn . Then the following hold:
(i) If x is a minimizer of (178) with λ > 0, then there exists δ = δ(x) ≥ 0 such that
x is a minimizer of (P1,δ ).
(ii) If x is a minimizer of (P1,δ ) with δ ≥ 0, then there exists λ = λ(x) ≥ 0 such that
x is a minimizer of (178).
Proof Fermat’s rule for problem (178) yields
that is,
∗ λ sign(xi ) if xi = 0
(∀ i ∈ {1, . . . , d}) (A (y − Ax))i ∈ λ∂|·|(xi ) =
[−λ, λ] if xi = 0.
This shows that 0 is a minimizer of (178) if and only if A∗ y∞ ≤ λ. Moreover, if
A∗ y∞ > λ and x is a minimizer of (178), then x = 0 and λ = A∗ (Ax − y)∞
(so λ is uniquely determined by any minimizer).
Now, problem (P1,δ ) can be equivalently written as
which is equivalent to
Recall that
226 S. Salzo and S. Villa
{0} if Ax − y < δ
∂ι Bδ (y) (Ax) = N Bδ (y) (Ax) =
R+ (Ax − y) if Ax − y = δ.
If Ax − y < δ, then u = 0 and hence 0 ∈ ∂·1 (x) which yields x = 0. Therefore
since 0 is not a minimizer of (P1,δ ), then necessarily Ax − y = δ and equation
(179) is equivalent to
which yields
⎧
1 ⎨α −1 sign(xi ) if xi = 0
∃α > 0 s. t. ∀ i ∈ {1, . . . , d} (A∗ (y − Ax))i ∈ ∂|·|(xi ) =
α ⎩[−α −1 , α −1 ] if x = 0.
i
Taking into account the above equations one can see that, if x is a minimizer of
(178), then x is a minimizer of (P1,δ ) with δ = Ax − y and, vice versa, if x is a
minimizer of (P1,δ ), then x is a minimizer of (178) with λ = A∗ (Ax − y)∞ .
Remark 98 Analogous equivalence results relate (P1,δ ) and (178) to another con-
strained problem:
min Ax − y2 , τ > 0.
x1 ≤τ
k = 0, 1, . . .
for
(180)
xk+1 = softγ λ (xk − γ A∗ (Axk − y)),
⎢ k = 0, 1, . . .
for
⎢ t −1
⎢ u k = xk + k−1 (xk − xk−1 ) (181)
⎣ tk
∗
xk+1 = softγ λ (u k − γ A (Au k − y)).
where γi < 2/a i 2 . Then, Theorem 93 ensures that E[F(x k )] − inf F = o(1/k)
and that (xk )k∈N there exists a random vector x∗ taking values in the solution set of
problem (178) such that xk → x∗ almost surely.
One of the most popular denoising models for imaging, is based on the total variation
regularizer, and is known under the name “ROF” (Rudin, Osher and Fatemi). We
consider a scalar-valued digital image x ∈ Rm×n of size m × n pixels. A standard
approach for defining the discrete total variation is to use a finite difference scheme
acting on the pixels. The discrete gradient operator D : Rm×n → Rm×n × Rm×n
(R2 )m×n is defined by
where
xi+1, j − xi, j if 1 ≤ i ≤ m − 1
(D1 x)i, j =
0 i =m
xi, j+1 − xi, j if 1 ≤ j ≤ n − 1
(D2 x)i, j =
0 j =n
where y ∈ Rm×n is the given noisy image, and the discrete total variation is defined
by 1/2
Dx2,1 = (Dx)i, j 2 = (D1 x)i,2 j + (D2 x)i,2 j ,
i, j i, j
that is, the 1 -norm of the 2-norm of the pixelwise image gradients. We can interpret
the total variation regularization from a sparsity point of view, establishing analogies
with lasso approach in (178). Indeed, the 1 -norm induces sparsity in the gradients of
the image. More precisely, this regularizer can be interpreted as a group lasso one (see
Example 42), where each group include the two directional derivatives at each pixel.
Hence, this norm favors vectors with sparse gradients, namely piecewise constant
images. This favorable property, a.k.a. staircaising effect has also some drawbacks
in the applications, and other regularizations have been proposed. In the next section
we describe an algorithm to solve (183).
1
min λDx2,1 + x − y22 , (184)
x∈R m×n 2
is equivalent to compute the proximity operator of the total variation, which is not
available in closed form. Here we show how to solve the above problem by a dual algo-
rithm. Indeed the problem is of the form (P) considered by the Fenchel–Rockafellar
duality theory with f (x) = (1/2)x − y2 ,
g(v) = λv2,1 = λv i, j 2 , v = (v i, j )1≤i≤m , v i, j ∈ R2 ,
1≤ j≤n
i, j
and A = D. We first compute D since it will be useful later to set the steplength.
For every x ∈ Rm×n
Dx2 = (xi+1, j − xi, j )2 + (xi, j+1 − xi, j )2
1≤i<m 1≤i≤m
1≤ j≤n 1≤ j<n
≤2 ((xi+1, j )2 + (xi, j )2 ) + 2 ((xi, j+1 )2 + (xi, j )2 )
1≤i<m 1≤i≤m
1≤ j≤n 1≤ j<n
≤ 8x2 ,
1
min y − D ∗ u2 − y2 + ι Bλ (0)m×n (u), Bλ (0) ⊂ R2 , (185)
u∈(R2 )m×n 2
and
1
f ∗ (z) =
(z + y22 − y22 ).
2
Moreover, g(v) = i, j λv i, j 2 = i, j σ Bλ (0) (v i, j ), which shows that g is separa-
ble. Then it follows from Fact 9(iii) that g ∗ is separable as well, so
g ∗ (v) = ι Bλ (0) (v i, j ) = ι Bλ (0)m×n (v).
i, j
Finally, since ∇ f ∗ (z) = z + y, the way one goes from the dual variable u ∈ (R2 )m×n
to the primal variable x ∈ Rm×n is through the formula
x = ∇ f ∗ (−D ∗ u) = y − D ∗ u.
for k = 0, 1, . . .
0 (k)
x = y − D ∗ u(k) (186)
u(k+1) = PBλ (0)n×m (u(k) + γ Dx (k) ),
where γ < 2/D2 = 1/4. Note also that the projection onto Bλ (0)m×n is separable
too and can be computed as
⎧
⎪
⎨ui, j if ui, j 2 ≤ λ
PBλ (0)m×n (u) = PBλ (0) (ui, j ) 1≤i≤m , PBλ (0) (ui, j ) = ui, j
1≤ j≤n ⎪
⎩ if ui, j 2 > λ.
ui, j 2
for k = 0, 1, . . .
⎢ (k)
⎢ x = y − D ∗ u(k)
⎢ (k+1)
⎢u = PBλ (0)n×m (v (k) + γ Dz (k) ), (187)
⎢
⎣ v (k+1) = u(k+1) + β (u(k+1) − u(k) )
k+1
z (k+1) = x (k+1) + βk+1 (x (k+1) − x (k) )
With the choice of parameters as in Theorem 68, from the results in Sect. 5, we derive
that the sequence (xk )k∈N converges to the minimizer of (184) as an O(1/k).
Finally, we specialize the randomized proximal gradient √Algorithm 5. Note that,
condition (ii) in Proposition 85 is satisfied with L i, j = 17. Then, Algorithm 5
(assuming that each block is made of one R2 block only and (i k , jk ) is uniformly
distributed on {1, . . . , n} × {1, . . . , m}) writes as
for k = 0, 1, . . .
0 (k)
x = x k−1 + D ∗ (uk−1 − uk ) (188)
!
u(k+1) = u(k) + J(ik , jk ) PBλ (0) (ui(k)
k , jk
+ γik , jk (Dx (k) )ik , jk ) − ui(k)
k , jk
,
√
where γi, j < 2/ 17 and J(ik , jk ) : R2 → (R2 )m×n is the canonical injection. Then,
denoting by x∗ the unique solution√of (184), Theorem 93 and the results in Sect. 5
ensure that E[x k − x∗ 2 ] ≤ o(1/ k).
In statistical machine learning we are given two random variables ξ and η, with
values in X and Y ⊂ R respectively, with joint distribution μ. We let : X × Y ×
R → R be a convex loss function and the goal is to find a function h : X → Y
in a given hypothesis function space which minimizes the averaged risk R(h) =
E[ (ξ, η, h(η))] without knowing the distribution μ but based on some sequence
(ξk , ηk )k∈N of independent copies of (ξ, η).
In this problem, concerning the hypothesis function space one option is that of
considering reproducing kernel Hilbert spaces (RKHS). They indeed are defined
through kernel functions and are flexible enough to model even infinite-dimensional
function spaces. They are defined as follows. We let : X → H be a general map
from the input space X to a separable Hilbert space H , endowed with a scalar product
·, · and norm ·. Then the corresponding RKHS is defined as
which is supposed to be solved via some sequence (ξk , ηk )k∈N of independent copies
of (ξ, η).
In order to approach problem (191) we consider two strategies. The first one con-
sists in considering the problem as an instance of a stochastic optimization problem
as described in Example 82. The second one is to consider a regularized empirical
version of (191) based on the available sample. In the following, we describe these
two approaches.
(∀ w1 , w2 ∈ H )(∀ z = (x, y) ∈ X × Y)
|ϕ(w1 , z) − ϕ(w2 , z)| ≤ α|w1 − w2 , (x)| ≤ α(x)w1 − w2 .
Hence, conditions (SO1 ) − (SO2 ) in Example 82 hold with L(z) = α(x). More-
over,
(∀ z ∈ Z)(∀ w ∈ H ) ∂ϕ(w, z) = ∂ (z, w, (x))(x), (192)
where ∂ϕ(w, z) = ∂ϕ(·, z)(w). Now, let, for every (z, t) ∈ Z × R, ˜ (z, t) be a sub-
gradient of (z, ·) at t and define
If we define h k (x) = wk , (x) and the kernel K (x, x ) = (x), (x ), then it
follows from (193) that
Moreover, set
k
−1
k
k
−1
k
w̄k = γi γi wi , h k (x) = w̄k , (x) = γi γi gi (x).
i=0 i=0 i=0 i=0
Then, the risk of h̄ k is R(w̄k ) and according to Theorem 77 we have that√ R(w̄k ) →
inf H R, and if S∗ := argmin H R = ∅, D ≥ dist(x0 , S∗ ), and γk = γ̄ / k + 1, we
have
D2 1 log(k + 1)
(∀ k ∈ N) E[R(w̄k )] − min R ≤ √ + γ̄ B 2 √ ,
H 2γ̄ k + 1 k+1
Note that algorithm (194) is fully practicable, since it depends only on the kernel
function K and on the data (ξk , ηk ). In the following, we provide a list of 1-Lipschitz
continuous losses:
• the hinge loss: Y = {−1, 1} and (x, y, t) = max{0, 1 − yt};
• the logistic loss for classification: Y = {−1, 1} and (x, y, t) = log(1 + e−yt );
• L 1 -loss: Y = R and (x, y, t) = |y − t|;
4e y−t
• logistic loss for regression: Y = R and (x, y, t) = − log .
(1 + e y−t )2
• ε-insensitive loss: Y = R and (x, y, t) = max{0, |y − t| − ε}.
λ
n
1
min (yi , w, (xi )) + w2 =: (w), (196)
w∈H n i=1 2
where (xi , yi )1≤i≤n are realizations of the random variables (ξi , ηi )1≤i≤n and we
assume for simplicity that the loss function is : Y × R → R+ (convex in the second
variable), and λ > 0 is a regularization parameter. Essentially the goal here is to find
a function h = w, (·) that best fits the data (xi , yi )1≤i≤n according the to given
Proximal Gradient Methods for Machine Learning and Imaging 233
loss . Depending on the choice of the loss function the techniques take different
names. If is the square loss, that is, Y = R and (s, t) = (s − t)2 , one talks about
ridge regression. If is the Vapnik ε-insensitive loss
then we have support vector regression. Finally, if is the hinge loss, that is
Y = {−1, 1} and (s, t) = (1 − st)+ , then we get support vector machines. Another
important loss for classification is the logistic loss, which is defined as (s, t) =
log(1 + e−st ).
We are going to compute the dual problem of (196) in the sense of Fenchel–
Rockafellar (see Sect. 2.6). Define the operator
⎡ ⎤
w, (x1 )
(X) : H → Rn , (X)w = ⎣ ··· ⎦ ∈ Rn
w, (xn )
λ
n
1
g : Rn → R, g(z) = (yi , −z i ), and f : H → R, f (w) = w2 .
n i=1 2
(197)
Then problem (196) can be written as
and the corresponding KKT optimality conditions are (see Sect. 2.6)
n
(∀ α ∈ Rn ) (X)∗ α = αi (xi ),
i=1
1
f ∗ ((X)∗ α) = (X)∗ α2
2
1
2
n
= αi (xi )
2 i=1
1 . /
n
= αi α j (xi ), (x j )
2 i, j=1
1
= α Kα,
2
where K ∈ Rn×n is the Gram matrix, defined as K = (K (xi , x j ))i,n j=1 and K is the
kernel function associated to the feature map as defined in (190). Now we compute
g ∗ . According to (197), the function g is separable, that is, it can be written
the form of
n
as g(z) = i=1 gi (z i ), where gi = (λ/n) (yi , −·). Therefore
n
∗
g (α) = gi∗ (αi ).
i=1
where K = (K (xi , x j ))i,n j=1 and K is the kernel function associated to the feature
map (see (190)), ∗ (yi , ·) is the Fenchel conjugate of (yi , ·). Moreover, (i) the
primal problem (196) has a unique solution, the dual problem has solutions and
min = − min (strong duality holds); and (ii) the solutions (w̄, ᾱ) of the primal
and dual problems are characterized by the following KKT conditions
⎧
⎪
⎪
n
⎪
⎨w̄ = (X) ∗
ᾱ = ᾱi (xi ),
i=1 (202)
⎪
⎪ ᾱ n
⎪
⎩∀ i ∈ {1, . . . , n} − i ∈ ∂ L(yi , (x i ), w̄),
λ
where ∂ (yi , ·) is the subdifferential of (yi , ·). Finally for the estimated function it
holds
Proximal Gradient Methods for Machine Learning and Imaging 235
n
w̄, (·) = ᾱi K (xi , ·).
i=1
Remark 100 The first equation in (202) says that the primal solution can be written
as a finite linear combination of feature map evaluations on the training points. This is
known as the representer theorem in the related literature. Moreover, the coefficients
of this representation can be obtained through the solution of the dual problem (201).
We now specialize Theorem 99 to distance-based and margin-based losses.
Corollary 101 Suppose that is a convex distance-based loss, that is, of the form
(s, t) = χ (s − t) with Y = R, for some convex function χ : R → R+ . Then the
dual problem (201) becomes
λ ∗ αi n
n
1
minn α Kα − y α + χ . (203)
α∈R 2 n i=1 λ
Suppose that is a convex margin-based loss, that is, of the form (s, t) = χ (st) with
Y = {−1, 1}, for some convex function χ : R → R+ . Then the dual problem (201)
becomes
λ ∗ yi αi n
n
1
minn α Kα + χ − . (204)
α∈R 2 n i=1 λ
The following example shows that all the losses commonly used in machine
learning admit explicit Fenchel conjugates.
Example 102 (i) The least squares loss is (s, t) = χ (s − t) with χ = (1/2)|·|2 .
In that case (203) reduces to
1 n
minn α Kα − y α + α2 .
α∈R 2 2λ
which is strongly convex with modulus n/λ and has the explicit solution ᾱ =
(K + (n/λ)Id)−1 y.
(ii) The Vapnik-ε-insensitive loss for regression is (s, t) = χ (s − t) with χ = |·|ε .
Then, χ ∗ = ε|·| + ι[−1,1] and the dual problem (203) turns out to be
1
min α Kα − y α + εα1 + ιλ/n[−1,1]n (α)
α∈Rn
2
1 n
minn α Kα − y α + α22 + ιρλ/n[−1,1]n (α)
α∈R 2 2λ
(iv) The logistic loss for classification is the margin-based loss with χ (r ) = log(1 +
e−r ). Thus
⎧
⎪
⎨(1 + s) log(1 + s) − s log(−s) if s ∈ ]−1, 0[
∗
χ (s) = 0 if s = −1 or s = 0
⎪
⎩
+∞ otherwise.
It is easy to see that χ has Lipschitz continuous derivative with constant 1/4 and
hence χ ∗ is strongly convex with modulus 4. Thus, referring to (203) and (199),
we see that in this case domg ∗ = i=1 n
(yi [0, λ/n]) and g ∗ is differentiable
∗
on int(domg ) with locally Lipschitz continuous gradient. Moreover, since
lims→1 |(χ ∗ ) (s)| = lims→0 |(χ ∗ ) (s)| = +∞, we have that ∇g ∗ (α) = +∞
on the boundary of domg ∗ . Finally, it follows from (202) that 0 < yi ᾱi < λ/n,
for i = 1, . . . , n.
(v) The hinge loss is the margin-based loss with χ (r ) = (1 − r )+ . We have χ ∗ (s) =
s + ι[−1,0] (s). So the dual problem (204) is
1
min α Kα − y α + ιλ/n[0,1]n (y % α).
α∈Rn2
a2 λ
(w) − inf ≤ 2(X) ((α) − inf )1/2 .
n
Remark 104 The above proposition ensures that if an algorithm generates a
sequence (α k )k∈N that is minimizing for the dual problem (201), i.e., (α k ) →
min , then the sequence defined by w k = (X)∗ α k , k ∈ N, converges to the solu-
tion of the primal problem. More precisely, for the function wk , (·) we have
Proximal Gradient Methods for Machine Learning and Imaging 237
Proximal gradient algorithms for SVM. For all the cases treated in Example 102,
the dual problem (201) has the following form
n
minn q(α) + h i (αi ) = (α), (206)
α∈R
i=1
Sparse estimation methods are very popular in machine learning. The most natural
one is the minimization of the empirical risk regularized with the 1 norm, in the
very same way that we described in Sect. 6.1. In several applications of interest,
it is beneficial to impose more structure in the regularization process and several
extensions of the 1 regularization, such as group lasso or multitask learning, are
common. It turns out that proximal gradient algorithms play a key role in the solution
of the related variational problems, which we write using the notation introduced in
the previous subsection
λ
n
min (yi , w, (xi )) + (w), (207)
w∈Rd n i=1
penalty. In this section, we briefly summarize some examples and the related proximal
gradient algorithms.
When the input variables are supposed to be grouped together according to prede-
fined groups forming a partition of the variables, the group lasso penalty discussed in
Example 42 promotes solutions w∗ depending only on few groups. The algorithms
and the considerations made for the lasso problem in Sect. 6.1.1 can be generalized
to the group LASSO, replacing the soft-thresholding operator with the proximal
operator of the group lasso computed in Example 42. If the support of the solution
is a union of potentially overlapping groups defined a priori then a different penalty
should be used.
Let J = {J1 , . . . , Jm } be a family of subsets of {1, . . . , d} whose union is
{1, . . . , d} itself. Let us call v J = (v j ) j∈J ∈ R J . Denote by · J the Euclidean
norm on R J and by J J : R J → Rd the canonical embedding. We define a penalty
on Rd by considering
m
m
(w) = inf v J J J (v J ) = w . (208)
=1 =1
When the groups do not overlap, the above penalty coincides with the group lasso
norm. If some groups overlap, then this penalty induces the selection of w∗ sparsely
supported on a union of groups. The regularized empirical risk in this case can be
written in terms of the vectors v J :
: ;
λ
n m m
min yi , v j , J j∗ (xi ) + v J ,
(v1 ,...,vl )∈R J1 ×...×R Jm n i=1 =1 =1
and the problem in these new variables coincide with a regularized group lasso
without overlap.
Learning multiple tasks simultaneously has been shown to improve performance
relative to learning each task independently, when the tasks are related in the sense
that they all share a small set of features. For example, given T tasks modeled as
x → wt , (x), for t = 1, . . . , T , multi-task learning amounts to the minimization
of
d 1/2
λ
T n T
min (yi , wt , (xi )) + w 2j,t ,
(w1 ,...,wT )∈Rd×T n
t=1 t i=1 j=1 t=1
where n t is the number of samples for each task. Note that the regularization is an
instance of a group lasso norm of the vector (w1 , . . . , wT ) ∈ Rd×T , and the multitask
problem can therefore be solved as described above.
Proximal Gradient Methods for Machine Learning and Imaging 239
Section 6.1 The connections between the lasso minimization problem and the prob-
lem of determining the sparsest solutions of linear systems is the topic of interest
for the compressive sensing community. We refer to [49] for a mathematical intro-
duction on this subject. The solution of the lasso problem motivated a huge amount
of research at the interface between convex optimization, signal processing, inverse
problems, and machine learning. The Iterative Soft thresholding algorithm has been
proposed in [41] and around the same time the application of the proximal gradient
algorithm to the lasso problem, but also to other signal processing problems was
discussed in [39]. Strong convergence of the sequence of iterates generated by the
proximal gradient algorithm for the objective function in (178) was proved in [41] and
generalized in [36]. The FISTA algorithm was proposed by Beck and Teboulle in the
seminal paper [12]. Block coordinate versions of the ISTA algorithm are considered
e.g., in [78, 103].
Section 6.2 The ROF model has been introduced by Rudin, Osher and Fatemi
in [101], and studied theoretically in [31]. The approach based on duality has been
considered in [28, 30, 33]. The application of FISTA and a monotone modification
to the dual problem has been considered in [13].
Section 6.3 Stochastic optimization approaches for machine learning are very
popular, and in particular stochastic gradient descent [21], see the related discussion
in Sect. 4.4. One of the most well known stochastic methods to solve SVM in the
primal variables is PEGASOS [106].
Proximal methods have been immediately the methods of choice to deal with
structured sparsity in machine learning. The literature on the topic is vast, see the
surveys [8, 77] and references therein.
Support vector machines are due to Vapnik and have been introduced in [20, 40].
There, the case of the hinge loss for classification with a general kernel function
(so to cover nonlinear classifiers) was treated. The dual problem was derived via
the Lagrange theory. The analysis for general losses as well as the connection with
reproducing kernel Hilbert spaces and the formulation via general feature maps is
given, e.g., in [110].
Acknowledgements The work of S. Villa has been supported by the ITN-ETN project TraDE-
OPT funded by the European Union’s Horizon 2020 research and innovation programme under the
Marie Skłodowska–Curie grant agreement No 861137 and by the project “Processi evolutivi con
memoria descrivibili tramite equazioni integro-differenziali” funded by Gruppo Nazionale per l’
Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di
Alta Matematica (INdAM).
240 S. Salzo and S. Villa
References
1. Alvarez, F., Attouch, H.: An inertial proximal method for maximal monotone operators via
discretization of a nonlinear oscillator with damping. Set-Valued Anal. 9, 3–11 (2001)
2. Atchadé, Y.F., Fort, G., Moulines, E.: On perturbed proximal gradient algorithms. J. Mach.
Learn. Res. 18, 1–33 (2017)
3. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions
involving analytic features. Math. Progr. 116, 5–16 (2009)
4. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and
projection methods for nonconvex problems. An approach based on the Kurdyka-Ł ojasiewicz
inequality, Math. Oper. Res. 35, 438–457 (2010)
5. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and
tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-
Seidel methods. Math. Progr. 137, 91–129 (2013)
6. H. Attouch, Z. Chbani, J. Peypouquet, P. Redont, Fast convergence of inertial dynamics and
algorithms with asymptotic vanishing viscosity. Math. Prog. Ser. B 168, 123–175 (2018)
7. Aujol, J.-F., Dossal, C., Rondepierre, A.: Optimal convergence rates for Nesterov Accelera-
tion. SIAM J. Optim. 29, 3131–3153 (2019)
8. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with Sparsity-Inducing Penal-
ties. Optim. Mach. Learn. 5, 19–53 (2011)
9. Baillon, J.B., Bruck, R.E., Reich, S.: On the asymptotic behavior of nonexpansive mappings
and semigroups in Banach spaces. Houston J. Math. 4, 1–9 (1978)
10. Barbu, V., Precupanu, T.: Convexity and Optimization in Banach Spaces. Springer, Dordrecht
(2012)
11. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert
Spaces, 2nd edn. Springer, New York (2017)
12. Beck, A., Teboulle, M.: A fast iterative Shrinkage-Thresholding algorithm for linear inverse
problems. SIAM J. Imaging Sci. 2, 183–202 (2009)
13. Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image
denoising and deblurring problems. IEEE Trans. Image Process. 18, 2419–2434 (2009)
14. Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic
functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–
1223 (2006)
15. Beck, A., Teboulle, M.: A fast dual proximal gradient algorithm for convex minimization and
applications. Oper. Res. Lett. 42, 1–6 (2014)
16. Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions.
SIAM J. Optim. 18, 556–572 (2007)
17. Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of
first-order descent methods for convex functions. Math. Program. 165, 471–507 (2017)
18. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for noncon-
vex and nonsmooth problems. Math. Prog. 146, 459–494 (2013)
19. Borwein, J.M., Vanderwerff, J.D.: Convex Functions: Constructions, Characterizations and
Counterexamples. Cambridge University Press, Cambridge (2010)
20. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers.
In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory—COLT
’92, p. 144 (1992)
21. Bottou, L., Bousquet, O.: The tradeoffs of large-scale learning. In: Optimization for Machine
Learning, pp. 351–368, The MIT Press, Cambridge (2012)
22. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning.
SIAM Rev. 60, 223–311 (2018)
23. Bourbaki, N.: General Topology, 2nd edn. Springer, New York (1989)
24. Bredies, K.: A forward-backward splitting algorithm for the minimization of non-smooth
convex functionals in Banach space. Inv. Prob. 25, Art. 015005 (2009)
Proximal Gradient Methods for Machine Learning and Imaging 241
25. Browder, F.E., Petryshyn, W.V.: The solution by iteration of nonlinear functional equations
in Banach spaces. Bull. Am. Math. Soc. 72, 571–575 (1966)
26. Browder, F.E., Petryshyn, W.V.: Construction of fixed points of nonlinear mappings in Hilbert
space. J. Math. Anal. Appl. 20, 197–228 (1967)
27. Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM J. Control
Optim. 31, 1340–1359 (1993)
28. Chambolle, A.: An algorithm for total variation minimization and applications. J. Math.
Imaging Vis. 20, 89–97 (2004)
29. Chambolle, A., Dossal, C.: On the convergence of the iterates of the “Fast Iterative Shrink-
age/Thresholding Algorithm". J. Optim. Theory Appl. 166, 968–982 (2015)
30. Chambolle, A., Lions, P.-L.: Image restoration by constrained total variation minimization
and variants. In: Investigative and Trial Image Processing, San Diego, CA (SPIE), vol. 2567,
pp. 50–59 (1995)
31. Chambolle, A., Lions, P.-L.: Image recovery via total variation minimization and related
problems. Numer. Math. 76, 167–188 (1997)
32. Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta
Numerica 25, 161–319 (2016)
33. Chan, T.F., Golub, G.H., Mulet, P.: A nonlinear primal-dual method for total variation-based
image restoration. SIAM J. Sci. Comput. 20, 1964–1977 (1999)
34. Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing, In: Fixed-
Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer,
New York, NY (2011)
35. Combettes, P.L., Pesquet, J.-C.: Stochastic quasi-Fejér block-coordinate fixed point iterations
with random sweeping. SIAM J. Optim. 25, 1121–1248 (2015)
36. Combettes, P.L., Pesquet, J.-C.: Proximal thresholding algorithms for minimization over
orthonormal bases. SIAM J. Optim. 18, 1351–1376 (2007)
37. Combettes, P.L., V u, B.C.: Dualization of signal recovery problems. Set-Valued Anal. 18,
373–404 (2010)
38. Combettes, P.L., Yamada, I.: Compositions and convex combinations of averaged nonexpan-
sive operators. J. Math. Anal. Appl. 425, 55–70 (2015)
39. Combettes, P.L., Wajs, V.: Signal recovery by proximal forward-backward splitting. Multi-
scale Model. Simul. 4, 1168–1200 (2005)
40. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)
41. Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse
problems with a sparsity constraint. Comm. Pure Appl. Math. 57, 1413–1457 (2004)
42. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with
support for non-strongly convex composite objectives. In: Advances in Neural Information
Processing Systems, vol. 27 (2014)
43. Dotson, W.G.: On the Mann iterative process. Trans. Am. Math. Soc. 149, 65–73 (1970)
44. Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J.
Mach. Learn. Res. 10, 2899–2934 (2009)
45. Dünner, C., Forte, S., Takac, M., Jaggi, M.: Primal-dual rates and certificates. In: Proceedings
of The 33rd International Conference on Machine Learning, PMLR, vol. 48, pp. 783–792
(2016)
46. Ekeland, I., Témam, R.: Roger. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, Convex analysis and variational problems (1999)
47. Ermoliev, Yu.M.: On the method of generalized stochastic gradients and quasi-Fejér
sequences. Cybernetics 5, 208–220 (1969)
48. Fenchel, W.: Convex Cones, Sets, and Functions. Princeton University (1953)
49. Foucart, S., Rauhut, H.: A mathematical introduction to compressive sensing. Birkäuser.
Springer, New York (2010)
50. Frankel, P., Garrigos, G., Peypouquet, J.: Splitting methods with variable metric for Kurdyka-
Łojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165, 874–900
(2015)
242 S. Salzo and S. Villa
51. Gabay, D.: Applications of the method of multipliers to variational inequalities. In: Fortin,
M., Glowinski, R. (eds.) Augmented Lagrangian Methods: Applications to the Numerical
Solution of Boundary-Value Problems, North-Holland, Amsterdam, vol. 15, pp. 299–331
(1983)
52. Garrigos, G., Rosasco, L., Villa, S.: Convergence of the Forward-Backward Algorithm:
Beyond the Worst Case with the Help of Geometry (2017). https://arxiv.org/abs/1703.09477
53. Goldstein, A.A.: Convex programming in Hilbert space. Bull. Am. Math. Soc. 70, 709–710
(1964)
54. Groetsch, C.W.: A note on segmenting Mann iterates. J. Math. Anal. Appl. 40, 369–372 (1972)
55. Guler, O.: New proximal point algorithms for convex minimization. SIAM J. Optim. 2, 649–
664 (1992)
56. Blatt, D., Hero, A., Gauchman, H.: A convergent incremental gradient method with a constant
step size. SIAM J. Optim. 18, 29–51 (2007)
57. Hiriart-Urruty, J.-B., Lemaréchal, C.: Fundamentals of Convex Analysis. Springer, Berlin
(2001)
58. Jensen, J.L.W.V.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta
Math. 30, 175–193 (1906)
59. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance
reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)
60. Karimi, H., Nutini, J., Schmidt, M.: Linear Convergence of gradient and proximal-gradient
methods under the Polyak-Łojasiewicz condition. In: Frasconi, P., Landwehr, N., Manco, G.,
Vreeken, J. (eds.), Machine Learning and Knowledge Discovery in Databases. ECML PKDD
2016. Lecture Notes in Computer Science, vol. 9851. Springer, Cham
61. Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann.
Math. Stat. 23, 462–466 (1952)
62. Kingma, D.P., Ba, L.J.: Adam: a method for stochastic optimization. In: Proceedings of
Conference on Learning Representations (ICLR), San Diego (2015)
63. Krasnoselski, M.A.: Two remarks on the method of successive approximations. Uspekhi Mat.
Nauk. 10, 123–127 (1955)
64. Levitin, E.S., Polyak, B.T.: Constrained minimization methods. U.S.S.R. Comput. Math.
Math. Phys. 6, 1–50 (1966)
65. Li, W.: Error bounds for piecewise convex quadratic programs and applications. SIAM J.
Control Optim 33, 1510–1529 (1995)
66. Li, G.: Global error bounds for piecewise convex polynomials. Math. Prog. Ser. A 137, 37–64
(2013)
67. Lions, P.L., Mercier, I.: Splitting algorithms for the sum of two nonlinear operators. SIAM J.
Numer. Anal. 16, 964–979 (1979)
68. Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a
general approach. Ann. Oper. Res. 46, 157–178 (1993)
69. Luque, F.: Asymptotic convergence analysis of the proximal point algorithm. SIAM J. Control
Optim. 22, 277–293 (1984)
70. Mann, W.R.: Mean value methods in iteration. Proc. Am. Math. Soc. 4, 506–510 (1953)
71. Martinet, B.: Régularisation d’in Opér. 4, Sér. R-3, pp. 154–158 (1970)
72. Mercier, B.: Inéquations Variationnelles de la Mécanique. No. 80.01 in Publications Mathé-
matiques d’Orsay. Université de Paris-XI, Orsay, France (1980)
73. Minkowski, H.: Theorie der konvexen Körper, insbesondere Begründung ihres Oberflächen-
begriffs. In: Hilbert, D. (ed.) Gesammelte abhandlungen von Hermann Minkowski [Collected
Papers of Hermann Minkowski], vol. 2, pp. 131–229. B.G. Teubner, Leipzig (1911)
74. Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace hilbertien, C. R.
Acad. Sci. Paris Ser. A Math. 255, 2897–2899 (1962)
75. Moreau, J.J.: Propriétés des applications “prox”, C. R. Acad. Sci. Paris Ser. A Math. 256,
1069–1071 (1963)
76. Moreau, J.J.: Proximité et dualité dans un espace Hilbertien. Bull. de la Société Mathématique
de France 93, 273–299 (1965)
Proximal Gradient Methods for Machine Learning and Imaging 243
77. Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity regu-
larization with proximal methods. In: Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, pp. 418–433. Springer, Berlin, Heidelberg (2010)
78. Necoara, I., Clipici, D.: Parallel random coordinate descent method for composite minimiza-
tion: convergence analysis and error bounds. SIAM J. Optim. 26, 197–226 (2016)
79. Necoara, I., Nesterov, Y., Glineur, F.: Random block coordinate descent methods for linearly
constrained optimization over networks. J. Optim. Theory Appl. 173, 227–254 (2017)
80. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach
to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)
81. Nemirovsij, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization.
Wiley-Interscience, New York (1983)
82. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Aca-
demic Publishers, London (2004)
83. Nesterov, Y.: A method for solving the convex programming problem with convergence rate
O(1/k 2 ). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
84. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems.
SIAM J. Optim. 22, 341–362 (2012)
85. Opial, Z.: Weak convergence of the sequence of successive approximations for nonexpansive
mappings. Bull. Am. Math. Soc. 73, 591–597 (1967)
86. Osher, S., Burger, M., Goldfarb, D., Xu, J., Yin, W.: An iterative regularization method for
total variation- based image restoration. Multiscale Model. Sim. 4, 460–489 (2005)
87. Passty, G.B.: Ergodic convergence of a zero of the sum of monotone operators in Hilbert
space. J. Math. Anal. Appl. 72, 383–390 (1979)
88. Peypouquet, J.: Convex Optimization in Normed Spaces. Springer, Cham (2015)
89. Phelps, R.R.: Convex Functions, Monotone Operators and Differentiability. Springer, Berlin
(1993)
90. Polyak, B.T.: Dokl. Akad. Nauk SSSR 174
91. Polyak, B.T.: Gradient methods for minimizing functionals. Zh. Vychisl. Mat. Mat. Fiz. 3,
643–653 (1963)
92. Polyak, B.T.: Subgradient methods: a survey of Soviet research. In: Lemaréchal, C.L., Mifflin,
R. (eds.) Proceedings of a IIASA Workshop, Nonsmooth Optimization, pp. 5–28. Pergamon
Press, New York (1977)
93. Polyak, B.T.: Introduction to Optimization. Optimization Software, Inc. (1987)
94. Richtàrik, P., Takàc̆, M.: Parallel coordinate descent methods for big data optimization. Math.
Program. Ser. A 156, 56–484 (2016)
95. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407
(1951)
96. Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales
and some applications. In: Optimizing Methods in Statistics, pp. 233–257. Academic Press
(1971)
97. Rockafellar, T.: Monotone operators and the proximal point algorithm. SIAM J. Optim. 14,
877–898 (1976)
98. Rockafellar, T.: Convex Analysis. Princeton University Press, Princeton (1970)
99. Rockafellar, T.: Conjugate duality and optimization. Society for Industrial and Applied Math-
ematics, Philadelphia (1974)
100. Rosasco, L., Villa, S., Vũ, B.C.: Convergence of stochastic proximal gradient method. Appl.
Math. Optim. 82, 891–917 (2020)
101. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms.
Physica D 60, 259–268 (1992)
102. Salzo, S.: The variable metric forward-backward splitting algorithm under mild differentia-
bility assumptions. SIAM J. Optim. 27(4), 2153–2181 (2017)
103. Salzo, S. Villa, S.: Parallel random block-coordinate forward-backward algorithm: a unified
convergence analysis. Math. Program. Ser. A. https://doi.org/10.1007s10107-020-01602-1
244 S. Salzo and S. Villa
104. Schaefer, H.: Über die Methode sukzessiver Approximationen. Jber. Deutsch. Math.-Verein.
59, 131–140 (1957)
105. Shamir, O., Zhang, T.: Stochastic gradient descent for non-smooth optimization: convergence
results and optimal averaging schemes. In: Proceedings of the 30th International Conference
on Machine Learning, pp. 71–79 (2013)
106. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient
solver for SVM. Math. Program. 127, 3–30 (2011)
107. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss
minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
108. Shor, N.: Minimization Methods for Non-differentiable Functions. Springer, New York (1985)
109. Sibony, M.: Méthodes itéraratives pour les équations et inéquations aux dérivées partielles
non linéaires de type monotone. Calcolo 7, 65–183 (1970)
110. Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
111. Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated
gradient method: theory and insights. J. Mach. Learn. Res. 17, 1–43 (2016)
112. Tseng, P.: Applications of a splitting algorithm to decomposition in convex programming and
variational inequalities. SIAM J. Control Optim. 29, 119–138 (1991)
113. Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable functions.
Nondifferentiable optimization. Math. Program. Stud. 3, 145–173 (1975)
114. Wright, S.: Coordinate descent algorithms. Math. Program. 151, 3–34 (2015)
115. Zălinescu, C.: Convex Analysis in General Vector Spaces. World Scientific Publishing Co.
Inc, River Edge, NJ (2002)
116. Zhang, X., Burger, M., Bresson, X., Osher, S.: Bregmanized nonlocal regularization for decon-
volution and sparse reconstruction. SIAM J. Imaging Sci. 3, 253–276 (2010)