0% found this document useful (0 votes)

94 views22 pages

Inference For Generalized Linear Models Via Alternating Directions and Bethe Free Energy Minimization

This document proposes a new algorithm called ADMM-GAMP for approximate inference in generalized linear models (GLMs). GLMs are a broad class of models where an unknown vector x is observed through a noisy, possibly nonlinear function of a linear transform z=Ax. Existing approximate message passing (AMP) methods like GAMP can diverge for non-ideal transforms A. The proposed ADMM-GAMP algorithm directly minimizes a Bethe free energy approximation to provide a convergent alternative to GAMP. It uses a double loop procedure where the outer loop linearizes the Bethe free energy and the inner loop minimizes the linearized form using ADMM. For strictly convex penalties, ADMM-GAMP is guaranteed to converge

Uploaded by

Tiago Nunes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views22 pages

Inference For Generalized Linear Models Via Alternating Directions and Bethe Free Energy Minimization

Uploaded by

Tiago Nunes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

1

Inference for Generalized Linear Models via

Alternating Directions and Bethe Free Energy
Minimization

arXiv:1501.01797v2 [cs.IT] 2 May 2016

Sundeep Rangan, Fellow, IEEE, Alyson K. Fletcher, Member, IEEE,

Philip Schniter, Fellow, IEEE, and Ulugbek S. Kamilov Member, IEEE

AbstractGeneralized Linear Models (GLMs), where a random vector x is observed through a noisy, possibly nonlinear,
function of a linear transform z = Ax, arise in a range of
applications in nonlinear filtering and regression. Approximate
Message Passing (AMP) methods, based on loopy belief propagation, are a promising class of approaches for approximate
inference in these models. AMP methods are computationally
simple, general, and admit precise analyses with testable conditions for optimality for large i.i.d. transforms A. However, the
algorithms can diverge for general A. This paper presents a
convergent approach to the generalized AMP (GAMP) algorithm
based on direct minimization of a large-system limit approximation of the Bethe Free Energy (LSL-BFE). The proposed
method uses a double-loop procedure, where the outer loop
successively linearizes the LSL-BFE and the inner loop minimizes
the linearized LSL-BFE using the Alternating Direction Method
of Multipliers (ADMM). The proposed method, called ADMMGAMP, is similar in structure to the original GAMP method, but
with an additional least-squares minimization. It is shown that for
strictly convex, smooth penalties, ADMM-GAMP is guaranteed
to converge to a local minimum of the LSL-BFE, thus providing
a convergent alternative to GAMP that is stable under arbitrary
transforms. Simulations are also presented that demonstrate the
robustness of the method for non-convex penalties as well.

is a special case of a generalized linear model (GLM) [1],

[2] and arises in a range of applications including statistical
regression, filtering, inverse problems, and nonlinear forms of
compressed sensing. The posterior density of x given y in the
GLM model is given by

Index TermsBelief propagation, ADMM, variational optimization, message passing, generalized linear models.

We study this inference problem in the case where the functions fx and fz are separable, in that they are of the form

px|y (x|y) =

1
exp [fx (x) fz (Ax, y)] ,
Z(y)

where Z(y) is a normalization constant. In the sequel, we will

often omit the dependence on y and simply write
1
exp [fx (x) fz (Ax)] ,
(2)
Z
so that the dependence on y in the function fz () and the
normalization constant Z is implicit. In this work, we consider
the inference problem of estimating the posterior marginal
distributions, pxj |y (xj |y). From these posterior marginals, one
can compute the posterior means and variances
px|y (x|y) =

x
bj , E(xj |y),

(3a)

xj , var(xj |y).

(3b)

I. I NTRODUCTION
fx (x) =

Consider the problem of estimating a random vector x R

from observations y Rm as shown in Fig. 1. The unknown vector is assumed to have a prior density of the form
p(x) = efx (x) and the observations y Rm are described
by a likelihood function of the form p(y|x) = efz (Ax,y) for
some known transform A Rmn . In statistics, this model
S. Rangan (email: srangan@poly.edu) is with the Department of Electrical
and Computer Engineering, Polytechnic Institute of New York University,
Brooklyn, NY. His work was supported by the National Science Foundation
under Grant No. 1116589.
A. K. Fletcher (email: akfletcher@ucla.edu) is with the Departments of
Statistics, Mathematics, and Electrical Engineering, University of California,
Los Angeles.
P. Schniter (email: schniter@ece.osu.edu) is with the Department of Electrical and Computer Engineering, The Ohio State University. His work was
supported in part by the National Science Foundation Grants CCF-1218754,
CCF-1018368, and CCF-1527162.
U. S. Kamilov (email: ulugbek.kamilov@epfl.ch) is with the Biomedical

Imaging Group, Ecole

polytechnique federale de Lausanne (EPFL), CH-1015
Lausanne VD, Switzerland. His work was supported by the European Research Council under the European Unions Seventh Framework Programme
(FP7/2007-2013)/ERC Grant Agreement 267439.
This work was presented in part at the 2015 IEEE Symposium on Information Theory.

(1)

fz (z) =

n
X
j=1
m
X

fxj (xj ),

(4a)

fzi (zi ),

(4b)

i=1

for some scalar functions fxj and fzi . The separability assumption (4a) corresponds to the components in x being
a priori independent. Recalling the implicit dependence of
fz on y, the separability assumption (4b) corresponds to
the observations y being conditionally independent given the
transform outputs z , Ax.
For posterior densities of the form (2), there are several
computationally efficient methods to find the maximum a
posteriori (MAP) estimate, which is given by
b = arg max px|y (x|y) = arg min [fx (x) + fz (Ax)] . (5)
x
x

Under the separability assumptions (4), the MAP minimization

(5) admits a factorizable dual decomposition that can be
exploited by a variety of approaches, including variants of the
iterative shrinkage and thresholding algorithm (ISTA) [3][8]

x px

Unknown input,
independent
components

Linear transform

py|z

Componentwise
output map

Fig. 1. Generalized Linear Model (GLM) where an unknown random vector

x is observed via a linear transform followed by componentwise likelihood
to yield a measurement vector y.

and the alternating direction method of multipliers (ADMM)

[9][12].
In contrast, the inference problem of estimating the posterior
marginals p(xj |y) and the corresponding minimum mean
squared error (MMSE) estimates (3a) is often more difficult
even in the case when fx and fz are convex. As a simple
example, consider the case where fx (x) = 0 and each fzi (zi )
constrains zi to belong to some interval, so that fz (Ax)
constrains x to belong to some polytope. The MAP estimate
(5) is then given by any point in the polytope. Such a point can
can be computed via a linear program. However, the MMSE
estimate (3a) is the centroid of the polytope which is, in
general, #P-hard to compute [13].
GLM inference methods often use a penalized quasilikelihood method [14] or some form of Gibbs sampling [15],
[16]. In recent years, Bayesian forms of approximate message
passing (AMP) have been considered as a potential alternate
class of methods for approximate inference in GLMs [17]
[22]. AMP methods are based on Gaussian and quadratic
approximations to loopy belief propagation (loopy BP) in
graphical models and are both computationally simple and
applicable to arbitrary separable penalty functions fx and fz .
In addition, for certain large i.i.d. transforms A, they have
the benefit that the behavior of the algorithm can be exactly
predicted by a state evolution analysis, which then provides
testable conditions for Bayes optimality [22][24].
Unfortunately, for general A, AMP methods may diverge
[25], [26]a situation that is not surprising given that AMP
is based on loopy BP, which also may diverge. Several recent
modifications have been proposed to improve the stability of
AMP, including damping [25], sequential updating [27], and
adaptive damping [28]. However, while these methods appear
to perform well empirically, little has been proven rigorously
about their convergence.
The main goal of this paper is to provide a provably convergent approach to AMP. We focus on the generalized AMP
(GAMP) method of [22], which allows arbitrary separable
functions for both fx and fz . Our approach to stabilizing
GAMP is based on reconsidering the inference problem as a
type of free-energy minimization. Specifically, it is known that
GAMP can be considered as an iterative procedure for minimizing a large-system-limit approximation of the so-called
Bethe Free Energy (BFE) [29], [30], which we abbreviate as
LSL-BFE in the sequel. The BFE also plays a central role
in loopy BP [31], and we review both the BFE and LSL-BFE
in Section III.
In contrast to GAMP, which implicitly minimizes the LSLBFE through an approximation of the sum-product algorithm,

our proposed approach explicitly minimizes the LSL-BFE. We

propose a double-loop algorithm, similar to the well-known
Convex Concave Procedure (CCCP) [32]. Specifically, the
outer loop of our method successively approximates the LSLBFE by partially linearizing the LSL-BFE around the current
belief estimate, while the inner loop minimizes the linearized
LSL-BFE using ADMM [9]. Similar applications of ADMM
have also been proposed for related free-energy minimizations
[33], [34]. Interestingly, our proposed double-loop algorithm,
which we dub ADMM-GAMP, is similar in structure to the
original GAMP method of [22], but with an additional least
squares optimization. We discuss these differences in detail in
Section VIII.
Our main theoretical result shows that, for strictly convex
penalties, the proposed ADMM-GAMP algorithm is guaranteed to converge to at least a local minimum of the LSL-BFE.
In this way, we obtain a variant of the GAMP method with
a provable convergence guarantee for arbitrary transforms A.
In addition, using hardening arguments similar to [35], [36],
we show that our ADMM-GAMP can also be applied to the
MAP estimation problem, in which case we can obtain global
convergence for strictly convex, smooth penalties. Also, while
our theory requires convex penalties, we present simulations
that show robust behavior even in non-convex cases.
II. T HE GLM AND E XAMPLES
Before describing our optimization approach, it is useful to
briefly provide some examples of the model (1) to illustrate
the generality of the framework. As a first simple example,
consider a simple linear model
y = Ax + , ,

(6)

where A is a known matrix, x is an unknown vector and is

a noise vector. In statistics, A would be the data matrix with
predictors, x would be the vector of regression coefficients, y
the vector of target or response variables and would represent
the model errors. To place this model in the framework of this
paper, we must impose a prior p(x) on x and model the noise
as a random vector independent of A and x. Under these
assumptions, the posterior density of x given y will be of the
form (1) if we define
fx (x) := ln p(x),

fz (z) := ln p(y|z) = ln p (y z).

(7)
The separability assumption (4) will be valid if the components
of xj are wi are independent so the prior and noise density
factorizes as
p(x) =

n
Y
j=1

p(xj ),

p() =

m
Y

p(i ).

i=1

If the output noise is Gaussian with independent components

i N (0, 2 ), the output factor fz (z) in (7) has a quadratic
cost,
1
ky zk2 .
fz (z) =
22

Similarly, if x has a Gaussian prior with N (0, x2 I), the input

factor fx (x) will be given by
1
kxk2 .
fx (x) =
22

where D(q` k` ) is the KL divergence,

Note that the estimation in this quadratic case would be given

by standard least squares estimation.
However, much more general models are possible. For
example, for Bayesian forms of compressed sensing problems [37], one can impose a sparse prior p(x) such as a
Bernoulli-Gaussian or a heavy-tailed density.
Also,
Q for the output, any likelihood p(y|z) that factorizes
as i p(yi |zi ) can be incorporated. This model would occur,
for example, under any output nonlinearities as considered in
[38],
yi = i (zi ) + i ,

where H(bj ) is the entropy or differential entropy; and where

(for each j) nj is the number of factors ` such that j (`).
The BFE minimization (9) is performed over the set E of
all (b, q) whose components satisfy a particular matching
condition: for each j (`), the marginal density of xj within
the belief q` (x(`) ) must agree with the belief bj (xj ). That is,
the set E contains all (b, q) such that

where i (zi ) is a known, nonlinear function and i is noise.

The model can also include logistic regression [39] where yi
{0, 1} is a binary class variable and

where the integration is over the components in the sub-vector

x(`) holding xj constant. Note that E imposes a set of linear
constraints on the belief vectors b and q.
The BFE minimization exactly recovers the true marginals
in certain cases (e.g., when the factor graph has no cycles) and
provides good estimates in many other scenarios as well; see
[42] for a complete discussion. In addition, due to its separable
structure, the BFE can be typically minimized locally, by
solving a set of minimizations over the densities b and q.
When the cardinalities of the subsets (`) are small, these
local minimizations may involve much less computation than
directly calculating the marginals of the full joint density
p(x|y). In fact, the classic result of [31] is that loopy belief
propagation can be interpreted precisely as one type of iterative
local minimization of the BFE.
For the GLM in Section I, the separability assumption (4)
allows us to write the density (2) in the factorized form (8)
using the L = n + m potentials

P (yi = 1|zi ) = 1 P (yi = 0|zi ) = (zi ),

for some sigmoidal function (z). One-bit and quantized
compressed sensing [40] as well as Poisson output models
[41] can also be easily modeled.
III. B ETHE F REE E NERGY M INIMIZATION
We next provide a brief review of the Bethe Free Energy
(BFE) minimization approach to estimation of marginal densities in GLMs. A more complete treatment of this topic, along
with related ideas in variational inference, can be found in
[31], [42].
For a generic density p(x|y), exact computation of the
marginal densities p(xj |y) is difficult, because it involves a
potentially high-dimensional integration. BFE minimization
provides an approximate approach to marginal density computation for the case when the joint density admits a factorizable
structure of the form
L
Y
p(x|y)
` (x(`) |y),
(8)
`=1

where, for each `, x(`) is a sub-vector of x created from

indices in the subset (`) and ` is a potential function on that
sub-vector. In this case, BFE minimization aims to compute
the vectors of densities
b , [b1 , . . . , bn ]T

and

q , [q1 , . . . , qL ]T ,

where bj (xj ) represents an estimate of the marginal density

p(xj |y) and where q` (x(`) ) represents an estimate of the
joint density p(x(`) |y) on the sub-vector x(`) . These density estimates, often called beliefs, are computed using an
optimization of the form

b q
b = arg min J(b, q),
b,
(9)

Z
D(akb) ,

a(x) ln

a(x)
dx;
b(x)

(11)

Z
q` (x(`) ) dx(`)\j = bj (xj ), for all `, j,

j (xj ) = exp(fxj (xj )), j = 1, . . . , n,

n+i (x) =

exp(fzi (aTi x)),

i = 1, . . . , m,

(12)

(13a)
(13b)

where aTi is the i-th row of A. Note that, if A is a nonsparse matrix, then fzi (aTi x) depends on all components in
the vector x. In this case, the application of traditional loopy
BPas described for example in [43]does not generally
yield a significant computational improvement.
The GAMP algorithm from [22] can be seen as an approximate BFE minimization method for GLMs with possibly
dense transforms A. Specifically, it was shown in [29] that the
stationary points of GAMP coincide with the local minima of
the constrained optimization
(bbx , bbz ) , arg min J(bx , bz ) such that

(14a)

bx ,bz

E(z|bz ) = AE(x|bx )

(14b)

(b,q)E

where bx and bz are product densities, i.e.,

where J(b, q) is the BFE given by

J(b, q) ,

L
X
`=1

n
X
D(q` k` ) +
(nj 1)H(bj );
j=1

(10)

bx (x) =

n
Y
j=1

bxj (xj ),

bz (z) =

m
Y
i=1

bzi (zi ),

(15)

and the objective function J(bx , bz ) is given by

J(bx , bz ) , D(bx kefx ) + D(bz kZz1 efz )

+ H var(x|bx ), var(z|bz ) ,
(16)
X

m
n
X
1

Pn zi
H(x , z ) ,
+ ln 2
Sij xj , (17)
2 i=1
j=1 Sij xj
j=1

linearizing the concave part of this function, i.e.,

(bk , qk ) = arg min f (b) + ( k )T q,

(21a)

(b,q)E

k+1 =

h(qk )
,
q

(21b)

where h(qk )/q denotes the gradient of h at qk . The

x , (x1 , . . . , xn ) , xj , var(xj |bxj ),
(18) resulting procedure is often called a double-loop algorithm,
since each iteration involves a minimization (21a) that is itself
T
z , (z1 , . . . , zm ) , zi , var(zi |bzi ),
(19)
usually performed by an iterative procedure. Because f is
2
Sij = [S]ij , [A]ij i, j.
(20) convex and the constraint (b, q) E is linear, the minimization problem (21a) is convex. Thus, the CCCP converts
Above, and in the sequel, we use E(x|bx ) Rn to denote the
the non-convex BFE minimization to a sequence of convex
n
expectation of x under x bx , and we use var(x|bx ) R+ to
minimizations. In fact, it can be shown that the CCCP will
denote the vector whose jth entry is the variance of xj under
monotonically decrease the BFE for arbitrary convex f and
x bx . Note Rthat var(x|bx ) is not a full covariance matrix.
concave h [32].
fz (z)
Also, Zz , Rm e
dz is the scale factor that makes
1 fz (z)
m
Zz e
a valid density over z R . Although it is not
B. Minimization via Iterative Linearization
essential for this paper, we note that H var(x|bx ), var(z|bz )
For the LSL-BFE, it is not convenient to decompose the
is an upper bound on the differential entropy of bz that is
tight when bz has independent Gaussian entries with variances objective function into a convex term plus a concave term.
z = Sx . It was then shown in [30] that the objective function To handle problems like LSL-BFE minimization, we consider
in (16) can be interpreted as an approximation of the BFE optimization problems of the form
for the GLM from Section I in a certain large-system limit,
min J(b), J(b) = f (b) + h(g(b)),
(22)
where m, n and A has i.i.d. entries. We thus call the
bB
approximate BFE in (16) the large-system limit Bethe Free where now b is a vector in a Hilbert space H , B is a closed
b
Energy or LSL-BFE.
affine subspace of Hb , f : Hb R is a convex functional,
Similar to the case of loopy BP, it has been shown in [29], g : H Rp is a mapping from H to Rp for some p N,
b
b
[30] that the stationary points of (14) are precisely the fixed and h : Rp R is an arbitrary functional. Below, we use
points of sum-product GAMP. Thus, GAMP can be interpreted Rp to denote the input to h. Note that the functionals h
as an iterative procedure to find local minima of the LSL-BFE, and h(g()) may be neither concave nor convex.
much in the same way that loopy BP can be interpreted as
To solve (22), we propose the iterative procedure shown
an iterative way to find local minima of the BFE. The trouble in Algorithm 1, which is reminiscent of the CCCP. At each
with GAMP, however, is that it does not always converge (see, iteration k, an estimate bk of arg min
bB J(b) is computed
e.g., the negative results in [25], [26], [28]). The situation is by minimizing the functional
similar to the case of loopy BP. Although several modifications
J(b, k ) , f (b) + ( k )T g(b),
(23)
of GAMP have been proposed with the goal of improving
convergence, such as damping [25], sequential updating [27], where k Rp is a damped version of the gradient
and adaptive damping [28], a globally convergent GAMP h(g(b))/b. In particular, when the damping parameter k
modification remains elusive.
is set to unity, the linearization vector is exactly equal to the
T

IV. M INIMIZATION VIA I TERATIVE L INEARIZATION

Our approach to finding a convergent algorithm for minimizing the constrained LSL-BFE employs a generalization of
the convex-concave procedure (CCCP) of [32] that we will
refer to as Minimization via Iterative Linearization.
A. The Convex-Concave Procedure
We first briefly review the CCCP. Observe that, in the BFE
(10), the D(q` k` ) terms are convex in q` and the H(bj ) terms
are concave in bj . Thus, the BFE (10) can be written as a sum
of terms
J(b, q) = f (q) + h(b),
where f is convex and h is concave. The CCCP finds a
b q
b) by iteratively
sequence of estimates of a BFE minimizer (b,

gradient at bk , i.e., k = [h(g(bk ))]/b, similar to CCCP.

However, in Algorithm 1, we have the option of setting k < 1,
which has the effect of slowing the update on k . We will see
that, by setting k < 1, we can guarantee convergence when
h and/or h(g()) is non-concave.
C. Convergence of Minimization via Iterative Linearization
Observe that when f is convex, h is concave, Hb = Rp
(as when xj are discrete variables), g is the identity map
(i.e., g(b) = b), and there is no damping (i.e., k = 1 k),
Algorithm 1 reduces to the CCCP. However, we are interested
in possibly non-concave h(g()), in which case we cannot
directly apply the results of [32]. We instead consider the
following alternate conditions.
Assumption 1: Consider the optimization problem (22), and
suppose that the functions f , g, and h have components
that are twice differentiable with uniformly bounded second

Algorithm 1 Minimization via Iterative Linearization

Require: Optimization problem (22).
1: k 0
2: Select initial linearization 0
3: repeat
4:
{Minimize the linearized function}
5:
bk arg minbB J(b, k )

D. Application to LSL-BFE Minimization

To apply Algorithm 1 to the LSL-BFE minimization (14),
we first take B to be the vector of separable density pairs
b = (bx ; bz ) satisfying the moment matching constraint
B = {(bx ; bz ) | E(z|bz ) = AE(x|bx )} .
Then, if we define the functions

6:
7:
8:

f (b) , f (bx , bz ) = D(bx kefx ) + D(bz kZz1 efz ) (29a)

{Update the linearization}

Select a damping parameter k (0, 1]
9:
k g(bk )
h( k )
10:
k+1 (1 k ) k + k

11: until Terminated

g(b)T , [var(x|bx ); var(z|bz )] = [x ; z ] (29b)

h([x ; z ]) , H(x , z ),

derivatives. Also, assume that there exists a convex set such

that, for all :
(a) The minimization of the linearized function,
b
b()
, arg min J(b, )

exists and is unique.

(b) At each minimum, the linearized objective is uniformly
strictly convex in the linear space B in that there exists
constants c1 , c2 with c2 > c1 > 0 such that
b
c1 kuk2 uT H()u c2 kuk2 , u : b()
+ u B,
(25)
b
where H() is the Hessian of J with respect to b at b(),
i.e.,

2 J(b, )
,
(26)
H() ,
b bT b
b=b()

and where the constants c1 and c2 do not depend on .

b
(c) The gradient obeys h(g(b))/ when b = b().
Theorem 1: Consider Algorithm 1 under Assumption 1.
There exists a (0, 1) such that if the damping parameters
are selected with 0 < k for all k, and if the initialization
obeys 0 , then k for all k and the objective
monotonically decreases, i.e.,
(27)

Proof: See Appendix A.

The most simple case where Assumption 1 holds is the
setting where f (b) is strictly convex and smooth, g(b) is
linear and h( ) is smooth (but neither necessarily convex
nor concave). Under these assumptions, J(b, ) would be
strictly convex for all , thereby satisfying Assumptions (a)
and (b). The assumption would also be valid for strictly convex
f (b) and convex g(b), provided we restrict to positive . In
this case, to satisfy assumption (c), we would require that
h(g(b))/ 0, i.e. h( ) is increasing in each of its
component. Interestingly, in the setting we will use below,
f (b) will be convex, but g(b) will be concave. Nevertheless,
we will show that the assumption will be satisfied.

(29c)

we see that J(bx , bz ) from (16) can be cast into the form in
(22). Observe that, while f is convex, the function h(g()) is,
in general, neither convex nor concave. Thus, while the CCCP
does not apply, we can apply the iterative linearization method
from Algorithm 1.
We will partition the linearization vector conformally with
function g in (29b) as
= [1./(2r ); 1./(2p )],

(24)

J(bk+1 ) J(bk ) k.

(28)

(30)

where we use ./ to denote componentwise division of two

vectors and ; to denote vertical concatenation. The notation
in (30) will help to clarify the connections to the original
GAMP algorithm. Using the above notation, the linearized
objective (23) can be written as
J(bx , bz , r , p ) , D(bx kefx ) + D(bz kZz1 efz )
T

+ (1./(2r )) var(x|bx )
T

+ (1./(2p )) var(z|bz ).

(31)

h
of the function
Finally, we compute the gradient h0 =
h from (29c). Similar to , we will partition the gradient into
two terms,

H(x , z )
H(x , z )
, 1./(2 p ) ,
. (32)
x
z
From (17), the derivative of H with respect to zi is
1./(2 r ) ,

H(x , z )
1
1
=
= Pn
.
2 pi
zi
2 j=1 Sij xj

(33)

Similarly, using the chain rule and (33), we find

P
1
H(x , z ) X H(x , z ) ( k Sik xk )
P
=
=
2 rj
xj
( k Sik xk )
xj
i
m
X 1 z
2
=
2i +
Sij
2
2 pi
pi
i=1
m
z 1
1X
=
Sij 1 i
.
(34)
2 i=1
pi pi
We can then write (33) and (34) in vector form as
p = Sx ,

1./ r = ST [(1 z ./ p ) ./ p ] .

(35)

Substituting the above computations into the iterative linearization algorithm, Algorithm 1, we obtain Algorithm 2. We
refer to this as the outer loop, since each iteration involves a
minimization of the linearized LSL-BFE in line 5. We discuss
this latter minimization next and show that it can itself be

performed iteratively using a set of iterations that we will refer

to as the inner loop.
We will also show shortly that, under certain convexity
conditions, the conditions of Assumption 1 are satisfied, so
that Algorithm 2 will converge to a local minimum of the
LSL-BFE.
Algorithm 2 Minimizing LSL-BFE via Iterative Linearization
Require: LSL-BFE objective function (16) with a matrix A.
1: k 0
2: Select initial linearization p0 , r0 .
3: repeat
4:
{Minimize the linearized LSL-BFE}
5:
(bkx , bkz ) arg min(bx ,bz )B J(bx , bz , rk , pk )
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:

{Compute the gradient terms}

xk var(x|bkx ), zk var(z|bkz )
kp Sxk
sk (1 zk ./ kp )./ kp
kr 1./(ST sk )
{Update the linearization}
Select a damping parameter k (0, 1]
1./rk+1 k 1./ kr + (1 k )1./rk
1./pk+1 k 1./ kp + (1 k )1./pk
until Terminated

where u is aPdual vector, is a vector of positive weights

and kxk2 , j x2j /j . The ADMM procedure then produces
a sequence of estimates for the optimization (37) through the
iterations
wt+1 = arg min L(w, ut ; )

(39a)

ut+1 = ut + Diag(1./ )Bwt+1 ,

(39b)

where Diag(d) creates a diagonal matrix from the vector d.

The algorithm thus alternately updates the primal variables wt
and dual variables ut . The vector can be interpreted as a
step-size on the primal problem and an inverse step-size on
the dual problem.
The key benefit of the ADMM method is that, for any
positive step-size vector , the procedure is guaranteed to
converge to a global optimum for convex functions f (w)
under mild conditions on B.
B. Application of ADMM to LSL-BFE Optimization
The ADMM procedure can be applied to the linearized
LSL-BFE optimization (36) as follows. First, we replace
the constraint E(z|bz ) = AE(x|bx ) with two constraints:
E(z|bz ) = Av and E(x|bx ) = v. Variable splittings of this
form are commonly used in the context of monotropic programming [46]. With this splitting, the augmented Lagrangian
for the LSL-BFE (14) becomes
L(bx , bz , s, q, v; p , r )

E. Alternative Methods
While the method proposed in this paper is based on CCCP
of [32], there are other methods for direct minimization of the
BFE that may apply to the LSL-BFE as well. For example, for
problems with binary variables and pairwise penalty functions,
[44], [45] propose a clever re-parametrization to convert the
constrained BFE minimization to an unconstrained optimization on which gradient descent can be used. Unfortunately,
it is not obvious if the LSL-BFE here can admit such a reparametrization since the penalty functions are not pairwise
and the variables are not binary.
V. I NNER -L OOP M INIMIZATION AND ADMM-GAMP
A. ADMM Principle
The outer loop algorithm, Algorithm 2, requires that in each
iteration we solve a constrained optimization of the form
(bx , bz ) = arg min J(bx , bz , r , p ) s.t. E(z|bz ) = AE(x|bx ).
bx ,bz

(36)
We will show that this optimization can be performed by the
Alternating Direction Method of Multipliers (ADMM) [9].
ADMM is a general approach to constrained optimizations
of the form
min f (w) s.t. Bw = 0,
(37)
w

where f (w) is an objective function and B is some constraint

matrix. Corresponding to this optimization, let us define the
augmented Lagrangian
1
L(w, u; ) , f (w) + uT Bw + kBwk2 ,
(38)
2

, J(bx , bz , r , p ) + qT (E(x|bx ) v) + sT (E(z|bz ) Av)

1
1
(40)
+ kE(x|bx ) vk2r + kE(z|bz ) Avk2p ,
2
2
where s and q represent the dual variables. Note that the
vectors r and p that appear in the linearized LSL-BFE
J(bx , bz , r , p ) have been used for the augmentation terms
(i.e., the last two terms) in (40). This choice will be critical.
From (39), the resulting ADMM recursion becomes
t+1
t
t
t
(bt+1
x , bz ) = arg min L(bx , bz , s , q , v ; p , r ),
bx ,bz
t

t+1
st+1 = s + Diag(1./p ) E(z|bt+1
,
z ) Av

t+1
qt+1 = qt + Diag(1./r ) E(x|bt+1
,
)

v
x
v

t+1

t+1 t+1
arg min L(bt+1
, qt+1 , v; p , r ).
x , bz , s
v

(41a)
(41b)
(41c)
(41d)

To compute the minimization in (41a), we first note that the

second and fourth terms in (40) can be rewritten as
1
qT (E(x|bx ) v) + kE(x|bx ) vk2r
2
n
X
E2 (xj |bx ) 2vj E(xj |bx )
=
qj E(xj |bx ) +
+ const
2rj
j=1

n
x2j 2vj xj
(a) X
bx xj + const
=
E q j xj +

2rj
2rj
j=1

n
2
X E (xj [vj rj qj ]) bx
x
=
j + const
2rj
2rj
j=1
n
X
xj
1
2
= E kx (v r .q)kr bx
+ const, (42)
2
2
rj
j=1

(b)

where in (a) we used E2 (xj |bx ) = E(x2j |bx ) var(xj |bx ) =

E(x2j |bx ) xj ; in (b) we used . to denote componentwise
multiplication between vectors; and const includes terms
that are constant with respect to bx and bz . A similar development yields
1
(43)
sT (E(z|bz ) Av) + kE(z|bz ) Avk2p
2
m
X zi
1
= E kz (Av p s)k2p bz
+ const.
2
2pi
i=1
Also, note that the last two terms in (31) can be rewritten as
n
X
xj
T
(1./(2r )) var(x|bx ) =
,
(44a)
2rj
j=1
T

(1./(2p )) var(z|bz ) =

m
X
zi
.
2
pi
i=1

13:

(44b)

Substituting (31), (42), (43), and (44) into (40), and canceling
terms, we get
L(bx , bz , s, q, v; p , r )

1
= D(bx kefx ) + E kx (v r .q)k2r bx
2

1
+ D(bz kZz1 efz ) + E kz (Av p .s)k2p bz
2
Z+ const
(x)
=
bx (x) ln exp(f (x) 1bxkx(v
dx
2
x
r .q)kr )
2
n
RZ
z (z)
dz
+
bz (z) ln exp(f (z) 1 bkz(Av
.s)k2 )
z

+ const

(45)

= D(bx kpx ) + D(bz kpz ) + const,

(46)

for px (x) exp(fx (x) 12 kx(v r .q)k2r ) and pz (z)

exp(fz (z) 12 kz (Av p .s)k2p ). Therefore, the ADMM
step (41a) has the solution,

t 2
1
bt+1
(47a)
x (x) exp fx (x) 2 kx r kr

t 2
t+1
1
(47b)
bz (z) exp fz (z) 2 kz p kp .
for vectors
rt , vt r .qt
t

(48a)
t

p , Av p .s ,

(48b)

where we use . to denote componentwise vector multiplication. Using Bayes rule, (47a) can be interpreted as the posterior
density of the random vector x under the prior efx (x) and an
independent Gaussian likelihood with mean rt and variance
r . Similarly, (47b) can be interpreted as the posterior pdf
the random vector z under the likelihood efz (z) and an
independent Gaussian prior with mean pt and variance p .
To tackle the minimization (41d), we ignore the v-invariant
components in the original augmented Lagrangian (40), after
which (41d) can be reformulated as the least-squares problem
vt+1 = arg min kzt+1 + p st+1 Avk2p
v

+ kxt+1 + r qt+1 vk2r

(49)

using the definitions

zt+1 , E(z|bt+1
z ),

Algorithm 3 ADMM-GAMP
Require: Matrix A, estimation functions gx and gz .
1: S A.A (componentwise square)
2: Initialize r0 > 0, p0 > 0, v0
3: q0 0, s0 0
4: t 0
5: repeat
6:
{ADMM inner iteration}
7:
rt vt rt .qt
8:
pt Avt pt .st
9:
xt+1 gx (rt , rt ), zt+1 gz (pt , pt )
10:
qt+1 qt + Diag(1./rt )(xt+1 vt )
11:
st+1 st + Diag(1./pt )(zt+1 Avt )
12:
Compute vt+1 from (49)

xt+1 , E(x|bt+1
x ).

(50)

14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:

{Compute the gradient terms}

xt+1 rt .gx0 (rt , rt ), zt+1 pt .gz0 (pt , pt )
t+1
t+1
p Sx
t+1
st+1 (1 zt+1 ./ t+1
p )./ p
T t+1
t+1
r 1./(S s )
{Update the linearization}
Select a damping parameter t [0, 1]
t
t
1./rt+1 t 1./ t+1
r + (1 )1./r
t+1
t
t+1
t
1./p 1./ p + (1 )1./pt
until Terminated

C. The ADMM-GAMP Algorithm

Inserting the above ADMM updates into the outer loop algorithm, Algorithm 2, we obtain the so-called ADMM-GAMP
method summarized in Algorithm 3. There and elsewhere, we
use . to denote componentwise vector-vector multiplication
and ./ to denote componentwise vector-vector division. Note
that the updates for the ADMM iteration appear under the
comment ADMM inner iteration.
Although, in principle, we should perform an infinite number of inner-loop iterations for each outer-loop iteration, Algorithm 2 is written in a more general parallel form. In each
(global) iteration t, there is one ADMM update as well as
one linearization update. However, by setting the outer-loop
damping parameter as t = 0, it is possible to bypass the
linearization update. Thus, we can obtain the desired doubleloop behavior as follows: First, hold t = 0 for a large number
of iterations, thus running ADMM to convergence. Then, set
t > 0 for a single iteration to update the linearization. Then,
hold t = 0 for another large number of iterations, and so on.
However, the parallel form of Algorithm 3 also facilitates other
update schedules. For example, we could run a small number
of ADMM updates for each linearization update, or we could
run only one ADMM update per linearization update.
An interesting question is whether the algorithm can be run
with a constant step-size t = for some small . Unfortunately, our theoretical analysis and numerical experiments
consider only the double-loop implementation where several
ADMM iterations are run for each outer loop update.
Another point to note in reading Algorithm 3 is that the

expectation and variance operators in (31), (41b), and (41c)

have been replaced by componentwise estimation functions gx
and gz and their scaled derivatives. In particular, recall from
t
t
t+1
(47) that bt+1
x is fully parameterized by (r , r ) and that bz is
t
t
fully parameterized by (p , p ). Thus, we can write the means
of these distributions as
t
t
xt+1 = E(x|bt+1
x ) , gx (r , r ),

t+1

E(z|bt+1
z )

, gz (p

(51a)

, pt ),

(51b)

as reflected in line 9 of Algorithm 3. For separable fx and fz ,

we note that the computations in (51) can be performed in a
componentwise, scalar manner, e.g.,
t
t
t
t
xt+1
j = [gx (r , r )]j , gxj (rj , rj )

R
x exp fxj (x) 21t (x rjt )2 dx
R
rj

, (52)
= R
1
t 2 dx
exp

f
(x)

t
x
j
2 (x rj )
R
rj

zit+1

, pt )]j

, gzi (pti , pti )

fzi (z) 21t (z
p

= [gz (p
R
z exp
R
= R
exp fzi (z)
R

1
2pt

bt+1
xj

pti )2 dz
i

,
(z pti )2 dz

(53)

ADMM [9] and in the simulations below the minimization

can be performed approximately via conjugate gradient (CG)
[47]. Conjugate gradient also requires repeated matrix-vector
multiplies by A and AT , but will require K such matrix-vector
multiplies where K is the number of CG iterations. In the
simulations below, we will use K = 3, thus increasing the per
iteration cost of ADMM-GAMP by a factor of approximately
3 relative to standard GAMP.
The other computations in each iteration of ADMM-GAMP
are typically smaller than the LS minimization and are similar to those performed in GAMP. For example, similar to
GAMP, each iteration requires evaluation of the estimation
functions gx () and gz (). These can be performed as n and
m componentwise scalar functions given in (52) and (53). For
certain penalty functions, such as Bernoulli-Gaussians, these
will have closed-form expressions; otherwise, they will need
to be evaluated via numerical integration. In either case, the
componentwise cost does not grow with the dimension, so
the per iteration cost of evaluating the estimation functions is
O(m + n) and are typically not dominant for large m and n.
VI. ADMM-GAMP FOR MAP E STIMATION

bt+1
zi

and
can be computed
Furthermore, the variances of
in a componentwise manner using the derivatives of gxj and
gzi with respect to their first argument [22], i.e.,

A. ADMM Inner Loop

t 0
t
t
xt+1 , var(x|bt+1
x ) = r .gx (r , r ),

(54a)

For the posterior density p(x|y) in (2), the MAP estimates

of the vector x and its transform z = Ax are given by the
constrained optimization

t 0
t
t
zt+1 , var(z|bt+1
z ) = p .gz (p , p ),

(54b)

(b
x, b
z) , arg min J(x, z) s.t. z = Ax,

(56)

x,z

as reflected in line 15 of Algorithm 3. That is,

xt+1
j

[rt .gx0 (rt , rt )]j

rtj

gxj (rjt , rtj )

where J(x, z) is the objective function

J(x, z) , fx (x) + fz (z).

(55)

(57)

We use these general scalar estimation functions gx and gz

since it will allow us later to consider a similar algorithm for
the MAP estimation problem (5).
Interestingly, the ADMM-GAMP algorithm has close similarities to the sum-product version of the original GAMP
algorithm from [22], as we will discuss in Section VIII. For
example, the sum-product version of the GAMP algorithm
uses the same estimation functions gx and gz from (51), which
we will refer to as the MMSE estimation functions.

We will show that, with appropriate selection of the estimation

functions, gx and gz , the inner loop of Algorithm 3 can be used
as an ADMM method for solving (56).
As before, we replace the constraint z = Ax in the
optimization (56) with two constraints: x = v and z = Av.
We then define the augmented Lagrangian

D. Computational Cost

The ADMM recursions (39) for this augmented Lagrangian

are then given by

While we will demonstrate below that ADMM-GAMP

offers improved convergence stability relative to the GAMP
algorithm of [22], it is important to point out that the computational cost of ADMM-GAMP may be somewhat larger:
One of the main attractive features of GAMP and other first
order methods, is that each iteration requires only matrixvector multiplies by by A and AT . Each such multiplication
will have complexity O(mn) in the most general case, and
may be smaller for structured transforms such as filters, FFTs,
or sparse matrices.
In contrast, ADMM-GAMP requires a least-squares (LS)
minimization (49) in each iteration. Exact evaluation of the
minimization will have a cost of O(n2 m) a cost not incurred
in GAMP or most other first-order methods. As is done

L(x, z, s, q, v; p , r )
, fx (x) + fz (z) + qT (x v) + sT (z Av)
+ 21 kx vk2r + 21 kz Avk2p .

(xt+1 , zt+1 ) = arg min L(x, z, st , qt , vt ; p , r ),

x,z

st+1 = st + Diag(1./p ) zt+1 Avt ,

qt+1 = qt + Diag(1./r ) xt+1 vt ,
v

t+1

= arg min L(x

t+1

(58)

(59a)
(59b)
(59c)

, v; p , r ). (59d)

To perform the minimization in (59a), first consider the

minimization over x. Eliminating terms that do not depend
on x, we obtain
xt+1 = arg min fx (x) + qT x + 21 kx vk2r
x

= arg min fx (x) + 21 kx + r .q vk2r .

(60)

Similarly, the minimization over z reduces to

t+1

= arg min fz (z) + s z +

1
2 kz

Avk2p

= arg min fz (z) + 21 kz + p .s Avk2p .

(61)

Hence, if we define the estimation functions

gx (r, r ) , arg min fx (x) + 12 kx rk2r ,
x
i
h
gz (p, p ) , arg min fx (x) + 12 kz pk2p ,

(62a)

(62b)

is through a standard hardening analysis, which is also used

to understand how max-sum loopy belief propagation can be
viewed as a limit of sum-product loopy BP; see, for example,
[48], [49]. Specifically, combining (2) with Laplaces Principle
from large deviations [50], and assuming suitable continuity
conditions, the marginal minimization functions (64) are given
by (up to a constant factor)
xj (xj ) = lim T ln pxj (xj ; T ),
T 0

zi (zi ) = lim T ln pzi (zi ; T ),

T 0

then we can rewrite (60) and (61) as

xt+1 = gx (rt , r ),

zt+1 = gz (pt , p ),

(63)

for rt and pt defined in (48). Also, the minimization over v

in (59d) can again be cast as the least-squares problem (49).
We see that equations (48), (49), (59b), (59c) and (63) are
precisely the updates in the ADMM inner-loop of Algorithm 3.
Therefore, for fixed penalty terms r and p , the inner loop
of the ADMM-GAMP algorithm with the estimation functions
(62) is precisely an ADMM algorithm for the MAP estimation
problem (56).
The functions in (62) are the standard proximal operators
used in many implementations of ADMM and related optimization algorithms [9]. These functions also appear in the
max-sum version of GAMP from [22], which is used for MAP
estimation. Thus, we will refer to (62) as the MAP estimation
functions.
B. Hardening Limit of the LSL-BFE
The above discussion shows that, with the MAP estimation
functions (62), the ADMM-GAMP outputs (xt , zt ) can be
interpreted as estimates of the MAP solution from (56). How
then do we interpret the related terms (xt , zt )? In the inference (i.e., MMSE) problem from Section V, the components
of xt and zt are estimates of the variances of the marginal
posteriors. Below, we use a hardening argument to show that,
in the MAP problem, (xt , zt ) can be interpreted as estimates
of the local curvature of the MAP objective (57).
To be precise, let us define the marginal minimization
functions
xj (xj ) , min J(x, Ax),
x\xj

zi (zi ) ,

min

x:zi =[Ax]i

J(x, Ax),

(64a)

zbi = arg min zi (zi ).

Note that, for any T > 0, we can estimate the marginal posteriors pxj (xj ; T ) and pzi (zi ; T ) using the LSL-BFE optimization
from Section V. That is, we can use the estimate
xj (xj ) bxj (xj ) , lim T ln bbxj (xj ; T ),

(66a)

zi (zi ) bzi (zi ) , lim T ln bbzi (zi ; T ),

(66b)

T 0

where bbxj (xj ; T ) and bbzi (zi ; T ) are the belief estimates computed via the LSL-BFE optimization under the scaled penalties
fx (x; T ) , fx (x)/T,

(65)

However, the marginal minimization functions provide not

only componentwise objectives for the MAP optimization
(56), but also the sensitivity of those objectives.
We will see that ADMM-GAMP provides estimates of the
marginal minimization functions, in addition to estimates of
the MAP solution in (56). Perhaps the easiest way to see this

fz (z; T ) , fz (z)/T.

(67)

In statistical physics, the parameter T has the interpretation of

temperature, and the limit T 0 corresponds to a cooling
of the system. In inference problems, the cooling has the effect
of concentrating the distributions about their maxima.
A large-deviations analysis in Appendix B shows that, if we
use ADMM-GAMP with the MMSE estimation functions (51)
with the scaled functions (67), then at iteration t the limits in
(66) are given by
btxj (xj ) = lim T ln btxj (xj ; T ),
T 0

= fxj (xj ) +

1
2rt

(xj rjt )2

(68a)

btzi (zi ) = lim T ln btzi (zi ; T )

T 0

= fzi (zi ) +

(64b)

where the minimizations are over the vector x, holding either

xj or zi , [Ax]i fixed. Note that, if one can compute these
marginal minimization functions, then one can compute the
components of the MAP estimates from (56) via
x
bj = arg min xj (xj ),

where pxj (xj ; T ) and pzi (zi ; T ) are the marginal densities for
the scaled joint density

1
1
p(x; T ) , exp
fx (x) + fz (Ax) .
Z
T

1
2pt

(zi pti )2 ,

(68b)

where the parameters rjt , pti , rtj , and pti are the outputs of
ADMM-GAMP under the MAP estimation functions (62). In
this sense, ADMM-GAMP under the MAP estimation function
can be seen as a limiting case of ADMM-GAMP under the
MMSE estimation functions. Hence, according to (66), MAP
ADMM-GAMP can be used to compute estimates (68) of the
marginal minimization functions (64). Furthermore, according
to (62) and (63), xt+1 and zt+1 are the minima of these
functions
bt
x
bt+1
j = arg min xj (xj ),
xj

as one would expect from (65).

zbit+1 = arg min btzi (zi ),

Finally, it can be shown (see (110)) that, for the MAP

estimation functions (62), the outputs of line 15 in Algorithm 3
take the form
rj
,
1 + rj fx00j (b
xj )
pi
= pi gz0 i (pi , pi ) =
.
1 + pi fz00i (b
zi )

xj = rj gx0 j (rj , pi ) =

(69a)

(69b)

Meanwhile, from (68), we see that

1
xt+1
j

2 btxj (b
xt+1
j )
x2j

zt+1
i

2 btzi (b
zit+1 )
.
zi2

(70)

Therefore, when ADMM-GAMP is used for MAP estimation,

the components of xt and zt can be interpreted as the inverse
curvatures of the constrained function J(x, z = Ax) in the
vicinity of the current estimate (b
xt , b
zt ).
Appendix B also show that, in the limit as T 0, the
LSL-BFE optimization (14) decomposes approximately into
two decoupled optimizations: The first computes the MAP
estimates (b
x, b
z) from (56), and the second computes
b, b
(b
x , bz ) , arg min J 2 (x , z , x
z),

(71)

x ,z

b, b
J 2 (x , z , x
z) ,

i
xj fx00j (b
xj ) ln(xj )

j=1
m
X

i=1

For the remainder of this section, we will show the convergence of ADMM-GAMP in the special case of convex and
smooth penalties fx and fz . We begin by analyzing the convergence of the ADMM inner-loop under fixed linearization terms
r and p . It is well-known that, when one applies ADMM to
a general optimization problem of the form (37) with convex f
and full-rank B, the method will converge [9]. However, in our
case, the objective function is the linearized LSL-BFE in (31),
which is not necessarily convex, even if the penalty functions
fx and fz are. The problem is that the variances var(x|bx ) and
var(z|bz ) are not convex functions of the densities bx and bz
(in fact, they are concave). We thus need a separate proof.
We will prove convergence under the following assumption.
Assumption 2: For fixed r and p , the estimation functions
gx (r, r ) and gz (p, p ) are separable in r and p in that

gx (r, r ) = gx1 (r1 , r ), , gxn (rn , r ) ,

gz (p, p ) = gz1 (p1 , p ), , gzm (pm , p )
for scalar function gxj and gzi . In addition, these scalar functions have, with respect to their first arguments, continuous
first derivatives gx0 j and gz0 i satisfying
gx0 j (rj , r ) 1 ,

where
n h
X

B. Convergence of the ADMM Inner Loop

(72)

pi
1
+ ln
,
fz00i (b
zi ) +
pi
zi

and, as before, p , Sx . Since the optimization (71) provides

the inverse-curvature estimates in (70), we will refer to it as
curvature optimization.
VII. C ONVERGENCE A NALYSIS FOR S TRICTLY C ONVEX
P ENALTIES
A. Fixed Points of ADMM-GAMP
We first characterize the fixed points of ADMM-GAMP,
assuming that the algorithm converges.
Theorem 2: At any fixed point of ADMM-GAMP with the
MMSE estimation functions (51), the belief pair (bx , bz ) in
(47) is a critical point of the constrained LSL-BFE optimization (14).
Proof: See Appendix C.
Theorem 3: At any fixed point of ADMM-GAMP with the
MAP estimation functions (62), the output (x, z) is a critical
point of the constrained MAP optimization (56) and (x , z )
is a critical point of the optimization (71).
Proof: See Appendix C.
Theorems 2 and 3 show that, if ADMM-GAMP converges,
then its limit points will be local minima of either the inference
(i.e., MMSE) or MAP problems.

gz0 i (pi , p ) 1 .

(73)

for some constant (0, 0.5].

The assumption requires that the estimation functions are
strictly increasing contractions. Importantly, the following
lemma shows that this assumption holds when the penalty
functions are smooth and convex.
Lemma 1: Suppose that fx and fz are strictly convex,
separable functions, in that they are of the form (4), where
the components have continuous second derivatives such that
A fx00j (xj ) B xj ,

A fz00i (zi ) B zi ,

(74)

for some 0 < A B < . Then, both the MMSE estimation

functions in (51) and the MAP estimation functions in (62)
satisfy Assumption 2 for any r , p > 0.
Proof: See Appendix D.
We now have the following convergence result.
Theorem 4: Consider Algorithm 3 with only ADMM updates (i.e., t = 0 for all t), so that the linearization terms
remain constant, i.e.,
pt = p and rt = r t,
for some vectors p and r . Then, if the estimation functions
satisfy Assumption 2, the algorithm converges to a unique
fixed point at a linear rate of convergence of 1 .
Proof: See Appendix E.
C. Outer Loop Convergence: MMSE Case
Theorem 4 shows that, with the MMSE estimation functions
(51) and strictly convex penalties, the ADMM inner loop of
Algorithm 3 converges. We next consider the convergence
of the outer loop, Algorithm 2, assuming that the inner
minimization (i.e., line 5 of Algorithm 2) is computed exactly.

Theorem 5: Suppose that the functions fx and fz satisfy

the assumptions in Lemma 1 and the matrix S has positive
components (i.e., Sij = |Aij |2 > 0 ij). Then, there exists
a such that, if k < , the sequence of belief estimates
bk generated by Algorithm 2 yields a monotonically nonincreasing LSL-BFE, i.e.,
k+1
k k
J(bk+1
x , bz ) J(bx , bz ).

Proof: See Appendix F.

Together, Theorems 4 and 5 demonstrate that ADMMGAMP will converge under an infinitely slow damping schedule. Specifically, we select iterations t1 < t2 < that are
infinitely far apart. Then, for all t between each tk and tk+1 , we
set t = 0 so that the ADMM inner-loop is run to completion,
and at each t = tk , we select t to be a small positive value.
It is of course impossible to use an infinite number of
inner-loop iterations in practice. Fortunately, our numerical
experiments in Section IX suggest that a fixed number of innerloop iterations is sufficient.
D. Outer Loop Convergence: MAP Case
We can prove a stronger convergence result for ADMMGAMP under the MAP estimation functions (62), if we make
two additional assumptions. Recall from Theorem 4 that, if
we set t = 0 t, then the linearization parameters rt and pt
will remain constant with t and the algorithm will converge
to some fixed point. Our first assumption is that we begin the
algorithm at one such fixed point. That is, we suppose that the
time t = 0 versions of
xt , zt , rt , pt , qt , st , vt

(75)

are fixed points of lines 7 through 12 in Algorithm 3. Our

second assumption is that we replace the st+1 update in line 17
with
st+1 (1 zt+1 ./pt )./pt .
(76)
That is, we use pt instead of t+1
p . Under these two additional
assumptions, we can prove the following.
Theorem 6: Consider ADMM-GAMP, Algorithm 3, run
under the MAP estimation functions (62), with penalty functions fx and fz satisfying the assumptions of Lemma 1.
Suppose that the initialization (75) is a fixed point of lines
7 through 12, and that line 17 is replaced by (76). Then, if
t = 1 for all t,
(a) Even though rt and pt may change with t, the variables
in (75) will remain constant. That is, for all t,
xt = x0 , zt = z0 , rt = r0 , pt = p0 ,
qt = q0 , st = s0 , vt = v0 .

(77)

Moreover, the variables (x0 , z0 ) are the global minima of

the MAP estimation problem (56).
(b) The linearization parameters xt and zt converge to unique
global minima of the curvature optimization (71).
Proof: See Appendix G.
The result shows that, in principle, we can solve the MAP
estimation problem by first running the ADMM inner loop to

convergence with arbitrary positive linearization terms r and

p . Then, we could turn on the outer loop updates, thus driving
xt and zt to the minima of the curvature optimization problem
(70). Of course, in practice, one cannot do this perfectly, since
the ADMM inner loop must be terminated at some finite
number of iterations. Also, it is possible that, by letting the
variance terms adapt (at least slowly) before the inner loop
fully converges, the convergence speed of the inner loop can
be improved. In fact, this is our empirical experience, although
we have no proof.
It is important to point out that the MAP convergence proof
requires a slightly modified variance update given in (76).
This update may actually be preferable for the MMSE case
as well, however, further analysis would be required. Indeed,
while we have demonstrated one variance update with provable
convergence, finding the best variance update method is a still
an open question.
VIII. R ELATIONSHIP OF ADMM-GAMP TO GAMP
There are two key differences between the proposed
ADMM-GAMP algorithm and the original sum-product
GAMP algorithm from [22], reproduced for convenience in
Algorithm 4 (with the variance updates indented for visual
clarity).
1) The ADMM-GAMP algorithm uses two additional variables: a dual variable qt , and an auxiliary variable vt that
is updated via the least-squares optimization (49), that are
not present in the original GAMP algorithm.
2) ADMM-GAMP uses an alternating schedule of mean and
(possibly damped) variance updates, whereas GAMP uses
interleaved mean and variance updates.
Below, we describe these differences in more detail.
Algorithm 4 Original GAMP
Require: Matrix A and estimation functions gx and gz .
1: S A.A (componentwise square)
2: Initialize x0 , x0
3: s0 0
4: t 0
5: repeat
6:
pt Sxt
t
7:
p Axt pt .st1
8:
zt pt .gz0 (pt , pt )
t
9:
z gz (pt , pt )
10:
st (1 zt ./pt )./pt
t
11:
s (zt pt )./pt
12:
rt 1./(ST st )
t
13:
r xt + Diag(rt )AT st
14:
xt+1 rt .gx0 (rt , rt )
t+1
15:
x gx (rt , rt )
16: until Terminated

A. Sum-product GAMP via Stale, Linearized ADMM

One way to understand the differences between ADMMGAMP and the original GAMP is as follows: ADMM-GAMP

results from minimizing the linearized LSL-BFE via ADMM the mean updates in Algorithm 3, we obtain
under the splitting rule E(z|bz ) = Av and E(x|bx ) = v (as
xt+1 = gx (rt , r ),
described in Section V-B), whereas the original GAMP uses
zt+1 = gz (pt , p ),
stale, linearized ADMM under the conventional1 splitting rule
E(z|bz ) = AE(x|bx ). Both use the same iterative LSL-BFE
st+1 = st + Diag(1./p )(zt+1 Axt ),
linearization strategy described in Section IV-D.
rt+1 = xt+1 + Diag(r )AT st+1 ,
We can derive the mean updates in the original GAMP using
pt+1 = Axt+1 p .st+1 .
the augmented Lagrangian
Then, substituting the p update into the s update, defining
L(bx , bz , s; p ) , J(bx , bz , r , p ) + sT E(z|bz ) AE(x|bx )
zt = zt+1 and st = st+1 , and reordering the steps, we obtain
2
1
(78)
+ 2 kE(z|bz ) AE(x|bx )kp ,
pt = Axt p st1 ,
for the J defined in (31) and stale, linearized ADMM:
zt = gz (pt , p ),

T
t
t1
st = Diag(1./p )(zt pt ),
bt+1
; p ) + 1 E(x|bx ) E(x|btx )
x = arg min L(bx , bz , s
2

Dr A Dp A E(x|bx )
bt+1
z

st+1 =

E(x|btx )

t
arg min L(bt+1
x , bz , s ; p ),
bz

t+1
st + Dp E(z|bt+1
z ) AE(x|bx ) ,

rt = xt + Diag(r )AT st1 ,

(79a)
(79b)

which is precisely the GAMP mean-update loop.

(79c)

where D , Diag(1./ ). Note the addition of a linearization term in (79a) to decouple the minimization. The
resulting approach goes by several names: linearized ADMM
[51, Sec. 4.4.2], split inexact Uzawa [10], and primal-dual
hybrid gradient (PDHG) [10]. Note also the use of the stale
dual estimate st1 in (79a), as opposed to the most recent
dual estimate st . In the context of PDHG, this stale update
is known as Arrow-Hurwicz [10]. In Appendix H, we show
that the recursion (79) yields the mean updates in the original
sum-product GAMP algorithm (i.e., the non-indented lines in
Algorithm 4).
Regarding the variance updates of the original sum-product
GAMP algorithm (i.e., the indented lines in Algorithm 4),
a visual inspection shows that they match the non-damped
ADMM-GAMP gradient updates (i.e., lines 15-18 of Algorithm 3 under t = 1), except for one small difference: in the
original sum-product GAMP, the update of s uses the same
version of p used by the z update, whereas in ADMMGAMP, the update of s uses a more recent version of p .
B. Recovering GAMP from ADMM-GAMP
We now show that the mean-updates of the original sumproduct GAMP can be recovered by approximating the meanupdates of ADMM-GAMP. For simplicity, we suppress the t
index on the variance terms.
At any critical point of Algorithm 3, we must have qt =
AT st and zt = Axt , as shown in (107). If we substitute
these two constraints into the v-update objective in (49), we
obtain
kzt + p .st Avk2p + kxt + r .qt vk2r
= kA(xt v) + p .st k2p + kxt v Diag(r )AT st k2r .
It can be verified that the minimum for this function occurs
at v = xt . So, if we substitute vt = xt and qt = AT st into
1 See,

e.g., [9, Sec. 3.1].

xt+1 = gx (rt , r ),

IX. N UMERICAL E XPERIMENTS

We now illustrate the performance of ADMM-GAMP by
considering three numerical experiments. While our theoretical
results assumed strictly convex penalties, we numerically
demonstrate the stability of ADMM-GAMP for the nonconvex penalty corresponding to a Bernoulli-Gaussian prior
on x, i.e.,
px (x) = (1 )(x) + N (x; 0, 1),

(80)

where (0, 1] is the sparsity ratio and is the Dirac

delta distribution. In our experiments, we fix the parameters
to n = 1000 and = 0.2, and we numerically compare the
normalized MSE

bk22
kx x
NMSE (dB) , 10 log10
kxk22
of ADMM-GAMP to four other recovery schemes: the original
GAMP method [22]; de-biased LASSO [52]; swept AMP
(SwAMP) [27]; and the support-aware MMSE estimator, labeled genie. The SwAMP method is identical to original
GAMP method but updates only one component of x at a
time a common technique also used for stabilizing loopy
BP. For LASSO, we optimized the regularization parameter
for best MSE performance. For GAMP, SwAMP, and
ADMM-GAMP, we terminated the iterations as soon as kb
xt
t1
t1
4
b k2 /kb
x k2 10 and imposed an upper limit of 200
x
iterations. In all experiments below, ADMM-GAMP was run
with 10 iterations of the inner loop ADMM minimization for
each outer loop update. Also, the least-sqaures minimization
(49) was performed with 3 conjugate gradient iterations per
inner loop iteration, using as a warm start, the output of final
value from the previous iteration as the initial condition of the
current iteration.
In our first experiment, we consider a standard problem:
recover sparse x from y = Ax + e, where e is AWGN
with variance set to achieve an SNR of 30 dB, and where the
measurement matrix A is drawn with i.i.d. N (0, 1/m) entries.
Figure 2 shows the NMSE performance of the algorithms

NMSE (dB)

genie

40
0.3

LASSO

GAMP

0.6

ADMMGAMP

Measurement ratio (m/n)

Fig. 2. Average NMSE versus measurement rate m/n when recovering

a length n = 1000 Bernoulli-Gaussian signal x from AWGN-corrupted
measurements y = Ax + e under i.i.d. A.

under test after averaging the results of 100 Monte Carlo trials.
Here, since y and z = Ax are related through AWGN, the
GAMP algorithm of [22] reduces to the Bayesian version of
the AMP algorithm from [18].
Note that the case of i.i.d. A is the ideal scenario for
both AMP and GAMP. As discussed in the Introduction, their
convergence in this case is guaranteed rigorously through state
evolution analysis [22][24] as m, n . In Figure 2, since
m and n are sufficiently large, it is not surprising to see
that GAMP performs well over all measurement ratios m/n.
Furthermore, it is interesting to notice that GAMP outperforms
LASSO and obtains NMSEs that are very close to that of
the support-aware genie. Under such ideal A, the proposed
ADMM-GAMP method matches the performance of GAMP
(since it minimizes the same objective) but does not offer any
additional benefit.
The benefits of ADMM-GAMP become apparent in our
second experiment, which uses non-i.i.d. matrices A. In describing the experiment, we first recall that [25] established
that the convergence of GAMP can be predicted by the peakto-average ratio of the squared singular values,
12 (A)
,
2
i=1 i (A)/r

(A) , Pr

(81)

where r = min{m, n} and i (A) is the i-th largest singular value of A. When this ratio is sufficiently large,
the algorithm will diverge. Thus, to test the robustness of
ADMM-GAMP, we constructed a sequence of matrices A
with varying , as follows. First, the left and right singular
vectors of A were generated by drawing an m n matrix
with i.i.d. N (0, 1/m) entries and taking its singular-value
decomposition. Then, the singular values of A were chosen by
setting the largest at 1 (A) = 1 and logarithmically spacing
each successive singular value to attain the desired peak-toaverage ratio .
As a function of , the NMSE performance of the various
algorithms under test is illustrated in Figure 3 for the case of
m = 600 measurements. There it can be seen that, for larger
values of , the NMSE performance of the original GAMP
algorithm deteriorated, which was a result of the algorithm

genie

LASSO

GAMP

SwAMP

ADMMGAMP

Peak-to-average ratio ()
Fig. 3. Average NMSE versus peak-to-average squared-singular-value ratio
(A) when recovering a length n = 1000 Bernoulli-Gaussian signal x
from m = 600 AWGN-corrupted measurements y = Ax + e. Note the
superior performance of ADMM-GAMP relative to both the original GAMP
and SwAMP, and the proximity of ADMM-GAMP to the support-aware genie.

NMSE (dB)

14
1

GAMP

SwAMP

ADMMGAMP

Peak-to-average ratio ()
Fig. 4. Average NMSE versus peak-to-average squared-singular-value ratio
(A) when recovering a length n = 1000 Bernoulli-Gaussian signal x from
m = 2000 noiseless 1-bit measurements y = sgn(Ax). Note the superior
performance of ADMM-GAMP relative to the original GAMP and SwAMP.

diverging. (Note that, in the plot, we capped the maximum

NMSE to 0 dB for visual clarity.) The figure also shows
that the SwAMP method achieved low NMSE over a wider
range of ratios than the original GAMP method, but its
performance also degraded for larger values of . The ADMMGAMP method, however, converged over the entire range of
values, achieving NMSE performance relatively close to the
support-aware genie.
In our third and final experiment, we recover x from onebit measurements y = sgn(Ax), where sgn is the sign
function, as considered in, e.g., [53] and [40]. Here, we used
m = 2000 measurements and generated the matrices A as
in our second experiment. Figure 4 shows the NMSE performance of the various algorithms under test. The results in the
figure illustrate that the original GAMP method diverged for
2. However, both SwAMP and ADMM-GAMP recovered
the solution for the whole range of without diverging, with
ADMM-GAMP yielding slightly better NMSE (about 0.3 dB
better) at higher values of .

C ONCLUSIONS
Despite many promising results of AMP methods, the major
stumbling block to more widespread use is their convergence
and numerical stability. Although AMP techniques admit
provable guarantees for i.i.d. A, they can easily diverge for
transforms that occur in many practical problems. While several methods have been proposed to improve the convergence,
this paper provides a method with provable guarantees under
arbitrary transforms. The method leverages well-established
concepts of double-loop methods in belief propagation [32] as
well as the classic ADMM method in optimization [9].
Nevertheless, there is still much work to be done. Most
obviously, the proposed ADMM-GAMP method comes at a
computational cost. Each iteration requires solving a (potentially large) least squares problem (49) that is not needed in
the original AMP and GAMP algorithms. Similar to standard applications of ADMM, this minimization can likely
be performed via conjugate gradient iterations, but its implementation requires further study. In any case, it is possible
that ADMM-GAMP will be slower than other variants of
GAMP. Indeed, our simulations suggest that other methods
such as SwAMP or adaptively damped GAMP [28] may
provide equally robust performance with less cost per iteration.
One line of future work would thus be see to whether the
proof techniques in this paper can be extended to address these
algorithms as well.
The analysis in this paper might also be extended to other
variants of AMP and GAMP. For example, it is conceivable
that similar analysis could be applied to develop convergent
approaches to the expectation-maximization (EM) GAMP developed in [41], [54][57], turbo and hybrid GAMP methods
in [58], [59] and applications in dictionary learning and matrix
factorization [60][62].
A PPENDIX A
P ROOF OF T HEOREM 1
Throughout this appendix, we use the shorthand notation
for the gradient h0 ( ) , h( )/ Rp .
First we show, by induction, that k for all k. Recall
that, by the hypothesis of the theorem, 0 . Now suppose
that k . Then the updates in Algorithm 1 imply that
0

b k ))).
h ( ) = h (g(b )) = h (g(b(
Then, by Assumption 1(c), h0 ( k ) . Since k , k
(0, 1], and is convex,
k+1 = (1 k ) k + k h0 ( k ) .
Thus, by induction, k for all k.
Next, we prove the decrementing property (27). First observe that since the restriction b B is a linear constraint,
we can find a linear transform B and vector b0 such that
b B if and only if b = Bx + b0 for some vector x.
It can be verified that we can reparameterize the functions
f () and g() around x and obtain the exact same recursions
in Algorithm 1. Also, all the conditions in Assumptions 1
will hold for reparametrized functions as well. Thus, for the

remainder of the proof we can ignore the linear constraints B,

or alternatively view B as the entire vector space.
b
Under this assumption, for any , and any minimizer b()
will be in the interior of B and therefore,
L

X
b
J(b(),
)
b
b
= f 0 (b())
+
g`0 (b())
`
b

(82)

`=1

b
b
= f 0 (b())
+ g 0 (b()),

(83)

where f (b) is shorthand notation for the gradient f (b)/b,

g`0 (b) is shorthand for the gradient (with respect to b) of
the `th component of the vector-valued function g(), and
0
where g 0 (b) = [g10 (b), . . . , gL
(b)] is matrix-valued. Taking
the gradient of (82) with respect to T yields the matrix
L
X
b
b
g`0 (b())
f 0 (b())
b
+
` + g 0 (b())
T

T
`=1
"
#
L
0 b
X
b
b
g`0 (b())
b()
(b) f (b())
b
=
+
`
+ g 0 (b())
T
T
T
b
b

b
b

(a)

0 =

`=1

b
b()
b
+ g 0 (b()),
= H()
T

(84)

where (a) and (b) follow from the chain rule and H() is the
Hessian from (25). Equation (84) then implies
b
b()
b
= H()1 g 0 (b()),
T

(85)

where Assumption 1(b) guarantees the existence of the inverse.

The gradient of the objective with respect to T is then
h
iT b()
b
b
J(b())
(a)
0
b
b
= f 0 (b())
+ g 0 (b())h
( )
T

T
b
T
(b)
T b()
b
= h0 ( ) g 0 (b())
T
T
(c)
T
b
b
= h0 ( ) g 0 (b())
H()1 g 0 (b()),
(86)
where (a) follows from (22) and the chain rule, (b) follows
from (83), and (c) follows from (85).
Notice that the k update in Algorithm 1 can be written as

k+1 k = h0 ( k ) k k .
Taking an inner product of the above and (86) evaluated at
= k , we get
#
"
b k ))

J(b(
k+1 k
T
T

= k h0 ( k ) g 0 (bk )T H( k )1 g 0 (bk ) k h0 ( k ) k
2
k
g 0 (bk ) k h0 ( k ) ,
(87)
c2
b k ) = bk and that c2 was defined in
recalling that b(
Assumption 1(b). Therefore, the update of k is in a descent
b
direction on the objective J(b()).
Hence, for a sufficiently
k
small damping parameter , we will have
b k )) 0,
b k+1 )) J(b(
J(bk+1 ) J(bk ) = J(b(
which proves the decrementing property (27).

A PPENDIX B
L ARGE D EVIATIONS V IEW OF MAP E STIMATION

as T 0. To this end, let J(bx , bz ; T ) be the LSL-BFE (16)

for the scaled penalties (67), which is given by

For each T > 0, let xt (T ), zt (T ), . . ., be the output of

the ADMM-GAMP algorithm with the MMSE estimation
functions (51) and the scaled penalties (67). Next, we define
several limits. For the mean vectors we define

1 fx /T
1 fz /T
J(bx , bz ; T ) = D(bx kZx,T
e
) + D(bz kZz,T
e
)

xt = lim xt (T ),

zt = lim zt (T ),

T 0

for the dual vectors we define

qt = lim T qt (T ),

st = lim T st (T ),

T 0

and for the variance terms we define

pt (T )
t (T ) t
xt (T ) t
, z = lim z
, p = lim
, (88a)
T 0
T 0
T 0
T
T
T
t (T ) t
rt = lim r
, s = lim T st (T ).
(88b)
T 0
T 0
T

xt = lim

We will assume that all of these limits exist. Note that some of
terms are scaled by T and others by 1/T . These normalizations
are important. It is easily checked that the scalings all cancel,
so that the limiting values satisfy the recursions of Algorithm 3
with the limiting estimation functions
gx (r, r ) , lim gx (r, r (T ); T ) = lim gx (r, r T ; T ), (89a)
T 0

T 0

gz (p, p ) , lim gz (p, p (T ); T ) = lim gz (p, p T ; T ),(89b)

T 0

where gx (r, r T ; T ) and gz (p, p T ; T ) are the MMSE estimation functions (51) for the scaled penalties (67). Note that we
have used the scalings in (88), which show r (T ) T r
and p (T ) p T for small T . Now, the scaled function
gx (r, r T ; T ) is the expectation E(x|T ) with respect to the
density

fx (x)
1
p(x|r, r T ; T ) exp

kx rk2r .
T
2T
Laplaces Principle [50] from large deviations theory shows
that (under mild conditions) this density concentrates around
its maxima, and thus the expectation with respect to this
density converges to the minimum
lim gx (r, r T ; T ) = arg min fx (x) + 12 kx rk2r ,

T 0

which is exactly the minimization in the MAP estimation

function (62). The limit of gz (p, p /T ; T ) as T 0 is similar.
We conclude that the limit of the ADMM-GAMP algorithm
with MMSE estimation functions (51) and scaled densities
(67) is exactly the ADMM-GAMP algorithm with the MAP
estimation functions (62). In particular, for each T , the density
over xj in (47) is given by
"
#
fxj (xj ) (xj rjt )2
t

, (90)
bxj (xj |rj , rj T ) exp
T
2T rtj
from which we can prove the limits in (68).
It remains to show that the LSL-BFE in (16) with the scaled
functions (67) decomposes into the optimizations (56) and (71)

H(bx ) H(bz ) + const,

(91)

where H(a) denotes the differential entropy of distribution

a,
H(x , z ) is the entropy
bound from (17), Zx,T ,
R f
R
e x (x)/T dx, Zz,T , efz (z)/T dz, and the const in
(91) is with respect to bx and bz . Now, we know that, as
T 0, the optimal densities bx and bz will concentrate
around their maxima with variance O(T ). Thus, we can take
a quadratic approximation around the maximum
ln bxj (xj )

(xj x
bj )2
+ const,
2T xj

(92)

where
2 ln bxj (xj )
1
= T
,
xj
x2j

x
bj = arg min ln bxj (xj ),
xj

with a similar approximation for ln bzi (zi ). Under these

approximations, bxj (xj ) and bzi (zi ) become approximately
Gaussian, i.e.,
bxj (xj ) N (b
xj , T xj ),

bzi (zi ) N (b
zi , T zi ).

(93)

Using these Gaussian approximations, we can compute the

expectations
Z
E(fxj (xj )|bxj ) = fxj (x)N (x; x
bj , T xj ) dx
(a)

Z X

(k)
xj )
(x x
bj )k fxj (b

k=0

(k)
X
fxj (b
xj )
(b)

(c)

k=0

(c)

l=0

N (x; x
bj , T xj ) dx

(94)

(x x
bj )k N (x; x
bj , T xj ) dx

(95)

k!
Z

(2l)

fxj (b
xj )
(T xj )l (2l 1)!!
(2l)!

(96)

(2l)

fxj (b
xj )
(T xj )l ,
2l l!

(97)

where (a) wrote fxj (x) using a Taylor series about x = x

bj ;
(b) assumed the exchange of limit and integral; (c) used the
expression for the Gaussian central moments, which involves
the double factorial (2l1)!! = (2l1)(2l3)(2l5) 1;
and (d) used the identity (2l 1)!! = (2l)!
. Thus, for small T ,
2l l!
we have
1
E(fxj (xj )|bxj ) fxj (b
xj ) + T xj fx00j (b
xj ),
(98a)
2
1
E(fzi (zi )|bzi ) fzi (b
zi ) + T zi fz00i (b
zi ).
(98b)
2
The differential entropies of these Gaussians (93) are
H(bxj ) =

1
ln(2eT xj ),
2

H(bzi ) =

1
ln(2eT zi ), (99)
2

zeros the corresponding gradient, i.e.,

and the entropy term (17) is

H(var(x|bx ), var(z|bz )) = H(T x , T z )
#
"m
1 X zi
+ ln(2T pi ) , p , Sx .
=
2 i=1 pi

0 =
(100)

(a)

Substituting (98), (99) and (100) into (91), we obtain

(b)

1
1
b, b
JT (bx , bz ) = J(b
x, b
z) + J 2 (x , z , x
z) + const, (101)
T
2

(c)

where J() and J () are given in (56) and (72). As T 0, the

first term in (101) dominates, implying that the optimization
of (b
x, b
z) can be conducted independently of x , z , as in (56).
The subsequent optimization of (x , z ) then follows, as given
in (71).

(d)

L(bx , bz , s, q, v; r , p )
bx
h
J(bx , bz , r , p ) + qT E(x|bx )
bx
i
1
+ kE(x|bx ) vk2r
2
i
h
J(bx , bz , r , p ) qT E(x|bx )
bx
i
h
J(bx , bz , r , p ) sT AE(x|bx )
bx "

J(bx , bz ) H var(x|bx ), z
bx
#
T

+ (1./(2r )) var(x|bx ) s AE(x|bx )

i
h
J(bx , bz ) sT AE(x|bx )
bx
(f )
=
L0 (bx , bz , s),
(108)
bx
where (a) follows from substituting (40) and eliminating terms
that do not depend on bx , since their gradient equals zero; (b)
follows from (105); (c) follows from (107); (d) follows from
the definitions of the original and linearized LSL-BFEs in (16)
and (31); (e) follows from the chain rule and the gradient
in (103); and (f) follows from (102). A similar computation
shows that

L0 (bx , bz , s) = 0.
(109)
bz
(e)

A PPENDIX C
P ROOF OF T HEOREMS 2 AND 3
We will just prove Theorem 2 since the proof of Theorem 3
is very similar. For the original constrained optimization (14),
define the Lagrangian
L0 (bx , bz , s) , J(bx , bz ) + sT (E(z|bz ) AE(x|bx )). (102)
We need to show that any fixed points (bx , bz , s) of ADMMGAMP are critical points of this Lagrangian.
First observe that, any fixed point, r from line 22 of
Algorithm 3 satisfies
1./(2r ) = 1./(2 r ) =

H(x , z )
,
x

(103)

where the last step follows from the construction of r in (32).

Similarly, at any fixed point of line 23,
1./(2p ) = 1./(2 p ) =

H(x , z )
.
z

(104)

From (41b) and (41c), we see that any fixed point satisfies
E(z|bz ) = Av,

E(x|bx ) = v.

(105)

Thus, the constraint in (14) is satisfied, in that E(z|bz ) =

AE(x|bx ). Furthermore, since v minimizes (49), we know
that it zeros the gradient of the corresponding cost function:
T

0 = A Dp E(z|bz )Av+p .s +Dr E(x|bx )v+r .q ,
(106)
where D = Diag(1./ ). Plugging (105) into the previous
expression, we obtain
q = AT s.

(107)

Since bx minimizes the augmented Lagrangian in (41a), it

Together, (108) and (109) show that (bx , bz ) are critical points
of the Lagrangian L0 (bx , bz , s) for the dual parameters s. Since
these densities also satisfy the constraint E(z|bz ) = AE(x|bx ),
we conclude that (bx , bz ) are critical points of the constrained
optimization (14).
A PPENDIX D
P ROOF OF L EMMA 1
For the MAP estimation functions (62), we know that
1
x
bj = gxj (rj , rj ) = arg min fxj (xj ) +
(xj rj )2 ,
2rj
xj
which implies that xj = x
bj is a solution to 0 = fx0 j (xj ) +
(xj rj )/rj , i.e., that
x
bj = rj rj fx0 j (b
xj ).
Taking the derivative with respect to rj , we find
b
xj
b
xj
= 1 rj fx00j (b
xj )
,
rj
rj
which can be rearranged to form
b
xj
1
.
= gx0 j (rj , rj ) =
00
rj
1 + fxj (b
xj )rj

(110)

Then, given the assumption in the lemma, (110) implies that

1
1
gx0 j (rj , rj )
.
1 + Brj
1 + Arj

A similar bound can be obtained for gz0 i (pi , pi ), which proves

(73) for any fixed r and p .
The proof for MMSE estimation functions (51) uses a classic result of log-concave functions [63]. Since the functions
fx and fz are separable, so are the estimation functions gx
and gz (51), as established in (52). In particular, we can write
gxj (rj , rj ) = E(xj |rj , rj ),

gzi (pi , pi ) = E(zi |pi , pi ),

where the expectations are with respect to the densities

(xj rj )2
, (111a)
p(xj |rj , rj ) exp fxj (xj )
2rj

(zi pi )2
p(zi |pi , pi ) exp fzi (zi )
. (111b)
2pi
We then need to show that the condition (73) is satisfied for
each of the functions gxj and gzi . Below, we prove this for
gxj , noting that the proof for gzi is similar.
From (55), we know that the derivative of gxj (rj , rj ) with
respect to rj is given by
x
gx0 j (rj , rj ) = j ,
rj

xj = var(xj |rj , rj ).

A PPENDIX E
P ROOF OF T HEOREM 4
We find it easier to analyze the algorithm after the variables
are combined and scaled as

, r , D , Diag(1./ ),
(115)
p
and

x
I
1/2 q
1/2
w,D
, u,D
, B,D
.
z
s
A
(116)
Also, we define

gx (x, r )
g(w, ) ,
,
(117)
gz (z, p )
1/2

and henceforth suppress the dependence on in the notation

since is constant in this analysis. The mean update steps in
Algorithm 3 then become

(112)

The variance here is with respect to the density (111a), which

can be rewritten as

wt+1 = D1/2 g(D1/2 (Bvt ut ))

(118a)

ut+1 = ut + wt+1 Bvt

(118b)

vt+1 = arg min kwt+1 + ut+1 Bvk2 ,

(118c)

where the result of (118c) can be written explicitly as

vt = (BT B)1 BT (wt + ut ).

p(xj |rj , rj ) = exp [h(xj )]

Let us define

for the potential function

h(xj ) = fxj (xj ) +

P , B(BT B)1 BT ,

(xj rj )2
,
2rj

h00 (xj ) = fx00j (xj ) +

1
.
rj

By assumption (74), this derivative is bounded as

1
1
h00 (xj ) B +
.
rj
rj

= g(Pwt P ut ),

In particular, h(xj ) is strictly convex. From (112) and [63,

Theorem 4.1], we have that
x
var(xj |rj , rj )
gx0 j (rj , rj ) = j =
rj
rj

1
1
1

.
rj
h00 (x)
Arj + 1

(113)

1
1
gx0 j (rj , rj )
,
1 + Brj
1 + Arj
which proves (73).

(121)

where
g(w) , D1/2 g(D1/2 w).

(122)

P ut+1 = P ut + P wt+1
= P ut + P g(Pwt P ut ).

(123)

Now define the state vector

t ,

gx0 j (rj , rj ) =

Thus, we conclude that

(120)

Also, since P B = 0, (118b) implies that

It is also shown in equation (4.13) of [63] that

xj
var(xj |rj , rj )
=
rj
rj
1
1

.
E(h00 (xj ))rj
Brj + 1

P , I P,

where P is an orthogonal projector operator onto the column

space of B and P is the projection onto its orthogonal
complement. Noting that Bvt = P(wt + ut ), (118a) reduces
to

wt+1 = D1/2 g D1/2 (P wt + ut ) ut

= D1/2 g D1/2 (Pwt P ut )

which has second derivative

A+

(119)

(114)

Pwt
.
P ut

(124)

Since P2 = P and (P )2 = P ,

P P t = Pwt P ut .
Therefore, from (121) and (123), respectively, we have that

Pwt+1 = P
g P P t ,
(125)

t
t+1
t

P u = P u + P g P P .
(126)

From (124), (125), and (126), we see that the mean update
steps in Algorithm 3 are characterized by the recursive system
t+1 = f ( t )

(127)

for

P
0 0

f () ,
g P P +
.
P
0 P

(128)

The following is a standard contraction mapping result [64]:

if f has a continuous Jacobian f 0 whose spectral norm is less
than one, i.e., > 0 s.t. kf 0 ()k < 1 , then the system
(127) converges to a unique fixed point, , with a linear
convergence rate, i.e.,
C > 0 s.t. k t k C(1 )t .
So, our proof will be complete if we can show that the
Jacobian of f from (128) is indeed a contraction.
First observe that, from the definition of g(w) in (117),
and the separability and boundedness assumptions in Assumption 2, the Jacobian of g(w) at any w is diagonal and bounded:

For the lower bound, observe that

(a)
0 0
T 0
J() = U g (w)U
0 P

(b)
0 0
T
U U
0 P

(c) P
0
=
0 ( 1)P

(d) ( 1)I
0
I P
0
=
+
0
( 1)I
0
(1 )P
( 1)I,
(136)
where step (a) follows from (132); (b) follows from (129);
(c) follows from the definition of U in (133) and the fact
that P and P are orthogonal projections; (d) follows from
the definition of P in (120); and (136) follows because the
eigenvalues of P and P are in the interval [0, 1] and because
(0, 0.5]. Together (135) and (136) show that
kf 0 ()k = kJ()k 1 .
Hence the f 0 () is a contraction and the ADMM-GAMP
algorithm converges linearly at rate 1 .

(0, 0.5] s.t. g 0 (w) = Diag(d) and dk 1 k.

Since D = Diag(1./ ) is also diagonal, the Jacobian of g(w)
in (122) is given by
g0 (w) = D1/2 Diag(d)D1/2 = Diag(d),
and hence
I g0 (w) (1 )I

(129)

for all w. Now, the Jacobian of f () in (128) is given by

P
0 0
0

P
P
f 0 () =
g

(w)
+
.
(130)
P
0 P
Hence, if we define
(131)

then kf 0 ()k = kJ()k so f 0 () is a contraction if and

only if J() is. Therefore, it suffices to prove that J() is
a contraction. Combining (130) and (131), we obtain

0 0
J() = UT g0 (w)U
,
(132)
0 P

where w = P P , and

U = P P .
(133)
Since P is an orthogonal projection and P is the projection
onto the orthogonal complement, U is an isometry. That is,
UUT = P + P = I,

(134)

and hence UT U I. Therefore, from (132) and (129),

J() UT g0 (w)U (1 )UT U (1 )I.

We need to prove that the conditions of Assumption 1 are

satisfied. Property (a) is satisfied since Theorem 4 shows that
the constrained linearized LSL-BFE optimization (36) has a
unique minima for any (r , p ) > 0.
We next construct the set . From the proof of Lemma 1,
we know that when x = var(x|r, r ) and z = var(z|p, p ),

Br
Ar
,
,
(137)
x
A + r B + r

Ap
Bp
z
,
,
(138)
A + p B + p
Hence

I 0
J() , f ()
,
0 I
0

A PPENDIX F
P ROOF OF T HEOREM 5

(135)

z 1
1
1
s , 1

,
.
p p
B + p A + p

(139)

Now consider a set of the form

= {(r , p ) | r [ar , br ],

p [ap , bp ]} .

(140)

In order that satisfies Assumption 1(c), we need to find

bounds ar , br , ap , bp , such that if (r , p ) , then ( r , p )
where ( r , p ) are given in (35).
To this end, first observe that (137) shows that x 1/B,
so p = Sx bp for some bp . If p bp , (139) shows that
s [1/B, 1/(A + bp )]. Therefore, using the boundedness
assumptions on S, r = 1./(ST s ) [ar , br ] for some
lower and upper bounds ar and br . Finally, if r ar ,
x Aar /(A + ar ) and hence p = Sx ap for some
ap . We conclude that we can find bounds ar , br , ap , bp , such
that if (r , p ) , then ( r , p ) , and is a compact,
convex set satisfying Assumption 1(c).
Finally, we need to show the convexity assumption in
Assumption 1(b). The linearized LSL-BFE in (31) is separable,

so we only need to consider the convexity of one of the terms.

To this consider a prototypical term of the form
J(b) = D(bkefx ) +

1
var(x|b),
2r

(141)

where b(x) is some density over a scalar variable x and

fx (x) is a convex penalty function. The Hessian of J(b) is a
quadratic form that takes perturbations v1 (x) and v2 (x) to the
density b(x) and returns a scalar. We will denote this Hessian
by J 00 (b)(v1 , v2 ). Differentiating (141) we obtain that
Z
2
Z
1
v(x)2
dx
v(x)x dx . (142)
J 00 (b)(v, v) =
b(x)
r
We need to show that this is positive. For any v(x), let u(x) =
v(x)/b(x) so that v(x) = u(x)b(x).
Since a perturbation to
R
the density b(x) must satisfy v(x) dx = 0, we have that
Z
E(u(x)|b) = u(x)b(x) dx = 0.
Also, J 00 (b)(v, v) above can be written as
1
E(u(x)x|b)
r
1
(b)
= var(u(x)|b) E2 (u(x)(x x )|b)
r

(c)
x
var(u(x)|b) 1
,
(143)
r

(a)

J 00 (b)(v, v) = E(u(x)2 |b)

where (a) follows from substituting v(x) = u(x)b(x) into

(142); in (b) we have used the notation x = E(x|b) and
the fact that E(u(x)|b) = 0; and (c) follows from the CauchySchwartz inequality with the notation x = var(x|b). Now,
using (137), we see that when (r , p ) , we have the
lower bound,
x
B
ar
1
1

> 0.
r
B + ar
B + ar
We conclude that there exists an such that
J 00 (b) I,
at any minima b = bb to the linearized LSL-BFE when
(r , p ) . This proves Assumption 1(b). The uniform
boundedness of all the other derivatives follows from the fact
that all the terms are twice differentiable and the set is
compact.
Thus, all the conditions of Assumption 1 and the theorem
follows from Theorem 1.
A PPENDIX G
P ROOF OF T HEOREM 6
We begin with proving part (a). We use induction. Suppose
that (77) is satisfied for some t. Since q0 , x0 , and v0 are fixed
points, we have from line 10 of Algorithm 3 that x0 = v0 .
Then, since x0 is a fixed point, we have from lines 7 and 9
and equation (62) that
1
x0 = gx (r0 , r0 ) = arg min fx (x) + kx v0 + r0 .q0 k2r0 .
2
x

Therefore, x = x0 is the unique solution to

0 = fx0 (x) + Diag(1./(2r0 ))(x x0 ) + q0 ,
which implies
fx0 (x0 ) = q0 .

(144)

By the induction hypothesis (77), xt = x0 and qt = q0 . Since

xt+1 = gx (rt , rt ), we have x = xt+1 is the unique solution to
0 = f 0 (x) + Diag(1./(2rt ))(x rt )
0

= f (x) +

Diag(1./(2rt ))(x

(145)
0

x )+q ,

(146)

where we have used the fact that

rt = xt + rt .qt = x0 + rt .q0 .
From (144), x = x0 is also a solution to (145). Therefore,
xt+1 = x0 . Similarly, if st = s0 and zt = z0 , then zt+1 = z0 .
From (49), vt+1 = v0 . We conclude that if (77) is satisfied for
some t, it is satisfied for t+1. So part (a) follows by induction.
To prove part (b), we leverage the convergence result from
[65]. Using our earlier result (110), we have that
xt+1
= rtj gx0 j (rjt , rtj ) =
j

rtj
t
1 + fx00j (xt+1
j )rj

Rewriting this in vector form and using the updates in Algorithm 3 with t = 1, we obtain that
1./xt+1 = 1./rt + fx00 (xt+1 ) = ST st + fx00 (xt+1 )
= ST st + x ,
fx00 (x)

x , fx00 (xt+1 ) > 0

(147)

[fx001 (x1 ), . . . , fx00n (xn )]T

where
=
and where x is
positive due to the convexity assumption and invariant to t
due to part (a). Similarly, for the output estimation function
gz ,
zt+1 = pt .gz0 (pt , pt ) = pt ./(1 + fz00 (zt+1 ).pt ).
Therefore, from the modified update of st+1 in (76),
st+1 = fz00 (zt+1 )./(1 + fz00 (zt+1 ).pt ),
or equivalently,
1./st+1 = pt + 1./fz00 (zt+1 ) = Sxt + z
z ,

1./fz00 (zt+1 ).

(148)
(149)

Now define the maps,

s (x ) := 1./ Sx + z

x (s ) := 1./ ST s + x

so that the updates (148) and (147) can be written as

st = s (xt1 ),

xt+1 = x (st ).

Note that, due to part (a), x and z in (147) and (148), do

not change with t. It is easy to check that, for any S > 0,
(i) s (x ) > 0,
(ii) x x0 s (x ) s (x0 ), and
(iii) For all > 1, s (x ) > (1/)s (x ).
with the analogous properties being satisfied by x (s ). Now
let := x s be the composition of the two functions so
that xt+1 = (xt1 ). Then, satisfies the three properties:

(i) (x ) > 0,
(ii) x x0 (x ) (x0 ), and
(iii) For all > 1, (x ) < (x ).

and (31), as follows:

L(bx , btz , st1 ; p )

Also, for any s 0, we have x (s ) 1./x , and therefore,

(x ) 1./x for all x 0. Hence, taking any x 1./x ,
we obtain:
x (x ).
The results in [65] then show that the updates xt+1 = (xt1 )
converge to unique fixed points. Since the increment increases
by two, we need to apply the convergence twice: once for the
xt with odd values of t, and a second time for even values.
Since the limit points are unique, both the even and odd subsequences will converge to the same value. A similar argument
shows that st also converges to unique fixed points.

Dr AT Dp A E(x|bx ) xt
T
= D(bx kefx ) + 1./r var(x|bx ) (st1 )T AE(x|bx )

+ 21 E(x|bx ) xt

+ 21 E(x|bx )T AT Dp AE(x|bx ) (zt )T Dp AE(x|bx )

A PPENDIX H
O RIGINAL GAMP VIA S TALE , L INEARIZED ADMM

First, we examine the minimization in (79b). Starting with

(78), a derivation identical to (43), but with xt+1 = E(x|bt+1
x )
in place of v, yields
t
L(bt+1
x , bz , s ; p )

D(bz kZz1 efz )

t+1

)
(150)

+ (1./(2p )) var(z|bz )
(151)
m
X zi
+ E 12 kz (Axt+1 p .st )k2p bz
+ const,
2pi
i=1

= D(bz kZz1 efz ) + 21 E kz (Axt+1 p .st )k2p bz

Z+ const,
=
bz (z) ln exp(f

bz (z)
1
t+1 .st )k2 )
z (z) 2 kz(Ax
p
p

(157)

(c)

= D(bx kpx ) + const,

(158)

where const is constant with respect to bx ; line (a) used

(79c); line (b) used
(159)

and line (c) used px (x) exp(fx (x) 12 kx rt k2p ). Thus,

the minimizing density bx output by (79a) is

t 2
1
bt+1
(160)
x (x) exp fx (x) 2 kx r kr .
Finally, using (155) in (79c), we obtain
st+1 = (zt+1 pt+1 )./p .

(161)

Thus, we have recovered the mean updates of the original

sum-product GAMP algorithm, i.e., the non-indented lines in
Algorithm 4.

+ const

(156)

rt , xt + Diag(r )AT st ;

t T
J(bt+1
x , bz , r , p ) + (s ) (E(z|bz )

2
+ 12 E(z|bz ) Axt+1

+ const

= D(bx kefx ) + E 12 kx rt k2r bx + const,
Z
bx (x)
=
bx (x) ln exp(f (x)
dx + const
1
kxrt k2 )

(152)

= D(bz kpz ) + const,

(153)

where const is constant with respect to bz and pz (z)

exp(fz (z) 21 kz(Axt+1 p .st )k2p ). Thus, the minimizing
density bz output by (79b) is
t+1 2
1
bt+1
kp
z (z) exp fz (z) 2 kz p
t+1

, Ax

t+1

p .s .

(154)
(155)

Next we examine the minimization in (79a). The objective

function in (79a) can be written, using (78), xt = E(x|btx ),

R EFERENCES
[1] J. A. Nelder and R. W. M. Wedderburn, Generalized linear models, J.
Royal Stat. Soc. Series A, vol. 135, pp. 370385, 1972.
[2] P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd ed.
Chapman & Hall, 1989.
[3] A. Chambolle, R. A. DeVore, N. Y. Lee, and B. J. Lucier, Nonlinear
wavelet image processing: Variational problems, compression, and noise
removal through wavelet shrinkage, IEEE Trans. Image Process., vol. 7,
no. 3, pp. 319335, Mar. 1998.
[4] I. Daubechies, M. Defrise, and C. D. Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Commun.
Pure Appl. Math., vol. 57, no. 11, pp. 14131457, Nov. 2004.
[5] S. J. Wright, R. D. Nowak, and M. Figueiredo, Sparse reconstruction by
separable approximation, IEEE Trans. Signal Process., vol. 57, no. 7,
pp. 24792493, Jul. 2009.
[6] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding
algorithm for linear inverse problem, SIAM J. Imag. Sci., vol. 2, no. 1,
pp. 183202, 2009.

[7] Y. E. Nesterov, Gradient methods for minimizing composite objective

function, CORE Report, 2007.
[8] J. Bioucas-Dias and M. Figueiredo, A new TwIST: Two-step iterative
shrinkage/thresholding algorithms for image restoration, IEEE Trans.
Image Process., vol. 16, no. 12, pp. 2992 3004, Dec. 2007.
[9] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed
optimization and statistical learning via the alternating direction method
of multipliers, Found. Trends Mach. Learn., vol. 3, pp. 1122, 2010.
[10] E. Esser, X. Zhang, and T. F. Chan, A general framework for a class
of first order primal-dual algorithms for convex optimization in imaging
science, SIAM J. Imaging Sci., vol. 3, no. 4, pp. 10151046, 2010.
[11] A. Chambolle and T. Pock, A first-order primal-dual algorithm for
convex problems with applications to imaging, J. Math. Imaging Vis.,
vol. 40, pp. 120145, 2011.
[12] B. He and X. Yuan, Convergence analysis of primal-dual algorithms for
a saddle-point problem: From contraction perspective, SIAM J. Imaging
Sci., vol. 5, no. 1, pp. 119149, 2012.
[13] L. A. Rademacher, Approximating the centroid is hard, in Proc. ACM
Computational Geometry, 2007, pp. 302305.
[14] N. E. Breslow and D. G. Clayton, Approximate inference in generalized
linear mixed models, Journal of the American Statistical Association,
vol. 88, no. 421, pp. 925, 1993.
[15] S. L. Zeger and M. R. Karim, Generalized linear models with random
effects: a Gibbs sampling approach, Journal of the American statistical
association, vol. 86, no. 413, pp. 7986, 1991.
[16] D. Gamerman, Sampling from the posterior distribution in generalized
linear mixed models, Statistics and Computing, vol. 7, no. 1, pp. 5768,
1997.
[17] D. L. Donoho, A. Maleki, and A. Montanari, Message-passing algorithms for compressed sensing, Proc. Nat. Acad. Sci., vol. 106, no. 45,
pp. 18 91418 919, Nov. 2009.
[18] , Message passing algorithms for compressed sensing I: motivation
and construction, in Proc. Info. Theory Workshop, Jan. 2010.
[19] , Message passing algorithms for compressed sensing II: analysis
and validation, in Proc. Info. Theory Workshop, Jan. 2010.
[20] S. Rangan, Estimation with random linear mixing, belief propagation
and compressed sensing, in Proc. Conf. on Inform. Sci. & Sys.,
Princeton, NJ, Mar. 2010, pp. 16.
[21] , Generalized approximate message passing for estimation with
random linear mixing, arXiv:1010.5141v1 [cs.IT]., Oct. 2010.
[22] , Generalized approximate message passing for estimation with
random linear mixing, in Proc. IEEE Int. Symp. Inform. Theory, Saint
Petersburg, Russia, Jul.Aug. 2011, pp. 21742178.
[23] M. Bayati and A. Montanari, The dynamics of message passing on
dense graphs, with applications to compressed sensing, IEEE Trans.
Inform. Theory, vol. 57, no. 2, pp. 764785, Feb. 2011.
[24] A. Javanmard and A. Montanari, State evolution for general approximate message passing algorithms, with applications to spatial coupling,
arXiv:1211.5164 [math.PR]., Nov. 2012.
[25] S. Rangan, P. Schniter, and A. Fletcher, On the convergence of
approximate message passing with arbitrary matrices, in Proc. ISIT,
Jul. 2014, pp. 236240.
[26] F. Caltagirone, L. Zdeborova, and F. Krzakala, On convergence of
approximate message passing, in Proc. ISIT, Jul. 2014, pp. 18121816.
[27] A. Manoel, F. Krzakala, E. W. Tramel, and L. Zdeborova, Sparse
estimation with the swept approximated message-passing algorithm,
arXiv:1406.4311, Jun. 2014.
[28] J. Vila, P. Schniter, S. Rangan, F. Krzakala, and L. Zdeborova, Adaptive
damping and mean removal for the generalized approximate message
passing algorithm, in Proc. IEEE ICASSP, 2015, to appear.
[29] S. Rangan, P. Schniter, E. Riegler, A. Fletcher, and V. Cevher, Fixed
points of generalized approximate message passing with arbitrary matrices, in Proc. ISIT, Jul. 2013, pp. 664668.
[30] F. Krzakala, A. Manoel, E. W. Tramel, and L. Zdeborova, Variational
free energies for compressed sensing, in Proc. ISIT, Jul. 2014, pp.
14991503.
[31] J. S. Yedidia, W. T. Freeman, and Y. Weiss, Understanding belief
propagation and its generalizations, in Exploring Artificial Intelligence
in the New Millennium.
San Francisco, CA: Morgan Kaufmann
Publishers, 2003, pp. 239269.
[32] A. L. Yuille and A. Rangarajan, The concave-convex procedure
(CCCP), Proc. NIPS, vol. 2, pp. 10331040, 2002.
[33] J. Yedidia, The alternating direction method of multipliers as a
message-passing algorithms, in Talk delivered at the Princeton Workshop on Counting, Inference and Optimization, 2011.
[34] M. Ibrahimi, A. Javanmard, Y. Kanoria, and A. Montanari, Robust maxproduct belief propagation, in Proc. ASILOMAR, 2011, pp. 4349.

[35] T. Tanaka, A statistical-mechanics approach to large-system analysis

of CDMA multiuser detectors, IEEE Trans. Inform. Theory, vol. 48,
no. 11, pp. 28882910, Nov. 2002.
[36] S. Rangan, A. K. Fletcher, and V. K. Goyal, Asymptotic analysis of
MAP estimation via the replica method and compressed sensing, in
Proc. Neural Information Process. Syst., Vancouver, Canada, Dec. 2009.
[37] S. Ji, Y. Xue, and L. Carin, Bayesian compressive sensing, IEEE Trans.
Signal Process., vol. 56, pp. 23462356, Jun. 2008.
[38] T. Blumensath, Compressed sensing with nonlinear observations and
related nonlinear optimization problems, IEEE Trans. Information
Theory, vol. 59, no. 6, pp. 34663474, 2013.
[39] C. M. Bishop, Pattern Recognition and Machine Learning, ser. Information Science and Statistics. New York, NY: Springer, 2006.
[40] U. S. Kamilov, V. K. Goyal, and S. Rangan, Message-passing dequantization with applications to compressed sensing, IEEE Trans.
Signal Process., vol. 60, no. 12, pp. 62706281, Dec. 2012.
[41] U. S. Kamilov, S. Rangan, A. K. Fletcher, and M. Unser, Approximate
message passing with consistent parameter estimation and applications
to sparse learning, IEEE Trans. Info. Theory, vol. 60, no. 5, pp. 2969
2985, Apr. 2014.
[42] M. J. Wainwright and M. I. Jordan, Graphical models, exponential
families, and variational inference, Found. Trends Mach. Learn., vol. 1,
2008.
[43] D. Baron, S. Sarvotham, and R. G. Baraniuk, Bayesian compressive
sensing via belief propagation, IEEE Trans. Signal Process., vol. 58,
no. 1, pp. 269280, Jan. 2010.
[44] M. Welling and Y. W. Teh, Belief optimization for binary networks:
A stable alternative to loopy belief propagation, in Proc. Uncertainty
in artificial intelligence. Morgan Kaufmann Publishers Inc., 2001, pp.
554561.
[45] J. Shin, The complexity of approximating a Bethe equilibrium, IEEE
Transactions on Information Theory, vol. 60, no. 7, pp. 39593969,
2014.
[46] R. T. Rockafellar, Monotropic programming: Descent algorithms and
duality, in Nonlinear Programming, O. L. Mangasarian, R. R. Meyer,
and S. M. Robinson, Eds. Academic Press, 1981, vol. 4, pp. 327366.
[47] J. Nocedal and S. Wright, Numerical optimization. Springer Science
& Business Media, 2006.
[48] Y. Weiss, C. Yanover, , and T. Meltzer, MAP estimation, linear
programming and belief propagation with convex free energies, in Proc.
UAI, 2007.
[49] S. Rangan, A. Fletcher, and V. K. Goyal, Asymptotic analysis of
MAP estimation via the replica method and applications to compressed
sensing, IEEE Trans. Inform. Theory, vol. 58, no. 3, pp. 19021923,
Mar. 2012.
[50] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications. New York: Springer, 1998.
[51] N. Parikh and S. Boyd, Proximal algorithms, Found. Trends Optimiz.,
vol. 3, no. 1, pp. 123231, 2013.
[52] R. Tibshirani, Regression shrinkage and selection via the lasso, J.
Royal Stat. Soc., Ser. B, vol. 58, no. 1, pp. 267288, 1996.
[53] P. T. Boufounos and R. G. Baraniuk, 1-bit compressive sensing, in
Proc. Conf. on Inform. Sci. & Sys., 2008, pp. 1621.
[54] F. Krzakala, M. Mezard, F. Sausset, Y. Sun, and L. Zdeborova,
Statistical physics-based reconstruction in compressed sensing,
arXiv:1109.4424, Sep. 2011.
[55] J. P. Vila and P. Schniter, Expectation-maximization Gaussian-mixture
approximate message passing, IEEE Trans. Signal Processing, vol. 61,
no. 19, pp. 46584672, Oct. 2013.
[56] , An empirical-Bayes approach to recovering linearly constrained
non-negative sparse signals, IEEE Trans. Signal Processing, vol. 62,
no. 18, pp. 46894703, Sep. 2014, (see also arXiv:1310.2806).
[57] U. S. Kamilov, S. Rangan, A. K. Fletcher, and M. Unser, Approximate
message passing with consistent parameter estimation and applications
to sparse learning, in Proc. NIPS, Lake Tahoe, NV, Dec. 2012.
[58] S. Som and P. Schniter, Compressive imaging using approximate
message passing and a Markov-tree prior, IEEE Trans. Signal Process.,
vol. 60, no. 7, pp. 34393448, Jul. 2012.
[59] S. Rangan, A. K. Fletcher, V. K. Goyal, and P. Schniter, Hybrid generalized approximation message passing with applications to structured
sparsity, in Proc. IEEE Int. Symp. Inform. Theory, Cambridge, MA,
Jul. 2012, pp. 12411245.
[60] J. Parker, P. Schniter, and V. Cevher, Bilinear generalized approximate
message passingPart I: Derivation, IEEE Trans. Signal Processing,
vol. 62, no. 22, pp. 5839 5853, 2013.

[61] , Bilinear generalized approximate message passingPart II:

Applications, IEEE Trans. Signal Processing, vol. 62, no. 22, pp. 5854
5867, 2013.
[62] S. Rangan and A. K. Fletcher, Iterative estimation of constrained
rank-one matrices in noise, in Proc. IEEE Int. Symp. Inform. Theory,
Cambridge, MA, Jul. 2012.
[63] H. J. Brascamp and E. H. Lieb, On extensions of the Brunn-Minkowski
and Prekopa-leindler theorems, including inequalities for log concave
functions, and with an application to the diffusion equation, in Inequalities. Springer, 2002, pp. 441464.
[64] M. Vidyasagar, Nonlinear Systems Analysis. Englewood Cliffs, NJ:
Prentice-Hall, 1978.
[65] R. D. Yates, A framework for uplink power control in cellular radio
systems, IEEE J. Sel. Areas Comm., vol. 13, no. 7, pp. 13411347,
September 1995.

Double/Debiased Machine Learning For Treatment and Structural Parameters
No ratings yet
Double/Debiased Machine Learning For Treatment and Structural Parameters
71 pages
Amp Sparse Paper Detail
No ratings yet
Amp Sparse Paper Detail
43 pages
PR M4 Notes
No ratings yet
PR M4 Notes
38 pages
19 Aos1828
No ratings yet
19 Aos1828
26 pages
Double-Debiased Machine Learning For Treatment
No ratings yet
Double-Debiased Machine Learning For Treatment
71 pages
Function Approximation: A Gradient Boosting Machine.
No ratings yet
Function Approximation: A Gradient Boosting Machine.
45 pages
Extensions Beyond Linear Regression: Topics in Data Science
No ratings yet
Extensions Beyond Linear Regression: Topics in Data Science
66 pages
AMP Paper
No ratings yet
AMP Paper
16 pages
A GAMP Based Low Complexity Sparse Bayesian Learning Algorithm
No ratings yet
A GAMP Based Low Complexity Sparse Bayesian Learning Algorithm
15 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Learning Convex Regularizers For Optimal Bayesian Denoising: Ha Q. Nguyen, Emrah Bostan, Member, IEEE, and Michael Unser
No ratings yet
Learning Convex Regularizers For Optimal Bayesian Denoising: Ha Q. Nguyen, Emrah Bostan, Member, IEEE, and Michael Unser
13 pages
2024 - Math Data Sci RPT
No ratings yet
2024 - Math Data Sci RPT
48 pages
Generalized Linear Models With 1-Bit Measurements: Asymptotics of The Maximum Likelihood Estimator
No ratings yet
Generalized Linear Models With 1-Bit Measurements: Asymptotics of The Maximum Likelihood Estimator
12 pages
Week 2 DrBuddhananda Banerjee Vector RV
No ratings yet
Week 2 DrBuddhananda Banerjee Vector RV
10 pages
Lie Optimization
No ratings yet
Lie Optimization
9 pages
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
No ratings yet
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
40 pages
Generalised Coupled Tensor Factorisation: Taylan - Cemgil, Umut - Simsekli @boun - Edu.tr
No ratings yet
Generalised Coupled Tensor Factorisation: Taylan - Cemgil, Umut - Simsekli @boun - Edu.tr
9 pages
Sketching As A Tool For Numerical Linear Algebra
No ratings yet
Sketching As A Tool For Numerical Linear Algebra
139 pages
Gaussian Process Latent Force Models For Learning and Stochastic Control of Physical Systems
No ratings yet
Gaussian Process Latent Force Models For Learning and Stochastic Control of Physical Systems
7 pages
Estimation
No ratings yet
Estimation
39 pages
Estimation 4
No ratings yet
Estimation 4
16 pages
IPAM Splines
No ratings yet
IPAM Splines
48 pages
1 Review
No ratings yet
1 Review
7 pages
Lecture 4 - Estimation - BMSLec03
No ratings yet
Lecture 4 - Estimation - BMSLec03
20 pages
Posterior Mean and Variance Approximation For Regression and Time Series Problems
No ratings yet
Posterior Mean and Variance Approximation For Regression and Time Series Problems
25 pages
EGIRAFFE Computational Intelligence UE - Joergan - Protokoll - 2019SS
No ratings yet
EGIRAFFE Computational Intelligence UE - Joergan - Protokoll - 2019SS
11 pages
Linear Stochastic Models: 5.1 Least Squares
No ratings yet
Linear Stochastic Models: 5.1 Least Squares
12 pages
ML 3
No ratings yet
ML 3
66 pages
Summary SC Microeconometrics
No ratings yet
Summary SC Microeconometrics
20 pages
2023 Tarea Curso Identificacion
No ratings yet
2023 Tarea Curso Identificacion
10 pages
Bayesian Kernel Methods
No ratings yet
Bayesian Kernel Methods
40 pages
NIPS 1999 Support Vector Method For Novelty Detection Paper
No ratings yet
NIPS 1999 Support Vector Method For Novelty Detection Paper
7 pages
Note on Generalized Linear Models: y y Xβ w X β w I y Xβ I y Xβ X w X
No ratings yet
Note on Generalized Linear Models: y y Xβ w X β w I y Xβ I y Xβ X w X
4 pages
Non-Linear Models With BSSM
No ratings yet
Non-Linear Models With BSSM
1 page
UMVUE Statmat 2 2022
No ratings yet
UMVUE Statmat 2 2022
43 pages
Detecting A Vector Based On Linear Measurements: Ery Arias-Castro
No ratings yet
Detecting A Vector Based On Linear Measurements: Ery Arias-Castro
9 pages
GLMConstrained
No ratings yet
GLMConstrained
11 pages
Exerc Session9 v2 Answ
No ratings yet
Exerc Session9 v2 Answ
2 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Learning Minimum Variance Unbiased Estimators
No ratings yet
Learning Minimum Variance Unbiased Estimators
5 pages
STAT 714 Linear Statistical Models: Lecture Notes
No ratings yet
STAT 714 Linear Statistical Models: Lecture Notes
150 pages
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
No ratings yet
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
23 pages
Lec 12
No ratings yet
Lec 12
15 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
Linear Model
No ratings yet
Linear Model
14 pages
11 Hidden Markov Models (HMMS) Model and Problem Description
No ratings yet
11 Hidden Markov Models (HMMS) Model and Problem Description
15 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Day 1
No ratings yet
Day 1
41 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
No ratings yet
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
86 pages
1238 Support Vector Regression Machines
No ratings yet
1238 Support Vector Regression Machines
7 pages
Solution 3 Problem 1: Let X
No ratings yet
Solution 3 Problem 1: Let X
12 pages
GMM, MLE and Tests For Non-Linear Restrictions: 1 Generalized Method of Moments (GMM)
No ratings yet
GMM, MLE and Tests For Non-Linear Restrictions: 1 Generalized Method of Moments (GMM)
15 pages
Linear Regression Analysis: Module - Vii
No ratings yet
Linear Regression Analysis: Module - Vii
10 pages
Deloitte TMT Telco 2030
100% (2)
Deloitte TMT Telco 2030
40 pages
Lecture5 Module2 Anova 1
No ratings yet
Lecture5 Module2 Anova 1
9 pages
An Intuitive Geometric Approach To The Gauss Markov Theorem
No ratings yet
An Intuitive Geometric Approach To The Gauss Markov Theorem
15 pages
Agilent 1220 Infinity LC User Manual PDF
No ratings yet
Agilent 1220 Infinity LC User Manual PDF
380 pages
Email List
No ratings yet
Email List
27 pages
How To Do ESD Protection During SMT Assembly Process
No ratings yet
How To Do ESD Protection During SMT Assembly Process
18 pages
Creating The LACE (V5.1)
No ratings yet
Creating The LACE (V5.1)
29 pages
Large Language Model-Brained GUI Agents: A Survey
No ratings yet
Large Language Model-Brained GUI Agents: A Survey
78 pages
Mobile Application SRS
100% (1)
Mobile Application SRS
9 pages
Grade 6 ICT Revised Text Book
No ratings yet
Grade 6 ICT Revised Text Book
65 pages
Casestudy 4
No ratings yet
Casestudy 4
3 pages
Water Tap
No ratings yet
Water Tap
14 pages
HI-SCAN 10080EDtS
No ratings yet
HI-SCAN 10080EDtS
8 pages
Unit 5 UCSD Notes
No ratings yet
Unit 5 UCSD Notes
2 pages
QR Patrol 2 Page
No ratings yet
QR Patrol 2 Page
2 pages
FINAL CS3501 Compiler Design LAB
No ratings yet
FINAL CS3501 Compiler Design LAB
49 pages
ACS-0026407 261112 Systems Analyst Result Letter 2025-06-11 - Completed
No ratings yet
ACS-0026407 261112 Systems Analyst Result Letter 2025-06-11 - Completed
5 pages
HCIA-HarmonyOS Device Developer V1.0 学员用书
No ratings yet
HCIA-HarmonyOS Device Developer V1.0 学员用书
166 pages
CS 3440 Graded Quiz Unit 6
No ratings yet
CS 3440 Graded Quiz Unit 6
7 pages
System Requirements Autodesk Autocad 2021
No ratings yet
System Requirements Autodesk Autocad 2021
3 pages
Lab01 - Classical Cryptography
No ratings yet
Lab01 - Classical Cryptography
10 pages
Cognitive Banking
No ratings yet
Cognitive Banking
15 pages
X-It-Record File-2024
No ratings yet
X-It-Record File-2024
17 pages
MB Manual Ga-78lmt-Usb3 v.6.0 e 04
No ratings yet
MB Manual Ga-78lmt-Usb3 v.6.0 e 04
36 pages
Narrative Part 2
No ratings yet
Narrative Part 2
67 pages
Electronic Devices and Circuits: Faculty: Mr. M Srinivas Reddy
No ratings yet
Electronic Devices and Circuits: Faculty: Mr. M Srinivas Reddy
32 pages
06hypothesis Testing v2 PDF
No ratings yet
06hypothesis Testing v2 PDF
39 pages
Kapil Chauhan: Mobile: +91 9958965296
No ratings yet
Kapil Chauhan: Mobile: +91 9958965296
2 pages
Neural Networks 16 Mark Answers
No ratings yet
Neural Networks 16 Mark Answers
3 pages
Frame Structure Design and Analysis For Millimeter Wave Cellular Systems
No ratings yet
Frame Structure Design and Analysis For Millimeter Wave Cellular Systems
31 pages
1-4 Presentation Piereangelo Alejo
No ratings yet
1-4 Presentation Piereangelo Alejo
27 pages
CCA3006 - CLOUD-SECURITY-MANAGEMENT - LT - 1.0 - 34 - Cloud Security Management
No ratings yet
CCA3006 - CLOUD-SECURITY-MANAGEMENT - LT - 1.0 - 34 - Cloud Security Management
2 pages
Spectrum and Infrastructure Sharing in Millimeter Wave Cellular Networks: An Economic Perspective
No ratings yet
Spectrum and Infrastructure Sharing in Millimeter Wave Cellular Networks: An Economic Perspective
14 pages
A Site-Specific MIMO Channel Simulator For Hilly and Mountainous Environments
No ratings yet
A Site-Specific MIMO Channel Simulator For Hilly and Mountainous Environments
6 pages
Resource Sharing in 5G Mmwave Cellular Networks
No ratings yet
Resource Sharing in 5G Mmwave Cellular Networks
7 pages
Bread, Milk Bread, Diapers, Beer, Eggs Bread, Diapers, Beer, Cola Bread, Milk, Diapers, Beer Bread, Milk, Diapers, Cola
No ratings yet
Bread, Milk Bread, Diapers, Beer, Eggs Bread, Diapers, Beer, Cola Bread, Milk, Diapers, Beer Bread, Milk, Diapers, Cola
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Inference For Generalized Linear Models Via Alternating Directions and Bethe Free Energy Minimization

Uploaded by

Inference For Generalized Linear Models Via Alternating Directions and Bethe Free Energy Minimization

Uploaded by

1

Inference for Generalized Linear Models via

arXiv:1501.01797v2 [cs.IT] 2 May 2016

Sundeep Rangan, Fellow, IEEE, Alyson K. Fletcher, Member, IEEE,

is a special case of a generalized linear model (GLM) [1],

where Z(y) is a normalization constant. In the sequel, we will

Consider the problem of estimating a random vector x R

Imaging Group, Ecole

Under the separability assumptions (4), the MAP minimization

Fig. 1. Generalized Linear Model (GLM) where an unknown random vector

and the alternating direction method of multipliers (ADMM)

our proposed approach explicitly minimizes the LSL-BFE. We

where A is a known matrix, x is an unknown vector and  is

fz (z) := ln p(y|z) = ln p (y z).

If the output noise  is Gaussian with independent components

Similarly, if x has a Gaussian prior with N (0, x2 I), the input

where D(q` k` ) is the KL divergence,

Note that the estimation in this quadratic case would be given

where H(bj ) is the entropy or differential entropy; and where

where i (zi ) is a known, nonlinear function and i is noise.

where the integration is over the components in the sub-vector

P (yi = 1|zi ) = 1 P (yi = 0|zi ) = (zi ),

where, for each `, x(`) is a sub-vector of x created from

where bj (xj ) represents an estimate of the marginal density

j (xj ) = exp(fxj (xj )), j = 1, . . . , n,

exp(fzi (aTi x)),

where bx and bz are product densities, i.e.,

where J(b, q) is the BFE given by

and the objective function J(bx , bz ) is given by

linearizing the concave part of this function, i.e.,

where h(qk )/q denotes the gradient of h at qk . The

IV. M INIMIZATION VIA I TERATIVE L INEARIZATION

gradient at bk , i.e., k = [h(g(bk ))]/b, similar to CCCP.

Algorithm 1 Minimization via Iterative Linearization

D. Application to LSL-BFE Minimization

f (b) , f (bx , bz ) = D(bx kefx ) + D(bz kZz1 efz ) (29a)

{Update the linearization}

11: until Terminated

g(b)T , [var(x|bx ); var(z|bz )] = [x ; z ] (29b)

derivatives. Also, assume that there exists a convex set such

exists and is unique.

and where the constants c1 and c2 do not depend on .

Proof: See Appendix A.

where we use ./ to denote componentwise division of two

Similarly, using the chain rule and (33), we find

performed iteratively using a set of iterations that we will refer

{Compute the gradient terms}

where u is aPdual vector, is a vector of positive weights

ut+1 = ut + Diag(1./ )Bwt+1 ,

where Diag(d) creates a diagonal matrix from the vector d.

where f (w) is an objective function and B is some constraint

, J(bx , bz , r , p ) + qT (E(x|bx ) v) + sT (E(z|bz ) Av)

To compute the minimization in (41a), we first note that the

where in (a) we used E2 (xj |bx ) = E(x2j |bx ) var(xj |bx ) =

= D(bx kpx ) + D(bz kpz ) + const,

for px (x) exp(fx (x) 12 kx(v r .q)k2r ) and pz (z)

+ kxt+1 + r qt+1 vk2r

using the definitions

{Compute the gradient terms}

C. The ADMM-GAMP Algorithm

expectation and variance operators in (31), (41b), and (41c)

as reflected in line 9 of Algorithm 3. For separable fx and fz ,

, gzi (pti , pti )

ADMM [9] and in the simulations below the minimization

A. ADMM Inner Loop

For the posterior density p(x|y) in (2), the MAP estimates

as reflected in line 15 of Algorithm 3. That is,

[rt .gx0 (rt , rt )]j

gxj (rjt , rtj )

where J(x, z) is the objective function

J(x, z) , fx (x) + fz (z).

We use these general scalar estimation functions gx and gz

We will show that, with appropriate selection of the estimation

The ADMM recursions (39) for this augmented Lagrangian

While we will demonstrate below that ADMM-GAMP

(xt+1 , zt+1 ) = arg min L(x, z, st , qt , vt ; p , r ),

where A is a known matrix, x is an unknown vector and is

fz (z) := ln p(y|z) = ln p (y z).

If the output noise is Gaussian with independent components

where i (zi ) is a known, nonlinear function and i is noise.

for some constant (0, 0.5].