Inference For Generalized Linear Models Via Alternating Directions and Bethe Free Energy Minimization
Inference For Generalized Linear Models Via Alternating Directions and Bethe Free Energy Minimization
AbstractGeneralized Linear Models (GLMs), where a random vector x is observed through a noisy, possibly nonlinear,
function of a linear transform z = Ax, arise in a range of
applications in nonlinear filtering and regression. Approximate
Message Passing (AMP) methods, based on loopy belief propagation, are a promising class of approaches for approximate
inference in these models. AMP methods are computationally
simple, general, and admit precise analyses with testable conditions for optimality for large i.i.d. transforms A. However, the
algorithms can diverge for general A. This paper presents a
convergent approach to the generalized AMP (GAMP) algorithm
based on direct minimization of a large-system limit approximation of the Bethe Free Energy (LSL-BFE). The proposed
method uses a double-loop procedure, where the outer loop
successively linearizes the LSL-BFE and the inner loop minimizes
the linearized LSL-BFE using the Alternating Direction Method
of Multipliers (ADMM). The proposed method, called ADMMGAMP, is similar in structure to the original GAMP method, but
with an additional least-squares minimization. It is shown that for
strictly convex, smooth penalties, ADMM-GAMP is guaranteed
to converge to a local minimum of the LSL-BFE, thus providing
a convergent alternative to GAMP that is stable under arbitrary
transforms. Simulations are also presented that demonstrate the
robustness of the method for non-convex penalties as well.
Index TermsBelief propagation, ADMM, variational optimization, message passing, generalized linear models.
We study this inference problem in the case where the functions fx and fz are separable, in that they are of the form
px|y (x|y) =
1
exp [fx (x) fz (Ax, y)] ,
Z(y)
x
bj , E(xj |y),
(3a)
xj , var(xj |y).
(3b)
I. I NTRODUCTION
fx (x) =
(1)
fz (z) =
n
X
j=1
m
X
fxj (xj ),
(4a)
fzi (zi ),
(4b)
i=1
for some scalar functions fxj and fzi . The separability assumption (4a) corresponds to the components in x being
a priori independent. Recalling the implicit dependence of
fz on y, the separability assumption (4b) corresponds to
the observations y being conditionally independent given the
transform outputs z , Ax.
For posterior densities of the form (2), there are several
computationally efficient methods to find the maximum a
posteriori (MAP) estimate, which is given by
b = arg max px|y (x|y) = arg min [fx (x) + fz (Ax)] . (5)
x
x
x px
Unknown input,
independent
components
Linear transform
py|z
Componentwise
output map
(6)
n
Y
j=1
p(xj ),
p() =
m
Y
p(i ).
i=1
and
q , [q1 , . . . , qL ]T ,
Z
D(akb) ,
a(x) ln
a(x)
dx;
b(x)
(11)
Z
q` (x(`) ) dx(`)\j = bj (xj ), for all `, j,
i = 1, . . . , m,
(12)
(13a)
(13b)
where aTi is the i-th row of A. Note that, if A is a nonsparse matrix, then fzi (aTi x) depends on all components in
the vector x. In this case, the application of traditional loopy
BPas described for example in [43]does not generally
yield a significant computational improvement.
The GAMP algorithm from [22] can be seen as an approximate BFE minimization method for GLMs with possibly
dense transforms A. Specifically, it was shown in [29] that the
stationary points of GAMP coincide with the local minima of
the constrained optimization
(bbx , bbz ) , arg min J(bx , bz ) such that
(14a)
bx ,bz
E(z|bz ) = AE(x|bx )
(14b)
(b,q)E
L
X
`=1
n
X
D(q` k` ) +
(nj 1)H(bj );
j=1
(10)
bx (x) =
n
Y
j=1
bxj (xj ),
bz (z) =
m
Y
i=1
bzi (zi ),
(15)
Pn zi
H(x , z ) ,
+ ln 2
Sij xj , (17)
2 i=1
j=1 Sij xj
j=1
(21a)
(b,q)E
k+1 =
h(qk )
,
q
(21b)
6:
7:
8:
(29c)
we see that J(bx , bz ) from (16) can be cast into the form in
(22). Observe that, while f is convex, the function h(g()) is,
in general, neither convex nor concave. Thus, while the CCCP
does not apply, we can apply the iterative linearization method
from Algorithm 1.
We will partition the linearization vector conformally with
function g in (29b) as
= [1./(2r ); 1./(2p )],
(24)
bB
J(bk+1 ) J(bk ) k.
(28)
(30)
+ (1./(2r )) var(x|bx )
T
+ (1./(2p )) var(z|bz ).
(31)
h
of the function
Finally, we compute the gradient h0 =
h from (29c). Similar to , we will partition the gradient into
two terms,
H(x , z )
H(x , z )
, 1./(2 p ) ,
. (32)
x
z
From (17), the derivative of H with respect to zi is
1./(2 r ) ,
H(x , z )
1
1
=
= Pn
.
2 pi
zi
2 j=1 Sij xj
(33)
1./ r = ST [(1 z ./ p ) ./ p ] .
(35)
Substituting the above computations into the iterative linearization algorithm, Algorithm 1, we obtain Algorithm 2. We
refer to this as the outer loop, since each iteration involves a
minimization of the linearized LSL-BFE in line 5. We discuss
this latter minimization next and show that it can itself be
(39a)
(39b)
E. Alternative Methods
While the method proposed in this paper is based on CCCP
of [32], there are other methods for direct minimization of the
BFE that may apply to the LSL-BFE as well. For example, for
problems with binary variables and pairwise penalty functions,
[44], [45] propose a clever re-parametrization to convert the
constrained BFE minimization to an unconstrained optimization on which gradient descent can be used. Unfortunately,
it is not obvious if the LSL-BFE here can admit such a reparametrization since the penalty functions are not pairwise
and the variables are not binary.
V. I NNER -L OOP M INIMIZATION AND ADMM-GAMP
A. ADMM Principle
The outer loop algorithm, Algorithm 2, requires that in each
iteration we solve a constrained optimization of the form
(bx , bz ) = arg min J(bx , bz , r , p ) s.t. E(z|bz ) = AE(x|bx ).
bx ,bz
(36)
We will show that this optimization can be performed by the
Alternating Direction Method of Multipliers (ADMM) [9].
ADMM is a general approach to constrained optimizations
of the form
min f (w) s.t. Bw = 0,
(37)
w
t+1
st+1 = s + Diag(1./p ) E(z|bt+1
,
z ) Av
t+1
qt+1 = qt + Diag(1./r ) E(x|bt+1
,
)
v
x
v
t+1
t+1 t+1
arg min L(bt+1
, qt+1 , v; p , r ).
x , bz , s
v
(41a)
(41b)
(41c)
(41d)
(b)
(1./(2p )) var(z|bz ) =
m
X
zi
.
2
pi
i=1
13:
(44b)
Substituting (31), (42), (43), and (44) into (40), and canceling
terms, we get
L(bx , bz , s, q, v; p , r )
1
= D(bx kefx ) + E kx (v r .q)k2r bx
2
1
+ D(bz kZz1 efz ) + E kz (Av p .s)k2p bz
2
Z+ const
(x)
=
bx (x) ln exp(f (x) 1bxkx(v
dx
2
x
r .q)kr )
2
n
RZ
z (z)
dz
+
bz (z) ln exp(f (z) 1 bkz(Av
.s)k2 )
z
Rm
+ const
(45)
(46)
(48a)
t
p , Av p .s ,
(48b)
where we use . to denote componentwise vector multiplication. Using Bayes rule, (47a) can be interpreted as the posterior
density of the random vector x under the prior efx (x) and an
independent Gaussian likelihood with mean rt and variance
r . Similarly, (47b) can be interpreted as the posterior pdf
the random vector z under the likelihood efz (z) and an
independent Gaussian prior with mean pt and variance p .
To tackle the minimization (41d), we ignore the v-invariant
components in the original augmented Lagrangian (40), after
which (41d) can be reformulated as the least-squares problem
vt+1 = arg min kzt+1 + p st+1 Avk2p
v
(49)
Algorithm 3 ADMM-GAMP
Require: Matrix A, estimation functions gx and gz .
1: S A.A (componentwise square)
2: Initialize r0 > 0, p0 > 0, v0
3: q0 0, s0 0
4: t 0
5: repeat
6:
{ADMM inner iteration}
7:
rt vt rt .qt
8:
pt Avt pt .st
9:
xt+1 gx (rt , rt ), zt+1 gz (pt , pt )
10:
qt+1 qt + Diag(1./rt )(xt+1 vt )
11:
st+1 st + Diag(1./pt )(zt+1 Avt )
12:
Compute vt+1 from (49)
xt+1 , E(x|bt+1
x ).
(50)
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
t+1
E(z|bt+1
z )
, gz (p
(51a)
, pt ),
(51b)
f
(x)
t
x
j
2 (x rj )
R
rj
zit+1
, pt )]j
= [gz (p
R
z exp
R
= R
exp fzi (z)
R
1
2pt
bt+1
xj
pti )2 dz
i
,
(z pti )2 dz
(53)
bt+1
zi
and
can be computed
Furthermore, the variances of
in a componentwise manner using the derivatives of gxj and
gzi with respect to their first argument [22], i.e.,
t 0
t
t
xt+1 , var(x|bt+1
x ) = r .gx (r , r ),
(54a)
t 0
t
t
zt+1 , var(z|bt+1
z ) = p .gz (p , p ),
(54b)
(b
x, b
z) , arg min J(x, z) s.t. z = Ax,
(56)
x,z
rtj
(55)
(57)
D. Computational Cost
L(x, z, s, q, v; p , r )
, fx (x) + fz (z) + qT (x v) + sT (z Av)
+ 21 kx vk2r + 21 kz Avk2p .
t+1
t+1
,z
t+1
t+1
,s
t+1
,q
(58)
(59a)
(59b)
(59c)
, v; p , r ). (59d)
(60)
t+1
1
2 kz
Avk2p
(61)
(62a)
(62b)
zt+1 = gz (pt , p ),
(63)
zi (zi ) ,
min
x:zi =[Ax]i
J(x, Ax),
(64a)
xj
Note that, for any T > 0, we can estimate the marginal posteriors pxj (xj ; T ) and pzi (zi ; T ) using the LSL-BFE optimization
from Section V. That is, we can use the estimate
xj (xj ) bxj (xj ) , lim T ln bbxj (xj ; T ),
(66a)
(66b)
T 0
T 0
where bbxj (xj ; T ) and bbzi (zi ; T ) are the belief estimates computed via the LSL-BFE optimization under the scaled penalties
fx (x; T ) , fx (x)/T,
(65)
zi
fz (z; T ) , fz (z)/T.
(67)
= fxj (xj ) +
1
2rt
(xj rjt )2
(68a)
= fzi (zi ) +
(64b)
where pxj (xj ; T ) and pzi (zi ; T ) are the marginal densities for
the scaled joint density
1
1
p(x; T ) , exp
fx (x) + fz (Ax) .
Z
T
1
2pt
(zi pti )2 ,
(68b)
where the parameters rjt , pti , rtj , and pti are the outputs of
ADMM-GAMP under the MAP estimation functions (62). In
this sense, ADMM-GAMP under the MAP estimation function
can be seen as a limiting case of ADMM-GAMP under the
MMSE estimation functions. Hence, according to (66), MAP
ADMM-GAMP can be used to compute estimates (68) of the
marginal minimization functions (64). Furthermore, according
to (62) and (63), xt+1 and zt+1 are the minima of these
functions
bt
x
bt+1
j = arg min xj (xj ),
xj
10
xj = rj gx0 j (rj , pi ) =
(69a)
zi
(69b)
2 btxj (b
xt+1
j )
x2j
zt+1
i
2 btzi (b
zit+1 )
.
zi2
(70)
(71)
x ,z
b, b
J 2 (x , z , x
z) ,
i
xj fx00j (b
xj ) ln(xj )
j=1
m
X
zi
i=1
For the remainder of this section, we will show the convergence of ADMM-GAMP in the special case of convex and
smooth penalties fx and fz . We begin by analyzing the convergence of the ADMM inner-loop under fixed linearization terms
r and p . It is well-known that, when one applies ADMM to
a general optimization problem of the form (37) with convex f
and full-rank B, the method will converge [9]. However, in our
case, the objective function is the linearized LSL-BFE in (31),
which is not necessarily convex, even if the penalty functions
fx and fz are. The problem is that the variances var(x|bx ) and
var(z|bz ) are not convex functions of the densities bx and bz
(in fact, they are concave). We thus need a separate proof.
We will prove convergence under the following assumption.
Assumption 2: For fixed r and p , the estimation functions
gx (r, r ) and gz (p, p ) are separable in r and p in that
gx (r, r ) = gx1 (r1 , r ), , gxn (rn , r ) ,
gz (p, p ) = gz1 (p1 , p ), , gzm (pm , p )
for scalar function gxj and gzi . In addition, these scalar functions have, with respect to their first arguments, continuous
first derivatives gx0 j and gz0 i satisfying
gx0 j (rj , r ) 1 ,
where
n h
X
(72)
pi
1
+ ln
,
fz00i (b
zi ) +
pi
zi
gz0 i (pi , p ) 1 .
(73)
A fz00i (zi ) B zi ,
(74)
11
(75)
(77)
12
results from minimizing the linearized LSL-BFE via ADMM the mean updates in Algorithm 3, we obtain
under the splitting rule E(z|bz ) = Av and E(x|bx ) = v (as
xt+1 = gx (rt , r ),
described in Section V-B), whereas the original GAMP uses
zt+1 = gz (pt , p ),
stale, linearized ADMM under the conventional1 splitting rule
E(z|bz ) = AE(x|bx ). Both use the same iterative LSL-BFE
st+1 = st + Diag(1./p )(zt+1 Axt ),
linearization strategy described in Section IV-D.
rt+1 = xt+1 + Diag(r )AT st+1 ,
We can derive the mean updates in the original GAMP using
pt+1 = Axt+1 p .st+1 .
the augmented Lagrangian
Then, substituting the p update into the s update, defining
L(bx , bz , s; p ) , J(bx , bz , r , p ) + sT E(z|bz ) AE(x|bx )
zt = zt+1 and st = st+1 , and reordering the steps, we obtain
2
1
(78)
+ 2 kE(z|bz ) AE(x|bx )kp ,
pt = Axt p st1 ,
for the J defined in (31) and stale, linearized ADMM:
zt = gz (pt , p ),
T
t
t1
st = Diag(1./p )(zt pt ),
bt+1
; p ) + 1 E(x|bx ) E(x|btx )
x = arg min L(bx , bz , s
2
bx
Dr A Dp A E(x|bx )
bt+1
z
st+1 =
E(x|btx )
t
arg min L(bt+1
x , bz , s ; p ),
bz
t+1
st + Dp E(z|bt+1
z ) AE(x|bx ) ,
(79a)
(79b)
where D , Diag(1./ ). Note the addition of a linearization term in (79a) to decouple the minimization. The
resulting approach goes by several names: linearized ADMM
[51, Sec. 4.4.2], split inexact Uzawa [10], and primal-dual
hybrid gradient (PDHG) [10]. Note also the use of the stale
dual estimate st1 in (79a), as opposed to the most recent
dual estimate st . In the context of PDHG, this stale update
is known as Arrow-Hurwicz [10]. In Appendix H, we show
that the recursion (79) yields the mean updates in the original
sum-product GAMP algorithm (i.e., the non-indented lines in
Algorithm 4).
Regarding the variance updates of the original sum-product
GAMP algorithm (i.e., the indented lines in Algorithm 4),
a visual inspection shows that they match the non-damped
ADMM-GAMP gradient updates (i.e., lines 15-18 of Algorithm 3 under t = 1), except for one small difference: in the
original sum-product GAMP, the update of s uses the same
version of p used by the z update, whereas in ADMMGAMP, the update of s uses a more recent version of p .
B. Recovering GAMP from ADMM-GAMP
We now show that the mean-updates of the original sumproduct GAMP can be recovered by approximating the meanupdates of ADMM-GAMP. For simplicity, we suppress the t
index on the variance terms.
At any critical point of Algorithm 3, we must have qt =
AT st and zt = Axt , as shown in (107). If we substitute
these two constraints into the v-update objective in (49), we
obtain
kzt + p .st Avk2p + kxt + r .qt vk2r
= kA(xt v) + p .st k2p + kxt v Diag(r )AT st k2r .
It can be verified that the minimum for this function occurs
at v = xt . So, if we substitute vt = xt and qt = AT st into
1 See,
xt+1 = gx (rt , r ),
(80)
13
NMSE (dB)
18
20
genie
40
0.3
LASSO
GAMP
0.6
36
ADMMGAMP
under test after averaging the results of 100 Monte Carlo trials.
Here, since y and z = Ax are related through AWGN, the
GAMP algorithm of [22] reduces to the Bayesian version of
the AMP algorithm from [18].
Note that the case of i.i.d. A is the ideal scenario for
both AMP and GAMP. As discussed in the Introduction, their
convergence in this case is guaranteed rigorously through state
evolution analysis [22][24] as m, n . In Figure 2, since
m and n are sufficiently large, it is not surprising to see
that GAMP performs well over all measurement ratios m/n.
Furthermore, it is interesting to notice that GAMP outperforms
LASSO and obtains NMSEs that are very close to that of
the support-aware genie. Under such ideal A, the proposed
ADMM-GAMP method matches the performance of GAMP
(since it minimizes the same objective) but does not offer any
additional benefit.
The benefits of ADMM-GAMP become apparent in our
second experiment, which uses non-i.i.d. matrices A. In describing the experiment, we first recall that [25] established
that the convergence of GAMP can be predicted by the peakto-average ratio of the squared singular values,
12 (A)
,
2
i=1 i (A)/r
(A) , Pr
(81)
where r = min{m, n} and i (A) is the i-th largest singular value of A. When this ratio is sufficiently large,
the algorithm will diverge. Thus, to test the robustness of
ADMM-GAMP, we constructed a sequence of matrices A
with varying , as follows. First, the left and right singular
vectors of A were generated by drawing an m n matrix
with i.i.d. N (0, 1/m) entries and taking its singular-value
decomposition. Then, the singular values of A were chosen by
setting the largest at 1 (A) = 1 and logarithmically spacing
each successive singular value to attain the desired peak-toaverage ratio .
As a function of , the NMSE performance of the various
algorithms under test is illustrated in Figure 3 for the case of
m = 600 measurements. There it can be seen that, for larger
values of , the NMSE performance of the original GAMP
algorithm deteriorated, which was a result of the algorithm
genie
LASSO
GAMP
SwAMP
ADMMGAMP
Peak-to-average ratio ()
Fig. 3. Average NMSE versus peak-to-average squared-singular-value ratio
(A) when recovering a length n = 1000 Bernoulli-Gaussian signal x
from m = 600 AWGN-corrupted measurements y = Ax + e. Note the
superior performance of ADMM-GAMP relative to both the original GAMP
and SwAMP, and the proximity of ADMM-GAMP to the support-aware genie.
NMSE (dB)
NMSE (dB)
14
1
GAMP
SwAMP
ADMMGAMP
Peak-to-average ratio ()
Fig. 4. Average NMSE versus peak-to-average squared-singular-value ratio
(A) when recovering a length n = 1000 Bernoulli-Gaussian signal x from
m = 2000 noiseless 1-bit measurements y = sgn(Ax). Note the superior
performance of ADMM-GAMP relative to the original GAMP and SwAMP.
14
C ONCLUSIONS
Despite many promising results of AMP methods, the major
stumbling block to more widespread use is their convergence
and numerical stability. Although AMP techniques admit
provable guarantees for i.i.d. A, they can easily diverge for
transforms that occur in many practical problems. While several methods have been proposed to improve the convergence,
this paper provides a method with provable guarantees under
arbitrary transforms. The method leverages well-established
concepts of double-loop methods in belief propagation [32] as
well as the classic ADMM method in optimization [9].
Nevertheless, there is still much work to be done. Most
obviously, the proposed ADMM-GAMP method comes at a
computational cost. Each iteration requires solving a (potentially large) least squares problem (49) that is not needed in
the original AMP and GAMP algorithms. Similar to standard applications of ADMM, this minimization can likely
be performed via conjugate gradient iterations, but its implementation requires further study. In any case, it is possible
that ADMM-GAMP will be slower than other variants of
GAMP. Indeed, our simulations suggest that other methods
such as SwAMP or adaptively damped GAMP [28] may
provide equally robust performance with less cost per iteration.
One line of future work would thus be see to whether the
proof techniques in this paper can be extended to address these
algorithms as well.
The analysis in this paper might also be extended to other
variants of AMP and GAMP. For example, it is conceivable
that similar analysis could be applied to develop convergent
approaches to the expectation-maximization (EM) GAMP developed in [41], [54][57], turbo and hybrid GAMP methods
in [58], [59] and applications in dictionary learning and matrix
factorization [60][62].
A PPENDIX A
P ROOF OF T HEOREM 1
Throughout this appendix, we use the shorthand notation
for the gradient h0 ( ) , h( )/ Rp .
First we show, by induction, that k for all k. Recall
that, by the hypothesis of the theorem, 0 . Now suppose
that k . Then the updates in Algorithm 1 imply that
0
b k ))).
h ( ) = h (g(b )) = h (g(b(
Then, by Assumption 1(c), h0 ( k ) . Since k , k
(0, 1], and is convex,
k+1 = (1 k ) k + k h0 ( k ) .
Thus, by induction, k for all k.
Next, we prove the decrementing property (27). First observe that since the restriction b B is a linear constraint,
we can find a linear transform B and vector b0 such that
b B if and only if b = Bx + b0 for some vector x.
It can be verified that we can reparameterize the functions
f () and g() around x and obtain the exact same recursions
in Algorithm 1. Also, all the conditions in Assumptions 1
will hold for reparametrized functions as well. Thus, for the
0=
X
b
J(b(),
)
b
b
= f 0 (b())
+
g`0 (b())
`
b
(82)
`=1
b
b
= f 0 (b())
+ g 0 (b()),
(83)
T
`=1
"
#
L
0 b
X
b
b
g`0 (b())
b()
(b) f (b())
b
=
+
`
+ g 0 (b())
T
T
T
b
b
b
b
(a)
0 =
`=1
b
b()
b
+ g 0 (b()),
= H()
T
(84)
where (a) and (b) follow from the chain rule and H() is the
Hessian from (25). Equation (84) then implies
b
b()
b
= H()1 g 0 (b()),
T
(85)
T
b
T
(b)
T b()
b
= h0 ( ) g 0 (b())
T
T
(c)
T
b
b
= h0 ( ) g 0 (b())
H()1 g 0 (b()),
(86)
where (a) follows from (22) and the chain rule, (b) follows
from (83), and (c) follows from (85).
Notice that the k update in Algorithm 1 can be written as
k+1 k = h0 ( k ) k k .
Taking an inner product of the above and (86) evaluated at
= k , we get
#
"
b k ))
J(b(
k+1 k
T
T
= k h0 ( k ) g 0 (bk )T H( k )1 g 0 (bk ) k h0 ( k ) k
2
k
g 0 (bk ) k h0 ( k )
,
(87)
c2
b k ) = bk and that c2 was defined in
recalling that b(
Assumption 1(b). Therefore, the update of k is in a descent
b
direction on the objective J(b()).
Hence, for a sufficiently
k
small damping parameter , we will have
b k )) 0,
b k+1 )) J(b(
J(bk+1 ) J(bk ) = J(b(
which proves the decrementing property (27).
15
A PPENDIX B
L ARGE D EVIATIONS V IEW OF MAP E STIMATION
1 fx /T
1 fz /T
J(bx , bz ; T ) = D(bx kZx,T
e
) + D(bz kZz,T
e
)
xt = lim xt (T ),
zt = lim zt (T ),
T 0
T 0
st = lim T st (T ),
T 0
T 0
xt = lim
We will assume that all of these limits exist. Note that some of
terms are scaled by T and others by 1/T . These normalizations
are important. It is easily checked that the scalings all cancel,
so that the limiting values satisfy the recursions of Algorithm 3
with the limiting estimation functions
gx (r, r ) , lim gx (r, r (T ); T ) = lim gx (r, r T ; T ), (89a)
T 0
T 0
T 0
where gx (r, r T ; T ) and gz (p, p T ; T ) are the MMSE estimation functions (51) for the scaled penalties (67). Note that we
have used the scalings in (88), which show r (T ) T r
and p (T ) p T for small T . Now, the scaled function
gx (r, r T ; T ) is the expectation E(x|T ) with respect to the
density
fx (x)
1
p(x|r, r T ; T ) exp
kx rk2r .
T
2T
Laplaces Principle [50] from large deviations theory shows
that (under mild conditions) this density concentrates around
its maxima, and thus the expectation with respect to this
density converges to the minimum
lim gx (r, r T ; T ) = arg min fx (x) + 12 kx rk2r ,
T 0
+ H(var(x|bx ), var(z|bz ))
i
1h
E(fx (x)|bx ) + E(fz (z)|bz )
=
T
+ H(var(x|bx ), var(z|bz ))
, (90)
bxj (xj |rj , rj T ) exp
T
2T rtj
from which we can prove the limits in (68).
It remains to show that the LSL-BFE in (16) with the scaled
functions (67) decomposes into the optimizations (56) and (71)
(91)
(xj x
bj )2
+ const,
2T xj
(92)
where
2 ln bxj (xj )
1
= T
,
xj
x2j
x
bj = arg min ln bxj (xj ),
xj
bzi (zi ) N (b
zi , T zi ).
(93)
Z X
(k)
xj )
(x x
bj )k fxj (b
k=0
(k)
X
fxj (b
xj )
(b)
(c)
k=0
(c)
l=0
l=0
k!
N (x; x
bj , T xj ) dx
(94)
(x x
bj )k N (x; x
bj , T xj ) dx
(95)
k!
Z
(2l)
fxj (b
xj )
(T xj )l (2l 1)!!
(2l)!
(96)
(2l)
fxj (b
xj )
(T xj )l ,
2l l!
(97)
1
ln(2eT xj ),
2
H(bzi ) =
1
ln(2eT zi ), (99)
2
16
0 =
(100)
(a)
1
1
b, b
JT (bx , bz ) = J(b
x, b
z) + J 2 (x , z , x
z) + const, (101)
T
2
(c)
(d)
L(bx , bz , s, q, v; r , p )
bx
h
J(bx , bz , r , p ) + qT E(x|bx )
bx
i
1
+ kE(x|bx ) vk2r
2
i
h
J(bx , bz , r , p ) qT E(x|bx )
bx
i
h
J(bx , bz , r , p ) sT AE(x|bx )
bx "
J(bx , bz ) H var(x|bx ), z
bx
#
T
L0 (bx , bz , s) = 0.
(109)
bz
(e)
A PPENDIX C
P ROOF OF T HEOREMS 2 AND 3
We will just prove Theorem 2 since the proof of Theorem 3
is very similar. For the original constrained optimization (14),
define the Lagrangian
L0 (bx , bz , s) , J(bx , bz ) + sT (E(z|bz ) AE(x|bx )). (102)
We need to show that any fixed points (bx , bz , s) of ADMMGAMP are critical points of this Lagrangian.
First observe that, any fixed point, r from line 22 of
Algorithm 3 satisfies
1./(2r ) = 1./(2 r ) =
H(x , z )
,
x
(103)
H(x , z )
.
z
(104)
From (41b) and (41c), we see that any fixed point satisfies
E(z|bz ) = Av,
E(x|bx ) = v.
(105)
0 = A Dp E(z|bz )Av+p .s +Dr E(x|bx )v+r .q ,
(106)
where D = Diag(1./ ). Plugging (105) into the previous
expression, we obtain
q = AT s.
(107)
Together, (108) and (109) show that (bx , bz ) are critical points
of the Lagrangian L0 (bx , bz , s) for the dual parameters s. Since
these densities also satisfy the constraint E(z|bz ) = AE(x|bx ),
we conclude that (bx , bz ) are critical points of the constrained
optimization (14).
A PPENDIX D
P ROOF OF L EMMA 1
For the MAP estimation functions (62), we know that
1
x
bj = gxj (rj , rj ) = arg min fxj (xj ) +
(xj rj )2 ,
2rj
xj
which implies that xj = x
bj is a solution to 0 = fx0 j (xj ) +
(xj rj )/rj , i.e., that
x
bj = rj rj fx0 j (b
xj ).
Taking the derivative with respect to rj , we find
b
xj
b
xj
= 1 rj fx00j (b
xj )
,
rj
rj
which can be rearranged to form
b
xj
1
.
= gx0 j (rj , rj ) =
00
rj
1 + fxj (b
xj )rj
(110)
17
xj = var(xj |rj , rj ).
A PPENDIX E
P ROOF OF T HEOREM 4
We find it easier to analyze the algorithm after the variables
are combined and scaled as
, r , D , Diag(1./ ),
(115)
p
and
x
I
1/2 q
1/2
w,D
, u,D
, B,D
.
z
s
A
(116)
Also, we define
gx (x, r )
g(w, ) ,
,
(117)
gz (z, p )
1/2
(112)
(118a)
(118b)
(118c)
P , B(BT B)1 BT ,
(xj rj )2
,
2rj
1
.
rj
= g(Pwt P ut ),
.
rj
h00 (x)
Arj + 1
(113)
1
1
gx0 j (rj , rj )
,
1 + Brj
1 + Arj
which proves (73).
(121)
where
g(w) , D1/2 g(D1/2 w).
(122)
P ut+1 = P ut + P wt+1
= P ut + P g(Pwt P ut ).
(123)
gx0 j (rj , rj ) =
(120)
.
E(h00 (xj ))rj
Brj + 1
P , I P,
A+
(119)
(114)
Pwt
.
P ut
(124)
Since P2 = P and (P )2 = P ,
P P t = Pwt P ut .
Therefore, from (121) and (123), respectively, we have that
Pwt+1 = P
g P P t ,
(125)
t
t+1
t
P u = P u + P g P P .
(126)
18
From (124), (125), and (126), we see that the mean update
steps in Algorithm 3 are characterized by the recursive system
t+1 = f ( t )
(127)
for
P
0 0
f () ,
g P P +
.
P
0 P
(128)
(129)
P
P
f 0 () =
g
(w)
+
.
(130)
P
0 P
Hence, if we define
(131)
(134)
I 0
J() , f ()
,
0 I
0
A PPENDIX F
P ROOF OF T HEOREM 5
(135)
z 1
1
1
s , 1
,
.
p p
B + p A + p
(139)
p [ap , bp ]} .
(140)
19
1
var(x|b),
2r
(141)
(a)
> 0.
r
B + ar
B + ar
We conclude that there exists an such that
J 00 (b) I,
at any minima b = bb to the linearized LSL-BFE when
(r , p ) . This proves Assumption 1(b). The uniform
boundedness of all the other derivatives follows from the fact
that all the terms are twice differentiable and the set is
compact.
Thus, all the conditions of Assumption 1 and the theorem
follows from Theorem 1.
A PPENDIX G
P ROOF OF T HEOREM 6
We begin with proving part (a). We use induction. Suppose
that (77) is satisfied for some t. Since q0 , x0 , and v0 are fixed
points, we have from line 10 of Algorithm 3 that x0 = v0 .
Then, since x0 is a fixed point, we have from lines 7 and 9
and equation (62) that
1
x0 = gx (r0 , r0 ) = arg min fx (x) + kx v0 + r0 .q0 k2r0 .
2
x
(144)
= f (x) +
Diag(1./(2rt ))(x
(145)
0
x )+q ,
(146)
rtj
t
1 + fx00j (xt+1
j )rj
Rewriting this in vector form and using the updates in Algorithm 3 with t = 1, we obtain that
1./xt+1 = 1./rt + fx00 (xt+1 ) = ST st + fx00 (xt+1 )
= ST st + x ,
fx00 (x)
(147)
where
=
and where x is
positive due to the convexity assumption and invariant to t
due to part (a). Similarly, for the output estimation function
gz ,
zt+1 = pt .gz0 (pt , pt ) = pt ./(1 + fz00 (zt+1 ).pt ).
Therefore, from the modified update of st+1 in (76),
st+1 = fz00 (zt+1 )./(1 + fz00 (zt+1 ).pt ),
or equivalently,
1./st+1 = pt + 1./fz00 (zt+1 ) = Sxt + z
z ,
1./fz00 (zt+1 ).
(148)
(149)
x (s ) := 1./ ST s + x
xt+1 = x (st ).
20
(i) (x ) > 0,
(ii) x x0 (x ) (x0 ), and
(iii) For all > 1, (x ) < (x ).
T
Dr AT Dp A E(x|bx ) xt
T
= D(bx kefx ) + 1./r var(x|bx ) (st1 )T AE(x|bx )
+ 21 E(x|bx ) xt
A PPENDIX H
O RIGINAL GAMP VIA S TALE , L INEARIZED ADMM
Rn
Ax
t+1
)
(150)
+ (1./(2p )) var(z|bz )
(151)
m
X zi
+ E 12 kz (Axt+1 p .st )k2p bz
+ const,
2pi
i=1
= D(bz kZz1 efz ) + 21 E kz (Axt+1 p .st )k2p bz
Z+ const,
=
bz (z) ln exp(f
bz (z)
1
t+1 .st )k2 )
z (z) 2 kz(Ax
p
p
(157)
(c)
(158)
(161)
dz
+ const
(156)
rt , xt + Diag(r )AT st ;
t T
J(bt+1
x , bz , r , p ) + (s ) (E(z|bz )
2
+ 12
E(z|bz ) Axt+1
Rm
+ const
= D(bx kefx ) + E 12 kx rt k2r bx + const,
Z
bx (x)
=
bx (x) ln exp(f (x)
dx + const
1
kxrt k2 )
(152)
(153)
, Ax
t+1
p .s .
(154)
(155)
R EFERENCES
[1] J. A. Nelder and R. W. M. Wedderburn, Generalized linear models, J.
Royal Stat. Soc. Series A, vol. 135, pp. 370385, 1972.
[2] P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd ed.
Chapman & Hall, 1989.
[3] A. Chambolle, R. A. DeVore, N. Y. Lee, and B. J. Lucier, Nonlinear
wavelet image processing: Variational problems, compression, and noise
removal through wavelet shrinkage, IEEE Trans. Image Process., vol. 7,
no. 3, pp. 319335, Mar. 1998.
[4] I. Daubechies, M. Defrise, and C. D. Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Commun.
Pure Appl. Math., vol. 57, no. 11, pp. 14131457, Nov. 2004.
[5] S. J. Wright, R. D. Nowak, and M. Figueiredo, Sparse reconstruction by
separable approximation, IEEE Trans. Signal Process., vol. 57, no. 7,
pp. 24792493, Jul. 2009.
[6] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding
algorithm for linear inverse problem, SIAM J. Imag. Sci., vol. 2, no. 1,
pp. 183202, 2009.
21
22