Notes On Computational-To-Statistical Gaps: Predictions Using Statistical Physics
Notes On Computational-To-Statistical Gaps: Predictions Using Statistical Physics
In memory of Amelia Perry, and her love for learning and teaching.
1. Introduction
Statistics has long studied how to recover information from data. Theoretical
statistics is concerned with, in part, understanding under which circumstances such
recovery is possible. Oftentimes recovery procedures amount to computational
tasks to be performed on the data that may be computationally expensive, and so
prohibitive for large datasets. While computer science, and in particular complexity
theory, has focused on studying hardness of computational problems on worst-case
instances, time and time again it is observed that computational tasks on data can
often be solved far faster than the worst case complexity would suggest. This is
not shocking; it is simply a manifestation of the fact that instances arising from
real-world data are not adversarial. This illustrates, however, an important gap
in fundamental knowledge: the understanding of “computational hardness of
statistical estimation problems”.
For concreteness we will focus on the case where we want to learn a set of param-
eters from samples of a distribution, or estimate a signal from noisy measurements
(often two interpretations of the same problem). In the problems we will consider,
there is a natural notion of signal-to-noise ratio (SNR) which can be related to the
variance of the distribution of samples, the strength of the noise, the number of
samples or measurements obtained, the size of a hidden planted structure buried
in noise, etc. Two “phase transitions” are often studied. Theoretical statistics
and information theory often study the critical SNR below which it is statistically
impossible to estimate the parameters (or recover the signal, or find the hidden
structure), and we call this threshold SNRStat . On the other hand, many algorithm
development fields propose and analyze efficient algorithms to understand for which
SNR levels different algorithms work. Despite significant effort to develop ever bet-
ter algorithms, there are various problems for which no efficient algorithm is known
to achieve recovery close to the statistical threshold SNRStat . Thus we are inter-
ested in the critical threshold SNRComp ≥ SNRStat below which it is fundamentally
impossible for an efficient (polynomial time) algorithm to recover the information
of interest.
are interesting in whether the posterior distribution resembles one big connected
region or whether it fractures into disconnected clusters (indicating computational
hardness). We will cover the replica method in Section 5 of these notes.
Complexity of a random objective function. Another method for investigating com-
putational hardness is through the lens of non-convex optimization. Intuitively, we
expect that “easy” optimization problems have no “bad” local minima and so an
algorithm such as gradient descent can find the global minimum (or at least a point
whose objective value is close to the global optimum). For Bayesian inference prob-
lems, maximum likelihood estimation amounts to minimizing a particular random
non-convex function. One tool to study critical points of random functions is the
Kac-Rice formula (see [AT07] for an introduction). This has been used to study
optimization landscapes in settings such as spin glasses [ABAČ13], tensor decom-
position [GM17], and problems arising in community detection [BBV16]. There
are also other methods to show that there are no spurious local minima in certain
settings, e.g. [GJZ17, BVB16, LV18].
There is a key difference between the two models above. The Rademacher spiked
Wigner model is dense in the sense that we are given an observation for every pair
of variables. On the other hand, the stochastic block model is sparse (at least in the
regime we have chosen) because essentially all the useful information comes from
the observed edges, which form a sparse graph. We will see that different tools are
needed for dense and sparse problems.
minimize the energy H(σ) over states σ, or equivalently sample from (or otherwise
describe) the low-temperature Gibbs distribution in the limit β → ∞.
This viewpoint is limited, in that the MLE frequently lacks any a priori guaran-
tee of optimality. On the other hand, the Gibbs distribution at the true temperature
β = λ enjoys optimality guarantees at a high level of generality:
Claim 2.3. Suppose we are given some observation Y leading to a posterior dis-
tribution on σ. For any estimate σ b = σ b(Y ), define the (expected) mean squared
error (MSE) Ekb σ − σk22 . The estimator that minimizes the expected MSE is given
b = E[σ | Y ], the posterior expectation (and thus the expectation under the Gibbs
by σ
distribution at Bayes-optimal temperature).
Remark 2.4. In the case of the Rademacher spiked Wigner model, there is a
caveat here: since σ ∗ and −σ ∗ are indistinguishable, the posterior expectation is
zero. Our objective is not to minimize the MSE but to minimize the error between
b or −b
σ and either σ σ (whichever is better).
Thus the optimization approach of maximum likelihood estimation aims for too
low a temperature. aggregate likelihood. Intuitively, MLE searches for the single
state with highest individual likelihood, whereas the optimal Bayesian approach
looks for a large cluster of closely-related states with a high aggregate likelihood.
Fortunately, the true Gibbs distribution has an optimization property of its own:
Claim 2.5. The Gibbs distribution with Hamiltonian H and temperature T > 0 is
the unique distribution minimizing the (Helmholtz) free energy
F = EH − T S,
Proof. Entropy is concave with infinite derivative at the edge of the probability
simplex, and the expected Hamiltonian is linear in the distribution, so the free
energy is convex and minimized in the interior of the simplex. We find the unique
local (hence global) minimum with a Lagrange multiplier:
const · 1 = ∇F = ∇
X
p(σ)(H(σ) + T log p(σ))
σ
const = H(σ) + T log p(σ)
const · e−H(σ)/T = p(σ),
This optimization approach is willing to trade off some energy for an increase in
entropy, and can thus detect large clusters of states with a high aggregate likelihood,
even when no individual state has the highest possible likelihood. Moreover, the
free energy is convex, but it is a function of an arbitrary probability distribution
on the state space, which is typically an exponentially large object.
We are thus led to ask the question: is there any way to reduce the problem of
free energy minimization to a tractable, polynomial-size problem? Can we get a
theoretical or algorithmic handle on this problem?
NOTES ON COMPUTATIONAL-TO-STATISTICAL GAPS 7
Each message mu→v is a probability distribution (over the possible values for σu ),
with the proportionality constant being determined by probabilities summing to 1
over all values of σu .
This is almost a full description of belief propagation, except for one detail. If
the belief from vertex v at time t−2 influences the belief of neighbor u at time t−1,
then neighbor u should not parrot that influence back to neighbor v, reinforcing its
belief at time t without any new evidence. Thus we ensure that the propagation of
messages does not immediately backtrack:
YX
(1) m(t)v→w (σv ) ∝
(t−1)
ψuv (σu , σv )mu→v (σu ).
u∼v σu
u6=w
most vertices are trees, with most loops being long, so that independence might ap-
proximately hold. BP certainly fails in the worst case; outside of special cases such
as trees it is certainly only suitable in an average-case setting. However, on many
families of random graphical models, belief propagation is a remarkably strong ap-
proach; it is general, efficient, and often yields a state-of-the-art statistical estimate.
It is conjectured in many models that belief propagation achieves asymptotically
optimal inference, either among all estimators or among all polynomial-time esti-
mators, but most rigorous results in this direction are yet to be established.
To connect to the previous viewpoint of free energy minimization: belief prop-
agation is intimately connected with the Bethe free energy, a heuristic proxy
for the free energy which may be described in terms of the messages mu→v (see
[ZK16], Section III.B). It can be shown that the fixed points of BP are precisely
the critical points of the Bethe free energy, justifying the view that BP is roughly a
minimization procedure for the free energy. Again, rigorously the situation is much
worse: the Bethe free energy is non-convex, and BP is not guaranteed to converge,
let alone guaranteed to find the global minimum.
3.2. The cavity method for the stochastic block model. The ideas of belief
propagation above appear as the cavity method in statistical physics, owing to
the idea that the Bethe free energy is believed to be essentially an accurate model
for the true (Helmholtz) free energy on a variety of models of interest. In passing
to the Bethe free energy, we can pass from studying a general distribution (an
exponentially complicated object) to studying node and edge marginals, which are
theoretically much simpler objects and, crucially, can be studied locally on the
graph. Local neighborhoods of sparse graphs as in the SBM (stochastic block
model) look like trees, and so we are drawn to studying message passing on a tree.
Much as for the Rademacher spiked Wigner model above, we derive a Hamilton-
ian from the block model posterior:
X X
H(σ) = θ + σi σj + θ − σi σj ,
i∼j i6∼j
where u ∼ v denotes adjacency in the observed graph, and θ+ > 0 > θ− are
constants depending on a and b; θ+ is of constant order, while θ− is of order 1/n.
In expressing belief propagation, we will make a small notational simplification:
instead of passing messages m that are distributions over {+, −}, it suffices to pass
the expectation m(+) − m(−). The reader can verify that rewriting the belief
propagation equations in this notation yields
X X
m(t) (t−1)
atanh(θ− m(t−1)
u→v = tanh atanh(θ+ mw→u )+ w→u )
w∼u w6∼u
w6=v w6=v
where tanh is the hyperbolic tangent function tanh(z) = (ez − e−z )/(ez + e−z ), and
atanh is its inverse.
The first term inside the tanh represents strong, constant-order attractions with
the few graph neighbors, while the second term represents very weak, low-order
repulsions with the multitude of non-neighbors. The value of the second term thus
depends very little on any individual spin, but rather on the overall balance of
positive and negative spins in the graph, with the tendency to cause the global
NOTES ON COMPUTATIONAL-TO-STATISTICAL GAPS 9
As this message-passing only involves the graph edges, it now makes sense to
study this on a tree-like neighborhood. We now discuss a generative model for
(approximate) local neighborhoods under the stochastic block model.
Model 3.1 (Galton–Watson tree). Begin with a root vertex, with spin + or −
chosen uniformly. Recursively, each vertex gives birth to a Poisson number of child
nodes: Pois((1 − ε)k) vertices of the same spin and Pois(εk) vertices of opposite
spin, up to a total tree depth of d.
As shown in [MNS12], the Galton–Watson tree with k = (a + b)/2 and ε =
b/(a + b) is distributionally very close to the radius-d neighborhood of a vertex in
the SBM with its true spins, so long as d = o(log n). Thus we will study belief
propagation on a random Galton–Watson tree.
Let us consider only the BP messages passing toward the root of the tree. The
upward message from any vertex v is computed as:
!
X
(2) mv = tanh atanh((1 − 2ε)mu )
u
where u ranges over the children of v. We now imagine that the child messages
(t−1)
mu are independently drawn from some distribution D+ for children with spin
(t−1) (t−1)
+, and (leveraging symmetry) from the distribution D− = −D+ for children
with spin −; this distribution represents the randomness of our BP calculation below
each child, over the random subtree hanging off each one. Then, from equation (2),
together with the fact that there are Pois((1 − ε)k) same-spin children and Pois(εk)
(t)
opposite-spin children, the distribution D± of the parent message m is determined!
(t)
Thus we obtain a distributional recurrence for D+ .
The calculation above is independent of n, and the radius of validity of the tree
approximation grows with n, so we are interested in the behavior of the recurrence
above as t → ∞, i.e. fixed points of the distributional recurrence above and their
stability.
Typically one initializes BP with small random messages, a perturbation of the
trivial all-0 fixed point that represents our prior. For P small messages, we can
linearize tanh and atanh, and write mv ≈ (1 − 2ε) u mu . Then if the child
(t−1)
distribution D+ has mean µ and variance σ 2 , it is easily computed that the
(t)
parent distribution D+ has mean k(1 − 2ε)2 µ and variance k(1 − 2ε)2 σ 2 . Thus
2
if k(1 − 2ε) < 1, then perturbations of the all-0 fixed point decay, or in other
words, this fixed point is stable, and BP is totally uninformative on this typical
initialization. If k(1 − 2ε)2 > 1, then small perturbations do become magnified
under BP dynamics, and one imagines that BP might find a more informative fixed
point (though this remains an open question!).
10 AFONSO S. BANDEIRA, AMELIA PERRY, AND ALEXANDER S. WEIN
(t−1)
We next exploit the weakness of individual interactions. Note that the values mw→u
lie in [−1, 1], while Ywu is of order n−1/2 in probability. Taylor-expanding atanh,
we simplify:
X
m(t)
u→v = tanh
(t−1)
λYwu mw→u + O(n−1/2 ) w.h.p.
w6=v
We next simplify the non-backtracking nature of BP. Naı̈vely, one might expect
that we can simply drop the condition w 6= v from the sum above, as the contribu-
tion from vertex v in the above sum should be only of size n−1/2 . As our formula
(t)
for mu→v would then no longer depend on v, we could write down messages indexed
NOTES ON COMPUTATIONAL-TO-STATISTICAL GAPS 11
by a single vertex:
!
X
m(t)
u = tanh (t−1)
λYwu mw ,
w
or in vector notation,
(3) m(t) = tanh(λY m(t−1) ),
where tanh applies entrywise. This resembles the “power iteration” iterative algo-
rithm to compute the leading eigenvector of Y :
m(t) = Y m(t−1) ,
but with tanh(λ • ) providing some form of soft projection onto the interval [−1, 1],
exploiting the entrywise ±1 structure.
Unfortunately, the non-backtracking simplification above is flawed, and equation
(3) does not accurately summarize BP or provide as strong an estimator. The prob-
lem is that the terms we have neglected add up constructively over two iterations.
(t−2)
Specifically: consider that vertex v exerts an influence λYvu mv on each neighbor
(t−1)
u; this small perturbation translates directly to a perturbation of mu (scaled
(t)
by a derivative of tanh). At the next iteration, vertex u influences mv according
(t−1) (t−2)
to λYvu mu ; the total contribution from backtracking here is thus λ2 Yvu 2
mv ,
scaled through some derivatives of tanh. This influence is a random, positive, order
(t−2)
1/n multiple of mv . Summing over all neighbors u, we realize that the aggregate
contribution of backtracking over two steps is in fact of order 1.
Thankfully, this contribution is also a sum of small random variables, and ex-
hibits concentration of measure. The solution is thus to subtract off this aggregate
backtracking term in expectation, adding a correction called the Onsager reaction
term:
(4) m(t) = tanh Y m(t−1) − λ2 (1 − kmk22 /n)m(t−2) .
where x ∈ {±1}n is the true signal (drawn uniformly at random) and the n × n
noise matrix W is symmetric with the upper triangle drawn i.i.d. as N (0, 1). In
this setting, the AMP algorithm and its analysis are due to [DAM16].
We have seen above that the AMP algorithm for this problem takes the form
v t+1 = Y f (v t ) + [Onsager]
where f (v) denotes entrywise application of the function f (v) = tanh(λv). (Here
we abuse notation and let f refer to both the scalar function and its entrywise
application to a vector.) The superscript t indexes timesteps of the algorithm (and
is not to be confused with an exponent). The details of the Onsager term, discussed
previously, will not be important here.
The state evolution heuristic proceeds as follows. Postulate that at timestep t,
AMP’s iterate v t is distributed as
(5) v t = µt x + σt g where g ∼ N (0, I).
t
This breaks down v into a signal term (recall x is the true signal) and a noise
term, whose sizes are determined by parameters µt ∈ R and σt ∈ R≥0 . The idea
of state evolution is to write down a recurrence for how the parameters µt and σt
evolve from one timestep to the next. In performing this calculation we will make
two simplifying assumptions that will be justified later: (1) we drop the Onsager
term, and (2) we assume the noise W is independent at each timestep (i.e. there
is no correlation between W and the noise g in the current iterate). Under these
assumptions we have
λ > 1
v t+1 = Y f (v t ) = xx + √ W f (v t )
n n
λ 1
= hx, f (v t )i x + √ W f (v t )
n n
which takes the form of (5) with a signal term and a noise term. We therefore have
λ λ
µt+1 = hx, f (v t )i = hx, f (µt x + σt g)i
n n
≈ λ E [Xf (µt X + σt G)] with scalars X ∼ Unif{±1}, G ∼ N (0, 1)
X,G
= λ E[f (µt + σt G)] since f (−v) = −f (v).
G
For the noise term, think of f (v t ) as fixed and consider the randomness over W .
Each entry of the noise term √1n W f (v t ) has mean zero and variance
X1 X1
(σ t+1 )2 = f (vit )2 = f (µt xi + σt gi )2
i
n i
n
≈ λ E [f (µt X + σt G)2 ] with scalars X, G as above
X,G
4.3. Free energy diagrams. In this section we will finally see how to predict
computational-to-statistical gaps (for dense problems)! Above we have seen how
to analyze a particular algorithm: AMP. In various settings it has been shown
that AMP is information-theoretically optimal. More generally, it is believed that
AMP is optimal among all efficient algorithms (for a wide class of problems). We
will now show how to use AMP to predict whether a problem should be easy,
(computationally) hard or (statistically) impossible. The ideas here originate from
[LKZ15a, LKZ15b].
Recall that the state of AMP is described by a parameter γ, where larger γ
indicates better correlation with the truth and γ = 0 means that AMP achieves
zero correlation with the truth. Also recall that the Bethe free energy is the quantity
that belief propagation (or AMP) is locally trying to minimize. It is possible to
14 AFONSO S. BANDEIRA, AMELIA PERRY, AND ALEXANDER S. WEIN
analytically write down the function f (γ) which gives the (Bethe) free energy of
the AMP state corresponding to γ; in the next section, we will see one way to
compute f (γ). AMP can be seen as starting near γ = 0 and naively moving in
the direction of lowest free energy until it reaches a local minimum; the γ value at
this minimum characterizes AMP’s output. The information-theoretically optimal
estimator is instead described by the global minimum of the free energy (and this
has been proven rigorously in various cases [BDM+ 16, LM16]); this corresponds to
the inefficient algorithm that uses exhaustive search to find the AMP state which
globally minimizes free energy. Figure 1 illustrates how the free energy landscape
f (γ) dictates whether the problem is easy, hard, or impossible at a particular λ
value.
−0.2
−1.845
−0.4
−0.6
−1.850
−0.8
f
f
−1.855
−1.0
−1.2
−1.860
−1.4
−1.865
−1.6
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
γ γ
−1.896
−2.7
−1.898
−1.900 −2.8
−1.902
f
−2.9
−1.904
−3.0
−1.906
−3.1
−1.908
−1.910 −3.2
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0 1 2 3 4 5
γ γ
For Rademacher spiked Wigner, we have phase (a) (from Figure 1) when λ ≤ 1
and phase (d) when λ > 1, so there is no computational-to-statistical gap. However,
for some variants of the problem (for instance if the signal x is sparse, i.e. only a
NOTES ON COMPUTATIONAL-TO-STATISTICAL GAPS 15
small constant fraction of entries are nonzero) then we see phases (a),(b),(c),(d)
appear in that order as λ increases; in particular, there is a computational-to-
statistical gap during the hard phase (c).
Although many parts of this picture have been made rigorous in certain cases,
the one piece that we do not have the tools to prove is that no efficient algorithm
can succeed during the hard phase (c). This is merely conjectured based on the
belief that AMP should be optimal among efficient algorithms.
There are a few different ways to compute the free energy landscape f (γ). One
method is to use the replica method discussed in the next section. Alternatively,
there is a direct formula for Bethe free energy in terms of the BP messages, which
can be adapted to AMP (see e.g. [LKZ15a]).
(This can be shown to coincide with the notion of free energy introduced earlier.)
The idea of the replica method is to compute the moments E[Z r ] of Z for r ∈ N
and perform the (non-rigorous) analytic continuation
1
(8) E[log Z] = lim log E[Z r ].
r→0 r
Note that this is quite bizarre – we at first assume r is a positive integer, but then
take the limit as r tends to zero! This will require writing E[Z r ] in an analytic form
that is defined for all values of r. An informal justification for the correctness of
(8) is that when r is close to 0, Z r is close to 1 and so we can interchange log and
E on the right-hand side.
The moment E[Z r ] can be expanded in terms of r ‘replicas’ σ 1 , . . . , σ r with
σ ∈ {±1}n :
a
X X Xr
E[Z r ] = E exp β Yij σia σja .
{σ a } i<j a=1
16 AFONSO S. BANDEIRA, AMELIA PERRY, AND ALEXANDER S. WEIN
2 X 2 X
X λ λ
E[Z r ] = exp n c2 + 2
qab
2 a a 4
{σ a } a,b
where qab = n1 i σia σib is the correlation between replicas a and b, and ca =
P
1 a
P
n i σi xi is the correlation between replica a and the truth.
Without loss of generality we can assume (by symmetry) the true spike is x = 1
(all-ones). Let Q be the (r + 1) × (r + 1) matrix of overlaps (qab and ca ), including
x as the zeroth replica. Note that Q is the average of n i.i.d. matrices and so
by the theory of large deviations (Cramér’s Theorem in multiple dimensions), the
number of configurations {σ a } corresponding to given overlap parameters qab , ca is
asymptotically
(9)
X 1 X X X 1 X
inf exp n − ν a ca − µab qab + log exp νa σa + µab σa σb .
µ,ν
a
2 r a
2
a6=b σ∈{±1} a6=b
We now apply the saddle point method: in the large n limit, the expression for
E[Z r ] should be dominated by a single value of the overlap parameters qab , ca . This
yields
1 ∗
log E[Z r ] = −G(qab , c∗a , µ∗ab , νa∗ )
n
∗
where (qab , c∗a , µ∗ab , νa∗ ) is a critical point of
λ2 X 2 λ2 X 2 X 1X
G(qab , ca , µab , νa ) = − ca − qab + νa ca + µab qab
2 a 4 a
2
a,b a6=b
X X 1 X
− log exp νa σa + µab σa σb .
r a
2
σ∈{±1} a6=b
We next assume that the dominant saddle point takes a particular form: the so-
called replica symmetric ansatz. The replica symmetric ansatz is given by qaa = 1,
ca = c, νa = ν, and for a 6= b, qab = q and µab = µ for constants q, c, µ, ν. This
yields
(10)
1 λ2 λ2 λ2 1 √
lim G(q, c, µ, ν) = − c2 − + q 2 +νc− µ(q−1)− E log(2 cosh(ν+ µz))
r→0 r 2 4 4 2 z∼N (0,1)
NOTES ON COMPUTATIONAL-TO-STATISTICAL GAPS 17
where (a) uses the Gaussian moment-generating function and (b) uses the replica
trick (8).
We next find the critical points by setting the derivatives of (10) (with respect
to all four variables) to zero, which yields
√ √
ν = λ2 c, µ = λ2 q, c = Ez tanh(ν + µz), q = Ez tanh2 (ν + µz).
Recall that the replicas are drawn from the posterior distribution Pr[x | Y ] and
so the truth x behaves as if it is a replica; therefore we should have c = q. Using
√ √
the identity Ez tanh(γ + γz) = Ez tanh2 (γ + γz) (see e.g. [DAM16]), we obtain
the solution c = q and ν = µ where q and µ are solutions to
√
(11) µ = λ2 q, q= E tanh(µ + µz).
z∼N (0,1)
The solution q to this equation tells us about the structure of the posterior dis-
tribution; namely, if we take two independent draws from this distribution, their
overlap will concentrate about q. (Equivalently, the true signal x and a draw from
the posterior distribution will also have overlap that concentrates about q.) Note
18 AFONSO S. BANDEIRA, AMELIA PERRY, AND ALEXANDER S. WEIN
that (11) exactly matches the state evolution fixed-point equation (6) with µ in
place of γ and q = γ/λ2 .
The free energy density of a solution to (11) is given by
2
1 1 1 λ 2 1 √
f = lim G(q, c, µ, ν) = − (q + 1) + µ(q + 1) − Ez log(2 cosh(µ + µz)) .
β r→0 r λ 4 2
This is how one can derive the free energy curves such as those shown in Figure 1.
If there are multiple solutions to (11), we should take the one with minimum free
energy.
Above, we had a Gibbs distribution corresponding to the posterior distribution
of a Bayesian inference problem. In this setting, the replica symmetric ansatz is
always correct; this is justified by a phenomenon in statistical physics: “there is no
static replica symmetry breaking on the Nishimori line” (see e.g. [ZK16, Nis01]).
More generally, one can apply the replica method to a Gibbs distribution that
does not correspond to a posterior distribution (e.g. if the ‘temperature’ of the
Gibbs distribution does not match the signal-to-noise of the observed data). This
is important when investigating computational hardness of random non-planted or
non-Bayesian problems. In this case, the optimal (lowest free energy) saddle point
can take various forms, which are summarized below; the form of the optimizer
reveals a lot about the structure of the Gibbs distribution. An important property
of a Gibbs distribution is its overlap distribution: the distribution of the overlap
between two independent draws from the Gibbs distribution (in the large n limit).
• RS (replica symmetric): The overlap matrix is qaa = 1 and qab = q for
some q ∈ [0, 1]. The overlap distribution is supported on a single point
mass at value q. The Gibbs distribution can be visualized as having one
large cluster where any two vectors in this cluster have overlap q. This
case is “easy” in the sense that belief propagation can easily move around
within the single cluster and find the true posterior distribution.
• 1RSB (1-step replica symmetry breaking): The r × r overlap matrix takes
the following form. The r replicas are partitioned into blocks of size m.
We have qaa = 1, qab = q1 if a, b are in the same block, and qab = q2
otherwise (for some q1 , q2 ∈ [0, 1]). The overlap distribution is supported
on q1 and q2 . The Gibbs distribution can be visualized as having a constant
number of clusters. Two vectors in the same cluster have overlap q1 whereas
two vectors in different clusters have overlap q2 . This case is “hard” for
belief propagation because it gets stuck in one cluster and cannot correctly
capture the posterior distribution. The idea of replica symmetry breaking
was first proposed in a groundbreaking work of Parisi [Par79].
• 2RSB (2-step replica symmetry breaking): Now we have “clusters of clus-
ters.” The overlap matrix has sub-blocks within each block. The overlap
distribution is supported on 3 different values (corresponding to “same sub-
block”, “same block (but different sub-block)”, “different blocks”). The
Gibbs distribution has a constant number of clusters, each with a constant
number of sub-clusters. This is again “hard” for belief propagation.
• FSRB (full replica symmetry breaking): We can define kRSB for any k as
above (characterized by an overlap distribution supported on k + 1 values);
FRSB is the limit of kRSB as k → ∞. Here the overlap distribution is a
continuous distribution.
NOTES ON COMPUTATIONAL-TO-STATISTICAL GAPS 19
• d1RSB (dynamic 1RSB): This phase is similar to RS and (unlike kRSB for
k ≥ 1) can appear in Bayesian inference problems. The overlap matrix is
the same as in the RS phase (and so the replica calculation proceeds exactly
as in the RS case). However, the Gibbs distribution has exponentially-many
small clusters. The overlap distribution is supported on a single point mass
because two samples from the Gibbs distribution will be in different clusters
with high probability. This phase is “hard” for BP (or AMP) because it
cannot easily move between clusters. For a Bayesian inference problem, you
can tell whether you are in the RS (easy) phase or 1dRSB (hard) phase by
looking at the free energy curve; 1dRSB corresponds to the “hard” phase
(c) in Figure 1.
Acknowledgements. The authors would like to thank the engaging audience at
the Courant Institute when this material was presented, and their many insight-
ful comments. The authors would also like to thank Soledad Villar and Lenka
Zdeborová for feedback on earlier versions of this manuscript.
References
[ABAČ13] Antonio Auffinger, Gérard Ben Arous, and Jiřı́ Černỳ. Random matrices and
complexity of spin glasses. Communications on Pure and Applied Mathematics,
66(2):165–201, 2013.
[Abb17] Emmanuel Abbe. Community detection and stochastic block models: recent devel-
opments. arXiv preprint arXiv:1703.10146, 2017.
[AKS98] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random
graph. SODA, 1998.
[AS15] Emmanuel Abbe and Colin Sandon. Detection in the stochastic block model
with multiple clusters: proof of the achievability conjectures, acyclic bp, and the
information-computation gap. arXiv preprint arXiv:1512.09080, 2015.
[AT07] R. J. Adler and J. E. Taylor. Random Fields and Geometry. Springer, 2007.
[BBV16] A. S. Bandeira, N. Boumal, and V. Voroninski. On the low-rank approach for semi-
definite programs arising in synchronization and community detection. COLT, 2016.
[BDM+ 16] Jean Barbier, Mohamad Dia, Nicolas Macris, Florent Krzakala, Thibault Lesieur,
and Lenka Zdeborová. Mutual information for symmetric rank-one matrix estima-
tion: A proof of the replica formula. In Advances in Neural Information Processing
Systems, pages 424–432, 2016.
[BHK+ 16] B. Barak, S. B. Hopkins, J. Kelner, P. K. Kothari, A. Moitra, and A. Potechin. A
nearly tight sum-of-squares lower bound for the planted clique problem. Available
online at arXiv:1604.03084 [cs.CC], 2016.
[BKS13] B. Barak, J. Kelner, and D. Steurer. Rounding sum-of-squares relaxations. Available
online at arXiv:1312.6652 [cs.DS], 2013.
[BKS15] Boaz Barak, Jonathan A Kelner, and David Steurer. Dictionary learning and tensor
decomposition via the sum-of-squares method. In Proceedings of the forty-seventh
annual ACM symposium on Theory of computing, pages 143–151. ACM, 2015.
[BM11] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense
graphs, with applications to compressed sensing. IEEE Transactions on Information
Theory, 57(2):764–785, 2011.
[BM16] Boaz Barak and Ankur Moitra. Noisy tensor completion via the sum-of-squares
hierarchy. In Conference on Learning Theory, pages 417–445, 2016.
[BMNN16] Jess Banks, Cristopher Moore, Joe Neeman, and Praneeth Netrapalli. Information-
theoretic thresholds for community detection in sparse networks. In Conference on
Learning Theory, pages 383–416, 2016.
[Bol12] Erwin Bolthausen. An iterative construction of solutions of the tap equations for the
sherrington-kirkpatrick model. arXiv preprint arXiv:1201.2891, 2012.
[BR12] Q. Berthet and P. Rigollet. Optimal detection of sparse principal components in
high dimension. Annals of Statistics, 2012.
20 AFONSO S. BANDEIRA, AMELIA PERRY, AND ALEXANDER S. WEIN
[BR13] Q. Berthet and P. Rigollet. Complexity theoretic lower bounds for sparse principal
component detection. Conference on Learning Theory (COLT), 2013.
[BS14] B. Barak and D. Steurer. Sum-of-squares proofs and the quest toward optimal algo-
rithms. Survey, ICM 2014, 2014.
[BVB16] N. Boumal, V. Voroninski, and A. S. Bandeira. The non-convex burer-monteiro
approach works on smooth semidefinite programs. NIPS, 2016.
[COKPZ16] Amin Coja-Oghlan, Florent Krzakala, Will Perkins, and Lenka Zdeborova.
Information-theoretic thresholds from the cavity method. arXiv preprint
arXiv:1611.00814, 2016.
[DAM16] Yash Deshpande, Emmanuel Abbe, and Andrea Montanari. Asymptotic mutual in-
formation for the binary stochastic block model. In Information Theory (ISIT),
2016 IEEE International Symposium on, pages 185–189. IEEE, 2016.
[DKMZ11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Infer-
ence and phase transitions in the detection of modules in sparse networks. Physical
Review Letters, 107(6):065701, 2011.
[DM14] Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse
PCA. In Information Theory (ISIT), 2014 IEEE International Symposium on, pages
2197–2201. IEEE, 2014. p
[DM15] Yash Deshpande and Andrea Montanari. Finding hidden cliques of size N/e in
nearly linear time. Foundations of Computational Mathematics, 15(4):1069–1128,
2015.
[DMM09] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algo-
rithms for compressed sensing. Proceedings of the National Academy of Sciences,
106(45):18914–18919, 2009.
[FR12] Alyson K Fletcher and Sundeep Rangan. Iterative reconstruction of rank-one ma-
trices in noise. Information and Inference: A Journal of the IMA, 2012.
[GJZ17] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank
problems: A unified geometric analysis. arXiv preprint arXiv:1704.00708, 2017.
[GM15] Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sum-
of-squares algorithms. arXiv preprint arXiv:1504.05287, 2015.
[GM17] R. Ge and T. Ma. On the optimization landscape of tensor decompositions. Available
online at arXiv:1706.05598 [cs.LG], 2017.
[GW95] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for max-
imum cut and satisfiability problems using semidefine programming. Journal of the
Association for Computing Machinery, 42:1115–1145, 1995.
[HLL83] P. W. Holland, K. Blackmond Laskey, and S. Leinhardt. Stochastic blockmodels:
First steps. Social networks, 5(2):109–137, 1983.
[HS17] Samuel B Hopkins and David Steurer. Bayesian estimation from few samples: com-
munity detection and related problems. arXiv preprint arXiv:1710.00264, 2017.
[HSS15] S. B. Hopkins, J. Shi, and D. Steurer. Tensor principal component analysis via sum-
of-squares proofs. Available at arXiv:1507.03269 [cs.LG], 2015.
[HSSS16] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral
algorithms from sum-of-squares proofs: tensor decomposition and planted sparse
vectors. In Proceedings of the forty-eighth annual ACM symposium on Theory of
Computing, pages 178–191. ACM, 2016.
[HWX14] B. Hajek, Y. Wu, and J. Xu. Computational lower bounds for community detection
on random graphs. Available online at arXiv:1406.6625, 2014.
[JM13] Adel Javanmard and Andrea Montanari. State evolution for general approximate
message passing algorithms, with applications to spatial coupling. Information and
Inference: A Journal of the IMA, 2(2):115–144, 2013.
[JMRT16] Adel Javanmard, Andrea Montanari, and Federico Ricci-Tersenghi. Phase transi-
tions in semidefinite relaxations. Proceedings of the National Academy of Sciences,
113(16):E2218–E2223, 2016.
[Kar72] R. M. Karp. Reducibility among combinatorial problems. Complexity of Computer
Computation, 1972.
[KBG17] C. Kim, A. S. Bandeira, and M. X. Goemans. Community detection in hypergraphs,
spiked tensor models, and sum-of-squares. SampTA 2017: Sampling Theory and
Applications, 12th international conference, 2017.
NOTES ON COMPUTATIONAL-TO-STATISTICAL GAPS 21
[Kho02] S. Khot. On the power of unique 2-prover 1-round games. Thiry-fourth annual ACM
symposium on Theory of computing, 2002.
[Kho10] S. Khot. On the unique games conjecture (invited survey). In Proceedings of the
2010 IEEE 25th Annual Conference on Computational Complexity, CCC ’10, pages
99–121, Washington, DC, USA, 2010. IEEE Computer Society.
[KS66] Harry Kesten and Bernt P Stigum. A limit theorem for multidimensional galton-
watson processes. The Annals of Mathematical Statistics, 37(5):1211–1223, 1966.
[KXZ16] Florent Krzakala, Jiaming Xu, and Lenka Zdeborová. Mutual information in rank-
one matrix estimation. In Information Theory Workshop (ITW), 2016 IEEE, pages
71–75. IEEE, 2016.
[Las01] J. B. Lassere. Global optimization with polynomials and the problem of moments.
SIAM Journal on Optimization, 11(3):796–817, 2001.
[LKZ15a] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Mmse of probabilistic
low-rank matrix estimation: Universality with respect to the output channel. In
53rd Annual Allerton Conference on Communication, Control, and Computing,
pages 680–687. IEEE, 2015.
[LKZ15b] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Phase transitions in
sparse pca. In Information Theory (ISIT), 2015 IEEE International Symposium
on, pages 1635–1639. IEEE, 2015.
[LM16] Marc Lelarge and Léo Miolane. Fundamental limits of symmetric low-rank matrix
estimation. arXiv preprint arXiv:1611.03888, 2016.
[LML+ 17] T. Lesieur, L. Miolane, M. Lelarge, F. Krzakala, and L. Zdeborová. Statistical
and computational phase transitions in spiked tensor estimation. arXiv preprint
arXiv:1701.08010, 2017.
[Lov79] L. Lovasz. On the shannon capacity of a graph. IEEE Trans. Inf. Theor., 25(1):1–7,
1979.
[LV18] J. Bruna L. Venturi, A. S. Bandeira. Neural networks with finite intrinsic dimension
have no spurious valleys. arXiv:1802.06384 [math.OC], 2018.
[Mas14] Laurent Massoulié. Community detection thresholds and the weak ramanujan prop-
erty. In Proceedings of the forty-sixth annual ACM symposium on Theory of com-
puting, pages 694–703. ACM, 2014.
[MM09] Marc Mezard and Andrea Montanari. Information, physics, and computation. Ox-
ford University Press, 2009.
[MNS12] Elchanan Mossel, Joe Neeman, and Allan Sly. Stochastic block models and recon-
struction. arXiv preprint arXiv:1202.1499, 2012.
[MNS13] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold
conjecture. arXiv preprint arXiv:1311.4115, 2013.
[Moo17] Cristopher Moore. The computer science and physics of community detection: Land-
scapes, phase transitions, and hardness. arXiv preprint arXiv:1702.00467, 2017.
[MPV86] Marc Mézard, Giorgio Parisi, and MA Virasoro. SK model: The replica solution
without replicas. EPL (Europhysics Letters), 1(2):77, 1986.
[MR16] Andrea Montanari and Emile Richard. Non-negative principal component analysis:
Message passing algorithms and sharp asymptotics. IEEE Transactions on Infor-
mation Theory, 62(3):1458–1484, 2016.
[MSS16] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompo-
sitions with sum-of-squares. In Foundations of Computer Science (FOCS), 2016
IEEE 57th Annual Symposium on, pages 438–446. IEEE, 2016.
[MW13] Z. Ma and Y. Wu. Computational barriers in minimax submatrix detection. Available
online at arXiv:1309.5914, 2013.
[Nes00] Y. Nesterov. Squared functional systems and optimization problems. High perfor-
mance optimization, 13(405-440), 2000.
[Nis80] Hidetoshi Nishimori. Exact results and critical properties of the ising model with
competing interactions. Journal of Physics C: Solid State Physics, 13(21):4071,
1980.
[Nis81] Hidetoshi Nishimori. Internal energy, specific heat and correlation function of the
bond-random ising model. Progress of Theoretical Physics, 66(4):1169–1181, 1981.
[Nis01] Hidetoshi Nishimori. Statistical physics of spin glasses and information processing:
an introduction, volume 111. Clarendon Press, 2001.
22 AFONSO S. BANDEIRA, AMELIA PERRY, AND ALEXANDER S. WEIN
[Par79] Giorgio Parisi. Infinite number of order parameters for spin-glasses. Physical Review
Letters, 43(23):1754, 1979.
[Par00] P. A. Parrilo. Structured semidefinite programs and semialgebraic geometry methods
in robustness and optimization. PhD thesis, California Institute of Technology, 2000.
[Pea86] Judea Pearl. Fusion, propagation, and structuring in belief networks. Artificial in-
telligence, 29(3):241–288, 1986.
[PS17] Aaron Potechin and David Steurer. Exact tensor completion with sum-of-squares.
arXiv preprint arXiv:1702.06237, 2017.
[PWB16] Amelia Perry, Alexander S Wein, and Afonso S Bandeira. Statistical limits of spiked
tensor models. arXiv preprint arXiv:1612.07728, 2016.
[PWBM16a] Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra.
Message-passing algorithms for synchronization problems over compact groups.
Communications on Pure and Applied Mathematics (to appear). arXiv preprint
arXiv:1610.04583, 2016.
[PWBM16b] Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality
and sub-optimality of PCA for spiked random matrices and synchronization. arXiv
preprint arXiv:1609.05573, 2016.
[Rag08] P. Raghavendra. Optimal algorithms and inapproximability results for every CSP?
In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing,
STOC ’08, pages 245–254. ACM, 2008.
[RM14] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. In Ad-
vances in Neural Information Processing Systems, pages 2897–2905, 2014.
[Sho87] N. Shor. An approach to obtaining global extremums in polynomial mathematical
programming problems. Cybernetics and Systems Analysis, 23(5):695–700, 1987.
[Sin11] A. Singer. Angular synchronization by eigenvectors and semidefinite programming.
Appl. Comput. Harmon. Anal., 30(1):20 – 36, 2011.
[TAP77] David J Thouless, Philip W Anderson, and Robert G Palmer. Solution of ‘solvable
model of a spin glass’. Philosophical Magazine, 35(3):593–601, 1977.
[ZK16] Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds
and algorithms. Advances in Physics, 65(5):453–552, 2016.
(Bandeira) Department of Mathematics and Center for Data Science, Courant Insti-
tute of Mathematical Sciences, New York University
E-mail address: bandeira@cims.nyu.edu