Syntactic and Semantic C
Syntactic and Semantic C
A BSTRACT
1 I NTRODUCTION
The goal of controlled generation from language models (LMs) is to produce text guided by a set of
syntactic or semantic constraints. One prominent example is semantic parsing, or code generation,
which involves producing text in a programming (or other formal) language, typically from a natural
language prompt. We may wish to use diverse signals to guide code generation, for example:
• Checking (partial) code statically (type-checking, linting, partial evaluation);
• Running (partial) code on a test case and checking if it raises an error or returns the wrong answer;
• Simulating environments (e.g., in robotics or chemistry) and assigning a score to the resulting state;
• Rolling out possible completions of partial code and computing their max, min, or average score;
• Asking another language model to critique the code generated so far.
Such signals vary along several important dimensions: some are cheap to compute (linting), others are
more costly (simulations); some can be evaluated incrementally with each sampled token (language
model critique), others provide sparser guidance (running code); some enforce binary hard constraints
(type-checking), others yield soft continuous scores (scoring).
One way to represent such signals uniformly is as potential functions ϕ(x) assigning non-negative
Q
scores to sequences of tokens x. Given a set Φ of such potentials, we will write Φ(x) = ϕ∈Φ ϕ(x).
We frame the problem of controlled generation probabilistically: We wish to sample from the global
∗
co-first authorship, ‡ co-senior authorship.
1
Published as a conference paper at ICLR 2025
2
Published as a conference paper at ICLR 2025
(
0.4 a = torch.vstack(
_)
tensors) (
0.7 1.0 a = torch.stack(tensors)
tensors, tensors) (
0.8 1.0 a = torch.vstack(tensors)
Resampling
Updated particles Focuses computation on promising particles
0.0 a = torch.tensor(tensors) 0.18 a = torch.stack(tensors)
Figure 1: Controlled generation from LMs via sequential Monte Carlo. Left: We use sequential Monte Carlo
to sample from high-quality approximations to posteriors over LM outputs. Partial sequences are repeatedly
extended via grammar-constrained generation. We then apply weight corrections to mitigate the greediness of
locally constrained decoding, as well as expensive potentials to encode rich information that cannot be included
in logit masks. Finally, resampling focuses computation on promising particles. Right: Accuracy gains from
these innovations on challenging data science, text-to-SQL, goal inference, and molecule synthesis benchmarks.
mitigates the greediness of locally constrained decoding; expensive potentials, which incorporate
useful signals that several baseline methods cannot integrate; and adaptive resampling, which
adaptively refocuses computation on partial sequences that look more promising.
• Empirical validation of the probabilistic perspective. We derive estimators of the KL divergence
from each method’s output distribution to the global product of experts (Appendix D). We find that
the best-performing methods (i) have outputs that are closer in KL divergence to the global product
of experts within each problem instance, and (ii) assign probabilities that are more correlated with
downstream performance across problem instances (§3.3).
Notation. We use x to refer to a sequence of tokens, with xi being the ith token in the sequence. Let
def
x<t = x1 ··· xt−1 . Let ε denote the empty sequence. We use juxtaposition (e.g., x y) to denote
sequence concatenation. Let A be a vocabulary of tokens, and let A∗ denote the set of all finite
sequences of tokens in A. We refer to the set A∗ as the set of partial sequences. We use EOS ∈/ A to
def
denote a special token marking the end of sequences not included in A. We define AEOS = A ∪ {EOS}
def
and the set of complete sequences A∗ EOS = {x EOS | x ∈ A∗ }.
Language models. A language model p is a probability distribution over complete sequences (i.e.,
′
P
x∈A∗ EOS p(x) = 1). We assume that p provides a conditional distribution p(x | x) over its next
token x′ ∈ AEOS given any sequence x ∈ A∗ ; the probability of any complete sequence factors as
|x|
Y
p(x) = p(xt | x<t ) (2)
t=1
We find it convenient to extend the definition of p(x) from Eq. (2) to partial sequences x ∈ A∗ ; note,
however, that p(x) is a prefix probability,1 so p is not a probability distribution over partial sequences.
x′ )
With this extended definition, we have p(x′ | x) = p(x p(x) provided p(x) > 0.
Potential functions. We consider a set Φ of domain- or task-specific potential functions that encode
relevant constraints or preferences as nonnegative scores. Each potential function ϕ ∈ Φ has the
type ϕ : (A∗ ∪ A∗ EOS) → R≥0 , meaning that it assigns a non-negative real ϕ(x) when evaluated
on some sequence x, freely using any structure in the sequence so far regardless of whether x
1
I.e., p(x) is the probability that a complete sequence x′ ∼ p has the partial sequence x as a prefix.
3
Published as a conference paper at ICLR 2025
is a partial or complete sequence. For technical reasons, we assume that all potentials satisfy
ϕ(x) = 0 =⇒ ϕ(x y) = 0, for all x and y such that x y ∈ A∗ EOS.
Target distribution. We formalize the goal of controlled generation as sampling sequences x ∈
A∗ EOS from the target distribution given by the global product of experts between p and Φ:2
1 X
g(x) = p(x)Φ(x) where Z = p(y)Φ(y) (3)
Z ∗ y∈A EOS
For intuition, if Φ(x) ∈ {0, 1} for all x in A∗ EOS, the global product of experts can be understood
as the rejection sampling distribution that arises by repeatedly generating x ∼ p, and rejecting if
Φ(x) = 0. The normalizing constant Z is the rate at which samples are accepted. Thus, the expected
runtime of rejection sampling is 1/Z per accepted sample, making it expensive if Z is small. Our
work aims to accurately approximate the global product of experts with much less computation.
Locally constrained decoding. A popular approach3 to enforcing constraints at inference-time is to
apply them before sampling each token. In this approach, at each time step t, the current sequence
x<t is extended with a new token xt ∼ ℓΦ (· | x<t ) (until xt = EOS) where4
p(xt | x<t ) Φ(x<t xt )
Φ(x<t )
X Φ(x<t x′ )
p(x′ | x<t )
def def
ℓΦ (xt | x<t ) = where LΦ (x<t ) = (4)
LΦ (x<t ) Φ(x<t )
x′ ∈AEOS
def Q |x| ∗
We write ℓΦ (x) = t=1 ℓΦ (xt | x<t ) for either the probability of x ∈ A EOS or the prefix
∗
probability of x ∈ A . Note that in the former case, ℓΦ is a distribution over complete sequences.
We call it a local product of experts model because normalization is performed locally (at each step
of the sequence), rather than globally (once per complete sequence).
Despite its popularity, locally constrained decoding has two important shortcomings. First, for most
practical potential functions, the local and global product of experts do not define the same distribution
(Lew et al., 2023; Park et al., 2025). In particular, while the global product of experts is defined with
respect to complete sequences, the local product typically only has access to the string generated so far
and a single token of lookahead—which can lead to myopic sampling down paths that lead to globally
poor solutions. In principle, this problem can be mitigated by the choice of intermediate potentials
(Φ(x) for x ∈ A∗ ), which implement more aggressive forms of lookahead. In particular, locally
constrained decoding is an exact sampler when Φ(x) = Φ∗ (x), the expected future potential of
def
x, Φ∗ (x) = Ex′ ∼p [Φ(x′ ) | x is a prefix of x′ ].5 Unfortunately, much like Z, computing Φ∗ exactly
is typically intractable. Although we may seek to approximate it, for example, by learning (Zhao
et al., 2024) or adaptive methods (Park et al., 2025), here we instead focus on variants of locally
constrained decoding which marginalize over the immediate next token as in Eq. (4); see Footnote 3.
Our experiments (§3) compare our method to this dominant form of local decoding from the literature,
using efficient tests for whether the addition of a single candidate token can satisfy the constraint.
The second, related problem with locally constrained decoding is that the local product of experts
can only be sampled efficiently when it is possible to cheaply evaluate the potentials ϕ ∈ Φ on
all possible one-token continuations x<t x′t of the current sequence. For some constraints (e.g.,
checking membership or prefixhood in the language of a regular expression or context-free grammar),
algorithms exist for efficient parallel evaluation across tens of thousands of possible continuations.
However, for many of the constraints of interest in the present paper (including several listed in
Table 1, for example, error-checking with test cases) this is not feasible. In what follows, we will
assume that the set of potentials Φ can be partitioned into expensive potentials Φexp , which are
too costly to use as part of locally constrained decoding, and efficient potentials Φeff , which can
be used in sampling from the local product of experts.
Importance sampling. The shortcomings of local decoding can be addressed with importance
sampling, a Monte Carlo technique for approximating intractable distributions. We describe a
2
Note that care has to be taken to ensure that the sum which defines the normalizing constant in Eq. (3) converges.
One sufficient (but not necessary) condition ensuring this is if Φ is bounded above by a constant.
3
E.g., Shin et al. (2021); Scholak et al. (2021); Poesia et al. (2022); Willard & Louf (2023); Ugare et al. (2024).
4
Here our assumption that ϕ(x) = 0 =⇒ ϕ(x y) = 0 ensures that whenever Φ(x<t ) = 0, all extensions will
be 0 as well, making it safe in such cases to define Φ(x <t x)
Φ(x<t )
= 0.
5 ∗
Φ is also known in the SMC literature as the optimal twist function (e.g., Zhao et al., 2024).
4
Published as a conference paper at ICLR 2025
particular application of the technique specialized to our setting. Here, we use the local product of
experts model ℓΦeff (abbreviated ℓeff ) with respect to just Φeff as a proposal distribution, from which
i.i.d.
we sample multiple complete particles x(1) , ... , x(N ) ∼ ℓeff . For each particle x(i) we define its
importance weight w(i) as
(i)
(i) (i) |x |
def p(x ) · Φ(x ) Y (i)
w(i) = = Leff (x<t ) · Φexp (x(i) ) (5)
ℓeff (x(i) ) t=1
The numerator is an unnormalized variant of the target distribution g, which we write as ge hereafter,
while the denominator ℓeff is the proposal distribution that was used to draw the sequence. These
1{x=x(i) }
PN (i)
def i=1 w
weighted particles define the following posterior approximation: gb(x) = PN
w(j)
,
j=1
which under mild conditions converges to g as N grows.6 Our importance weights simplify as shown
in Eq. (5), and we note how they correct for the two problems we identified with ℓeff . The first
Q|x(i) | (i)
factor, t=1 Leff (x<t ), corrects for the greediness of ℓeff , penalizing particles where all possible
continuations xt ∈ AEOS score poorly in context. The second factor, Φexp (x(i) ), incorporates the
expensive potentials that could not be used in ℓeff . These importance weights can be computed
efficiently: the first factor is already computed as a byproduct of sampling from ℓeff , and the second
factor is computed by running each of the expensive efficient potentials once on each x(i) .
Sequential Monte Carlo. While importance sampling addresses several shortcomings of local
decoding, it too suffers from a major weakness: weight corrections and expensive potentials are not
integrated until after a complete sequence has been generated from the proposal. This is despite
the fact that critical information about whether a sequence can satisfy a constraint is often available
much earlier and can be used to avoid large amounts of unnecessary computation. Sequential Monte
Carlo (SMC; e.g., Chopin & Papaspiliopoulos, 2020), is a natural generalization of importance
sampling that constructs importance-weighted samples from a sequence of unnormalized target
distributions ⟨egt ⟩∞
t=0 to arrive at the final unnormalized target g
e. In our case, we consider intermediate
(t) def
targets get defined on the sequence of spaces A = {x ∈ A∗ | |x| = t} ∪ {x ∈ A∗ EOS | |x| ≤ t},
that is, partial sequences of length equal to t and complete sequences of length less than or equal to t.
The targets are defined as
get (x) = p(x)Φ(x), for x ∈ A(t) (6)
Note that ge and get are unnormalized distributions over different spaces: A∗ EOS and A(t) respectively.
Whereas ge only considers potentials on complete sequences, get depends also on the behavior of the
potentials Φexp when applied to partial sequences. But as t → ∞, there is less and less mass on
partial sequences, and get approaches ge no matter how the partial potentials are defined.
The particles for get are drawn as prefixes from ℓeff (stopping at length t if EOS has not been reached),
again requiring an importance weighting correction. The importance weight at time t can be re-
expressed as the importance weight from time t − 1 times a correction factor for step t:
(i) (i)
(i) def get (x<t xt )
wt = (i)
(7a)
ℓeff (x<t xt )
t (i) (i)
(i)
Y get′ (x<t′ xt′ )
= ge0 (x<1 ) (i) (i)
(7b)
t′ =1 get′ −1 (x<t′ ) ℓeff (xt′ | x<t′ )
(i) (i)
(i) get (x<t xt )
= wt−1 (i) (i)
(7c)
get−1 (x<t ) ℓeff (xt | x<t )
(i) (i)
(i) (i) Φexp (x<t xt )
= wt−1 · Leff (x<t ) ·
(i)
(7d)
Φexp (x<t )
The sequential Monte Carlo algorithm generates approximations to each intermediate target get ,
using resampling steps to reallocate computation from less to more promising partial sequences.
We begin with a collection of N weighted particles (x(i) , w(i) ) = (ε, 1), where ε is the empty
6
However, the number of particles required to obtain a good approximation of the target distribution is exponential
in the KL divergence between target g and proposal ℓeff (Chatterjee & Diaconis, 2018).
5
Published as a conference paper at ICLR 2025
sequence of tokens. Then, starting at t = 1, we repeat the following three steps until all particles are
EOS-terminated (i.e., x(i) ∈ A∗ EOS for all i):
(i) (i) (i)
1. Extend. For each incomplete particle x(i) , sample xt ∼ ℓeff (· | x<t ) and update x(i) ← x(i) xt .
(i) (i)
(i) Φexp (x<t xt )
2. Reweight. For each extended particle x(i) , update the weight w(i) ← w(i) Leff (x<t ) . (i)
Φexp (x<t )
(1) (N )
i.i.d.
3. Resample. Sample ancestor indices a(1) , ... , a(N ) ∼ Categorical wW , ... , wW where W =
(a(i) ) W
PN (i) (i) (i)
i=1 w . Then, reassign all particles (x , w ) ← (x , N ) for all i simultaneously.7
The extension step extends each incomplete particle with a next token proposed by the local product
of experts ℓeff . The reweighting step computes the updated importance weight, incorporating a new
factor for the new token. The resampling step exploits any early signal available in the updated
weights at time t to abandon some less promising incomplete particles (which are unlikely to be
chosen as ancestors) and focus more future computation on more promising particles (which are likely
to be chosen as ancestors multiple times and then will be extended in multiple ways at time t + 1).
This reallocation of computation often leads to dramatic improvements in inference quality—without
it, SMC would reduce to the previous importance sampling method.
Further extensions. We further extend our SMC implementation in two ways. First, potentials in
Φeff may still be modestly expensive to evaluate on the entire vocabulary. In these cases, we develop
cheap stochastic approximations to the local product of distributions and use these as proposals
during the Extend step. The incremental weight computation must also be corrected to account for
these approximations; we derive stochastic unbiased estimators of the incremental weights that can
be soundly used within SMC (see Appendix C). Second, the intermediate targets get do not need to
advance token-by-token; in some domains, it is beneficial to consider more semantically meaningful
increments. For example, in one of our experiments, the intermediate target pt is defined over the
space of all partial Python programs containing t or fewer lines of code (rather than tokens); the
Extend step then samples a different number of tokens per particle, waiting in each partial sequence
until a new full line has been generated. Such strategies can lead to better particle alignment (Lundén
et al., 2018), making resampling more effective.
3 E XPERIMENTS
6
Published as a conference paper at ICLR 2025
Table 1: Summary of tasks and potential functions. Examples are truncated for brevity. Full prompts include
additional information.
Potentials Examples
Task
Φeff Φexp Prompt Output
Write the STRIPS goal condition
(:goal (and
STRIPS for the planning problem
Goal Inference Plan simulation (arm-empty) (on-table
parser described below [...]. The
b1) [...]
STRIPS initial condition is: [...]
Here is a sample dataframe: [...]
I’d like to add inverses of each result =
Error-checking
Python Data Science - existing column to the dataframe df.join(df.apply(lambda
with test cases
[...] x: 1/x) [...]
et al., 2025) or filtering via a verifier (Olausson et al., 2023; Chen et al., 2024; Lightman et al.,
2024; Xin et al., 2024). In each domain, we formulate an additional potential Φexp that encodes
task-specific signals of sequence quality (see §3.1). This baseline generates grammar-constrained
sequences from the local product of experts, then reweights each sequence x by Φexp (x).
6. Language model with grammar constraint, weight correction, and expensive potential (Full IS).
This is the full importance sampling method described in §2, with Φ = Φeff ∪ Φexp . Unlike
in the previous method, the importance weights here include correction terms that mitigate the
greediness of local sampling, targeting the global product g. We include this method primarily as
an ablation of our next method (SMC), modified not to include incremental resampling.
7. Language model with grammar constraint, weight correction, expensive potential, and resampling
(Full SMC). This method includes all of the algorithmic contributions of our approach. It is the
full sequential Monte Carlo algorithm, with Φ = Φeff ∪ Φexp . It targets the same global posterior
g as the previous method but uses resampling to reallocate computation to promising particles.
We report results using N = 10 particles; see Appendix A.2 and Fig. 2 for downstream accuracy
results for a varying number of particles. We ran experiments on GCP instances with 1 A100 GPU
and 12 vCPUs (our CFG parser is implemented for CPU and is parallelized across particles), with the
exception of the Data Science domain, for which we used 4 H100 GPUs and 64 vCPUs.
3.1 D OMAINS
We study the performance of our proposed sampling methods on four challenging semantic parsing
domains, summarized in Table 1; see Appendix E for further details.
• Goal inference (Planetarium). Task: Formally specify an agent’s goal in the STRIPS subset
of the PDDL planning language, based on a natural-language description of the goal and PDDL
code detailing the agent’s initial conditions and plan for achieving it. Data: Blocksworld tasks
with up to 10 objects from the Planetarium benchmark (Zuo et al., 2024). Metric: Accuracy with
respect to ground-truth PDDL goal. Base LM: Llama 3.1 8B. Grammar: STRIPS syntax for goals
within Planetarium Blocksworld’s domain definition. Expensive potential: Run a simulation with a
ground-truth plan and check whether the resulting state conforms to the predicted (partial) goal.
• Python for data science (DS-1000). Task: Generate Python code that uses standard data science
libraries (NumPy, PyTorch, Pandas, etc.) to solve a task specified in natural language and via
(executable) test cases. Data: DS-1000 benchmark (Lai et al., 2023). Metric: Accuracy of the
generated program with respect to the provided test cases. Base LM: Llama 3 70B. Grammar:
We use a trivial potential Φeff (x) = 1, as we find that the unconstrained LM reliably generates
grammatical Python (that may nonetheless induce runtime errors). Expensive potential: Given
a partial program x, Φexp truncates x to the longest prefix of the sequence that consists of only
valid Python statements (discarding any incomplete material at the end), and executes the resulting
(partial) program on the provided test case, checking for runtime errors.
• Text-to-SQL (Spider). Task: Generate SQL queries from a natural language question and a
database schema. Data: Spider development split (Yu et al., 2018). Metric: Execution accuracy
(whether the generated SQL query, when run against a test database, produces the same results
7
Published as a conference paper at ICLR 2025
Table 2: Comparison of method performance across domains with bootstrapped 95% confidence intervals. For
brevity, grammar constraint and weight correction are abbreviated as grammar and correction, respectively.
Score
Method
Goal inference Molecular synthesis Data science Text-to-SQL
Base LM 0.063 (0.05, 0.08) 0.132 (0.12, 0.15) 0.213 (0.19, 0.24) 0.531 (0.51, 0.55)
w/ grammar constraint (Locally constrained Decoding) 0.086 (0.07, 0.11) 0.189 (0.17, 0.21) - 0.559 (0.54, 0.58)
w/ grammar, weight correction (Grammar-only IS) 0.083 (0.06, 0.11) 0.228 (0.21, 0.25) - 0.597 (0.57, 0.62)
w/ grammar, potential (Sample-Rerank) 0.289 (0.24, 0.34) 0.392 (0.36, 0.42) - 0.581 (0.56, 0.60)
w/ grammar, correction, and resampling (Grammar-only SMC) 0.401 (0.34, 0.46) 0.205 (0.18, 0.23) - 0.596 (0.57, 0.62)
w/ grammar, potential, and correction (Full IS) 0.257 (0.21, 0.31) 0.404 (0.37, 0.44) 0.346 (0.31, 0.39) 0.618 (0.59, 0.64)
w/ grammar, potential, correction, and resampling (Full SMC) 0.419 (0.37, 0.48) 0.577 (0.56, 0.59) 0.407 (0.36, 0.45) 0.620 (0.60, 0.64)
Figure 2: Left: Performance on the Data Science task (DS-1000) for different models and methods. Codex-002
performance as reported in Lai et al. (2023). Right: Performance across all tasks for Full IS and Full SMC
with 5, 10, and 50 particles. Error bars: bootstrapped 95% confidence intervals.
as the ground-truth SQL query). Base LM: Llama 3.1 8B-Instruct. Grammar: SQL context-free
grammars released by Roy et al. (2024), which enforce valid SQL syntax. Expensive potential:
Check whether column names in the generated (partial) query actually belong to the queried tables,
modulo aliasing. (The grammar ensures only that the column names exist in some table.)
• Molecular synthesis (GDB-17). Task: Generate drug-like molecules in the SMILES for-
mat (Weininger, 1988). Data: Few-shot prompts constructed by repeatedly choosing 20 random
examples from the GDB-17 dataset (Ruddigkeit et al., 2012). Metric: Quantitative Estimate of
Drug-likeness (QED; Bickerton et al., 2012), a standard molecular fitness function. Base LM:
Llama 3.1 8B. Grammar: SMILES syntax for molecules. Expensive potential: A SMILES prefix
validator implemented in the Python partialsmiles library (O’Boyle, 2024).
We begin by investigating whether our approach leads to significant performance gains. Table 2 reports
posterior-weighted accuracy for our approach and ablations of its components: grammar constraints,
weight corrections, expensive potentials, and resampling. We first summarize the observed effects of
each component in our approach:
Grammar constraints. In line with previous literature (e.g., Shin et al., 2021; Scholak et al., 2021;
Poesia et al., 2022; Wang et al., 2024), we find that the addition of a grammar constraint via Φeff
improves downstream accuracy relative to the base LM across all domains in which it is used, even
without the use of weight corrections.
Expensive potentials. Furthermore, we observe that integrating expensive potentials Φexp improves
accuracy in models. Even without any weight corrections, the improvement in the goal inference,
data science, and molecular synthesis domains is large; in the text-to-SQL domain, it is smaller
but statistically significant (paired permutation test, p < 0.01). This suggests that making use of
information that cannot be efficiently encoded in logit masks can greatly improve performance.
Weight corrections. Although the use of Φeff and Φexp alone leads to significant gains in downstream
accuracy, these gains can be amplified with the addition of weight corrections. In cases without the
expensive potential, weight corrections provide significant albeit relatively small gains in accuracy
across three domains; in goal inference, it does not significantly affect performance. In the presence
of the expensive potential, adding weight corrections improves accuracy for text-to-SQL and has no
effect on goal inference and molecular synthesis. Overall, these results indicate that debiasing samples
from a local product of experts to correctly target the global product of experts often significantly
improves downstream accuracy and never harms it. That said, the accuracy gains attributable to
weight corrections are modest compared to other components of the algorithm, which suggests that the
bias from locally constrained decoding may be less severe in these semantic parsing domains than has
been observed in other domains (e.g., constrained generation of natural language, Lew et al., 2023).
8
Published as a conference paper at ICLR 2025
Figure 3: Estimated KL between the algorithm and the global product of experts for a representative problem
instance in each domain. Values closer to 0 indicate that the algorithm is better at approximating g. Significant
differences are indicated with ** for p < 0.01 and *** for p < 0.001 (t-test). Algorithms use N = 10 particles.
Resampling. We observe that the addition of resampling steps improves downstream accuracy in
all domains except text-to-SQL, for which they neither significantly improve nor hurt performance.
These results motivate adaptively focusing computation on promising partial sequences.
Other Evaluations. Next, we study the effects of varying the base language model, the number of
particles used by different methods, and the computational cost of our approach: Tables 4, 6 and 8 in
the appendix report the results of these experiments. We summarize key findings:
• Our approach allows smaller LMs to outperform larger ones: In 3 out of 4 domains (Data Sci-
ence, Molecular Synthesis, Goal Inference), Full SMC allows small language models to outperform
models over 8 times larger (see Tables 2 and 4). These gains persist on larger models: Fig. 2 shows
how our method allows Llama 3.1 70b to outperform Codex-002, which has 175b parameters and
is fine-tuned for coding tasks.
• Our approach makes better use of resources than approaches that apply constraints only at
the end of generation: In 3 out of 4 domains (Data Science, Molecular Synthesis, Text-to-SQL),
Full SMC performs as well as or better than Full IS while using one-tenth of the particles (see
Fig. 2 and Table 6); in the remaining domain (Goal Inference), Full SMC outperforms IS with one
fifth (10 vs 50) or one half (5 vs 10) of the particles. This is in line with the arguments drawn in §2
for the poor scaling of importance sampling and the benefits of resampling.
• Our approach incurs minimal computational overhead: At every token, our SMC approach
incurs two computational overheads relative to a simple locally constrained decoding baseline:
resampling and computing expensive potentials. Though the cost of resampling is negligible,
computing expensive potentials presents a more significant cost that varies across domains: Table 8
shows that cost rarely rises above ∼30ms per token. In general, this cost is reduced by two factors:
(i) expensive potentials often need to run expensive computations only at larger, semantically
meaningful units (for instance, the end of a SQL clause or a Python statement) rather than at
every token—therefore significantly lessening the average cost per token, (ii) expensive potentials
operations are often performed on CPU rather than GPU, and therefore cost fewer dollars per hour.
3.3 VALIDATION OF THE P ROBABILISTIC P ERSPECTIVE
The best-performing methods from the previous section were designed to approximate the global
product of experts distribution. In this section, we investigate how closely each of these methods
approximates this global distribution and whether the downstream performance results from the
previous section are driven by the quality of the probabilistic inference. In particular, we find:
Within each problem instance, the best-performing methods have outputs that are closer in
KL divergence to the global product of experts. We consider the distribution over sequences
r r
qalg (x) defined by each algorithm (see Appendix D for details and derivations). For each qalg , we
estimate a tractable correlate of the KL between the algorithm and the global product of experts:
r
log Z − KL(qalg ∥ g). We refer to this quantity as the approximation quality. Since the term
log Z is algorithm-independent, we can directly compare the estimated approximation quality across
algorithms to determine which ones have lower KL divergence relative to the global product of
experts. However, because log Z is instance-specific, these comparisons can only be made at the
instance level. Accordingly, for each domain, we select the instance with the median unique accuracy
9
Published as a conference paper at ICLR 2025
Table 3: Pearson correlation between relative particle weights and accuracy scores for all weighted methods.
Greater correlation indicates that relative weights are more strongly associated with downstream performance.
Correlation between relative weight and score
Method
Goal inference Molecular synthesis Data science Text-to-SQL
LM with grammar constraints and weight cor-
0.138 (0.10, 0.18) 0.218 (0.16, 0.28) 0.217 (0.18, 0.26) 0.810 (0.79, 0.83)
rection (Grammar-Only IS)
LM with grammar constraints, potential, and
0.677 (0.64, 0.71) 0.570 (0.53, 0.61) 0.289 (0.25, 0.33) 0.796 (0.78, 0.81)
weight correction (Full IS)
LM with grammar constraints, potential,
0.793 (0.76, 0.82) 0.826 (0.81, 0.84) 0.370 (0.31, 0.42) 0.810 (0.79, 0.83)
weight correction, and resampling (Full SMC)
Figure 4: Distributional properties of compounds generated by different methods. Middle: Distribution of
drug-likeness as measured by QED score (Bickerton et al., 2012). Right: Means for other properties of interest
such as diversity and de novo similarity (details in Appendix E.2).
10
Published as a conference paper at ICLR 2025
ACKNOWLEDGMENTS
The authors would like to thank Manuel de Prada Corral, Brian DuSell, Joshua B. Tenenbaum, and
Tan Zhi Xuan for valuable discussions, suggestions, and coding support that improved this work. The
last author gratefully acknowledges the Canada CIFAR AI Chair program for support.
AUTHOR C ONTRIBUTIONS
First Authors
• João Loula (jloula@mit.edu): research conception and development, writing, experiment devel-
opment, software development (prototype)
• Benjamin LeBrun (benjamin.lebrun@mail.mcgill.ca): research conception and development,
lead software engineer, experiment development, writing
• Li Du (leodu@cs.jhu.edu): experiment development, software development (parser)
Contributors
• Ben Lipkin (lipkinb@mit.edu): software development (grammar interfaces, testing, and integra-
tion), writing
• Clemente Pasti (clemente.pasti@inf.ethz.ch): software and algorithm development (context-
free grammars), writing
• Gabriel Grand (grandg@mit.edu): software development (testing and integration), analysis and
presentation of molecular synthesis experiments
• Tianyu Liu (tianyu.liu@inf.ethz.ch): software development (vLLM integration, testing)
• Yahya Emara (yemara@ethz.ch): software development (testing and integration), writing
• Marjorie Freedman (mrf@isi.edu): organization management, writing
• Jason Eisner (jason@cs.jhu.edu): technical advice, writing, project advising and mentorship
• Ryan Cotterell (ryan.cotterell@inf.ethz.ch): organization management, senior project lead-
ership, research conception and development, writing
Senior Authors
• Vikash Mansinghka (vkm@mit.edu): organization management, research conception and develop-
ment, project advising and mentorship
• Alexander K. Lew (alexander.lew@yale.edu): senior project leadership, research conception
and development, project narrative development, writing, software development (prototype)
• Tim Vieira (tim.f.vieira@gmail.com): senior project leadership, full-stack software contributor,
research conception and development, project narrative development, writing, software system
design and implementation
• Timothy J. O’Donnell (timothy.odonnell@mcgill.ca): overall team leadership and direction,
organization management, senior project leadership, research conception and development, project
advising and mentorship, project narrative development, writing
R EFERENCES
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea
Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine
Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally
Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee,
Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka
Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander
Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy
Zeng. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint
arXiv:2204.01691, 2022. URL https://arxiv.org/abs/2204.01691. (Cited on p. 33)
Afra Amini, Tim Vieira, Elliott Ash, and Ryan Cotterell. Variational best-of-N alignment. In
Proceedings of International Conference on Learning Representations, 2025. URL https://
openreview.net/forum?id=W9FZEQj3vv. (Cited on p. 33)
Christophe Andrieu and Gareth O. Roberts. The pseudo-marginal approach for efficient monte
carlo computations. arXiv: Statistics Theory, 2009. URL https://api.semanticscholar.org/
CorpusID:15661729. (Cited on p. 28)
11
Published as a conference paper at ICLR 2025
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones,
Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson,
Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson,
Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile
Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova
DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer
El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan,
Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas
Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI: Harmlessness from AI
feedback. arXiv preprint arXiv:2212.08073, 2022. URL https://arxiv.org/abs/2212.08073.
(Cited on p. 32, 33)
Martin Berglund, Willeke Martens, and Brink Van der Merwe. Constructing a BPE tokenization DFA.
In International Conference on Implementation and Application of Automata. Springer, 2024. URL
https://dl.acm.org/doi/abs/10.1007/978-3-031-71112-1_5. (Cited on p. 33)
G. Richard Bickerton, Gaia V. Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L. Hopkins.
Quantifying the chemical beauty of drugs. Nature Chemistry, 4(2), 2012. URL https://www.
nature.com/articles/nchem.1243. (Cited on p. 8, 10, 30)
Benjamin Börschinger and Mark Johnson. A particle filter algorithm for Bayesian wordsegmentation.
In Proceedings of the Australasian Language Technology Association Workshop, December 2011.
URL https://aclanthology.org/U11-1004/. (Cited on p. 2, 32)
Jan Buys and Phil Blunsom. A Bayesian model for generative transition-based dependency parsing.
In Proceedings of the International Conference on Dependency Linguistics, 2015. URL https:
//aclanthology.org/W15-2108/. (Cited on p. 2, 32)
Kris Cao and Laura Rimell. You should evaluate your language model on marginal likelihood over
tokenisations. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing, 2021. URL https://aclanthology.org/2021.emnlp-main.161.pdf. (Cited on p.
33)
Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. The Annals of
Applied Probability, 28(2), 2018. URL https://www.jstor.org/stable/pdf/26542331.pdf.
(Cited on p. 5)
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei A. Zaharia, and
James Y. Zou. Are more LLM calls all you need? towards the scaling properties of compound
AI systems. In Advances in Neural Information Processing Systems 38: Annual Conference
on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, De-
cember 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/
51173cf34c5faac9796a47dc2fdd3a71-Abstract-Conference.html. (Cited on p. 2, 7)
Emily Cheng, Marco Baroni, and Carmen Amo Alonso. Linearly controlled language generation
with performative guarantees. NeurIPS Workshop on Foundation Model Interventions, 2024. URL
https://openreview.net/pdf?id=V2xBBD1Xtu. (Cited on p. 32, 33)
Nadezhda Chirkova, Germán Kruszewski, Jos Rozen, and Marc Dymetman. Should you marginalize
over possible tokenizations? In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2023. URL https://aclanthology.org/2023.acl-short.1.pdf.
(Cited on p. 33)
Nicolas Chopin and Omiros Papaspiliopoulos. An Introduction to Sequential Monte Carlo, volume 4.
Springer, 2020. URL https://link.springer.com/book/10.1007/978-3-030-47845-2.
(Cited on p. 5, 24)
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason
Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text
generation. arXiv preprint arXiv:1912.02164, 2019. URL https://arxiv.org/pdf/1912.02164.
(Cited on p. 33)
12
Published as a conference paper at ICLR 2025
Daniel Deutsch, Shyam Upadhyay, and Dan Roth. A general-purpose algorithm for constrained
sequential inference. In Proceedings of the Conference on Computational Natural Language
Learning, 2019. URL https://aclanthology.org/K19-1045/. (Cited on p. 32)
Li Du, Afra Amini, Lucas Torroba Hennigen, Xinyan Velocity Yu, Holden Lee, Jason Eisner, and
Ryan Cotterell. Principled gradient-based MCMC for conditional sampling of text. In Proceedings
of the International Conference on Machine Learning, 2024. URL https://proceedings.mlr.
press/v235/du24a.html. (Cited on p. 33)
Gregory Dubbin and Phil Blunsom. Unsupervised part of speech inference with particle filters. In
Proceedings of the NAACL HLT Workshop on Induction of Linguistic Structure, Montréal, QC,
2012. URL https://aclanthology.org/W12-1907.pdf. (Cited on p. 2, 32)
Jay Earley. An Efficient Context-Free Parsing Algorithm. PhD thesis, Carnegie Mellon University,
1968. URL https://dl.acm.org/doi/pdf/10.1145/362007.362035. (Cited on p. 23)
Richard E Fikes and Nils J Nilsson. STRIPS: A new approach to the application of theorem proving
to problem solving. Artificial Intelligence, 2(3-4), 1971. URL https://www.sciencedirect.
com/science/article/abs/pii/0004370271900105. (Cited on p. 31)
Daniel Flam-Shepherd, Kevin Zhu, and Alán Aspuru-Guzik. Language models can learn complex
molecular distributions. Nature Communications, 13(1), 2022. URL https://www.nature.com/
articles/s41467-022-30839-x.pdf. (Cited on p. 30)
Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-constrained decoding for
structured NLP tasks without finetuning. In Proceedings of the Conference on Empirical Methods
in Natural Language Processing, 2023. URL https://aclanthology.org/2023.emnlp-main.
674.pdf. (Cited on p. 32)
Malik Ghallab, Adele Howe, Craig Knoblock, Drew McDermott, Ashwin Ram, Manuela Veloso,
Daniel Weld, and David Wilkins. PDDL—the planning domain definition language. Technical
report, Yale Center for Computational Vision and Control, 1998. URL https://www.cs.cmu.
edu/~mmv/planning/readings/98aips-PDDL.pdf. (Cited on p. 31)
Joshua Goodman. Semiring parsing. Computational Linguistics, 25(4), 1999. URL https://
aclanthology.org/J99-4004/. (Cited on p. 23)
Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Lever-
aging pre-trained large language models to construct and utilize world models for
model-based task planning. In Advances in Neural Information Processing Sys-
tems, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/
f9f54762cbb4fe4dbffdd4f792c31221-Paper-Conference.pdf. (Cited on p. 31)
Lin Gui, Cristina Garbacea, and Victor Veitch. Bonbon alignment for large language models and
the sweetness of best-of-n sampling. In Advances in Neural Information Processing Systems 38:
Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC,
Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/
2024/hash/056521a35eacd9d2127b66a7d3c499c5-Abstract-Conference.html. (Cited on p.
2, 6)
Malte Helmert. The fast downward planning system. Journal of Artificial Intelligence Research, 26,
2006. URL https://www.jair.org/index.php/jair/article/view/10457. (Cited on p. 31)
Brian Hie, Salvatore Candido, Zeming Lin, Ori Kabeli, Roshan Rao, Nikita Smetanin, Tom Sercu,
and Alexander Rives. A high-level programming language for generative protein design. bioRxiv,
2022. URL https://www.biorxiv.org/content/10.1101/2022.12.21.521526v1.full.pdf.
(Cited on p. 33)
John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computa-
tion. Addison-Wesley, 1979. ISBN 0-201-02988-X. (Cited on p. 23)
Daniel G. Horvitz and Donovan J. Thompson. A generalization of sampling without replacement
from a finite universe. Journal of the American Statistical Association, 47(260), 1952. URL
https://www.jstor.org/stable/pdf/2280784.pdf. (Cited on p. 25)
13
Published as a conference paper at ICLR 2025
R. Howey, D. Long, and M. Fox. VAL: Automatic plan validation, continuous effects and mixed
initiative planning using PDDL. In IEEE International Conference on Tools with Artificial
Intelligence, 2004. URL https://ieeexplore.ieee.org/document/1374201. (Cited on p. 31)
Wenlong Huang, Fei Xia, Dhruv Shah, Danny Driess, Andy Zeng, Yao Lu, Pete Florence, Igor
Mordatch, Sergey Levine, Karol Hausman, and Brian Ichter. Grounded decoding: Guiding text
generation with grounded models for embodied agents. In Advances in Neural Information
Processing Systems, volume 36, 2024. URL https://proceedings.neurips.cc/paper_files/
paper/2023/file/bb3cfcb0284642a973dd631ec9184f2f-Paper-Conference.pdf. (Cited on
p. 32, 33)
Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, and Eiji
Uchibe. Evaluation of best-of-n sampling strategies for language model alignment. Transactions on
Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?
id=H4S4ETc8c9. (Cited on p. 2, 6)
Terry Koo, Frederick Liu, and Luheng He. Automata-based constraints for language model decod-
ing. In Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=
BDBdblmyzY. (Cited on p. 32)
Tomasz Korbak, Ethan Perez, and Christopher Buckley. RL with KL penalties is better viewed as
Bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022,
2022. URL https://aclanthology.org/2022.findings-emnlp.77. (Cited on p. 33)
Kalpesh Krishna, Yapei Chang, John Wieting, and Mohit Iyyer. Rankgen: Improving text generation
with large ranking models. In Proceedings of the 2022 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11,
2022. Association for Computational Linguistics, 2022. URL https://doi.org/10.18653/v1/
2022.emnlp-main.15. (Cited on p. 2, 6)
Michael Kuchnik, Virginia Smith, and George Amvrosiadis. Validating large lan-
guage models with RELM. Proceedings of Machine Learning and Systems, 5,
2023. URL https://proceedings.mlsys.org/paper_files/paper/2023/file/
93c7d9da61ccb2a60ac047e92787c3ef-Paper-mlsys2023.pdf. (Cited on p. 32, 33)
Sachin Kumar, Eric Malmi, Aliaksei Severyn, and Yulia Tsvetkov. Controlled text generation as
continuous optimization with multiple constraints. Advances in Neural Information Processing
Systems, 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/
79ec2a4246feb2126ecf43c4a4418002-Paper.pdf. (Cited on p. 33)
Sachin Kumar, Biswajit Paria, and Yulia Tsvetkov. Gradient-based constrained sampling from
language models. In Proceedings of the 2022 Conference on Empirical Methods in Nat-
ural Language Processing. Association for Computational Linguistics, 2022. URL https:
//aclanthology.org/2022.emnlp-main.144/. (Cited on p. 33)
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih,
Daniel Fried, Sida Wang, and Tao Yu. DS-1000: A natural and reliable benchmark for data science
code generation. In International Conference on Machine Learning. Proceedings of Machine
Learning Research, 2023. URL https://proceedings.mlr.press/v202/lai23b/lai23b.pdf.
(Cited on p. 7, 8, 31)
Greg Landrum. RDKit: Open-source cheminformatics software. https://github.com/rdkit/
rdkit, 2024. (Cited on p. 30)
Dieterich Lawson, Allan Raventós, Andrew Warrington, and Scott Linderman. Sixo: Smooth-
ing inference with twisted objectives. Advances in Neural Information Processing Sys-
tems, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/
fddc79681b2df2734c01444f9bc2a17e-Paper-Conference.pdf. (Cited on p. 2)
Alexander K Lew, Marco Cusumano-Towner, and Vikash K Mansinghka. Recursive Monte Carlo and
variational inference with auxiliary variables. In Uncertainty in Artificial Intelligence. Proceedings
of Machine Learning Research, 2022. URL https://proceedings.mlr.press/v180/lew22a/
lew22a.pdf. (Cited on p. 24, 27, 28)
14
Published as a conference paper at ICLR 2025
Alexander K Lew, Tan Zhi-Xuan, Gabriel Grand, and Vikash Mansinghka. Sequential Monte Carlo
steering of large language models using probabilistic programs. In ICML Workshop: Sampling and
Optimization in Discrete Space, 2023. URL https://openreview.net/forum?id=Ul2K0qXxXy.
(Cited on p. 1, 2, 4, 6, 8, 21, 32, 33)
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan
Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth
International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi. (Cited
on p. 2, 7)
Chu-Cheng Lin and Jason Eisner. Neural particle smoothing for sampling from conditional sequence
models. In Proceedings of the Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 2018. URL https://aclanthology.
org/N18-1085/. (Cited on p. 2, 32)
Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone.
LLM+P: Empowering large language models with optimal planning proficiency. arXiv preprint
arXiv:2304.11477, 2023. URL https://arxiv.org/pdf/2304.11477. (Cited on p. 31)
Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras,
Lianhui Qin, Youngjae Yu, Rowan Zellers, et al. Neurologic A* esque decoding: Constrained
text generation with lookahead heuristics. arXiv preprint arXiv:2112.08726, 2021. URL https:
//aclanthology.org/2022.naacl-main.57.pdf. (Cited on p. 33)
Daniel Lundén, David Broman, Fredrik Ronquist, and Lawrence M Murray. Automatic alignment
of sequential Monte Carlo inference in higher-order probabilistic programs. arXiv preprint
arXiv:1812.07439, 2018. URL https://arxiv.org/pdf/1812.07439. (Cited on p. 6)
Vikash K Mansinghka, Ulrich Schaechtle, Shivam Handa, Alexey Radul, Yutian Chen, and Martin
Rinard. Probabilistic programming with programmable inference. In Proceedings of the 39th
ACM SIGPLAN Conference on Programming Language Design and Implementation, 2018. URL
https://dl.acm.org/doi/abs/10.1145/3192366.3192409. (Cited on p. 2)
Clara Meister, Tim Vieira, and Ryan Cotterell. If beam search is the answer, what was the question?
arXiv preprint arXiv:2010.02650, 2020. URL https://aclanthology.org/2020.emnlp-main.
170/. (Cited on p. 33)
Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li. CGMH: Constrained sentence generation by
metropolis-hastings sampling. In Proceedings of the AAAI Conference on Artificial Intelligence,
number 01, 2019. URL https://ojs.aaai.org/index.php/AAAI/article/view/4659. (Cited
on p. 33)
Ning Miao, Yuxuan Song, Hao Zhou, and Lei Li. Do you have the right scissors? tailoring pre-
trained language models via Monte-Carlo methods. In Proceedings of the Annual Meeting of
the Association for Computational Linguistics, 2020. URL https://aclanthology.org/2020.
acl-main.314/. (Cited on p. 32)
Michal Moskal, Madan Musuvathi, and Emre Kıcıman. AI Controller Interface. https://github.
com/microsoft/aici/, 2024. (Cited on p. 2, 32)
Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng
Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad
Beirami. Controlled decoding from language models. In Forty-first International Conference on
Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL
https://openreview.net/forum?id=bVIcZb7Qa0. (Cited on p. 2, 6)
Christian A. Naesseth, Fredrik Lindsten, and Thomas B. Schön. Elements of sequential monte
carlo. Found. Trends Mach. Learn., 12(3), 2019. doi: 10.1561/2200000074. URL https:
//doi.org/10.1561/2200000074. (Cited on p. 2)
15
Published as a conference paper at ICLR 2025
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher
Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou,
Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT:
Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,
2021. URL https://arxiv.org/pdf/2112.09332. (Cited on p. 2, 6)
Franz Nowak and Ryan Cotterell. A fast algorithm for computing prefix probabilities. In Proceedings
of the Annual Meeting of the Association for Computational Linguistics, 2023. URL https:
//aclanthology.org/2023.acl-short.6/. (Cited on p. 23)
Noel O’Boyle. partialsmiles: A validating SMILES parser, with support for incomplete SMILES,
2024. URL https://github.com/baoilleach/partialsmiles. (Cited on p. 8, 30)
Theo Olausson, Alex Gu, Ben Lipkin, Cedegao Zhang, Armando Solar-Lezama, Joshua Tenenbaum,
and Roger Levy. LINC: A neurosymbolic approach for logical reasoning by combining language
models with first-order logic provers. In Proceedings of the Conference on Empirical Methods
in Natural Language Processing, 2023. URL https://aclanthology.org/2023.emnlp-main.
313/. (Cited on p. 2, 7)
João CA Oliveira, Johanna Frey, Shuo-Qing Zhang, Li-Cheng Xu, Xin Li, Shu-Wen Li, Xin Hong,
and Lutz Ackermann. When machine learning meets molecular synthesis. Trends in Chemistry,
4(10), 2022. URL https://www.cell.com/trends/chemistry/abstract/S2589-5974(22)
00175-7. (Cited on p. 30)
Andreas Opedal, Ran Zmigrod, Tim Vieira, Ryan Cotterell, and Jason Eisner. Efficient semiring-
weighted Earley parsing. In Proceedings of the Annual Meeting of the Association for Computa-
tional Linguistics, 2023. URL https://aclanthology.org/2023.acl-long.204/. (Cited on p.
23)
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob
Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder,
Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow in-
structions with human feedback. In Advances in Neural Information Processing Sys-
tems, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/
b1efde53be364a73914f58805a001731-Paper-Conference.pdf. (Cited on p. 32, 33)
Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris
D’Antoni. Grammar-aligned decoding. In Advances in Neural Information Processing
Systems, 2025. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/
2bdc2267c3d7d01523e2e17ac0a754f3-Paper-Conference.pdf. (Cited on p. 2, 4, 6, 32, 34)
Damian Pascual, Beni Egressy, Clara Meister, Ryan Cotterell, and Roger Wattenhofer. A plug-and-
play method for controlled text generation. In Findings of the Association for Computational
Linguistics: EMNLP 2021. Association for Computational Linguistics, November 2021. URL
https://aclanthology.org/2021.findings-emnlp.334/. (Cited on p. 32, 33)
Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and
Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models. In
International Conference on Learning Representations, 2022. URL https://openreview.net/
forum?id=KmtVD97J43e. (Cited on p. 2, 4, 8, 32, 33)
Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. A probabilistic
inference approach to inference-time scaling of LLMs using particle-based Monte Carlo methods.
arXiv preprint arXiv:2502.01618, 2025. URL https://arxiv.org/pdf/2502.01618. (Cited on
p. 32)
Lianhui Qin, Sean Welleck, Daniel Khashabi, and Yejin Choi. Cold decoding: Energy-based
constrained text generation with langevin dynamics. Advances in Neural Information Processing
Systems, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/
3e25d1aff47964c8409fd5c8dc0438d7-Paper-Conference.pdf. (Cited on p. 33)
16
Published as a conference paper at ICLR 2025
Ronald Rosenfeld, Stanley Chen, and Xiaojin Zhu. Whole-sentence exponential language models: A
vehicle for linguistic-statistical integration. Computer Speech & Language, 15, 01 2001. URL
https://www.sciencedirect.com/science/article/abs/pii/S0885230800901591. (Cited
on p. 2)
Subhro Roy, Samuel Thomson, Tongfei Chen, Richard Shin, Adam Pauls, Jason Eisner, and
Benjamin Van Durme. BenchCLAMP: A benchmark for evaluating language models on syn-
tactic and semantic parsing. In Advances in Neural Information Processing Systems, vol-
ume 36, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/
9c1535a02f0ce079433344e14d910597-Paper-Datasets_and_Benchmarks.pdf. (Cited on p.
8, 30)
Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. Enumeration
of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of
Chemical Information and Modeling, 52(11), 2012. URL https://pubs.acs.org/doi/pdf/10.
1021/ci300415d. (Cited on p. 8, 30)
Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. PICARD: Parsing incrementally for
constrained auto-regressive decoding from language models. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing, 2021. URL https://aclanthology.org/
2022.emnlp-main.39/. (Cited on p. 2, 4, 8, 32)
Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, and Sida I Wang. Natural language
to code translation with execution. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, 2022. URL https://aclanthology.org/2022.emnlp-main.
231/. (Cited on p. 10)
Richard Shin and Benjamin Van Durme. Few-shot semantic parsing with language models trained
on code. In Proceedings of the Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 2022. URL https://aclanthology.
org/2022.naacl-main.396/. (Cited on p. 32)
Richard Shin, Christopher Lin, Sam Thomson, Charles Chen Jr, Subhro Roy, Emmanouil Antonios
Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. Constrained language
models yield few-shot semantic parsers. In Proceedings of the Conference on Empirical Methods
in Natural Language Processing, 2021. URL https://aclanthology.org/2021.emnlp-main.
608/. (Cited on p. 2, 4, 8, 32)
Tom Silver, Varun Hariprasad, Reece S Shuttleworth, Nishanth Kumar, Tomás Lozano-Pérez, and
Leslie Pack Kaelbling. PDDL planning with pretrained large language models. In Foundation
Models for Decision Making Workshop, 2022. URL https://openreview.net/forum?id=
1QMMUB4zfl. (Cited on p. 31)
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss,
Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize
with human feedback. In Advances in Neural Information Processing Systems, vol-
ume 33, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/
1f89885d556929e98d3ef9b86448f951-Paper.pdf. (Cited on p. 32, 33)
Andreas Stolcke. An efficient probabilistic context-free parsing algorithm that computes prefix
probabilities. Computational Linguistics, 21(2), 1995. URL https://aclanthology.org/
J95-2002.pdf. (Cited on p. 23)
Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Ste-
fano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of LLMs should leverage
suboptimal, on-policy data. In International Conference on Machine Learning, 2024. URL
https://proceedings.mlr.press/v235/tajwar24a.html. (Cited on p. 33)
Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, and Yun-Nung Chen. Let
me speak freely? A study on the impact of format restrictions on performance of large language
models. arXiv preprint arXiv:2408.02442, 2024. URL https://arxiv.org/pdf/2408.02442.
(Cited on p. 33)
17
Published as a conference paper at ICLR 2025
Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. SynCode:
Improving LLM code generation with grammar augmentation. arXiv preprint arXiv:2403.01632,
2024. URL https://arxiv.org/pdf/2403.01632. (Cited on p. 2, 4, 32)
Tim Vieira, Ben LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Brian DuSell, John Terilla, Timothy J
O’Donnell, and Ryan Cotterell. From language models over tokens to language models over
characters. arXiv preprint arXiv:2412.03719, 2024. URL https://arxiv.org/pdf/2412.03719.
(Cited on p. 33)
Bailin Wang, Zi Wang, Xuezhi Wang, Yuan Cao, Rif A Saurous, and Yoon Kim. Grammar prompting
for domain-specific language generation with large language models. In Advances in Neural In-
formation Processing Systems, 2024. URL https://openreview.net/forum?id=B4tkwuzeiY&
noteId=BaPOkLl42Y. (Cited on p. 8, 30, 32)
David Weininger. SMILES, a chemical language and information system. Journal of Chemical
Information and Computer Sciences, 28(1), 1988. URL https://pubs.acs.org/doi/pdf/10.
1021/ci00057a005. (Cited on p. 8, 30)
Brandon T Willard and Rémi Louf. Efficient guided generation for large language models. arXiv
preprint arXiv:2307.09702, 2023. URL https://arxiv.org/pdf/2307.09702. (Cited on p. 2,
4, 32)
Lionel Wong, Jiayuan Mao, Pratyusha Sharma, Zachary S Siegel, Jiahai Feng, Noa Korneev, Joshua B
Tenenbaum, and Jacob Andreas. Learning adaptive planning representations with natural language
guidance. arXiv preprint arXiv:2312.08566, 2023. URL https://arxiv.org/pdf/2312.08566.
(Cited on p. 31)
Yaqi Xie, Chen Yu, Tongyao Zhu, Jinbin Bai, Ze Gong, and Harold Soh. Translating natural language
to planning goals with large-language models. arXiv preprint arXiv:2302.05128, 2023. URL
https://arxiv.org/pdf/2302.05128. (Cited on p. 31)
Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and
Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic
data. arXiv preprint arXiv:2405.14333, 2024. URL https://arxiv.org/pdf/2405.14333.
(Cited on p. 2, 7)
Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong
Zhang. Iterative preference learning from human feedback: Bridging theory and practice for
RLHF under KL-constraint. In International Conference on Machine Learning, 2024. URL
https://proceedings.mlr.press/v235/xiong24a.html. (Cited on p. 33)
Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators. In
Proceedings of the Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 2021. URL https://aclanthology.org/2021.
naacl-main.276/. (Cited on p. 32)
Yi Yang and Jacob Eisenstein. A log-linear model for unsupervised text normalization. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing, 2013. URL https:
//aclanthology.org/D13-1007/. (Cited on p. 32)
Lance Ying, Katherine M Collins, Megan Wei, Cedegao E Zhang, Tan Zhi-Xuan, Adrian Weller,
Joshua B Tenenbaum, and Lionel Wong. The neuro-symbolic inverse planning engine (NIPE):
Modeling probabilistic social inferences from linguistic inputs. In Workshop on Theory of Mind in
Communicating Agents, 2023. URL https://openreview.net/forum?id=UNy5AZkBjy. (Cited
on p. 31)
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene
Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. Spider: A large-scale
human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018. URL
https://aclanthology.org/D18-1425. (Cited on p. 7)
18
Published as a conference paper at ICLR 2025
Honghua Zhang, Meihua Dang, Nanyun Peng, and Guy Van den Broeck. Tractable control for autore-
gressive language generation. In International Conference on Machine Learning. Proceedings of
Machine Learning Research, 2023a. URL https://proceedings.mlr.press/v202/zhang23g/
zhang23g.pdf. (Cited on p. 32, 33)
Maosen Zhang, Nan Jiang, Lei Li, and Yexiang Xue. Language generation via combinatorial con-
straint satisfaction: A tree search enhanced Monte-Carlo approach. In Findings of the Association
for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020.
URL https://aclanthology.org/2020.findings-emnlp.115/. (Cited on p. 33)
Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum, and Chuang Gan.
Planning with large language models for code generation. In The Eleventh International Conference
on Learning Representations, 2023b. URL https://openreview.net/forum?id=Lr8cOOtYbfL.
(Cited on p. 33)
Tianyi Zhang, Li Zhang, Zhaoyi Hou, Ziyu Wang, Yuling Gu, Peter Clark, Chris Callison-Burch, and
Niket Tandon. PROC2PDDL: Open-domain planning representations from texts. In Proceedings
of the Workshop on Natural Language Reasoning and Structured Explanations, 2024. URL
https://aclanthology.org/2024.nlrse-1.2/. (Cited on p. 31)
Stephen Zhao, Rob Brekelmans, Alireza Makhzani, and Roger Baker Grosse. Probabilistic inference
in language models via twisted sequential Monte Carlo. In Proceedings of the International Con-
ference on Machine Learning, 2024. URL https://proceedings.mlr.press/v235/zhao24c.
html. (Cited on p. 2, 4, 27, 28, 29, 32, 34)
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao,
Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang:
Efficient execution of structured language model programs. In Advances in Neural Information
Processing Systems, 2024. URL https://openreview.net/forum?id=VqkAKQibpq. (Cited on
p. 32)
Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin,
Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu,
Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun,
Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of RLHF in large
language models part I: PPO. arXiv preprint arXiv:2307.04964, 2023. URL https://arxiv.
org/pdf/2307.04964. (Cited on p. 33)
Tan Zhi-Xuan, Lance Ying, Vikash Mansinghka, and Joshua B Tenenbaum. Pragmatic instruction
following and goal assistance via cooperative language-guided inverse planning. In Proceedings
of the International Conference on Autonomous Agents and Multiagent Systems, 2024. URL
https://dl.acm.org/doi/abs/10.5555/3635637.3663074. (Cited on p. 31)
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan,
and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh
International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,
2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=92gvk82DE-. (Cited
on p. 2, 6)
Banghua Zhu, Michael Jordan, and Jiantao Jiao. Principled reinforcement learning with human
feedback from pairwise or K-wise comparisons. In International Conference on Machine Learning.
Proceedings of Machine Learning Research, 2023. URL https://proceedings.mlr.press/
v202/zhu23f/zhu23f.pdf. (Cited on p. 33)
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul
Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv
preprint arXiv:1909.08593, 2019. URL https://arxiv.org/pdf/1909.08593. (Cited on p. 32,
33)
Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael L Littman, and Stephen H Bach.
Planetarium: A rigorous benchmark for translating text to structured planning languages. arXiv
preprint arXiv:2407.03321, 2024. URL https://arxiv.org/pdf/2407.03321. (Cited on p. 7,
31)
19
Published as a conference paper at ICLR 2025
Table 4: Downstream accuracy of different methods with a smaller base language model (Llama 3.1 8B in Data
science and Llama 3.2 1B in all other domains). Errors are bootstrapped 95% confidence intervals. Instruct
model is used for Text-to-SQL.
Score
Method
Goal inference Molecular synthesis Data science Text-to-SQL
LM 0.012 (0.01, 0.02) 0.032 (0.02, 0.04) 0.114 (0.09, 0.14) 0.224 (0.207, 0.241)
w/ grammar constraint (Locally constrained Decoding) 0.046 (0.03, 0.06) 0.031 (0.02, 0.04) - 0.250 (0.232, 0.270)
w/ grammar, weight correction (Grammar-only IS) 0.037 (0.02, 0.06) 0.041 (0.03, 0.05) - 0.301 (0.281, 0.323)
w/ grammar, potential (Sample-Rerank) 0.087 (0.06, 0.12) 0.119 (0.09, 0.16) - 0.299 (0.278, 0.321)
w/ grammar, correction, and resampling (Grammar-only SMC) 0.052 (0.03, 0.08) 0.050 (0.04, 0.06) - 0.302 (0.281, 0.324)
w/ grammar, potential, and correction (IS) 0.079 (0.05, 0.11) 0.122 (0.09, 0.16) 0.225 (0.19, 0.26) 0.348 (0.326, 0.372)
w/ grammar, potential, correction, and resampling (SMC) 0.125 (0.09, 0.16) 0.517 (0.48, 0.55) 0.285 (0.24, 0.34) 0.348 (0.325, 0.374)
Table 5: Downstream accuracy of different methods with a larger (relative to Table 4) base language models
that were used in the main experiments (Llama 3.1 70B in Data science and Llama 3.1 8B in all other domains).
Errors are bootstrapped 95% confidence intervals. Instruct model is used for Text-to-SQL. This table is identical
to Table 2 in the main text and is repeated in the appendix for easier comparison.
Score
Method
Goal inference Molecular synthesis Data science Text-to-SQL
LM 0.063 (0.05, 0.08) 0.132 (0.12, 0.15) 0.213 (0.19, 0.24) 0.531 (0.51, 0.55)
w/ grammar constraint (Locally constrained Decoding) 0.086 (0.07, 0.11) 0.189 (0.17, 0.21) - 0.559 (0.54, 0.58)
w/ grammar, weight correction (Grammar-only IS) 0.083 (0.06, 0.11) 0.228 (0.21, 0.25) - 0.597 (0.57, 0.62)
w/ grammar, potential (Sample-Rerank) 0.289 (0.24, 0.34) 0.392 (0.36, 0.42) - 0.581 (0.56, 0.60)
w/ grammar, correction, and resampling (Grammar-only SMC) 0.401 (0.34, 0.46) 0.205 (0.18, 0.23) - 0.596 (0.57, 0.62)
w/ grammar, potential, and correction (Full IS) 0.257 (0.21, 0.31) 0.404 (0.37, 0.44) 0.346 (0.31, 0.39) 0.618 (0.59, 0.64)
w/ grammar, potential, correction, and resampling (Full SMC) 0.419 (0.37, 0.48) 0.577 (0.56, 0.59) 0.407 (0.36, 0.45) 0.620 (0.60, 0.64)
A A DDITIONAL E XPERIMENTS
This section evaluates downstream accuracy across methods using smaller base language models
(relative to Table 2 in the main text, reproduced in Appendix Table 5 for easier comparison). For
the Text-to-SQL, Molecular Synthesis, and Goal Inference domains, which in the §3.2 experiments
used Llama 3.1 (8B), we substitute Llama 3.2 (1B). In the Data Science domain, which used Llama 3
(70B) in the §3.2 experiments, we substitute Llama 3.1 (8B). All experiments were run with N = 10
particles, and the instruct version of Llama 3.2 (1B) was used in the text-to-SQL domain to remain
consistent with the model variants used in the main paper.
We report posterior-weighted accuracy using the smaller LMs across all methods and domains in
Table 4. Although accuracy is significantly lower compared to the larger LMs, we find that weight
corrections, expensive potentials, and resampling steps still improve model performance. We also
find that, in general, the relative gains in accuracy provided by our method are more pronounced for
smaller language models. For easier comparison, Table 5 presents an identical version of Table 2,
showing the results for the larger base LMs which were reported in §3.2. With the exception of
Text-to-SQL, we observe that our approach with the smaller LM outperforms the locally constrained
decoding baseline (LM w/ grammar constraint) using the larger LM. In the Data Science domain, our
Full SMC approach with the smaller LM outperforms the larger base LM. These results suggest that
our approach can dramatically improve the performance of smaller LMs.
This section investigates how performance improvements vary with the number of particles. Table 6
reports downstream accuracy for N = 5, N = 10, and N = 50 particles using the Llama 3.1 (8B)
models. Note that we only include methods in which samples are generated from an approximation
that is constructed from a set of importance-weighted particles. For the base LM and locally
constrained decoding baselines, samples are generated through direct ancestral sampling. As a result,
the number of particles does not influence accuracy in these cases (though additional particles can
provide a better estimate of the true model accuracy), so we omit these methods from the analysis.
The main effect we observe is the more efficient use of computational resources by Full SMC
compared to methods that do not incorporate incremental information, such as Full IS: the former
outperforms the latter with one tenth of the particles in 3 out of 4 domains (Data Science, Molecular
Sythesis, Text-to-SQL) and one fifth of the particles in the other domain (Goal Inference, see Fig. 2
in the main text for a visualization). We note an additional patterns of results: in the Text-to-SQL and
20
Published as a conference paper at ICLR 2025
Table 6: Accuracy by number of particles across methods. Errors are bootstrapped 95% confidence intervals.
Llama 3.1 8B is used as the base LM for all domains. Instruct model is used for Text-to-SQL.
Score
Method
Goal inference Molecular synthesis Data science Text-to-SQL
5 Particles
LM w/ grammar constraint, correction (Grammar-only IS) 0.106 (0.08, 0.14) 0.239 (0.21, 0.27) - 0.587 (0.56, 0.61)
LM w/ grammar constraint, potential (Sample-Rerank) 0.214 (0.17, 0.26) 0.407 (0.36, 0.45) - 0.578 (0.55, 0.60)
LM w/ grammar constraint, correction, and resampling (Grammar-only SMC) 0.310 (0.26, 0.37) 0.209 (0.18, 0.24) - 0.599 (0.57, 0.62)
LM w/ grammar constraint, potential, and correction (Full IS) 0.216 (0.17, 0.27) 0.411 (0.37, 0.45) 0.204 (0.16, 0.25) 0.611 (0.59, 0.63)
LM w/ grammar constraint, potential, correction, and resampling (Full SMC) 0.319 (0.27, 0.37) 0.552 (0.52, 0.58) 0.224 (0.18, 0.27) 0.620 (0.59, 0.64)
10 Particles
LM w/ grammar constraint, weight correction (Grammar-only IS) 0.083 (0.06, 0.11) 0.228 (0.21, 0.25) - 0.597 (0.57, 0.62)
LM w/ grammar constraint, potential (Sample-Rerank) 0.289 (0.24, 0.34) 0.392 (0.36, 0.42) - 0.581 (0.56, 0.60)
LM w/ grammar constraint, correction, and resampling (Grammar-only SMC) 0.401 (0.34, 0.46) 0.205 (0.18, 0.23) - 0.596 (0.57, 0.62)
LM w/ grammar constraint, potential, and correction (Full IS) 0.257 (0.21, 0.31) 0.404 (0.37, 0.44) 0.223 (0.19, 0.27) 0.618 (0.59, 0.64)
LM w/ grammar constraint, potential, correction, and resampling (Full SMC) 0.419 (0.37, 0.48) 0.577 (0.56, 0.59) 0.285 (0.26, 0.32) 0.620 (0.60, 0.64)
50 Particles
LM w/ grammar constraint, correction (Grammar-only IS) 0.069 (0.05, 0.09) 0.211 (0.20, 0.22) - 0.603 (0.58, 0.63)
LM w/ grammar constraint, potential (Sample-Rerank) 0.416 (0.36, 0.47) 0.382 (0.37, 0.40) - 0.585 (0.56, 0.61)
LM w/ grammar constraint, correction, and resampling (Grammar-only SMC) 0.595 (0.54, 0.65) 0.212 (0.20, 0.23) - 0.599 (0.58, 0.62)
LM w/ grammar constraint, potential, and correction (Full IS) 0.393 (0.35, 0.45) 0.389 (0.38, 0.40) 0.218 (0.19, 0.25) 0.626 (0.60, 0.66)
LM w/ grammar constraint, potential, correction, and resampling (Full SMC) 0.611 (0.56, 0.66) 0.569 (0.56, 0.58) 0.292 (0.25, 0.33) 0.622 (0.60, 0.65)
Table 7: Downstream accuracy comparison with the SMC Steering method from Lew et al. (2023) in the text-to-
SQL domain. Errors are bootstrapped 95% confidence intervals. Both methods include expensive potentials.
Our method is run with 10 particles. SMC Steering is run with 5 particles and a beam size of 3. Both methods
are run with Llama 3.1 8B Instruct.
Method Score
Full SMC 0.620 (0.60, 0.64)
SMC Steering (Lew et al., 2023) 0.607 (0.58, 0.63)
Molecular synthesis domains, increasing the number of particles has a marginal impact on downstream
accuracy; however, in Goal inference and Data Science, we observe that a greater number of particles
can lead to significantly better downstream accuracy (though only when increasing from 5 to 10
particles in the Data Science domain). Given that Goal Inference and Data Science are the two tasks
where our expensive potentials are most informative, this pattern of results seems to be reflective of
the fact that richer potentials require more computation to fully exploit.
This section evaluates our approach using the without-replacement resampling method introduced
in Lew et al. (2023). Specifically, we use our Full SMC algorithm with expensive potential (LM
w/ grammar constraint, potential, correction, and resampling), and replace multinomial resampling
steps with Lew et al. (2023)’s without replacement scheme. For comparison, we ran the without
replacement baseline (SMC Steering) with N = 5 particles and a beam size of 3, alongside our
approach using multinomial resampling with N = 10 particles (and an ESS threshold of 0.9). These
settings effectively give the SMC Steering method a particle count of N = 15, giving it an advantage
in the comparison.
Table 7 reports weighted accuracy for these methods in the text-to-SQL domain (we restricted this
analysis to a single domain because of limitations in computational resources). We observe that
without-replacement resampling steps slightly hurt performance compared to multinomial resampling.
Though we have shown that practitioners can improve over locally constrained decoding by using
our proposed SMC method, in practice, there is additional computational cost stemming from two
sources: resampling and computing expensive potentials Φexp . The cost of resampling is negligible,
consisting only of simple sum, softmax, and categorical sampling operations at every token. The cost
of computing expensive potentials, on the other hand, is more significant and varies across domains.
Table 8 shows the average per token cost of computing expensive potentials for all of our domains:
we see that it rarely goes above about 30ms.
21
Published as a conference paper at ICLR 2025
Table 8: Average per token cost (in seconds) of computing the expensive potential Φexp for each of our domains.
Intervals are bootstrapped confidences estimated by selecting 10 SMC generations at random for each domain.
Method Goal Inference Molecular Synthesis Data Science Text-to-SQL
Φexp seconds per token 0.011 (0.007, 0.016) 0.0003 (0.0002, 0.0004) 0.007 (0.0009, 0.023) 0.031 (0.0204, 0.0413)
In general, the computational cost of expensive potentials is lessened by two factors: 1) expensive
potentials often change not at every token, but only at larger, semantically meaningful units (for
instance, the end of a SQL clause or a Python statement)—caching can therefore significantly lessen
computational cost, 2) expensive potentials are often CPU rather than GPU computations (and so the
cost of computation is much cheaper).
22
Published as a conference paper at ICLR 2025
8
This may be achieved by adding additional rules to the CFG that expand each of its terminal symbols into its
constituent byte sequence.
9
An alternative approach is to transform a byte-level CFG into a token-level CFG; however, this can make the
grammar extremely large.
10
In practice, this may require a lookup table to convert each token identifier to its representation as a byte string.
11
Our algorithm is based on Earley (1968); Stolcke (1995); Nowak & Cotterell (2023); Opedal et al. (2023).
23
Published as a conference paper at ICLR 2025
C.1 F RAMEWORK
The reweight step of our sequential Monte Carlo algorithm requires us to evaluate Leff (x), and
the extend step requires us to sample from the locally constrained distribution ℓeff (· | x). Both of
these operations can be moderately expensive because they require the evaluation of Φeff (x x′ ) for
all tokens x′ ∈ AEOS . In this section, we describe a general scheme for speeding up both steps by
approximately sampling ℓeff (· | x) and approximately evaluating Leff (x). If we are careful about how
we carry out this approximation, it will not change the intermediate targets get (x), defined Eq. (6),
that control the behavior of the sequential Monte Carlo algorithm.
Sequential Monte Carlo with approximate Extend and Reweight steps. The key invariant main-
(i) (i)
tained by sequential Monte Carlo is that at each step, the distribution of each particle (xt , wt ) is
properly weighted for the intermediate target get .
Definition 1. Let p̃(x) = Zp · p(x) be an unnormalized distribution on a space X, with normalizing
constant Zp . Let q be a probability distribution on weighted pairs (x, w) ∈ X × R≥0 . We say that q
is properly weighted for p̃ if, for any function f ,
E [w · f (x)] = Zp E [f (x)] (8)
(x,w)∼q x∼p
(i) (i)
The extend and reweight steps of the algorithm, which update a previous particle (x<t , wt−1 ) by
(i) (i)
sampling xt ∼ ℓeff (x<t ) and returning the new particle
!
(i) (i)
(i) (i) (i) (i) Φexp (x<t xt )
x<t xt , wt−1 · Leff (x<t ) · (i)
Φexp (x<t )
(i)
are justified by the fact that if (x<t , wt−1 ) is properly weighted for get−1 , then the new pair is properly
weighted for get . We are looking to improve runtime performance without compromising soundness,
so we seek ways of modifying the extend and reweight steps that do not break this invariant.
(i) (i)
In particular, suppose that instead of sampling xt ∼ ℓeff (· | x<t ) and computing the exact weight
(i) (i)
(i) def (i) (i) Φexp (x<t xt )
update wt = wt−1 · Leff (x<t ) · (i) , we instead generate (x, W) from a proposal q that
Φexp (x<t )
e def (i) (i)
is properly weighted for the unnormalized local product of experts ℓ(x) = Leff (x<t ) · ℓeff (x | x<t ),
then compute the alternative weight update
(i)
(i) def (i) Φexp (x<t x)
wt = wt−1 · W · (i)
(9)
Φexp (x<t )
The key observation is that, by the fact that q is properly weighted for ℓ, e we know that for every
(i) (i)
possible previous particle (x<t , wt−1 ) and every function f ,
" #
(i)
(i) (i) (i) Φexp (x<t x) (i)
E [wt · f (x<t x)] = E wt−1 · W · (i)
· f (x<t x) (10)
(x,W)∼q (x,W)∼q Φexp (x<t )
" #
(i) (i)
(i) (i) Φexp (x<t xt ) (i) (i)
= Leff (x<t ) E wt−1 (i)
f (x<t xt ) (11)
(i) (i)
xt ∼ℓeff (·|x<t ) Φexp (x<t )
h i
(i) (i) (i)
= E wt · f (x<t xt ) (12)
(i) (i)
xt ∼ℓeff (·|x<t )
Therefore, if the overall proper weighting invariant (with respect to the intermediate target get ) holds
for the original update, then it will also hold for this modified extend-and-reweight procedure. For
more on SMC with estimated weights, see Chopin & Papaspiliopoulos (2020) and Lew et al. (2022).
A family of properly weighted updates based on the Horvitz–Thompson estimator. We now
introduce a useful family of properly weighted proposals for our setting. It will allow us to generate
24
Published as a conference paper at ICLR 2025
weighted next-token proposals (x, W) while only evaluating the potentials Φeff on a (randomly
chosen) subset S ⊆ AEOS . Our procedure is inspired by the Horvitz–Thompson estimator (Horvitz &
Thompson, 1952).
def
First, to reduce notational clutter, we define the following shorthand: L = Leff (x), and ℓ(x′ ) =
e ′ )/L = ℓeff (x′ | x).
ℓ(x
Definition 2. Given a probability distribution q over subsets of AEOS , we define the set-based proposal
speedup by the following generation procedure:
1. Sample a subset S ∼ q where q is a probability distribution over subsets of AEOS .
def ℓ(x)
of each token x ∈ S where π q (x) is the inclusion
e
2. Compute the local weight w(x) = π(x)
[x ∈ S ] = S ′ q(S ′ )1{x ∈ S ′ }, i.e., the probability that x ends up in
def ′
P
probability π q (x) = Pr′ S ∼q
any sampled S ′ ∼ q.12
def w(x) 1{x∈S} P
3. Compute the set-conditioned distribution q(x | S) = WS where WS = x∈S w(x).
4. Sample x ∼ q(· | S).
5. Return (x, WS )
Then, as described above, we modify the SMC algorithm to use the sampled token x instead of
x ∼ ℓeff (· | x) in the extend step, and the weight WS instead of Leff (x) in the reweight step.
We justify this approach by showing that it produces properly weighted samples.
Proposition 1. The set-based proposal speedup’s distribution q (Def. 2) is a properly weighted
proposal for ℓ.
e
1{x ∈ S} q(S)
X X
= f (x)w(x) (13c)
x S
X ℓ(x)
e
= f (x) π q (x) (13d)
x
π q (x)
X
=L f (x)ℓ(x) (13e)
x
= L E [f (x)] (13f)
x∼ℓ
Our character-based proposal distribution is an instance of the framework of the previous section.
In particular, q samples sets of tokens S by sampling a sequence of characters. We provide the
pseudocode for this algorithm in Alg. 1, and define two key data structures used by this proposal:
Definition 3. Our trie data structure T is a labeled, tree-structured graph that is defined as follows:
• Let AEOS be the LM’s vocabulary of tokens where we represent each token as its strings of characters
ending with a designated end-of-token marker EOT.13 Let Σ denote the set of characters (or bytes).
def
• Let P be the prefix closure of the set AEOS : P = {p ∈ Σ∗ | p ⪯ x, x ∈ AEOS } where p ⪯ x denotes
that p is a prefix of x.
12
We require q to be such that every token has a positive inclusion probability π q (x) > 0 for all x ∈ AEOS .
13
Note that EOS is handled specially as it is not a string of characters.
25
Published as a conference paper at ICLR 2025
n o
a
• Let T = (N, E) be a labeled graph with node P and labeled edges E = p −
→ p a p, (p a) ∈ P
Definition 4. Let mass be a mapping N → [0, 1], defined as follows.
mass(x′ ) = p(x′ | x), for x′ ∈ AEOS (14a)
X
mass(p) = mass(p a), for p ∈ P \ AEOS (14b)
a
p−
→p a∈E
Algorithm 1 Character proposal: This procedure implements a properly weighted proposal distribu-
tion for the unnormalized version of the locally constrained distribution ℓ{ϕG } (· | x).
1. procedure character_proposal(x)
2. mass ← Apply Eq. (14) to p(· | x)
3. p←ε ▷ start at the trie’s root node
4. c←1 ▷ path weight under cfg
5. w ← {}
6. S←∅
7. π ← {}
8. π(ε) ← 1
9. while true:
n o
a) a
10. p1 ← a : mass(p
mass(p) for p −
→ p a ∈ E
n o
ϕG (x p a)
11. p2 ← a : ϕG (x p) for a ∈ Σ
EOT
12. if p −−→ p EOT ∈ E : ▷ End-of-token available (i.e., p ∈ AEOS )
13. w(p) = mass(p EOT)·c
π(p)
14. S ← S ∪ {p}
15. q ← {a P: p1 (a) · p2 (a) for a ∈ Σ} ▷ Note: EOT ̸∈ Σ.
16. Q ← a∈Σ q(a)
17. if Q = 0: ▷ cannot continue further
18. break
19. a ∼ q/Q ▷ Sample next character proportional to q
20. π(p a) ← π(p) · q(a)/Q
21. p ← pa ▷ extend the prefix (i.e., transition to the next node)
22. c ← c · p2 (a)
W ← x′ ∈S w(x′ )
P
23.
24. x ∼ w(·)/W
25. return (x, W)
It is straightforward to verify that the character proposal is an instance of set-based proposal speedup
(Def. 2), which is properly weighted according to Proposition 1.
26
Published as a conference paper at ICLR 2025
The KL divergence KL(q ∥ g) measures how well the distribution q approximates the global product
of experts g. The exact KL(q ∥ g) is generally intractable, for two reasons: g is known only up to a
normalizing constant, and q (which, for us, represents the marginal distribution of a particle at the
end of inference) is not known in closed form. To address the first difficulty, we instead measure the
approximation quality log Z − KL(q ∥ g); because Z is a function only of g and not of the inference
algorithm, as we vary the inference algorithm, the resulting variation in approximation quality is
caused solely by changes to the KL term. To address the second difficulty, we work over extended
state spaces to obtain lower bounds on the approximation quality (Lew et al., 2022; Zhao et al., 2024).
In Appendix D.1 we show how to estimate this quantity in a general setting, and in Appendix D.2 we
explain how to adapt this result to the setting where g incorporates hard constraints (making the KL
divergence discussed above potentially infinite).
We begin by considering the case of a generic importance sampling algorithm, which draws K
particles x(i) ∈ A∗ EOS from a proposal distribution q to approximate the target distribution σ. For
each particle, the algorithm computes an importance weight relative to the unnormalized target
distribution σ̃(x) = Zσ σ(x), where Zσ > 0 is the unknown normalizing constant.
Estimating the quality of 1-particle inference. A very simple inference strategy is to use the proposal
q to generate a single sample x, without further correction. A reasonable measure of inference quality
would be KL(q ∥ σ). However, due to the unknown normalizing constant of σ̃, the quantity that we
can more easily estimate unbiasedly is log Zσ − KL(q ∥ σ) = Ex∼q [log σ̃(x) − log q(x)]. As the
expression suggests, we can estimate this quantity by sampling many sequences x independently
from q and computing the average across samples of log σ̃(x) − log q(x). Because Zσ depends on
only σ̃, and not the inference algorithm, we can compare proposals q by estimating this quantity;
higher is better, implying lower KL(q ∥ σ).
Estimating the quality of K-particle IS. In K-particle IS, we consider the posterior approximation
to be the distribution obtained by running importance sampling and resampling a particular particle
x(k) with probabilities proportional to the normalized particle weights. Formally, let S = A∗ EOS ×
A∗ EOSK−1 be the space of K-particle collections with a distinguished, chosen particle; we write
elements of S as ⟨x(k) , x−k ⟩, where x−k = {x(i) | i = 1 ··· K, i ̸= k} are the K − 1 unchosen
particles. The importance sampling procedure defines the following distribution over S:
K
def w(x(k) ) Y
qIS (⟨x(k) , x−k ⟩) = PK q(x(i) ), (15)
w(x (i) )
i=1 i=1
(i)
where w(x(i) ) = σ̃(x )
q(x(i) )
is the importance weight for particle x(i) . We are interested in the quality
of the marginal posterior approximation
X
∗ def
qIS (x) = qIS (⟨x(k) , x−k ⟩) · 1[x = x(k) ]. (16)
⟨x(k) ,x−k ⟩∈S
∗
Although we can generate samples from qIS (by running importance sampling and selecting a chosen
particle), we cannot evaluate its density exactly, so we cannot directly compute unbiased Monte
27
Published as a conference paper at ICLR 2025
∗
Carlo estimates of log Zσ − KL(qIS ∥ σ) as before. Instead, we extend the state space of the target
distribution σ to obtain a new distribution σIS over S, following Andrieu & Roberts (2009):
def 1 def
Y
σIS (⟨x(k) , x−k ⟩) = σ(x−k | x(k) )σ(x(k) ), where σ(x−k | x(k) ) = q(x(i) ) (17)
K
i̸=k
We write σ̃IS = Zσ σ̃IS for the unnormalized extended target.
Note that the marginal distribution of the extended target,
X
∗
σIS (x) = σIS (⟨x(k) , x−k ⟩)1[x = x(k) ] (18)
⟨x(k) ,x−k ⟩∈S
is equal to σ(x). The KL divergence between two joint distributions is lower-bounded by the KL
∗
divergence between their marginals, so we have KL(qIS ∥ σ) ≤ KL(qIS ∥ σIS ).
Both qIS and σ̃IS have tractable joint densities, so we can estimate log Zσ − KL(qIS ∥ σIS ) by
repeatedly generating ⟨x(k) , x−k ⟩ ∼ qIS and computing the log density ratio, which simplifies into a
log average particle weight.
∗
Proposition 2 (Estimating log Zσ − KL(qIS ∥ σ)). Consider K particles x(i) generated
(i) (i) (i)
from
h the Pproposal q and
i let w(x ) = σ̃(x )/q(x ) be their importance weights. Then
1 K (i)
E log K i=1 w(x ) is a lower bound on log Zσ − KL(qIS ∥ σ).
Proof.
∗
log Zσ − KL(qIS ∥ σ) ≥ log Zσ − KL(qIS ∥ σIS ) (19a)
" #
σ̃IS ( x(k) , x−k )
= E log (19b)
qIS qIS ( x(k) , x−k )
1 (k) i
Q
K σ̃(x ) i̸=k q(x )
= E log (k)
(19c)
qIS
q(x(k) ) PKw(xw(x)(i) ) i̸=k q(x(i) )
Q
i=1
1 (k) i
Q
K σ̃(x ) i̸=k q(x )
= E log (k) )/q(x(k) ) Q
(19d)
qIS
q(x(k) ) σ̃(x
PK (i) ) i̸=k q(x (i) )
i=1 w(x
" #
1
K
= E log 1 (19e)
qIS PK (i) )
i=1 w(x
" K
#
1 X (i)
= E log w(x ) (19f)
qIS K i=1
PK
1 (i) ∗
Hence, log K i=1 w(x ) can be seen as a single sample estimate of log Zσ −KL(qIS ∥ σ), with
negative bias. The precise bias can be shown to be −KL(qIS ( x(k) , x−k | x(k) ) ∥ σIS ( x(k) , x−k |
x(k) )), which decreases as the number of particles increase (Lew et al., 2022). As before, the term
log Zσ is a constant that is independent of the inference algorithm, so this estimate can be used as a
measure of inference quality across algorithms, where higher is better.
Estimating the quality of SMC. The same logic as above can be applied to SMC (Andrieu &
Roberts, 2009; Lew et al., 2022; Zhao et al., 2024). As in the IS case, we use an extended target σSMC
such that the density ratio works out to exactly the average particle weight at the end of the algorithm.
When attempting to estimate the discrepancy between samples for algorithms qalg and g, one difficulty
is that KL(qalg ∥ g) can be infinite when g incorporates hard constraints. This is because, in those
28
Published as a conference paper at ICLR 2025
cases, there is positive probability on the outcome that all generated proposals fail to meet the
constraints, and thus have mass 0 under g. A potential solution to this issue, explored by Zhao
et al. (2024), is to instead estimate KL(g||qalg ). But this requires exact samples from g, which are
impractical to obtain in our setting. We thus take a different approach and consider rejection-sampled
r r
versions of each of our algorithms, qalg , which draw samples x ∼ qalg repeatedly until qalg > 0. In
this case, we have
r
qalg (x)
r
KL(qalg ||g) = Eqalg
r log (20a)
g(x)
" #
qalg (x)
= Eqalg
r log r (20b)
g(x)Zalg
qalg (x) r
= Eqalg
r log − log Zalg (20c)
g(x)
r r
where Zalg is the acceptance rate of qalg (which we can estimate with standard Monte Carlo), and we
can estimate the first term in Eq. (20c) up to an instance-specific constant by the derivations above.
29
Published as a conference paper at ICLR 2025
Table Columns
singer singer_id, name, ...
concert concert_id, concert_name, ...
(a) Example schema
E D OMAIN D ETAILS
This section provides further details on the domains used in the experiments.
Spider is a large-scale text-to-SQL dataset of natural language questions and database schemas.
Given a natural language question and schema, the task is to generate a valid SQL query that is
semantically equivalent to a ground-truth query. In this domain, Φeff is used to enforce valid SQL
syntax according to the SQL grammars released by Roy et al. (2024). These grammars include
schema-specific constraints that limit table and column names to those present in the given schema
but do not ensure correct table-column associations. Thus, we use Φexp to verify whether the (partial)
SQL query references a column that exists in a table, returning 0 in the case that it does not and 1
otherwise. Table 10 provides an example of Φeff and Φexp applied to an SQL query. Since Φexp is
only semantically meaningful when a generated query has fully specified the necessary table or alias
information to check correspondences, we only run the table-column verification at clause boundaries
once the FROM clause has been completed. We evaluate on the development split of Spider with
execution accuracy, which checks whether the predicted SQL query’s output matches that of the
ground-truth query. We define p(x) by prompting Llama-3.1 (8b) instruct with 3 examples followed
by a rendering of the database and the natural language question.
A recent line of work has applied LMs to the problem of molecular synthesis, with the aim of
generating candidate molecules with properties similar to molecules from known databases (see
Oliveira et al., 2022, for review)—most commonly (e.g. Flam-Shepherd et al. (2022); Wang et al.
(2024)) by prompting with examples of molecules in SMILES format (Weininger, 1988). We
follow this approach, constructing prompts from random subsets of 20 molecules from the GDB-
17 dataset (Ruddigkeit et al., 2012). We evaluate generations using the standard molecule fitness
function Quantitative-Estimated Drug-likeness (QED; Bickerton et al., 2012) implemented in the
Python RDKit library (Landrum, 2024). This metric combines eight physicochemical properties
of a compound: Molecular weight, LogP, H-bond donors, H-bond acceptors, Charge, Aromaticity,
Stereochemistry, and Solubility. Here, Φeff enforces SMILES syntax. To enforce properties not
encoded by this syntax, we define Φexp using a molecule validator that can be applied to partial
SMILES strings, implemented in the Python partialsmiles library (O’Boyle, 2024). The validator
checks the SMILES prefix to ensure that the atom’s valences are in a list of allowed valences, and
attempts to find alternating patterns of single and double bonds to cover all aromatic systems in the
partial string. The additional metrics reported in Fig. 4 are Validity (proportion of valid SMILES),
Weight (exact molecular weight), De Novo Similarity (average pairwise Tanimoto similarity to the
target distribution, excluding exact duplicates), and Diversity (inverse average pairwise Tanimoto
similarity among compounds generated by a particular method).
30
Published as a conference paper at ICLR 2025
Recent work has explored using LMs for planning with languages like the Planning Domain Definition
Language (PDDL) (Ghallab et al., 1998), by either generating plans directly (Silver et al., 2022; Wong
et al., 2023; Ying et al., 2023; Zhang et al., 2024; Zhi-Xuan et al., 2024), or generating descriptions
of a task’s initial and/or goal conditions, which classical planning algorithms can use to search
for plans (Liu et al., 2023; Xie et al., 2023; Guan et al., 2023). In the spirit of the latter, we use
the Blocksworld tasks from the Planetarium benchmark (Zuo et al., 2024), which provides natural
language descriptions of a task’s initial and goal conditions along with their ground-truth symbolic
representations in the STRIPS subset of PDDL (Fikes & Nilsson, 1971). The original dataset is
extremely challenging, requiring the LM to output a full STRIPS description of tasks with up to 100
objects—Zuo et al. (2024) report fewer than 2% of the outputs of Gemma 1.1 7B to be even parseable.
We therefore simplify the task by limiting our evaluation to examples with fewer than 10 objects, and
requiring the LM to generate only the goal conditions by appending the ground-truth initial conditions
to the pre-prompt. Here, Φeff encodes STRIPS syntax for goals within Planetarium’s Blocksworld
domain definition; Φexp uses a gold-standard plan known to satisfy the ground-truth task description,
and calls the VAL plan validator (Howey et al., 2004) to test whether partial goal descriptions are valid
according to that plan. The gold-standard plans for each instance were derived using the fast downward
algorithm (Helmert, 2006). Note that it is only possible to apply this Φexp potential to partial strings
if goal descriptions are monotonic, that is, if any goal prefix describes a superset of the states that
the full goal describes. This is the case for STRIPS, where goals must be described as conjunctions
of literals, so that we can evaluate the potential after each literal in the conjunction is completed.
DS-1000 (Lai et al., 2023) is a challenging code generation benchmark on data science problems in
Python, split into subsets requiring the use of six popular libraries: Pandas, NumPy, Scikit-Learn,
SciPy, TensorFlow, and PyTorch. For each problem instance, the language model is prompted with
an English description of the problem and a sample test case in Python and is tasked with generating
code that solves the problem and passes the test case. Each test case includes a result variable, and
success depends on the execution. In preliminary experiments, we observed that our language model
was able to generate syntactically correct Python programs for every sample. We, therefore, set
Φeff = 1 for our experiments in this domain. Thus, unlike the other three domains, the proposal
distribution for all evaluations of DS1000 was simply p. In this domain, Φexp simply executes the
test cases provided in the prompts from Lai et al. (2023) on generated (partial) Python programs, and
returns 1 if no errors are produced and 0 otherwise (in particular, we did not make use of the output
of test cases). Note that it is only possible to execute Python code when the generated sequence x
consists entirely of well-formed Python statements, thus in this domain Φeff can only be meaningfully
applied at the boundary of statements—this motivates aligning SMC particles using statements as
their steps, as explained in the “Further extensions" section in (§2).
31
Published as a conference paper at ICLR 2025
F R ELATED W ORK
Shin et al. (2021) presented a system allowing LMs to be locally intersected with (boolean) CFGs to
restrict generations to conform to target formal languages, and that with only a few in-context exam-
ples, such an inference-time strategy could outperform more substantial fine-tuning. Concurrently,
PICARD (Scholak et al., 2021) presented an approach for intersecting LMs with an incremental
parsing algorithm and showed how additional context-sensitive constraints could be imposed, such
as requiring table-column matching for SQL generation via the use of programmable “guards”.
Synchromesh (Poesia et al., 2022) generalized these frameworks and extended the idea of incremental
guards that can impose semantic restrictions during generation—such as typing and scoping rules—by
dynamically constructing constraints as regular expressions on the fly. A great deal of other work
has explored variants of LM-grammar intersection including the effectiveness of pre-training models
on code for these settings (Shin & Van Durme, 2022), the runtime compilation of individual task
instances into highly specific, task-specialized grammars (Geng et al., 2023), and even using the LM
to generate grammars directly at runtime, that then restrict their own generation to solve a task (Wang
et al., 2024). Other work has focused more closely on the standard syntactic-constraint problem but
with an emphasis on optimizing efficient data structures and algorithms for fast LM-CFG intersection
(Ugare et al., 2024; Zheng et al., 2024; Moskal et al., 2024).
A parallel line of work in this space has been concerned with the efficient construction and application
of constraints for sequential inference problems. Deutsch et al. (2019) first noted that regular and
context-free grammar constraints could be pre-compiled to automata—these could then be used
during sequential inference to impose constraints with near-zero runtime overhead. This approach
was independently developed and efficiently implemented in the context of restricting LM generations
to regular expressions by the Outlines (Willard & Louf, 2023) and ReLM libraries (Kuchnik et al.,
2023). Similar work was later developed by Koo et al. (2024), who extended several formal automata-
theoretic characteristics of these constructions.
This work has noted the complications of efficiently intersecting grammars whose atoms are terminals
and LMs whose atoms are tokens, which we refer to as the token–terminal alignment problem. An
efficient and accurate solution to this problem space was one of several desiderata for our proposal
32
Published as a conference paper at ICLR 2025
algorithm (see Alg. 1 in Appendix C for more details). These works have also discussed considerations
that arise in the construction of automata, whose arcs are tokens, in the assignment of probabilities to
strings. Namely, there are exponentially many latent token trajectories that correspond to a generated
sequence. While the correct method for assigning string probabilities involves marginalizing over
these trajectories (Cao & Rimell, 2021), in practice, simply using the canonical tokenization accounts
for the overwhelming majority of the probability mass and can be justified (Chirkova et al., 2023;
Kuchnik et al., 2023; Berglund et al., 2024; Vieira et al., 2024). In the present work, we do not
enforce this assumption and allow all token trajectories.
Language models pre-trained on a next-word objective reflect the distribution of their pre-training
corpora, but often the inference-time needs of tasks necessitate that LMs modify this base distribution.
One approach to this class of problems is fine-tuning or reinforcement learning via some set of data
that more closely mirrors the target task, such as via reinforcement learning from human feedback
(RLHF) (Ziegler et al., 2019; Stiennon et al., 2020; Bai et al., 2022; Ouyang et al., 2022), but this
method comes with challenges such as hyperparameter sensitivity and distributional collapse (Zheng
et al., 2023; Zhu et al., 2023; Xiong et al., 2024). Some of these drawbacks can be mitigated by
utilizing on-policy data (Tajwar et al., 2024) and imposing a KL penalty that penalizes shifting an LM
too far from its prior distribution, casting optimization as a variational inference problem (Korbak
et al., 2022; Amini et al., 2025).
Another inference-time approach to controlled generation for an LM is via direct modification to the
LM’s sampling distribution. This may be done via controlling intermediate layer activations with
classifier guidance (Cheng et al., 2024), guiding autoregressive generation with a proxy probabilistic
model for which estimation of the conditional density is tractable (Zhang et al., 2023a), or most
commonly by directly intervening on the final logits before sampling to impose intersection with a
potential function. Pascual et al. (2021) presented an early variant of such logit-biasing to encourage
the presence of predefined guide words in generations.
This pattern is employed more broadly for hard constraints via logit-masking, setting the probability
associated with particular tokens to zero, forcing the LM to sample from a subset of its distribution
over sequences. This approach is used in most of the grammar-constrained semantic parsing work
outlined in the previous section. Most recently, there have been attempts to restrict and re-weight
generations not only via grammars but through additional expensive potentials such as grounded
affordances in robotics settings (Ahn et al., 2022; Huang et al., 2024). However, in all of these works,
constraints are imposed greedily, resulting in a local product of experts construction, and care is not
taken to appropriately target the implied global product of experts. It should then come as no surprise
that while standard approaches to grammar-constrained generation have been successful, they have
been far from a silver bullet (Tam et al., 2024).
This leads to a third line of work that formulates constrained generation from language models as
posterior inference (Zhang et al., 2023a), and employs approximate inference to sample from the
desired target distribution. This is in contrast to yet another line of work that views constrained
decoding as an optimization problem, and tackles it via search (Meister et al., 2020; Lu et al., 2021;
Zhang et al., 2023b) or continuous optimization (Dathathri et al., 2019; Kumar et al., 2021).
Several approximate inference algorithms have been explored for generating constrained samples
from LMs, including rejection sampling (Poesia et al., 2022), as well as MCMC (Miao et al., 2019;
Hie et al., 2022; Zhang et al., 2020; Qin et al., 2022; Kumar et al., 2022; Du et al., 2024). A weakness
of MCMC-based approaches is that they do not fully exploit the autoregressive factorization of
modern language models; each edit to a candidate sequence requires re-evaluating the entire sequence
(or at least the entire suffix) to compute a new target density.
Lew et al. (2023) propose SMC steering of LMs via probabilistic programming specifications. This
work enables provably accurate posterior sampling from such conditional targets, globally steering
generation while only ever computing local constraints. Our approach builds on their work.
33
Published as a conference paper at ICLR 2025
Shortly thereafter, Zhao et al. (2024) independently developed a framework for expressing various
LM tasks as probabilistic inference problems that can be tackled with SMC. Similar to our work,
Zhao et al. (2024) guide SMC with intermediate targets—in their case, learned twist functions via
a novel contrastive method—that enable estimation of the expected future value of each candidate
partial sequence. Their work also developed methods for evaluating LM inference algorithms via
bi-directional bounds on the log-partition function that can be used to estimate the KL divergence
between the inference and target distribution. In contrast to this prior work, our approach to SMC
leverages incremental static and dynamic analyses to inform our proposal distributions and twist
functions, as opposed to learning components of these algorithms via a costly contrastive fine-tuning
procedure. In addition, our results directly relate the quality of our posterior approximation to
improved performance on a series of standard, difficult benchmark tasks.
Concurrent with our work, Park et al. (2025) have highlighted the distinction between the prevalent
locally constrained decoding approach and the more accurate targeting of the global distribution
that arises from combining language models with constraints. Park et al. (2025)’s approach to
approximate the global distribution is based on the concept of expected future grammaticality, which
is the probability that the completion to be sampled from the LM will be compliant with the given
grammar. The authors describe an iterative algorithm that approximates the global distribution by
refining the estimates of the expected future grammatically. However, the proposed strategy shows
relatively slow convergence, was specifically designed for a CFG constraint, and may not be easily
adaptable to constraining with multiple potential functions.
34