B S: T U E R I T: Eyond Emantics HE Nreasonable Ffectiveness of Easonless Ntermediate Okens
B S: T U E R I T: Eyond Emantics HE Nreasonable Ffectiveness of Easonless Ntermediate Okens
A BSTRACT
Recent impressive results from large reasoning models have been interpreted as a triumph of Chain
of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in
order to help find new reasoning patterns. In this paper, we critically examine that interpretation by
investigating how the semantics of intermediate tokens—often anthropomorphized as “thoughts” or
reasoning traces and which are claimed to display behaviors like backtracking, self-verification, and
meta-cognition—actually influence model performance. We train transformer models on formally
verifiable reasoning traces and solutions, constraining both intermediate steps and final outputs to
align with those of a formal solver. By constructing a formal interpreter of the semantics of our
problems and intended algorithm, we systematically evaluate not only solution accuracy but also the
correctness of intermediate traces, thus allowing us to evaluate whether the latter causally influences
the former. Our experiments involve training transformer models on traces and solutions generated
by A* search. We notice that, despite significant improvements on the solution-only baseline, models
trained on entirely correct traces still produce invalid reasoning traces when arriving at correct
solutions. To further show that trace accuracy is only loosely connected to solution accuracy, we then
train models on noisy, corrupted traces which have no relation to the specific problem each is paired
with, and find that not only does performance remain largely consistent with models trained on correct
data, but in some cases can improve upon it and generalize more robustly on out-of-distribution
tasks. These results challenge the assumption that intermediate tokens or “Chains of Thought” reflect
or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or
over-interpreting them (despite their mostly correct forms) as evidence of human-like or algorithmic
behaviors in language models.
1 Introduction
Recent advances in general planning and problem solving have been spearheaded by so-called “Long Chain-of-Thought”
models, most notably DeepSeek’s R1 [1]. These transformer-based large language models are further post-trained
using iterative fine-tuning and reinforcement learning methods. Following the now-standard teacher-forced pre-training,
instruction fine-tuning, and preference alignment stages, they undergo additional training on reasoning tasks: at each
step, the model is presented with a question; it generates a sequence of intermediate tokens (colloquially or perhaps
fancifully called a “Chain of Thought" or “reasoning trace"); and it ends it with a specially delimited answer sequence.
After verification of this answer sequence by a formal system, the model’s parameters are updated so that it is more
likely to output sequences that end in correct answers and less likely to output those that end in incorrect answers.
While (typically) no optimization pressure is applied to the intermediate tokens [2, 3], empirically it has been observed
that language models perform better on many domains if they output such tokens first [4, 5, 6, 7, 8, 1, 9, 10, 11]. While
∗
equal contribution
Preprint
the fact of the performance increase is well-known, the reasons for it are less clear. Previous work has often framed it in
anthropomorphic terms, claiming that these models are “thinking” before outputting their answers [4, 12, 1, 13, 3, 14].
Simultaneously, the process of performing more auto-regressive forward passes before outputting the final answer has
been credited as an instance of inference-time scaling – that is, these models are assumed to be doing problem-adaptive
computation.
Famously, DeepSeek’s R1 paper claimed that one of the most impressive observed behaviors of their trained models
was the so-called “aha” moment: as part of the chain of thought it was producing in order to answer some question, the
model output the token “aha”, seeming to indicate that it had come upon a sudden realization. While a human may
say “aha” to indicate exactly a sudden internal state change, this interpretation is unwarranted for models which do not
have any such internal state, and which on the next forward pass will only differ from the pre-aha pass by the inclusion
of that single token in their context. Interpreting this token as meaningful in this way requires making an additional
assumption that has thus far been brushed to the side in discussions of how long CoT models function and what they do
– that the derivational traces they produce are semantically meaningful in the same way that the traces they were trained
on were or at least in the way that a human might expect them to be.
For R1 and similar large models, this is nearly impossible to check. The intermediate tokens that massive pre-trained
and post-RL’d models produce meander for dozens of pages, are written wholly in ambiguous and polysemantic natural
language, and – perhaps much worse – are the result of long, opaque training processes on data that we have no access
to and cannot compare against.
In this paper, we shed some light on the question of whether intermediate traces are semantically meaningful. Following
previous work that elucidated important functional aspects of large scale models through controlled small scale
experiments [15, 16, 17] and working within a sort of “model organism” paradigm, we focus the current work on fully
controlled, open, and replicable models trained from scratch. Our models are trained on a simple and well-understood
shortest path planning problem for randomly generated mazes, with our training runs including varying kinds of
intermediate traces – from none to ones generated by the classic A∗ algorithm to noisy and irrelevant ones. This setup
is not only well-understood as a classical computer science problem, but has also grown to be well-studied domain for
trace-augmented transformer training [18, 19, 20, 21].
We approach the problem of understanding intermediate token semantics from three major novel angles, performing
empirical evaluations on models we train on small planning tasks. First, we construct a validator for A∗ execution traces
and use it to validate and compare trace accuracy to solution accuracy, finding only a loose correlation between the two.
Then, we train half billion parameter Qwen models on none, correct, and deliberately irrelevant traces. We present a
dataset manipulation that – despite the fact that it removes all specific-problem-relevant semantics – leads to trained
models that perform better on both in and out of distribution tasks. We argue that, if performance is the goal, assuming
human-like or algorithm-interpretable trace semantics is not only unnecessary but potentially misleading.
2 Related Work
Training Transformer models from scratch to plan using derivational traces - There have been various approaches
to train transformer models on algorithmic execution traces. Searchformer - transformer models trained on A* search
algorithm execution traces to emulate the A* search algorithm for solving pathfinding tasks [18] . Similarly Stream-
of-Search trained Transformers to internalize search processes of Breadth-first search (BFS) and Depth first search
(DFS) algortihms within the linguistic representation for solving the Countdown arithmetic game [22]. Yang et al. [23]
trained transformer models on search trajectories to mimic search strategies such as Monte Carlo Tree Search or BFS.
In concurrent research System 1.x, introduced a dual-model approach coordinated by an external meta-controller, where
one model quickly generates solutions without explicit reasoning traces, while the other operates more deliberately,
providing step-by-step explanations [24]. Similarly, SwiftSage also employed multiple models within an agent-style
workflow to differentiate between fast and slow modes of reasoning [25]. In these works, the System 2 or slow modes
of reasoning models are transformer models trained on execution traces of formal procedures. Pan et al. [26] trained
transformer models for solving Boolean SAT problems. They had also measured the trace correctness with respect to
the abstract Davis–Putnam–Logemann–Loveland (DPLL) procedure over which the transformer models were trained
on. But their work was mainly focused on showing that decode-only transformers with CoT can solve the boolean
SAT problems whereas our work is focused on providing a much deeper analysis of trace-plan corelation and semantic
correctness.
Trace evaluation - Previous studies have examined whether intermediate derivational steps correspond meaningfully to
the final answers generated by Large Language Models (LLMs) and Large Reasoning Models (LRMs), finding notable
discrepancies. Specifically, prior research has demonstrated that intermediate CoT steps, despite appearing coherent,
do not reliably reflect the actual computations used by models to reach their conclusions [27, 28]. Even for the SOTA
2
Preprint
reasoning models, whose performance gains are typically attributed to employing explicit intermediate "think" tokens,
have been shown to produce intermediate reasoning chains that fail to meaningfully correspond with the underlying
computations leading to their final outputs [29, 30, 31, 2].
Training on Noisy traces - Li et al. [11] investigated the impact of perturbations in reasoning traces by distilling
DeepSeek R1 and QwQ 32B-Preview’s derivational outputs on math and coding tasks. Their findings reveal that models
remain robust to noise in the trace—showing improved performance even when trained on derivations containing
incorrect mathematical operations, relative to the base model. However, as previously noted, systematically evaluating
the semantic correctness of natural language derivational traces remains infeasible. Therefore, no definitive conclusions
can be made regarding the semantic alignment between reasoning traces and final answers. Nonetheless, work does
seem to indicate that there is no strong causal connection between trace correctness and solution correctness.
Dualformer, an extension of Searchformer, trained transformer models on truncated A* derivational traces by arbitrarily
removing steps from the original A* search process [19]. This pruning renders the traces semantically invalid, as they
no longer reflect any faithful execution of the A* algorithm. Despite this, models trained on these shortened traces
outperformed those trained on full A* traces used in Searchformer. These findings further support the notion that the
semantic correctness of derivational traces with respect to ground-truth algorithms like A* may not be causally linked
to the correctness of the final output plan.
Post Training methods - Numerous approaches have shown that Supervised fine tuning (SFT) over derivational traces
and RL methods improve the task performance of LLMs over planning and reasoning tasks [32, 11, 10, 33, 34, 35, 1, 36].
Among these, one of the first approaches that has shown impressive results is STaR. STaR is a method in which the
LLM is prompted to generate multiple responses given a problem with their intermediate CoTs and subsequently the
responses are filtered based on whether the final answer is correct. The LLM is further finetuned over examples where
the model generated the correct final answer [32]. It has been shown that this approach significantly outperforms direct
answer prediction fine-tuning. Recently, since the release of DeepSeek’s R1 which is post trained using GRPO [1],
two major types of post training methods have emerged - 1. Finetuning LLMs on derivational traces of LRMs, mainly
R1, to enable CoT generation in smaller LLMs (Model Distillation), 2. Using various RL algorithms, mainly different
versions GRPO, to improve task performance [37, 38]. In all of these approaches there is no semantic evaluation of the
derivational trace generated by the LLMs or LRMs. In the case of Iterative SFT, the model responses are filtered based
on the correctness of the final answers and the smaller models are trained to mimic these long derivational traces to
"elicit" reasoning and procedure following. Similarly for RL based approaches, reward is determined by sound verifiers
based only the final answers. Since there are no process reward models, there is no local evaluation of the correctness of
the produced intermediate tokens.
3 Background
Though recently popularized by Large Reasoning Models, especially DeepSeek’s R1 [3], training on intermediate traces
in order to improve transformer performance dates back to at least GPT-2 [4] and has been extended and analyzed from
many angles [18, 32, 22, 19, 39, 40, 41] While these papers demonstrated ways to improve final answer accuracy, they
neither evaluate the trace accuracy nor do they explicitly attempt to train on incorrect or irrelevant traces.
Thus, while they do show that accuracy increases, they leave open the question of whether that accuracy increase
actually stems from the additional semantic information in the trace. In many cases, especially with pre-trained models,
it is nearly impossible to formally verify reasoning traces, due to the ambiguity of natural language and the lack of a
clear ground truth. However, for small, well-scoped formal domains like the gridworld path planning domain used in
[18, 19] and this paper, and with carefully from-scratch trained models on those domains, we have the ability to check
whether generated traces follow the exact semantics enforced in the training data and causally predict the final solutions
that the model outputs.
We consider a standard grid-based path-finding domain. The task is to find a legal path between a given start cell and
goal cell in a 30 × 30 grid. Every cell of this grid is either free (traversable) or a wall (impassable). The agent begins at
the start state, and at every state may take one of four actions: go up, go down, go left, or go right. The transformer is
given a full description of this problem (in token format – we follow the formulation used by [18] and [19]) and must
output as its final answer a plan, which consists of a sequence of actions. A plan is considered correct if it executable –
that is, every action it presents moves the agent from a free cell to an adjacent free cell – and its final action results in
the agent being at the goal cell.
3
Preprint
Figure 1: Examples of mazes. The left is generated by Wilson’s algorithm and is used for model training. The right is
generated by the Drunkard’s Walk algorithm and used to evaluate models as an out of distribution task. The goal is
represented by a green square and the start state by a yellow square. Black squares represent impassable walls. Blue
squares represent steps along the optimal path (as found by A∗ ). Gray squares are squares that were explored by A∗ but
are not along the optimal path. White squares are unexplored traversable squares.
We generate navigation problems using diverse generation algorithms, resulting in varied structural patterns and
exploration dynamics. This enables systematic out-of-distribution (OOD) evaluation by testing models on maze types
unseen during training – which was all done on mazes generated with Wilson’s algorithm. These generation algorithms
can be sorted into two major categories: 1) algorithms that do not permit cycles and sample over the spanning trees of
the 30 × 30 grid and 2) algorithms that permit loops and create noisy, less-structured dungeon or cave-like instances.
For all algorithms except SearchFormer’s, which has its own start and goal generation loop, we sample a legal (start,
goal) pair after maze generation.
Acyclic Maze Generation
1. Wilson’s algorithm: This is the algorithm that we use to generate mazes for training models. Wilson’s
algorithm generates uniform random mazes by performing loop-erased random walks from unvisited cells
until they connect to the current maze [42]. Each walk removes any loops it creates, ensuring a valid tree
structure. This process continues until all cells are included, producing a uniform sample from the space of all
possible spanning trees of the 30 × 30 graph.
2. Kruskal’s algorithm: Kruskal’s algorithm, originally proposed for finding a minimum spanning forest of an
undirected edge-weighted graph [43], generates mazes by treating each cell as a node and randomly removing
walls between unconnected regions, using a union–find structure to avoid cycles. This results in a fully
connected maze without loops, though the maze distribution is not perfectly uniform. The method produces
mazes biased towards short local connections and dead ends.
3. Randomized Depth-First Search algorithm: The randomized depth-first search (DFS) or recursive back-
tracker algorithm generates mazes by carving a path forward until reaching a dead-end [44]. When it hits a
dead-end (no unvisited neighbors), it backtracks until it finds a new direction to explore, repeating until all
cells are visited and connected into a complete maze. Depth-first search is biased towards generating mazes
with low branching factors and many long corridors.
Cave Generation
4. Drunkard’s Walk: We implement a version of the “Drunkard’s Walk” algorithm, as described by [45], and
originally used for procedurally generating dungeons for top-down two-dimensional video games. Starting
from a grid of solid walls, a random walk is performed, carving out the current cell on every step. The walk
4
Preprint
continues until a predefined number or percentage of floor tiles has been dug out. This method preserves
cycles, producing cave-like structures with open chambers and looping corridors. The output space includes
grid states unreachable by perfect acyclic maze generators.
5. SearchFormer style generation We also implement the random generation algorithm used in the SearchFormer
paper [18], though we use it for evaluation rather than training. Tasks are generated by exhaustive rejection
sampling: first randomly select a number between 30% and 50%. Then select that percentage of cells to
be wall cells. Randomly choose a start and goal location and execute A∗ to find an optimal plan. Reject
unsolvable, too easy, or duplicate instances and resample. These instances also allow for loops and so are also
out of distribution for our models.
A* is a classic best-first graph–search procedure that combines the uniform-cost guarantee of Dijkstra’s algorithm [46]
with domain-specific heuristics to focus exploration on promising states, originally introduced to compute minimum-cost
paths in state-space graphs [47].
The algorithm maintains an open list (a priority queue) keyed by f (n) = g(n) + h(n), where g(n) is the exact cost
from the start and h(n) is a heuristic estimate to the goal, and also maintains a closed list of already visited nodes. It
repeatedly pops the open list node with the smallest f ; if this is the goal, it reconstructs the path that lead to this node
and this is returned as the final plan. Otherwise, it generates child nodes (in our case, traversable neighbor cells) and
calculates their g and f values. For each node, it either inserts it into the open list or – if the node is already in the list –
updates its g value if the new value is lower. The popped node is added to the closed list to prevent re-expansion.
The effectiveness of A* is dependent on the heuristic it is implemented with. For solvable graph search problems like
the ones featured in this paper, any consistent (h(n) ≤ c(n, n′ ) + h(n′ ) for all neighboring n′ ) heuristic will guarantee
that the plan returned is not only satisficing but optimal [48].
For the maze path planning problems we examine in this paper, we use the very standard Manhattan heuristic
h(n) = |xn − xg | + |yn − yg | computes the sum of horizontal and vertical displacements between a cell and the goal.
On a 2-D grid with only orthogonal, unit-cost movement, this heuristic is consistent, ensuring A* returns an optimal
path.
Finally, following SearchFormer and Stream of Search, we modify the A* implementation to output a linearized
execution trace [22, 18]. That is, whenever the algorithm creates a child node and adds it to the open list, it prints
create x y cA cB and when it closes a node and adds it to the closed list, it prints close x y cA cB. Here, ‘A’ in
cA represents the exact cost from the start state to the node (i.e., the g(n) value) and ‘B’ in cB represents the heuristic
estimate from that node to the goal state (i.e., the h(n) value). Similar to Searchformer notation [18], we use the prefix
"c" to differentiate between the node co-ordinates and its cost estimations. In the next section, we construct an A*
validator that reverses this process – it takes in a linearized trace and attempts to simulate the corresponding open and
closed list operations to check if they are valid with respect to the semantics of this implementation.
Figure 2: Trace validation procedure. Our A∗ validator runs through the model’s output stream sequentially. Assuming
no parsing errors, it will flag a trace as invalid if at some point it contains an invalid action. The left bottom corner is
(0, 0). The goal is represented by a green square and the start state by a yellow square.
While previous work evaluated the final accuracy of trace-trained models, it did not evaluate the traces themselves. For
large, production ready RL-post-trained models like DeepSeek’s R1, this is practically impossible. For even a simple
5
Preprint
Figure 3: Plan versus trace validity for the model trained on correct traces, measured across domains. Wilson =
generated by Wilson’s algorithm, Kruskal = mazes generated by Kruskal’s algorithm, DFS = mazes generated by
Depth-First Search, SF-Style = instances generated in the SearchFormer Style, Drunkard = instances generated using
the Drunkard’s algorithm.
query, the model produces dozens of pages of convoluted and meandering output before arriving at an answer, and this
output is all in natural language, which makes it very easy to read multiple equally valid interpretations into it.
To truly tell whether the traces that were trained on helped in the expected way, we need a formal way of validating
their correctness. By training models on traces produced by a well-known algorithm with well-known semantics, it is
possible to check whether the model’s emulations of the algorithm’s execution trace are correct.
We construct a formal verifier for A∗ traces. The format for these traces follows [18], and is described in more detail in
Section 3. Essentially, our validator consumes the generated trace and simulates the operations proposed in that trace
on open and closed lists. It runs through the generated trace sequentially, parsing each action x y cA cB sequence
as an operation and using it to update its open and closed list. It marks a trace as valid if it can correctly execute this
procedure until it closes the goal node. Errors in execution can be one of the following:
• Parsing Error: a substring is malformed and does not parse into either a create or a close action with the
correct arguments.
• Invalid Neighbor: the current create action is attempting to create an illegal child, either referencing a wall
cell or a cell that is not adjacent to the last closed node.
• Already Closed: the current create action is attempting to close an already closed node.
• Not in Open List: the current close action is referencing a node that is not in the open list.
• Not Lowest f -value: the current close action is attempting to close a node when there is at least one other
node in the open list with a lower f -value.
• Goal Not Reached: after the entire sequence was processed, the goal node was not in the closed list, and so
the reconstruction step cannot proceed.
With this verifier in hand, we can now distinguish between plan validity and trace validity for models trained on this
kind of dataset. To construct our training sets, we generate 50,000 mazes using Wilson’s algorithm, and randomly select
a start and goal cell. Then, we use A* with the Manhattan distance heuristic to find an optimal plan for each maze as
well as to produce a trace that is saved with each datapoint.
We modify the architecture of the Qwen2.5 0.5B [49] to support a vocabulary of exactly 944 different tokens (which
reduces the parameter count to about 380 million from 500 million), randomly initialize the model, and then train it for
85,000 training steps with a batch size of 8 on two NVIDIA H100s. The model has a context length of 32,000 tokens
to support the long lengths of intermediate token generation. (Our other experiments later in the paper also use this
architecture, but train on different datasets, from solution-only through to irrelevant and noisy traces. All code and data
will be made public.)
We test this model trained on Wilson mazes on a thousand instances of mazes generated by Wilson, Kruskal, DFS,
SF-Style and Drunkard approaches, evaluating the solution accuracy as well as the trace validity. We present these
results as confusion matrices in Figure 3, with each domain represented by a separate matrix. These results break down
the correlation between model accuracy and trace validity. As can be seen from the results, trace accuracy is not a
perfect predictor of plan accuracy. In fact, as can be seen from the diagonal entries, the model can produce valid traces
and then continue on to produce an incorrect plan or produce invalid traces and yet end up at a correct plan2 .
2
In the appendix, we also include a similar set of results for the models trained by [18], which show similar trends.
6
Preprint
If plan and trace validity are only loosely connected for models trained on the A* trace dataset, then perhaps the validity
of the trace isn’t as important to the performance increase as previously believed. To test this empirically, we construct
a second training dataset called Swap, which we build by randomly permuting reasoning traces between problems.
This dataset consists of the exact same problems as the original Trace 50,000, but problem 1’s trace will be given to,
say, problem 4; problem 4’s will be given to problem 7; and so forth. In other words, while the traces continue to have
the right form and some generic domain information, they no longer have any connection to the specific problems they
are associated with. Training examples consist of a start and goal state, a maze definition, an A∗ trace for searching for
the shortest path across a totally unrelated maze from a different start and goal state, and the correct solution plan for
the original maze.
What we find is that our most competent model not only maintains performance on the in-distribution test set, but it
generalizes better than the to the other maze distributions we test! All despite the lack of algorithmically valid semantics
in the trained upon and generated traces.
For these experiments, we continue to use the same model architecture described in the previous section, varying
the datasets we train on to see how they affect performance – even as they further corrupt or completely destroy
the correctness of the traces. For best results, we performed hyperparameter optimization (via Optuna [50]). We
provide additional details on hyperparameters and initializations in the Appendix, and we will publicly open-source our
codebase for full transparency.
The most basic training run is the standard solution-only baseline, where the model is trained on just solutions without
any derivational traces. The next baseline, following previous work [18, 19, 22] is training the model with A* generated
traces, teacher-forcing during training to make it output intermediate tokens before the final solution. These are the
models discussed in the previous section. Finally, we use the same underlying training data and distribution, but
modify it by corrupting the traces. Our trace corruption process is very simple: we randomly permute which problem
is associated with which traces – so, for example, the third problem might have the 5th problem’s trace, which is an
execution trace of A* on an unrelated maze with unrelated start and goal states.
The problems in our training data are all generated by Wilson’s algorithm. For our test sets we generate data with
several maze generation algorithms (as described in Section 3), including Wilson’s algorithm, to get both in and out of
distribution data. Our training data consists of 50k samples, while our test sets each contain 1k.
Unintuitively, as seen in Table 2, the best model in both in and out of distribution test sets turns out to be the model
trained on swapped (incorrect) traces! We see that the swapped model has a 0% trace validity – as it has been trained to
output well-structured but problem-irrelevant traces in response to every problem – but nevertheless performs noticeably
better than both the correct trace and solution-only baselines. An interesting point to note is the performance difference
on out-of-distribution datasets. While most of the performance differences are within a few percentage points, and
in-distribution testing results in near identical performance, on the Drunkard dataset the swapped model is 10 times
better than the original model, giving 26% to the correct trace model’s 2.6%, and on the DFS maze set, it reaches 41.7%
to the original model’s 30.8%.
Table 1: Performance of Swap, A* Trace, and Solution-Only Models across maze distributions. "Plan Val." = Plan
Validity, "Trace Val." = Trace Validity within Valid Plans
If intermediate tokens improve accuracy because they teach the model a given reasoning procedure, then we should
expect their influence on performance to fluctuate exactly with their connection to the problem. However, we find that
this is not always the case – in fact, intermediate token sequences that have almost nothing to do with the problem at
hand can provide a significantly higher performance boost (and which, counterintuitively might even generalize better)
than well-grounded semantically meaningful execution traces, thus throwing doubt on the seemingly wide-spread
7
Preprint
intuition that the effectiveness of traces stems from allowing the transformer to perform structured, interpretable, and
algorithmic procedures.
Our results hint that the impact of trace content on performance and the legibility of that content have been somewhat
conflated – if all we care about is increasing the accuracy and capability of a model, enforcing human readability may
be counterproductive, a lesson also mentioned in the R1 paper [3]. Furthermore, examining traces produced by a model
– though they may look right at first glance – is not necessarily informative if those traces are not predictive of the
model’s final answers.
Of course, if trace semantics don’t matter, then the question immediately arises: why does generating intermediate
tokens increase accuracy at all? We speculate that what is helping is finding the right prompt augmentation. That is, for
a given task prompt T , there exists a prompt augmentation P A which boosts the LLM’s performance on that task:
∃ P A s.t. P Sol LLM(T + P A), T > P Sol(LLM(T ), T )
Here Sol(y, T ) indicates that y solves T , and LLM(x) is the model’s completion for input x. The central challenge
then is to learn the Skolem function
PA = fθ (T, LLM),
that maps each task to an effective augmentation. This can be accomplished through modifying the model itself to
inherently and automatically augment prompts, as is the case in models that first generate long chains of intermediate
tokens before their final answers. Crucially, prompt augmentations have no need to be human-interpretable. In fact,
we see results that back this up in the adversarial prompting literature, where effective jailbreaks can be effected by
augmenting prompts with human-uninterpretable strings [51, 52, 53, 54] or modifying them with random syntactic
permutations, capitalizations, and shufflings [55].
6 Conclusion
In this paper, we challenged the prevailing narrative that intermediate tokens or “Chains of Thought” generated by
Large Reasoning Models like DeepSeek’s R1 are interpretable, semantically valid sequences with predictable effects on
the model’s behavior. As we don’t have access to any frontier LLM’s training data or even exact training procedure, and
since the traces these models output are in multiply-interpretable natural language without a concrete ground truth,
we designed a series of experiments building on previous smaller model reasoning work – mainly Searchformer and
Stream of Search [22, 18] – and constructed an A∗ trace validator, finding that there is only a loose correlation between
the correctness of the trace and the correctness of the output plan. We then trained additional models on noisy or
irrelevant traces and found that there are (nonsensical) trace formats that nevertheless maintain or even increase the
model’s performance – all despite them being much less informative or connected to the problem at hand. Finally, we
argue that, if the goal is to increase model performance, enforcing trace semantics is unnecessary and potentially very
misleading. All together, our counter-intuitive results demonstrate ways in which common interpretations of Large
Reasoning Models may be anthropomorphizations or simplifications.
7 Acknowledgements
This research is supported in part by ONR grant N0001423-1-2409, DARPA grant HR00112520016, and gifts from
Qualcomm, J.P. Morgan and Amazon.
References
[1] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi
Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv
preprint arXiv:2501.12948, 2025.
[2] Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba,
Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting
obfuscation. arXiv preprint arXiv:2503.11926, 2025.
[3] Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s" aha
moment" in visual reasoning on a 2b non-sft model. arXiv preprint arXiv:2503.05132, 2025.
8
Preprint
[4] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David
Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate
computation with language models. 2021.
[5] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing
systems, 35:24824–24837, 2022.
[6] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language
models. arXiv preprint arXiv:2210.03493, 2022.
[7] Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay
Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with
less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
[8] Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.
arXiv preprint arXiv:2306.08543, 2023.
[9] Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation in transformer
language models. arXiv preprint arXiv:2404.15758, 2024.
[10] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer,
Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint
arXiv:2501.19393, 2025.
[11] Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi,
Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demonstrations structure, not content, is
what matters! arXiv preprint arXiv:2502.07374, 2025.
[12] Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors
that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307,
2025.
[13] Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F Wong, and Di Wang. Understanding aha
moments: from external observations to internal mechanisms. arXiv preprint arXiv:2504.02956, 2025.
[14] Sébastien Bubeck, Varun Chadrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee,
Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with
gpt-4, 2023.
[15] Boshi Wang, Xiang Yue, Yu Su, and Huan Sun. Grokked transformers are implicit reasoners: A mechanistic
journey to the edge of generalization. arXiv preprint arXiv:2405.15071, 2024.
[16] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond
overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
[17] Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic
explanation of neural networks. Advances in neural information processing systems, 36:27223–27250, 2023.
[18] Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul Mcvay, Michael Rabbat, and Yuan-
dong Tian. Beyond a*: Better planning with transformers via search dynamics bootstrapping. arXiv preprint
arXiv:2402.14083, 2024.
[19] DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, and Qinqing Zheng. Dualformer: Controllable
fast and slow thinking by learning with randomized reasoning traces. In The Thirteenth International Conference
on Learning Representations, 2024.
[20] Niklas Nolte, Ouail Kitouni, Adina Williams, Mike Rabbat, and Mark Ibrahim. Transformers can navigate mazes
with multi-step prediction. arXiv preprint arXiv:2412.05117, 2024.
[21] Yongjing Yin, Junran Ding, Kai Song, and Yue Zhang. Semformer: Transformer language models with semantic
planning. arXiv preprint arXiv:2409.11143, 2024.
[22] Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman.
Stream of search (sos): Learning to search in language. arXiv preprint arXiv:2404.03683, 2024.
[23] Mengjiao Sherry Yang, Dale Schuurmans, Pieter Abbeel, and Ofir Nachum. Chain of thought imitation with
procedure cloning. Advances in Neural Information Processing Systems, 35:36366–36381, 2022.
[24] Swarnadeep Saha, Archiki Prasad, Justin Chih-Yao Chen, Peter Hase, Elias Stengel-Eskin, and Mohit Bansal.
System-1. x: Learning to balance fast and slow planning with language models. arXiv preprint arXiv:2407.14414,
2024.
9
Preprint
[25] Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj
Ammanabrolu, Yejin Choi, and Xiang Ren. Swiftsage: A generative agent with fast and slow thinking for complex
interactive tasks. Advances in Neural Information Processing Systems, 36:23813–23825, 2023.
[26] Leyan Pan, Vijay Ganesh, Jacob Abernethy, Chris Esposo, and Wenke Lee. Can transformers reason logically? a
study in sat solving. arXiv preprint arXiv:2410.07432, 2024.
[27] Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what
they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing
Systems, 36:74952–74965, 2023.
[28] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin
Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.
arXiv preprint arXiv:2307.13702, 2023.
[29] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani,
Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t always say what they think. arXiv
preprint arXiv:2505.05410, 2025.
[30] James Chua and Owain Evans. Are deepseek r1 and other reasoning models more faithful? In ICLR 2025
Workshop on Foundation Models in the Wild.
[31] Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy.
Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025.
[32] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.
Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
[33] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John
Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on
Learning Representations, 2023.
[34] Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester
James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\" ulu 3: Pushing frontiers in open language model
post-training. arXiv preprint arXiv:2411.15124, 2024.
[35] Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao
Peng. Free process rewards without process labels. arXiv preprint arXiv:2412.01981, 2024.
[36] Hao Sun. Reinforcement learning in the era of llms: What is essential? what is needed? an rl perspective on rlhf,
prompting, and beyond. arXiv preprint arXiv:2310.06147, 2023.
[37] Daman Arora and Andrea Zanette. Training language models to reason efficiently, 2025. URL https://arxiv.
org/abs/2502.04463.
[38] Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu,
Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint
arXiv:2503.14476, 2025.
[39] Juno Kim and Taiji Suzuki. Transformers provably solve parity efficiently with chain of thought. arXiv preprint
arXiv:2410.08633, 2024.
[40] William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. arXiv
preprint arXiv:2310.07923, 2023.
[41] Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery
behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36:70757–
70798, 2023.
[42] David Bruce Wilson. Generating random spanning trees more quickly than the cover time. In Proceedings of the
twenty-eighth annual ACM symposium on Theory of computing, pages 296–303, 1996.
[43] Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings
of the American Mathematical society, 7(1):48–50, 1956.
[44] Robert Tarjan. Depth-first search and linear graph algorithms. SIAM journal on computing, 1(2):146–160, 1972.
[45] jrheard. Procedural dungeon generation: A drunkard’s walk in clojurescript.
[46] Edsger W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269–271,
1959.
[47] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum
cost paths. IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.
10
Preprint
[48] Judea Pearl. Heuristics: intelligent search strategies for computer problem solving. Addison-Wesley, 1984.
[49] Qwen Team. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024.
[50] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation
hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on
knowledge discovery & data mining, pages 2623–2631, 2019.
[51] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and
transferable adversarial attacks on aligned language models, 2023.
[52] Valeriia Cherepanova and James Zou. Talking nonsense: Probing large language models’ understanding of
adversarial gibberish inputs, 2024.
[53] Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via
flipping. OpenReview pre-print, submitted to ICLR 2025, 2024.
[54] William Hackett, Lewis Birch, Stefan Trawicki, Neeraj Suri, and Peter Garraghan. Bypassing prompt injection
and jailbreak detection in llm guardrails, 2025.
[55] John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones,
Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking. arXiv preprint arXiv:2412.03556, 2024.
11