0% found this document useful (0 votes)
7 views14 pages

Reason From Future: Reverse Thought Chain Enhances LLM Reasoning

The document introduces a novel reasoning paradigm called Reason from Future (RFF) that enhances the reasoning capabilities of large language models (LLMs) by integrating bidirectional reasoning through reverse and forward thinking. RFF aims to reduce search space and error accumulation by prioritizing core logical relationships and employing goal-oriented constraints during the reasoning process. Empirical evaluations demonstrate that RFF outperforms existing paradigms, providing improved accuracy and efficiency in solving complex tasks.

Uploaded by

sethstark63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

Reason From Future: Reverse Thought Chain Enhances LLM Reasoning

The document introduces a novel reasoning paradigm called Reason from Future (RFF) that enhances the reasoning capabilities of large language models (LLMs) by integrating bidirectional reasoning through reverse and forward thinking. RFF aims to reduce search space and error accumulation by prioritizing core logical relationships and employing goal-oriented constraints during the reasoning process. Empirical evaluations demonstrate that RFF outperforms existing paradigms, providing improved accuracy and efficiency in solving complex tasks.

Uploaded by

sethstark63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Reason from Future: Reverse Thought Chain Enhances LLM Reasoning

Yinlong Xu1 , Yanzhao Zheng2 , Shuoshuo Sun2 ,


Shuaihan Huang , Baohua Dong2 , Hangcheng Zhu2 , Ruohui Huang2 , Gang Yu2 , Hongxia Xu3,4 * , Jian Wu1,5 *
2
1
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
2
Alibaba Group, Hangzhou, China
3
State Key Laboratory of Transvascular Implantation Devices and TIDRI, Hangzhou, 310009, China
4
Liangzhu Laboratory and WeDoctor Cloud, Hangzhou, 310000, China
5
Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence, Hangzhou, 310058, China
{xuyinlong, Einstein, Wujian2000}@zju.edu.cn huangshuaihan@outlook.com
{zhengyanzhao.zyz, sunshuoshuo.sss, baohua.dbh, linran.lr09, wentong, ruohai}@taobao.com
Abstract Question: There are 10 apples and 10 bananas in the blanket, 5 bananas
are a little smaller than others. How many fruits in the blanket?

It has been demonstrated that carefully de-


The number of fruits
arXiv:2506.03673v1 [cs.AI] 4 Jun 2025

signed reasoning paradigms, like Chain-of- is apples + bananas


Thought (CoT) and Tree-of-Thought (ToT), can
enhance the reasoning capabilities of small lan-
guage models by detailed thinking and exten- Try to minus
sive thought searching, unbounded branching bananas by 5

factors in the searching space create prohibitive


reasoning consumption. However these meth-
ods fall into the trap of local optimum reason- Try to do apples +
I see, do apples +
bananas to get 20
ing, which means the model lacks a global bananas to get 20

perspective while solving problems. We pro-


Answer Back Reason
pose a novel reasoning paradigm called Rea- There are 10 apples and 10 The fruits are the number of
bananas. 5 bananas are a little sum of apples and bananas.
son from Future (RFF), which generates rea- smaller. So there are 10 + 10 - 5 Answer
soning paths by bidirectional reasoning that = 15 fruits. The answer is 15. There are 10 apples and 10
bananas. 5 bananas are smaller.
combines top-down planning with bottom-up So there are 10 + 10 = 20 fruits.
reasoning accumulation. The essence of RFF
lies in its reverse reasoning mechanism, which
prioritizes core logical relationships and im-
Figure 1: Comparison between simple forward reason-
poses goal-oriented constraints on intermedi-
ing (left) and forward reasoning guided by back reason-
ate steps, thereby reducing the searching space
ing (right).
and mitigating error accumulation inherent in
sequential forward reasoning. Empirical evalu-
ations across diverse experiments demonstrate
that RFF outperforms conventional paradigms decomposition as the critical determinant of func-
with higher accuracy and less searching space tional boundaries, enabling industrial automation
to solve complex tasks. and academic research applications.
Recent studies demonstrate that well-designed
1 Introduction reasoning paradigms can significantly enhance
LLMs’ reasoning ability without additional costly
The rapid evolution of large language mod- and time-consuming post-training. A seminal
els (LLMs), fueled by breakthroughs in deep work in this area is Chain-of-Thought (CoT) (Wei
learning architectures and unprecedented datasets, et al., 2022), which pioneered the novel view
has demonstrated remarkable potential across nat- that reasoning ability can be improved by de-
ural language processing (NLP) and interdisci- signing reasoning prompts, paradigms, and exam-
plinary applications (Lee and Toutanova, 2018; ples. Tree-of-Thought (ToT) (Yao et al., 2024)
Radford, 2018; Team et al., 2023; Sel et al., 2023). provides a searching view to enhance the ability
LLMs like ChatGPT (Achiam et al., 2023) and of complex reasoning. Progressive-Hint Prompt-
Llama (Dubey et al., 2024) exhibit human-like text ing (PHP) (Zheng et al., 2023) and Cumulative Rea-
generation, multilingual task execution, and emerg- soning (CR) (Zhang et al., 2023) asks the model to
ing logical reasoning. Current scholarly investiga- generate hints for the question before generating
tions identify their reasoning capacity for problem the answer.
* Corresponding authors. Although, these reasoning paradigms, break-
ing down the solution into multiple steps through space by constraining reasoning to target-driven
prompts or spatial search, can enhance the reason- states, demonstrating good efficiency. Our results
ing ability and coherence of the model. They tend highlight the potential of bidirectional, goal-aware
to make the model focus on the current state, re- reasoning to unlock more robust and systematic
sulting in lacking explicit guidance from a global problem-solving in LLMs.
understanding of the problem and excessive explo- In summary, We introduce RFF, a novel self-
ration of redundant information, overthinking, or planning reasoning paradigm to enhance the reason
errors during inference (Boix-Adsera et al., 2023). ability of LLMs. In which, reverse thinking and
In contrast, the way human approaches problem- forward-thinking alternately to obtain a future per-
solving is different. Researches have shown that spective and narrow the solution-searching space.
humans begin by building holistic mental modeling We conduct experiments involving four datasets to
when solving complex problems, allowing problem demonstrate the great performance and efficiency
solvers to form a topological framework before of RFF. And we employ two extra experiments
focusing on specific details (Spreng et al., 2009; by complicating the questions in Game of 24 and
Koban et al., 2021). This kind of cognitive predic- GSM8K. The results represent RFF less consum-
tion provides dual guidance for the subsequent solu- ing in larger search spaces and robust thinking in
tion process: forming a "cognitive road map" of the variant problems.
solution path at the macro-level, helping to exclude
obviously unrelated branches; evaluation criteria 2 Related Work
are established at the micro-level so that each spe-
2.1 Chain of Thought Reasoning
cific operation remains dynamically calibrated to
the end goal. This global awareness allows us to In the study of complex reasoning tasks, Chain-of-
avoid blindly combining superficial details and in- Thought (CoT) (Wei et al., 2022; Wang et al., 2022)
stead prioritize purposeful, contextually grounded prompting has emerged as a pivotal technique for
deductions. This suggests that modeling this local- significantly improving the performance of large
global consistency thinking paradigm might be able language models (LLMs) by explicitly generating
to enable LLMs to strategically synthesize infor- intermediate reasoning steps. This approach en-
mation, minimize irrelevant exploration, and align ables the decomposition of problems into struc-
intermediate steps with the overarching goal. tured, stepwise reasoning pathways, demonstrating
Inspired by the maze-solving strategy of back- particular efficacy in mathematical and logical do-
ward reasoning, where reversing the path from the mains. Recent advancements extend CoT through
endpoint accelerates discovering the solution, we symbolic formalization (e.g., Symbolic CoT (Xu
propose a novel reasoning paradigm called Reason- et al., 2024)), which incorporates formal logic sys-
from-Future (RFF) to enhance the reasoning abil- tems to enhance both reliability and interpretabil-
ity of LLMs by adding reverse thinking process to ity by grounding reasoning in rigorous symbolic
guide the forward reasoning as shown in Figure 1. frameworks. Critical analyses, however, reveal po-
RFF integrates bidirectional reasoning by alter- tential limitations where models may exploit com-
nating between reverse and forward thinking to putational redundancy rather than genuine reason-
maintain solution states: the reverse reasoning gen- ing in extended CoT steps, prompting discussions
erates the potential last state of the target state and about mechanistic transparency.
sets the last state as the new target, then the for-
ward reasoning takes a step toward the new target. 2.2 Search Reasoning
The target state serves as a guide to precisely lead In the domain of search-based reasoning for large
the forward reasoning, and the forward reasoning language models, the Tree-of-Thought (ToT) (Yao
in turn produces more useful information to make et al., 2024) framework introduces backtracking
the reverse reasoning more reasonable. We evalu- capabilities within multi-path decision structures,
ate RFF in five datasets: Game of 24 (Yao et al., enabling systematic exploration of diverse solu-
2024), GSM8K (Cobbe et al., 2021), ASDiv (Miao tion trajectories. This approach proves particularly
et al., 2021), SVAMP (Patel et al., 2021) MATH- effective for complex tasks requiring iterative hy-
500 (Lightman et al., 2023), and demonstrate sig- pothesis generation and validation. Monte Carlo
nificant improvements in accuracy over baseline Tree Search (MCTS) (Świechowski et al., 2023)
methods. Additionally, RFF reduces the search strengthens online decision-making robustness
Reason
Guide
Input Input Input Input
Thought
Reason-2
Hints (Forward)

……
Question-1
Reason-n+1
(Forward)
Answer-1

Reason-n
…… Question-2 (Back)

……
Answer-2
Reason-1
…… (Back)
……

Virtual
Output Output Output Output
Output
(a) Chain-of-Thought (b) Tree-of-Thought (c) Cumulate Reasoning (d) Reason from Future
Prompting (COT) Prompting (TOT) Prompting (CR) Prompting (RFF)

Figure 2: Schematic illustrating various approaches to problem-solving with LLMs, and each rectangle box
represents a thought. Figure2(d) only shows the basic framework about RFF, see the concrete pipeline of two types
of RFF in Algorithm 1 (RFF-T) and Algorithm 2 (RFF-G).

through simulating and evaluating long-term re- computational efficiency and path-pruning strate-
wards of candidate paths, demonstrating strengths gies remain critical areas for improvement.
in reinforcement learning and dynamic program-
ming scenarios. AOT (Sel et al., 2023) improves 2.3 Progressive Hint Prompting Reasoning
TOT by introducing a step-evaluation mechanism In the realm of progressive prompting for complex
into the reasoning process, which helps LLM to reasoning, the LLMs solve a problem through mul-
prune the less likely searching route thus reduce tiple rounds of messages. Least-to-Most (Zhou
the searching space. AOT+ (Sel et al., 2025), as et al., 2022) first breaks a problem into several
the upgrade of AOT, adds a fine-grained backtrack- sub-problems and then solve them sequentially.
ing mechanism by labeling each steps, which fur- Progressive-Hint Prompting (PHP) (Zheng et al.,
ther reduces the reasoning consumption on an error 2023) advances dynamic problem-solving by fos-
searching route. However, both AOT and AOT+ tering iterative, multi-turn interactions between
get their global perspectives by continue exploring users and LLMs. This method leverages feedback-
reasoning and degrade to normal TOT when ex- driven prompts informed by historical outputs to
ploring their first reasoning route, which may lead systematically refine reasoning accuracy and co-
to a random search in each step and may miss the herence. Parallel to this, Cumulative Reason-
correct searching route. ing (CR) (Zhang et al., 2023) emulates human-like
Recent innovations also bridge reasoning with incremental cognition by decomposing tasks into
executable action, exemplified by frameworks like structured subtasks and aggregating intermediate
LATS (Language Agent Tree Search) (Zhou et al., results through stepwise integration. Both PHP and
2023). By unifying hierarchical planning, prob- CR synergize with foundational frameworks like
abilistic reasoning, and environment interaction CoT and its derivatives, collectively strengthening
within language models, LATS extends the dy- the generation and validation of adaptive reasoning
namic capabilities of ReAct (Reasoning + Act- pathways.
ing) (Yao et al., 2022) paradigms, enabling adaptive Recent advancements further explore hybrid
agent behavior in multi-step problem-solving sce- architectures that combine PHP with retrieval-
narios. While these approaches show complemen- augmented mechanisms and task-specific distilla-
tary advantages in addressing combinatorial opti- tion. These frameworks aim to balance computa-
mization and long-range dependency challenges, tional efficiency with robust reasoning fidelity, ad-
Algorithm 1 RFF-T 3.1 Last Step Generator
Require: LM pθ , input x, max Steps L, last step
RFF implements backward reasoning by generat-
generator G(), stepwise reasoner R(), state
ing the last previous step. To be specific, RFF
checker C(), current state {S}, target state
decomposes one target state Ti with current state
{T }, avoid attempts {A}, verifier V (), Output
Si into a pre-target state Ti+1 = G(pθ , Si , Ti ) at a
function O()
time, the form of the specific sub-target state de-
1: S0 ← x, T0 ← t, A0 ← {}, i ← 0
pends on the on the target of the task, such as a set
2: while i <= L do
of numbers (Game of 24), the variables to be found
3: i←i+1
(mathematical problems). It is worth noticing that
4: Ai ← {}
the transition step between pre-target state Ti+1 to
5: Ti ← G(pθ , Si−1 , Ti−1 )
target Ti should be output explicitly to guarantee
6: Si ← R(pθ , Si−1 , Ti , Ai−1 )
the correctness of the target decomposition to a
7: if C(Si , Ti ) == T rue then
certain extent.
8: j ← V (Si , Ti )
9: if j == i then
10: break 3.2 Stepwise Forward Reason
11: end if
We consider two different strategies: RFF-T in
12: Aj ← Aj ∪ {Sj , Tj }
Algorithm 1 and RFF-G in Algorithm 2, to generate
13: i←j
the next forward reasoning step for different types
14: end if
of target:
15: end while
16: return O(pθ , x, t|Si )
(a) RFF-T: For problems like Game of 24 or
Maze game, whose solution is one branch of a
searching tree, the model should avoid repeating
dressing challenges such as error propagation and the wrong attempts in the same layer of the search-
context scalability. By integrating iterative feed- ing tree. We use {A} ∼ {A0 , A1 ...Ai } to denote
back loops with external knowledge retrieval, such the attempts should be avoid in step i, thus the next
approaches optimize performance in multi-step rea- state should be Si ← R(pθ , Si−1 , Ti , Ai−1 ).
soning tasks while maintaining generalizability. (b) RFF-G: For the problem like mathemati-
cal problems, whose solution is a directed acyclic
3 Methods graph, all the information calculated by the previ-
ous states are either useful or redundant but not
Reason from Future (RFF) is a reasoning paradigm harmful, so the reasoning path should consider all
that allows models to solve a question by using the information calculated by the previous states,
forward and backward reasoning alternately. We which is Si ← Si−1 ∪ R(pθ , x, Si−1 , Ti ).
use pθ to denote a LLM with parameters pθ , and
x, t to denote the input and question. The {S} ∼ Algorithm 2 RFF-G
{S0 , S1 ...Si }, {T } ∼ {T0 , T1 ...Ti } denote the list Require: LM pθ , input x, max Steps L, last step
of current states and the list of target states in each generator G(), stepwise reasoner R(), state
step i. We define O(pθ , x, t|Si ) as the output of the checker V (), current state {S}, target state
LLM with parameters pθ using a prompt consisting {T }, Output function O()
of input x, target t, and hints Si . In the i − th step, 1: S0 ← x, T0 ← t0
the model identifies the preceding step closest to 2: for i = 1 to L do
the current target state Ti−1 and considers it as the 3: Ti ← G(pθ , Si−1 , Ti−1 )
new target state Ti and provides the calculation re- 4: Si ← Si−1 ∪ R(pθ , Si−1 , Ti )
lationship between the two. Then the model takes 5: if V (Si , Ti ) == T rue then
the Ti−1 as the new target for one-step forward rea- 6: break
soning. The model then repeats this step until the 7: end if
latest target state has been achieved (Si = Ti ). A 8: end for
specific RFF pipeline should consist of three com- 9: return O(pθ , x, t|Si )
ponents: 1: Last Step Generator G(); 2: Stepwise
Forward Reason R(); 3: State Check C().
3.3 State Check Input: 2 3 6 8 forward reason
backward reason
State Check C() maintains an inference boundary verify
that determines the termination conditions of the 2+6=8 3 * 6 = 18
……
inference paradigm. Similar to Stepwise Forward (left: 3 8 8) (left: 2 8 18)

Reason, we set two different strategies to check


18 - 2 = 16
whether the reasoning comes to the boundary: (left: 8 16)
(a) RFF-T: For the reason only the correct reason-
(from: 3 8) (from: 8 16)
ing path will be saved in the end, the C(pθ , Si , Ti ) 3 * 8 = 24 16 + 8 = 24
(left: 24) (left: 24)
only considers whether the current state Si coin-
cides with the latest target state Ti , or whether
the current state requires only one mathematical Virtual
Output: 24
Output: 24
or logical operation to reach the target state(e.g.
present state:(2 3 4), target state:(4 6) in Game of
24). Meanwhile, because RFF-T need to revisit
Figure 3: An example of how RFF-T works in Game of
the previous state to explore the thought space, a
24. Different reasoning paths represent the backtracking
Verifier V (Si , Ti ) is set to verify whether this path mechanism of RFF-T
is the correct path when the reasoning comes to the
boundary. If the path is a wrong path V () will re-
turn the previous state j which should be revisited RFF-T. A solving example can be seen in Figure 3.
and record the wrong attempt (Sj , Tj ). We conduct 100 times about the middle hard 100
(b) RFF-G: Different from the RFF-T, each step puzzles from 901 to 1000 (Yao et al., 2024). We
of reasoning generates a useful node of the directed also consider each branch of the searching tree as
acyclic graph, so C(pθ , Si , Ti ) considers whether a visit state and record the average visit states of
the information the target state needs has already different paradigms, which is proportional to the
been solved or is noted in the background. search space and the consumption of computation.

4 Experiment Model Method ACC Visited States


We evaluate the effectiveness of RFF on some CoT 3% 1.0
widely used LLM reasoning benchmarks, like ToT(n=1) 45% -
GAME of 24 and GSM8k. Considering that suc- ToT(n=5) 74% 61.2
cessful paradigm may be due to the strength of GPT-4 AoT 71%∗ -
the model itself rather than the strength of the CR(n=1) 84% 11.7
paradigm, leading to difficulty in migrating them to CR(n=5) 94% 13.7
weak models or small models, we carry out our ex- RFF(n=5) 95% 9.3
periments using Llama3-8B-Instruct (Dubey et al.,
2024) and Qwen2.5-7B-Instruct (Yang et al., 2024) CR(n=1) 9% 30.9
Llama3-8B
as the base models, and more detailed parameters CR(n=5) 19% 89.8
will be shown in specific tasks setup. RFF(n=5) 89% 9.9
Llama3-8B
RFF(n=10) 96% 15.0
4.1 Game of 24
The task of Game of 24 originates from (Yao et al., Table 1: The results of the Game of 24, where n denotes
2024), where the goal is to use four numbers with the width of the searching tree. ∗ denotes the result is
basic arithmetic operations (+-*/) to obtain 24, and from (Sel et al., 2023).
each number can be used only once.
Task Setup Baselines
We conduct Game of 24 on Llama3-8B-Instruct We employ CoT, ToT, AoT and cumulative reason-
with a temperature of 0.7 (consistent with the setup ing (CR) with different parameters as the baselines.
of CoT (Wei et al., 2022) and ToT (Yao et al., The setup of CoT is consistent with (Wei et al.,
2024)). We apply RFF-T because Game of 24 is 2022) and (Yang et al., 2024), who employ the
usually viewed as fetching a branch of the search- intermediate calculation process as the reasoning
ing tree and is consistent with the paradigm of step. As for ToT and CR, we adapt the settings and
Question RFF-G Answer

Jon runs a triathlon. It takes him 40 minutes for Step 1:


the swim, an hour and 20 minutes for the bike Back Reason:
ride and 50 minutes for the run. Compared to Jon, To know how long did it take James to do the run, we can do:
James finishes the swim 10% faster but takes 5 the time James finishes – the time James swim-the time James rides
minutes longer on the bike. If Jon won by 10 = the time James runs.
minutes, how long did it take James to do the run? What we need to know is: the time James finishes, the time swims, the time
James rides.
Direct Output Forward Reason:
the time James rides is: 1 hour+ 20 minutes = 80 minutes
Jon's total time is 40 + 80 + 50 = 170 minutes. Step 2:
James' swim time is 40 * 0.9 = 36 minutes. His Back Reason:
bike time is 80 + 5 = 85 minutes. So his total time To know the time James finishes: Since Jon won, we can do:
is 36 + 85 + x = 121 + x. Since he lost by 10 the time Jon finishes + the time Jon won
minutes, 121 + x = 170 - 10 = 160. So x = 160 - Forward Reason:
121 = 39. The answer is 39. the time James finishes is:
40 minutes + 80 minutes+ 50 minutes + 10 minutes
= 180 minutes
COT Output
Step 3:
Let's break it down step by step. Back Reason:
1. Jon's total time is: 40 minutes (swim) + 80 To know the time James swims: Since James swims 10% faster, we can do:
minutes (bike ride) + 50 minutes (run) = 170 the time James swims – the time Jon swims faster = the time Jon swims
minutes Forward Reason:
2. James finishes the swim 10% faster than Jon, The time Jon swims = 40 minutes – 40 minutes*10% = 36 minutes
which means James' swim time is: Jon's swim Step 4:
time - 10% of Jon's swim time = 40 - 4 = 36 Back Reason:
minutes To know the time James rides: Since James rides 5 minutes later, we can do:
3. James takes 5 minutes longer on the bike than the time James rides + 5 minutes = the time Jon rides
Jon, so James' bike time is: Jon's bike time + 5 Forward Reason:
minutes = 80 + 5 = 85 minutes The time Jon rides is: 1 hour + 20 minutes + 5minutes = 85 minutes
4. Since James won by 10 minutes, his total time Step 5:
is 10 minutes less than Jon's total time: Back Reason:
James' total time = 170 - 10 = 160 minutes We have known all the information, now we can solve the problem:
5. James' total time is the sum of his swim time, the time James finishes – the time James swim-the time James rides
bike time, and run time: 160 minutes = 36 = the time James runs
minutes (swim) + 85 minutes (bike) +? Forward Reason:
minutes (run) The time James runs is: 180 minutes – 36 minutes – 85 minutes
6. To find James' run time, subtract the swim = 59 minutes
and bike times from his total time: ? minutes Final Answer:
= 160 - 36 - 85 = 39 minutes So, it took James The time James runs is 59 minutes
39 minutes to complete the run.

Figure 4: An example from the GSM8K dataset, with solution generated by Direct, CoT, and RFF paradigms. The
former two paradigms tend to connect with "win" to positive operation "more", while RFF will first analyze the
background of the "win" and then generate the operation.

prompts from (Zhang et al., 2023). All methods wide solution space. Searching paradigms like ToT
are tested 100 times to get the average result, and and CR achieve better scores than CoT, meanwhile,
unless otherwise specified, the temperature of the the ToT method visits more states because of blind
model is set to 0.7. We also use GPT-4 as the base- searching. RFF reaches the highest accuracy and
line model. Due to the different space exploring least visit-states at the same level: when the visit-
paradigms, we treat those with a similar number state is around 10, RFF reaches the best accuracy
of search spaces as a same class for comparison of 89% compared to CR with GPT-4 at 84%; when
instead of similar branches of searching trees. the visit-state is around 14, RFF reaches an accu-
racy of 96% compared to CR with GPT-4 at 94%.
Results The fewer visit-states and higher accuracy are due
As shown in Table 1, RFF with Llama3-8B ex- to the searching space in RFF being much smaller
hibits outstanding performance even compared to and reasonable than simply forward searching (e.g.
GPT-4. CoT performs badly on this task for the for "1 2 12 12", with the target "12+12=24", LLM
searching-tree-like tasks need the model to explore will less likely explore ways like "2+12=14").
Model Method GSM8K SVAMP ASDiv MATH AVG
CoT 75.6% 80.5% 82.3% 32.8% 67.8%
Least-to-Most 79.5% 86.8% 84.4% 38.8% 72.4%
Llama3-8B-Instruct Give-me-Hint 77.3% 87.9% 86.0% 37.0% 72.1%
CR 77.0% 71.2% 84.8% 40.2% 68.3%
RFF 83.8% 89.7% 86.7% 41.4% 75.4%
CoT 87.2% 92.1% 88.0% 74.6% 85.5%
Qwen2.5-7B-Instruct CR 87.7% 83.7% 91.9% 78.2% 85.4%
RFF 89.5% 95.1% 92.2% 79.8% 89.1%

Table 2: The results of the math problems. The score represents the accuracy of the benchmarks. The best results in
each column is highlighted in bold. The AVG represents the average value of the four benchmarks.

4.2 Math Problem Benchmark et al., 2024), which generates helpful hints to help
This task contains four datasets: GSM8K, SVAMP, LLM solve the problem. Both baselines are recent
AsDiv and MATH-500. GSM8K is a mathematical and typical to serve as convincing control groups.
dataset with 1319 test data, each question requires Results
3-10 steps of reasoning and calculation. SVAMP
Table 2 shows the accuracy of Llama3-8B-instruct
and ASDiv are two simple math problem datasets
and Qwen-2.5-7B-Instruct on four datasets and our
with 1000 and 2096 data respectively, each ques-
method RFF shows great advantages over the other
tion requires 1-2 steps of reasoning and calcula-
methods. Meanwhile, there is another an interest-
tion. MATH-500 is a subset of 500 mathematical
ing phenomenon: since the methods present great
problem from the MATH (Hendrycks et al., 2021)
improvement of accuracy to CoT on GSM8K, AS-
benchmark, which is a much more harder bench-
Div and MATH datasets, contributing to the bet-
mark than GSM8K.
ter focus on details and relations about progres-
Task Setup sive prompting methods. However, CR fails to
We conduct this task on Llama3-8B-Instruct and reach the average level of other methods on sim-
Qwen2.5-7B-Instruct with a greedy search to ex- ple task SVAMP with 71.2% compared to 85.1%,
clude the influence of random numbers on textual and demonstrates better performance on hard task
reasoning. We apply RFF-G for the mathemati- MATH with 40.2% to 36.2%. We carefully check
cal puzzle as its solution can be seen as a directed the output of the CR and our RFF to and find that
acyclic graph from the question to the answer. We the detailed hints and question-answer pairs can be
employ 1 shot as the example to lead the model to helpful in hard problems like MATH, but they lead
perform formatted reasoning. to a harmful phenomenon of overthinking when fac-
ing simple problems like SVAMP. And our method
Baselines RFF, benefiting from the State Checker, can avoid
Considering the nature of the math problems, CoT overthinking when the State Checker thinks the
and CR are chosen as baselines for their excellent reasoning should be stopped. We also notice the
ability for complex thinking and multi-jump rea- gap between RFF and CoT is increasing with the
soning either. CoT and CR are set with one shot to base ability of the model decreasing (from Qwen
balance the influence of the same setup in RFF-G. to Llama), demonstrating a significant complemen-
CoT generates a continuous chain of thoughts until tary effect on the model’s reasoning ability.
the model answers the question while CR generates
a few hints first, then generates simple questions 4.3 Commonsense Problem Benchmark
and answers until the model thinks it’s enough to To further explore the effectiveness of RFF in dif-
answer the question. We also conducted two extra ferent NLP tasks, we have conducted experiments
baseline not showing in the Game of 24: Least-to- on commonsense problem benchmarks. Common-
Most (Zhou et al., 2022), which generates the sub- sense problems usually refer to those that require
problem pipeline first and then follow the pipeline basic knowledge and reasoning ability in human
to solve the problem and Give-me-Hint (Agrawal daily life to solve. These problems rely on back-
ground knowledge, which is usually not explic- The results also indicate that in commonsen
itly stated in the problems. we have conducted tasks, especially in easy task (CommonQA), simple
experiments on two widely used commonsense prompting method (Give-me-Hint) achieves better
benchmarks: CommonQA (Talmor et al., 2018) scores than complex prompting methods (Least-
and LogiQA (Liu et al., 2020) using RFF-G. Both to-Most and CR), which we think it’s for the over
two benchmarks are multiple-choice, which con- thinking of these methods. Meanwhile, RFF can
tains 12102 and 8678 questions respectively, each still maintain a high score due to its step-evaluation
with one correct answer and four choices. mechanism: when facing simple questions, the for-
ward reasoning can quickly meets the backward
Task Setup reasoning, then the RFF turns to simple COT to
We conduct this task on Llama3-8B-Instruct to test solve the problems instead of over thinking.
on this benchmark. Considering the nature of the
commonsense tasks, we choose RFF-G to solve the 4.4 Studies of Redundant Thinking
question. We judge the accuracy of the answer by We investigate the limitations of traditional algo-
observing whether the model outputs the correct rithms for solving Game of 24. While conventional
option, any other forms of answers will be viewed breadth-first searching methods perform well in
as wrong. low-dimensional solution spaces, their unguided ex-
ploration mechanisms may lead to significant com-
Baselines
putational resource waste and efficiency degrada-
For commonsense tasks, it will be much helpful tion when handling higher-dimensional problems.
if the model is given background informations be- To validate this theoretical hypothesis, we con-
fore answering the question, so the hints based and structed an experimental dataset comprising 100
progressive prompt reasoning paradigms will serve enhanced problems (IDs 901-1000) by strategically
as effective references. Same as math benchmark, adding the constant "1" to original four-number
we choose COT, CR, Least-to-Most and CR as our combinations, creating five-number variants. Theo-
experiment baselines. retically, this operation preserves the solvability of
Results problems (based on arithmetic identity transforma-
tions) and is expected to decrease the difficulty by
As shown in Table 3, all the reasoning paradigms
introducing a redundant variant.
achieve a better accuracy on CommonQA, demon-
strating the complex prompting can easily improve
Model Method ACC Visited States
the performance on it. However, the differences
among these methods are not so apparent (all dis- CR(n=5) 76% 7.06
tributed at around 76%), so the sole result of Com- GPT-4 RFF(n=5) 89% 5.96
monQA is not sufficient to express the superiority RFF(n=10) 93% 9.13
of our method. In LogiQA, the differences shows CR(n=5) 26% 96.56
a greater gap among these baselines: the results of Llama3-8B RFF(n=5) 85% 28.62
Least-to-Most and Give-me-Hint are close to COT RFF(n=10) 92% 56.13
while the results of RFF and CR shows a significant
improvement over COT. Table 4: The results of 5 numbers of the Game of 24.

Method CommonQA LogiQA AVG As shown in Table 4, after adding a redundant


COT 73.1% 41.8% 57.5% "1", the performance of CR decreased significantly
CR 75.4% 45.5% 60.5% with GPT-4 (from 94% to 76%). In contrast, RFF
Least-to-Most 74.3% 42.9% 58.6% achieves a higher success rate with fewer visit
Give-me-Hint 76.6% 42.0% 59.3% states compared with CR. When we further expand
RFF 77.1% 45.2% 61.2% the search space, the model’s performance contin-
ues to improve. At the same time, we observe the
Table 3: The results of commonsense reasoning bench- smaller model requires more attempts to achieve
marks. The best result in each column is highlighted in performance comparable to the original data. How-
bold, and the second best in each column is added an ever, RFF consistently surpasses CR in terms of
underline. success rate and resource consumption. The result
GSM-SYM GSM-P1 GSM-P2
15 COT 74.37 COT 62.48 15 COT 36.84
RFF 75.61 15 RFF 65.93 RFF 40.82
10 10
Frequency

Frequency

Frequency
10
5 5 5

0 0 0
65 70 75 80 85 55 60 65 70 75 30 40 50 60
Accuracy(%)-(1-shot) Accuracy(%)-(1-shot) Accuracy(%)-(1-shot)

Figure 5: The result of CoT and RFF on GSM-Symbolic dataset. The score in the upper right of the chart is the
average accuracy score of the 50 subsets. RFF shows more stable and better distributed at high accuracy.

demonstrates the effective prospective state space 5 Conclusion


pruning in RFF and validates the superiority of the
In this paper, we introduce Reason from Fu-
searching space convergence mechanisms of RFF.
ture (RFF), a novel reasoning paradigm aiming
at enhancing the reasoning ability of LLMs for
4.5 Studies of Robust Thinking complex problems. RFF leverages a bidirectional
To address the data leakage risks associated reasoning framework integrates top-down planning
with the widespread adoption of the GSM8K, with bottom-up reasoning accumulation to gener-
researchers (Mirzadeh et al., 2024) proposed ate a solution path. This aids in the convergence of
the GSM-Symbolic benchmark through semantic- the search space for the model, thereby enhancing
preserving transformations. This dataset gener- inference efficiency. Simultaneously, it allows the
ates derivative problems via entity/quantity sub- model to focus on critical information, which im-
stitution (GSM-SYM) and the addition of sin- proves the accuracy of reasoning. RFF has demon-
gle (GSM-P1) or dual (GSM-P2) conditional con- strated superior performance across both searching
straints to the original question. Despite theoretical tree tasks (Game of 24) and directed acyclic graph
expectations that surface-level modifications (e.g., tasks (math problems and commonsense problems),
name/quantity changes) should not impact rea- showing the potential to enhance the model’s rea-
soning capabilities, empirical observations reveal soning capabilities.
significant accuracy degradation across all mod-
Limitations
els. Following the standard evaluation protocol of
GSM8K, we systematically assess the reasoning The effectiveness of RFF relies on the model’s abil-
generalization of RFF models using the publicly ity for reverse thinking. Since the model has not
available SYM, P1, and P2 subsets (each subset been trained with specialized data, there can be
contains 50 variants from an original dataset). We rare instances where errors in the final step of re-
employ Llama3-8B-Instruct with CoT and RFF verse reasoning lead to failure. In future work, we
methods to conduct this task. will introduce fine-tuning or reinforcement learn-
Figure 5 shows the accuracy distribution of the ing to further enhance the generalizability of this
50 variants datasets in three subsets. The result reasoning paradigm.
exhibits that the accuracy of both methods drops
on these three datasets, representing the fragility of Acknowledgments
the reasoning ability of models. This research was partially supported by National
However, RFF still has advantages in the aver- Natural Science Foundation of China under Grant
age accuracy and shows a more concentrated and No.12326612, No.82202984, Zhejiang Key R&D
more accurate distribution. The result emphasizes Program of China under Grant No.2024SSYS0026,
that the form of forward reasoning guided by back- Zhejiang Key Laboratory of Medical Imaging Arti-
ward reasoning is quite robust in the face of variant ficial Intelligence, and the Transvascular lmplanta-
problems. tion Devices Research Institute (TIDRI).
References Arkil Patel, Satwik Bhattamishra, and Navin Goyal.
2021. Are nlp models really able to solve
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama simple math word problems? arXiv preprint
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, arXiv:2103.07191.
Diogo Almeida, Janko Altenschmidt, Sam Altman,
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. Alec Radford. 2018. Improving language understanding
arXiv preprint arXiv:2303.08774. by generative pre-training.
Vansh Agrawal, Pratham Singla, Amitoj Singh Miglani, Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar,
Shivank Garg, and Ayush Mangal. 2024. Give me a Ruoxi Jia, and Ming Jin. 2023. Algorithm of
hint: Can llms take a hint to solve math problems? thoughts: Enhancing exploration of ideas in large
arXiv preprint arXiv:2410.05915. language models. arXiv preprint arXiv:2308.10379.
Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Bilgehan Sel, Ruoxi Jia, and Ming Jin. 2025. Llms
Samy Bengio, Etai Littwin, and Joshua Susskind. can plan only if we tell them. arXiv preprint
2023. When can transformers reason with abstract arXiv:2501.13545.
symbols? arXiv preprint arXiv:2310.09753.
R Nathan Spreng, Raymond A Mar, and Alice SN Kim.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, 2009. The common neural basis of autobiographical
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias memory, prospection, navigation, theory of mind,
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro and the default mode: a quantitative meta-analysis.
Nakano, et al. 2021. Training verifiers to solve math Journal of cognitive neuroscience, 21(3):489–510.
word problems. arXiv preprint arXiv:2110.14168.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Maciej Świechowski, Konrad Godlewski, Bartosz Saw-
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, icki, and Jacek Mańdziuk. 2023. Monte carlo tree
Akhil Mathur, Alan Schelten, Amy Yang, Angela search: A review of recent modifications and appli-
Fan, et al. 2024. The llama 3 herd of models. arXiv cations. Artificial Intelligence Review, 56(3):2497–
preprint arXiv:2407.21783. 2562.

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Arora, Steven Basart, Eric Tang, Dawn Song, and Ja- Jonathan Berant. 2018. Commonsenseqa: A question
cob Steinhardt. 2021. Measuring mathematical prob- answering challenge targeting commonsense knowl-
lem solving with the math dataset. arXiv preprint edge. arXiv preprint arXiv:1811.00937.
arXiv:2103.03874.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-
Leonie Koban, Peter J Gianaros, Hedy Kober, and Tor D Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
Wager. 2021. The self in context: brain systems Schalkwyk, Andrew M Dai, Anja Hauth, Katie
linking mental and physical health. Nature Reviews Millican, et al. 2023. Gemini: a family of
Neuroscience, 22(5):309–322. highly capable multimodal models. arXiv preprint
arXiv:2312.11805.
JDMCK Lee and K Toutanova. 2018. Pre-training of
deep bidirectional transformers for language under- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
standing. arXiv preprint arXiv:1810.04805, 3(8). Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. 2022. Self-consistency improves chain
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- of thought reasoning in language models. arXiv
son Edwards, Bowen Baker, Teddy Lee, Jan Leike, preprint arXiv:2203.11171.
John Schulman, Ilya Sutskever, and Karl Cobbe.
2023. Let’s verify step by step. In The Twelfth Inter- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
national Conference on Learning Representations. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022. Chain-of-thought prompting elicits rea-
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, soning in large language models. Advances in neural
Yile Wang, and Yue Zhang. 2020. Logiqa: A information processing systems, 35:24824–24837.
challenge dataset for machine reading compre-
hension with logical reasoning. arXiv preprint Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-
arXiv:2007.08124. Li Lee, and Wynne Hsu. 2024. Faithful logical rea-
soning via symbolic chain-of-thought. arXiv preprint
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. arXiv:2405.18357.
2021. A diverse corpus for evaluating and developing
english math word problem solvers. arXiv preprint An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
arXiv:2106.15772. Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech-
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, nical report. arXiv preprint arXiv:2412.15115.
Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar.
2024. Gsm-symbolic: Understanding the limitations Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
of mathematical reasoning in large language models. Tom Griffiths, Yuan Cao, and Karthik Narasimhan.
arXiv preprint arXiv:2410.05229. 2024. Tree of thoughts: Deliberate problem solving
with large language models. Advances in Neural A.2 Difference between CR
Information Processing Systems, 36.
For CR, it generates several hints for the problem
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak first, the hints are helpful to solve the questions.
Shafran, Karthik Narasimhan, and Yuan Cao. 2022. Then the LLM continues to generate subproblem
React: Synergizing reasoning and acting in language
models. arXiv preprint arXiv:2210.03629. which based on the hints and then answer it until
the LLM can solve the problem.
Yifan Zhang, Jingqin Yang, Yang Yuan, and An- Differ from CR, which generates guidances from
drew Chi-Chih Yao. 2023. Cumulative reason-
ing with large language models. arXiv preprint the previous visiting states to reduce hallucina-
arXiv:2308.04371. tions, our method (RFF) generates guidance from
the backward reasoning to give the model a full-
Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo
Li, and Yu Li. 2023. Progressive-hint prompting perspective of the problem. Additionally, the de-
improves reasoning in large language models. arXiv tailed hints and nonstop subproblems may lead CR
preprint arXiv:2304.09797. to overthinking, meanwhile, the full-perspective
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman,
and State Check ensure the ability of RFF to stop
Haohan Wang, and Yu-Xiong Wang. 2023. Lan- overthinking quickly when facing simple questions
guage agent tree search unifies reasoning acting with complex prompting.
and planning in language models. arXiv preprint
arXiv:2310.04406. B Extra Experiments on Reasoning
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Paradigms
Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. We compare two different backward reasoning
Least-to-most prompting enables complex reason- strategies on Math Problems called Pair Reason-
ing in large language models. arXiv preprint ing (same as RFF in 2) and Single Reasoning as
arXiv:2205.10625. shown in Figure 6, we conduct experiments on
GSM8K dataset using Llama3-8B-Instruct with
A Detailed Difference between Baselines greedy search.
Although most progressive prompting methods
Model Method ACC
employ the reasoning paradigms by dividing a
hard problem into several simple subproblems, the CoT 75.6%
mechanism for decomposing the problem and the Llama3-8B Pair Reasoning RFF 83.8%
mechanism for how to solve the subproblems are Single Reasoning RFF 69.8%
different and serves as the core of the reasoning
paradigms. We carefully compare the two typical Table 5: The results of different back reasoning strate-
baselines we used in our work (Least-to-Most and gies.
CR), and analyzed their characteristics and differ-
ence with RFF. As shown in Table 5, the performance of Single
Reasoning RFF drops badly on GSM8K dataset and
A.1 Difference between Least-to-Most even worse than CoT, we assume that the weakness
For Least-to-Most, it generates its sub-problems of of the strategy of Single Reasoning is that when
the final problem in the very beginning, then the deducing the whole chain of backward thought, the
LLM follows the pipeline of the sub-problems to situations of multi-hop seriously affects the back-
generate its reasoning. It’s a two-stage plan-solve ward reasoning without new information generated
reasoning and the sub-problems never change dur- by the forward reasoning. So when employing
ing the solution, so the sub-problems may be wrong backward reasoning paradigms, a continuous gen-
or insufficient because of a lack of the intermediate eration of new information is needed.
information generated by the reasoning process.
C Appendix for Prompts
The biggest difference between RFF and Least-
to-Most is that process of RFF starts from the end The design of prompts is critical to lead the model
state of the problem, RFF doesn’t plan just at the to reason exactly according to the paradigm we
beginning but to continue to generate questions or have planned. We design these prompts deliber-
hints or guidance by backward reasoning to lead ately to ensure the model reason and output accord-
LLM to better reason forward. ing to the format we give it.
Reason
Input Guide
Input

Thought
Reason-2 Reason-n Reason-n+1
(Forward) (Back) (Forward)

……
Reason-n-1 Reason-n+2
Reason-n+1 (Back) (Forward)
(Forward)
…… ……

Reason-n
(Back) Reason-2 Reason-2n-1
(Back) (Forward)
……

Reason-1 Reason-2n
Reason-1
(Back) (Forward)
(Back)

Virtual Virtual
Output Output
Output Output

(a) Pair Reasoning (b) Single Reasoning

Figure 6: Two different strategies of backward reasoning

System: Suppose you are one of the greatest AI scientists, logicians, and mathematicians. Let’s play a game.
The Input is the current state, which contains four or three numbers.
The Target is the state we want to get, which contains one or two numbers.
The Last Step is how to get the Target with Input in the last step.
The Avoid is tested to be a wrong Last step, you need to generate another different step.
What you need to do is to think how to use Input to get Target use some steps, just output the most likely Last
Step you think.
Notice:
1 Now do not calculate the game, you need to rely on your instincts to give the most likely Last Step directly,
and do not output other thinking process.
2 The Last Step should contains two parts: "calculation" and "left".
3 The number used in "calculation" may not appear in Input, and the result of "calculation" must appear in
"left".
4 The numbers in "left" must be the same as Target.
5 You are forbidden to output Steps, you should output Last Step only.

User:
Input: 1 3 6 11
Target: 24
Avoid: 3 x 8 = 24 (left: 24)
Avoid: 4 x 6 = 24 (left: 24)

Assistant:
Last Step: 2 x 12 = 24 (left: 24)

Figure 7: Prompts of Last Step Generator for Game of 24


System: Now you are given few examples about the game:
Input is the current state of the game.
Target is the final state you need try to satisfy using basic arithmetic operations (+ - * /) with the Input.
Steps are how to get Target with Input through basic operations.
Next Step is the how to get the Target with Input in the next step.
The Avoid is tested to be a wrong Next step, you need to generate another different step.
You need to choose two numbers from Input and use one basic arithmetic operations (+ - * /) to generate a
new number.
Notice:
1 Output the Next Step directly and do not output the other thinking process.
2 The Next Step contains and only contain two parts: "calculation" and "left".
3 The "left" should be close to Target but not asked to be the totally same.
4 Your calculation must be correct.
5 Do not output Steps.

User:
Input: 8 8 10 12
Target: 8 16
Avoid: 8 + 8 = 16

System:
Next Step: 12 - 10 = 2 (left: 2 8 8)

Figure 8: Prompts of Stepwise Forward Reason for Game of 24

System: You are one of the GREATEST mathematicians, logicians, programmers, and AI scientists. You
are intelligent and rational. You are prudent and cautious. You THINK NATURAL, BROAD AND DEEP.
Let's think step by step.
You will be given a mathematical problem, a question about it and the information we have calculated.
Notice:
1 You are not permitted to solve the question from the beginning.
2 You need to analyse the question and figure out what will be the last step to solve the question.
3 Make sure your analysis are used to calculate the result of the question not the intermediate result.
4 Output the calculation process, and the information we need.

User:
Problem: {problem}
Question: {question}
Information: {information}

Assistant:
To know {question}: Since {information}, we can do: {calculation process}
Need Information: {need information}

Figure 9: Prompts of Last Step Generator for math problems


System: You are one of the GREATEST mathematicians, logicians, programmers, and AI scientists. You
are intelligent and rational. You are prudent and cautious. You THINK NATURAL, BROAD AND DEEP.
Let's think step by step.
You will be given a mathematical problem, a question about the problem, and the information we have
calculated.
Notice:
1 You are not permitted to solve the question directly.
2 You need to analyse the question and figure out next step we get to solve the question.
3 Make sure your analysis are used to calculate the result of the question not the intermediate result.
4 Output the calculation process, and the new information we get.

User:
Problem: {problem}
Question: {question}
Information: {information}

Assistant:
Next Step: {calculation process}
New Information: {new information}

Figure 10: Prompts of Stepwise Forward Reason for math problems

System: You are one of the GREATEST mathematicians, logicians, programmers, and AI scientists. You
are intelligent and rational. You are prudent and cautious. You THINK NATURAL, BROAD AND DEEP.
Let's think step by step.
You will be given a mathematical problem, a question about it and information we have calculated.
Notice:
1 You are not permitted to solve the question.
2 You need to analyse the question and information, then figure out whether we have already solved the
question.
3 Make sure your analysis do not consist of calculation process.
4 If we have solved the question, you should output [True], else you should output [False].

User:
Problem: {problem}
Question: {question}
Information: {information}

Assistant:
Analyse: {analyse}
Answer: [True, False]

Figure 11: Prompts of State Check for math problems

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy