Reason From Future: Reverse Thought Chain Enhances LLM Reasoning
Reason From Future: Reverse Thought Chain Enhances LLM Reasoning
……
Question-1
Reason-n+1
(Forward)
Answer-1
Reason-n
…… Question-2 (Back)
……
Answer-2
Reason-1
…… (Back)
……
Virtual
Output Output Output Output
Output
(a) Chain-of-Thought (b) Tree-of-Thought (c) Cumulate Reasoning (d) Reason from Future
Prompting (COT) Prompting (TOT) Prompting (CR) Prompting (RFF)
Figure 2: Schematic illustrating various approaches to problem-solving with LLMs, and each rectangle box
represents a thought. Figure2(d) only shows the basic framework about RFF, see the concrete pipeline of two types
of RFF in Algorithm 1 (RFF-T) and Algorithm 2 (RFF-G).
through simulating and evaluating long-term re- computational efficiency and path-pruning strate-
wards of candidate paths, demonstrating strengths gies remain critical areas for improvement.
in reinforcement learning and dynamic program-
ming scenarios. AOT (Sel et al., 2023) improves 2.3 Progressive Hint Prompting Reasoning
TOT by introducing a step-evaluation mechanism In the realm of progressive prompting for complex
into the reasoning process, which helps LLM to reasoning, the LLMs solve a problem through mul-
prune the less likely searching route thus reduce tiple rounds of messages. Least-to-Most (Zhou
the searching space. AOT+ (Sel et al., 2025), as et al., 2022) first breaks a problem into several
the upgrade of AOT, adds a fine-grained backtrack- sub-problems and then solve them sequentially.
ing mechanism by labeling each steps, which fur- Progressive-Hint Prompting (PHP) (Zheng et al.,
ther reduces the reasoning consumption on an error 2023) advances dynamic problem-solving by fos-
searching route. However, both AOT and AOT+ tering iterative, multi-turn interactions between
get their global perspectives by continue exploring users and LLMs. This method leverages feedback-
reasoning and degrade to normal TOT when ex- driven prompts informed by historical outputs to
ploring their first reasoning route, which may lead systematically refine reasoning accuracy and co-
to a random search in each step and may miss the herence. Parallel to this, Cumulative Reason-
correct searching route. ing (CR) (Zhang et al., 2023) emulates human-like
Recent innovations also bridge reasoning with incremental cognition by decomposing tasks into
executable action, exemplified by frameworks like structured subtasks and aggregating intermediate
LATS (Language Agent Tree Search) (Zhou et al., results through stepwise integration. Both PHP and
2023). By unifying hierarchical planning, prob- CR synergize with foundational frameworks like
abilistic reasoning, and environment interaction CoT and its derivatives, collectively strengthening
within language models, LATS extends the dy- the generation and validation of adaptive reasoning
namic capabilities of ReAct (Reasoning + Act- pathways.
ing) (Yao et al., 2022) paradigms, enabling adaptive Recent advancements further explore hybrid
agent behavior in multi-step problem-solving sce- architectures that combine PHP with retrieval-
narios. While these approaches show complemen- augmented mechanisms and task-specific distilla-
tary advantages in addressing combinatorial opti- tion. These frameworks aim to balance computa-
mization and long-range dependency challenges, tional efficiency with robust reasoning fidelity, ad-
Algorithm 1 RFF-T 3.1 Last Step Generator
Require: LM pθ , input x, max Steps L, last step
RFF implements backward reasoning by generat-
generator G(), stepwise reasoner R(), state
ing the last previous step. To be specific, RFF
checker C(), current state {S}, target state
decomposes one target state Ti with current state
{T }, avoid attempts {A}, verifier V (), Output
Si into a pre-target state Ti+1 = G(pθ , Si , Ti ) at a
function O()
time, the form of the specific sub-target state de-
1: S0 ← x, T0 ← t, A0 ← {}, i ← 0
pends on the on the target of the task, such as a set
2: while i <= L do
of numbers (Game of 24), the variables to be found
3: i←i+1
(mathematical problems). It is worth noticing that
4: Ai ← {}
the transition step between pre-target state Ti+1 to
5: Ti ← G(pθ , Si−1 , Ti−1 )
target Ti should be output explicitly to guarantee
6: Si ← R(pθ , Si−1 , Ti , Ai−1 )
the correctness of the target decomposition to a
7: if C(Si , Ti ) == T rue then
certain extent.
8: j ← V (Si , Ti )
9: if j == i then
10: break 3.2 Stepwise Forward Reason
11: end if
We consider two different strategies: RFF-T in
12: Aj ← Aj ∪ {Sj , Tj }
Algorithm 1 and RFF-G in Algorithm 2, to generate
13: i←j
the next forward reasoning step for different types
14: end if
of target:
15: end while
16: return O(pθ , x, t|Si )
(a) RFF-T: For problems like Game of 24 or
Maze game, whose solution is one branch of a
searching tree, the model should avoid repeating
dressing challenges such as error propagation and the wrong attempts in the same layer of the search-
context scalability. By integrating iterative feed- ing tree. We use {A} ∼ {A0 , A1 ...Ai } to denote
back loops with external knowledge retrieval, such the attempts should be avoid in step i, thus the next
approaches optimize performance in multi-step rea- state should be Si ← R(pθ , Si−1 , Ti , Ai−1 ).
soning tasks while maintaining generalizability. (b) RFF-G: For the problem like mathemati-
cal problems, whose solution is a directed acyclic
3 Methods graph, all the information calculated by the previ-
ous states are either useful or redundant but not
Reason from Future (RFF) is a reasoning paradigm harmful, so the reasoning path should consider all
that allows models to solve a question by using the information calculated by the previous states,
forward and backward reasoning alternately. We which is Si ← Si−1 ∪ R(pθ , x, Si−1 , Ti ).
use pθ to denote a LLM with parameters pθ , and
x, t to denote the input and question. The {S} ∼ Algorithm 2 RFF-G
{S0 , S1 ...Si }, {T } ∼ {T0 , T1 ...Ti } denote the list Require: LM pθ , input x, max Steps L, last step
of current states and the list of target states in each generator G(), stepwise reasoner R(), state
step i. We define O(pθ , x, t|Si ) as the output of the checker V (), current state {S}, target state
LLM with parameters pθ using a prompt consisting {T }, Output function O()
of input x, target t, and hints Si . In the i − th step, 1: S0 ← x, T0 ← t0
the model identifies the preceding step closest to 2: for i = 1 to L do
the current target state Ti−1 and considers it as the 3: Ti ← G(pθ , Si−1 , Ti−1 )
new target state Ti and provides the calculation re- 4: Si ← Si−1 ∪ R(pθ , Si−1 , Ti )
lationship between the two. Then the model takes 5: if V (Si , Ti ) == T rue then
the Ti−1 as the new target for one-step forward rea- 6: break
soning. The model then repeats this step until the 7: end if
latest target state has been achieved (Si = Ti ). A 8: end for
specific RFF pipeline should consist of three com- 9: return O(pθ , x, t|Si )
ponents: 1: Last Step Generator G(); 2: Stepwise
Forward Reason R(); 3: State Check C().
3.3 State Check Input: 2 3 6 8 forward reason
backward reason
State Check C() maintains an inference boundary verify
that determines the termination conditions of the 2+6=8 3 * 6 = 18
……
inference paradigm. Similar to Stepwise Forward (left: 3 8 8) (left: 2 8 18)
Figure 4: An example from the GSM8K dataset, with solution generated by Direct, CoT, and RFF paradigms. The
former two paradigms tend to connect with "win" to positive operation "more", while RFF will first analyze the
background of the "win" and then generate the operation.
prompts from (Zhang et al., 2023). All methods wide solution space. Searching paradigms like ToT
are tested 100 times to get the average result, and and CR achieve better scores than CoT, meanwhile,
unless otherwise specified, the temperature of the the ToT method visits more states because of blind
model is set to 0.7. We also use GPT-4 as the base- searching. RFF reaches the highest accuracy and
line model. Due to the different space exploring least visit-states at the same level: when the visit-
paradigms, we treat those with a similar number state is around 10, RFF reaches the best accuracy
of search spaces as a same class for comparison of 89% compared to CR with GPT-4 at 84%; when
instead of similar branches of searching trees. the visit-state is around 14, RFF reaches an accu-
racy of 96% compared to CR with GPT-4 at 94%.
Results The fewer visit-states and higher accuracy are due
As shown in Table 1, RFF with Llama3-8B ex- to the searching space in RFF being much smaller
hibits outstanding performance even compared to and reasonable than simply forward searching (e.g.
GPT-4. CoT performs badly on this task for the for "1 2 12 12", with the target "12+12=24", LLM
searching-tree-like tasks need the model to explore will less likely explore ways like "2+12=14").
Model Method GSM8K SVAMP ASDiv MATH AVG
CoT 75.6% 80.5% 82.3% 32.8% 67.8%
Least-to-Most 79.5% 86.8% 84.4% 38.8% 72.4%
Llama3-8B-Instruct Give-me-Hint 77.3% 87.9% 86.0% 37.0% 72.1%
CR 77.0% 71.2% 84.8% 40.2% 68.3%
RFF 83.8% 89.7% 86.7% 41.4% 75.4%
CoT 87.2% 92.1% 88.0% 74.6% 85.5%
Qwen2.5-7B-Instruct CR 87.7% 83.7% 91.9% 78.2% 85.4%
RFF 89.5% 95.1% 92.2% 79.8% 89.1%
Table 2: The results of the math problems. The score represents the accuracy of the benchmarks. The best results in
each column is highlighted in bold. The AVG represents the average value of the four benchmarks.
4.2 Math Problem Benchmark et al., 2024), which generates helpful hints to help
This task contains four datasets: GSM8K, SVAMP, LLM solve the problem. Both baselines are recent
AsDiv and MATH-500. GSM8K is a mathematical and typical to serve as convincing control groups.
dataset with 1319 test data, each question requires Results
3-10 steps of reasoning and calculation. SVAMP
Table 2 shows the accuracy of Llama3-8B-instruct
and ASDiv are two simple math problem datasets
and Qwen-2.5-7B-Instruct on four datasets and our
with 1000 and 2096 data respectively, each ques-
method RFF shows great advantages over the other
tion requires 1-2 steps of reasoning and calcula-
methods. Meanwhile, there is another an interest-
tion. MATH-500 is a subset of 500 mathematical
ing phenomenon: since the methods present great
problem from the MATH (Hendrycks et al., 2021)
improvement of accuracy to CoT on GSM8K, AS-
benchmark, which is a much more harder bench-
Div and MATH datasets, contributing to the bet-
mark than GSM8K.
ter focus on details and relations about progres-
Task Setup sive prompting methods. However, CR fails to
We conduct this task on Llama3-8B-Instruct and reach the average level of other methods on sim-
Qwen2.5-7B-Instruct with a greedy search to ex- ple task SVAMP with 71.2% compared to 85.1%,
clude the influence of random numbers on textual and demonstrates better performance on hard task
reasoning. We apply RFF-G for the mathemati- MATH with 40.2% to 36.2%. We carefully check
cal puzzle as its solution can be seen as a directed the output of the CR and our RFF to and find that
acyclic graph from the question to the answer. We the detailed hints and question-answer pairs can be
employ 1 shot as the example to lead the model to helpful in hard problems like MATH, but they lead
perform formatted reasoning. to a harmful phenomenon of overthinking when fac-
ing simple problems like SVAMP. And our method
Baselines RFF, benefiting from the State Checker, can avoid
Considering the nature of the math problems, CoT overthinking when the State Checker thinks the
and CR are chosen as baselines for their excellent reasoning should be stopped. We also notice the
ability for complex thinking and multi-jump rea- gap between RFF and CoT is increasing with the
soning either. CoT and CR are set with one shot to base ability of the model decreasing (from Qwen
balance the influence of the same setup in RFF-G. to Llama), demonstrating a significant complemen-
CoT generates a continuous chain of thoughts until tary effect on the model’s reasoning ability.
the model answers the question while CR generates
a few hints first, then generates simple questions 4.3 Commonsense Problem Benchmark
and answers until the model thinks it’s enough to To further explore the effectiveness of RFF in dif-
answer the question. We also conducted two extra ferent NLP tasks, we have conducted experiments
baseline not showing in the Game of 24: Least-to- on commonsense problem benchmarks. Common-
Most (Zhou et al., 2022), which generates the sub- sense problems usually refer to those that require
problem pipeline first and then follow the pipeline basic knowledge and reasoning ability in human
to solve the problem and Give-me-Hint (Agrawal daily life to solve. These problems rely on back-
ground knowledge, which is usually not explic- The results also indicate that in commonsen
itly stated in the problems. we have conducted tasks, especially in easy task (CommonQA), simple
experiments on two widely used commonsense prompting method (Give-me-Hint) achieves better
benchmarks: CommonQA (Talmor et al., 2018) scores than complex prompting methods (Least-
and LogiQA (Liu et al., 2020) using RFF-G. Both to-Most and CR), which we think it’s for the over
two benchmarks are multiple-choice, which con- thinking of these methods. Meanwhile, RFF can
tains 12102 and 8678 questions respectively, each still maintain a high score due to its step-evaluation
with one correct answer and four choices. mechanism: when facing simple questions, the for-
ward reasoning can quickly meets the backward
Task Setup reasoning, then the RFF turns to simple COT to
We conduct this task on Llama3-8B-Instruct to test solve the problems instead of over thinking.
on this benchmark. Considering the nature of the
commonsense tasks, we choose RFF-G to solve the 4.4 Studies of Redundant Thinking
question. We judge the accuracy of the answer by We investigate the limitations of traditional algo-
observing whether the model outputs the correct rithms for solving Game of 24. While conventional
option, any other forms of answers will be viewed breadth-first searching methods perform well in
as wrong. low-dimensional solution spaces, their unguided ex-
ploration mechanisms may lead to significant com-
Baselines
putational resource waste and efficiency degrada-
For commonsense tasks, it will be much helpful tion when handling higher-dimensional problems.
if the model is given background informations be- To validate this theoretical hypothesis, we con-
fore answering the question, so the hints based and structed an experimental dataset comprising 100
progressive prompt reasoning paradigms will serve enhanced problems (IDs 901-1000) by strategically
as effective references. Same as math benchmark, adding the constant "1" to original four-number
we choose COT, CR, Least-to-Most and CR as our combinations, creating five-number variants. Theo-
experiment baselines. retically, this operation preserves the solvability of
Results problems (based on arithmetic identity transforma-
tions) and is expected to decrease the difficulty by
As shown in Table 3, all the reasoning paradigms
introducing a redundant variant.
achieve a better accuracy on CommonQA, demon-
strating the complex prompting can easily improve
Model Method ACC Visited States
the performance on it. However, the differences
among these methods are not so apparent (all dis- CR(n=5) 76% 7.06
tributed at around 76%), so the sole result of Com- GPT-4 RFF(n=5) 89% 5.96
monQA is not sufficient to express the superiority RFF(n=10) 93% 9.13
of our method. In LogiQA, the differences shows CR(n=5) 26% 96.56
a greater gap among these baselines: the results of Llama3-8B RFF(n=5) 85% 28.62
Least-to-Most and Give-me-Hint are close to COT RFF(n=10) 92% 56.13
while the results of RFF and CR shows a significant
improvement over COT. Table 4: The results of 5 numbers of the Game of 24.
Frequency
Frequency
10
5 5 5
0 0 0
65 70 75 80 85 55 60 65 70 75 30 40 50 60
Accuracy(%)-(1-shot) Accuracy(%)-(1-shot) Accuracy(%)-(1-shot)
Figure 5: The result of CoT and RFF on GSM-Symbolic dataset. The score in the upper right of the chart is the
average accuracy score of the 50 subsets. RFF shows more stable and better distributed at high accuracy.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Arora, Steven Basart, Eric Tang, Dawn Song, and Ja- Jonathan Berant. 2018. Commonsenseqa: A question
cob Steinhardt. 2021. Measuring mathematical prob- answering challenge targeting commonsense knowl-
lem solving with the math dataset. arXiv preprint edge. arXiv preprint arXiv:1811.00937.
arXiv:2103.03874.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-
Leonie Koban, Peter J Gianaros, Hedy Kober, and Tor D Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
Wager. 2021. The self in context: brain systems Schalkwyk, Andrew M Dai, Anja Hauth, Katie
linking mental and physical health. Nature Reviews Millican, et al. 2023. Gemini: a family of
Neuroscience, 22(5):309–322. highly capable multimodal models. arXiv preprint
arXiv:2312.11805.
JDMCK Lee and K Toutanova. 2018. Pre-training of
deep bidirectional transformers for language under- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
standing. arXiv preprint arXiv:1810.04805, 3(8). Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. 2022. Self-consistency improves chain
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- of thought reasoning in language models. arXiv
son Edwards, Bowen Baker, Teddy Lee, Jan Leike, preprint arXiv:2203.11171.
John Schulman, Ilya Sutskever, and Karl Cobbe.
2023. Let’s verify step by step. In The Twelfth Inter- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
national Conference on Learning Representations. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022. Chain-of-thought prompting elicits rea-
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, soning in large language models. Advances in neural
Yile Wang, and Yue Zhang. 2020. Logiqa: A information processing systems, 35:24824–24837.
challenge dataset for machine reading compre-
hension with logical reasoning. arXiv preprint Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-
arXiv:2007.08124. Li Lee, and Wynne Hsu. 2024. Faithful logical rea-
soning via symbolic chain-of-thought. arXiv preprint
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. arXiv:2405.18357.
2021. A diverse corpus for evaluating and developing
english math word problem solvers. arXiv preprint An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
arXiv:2106.15772. Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech-
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, nical report. arXiv preprint arXiv:2412.15115.
Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar.
2024. Gsm-symbolic: Understanding the limitations Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
of mathematical reasoning in large language models. Tom Griffiths, Yuan Cao, and Karthik Narasimhan.
arXiv preprint arXiv:2410.05229. 2024. Tree of thoughts: Deliberate problem solving
with large language models. Advances in Neural A.2 Difference between CR
Information Processing Systems, 36.
For CR, it generates several hints for the problem
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak first, the hints are helpful to solve the questions.
Shafran, Karthik Narasimhan, and Yuan Cao. 2022. Then the LLM continues to generate subproblem
React: Synergizing reasoning and acting in language
models. arXiv preprint arXiv:2210.03629. which based on the hints and then answer it until
the LLM can solve the problem.
Yifan Zhang, Jingqin Yang, Yang Yuan, and An- Differ from CR, which generates guidances from
drew Chi-Chih Yao. 2023. Cumulative reason-
ing with large language models. arXiv preprint the previous visiting states to reduce hallucina-
arXiv:2308.04371. tions, our method (RFF) generates guidance from
the backward reasoning to give the model a full-
Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo
Li, and Yu Li. 2023. Progressive-hint prompting perspective of the problem. Additionally, the de-
improves reasoning in large language models. arXiv tailed hints and nonstop subproblems may lead CR
preprint arXiv:2304.09797. to overthinking, meanwhile, the full-perspective
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman,
and State Check ensure the ability of RFF to stop
Haohan Wang, and Yu-Xiong Wang. 2023. Lan- overthinking quickly when facing simple questions
guage agent tree search unifies reasoning acting with complex prompting.
and planning in language models. arXiv preprint
arXiv:2310.04406. B Extra Experiments on Reasoning
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Paradigms
Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. We compare two different backward reasoning
Least-to-most prompting enables complex reason- strategies on Math Problems called Pair Reason-
ing in large language models. arXiv preprint ing (same as RFF in 2) and Single Reasoning as
arXiv:2205.10625. shown in Figure 6, we conduct experiments on
GSM8K dataset using Llama3-8B-Instruct with
A Detailed Difference between Baselines greedy search.
Although most progressive prompting methods
Model Method ACC
employ the reasoning paradigms by dividing a
hard problem into several simple subproblems, the CoT 75.6%
mechanism for decomposing the problem and the Llama3-8B Pair Reasoning RFF 83.8%
mechanism for how to solve the subproblems are Single Reasoning RFF 69.8%
different and serves as the core of the reasoning
paradigms. We carefully compare the two typical Table 5: The results of different back reasoning strate-
baselines we used in our work (Least-to-Most and gies.
CR), and analyzed their characteristics and differ-
ence with RFF. As shown in Table 5, the performance of Single
Reasoning RFF drops badly on GSM8K dataset and
A.1 Difference between Least-to-Most even worse than CoT, we assume that the weakness
For Least-to-Most, it generates its sub-problems of of the strategy of Single Reasoning is that when
the final problem in the very beginning, then the deducing the whole chain of backward thought, the
LLM follows the pipeline of the sub-problems to situations of multi-hop seriously affects the back-
generate its reasoning. It’s a two-stage plan-solve ward reasoning without new information generated
reasoning and the sub-problems never change dur- by the forward reasoning. So when employing
ing the solution, so the sub-problems may be wrong backward reasoning paradigms, a continuous gen-
or insufficient because of a lack of the intermediate eration of new information is needed.
information generated by the reasoning process.
C Appendix for Prompts
The biggest difference between RFF and Least-
to-Most is that process of RFF starts from the end The design of prompts is critical to lead the model
state of the problem, RFF doesn’t plan just at the to reason exactly according to the paradigm we
beginning but to continue to generate questions or have planned. We design these prompts deliber-
hints or guidance by backward reasoning to lead ately to ensure the model reason and output accord-
LLM to better reason forward. ing to the format we give it.
Reason
Input Guide
Input
Thought
Reason-2 Reason-n Reason-n+1
(Forward) (Back) (Forward)
……
Reason-n-1 Reason-n+2
Reason-n+1 (Back) (Forward)
(Forward)
…… ……
Reason-n
(Back) Reason-2 Reason-2n-1
(Back) (Forward)
……
Reason-1 Reason-2n
Reason-1
(Back) (Forward)
(Back)
Virtual Virtual
Output Output
Output Output
System: Suppose you are one of the greatest AI scientists, logicians, and mathematicians. Let’s play a game.
The Input is the current state, which contains four or three numbers.
The Target is the state we want to get, which contains one or two numbers.
The Last Step is how to get the Target with Input in the last step.
The Avoid is tested to be a wrong Last step, you need to generate another different step.
What you need to do is to think how to use Input to get Target use some steps, just output the most likely Last
Step you think.
Notice:
1 Now do not calculate the game, you need to rely on your instincts to give the most likely Last Step directly,
and do not output other thinking process.
2 The Last Step should contains two parts: "calculation" and "left".
3 The number used in "calculation" may not appear in Input, and the result of "calculation" must appear in
"left".
4 The numbers in "left" must be the same as Target.
5 You are forbidden to output Steps, you should output Last Step only.
User:
Input: 1 3 6 11
Target: 24
Avoid: 3 x 8 = 24 (left: 24)
Avoid: 4 x 6 = 24 (left: 24)
Assistant:
Last Step: 2 x 12 = 24 (left: 24)
User:
Input: 8 8 10 12
Target: 8 16
Avoid: 8 + 8 = 16
System:
Next Step: 12 - 10 = 2 (left: 2 8 8)
System: You are one of the GREATEST mathematicians, logicians, programmers, and AI scientists. You
are intelligent and rational. You are prudent and cautious. You THINK NATURAL, BROAD AND DEEP.
Let's think step by step.
You will be given a mathematical problem, a question about it and the information we have calculated.
Notice:
1 You are not permitted to solve the question from the beginning.
2 You need to analyse the question and figure out what will be the last step to solve the question.
3 Make sure your analysis are used to calculate the result of the question not the intermediate result.
4 Output the calculation process, and the information we need.
User:
Problem: {problem}
Question: {question}
Information: {information}
Assistant:
To know {question}: Since {information}, we can do: {calculation process}
Need Information: {need information}
User:
Problem: {problem}
Question: {question}
Information: {information}
Assistant:
Next Step: {calculation process}
New Information: {new information}
System: You are one of the GREATEST mathematicians, logicians, programmers, and AI scientists. You
are intelligent and rational. You are prudent and cautious. You THINK NATURAL, BROAD AND DEEP.
Let's think step by step.
You will be given a mathematical problem, a question about it and information we have calculated.
Notice:
1 You are not permitted to solve the question.
2 You need to analyse the question and information, then figure out whether we have already solved the
question.
3 Make sure your analysis do not consist of calculation process.
4 If we have solved the question, you should output [True], else you should output [False].
User:
Problem: {problem}
Question: {question}
Information: {information}
Assistant:
Analyse: {analyse}
Answer: [True, False]