0% found this document useful (0 votes)

4 views18 pages

2024 Arxiv OpenR

OpenR is an open-source framework aimed at enhancing the reasoning capabilities of large language models (LLMs) by integrating data acquisition, reinforcement learning, and non-autoregressive decoding. It builds on the principles of OpenAI's o1 model, demonstrating improved reasoning performance through test-time computation and process supervision, particularly in mathematical tasks. The framework, which includes various tools and datasets, is designed to foster collaboration and accelerate research in LLM reasoning, showing significant performance gains in initial experiments.

Uploaded by

1733782645

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views18 pages

2024 Arxiv OpenR

Uploaded by

1733782645

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

OpenR: An Open Source Framework for Advanced

Reasoning with Large Language Models

Jun Wang1 , Meng Fang2 , Ziyu Wan3 , Muning Wen3 , Jiachen Zhu3 , Anjie Liu4 , Ziqin Gong4 ,
Yan Song1 , Lei Chen4 , Lionel M. Ni4 , Linyi Yang5 , Ying Wen3 , Weinan Zhang3
1
University College London 2 University of Liverpool 3 Shanghai Jiao Tong University
arXiv:2410.09671v1 [cs.AI] 12 Oct 2024

4
The Hong Kong University of Science and Technology (Guangzhou) 5 Westlake University

Abstract

In this technical report, we introduce OpenR, an open-source framework designed

to integrate key components for enhancing the reasoning capabilities of large
language models (LLMs). OpenR unifies data acquisition, reinforcement learning
training (both online and offline), and non-autoregressive decoding into a cohesive
software platform. Our goal is to establish an open-source platform and community
to accelerate the development of LLM reasoning. Inspired by the success of
OpenAI’s o1 model, which demonstrated improved reasoning abilities through step-
by-step reasoning and reinforcement learning, OpenR integrates test-time compute,
reinforcement learning, and process supervision to improve reasoning in LLMs.
Our work is the first to provide an open-source framework that explores the core
techniques of OpenAI’s o1 model with reinforcement learning, achieving advanced
reasoning capabilities beyond traditional autoregressive methods. We demonstrate
the efficacy of OpenR by evaluating it on the MATH dataset, utilising publicly
available data and search methods. Our initial experiments confirm substantial
gains, with relative improvements in reasoning and performance driven by test-
time computation and reinforcement learning through process reward models.
The OpenR framework, including code, models, and datasets, is accessible at
https://openreasoner.github.io.

1 Introduction
OpenAI has recently unveiled o1 [OpenAI, 2024], a groundbreaking large language model (LLM)
that represents a giant leap forward in strong AI. The model is reported to be five times more
proficient in math and coding compared to the previous GPT-4o, specifically displaying exceptional
performance across various domains: it ranks in the 89th percentile for competitive programming,
places among the top 500 students in a prestigious US math olympiad qualifier, and surpasses human
PhD-level accuracy in physics, biology, and chemistry benchmarks. Trained using reinforcement
learning techniques, o1 excels in complex reasoning tasks by explicitly embedding a native “Chain-of-
Thought” (NCoT) process in LLMs, which allows it to “deep think” through step-by-step reasoning
before generating responses. A key innovation of o1 is that it allows spending more time reasoning
during the inference process, marking a shift from fast, direct responses to slow, deliberate, multi-step
inference-time computation, as illustrated in Figure 1.
Interestingly, in human cognition, two correlated yet distinct modes of cognitive processing are
presented to guide human decision-making and behaviours [Kahneman, 2011], each of which has the
partial distinction between brain circuits and neural pathways. System 1 thinking is fast, automatic,
Correspondence to: Jun Wang and Meng Fang.

Technical Report. Work in progress.

(a) (b)
Figure 1: Inference-time computation. (a) An autoregressive LLM directly generates an answer (A)
by conditioning the given question (Q). (b) The concept of chain of thought, or step-by-step thinking,
involves incorporating intermediate reasoning steps (R) before arriving at the final answer (A). These
repeated operations allow for 1) revisiting and revising prior outputs, 2) progressing to subsequent
reasoning stages, and 3) exploring multiple reasoning paths or trajectories.

and intuitive, operating effortlessly and often unconsciously. It relies on neural pathways that enable
rapid processing, especially in situations needing quick reactions or when cognitive resources are
constrained. System 2 thinking is deliberate, effortful, and conscious, involving focused attention and
analytical reasoning. It processes information more slowly and is used for complex problem-solving,
logical reasoning, and decision-making tasks.
o1 model is an exciting development for AI, as LLMs can now not only generate rapid responses using
learned patterns but, more significantly, simulate complex reasoning processes through mechanisms
like chain-of-thought or other forms of search, similar to how humans engage in deeper, step-by-
step thinking. o1’s improved reasoning skills induce implications for multiple fields, including
science, coding, and mathematics. In coding competitions, a specialised version of o1 achieved
impressive results, scoring in the 49th percentile in the 2024 International Olympiad in Informatics
and outperforming 93% of human competitors in simulated Codeforces contests. Beyond its technical
capabilities, o1 also represents progress in AI safety and alignment. The model’s chain of thought
reasoning provides new opportunities for integrating human values and principles, resulting in
improved performance on safety evaluations and jailbreak tests.
The idea of chain-of-thought reasoning [Wei et al., 2022] and step-by-step thinking in large language
models (LLMs) is not new. Previous research has shown that simply adding instructions like “describe
your reasoning in steps” or “explain your answer step by step” to the input questions or providing
a few shot examples can trigger LLMs to generate intermediate reasoning steps (as illustrated in
Figure 1) and subsequently improve problem-solving, especially in tasks like math and coding [Wei
et al., 2022, Nye et al., 2021]. However, these approaches build on existing LLMs without truly
embedding the chain of thought ability within the models themselves. As a result, LLMs cannot
inherently learn this reasoning capability, leading to active research on how to integrate it directly into
model training. Proposed methods range from collecting specialised training data to building reward
models [Ouyang et al., 2022, Li et al., 2022, Luo et al., 2024] and increasing the computational
complexity of decoding [Snell et al., 2024, Wu et al., 2024], but none have yet achieved significant
performance breakthroughs at scale.
It remains unclear whether o1’s innovation is rooted in the model itself, rather than relying on external
prompting systems. If it indeed involves explicitly embedding step-by-step reasoning natively within
the architecture, this would represent a significant breakthrough. Building on substantial performance
gains, o1 has shown that the scaling principles traditionally applied during training [Kaplan et al., 2020,
Snell et al., 2024] are now relevant to the inference phase. We should reallocate our computational
focus, balancing pre-training efforts with efficient use of inference-time computation. Allowing LLMs
to enhance their outputs with increased test-time computing is an essential step towards creating
generally self-improving agents capable of managing open-ended strong reasoning and decision-
making tasks. This direction, which we refer to as LLM-Native Chain-of-Thought (NativeCoT),
should be able to inherently mirror the deliberate, analytical process possessed by human’s System 2
thinking [Kahneman, 2011].
In this report, we present OpenR, an open-source framework built on the principles behind OpenAI’s
o1 model, designed to replicate and extend its reasoning capabilities. Our approach focuses on
improving LLM reasoning by integrating process supervision, reinforcement learning (RL), and

2
inference-time computation strategies such as guided search. OpenR implements key components
such as data augmentation for process supervision, policy learning via RL, and efficient decoding
algorithms. By doing so, it shifts the focus from merely scaling model parameters during pre-training
to leveraging smarter inference strategies at test time. These techniques help the model refine its
reasoning step by step, allowing it to pause, evaluate intermediate reasoning, and select better solution
pathways during test-time computation. Through experiments on publicly available benchmarks,
such as the MATH dataset, we show that the combination of process reward models and guided
search improves test-time reasoning performance by approximately 10%.
In summary, we introduce OpenR, an open-source framework that integrates test-time computation
and process supervision to enhance reasoning in LLMs, providing an open platform with models,
data, and code to foster collaboration and accelerate research in LLM reasoning. To our knowledge,
OpenR is the first open-source framework to explore the core methods of OpenAI’s o1 model with
reinforcement learning techniques. The framework includes reinforcement learning algorithms
designed to optimize decision-making during training, enabling more accurate and deliberate step-
by-step reasoning. Additionally, OpenR provides tools for generating synthetic process reward data,
reducing dependence on costly human annotations and supporting scalable process supervision.
Through experiments, we demonstrate the effectiveness of process reward models and test-time
guided search.

2 Related Work

Key references in the field of improving reasoning capabilities in large language models (LLMs)
highlight several innovative approaches, including inference-time computing, process reward models,
and data acquisition methods.
Inference-time Computing. To discuss the role of inference-time computation in large language
models (LLMs), recent studies have focused on optimizing the efficiency and effectiveness of
reasoning during the inference process rather than merely relying on the scaling law of training-time
computing. A pivotal study, Feng et al. [2024] demonstrate the benefits of using MCTS as a decoding
mechanism, which enhances inference computation by actively planning and selecting higher-quality
responses. This approach aligns with the reasoning-as-planning approach proposed in Hao et al.
[2023], where reasoning is viewed as a process similar to planning in decision-making processes,
further underscoring the centrality of step-wise reasoning at inference time. In recent, the work [Snell
et al., 2024] reinforces that optimizing inference strategies can yield superior performance gains
compared to simply increasing model size, underscoring the critical role of test-time computation.
Finally, this is complemented by the findings of work [Goyal et al., 2023], which introduces an
implicit reasoning model by incorporating pause tokens to encourage deliberate reasoning during
generation. Collectively, these recent advances suggest the growing recognition of inference-time
optimisation – whether through planning-based reasoning models or computational optimisation –
as a critical factor in improving LLM capabilities, advocating for strategies that enhance reasoning,
planning, and compute efficiency beyond mere training-time scaling.
From Outcome Supervision to Process Supervision. The shift from Outcome Supervision to
Process Supervision in language model training has gained prominence in recent research, particularly
with respect to enhancing reasoning capabilities. The foundational work by Cobbe et al. [2021a]
introduces Outcome-supervised Reward Models (ORM) and the widely used math reasoning dataset,
GSM8K, where verifiers are trained to assess the final correctness of generated solutions. While
ORM plays a crucial role in the early stage, it primarily focuses on evaluating the end result rather
than the reasoning steps leading to the final output.
Building on this, the concept of process reward models (PRM) is introduced as a more granular
and transparent approach. With both ORM and PRM, DeepMind proposes the idea of supervising
intermediate reasoning steps alongside the final outcome, allowing for more detailed feedback during
the reasoning process [Uesato et al., 2022]. This research laid the groundwork for subsequent
developments in process-based verification. On the other hand, OpenAI’s work [Lightman et al.,
2023] continues this trend by refining PRM through a follow-up study that emphasizes verifying each
intermediate step in reasoning tasks by providing a high-quality human-labelled process-supervision
dataset, namely PRM800K, which has been enriched in our work.

3
Figure 2: In this MDP formulation, the LLM is tasked with generating reasoning steps and the final
answer to a question in a step-by-step manner. The LLM policy operates by generating tokens, which
form higher-level reasoning constructs. The states represent the sequence of reasoning steps so far,
and actions correspond to the selection of new reasoning steps or the final answer. The LLM policy
governs the choice of actions, and the process reward model (PRM) provides feedback on the quality
of reasoning steps and the final answer. By optimising the policy to maximise the reward, the LLM
can be guided by PRM to generate accurate and meaningful reasoning processes.

Similarly, the integration of verifier models with majority voting schemes, as highlighted in Li
et al. [2022], showcases the practical application of PRM. This method uses a verifier to scrutinize
each reasoning step while incorporating majority voting to increase the reliability of the final result.
Furthermore, Yu et al. [2024] introduce another approach using reinforcement learning to enhance
the planning and reasoning process in LLMs, providing a hybrid of both outcome- and process-
supervised techniques. Besides the traditional scalar-based reward models, the recently proposed
generative reward model (GenRM) [Zhang et al., 2024] attracts significant attention since the verifier
and generator can interact with each other in a more information-dense text-based manner. When the
reward model monitors the policy, it not only outputs the score of each answer but also details the
reason for the error. This trend in recent research reflects a broader move toward more sophisticated
process-supervision methods, which has been fully covered in this project.
Data Acquisition. The problem of Data Acquisition for PRM has evolved significantly, focusing on
automating the extraction of step-by-step reasoning data, which is crucial for training models capable
of complex reasoning tasks. The STaR technique [Zelikman et al., 2022] presents a novel self-taught
reasoning approach where models generate and bootstrap their own reasoning processes for further
training, thus improving reasoning capabilities without extensive labelled datasets. Building upon
the foundation laid by STaR, Zelikman et al. [2024] demonstrate how these techniques could be
generalized beyond specific domains like mathematical problem-solving. By extending the reasoning
process to arbitrary tasks and incorporating the methodology into pre-training, Quiet-STaR highlights
the versatility of automated process supervision across various tasks, marking a significant step
in scaling data acquisition for reasoning tasks. In addition, Luo et al. [2024] represent the latest
advancement in the field, specifically focusing on mathematical reasoning. This work refines the
methods for automated data acquisition, making the process more robust and applicable to increasingly
complex problem-solving scenarios. Moreover, Wang et al. [2024a] take the concept of automatic
process supervision a step further by proposing a practical solution for training models without
relying on human-labelled data. Finally, the empirical results in Wang et al. [2024b] extend these
approaches by testing their applicability on coding tasks, demonstrating that process supervision
can be effectively induced by the model itself. These works underscore the increasing reliance on
automated data acquisition methods, where models are equipped to extract and verify their self-
reasoning processes. To facilitate the research in this direction, we make the generated dataset and
code publicly available.
In summary, advanced reasoning in models such as OpenAI’s o1 relies heavily on careful data
selection, sophisticated PRM training, and enhanced decoding methods. Approaches such as tree-
based search, reinforcement learning, and step-aware verifiers enable these models to tackle more
complex tasks. As research progresses, LLMs are expected to further enhance their autonomous
reasoning, planning, and problem-solving capabilities. Our project aims to serve as a starting point
for transparently investigating and evaluating the potential of inference-time computation.

4
Figure 3: The OpenR framework for LLM reasoning. Combining the value function from the
PRM with the LLM’s policy generation ensures guided and controlled results. During training, the
generation produced by the LLM’s policy and the evaluation provided by the PRM reinforce each
other, leading to continuous self-improvement and refinement of both components.

3 The OpenR LLM Reasoning Framework

To model the process of reasoning in tasks such as question-answering or problem-solving, we

structure the reasoning task using the Q → {R} → A sequence, where:

• Q represents the question or prompt that initiates the reasoning process;

• R represents the sequence of intermediate reasoning steps the model generates to build
toward the solution;

• A represents the final answer or solution produced after the reasoning steps.

This structure allows the LLM to generate a sequence of reasoning steps that logically connect
the question Q to the final answer A. We can define the reasoning process as a Markov Decision
Process (MDP) [Bellman, 1958]. An MDP representation offers a flexible framework for modelling
reasoning. It allows the model to generate sequential reasoning steps toward the final answer step
by step while also enabling a tree structure by sampling multiple paths at each step for alternative
reasoning trajectories. By combining both approaches - sequential and branching reasoning - the
model can explore diverse solutions, creating a versatile and comprehensive reasoning process.
In an MDP (as illustrated in Figure 2), the LLM policy functions by generating tokens that combine
to form higher-level reasoning constructs. States represent the sequence of reasoning steps taken up
to the current point, while actions involve selecting the next reasoning step or the final answer. The
LLM policy generates these action choices, and the process reward model (PRM) [Lightman et al.,
2023, Uesato et al., 2022] offers feedback on the quality of both the reasoning steps and the final
answer. The PRM guides the LLM toward producing accurate and meaningful reasoning processes
by optimising the policy to maximise the reward.

3.1 System Design

The process reward model (PRM) plays a crucial role in enhancing the LLM’s policy in two key ways.
First, during training, the PRM improves the LLM policy through policy optimisation techniques
(Policy Iteration as shown in Figure 3). Second, during the decoding phase, the PRM guides the
LLM’s search process, steering the reasoning toward more effective outcomes (as shown in Figure 3).
As we will show next, the LLM policy also helps identify missing intermediate reasoning steps,
which in return enables further training and refinement of the PRM. As shown in Figure 3, this
iterative interaction allows the LLM and PRM to unlock each other’s potential for improved reasoning
continuously.

5
3.2 Data Augmentation

For a solution or chain-of-thought provided by large language models (LLMs), we use more precise
and fine-grained feedback instead of relying solely on the final answers. We collect data for process
supervision, which provides step-wise feedback for a given solution. Formally, a PRM computes
pt = PRM([q, x1:t−1 ], xt ), where x1:t = [x1 , · · · , xt ] represents the first t steps of the solution.
This method provides more precise and fine-grained feedback compared to outcome reward models
(ORMs), as it identifies the exact location of errors within the problem-solving process [Lightman
et al., 2023].

MATH-APS. We augment the data by automatically generating synthetic samples. In addition to

the PRM800k dataset [Lightman et al., 2023], which relies on costly human annotation and is difficult
to scale, we introduce a new dataset called MATH-APS, based on MATH [Hendrycks et al., 2021],
using automated methods such as OmegaPRM [Luo et al., 2024]. This approach reduces the reliance
on expensive human annotations, enabling more scalable data collection. Automatic methods such as
OmegaPRM, Math-Shepherd [Wang et al., 2024a] and MiPS [Wang et al., 2024b] efficiently collect
high-quality process supervision data. While Math-Shepherd and MiPS provide automatic annotation
for process supervision, they require lots of policy calls, making them computationally expensive.
OmegaPRM improves this process by iteratively dividing the solution, performing rollouts, and
identifying the first incorrect step in a model’s solution.
We follow OmegaPRM [Luo et al., 2024] and collect PRM training examples by constructing a
state-action tree using LLMs. For each question, a tree is built where each node contains the question
q, the solution prefix s, and all previous rollouts {(s, ri )}ki=1 (with ri indicating the i-th rollout).
Each edge represents a single step or a sequence of steps from the node. For each node, we calculate
the Monte Carlo estimation M C(s) and the value function Q(s, r) to guide the selection of rollouts
during tree traversal. The value function is defined as: Q(s, r) = α · 1−M1C(s) · β · len(r) L , where
α, β, and L are constants,
√ and len(r) is the length of the rollout. We also compute the exploration
P
N (si )
term: U (s) = cpuct · 1+N i
(s) , where N (s) is the visit count and cpuct is a constant encouraging
exploration. During the selection phase, a rollout is chosen using a variant of the PUCT algorithm:
(s, r) = arg max(s,r) [Q(s, r) + U (s)]. This heuristic selects the most valuable rollouts. A binary
search is then used to identify the first error in the selected rollouts, and rollouts with 0 < M C(s)
are added to the candidate pool. All positions before the first error become new states for further
exploration.

3.3 Supervised Training for PRMs

In PRMs, the goal is to determine whether the sequence of the solution process is currently on the
right track, so it should output a binary indicator of correctness. Specifically, we assign a score yt
between 0 and 1 given a problem q and a sequence of solution steps x1 → xt . This score represents
the correctness of the current problem-solving process. As a result, the problem is reframed as
yt = PRM(q, x1 , x2 , · · · , xt ), which can be treated as a binary classification task. The PRM is
trained through supervised fine-tuning on a LLM, with the correct/incorrect distinction serving as the
classification label. We then use the LLM to predict the next token of the step token.

Math-psa. The PRM is trained through supervised fine-tuning on an LLM, with the correct/incorrect
distinction serving as the classification label. We train a PRM named Math-psa using datasets such
as PRM800K [Lightman et al., 2023], Math-Shepherd [Wang et al., 2024a], and our MATH-APS
dataset (see Section 3.2). These datasets are structured into three components: question, process,
and label. The input consists of a concatenation of the question and the process. In the process, the
solution is divided into multiple steps, each separated by a special step token (“\n\n\n\n\n”), marking
the end of each step where the PRM can make predictions. The label is a classification of the entire
process, with each step labelled as either ‘+’ or ‘-’ based on the correctness of the solution.
During training, the data is fed to the LLM as a next-token prediction task. The model is trained
to predict a positive or negative token immediately following each step token. As described in the
data section, the input consists of the question concatenated with the process, and the step tokens
separate the steps in the process. The labels are assigned such that at the positions of the step token,
the label is either a positive or negative token, while all other positions are ignored during the loss

6
computation. The attention mask is set to 1 for all tokens except the step token positions, ensuring
that during training, the LLM focuses only on the input sequence (q, x1 → xt ) and does not attend to
the step tokens themselves.

3.4 Policy Learning for LLMs

We transform math problems into a language-augmented Markov Decision Process (MDP) M =

(V, S, A, T , R, γ) [Van Otterlo and Wiering, 2012, Carta et al., 2023]. Given V the vocabulary and
w ∈ V the tokens, A ⊂ V N , S ⊂ V N are action and state space, respectively, i.e., actions and
states are sequences of tokens. T : S × A 7→ S is the state transition function. R : S × A 7→ R is
the reward function that responds to each action, and γ is the discounted factor that typically less
than 1. An initial state s0 represents a given problem in mathematical problem-solving scenarios.
A language model receives this input and generates an intermediate reasoning step, denoted as an
action a0 . This action a0 is then concatenated with the initial problem s0 to form the subsequent state
s1 , which is used to infer the next action a1 , This iterative process continues, with each state-action
pair successively informing the following state, i.e., T : st+1 = {st , at } at time step t, until the
model arrives at the final answer. After inferring each action at , the model receives a reward signal
rtP RM = R(st , at ) from a well-trained PRM. Following this process with trajectories of a maximum
PT
timestep T , the agents earn a discounted cumulative return of Rγ = t=0 γ t rtP RM , which is aimed
to be maximised by RL algorithms. We correspondingly implement this MDP as a reinforcement
learning environment like OpenAI’s Gym. In these environments, math problems are presented as
tasks where the model takes sequential actions to solve the problem, receiving rewards for correct
actions and penalties for incorrect ones, which enables the model to iteratively learn and refine its
problem-solving strategies through trial and error, ultimately enhancing its mathematical reasoning
skills.

RL Training. Training LLMs with reinforcement learning often involves Proximal Policy Opti-
misation (PPO) [Schulman et al., 2017] to align generated language outputs with desired actions.
This approach bridges the gap between language understanding and actionable outputs by reinforcing
the generation of responses that are both contextually accurate and aligned with predefined goals,
effectively linking language comprehension with strategic planning. We provide both traditional PPO
and an efficient variant of PPO, i.e., Group Relative Policy Optimisation (GRPO) [Shao et al., 2024].
The primary distinction between these two lies in their approaches to advantage value estimation.
Specifically, PPO utilises a network to approximate the state value function, leveraging the Gener-
alized Advantage Estimation (GAE) technique [Schulman et al., 2015] to derive the advantage. In
contrast, GRPO simplifies this process by directly employing a normalized reward signal to estimate
r P RM −mean(r P RM )
an action’s advantage, i.e., A(st , at ) = t std(rP RM ) . Compared with PPO, GRPO bypasses
the need for an extra critic network and reduces the resources consumed during training, however, it
emphasizes the stability of PRMs more.

3.5 Decoding: Inference-Time Guided Search and Planning

Following Snell et al. [2024], we use PRMs to assess the accuracy of each solution step. Once a
high-quality process reward model is trained, we integrate it into the decoding process alongside the
language model, enabling guided search and scoring or voting across multiple generations.
To use PRMs as verifiers, we define a method for evaluating the correctness of LLM-generated
solutions. Specifically, we map the scores of individual steps {rtP RM }Tt=0 to a final score. Following
the strategies outlined by Lightman et al. [2023] and Snell et al. [2024], we employ two approaches:
• PRM-Min: choose the minimum value among all scores, i.e., v = min{rtP RM }Tt=0 .
• PRM-Last: choose the last step’s score as the final score, i.e., v = rTP RM . This strategy has
been shown to be as good as PRM-Min in Snell et al. [2024].
Once multiple answers are generated by scaling test-time computations, we need strategies to select
the best answer based on their scores. We adopt three strategies from Feng et al. [2024]:
• Majority-Vote: Aggregate answers using majority vote: f ∗ = arg maxf yj 1final_ans(yj )=f ,
P
where 1 is the indicator function.

7
• RM-Max: Given an outcome reward model, the aggregation can choose the answer f with
maximum final reward, f ∗ = final_ans(arg maxyj v(yj |x)).

P aggregation can choose the answer f with

• RM-Vote: Given an outcome reward model, the
the sum of rewards, namely f ∗ = arg maxf yj ;final_ans(yj )=f v(yj |x).

Combining these strategies, we can define multi-answer weighting methods, such as PRM-Last-Max,
which refers to using PRM-Last with RM-Max.
Our framework allows us to select among various search algorithms — such as beam search, best-of-N
selection, and others — each with unique advantages depending on the quality of PRMs. Complex
search algorithms may yield better performance on more difficult tasks, while simpler methods, such
as best-of-N, often perform adequately for less challenging cases [Snell et al., 2024].
We mainly employ two strategies:

• Best-of-N: Given a base model, the best-of-N sampling approach generates N outputs in
parallel and selects the answer with the highest score according to a learned process using
PRMs. This method is similar to previous work that leverages verifiers or reward models
[Cobbe et al., 2021b, Lightman et al., 2023]. While simple, it is an effective baseline that
leverages test-time computation to improve the performance of LLMs. PRMs can act as
dense verifiers [Lightman et al., 2023, Wang et al., 2024a], and it is intuitive that providing
a strong signal can lead to improved outcomes. Since dense rewards can be obtained for
a base model’s solution, we need to consider how best to use this feedback to optimise
test-time computation.
• Beam Search: The LLM generates N different outputs for the first step, which are then
N
scored using PRMs. These N outputs are scored using PRMs, and the top N/m ( m ∈ Z)
highest-scoring outputs are retained. We then keep only these N/m outputs for the current
step. For each of these outputs, we sample M potential next steps via the base model,
returning to N total outputs. The process is repeated: new candidates are scored, filtered,
and sampled for subsequent steps. The scores from the PRMs are central to guiding this
search. As with the best-of-N approach, we use both last vote and majority vote strategies
to aggregate the scores, with the latter relying on the sum of scores across the PRMs as in
[Wang et al., 2022].

We are going to continuously work on developing more complicated inference-time guided search
decoding methods such as Monte Carlo Tree Search (MCTS), which has been already covered in the
codebase of OpenR and other methods like sequential revision [Snell et al., 2024].

4 Experiments

To demonstrate the capabilities of our OpenR framework, we present quantitative results on large
language model inference and training. We evaluate our open framework using the MATH dataset
[Hendrycks et al., 2021], which includes a wide range of high-school competition-level math problems.
This makes it an ideal benchmark for testing reasoning skills. To ensure fair comparisons with previous
work and reduce overfitting, we follow Lightman et al. [2023] and use a subset of 500 problems for
evaluation, known as MATH500, in which the problems are sampled randomly.

4.1 Scaling LLM Test-Time Compute

Setting. Our PRM model, Math-psa, is fine-tuned from the Qwen2.5-Math-7B-Instruct [Yang et al.,
2024] model using multiple datasets, including PRM500K [Lightman et al., 2023], Math-Shepherd
[Wang et al., 2024a], and our MATH-APS dataset (originally over 500k state-value pairs collected
from Qwen2.5 Math models, reduced to approximately 150k pairs after cleaning and preprocessing).
In the meantime, we also experiment with Math-Shepherd PRM for comparison purposes. Following
Snell et al. [2024], we employ best-of-N and beam search algorithms for test-time computation. We
compare multiple test-time computation schemes across pre-defined budgets of token generation.
Among different aggregation strategies, we select PRM-Last as a representative. The LLM inference
server is implemented using FastChat [Zheng et al., 2023].

8
Comparing PRM Search and Voting Methods Comparing PRM weighted Best-of-N
0.82
0.81
0.81
0.80
MATH500 Test Accuracy

MATH500 Test Accuracy

0.80
0.79
0.79
0.78
0.78
0.77
0.77 Majority Vote
0.76
Best-of-N prm-last-vote 0.76
Best-of-N prm-last-max Best-of-N Math-psa (Ours)
0.75 Beam-Search prm-last-vote 0.75 Best-of-N Math-Shepherd
Beam-Search prm-last-max Majority Vote
20 21 22 23 24 25 26 27 28 20 21 22 23 24 25 26 27 28
Generation Budget (average per question) Generation Budget (average per question)

(a) Comparision of different PRM guided search and (b) Comparison of different reward models in best-of-
voting methods. N test time computation.
Figure 4: Test-time computation experiments on search methods (a) and PRMs (b).

Single Problem MATH500

0.80
0.9
0.75
Rewards from PRM

Rewards from PRM

0.8 0.70

0.65
0.7
0.60

0.6 0.55

0.50
0.5
PPO with our PRM 0.45
PPO with our PRM
0 2 4 6 8 10 0 2 4 6 8 10 12
Wall Clock (hours) Wall Clock (hours)
Figure 5: Performance on a single math problem and the MATH500 dataset over training time.

Results. Figure 4a compares the performance of these search and voting methods during inference.
The y-axis represents testing accuracy on the MATH500 dataset, while the x-axis shows the generation
budget (average tokens per question), reflecting the computational effort or token usage per question.
The figure indicates that both Best-of-N and Beam-Search methods significantly outperform Majority
Vote, especially as the generation budget increases, showing a similar pattern with previous findings
Snell et al. [2024]. Under low test-time computation budgets (< 24 ), Best-of-N methods demonstrate
better performance compared to Beam Search, whereas Beam Search can reach matching performance
given higher budgets, or even surpass Best-of-N with PRM-Last strategy used at budgets larger than
25 .
On the other hand, Figure 4b investigates how different PRMs affect test-time computations. We
compare the performance of best-of-N methods with different PRM guidance. The figure shows that
our PRM (Math-aps) can achieve the highest testing accuracy across all tested computation budgets.
This indeed has verified that our PRM training pipeline can provide effective learning of process
supervision.

4.2 Online Policy Learning for LLM

Setting. In the policy learning experiment, we use the Qwen2.5-1.5B-Math-Instruct model as the
policy model for training, with the Math-Shepherd model [Wang et al., 2024a] serving as the PRM
to provide feedback during RL. In addition to the MATH500 dataset, we test the performance of
the model on a specific math problem: “How many positive whole-number divisors does 196 have?”
with the final answer being “9.”

9
Q&A Case 1

Question:

What is the remainder of 2004 divided by 12?

Reason:

Step 1: To find the remainder of 2004 divided by 12, we can use the division algorithm, which
states that for any integers a and b (with b > 0), there exist unique integers q (the quotient) and
r (the remainder) such that a = bq + r and 0 ≤ r < b. [0.958, 0.988]

Step 2: Here, a = 2004 and b = 12. We need to find q and r such that:
2004 = 12q + r
where 0 ≤ r < 12. [0.924, 0.987]

Step 3: First, we perform the division of 2004 by 12:

2004 ÷ 12 = 167
This means the quotient q is 167. Now, we can find the remainder r by multiplying the quotient
by the divisor and subtracting the result from the original number:
2004 − 12 × 167 = 2004 − 2004 = 0
[0.777, 0.719]

Step 4: Thus, the remainder r is:

r=0
[0.777, 0.378]

Step 5: Therefore, the remainder when 2004 is divided by 12 is: 0 . [0.622, 0.349]

Answer:

An integer is divisible by 12 if and only if it is divisible by both 3 and 4.

Because 2 + 0 + 0 + 4 = 6 is divisible by 3, 2004 is divisible by 3.
Also, the last two digits of 2004 form a multiple of 4, so 2004 is divisible by 4 as well.
Therefore, 2004 is divisible by 12 and hence leaves a remainder of 0 when divided by 12.

Figure 6: PRM score comparison for a case with a correct answer.

Results. Figure 5 illustrates the reward obtained by a reinforcement learning (RL) algorithm with
process reward models (PRM) on a single math problem. The rewards steadily increase over time,
showing consistent improvement, with performance stabilizing after around 6 hours of training. This
indicates that the model becomes more accurate in solving the specific problem as training progresses.
On the MATH500 dataset, results exhibit more fluctuations in rewards. This suggests that the PPO
algorithm with PRM faces a more complex challenge due to the diversity of problems in the dataset.
Although the rewards increase over time, their variability highlights the difficulty of generalizing
across broader problems. This indicates the need for further improvements on the algorithm to
enhance its adaptability for diverse problem sets in future work.

4.3 Case Study

Comparison of PRMs. Figures 6 and 7 compare the responses of the Math-psa PRM and the
Math-Shepherd PRM to a given reasoning step. The PRM scores for each step are represented as
“[Math-psa (ours) Score, Math-Shepherd PRM Score]”. Figure 6 shows Math-psa (ours) is able to
assign more reasonable scores to reasoning steps. Particularly at Step 4 and Step 5, Math-psa shows

10
Q&A Case 2

Question:

Let f be the function defined by f (x) = x3 − 49x2 + 623x − 2015, and let g(x) = f (x + 5).
Compute the sum of the roots of g.

Reason:

Step 1: To find the sum of the roots of the function g(x) = f (x + 5), we first need to
understand the relationship between the roots of g(x) and the roots of f (x). [0.905, 0.810]

Step 2: Given the function f (x) = x3 − 49x2 + 623x − 2015, we know that the sum of the
roots of f (x) can be found using Vieta’s formulas. For a cubic polynomial ax3 + bx2 + cx + d,
the sum of the roots (denoted as α + β + γ) is given by:
b
α+β+γ =−
a
In our case, a = 1 and b = −49. Therefore, the sum of the roots of f (x) is:
−49
α+β+γ =− = 49
1
[0.706, 0.715]
Step 3: Next, we consider the function g(x) = f (x + 5). This represents a horizontal shift
of the function f (x) by 5 units to the left horizontal shifts do not change the sum of the
roots of the polynomial; they only shift the positions of the roots along the x-axis. [0.593, 0.788]

Step 4: Thus, the sum of the roots of g(x) is the same as the sum of the roots of f (x). Therefore,
the sum of the roots of g(x) is:
49
[0.182, 0.665]

Answer:

Let a, b, c be the roots of x3 − 49x2 + 623x − 2015. Then by Vieta’s formulas, a + b + c = 49.
The roots of g(x) = f (x + 5) are a − 5, b − 5, and c − 5, and their sum is a + b + c − 15 =
49 − 15 = 34 .

Figure 7: PRM score comparison for a case with an incorrect answer.

higher confidence in the output whereas the Math-Shepherd PRM appears to be less certain about the
given correct response. Figure 7 shows the reasoning process leads to an incorrect answer. It shows
that Math-psa PRM assigns a lower score to Step 3 since horizontal shifts do affect the sum of the
roots of a polynomial. Math-psa PRM further discourages Step 4 whereas Math-Shepherd PRM still
holds high confidence.

11
Q&A Case 3

Question:

A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the
triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?

Reason:

Step 1: A regular hexagon can be divided into six equilateral triangles.

Step 2: The perimeter of one of the triangles is 21 inches.
Step 3: Since the hexagon is regular, all sides are equal.
Step 4: The perimeter of the hexagon is the sum of all six sides.
Step 5: The perimeter of the hexagon is 21 * 6 = 126 inches. [Wrong Step]

Answer:

The answer is 126.

Figure 8: A case study from models before policy learning.

Q&A Case 4

Question:

A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the
triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?

Reason:

Step 1: A regular hexagon can be divided into six equilateral triangles.

Step 2: The perimeter of one of the triangles is 21 inches.
Step 3: Since the hexagon is regular, all sides of the hexagon are equal.
Step 4: The perimeter of the hexagon is the sum of the lengths of all its sides.
Step 5: Since the hexagon is divided into six equilateral triangles, each side of the hexagon is
equal to the side of one of the triangles.
Step 6: The perimeter of the hexagon is 6 times the length of one side of the triangle.
Step 7: The length of one side of the triangle is 21 / 3 = 7 inches.
Step 8: The perimeter of the hexagon is 6 * 7 = 42 inches.

Answer:

The answer is 42.

Figure 9: A cases study from models after policy learning.

Comparison between LLMs Before and After Policy Learning. Figures 8 and 9 present cases
from models before and after policy learning, respectively. Figure 8 shows the results prior to policy
learning, collected from the vanilla policy of Qwen2.5-Math-1.5B-Instruct. In this instance, the
model makes an error by incorrectly using the perimeter of a triangle as the side length, resulting in
an incorrect answer. Figure 9 uses the same question as Figure 8 and demonstrates the improvements
after applying the Math-psa PRM. The Math-psa PRM is able to correct the error, showing that the
RL training process with a well-designed PRM can improve reasoning accuracy.

12
Q&A Case 5

Question:

Kathy needs money for her trip to Europe. If she has 300 US dollars in the bank but wants to
withdraw half of it in British pounds and half of it in euros, how many more euros than pounds
will she have? Assume 1 pound = 1.64 USD and 1 euro = 1.32 USD, and round to the nearest
whole number.

Reason:

CoT Reasoning
Step 1: First, calculate the amount of money Cathy will withdraw in pounds and euros:
300
Amount withdrawn in pounds = = 150 USD
2
300
Amount withdrawn in euros = = 150 USD
2
Step 2: Next, convert the amounts from pounds and euros back to US dollars using the given
exchange rates:
Value of 150 pounds in USD = 150 × 1.64 = 246 USD
Value of 150 euros in USD = 150 × 1.32 = 198 USD
Step 3: Now, find the difference between the amounts in euros and pounds:
Difference = 198 − 246 = −48 USD
So the answer is:
48

Answer:

After Kathy has exchanged half of her money ($300 USD÷2 = 150 USD) into pounds, she
will have:
1 pound
150 USD × ≈ 91.46 pounds
1.64 USD
After exchanging the other half of her money into euros, she will have:
1 euro
150 USD × ≈ 113.64 euros
1.32 USD
Subtracting these two values, we have:
113.64 − 91.46 = 22.18
Since the question calls for the nearest whole number, we round 22.18 to the final answer of:

Figure 10: A case study of the CoT method with an incorrect answer.

Comparison of Different Search Methods. Figures 10, 11, and 12 present examples that demon-
strate how different test-time compute methods impact the reasoning outputs. The three question-
and-answer sessions show that both Best-of-N and Beam Search reasoning accurately addressed
the questions, thoroughly understanding them and providing correct calculations. In contrast, CoT
reasoning misinterprets the question, leading to calculation errors starting from Step 2. This suggests
that Best-of-N and Beam Search reasoning benefit from a larger search space, enabling them to
explore more reasoning paths and converge on the correct solution.

13
Q&A Case 6

Question:

Reason:

Best-of-N Reasoning
Step 1: Kathy has 300 US dollars in the bank. She wants to withdraw half of it in British
pounds and the other half in euros. First, we calculate the amount she will withdraw in each
currency:
300
Amount withdrawn in pounds = = 150 USD
2
300
Amount withdrawn in euros = = 150 USD
2
Step 2: Next, we convert these amounts into pounds and euros using the given exchange rates.
The exchange rate for pounds is 1 pound = 1.64 USD, so the amount in pounds is:
150
Amount in pounds = ≈ 91.50 pounds
1.64
Step 3: The exchange rate for euros is 1 euro = 1.32 USD, so the amount in euros is:
150
Amount in euros = ≈ 113.64 euros
1.32
Step 4: Now, we find the difference between the amounts in euros and pounds:
Difference = 113.64 − 91.50 = 22.14
Step 5: Rounding 22.14 to the nearest whole number, we get:

Answer:

Figure 11: A case study of the Best-of-N method with a correct answer.

14
Q&A Case 7

Question:

Reason:

Beam Search Reasoning

Step 1: Kathy has 300 US dollars (USD) in the bank. She wants to withdraw half of it in British
pounds (GBP) and half of it in euros (EUR). Let’s calculate the amount of money she will
withdraw in each currency. First, we calculate the amount to be withdrawn in pounds:
300
Amount in GBP = = 150 USD
2
Step 2: Next, we convert this amount into pounds using the exchange rate (1 GBP = 1.64 USD):
150
Amount in GBP in pounds = ≈ 91.50 GBP
1.64
Rounding to the nearest whole number, she will withdraw 92 pounds.
Step 3: Now, we calculate the amount to be withdrawn in euros:
300
Amount in EUR = = 150 USD
2
Step 4: Next, we convert this amount into euros using the exchange rate (1 EUR = 1.32 USD):
150
Amount in EUR in euros = ≈ 113.64 EUR
1.32
Rounding to the nearest whole number, she will withdraw 114 euros.
Step 5: Finally, we find the difference between the amount of euros and pounds she will have:
Difference = 114 − 92 = 22
So, she will have 22 more euros than pounds.

Answer:

Figure 12: A case study of the Beam Search method with a correct answer.

15
5 Conclusion

In this work, we have introduced OpenR, an open-source framework designed to advance reasoning
capabilities in large language models (LLMs) through the integration of test-time computation,
reinforcement learning, and process supervision. Our framework provides an open and accessible
platform for experimenting with reasoning tasks in LLMs, showcasing how test-time compute, search
algorithms, and process reward models (PRMs) can be combined to improve reasoning performance.
We believe OpenR will serve as a valuable resource for the research community, offering a compre-
hensive platform for further exploration of reasoning in LLMs. By making our models, data, and
code publicly available, we aim to accelerate advancements in AI reasoning, fostering collaboration
and innovation in the field. In future work, we plan to extend the framework to support a wider range
of reasoning tasks and optimise inference-time computation for even more efficient and scalable
reasoning models.

6 Limitations

Limited Scale of Experiments: Due to restricted access to large-scale computing infrastructure, our
evaluations were conducted on relatively smaller models and datasets.
Model Size: We primarily utilised mid-sized models as our base LLMs. While these models
demonstrated substantial gains, larger models could further enhance reasoning capabilities.
Limited Process Supervision Data: While we leveraged datasets like PRM800K, Math-Shepherd, and
our generated MATH-APS dataset for training, the scale and diversity of process supervision data
remain limited.
Future work could focus on scaling up experiments, expanding the training datasets, and testing across
a broader range of models and domains to unlock further improvements in reasoning performance.

References
OpenAI. Learning to reason with llms. https://openai.com/index/
learning-to-reason-with-llms/, 2024.

Daniel Kahneman. Thinking, Fast and Slow. Farrar, Straus and Giroux, New York, 2011.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le,
and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv
preprint arXiv:2201.11903, 2022.

Maxwell Nye, Anders J Andreassen, Guy Gur-Ari, Henryk Michalewski, David Dohan, Jackie Jiang,
John Schulman, William Fedus, and Charles Sutton. Show your work: Scratchpads for intermediate
computation with language models. arXiv preprint arXiv:2112.00114, 2021.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback. Advances in Neural Information Processing Systems, 35:
27730–27744, 2022.

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making
large language models better reasoners with step-aware verifier. arXiv preprint arXiv:2206.02336,
2022.

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun
Zhu, Lei Meng, Jiao Sun, et al. Improve mathematical reasoning in language models by automated
process supervision. arXiv preprint arXiv:2406.06592, 2024.

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally
can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.

16
Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. An empirical analy-
sis of compute-optimal inference for problem-solving with language models. arXiv preprint
arXiv:2408.00724, 2024.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361, 2020.
Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like
tree-search can guide large language model decoding and training. In ICML 2024, 2024.
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu.
Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992,
2023.
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh
Nagarajan. Think before you speak: Training language models with pause tokens. arXiv preprint
arXiv:2310.02226, 2023.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve
math word problems. arXiv preprint arXiv:2110.14168, 2021a.
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia
Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and
outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan
Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint
arXiv:2305.20050, 2023.
Fei Yu, Anningzhe Gao, and Benyou Wang. Ovm, outcome-supervised value models for planning in
mathematical reasoning. In Findings of the Association for Computational Linguistics: NAACL
2024, pages 858–875, 2024.
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal.
Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240,
2024.
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with
reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman.
Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint
arXiv:2403.09629, 2024.
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang
Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 9426–9439, 2024a.
Zihan Wang, Yunxuan Li, Yuexin Wu, Liangchen Luo, Le Hou, Hongkun Yu, and Jingbo Shang.
Multi-step problem solving through a verifier: An empirical analysis on model-induced process
supervision. arXiv preprint arXiv:2402.02658, 2024b.
Richard Bellman. Dynamic programming and stochastic control processes. Information and control,
1(3):228–239, 1958.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song,
and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS,
2021.
Martijn Van Otterlo and Marco Wiering. Reinforcement learning and markov decision processes. In
Reinforcement learning: State-of-the-art, pages 3–42. Springer, 2012.

17
Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves
Oudeyer. Grounding large language models in interactive environments with online reinforcement
learning. In International Conference on Machine Learning, pages 3676–3713. PMLR, 2023.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu,
and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language
models. arXiv preprint arXiv:2402.03300, 2024.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional
continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438,
2015.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve
math word problems. arXiv preprint arXiv:2110.14168, 2021b.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency
improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jian-
hong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical
expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica.
Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.

Goal 14-Eld PD Reflections and Assessment Results
No ratings yet
Goal 14-Eld PD Reflections and Assessment Results
9 pages
New Form 137 A
No ratings yet
New Form 137 A
12 pages
HIS 318 Week 1 HW
No ratings yet
HIS 318 Week 1 HW
3 pages
Barbara Case: Documents: Barbara Norris Leading Change in The General Surgery Unit Case Solution & Answer
No ratings yet
Barbara Case: Documents: Barbara Norris Leading Change in The General Surgery Unit Case Solution & Answer
19 pages
Quiz 1 Organization Behavior
No ratings yet
Quiz 1 Organization Behavior
3 pages
A Tutorial On LLM Reasoning
No ratings yet
A Tutorial On LLM Reasoning
15 pages
Towards Large Reasoning Models
No ratings yet
Towards Large Reasoning Models
36 pages
Imitate, Explore, and Self-Improve: A Reproduction Report On Slow-Thinking Reasoning Systems
No ratings yet
Imitate, Explore, and Self-Improve: A Reproduction Report On Slow-Thinking Reasoning Systems
16 pages
Competitive Programming With Large Reasoning Models
No ratings yet
Competitive Programming With Large Reasoning Models
48 pages
Chain of Draft: Thinking Faster by Writing Less: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
No ratings yet
Chain of Draft: Thinking Faster by Writing Less: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
7 pages
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
No ratings yet
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
25 pages
Chain of Draft: Thinking Faster by Writing Less: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
No ratings yet
Chain of Draft: Thinking Faster by Writing Less: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
6 pages
Thought-Like-Pro - Enhancing Reasoning of Large Language Models Through Self-Driven Prolog-Based Chain-of-Thought
No ratings yet
Thought-Like-Pro - Enhancing Reasoning of Large Language Models Through Self-Driven Prolog-Based Chain-of-Thought
15 pages
Thinking Machines: A Survey of LLM Based Reasoning Strategies
No ratings yet
Thinking Machines: A Survey of LLM Based Reasoning Strategies
15 pages
Meta-Reasoner: Dynamic Guidance For Optimized Inference-Time Reasoning in Large Language Models
No ratings yet
Meta-Reasoner: Dynamic Guidance For Optimized Inference-Time Reasoning in Large Language Models
17 pages
Llms As Method Actors: A Model For Prompt Engineering and Architecture
No ratings yet
Llms As Method Actors: A Model For Prompt Engineering and Architecture
41 pages
LLMs As Method Actors
No ratings yet
LLMs As Method Actors
41 pages
Part 2
No ratings yet
Part 2
3 pages
思维算法
No ratings yet
思维算法
46 pages
T LLM: G I F T G: Hinking S Eneral Nstruction Ollowing With Hought Eneration
No ratings yet
T LLM: G I F T G: Hinking S Eneral Nstruction Ollowing With Hought Eneration
28 pages
Solving Elaborate Problems With Large Language Models.
No ratings yet
Solving Elaborate Problems With Large Language Models.
61 pages
Graph of Thoughts: Solving Elaborate Problems With Large Language Models
No ratings yet
Graph of Thoughts: Solving Elaborate Problems With Large Language Models
13 pages
Jason Weston Reasoning Alignment Berkeley Talk
No ratings yet
Jason Weston Reasoning Alignment Berkeley Talk
106 pages
Tree of Thoughts: Deliberate Problem Solving With Large Language Models
No ratings yet
Tree of Thoughts: Deliberate Problem Solving With Large Language Models
14 pages
THOUGHTSCULPT - Reasoning With Intermediate Revision and Search
No ratings yet
THOUGHTSCULPT - Reasoning With Intermediate Revision and Search
24 pages
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
No ratings yet
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
27 pages
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
No ratings yet
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
9 pages
Tree of Thoughts Deliberate Problem Solving
No ratings yet
Tree of Thoughts Deliberate Problem Solving
11 pages
Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models Through Logic
No ratings yet
Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models Through Logic
21 pages
The Illusion of Thinking
No ratings yet
The Illusion of Thinking
30 pages
Training Large Language Models To Reason in A Continuous Latent Space
No ratings yet
Training Large Language Models To Reason in A Continuous Latent Space
11 pages
Reasoning in LLMs Training
No ratings yet
Reasoning in LLMs Training
17 pages
Training Large Language Models To Reason in A Continuous Latent Space
No ratings yet
Training Large Language Models To Reason in A Continuous Latent Space
17 pages
LLMand Logicor Mimick
No ratings yet
LLMand Logicor Mimick
11 pages
LLM S C ' P C LRM ? AP E O AI' 1 P B: S Till AN T LAN AN S Reliminary Valuation of PEN SO ON LAN Ench
No ratings yet
LLM S C ' P C LRM ? AP E O AI' 1 P B: S Till AN T LAN AN S Reliminary Valuation of PEN SO ON LAN Ench
17 pages
L L M C S - I: Arge Anguage Odels AN ELF Mprove
No ratings yet
L L M C S - I: Arge Anguage Odels AN ELF Mprove
19 pages
Think How To Think: Mitigating Overthinking With Autonomous Difficulty Cognition in Large Reasoning Models
No ratings yet
Think How To Think: Mitigating Overthinking With Autonomous Difficulty Cognition in Large Reasoning Models
21 pages
Reasoning Language Models - A Blueprint
No ratings yet
Reasoning Language Models - A Blueprint
44 pages
LLM Powered Autonomous Agents - Lil'Log
No ratings yet
LLM Powered Autonomous Agents - Lil'Log
24 pages
Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback
No ratings yet
Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback
8 pages
Recent Advances in Language Modeling (2022-2025)
No ratings yet
Recent Advances in Language Modeling (2022-2025)
5 pages
Autonomous Prompt Engineering in Large Language Models
No ratings yet
Autonomous Prompt Engineering in Large Language Models
38 pages
PROMPT SPACE New Prompt Engineering Technique Use Feb 2025 BEST 2024.findings-Naacl.119
No ratings yet
PROMPT SPACE New Prompt Engineering Technique Use Feb 2025 BEST 2024.findings-Naacl.119
27 pages
L4 Auto Got
No ratings yet
L4 Auto Got
6 pages
04-Tree of Thoughts White Papers
No ratings yet
04-Tree of Thoughts White Papers
11 pages
Reasoning On A Budget: A Survey of Adaptive and Controllable Test-Time Compute in Llms
No ratings yet
Reasoning On A Budget: A Survey of Adaptive and Controllable Test-Time Compute in Llms
28 pages
Motif: Modular Thinking Via Reinforcement Fine-Tuning in Llms
No ratings yet
Motif: Modular Thinking Via Reinforcement Fine-Tuning in Llms
9 pages
Demystifying Long Chain-of-Thought Reasoning in LLMs
No ratings yet
Demystifying Long Chain-of-Thought Reasoning in LLMs
40 pages
Reason From Future: Reverse Thought Chain Enhances LLM Reasoning
No ratings yet
Reason From Future: Reverse Thought Chain Enhances LLM Reasoning
14 pages
L L M A R - : Arge Anguage Odels As Nalogical Eason ERS
No ratings yet
L L M A R - : Arge Anguage Odels As Nalogical Eason ERS
24 pages
Training Language Models To Reason Efficiently: Daman Arora Andrea Zanette
No ratings yet
Training Language Models To Reason Efficiently: Daman Arora Andrea Zanette
16 pages
Part 5
No ratings yet
Part 5
3 pages
Roisinluo Reasoning in LLMs
No ratings yet
Roisinluo Reasoning in LLMs
72 pages
The Illusion of Thinking
No ratings yet
The Illusion of Thinking
30 pages
The Illusion of Thinking
No ratings yet
The Illusion of Thinking
3 pages
Advanced Prompt Engineering Techniques
No ratings yet
Advanced Prompt Engineering Techniques
2 pages
Buffthought Reasoning
No ratings yet
Buffthought Reasoning
21 pages
Prompt Engineering 201 Advanced Methods and Toolkits - AI, Software, Tech, and People. Not in That Order. by X
No ratings yet
Prompt Engineering 201 Advanced Methods and Toolkits - AI, Software, Tech, and People. Not in That Order. by X
2 pages
Deng Et Al (2025) CogniDual Framework - Self-Training LLMs Within A Dual-System Theoretical Framework For Improving Cognitive Task
No ratings yet
Deng Et Al (2025) CogniDual Framework - Self-Training LLMs Within A Dual-System Theoretical Framework For Improving Cognitive Task
5 pages
LPML - Llm-Prompting Markup Language For Mathematical Reasoning
No ratings yet
LPML - Llm-Prompting Markup Language For Mathematical Reasoning
10 pages
Forest-of-Thought: Scaling Test-Time Compute For Enhancing LLM Reasoning
No ratings yet
Forest-of-Thought: Scaling Test-Time Compute For Enhancing LLM Reasoning
13 pages
Group 4 Silabus
No ratings yet
Group 4 Silabus
5 pages
AI Revision Notes
No ratings yet
AI Revision Notes
108 pages
Final Opcrf Hs
No ratings yet
Final Opcrf Hs
32 pages
School Form 10 ES Learners Permanent Record
100% (1)
School Form 10 ES Learners Permanent Record
8 pages
MGT 301 2nd Assignment (Update)
100% (1)
MGT 301 2nd Assignment (Update)
9 pages
Writing A Letter For Ecce
No ratings yet
Writing A Letter For Ecce
11 pages
Antecedents of Sustainability-Oriented Entrepreneurial Intentions: A Comprehensive Model and Empirical Evidence
No ratings yet
Antecedents of Sustainability-Oriented Entrepreneurial Intentions: A Comprehensive Model and Empirical Evidence
18 pages
Final Examination Building and Enhancing New Literacies Across The Curriculum
100% (1)
Final Examination Building and Enhancing New Literacies Across The Curriculum
3 pages
DLL Grade 9-AFA
No ratings yet
DLL Grade 9-AFA
15 pages
Rohan Resume
No ratings yet
Rohan Resume
2 pages
DARPA 2009 Budget Includes 'Silent Talk' Mind Reading Project, Remote EEG (Electroencephalography) / MEG (Magnetoencephalography)
0% (1)
DARPA 2009 Budget Includes 'Silent Talk' Mind Reading Project, Remote EEG (Electroencephalography) / MEG (Magnetoencephalography)
471 pages
Direct Instruction Lesson Plan Template: English
No ratings yet
Direct Instruction Lesson Plan Template: English
3 pages
English-VI Week 24
No ratings yet
English-VI Week 24
4 pages
Guidelines COT Teachers
No ratings yet
Guidelines COT Teachers
10 pages
Eportfolio Template For Facilitate
No ratings yet
Eportfolio Template For Facilitate
14 pages
Basic: IELTS Speaking Trial Lesson (ET) Mood and Feelings
No ratings yet
Basic: IELTS Speaking Trial Lesson (ET) Mood and Feelings
6 pages
Listening Comprehension Test
No ratings yet
Listening Comprehension Test
7 pages
Cross Validation Thesis
100% (4)
Cross Validation Thesis
5 pages
Cycle 5 Week 2 Session 2 Jane Eyre
No ratings yet
Cycle 5 Week 2 Session 2 Jane Eyre
2 pages
Blended Learning Essentials
No ratings yet
Blended Learning Essentials
21 pages
Group 3: Attitudinal Barrier in Communication
No ratings yet
Group 3: Attitudinal Barrier in Communication
4 pages
Living Graph
No ratings yet
Living Graph
6 pages
Digitized Remedial Instructional Materials in Afro-Asian Literature For Grade 8 Learners
No ratings yet
Digitized Remedial Instructional Materials in Afro-Asian Literature For Grade 8 Learners
7 pages
Thorndikes 1898 Dissertation
100% (2)
Thorndikes 1898 Dissertation
5 pages
Weekly Report Plan - Second Week
No ratings yet
Weekly Report Plan - Second Week
11 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2024 Arxiv OpenR

Uploaded by

2024 Arxiv OpenR

Uploaded by

OpenR: An Open Source Framework for Advanced

Reasoning with Large Language Models

In this technical report, we introduce OpenR, an open-source framework designed

Technical Report. Work in progress.

3 The OpenR LLM Reasoning Framework

To model the process of reasoning in tasks such as question-answering or problem-solving, we

• Q represents the question or prompt that initiates the reasoning process;

3.1 System Design

MATH-APS. We augment the data by automatically generating synthetic samples. In addition to

3.3 Supervised Training for PRMs

3.4 Policy Learning for LLMs

We transform math problems into a language-augmented Markov Decision Process (MDP) M =

3.5 Decoding: Inference-Time Guided Search and Planning

P aggregation can choose the answer f with

4.1 Scaling LLM Test-Time Compute

MATH500 Test Accuracy

Single Problem MATH500

Rewards from PRM

4.2 Online Policy Learning for LLM

What is the remainder of 2004 divided by 12?

Step 3: First, we perform the division of 2004 by 12:

Step 4: Thus, the remainder r is:

An integer is divisible by 12 if and only if it is divisible by both 3 and 4.

Figure 6: PRM score comparison for a case with a correct answer.

4.3 Case Study

Figure 7: PRM score comparison for a case with an incorrect answer.

Step 1: A regular hexagon can be divided into six equilateral triangles.

The answer is 126.

Figure 8: A case study from models before policy learning.

Step 1: A regular hexagon can be divided into six equilateral triangles.

The answer is 42.

Figure 9: A cases study from models after policy learning.

Beam Search Reasoning

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.