0% found this document useful (0 votes)

57 views46 pages

s1: Simple Test-Time Scaling

The document discusses a new approach called test-time scaling in language modeling, which enhances performance by utilizing additional computational resources during testing. The authors introduce a model, s1-32B, which employs a curated dataset of 1,000 reasoning questions and a technique called budget forcing to control the model's thinking duration, achieving significant improvements over existing models like OpenAI's o1. The research emphasizes the importance of careful dataset selection and presents the model, data, and code as open-source resources for further exploration.

Uploaded by

zhuxinyu886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views46 pages

s1: Simple Test-Time Scaling

Uploaded by

zhuxinyu886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

s1: Simple test-time scaling

Niklas Muennighoff * 1 3 4 Zitong Yang * 1 Weijia Shi * 2 3 Xiang Lisa Li * 1 Li Fei-Fei 1 Hannaneh Hajishirzi 2 3
Luke Zettlemoyer 2 Percy Liang 1 Emmanuel Candès 1 Tatsunori Hashimoto 1

Abstract Mathematical Competition PhD-Level

Problem Solving Math Science Questions
Test-time scaling is a promising new approach to (MATH500) (AIME24) (GPQA Diamond)
95 60 60
arXiv:2501.19393v3 [cs.CL] 1 Mar 2025

language modeling that uses extra test-time com-

Accuracy (%)
pute to improve performance. Recently, OpenAI’s 85 40
o1 model showed this capability but did not pub- 50
licly share its methodology, leading to many repli- 75 20
cation efforts. We seek the simplest approach to 65 40
0
achieve test-time scaling and strong reasoning per- 512 2048 512 2048 8192 1024 4096
formance. First, we curate a small dataset s1K Average thinking time (tokens)
of 1,000 questions paired with reasoning traces
relying on three criteria we validate through abla- Figure 1. Test-time scaling with s1-32B. We benchmark s1-32B
tions: difficulty, diversity, and quality. Second, we on reasoning-intensive tasks and vary test-time compute.
develop budget forcing to control test-time com-
pute by forcefully terminating the model’s think-
of this approach is to increase the compute at test time to
ing process or lengthening it by appending “Wait”
get better results. There has been much work exploring this
multiple times to the model’s generation when it
idea (Snell et al., 2024; Welleck et al., 2024), and the via-
tries to end. This can lead the model to double-
bility of this paradigm was recently validated by OpenAI
check its answer, often fixing incorrect reasoning
o1 (OpenAI, 2024). o1 has demonstrated strong reasoning
steps. After supervised finetuning the Qwen2.5-
performance with consistent gains from scaling test-time
32B-Instruct language model on s1K and equip-
compute. OpenAI describes their approach as using large-
ping it with budget forcing, our model s1-32B ex-
scale reinforcement learning (RL) implying the use of sizable
ceeds o1-preview on competition math questions
amounts of data (OpenAI, 2024). This has led to various
by up to 27% (MATH and AIME24). Further,
attempts to replicate their models relying on techniques like
scaling s1-32B with budget forcing allows extrap-
Monte Carlo Tree Search (Gao et al., 2024b; Zhang et al.,
olating beyond its performance without test-time
2024a), multi-agent approaches (Qin et al., 2024), and oth-
intervention: from 50% to 57% on AIME24. Our
ers (Wang et al., 2024a; Huang et al., 2024b; 2025). Among
model, data, and code are open-source at https:
these approaches, DeepSeek R1 (DeepSeek-AI et al., 2025)
//github.com/simplescaling/s1.
has successfully replicated o1-level performance, also em-
ploying reinforcement learning via millions of samples and
1. Introduction multiple training stages. However, despite the large num-
ber of o1 replication attempts, none have openly replicated
Performance improvements of language models (LMs) over a clear test-time scaling behavior. Thus, we ask: what is
the past years have largely relied on scaling up train-time the simplest approach to achieve both test-time scaling and
compute using large-scale self-supervised pretraining (Ka- strong reasoning performance?
plan et al., 2020; Hoffmann et al., 2022). The creation of
We show that training on only 1,000 samples with next-token
these powerful models has set the stage for a new scaling
prediction and controlling thinking duration via a simple
paradigm built on top of them: test-time scaling. The aim
test-time technique we refer to as budget forcing leads to
*
Equal contribution. ZY and NM started the project. WS, NM a strong reasoning model that scales in performance with
and ZY collected the prompts, XL, ZY and NM, built the data more test-time compute. Specifically, we construct s1K,
pipeline, LZ and WS proposed using a 1K subset and ZY and which consists of 1,000 carefully curated questions paired
NM built budget forcing. 1 Stanford University. 2 University of
Washington, Seattle. 3 Allen Institute for AI. 4 Contextual AI. with reasoning traces and answers distilled from Gemini
Thinking Experimental (Google, 2024). We perform super-
vised fine-tuning (SFT) of an off-the-shelf pretrained model

1
s1: Simple test-time scaling

on our small dataset requiring just 26 minutes of training sity: Datasets should stem from various fields to cover differ-
on 16 H100 GPUs. After training, we control the amount ent reasoning tasks. We collect datasets of two categories:
of test-time compute our model spends using budget forc-
ing: (I) If the model generates more thinking tokens than Curation of existing datasets Our largest source is Nu-
a desired limit, we forcefully end the thinking process by minaMATH (LI et al., 2024) with 30,660 mathematical
appending an end-of-thinking token delimiter. Ending the problems from online websites. We also include histori-
thinking this way makes the model transition to generating cal AIME problems (1983-2021). To enhance diversity,
its answer. (II) If we want the model to spend more test-time we add OlympicArena (Huang et al., 2024a) with 4,250
compute on a problem, we suppress the generation of the questions spanning Astronomy, Biology, Chemistry, Com-
end-of-thinking token delimiter and instead append “Wait” puter Science, Geography, Mathematics, and Physics from
to the model’s current reasoning trace to encourage more various Olympiads. OmniMath (Gao et al., 2024a) adds
exploration. Equipped with this simple recipe – SFT on 4,238 competition-level mathematics problems. We also
1,000 samples and test-time budget forcing – our model s1- include 2,385 problems from AGIEval (Zhong et al., 2023),
32B exhibits test-time scaling (Figure 1). Further, s1-32B is which features questions from standardized tests like SAT
the most sample-efficient reasoning model and outperforms and LSAT, covering English, Law, and Logic. We refer to
closed-source models like OpenAI’s o1-preview (Figure 2). Table 7 in §C for our other sources.
We conduct extensive ablation experiments targeting (a) our
selection of 1,000 (1K) reasoning samples and (b) our test- New datasets in quantitative reasoning To comple-
time scaling. For (a), we find that jointly incorporating ment these existing datasets, we create two original
difficulty, diversity, and quality measures into our selec- datasets. s1-prob consists of 182 questions from the prob-
tion algorithm is important. Random selection, selecting ability section of Stanford University’s Statistics Depart-
samples with the longest reasoning traces, or only selecting ment’s PhD Qualifying Exams (https://statistics.
maximally diverse samples all lead to significantly worse stanford.edu), accompanied by handwritten solutions
performance (around −30% on AIME24 on average). Train- that cover difficult proofs. The probability qualifying exam
ing on our full data pool of 59K examples, a superset of is held yearly and requires professional-level mathemati-
s1K, does not offer substantial gains over our 1K selection. cal problem-solving. s1-teasers comprises 23 challenging
This highlights the importance of careful data selection and brain-teasers commonly used in interview questions for
echoes prior findings for instruction tuning (Zhou et al., quantitative trading positions. Each sample consists of a
2023). For (b), we define desiderata for test-time scaling problem and solution taken from PuzzledQuant (https:
methods to compare different approaches. Budget forcing //www.puzzledquant.com/). We only take examples
leads to the best scaling as it has perfect controllability with with the highest difficulty level ("Hard").
a clear positive slope leading to strong performance. For each question, we generate a reasoning trace and solu-
In summary, our contributions are: We develop simple meth- tion using the Google Gemini Flash Thinking API (Google,
ods for creating a sample-efficient reasoning dataset (§2) 2024) extracting its reasoning trace and response. This yields
and test-time scaling (§3); Based on these we build s1-32B 59K triplets of a question, generated reasoning trace, and
which is competitive with o1-preview (§4); We ablate sub- generated solution. Examples from our dataset are in §D.2.
tleties of data (§5.1) and test-time scaling (§5.2). We end We decontaminate all samples against our evaluation ques-
with a discussion to motivate future work on simple rea- tions (MATH500, GPQA Diamond, AIME24; §C.5) using
soning (§6). Our code, model, and data are open-source at 8-grams and deduplicate the data.
https://github.com/simplescaling/s1.
2.2. Final selection of 1K samples
2. Reasoning data curation to create s1K We could directly train on our pool of 59K questions, how-
ever, our goal is to find the simplest approach with minimal
In this section, we describe our process for creating a large resources. Thus, we go through three stages of filtering to
dataset first in §2.1 and then filtering it down to s1K in §2.2. arrive at a minimal set of 1,000 samples relying on our three
guiding data principles: Quality, Difficulty, and Diversity.
2.1. Initial collection of 59K samples
We collect an initial 59,029 questions from 16 sources follow- Quality We first remove any questions where we ran into
ing three guiding principles. Quality: Datasets should be any API errors reducing our dataset to 54,116 samples. Next,
high-quality; we always inspect samples and ignore datasets we filter out low-quality examples by checking if they contain
with, e.g., poor formatting; Difficulty: Datasets should be any string patterns with formatting issues, such as ASCII
challenging and require significant reasoning effort; Diver- art diagrams, non-existent image references, or inconsistent
question numbering reducing our dataset to 51,581 examples.

2
s1: Simple test-time scaling

General
Geo− Statistics topology
Approxi− physics
General

100 Most sample-efficient

mation Convex
theory
geometry relativity
Special
functions
Integral Operations Linear Probability

MATH500 Accuracy (%)

transform research algebra theory
Computer Comm−
Math− science utative
Group algebra

r1-distill
ematical Statistical

95 s1
theory
logic mechanics
Real Biology Numerical Integral
functions analysis equations
Bespoke-Stratos
Differential
Electro−
geometry Dynamical dynamics
systems Complex Measure
QwQ
90
Number functions theory
Math− Geometry theory
ematics Difference
education equations Partial
Calculus differential
Quantum

o1-preview
Differential equations
Potential Calculus of Combin− theory
theory equations
variations atorics
Functional
analysis
Thermo−
dynamics
Field
theory
Algebraic Information
structures theory 85
Solid
mech
Mechanics Algebraic
systems Astronomy
Harmonic
analysis
Sky-T1
801000
−anics Associative Fluid Control

17000 800000 N/A

rings mechanics theory

Economics Algebraic

Number of Examples
geometry

Figure 2. s1K and s1-32B. (left) s1K is a dataset of 1,000 high-quality, diverse, and difficult questions with reasoning traces. (right) s1-32B,
a 32B parameter model finetuned on s1K is on the sample-efficiency frontier. See Table 1 for details on other models.

From this pool, we identify 384 samples for our final 1,000 nomics. To select our final examples from the pool of 24,496
samples from datasets that we perceive as high-quality and questions, we first choose one domain uniformly at random.
not in need of further filtering (see §C.4 for details). Then, we sample one problem from this domain according
to a distribution that favors longer reasoning traces (see §C.4
Difficulty For difficulty, we use two indicators: model per- for details) as motivated in Difficulty. We repeat this process
formance and reasoning trace length. We evaluate two mod- until we have 1,000 total samples spanning 50 domains.
els on each question: Qwen2.5-7B-Instruct and Qwen2.5- In §5.1, we will show that using our three criteria in combi-
32B-Instruct (Qwen et al., 2024), with correctness assessed nation is important, as only relying on quality, diversity, or
by Claude 3.5 Sonnet comparing each attempt against the difficulty in isolation leads to worse datasets. Some distilled
reference solution (see §C.3 for the grading protocol). We generations are incorrect, which we allow in our data as we
measure the token length of each reasoning trace to indicate focus on capturing the reasoning process rather than entirely
problem difficulty using the Qwen2.5 tokenizer. This relies correct solutions. Our grader (§C.3) deems 53.6% correct
on the assumption that more difficult problems require more in s1K and 63.0% in our follow-up s1K-1.1 (see §A).
thinking tokens. Based on the grading, we remove questions
that either Qwen2.5-7B-Instruct or Qwen2.5-32B-Instruct
can solve correctly and thus may be too easy. By using two 3. Test-time scaling
models we reduce the likelihood of an easy sample slipping 3.1. Method
through our filtering due to a rare mistake on an easy ques-
tion of one of the models. This brings our total samples We classify test-time scaling methods into 1) Sequential,
down to 24,496, setting the stage for the next round of sub- where later computations depend on earlier ones (e.g., a long
sampling based on diversity. While filtering with these two reasoning trace), and 2) Parallel, where computations run in-
models may be optimized for our setup as we will also use dependently (e.g., majority voting) (Snell et al., 2024; Brown
Qwen2.5-32B-Instruct as our model to finetune, the idea of et al., 2024). We focus on sequential scaling as intuitively we
model-based filtering generalizes to other setups. believe it should scale better, since later computations can
build on intermediate results, allowing for deeper reasoning
Diversity To quantify diversity, we classify questions into and iterative refinement. We propose new sequential scaling
domains using Claude 3.5 Sonnet based on the Mathematics methods and ways to benchmark them.
Subject Classification (MSC) system (e.g., geometry, com-
binatorics, etc.) from the American Mathematical Society.1 Budget forcing We propose a simple decoding-time in-
The taxonomy focuses on topics in mathematics but also tervention by forcing a maximum and/or minimum number
includes other sciences such as biology, physics, and eco- of thinking tokens. Specifically, we enforce a maximum
1
https://mathscinet.ams.org/mathscinet/ token count by simply appending the end-of-thinking token
msc/msc2020.html delimiter and optionally “Final Answer:” to early exit

3
s1: Simple test-time scaling

How many r in raspberry? Question three metrics:

Let's break down the process of counting the letter 'r' in the 1 ∑
Control = 𝕀(𝑎 ≤ 𝑎 ≤ 𝑎max ) (1)
word "raspberry" ...
|| 𝑎∈ min
* First letter: 'r' - This is an 'r', count = 1.

* Second letter: 'a' - Not an 'r', count remains 1 ...

where 𝑎min , 𝑎max refer to a pre-specified minimum and max-

* Sixth letter: 'e' - Not an 'r', count remains 1.

imum amount of test-time compute; in our case thinking

* Seventh letter: 'r' - This is an 'r', count = 2.

* Eighth letter: 'y' - Not an 'r', count remains 2 ...

tokens. We usually only constrain 𝑎max . As tokens gener-

The number of 'r's in "raspberry" is 2.
ated correspond to the amount of test-time compute spent,
Wait, let's re-read the question carefully. It asks "How many r this metric measures the extent to which a method allows
in raspberry?" ... * r - a - s - p - b - e - r - r - y ... * First 'r' ... * controllability over the use of that test-time compute. We
Second 'r' ... * Third 'r' ... Count = 3 ... Reasoning trace report it as a percentage with 100% being perfect control.
My initial answer of 2 was incorrect due to a quick reading of
1 ∑ 𝑓 (𝑏) − 𝑓 (𝑎)
the word. Final Answer: The final answer is Response
Scaling = (||) (2)
𝑎,𝑏∈
𝑏−𝑎
2
𝑏>𝑎
Figure 3. Budget forcing with s1-32B. The model tries to stop
after “...is 2.”, but we suppress the end-of-thinking token delimiter Scaling is the average slope of the piece-wise linear function.
instead appending “Wait” leading s1-32B to self-correct its answer. It must be positive for useful methods and larger is better.

Performance = max 𝑓 (𝑎) (3)

𝑎∈
the thinking stage and make the model provide its current
Performance is simply the maximum performance the
best answer. To enforce a minimum, we suppress the gener-
method achieves on the benchmark. A method with mono-
ation of the end-of-thinking token delimiter and optionally
tonically increasing scaling achieves 100% performance on
append the string “Wait” to the model’s current reasoning
any benchmark in the limit. However, the methods we in-
trace to encourage the model to reflect on its current gen-
vestigate eventually flatten out or further scaling fails due to
eration. Figure 3 contains an example of how this simple
control or context window limitations.
approach can lead the model to arrive at a better answer.

4. Results
Baselines We benchmark budget forcing with: (I) Con-
ditional length-control methods, which rely on telling the 4.1. Setup
model in the prompt how long it should generate for. We
group them by granularity into (a) Token-conditional con- Training We perform supervised finetuning on Qwen2.5-
trol: We specify an upper bound of thinking tokens in the 32B-Instruct using s1K to obtain our model s1-32B using
prompt; (b) Step-conditional control: We specify an upper basic hyperparameters outlined in §D. Finetuning took 26
bound of thinking steps, where each step is around 100 to- minutes on 16 NVIDIA H100 GPUs with PyTorch FSDP.
kens; (c) Class-conditional control: We write two generic
prompts that tell the model to either think for a short or long Evaluation We select three representative reasoning
amount of time (see §E.1 for details). (II) Rejection sam- benchmarks widely used in the field: AIME24 (of America,
pling, which samples until a generation fits a predetermined 2024) has 30 problems that were used in the 2024 American
compute budget. This oracle captures the posterior over Invitational Mathematics Examination (AIME) held from
responses conditioned on its length. January 31 – February 1, 2024. AIME tests mathematical
problem-solving with arithmetic, algebra, counting, geome-
try, number theory, probability, and other secondary school
3.2. Metrics
math topics. High-scoring high school students in the test are
We establish a set of desiderata as evaluation metrics to invited to participate in the United States of America Mathe-
measure test-time scaling across methods. Importantly, we matics Olympiad (USAMO). All AIME answers are integers
do not only care about the accuracy a method can achieve ranging from 000 to 999, inclusive. Some AIME problems
but also its controllability and test-time scaling slope. For rely on figures that we provide to our model using the vec-
each method we consider, we run a set of evaluations 𝑎 ∈ tor graphics language Asymptote as it cannot take image
 varying test-time compute on a fixed benchmark, e.g. inputs. MATH500 (Hendrycks et al., 2021) is a benchmark
AIME24. This produces a piece-wise linear function 𝑓 of competition math problems of varying difficulty. We
with compute as the x-axis measured in thinking tokens and evaluate on the same 500 samples selected by OpenAI in
accuracy as the y-axis (see Figure 1, where the rightmost dot prior work (Lightman et al., 2023). GPQA Diamond (Rein
for AIME24 corresponds to 𝑓 (7320) = 57%). We measure et al., 2023) consists of 198 PhD-level science questions

4
s1: Simple test-time scaling

Sequential scaling 60% accuracy

Forcing 2048/4096 max thinking

tokens via Forcing with s1

Ignoring end-of-thinking 2x/4x/6x and Parallel scaling via Majority

appending “Wait” thus forcing it to continue voting with base model
reasoning when it tries to stop (Qwen2.5-32B-Instruct)

(a) Sequential scaling via budget forcing (b) Parallel scaling via majority voting

Figure 4. Sequential and parallel test-time scaling. (a): Budget forcing shows clear scaling trends and extrapolates to some extent. For
the three rightmost dots, we prevent the model from stopping its thinking 2/4/6 times, each time appending “Wait” to its current reasoning
trace. (b): For Qwen2.5-32B-Instruct we perform 64 evaluations for each sample with a temperature of 1 and visualize the performance
when majority voting across 2, 4, 8, 16, 32, and 64 of these.

from Biology, Chemistry and Physics. Experts with PhDs in

the corresponding domains only achieved 69.7% on GPQA Table 1. s1-32B is a strong open reasoning model. We evaluate
Diamond (OpenAI, 2024). When we write “GPQA” in the s1-32B, Qwen, and Gemini (some entries are unknown (N.A.), see
context of evaluation in this work, we always refer to the §4). Other results are from the respective reports (Qwen et al.,
Diamond subset. We build on the “lm-evaluation-harness” 2024; Team, 2024; OpenAI, 2024; DeepSeek-AI et al., 2025; Labs,
framework (Gao et al., 2021; Biderman et al., 2024). Unless 2025; Team, 2025). # ex. = number examples used for reasoning
otherwise specified, we evaluate with a temperature of 0 finetuning; BF = budget forcing. See §A for our better s1.1 model.
(greedy) and measure accuracy (equivalent to pass@1). AIME MATH GPQA
Model # ex.
2024 500 Diamond
Other models We benchmark s1-32B against: OpenAI API only
o1 series (OpenAI, 2024), closed-source models that pop-
o1-preview N.A. 44.6 85.5 73.3
ularized test-time scaling; DeepSeek r1 series (DeepSeek-
o1-mini N.A. 70.0 90.0 60.0
AI et al., 2025), open-weight reasoning models with up to
o1 N.A. 74.4 94.8 77.3
o1-level performance; Qwen’s QwQ-32B-preview (Team,
Gemini 2.0
2024), a 32B open-weight reasoning model without dis- N.A. 60.0 N.A. N.A.
Flash Think.
closed methodology; Sky-T1-32B-Preview (Team, 2025)
and Bespoke-32B (Labs, 2025), open models with open rea- Open Weights
soning data distilled from QwQ-32B-preview and r1; Google Qwen2.5-
Gemini 2.0 Flash Thinking Experimental (Google, 2024), N.A. 26.7 84.0 49.0
32B-Instruct
the API that we distill from. As it has no official evaluation QwQ-32B N.A. 50.0 90.6 54.5
scores, we use the Gemini API to benchmark it ourselves. r1 ≫800K 79.8 97.3 71.5
However, the “recitation error” of the Gemini API makes r1-distill 800K 72.6 94.3 62.1
evaluation challenging.2 We circumvent this, by manually
inserting all 30 AIME24 questions in its web interface where Open Weights and Open Data
the error does not appear. However, we leave out MATH500 Sky-T1 17K 43.3 82.4 56.8
(500 questions) and GPQA Diamond (198 questions), thus Bespoke-32B 17K 63.3 93.0 58.1
they are N.A. in Table 1. Our model, s1-32B, is fully open
s1 w/o BF 1K 50.0 92.6 56.6
including weights, reasoning data, and code.
s1-32B 1K 56.7 93.0 59.6
2
https://github.com/google/
generative-ai-docs/issues/257

5
s1: Simple test-time scaling

4.2. Performance
Table 2. s1K data ablations. We budget force (BF) a maximum
Test-time scaling Figure 1 shows the performance of s1- of around 30,000 thinking tokens for all scores in this table. This
32B with budget forcing scales with more test-time compute. performs slightly better than the scores without BF (Table 1) as
In Figure 4 (left), we expand the plot from Figure 1 (middle) it allows the model to finish with a best guess when stuck in an
showing that while we can improve AIME24 performance infinite loop. We report 95% paired bootstrap confidence intervals
using our budget forcing technique (§3) and more test-time for differences relative to the s1K model using 10,000 bootstrap
compute it does eventually flatten out at six times. Suppress- samples. E.g., the interval [-13%, 20%] means that, with 95% con-
ing the end-of-thinking token delimiter too often can lead the fidence, the true difference between 59K-full and s1K is between
model into repetitive loops instead of continued reasoning. -13% and +20%. If the entire interval is negative, e.g. [-27%, -3%],
In Figure 4 (right), we show that after training Qwen2.5- we can confidently say that the performance is worse than s1K.
32B-Instruct on our 1,000 samples to produce s1-32B and AIME MATH GPQA
equipping it with the simple budget forcing technique, it Model
2024 500 Diamond
operates in a different scaling paradigm. Scaling test-time
compute on the base model via majority voting cannot catch 36.7 90.6 52.0
1K-random
[-26.7%, -3.3%] [-4.8%, 0.0%] [-12.6%, 2.5%]
up with the performance of s1-32B which validates our
intuition from §3 that sequential scaling is more effective 26.7 91.2 54.6
1K-diverse
[-40.0%, -10.0%] [-4.0%, 0.2%] [-10.1%, 5.1%]
than parallel. We provide example generations of s1-32B in
Figure 5. 33.3 90.4 59.6
1K-longest
[-36.7%, 0.0%] [-5.0%, -0.2%] [-5.1%, 10.1%]
53.3 92.8 58.1
Sample-efficiency In Figure 2 (right) and Table 1 we 59K-full
[-13.3%, 20.0%] [-2.6%, 2.2%] [-6.6%, 8.6%]
compare s1-32B with other models. We find that s1-
32B is the most sample-efficient open data reasoning s1K 50.0 93.0 57.6
model. It performs significantly better than our base model
(Qwen2.5-32B-Instruct) despite just training it on an ad-
ditional 1,000 samples. The concurrently released r1-32B Finally, we compare with just training on all of our 59K
shows stronger performance than s1-32B while also only samples, a superset of all the 1K-sample versions. This
using SFT (DeepSeek-AI et al., 2025). However, it is trained leads to a strong model but uses much more resources. To
on 800 × more reasoning samples. It is an open question finetune on 59K samples, we use 394 H100 GPU hours
whether one can achieve their performance with just 1,000 while s1-32B only required 7 H100 GPU hours. Moreover,
samples. Finally, our model nearly matches Gemini 2.0 relying only on s1K is extremely competitive as shown in §2.
Thinking on AIME24. As the data for s1-32B is distilled Overall, combining all three criteria – Quality, Difficulty,
from Gemini 2.0, this shows our distillation procedure was Diversity – via our methodology in §2 is key for sample-
likely effective. efficient reasoning training.

5. Ablations 5.2. Test-time scaling methods

5.1. Data Quantity, Diversity, and Difficulty

Table 3. Ablations on methods to scale test-time compute on
In §2 we outlined our three guiding principles in curating AIME24. || refers to the number of evaluation runs used to
s1K: Quality, Difficulty, and Diversity. Here we test the estimate the properties; thus a higher value indicates more robust-
importance of combining them and the overall efficacy of our ness. Bold indicates our chosen method and the best values. BF
selection. Only Quality (1K-random): After obtaining our = budget forcing, TCC/SCC/CCC = token/step/class-conditional
high-quality reasoning chains from Gemini, we select 1,000 control, RS = rejection sampling.
samples at random; not relying on our difficulty and diversity
Method Control Scaling Performance ||
filtering at all. Table 2 shows this approach performs much
worse than s1K across all benchmarks. Only Diversity BF 100% 15 56.7 5
(1K-diverse): For this dataset, we sample uniformly across TCC 40% -24 40.0 5
domains to maximize diversity disregarding any notion of TCC + BF 100% 13 40.0 5
difficulty. This approach also leads to poor performance SCC 60% 3 36.7 5
similar to 1K-random. Only Difficulty (1K-longest): Here SCC + BF 100% 6 36.7 5
we rely on one of our difficulty indicators introduced in §2 by CCC 50% 25 36.7 2
selecting the 1,000 samples with the longest reasoning traces.
This approach significantly boosts GPQA performance but RS 100% -35 40.0 5
overall still falls short of using s1K. Maximize Quantity:

6
s1: Simple test-time scaling

[AIME24] [MATH500] [GPQA]

Alice and Bob play the following game. A stack of Find the projection of 𝑎 onto 𝑏 = A quantum mechanical particle of
( )
𝑛 tokens lies before them. The players take turns 2 mass 𝑚 moves in two dimensions in
with Alice going first. On each turn, the player 6 if 𝑎 ⋅ 𝑏 = 8. the following potential, as a function
removes either 1 token or 4 tokens from the stack. 3 of the polar coordinates (𝑟, 𝜃):
Whoever removes the last token wins. Find the The projection of vector 𝐚 onto vector
number of positive integers 𝑛 less than or equal to 𝐛 is given by the formula: 𝑉 (𝑟, 𝜃) =
1 2 3 2
𝑘𝑟 + 𝑘𝑟 cos2 (𝜃)
2024 for which there exists a strategy for Bob that 2 2
guarantees that Bob will win the game regardless 𝐚⋅𝐛
of Alice’s play. proj𝐛 𝐚 = 𝐛 Find the energy spectrum. Hint:
‖𝐛‖2 Write the potential in Cartesian co-
The problem asks for the number of positive integers ( ) ordinates.
2
𝑛 ≤ 2024 such that Bob has a winning strategy. This
We are given 𝐛 = 6 and 𝐚 ⋅ 𝐛 = 8. The problem asks for the energy spec-
is a typical impartial game that can be analyzed using 3
the Sprague-Grundy theorem or by directly finding trum of a particle in a given 2D po-
the winning and losing positions. Let 𝑊 be the set of tential. The potential is given in polar
winning positions (number of tokens 𝑛 such that the First, we need to find the magnitude coordinates (𝑟, 𝜃):
player whose turn it is can force a win) and 𝐿 be the squared of 𝐛, which is ‖𝐛‖2 .
set of losing positions. [...] 1 2 3 2
𝑉 (𝑟, 𝜃) = 𝑘𝑟 + 𝑘𝑟 cos2 (𝜃).
‖𝐛‖2 = 𝐛⋅𝐛 = 22 +62 +32 = 4+36+9 = 49. 2 2
Let 𝑃 (𝑛) be the statement "the player whose turn it is
We can rewrite the potential as:
when there are 𝑛 tokens has a winning strategy. 𝑃 (0)
is false (the player whose turn it is and faces 0 tokens 1 2
has already lost). Consider the small values of 𝑛: Now we can plug the values into the 𝑉 (𝑟, 𝜃) = 𝑘𝑟 (1 + 3 cos2 (𝜃)).
𝑛 = 1: The player whose turn it is can take 1 token projection formula: 2
and win. So 𝑃 (1) is true. 1 ∈ 𝑊 . The Hamiltonian for a particle in 2D
𝑛 = 2: The player whose turn it is can take 1 token, 8
proj𝐛 𝐚 = 𝐛 is given by:
leaving 1 token for the opponent. Since 1 ∈ 𝑊 , the 49
opponent wins. So 𝑃 (2) is false. 2 ∈ 𝐿. ( )
Substitute the vector 𝐛: ℏ2 𝜕2 𝜕2
𝑛 = 3: The player whose turn it is can take 1 token, 𝐻 =− + + 𝑉 (𝑥, 𝑦).
leaving 2 tokens for the opponent. Since 2 ∈ 𝐿, the ( ) 2𝑚 𝜕𝑥2 𝜕𝑦2
2
player whose turn it is wins. So 𝑃 (3) is true. 3 ∈ 8
proj𝐛 𝐚 = 6 In polar coordinates (𝑟, 𝜃), the Lapla-
𝑊 .[...] 49 3 cian operator is:
We are looking for the number of positive integers ( )
𝑛 ≤ 2024 for which Bob has a winning strategy. This 1 𝜕 𝜕 1 𝜕2
∇2 = 𝑟 + 2 2.
means that if the game starts with 𝑛 tokens, and Alice Multiply the scalar by each compo- 𝑟 𝜕𝑟 𝜕𝑟 𝑟 𝜕𝜃
goes first, Bob can win. This happens if and only if nent of the vector:
So the Hamiltonian in polar coordi-
𝑛 is a losing position (because if 𝑛 ∈ 𝐿, then Alice’s nates is:
first move must lead to a winning position for Bob). ⎛ 8 × 2⎞ ⎛ 16 ⎞
[...] ⎜ 498 ⎟ ⎜ 4948 ⎟ [ ( ) ]
proj𝐛 𝐚 = ⎜ 49 × 6⎟ = ⎜ 49 ⎟ ℏ2 1 𝜕 𝜕 1 𝜕2
Let’s try to find a pattern in the losing positions ⎜ 8 × 3⎟ ⎜ 24 ⎟ 𝐻 =− 𝑟 + 2 2 +𝑉 (𝑟, 𝜃).
⎝ 49 ⎠ ⎝ 49 ⎠ 2𝑚 𝑟 𝜕𝑟 𝜕𝑟 𝑟 𝜕𝜃
𝐿 = {0, 2, 5, 7, 10, 12, 15, 17, 20, 22, ...}. Differences
between consecutive terms: 2, 3, 2, 3, 2, 3, 2, 3, 2, ...
The time-independent Schrödinger
The pattern of differences is alternating 2, 3. [...]
equation is 𝐻𝜓 = 𝐸𝜓. We can try
The losing positions are of the form 5𝑚 or 5𝑚 + 2, Final Answer: The final answer is to separate variables by assuming a so-
where 𝑚 ≥ 0. We need to find the number of positive
lution of the form 𝜓(𝑟, 𝜃) = 𝑅(𝑟)Θ(𝜃).
integers 𝑛 ≤ 2024 that are in the set of losing positions
⎛ 16 ⎞ [...]
𝐿. [...] ⎜ 49
48 ⎟
Case 1: 𝑛 = 5𝑚, where 𝑚 ≥ 1. 1 ≤ 5𝑚 ≤ 2024 ⎜ 49 ⎟ This confirms the energy spectrum is:
1∕5 ≤ 𝑚 ≤ 2024∕5 = 404.8 Since 𝑚 is an integer, ⎜ 24 ⎟
1 ≤ 𝑚 ≤ 404. The number of values of 𝑚 is 404. ⎝ 49 ⎠ √ ( )
Case 2: 𝑛 = 5𝑚 + 2, where 𝑚 ≥ 0. 5𝑚 + 2 ≤ 2024 𝐸=ℏ
𝑘
2𝑛𝑥 + 𝑛𝑦 +
3
.
5𝑚 ≤ 2022 𝑚 ≤ 2022∕5 = 404.4 Since 𝑚 is an in- 𝑚 2
teger, 0 ≤ 𝑚 ≤ 404. The number of values of 𝑚 is
404 − 0 + 1 = 405. Final
√ ( Answer: ) 𝐸 =
ℏ 𝑚𝑘 2𝑛𝑥 + 𝑛𝑦 + 23 .
The total number of positive integers 𝑛 ≤ 2024 that
are losing positions is 404 + 405 = 809 .

Figure 5. Example model outputs. We pick a question from AIME24 (left), MATH500 (middle), and GPQA (right), where our model
generates the correct answer. The black text is the prompt, the light blue text is the reasoning trace, and the blue text is the answer of
s1-32B. The gray ellipsis [...] indicates that the text was trimmed to fit this page, but the generated text is actually longer.

7
s1: Simple test-time scaling

Budget forcing In Table 3 we compare the test-time scal- spect a question, which was answered correctly by the model
ing methods we have introduced in §3. Overall, we find that when rejection sampling for ≤ 4000, but not for the ≤ 8000
budget forcing provides perfect control, good scaling, and token setting. In the ≤ 4000 setting the model directly jumps
leads to our best AIME24 score. Thus, this is the method to the correct approach, while for the ≤ 8000 setting it back-
we use for s1-32B in Figure 1 and in §4. In Table 4, we tracks a lot. We hypothesize that there is a correlation such
compare different strings for extrapolating performance. We that shorter generations tend to be the ones where the model
find that “Wait” generally gives the best performance. was on the right track from the start, whereas longer ones
tend to be ones where the model made mistakes and thus
Class-conditional control We provide benchmark scores
backtracks or questions itself. This leads to longer samples
for this method in §E.1 and summarize three findings here:
often being wrong when rejection sampling and thus the
(1) Token-conditional control fails without budget forcing, as
inverse scaling trend.
our model cannot reliably count tokens - even when trained
to do so. (2) Under step-conditional control, the model gen-
erates a similar total number of tokens when given different 6. Discussion and related work
step targets, as the model goes from few steps with many
6.1. Sample-efficient reasoning
tokens per step, to many steps with few tokens in each step.
Thus, the model learns to hack its way around the com- Models There are a number of concurrent efforts to build
pute constraint making the controllability of this method models that replicate the performance of o1 (OpenAI, 2024).
mediocre. (3) Class-conditional control can work - telling a For example, DeepSeek-r1 and k1.5 (DeepSeek-AI et al.,
model to simply think longer can increase its test-time com- 2025; Team et al., 2025) are built with reinforcement learn-
pute and performance, which leads good scaling in Table 3. ing methods, while others rely on SFT using tens of thou-
sands of distilled examples (Team, 2025; Xu et al., 2025;
Labs, 2025). We show that SFT on only 1,000 examples
Table 4. Budget forcing extrapolation ablations. We compare
suffices to build a competitive reasoning model matching
ignoring the end-of-thinking delimiter twice and appending none
o1-preview and produces a model that lies on the pareto
or various strings.
frontier (Figure 2). Further, we introduce budget forcing
AIME MATH GPQA which combined with our reasoning model leads to the first
Model
2024 500 Diamond reproduction of OpenAI’s test-time scaling curves (OpenAI,
No extrapolation 50.0 93.0 57.6 2024). Why does supervised finetuning on just 1,000 sam-
ples lead to such performance gains? We hypothesize that
2x without string 50.0 90.2 55.1
the model is already exposed to large amounts of reason-
2x “Alternatively” 50.0 92.2 59.6
ing data during pretraining which spans trillions of tokens.
2x “Hmm” 50.0 93.0 59.6
Thus, the ability to perform reasoning is already present in
2x “Wait” 53.3 93.0 59.6
our model. Our sample-efficient finetuning stage just acti-
vates it and we scale it further at test time with budget forcing.
Rejection sampling Surprisingly, we find that simply sam- This is similar to the "Superficial Alignment Hypothesis"
pling until the generation fits a specific length leads to an presented in LIMA (Zhou et al., 2023), where the authors
inverse scaling trend as depicted in Figure 6. In §E.2 we in- find that 1,000 examples can be sufficient to align a model
to adhere to user preferences.

42
Benchmarks and methods To evaluate and push the lim-
37
Accuracy (%)

its of these models, increasingly challenging benchmarks

have been introduced, such as Olympiad-level science com-
32 petitions (He et al., 2024; Jain et al., 2024; Zhong et al.,
2023) and others (Srivastava et al., 2023; Glazer et al., 2024;
27 Su et al., 2024; Kim et al., 2024; Phan et al., 2025). To
enhance models’ performance on reasoning-related tasks, re-
22 3072 4096 5120 searchers have pursued several strategies: Prior works have
Average thinking time (tokens) explored continuing training language models on special-
ized corpora related to mathematics and science (Azerbayev
Figure 6. Rejection sampling on AIME24 with s1-32B. We sam- et al., 2023; Yang et al., 2024), sometimes even synthetically
ple with a temperature of 1 until all generations have less than (from generated data (Yu et al., 2024). Others have developed
left to right) 3500, 4000, 5000, 8000, and 16000 thinking tokens training methodologies specifically aimed at reasoning per-
requiring an average of 655, 97, 8, 3, 2, and 1 tries per sample. formance (Zelikman et al., 2022; 2024; Luo et al., 2025;

8
s1: Simple test-time scaling

Yuan et al., 2025; Wu et al., 2024a). Another significant line 60

of work focuses on prompting-based methods to elicit and

Accuracy (%)
improve reasoning abilities, including methods like Chain- 50
of-Thought prompting (Wei et al., 2023; Yao et al., 2023a;b; REBASE
Bi et al., 2023; Fu et al., 2023; Zhang et al., 2024b; Xi-
40 Majority
ang et al., 2025; Hu et al., 2024; Diao et al., 2024). These Sequential
combined efforts aim to advance the reasoning ability of
30 scaling
language models, enabling them to handle more complex
and abstract tasks effectively.
2048 8192 32768 131072
Average thinking time (tokens)
6.2. Test-time scaling Figure 7. Scaling further with parallel scaling methods. All met-
Methods As we introduce in §3, we differentiate two meth- rics averaged over the 30 questions in AIME24. Average thinking
ods to scale test-time compute: parallel and sequential. The tokens for REBASE do not account for the additional compute from
former relies on multiple solution attempts generated in par- the reward model. For sequential scaling, we prompt the model
allel and selecting the best outcome via specific criteria. to use up to (from left to right) 32, 64, 256, and 512 steps. For
These criteria include choosing the most frequent response REBASE and majority voting we generate 16 parallel trajectories
for majority voting or the best response based on an exter- to aggregate across.
nal reward for Best-of-N (Brown et al., 2024; Irvine et al.,
2023; Levi, 2024). Unlike repeated sampling, previous se-
quential scaling methods let the model generate solution
such extrapolation? There may be improvements to budget
attempts sequentially based on previous attempts, allowing
forcing such as rotating through different strings, not only
it to refine each attempt based on previous outcomes (Snell
“Wait”, or combining it with frequency penalties or higher
et al., 2024; Hou et al., 2025; Lee et al., 2025). Tree-based
temperature to avoid repetitive loops. An exciting direc-
search methods (Gandhi et al., 2024; Wu et al., 2024b) offer
tion for future work is also researching whether applying
a hybrid approach between sequential and parallel scaling,
budget forcing to a reasoning model trained with reinforce-
such as Monte-Carlo Tree Search (MCTS) (Liu et al., 2024;
ment learning yields better extrapolation; or if RL allows for
Zhang et al., 2023; Zhou et al., 2024; Choi et al., 2023) and
new ways of test-time scaling beyond budget forcing. Our
guided beam search (Xie et al., 2023). REBASE (Wu et al.,
work defines the right metrics (§3.2) – Control, Scaling, and
2024b) employs a process reward model to balance exploita-
Performance – to enable future research and progress on
tion and pruning during tree search. Empirically, REBASE
extrapolating test-time compute.
has been shown to outperform sampling-based methods and
MCTS (Wu et al., 2024b). Reward models (Lightman et al.,
2023; Wang et al., 2024b;c) play a key role in these meth-
ods. They come in two variants: outcome reward models Parallel scaling as a solution Parallel scaling offers one
and process reward models. Outcome reward models (Xin solution to the limits of sequential scaling, thus we aug-
et al., 2024; Ankner et al., 2024) assign a score to complete ment our sequentially scaled model with two methods: (I)
solutions and are particularly useful in Best-of-N selection, Majority voting: After generating 𝑘 solutions, the final
while process reward models (Lightman et al., 2023; Wang solution is the most frequent one across generations; (II)
et al., 2024b; Wu et al., 2024b) assess individual reasoning Tree search via REBASE: We use the REBASE process
steps and are effective in guiding tree-based search methods. reward model, which is initialized from LLaMA-34B and
further finetuned on a synthetic process reward modeling
dataset (Wu et al., 2024b). We then aggregate the solutions
Limits to further test-time scaling We have shown that
generated by REBASE via majority voting. As shown in
budget forcing allows extrapolating test-time compute in
Figure 7, augmenting our model with REBASE scales better
§4, e.g., improving AIME24 performance from 50% to 57%.
than majority voting, and even sequential scaling in this sce-
However, it has two key limitations when scaling further: it
nario. However, REBASE requires an additional forward
eventually flattens out (Figure 4), and the context window
pass at each step for the reward model adding some computa-
of the underlying language model constrains it. Despite
tion overhead. For sequential scaling, when prompted to use
these constraints, our work shows test-time scaling across a
up to 512 steps, for 12 out of the 30 evaluation questions the
wide range of accuracies (Figure 1), partly because scaling
model generates a response that exceeds the context window
down test-time compute behaves predictably and does not
leading to a large performance drop. Overall, we find that
suffer from these constraints.
these parallel scaling methods complement sequential scal-
Continuing test-time scaling will require approaches that ing thus they offer an avenue for scaling test-time compute
can further extrapolate test-time compute. How can we get even further; beyond fixed context windows.

9
s1: Simple test-time scaling

Impact Statement inference compute with repeated sampling, 2024. URL

https://arxiv.org/abs/2407.21787.
Language models with strong reasoning capabilities have
the potential to greatly enhance human productivity, from Cesista, F. L. Multimodal structured generation: Cvpr’s 2nd
assisting in complex decision-making to driving scientific mmfm challenge technical report, 2024. URL https:
breakthroughs. However, recent advances in reasoning, such //arxiv.org/abs/2406.11403.
as OpenAI’s o1 and DeepSeek’s r1, lack transparency, limit-
ing broader research progress. Our work aims to push the Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y., Ma, X.,
frontier of reasoning in a fully open manner, fostering in- Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem-
novation and collaboration to accelerate advancements that driven question answering dataset, 2023. URL https:
ultimately benefit society. //arxiv.org/abs/2305.12524.

Choi, S., Fang, T., Wang, Z., and Song, Y. Kcts: Knowledge-
Acknowledgements constrained tree search decoding with token-level halluci-
nation detection, 2023. URL https://arxiv.org/
We thank Ryan Marten for generating traces from DeepSeek abs/2310.09044.
r1 for s1.1 using Bespoke Curator (Marten et al., 2025).
This work was partly conducted using the Stanford Marlowe DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang,
GPU cluster (Kapfer et al., 2025), made possible by financial R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X.,
support from Stanford University. We thank Alexander M. Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao,
Rush, Andrew Ilyas, Banghua Zhu, Chenglei Si, Chunting Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu,
Zhou, John Yang, Ludwig Schmidt, Samy Jelassi, Suhas C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D.,
Kotha, Tengyu Ma, Xuechen Li, Yu Sun, and Yue Zhang for Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao,
very constructive discussions. G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang,
H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J.,
References Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai,
J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao,
Ankner, Z., Paul, M., Cui, B., Chang, J. D., and Am- K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L.,
manabrolu, P. Critique-out-loud reward models, 2024. Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang,
URL https://arxiv.org/abs/2408.11791. M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M., Tian,
N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q.,
Arora, D., Singh, H. G., and Mausam. Have llms advanced Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin,
enough? a challenging problem solving benchmark for R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang,
large language models, 2023. URL https://arxiv. S., Yu, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., Wu,
org/abs/2305.15074. S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W.,
Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang,
McAleer, S., Jiang, A. Q., Deng, J., Biderman, S., and W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X.,
Welleck, S. Llemma: An open language model for math- Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X.,
ematics, 2023. Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen,
X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X.,
Bi, Z., Zhang, N., Jiang, Y., Deng, S., Zheng, G., and Shan, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhang, Y.,
Chen, H. When do program-of-thoughts work for reason- Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang,
ing?, 2023. URL https://arxiv.org/abs/2308. Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan,
15452. Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong,
Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y.,
Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, Zhou, Y., Zhu, Y. X., Xu, Y., Huang, Y., Li, Y., Zheng,
J., Abbasi, B., Aji, A. F., Ammanamanchi, P. S., Black, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z. Z.,
S., Clive, J., DiPofi, A., Etxaniz, J., Fattori, B., Forde, Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao,
J. Z., Foster, C., Hsu, J., Jaiswal, M., Lee, W. Y., Li, H., Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li,
Lovering, C., Muennighoff, N., Pavlick, E., Phang, J., Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang,
Skowron, A., Tan, S., Tang, X., Wang, K. A., Winata, Z., and Zhang, Z. Deepseek-r1: Incentivizing reasoning
G. I., Yvon, F., and Zou, A. Lessons from the trenches on capability in llms via reinforcement learning, 2025. URL
reproducible evaluation of language models, 2024. https://arxiv.org/abs/2501.12948.

Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Ré, Diao, S., Wang, P., Lin, Y., Pan, R., Liu, X., and Zhang,
C., and Mirhoseini, A. Large language monkeys: Scaling T. Active prompting with chain-of-thought for large lan-

10
s1: Simple test-time scaling

guage models, 2024. URL https://arxiv.org/ Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu,
abs/2302.12246. K. R., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J.,
Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik,
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., A., Nam, C., Peters, M. E., Pyatkin, V., Ravichander, A.,
Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani,
Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravanku- N., Wortsman, M., Dasigi, P., Lambert, N., Richardson,
mar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L., Smith,
Rodriguez, A., Gregerson, A., et al. The llama 3 herd N. A., and Hajishirzi, H. Olmo: Accelerating the science
of models, 2024. URL https://arxiv.org/abs/ of language models, 2024.
2407.21783.
He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J.,
Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot, Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L.,
T. Complexity-based prompting for multi-step reason- Liu, Z., and Sun, M. Olympiadbench: A challeng-
ing, 2023. URL https://arxiv.org/abs/2210. ing benchmark for promoting agi with olympiad-level
00720. bilingual multimodal scientific problems, 2024. URL
https://arxiv.org/abs/2402.14008.
Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma,
A., and Goodman, N. D. Stream of search (sos): Learning Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
to search in language, 2024. URL https://arxiv. S., Tang, E., Song, D., and Steinhardt, J. Measuring math-
org/abs/2404.03683. ematical problem solving with the math dataset, 2021.
URL https://arxiv.org/abs/2103.03874.
Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li,
L., Ma, C., Chen, L., Xu, R., Tang, Z., Wang, B., Zan, D., Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E.,
Quan, S., Zhang, G., Sha, L., Zhang, Y., Ren, X., Liu, T., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A.,
and Chang, B. Omni-math: A universal olympiad level Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican,
mathematic benchmark for large language models, 2024a. K., van den Driessche, G., Damoc, B., Guy, A., Osindero,
URL https://arxiv.org/abs/2410.07985. S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and
Sifre, L. Training compute-optimal large language mod-
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster, els, 2022. URL https://arxiv.org/abs/2203.
C., Golding, L., Hsu, J., McDonell, K., Muennighoff, N., 15556.
Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B.,
Wang, K., and Zou, A. A framework for few-shot language Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J.,
model evaluation, September 2021. URL https:// Tang, J., and Dong, Y. Advancing language model rea-
doi.org/10.5281/zenodo.5371628. soning through reinforcement learning and inference scal-
ing, 2025. URL https://arxiv.org/abs/2501.
Gao, Z., Niu, B., He, X., Xu, H., Liu, H., Liu, A., Hu, X., and 11651.
Wen, L. Interpretable contrastive monte carlo tree search
reasoning, 2024b. URL https://arxiv.org/abs/ Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettle-
2410.01707. moyer, L., Smith, N. A., and Krishna, R. Visual sketch-
pad: Sketching as a visual chain of thought for multimodal
Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, language models, 2024. URL https://arxiv.org/
E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., abs/2406.09403.
de Oliveira Santos, E., Järviniemi, O., Barnett, M., San-
dler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, Huang, Z., Wang, Z., Xia, S., Li, X., Zou, H., Xu, R., Fan,
L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., R.-Z., Ye, L., Chern, E., Ye, Y., Zhang, Y., Yang, Y., Wu,
Enugandla, S. V., and Wildon, M. Frontiermath: A bench- T., Wang, B., Sun, S., Xiao, Y., Li, Y., Zhou, F., Chern, S.,
mark for evaluating advanced mathematical reasoning in Qin, Y., Ma, Y., Su, J., Liu, Y., Zheng, Y., Zhang, S., Lin,
ai, 2024. URL https://arxiv.org/abs/2411. D., Qiao, Y., and Liu, P. Olympicarena: Benchmarking
04872. multi-discipline cognitive reasoning for superintelligent
ai, 2024a. URL https://arxiv.org/abs/2406.
Google. Gemini 2.0 flash thinking mode (gemini-2.0- 12753.
flash-thinking-exp-1219), December 2024. URL
https://cloud.google.com/vertex-ai/ Huang, Z., Zou, H., Li, X., Liu, Y., Zheng, Y., Chern, E.,
generative-ai/docs/thinking-mode. Xia, S., Qin, Y., Yuan, W., and Liu, P. O1 replication
journey – part 2: Surpassing o1-preview through simple
Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, distillation, big progress or bitter lesson?, 2024b. URL
R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., https://arxiv.org/abs/2411.16489.

11
s1: Simple test-time scaling

Huang, Z., Geng, G., Hua, S., Huang, Z., Zou, H., Zhang, aimo-progress-prize/blob/main/report/
S., Liu, P., and Zhang, X. O1 replication journey – part 3: numina_dataset.pdf.
Inference-time scaling for medical reasoning, 2025. URL
https://arxiv.org/abs/2501.06458. Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker,
B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and
Irvine, R., Boubert, D., Raina, V., Liusie, A., Zhu, Z., Cobbe, K. Let’s verify step by step, 2023. URL https:
Mudupalli, V., Korshuk, A., Liu, Z., Cremer, F., As- //arxiv.org/abs/2305.20050.
sassi, V., Beauchamp, C.-C., Lu, X., Rialan, T., and
Beauchamp, W. Rewarding chatbots for real-world en- Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. Program
gagement with millions of users, 2023. URL https: induction by rationale generation : Learning to solve and
//arxiv.org/abs/2303.06135. explain algebraic word problems, 2017. URL https:
//arxiv.org/abs/1705.04146.
Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang,
T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang,
Livecodebench: Holistic and contamination free evalu- Y. Logiqa: A challenge dataset for machine reading com-
ation of large language models for code, 2024. URL prehension with logical reasoning, 2020. URL https:
https://arxiv.org/abs/2403.07974. //arxiv.org/abs/2007.08124.
Kapfer, C., Stine, K., Narasimhan, B., Mentzel, C., and Liu, J., Cohen, A., Pasunuru, R., Choi, Y., Hajishirzi,
Candes, E. Marlowe: Stanford’s gpu-based computational H., and Celikyilmaz, A. Don’t throw away your value
instrument, January 2025. URL https://doi.org/ model! generating more preferable text with value-guided
10.5281/zenodo.14751899. monte-carlo tree search decoding, 2024. URL https:
//arxiv.org/abs/2309.15028.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Loshchilov, I. and Hutter, F. Decoupled weight decay regu-
and Amodei, D. Scaling laws for neural language mod- larization, 2019.
els, 2020. URL https://arxiv.org/abs/2001.
08361. Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X.,
Lin, Q., Chen, S., Tang, Y., and Zhang, D. Wizardmath:
Kim, E., Suk, J., Kim, S., Muennighoff, N., Kim, D., and Oh, Empowering mathematical reasoning for large language
A. Llm-as-an-interviewer: Beyond static testing through models via reinforced evol-instruct, 2025. URL https:
dynamic llm evaluation, 2024. URL https://arxiv. //arxiv.org/abs/2308.09583.
org/abs/2412.10424.
Marten, R., Vu, T., Ji, C. C.-J., Sharma, K., Pimpalgaonkar,
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu,
S., Dimakis, A., and Sathiamoorthy, M. Curator: A tool
C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Effi-
for synthetic data creation. https://github.com/
cient memory management for large language model serv-
bespokelabsai/curator, January 2025.
ing with pagedattention, 2023. URL https://arxiv.
org/abs/2309.06180. Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor-
Labs, B. Bespoke-stratos: The unreasonable effectiveness rison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert,
of reasoning distillation, 2025. URL https://hf. N., Gu, Y., Arora, S., Bhagia, A., Schwenk, D., Wadden,
co/bespokelabs/Bespoke-Stratos-32B. Ac- D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi,
cessed: 2025-01-22. A., Smith, N. A., Koh, P. W., Singh, A., and Hajishirzi, H.
Olmoe: Open mixture-of-experts language models, 2024.
Lee, K.-H., Fischer, I., Wu, Y.-H., Marwood, D., Baluja, URL https://arxiv.org/abs/2409.02060.
S., Schuurmans, D., and Chen, X. Evolving deeper llm
thinking, 2025. URL https://arxiv.org/abs/ of America, M. A. Aime, February 2024. URL https:
2501.09891. //artofproblemsolving.com/wiki/index.
php/AIME_Problems_and_Solutions/.
Levi, N. A simple model of inference scaling laws, 2024.
URL https://arxiv.org/abs/2410.16377. OpenAI. Learning to reason with llms, Septem-
ber 2024. URL https://openai.com/index/
LI, J., Beeching, E., Tunstall, L., Lipkin, B., Solet- learning-to-reason-with-llms/.
skyi, R., Huang, S. C., Rasul, K., Yu, L., Jiang, A.,
Shen, Z., Qin, Z., Dong, B., Zhou, L., Fleureau, Y., OpenAI. Openai o3-mini, 2025. URL https://openai.
Lample, G., and Polu, S. Numinamath, 2024. URL com/index/openai-o3-mini/. Accessed: 2025-
https://github.com/project-numina/ 02-24.

12
s1: Simple test-time scaling

Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Shi, H., Yuan, H., Chen, J., Guo, J., Su, J., Wang, J., Zhao, J.,
S., Choi, M., Agrawal, A., Chopra, A., et al. Humanity’s Zhang, J., Liu, J., Yan, J., Wu, J., Shi, L., Ye, L., Yu, L.,
last exam, 2025. URL https://arxiv.org/abs/ Dong, M., Zhang, N., Ma, N., Pan, Q., Gong, Q., Liu, S.,
2501.14249. Ma, S., Wei, S., Cao, S., Huang, S., Jiang, T., Gao, W.,
Xiong, W., He, W., Huang, W., Wu, W., He, W., Wei, X.,
Qin, Y., Li, X., Zou, H., Liu, Y., Xia, S., Huang, Z., Ye, Jia, X., Wu, X., Xu, X., Zu, X., Zhou, X., Pan, X., Charles,
Y., Yuan, W., Liu, H., Li, Y., and Liu, P. O1 replication Y., Li, Y., Hu, Y., Liu, Y., Chen, Y., Wang, Y., Liu, Y., Qin,
journey: A strategic progress report – part 1, 2024. URL Y., Liu, Y., Yang, Y., Bao, Y., Du, Y., Wu, Y., Wang, Y.,
https://arxiv.org/abs/2410.18982. Zhou, Z., Wang, Z., Li, Z., Zhu, Z., Zhang, Z., Wang, Z.,
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, Yang, Z., Huang, Z., Huang, Z., Xu, Z., and Yang, Z. Kimi
B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., k1.5: Scaling reinforcement learning with llms, 2025.
Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., URL https://arxiv.org/abs/2501.12599.
Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li,
M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, Team, N. Sky-t1: Fully open-source reasoning
T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., model with o1-preview performance in $450 budget,
Wan, Y., Liu, Y., Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5 2025. URL https://novasky-ai.github.io/
technical report, 2024. URL https://arxiv.org/ posts/sky-t1. Accessed: 2025-01-09.
abs/2412.15115.
Team, Q. Qwq: Reflect deeply on the boundaries of the
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., unknown, November 2024. URL https://qwenlm.
Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A github.io/blog/qwq-32b-preview/.
graduate-level google-proof q&a benchmark, 2023. URL
https://arxiv.org/abs/2311.12022. Wang, J., Meng, F., Liang, Y., and Zhou, J. Drt-o1: Op-
timized deep reasoning translation via long chain-of-
Shi, Q., Tang, M., Narasimhan, K., and Yao, S. Can lan- thought, 2024a. URL https://arxiv.org/abs/
guage models solve olympiad programming?, 2024. URL 2412.17498.
https://arxiv.org/abs/2404.10952.

Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen,
time compute optimally can be more effective than scal- D., Wu, Y., and Sui, Z. Math-shepherd: Verify and rein-
ing model parameters, 2024. URL https://arxiv. force llms step-by-step without human annotations, 2024b.
org/abs/2408.03314. URL https://arxiv.org/abs/2312.08935.

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, Wang, S., Liu, Z., Zhong, W., Zhou, M., Wei, Z., Chen, Z.,
A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., and Duan, N. From lsat: The progress and challenges
Garriga-Alonso, A., et al. Beyond the imitation game: of complex reasoning, 2021. URL https://arxiv.
Quantifying and extrapolating the capabilities of language org/abs/2108.00648.
models, 2023.
Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Egert,
Su, H., Yen, H., Xia, M., Shi, W., Muennighoff, N., yu Wang, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Help-
H., Liu, H., Shi, Q., Siegel, Z. S., Tang, M., Sun, R., Yoon, steer2: Open-source dataset for training top-performing
J., Arik, S. O., Chen, D., and Yu, T. Bright: A realis- reward models, 2024c. URL https://arxiv.org/
tic and challenging benchmark for reasoning-intensive abs/2406.08673.
retrieval, 2024. URL https://arxiv.org/abs/
2407.12883. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter,
Sun, L., Han, Y., Zhao, Z., Ma, D., Shen, Z., Chen, B., Chen, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-
L., and Yu, K. Scieval: A multi-level large language thought prompting elicits reasoning in large language
model evaluation benchmark for scientific research, 2024. models, 2023. URL https://arxiv.org/abs/
URL https://arxiv.org/abs/2308.13149. 2201.11903.

Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, Welleck, S., Bertsch, A., Finlayson, M., Schoelkopf, H., Xie,
C., Xiao, C., Du, C., Liao, C., Tang, C., Wang, C., Zhang, A., Neubig, G., Kulikov, I., and Harchaoui, Z. From de-
D., Yuan, E., Lu, E., Tang, F., Sung, F., Wei, G., Lai, G., coding to meta-generation: Inference-time algorithms for
Guo, H., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H., large language models, 2024. URL https://arxiv.
Yao, H., Zhao, H., Lu, H., Li, H., Yu, H., Gao, H., Zheng, org/abs/2406.16838.

13
s1: Simple test-time scaling

Wu, T., Lan, J., Yuan, W., Jiao, J., Weston, J., and Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok,
Sukhbaatar, S. Thinking llms: General instruction fol- J. T., Li, Z., Weller, A., and Liu, W. Metamath: Bootstrap
lowing with thought generation, 2024a. URL https: your own mathematical questions for large language mod-
//arxiv.org/abs/2410.10630. els, 2024. URL https://arxiv.org/abs/2309.
12284.
Wu, Y., Sun, Z., Li, S., Welleck, S., and Yang, Y. Infer-
ence scaling laws: An empirical analysis of compute- Yuan, S., Chen, Z., Xi, Z., Ye, J., Du, Z., and Chen, J.
optimal inference for problem-solving with language mod- Agent-r: Training language model agents to reflect via
els, 2024b. URL https://arxiv.org/abs/2408. iterative self-training, 2025. URL https://arxiv.
00724. org/abs/2501.11425.
Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Zelikman, E., Wu, Y., Mu, J., and Goodman, N. D. Star:
Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan, Bootstrapping reasoning with reasoning, 2022. URL
D., Castricato, L., Franken, J.-P., Haber, N., and Finn, C. https://arxiv.org/abs/2203.14465.
Towards system 2 reasoning in llms: Learning how to
think with meta chain-of-thought, 2025. URL https: Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N.,
//arxiv.org/abs/2501.04682. and Goodman, N. D. Quiet-star: Language models can
teach themselves to think before speaking, 2024. URL
Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He, https://arxiv.org/abs/2403.09629.
J., and Xie, Q. Self-evaluation guided beam search for
reasoning, 2023. URL https://arxiv.org/abs/ Zhang, H. and Chen, C. Test-time compute scaling laws,
2305.00633. 2024. URL https://github.com/hughbzhang/
o1_inference_scaling_laws.
Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan,
C., Li, W., and Liang, X. Deepseek-prover: Advancing Zhang, S., Chen, Z., Shen, Y., Ding, M., Tenenbaum, J. B.,
theorem proving in llms through large-scale synthetic and Gan, C. Planning with large language models for
data, 2024. URL https://arxiv.org/abs/2405. code generation, 2023. URL https://arxiv.org/
14333. abs/2303.05510.

Xu, H., Wu, X., Wang, W., Li, Z., Zheng, D., Chen, B., Zhang, Y., Wu, S., Yang, Y., Shu, J., Xiao, J., Kong, C., and
Hu, Y., Kang, S., Ji, J., Zhang, Y., Guo, Z., Yang, Y., Sang, J. o1-coder: an o1 replication for coding, 2024a.
Zhang, M., and Zhang, D. Redstar: Does scaling long-cot URL https://arxiv.org/abs/2412.00154.
data unlock better slow-reasoning systems?, 2025. URL
Zhang, Y., Yang, J., Yuan, Y., and Yao, A. C.-C. Cumula-
https://arxiv.org/abs/2501.11284.
tive reasoning with large language models, 2024b. URL
Yang, Z., Band, N., Li, S., Candès, E., and Hashimoto, T. https://arxiv.org/abs/2308.04371.
Synthetic continued pretraining, 2024. URL https:
Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., and Sun, M.
//arxiv.org/abs/2409.07431.
Jec-qa: A legal-domain question answering dataset, 2019.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, URL https://arxiv.org/abs/1911.12011.
Y., and Narasimhan, K. Tree of thoughts: Deliberate
Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang,
problem solving with large language models, 2023a. URL
Y., Saied, A., Chen, W., and Duan, N. Agieval: A
https://arxiv.org/abs/2305.10601.
human-centric benchmark for evaluating foundation mod-
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, els, 2023. URL https://arxiv.org/abs/2304.
K., and Cao, Y. React: Synergizing reasoning and acting 06364.
in language models, 2023b. URL https://arxiv.
Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H.,
org/abs/2210.03629.
and Wang, Y.-X. Language agent tree search unifies rea-
Ye, Y., Huang, Z., Xiao, Y., Chern, E., Xia, S., and Liu, P. soning acting and planning in language models, 2024.
Limo: Less is more for reasoning, 2025a. URL https: URL https://arxiv.org/abs/2310.04406.
//arxiv.org/abs/2502.03387.
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
Ye, Y., Xiao, Y., Mi, T., and Liu, P. Aime-preview: A Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M.,
rigorous and immediate evaluation framework for ad- Zettlemoyer, L., and Levy, O. Lima: Less is more for
vanced mathematical reasoning. https://github. alignment, 2023. URL https://arxiv.org/abs/
com/GAIR-NLP/AIME-Preview, 2025b. GitHub 2305.11206.
repository.

14
s1: Simple test-time scaling

Contents

1 Introduction 1

2 Reasoning data curation to create s1K 2

2.1 Initial collection of 59K samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Final selection of 1K samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Test-time scaling 3
3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Results 4
4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Ablations 6
5.1 Data Quantity, Diversity, and Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.2 Test-time scaling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6 Discussion and related work 8

6.1 Sample-efficient reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.2 Test-time scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

A s1.1 16

B Evaluation determinism 16

C s1K details 17
C.1 s1K summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
C.2 Dataset composition for full 59K questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
C.3 s1K grading prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
C.4 s1K diversity selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
C.5 Decontamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

D Training details 21
D.1 Training Ablations: Sequence length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
D.2 Training Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

E Test-time scaling details 35

E.1 Sequential scaling ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
E.2 Examples for rejection sampling ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

F Version Control 46

15
s1: Simple test-time scaling

A. s1.1
Seven days after our release of s1, we released s1.1. We regenerated traces for our 1,000 samples in s1K using DeepSeek
r1 (DeepSeek-AI et al., 2025) to create s1K-1.1. We use the same training procedure to train our model s1.1. Other updates
since our launch include the release of o3 (OpenAI, 2025), LIMO (Ye et al., 2025a), and AIME 2025. We consider all these
new developments in Table 5. We find that s1.1 performs significantly better than s1. We also tried distilling from Claude
3.7, which led to worse performance than from r1 (not reported).

Table 5. s1-32B is an open and sample-efficient reasoning model. We evaluate s1-32B, Qwen, and Gemini (some entries are unknown
(N.A.), see §4). Other results are from the respective reports (Qwen et al., 2024; Team, 2024; OpenAI, 2024; 2025; DeepSeek-AI et al.,
2025; Labs, 2025; Team, 2025) except for AIME 2025 (Ye et al., 2025b). # ex. = number examples used for reasoning finetuning; BF =
budget forcing.
Model # Examples MATH500 GPQA AIME 2024 AIME 2025
API only
o3-mini-low N/A 95.8 70.6 56.3 42.1
o3-mini-medium N/A 97.3 76.8 75.8 70.4
o3-mini-high N/A 97.9 79.7 83.8 80.9
Open Weights
QwQ-32B N.A. 90.6 54.5 46.7 32.7
r1 ≫800K 07.3 71.5 79.8 70.0
r1-distill-Llama-70B 800K 94.5 65.2 57.1 56.3
r1-distill-Qwen-14B 800K 93.9 59.1 61.7 48.0
r1-distill-Qwen-32B 800K 94.3 62.1 58.3 49.6
Open Weights and Open Data
LIMO 817 94.8 66.7 56.3 44.6
s1 w/o BF 1K 92.6 56.6 50.0 26.7
s1 with Budget Forcing “Wait” 1x 1K 92.8 59.6 53.3 30.0
s1 with Budget Forcing “Wait” 2x 1K 93.0 59.6 53.3 33.3
s1 with Budget Forcing “Wait” 4x 1K 92.2 58.6 56.7 36.7
s1.1 w/o BF 1K 94.4 60.6 56.7 50.0
s1.1 with Budget Forcing “Wait” 1x 1K 95.4 62.6 56.7 50.0
s1.1 with Budget Forcing “Wait” 2x 1K 95.4 63.6 56.7 50.0

B. Evaluation determinism
We run our evaluations using vLLM (Kwon et al., 2023) as it is faster than the alternatives we tried. However, we find that
even when using the same random seeds and greedy sampling, evaluation scores can change significantly across runs:

• Different batch sizes causing different results see https://github.com/vllm-project/vllm/issues/

5898
• Continuing generations causing different results see https://github.com/vllm-project/vllm/issues/
11783
• Changes in tensor parallelism causing different results

As our model generates long reasoning traces prior to its answer, small numeric changes can snowball into large differences.
We encounter many generations that are exactly the same for thousands of tokens and then suddenly differ in one token
eventually ending up with an entirely different answer. To partly counter this issue we generally run our final evaluations
using full precision unless otherwise indicated.

16
s1: Simple test-time scaling

C. s1K details
C.1. s1K summary

Table 6. Summary of our dataset s1K. Token count measured by the Qwen-2.5 tokenizer. We prompt Claude to produce keywords given
several questions from the domain.
Domain #questions Total token count Keywords
Geometry 109 560.2K Area, Triangle, Distance
Number theory 98 522.5K Sequences, Divisibility
Combinatorics 75 384.7K Permutations, Counting
Real functions 43 234.8K Trigonometry, Calculus
Biology 41 120.9K Organic reactions
Complex functions 32 170.2K Complex roots
Quantum theory 32 127.9K Particles, Wave functions
Field theory 28 150.1K Polynomials, Roots
Calculus of variations 28 155.5K Optimization, Control
Difference equations 24 132.5K Recurrence, Recursion
Electromagnetic theory 23 95.8K Optics, Waves, Diffraction
Group theory 22 100.0K Groups, Automorphisms
Linear algebra 22 128.3K Matrices, Determinants
Probability theory 20 114.6K Random walk, Expectation
Algebraic systems 19 109.9K Functional equations
Mechanics 19 103.6K Forces, Motion, Energy
Thermodynamics 19 74.2K Heat engines, Entropy
Differential equations 18 89.6K Substitution, Existence
Computer science 18 34.2K Complexity theory, Algorithms
Numerical analysis 18 76.5K Error analysis, Stability
Calculus 17 96.3K Convergence, Summation
Algebraic structures 17 90.4K Inequalities, Sets
Astronomy 16 37.7K Stellar populations, Orbits
Remaining 27 domains 242 982.2K Domains with ≤ 16 questions
All domains (51) 1000 4.7M s1K

C.2. Dataset composition for full 59K questions

17
s1: Simple test-time scaling

Table 7. Composition of full 59K questions. Thinking and response lengths are measured in tokens using the Qwen2.5-32B-Instruct
tokenizer (Qwen et al., 2024). In addition to excluding our evaluation benchmark, AIME24, we also exclude AIME questions from
2022-2023 as we use these 90 questions during our development stage of s1-32B.
Source Description #Samples Avg.
thinking
length
NuminaMATH (LI et al., 2024) Math problems from online websites 30660 4.1K
MATH (Hendrycks et al., 2021) Math problems from competitions 11999 2.9K
OlympicArena (Huang et al., Astronomy, Biology, Chemistry, Com- 4250 3.2K
2024a) puter Science, Geography, Math, and
Physics olympiad questions
OmniMath (Gao et al., 2024a) Math problems from competitions 4238 4.4K
AGIEval (Zhong et al., 2023; English, Law, Logic and Math prob- 2385 1.2K
Ling et al., 2017; Hendrycks et al., lems from the SAT, LSAT and other
2021; Liu et al., 2020; Zhong exams
et al., 2019; Wang et al., 2021)
xword Crossword puzzles 999 0.7K
OlympiadBench (He et al., 2024) Math and Physics olympiad questions 896 3.9K
AIME (1983-2021) American Invitational Mathematics Ex- 890 4.7K
amination
TheoremQA (Chen et al., 2023) Computer Science, Finance, Math, and 747 2.1K
Physics university-level questions relat-
ing to theorems
USACO (Shi et al., 2024) Code problems from the USA Comput- 519 3.6K
ing Olympiad
JEEBench (Arora et al., 2023) Chemistry, Math, and Physics problems 515 2.9K
used in the university entrance exami-
nation of the Indian Institute of Tech-
nology
GPQA (Rein et al., 2023) PhD-Level Science Questions 348 2.9K
SciEval (Sun et al., 2024) Biology, Chemistry, and Physics prob- 227 0.7K
lems from various sources
s1-prob Stanford statistics qualifying exams 182 4.0K
LiveCodeBench (Jain et al., 2024) Code problems from coding websites 151 3.5K
(LeetCode, AtCoder, and CodeForces)
s1-teasers Math brain-teasers crawled from the In- 23 4.1K
ternet
All 59K questions Composite of the above datasets with 59029 3.6K
reasoning traces and solutions

18
s1: Simple test-time scaling

C.3. s1K grading prompt

To grade whether an example is correct for our dataset selection in §2, we use the prompt in Figure 8. We grade using Claude
3.5 except for the correctness among the final 1,000 samples, which we graded with Claude 3.7.

You are an AI assistant for grading a science problem. The user will provide you with the question itself, an attempt made by
a student and the correct answer to the problem. Your job is to judge whether the attempt is correct by comparing it with the
correct answer. If the expected solution concludes with a number or choice, there should be no ambiguity. If the expected
solution involves going through the entire reasoning process, you should judge the attempt based on whether the reasoning
process is correct with correct answer if helpful.

The user will provide the attempt and the correct answer in the following format:

# Problem
{problem}

## Attempt
{attempt}

## Correct answer
{solution}

Explain your reasoning, and end your response on a new line with only "Yes" or "No" (without quotes).

Figure 8. Grading prompt.

C.4. s1K diversity selection

Algorithm 1 provides our algorithm for selecting data in our diversity selection stage. As mentioned in §2, we also include
samples from some specific benchmarks we perceive as high-quality. None of the samples overlap with our final evaluation.

C.5. Decontamination
We filter all samples by checking for an 8-gram overlap between the selected examples and the evaluation benchmarks:
MATH500, GPTQA Diamond, and AIME24. We exclude questions with more than an 8-gram overlap.

19
s1: Simple test-time scaling

Algorithm 1 Two-stage sampling for s1K

1: Input:  := Set of 24,496 questions with features
2: Output:  := Set of 1,000 selected questions
3: ←∅ Initialize the output set (only tracks unique elements)
4: for 𝑞 ∈  do
5: if IsGeminiCorrect(𝑞) and (IsAIME(𝑞) or IsGPQA(𝑞)) then
6:  ←  ∪ {𝑞}
7: Select all correct AIME/GPQA solutions
8: else if IsGeminiCorrect(𝑞) and IsMATH(𝑞) and ThinkingLength(𝑞) > 5600 then
9:  ←  ∪ {𝑞}
10: Select correct MATH500 solutions with long chains
11: end if
12: end for
13:  ← All available domains
14: Initialize domain pool
15: while || < 1000 do
16: 𝑑 ← RandomChoice()
17: Randomly select a domain
18: 𝑄𝑑 ← Questions in domain 𝑑
19: Get questions from this domain
20: ranks ← RankByThinkingLength(𝑄𝑑 )
21: Rank by thinking length
22: weights ← 2−ranks
23: Apply power-law weighting
24: 𝑞 ← WeightedSample(𝑄𝑑 , weights)
25: Sample favoring longer chains
26:  ←  ∪ {𝑞}
27: Add selected question
28: 𝑄𝑑 ← 𝑄𝑑 ⧵ {𝑞}
29: if 𝑄𝑑 = ∅ then
30:  ←  ⧵ {𝑑}
31: Remove exhausted domains
32: end if
33: end while

20
s1: Simple test-time scaling

D. Training details
We take a model that has already been pretrained and instruction tuned and further finetune it for reasoning. Specifically, we
use Qwen2.5-32B-Instruct (Qwen et al., 2024), which on math tasks generally matches or outperforms the larger Qwen2.5-
72B-Instruct (Qwen et al., 2024) or other open models (Dubey et al., 2024; Groeneveld et al., 2024; Muennighoff et al.,
2024). We use token delimiters to separate the thinking stage from the answering stage. We enclose the thinking stage
with <|im_start|>think and <|im_start|>answer; both preceded and followed by a newline. Samples from our
dataset are in §D.2. We use basic fine-tuning hyperparameters: we train for 5 epochs with a batch size of 16 for a total of 315
gradient steps. We train in bfloat16 precision with a learning rate of 1𝑒 − 5 warmed up linearly for 5% (16 steps) and then
decayed to 0 over the rest of training (299 steps) following a cosine schedule. We use the AdamW optimizer (Loshchilov &
Hutter, 2019) with 𝛽1 = 0.9, 𝛽2 = 0.95 and weight decay of 1𝑒 − 4. We do not compute loss on questions, only on reasoning
traces and solutions. We ensure the sequence length is large enough to avoid cutting off any samples; a setting we ablate in
§D.1. The training takes just 26 minutes on 16 NVIDIA H100 GPUs.

Training Loss Learning Rate Schedule Gradient Norm

1e 5
1.0

Gradient Norm
Learning Rate

0.4 3
Loss

0.5 2
0.2
1
0.0 0.0
0 100 200 300 0 100 200 300 0 100 200 300
Training Steps Training Steps Training Steps
Figure 9. Training dynamics of s1-32B on s1K.

D.1. Training Ablations: Sequence length

Table 8. Training sequence length ablation. We report “accuracy / average thinking tokens per sample”; the higher the accuracy and the
fewer the thinking tokens (inference cost) the better.
Model A Model B
Training sequence length 4096 32768
% training samples cutoff 74% 0%
AIME24 30.0% / 20721 50.0% / 6984
MATH500 90.0% / 5324 91.0% / 3268
GPQA 52.5% / 6841 53.0% / 3568

Besides our scaling ablations in §5.2, the main training hyperparameter we ablate is the sequence length used during training.
We find that a shorter training sequence length leads to longer reasoning traces at test time. This is because when training
with a shorter sequence length the answer section of the training sample is more commonly cut off. Inversely, when the
training sequence length is longer, more samples appear in their entirety with the section where the model answers. Thus
the model receives more gradient updates where it learns to generate an answer following its chain. This in turn leads to a
higher log probability of the answer section at any point during the generation and thus shorter reasoning traces at test time.
Performance-wise, we also find that the model trained with a longer sequence length performs better. Thus we opt for the
longest training sequence length as it leads to better performance and makes inference more efficient by leading to shorter
reasoning traces.

D.2. Training Samples

Table 9, Table 10, Table 11 contain training samples from s1K.

21
s1: Simple test-time scaling

Table 9. s1K sample stemming from s1-prob. The black text is the question and the yellow text is the solution. Based solely on the
question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.

You are given integrable random variables 𝑋, 𝑌0 and 𝑍0 on the same probability space (Ω,  , ℙ), and two 𝜎-algebras
 ⊂  ,  ⊂  . For 𝑘 = 1, 2, … ,, let

𝑌𝑘 ∶= 𝔼[𝑋|𝜎(, 𝑍0 , … , 𝑍𝑘−1 )], 𝑍𝑘 ∶= 𝔼[𝑋|𝜎(, 𝑌0 , … , 𝑌𝑘−1 )].

(a) Show that there exist integrable random variables 𝑌∞ and 𝑍∞ such that as 𝑛 → ∞ both 𝔼[|𝑌𝑛 − 𝑌∞ |] → 0 and
𝔼[|𝑍𝑛 − 𝑍∞ |] → 0. (b) Prove that almost surely 𝑌∞ = 𝑍∞ .

1. Introduce the notations, 𝑛 ∶= 𝜎(, 𝑍0 , … , 𝑍𝑛−1 ) and [𝑛 ∶= ]𝜎(, 𝑌0 , … , 𝑌𝑛−1

[ ), for
] all 𝑛 ≥ 1 and 0 ∶= ,
{ }0 ∶= .
Clearly, both are filtrations and by definition, 𝑌𝑛 = 𝔼 𝑋|𝑛 and 𝑍𝑛 = 𝔼 𝑋|𝑛 , for all 𝑛 ≥ 0. Thus 𝑌𝑛 𝑛≥0 (or
{ } { } { }
𝑍𝑛 𝑛≥0 ) is a Doob’s Martingale with respect to filtration 𝑛 𝑛≥0 (or 𝑛 𝑛≥0 ) and hence is U.I., 𝑋 being integrable.
Now apply the fact that a Martingale is U.I. if and only if it converges in 𝐿1 and conclude.
2. We provide two different proofs.

1. Observe the following identity which follows from Tower Property.

[ ] [ [ ] ] [ [ ] ] [ ]
𝔼 𝑌𝑛 |𝑛 = 𝔼 𝔼 𝑋|𝑛 |𝑛 = 𝔼 𝔼 𝑋|𝑛 |𝑛 = 𝔼 𝑍𝑛 |𝑛 . (4)

[ ] [ ] [ ] 𝐿1 [ ]
Note that 𝔼|𝔼 𝑌𝑛 |𝑛 − 𝔼 𝑌∞ |𝑛 | ≤ 𝔼|𝑌𝑛 − 𝑌∞ | = 𝑜(1) and by Levy’s Upward Theorem, 𝔼 𝑌∞ |𝑛 ⟶ 𝔼 𝑌∞ |∞ ,
( )
where ∞ = 𝜎 ∪𝑛≥0 𝑛 = 𝜎(, 𝑌0 , 𝑌1 …). It is evident that 𝑌∞ ∈ 𝑚∞ (since it is almost sure limit of 𝑌𝑛 ∈ 𝑚𝑛 )
[ ] [ ] 𝐿1
and hence 𝔼 𝑌∞ |∞ = 𝑌∞ almost surely. Combining the above observations we conclude that 𝔼 𝑌𝑛 |𝑛 ⟶ 𝑌∞ .
[ ] 𝐿1
Similarly, 𝔼 𝑍𝑛 |𝑛 ⟶ 𝑍∞ . Now invoke Equation (4), to conclude that 𝑌∞ = 𝑍∞ almost surely.
[ ] [ ]
2. Since 𝑋 is integrable, we have by Levy’s Upward Theorem, 𝑌∞ = lim𝑛→∞ 𝑌𝑛 = lim𝑛→∞ 𝔼 𝑋|𝑛 = 𝔼 𝑋|∞ .
[ ]
Similarly, 𝑍∞ = 𝔼 𝑋|∞ . On the otherhand, in first proof we have observed that 𝑌∞ ∈ 𝑚∞ and hence 𝑌∞ ∈
( )
𝑚∞ ∩ 𝑚∞ = 𝑚 ∞ ∩ ∞ . This yields the following.
[ ] [ [ ] ] [ ]
𝑌∞ = 𝔼 𝑌∞ |∞ ∩ ∞ = 𝔼 𝔼 𝑋|∞ |∞ ∩ ∞ = 𝔼 𝑋|∞ ∩ ∞ .
[ ]
Similarly, 𝑍∞ = 𝔼 𝑋|∞ ∩ ∞ . This proves that 𝑌∞ = 𝑍∞ almost surely.

Let’s analyze the problem. We are given two sequences of random variables 𝑌𝑘 and 𝑍𝑘 defined by conditional expectations.
We need to show that these sequences converge in 𝐿1 and that their limits are equal almost surely.
Part (a): Convergence in 𝐿1 .
The sigma-algebras involved are increasing:
𝑘 = 𝜎(, 𝑍0 , … , 𝑍𝑘−1 )
𝑘 = 𝜎(, 𝑌0 , … , 𝑌𝑘−1 )
For the 𝑌𝑘 sequence:
𝑌1 = 𝔼[𝑋|𝜎()]
𝑌2 = 𝔼[𝑋|𝜎(, 𝑍0 )]
𝑌3 = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 )]
...
The sequence of sigma-algebras 𝜎(, 𝑍0 , … , 𝑍𝑘−1 ) is increasing. By the Martingale Convergence Theorem (for conditional
expectations), since 𝑋 is integrable, the sequence 𝑌𝑘 converges almost surely and in 𝐿1 to 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)]. Let
𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)].

22
s1: Simple test-time scaling

Similarly, for the 𝑍𝑘 sequence:

𝑍1 = 𝔼[𝑋|𝜎()]
𝑍2 = 𝔼[𝑋|𝜎(, 𝑌0 )]
𝑍3 = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 )]
...
The sequence of sigma-algebras 𝜎(, 𝑌0 , … , 𝑌𝑘−1 ) is increasing. By the Martingale Convergence Theorem, the sequence
𝑍𝑘 converges almost surely and in 𝐿1 to 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)]. Let 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
Part (b): Prove that almost surely 𝑌∞ = 𝑍∞ .
We have 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)] and 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
Let ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …).
Then 𝑌∞ = 𝔼[𝑋|∞ ] and 𝑍∞ = 𝔼[𝑋|∞ ].
Consider 𝑌1 = 𝔼[𝑋|]. 𝑍1 = 𝔼[𝑋|].
𝑌2 = 𝔼[𝑋|𝜎(, 𝑍0 )]. Since 𝑍0 = 𝔼[𝑋|], 𝜎(, 𝑍0 ) = 𝜎(, 𝔼[𝑋|]).
𝑍2 = 𝔼[𝑋|𝜎(, 𝑌0 )]. Since 𝑌0 is not defined, let’s assume 𝑌0 and 𝑍0 are given.
Let’s consider the sigma-algebra generated by all the 𝑌𝑖 and 𝑍𝑖 .
Let  = 𝜎(𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …).
Both 𝑌𝑘 and 𝑍𝑘 are measurable with respect to . Thus, 𝑌∞ and 𝑍∞ are measurable with respect to .
Consider 𝔼[𝑌𝑛+1 |, 𝑍0 , … , 𝑍𝑛−1 ] = 𝔼[𝔼[𝑋|𝜎(, 𝑍0 , … , 𝑍𝑛 )]|, 𝑍0 , … , 𝑍𝑛−1 ] = 𝑌𝑛 .
Consider 𝔼[𝑍𝑛+1 |, 𝑌0 , … , 𝑌𝑛−1 ] = 𝔼[𝔼[𝑋|𝜎(, 𝑌0 , … , 𝑌𝑛 )]|, 𝑌0 , … , 𝑌𝑛−1 ] = 𝑍𝑛 .
Let  = 𝜎( ∪  ∪ 𝜎(𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …)).
Consider 𝔼[𝑋|, 𝑍0 , … , 𝑍𝑘−1 ] = 𝑌𝑘 .
Consider 𝔼[𝑋|, 𝑌0 , … , 𝑌𝑘−1 ] = 𝑍𝑘 .
Let’s consider the limit sigma-algebras ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …).
We know that 𝑌∞ is ∞ -measurable and 𝑍∞ is ∞ -measurable.
Let  = 𝜎( ∪ ).
𝑌1 = 𝔼[𝑋|]. 𝑍1 = 𝔼[𝑋|].
𝑌2 = 𝔼[𝑋|𝜎(, 𝑍0 )] = 𝔼[𝑋|𝜎(, 𝔼[𝑋|])].
𝑍2 = 𝔼[𝑋|𝜎(, 𝑌0 )] = 𝔼[𝑋|𝜎(, 𝔼[𝑋|])].
Consider the case when  and  are independent. Then 𝑌1 = 𝔼[𝑋|] and 𝑍1 = 𝔼[𝑋|].
𝑌2 = 𝔼[𝑋|𝜎(, 𝔼[𝑋|])].
Let’s consider the sigma-algebra generated by all the information. Let ∞ = 𝜎(, , 𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …).
Then both 𝑌∞ and 𝑍∞ are ∞ -measurable.
We have 𝑌∞ = lim𝑛→∞ 𝑌𝑛 and 𝑍∞ = lim𝑛→∞ 𝑍𝑛 a.s. and in 𝐿1 .
Consider 𝔼[𝑋|, 𝑍0 , 𝑍1 , …] = 𝑌∞ and 𝔼[𝑋|, 𝑌0 , 𝑌1 , …] = 𝑍∞ .
Let ∞ = 𝜎( ∪ ).
Consider 𝔼[𝑋|∞ ].
Let’s try to show that both 𝑌∞ and 𝑍∞ are equal to 𝔼[𝑋|𝜎( ∨ )]. This seems too simple.
Let’s consider the definition of conditional expectation. For any set 𝐴 ∈ ∞ , 𝔼[𝑌∞ 1𝐴 ] = 𝔼[𝑋1𝐴 ].
For any set 𝐵 ∈ ∞ , 𝔼[𝑍∞ 1𝐵 ] = 𝔼[𝑋1𝐵 ].
Consider 𝔼[𝑌𝑘+1 |𝑘 ] = 𝑌𝑘 where 𝑘 = 𝜎(, 𝑍0 , … , 𝑍𝑘−1 ).
Consider 𝔼[𝑍𝑘+1 |𝑘 ] = 𝑍𝑘 where 𝑘 = 𝜎(, 𝑌0 , … , 𝑌𝑘−1 ).
Consider the sigma-algebra 𝜎( ∪ ).
Let 𝑊 = 𝔼[𝑋|𝜎( ∪ )].
We have 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)].
Since 𝑍𝑘 depends on 𝑌𝑘−1 which depends on 𝑍𝑘−2 etc., it is complex.
Let’s consider the case where 𝑌0 = 𝔼[𝑋|] and 𝑍0 = 𝔼[𝑋|].
Then 𝑌1 = 𝔼[𝑋|] = 𝑌0 .
𝑍1 = 𝔼[𝑋|] = 𝑍0 .
If this were the case, then 𝑌𝑘 = 𝑌0 and 𝑍𝑘 = 𝑍0 , so 𝑌∞ = 𝑌0 and 𝑍∞ = 𝑍0 .
We need 𝑌0 = 𝑍0 a.s., i.e., 𝔼[𝑋|] = 𝔼[𝑋|] a.s. This is not generally true.
Consider the sigma-algebra generated by all 𝑌𝑖 and 𝑍𝑖 , denoted by ∞ = 𝜎(𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …).
Then ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) ⊆ 𝜎(, ∞ ).
∞ = 𝜎(, 𝑌0 , 𝑌1 , …) ⊆ 𝜎(, ∞ ).

23
s1: Simple test-time scaling

Consider 𝔼[𝑋|, 𝑍0 , 𝑍1 , …] = 𝑌∞ . This means for any 𝐴 ∈ 𝜎(, 𝑍0 , 𝑍1 , …), 𝔼[𝑋1𝐴 ] = 𝔼[𝑌∞ 1𝐴 ].
Consider 𝔼[𝑋|, 𝑌0 , 𝑌1 , …] = 𝑍∞ . This means for any 𝐵 ∈ 𝜎(, 𝑌0 , 𝑌1 , …), 𝔼[𝑋1𝐵 ] = 𝔼[𝑍∞ 1𝐵 ].
Let  = 𝜎(, ).
Consider 𝔼[𝑋|].
Consider 𝔼[𝑌∞ |, 𝑌0 , 𝑌1 , …] = 𝔼[𝔼[𝑋|, 𝑍0 , 𝑍1 , …]|, 𝑌0 , 𝑌1 , …].
Let’s use the fact that 𝑌∞ is 𝜎(, 𝑍0 , 𝑍1 , …)-measurable and 𝑍∞ is 𝜎(, 𝑌0 , 𝑌1 , …)-measurable.
Consider the smallest sigma-algebra containing both 𝜎(, 𝑍0 , 𝑍1 , …) and 𝜎(, 𝑌0 , 𝑌1 , …).
This is 𝜎( ∪  ∪ {𝑌𝑖 }𝑖≥0 ∪ {𝑍𝑖 }𝑖≥0 ).
Let’s try to show that 𝔼[(𝑌∞ − 𝑍∞ )2 ] = 0.
𝔼[(𝑌∞ − 𝑍∞ )2 ] = 𝔼[𝑌∞2 ] + 𝔼[𝑍∞ 2 ] − 2𝔼[𝑌 𝑍 ].
∞ ∞
Consider the definition of 𝑌𝑘 and 𝑍𝑘 .
𝔼[𝑌𝑘+1 |, 𝑍0 , … , 𝑍𝑘−1 ] = 𝑌𝑘 .
𝔼[𝑍𝑘+1 |, 𝑌0 , … , 𝑌𝑘−1 ] = 𝑍𝑘 .
Consider 𝔼[𝑌∞ |, 𝑌0 , 𝑌1 , …].
Since 𝑌∞ is measurable with respect to 𝜎(, 𝑍0 , 𝑍1 , …), this does not simplify easily.
Let’s assume 𝑌∞ ≠ 𝑍∞ on a set of positive measure.
Consider the case where  and  are trivial.
Then 𝑌𝑘 = 𝔼[𝑋|𝜎(𝑍0 , … , 𝑍𝑘−1 )] and 𝑍𝑘 = 𝔼[𝑋|𝜎(𝑌0 , … , 𝑌𝑘−1 )].
Let ∞ = 𝜎( ∪  ∪ {𝑌𝑖 , 𝑍𝑖 }𝑖≥0 ).
Consider 𝔼[𝑋|∞ ].
We have 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)].
By the properties of conditional expectation, 𝔼[𝑌∞ |] = 𝔼[𝔼[𝑋|, 𝑍0 , 𝑍1 , …]|].
Consider the sigma-algebra  generated by all the random variables involved.
 = 𝜎( ∪  ∪ {𝑌𝑖 }𝑖≥0 ∪ {𝑍𝑖 }𝑖≥0 ).
Let 𝑀∞ = 𝔼[𝑋| ∨ ].
Consider the limits. As 𝑛 → ∞, 𝜎(, 𝑍0 , … , 𝑍𝑛−1 ) ↑ 𝜎(, 𝑍0 , 𝑍1 , …).
As 𝑛 → ∞, 𝜎(, 𝑌0 , … , 𝑌𝑛−1 ) ↑ 𝜎(, 𝑌0 , 𝑌1 , …).
Let ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …).
We have 𝑌∞ = 𝔼[𝑋|∞ ] and 𝑍∞ = 𝔼[𝑋|∞ ].
Consider 𝔼[𝑌∞ − 𝑍∞ ]2 = 𝔼[(𝑌∞ − 𝑋) + (𝑋 − 𝑍∞ )]2
= 𝔼[(𝑌∞ − 𝑋)2 ] + 𝔼[(𝑋 − 𝑍∞ )2 ] + 2𝔼[(𝑌∞ − 𝑋)(𝑋 − 𝑍∞ )]
= 𝔼[(𝑋 − 𝔼[𝑋|∞ ])2 ] + 𝔼[(𝑋 − 𝔼[𝑋|∞ ])2 ] + 2𝔼[𝑌∞ 𝑋 − 𝑌∞ 𝑍∞ − 𝑋 2 + 𝑋𝑍∞ ]
= 𝔼[𝑋 2 ] − 𝔼[𝑌∞2 ] + 𝔼[𝑋 2 ] − 𝔼[𝑍∞ 2 ] + 2𝔼[𝑌 𝑋] − 2𝔼[𝑌 𝑍 ] − 2𝔼[𝑋 2 ] + 2𝔼[𝑋𝑍 ]
∞ ∞ ∞ ∞
2 2 2
= −𝔼[𝑌∞ ] − 𝔼[𝑍∞ ] + 2𝔼[𝑌∞ ] − 2𝔼[𝑌∞ 𝑍∞ ] + 2𝔼[𝑍∞ 2]

= 𝔼[𝑌∞2 ] + 𝔼[𝑍∞ 2 ] − 2𝔼[𝑌 𝑍 ]

∞ ∞
Consider the sigma-algebra  = 𝜎( ∪ ).
Let 𝑊 = 𝔼[𝑋|].
Consider the limit sigma-algebra generated by all the information.
Let ∞ = 𝜎(, , 𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …).
Then ∞ ⊆ ∞ and ∞ ⊆ ∞ .
We have 𝔼[𝑋|∞ ] = 𝑌∞ and 𝔼[𝑋|∞ ] = 𝑍∞ .
Consider 𝔼[𝑌∞ |∞ ] and 𝔼[𝑍∞ |∞ ].
Let  = 𝜎( ∪ ).
Consider 𝔼[𝑋|].
Consider the sequence of pairs (𝑌𝑘 , 𝑍𝑘 ).
If the limits are the same, then for any bounded measurable function 𝑓 , 𝔼[𝑓 (𝑌∞ )] = 𝔼[𝑓 (𝑍∞ )].
Consider the sigma-algebra  = 𝜎( ∪  ∪ 𝜎(𝑌0 , 𝑍0 , …)).
Then both 𝑌∞ and 𝑍∞ are measurable with respect to .
Consider 𝔼[(𝑌𝑛 − 𝑍𝑛 )2 ].
Let  = 𝜎( ∪ ).
Consider 𝑊 = 𝔼[𝑋|].
Then 𝔼[𝑊 |] = 𝔼[𝑋|] and 𝔼[𝑊 |] = 𝔼[𝑋|].

24
s1: Simple test-time scaling

Consider the limit sigma-algebra  = 𝜎( ∪  ∪ 𝜎(𝑌0 , 𝑍0 , …)).

We have 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)] and 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
Let  = 𝜎(, 𝑍0 , 𝑍1 , …) ∨ 𝜎(, 𝑌0 , 𝑌1 , …).
Consider 𝔼[𝑋|].
Consider the random variable 𝑋. Then 𝑌∞ = lim𝑛 𝔼[𝑋|𝑛 ] and 𝑍∞ = lim𝑛 𝔼[𝑋|𝑛 ].
Let ∞ = 𝜎( ∪  ∪ 𝜎(𝑌0 , 𝑍0 , …)).
Consider 𝐸 = {𝜔 ∶ 𝑌∞ (𝜔) ≠ 𝑍∞ (𝜔)}. Suppose 𝑃 (𝐸) > 0.
Consider the conditional expectation with respect to the join of the sigma-algebras.
Let ∞ = 𝜎(, , 𝑌0 , 𝑍0 , …).
Consider 𝔼[𝑋|∞ ].
Then 𝔼[𝔼[𝑋|∞ ]|, 𝑍0 , …] = 𝔼[𝑋|, 𝑍0 , …] = 𝑌𝑛+1 .
And 𝔼[𝔼[𝑋|∞ ]|, 𝑌0 , …] = 𝔼[𝑋|, 𝑌0 , …] = 𝑍𝑛+1 .
So 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)] and 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
Let ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …).
Consider 𝔼[𝑋|∞ ∨ ∞ ].
Then 𝔼[𝔼[𝑋|∞ ∨ ∞ ]|∞ ] = 𝔼[𝑋|∞ ] = 𝑌∞ .
And 𝔼[𝔼[𝑋|∞ ∨ ∞ ]|∞ ] = 𝔼[𝑋|∞ ] = 𝑍∞ .
This means 𝑌∞ = 𝔼[𝑋|∞ ∨ ∞ ] a.s. and 𝑍∞ = 𝔼[𝑋|∞ ∨ ∞ ] a.s.
Therefore, 𝑌∞ = 𝑍∞ a.s.
Solution:
(a) For the sequence 𝑌𝑘 , the 𝜎-algebras 𝑘 = 𝜎(, 𝑍0 , … , 𝑍𝑘−1 ) are increasing as 𝑘 increases. Since 𝑋 is integrable, by the
Martingale Convergence Theorem for conditional expectations, the sequence 𝑌𝑘 = 𝔼[𝑋|𝑘 ] converges almost surely and in
𝐿1 to 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)]. Similarly, for the sequence 𝑍𝑘 , the 𝜎-algebras 𝑘 = 𝜎(, 𝑌0 , … , 𝑌𝑘−1 ) are increasing.
Since 𝑋 is integrable, by the Martingale Convergence Theorem, the sequence 𝑍𝑘 = 𝔼[𝑋|𝑘 ] converges almost surely and in
𝐿1 to 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
(b) Let ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …). We have 𝑌∞ = 𝔼[𝑋|∞ ] and 𝑍∞ = 𝔼[𝑋|∞ ]. Consider the
𝜎-algebra  = ∞ ∨ ∞ = 𝜎(, , 𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …). Consider 𝑊 = 𝔼[𝑋|].
We have ∞ ⊆ , so 𝔼[𝑊 |∞ ] = 𝔼[𝔼[𝑋|]|∞ ] = 𝔼[𝑋|∞ ] = 𝑌∞ .
Since 𝑊 is -measurable, 𝔼[𝑊 |∞ ] = 𝑌∞ implies that 𝑌∞ is the conditional expectation of 𝑊 given ∞ . But since 𝑌∞ is
already ∞ -measurable, we have 𝔼[𝑊 |∞ ] = 𝑊 if 𝑊 is ∞ -measurable, which is not necessarily true.
However, we know that 𝔼[𝑌∞ |∞ ] = 𝑌∞ and 𝔼[𝑍∞ |∞ ] = 𝑍∞ .
Consider 𝔼[𝑋|∞ ∨ ∞ ].
Then 𝔼[𝔼[𝑋|∞ ∨ ∞ ]|∞ ] = 𝔼[𝑋|∞ ] = 𝑌∞ .
This implies that 𝔼[𝑋|∞ ∨ ∞ ] = 𝑌∞ almost surely.
Similarly, 𝔼[𝔼[𝑋|∞ ∨ ∞ ]|∞ ] = 𝔼[𝑋|∞ ] = 𝑍∞ .
This implies that 𝔼[𝑋|∞ ∨ ∞ ] = 𝑍∞ almost surely.
Therefore, 𝑌∞ = 𝑍∞ almost surely.
Final Answer: The final answer is 𝑌∞ = 𝑍∞

25
s1: Simple test-time scaling

Table 10. s1K sample stemming from s1-teasers. The black text is the question and the yellow text is the solution. Based solely on the
question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.

Alice and Bob each have a coin and flip it until they get a heads. If Bob flipped his coin more times than Alice, what is the
expected number of times Alice flipped her coin?

4/3

The problem asks for the expected number of times Alice flipped her coin, given that Bob flipped his coin more times than
Alice.
Let 𝑋 be the number of flips Alice makes until she gets a head.
Let 𝑌 be the number of flips Bob makes until he gets a head.
𝑋 and 𝑌 are independent random variables following a geometric distribution with parameter 𝑝 = 0.5 (probability of getting
a head).
The probability mass function for a geometric distribution is 𝑃 (𝐾 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝, for 𝑘 = 1, 2, … .
In this case, 𝑝 = 0.5, so 𝑃 (𝐾 = 𝑘) = (0.5)𝑘−1 (0.5) = (0.5)𝑘 = 21𝑘 .
We are given the condition that Bob flipped his coin more times than Alice, which means 𝑌 > 𝑋.
We want to find the expected value of 𝑋 given that 𝑌 > 𝑋, i.e., 𝐸[𝑋|𝑌 > 𝑋].
∑
The formula for conditional expectation is 𝐸[𝑋|𝐴] = 𝑥 𝑥𝑃 (𝑋 = 𝑥|𝐴).
Also, 𝑃 (𝑋 = 𝑥|𝐴) = 𝑃 (𝑋=𝑥∩𝐴)
𝑃 (𝐴)
.
In our case, 𝐴 is the event 𝑌 > 𝑋.
We need to find 𝑃 (𝑌 > 𝑋).
∑ ∑∞
𝑃 (𝑌 > 𝑋) = ∞ 𝑥=1 𝑦=𝑥+1 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦)
1 1 1
Since 𝑋 and 𝑌 are independent, 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 = 𝑦) = 2𝑥 2𝑦
= 2𝑥+𝑦
.
∑ ∑∞ ∑∞ 1 ∑∞
𝑃 (𝑌 > 𝑋) = ∞ 𝑥=1
1
𝑦=𝑥+1 2𝑥+𝑦 = 𝑥=1 𝑥 𝑦=𝑥+1 2𝑦
1
∑∞ 2 1 𝑦
The inner sum is a geometric series: 𝑦=𝑥+1 ( 2 ) = ( 12 )𝑥+1 + ( 12 )𝑥+2 + …
This is a geometric series with first term 𝑎 = ( 12 )𝑥+1 and common ratio 𝑟 = 12 .
( 12 )𝑥+1 ( 12 )𝑥+1
The sum is 𝑎
= = = ( 12 )𝑥 .
1−𝑟 1− 1 1
∑∞ 2 1 1 𝑥2 ∑
So, 𝑃 (𝑌 > 𝑋) = 𝑥=1 2𝑥 ( 2 ) = ∞ 1 𝑥
𝑥=1 ( 4 ) .
1
This is another geometric series with first term 𝑎 = 4
and common ratio 𝑟 = 41 .
1 1
The sum is 𝑎
= 4
= 4
= 13 .
1−𝑟 1− 41 3
4
Now, we need to find 𝑃 (𝑋 = 𝑥 ∩ 𝑌 > 𝑋) = 𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥).
𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥), due to independence.
∑ ∑∞
𝑃 (𝑌 > 𝑥) = ∞ 𝑦=𝑥+1 𝑃 (𝑌 = 𝑦) =
1 1 𝑥+1
𝑦=𝑥+1 2𝑦 = ( 2 ) + ( 12 )𝑥+2 + ⋯ = ( 12 )𝑥 .
1
So, 𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 2𝑥
× ( 12 )𝑥 = ( 14 )𝑥 .
𝑃 (𝑋=𝑥∩𝑌 >𝑋) ( 41 )𝑥
Now, we can find 𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋)
= 1 = 3( 14 )𝑥 .
3
Finally, we can calculate the conditional expectation:
∑ ∑ ∑
𝐸[𝑋|𝑌 > 𝑋] = ∞ 𝑥=1 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = ∞ 𝑥 ⋅ 3( 14 )𝑥 = 3 ∞ 1 𝑥
𝑥=1 𝑥( 4 ) .
∑∞ 𝑥=1
Consider the series 𝑆 = 𝑥=1 𝑥𝑟 , where |𝑟| < 1.
𝑥

𝑆 = 𝑟 + 2𝑟2 + 3𝑟3 + 4𝑟4 + …

𝑆
𝑟
= 1 + 2𝑟 + 3𝑟2 + 4𝑟3 + …
1
Consider the geometric series 𝐺 = 1 + 𝑟 + 𝑟2 + 𝑟3 + ⋯ = 1−𝑟
.
Differentiating with respect to 𝑟:
𝑑𝐺 𝑑 1 1
𝑑𝑟
= 1 + 2𝑟 + 3𝑟2 + 4𝑟3 + ⋯ = 𝑑𝑟 ( 1−𝑟 ) = (1−𝑟)2.
𝑆 1 𝑟
So, 𝑟
= (1−𝑟)2
, which means 𝑆 = (1−𝑟)2
.
In our case, 𝑟 = 14 .

26
s1: Simple test-time scaling

∑∞ 1 𝑥
1 1 1
1 16
𝑥=1 𝑥( 4 ) =
4
= 4
= 4
= × = 49 .
(1− 14 )2 ( 43 )2 9
16
4 9

Therefore, 𝐸[𝑋|𝑌 > 𝑋] = 3 × 94 = 12 9

= 43 .
Alternative approach: Symmetry.
Consider the events 𝑌 > 𝑋, 𝑋 > 𝑌 , and 𝑋 = 𝑌 .
𝑃 (𝑌 > 𝑋) = 𝑃 (𝑋 > 𝑌 ) due to symmetry.
∑ ∑∞ ∑∞ 1 2 ∑∞ 1 𝑘 1 1
𝑃 (𝑋 = 𝑌 ) = ∞ 𝑘=1 𝑃 (𝑋 = 𝑘, 𝑌 = 𝑘) = 𝑘=1 𝑃 (𝑋 = 𝑘)𝑃 (𝑌 = 𝑘) = 𝑘=1 ( 2𝑘 ) = 𝑘=1 ( 4 ) =
4
= 4
= 13 .
1− 14 3
4
We have 𝑃 (𝑌 > 𝑋) + 𝑃 (𝑋 > 𝑌 ) + 𝑃 (𝑋 = 𝑌 ) = 1.
𝑃 (𝑌 > 𝑋) + 𝑃 (𝑌 > 𝑋) + 31 = 1.
1
2𝑃 (𝑌 > 𝑋) = 1 − 3
= 32 .
𝑃 (𝑌 > 𝑋) = 31 . This confirms our previous calculation.
Let 𝐸[𝑋|𝑌 > 𝑋] = 𝜇.
We are looking for the expected number of flips Alice made, given Bob flipped more times.
Consider the case where the first flip for both is tails. Then the process restarts, and the condition remains.
Let 𝐴1 be Alice’s first flip, 𝐵1 be Bob’s first flip.
𝑃 (𝐴1 = 𝐻) = 1∕2, 𝑃 (𝐴1 = 𝑇 ) = 1∕2.
𝑃 (𝐵1 = 𝐻) = 1∕2, 𝑃 (𝐵1 = 𝑇 ) = 1∕2.
Case 1: 𝐴1 = 𝐻, 𝐵1 = 𝐻. 𝑋 = 1, 𝑌 = 1, so 𝑌 > 𝑋 is false.
Case 2: 𝐴1 = 𝐻, 𝐵1 = 𝑇 . Alice stops at 1, Bob continues. 𝑌 > 1.
Case 3: 𝐴1 = 𝑇 , 𝐵1 = 𝐻. Bob stops at 1, Alice continues.
Case 4: 𝐴1 = 𝑇 , 𝐵1 = 𝑇 . Both continue.
Consider the conditional probabilities given 𝑌 > 𝑋.
Let 𝐸 = 𝐸[𝑋|𝑌 > 𝑋].
If the first flips are HH, the condition 𝑌 > 𝑋 is not met.
If the first flips are HT, Alice’s number of flips is 1.
If the first flips are TH, Bob’s number of flips is 1, Alice’s is at least 2, so 𝑌 < 𝑋.
If the first flips are TT, the problem restarts, and the expected number of additional flips for Alice is still 𝐸.
Let 𝑁𝐴 be the number of flips Alice makes, 𝑁𝐵 be the number of flips Bob makes.
We are given 𝑁𝐵 > 𝑁𝐴 .
Consider the first flips:
HH (prob 1/4): Condition not met.
HT (prob 1/4): Alice flips 1 time.
TH (prob 1/4): Bob flips 1 time, Alice > 1, so 𝑁𝐴 > 𝑁𝐵 , condition not met.
TT (prob 1/4): Both continue.
Let 𝐸 be the expected number of Alice’s flips given Bob flipped more.
If the first flips are HT, Alice flipped 1 time, and the condition is met.
If the first flips are TT, the situation is the same as the beginning, so the expected number of additional flips for Alice is 𝐸.
Consider the event 𝑌 > 𝑋.
Conditioning on the first flips:
𝐸[𝑋|𝑌 > 𝑋] = 𝐸[𝑋|𝑌 > 𝑋, 𝐻𝐻]𝑃 (𝐻𝐻|𝑌 > 𝑋) + 𝐸[𝑋|𝑌 > 𝑋, 𝐻𝑇 ]𝑃 (𝐻𝑇 |𝑌 > 𝑋) + 𝐸[𝑋|𝑌 > 𝑋, 𝑇 𝐻]𝑃 (𝑇 𝐻|𝑌 >
𝑋) + 𝐸[𝑋|𝑌 > 𝑋, 𝑇 𝑇 ]𝑃 (𝑇 𝑇 |𝑌 > 𝑋)
𝑃 (𝑌 > 𝑋|𝐻𝐻) = 0.
𝑃 (𝑌 > 𝑋|𝐻𝑇 ) = 1.
𝑃 (𝑌 > 𝑋|𝑇 𝐻) = 0.
𝑃 (𝑌 > 𝑋|𝑇 𝑇 ) depends on the future flips.
Let’s use the definition of conditional expectation directly.
∑ ∑∞ 𝑃 (𝑋=𝑥,𝑌 >𝑥)
𝐸[𝑋|𝑌 > 𝑋] = ∞ 𝑥=1 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑥=1 𝑥 𝑃 (𝑌 >𝑋)
Consider the probabilities of the first flips given 𝑌 > 𝑋.
0⋅1∕4
𝑃 (𝐻𝐻|𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋|𝐻𝐻)𝑃 𝑃 (𝑌 >𝑋)
(𝐻𝐻)
= 1∕3 = 0.
𝑃 (𝑌 >𝑋|𝐻𝑇 )𝑃 (𝐻𝑇 ) 1⋅1∕4
𝑃 (𝐻𝑇 |𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋)
= 1∕3
= 34 .

27
s1: Simple test-time scaling

𝑃 (𝑌 >𝑋|𝑇 𝐻)𝑃 (𝑇 𝐻) 0⋅1∕4

𝑃 (𝑇 𝐻|𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋)
= 1∕3
= 0.
𝑃 (𝑌 >𝑋|𝑇 𝑇 )𝑃 (𝑇 𝑇 )
𝑃 (𝑇 𝑇 |𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋)
If the first flips are TT, the future behavior is the same, and the condition 𝑌 > 𝑋 holds if Bob takes more flips than Alice in
the subsequent trials.
𝑃 (𝑌 > 𝑋|𝑇 𝑇 ) = 𝑃 (Bob needs more flips than Alice starting from scratch) = 𝑃 (𝑌 > 𝑋) = 1∕3.
(1∕3)⋅(1∕4)
𝑃 (𝑇 𝑇 |𝑌 > 𝑋) = 1∕3
= 14 .
Sum of conditional probabilities: 0 + 3∕4 + 0 + 1∕4 = 1.
Now consider the conditional expectations of 𝑋:
𝐸[𝑋|𝑌 > 𝑋, 𝐻𝐻]: Not relevant since 𝑃 (𝐻𝐻|𝑌 > 𝑋) = 0.
𝐸[𝑋|𝑌 > 𝑋, 𝐻𝑇 ]: Given HT and 𝑌 > 𝑋, Alice flipped 1 time. 𝐸 = 1.
𝐸[𝑋|𝑌 > 𝑋, 𝑇 𝐻]: Not relevant.
𝐸[𝑋|𝑌 > 𝑋, 𝑇 𝑇 ]: Given TT and 𝑌 > 𝑋, Alice has already flipped once, and the expected number of additional flips is 𝐸.
So, the total number of flips is 1 + 𝐸.
𝐸 = 0 ⋅ 0 + 1 ⋅ (3∕4) + undefined ⋅ 0 + (1 + 𝐸) ⋅ (1∕4)
𝐸 = 0 + 3∕4 + 0 + 1∕4 + 𝐸∕4
𝐸 = 1 + 𝐸∕4
𝐸 − 𝐸∕4 = 1
3𝐸∕4 = 1
𝐸 = 4∕3.
This confirms the previous result.
Let’s think about the intuition. If we have the condition 𝑌 > 𝑋, it means Alice was "lucky" and got a head early, or Bob was
"unlucky".
Consider the random variable 𝑍 = 𝑌 − 𝑋. We are given 𝑍 > 0.
We want 𝐸[𝑋|𝑍 > 0].
Let 𝑝 = 𝑃 (head) = 0.5.
The number of tails before the first head follows a geometric distribution with parameter 𝑝.
𝑋 − 1 ∼ 𝐺𝑒𝑜𝑚(𝑝), 𝑌 − 1 ∼ 𝐺𝑒𝑜𝑚(𝑝).
𝑃 (𝑋 − 1 = 𝑘) = (1 − 𝑝)𝑘 𝑝.
𝑃 (𝑋 = 𝑥) = (1 − 𝑝)𝑥−1 𝑝.
Consider the joint distribution of 𝑋 and 𝑌 .
𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑝2 (1 − 𝑝)𝑥−1 (1 − 𝑝)𝑦−1
Consider the event 𝑌 = 𝑋 + 𝑘, where 𝑘 ≥ 1.
∑ ∑∞ 2
𝑃 (𝑌 = 𝑋 + 𝑘) = ∞ 𝑥=1 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑥 + 𝑘) = 𝑥=1 𝑝 (1 − 𝑝)
𝑥−1 (1 − 𝑝)𝑥+𝑘−1
∑
= 𝑝2 (1 − 𝑝)𝑘−1 ∞ 2 𝑥−1 = 𝑝2 (1 − 𝑝)𝑘−1
𝑥=1 ((1 − 𝑝) )
1
1−(1−𝑝)2
1 𝑘−1 1 1
= 𝑝2 (1 − 𝑝)𝑘−1 1−(1−2𝑝+𝑝 2
2 ) = 𝑝 (1 − 𝑝) 2𝑝−𝑝2
= 𝑝2 (1 − 𝑝)𝑘−1 𝑝(2−𝑝)
𝑘−1
= 𝑝(1−𝑝)
2−𝑝
.
If 𝑝 = 0.5, then 1 − 𝑝 = 0.5, 2 − 𝑝 = 1.5.
𝑘−1 𝑘 (1∕2)𝑘
𝑃 (𝑌 = 𝑋 + 𝑘) = 0.5(0.5) = (0.5) = 3∕2 = 21𝑘 ⋅ 23 = 3⋅21𝑘−1 .
∑ 1.5 1.5
Check if ∞ 𝑘=1 𝑃 (𝑌 = 𝑋 + 𝑘) = 𝑃 (𝑌 > 𝑋) = 1∕3.
∑∞ 1 1 ∑∞ 1 𝑘−1
𝑘=1 3⋅2𝑘−1 = 3 𝑘=1 ( 2 ) = 31 (1 + 1∕2 + 1∕4 + … ) = 13 1−1∕2 1
= 1
3
⋅ 2 = 32 .
Something is wrong here.
Let’s recalculate 𝑃 (𝑌 > 𝑋).
∑ ∑∞
𝑃 (𝑌 > 𝑋) = ∞ 𝑥=1 𝑦=𝑥+1 (1 − 𝑝)
𝑥−1 𝑝(1 − 𝑝)𝑦−1 𝑝
∑ ∞ ∑ ∞
= 𝑝2 𝑥=1 (1 − 𝑝)𝑥−1 𝑦=𝑥+1 (1 − 𝑝)𝑦−1
(1−𝑝) 𝑥 𝑥
Inner sum: (1 − 𝑝)𝑥 + (1 − 𝑝)𝑥+1 + ⋯ = 1−(1−𝑝) = (1−𝑝)
𝑝
.
2 ∑∞ (1−𝑝)𝑥 ∑∞
𝑃 (𝑌 > 𝑋) = 𝑝 𝑥=1 (1 − 𝑝)
𝑥−1
𝑝
= 𝑝(1 − 𝑝) 𝑥=1 (1 − 𝑝)2(𝑥−1)
∑
= 𝑝(1 − 𝑝) ∞ 2 𝑥−1 = 𝑝(1 − 𝑝)
𝑥=1 ((1 − 𝑝) )
1
1−(1−𝑝)2
1 1 1 1−𝑝
= 𝑝(1 − 𝑝) 1−(1−2𝑝+𝑝 2 ) = 𝑝(1 − 𝑝) 2𝑝−𝑝2 = 𝑝(1 − 𝑝) 𝑝(2−𝑝) = 2−𝑝
.
0.5
If 𝑝 = 0.5, 𝑃 (𝑌 > 𝑋) = 1.5
= 31 . Correct.

28
s1: Simple test-time scaling

Now, let’s find 𝐸[𝑋|𝑌 > 𝑋].

∑ ∑∞ 𝑃 (𝑋=𝑥,𝑌 >𝑥)
𝐸[𝑋|𝑌 > 𝑋] = ∞ 𝑥=1 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑥=1 𝑥 𝑃 (𝑌 >𝑋) .
𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥) (independence)
𝑃 (𝑋 = 𝑥) = (1 − 𝑝)𝑥−1 𝑝.
∑ 𝑥
𝑦−1 𝑝 = 𝑝[(1 − 𝑝)𝑥 + (1 − 𝑝)𝑥+1 + … ] = 𝑝 (1−𝑝) = (1 − 𝑝)𝑥 .
𝑃 (𝑌 > 𝑥) = ∞ 𝑦=𝑥+1 (1 − 𝑝) 1−(1−𝑝)
𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = (1 − 𝑝)𝑥−1 𝑝(1 − 𝑝)𝑥 = 𝑝(1 − 𝑝)2𝑥−1 .
∑
𝐸[𝑋|𝑌 > 𝑋] = 𝑃 (𝑌1>𝑋) ∞ 𝑥=1 𝑥𝑝(1 − 𝑝)
2𝑥−1

2−𝑝 ∑
= 1−𝑝 𝑝(1 − 𝑝)−1 ∞𝑥=1 𝑥((1 − 𝑝) ) (error in power)
2 𝑥

(1−𝑝)𝑥
𝑃 (𝑌 > 𝑥) = 1−(1−𝑝)
𝑝 = (1 − 𝑝)𝑥 .
∑
𝐸[𝑋|𝑌 > 𝑋] = 𝑃 (𝑌1>𝑋) ∞ 𝑥=1 𝑥𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥)
1 ∑∞
= 𝑃 (𝑌 >𝑋) 𝑥=1 𝑥(1 − 𝑝) 𝑝(1 − 𝑝)𝑥
𝑥−1
∑
= 𝑃 (𝑌𝑝>𝑋) ∞ 𝑥=1 𝑥(1 − 𝑝)
2𝑥−1
∑
𝑝
= (1−𝑝)∕(2−𝑝) (1 − 𝑝)−1 ∞ 2 𝑥
𝑥=1 𝑥((1 − 𝑝) )
2−𝑝 ∑∞
𝑥=1 𝑥((1 − 𝑝) ) .
= 𝑝 (1−𝑝) 2 𝑥
2
∑ ∞ 𝑟
Let 𝑟 = (1 − 𝑝)2 . 𝑥=1 𝑥𝑟𝑥 = (1−𝑟) 2.

1 − 𝑟 = 1 − (1 − 𝑝)2 = 1 − (1 − 2𝑝 + 𝑝2 ) = 2𝑝 − 𝑝2 = 𝑝(2 − 𝑝).

(1 − 𝑟)2 = 𝑝2 (2 − 𝑝)2 .
∑∞ (1−𝑝)2
𝑥=1 𝑥((1 − 𝑝) ) = 𝑝2 (2−𝑝)2 .
2 𝑥

2−𝑝 (1−𝑝) 2 1
𝐸[𝑋|𝑌 > 𝑋] = 𝑝 (1−𝑝)2 𝑝2 (2−𝑝)2 = 𝑝(2−𝑝)
.
1 1 1
If 𝑝 = 0.5, 𝐸[𝑋|𝑌 > 𝑋] = 0.5×1.5
= 0.75 = 3∕4
= 4∕3.

Solution:
Let 𝑋 be the number of flips Alice makes until she gets a head, and 𝑌 be the number of flips Bob makes until he gets a
head. 𝑋 and 𝑌 are independent geometric random variables with parameter 𝑝 = 0.5. The probability mass function is
𝑃 (𝐾 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝.
We are interested in the expected number of times Alice flipped her coin, given that Bob flipped his coin more times than
Alice, i.e., 𝐸[𝑋|𝑌 > 𝑋].
∑
Using the definition of conditional expectation, 𝐸[𝑋|𝑌 > 𝑋] = ∞ 𝑥=1 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋).
𝑃 (𝑋=𝑥,𝑌 >𝑥)
𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋) .
First, calculate 𝑃 (𝑌 > 𝑋).
∑ ∑∞ ∑∞
𝑃 (𝑌 > 𝑋) = ∞ 𝑥=1 𝑦=𝑥+1 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑥=1 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥).
𝑃 (𝑋 = 𝑥) = (0.5)𝑥−1 (0.5) = (0.5)𝑥 = 21𝑥 .
∑
𝑃 (𝑌 > 𝑥) = ∞ 𝑦=𝑥+1 (0.5)
𝑦−1 (0.5) = (0.5)𝑥 .
∑∞ 1 1 𝑥 ∑∞ 1 𝑥 1∕4 1∕4
𝑃 (𝑌 > 𝑋) = 𝑥=1 2𝑥 ( 2 ) = 𝑥=1 ( 4 ) = 1−1∕4 = 3∕4
= 31 .
1 1 𝑥
Next, calculate 𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥) = ( )
2𝑥 2
= ( 14 )𝑥 .
(1∕4)𝑥
Now, find 𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑃 (𝑋=𝑥,𝑌 >𝑥)
𝑃 (𝑌 >𝑋)
= 1∕3 = 1 𝑥
3( 4 ) .
Finally, calculate the conditional expectation:
∑ ∑ ∑
𝐸[𝑋|𝑌 > 𝑋] = ∞ 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = ∞ 𝑥 ⋅ 3( 14 )𝑥 = 3 ∞ 1 𝑥
𝑥=1 𝑥( 4 ) .
∑∞
𝑥=1
𝑟
𝑥=1
Using the formula 𝑥=1 𝑥𝑟 = (1−𝑟)2 with 𝑟 = 1∕4:
𝑥
∑∞ 1 𝑥 1∕4 1∕4 1∕4 1 16 4
𝑥=1 𝑥( 4 ) = (1−1∕4)2 = (3∕4)2 = 9∕16 = 4 × 9 = 9 .
4 12
𝐸[𝑋|𝑌 > 𝑋] = 3 × 9
= 9
= 34 .
Final Answer: The final answer is 4∕3

29
s1: Simple test-time scaling

Table 11. s1K sample stemming from the astronomy subset of OlympicArena. The black text is the question and the blue text is the
solution. Based solely on the question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.

An often-repeated fun fact is that humans produce more power per unit volume than stars. If the sun were the same size, but
it produced the same amount of power per unit volume as a human, what would its surface temperature be? Assume the
"average human" produces 100 watts of power and has a volume of 66400 cubic centimeters.
A: 3500 K
B: 10000 K
C: 25000 K
D: 40000 K
E: 50000 K

Using the numbers from the problem, the average human produces
100
𝑢= = 1506 W∕m3
66400 × 10−6
The volume of the sun is 43 𝜋𝑅3⊙ , so its new power output would be 𝑃 = 43 𝜋𝑅3⊙ 𝑢. To find the equilibrium temperature, we use

𝑃 = 𝜎𝐴𝑇 4 = 4𝜋𝜎𝑅2⊙ 𝑇 4
Solving for 𝑇 , we get
√ √( )
√
4 𝑅⊙ 𝑢 √√
4
6.96 × 108 (1506)
𝑇 = = ( ) = 49823 ≈ 50000 K.
3𝜎 3 5.67 × 10−8

The problem asks for the surface temperature of the sun if it had the same size but produced power per unit volume as a
human.
First, let’s find the power density of a human.
Power produced by a human, 𝑃ℎ𝑢𝑚𝑎𝑛 = 100 W
Volume of a human, 𝑉ℎ𝑢𝑚𝑎𝑛 = 66400 cm3
Convert the volume to cubic meters:
1 m = 100 cm
1 m3 = (100 cm)3 = 106 cm3
𝑉ℎ𝑢𝑚𝑎𝑛 = 66400 cm3 = 66400 × 10−6 m3 = 0.0664 m3
𝑃 100 W
Power density of a human, 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 𝑉ℎ𝑢𝑚𝑎𝑛 = 0.0664 m3
ℎ𝑢𝑚𝑎𝑛
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 1506.024 W∕m3
Now, consider the sun with the same size as it currently is, but with a power density equal to that of a human.
We need the radius of the sun to calculate its volume. The average radius of the sun is 𝑅𝑠𝑢𝑛 = 6.95 × 108 m.
Volume of the sun, 𝑉𝑠𝑢𝑛 = 43 𝜋𝑅3𝑠𝑢𝑛 = 43 𝜋(6.95 × 108 m)3
𝑉𝑠𝑢𝑛 = 43 𝜋(337.159375 × 1024 ) m3
𝑉𝑠𝑢𝑛 = 1410.24 × 1024 × 𝜋 m3
𝑉𝑠𝑢𝑛 = 4.4307 × 1027 m3
If the sun had the same power density as a human, the total power produced by this modified sun would be:
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 × 𝑉𝑠𝑢𝑛
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 1506.024 W∕m3 × 4.4307 × 1027 m3
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 6.6726 × 1030 W

30
s1: Simple test-time scaling

The surface temperature of a star is related to its luminosity (total power output) through the Stefan-Boltzmann law:
𝐿 = 4𝜋𝑅2 𝜎𝑇 4
where 𝐿 is the luminosity (total power output), 𝑅 is the radius of the star, 𝜎 is the Stefan-Boltzmann constant, and 𝑇 is the
surface temperature.
In this case, the luminosity of the modified sun is 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 . The radius of the modified sun is the same as the actual sun,
𝑅𝑠𝑢𝑛 .
The Stefan-Boltzmann constant, 𝜎 = 5.67 × 10−8 Wm−2 K −4 .
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇 4
6.6726 × 1030 W = 4𝜋(6.95 × 108 m)2 × 5.67 × 10−8 Wm−2 K −4 × 𝑇 4
Calculate the term 4𝜋𝑅2𝑠𝑢𝑛 𝜎:
𝑅2𝑠𝑢𝑛 = (6.95 × 108 )2 = 48.3025 × 1016 m2
4𝜋𝑅2𝑠𝑢𝑛 = 4𝜋 × 48.3025 × 1016 = 607.08 × 1016 m2
4𝜋𝑅2𝑠𝑢𝑛 𝜎 = 607.08 × 1016 m2 × 5.67 × 10−8 Wm−2 K −4
4𝜋𝑅2𝑠𝑢𝑛 𝜎 = 3441.93 × 108 WK −4 = 3.44193 × 1011 WK −4
Now, substitute this back into the equation:
6.6726 × 1030 = 3.44193 × 1011 × 𝑇 4
Solve for 𝑇 4 : 30
6.6726×10
𝑇 4 = 3.44193×10 11 = 1.9386 × 10
19

Solve for 𝑇 :
𝑇 = (1.9386 × 1019 )1∕4 = (193.86 × 1017 )1∕4
𝑇 = (1.9386)1∕4 × (1019 )1∕4 = (1.9386)0.25 × 104.75
Calculate (1.9386)0.25 :
Let 𝑦 = 1.93860.25 , then ln 𝑦 = 0.25 ln(1.9386)
ln(1.9386) ≈ 0.6619
ln 𝑦 = 0.25 × 0.6619 = 0.165475
𝑦 = 𝑒0.165475 ≈ 1.1799
So, 𝑇 = 1.1799 × 104.75 = 1.1799 × 104 × 100.75
100.75 = 103∕4 = (103 )1∕4 = (1000)1∕4
We know 54 = 625 and 64 = 1296, so (1000) √
1∕4 is between 5 and 6.

Alternatively, 100.75 = 100.5 × 100.25 = 10 × 100.25 ≈ 3.16 × 1.778 = 5.62

𝑇 = 1.1799 × 104 × 5.62 = 6.63 × 104 K = 66300 K
Let’s double-check the calculations.
Power density of human: 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 1506.02 W∕m3
Volume of sun: 𝑉𝑠𝑢𝑛 = 4.43 × 1027 m3
Power of modified sun: 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 1506.02 × 4.43 × 1027 = 6671.67 × 1027 = 6.67 × 1030 W
Stefan-Boltzmann law: 𝑃 = 4𝜋𝑅2 𝜎𝑇 4
𝑃
𝑇 4 = 4𝜋𝑅 2𝜎
4𝜋𝑅2𝑠𝑢𝑛 𝜎 = 3.44 × 1011
30
𝑇 4 = 6.67×10
3.44×1011
= 1.939 × 1019
𝑇 = (1.939 × 1019 )1∕4 = (193.9 × 1017 )1∕4
𝑇 = (193.9)0.25 × (1017 )0.25 = (193.9)0.25 × 104.25
Calculate (193.9)0.25 :
Let 𝑦 = 193.90.25 , then ln 𝑦 = 0.25 ln(193.9)
ln(193.9) ≈ 5.267
ln 𝑦 = 0.25 × 5.267 = 1.31675
𝑦 = 𝑒1.31675 ≈ 3.73
𝑇 = 3.73 × 104.25 = 3.73 × 104 × 100.25 = 37300 × 1.778 = 66329 K
The result is around 66000 K. This is not among the options. Let’s review the assumptions and calculations.
Assumption: The sun radiates as a black body, which is reasonable for estimating surface temperature.

31
s1: Simple test-time scaling

Consider potential errors in calculations.

Let’s try a simpler approach by setting up ratios.
The power per unit volume of the modified sun is equal to that of a human.
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 𝑃
𝑉
= 𝑉ℎ𝑢𝑚𝑎𝑛
𝑠𝑢𝑛 ℎ𝑢𝑚𝑎𝑛
We also know 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇𝑚𝑜𝑑𝑖𝑓
4
𝑖𝑒𝑑_𝑠𝑢𝑛
.
4
4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 𝑃ℎ𝑢𝑚𝑎𝑛
4 =
3
𝜋𝑅3𝑠𝑢𝑛 𝑉ℎ𝑢𝑚𝑎𝑛
4
3𝜎𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
𝑅𝑠𝑢𝑛
= 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
4 𝜌 𝑅
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
= 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
3𝜎
𝑠𝑢𝑛

Substitute the values:

𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 1506.02 W∕m3
𝑅𝑠𝑢𝑛 = 6.95 × 108 m
𝜎 = 5.67 × 10−8 Wm−2 K −4 8
4
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
= 1506.02×6.95×10
3×5.67×10−8
10466.839×108 12
4
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
= 17.01×10−8
= 1.0466×10
1.701×10−7
= 0.6153 × 1019 = 6.153 × 1018
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = (6.153 × 1018 )1∕4 = (6153 × 1015 )1∕4
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = (6153)0.25 × (1015 )0.25 = (6153)0.25 × 103.75
Calculate (6153)0.25 :
Let 𝑦 = 61530.25 , then ln 𝑦 = 0.25 ln(6153)
ln(6153) ≈ 8.724
ln 𝑦 = 0.25 × 8.724 = 2.181
𝑦 = 𝑒2.181 ≈ 8.85
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 8.85 × 103.75 = 8.85 × 103 × 100.75 = 8850 × 5.62 = 49737 K
This is close to option E: 50000 K.
Let’s re-check the units and calculations carefully.
100 W
Power density of human: 0.0664 m3
= 1506.02 W∕m3
𝜌 𝑅
The formula derived was 𝑇 4 = 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
3𝜎
𝑠𝑢𝑛
.
Units check:
[𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 ] = W∕m3
[𝑅𝑠𝑢𝑛 ] = m
[𝜎] = Wm−2 K −4
(W∕m3 )×m W∕m2 2 4

Wm−2 K −4
= Wm−2 K −4 = mW2 × mWK = K 4
The units match for 𝑇 4 .
Numerical calculation check:
Numerator: 1506.02 × 6.95 × 108 = 10466.839 × 108 = 1.0466 × 1012
Denominator: 312× 5.67 × 10−8 = 17.01 × 10−8
Ratio: 1.0466×10
17.01×10−8
= 0.061528 × 1020 = 6.1528 × 1018
𝑇 4 = 6.1528 × 1018
𝑇 = (6.1528 × 1018 )1∕4 = (6152.8 × 1015 )1∕4
𝑇 = (6152.8)0.25 × 103.75
Let’s use logarithms for precision.
log10 (6152.8) ≈ 3.789
0.25 × 3.789 = 0.94725
100.94725 ≈ 8.856
𝑇 = 8.856 × 103.75 = 8.856 × 103 × 100.75 = 8856 × 5.623 = 49806 K
This is very close to 50000 K.
Possible sources of minor discrepancies: rounding errors, precision of constants.
Let’s verify the formula derivation.
Power per unit volume = Total Power
Volume

32
s1: Simple test-time scaling

𝜌𝑝𝑜𝑤𝑒𝑟 = 𝑉𝐿
𝐿 = 4𝜋𝑅2 𝜎𝑇 4
𝑉 = 43 𝜋𝑅3
4𝜋𝑅2 𝜎𝑇 4 3𝜎𝑇 4
𝜌𝑝𝑜𝑤𝑒𝑟 = 4 = 𝑅
3
𝜋𝑅3
If the modified sun has the power density of a human:
4
3𝜎𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 𝑅𝑠𝑢𝑛
4 𝜌 𝑅
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
= 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
3𝜎
𝑠𝑢𝑛

The formula derivation seems correct.

Let’s use approximate values to check the order of magnitude.
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 ≈ 103 W∕m3
𝑅𝑠𝑢𝑛 ≈ 7 × 108 m
𝜎 ≈ 6 × 310−8 Wm −2 K −4
4 10 ×7×108 7×1011 19 = 3.8 × 1018
𝑇 ≈ 3×6×10−8 = 18×10 −8 = 0.38 × 10
𝑇 ≈ (3.8 × 1018 )1∕4 = (3800 × 1015 )1∕4
𝑇 ≈ (3800)0.25 × 103.75
(3800)0.25 : 74 = 2401, 84 = 4096. So, between 7 and 8.
≈ 7.8
𝑇 ≈ 7.8 × 103.75 = 7800 × 5.6 = 43680 K
The order of magnitude seems consistent with the detailed calculation.
The problem statement is straightforward and doesn’t seem to have any hidden complexities. The assumptions made are
standard for such problems.
Consider the precision of the given values.
Power of human: 100 W (exact)
Volume of human: 66400 cm3 (exact)
The result obtained is very close to option E. Given the nature of multiple-choice questions, the closest option is likely the
intended answer.
If there were any misinterpretations of the question, it would likely be in the setup. However, the phrase "If the sun were the
same size, but it produced the same amount of power per unit volume as a human" seems clear.
The surface temperature is calculated using the Stefan-Boltzmann law, assuming black-body radiation.
Double-check the provided options:
A: 3500 K
B: 10000 K
C: 25000 K
D: 40000 K
E: 50000 K
Our calculated value is approximately 49800 K, which is closest to 50000 K.
If we used more precise values for the constants, the result might be slightly different, but likely not enough to shift to
another option.
The fun fact mentioned in the question implies that the power density of a human is higher than that of the sun. Let’s verify
this.
Power output of the sun: 3.846 × 1026 W
Volume of the sun: 4.43 × 1027 m26 3
3.846×10
Power density of the sun: 4.43×1027 = 0.868 W∕m3
Power density of human: 1506 W∕m3
Indeed, the power density of a human is significantly higher than that of the sun.
Final check of the calculation steps and logic. The approach of equating the power density of the modified sun to that of a
human and then using the Stefan-Boltzmann law seems correct.

33
s1: Simple test-time scaling

Solution:
The power produced by an average human is 𝑃ℎ𝑢𝑚𝑎𝑛 = 100 W, and the volume of an average human is 𝑉ℎ𝑢𝑚𝑎𝑛 = 66400 cm3 =
0.0664 m3 .
𝑃 100 W
The power density of a human is 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 𝑉ℎ𝑢𝑚𝑎𝑛 = 0.0664 m3
≈ 1506.02 W∕m3 .
ℎ𝑢𝑚𝑎𝑛
If the sun were the same size but produced the same amount of power per unit volume as a human, its power density would
be 𝜌𝑝𝑜𝑤𝑒𝑟,𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 .
The radius of the sun is 𝑅𝑠𝑢𝑛 = 6.95 × 108 m.
The volume of the sun is 𝑉𝑠𝑢𝑛 = 34 𝜋𝑅3𝑠𝑢𝑛 .
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
The power density of the modified sun is given by 𝜌𝑝𝑜𝑤𝑒𝑟,𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 𝑉𝑠𝑢𝑛
.
We also know that 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇 4 , where 𝑇 is the surface temperature of the modified sun.
Equating the power density:
4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇 4 3𝜎𝑇 4
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 4 =
3
𝜋𝑅3𝑠𝑢𝑛 𝑅𝑠𝑢𝑛
Solving for 𝑇 4 :
𝜌 𝑅𝑠𝑢𝑛
𝑇 4 = 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
3𝜎
Substitute the values:
1506.02 W∕m3 ×6.95×108 m
𝑇 4 = 3×5.67×10−8 Wm−2 K −4
12
𝑇 4 = 1.0466×10
1.701×10−7
= 6.1528 × 1018 K 4
Solving for 𝑇 :
𝑇 = (6.1528 × 1018 )1∕4 = (6152.8)0.25 × 1018∕4 = (6152.8)0.25 × 104.5
Using logarithms:
log10 (6152.8) ≈ 3.789
0.25 × 3.789 = 0.94725
(6152.8)0.25 = 100.94725 ≈ 8.856 √
𝑇 = 8.856 × 104.5 = 8.856 × 104 × 100.5 = 88560 × 10 ≈ 88560 × 3.16 = 280000 K
Let’s redo the exponent calculation: 18∕4 = 4.5.
𝑇 = 8.856 × 104.5 = 8.856 × 104 × 100.5 - error in previous step.
𝑇 = (6.1528 × 1018 )0.25 = 49806 K
The surface temperature would be approximately 49806 K, which is closest to 50000 K.
Final Answer: The final answer is 𝐸

34
s1: Simple test-time scaling

E. Test-time scaling details

E.1. Sequential scaling ablations

Figure 10. Token and step instruction data formats for controlling test-time compute. We only train our model on the reasoning trace
and the answer.

Table 12. Scaling thinking time via tokens-conditional control. All metrics are averaged over the 30 questions in AIME24.
Tokens instructed (→) 1024 2048 4096 8192 16384
No intervention at test-time
Thinking tokens 7939 7158 8263 7108 7500
Answer tokens 689 669 659 722 724
AIME24 26.7 30.0 33.3 33.3 40.0
Forcing end of thinking when token budget is reached
Thinking tokens 1024 2048 4031 5664 6330
Answer tokens 15 15 142 722 691
AIME24 3.3 30.0 33.3 33.3 40.0

Token-conditional control One general approach is to simply tell a model in the prompt precisely how many tokens it
should generate. Ideally, the model can keep track of its token count and adjust its generation to finish within the desired
limits. We experiment with this approach by training a model with token instructions using the format in Figure 10 (left).
We bucket the lengths of the reasoning traces from our 1,000 training examples into powers of two (rounded upwards) and
add a corresponding instruction to the user prompt. For example, if the instruction says “Think for up to 2048 tokens”,
then the reasoning trace has anywhere between 1024 and 2048 tokens. In Table 12, we show that after training the model
hardly follows the token instruction. It does sometimes generate more tokens when given a higher limit but often overshoots
the limit. This may not be unique to our model as prior work suggests that OpenAI o1-mini can also not follow token
instructions (Zhang & Chen, 2024). To prevent exceeding the limit, we test budget forcing the thinking to end once the
limit is reached. This leads to perfect control (Table 12 (lower)). With budget forcing, the scaling trend is also clearer as the
model can no longer overshoot the limit when given a small thinking budget. This leads to better test-time scaling values for
Token Prompting + budget forcing in Table 3. To compute Control reported in Table 3 for token-conditional control variants
we divide the number of times the thinking tokens in Table 12 are less than the upper limit by the total evaluations (2/5 for
without intervention; 5/5 for with intervention).

Step-conditional control Token instructions fail as current models cannot count tokens. To accommodate this lack of
capability, we experiment with making the counting more coarse-grained. We partition the reasoning traces into steps and
ask the model to think for a specific number of steps rather than tokens. We split our reasoning traces on double newlines into
steps, which we find act as intuitive separators based on manual inspection of samples. We bucket our training samples into
powers of 2 depending on their number of steps and add a corresponding step instruction following the format in Figure 10
(right). This format is based on early experiments, where we found the model to be more likely to adhere to the step limit

35
s1: Simple test-time scaling

Table 13. Scaling thinking time via step-conditional control. All metrics are averaged over the 30 samples in AIME24. Token counts
ignore the thinking and step delimiters.
Steps instructed (→) 16 32 64 128 256
No intervention at test-time
Steps used 123 90 80 82 136
Tokens per step 60 70 69 66 56
Thinking tokens 7252 6277 5396 5552 7551
Answer tokens 665 653 735 777 754
AIME24 33.3 23.3 33.3 36.7 33.3
Forcing end of thinking when 0 steps are reached
Steps used 16 32 59 78 136
Tokens per step 96 94 80 70 56
Thinking tokens 1517 2963 4636 5409 7551
Answer tokens 1111 788 799 794 754
AIME24 23.3 23.3 33.3 36.7 33.3

Table 14. Scaling thinking time via class-conditional control. We report “accuracy / average thinking tokens per sample”; the higher
the accuracy and the fewer the thinking tokens (inference cost) the better.
Prompt appended to the question after two newlines AIME24 MATH500 GPQA
Answer after a short amount of thinking. Do not spend 30.0% / 90.4% / 56.6% /
excessive time double-checking your work. 8033 2537 4177

Answer after a long amount of thinking. If you feel like you 36.7% / 91.4% / 51.0% /
are finished early, spend the extra time trying to double- 9651 3875 4827
check your work until you are absolutely sure that you have
the correct answer.
Without generic prompt appending 50.0% / 93.0% / 57.6% /
6109 3298 3510

36
s1: Simple test-time scaling

when counting down (“3 steps left...2 steps left”) rather than counting up (“Step2...Step3...”). This is likely because if
counting down, the final step is always 1, which will act as a strong prior to the model to finish its generation. If counting up,
the final step before the answer varies, thus if the model does not remember the original step instruction, it may fail to stop.
We conclude the following from our results in Table 13: (1) The model still struggles to adhere to the step limit. The model
sometimes simply continues counting into negative steps, e.g. “-1 steps left”. To solve this issue, we automatically stop the
thinking process once 0 steps are reached and then force the model to transition to answering mode by appending the answer
token delimiter (§3). This leads to perfect step adherence (lower half of Table 13), yet problems remain. (2) The model
compensates for fewer steps by making each step longer. For example, when forced to use up to 16 steps vs 256 steps, the
model generates an average of 96 tokens per step vs 56. Despite this issue, more steps still clearly correlate with more total
thinking tokens in Table 13 and better performance leading to a positive slope (3) Step instructions are more costly than
other methods. The step delimiters require around 6 tokens each which for e.g. 64 steps adds up to a total of around 380
tokens. When ignoring the step delimiters in token counts as in Table 13, the model still requires 7551 thinking tokens on
average to achieve only 33.3% on AIME24. To compute Control reported in Table 3 for step-conditional control variants,
we first decide that 100 tokens are an upper limit per step and then multiply this number by the steps instructed to arrive
at a proxy total token limit, e.g. 1600 for 16 steps instructed. We then check whether the thinking tokens in Table 13 fit
within the respective limit for each evaluation run (3/5 for without intervention; 5/5 for with intervention). For the model in
Figure 7, we use a model with step-conditional control trained on an earlier version of our data and using an earlier version
of our evaluation codebase.

Class-conditional control OpenAI exposes test-time compute control to users via a “reasoning_effort” API parameter
with three possible settings: low, medium, and high.3 The OpenAI documentation also states that “Reducing reasoning
effort can result in faster responses and fewer tokens used on reasoning in a response." suggesting that they are unable to
control test-time compute with guarantees. Thus, maybe OpenAI simply adjusts the prompt or system instruction depending
on the reasoning effort desired. In Table 14, we show that separate prompts for short and long thinking allow us to control
thinking time to some extent: Prompting the model to think for longer leads to longer thinking. However, it does not reliably
improve performance and control is not precise. The current adherence to control may suffice when we only have three
classes, but it might not scale to finer-grained classes. To compute Control reported in Table 3 for this method, we assume
that prompting the model to think for a short time in Table 14 should produce fewer tokens than the default for AIME24,
while the long prompt should produce more. As 8033 > 6109 and 9651 > 6109, one out of two follows our expected control
thus Control is 50%.

E.2. Examples for rejection sampling ablation

3
https://github.com/openai/openai-python/blob/44d6210f101abedeb2dd68507fcffcb329df70ea/
src/openai/types/chat/completion_create_params.py#L172

37
s1: Simple test-time scaling

Table 15. Answers generated by an earlier version of s1-32B when rejection sampling on AIME24. The black text is the question and
the yellow text is the solution. Based solely on the question we generate the light blue reasoning trace and blue response with s1-32B
using a temperature of 1. We generate until the reasoning trace has less than 4000 tokens for the first and less than 8000 tokens for the
second example. It required on average 199 (1) tries per example until thinking was ≤ 4000 (≤ 8000 tokens).
√
3
Let 𝑂(0, 0), 𝐴( 12 , 0), and 𝐵(0, 2
) be points in the coordinate plane. Let  be the family of segments 𝑃 𝑄 of unit length
lying in the first quadrant with 𝑃 on the 𝑥-axis and 𝑄 on the 𝑦-axis. There is a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵,
that does not belong to any segment from  other than 𝐴𝐵. Then 𝑂𝐶 2 = 𝑞𝑝 , where 𝑝 and 𝑞 are relatively prime positive
integers. Find 𝑝 + 𝑞.

[Correct generation with 3981 thinking tokens and 1055 answer tokens]
The problem asks for a point 𝐶 on the segment 𝐴𝐵 that is not contained in any other segment from the family  , except for
𝐴𝐵 itself.
The points 𝑂, 𝐴, 𝐵 form a triangle in the first quadrant.
𝑂 = (0, 0)
𝐴 = ( 12 , 0)
√
3
𝐵 = (0, 2
)
√
3
The segment 𝐴𝐵 connects 𝐴( 21 , 0) and 𝐵(0, 2
).
The equation√ of the line containing 𝐴𝐵 can be found using the two points:
3
−0 √
Slope 𝑚 = 2 1 = − 3
0− 2
Using the point-slope form 𝑦 − 𝑦1 = 𝑚(𝑥 − 𝑥1 ) with point 𝐴( 12 , 0):
√
𝑦 − 0 = − 3(𝑥 − 21 )
√ √
𝑦 = − 3𝑥 + 23
√
3
The segment 𝐴𝐵 lies in the first quadrant and connects ( 12 , 0) to (0, 2
).
The family  consists of segments 𝑃 𝑄 of unit length, where 𝑃 is on the 𝑥-axis and 𝑄 is on the 𝑦-axis. Let 𝑃 = (𝑎, 0) and
𝑄 = (0, 𝑏). √ √
The length of 𝑃 𝑄 is (𝑎 − 0)2 + (0 − 𝑏)2 = 𝑎2 + 𝑏2 = 1.
So 𝑎2 + 𝑏2 = 1.
Since 𝑃 is on the positive 𝑥-axis and 𝑄 is on the positive 𝑦-axis, we have 𝑎 > 0 and 𝑏 > 0.
The segment 𝑃 𝑄 connects (𝑎, 0) and (0, 𝑏). The equation of the line containing 𝑃 𝑄 is 𝑥𝑎 + 𝑦𝑏 = 1.
The region covered by the family  is the area bounded by the lines containing these segments. The endpoints (𝑎, 0) and
(0, 𝑏) vary such that 𝑎2 + 𝑏2 = 1.
Consider the parametrization 𝑎 = cos 𝜃, 𝑏 = sin 𝜃, where 0 < 𝜃 < 𝜋2 .
𝑥 𝑦
The segment 𝑃 𝑄 connects (cos 𝜃, 0) and (0, sin 𝜃). The equation of the line is cos 𝜃
+ sin 𝜃
= 1.
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from  other than
𝐴𝐵.
Let 𝐶 be a point on 𝐴𝐵.√We can parametrize
√
the points on 𝐴𝐵 as (1 − 𝑡)𝐴 + 𝑡𝐵 for 0 ≤ 𝑡 ≤ 1.
𝐶 = (1 − 𝑡)( 12 , 0) + 𝑡(0, 23 ) = ( 12 − 21 𝑡, 23 𝑡).
Since 𝐶 is distinct from 𝐴 and 𝐵, we have 0 < 𝑡 < 1.
Suppose the point 𝐶 lies on some segment 𝑃 𝑄 ∈  , where 𝑃 = (cos 𝜃, 0) and 𝑄 = (0, sin 𝜃).
The point 𝐶 satisfies the equation cos𝑥 𝜃 + sin𝑦 𝜃 = 1.

38
s1: Simple test-time scaling

Plugging√in the coordinates of 𝐶:

1 1 3
− 𝑡
2 2 2
𝑡
cos 𝜃
+sin
=1
√𝜃
1−𝑡
2 cos 𝜃
+ 2 sin3𝑡𝜃 =1
Multiply√by 2:
1−𝑡 3𝑡
+ =2
cos 𝜃 √
sin 𝜃
(1 − 𝑡) sin 𝜃 + √3𝑡 cos 𝜃 = 2 sin 𝜃 cos 𝜃
sin 𝜃 − 𝑡 sin 𝜃 + 3𝑡 cos 𝜃 = sin(2𝜃)∕2
We are looking for a point 𝐶 on 𝐴𝐵 such that the only segment from  containing 𝐶 is 𝐴𝐵 itself.
The segment 𝐴𝐵 is part of the family  . √
3
For 𝐴𝐵, 𝑃 = 𝐴 = ( 12 , 0) and 𝑄 = 𝐵 = (0, 2
).
√
Here 𝑎 = 21 , 𝑏 = 23 .
√
𝑎 + 𝑏 = ( 2 ) + ( 23 )2
2 2 1 2
= 1
4
+ 3
4
= 1.
√
The line segment 𝐴𝐵 connects ( 21 , 0) and (0, 23 ).
√ √
2 3
𝑥
The equation of the line is 1∕2 + √ 𝑦 = 1, which simplifies to 2𝑥 + √ 𝑦 = 1, or 3𝑥 + 𝑦 = 2
.
3∕2 √ 3
√
This matches the equation 𝑦 = − 3𝑥 + 23 .
Consider the envelope of the family of lines cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
Differentiating with respect to 𝜃:
− 𝑥cossin2 𝜃𝜃 + 𝑦 cos2 𝜃 = 0
sin 𝜃
𝑦 cos 𝜃
2 = 𝑥cossin2 𝜃𝜃
sin 𝜃
𝑦 cos3 𝜃 = 𝑥 sin3 𝜃
𝑦
𝑥
= tan3 𝜃
tan 𝜃 = ( 𝑥𝑦 )1∕3
From the line equation: 𝑦 sin 𝜃 = sin 𝜃 cos 𝜃 − 𝑥 sin 𝜃
𝑦 = cos 𝜃 − 𝑥
Substitute cos 𝜃 = √ 1 2 = √ 1 2∕3
1+tan 𝜃 1+(𝑦∕𝑥)
𝑥1∕3
cos 𝜃 = (𝑥2∕3 +𝑦2∕3 )1∕2
1∕3 1∕3
sin 𝜃 = tan 𝜃 cos 𝜃 = ( 𝑥𝑦 )1∕3 (𝑥2∕3 +𝑦
𝑥 𝑦
2∕3 )1∕2 = (𝑥2∕3 +𝑦2∕3 )1∕2
Substitute back into the line equation:
𝑥
cos 𝜃
+ sin𝑦 𝜃 = 1
𝑥(𝑥2∕3 + 𝑦2∕3 )1∕2 ∕𝑥1∕3 + 𝑦(𝑥2∕3 + 𝑦2∕3 )1∕2 ∕𝑦1∕3 = 1
𝑥2∕3 (𝑥2∕3 + 𝑦2∕3 )1∕2 + 𝑦2∕3 (𝑥2∕3 + 𝑦2∕3 )1∕2 = 1
(𝑥2∕3 + 𝑦2∕3 )3∕2 = 1
𝑥2∕3 + 𝑦2∕3 = 1
The envelope of the family  is the astroid 𝑥2∕3 + 𝑦2∕3 = 1.
The segments from  are tangent to this astroid.
The point 𝐶 on 𝐴𝐵 that we are looking √
for is the point where 𝐴𝐵 is tangent to the astroid.
√ 3
The equation of 𝐴𝐵 is 𝑦 = − 3𝑥 + 2 .
Substitute into the√astroid equation:
√
𝑥2∕3 + (− 3𝑥 + 23 )2∕3 = 1
Let 𝑢 = 𝑥1∕3 and 𝑣 = 𝑦1∕3 . The astroid is 𝑢2 + 𝑣2 = 1.
The point 𝐶 lies on both the line 𝐴𝐵 and the astroid.
Differentiating the astroid equation implicitly:
2 −1∕3
3
𝑥 + 23 𝑦−1∕3 𝑦′ = 0
1∕3
𝑦′ = − 𝑥𝑦1∕3 = −( 𝑥𝑦 )1∕3

39
s1: Simple test-time scaling
√
The slope of the tangent to the astroid at a point (𝑥, 𝑦) is − 3 𝑥𝑦 .
√
The slope of the line 𝐴𝐵 is − √ 3.
√
3 𝑦
So, at the point of tangency, − 𝑥 = − 3.
√ √
3 𝑦
𝑥
= 3
𝑦
√ √
= ( 3) 3 =3 3
𝑥 √
√ √
The point 𝐶 lies on the line 𝑦 = − 3𝑥 + 23 and satisfies 𝑥𝑦 = 3 3.
√
So 𝑦 = 3 3𝑥.
Substitute into the line
√
equation:
√ √ 3
3 3𝑥 = − 3𝑥 + 2
1
3𝑥 = −𝑥 + 2
1
4𝑥 = 2
1
𝑥= 8
√ √
3 3
Then 𝑦 = 3 3( 18 ) = .
√ 8
The point 𝐶 is ( 18 , 3 8 3 ).
We need to check if this point lies on√the segment 𝐴𝐵.
3
The point 𝐶 is of the form ( 21 − 21 𝑡, 2
𝑡).
1
2
− 12 𝑡 = 18 ⟹ 4 − 4𝑡 = 1 ⟹ 3 = 4𝑡 ⟹ 𝑡 = 34 .
√ √ √
3 3 3 3 3
2
𝑡 = 2
⋅ 4
= 8
.
So 𝐶 lies on 𝐴𝐵.
We need to make sure 0 < 𝑡 < 1, which is true for 𝑡 = 34 .
Now, we need to check if this point 𝐶 does not belong to any other segment from  .
The point 𝐶 lies on the astroid. The segment 𝐴𝐵 is tangent to the astroid at 𝐶.
Any other segment from  that contains 𝐶 would have 𝐶 as an interior point, not a tangent point.
Consider the parametrization of the family  .
A point (𝑥, 𝑦) belongs to some segment in  iff there exists 𝜃 such that cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
√
For the point 𝐶 = ( 18 , 3 8 3 ), we have:
√
1∕8 3 3∕8
cos 𝜃
+ sin 𝜃
= 1
√
1 3 3
+ =8
cos 𝜃 √𝜃
sin
sin 𝜃 + 3√3 cos 𝜃 = 8 sin 𝜃 cos 𝜃
sin 𝜃 + 3 3 cos 𝜃 = 4 sin(2𝜃)
We know that 𝐶 lies on 𝐴𝐵, which corresponds to some specific segment in  .
The problem statement assumes the existence and uniqueness of such a point 𝐶.
The unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from  other than 𝐴𝐵.
This means that if 𝐶 lies on another segment 𝑃 ′ 𝑄′ ∈  , then 𝑃 ′ 𝑄′ = 𝐴𝐵.
The point 𝐶 is the point of tangency of 𝐴𝐵 with the astroid.
Now we need to find
√ 𝑂𝐶 .
2
1 2 3 3 2 1
2
𝑂𝐶 = ( 8 ) + ( 8 ) = 64 + 27
64
28
= 64 = 167
.
Here 𝑝 = 7 and 𝑞 = 16.
𝑝 and 𝑞 are relatively prime.
𝑝 + 𝑞 = 7 + 16 = 23.
Final check of the reasoning:
The family  consists of unit-length segments with endpoints on the axes.
The envelope of this family is the astroid 𝑥2∕3 + 𝑦2∕3 = 1.
The segments from  are tangent to the astroid.
The segment 𝐴𝐵 is also a member of the family  .
We are looking for a point 𝐶 on 𝐴𝐵 that lies on the astroid. 40
This point 𝐶 is the point of tangency of 𝐴𝐵 with the astroid.
s1: Simple test-time scaling

If a point on 𝐴𝐵 lies on the astroid, it belongs to the tangent segment at that point. Since 𝐴𝐵 is the tangent segment, this
point does not belong to any other segment from  .
Consider the case when the segment 𝐴𝐵 intersects the astroid at two points.
If the segment 𝐴𝐵 is not tangent to the astroid, then there are two points of intersection.
Let the points of intersection be 𝐶1 and 𝐶2 .
If 𝐶1 lies on 𝐴𝐵, then the tangent to the astroid at 𝐶1 is the segment 𝐴𝐵.
If 𝐶2 lies on 𝐴𝐵, then√the tangent to the astroid at 𝐶2 is the segment 𝐴𝐵.
The slope of 𝐴𝐵 is − 3. √
We found the point on the astroid with tangent slope − 3.
The problem statement guarantees the uniqueness of such a point 𝐶.
The point 𝐶 must be interior to 𝐴𝐵, so 0 < 𝑡 < 1.
The parameter 𝑡 = 34 is in this range.
The point 𝐶 is distinct from 𝐴 and 𝐵.
If 𝐶 = 𝐴, then 𝑡 = 0, 𝐶 = ( 12 , 0).
√
3
If 𝐶 = 𝐵, then 𝑡 = 1, 𝐶 = (0, 2
).
The point of tangency 𝐶 is the unique point on 𝐴𝐵 (excluding endpoints) that lies on the astroid.
Solution:
Let the family of segments  be defined by segments 𝑃 𝑄 of unit length with 𝑃 on the positive 𝑥-axis and 𝑄 on the positive
𝑦-axis. Let 𝑃 = (𝑎, 0) and 𝑄 = (0, 𝑏). The length condition gives 𝑎2 + 𝑏2 = 1. The segment 𝑃 𝑄 lies on the line 𝑥𝑎 + 𝑦𝑏 = 1.
Using the parametrization 𝑎 = cos 𝜃, 𝑏 = sin 𝜃 with 0 < 𝜃 < 𝜋2 , the equation of the line is cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
The envelope of this family of lines is the astroid 𝑥2∕3 + 𝑦2∕3 = 1.
The segments from  are tangent to this astroid.
√ √
√
The segment 𝐴𝐵 connects 𝐴( 12 , 0) and 𝐵(0, 23 ). The equation of the line containing 𝐴𝐵 is 𝑦 = − 3𝑥 + 23 . This segment
is part of the family  .
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from  other than
𝐴𝐵. This means 𝐶 is the point where 𝐴𝐵 is tangent to the√ astroid.
√
The slope of the tangent to the astroid at a point (𝑥, 𝑦) is − 3 𝑥𝑦 . The slope of 𝐴𝐵 is − 3.
√ √ √
Setting the slopes equal: − 3 𝑥𝑦 = − 3 ⟹ 𝑥𝑦 = 3 3.
√ √ √
The point 𝐶 lies on the line 𝑦 = − 3𝑥 + 23 and satisfies 𝑦 = 3 3𝑥.
√ √ √ √ √
Substituting, 3 3𝑥 = − 3𝑥 + 23 ⟹ 4 3𝑥 = 23 ⟹ 𝑥 = 18 .
√ √
Then 𝑦 = 3 3( 18 ) = 3 8 3 .
√
The point 𝐶 is ( 18 , 3 8 3 ).
√
3
We need to check if 𝐶 lies on 𝐴𝐵. The points on 𝐴𝐵 are ( 12 − 12 𝑡, 2
𝑡) for 0 ≤ 𝑡 ≤ 1.
For 𝑥 = 18 , 12 − 12 𝑡 = 81 ⟹ 𝑡 = 43 .
√ √ √
For 𝑦 = 3 8 3 , 23 𝑡 = 3 8 3 ⟹ 𝑡 = 34 .
Since 0 < 43 < 1, 𝐶 is an interior point of 𝐴𝐵.
The point 𝐶 is the unique point on 𝐴𝐵 (excluding endpoints) that lies on the astroid. The tangent to the astroid at 𝐶 is the
segment 𝐴𝐵. Thus, 𝐶 does not belong√to any other segment from  .
Finally, we calculate 𝑂𝐶 2 = ( 81 )2 + ( 3 8 3 )2 = 64
1
+ 27
64
= 28
64
= 7
16
.
We have 𝑝 = 7 and 𝑞 = 16. These are relatively prime.
𝑝 + 𝑞 = 7 + 16 = 23.
Final Answer: The final answer is 23

41
s1: Simple test-time scaling

[Incorrect generation with 5923 thinking tokens and 754 answer tokens]
The problem asks for a point 𝐶 on the segment 𝐴𝐵 such that 𝐶 is not on any other segment from the family  , except 𝐴𝐵
itself. The family  consists of segments 𝑃 𝑄 of unit length, with 𝑃 on the 𝑥-axis and 𝑄 on the 𝑦-axis in the first quadrant.
Let 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦), where 𝑥 > 0 and 𝑦 > 0. The length of the segment 𝑃 𝑄 is 1, so the distance between 𝑃 and 𝑄
is 1.
√
2 2
√(𝑥 − 0) + (0 − 𝑦) = 1
2
𝑥 +𝑦 =12
𝑥2 + 𝑦2 = 1
The segment 𝑃 𝑄 lies in the first quadrant, so 𝑥 > 0 and 𝑦 > 0.
The equation 𝑥2 + 𝑦2 = 1 represents the quarter circle in the first quadrant, with endpoints at (1, 0) and (0, 1).
The points on the segment 𝑃 𝑄 can be parameterized
√
as (1 − 𝑡)𝑥 + 𝑡(0), (1 − 𝑡)0 + 𝑡𝑦) = (𝑥(1 − 𝑡), 𝑦𝑡), where 0 ≤ 𝑡 ≤ 1.
3
The segment 𝐴𝐵 connects 𝐴( 21 , 0) and 𝐵(0, 2
).
The equation√ of the line
√
containing 𝐴𝐵 can be found using the two points:
3
−0
3 √
Slope 𝑚 = 2 1 = 21 = − 3.
0− 2 −2
Using point-slope form with point 𝐴( 21 , 0):
√
𝑦 − 0 = − 3(𝑥 − 21 )
√ √
𝑦 = − 3𝑥 + 23
The segment 𝐴𝐵 lies in the first quadrant, between 𝑥 = 0 and 𝑥 = 12 .
Alternatively, we can parameterize the points on 𝐴𝐵.
Let 𝐶 be a point on 𝐴𝐵. Then
√
𝐶 = (1 − 𝜆)𝐴
√
+ 𝜆𝐵 for some 0 ≤ 𝜆 ≤ 1.
3 3
𝐶 = (1 − 𝜆)( 21 , 0) + 𝜆(0, 2
) = ( 12 − 12 𝜆, 2
𝜆).
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from  other than
𝐴𝐵.
Consider a segment from  , with endpoints 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦), where 𝑥2 + 𝑦2 = 1.
A point on this segment is of the form (𝑥(1 − 𝑡), 𝑦𝑡), where 0 ≤ 𝑡 ≤ 1.
We want to find a point
√
𝐶 on 𝐴𝐵, 𝐶 ≠ 𝐴, 𝐵, such that if 𝐶 lies on a segment 𝑃 𝑄 ∈  , then 𝑃 𝑄 = 𝐴𝐵.
3
Let 𝐶 = ( 12 − 12 𝜆, 2
𝜆), where 0 < 𝜆 < 1.
If 𝐶 lies on a segment 𝑃 𝑄 from  , then there exist 𝑥, 𝑦 > 0 with 𝑥2 + 𝑦2 = 1, and 𝑡 ∈ [0, 1] such that:
1
2
− 12 𝜆 = 𝑥(1 − 𝑡)
√
3
2
𝜆 = 𝑦𝑡
1 1
− 𝜆 1−𝜆
From the first equation, 1 − 𝑡 = 2 2
𝑥
= 2𝑥
.
√
3 √
𝜆
From the second equation, 𝑡 = 2𝑦 = 2𝑦3𝜆 .
Since 𝑡 ∈ [0, 1], we have 1 − 𝑡 ∈ [0, 1].
Also, 𝑥 > 0, so 1−𝜆
2𝑥
> 0, which implies 1 − 𝜆 > 0, or 𝜆 < 1. This is consistent with the condition for point on 𝐴𝐵.
√
1−𝜆 3𝜆
We have 1 − 𝑡 = 2𝑥
and 𝑡 = 2𝑦
.
√
Adding these equations: 1 = 1−𝜆
+ 2𝑦3𝜆
√ 2𝑥
2𝑥𝑦 = (1 − 𝜆)𝑦 + 3𝜆𝑥

42
s1: Simple test-time scaling

We know 𝑥2 + 𝑦2 = 1.
If the point 𝐶 belongs to the segment√𝐴𝐵 itself, then the endpoints of the segment are 𝐴 and 𝐵.
3
So, the segment is from ( 12 , 0) to (0, 2
).
√
Here, 𝑥 = 12 and 𝑦 = 23 .
√
3 2
Check if 𝑥2 + 𝑦2 = 1: ( 12 )2 +( 2
) = 1
4
+ 3
4
= 1. So 𝐴𝐵 ∈  .
√
If 𝐶 belongs to another segment 𝑃 𝑄 from  , with endpoints 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦), where (𝑥, 𝑦) ≠ ( 12 , 23 ).
Consider the geometric interpretation. The family  consists of segments of unit length with endpoints on the axes. The
midpoints of these segments lie on the circle (𝑥 − 21 )2 + (𝑦 − 12 )2 = 12 .
Let the point 𝐶 correspond to some 𝜆.
If 𝐶 lies on 𝑃 𝑄, then 𝐶 = (𝑥(1
√
− 𝑡), 𝑦𝑡).
1 3
So, 2
− 12 𝜆 = 𝑥(1 − 𝑡) and 2
𝜆 = 𝑦𝑡.
√
1 3
We need to find a 𝜆 ∈ (0, 1) such that the only solution to these equations, with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, is 𝑥 = 2
and 𝑦 = 2
.
√
1−𝜆 3𝜆
From the equations, 1 − 𝑡 = 2𝑥
and 𝑡 = 2𝑦
.
Substitute these into 𝑥2 + 𝑦2 = 1.
Try a different approach. Consider the tangents to the family of segments.
The segments are formed by points (𝑥, 0) and (0, 𝑦) with 𝑥2 + 𝑦2 = 1.
The equation of the line containing the segment 𝑃 𝑄 is 𝑥𝑥 + 𝑦𝑦 = 1, or 𝑥𝑥 + 𝑦𝑦 = 1.
The equation of the line is 𝑥𝑥 + 𝑦𝑦 = 1.
Consider the equation of the segment in parametric form: (𝑥(1 − 𝑡), 𝑦𝑡).
If a point 𝐶 lies on this segment, then 𝐶 = (𝑥𝑐 , 𝑦𝑐 ), where 𝑥𝑐 = 𝑥(1 − 𝑡) and 𝑦𝑐 = 𝑦𝑡.
𝑥 𝑦
So, 𝑥𝑐 = 1 − 𝑡 and 𝑦𝑐 = 𝑡.
𝑥𝑐 𝑦𝑐
𝑥
+ 𝑦
= 1 − 𝑡 + 𝑡 = 1.
𝑥 𝑦
The segments from  are parts of the lines 𝑥0
+ 𝑦0
= 1, where 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0.
√
3
Consider the point 𝐶 = ( 12 − 12 𝜆, 2
𝜆).
If 𝐶 lies on
√
a segment from  with endpoints (𝑥, 0) and (0, 𝑦), then:
1 1 3
− 𝜆
2 2 2
𝜆
𝑥
+ =1
√ 𝑦
1−𝜆 3𝜆
2𝑥
+ 2𝑦
=1
√
3
We need to find a unique 𝜆 ∈ (0, 1) such that the only solution (𝑥, 𝑦) to this equation, with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, is ( 12 , 2
).
√
1 3
Substitute 𝑥 = 2
and 𝑦 = 2
into the equation:
√
1−𝜆 3𝜆
+ √ =1
2( 21 ) 3
2( 2 )
1−𝜆+𝜆=1
1=1
This shows that any point on 𝐴𝐵 lies on the segment
√
𝐴𝐵 itself.
√ 3
Consider the line containing 𝐴𝐵: 𝑦 = − 3𝑥 + 2 .
If a point 𝐶 on 𝐴𝐵 also lies on another segment from  with endpoints (𝑥, 0) and (0, 𝑦), then:
𝑥𝐶 𝑦
𝑥
+ 𝑦𝐶 = 1.
Let the point 𝐶 divide 𝐴𝐵 in the ratio 𝜇 ∶ (1√− 𝜇), where 0 <√ 𝜇 < 1.
𝐶 = (1 − 𝜇)𝐴 + 𝜇𝐵 = (1 − 𝜇)( 12 , 0) + 𝜇(0, 23 ) = ( 12 − 12 𝜇, 2
3
𝜇).
If 𝐶 lies on
√
a segment with endpoints (𝑥, 0) and (0, 𝑦), then:
1 1 3
− 𝜇
2 2 2
𝜇
𝑥
+ 𝑦
= 1.
𝑥 𝑦
We are looking for a point 𝐶 such that the line passing through 𝐶 with the form 𝑥0
+ 𝑦0
= 1, where 𝑥20 + 𝑦20 = 1, is unique.

43
s1: Simple test-time scaling

𝑥 𝑦
Consider the tangents to the unit circle centered at the origin. The lines of the form 𝑥0
+ 𝑦0
= 1 are tangents to the circle
𝑥2 + 𝑦2
= 𝑟2 .
Consider the dual problem. The family of segments corresponds to points on the quarter circle 𝑥2 + 𝑦2 = 1 in the first
quadrant.
The lines containing the segments are 𝑥𝑥 + 𝑦𝑦 = 1.
0 0
The envelope of these lines is the curve whose tangents are these lines.
The equation 𝑥𝑥 + 𝑦𝑦 = 1, with the constraint 𝑥20 + 𝑦20 = 1.
0 0
Let 𝑥0 = cos 𝜃, 𝑦0 = sin 𝜃, where 0 < 𝜃 < 𝜋2 .
The equations of the lines are 𝑥 cos 𝜃 + 𝑦 sin 𝜃 = 1.
The envelope of the family of lines 𝑥 cos 𝜃 + 𝑦 sin 𝜃 = 1 is the circle 𝑥2 + 𝑦2 = 1.
The segments from  are parts of the tangents to the unit circle in the first quadrant.
The point 𝐶 on 𝐴𝐵 that does not belong to any other segment from  means that the tangent to the unit circle passing
through 𝐶 is unique. √
Consider the point 𝐶 = ( 12 − 12 𝜆, 23 𝜆).
The tangents to the unit circle passing through 𝐶 have the form 𝑥𝑥𝐶 + 𝑦𝑦𝐶 = 1.
If the tangent √
is unique, then 𝐶 lies on the unit circle.
3 2
( 12 − 12 𝜆)2 + ( 2
𝜆) = 1
1
( 4 − 2 𝜆 + 4 𝜆 ) + 34 𝜆2
1 1 2
=1
1
4
− 12 𝜆 + 𝜆2 = 1
4𝜆2 − 2𝜆 + 1 = 4
4𝜆2 − 2𝜆 − 3 = 0
Solve for 𝜆:√ √ √ √ √
−(−2)± (−2)2 −4(4)(−3) 2± 4+48 2± 52 2±2 13 1± 13
𝜆= 2(4)
= 8
= 8
= 8
= 4
.
We
√ need 0 < 𝜆 < 1.
13 ≈ 3.6.
1+3.6
4
= 4.6
4
> 1.
1−3.6
4
= −2.6
4
< 0.
Something is wrong. The point 𝐶 is inside the unit circle.
Let’s rephrase the condition. There is
√
a unique point 𝐶 on 𝐴𝐵 such that if 𝐶 lies on a segment 𝑃 𝑄 ∈  , then 𝑃 𝑄 = 𝐴𝐵.
Let the point 𝐶 on 𝐴𝐵 be ( 12 − 12 𝜆, 23 𝜆).
Consider a segment from  with endpoints (𝑥, 0) and (0, 𝑦), 𝑥2 + 𝑦2 = 1.
The parametric form of the segment is (𝑥(1 − 𝑡), 𝑦𝑡).
If 𝐶 is on this segment, then for some 𝑡 ∈ [0, 1]:
1
2
− 12 𝜆 = 𝑥(1 − 𝑡)
√
3
2
𝜆 = 𝑦𝑡
If the segment is unique, then
√
the values of 𝑥 and 𝑦 are unique. √
1−𝜆 3𝜆 3
The only solution to 2𝑥
+ 2𝑦
= 1, with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, should be 𝑥 = 12 , 𝑦 = 2
.
Consider the intersection of the line containing 𝐴𝐵 with the boundary of the region formed by the segments in  .
The boundaries are the 𝑥-axis, 𝑦-axis, and the quarter circle 𝑥2 + 𝑦2 = 1.
Consider the dual problem again. The family of lines 𝑥𝑥 + 𝑦𝑦 = 1, where 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0.
0 0
The envelope of these lines is the circle 𝑥2 + 𝑦2 = 1.
A point belongs to a unique segment from  if it lies on the circle 𝑥2 + 𝑦2 = 1.
The point 𝐶 on 𝐴𝐵 satisfies 𝐶 = 𝑡𝐴 + (1 − 𝑡)𝐵. √
3
Let’s consider the case where the segment is 𝐴𝐵. Here 𝑥 = 12 , 𝑦 = 2
.
√
3
The point 𝐶 can be represented as ( 21 (1 − 𝑠), 2
𝑠) for 0 ≤ 𝑠 ≤ 1.
We are looking for a point 𝐶 on 𝐴𝐵 that does not lie on any other segment from  .

44
s1: Simple test-time scaling

Consider the geometry. The family  forms a region bounded by the axes and the quarter circle.
The segments are chords of circles centered on the line 𝑦 = 𝑥, tangent to the axes.
Let the endpoints of a segment be (cos 𝜃, 0) and (0, sin 𝜃).
A point on the segment is√(cos 𝜃(1 − 𝑡), sin 𝜃𝑡).
3
The point 𝐶 = ( 21 − 12 𝜆, 2
𝜆).
√
1 1 3
− 𝜆 𝜆
If 𝐶 lies on such a segment, then 2 2
cos 𝜃
= 1 − 𝑡 and 2
sin 𝜃
= 𝑡.
√
1−𝜆 3𝜆
2 cos 𝜃
+ 2 sin 𝜃
= 1.
√
3
We need a unique 𝜆 ∈ (0, 1) such that the only solution (cos 𝜃, sin 𝜃) is ( 21 , 2
), which corresponds to 𝜃 = 𝜋3 .
√
1 𝜆 3𝜆
Consider the function 𝑓 (𝜃) = 2 cos 𝜃
− 2 cos 𝜃
+ 2 sin 𝜃
.
We need 𝑓 (𝜃) = 1. √
3
If 𝜆 = 12 , then 𝐶 = ( 14 , 4
).
√
1−1∕2 3(1∕2)
2 cos 𝜃
+ 2 sin 𝜃
= 1
√
1 3
+
4 cos 𝜃 √4 sin 𝜃
=1
1
+ sin3𝜃 = 4
cos 𝜃 √
sin 𝜃 + √3 cos 𝜃 = 4 sin 𝜃 cos 𝜃
sin 𝜃 + 3 cos 𝜃 = 2 sin(2𝜃)
√
Divide by 2: 12 sin 𝜃 + 23 cos 𝜃 = sin(2𝜃)
sin(𝜃 + 𝜋3 ) = sin(2𝜃).
𝜃 + 𝜋3 = 2𝜃 + 2𝜋𝑘, or 𝜃 + 𝜋3 = 𝜋 − 2𝜃 + 2𝜋𝑘.
𝜃 = 𝜋3 − 2𝜋𝑘, so 𝜃 = 𝜋3 .
3𝜃 = 𝜋 − 𝜋
3
+ 2𝜋𝑘 = 2𝜋 3
+ 2𝜋𝑘.
2𝜋 2𝜋𝑘
𝜃= 9
+ 3 .
If 𝑘 = 0, 𝜃 = 2𝜋 9
.
If 𝑘 = 1, 𝜃 = 9 + 2𝜋
2𝜋
3
= 2𝜋+6𝜋 = 8𝜋 .
√9 9
If 𝜆 = 12 , the point is ( 41 , 43 ).
This point lies on segments corresponding to 𝜃 = 𝜋3 and 𝜃 = 2𝜋9
.
The unique point 𝐶 means that for other values of 𝜆, there is more than one solution for (𝑥, 𝑦).
Consider the tangent to the unit circle passing through 𝐶.
If the tangent is unique,
√
𝐶 is on the unit circle.
This leads to 𝜆 = 1± 4 13 , neither of which is in (0, 1).
Consider the geometric interpretation of the family  . The segments connect points on the axes such that the distance is 1.
The locus of the centers of these segments is the circle (𝑥 − 21 )2 + (𝑦 − 12 )2 = 12 , in the first quadrant.
If a point 𝐶 lies on a segment from  , then 𝐶 is on a line segment connecting (𝑥, 0) and (0, 𝑦) with 𝑥2 + 𝑦2 = 1.
The line containing the segment is 𝑥𝑥 + 𝑦𝑦 = 1.
0 0
If 𝐶 belongs to a unique segment, √
then there is a unique line of this form passing through 𝐶.
3
The point 𝐶 on 𝐴𝐵 is ( 21 − 12 𝜆, 2
𝜆).
√
1 1 3
− 𝜆 𝜆
We need a unique (𝑥0 , 𝑦0 ) with 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0, such that 2 2
𝑥0
+ 2
𝑦0
= 1.
𝑥 𝑦
Let the point 𝐶 be such that the line passing through 𝐶 with the form + = 1 is unique, subject to 𝑥20 + 𝑦20 = 1.
𝑥0 𝑦0
This line is tangent to the unit circle.
The point 𝐶 lies
√
on the intersection of the line containing 𝐴𝐵 and the unit circle.
( 21 − 12 𝜆)2 + ( 23 𝜆)2 = 1.
We found no solution for 𝜆 ∈ (0, 1).

45
s1: Simple test-time scaling

The unique point 𝐶 means that if any other segment passes through 𝐶, it must be 𝐴𝐵.
Consider the slope of the line containing the segment connecting (𝑥, 0) and (0, 𝑦). Slope is − 𝑥𝑦 .
The equation is 𝑦 = − 𝑥𝑦 𝑥 + 𝑦.
√
If 𝐶 lies on the segment, then 1−𝜆
2𝑥
+ 2𝑦3𝜆 = 1.
Consider the geometric mean.
Let the point 𝐶 have coordinates (𝑎, 𝑏). The segment connecting (𝑥, 0) and (0, 𝑦) passes through (𝑎, 𝑏).
𝑎
𝑥
+ 𝑦𝑏 = 1.
√ √
If 𝐶 is on 𝐴𝐵, then 𝑏 = − 3𝑎 + 23 .
√
𝑎 = 12 − 21 𝜆, 𝑏 = 23 𝜆.
The point 𝐶 is tangent to the envelope of the family of segments.

Solution:
Let the family of segments be  . The segments have endpoints 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦) with 𝑥2 + 𝑦2 = 1 and 𝑥, 𝑦 > 0.
The points on the segment 𝑃 𝑄 satisfy 𝑥𝑥 + 𝑦𝑦 = 1. The segments lie on the lines 𝑥𝑥 + 𝑦𝑦 = 1, where 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0.
√ 0 0
√ √
The segment 𝐴𝐵 has endpoints 𝐴( 2 , 0) and 𝐵(0, 2 ). The equation of the line containing 𝐴𝐵 is 𝑦 = − 3𝑥 + 23 .
1 3
√
3
A point 𝐶 on 𝐴𝐵 can be parameterized as 𝐶 = ( 12 − 12 𝜆, 2
𝜆), where 0 ≤ 𝜆 ≤ 1.
√
1 1 3
− 𝜆 𝜆
If 𝐶 belongs to a segment from  with endpoints (𝑥, 0) and (0, 𝑦), then 2 2
𝑥
+ 2
𝑦
= 1.
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any√segment from  other than
𝐴𝐵. This means that the only solution (𝑥, 𝑦) to the equation with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, is ( 12 , 23 ).
The segments from  are parts of the tangents to the unit circle. The point 𝐶 must be tangent to the envelope of the family
of segments. The envelope is the unit circle itself.
The unique point
√
𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that lies on the unit circle. We need to solve:
3 2
( 12 − 12 𝜆)2 + ( 2
𝜆) =1
√
1 3
𝜆= 2
. The point 𝐶 = ( 14 , 4
).
√
3 2
𝑂𝐶 2 = ( 41 )2 + ( 4
) = 1
16
+ 163
= 4
16
= 14 .
𝑂𝐶 2 = 41 = 𝑞𝑝 , so 𝑝 = 1, 𝑞 = 4.
𝑝 and 𝑞 are relatively prime.
𝑝 + 𝑞 = 1 + 4 = 5.
Final Answer: The final answer is 5

F. Version Control
V2 → V3 (2025-03):

• Added §A

• Added number of correct samples in §2

V1 → V2 (2025-02):

• Added citations and other small writing changes

DeepSeek Prover V2
No ratings yet
DeepSeek Prover V2
34 pages
O1 Tutorial
No ratings yet
O1 Tutorial
103 pages
PHD Knowledge Not Required: A Reasoning Challenge For Large Language Models
No ratings yet
PHD Knowledge Not Required: A Reasoning Challenge For Large Language Models
11 pages
s1: Simple Test-Time Scaling
No ratings yet
s1: Simple Test-Time Scaling
45 pages
Big Math
No ratings yet
Big Math
43 pages
F: Formula-Driven Reinforcement Learning For Symbolic Table Reasoning in Language Models
No ratings yet
F: Formula-Driven Reinforcement Learning For Symbolic Table Reasoning in Language Models
38 pages
Reasoning On A Budget: A Survey of Adaptive and Controllable Test-Time Compute in Llms
No ratings yet
Reasoning On A Budget: A Survey of Adaptive and Controllable Test-Time Compute in Llms
28 pages
Competitive Programming With Large Reasoning Models
No ratings yet
Competitive Programming With Large Reasoning Models
48 pages
Scaling Paradigms For Large Language Models
No ratings yet
Scaling Paradigms For Large Language Models
42 pages
Kinimi Provider Paper
No ratings yet
Kinimi Provider Paper
24 pages
Kimina Prover Preview
No ratings yet
Kimina Prover Preview
22 pages
Why We Think - Lil'Log
No ratings yet
Why We Think - Lil'Log
32 pages
Efficient Reasoning Models: A Survey: Sicheng Feng
No ratings yet
Efficient Reasoning Models: A Survey: Sicheng Feng
30 pages
DeepSeek R1 Dual
No ratings yet
DeepSeek R1 Dual
44 pages
Table-R1: Inference-Time Scaling For Table Reasoning: Zheyuan Yang Lyuhao Chen Arman Cohan Yilun Zhao
No ratings yet
Table-R1: Inference-Time Scaling For Table Reasoning: Zheyuan Yang Lyuhao Chen Arman Cohan Yilun Zhao
20 pages
Can 1B Surpass 405B - Rethinking Test-Time Scaling
No ratings yet
Can 1B Surpass 405B - Rethinking Test-Time Scaling
34 pages
Fractional Reasoning Via Latent Steering Vectors Improves Inference Time Compute
No ratings yet
Fractional Reasoning Via Latent Steering Vectors Improves Inference Time Compute
18 pages
Large Language Monkeys: Scaling Inference Compute With Repeated Sampling
No ratings yet
Large Language Monkeys: Scaling Inference Compute With Repeated Sampling
27 pages
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
No ratings yet
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
36 pages
Do NOT Think That Much
No ratings yet
Do NOT Think That Much
17 pages
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters
No ratings yet
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters
37 pages
U R C LLM S Q S S: Nleashing Easoning Apability of S Via Calable Uestion Ynthesis From Cratch
No ratings yet
U R C LLM S Q S S: Nleashing Easoning Apability of S Via Calable Uestion Ynthesis From Cratch
22 pages
The Surprising Effectiveness of Test-Time Training For Abstract Reasoning
No ratings yet
The Surprising Effectiveness of Test-Time Training For Abstract Reasoning
22 pages
Calculus Honors Workbook
100% (3)
Calculus Honors Workbook
157 pages
Logic RL
No ratings yet
Logic RL
17 pages
R1-Searcher: Incentivizing The Search Capability in Llms Via Reinforcement Learning
No ratings yet
R1-Searcher: Incentivizing The Search Capability in Llms Via Reinforcement Learning
17 pages
Demystifying Long Chain-of-Thought Reasoning in LLMs
No ratings yet
Demystifying Long Chain-of-Thought Reasoning in LLMs
40 pages
The Surprising Effectiveness of Test-Time Training For Abstract Reasoning
No ratings yet
The Surprising Effectiveness of Test-Time Training For Abstract Reasoning
21 pages
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
No ratings yet
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
16 pages
Beyond Limited Data: Self-Play LLM Theorem Provers With Iterative Conjecturing and Proving
No ratings yet
Beyond Limited Data: Self-Play LLM Theorem Provers With Iterative Conjecturing and Proving
22 pages
Autonomous Data Selection With Language Models For Mathematical Texts
No ratings yet
Autonomous Data Selection With Language Models For Mathematical Texts
25 pages
AM-Thinking-v1: Advancing The Frontier of Reasoning at 32B Scale
No ratings yet
AM-Thinking-v1: Advancing The Frontier of Reasoning at 32B Scale
16 pages
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
No ratings yet
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
25 pages
Splus - Test Time Scaling For Code Generation
No ratings yet
Splus - Test Time Scaling For Code Generation
15 pages
Deepseekmath: Pushing The Limits of Mathematical Reasoning in Open Language Models
No ratings yet
Deepseekmath: Pushing The Limits of Mathematical Reasoning in Open Language Models
30 pages
Training Language Models To Reason Efficiently: Daman Arora Andrea Zanette
No ratings yet
Training Language Models To Reason Efficiently: Daman Arora Andrea Zanette
16 pages
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
No ratings yet
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
24 pages
Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms
No ratings yet
Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms
13 pages
DeepSeek R1
No ratings yet
DeepSeek R1
22 pages
Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?
No ratings yet
Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?
25 pages
DeepSeek R1
No ratings yet
DeepSeek R1
23 pages
Mini A Cross System Benchmark For Formal Lympiad Level Mathematics
No ratings yet
Mini A Cross System Benchmark For Formal Lympiad Level Mathematics
11 pages
Towards Large Reasoning Models
No ratings yet
Towards Large Reasoning Models
36 pages
Solution ML KOE - 073 PUT (7th Sem 2024-25) Neeru
No ratings yet
Solution ML KOE - 073 PUT (7th Sem 2024-25) Neeru
14 pages
Flaming Hot Initiation
No ratings yet
Flaming Hot Initiation
9 pages
LIMR: Less Is More For RL Scaling: Generative AI Research
No ratings yet
LIMR: Less Is More For RL Scaling: Generative AI Research
7 pages
Teaching LLM
No ratings yet
Teaching LLM
24 pages
Pe Active Prompting
No ratings yet
Pe Active Prompting
20 pages
Pretraining On The Test Set Is All You Need: Rylan Schaeffer September 19, 2023
No ratings yet
Pretraining On The Test Set Is All You Need: Rylan Schaeffer September 19, 2023
6 pages
DeepSeek R1
No ratings yet
DeepSeek R1
22 pages
TOPIC 2 - Functions and Equivalence Relation
100% (2)
TOPIC 2 - Functions and Equivalence Relation
87 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Tree of Thoughts Deliberate Problem Solving
No ratings yet
Tree of Thoughts Deliberate Problem Solving
11 pages
Copy+of+Trigonometry+ +JEE+Super+Revision
100% (1)
Copy+of+Trigonometry+ +JEE+Super+Revision
72 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
25 pages
Part B - Viscous Flow
No ratings yet
Part B - Viscous Flow
68 pages
Tree of Thoughts: Deliberate Problem Solving With Large Language Models
No ratings yet
Tree of Thoughts: Deliberate Problem Solving With Large Language Models
14 pages
Part 2
No ratings yet
Part 2
3 pages
04-Tree of Thoughts White Papers
No ratings yet
04-Tree of Thoughts White Papers
11 pages
10 Maths CBQ
100% (1)
10 Maths CBQ
62 pages
The Singular Value Decomposition (SVD)
No ratings yet
The Singular Value Decomposition (SVD)
9 pages
Probability Distributions
No ratings yet
Probability Distributions
29 pages
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
From Everand
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
Manish Soni
No ratings yet
Analysis of Manifolds IMPA
No ratings yet
Analysis of Manifolds IMPA
39 pages
D Equation
No ratings yet
D Equation
13 pages
Questions On Heuristics For ZITE3109: + 3sin (2x) - Start at X 4 and Use Step 1.0 To
No ratings yet
Questions On Heuristics For ZITE3109: + 3sin (2x) - Start at X 4 and Use Step 1.0 To
8 pages
Measure Theory 2
No ratings yet
Measure Theory 2
24 pages
Gandolfo Cap 11 Ed
No ratings yet
Gandolfo Cap 11 Ed
9 pages
AI3
No ratings yet
AI3
1 page
12a HIGHER ORDER DERIVATIVES
No ratings yet
12a HIGHER ORDER DERIVATIVES
16 pages
Rational Expressions Simplifying: Section 11.3
No ratings yet
Rational Expressions Simplifying: Section 11.3
30 pages
Tutorial Fourier Transform
No ratings yet
Tutorial Fourier Transform
4 pages
Math Sample Paper
No ratings yet
Math Sample Paper
2 pages
IB Math SL AI Unit 3 Functions and Modeling Notes Updated
No ratings yet
IB Math SL AI Unit 3 Functions and Modeling Notes Updated
25 pages
ISTQB Advanced Level Technical Test Analyst- Exam Insights: Q&A with Explanations
From Everand
ISTQB Advanced Level Technical Test Analyst- Exam Insights: Q&A with Explanations
SUJAN
No ratings yet
The Mellin Transform
No ratings yet
The Mellin Transform
6 pages
PMI-ACP Exam Companion : Q & A with Explanations
From Everand
PMI-ACP Exam Companion : Q & A with Explanations
SUJAN
No ratings yet
Vibration - Chapter 08-Ver1 PDF
No ratings yet
Vibration - Chapter 08-Ver1 PDF
32 pages
The University of Sydney: CC0674A Semester 2 2015
No ratings yet
The University of Sydney: CC0674A Semester 2 2015
20 pages
Algebraic Substitution
No ratings yet
Algebraic Substitution
7 pages
1 Algebra1
No ratings yet
1 Algebra1
3 pages
Engineering Mathematics Group Assignment 2: Rules and Instructions For Solutions
No ratings yet
Engineering Mathematics Group Assignment 2: Rules and Instructions For Solutions
2 pages
Algebra
No ratings yet
Algebra
1 page
304 Solutions For The Appendix
No ratings yet
304 Solutions For The Appendix
1 page
Convolucion Circular MATLAB
No ratings yet
Convolucion Circular MATLAB
5 pages
Functions of One Complex Variable I: Printed Book
No ratings yet
Functions of One Complex Variable I: Printed Book
1 page
5.1 Simultaneous Equations-AZE
No ratings yet
5.1 Simultaneous Equations-AZE
16 pages
Algorithms Matrices Math Ia
No ratings yet
Algorithms Matrices Math Ia
9 pages
Num Final2016 Ok
No ratings yet
Num Final2016 Ok
2 pages
MAT02B2 - Online Exam
No ratings yet
MAT02B2 - Online Exam
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

s1: Simple Test-Time Scaling

Uploaded by

s1: Simple Test-Time Scaling

Uploaded by

s1: Simple test-time scaling

Abstract Mathematical Competition PhD-Level

language modeling that uses extra test-time com-

100 Most sample-efficient

MATH500 Accuracy (%)

17000 800000 N/A

How many r in raspberry? Question three metrics:

* Second letter: 'a' - Not an 'r', count remains 1 ...

where 𝑎min , 𝑎max refer to a pre-specified minimum and max-

imum amount of test-time compute; in our case thinking

* Eighth letter: 'y' - Not an 'r', count remains 2 ...

tokens. We usually only constrain 𝑎max . As tokens gener-

Performance = max 𝑓 (𝑎) (3)

Sequential scaling 60% accuracy

tokens via Forcing with s1

Ignoring end-of-thinking 2x/4x/6x and Parallel scaling via Majority

from Biology, Chemistry and Physics. Experts with PhDs in

5. Ablations 5.2. Test-time scaling methods

5.1. Data Quantity, Diversity, and Difficulty

[AIME24] [MATH500] [GPQA]

its of these models, increasingly challenging benchmarks

Yuan et al., 2025; Wu et al., 2024a). Another significant line 60

Impact Statement inference compute with repeated sampling, 2024. URL

2 Reasoning data curation to create s1K 2

6 Discussion and related work 8

E Test-time scaling details 35

• Different batch sizes causing different results see https://github.com/vllm-project/vllm/issues/

C.2. Dataset composition for full 59K questions

C.3. s1K grading prompt

Figure 8. Grading prompt.

C.4. s1K diversity selection

Algorithm 1 Two-stage sampling for s1K

Training Loss Learning Rate Schedule Gradient Norm

D.1. Training Ablations: Sequence length

D.2. Training Samples

𝑌𝑘 ∶= 𝔼[𝑋|𝜎(, 𝑍0 , … , 𝑍𝑘−1 )], 𝑍𝑘 ∶= 𝔼[𝑋|𝜎(, 𝑌0 , … , 𝑌𝑘−1 )].

1. Introduce the notations, 𝑛 ∶= 𝜎(, 𝑍0 , … , 𝑍𝑛−1 ) and [𝑛 ∶= ]𝜎(, 𝑌0 , … , 𝑌𝑛−1

1. Observe the following identity which follows from Tower Property.

Similarly, for the 𝑍𝑘 sequence:

= 𝔼[𝑌∞2 ] + 𝔼[𝑍∞ 2 ] − 2𝔼[𝑌 𝑍 ]

Consider the limit sigma-algebra  = 𝜎( ∪  ∪ 𝜎(𝑌0 , 𝑍0 , …)).

𝑆 = 𝑟 + 2𝑟2 + 3𝑟3 + 4𝑟4 + …

Therefore, 𝐸[𝑋|𝑌 > 𝑋] = 3 × 94 = 12 9

𝑃 (𝑌 >𝑋|𝑇 𝐻)𝑃 (𝑇 𝐻) 0⋅1∕4

Now, let’s find 𝐸[𝑋|𝑌 > 𝑋].

1 − 𝑟 = 1 − (1 − 𝑝)2 = 1 − (1 − 2𝑝 + 𝑝2 ) = 2𝑝 − 𝑝2 = 𝑝(2 − 𝑝).

Alternatively, 100.75 = 100.5 × 100.25 = 10 × 100.25 ≈ 3.16 × 1.778 = 5.62

Consider potential errors in calculations.

Substitute the values:

The formula derivation seems correct.

E. Test-time scaling details

E.2. Examples for rejection sampling ablation

Plugging√in the coordinates of 𝐶:

• Added number of correct samples in §2

• Added citations and other small writing changes

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.