s1: Simple Test-Time Scaling
s1: Simple Test-Time Scaling
Niklas Muennighoff * 1 3 4 Zitong Yang * 1 Weijia Shi * 2 3 Xiang Lisa Li * 1 Li Fei-Fei 1 Hannaneh Hajishirzi 2 3
Luke Zettlemoyer 2 Percy Liang 1 Emmanuel Candès 1 Tatsunori Hashimoto 1
Accuracy (%)
pute to improve performance. Recently, OpenAI’s 85 40
o1 model showed this capability but did not pub- 50
licly share its methodology, leading to many repli- 75 20
cation efforts. We seek the simplest approach to 65 40
0
achieve test-time scaling and strong reasoning per- 512 2048 512 2048 8192 1024 4096
formance. First, we curate a small dataset s1K Average thinking time (tokens)
of 1,000 questions paired with reasoning traces
relying on three criteria we validate through abla- Figure 1. Test-time scaling with s1-32B. We benchmark s1-32B
tions: difficulty, diversity, and quality. Second, we on reasoning-intensive tasks and vary test-time compute.
develop budget forcing to control test-time com-
pute by forcefully terminating the model’s think-
of this approach is to increase the compute at test time to
ing process or lengthening it by appending “Wait”
get better results. There has been much work exploring this
multiple times to the model’s generation when it
idea (Snell et al., 2024; Welleck et al., 2024), and the via-
tries to end. This can lead the model to double-
bility of this paradigm was recently validated by OpenAI
check its answer, often fixing incorrect reasoning
o1 (OpenAI, 2024). o1 has demonstrated strong reasoning
steps. After supervised finetuning the Qwen2.5-
performance with consistent gains from scaling test-time
32B-Instruct language model on s1K and equip-
compute. OpenAI describes their approach as using large-
ping it with budget forcing, our model s1-32B ex-
scale reinforcement learning (RL) implying the use of sizable
ceeds o1-preview on competition math questions
amounts of data (OpenAI, 2024). This has led to various
by up to 27% (MATH and AIME24). Further,
attempts to replicate their models relying on techniques like
scaling s1-32B with budget forcing allows extrap-
Monte Carlo Tree Search (Gao et al., 2024b; Zhang et al.,
olating beyond its performance without test-time
2024a), multi-agent approaches (Qin et al., 2024), and oth-
intervention: from 50% to 57% on AIME24. Our
ers (Wang et al., 2024a; Huang et al., 2024b; 2025). Among
model, data, and code are open-source at https:
these approaches, DeepSeek R1 (DeepSeek-AI et al., 2025)
//github.com/simplescaling/s1.
has successfully replicated o1-level performance, also em-
ploying reinforcement learning via millions of samples and
1. Introduction multiple training stages. However, despite the large num-
ber of o1 replication attempts, none have openly replicated
Performance improvements of language models (LMs) over a clear test-time scaling behavior. Thus, we ask: what is
the past years have largely relied on scaling up train-time the simplest approach to achieve both test-time scaling and
compute using large-scale self-supervised pretraining (Ka- strong reasoning performance?
plan et al., 2020; Hoffmann et al., 2022). The creation of
We show that training on only 1,000 samples with next-token
these powerful models has set the stage for a new scaling
prediction and controlling thinking duration via a simple
paradigm built on top of them: test-time scaling. The aim
test-time technique we refer to as budget forcing leads to
*
Equal contribution. ZY and NM started the project. WS, NM a strong reasoning model that scales in performance with
and ZY collected the prompts, XL, ZY and NM, built the data more test-time compute. Specifically, we construct s1K,
pipeline, LZ and WS proposed using a 1K subset and ZY and which consists of 1,000 carefully curated questions paired
NM built budget forcing. 1 Stanford University. 2 University of
Washington, Seattle. 3 Allen Institute for AI. 4 Contextual AI. with reasoning traces and answers distilled from Gemini
Thinking Experimental (Google, 2024). We perform super-
vised fine-tuning (SFT) of an off-the-shelf pretrained model
1
s1: Simple test-time scaling
on our small dataset requiring just 26 minutes of training sity: Datasets should stem from various fields to cover differ-
on 16 H100 GPUs. After training, we control the amount ent reasoning tasks. We collect datasets of two categories:
of test-time compute our model spends using budget forc-
ing: (I) If the model generates more thinking tokens than Curation of existing datasets Our largest source is Nu-
a desired limit, we forcefully end the thinking process by minaMATH (LI et al., 2024) with 30,660 mathematical
appending an end-of-thinking token delimiter. Ending the problems from online websites. We also include histori-
thinking this way makes the model transition to generating cal AIME problems (1983-2021). To enhance diversity,
its answer. (II) If we want the model to spend more test-time we add OlympicArena (Huang et al., 2024a) with 4,250
compute on a problem, we suppress the generation of the questions spanning Astronomy, Biology, Chemistry, Com-
end-of-thinking token delimiter and instead append “Wait” puter Science, Geography, Mathematics, and Physics from
to the model’s current reasoning trace to encourage more various Olympiads. OmniMath (Gao et al., 2024a) adds
exploration. Equipped with this simple recipe – SFT on 4,238 competition-level mathematics problems. We also
1,000 samples and test-time budget forcing – our model s1- include 2,385 problems from AGIEval (Zhong et al., 2023),
32B exhibits test-time scaling (Figure 1). Further, s1-32B is which features questions from standardized tests like SAT
the most sample-efficient reasoning model and outperforms and LSAT, covering English, Law, and Logic. We refer to
closed-source models like OpenAI’s o1-preview (Figure 2). Table 7 in §C for our other sources.
We conduct extensive ablation experiments targeting (a) our
selection of 1,000 (1K) reasoning samples and (b) our test- New datasets in quantitative reasoning To comple-
time scaling. For (a), we find that jointly incorporating ment these existing datasets, we create two original
difficulty, diversity, and quality measures into our selec- datasets. s1-prob consists of 182 questions from the prob-
tion algorithm is important. Random selection, selecting ability section of Stanford University’s Statistics Depart-
samples with the longest reasoning traces, or only selecting ment’s PhD Qualifying Exams (https://statistics.
maximally diverse samples all lead to significantly worse stanford.edu), accompanied by handwritten solutions
performance (around −30% on AIME24 on average). Train- that cover difficult proofs. The probability qualifying exam
ing on our full data pool of 59K examples, a superset of is held yearly and requires professional-level mathemati-
s1K, does not offer substantial gains over our 1K selection. cal problem-solving. s1-teasers comprises 23 challenging
This highlights the importance of careful data selection and brain-teasers commonly used in interview questions for
echoes prior findings for instruction tuning (Zhou et al., quantitative trading positions. Each sample consists of a
2023). For (b), we define desiderata for test-time scaling problem and solution taken from PuzzledQuant (https:
methods to compare different approaches. Budget forcing //www.puzzledquant.com/). We only take examples
leads to the best scaling as it has perfect controllability with with the highest difficulty level ("Hard").
a clear positive slope leading to strong performance. For each question, we generate a reasoning trace and solu-
In summary, our contributions are: We develop simple meth- tion using the Google Gemini Flash Thinking API (Google,
ods for creating a sample-efficient reasoning dataset (§2) 2024) extracting its reasoning trace and response. This yields
and test-time scaling (§3); Based on these we build s1-32B 59K triplets of a question, generated reasoning trace, and
which is competitive with o1-preview (§4); We ablate sub- generated solution. Examples from our dataset are in §D.2.
tleties of data (§5.1) and test-time scaling (§5.2). We end We decontaminate all samples against our evaluation ques-
with a discussion to motivate future work on simple rea- tions (MATH500, GPQA Diamond, AIME24; §C.5) using
soning (§6). Our code, model, and data are open-source at 8-grams and deduplicate the data.
https://github.com/simplescaling/s1.
2.2. Final selection of 1K samples
2. Reasoning data curation to create s1K We could directly train on our pool of 59K questions, how-
ever, our goal is to find the simplest approach with minimal
In this section, we describe our process for creating a large resources. Thus, we go through three stages of filtering to
dataset first in §2.1 and then filtering it down to s1K in §2.2. arrive at a minimal set of 1,000 samples relying on our three
guiding data principles: Quality, Difficulty, and Diversity.
2.1. Initial collection of 59K samples
We collect an initial 59,029 questions from 16 sources follow- Quality We first remove any questions where we ran into
ing three guiding principles. Quality: Datasets should be any API errors reducing our dataset to 54,116 samples. Next,
high-quality; we always inspect samples and ignore datasets we filter out low-quality examples by checking if they contain
with, e.g., poor formatting; Difficulty: Datasets should be any string patterns with formatting issues, such as ASCII
challenging and require significant reasoning effort; Diver- art diagrams, non-existent image references, or inconsistent
question numbering reducing our dataset to 51,581 examples.
2
s1: Simple test-time scaling
General
Geo− Statistics topology
Approxi− physics
General
r1-distill
ematical Statistical
95 s1
theory
logic mechanics
Real Biology Numerical Integral
functions analysis equations
Bespoke-Stratos
Differential
Electro−
geometry Dynamical dynamics
systems Complex Measure
QwQ
90
Number functions theory
Math− Geometry theory
ematics Difference
education equations Partial
Calculus differential
Quantum
o1-preview
Differential equations
Potential Calculus of Combin− theory
theory equations
variations atorics
Functional
analysis
Thermo−
dynamics
Field
theory
Algebraic Information
structures theory 85
Solid
mech
Mechanics Algebraic
systems Astronomy
Harmonic
analysis
Sky-T1
801000
−anics Associative Fluid Control
Economics Algebraic
Number of Examples
geometry
Figure 2. s1K and s1-32B. (left) s1K is a dataset of 1,000 high-quality, diverse, and difficult questions with reasoning traces. (right) s1-32B,
a 32B parameter model finetuned on s1K is on the sample-efficiency frontier. See Table 1 for details on other models.
From this pool, we identify 384 samples for our final 1,000 nomics. To select our final examples from the pool of 24,496
samples from datasets that we perceive as high-quality and questions, we first choose one domain uniformly at random.
not in need of further filtering (see §C.4 for details). Then, we sample one problem from this domain according
to a distribution that favors longer reasoning traces (see §C.4
Difficulty For difficulty, we use two indicators: model per- for details) as motivated in Difficulty. We repeat this process
formance and reasoning trace length. We evaluate two mod- until we have 1,000 total samples spanning 50 domains.
els on each question: Qwen2.5-7B-Instruct and Qwen2.5- In §5.1, we will show that using our three criteria in combi-
32B-Instruct (Qwen et al., 2024), with correctness assessed nation is important, as only relying on quality, diversity, or
by Claude 3.5 Sonnet comparing each attempt against the difficulty in isolation leads to worse datasets. Some distilled
reference solution (see §C.3 for the grading protocol). We generations are incorrect, which we allow in our data as we
measure the token length of each reasoning trace to indicate focus on capturing the reasoning process rather than entirely
problem difficulty using the Qwen2.5 tokenizer. This relies correct solutions. Our grader (§C.3) deems 53.6% correct
on the assumption that more difficult problems require more in s1K and 63.0% in our follow-up s1K-1.1 (see §A).
thinking tokens. Based on the grading, we remove questions
that either Qwen2.5-7B-Instruct or Qwen2.5-32B-Instruct
can solve correctly and thus may be too easy. By using two 3. Test-time scaling
models we reduce the likelihood of an easy sample slipping 3.1. Method
through our filtering due to a rare mistake on an easy ques-
tion of one of the models. This brings our total samples We classify test-time scaling methods into 1) Sequential,
down to 24,496, setting the stage for the next round of sub- where later computations depend on earlier ones (e.g., a long
sampling based on diversity. While filtering with these two reasoning trace), and 2) Parallel, where computations run in-
models may be optimized for our setup as we will also use dependently (e.g., majority voting) (Snell et al., 2024; Brown
Qwen2.5-32B-Instruct as our model to finetune, the idea of et al., 2024). We focus on sequential scaling as intuitively we
model-based filtering generalizes to other setups. believe it should scale better, since later computations can
build on intermediate results, allowing for deeper reasoning
Diversity To quantify diversity, we classify questions into and iterative refinement. We propose new sequential scaling
domains using Claude 3.5 Sonnet based on the Mathematics methods and ways to benchmark them.
Subject Classification (MSC) system (e.g., geometry, com-
binatorics, etc.) from the American Mathematical Society.1 Budget forcing We propose a simple decoding-time in-
The taxonomy focuses on topics in mathematics but also tervention by forcing a maximum and/or minimum number
includes other sciences such as biology, physics, and eco- of thinking tokens. Specifically, we enforce a maximum
1
https://mathscinet.ams.org/mathscinet/ token count by simply appending the end-of-thinking token
msc/msc2020.html delimiter and optionally “Final Answer:” to early exit
3
s1: Simple test-time scaling
4. Results
Baselines We benchmark budget forcing with: (I) Con-
ditional length-control methods, which rely on telling the 4.1. Setup
model in the prompt how long it should generate for. We
group them by granularity into (a) Token-conditional con- Training We perform supervised finetuning on Qwen2.5-
trol: We specify an upper bound of thinking tokens in the 32B-Instruct using s1K to obtain our model s1-32B using
prompt; (b) Step-conditional control: We specify an upper basic hyperparameters outlined in §D. Finetuning took 26
bound of thinking steps, where each step is around 100 to- minutes on 16 NVIDIA H100 GPUs with PyTorch FSDP.
kens; (c) Class-conditional control: We write two generic
prompts that tell the model to either think for a short or long Evaluation We select three representative reasoning
amount of time (see §E.1 for details). (II) Rejection sam- benchmarks widely used in the field: AIME24 (of America,
pling, which samples until a generation fits a predetermined 2024) has 30 problems that were used in the 2024 American
compute budget. This oracle captures the posterior over Invitational Mathematics Examination (AIME) held from
responses conditioned on its length. January 31 – February 1, 2024. AIME tests mathematical
problem-solving with arithmetic, algebra, counting, geome-
try, number theory, probability, and other secondary school
3.2. Metrics
math topics. High-scoring high school students in the test are
We establish a set of desiderata as evaluation metrics to invited to participate in the United States of America Mathe-
measure test-time scaling across methods. Importantly, we matics Olympiad (USAMO). All AIME answers are integers
do not only care about the accuracy a method can achieve ranging from 000 to 999, inclusive. Some AIME problems
but also its controllability and test-time scaling slope. For rely on figures that we provide to our model using the vec-
each method we consider, we run a set of evaluations 𝑎 ∈ tor graphics language Asymptote as it cannot take image
varying test-time compute on a fixed benchmark, e.g. inputs. MATH500 (Hendrycks et al., 2021) is a benchmark
AIME24. This produces a piece-wise linear function 𝑓 of competition math problems of varying difficulty. We
with compute as the x-axis measured in thinking tokens and evaluate on the same 500 samples selected by OpenAI in
accuracy as the y-axis (see Figure 1, where the rightmost dot prior work (Lightman et al., 2023). GPQA Diamond (Rein
for AIME24 corresponds to 𝑓 (7320) = 57%). We measure et al., 2023) consists of 198 PhD-level science questions
4
s1: Simple test-time scaling
(a) Sequential scaling via budget forcing (b) Parallel scaling via majority voting
Figure 4. Sequential and parallel test-time scaling. (a): Budget forcing shows clear scaling trends and extrapolates to some extent. For
the three rightmost dots, we prevent the model from stopping its thinking 2/4/6 times, each time appending “Wait” to its current reasoning
trace. (b): For Qwen2.5-32B-Instruct we perform 64 evaluations for each sample with a temperature of 1 and visualize the performance
when majority voting across 2, 4, 8, 16, 32, and 64 of these.
5
s1: Simple test-time scaling
4.2. Performance
Table 2. s1K data ablations. We budget force (BF) a maximum
Test-time scaling Figure 1 shows the performance of s1- of around 30,000 thinking tokens for all scores in this table. This
32B with budget forcing scales with more test-time compute. performs slightly better than the scores without BF (Table 1) as
In Figure 4 (left), we expand the plot from Figure 1 (middle) it allows the model to finish with a best guess when stuck in an
showing that while we can improve AIME24 performance infinite loop. We report 95% paired bootstrap confidence intervals
using our budget forcing technique (§3) and more test-time for differences relative to the s1K model using 10,000 bootstrap
compute it does eventually flatten out at six times. Suppress- samples. E.g., the interval [-13%, 20%] means that, with 95% con-
ing the end-of-thinking token delimiter too often can lead the fidence, the true difference between 59K-full and s1K is between
model into repetitive loops instead of continued reasoning. -13% and +20%. If the entire interval is negative, e.g. [-27%, -3%],
In Figure 4 (right), we show that after training Qwen2.5- we can confidently say that the performance is worse than s1K.
32B-Instruct on our 1,000 samples to produce s1-32B and AIME MATH GPQA
equipping it with the simple budget forcing technique, it Model
2024 500 Diamond
operates in a different scaling paradigm. Scaling test-time
compute on the base model via majority voting cannot catch 36.7 90.6 52.0
1K-random
[-26.7%, -3.3%] [-4.8%, 0.0%] [-12.6%, 2.5%]
up with the performance of s1-32B which validates our
intuition from §3 that sequential scaling is more effective 26.7 91.2 54.6
1K-diverse
[-40.0%, -10.0%] [-4.0%, 0.2%] [-10.1%, 5.1%]
than parallel. We provide example generations of s1-32B in
Figure 5. 33.3 90.4 59.6
1K-longest
[-36.7%, 0.0%] [-5.0%, -0.2%] [-5.1%, 10.1%]
53.3 92.8 58.1
Sample-efficiency In Figure 2 (right) and Table 1 we 59K-full
[-13.3%, 20.0%] [-2.6%, 2.2%] [-6.6%, 8.6%]
compare s1-32B with other models. We find that s1-
32B is the most sample-efficient open data reasoning s1K 50.0 93.0 57.6
model. It performs significantly better than our base model
(Qwen2.5-32B-Instruct) despite just training it on an ad-
ditional 1,000 samples. The concurrently released r1-32B Finally, we compare with just training on all of our 59K
shows stronger performance than s1-32B while also only samples, a superset of all the 1K-sample versions. This
using SFT (DeepSeek-AI et al., 2025). However, it is trained leads to a strong model but uses much more resources. To
on 800 × more reasoning samples. It is an open question finetune on 59K samples, we use 394 H100 GPU hours
whether one can achieve their performance with just 1,000 while s1-32B only required 7 H100 GPU hours. Moreover,
samples. Finally, our model nearly matches Gemini 2.0 relying only on s1K is extremely competitive as shown in §2.
Thinking on AIME24. As the data for s1-32B is distilled Overall, combining all three criteria – Quality, Difficulty,
from Gemini 2.0, this shows our distillation procedure was Diversity – via our methodology in §2 is key for sample-
likely effective. efficient reasoning training.
6
s1: Simple test-time scaling
Alice and Bob play the following game. A stack of Find the projection of 𝑎 onto 𝑏 = A quantum mechanical particle of
( )
𝑛 tokens lies before them. The players take turns 2 mass 𝑚 moves in two dimensions in
with Alice going first. On each turn, the player 6 if 𝑎 ⋅ 𝑏 = 8. the following potential, as a function
removes either 1 token or 4 tokens from the stack. 3 of the polar coordinates (𝑟, 𝜃):
Whoever removes the last token wins. Find the The projection of vector 𝐚 onto vector
number of positive integers 𝑛 less than or equal to 𝐛 is given by the formula: 𝑉 (𝑟, 𝜃) =
1 2 3 2
𝑘𝑟 + 𝑘𝑟 cos2 (𝜃)
2024 for which there exists a strategy for Bob that 2 2
guarantees that Bob will win the game regardless 𝐚⋅𝐛
of Alice’s play. proj𝐛 𝐚 = 𝐛 Find the energy spectrum. Hint:
‖𝐛‖2 Write the potential in Cartesian co-
The problem asks for the number of positive integers ( ) ordinates.
2
𝑛 ≤ 2024 such that Bob has a winning strategy. This
We are given 𝐛 = 6 and 𝐚 ⋅ 𝐛 = 8. The problem asks for the energy spec-
is a typical impartial game that can be analyzed using 3
the Sprague-Grundy theorem or by directly finding trum of a particle in a given 2D po-
the winning and losing positions. Let 𝑊 be the set of tential. The potential is given in polar
winning positions (number of tokens 𝑛 such that the First, we need to find the magnitude coordinates (𝑟, 𝜃):
player whose turn it is can force a win) and 𝐿 be the squared of 𝐛, which is ‖𝐛‖2 .
set of losing positions. [...] 1 2 3 2
𝑉 (𝑟, 𝜃) = 𝑘𝑟 + 𝑘𝑟 cos2 (𝜃).
‖𝐛‖2 = 𝐛⋅𝐛 = 22 +62 +32 = 4+36+9 = 49. 2 2
Let 𝑃 (𝑛) be the statement "the player whose turn it is
We can rewrite the potential as:
when there are 𝑛 tokens has a winning strategy. 𝑃 (0)
is false (the player whose turn it is and faces 0 tokens 1 2
has already lost). Consider the small values of 𝑛: Now we can plug the values into the 𝑉 (𝑟, 𝜃) = 𝑘𝑟 (1 + 3 cos2 (𝜃)).
𝑛 = 1: The player whose turn it is can take 1 token projection formula: 2
and win. So 𝑃 (1) is true. 1 ∈ 𝑊 . The Hamiltonian for a particle in 2D
𝑛 = 2: The player whose turn it is can take 1 token, 8
proj𝐛 𝐚 = 𝐛 is given by:
leaving 1 token for the opponent. Since 1 ∈ 𝑊 , the 49
opponent wins. So 𝑃 (2) is false. 2 ∈ 𝐿. ( )
Substitute the vector 𝐛: ℏ2 𝜕2 𝜕2
𝑛 = 3: The player whose turn it is can take 1 token, 𝐻 =− + + 𝑉 (𝑥, 𝑦).
leaving 2 tokens for the opponent. Since 2 ∈ 𝐿, the ( ) 2𝑚 𝜕𝑥2 𝜕𝑦2
2
player whose turn it is wins. So 𝑃 (3) is true. 3 ∈ 8
proj𝐛 𝐚 = 6 In polar coordinates (𝑟, 𝜃), the Lapla-
𝑊 .[...] 49 3 cian operator is:
We are looking for the number of positive integers ( )
𝑛 ≤ 2024 for which Bob has a winning strategy. This 1 𝜕 𝜕 1 𝜕2
∇2 = 𝑟 + 2 2.
means that if the game starts with 𝑛 tokens, and Alice Multiply the scalar by each compo- 𝑟 𝜕𝑟 𝜕𝑟 𝑟 𝜕𝜃
goes first, Bob can win. This happens if and only if nent of the vector:
So the Hamiltonian in polar coordi-
𝑛 is a losing position (because if 𝑛 ∈ 𝐿, then Alice’s nates is:
first move must lead to a winning position for Bob). ⎛ 8 × 2⎞ ⎛ 16 ⎞
[...] ⎜ 498 ⎟ ⎜ 4948 ⎟ [ ( ) ]
proj𝐛 𝐚 = ⎜ 49 × 6⎟ = ⎜ 49 ⎟ ℏ2 1 𝜕 𝜕 1 𝜕2
Let’s try to find a pattern in the losing positions ⎜ 8 × 3⎟ ⎜ 24 ⎟ 𝐻 =− 𝑟 + 2 2 +𝑉 (𝑟, 𝜃).
⎝ 49 ⎠ ⎝ 49 ⎠ 2𝑚 𝑟 𝜕𝑟 𝜕𝑟 𝑟 𝜕𝜃
𝐿 = {0, 2, 5, 7, 10, 12, 15, 17, 20, 22, ...}. Differences
between consecutive terms: 2, 3, 2, 3, 2, 3, 2, 3, 2, ...
The time-independent Schrödinger
The pattern of differences is alternating 2, 3. [...]
equation is 𝐻𝜓 = 𝐸𝜓. We can try
The losing positions are of the form 5𝑚 or 5𝑚 + 2, Final Answer: The final answer is to separate variables by assuming a so-
where 𝑚 ≥ 0. We need to find the number of positive
lution of the form 𝜓(𝑟, 𝜃) = 𝑅(𝑟)Θ(𝜃).
integers 𝑛 ≤ 2024 that are in the set of losing positions
⎛ 16 ⎞ [...]
𝐿. [...] ⎜ 49
48 ⎟
Case 1: 𝑛 = 5𝑚, where 𝑚 ≥ 1. 1 ≤ 5𝑚 ≤ 2024 ⎜ 49 ⎟ This confirms the energy spectrum is:
1∕5 ≤ 𝑚 ≤ 2024∕5 = 404.8 Since 𝑚 is an integer, ⎜ 24 ⎟
1 ≤ 𝑚 ≤ 404. The number of values of 𝑚 is 404. ⎝ 49 ⎠ √ ( )
Case 2: 𝑛 = 5𝑚 + 2, where 𝑚 ≥ 0. 5𝑚 + 2 ≤ 2024 𝐸=ℏ
𝑘
2𝑛𝑥 + 𝑛𝑦 +
3
.
5𝑚 ≤ 2022 𝑚 ≤ 2022∕5 = 404.4 Since 𝑚 is an in- 𝑚 2
teger, 0 ≤ 𝑚 ≤ 404. The number of values of 𝑚 is
404 − 0 + 1 = 405. Final
√ ( Answer: ) 𝐸 =
ℏ 𝑚𝑘 2𝑛𝑥 + 𝑛𝑦 + 23 .
The total number of positive integers 𝑛 ≤ 2024 that
are losing positions is 404 + 405 = 809 .
Figure 5. Example model outputs. We pick a question from AIME24 (left), MATH500 (middle), and GPQA (right), where our model
generates the correct answer. The black text is the prompt, the light blue text is the reasoning trace, and the blue text is the answer of
s1-32B. The gray ellipsis [...] indicates that the text was trimmed to fit this page, but the generated text is actually longer.
7
s1: Simple test-time scaling
Budget forcing In Table 3 we compare the test-time scal- spect a question, which was answered correctly by the model
ing methods we have introduced in §3. Overall, we find that when rejection sampling for ≤ 4000, but not for the ≤ 8000
budget forcing provides perfect control, good scaling, and token setting. In the ≤ 4000 setting the model directly jumps
leads to our best AIME24 score. Thus, this is the method to the correct approach, while for the ≤ 8000 setting it back-
we use for s1-32B in Figure 1 and in §4. In Table 4, we tracks a lot. We hypothesize that there is a correlation such
compare different strings for extrapolating performance. We that shorter generations tend to be the ones where the model
find that “Wait” generally gives the best performance. was on the right track from the start, whereas longer ones
tend to be ones where the model made mistakes and thus
Class-conditional control We provide benchmark scores
backtracks or questions itself. This leads to longer samples
for this method in §E.1 and summarize three findings here:
often being wrong when rejection sampling and thus the
(1) Token-conditional control fails without budget forcing, as
inverse scaling trend.
our model cannot reliably count tokens - even when trained
to do so. (2) Under step-conditional control, the model gen-
erates a similar total number of tokens when given different 6. Discussion and related work
step targets, as the model goes from few steps with many
6.1. Sample-efficient reasoning
tokens per step, to many steps with few tokens in each step.
Thus, the model learns to hack its way around the com- Models There are a number of concurrent efforts to build
pute constraint making the controllability of this method models that replicate the performance of o1 (OpenAI, 2024).
mediocre. (3) Class-conditional control can work - telling a For example, DeepSeek-r1 and k1.5 (DeepSeek-AI et al.,
model to simply think longer can increase its test-time com- 2025; Team et al., 2025) are built with reinforcement learn-
pute and performance, which leads good scaling in Table 3. ing methods, while others rely on SFT using tens of thou-
sands of distilled examples (Team, 2025; Xu et al., 2025;
Labs, 2025). We show that SFT on only 1,000 examples
Table 4. Budget forcing extrapolation ablations. We compare
suffices to build a competitive reasoning model matching
ignoring the end-of-thinking delimiter twice and appending none
o1-preview and produces a model that lies on the pareto
or various strings.
frontier (Figure 2). Further, we introduce budget forcing
AIME MATH GPQA which combined with our reasoning model leads to the first
Model
2024 500 Diamond reproduction of OpenAI’s test-time scaling curves (OpenAI,
No extrapolation 50.0 93.0 57.6 2024). Why does supervised finetuning on just 1,000 sam-
ples lead to such performance gains? We hypothesize that
2x without string 50.0 90.2 55.1
the model is already exposed to large amounts of reason-
2x “Alternatively” 50.0 92.2 59.6
ing data during pretraining which spans trillions of tokens.
2x “Hmm” 50.0 93.0 59.6
Thus, the ability to perform reasoning is already present in
2x “Wait” 53.3 93.0 59.6
our model. Our sample-efficient finetuning stage just acti-
vates it and we scale it further at test time with budget forcing.
Rejection sampling Surprisingly, we find that simply sam- This is similar to the "Superficial Alignment Hypothesis"
pling until the generation fits a specific length leads to an presented in LIMA (Zhou et al., 2023), where the authors
inverse scaling trend as depicted in Figure 6. In §E.2 we in- find that 1,000 examples can be sufficient to align a model
to adhere to user preferences.
42
Benchmarks and methods To evaluate and push the lim-
37
Accuracy (%)
8
s1: Simple test-time scaling
Accuracy (%)
improve reasoning abilities, including methods like Chain- 50
of-Thought prompting (Wei et al., 2023; Yao et al., 2023a;b; REBASE
Bi et al., 2023; Fu et al., 2023; Zhang et al., 2024b; Xi-
40 Majority
ang et al., 2025; Hu et al., 2024; Diao et al., 2024). These Sequential
combined efforts aim to advance the reasoning ability of
30 scaling
language models, enabling them to handle more complex
and abstract tasks effectively.
2048 8192 32768 131072
Average thinking time (tokens)
6.2. Test-time scaling Figure 7. Scaling further with parallel scaling methods. All met-
Methods As we introduce in §3, we differentiate two meth- rics averaged over the 30 questions in AIME24. Average thinking
ods to scale test-time compute: parallel and sequential. The tokens for REBASE do not account for the additional compute from
former relies on multiple solution attempts generated in par- the reward model. For sequential scaling, we prompt the model
allel and selecting the best outcome via specific criteria. to use up to (from left to right) 32, 64, 256, and 512 steps. For
These criteria include choosing the most frequent response REBASE and majority voting we generate 16 parallel trajectories
for majority voting or the best response based on an exter- to aggregate across.
nal reward for Best-of-N (Brown et al., 2024; Irvine et al.,
2023; Levi, 2024). Unlike repeated sampling, previous se-
quential scaling methods let the model generate solution
such extrapolation? There may be improvements to budget
attempts sequentially based on previous attempts, allowing
forcing such as rotating through different strings, not only
it to refine each attempt based on previous outcomes (Snell
“Wait”, or combining it with frequency penalties or higher
et al., 2024; Hou et al., 2025; Lee et al., 2025). Tree-based
temperature to avoid repetitive loops. An exciting direc-
search methods (Gandhi et al., 2024; Wu et al., 2024b) offer
tion for future work is also researching whether applying
a hybrid approach between sequential and parallel scaling,
budget forcing to a reasoning model trained with reinforce-
such as Monte-Carlo Tree Search (MCTS) (Liu et al., 2024;
ment learning yields better extrapolation; or if RL allows for
Zhang et al., 2023; Zhou et al., 2024; Choi et al., 2023) and
new ways of test-time scaling beyond budget forcing. Our
guided beam search (Xie et al., 2023). REBASE (Wu et al.,
work defines the right metrics (§3.2) – Control, Scaling, and
2024b) employs a process reward model to balance exploita-
Performance – to enable future research and progress on
tion and pruning during tree search. Empirically, REBASE
extrapolating test-time compute.
has been shown to outperform sampling-based methods and
MCTS (Wu et al., 2024b). Reward models (Lightman et al.,
2023; Wang et al., 2024b;c) play a key role in these meth-
ods. They come in two variants: outcome reward models Parallel scaling as a solution Parallel scaling offers one
and process reward models. Outcome reward models (Xin solution to the limits of sequential scaling, thus we aug-
et al., 2024; Ankner et al., 2024) assign a score to complete ment our sequentially scaled model with two methods: (I)
solutions and are particularly useful in Best-of-N selection, Majority voting: After generating 𝑘 solutions, the final
while process reward models (Lightman et al., 2023; Wang solution is the most frequent one across generations; (II)
et al., 2024b; Wu et al., 2024b) assess individual reasoning Tree search via REBASE: We use the REBASE process
steps and are effective in guiding tree-based search methods. reward model, which is initialized from LLaMA-34B and
further finetuned on a synthetic process reward modeling
dataset (Wu et al., 2024b). We then aggregate the solutions
Limits to further test-time scaling We have shown that
generated by REBASE via majority voting. As shown in
budget forcing allows extrapolating test-time compute in
Figure 7, augmenting our model with REBASE scales better
§4, e.g., improving AIME24 performance from 50% to 57%.
than majority voting, and even sequential scaling in this sce-
However, it has two key limitations when scaling further: it
nario. However, REBASE requires an additional forward
eventually flattens out (Figure 4), and the context window
pass at each step for the reward model adding some computa-
of the underlying language model constrains it. Despite
tion overhead. For sequential scaling, when prompted to use
these constraints, our work shows test-time scaling across a
up to 512 steps, for 12 out of the 30 evaluation questions the
wide range of accuracies (Figure 1), partly because scaling
model generates a response that exceeds the context window
down test-time compute behaves predictably and does not
leading to a large performance drop. Overall, we find that
suffer from these constraints.
these parallel scaling methods complement sequential scal-
Continuing test-time scaling will require approaches that ing thus they offer an avenue for scaling test-time compute
can further extrapolate test-time compute. How can we get even further; beyond fixed context windows.
9
s1: Simple test-time scaling
Choi, S., Fang, T., Wang, Z., and Song, Y. Kcts: Knowledge-
Acknowledgements constrained tree search decoding with token-level halluci-
nation detection, 2023. URL https://arxiv.org/
We thank Ryan Marten for generating traces from DeepSeek abs/2310.09044.
r1 for s1.1 using Bespoke Curator (Marten et al., 2025).
This work was partly conducted using the Stanford Marlowe DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang,
GPU cluster (Kapfer et al., 2025), made possible by financial R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X.,
support from Stanford University. We thank Alexander M. Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao,
Rush, Andrew Ilyas, Banghua Zhu, Chenglei Si, Chunting Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu,
Zhou, John Yang, Ludwig Schmidt, Samy Jelassi, Suhas C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D.,
Kotha, Tengyu Ma, Xuechen Li, Yu Sun, and Yue Zhang for Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao,
very constructive discussions. G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang,
H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J.,
References Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai,
J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao,
Ankner, Z., Paul, M., Cui, B., Chang, J. D., and Am- K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L.,
manabrolu, P. Critique-out-loud reward models, 2024. Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang,
URL https://arxiv.org/abs/2408.11791. M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M., Tian,
N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q.,
Arora, D., Singh, H. G., and Mausam. Have llms advanced Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin,
enough? a challenging problem solving benchmark for R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang,
large language models, 2023. URL https://arxiv. S., Yu, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., Wu,
org/abs/2305.15074. S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W.,
Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang,
McAleer, S., Jiang, A. Q., Deng, J., Biderman, S., and W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X.,
Welleck, S. Llemma: An open language model for math- Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X.,
ematics, 2023. Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen,
X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X.,
Bi, Z., Zhang, N., Jiang, Y., Deng, S., Zheng, G., and Shan, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhang, Y.,
Chen, H. When do program-of-thoughts work for reason- Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang,
ing?, 2023. URL https://arxiv.org/abs/2308. Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan,
15452. Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong,
Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y.,
Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, Zhou, Y., Zhu, Y. X., Xu, Y., Huang, Y., Li, Y., Zheng,
J., Abbasi, B., Aji, A. F., Ammanamanchi, P. S., Black, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z. Z.,
S., Clive, J., DiPofi, A., Etxaniz, J., Fattori, B., Forde, Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao,
J. Z., Foster, C., Hsu, J., Jaiswal, M., Lee, W. Y., Li, H., Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li,
Lovering, C., Muennighoff, N., Pavlick, E., Phang, J., Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang,
Skowron, A., Tan, S., Tang, X., Wang, K. A., Winata, Z., and Zhang, Z. Deepseek-r1: Incentivizing reasoning
G. I., Yvon, F., and Zou, A. Lessons from the trenches on capability in llms via reinforcement learning, 2025. URL
reproducible evaluation of language models, 2024. https://arxiv.org/abs/2501.12948.
Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Ré, Diao, S., Wang, P., Lin, Y., Pan, R., Liu, X., and Zhang,
C., and Mirhoseini, A. Large language monkeys: Scaling T. Active prompting with chain-of-thought for large lan-
10
s1: Simple test-time scaling
guage models, 2024. URL https://arxiv.org/ Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu,
abs/2302.12246. K. R., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J.,
Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik,
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., A., Nam, C., Peters, M. E., Pyatkin, V., Ravichander, A.,
Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani,
Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravanku- N., Wortsman, M., Dasigi, P., Lambert, N., Richardson,
mar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L., Smith,
Rodriguez, A., Gregerson, A., et al. The llama 3 herd N. A., and Hajishirzi, H. Olmo: Accelerating the science
of models, 2024. URL https://arxiv.org/abs/ of language models, 2024.
2407.21783.
He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J.,
Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot, Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L.,
T. Complexity-based prompting for multi-step reason- Liu, Z., and Sun, M. Olympiadbench: A challeng-
ing, 2023. URL https://arxiv.org/abs/2210. ing benchmark for promoting agi with olympiad-level
00720. bilingual multimodal scientific problems, 2024. URL
https://arxiv.org/abs/2402.14008.
Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma,
A., and Goodman, N. D. Stream of search (sos): Learning Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
to search in language, 2024. URL https://arxiv. S., Tang, E., Song, D., and Steinhardt, J. Measuring math-
org/abs/2404.03683. ematical problem solving with the math dataset, 2021.
URL https://arxiv.org/abs/2103.03874.
Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li,
L., Ma, C., Chen, L., Xu, R., Tang, Z., Wang, B., Zan, D., Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E.,
Quan, S., Zhang, G., Sha, L., Zhang, Y., Ren, X., Liu, T., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A.,
and Chang, B. Omni-math: A universal olympiad level Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican,
mathematic benchmark for large language models, 2024a. K., van den Driessche, G., Damoc, B., Guy, A., Osindero,
URL https://arxiv.org/abs/2410.07985. S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and
Sifre, L. Training compute-optimal large language mod-
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster, els, 2022. URL https://arxiv.org/abs/2203.
C., Golding, L., Hsu, J., McDonell, K., Muennighoff, N., 15556.
Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B.,
Wang, K., and Zou, A. A framework for few-shot language Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J.,
model evaluation, September 2021. URL https:// Tang, J., and Dong, Y. Advancing language model rea-
doi.org/10.5281/zenodo.5371628. soning through reinforcement learning and inference scal-
ing, 2025. URL https://arxiv.org/abs/2501.
Gao, Z., Niu, B., He, X., Xu, H., Liu, H., Liu, A., Hu, X., and 11651.
Wen, L. Interpretable contrastive monte carlo tree search
reasoning, 2024b. URL https://arxiv.org/abs/ Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettle-
2410.01707. moyer, L., Smith, N. A., and Krishna, R. Visual sketch-
pad: Sketching as a visual chain of thought for multimodal
Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, language models, 2024. URL https://arxiv.org/
E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., abs/2406.09403.
de Oliveira Santos, E., Järviniemi, O., Barnett, M., San-
dler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, Huang, Z., Wang, Z., Xia, S., Li, X., Zou, H., Xu, R., Fan,
L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., R.-Z., Ye, L., Chern, E., Ye, Y., Zhang, Y., Yang, Y., Wu,
Enugandla, S. V., and Wildon, M. Frontiermath: A bench- T., Wang, B., Sun, S., Xiao, Y., Li, Y., Zhou, F., Chern, S.,
mark for evaluating advanced mathematical reasoning in Qin, Y., Ma, Y., Su, J., Liu, Y., Zheng, Y., Zhang, S., Lin,
ai, 2024. URL https://arxiv.org/abs/2411. D., Qiao, Y., and Liu, P. Olympicarena: Benchmarking
04872. multi-discipline cognitive reasoning for superintelligent
ai, 2024a. URL https://arxiv.org/abs/2406.
Google. Gemini 2.0 flash thinking mode (gemini-2.0- 12753.
flash-thinking-exp-1219), December 2024. URL
https://cloud.google.com/vertex-ai/ Huang, Z., Zou, H., Li, X., Liu, Y., Zheng, Y., Chern, E.,
generative-ai/docs/thinking-mode. Xia, S., Qin, Y., Yuan, W., and Liu, P. O1 replication
journey – part 2: Surpassing o1-preview through simple
Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, distillation, big progress or bitter lesson?, 2024b. URL
R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., https://arxiv.org/abs/2411.16489.
11
s1: Simple test-time scaling
Huang, Z., Geng, G., Hua, S., Huang, Z., Zou, H., Zhang, aimo-progress-prize/blob/main/report/
S., Liu, P., and Zhang, X. O1 replication journey – part 3: numina_dataset.pdf.
Inference-time scaling for medical reasoning, 2025. URL
https://arxiv.org/abs/2501.06458. Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker,
B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and
Irvine, R., Boubert, D., Raina, V., Liusie, A., Zhu, Z., Cobbe, K. Let’s verify step by step, 2023. URL https:
Mudupalli, V., Korshuk, A., Liu, Z., Cremer, F., As- //arxiv.org/abs/2305.20050.
sassi, V., Beauchamp, C.-C., Lu, X., Rialan, T., and
Beauchamp, W. Rewarding chatbots for real-world en- Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. Program
gagement with millions of users, 2023. URL https: induction by rationale generation : Learning to solve and
//arxiv.org/abs/2303.06135. explain algebraic word problems, 2017. URL https:
//arxiv.org/abs/1705.04146.
Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang,
T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang,
Livecodebench: Holistic and contamination free evalu- Y. Logiqa: A challenge dataset for machine reading com-
ation of large language models for code, 2024. URL prehension with logical reasoning, 2020. URL https:
https://arxiv.org/abs/2403.07974. //arxiv.org/abs/2007.08124.
Kapfer, C., Stine, K., Narasimhan, B., Mentzel, C., and Liu, J., Cohen, A., Pasunuru, R., Choi, Y., Hajishirzi,
Candes, E. Marlowe: Stanford’s gpu-based computational H., and Celikyilmaz, A. Don’t throw away your value
instrument, January 2025. URL https://doi.org/ model! generating more preferable text with value-guided
10.5281/zenodo.14751899. monte-carlo tree search decoding, 2024. URL https:
//arxiv.org/abs/2309.15028.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Loshchilov, I. and Hutter, F. Decoupled weight decay regu-
and Amodei, D. Scaling laws for neural language mod- larization, 2019.
els, 2020. URL https://arxiv.org/abs/2001.
08361. Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X.,
Lin, Q., Chen, S., Tang, Y., and Zhang, D. Wizardmath:
Kim, E., Suk, J., Kim, S., Muennighoff, N., Kim, D., and Oh, Empowering mathematical reasoning for large language
A. Llm-as-an-interviewer: Beyond static testing through models via reinforced evol-instruct, 2025. URL https:
dynamic llm evaluation, 2024. URL https://arxiv. //arxiv.org/abs/2308.09583.
org/abs/2412.10424.
Marten, R., Vu, T., Ji, C. C.-J., Sharma, K., Pimpalgaonkar,
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu,
S., Dimakis, A., and Sathiamoorthy, M. Curator: A tool
C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Effi-
for synthetic data creation. https://github.com/
cient memory management for large language model serv-
bespokelabsai/curator, January 2025.
ing with pagedattention, 2023. URL https://arxiv.
org/abs/2309.06180. Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor-
Labs, B. Bespoke-stratos: The unreasonable effectiveness rison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert,
of reasoning distillation, 2025. URL https://hf. N., Gu, Y., Arora, S., Bhagia, A., Schwenk, D., Wadden,
co/bespokelabs/Bespoke-Stratos-32B. Ac- D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi,
cessed: 2025-01-22. A., Smith, N. A., Koh, P. W., Singh, A., and Hajishirzi, H.
Olmoe: Open mixture-of-experts language models, 2024.
Lee, K.-H., Fischer, I., Wu, Y.-H., Marwood, D., Baluja, URL https://arxiv.org/abs/2409.02060.
S., Schuurmans, D., and Chen, X. Evolving deeper llm
thinking, 2025. URL https://arxiv.org/abs/ of America, M. A. Aime, February 2024. URL https:
2501.09891. //artofproblemsolving.com/wiki/index.
php/AIME_Problems_and_Solutions/.
Levi, N. A simple model of inference scaling laws, 2024.
URL https://arxiv.org/abs/2410.16377. OpenAI. Learning to reason with llms, Septem-
ber 2024. URL https://openai.com/index/
LI, J., Beeching, E., Tunstall, L., Lipkin, B., Solet- learning-to-reason-with-llms/.
skyi, R., Huang, S. C., Rasul, K., Yu, L., Jiang, A.,
Shen, Z., Qin, Z., Dong, B., Zhou, L., Fleureau, Y., OpenAI. Openai o3-mini, 2025. URL https://openai.
Lample, G., and Polu, S. Numinamath, 2024. URL com/index/openai-o3-mini/. Accessed: 2025-
https://github.com/project-numina/ 02-24.
12
s1: Simple test-time scaling
Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Shi, H., Yuan, H., Chen, J., Guo, J., Su, J., Wang, J., Zhao, J.,
S., Choi, M., Agrawal, A., Chopra, A., et al. Humanity’s Zhang, J., Liu, J., Yan, J., Wu, J., Shi, L., Ye, L., Yu, L.,
last exam, 2025. URL https://arxiv.org/abs/ Dong, M., Zhang, N., Ma, N., Pan, Q., Gong, Q., Liu, S.,
2501.14249. Ma, S., Wei, S., Cao, S., Huang, S., Jiang, T., Gao, W.,
Xiong, W., He, W., Huang, W., Wu, W., He, W., Wei, X.,
Qin, Y., Li, X., Zou, H., Liu, Y., Xia, S., Huang, Z., Ye, Jia, X., Wu, X., Xu, X., Zu, X., Zhou, X., Pan, X., Charles,
Y., Yuan, W., Liu, H., Li, Y., and Liu, P. O1 replication Y., Li, Y., Hu, Y., Liu, Y., Chen, Y., Wang, Y., Liu, Y., Qin,
journey: A strategic progress report – part 1, 2024. URL Y., Liu, Y., Yang, Y., Bao, Y., Du, Y., Wu, Y., Wang, Y.,
https://arxiv.org/abs/2410.18982. Zhou, Z., Wang, Z., Li, Z., Zhu, Z., Zhang, Z., Wang, Z.,
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, Yang, Z., Huang, Z., Huang, Z., Xu, Z., and Yang, Z. Kimi
B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., k1.5: Scaling reinforcement learning with llms, 2025.
Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., URL https://arxiv.org/abs/2501.12599.
Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li,
M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, Team, N. Sky-t1: Fully open-source reasoning
T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., model with o1-preview performance in $450 budget,
Wan, Y., Liu, Y., Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5 2025. URL https://novasky-ai.github.io/
technical report, 2024. URL https://arxiv.org/ posts/sky-t1. Accessed: 2025-01-09.
abs/2412.15115.
Team, Q. Qwq: Reflect deeply on the boundaries of the
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., unknown, November 2024. URL https://qwenlm.
Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A github.io/blog/qwq-32b-preview/.
graduate-level google-proof q&a benchmark, 2023. URL
https://arxiv.org/abs/2311.12022. Wang, J., Meng, F., Liang, Y., and Zhou, J. Drt-o1: Op-
timized deep reasoning translation via long chain-of-
Shi, Q., Tang, M., Narasimhan, K., and Yao, S. Can lan- thought, 2024a. URL https://arxiv.org/abs/
guage models solve olympiad programming?, 2024. URL 2412.17498.
https://arxiv.org/abs/2404.10952.
Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen,
time compute optimally can be more effective than scal- D., Wu, Y., and Sui, Z. Math-shepherd: Verify and rein-
ing model parameters, 2024. URL https://arxiv. force llms step-by-step without human annotations, 2024b.
org/abs/2408.03314. URL https://arxiv.org/abs/2312.08935.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, Wang, S., Liu, Z., Zhong, W., Zhou, M., Wei, Z., Chen, Z.,
A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., and Duan, N. From lsat: The progress and challenges
Garriga-Alonso, A., et al. Beyond the imitation game: of complex reasoning, 2021. URL https://arxiv.
Quantifying and extrapolating the capabilities of language org/abs/2108.00648.
models, 2023.
Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Egert,
Su, H., Yen, H., Xia, M., Shi, W., Muennighoff, N., yu Wang, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Help-
H., Liu, H., Shi, Q., Siegel, Z. S., Tang, M., Sun, R., Yoon, steer2: Open-source dataset for training top-performing
J., Arik, S. O., Chen, D., and Yu, T. Bright: A realis- reward models, 2024c. URL https://arxiv.org/
tic and challenging benchmark for reasoning-intensive abs/2406.08673.
retrieval, 2024. URL https://arxiv.org/abs/
2407.12883. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter,
Sun, L., Han, Y., Zhao, Z., Ma, D., Shen, Z., Chen, B., Chen, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-
L., and Yu, K. Scieval: A multi-level large language thought prompting elicits reasoning in large language
model evaluation benchmark for scientific research, 2024. models, 2023. URL https://arxiv.org/abs/
URL https://arxiv.org/abs/2308.13149. 2201.11903.
Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, Welleck, S., Bertsch, A., Finlayson, M., Schoelkopf, H., Xie,
C., Xiao, C., Du, C., Liao, C., Tang, C., Wang, C., Zhang, A., Neubig, G., Kulikov, I., and Harchaoui, Z. From de-
D., Yuan, E., Lu, E., Tang, F., Sung, F., Wei, G., Lai, G., coding to meta-generation: Inference-time algorithms for
Guo, H., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H., large language models, 2024. URL https://arxiv.
Yao, H., Zhao, H., Lu, H., Li, H., Yu, H., Gao, H., Zheng, org/abs/2406.16838.
13
s1: Simple test-time scaling
Wu, T., Lan, J., Yuan, W., Jiao, J., Weston, J., and Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok,
Sukhbaatar, S. Thinking llms: General instruction fol- J. T., Li, Z., Weller, A., and Liu, W. Metamath: Bootstrap
lowing with thought generation, 2024a. URL https: your own mathematical questions for large language mod-
//arxiv.org/abs/2410.10630. els, 2024. URL https://arxiv.org/abs/2309.
12284.
Wu, Y., Sun, Z., Li, S., Welleck, S., and Yang, Y. Infer-
ence scaling laws: An empirical analysis of compute- Yuan, S., Chen, Z., Xi, Z., Ye, J., Du, Z., and Chen, J.
optimal inference for problem-solving with language mod- Agent-r: Training language model agents to reflect via
els, 2024b. URL https://arxiv.org/abs/2408. iterative self-training, 2025. URL https://arxiv.
00724. org/abs/2501.11425.
Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Zelikman, E., Wu, Y., Mu, J., and Goodman, N. D. Star:
Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan, Bootstrapping reasoning with reasoning, 2022. URL
D., Castricato, L., Franken, J.-P., Haber, N., and Finn, C. https://arxiv.org/abs/2203.14465.
Towards system 2 reasoning in llms: Learning how to
think with meta chain-of-thought, 2025. URL https: Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N.,
//arxiv.org/abs/2501.04682. and Goodman, N. D. Quiet-star: Language models can
teach themselves to think before speaking, 2024. URL
Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He, https://arxiv.org/abs/2403.09629.
J., and Xie, Q. Self-evaluation guided beam search for
reasoning, 2023. URL https://arxiv.org/abs/ Zhang, H. and Chen, C. Test-time compute scaling laws,
2305.00633. 2024. URL https://github.com/hughbzhang/
o1_inference_scaling_laws.
Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan,
C., Li, W., and Liang, X. Deepseek-prover: Advancing Zhang, S., Chen, Z., Shen, Y., Ding, M., Tenenbaum, J. B.,
theorem proving in llms through large-scale synthetic and Gan, C. Planning with large language models for
data, 2024. URL https://arxiv.org/abs/2405. code generation, 2023. URL https://arxiv.org/
14333. abs/2303.05510.
Xu, H., Wu, X., Wang, W., Li, Z., Zheng, D., Chen, B., Zhang, Y., Wu, S., Yang, Y., Shu, J., Xiao, J., Kong, C., and
Hu, Y., Kang, S., Ji, J., Zhang, Y., Guo, Z., Yang, Y., Sang, J. o1-coder: an o1 replication for coding, 2024a.
Zhang, M., and Zhang, D. Redstar: Does scaling long-cot URL https://arxiv.org/abs/2412.00154.
data unlock better slow-reasoning systems?, 2025. URL
Zhang, Y., Yang, J., Yuan, Y., and Yao, A. C.-C. Cumula-
https://arxiv.org/abs/2501.11284.
tive reasoning with large language models, 2024b. URL
Yang, Z., Band, N., Li, S., Candès, E., and Hashimoto, T. https://arxiv.org/abs/2308.04371.
Synthetic continued pretraining, 2024. URL https:
Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., and Sun, M.
//arxiv.org/abs/2409.07431.
Jec-qa: A legal-domain question answering dataset, 2019.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, URL https://arxiv.org/abs/1911.12011.
Y., and Narasimhan, K. Tree of thoughts: Deliberate
Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang,
problem solving with large language models, 2023a. URL
Y., Saied, A., Chen, W., and Duan, N. Agieval: A
https://arxiv.org/abs/2305.10601.
human-centric benchmark for evaluating foundation mod-
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, els, 2023. URL https://arxiv.org/abs/2304.
K., and Cao, Y. React: Synergizing reasoning and acting 06364.
in language models, 2023b. URL https://arxiv.
Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H.,
org/abs/2210.03629.
and Wang, Y.-X. Language agent tree search unifies rea-
Ye, Y., Huang, Z., Xiao, Y., Chern, E., Xia, S., and Liu, P. soning acting and planning in language models, 2024.
Limo: Less is more for reasoning, 2025a. URL https: URL https://arxiv.org/abs/2310.04406.
//arxiv.org/abs/2502.03387.
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
Ye, Y., Xiao, Y., Mi, T., and Liu, P. Aime-preview: A Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M.,
rigorous and immediate evaluation framework for ad- Zettlemoyer, L., and Levy, O. Lima: Less is more for
vanced mathematical reasoning. https://github. alignment, 2023. URL https://arxiv.org/abs/
com/GAIR-NLP/AIME-Preview, 2025b. GitHub 2305.11206.
repository.
14
s1: Simple test-time scaling
Contents
1 Introduction 1
3 Test-time scaling 3
3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Results 4
4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Ablations 6
5.1 Data Quantity, Diversity, and Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.2 Test-time scaling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
A s1.1 16
B Evaluation determinism 16
C s1K details 17
C.1 s1K summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
C.2 Dataset composition for full 59K questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
C.3 s1K grading prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
C.4 s1K diversity selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
C.5 Decontamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
D Training details 21
D.1 Training Ablations: Sequence length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
D.2 Training Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
F Version Control 46
15
s1: Simple test-time scaling
A. s1.1
Seven days after our release of s1, we released s1.1. We regenerated traces for our 1,000 samples in s1K using DeepSeek
r1 (DeepSeek-AI et al., 2025) to create s1K-1.1. We use the same training procedure to train our model s1.1. Other updates
since our launch include the release of o3 (OpenAI, 2025), LIMO (Ye et al., 2025a), and AIME 2025. We consider all these
new developments in Table 5. We find that s1.1 performs significantly better than s1. We also tried distilling from Claude
3.7, which led to worse performance than from r1 (not reported).
Table 5. s1-32B is an open and sample-efficient reasoning model. We evaluate s1-32B, Qwen, and Gemini (some entries are unknown
(N.A.), see §4). Other results are from the respective reports (Qwen et al., 2024; Team, 2024; OpenAI, 2024; 2025; DeepSeek-AI et al.,
2025; Labs, 2025; Team, 2025) except for AIME 2025 (Ye et al., 2025b). # ex. = number examples used for reasoning finetuning; BF =
budget forcing.
Model # Examples MATH500 GPQA AIME 2024 AIME 2025
API only
o3-mini-low N/A 95.8 70.6 56.3 42.1
o3-mini-medium N/A 97.3 76.8 75.8 70.4
o3-mini-high N/A 97.9 79.7 83.8 80.9
Open Weights
QwQ-32B N.A. 90.6 54.5 46.7 32.7
r1 ≫800K 07.3 71.5 79.8 70.0
r1-distill-Llama-70B 800K 94.5 65.2 57.1 56.3
r1-distill-Qwen-14B 800K 93.9 59.1 61.7 48.0
r1-distill-Qwen-32B 800K 94.3 62.1 58.3 49.6
Open Weights and Open Data
LIMO 817 94.8 66.7 56.3 44.6
s1 w/o BF 1K 92.6 56.6 50.0 26.7
s1 with Budget Forcing “Wait” 1x 1K 92.8 59.6 53.3 30.0
s1 with Budget Forcing “Wait” 2x 1K 93.0 59.6 53.3 33.3
s1 with Budget Forcing “Wait” 4x 1K 92.2 58.6 56.7 36.7
s1.1 w/o BF 1K 94.4 60.6 56.7 50.0
s1.1 with Budget Forcing “Wait” 1x 1K 95.4 62.6 56.7 50.0
s1.1 with Budget Forcing “Wait” 2x 1K 95.4 63.6 56.7 50.0
B. Evaluation determinism
We run our evaluations using vLLM (Kwon et al., 2023) as it is faster than the alternatives we tried. However, we find that
even when using the same random seeds and greedy sampling, evaluation scores can change significantly across runs:
As our model generates long reasoning traces prior to its answer, small numeric changes can snowball into large differences.
We encounter many generations that are exactly the same for thousands of tokens and then suddenly differ in one token
eventually ending up with an entirely different answer. To partly counter this issue we generally run our final evaluations
using full precision unless otherwise indicated.
16
s1: Simple test-time scaling
C. s1K details
C.1. s1K summary
Table 6. Summary of our dataset s1K. Token count measured by the Qwen-2.5 tokenizer. We prompt Claude to produce keywords given
several questions from the domain.
Domain #questions Total token count Keywords
Geometry 109 560.2K Area, Triangle, Distance
Number theory 98 522.5K Sequences, Divisibility
Combinatorics 75 384.7K Permutations, Counting
Real functions 43 234.8K Trigonometry, Calculus
Biology 41 120.9K Organic reactions
Complex functions 32 170.2K Complex roots
Quantum theory 32 127.9K Particles, Wave functions
Field theory 28 150.1K Polynomials, Roots
Calculus of variations 28 155.5K Optimization, Control
Difference equations 24 132.5K Recurrence, Recursion
Electromagnetic theory 23 95.8K Optics, Waves, Diffraction
Group theory 22 100.0K Groups, Automorphisms
Linear algebra 22 128.3K Matrices, Determinants
Probability theory 20 114.6K Random walk, Expectation
Algebraic systems 19 109.9K Functional equations
Mechanics 19 103.6K Forces, Motion, Energy
Thermodynamics 19 74.2K Heat engines, Entropy
Differential equations 18 89.6K Substitution, Existence
Computer science 18 34.2K Complexity theory, Algorithms
Numerical analysis 18 76.5K Error analysis, Stability
Calculus 17 96.3K Convergence, Summation
Algebraic structures 17 90.4K Inequalities, Sets
Astronomy 16 37.7K Stellar populations, Orbits
Remaining 27 domains 242 982.2K Domains with ≤ 16 questions
All domains (51) 1000 4.7M s1K
17
s1: Simple test-time scaling
Table 7. Composition of full 59K questions. Thinking and response lengths are measured in tokens using the Qwen2.5-32B-Instruct
tokenizer (Qwen et al., 2024). In addition to excluding our evaluation benchmark, AIME24, we also exclude AIME questions from
2022-2023 as we use these 90 questions during our development stage of s1-32B.
Source Description #Samples Avg.
thinking
length
NuminaMATH (LI et al., 2024) Math problems from online websites 30660 4.1K
MATH (Hendrycks et al., 2021) Math problems from competitions 11999 2.9K
OlympicArena (Huang et al., Astronomy, Biology, Chemistry, Com- 4250 3.2K
2024a) puter Science, Geography, Math, and
Physics olympiad questions
OmniMath (Gao et al., 2024a) Math problems from competitions 4238 4.4K
AGIEval (Zhong et al., 2023; English, Law, Logic and Math prob- 2385 1.2K
Ling et al., 2017; Hendrycks et al., lems from the SAT, LSAT and other
2021; Liu et al., 2020; Zhong exams
et al., 2019; Wang et al., 2021)
xword Crossword puzzles 999 0.7K
OlympiadBench (He et al., 2024) Math and Physics olympiad questions 896 3.9K
AIME (1983-2021) American Invitational Mathematics Ex- 890 4.7K
amination
TheoremQA (Chen et al., 2023) Computer Science, Finance, Math, and 747 2.1K
Physics university-level questions relat-
ing to theorems
USACO (Shi et al., 2024) Code problems from the USA Comput- 519 3.6K
ing Olympiad
JEEBench (Arora et al., 2023) Chemistry, Math, and Physics problems 515 2.9K
used in the university entrance exami-
nation of the Indian Institute of Tech-
nology
GPQA (Rein et al., 2023) PhD-Level Science Questions 348 2.9K
SciEval (Sun et al., 2024) Biology, Chemistry, and Physics prob- 227 0.7K
lems from various sources
s1-prob Stanford statistics qualifying exams 182 4.0K
LiveCodeBench (Jain et al., 2024) Code problems from coding websites 151 3.5K
(LeetCode, AtCoder, and CodeForces)
s1-teasers Math brain-teasers crawled from the In- 23 4.1K
ternet
All 59K questions Composite of the above datasets with 59029 3.6K
reasoning traces and solutions
18
s1: Simple test-time scaling
You are an AI assistant for grading a science problem. The user will provide you with the question itself, an attempt made by
a student and the correct answer to the problem. Your job is to judge whether the attempt is correct by comparing it with the
correct answer. If the expected solution concludes with a number or choice, there should be no ambiguity. If the expected
solution involves going through the entire reasoning process, you should judge the attempt based on whether the reasoning
process is correct with correct answer if helpful.
The user will provide the attempt and the correct answer in the following format:
# Problem
{problem}
## Attempt
{attempt}
## Correct answer
{solution}
Explain your reasoning, and end your response on a new line with only "Yes" or "No" (without quotes).
C.5. Decontamination
We filter all samples by checking for an 8-gram overlap between the selected examples and the evaluation benchmarks:
MATH500, GPTQA Diamond, and AIME24. We exclude questions with more than an 8-gram overlap.
19
s1: Simple test-time scaling
20
s1: Simple test-time scaling
D. Training details
We take a model that has already been pretrained and instruction tuned and further finetune it for reasoning. Specifically, we
use Qwen2.5-32B-Instruct (Qwen et al., 2024), which on math tasks generally matches or outperforms the larger Qwen2.5-
72B-Instruct (Qwen et al., 2024) or other open models (Dubey et al., 2024; Groeneveld et al., 2024; Muennighoff et al.,
2024). We use token delimiters to separate the thinking stage from the answering stage. We enclose the thinking stage
with <|im_start|>think and <|im_start|>answer; both preceded and followed by a newline. Samples from our
dataset are in §D.2. We use basic fine-tuning hyperparameters: we train for 5 epochs with a batch size of 16 for a total of 315
gradient steps. We train in bfloat16 precision with a learning rate of 1𝑒 − 5 warmed up linearly for 5% (16 steps) and then
decayed to 0 over the rest of training (299 steps) following a cosine schedule. We use the AdamW optimizer (Loshchilov &
Hutter, 2019) with 𝛽1 = 0.9, 𝛽2 = 0.95 and weight decay of 1𝑒 − 4. We do not compute loss on questions, only on reasoning
traces and solutions. We ensure the sequence length is large enough to avoid cutting off any samples; a setting we ablate in
§D.1. The training takes just 26 minutes on 16 NVIDIA H100 GPUs.
Gradient Norm
Learning Rate
0.4 3
Loss
0.5 2
0.2
1
0.0 0.0
0 100 200 300 0 100 200 300 0 100 200 300
Training Steps Training Steps Training Steps
Figure 9. Training dynamics of s1-32B on s1K.
Table 8. Training sequence length ablation. We report “accuracy / average thinking tokens per sample”; the higher the accuracy and the
fewer the thinking tokens (inference cost) the better.
Model A Model B
Training sequence length 4096 32768
% training samples cutoff 74% 0%
AIME24 30.0% / 20721 50.0% / 6984
MATH500 90.0% / 5324 91.0% / 3268
GPQA 52.5% / 6841 53.0% / 3568
Besides our scaling ablations in §5.2, the main training hyperparameter we ablate is the sequence length used during training.
We find that a shorter training sequence length leads to longer reasoning traces at test time. This is because when training
with a shorter sequence length the answer section of the training sample is more commonly cut off. Inversely, when the
training sequence length is longer, more samples appear in their entirety with the section where the model answers. Thus
the model receives more gradient updates where it learns to generate an answer following its chain. This in turn leads to a
higher log probability of the answer section at any point during the generation and thus shorter reasoning traces at test time.
Performance-wise, we also find that the model trained with a longer sequence length performs better. Thus we opt for the
longest training sequence length as it leads to better performance and makes inference more efficient by leading to shorter
reasoning traces.
21
s1: Simple test-time scaling
Table 9. s1K sample stemming from s1-prob. The black text is the question and the yellow text is the solution. Based solely on the
question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.
You are given integrable random variables 𝑋, 𝑌0 and 𝑍0 on the same probability space (Ω, , ℙ), and two 𝜎-algebras
⊂ , ⊂ . For 𝑘 = 1, 2, … ,, let
(a) Show that there exist integrable random variables 𝑌∞ and 𝑍∞ such that as 𝑛 → ∞ both 𝔼[|𝑌𝑛 − 𝑌∞ |] → 0 and
𝔼[|𝑍𝑛 − 𝑍∞ |] → 0. (b) Prove that almost surely 𝑌∞ = 𝑍∞ .
[ ] [ ] [ ] 𝐿1 [ ]
Note that 𝔼|𝔼 𝑌𝑛 |𝑛 − 𝔼 𝑌∞ |𝑛 | ≤ 𝔼|𝑌𝑛 − 𝑌∞ | = 𝑜(1) and by Levy’s Upward Theorem, 𝔼 𝑌∞ |𝑛 ⟶ 𝔼 𝑌∞ |∞ ,
( )
where ∞ = 𝜎 ∪𝑛≥0 𝑛 = 𝜎(, 𝑌0 , 𝑌1 …). It is evident that 𝑌∞ ∈ 𝑚∞ (since it is almost sure limit of 𝑌𝑛 ∈ 𝑚𝑛 )
[ ] [ ] 𝐿1
and hence 𝔼 𝑌∞ |∞ = 𝑌∞ almost surely. Combining the above observations we conclude that 𝔼 𝑌𝑛 |𝑛 ⟶ 𝑌∞ .
[ ] 𝐿1
Similarly, 𝔼 𝑍𝑛 |𝑛 ⟶ 𝑍∞ . Now invoke Equation (4), to conclude that 𝑌∞ = 𝑍∞ almost surely.
[ ] [ ]
2. Since 𝑋 is integrable, we have by Levy’s Upward Theorem, 𝑌∞ = lim𝑛→∞ 𝑌𝑛 = lim𝑛→∞ 𝔼 𝑋|𝑛 = 𝔼 𝑋|∞ .
[ ]
Similarly, 𝑍∞ = 𝔼 𝑋|∞ . On the otherhand, in first proof we have observed that 𝑌∞ ∈ 𝑚∞ and hence 𝑌∞ ∈
( )
𝑚∞ ∩ 𝑚∞ = 𝑚 ∞ ∩ ∞ . This yields the following.
[ ] [ [ ] ] [ ]
𝑌∞ = 𝔼 𝑌∞ |∞ ∩ ∞ = 𝔼 𝔼 𝑋|∞ |∞ ∩ ∞ = 𝔼 𝑋|∞ ∩ ∞ .
[ ]
Similarly, 𝑍∞ = 𝔼 𝑋|∞ ∩ ∞ . This proves that 𝑌∞ = 𝑍∞ almost surely.
Let’s analyze the problem. We are given two sequences of random variables 𝑌𝑘 and 𝑍𝑘 defined by conditional expectations.
We need to show that these sequences converge in 𝐿1 and that their limits are equal almost surely.
Part (a): Convergence in 𝐿1 .
The sigma-algebras involved are increasing:
𝑘 = 𝜎(, 𝑍0 , … , 𝑍𝑘−1 )
𝑘 = 𝜎(, 𝑌0 , … , 𝑌𝑘−1 )
For the 𝑌𝑘 sequence:
𝑌1 = 𝔼[𝑋|𝜎()]
𝑌2 = 𝔼[𝑋|𝜎(, 𝑍0 )]
𝑌3 = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 )]
...
The sequence of sigma-algebras 𝜎(, 𝑍0 , … , 𝑍𝑘−1 ) is increasing. By the Martingale Convergence Theorem (for conditional
expectations), since 𝑋 is integrable, the sequence 𝑌𝑘 converges almost surely and in 𝐿1 to 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)]. Let
𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)].
22
s1: Simple test-time scaling
23
s1: Simple test-time scaling
Consider 𝔼[𝑋|, 𝑍0 , 𝑍1 , …] = 𝑌∞ . This means for any 𝐴 ∈ 𝜎(, 𝑍0 , 𝑍1 , …), 𝔼[𝑋1𝐴 ] = 𝔼[𝑌∞ 1𝐴 ].
Consider 𝔼[𝑋|, 𝑌0 , 𝑌1 , …] = 𝑍∞ . This means for any 𝐵 ∈ 𝜎(, 𝑌0 , 𝑌1 , …), 𝔼[𝑋1𝐵 ] = 𝔼[𝑍∞ 1𝐵 ].
Let = 𝜎(, ).
Consider 𝔼[𝑋|].
Consider 𝔼[𝑌∞ |, 𝑌0 , 𝑌1 , …] = 𝔼[𝔼[𝑋|, 𝑍0 , 𝑍1 , …]|, 𝑌0 , 𝑌1 , …].
Let’s use the fact that 𝑌∞ is 𝜎(, 𝑍0 , 𝑍1 , …)-measurable and 𝑍∞ is 𝜎(, 𝑌0 , 𝑌1 , …)-measurable.
Consider the smallest sigma-algebra containing both 𝜎(, 𝑍0 , 𝑍1 , …) and 𝜎(, 𝑌0 , 𝑌1 , …).
This is 𝜎( ∪ ∪ {𝑌𝑖 }𝑖≥0 ∪ {𝑍𝑖 }𝑖≥0 ).
Let’s try to show that 𝔼[(𝑌∞ − 𝑍∞ )2 ] = 0.
𝔼[(𝑌∞ − 𝑍∞ )2 ] = 𝔼[𝑌∞2 ] + 𝔼[𝑍∞ 2 ] − 2𝔼[𝑌 𝑍 ].
∞ ∞
Consider the definition of 𝑌𝑘 and 𝑍𝑘 .
𝔼[𝑌𝑘+1 |, 𝑍0 , … , 𝑍𝑘−1 ] = 𝑌𝑘 .
𝔼[𝑍𝑘+1 |, 𝑌0 , … , 𝑌𝑘−1 ] = 𝑍𝑘 .
Consider 𝔼[𝑌∞ |, 𝑌0 , 𝑌1 , …].
Since 𝑌∞ is measurable with respect to 𝜎(, 𝑍0 , 𝑍1 , …), this does not simplify easily.
Let’s assume 𝑌∞ ≠ 𝑍∞ on a set of positive measure.
Consider the case where and are trivial.
Then 𝑌𝑘 = 𝔼[𝑋|𝜎(𝑍0 , … , 𝑍𝑘−1 )] and 𝑍𝑘 = 𝔼[𝑋|𝜎(𝑌0 , … , 𝑌𝑘−1 )].
Let ∞ = 𝜎( ∪ ∪ {𝑌𝑖 , 𝑍𝑖 }𝑖≥0 ).
Consider 𝔼[𝑋|∞ ].
We have 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)].
By the properties of conditional expectation, 𝔼[𝑌∞ |] = 𝔼[𝔼[𝑋|, 𝑍0 , 𝑍1 , …]|].
Consider the sigma-algebra generated by all the random variables involved.
= 𝜎( ∪ ∪ {𝑌𝑖 }𝑖≥0 ∪ {𝑍𝑖 }𝑖≥0 ).
Let 𝑀∞ = 𝔼[𝑋| ∨ ].
Consider the limits. As 𝑛 → ∞, 𝜎(, 𝑍0 , … , 𝑍𝑛−1 ) ↑ 𝜎(, 𝑍0 , 𝑍1 , …).
As 𝑛 → ∞, 𝜎(, 𝑌0 , … , 𝑌𝑛−1 ) ↑ 𝜎(, 𝑌0 , 𝑌1 , …).
Let ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …).
We have 𝑌∞ = 𝔼[𝑋|∞ ] and 𝑍∞ = 𝔼[𝑋|∞ ].
Consider 𝔼[𝑌∞ − 𝑍∞ ]2 = 𝔼[(𝑌∞ − 𝑋) + (𝑋 − 𝑍∞ )]2
= 𝔼[(𝑌∞ − 𝑋)2 ] + 𝔼[(𝑋 − 𝑍∞ )2 ] + 2𝔼[(𝑌∞ − 𝑋)(𝑋 − 𝑍∞ )]
= 𝔼[(𝑋 − 𝔼[𝑋|∞ ])2 ] + 𝔼[(𝑋 − 𝔼[𝑋|∞ ])2 ] + 2𝔼[𝑌∞ 𝑋 − 𝑌∞ 𝑍∞ − 𝑋 2 + 𝑋𝑍∞ ]
= 𝔼[𝑋 2 ] − 𝔼[𝑌∞2 ] + 𝔼[𝑋 2 ] − 𝔼[𝑍∞ 2 ] + 2𝔼[𝑌 𝑋] − 2𝔼[𝑌 𝑍 ] − 2𝔼[𝑋 2 ] + 2𝔼[𝑋𝑍 ]
∞ ∞ ∞ ∞
2 2 2
= −𝔼[𝑌∞ ] − 𝔼[𝑍∞ ] + 2𝔼[𝑌∞ ] − 2𝔼[𝑌∞ 𝑍∞ ] + 2𝔼[𝑍∞ 2]
24
s1: Simple test-time scaling
25
s1: Simple test-time scaling
Table 10. s1K sample stemming from s1-teasers. The black text is the question and the yellow text is the solution. Based solely on the
question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.
Alice and Bob each have a coin and flip it until they get a heads. If Bob flipped his coin more times than Alice, what is the
expected number of times Alice flipped her coin?
4/3
The problem asks for the expected number of times Alice flipped her coin, given that Bob flipped his coin more times than
Alice.
Let 𝑋 be the number of flips Alice makes until she gets a head.
Let 𝑌 be the number of flips Bob makes until he gets a head.
𝑋 and 𝑌 are independent random variables following a geometric distribution with parameter 𝑝 = 0.5 (probability of getting
a head).
The probability mass function for a geometric distribution is 𝑃 (𝐾 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝, for 𝑘 = 1, 2, … .
In this case, 𝑝 = 0.5, so 𝑃 (𝐾 = 𝑘) = (0.5)𝑘−1 (0.5) = (0.5)𝑘 = 21𝑘 .
We are given the condition that Bob flipped his coin more times than Alice, which means 𝑌 > 𝑋.
We want to find the expected value of 𝑋 given that 𝑌 > 𝑋, i.e., 𝐸[𝑋|𝑌 > 𝑋].
∑
The formula for conditional expectation is 𝐸[𝑋|𝐴] = 𝑥 𝑥𝑃 (𝑋 = 𝑥|𝐴).
Also, 𝑃 (𝑋 = 𝑥|𝐴) = 𝑃 (𝑋=𝑥∩𝐴)
𝑃 (𝐴)
.
In our case, 𝐴 is the event 𝑌 > 𝑋.
We need to find 𝑃 (𝑌 > 𝑋).
∑ ∑∞
𝑃 (𝑌 > 𝑋) = ∞ 𝑥=1 𝑦=𝑥+1 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦)
1 1 1
Since 𝑋 and 𝑌 are independent, 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 = 𝑦) = 2𝑥 2𝑦
= 2𝑥+𝑦
.
∑ ∑∞ ∑∞ 1 ∑∞
𝑃 (𝑌 > 𝑋) = ∞ 𝑥=1
1
𝑦=𝑥+1 2𝑥+𝑦 = 𝑥=1 𝑥 𝑦=𝑥+1 2𝑦
1
∑∞ 2 1 𝑦
The inner sum is a geometric series: 𝑦=𝑥+1 ( 2 ) = ( 12 )𝑥+1 + ( 12 )𝑥+2 + …
This is a geometric series with first term 𝑎 = ( 12 )𝑥+1 and common ratio 𝑟 = 12 .
( 12 )𝑥+1 ( 12 )𝑥+1
The sum is 𝑎
= = = ( 12 )𝑥 .
1−𝑟 1− 1 1
∑∞ 2 1 1 𝑥2 ∑
So, 𝑃 (𝑌 > 𝑋) = 𝑥=1 2𝑥 ( 2 ) = ∞ 1 𝑥
𝑥=1 ( 4 ) .
1
This is another geometric series with first term 𝑎 = 4
and common ratio 𝑟 = 41 .
1 1
The sum is 𝑎
= 4
= 4
= 13 .
1−𝑟 1− 41 3
4
Now, we need to find 𝑃 (𝑋 = 𝑥 ∩ 𝑌 > 𝑋) = 𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥).
𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥), due to independence.
∑ ∑∞
𝑃 (𝑌 > 𝑥) = ∞ 𝑦=𝑥+1 𝑃 (𝑌 = 𝑦) =
1 1 𝑥+1
𝑦=𝑥+1 2𝑦 = ( 2 ) + ( 12 )𝑥+2 + ⋯ = ( 12 )𝑥 .
1
So, 𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 2𝑥
× ( 12 )𝑥 = ( 14 )𝑥 .
𝑃 (𝑋=𝑥∩𝑌 >𝑋) ( 41 )𝑥
Now, we can find 𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋)
= 1 = 3( 14 )𝑥 .
3
Finally, we can calculate the conditional expectation:
∑ ∑ ∑
𝐸[𝑋|𝑌 > 𝑋] = ∞ 𝑥=1 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = ∞ 𝑥 ⋅ 3( 14 )𝑥 = 3 ∞ 1 𝑥
𝑥=1 𝑥( 4 ) .
∑∞ 𝑥=1
Consider the series 𝑆 = 𝑥=1 𝑥𝑟 , where |𝑟| < 1.
𝑥
26
s1: Simple test-time scaling
∑∞ 1 𝑥
1 1 1
1 16
𝑥=1 𝑥( 4 ) =
4
= 4
= 4
= × = 49 .
(1− 14 )2 ( 43 )2 9
16
4 9
27
s1: Simple test-time scaling
28
s1: Simple test-time scaling
2−𝑝 ∑
= 1−𝑝 𝑝(1 − 𝑝)−1 ∞𝑥=1 𝑥((1 − 𝑝) ) (error in power)
2 𝑥
(1−𝑝)𝑥
𝑃 (𝑌 > 𝑥) = 1−(1−𝑝)
𝑝 = (1 − 𝑝)𝑥 .
∑
𝐸[𝑋|𝑌 > 𝑋] = 𝑃 (𝑌1>𝑋) ∞ 𝑥=1 𝑥𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥)
1 ∑∞
= 𝑃 (𝑌 >𝑋) 𝑥=1 𝑥(1 − 𝑝) 𝑝(1 − 𝑝)𝑥
𝑥−1
∑
= 𝑃 (𝑌𝑝>𝑋) ∞ 𝑥=1 𝑥(1 − 𝑝)
2𝑥−1
∑
𝑝
= (1−𝑝)∕(2−𝑝) (1 − 𝑝)−1 ∞ 2 𝑥
𝑥=1 𝑥((1 − 𝑝) )
2−𝑝 ∑∞
𝑥=1 𝑥((1 − 𝑝) ) .
= 𝑝 (1−𝑝) 2 𝑥
2
∑ ∞ 𝑟
Let 𝑟 = (1 − 𝑝)2 . 𝑥=1 𝑥𝑟𝑥 = (1−𝑟) 2.
2−𝑝 (1−𝑝) 2 1
𝐸[𝑋|𝑌 > 𝑋] = 𝑝 (1−𝑝)2 𝑝2 (2−𝑝)2 = 𝑝(2−𝑝)
.
1 1 1
If 𝑝 = 0.5, 𝐸[𝑋|𝑌 > 𝑋] = 0.5×1.5
= 0.75 = 3∕4
= 4∕3.
Solution:
Let 𝑋 be the number of flips Alice makes until she gets a head, and 𝑌 be the number of flips Bob makes until he gets a
head. 𝑋 and 𝑌 are independent geometric random variables with parameter 𝑝 = 0.5. The probability mass function is
𝑃 (𝐾 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝.
We are interested in the expected number of times Alice flipped her coin, given that Bob flipped his coin more times than
Alice, i.e., 𝐸[𝑋|𝑌 > 𝑋].
∑
Using the definition of conditional expectation, 𝐸[𝑋|𝑌 > 𝑋] = ∞ 𝑥=1 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋).
𝑃 (𝑋=𝑥,𝑌 >𝑥)
𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋) .
First, calculate 𝑃 (𝑌 > 𝑋).
∑ ∑∞ ∑∞
𝑃 (𝑌 > 𝑋) = ∞ 𝑥=1 𝑦=𝑥+1 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑥=1 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥).
𝑃 (𝑋 = 𝑥) = (0.5)𝑥−1 (0.5) = (0.5)𝑥 = 21𝑥 .
∑
𝑃 (𝑌 > 𝑥) = ∞ 𝑦=𝑥+1 (0.5)
𝑦−1 (0.5) = (0.5)𝑥 .
∑∞ 1 1 𝑥 ∑∞ 1 𝑥 1∕4 1∕4
𝑃 (𝑌 > 𝑋) = 𝑥=1 2𝑥 ( 2 ) = 𝑥=1 ( 4 ) = 1−1∕4 = 3∕4
= 31 .
1 1 𝑥
Next, calculate 𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥) = ( )
2𝑥 2
= ( 14 )𝑥 .
(1∕4)𝑥
Now, find 𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑃 (𝑋=𝑥,𝑌 >𝑥)
𝑃 (𝑌 >𝑋)
= 1∕3 = 1 𝑥
3( 4 ) .
Finally, calculate the conditional expectation:
∑ ∑ ∑
𝐸[𝑋|𝑌 > 𝑋] = ∞ 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = ∞ 𝑥 ⋅ 3( 14 )𝑥 = 3 ∞ 1 𝑥
𝑥=1 𝑥( 4 ) .
∑∞
𝑥=1
𝑟
𝑥=1
Using the formula 𝑥=1 𝑥𝑟 = (1−𝑟)2 with 𝑟 = 1∕4:
𝑥
∑∞ 1 𝑥 1∕4 1∕4 1∕4 1 16 4
𝑥=1 𝑥( 4 ) = (1−1∕4)2 = (3∕4)2 = 9∕16 = 4 × 9 = 9 .
4 12
𝐸[𝑋|𝑌 > 𝑋] = 3 × 9
= 9
= 34 .
Final Answer: The final answer is 4∕3
29
s1: Simple test-time scaling
Table 11. s1K sample stemming from the astronomy subset of OlympicArena. The black text is the question and the blue text is the
solution. Based solely on the question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.
An often-repeated fun fact is that humans produce more power per unit volume than stars. If the sun were the same size, but
it produced the same amount of power per unit volume as a human, what would its surface temperature be? Assume the
"average human" produces 100 watts of power and has a volume of 66400 cubic centimeters.
A: 3500 K
B: 10000 K
C: 25000 K
D: 40000 K
E: 50000 K
Using the numbers from the problem, the average human produces
100
𝑢= = 1506 W∕m3
66400 × 10−6
The volume of the sun is 43 𝜋𝑅3⊙ , so its new power output would be 𝑃 = 43 𝜋𝑅3⊙ 𝑢. To find the equilibrium temperature, we use
𝑃 = 𝜎𝐴𝑇 4 = 4𝜋𝜎𝑅2⊙ 𝑇 4
Solving for 𝑇 , we get
√ √( )
√
4 𝑅⊙ 𝑢 √√
4
6.96 × 108 (1506)
𝑇 = = ( ) = 49823 ≈ 50000 K.
3𝜎 3 5.67 × 10−8
The problem asks for the surface temperature of the sun if it had the same size but produced power per unit volume as a
human.
First, let’s find the power density of a human.
Power produced by a human, 𝑃ℎ𝑢𝑚𝑎𝑛 = 100 W
Volume of a human, 𝑉ℎ𝑢𝑚𝑎𝑛 = 66400 cm3
Convert the volume to cubic meters:
1 m = 100 cm
1 m3 = (100 cm)3 = 106 cm3
𝑉ℎ𝑢𝑚𝑎𝑛 = 66400 cm3 = 66400 × 10−6 m3 = 0.0664 m3
𝑃 100 W
Power density of a human, 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 𝑉ℎ𝑢𝑚𝑎𝑛 = 0.0664 m3
ℎ𝑢𝑚𝑎𝑛
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 1506.024 W∕m3
Now, consider the sun with the same size as it currently is, but with a power density equal to that of a human.
We need the radius of the sun to calculate its volume. The average radius of the sun is 𝑅𝑠𝑢𝑛 = 6.95 × 108 m.
Volume of the sun, 𝑉𝑠𝑢𝑛 = 43 𝜋𝑅3𝑠𝑢𝑛 = 43 𝜋(6.95 × 108 m)3
𝑉𝑠𝑢𝑛 = 43 𝜋(337.159375 × 1024 ) m3
𝑉𝑠𝑢𝑛 = 1410.24 × 1024 × 𝜋 m3
𝑉𝑠𝑢𝑛 = 4.4307 × 1027 m3
If the sun had the same power density as a human, the total power produced by this modified sun would be:
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 × 𝑉𝑠𝑢𝑛
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 1506.024 W∕m3 × 4.4307 × 1027 m3
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 6.6726 × 1030 W
30
s1: Simple test-time scaling
The surface temperature of a star is related to its luminosity (total power output) through the Stefan-Boltzmann law:
𝐿 = 4𝜋𝑅2 𝜎𝑇 4
where 𝐿 is the luminosity (total power output), 𝑅 is the radius of the star, 𝜎 is the Stefan-Boltzmann constant, and 𝑇 is the
surface temperature.
In this case, the luminosity of the modified sun is 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 . The radius of the modified sun is the same as the actual sun,
𝑅𝑠𝑢𝑛 .
The Stefan-Boltzmann constant, 𝜎 = 5.67 × 10−8 Wm−2 K −4 .
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇 4
6.6726 × 1030 W = 4𝜋(6.95 × 108 m)2 × 5.67 × 10−8 Wm−2 K −4 × 𝑇 4
Calculate the term 4𝜋𝑅2𝑠𝑢𝑛 𝜎:
𝑅2𝑠𝑢𝑛 = (6.95 × 108 )2 = 48.3025 × 1016 m2
4𝜋𝑅2𝑠𝑢𝑛 = 4𝜋 × 48.3025 × 1016 = 607.08 × 1016 m2
4𝜋𝑅2𝑠𝑢𝑛 𝜎 = 607.08 × 1016 m2 × 5.67 × 10−8 Wm−2 K −4
4𝜋𝑅2𝑠𝑢𝑛 𝜎 = 3441.93 × 108 WK −4 = 3.44193 × 1011 WK −4
Now, substitute this back into the equation:
6.6726 × 1030 = 3.44193 × 1011 × 𝑇 4
Solve for 𝑇 4 : 30
6.6726×10
𝑇 4 = 3.44193×10 11 = 1.9386 × 10
19
Solve for 𝑇 :
𝑇 = (1.9386 × 1019 )1∕4 = (193.86 × 1017 )1∕4
𝑇 = (1.9386)1∕4 × (1019 )1∕4 = (1.9386)0.25 × 104.75
Calculate (1.9386)0.25 :
Let 𝑦 = 1.93860.25 , then ln 𝑦 = 0.25 ln(1.9386)
ln(1.9386) ≈ 0.6619
ln 𝑦 = 0.25 × 0.6619 = 0.165475
𝑦 = 𝑒0.165475 ≈ 1.1799
So, 𝑇 = 1.1799 × 104.75 = 1.1799 × 104 × 100.75
100.75 = 103∕4 = (103 )1∕4 = (1000)1∕4
We know 54 = 625 and 64 = 1296, so (1000) √
1∕4 is between 5 and 6.
31
s1: Simple test-time scaling
Wm−2 K −4
= Wm−2 K −4 = mW2 × mWK = K 4
The units match for 𝑇 4 .
Numerical calculation check:
Numerator: 1506.02 × 6.95 × 108 = 10466.839 × 108 = 1.0466 × 1012
Denominator: 312× 5.67 × 10−8 = 17.01 × 10−8
Ratio: 1.0466×10
17.01×10−8
= 0.061528 × 1020 = 6.1528 × 1018
𝑇 4 = 6.1528 × 1018
𝑇 = (6.1528 × 1018 )1∕4 = (6152.8 × 1015 )1∕4
𝑇 = (6152.8)0.25 × 103.75
Let’s use logarithms for precision.
log10 (6152.8) ≈ 3.789
0.25 × 3.789 = 0.94725
100.94725 ≈ 8.856
𝑇 = 8.856 × 103.75 = 8.856 × 103 × 100.75 = 8856 × 5.623 = 49806 K
This is very close to 50000 K.
Possible sources of minor discrepancies: rounding errors, precision of constants.
Let’s verify the formula derivation.
Power per unit volume = Total Power
Volume
32
s1: Simple test-time scaling
𝜌𝑝𝑜𝑤𝑒𝑟 = 𝑉𝐿
𝐿 = 4𝜋𝑅2 𝜎𝑇 4
𝑉 = 43 𝜋𝑅3
4𝜋𝑅2 𝜎𝑇 4 3𝜎𝑇 4
𝜌𝑝𝑜𝑤𝑒𝑟 = 4 = 𝑅
3
𝜋𝑅3
If the modified sun has the power density of a human:
4
3𝜎𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 𝑅𝑠𝑢𝑛
4 𝜌 𝑅
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
= 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
3𝜎
𝑠𝑢𝑛
33
s1: Simple test-time scaling
Solution:
The power produced by an average human is 𝑃ℎ𝑢𝑚𝑎𝑛 = 100 W, and the volume of an average human is 𝑉ℎ𝑢𝑚𝑎𝑛 = 66400 cm3 =
0.0664 m3 .
𝑃 100 W
The power density of a human is 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 𝑉ℎ𝑢𝑚𝑎𝑛 = 0.0664 m3
≈ 1506.02 W∕m3 .
ℎ𝑢𝑚𝑎𝑛
If the sun were the same size but produced the same amount of power per unit volume as a human, its power density would
be 𝜌𝑝𝑜𝑤𝑒𝑟,𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 .
The radius of the sun is 𝑅𝑠𝑢𝑛 = 6.95 × 108 m.
The volume of the sun is 𝑉𝑠𝑢𝑛 = 34 𝜋𝑅3𝑠𝑢𝑛 .
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
The power density of the modified sun is given by 𝜌𝑝𝑜𝑤𝑒𝑟,𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 𝑉𝑠𝑢𝑛
.
We also know that 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇 4 , where 𝑇 is the surface temperature of the modified sun.
Equating the power density:
4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇 4 3𝜎𝑇 4
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 4 =
3
𝜋𝑅3𝑠𝑢𝑛 𝑅𝑠𝑢𝑛
Solving for 𝑇 4 :
𝜌 𝑅𝑠𝑢𝑛
𝑇 4 = 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
3𝜎
Substitute the values:
1506.02 W∕m3 ×6.95×108 m
𝑇 4 = 3×5.67×10−8 Wm−2 K −4
12
𝑇 4 = 1.0466×10
1.701×10−7
= 6.1528 × 1018 K 4
Solving for 𝑇 :
𝑇 = (6.1528 × 1018 )1∕4 = (6152.8)0.25 × 1018∕4 = (6152.8)0.25 × 104.5
Using logarithms:
log10 (6152.8) ≈ 3.789
0.25 × 3.789 = 0.94725
(6152.8)0.25 = 100.94725 ≈ 8.856 √
𝑇 = 8.856 × 104.5 = 8.856 × 104 × 100.5 = 88560 × 10 ≈ 88560 × 3.16 = 280000 K
Let’s redo the exponent calculation: 18∕4 = 4.5.
𝑇 = 8.856 × 104.5 = 8.856 × 104 × 100.5 - error in previous step.
𝑇 = (6.1528 × 1018 )0.25 = 49806 K
The surface temperature would be approximately 49806 K, which is closest to 50000 K.
Final Answer: The final answer is 𝐸
34
s1: Simple test-time scaling
<|im_start|>user <|im_start|>user
What is the answer to Life, the Universe and Everything? What is the answer to Life, the Universe and Everything?
Think for up to 2048 tokens. Think for up to 64 steps.
<|im_start|>assistant <|im_start|>assistant
<|im_start|>think <|im_start|>64 steps left
Let me break down this question into the three parts it Let me break down this question into the three parts it is asking for:
is asking for: 1) Life 2) Universe 3) Everything 1) Life 2) Universe 3) Everything
Let me start with life... <|im_start|>63 steps left
Let me start with life...
<|im_start|>answer <|im_start|>answer
The answer is... The answer is...
Figure 10. Token and step instruction data formats for controlling test-time compute. We only train our model on the reasoning trace
and the answer.
Table 12. Scaling thinking time via tokens-conditional control. All metrics are averaged over the 30 questions in AIME24.
Tokens instructed (→) 1024 2048 4096 8192 16384
No intervention at test-time
Thinking tokens 7939 7158 8263 7108 7500
Answer tokens 689 669 659 722 724
AIME24 26.7 30.0 33.3 33.3 40.0
Forcing end of thinking when token budget is reached
Thinking tokens 1024 2048 4031 5664 6330
Answer tokens 15 15 142 722 691
AIME24 3.3 30.0 33.3 33.3 40.0
Token-conditional control One general approach is to simply tell a model in the prompt precisely how many tokens it
should generate. Ideally, the model can keep track of its token count and adjust its generation to finish within the desired
limits. We experiment with this approach by training a model with token instructions using the format in Figure 10 (left).
We bucket the lengths of the reasoning traces from our 1,000 training examples into powers of two (rounded upwards) and
add a corresponding instruction to the user prompt. For example, if the instruction says “Think for up to 2048 tokens”,
then the reasoning trace has anywhere between 1024 and 2048 tokens. In Table 12, we show that after training the model
hardly follows the token instruction. It does sometimes generate more tokens when given a higher limit but often overshoots
the limit. This may not be unique to our model as prior work suggests that OpenAI o1-mini can also not follow token
instructions (Zhang & Chen, 2024). To prevent exceeding the limit, we test budget forcing the thinking to end once the
limit is reached. This leads to perfect control (Table 12 (lower)). With budget forcing, the scaling trend is also clearer as the
model can no longer overshoot the limit when given a small thinking budget. This leads to better test-time scaling values for
Token Prompting + budget forcing in Table 3. To compute Control reported in Table 3 for token-conditional control variants
we divide the number of times the thinking tokens in Table 12 are less than the upper limit by the total evaluations (2/5 for
without intervention; 5/5 for with intervention).
Step-conditional control Token instructions fail as current models cannot count tokens. To accommodate this lack of
capability, we experiment with making the counting more coarse-grained. We partition the reasoning traces into steps and
ask the model to think for a specific number of steps rather than tokens. We split our reasoning traces on double newlines into
steps, which we find act as intuitive separators based on manual inspection of samples. We bucket our training samples into
powers of 2 depending on their number of steps and add a corresponding step instruction following the format in Figure 10
(right). This format is based on early experiments, where we found the model to be more likely to adhere to the step limit
35
s1: Simple test-time scaling
Table 13. Scaling thinking time via step-conditional control. All metrics are averaged over the 30 samples in AIME24. Token counts
ignore the thinking and step delimiters.
Steps instructed (→) 16 32 64 128 256
No intervention at test-time
Steps used 123 90 80 82 136
Tokens per step 60 70 69 66 56
Thinking tokens 7252 6277 5396 5552 7551
Answer tokens 665 653 735 777 754
AIME24 33.3 23.3 33.3 36.7 33.3
Forcing end of thinking when 0 steps are reached
Steps used 16 32 59 78 136
Tokens per step 96 94 80 70 56
Thinking tokens 1517 2963 4636 5409 7551
Answer tokens 1111 788 799 794 754
AIME24 23.3 23.3 33.3 36.7 33.3
Table 14. Scaling thinking time via class-conditional control. We report “accuracy / average thinking tokens per sample”; the higher
the accuracy and the fewer the thinking tokens (inference cost) the better.
Prompt appended to the question after two newlines AIME24 MATH500 GPQA
Answer after a short amount of thinking. Do not spend 30.0% / 90.4% / 56.6% /
excessive time double-checking your work. 8033 2537 4177
Answer after a long amount of thinking. If you feel like you 36.7% / 91.4% / 51.0% /
are finished early, spend the extra time trying to double- 9651 3875 4827
check your work until you are absolutely sure that you have
the correct answer.
Without generic prompt appending 50.0% / 93.0% / 57.6% /
6109 3298 3510
36
s1: Simple test-time scaling
when counting down (“3 steps left...2 steps left”) rather than counting up (“Step2...Step3...”). This is likely because if
counting down, the final step is always 1, which will act as a strong prior to the model to finish its generation. If counting up,
the final step before the answer varies, thus if the model does not remember the original step instruction, it may fail to stop.
We conclude the following from our results in Table 13: (1) The model still struggles to adhere to the step limit. The model
sometimes simply continues counting into negative steps, e.g. “-1 steps left”. To solve this issue, we automatically stop the
thinking process once 0 steps are reached and then force the model to transition to answering mode by appending the answer
token delimiter (§3). This leads to perfect step adherence (lower half of Table 13), yet problems remain. (2) The model
compensates for fewer steps by making each step longer. For example, when forced to use up to 16 steps vs 256 steps, the
model generates an average of 96 tokens per step vs 56. Despite this issue, more steps still clearly correlate with more total
thinking tokens in Table 13 and better performance leading to a positive slope (3) Step instructions are more costly than
other methods. The step delimiters require around 6 tokens each which for e.g. 64 steps adds up to a total of around 380
tokens. When ignoring the step delimiters in token counts as in Table 13, the model still requires 7551 thinking tokens on
average to achieve only 33.3% on AIME24. To compute Control reported in Table 3 for step-conditional control variants,
we first decide that 100 tokens are an upper limit per step and then multiply this number by the steps instructed to arrive
at a proxy total token limit, e.g. 1600 for 16 steps instructed. We then check whether the thinking tokens in Table 13 fit
within the respective limit for each evaluation run (3/5 for without intervention; 5/5 for with intervention). For the model in
Figure 7, we use a model with step-conditional control trained on an earlier version of our data and using an earlier version
of our evaluation codebase.
Class-conditional control OpenAI exposes test-time compute control to users via a “reasoning_effort” API parameter
with three possible settings: low, medium, and high.3 The OpenAI documentation also states that “Reducing reasoning
effort can result in faster responses and fewer tokens used on reasoning in a response." suggesting that they are unable to
control test-time compute with guarantees. Thus, maybe OpenAI simply adjusts the prompt or system instruction depending
on the reasoning effort desired. In Table 14, we show that separate prompts for short and long thinking allow us to control
thinking time to some extent: Prompting the model to think for longer leads to longer thinking. However, it does not reliably
improve performance and control is not precise. The current adherence to control may suffice when we only have three
classes, but it might not scale to finer-grained classes. To compute Control reported in Table 3 for this method, we assume
that prompting the model to think for a short time in Table 14 should produce fewer tokens than the default for AIME24,
while the long prompt should produce more. As 8033 > 6109 and 9651 > 6109, one out of two follows our expected control
thus Control is 50%.
3
https://github.com/openai/openai-python/blob/44d6210f101abedeb2dd68507fcffcb329df70ea/
src/openai/types/chat/completion_create_params.py#L172
37
s1: Simple test-time scaling
Table 15. Answers generated by an earlier version of s1-32B when rejection sampling on AIME24. The black text is the question and
the yellow text is the solution. Based solely on the question we generate the light blue reasoning trace and blue response with s1-32B
using a temperature of 1. We generate until the reasoning trace has less than 4000 tokens for the first and less than 8000 tokens for the
second example. It required on average 199 (1) tries per example until thinking was ≤ 4000 (≤ 8000 tokens).
√
3
Let 𝑂(0, 0), 𝐴( 12 , 0), and 𝐵(0, 2
) be points in the coordinate plane. Let be the family of segments 𝑃 𝑄 of unit length
lying in the first quadrant with 𝑃 on the 𝑥-axis and 𝑄 on the 𝑦-axis. There is a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵,
that does not belong to any segment from other than 𝐴𝐵. Then 𝑂𝐶 2 = 𝑞𝑝 , where 𝑝 and 𝑞 are relatively prime positive
integers. Find 𝑝 + 𝑞.
23
[Correct generation with 3981 thinking tokens and 1055 answer tokens]
The problem asks for a point 𝐶 on the segment 𝐴𝐵 that is not contained in any other segment from the family , except for
𝐴𝐵 itself.
The points 𝑂, 𝐴, 𝐵 form a triangle in the first quadrant.
𝑂 = (0, 0)
𝐴 = ( 12 , 0)
√
3
𝐵 = (0, 2
)
√
3
The segment 𝐴𝐵 connects 𝐴( 21 , 0) and 𝐵(0, 2
).
The equation√ of the line containing 𝐴𝐵 can be found using the two points:
3
−0 √
Slope 𝑚 = 2 1 = − 3
0− 2
Using the point-slope form 𝑦 − 𝑦1 = 𝑚(𝑥 − 𝑥1 ) with point 𝐴( 12 , 0):
√
𝑦 − 0 = − 3(𝑥 − 21 )
√ √
𝑦 = − 3𝑥 + 23
√
3
The segment 𝐴𝐵 lies in the first quadrant and connects ( 12 , 0) to (0, 2
).
The family consists of segments 𝑃 𝑄 of unit length, where 𝑃 is on the 𝑥-axis and 𝑄 is on the 𝑦-axis. Let 𝑃 = (𝑎, 0) and
𝑄 = (0, 𝑏). √ √
The length of 𝑃 𝑄 is (𝑎 − 0)2 + (0 − 𝑏)2 = 𝑎2 + 𝑏2 = 1.
So 𝑎2 + 𝑏2 = 1.
Since 𝑃 is on the positive 𝑥-axis and 𝑄 is on the positive 𝑦-axis, we have 𝑎 > 0 and 𝑏 > 0.
The segment 𝑃 𝑄 connects (𝑎, 0) and (0, 𝑏). The equation of the line containing 𝑃 𝑄 is 𝑥𝑎 + 𝑦𝑏 = 1.
The region covered by the family is the area bounded by the lines containing these segments. The endpoints (𝑎, 0) and
(0, 𝑏) vary such that 𝑎2 + 𝑏2 = 1.
Consider the parametrization 𝑎 = cos 𝜃, 𝑏 = sin 𝜃, where 0 < 𝜃 < 𝜋2 .
𝑥 𝑦
The segment 𝑃 𝑄 connects (cos 𝜃, 0) and (0, sin 𝜃). The equation of the line is cos 𝜃
+ sin 𝜃
= 1.
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from other than
𝐴𝐵.
Let 𝐶 be a point on 𝐴𝐵.√We can parametrize
√
the points on 𝐴𝐵 as (1 − 𝑡)𝐴 + 𝑡𝐵 for 0 ≤ 𝑡 ≤ 1.
𝐶 = (1 − 𝑡)( 12 , 0) + 𝑡(0, 23 ) = ( 12 − 21 𝑡, 23 𝑡).
Since 𝐶 is distinct from 𝐴 and 𝐵, we have 0 < 𝑡 < 1.
Suppose the point 𝐶 lies on some segment 𝑃 𝑄 ∈ , where 𝑃 = (cos 𝜃, 0) and 𝑄 = (0, sin 𝜃).
The point 𝐶 satisfies the equation cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
38
s1: Simple test-time scaling
39
s1: Simple test-time scaling
√
The slope of the tangent to the astroid at a point (𝑥, 𝑦) is − 3 𝑥𝑦 .
√
The slope of the line 𝐴𝐵 is − √ 3.
√
3 𝑦
So, at the point of tangency, − 𝑥 = − 3.
√ √
3 𝑦
𝑥
= 3
𝑦
√ √
= ( 3) 3 =3 3
𝑥 √
√ √
The point 𝐶 lies on the line 𝑦 = − 3𝑥 + 23 and satisfies 𝑥𝑦 = 3 3.
√
So 𝑦 = 3 3𝑥.
Substitute into the line
√
equation:
√ √ 3
3 3𝑥 = − 3𝑥 + 2
1
3𝑥 = −𝑥 + 2
1
4𝑥 = 2
1
𝑥= 8
√ √
3 3
Then 𝑦 = 3 3( 18 ) = .
√ 8
The point 𝐶 is ( 18 , 3 8 3 ).
We need to check if this point lies on√the segment 𝐴𝐵.
3
The point 𝐶 is of the form ( 21 − 21 𝑡, 2
𝑡).
1
2
− 12 𝑡 = 18 ⟹ 4 − 4𝑡 = 1 ⟹ 3 = 4𝑡 ⟹ 𝑡 = 34 .
√ √ √
3 3 3 3 3
2
𝑡 = 2
⋅ 4
= 8
.
So 𝐶 lies on 𝐴𝐵.
We need to make sure 0 < 𝑡 < 1, which is true for 𝑡 = 34 .
Now, we need to check if this point 𝐶 does not belong to any other segment from .
The point 𝐶 lies on the astroid. The segment 𝐴𝐵 is tangent to the astroid at 𝐶.
Any other segment from that contains 𝐶 would have 𝐶 as an interior point, not a tangent point.
Consider the parametrization of the family .
A point (𝑥, 𝑦) belongs to some segment in iff there exists 𝜃 such that cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
√
For the point 𝐶 = ( 18 , 3 8 3 ), we have:
√
1∕8 3 3∕8
cos 𝜃
+ sin 𝜃
= 1
√
1 3 3
+ =8
cos 𝜃 √𝜃
sin
sin 𝜃 + 3√3 cos 𝜃 = 8 sin 𝜃 cos 𝜃
sin 𝜃 + 3 3 cos 𝜃 = 4 sin(2𝜃)
We know that 𝐶 lies on 𝐴𝐵, which corresponds to some specific segment in .
The problem statement assumes the existence and uniqueness of such a point 𝐶.
The unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from other than 𝐴𝐵.
This means that if 𝐶 lies on another segment 𝑃 ′ 𝑄′ ∈ , then 𝑃 ′ 𝑄′ = 𝐴𝐵.
The point 𝐶 is the point of tangency of 𝐴𝐵 with the astroid.
Now we need to find
√ 𝑂𝐶 .
2
1 2 3 3 2 1
2
𝑂𝐶 = ( 8 ) + ( 8 ) = 64 + 27
64
28
= 64 = 167
.
Here 𝑝 = 7 and 𝑞 = 16.
𝑝 and 𝑞 are relatively prime.
𝑝 + 𝑞 = 7 + 16 = 23.
Final check of the reasoning:
The family consists of unit-length segments with endpoints on the axes.
The envelope of this family is the astroid 𝑥2∕3 + 𝑦2∕3 = 1.
The segments from are tangent to the astroid.
The segment 𝐴𝐵 is also a member of the family .
We are looking for a point 𝐶 on 𝐴𝐵 that lies on the astroid. 40
This point 𝐶 is the point of tangency of 𝐴𝐵 with the astroid.
s1: Simple test-time scaling
If a point on 𝐴𝐵 lies on the astroid, it belongs to the tangent segment at that point. Since 𝐴𝐵 is the tangent segment, this
point does not belong to any other segment from .
Consider the case when the segment 𝐴𝐵 intersects the astroid at two points.
If the segment 𝐴𝐵 is not tangent to the astroid, then there are two points of intersection.
Let the points of intersection be 𝐶1 and 𝐶2 .
If 𝐶1 lies on 𝐴𝐵, then the tangent to the astroid at 𝐶1 is the segment 𝐴𝐵.
If 𝐶2 lies on 𝐴𝐵, then√the tangent to the astroid at 𝐶2 is the segment 𝐴𝐵.
The slope of 𝐴𝐵 is − 3. √
We found the point on the astroid with tangent slope − 3.
The problem statement guarantees the uniqueness of such a point 𝐶.
The point 𝐶 must be interior to 𝐴𝐵, so 0 < 𝑡 < 1.
The parameter 𝑡 = 34 is in this range.
The point 𝐶 is distinct from 𝐴 and 𝐵.
If 𝐶 = 𝐴, then 𝑡 = 0, 𝐶 = ( 12 , 0).
√
3
If 𝐶 = 𝐵, then 𝑡 = 1, 𝐶 = (0, 2
).
The point of tangency 𝐶 is the unique point on 𝐴𝐵 (excluding endpoints) that lies on the astroid.
Solution:
Let the family of segments be defined by segments 𝑃 𝑄 of unit length with 𝑃 on the positive 𝑥-axis and 𝑄 on the positive
𝑦-axis. Let 𝑃 = (𝑎, 0) and 𝑄 = (0, 𝑏). The length condition gives 𝑎2 + 𝑏2 = 1. The segment 𝑃 𝑄 lies on the line 𝑥𝑎 + 𝑦𝑏 = 1.
Using the parametrization 𝑎 = cos 𝜃, 𝑏 = sin 𝜃 with 0 < 𝜃 < 𝜋2 , the equation of the line is cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
The envelope of this family of lines is the astroid 𝑥2∕3 + 𝑦2∕3 = 1.
The segments from are tangent to this astroid.
√ √
√
The segment 𝐴𝐵 connects 𝐴( 12 , 0) and 𝐵(0, 23 ). The equation of the line containing 𝐴𝐵 is 𝑦 = − 3𝑥 + 23 . This segment
is part of the family .
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from other than
𝐴𝐵. This means 𝐶 is the point where 𝐴𝐵 is tangent to the√ astroid.
√
The slope of the tangent to the astroid at a point (𝑥, 𝑦) is − 3 𝑥𝑦 . The slope of 𝐴𝐵 is − 3.
√ √ √
Setting the slopes equal: − 3 𝑥𝑦 = − 3 ⟹ 𝑥𝑦 = 3 3.
√ √ √
The point 𝐶 lies on the line 𝑦 = − 3𝑥 + 23 and satisfies 𝑦 = 3 3𝑥.
√ √ √ √ √
Substituting, 3 3𝑥 = − 3𝑥 + 23 ⟹ 4 3𝑥 = 23 ⟹ 𝑥 = 18 .
√ √
Then 𝑦 = 3 3( 18 ) = 3 8 3 .
√
The point 𝐶 is ( 18 , 3 8 3 ).
√
3
We need to check if 𝐶 lies on 𝐴𝐵. The points on 𝐴𝐵 are ( 12 − 12 𝑡, 2
𝑡) for 0 ≤ 𝑡 ≤ 1.
For 𝑥 = 18 , 12 − 12 𝑡 = 81 ⟹ 𝑡 = 43 .
√ √ √
For 𝑦 = 3 8 3 , 23 𝑡 = 3 8 3 ⟹ 𝑡 = 34 .
Since 0 < 43 < 1, 𝐶 is an interior point of 𝐴𝐵.
The point 𝐶 is the unique point on 𝐴𝐵 (excluding endpoints) that lies on the astroid. The tangent to the astroid at 𝐶 is the
segment 𝐴𝐵. Thus, 𝐶 does not belong√to any other segment from .
Finally, we calculate 𝑂𝐶 2 = ( 81 )2 + ( 3 8 3 )2 = 64
1
+ 27
64
= 28
64
= 7
16
.
We have 𝑝 = 7 and 𝑞 = 16. These are relatively prime.
𝑝 + 𝑞 = 7 + 16 = 23.
Final Answer: The final answer is 23
41
s1: Simple test-time scaling
[Incorrect generation with 5923 thinking tokens and 754 answer tokens]
The problem asks for a point 𝐶 on the segment 𝐴𝐵 such that 𝐶 is not on any other segment from the family , except 𝐴𝐵
itself. The family consists of segments 𝑃 𝑄 of unit length, with 𝑃 on the 𝑥-axis and 𝑄 on the 𝑦-axis in the first quadrant.
Let 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦), where 𝑥 > 0 and 𝑦 > 0. The length of the segment 𝑃 𝑄 is 1, so the distance between 𝑃 and 𝑄
is 1.
√
2 2
√(𝑥 − 0) + (0 − 𝑦) = 1
2
𝑥 +𝑦 =12
𝑥2 + 𝑦2 = 1
The segment 𝑃 𝑄 lies in the first quadrant, so 𝑥 > 0 and 𝑦 > 0.
The equation 𝑥2 + 𝑦2 = 1 represents the quarter circle in the first quadrant, with endpoints at (1, 0) and (0, 1).
The points on the segment 𝑃 𝑄 can be parameterized
√
as (1 − 𝑡)𝑥 + 𝑡(0), (1 − 𝑡)0 + 𝑡𝑦) = (𝑥(1 − 𝑡), 𝑦𝑡), where 0 ≤ 𝑡 ≤ 1.
3
The segment 𝐴𝐵 connects 𝐴( 21 , 0) and 𝐵(0, 2
).
The equation√ of the line
√
containing 𝐴𝐵 can be found using the two points:
3
−0
3 √
Slope 𝑚 = 2 1 = 21 = − 3.
0− 2 −2
Using point-slope form with point 𝐴( 21 , 0):
√
𝑦 − 0 = − 3(𝑥 − 21 )
√ √
𝑦 = − 3𝑥 + 23
The segment 𝐴𝐵 lies in the first quadrant, between 𝑥 = 0 and 𝑥 = 12 .
Alternatively, we can parameterize the points on 𝐴𝐵.
Let 𝐶 be a point on 𝐴𝐵. Then
√
𝐶 = (1 − 𝜆)𝐴
√
+ 𝜆𝐵 for some 0 ≤ 𝜆 ≤ 1.
3 3
𝐶 = (1 − 𝜆)( 21 , 0) + 𝜆(0, 2
) = ( 12 − 12 𝜆, 2
𝜆).
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from other than
𝐴𝐵.
Consider a segment from , with endpoints 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦), where 𝑥2 + 𝑦2 = 1.
A point on this segment is of the form (𝑥(1 − 𝑡), 𝑦𝑡), where 0 ≤ 𝑡 ≤ 1.
We want to find a point
√
𝐶 on 𝐴𝐵, 𝐶 ≠ 𝐴, 𝐵, such that if 𝐶 lies on a segment 𝑃 𝑄 ∈ , then 𝑃 𝑄 = 𝐴𝐵.
3
Let 𝐶 = ( 12 − 12 𝜆, 2
𝜆), where 0 < 𝜆 < 1.
If 𝐶 lies on a segment 𝑃 𝑄 from , then there exist 𝑥, 𝑦 > 0 with 𝑥2 + 𝑦2 = 1, and 𝑡 ∈ [0, 1] such that:
1
2
− 12 𝜆 = 𝑥(1 − 𝑡)
√
3
2
𝜆 = 𝑦𝑡
1 1
− 𝜆 1−𝜆
From the first equation, 1 − 𝑡 = 2 2
𝑥
= 2𝑥
.
√
3 √
𝜆
From the second equation, 𝑡 = 2𝑦 = 2𝑦3𝜆 .
Since 𝑡 ∈ [0, 1], we have 1 − 𝑡 ∈ [0, 1].
Also, 𝑥 > 0, so 1−𝜆
2𝑥
> 0, which implies 1 − 𝜆 > 0, or 𝜆 < 1. This is consistent with the condition for point on 𝐴𝐵.
√
1−𝜆 3𝜆
We have 1 − 𝑡 = 2𝑥
and 𝑡 = 2𝑦
.
√
Adding these equations: 1 = 1−𝜆
+ 2𝑦3𝜆
√ 2𝑥
2𝑥𝑦 = (1 − 𝜆)𝑦 + 3𝜆𝑥
42
s1: Simple test-time scaling
We know 𝑥2 + 𝑦2 = 1.
If the point 𝐶 belongs to the segment√𝐴𝐵 itself, then the endpoints of the segment are 𝐴 and 𝐵.
3
So, the segment is from ( 12 , 0) to (0, 2
).
√
Here, 𝑥 = 12 and 𝑦 = 23 .
√
3 2
Check if 𝑥2 + 𝑦2 = 1: ( 12 )2 +( 2
) = 1
4
+ 3
4
= 1. So 𝐴𝐵 ∈ .
√
If 𝐶 belongs to another segment 𝑃 𝑄 from , with endpoints 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦), where (𝑥, 𝑦) ≠ ( 12 , 23 ).
Consider the geometric interpretation. The family consists of segments of unit length with endpoints on the axes. The
midpoints of these segments lie on the circle (𝑥 − 21 )2 + (𝑦 − 12 )2 = 12 .
Let the point 𝐶 correspond to some 𝜆.
If 𝐶 lies on 𝑃 𝑄, then 𝐶 = (𝑥(1
√
− 𝑡), 𝑦𝑡).
1 3
So, 2
− 12 𝜆 = 𝑥(1 − 𝑡) and 2
𝜆 = 𝑦𝑡.
√
1 3
We need to find a 𝜆 ∈ (0, 1) such that the only solution to these equations, with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, is 𝑥 = 2
and 𝑦 = 2
.
√
1−𝜆 3𝜆
From the equations, 1 − 𝑡 = 2𝑥
and 𝑡 = 2𝑦
.
Substitute these into 𝑥2 + 𝑦2 = 1.
Try a different approach. Consider the tangents to the family of segments.
The segments are formed by points (𝑥, 0) and (0, 𝑦) with 𝑥2 + 𝑦2 = 1.
The equation of the line containing the segment 𝑃 𝑄 is 𝑥𝑥 + 𝑦𝑦 = 1, or 𝑥𝑥 + 𝑦𝑦 = 1.
The equation of the line is 𝑥𝑥 + 𝑦𝑦 = 1.
Consider the equation of the segment in parametric form: (𝑥(1 − 𝑡), 𝑦𝑡).
If a point 𝐶 lies on this segment, then 𝐶 = (𝑥𝑐 , 𝑦𝑐 ), where 𝑥𝑐 = 𝑥(1 − 𝑡) and 𝑦𝑐 = 𝑦𝑡.
𝑥 𝑦
So, 𝑥𝑐 = 1 − 𝑡 and 𝑦𝑐 = 𝑡.
𝑥𝑐 𝑦𝑐
𝑥
+ 𝑦
= 1 − 𝑡 + 𝑡 = 1.
𝑥 𝑦
The segments from are parts of the lines 𝑥0
+ 𝑦0
= 1, where 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0.
√
3
Consider the point 𝐶 = ( 12 − 12 𝜆, 2
𝜆).
If 𝐶 lies on
√
a segment from with endpoints (𝑥, 0) and (0, 𝑦), then:
1 1 3
− 𝜆
2 2 2
𝜆
𝑥
+ =1
√ 𝑦
1−𝜆 3𝜆
2𝑥
+ 2𝑦
=1
√
3
We need to find a unique 𝜆 ∈ (0, 1) such that the only solution (𝑥, 𝑦) to this equation, with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, is ( 12 , 2
).
√
1 3
Substitute 𝑥 = 2
and 𝑦 = 2
into the equation:
√
1−𝜆 3𝜆
+ √ =1
2( 21 ) 3
2( 2 )
1−𝜆+𝜆=1
1=1
This shows that any point on 𝐴𝐵 lies on the segment
√
𝐴𝐵 itself.
√ 3
Consider the line containing 𝐴𝐵: 𝑦 = − 3𝑥 + 2 .
If a point 𝐶 on 𝐴𝐵 also lies on another segment from with endpoints (𝑥, 0) and (0, 𝑦), then:
𝑥𝐶 𝑦
𝑥
+ 𝑦𝐶 = 1.
Let the point 𝐶 divide 𝐴𝐵 in the ratio 𝜇 ∶ (1√− 𝜇), where 0 <√ 𝜇 < 1.
𝐶 = (1 − 𝜇)𝐴 + 𝜇𝐵 = (1 − 𝜇)( 12 , 0) + 𝜇(0, 23 ) = ( 12 − 12 𝜇, 2
3
𝜇).
If 𝐶 lies on
√
a segment with endpoints (𝑥, 0) and (0, 𝑦), then:
1 1 3
− 𝜇
2 2 2
𝜇
𝑥
+ 𝑦
= 1.
𝑥 𝑦
We are looking for a point 𝐶 such that the line passing through 𝐶 with the form 𝑥0
+ 𝑦0
= 1, where 𝑥20 + 𝑦20 = 1, is unique.
43
s1: Simple test-time scaling
𝑥 𝑦
Consider the tangents to the unit circle centered at the origin. The lines of the form 𝑥0
+ 𝑦0
= 1 are tangents to the circle
𝑥2 + 𝑦2
= 𝑟2 .
Consider the dual problem. The family of segments corresponds to points on the quarter circle 𝑥2 + 𝑦2 = 1 in the first
quadrant.
The lines containing the segments are 𝑥𝑥 + 𝑦𝑦 = 1.
0 0
The envelope of these lines is the curve whose tangents are these lines.
The equation 𝑥𝑥 + 𝑦𝑦 = 1, with the constraint 𝑥20 + 𝑦20 = 1.
0 0
Let 𝑥0 = cos 𝜃, 𝑦0 = sin 𝜃, where 0 < 𝜃 < 𝜋2 .
The equations of the lines are 𝑥 cos 𝜃 + 𝑦 sin 𝜃 = 1.
The envelope of the family of lines 𝑥 cos 𝜃 + 𝑦 sin 𝜃 = 1 is the circle 𝑥2 + 𝑦2 = 1.
The segments from are parts of the tangents to the unit circle in the first quadrant.
The point 𝐶 on 𝐴𝐵 that does not belong to any other segment from means that the tangent to the unit circle passing
through 𝐶 is unique. √
Consider the point 𝐶 = ( 12 − 12 𝜆, 23 𝜆).
The tangents to the unit circle passing through 𝐶 have the form 𝑥𝑥𝐶 + 𝑦𝑦𝐶 = 1.
If the tangent √
is unique, then 𝐶 lies on the unit circle.
3 2
( 12 − 12 𝜆)2 + ( 2
𝜆) = 1
1
( 4 − 2 𝜆 + 4 𝜆 ) + 34 𝜆2
1 1 2
=1
1
4
− 12 𝜆 + 𝜆2 = 1
4𝜆2 − 2𝜆 + 1 = 4
4𝜆2 − 2𝜆 − 3 = 0
Solve for 𝜆:√ √ √ √ √
−(−2)± (−2)2 −4(4)(−3) 2± 4+48 2± 52 2±2 13 1± 13
𝜆= 2(4)
= 8
= 8
= 8
= 4
.
We
√ need 0 < 𝜆 < 1.
13 ≈ 3.6.
1+3.6
4
= 4.6
4
> 1.
1−3.6
4
= −2.6
4
< 0.
Something is wrong. The point 𝐶 is inside the unit circle.
Let’s rephrase the condition. There is
√
a unique point 𝐶 on 𝐴𝐵 such that if 𝐶 lies on a segment 𝑃 𝑄 ∈ , then 𝑃 𝑄 = 𝐴𝐵.
Let the point 𝐶 on 𝐴𝐵 be ( 12 − 12 𝜆, 23 𝜆).
Consider a segment from with endpoints (𝑥, 0) and (0, 𝑦), 𝑥2 + 𝑦2 = 1.
The parametric form of the segment is (𝑥(1 − 𝑡), 𝑦𝑡).
If 𝐶 is on this segment, then for some 𝑡 ∈ [0, 1]:
1
2
− 12 𝜆 = 𝑥(1 − 𝑡)
√
3
2
𝜆 = 𝑦𝑡
If the segment is unique, then
√
the values of 𝑥 and 𝑦 are unique. √
1−𝜆 3𝜆 3
The only solution to 2𝑥
+ 2𝑦
= 1, with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, should be 𝑥 = 12 , 𝑦 = 2
.
Consider the intersection of the line containing 𝐴𝐵 with the boundary of the region formed by the segments in .
The boundaries are the 𝑥-axis, 𝑦-axis, and the quarter circle 𝑥2 + 𝑦2 = 1.
Consider the dual problem again. The family of lines 𝑥𝑥 + 𝑦𝑦 = 1, where 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0.
0 0
The envelope of these lines is the circle 𝑥2 + 𝑦2 = 1.
A point belongs to a unique segment from if it lies on the circle 𝑥2 + 𝑦2 = 1.
The point 𝐶 on 𝐴𝐵 satisfies 𝐶 = 𝑡𝐴 + (1 − 𝑡)𝐵. √
3
Let’s consider the case where the segment is 𝐴𝐵. Here 𝑥 = 12 , 𝑦 = 2
.
√
3
The point 𝐶 can be represented as ( 21 (1 − 𝑠), 2
𝑠) for 0 ≤ 𝑠 ≤ 1.
We are looking for a point 𝐶 on 𝐴𝐵 that does not lie on any other segment from .
44
s1: Simple test-time scaling
Consider the geometry. The family forms a region bounded by the axes and the quarter circle.
The segments are chords of circles centered on the line 𝑦 = 𝑥, tangent to the axes.
Let the endpoints of a segment be (cos 𝜃, 0) and (0, sin 𝜃).
A point on the segment is√(cos 𝜃(1 − 𝑡), sin 𝜃𝑡).
3
The point 𝐶 = ( 21 − 12 𝜆, 2
𝜆).
√
1 1 3
− 𝜆 𝜆
If 𝐶 lies on such a segment, then 2 2
cos 𝜃
= 1 − 𝑡 and 2
sin 𝜃
= 𝑡.
√
1−𝜆 3𝜆
2 cos 𝜃
+ 2 sin 𝜃
= 1.
√
3
We need a unique 𝜆 ∈ (0, 1) such that the only solution (cos 𝜃, sin 𝜃) is ( 21 , 2
), which corresponds to 𝜃 = 𝜋3 .
√
1 𝜆 3𝜆
Consider the function 𝑓 (𝜃) = 2 cos 𝜃
− 2 cos 𝜃
+ 2 sin 𝜃
.
We need 𝑓 (𝜃) = 1. √
3
If 𝜆 = 12 , then 𝐶 = ( 14 , 4
).
√
1−1∕2 3(1∕2)
2 cos 𝜃
+ 2 sin 𝜃
= 1
√
1 3
+
4 cos 𝜃 √4 sin 𝜃
=1
1
+ sin3𝜃 = 4
cos 𝜃 √
sin 𝜃 + √3 cos 𝜃 = 4 sin 𝜃 cos 𝜃
sin 𝜃 + 3 cos 𝜃 = 2 sin(2𝜃)
√
Divide by 2: 12 sin 𝜃 + 23 cos 𝜃 = sin(2𝜃)
sin(𝜃 + 𝜋3 ) = sin(2𝜃).
𝜃 + 𝜋3 = 2𝜃 + 2𝜋𝑘, or 𝜃 + 𝜋3 = 𝜋 − 2𝜃 + 2𝜋𝑘.
𝜃 = 𝜋3 − 2𝜋𝑘, so 𝜃 = 𝜋3 .
3𝜃 = 𝜋 − 𝜋
3
+ 2𝜋𝑘 = 2𝜋 3
+ 2𝜋𝑘.
2𝜋 2𝜋𝑘
𝜃= 9
+ 3 .
If 𝑘 = 0, 𝜃 = 2𝜋 9
.
If 𝑘 = 1, 𝜃 = 9 + 2𝜋
2𝜋
3
= 2𝜋+6𝜋 = 8𝜋 .
√9 9
If 𝜆 = 12 , the point is ( 41 , 43 ).
This point lies on segments corresponding to 𝜃 = 𝜋3 and 𝜃 = 2𝜋9
.
The unique point 𝐶 means that for other values of 𝜆, there is more than one solution for (𝑥, 𝑦).
Consider the tangent to the unit circle passing through 𝐶.
If the tangent is unique,
√
𝐶 is on the unit circle.
This leads to 𝜆 = 1± 4 13 , neither of which is in (0, 1).
Consider the geometric interpretation of the family . The segments connect points on the axes such that the distance is 1.
The locus of the centers of these segments is the circle (𝑥 − 21 )2 + (𝑦 − 12 )2 = 12 , in the first quadrant.
If a point 𝐶 lies on a segment from , then 𝐶 is on a line segment connecting (𝑥, 0) and (0, 𝑦) with 𝑥2 + 𝑦2 = 1.
The line containing the segment is 𝑥𝑥 + 𝑦𝑦 = 1.
0 0
If 𝐶 belongs to a unique segment, √
then there is a unique line of this form passing through 𝐶.
3
The point 𝐶 on 𝐴𝐵 is ( 21 − 12 𝜆, 2
𝜆).
√
1 1 3
− 𝜆 𝜆
We need a unique (𝑥0 , 𝑦0 ) with 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0, such that 2 2
𝑥0
+ 2
𝑦0
= 1.
𝑥 𝑦
Let the point 𝐶 be such that the line passing through 𝐶 with the form + = 1 is unique, subject to 𝑥20 + 𝑦20 = 1.
𝑥0 𝑦0
This line is tangent to the unit circle.
The point 𝐶 lies
√
on the intersection of the line containing 𝐴𝐵 and the unit circle.
( 21 − 12 𝜆)2 + ( 23 𝜆)2 = 1.
We found no solution for 𝜆 ∈ (0, 1).
45
s1: Simple test-time scaling
The unique point 𝐶 means that if any other segment passes through 𝐶, it must be 𝐴𝐵.
Consider the slope of the line containing the segment connecting (𝑥, 0) and (0, 𝑦). Slope is − 𝑥𝑦 .
The equation is 𝑦 = − 𝑥𝑦 𝑥 + 𝑦.
√
If 𝐶 lies on the segment, then 1−𝜆
2𝑥
+ 2𝑦3𝜆 = 1.
Consider the geometric mean.
Let the point 𝐶 have coordinates (𝑎, 𝑏). The segment connecting (𝑥, 0) and (0, 𝑦) passes through (𝑎, 𝑏).
𝑎
𝑥
+ 𝑦𝑏 = 1.
√ √
If 𝐶 is on 𝐴𝐵, then 𝑏 = − 3𝑎 + 23 .
√
𝑎 = 12 − 21 𝜆, 𝑏 = 23 𝜆.
The point 𝐶 is tangent to the envelope of the family of segments.
Solution:
Let the family of segments be . The segments have endpoints 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦) with 𝑥2 + 𝑦2 = 1 and 𝑥, 𝑦 > 0.
The points on the segment 𝑃 𝑄 satisfy 𝑥𝑥 + 𝑦𝑦 = 1. The segments lie on the lines 𝑥𝑥 + 𝑦𝑦 = 1, where 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0.
√ 0 0
√ √
The segment 𝐴𝐵 has endpoints 𝐴( 2 , 0) and 𝐵(0, 2 ). The equation of the line containing 𝐴𝐵 is 𝑦 = − 3𝑥 + 23 .
1 3
√
3
A point 𝐶 on 𝐴𝐵 can be parameterized as 𝐶 = ( 12 − 12 𝜆, 2
𝜆), where 0 ≤ 𝜆 ≤ 1.
√
1 1 3
− 𝜆 𝜆
If 𝐶 belongs to a segment from with endpoints (𝑥, 0) and (0, 𝑦), then 2 2
𝑥
+ 2
𝑦
= 1.
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any√segment from other than
𝐴𝐵. This means that the only solution (𝑥, 𝑦) to the equation with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, is ( 12 , 23 ).
The segments from are parts of the tangents to the unit circle. The point 𝐶 must be tangent to the envelope of the family
of segments. The envelope is the unit circle itself.
The unique point
√
𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that lies on the unit circle. We need to solve:
3 2
( 12 − 12 𝜆)2 + ( 2
𝜆) =1
√
1 3
𝜆= 2
. The point 𝐶 = ( 14 , 4
).
√
3 2
𝑂𝐶 2 = ( 41 )2 + ( 4
) = 1
16
+ 163
= 4
16
= 14 .
𝑂𝐶 2 = 41 = 𝑞𝑝 , so 𝑝 = 1, 𝑞 = 4.
𝑝 and 𝑞 are relatively prime.
𝑝 + 𝑞 = 1 + 4 = 5.
Final Answer: The final answer is 5
F. Version Control
V2 → V3 (2025-03):
• Added §A
V1 → V2 (2025-02):
46