Understanding Length Dynamics in RL Training

37 minute read

Published:

Mystery of Length Growth from RL

DeepSeek-R1 [1] reported that Reinforcement Learning (RL) can drive models to “think” for thousands of steps to solve complex problems. This “thinking” process—manifested as extended Chain-of-Thought (CoT)—is widely believed to be the key to breakthrough reasoning performance.

Deepseek R1

However, replicating this length explosion has proven elusive. While some replication efforts report a U-shaped growth pattern, others observe lengths declining or stagnating over training.

Two early replication efforts of DeepSeek-R1: DeepScaleR [2] and Simple-RL-Zero [3] both reported that the output length changed with the number of steps following a U-shape

DeepScaleR Simple-RL-Zero

At the same time, the paper “When More is Less: Understanding Chain-of-Thought Length in LLMs” [4] and the Concise Reasoning paper [9] both report declining output length over training:

more-is-less concise-reasoning

This brings us to a fundamental question: What actually drives a model to think longer? Is it an inherent property of RL, a bias in the the loss function design, or something else entirely? In this post, we systematically isolate the drivers of length dynamics.

Experimental Setup

To cleanly reproduce DeepSeek-R1-Zero behavior, all of the experiments mentioned in this post use the following configuration unless otherwise specified:

  • Model initialization: We use base model checkpoints (not instruction-tuned), following the R1-Zero approach of starting from pretrained weights
  • RL algorithm: We use the default GRPO implementation in VeRL with a clip ratio of 0.2.
  • Reward design: We employ only a binary correctness reward based on answer parsing (with math_verify for math datasets), with no format reward and no KL regularization term
  • Generation parameters: The maximum output length is set to 16,384 tokens to minimize truncation effects. The sampling temperature is set to 1.0 and top-p= 1.0.
  • Hyperparameters: Initial learning rate of 1e-6, batch size of 32 with 8 rollouts per prompt

This minimal setup allows us to isolate the effects of pure correctness-based RL on length dynamics, without confounding factors from auxiliary rewards or regularization terms.

Initial Experiments: Length Decrease and Stabilization

Our pilot run with Qwen3-4B on MATH to replicate the length growth effect was unsuccessful. The model showed length decrease and then stabilizing. This was also the case with Qwen2.5-3B and Qwen2.5-1.5B models. Note that we set the maximum output length to 16k to minimize truncation. Although it takes the Qwen2.5 1.5B model longer to learn the correct format, all models reached an output of ~600 tokens at the end of training.

Even after we switched to the Nemotron Science dataset [5], we were still consistently observing this pattern of decrease and stabilization (bottom right figure).

Qwen3-4B on Math Qwen2.5-3B on Math Qwen2.5-1.5B on Math Qwen3-4B on Science

Hypothesis 1: Is KL Divergence the Key?

In the Demystifying CoT paper [4], they attempt to reproduce the curve from Simple-RL-Zero [3], but fail to observe substantial length growth. They attribute this mild length growth to KL regularization, pulling the model back to the long output distribution it initially started with.

demystifying-cot

Following this hypothesis, we added KL divergence back to the objective with a weight of 0.01 (same as Simple-RL-Zero):

qwen3-math-with-kl qwen3-math-with-kl-loss

Adding this KL term does not recover length growth; in fact, the KL loss increases while lengths still shrink. This suggests that KL regularization alone does not explain the discrepancy with Simple-RL-Zero and Demystifying CoT under our experiment settings.

Hypothesis 2: Is GRPO Length Bias in Effect?

The Dr.GRPO paper [6] points out an inherent bias in GRPO’s loss function that creates asymmetric incentives based on sequence length. Since we were using the GRPO algorithm out-of-the-box, this seemingly could be the culprit.

Length Bias Statement (from Dr.GRPO)

Original GRPO computes the loss per sequence by averaging the advantage over tokens, then averaging across sequences:

\[\mathcal{L}_{\text{GRPO}} = \frac{1}{G} \sum_{i=1}^{G} \left\{ \min \left[ \frac{\pi_\theta(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})} \hat{A}_{y_i}, \text{clip}\left(\frac{\pi_\theta(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})}, 1-\epsilon, 1+\epsilon\right) \hat{A}_{y_i} \right] \right\}\]

where the advantage $\hat{A}_{y_i}$ is computed per sequence, and $G$ is the group size. The critical issue is that the per-sequence loss averages over token count, while the advantage is the same for all tokens in a sequence.

This creates asymmetric incentives:

For correct answers ($\hat{A}_{y_i} > 0$):

  • Short sequence (length $L_{\text{short}}$): gradient $\propto \frac{\hat{A}_{y_i}}{L_{\text{short}}}$ → larger per-token update
  • Long sequence (length $L_{\text{long}}$): gradient $\propto \frac{\hat{A}_{y_i}}{L_{\text{long}}}$ → smaller per-token update
  • Result: Favors shorter responses among correct ones

For incorrect answers ($\hat{A}_{y_i} < 0$):

  • Short sequence: penalty $\propto \frac{\lvert\hat{A}_{y_i}\rvert}{L_{\text{short}}}$ → larger per-token penalty
  • Long sequence: penalty $\propto \frac{\lvert\hat{A}_{y_i}\rvert}{L_{\text{long}}}$ → smaller per-token penalty
  • Result: Favors longer responses among incorrect ones

Now if we consider the setting of our pilot run with Qwen3-4B on MATH, we observe that the training reward approaches 0.8 and the validation accuracy shoots to pass@1=75%. Following Dr. GRPO’s reasoning, this would imply that the shortening effect among correct responses would dominate and drive the average length down.

qwen3-math-reward qwen3-math-validation-acc

However, applying the Dr.GRPO fix to the loss calculation gives us the figure below, which is still very similar to what we observed before:

Qwen3-4b on Math with GRPO fix

VeRL’s Loss Aggregation: Three Aggregation Methods

A closer look at VeRL’s default implementation reveals that the default aggregation is token-mean, following DAPO:

if loss_agg_mode == "token-mean":  # DAPO (default)
    loss = verl_F.masked_mean(loss_mat, loss_mask)
elif loss_agg_mode == "seq-mean-token-sum":  # Dr. GRPO
    seq_losses = torch.sum(loss_mat * loss_mask, dim=-1)  # token-sum
    loss = torch.mean(seq_losses)  # seq-mean
elif loss_agg_mode == "seq-mean-token-mean":  # original GRPO
    seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) / torch.sum(loss_mask, dim=-1)  # token-mean
    loss = torch.mean(seq_losses)  # seq-mean

Code snippet taken from VeRL codebase

This means our runs already use token-mean aggregation, which eliminates sequence normalization length bias, so this particular GRPO length bias is not present in our setup, and cannot explain our initial observations.

Let $L_i$ denote the length of sequence $i$, and $L_{\max}$ denote the maximum length (16k in our experiments). The three methods differ in how they aggregate:

1. Original GRPO (seq-mean-token-mean): \(\mathcal{L} = \frac{1}{G} \sum_{i=1}^{G} \left( \frac{1}{L_i} \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right)\)

This normalizes by actual sequence length $L_i$, creating the length bias.

2. Dr.GRPO (seq-mean-token-sum): \(\mathcal{L} = \frac{1}{G} \sum_{i=1}^{G} \left( \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right) = \frac{1}{G} \sum_{i=1}^{G} \left( \frac{L_i}{L_{\max}} \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right) \cdot L_{\max}\)

By summing (or equivalently, dividing by the same $L_{\max}$ for all sequences), this removes the asymmetry.

3. Token-mean (DAPO [7]) (token-mean): \(\mathcal{L} = \frac{1}{\sum_{i=1}^{G} L_i} \sum_{i=1}^{G} \sum_{t=1}^{L_i} \mathcal{L}_{i,t}\)

This treats all tokens equally regardless of which sequence they belong to, eliminating length-based bias from aggregation.

Difficulty Imbalance Drives Length Growth

As we were unable to reproduce the length growth with the MATH dataset or Nemotron Science dataset, we turned to explore other datasets.

We then experimented with a hard subset of BigMath [8], which is a cleaned version of 250k math problems for RL training. With extended training up to 1000 steps, we were able to see a clear U shape in the output length.

BigMath Dataset Composition Details To ensure the dataset contains sufficiently challenging problems, we applied the following filtering to the original BigMath dataset: Source filtering: We kept only challenging sources (olympiads, omnimath, harp, aops_forum, amc_aime, big_math, openmath), while removing sources that overlap with previous experiments or are too easy (math, gsm8k, cn_k12, orca_math). Difficulty filtering: BigMath provides pass rate using Llama3.1-8B as a reference, so we performed filtering by keeping only problems with <30% pass rate.

Qwen3-4b on BigMath

As a control treatment, if we remove the hard questions in the training set(with pass@1 <0.3 measured on the checkpoint at step 100), the output length will fluctuate, but does not show a consistent upward trend.

early-ckpt-difficulty-distribution easy-comparison

Here ‘hard’ is defined relative to the early checkpoint, so this is a model-dependent notion of difficulty, and removing those questions also changes the content distribution. Nonetheless, under this intervention we no longer observe consistent length growth, which is consistent with the idea that the hard tail is a key driver.

A Moving Reward Distribution

To understand why a dataset of mixed difficulty would cause length growth, we create an easy split and hard split of the validation set based on the pass rate on an early checkpoint (at 100 steps.) The easy split is defined as questions with pass@1 > 90% and the hard split is defined as questions with pass@1 < 10%. We examine the reward distribution and output length distribution of each of these question categories.

What we expect: overall we think hard questions require longer responses to solve compared to easy questions.

On the hard question subset, our observations confirm our intuition: the model gets higher reward when the output is longer (the flat tail is because the model does not ever generate such long outputs).

early-ckpt-easy-reward late-ckpt-easy-reward
early-ckpt-easy-reward late-ckpt-easy-reward

The median output length significantly shifts over training, demonstrating a mode-seeking behavior towards higher reward.

For the easy questions, as the model is already good at solving them, we would expect the length to stay unchanged, as there is little room of improvement.

However, when we examine the empirical reward distribution, we observe a rightward shift of the reward peak, indicating that the model can now maintain correctness at longer response lengths. The overall inverted-U shape remains—extremely long responses still decrease accuracy—but the curve has become flatter and the optimal length has shifted from approximately 500 to 700 tokens. This means the model has learned to generate longer responses while preserving accuracy on easy problems.

early-ckpt-easy-reward late-ckpt-easy-reward

The output length distribution also shifts in accordance with the reward distribution.

early-ckpt-easy-reward late-ckpt-easy-reward

Combining our observations on easy problems and hard problems, we can quantify the shift in both output length and performance:

CheckpointEasy Problems (Pass Rate)Easy Problems (Avg Length)Hard Problems (Pass Rate)Hard Problems (Avg Length)
Step 100 (Early)95.1%5302.3%1039
Step 1000 (Late)97.0%6615.5%1301

The table shows that both easy and hard problems experience length increases during training, with the magnitude and direction of the shift differing based on problem difficulty.

This reveals the mechanism behind length growth on mixed-difficulty datasets. The model learns to generate longer outputs to solve hard problems (step 100→1000: +262 tokens, pass rate 2.3%→5.5%). However, this policy shift affects all problems broadly—easy problems also become longer (+131 tokens). Crucially, the model learns to maintain correctness on easy problems even at these longer lengths, which is reflected in the flattening of their empirical reward distribution. Without hard problems present (easy-only training), the model has no incentive to grow longer and the distribution remains stable.

The Principle of Asymmetric Risk
💡 Key Insight: We hypothesize this behavior is driven by asymmetric risk. For a mixed-difficulty dataset, the model faces a trade-off:
On Hard Problems: "Thinking short" carries a high risk of failure (low reward). "Thinking long" is necessary to solve them.
On Easy Problems: "Thinking long" carries little risk (as long as the length stays within the flattened band). As the reward distribution flattens, the model can spend extra tokens without hurting its accuracy.
Consequently, the model adopts a "better safe than sorry" global policy. It learns to extend its reasoning to cover the hard tail of the distribution. Since this extended length doesn't penalize performance on easy problems, the length drift bleeds into the easy subset—resulting in the "overthinking" we observe.

Case Studies

To understand how length growth manifests in individual outputs, we examine two problems at different difficulty levels.

Easy Problem: Coin Counting

Problem: Petya has 25 coins, each with a denomination of 1, 2, 5, or 10 rubles. Among these coins, 19 are not two-ruble coins, 20 are not ten-ruble coins, and 16 are not one-ruble coins. How many five-ruble coins does Petya have?

For this problem, the output length grew from 493 tokens (step 100) to 602 tokens (step 1000) on average, a 22% increase.

A simple and short solution to the problem produced by the model would look like the following, which sits at 338 tokens:

📝 Example: Concise solution (338 tokens)
To determine the number of five-ruble coins Petya has, we will use the information given and set up a system of equations. Let's denote the number of one-ruble coins by $a$, the number of two-ruble coins by $b$, the number of five-ruble coins by $c$, and the number of ten-ruble coins by $d$. **We know the following:** 1. The total number of coins is 25. 2. There are 19 coins that are not two-ruble coins, which means there are $25 - 19 = 6$ two-ruble coins. 3. There are 20 coins that are not ten-ruble coins, which means there are $25 - 20 = 5$ ten-ruble coins. 4. There are 16 coins that are not one-ruble coins, which means there are $25 - 16 = 9$ one-ruble coins. **From the above information, we can write the following equations:** $$a + b + c + d = 25$$ $$b = 6, \quad d = 5, \quad a = 9$$ **Now, we substitute** $b = 6$, $d = 5$, and $a = 9$ into the first equation: $$9 + 6 + c + 5 = 25$$ $$20 + c = 25$$ $$c = 5$$ Thus, the number of five-ruble coins Petya has is $\boxed{5}$.

After training (step 1000), outputs became 22% longer. Here’s a representative example (595 tokens):

📝 Example: Step 1000 solution with verification (595 tokens)
Let's denote the number of 1-ruble coins, 2-ruble coins, 5-ruble coins, and 10-ruble coins by $a$, $b$, $c$, and $d$ respectively. We know the following: 1. The total number of coins is 25: $a + b + c + d = 25$ 2. There are 19 coins that are not 2-ruble coins: $a + c + d = 19$ 3. There are 20 coins that are not 10-ruble coins: $a + b + c = 20$ 4. There are 16 coins that are not 1-ruble coins: $b + c + d = 16$ **We need to find the value of** $c$, the number of 5-ruble coins. **First,** we subtract the second equation from the first equation: $$(a + b + c + d) - (a + c + d) = 25 - 19$$ $$b = 6$$ **Next,** we subtract the third equation from the first equation: $$(a + b + c + d) - (a + b + c) = 25 - 20$$ $$d = 5$$ **Then,** we subtract the fourth equation from the first equation: $$(a + b + c + d) - (b + c + d) = 25 - 16$$ $$a = 9$$ **Now we substitute** $a = 9$, $b = 6$, and $d = 5$ into the first equation: $$9 + 6 + c + 5 = 25$$ $$20 + c = 25$$ $$c = 5$$ **To verify, we check the consistency of the values with the given conditions:** 1. The total number of coins is: $a + b + c + d = 9 + 6 + 5 + 5 = 25$ ✓ 2. The number of coins that are not 2-ruble coins is: $a + c + d = 9 + 5 + 5 = 19$ ✓ 3. The number of coins that are not 10-ruble coins is: $a + b + c = 9 + 6 + 5 = 20$ ✓ 4. The number of coins that are not 1-ruble coins is: $b + c + d = 6 + 5 + 5 = 16$ ✓ All conditions are satisfied, so the number of 5-ruble coins is $\boxed{5}$.

Analysis: The solution remains correct but becomes more elaborate. The model now includes explicit verification steps, checking each condition after deriving the answer. While this adds length, it doesn’t provide computational advantage for such a straightforward problem.

Harder Problem: Circle and Right Triangle Geometry

Problem: On the leg BC of right triangle ABC, a circle is constructed with BC as the diameter. The circle intersects the hypotenuse at point D such that AD:BD = 1:3. The altitude from vertex C (the right angle) to the hypotenuse is 3. Find the leg BC.

Geometry Problem Visualization

Concise solution approach (using geometric mean theorem):

  1. Since BC is diameter, ∠BDC = 90° by Thales’ theorem → D is the altitude foot
  2. Apply altitude geometric mean: h² = AD · BD
    • 3² = x · 3x (where AD = x, BD = 3x)
    • x = √3, so AB = 4√3
  3. Apply leg-segment relationship: BC² = BD · AB = 3√3 · 4√3 = 36
    • BC = 6

For this problem, outputs from step 1000 averaged 2021 tokens compared to 1544 tokens at step 100—a 31% increase. Solutions from later training tend to use algebraic approaches with systems of equations rather than leveraging geometric relationships directly.

Representative solution from step 100 (clean geometric approach):

✅ Example: Clean geometric solution from step 100
To solve the problem, we need to use the properties of the right triangle and the circle constructed on the hypotenuse. Let's denote the right triangle $ABC$ with $\angle ACB = 90^\circ$, and let $BC = a$, $AC = b$, and $AB = c$. The altitude from $C$ to the hypotenuse $AB$ is given as 3. This altitude splits the hypotenuse into two segments, $AD$ and $BD$, such that $AD : BD = 1 : 3$. **Step 1: Apply geometric mean theorem** In a right triangle, the altitude to the hypotenuse creates two smaller right triangles that are similar to the original triangle and to each other. Therefore, the altitude $CD$ is the geometric mean of the segments of the hypotenuse it creates. Let $AD = x$ and $BD = 3x$. Then, $AB = AD + BD = x + 3x = 4x$. Since the altitude $CD$ is 3, we have: $$CD^2 = AD \cdot BD$$ $$3^2 = x \cdot 3x$$ $$9 = 3x^2$$ $$x = \sqrt{3}$$ Thus, $AD = \sqrt{3}$ and $BD = 3\sqrt{3}$, and the hypotenuse $AB = 4\sqrt{3}$. **Step 2: Use Thales' theorem** Since the circle is constructed on $BC$ as the diameter and intersects the hypotenuse at point $D$, we know that $\angle BDC = 90^\circ$ (Thales' theorem). This means that $D$ is the foot of the altitude from $C$ to $AB$. **Step 3: Apply Pythagorean theorem** Since $D$ is the foot of the altitude, we can find the length of $BC$ using the Pythagorean theorem in the smaller right triangle $BDC$: $$BC^2 = BD^2 + CD^2$$ $$BC^2 = (3\sqrt{3})^2 + 3^2$$ $$BC^2 = 27 + 9 = 36$$ $$BC = 6$$ Thus, the length of $BC$ is $\boxed{6}$.

This solution efficiently combines the geometric mean theorem, Thales’ theorem, and the Pythagorean theorem to reach the answer in a clear logical sequence.

However, not all step 100 outputs are this clean. Some start correctly but make algebraic errors mid-solution, yet still arrive at the correct answer:

⚠️ Example: Flawed solution from step 100 (correct answer despite errors)
To solve for the length of the leg $BC$ in the right triangle $ABC$ with the given conditions, we will use geometric properties and relationships. Let's denote the lengths of the sides as follows: $AB = c$, $BC = a$, and $CA = b$. The altitude from $C$ to the hypotenuse $AB$ is given as 3. The circle with diameter $BC$ intersects the hypotenuse $AB$ at point $D$ such that $AD : BD = 1 : 3$. First, we use the property of the right triangle and the altitude to the hypotenuse. The altitude $CH$ from $C$ to $AB$ can be expressed as: $$CH = \frac{ab}{c} = 3$$ This gives us the relationship: $ab = 3c$. Next, we use the fact that $AD : BD = 1 : 3$. Let $AD = x$ and $BD = 3x$. Therefore, the length of the hypotenuse $AB$ is: $$AB = AD + BD = x + 3x = 4x$$ Since $AB = c$, we have $c = 4x$. **❌ Error: Fabricated formula.** Now, we use the fact that $D$ lies on the circle with diameter $BC$. This means that $\angle BDC = 90^\circ$. By the power of a point theorem (also known as the intersecting chords theorem), we have: $$AD \cdot BD = CD \cdot HD$$ This formula is incorrect! The power of point theorem doesn't apply this way. Since $D$ is on the circle with diameter $BC$, $CD$ is the altitude from $C$ to $AB$, and $HD$ is the segment of the altitude from $D$ to $H$. However, since $D$ is on the circle, we can use the fact that $CD = CH = 3$ and $HD = 0$ (since $D$ is on the hypotenuse and the altitude is perpendicular to the hypotenuse). Thus, we have: $$AD \cdot BD = 3 \cdot (c - 3)$$ Substituting $AD = x$ and $BD = 3x$, we get: $$x \cdot 3x = 3 \cdot (4x - 3)$$ $$3x^2 = 12x - 9$$ $$x^2 - 4x + 3 = 0$$ $$(x - 1)(x - 3) = 0$$ Thus, the solutions are $x = 1$ or $x = 3$. Since $x = 3$ would make $AB = 12$, which is not possible given the altitude constraint, we have $x = 1$. Therefore, $AB = 4x = 4$. Now, we need to find the length of $BC$. Using the relationship $ab = 3c$ and knowing $c = 4$, we have: $$ab = 3 \cdot 4 = 12$$ We also know that in a right triangle, the legs $a$ and $b$ satisfy the Pythagorean theorem: $$a^2 + b^2 = c^2 = 16$$ We can solve this system by expressing $b$ in terms of $a$: $$b = \frac{12}{a}$$ Substituting into the Pythagorean equation: $$a^4 - 16a^2 + 144 = 0$$ Let $y = a^2$: $$y^2 - 16y + 144 = 0$$ $$y = \frac{16 \pm \sqrt{256 - 576}}{2} = \frac{16 \pm \sqrt{-320}}{2}$$ **❌ Error: Negative discriminant.** This gives a negative discriminant, indicating the approach has gone wrong! Since the discriminant is negative, we need to re-evaluate the possible values of $a$ and $b$. Given the constraints, we can solve for $a$ and $b$ directly by trial and error or by solving the quadratic equation correctly. The correct values are $a = 3$ and $b = 4$ or vice versa. Therefore, the length of $BC$ is: $$\boxed{6}$$

This example shows a common pattern at step 100: the solution makes mathematical errors (fabricated power-of-point formula, negative discriminant) but somehow recovers to produce the correct final answer—suggesting the model may be leveraging memorized answers rather than purely following its derivation.

Representative solution from step 1000 (algebraic brute-force approach):

📐 Example: Long algebraic solution from step 1000 (~100 lines)
To solve the problem, we start with the right triangle $ABC$ where $\angle C = 90^\circ$. Let $BC = a$ and $AC = b$. The hypotenuse $AB = c$ can be found using the Pythagorean theorem: $$c = \sqrt{a^2 + b^2}$$ A circle is constructed with $BC$ as the diameter, so the center of the circle is the midpoint of $BC$, and the radius is $\frac{a}{2}$. Since $BC$ is the diameter, any angle subtended by $BC$ on the circle is a right angle. Therefore, $\angle BDC = 90^\circ$, making $D$ the foot of the perpendicular from $C$ to the hypotenuse $AB$. Given that $AD : BD = 1 : 3$, we can let $AD = x$ and $BD = 3x$. Thus, the length of the hypotenuse $AB = c = AD + BD = x + 3x = 4x$. The altitude from $C$ to the hypotenuse $AB$ is given as 3. Let the foot of the perpendicular from $C$ to $AB$ be $D$. The area of the triangle $ABC$ can be calculated in two ways. First, using the legs $a$ and $b$: $$\text{Area} = \frac{1}{2} \times a \times b$$ Second, using the hypotenuse $c$ and the altitude from $C$ to $AB$: $$\text{Area} = \frac{1}{2} \times AB \times CD = \frac{1}{2} \times 4x \times 3 = 6x$$ Equating the two expressions for the area, we get: $$\frac{1}{2} \times a \times b = 6x \implies ab = 12x$$ Since $AB = c = 4x$, we use the Pythagorean theorem: $$c = \sqrt{a^2 + b^2} = 4x \implies a^2 + b^2 = (4x)^2 = 16x^2$$ We now have two key equations: 1. $ab = 12x$ 2. $a^2 + b^2 = 16x^2$ We can express $b$ in terms of $a$ and $x$ from the first equation: $$b = \frac{12x}{a}$$ Substitute $b = \frac{12x}{a}$ into the second equation: $$a^2 + \left(\frac{12x}{a}\right)^2 = 16x^2 \implies a^2 + \frac{144x^2}{a^2} = 16x^2$$ Multiply through by $a^2$ to clear the fraction: $$a^4 + 144x^2 = 16x^2a^2 \implies a^4 - 16x^2a^2 + 144x^2 = 0$$ Let $y = a^2$. Then the equation becomes a quadratic in $y$: $$y^2 - 16x^2y + 144x^2 = 0$$ We solve this quadratic equation using the quadratic formula $y = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$: $$y = \frac{16x^2 \pm \sqrt{(16x^2)^2 - 4 \cdot 1 \cdot 144x^2}}{2 \cdot 1} = \frac{16x^2 \pm \sqrt{256x^4 - 576x^2}}{2} = \frac{16x^2 \pm \sqrt{64x^2(4x^2 - 9)}}{2} = \frac{16x^2 \pm 8x\sqrt{4x^2 - 9}}{2} = 8x^2 \pm 4x\sqrt{4x^2 - 9}$$ So, we have two solutions for $y$: $$y = 8x^2 + 4x\sqrt{4x^2 - 9} \quad \text{or} \quad y = 8x^2 - 4x\sqrt{4x^2 - 9}$$ Since $y = a^2$, we need to determine which solution is valid. We know $a$ and $b$ are positive, so we need to check the feasibility of both solutions. If $y = 8x^2 + 4x\sqrt{4x^2 - 9}$, then: $$a^2 = 8x^2 + 4x\sqrt{4x^2 - 9}$$ If $y = 8x^2 - 4x\sqrt{4x^2 - 9}$, then: $$a^2 = 8x^2 - 4x\sqrt{4x^2 - 9}$$ We need to find $x$ such that both $a$ and $b$ are real and positive. Let's test the feasible solution: Since $b = \frac{12x}{a}$, we substitute $a^2 = 8x^2 + 4x\sqrt{4x^2 - 9}$: $$b^2 = \left(\frac{12x}{a}\right)^2 = \frac{144x^2}{a^2} = \frac{144x^2}{8x^2 + 4x\sqrt{4x^2 - 9}} = \frac{144x^2}{4x(2x + \sqrt{4x^2 - 9})} = \frac{36x}{2x + \sqrt{4x^2 - 9}}$$ Rationalizing the denominator: $$b^2 = \frac{36x(2x - \sqrt{4x^2 - 9})}{(2x + \sqrt{4x^2 - 9})(2x - \sqrt{4x^2 - 9})} = \frac{36x(2x - \sqrt{4x^2 - 9})}{4x^2 - (4x^2 - 9)} = \frac{36x(2x - \sqrt{4x^2 - 9})}{9} = 4x(2x - \sqrt{4x^2 - 9})$$ We need to check if $a^2 + b^2 = 16x^2$: $$a^2 + b^2 = (8x^2 + 4x\sqrt{4x^2 - 9}) + 4x(2x - \sqrt{4x^2 - 9}) = 8x^2 + 4x\sqrt{4x^2 - 9} + 8x^2 - 4x\sqrt{4x^2 - 9} = 16x^2$$ This confirms our solution is correct. Now we need to find $a$: Since $a^2 = 8x^2 + 4x\sqrt{4x^2 - 9}$, we need to find $x$. Using the altitude property: The area of the triangle is also given by: $$\text{Area} = \frac{1}{2} \times AB \times CD = \frac{1}{2} \times 4x \times 3 = 6x$$ We already have $ab = 12x$. To find $x$, we use the relationship involving the segments of the hypotenuse: Since $AD = x$ and $BD = 3x$, we use the property of the altitude in a right triangle: $$CD^2 = AD \cdot BD = x \cdot 3x = 3x^2 \implies CD = \sqrt{3x^2} = x\sqrt{3}$$ But we know $CD = 3$, so: $$x\sqrt{3} = 3 \implies x = \sqrt{3}$$ Now we find $a$: $$a^2 = 8x^2 + 4x\sqrt{4x^2 - 9} = 8(\sqrt{3})^2 + 4\sqrt{3}\sqrt{4(\sqrt{3})^2 - 9} = 8 \cdot 3 + 4\sqrt{3}\sqrt{12 - 9} = 24 + 4\sqrt{3} \cdot \sqrt{3} = 24 + 4 \cdot 3 = 24 + 12 = 36$$ So, $a = \sqrt{36} = 6$. Thus, the length of $BC$ is $\boxed{6}$.

Analysis: This step 1000 solution is mathematically complete and rigorous—every algebraic step is justified and the logic is sound. However, it takes an algebraic brute-force approach, setting up a quartic equation and using the quadratic formula with denominator rationalization. The solution doesn’t leverage the elegant geometric insight upfront (that D = H, making the geometric mean theorem directly applicable).

What could be solved in 3 key steps (geometric mean → find x → apply leg-segment relationship) instead becomes a ~100-line algebraic derivation. The model only applies the geometric mean theorem near the end (line 446) after extensive algebraic machinery. While this demonstrates strong algebraic skills, it represents a shift away from geometric insight toward computational thoroughness.

Key Takeaways from Case Studies

These examples illustrate two manifestations of length growth:

  1. Added verification steps (coin problem): The model learns to explicitly verify solutions even when unnecessary, adding ~100 tokens without improving accuracy on simple problems.

  2. Shift toward algebraic thoroughness (geometry problem): Later checkpoints favor algebraic brute-force methods over geometric insights, even when the geometric approach is more elegant and efficient. This increases token count by ~500 tokens while maintaining correctness.

Both patterns demonstrate how training on mixed-difficulty datasets causes the model to apply more elaborate reasoning strategies across all problems—a form of “overthinking” on easy questions that emerges as a side effect of learning to solve harder ones.

Mitigating Overthinking: Length Penalty Interventions

Overthinking has been widely recognized as computationally wasteful, leading to growing interest in mitigation strategies. The most common approach is reward shaping through length penalties—penalties applied to longer outputs. Truncation can be viewed as an extreme length penalty where outputs exceeding a limit receive zero advantage.

We test several penalty variants and connect them to our findings on the mechanism behind length growth.

The Challenge

The asymmetric risk principle creates a dilemma: the model needs longer reasoning chains for hard problems, but this drives wasteful length growth on easy problems where correctness can be maintained at shorter lengths. An ideal intervention would:

  1. Preserve hard problem performance: Allow the model to “think long” when necessary
  2. Prevent easy problem waste: Discourage unnecessarily long responses on simple problems
  3. Adapt to problem difficulty: Different problems have different “natural” solution lengths

Length Penalty Variants

We implement and evaluate three penalty mechanisms, each with different strategies for balancing efficiency and performance.

Variable definitions:

  • $R \in {0, 1}$ is the binary correctness reward
  • $L$ is the response length in tokens
  • $\alpha, \beta, \gamma$ are penalty coefficients (hyperparameters)

1. Uniform Length Penalty

\[\text{reward} = R - \alpha \times L\]

The simplest approach applies a constant penalty per token across all problems, regardless of difficulty.

2. Group-Based Adaptive Penalty (Kimi K1.5 [10])

\[\lambda = 0.5 - \frac{L - \text{min\_len}}{\text{max\_len} - \text{min\_len}}\] \[\text{reward} = \begin{cases} R + \beta \times \lambda & \text{(correct responses)} \\ R + \beta \times \min(0, \lambda) & \text{(incorrect responses)} \end{cases}\]

Inspired by Kimi K1.5 [10], this approach tracks the min/max response length per prompt and applies difficulty-aware penalties. It automatically adapts to each problem’s natural length range, rewarding brevity for correct responses while penalizing long incorrect ones. Note that $\lambda \in [-0.5, 0.5]$ normalizes the penalty based on the group’s length distribution.

3. Anchor Length Penalty

\[r = \max\left(0, \frac{L - \text{anchor\_length}}{\text{anchor\_length}}\right)\] \[\text{reward} = R - \gamma \times r\]

where $\text{anchor_length}$ is set when the model first achieves reliable competence (pass@1 > 90%).

This penalty is also adaptive per prompt, but uses a reference length from when the model first demonstrated competence on that problem, rather than the current group’s responses.

Experimental Results

In the experiments below, we set $\alpha=10^{-4}, \beta=0.1, \gamma=0.01$. These values were chosen so the penalty magnitude roughly matches the binary reward (given initial output lengths of $\sim 1000$ tokens).

Note: These are initial hyperparameter settings; further tuning may improve performance.

The figures below compare the four approaches across training:

Control (Pink) - No Penalty Baseline: Shows continued length growth from ~1000 to ~1300 tokens as the model learns to solve harder problems, with easy problem lengths also increasing (the “overthinking” phenomenon).

Uniform Penalty (Orange) - The Exploration Tax: Because the penalty is applied indiscriminately, with the weight $\alpha$ that we are applying, the model cannot afford the computational cost required to solve hard problems. The gradient signal is dominated by the length penalty, forcing the model to collapse to empty responses.

Group-Based Penalty (Green) - The Race to the Bottom: While initially stable, the relative nature of this penalty creates a moving target. As the batch becomes more efficient, the definition of “long” shifts downwards. The model chases this moving target until it optimizes away the reasoning chain entirely, leading to a total collapse in capability (score drops to 0). This is perhaps why in the original Kimi K1.5 the authors mention that they notice “this length reward will slow down training initially so they choose to gradually warm up the length penalty during training”.

Anchor Penalty (Teal) - Stable Efficiency: By contrast, the Anchor Penalty successfully stabilizes the length. It arrests the heavy-tail drift seen in the control run but does not force the model below its necessary reasoning baseline. It effectively locks in the efficiency of the early training stages without suffocating exploration.

length-penalty-comparison-length length-penalty-comparison-reward

Quantitative comparison on easy vs. hard problems:

CheckpointPenaltyEasy Pass RateEasy Avg LengthHard Pass RateHard Avg Length
Step 100None95.1%5302.3%1039
Step 1000None97.0%6615.2%1288
Step 1000Anchor ($\gamma=0.01$)97.0%5634.6%958
Step 1000Anchor ($\gamma=0.1$)96.5%5334.5%1067

All results are computed as the average of 64 samples per prompt.

The Anchor Penalty with $\gamma=0.01$ successfully reduces easy problem length by 15% (661→563 tokens) and hard problem length by 26% (1301→958 tokens) while maintaining comparable accuracy to the no-penalty baseline.

Summary

Our empirical investigation reveals dataset difficulty composition as the primary factor determining length dynamics under our Zero-RL setup.

Findings

  1. Datasets with narrow difficulty range (MATH, Science) → Length convergence
    • Models trained on datasets with similar difficulty to the model converge to stable output lengths
    • Neither KL divergence nor GRPO length bias explains this behavior
  2. Mixed-difficulty datasets with heavy hard-tail (BigMath) → Length growth
    • Datasets with difficulty imbalance produce U-shaped length dynamics
    • Removing hard problems eliminates consistent length growth
    • Hard problems drive the overall length increase
  3. The mechanism: Moving reward distributions cause asymmetric risk
    • Hard problems get reward gains from longer reasoning (1039→1301 tokens, +262 tokens)
    • This policy shift affects all problems broadly (easy problems: 530→661 tokens, +131 tokens)
    • Models learn to maintain correctness on easy problems at longer lengths
    • The empirical reward distribution for easy problems flattens to accommodate this shift
  4. Difficulty-aware length penalties can mitigate overthinking without harming performance
    • Uniform penalties cause exploration collapse by taxing all reasoning equally
    • Group-based adaptive penalties create unstable moving targets that might lead to capability collapse
    • Anchor-based penalties successfully stabilize length by locking to early-training efficiency

Limitations and Future Work

Model families and contamination: Our study uses the Qwen model family with Qwen3-4B as the base model in the majority of the experiments. Recent work has noted that some Qwen variants may exhibit training-data leakage on math benchmarks, and our own case studies show examples where the model reaches the correct final answer with flawed intermediate reasoning. Our analysis therefore focuses on answer-level correctness and length dynamics for this particular model under binary verifier rewards. We do not claim that the observed behavior necessarily reflects faithful mathematical reasoning, nor that it generalizes to all model families.

Length penalty applicability: The anchor penalty approach assumes that models start with reasonable output lengths and grow longer during training. It may not be suitable for models that are initially too verbose—for example, models initialized from checkpoints distilled with long chain-of-thought traces. In such cases, the anchor would be set at an already-inflated length, failing to provide efficiency gains. In those cases, a pre-set reference length like that proposed in the LASER paper [11] would be a good substitute. Our experiments use base models trained with RL from scratch (R1-Zero paradigm). A natural next step is evaluating these penalty mechanisms on supervised fine-tuned (SFT) models that may exhibit different initial length distributions. We plan to investigate this in the next phase of our experiments.

Theoretical understanding: While we’ve established the empirical patterns of length dynamics, the underlying mechanisms require further theoretical investigation. Key open questions include understanding why reward distributions exhibit inverted-U shapes across different lengths, how GRPO’s mode-matching behavior interacts with these reward curves, and whether we can develop predictive models for length dynamics on mixed-difficulty datasets.


References

[1] DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Technical Report. https://arxiv.org/abs/2501.12948

[2] DeepScaleR Team. (2025). DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2

[3] HKUST NLP Group. (2025). Simple-RL-Zero: Replicating DeepSeek-R1-Zero’s Reasoning with Reinforcement Learning. https://hkust-nlp.notion.site/simplerl-reason

[4] Zhang, Y., et al. (2025). When More is Less: Understanding Chain-of-Thought Length in LLMs. arXiv preprint.

[5] NVIDIA. (2024). Nemotron-4 340B Technical Report. https://arxiv.org/abs/2406.11704

[6] Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2412.11911. https://arxiv.org/abs/2412.11911

[7] Xiong, W., et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476. https://arxiv.org/abs/2503.14476

[8] Luo, H., et al. (2025). BigMath: A Large-Scale High-Quality Mathematical Dataset for Enhanced Reasoning. arXiv:2502.17387. https://arxiv.org/abs/2502.17387

[9] Wei, Z., et al. (2025). Concise Reasoning via Reinforcement Learning. arXiv:2504.05185. https://arxiv.org/abs/2504.05185

[10] Moonshot AI. (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. Technical Report. https://github.com/MoonshotAI/Kimi-k1.5

[11] Liu, W., Zhou, R., Deng, Y., Huang, Y., Liu, J., Deng, Y., Zhang, Y., and He, J. (2025). Learn to Reason Efficiently with Adaptive Length-based Reward Shaping. arXiv:2505.15612. https://arxiv.org/abs/2505.15612