Understanding Length Dynamics in RL Training
Published:
Mystery of Length Growth from RL
DeepSeek-R1 [1] reported that Reinforcement Learning (RL) can drive models to “think” for thousands of steps to solve complex problems. This “thinking” process—manifested as extended Chain-of-Thought (CoT)—is widely believed to be the key to breakthrough reasoning performance.

However, replicating this length explosion has proven elusive. While some replication efforts report a U-shaped growth pattern, others observe lengths declining or stagnating over training.
Two early replication efforts of DeepSeek-R1: DeepScaleR [2] and Simple-RL-Zero [3] both reported that the output length changed with the number of steps following a U-shape

At the same time, the paper “When More is Less: Understanding Chain-of-Thought Length in LLMs” [4] and the Concise Reasoning paper [9] both report declining output length over training:

This brings us to a fundamental question: What actually drives a model to think longer? Is it an inherent property of RL, a bias in the the loss function design, or something else entirely? In this post, we systematically isolate the drivers of length dynamics.
Experimental Setup
To cleanly reproduce DeepSeek-R1-Zero behavior, all of the experiments mentioned in this post use the following configuration unless otherwise specified:
- Model initialization: We use base model checkpoints (not instruction-tuned), following the R1-Zero approach of starting from pretrained weights
- RL algorithm: We use the default GRPO implementation in VeRL with a clip ratio of 0.2.
- Reward design: We employ only a binary correctness reward based on answer parsing (with
math_verifyfor math datasets), with no format reward and no KL regularization term - Generation parameters: The maximum output length is set to 16,384 tokens to minimize truncation effects. The sampling temperature is set to 1.0 and top-p= 1.0.
- Hyperparameters: Initial learning rate of 1e-6, batch size of 32 with 8 rollouts per prompt
This minimal setup allows us to isolate the effects of pure correctness-based RL on length dynamics, without confounding factors from auxiliary rewards or regularization terms.
Initial Experiments: Length Decrease and Stabilization
Our pilot run with Qwen3-4B on MATH to replicate the length growth effect was unsuccessful. The model showed length decrease and then stabilizing. This was also the case with Qwen2.5-3B and Qwen2.5-1.5B models. Note that we set the maximum output length to 16k to minimize truncation. Although it takes the Qwen2.5 1.5B model longer to learn the correct format, all models reached an output of ~600 tokens at the end of training.
Even after we switched to the Nemotron Science dataset [5], we were still consistently observing this pattern of decrease and stabilization (bottom right figure).
Hypothesis 1: Is KL Divergence the Key?
In the Demystifying CoT paper [4], they attempt to reproduce the curve from Simple-RL-Zero [3], but fail to observe substantial length growth. They attribute this mild length growth to KL regularization, pulling the model back to the long output distribution it initially started with.

Following this hypothesis, we added KL divergence back to the objective with a weight of 0.01 (same as Simple-RL-Zero):
Adding this KL term does not recover length growth; in fact, the KL loss increases while lengths still shrink. This suggests that KL regularization alone does not explain the discrepancy with Simple-RL-Zero and Demystifying CoT under our experiment settings.
Hypothesis 2: Is GRPO Length Bias in Effect?
The Dr.GRPO paper [6] points out an inherent bias in GRPO’s loss function that creates asymmetric incentives based on sequence length. Since we were using the GRPO algorithm out-of-the-box, this seemingly could be the culprit.
Length Bias Statement (from Dr.GRPO)
Original GRPO computes the loss per sequence by averaging the advantage over tokens, then averaging across sequences:
\[\mathcal{L}_{\text{GRPO}} = \frac{1}{G} \sum_{i=1}^{G} \left\{ \min \left[ \frac{\pi_\theta(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})} \hat{A}_{y_i}, \text{clip}\left(\frac{\pi_\theta(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})}, 1-\epsilon, 1+\epsilon\right) \hat{A}_{y_i} \right] \right\}\]where the advantage $\hat{A}_{y_i}$ is computed per sequence, and $G$ is the group size. The critical issue is that the per-sequence loss averages over token count, while the advantage is the same for all tokens in a sequence.
This creates asymmetric incentives:
For correct answers ($\hat{A}_{y_i} > 0$):
- Short sequence (length $L_{\text{short}}$): gradient $\propto \frac{\hat{A}_{y_i}}{L_{\text{short}}}$ → larger per-token update
- Long sequence (length $L_{\text{long}}$): gradient $\propto \frac{\hat{A}_{y_i}}{L_{\text{long}}}$ → smaller per-token update
- Result: Favors shorter responses among correct ones
For incorrect answers ($\hat{A}_{y_i} < 0$):
- Short sequence: penalty $\propto \frac{\lvert\hat{A}_{y_i}\rvert}{L_{\text{short}}}$ → larger per-token penalty
- Long sequence: penalty $\propto \frac{\lvert\hat{A}_{y_i}\rvert}{L_{\text{long}}}$ → smaller per-token penalty
- Result: Favors longer responses among incorrect ones
Now if we consider the setting of our pilot run with Qwen3-4B on MATH, we observe that the training reward approaches 0.8 and the validation accuracy shoots to pass@1=75%. Following Dr. GRPO’s reasoning, this would imply that the shortening effect among correct responses would dominate and drive the average length down.
However, applying the Dr.GRPO fix to the loss calculation gives us the figure below, which is still very similar to what we observed before:
VeRL’s Loss Aggregation: Three Aggregation Methods
A closer look at VeRL’s default implementation reveals that the default aggregation is token-mean, following DAPO:
if loss_agg_mode == "token-mean": # DAPO (default)
loss = verl_F.masked_mean(loss_mat, loss_mask)
elif loss_agg_mode == "seq-mean-token-sum": # Dr. GRPO
seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) # token-sum
loss = torch.mean(seq_losses) # seq-mean
elif loss_agg_mode == "seq-mean-token-mean": # original GRPO
seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) / torch.sum(loss_mask, dim=-1) # token-mean
loss = torch.mean(seq_losses) # seq-mean
Code snippet taken from VeRL codebase
This means our runs already use token-mean aggregation, which eliminates sequence normalization length bias, so this particular GRPO length bias is not present in our setup, and cannot explain our initial observations.
Let $L_i$ denote the length of sequence $i$, and $L_{\max}$ denote the maximum length (16k in our experiments). The three methods differ in how they aggregate:
1. Original GRPO (seq-mean-token-mean): \(\mathcal{L} = \frac{1}{G} \sum_{i=1}^{G} \left( \frac{1}{L_i} \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right)\)
This normalizes by actual sequence length $L_i$, creating the length bias.
2. Dr.GRPO (seq-mean-token-sum): \(\mathcal{L} = \frac{1}{G} \sum_{i=1}^{G} \left( \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right) = \frac{1}{G} \sum_{i=1}^{G} \left( \frac{L_i}{L_{\max}} \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right) \cdot L_{\max}\)
By summing (or equivalently, dividing by the same $L_{\max}$ for all sequences), this removes the asymmetry.
3. Token-mean (DAPO [7]) (token-mean): \(\mathcal{L} = \frac{1}{\sum_{i=1}^{G} L_i} \sum_{i=1}^{G} \sum_{t=1}^{L_i} \mathcal{L}_{i,t}\)
This treats all tokens equally regardless of which sequence they belong to, eliminating length-based bias from aggregation.
Difficulty Imbalance Drives Length Growth
As we were unable to reproduce the length growth with the MATH dataset or Nemotron Science dataset, we turned to explore other datasets.
We then experimented with a hard subset of BigMath [8], which is a cleaned version of 250k math problems for RL training. With extended training up to 1000 steps, we were able to see a clear U shape in the output length.
BigMath Dataset Composition Details
To ensure the dataset contains sufficiently challenging problems, we applied the following filtering to the original BigMath dataset: Source filtering: We kept only challenging sources (olympiads, omnimath, harp, aops_forum, amc_aime, big_math, openmath), while removing sources that overlap with previous experiments or are too easy (math, gsm8k, cn_k12, orca_math). Difficulty filtering: BigMath provides pass rate using Llama3.1-8B as a reference, so we performed filtering by keeping only problems with <30% pass rate.As a control treatment, if we remove the hard questions in the training set(with pass@1 <0.3 measured on the checkpoint at step 100), the output length will fluctuate, but does not show a consistent upward trend.
Here ‘hard’ is defined relative to the early checkpoint, so this is a model-dependent notion of difficulty, and removing those questions also changes the content distribution. Nonetheless, under this intervention we no longer observe consistent length growth, which is consistent with the idea that the hard tail is a key driver.
A Moving Reward Distribution
To understand why a dataset of mixed difficulty would cause length growth, we create an easy split and hard split of the validation set based on the pass rate on an early checkpoint (at 100 steps.) The easy split is defined as questions with pass@1 > 90% and the hard split is defined as questions with pass@1 < 10%. We examine the reward distribution and output length distribution of each of these question categories.
What we expect: overall we think hard questions require longer responses to solve compared to easy questions.
On the hard question subset, our observations confirm our intuition: the model gets higher reward when the output is longer (the flat tail is because the model does not ever generate such long outputs).


The median output length significantly shifts over training, demonstrating a mode-seeking behavior towards higher reward.
For the easy questions, as the model is already good at solving them, we would expect the length to stay unchanged, as there is little room of improvement.
However, when we examine the empirical reward distribution, we observe a rightward shift of the reward peak, indicating that the model can now maintain correctness at longer response lengths. The overall inverted-U shape remains—extremely long responses still decrease accuracy—but the curve has become flatter and the optimal length has shifted from approximately 500 to 700 tokens. This means the model has learned to generate longer responses while preserving accuracy on easy problems.

The output length distribution also shifts in accordance with the reward distribution.

Combining our observations on easy problems and hard problems, we can quantify the shift in both output length and performance:
| Checkpoint | Easy Problems (Pass Rate) | Easy Problems (Avg Length) | Hard Problems (Pass Rate) | Hard Problems (Avg Length) |
|---|---|---|---|---|
| Step 100 (Early) | 95.1% | 530 | 2.3% | 1039 |
| Step 1000 (Late) | 97.0% | 661 | 5.5% | 1301 |
The table shows that both easy and hard problems experience length increases during training, with the magnitude and direction of the shift differing based on problem difficulty.
This reveals the mechanism behind length growth on mixed-difficulty datasets. The model learns to generate longer outputs to solve hard problems (step 100→1000: +262 tokens, pass rate 2.3%→5.5%). However, this policy shift affects all problems broadly—easy problems also become longer (+131 tokens). Crucially, the model learns to maintain correctness on easy problems even at these longer lengths, which is reflected in the flattening of their empirical reward distribution. Without hard problems present (easy-only training), the model has no incentive to grow longer and the distribution remains stable.
💡 Key Insight: We hypothesize this behavior is driven by asymmetric risk. For a mixed-difficulty dataset, the model faces a trade-off:
On Hard Problems: "Thinking short" carries a high risk of failure (low reward). "Thinking long" is necessary to solve them.
On Easy Problems: "Thinking long" carries little risk (as long as the length stays within the flattened band). As the reward distribution flattens, the model can spend extra tokens without hurting its accuracy.
Consequently, the model adopts a "better safe than sorry" global policy. It learns to extend its reasoning to cover the hard tail of the distribution. Since this extended length doesn't penalize performance on easy problems, the length drift bleeds into the easy subset—resulting in the "overthinking" we observe.
Case Studies
To understand how length growth manifests in individual outputs, we examine two problems at different difficulty levels.
Easy Problem: Coin Counting
Problem: Petya has 25 coins, each with a denomination of 1, 2, 5, or 10 rubles. Among these coins, 19 are not two-ruble coins, 20 are not ten-ruble coins, and 16 are not one-ruble coins. How many five-ruble coins does Petya have?
For this problem, the output length grew from 493 tokens (step 100) to 602 tokens (step 1000) on average, a 22% increase.
A simple and short solution to the problem produced by the model would look like the following, which sits at 338 tokens:
📝 Example: Concise solution (338 tokens)
After training (step 1000), outputs became 22% longer. Here’s a representative example (595 tokens):
📝 Example: Step 1000 solution with verification (595 tokens)
Analysis: The solution remains correct but becomes more elaborate. The model now includes explicit verification steps, checking each condition after deriving the answer. While this adds length, it doesn’t provide computational advantage for such a straightforward problem.
Harder Problem: Circle and Right Triangle Geometry
Problem: On the leg BC of right triangle ABC, a circle is constructed with BC as the diameter. The circle intersects the hypotenuse at point D such that AD:BD = 1:3. The altitude from vertex C (the right angle) to the hypotenuse is 3. Find the leg BC.

Concise solution approach (using geometric mean theorem):
- Since BC is diameter, ∠BDC = 90° by Thales’ theorem → D is the altitude foot
- Apply altitude geometric mean: h² = AD · BD
- 3² = x · 3x (where AD = x, BD = 3x)
- x = √3, so AB = 4√3
- Apply leg-segment relationship: BC² = BD · AB = 3√3 · 4√3 = 36
- BC = 6
For this problem, outputs from step 1000 averaged 2021 tokens compared to 1544 tokens at step 100—a 31% increase. Solutions from later training tend to use algebraic approaches with systems of equations rather than leveraging geometric relationships directly.
Representative solution from step 100 (clean geometric approach):
✅ Example: Clean geometric solution from step 100
This solution efficiently combines the geometric mean theorem, Thales’ theorem, and the Pythagorean theorem to reach the answer in a clear logical sequence.
However, not all step 100 outputs are this clean. Some start correctly but make algebraic errors mid-solution, yet still arrive at the correct answer:
⚠️ Example: Flawed solution from step 100 (correct answer despite errors)
This example shows a common pattern at step 100: the solution makes mathematical errors (fabricated power-of-point formula, negative discriminant) but somehow recovers to produce the correct final answer—suggesting the model may be leveraging memorized answers rather than purely following its derivation.
Representative solution from step 1000 (algebraic brute-force approach):
📐 Example: Long algebraic solution from step 1000 (~100 lines)
Analysis: This step 1000 solution is mathematically complete and rigorous—every algebraic step is justified and the logic is sound. However, it takes an algebraic brute-force approach, setting up a quartic equation and using the quadratic formula with denominator rationalization. The solution doesn’t leverage the elegant geometric insight upfront (that D = H, making the geometric mean theorem directly applicable).
What could be solved in 3 key steps (geometric mean → find x → apply leg-segment relationship) instead becomes a ~100-line algebraic derivation. The model only applies the geometric mean theorem near the end (line 446) after extensive algebraic machinery. While this demonstrates strong algebraic skills, it represents a shift away from geometric insight toward computational thoroughness.
Key Takeaways from Case Studies
These examples illustrate two manifestations of length growth:
Added verification steps (coin problem): The model learns to explicitly verify solutions even when unnecessary, adding ~100 tokens without improving accuracy on simple problems.
Shift toward algebraic thoroughness (geometry problem): Later checkpoints favor algebraic brute-force methods over geometric insights, even when the geometric approach is more elegant and efficient. This increases token count by ~500 tokens while maintaining correctness.
Both patterns demonstrate how training on mixed-difficulty datasets causes the model to apply more elaborate reasoning strategies across all problems—a form of “overthinking” on easy questions that emerges as a side effect of learning to solve harder ones.
Mitigating Overthinking: Length Penalty Interventions
Overthinking has been widely recognized as computationally wasteful, leading to growing interest in mitigation strategies. The most common approach is reward shaping through length penalties—penalties applied to longer outputs. Truncation can be viewed as an extreme length penalty where outputs exceeding a limit receive zero advantage.
We test several penalty variants and connect them to our findings on the mechanism behind length growth.
The Challenge
The asymmetric risk principle creates a dilemma: the model needs longer reasoning chains for hard problems, but this drives wasteful length growth on easy problems where correctness can be maintained at shorter lengths. An ideal intervention would:
- Preserve hard problem performance: Allow the model to “think long” when necessary
- Prevent easy problem waste: Discourage unnecessarily long responses on simple problems
- Adapt to problem difficulty: Different problems have different “natural” solution lengths
Length Penalty Variants
We implement and evaluate three penalty mechanisms, each with different strategies for balancing efficiency and performance.
Variable definitions:
- $R \in {0, 1}$ is the binary correctness reward
- $L$ is the response length in tokens
- $\alpha, \beta, \gamma$ are penalty coefficients (hyperparameters)
1. Uniform Length Penalty
\[\text{reward} = R - \alpha \times L\]The simplest approach applies a constant penalty per token across all problems, regardless of difficulty.
2. Group-Based Adaptive Penalty (Kimi K1.5 [10])
\[\lambda = 0.5 - \frac{L - \text{min\_len}}{\text{max\_len} - \text{min\_len}}\] \[\text{reward} = \begin{cases} R + \beta \times \lambda & \text{(correct responses)} \\ R + \beta \times \min(0, \lambda) & \text{(incorrect responses)} \end{cases}\]Inspired by Kimi K1.5 [10], this approach tracks the min/max response length per prompt and applies difficulty-aware penalties. It automatically adapts to each problem’s natural length range, rewarding brevity for correct responses while penalizing long incorrect ones. Note that $\lambda \in [-0.5, 0.5]$ normalizes the penalty based on the group’s length distribution.
3. Anchor Length Penalty
\[r = \max\left(0, \frac{L - \text{anchor\_length}}{\text{anchor\_length}}\right)\] \[\text{reward} = R - \gamma \times r\]where $\text{anchor_length}$ is set when the model first achieves reliable competence (pass@1 > 90%).
This penalty is also adaptive per prompt, but uses a reference length from when the model first demonstrated competence on that problem, rather than the current group’s responses.
Experimental Results
In the experiments below, we set $\alpha=10^{-4}, \beta=0.1, \gamma=0.01$. These values were chosen so the penalty magnitude roughly matches the binary reward (given initial output lengths of $\sim 1000$ tokens).
Note: These are initial hyperparameter settings; further tuning may improve performance.
The figures below compare the four approaches across training:
Control (Pink) - No Penalty Baseline: Shows continued length growth from ~1000 to ~1300 tokens as the model learns to solve harder problems, with easy problem lengths also increasing (the “overthinking” phenomenon).
Uniform Penalty (Orange) - The Exploration Tax: Because the penalty is applied indiscriminately, with the weight $\alpha$ that we are applying, the model cannot afford the computational cost required to solve hard problems. The gradient signal is dominated by the length penalty, forcing the model to collapse to empty responses.
Group-Based Penalty (Green) - The Race to the Bottom: While initially stable, the relative nature of this penalty creates a moving target. As the batch becomes more efficient, the definition of “long” shifts downwards. The model chases this moving target until it optimizes away the reasoning chain entirely, leading to a total collapse in capability (score drops to 0). This is perhaps why in the original Kimi K1.5 the authors mention that they notice “this length reward will slow down training initially so they choose to gradually warm up the length penalty during training”.
Anchor Penalty (Teal) - Stable Efficiency: By contrast, the Anchor Penalty successfully stabilizes the length. It arrests the heavy-tail drift seen in the control run but does not force the model below its necessary reasoning baseline. It effectively locks in the efficiency of the early training stages without suffocating exploration.
Quantitative comparison on easy vs. hard problems:
| Checkpoint | Penalty | Easy Pass Rate | Easy Avg Length | Hard Pass Rate | Hard Avg Length |
|---|---|---|---|---|---|
| Step 100 | None | 95.1% | 530 | 2.3% | 1039 |
| Step 1000 | None | 97.0% | 661 | 5.2% | 1288 |
| Step 1000 | Anchor ($\gamma=0.01$) | 97.0% | 563 | 4.6% | 958 |
| Step 1000 | Anchor ($\gamma=0.1$) | 96.5% | 533 | 4.5% | 1067 |
All results are computed as the average of 64 samples per prompt.
The Anchor Penalty with $\gamma=0.01$ successfully reduces easy problem length by 15% (661→563 tokens) and hard problem length by 26% (1301→958 tokens) while maintaining comparable accuracy to the no-penalty baseline.
Summary
Our empirical investigation reveals dataset difficulty composition as the primary factor determining length dynamics under our Zero-RL setup.
Findings
- Datasets with narrow difficulty range (MATH, Science) → Length convergence
- Models trained on datasets with similar difficulty to the model converge to stable output lengths
- Neither KL divergence nor GRPO length bias explains this behavior
- Mixed-difficulty datasets with heavy hard-tail (BigMath) → Length growth
- Datasets with difficulty imbalance produce U-shaped length dynamics
- Removing hard problems eliminates consistent length growth
- Hard problems drive the overall length increase
- The mechanism: Moving reward distributions cause asymmetric risk
- Hard problems get reward gains from longer reasoning (1039→1301 tokens, +262 tokens)
- This policy shift affects all problems broadly (easy problems: 530→661 tokens, +131 tokens)
- Models learn to maintain correctness on easy problems at longer lengths
- The empirical reward distribution for easy problems flattens to accommodate this shift
- Difficulty-aware length penalties can mitigate overthinking without harming performance
- Uniform penalties cause exploration collapse by taxing all reasoning equally
- Group-based adaptive penalties create unstable moving targets that might lead to capability collapse
- Anchor-based penalties successfully stabilize length by locking to early-training efficiency
Limitations and Future Work
Model families and contamination: Our study uses the Qwen model family with Qwen3-4B as the base model in the majority of the experiments. Recent work has noted that some Qwen variants may exhibit training-data leakage on math benchmarks, and our own case studies show examples where the model reaches the correct final answer with flawed intermediate reasoning. Our analysis therefore focuses on answer-level correctness and length dynamics for this particular model under binary verifier rewards. We do not claim that the observed behavior necessarily reflects faithful mathematical reasoning, nor that it generalizes to all model families.
Length penalty applicability: The anchor penalty approach assumes that models start with reasonable output lengths and grow longer during training. It may not be suitable for models that are initially too verbose—for example, models initialized from checkpoints distilled with long chain-of-thought traces. In such cases, the anchor would be set at an already-inflated length, failing to provide efficiency gains. In those cases, a pre-set reference length like that proposed in the LASER paper [11] would be a good substitute. Our experiments use base models trained with RL from scratch (R1-Zero paradigm). A natural next step is evaluating these penalty mechanisms on supervised fine-tuned (SFT) models that may exhibit different initial length distributions. We plan to investigate this in the next phase of our experiments.
Theoretical understanding: While we’ve established the empirical patterns of length dynamics, the underlying mechanisms require further theoretical investigation. Key open questions include understanding why reward distributions exhibit inverted-U shapes across different lengths, how GRPO’s mode-matching behavior interacts with these reward curves, and whether we can develop predictive models for length dynamics on mixed-difficulty datasets.
References
[1] DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Technical Report. https://arxiv.org/abs/2501.12948
[2] DeepScaleR Team. (2025). DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2
[3] HKUST NLP Group. (2025). Simple-RL-Zero: Replicating DeepSeek-R1-Zero’s Reasoning with Reinforcement Learning. https://hkust-nlp.notion.site/simplerl-reason
[4] Zhang, Y., et al. (2025). When More is Less: Understanding Chain-of-Thought Length in LLMs. arXiv preprint.
[5] NVIDIA. (2024). Nemotron-4 340B Technical Report. https://arxiv.org/abs/2406.11704
[6] Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2412.11911. https://arxiv.org/abs/2412.11911
[7] Xiong, W., et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476. https://arxiv.org/abs/2503.14476
[8] Luo, H., et al. (2025). BigMath: A Large-Scale High-Quality Mathematical Dataset for Enhanced Reasoning. arXiv:2502.17387. https://arxiv.org/abs/2502.17387
[9] Wei, Z., et al. (2025). Concise Reasoning via Reinforcement Learning. arXiv:2504.05185. https://arxiv.org/abs/2504.05185
[10] Moonshot AI. (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. Technical Report. https://github.com/MoonshotAI/Kimi-k1.5
[11] Liu, W., Zhou, R., Deng, Y., Huang, Y., Liu, J., Deng, Y., Zhang, Y., and He, J. (2025). Learn to Reason Efficiently with Adaptive Length-based Reward Shaping. arXiv:2505.15612. https://arxiv.org/abs/2505.15612
