This time we should say goodbye to PPO/GRPO for real 👋
PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔
Checkout our paper for more details👇
🚀Excited to share our new work!
💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training.
💡Solution: Just switch to FP16.
🎯That's it.
📰Paper: arxiv.org/pdf/2510.26788
⭐️Code: github.com/sail-sg/Precis…
Thanks for this fix. Actually it is not like this easy, GradScaler should be introduced to avoid gradient underflow, otherwise the performance can be even worse than BF16.
See: docs.pytorch.org/docs/stable/am…
VeRL Example:
github.com/sail-sg/Precis…
⛈️ VeRL does not natively support FP16 training. A naive implementation will suffer from gradient underflow. 💊
🚀We provide a minimal patch for VeRL to enable effective FP16 training, with about 10 lines of code change.👇
⌨️
🚀Excited to share our new work!
💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training.
💡Solution: Just switch to FP16.
🎯That's it.
📰Paper: arxiv.org/pdf/2510.26788
⭐️Code: github.com/sail-sg/Precis…
Finally!
Although it's 2.4 slower right now (I believe many optimizations are coming), the results are really promising!
It is a huge step towards truly on-policy RL! Amazing work!
🚀 No More Train–Inference Mismatch!
We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) — the first open-source run where training and inference numerics match exactly.
It only takes 3 steps:
1️⃣ Make vLLM batch-invariant (same seq →
Indeed many ppl never saw their bf16 training collapse, but the problem exists as in many reports.
We reproduce this instability by designing a sanity test (just like MNIST for CV) for better understanding.
Large models+datasets are here👇
Give it a try, you may be suprised.
Thanks for the thought! Some further thoughts (clarifications):
1. Reasonably designed algorithms (let’s also include precision in the design space) should not collapse on small data. It’s just like if my CNN cannot even overfit MNIST, how can I trust it will master 1000-class
Well sort of, with just GRPO and not actually taking care of the mismatch at the algorithm level, u will encounter instability with bf16 under normal training settings like here (and as many papers for actual models like Kimi linear have mentioned).
Their point is that given
Another amazing progress on truly on-policy RL!💯
I believe it is a headache for the community to find a reproducible setting where the mismatch consistently causes training collapse. If so, may check this sanity test.
Link to this dataset👇
huggingface.co/datasets/sail/…
💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.)
The result? A strict KL divergence of 0.
But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷♂️ We haven't
Well, not only A100. Here is the sanity check on H200 (GRPO, 32B dense model). The authors also mention that they did some larger-scale experiments on H100.
Hi @RichardYRLi , I tried this disable_cascade_attn many times, including the latest vllm version. But unfortunately it made no difference in our experiments. So I guess it really depends on the setting.