Penghui Qi (@QPHutu) / X

Penghui Qi

191 posts

Penghui Qi

@QPHutu

Researcher @SeaAIL PhD student @NUSingapore Working on RL, LLM Reasoning, and MLSys.

Joined August 2022

Pinned
Penghui Qi
@QPHutu
Feb 5
This time we should say goodbye to PPO/GRPO for real 👋 PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔 Checkout our paper for more details👇
47K
Penghui Qi
@QPHutu
Oct 31, 2025
🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precis…
220K
Penghui Qi
@QPHutu
Nov 1, 2025
Thanks for this fix. Actually it is not like this easy, GradScaler should be introduced to avoid gradient underflow, otherwise the performance can be even worse than BF16. See: docs.pytorch.org/docs/stable/am… VeRL Example: github.com/sail-sg/Precis…
36K
Penghui Qi
@QPHutu
Oct 31, 2025
⛈️ VeRL does not natively support FP16 training. A naive implementation will suffer from gradient underflow. 💊 🚀We provide a minimal patch for VeRL to enable effective FP16 training, with about 10 lines of code change.👇 ⌨️
Penghui Qi
@QPHutu
Oct 31, 2025
🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precis…
Precision-RL/verl_fp16.patch at main · sail-sg/Precision-RL
From github.com
18K
Penghui Qi
@QPHutu
Nov 13, 2025
Finally! Although it's 2.4 slower right now (I believe many optimizations are coming), the results are really promising! It is a huge step towards truly on-policy RL! Amazing work!
vLLM
@vllm_project
Nov 12, 2025
🚀 No More Train–Inference Mismatch! We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) — the first open-source run where training and inference numerics match exactly. It only takes 3 steps: 1️⃣ Make vLLM batch-invariant (same seq →
15K
Penghui Qi
@QPHutu
Nov 2, 2025
Indeed many ppl never saw their bf16 training collapse, but the problem exists as in many reports. We reproduce this instability by designing a sanity test (just like MNIST for CV) for better understanding. Large models+datasets are here👇 Give it a try, you may be suprised.
Zichen Liu
@zzlccc
Nov 1, 2025
Thanks for the thought! Some further thoughts (clarifications): 1. Reasonably designed algorithms (let’s also include precision in the design space) should not collapse on small data. It’s just like if my CNN cannot even overfit MNIST, how can I trust it will master 1000-class
19K
Penghui Qi
@QPHutu
May 20, 2025
👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: arxiv.org/abs/2505.13438 🛠️Code: github.com/sail-sg/Anytim…
33K
Penghui Qi
@QPHutu
Nov 1, 2025
This is exactly what we want to share by fp16 tech report! Thanks @Grad62304977 for the great explanation.
Grad
@Grad62304977
Nov 1, 2025
Replying to @redtachyon
Well sort of, with just GRPO and not actually taking care of the mismatch at the algorithm level, u will encounter instability with bf16 under normal training settings like here (and as many papers for actual models like Kimi linear have mentioned). Their point is that given
15K
Penghui Qi
@QPHutu
Oct 31, 2025
Huge thanks to @Grad62304977 for quickly testing out our findings on using FP16 for RL fine-tuning and confirming the results!🥇
Grad
@Grad62304977
Oct 31, 2025
Replying to @Grad62304977
6K
Penghui Qi
@QPHutu
Nov 14, 2025
Another amazing progress on truly on-policy RL!💯 I believe it is a headache for the community to find a reproducible setting where the mismatch consistently causes training collapse. If so, may check this sanity test. Link to this dataset👇 huggingface.co/datasets/sail/…
LMSYS Org
@lmsysorg
Nov 14, 2025
💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.) The result? A strict KL divergence of 0. But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't
12K
Penghui Qi
@QPHutu
Nov 3, 2025
Many thanks for these exciting results. I’ve been waiting all weekend for someone to reproduce them, and I’m thrilled they’re here.
Łukasz Borchmann
@LukaszBorchmann
Nov 3, 2025
Replying to @redtachyon
Well, not only A100. Here is the sanity check on H200 (GRPO, 32B dense model). The authors also mention that they did some larger-scale experiments on H100.
8.5K
Penghui Qi
@QPHutu
Nov 2, 2025
Thank you @karpathy for finding our paper interesting. This is very encouraging.
Andrej Karpathy
@karpathy
Nov 1, 2025
Replying to @MarFot78 and @zzlccc
I think if you zoomed into the paper too you’d find it just as if not more interesting.
7.3K
Penghui Qi
@QPHutu
Nov 2, 2025
Replying to @RichardYRLi and @danielhanchen
Hi @RichardYRLi , I tried this disable_cascade_attn many times, including the latest vllm version. But unfortunately it made no difference in our experiments. So I guess it really depends on the setting.
6K
Penghui Qi
@QPHutu
Oct 31, 2025
Replying to @QPHutu
4.2K