Zichen Liu (@zzlccc) / X

Zichen Liu

586 posts

Zichen Liu

@zzlccc

Gemini RL @GoogleDeepMind

Singapore

Joined October 2021

Pinned
Zichen Liu
@zzlccc
Mar 21, 2025
🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full
331K
Zichen Liu
@zzlccc
Nov 1, 2025
Super excited that @karpathy noticed our work! Hopefully it helps the broader community realize that *precision* deserves a place in our design space.
279K
Zichen Liu
@zzlccc
Oct 2, 2025
much more convinced after getting my own results: LoRA with rank=1 learns (and generalizes) as well as full-tuning while saving 43% vRAM usage! allows me to RL bigger models with limited resources😆 script: github.com/sail-sg/oat/bl…
Thinking Machines
@thinkymachines
Sep 29, 2025
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
205K
Zichen Liu
@zzlccc
Sep 22, 2025
exactly. and we will never derive a term like 1/|o|. seeing so many papers still using the original GRPO is sad.
Nan Jiang
@nanjiang_cs
Sep 21, 2025
I was surprised by how many didnt know that (1) per token MLE is whole seq MLE, and (2) PG at token level same as PG at seq level (optimizkng one big combinatorial action). story is different if you introduce fitted critic/Q-values or intermediate resets.
62K
Zichen Liu
@zzlccc
Oct 25, 2025
Nothing feels more exciting than writing a thesis proposal on RL for LLMs before 2025 ends!! Covering a subset of my first-author works done in the past 1.5 years (after switching from traditional RL to LLM RL…) Tentative title, of course
61K
Zichen Liu
@zzlccc
Oct 31, 2025
BF16 -> FP16 is such a simple (one configuration change in Oat) yet fundamental fix for inference-training mismatch. With FP16, the most basic importance sampling PG outperforms all algorithmic fixes in BF16. Let's rethink RL stability from the precision perspective.🔎
Penghui Qi
@QPHutu
Oct 31, 2025
🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precis…
78K
Zichen Liu
@zzlccc
Feb 6, 2025
🚨There May Not be Aha Moment in R1-Zero-like Training: oatllm.notion.site/oat-zero A common belief about the recent R1-Zero-like training is that self-reflections *emerge* as a result of RL training. We carefully investigated and showed the opposite. 🧵
117K
Zichen Liu
@zzlccc
Aug 22, 2025
With just a few lines of code, Feng’s (@fengyao1909) suggested fix—applying importance sampling on the behavior policy—resolved the training instability in my case (oat). I believe the result can generalize to other RL frameworks as well. Great work, Feng!
45K
Zichen Liu
@zzlccc
Oct 3, 2025
6 months after our paper release, I still recall the debates on removing the length normalization term in DrGRPO. And people gradually think DrGRPO is just about removing the std, ignoring the most important and subtle (length) bias we tried to point out to the community. Even
41K
Zichen Liu
@zzlccc
Jul 27, 2025
Learning GSPO proposed by Qwen team: fig 1. they propose to use sequence likelihood for importance sampling fig 2. but from the RL course by @svlevine, this is the original form of off-policy PG fig 3. per-token IS in (Dr) GRPO is an approximation of it Am I missing anything?
63K
Zichen Liu
@zzlccc
Mar 22, 2025
Good catch! But in fact this correction is unnecessary. We were aware of this. The N/N-1 factor affects all training instances equally, thus can be compensated by adapting the learning rate. Their gradients are the same after compensation. We have acknowledged the connection
leloy!
@leloykun
Mar 22, 2025
I'm not sure if someone has already pointed this out, but Dr. GRPO still has a bias that is more pronounced the smaller the group size is. To make it unbiased, simply multiply Dr. GRPO's A_i by the correction term N/N-1. With this, you'll get LOOP (Leave-One-Out Proximal Policy
45K
Zichen Liu
@zzlccc
Oct 6, 2025
GEM❤️Tinker GEM, an environment suite with a unified interface, works perfectly with Tinker, the API by @thinkymachines that handles the heavy lifting of distributed training. In our latest release of GEM, we 1. supported Tinker and 5 more RL training frameworks 2. reproduced
58K
Zichen Liu
@zzlccc
Aug 1, 2025
In the era of experience, we're training LLM agents with RL — but something's missing... We miss the good old Gym! So we built 💎GEM: a suite of environments for training LLM 𝚐𝚎𝚗𝚎𝚛𝚊𝚕𝚒𝚜𝚝𝚜. Let’s build the Gym for LLMs, together: axon-rl.notion.site/gem
45K
Zichen Liu
@zzlccc
Mar 26, 2025
Since the release of Dr. GRPO, many are interested in the 𝐥𝐞𝐧𝐠𝐭𝐡 𝐛𝐢𝐚𝐬 in GRPO's formulation & implementation, as well as in PPO's implementations. I did some updates on our paper and prepared a table for better comparison (details in thread):
22K