Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably?
We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
Devvrit
250 posts
GradStudent@UTCompSci, StudentResearcher@Meta. Large Scale ML - Scalability and Efficiency. Past: DeepMind.
Joined December 2019
- Presenting SONew: A Sparsified Online Newton Method. SONew offers to capture gradients cross-moments while maintaining linear memory in #params, and is embarrassingly parallel! Paper: arxiv.org/abs/2311.10085 1/8
- No better way to start my "Active on twitter" journey! Introducing MatFormer! Now train just one model, and get an entire family of models to choose from. Stay tuned, and we'll be dropping even more interesting updates now and then :)Announcing MatFormer - a nested🪆(Matryoshka) Transformer that offers elasticity across deployment constraints. MatFormer is an architecture that lets us use 100s of accurate smaller models that we never actually trained for! arxiv.org/abs/2310.07707 1/9
- Had an amazing time on the Delta Podcast about our recent Scaling RL work, future directions, and some fun broader conversation. Thanks for having me on :)Huge thanks to Devvrit Khatri for coming on the Delta Podcast! Check out the podcast episode here: youtube.com/watch?v=ZDg58Z…
- Replying to @Devvrit_KhatriWork done at Meta (thanks for the gb200s :p), with awesome collaborators including @louvishh, @rish2k1, @rach_it_, @dvsaisurya, Manzil Zaheer, @inderjit_ml, @brandfonbrener, and @agarwl_ Paper: arxiv.org/abs/2510.13786 My blog Link (work in progress): devvrit.com/scaling_rl
- Replying to @Devvrit_KhatriHow do we understand the contribution of several design choices in an RL algorithm? Do they make the algorithm efficient? Or do they elevate the asymptotic performance? To study the scaling behavior of each design choice, we need to fit a predictable scaling curve - this provides
- Replying to @Devvrit_KhatriNot all RL methods scale equally well. Some reach higher asymptotic performance than others. Methods that may look promising early on can be worse when extrapolating to a larger compute regime.
- Replying to @Devvrit_KhatriWe provide (a) a framework to fit such scaling curves. Using this, we analyze several design choices, and combine the best ones to form our recipe (b) ScaleRL. We demonstrate its effectiveness by predictably scaling to 100k GPU-hours.
- Replying to @Devvrit_KhatriCommon “tricks” mainly shift efficiency: loss aggregation, normalization, curriculum, etc. Large batch size, large generation length, loss type, off-policy setup, and train/inference kernel mismatch fixes are the most consequential.
- I’ve been working with MLO for the past couple of years and it’s the best research lab that I’ve worked at. The team is super helpful and friendly, and the work is highly, highly impactful! Would strongly recommend applying 💻Excited to share that the Machine Learning and Optimization team at @GoogleDeepMind India is hiring Research Scientists and Research Engineers! If you're passionate about cutting-edge AI research and building efficient, elastic, and safe LLMs, we'd love to hear from you. Check
- Just boarded my flight to Vancouver for #NeurIPS2024 ! Excited to meet everyone and chat about efficiency and scalability in LLMs. Hit me up if you’re around! 😁
- Replying to @ziv_ravidAgreed, indeed this isn't a scaling law. Nor do we claim it in our paper (un case you referring it). But what we claim is we need such a property to build scaling laws. And we show RL exhibits this property of being predictable. There are many implications of this, and I'm happy
- Replying to @Devvrit_KhatriWould “scaling” up along generation length/model size/batch-size give expected gains? Absolutely! And now we can analyze how exactly they improve the performance. For example, smaller bsz/gen len may seem better initially, but larger ones overtake eventually.













