Devvrit (@Devvrit_Khatri) / X

Devvrit

250 posts

Devvrit

@Devvrit_Khatri

GradStudent@UTCompSci, StudentResearcher@Meta. Large Scale ML - Scalability and Efficiency. Past: DeepMind.

Joined December 2019

Devvrit
@Devvrit_Khatri
Oct 16, 2025
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
296K
Devvrit
@Devvrit_Khatri
Nov 18, 2023
Presenting SONew: A Sparsified Online Newton Method. SONew offers to capture gradients cross-moments while maintaining linear memory in #params, and is embarrassingly parallel! Paper: arxiv.org/abs/2311.10085 1/8
39K
Devvrit
@Devvrit_Khatri
Oct 16, 2023
No better way to start my "Active on twitter" journey! Introducing MatFormer! Now train just one model, and get an entire family of models to choose from. Stay tuned, and we'll be dropping even more interesting updates now and then :)
Aditya Kusupati
@adityakusupati
Oct 16, 2023
Announcing MatFormer - a nested🪆(Matryoshka) Transformer that offers elasticity across deployment constraints. MatFormer is an architecture that lets us use 100s of accurate smaller models that we never actually trained for! arxiv.org/abs/2310.07707 1/9
10K
Devvrit
@Devvrit_Khatri
Oct 20, 2025
Had an amazing time on the Delta Podcast about our recent Scaling RL work, future directions, and some fun broader conversation. Thanks for having me on :)
Delta Institute
@DeltaInstitutes
Oct 20, 2025
Huge thanks to Devvrit Khatri for coming on the Delta Podcast! Check out the podcast episode here: youtube.com/watch?v=ZDg58Z…
7.1K
Devvrit
@Devvrit_Khatri
Oct 16, 2025
Replying to @Devvrit_Khatri
Work done at Meta (thanks for the gb200s :p), with awesome collaborators including @louvishh, @rish2k1, @rach_it_, @dvsaisurya, Manzil Zaheer, @inderjit_ml, @brandfonbrener, and @agarwl_ Paper: arxiv.org/abs/2510.13786 My blog Link (work in progress): devvrit.com/scaling_rl
arxiv.org
The Art of Scaling Reinforcement Learning Compute for LLMs
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training....
2.7K
Devvrit
@Devvrit_Khatri
Oct 16, 2025
Replying to @Devvrit_Khatri
How do we understand the contribution of several design choices in an RL algorithm? Do they make the algorithm efficient? Or do they elevate the asymptotic performance? To study the scaling behavior of each design choice, we need to fit a predictable scaling curve - this provides
4.8K
Devvrit
@Devvrit_Khatri
Nov 11, 2025
#ICLR reviews
6.2K
Devvrit
@Devvrit_Khatri
Oct 16, 2025
Replying to @Devvrit_Khatri
Not all RL methods scale equally well. Some reach higher asymptotic performance than others. Methods that may look promising early on can be worse when extrapolating to a larger compute regime.
2.7K
Devvrit
@Devvrit_Khatri
Oct 16, 2025
Replying to @Devvrit_Khatri
We provide (a) a framework to fit such scaling curves. Using this, we analyze several design choices, and combine the best ones to form our recipe (b) ScaleRL. We demonstrate its effectiveness by predictably scaling to 100k GPU-hours.
3.8K
Devvrit
@Devvrit_Khatri
Oct 16, 2025
Replying to @Devvrit_Khatri
Common “tricks” mainly shift efficiency: loss aggregation, normalization, curriculum, etc. Large batch size, large generation length, loss type, off-policy setup, and train/inference kernel mismatch fixes are the most consequential.
2.4K
Devvrit
@Devvrit_Khatri
Oct 4, 2024
I’ve been working with MLO for the past couple of years and it’s the best research lab that I’ve worked at. The team is super helpful and friendly, and the work is highly, highly impactful! Would strongly recommend applying 💻
Prateek Jain
@jainprateek_
Oct 4, 2024
Excited to share that the Machine Learning and Optimization team at @GoogleDeepMind India is hiring Research Scientists and Research Engineers! If you're passionate about cutting-edge AI research and building efficient, elastic, and safe LLMs, we'd love to hear from you. Check
2.4K
Devvrit
@Devvrit_Khatri
Dec 9, 2024
Just boarded my flight to Vancouver for #NeurIPS2024 ! Excited to meet everyone and chat about efficiency and scalability in LLMs. Hit me up if you’re around! 😁
816
Devvrit
@Devvrit_Khatri
Oct 17, 2025
Replying to @ziv_ravid
Agreed, indeed this isn't a scaling law. Nor do we claim it in our paper (un case you referring it). But what we claim is we need such a property to build scaling laws. And we show RL exhibits this property of being predictable. There are many implications of this, and I'm happy
2.6K
Devvrit
@Devvrit_Khatri
Oct 16, 2025
Replying to @Devvrit_Khatri
Would “scaling” up along generation length/model size/batch-size give expected gains? Absolutely! And now we can analyze how exactly they improve the performance. For example, smaller bsz/gen len may seem better initially, but larger ones overtake eventually.
4.8K