R1-zero is such a striking example of a discovery that’s blatantly obvious in retrospect, yet eluded so many for such a long time
Charlie Snell
5,095 posts
PhD student @berkeley_ai; research @cursor_ai; prev @GoogleDeepMind. My friend told me to tweet more. I stare at my computer a lot and make things
- What did Ilya’s investors see?
- Need the Terrance Tao vibe review of o3
- > wake up > launch yet another YOLO run (600M H100 hours, powered by 16 suns) > spend entire day anxiously refreshing wandb > fuck, learning rate too high again > beg manager for just one more YOLO run tomorrow > go to bed and repeat
- On difficult problems, humans can think longer to improve their decisions. Can we instill a similar capability into LLMs? And can it do well? In our paper, we find that by optimally scaling test-time compute we can outperform *much* larger models in a FLOPs matched evaluation.
- Recently my Twitter timeline has been completely taken over by artwork generated with @OpenAI's CLIP model. So I figured I'd write a blog post about it. In the blog I follow the evolution of this art scene and present some cool artwork along the way ml.berkeley.edu/blog/posts/cli…
- Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵
- When you leave the RL training overnight, only to wake up and find that llama has had enough
- LM can “learn from itself”😉 We ask it to generate answers with extra info in prompts/scratchpads, and then fine-tune on the generations We call this “context distillation”, and with it we can learn from: - Instruction/explanations - Training examples - Step-by-step reasoning
- Oh shit officer @kennybeats is part of Kanye’s personal security now. Mans is moving up in the law enforcement world
- I still remember reading the Minerva paper one afternoon the summer before starting my PhD. I was shocked. Before then, a tiny part of me thought Gary Marcus could actually be right. Immediately after processing the paper, this shred of doubt dissolved.
- Hot take: MoE is often not the optimal config if you want to run models locally Locally, you’re usually memory constrained. To maximize capabilities you should use a big dense model that maxes out device memory
- The problem with idea guys is that their ideas aren’t very good
- Training LLMs across a bunch of devices is not easy. To contribute towards making this easier, I'm releasing my workflow for training LLMs in Jax with the JaxSeq repository. I've personally used this to train up to 20 billion parameter models. Link:






