Verdict at @NousResearch RL hackathon! Your calibrated and low-variance LLM-as-a-judge is a reward model 🙈
Nimit Kalra
218 posts
Incoming PhD student. Visiting researcher with @MicahGoldblum (self-play, RL, reasoning, world models). Prev: @HaizeLabs @Citadel @UTAustin
- we're looking for a rockstar research eng @haizelabs! if you're interested in training tons of models and thinking about adversarial robustness for real-world deployed AI systems, DM me or apply below :)
- qwen RL has felt icky recently, but these authors get llama RL to matchWhat Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?
- The more RL I do, the less I believe in evolution
- Excited to discuss "SFT Memorizes, RL Generalizes" tomorrow at @haizelabs's NYC AI Reading Group with @leonardtang_ and @willccbb! We'll also explore a broader theme — "what does RL actually learn?", guided by some related works from the past week.
- We modified DeepSeek's recent Self-Principled Critique Tuning paper and bootstrapped a family of super tiny generalist reward models in < 1 day on a single A100 GPU. By proposing instance-specific rubrics at inference time, j1-micro (1.7B) and j1-nano (0.6B) punch well above
- awful day to be an llmEVALS EVALS EVALS Core Research @AutinMitra
- Discussing "Mind the Gap" tonight at @haizelabs's NYC AI Reading Group with @leonardtang_ and @willccbb. Authors study self-improvement through the "Generation-Verification Gap" (model's verification ability over its own generations) and find that this capability log scales withStill noodling on this, but the generation-verification gap proposed by @yus167 @_hanlin_zhang_ @ShamKakade6 @udayaghai et al. in arxiv.org/abs/2412.02674 is a very nice framework that unifies a lot of thoughts around self-improvement/verification/bootstrapping reasoning
- Replying to @Purring_Lynxrate limits too low for any real prod use cases tho 🙄
- think it was @jxmnop who said that science is about generating artifacts. inspired me to really focus on this this past week, starting with some internal eng tools and paper summaries... grinding out a couple more researchy things for the next couple weeks :) super excited to
- Flying out to #ICML2025 tonight! Always down to chat about unverifiable domains, evals, red-teaming, safeguards, or just meet cool people. I’ll be a panelist at the Methods and Opportunities at Small Scale workshop, sharing our work on tiny generalist reward models
- Replying to @rakyllPicked GPL for one of my first open-source projects and really learned this lesson the hard way
- What tools are people using these days to search for relevant citations, e.g., papers that actually benchmark against a particular work? Google Scholar first page is usually surveys/prior work sections, which are somewhat useless for tracing the lineage of an approach
- Great discussion tonight at @haizelabs HQ about the many many different definitions of generalization / “out of distribution” and which ones we actually care about in practice. + a special shoutout to @marklxu1 for the Joe’s pizza 🤤thursday night pizza + papers in nyc! thanks to those who came out!! @leonardtang_ @qw3rtman @willccbb


















