hud (@hud_evals) / X

hud

63 posts

hud

@hud_evals

RL environments + evals for agents | @ycombinator | we're hiring!

Joined January 2025

Pinned
hud
@hud_evals
Jul 2, 2025
we're actively hiring for these roles btw👀
atlas
@creatine_cycle
Jul 1, 2025
the jobs left after the singularity will be: - agentic workflow engineer - twink - chief of staff
59K
hud
@hud_evals
18h
Replying to @hud_evals
apply here by 13th June (!) 👉
HUD Frontier/RSI RL Environments Hackathon
From events.ycombinator.com
111
hud
@hud_evals
18h
btw, if you win our RL for RSI hackathon, u get a cool robot dog 🐕‍🦺 June 20th-21st @ YC HQ. Signups close in 3 days! 👇
00:00
hud
@hud_evals
May 16
Announcing HUD's RL environments for RSI hackathon! 🎉 Join us June 20–21 in SF if you're interested in RL and want to push the frontier forward! (w/$100,000+ in prizes and compute credits 👀)
2.1K
hud
@hud_evals
May 16
Replying to @hud_evals
No prior RL experience required. Just ambition. Apply here → events.ycombinator.com/hud-frontier-j… Special thanks to our partners! @ycombinator, @AnthropicAI, @GoogleDeepMind, @modal, @daytonaio, @ExaAILabs, @FireworksAI_HQ, @Sixtyfourai, @MiniMax_AI, @AntimLabs .
HUD Frontier/RSI RL Environments Hackathon
From events.ycombinator.com
1.9K
hud
@hud_evals
May 16
Replying to @hud_evals
You can improve models at anything you can verify. The only question left: what will you teach them? Imagine what 2040 looks like. Then work backwards. Build environments and agents to push frontier in coding, ML research, robotics, manufacturing, autonomous businesses.
2.3K
hud
@hud_evals
May 16
Announcing HUD's RL environments for RSI hackathon! 🎉 Join us June 20–21 in SF if you're interested in RL and want to push the frontier forward! (w/$100,000+ in prizes and compute credits 👀)
68K
hud
@hud_evals
May 9
This Tuesday HUD is hosting Strange Evals. This session: if VLM reasoning benchmark are saturated why cant claude make me a decent PPT? DM if you’d like to join!
Vincent Koc
@vincent_koc
May 4
For my eval-maxxing nerds out there, good friends of mine are running a series called "strange evals", you can benchmaxx now on anything. If in SF swing by! luma.com/lvqbs1mo
1.8K
hud
@hud_evals
Mar 18
Replying to @hud_evals
Check out the full paper by @unrelated333, @louis_sloot, @WinterCawfie as well as @Shark_Academia, @super_bavario and @jdchawla29 from @hud_evals !
arxiv.org
ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day...
Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to...
1K
hud
@hud_evals
Mar 18
Replying to @hud_evals
While creating ZeroDayBench, a member of our team discovered CVE-2025-14279, a high-severity DNS rebinding vulnerability in the MLFlow REST server allowing full read/write access to a user’s endpoint w/o authentication. Read more on: huntr.com/bounties/ef478…
1.2K
hud
@hud_evals
Mar 18
Replying to @hud_evals
Models varied in interesting ways: Grok attempted a reward hack where it would clone the upstream repo to overwrite the vulnerable codebase instead of writing a patch. GPT 5.2 scored 0% across all difficulty levels on a Java SSTI task, even when told exactly what to fix.
358
hud
@hud_evals
Mar 18
Replying to @hud_evals
At zero-day (no hints), pass rates for vulnerability detection are 14.4% (GPT-5.2), 12.8% (Claude Sonnet 4.5), and 12.1% (Grok 4.1 Fast). With full context, Claude hits 95.7%. Agents can often patch when told what’s wrong, but struggle to find what’s wrong on their own.
2.6K
hud
@hud_evals
Mar 18
Replying to @hud_evals
We evaluate GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 Fast across 5 information levels, from zero-day (just “find and patch a critical vuln”) to full-info (exact file, function, and fix description) and score their patches with live pentests.
538
hud
@hud_evals
Mar 18
Replying to @hud_evals
Real world vulnerabilities quickly enter training data (e.g via automated pipelines that turn CVEs into RL envs). To combat this, ZeroDayBench copies existing CVEs and injects vulnerabilities with similar patterns into new repositories.
571