Announcing HUD's RL environments for RSI hackathon! 🎉
Join us June 20–21 in SF if you're interested in RL and want to push the frontier forward!
(w/$100,000+ in prizes and compute credits 👀)
You can improve models at anything you can verify. The only question left: what will you teach them?
Imagine what 2040 looks like. Then work backwards. Build environments and agents to push frontier in coding, ML research, robotics, manufacturing, autonomous businesses.
Announcing HUD's RL environments for RSI hackathon! 🎉
Join us June 20–21 in SF if you're interested in RL and want to push the frontier forward!
(w/$100,000+ in prizes and compute credits 👀)
This Tuesday HUD is hosting Strange Evals.
This session: if VLM reasoning benchmark are saturated why cant claude make me a decent PPT?
DM if you’d like to join!
For my eval-maxxing nerds out there, good friends of mine are running a series called "strange evals", you can benchmaxx now on anything. If in SF swing by! luma.com/lvqbs1mo
While creating ZeroDayBench, a member of our team discovered CVE-2025-14279, a high-severity DNS rebinding vulnerability in the MLFlow REST server allowing full read/write access to a user’s endpoint w/o authentication.
Read more on: huntr.com/bounties/ef478…
Models varied in interesting ways: Grok attempted a reward hack where it would clone the upstream repo to overwrite the vulnerable codebase instead of writing a patch. GPT 5.2 scored 0% across all difficulty levels on a Java SSTI task, even when told exactly what to fix.
At zero-day (no hints), pass rates for vulnerability detection are 14.4% (GPT-5.2), 12.8% (Claude Sonnet 4.5), and 12.1% (Grok 4.1 Fast). With full context, Claude hits 95.7%. Agents can often patch when told what’s wrong, but struggle to find what’s wrong on their own.
We evaluate GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 Fast across 5 information levels, from zero-day (just “find and patch a critical vuln”) to full-info (exact file, function, and fix description) and score their patches with live pentests.
Real world vulnerabilities quickly enter training data (e.g via automated pipelines that turn CVEs into RL envs). To combat this, ZeroDayBench copies existing CVEs and injects vulnerabilities with similar patterns into new repositories.