Daniel’s Substack
Subscribe
Sign in
Home
Archive
About
Claude 4.5 Opus Solves CORE-Bench — But Not REPRO-Bench
In our ACL 2025 paper, we introduced REPRO-Bench (GitHub), a benchmark designed to evaluate whether AI agents can accurately assess the reproducibility…
Dec 16, 2025
•
Daniel Kang
2
Latest
Top
Discussions
SafeSearch: Teaching LLM Search Agents to Be Both Smart and Safe
LLMs are rapidly expanding their built-in knowledge from training.
Nov 10, 2025
•
Daniel Kang
1
When Your Home Robot Turns Against You: BEATing Vision-Language Agents with Visual Backdoors
Household humanoid robots promise to assist everyone in daily life, with several exciting demos released recently (NEO, Figure 03, Tesla Optimus).
Nov 5, 2025
•
Daniel Kang
1
DRAMA: Enabling AI Agents to Collect Data to Support Data Science Workflows
Data science workflows generally include two major phases: data retrieval and data analysis.
Nov 3, 2025
•
Daniel Kang
1
CVE-Bench v2.0: Making Evaluation More Rigorous with ABC
This is the third post in the Agentic Benchmark Checklist (ABC) blog series. Written by Yuxuan Zhu, Antony Kellermann, and Daniel Kang.
Oct 30, 2025
•
Daniel Kang
2
1
No, RL does not get "1 bit of information" per rollout
Dwarkesh is one of the biggest podcasters in the AI space.
Oct 5, 2025
•
Daniel Kang
6
1
Human Data is (Probably) More Expensive Than Compute for Training Frontier LLMs
This blog post is written by Yuxuan Zhu and Daniel Kang
Aug 11, 2025
•
Daniel Kang
19
2
2
ZKTorch: Open-Sourcing the First Universal ZKML Compiler for Real-World AI
AI has significantly reshaped many aspects of our daily lives.
Jul 29, 2025
•
Daniel Kang
See all
Daniel’s Substack
My personal Substack
Subscribe
Daniel’s Substack
Subscribe
About
Archive
Sitemap
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts