Pinned
This was a very timely project, and I had so much fun and learned so much about agent evaluation. Kudos to the SkillsBench community!
How well are agents at using the latest CLI tools like GWS CLI, and how they can safely use them?
Introducing ClawsBench, the first benchmark that measures both LLM capability and safety in a set of high fidelity and stateful environments and scenarios.
We made 5 mock








