Docker workbench and Agent Skill for running deterministic evals against agent skills.
Use this repo in two ways:
- Install the
skill-optimizerskill/plugin into your agent so it can author and debug eval suites. - Run the local CLI to execute cases and suites in Docker against OpenRouter models.
Installation differs by agent. The canonical skill is skills/skill-optimizer/SKILL.md; every plugin manifest points at that same file.
Register this repository as a Claude Code plugin marketplace:
/plugin marketplace add fastxyz/skill-optimizer
Then install the plugin:
/plugin install skill-optimizer@skill-optimizer
Register this repository as a Codex plugin marketplace:
codex plugin marketplace add fastxyz/skill-optimizerThen open the plugin search interface:
/plugins
Select skill-optimizer and install it.
In the Codex app, open Plugins from the sidebar, search for skill-optimizer, and install it from the Coding section.
If it is not listed, install it from Codex CLI first:
codex plugin marketplace add fastxyz/skill-optimizerInstall the skill with the open skills CLI:
npx skills add fastxyz/skill-optimizer --skill skill-optimizer -a cursor -yCursor can also import the skill from GitHub via Settings -> Rules -> Project Rules -> Add Rule -> Remote Rule (Github). The Cursor plugin metadata lives at .cursor-plugin/plugin.json.
Tell OpenCode:
Fetch and follow instructions from https://raw.githubusercontent.com/fastxyz/skill-optimizer/refs/heads/main/.opencode/INSTALL.md
Or add the plugin to opencode.json at user or project scope:
{
"plugin": ["skill-optimizer@git+https://github.com/fastxyz/skill-optimizer.git"]
}Restart OpenCode. See docs/README.opencode.md for details.
Install the Gemini extension from GitHub:
gemini extensions install https://github.com/fastxyz/skill-optimizerTo update:
gemini extensions update skill-optimizerIf you only want the skill files without plugin metadata, use the open skills CLI:
npx skills add fastxyz/skill-optimizer --skill skill-optimizer -a claude-code -a opencode -a codex -a cursor -yRequirements:
- Node.js 20+
- Docker
OPENROUTER_API_KEYfor real model runs
Install and build:
npm install
npm run buildOnly openrouter/... model refs are supported.
Run the suite against the models listed in suite.yml:
npx tsx src/cli.ts run-suite examples/workbench/pdf/suite.yml --trials 1Run one case directly:
npx tsx src/cli.ts run-case ./case.yml --model openrouter/google/gemini-2.5-flashCLI help:
npx tsx src/cli.ts --help
npx tsx src/cli.ts run-case --help
npx tsx src/cli.ts run-suite --helpThe workbench gives an agent a skill/reference folder, an isolated /work directory, and deterministic graders. It is designed for evals where success can be verified from files, command logs, SQL, generated artifacts, or other local state.
Core concepts:
- A case is one user-like task plus one or more graders.
- A suite is a matrix of cases and OpenRouter models.
references/is copied into/work; this is where the skill under test lives.- The agent phase sees only
/work, not graders, hidden answers,/case, or/results. - Graders run after the agent with
$CASE,$WORK, and$RESULTSavailable. - Graders are the acceptance contract. They can inspect workspace files and artifacts,
answer.json,trace.jsonl, and result state under$RESULTS.
Read docs/workbench.md for the full model: directory layout, Docker phases, graders, outputs, and debugging.
Tracked examples live under examples/workbench/. The PDF example includes positive PDF extraction/splitting/creation cases and a negative case that checks the agent did not read the PDF skill file for a non-PDF task. The MCP example shows a local calculator server started as a hidden Docker service and exposed through the workbench mcp command.
npx tsx src/cli.ts run-suite examples/workbench/pdf/suite.yml --trials 1
npx tsx src/cli.ts run-suite examples/workbench/mcp/suite.yml --trials 1npm run typecheck
npm test
npm run build
npx tsx src/cli.ts --helpFor Docker runner or image changes:
docker build -t skill-optimizer-workbench:local -f docker/workbench-runner.Dockerfile .Do not commit .skill-eval/, .results/, .env, or credentials.