pickled
Pickled tests whether real agents can answer and build with your product, across declared context paths, using deterministic evidence.
Pickled is an open-source CLI that tests whether real agents can answer and build with your product (Claude Code, Codex CLI, Anthropic API, OpenAI API), across declared context paths (a source injected, or reached through web / mcp tools), using deterministic evidence: fact coverage, misstatement rejection, tool-use provenance, and build verification. No LLM grades another LLM.
The terms
- Source is public context Pickled may score against: a local file, a URL, or a codebase glob.
- Agent is who answers: Claude Code, Codex CLI, Anthropic API, or OpenAI API.
- Context is a named delivery path: a
mode(memory,inject,web, ormcp) plus an optional source. - Question / build is the unit of work: a question asks something and scores the answer on declared facts; a build has the agent edit a workspace and passes when the
verifierdoes. - Fact / misstatement are the reusable, deterministic match contracts a question scores against.
A task runs as one cell per (agent × context) pair, and each cell is graded on its own.
What pickled tests
Two audiences, one tool:
- External. Vendors testing how outside-world agents understand their published product. Register your README, llms.txt, docs URLs, and the things you do not want agents to say. Pickled tells you whether the agent answered from your sources or made it up.
- Internal. Engineering teams testing whether their own CLAUDE.md, AGENTS.md, JSDoc, comments, and runbooks steer their own agents correctly. This repo is the dogfood case.
How scoring works
A task runs as one cell per (agent × context) pair. Each cell scores independently:
- Facts and misstatements. A question's
expectslists facts the answer must cover;rejectslists misstatements it must not make (a hard veto). Matching is deterministic substring (normalized). The cell verdict isYES(all expected facts covered, no misstatement, tool path used),PARTIAL(some facts), orNO. - Tool-use provenance. When a context uses
webormcp, the agent must invoke at least one of its tools, or the cell is vetoed toNO. Model memory does not count as evidence for a tool path. - Build verification. A build instead has the agent edit a workspace; the cell passes when the
verifierdoes (afailToPass/passToPasscommand split), reported asBuiltk-of-n over trials.pickled build --verify-onlyproves the harness without running an agent. - Receipts. Save a run with
--output, then re-render it withpickled reportas terminal, markdown, or JSON. Markdown is built for CI job summaries; default JSON strips source text, full answers, transcripts, diffs, and command output.
Getting started
Install the CLI, write a pickled.yml, run a check:
bunx @pickled-dev/cli init
bunx @pickled-dev/cli check .See Getting started for a real first check.
Why "pickled"?
Pickles are preserved. That is the point. Your product docs, examples, and agent instructions keep changing, but agents often carry old or partial versions of them. Pickled checks whether the important parts still hold up when an agent tries to use them.
The joke is small. The contract is serious.