pickled

Pickled tests whether real agents can answer and build with your product, across declared context paths, using deterministic evidence.

Pickled is an open-source CLI that tests whether real agents can answer and build with your product (Claude Code, Codex CLI, Anthropic API, OpenAI API), across declared context paths (a source injected, or reached through web / mcp tools), using deterministic evidence: fact coverage, misstatement rejection, tool-use provenance, and build verification. No LLM grades another LLM.

The terms

Source is public context Pickled may score against: a local file, a URL, or a codebase glob.
Agent is who answers: Claude Code, Codex CLI, Anthropic API, or OpenAI API.
Context is a named delivery path: a mode (memory, inject, web, or mcp) plus an optional source.
Question / build is the unit of work: a question asks something and scores the answer on declared facts; a build has the agent edit a workspace and passes when the verifier does.
Fact / misstatement are the reusable, deterministic match contracts a question scores against.

A task runs as one cell per (agent × context) pair, and each cell is graded on its own.

What pickled tests

Two audiences, one tool:

External. Vendors testing how outside-world agents understand their published product. Register your README, llms.txt, docs URLs, and the things you do not want agents to say. Pickled tells you whether the agent answered from your sources or made it up.
Internal. Engineering teams testing whether their own CLAUDE.md, AGENTS.md, JSDoc, comments, and runbooks steer their own agents correctly. This repo is the dogfood case.

How scoring works

A task runs as one cell per (agent × context) pair. Each cell scores independently:

Facts and misstatements. A question's expects lists facts the answer must cover; rejects lists misstatements it must not make (a hard veto). Matching is deterministic substring (normalized). The cell verdict is YES (all expected facts covered, no misstatement, tool path used), PARTIAL (some facts), or NO.
Tool-use provenance. When a context uses web or mcp, the agent must invoke at least one of its tools, or the cell is vetoed to NO. Model memory does not count as evidence for a tool path.
Build verification. A build instead has the agent edit a workspace; the cell passes when the verifier does (a failToPass / passToPass command split), reported as Built k-of-n over trials. pickled build --verify-only proves the harness without running an agent.
Receipts. Save a run with --output, then re-render it with pickled report as terminal, markdown, or JSON. Markdown is built for CI job summaries; default JSON strips source text, full answers, transcripts, diffs, and command output.

Getting started

Install the CLI, write a pickled.yml, run a check:

bunx @pickled-dev/cli init
bunx @pickled-dev/cli check .

See Getting started for a real first check.

Why "pickled"?

Pickles are preserved. That is the point. Your product docs, examples, and agent instructions keep changing, but agents often carry old or partial versions of them. Pickled checks whether the important parts still hold up when an agent tries to use them.

The joke is small. The contract is serious.

The terms

What pickled tests

How scoring works

Getting started

Why "pickled"?

On this page