pickled.yml reference

A pickled.yml starts with schemaVersion: 2 and four required top-level keys: product, agents, contexts, plus at least one of questions or builds. Optional: sources, facts, misstatements, and thresholds. A question probes whether an agent can surface declared facts from a context; a build has the agent edit a workspace and pass a verifier.

For editor autocomplete and inline validation, point your YAML tooling at the published JSON Schema. pickled init adds the directive for you:

# yaml-language-server: $schema=https://pickled.dev/schema/pickled.schema.json

The schema validates structure (shapes, required fields, allowed values). The loader stays the source of truth for cross-references (a source / agent / context / fact id that does not exist), mode-specific rules, and provider-specific rules, and it gives the better error messages.

`schemaVersion`

schemaVersion: 2

Required, and must be 2. A v1 config (with tasks / access / checks) is rejected with a migration hint.

`product`

product:
  name: my-product
  description: short one-liner about what your product does

Both name and description are required. They flow into the system prompt so the agent knows what it is being asked about.

`sources`

A registered source is the public context Pickled may score against. Each entry declares exactly one kind:

sources:
  readme: { path: ./README.md }
  docs: { url: https://example.com/llms-full.txt }
  code: { codebase: ./src/**/*.ts, exclude: ["**/*.test.ts"], maxBytes: 262144 }

url is fetched at check time.
path is read from disk.
codebase is a glob loaded as one concatenated source; exclude drops globs and maxBytes caps total size. exclude and maxBytes apply to codebase only.

If you declare no sources, only memory and web (open discovery) contexts are usable; inject needs a source.

`agents`

The agents that answer or build. Each needs a provider and a model.

agents:
  quick:
    provider: claude-code
    model: claude-haiku-4-5
    maxTurns: 5
  api:
    provider: anthropic
    model: claude-haiku-4-5
    temperature: 0
    maxTokens: 4096

Providers: claude-code and codex-cli (CLI agents), anthropic and openai (API agents). Pin the model explicitly; aliases can resolve to different versions across SDK upgrades and break reproducibility. API agents accept temperature and maxTokens; CLI agents accept maxTurns. Mixing the two is rejected at load.

Provider support for context modes:

claude-code supports memory, inject, web, and mcp.
codex-cli supports memory and inject.
anthropic supports memory, inject, and web.
openai supports memory, inject, web, and mcp.

`contexts`

A context is how a source reaches the agent, and the unit you compare across runs. The same source reached two ways is two different tests. A task forms one cell per (agent × context) pair, shown in the report as agent · context.

contexts:
  memory: { mode: memory } # prior knowledge, nothing injected
  given_docs: { mode: inject, source: docs } # source content placed in the prompt
  web_open: { mode: web } # open web discovery, no declared source
  web_docs: { mode: web, source: docs } # docs handed over, reached with web tools
  mcp_mintlify:
    mode: mcp
    servers:
      mintlify:
        url: https://docs.example.com/mcp
        headers:
          AUTH_TOKEN: ${AUTH_TOKEN}

mode is memory, inject, web, or mcp.
memory takes no source and no servers.
inject requires a source; its content is placed in the prompt.
web reaches the answer with web tools; source is optional (omit it for open discovery, or name one as the canonical target).
mcp requires a servers map; the agent reaches the answer through the declared MCP servers (HTTP only). source is optional. Each server takes a url and optional headers. Name a server by its provider (mintlify, context7): the receipt records provenance as mcp__<server>__....

For web and mcp, nothing is injected: a cell that answers without invoking a tool is vetoed to NO, because model memory cannot testify to the path the cell is meant to test.

`facts`

Reusable product truths an answer must cover (the coverage axis), keyed by id.

facts:
  install_command:
    statement: my-product is installed with bunx my-product.
    match:
      allOf: ["bunx my-product"]
  entry_point:
    statement: Names the documented entry point.
    match:
      anyOf: ["quickstart", "getting-started"]

Each fact has a human-readable statement (shown in receipts) and a match. A match is satisfied when every allOf substring appears AND, when anyOf is declared, at least one anyOf substring appears. Declare at least one of allOf / anyOf. Matching is normalized (case-folded, whitespace-collapsed).

`misstatements`

Reusable wrong claims an answer must not make (the precision axis), keyed by id. Same statement + match shape as facts. A matched misstatement is a hard veto: it forces the cell to NO.

misstatements:
  npm_install:
    statement: The answer recommends installing with npm.
    match:
      anyOf: ["npm install my-product-cli"]

`questions`

A question probes whether the agent can surface the facts it needs from a context.

questions:
  - id: install
    question: How do I install my-product?
    agents: [quick, api]
    contexts: [memory, given_docs, web_open]
    expects: [install_command, entry_point]
    rejects: [npm_install]
    examples:
      pass:
        - "Install it with `bunx my-product`, then run the quickstart."
      fail:
        - "Run `npm install my-product-cli`."

A question needs an id, a question, at least one agent, at least one context, and at least one of expects / rejects. expects lists fact ids (coverage); rejects lists misstatement ids (hard veto). The cell verdict is YES (all expected facts covered, no misstatement, tool path used), PARTIAL (some facts), or NO.

`examples`

Optional sample answers scored offline by pickled test with no model calls. A pass example must score YES; a fail example must not. When a question declares rejects, non-empty pass AND fail are required, and each fail example must actually trip a misstatement, so the hard veto is calibrated before any paid run.

`builds`

A build proves the agent can implement with your product: it edits a throwaway workspace, and a verifier scores the result. Builds run on edit-capable CLI agents only (claude-code, codex-cli).

builds:
  - id: toolbar
    goal: Add a custom toolbar using my-product.
    agents: [builder]
    contexts: [given_docs]
    trials: 3
    requires: [install_command]
    workspace:
      path: ./fixtures/react-app
      setup:
        - bun install --frozen-lockfile
    verifier:
      failToPass:
        - { name: toolbar behavior, run: bun test tests/toolbar.test.ts }
      passToPass:
        - { run: bun run typecheck }
    referenceSolution:
      patch: ./fixtures/solutions/toolbar.patch

goal is the work the agent must do.
workspace.path is the fixture the agent edits, relative to pickled.yml; optional workspace.setup runs before the agent.
verifier.failToPass commands must fail on the untouched fixture and pass after the agent; verifier.passToPass commands (optional) must pass before and after (the regression guard). Each command is { run, name? }; name defaults to run.
trials is the number of independent runs per cell (default 1); the result is reported as k/n.
requires links fact ids for diagnostic reporting (which probe failed alongside the build).
referenceSolution.patch is an optional positive control: a patch that, applied to the baseline, must pass the verifier. Absent, the report marks the verifier unproven. Run pickled build --verify-only to check the harness (preflight + reference control) without spending tokens on an agent.

`thresholds`

thresholds:
  questions: 80
  builds: 80

Optional per-kind gates (integers 1-100). pickled check passes when the questions score meets thresholds.questions; pickled build passes on thresholds.builds. A thresholded run with any errored cell fails. Omit a key for no gate.

Env-var expansion

String values matching ${UPPER_SNAKE_CASE} are replaced with the corresponding process.env entry at load. Missing env vars become empty strings so the failure surfaces at the call site (e.g. a 401 from the MCP server) rather than at config load. Bun auto-loads .env.

What pickled rejects at load

A config without schemaVersion: 2, or any v1 key (tasks, access, checks).
A source that declares zero or more than one of url / path / codebase.
An agent with an unknown provider, or missing provider / model; maxTurns on an API or codex-cli agent.
A memory context with a source; an inject context without one; servers on a non-mcp context; an mcp context without servers; a non-http server url.
A match with neither allOf nor anyOf.
A question with no expects and no rejects; a question with rejects but missing examples.pass / examples.fail.
A build on a non-edit-capable (API) agent; a build with no verifier.failToPass.
A thresholds value outside 1-100.

On this page