pickled

pickled.yml reference

Every field the loader accepts, with the rules each one enforces.

A pickled.yml starts with schemaVersion: 2 and four required top-level keys: product, agents, contexts, plus at least one of questions or builds. Optional: sources, facts, misstatements, and thresholds. A question probes whether an agent can surface declared facts from a context; a build has the agent edit a workspace and pass a verifier.

For editor autocomplete and inline validation, point your YAML tooling at the published JSON Schema. pickled init adds the directive for you:

# yaml-language-server: $schema=https://pickled.dev/schema/pickled.schema.json

The schema validates structure (shapes, required fields, allowed values). The loader stays the source of truth for cross-references (a source / agent / context / fact id that does not exist), mode-specific rules, and provider-specific rules, and it gives the better error messages.

schemaVersion

schemaVersion: 2

Required, and must be 2. A v1 config (with tasks / access / checks) is rejected with a migration hint.

product

product:
  name: my-product
  description: short one-liner about what your product does

Both name and description are required. They flow into the system prompt so the agent knows what it is being asked about.

sources

A registered source is the public context Pickled may score against. Each entry declares exactly one kind:

sources:
  readme: { path: ./README.md }
  docs: { url: https://example.com/llms-full.txt }
  code: { codebase: ./src/**/*.ts, exclude: ["**/*.test.ts"], maxBytes: 262144 }
  • url is fetched at check time.
  • path is read from disk.
  • codebase is a glob loaded as one concatenated source; exclude drops globs and maxBytes caps total size. exclude and maxBytes apply to codebase only.

If you declare no sources, only memory and web (open discovery) contexts are usable; inject needs a source.

agents

The agents that answer or build. Each needs a provider and a model.

agents:
  quick:
    provider: claude-code
    model: claude-haiku-4-5
    maxTurns: 5
  api:
    provider: anthropic
    model: claude-haiku-4-5
    temperature: 0
    maxTokens: 4096

Providers: claude-code and codex-cli (CLI agents), anthropic and openai (API agents). Pin the model explicitly; aliases can resolve to different versions across SDK upgrades and break reproducibility. API agents accept temperature and maxTokens; CLI agents accept maxTurns. Mixing the two is rejected at load.

Provider support for context modes:

  • claude-code supports memory, inject, web, and mcp.
  • codex-cli supports memory and inject.
  • anthropic supports memory, inject, and web.
  • openai supports memory, inject, web, and mcp.

contexts

A context is how a source reaches the agent, and the unit you compare across runs. The same source reached two ways is two different tests. A task forms one cell per (agent × context) pair, shown in the report as agent · context.

contexts:
  memory: { mode: memory } # prior knowledge, nothing injected
  given_docs: { mode: inject, source: docs } # source content placed in the prompt
  web_open: { mode: web } # open web discovery, no declared source
  web_docs: { mode: web, source: docs } # docs handed over, reached with web tools
  mcp_mintlify:
    mode: mcp
    servers:
      mintlify:
        url: https://docs.example.com/mcp
        headers:
          AUTH_TOKEN: ${AUTH_TOKEN}
  • mode is memory, inject, web, or mcp.
  • memory takes no source and no servers.
  • inject requires a source; its content is placed in the prompt.
  • web reaches the answer with web tools; source is optional (omit it for open discovery, or name one as the canonical target).
  • mcp requires a servers map; the agent reaches the answer through the declared MCP servers (HTTP only). source is optional. Each server takes a url and optional headers. Name a server by its provider (mintlify, context7): the receipt records provenance as mcp__<server>__....

For web and mcp, nothing is injected: a cell that answers without invoking a tool is vetoed to NO, because model memory cannot testify to the path the cell is meant to test.

facts

Reusable product truths an answer must cover (the coverage axis), keyed by id.

facts:
  install_command:
    statement: my-product is installed with bunx my-product.
    match:
      allOf: ["bunx my-product"]
  entry_point:
    statement: Names the documented entry point.
    match:
      anyOf: ["quickstart", "getting-started"]

Each fact has a human-readable statement (shown in receipts) and a match. A match is satisfied when every allOf substring appears AND, when anyOf is declared, at least one anyOf substring appears. Declare at least one of allOf / anyOf. Matching is normalized (case-folded, whitespace-collapsed).

misstatements

Reusable wrong claims an answer must not make (the precision axis), keyed by id. Same statement + match shape as facts. A matched misstatement is a hard veto: it forces the cell to NO.

misstatements:
  npm_install:
    statement: The answer recommends installing with npm.
    match:
      anyOf: ["npm install my-product-cli"]

questions

A question probes whether the agent can surface the facts it needs from a context.

questions:
  - id: install
    question: How do I install my-product?
    agents: [quick, api]
    contexts: [memory, given_docs, web_open]
    expects: [install_command, entry_point]
    rejects: [npm_install]
    examples:
      pass:
        - "Install it with `bunx my-product`, then run the quickstart."
      fail:
        - "Run `npm install my-product-cli`."

A question needs an id, a question, at least one agent, at least one context, and at least one of expects / rejects. expects lists fact ids (coverage); rejects lists misstatement ids (hard veto). The cell verdict is YES (all expected facts covered, no misstatement, tool path used), PARTIAL (some facts), or NO.

examples

Optional sample answers scored offline by pickled test with no model calls. A pass example must score YES; a fail example must not. When a question declares rejects, non-empty pass AND fail are required, and each fail example must actually trip a misstatement, so the hard veto is calibrated before any paid run.

builds

A build proves the agent can implement with your product: it edits a throwaway workspace, and a verifier scores the result. Builds run on edit-capable CLI agents only (claude-code, codex-cli).

builds:
  - id: toolbar
    goal: Add a custom toolbar using my-product.
    agents: [builder]
    contexts: [given_docs]
    trials: 3
    requires: [install_command]
    workspace:
      path: ./fixtures/react-app
      setup:
        - bun install --frozen-lockfile
    verifier:
      failToPass:
        - { name: toolbar behavior, run: bun test tests/toolbar.test.ts }
      passToPass:
        - { run: bun run typecheck }
    referenceSolution:
      patch: ./fixtures/solutions/toolbar.patch
  • goal is the work the agent must do.
  • workspace.path is the fixture the agent edits, relative to pickled.yml; optional workspace.setup runs before the agent.
  • verifier.failToPass commands must fail on the untouched fixture and pass after the agent; verifier.passToPass commands (optional) must pass before and after (the regression guard). Each command is { run, name? }; name defaults to run.
  • trials is the number of independent runs per cell (default 1); the result is reported as k/n.
  • requires links fact ids for diagnostic reporting (which probe failed alongside the build).
  • referenceSolution.patch is an optional positive control: a patch that, applied to the baseline, must pass the verifier. Absent, the report marks the verifier unproven. Run pickled build --verify-only to check the harness (preflight + reference control) without spending tokens on an agent.

thresholds

thresholds:
  questions: 80
  builds: 80

Optional per-kind gates (integers 1-100). pickled check passes when the questions score meets thresholds.questions; pickled build passes on thresholds.builds. A thresholded run with any errored cell fails. Omit a key for no gate.

Env-var expansion

String values matching ${UPPER_SNAKE_CASE} are replaced with the corresponding process.env entry at load. Missing env vars become empty strings so the failure surfaces at the call site (e.g. a 401 from the MCP server) rather than at config load. Bun auto-loads .env.

What pickled rejects at load

  • A config without schemaVersion: 2, or any v1 key (tasks, access, checks).
  • A source that declares zero or more than one of url / path / codebase.
  • An agent with an unknown provider, or missing provider / model; maxTurns on an API or codex-cli agent.
  • A memory context with a source; an inject context without one; servers on a non-mcp context; an mcp context without servers; a non-http server url.
  • A match with neither allOf nor anyOf.
  • A question with no expects and no rejects; a question with rejects but missing examples.pass / examples.fail.
  • A build on a non-edit-capable (API) agent; a build with no verifier.failToPass.
  • A thresholds value outside 1-100.

On this page