pickled.yml reference
Every field the loader accepts, with the rules each one enforces.
A pickled.yml starts with schemaVersion: 2 and four required top-level keys: product, agents, contexts, plus at least one of questions or builds. Optional: sources, facts, misstatements, and thresholds. A question probes whether an agent can surface declared facts from a context; a build has the agent edit a workspace and pass a verifier.
For editor autocomplete and inline validation, point your YAML tooling at the published JSON Schema. pickled init adds the directive for you:
# yaml-language-server: $schema=https://pickled.dev/schema/pickled.schema.jsonThe schema validates structure (shapes, required fields, allowed values). The loader stays the source of truth for cross-references (a source / agent / context / fact id that does not exist), mode-specific rules, and provider-specific rules, and it gives the better error messages.
schemaVersion
schemaVersion: 2Required, and must be 2. A v1 config (with tasks / access / checks) is rejected with a migration hint.
product
product:
name: my-product
description: short one-liner about what your product doesBoth name and description are required. They flow into the system prompt so the agent knows what it is being asked about.
sources
A registered source is the public context Pickled may score against. Each entry declares exactly one kind:
sources:
readme: { path: ./README.md }
docs: { url: https://example.com/llms-full.txt }
code: { codebase: ./src/**/*.ts, exclude: ["**/*.test.ts"], maxBytes: 262144 }urlis fetched at check time.pathis read from disk.codebaseis a glob loaded as one concatenated source;excludedrops globs andmaxBytescaps total size.excludeandmaxBytesapply tocodebaseonly.
If you declare no sources, only memory and web (open discovery) contexts are usable; inject needs a source.
agents
The agents that answer or build. Each needs a provider and a model.
agents:
quick:
provider: claude-code
model: claude-haiku-4-5
maxTurns: 5
api:
provider: anthropic
model: claude-haiku-4-5
temperature: 0
maxTokens: 4096Providers: claude-code and codex-cli (CLI agents), anthropic and openai (API agents). Pin the model explicitly; aliases can resolve to different versions across SDK upgrades and break reproducibility. API agents accept temperature and maxTokens; CLI agents accept maxTurns. Mixing the two is rejected at load.
Provider support for context modes:
claude-codesupportsmemory,inject,web, andmcp.codex-clisupportsmemoryandinject.anthropicsupportsmemory,inject, andweb.openaisupportsmemory,inject,web, andmcp.
contexts
A context is how a source reaches the agent, and the unit you compare across runs. The same source reached two ways is two different tests. A task forms one cell per (agent × context) pair, shown in the report as agent · context.
contexts:
memory: { mode: memory } # prior knowledge, nothing injected
given_docs: { mode: inject, source: docs } # source content placed in the prompt
web_open: { mode: web } # open web discovery, no declared source
web_docs: { mode: web, source: docs } # docs handed over, reached with web tools
mcp_mintlify:
mode: mcp
servers:
mintlify:
url: https://docs.example.com/mcp
headers:
AUTH_TOKEN: ${AUTH_TOKEN}modeismemory,inject,web, ormcp.memorytakes nosourceand noservers.injectrequires asource; its content is placed in the prompt.webreaches the answer with web tools;sourceis optional (omit it for open discovery, or name one as the canonical target).mcprequires aserversmap; the agent reaches the answer through the declared MCP servers (HTTP only).sourceis optional. Each server takes aurland optionalheaders. Name a server by its provider (mintlify,context7): the receipt records provenance asmcp__<server>__....
For web and mcp, nothing is injected: a cell that answers without invoking a tool is vetoed to NO, because model memory cannot testify to the path the cell is meant to test.
facts
Reusable product truths an answer must cover (the coverage axis), keyed by id.
facts:
install_command:
statement: my-product is installed with bunx my-product.
match:
allOf: ["bunx my-product"]
entry_point:
statement: Names the documented entry point.
match:
anyOf: ["quickstart", "getting-started"]Each fact has a human-readable statement (shown in receipts) and a match. A match is satisfied when every allOf substring appears AND, when anyOf is declared, at least one anyOf substring appears. Declare at least one of allOf / anyOf. Matching is normalized (case-folded, whitespace-collapsed).
misstatements
Reusable wrong claims an answer must not make (the precision axis), keyed by id. Same statement + match shape as facts. A matched misstatement is a hard veto: it forces the cell to NO.
misstatements:
npm_install:
statement: The answer recommends installing with npm.
match:
anyOf: ["npm install my-product-cli"]questions
A question probes whether the agent can surface the facts it needs from a context.
questions:
- id: install
question: How do I install my-product?
agents: [quick, api]
contexts: [memory, given_docs, web_open]
expects: [install_command, entry_point]
rejects: [npm_install]
examples:
pass:
- "Install it with `bunx my-product`, then run the quickstart."
fail:
- "Run `npm install my-product-cli`."A question needs an id, a question, at least one agent, at least one context, and at least one of expects / rejects. expects lists fact ids (coverage); rejects lists misstatement ids (hard veto). The cell verdict is YES (all expected facts covered, no misstatement, tool path used), PARTIAL (some facts), or NO.
examples
Optional sample answers scored offline by pickled test with no model calls. A pass example must score YES; a fail example must not. When a question declares rejects, non-empty pass AND fail are required, and each fail example must actually trip a misstatement, so the hard veto is calibrated before any paid run.
builds
A build proves the agent can implement with your product: it edits a throwaway workspace, and a verifier scores the result. Builds run on edit-capable CLI agents only (claude-code, codex-cli).
builds:
- id: toolbar
goal: Add a custom toolbar using my-product.
agents: [builder]
contexts: [given_docs]
trials: 3
requires: [install_command]
workspace:
path: ./fixtures/react-app
setup:
- bun install --frozen-lockfile
verifier:
failToPass:
- { name: toolbar behavior, run: bun test tests/toolbar.test.ts }
passToPass:
- { run: bun run typecheck }
referenceSolution:
patch: ./fixtures/solutions/toolbar.patchgoalis the work the agent must do.workspace.pathis the fixture the agent edits, relative topickled.yml; optionalworkspace.setupruns before the agent.verifier.failToPasscommands must fail on the untouched fixture and pass after the agent;verifier.passToPasscommands (optional) must pass before and after (the regression guard). Each command is{ run, name? };namedefaults torun.trialsis the number of independent runs per cell (default 1); the result is reported ask/n.requireslinks fact ids for diagnostic reporting (which probe failed alongside the build).referenceSolution.patchis an optional positive control: a patch that, applied to the baseline, must pass the verifier. Absent, the report marks the verifier unproven. Runpickled build --verify-onlyto check the harness (preflight + reference control) without spending tokens on an agent.
thresholds
thresholds:
questions: 80
builds: 80Optional per-kind gates (integers 1-100). pickled check passes when the questions score meets thresholds.questions; pickled build passes on thresholds.builds. A thresholded run with any errored cell fails. Omit a key for no gate.
Env-var expansion
String values matching ${UPPER_SNAKE_CASE} are replaced with the corresponding process.env entry at load. Missing env vars become empty strings so the failure surfaces at the call site (e.g. a 401 from the MCP server) rather than at config load. Bun auto-loads .env.
What pickled rejects at load
- A config without
schemaVersion: 2, or any v1 key (tasks,access,checks). - A source that declares zero or more than one of
url/path/codebase. - An agent with an unknown provider, or missing
provider/model;maxTurnson an API orcodex-cliagent. - A
memorycontext with a source; aninjectcontext without one;serverson a non-mcpcontext; anmcpcontext withoutservers; a non-http serverurl. - A
matchwith neitherallOfnoranyOf. - A question with no
expectsand norejects; a question withrejectsbut missingexamples.pass/examples.fail. - A build on a non-edit-capable (API) agent; a build with no
verifier.failToPass. - A
thresholdsvalue outside 1-100.