Test what agents actually understand.

For products developers and agents read. An open-source CLI that tests whether agents can answer and build with your product across context paths, scored by deterministic evidence. No LLM grades another LLM.

$bun add -g @pickled-dev/cli
pickled check
$ pickled check
 
Task: How do I install pickled?
  [quick · given_docs] ✓ Well grounded 1/1
 
Task: Basic usage
  [quick · memory] ⚠ Partially grounded 0/1 (50% facts)
reason: missing facts: run_command
 
Task: Config format
  [quick · web_open] ✗ Ungrounded 0/1
reason: tool path not used (provenance)
 
Overall: 42 / 100 · threshold 80 · run fails
Developers aren't your only readers anymore

One config. Your real docs. Your real checks.

Drop a pickled.yml next to your sources. Declare the sources agents should use, the tasks they should complete, and how each one is checked. Whether agents reach your product through a public API, SDK docs, llms.txt, CLAUDE.md, AGENTS.md, JSDoc, or internal runbooks, pickled tests whether they can answer and build from the sources you declared. The example below is a public library.

# pickled.yml
schemaVersion: 2

product:
  name: zod
  description: TypeScript-first schema validation

sources:
  llms: { url: https://zod.dev/llms.txt }

agents:
  quick:
    provider: claude-code
    model: claude-haiku-4-5

contexts:
  injected: { mode: inject, source: llms }

facts:
  error_api:
    statement: Errors are read with z.treeifyError.
    match:
      allOf: ["z.treeifyError"]

misstatements:
  deprecated_format:
    statement: Recommends the removed ZodError.format().
    match:
      anyOf: ["ZodError.format()"]

questions:
  - id: error-handling
    question: How do I get error messages from failed validation?
    agents: [quick]
    contexts: [injected]
    expects: [error_api]
    rejects: [deprecated_format]
    examples:
      pass: ["Read issues with z.treeifyError(err)."]
      fail: ["Call ZodError.format() on the error."]

builds:
  - id: add-validation
    goal: Add Zod validation to the signup form.
    agents: [quick]
    contexts: [injected]
    trials: 3
    workspace: { path: ./fixtures/app }
    verifier:
      failToPass:
        - { run: bun test }
      passToPass:
        - { run: bun run typecheck }
    referenceSolution:
      patch: ./fixtures/solutions/add-validation.patch

thresholds:
  questions: 80
  builds: 80
pickled check · zod
$ pickled check
 
Task: error-handling
  [quick · injected] ✗ Ungrounded 0/1
reason: misstatement: deprecated_format; missing facts: error_api
Overall: 0 / 100 · threshold 80 · run fails
pickled build · app
$ pickled build
 
Task: add-validation
  [quick · injected] ⚠ Partially built 2/3 (verifier proven)
failed: bun test (failToPass)
Overall: 67 / 100 · threshold 80 · run fails

A plausible answer can still be wrong, and a plausible edit can still fail verification. Pickled keeps both receipts deterministic.

Pickled runs locally. Runs in CI. Each run leaves a receipt you can diff and threshold. No dashboard required.

See the full example

Find out what agents say about your product.

Open source. MIT. Install in 30 seconds. See your first score in two minutes.

$bun add -g @pickled-dev/cli

A pickle isn't fresh. A pickle is preserved. Same idea for your product context.