pickled

GitHub Actions

Run pickled in CI with offline checks, dry plans, capped runs, saved receipts, and job summaries.

Use CI to keep the public context contract from drifting. Run answer tasks on pull requests; run build tasks on trusted branches or schedules when you want the agent to edit a workspace.

Pull-request-safe workflow

name: pickled

on:
  pull_request:
  push:
    branches: [main]
  workflow_dispatch:

jobs:
  deterministic:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: oven-sh/setup-bun@v2
      - run: bun install
      - run: bunx @pickled-dev/cli test .
      - run: bunx @pickled-dev/cli check . --plan
      - run: bunx @pickled-dev/cli build . --plan
      - run: bunx @pickled-dev/cli build . --verify-only

What each step does:

  • pickled test scores example answers offline. It catches brittle checks before spending on a model run.
  • pickled check --plan prints the answer cells that would run. No model calls.
  • pickled build --plan prints build cells and executions. No agent edits.
  • pickled build --verify-only proves build fixtures and reference patches. No agent edits.

Set thresholds.questions or thresholds.builds in pickled.yml when you want CI to fail on a low score.

Real agent runs

Real agent runs spend tokens and may edit throwaway workspaces. Keep them off untrusted fork pull requests. Run them on workflow_dispatch, on a schedule, or on trusted branches:

on:
  workflow_dispatch:
  schedule:
    - cron: "17 8 * * 1"

jobs:
  real-agent-benchmark:
    runs-on: ubuntu-latest
    if: github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
    env:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v6
      - uses: oven-sh/setup-bun@v2
      - run: bun install

      - name: Run questions
        id: questions
        run: |
          set +e
          bunx @pickled-dev/cli check . --max-cells 20 --output pickled-questions.json
          echo "exit_code=$?" >> "$GITHUB_OUTPUT"
          exit 0

      - name: Add question summary
        if: always()
        run: |
          if [ -f pickled-questions.json ]; then
            bunx @pickled-dev/cli report pickled-questions.json --format markdown >> "$GITHUB_STEP_SUMMARY"
          else
            echo "No Pickled receipt was produced." >> "$GITHUB_STEP_SUMMARY"
          fi

      - name: Upload question receipt
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: pickled-questions
          path: pickled-questions.json

      - name: Reflect question verdict
        run: exit "${{ steps.questions.outputs.exit_code }}"

pickled build --plan reports both selected cells and selected executions. --max-cells gates executions (cells × trials), so one build cell with trials: 3 counts as three.

pickled build --verify-only proves each build's harness (the untouched fixture fails failToPass, passes passToPass, and a declared referenceSolution applies and clears the verifier) without running an agent. It exits non-zero if any harness is broken, so it is a cheap gate that catches a bad fixture or an unreachable bar before any token is spent.

Use the same pattern for pickled build: save --output pickled-builds.json, render it with pickled report pickled-builds.json --format markdown, upload the artifact, then exit with the original command's code.

Default JSON is CI-safe. It keeps verdicts and evidence ids, but not full answers, source text, transcripts, diffs, or command output. Add --verbose only for forensic artifacts.

For public repos, do not run build tasks on untrusted fork pull requests with secrets. Use pickled test and pickled check --plan everywhere; run real answer and build tasks on trusted branches, internal PRs, or a schedule.

Sample larger suites

For a broad suite, sample first:

- run: bunx @pickled-dev/cli check . --sample 2 --seed pull-${{ github.event.pull_request.number || github.sha }}
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The seed makes the sample reproducible. The receipt records expandedCells, selectedCells, selectedExecutions, and seed.

Secrets

  • ANTHROPIC_API_KEY for Claude Code and Anthropic API agents.
  • OPENAI_API_KEY for OpenAI API agents.
  • MCP server secrets referenced in pickled.yml with ${UPPER_SNAKE_CASE}.

Bun auto-loads .env locally. In Actions, pass secrets through the job env.

On this page