GitHub Actions
Run pickled in CI with offline checks, dry plans, capped runs, saved receipts, and job summaries.
Use CI to keep the public context contract from drifting. Run answer tasks on pull requests; run build tasks on trusted branches or schedules when you want the agent to edit a workspace.
Pull-request-safe workflow
name: pickled
on:
pull_request:
push:
branches: [main]
workflow_dispatch:
jobs:
deterministic:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: oven-sh/setup-bun@v2
- run: bun install
- run: bunx @pickled-dev/cli test .
- run: bunx @pickled-dev/cli check . --plan
- run: bunx @pickled-dev/cli build . --plan
- run: bunx @pickled-dev/cli build . --verify-onlyWhat each step does:
pickled testscores example answers offline. It catches brittle checks before spending on a model run.pickled check --planprints the answer cells that would run. No model calls.pickled build --planprints build cells and executions. No agent edits.pickled build --verify-onlyproves build fixtures and reference patches. No agent edits.
Set thresholds.questions or thresholds.builds in pickled.yml when you want CI to fail on a low score.
Real agent runs
Real agent runs spend tokens and may edit throwaway workspaces. Keep them off untrusted fork pull requests. Run them on workflow_dispatch, on a schedule, or on trusted branches:
on:
workflow_dispatch:
schedule:
- cron: "17 8 * * 1"
jobs:
real-agent-benchmark:
runs-on: ubuntu-latest
if: github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
steps:
- uses: actions/checkout@v6
- uses: oven-sh/setup-bun@v2
- run: bun install
- name: Run questions
id: questions
run: |
set +e
bunx @pickled-dev/cli check . --max-cells 20 --output pickled-questions.json
echo "exit_code=$?" >> "$GITHUB_OUTPUT"
exit 0
- name: Add question summary
if: always()
run: |
if [ -f pickled-questions.json ]; then
bunx @pickled-dev/cli report pickled-questions.json --format markdown >> "$GITHUB_STEP_SUMMARY"
else
echo "No Pickled receipt was produced." >> "$GITHUB_STEP_SUMMARY"
fi
- name: Upload question receipt
if: always()
uses: actions/upload-artifact@v4
with:
name: pickled-questions
path: pickled-questions.json
- name: Reflect question verdict
run: exit "${{ steps.questions.outputs.exit_code }}"pickled build --plan reports both selected cells and selected executions. --max-cells gates executions (cells × trials), so one build cell with trials: 3 counts as three.
pickled build --verify-only proves each build's harness (the untouched fixture fails failToPass, passes passToPass, and a declared referenceSolution applies and clears the verifier) without running an agent. It exits non-zero if any harness is broken, so it is a cheap gate that catches a bad fixture or an unreachable bar before any token is spent.
Use the same pattern for pickled build: save --output pickled-builds.json, render it with pickled report pickled-builds.json --format markdown, upload the artifact, then exit with the original command's code.
Default JSON is CI-safe. It keeps verdicts and evidence ids, but not full answers, source text, transcripts, diffs, or command output. Add --verbose only for forensic artifacts.
For public repos, do not run build tasks on untrusted fork pull requests with secrets. Use pickled test and pickled check --plan everywhere; run real answer and build tasks on trusted branches, internal PRs, or a schedule.
Sample larger suites
For a broad suite, sample first:
- run: bunx @pickled-dev/cli check . --sample 2 --seed pull-${{ github.event.pull_request.number || github.sha }}
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}The seed makes the sample reproducible. The receipt records expandedCells, selectedCells, selectedExecutions, and seed.
Secrets
ANTHROPIC_API_KEYfor Claude Code and Anthropic API agents.OPENAI_API_KEYfor OpenAI API agents.- MCP server secrets referenced in
pickled.ymlwith${UPPER_SNAKE_CASE}.
Bun auto-loads .env locally. In Actions, pass secrets through the job env.