feat: snapshot/replay for deterministic eval reproduction (closes #367)#391
Merged
Conversation
Adds 'waza run --snapshot <dir>' to capture self-contained per-task snapshots, and 'waza replay <snapshot.json>' to verify them offline. Snapshot contents (schema 1.0): - Identity, prompt + content-addressed instruction digests - Recursive content-addressed fixture digests - Ordered tool-event tape (tool_call_id, args, result, duration, error) - Engine identity, redacted env (default-deny + allow-list) - Redaction policy/rules/count, final status and grader validations Replay modes: - model-replay (default, offline): tool-event sequence + grader consistency - bisect: locate first divergent turn between two snapshots of same task - live: stubbed for Wave 4 adversarial harness (exit 2) Other changes: - results.json schema bump 1.1 -> 1.2 (additive RunResult.SnapshotPath) - --snapshot, --snapshot-env-allow, --redact flags on 'waza run' - Default-deny env capture; built-in redaction rules merge with --redact YAML - Reuses cmd/waza ExitCodeError so exit codes 0/1/2 propagate through main Test plan: - 13 unit tests in internal/snapshot covering capture, redaction, env allow-list, fixtures, schema rejection, bisect divergence - 7 CLI integration tests for 'waza replay' (pass/fail, JSON, bisect, schema MAJOR rejection, live-mode stub) - go build ./... && go test ./... && go vet ./... all clean - golangci-lint run: 0 issues Docs: - New site guide: guides/snapshot-replay.mdx - 'waza replay' section + new flags in reference/cli.mdx - README All Commands + replay subsection Closes #367 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a new per-task snapshot artifact and a replay CLI to make eval failures reproducible and debuggable offline. It introduces an internal/snapshot package for capturing/redacting tool-event tapes + fixture digests, wires snapshot capture into the runner, adds waza replay, and updates docs/tests alongside an additive results.json schema bump to 1.2.
Changes:
- Add snapshot capture (
waza run --snapshot …) with redaction + env allow-listing and persist asnapshot_pathpointer inresults.json(schema 1.2). - Add
waza replay <snapshot.json>with model-replay and--bisectsupport (live mode stubbed with exit code 2). - Add/refresh docs and tests for the new snapshot/replay workflows.
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/cli.mdx | Documents waza run snapshot flags and the new waza replay command. |
| site/src/content/docs/guides/snapshot-replay.mdx | New end-to-end guide for capturing/replaying/bisecting snapshots and redaction. |
| README.md | Adds snapshot/replay examples and CLI reference updates. |
| internal/snapshot/types.go | Defines the snapshot schema (1.0) and parsing/compat rules. |
| internal/snapshot/snapshot_test.go | Unit tests for snapshot parsing, env capture, fixture hashing, redaction, compare/bisect. |
| internal/snapshot/replay.go | Implements snapshot comparison, divergence rendering, and bisect logic. |
| internal/snapshot/redaction.go | Implements built-in + custom YAML redaction policies and counters. |
| internal/snapshot/io.go | Small I/O helper for fixture hashing. |
| internal/snapshot/fixtures.go | Fixture digesting logic for snapshot capture. |
| internal/snapshot/env.go | Env allow-list capture logic for snapshots (default-deny). |
| internal/snapshot/capture.go | Snapshot capture + writer implementation and tool-event/prompt/result redaction. |
| internal/orchestration/runner.go | Wires snapshot capture into per-run execution and stores SnapshotPath on RunResult. |
| internal/models/schema_version.go | Bumps results.json schema version to 1.2 (additive). |
| internal/models/outcome.go | Adds RunResult.snapshot_path field. |
| cmd/waza/root.go | Registers the new replay command. |
| cmd/waza/cmd_run.go | Adds --snapshot, --snapshot-env-allow, --redact flags and passes options into the runner. |
| cmd/waza/cmd_replay.go | Implements waza replay (model-replay + bisect + live stub). |
| cmd/waza/cmd_replay_test.go | CLI-level tests for replay behavior and exit codes. |
| cmd/waza/cmd_migrate_test.go | Updates migrate tests for schema version 1.2 messaging. |
Review details
- Files reviewed: 19/19 changed files
- Comments generated: 10
- Review effort level: Low
- snapshot env capture no longer leaks every host env name into
DeniedKeys when the allow-list is empty (default-deny path).
- HashFixtures now takes caller-supplied skip dirs; runner excludes the
configured --snapshot directory when it sits under the fixtures root,
removing the hard-coded 'snapshots/' carveout.
- Bisect runs Compare in non-strict mode so the reported first-divergent
turn reflects an agent action change, not downstream result drift.
- Drop the no-op --redact flag from 'waza replay' (it remains on
'waza run', where it actually affects snapshot output); update README,
CLI reference, and the snapshot/replay guide accordingly.
- Fix the schema-major-rejection test to use the real snapshot kind
('task-snapshot') so it asserts schemaVersion handling rather than
kind handling.
- Replace 'redacted' Unicode placeholders in the guide with the literal
[REDACTED] value the implementation actually writes.
- runner.go: hoist the eval run ID so per-task snapshots and the
telemetry span share the same identifier.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements snapshot / replay-from-results (Wave 3 #367). Adds two commands to the Go CLI:
waza run --snapshot <dir>— writes a self-containedsnapshot.jsonper task run, capturing prompt + content-addressed instruction/fixture digests, the ordered tool-event tape fromruns[].tool_events[], engine identity, redacted env (default-deny + explicit allow-list), redaction metadata, and the final status + grader validations.waza replay <snapshot.json>— re-checks a snapshot offline against itself (model-replay mode, default), bisects two snapshots to find the first divergent turn (--bisect), or returns a clean "not implemented" stub for Wave 4's live mode (--mode live, exit 2).The replay tape is deliberately designed so the upcoming #365 adversarial harness can swap in altered tool results without changing the wire format.
What's in a snapshot (schema 1.0)
waza_version,eval_id,eval_name,skill,task(TestID, DisplayName, Golden, RunNumber)context_dirtool_call_id,name,args,result,duration_ms,success,errorfinal_output/error_msg, total duration, grader validationsBuilt-in redaction rules cover GitHub tokens, AWS keys, JWTs, emails, etc., and merge additively with
--redact <yaml>. Even allow-listed env values flow through redaction.Other changes
results.jsonschema 1.1 → 1.2 (additiveRunResult.SnapshotPath). Migrate command tests updated.--snapshot,--snapshot-env-allow,--redactflags onwaza run.cmd/waza's existing*ExitCodeErrorso replay propagates0/1/2cleanly throughmain.execute().Test plan
internal/snapshot) — 13 tests: capture, redaction (string + nested maps), env allow-list + default-deny, fixture digesting, schema MAJOR rejection, bisect divergence.cmd/waza) — 7 tests: model-replay pass/fail,--json, bisect divergence, bisect match, schema rejection, live-mode stub.go build ./... && go vet ./... && go test ./...— all clean.golangci-lint run— 0 issues.Docs
site/src/content/docs/guides/snapshot-replay.mdxwalking through capture, replay, bisect, redaction, and CI patterns.waza replaysection + new flags insite/src/content/docs/reference/cli.mdx.waza replay <snapshot.json>subsection.Closes #367