feat: snapshot/replay for deterministic eval reproduction (closes #367) by spboyer · Pull Request #391 · microsoft/waza

spboyer · 2026-06-28T13:39:31Z

Summary

Implements snapshot / replay-from-results (Wave 3 #367). Adds two commands to the Go CLI:

waza run --snapshot <dir> — writes a self-contained snapshot.json per task run, capturing prompt + content-addressed instruction/fixture digests, the ordered tool-event tape from runs[].tool_events[], engine identity, redacted env (default-deny + explicit allow-list), redaction metadata, and the final status + grader validations.
waza replay <snapshot.json> — re-checks a snapshot offline against itself (model-replay mode, default), bisects two snapshots to find the first divergent turn (--bisect), or returns a clean "not implemented" stub for Wave 4's live mode (--mode live, exit 2).

The replay tape is deliberately designed so the upcoming #365 adversarial harness can swap in altered tool results without changing the wire format.

What's in a snapshot (schema 1.0)

Section	Contents
Identity	`waza_version`, `eval_id`, `eval_name`, `skill`, `task` (TestID, DisplayName, Golden, RunNumber)
Prompt	initial + follow-up messages, SHA-256 digests of instruction files
Fixtures	recursive SHA-256 digests of every file under `context_dir`
Tool events	ordered `tool_call_id`, `name`, `args`, `result`, `duration_ms`, `success`, `error`
Engine	model, vendor, runtime identity
Env	allow-list + redacted KEY/VALUE pairs (default-deny)
Redaction	policy label, matched rule names, redaction count
Result	redacted `final_output` / `error_msg`, total duration, grader validations

Built-in redaction rules cover GitHub tokens, AWS keys, JWTs, emails, etc., and merge additively with --redact <yaml>. Even allow-listed env values flow through redaction.

Other changes

results.json schema 1.1 → 1.2 (additive RunResult.SnapshotPath). Migrate command tests updated.
New --snapshot, --snapshot-env-allow, --redact flags on waza run.
Reuses cmd/waza's existing *ExitCodeError so replay propagates 0 / 1 / 2 cleanly through main.execute().

Test plan

Unit (internal/snapshot) — 13 tests: capture, redaction (string + nested maps), env allow-list + default-deny, fixture digesting, schema MAJOR rejection, bisect divergence.
CLI integration (cmd/waza) — 7 tests: model-replay pass/fail, --json, bisect divergence, bisect match, schema rejection, live-mode stub.
go build ./... && go vet ./... && go test ./... — all clean.
golangci-lint run — 0 issues.

Docs

New site/src/content/docs/guides/snapshot-replay.mdx walking through capture, replay, bisect, redaction, and CI patterns.
waza replay section + new flags in site/src/content/docs/reference/cli.mdx.
README All Commands block + dedicated waza replay <snapshot.json> subsection.

Closes #367

Adds 'waza run --snapshot <dir>' to capture self-contained per-task snapshots, and 'waza replay <snapshot.json>' to verify them offline. Snapshot contents (schema 1.0): - Identity, prompt + content-addressed instruction digests - Recursive content-addressed fixture digests - Ordered tool-event tape (tool_call_id, args, result, duration, error) - Engine identity, redacted env (default-deny + allow-list) - Redaction policy/rules/count, final status and grader validations Replay modes: - model-replay (default, offline): tool-event sequence + grader consistency - bisect: locate first divergent turn between two snapshots of same task - live: stubbed for Wave 4 adversarial harness (exit 2) Other changes: - results.json schema bump 1.1 -> 1.2 (additive RunResult.SnapshotPath) - --snapshot, --snapshot-env-allow, --redact flags on 'waza run' - Default-deny env capture; built-in redaction rules merge with --redact YAML - Reuses cmd/waza ExitCodeError so exit codes 0/1/2 propagate through main Test plan: - 13 unit tests in internal/snapshot covering capture, redaction, env allow-list, fixtures, schema rejection, bisect divergence - 7 CLI integration tests for 'waza replay' (pass/fail, JSON, bisect, schema MAJOR rejection, live-mode stub) - go build ./... && go test ./... && go vet ./... all clean - golangci-lint run: 0 issues Docs: - New site guide: guides/snapshot-replay.mdx - 'waza replay' section + new flags in reference/cli.mdx - README All Commands + replay subsection Closes #367 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR adds a new per-task snapshot artifact and a replay CLI to make eval failures reproducible and debuggable offline. It introduces an internal/snapshot package for capturing/redacting tool-event tapes + fixture digests, wires snapshot capture into the runner, adds waza replay, and updates docs/tests alongside an additive results.json schema bump to 1.2.

Changes:

Add snapshot capture (waza run --snapshot …) with redaction + env allow-listing and persist a snapshot_path pointer in results.json (schema 1.2).
Add waza replay <snapshot.json> with model-replay and --bisect support (live mode stubbed with exit code 2).
Add/refresh docs and tests for the new snapshot/replay workflows.

Show a summary per file

File	Description
site/src/content/docs/reference/cli.mdx	Documents `waza run` snapshot flags and the new `waza replay` command.
site/src/content/docs/guides/snapshot-replay.mdx	New end-to-end guide for capturing/replaying/bisecting snapshots and redaction.
README.md	Adds snapshot/replay examples and CLI reference updates.
internal/snapshot/types.go	Defines the snapshot schema (1.0) and parsing/compat rules.
internal/snapshot/snapshot_test.go	Unit tests for snapshot parsing, env capture, fixture hashing, redaction, compare/bisect.
internal/snapshot/replay.go	Implements snapshot comparison, divergence rendering, and bisect logic.
internal/snapshot/redaction.go	Implements built-in + custom YAML redaction policies and counters.
internal/snapshot/io.go	Small I/O helper for fixture hashing.
internal/snapshot/fixtures.go	Fixture digesting logic for snapshot capture.
internal/snapshot/env.go	Env allow-list capture logic for snapshots (default-deny).
internal/snapshot/capture.go	Snapshot capture + writer implementation and tool-event/prompt/result redaction.
internal/orchestration/runner.go	Wires snapshot capture into per-run execution and stores `SnapshotPath` on `RunResult`.
internal/models/schema_version.go	Bumps `results.json` schema version to 1.2 (additive).
internal/models/outcome.go	Adds `RunResult.snapshot_path` field.
cmd/waza/root.go	Registers the new `replay` command.
cmd/waza/cmd_run.go	Adds `--snapshot`, `--snapshot-env-allow`, `--redact` flags and passes options into the runner.
cmd/waza/cmd_replay.go	Implements `waza replay` (model-replay + bisect + live stub).
cmd/waza/cmd_replay_test.go	CLI-level tests for replay behavior and exit codes.
cmd/waza/cmd_migrate_test.go	Updates migrate tests for schema version 1.2 messaging.

Review details

Files reviewed: 19/19 changed files
Comments generated: 10
Review effort level: Low

- snapshot env capture no longer leaks every host env name into DeniedKeys when the allow-list is empty (default-deny path). - HashFixtures now takes caller-supplied skip dirs; runner excludes the configured --snapshot directory when it sits under the fixtures root, removing the hard-coded 'snapshots/' carveout. - Bisect runs Compare in non-strict mode so the reported first-divergent turn reflects an agent action change, not downstream result drift. - Drop the no-op --redact flag from 'waza replay' (it remains on 'waza run', where it actually affects snapshot output); update README, CLI reference, and the snapshot/replay guide accordingly. - Fix the schema-major-rejection test to use the real snapshot kind ('task-snapshot') so it asserts schemaVersion handling rather than kind handling. - Replace 'redacted' Unicode placeholders in the guide with the literal [REDACTED] value the implementation actually writes. - runner.go: hoist the eval run ID so per-task snapshots and the telemetry span share the same identifier. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 28, 2026 13:39

Copilot started reviewing on behalf of spboyer June 28, 2026 13:39 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

spboyer merged commit 9a82b31 into main Jun 28, 2026
10 checks passed

spboyer deleted the spboyer-snapshot-replay branch June 28, 2026 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: snapshot/replay for deterministic eval reproduction (closes #367)#391

feat: snapshot/replay for deterministic eval reproduction (closes #367)#391
spboyer merged 2 commits into
mainfrom
spboyer-snapshot-replay

spboyer commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

spboyer commented Jun 28, 2026

Summary

What's in a snapshot (schema 1.0)

Other changes

Test plan

Docs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants