Problem
Agentic runs are non-deterministic. When an eval fails in CI, re-running rarely reproduces the failure: different model sampling, different tool ordering, different temp workspace state. Reviewers and authors waste cycles trying to recreate failures that never reappear the same way twice, and regressions in waza itself (runner/grader logic) can't be safely verified without burning LLM calls.
Proposal
Two complementary features, with redaction designed in from day one.
1. Snapshot
Each task run can emit a snapshot.json capturing enough to replay it:
- Full prompt sequence (system, user, assistant).
- Every tool call: name, args, response, timing.
- Engine/model config (model id, temperature, top-p, etc.).
- Random seed if the engine surfaces one (best-effort metadata; not required).
- Fixture file hashes.
- Env vars — default-deny; only vars in a configurable allow-list are captured.
- Snapshot schema is versioned (
schemaVersion); readers handle older versions or refuse with a clear error.
Snapshots are written under --output-dir with stable paths and referenced from results.json.
2. Replay
waza replay <snapshot.json> with two modes:
--mode model-replay — stub the engine using captured events. Deterministic, no LLM call. Used to verify runner/grader changes don't break old runs.
--mode live — re-run against the real engine with the same inputs; compare against snapshot with configurable tolerances. Used to detect model/engine drift.
--bisect <baseline.json> <failing.json> — identify the first divergent turn between two snapshots.
Redaction policy (designed first)
- Prompts, tool args, tool results, and outputs default to captured as-is unless
--redact <policy> is set; the policy file lists regexes and JSON-paths to scrub.
- Env vars: default-deny, allow-list only.
- Documented redaction defaults shipped for common patterns (API keys, tokens, emails).
Why this matters for agentic-first
Non-deterministic runs are the #1 source of "works on my machine." Snapshots turn flaky failures into reproducible bug reports. model-replay lets reviewers verify a fix without paying for an LLM call. --bisect makes "what changed?" answerable.
Acceptance criteria
Related
Problem
Agentic runs are non-deterministic. When an eval fails in CI, re-running rarely reproduces the failure: different model sampling, different tool ordering, different temp workspace state. Reviewers and authors waste cycles trying to recreate failures that never reappear the same way twice, and regressions in waza itself (runner/grader logic) can't be safely verified without burning LLM calls.
Proposal
Two complementary features, with redaction designed in from day one.
1. Snapshot
Each task run can emit a
snapshot.jsoncapturing enough to replay it:schemaVersion); readers handle older versions or refuse with a clear error.Snapshots are written under
--output-dirwith stable paths and referenced fromresults.json.2. Replay
waza replay <snapshot.json>with two modes:--mode model-replay— stub the engine using captured events. Deterministic, no LLM call. Used to verify runner/grader changes don't break old runs.--mode live— re-run against the real engine with the same inputs; compare against snapshot with configurable tolerances. Used to detect model/engine drift.--bisect <baseline.json> <failing.json>— identify the first divergent turn between two snapshots.Redaction policy (designed first)
--redact <policy>is set; the policy file lists regexes and JSON-paths to scrub.Why this matters for agentic-first
Non-deterministic runs are the #1 source of "works on my machine." Snapshots turn flaky failures into reproducible bug reports.
model-replaylets reviewers verify a fix without paying for an LLM call.--bisectmakes "what changed?" answerable.Acceptance criteria
waza run --snapshotwritessnapshot.jsonper task under--output-dir; referenced fromresults.json.schemaVersion; reader includes a backward-compatibility test against at least one prior version.waza replayimplementsmodel-replayandlivemodes with documented exit codes.--bisectidentifies the first divergent turn between two snapshots.site/with a worked CI example.Related