feat: Trace replay and deterministic snapshot for agent runs

## Problem

Agentic runs are non-deterministic. When an eval fails in CI, re-running rarely reproduces the failure: different model sampling, different tool ordering, different temp workspace state. Reviewers and authors waste cycles trying to recreate failures that never reappear the same way twice, and regressions in waza itself (runner/grader logic) can't be safely verified without burning LLM calls.

## Proposal

Two complementary features, with **redaction designed in from day one**.

### 1. Snapshot

Each task run can emit a `snapshot.json` capturing enough to replay it:

- Full prompt sequence (system, user, assistant).
- Every tool call: name, args, response, timing.
- Engine/model config (model id, temperature, top-p, etc.).
- Random seed *if the engine surfaces one* (best-effort metadata; not required).
- Fixture file hashes.
- Env vars — **default-deny**; only vars in a configurable allow-list are captured.
- Snapshot schema is **versioned** (`schemaVersion`); readers handle older versions or refuse with a clear error.

Snapshots are written under `--output-dir` with stable paths and referenced from `results.json`.

### 2. Replay

`waza replay <snapshot.json>` with two modes:

- **`--mode model-replay`** — stub the engine using captured events. Deterministic, no LLM call. Used to verify runner/grader changes don't break old runs.
- **`--mode live`** — re-run against the real engine with the same inputs; compare against snapshot with configurable tolerances. Used to detect model/engine drift.
- **`--bisect <baseline.json> <failing.json>`** — identify the first divergent turn between two snapshots.

### Redaction policy (designed first)

- Prompts, tool args, tool results, and outputs default to **captured as-is** unless `--redact <policy>` is set; the policy file lists regexes and JSON-paths to scrub.
- Env vars: **default-deny**, allow-list only.
- Documented redaction defaults shipped for common patterns (API keys, tokens, emails).

## Why this matters for agentic-first

Non-deterministic runs are the #1 source of "works on my machine." Snapshots turn flaky failures into reproducible bug reports. `model-replay` lets reviewers verify a fix without paying for an LLM call. `--bisect` makes "what changed?" answerable.

## Acceptance criteria

- [ ] `waza run --snapshot` writes `snapshot.json` per task under `--output-dir`; referenced from `results.json`.
- [ ] Snapshot has `schemaVersion`; reader includes a backward-compatibility test against at least one prior version.
- [ ] `waza replay` implements `model-replay` and `live` modes with documented exit codes.
- [ ] `--bisect` identifies the first divergent turn between two snapshots.
- [ ] Env capture is default-deny with an explicit allow-list.
- [ ] Redaction policy file format defined; shipped defaults for API keys/tokens/emails.
- [ ] Tests cover: round-trip fidelity, divergence detection, redaction, schema-version mismatch behavior.
- [ ] Docs in `site/` with a worked CI example.

## Related

- Engine event abstraction: #10
- Multi-turn coverage: #358 (replay should cover per-turn checkpoints once they land)
- OTel export is complementary, not required: #362
- Schema versioning policy: see new schema-version issue
- Roadmap: #66


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Trace replay and deterministic snapshot for agent runs #367

Problem

Proposal

1. Snapshot

2. Replay

Redaction policy (designed first)

Why this matters for agentic-first

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: Trace replay and deterministic snapshot for agent runs #367

Description

Problem

Proposal

1. Snapshot

2. Replay

Redaction policy (designed first)

Why this matters for agentic-first

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions