Skip to content

feat: Trace replay and deterministic snapshot for agent runs #367

Description

@spboyer

Problem

Agentic runs are non-deterministic. When an eval fails in CI, re-running rarely reproduces the failure: different model sampling, different tool ordering, different temp workspace state. Reviewers and authors waste cycles trying to recreate failures that never reappear the same way twice, and regressions in waza itself (runner/grader logic) can't be safely verified without burning LLM calls.

Proposal

Two complementary features, with redaction designed in from day one.

1. Snapshot

Each task run can emit a snapshot.json capturing enough to replay it:

  • Full prompt sequence (system, user, assistant).
  • Every tool call: name, args, response, timing.
  • Engine/model config (model id, temperature, top-p, etc.).
  • Random seed if the engine surfaces one (best-effort metadata; not required).
  • Fixture file hashes.
  • Env vars — default-deny; only vars in a configurable allow-list are captured.
  • Snapshot schema is versioned (schemaVersion); readers handle older versions or refuse with a clear error.

Snapshots are written under --output-dir with stable paths and referenced from results.json.

2. Replay

waza replay <snapshot.json> with two modes:

  • --mode model-replay — stub the engine using captured events. Deterministic, no LLM call. Used to verify runner/grader changes don't break old runs.
  • --mode live — re-run against the real engine with the same inputs; compare against snapshot with configurable tolerances. Used to detect model/engine drift.
  • --bisect <baseline.json> <failing.json> — identify the first divergent turn between two snapshots.

Redaction policy (designed first)

  • Prompts, tool args, tool results, and outputs default to captured as-is unless --redact <policy> is set; the policy file lists regexes and JSON-paths to scrub.
  • Env vars: default-deny, allow-list only.
  • Documented redaction defaults shipped for common patterns (API keys, tokens, emails).

Why this matters for agentic-first

Non-deterministic runs are the #1 source of "works on my machine." Snapshots turn flaky failures into reproducible bug reports. model-replay lets reviewers verify a fix without paying for an LLM call. --bisect makes "what changed?" answerable.

Acceptance criteria

  • waza run --snapshot writes snapshot.json per task under --output-dir; referenced from results.json.
  • Snapshot has schemaVersion; reader includes a backward-compatibility test against at least one prior version.
  • waza replay implements model-replay and live modes with documented exit codes.
  • --bisect identifies the first divergent turn between two snapshots.
  • Env capture is default-deny with an explicit allow-list.
  • Redaction policy file format defined; shipped defaults for API keys/tokens/emails.
  • Tests cover: round-trip fidelity, divergence detection, redaction, schema-version mismatch behavior.
  • Docs in site/ with a worked CI example.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions