You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multiple in-flight gap features mutate the public shapes of eval.yaml, results.json, transcripts, dashboard APIs, and (newly) snapshot.json. Without an explicit schema-version policy, every feature ships with a hidden risk of breaking older evals, baselines, and CI configs.
If we don't pin this down now, the second time someone tries to compare a baseline results.json to a current one across a waza version bump, it'll silently misbehave.
Proposal
Adopt a single, documented schema-version policy for all public artifacts.
Semver-shaped: MAJOR.MINOR (no patch — schemas don't have hotfixes).
MINOR bumps are backward-compatible additions (new optional fields). Readers ignore unknown fields.
MAJOR bumps are breaking. Readers must refuse with a clear error and a pointer to a migration command.
Default-on unknown-field warnings (not errors) for MINOR drift.
waza migrate <file> command for explicit migrations across MAJOR boundaries.
Compatibility tests
internal/validation/ ships golden fixtures for each prior schema version.
CI test: every reader must parse every prior MINOR within the same MAJOR.
Why this matters
Eval suites are long-lived. Authors check in eval.yaml and baseline results.json and expect them to keep working. Without versioning, a routine waza upgrade silently changes meaning — the worst kind of break.
Acceptance criteria
schemaVersion field added to eval.yaml, results.json, and any new artifact (snapshot.json, gate output, SSE envelope).
Reader logic emits warnings on unknown fields within a MAJOR, errors across MAJOR.
waza migrate command stubbed (no-op for v1 → v1; real migration when first MAJOR bump happens).
Golden fixtures for each prior MINOR live in internal/validation/testdata/.
Compatibility tests in CI assert every reader handles every prior MINOR.
Policy documented in site/ with a "schema changes" changelog page.
Problem
Multiple in-flight gap features mutate the public shapes of
eval.yaml,results.json, transcripts, dashboard APIs, and (newly)snapshot.json. Without an explicit schema-version policy, every feature ships with a hidden risk of breaking older evals, baselines, and CI configs.Examples already on the table:
checkpoints:field on tasks (feat: Multi-turn conversation evaluation #358)mcp_mocks:field on tasks (feat: MCP tool-use evaluation primitives #363)golden: truetask field + newwaza gateoutput shape (feat: Regression gates — baseline comparison with thresholds and statistical confidence #364)tool_events[]normalized inresults.json(feat: Agentic metrics — tool-call accuracy, tool selection, tool-input correctness #366)snapshot.jsonwith full event capture (feat: Trace replay and deterministic snapshot for agent runs #367)adversarial:field (feat: Adversarial / safety evaluators (prompt injection, jailbreak, scope-bypass) #365)If we don't pin this down now, the second time someone tries to compare a baseline
results.jsonto a current one across a waza version bump, it'll silently misbehave.Proposal
Adopt a single, documented schema-version policy for all public artifacts.
Versioned artifacts
eval.yamlschemaVersion(top-level)internal/models/spec.goresults.jsonschemaVersioninternal/models/outcome.gosnapshot.jsonschemaVersionschemaVersionweb/, #178Versioning rules
MAJOR.MINOR(no patch — schemas don't have hotfixes).waza migrate <file>command for explicit migrations across MAJOR boundaries.Compatibility tests
internal/validation/ships golden fixtures for each prior schema version.Why this matters
Eval suites are long-lived. Authors check in
eval.yamland baselineresults.jsonand expect them to keep working. Without versioning, a routine waza upgrade silently changes meaning — the worst kind of break.Acceptance criteria
schemaVersionfield added toeval.yaml,results.json, and any new artifact (snapshot.json, gate output, SSE envelope).waza migratecommand stubbed (no-op for v1 → v1; real migration when first MAJOR bump happens).internal/validation/testdata/.site/with a "schema changes" changelog page.Related